* [PATCH] netfilter/ipvs: expire no destination UDP connections when expire_nodest_conn=1
@ 2020-05-15  1:35 Andrew Sy Kim
  2020-05-15 18:07 ` Julian Anastasov
  2020-05-17 17:16 ` [PATCH] netfilter/ipvs: immediately expire UDP connections matching unavailable destination if expire_nodest_conn=1 Andrew Sy Kim
  0 siblings, 2 replies; 31+ messages in thread
From: Andrew Sy Kim @ 2020-05-15  1:35 UTC (permalink / raw)
  Cc: kim.andrewsy, Wensong Zhang, Simon Horman, Julian Anastasov,
	Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	David S. Miller, Jakub Kicinski, open list:IPVS, open list:IPVS,
	open list:NETFILTER, open list:NETFILTER, open list
When expire_nodest_conn=1 and an IPVS destination is deleted, IPVS
doesn't expire connections with the IP_VS_CONN_F_ONE_PACKET flag set (any
UDP connection). If there are many UDP packets to a virtual server from a
single client and a destination is deleted, many packets are silently
dropped whenever an existing connection entry with the same source port
exists. This patch ensures IPVS also expires UDP connections when a
packet matches an existing connection with no destinations.
Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
---
 net/netfilter/ipvs/ip_vs_core.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
index aa6a603a2425..f0535586fe75 100644
--- a/net/netfilter/ipvs/ip_vs_core.c
+++ b/net/netfilter/ipvs/ip_vs_core.c
@@ -2116,8 +2116,7 @@ ip_vs_in(struct netns_ipvs *ipvs, unsigned int hooknum, struct sk_buff *skb, int
 		else
 			ip_vs_conn_put(cp);
 
-		if (sysctl_expire_nodest_conn(ipvs) &&
-		    !(flags & IP_VS_CONN_F_ONE_PACKET)) {
+		if (sysctl_expire_nodest_conn(ipvs)) {
 			/* try to expire the connection immediately */
 			ip_vs_conn_expire_now(cp);
 		}
-- 
2.20.1
^ permalink raw reply related	[flat|nested] 31+ messages in thread
* Re: [PATCH] netfilter/ipvs: expire no destination UDP connections when expire_nodest_conn=1
  2020-05-15  1:35 [PATCH] netfilter/ipvs: expire no destination UDP connections when expire_nodest_conn=1 Andrew Sy Kim
@ 2020-05-15 18:07 ` Julian Anastasov
  2020-05-17 17:27   ` Andrew Kim
  2020-05-17 17:16 ` [PATCH] netfilter/ipvs: immediately expire UDP connections matching unavailable destination if expire_nodest_conn=1 Andrew Sy Kim
  1 sibling, 1 reply; 31+ messages in thread
From: Julian Anastasov @ 2020-05-15 18:07 UTC (permalink / raw)
  To: Andrew Sy Kim
  Cc: Wensong Zhang, Simon Horman, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, David S. Miller, Jakub Kicinski, open list:IPVS,
	open list:IPVS, open list:NETFILTER, open list:NETFILTER,
	open list
	Hello,
On Thu, 14 May 2020, Andrew Sy Kim wrote:
> When expire_nodest_conn=1 and an IPVS destination is deleted, IPVS
> doesn't expire connections with the IP_VS_CONN_F_ONE_PACKET flag set (any
> UDP connection). If there are many UDP packets to a virtual server from a
> single client and a destination is deleted, many packets are silently
> dropped whenever an existing connection entry with the same source port
> exists. This patch ensures IPVS also expires UDP connections when a
> packet matches an existing connection with no destinations.
> 
> Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
> ---
>  net/netfilter/ipvs/ip_vs_core.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
> index aa6a603a2425..f0535586fe75 100644
> --- a/net/netfilter/ipvs/ip_vs_core.c
> +++ b/net/netfilter/ipvs/ip_vs_core.c
> @@ -2116,8 +2116,7 @@ ip_vs_in(struct netns_ipvs *ipvs, unsigned int hooknum, struct sk_buff *skb, int
>  		else
>  			ip_vs_conn_put(cp);
	Above ip_vs_conn_put() should free the ONE_PACKET
connections because:
- such connections never start timer, they are designed
to exist just to schedule the packet, then they are released.
- noone takes extra references
	So, ip_vs_conn_put() simply calls ip_vs_conn_expire()
where connections should be released immediately. As result,
we can not access cp after this point here. That is why we work
just with 'flags' below...
	Note that not every UDP connection has ONE_PACKET
flag, it is present if you configure it for the service.
Do you have -o/--ops flag? If not, the UDP connection
should expire before the next jiffie. This is the theory,
in practice, you may observe some problem...
> -		if (sysctl_expire_nodest_conn(ipvs) &&
> -		    !(flags & IP_VS_CONN_F_ONE_PACKET)) {
> +		if (sysctl_expire_nodest_conn(ipvs)) {
>  			/* try to expire the connection immediately */
>  			ip_vs_conn_expire_now(cp);
>  		}
	You can also look at the discussion which resulted in
the last patch for this place:
http://archive.linuxvirtualserver.org/html/lvs-devel/2018-07/msg00014.html
Regards
--
Julian Anastasov <ja@ssi.bg>
^ permalink raw reply	[flat|nested] 31+ messages in thread
* [PATCH] netfilter/ipvs: immediately expire UDP connections matching unavailable destination if expire_nodest_conn=1
  2020-05-15  1:35 [PATCH] netfilter/ipvs: expire no destination UDP connections when expire_nodest_conn=1 Andrew Sy Kim
  2020-05-15 18:07 ` Julian Anastasov
@ 2020-05-17 17:16 ` Andrew Sy Kim
  2020-05-18 19:10   ` Julian Anastasov
  1 sibling, 1 reply; 31+ messages in thread
From: Andrew Sy Kim @ 2020-05-17 17:16 UTC (permalink / raw)
  Cc: kim.andrewsy, David S. Miller, Alexey Kuznetsov,
	Hideaki YOSHIFUJI, Wensong Zhang, Simon Horman, Julian Anastasov,
	Jakub Kicinski, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, netdev, lvs-devel, linux-kernel,
	netfilter-devel, coreteam
If expire_nodest_conn=1 and a UDP destination is deleted, IPVS should
also expire all matching connections immiediately instead of waiting for
the next matching packet. This is particulary useful when there are a
lot of packets coming from a few number of clients. Those clients are
likely to match against existing entries if a source port in the
connection hash is reused. When the number of entries in the connection
tracker is large, we can significantly reduce the number of dropped
packets by expiring all connections upon deletion.
Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
---
 include/net/ip_vs.h             |  7 ++++++
 net/netfilter/ipvs/ip_vs_conn.c | 38 +++++++++++++++++++++++++++++++++
 net/netfilter/ipvs/ip_vs_core.c |  5 -----
 net/netfilter/ipvs/ip_vs_ctl.c  |  9 ++++++++
 4 files changed, 54 insertions(+), 5 deletions(-)
diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index 83be2d93b407..deecf1344676 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -1049,6 +1049,11 @@ static inline int sysctl_conn_reuse_mode(struct netns_ipvs *ipvs)
 	return ipvs->sysctl_conn_reuse_mode;
 }
 
+static inline int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
+{
+	return ipvs->sysctl_expire_nodest_conn;
+}
+
 static inline int sysctl_schedule_icmp(struct netns_ipvs *ipvs)
 {
 	return ipvs->sysctl_schedule_icmp;
@@ -1209,6 +1214,8 @@ struct ip_vs_conn * ip_vs_conn_out_get_proto(struct netns_ipvs *ipvs, int af,
 					     const struct sk_buff *skb,
 					     const struct ip_vs_iphdr *iph);
 
+void ip_vs_conn_flush_dest(struct netns_ipvs *ipvs, struct ip_vs_dest *dest);
+
 /* Get reference to gain full access to conn.
  * By default, RCU read-side critical sections have access only to
  * conn fields and its PE data, see ip_vs_conn_rcu_free() for reference.
diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
index 02f2f636798d..c69dfbbc3416 100644
--- a/net/netfilter/ipvs/ip_vs_conn.c
+++ b/net/netfilter/ipvs/ip_vs_conn.c
@@ -1366,6 +1366,44 @@ static void ip_vs_conn_flush(struct netns_ipvs *ipvs)
 		goto flush_again;
 	}
 }
+
+/*	Flush all the connection entries in the ip_vs_conn_tab with a
+ *	matching destination.
+ */
+void ip_vs_conn_flush_dest(struct netns_ipvs *ipvs, struct ip_vs_dest *dest)
+{
+	int idx;
+	struct ip_vs_conn *cp, *cp_c;
+
+	rcu_read_lock();
+	for (idx = 0; idx < ip_vs_conn_tab_size; idx++) {
+		hlist_for_each_entry_rcu(cp, &ip_vs_conn_tab[idx], c_list) {
+			if (cp->ipvs != ipvs)
+				continue;
+
+			if (cp->dest != dest)
+				continue;
+
+			/* As timers are expired in LIFO order, restart
+			 * the timer of controlling connection first, so
+			 * that it is expired after us.
+			 */
+			cp_c = cp->control;
+			/* cp->control is valid only with reference to cp */
+			if (cp_c && __ip_vs_conn_get(cp)) {
+				IP_VS_DBG(4, "del controlling connection\n");
+				ip_vs_conn_expire_now(cp_c);
+				__ip_vs_conn_put(cp);
+			}
+			IP_VS_DBG(4, "del connection\n");
+			ip_vs_conn_expire_now(cp);
+		}
+		cond_resched_rcu();
+	}
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL_GPL(ip_vs_conn_flush_dest);
+
 /*
  * per netns init and exit
  */
diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
index aa6a603a2425..0139fa597d76 100644
--- a/net/netfilter/ipvs/ip_vs_core.c
+++ b/net/netfilter/ipvs/ip_vs_core.c
@@ -694,11 +694,6 @@ static int sysctl_nat_icmp_send(struct netns_ipvs *ipvs)
 	return ipvs->sysctl_nat_icmp_send;
 }
 
-static int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
-{
-	return ipvs->sysctl_expire_nodest_conn;
-}
-
 #else
 
 static int sysctl_snat_reroute(struct netns_ipvs *ipvs) { return 0; }
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 8d14a1acbc37..f87c03622874 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -1225,6 +1225,15 @@ ip_vs_del_dest(struct ip_vs_service *svc, struct ip_vs_dest_user_kern *udest)
 	 */
 	__ip_vs_del_dest(svc->ipvs, dest, false);
 
+	/*	If expire_nodest_conn is enabled and protocol is UDP,
+	 *	attempt best effort flush of all connections with this
+	 *	destination.
+	 */
+	if (sysctl_expire_nodest_conn(svc->ipvs) &&
+	    dest->protocol == IPPROTO_UDP) {
+		ip_vs_conn_flush_dest(svc->ipvs, dest);
+	}
+
 	LeaveFunction(2);
 
 	return 0;
-- 
2.20.1
^ permalink raw reply related	[flat|nested] 31+ messages in thread
* Re: [PATCH] netfilter/ipvs: expire no destination UDP connections when expire_nodest_conn=1
  2020-05-15 18:07 ` Julian Anastasov
@ 2020-05-17 17:27   ` Andrew Kim
  2020-05-17 17:30     ` Andrew Kim
  0 siblings, 1 reply; 31+ messages in thread
From: Andrew Kim @ 2020-05-17 17:27 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Wensong Zhang, Simon Horman, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, David S. Miller, Jakub Kicinski, open list:IPVS,
	open list:IPVS, open list:NETFILTER, open list:NETFILTER,
	open list
Hi Julian,
Thanks for getting back to me, that makes sense.
Would you be opposed to trying to expire all UDP connections matching
a deleted destination only if expire_nodest_conn=1?
Even today with `expire_nodest_conn=1`, many packets could be dropped
if there are many requests from a single client
trying to reuse client ports matching a deleted destination. Setting
`expire_nodest_conn=1` and reducing the UDP timeout
helps but deleting all connections when the destination is deleted
seems more efficient.
Looking forward to hearing your thoughts,
Andrew Sy Kim
On Fri, May 15, 2020 at 2:07 PM Julian Anastasov <ja@ssi.bg> wrote:
>
>
>         Hello,
>
> On Thu, 14 May 2020, Andrew Sy Kim wrote:
>
> > When expire_nodest_conn=1 and an IPVS destination is deleted, IPVS
> > doesn't expire connections with the IP_VS_CONN_F_ONE_PACKET flag set (any
> > UDP connection). If there are many UDP packets to a virtual server from a
> > single client and a destination is deleted, many packets are silently
> > dropped whenever an existing connection entry with the same source port
> > exists. This patch ensures IPVS also expires UDP connections when a
> > packet matches an existing connection with no destinations.
> >
> > Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
> > ---
> >  net/netfilter/ipvs/ip_vs_core.c | 3 +--
> >  1 file changed, 1 insertion(+), 2 deletions(-)
> >
> > diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
> > index aa6a603a2425..f0535586fe75 100644
> > --- a/net/netfilter/ipvs/ip_vs_core.c
> > +++ b/net/netfilter/ipvs/ip_vs_core.c
> > @@ -2116,8 +2116,7 @@ ip_vs_in(struct netns_ipvs *ipvs, unsigned int hooknum, struct sk_buff *skb, int
> >               else
> >                       ip_vs_conn_put(cp);
>
>         Above ip_vs_conn_put() should free the ONE_PACKET
> connections because:
>
> - such connections never start timer, they are designed
> to exist just to schedule the packet, then they are released.
> - noone takes extra references
>
>         So, ip_vs_conn_put() simply calls ip_vs_conn_expire()
> where connections should be released immediately. As result,
> we can not access cp after this point here. That is why we work
> just with 'flags' below...
>
>         Note that not every UDP connection has ONE_PACKET
> flag, it is present if you configure it for the service.
> Do you have -o/--ops flag? If not, the UDP connection
> should expire before the next jiffie. This is the theory,
> in practice, you may observe some problem...
>
> > -             if (sysctl_expire_nodest_conn(ipvs) &&
> > -                 !(flags & IP_VS_CONN_F_ONE_PACKET)) {
> > +             if (sysctl_expire_nodest_conn(ipvs)) {
> >                       /* try to expire the connection immediately */
> >                       ip_vs_conn_expire_now(cp);
> >               }
>
>         You can also look at the discussion which resulted in
> the last patch for this place:
>
> http://archive.linuxvirtualserver.org/html/lvs-devel/2018-07/msg00014.html
>
> Regards
>
> --
> Julian Anastasov <ja@ssi.bg>
^ permalink raw reply	[flat|nested] 31+ messages in thread
* Re: [PATCH] netfilter/ipvs: expire no destination UDP connections when expire_nodest_conn=1
  2020-05-17 17:27   ` Andrew Kim
@ 2020-05-17 17:30     ` Andrew Kim
  0 siblings, 0 replies; 31+ messages in thread
From: Andrew Kim @ 2020-05-17 17:30 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Wensong Zhang, Simon Horman, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, David S. Miller, Jakub Kicinski, open list:IPVS,
	open list:IPVS, open list:NETFILTER, open list:NETFILTER,
	open list
I sent a new patch diff based on my ask above. Please take a look :)
Thanks,
Andrew Sy Kim
On Sun, May 17, 2020 at 1:27 PM Andrew Kim <kim.andrewsy@gmail.com> wrote:
>
> Hi Julian,
>
> Thanks for getting back to me, that makes sense.
>
> Would you be opposed to trying to expire all UDP connections matching
> a deleted destination only if expire_nodest_conn=1?
> Even today with `expire_nodest_conn=1`, many packets could be dropped
> if there are many requests from a single client
> trying to reuse client ports matching a deleted destination. Setting
> `expire_nodest_conn=1` and reducing the UDP timeout
> helps but deleting all connections when the destination is deleted
> seems more efficient.
>
> Looking forward to hearing your thoughts,
>
> Andrew Sy Kim
>
>
> On Fri, May 15, 2020 at 2:07 PM Julian Anastasov <ja@ssi.bg> wrote:
> >
> >
> >         Hello,
> >
> > On Thu, 14 May 2020, Andrew Sy Kim wrote:
> >
> > > When expire_nodest_conn=1 and an IPVS destination is deleted, IPVS
> > > doesn't expire connections with the IP_VS_CONN_F_ONE_PACKET flag set (any
> > > UDP connection). If there are many UDP packets to a virtual server from a
> > > single client and a destination is deleted, many packets are silently
> > > dropped whenever an existing connection entry with the same source port
> > > exists. This patch ensures IPVS also expires UDP connections when a
> > > packet matches an existing connection with no destinations.
> > >
> > > Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
> > > ---
> > >  net/netfilter/ipvs/ip_vs_core.c | 3 +--
> > >  1 file changed, 1 insertion(+), 2 deletions(-)
> > >
> > > diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
> > > index aa6a603a2425..f0535586fe75 100644
> > > --- a/net/netfilter/ipvs/ip_vs_core.c
> > > +++ b/net/netfilter/ipvs/ip_vs_core.c
> > > @@ -2116,8 +2116,7 @@ ip_vs_in(struct netns_ipvs *ipvs, unsigned int hooknum, struct sk_buff *skb, int
> > >               else
> > >                       ip_vs_conn_put(cp);
> >
> >         Above ip_vs_conn_put() should free the ONE_PACKET
> > connections because:
> >
> > - such connections never start timer, they are designed
> > to exist just to schedule the packet, then they are released.
> > - noone takes extra references
> >
> >         So, ip_vs_conn_put() simply calls ip_vs_conn_expire()
> > where connections should be released immediately. As result,
> > we can not access cp after this point here. That is why we work
> > just with 'flags' below...
> >
> >         Note that not every UDP connection has ONE_PACKET
> > flag, it is present if you configure it for the service.
> > Do you have -o/--ops flag? If not, the UDP connection
> > should expire before the next jiffie. This is the theory,
> > in practice, you may observe some problem...
> >
> > > -             if (sysctl_expire_nodest_conn(ipvs) &&
> > > -                 !(flags & IP_VS_CONN_F_ONE_PACKET)) {
> > > +             if (sysctl_expire_nodest_conn(ipvs)) {
> > >                       /* try to expire the connection immediately */
> > >                       ip_vs_conn_expire_now(cp);
> > >               }
> >
> >         You can also look at the discussion which resulted in
> > the last patch for this place:
> >
> > http://archive.linuxvirtualserver.org/html/lvs-devel/2018-07/msg00014.html
> >
> > Regards
> >
> > --
> > Julian Anastasov <ja@ssi.bg>
^ permalink raw reply	[flat|nested] 31+ messages in thread
* Re: [PATCH] netfilter/ipvs: immediately expire UDP connections matching unavailable destination if expire_nodest_conn=1
  2020-05-17 17:16 ` [PATCH] netfilter/ipvs: immediately expire UDP connections matching unavailable destination if expire_nodest_conn=1 Andrew Sy Kim
@ 2020-05-18 19:10   ` Julian Anastasov
  2020-05-18 19:54     ` Andrew Kim
  0 siblings, 1 reply; 31+ messages in thread
From: Julian Anastasov @ 2020-05-18 19:10 UTC (permalink / raw)
  To: Andrew Sy Kim
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI,
	Wensong Zhang, Simon Horman, Jakub Kicinski, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, netdev, lvs-devel,
	linux-kernel, netfilter-devel, coreteam
	Hello,
On Sun, 17 May 2020, Andrew Sy Kim wrote:
> If expire_nodest_conn=1 and a UDP destination is deleted, IPVS should
> also expire all matching connections immiediately instead of waiting for
> the next matching packet. This is particulary useful when there are a
> lot of packets coming from a few number of clients. Those clients are
> likely to match against existing entries if a source port in the
> connection hash is reused. When the number of entries in the connection
> tracker is large, we can significantly reduce the number of dropped
> packets by expiring all connections upon deletion.
> 
> Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
> ---
>  include/net/ip_vs.h             |  7 ++++++
>  net/netfilter/ipvs/ip_vs_conn.c | 38 +++++++++++++++++++++++++++++++++
>  net/netfilter/ipvs/ip_vs_core.c |  5 -----
>  net/netfilter/ipvs/ip_vs_ctl.c  |  9 ++++++++
>  4 files changed, 54 insertions(+), 5 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
> index 02f2f636798d..c69dfbbc3416 100644
> --- a/net/netfilter/ipvs/ip_vs_conn.c
> +++ b/net/netfilter/ipvs/ip_vs_conn.c
> @@ -1366,6 +1366,44 @@ static void ip_vs_conn_flush(struct netns_ipvs *ipvs)
>  		goto flush_again;
>  	}
>  }
> +
> +/*	Flush all the connection entries in the ip_vs_conn_tab with a
> + *	matching destination.
> + */
> +void ip_vs_conn_flush_dest(struct netns_ipvs *ipvs, struct ip_vs_dest *dest)
> +{
> +	int idx;
> +	struct ip_vs_conn *cp, *cp_c;
> +
> +	rcu_read_lock();
> +	for (idx = 0; idx < ip_vs_conn_tab_size; idx++) {
> +		hlist_for_each_entry_rcu(cp, &ip_vs_conn_tab[idx], c_list) {
> +			if (cp->ipvs != ipvs)
> +				continue;
> +
> +			if (cp->dest != dest)
> +				continue;
> +
> +			/* As timers are expired in LIFO order, restart
> +			 * the timer of controlling connection first, so
> +			 * that it is expired after us.
> +			 */
> +			cp_c = cp->control;
> +			/* cp->control is valid only with reference to cp */
> +			if (cp_c && __ip_vs_conn_get(cp)) {
> +				IP_VS_DBG(4, "del controlling connection\n");
> +				ip_vs_conn_expire_now(cp_c);
> +				__ip_vs_conn_put(cp);
> +			}
> +			IP_VS_DBG(4, "del connection\n");
> +			ip_vs_conn_expire_now(cp);
> +		}
> +		cond_resched_rcu();
	Such kind of loop is correct if done in another context:
1. kthread
or
2. delayed work: mod_delayed_work(system_long_wq, ...)
	Otherwise cond_resched_rcu() can schedule() while holding
__ip_vs_mutex. Also, it will add long delay if many dests are
removed.
	If such loop analyzes instead all cp->dest for 
IP_VS_DEST_F_AVAILABLE, it should be done after calling
__ip_vs_conn_get().
>  static int sysctl_snat_reroute(struct netns_ipvs *ipvs) { return 0; }
> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> index 8d14a1acbc37..f87c03622874 100644
> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> @@ -1225,6 +1225,15 @@ ip_vs_del_dest(struct ip_vs_service *svc, struct ip_vs_dest_user_kern *udest)
>  	 */
>  	__ip_vs_del_dest(svc->ipvs, dest, false);
>  
> +	/*	If expire_nodest_conn is enabled and protocol is UDP,
> +	 *	attempt best effort flush of all connections with this
> +	 *	destination.
> +	 */
> +	if (sysctl_expire_nodest_conn(svc->ipvs) &&
> +	    dest->protocol == IPPROTO_UDP) {
> +		ip_vs_conn_flush_dest(svc->ipvs, dest);
	Above work should be scheduled from __ip_vs_del_dest().
Check for UDP is not needed, sysctl_expire_nodest_conn() is for
all protocols.
	If the flushing is complex to implement, we can still allow
rescheduling for unavailable dests:
- first we should move this block above the ip_vs_try_to_schedule() 
block because:
	1. the scheduling does not return unavailabel dests, even
	for persistence, so no need to check new connections for
	the flag
	2. it will allow to create new connection if dest for
	existing connection is unavailable
	if (cp && cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
		/* the destination server is not available */
		if (sysctl_expire_nodest_conn(ipvs)) {
			bool uses_ct = ip_vs_conn_uses_conntrack(cp, skb);
			ip_vs_conn_expire_now(cp);
			__ip_vs_conn_put(cp);
			if (uses_ct)
				return NF_DROP;
			cp = NULL;
		} else {
			__ip_vs_conn_put(cp);
			return NF_DROP;
		}
	}
	if (unlikely(!cp)) {
		int v;
		if (!ip_vs_try_to_schedule(ipvs, af, skb, pd, &v, &cp, &iph))
			return v;
	}
	Before now, we always waited one jiffie connection to expire,
now one packet will:
- schedule expiration for existing connection with unavailable dest,
as before
- create new connection to available destination that will be found
first in lists. But it can work only when sysctl var "conntrack" is 0,
we do not want to create two netfilter conntracks to different
real servers.
	Note that we intentionally removed the timer_pending() check
because we can not see existing ONE_PACKET connections in table.
Regards
--
Julian Anastasov <ja@ssi.bg>
^ permalink raw reply	[flat|nested] 31+ messages in thread
* Re: [PATCH] netfilter/ipvs: immediately expire UDP connections matching unavailable destination if expire_nodest_conn=1
  2020-05-18 19:10   ` Julian Anastasov
@ 2020-05-18 19:54     ` Andrew Kim
  2020-05-19 11:46       ` Marco Angaroni
  0 siblings, 1 reply; 31+ messages in thread
From: Andrew Kim @ 2020-05-18 19:54 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI,
	Wensong Zhang, Simon Horman, Jakub Kicinski, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, open list:IPVS,
	open list:IPVS, open list, open list:NETFILTER,
	open list:NETFILTER
Hi Julian,
Thank you for getting back to me. I will update the patch based on
your feedback shortly.
Regards,
Andrew
On Mon, May 18, 2020 at 3:10 PM Julian Anastasov <ja@ssi.bg> wrote:
>
>
>         Hello,
>
> On Sun, 17 May 2020, Andrew Sy Kim wrote:
>
> > If expire_nodest_conn=1 and a UDP destination is deleted, IPVS should
> > also expire all matching connections immiediately instead of waiting for
> > the next matching packet. This is particulary useful when there are a
> > lot of packets coming from a few number of clients. Those clients are
> > likely to match against existing entries if a source port in the
> > connection hash is reused. When the number of entries in the connection
> > tracker is large, we can significantly reduce the number of dropped
> > packets by expiring all connections upon deletion.
> >
> > Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
> > ---
> >  include/net/ip_vs.h             |  7 ++++++
> >  net/netfilter/ipvs/ip_vs_conn.c | 38 +++++++++++++++++++++++++++++++++
> >  net/netfilter/ipvs/ip_vs_core.c |  5 -----
> >  net/netfilter/ipvs/ip_vs_ctl.c  |  9 ++++++++
> >  4 files changed, 54 insertions(+), 5 deletions(-)
> >
>
> > diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
> > index 02f2f636798d..c69dfbbc3416 100644
> > --- a/net/netfilter/ipvs/ip_vs_conn.c
> > +++ b/net/netfilter/ipvs/ip_vs_conn.c
> > @@ -1366,6 +1366,44 @@ static void ip_vs_conn_flush(struct netns_ipvs *ipvs)
> >               goto flush_again;
> >       }
> >  }
> > +
> > +/*   Flush all the connection entries in the ip_vs_conn_tab with a
> > + *   matching destination.
> > + */
> > +void ip_vs_conn_flush_dest(struct netns_ipvs *ipvs, struct ip_vs_dest *dest)
> > +{
> > +     int idx;
> > +     struct ip_vs_conn *cp, *cp_c;
> > +
> > +     rcu_read_lock();
> > +     for (idx = 0; idx < ip_vs_conn_tab_size; idx++) {
> > +             hlist_for_each_entry_rcu(cp, &ip_vs_conn_tab[idx], c_list) {
> > +                     if (cp->ipvs != ipvs)
> > +                             continue;
> > +
> > +                     if (cp->dest != dest)
> > +                             continue;
> > +
> > +                     /* As timers are expired in LIFO order, restart
> > +                      * the timer of controlling connection first, so
> > +                      * that it is expired after us.
> > +                      */
> > +                     cp_c = cp->control;
> > +                     /* cp->control is valid only with reference to cp */
> > +                     if (cp_c && __ip_vs_conn_get(cp)) {
> > +                             IP_VS_DBG(4, "del controlling connection\n");
> > +                             ip_vs_conn_expire_now(cp_c);
> > +                             __ip_vs_conn_put(cp);
> > +                     }
> > +                     IP_VS_DBG(4, "del connection\n");
> > +                     ip_vs_conn_expire_now(cp);
> > +             }
> > +             cond_resched_rcu();
>
>         Such kind of loop is correct if done in another context:
>
> 1. kthread
> or
> 2. delayed work: mod_delayed_work(system_long_wq, ...)
>
>         Otherwise cond_resched_rcu() can schedule() while holding
> __ip_vs_mutex. Also, it will add long delay if many dests are
> removed.
>
>         If such loop analyzes instead all cp->dest for
> IP_VS_DEST_F_AVAILABLE, it should be done after calling
> __ip_vs_conn_get().
>
> >  static int sysctl_snat_reroute(struct netns_ipvs *ipvs) { return 0; }
> > diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> > index 8d14a1acbc37..f87c03622874 100644
> > --- a/net/netfilter/ipvs/ip_vs_ctl.c
> > +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> > @@ -1225,6 +1225,15 @@ ip_vs_del_dest(struct ip_vs_service *svc, struct ip_vs_dest_user_kern *udest)
> >        */
> >       __ip_vs_del_dest(svc->ipvs, dest, false);
> >
> > +     /*      If expire_nodest_conn is enabled and protocol is UDP,
> > +      *      attempt best effort flush of all connections with this
> > +      *      destination.
> > +      */
> > +     if (sysctl_expire_nodest_conn(svc->ipvs) &&
> > +         dest->protocol == IPPROTO_UDP) {
> > +             ip_vs_conn_flush_dest(svc->ipvs, dest);
>
>         Above work should be scheduled from __ip_vs_del_dest().
> Check for UDP is not needed, sysctl_expire_nodest_conn() is for
> all protocols.
>
>         If the flushing is complex to implement, we can still allow
> rescheduling for unavailable dests:
>
> - first we should move this block above the ip_vs_try_to_schedule()
> block because:
>
>         1. the scheduling does not return unavailabel dests, even
>         for persistence, so no need to check new connections for
>         the flag
>
>         2. it will allow to create new connection if dest for
>         existing connection is unavailable
>
>         if (cp && cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
>                 /* the destination server is not available */
>
>                 if (sysctl_expire_nodest_conn(ipvs)) {
>                         bool uses_ct = ip_vs_conn_uses_conntrack(cp, skb);
>
>                         ip_vs_conn_expire_now(cp);
>                         __ip_vs_conn_put(cp);
>                         if (uses_ct)
>                                 return NF_DROP;
>                         cp = NULL;
>                 } else {
>                         __ip_vs_conn_put(cp);
>                         return NF_DROP;
>                 }
>         }
>
>         if (unlikely(!cp)) {
>                 int v;
>
>                 if (!ip_vs_try_to_schedule(ipvs, af, skb, pd, &v, &cp, &iph))
>                         return v;
>         }
>
>         Before now, we always waited one jiffie connection to expire,
> now one packet will:
>
> - schedule expiration for existing connection with unavailable dest,
> as before
>
> - create new connection to available destination that will be found
> first in lists. But it can work only when sysctl var "conntrack" is 0,
> we do not want to create two netfilter conntracks to different
> real servers.
>
>         Note that we intentionally removed the timer_pending() check
> because we can not see existing ONE_PACKET connections in table.
>
> Regards
>
> --
> Julian Anastasov <ja@ssi.bg>
^ permalink raw reply	[flat|nested] 31+ messages in thread
* Re: [PATCH] netfilter/ipvs: immediately expire UDP connections matching unavailable destination if expire_nodest_conn=1
  2020-05-18 19:54     ` Andrew Kim
@ 2020-05-19 11:46       ` Marco Angaroni
  2020-05-19 14:18         ` Andrew Kim
  2020-05-19 19:46         ` Julian Anastasov
  0 siblings, 2 replies; 31+ messages in thread
From: Marco Angaroni @ 2020-05-19 11:46 UTC (permalink / raw)
  To: Andrew Kim
  Cc: Julian Anastasov, David S. Miller, Alexey Kuznetsov,
	Hideaki YOSHIFUJI, Wensong Zhang, Simon Horman, Jakub Kicinski,
	Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	open list:IPVS, open list:IPVS, open list, open list:NETFILTER,
	open list:NETFILTER
Hi Andrew, Julian,
could you please confirm if/how this patch is changing any of the
following behaviours, which I’m listing below as per my understanding
?
When expire_nodest is set and real-server is unavailable, at the
moment the following happens to a packet going through IPVS:
a) TCP (or other connection-oriented protocols):
   the packet is silently dropped, then the following retransmission
causes the generation of a RST from the load-balancer to the client,
which will then re-open a new TCP connection
b) UDP:
   the packet is silently dropped, then the following retransmission
is rescheduled to a new real-server
c) UDP in OPS mode:
   the packet is rescheduled to a new real-server, as no previous
connection exists in IPVS connection table, and a new OPS connection
is created (but it lasts only the time to transmit the packet)
d) UDP in OPS mode + persistent-template:
   the packet is rescheduled to a new real-server, as previous
template-connection is invalidated, a new template-connection is
created, and a new OPS connection is created (but it lasts only the
time to transmit the packet)
It seems to me that you are trying to optimize case a) and b),
avoiding the first step where the packet is silently dropped and
consequently avoiding the retransmission.
And contextually expire also all the other connections pointing to the
unavailable real-sever.
However I'm confused about the references to OPS mode.
And why you need to expire all the connections at once: if you expire
on a per connection basis, the client experiences the same behaviour
(no more re-transmissions), but you avoid the complexities of a new
thread.
Maybe also the documentation of expire_nodest_conn sysctl should be updated.
When it's stated:
        If this feature is enabled, the load balancer will expire the
        connection immediately when a packet arrives and its
        destination server is not available, then the client program
        will be notified that the connection is closed
I think it should be at least "and the client program" instead of
"then the client program".
Or a more detailed explanation.
Thanks
Marco Angaroni
Il giorno lun 18 mag 2020 alle ore 22:06 Andrew Kim
<kim.andrewsy@gmail.com> ha scritto:
>
> Hi Julian,
>
> Thank you for getting back to me. I will update the patch based on
> your feedback shortly.
>
> Regards,
>
> Andrew
>
> On Mon, May 18, 2020 at 3:10 PM Julian Anastasov <ja@ssi.bg> wrote:
> >
> >
> >         Hello,
> >
> > On Sun, 17 May 2020, Andrew Sy Kim wrote:
> >
> > > If expire_nodest_conn=1 and a UDP destination is deleted, IPVS should
> > > also expire all matching connections immiediately instead of waiting for
> > > the next matching packet. This is particulary useful when there are a
> > > lot of packets coming from a few number of clients. Those clients are
> > > likely to match against existing entries if a source port in the
> > > connection hash is reused. When the number of entries in the connection
> > > tracker is large, we can significantly reduce the number of dropped
> > > packets by expiring all connections upon deletion.
> > >
> > > Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
> > > ---
> > >  include/net/ip_vs.h             |  7 ++++++
> > >  net/netfilter/ipvs/ip_vs_conn.c | 38 +++++++++++++++++++++++++++++++++
> > >  net/netfilter/ipvs/ip_vs_core.c |  5 -----
> > >  net/netfilter/ipvs/ip_vs_ctl.c  |  9 ++++++++
> > >  4 files changed, 54 insertions(+), 5 deletions(-)
> > >
> >
> > > diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
> > > index 02f2f636798d..c69dfbbc3416 100644
> > > --- a/net/netfilter/ipvs/ip_vs_conn.c
> > > +++ b/net/netfilter/ipvs/ip_vs_conn.c
> > > @@ -1366,6 +1366,44 @@ static void ip_vs_conn_flush(struct netns_ipvs *ipvs)
> > >               goto flush_again;
> > >       }
> > >  }
> > > +
> > > +/*   Flush all the connection entries in the ip_vs_conn_tab with a
> > > + *   matching destination.
> > > + */
> > > +void ip_vs_conn_flush_dest(struct netns_ipvs *ipvs, struct ip_vs_dest *dest)
> > > +{
> > > +     int idx;
> > > +     struct ip_vs_conn *cp, *cp_c;
> > > +
> > > +     rcu_read_lock();
> > > +     for (idx = 0; idx < ip_vs_conn_tab_size; idx++) {
> > > +             hlist_for_each_entry_rcu(cp, &ip_vs_conn_tab[idx], c_list) {
> > > +                     if (cp->ipvs != ipvs)
> > > +                             continue;
> > > +
> > > +                     if (cp->dest != dest)
> > > +                             continue;
> > > +
> > > +                     /* As timers are expired in LIFO order, restart
> > > +                      * the timer of controlling connection first, so
> > > +                      * that it is expired after us.
> > > +                      */
> > > +                     cp_c = cp->control;
> > > +                     /* cp->control is valid only with reference to cp */
> > > +                     if (cp_c && __ip_vs_conn_get(cp)) {
> > > +                             IP_VS_DBG(4, "del controlling connection\n");
> > > +                             ip_vs_conn_expire_now(cp_c);
> > > +                             __ip_vs_conn_put(cp);
> > > +                     }
> > > +                     IP_VS_DBG(4, "del connection\n");
> > > +                     ip_vs_conn_expire_now(cp);
> > > +             }
> > > +             cond_resched_rcu();
> >
> >         Such kind of loop is correct if done in another context:
> >
> > 1. kthread
> > or
> > 2. delayed work: mod_delayed_work(system_long_wq, ...)
> >
> >         Otherwise cond_resched_rcu() can schedule() while holding
> > __ip_vs_mutex. Also, it will add long delay if many dests are
> > removed.
> >
> >         If such loop analyzes instead all cp->dest for
> > IP_VS_DEST_F_AVAILABLE, it should be done after calling
> > __ip_vs_conn_get().
> >
> > >  static int sysctl_snat_reroute(struct netns_ipvs *ipvs) { return 0; }
> > > diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> > > index 8d14a1acbc37..f87c03622874 100644
> > > --- a/net/netfilter/ipvs/ip_vs_ctl.c
> > > +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> > > @@ -1225,6 +1225,15 @@ ip_vs_del_dest(struct ip_vs_service *svc, struct ip_vs_dest_user_kern *udest)
> > >        */
> > >       __ip_vs_del_dest(svc->ipvs, dest, false);
> > >
> > > +     /*      If expire_nodest_conn is enabled and protocol is UDP,
> > > +      *      attempt best effort flush of all connections with this
> > > +      *      destination.
> > > +      */
> > > +     if (sysctl_expire_nodest_conn(svc->ipvs) &&
> > > +         dest->protocol == IPPROTO_UDP) {
> > > +             ip_vs_conn_flush_dest(svc->ipvs, dest);
> >
> >         Above work should be scheduled from __ip_vs_del_dest().
> > Check for UDP is not needed, sysctl_expire_nodest_conn() is for
> > all protocols.
> >
> >         If the flushing is complex to implement, we can still allow
> > rescheduling for unavailable dests:
> >
> > - first we should move this block above the ip_vs_try_to_schedule()
> > block because:
> >
> >         1. the scheduling does not return unavailabel dests, even
> >         for persistence, so no need to check new connections for
> >         the flag
> >
> >         2. it will allow to create new connection if dest for
> >         existing connection is unavailable
> >
> >         if (cp && cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
> >                 /* the destination server is not available */
> >
> >                 if (sysctl_expire_nodest_conn(ipvs)) {
> >                         bool uses_ct = ip_vs_conn_uses_conntrack(cp, skb);
> >
> >                         ip_vs_conn_expire_now(cp);
> >                         __ip_vs_conn_put(cp);
> >                         if (uses_ct)
> >                                 return NF_DROP;
> >                         cp = NULL;
> >                 } else {
> >                         __ip_vs_conn_put(cp);
> >                         return NF_DROP;
> >                 }
> >         }
> >
> >         if (unlikely(!cp)) {
> >                 int v;
> >
> >                 if (!ip_vs_try_to_schedule(ipvs, af, skb, pd, &v, &cp, &iph))
> >                         return v;
> >         }
> >
> >         Before now, we always waited one jiffie connection to expire,
> > now one packet will:
> >
> > - schedule expiration for existing connection with unavailable dest,
> > as before
> >
> > - create new connection to available destination that will be found
> > first in lists. But it can work only when sysctl var "conntrack" is 0,
> > we do not want to create two netfilter conntracks to different
> > real servers.
> >
> >         Note that we intentionally removed the timer_pending() check
> > because we can not see existing ONE_PACKET connections in table.
> >
> > Regards
> >
> > --
> > Julian Anastasov <ja@ssi.bg>
^ permalink raw reply	[flat|nested] 31+ messages in thread
* Re: [PATCH] netfilter/ipvs: immediately expire UDP connections matching unavailable destination if expire_nodest_conn=1
  2020-05-19 11:46       ` Marco Angaroni
@ 2020-05-19 14:18         ` Andrew Kim
  2020-05-19 19:46         ` Julian Anastasov
  1 sibling, 0 replies; 31+ messages in thread
From: Andrew Kim @ 2020-05-19 14:18 UTC (permalink / raw)
  To: Marco Angaroni
  Cc: Julian Anastasov, David S. Miller, Alexey Kuznetsov,
	Hideaki YOSHIFUJI, Wensong Zhang, Simon Horman, Jakub Kicinski,
	Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	open list:IPVS, open list:IPVS, open list, open list:NETFILTER,
	open list:NETFILTER
Hi Marco,
> could you please confirm if/how this patch is changing any of the?
> following behaviours, which I’m listing below as per my understanding?
This patch would optimize b) where the chances of packets being silently
dropped is lower.
> However I'm confused about the references to OPS mode.
> And why you need to expire all the connections at once: if you expire
> on a per connection basis, the client experiences the same behaviour
> (no more re-transmissions), but you avoid the complexities of a new
> thread.
I agree for TCP the client would experience the same behavior. This is mostly
problematic for UDP when there are many entries in the connection hash
from the same client to 1 destination. If we only expire connections
on receiving
packets, there can be many dropped packets before all entries are expired (or
we reached the UDP timeout). Expiring all UDP packets immediately would
reduce the number of packets dropped significantly.
Thanks,
Andrew Sy Kim
On Tue, May 19, 2020 at 7:46 AM Marco Angaroni <marcoangaroni@gmail.com> wrote:
>
> Hi Andrew, Julian,
>
> could you please confirm if/how this patch is changing any of the
> following behaviours, which I’m listing below as per my understanding
> ?
>
> When expire_nodest is set and real-server is unavailable, at the
> moment the following happens to a packet going through IPVS:
>
> a) TCP (or other connection-oriented protocols):
>    the packet is silently dropped, then the following retransmission
> causes the generation of a RST from the load-balancer to the client,
> which will then re-open a new TCP connection
> b) UDP:
>    the packet is silently dropped, then the following retransmission
> is rescheduled to a new real-server
> c) UDP in OPS mode:
>    the packet is rescheduled to a new real-server, as no previous
> connection exists in IPVS connection table, and a new OPS connection
> is created (but it lasts only the time to transmit the packet)
> d) UDP in OPS mode + persistent-template:
>    the packet is rescheduled to a new real-server, as previous
> template-connection is invalidated, a new template-connection is
> created, and a new OPS connection is created (but it lasts only the
> time to transmit the packet)
>
> It seems to me that you are trying to optimize case a) and b),
> avoiding the first step where the packet is silently dropped and
> consequently avoiding the retransmission.
> And contextually expire also all the other connections pointing to the
> unavailable real-sever.
>
> However I'm confused about the references to OPS mode.
> And why you need to expire all the connections at once: if you expire
> on a per connection basis, the client experiences the same behaviour
> (no more re-transmissions), but you avoid the complexities of a new
> thread.
>
> Maybe also the documentation of expire_nodest_conn sysctl should be updated.
> When it's stated:
>
>         If this feature is enabled, the load balancer will expire the
>         connection immediately when a packet arrives and its
>         destination server is not available, then the client program
>         will be notified that the connection is closed
>
> I think it should be at least "and the client program" instead of
> "then the client program".
> Or a more detailed explanation.
>
> Thanks
> Marco Angaroni
>
>
> Il giorno lun 18 mag 2020 alle ore 22:06 Andrew Kim
> <kim.andrewsy@gmail.com> ha scritto:
> >
> > Hi Julian,
> >
> > Thank you for getting back to me. I will update the patch based on
> > your feedback shortly.
> >
> > Regards,
> >
> > Andrew
> >
> > On Mon, May 18, 2020 at 3:10 PM Julian Anastasov <ja@ssi.bg> wrote:
> > >
> > >
> > >         Hello,
> > >
> > > On Sun, 17 May 2020, Andrew Sy Kim wrote:
> > >
> > > > If expire_nodest_conn=1 and a UDP destination is deleted, IPVS should
> > > > also expire all matching connections immiediately instead of waiting for
> > > > the next matching packet. This is particulary useful when there are a
> > > > lot of packets coming from a few number of clients. Those clients are
> > > > likely to match against existing entries if a source port in the
> > > > connection hash is reused. When the number of entries in the connection
> > > > tracker is large, we can significantly reduce the number of dropped
> > > > packets by expiring all connections upon deletion.
> > > >
> > > > Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
> > > > ---
> > > >  include/net/ip_vs.h             |  7 ++++++
> > > >  net/netfilter/ipvs/ip_vs_conn.c | 38 +++++++++++++++++++++++++++++++++
> > > >  net/netfilter/ipvs/ip_vs_core.c |  5 -----
> > > >  net/netfilter/ipvs/ip_vs_ctl.c  |  9 ++++++++
> > > >  4 files changed, 54 insertions(+), 5 deletions(-)
> > > >
> > >
> > > > diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
> > > > index 02f2f636798d..c69dfbbc3416 100644
> > > > --- a/net/netfilter/ipvs/ip_vs_conn.c
> > > > +++ b/net/netfilter/ipvs/ip_vs_conn.c
> > > > @@ -1366,6 +1366,44 @@ static void ip_vs_conn_flush(struct netns_ipvs *ipvs)
> > > >               goto flush_again;
> > > >       }
> > > >  }
> > > > +
> > > > +/*   Flush all the connection entries in the ip_vs_conn_tab with a
> > > > + *   matching destination.
> > > > + */
> > > > +void ip_vs_conn_flush_dest(struct netns_ipvs *ipvs, struct ip_vs_dest *dest)
> > > > +{
> > > > +     int idx;
> > > > +     struct ip_vs_conn *cp, *cp_c;
> > > > +
> > > > +     rcu_read_lock();
> > > > +     for (idx = 0; idx < ip_vs_conn_tab_size; idx++) {
> > > > +             hlist_for_each_entry_rcu(cp, &ip_vs_conn_tab[idx], c_list) {
> > > > +                     if (cp->ipvs != ipvs)
> > > > +                             continue;
> > > > +
> > > > +                     if (cp->dest != dest)
> > > > +                             continue;
> > > > +
> > > > +                     /* As timers are expired in LIFO order, restart
> > > > +                      * the timer of controlling connection first, so
> > > > +                      * that it is expired after us.
> > > > +                      */
> > > > +                     cp_c = cp->control;
> > > > +                     /* cp->control is valid only with reference to cp */
> > > > +                     if (cp_c && __ip_vs_conn_get(cp)) {
> > > > +                             IP_VS_DBG(4, "del controlling connection\n");
> > > > +                             ip_vs_conn_expire_now(cp_c);
> > > > +                             __ip_vs_conn_put(cp);
> > > > +                     }
> > > > +                     IP_VS_DBG(4, "del connection\n");
> > > > +                     ip_vs_conn_expire_now(cp);
> > > > +             }
> > > > +             cond_resched_rcu();
> > >
> > >         Such kind of loop is correct if done in another context:
> > >
> > > 1. kthread
> > > or
> > > 2. delayed work: mod_delayed_work(system_long_wq, ...)
> > >
> > >         Otherwise cond_resched_rcu() can schedule() while holding
> > > __ip_vs_mutex. Also, it will add long delay if many dests are
> > > removed.
> > >
> > >         If such loop analyzes instead all cp->dest for
> > > IP_VS_DEST_F_AVAILABLE, it should be done after calling
> > > __ip_vs_conn_get().
> > >
> > > >  static int sysctl_snat_reroute(struct netns_ipvs *ipvs) { return 0; }
> > > > diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> > > > index 8d14a1acbc37..f87c03622874 100644
> > > > --- a/net/netfilter/ipvs/ip_vs_ctl.c
> > > > +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> > > > @@ -1225,6 +1225,15 @@ ip_vs_del_dest(struct ip_vs_service *svc, struct ip_vs_dest_user_kern *udest)
> > > >        */
> > > >       __ip_vs_del_dest(svc->ipvs, dest, false);
> > > >
> > > > +     /*      If expire_nodest_conn is enabled and protocol is UDP,
> > > > +      *      attempt best effort flush of all connections with this
> > > > +      *      destination.
> > > > +      */
> > > > +     if (sysctl_expire_nodest_conn(svc->ipvs) &&
> > > > +         dest->protocol == IPPROTO_UDP) {
> > > > +             ip_vs_conn_flush_dest(svc->ipvs, dest);
> > >
> > >         Above work should be scheduled from __ip_vs_del_dest().
> > > Check for UDP is not needed, sysctl_expire_nodest_conn() is for
> > > all protocols.
> > >
> > >         If the flushing is complex to implement, we can still allow
> > > rescheduling for unavailable dests:
> > >
> > > - first we should move this block above the ip_vs_try_to_schedule()
> > > block because:
> > >
> > >         1. the scheduling does not return unavailabel dests, even
> > >         for persistence, so no need to check new connections for
> > >         the flag
> > >
> > >         2. it will allow to create new connection if dest for
> > >         existing connection is unavailable
> > >
> > >         if (cp && cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
> > >                 /* the destination server is not available */
> > >
> > >                 if (sysctl_expire_nodest_conn(ipvs)) {
> > >                         bool uses_ct = ip_vs_conn_uses_conntrack(cp, skb);
> > >
> > >                         ip_vs_conn_expire_now(cp);
> > >                         __ip_vs_conn_put(cp);
> > >                         if (uses_ct)
> > >                                 return NF_DROP;
> > >                         cp = NULL;
> > >                 } else {
> > >                         __ip_vs_conn_put(cp);
> > >                         return NF_DROP;
> > >                 }
> > >         }
> > >
> > >         if (unlikely(!cp)) {
> > >                 int v;
> > >
> > >                 if (!ip_vs_try_to_schedule(ipvs, af, skb, pd, &v, &cp, &iph))
> > >                         return v;
> > >         }
> > >
> > >         Before now, we always waited one jiffie connection to expire,
> > > now one packet will:
> > >
> > > - schedule expiration for existing connection with unavailable dest,
> > > as before
> > >
> > > - create new connection to available destination that will be found
> > > first in lists. But it can work only when sysctl var "conntrack" is 0,
> > > we do not want to create two netfilter conntracks to different
> > > real servers.
> > >
> > >         Note that we intentionally removed the timer_pending() check
> > > because we can not see existing ONE_PACKET connections in table.
> > >
> > > Regards
> > >
> > > --
> > > Julian Anastasov <ja@ssi.bg>
^ permalink raw reply	[flat|nested] 31+ messages in thread
* Re: [PATCH] netfilter/ipvs: immediately expire UDP connections matching unavailable destination if expire_nodest_conn=1
  2020-05-19 11:46       ` Marco Angaroni
  2020-05-19 14:18         ` Andrew Kim
@ 2020-05-19 19:46         ` Julian Anastasov
  2020-05-24 21:31           ` [PATCH] netfilter/ipvs: immediately expire no destination connections in kthread " Andrew Sy Kim
  1 sibling, 1 reply; 31+ messages in thread
From: Julian Anastasov @ 2020-05-19 19:46 UTC (permalink / raw)
  To: Marco Angaroni
  Cc: Andrew Kim, David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI,
	Wensong Zhang, Simon Horman, Jakub Kicinski, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, open list:IPVS,
	open list:IPVS, open list, open list:NETFILTER,
	open list:NETFILTER
[-- Attachment #1: Type: text/plain, Size: 3860 bytes --]
	Hello,
On Tue, 19 May 2020, Marco Angaroni wrote:
> Hi Andrew, Julian,
> 
> could you please confirm if/how this patch is changing any of the
> following behaviours, which I’m listing below as per my understanding
> ?
> 
> When expire_nodest is set and real-server is unavailable, at the
> moment the following happens to a packet going through IPVS:
> 
> a) TCP (or other connection-oriented protocols):
>    the packet is silently dropped, then the following retransmission
> causes the generation of a RST from the load-balancer to the client,
> which will then re-open a new TCP connection
	Yes. It seems we can not create new connection in
all cases, we should also check with is_new_conn().
	What we have is that two cases are possible depending on 
conn_reuse_mode, the state of existing connection and whether
netfilter conntrack is used:
	1. setup expire for old conn, then drop packet
	2. setup expire for old conn, then create new
	conn to schedule the packet
	When expiration is set, the timer will fire in the
next jiffie to remove the connection from hash table. Until
removed, the connection still can cause drops. Sometimes
we can simply create new connection with the same tuple,
so it is possible both connections to coexist for one jiffie
but the old connection is not reached on lookup.
> b) UDP:
>    the packet is silently dropped, then the following retransmission
> is rescheduled to a new real-server
	Yes, we drop while old conn is not expired yet
> c) UDP in OPS mode:
>    the packet is rescheduled to a new real-server, as no previous
> connection exists in IPVS connection table, and a new OPS connection
> is created (but it lasts only the time to transmit the packet)
	Yes, OPS is not affected.
> d) UDP in OPS mode + persistent-template:
>    the packet is rescheduled to a new real-server, as previous
> template-connection is invalidated, a new template-connection is
> created, and a new OPS connection is created (but it lasts only the
> time to transmit the packet)
	Yes, the existing template is ignored when its server
is unavailable.
> It seems to me that you are trying to optimize case a) and b),
> avoiding the first step where the packet is silently dropped and
> consequently avoiding the retransmission.
> And contextually expire also all the other connections pointing to the
> unavailable real-sever.
	The change will allow immediate scheduling in a new
connection for any protocol when netfilter conntrack is not
used:
- TCP: avoids retransmission for SYN
- UDP: reduces drops from 1 jiffie to 0 (no drops)
	But this single jiffie compared to the delay between
real server failure and the removal from the IPVS table can be
negligible. Of course, if real server is removed while it is
working, with this change we should not see any UDP drops.
> However I'm confused about the references to OPS mode.
> And why you need to expire all the connections at once: if you expire
> on a per connection basis, the client experiences the same behaviour
> (no more re-transmissions), but you avoid the complexities of a new
> thread.
	Such flushing can help when conntrack is used in which
case the cost is a retransmission or downtime for one jiffie.
> Maybe also the documentation of expire_nodest_conn sysctl should be updated.
> When it's stated:
> 
>         If this feature is enabled, the load balancer will expire the
>         connection immediately when a packet arrives and its
>         destination server is not available, then the client program
>         will be notified that the connection is closed
> 
> I think it should be at least "and the client program" instead of
> "then the client program".
> Or a more detailed explanation.
	Yes, if the packet is SYN we can create new connection.
If it is ACK, the retransmission will get RST.
Regards
--
Julian Anastasov <ja@ssi.bg>
^ permalink raw reply	[flat|nested] 31+ messages in thread
* [PATCH] netfilter/ipvs: immediately expire no destination connections in kthread if expire_nodest_conn=1
  2020-05-19 19:46         ` Julian Anastasov
@ 2020-05-24 21:31           ` Andrew Sy Kim
  2020-05-26 21:24             ` Julian Anastasov
  0 siblings, 1 reply; 31+ messages in thread
From: Andrew Sy Kim @ 2020-05-24 21:31 UTC (permalink / raw)
  Cc: kim.andrewsy, Wensong Zhang, Simon Horman, Julian Anastasov,
	David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI,
	Jakub Kicinski, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, netdev, lvs-devel, linux-kernel,
	netfilter-devel, coreteam
If expire_nodest_conn=1 and a destination is deleted, IPVS should
also expire all matching connections immiediately instead of waiting for
the next matching packet. This is particulary useful when there are a
lot of packets coming from a few number of clients. Those clients are
likely to match against existing entries if a source port in the
connection hash is reused. When the number of entries in the connection
tracker is large, we can significantly reduce the number of dropped
packets by expiring all connections upon deletion in a kthread.
Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
---
 include/net/ip_vs.h             | 12 +++++++++
 net/netfilter/ipvs/ip_vs_conn.c | 48 +++++++++++++++++++++++++++++++++
 net/netfilter/ipvs/ip_vs_core.c | 45 +++++++++++++------------------
 net/netfilter/ipvs/ip_vs_ctl.c  | 16 +++++++++++
 4 files changed, 95 insertions(+), 26 deletions(-)
diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index 83be2d93b407..d1fbead6965d 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -1049,6 +1049,11 @@ static inline int sysctl_conn_reuse_mode(struct netns_ipvs *ipvs)
 	return ipvs->sysctl_conn_reuse_mode;
 }
 
+static inline int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
+{
+	return ipvs->sysctl_expire_nodest_conn;
+}
+
 static inline int sysctl_schedule_icmp(struct netns_ipvs *ipvs)
 {
 	return ipvs->sysctl_schedule_icmp;
@@ -1209,6 +1214,13 @@ struct ip_vs_conn * ip_vs_conn_out_get_proto(struct netns_ipvs *ipvs, int af,
 					     const struct sk_buff *skb,
 					     const struct ip_vs_iphdr *iph);
 
+
+int ip_vs_conn_flush_dest(void *data);
+struct ip_vs_conn_flush_dest_tinfo {
+	struct netns_ipvs *ipvs;
+	struct ip_vs_dest *dest;
+};
+
 /* Get reference to gain full access to conn.
  * By default, RCU read-side critical sections have access only to
  * conn fields and its PE data, see ip_vs_conn_rcu_free() for reference.
diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
index 02f2f636798d..111fa0e287a2 100644
--- a/net/netfilter/ipvs/ip_vs_conn.c
+++ b/net/netfilter/ipvs/ip_vs_conn.c
@@ -1366,6 +1366,54 @@ static void ip_vs_conn_flush(struct netns_ipvs *ipvs)
 		goto flush_again;
 	}
 }
+
+/*	Thread function to flush all the connection entries in the
+ *	ip_vs_conn_tab with a matching destination.
+ */
+int ip_vs_conn_flush_dest(void *data)
+{
+	struct ip_vs_conn_flush_dest_tinfo *tinfo = data;
+	struct netns_ipvs *ipvs = tinfo->ipvs;
+	struct ip_vs_dest *dest = tinfo->dest;
+
+	int idx;
+	struct ip_vs_conn *cp, *cp_c;
+
+	IP_VS_DBG_BUF(4, "flushing all connections with destination %s:%d",
+		      IP_VS_DBG_ADDR(dest->af, &dest->addr), ntohs(dest->port));
+
+	rcu_read_lock();
+	for (idx = 0; idx < ip_vs_conn_tab_size; idx++) {
+		hlist_for_each_entry_rcu(cp, &ip_vs_conn_tab[idx], c_list) {
+			if (cp->ipvs != ipvs)
+				continue;
+
+			if (cp->dest != dest)
+				continue;
+
+			/* As timers are expired in LIFO order, restart
+			 * the timer of controlling connection first, so
+			 * that it is expired after us.
+			 */
+			cp_c = cp->control;
+			/* cp->control is valid only with reference to cp */
+			if (cp_c && __ip_vs_conn_get(cp)) {
+				IP_VS_DBG(4, "del controlling connection\n");
+				ip_vs_conn_expire_now(cp_c);
+				__ip_vs_conn_put(cp);
+			}
+
+			IP_VS_DBG(4, "del connection\n");
+			ip_vs_conn_expire_now(cp);
+		}
+		cond_resched_rcu();
+	}
+	rcu_read_unlock();
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(ip_vs_conn_flush_dest);
+
 /*
  * per netns init and exit
  */
diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
index aa6a603a2425..ff052e57e054 100644
--- a/net/netfilter/ipvs/ip_vs_core.c
+++ b/net/netfilter/ipvs/ip_vs_core.c
@@ -24,6 +24,7 @@
 
 #include <linux/module.h>
 #include <linux/kernel.h>
+#include <linux/kthread.h>
 #include <linux/ip.h>
 #include <linux/tcp.h>
 #include <linux/sctp.h>
@@ -694,11 +695,6 @@ static int sysctl_nat_icmp_send(struct netns_ipvs *ipvs)
 	return ipvs->sysctl_nat_icmp_send;
 }
 
-static int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
-{
-	return ipvs->sysctl_expire_nodest_conn;
-}
-
 #else
 
 static int sysctl_snat_reroute(struct netns_ipvs *ipvs) { return 0; }
@@ -2095,36 +2091,33 @@ ip_vs_in(struct netns_ipvs *ipvs, unsigned int hooknum, struct sk_buff *skb, int
 		}
 	}
 
-	if (unlikely(!cp)) {
-		int v;
-
-		if (!ip_vs_try_to_schedule(ipvs, af, skb, pd, &v, &cp, &iph))
-			return v;
-	}
-
-	IP_VS_DBG_PKT(11, af, pp, skb, iph.off, "Incoming packet");
-
 	/* Check the server status */
-	if (cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
+	if (cp && cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
 		/* the destination server is not available */
 
-		__u32 flags = cp->flags;
-
-		/* when timer already started, silently drop the packet.*/
-		if (timer_pending(&cp->timer))
-			__ip_vs_conn_put(cp);
-		else
-			ip_vs_conn_put(cp);
+		if (sysctl_expire_nodest_conn(ipvs)) {
+			bool uses_ct = ip_vs_conn_uses_conntrack(cp, skb);
 
-		if (sysctl_expire_nodest_conn(ipvs) &&
-		    !(flags & IP_VS_CONN_F_ONE_PACKET)) {
-			/* try to expire the connection immediately */
 			ip_vs_conn_expire_now(cp);
+			__ip_vs_conn_put(cp);
+			if (uses_ct)
+				return NF_DROP;
+			cp = NULL;
+		} else {
+			__ip_vs_conn_put(cp);
+			return NF_DROP;
 		}
+	}
 
-		return NF_DROP;
+	if (unlikely(!cp)) {
+		int v;
+
+		if (!ip_vs_try_to_schedule(ipvs, af, skb, pd, &v, &cp, &iph))
+			return v;
 	}
 
+	IP_VS_DBG_PKT(11, af, pp, skb, iph.off, "Incoming packet");
+
 	ip_vs_in_stats(cp, skb);
 	ip_vs_set_state(cp, IP_VS_DIR_INPUT, skb, pd);
 	if (cp->packet_xmit)
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 8d14a1acbc37..fa48268368fc 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -1163,6 +1163,22 @@ static void __ip_vs_del_dest(struct netns_ipvs *ipvs, struct ip_vs_dest *dest,
 	list_add(&dest->t_list, &ipvs->dest_trash);
 	dest->idle_start = 0;
 	spin_unlock_bh(&ipvs->dest_trash_lock);
+
+	/*	If expire_nodest_conn is enabled, expire all connections
+	 *	immediately in a kthread.
+	 */
+	if (sysctl_expire_nodest_conn(ipvs)) {
+		struct ip_vs_conn_flush_dest_tinfo *tinfo = NULL;
+
+		tinfo = kcalloc(1, sizeof(struct ip_vs_conn_flush_dest_tinfo),
+				GFP_KERNEL);
+		tinfo->ipvs = ipvs;
+		tinfo->dest = dest;
+
+		IP_VS_DBG(3, "flushing connections in kthread\n");
+		kthread_run(ip_vs_conn_flush_dest,
+			    tinfo, "ipvs-flush-dest-conn");
+	}
 }
 
 
-- 
2.20.1
^ permalink raw reply related	[flat|nested] 31+ messages in thread
* Re: [PATCH] netfilter/ipvs: immediately expire no destination connections in kthread if expire_nodest_conn=1
  2020-05-24 21:31           ` [PATCH] netfilter/ipvs: immediately expire no destination connections in kthread " Andrew Sy Kim
@ 2020-05-26 21:24             ` Julian Anastasov
  2020-05-26 21:47               ` Andrew Kim
  0 siblings, 1 reply; 31+ messages in thread
From: Julian Anastasov @ 2020-05-26 21:24 UTC (permalink / raw)
  To: Andrew Sy Kim; +Cc: Wensong Zhang, Simon Horman, lvs-devel, netfilter-devel
	Hello,
	Long CC list trimmed...
On Sun, 24 May 2020, Andrew Sy Kim wrote:
> If expire_nodest_conn=1 and a destination is deleted, IPVS should
> also expire all matching connections immiediately instead of waiting for
> the next matching packet. This is particulary useful when there are a
> lot of packets coming from a few number of clients. Those clients are
> likely to match against existing entries if a source port in the
> connection hash is reused. When the number of entries in the connection
> tracker is large, we can significantly reduce the number of dropped
> packets by expiring all connections upon deletion in a kthread.
> 
> Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
> ---
>  include/net/ip_vs.h             | 12 +++++++++
>  net/netfilter/ipvs/ip_vs_conn.c | 48 +++++++++++++++++++++++++++++++++
>  net/netfilter/ipvs/ip_vs_core.c | 45 +++++++++++++------------------
>  net/netfilter/ipvs/ip_vs_ctl.c  | 16 +++++++++++
>  4 files changed, 95 insertions(+), 26 deletions(-)
> 
>  /* Get reference to gain full access to conn.
>   * By default, RCU read-side critical sections have access only to
>   * conn fields and its PE data, see ip_vs_conn_rcu_free() for reference.
> diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
> index 02f2f636798d..111fa0e287a2 100644
> --- a/net/netfilter/ipvs/ip_vs_conn.c
> +++ b/net/netfilter/ipvs/ip_vs_conn.c
> @@ -1366,6 +1366,54 @@ static void ip_vs_conn_flush(struct netns_ipvs *ipvs)
>  		goto flush_again;
>  	}
>  }
> +
> +/*	Thread function to flush all the connection entries in the
> + *	ip_vs_conn_tab with a matching destination.
> + */
> +int ip_vs_conn_flush_dest(void *data)
> +{
> +	struct ip_vs_conn_flush_dest_tinfo *tinfo = data;
> +	struct netns_ipvs *ipvs = tinfo->ipvs;
> +	struct ip_vs_dest *dest = tinfo->dest;
	Starting a kthread just for single dest can cause storms
when many dests are used. IMHO, we should work this way:
- do not use kthreads: they are hard to manage, start only from
process context (we can not retry from timer if creating a kthread
fails for some reason).
- use delayed_work, similar to our defense_work but this time
we should use queue_delayed_work(system_long_wq,...) instead of 
schedule_delayed_work(). Just cancel_delayed_work_sync() is needed
to stop the work. The callback function will not start the
work timer.
- we will use one work for the netns (struct netns_ipvs *ipvs):
__ip_vs_del_dest() will start it for next jiffie (delay=1) to
catch more dests for flusing. As result, the first destination
will start the work timer, other dests will do nothing while timer
is pending. When timer expires, the work is queued to worker,
so next dests will start the timer again, even while the work
is executing the callback function.
> +
> +	int idx;
> +	struct ip_vs_conn *cp, *cp_c;
> +
> +	IP_VS_DBG_BUF(4, "flushing all connections with destination %s:%d",
> +		      IP_VS_DBG_ADDR(dest->af, &dest->addr), ntohs(dest->port));
	We will not provide current dest. Still, above was not
safe because we do not hold reference to dest.
> +
> +	rcu_read_lock();
> +	for (idx = 0; idx < ip_vs_conn_tab_size; idx++) {
> +		hlist_for_each_entry_rcu(cp, &ip_vs_conn_tab[idx], c_list) {
> +			if (cp->ipvs != ipvs)
> +				continue;
> +
> +			if (cp->dest != dest)
> +				continue;
			struct ip_vs_dest *dest = cp->dest;
			if (!dest || (dest->flags & IP_VS_DEST_F_AVAILABLE))
				continue;
			We can access dest because under RCU grace period
			we have access to the cp->dest fields. But we 
			should read cp->dest once as done above.
> +
> +			/* As timers are expired in LIFO order, restart
> +			 * the timer of controlling connection first, so
> +			 * that it is expired after us.
> +			 */
> +			cp_c = cp->control;
> +			/* cp->control is valid only with reference to cp */
> +			if (cp_c && __ip_vs_conn_get(cp)) {
> +				IP_VS_DBG(4, "del controlling connection\n");
> +				ip_vs_conn_expire_now(cp_c);
> +				__ip_vs_conn_put(cp);
> +			}
> +
> +			IP_VS_DBG(4, "del connection\n");
> +			ip_vs_conn_expire_now(cp);
> +		}
> +		cond_resched_rcu();
		if (!ipvs->enable)
			break;
		Abort immediately if netns cleanup is started.
> +	}
> +	rcu_read_unlock();
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(ip_vs_conn_flush_dest);
> +
>  /*
>   * per netns init and exit
>   */
> diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
> index aa6a603a2425..ff052e57e054 100644
> --- a/net/netfilter/ipvs/ip_vs_core.c
> +++ b/net/netfilter/ipvs/ip_vs_core.c
> @@ -24,6 +24,7 @@
>  
>  #include <linux/module.h>
>  #include <linux/kernel.h>
> +#include <linux/kthread.h>
>  #include <linux/ip.h>
>  #include <linux/tcp.h>
>  #include <linux/sctp.h>
> @@ -694,11 +695,6 @@ static int sysctl_nat_icmp_send(struct netns_ipvs *ipvs)
>  	return ipvs->sysctl_nat_icmp_send;
>  }
>  
> -static int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
> -{
> -	return ipvs->sysctl_expire_nodest_conn;
> -}
> -
>  #else
>  
>  static int sysctl_snat_reroute(struct netns_ipvs *ipvs) { return 0; }
> @@ -2095,36 +2091,33 @@ ip_vs_in(struct netns_ipvs *ipvs, unsigned int hooknum, struct sk_buff *skb, int
>  		}
>  	}
>  
> -	if (unlikely(!cp)) {
> -		int v;
> -
> -		if (!ip_vs_try_to_schedule(ipvs, af, skb, pd, &v, &cp, &iph))
> -			return v;
> -	}
> -
> -	IP_VS_DBG_PKT(11, af, pp, skb, iph.off, "Incoming packet");
> -
>  	/* Check the server status */
> -	if (cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
> +	if (cp && cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
>  		/* the destination server is not available */
>  
> -		__u32 flags = cp->flags;
> -
> -		/* when timer already started, silently drop the packet.*/
> -		if (timer_pending(&cp->timer))
> -			__ip_vs_conn_put(cp);
> -		else
> -			ip_vs_conn_put(cp);
> +		if (sysctl_expire_nodest_conn(ipvs)) {
> +			bool uses_ct = ip_vs_conn_uses_conntrack(cp, skb);
>  
> -		if (sysctl_expire_nodest_conn(ipvs) &&
> -		    !(flags & IP_VS_CONN_F_ONE_PACKET)) {
> -			/* try to expire the connection immediately */
>  			ip_vs_conn_expire_now(cp);
> +			__ip_vs_conn_put(cp);
> +			if (uses_ct)
> +				return NF_DROP;
> +			cp = NULL;
> +		} else {
> +			__ip_vs_conn_put(cp);
> +			return NF_DROP;
>  		}
> +	}
>  
> -		return NF_DROP;
> +	if (unlikely(!cp)) {
> +		int v;
> +
> +		if (!ip_vs_try_to_schedule(ipvs, af, skb, pd, &v, &cp, &iph))
> +			return v;
>  	}
>  
> +	IP_VS_DBG_PKT(11, af, pp, skb, iph.off, "Incoming packet");
> +
	The above code in ip_vs_in() is correct.
>  	ip_vs_in_stats(cp, skb);
>  	ip_vs_set_state(cp, IP_VS_DIR_INPUT, skb, pd);
>  	if (cp->packet_xmit)
> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> index 8d14a1acbc37..fa48268368fc 100644
> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> @@ -1163,6 +1163,22 @@ static void __ip_vs_del_dest(struct netns_ipvs *ipvs, struct ip_vs_dest *dest,
>  	list_add(&dest->t_list, &ipvs->dest_trash);
>  	dest->idle_start = 0;
>  	spin_unlock_bh(&ipvs->dest_trash_lock);
> +
> +	/*	If expire_nodest_conn is enabled, expire all connections
> +	 *	immediately in a kthread.
> +	 */
> +	if (sysctl_expire_nodest_conn(ipvs)) {
	Looks like we should not start work when 'cleanup' is true, it 
indicates that we are doing final release of all resources.
	if (sysctl_expire_nodest_conn(ipvs) && !cleanup)
		queue_delayed_work(system_long_wq, &ipvs->expire_nodest_work, 1);
> +		struct ip_vs_conn_flush_dest_tinfo *tinfo = NULL;
> +
> +		tinfo = kcalloc(1, sizeof(struct ip_vs_conn_flush_dest_tinfo),
> +				GFP_KERNEL);
> +		tinfo->ipvs = ipvs;
> +		tinfo->dest = dest;
> +
> +		IP_VS_DBG(3, "flushing connections in kthread\n");
> +		kthread_run(ip_vs_conn_flush_dest,
> +			    tinfo, "ipvs-flush-dest-conn");
> +	}
>  }
Regards
--
Julian Anastasov <ja@ssi.bg>
^ permalink raw reply	[flat|nested] 31+ messages in thread
* Re: [PATCH] netfilter/ipvs: immediately expire no destination connections in kthread if expire_nodest_conn=1
  2020-05-26 21:24             ` Julian Anastasov
@ 2020-05-26 21:47               ` Andrew Kim
  2020-05-28  1:41                 ` [PATCH] netfilter/ipvs: queue delayed work to expire no destination connections " Andrew Sy Kim
  0 siblings, 1 reply; 31+ messages in thread
From: Andrew Kim @ 2020-05-26 21:47 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Wensong Zhang, Simon Horman, open list:IPVS, open list:NETFILTER
Hi Julian,
Thanks for getting back to me. This is my first patch to the kernel so
I really appreciate your patience reviewing it.
I will update the patch based on your comments shortly.
Regards,
Andrew Sy Kim
On Tue, May 26, 2020 at 5:25 PM Julian Anastasov <ja@ssi.bg> wrote:
>
>
>         Hello,
>
>         Long CC list trimmed...
>
> On Sun, 24 May 2020, Andrew Sy Kim wrote:
>
> > If expire_nodest_conn=1 and a destination is deleted, IPVS should
> > also expire all matching connections immiediately instead of waiting for
> > the next matching packet. This is particulary useful when there are a
> > lot of packets coming from a few number of clients. Those clients are
> > likely to match against existing entries if a source port in the
> > connection hash is reused. When the number of entries in the connection
> > tracker is large, we can significantly reduce the number of dropped
> > packets by expiring all connections upon deletion in a kthread.
> >
> > Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
> > ---
> >  include/net/ip_vs.h             | 12 +++++++++
> >  net/netfilter/ipvs/ip_vs_conn.c | 48 +++++++++++++++++++++++++++++++++
> >  net/netfilter/ipvs/ip_vs_core.c | 45 +++++++++++++------------------
> >  net/netfilter/ipvs/ip_vs_ctl.c  | 16 +++++++++++
> >  4 files changed, 95 insertions(+), 26 deletions(-)
> >
>
> >  /* Get reference to gain full access to conn.
> >   * By default, RCU read-side critical sections have access only to
> >   * conn fields and its PE data, see ip_vs_conn_rcu_free() for reference.
> > diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
> > index 02f2f636798d..111fa0e287a2 100644
> > --- a/net/netfilter/ipvs/ip_vs_conn.c
> > +++ b/net/netfilter/ipvs/ip_vs_conn.c
> > @@ -1366,6 +1366,54 @@ static void ip_vs_conn_flush(struct netns_ipvs *ipvs)
> >               goto flush_again;
> >       }
> >  }
> > +
> > +/*   Thread function to flush all the connection entries in the
> > + *   ip_vs_conn_tab with a matching destination.
> > + */
> > +int ip_vs_conn_flush_dest(void *data)
> > +{
> > +     struct ip_vs_conn_flush_dest_tinfo *tinfo = data;
> > +     struct netns_ipvs *ipvs = tinfo->ipvs;
> > +     struct ip_vs_dest *dest = tinfo->dest;
>
>         Starting a kthread just for single dest can cause storms
> when many dests are used. IMHO, we should work this way:
>
> - do not use kthreads: they are hard to manage, start only from
> process context (we can not retry from timer if creating a kthread
> fails for some reason).
>
> - use delayed_work, similar to our defense_work but this time
> we should use queue_delayed_work(system_long_wq,...) instead of
> schedule_delayed_work(). Just cancel_delayed_work_sync() is needed
> to stop the work. The callback function will not start the
> work timer.
>
> - we will use one work for the netns (struct netns_ipvs *ipvs):
> __ip_vs_del_dest() will start it for next jiffie (delay=1) to
> catch more dests for flusing. As result, the first destination
> will start the work timer, other dests will do nothing while timer
> is pending. When timer expires, the work is queued to worker,
> so next dests will start the timer again, even while the work
> is executing the callback function.
>
> > +
> > +     int idx;
> > +     struct ip_vs_conn *cp, *cp_c;
> > +
> > +     IP_VS_DBG_BUF(4, "flushing all connections with destination %s:%d",
> > +                   IP_VS_DBG_ADDR(dest->af, &dest->addr), ntohs(dest->port));
>
>         We will not provide current dest. Still, above was not
> safe because we do not hold reference to dest.
>
> > +
> > +     rcu_read_lock();
> > +     for (idx = 0; idx < ip_vs_conn_tab_size; idx++) {
> > +             hlist_for_each_entry_rcu(cp, &ip_vs_conn_tab[idx], c_list) {
> > +                     if (cp->ipvs != ipvs)
> > +                             continue;
> > +
> > +                     if (cp->dest != dest)
> > +                             continue;
>
>                         struct ip_vs_dest *dest = cp->dest;
>
>                         if (!dest || (dest->flags & IP_VS_DEST_F_AVAILABLE))
>                                 continue;
>
>                         We can access dest because under RCU grace period
>                         we have access to the cp->dest fields. But we
>                         should read cp->dest once as done above.
>
> > +
> > +                     /* As timers are expired in LIFO order, restart
> > +                      * the timer of controlling connection first, so
> > +                      * that it is expired after us.
> > +                      */
> > +                     cp_c = cp->control;
> > +                     /* cp->control is valid only with reference to cp */
> > +                     if (cp_c && __ip_vs_conn_get(cp)) {
> > +                             IP_VS_DBG(4, "del controlling connection\n");
> > +                             ip_vs_conn_expire_now(cp_c);
> > +                             __ip_vs_conn_put(cp);
> > +                     }
> > +
> > +                     IP_VS_DBG(4, "del connection\n");
> > +                     ip_vs_conn_expire_now(cp);
> > +             }
> > +             cond_resched_rcu();
>
>                 if (!ipvs->enable)
>                         break;
>
>                 Abort immediately if netns cleanup is started.
>
> > +     }
> > +     rcu_read_unlock();
> > +
> > +     return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(ip_vs_conn_flush_dest);
> > +
> >  /*
> >   * per netns init and exit
> >   */
> > diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
> > index aa6a603a2425..ff052e57e054 100644
> > --- a/net/netfilter/ipvs/ip_vs_core.c
> > +++ b/net/netfilter/ipvs/ip_vs_core.c
> > @@ -24,6 +24,7 @@
> >
> >  #include <linux/module.h>
> >  #include <linux/kernel.h>
> > +#include <linux/kthread.h>
> >  #include <linux/ip.h>
> >  #include <linux/tcp.h>
> >  #include <linux/sctp.h>
> > @@ -694,11 +695,6 @@ static int sysctl_nat_icmp_send(struct netns_ipvs *ipvs)
> >       return ipvs->sysctl_nat_icmp_send;
> >  }
> >
> > -static int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
> > -{
> > -     return ipvs->sysctl_expire_nodest_conn;
> > -}
> > -
> >  #else
> >
> >  static int sysctl_snat_reroute(struct netns_ipvs *ipvs) { return 0; }
> > @@ -2095,36 +2091,33 @@ ip_vs_in(struct netns_ipvs *ipvs, unsigned int hooknum, struct sk_buff *skb, int
> >               }
> >       }
> >
> > -     if (unlikely(!cp)) {
> > -             int v;
> > -
> > -             if (!ip_vs_try_to_schedule(ipvs, af, skb, pd, &v, &cp, &iph))
> > -                     return v;
> > -     }
> > -
> > -     IP_VS_DBG_PKT(11, af, pp, skb, iph.off, "Incoming packet");
> > -
> >       /* Check the server status */
> > -     if (cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
> > +     if (cp && cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
> >               /* the destination server is not available */
> >
> > -             __u32 flags = cp->flags;
> > -
> > -             /* when timer already started, silently drop the packet.*/
> > -             if (timer_pending(&cp->timer))
> > -                     __ip_vs_conn_put(cp);
> > -             else
> > -                     ip_vs_conn_put(cp);
> > +             if (sysctl_expire_nodest_conn(ipvs)) {
> > +                     bool uses_ct = ip_vs_conn_uses_conntrack(cp, skb);
> >
> > -             if (sysctl_expire_nodest_conn(ipvs) &&
> > -                 !(flags & IP_VS_CONN_F_ONE_PACKET)) {
> > -                     /* try to expire the connection immediately */
> >                       ip_vs_conn_expire_now(cp);
> > +                     __ip_vs_conn_put(cp);
> > +                     if (uses_ct)
> > +                             return NF_DROP;
> > +                     cp = NULL;
> > +             } else {
> > +                     __ip_vs_conn_put(cp);
> > +                     return NF_DROP;
> >               }
> > +     }
> >
> > -             return NF_DROP;
> > +     if (unlikely(!cp)) {
> > +             int v;
> > +
> > +             if (!ip_vs_try_to_schedule(ipvs, af, skb, pd, &v, &cp, &iph))
> > +                     return v;
> >       }
> >
> > +     IP_VS_DBG_PKT(11, af, pp, skb, iph.off, "Incoming packet");
> > +
>
>         The above code in ip_vs_in() is correct.
>
> >       ip_vs_in_stats(cp, skb);
> >       ip_vs_set_state(cp, IP_VS_DIR_INPUT, skb, pd);
> >       if (cp->packet_xmit)
> > diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> > index 8d14a1acbc37..fa48268368fc 100644
> > --- a/net/netfilter/ipvs/ip_vs_ctl.c
> > +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> > @@ -1163,6 +1163,22 @@ static void __ip_vs_del_dest(struct netns_ipvs *ipvs, struct ip_vs_dest *dest,
> >       list_add(&dest->t_list, &ipvs->dest_trash);
> >       dest->idle_start = 0;
> >       spin_unlock_bh(&ipvs->dest_trash_lock);
> > +
> > +     /*      If expire_nodest_conn is enabled, expire all connections
> > +      *      immediately in a kthread.
> > +      */
> > +     if (sysctl_expire_nodest_conn(ipvs)) {
>
>         Looks like we should not start work when 'cleanup' is true, it
> indicates that we are doing final release of all resources.
>
>         if (sysctl_expire_nodest_conn(ipvs) && !cleanup)
>                 queue_delayed_work(system_long_wq, &ipvs->expire_nodest_work, 1);
>
> > +             struct ip_vs_conn_flush_dest_tinfo *tinfo = NULL;
> > +
> > +             tinfo = kcalloc(1, sizeof(struct ip_vs_conn_flush_dest_tinfo),
> > +                             GFP_KERNEL);
> > +             tinfo->ipvs = ipvs;
> > +             tinfo->dest = dest;
> > +
> > +             IP_VS_DBG(3, "flushing connections in kthread\n");
> > +             kthread_run(ip_vs_conn_flush_dest,
> > +                         tinfo, "ipvs-flush-dest-conn");
> > +     }
> >  }
>
> Regards
>
> --
> Julian Anastasov <ja@ssi.bg>
^ permalink raw reply	[flat|nested] 31+ messages in thread
* [PATCH] netfilter/ipvs: queue delayed work to expire no destination connections if expire_nodest_conn=1
  2020-05-26 21:47               ` Andrew Kim
@ 2020-05-28  1:41                 ` Andrew Sy Kim
  2020-05-28 17:26                   ` Julian Anastasov
  0 siblings, 1 reply; 31+ messages in thread
From: Andrew Sy Kim @ 2020-05-28  1:41 UTC (permalink / raw)
  Cc: Julian Anastasov, Wensong Zhang, Simon Horman, lvs-devel,
	netfilter-devel, Andrew Sy Kim
When expire_nodest_conn=1 and a destination is deleted, IPVS does not
expire the existing connections until the next matching incoming packet.
If there are many connection entries from a single client to a single
destination, many packets may get dropped before all the connections are
expired (more likely with lots of UDP traffic). An optimization can be
made where upon deletion of a destination, IPVS queues up delayed work
to immediately expire any connections with a deleted destination. This
ensures any reused source ports from a client (within the IPVS timeouts)
are scheduled to new real servers instead of silently dropped.
Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
---
 include/net/ip_vs.h             |  9 ++++++
 net/netfilter/ipvs/ip_vs_conn.c | 53 +++++++++++++++++++++++++++++++++
 net/netfilter/ipvs/ip_vs_core.c | 44 +++++++++++----------------
 net/netfilter/ipvs/ip_vs_ctl.c  | 12 ++++++++
 4 files changed, 92 insertions(+), 26 deletions(-)
diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index 83be2d93b407..1636100f7da5 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -884,6 +884,9 @@ struct netns_ipvs {
 	atomic_t		nullsvc_counter;
 	atomic_t		conn_out_counter;
 
+	/* delayed work for expiring no dest connections */
+	struct delayed_work	expire_nodest_conn_work;
+
 #ifdef CONFIG_SYSCTL
 	/* 1/rate drop and drop-entry variables */
 	struct delayed_work	defense_work;   /* Work handler */
@@ -1049,6 +1052,11 @@ static inline int sysctl_conn_reuse_mode(struct netns_ipvs *ipvs)
 	return ipvs->sysctl_conn_reuse_mode;
 }
 
+static inline int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
+{
+	return ipvs->sysctl_expire_nodest_conn;
+}
+
 static inline int sysctl_schedule_icmp(struct netns_ipvs *ipvs)
 {
 	return ipvs->sysctl_schedule_icmp;
@@ -1232,6 +1240,7 @@ struct ip_vs_conn *ip_vs_conn_new(const struct ip_vs_conn_param *p, int dest_af,
 				  __be16 dport, unsigned int flags,
 				  struct ip_vs_dest *dest, __u32 fwmark);
 void ip_vs_conn_expire_now(struct ip_vs_conn *cp);
+void expire_nodest_conn_handler(struct work_struct *work);
 
 const char *ip_vs_state_name(const struct ip_vs_conn *cp);
 
diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
index 02f2f636798d..5e802b7fabbb 100644
--- a/net/netfilter/ipvs/ip_vs_conn.c
+++ b/net/netfilter/ipvs/ip_vs_conn.c
@@ -28,6 +28,7 @@
 #include <linux/module.h>
 #include <linux/vmalloc.h>
 #include <linux/proc_fs.h>		/* for proc_net_* */
+#include <linux/workqueue.h>
 #include <linux/slab.h>
 #include <linux/seq_file.h>
 #include <linux/jhash.h>
@@ -1366,6 +1367,58 @@ static void ip_vs_conn_flush(struct netns_ipvs *ipvs)
 		goto flush_again;
 	}
 }
+
+/*	Handler for delayed work for expiring no
+ *	destination connections
+ */
+void expire_nodest_conn_handler(struct work_struct *work)
+{
+	int idx;
+	struct ip_vs_conn *cp, *cp_c;
+	struct ip_vs_dest *dest;
+	struct netns_ipvs *ipvs;
+
+	ipvs = container_of(work, struct netns_ipvs,
+			    expire_nodest_conn_work.work);
+
+	/* netns clean up started, aborted delayed work */
+	if (!ipvs->enable)
+		return;
+
+	/* only start delayed work if expire_nodest_conn=1 */
+	if (!ipvs->sysctl_expire_nodest_conn)
+		return;
+
+	rcu_read_lock();
+	for (idx = 0; idx < ip_vs_conn_tab_size; idx++) {
+		hlist_for_each_entry_rcu(cp, &ip_vs_conn_tab[idx], c_list) {
+			if (cp->ipvs != ipvs)
+				continue;
+
+			dest = cp->dest;
+			if (!dest || (dest->flags & IP_VS_DEST_F_AVAILABLE))
+				continue;
+
+			/* As timers are expired in LIFO order, restart
+			 * the timer of controlling connection first, so
+			 * that it is expired after us.
+			 */
+			cp_c = cp->control;
+			/* cp->control is valid only with reference to cp */
+			if (cp_c && __ip_vs_conn_get(cp)) {
+				IP_VS_DBG(4, "del controlling connection\n");
+				ip_vs_conn_expire_now(cp_c);
+				__ip_vs_conn_put(cp);
+			}
+
+			IP_VS_DBG(4, "del connection\n");
+			ip_vs_conn_expire_now(cp);
+		}
+		cond_resched_rcu();
+	}
+	rcu_read_unlock();
+}
+
 /*
  * per netns init and exit
  */
diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
index aa6a603a2425..c8389936d51a 100644
--- a/net/netfilter/ipvs/ip_vs_core.c
+++ b/net/netfilter/ipvs/ip_vs_core.c
@@ -694,11 +694,6 @@ static int sysctl_nat_icmp_send(struct netns_ipvs *ipvs)
 	return ipvs->sysctl_nat_icmp_send;
 }
 
-static int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
-{
-	return ipvs->sysctl_expire_nodest_conn;
-}
-
 #else
 
 static int sysctl_snat_reroute(struct netns_ipvs *ipvs) { return 0; }
@@ -2095,36 +2090,33 @@ ip_vs_in(struct netns_ipvs *ipvs, unsigned int hooknum, struct sk_buff *skb, int
 		}
 	}
 
-	if (unlikely(!cp)) {
-		int v;
-
-		if (!ip_vs_try_to_schedule(ipvs, af, skb, pd, &v, &cp, &iph))
-			return v;
-	}
-
-	IP_VS_DBG_PKT(11, af, pp, skb, iph.off, "Incoming packet");
-
 	/* Check the server status */
-	if (cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
+	if (cp && cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
 		/* the destination server is not available */
 
-		__u32 flags = cp->flags;
-
-		/* when timer already started, silently drop the packet.*/
-		if (timer_pending(&cp->timer))
-			__ip_vs_conn_put(cp);
-		else
-			ip_vs_conn_put(cp);
+		if (sysctl_expire_nodest_conn(ipvs)) {
+			bool uses_ct = ip_vs_conn_uses_conntrack(cp, skb);
 
-		if (sysctl_expire_nodest_conn(ipvs) &&
-		    !(flags & IP_VS_CONN_F_ONE_PACKET)) {
-			/* try to expire the connection immediately */
 			ip_vs_conn_expire_now(cp);
+			__ip_vs_conn_put(cp);
+			if (uses_ct)
+				return NF_DROP;
+			cp = NULL;
+		} else {
+			__ip_vs_conn_put(cp);
+			return NF_DROP;
 		}
+	}
 
-		return NF_DROP;
+	if (unlikely(!cp)) {
+		int v;
+
+		if (!ip_vs_try_to_schedule(ipvs, af, skb, pd, &v, &cp, &iph))
+			return v;
 	}
 
+	IP_VS_DBG_PKT(11, af, pp, skb, iph.off, "Incoming packet");
+
 	ip_vs_in_stats(cp, skb);
 	ip_vs_set_state(cp, IP_VS_DIR_INPUT, skb, pd);
 	if (cp->packet_xmit)
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 8d14a1acbc37..6eb1afa30c74 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -1163,6 +1163,13 @@ static void __ip_vs_del_dest(struct netns_ipvs *ipvs, struct ip_vs_dest *dest,
 	list_add(&dest->t_list, &ipvs->dest_trash);
 	dest->idle_start = 0;
 	spin_unlock_bh(&ipvs->dest_trash_lock);
+
+	/*	If expire_nodest_conn is enabled and we're not cleaning up,
+	 *	queue up delayed work to expire all no destination connections
+	 */
+	if (sysctl_expire_nodest_conn(ipvs) && !cleanup)
+		queue_delayed_work(system_long_wq,
+				   &ipvs->expire_nodest_conn_work, 1);
 }
 
 
@@ -4065,6 +4072,10 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
 	INIT_DELAYED_WORK(&ipvs->defense_work, defense_work_handler);
 	schedule_delayed_work(&ipvs->defense_work, DEFENSE_TIMER_PERIOD);
 
+	/* Init delayed work for expiring no dest conn */
+	INIT_DELAYED_WORK(&ipvs->expire_nodest_conn_work,
+			  expire_nodest_conn_handler);
+
 	return 0;
 }
 
@@ -4072,6 +4083,7 @@ static void __net_exit ip_vs_control_net_cleanup_sysctl(struct netns_ipvs *ipvs)
 {
 	struct net *net = ipvs->net;
 
+	cancel_delayed_work_sync(&ipvs->expire_nodest_conn_work);
 	cancel_delayed_work_sync(&ipvs->defense_work);
 	cancel_work_sync(&ipvs->defense_work.work);
 	unregister_net_sysctl_table(ipvs->sysctl_hdr);
-- 
2.20.1
^ permalink raw reply related	[flat|nested] 31+ messages in thread
* Re: [PATCH] netfilter/ipvs: queue delayed work to expire no destination connections if expire_nodest_conn=1
  2020-05-28  1:41                 ` [PATCH] netfilter/ipvs: queue delayed work to expire no destination connections " Andrew Sy Kim
@ 2020-05-28 17:26                   ` Julian Anastasov
  2020-06-08 17:22                     ` Andrew Sy Kim
  2020-06-08 17:34                     ` Andrew Sy Kim
  0 siblings, 2 replies; 31+ messages in thread
From: Julian Anastasov @ 2020-05-28 17:26 UTC (permalink / raw)
  To: Andrew Sy Kim; +Cc: Wensong Zhang, Simon Horman, lvs-devel, netfilter-devel
	Hello,
On Wed, 27 May 2020, Andrew Sy Kim wrote:
> When expire_nodest_conn=1 and a destination is deleted, IPVS does not
> expire the existing connections until the next matching incoming packet.
> If there are many connection entries from a single client to a single
> destination, many packets may get dropped before all the connections are
> expired (more likely with lots of UDP traffic). An optimization can be
> made where upon deletion of a destination, IPVS queues up delayed work
> to immediately expire any connections with a deleted destination. This
> ensures any reused source ports from a client (within the IPVS timeouts)
> are scheduled to new real servers instead of silently dropped.
> 
> Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
> ---
>  include/net/ip_vs.h             |  9 ++++++
>  net/netfilter/ipvs/ip_vs_conn.c | 53 +++++++++++++++++++++++++++++++++
>  net/netfilter/ipvs/ip_vs_core.c | 44 +++++++++++----------------
>  net/netfilter/ipvs/ip_vs_ctl.c  | 12 ++++++++
>  4 files changed, 92 insertions(+), 26 deletions(-)
> 
> diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
> index 83be2d93b407..1636100f7da5 100644
> --- a/include/net/ip_vs.h
> +++ b/include/net/ip_vs.h
> @@ -884,6 +884,9 @@ struct netns_ipvs {
>  	atomic_t		nullsvc_counter;
>  	atomic_t		conn_out_counter;
>  
> +	/* delayed work for expiring no dest connections */
> +	struct delayed_work	expire_nodest_conn_work;
	All expire_nodest_conn code should be under CONFIG_SYSCTL,
so this should go below.
> +
>  #ifdef CONFIG_SYSCTL
>  	/* 1/rate drop and drop-entry variables */
>  	struct delayed_work	defense_work;   /* Work handler */
> @@ -1049,6 +1052,11 @@ static inline int sysctl_conn_reuse_mode(struct netns_ipvs *ipvs)
>  	return ipvs->sysctl_conn_reuse_mode;
>  }
>  
> +static inline int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
> +{
> +	return ipvs->sysctl_expire_nodest_conn;
> +}
> +
	This is the CONFIG_SYSCTL code, you have to move the
empty function for the !CONFIG_SYSCTL case too. In general, we
try to hide such ifdefs in ip_vs.h, still the .c files can
ifdef whole functions. As result, for all these sysctl vars
we have two function variants depending on CONFIG_SYSCTL.
>  static inline int sysctl_schedule_icmp(struct netns_ipvs *ipvs)
>  {
>  	return ipvs->sysctl_schedule_icmp;
> @@ -1232,6 +1240,7 @@ struct ip_vs_conn *ip_vs_conn_new(const struct ip_vs_conn_param *p, int dest_af,
>  				  __be16 dport, unsigned int flags,
>  				  struct ip_vs_dest *dest, __u32 fwmark);
>  void ip_vs_conn_expire_now(struct ip_vs_conn *cp);
> +void expire_nodest_conn_handler(struct work_struct *work);
	Keep this in ip_vs_ctl.c under CONFIG_SYSCTL.
Here add ip_vs_expire_nodest_conn_flush(struct netns_ipvs *ipvs);
under CONFIG_SYSCTL.
>  const char *ip_vs_state_name(const struct ip_vs_conn *cp);
>  
> diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
> index 02f2f636798d..5e802b7fabbb 100644
> --- a/net/netfilter/ipvs/ip_vs_conn.c
> +++ b/net/netfilter/ipvs/ip_vs_conn.c
> @@ -28,6 +28,7 @@
>  #include <linux/module.h>
>  #include <linux/vmalloc.h>
>  #include <linux/proc_fs.h>		/* for proc_net_* */
> +#include <linux/workqueue.h>
>  #include <linux/slab.h>
>  #include <linux/seq_file.h>
>  #include <linux/jhash.h>
> @@ -1366,6 +1367,58 @@ static void ip_vs_conn_flush(struct netns_ipvs *ipvs)
>  		goto flush_again;
>  	}
>  }
> +
> +/*	Handler for delayed work for expiring no
> + *	destination connections
> + */
> +void expire_nodest_conn_handler(struct work_struct *work)
> +{
> +	int idx;
> +	struct ip_vs_conn *cp, *cp_c;
> +	struct ip_vs_dest *dest;
> +	struct netns_ipvs *ipvs;
> +
> +	ipvs = container_of(work, struct netns_ipvs,
> +			    expire_nodest_conn_work.work);
> +
> +	/* netns clean up started, aborted delayed work */
> +	if (!ipvs->enable)
> +		return;
	This is of no use here, it should be near
cond_resched_rcu(), so that we can abort immediately,
not after seconds...
> +
> +	/* only start delayed work if expire_nodest_conn=1 */
> +	if (!ipvs->sysctl_expire_nodest_conn)
> +		return;
> +
	May be not needed because work was queued after
such check. And we want to avoid using an ipvs field,
we prefer the sysctl_expire_nodest_conn() function
because such fields should not be compiled when
!CONFIG_SYSCTL, which is not the case for many of them
but this needs special cleanup with another patch...
	Just like ip_vs_random_dropentry(), keep the below
logic in ip_vs_conn.c in separate function, eg.
ip_vs_expire_nodest_conn_flush(ipvs), under CONFIG_SYSCTL,
called from the work callback which is in ip_vs_ctl.c.
We want functions that walk ip_vs_conn_tab in ip_vs_conn.c.
> +	rcu_read_lock();
> +	for (idx = 0; idx < ip_vs_conn_tab_size; idx++) {
> +		hlist_for_each_entry_rcu(cp, &ip_vs_conn_tab[idx], c_list) {
> +			if (cp->ipvs != ipvs)
> +				continue;
> +
> +			dest = cp->dest;
> +			if (!dest || (dest->flags & IP_VS_DEST_F_AVAILABLE))
> +				continue;
> +
> +			/* As timers are expired in LIFO order, restart
> +			 * the timer of controlling connection first, so
> +			 * that it is expired after us.
> +			 */
> +			cp_c = cp->control;
> +			/* cp->control is valid only with reference to cp */
> +			if (cp_c && __ip_vs_conn_get(cp)) {
> +				IP_VS_DBG(4, "del controlling connection\n");
> +				ip_vs_conn_expire_now(cp_c);
> +				__ip_vs_conn_put(cp);
> +			}
> +
> +			IP_VS_DBG(4, "del connection\n");
> +			ip_vs_conn_expire_now(cp);
> +		}
> +		cond_resched_rcu();
> +	}
> +	rcu_read_unlock();
> +}
> +
>  /*
>   * per netns init and exit
>   */
> diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
> index aa6a603a2425..c8389936d51a 100644
> --- a/net/netfilter/ipvs/ip_vs_core.c
> +++ b/net/netfilter/ipvs/ip_vs_core.c
> @@ -694,11 +694,6 @@ static int sysctl_nat_icmp_send(struct netns_ipvs *ipvs)
>  	return ipvs->sysctl_nat_icmp_send;
>  }
>  
> -static int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
> -{
> -	return ipvs->sysctl_expire_nodest_conn;
> -}
> -
>  #else
>  
	At this place there is second sysctl_expire_nodest_conn()
that should be moved to ip_vs.h
>  static int sysctl_snat_reroute(struct netns_ipvs *ipvs) { return 0; }
> @@ -2095,36 +2090,33 @@ ip_vs_in(struct netns_ipvs *ipvs, unsigned int hooknum, struct sk_buff *skb, int
>  		}
>  	}
>  
> -	if (unlikely(!cp)) {
> -		int v;
> -
> -		if (!ip_vs_try_to_schedule(ipvs, af, skb, pd, &v, &cp, &iph))
> -			return v;
> -	}
> -
> -	IP_VS_DBG_PKT(11, af, pp, skb, iph.off, "Incoming packet");
> -
>  	/* Check the server status */
> -	if (cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
> +	if (cp && cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
>  		/* the destination server is not available */
>  
> -		__u32 flags = cp->flags;
> -
> -		/* when timer already started, silently drop the packet.*/
> -		if (timer_pending(&cp->timer))
> -			__ip_vs_conn_put(cp);
> -		else
> -			ip_vs_conn_put(cp);
> +		if (sysctl_expire_nodest_conn(ipvs)) {
> +			bool uses_ct = ip_vs_conn_uses_conntrack(cp, skb);
>  
> -		if (sysctl_expire_nodest_conn(ipvs) &&
> -		    !(flags & IP_VS_CONN_F_ONE_PACKET)) {
> -			/* try to expire the connection immediately */
>  			ip_vs_conn_expire_now(cp);
> +			__ip_vs_conn_put(cp);
> +			if (uses_ct)
> +				return NF_DROP;
> +			cp = NULL;
> +		} else {
> +			__ip_vs_conn_put(cp);
> +			return NF_DROP;
>  		}
> +	}
>  
> -		return NF_DROP;
> +	if (unlikely(!cp)) {
> +		int v;
> +
> +		if (!ip_vs_try_to_schedule(ipvs, af, skb, pd, &v, &cp, &iph))
> +			return v;
>  	}
>  
> +	IP_VS_DBG_PKT(11, af, pp, skb, iph.off, "Incoming packet");
> +
>  	ip_vs_in_stats(cp, skb);
>  	ip_vs_set_state(cp, IP_VS_DIR_INPUT, skb, pd);
>  	if (cp->packet_xmit)
> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> index 8d14a1acbc37..6eb1afa30c74 100644
> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> @@ -1163,6 +1163,13 @@ static void __ip_vs_del_dest(struct netns_ipvs *ipvs, struct ip_vs_dest *dest,
>  	list_add(&dest->t_list, &ipvs->dest_trash);
>  	dest->idle_start = 0;
>  	spin_unlock_bh(&ipvs->dest_trash_lock);
> +
> +	/*	If expire_nodest_conn is enabled and we're not cleaning up,
> +	 *	queue up delayed work to expire all no destination connections
> +	 */
> +	if (sysctl_expire_nodest_conn(ipvs) && !cleanup)
> +		queue_delayed_work(system_long_wq,
> +				   &ipvs->expire_nodest_conn_work, 1);
	May be we should put this in ifdef CONFIG_SYSCTL.
Another option is to have ip_vs_enqueue_expire_nodest_conns()
in ip_vs.h, again in two variants, to avoid ifdef here.
>  }
>  
>  
> @@ -4065,6 +4072,10 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
>  	INIT_DELAYED_WORK(&ipvs->defense_work, defense_work_handler);
>  	schedule_delayed_work(&ipvs->defense_work, DEFENSE_TIMER_PERIOD);
>  
> +	/* Init delayed work for expiring no dest conn */
> +	INIT_DELAYED_WORK(&ipvs->expire_nodest_conn_work,
> +			  expire_nodest_conn_handler);
> +
>  	return 0;
>  }
>  
> @@ -4072,6 +4083,7 @@ static void __net_exit ip_vs_control_net_cleanup_sysctl(struct netns_ipvs *ipvs)
>  {
>  	struct net *net = ipvs->net;
>  
> +	cancel_delayed_work_sync(&ipvs->expire_nodest_conn_work);
>  	cancel_delayed_work_sync(&ipvs->defense_work);
>  	cancel_work_sync(&ipvs->defense_work.work);
>  	unregister_net_sysctl_table(ipvs->sysctl_hdr);
> -- 
Regards
--
Julian Anastasov <ja@ssi.bg>
^ permalink raw reply	[flat|nested] 31+ messages in thread
* [PATCH] netfilter/ipvs: queue delayed work to expire no destination connections if expire_nodest_conn=1
  2020-05-28 17:26                   ` Julian Anastasov
@ 2020-06-08 17:22                     ` Andrew Sy Kim
  2020-06-08 17:29                       ` Andrew Kim
  2020-06-08 17:34                     ` Andrew Sy Kim
  1 sibling, 1 reply; 31+ messages in thread
From: Andrew Sy Kim @ 2020-06-08 17:22 UTC (permalink / raw)
  Cc: Julian Anastasov, Wensong Zhang, Simon Horman, lvs-devel,
	netfilter-devel, Andrew Sy Kim
When expire_nodest_conn=1 and a destination is deleted, IPVS does not
expire the existing connections until the next matching incoming packet.
If there are many connection entries from a single client to a single
destination, many packets may get dropped before all the connections are
expired (more likely with lots of UDP traffic). An optimization can be
made where upon deletion of a destination, IPVS queues up delayed work
to immediately expire any connections with a deleted destination. This
ensures any reused source ports from a client (within the IPVS timeouts)
are scheduled to new real servers instead of silently dropped.
Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
---
 include/net/ip_vs.h             | 28 ++++++++++++++++++++
 net/netfilter/ipvs/ip_vs_conn.c | 44 ++++++++++++++++++++++++++++++++
 net/netfilter/ipvs/ip_vs_core.c | 45 +++++++++++++--------------------
 net/netfilter/ipvs/ip_vs_ctl.c  | 22 ++++++++++++++++
 4 files changed, 112 insertions(+), 27 deletions(-)
diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index 83be2d93b407..3eb03d9440ce 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -885,6 +885,8 @@ struct netns_ipvs {
 	atomic_t		conn_out_counter;
 
 #ifdef CONFIG_SYSCTL
+	/* delayed work for expiring no dest connections */
+	struct delayed_work	expire_nodest_conn_work;
 	/* 1/rate drop and drop-entry variables */
 	struct delayed_work	defense_work;   /* Work handler */
 	int			drop_rate;
@@ -1049,6 +1051,11 @@ static inline int sysctl_conn_reuse_mode(struct netns_ipvs *ipvs)
 	return ipvs->sysctl_conn_reuse_mode;
 }
 
+static inline int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
+{
+	return ipvs->sysctl_expire_nodest_conn;
+}
+
 static inline int sysctl_schedule_icmp(struct netns_ipvs *ipvs)
 {
 	return ipvs->sysctl_schedule_icmp;
@@ -1136,6 +1143,11 @@ static inline int sysctl_conn_reuse_mode(struct netns_ipvs *ipvs)
 	return 1;
 }
 
+static inline int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
+{
+	return 0;
+}
+
 static inline int sysctl_schedule_icmp(struct netns_ipvs *ipvs)
 {
 	return 0;
@@ -1505,6 +1517,22 @@ static inline int ip_vs_todrop(struct netns_ipvs *ipvs)
 static inline int ip_vs_todrop(struct netns_ipvs *ipvs) { return 0; }
 #endif
 
+#ifdef CONFIG_SYSCTL
+/* Enqueue delayed work for expiring no dest connections
+ * Only run when sysctl_expire_nodest=1
+ */
+static inline void ip_vs_enqueue_expire_nodest_conns(struct netns_ipvs *ipvs)
+{
+	if (sysctl_expire_nodest_conn(ipvs))
+		queue_delayed_work(system_long_wq,
+				   &ipvs->expire_nodest_conn_work, 1);
+}
+
+void ip_vs_expire_nodest_conn_flush(struct netns_ipvs *ipvs);
+#else
+static inline void ip_vs_enqueue_expire_nodest_conns(struct netns_ipvs) {}
+#endif
+
 #define IP_VS_DFWD_METHOD(dest) (atomic_read(&(dest)->conn_flags) & \
 				 IP_VS_CONN_F_FWD_MASK)
 
diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
index 02f2f636798d..88eb7c2433aa 100644
--- a/net/netfilter/ipvs/ip_vs_conn.c
+++ b/net/netfilter/ipvs/ip_vs_conn.c
@@ -28,6 +28,7 @@
 #include <linux/module.h>
 #include <linux/vmalloc.h>
 #include <linux/proc_fs.h>		/* for proc_net_* */
+#include <linux/workqueue.h>
 #include <linux/slab.h>
 #include <linux/seq_file.h>
 #include <linux/jhash.h>
@@ -1366,6 +1367,49 @@ static void ip_vs_conn_flush(struct netns_ipvs *ipvs)
 		goto flush_again;
 	}
 }
+
+#ifdef CONFIG_SYSCTL
+void ip_vs_expire_nodest_conn_flush(struct netns_ipvs *ipvs)
+{
+	int idx;
+	struct ip_vs_conn *cp, *cp_c;
+	struct ip_vs_dest *dest;
+
+	rcu_read_lock();
+	for (idx = 0; idx < ip_vs_conn_tab_size; idx++) {
+		hlist_for_each_entry_rcu(cp, &ip_vs_conn_tab[idx], c_list) {
+			if (cp->ipvs != ipvs)
+				continue;
+
+			dest = cp->dest;
+			if (!dest || (dest->flags & IP_VS_DEST_F_AVAILABLE))
+				continue;
+
+			/* As timers are expired in LIFO order, restart
+			 * the timer of controlling connection first, so
+			 * that it is expired after us.
+			 */
+			cp_c = cp->control;
+			/* cp->control is valid only with reference to cp */
+			if (cp_c && __ip_vs_conn_get(cp)) {
+				IP_VS_DBG(4, "del controlling connection\n");
+				ip_vs_conn_expire_now(cp_c);
+				__ip_vs_conn_put(cp);
+			}
+
+			IP_VS_DBG(4, "del connection\n");
+			ip_vs_conn_expire_now(cp);
+		}
+		cond_resched_rcu();
+
+		/* netns clean up started, aborted delayed work */
+		if (!ipvs->enable)
+			return;
+	}
+	rcu_read_unlock();
+}
+#endif
+
 /*
  * per netns init and exit
  */
diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
index aa6a603a2425..2508a9caeae8 100644
--- a/net/netfilter/ipvs/ip_vs_core.c
+++ b/net/netfilter/ipvs/ip_vs_core.c
@@ -694,16 +694,10 @@ static int sysctl_nat_icmp_send(struct netns_ipvs *ipvs)
 	return ipvs->sysctl_nat_icmp_send;
 }
 
-static int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
-{
-	return ipvs->sysctl_expire_nodest_conn;
-}
-
 #else
 
 static int sysctl_snat_reroute(struct netns_ipvs *ipvs) { return 0; }
 static int sysctl_nat_icmp_send(struct netns_ipvs *ipvs) { return 0; }
-static int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs) { return 0; }
 
 #endif
 
@@ -2095,36 +2089,33 @@ ip_vs_in(struct netns_ipvs *ipvs, unsigned int hooknum, struct sk_buff *skb, int
 		}
 	}
 
-	if (unlikely(!cp)) {
-		int v;
-
-		if (!ip_vs_try_to_schedule(ipvs, af, skb, pd, &v, &cp, &iph))
-			return v;
-	}
-
-	IP_VS_DBG_PKT(11, af, pp, skb, iph.off, "Incoming packet");
-
 	/* Check the server status */
-	if (cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
+	if (cp && cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
 		/* the destination server is not available */
 
-		__u32 flags = cp->flags;
-
-		/* when timer already started, silently drop the packet.*/
-		if (timer_pending(&cp->timer))
-			__ip_vs_conn_put(cp);
-		else
-			ip_vs_conn_put(cp);
+		if (sysctl_expire_nodest_conn(ipvs)) {
+			bool uses_ct = ip_vs_conn_uses_conntrack(cp, skb);
 
-		if (sysctl_expire_nodest_conn(ipvs) &&
-		    !(flags & IP_VS_CONN_F_ONE_PACKET)) {
-			/* try to expire the connection immediately */
 			ip_vs_conn_expire_now(cp);
+			__ip_vs_conn_put(cp);
+			if (uses_ct)
+				return NF_DROP;
+			cp = NULL;
+		} else {
+			__ip_vs_conn_put(cp);
+			return NF_DROP;
 		}
+	}
 
-		return NF_DROP;
+	if (unlikely(!cp)) {
+		int v;
+
+		if (!ip_vs_try_to_schedule(ipvs, af, skb, pd, &v, &cp, &iph))
+			return v;
 	}
 
+	IP_VS_DBG_PKT(11, af, pp, skb, iph.off, "Incoming packet");
+
 	ip_vs_in_stats(cp, skb);
 	ip_vs_set_state(cp, IP_VS_DIR_INPUT, skb, pd);
 	if (cp->packet_xmit)
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 8d14a1acbc37..24736efac85c 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -210,6 +210,17 @@ static void update_defense_level(struct netns_ipvs *ipvs)
 	local_bh_enable();
 }
 
+/*	Handler for delayed work for expiring no
+ *	destination connections
+ */
+static void expire_nodest_conn_handler(struct work_struct *work)
+{
+	struct netns_ipvs *ipvs;
+
+	ipvs = container_of(work, struct netns_ipvs,
+			    expire_nodest_conn_work.work);
+	ip_vs_expire_nodest_conn_flush(ipvs);
+}
 
 /*
  *	Timer for checking the defense
@@ -1163,6 +1174,12 @@ static void __ip_vs_del_dest(struct netns_ipvs *ipvs, struct ip_vs_dest *dest,
 	list_add(&dest->t_list, &ipvs->dest_trash);
 	dest->idle_start = 0;
 	spin_unlock_bh(&ipvs->dest_trash_lock);
+
+	/*	Queue up delayed work to expire all no estination connections.
+	 *	No-op when CONFIG_SYSCTL is disabled.
+	 */
+	if (!cleanup)
+		ip_vs_enqueue_expire_nodest_conns(ipvs);
 }
 
 
@@ -4065,6 +4082,10 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
 	INIT_DELAYED_WORK(&ipvs->defense_work, defense_work_handler);
 	schedule_delayed_work(&ipvs->defense_work, DEFENSE_TIMER_PERIOD);
 
+	/* Init delayed work for expiring no dest conn */
+	INIT_DELAYED_WORK(&ipvs->expire_nodest_conn_work,
+			  expire_nodest_conn_handler);
+
 	return 0;
 }
 
@@ -4072,6 +4093,7 @@ static void __net_exit ip_vs_control_net_cleanup_sysctl(struct netns_ipvs *ipvs)
 {
 	struct net *net = ipvs->net;
 
+	cancel_delayed_work_sync(&ipvs->expire_nodest_conn_work);
 	cancel_delayed_work_sync(&ipvs->defense_work);
 	cancel_work_sync(&ipvs->defense_work.work);
 	unregister_net_sysctl_table(ipvs->sysctl_hdr);
-- 
2.20.1
^ permalink raw reply related	[flat|nested] 31+ messages in thread
* Re: [PATCH] netfilter/ipvs: queue delayed work to expire no destination connections if expire_nodest_conn=1
  2020-06-08 17:22                     ` Andrew Sy Kim
@ 2020-06-08 17:29                       ` Andrew Kim
  0 siblings, 0 replies; 31+ messages in thread
From: Andrew Kim @ 2020-06-08 17:29 UTC (permalink / raw)
  Cc: Julian Anastasov, Wensong Zhang, Simon Horman, open list:IPVS,
	open list:NETFILTER
Missed a few changes in the patch I just sent, will send an updated one shortly.
On Mon, Jun 8, 2020 at 1:22 PM Andrew Sy Kim <kim.andrewsy@gmail.com> wrote:
>
> When expire_nodest_conn=1 and a destination is deleted, IPVS does not
> expire the existing connections until the next matching incoming packet.
> If there are many connection entries from a single client to a single
> destination, many packets may get dropped before all the connections are
> expired (more likely with lots of UDP traffic). An optimization can be
> made where upon deletion of a destination, IPVS queues up delayed work
> to immediately expire any connections with a deleted destination. This
> ensures any reused source ports from a client (within the IPVS timeouts)
> are scheduled to new real servers instead of silently dropped.
>
> Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
> ---
>  include/net/ip_vs.h             | 28 ++++++++++++++++++++
>  net/netfilter/ipvs/ip_vs_conn.c | 44 ++++++++++++++++++++++++++++++++
>  net/netfilter/ipvs/ip_vs_core.c | 45 +++++++++++++--------------------
>  net/netfilter/ipvs/ip_vs_ctl.c  | 22 ++++++++++++++++
>  4 files changed, 112 insertions(+), 27 deletions(-)
>
> diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
> index 83be2d93b407..3eb03d9440ce 100644
> --- a/include/net/ip_vs.h
> +++ b/include/net/ip_vs.h
> @@ -885,6 +885,8 @@ struct netns_ipvs {
>         atomic_t                conn_out_counter;
>
>  #ifdef CONFIG_SYSCTL
> +       /* delayed work for expiring no dest connections */
> +       struct delayed_work     expire_nodest_conn_work;
>         /* 1/rate drop and drop-entry variables */
>         struct delayed_work     defense_work;   /* Work handler */
>         int                     drop_rate;
> @@ -1049,6 +1051,11 @@ static inline int sysctl_conn_reuse_mode(struct netns_ipvs *ipvs)
>         return ipvs->sysctl_conn_reuse_mode;
>  }
>
> +static inline int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
> +{
> +       return ipvs->sysctl_expire_nodest_conn;
> +}
> +
>  static inline int sysctl_schedule_icmp(struct netns_ipvs *ipvs)
>  {
>         return ipvs->sysctl_schedule_icmp;
> @@ -1136,6 +1143,11 @@ static inline int sysctl_conn_reuse_mode(struct netns_ipvs *ipvs)
>         return 1;
>  }
>
> +static inline int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
> +{
> +       return 0;
> +}
> +
>  static inline int sysctl_schedule_icmp(struct netns_ipvs *ipvs)
>  {
>         return 0;
> @@ -1505,6 +1517,22 @@ static inline int ip_vs_todrop(struct netns_ipvs *ipvs)
>  static inline int ip_vs_todrop(struct netns_ipvs *ipvs) { return 0; }
>  #endif
>
> +#ifdef CONFIG_SYSCTL
> +/* Enqueue delayed work for expiring no dest connections
> + * Only run when sysctl_expire_nodest=1
> + */
> +static inline void ip_vs_enqueue_expire_nodest_conns(struct netns_ipvs *ipvs)
> +{
> +       if (sysctl_expire_nodest_conn(ipvs))
> +               queue_delayed_work(system_long_wq,
> +                                  &ipvs->expire_nodest_conn_work, 1);
> +}
> +
> +void ip_vs_expire_nodest_conn_flush(struct netns_ipvs *ipvs);
> +#else
> +static inline void ip_vs_enqueue_expire_nodest_conns(struct netns_ipvs) {}
> +#endif
> +
>  #define IP_VS_DFWD_METHOD(dest) (atomic_read(&(dest)->conn_flags) & \
>                                  IP_VS_CONN_F_FWD_MASK)
>
> diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
> index 02f2f636798d..88eb7c2433aa 100644
> --- a/net/netfilter/ipvs/ip_vs_conn.c
> +++ b/net/netfilter/ipvs/ip_vs_conn.c
> @@ -28,6 +28,7 @@
>  #include <linux/module.h>
>  #include <linux/vmalloc.h>
>  #include <linux/proc_fs.h>             /* for proc_net_* */
> +#include <linux/workqueue.h>
>  #include <linux/slab.h>
>  #include <linux/seq_file.h>
>  #include <linux/jhash.h>
> @@ -1366,6 +1367,49 @@ static void ip_vs_conn_flush(struct netns_ipvs *ipvs)
>                 goto flush_again;
>         }
>  }
> +
> +#ifdef CONFIG_SYSCTL
> +void ip_vs_expire_nodest_conn_flush(struct netns_ipvs *ipvs)
> +{
> +       int idx;
> +       struct ip_vs_conn *cp, *cp_c;
> +       struct ip_vs_dest *dest;
> +
> +       rcu_read_lock();
> +       for (idx = 0; idx < ip_vs_conn_tab_size; idx++) {
> +               hlist_for_each_entry_rcu(cp, &ip_vs_conn_tab[idx], c_list) {
> +                       if (cp->ipvs != ipvs)
> +                               continue;
> +
> +                       dest = cp->dest;
> +                       if (!dest || (dest->flags & IP_VS_DEST_F_AVAILABLE))
> +                               continue;
> +
> +                       /* As timers are expired in LIFO order, restart
> +                        * the timer of controlling connection first, so
> +                        * that it is expired after us.
> +                        */
> +                       cp_c = cp->control;
> +                       /* cp->control is valid only with reference to cp */
> +                       if (cp_c && __ip_vs_conn_get(cp)) {
> +                               IP_VS_DBG(4, "del controlling connection\n");
> +                               ip_vs_conn_expire_now(cp_c);
> +                               __ip_vs_conn_put(cp);
> +                       }
> +
> +                       IP_VS_DBG(4, "del connection\n");
> +                       ip_vs_conn_expire_now(cp);
> +               }
> +               cond_resched_rcu();
> +
> +               /* netns clean up started, aborted delayed work */
> +               if (!ipvs->enable)
> +                       return;
> +       }
> +       rcu_read_unlock();
> +}
> +#endif
> +
>  /*
>   * per netns init and exit
>   */
> diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
> index aa6a603a2425..2508a9caeae8 100644
> --- a/net/netfilter/ipvs/ip_vs_core.c
> +++ b/net/netfilter/ipvs/ip_vs_core.c
> @@ -694,16 +694,10 @@ static int sysctl_nat_icmp_send(struct netns_ipvs *ipvs)
>         return ipvs->sysctl_nat_icmp_send;
>  }
>
> -static int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
> -{
> -       return ipvs->sysctl_expire_nodest_conn;
> -}
> -
>  #else
>
>  static int sysctl_snat_reroute(struct netns_ipvs *ipvs) { return 0; }
>  static int sysctl_nat_icmp_send(struct netns_ipvs *ipvs) { return 0; }
> -static int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs) { return 0; }
>
>  #endif
>
> @@ -2095,36 +2089,33 @@ ip_vs_in(struct netns_ipvs *ipvs, unsigned int hooknum, struct sk_buff *skb, int
>                 }
>         }
>
> -       if (unlikely(!cp)) {
> -               int v;
> -
> -               if (!ip_vs_try_to_schedule(ipvs, af, skb, pd, &v, &cp, &iph))
> -                       return v;
> -       }
> -
> -       IP_VS_DBG_PKT(11, af, pp, skb, iph.off, "Incoming packet");
> -
>         /* Check the server status */
> -       if (cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
> +       if (cp && cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
>                 /* the destination server is not available */
>
> -               __u32 flags = cp->flags;
> -
> -               /* when timer already started, silently drop the packet.*/
> -               if (timer_pending(&cp->timer))
> -                       __ip_vs_conn_put(cp);
> -               else
> -                       ip_vs_conn_put(cp);
> +               if (sysctl_expire_nodest_conn(ipvs)) {
> +                       bool uses_ct = ip_vs_conn_uses_conntrack(cp, skb);
>
> -               if (sysctl_expire_nodest_conn(ipvs) &&
> -                   !(flags & IP_VS_CONN_F_ONE_PACKET)) {
> -                       /* try to expire the connection immediately */
>                         ip_vs_conn_expire_now(cp);
> +                       __ip_vs_conn_put(cp);
> +                       if (uses_ct)
> +                               return NF_DROP;
> +                       cp = NULL;
> +               } else {
> +                       __ip_vs_conn_put(cp);
> +                       return NF_DROP;
>                 }
> +       }
>
> -               return NF_DROP;
> +       if (unlikely(!cp)) {
> +               int v;
> +
> +               if (!ip_vs_try_to_schedule(ipvs, af, skb, pd, &v, &cp, &iph))
> +                       return v;
>         }
>
> +       IP_VS_DBG_PKT(11, af, pp, skb, iph.off, "Incoming packet");
> +
>         ip_vs_in_stats(cp, skb);
>         ip_vs_set_state(cp, IP_VS_DIR_INPUT, skb, pd);
>         if (cp->packet_xmit)
> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> index 8d14a1acbc37..24736efac85c 100644
> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> @@ -210,6 +210,17 @@ static void update_defense_level(struct netns_ipvs *ipvs)
>         local_bh_enable();
>  }
>
> +/*     Handler for delayed work for expiring no
> + *     destination connections
> + */
> +static void expire_nodest_conn_handler(struct work_struct *work)
> +{
> +       struct netns_ipvs *ipvs;
> +
> +       ipvs = container_of(work, struct netns_ipvs,
> +                           expire_nodest_conn_work.work);
> +       ip_vs_expire_nodest_conn_flush(ipvs);
> +}
>
>  /*
>   *     Timer for checking the defense
> @@ -1163,6 +1174,12 @@ static void __ip_vs_del_dest(struct netns_ipvs *ipvs, struct ip_vs_dest *dest,
>         list_add(&dest->t_list, &ipvs->dest_trash);
>         dest->idle_start = 0;
>         spin_unlock_bh(&ipvs->dest_trash_lock);
> +
> +       /*      Queue up delayed work to expire all no estination connections.
> +        *      No-op when CONFIG_SYSCTL is disabled.
> +        */
> +       if (!cleanup)
> +               ip_vs_enqueue_expire_nodest_conns(ipvs);
>  }
>
>
> @@ -4065,6 +4082,10 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
>         INIT_DELAYED_WORK(&ipvs->defense_work, defense_work_handler);
>         schedule_delayed_work(&ipvs->defense_work, DEFENSE_TIMER_PERIOD);
>
> +       /* Init delayed work for expiring no dest conn */
> +       INIT_DELAYED_WORK(&ipvs->expire_nodest_conn_work,
> +                         expire_nodest_conn_handler);
> +
>         return 0;
>  }
>
> @@ -4072,6 +4093,7 @@ static void __net_exit ip_vs_control_net_cleanup_sysctl(struct netns_ipvs *ipvs)
>  {
>         struct net *net = ipvs->net;
>
> +       cancel_delayed_work_sync(&ipvs->expire_nodest_conn_work);
>         cancel_delayed_work_sync(&ipvs->defense_work);
>         cancel_work_sync(&ipvs->defense_work.work);
>         unregister_net_sysctl_table(ipvs->sysctl_hdr);
> --
> 2.20.1
>
^ permalink raw reply	[flat|nested] 31+ messages in thread
* [PATCH] netfilter/ipvs: queue delayed work to expire no destination connections if expire_nodest_conn=1
  2020-05-28 17:26                   ` Julian Anastasov
  2020-06-08 17:22                     ` Andrew Sy Kim
@ 2020-06-08 17:34                     ` Andrew Sy Kim
  2020-06-08 20:20                       ` Andrew Sy Kim
  1 sibling, 1 reply; 31+ messages in thread
From: Andrew Sy Kim @ 2020-06-08 17:34 UTC (permalink / raw)
  Cc: Julian Anastasov, Wensong Zhang, Simon Horman, lvs-devel,
	netfilter-devel, Andrew Sy Kim
When expire_nodest_conn=1 and a destination is deleted, IPVS does not
expire the existing connections until the next matching incoming packet.
If there are many connection entries from a single client to a single
destination, many packets may get dropped before all the connections are
expired (more likely with lots of UDP traffic). An optimization can be
made where upon deletion of a destination, IPVS queues up delayed work
to immediately expire any connections with a deleted destination. This
ensures any reused source ports from a client (within the IPVS timeouts)
are scheduled to new real servers instead of silently dropped.
Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
---
 include/net/ip_vs.h             | 29 +++++++++++++++++++++
 net/netfilter/ipvs/ip_vs_conn.c | 43 +++++++++++++++++++++++++++++++
 net/netfilter/ipvs/ip_vs_core.c | 45 +++++++++++++--------------------
 net/netfilter/ipvs/ip_vs_ctl.c  | 22 ++++++++++++++++
 4 files changed, 112 insertions(+), 27 deletions(-)
diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index 83be2d93b407..49ca61765298 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -14,6 +14,7 @@
 #include <linux/spinlock.h>             /* for struct rwlock_t */
 #include <linux/atomic.h>               /* for struct atomic_t */
 #include <linux/refcount.h>             /* for struct refcount_t */
+#include <linux/workqueue.h>
 
 #include <linux/compiler.h>
 #include <linux/timer.h>
@@ -885,6 +886,8 @@ struct netns_ipvs {
 	atomic_t		conn_out_counter;
 
 #ifdef CONFIG_SYSCTL
+	/* delayed work for expiring no dest connections */
+	struct delayed_work	expire_nodest_conn_work;
 	/* 1/rate drop and drop-entry variables */
 	struct delayed_work	defense_work;   /* Work handler */
 	int			drop_rate;
@@ -1049,6 +1052,11 @@ static inline int sysctl_conn_reuse_mode(struct netns_ipvs *ipvs)
 	return ipvs->sysctl_conn_reuse_mode;
 }
 
+static inline int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
+{
+	return ipvs->sysctl_expire_nodest_conn;
+}
+
 static inline int sysctl_schedule_icmp(struct netns_ipvs *ipvs)
 {
 	return ipvs->sysctl_schedule_icmp;
@@ -1136,6 +1144,11 @@ static inline int sysctl_conn_reuse_mode(struct netns_ipvs *ipvs)
 	return 1;
 }
 
+static inline int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
+{
+	return 0;
+}
+
 static inline int sysctl_schedule_icmp(struct netns_ipvs *ipvs)
 {
 	return 0;
@@ -1505,6 +1518,22 @@ static inline int ip_vs_todrop(struct netns_ipvs *ipvs)
 static inline int ip_vs_todrop(struct netns_ipvs *ipvs) { return 0; }
 #endif
 
+#ifdef CONFIG_SYSCTL
+/* Enqueue delayed work for expiring no dest connections
+ * Only run when sysctl_expire_nodest=1
+ */
+static inline void ip_vs_enqueue_expire_nodest_conns(struct netns_ipvs *ipvs)
+{
+	if (sysctl_expire_nodest_conn(ipvs))
+		queue_delayed_work(system_long_wq,
+				   &ipvs->expire_nodest_conn_work, 1);
+}
+
+void ip_vs_expire_nodest_conn_flush(struct netns_ipvs *ipvs);
+#else
+static inline void ip_vs_enqueue_expire_nodest_conns(struct netns_ipvs) {}
+#endif
+
 #define IP_VS_DFWD_METHOD(dest) (atomic_read(&(dest)->conn_flags) & \
 				 IP_VS_CONN_F_FWD_MASK)
 
diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
index 02f2f636798d..8396ab098953 100644
--- a/net/netfilter/ipvs/ip_vs_conn.c
+++ b/net/netfilter/ipvs/ip_vs_conn.c
@@ -1366,6 +1366,49 @@ static void ip_vs_conn_flush(struct netns_ipvs *ipvs)
 		goto flush_again;
 	}
 }
+
+#ifdef CONFIG_SYSCTL
+void ip_vs_expire_nodest_conn_flush(struct netns_ipvs *ipvs)
+{
+	int idx;
+	struct ip_vs_conn *cp, *cp_c;
+	struct ip_vs_dest *dest;
+
+	rcu_read_lock();
+	for (idx = 0; idx < ip_vs_conn_tab_size; idx++) {
+		hlist_for_each_entry_rcu(cp, &ip_vs_conn_tab[idx], c_list) {
+			if (cp->ipvs != ipvs)
+				continue;
+
+			dest = cp->dest;
+			if (!dest || (dest->flags & IP_VS_DEST_F_AVAILABLE))
+				continue;
+
+			/* As timers are expired in LIFO order, restart
+			 * the timer of controlling connection first, so
+			 * that it is expired after us.
+			 */
+			cp_c = cp->control;
+			/* cp->control is valid only with reference to cp */
+			if (cp_c && __ip_vs_conn_get(cp)) {
+				IP_VS_DBG(4, "del controlling connection\n");
+				ip_vs_conn_expire_now(cp_c);
+				__ip_vs_conn_put(cp);
+			}
+
+			IP_VS_DBG(4, "del connection\n");
+			ip_vs_conn_expire_now(cp);
+		}
+		cond_resched_rcu();
+
+		/* netns clean up started, aborted delayed work */
+		if (!ipvs->enable)
+			return;
+	}
+	rcu_read_unlock();
+}
+#endif
+
 /*
  * per netns init and exit
  */
diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
index aa6a603a2425..2508a9caeae8 100644
--- a/net/netfilter/ipvs/ip_vs_core.c
+++ b/net/netfilter/ipvs/ip_vs_core.c
@@ -694,16 +694,10 @@ static int sysctl_nat_icmp_send(struct netns_ipvs *ipvs)
 	return ipvs->sysctl_nat_icmp_send;
 }
 
-static int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
-{
-	return ipvs->sysctl_expire_nodest_conn;
-}
-
 #else
 
 static int sysctl_snat_reroute(struct netns_ipvs *ipvs) { return 0; }
 static int sysctl_nat_icmp_send(struct netns_ipvs *ipvs) { return 0; }
-static int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs) { return 0; }
 
 #endif
 
@@ -2095,36 +2089,33 @@ ip_vs_in(struct netns_ipvs *ipvs, unsigned int hooknum, struct sk_buff *skb, int
 		}
 	}
 
-	if (unlikely(!cp)) {
-		int v;
-
-		if (!ip_vs_try_to_schedule(ipvs, af, skb, pd, &v, &cp, &iph))
-			return v;
-	}
-
-	IP_VS_DBG_PKT(11, af, pp, skb, iph.off, "Incoming packet");
-
 	/* Check the server status */
-	if (cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
+	if (cp && cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
 		/* the destination server is not available */
 
-		__u32 flags = cp->flags;
-
-		/* when timer already started, silently drop the packet.*/
-		if (timer_pending(&cp->timer))
-			__ip_vs_conn_put(cp);
-		else
-			ip_vs_conn_put(cp);
+		if (sysctl_expire_nodest_conn(ipvs)) {
+			bool uses_ct = ip_vs_conn_uses_conntrack(cp, skb);
 
-		if (sysctl_expire_nodest_conn(ipvs) &&
-		    !(flags & IP_VS_CONN_F_ONE_PACKET)) {
-			/* try to expire the connection immediately */
 			ip_vs_conn_expire_now(cp);
+			__ip_vs_conn_put(cp);
+			if (uses_ct)
+				return NF_DROP;
+			cp = NULL;
+		} else {
+			__ip_vs_conn_put(cp);
+			return NF_DROP;
 		}
+	}
 
-		return NF_DROP;
+	if (unlikely(!cp)) {
+		int v;
+
+		if (!ip_vs_try_to_schedule(ipvs, af, skb, pd, &v, &cp, &iph))
+			return v;
 	}
 
+	IP_VS_DBG_PKT(11, af, pp, skb, iph.off, "Incoming packet");
+
 	ip_vs_in_stats(cp, skb);
 	ip_vs_set_state(cp, IP_VS_DIR_INPUT, skb, pd);
 	if (cp->packet_xmit)
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 8d14a1acbc37..24736efac85c 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -210,6 +210,17 @@ static void update_defense_level(struct netns_ipvs *ipvs)
 	local_bh_enable();
 }
 
+/*	Handler for delayed work for expiring no
+ *	destination connections
+ */
+static void expire_nodest_conn_handler(struct work_struct *work)
+{
+	struct netns_ipvs *ipvs;
+
+	ipvs = container_of(work, struct netns_ipvs,
+			    expire_nodest_conn_work.work);
+	ip_vs_expire_nodest_conn_flush(ipvs);
+}
 
 /*
  *	Timer for checking the defense
@@ -1163,6 +1174,12 @@ static void __ip_vs_del_dest(struct netns_ipvs *ipvs, struct ip_vs_dest *dest,
 	list_add(&dest->t_list, &ipvs->dest_trash);
 	dest->idle_start = 0;
 	spin_unlock_bh(&ipvs->dest_trash_lock);
+
+	/*	Queue up delayed work to expire all no estination connections.
+	 *	No-op when CONFIG_SYSCTL is disabled.
+	 */
+	if (!cleanup)
+		ip_vs_enqueue_expire_nodest_conns(ipvs);
 }
 
 
@@ -4065,6 +4082,10 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
 	INIT_DELAYED_WORK(&ipvs->defense_work, defense_work_handler);
 	schedule_delayed_work(&ipvs->defense_work, DEFENSE_TIMER_PERIOD);
 
+	/* Init delayed work for expiring no dest conn */
+	INIT_DELAYED_WORK(&ipvs->expire_nodest_conn_work,
+			  expire_nodest_conn_handler);
+
 	return 0;
 }
 
@@ -4072,6 +4093,7 @@ static void __net_exit ip_vs_control_net_cleanup_sysctl(struct netns_ipvs *ipvs)
 {
 	struct net *net = ipvs->net;
 
+	cancel_delayed_work_sync(&ipvs->expire_nodest_conn_work);
 	cancel_delayed_work_sync(&ipvs->defense_work);
 	cancel_work_sync(&ipvs->defense_work.work);
 	unregister_net_sysctl_table(ipvs->sysctl_hdr);
-- 
2.20.1
^ permalink raw reply related	[flat|nested] 31+ messages in thread
* [PATCH] netfilter/ipvs: queue delayed work to expire no destination connections if expire_nodest_conn=1
  2020-06-08 17:34                     ` Andrew Sy Kim
@ 2020-06-08 20:20                       ` Andrew Sy Kim
  2020-06-08 20:24                         ` Andrew Kim
  2020-06-15 19:24                         ` Julian Anastasov
  0 siblings, 2 replies; 31+ messages in thread
From: Andrew Sy Kim @ 2020-06-08 20:20 UTC (permalink / raw)
  Cc: Julian Anastasov, Wensong Zhang, Simon Horman, lvs-devel,
	netfilter-devel, Andrew Sy Kim
When expire_nodest_conn=1 and a destination is deleted, IPVS does not
expire the existing connections until the next matching incoming packet.
If there are many connection entries from a single client to a single
destination, many packets may get dropped before all the connections are
expired (more likely with lots of UDP traffic). An optimization can be
made where upon deletion of a destination, IPVS queues up delayed work
to immediately expire any connections with a deleted destination. This
ensures any reused source ports from a client (within the IPVS timeouts)
are scheduled to new real servers instead of silently dropped.
Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
---
 include/net/ip_vs.h             | 29 +++++++++++++++++++++
 net/netfilter/ipvs/ip_vs_conn.c | 43 +++++++++++++++++++++++++++++++
 net/netfilter/ipvs/ip_vs_core.c | 45 +++++++++++++--------------------
 net/netfilter/ipvs/ip_vs_ctl.c  | 22 ++++++++++++++++
 4 files changed, 112 insertions(+), 27 deletions(-)
diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index 83be2d93b407..49ca61765298 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -14,6 +14,7 @@
 #include <linux/spinlock.h>             /* for struct rwlock_t */
 #include <linux/atomic.h>               /* for struct atomic_t */
 #include <linux/refcount.h>             /* for struct refcount_t */
+#include <linux/workqueue.h>
 
 #include <linux/compiler.h>
 #include <linux/timer.h>
@@ -885,6 +886,8 @@ struct netns_ipvs {
 	atomic_t		conn_out_counter;
 
 #ifdef CONFIG_SYSCTL
+	/* delayed work for expiring no dest connections */
+	struct delayed_work	expire_nodest_conn_work;
 	/* 1/rate drop and drop-entry variables */
 	struct delayed_work	defense_work;   /* Work handler */
 	int			drop_rate;
@@ -1049,6 +1052,11 @@ static inline int sysctl_conn_reuse_mode(struct netns_ipvs *ipvs)
 	return ipvs->sysctl_conn_reuse_mode;
 }
 
+static inline int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
+{
+	return ipvs->sysctl_expire_nodest_conn;
+}
+
 static inline int sysctl_schedule_icmp(struct netns_ipvs *ipvs)
 {
 	return ipvs->sysctl_schedule_icmp;
@@ -1136,6 +1144,11 @@ static inline int sysctl_conn_reuse_mode(struct netns_ipvs *ipvs)
 	return 1;
 }
 
+static inline int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
+{
+	return 0;
+}
+
 static inline int sysctl_schedule_icmp(struct netns_ipvs *ipvs)
 {
 	return 0;
@@ -1505,6 +1518,22 @@ static inline int ip_vs_todrop(struct netns_ipvs *ipvs)
 static inline int ip_vs_todrop(struct netns_ipvs *ipvs) { return 0; }
 #endif
 
+#ifdef CONFIG_SYSCTL
+/* Enqueue delayed work for expiring no dest connections
+ * Only run when sysctl_expire_nodest=1
+ */
+static inline void ip_vs_enqueue_expire_nodest_conns(struct netns_ipvs *ipvs)
+{
+	if (sysctl_expire_nodest_conn(ipvs))
+		queue_delayed_work(system_long_wq,
+				   &ipvs->expire_nodest_conn_work, 1);
+}
+
+void ip_vs_expire_nodest_conn_flush(struct netns_ipvs *ipvs);
+#else
+static inline void ip_vs_enqueue_expire_nodest_conns(struct netns_ipvs) {}
+#endif
+
 #define IP_VS_DFWD_METHOD(dest) (atomic_read(&(dest)->conn_flags) & \
 				 IP_VS_CONN_F_FWD_MASK)
 
diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
index 02f2f636798d..f0d744e8c716 100644
--- a/net/netfilter/ipvs/ip_vs_conn.c
+++ b/net/netfilter/ipvs/ip_vs_conn.c
@@ -1366,6 +1366,49 @@ static void ip_vs_conn_flush(struct netns_ipvs *ipvs)
 		goto flush_again;
 	}
 }
+
+#ifdef CONFIG_SYSCTL
+void ip_vs_expire_nodest_conn_flush(struct netns_ipvs *ipvs)
+{
+	int idx;
+	struct ip_vs_conn *cp, *cp_c;
+	struct ip_vs_dest *dest;
+
+	rcu_read_lock();
+	for (idx = 0; idx < ip_vs_conn_tab_size; idx++) {
+		hlist_for_each_entry_rcu(cp, &ip_vs_conn_tab[idx], c_list) {
+			if (cp->ipvs != ipvs)
+				continue;
+
+			dest = cp->dest;
+			if (!dest || (dest->flags & IP_VS_DEST_F_AVAILABLE))
+				continue;
+
+			/* As timers are expired in LIFO order, restart
+			 * the timer of controlling connection first, so
+			 * that it is expired after us.
+			 */
+			cp_c = cp->control;
+			/* cp->control is valid only with reference to cp */
+			if (cp_c && __ip_vs_conn_get(cp)) {
+				IP_VS_DBG(4, "del controlling connection\n");
+				ip_vs_conn_expire_now(cp_c);
+				__ip_vs_conn_put(cp);
+			}
+
+			IP_VS_DBG(4, "del connection\n");
+			ip_vs_conn_expire_now(cp);
+		}
+		cond_resched_rcu();
+
+		/* netns clean up started, abort delayed work */
+		if (!ipvs->enable)
+			return;
+	}
+	rcu_read_unlock();
+}
+#endif
+
 /*
  * per netns init and exit
  */
diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
index aa6a603a2425..2508a9caeae8 100644
--- a/net/netfilter/ipvs/ip_vs_core.c
+++ b/net/netfilter/ipvs/ip_vs_core.c
@@ -694,16 +694,10 @@ static int sysctl_nat_icmp_send(struct netns_ipvs *ipvs)
 	return ipvs->sysctl_nat_icmp_send;
 }
 
-static int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
-{
-	return ipvs->sysctl_expire_nodest_conn;
-}
-
 #else
 
 static int sysctl_snat_reroute(struct netns_ipvs *ipvs) { return 0; }
 static int sysctl_nat_icmp_send(struct netns_ipvs *ipvs) { return 0; }
-static int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs) { return 0; }
 
 #endif
 
@@ -2095,36 +2089,33 @@ ip_vs_in(struct netns_ipvs *ipvs, unsigned int hooknum, struct sk_buff *skb, int
 		}
 	}
 
-	if (unlikely(!cp)) {
-		int v;
-
-		if (!ip_vs_try_to_schedule(ipvs, af, skb, pd, &v, &cp, &iph))
-			return v;
-	}
-
-	IP_VS_DBG_PKT(11, af, pp, skb, iph.off, "Incoming packet");
-
 	/* Check the server status */
-	if (cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
+	if (cp && cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
 		/* the destination server is not available */
 
-		__u32 flags = cp->flags;
-
-		/* when timer already started, silently drop the packet.*/
-		if (timer_pending(&cp->timer))
-			__ip_vs_conn_put(cp);
-		else
-			ip_vs_conn_put(cp);
+		if (sysctl_expire_nodest_conn(ipvs)) {
+			bool uses_ct = ip_vs_conn_uses_conntrack(cp, skb);
 
-		if (sysctl_expire_nodest_conn(ipvs) &&
-		    !(flags & IP_VS_CONN_F_ONE_PACKET)) {
-			/* try to expire the connection immediately */
 			ip_vs_conn_expire_now(cp);
+			__ip_vs_conn_put(cp);
+			if (uses_ct)
+				return NF_DROP;
+			cp = NULL;
+		} else {
+			__ip_vs_conn_put(cp);
+			return NF_DROP;
 		}
+	}
 
-		return NF_DROP;
+	if (unlikely(!cp)) {
+		int v;
+
+		if (!ip_vs_try_to_schedule(ipvs, af, skb, pd, &v, &cp, &iph))
+			return v;
 	}
 
+	IP_VS_DBG_PKT(11, af, pp, skb, iph.off, "Incoming packet");
+
 	ip_vs_in_stats(cp, skb);
 	ip_vs_set_state(cp, IP_VS_DIR_INPUT, skb, pd);
 	if (cp->packet_xmit)
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 8d14a1acbc37..9e53f517f138 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -210,6 +210,17 @@ static void update_defense_level(struct netns_ipvs *ipvs)
 	local_bh_enable();
 }
 
+/* Handler for delayed work for expiring no
+ * destination connections
+ */
+static void expire_nodest_conn_handler(struct work_struct *work)
+{
+	struct netns_ipvs *ipvs;
+
+	ipvs = container_of(work, struct netns_ipvs,
+			    expire_nodest_conn_work.work);
+	ip_vs_expire_nodest_conn_flush(ipvs);
+}
 
 /*
  *	Timer for checking the defense
@@ -1163,6 +1174,12 @@ static void __ip_vs_del_dest(struct netns_ipvs *ipvs, struct ip_vs_dest *dest,
 	list_add(&dest->t_list, &ipvs->dest_trash);
 	dest->idle_start = 0;
 	spin_unlock_bh(&ipvs->dest_trash_lock);
+
+	/* Queue up delayed work to expire all no estination connections.
+	 * No-op when CONFIG_SYSCTL is disabled.
+	 */
+	if (!cleanup)
+		ip_vs_enqueue_expire_nodest_conns(ipvs);
 }
 
 
@@ -4065,6 +4082,10 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
 	INIT_DELAYED_WORK(&ipvs->defense_work, defense_work_handler);
 	schedule_delayed_work(&ipvs->defense_work, DEFENSE_TIMER_PERIOD);
 
+	/* Init delayed work for expiring no dest conn */
+	INIT_DELAYED_WORK(&ipvs->expire_nodest_conn_work,
+			  expire_nodest_conn_handler);
+
 	return 0;
 }
 
@@ -4072,6 +4093,7 @@ static void __net_exit ip_vs_control_net_cleanup_sysctl(struct netns_ipvs *ipvs)
 {
 	struct net *net = ipvs->net;
 
+	cancel_delayed_work_sync(&ipvs->expire_nodest_conn_work);
 	cancel_delayed_work_sync(&ipvs->defense_work);
 	cancel_work_sync(&ipvs->defense_work.work);
 	unregister_net_sysctl_table(ipvs->sysctl_hdr);
-- 
2.20.1
^ permalink raw reply related	[flat|nested] 31+ messages in thread
* Re: [PATCH] netfilter/ipvs: queue delayed work to expire no destination connections if expire_nodest_conn=1
  2020-06-08 20:20                       ` Andrew Sy Kim
@ 2020-06-08 20:24                         ` Andrew Kim
  2020-06-15 19:24                         ` Julian Anastasov
  1 sibling, 0 replies; 31+ messages in thread
From: Andrew Kim @ 2020-06-08 20:24 UTC (permalink / raw)
  Cc: Julian Anastasov, Wensong Zhang, Simon Horman, open list:IPVS,
	open list:NETFILTER
Last patch just fixes some styling issues.
On Mon, Jun 8, 2020 at 4:20 PM Andrew Sy Kim <kim.andrewsy@gmail.com> wrote:
>
> When expire_nodest_conn=1 and a destination is deleted, IPVS does not
> expire the existing connections until the next matching incoming packet.
> If there are many connection entries from a single client to a single
> destination, many packets may get dropped before all the connections are
> expired (more likely with lots of UDP traffic). An optimization can be
> made where upon deletion of a destination, IPVS queues up delayed work
> to immediately expire any connections with a deleted destination. This
> ensures any reused source ports from a client (within the IPVS timeouts)
> are scheduled to new real servers instead of silently dropped.
>
> Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
> ---
>  include/net/ip_vs.h             | 29 +++++++++++++++++++++
>  net/netfilter/ipvs/ip_vs_conn.c | 43 +++++++++++++++++++++++++++++++
>  net/netfilter/ipvs/ip_vs_core.c | 45 +++++++++++++--------------------
>  net/netfilter/ipvs/ip_vs_ctl.c  | 22 ++++++++++++++++
>  4 files changed, 112 insertions(+), 27 deletions(-)
>
> diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
> index 83be2d93b407..49ca61765298 100644
> --- a/include/net/ip_vs.h
> +++ b/include/net/ip_vs.h
> @@ -14,6 +14,7 @@
>  #include <linux/spinlock.h>             /* for struct rwlock_t */
>  #include <linux/atomic.h>               /* for struct atomic_t */
>  #include <linux/refcount.h>             /* for struct refcount_t */
> +#include <linux/workqueue.h>
>
>  #include <linux/compiler.h>
>  #include <linux/timer.h>
> @@ -885,6 +886,8 @@ struct netns_ipvs {
>         atomic_t                conn_out_counter;
>
>  #ifdef CONFIG_SYSCTL
> +       /* delayed work for expiring no dest connections */
> +       struct delayed_work     expire_nodest_conn_work;
>         /* 1/rate drop and drop-entry variables */
>         struct delayed_work     defense_work;   /* Work handler */
>         int                     drop_rate;
> @@ -1049,6 +1052,11 @@ static inline int sysctl_conn_reuse_mode(struct netns_ipvs *ipvs)
>         return ipvs->sysctl_conn_reuse_mode;
>  }
>
> +static inline int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
> +{
> +       return ipvs->sysctl_expire_nodest_conn;
> +}
> +
>  static inline int sysctl_schedule_icmp(struct netns_ipvs *ipvs)
>  {
>         return ipvs->sysctl_schedule_icmp;
> @@ -1136,6 +1144,11 @@ static inline int sysctl_conn_reuse_mode(struct netns_ipvs *ipvs)
>         return 1;
>  }
>
> +static inline int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
> +{
> +       return 0;
> +}
> +
>  static inline int sysctl_schedule_icmp(struct netns_ipvs *ipvs)
>  {
>         return 0;
> @@ -1505,6 +1518,22 @@ static inline int ip_vs_todrop(struct netns_ipvs *ipvs)
>  static inline int ip_vs_todrop(struct netns_ipvs *ipvs) { return 0; }
>  #endif
>
> +#ifdef CONFIG_SYSCTL
> +/* Enqueue delayed work for expiring no dest connections
> + * Only run when sysctl_expire_nodest=1
> + */
> +static inline void ip_vs_enqueue_expire_nodest_conns(struct netns_ipvs *ipvs)
> +{
> +       if (sysctl_expire_nodest_conn(ipvs))
> +               queue_delayed_work(system_long_wq,
> +                                  &ipvs->expire_nodest_conn_work, 1);
> +}
> +
> +void ip_vs_expire_nodest_conn_flush(struct netns_ipvs *ipvs);
> +#else
> +static inline void ip_vs_enqueue_expire_nodest_conns(struct netns_ipvs) {}
> +#endif
> +
>  #define IP_VS_DFWD_METHOD(dest) (atomic_read(&(dest)->conn_flags) & \
>                                  IP_VS_CONN_F_FWD_MASK)
>
> diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
> index 02f2f636798d..f0d744e8c716 100644
> --- a/net/netfilter/ipvs/ip_vs_conn.c
> +++ b/net/netfilter/ipvs/ip_vs_conn.c
> @@ -1366,6 +1366,49 @@ static void ip_vs_conn_flush(struct netns_ipvs *ipvs)
>                 goto flush_again;
>         }
>  }
> +
> +#ifdef CONFIG_SYSCTL
> +void ip_vs_expire_nodest_conn_flush(struct netns_ipvs *ipvs)
> +{
> +       int idx;
> +       struct ip_vs_conn *cp, *cp_c;
> +       struct ip_vs_dest *dest;
> +
> +       rcu_read_lock();
> +       for (idx = 0; idx < ip_vs_conn_tab_size; idx++) {
> +               hlist_for_each_entry_rcu(cp, &ip_vs_conn_tab[idx], c_list) {
> +                       if (cp->ipvs != ipvs)
> +                               continue;
> +
> +                       dest = cp->dest;
> +                       if (!dest || (dest->flags & IP_VS_DEST_F_AVAILABLE))
> +                               continue;
> +
> +                       /* As timers are expired in LIFO order, restart
> +                        * the timer of controlling connection first, so
> +                        * that it is expired after us.
> +                        */
> +                       cp_c = cp->control;
> +                       /* cp->control is valid only with reference to cp */
> +                       if (cp_c && __ip_vs_conn_get(cp)) {
> +                               IP_VS_DBG(4, "del controlling connection\n");
> +                               ip_vs_conn_expire_now(cp_c);
> +                               __ip_vs_conn_put(cp);
> +                       }
> +
> +                       IP_VS_DBG(4, "del connection\n");
> +                       ip_vs_conn_expire_now(cp);
> +               }
> +               cond_resched_rcu();
> +
> +               /* netns clean up started, abort delayed work */
> +               if (!ipvs->enable)
> +                       return;
> +       }
> +       rcu_read_unlock();
> +}
> +#endif
> +
>  /*
>   * per netns init and exit
>   */
> diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
> index aa6a603a2425..2508a9caeae8 100644
> --- a/net/netfilter/ipvs/ip_vs_core.c
> +++ b/net/netfilter/ipvs/ip_vs_core.c
> @@ -694,16 +694,10 @@ static int sysctl_nat_icmp_send(struct netns_ipvs *ipvs)
>         return ipvs->sysctl_nat_icmp_send;
>  }
>
> -static int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
> -{
> -       return ipvs->sysctl_expire_nodest_conn;
> -}
> -
>  #else
>
>  static int sysctl_snat_reroute(struct netns_ipvs *ipvs) { return 0; }
>  static int sysctl_nat_icmp_send(struct netns_ipvs *ipvs) { return 0; }
> -static int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs) { return 0; }
>
>  #endif
>
> @@ -2095,36 +2089,33 @@ ip_vs_in(struct netns_ipvs *ipvs, unsigned int hooknum, struct sk_buff *skb, int
>                 }
>         }
>
> -       if (unlikely(!cp)) {
> -               int v;
> -
> -               if (!ip_vs_try_to_schedule(ipvs, af, skb, pd, &v, &cp, &iph))
> -                       return v;
> -       }
> -
> -       IP_VS_DBG_PKT(11, af, pp, skb, iph.off, "Incoming packet");
> -
>         /* Check the server status */
> -       if (cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
> +       if (cp && cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
>                 /* the destination server is not available */
>
> -               __u32 flags = cp->flags;
> -
> -               /* when timer already started, silently drop the packet.*/
> -               if (timer_pending(&cp->timer))
> -                       __ip_vs_conn_put(cp);
> -               else
> -                       ip_vs_conn_put(cp);
> +               if (sysctl_expire_nodest_conn(ipvs)) {
> +                       bool uses_ct = ip_vs_conn_uses_conntrack(cp, skb);
>
> -               if (sysctl_expire_nodest_conn(ipvs) &&
> -                   !(flags & IP_VS_CONN_F_ONE_PACKET)) {
> -                       /* try to expire the connection immediately */
>                         ip_vs_conn_expire_now(cp);
> +                       __ip_vs_conn_put(cp);
> +                       if (uses_ct)
> +                               return NF_DROP;
> +                       cp = NULL;
> +               } else {
> +                       __ip_vs_conn_put(cp);
> +                       return NF_DROP;
>                 }
> +       }
>
> -               return NF_DROP;
> +       if (unlikely(!cp)) {
> +               int v;
> +
> +               if (!ip_vs_try_to_schedule(ipvs, af, skb, pd, &v, &cp, &iph))
> +                       return v;
>         }
>
> +       IP_VS_DBG_PKT(11, af, pp, skb, iph.off, "Incoming packet");
> +
>         ip_vs_in_stats(cp, skb);
>         ip_vs_set_state(cp, IP_VS_DIR_INPUT, skb, pd);
>         if (cp->packet_xmit)
> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> index 8d14a1acbc37..9e53f517f138 100644
> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> @@ -210,6 +210,17 @@ static void update_defense_level(struct netns_ipvs *ipvs)
>         local_bh_enable();
>  }
>
> +/* Handler for delayed work for expiring no
> + * destination connections
> + */
> +static void expire_nodest_conn_handler(struct work_struct *work)
> +{
> +       struct netns_ipvs *ipvs;
> +
> +       ipvs = container_of(work, struct netns_ipvs,
> +                           expire_nodest_conn_work.work);
> +       ip_vs_expire_nodest_conn_flush(ipvs);
> +}
>
>  /*
>   *     Timer for checking the defense
> @@ -1163,6 +1174,12 @@ static void __ip_vs_del_dest(struct netns_ipvs *ipvs, struct ip_vs_dest *dest,
>         list_add(&dest->t_list, &ipvs->dest_trash);
>         dest->idle_start = 0;
>         spin_unlock_bh(&ipvs->dest_trash_lock);
> +
> +       /* Queue up delayed work to expire all no estination connections.
> +        * No-op when CONFIG_SYSCTL is disabled.
> +        */
> +       if (!cleanup)
> +               ip_vs_enqueue_expire_nodest_conns(ipvs);
>  }
>
>
> @@ -4065,6 +4082,10 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
>         INIT_DELAYED_WORK(&ipvs->defense_work, defense_work_handler);
>         schedule_delayed_work(&ipvs->defense_work, DEFENSE_TIMER_PERIOD);
>
> +       /* Init delayed work for expiring no dest conn */
> +       INIT_DELAYED_WORK(&ipvs->expire_nodest_conn_work,
> +                         expire_nodest_conn_handler);
> +
>         return 0;
>  }
>
> @@ -4072,6 +4093,7 @@ static void __net_exit ip_vs_control_net_cleanup_sysctl(struct netns_ipvs *ipvs)
>  {
>         struct net *net = ipvs->net;
>
> +       cancel_delayed_work_sync(&ipvs->expire_nodest_conn_work);
>         cancel_delayed_work_sync(&ipvs->defense_work);
>         cancel_work_sync(&ipvs->defense_work.work);
>         unregister_net_sysctl_table(ipvs->sysctl_hdr);
> --
> 2.20.1
>
^ permalink raw reply	[flat|nested] 31+ messages in thread
* Re: [PATCH] netfilter/ipvs: queue delayed work to expire no destination connections if expire_nodest_conn=1
  2020-06-08 20:20                       ` Andrew Sy Kim
  2020-06-08 20:24                         ` Andrew Kim
@ 2020-06-15 19:24                         ` Julian Anastasov
  2020-07-01 21:24                           ` Andrew Sy Kim
  1 sibling, 1 reply; 31+ messages in thread
From: Julian Anastasov @ 2020-06-15 19:24 UTC (permalink / raw)
  To: Andrew Sy Kim; +Cc: Wensong Zhang, Simon Horman, lvs-devel, netfilter-devel
	Hello,
On Mon, 8 Jun 2020, Andrew Sy Kim wrote:
> When expire_nodest_conn=1 and a destination is deleted, IPVS does not
> expire the existing connections until the next matching incoming packet.
> If there are many connection entries from a single client to a single
> destination, many packets may get dropped before all the connections are
> expired (more likely with lots of UDP traffic). An optimization can be
> made where upon deletion of a destination, IPVS queues up delayed work
> to immediately expire any connections with a deleted destination. This
> ensures any reused source ports from a client (within the IPVS timeouts)
> are scheduled to new real servers instead of silently dropped.
> 
> Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
	Looks good to me. But you have to base it on the
following patch that I'll finally post when the trees are open.
You have to use ip_vs_conn_del() in ip_vs_expire_nodest_conn_flush()
and the logic as in ip_vs_conn_flush().
	Here is what you can test:
==========================================================
ipvs: avoid expiring many connections from timer
Add new functions ip_vs_conn_del() and ip_vs_conn_del_put()
to release many IPVS connections in process context.
They are suitable for connections found in table
when we do not want to overload the timers.
Currently, the change is useful for the dropentry delayed
work but it will be used also in following patch
when flushing connections to failed destinations.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
 net/netfilter/ipvs/ip_vs_conn.c | 53 +++++++++++++++++++++++----------
 net/netfilter/ipvs/ip_vs_ctl.c  |  6 ++--
 2 files changed, 42 insertions(+), 17 deletions(-)
diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
index 02f2f636798d..b3921ae92740 100644
--- a/net/netfilter/ipvs/ip_vs_conn.c
+++ b/net/netfilter/ipvs/ip_vs_conn.c
@@ -807,6 +807,31 @@ static void ip_vs_conn_rcu_free(struct rcu_head *head)
 	kmem_cache_free(ip_vs_conn_cachep, cp);
 }
 
+/* Try to delete connection while not holding reference */
+static void ip_vs_conn_del(struct ip_vs_conn *cp)
+{
+	if (del_timer(&cp->timer)) {
+		/* Drop cp->control chain too */
+		if (cp->control)
+			cp->timeout = 0;
+		ip_vs_conn_expire(&cp->timer);
+	}
+}
+
+/* Try to delete connection while holding reference */
+static void ip_vs_conn_del_put(struct ip_vs_conn *cp)
+{
+	if (del_timer(&cp->timer)) {
+		/* Drop cp->control chain too */
+		if (cp->control)
+			cp->timeout = 0;
+		__ip_vs_conn_put(cp);
+		ip_vs_conn_expire(&cp->timer);
+	} else {
+		__ip_vs_conn_put(cp);
+	}
+}
+
 static void ip_vs_conn_expire(struct timer_list *t)
 {
 	struct ip_vs_conn *cp = from_timer(cp, t, timer);
@@ -827,14 +852,17 @@ static void ip_vs_conn_expire(struct timer_list *t)
 
 		/* does anybody control me? */
 		if (ct) {
+			bool has_ref = !cp->timeout && __ip_vs_conn_get(ct);
+
 			ip_vs_control_del(cp);
 			/* Drop CTL or non-assured TPL if not used anymore */
-			if (!cp->timeout && !atomic_read(&ct->n_control) &&
+			if (has_ref && !atomic_read(&ct->n_control) &&
 			    (!(ct->flags & IP_VS_CONN_F_TEMPLATE) ||
 			     !(ct->state & IP_VS_CTPL_S_ASSURED))) {
 				IP_VS_DBG(4, "drop controlling connection\n");
-				ct->timeout = 0;
-				ip_vs_conn_expire_now(ct);
+				ip_vs_conn_del_put(ct);
+			} else if (has_ref) {
+				__ip_vs_conn_put(ct);
 			}
 		}
 
@@ -1317,8 +1345,7 @@ void ip_vs_random_dropentry(struct netns_ipvs *ipvs)
 
 drop:
 			IP_VS_DBG(4, "drop connection\n");
-			cp->timeout = 0;
-			ip_vs_conn_expire_now(cp);
+			ip_vs_conn_del(cp);
 		}
 		cond_resched_rcu();
 	}
@@ -1341,19 +1368,15 @@ static void ip_vs_conn_flush(struct netns_ipvs *ipvs)
 		hlist_for_each_entry_rcu(cp, &ip_vs_conn_tab[idx], c_list) {
 			if (cp->ipvs != ipvs)
 				continue;
-			/* As timers are expired in LIFO order, restart
-			 * the timer of controlling connection first, so
-			 * that it is expired after us.
-			 */
+			if (atomic_read(&cp->n_control))
+				continue;
 			cp_c = cp->control;
-			/* cp->control is valid only with reference to cp */
-			if (cp_c && __ip_vs_conn_get(cp)) {
+			IP_VS_DBG(4, "del connection\n");
+			ip_vs_conn_del(cp);
+			if (cp_c && !atomic_read(&cp_c->n_control)) {
 				IP_VS_DBG(4, "del controlling connection\n");
-				ip_vs_conn_expire_now(cp_c);
-				__ip_vs_conn_put(cp);
+				ip_vs_conn_del(cp_c);
 			}
-			IP_VS_DBG(4, "del connection\n");
-			ip_vs_conn_expire_now(cp);
 		}
 		cond_resched_rcu();
 	}
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 412656c34f20..1a231f518e3f 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -224,7 +224,8 @@ static void defense_work_handler(struct work_struct *work)
 	update_defense_level(ipvs);
 	if (atomic_read(&ipvs->dropentry))
 		ip_vs_random_dropentry(ipvs);
-	schedule_delayed_work(&ipvs->defense_work, DEFENSE_TIMER_PERIOD);
+	queue_delayed_work(system_long_wq, &ipvs->defense_work,
+			   DEFENSE_TIMER_PERIOD);
 }
 #endif
 
@@ -4063,7 +4064,8 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
 	ipvs->sysctl_tbl = tbl;
 	/* Schedule defense work */
 	INIT_DELAYED_WORK(&ipvs->defense_work, defense_work_handler);
-	schedule_delayed_work(&ipvs->defense_work, DEFENSE_TIMER_PERIOD);
+	queue_delayed_work(system_long_wq, &ipvs->defense_work,
+			   DEFENSE_TIMER_PERIOD);
 
 	return 0;
 }
-- 
2.26.2
==========================================================
> ---
>  include/net/ip_vs.h             | 29 +++++++++++++++++++++
>  net/netfilter/ipvs/ip_vs_conn.c | 43 +++++++++++++++++++++++++++++++
>  net/netfilter/ipvs/ip_vs_core.c | 45 +++++++++++++--------------------
>  net/netfilter/ipvs/ip_vs_ctl.c  | 22 ++++++++++++++++
>  4 files changed, 112 insertions(+), 27 deletions(-)
> 
> diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
> index 83be2d93b407..49ca61765298 100644
> --- a/include/net/ip_vs.h
> +++ b/include/net/ip_vs.h
> @@ -14,6 +14,7 @@
>  #include <linux/spinlock.h>             /* for struct rwlock_t */
>  #include <linux/atomic.h>               /* for struct atomic_t */
>  #include <linux/refcount.h>             /* for struct refcount_t */
> +#include <linux/workqueue.h>
>  
>  #include <linux/compiler.h>
>  #include <linux/timer.h>
> @@ -885,6 +886,8 @@ struct netns_ipvs {
>  	atomic_t		conn_out_counter;
>  
>  #ifdef CONFIG_SYSCTL
> +	/* delayed work for expiring no dest connections */
> +	struct delayed_work	expire_nodest_conn_work;
>  	/* 1/rate drop and drop-entry variables */
>  	struct delayed_work	defense_work;   /* Work handler */
>  	int			drop_rate;
> @@ -1049,6 +1052,11 @@ static inline int sysctl_conn_reuse_mode(struct netns_ipvs *ipvs)
>  	return ipvs->sysctl_conn_reuse_mode;
>  }
>  
> +static inline int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
> +{
> +	return ipvs->sysctl_expire_nodest_conn;
> +}
> +
>  static inline int sysctl_schedule_icmp(struct netns_ipvs *ipvs)
>  {
>  	return ipvs->sysctl_schedule_icmp;
> @@ -1136,6 +1144,11 @@ static inline int sysctl_conn_reuse_mode(struct netns_ipvs *ipvs)
>  	return 1;
>  }
>  
> +static inline int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
> +{
> +	return 0;
> +}
> +
>  static inline int sysctl_schedule_icmp(struct netns_ipvs *ipvs)
>  {
>  	return 0;
> @@ -1505,6 +1518,22 @@ static inline int ip_vs_todrop(struct netns_ipvs *ipvs)
>  static inline int ip_vs_todrop(struct netns_ipvs *ipvs) { return 0; }
>  #endif
>  
> +#ifdef CONFIG_SYSCTL
> +/* Enqueue delayed work for expiring no dest connections
> + * Only run when sysctl_expire_nodest=1
> + */
> +static inline void ip_vs_enqueue_expire_nodest_conns(struct netns_ipvs *ipvs)
> +{
> +	if (sysctl_expire_nodest_conn(ipvs))
> +		queue_delayed_work(system_long_wq,
> +				   &ipvs->expire_nodest_conn_work, 1);
> +}
> +
> +void ip_vs_expire_nodest_conn_flush(struct netns_ipvs *ipvs);
> +#else
> +static inline void ip_vs_enqueue_expire_nodest_conns(struct netns_ipvs) {}
> +#endif
> +
>  #define IP_VS_DFWD_METHOD(dest) (atomic_read(&(dest)->conn_flags) & \
>  				 IP_VS_CONN_F_FWD_MASK)
>  
> diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
> index 02f2f636798d..f0d744e8c716 100644
> --- a/net/netfilter/ipvs/ip_vs_conn.c
> +++ b/net/netfilter/ipvs/ip_vs_conn.c
> @@ -1366,6 +1366,49 @@ static void ip_vs_conn_flush(struct netns_ipvs *ipvs)
>  		goto flush_again;
>  	}
>  }
> +
> +#ifdef CONFIG_SYSCTL
> +void ip_vs_expire_nodest_conn_flush(struct netns_ipvs *ipvs)
> +{
> +	int idx;
> +	struct ip_vs_conn *cp, *cp_c;
> +	struct ip_vs_dest *dest;
> +
> +	rcu_read_lock();
> +	for (idx = 0; idx < ip_vs_conn_tab_size; idx++) {
> +		hlist_for_each_entry_rcu(cp, &ip_vs_conn_tab[idx], c_list) {
> +			if (cp->ipvs != ipvs)
> +				continue;
> +
> +			dest = cp->dest;
> +			if (!dest || (dest->flags & IP_VS_DEST_F_AVAILABLE))
> +				continue;
> +
> +			/* As timers are expired in LIFO order, restart
> +			 * the timer of controlling connection first, so
> +			 * that it is expired after us.
> +			 */
> +			cp_c = cp->control;
> +			/* cp->control is valid only with reference to cp */
> +			if (cp_c && __ip_vs_conn_get(cp)) {
> +				IP_VS_DBG(4, "del controlling connection\n");
> +				ip_vs_conn_expire_now(cp_c);
> +				__ip_vs_conn_put(cp);
> +			}
> +
> +			IP_VS_DBG(4, "del connection\n");
> +			ip_vs_conn_expire_now(cp);
> +		}
> +		cond_resched_rcu();
> +
> +		/* netns clean up started, abort delayed work */
> +		if (!ipvs->enable)
> +			return;
> +	}
> +	rcu_read_unlock();
> +}
> +#endif
> +
>  /*
>   * per netns init and exit
>   */
> diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
> index aa6a603a2425..2508a9caeae8 100644
> --- a/net/netfilter/ipvs/ip_vs_core.c
> +++ b/net/netfilter/ipvs/ip_vs_core.c
> @@ -694,16 +694,10 @@ static int sysctl_nat_icmp_send(struct netns_ipvs *ipvs)
>  	return ipvs->sysctl_nat_icmp_send;
>  }
>  
> -static int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
> -{
> -	return ipvs->sysctl_expire_nodest_conn;
> -}
> -
>  #else
>  
>  static int sysctl_snat_reroute(struct netns_ipvs *ipvs) { return 0; }
>  static int sysctl_nat_icmp_send(struct netns_ipvs *ipvs) { return 0; }
> -static int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs) { return 0; }
>  
>  #endif
>  
> @@ -2095,36 +2089,33 @@ ip_vs_in(struct netns_ipvs *ipvs, unsigned int hooknum, struct sk_buff *skb, int
>  		}
>  	}
>  
> -	if (unlikely(!cp)) {
> -		int v;
> -
> -		if (!ip_vs_try_to_schedule(ipvs, af, skb, pd, &v, &cp, &iph))
> -			return v;
> -	}
> -
> -	IP_VS_DBG_PKT(11, af, pp, skb, iph.off, "Incoming packet");
> -
>  	/* Check the server status */
> -	if (cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
> +	if (cp && cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
>  		/* the destination server is not available */
>  
> -		__u32 flags = cp->flags;
> -
> -		/* when timer already started, silently drop the packet.*/
> -		if (timer_pending(&cp->timer))
> -			__ip_vs_conn_put(cp);
> -		else
> -			ip_vs_conn_put(cp);
> +		if (sysctl_expire_nodest_conn(ipvs)) {
> +			bool uses_ct = ip_vs_conn_uses_conntrack(cp, skb);
>  
> -		if (sysctl_expire_nodest_conn(ipvs) &&
> -		    !(flags & IP_VS_CONN_F_ONE_PACKET)) {
> -			/* try to expire the connection immediately */
>  			ip_vs_conn_expire_now(cp);
> +			__ip_vs_conn_put(cp);
> +			if (uses_ct)
> +				return NF_DROP;
> +			cp = NULL;
> +		} else {
> +			__ip_vs_conn_put(cp);
> +			return NF_DROP;
>  		}
> +	}
>  
> -		return NF_DROP;
> +	if (unlikely(!cp)) {
> +		int v;
> +
> +		if (!ip_vs_try_to_schedule(ipvs, af, skb, pd, &v, &cp, &iph))
> +			return v;
>  	}
>  
> +	IP_VS_DBG_PKT(11, af, pp, skb, iph.off, "Incoming packet");
> +
>  	ip_vs_in_stats(cp, skb);
>  	ip_vs_set_state(cp, IP_VS_DIR_INPUT, skb, pd);
>  	if (cp->packet_xmit)
> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> index 8d14a1acbc37..9e53f517f138 100644
> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> @@ -210,6 +210,17 @@ static void update_defense_level(struct netns_ipvs *ipvs)
>  	local_bh_enable();
>  }
>  
> +/* Handler for delayed work for expiring no
> + * destination connections
> + */
> +static void expire_nodest_conn_handler(struct work_struct *work)
> +{
> +	struct netns_ipvs *ipvs;
> +
> +	ipvs = container_of(work, struct netns_ipvs,
> +			    expire_nodest_conn_work.work);
> +	ip_vs_expire_nodest_conn_flush(ipvs);
> +}
>  
>  /*
>   *	Timer for checking the defense
> @@ -1163,6 +1174,12 @@ static void __ip_vs_del_dest(struct netns_ipvs *ipvs, struct ip_vs_dest *dest,
>  	list_add(&dest->t_list, &ipvs->dest_trash);
>  	dest->idle_start = 0;
>  	spin_unlock_bh(&ipvs->dest_trash_lock);
> +
> +	/* Queue up delayed work to expire all no estination connections.
> +	 * No-op when CONFIG_SYSCTL is disabled.
> +	 */
> +	if (!cleanup)
> +		ip_vs_enqueue_expire_nodest_conns(ipvs);
>  }
>  
>  
> @@ -4065,6 +4082,10 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
>  	INIT_DELAYED_WORK(&ipvs->defense_work, defense_work_handler);
>  	schedule_delayed_work(&ipvs->defense_work, DEFENSE_TIMER_PERIOD);
>  
> +	/* Init delayed work for expiring no dest conn */
> +	INIT_DELAYED_WORK(&ipvs->expire_nodest_conn_work,
> +			  expire_nodest_conn_handler);
> +
>  	return 0;
>  }
>  
> @@ -4072,6 +4093,7 @@ static void __net_exit ip_vs_control_net_cleanup_sysctl(struct netns_ipvs *ipvs)
>  {
>  	struct net *net = ipvs->net;
>  
> +	cancel_delayed_work_sync(&ipvs->expire_nodest_conn_work);
>  	cancel_delayed_work_sync(&ipvs->defense_work);
>  	cancel_work_sync(&ipvs->defense_work.work);
>  	unregister_net_sysctl_table(ipvs->sysctl_hdr);
> -- 
> 2.20.1
Regards
--
Julian Anastasov <ja@ssi.bg>
^ permalink raw reply related	[flat|nested] 31+ messages in thread
* [PATCH] netfilter/ipvs: queue delayed work to expire no destination connections if expire_nodest_conn=1
  2020-06-15 19:24                         ` Julian Anastasov
@ 2020-07-01 21:24                           ` Andrew Sy Kim
  2020-07-02  4:33                             ` Julian Anastasov
  0 siblings, 1 reply; 31+ messages in thread
From: Andrew Sy Kim @ 2020-07-01 21:24 UTC (permalink / raw)
  Cc: Julian Anastasov, Wensong Zhang, Simon Horman, lvs-devel,
	netfilter-devel, Andrew Sy Kim
When expire_nodest_conn=1 and a destination is deleted, IPVS does not
expire the existing connections until the next matching incoming packet.
If there are many connection entries from a single client to a single
destination, many packets may get dropped before all the connections are
expired (more likely with lots of UDP traffic). An optimization can be
made where upon deletion of a destination, IPVS queues up delayed work
to immediately expire any connections with a deleted destination. This
ensures any reused source ports from a client (within the IPVS timeouts)
are scheduled to new real servers instead of silently dropped.
This patch was applied on top of Julian's patch "ipvs: allow
connection reuse for unconfirmed conntrack":
https://marc.info/?l=netfilter-devel&m=159361668110823&w=2
Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
---
 include/net/ip_vs.h             | 29 ++++++++++++++++++++
 net/netfilter/ipvs/ip_vs_conn.c | 36 +++++++++++++++++++++++++
 net/netfilter/ipvs/ip_vs_core.c | 47 ++++++++++++++-------------------
 net/netfilter/ipvs/ip_vs_ctl.c  | 22 +++++++++++++++
 4 files changed, 107 insertions(+), 27 deletions(-)
diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index 011f407b76fe..91a9e1d590a6 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -14,6 +14,7 @@
 #include <linux/spinlock.h>             /* for struct rwlock_t */
 #include <linux/atomic.h>               /* for struct atomic_t */
 #include <linux/refcount.h>             /* for struct refcount_t */
+#include <linux/workqueue.h>
 
 #include <linux/compiler.h>
 #include <linux/timer.h>
@@ -886,6 +887,8 @@ struct netns_ipvs {
 	atomic_t		conn_out_counter;
 
 #ifdef CONFIG_SYSCTL
+	/* delayed work for expiring no dest connections */
+	struct delayed_work	expire_nodest_conn_work;
 	/* 1/rate drop and drop-entry variables */
 	struct delayed_work	defense_work;   /* Work handler */
 	int			drop_rate;
@@ -1051,6 +1054,11 @@ static inline int sysctl_conn_reuse_mode(struct netns_ipvs *ipvs)
 	return ipvs->sysctl_conn_reuse_mode;
 }
 
+static inline int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
+{
+	return ipvs->sysctl_expire_nodest_conn;
+}
+
 static inline int sysctl_schedule_icmp(struct netns_ipvs *ipvs)
 {
 	return ipvs->sysctl_schedule_icmp;
@@ -1138,6 +1146,11 @@ static inline int sysctl_conn_reuse_mode(struct netns_ipvs *ipvs)
 	return 1;
 }
 
+static inline int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
+{
+	return 0;
+}
+
 static inline int sysctl_schedule_icmp(struct netns_ipvs *ipvs)
 {
 	return 0;
@@ -1507,6 +1520,22 @@ static inline int ip_vs_todrop(struct netns_ipvs *ipvs)
 static inline int ip_vs_todrop(struct netns_ipvs *ipvs) { return 0; }
 #endif
 
+#ifdef CONFIG_SYSCTL
+/* Enqueue delayed work for expiring no dest connections
+ * Only run when sysctl_expire_nodest=1
+ */
+static inline void ip_vs_enqueue_expire_nodest_conns(struct netns_ipvs *ipvs)
+{
+	if (sysctl_expire_nodest_conn(ipvs))
+		queue_delayed_work(system_long_wq,
+				   &ipvs->expire_nodest_conn_work, 1);
+}
+
+void ip_vs_expire_nodest_conn_flush(struct netns_ipvs *ipvs);
+#else
+static inline void ip_vs_enqueue_expire_nodest_conns(struct netns_ipvs) {}
+#endif
+
 #define IP_VS_DFWD_METHOD(dest) (atomic_read(&(dest)->conn_flags) & \
 				 IP_VS_CONN_F_FWD_MASK)
 
diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
index b3921ae92740..7a5f57949fb9 100644
--- a/net/netfilter/ipvs/ip_vs_conn.c
+++ b/net/netfilter/ipvs/ip_vs_conn.c
@@ -1389,6 +1389,42 @@ static void ip_vs_conn_flush(struct netns_ipvs *ipvs)
 		goto flush_again;
 	}
 }
+
+#ifdef CONFIG_SYSCTL
+void ip_vs_expire_nodest_conn_flush(struct netns_ipvs *ipvs)
+{
+	int idx;
+	struct ip_vs_conn *cp, *cp_c;
+	struct ip_vs_dest *dest;
+
+	rcu_read_lock();
+	for (idx = 0; idx < ip_vs_conn_tab_size; idx++) {
+		hlist_for_each_entry_rcu(cp, &ip_vs_conn_tab[idx], c_list) {
+			if (cp->ipvs != ipvs)
+				continue;
+
+			dest = cp->dest;
+			if (!dest || (dest->flags & IP_VS_DEST_F_AVAILABLE))
+				continue;
+
+			cp_c = cp->control;
+			IP_VS_DBG(4, "del connection\n");
+			ip_vs_conn_del(cp);
+			if (cp_c && !atomic_read(&cp_c->n_control)) {
+				IP_VS_DBG(4, "del controlling connection\n");
+				ip_vs_conn_del(cp_c);
+			}
+		}
+		cond_resched_rcu();
+
+		/* netns clean up started, abort delayed work */
+		if (!ipvs->enable)
+			return;
+	}
+	rcu_read_unlock();
+}
+#endif
+
 /*
  * per netns init and exit
  */
diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
index b4a6b7662f3f..e3668a6e54e4 100644
--- a/net/netfilter/ipvs/ip_vs_core.c
+++ b/net/netfilter/ipvs/ip_vs_core.c
@@ -694,16 +694,10 @@ static int sysctl_nat_icmp_send(struct netns_ipvs *ipvs)
 	return ipvs->sysctl_nat_icmp_send;
 }
 
-static int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
-{
-	return ipvs->sysctl_expire_nodest_conn;
-}
-
 #else
 
 static int sysctl_snat_reroute(struct netns_ipvs *ipvs) { return 0; }
 static int sysctl_nat_icmp_send(struct netns_ipvs *ipvs) { return 0; }
-static int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs) { return 0; }
 
 #endif
 
@@ -2097,36 +2091,35 @@ ip_vs_in(struct netns_ipvs *ipvs, unsigned int hooknum, struct sk_buff *skb, int
 		}
 	}
 
-	if (unlikely(!cp)) {
-		int v;
-
-		if (!ip_vs_try_to_schedule(ipvs, af, skb, pd, &v, &cp, &iph))
-			return v;
-	}
-
-	IP_VS_DBG_PKT(11, af, pp, skb, iph.off, "Incoming packet");
-
 	/* Check the server status */
-	if (cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
+	if (cp && cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
 		/* the destination server is not available */
+		if (sysctl_expire_nodest_conn(ipvs)) {
+			bool old_ct = ip_vs_conn_uses_old_conntrack(cp, skb);
 
-		__u32 flags = cp->flags;
-
-		/* when timer already started, silently drop the packet.*/
-		if (timer_pending(&cp->timer))
-			__ip_vs_conn_put(cp);
-		else
-			ip_vs_conn_put(cp);
+			if (!old_ct)
+				cp->flags &= ~IP_VS_CONN_F_NFCT;
 
-		if (sysctl_expire_nodest_conn(ipvs) &&
-		    !(flags & IP_VS_CONN_F_ONE_PACKET)) {
-			/* try to expire the connection immediately */
 			ip_vs_conn_expire_now(cp);
+			__ip_vs_conn_put(cp);
+			if (old_ct)
+				return NF_DROP;
+			cp = NULL;
+		} else {
+			__ip_vs_conn_put(cp);
+			return NF_DROP;
 		}
+	}
 
-		return NF_DROP;
+	if (unlikely(!cp)) {
+		int v;
+
+		if (!ip_vs_try_to_schedule(ipvs, af, skb, pd, &v, &cp, &iph))
+			return v;
 	}
 
+	IP_VS_DBG_PKT(11, af, pp, skb, iph.off, "Incoming packet");
+
 	ip_vs_in_stats(cp, skb);
 	ip_vs_set_state(cp, IP_VS_DIR_INPUT, skb, pd);
 	if (cp->packet_xmit)
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 4af83f466dfc..f984d2c881ff 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -210,6 +210,17 @@ static void update_defense_level(struct netns_ipvs *ipvs)
 	local_bh_enable();
 }
 
+/* Handler for delayed work for expiring no
+ * destination connections
+ */
+static void expire_nodest_conn_handler(struct work_struct *work)
+{
+	struct netns_ipvs *ipvs;
+
+	ipvs = container_of(work, struct netns_ipvs,
+			    expire_nodest_conn_work.work);
+	ip_vs_expire_nodest_conn_flush(ipvs);
+}
 
 /*
  *	Timer for checking the defense
@@ -1164,6 +1175,12 @@ static void __ip_vs_del_dest(struct netns_ipvs *ipvs, struct ip_vs_dest *dest,
 	list_add(&dest->t_list, &ipvs->dest_trash);
 	dest->idle_start = 0;
 	spin_unlock_bh(&ipvs->dest_trash_lock);
+
+	/* Queue up delayed work to expire all no destination connections.
+	 * No-op when CONFIG_SYSCTL is disabled.
+	 */
+	if (!cleanup)
+		ip_vs_enqueue_expire_nodest_conns(ipvs);
 }
 
 
@@ -4086,6 +4103,10 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
 	queue_delayed_work(system_long_wq, &ipvs->defense_work,
 			   DEFENSE_TIMER_PERIOD);
 
+	/* Init delayed work for expiring no dest conn */
+	INIT_DELAYED_WORK(&ipvs->expire_nodest_conn_work,
+			  expire_nodest_conn_handler);
+
 	return 0;
 }
 
@@ -4093,6 +4114,7 @@ static void __net_exit ip_vs_control_net_cleanup_sysctl(struct netns_ipvs *ipvs)
 {
 	struct net *net = ipvs->net;
 
+	cancel_delayed_work_sync(&ipvs->expire_nodest_conn_work);
 	cancel_delayed_work_sync(&ipvs->defense_work);
 	cancel_work_sync(&ipvs->defense_work.work);
 	unregister_net_sysctl_table(ipvs->sysctl_hdr);
-- 
2.20.1
^ permalink raw reply related	[flat|nested] 31+ messages in thread
* Re: [PATCH] netfilter/ipvs: queue delayed work to expire no destination connections if expire_nodest_conn=1
  2020-07-01 21:24                           ` Andrew Sy Kim
@ 2020-07-02  4:33                             ` Julian Anastasov
  2020-07-08 13:58                               ` [PATCH net-next] " Andrew Sy Kim
  0 siblings, 1 reply; 31+ messages in thread
From: Julian Anastasov @ 2020-07-02  4:33 UTC (permalink / raw)
  To: Andrew Sy Kim; +Cc: Wensong Zhang, Simon Horman, lvs-devel, netfilter-devel
	Hello,
On Wed, 1 Jul 2020, Andrew Sy Kim wrote:
> When expire_nodest_conn=1 and a destination is deleted, IPVS does not
> expire the existing connections until the next matching incoming packet.
> If there are many connection entries from a single client to a single
> destination, many packets may get dropped before all the connections are
> expired (more likely with lots of UDP traffic). An optimization can be
> made where upon deletion of a destination, IPVS queues up delayed work
> to immediately expire any connections with a deleted destination. This
> ensures any reused source ports from a client (within the IPVS timeouts)
> are scheduled to new real servers instead of silently dropped.
> 
> This patch was applied on top of Julian's patch "ipvs: allow
> connection reuse for unconfirmed conntrack":
> https://marc.info/?l=netfilter-devel&m=159361668110823&w=2
	Above paragraph should be after the following --- line
> 
> Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
> ---
>  include/net/ip_vs.h             | 29 ++++++++++++++++++++
>  net/netfilter/ipvs/ip_vs_conn.c | 36 +++++++++++++++++++++++++
>  net/netfilter/ipvs/ip_vs_core.c | 47 ++++++++++++++-------------------
>  net/netfilter/ipvs/ip_vs_ctl.c  | 22 +++++++++++++++
>  4 files changed, 107 insertions(+), 27 deletions(-)
> 
> diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
> index 011f407b76fe..91a9e1d590a6 100644
> --- a/include/net/ip_vs.h
> +++ b/include/net/ip_vs.h
> @@ -14,6 +14,7 @@
>  #include <linux/spinlock.h>             /* for struct rwlock_t */
>  #include <linux/atomic.h>               /* for struct atomic_t */
>  #include <linux/refcount.h>             /* for struct refcount_t */
> +#include <linux/workqueue.h>
>  
>  #include <linux/compiler.h>
>  #include <linux/timer.h>
> @@ -886,6 +887,8 @@ struct netns_ipvs {
>  	atomic_t		conn_out_counter;
>  
>  #ifdef CONFIG_SYSCTL
> +	/* delayed work for expiring no dest connections */
> +	struct delayed_work	expire_nodest_conn_work;
>  	/* 1/rate drop and drop-entry variables */
>  	struct delayed_work	defense_work;   /* Work handler */
>  	int			drop_rate;
> @@ -1051,6 +1054,11 @@ static inline int sysctl_conn_reuse_mode(struct netns_ipvs *ipvs)
>  	return ipvs->sysctl_conn_reuse_mode;
>  }
>  
> +static inline int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
> +{
> +	return ipvs->sysctl_expire_nodest_conn;
> +}
> +
>  static inline int sysctl_schedule_icmp(struct netns_ipvs *ipvs)
>  {
>  	return ipvs->sysctl_schedule_icmp;
> @@ -1138,6 +1146,11 @@ static inline int sysctl_conn_reuse_mode(struct netns_ipvs *ipvs)
>  	return 1;
>  }
>  
> +static inline int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
> +{
> +	return 0;
> +}
> +
>  static inline int sysctl_schedule_icmp(struct netns_ipvs *ipvs)
>  {
>  	return 0;
> @@ -1507,6 +1520,22 @@ static inline int ip_vs_todrop(struct netns_ipvs *ipvs)
>  static inline int ip_vs_todrop(struct netns_ipvs *ipvs) { return 0; }
>  #endif
>  
> +#ifdef CONFIG_SYSCTL
> +/* Enqueue delayed work for expiring no dest connections
> + * Only run when sysctl_expire_nodest=1
> + */
> +static inline void ip_vs_enqueue_expire_nodest_conns(struct netns_ipvs *ipvs)
> +{
> +	if (sysctl_expire_nodest_conn(ipvs))
> +		queue_delayed_work(system_long_wq,
> +				   &ipvs->expire_nodest_conn_work, 1);
> +}
> +
> +void ip_vs_expire_nodest_conn_flush(struct netns_ipvs *ipvs);
> +#else
> +static inline void ip_vs_enqueue_expire_nodest_conns(struct netns_ipvs) {}
> +#endif
> +
>  #define IP_VS_DFWD_METHOD(dest) (atomic_read(&(dest)->conn_flags) & \
>  				 IP_VS_CONN_F_FWD_MASK)
>  
> diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
> index b3921ae92740..7a5f57949fb9 100644
> --- a/net/netfilter/ipvs/ip_vs_conn.c
> +++ b/net/netfilter/ipvs/ip_vs_conn.c
> @@ -1389,6 +1389,42 @@ static void ip_vs_conn_flush(struct netns_ipvs *ipvs)
>  		goto flush_again;
>  	}
>  }
> +
> +#ifdef CONFIG_SYSCTL
> +void ip_vs_expire_nodest_conn_flush(struct netns_ipvs *ipvs)
> +{
> +	int idx;
> +	struct ip_vs_conn *cp, *cp_c;
> +	struct ip_vs_dest *dest;
> +
> +	rcu_read_lock();
> +	for (idx = 0; idx < ip_vs_conn_tab_size; idx++) {
> +		hlist_for_each_entry_rcu(cp, &ip_vs_conn_tab[idx], c_list) {
> +			if (cp->ipvs != ipvs)
> +				continue;
> +
> +			dest = cp->dest;
> +			if (!dest || (dest->flags & IP_VS_DEST_F_AVAILABLE))
> +				continue;
> +
	It is better if we have here:
			if (atomic_read(&cp->n_control))
				continue;
> +			cp_c = cp->control;
> +			IP_VS_DBG(4, "del connection\n");
> +			ip_vs_conn_del(cp);
> +			if (cp_c && !atomic_read(&cp_c->n_control)) {
> +				IP_VS_DBG(4, "del controlling connection\n");
> +				ip_vs_conn_del(cp_c);
> +			}
> +		}
> +		cond_resched_rcu();
> +
> +		/* netns clean up started, abort delayed work */
> +		if (!ipvs->enable)
> +			return;
> +	}
> +	rcu_read_unlock();
> +}
> +#endif
> +
>  /*
>   * per netns init and exit
>   */
> diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
> index b4a6b7662f3f..e3668a6e54e4 100644
> --- a/net/netfilter/ipvs/ip_vs_core.c
> +++ b/net/netfilter/ipvs/ip_vs_core.c
> @@ -694,16 +694,10 @@ static int sysctl_nat_icmp_send(struct netns_ipvs *ipvs)
>  	return ipvs->sysctl_nat_icmp_send;
>  }
>  
> -static int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
> -{
> -	return ipvs->sysctl_expire_nodest_conn;
> -}
> -
>  #else
>  
>  static int sysctl_snat_reroute(struct netns_ipvs *ipvs) { return 0; }
>  static int sysctl_nat_icmp_send(struct netns_ipvs *ipvs) { return 0; }
> -static int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs) { return 0; }
>  
>  #endif
>  
> @@ -2097,36 +2091,35 @@ ip_vs_in(struct netns_ipvs *ipvs, unsigned int hooknum, struct sk_buff *skb, int
>  		}
>  	}
>  
> -	if (unlikely(!cp)) {
> -		int v;
> -
> -		if (!ip_vs_try_to_schedule(ipvs, af, skb, pd, &v, &cp, &iph))
> -			return v;
> -	}
> -
> -	IP_VS_DBG_PKT(11, af, pp, skb, iph.off, "Incoming packet");
> -
>  	/* Check the server status */
> -	if (cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
> +	if (cp && cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
>  		/* the destination server is not available */
> +		if (sysctl_expire_nodest_conn(ipvs)) {
> +			bool old_ct = ip_vs_conn_uses_old_conntrack(cp, skb);
>  
> -		__u32 flags = cp->flags;
> -
> -		/* when timer already started, silently drop the packet.*/
> -		if (timer_pending(&cp->timer))
> -			__ip_vs_conn_put(cp);
> -		else
> -			ip_vs_conn_put(cp);
> +			if (!old_ct)
> +				cp->flags &= ~IP_VS_CONN_F_NFCT;
>  
> -		if (sysctl_expire_nodest_conn(ipvs) &&
> -		    !(flags & IP_VS_CONN_F_ONE_PACKET)) {
> -			/* try to expire the connection immediately */
>  			ip_vs_conn_expire_now(cp);
> +			__ip_vs_conn_put(cp);
> +			if (old_ct)
> +				return NF_DROP;
> +			cp = NULL;
> +		} else {
> +			__ip_vs_conn_put(cp);
> +			return NF_DROP;
>  		}
> +	}
>  
> -		return NF_DROP;
> +	if (unlikely(!cp)) {
> +		int v;
> +
> +		if (!ip_vs_try_to_schedule(ipvs, af, skb, pd, &v, &cp, &iph))
> +			return v;
>  	}
>  
> +	IP_VS_DBG_PKT(11, af, pp, skb, iph.off, "Incoming packet");
> +
>  	ip_vs_in_stats(cp, skb);
>  	ip_vs_set_state(cp, IP_VS_DIR_INPUT, skb, pd);
>  	if (cp->packet_xmit)
> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> index 4af83f466dfc..f984d2c881ff 100644
> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> @@ -210,6 +210,17 @@ static void update_defense_level(struct netns_ipvs *ipvs)
>  	local_bh_enable();
>  }
>  
> +/* Handler for delayed work for expiring no
> + * destination connections
> + */
> +static void expire_nodest_conn_handler(struct work_struct *work)
> +{
> +	struct netns_ipvs *ipvs;
> +
> +	ipvs = container_of(work, struct netns_ipvs,
> +			    expire_nodest_conn_work.work);
> +	ip_vs_expire_nodest_conn_flush(ipvs);
> +}
>  
>  /*
>   *	Timer for checking the defense
> @@ -1164,6 +1175,12 @@ static void __ip_vs_del_dest(struct netns_ipvs *ipvs, struct ip_vs_dest *dest,
>  	list_add(&dest->t_list, &ipvs->dest_trash);
>  	dest->idle_start = 0;
>  	spin_unlock_bh(&ipvs->dest_trash_lock);
> +
> +	/* Queue up delayed work to expire all no destination connections.
> +	 * No-op when CONFIG_SYSCTL is disabled.
> +	 */
> +	if (!cleanup)
> +		ip_vs_enqueue_expire_nodest_conns(ipvs);
>  }
>  
>  
> @@ -4086,6 +4103,10 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
>  	queue_delayed_work(system_long_wq, &ipvs->defense_work,
>  			   DEFENSE_TIMER_PERIOD);
>  
> +	/* Init delayed work for expiring no dest conn */
> +	INIT_DELAYED_WORK(&ipvs->expire_nodest_conn_work,
> +			  expire_nodest_conn_handler);
> +
>  	return 0;
>  }
>  
> @@ -4093,6 +4114,7 @@ static void __net_exit ip_vs_control_net_cleanup_sysctl(struct netns_ipvs *ipvs)
>  {
>  	struct net *net = ipvs->net;
>  
> +	cancel_delayed_work_sync(&ipvs->expire_nodest_conn_work);
>  	cancel_delayed_work_sync(&ipvs->defense_work);
>  	cancel_work_sync(&ipvs->defense_work.work);
>  	unregister_net_sysctl_table(ipvs->sysctl_hdr);
> -- 
> 2.20.1
Regards
--
Julian Anastasov <ja@ssi.bg>
^ permalink raw reply	[flat|nested] 31+ messages in thread
* [PATCH net-next] netfilter/ipvs: queue delayed work to expire no destination connections if expire_nodest_conn=1
  2020-07-02  4:33                             ` Julian Anastasov
@ 2020-07-08 13:58                               ` Andrew Sy Kim
  2020-07-08 16:00                                 ` Julian Anastasov
  0 siblings, 1 reply; 31+ messages in thread
From: Andrew Sy Kim @ 2020-07-08 13:58 UTC (permalink / raw)
  Cc: Julian Anastasov, Wensong Zhang, Simon Horman, lvs-devel,
	netfilter-devel, Andrew Sy Kim
When expire_nodest_conn=1 and a destination is deleted, IPVS does not
expire the existing connections until the next matching incoming packet.
If there are many connection entries from a single client to a single
destination, many packets may get dropped before all the connections are
expired (more likely with lots of UDP traffic). An optimization can be
made where upon deletion of a destination, IPVS queues up delayed work
to immediately expire any connections with a deleted destination. This
ensures any reused source ports from a client (within the IPVS timeouts)
are scheduled to new real servers instead of silently dropped.
Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
---
This patch was applied on top of Julian's patch "ipvs: allow
connection reuse for unconfirmed conntrack":
https://marc.info/?l=netfilter-devel&m=159361668110823&w=2
 include/net/ip_vs.h             | 29 ++++++++++++++++++++
 net/netfilter/ipvs/ip_vs_conn.c | 39 +++++++++++++++++++++++++++
 net/netfilter/ipvs/ip_vs_core.c | 47 ++++++++++++++-------------------
 net/netfilter/ipvs/ip_vs_ctl.c  | 22 +++++++++++++++
 4 files changed, 110 insertions(+), 27 deletions(-)
diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index 011f407b76fe..91a9e1d590a6 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -14,6 +14,7 @@
 #include <linux/spinlock.h>             /* for struct rwlock_t */
 #include <linux/atomic.h>               /* for struct atomic_t */
 #include <linux/refcount.h>             /* for struct refcount_t */
+#include <linux/workqueue.h>
 
 #include <linux/compiler.h>
 #include <linux/timer.h>
@@ -886,6 +887,8 @@ struct netns_ipvs {
 	atomic_t		conn_out_counter;
 
 #ifdef CONFIG_SYSCTL
+	/* delayed work for expiring no dest connections */
+	struct delayed_work	expire_nodest_conn_work;
 	/* 1/rate drop and drop-entry variables */
 	struct delayed_work	defense_work;   /* Work handler */
 	int			drop_rate;
@@ -1051,6 +1054,11 @@ static inline int sysctl_conn_reuse_mode(struct netns_ipvs *ipvs)
 	return ipvs->sysctl_conn_reuse_mode;
 }
 
+static inline int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
+{
+	return ipvs->sysctl_expire_nodest_conn;
+}
+
 static inline int sysctl_schedule_icmp(struct netns_ipvs *ipvs)
 {
 	return ipvs->sysctl_schedule_icmp;
@@ -1138,6 +1146,11 @@ static inline int sysctl_conn_reuse_mode(struct netns_ipvs *ipvs)
 	return 1;
 }
 
+static inline int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
+{
+	return 0;
+}
+
 static inline int sysctl_schedule_icmp(struct netns_ipvs *ipvs)
 {
 	return 0;
@@ -1507,6 +1520,22 @@ static inline int ip_vs_todrop(struct netns_ipvs *ipvs)
 static inline int ip_vs_todrop(struct netns_ipvs *ipvs) { return 0; }
 #endif
 
+#ifdef CONFIG_SYSCTL
+/* Enqueue delayed work for expiring no dest connections
+ * Only run when sysctl_expire_nodest=1
+ */
+static inline void ip_vs_enqueue_expire_nodest_conns(struct netns_ipvs *ipvs)
+{
+	if (sysctl_expire_nodest_conn(ipvs))
+		queue_delayed_work(system_long_wq,
+				   &ipvs->expire_nodest_conn_work, 1);
+}
+
+void ip_vs_expire_nodest_conn_flush(struct netns_ipvs *ipvs);
+#else
+static inline void ip_vs_enqueue_expire_nodest_conns(struct netns_ipvs) {}
+#endif
+
 #define IP_VS_DFWD_METHOD(dest) (atomic_read(&(dest)->conn_flags) & \
 				 IP_VS_CONN_F_FWD_MASK)
 
diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
index b3921ae92740..a5e9b2d55e57 100644
--- a/net/netfilter/ipvs/ip_vs_conn.c
+++ b/net/netfilter/ipvs/ip_vs_conn.c
@@ -1389,6 +1389,45 @@ static void ip_vs_conn_flush(struct netns_ipvs *ipvs)
 		goto flush_again;
 	}
 }
+
+#ifdef CONFIG_SYSCTL
+void ip_vs_expire_nodest_conn_flush(struct netns_ipvs *ipvs)
+{
+	int idx;
+	struct ip_vs_conn *cp, *cp_c;
+	struct ip_vs_dest *dest;
+
+	rcu_read_lock();
+	for (idx = 0; idx < ip_vs_conn_tab_size; idx++) {
+		hlist_for_each_entry_rcu(cp, &ip_vs_conn_tab[idx], c_list) {
+			if (cp->ipvs != ipvs)
+				continue;
+
+			dest = cp->dest;
+			if (!dest || (dest->flags & IP_VS_DEST_F_AVAILABLE))
+				continue;
+
+			if (atomic_read(&cp->n_control))
+				continue;
+
+			cp_c = cp->control;
+			IP_VS_DBG(4, "del connection\n");
+			ip_vs_conn_del(cp);
+			if (cp_c && !atomic_read(&cp_c->n_control)) {
+				IP_VS_DBG(4, "del controlling connection\n");
+				ip_vs_conn_del(cp_c);
+			}
+		}
+		cond_resched_rcu();
+
+		/* netns clean up started, abort delayed work */
+		if (!ipvs->enable)
+			return;
+	}
+	rcu_read_unlock();
+}
+#endif
+
 /*
  * per netns init and exit
  */
diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
index b4a6b7662f3f..e3668a6e54e4 100644
--- a/net/netfilter/ipvs/ip_vs_core.c
+++ b/net/netfilter/ipvs/ip_vs_core.c
@@ -694,16 +694,10 @@ static int sysctl_nat_icmp_send(struct netns_ipvs *ipvs)
 	return ipvs->sysctl_nat_icmp_send;
 }
 
-static int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
-{
-	return ipvs->sysctl_expire_nodest_conn;
-}
-
 #else
 
 static int sysctl_snat_reroute(struct netns_ipvs *ipvs) { return 0; }
 static int sysctl_nat_icmp_send(struct netns_ipvs *ipvs) { return 0; }
-static int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs) { return 0; }
 
 #endif
 
@@ -2097,36 +2091,35 @@ ip_vs_in(struct netns_ipvs *ipvs, unsigned int hooknum, struct sk_buff *skb, int
 		}
 	}
 
-	if (unlikely(!cp)) {
-		int v;
-
-		if (!ip_vs_try_to_schedule(ipvs, af, skb, pd, &v, &cp, &iph))
-			return v;
-	}
-
-	IP_VS_DBG_PKT(11, af, pp, skb, iph.off, "Incoming packet");
-
 	/* Check the server status */
-	if (cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
+	if (cp && cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
 		/* the destination server is not available */
+		if (sysctl_expire_nodest_conn(ipvs)) {
+			bool old_ct = ip_vs_conn_uses_old_conntrack(cp, skb);
 
-		__u32 flags = cp->flags;
-
-		/* when timer already started, silently drop the packet.*/
-		if (timer_pending(&cp->timer))
-			__ip_vs_conn_put(cp);
-		else
-			ip_vs_conn_put(cp);
+			if (!old_ct)
+				cp->flags &= ~IP_VS_CONN_F_NFCT;
 
-		if (sysctl_expire_nodest_conn(ipvs) &&
-		    !(flags & IP_VS_CONN_F_ONE_PACKET)) {
-			/* try to expire the connection immediately */
 			ip_vs_conn_expire_now(cp);
+			__ip_vs_conn_put(cp);
+			if (old_ct)
+				return NF_DROP;
+			cp = NULL;
+		} else {
+			__ip_vs_conn_put(cp);
+			return NF_DROP;
 		}
+	}
 
-		return NF_DROP;
+	if (unlikely(!cp)) {
+		int v;
+
+		if (!ip_vs_try_to_schedule(ipvs, af, skb, pd, &v, &cp, &iph))
+			return v;
 	}
 
+	IP_VS_DBG_PKT(11, af, pp, skb, iph.off, "Incoming packet");
+
 	ip_vs_in_stats(cp, skb);
 	ip_vs_set_state(cp, IP_VS_DIR_INPUT, skb, pd);
 	if (cp->packet_xmit)
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 4af83f466dfc..f984d2c881ff 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -210,6 +210,17 @@ static void update_defense_level(struct netns_ipvs *ipvs)
 	local_bh_enable();
 }
 
+/* Handler for delayed work for expiring no
+ * destination connections
+ */
+static void expire_nodest_conn_handler(struct work_struct *work)
+{
+	struct netns_ipvs *ipvs;
+
+	ipvs = container_of(work, struct netns_ipvs,
+			    expire_nodest_conn_work.work);
+	ip_vs_expire_nodest_conn_flush(ipvs);
+}
 
 /*
  *	Timer for checking the defense
@@ -1164,6 +1175,12 @@ static void __ip_vs_del_dest(struct netns_ipvs *ipvs, struct ip_vs_dest *dest,
 	list_add(&dest->t_list, &ipvs->dest_trash);
 	dest->idle_start = 0;
 	spin_unlock_bh(&ipvs->dest_trash_lock);
+
+	/* Queue up delayed work to expire all no destination connections.
+	 * No-op when CONFIG_SYSCTL is disabled.
+	 */
+	if (!cleanup)
+		ip_vs_enqueue_expire_nodest_conns(ipvs);
 }
 
 
@@ -4086,6 +4103,10 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
 	queue_delayed_work(system_long_wq, &ipvs->defense_work,
 			   DEFENSE_TIMER_PERIOD);
 
+	/* Init delayed work for expiring no dest conn */
+	INIT_DELAYED_WORK(&ipvs->expire_nodest_conn_work,
+			  expire_nodest_conn_handler);
+
 	return 0;
 }
 
@@ -4093,6 +4114,7 @@ static void __net_exit ip_vs_control_net_cleanup_sysctl(struct netns_ipvs *ipvs)
 {
 	struct net *net = ipvs->net;
 
+	cancel_delayed_work_sync(&ipvs->expire_nodest_conn_work);
 	cancel_delayed_work_sync(&ipvs->defense_work);
 	cancel_work_sync(&ipvs->defense_work.work);
 	unregister_net_sysctl_table(ipvs->sysctl_hdr);
-- 
2.20.1
^ permalink raw reply related	[flat|nested] 31+ messages in thread
* Re: [PATCH net-next] netfilter/ipvs: queue delayed work to expire no destination connections if expire_nodest_conn=1
  2020-07-08 13:58                               ` [PATCH net-next] " Andrew Sy Kim
@ 2020-07-08 16:00                                 ` Julian Anastasov
  2020-07-08 16:06                                   ` [PATCHv2 net-next] ipvs: " Andrew Sy Kim
  0 siblings, 1 reply; 31+ messages in thread
From: Julian Anastasov @ 2020-07-08 16:00 UTC (permalink / raw)
  To: Andrew Sy Kim; +Cc: Wensong Zhang, Simon Horman, lvs-devel, netfilter-devel
	Hello,
On Wed, 8 Jul 2020, Andrew Sy Kim wrote:
> When expire_nodest_conn=1 and a destination is deleted, IPVS does not
> expire the existing connections until the next matching incoming packet.
> If there are many connection entries from a single client to a single
> destination, many packets may get dropped before all the connections are
> expired (more likely with lots of UDP traffic). An optimization can be
> made where upon deletion of a destination, IPVS queues up delayed work
> to immediately expire any connections with a deleted destination. This
> ensures any reused source ports from a client (within the IPVS timeouts)
> are scheduled to new real servers instead of silently dropped.
> 
> Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
> 
	Looks good to me, thanks!
	There is one empty line here that can be removed
before applying the patch. Also, post any future patches with
version tag, eg. PATCHv2 net-next, etc. For IPVS, "ipvs:"
prefix should be enough.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
> ---
> This patch was applied on top of Julian's patch "ipvs: allow
> connection reuse for unconfirmed conntrack":
> https://marc.info/?l=netfilter-devel&m=159361668110823&w=2
> 
>  include/net/ip_vs.h             | 29 ++++++++++++++++++++
>  net/netfilter/ipvs/ip_vs_conn.c | 39 +++++++++++++++++++++++++++
>  net/netfilter/ipvs/ip_vs_core.c | 47 ++++++++++++++-------------------
>  net/netfilter/ipvs/ip_vs_ctl.c  | 22 +++++++++++++++
>  4 files changed, 110 insertions(+), 27 deletions(-)
> 
> diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
> index 011f407b76fe..91a9e1d590a6 100644
> --- a/include/net/ip_vs.h
> +++ b/include/net/ip_vs.h
> @@ -14,6 +14,7 @@
>  #include <linux/spinlock.h>             /* for struct rwlock_t */
>  #include <linux/atomic.h>               /* for struct atomic_t */
>  #include <linux/refcount.h>             /* for struct refcount_t */
> +#include <linux/workqueue.h>
>  
>  #include <linux/compiler.h>
>  #include <linux/timer.h>
> @@ -886,6 +887,8 @@ struct netns_ipvs {
>  	atomic_t		conn_out_counter;
>  
>  #ifdef CONFIG_SYSCTL
> +	/* delayed work for expiring no dest connections */
> +	struct delayed_work	expire_nodest_conn_work;
>  	/* 1/rate drop and drop-entry variables */
>  	struct delayed_work	defense_work;   /* Work handler */
>  	int			drop_rate;
> @@ -1051,6 +1054,11 @@ static inline int sysctl_conn_reuse_mode(struct netns_ipvs *ipvs)
>  	return ipvs->sysctl_conn_reuse_mode;
>  }
>  
> +static inline int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
> +{
> +	return ipvs->sysctl_expire_nodest_conn;
> +}
> +
>  static inline int sysctl_schedule_icmp(struct netns_ipvs *ipvs)
>  {
>  	return ipvs->sysctl_schedule_icmp;
> @@ -1138,6 +1146,11 @@ static inline int sysctl_conn_reuse_mode(struct netns_ipvs *ipvs)
>  	return 1;
>  }
>  
> +static inline int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
> +{
> +	return 0;
> +}
> +
>  static inline int sysctl_schedule_icmp(struct netns_ipvs *ipvs)
>  {
>  	return 0;
> @@ -1507,6 +1520,22 @@ static inline int ip_vs_todrop(struct netns_ipvs *ipvs)
>  static inline int ip_vs_todrop(struct netns_ipvs *ipvs) { return 0; }
>  #endif
>  
> +#ifdef CONFIG_SYSCTL
> +/* Enqueue delayed work for expiring no dest connections
> + * Only run when sysctl_expire_nodest=1
> + */
> +static inline void ip_vs_enqueue_expire_nodest_conns(struct netns_ipvs *ipvs)
> +{
> +	if (sysctl_expire_nodest_conn(ipvs))
> +		queue_delayed_work(system_long_wq,
> +				   &ipvs->expire_nodest_conn_work, 1);
> +}
> +
> +void ip_vs_expire_nodest_conn_flush(struct netns_ipvs *ipvs);
> +#else
> +static inline void ip_vs_enqueue_expire_nodest_conns(struct netns_ipvs) {}
> +#endif
> +
>  #define IP_VS_DFWD_METHOD(dest) (atomic_read(&(dest)->conn_flags) & \
>  				 IP_VS_CONN_F_FWD_MASK)
>  
> diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
> index b3921ae92740..a5e9b2d55e57 100644
> --- a/net/netfilter/ipvs/ip_vs_conn.c
> +++ b/net/netfilter/ipvs/ip_vs_conn.c
> @@ -1389,6 +1389,45 @@ static void ip_vs_conn_flush(struct netns_ipvs *ipvs)
>  		goto flush_again;
>  	}
>  }
> +
> +#ifdef CONFIG_SYSCTL
> +void ip_vs_expire_nodest_conn_flush(struct netns_ipvs *ipvs)
> +{
> +	int idx;
> +	struct ip_vs_conn *cp, *cp_c;
> +	struct ip_vs_dest *dest;
> +
> +	rcu_read_lock();
> +	for (idx = 0; idx < ip_vs_conn_tab_size; idx++) {
> +		hlist_for_each_entry_rcu(cp, &ip_vs_conn_tab[idx], c_list) {
> +			if (cp->ipvs != ipvs)
> +				continue;
> +
> +			dest = cp->dest;
> +			if (!dest || (dest->flags & IP_VS_DEST_F_AVAILABLE))
> +				continue;
> +
> +			if (atomic_read(&cp->n_control))
> +				continue;
> +
> +			cp_c = cp->control;
> +			IP_VS_DBG(4, "del connection\n");
> +			ip_vs_conn_del(cp);
> +			if (cp_c && !atomic_read(&cp_c->n_control)) {
> +				IP_VS_DBG(4, "del controlling connection\n");
> +				ip_vs_conn_del(cp_c);
> +			}
> +		}
> +		cond_resched_rcu();
> +
> +		/* netns clean up started, abort delayed work */
> +		if (!ipvs->enable)
> +			return;
> +	}
> +	rcu_read_unlock();
> +}
> +#endif
> +
>  /*
>   * per netns init and exit
>   */
> diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
> index b4a6b7662f3f..e3668a6e54e4 100644
> --- a/net/netfilter/ipvs/ip_vs_core.c
> +++ b/net/netfilter/ipvs/ip_vs_core.c
> @@ -694,16 +694,10 @@ static int sysctl_nat_icmp_send(struct netns_ipvs *ipvs)
>  	return ipvs->sysctl_nat_icmp_send;
>  }
>  
> -static int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
> -{
> -	return ipvs->sysctl_expire_nodest_conn;
> -}
> -
>  #else
>  
>  static int sysctl_snat_reroute(struct netns_ipvs *ipvs) { return 0; }
>  static int sysctl_nat_icmp_send(struct netns_ipvs *ipvs) { return 0; }
> -static int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs) { return 0; }
>  
>  #endif
>  
> @@ -2097,36 +2091,35 @@ ip_vs_in(struct netns_ipvs *ipvs, unsigned int hooknum, struct sk_buff *skb, int
>  		}
>  	}
>  
> -	if (unlikely(!cp)) {
> -		int v;
> -
> -		if (!ip_vs_try_to_schedule(ipvs, af, skb, pd, &v, &cp, &iph))
> -			return v;
> -	}
> -
> -	IP_VS_DBG_PKT(11, af, pp, skb, iph.off, "Incoming packet");
> -
>  	/* Check the server status */
> -	if (cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
> +	if (cp && cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
>  		/* the destination server is not available */
> +		if (sysctl_expire_nodest_conn(ipvs)) {
> +			bool old_ct = ip_vs_conn_uses_old_conntrack(cp, skb);
>  
> -		__u32 flags = cp->flags;
> -
> -		/* when timer already started, silently drop the packet.*/
> -		if (timer_pending(&cp->timer))
> -			__ip_vs_conn_put(cp);
> -		else
> -			ip_vs_conn_put(cp);
> +			if (!old_ct)
> +				cp->flags &= ~IP_VS_CONN_F_NFCT;
>  
> -		if (sysctl_expire_nodest_conn(ipvs) &&
> -		    !(flags & IP_VS_CONN_F_ONE_PACKET)) {
> -			/* try to expire the connection immediately */
>  			ip_vs_conn_expire_now(cp);
> +			__ip_vs_conn_put(cp);
> +			if (old_ct)
> +				return NF_DROP;
> +			cp = NULL;
> +		} else {
> +			__ip_vs_conn_put(cp);
> +			return NF_DROP;
>  		}
> +	}
>  
> -		return NF_DROP;
> +	if (unlikely(!cp)) {
> +		int v;
> +
> +		if (!ip_vs_try_to_schedule(ipvs, af, skb, pd, &v, &cp, &iph))
> +			return v;
>  	}
>  
> +	IP_VS_DBG_PKT(11, af, pp, skb, iph.off, "Incoming packet");
> +
>  	ip_vs_in_stats(cp, skb);
>  	ip_vs_set_state(cp, IP_VS_DIR_INPUT, skb, pd);
>  	if (cp->packet_xmit)
> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> index 4af83f466dfc..f984d2c881ff 100644
> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> @@ -210,6 +210,17 @@ static void update_defense_level(struct netns_ipvs *ipvs)
>  	local_bh_enable();
>  }
>  
> +/* Handler for delayed work for expiring no
> + * destination connections
> + */
> +static void expire_nodest_conn_handler(struct work_struct *work)
> +{
> +	struct netns_ipvs *ipvs;
> +
> +	ipvs = container_of(work, struct netns_ipvs,
> +			    expire_nodest_conn_work.work);
> +	ip_vs_expire_nodest_conn_flush(ipvs);
> +}
>  
>  /*
>   *	Timer for checking the defense
> @@ -1164,6 +1175,12 @@ static void __ip_vs_del_dest(struct netns_ipvs *ipvs, struct ip_vs_dest *dest,
>  	list_add(&dest->t_list, &ipvs->dest_trash);
>  	dest->idle_start = 0;
>  	spin_unlock_bh(&ipvs->dest_trash_lock);
> +
> +	/* Queue up delayed work to expire all no destination connections.
> +	 * No-op when CONFIG_SYSCTL is disabled.
> +	 */
> +	if (!cleanup)
> +		ip_vs_enqueue_expire_nodest_conns(ipvs);
>  }
>  
>  
> @@ -4086,6 +4103,10 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
>  	queue_delayed_work(system_long_wq, &ipvs->defense_work,
>  			   DEFENSE_TIMER_PERIOD);
>  
> +	/* Init delayed work for expiring no dest conn */
> +	INIT_DELAYED_WORK(&ipvs->expire_nodest_conn_work,
> +			  expire_nodest_conn_handler);
> +
>  	return 0;
>  }
>  
> @@ -4093,6 +4114,7 @@ static void __net_exit ip_vs_control_net_cleanup_sysctl(struct netns_ipvs *ipvs)
>  {
>  	struct net *net = ipvs->net;
>  
> +	cancel_delayed_work_sync(&ipvs->expire_nodest_conn_work);
>  	cancel_delayed_work_sync(&ipvs->defense_work);
>  	cancel_work_sync(&ipvs->defense_work.work);
>  	unregister_net_sysctl_table(ipvs->sysctl_hdr);
> -- 
> 2.20.1
Regards
--
Julian Anastasov <ja@ssi.bg>
^ permalink raw reply	[flat|nested] 31+ messages in thread
* [PATCHv2 net-next] ipvs: queue delayed work to expire no destination connections if expire_nodest_conn=1
  2020-07-08 16:00                                 ` Julian Anastasov
@ 2020-07-08 16:06                                   ` Andrew Sy Kim
  2020-07-08 16:12                                     ` Pablo Neira Ayuso
  0 siblings, 1 reply; 31+ messages in thread
From: Andrew Sy Kim @ 2020-07-08 16:06 UTC (permalink / raw)
  Cc: Julian Anastasov, Wensong Zhang, Simon Horman, lvs-devel,
	netfilter-devel, Andrew Sy Kim
When expire_nodest_conn=1 and a destination is deleted, IPVS does not
expire the existing connections until the next matching incoming packet.
If there are many connection entries from a single client to a single
destination, many packets may get dropped before all the connections are
expired (more likely with lots of UDP traffic). An optimization can be
made where upon deletion of a destination, IPVS queues up delayed work
to immediately expire any connections with a deleted destination. This
ensures any reused source ports from a client (within the IPVS timeouts)
are scheduled to new real servers instead of silently dropped.
Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
---
This patch was applied on top of Julian's patch "ipvs: allow
connection reuse for unconfirmed conntrack":
https://marc.info/?l=netfilter-devel&m=159361668110823&w=2
 include/net/ip_vs.h             | 29 ++++++++++++++++++++
 net/netfilter/ipvs/ip_vs_conn.c | 39 +++++++++++++++++++++++++++
 net/netfilter/ipvs/ip_vs_core.c | 47 ++++++++++++++-------------------
 net/netfilter/ipvs/ip_vs_ctl.c  | 22 +++++++++++++++
 4 files changed, 110 insertions(+), 27 deletions(-)
diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index 011f407b76fe..91a9e1d590a6 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -14,6 +14,7 @@
 #include <linux/spinlock.h>             /* for struct rwlock_t */
 #include <linux/atomic.h>               /* for struct atomic_t */
 #include <linux/refcount.h>             /* for struct refcount_t */
+#include <linux/workqueue.h>
 
 #include <linux/compiler.h>
 #include <linux/timer.h>
@@ -886,6 +887,8 @@ struct netns_ipvs {
 	atomic_t		conn_out_counter;
 
 #ifdef CONFIG_SYSCTL
+	/* delayed work for expiring no dest connections */
+	struct delayed_work	expire_nodest_conn_work;
 	/* 1/rate drop and drop-entry variables */
 	struct delayed_work	defense_work;   /* Work handler */
 	int			drop_rate;
@@ -1051,6 +1054,11 @@ static inline int sysctl_conn_reuse_mode(struct netns_ipvs *ipvs)
 	return ipvs->sysctl_conn_reuse_mode;
 }
 
+static inline int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
+{
+	return ipvs->sysctl_expire_nodest_conn;
+}
+
 static inline int sysctl_schedule_icmp(struct netns_ipvs *ipvs)
 {
 	return ipvs->sysctl_schedule_icmp;
@@ -1138,6 +1146,11 @@ static inline int sysctl_conn_reuse_mode(struct netns_ipvs *ipvs)
 	return 1;
 }
 
+static inline int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
+{
+	return 0;
+}
+
 static inline int sysctl_schedule_icmp(struct netns_ipvs *ipvs)
 {
 	return 0;
@@ -1507,6 +1520,22 @@ static inline int ip_vs_todrop(struct netns_ipvs *ipvs)
 static inline int ip_vs_todrop(struct netns_ipvs *ipvs) { return 0; }
 #endif
 
+#ifdef CONFIG_SYSCTL
+/* Enqueue delayed work for expiring no dest connections
+ * Only run when sysctl_expire_nodest=1
+ */
+static inline void ip_vs_enqueue_expire_nodest_conns(struct netns_ipvs *ipvs)
+{
+	if (sysctl_expire_nodest_conn(ipvs))
+		queue_delayed_work(system_long_wq,
+				   &ipvs->expire_nodest_conn_work, 1);
+}
+
+void ip_vs_expire_nodest_conn_flush(struct netns_ipvs *ipvs);
+#else
+static inline void ip_vs_enqueue_expire_nodest_conns(struct netns_ipvs) {}
+#endif
+
 #define IP_VS_DFWD_METHOD(dest) (atomic_read(&(dest)->conn_flags) & \
 				 IP_VS_CONN_F_FWD_MASK)
 
diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
index b3921ae92740..a5e9b2d55e57 100644
--- a/net/netfilter/ipvs/ip_vs_conn.c
+++ b/net/netfilter/ipvs/ip_vs_conn.c
@@ -1389,6 +1389,45 @@ static void ip_vs_conn_flush(struct netns_ipvs *ipvs)
 		goto flush_again;
 	}
 }
+
+#ifdef CONFIG_SYSCTL
+void ip_vs_expire_nodest_conn_flush(struct netns_ipvs *ipvs)
+{
+	int idx;
+	struct ip_vs_conn *cp, *cp_c;
+	struct ip_vs_dest *dest;
+
+	rcu_read_lock();
+	for (idx = 0; idx < ip_vs_conn_tab_size; idx++) {
+		hlist_for_each_entry_rcu(cp, &ip_vs_conn_tab[idx], c_list) {
+			if (cp->ipvs != ipvs)
+				continue;
+
+			dest = cp->dest;
+			if (!dest || (dest->flags & IP_VS_DEST_F_AVAILABLE))
+				continue;
+
+			if (atomic_read(&cp->n_control))
+				continue;
+
+			cp_c = cp->control;
+			IP_VS_DBG(4, "del connection\n");
+			ip_vs_conn_del(cp);
+			if (cp_c && !atomic_read(&cp_c->n_control)) {
+				IP_VS_DBG(4, "del controlling connection\n");
+				ip_vs_conn_del(cp_c);
+			}
+		}
+		cond_resched_rcu();
+
+		/* netns clean up started, abort delayed work */
+		if (!ipvs->enable)
+			return;
+	}
+	rcu_read_unlock();
+}
+#endif
+
 /*
  * per netns init and exit
  */
diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
index b4a6b7662f3f..e3668a6e54e4 100644
--- a/net/netfilter/ipvs/ip_vs_core.c
+++ b/net/netfilter/ipvs/ip_vs_core.c
@@ -694,16 +694,10 @@ static int sysctl_nat_icmp_send(struct netns_ipvs *ipvs)
 	return ipvs->sysctl_nat_icmp_send;
 }
 
-static int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
-{
-	return ipvs->sysctl_expire_nodest_conn;
-}
-
 #else
 
 static int sysctl_snat_reroute(struct netns_ipvs *ipvs) { return 0; }
 static int sysctl_nat_icmp_send(struct netns_ipvs *ipvs) { return 0; }
-static int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs) { return 0; }
 
 #endif
 
@@ -2097,36 +2091,35 @@ ip_vs_in(struct netns_ipvs *ipvs, unsigned int hooknum, struct sk_buff *skb, int
 		}
 	}
 
-	if (unlikely(!cp)) {
-		int v;
-
-		if (!ip_vs_try_to_schedule(ipvs, af, skb, pd, &v, &cp, &iph))
-			return v;
-	}
-
-	IP_VS_DBG_PKT(11, af, pp, skb, iph.off, "Incoming packet");
-
 	/* Check the server status */
-	if (cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
+	if (cp && cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
 		/* the destination server is not available */
+		if (sysctl_expire_nodest_conn(ipvs)) {
+			bool old_ct = ip_vs_conn_uses_old_conntrack(cp, skb);
 
-		__u32 flags = cp->flags;
-
-		/* when timer already started, silently drop the packet.*/
-		if (timer_pending(&cp->timer))
-			__ip_vs_conn_put(cp);
-		else
-			ip_vs_conn_put(cp);
+			if (!old_ct)
+				cp->flags &= ~IP_VS_CONN_F_NFCT;
 
-		if (sysctl_expire_nodest_conn(ipvs) &&
-		    !(flags & IP_VS_CONN_F_ONE_PACKET)) {
-			/* try to expire the connection immediately */
 			ip_vs_conn_expire_now(cp);
+			__ip_vs_conn_put(cp);
+			if (old_ct)
+				return NF_DROP;
+			cp = NULL;
+		} else {
+			__ip_vs_conn_put(cp);
+			return NF_DROP;
 		}
+	}
 
-		return NF_DROP;
+	if (unlikely(!cp)) {
+		int v;
+
+		if (!ip_vs_try_to_schedule(ipvs, af, skb, pd, &v, &cp, &iph))
+			return v;
 	}
 
+	IP_VS_DBG_PKT(11, af, pp, skb, iph.off, "Incoming packet");
+
 	ip_vs_in_stats(cp, skb);
 	ip_vs_set_state(cp, IP_VS_DIR_INPUT, skb, pd);
 	if (cp->packet_xmit)
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 4af83f466dfc..f984d2c881ff 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -210,6 +210,17 @@ static void update_defense_level(struct netns_ipvs *ipvs)
 	local_bh_enable();
 }
 
+/* Handler for delayed work for expiring no
+ * destination connections
+ */
+static void expire_nodest_conn_handler(struct work_struct *work)
+{
+	struct netns_ipvs *ipvs;
+
+	ipvs = container_of(work, struct netns_ipvs,
+			    expire_nodest_conn_work.work);
+	ip_vs_expire_nodest_conn_flush(ipvs);
+}
 
 /*
  *	Timer for checking the defense
@@ -1164,6 +1175,12 @@ static void __ip_vs_del_dest(struct netns_ipvs *ipvs, struct ip_vs_dest *dest,
 	list_add(&dest->t_list, &ipvs->dest_trash);
 	dest->idle_start = 0;
 	spin_unlock_bh(&ipvs->dest_trash_lock);
+
+	/* Queue up delayed work to expire all no destination connections.
+	 * No-op when CONFIG_SYSCTL is disabled.
+	 */
+	if (!cleanup)
+		ip_vs_enqueue_expire_nodest_conns(ipvs);
 }
 
 
@@ -4086,6 +4103,10 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
 	queue_delayed_work(system_long_wq, &ipvs->defense_work,
 			   DEFENSE_TIMER_PERIOD);
 
+	/* Init delayed work for expiring no dest conn */
+	INIT_DELAYED_WORK(&ipvs->expire_nodest_conn_work,
+			  expire_nodest_conn_handler);
+
 	return 0;
 }
 
@@ -4093,6 +4114,7 @@ static void __net_exit ip_vs_control_net_cleanup_sysctl(struct netns_ipvs *ipvs)
 {
 	struct net *net = ipvs->net;
 
+	cancel_delayed_work_sync(&ipvs->expire_nodest_conn_work);
 	cancel_delayed_work_sync(&ipvs->defense_work);
 	cancel_work_sync(&ipvs->defense_work.work);
 	unregister_net_sysctl_table(ipvs->sysctl_hdr);
-- 
2.20.1
^ permalink raw reply related	[flat|nested] 31+ messages in thread
* Re: [PATCHv2 net-next] ipvs: queue delayed work to expire no destination connections if expire_nodest_conn=1
  2020-07-08 16:06                                   ` [PATCHv2 net-next] ipvs: " Andrew Sy Kim
@ 2020-07-08 16:12                                     ` Pablo Neira Ayuso
  2020-07-08 16:14                                       ` Andrew Kim
  2020-07-08 16:16                                       ` [PATCH " Andrew Sy Kim
  0 siblings, 2 replies; 31+ messages in thread
From: Pablo Neira Ayuso @ 2020-07-08 16:12 UTC (permalink / raw)
  To: Andrew Sy Kim
  Cc: Julian Anastasov, Wensong Zhang, Simon Horman, lvs-devel,
	netfilter-devel
On Wed, Jul 08, 2020 at 12:06:18PM -0400, Andrew Sy Kim wrote:
> When expire_nodest_conn=1 and a destination is deleted, IPVS does not
> expire the existing connections until the next matching incoming packet.
> If there are many connection entries from a single client to a single
> destination, many packets may get dropped before all the connections are
> expired (more likely with lots of UDP traffic). An optimization can be
> made where upon deletion of a destination, IPVS queues up delayed work
> to immediately expire any connections with a deleted destination. This
> ensures any reused source ports from a client (within the IPVS timeouts)
> are scheduled to new real servers instead of silently dropped.
Is this the same patch ?
https://patchwork.ozlabs.org/project/netfilter-devel/patch/20200708135854.28944-1-kim.andrewsy@gmail.com/
Julian has "Signed-off-by:" previous patch and this v2 does not say
what has been updated.
Thanks.
^ permalink raw reply	[flat|nested] 31+ messages in thread
* Re: [PATCHv2 net-next] ipvs: queue delayed work to expire no destination connections if expire_nodest_conn=1
  2020-07-08 16:12                                     ` Pablo Neira Ayuso
@ 2020-07-08 16:14                                       ` Andrew Kim
  2020-07-08 16:16                                       ` [PATCH " Andrew Sy Kim
  1 sibling, 0 replies; 31+ messages in thread
From: Andrew Kim @ 2020-07-08 16:14 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Julian Anastasov, Wensong Zhang, Simon Horman, open list:IPVS,
	open list:NETFILTER
Sorry -- I think I misunderstood what Julian said. Nothing has changed
aside from some formatting on the patch file.
Will resend this patch with the v2 in the subject removed.
On Wed, Jul 8, 2020 at 12:12 PM Pablo Neira Ayuso <pablo@netfilter.org> wrote:
>
> On Wed, Jul 08, 2020 at 12:06:18PM -0400, Andrew Sy Kim wrote:
> > When expire_nodest_conn=1 and a destination is deleted, IPVS does not
> > expire the existing connections until the next matching incoming packet.
> > If there are many connection entries from a single client to a single
> > destination, many packets may get dropped before all the connections are
> > expired (more likely with lots of UDP traffic). An optimization can be
> > made where upon deletion of a destination, IPVS queues up delayed work
> > to immediately expire any connections with a deleted destination. This
> > ensures any reused source ports from a client (within the IPVS timeouts)
> > are scheduled to new real servers instead of silently dropped.
>
> Is this the same patch ?
>
> https://patchwork.ozlabs.org/project/netfilter-devel/patch/20200708135854.28944-1-kim.andrewsy@gmail.com/
>
> Julian has "Signed-off-by:" previous patch and this v2 does not say
> what has been updated.
>
> Thanks.
^ permalink raw reply	[flat|nested] 31+ messages in thread
* [PATCH net-next] ipvs: queue delayed work to expire no destination connections if expire_nodest_conn=1
  2020-07-08 16:12                                     ` Pablo Neira Ayuso
  2020-07-08 16:14                                       ` Andrew Kim
@ 2020-07-08 16:16                                       ` Andrew Sy Kim
  2020-07-08 17:19                                         ` Julian Anastasov
  1 sibling, 1 reply; 31+ messages in thread
From: Andrew Sy Kim @ 2020-07-08 16:16 UTC (permalink / raw)
  Cc: Julian Anastasov, Wensong Zhang, Simon Horman, lvs-devel,
	netfilter-devel, Andrew Sy Kim
When expire_nodest_conn=1 and a destination is deleted, IPVS does not
expire the existing connections until the next matching incoming packet.
If there are many connection entries from a single client to a single
destination, many packets may get dropped before all the connections are
expired (more likely with lots of UDP traffic). An optimization can be
made where upon deletion of a destination, IPVS queues up delayed work
to immediately expire any connections with a deleted destination. This
ensures any reused source ports from a client (within the IPVS timeouts)
are scheduled to new real servers instead of silently dropped.
Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
---
This patch was applied on top of Julian's patch "ipvs: allow
connection reuse for unconfirmed conntrack":
https://marc.info/?l=netfilter-devel&m=159361668110823&w=2
 include/net/ip_vs.h             | 29 ++++++++++++++++++++
 net/netfilter/ipvs/ip_vs_conn.c | 39 +++++++++++++++++++++++++++
 net/netfilter/ipvs/ip_vs_core.c | 47 ++++++++++++++-------------------
 net/netfilter/ipvs/ip_vs_ctl.c  | 22 +++++++++++++++
 4 files changed, 110 insertions(+), 27 deletions(-)
diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index 011f407b76fe..91a9e1d590a6 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -14,6 +14,7 @@
 #include <linux/spinlock.h>             /* for struct rwlock_t */
 #include <linux/atomic.h>               /* for struct atomic_t */
 #include <linux/refcount.h>             /* for struct refcount_t */
+#include <linux/workqueue.h>
 
 #include <linux/compiler.h>
 #include <linux/timer.h>
@@ -886,6 +887,8 @@ struct netns_ipvs {
 	atomic_t		conn_out_counter;
 
 #ifdef CONFIG_SYSCTL
+	/* delayed work for expiring no dest connections */
+	struct delayed_work	expire_nodest_conn_work;
 	/* 1/rate drop and drop-entry variables */
 	struct delayed_work	defense_work;   /* Work handler */
 	int			drop_rate;
@@ -1051,6 +1054,11 @@ static inline int sysctl_conn_reuse_mode(struct netns_ipvs *ipvs)
 	return ipvs->sysctl_conn_reuse_mode;
 }
 
+static inline int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
+{
+	return ipvs->sysctl_expire_nodest_conn;
+}
+
 static inline int sysctl_schedule_icmp(struct netns_ipvs *ipvs)
 {
 	return ipvs->sysctl_schedule_icmp;
@@ -1138,6 +1146,11 @@ static inline int sysctl_conn_reuse_mode(struct netns_ipvs *ipvs)
 	return 1;
 }
 
+static inline int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
+{
+	return 0;
+}
+
 static inline int sysctl_schedule_icmp(struct netns_ipvs *ipvs)
 {
 	return 0;
@@ -1507,6 +1520,22 @@ static inline int ip_vs_todrop(struct netns_ipvs *ipvs)
 static inline int ip_vs_todrop(struct netns_ipvs *ipvs) { return 0; }
 #endif
 
+#ifdef CONFIG_SYSCTL
+/* Enqueue delayed work for expiring no dest connections
+ * Only run when sysctl_expire_nodest=1
+ */
+static inline void ip_vs_enqueue_expire_nodest_conns(struct netns_ipvs *ipvs)
+{
+	if (sysctl_expire_nodest_conn(ipvs))
+		queue_delayed_work(system_long_wq,
+				   &ipvs->expire_nodest_conn_work, 1);
+}
+
+void ip_vs_expire_nodest_conn_flush(struct netns_ipvs *ipvs);
+#else
+static inline void ip_vs_enqueue_expire_nodest_conns(struct netns_ipvs) {}
+#endif
+
 #define IP_VS_DFWD_METHOD(dest) (atomic_read(&(dest)->conn_flags) & \
 				 IP_VS_CONN_F_FWD_MASK)
 
diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
index b3921ae92740..a5e9b2d55e57 100644
--- a/net/netfilter/ipvs/ip_vs_conn.c
+++ b/net/netfilter/ipvs/ip_vs_conn.c
@@ -1389,6 +1389,45 @@ static void ip_vs_conn_flush(struct netns_ipvs *ipvs)
 		goto flush_again;
 	}
 }
+
+#ifdef CONFIG_SYSCTL
+void ip_vs_expire_nodest_conn_flush(struct netns_ipvs *ipvs)
+{
+	int idx;
+	struct ip_vs_conn *cp, *cp_c;
+	struct ip_vs_dest *dest;
+
+	rcu_read_lock();
+	for (idx = 0; idx < ip_vs_conn_tab_size; idx++) {
+		hlist_for_each_entry_rcu(cp, &ip_vs_conn_tab[idx], c_list) {
+			if (cp->ipvs != ipvs)
+				continue;
+
+			dest = cp->dest;
+			if (!dest || (dest->flags & IP_VS_DEST_F_AVAILABLE))
+				continue;
+
+			if (atomic_read(&cp->n_control))
+				continue;
+
+			cp_c = cp->control;
+			IP_VS_DBG(4, "del connection\n");
+			ip_vs_conn_del(cp);
+			if (cp_c && !atomic_read(&cp_c->n_control)) {
+				IP_VS_DBG(4, "del controlling connection\n");
+				ip_vs_conn_del(cp_c);
+			}
+		}
+		cond_resched_rcu();
+
+		/* netns clean up started, abort delayed work */
+		if (!ipvs->enable)
+			return;
+	}
+	rcu_read_unlock();
+}
+#endif
+
 /*
  * per netns init and exit
  */
diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
index b4a6b7662f3f..e3668a6e54e4 100644
--- a/net/netfilter/ipvs/ip_vs_core.c
+++ b/net/netfilter/ipvs/ip_vs_core.c
@@ -694,16 +694,10 @@ static int sysctl_nat_icmp_send(struct netns_ipvs *ipvs)
 	return ipvs->sysctl_nat_icmp_send;
 }
 
-static int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
-{
-	return ipvs->sysctl_expire_nodest_conn;
-}
-
 #else
 
 static int sysctl_snat_reroute(struct netns_ipvs *ipvs) { return 0; }
 static int sysctl_nat_icmp_send(struct netns_ipvs *ipvs) { return 0; }
-static int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs) { return 0; }
 
 #endif
 
@@ -2097,36 +2091,35 @@ ip_vs_in(struct netns_ipvs *ipvs, unsigned int hooknum, struct sk_buff *skb, int
 		}
 	}
 
-	if (unlikely(!cp)) {
-		int v;
-
-		if (!ip_vs_try_to_schedule(ipvs, af, skb, pd, &v, &cp, &iph))
-			return v;
-	}
-
-	IP_VS_DBG_PKT(11, af, pp, skb, iph.off, "Incoming packet");
-
 	/* Check the server status */
-	if (cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
+	if (cp && cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
 		/* the destination server is not available */
+		if (sysctl_expire_nodest_conn(ipvs)) {
+			bool old_ct = ip_vs_conn_uses_old_conntrack(cp, skb);
 
-		__u32 flags = cp->flags;
-
-		/* when timer already started, silently drop the packet.*/
-		if (timer_pending(&cp->timer))
-			__ip_vs_conn_put(cp);
-		else
-			ip_vs_conn_put(cp);
+			if (!old_ct)
+				cp->flags &= ~IP_VS_CONN_F_NFCT;
 
-		if (sysctl_expire_nodest_conn(ipvs) &&
-		    !(flags & IP_VS_CONN_F_ONE_PACKET)) {
-			/* try to expire the connection immediately */
 			ip_vs_conn_expire_now(cp);
+			__ip_vs_conn_put(cp);
+			if (old_ct)
+				return NF_DROP;
+			cp = NULL;
+		} else {
+			__ip_vs_conn_put(cp);
+			return NF_DROP;
 		}
+	}
 
-		return NF_DROP;
+	if (unlikely(!cp)) {
+		int v;
+
+		if (!ip_vs_try_to_schedule(ipvs, af, skb, pd, &v, &cp, &iph))
+			return v;
 	}
 
+	IP_VS_DBG_PKT(11, af, pp, skb, iph.off, "Incoming packet");
+
 	ip_vs_in_stats(cp, skb);
 	ip_vs_set_state(cp, IP_VS_DIR_INPUT, skb, pd);
 	if (cp->packet_xmit)
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 4af83f466dfc..f984d2c881ff 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -210,6 +210,17 @@ static void update_defense_level(struct netns_ipvs *ipvs)
 	local_bh_enable();
 }
 
+/* Handler for delayed work for expiring no
+ * destination connections
+ */
+static void expire_nodest_conn_handler(struct work_struct *work)
+{
+	struct netns_ipvs *ipvs;
+
+	ipvs = container_of(work, struct netns_ipvs,
+			    expire_nodest_conn_work.work);
+	ip_vs_expire_nodest_conn_flush(ipvs);
+}
 
 /*
  *	Timer for checking the defense
@@ -1164,6 +1175,12 @@ static void __ip_vs_del_dest(struct netns_ipvs *ipvs, struct ip_vs_dest *dest,
 	list_add(&dest->t_list, &ipvs->dest_trash);
 	dest->idle_start = 0;
 	spin_unlock_bh(&ipvs->dest_trash_lock);
+
+	/* Queue up delayed work to expire all no destination connections.
+	 * No-op when CONFIG_SYSCTL is disabled.
+	 */
+	if (!cleanup)
+		ip_vs_enqueue_expire_nodest_conns(ipvs);
 }
 
 
@@ -4086,6 +4103,10 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
 	queue_delayed_work(system_long_wq, &ipvs->defense_work,
 			   DEFENSE_TIMER_PERIOD);
 
+	/* Init delayed work for expiring no dest conn */
+	INIT_DELAYED_WORK(&ipvs->expire_nodest_conn_work,
+			  expire_nodest_conn_handler);
+
 	return 0;
 }
 
@@ -4093,6 +4114,7 @@ static void __net_exit ip_vs_control_net_cleanup_sysctl(struct netns_ipvs *ipvs)
 {
 	struct net *net = ipvs->net;
 
+	cancel_delayed_work_sync(&ipvs->expire_nodest_conn_work);
 	cancel_delayed_work_sync(&ipvs->defense_work);
 	cancel_work_sync(&ipvs->defense_work.work);
 	unregister_net_sysctl_table(ipvs->sysctl_hdr);
-- 
2.20.1
^ permalink raw reply related	[flat|nested] 31+ messages in thread
* Re: [PATCH net-next] ipvs: queue delayed work to expire no destination connections if expire_nodest_conn=1
  2020-07-08 16:16                                       ` [PATCH " Andrew Sy Kim
@ 2020-07-08 17:19                                         ` Julian Anastasov
  2020-07-15 18:54                                           ` Pablo Neira Ayuso
  0 siblings, 1 reply; 31+ messages in thread
From: Julian Anastasov @ 2020-07-08 17:19 UTC (permalink / raw)
  To: Andrew Sy Kim; +Cc: Wensong Zhang, Simon Horman, lvs-devel, netfilter-devel
	Hello,
On Wed, 8 Jul 2020, Andrew Sy Kim wrote:
> When expire_nodest_conn=1 and a destination is deleted, IPVS does not
> expire the existing connections until the next matching incoming packet.
> If there are many connection entries from a single client to a single
> destination, many packets may get dropped before all the connections are
> expired (more likely with lots of UDP traffic). An optimization can be
> made where upon deletion of a destination, IPVS queues up delayed work
> to immediately expire any connections with a deleted destination. This
> ensures any reused source ports from a client (within the IPVS timeouts)
> are scheduled to new real servers instead of silently dropped.
> 
> Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
	OK, patch content is same, subject has "ipvs:" prefix,
empty line after Signed-off-by is removed, so this patch can
be applied without any modifications.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
	Andrew can read Documentation/process/submitting-patches.rst
after line 751 about using version tags and logging changes
between versions after the --- line.
> ---
> This patch was applied on top of Julian's patch "ipvs: allow
> connection reuse for unconfirmed conntrack":
> https://marc.info/?l=netfilter-devel&m=159361668110823&w=2
> 
>  include/net/ip_vs.h             | 29 ++++++++++++++++++++
>  net/netfilter/ipvs/ip_vs_conn.c | 39 +++++++++++++++++++++++++++
>  net/netfilter/ipvs/ip_vs_core.c | 47 ++++++++++++++-------------------
>  net/netfilter/ipvs/ip_vs_ctl.c  | 22 +++++++++++++++
>  4 files changed, 110 insertions(+), 27 deletions(-)
> 
> diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
> index 011f407b76fe..91a9e1d590a6 100644
> --- a/include/net/ip_vs.h
> +++ b/include/net/ip_vs.h
> @@ -14,6 +14,7 @@
>  #include <linux/spinlock.h>             /* for struct rwlock_t */
>  #include <linux/atomic.h>               /* for struct atomic_t */
>  #include <linux/refcount.h>             /* for struct refcount_t */
> +#include <linux/workqueue.h>
>  
>  #include <linux/compiler.h>
>  #include <linux/timer.h>
> @@ -886,6 +887,8 @@ struct netns_ipvs {
>  	atomic_t		conn_out_counter;
>  
>  #ifdef CONFIG_SYSCTL
> +	/* delayed work for expiring no dest connections */
> +	struct delayed_work	expire_nodest_conn_work;
>  	/* 1/rate drop and drop-entry variables */
>  	struct delayed_work	defense_work;   /* Work handler */
>  	int			drop_rate;
> @@ -1051,6 +1054,11 @@ static inline int sysctl_conn_reuse_mode(struct netns_ipvs *ipvs)
>  	return ipvs->sysctl_conn_reuse_mode;
>  }
>  
> +static inline int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
> +{
> +	return ipvs->sysctl_expire_nodest_conn;
> +}
> +
>  static inline int sysctl_schedule_icmp(struct netns_ipvs *ipvs)
>  {
>  	return ipvs->sysctl_schedule_icmp;
> @@ -1138,6 +1146,11 @@ static inline int sysctl_conn_reuse_mode(struct netns_ipvs *ipvs)
>  	return 1;
>  }
>  
> +static inline int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
> +{
> +	return 0;
> +}
> +
>  static inline int sysctl_schedule_icmp(struct netns_ipvs *ipvs)
>  {
>  	return 0;
> @@ -1507,6 +1520,22 @@ static inline int ip_vs_todrop(struct netns_ipvs *ipvs)
>  static inline int ip_vs_todrop(struct netns_ipvs *ipvs) { return 0; }
>  #endif
>  
> +#ifdef CONFIG_SYSCTL
> +/* Enqueue delayed work for expiring no dest connections
> + * Only run when sysctl_expire_nodest=1
> + */
> +static inline void ip_vs_enqueue_expire_nodest_conns(struct netns_ipvs *ipvs)
> +{
> +	if (sysctl_expire_nodest_conn(ipvs))
> +		queue_delayed_work(system_long_wq,
> +				   &ipvs->expire_nodest_conn_work, 1);
> +}
> +
> +void ip_vs_expire_nodest_conn_flush(struct netns_ipvs *ipvs);
> +#else
> +static inline void ip_vs_enqueue_expire_nodest_conns(struct netns_ipvs) {}
> +#endif
> +
>  #define IP_VS_DFWD_METHOD(dest) (atomic_read(&(dest)->conn_flags) & \
>  				 IP_VS_CONN_F_FWD_MASK)
>  
> diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
> index b3921ae92740..a5e9b2d55e57 100644
> --- a/net/netfilter/ipvs/ip_vs_conn.c
> +++ b/net/netfilter/ipvs/ip_vs_conn.c
> @@ -1389,6 +1389,45 @@ static void ip_vs_conn_flush(struct netns_ipvs *ipvs)
>  		goto flush_again;
>  	}
>  }
> +
> +#ifdef CONFIG_SYSCTL
> +void ip_vs_expire_nodest_conn_flush(struct netns_ipvs *ipvs)
> +{
> +	int idx;
> +	struct ip_vs_conn *cp, *cp_c;
> +	struct ip_vs_dest *dest;
> +
> +	rcu_read_lock();
> +	for (idx = 0; idx < ip_vs_conn_tab_size; idx++) {
> +		hlist_for_each_entry_rcu(cp, &ip_vs_conn_tab[idx], c_list) {
> +			if (cp->ipvs != ipvs)
> +				continue;
> +
> +			dest = cp->dest;
> +			if (!dest || (dest->flags & IP_VS_DEST_F_AVAILABLE))
> +				continue;
> +
> +			if (atomic_read(&cp->n_control))
> +				continue;
> +
> +			cp_c = cp->control;
> +			IP_VS_DBG(4, "del connection\n");
> +			ip_vs_conn_del(cp);
> +			if (cp_c && !atomic_read(&cp_c->n_control)) {
> +				IP_VS_DBG(4, "del controlling connection\n");
> +				ip_vs_conn_del(cp_c);
> +			}
> +		}
> +		cond_resched_rcu();
> +
> +		/* netns clean up started, abort delayed work */
> +		if (!ipvs->enable)
> +			return;
> +	}
> +	rcu_read_unlock();
> +}
> +#endif
> +
>  /*
>   * per netns init and exit
>   */
> diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
> index b4a6b7662f3f..e3668a6e54e4 100644
> --- a/net/netfilter/ipvs/ip_vs_core.c
> +++ b/net/netfilter/ipvs/ip_vs_core.c
> @@ -694,16 +694,10 @@ static int sysctl_nat_icmp_send(struct netns_ipvs *ipvs)
>  	return ipvs->sysctl_nat_icmp_send;
>  }
>  
> -static int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs)
> -{
> -	return ipvs->sysctl_expire_nodest_conn;
> -}
> -
>  #else
>  
>  static int sysctl_snat_reroute(struct netns_ipvs *ipvs) { return 0; }
>  static int sysctl_nat_icmp_send(struct netns_ipvs *ipvs) { return 0; }
> -static int sysctl_expire_nodest_conn(struct netns_ipvs *ipvs) { return 0; }
>  
>  #endif
>  
> @@ -2097,36 +2091,35 @@ ip_vs_in(struct netns_ipvs *ipvs, unsigned int hooknum, struct sk_buff *skb, int
>  		}
>  	}
>  
> -	if (unlikely(!cp)) {
> -		int v;
> -
> -		if (!ip_vs_try_to_schedule(ipvs, af, skb, pd, &v, &cp, &iph))
> -			return v;
> -	}
> -
> -	IP_VS_DBG_PKT(11, af, pp, skb, iph.off, "Incoming packet");
> -
>  	/* Check the server status */
> -	if (cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
> +	if (cp && cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
>  		/* the destination server is not available */
> +		if (sysctl_expire_nodest_conn(ipvs)) {
> +			bool old_ct = ip_vs_conn_uses_old_conntrack(cp, skb);
>  
> -		__u32 flags = cp->flags;
> -
> -		/* when timer already started, silently drop the packet.*/
> -		if (timer_pending(&cp->timer))
> -			__ip_vs_conn_put(cp);
> -		else
> -			ip_vs_conn_put(cp);
> +			if (!old_ct)
> +				cp->flags &= ~IP_VS_CONN_F_NFCT;
>  
> -		if (sysctl_expire_nodest_conn(ipvs) &&
> -		    !(flags & IP_VS_CONN_F_ONE_PACKET)) {
> -			/* try to expire the connection immediately */
>  			ip_vs_conn_expire_now(cp);
> +			__ip_vs_conn_put(cp);
> +			if (old_ct)
> +				return NF_DROP;
> +			cp = NULL;
> +		} else {
> +			__ip_vs_conn_put(cp);
> +			return NF_DROP;
>  		}
> +	}
>  
> -		return NF_DROP;
> +	if (unlikely(!cp)) {
> +		int v;
> +
> +		if (!ip_vs_try_to_schedule(ipvs, af, skb, pd, &v, &cp, &iph))
> +			return v;
>  	}
>  
> +	IP_VS_DBG_PKT(11, af, pp, skb, iph.off, "Incoming packet");
> +
>  	ip_vs_in_stats(cp, skb);
>  	ip_vs_set_state(cp, IP_VS_DIR_INPUT, skb, pd);
>  	if (cp->packet_xmit)
> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> index 4af83f466dfc..f984d2c881ff 100644
> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> @@ -210,6 +210,17 @@ static void update_defense_level(struct netns_ipvs *ipvs)
>  	local_bh_enable();
>  }
>  
> +/* Handler for delayed work for expiring no
> + * destination connections
> + */
> +static void expire_nodest_conn_handler(struct work_struct *work)
> +{
> +	struct netns_ipvs *ipvs;
> +
> +	ipvs = container_of(work, struct netns_ipvs,
> +			    expire_nodest_conn_work.work);
> +	ip_vs_expire_nodest_conn_flush(ipvs);
> +}
>  
>  /*
>   *	Timer for checking the defense
> @@ -1164,6 +1175,12 @@ static void __ip_vs_del_dest(struct netns_ipvs *ipvs, struct ip_vs_dest *dest,
>  	list_add(&dest->t_list, &ipvs->dest_trash);
>  	dest->idle_start = 0;
>  	spin_unlock_bh(&ipvs->dest_trash_lock);
> +
> +	/* Queue up delayed work to expire all no destination connections.
> +	 * No-op when CONFIG_SYSCTL is disabled.
> +	 */
> +	if (!cleanup)
> +		ip_vs_enqueue_expire_nodest_conns(ipvs);
>  }
>  
>  
> @@ -4086,6 +4103,10 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
>  	queue_delayed_work(system_long_wq, &ipvs->defense_work,
>  			   DEFENSE_TIMER_PERIOD);
>  
> +	/* Init delayed work for expiring no dest conn */
> +	INIT_DELAYED_WORK(&ipvs->expire_nodest_conn_work,
> +			  expire_nodest_conn_handler);
> +
>  	return 0;
>  }
>  
> @@ -4093,6 +4114,7 @@ static void __net_exit ip_vs_control_net_cleanup_sysctl(struct netns_ipvs *ipvs)
>  {
>  	struct net *net = ipvs->net;
>  
> +	cancel_delayed_work_sync(&ipvs->expire_nodest_conn_work);
>  	cancel_delayed_work_sync(&ipvs->defense_work);
>  	cancel_work_sync(&ipvs->defense_work.work);
>  	unregister_net_sysctl_table(ipvs->sysctl_hdr);
> -- 
> 2.20.1
Regards
--
Julian Anastasov <ja@ssi.bg>
^ permalink raw reply	[flat|nested] 31+ messages in thread
* Re: [PATCH net-next] ipvs: queue delayed work to expire no destination connections if expire_nodest_conn=1
  2020-07-08 17:19                                         ` Julian Anastasov
@ 2020-07-15 18:54                                           ` Pablo Neira Ayuso
  0 siblings, 0 replies; 31+ messages in thread
From: Pablo Neira Ayuso @ 2020-07-15 18:54 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Andrew Sy Kim, Wensong Zhang, Simon Horman, lvs-devel,
	netfilter-devel
On Wed, Jul 08, 2020 at 08:19:55PM +0300, Julian Anastasov wrote:
> 
> 	Hello,
> 
> On Wed, 8 Jul 2020, Andrew Sy Kim wrote:
> 
> > When expire_nodest_conn=1 and a destination is deleted, IPVS does not
> > expire the existing connections until the next matching incoming packet.
> > If there are many connection entries from a single client to a single
> > destination, many packets may get dropped before all the connections are
> > expired (more likely with lots of UDP traffic). An optimization can be
> > made where upon deletion of a destination, IPVS queues up delayed work
> > to immediately expire any connections with a deleted destination. This
> > ensures any reused source ports from a client (within the IPVS timeouts)
> > are scheduled to new real servers instead of silently dropped.
> > 
> > Signed-off-by: Andrew Sy Kim <kim.andrewsy@gmail.com>
> 
> 	OK, patch content is same, subject has "ipvs:" prefix,
> empty line after Signed-off-by is removed, so this patch can
> be applied without any modifications.
> 
> Signed-off-by: Julian Anastasov <ja@ssi.bg>
Applied, thanks.
^ permalink raw reply	[flat|nested] 31+ messages in thread
end of thread, other threads:[~2020-07-15 18:54 UTC | newest]
Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-05-15  1:35 [PATCH] netfilter/ipvs: expire no destination UDP connections when expire_nodest_conn=1 Andrew Sy Kim
2020-05-15 18:07 ` Julian Anastasov
2020-05-17 17:27   ` Andrew Kim
2020-05-17 17:30     ` Andrew Kim
2020-05-17 17:16 ` [PATCH] netfilter/ipvs: immediately expire UDP connections matching unavailable destination if expire_nodest_conn=1 Andrew Sy Kim
2020-05-18 19:10   ` Julian Anastasov
2020-05-18 19:54     ` Andrew Kim
2020-05-19 11:46       ` Marco Angaroni
2020-05-19 14:18         ` Andrew Kim
2020-05-19 19:46         ` Julian Anastasov
2020-05-24 21:31           ` [PATCH] netfilter/ipvs: immediately expire no destination connections in kthread " Andrew Sy Kim
2020-05-26 21:24             ` Julian Anastasov
2020-05-26 21:47               ` Andrew Kim
2020-05-28  1:41                 ` [PATCH] netfilter/ipvs: queue delayed work to expire no destination connections " Andrew Sy Kim
2020-05-28 17:26                   ` Julian Anastasov
2020-06-08 17:22                     ` Andrew Sy Kim
2020-06-08 17:29                       ` Andrew Kim
2020-06-08 17:34                     ` Andrew Sy Kim
2020-06-08 20:20                       ` Andrew Sy Kim
2020-06-08 20:24                         ` Andrew Kim
2020-06-15 19:24                         ` Julian Anastasov
2020-07-01 21:24                           ` Andrew Sy Kim
2020-07-02  4:33                             ` Julian Anastasov
2020-07-08 13:58                               ` [PATCH net-next] " Andrew Sy Kim
2020-07-08 16:00                                 ` Julian Anastasov
2020-07-08 16:06                                   ` [PATCHv2 net-next] ipvs: " Andrew Sy Kim
2020-07-08 16:12                                     ` Pablo Neira Ayuso
2020-07-08 16:14                                       ` Andrew Kim
2020-07-08 16:16                                       ` [PATCH " Andrew Sy Kim
2020-07-08 17:19                                         ` Julian Anastasov
2020-07-15 18:54                                           ` Pablo Neira Ayuso
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).