All of lore.kernel.org
 help / color / mirror / Atom feed
From: Marcelo Ricardo Leitner <mleitner@redhat.com>
To: Julian Anastasov <ja@ssi.bg>
Cc: lvs-devel@vger.kernel.org, hannes@redhat.com, jbrouer@redhat.com
Subject: Re: [PATCH RFC] ipvs: reschedule new connections if previous was on FIN_WAIT or TIME_WAIT
Date: Wed, 10 Dec 2014 10:34:18 -0200	[thread overview]
Message-ID: <54883DCA.7050906@redhat.com> (raw)
In-Reply-To: <alpine.LFD.2.11.1412092248390.2374@ja.home.ssi.bg>

On 09-12-2014 21:37, Julian Anastasov wrote:
>
> 	Hello,
>
> On Mon, 8 Dec 2014, Marcelo Ricardo Leitner wrote:
>
>> Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
>> ---
>>
>> Notes:
>>      Hi,
>>
>>      We have a report that not doing so may cause poor load balacing if
>>      applications reuse src port. With a patch like this, it would make
>>      new SYNs on a given connection to drop the old one and start a new
>>      one.
>
> 	People complained about UDP, such as RADIUS, etc.
> I guess, for TCP it is some local client that can benefit
> from balancing, it can be also a local testing tool.

So it's a new type of issue, ok. I don't have info on what their application 
is, just that it ends up reusing port numbers.

>>      One could say that this reuse can be done on purpose and carefully
>>      as a way to cause poor load balancing to cause a DoS.
>>
>>      Thing is, I'm unsure if we really should do this, as it may end up
>>      doing more harm than good.
>>
>>      WDYT? And if we do additional checks, like at least validating seq
>>      number, would it be better?
>
> 	I think, checking of SEQ will not help, tw_recycle
> works by checking the timestamp option, SEQ of SYN is
> arbitrary.

I thought for time wait assassination it would have to use a bigger seq 
together with the timestamps. But anyway, Windows (since Vista, even 8) 
doesn't use timestamps by default, unfortunately. Windows is not really 
involved, it's just to have that in mind, as by relying on timestamps we would 
be limiting the solution.

> 	Also, not all connections can be expired, for
> example controlling conns (FTP CTL), i.e. our attempt
> to expire conn will not succeed if FTP DATA is still in
> progress (cp->n_control != 0 check) or in some FW/TW state.

Interesting. I hadn't thought about this case.

>>      Thanks,
>>      Marcelo
>>
>>   net/netfilter/ipvs/ip_vs_core.c | 15 ++++++++++++---
>>   1 file changed, 12 insertions(+), 3 deletions(-)
>>
>> diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
>> index 990decba1fe418e36e59a1f081fcf0e47188da29..e81a9ac3c7e4e25fb14953b7faa4ace054f51274 100644
>> --- a/net/netfilter/ipvs/ip_vs_core.c
>> +++ b/net/netfilter/ipvs/ip_vs_core.c
>> @@ -1036,6 +1036,14 @@ static inline bool is_new_conn(const struct sk_buff *skb,
>>   	}
>>   }
>>
>> +static inline bool is_new_conn_expected(const struct ip_vs_conn *cp)
>> +{
>> +	if (cp->protocol != IPPROTO_TCP)
>> +		return false;
>> +	return (cp->state == IP_VS_TCP_S_TIME_WAIT) ||
>> +		(cp->state == IP_VS_TCP_S_FIN_WAIT);
>
> 	For conns with cp->flags & IP_VS_CONN_F_NOOUTPUT the
> FIN_WAIT state is problematic, we see only packets from
> client (eg. DR method) and can not tell if server has sent its
> FIN packet. I.e. diverting large transfer from one real to
> another will lead to sending of FIN+ACKs from client to
> new place.
>
> 	So, I'm not sure if we should restrict the FW state, eg:
>
> 	return (cp->state == IP_VS_TCP_S_TIME_WAIT) ||
> 	       (cp->state == IP_VS_TCP_S_FIN_WAIT &&
> 		(cp->flags & IP_VS_CONN_F_NOOUTPUT) &&
> 		time_after_eq(jiffies + cp->timeout,
> 			      cp->timer.expires + 1 * HZ));
>
> 	I.e. for INPUT+OUTPUT IP_VS_TCP_S_FIN_WAIT is
> not expired because only one side sent FIN. For

And when the other side send it too, it will be in TIME_WAIT then. Ok.

> INPUT-ONLY we can allow expiration in IP_VS_TCP_S_FIN_WAIT,
> for example, when FIN+ACK was not seen recently, say in
> last second?

That's interesting.

> 	If such checks look dangerous we can try to expire
> only in IP_VS_TCP_S_TIME_WAIT.

Or only enable it within some sysctl, like we have expire_nodest_conn? We 
could have a "(tcp_)rebalance_on_port_reuse" or something like that.

>> +}
>> +
>>   /* Handle response packets: rewrite addresses and send away...
>>    */
>>   static unsigned int
>> @@ -1642,9 +1650,10 @@ ip_vs_in(unsigned int hooknum, struct sk_buff *skb, int af)
>>   	 */
>>   	cp = pp->conn_in_get(af, skb, &iph, 0);
>>
>> -	if (unlikely(sysctl_expire_nodest_conn(ipvs)) && cp && cp->dest &&
>> -	    unlikely(!atomic_read(&cp->dest->weight)) && !iph.fragoffs &&
>> -	    is_new_conn(skb, &iph)) {
>> +	if (cp && cp->dest && !iph.fragoffs && is_new_conn(skb, &iph) &&
>> +	    ((unlikely(sysctl_expire_nodest_conn(ipvs)) &&
>> +	      unlikely(!atomic_read(&cp->dest->weight))) ||
>> +	     unlikely(is_new_conn_expected(cp)))) {
>
> 	This check can be optimized:
>
> 	if (!iph.fragoffs && is_new_conn(skb, &iph) && cp &&
> 	    ((unlikely(sysctl_expire_nodest_conn(ipvs)) && cp->dest &&
> 	      unlikely(!atomic_read(&cp->dest->weight))) ||
> 	     unlikely(is_new_conn_expected(cp)))) {

Yup. Okay.

I'll wait on the sysctl thing before posting as all the rest seems aligned. 
Thanks!

   Marcelo

>>   		ip_vs_conn_expire_now(cp);
>>   		__ip_vs_conn_put(cp);
>>   		cp = NULL;
>> --
>> 1.9.3
>
> Regards
>
> --
> Julian Anastasov <ja@ssi.bg>
>


  reply	other threads:[~2014-12-10 12:34 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-12-08 14:27 [PATCH RFC] ipvs: reschedule new connections if previous was on FIN_WAIT or TIME_WAIT Marcelo Ricardo Leitner
2014-12-08 14:29 ` Marcelo Ricardo Leitner
2014-12-09 23:37 ` Julian Anastasov
2014-12-10 12:34   ` Marcelo Ricardo Leitner [this message]
2014-12-10 16:56     ` Marcelo Ricardo Leitner
2014-12-10 21:27       ` Julian Anastasov
2014-12-11 12:44         ` Marcelo Ricardo Leitner
2014-12-11 22:39           ` Julian Anastasov
2014-12-12 11:58             ` Marcelo Ricardo Leitner
2014-12-30 19:49               ` Marcelo Ricardo Leitner
2015-01-05 22:26                 ` Julian Anastasov
2015-01-06 12:12                   ` Marcelo Ricardo Leitner
2015-01-06 21:06                     ` Julian Anastasov
2015-01-07 15:52                       ` Marcelo Ricardo Leitner
2015-01-07 19:31                         ` Julian Anastasov
2015-01-08 12:35                           ` Marcelo Ricardo Leitner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54883DCA.7050906@redhat.com \
    --to=mleitner@redhat.com \
    --cc=hannes@redhat.com \
    --cc=ja@ssi.bg \
    --cc=jbrouer@redhat.com \
    --cc=lvs-devel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.