From: Vlad Yasevich <vyasevich@gmail.com>
To: Daniel Borkmann <dborkman@redhat.com>, davem@davemloft.net
Cc: netdev@vger.kernel.org, linux-sctp@vger.kernel.org,
Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>,
Alexander Sverdlin <alexander.sverdlin@nsn.com>
Subject: Re: [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver'
Date: Mon, 14 Apr 2014 19:57:54 +0000 [thread overview]
Message-ID: <534C3DC2.9070604@gmail.com> (raw)
In-Reply-To: <1397504717-19566-1-git-send-email-dborkman@redhat.com>
On 04/14/2014 03:45 PM, Daniel Borkmann wrote:
> This reverts commit ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management
> to reflect real state of the receiver's buffer") as it introduced a
> serious performance regression on SCTP over IPv4 and IPv6, though a not
> as dramatic on the latter. Measurements are on 10Gbit/s with ixgbe NICs.
>
> Current state:
>
> [root@Lab200slot2 ~]# iperf3 --sctp -4 -c 192.168.241.3 -V -l 1452 -t 60
> iperf version 3.0.1 (10 January 2014)
> Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64
> Time: Fri, 11 Apr 2014 17:56:21 GMT
> Connecting to host 192.168.241.3, port 5201
> Cookie: Lab200slot2.1397238981.812898.548918
> [ 4] local 192.168.241.2 port 38616 connected to 192.168.241.3 port 5201
> Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test
> [ ID] Interval Transfer Bandwidth
> [ 4] 0.00-1.09 sec 20.8 MBytes 161 Mbits/sec
> [ 4] 1.09-2.13 sec 10.8 MBytes 86.8 Mbits/sec
> [ 4] 2.13-3.15 sec 3.57 MBytes 29.5 Mbits/sec
> [ 4] 3.15-4.16 sec 4.33 MBytes 35.7 Mbits/sec
> [ 4] 4.16-6.21 sec 10.4 MBytes 42.7 Mbits/sec
> [ 4] 6.21-6.21 sec 0.00 Bytes 0.00 bits/sec
> [ 4] 6.21-7.35 sec 34.6 MBytes 253 Mbits/sec
> [ 4] 7.35-11.45 sec 22.0 MBytes 45.0 Mbits/sec
> [ 4] 11.45-11.45 sec 0.00 Bytes 0.00 bits/sec
> [ 4] 11.45-11.45 sec 0.00 Bytes 0.00 bits/sec
> [ 4] 11.45-11.45 sec 0.00 Bytes 0.00 bits/sec
> [ 4] 11.45-12.51 sec 16.0 MBytes 126 Mbits/sec
> [ 4] 12.51-13.59 sec 20.3 MBytes 158 Mbits/sec
> [ 4] 13.59-14.65 sec 13.4 MBytes 107 Mbits/sec
> [ 4] 14.65-16.79 sec 33.3 MBytes 130 Mbits/sec
> [ 4] 16.79-16.79 sec 0.00 Bytes 0.00 bits/sec
> [ 4] 16.79-17.82 sec 5.94 MBytes 48.7 Mbits/sec
> (etc)
>
> [root@Lab200slot2 ~]# iperf3 --sctp -6 -c 2001:db8:0:f101::1 -V -l 1400 -t 60
> iperf version 3.0.1 (10 January 2014)
> Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64
> Time: Fri, 11 Apr 2014 19:08:41 GMT
> Connecting to host 2001:db8:0:f101::1, port 5201
> Cookie: Lab200slot2.1397243321.714295.2b3f7c
> [ 4] local 2001:db8:0:f101::2 port 55804 connected to 2001:db8:0:f101::1 port 5201
> Starting Test: protocol: SCTP, 1 streams, 1400 byte blocks, omitting 0 seconds, 60 second test
> [ ID] Interval Transfer Bandwidth
> [ 4] 0.00-1.00 sec 169 MBytes 1.42 Gbits/sec
> [ 4] 1.00-2.00 sec 201 MBytes 1.69 Gbits/sec
> [ 4] 2.00-3.00 sec 188 MBytes 1.58 Gbits/sec
> [ 4] 3.00-4.00 sec 174 MBytes 1.46 Gbits/sec
> [ 4] 4.00-5.00 sec 165 MBytes 1.39 Gbits/sec
> [ 4] 5.00-6.00 sec 199 MBytes 1.67 Gbits/sec
> [ 4] 6.00-7.00 sec 163 MBytes 1.36 Gbits/sec
> [ 4] 7.00-8.00 sec 174 MBytes 1.46 Gbits/sec
> [ 4] 8.00-9.00 sec 193 MBytes 1.62 Gbits/sec
> [ 4] 9.00-10.00 sec 196 MBytes 1.65 Gbits/sec
> [ 4] 10.00-11.00 sec 157 MBytes 1.31 Gbits/sec
> [ 4] 11.00-12.00 sec 175 MBytes 1.47 Gbits/sec
> [ 4] 12.00-13.00 sec 192 MBytes 1.61 Gbits/sec
> [ 4] 13.00-14.00 sec 199 MBytes 1.67 Gbits/sec
> (etc)
>
> After patch:
>
> [root@Lab200slot2 ~]# iperf3 --sctp -4 -c 192.168.240.3 -V -l 1452 -t 60
> iperf version 3.0.1 (10 January 2014)
> Linux Lab200slot2 3.14.0+ #1 SMP Mon Apr 14 12:06:40 EDT 2014 x86_64
> Time: Mon, 14 Apr 2014 16:40:48 GMT
> Connecting to host 192.168.240.3, port 5201
> Cookie: Lab200slot2.1397493648.413274.65e131
> [ 4] local 192.168.240.2 port 50548 connected to 192.168.240.3 port 5201
> Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test
> [ ID] Interval Transfer Bandwidth
> [ 4] 0.00-1.00 sec 240 MBytes 2.02 Gbits/sec
> [ 4] 1.00-2.00 sec 239 MBytes 2.01 Gbits/sec
> [ 4] 2.00-3.00 sec 240 MBytes 2.01 Gbits/sec
> [ 4] 3.00-4.00 sec 239 MBytes 2.00 Gbits/sec
> [ 4] 4.00-5.00 sec 245 MBytes 2.05 Gbits/sec
> [ 4] 5.00-6.00 sec 240 MBytes 2.01 Gbits/sec
> [ 4] 6.00-7.00 sec 240 MBytes 2.02 Gbits/sec
> [ 4] 7.00-8.00 sec 239 MBytes 2.01 Gbits/sec
>
> With the reverted patch applied, the SCTP/IPv4 performance is back
> to normal on latest upstream for IPv4 and IPv6 and has same throughput
> as 3.4.2 test kernel, steady and interval reports are smooth again.
>
> Fixes: ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer")
> Reported-by: Peter Butler <pbutler@sonusnet.com>
> Reported-by: Dongsheng Song <dongsheng.song@gmail.com>
> Reported-by: Fengguang Wu <fengguang.wu@intel.com>
> Tested-by: Peter Butler <pbutler@sonusnet.com>
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
> Cc: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
> Cc: Alexander Sverdlin <alexander.sverdlin@nsn.com>
> Cc: Vlad Yasevich <vyasevich@gmail.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
The base approach is sound. The idea is to calculate rwnd based
on receiver buffer available. The algorithm chosen however, is
gives a much higher preference to small data and penalizes large
data transfers. We need to figure our something else here..
-vlad
> ---
> As commit ef2820a735f7 affects kernels with 3.11 and onwards, this
> needs a rework by Matija for net-next again, so that this fix can
> go back to -stable and restore performance for 3.11-3.15 kernels.
>
> include/net/sctp/structs.h | 14 +++++++-
> net/sctp/associola.c | 82 ++++++++++++++++++++++++++++++++++++----------
> net/sctp/sm_statefuns.c | 2 +-
> net/sctp/socket.c | 6 ++++
> net/sctp/ulpevent.c | 8 ++---
> 5 files changed, 87 insertions(+), 25 deletions(-)
>
> diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
> index 6ee76c8..d992ca3 100644
> --- a/include/net/sctp/structs.h
> +++ b/include/net/sctp/structs.h
> @@ -1653,6 +1653,17 @@ struct sctp_association {
> /* This is the last advertised value of rwnd over a SACK chunk. */
> __u32 a_rwnd;
>
> + /* Number of bytes by which the rwnd has slopped. The rwnd is allowed
> + * to slop over a maximum of the association's frag_point.
> + */
> + __u32 rwnd_over;
> +
> + /* Keeps treack of rwnd pressure. This happens when we have
> + * a window, but not recevie buffer (i.e small packets). This one
> + * is releases slowly (1 PMTU at a time ).
> + */
> + __u32 rwnd_press;
> +
> /* This is the sndbuf size in use for the association.
> * This corresponds to the sndbuf size for the association,
> * as specified in the sk->sndbuf.
> @@ -1881,7 +1892,8 @@ void sctp_assoc_update(struct sctp_association *old,
> __u32 sctp_association_get_next_tsn(struct sctp_association *);
>
> void sctp_assoc_sync_pmtu(struct sock *, struct sctp_association *);
> -void sctp_assoc_rwnd_update(struct sctp_association *, bool);
> +void sctp_assoc_rwnd_increase(struct sctp_association *, unsigned int);
> +void sctp_assoc_rwnd_decrease(struct sctp_association *, unsigned int);
> void sctp_assoc_set_primary(struct sctp_association *,
> struct sctp_transport *);
> void sctp_assoc_del_nonprimary_peers(struct sctp_association *,
> diff --git a/net/sctp/associola.c b/net/sctp/associola.c
> index 4f6d6f9..39579c3 100644
> --- a/net/sctp/associola.c
> +++ b/net/sctp/associola.c
> @@ -1395,35 +1395,44 @@ static inline bool sctp_peer_needs_update(struct sctp_association *asoc)
> return false;
> }
>
> -/* Update asoc's rwnd for the approximated state in the buffer,
> - * and check whether SACK needs to be sent.
> - */
> -void sctp_assoc_rwnd_update(struct sctp_association *asoc, bool update_peer)
> +/* Increase asoc's rwnd by len and send any window update SACK if needed. */
> +void sctp_assoc_rwnd_increase(struct sctp_association *asoc, unsigned int len)
> {
> - int rx_count;
> struct sctp_chunk *sack;
> struct timer_list *timer;
>
> - if (asoc->ep->rcvbuf_policy)
> - rx_count = atomic_read(&asoc->rmem_alloc);
> - else
> - rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
> + if (asoc->rwnd_over) {
> + if (asoc->rwnd_over >= len) {
> + asoc->rwnd_over -= len;
> + } else {
> + asoc->rwnd += (len - asoc->rwnd_over);
> + asoc->rwnd_over = 0;
> + }
> + } else {
> + asoc->rwnd += len;
> + }
>
> - if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
> - asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
> - else
> - asoc->rwnd = 0;
> + /* If we had window pressure, start recovering it
> + * once our rwnd had reached the accumulated pressure
> + * threshold. The idea is to recover slowly, but up
> + * to the initial advertised window.
> + */
> + if (asoc->rwnd_press && asoc->rwnd >= asoc->rwnd_press) {
> + int change = min(asoc->pathmtu, asoc->rwnd_press);
> + asoc->rwnd += change;
> + asoc->rwnd_press -= change;
> + }
>
> - pr_debug("%s: asoc:%p rwnd=%u, rx_count=%d, sk_rcvbuf=%d\n",
> - __func__, asoc, asoc->rwnd, rx_count,
> - asoc->base.sk->sk_rcvbuf);
> + pr_debug("%s: asoc:%p rwnd increased by %d to (%u, %u) - %u\n",
> + __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
> + asoc->a_rwnd);
>
> /* Send a window update SACK if the rwnd has increased by at least the
> * minimum of the association's PMTU and half of the receive buffer.
> * The algorithm used is similar to the one described in
> * Section 4.2.3.3 of RFC 1122.
> */
> - if (update_peer && sctp_peer_needs_update(asoc)) {
> + if (sctp_peer_needs_update(asoc)) {
> asoc->a_rwnd = asoc->rwnd;
>
> pr_debug("%s: sending window update SACK- asoc:%p rwnd:%u "
> @@ -1445,6 +1454,45 @@ void sctp_assoc_rwnd_update(struct sctp_association *asoc, bool update_peer)
> }
> }
>
> +/* Decrease asoc's rwnd by len. */
> +void sctp_assoc_rwnd_decrease(struct sctp_association *asoc, unsigned int len)
> +{
> + int rx_count;
> + int over = 0;
> +
> + if (unlikely(!asoc->rwnd || asoc->rwnd_over))
> + pr_debug("%s: association:%p has asoc->rwnd:%u, "
> + "asoc->rwnd_over:%u!\n", __func__, asoc,
> + asoc->rwnd, asoc->rwnd_over);
> +
> + if (asoc->ep->rcvbuf_policy)
> + rx_count = atomic_read(&asoc->rmem_alloc);
> + else
> + rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
> +
> + /* If we've reached or overflowed our receive buffer, announce
> + * a 0 rwnd if rwnd would still be positive. Store the
> + * the potential pressure overflow so that the window can be restored
> + * back to original value.
> + */
> + if (rx_count >= asoc->base.sk->sk_rcvbuf)
> + over = 1;
> +
> + if (asoc->rwnd >= len) {
> + asoc->rwnd -= len;
> + if (over) {
> + asoc->rwnd_press += asoc->rwnd;
> + asoc->rwnd = 0;
> + }
> + } else {
> + asoc->rwnd_over = len - asoc->rwnd;
> + asoc->rwnd = 0;
> + }
> +
> + pr_debug("%s: asoc:%p rwnd decreased by %d to (%u, %u, %u)\n",
> + __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
> + asoc->rwnd_press);
> +}
>
> /* Build the bind address list for the association based on info from the
> * local endpoint and the remote peer.
> diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
> index 01e0024..ae9fbeb 100644
> --- a/net/sctp/sm_statefuns.c
> +++ b/net/sctp/sm_statefuns.c
> @@ -6178,7 +6178,7 @@ static int sctp_eat_data(const struct sctp_association *asoc,
> * PMTU. In cases, such as loopback, this might be a rather
> * large spill over.
> */
> - if ((!chunk->data_accepted) && (!asoc->rwnd ||
> + if ((!chunk->data_accepted) && (!asoc->rwnd || asoc->rwnd_over ||
> (datalen > asoc->rwnd + asoc->frag_point))) {
>
> /* If this is the next TSN, consider reneging to make
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index e13519e..ff20e2d 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -2115,6 +2115,12 @@ static int sctp_recvmsg(struct kiocb *iocb, struct sock *sk,
> sctp_skb_pull(skb, copied);
> skb_queue_head(&sk->sk_receive_queue, skb);
>
> + /* When only partial message is copied to the user, increase
> + * rwnd by that amount. If all the data in the skb is read,
> + * rwnd is updated when the event is freed.
> + */
> + if (!sctp_ulpevent_is_notification(event))
> + sctp_assoc_rwnd_increase(event->asoc, copied);
> goto out;
> } else if ((event->msg_flags & MSG_NOTIFICATION) ||
> (event->msg_flags & MSG_EOR))
> diff --git a/net/sctp/ulpevent.c b/net/sctp/ulpevent.c
> index 8d198ae..85c6465 100644
> --- a/net/sctp/ulpevent.c
> +++ b/net/sctp/ulpevent.c
> @@ -989,7 +989,7 @@ static void sctp_ulpevent_receive_data(struct sctp_ulpevent *event,
> skb = sctp_event2skb(event);
> /* Set the owner and charge rwnd for bytes received. */
> sctp_ulpevent_set_owner(event, asoc);
> - sctp_assoc_rwnd_update(asoc, false);
> + sctp_assoc_rwnd_decrease(asoc, skb_headlen(skb));
>
> if (!skb->data_len)
> return;
> @@ -1011,7 +1011,6 @@ static void sctp_ulpevent_release_data(struct sctp_ulpevent *event)
> {
> struct sk_buff *skb, *frag;
> unsigned int len;
> - struct sctp_association *asoc;
>
> /* Current stack structures assume that the rcv buffer is
> * per socket. For UDP style sockets this is not true as
> @@ -1036,11 +1035,8 @@ static void sctp_ulpevent_release_data(struct sctp_ulpevent *event)
> }
>
> done:
> - asoc = event->asoc;
> - sctp_association_hold(asoc);
> + sctp_assoc_rwnd_increase(event->asoc, len);
> sctp_ulpevent_release_owner(event);
> - sctp_assoc_rwnd_update(asoc, true);
> - sctp_association_put(asoc);
> }
>
> static void sctp_ulpevent_release_frag_data(struct sctp_ulpevent *event)
>
WARNING: multiple messages have this Message-ID (diff)
From: Vlad Yasevich <vyasevich@gmail.com>
To: Daniel Borkmann <dborkman@redhat.com>, davem@davemloft.net
Cc: netdev@vger.kernel.org, linux-sctp@vger.kernel.org,
Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>,
Alexander Sverdlin <alexander.sverdlin@nsn.com>
Subject: Re: [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer"
Date: Mon, 14 Apr 2014 15:57:54 -0400 [thread overview]
Message-ID: <534C3DC2.9070604@gmail.com> (raw)
In-Reply-To: <1397504717-19566-1-git-send-email-dborkman@redhat.com>
On 04/14/2014 03:45 PM, Daniel Borkmann wrote:
> This reverts commit ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management
> to reflect real state of the receiver's buffer") as it introduced a
> serious performance regression on SCTP over IPv4 and IPv6, though a not
> as dramatic on the latter. Measurements are on 10Gbit/s with ixgbe NICs.
>
> Current state:
>
> [root@Lab200slot2 ~]# iperf3 --sctp -4 -c 192.168.241.3 -V -l 1452 -t 60
> iperf version 3.0.1 (10 January 2014)
> Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64
> Time: Fri, 11 Apr 2014 17:56:21 GMT
> Connecting to host 192.168.241.3, port 5201
> Cookie: Lab200slot2.1397238981.812898.548918
> [ 4] local 192.168.241.2 port 38616 connected to 192.168.241.3 port 5201
> Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test
> [ ID] Interval Transfer Bandwidth
> [ 4] 0.00-1.09 sec 20.8 MBytes 161 Mbits/sec
> [ 4] 1.09-2.13 sec 10.8 MBytes 86.8 Mbits/sec
> [ 4] 2.13-3.15 sec 3.57 MBytes 29.5 Mbits/sec
> [ 4] 3.15-4.16 sec 4.33 MBytes 35.7 Mbits/sec
> [ 4] 4.16-6.21 sec 10.4 MBytes 42.7 Mbits/sec
> [ 4] 6.21-6.21 sec 0.00 Bytes 0.00 bits/sec
> [ 4] 6.21-7.35 sec 34.6 MBytes 253 Mbits/sec
> [ 4] 7.35-11.45 sec 22.0 MBytes 45.0 Mbits/sec
> [ 4] 11.45-11.45 sec 0.00 Bytes 0.00 bits/sec
> [ 4] 11.45-11.45 sec 0.00 Bytes 0.00 bits/sec
> [ 4] 11.45-11.45 sec 0.00 Bytes 0.00 bits/sec
> [ 4] 11.45-12.51 sec 16.0 MBytes 126 Mbits/sec
> [ 4] 12.51-13.59 sec 20.3 MBytes 158 Mbits/sec
> [ 4] 13.59-14.65 sec 13.4 MBytes 107 Mbits/sec
> [ 4] 14.65-16.79 sec 33.3 MBytes 130 Mbits/sec
> [ 4] 16.79-16.79 sec 0.00 Bytes 0.00 bits/sec
> [ 4] 16.79-17.82 sec 5.94 MBytes 48.7 Mbits/sec
> (etc)
>
> [root@Lab200slot2 ~]# iperf3 --sctp -6 -c 2001:db8:0:f101::1 -V -l 1400 -t 60
> iperf version 3.0.1 (10 January 2014)
> Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64
> Time: Fri, 11 Apr 2014 19:08:41 GMT
> Connecting to host 2001:db8:0:f101::1, port 5201
> Cookie: Lab200slot2.1397243321.714295.2b3f7c
> [ 4] local 2001:db8:0:f101::2 port 55804 connected to 2001:db8:0:f101::1 port 5201
> Starting Test: protocol: SCTP, 1 streams, 1400 byte blocks, omitting 0 seconds, 60 second test
> [ ID] Interval Transfer Bandwidth
> [ 4] 0.00-1.00 sec 169 MBytes 1.42 Gbits/sec
> [ 4] 1.00-2.00 sec 201 MBytes 1.69 Gbits/sec
> [ 4] 2.00-3.00 sec 188 MBytes 1.58 Gbits/sec
> [ 4] 3.00-4.00 sec 174 MBytes 1.46 Gbits/sec
> [ 4] 4.00-5.00 sec 165 MBytes 1.39 Gbits/sec
> [ 4] 5.00-6.00 sec 199 MBytes 1.67 Gbits/sec
> [ 4] 6.00-7.00 sec 163 MBytes 1.36 Gbits/sec
> [ 4] 7.00-8.00 sec 174 MBytes 1.46 Gbits/sec
> [ 4] 8.00-9.00 sec 193 MBytes 1.62 Gbits/sec
> [ 4] 9.00-10.00 sec 196 MBytes 1.65 Gbits/sec
> [ 4] 10.00-11.00 sec 157 MBytes 1.31 Gbits/sec
> [ 4] 11.00-12.00 sec 175 MBytes 1.47 Gbits/sec
> [ 4] 12.00-13.00 sec 192 MBytes 1.61 Gbits/sec
> [ 4] 13.00-14.00 sec 199 MBytes 1.67 Gbits/sec
> (etc)
>
> After patch:
>
> [root@Lab200slot2 ~]# iperf3 --sctp -4 -c 192.168.240.3 -V -l 1452 -t 60
> iperf version 3.0.1 (10 January 2014)
> Linux Lab200slot2 3.14.0+ #1 SMP Mon Apr 14 12:06:40 EDT 2014 x86_64
> Time: Mon, 14 Apr 2014 16:40:48 GMT
> Connecting to host 192.168.240.3, port 5201
> Cookie: Lab200slot2.1397493648.413274.65e131
> [ 4] local 192.168.240.2 port 50548 connected to 192.168.240.3 port 5201
> Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test
> [ ID] Interval Transfer Bandwidth
> [ 4] 0.00-1.00 sec 240 MBytes 2.02 Gbits/sec
> [ 4] 1.00-2.00 sec 239 MBytes 2.01 Gbits/sec
> [ 4] 2.00-3.00 sec 240 MBytes 2.01 Gbits/sec
> [ 4] 3.00-4.00 sec 239 MBytes 2.00 Gbits/sec
> [ 4] 4.00-5.00 sec 245 MBytes 2.05 Gbits/sec
> [ 4] 5.00-6.00 sec 240 MBytes 2.01 Gbits/sec
> [ 4] 6.00-7.00 sec 240 MBytes 2.02 Gbits/sec
> [ 4] 7.00-8.00 sec 239 MBytes 2.01 Gbits/sec
>
> With the reverted patch applied, the SCTP/IPv4 performance is back
> to normal on latest upstream for IPv4 and IPv6 and has same throughput
> as 3.4.2 test kernel, steady and interval reports are smooth again.
>
> Fixes: ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer")
> Reported-by: Peter Butler <pbutler@sonusnet.com>
> Reported-by: Dongsheng Song <dongsheng.song@gmail.com>
> Reported-by: Fengguang Wu <fengguang.wu@intel.com>
> Tested-by: Peter Butler <pbutler@sonusnet.com>
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
> Cc: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
> Cc: Alexander Sverdlin <alexander.sverdlin@nsn.com>
> Cc: Vlad Yasevich <vyasevich@gmail.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
The base approach is sound. The idea is to calculate rwnd based
on receiver buffer available. The algorithm chosen however, is
gives a much higher preference to small data and penalizes large
data transfers. We need to figure our something else here..
-vlad
> ---
> As commit ef2820a735f7 affects kernels with 3.11 and onwards, this
> needs a rework by Matija for net-next again, so that this fix can
> go back to -stable and restore performance for 3.11-3.15 kernels.
>
> include/net/sctp/structs.h | 14 +++++++-
> net/sctp/associola.c | 82 ++++++++++++++++++++++++++++++++++++----------
> net/sctp/sm_statefuns.c | 2 +-
> net/sctp/socket.c | 6 ++++
> net/sctp/ulpevent.c | 8 ++---
> 5 files changed, 87 insertions(+), 25 deletions(-)
>
> diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
> index 6ee76c8..d992ca3 100644
> --- a/include/net/sctp/structs.h
> +++ b/include/net/sctp/structs.h
> @@ -1653,6 +1653,17 @@ struct sctp_association {
> /* This is the last advertised value of rwnd over a SACK chunk. */
> __u32 a_rwnd;
>
> + /* Number of bytes by which the rwnd has slopped. The rwnd is allowed
> + * to slop over a maximum of the association's frag_point.
> + */
> + __u32 rwnd_over;
> +
> + /* Keeps treack of rwnd pressure. This happens when we have
> + * a window, but not recevie buffer (i.e small packets). This one
> + * is releases slowly (1 PMTU at a time ).
> + */
> + __u32 rwnd_press;
> +
> /* This is the sndbuf size in use for the association.
> * This corresponds to the sndbuf size for the association,
> * as specified in the sk->sndbuf.
> @@ -1881,7 +1892,8 @@ void sctp_assoc_update(struct sctp_association *old,
> __u32 sctp_association_get_next_tsn(struct sctp_association *);
>
> void sctp_assoc_sync_pmtu(struct sock *, struct sctp_association *);
> -void sctp_assoc_rwnd_update(struct sctp_association *, bool);
> +void sctp_assoc_rwnd_increase(struct sctp_association *, unsigned int);
> +void sctp_assoc_rwnd_decrease(struct sctp_association *, unsigned int);
> void sctp_assoc_set_primary(struct sctp_association *,
> struct sctp_transport *);
> void sctp_assoc_del_nonprimary_peers(struct sctp_association *,
> diff --git a/net/sctp/associola.c b/net/sctp/associola.c
> index 4f6d6f9..39579c3 100644
> --- a/net/sctp/associola.c
> +++ b/net/sctp/associola.c
> @@ -1395,35 +1395,44 @@ static inline bool sctp_peer_needs_update(struct sctp_association *asoc)
> return false;
> }
>
> -/* Update asoc's rwnd for the approximated state in the buffer,
> - * and check whether SACK needs to be sent.
> - */
> -void sctp_assoc_rwnd_update(struct sctp_association *asoc, bool update_peer)
> +/* Increase asoc's rwnd by len and send any window update SACK if needed. */
> +void sctp_assoc_rwnd_increase(struct sctp_association *asoc, unsigned int len)
> {
> - int rx_count;
> struct sctp_chunk *sack;
> struct timer_list *timer;
>
> - if (asoc->ep->rcvbuf_policy)
> - rx_count = atomic_read(&asoc->rmem_alloc);
> - else
> - rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
> + if (asoc->rwnd_over) {
> + if (asoc->rwnd_over >= len) {
> + asoc->rwnd_over -= len;
> + } else {
> + asoc->rwnd += (len - asoc->rwnd_over);
> + asoc->rwnd_over = 0;
> + }
> + } else {
> + asoc->rwnd += len;
> + }
>
> - if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
> - asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
> - else
> - asoc->rwnd = 0;
> + /* If we had window pressure, start recovering it
> + * once our rwnd had reached the accumulated pressure
> + * threshold. The idea is to recover slowly, but up
> + * to the initial advertised window.
> + */
> + if (asoc->rwnd_press && asoc->rwnd >= asoc->rwnd_press) {
> + int change = min(asoc->pathmtu, asoc->rwnd_press);
> + asoc->rwnd += change;
> + asoc->rwnd_press -= change;
> + }
>
> - pr_debug("%s: asoc:%p rwnd=%u, rx_count=%d, sk_rcvbuf=%d\n",
> - __func__, asoc, asoc->rwnd, rx_count,
> - asoc->base.sk->sk_rcvbuf);
> + pr_debug("%s: asoc:%p rwnd increased by %d to (%u, %u) - %u\n",
> + __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
> + asoc->a_rwnd);
>
> /* Send a window update SACK if the rwnd has increased by at least the
> * minimum of the association's PMTU and half of the receive buffer.
> * The algorithm used is similar to the one described in
> * Section 4.2.3.3 of RFC 1122.
> */
> - if (update_peer && sctp_peer_needs_update(asoc)) {
> + if (sctp_peer_needs_update(asoc)) {
> asoc->a_rwnd = asoc->rwnd;
>
> pr_debug("%s: sending window update SACK- asoc:%p rwnd:%u "
> @@ -1445,6 +1454,45 @@ void sctp_assoc_rwnd_update(struct sctp_association *asoc, bool update_peer)
> }
> }
>
> +/* Decrease asoc's rwnd by len. */
> +void sctp_assoc_rwnd_decrease(struct sctp_association *asoc, unsigned int len)
> +{
> + int rx_count;
> + int over = 0;
> +
> + if (unlikely(!asoc->rwnd || asoc->rwnd_over))
> + pr_debug("%s: association:%p has asoc->rwnd:%u, "
> + "asoc->rwnd_over:%u!\n", __func__, asoc,
> + asoc->rwnd, asoc->rwnd_over);
> +
> + if (asoc->ep->rcvbuf_policy)
> + rx_count = atomic_read(&asoc->rmem_alloc);
> + else
> + rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
> +
> + /* If we've reached or overflowed our receive buffer, announce
> + * a 0 rwnd if rwnd would still be positive. Store the
> + * the potential pressure overflow so that the window can be restored
> + * back to original value.
> + */
> + if (rx_count >= asoc->base.sk->sk_rcvbuf)
> + over = 1;
> +
> + if (asoc->rwnd >= len) {
> + asoc->rwnd -= len;
> + if (over) {
> + asoc->rwnd_press += asoc->rwnd;
> + asoc->rwnd = 0;
> + }
> + } else {
> + asoc->rwnd_over = len - asoc->rwnd;
> + asoc->rwnd = 0;
> + }
> +
> + pr_debug("%s: asoc:%p rwnd decreased by %d to (%u, %u, %u)\n",
> + __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
> + asoc->rwnd_press);
> +}
>
> /* Build the bind address list for the association based on info from the
> * local endpoint and the remote peer.
> diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
> index 01e0024..ae9fbeb 100644
> --- a/net/sctp/sm_statefuns.c
> +++ b/net/sctp/sm_statefuns.c
> @@ -6178,7 +6178,7 @@ static int sctp_eat_data(const struct sctp_association *asoc,
> * PMTU. In cases, such as loopback, this might be a rather
> * large spill over.
> */
> - if ((!chunk->data_accepted) && (!asoc->rwnd ||
> + if ((!chunk->data_accepted) && (!asoc->rwnd || asoc->rwnd_over ||
> (datalen > asoc->rwnd + asoc->frag_point))) {
>
> /* If this is the next TSN, consider reneging to make
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index e13519e..ff20e2d 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -2115,6 +2115,12 @@ static int sctp_recvmsg(struct kiocb *iocb, struct sock *sk,
> sctp_skb_pull(skb, copied);
> skb_queue_head(&sk->sk_receive_queue, skb);
>
> + /* When only partial message is copied to the user, increase
> + * rwnd by that amount. If all the data in the skb is read,
> + * rwnd is updated when the event is freed.
> + */
> + if (!sctp_ulpevent_is_notification(event))
> + sctp_assoc_rwnd_increase(event->asoc, copied);
> goto out;
> } else if ((event->msg_flags & MSG_NOTIFICATION) ||
> (event->msg_flags & MSG_EOR))
> diff --git a/net/sctp/ulpevent.c b/net/sctp/ulpevent.c
> index 8d198ae..85c6465 100644
> --- a/net/sctp/ulpevent.c
> +++ b/net/sctp/ulpevent.c
> @@ -989,7 +989,7 @@ static void sctp_ulpevent_receive_data(struct sctp_ulpevent *event,
> skb = sctp_event2skb(event);
> /* Set the owner and charge rwnd for bytes received. */
> sctp_ulpevent_set_owner(event, asoc);
> - sctp_assoc_rwnd_update(asoc, false);
> + sctp_assoc_rwnd_decrease(asoc, skb_headlen(skb));
>
> if (!skb->data_len)
> return;
> @@ -1011,7 +1011,6 @@ static void sctp_ulpevent_release_data(struct sctp_ulpevent *event)
> {
> struct sk_buff *skb, *frag;
> unsigned int len;
> - struct sctp_association *asoc;
>
> /* Current stack structures assume that the rcv buffer is
> * per socket. For UDP style sockets this is not true as
> @@ -1036,11 +1035,8 @@ static void sctp_ulpevent_release_data(struct sctp_ulpevent *event)
> }
>
> done:
> - asoc = event->asoc;
> - sctp_association_hold(asoc);
> + sctp_assoc_rwnd_increase(event->asoc, len);
> sctp_ulpevent_release_owner(event);
> - sctp_assoc_rwnd_update(asoc, true);
> - sctp_association_put(asoc);
> }
>
> static void sctp_ulpevent_release_frag_data(struct sctp_ulpevent *event)
>
next prev parent reply other threads:[~2014-04-14 19:57 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-04-14 19:45 [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's bu Daniel Borkmann
2014-04-14 19:45 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" Daniel Borkmann
2014-04-14 19:57 ` Vlad Yasevich [this message]
2014-04-14 19:57 ` Vlad Yasevich
2014-04-16 6:57 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Matija Glavinic Pecotic
2014-04-16 6:57 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" Matija Glavinic Pecotic
2014-04-16 8:39 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Dongsheng Song
2014-04-16 8:39 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" Dongsheng Song
2014-04-16 9:02 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Alexander Sverdlin
2014-04-16 9:02 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" Alexander Sverdlin
2014-04-16 11:55 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Matija Glavinic Pecotic
2014-04-16 11:55 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" Matija Glavinic Pecotic
2014-04-16 13:32 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Vlad Yasevich
2014-04-16 13:32 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" Vlad Yasevich
2014-04-16 18:50 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Vlad Yasevich
2014-04-16 18:50 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" Vlad Yasevich
2014-04-16 19:05 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Daniel Borkmann
2014-04-16 19:05 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" Daniel Borkmann
2014-04-16 19:24 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Matija Glavinic Pecotic
2014-04-16 19:24 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" Matija Glavinic Pecotic
2014-04-16 19:47 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Vlad Yasevich
2014-04-16 19:47 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" Vlad Yasevich
2014-04-21 19:12 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Matija Glavinic Pecotic
2014-04-21 19:12 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" Matija Glavinic Pecotic
2014-04-14 20:48 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' David Miller
2014-04-14 20:48 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" David Miller
2014-04-15 8:46 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Alexander Sverdlin
2014-04-15 8:46 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" Alexander Sverdlin
2014-04-15 8:57 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Daniel Borkmann
2014-04-15 8:57 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" Daniel Borkmann
2014-04-15 6:43 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Alexander Sverdlin
2014-04-15 6:43 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" Alexander Sverdlin
2014-04-15 7:08 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Daniel Borkmann
2014-04-15 7:08 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" Daniel Borkmann
2014-04-15 14:27 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Butler, Peter
2014-04-15 14:27 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" Butler, Peter
2014-04-16 18:36 ` [PATCH net] Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver' Vlad Yasevich
-- strict thread matches above, loose matches on Subject: below --
2015-12-23 7:13 Roger Nyberg
2015-12-23 13:18 ` Marcelo Ricardo Leitner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=534C3DC2.9070604@gmail.com \
--to=vyasevich@gmail.com \
--cc=alexander.sverdlin@nsn.com \
--cc=davem@davemloft.net \
--cc=dborkman@redhat.com \
--cc=linux-sctp@vger.kernel.org \
--cc=matija.glavinic-pecotic.ext@nsn.com \
--cc=netdev@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.