[PATCH] net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer

public inbox for linux-sctp@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer
@ 2014-01-17  7:01 Matija Glavinic Pecotic
  2014-01-21 22:36 ` David Miller
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: Matija Glavinic Pecotic @ 2014-01-17  7:01 UTC (permalink / raw)
  To: linux-sctp; +Cc: Alexander Sverdlin, netdev

Implementation of (a)rwnd calculation might lead to severe performance issues
and associations completely stalling. These problems are described and solution
is proposed which improves lksctp's robustness in congestion state.

1) Sudden drop of a_rwnd and incomplete window recovery afterwards

Data accounted in sctp_assoc_rwnd_decrease takes only payload size (sctp data),
but size of sk_buff, which is blamed against receiver buffer, is not accounted
in rwnd. Theoretically, this should not be the problem as actual size of buffer
is double the amount requested on the socket (SO_RECVBUF). Problem here is
that this will have bad scaling for data which is less then sizeof sk_buff.
E.g. in 4G (LTE) networks, link interfacing radio side will have a large portion
of traffic of this size (less then 100B).

An example of sudden drop and incomplete window recovery is given below. Node B
exhibits problematic behavior. Node A initiates association and B is configured
to advertise rwnd of 10000. A sends messages of size 43B (size of typical sctp
message in 4G (LTE) network). On B data is left in buffer by not reading socket
in userspace.

Lets examine when we will hit pressure state and declare rwnd to be 0 for
scenario with above stated parameters (rwnd = 10000, chunk size = 43, each
chunk is sent in separate sctp packet)

Logic is implemented in sctp_assoc_rwnd_decrease:

socket_buffer (see below) is maximum size which can be held in socket buffer
(sk_rcvbuf). current_alloced is amount of data currently allocated (rx_count)

A simple expression is given for which it will be examined after how many
packets for above stated parameters we enter pressure state:

We start by condition which has to be met in order to enter pressure state:

	socket_buffer < currently_alloced;

currently_alloced is represented as size of sctp packets received so far and not
yet delivered to userspace. x is the number of chunks/packets (since there is no
bundling, and each chunk is delivered in separate packet, we can observe each
chunk also as sctp packet, and what is important here, having its own sk_buff):

	socket_buffer < x*each_sctp_packet;

each_sctp_packet is sctp chunk size + sizeof(struct sk_buff). socket_buffer is
twice the amount of initially requested size of socket buffer, which is in case
of sctp, twice the a_rwnd requested:

	2*rwnd < x*(payload+sizeof(struc sk_buff));

sizeof(struct sk_buff) is 190 (3.13.0-rc4+). Above is stated that rwnd is 10000
and each payload size is 43

	20000 < x(43+190);

	x > 20000/233;

	x ~> 84;

After ~84 messages, pressure state is entered and 0 rwnd is advertised while 
received 84*43B ~= 3612B sctp data. This is why external observer notices sudden
drop from 6474 to 0, as it will be now shown in example:

IP A.34340 > B.12345: sctp (1) [INIT] [init tag: 1875509148] [rwnd: 81920] [OS: 10] [MIS: 65535] [init TSN: 1096057017]
IP B.12345 > A.34340: sctp (1) [INIT ACK] [init tag: 3198966556] [rwnd: 10000] [OS: 10] [MIS: 10] [init TSN: 902132839]
IP A.34340 > B.12345: sctp (1) [COOKIE ECHO]
IP B.12345 > A.34340: sctp (1) [COOKIE ACK]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057017] [SID: 0] [SSEQ 0] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057017] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057018] [SID: 0] [SSEQ 1] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057018] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057019] [SID: 0] [SSEQ 2] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057019] [a_rwnd 9914] [#gap acks 0] [#dup tsns 0]
<...>
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057098] [SID: 0] [SSEQ 81] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057098] [a_rwnd 6517] [#gap acks 0] [#dup tsns 0]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057099] [SID: 0] [SSEQ 82] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057099] [a_rwnd 6474] [#gap acks 0] [#dup tsns 0]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057100] [SID: 0] [SSEQ 83] [PPID 0x18]

--> Sudden drop

IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]

At this point, rwnd_press stores current rwnd value so it can be later restored
in sctp_assoc_rwnd_increase. This however doesn't happen as condition to start
slowly increasing rwnd until rwnd_press is returned to rwnd is never met. This
condition is not met since rwnd, after it hit 0, must first reach rwnd_press by
adding amount which is read from userspace. Let us observe values in above
example. Initial a_rwnd is 10000, pressure was hit when rwnd was ~6500 and the
amount of actual sctp data currently waiting to be delivered to userspace
is ~3500. When userspace starts to read, sctp_assoc_rwnd_increase will be blamed
only for sctp data, which is ~3500. Condition is never met, and when userspace
reads all data, rwnd stays on 3569.

IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 1505] [#gap acks 0] [#dup tsns 0]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 3010] [#gap acks 0] [#dup tsns 0]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057101] [SID: 0] [SSEQ 84] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057101] [a_rwnd 3569] [#gap acks 0] [#dup tsns 0]

--> At this point userspace read everything, rwnd recovered only to 3569

IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057102] [SID: 0] [SSEQ 85] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057102] [a_rwnd 3569] [#gap acks 0] [#dup tsns 0]

Reproduction is straight forward, it is enough for sender to send packets of
size less then sizeof(struct sk_buff) and receiver keeping them in its buffers.

2) Minute size window for associations sharing the same socket buffer

In case multiple associations share the same socket, and same socket buffer
(sctp.rcvbuf_policy = 0), different scenarios exist in which congestion on one
of the associations can permanently drop rwnd of other association(s).

Situation will be typically observed as one association suddenly having rwnd
dropped to size of last packet received and never recovering beyond that point.
Different scenarios will lead to it, but all have in common that one of the
associations (let it be association from 1)) nearly depleted socket buffer, and
the other association blames socket buffer just for the amount enough to start
the pressure. This association will enter pressure state, set rwnd_press and 
announce 0 rwnd.
When data is read by userspace, similar situation as in 1) will occur, rwnd will
increase just for the size read by userspace but rwnd_press will be high enough
so that association doesn't have enough credit to reach rwnd_press and restore
to previous state. This case is special case of 1), being worse as there is, in
the worst case, only one packet in buffer for which size rwnd will be increased.
Consequence is association which has very low maximum rwnd ('minute size', in
our case down to 43B - size of packet which caused pressure) and as such
unusable.

Scenario happened in the field and labs frequently after congestion state (link
breaks, different probabilities of packet drop, packet reordering) and with 
scenario 1) preceding. Here is given a deterministic scenario for reproduction:

From node A establish two associations on the same socket, with rcvbuf_policy
being set to share one common buffer (sctp.rcvbuf_policy = 0). On association 1
repeat scenario from 1), that is, bring it down to 0 and restore up. Observe
scenario 1). Use small payload size (here we use 43). Once rwnd is 'recovered',
bring it down close to 0, as in just one more packet would close it. This has as
a consequence that association number 2 is able to receive (at least) one more
packet which will bring it in pressure state. E.g. if association 2 had rwnd of
10000, packet received was 43, and we enter at this point into pressure,
rwnd_press will have 9957. Once payload is delivered to userspace, rwnd will
increase for 43, but conditions to restore rwnd to original state, just as in
1), will never be satisfied.

--> Association 1, between A.y and B.12345

IP A.55915 > B.12345: sctp (1) [INIT] [init tag: 836880897] [rwnd: 10000] [OS: 10] [MIS: 65535] [init TSN: 4032536569]
IP B.12345 > A.55915: sctp (1) [INIT ACK] [init tag: 2873310749] [rwnd: 81920] [OS: 10] [MIS: 10] [init TSN: 3799315613]
IP A.55915 > B.12345: sctp (1) [COOKIE ECHO]
IP B.12345 > A.55915: sctp (1) [COOKIE ACK]

--> Association 2, between A.z and B.12346

IP A.55915 > B.12346: sctp (1) [INIT] [init tag: 534798321] [rwnd: 10000] [OS: 10] [MIS: 65535] [init TSN: 2099285173]
IP B.12346 > A.55915: sctp (1) [INIT ACK] [init tag: 516668823] [rwnd: 81920] [OS: 10] [MIS: 10] [init TSN: 3676403240]
IP A.55915 > B.12346: sctp (1) [COOKIE ECHO]
IP B.12346 > A.55915: sctp (1) [COOKIE ACK]

--> Deplete socket buffer by sending messages of size 43B over association 1

IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315613] [SID: 0] [SSEQ 0] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315613] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]

<...>

IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315696] [a_rwnd 6388] [#gap acks 0] [#dup tsns 0]
IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315697] [SID: 0] [SSEQ 84] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315697] [a_rwnd 6345] [#gap acks 0] [#dup tsns 0]

--> Sudden drop on 1

IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315698] [SID: 0] [SSEQ 85] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315698] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]

--> Here userspace read, rwnd 'recovered' to 3698, now deplete again using
    association 1 so there is place in buffer for only one more packet

IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315799] [SID: 0] [SSEQ 186] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315799] [a_rwnd 86] [#gap acks 0] [#dup tsns 0]
IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315800] [SID: 0] [SSEQ 187] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 43] [#gap acks 0] [#dup tsns 0]

--> Socket buffer is almost depleted, but there is space for one more packet,
    send them over association 2, size 43B

IP B.12346 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3676403240] [SID: 0] [SSEQ 0] [PPID 0x18]
IP A.55915 > B.12346: sctp (1) [SACK] [cum ack 3676403240] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]

--> Immediate drop

IP A.60995 > B.12346: sctp (1) [SACK] [cum ack 387491510] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]

--> Read everything from the socket, both association recover up to maximum rwnd
    they are capable of reaching, note that association 1 recovered up to 3698,
    and association 2 recovered only to 43

IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 1548] [#gap acks 0] [#dup tsns 0]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 3053] [#gap acks 0] [#dup tsns 0]
IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315801] [SID: 0] [SSEQ 188] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315801] [a_rwnd 3698] [#gap acks 0] [#dup tsns 0]
IP B.12346 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3676403241] [SID: 0] [SSEQ 1] [PPID 0x18]
IP A.55915 > B.12346: sctp (1) [SACK] [cum ack 3676403241] [a_rwnd 43] [#gap acks 0] [#dup tsns 0]

A careful reader might wonder why it is necessary to reproduce 1) prior
reproduction of 2). It is simply easier to observe when to send packet over
association 2 which will push association into the pressure state.

Proposed solution:

Both problems share the same root cause, and that is improper scaling of socket
buffer with rwnd. Solution in which sizeof(sk_buff) is taken into concern while
calculating rwnd is not possible due to fact that there is no linear
relationship between amount of data blamed in increase/decrease with IP packet
in which payload arrived. Even in case such solution would be followed,
complexity of the code would increase. Due to nature of current rwnd handling,
slow increase (in sctp_assoc_rwnd_increase) of rwnd after pressure state is
entered is rationale, but it gives false representation to the sender of current
buffer space. Furthermore, it implements additional congestion control mechanism
which is defined on implementation, and not on standard basis.

Proposed solution simplifies whole algorithm having on mind definition from rfc:

o  Receiver Window (rwnd): This gives the sender an indication of the space
   available in the receiver's inbound buffer.

Core of the proposed solution is given with these lines:

sctp_assoc_rwnd_account:
	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
	else
		asoc->rwnd = 0;

We advertise to sender (half of) actual space we have. Half is in the braces
depending whether you would like to observe size of socket buffer as SO_RECVBUF
or twice the amount, i.e. size is the one visible from userspace, that is,
from kernelspace.
In this way sender is given with good approximation of our buffer space,
regardless of the buffer policy - we always advertise what we have. Proposed
solution fixes described problems and removes necessity for rwnd restoration
algorithm. Finally, as proposed solution is simplification, some lines of code,
along with some bytes in struct sctp_association are saved.

Signed-off-by: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
Reviewed-by: Alexander Sverdlin <alexander.sverdlin@nsn.com>

---

--- net-next.orig/net/sctp/associola.c
+++ net-next/net/sctp/associola.c
@@ -1367,44 +1367,35 @@ static inline bool sctp_peer_needs_updat
 	return false;
 }

-/* Increase asoc's rwnd by len and send any window update SACK if needed. */
-void sctp_assoc_rwnd_increase(struct sctp_association *asoc, unsigned int len)
+/* Account asoc's rwnd for the approximated state in the buffer,
+ * and check whether SACK needs to be sent.
+ */
+void sctp_assoc_rwnd_account(struct sctp_association *asoc, int check_sack)
 {
+	int rx_count;
 	struct sctp_chunk *sack;
 	struct timer_list *timer;

-	if (asoc->rwnd_over) {
-		if (asoc->rwnd_over >= len) {
-			asoc->rwnd_over -= len;
-		} else {
-			asoc->rwnd += (len - asoc->rwnd_over);
-			asoc->rwnd_over = 0;
-		}
-	} else {
-		asoc->rwnd += len;
-	}
+	if (asoc->ep->rcvbuf_policy)
+		rx_count = atomic_read(&asoc->rmem_alloc);
+	else
+		rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);

-	/* If we had window pressure, start recovering it
-	 * once our rwnd had reached the accumulated pressure
-	 * threshold.  The idea is to recover slowly, but up
-	 * to the initial advertised window.
-	 */
-	if (asoc->rwnd_press && asoc->rwnd >= asoc->rwnd_press) {
-		int change = min(asoc->pathmtu, asoc->rwnd_press);
-		asoc->rwnd += change;
-		asoc->rwnd_press -= change;
-	}
+	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
+		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
+	else
+		asoc->rwnd = 0;

-	pr_debug("%s: asoc:%p rwnd increased by %d to (%u, %u) - %u\n",
-		 __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
-		 asoc->a_rwnd);
+	pr_debug("%s: asoc:%p rwnd=%u, rx_count=%d, sk_rcvbuf=%d\n",
+		 __func__, asoc, asoc->rwnd, rx_count,
+		 asoc->base.sk->sk_rcvbuf);

 	/* Send a window update SACK if the rwnd has increased by at least the
 	 * minimum of the association's PMTU and half of the receive buffer.
 	 * The algorithm used is similar to the one described in
 	 * Section 4.2.3.3 of RFC 1122.
 	 */
-	if (sctp_peer_needs_update(asoc)) {
+	if (check_sack && sctp_peer_needs_update(asoc)) {
 		asoc->a_rwnd = asoc->rwnd;

 		pr_debug("%s: sending window update SACK- asoc:%p rwnd:%u "
@@ -1426,45 +1417,6 @@ void sctp_assoc_rwnd_increase(struct sct
 	}
 }

-/* Decrease asoc's rwnd by len. */
-void sctp_assoc_rwnd_decrease(struct sctp_association *asoc, unsigned int len)
-{
-	int rx_count;
-	int over = 0;
-
-	if (unlikely(!asoc->rwnd || asoc->rwnd_over))
-		pr_debug("%s: association:%p has asoc->rwnd:%u, "
-			 "asoc->rwnd_over:%u!\n", __func__, asoc,
-			 asoc->rwnd, asoc->rwnd_over);
-
-	if (asoc->ep->rcvbuf_policy)
-		rx_count = atomic_read(&asoc->rmem_alloc);
-	else
-		rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
-
-	/* If we've reached or overflowed our receive buffer, announce
-	 * a 0 rwnd if rwnd would still be positive.  Store the
-	 * the potential pressure overflow so that the window can be restored
-	 * back to original value.
-	 */
-	if (rx_count >= asoc->base.sk->sk_rcvbuf)
-		over = 1;
-
-	if (asoc->rwnd >= len) {
-		asoc->rwnd -= len;
-		if (over) {
-			asoc->rwnd_press += asoc->rwnd;
-			asoc->rwnd = 0;
-		}
-	} else {
-		asoc->rwnd_over = len - asoc->rwnd;
-		asoc->rwnd = 0;
-	}
-
-	pr_debug("%s: asoc:%p rwnd decreased by %d to (%u, %u, %u)\n",
-		 __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
-		 asoc->rwnd_press);
-}

 /* Build the bind address list for the association based on info from the
  * local endpoint and the remote peer.
--- net-next.orig/include/net/sctp/structs.h
+++ net-next/include/net/sctp/structs.h
@@ -1653,17 +1653,6 @@ struct sctp_association {
 	/* This is the last advertised value of rwnd over a SACK chunk. */
 	__u32 a_rwnd;

-	/* Number of bytes by which the rwnd has slopped.  The rwnd is allowed
-	 * to slop over a maximum of the association's frag_point.
-	 */
-	__u32 rwnd_over;
-
-	/* Keeps treack of rwnd pressure.  This happens when we have
-	 * a window, but not recevie buffer (i.e small packets).  This one
-	 * is releases slowly (1 PMTU at a time ).
-	 */
-	__u32 rwnd_press;
-
 	/* This is the sndbuf size in use for the association.
 	 * This corresponds to the sndbuf size for the association,
 	 * as specified in the sk->sndbuf.
@@ -1892,8 +1881,7 @@ void sctp_assoc_update(struct sctp_assoc
 __u32 sctp_association_get_next_tsn(struct sctp_association *);

 void sctp_assoc_sync_pmtu(struct sock *, struct sctp_association *);
-void sctp_assoc_rwnd_increase(struct sctp_association *, unsigned int);
-void sctp_assoc_rwnd_decrease(struct sctp_association *, unsigned int);
+void sctp_assoc_rwnd_account(struct sctp_association *, int);
 void sctp_assoc_set_primary(struct sctp_association *,
 			    struct sctp_transport *);
 void sctp_assoc_del_nonprimary_peers(struct sctp_association *,
--- net-next.orig/net/sctp/sm_statefuns.c
+++ net-next/net/sctp/sm_statefuns.c
@@ -6176,7 +6176,7 @@ static int sctp_eat_data(const struct sc
 	 * PMTU.  In cases, such as loopback, this might be a rather
 	 * large spill over.
 	 */
-	if ((!chunk->data_accepted) && (!asoc->rwnd || asoc->rwnd_over ||
+	if ((!chunk->data_accepted) && (!asoc->rwnd ||
 	    (datalen > asoc->rwnd + asoc->frag_point))) {

 		/* If this is the next TSN, consider reneging to make
--- net-next.orig/net/sctp/socket.c
+++ net-next/net/sctp/socket.c
@@ -2097,7 +2097,7 @@ static int sctp_recvmsg(struct kiocb *io
 		 * rwnd is updated when the event is freed.
 		 */
 		if (!sctp_ulpevent_is_notification(event))
-			sctp_assoc_rwnd_increase(event->asoc, copied);
+			sctp_assoc_rwnd_account(event->asoc, 1);
 		goto out;
 	} else if ((event->msg_flags & MSG_NOTIFICATION) ||
 		   (event->msg_flags & MSG_EOR))
--- net-next.orig/net/sctp/ulpevent.c
+++ net-next/net/sctp/ulpevent.c
@@ -989,7 +989,7 @@ static void sctp_ulpevent_receive_data(s
 	skb = sctp_event2skb(event);
 	/* Set the owner and charge rwnd for bytes received.  */
 	sctp_ulpevent_set_owner(event, asoc);
-	sctp_assoc_rwnd_decrease(asoc, skb_headlen(skb));
+	sctp_assoc_rwnd_account(asoc, 0);

 	if (!skb->data_len)
 		return;
@@ -1035,7 +1035,7 @@ static void sctp_ulpevent_release_data(s
 	}

 done:
-	sctp_assoc_rwnd_increase(event->asoc, len);
+	sctp_assoc_rwnd_account(event->asoc, 1);
 	sctp_ulpevent_release_owner(event);
 }

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer
  2014-01-17  7:01 [PATCH] net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer Matija Glavinic Pecotic
@ 2014-01-21 22:36 ` David Miller
  2014-01-22  0:08   ` Vlad Yasevich
  2014-01-22  2:12 ` Vlad Yasevich
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: David Miller @ 2014-01-21 22:36 UTC (permalink / raw)
  To: matija.glavinic-pecotic.ext; +Cc: linux-sctp, alexander.sverdlin, netdev

From: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
Date: Fri, 17 Jan 2014 08:01:24 +0100

> Implementation of (a)rwnd calculation might lead to severe performance issues
> and associations completely stalling. These problems are described and solution
> is proposed which improves lksctp's robustness in congestion state.

It would be fantastic if an SCTP expert would review this patch, it's
been rotting in patchwork for 4 days.

Thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer
  2014-01-21 22:36 ` David Miller
@ 2014-01-22  0:08   ` Vlad Yasevich
  0 siblings, 0 replies; 13+ messages in thread
From: Vlad Yasevich @ 2014-01-22  0:08 UTC (permalink / raw)
  To: David Miller, matija.glavinic-pecotic.ext
  Cc: linux-sctp, alexander.sverdlin, netdev

On 01/21/2014 05:36 PM, David Miller wrote:
> From: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
> Date: Fri, 17 Jan 2014 08:01:24 +0100
> 
>> Implementation of (a)rwnd calculation might lead to severe performance issues
>> and associations completely stalling. These problems are described and solution
>> is proposed which improves lksctp's robustness in congestion state.
> 
> It would be fantastic if an SCTP expert would review this patch, it's
> been rotting in patchwork for 4 days.
> 

Apologies.  It's been on my to do list.

-vlad


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer
  2014-01-17  7:01 [PATCH] net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer Matija Glavinic Pecotic
  2014-01-21 22:36 ` David Miller
@ 2014-01-22  2:12 ` Vlad Yasevich
  2014-01-22 19:21   ` Matija Glavinic Pecotic
  2014-01-22 12:30 ` Neil Horman
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: Vlad Yasevich @ 2014-01-22  2:12 UTC (permalink / raw)
  To: Matija Glavinic Pecotic, linux-sctp; +Cc: Alexander Sverdlin, netdev

On 01/17/2014 02:01 AM, Matija Glavinic Pecotic wrote:
[ snip long description ...]
> Proposed solution simplifies whole algorithm having on mind definition from rfc:
> 
> o  Receiver Window (rwnd): This gives the sender an indication of the space
>    available in the receiver's inbound buffer.
> 
> Core of the proposed solution is given with these lines:
> 
> sctp_assoc_rwnd_account:
> 	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
> 		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
> 	else
> 		asoc->rwnd = 0;
> 
> We advertise to sender (half of) actual space we have. Half is in the braces
> depending whether you would like to observe size of socket buffer as SO_RECVBUF
> or twice the amount, i.e. size is the one visible from userspace, that is,
> from kernelspace.
> In this way sender is given with good approximation of our buffer space,
> regardless of the buffer policy - we always advertise what we have. Proposed
> solution fixes described problems and removes necessity for rwnd restoration
> algorithm. Finally, as proposed solution is simplification, some lines of code,
> along with some bytes in struct sctp_association are saved.
> 
> Signed-off-by: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
> Reviewed-by: Alexander Sverdlin <alexander.sverdlin@nsn.com>
> 

This is the correct approach.  However there is one issue and
a few cosmetic suggestions.

> ---
> 
> --- net-next.orig/net/sctp/associola.c
> +++ net-next/net/sctp/associola.c
> @@ -1367,44 +1367,35 @@ static inline bool sctp_peer_needs_updat
>  	return false;
>  }
>  
> -/* Increase asoc's rwnd by len and send any window update SACK if needed. */
> -void sctp_assoc_rwnd_increase(struct sctp_association *asoc, unsigned int len)
> +/* Account asoc's rwnd for the approximated state in the buffer,
> + * and check whether SACK needs to be sent.
> + */
> +void sctp_assoc_rwnd_account(struct sctp_association *asoc, int check_sack)

Maybe sctp_assoc_rwnd_update()?

'check_sack' isn't a very descriptive name for what we are doing.  May
be 'update_peer'?  Also, since it is either 0 or 1, just make a bool.

>  {
> +	int rx_count;
>  	struct sctp_chunk *sack;
>  	struct timer_list *timer;
>  
> -	if (asoc->rwnd_over) {
> -		if (asoc->rwnd_over >= len) {
> -			asoc->rwnd_over -= len;
> -		} else {
> -			asoc->rwnd += (len - asoc->rwnd_over);
> -			asoc->rwnd_over = 0;
> -		}
> -	} else {
> -		asoc->rwnd += len;
> -	}
> +	if (asoc->ep->rcvbuf_policy)
> +		rx_count = atomic_read(&asoc->rmem_alloc);
> +	else
> +		rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
>  
> -	/* If we had window pressure, start recovering it
> -	 * once our rwnd had reached the accumulated pressure
> -	 * threshold.  The idea is to recover slowly, but up
> -	 * to the initial advertised window.
> -	 */
> -	if (asoc->rwnd_press && asoc->rwnd >= asoc->rwnd_press) {
> -		int change = min(asoc->pathmtu, asoc->rwnd_press);
> -		asoc->rwnd += change;
> -		asoc->rwnd_press -= change;
> -	}
> +	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
> +		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
> +	else
> +		asoc->rwnd = 0;
>  
> -	pr_debug("%s: asoc:%p rwnd increased by %d to (%u, %u) - %u\n",
> -		 __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
> -		 asoc->a_rwnd);
> +	pr_debug("%s: asoc:%p rwnd=%u, rx_count=%d, sk_rcvbuf=%d\n",
> +		 __func__, asoc, asoc->rwnd, rx_count,
> +		 asoc->base.sk->sk_rcvbuf);
>  
>  	/* Send a window update SACK if the rwnd has increased by at least the
>  	 * minimum of the association's PMTU and half of the receive buffer.
>  	 * The algorithm used is similar to the one described in
>  	 * Section 4.2.3.3 of RFC 1122.
>  	 */
> -	if (sctp_peer_needs_update(asoc)) {
> +	if (check_sack && sctp_peer_needs_update(asoc)) {
>  		asoc->a_rwnd = asoc->rwnd;
>  
>  		pr_debug("%s: sending window update SACK- asoc:%p rwnd:%u "
> @@ -1426,45 +1417,6 @@ void sctp_assoc_rwnd_increase(struct sct
>  	}
>  }
>  
> -/* Decrease asoc's rwnd by len. */
> -void sctp_assoc_rwnd_decrease(struct sctp_association *asoc, unsigned int len)
> -{
> -	int rx_count;
> -	int over = 0;
> -
> -	if (unlikely(!asoc->rwnd || asoc->rwnd_over))
> -		pr_debug("%s: association:%p has asoc->rwnd:%u, "
> -			 "asoc->rwnd_over:%u!\n", __func__, asoc,
> -			 asoc->rwnd, asoc->rwnd_over);
> -
> -	if (asoc->ep->rcvbuf_policy)
> -		rx_count = atomic_read(&asoc->rmem_alloc);
> -	else
> -		rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
> -
> -	/* If we've reached or overflowed our receive buffer, announce
> -	 * a 0 rwnd if rwnd would still be positive.  Store the
> -	 * the potential pressure overflow so that the window can be restored
> -	 * back to original value.
> -	 */
> -	if (rx_count >= asoc->base.sk->sk_rcvbuf)
> -		over = 1;
> -
> -	if (asoc->rwnd >= len) {
> -		asoc->rwnd -= len;
> -		if (over) {
> -			asoc->rwnd_press += asoc->rwnd;
> -			asoc->rwnd = 0;
> -		}
> -	} else {
> -		asoc->rwnd_over = len - asoc->rwnd;
> -		asoc->rwnd = 0;
> -	}
> -
> -	pr_debug("%s: asoc:%p rwnd decreased by %d to (%u, %u, %u)\n",
> -		 __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
> -		 asoc->rwnd_press);
> -}
>  
>  /* Build the bind address list for the association based on info from the
>   * local endpoint and the remote peer.
> --- net-next.orig/include/net/sctp/structs.h
> +++ net-next/include/net/sctp/structs.h
> @@ -1653,17 +1653,6 @@ struct sctp_association {
>  	/* This is the last advertised value of rwnd over a SACK chunk. */
>  	__u32 a_rwnd;
>  
> -	/* Number of bytes by which the rwnd has slopped.  The rwnd is allowed
> -	 * to slop over a maximum of the association's frag_point.
> -	 */
> -	__u32 rwnd_over;
> -
> -	/* Keeps treack of rwnd pressure.  This happens when we have
> -	 * a window, but not recevie buffer (i.e small packets).  This one
> -	 * is releases slowly (1 PMTU at a time ).
> -	 */
> -	__u32 rwnd_press;
> -
>  	/* This is the sndbuf size in use for the association.
>  	 * This corresponds to the sndbuf size for the association,
>  	 * as specified in the sk->sndbuf.
> @@ -1892,8 +1881,7 @@ void sctp_assoc_update(struct sctp_assoc
>  __u32 sctp_association_get_next_tsn(struct sctp_association *);
>  
>  void sctp_assoc_sync_pmtu(struct sock *, struct sctp_association *);
> -void sctp_assoc_rwnd_increase(struct sctp_association *, unsigned int);
> -void sctp_assoc_rwnd_decrease(struct sctp_association *, unsigned int);
> +void sctp_assoc_rwnd_account(struct sctp_association *, int);
>  void sctp_assoc_set_primary(struct sctp_association *,
>  			    struct sctp_transport *);
>  void sctp_assoc_del_nonprimary_peers(struct sctp_association *,
> --- net-next.orig/net/sctp/sm_statefuns.c
> +++ net-next/net/sctp/sm_statefuns.c
> @@ -6176,7 +6176,7 @@ static int sctp_eat_data(const struct sc
>  	 * PMTU.  In cases, such as loopback, this might be a rather
>  	 * large spill over.
>  	 */
> -	if ((!chunk->data_accepted) && (!asoc->rwnd || asoc->rwnd_over ||
> +	if ((!chunk->data_accepted) && (!asoc->rwnd ||
>  	    (datalen > asoc->rwnd + asoc->frag_point))) {
>  
>  		/* If this is the next TSN, consider reneging to make
> --- net-next.orig/net/sctp/socket.c
> +++ net-next/net/sctp/socket.c
> @@ -2097,7 +2097,7 @@ static int sctp_recvmsg(struct kiocb *io
>  		 * rwnd is updated when the event is freed.
>  		 */
>  		if (!sctp_ulpevent_is_notification(event))
> -			sctp_assoc_rwnd_increase(event->asoc, copied);
> +			sctp_assoc_rwnd_account(event->asoc, 1);

This is not going to do anything.  The event has not been freed, thus
rmem_alloc has not changed.  So, the rwnd will not change.

>  		goto out;
>  	} else if ((event->msg_flags & MSG_NOTIFICATION) ||
>  		   (event->msg_flags & MSG_EOR))
> --- net-next.orig/net/sctp/ulpevent.c
> +++ net-next/net/sctp/ulpevent.c
> @@ -989,7 +989,7 @@ static void sctp_ulpevent_receive_data(s
>  	skb = sctp_event2skb(event);
>  	/* Set the owner and charge rwnd for bytes received.  */
>  	sctp_ulpevent_set_owner(event, asoc);
> -	sctp_assoc_rwnd_decrease(asoc, skb_headlen(skb));
> +	sctp_assoc_rwnd_account(asoc, 0);
>  
>  	if (!skb->data_len)
>  		return;
> @@ -1035,7 +1035,7 @@ static void sctp_ulpevent_release_data(s
>  	}
>  
>  done:
> -	sctp_assoc_rwnd_increase(event->asoc, len);
> +	sctp_assoc_rwnd_account(event->asoc, 1);

Same here.  The below call to sctp_ulpevent_release_owner() will
finally update the rmem_alloc, so the a above call to rwnd_account()
will not trigger a window update.

Thanks
-vlad

>  	sctp_ulpevent_release_owner(event);
>  }
>  
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer
  2014-01-17  7:01 [PATCH] net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer Matija Glavinic Pecotic
  2014-01-21 22:36 ` David Miller
  2014-01-22  2:12 ` Vlad Yasevich
@ 2014-01-22 12:30 ` Neil Horman
  2014-01-22 14:18   ` Vlad Yasevich
  2014-01-22 13:41 ` David Laight
  2014-01-23  0:16 ` Vlad Yasevich
  4 siblings, 1 reply; 13+ messages in thread
From: Neil Horman @ 2014-01-22 12:30 UTC (permalink / raw)
  To: Matija Glavinic Pecotic; +Cc: linux-sctp, Alexander Sverdlin, netdev

On Fri, Jan 17, 2014 at 08:01:24AM +0100, Matija Glavinic Pecotic wrote:
> Implementation of (a)rwnd calculation might lead to severe performance issues
> and associations completely stalling. These problems are described and solution
> is proposed which improves lksctp's robustness in congestion state.
> 
> 1) Sudden drop of a_rwnd and incomplete window recovery afterwards
> 
> Data accounted in sctp_assoc_rwnd_decrease takes only payload size (sctp data),
> but size of sk_buff, which is blamed against receiver buffer, is not accounted
> in rwnd. Theoretically, this should not be the problem as actual size of buffer
> is double the amount requested on the socket (SO_RECVBUF). Problem here is
> that this will have bad scaling for data which is less then sizeof sk_buff.
> E.g. in 4G (LTE) networks, link interfacing radio side will have a large portion
> of traffic of this size (less then 100B).
> 
> An example of sudden drop and incomplete window recovery is given below. Node B
> exhibits problematic behavior. Node A initiates association and B is configured
> to advertise rwnd of 10000. A sends messages of size 43B (size of typical sctp
> message in 4G (LTE) network). On B data is left in buffer by not reading socket
> in userspace.
> 
> Lets examine when we will hit pressure state and declare rwnd to be 0 for
> scenario with above stated parameters (rwnd = 10000, chunk size = 43, each
> chunk is sent in separate sctp packet)
> 
> Logic is implemented in sctp_assoc_rwnd_decrease:
> 
> socket_buffer (see below) is maximum size which can be held in socket buffer
> (sk_rcvbuf). current_alloced is amount of data currently allocated (rx_count)
> 
> A simple expression is given for which it will be examined after how many
> packets for above stated parameters we enter pressure state:
> 
> We start by condition which has to be met in order to enter pressure state:
> 
> 	socket_buffer < currently_alloced;
> 
> currently_alloced is represented as size of sctp packets received so far and not
> yet delivered to userspace. x is the number of chunks/packets (since there is no
> bundling, and each chunk is delivered in separate packet, we can observe each
> chunk also as sctp packet, and what is important here, having its own sk_buff):
> 
> 	socket_buffer < x*each_sctp_packet;
> 
> each_sctp_packet is sctp chunk size + sizeof(struct sk_buff). socket_buffer is
> twice the amount of initially requested size of socket buffer, which is in case
> of sctp, twice the a_rwnd requested:
> 
> 	2*rwnd < x*(payload+sizeof(struc sk_buff));
> 
> sizeof(struct sk_buff) is 190 (3.13.0-rc4+). Above is stated that rwnd is 10000
> and each payload size is 43
> 
> 	20000 < x(43+190);
> 
> 	x > 20000/233;
> 
> 	x ~> 84;
> 
> After ~84 messages, pressure state is entered and 0 rwnd is advertised while 
> received 84*43B ~= 3612B sctp data. This is why external observer notices sudden
> drop from 6474 to 0, as it will be now shown in example:
> 
> IP A.34340 > B.12345: sctp (1) [INIT] [init tag: 1875509148] [rwnd: 81920] [OS: 10] [MIS: 65535] [init TSN: 1096057017]
> IP B.12345 > A.34340: sctp (1) [INIT ACK] [init tag: 3198966556] [rwnd: 10000] [OS: 10] [MIS: 10] [init TSN: 902132839]
> IP A.34340 > B.12345: sctp (1) [COOKIE ECHO]
> IP B.12345 > A.34340: sctp (1) [COOKIE ACK]
> IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057017] [SID: 0] [SSEQ 0] [PPID 0x18]
> IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057017] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]
> IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057018] [SID: 0] [SSEQ 1] [PPID 0x18]
> IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057018] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]
> IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057019] [SID: 0] [SSEQ 2] [PPID 0x18]
> IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057019] [a_rwnd 9914] [#gap acks 0] [#dup tsns 0]
> <...>
> IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057098] [SID: 0] [SSEQ 81] [PPID 0x18]
> IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057098] [a_rwnd 6517] [#gap acks 0] [#dup tsns 0]
> IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057099] [SID: 0] [SSEQ 82] [PPID 0x18]
> IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057099] [a_rwnd 6474] [#gap acks 0] [#dup tsns 0]
> IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057100] [SID: 0] [SSEQ 83] [PPID 0x18]
> 
> --> Sudden drop
> 
> IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]
> 
> At this point, rwnd_press stores current rwnd value so it can be later restored
> in sctp_assoc_rwnd_increase. This however doesn't happen as condition to start
> slowly increasing rwnd until rwnd_press is returned to rwnd is never met. This
> condition is not met since rwnd, after it hit 0, must first reach rwnd_press by
> adding amount which is read from userspace. Let us observe values in above
> example. Initial a_rwnd is 10000, pressure was hit when rwnd was ~6500 and the
> amount of actual sctp data currently waiting to be delivered to userspace
> is ~3500. When userspace starts to read, sctp_assoc_rwnd_increase will be blamed
> only for sctp data, which is ~3500. Condition is never met, and when userspace
> reads all data, rwnd stays on 3569.
> 
> IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 1505] [#gap acks 0] [#dup tsns 0]
> IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 3010] [#gap acks 0] [#dup tsns 0]
> IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057101] [SID: 0] [SSEQ 84] [PPID 0x18]
> IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057101] [a_rwnd 3569] [#gap acks 0] [#dup tsns 0]
> 
> --> At this point userspace read everything, rwnd recovered only to 3569
> 
> IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057102] [SID: 0] [SSEQ 85] [PPID 0x18]
> IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057102] [a_rwnd 3569] [#gap acks 0] [#dup tsns 0]
> 
> Reproduction is straight forward, it is enough for sender to send packets of
> size less then sizeof(struct sk_buff) and receiver keeping them in its buffers.
> 
> 2) Minute size window for associations sharing the same socket buffer
> 
> In case multiple associations share the same socket, and same socket buffer
> (sctp.rcvbuf_policy = 0), different scenarios exist in which congestion on one
> of the associations can permanently drop rwnd of other association(s).
> 
> Situation will be typically observed as one association suddenly having rwnd
> dropped to size of last packet received and never recovering beyond that point.
> Different scenarios will lead to it, but all have in common that one of the
> associations (let it be association from 1)) nearly depleted socket buffer, and
> the other association blames socket buffer just for the amount enough to start
> the pressure. This association will enter pressure state, set rwnd_press and 
> announce 0 rwnd.
> When data is read by userspace, similar situation as in 1) will occur, rwnd will
> increase just for the size read by userspace but rwnd_press will be high enough
> so that association doesn't have enough credit to reach rwnd_press and restore
> to previous state. This case is special case of 1), being worse as there is, in
> the worst case, only one packet in buffer for which size rwnd will be increased.
> Consequence is association which has very low maximum rwnd ('minute size', in
> our case down to 43B - size of packet which caused pressure) and as such
> unusable.
> 
> Scenario happened in the field and labs frequently after congestion state (link
> breaks, different probabilities of packet drop, packet reordering) and with 
> scenario 1) preceding. Here is given a deterministic scenario for reproduction:
> 
> From node A establish two associations on the same socket, with rcvbuf_policy
> being set to share one common buffer (sctp.rcvbuf_policy = 0). On association 1
> repeat scenario from 1), that is, bring it down to 0 and restore up. Observe
> scenario 1). Use small payload size (here we use 43). Once rwnd is 'recovered',
> bring it down close to 0, as in just one more packet would close it. This has as
> a consequence that association number 2 is able to receive (at least) one more
> packet which will bring it in pressure state. E.g. if association 2 had rwnd of
> 10000, packet received was 43, and we enter at this point into pressure,
> rwnd_press will have 9957. Once payload is delivered to userspace, rwnd will
> increase for 43, but conditions to restore rwnd to original state, just as in
> 1), will never be satisfied.
> 
> --> Association 1, between A.y and B.12345
> 
> IP A.55915 > B.12345: sctp (1) [INIT] [init tag: 836880897] [rwnd: 10000] [OS: 10] [MIS: 65535] [init TSN: 4032536569]
> IP B.12345 > A.55915: sctp (1) [INIT ACK] [init tag: 2873310749] [rwnd: 81920] [OS: 10] [MIS: 10] [init TSN: 3799315613]
> IP A.55915 > B.12345: sctp (1) [COOKIE ECHO]
> IP B.12345 > A.55915: sctp (1) [COOKIE ACK]
> 
> --> Association 2, between A.z and B.12346
> 
> IP A.55915 > B.12346: sctp (1) [INIT] [init tag: 534798321] [rwnd: 10000] [OS: 10] [MIS: 65535] [init TSN: 2099285173]
> IP B.12346 > A.55915: sctp (1) [INIT ACK] [init tag: 516668823] [rwnd: 81920] [OS: 10] [MIS: 10] [init TSN: 3676403240]
> IP A.55915 > B.12346: sctp (1) [COOKIE ECHO]
> IP B.12346 > A.55915: sctp (1) [COOKIE ACK]
> 
> --> Deplete socket buffer by sending messages of size 43B over association 1
> 
> IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315613] [SID: 0] [SSEQ 0] [PPID 0x18]
> IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315613] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]
> 
> <...>
> 
> IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315696] [a_rwnd 6388] [#gap acks 0] [#dup tsns 0]
> IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315697] [SID: 0] [SSEQ 84] [PPID 0x18]
> IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315697] [a_rwnd 6345] [#gap acks 0] [#dup tsns 0]
> 
> --> Sudden drop on 1
>  
> IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315698] [SID: 0] [SSEQ 85] [PPID 0x18]
> IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315698] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]
> 
> --> Here userspace read, rwnd 'recovered' to 3698, now deplete again using
>     association 1 so there is place in buffer for only one more packet
> 
> IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315799] [SID: 0] [SSEQ 186] [PPID 0x18]
> IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315799] [a_rwnd 86] [#gap acks 0] [#dup tsns 0]
> IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315800] [SID: 0] [SSEQ 187] [PPID 0x18]
> IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 43] [#gap acks 0] [#dup tsns 0]
> 
> --> Socket buffer is almost depleted, but there is space for one more packet,
>     send them over association 2, size 43B
> 
> IP B.12346 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3676403240] [SID: 0] [SSEQ 0] [PPID 0x18]
> IP A.55915 > B.12346: sctp (1) [SACK] [cum ack 3676403240] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]
> 
> --> Immediate drop
> 
> IP A.60995 > B.12346: sctp (1) [SACK] [cum ack 387491510] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]
> 
> --> Read everything from the socket, both association recover up to maximum rwnd
>     they are capable of reaching, note that association 1 recovered up to 3698,
>     and association 2 recovered only to 43
> 
> IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 1548] [#gap acks 0] [#dup tsns 0]
> IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 3053] [#gap acks 0] [#dup tsns 0]
> IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315801] [SID: 0] [SSEQ 188] [PPID 0x18]
> IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315801] [a_rwnd 3698] [#gap acks 0] [#dup tsns 0]
> IP B.12346 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3676403241] [SID: 0] [SSEQ 1] [PPID 0x18]
> IP A.55915 > B.12346: sctp (1) [SACK] [cum ack 3676403241] [a_rwnd 43] [#gap acks 0] [#dup tsns 0]
> 
> A careful reader might wonder why it is necessary to reproduce 1) prior
> reproduction of 2). It is simply easier to observe when to send packet over
> association 2 which will push association into the pressure state.
> 
> Proposed solution:
> 
> Both problems share the same root cause, and that is improper scaling of socket
> buffer with rwnd. Solution in which sizeof(sk_buff) is taken into concern while
> calculating rwnd is not possible due to fact that there is no linear
> relationship between amount of data blamed in increase/decrease with IP packet
> in which payload arrived. Even in case such solution would be followed,
> complexity of the code would increase. Due to nature of current rwnd handling,
> slow increase (in sctp_assoc_rwnd_increase) of rwnd after pressure state is
> entered is rationale, but it gives false representation to the sender of current
> buffer space. Furthermore, it implements additional congestion control mechanism
> which is defined on implementation, and not on standard basis.
> 
> Proposed solution simplifies whole algorithm having on mind definition from rfc:
> 
> o  Receiver Window (rwnd): This gives the sender an indication of the space
>    available in the receiver's inbound buffer.
> 
> Core of the proposed solution is given with these lines:
> 
> sctp_assoc_rwnd_account:
> 	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
> 		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
> 	else
> 		asoc->rwnd = 0;
> 
> We advertise to sender (half of) actual space we have. Half is in the braces
> depending whether you would like to observe size of socket buffer as SO_RECVBUF
> or twice the amount, i.e. size is the one visible from userspace, that is,
> from kernelspace.
> In this way sender is given with good approximation of our buffer space,
> regardless of the buffer policy - we always advertise what we have. Proposed
> solution fixes described problems and removes necessity for rwnd restoration
> algorithm. Finally, as proposed solution is simplification, some lines of code,
> along with some bytes in struct sctp_association are saved.
> 
> Signed-off-by: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
> Reviewed-by: Alexander Sverdlin <alexander.sverdlin@nsn.com>
> 


General comment - While this seems to make sense to me generally speaking,
doesn't it currently violate section 6 of the RFC?


A SCTP receiver MUST be able to receive a minimum of 1500 bytes in
   one SCTP packet.  This means that a SCTP endpoint MUST NOT indicate
   less than 1500 bytes in its Initial a_rwnd sent in the INIT or INIT
   ACK.

Since we set the initial rwnd value to the larger of sk->sk_rcvbuf/2 or
SCTP_MIN_RWND (1500 bytes), won't we potentially advertize half that amount?  It
seems we need to double the minimum socket receive buffer here.

Neil


> ---
> 
> --- net-next.orig/net/sctp/associola.c
> +++ net-next/net/sctp/associola.c
> @@ -1367,44 +1367,35 @@ static inline bool sctp_peer_needs_updat
>  	return false;
>  }
>  
> -/* Increase asoc's rwnd by len and send any window update SACK if needed. */
> -void sctp_assoc_rwnd_increase(struct sctp_association *asoc, unsigned int len)
> +/* Account asoc's rwnd for the approximated state in the buffer,
> + * and check whether SACK needs to be sent.
> + */
> +void sctp_assoc_rwnd_account(struct sctp_association *asoc, int check_sack)
>  {
> +	int rx_count;
>  	struct sctp_chunk *sack;
>  	struct timer_list *timer;
>  
> -	if (asoc->rwnd_over) {
> -		if (asoc->rwnd_over >= len) {
> -			asoc->rwnd_over -= len;
> -		} else {
> -			asoc->rwnd += (len - asoc->rwnd_over);
> -			asoc->rwnd_over = 0;
> -		}
> -	} else {
> -		asoc->rwnd += len;
> -	}
> +	if (asoc->ep->rcvbuf_policy)
> +		rx_count = atomic_read(&asoc->rmem_alloc);
> +	else
> +		rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
>  
> -	/* If we had window pressure, start recovering it
> -	 * once our rwnd had reached the accumulated pressure
> -	 * threshold.  The idea is to recover slowly, but up
> -	 * to the initial advertised window.
> -	 */
> -	if (asoc->rwnd_press && asoc->rwnd >= asoc->rwnd_press) {
> -		int change = min(asoc->pathmtu, asoc->rwnd_press);
> -		asoc->rwnd += change;
> -		asoc->rwnd_press -= change;
> -	}
> +	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
> +		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
> +	else
> +		asoc->rwnd = 0;
>  
> -	pr_debug("%s: asoc:%p rwnd increased by %d to (%u, %u) - %u\n",
> -		 __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
> -		 asoc->a_rwnd);
> +	pr_debug("%s: asoc:%p rwnd=%u, rx_count=%d, sk_rcvbuf=%d\n",
> +		 __func__, asoc, asoc->rwnd, rx_count,
> +		 asoc->base.sk->sk_rcvbuf);
>  
>  	/* Send a window update SACK if the rwnd has increased by at least the
>  	 * minimum of the association's PMTU and half of the receive buffer.
>  	 * The algorithm used is similar to the one described in
>  	 * Section 4.2.3.3 of RFC 1122.
>  	 */
> -	if (sctp_peer_needs_update(asoc)) {
> +	if (check_sack && sctp_peer_needs_update(asoc)) {
>  		asoc->a_rwnd = asoc->rwnd;
>  
>  		pr_debug("%s: sending window update SACK- asoc:%p rwnd:%u "
> @@ -1426,45 +1417,6 @@ void sctp_assoc_rwnd_increase(struct sct
>  	}
>  }
>  
> -/* Decrease asoc's rwnd by len. */
> -void sctp_assoc_rwnd_decrease(struct sctp_association *asoc, unsigned int len)
> -{
> -	int rx_count;
> -	int over = 0;
> -
> -	if (unlikely(!asoc->rwnd || asoc->rwnd_over))
> -		pr_debug("%s: association:%p has asoc->rwnd:%u, "
> -			 "asoc->rwnd_over:%u!\n", __func__, asoc,
> -			 asoc->rwnd, asoc->rwnd_over);
> -
> -	if (asoc->ep->rcvbuf_policy)
> -		rx_count = atomic_read(&asoc->rmem_alloc);
> -	else
> -		rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
> -
> -	/* If we've reached or overflowed our receive buffer, announce
> -	 * a 0 rwnd if rwnd would still be positive.  Store the
> -	 * the potential pressure overflow so that the window can be restored
> -	 * back to original value.
> -	 */
> -	if (rx_count >= asoc->base.sk->sk_rcvbuf)
> -		over = 1;
> -
> -	if (asoc->rwnd >= len) {
> -		asoc->rwnd -= len;
> -		if (over) {
> -			asoc->rwnd_press += asoc->rwnd;
> -			asoc->rwnd = 0;
> -		}
> -	} else {
> -		asoc->rwnd_over = len - asoc->rwnd;
> -		asoc->rwnd = 0;
> -	}
> -
> -	pr_debug("%s: asoc:%p rwnd decreased by %d to (%u, %u, %u)\n",
> -		 __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
> -		 asoc->rwnd_press);
> -}
>  
>  /* Build the bind address list for the association based on info from the
>   * local endpoint and the remote peer.
> --- net-next.orig/include/net/sctp/structs.h
> +++ net-next/include/net/sctp/structs.h
> @@ -1653,17 +1653,6 @@ struct sctp_association {
>  	/* This is the last advertised value of rwnd over a SACK chunk. */
>  	__u32 a_rwnd;
>  
> -	/* Number of bytes by which the rwnd has slopped.  The rwnd is allowed
> -	 * to slop over a maximum of the association's frag_point.
> -	 */
> -	__u32 rwnd_over;
> -
> -	/* Keeps treack of rwnd pressure.  This happens when we have
> -	 * a window, but not recevie buffer (i.e small packets).  This one
> -	 * is releases slowly (1 PMTU at a time ).
> -	 */
> -	__u32 rwnd_press;
> -
>  	/* This is the sndbuf size in use for the association.
>  	 * This corresponds to the sndbuf size for the association,
>  	 * as specified in the sk->sndbuf.
> @@ -1892,8 +1881,7 @@ void sctp_assoc_update(struct sctp_assoc
>  __u32 sctp_association_get_next_tsn(struct sctp_association *);
>  
>  void sctp_assoc_sync_pmtu(struct sock *, struct sctp_association *);
> -void sctp_assoc_rwnd_increase(struct sctp_association *, unsigned int);
> -void sctp_assoc_rwnd_decrease(struct sctp_association *, unsigned int);
> +void sctp_assoc_rwnd_account(struct sctp_association *, int);
>  void sctp_assoc_set_primary(struct sctp_association *,
>  			    struct sctp_transport *);
>  void sctp_assoc_del_nonprimary_peers(struct sctp_association *,
> --- net-next.orig/net/sctp/sm_statefuns.c
> +++ net-next/net/sctp/sm_statefuns.c
> @@ -6176,7 +6176,7 @@ static int sctp_eat_data(const struct sc
>  	 * PMTU.  In cases, such as loopback, this might be a rather
>  	 * large spill over.
>  	 */
> -	if ((!chunk->data_accepted) && (!asoc->rwnd || asoc->rwnd_over ||
> +	if ((!chunk->data_accepted) && (!asoc->rwnd ||
>  	    (datalen > asoc->rwnd + asoc->frag_point))) {
>  
>  		/* If this is the next TSN, consider reneging to make
> --- net-next.orig/net/sctp/socket.c
> +++ net-next/net/sctp/socket.c
> @@ -2097,7 +2097,7 @@ static int sctp_recvmsg(struct kiocb *io
>  		 * rwnd is updated when the event is freed.
>  		 */
>  		if (!sctp_ulpevent_is_notification(event))
> -			sctp_assoc_rwnd_increase(event->asoc, copied);
> +			sctp_assoc_rwnd_account(event->asoc, 1);
>  		goto out;
>  	} else if ((event->msg_flags & MSG_NOTIFICATION) ||
>  		   (event->msg_flags & MSG_EOR))
> --- net-next.orig/net/sctp/ulpevent.c
> +++ net-next/net/sctp/ulpevent.c
> @@ -989,7 +989,7 @@ static void sctp_ulpevent_receive_data(s
>  	skb = sctp_event2skb(event);
>  	/* Set the owner and charge rwnd for bytes received.  */
>  	sctp_ulpevent_set_owner(event, asoc);
> -	sctp_assoc_rwnd_decrease(asoc, skb_headlen(skb));
> +	sctp_assoc_rwnd_account(asoc, 0);
>  
>  	if (!skb->data_len)
>  		return;
> @@ -1035,7 +1035,7 @@ static void sctp_ulpevent_release_data(s
>  	}
>  
>  done:
> -	sctp_assoc_rwnd_increase(event->asoc, len);
> +	sctp_assoc_rwnd_account(event->asoc, 1);
>  	sctp_ulpevent_release_owner(event);
>  }
>  
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH] net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer
  2014-01-17  7:01 [PATCH] net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer Matija Glavinic Pecotic
                   ` (2 preceding siblings ...)
  2014-01-22 12:30 ` Neil Horman
@ 2014-01-22 13:41 ` David Laight
  2014-01-22 14:52   ` Vlad Yasevich
  2014-01-23  0:16 ` Vlad Yasevich
  4 siblings, 1 reply; 13+ messages in thread
From: David Laight @ 2014-01-22 13:41 UTC (permalink / raw)
  To: 'Matija Glavinic Pecotic', linux-sctp@vger.kernel.org
  Cc: Alexander Sverdlin, netdev@vger.kernel.org

RnJvbTogTWF0aWphIEdsYXZpbmljIFBlY290aWMNCj4gSW1wbGVtZW50YXRpb24gb2YgKGEpcndu
ZCBjYWxjdWxhdGlvbiBtaWdodCBsZWFkIHRvIHNldmVyZSBwZXJmb3JtYW5jZSBpc3N1ZXMNCj4g
YW5kIGFzc29jaWF0aW9ucyBjb21wbGV0ZWx5IHN0YWxsaW5nLiBUaGVzZSBwcm9ibGVtcyBhcmUg
ZGVzY3JpYmVkIGFuZCBzb2x1dGlvbg0KPiBpcyBwcm9wb3NlZCB3aGljaCBpbXByb3ZlcyBsa3Nj
dHAncyByb2J1c3RuZXNzIGluIGNvbmdlc3Rpb24gc3RhdGUuDQo+IA0KPiAxKSBTdWRkZW4gZHJv
cCBvZiBhX3J3bmQgYW5kIGluY29tcGxldGUgd2luZG93IHJlY292ZXJ5IGFmdGVyd2FyZHMNCj4g
DQo+IERhdGEgYWNjb3VudGVkIGluIHNjdHBfYXNzb2NfcnduZF9kZWNyZWFzZSB0YWtlcyBvbmx5
IHBheWxvYWQgc2l6ZSAoc2N0cCBkYXRhKSwNCj4gYnV0IHNpemUgb2Ygc2tfYnVmZiwgd2hpY2gg
aXMgYmxhbWVkIGFnYWluc3QgcmVjZWl2ZXIgYnVmZmVyLCBpcyBub3QgYWNjb3VudGVkDQo+IGlu
IHJ3bmQuIFRoZW9yZXRpY2FsbHksIHRoaXMgc2hvdWxkIG5vdCBiZSB0aGUgcHJvYmxlbSBhcyBh
Y3R1YWwgc2l6ZSBvZiBidWZmZXINCj4gaXMgZG91YmxlIHRoZSBhbW91bnQgcmVxdWVzdGVkIG9u
IHRoZSBzb2NrZXQgKFNPX1JFQ1ZCVUYpLiBQcm9ibGVtIGhlcmUgaXMNCj4gdGhhdCB0aGlzIHdp
bGwgaGF2ZSBiYWQgc2NhbGluZyBmb3IgZGF0YSB3aGljaCBpcyBsZXNzIHRoZW4gc2l6ZW9mIHNr
X2J1ZmYuDQo+IEUuZy4gaW4gNEcgKExURSkgbmV0d29ya3MsIGxpbmsgaW50ZXJmYWNpbmcgcmFk
aW8gc2lkZSB3aWxsIGhhdmUgYSBsYXJnZSBwb3J0aW9uDQo+IG9mIHRyYWZmaWMgb2YgdGhpcyBz
aXplIChsZXNzIHRoZW4gMTAwQikuDQouLi4NCj4gDQo+IFByb3Bvc2VkIHNvbHV0aW9uOg0KPiAN
Cj4gQm90aCBwcm9ibGVtcyBzaGFyZSB0aGUgc2FtZSByb290IGNhdXNlLCBhbmQgdGhhdCBpcyBp
bXByb3BlciBzY2FsaW5nIG9mIHNvY2tldA0KPiBidWZmZXIgd2l0aCByd25kLiBTb2x1dGlvbiBp
biB3aGljaCBzaXplb2Yoc2tfYnVmZikgaXMgdGFrZW4gaW50byBjb25jZXJuIHdoaWxlDQo+IGNh
bGN1bGF0aW5nIHJ3bmQgaXMgbm90IHBvc3NpYmxlIGR1ZSB0byBmYWN0IHRoYXQgdGhlcmUgaXMg
bm8gbGluZWFyDQo+IHJlbGF0aW9uc2hpcCBiZXR3ZWVuIGFtb3VudCBvZiBkYXRhIGJsYW1lZCBp
biBpbmNyZWFzZS9kZWNyZWFzZSB3aXRoIElQIHBhY2tldA0KPiBpbiB3aGljaCBwYXlsb2FkIGFy
cml2ZWQuIEV2ZW4gaW4gY2FzZSBzdWNoIHNvbHV0aW9uIHdvdWxkIGJlIGZvbGxvd2VkLA0KPiBj
b21wbGV4aXR5IG9mIHRoZSBjb2RlIHdvdWxkIGluY3JlYXNlLiBEdWUgdG8gbmF0dXJlIG9mIGN1
cnJlbnQgcnduZCBoYW5kbGluZywNCj4gc2xvdyBpbmNyZWFzZSAoaW4gc2N0cF9hc3NvY19yd25k
X2luY3JlYXNlKSBvZiByd25kIGFmdGVyIHByZXNzdXJlIHN0YXRlIGlzDQo+IGVudGVyZWQgaXMg
cmF0aW9uYWxlLCBidXQgaXQgZ2l2ZXMgZmFsc2UgcmVwcmVzZW50YXRpb24gdG8gdGhlIHNlbmRl
ciBvZiBjdXJyZW50DQo+IGJ1ZmZlciBzcGFjZS4gRnVydGhlcm1vcmUsIGl0IGltcGxlbWVudHMg
YWRkaXRpb25hbCBjb25nZXN0aW9uIGNvbnRyb2wgbWVjaGFuaXNtDQo+IHdoaWNoIGlzIGRlZmlu
ZWQgb24gaW1wbGVtZW50YXRpb24sIGFuZCBub3Qgb24gc3RhbmRhcmQgYmFzaXMuDQo+IA0KPiBQ
cm9wb3NlZCBzb2x1dGlvbiBzaW1wbGlmaWVzIHdob2xlIGFsZ29yaXRobSBoYXZpbmcgb24gbWlu
ZCBkZWZpbml0aW9uIGZyb20gcmZjOg0KPiANCj4gbyAgUmVjZWl2ZXIgV2luZG93IChyd25kKTog
VGhpcyBnaXZlcyB0aGUgc2VuZGVyIGFuIGluZGljYXRpb24gb2YgdGhlIHNwYWNlDQo+ICAgIGF2
YWlsYWJsZSBpbiB0aGUgcmVjZWl2ZXIncyBpbmJvdW5kIGJ1ZmZlci4NCj4gDQo+IENvcmUgb2Yg
dGhlIHByb3Bvc2VkIHNvbHV0aW9uIGlzIGdpdmVuIHdpdGggdGhlc2UgbGluZXM6DQo+IA0KPiBz
Y3RwX2Fzc29jX3J3bmRfYWNjb3VudDoNCj4gCWlmICgoYXNvYy0+YmFzZS5zay0+c2tfcmN2YnVm
IC0gcnhfY291bnQpID4gMCkNCj4gCQlhc29jLT5yd25kID0gKGFzb2MtPmJhc2Uuc2stPnNrX3Jj
dmJ1ZiAtIHJ4X2NvdW50KSA+PiAxOw0KPiAJZWxzZQ0KPiAJCWFzb2MtPnJ3bmQgPSAwOw0KPiAN
Cj4gV2UgYWR2ZXJ0aXNlIHRvIHNlbmRlciAoaGFsZiBvZikgYWN0dWFsIHNwYWNlIHdlIGhhdmUu
IEhhbGYgaXMgaW4gdGhlIGJyYWNlcw0KPiBkZXBlbmRpbmcgd2hldGhlciB5b3Ugd291bGQgbGlr
ZSB0byBvYnNlcnZlIHNpemUgb2Ygc29ja2V0IGJ1ZmZlciBhcyBTT19SRUNWQlVGDQo+IG9yIHR3
aWNlIHRoZSBhbW91bnQsIGkuZS4gc2l6ZSBpcyB0aGUgb25lIHZpc2libGUgZnJvbSB1c2Vyc3Bh
Y2UsIHRoYXQgaXMsDQo+IGZyb20ga2VybmVsc3BhY2UuDQo+IEluIHRoaXMgd2F5IHNlbmRlciBp
cyBnaXZlbiB3aXRoIGdvb2QgYXBwcm94aW1hdGlvbiBvZiBvdXIgYnVmZmVyIHNwYWNlLA0KPiBy
ZWdhcmRsZXNzIG9mIHRoZSBidWZmZXIgcG9saWN5IC0gd2UgYWx3YXlzIGFkdmVydGlzZSB3aGF0
IHdlIGhhdmUuIFByb3Bvc2VkDQo+IHNvbHV0aW9uIGZpeGVzIGRlc2NyaWJlZCBwcm9ibGVtcyBh
bmQgcmVtb3ZlcyBuZWNlc3NpdHkgZm9yIHJ3bmQgcmVzdG9yYXRpb24NCj4gYWxnb3JpdGhtLiBG
aW5hbGx5LCBhcyBwcm9wb3NlZCBzb2x1dGlvbiBpcyBzaW1wbGlmaWNhdGlvbiwgc29tZSBsaW5l
cyBvZiBjb2RlLA0KPiBhbG9uZyB3aXRoIHNvbWUgYnl0ZXMgaW4gc3RydWN0IHNjdHBfYXNzb2Np
YXRpb24gYXJlIHNhdmVkLg0KDQpJSVJDIHRoZSAnc2l6ZScgdGFrZW4gb2YgdGhlIHNvY2tldCBi
dWZmZXIgaXMgdGhlIHNrYidzICd0cnVlIHNpemUnIGFuZCB0aGF0DQppbmNsdWRlcyBhbnkgcGFk
ZGluZyBiZWZvcmUgYW5kIGFmdGVyIHRoZSBhY3R1YWwgcnggZGF0YS4gRm9yIHNob3J0IHBhY2tl
dHMNCnRoZSBkcml2ZXIgbWF5IGhhdmUgY29waWVkIHRoZSBkYXRhIGludG8gYSBzbWFsbGVyIHNr
YiwgZm9yIGxvbmcgcGFja2V0cyBpdA0KaXMgbGlrZWx5IHRvIGJlIG1vcmUgdGhhbiB0aGF0IG9m
IGEgZnVsbCBsZW5ndGggZXRoZXJuZXQgcGFja2V0Lg0KSW4gZWl0aGVyIGNhc2UgaXQgY2FuIGJl
IHNpZ25pZmljYW50bHkgbW9yZSB0aGFuIHNpemVvZihza19idWZmKSAoMTkwPykgcGx1cw0KdGhl
IHNpemUgb2YgdGhlIGFjdHVhbCBkYXRhLg0KDQpJJ20gYWxzbyBub3Qgc3VyZSB0aGF0IGNvbnRp
bnVvdXNseSByZW1vdmluZyAnY3JlZGl0JyBpcyBhIGdvb2QgaWRlYS4NCkkndmUgZG9uZSBhIGxv
dCBvZiBjb21tcyBwcm90b2NvbCBjb2RlLCByZW1vdmluZyBjcmVkaXQgYW5kICd3aW5kb3cNCnNs
YW1taW5nJyBhY2tzIGFyZSBub3QgZ29vZCBpZGVhcy4NCg0KUGVyaGFwcyB0aGUgYWR2ZXJ0aXNl
ZCB3aW5kb3cgc2hvdWxkIGJlIGJvdW5kZWQgYnkgdGhlIGNvbmZpZ3VyZWQgc29ja2V0DQpidWZm
ZXIgc2l6ZSwgYW5kIG9ubHkgcmVkdWNlZCBpZiB0aGUgYWN0dWFsIHNwYWNlIGlzbid0IGxpa2Vs
eSB0byBiZSBsYXJnZQ0KZW5vdWdoIGdpdmVuIHRoZSB0eXBpY2FsIG92ZXJoZWFkIG9mIHRoZSBy
ZWNlaXZlZCBkYXRhLg0KDQpTaW1pbGFybHksIGFzIHRoZSB3aW5kb3cgaXMgb3BlbmVkIGFmdGVy
IGNvbmdlc3Rpb24gaXQgc2hvdWxkIGJlIGluY3JlYXNlZA0KYnkgdGhlIGFtb3VudCBvZiBkYXRh
IGFjdHVhbGx5IHJlbW92ZWQgKG5vdCB0aGUgbnVtYmVyIG9mIGZyZWUgYnl0ZXMpLg0KV2hlbiB0
aGVyZSBpcyBhIHNpZ25pZmljYW50IGFtb3VudCBvZiBzcGFjZSB0aGUgd2luZG93IGNvdWxkIGJl
IGluY3JlYXNlZA0KZmFzdGVyIC0gYWxsb3dpbmcgYSBzbWFsbGVyIG51bWJlciBvZiBsYXJnZXIg
c2tiIGNhcnJ5aW5nIG1vcmUgZGF0YSBiZSBxdWV1ZWQuDQoNCkFzIGEgbWF0dGVyIG9mIGludGVy
ZXN0LCBob3cgZG9lcyBUQ1AgaGFuZGxlIHRoaXM/DQoNCglEYXZpZA0KDQo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer
  2014-01-22 12:30 ` Neil Horman
@ 2014-01-22 14:18   ` Vlad Yasevich
  2014-01-22 19:34     ` Matija Glavinic Pecotic
  0 siblings, 1 reply; 13+ messages in thread
From: Vlad Yasevich @ 2014-01-22 14:18 UTC (permalink / raw)
  To: Neil Horman, Matija Glavinic Pecotic
  Cc: linux-sctp, Alexander Sverdlin, netdev

On 01/22/2014 07:30 AM, Neil Horman wrote:
> On Fri, Jan 17, 2014 at 08:01:24AM +0100, Matija Glavinic Pecotic wrote:
>>
>> Proposed solution simplifies whole algorithm having on mind
definition from rfc:
>>
>> o  Receiver Window (rwnd): This gives the sender an indication of the
space
>>    available in the receiver's inbound buffer.
>>
>> Core of the proposed solution is given with these lines:
>>
>> sctp_assoc_rwnd_account:
>> 	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
>> 		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
>> 	else
>> 		asoc->rwnd = 0;
>>
>> We advertise to sender (half of) actual space we have. Half is in the
braces
>> depending whether you would like to observe size of socket buffer as
SO_RECVBUF
>> or twice the amount, i.e. size is the one visible from userspace,
that is,
>> from kernelspace.
>> In this way sender is given with good approximation of our buffer space,
>> regardless of the buffer policy - we always advertise what we have.
Proposed
>> solution fixes described problems and removes necessity for rwnd
restoration
>> algorithm. Finally, as proposed solution is simplification, some
lines of code,
>> along with some bytes in struct sctp_association are saved.
>>
>> Signed-off-by: Matija Glavinic Pecotic
<matija.glavinic-pecotic.ext@nsn.com>
>> Reviewed-by: Alexander Sverdlin <alexander.sverdlin@nsn.com>
>>
>
>
> General comment - While this seems to make sense to me generally speaking,
> doesn't it currently violate section 6 of the RFC?
>
>
> A SCTP receiver MUST be able to receive a minimum of 1500 bytes in
>    one SCTP packet.  This means that a SCTP endpoint MUST NOT indicate
>    less than 1500 bytes in its Initial a_rwnd sent in the INIT or INIT
>    ACK.
>
> Since we set the initial rwnd value to the larger of sk->sk_rcvbuf/2 or
> SCTP_MIN_RWND (1500 bytes), won't we potentially advertize half that
amount?

Not initially.  Initial window will still be advertized properly.  Once
we receive the first packet and consumed some space, we'll advertize
half of available receive buffer.  It is perfectly OK to advertize a
window smaller the MIN_WINDOW in the middle of the transfer.

> It seems we need to double the minimum socket receive buffer here.

Not here specifically, but yes.  It is already broken and this patch
doesn't change current behavior.  This is something SCTP may need to do
separately.

-vlad

>
> Neil
>
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer
  2014-01-22 13:41 ` David Laight
@ 2014-01-22 14:52   ` Vlad Yasevich
  2014-01-22 15:30     ` David Laight
  0 siblings, 1 reply; 13+ messages in thread
From: Vlad Yasevich @ 2014-01-22 14:52 UTC (permalink / raw)
  To: David Laight, 'Matija Glavinic Pecotic',
	linux-sctp@vger.kernel.org
  Cc: Alexander Sverdlin, netdev@vger.kernel.org

On 01/22/2014 08:41 AM, David Laight wrote:
> From: Matija Glavinic Pecotic
>> Implementation of (a)rwnd calculation might lead to severe
performance issues
>> and associations completely stalling. These problems are described
and solution
>> is proposed which improves lksctp's robustness in congestion state.
>>
>> 1) Sudden drop of a_rwnd and incomplete window recovery afterwards
>>
>> Data accounted in sctp_assoc_rwnd_decrease takes only payload size
(sctp data),
>> but size of sk_buff, which is blamed against receiver buffer, is not
accounted
>> in rwnd. Theoretically, this should not be the problem as actual size
of buffer
>> is double the amount requested on the socket (SO_RECVBUF). Problem
here is
>> that this will have bad scaling for data which is less then sizeof
sk_buff.
>> E.g. in 4G (LTE) networks, link interfacing radio side will have a
large portion
>> of traffic of this size (less then 100B).
> ...
>>
>> Proposed solution:
>>
>> Both problems share the same root cause, and that is improper scaling
of socket
>> buffer with rwnd. Solution in which sizeof(sk_buff) is taken into
concern while
>> calculating rwnd is not possible due to fact that there is no linear
>> relationship between amount of data blamed in increase/decrease with
IP packet
>> in which payload arrived. Even in case such solution would be followed,
>> complexity of the code would increase. Due to nature of current rwnd
handling,
>> slow increase (in sctp_assoc_rwnd_increase) of rwnd after pressure
state is
>> entered is rationale, but it gives false representation to the sender
of current
>> buffer space. Furthermore, it implements additional congestion
control mechanism
>> which is defined on implementation, and not on standard basis.
>>
>> Proposed solution simplifies whole algorithm having on mind
definition from rfc:
>>
>> o  Receiver Window (rwnd): This gives the sender an indication of the
space
>>    available in the receiver's inbound buffer.
>>
>> Core of the proposed solution is given with these lines:
>>
>> sctp_assoc_rwnd_account:
>> 	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
>> 		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
>> 	else
>> 		asoc->rwnd = 0;
>>
>> We advertise to sender (half of) actual space we have. Half is in the
braces
>> depending whether you would like to observe size of socket buffer as
SO_RECVBUF
>> or twice the amount, i.e. size is the one visible from userspace,
that is,
>> from kernelspace.
>> In this way sender is given with good approximation of our buffer space,
>> regardless of the buffer policy - we always advertise what we have.
Proposed
>> solution fixes described problems and removes necessity for rwnd
restoration
>> algorithm. Finally, as proposed solution is simplification, some
lines of code,
>> along with some bytes in struct sctp_association are saved.
>
> IIRC the 'size' taken of the socket buffer is the skb's 'true size'
and that
> includes any padding before and after the actual rx data. For short
packets
> the driver may have copied the data into a smaller skb, for long
packets it
> is likely to be more than that of a full length ethernet packet.
> In either case it can be significantly more than sizeof(sk_buff)
(190?) plus
> the size of the actual data.

SCTP currently doesn't support GRO, so each packet is limited to
ethernet packet plus sk_buff overhead.  What throws a real monkey
wrench into this whole accounting business is SCTP bundling.  If you
bundle multiple messages into the single packet, accounting for it
is a mess.

>
> I'm also not sure that continuously removing 'credit' is a good idea.
> I've done a lot of comms protocol code, removing credit and 'window
> slamming' acks are not good ideas.

This patch doesn't continuously remove 'credit'.  It advertises the
closest approximation of the window that we support and computes it
the same way as we do for Initial Window (available sk_rcvbuff >> 1).
As the receiver drains the receive queue, available buffer will increase
and the available window will grow.

In fact, the current implementation does more 'windows slamming' then
this proposed patch.

>
> Perhaps the advertised window should be bounded by the configured socket
> buffer size, and only reduced if the actual space isn't likely to be large
> enough given the typical overhead of the received data.

Problem is there is no typical overhead due to the message oriented
nature of the SCTP, and the socket buffer limits entire buffer space
(overhead included).   What happens when the socket buffer buffer is
consumed faster then the window due to overhead being significantly
larger then the payload?  You have "window slamming" of the worst
kind.  The available window slowly decreases until it hits a point
receive buffer exhaustion and then it slams shut.

This patch is better is that it gradually reduces the window based
on available buffer space providing a more accurate depiction of what
is happening on the receiver.

>
> Similarly, as the window is opened after congestion it should be increased
> by the amount of data actually removed (not the number of free bytes).
> When there is a significant amount of space the window could be increased
> faster - allowing a smaller number of larger skb carrying more data be
queued.
>
> As a matter of interest, how does TCP handle this?

TCP also calculates it's available window based on available receive
buffer space.

-vlad
>
> 	David
>
>
N�����r��y���b�X��ǧv�^�)޺{.n�+����{���i�{ay�\x1dʇڙ�,j\a��f���h���z�\x1e�w���\f���j:+v���w�j�m����\a����zZ+�����ݢj"��!tml>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH] net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer
  2014-01-22 14:52   ` Vlad Yasevich
@ 2014-01-22 15:30     ` David Laight
  2014-01-22 19:48       ` Matija Glavinic Pecotic
  0 siblings, 1 reply; 13+ messages in thread
From: David Laight @ 2014-01-22 15:30 UTC (permalink / raw)
  To: 'Vlad Yasevich', 'Matija Glavinic Pecotic',
	linux-sctp@vger.kernel.org
  Cc: Alexander Sverdlin, netdev@vger.kernel.org

RnJvbTogVmxhZCBZYXNldmljaCANCi4uLg0KPiA+IElJUkMgdGhlICdzaXplJyB0YWtlbiBvZiB0
aGUgc29ja2V0IGJ1ZmZlciBpcyB0aGUgc2tiJ3MgJ3RydWUgc2l6ZScNCj4gPiBhbmQgdGhhdCBp
bmNsdWRlcyBhbnkgcGFkZGluZyBiZWZvcmUgYW5kIGFmdGVyIHRoZSBhY3R1YWwgcnggZGF0YS4N
Cj4gPiBGb3Igc2hvcnQgcGFja2V0cyB0aGUgZHJpdmVyIG1heSBoYXZlIGNvcGllZCB0aGUgZGF0
YSBpbnRvIGEgc21hbGxlcg0KPiA+IHNrYiwgZm9yIGxvbmcgcGFja2V0cyBpdCBpcyBsaWtlbHkg
dG8gYmUgbW9yZSB0aGFuIHRoYXQgb2YgYSBmdWxsDQo+ID4gbGVuZ3RoIGV0aGVybmV0IHBhY2tl
dC4NCj4gPiBJbiBlaXRoZXIgY2FzZSBpdCBjYW4gYmUgc2lnbmlmaWNhbnRseSBtb3JlIHRoYW4g
c2l6ZW9mKHNrX2J1ZmYpDQo+ID4gKDE5MD8pIHBsdXMgdGhlIHNpemUgb2YgdGhlIGFjdHVhbCBk
YXRhLg0KPiANCj4gU0NUUCBjdXJyZW50bHkgZG9lc24ndCBzdXBwb3J0IEdSTywgc28gZWFjaCBw
YWNrZXQgaXMgbGltaXRlZCB0bw0KPiBldGhlcm5ldCBwYWNrZXQgcGx1cyBza19idWZmIG92ZXJo
ZWFkLg0KDQpUaGUgZXRoZXJuZXQgZHJpdmVyIG1pZ2h0IHN0aWxsIHBhc3MgdXAgYSAyayBidWZm
ZXIsIG9yIGV2ZW4gYSA0ayBvbmUuDQpFc3BlY2lhbGx5IGlmIGl0IHN1cHBvcnRzIEdSTyBmb3Ig
VENQLg0KDQo+IFdoYXQgdGhyb3dzIGEgcmVhbCBtb25rZXkNCj4gd3JlbmNoIGludG8gdGhpcyB3
aG9sZSBhY2NvdW50aW5nIGJ1c2luZXNzIGlzIFNDVFAgYnVuZGxpbmcuICBJZiB5b3UNCj4gYnVu
ZGxlIG11bHRpcGxlIG1lc3NhZ2VzIGludG8gdGhlIHNpbmdsZSBwYWNrZXQsIGFjY291bnRpbmcg
Zm9yIGl0DQo+IGlzIGEgbWVzcy4NCg0KSG93IGFib3V0IGRpdmlkaW5nIHRoZSAndHJ1ZXNpemUn
IGJ5IHRoZSByZWZlcmVuY2UgY291bnQ/DQooZXhjZXB0IHRoYXQgaXNuJ3QgcXVpdGUgcmlnaHQu
Li4pDQpJIGFzc3VtZSB0aGVyZSBhcmUgbXVsdGlwbGUgc2tiIGhlYWRlcnMgYnV0IG9ubHkgb25l
IGRhdGEgYnVmZmVyPw0KQXQgbGVhc3QgdGhlIGNodW5rcyBhcmUgYWxsIGZvciB0aGUgc2FtZSBj
b25uZWN0aW9uIHNvIGVuZCB1cCBvbg0Kb25lIHNvY2tldCAoZXhjZXB0IEkgcmVtZW1iZXIgc29t
ZSBvdGhlciBob3JyaWQgc3R1ZmYgZm9yIGRhdGFncmFtcykuDQogDQo+ID4gSSdtIGFsc28gbm90
IHN1cmUgdGhhdCBjb250aW51b3VzbHkgcmVtb3ZpbmcgJ2NyZWRpdCcgaXMgYSBnb29kIGlkZWEu
DQo+ID4gSSd2ZSBkb25lIGEgbG90IG9mIGNvbW1zIHByb3RvY29sIGNvZGUsIHJlbW92aW5nIGNy
ZWRpdCBhbmQgJ3dpbmRvdw0KPiA+IHNsYW1taW5nJyBhY2tzIGFyZSBub3QgZ29vZCBpZGVhcy4N
Cj4gDQo+IFRoaXMgcGF0Y2ggZG9lc24ndCBjb250aW51b3VzbHkgcmVtb3ZlICdjcmVkaXQnLiAg
SXQgYWR2ZXJ0aXNlcyB0aGUNCj4gY2xvc2VzdCBhcHByb3hpbWF0aW9uIG9mIHRoZSB3aW5kb3cg
dGhhdCB3ZSBzdXBwb3J0IGFuZCBjb21wdXRlcyBpdA0KPiB0aGUgc2FtZSB3YXkgYXMgd2UgZG8g
Zm9yIEluaXRpYWwgV2luZG93IChhdmFpbGFibGUgc2tfcmN2YnVmZiA+PiAxKS4NCj4gQXMgdGhl
IHJlY2VpdmVyIGRyYWlucyB0aGUgcmVjZWl2ZSBxdWV1ZSwgYXZhaWxhYmxlIGJ1ZmZlciB3aWxs
IGluY3JlYXNlDQo+IGFuZCB0aGUgYXZhaWxhYmxlIHdpbmRvdyB3aWxsIGdyb3cuDQoNCkxldCdz
IGFzc3VtZSB0aGUgc29ja2V0IGJ1ZmZlciBzaXplIGlzIDEwMGsuDQpXZSBhZHZlcnRpc2UgYSB3
aW5kb3cgb2YgNTBrLg0KV2Ugbm93IHJlY2VpdmUgMTAwIGJ5dGVzIG9mIGRhdGEsIHRoZSByZW1v
dGUgaGFzIDQ5OTAwIHdpbmRvdyBsZWZ0Lg0KVGhlICd0cnVlc2l6ZScgaXMgc29tZXRoaW5nIGxp
a2UgMTkwKzY0KGlzaCkrMTAwK3RhaWxfcGFkLCBzYXkgNDAwLg0KU29ja2V0IGJ1ZmZlciBzcGFj
ZSBpcyByZWR1Y2VkIHRvIDk5NjAwIGFuZCBhbnkgYWNrIHdvdWxkIGdpdmUgNDk4MDAuDQpTbyB3
ZSBoYXZlIHJlZHVjZWQgdGhlIHdpbmRvdyBieSAxMDAgYnl0ZXMuDQpXaXRoIHRoYXQgbXVjaCBz
cGFjZSBpdCBwcm9iYWJseSBkb2Vzbid0IG1hdHRlciBtdWNoLg0KDQpIb3dldmVyIGlmIHRoZSBj
b25uZWN0aW9uIGlzIHJlY2VpdmUgbGltaXRlZCB0aGVuIHRoZSByZW1vdGUgc3lzdGVtDQp3aWxs
IGhhdmUgYSBzZWNvbmQgcGFja2V0IGluIGZsaWdodCB0aGF0IHVzZXMgdGhlIHJlc3Qgb2YgdGhl
IHdpbmRvdy4NCldlIHRoZW4gcmVjZWl2ZSBhbiBpbi1zZXF1ZW5jZSBidXQgb3V0LW9mLXdpbmRv
dyBwYWNrZXQgdGhhdCByZWZlcnMNCnRvIHdpbmRvdyB0aGF0IHdlIGhhZCBwcmV2aW91c2x5IGdp
dmVuIHRvIHRoZSByZW1vdGUgc3lzdGVtLg0KDQpJIGRvbid0IGtub3cgd2hhdCB0aGUgc2N0cCAo
b3IgdGNwKSBjb2RlIGRvZXMgd2l0aCBzdWNoIHBhY2tldHMuDQpJbiBteSBleHBlcmllbmNlIGl0
IGlzIGJlc3QgdG8gdHJlYXQgdGhlbSBhcyB2YWxpZCBkYXRhICh1bmxlc3MNCnRoZXJlIGFyZSBs
YXJnZXIgZnJlZSBtZW1vcnkgaXNzdWVzKSBhbmQgYWNrIHRoZW0gYXQgc29tZSBwb2ludC4NCkhv
cGVmdWxseSB0aGUgcnVsZXMgb2YgdGhlIHVuZGVybHlpbmcgcHJvdG9jb2wgbGV0IHlvdSBkbyB0
aGlzIQ0KDQpJZiB5b3UgZGlzY2FyZCB0aGVzZSBwYWNrZXRzIHRoZW4gZXZlcnkgcGFja2V0IGdl
dHMgc2VudCB0d2ljZQ0KKG9yIGV2ZW4gbW9yZSBvZnRlbiBpZiB0aGUgZGF0YSBpcyB2ZXJ5IHNo
b3J0KS4NCg0KCURhdmlkDQoNCg=

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer
  2014-01-22  2:12 ` Vlad Yasevich
@ 2014-01-22 19:21   ` Matija Glavinic Pecotic
  0 siblings, 0 replies; 13+ messages in thread
From: Matija Glavinic Pecotic @ 2014-01-22 19:21 UTC (permalink / raw)
  To: ext Vlad Yasevich; +Cc: linux-sctp, Alexander Sverdlin, netdev

Hello Vlad,

On 01/22/2014 03:12 AM, ext Vlad Yasevich wrote:
> On 01/17/2014 02:01 AM, Matija Glavinic Pecotic wrote:
> [ snip long description ...]
>> Proposed solution simplifies whole algorithm having on mind definition from rfc:
>>
>> o  Receiver Window (rwnd): This gives the sender an indication of the space
>>    available in the receiver's inbound buffer.
>>
>> Core of the proposed solution is given with these lines:
>>
>> sctp_assoc_rwnd_account:
>> 	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
>> 		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
>> 	else
>> 		asoc->rwnd = 0;
>>
>> We advertise to sender (half of) actual space we have. Half is in the braces
>> depending whether you would like to observe size of socket buffer as SO_RECVBUF
>> or twice the amount, i.e. size is the one visible from userspace, that is,
>> from kernelspace.
>> In this way sender is given with good approximation of our buffer space,
>> regardless of the buffer policy - we always advertise what we have. Proposed
>> solution fixes described problems and removes necessity for rwnd restoration
>> algorithm. Finally, as proposed solution is simplification, some lines of code,
>> along with some bytes in struct sctp_association are saved.
>>
>> Signed-off-by: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
>> Reviewed-by: Alexander Sverdlin <alexander.sverdlin@nsn.com>
>>
> 
> This is the correct approach.  However there is one issue and
> a few cosmetic suggestions.
> 
>> ---
>>
>> --- net-next.orig/net/sctp/associola.c
>> +++ net-next/net/sctp/associola.c
>> @@ -1367,44 +1367,35 @@ static inline bool sctp_peer_needs_updat
>>  	return false;
>>  }
>>  
>> -/* Increase asoc's rwnd by len and send any window update SACK if needed. */
>> -void sctp_assoc_rwnd_increase(struct sctp_association *asoc, unsigned int len)
>> +/* Account asoc's rwnd for the approximated state in the buffer,
>> + * and check whether SACK needs to be sent.
>> + */
>> +void sctp_assoc_rwnd_account(struct sctp_association *asoc, int check_sack)
> 
> Maybe sctp_assoc_rwnd_update()?
> 
> 'check_sack' isn't a very descriptive name for what we are doing.  May
> be 'update_peer'?  Also, since it is either 0 or 1, just make a bool.

these would be more appropriate names, I agree.

>>  {
>> +	int rx_count;
>>  	struct sctp_chunk *sack;
>>  	struct timer_list *timer;
>>  
>> -	if (asoc->rwnd_over) {
>> -		if (asoc->rwnd_over >= len) {
>> -			asoc->rwnd_over -= len;
>> -		} else {
>> -			asoc->rwnd += (len - asoc->rwnd_over);
>> -			asoc->rwnd_over = 0;
>> -		}
>> -	} else {
>> -		asoc->rwnd += len;
>> -	}
>> +	if (asoc->ep->rcvbuf_policy)
>> +		rx_count = atomic_read(&asoc->rmem_alloc);
>> +	else
>> +		rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
>>  
>> -	/* If we had window pressure, start recovering it
>> -	 * once our rwnd had reached the accumulated pressure
>> -	 * threshold.  The idea is to recover slowly, but up
>> -	 * to the initial advertised window.
>> -	 */
>> -	if (asoc->rwnd_press && asoc->rwnd >= asoc->rwnd_press) {
>> -		int change = min(asoc->pathmtu, asoc->rwnd_press);
>> -		asoc->rwnd += change;
>> -		asoc->rwnd_press -= change;
>> -	}
>> +	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
>> +		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
>> +	else
>> +		asoc->rwnd = 0;
>>  
>> -	pr_debug("%s: asoc:%p rwnd increased by %d to (%u, %u) - %u\n",
>> -		 __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
>> -		 asoc->a_rwnd);
>> +	pr_debug("%s: asoc:%p rwnd=%u, rx_count=%d, sk_rcvbuf=%d\n",
>> +		 __func__, asoc, asoc->rwnd, rx_count,
>> +		 asoc->base.sk->sk_rcvbuf);
>>  
>>  	/* Send a window update SACK if the rwnd has increased by at least the
>>  	 * minimum of the association's PMTU and half of the receive buffer.
>>  	 * The algorithm used is similar to the one described in
>>  	 * Section 4.2.3.3 of RFC 1122.
>>  	 */
>> -	if (sctp_peer_needs_update(asoc)) {
>> +	if (check_sack && sctp_peer_needs_update(asoc)) {
>>  		asoc->a_rwnd = asoc->rwnd;
>>  
>>  		pr_debug("%s: sending window update SACK- asoc:%p rwnd:%u "
>> @@ -1426,45 +1417,6 @@ void sctp_assoc_rwnd_increase(struct sct
>>  	}
>>  }
>>  
>> -/* Decrease asoc's rwnd by len. */
>> -void sctp_assoc_rwnd_decrease(struct sctp_association *asoc, unsigned int len)
>> -{
>> -	int rx_count;
>> -	int over = 0;
>> -
>> -	if (unlikely(!asoc->rwnd || asoc->rwnd_over))
>> -		pr_debug("%s: association:%p has asoc->rwnd:%u, "
>> -			 "asoc->rwnd_over:%u!\n", __func__, asoc,
>> -			 asoc->rwnd, asoc->rwnd_over);
>> -
>> -	if (asoc->ep->rcvbuf_policy)
>> -		rx_count = atomic_read(&asoc->rmem_alloc);
>> -	else
>> -		rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
>> -
>> -	/* If we've reached or overflowed our receive buffer, announce
>> -	 * a 0 rwnd if rwnd would still be positive.  Store the
>> -	 * the potential pressure overflow so that the window can be restored
>> -	 * back to original value.
>> -	 */
>> -	if (rx_count >= asoc->base.sk->sk_rcvbuf)
>> -		over = 1;
>> -
>> -	if (asoc->rwnd >= len) {
>> -		asoc->rwnd -= len;
>> -		if (over) {
>> -			asoc->rwnd_press += asoc->rwnd;
>> -			asoc->rwnd = 0;
>> -		}
>> -	} else {
>> -		asoc->rwnd_over = len - asoc->rwnd;
>> -		asoc->rwnd = 0;
>> -	}
>> -
>> -	pr_debug("%s: asoc:%p rwnd decreased by %d to (%u, %u, %u)\n",
>> -		 __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
>> -		 asoc->rwnd_press);
>> -}
>>  
>>  /* Build the bind address list for the association based on info from the
>>   * local endpoint and the remote peer.
>> --- net-next.orig/include/net/sctp/structs.h
>> +++ net-next/include/net/sctp/structs.h
>> @@ -1653,17 +1653,6 @@ struct sctp_association {
>>  	/* This is the last advertised value of rwnd over a SACK chunk. */
>>  	__u32 a_rwnd;
>>  
>> -	/* Number of bytes by which the rwnd has slopped.  The rwnd is allowed
>> -	 * to slop over a maximum of the association's frag_point.
>> -	 */
>> -	__u32 rwnd_over;
>> -
>> -	/* Keeps treack of rwnd pressure.  This happens when we have
>> -	 * a window, but not recevie buffer (i.e small packets).  This one
>> -	 * is releases slowly (1 PMTU at a time ).
>> -	 */
>> -	__u32 rwnd_press;
>> -
>>  	/* This is the sndbuf size in use for the association.
>>  	 * This corresponds to the sndbuf size for the association,
>>  	 * as specified in the sk->sndbuf.
>> @@ -1892,8 +1881,7 @@ void sctp_assoc_update(struct sctp_assoc
>>  __u32 sctp_association_get_next_tsn(struct sctp_association *);
>>  
>>  void sctp_assoc_sync_pmtu(struct sock *, struct sctp_association *);
>> -void sctp_assoc_rwnd_increase(struct sctp_association *, unsigned int);
>> -void sctp_assoc_rwnd_decrease(struct sctp_association *, unsigned int);
>> +void sctp_assoc_rwnd_account(struct sctp_association *, int);
>>  void sctp_assoc_set_primary(struct sctp_association *,
>>  			    struct sctp_transport *);
>>  void sctp_assoc_del_nonprimary_peers(struct sctp_association *,
>> --- net-next.orig/net/sctp/sm_statefuns.c
>> +++ net-next/net/sctp/sm_statefuns.c
>> @@ -6176,7 +6176,7 @@ static int sctp_eat_data(const struct sc
>>  	 * PMTU.  In cases, such as loopback, this might be a rather
>>  	 * large spill over.
>>  	 */
>> -	if ((!chunk->data_accepted) && (!asoc->rwnd || asoc->rwnd_over ||
>> +	if ((!chunk->data_accepted) && (!asoc->rwnd ||
>>  	    (datalen > asoc->rwnd + asoc->frag_point))) {
>>  
>>  		/* If this is the next TSN, consider reneging to make
>> --- net-next.orig/net/sctp/socket.c
>> +++ net-next/net/sctp/socket.c
>> @@ -2097,7 +2097,7 @@ static int sctp_recvmsg(struct kiocb *io
>>  		 * rwnd is updated when the event is freed.
>>  		 */
>>  		if (!sctp_ulpevent_is_notification(event))
>> -			sctp_assoc_rwnd_increase(event->asoc, copied);
>> +			sctp_assoc_rwnd_account(event->asoc, 1);
> 
> This is not going to do anything.  The event has not been freed, thus
> rmem_alloc has not changed.  So, the rwnd will not change.

You are right. With current approach we can only remove this code.
I cannot imagine potential problem with it. I simply do not see situation in which receivers behavior regarding reading the whole message would depend on the current state of rwnd. 
In fact, behavior which you pointed is actually in sync with the whole idea. I would say it is safe to remove it.

>>  		goto out;
>>  	} else if ((event->msg_flags & MSG_NOTIFICATION) ||
>>  		   (event->msg_flags & MSG_EOR))
>> --- net-next.orig/net/sctp/ulpevent.c
>> +++ net-next/net/sctp/ulpevent.c
>> @@ -989,7 +989,7 @@ static void sctp_ulpevent_receive_data(s
>>  	skb = sctp_event2skb(event);
>>  	/* Set the owner and charge rwnd for bytes received.  */
>>  	sctp_ulpevent_set_owner(event, asoc);
>> -	sctp_assoc_rwnd_decrease(asoc, skb_headlen(skb));
>> +	sctp_assoc_rwnd_account(asoc, 0);
>>  
>>  	if (!skb->data_len)
>>  		return;
>> @@ -1035,7 +1035,7 @@ static void sctp_ulpevent_release_data(s
>>  	}
>>  
>>  done:
>> -	sctp_assoc_rwnd_increase(event->asoc, len);
>> +	sctp_assoc_rwnd_account(event->asoc, 1);
> 
> Same here.  The below call to sctp_ulpevent_release_owner() will
> finally update the rmem_alloc, so the a above call to rwnd_account()
> will not trigger a window update.

And if we have sack on the way, it will contain approximated rwnd minus the size of the last message received. Ideally, we would here first account rmem_alloc, update rwnd, then put association.

The only thing which comes to my mind is to rewrite this part exactly as described.
Any suggestions from your side?

Thanks,

Matija

> Thanks
> -vlad
> 
>>  	sctp_ulpevent_release_owner(event);
>>  }
>>  
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer
  2014-01-22 14:18   ` Vlad Yasevich
@ 2014-01-22 19:34     ` Matija Glavinic Pecotic
  0 siblings, 0 replies; 13+ messages in thread
From: Matija Glavinic Pecotic @ 2014-01-22 19:34 UTC (permalink / raw)
  To: ext Vlad Yasevich, Neil Horman; +Cc: linux-sctp, Alexander Sverdlin, netdev

Hello Neil, Vlad,

On 01/22/2014 03:18 PM, ext Vlad Yasevich wrote:
> On 01/22/2014 07:30 AM, Neil Horman wrote:
>> On Fri, Jan 17, 2014 at 08:01:24AM +0100, Matija Glavinic Pecotic wrote:
>>>
>>> Proposed solution simplifies whole algorithm having on mind
> definition from rfc:
>>>
>>> o  Receiver Window (rwnd): This gives the sender an indication of the
> space
>>>    available in the receiver's inbound buffer.
>>>
>>> Core of the proposed solution is given with these lines:
>>>
>>> sctp_assoc_rwnd_account:
>>> 	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
>>> 		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
>>> 	else
>>> 		asoc->rwnd = 0;
>>>
>>> We advertise to sender (half of) actual space we have. Half is in the
> braces
>>> depending whether you would like to observe size of socket buffer as
> SO_RECVBUF
>>> or twice the amount, i.e. size is the one visible from userspace,
> that is,
>>> from kernelspace.
>>> In this way sender is given with good approximation of our buffer space,
>>> regardless of the buffer policy - we always advertise what we have.
> Proposed
>>> solution fixes described problems and removes necessity for rwnd
> restoration
>>> algorithm. Finally, as proposed solution is simplification, some
> lines of code,
>>> along with some bytes in struct sctp_association are saved.
>>>
>>> Signed-off-by: Matija Glavinic Pecotic
> <matija.glavinic-pecotic.ext@nsn.com>
>>> Reviewed-by: Alexander Sverdlin <alexander.sverdlin@nsn.com>
>>>
>>
>>
>> General comment - While this seems to make sense to me generally speaking,
>> doesn't it currently violate section 6 of the RFC?
>>
>>
>> A SCTP receiver MUST be able to receive a minimum of 1500 bytes in
>>    one SCTP packet.  This means that a SCTP endpoint MUST NOT indicate
>>    less than 1500 bytes in its Initial a_rwnd sent in the INIT or INIT
>>    ACK.
>>
>> Since we set the initial rwnd value to the larger of sk->sk_rcvbuf/2 or
>> SCTP_MIN_RWND (1500 bytes), won't we potentially advertize half that
> amount?
> 
> Not initially.  Initial window will still be advertized properly.  Once
> we receive the first packet and consumed some space, we'll advertize
> half of available receive buffer.  It is perfectly OK to advertize a
> window smaller the MIN_WINDOW in the middle of the transfer.

I agree to this, although we might be in gray area here:

   Advertised Receiver Window Credit (a_rwnd): 32 bits (unsigned
   integer)

      This value represents the dedicated buffer space, in number of
      bytes, the sender of the INIT has reserved in association with
      this window.  During the life of the association, this buffer
      space SHOULD NOT be lessened (i.e., dedicated buffers taken away
      from this association); however, an endpoint MAY change the value
      of a_rwnd it sends in SACK chunks.

this might be considered as taking away buffer space, although I would agree with point below about doubling.

This however opens another question which you should be aware of. This patch brakes regression, two TCs:

test_timetolive and test_sctp_sendrcvmsg.

This is simply due to 'honest' rwnd reporting. Both of these TCs share code in which initial rwnd is set very low, later socket buffer is increased but with counting on the fact that rwnd will stay as initially set. In TCs, this latter rwnd is fetched from the socket and used as value for the message size which in the end breaks it as message to be sent is too big.
What is important difference to current implementation is that changes of SO_RECVBUF will also change a_rwnd. It is not a big problem to add code which will keep the idea, but bound rwnd to initially set rwnd, but we haven't found it mandated by rfc.

Thanks,

Matija 

>> It seems we need to double the minimum socket receive buffer here.
> 
> Not here specifically, but yes.  It is already broken and this patch
> doesn't change current behavior.  This is something SCTP may need to do
> separately.
> 
> -vlad
> 
>>
>> Neil
>>
>>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer
  2014-01-22 15:30     ` David Laight
@ 2014-01-22 19:48       ` Matija Glavinic Pecotic
  0 siblings, 0 replies; 13+ messages in thread
From: Matija Glavinic Pecotic @ 2014-01-22 19:48 UTC (permalink / raw)
  To: ext David Laight
  Cc: 'Vlad Yasevich', linux-sctp@vger.kernel.org,
	Alexander Sverdlin, netdev@vger.kernel.org

Hello David,

On 01/22/2014 04:30 PM, ext David Laight wrote:
> From: Vlad Yasevich 
> ...
>>> IIRC the 'size' taken of the socket buffer is the skb's 'true size'
>>> and that includes any padding before and after the actual rx data.
>>> For short packets the driver may have copied the data into a smaller
>>> skb, for long packets it is likely to be more than that of a full
>>> length ethernet packet.
>>> In either case it can be significantly more than sizeof(sk_buff)
>>> (190?) plus the size of the actual data.
>>
>> SCTP currently doesn't support GRO, so each packet is limited to
>> ethernet packet plus sk_buff overhead.
> 
> The ethernet driver might still pass up a 2k buffer, or even a 4k one.
> Especially if it supports GRO for TCP.
> 
>> What throws a real monkey
>> wrench into this whole accounting business is SCTP bundling.  If you
>> bundle multiple messages into the single packet, accounting for it
>> is a mess.
> 
> How about dividing the 'truesize' by the reference count?
> (except that isn't quite right...)
> I assume there are multiple skb headers but only one data buffer?
> At least the chunks are all for the same connection so end up on
> one socket (except I remember some other horrid stuff for datagrams).
>  
>>> I'm also not sure that continuously removing 'credit' is a good idea.
>>> I've done a lot of comms protocol code, removing credit and 'window
>>> slamming' acks are not good ideas.
>>
>> This patch doesn't continuously remove 'credit'.  It advertises the
>> closest approximation of the window that we support and computes it
>> the same way as we do for Initial Window (available sk_rcvbuff >> 1).
>> As the receiver drains the receive queue, available buffer will increase
>> and the available window will grow.
> 
> Let's assume the socket buffer size is 100k.
> We advertise a window of 50k.
> We now receive 100 bytes of data, the remote has 49900 window left.
> The 'truesize' is something like 190+64(ish)+100+tail_pad, say 400.
> Socket buffer space is reduced to 99600 and any ack would give 49800.
> So we have reduced the window by 100 bytes.
> With that much space it probably doesn't matter much.
> 
> However if the connection is receive limited then the remote system
> will have a second packet in flight that uses the rest of the window.
> We then receive an in-sequence but out-of-window packet that refers
> to window that we had previously given to the remote system.

AFAIK, if the packet is not discarded due to depleted buffer, it will normally be processed.
In case its discarded, sender will have to resend it since it wont be acked.

> I don't know what the sctp (or tcp) code does with such packets.
> In my experience it is best to treat them as valid data (unless
> there are larger free memory issues) and ack them at some point.
> Hopefully the rules of the underlying protocol let you do this!
> 
> If you discard these packets then every packet gets sent twice
> (or even more often if the data is very short).

For small packets we will not get into trouble since sctp has same as tcp sws avoidance algorithm

Regards,

Matija

> 	David
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer
  2014-01-17  7:01 [PATCH] net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer Matija Glavinic Pecotic
                   ` (3 preceding siblings ...)
  2014-01-22 13:41 ` David Laight
@ 2014-01-23  0:16 ` Vlad Yasevich
  4 siblings, 0 replies; 13+ messages in thread
From: Vlad Yasevich @ 2014-01-23  0:16 UTC (permalink / raw)
  To: linux-sctp



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2014-01-23  0:16 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-01-17  7:01 [PATCH] net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer Matija Glavinic Pecotic
2014-01-21 22:36 ` David Miller
2014-01-22  0:08   ` Vlad Yasevich
2014-01-22  2:12 ` Vlad Yasevich
2014-01-22 19:21   ` Matija Glavinic Pecotic
2014-01-22 12:30 ` Neil Horman
2014-01-22 14:18   ` Vlad Yasevich
2014-01-22 19:34     ` Matija Glavinic Pecotic
2014-01-22 13:41 ` David Laight
2014-01-22 14:52   ` Vlad Yasevich
2014-01-22 15:30     ` David Laight
2014-01-22 19:48       ` Matija Glavinic Pecotic
2014-01-23  0:16 ` Vlad Yasevich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox