* [PATCH net-next] tcp: reduce memory needs of out of order queue
@ 2011-10-14 7:19 Eric Dumazet
2011-10-14 7:42 ` David Miller
0 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2011-10-14 7:19 UTC (permalink / raw)
To: David Miller; +Cc: netdev
Many drivers allocates big skb to store a single TCP frame.
(WIFI drivers, or NIC using PAGE_SIZE fragments)
Its now common to get skb->truesize bigger than 4096 to store a ~1500
bytes TCP frame.
TCP sessions with large RTT and packet losses can fill their Out Of
Order queue with such oversized skbs, and hit their sk_rcvbuf limit,
starting a pruning of complete OFO queue, without giving chance to
receive the missing packet(s) and moving skbs from OFO to receive queue.
This patch adds skb_reduce_truesize() helper, and uses it for all skbs
queued into OFO queue.
Spending some time to perform a copy is worth the pain, since it permits
SACK processing to have a chance to complete over the RTT barrier.
This greatly improves user experience, without added cost on fast path.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
net/ipv4/tcp_input.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c1653fe..1d10edb 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4426,6 +4426,25 @@ static inline int tcp_try_rmem_schedule(struct sock *sk, unsigned int size)
return 0;
}
+/*
+ * Caller want to reduce memory needs before queueing skb
+ * The (expensive) copy should not be be done in fast path.
+ */
+static struct sk_buff *skb_reduce_truesize(struct sk_buff *skb)
+{
+ if (skb->truesize > 2 * SKB_TRUESIZE(skb->len)) {
+ struct sk_buff *nskb;
+
+ nskb = skb_copy_expand(skb, skb_headroom(skb), 0,
+ GFP_ATOMIC | __GFP_NOWARN);
+ if (nskb) {
+ __kfree_skb(skb);
+ skb = nskb;
+ }
+ }
+ return skb;
+}
+
static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
{
struct tcphdr *th = tcp_hdr(skb);
@@ -4553,6 +4572,11 @@ drop:
SOCK_DEBUG(sk, "out of order segment: rcv_next %X seq %X - %X\n",
tp->rcv_nxt, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq);
+ /* Since this skb might stay on ofo a long time, try to reduce
+ * its truesize (if its too big) to avoid future pruning.
+ * Many drivers allocate large buffers even to hold tiny frames.
+ */
+ skb = skb_reduce_truesize(skb);
skb_set_owner_r(skb, sk);
if (!skb_peek(&tp->out_of_order_queue)) {
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH net-next] tcp: reduce memory needs of out of order queue
2011-10-14 7:19 [PATCH net-next] tcp: reduce memory needs of out of order queue Eric Dumazet
@ 2011-10-14 7:42 ` David Miller
2011-10-14 8:05 ` Eric Dumazet
2011-10-14 15:50 ` Rick Jones
0 siblings, 2 replies; 14+ messages in thread
From: David Miller @ 2011-10-14 7:42 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 14 Oct 2011 09:19:51 +0200
> Many drivers allocates big skb to store a single TCP frame.
> (WIFI drivers, or NIC using PAGE_SIZE fragments)
>
> Its now common to get skb->truesize bigger than 4096 to store a ~1500
> bytes TCP frame.
>
> TCP sessions with large RTT and packet losses can fill their Out Of
> Order queue with such oversized skbs, and hit their sk_rcvbuf limit,
> starting a pruning of complete OFO queue, without giving chance to
> receive the missing packet(s) and moving skbs from OFO to receive queue.
>
> This patch adds skb_reduce_truesize() helper, and uses it for all skbs
> queued into OFO queue.
>
> Spending some time to perform a copy is worth the pain, since it permits
> SACK processing to have a chance to complete over the RTT barrier.
>
> This greatly improves user experience, without added cost on fast path.
>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
No objection from me, although I wish wireless drivers were able to
size their SKBs more appropriately. I wonder how many problems that
look like "OMG we gotz da Buffer Bloat, arrr!" are actually due to
this truesize issue.
I think such large truesize SKBs will cause problems even in non loss
situations, in that the receive buffer will hit it's limits more
quickly. I not sure that the receive buffer autotuning is built to
handle this sort of scenerio as a common occurance.
You might want to check if this is the actual root cause of your
problems. If the receive buffer autotuning doesn't expand the receive
buffer enough to hold two windows worth of these large truesize SKBs,
that's the real reason why we end up pruning.
We have to decide if these kinds of SKBs are acceptable as a normal
situation for MSS sized frames. And if they are then it's probably
a good idea to adjust the receive buffer autotuning code too.
Although I realize it might be difficult, getting rid of these weird
SKBs in the first place would be ideal.
It would also be a good idea to put the truesize inaccuracies into
perspective when selecting how to fix this. It's trying to prevent
1 byte packets not accounting for the 256 byte SKB and metadata.
That kind of case with such a high ratio of wastage is important.
On the other hand, using 2048 bytes for a 1500 byte packet and claiming
the truesize is 1500 + sizeof(metadata)... that might be an acceptable
lie to tell :-) This is especially true if it allows an easy solution
to this wireless problem.
Just some thoughts... and I wonder if the wireless thing is due to
some hardware limitation or similar.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH net-next] tcp: reduce memory needs of out of order queue
2011-10-14 7:42 ` David Miller
@ 2011-10-14 8:05 ` Eric Dumazet
2011-10-14 17:33 ` Eric Dumazet
2011-10-14 15:50 ` Rick Jones
1 sibling, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2011-10-14 8:05 UTC (permalink / raw)
To: David Miller; +Cc: netdev
Le vendredi 14 octobre 2011 à 03:42 -0400, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Fri, 14 Oct 2011 09:19:51 +0200
>
> > Many drivers allocates big skb to store a single TCP frame.
> > (WIFI drivers, or NIC using PAGE_SIZE fragments)
> >
> > Its now common to get skb->truesize bigger than 4096 to store a ~1500
> > bytes TCP frame.
> >
> > TCP sessions with large RTT and packet losses can fill their Out Of
> > Order queue with such oversized skbs, and hit their sk_rcvbuf limit,
> > starting a pruning of complete OFO queue, without giving chance to
> > receive the missing packet(s) and moving skbs from OFO to receive queue.
> >
> > This patch adds skb_reduce_truesize() helper, and uses it for all skbs
> > queued into OFO queue.
> >
> > Spending some time to perform a copy is worth the pain, since it permits
> > SACK processing to have a chance to complete over the RTT barrier.
> >
> > This greatly improves user experience, without added cost on fast path.
> >
> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
>
> No objection from me, although I wish wireless drivers were able to
> size their SKBs more appropriately. I wonder how many problems that
> look like "OMG we gotz da Buffer Bloat, arrr!" are actually due to
> this truesize issue.
>
> I think such large truesize SKBs will cause problems even in non loss
> situations, in that the receive buffer will hit it's limits more
> quickly. I not sure that the receive buffer autotuning is built to
> handle this sort of scenerio as a common occurance.
>
> You might want to check if this is the actual root cause of your
> problems. If the receive buffer autotuning doesn't expand the receive
> buffer enough to hold two windows worth of these large truesize SKBs,
> that's the real reason why we end up pruning.
>
> We have to decide if these kinds of SKBs are acceptable as a normal
> situation for MSS sized frames. And if they are then it's probably
> a good idea to adjust the receive buffer autotuning code too.
>
> Although I realize it might be difficult, getting rid of these weird
> SKBs in the first place would be ideal.
>
> It would also be a good idea to put the truesize inaccuracies into
> perspective when selecting how to fix this. It's trying to prevent
> 1 byte packets not accounting for the 256 byte SKB and metadata.
> That kind of case with such a high ratio of wastage is important.
>
> On the other hand, using 2048 bytes for a 1500 byte packet and claiming
> the truesize is 1500 + sizeof(metadata)... that might be an acceptable
> lie to tell :-) This is especially true if it allows an easy solution
> to this wireless problem.
>
> Just some thoughts... and I wonder if the wireless thing is due to
> some hardware limitation or similar.
>
This patch specifically addresses the OFO problem, trying to lower
memory usage for machines handling lot of sockets (proxies for example)
For the general case, I believe we have to tune/change
tcp_win_from_space() to take into account general tendancy to get fat
skbs.
sysctl_tcp_adv_win_scale is not fine enough today, and default value (2)
gives too much collapses. It's also a very complex setting, I am pretty
sure nobody knows how to use it.
tcp_win_from_space(int space) -> 75% of space [ default ]
Only current kernels choices are to set it to one/-1 :
tcp_win_from_space(int space) -> 50% of space
or -2 :
tcp_win_from_space(int space) -> 25% of space
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH net-next] tcp: reduce memory needs of out of order queue
2011-10-14 7:42 ` David Miller
2011-10-14 8:05 ` Eric Dumazet
@ 2011-10-14 15:50 ` Rick Jones
2011-10-14 16:00 ` Eric Dumazet
2011-10-14 22:12 ` Rick Jones
1 sibling, 2 replies; 14+ messages in thread
From: Rick Jones @ 2011-10-14 15:50 UTC (permalink / raw)
To: David Miller; +Cc: eric.dumazet, netdev
On 10/14/2011 12:42 AM, David Miller wrote:
> No objection from me, although I wish wireless drivers were able to
> size their SKBs more appropriately. I wonder how many problems that
> look like "OMG we gotz da Buffer Bloat, arrr!" are actually due to
> this truesize issue.
I think the buffer bloat folks are looking at latency through transmit
queues - now perhaps some of their latency is really coming from
retransmissions thanks to packets being dropped thanks to overfilling
socket buffers, but I'm pretty sure they are clever enough to look for that.
> I think such large truesize SKBs will cause problems even in non loss
> situations, in that the receive buffer will hit it's limits more
> quickly. I not sure that the receive buffer autotuning is built to
> handle this sort of scenerio as a common occurance.
I believe that may be the case - at least during something like:
netperf -t TCP_RR -H <host> -l 30 -- -b 256 -D
which on an otherwise quiet test setup will report a non-trivial number
of retransmissions - either via looking at netstat -s output, or by
adding local_transport_retrans,remote_transport_retrans to an output
selector for netperf (eg -o
throughput,burst_size,local_transport_retrans,remote_transport_retrans,lss_size_end,rsr_size_end)
(I plan on providing more data after a laptop has gone through some
upgrades)
> You might want to check if this is the actual root cause of your
> problems. If the receive buffer autotuning doesn't expand the receive
> buffer enough to hold two windows worth of these large truesize SKBs,
> that's the real reason why we end up pruning.
>
> We have to decide if these kinds of SKBs are acceptable as a normal
> situation for MSS sized frames. And if they are then it's probably
> a good idea to adjust the receive buffer autotuning code too.
>
> Although I realize it might be difficult, getting rid of these weird
> SKBs in the first place would be ideal.
That means a semi-arbitrary alloc/copy in drivers, even when/if the
wasted space isn't going to be a problem no? That TCP_RR test above
would run "just fine" if the burst size was much smaller, but if there
was an arbitrary allocate/copy it would take a service demand and thus
transaction rate hit.
> It would also be a good idea to put the truesize inaccuracies into
> perspective when selecting how to fix this. It's trying to prevent
> 1 byte packets not accounting for the 256 byte SKB and metadata.
> That kind of case with such a high ratio of wastage is important.
>
> On the other hand, using 2048 bytes for a 1500 byte packet and claiming
> the truesize is 1500 + sizeof(metadata)... that might be an acceptable
> lie to tell :-) This is especially true if it allows an easy solution
> to this wireless problem.
Is the wireless problem strictly a wireless problem? Many of the
drivers where Eric has been fixing the truesize accounting have been
wired devices no?
rick jones
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH net-next] tcp: reduce memory needs of out of order queue
2011-10-14 15:50 ` Rick Jones
@ 2011-10-14 16:00 ` Eric Dumazet
2011-10-14 16:11 ` Eric Dumazet
2011-10-14 22:12 ` Rick Jones
1 sibling, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2011-10-14 16:00 UTC (permalink / raw)
To: Rick Jones; +Cc: David Miller, netdev
Le vendredi 14 octobre 2011 à 08:50 -0700, Rick Jones a écrit :
> Is the wireless problem strictly a wireless problem? Many of the
> drivers where Eric has been fixing the truesize accounting have been
> wired devices no?
Yes, but the goal of such fixes it to make bugs happen too with said
wired devices ;)
About WIFI, I get these TCP Collapses on two different machines, one
using drivers/net/wireless/rt2x00 driver
Extract from drivers/net/wireless/rt2x00/rt2x00queue.h
/**
* DOC: Entry frame size
*
* Ralink PCI devices demand the Frame size to be a multiple of 128 bytes,
* for USB devices this restriction does not apply, but the value of
* 2432 makes sense since it is big enough to contain the maximum fragment
* size according to the ieee802.11 specs.
* The aggregation size depends on support from the driver, but should
* be something around 3840 bytes.
*/
#define DATA_FRAME_SIZE 2432
#define MGMT_FRAME_SIZE 256
#define AGGREGATION_SIZE 3840
You understand why we endup using skb->truesize > 4096 buffers
I liked doing the copybreak only if needed, I found the OFO case was
most of the time responsible of the Collapses.
Now we also could do the copybreak for frames queued into regular
receive_queue, if current wmem_alloc is above 25% of rcvbuf space...
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH net-next] tcp: reduce memory needs of out of order queue
2011-10-14 16:00 ` Eric Dumazet
@ 2011-10-14 16:11 ` Eric Dumazet
0 siblings, 0 replies; 14+ messages in thread
From: Eric Dumazet @ 2011-10-14 16:11 UTC (permalink / raw)
To: Rick Jones; +Cc: David Miller, netdev
Le vendredi 14 octobre 2011 à 18:00 +0200, Eric Dumazet a écrit :
> Now we also could do the copybreak for frames queued into regular
> receive_queue, if current wmem_alloc is above 25% of rcvbuf space...
I mean rmem_alloc of course...
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c1653fe..0fe0828 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4426,6 +4426,25 @@ static inline int tcp_try_rmem_schedule(struct sock *sk, unsigned int size)
return 0;
}
+/*
+ * Caller want to reduce memory needs before queueing skb
+ * The (expensive) copy should not be be done in fast path.
+ */
+static struct sk_buff *skb_reduce_truesize(struct sk_buff *skb)
+{
+ if (skb->truesize > 2 * SKB_TRUESIZE(skb->len)) {
+ struct sk_buff *nskb;
+
+ nskb = skb_copy_expand(skb, skb_headroom(skb), 0,
+ GFP_ATOMIC | __GFP_NOWARN);
+ if (nskb) {
+ __kfree_skb(skb);
+ skb = nskb;
+ }
+ }
+ return skb;
+}
+
static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
{
struct tcphdr *th = tcp_hdr(skb);
@@ -4475,6 +4494,10 @@ queue_and_out:
tcp_try_rmem_schedule(sk, skb->truesize))
goto drop;
+ if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf >> 2) {
+ skb = skb_reduce_truesize(skb);
+ th = tcp_hdr(skb);
+ }
skb_set_owner_r(skb, sk);
__skb_queue_tail(&sk->sk_receive_queue, skb);
}
@@ -4553,6 +4576,11 @@ drop:
SOCK_DEBUG(sk, "out of order segment: rcv_next %X seq %X - %X\n",
tp->rcv_nxt, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq);
+ /* Since this skb might stay on ofo a long time, try to reduce
+ * its truesize (if its too big) to avoid future pruning.
+ * Many drivers allocate large buffers even to hold tiny frames.
+ */
+ skb = skb_reduce_truesize(skb);
skb_set_owner_r(skb, sk);
if (!skb_peek(&tp->out_of_order_queue)) {
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH net-next] tcp: reduce memory needs of out of order queue
2011-10-14 8:05 ` Eric Dumazet
@ 2011-10-14 17:33 ` Eric Dumazet
0 siblings, 0 replies; 14+ messages in thread
From: Eric Dumazet @ 2011-10-14 17:33 UTC (permalink / raw)
To: David Miller; +Cc: netdev
Le vendredi 14 octobre 2011 à 10:05 +0200, Eric Dumazet a écrit :
> This patch specifically addresses the OFO problem, trying to lower
> memory usage for machines handling lot of sockets (proxies for example)
Well, thinking a bit more about it is needed, so zap the patch please.
Thanks
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH net-next] tcp: reduce memory needs of out of order queue
2011-10-14 15:50 ` Rick Jones
2011-10-14 16:00 ` Eric Dumazet
@ 2011-10-14 22:12 ` Rick Jones
2011-10-14 23:18 ` David Miller
2011-10-15 6:39 ` Eric Dumazet
1 sibling, 2 replies; 14+ messages in thread
From: Rick Jones @ 2011-10-14 22:12 UTC (permalink / raw)
To: David Miller; +Cc: eric.dumazet, netdev
> I believe that may be the case - at least during something like:
>
> netperf -t TCP_RR -H <host> -l 30 -- -b 256 -D
>
> which on an otherwise quiet test setup will report a non-trivial number
> of retransmissions - either via looking at netstat -s output, or by
> adding local_transport_retrans,remote_transport_retrans to an output
> selector for netperf (eg -o
> throughput,burst_size,local_transport_retrans,remote_transport_retrans,lss_size_end,rsr_size_end)
>
>
> (I plan on providing more data after a laptop has gone through some
> upgrades)
So, a test as above from a system running 2.6.38-11-generic to a system
running 3.0.0-12-generic. On the sender we have:
raj@tardy:~/netperf2_trunk$ netstat -s > before; src/netperf -H
raj-8510w.americas.hpqcorp.net -t tcp_rr -- -b 256 -D -o
throughput,local_transport_retrans,remote_transport_retrans,lss_size_end,rsr_size_end
; netstat -s > after
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to internal-host.americas.hpqcorp.net (16.89.245.115) port 0 AF_INET :
nodelay : first burst 256
Throughput,Local Transport Retransmissions,Remote Transport
Retransmissions,Local Send Socket Size Final,Remote Recv Socket Size Final
76752.43,274,0,16384,98304
274 retransmissions at the sender. The "beforeafter" of that on the sender:
raj@tardy:~/netperf2_trunk$ cat delta.send
Ip:
766747 total packets received
12 with invalid addresses
0 forwarded
0 incoming packets discarded
766735 incoming packets delivered
734689 requests sent out
0 dropped because of missing route
Icmp:
0 ICMP messages received
0 input ICMP message failed.
ICMP input histogram:
destination unreachable: 0
echo requests: 0
echo replies: 0
0 ICMP messages sent
0 ICMP messages failed
ICMP output histogram:
destination unreachable: 0
echo request: 0
echo replies: 0
IcmpMsg:
InType0: 0
InType3: 0
InType8: 0
OutType0: 0
OutType3: 0
OutType8: 0
Tcp:
2 active connections openings
0 passive connection openings
0 failed connection attempts
0 connection resets received
0 connections established
766727 segments received
734408 segments send out
274 segments retransmited
0 bad segments received.
0 resets sent
Udp:
7 packets received
0 packets to unknown port received.
0 packet receive errors
7 packets sent
UdpLite:
TcpExt:
0 packets pruned from receive queue because of socket buffer overrun
0 ICMP packets dropped because they were out-of-window
0 TCP sockets finished time wait in fast timer
2 delayed acks sent
0 delayed acks further delayed because of locked socket
Quick ack mode was activated 0 times
170856 packets directly queued to recvmsg prequeue.
1204 bytes directly in process context from backlog
170678 bytes directly received in process context from prequeue
592090 packet headers predicted
170626 packets header predicted and directly queued to user
1375 acknowledgments not containing data payload received
174911 predicted acknowledgments
150 times recovered from packet loss by selective acknowledgements
0 congestion windows recovered without slow start by DSACK
0 congestion windows recovered without slow start after partial ack
299 TCP data loss events
TCPLostRetransmit: 9
0 timeouts after reno fast retransmit
0 timeouts after SACK recovery
253 fast retransmits
14 forward retransmits
6 retransmits in slow start
0 other TCP timeouts
1 SACK retransmits failed
0 times receiver scheduled too late for direct processing
0 packets collapsed in receive queue due to low socket buffer
0 DSACKs sent for old packets
0 DSACKs received
0 connections reset due to unexpected data
0 connections reset due to early user close
0 connections aborted due to timeout
0 times unabled to send RST due to no memory
TCPDSACKIgnoredOld: 0
TCPDSACKIgnoredNoUndo: 0
TCPSackShifted: 0
TCPSackMerged: 1031
TCPSackShiftFallback: 240
TCPBacklogDrop: 0
IPReversePathFilter: 0
IpExt:
InMcastPkts: 0
OutMcastPkts: 0
InBcastPkts: 1
InOctets: -1012182764
OutOctets: -1436530450
InMcastOctets: 0
OutMcastOctets: 0
InBcastOctets: 147
and then the deltas on the receiver:
raj@raj-8510w:~/netperf2_trunk$ cat delta.recv
Ip:
734669 total packets received
0 with invalid addresses
0 forwarded
0 incoming packets discarded
734669 incoming packets delivered
766696 requests sent out
0 dropped because of missing route
Icmp:
0 ICMP messages received
0 input ICMP message failed.
ICMP input histogram:
destination unreachable: 0
0 ICMP messages sent
0 ICMP messages failed
ICMP output histogram:
IcmpMsg:
InType3: 0
Tcp:
0 active connections openings
2 passive connection openings
0 failed connection attempts
0 connection resets received
0 connections established
734651 segments received
766695 segments send out
0 segments retransmited
0 bad segments received.
0 resets sent
Udp:
1 packets received
0 packets to unknown port received.
0 packet receive errors
1 packets sent
UdpLite:
TcpExt:
28 packets pruned from receive queue because of socket buffer overrun
0 delayed acks sent
0 delayed acks further delayed because of locked socket
19 packets directly queued to recvmsg prequeue.
0 bytes directly in process context from backlog
667 bytes directly received in process context from prequeue
727842 packet headers predicted
9 packets header predicted and directly queued to user
161 acknowledgments not containing data payload received
229704 predicted acknowledgments
6774 packets collapsed in receive queue due to low socket buffer
TCPBacklogDrop: 276
IpExt:
InMcastPkts: 0
OutMcastPkts: 0
InBcastPkts: 17
OutBcastPkts: 0
InOctets: 38973144
OutOctets: 40673137
InMcastOctets: 0
OutMcastOctets: 0
InBcastOctets: 1816
OutBcastOctets: 0
this is an otherwise clean network, no errors reported by ifconfig or
ethtool -S, and the packet rate was well within the limits of 1 GbE and
the ProCurve 2724 switch between the two systems.
From just a very quick look it looks like tcp_v[46]_rcv is called,
finds that the socket is owned by the user, attempts to add to the
backlog, but the path called by sk_add_backlog does not seem to make any
attempts to compress things, so when the quantity of data is << the
truesize it starts tossing babies out with the bathwater.
rick jones
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH net-next] tcp: reduce memory needs of out of order queue
2011-10-14 22:12 ` Rick Jones
@ 2011-10-14 23:18 ` David Miller
2011-10-15 6:54 ` Eric Dumazet
2011-10-15 6:39 ` Eric Dumazet
1 sibling, 1 reply; 14+ messages in thread
From: David Miller @ 2011-10-14 23:18 UTC (permalink / raw)
To: rick.jones2; +Cc: eric.dumazet, netdev
From: Rick Jones <rick.jones2@hp.com>
Date: Fri, 14 Oct 2011 15:12:04 -0700
> From just a very quick look it looks like tcp_v[46]_rcv is called,
> finds that the socket is owned by the user, attempts to add to the
> backlog, but the path called by sk_add_backlog does not seem to make
> any attempts to compress things, so when the quantity of data is <<
> the truesize it starts tossing babies out with the bathwater.
This is why I don't believe the right fix is to add bandaids all
around the TCP layer.
The wastage has to be avoided at a higher level.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH net-next] tcp: reduce memory needs of out of order queue
2011-10-14 22:12 ` Rick Jones
2011-10-14 23:18 ` David Miller
@ 2011-10-15 6:39 ` Eric Dumazet
2011-10-17 16:47 ` Rick Jones
1 sibling, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2011-10-15 6:39 UTC (permalink / raw)
To: Rick Jones; +Cc: David Miller, netdev
Le vendredi 14 octobre 2011 à 15:12 -0700, Rick Jones a écrit :
Thanks Rick
> So, a test as above from a system running 2.6.38-11-generic to a system
> running 3.0.0-12-generic. On the sender we have:
>
> raj@tardy:~/netperf2_trunk$ netstat -s > before; src/netperf -H
> raj-8510w.americas.hpqcorp.net -t tcp_rr -- -b 256 -D -o
> throughput,local_transport_retrans,remote_transport_retrans,lss_size_end,rsr_size_end
> ; netstat -s > after
> MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
> to internal-host.americas.hpqcorp.net (16.89.245.115) port 0 AF_INET :
> nodelay : first burst 256
> Throughput,Local Transport Retransmissions,Remote Transport
> Retransmissions,Local Send Socket Size Final,Remote Recv Socket Size Final
> 76752.43,274,0,16384,98304
>
> 274 retransmissions at the sender. The "beforeafter" of that on the sender:
>
> raj@tardy:~/netperf2_trunk$ cat delta.send
> Tcp:
> 2 active connections openings
> 0 passive connection openings
> 0 failed connection attempts
> 0 connection resets received
> 0 connections established
> 766727 segments received
> 734408 segments send out
> 274 segments retransmited
Exactly the count of dropped frames because of receiver sk_rmem_alloc +
backlog.len hitting receiver sk_rcvbuf
static inline bool sk_rcvqueues_full(const struct sock *sk, const struct sk_buff *skb)
{
unsigned int qsize = sk->sk_backlog.len + atomic_read(&sk->sk_rmem_alloc);
return qsize + skb->truesize > sk->sk_rcvbuf;
}
static inline __must_check int sk_add_backlog(struct sock *sk, struct sk_buff *skb)
{
if (sk_rcvqueues_full(sk, skb))
return -ENOBUFS;
__sk_add_backlog(sk, skb);
sk->sk_backlog.len += skb->truesize;
return 0;
}
In very old kernels, we had no limit on backlog, so we could queue lot
of extra skbs in it and eventually consume all kernel memory (OOM)
refs : commit c377411f249 (net: sk_add_backlog() take rmem_alloc into
account)
commit 6b03a53a5ab7 (tcp: use limited socket backlog)
commit 8eae939f14003 (net: add limit for socket backlog )
Now we enforce a limit, better to chose a correct limit / tcpwindow
combination so that normal trafic doesnt trigger drops at receiver
> 0 bad segments received.
> 0 resets sent
> Udp:
> 7 packets received
> 0 packets to unknown port received.
> 0 packet receive errors
> 7 packets sent
> UdpLite:
> TcpExt:
> 0 packets pruned from receive queue because of socket buffer overrun
> 0 ICMP packets dropped because they were out-of-window
> 0 TCP sockets finished time wait in fast timer
> 2 delayed acks sent
> 0 delayed acks further delayed because of locked socket
> Quick ack mode was activated 0 times
> 170856 packets directly queued to recvmsg prequeue.
> 1204 bytes directly in process context from backlog
> 170678 bytes directly received in process context from prequeue
> 592090 packet headers predicted
> 170626 packets header predicted and directly queued to user
> 1375 acknowledgments not containing data payload received
> 174911 predicted acknowledgments
> 150 times recovered from packet loss by selective acknowledgements
> 0 congestion windows recovered without slow start by DSACK
> 0 congestion windows recovered without slow start after partial ack
> 299 TCP data loss events
> TCPLostRetransmit: 9
> 0 timeouts after reno fast retransmit
> 0 timeouts after SACK recovery
> 253 fast retransmits
> 14 forward retransmits
> 6 retransmits in slow start
> 0 other TCP timeouts
> 1 SACK retransmits failed
> 0 times receiver scheduled too late for direct processing
> 0 packets collapsed in receive queue due to low socket buffer
> 0 DSACKs sent for old packets
> 0 DSACKs received
> 0 connections reset due to unexpected data
> 0 connections reset due to early user close
> 0 connections aborted due to timeout
> 0 times unabled to send RST due to no memory
> TCPDSACKIgnoredOld: 0
> TCPDSACKIgnoredNoUndo: 0
> TCPSackShifted: 0
> TCPSackMerged: 1031
> TCPSackShiftFallback: 240
> TCPBacklogDrop: 0
> IPReversePathFilter: 0
> IpExt:
> InMcastPkts: 0
> OutMcastPkts: 0
> InBcastPkts: 1
> InOctets: -1012182764
> OutOctets: -1436530450
> InMcastOctets: 0
> OutMcastOctets: 0
> InBcastOctets: 147
>
> and then the deltas on the receiver:
>
> raj@raj-8510w:~/netperf2_trunk$ cat delta.recv
> Ip:
> 734669 total packets received
> 0 with invalid addresses
> 0 forwarded
> 0 incoming packets discarded
> 734669 incoming packets delivered
> 766696 requests sent out
> 0 dropped because of missing route
> Icmp:
> 0 ICMP messages received
> 0 input ICMP message failed.
> ICMP input histogram:
> destination unreachable: 0
> 0 ICMP messages sent
> 0 ICMP messages failed
> ICMP output histogram:
> IcmpMsg:
> InType3: 0
> Tcp:
> 0 active connections openings
> 2 passive connection openings
> 0 failed connection attempts
> 0 connection resets received
> 0 connections established
> 734651 segments received
> 766695 segments send out
> 0 segments retransmited
> 0 bad segments received.
> 0 resets sent
> Udp:
> 1 packets received
> 0 packets to unknown port received.
> 0 packet receive errors
> 1 packets sent
> UdpLite:
> TcpExt:
> 28 packets pruned from receive queue because of socket buffer overrun
> 0 delayed acks sent
> 0 delayed acks further delayed because of locked socket
> 19 packets directly queued to recvmsg prequeue.
> 0 bytes directly in process context from backlog
> 667 bytes directly received in process context from prequeue
> 727842 packet headers predicted
> 9 packets header predicted and directly queued to user
> 161 acknowledgments not containing data payload received
> 229704 predicted acknowledgments
> 6774 packets collapsed in receive queue due to low socket buffer
> TCPBacklogDrop: 276
Yes, these two counters explain all.
1) "6774 packets collapsed in receive queue due to low socket buffer"
We spend a _lot_ of cpu time in "collapsing" process : Taking several
skb and build a compound one (using one PAGE and trying to fill all the
available bytes in it with contigous parts).
Doing this work is of course last desperate attempt before the much
painfull :
2) TCPBacklogDrop: 276
We plain drop incoming messages because too much kernel memory is used
by the socket.
> IpExt:
> InMcastPkts: 0
> OutMcastPkts: 0
> InBcastPkts: 17
> OutBcastPkts: 0
> InOctets: 38973144
> OutOctets: 40673137
> InMcastOctets: 0
> OutMcastOctets: 0
> InBcastOctets: 1816
> OutBcastOctets: 0
>
> this is an otherwise clean network, no errors reported by ifconfig or
> ethtool -S, and the packet rate was well within the limits of 1 GbE and
> the ProCurve 2724 switch between the two systems.
>
> From just a very quick look it looks like tcp_v[46]_rcv is called,
> finds that the socket is owned by the user, attempts to add to the
> backlog, but the path called by sk_add_backlog does not seem to make any
> attempts to compress things, so when the quantity of data is << the
> truesize it starts tossing babies out with the bathwater.
>
Rick, could you redo the test, using following bit on receiver :
echo 1 >/proc/sys/net/ipv4/tcp_adv_win_scale
If you still have collapses/retransmits, you then could try :
echo -2 >/proc/sys/net/ipv4/tcp_adv_win_scale
Thanks !
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH net-next] tcp: reduce memory needs of out of order queue
2011-10-14 23:18 ` David Miller
@ 2011-10-15 6:54 ` Eric Dumazet
2011-10-17 0:53 ` David Miller
0 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2011-10-15 6:54 UTC (permalink / raw)
To: David Miller; +Cc: rick.jones2, netdev
Le vendredi 14 octobre 2011 à 19:18 -0400, David Miller a écrit :
> From: Rick Jones <rick.jones2@hp.com>
> Date: Fri, 14 Oct 2011 15:12:04 -0700
>
> > From just a very quick look it looks like tcp_v[46]_rcv is called,
> > finds that the socket is owned by the user, attempts to add to the
> > backlog, but the path called by sk_add_backlog does not seem to make
> > any attempts to compress things, so when the quantity of data is <<
> > the truesize it starts tossing babies out with the bathwater.
>
> This is why I don't believe the right fix is to add bandaids all
> around the TCP layer.
>
> The wastage has to be avoided at a higher level.
We cant do that at higher level without smart hardware (like NIU) or
adding a copy.
Its a tradeoff between space and speed.
Most drivers have to allocate a large skb1 and post it to hardware to
receive a frame (Unknown length, only max length is known)
Some drivers have a copybreak feature, doing a copy of small incoming
frames into a smaller skb2 (skb2->truesize < skb1->truesize)
This strategy do save memory for small frames, not for 1500 bytes
frames.
I think the problem is in TCP layer (and maybe in other protocols) :
1) Either tune rcvbuf to allow more memory to be used, for a particular
tcp window,
Or lower TCP window to allow less packets in flight for a given
rcvbuf.
2) TCP COLLAPSE already is trying to reduce memory costs of a tcp socket
with many packets in OFO queue. But fixing 1) would make these collapses
never happen in the first place. People wanting high TCP bandwidth
[ with say more than 500 in-flight packets per session ] can certainly
afford having enough memory.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH net-next] tcp: reduce memory needs of out of order queue
2011-10-15 6:54 ` Eric Dumazet
@ 2011-10-17 0:53 ` David Miller
2011-10-17 7:02 ` Eric Dumazet
0 siblings, 1 reply; 14+ messages in thread
From: David Miller @ 2011-10-17 0:53 UTC (permalink / raw)
To: eric.dumazet; +Cc: rick.jones2, netdev
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 15 Oct 2011 08:54:42 +0200
> I think the problem is in TCP layer (and maybe in other protocols) :
>
> 1) Either tune rcvbuf to allow more memory to be used, for a particular
> tcp window,
>
> Or lower TCP window to allow less packets in flight for a given
> rcvbuf.
>
> 2) TCP COLLAPSE already is trying to reduce memory costs of a tcp socket
> with many packets in OFO queue. But fixing 1) would make these collapses
> never happen in the first place. People wanting high TCP bandwidth
> [ with say more than 500 in-flight packets per session ] can certainly
> afford having enough memory.
So perhaps the best solution is to divorce truesize from such driver
and device details? If there is one calculation, then TCP need only
be concerned with one case.
Look at how confusing and useless tcp_adv_win_scale ends up being for
this problem.
Therefore I'll make the mostly-serious propsal that truesize be
something like "initial_real_total_data + sizeof(metadata)"
So if a device receives a 512 byte packet, it's:
512 + sizeof(metadata)
It still provides the necessary protection that truesize is meant to
provide, yet sanitizes all of the receive and send buffer overhead
handling.
TCP should be absoultely, and completely, impervious to details like
how buffering needs to be done for some random wireless card. Just
the mere fact that using a larger buffer in a driver ruins TCP
performance indicates a serious design failure.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH net-next] tcp: reduce memory needs of out of order queue
2011-10-17 0:53 ` David Miller
@ 2011-10-17 7:02 ` Eric Dumazet
0 siblings, 0 replies; 14+ messages in thread
From: Eric Dumazet @ 2011-10-17 7:02 UTC (permalink / raw)
To: David Miller; +Cc: rick.jones2, netdev
Le dimanche 16 octobre 2011 à 20:53 -0400, David Miller a écrit :
> So perhaps the best solution is to divorce truesize from such driver
> and device details? If there is one calculation, then TCP need only
> be concerned with one case.
>
> Look at how confusing and useless tcp_adv_win_scale ends up being for
> this problem.
>
> Therefore I'll make the mostly-serious propsal that truesize be
> something like "initial_real_total_data + sizeof(metadata)"
>
> So if a device receives a 512 byte packet, it's:
>
> 512 + sizeof(metadata)
>
That would probably OOM in stress situation, with thousand of sockets.
> It still provides the necessary protection that truesize is meant to
> provide, yet sanitizes all of the receive and send buffer overhead
> handling.
>
> TCP should be absoultely, and completely, impervious to details like
> how buffering needs to be done for some random wireless card. Just
> the mere fact that using a larger buffer in a driver ruins TCP
> performance indicates a serious design failure.
>
I dont think its a design failure. Its the same problem when computing
the TCP window given the rcvspace (memory we allow to be consumed for
the socket) based on the MSS : If the sender uses 1-bytes frames only,
then receiver hit the memory limit and performance drops.
Right now our tcp-window tuning really assumes too much : perfect MSS
skb using _exactly_ MSS + sizeof(metadata), while we already know that
real slab cost is higher :
__roundup_pow_of_two(MSS + sizeof(struct skb_shared_info)) +
SKB_DATA_ALIGN(sizeof(struct sk_buff))
and now with paged frag devices :
PAGE_SIZE + SKB_DATA_ALIGN(sizeof(struct sk_buff))
We assume sender behaves correctly and drivers dont use 64KB pages to
store a single 72-bytes frame
I would say the first thing TCP stack must respect is the memory limits
that the admin set for it. Thats what skb->truesize is for.
# cat /proc/sys/net/ipv4/tcp_rmem
4096 87380 4127616
In this case, we allow up to 4Mbytes or receiver memory per session.
Not 20 or 30 Mbytes...
We must translate this to a TCP window, suitable for current hardware.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH net-next] tcp: reduce memory needs of out of order queue
2011-10-15 6:39 ` Eric Dumazet
@ 2011-10-17 16:47 ` Rick Jones
0 siblings, 0 replies; 14+ messages in thread
From: Rick Jones @ 2011-10-17 16:47 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, netdev
>
> Rick, could you redo the test, using following bit on receiver :
>
> echo 1>/proc/sys/net/ipv4/tcp_adv_win_scale
raj@tardy:~/netperf2_trunk$ netstat -s > before; src/netperf -H
raj-8510w.americas.hpqcorp.net -t tcp_rr -- -b 256 -D -o
throughput,local_transport_retrans,remote_transport_retrans,lss_size_end,rsr_size_end
; netstat -s > afterMIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0
(0.0.0.0) port 0 AF_INET to internal-host.americas.hpqcorp.net
(16.89.245.115) port 0 AF_INET : nodelay : first burst 256
Throughput,Local Transport Retransmissions,Remote Transport
Retransmissions,Local Send Socket Size Final,Remote Recv Socket Size Final
78527.68,289,0,16384,98304
Deltas on the receiver:
TcpExt:
27 packets pruned from receive queue because of socket buffer overrun
0 TCP sockets finished time wait in fast timer
0 delayed acks sent
0 delayed acks further delayed because of locked socket
Quick ack mode was activated 0 times
19 packets directly queued to recvmsg prequeue.
0 bytes directly in process context from backlog
670 bytes directly received in process context from prequeue
739983 packet headers predicted
14 packets header predicted and directly queued to user
127 acknowledgments not containing data payload received
235774 predicted acknowledgments
0 other TCP timeouts
6553 packets collapsed in receive queue due to low socket buffer
0 DSACKs sent for old packets
TCPBacklogDrop: 294
So, moving on to:
> If you still have collapses/retransmits, you then could try :
>
> echo -2>/proc/sys/net/ipv4/tcp_adv_win_scale
raj@tardy:~/netperf2_trunk$ netstat -s > before; src/netperf -H
raj-8510w.americas.hpqcorp.net -t tcp_rr -- -b 256 -D -o
throughput,local_transport_retrans,remote_transport_retrans,lss_size_end,rsr_size_end
; netstat -s > after
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to internal-host.americas.hpqcorp.net (16.89.245.115) port 0 AF_INET :
nodelay : first burst 256
Throughput,Local Transport Retransmissions,Remote Transport
Retransmissions,Local Send Socket Size Final,Remote Recv Socket Size Final
95981.83,0,0,121200,156600
No retransmissions in that one.
rick
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2011-10-17 16:47 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-10-14 7:19 [PATCH net-next] tcp: reduce memory needs of out of order queue Eric Dumazet
2011-10-14 7:42 ` David Miller
2011-10-14 8:05 ` Eric Dumazet
2011-10-14 17:33 ` Eric Dumazet
2011-10-14 15:50 ` Rick Jones
2011-10-14 16:00 ` Eric Dumazet
2011-10-14 16:11 ` Eric Dumazet
2011-10-14 22:12 ` Rick Jones
2011-10-14 23:18 ` David Miller
2011-10-15 6:54 ` Eric Dumazet
2011-10-17 0:53 ` David Miller
2011-10-17 7:02 ` Eric Dumazet
2011-10-15 6:39 ` Eric Dumazet
2011-10-17 16:47 ` Rick Jones
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).