* Spurious "TCP: too many of orphaned sockets", unable to allocate sockets @ 2010-08-25 7:16 Anton Blanchard 2010-08-25 7:17 ` [PATCH] tcp: Fix sysctl_tcp_max_orphans when PAGE_SIZE != 4k Anton Blanchard 2010-08-25 7:59 ` Spurious "TCP: too many of orphaned sockets", unable to allocate sockets David Miller 0 siblings, 2 replies; 17+ messages in thread From: Anton Blanchard @ 2010-08-25 7:16 UTC (permalink / raw) To: netdev; +Cc: miltonm Hi, We have a machine running a network test that regularly hits: TCP: too many of orphaned sockets Which comes from: int orphan_count = percpu_counter_read_positive( sk->sk_prot->orphan_count); sk_mem_reclaim(sk); if (tcp_too_many_orphans(sk, orphan_count)) { if (net_ratelimit()) printk(KERN_INFO "TCP: too many of orphaned " "sockets\n"); tcp_set_state(sk, TCP_CLOSE); tcp_send_active_reset(sk, GFP_ATOMIC); NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPABORTONMEMORY); } Looking closer we have: # cat /proc/sys/net/ipv4/tcp_max_orphans 4096 # grep processor /proc/cpuinfo | wc -l 128 The problem is we are using percpu_counter_read_positive, so the value can be out num_online_cpus() * percpu_counter_batch. percpu_counter_batch is going to be 32, so we might be out by 32 * 128 = 4k. Considering tcp_max_orphans is 4k that explains the spurious printout and the inability to allocate sockets. A couple of issues: 1. We size sysctl_tcp_max_orphans based on some second order heuristic that uses pages which could be anything from 4k to 64k: /* Try to be a bit smarter and adjust defaults depending * on available memory. */ for (order = 0; ((1 << order) << PAGE_SHIFT) < (tcp_hashinfo.bhash_size * sizeof(struct inet_bind_hashbucket)); order++) ; if (order >= 4) { tcp_death_row.sysctl_max_tw_buckets = 180000; sysctl_tcp_max_orphans = 4096 << (order - 4); sysctl_max_syn_backlog = 1024; } else if (order < 3) { tcp_death_row.sysctl_max_tw_buckets >>= (3 - order); sysctl_tcp_max_orphans >>= (3 - order); sysctl_max_syn_backlog = 128; } I'll follow up with a patch to fix this for PAGE_SIZE != 4k 2. Even with this fixed we could hit the original issue. We have been known to test on 1024 thread boxes and we would have the possibility of 32 * 1024 = 32k slack in the percpu counters. On this box tcp_max_orphans will be 64k after the fix which is a bit close for comfort. Should we do anything here? Anton ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH] tcp: Fix sysctl_tcp_max_orphans when PAGE_SIZE != 4k 2010-08-25 7:16 Spurious "TCP: too many of orphaned sockets", unable to allocate sockets Anton Blanchard @ 2010-08-25 7:17 ` Anton Blanchard 2010-08-25 7:39 ` Eric Dumazet 2010-08-25 17:50 ` Eric Dumazet 2010-08-25 7:59 ` Spurious "TCP: too many of orphaned sockets", unable to allocate sockets David Miller 1 sibling, 2 replies; 17+ messages in thread From: Anton Blanchard @ 2010-08-25 7:17 UTC (permalink / raw) To: netdev; +Cc: miltonm We were hard coding 4096 when sizing sysctl_tcp_max_orphans which causes problems when PAGE_SIZE is not 4k. We calculate an order based on PAGE_SHIFT so the count should be based on PAGE_SIZE Signed-off-By: Milton Miller <miltonm@bga.com> Signed-off-by: Anton Blanchard <anton@samba.org> --- Index: powerpc.git/net/ipv4/tcp.c =================================================================== --- powerpc.git.orig/net/ipv4/tcp.c 2010-08-25 17:04:51.190305401 +1000 +++ powerpc.git/net/ipv4/tcp.c 2010-08-25 17:05:15.463884764 +1000 @@ -3270,7 +3270,7 @@ void __init tcp_init(void) ; if (order >= 4) { tcp_death_row.sysctl_max_tw_buckets = 180000; - sysctl_tcp_max_orphans = 4096 << (order - 4); + sysctl_tcp_max_orphans = PAGE_SIZE << (order - 4); sysctl_max_syn_backlog = 1024; } else if (order < 3) { tcp_death_row.sysctl_max_tw_buckets >>= (3 - order); ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] tcp: Fix sysctl_tcp_max_orphans when PAGE_SIZE != 4k 2010-08-25 7:17 ` [PATCH] tcp: Fix sysctl_tcp_max_orphans when PAGE_SIZE != 4k Anton Blanchard @ 2010-08-25 7:39 ` Eric Dumazet 2010-08-25 7:59 ` David Miller 2010-08-25 17:50 ` Eric Dumazet 1 sibling, 1 reply; 17+ messages in thread From: Eric Dumazet @ 2010-08-25 7:39 UTC (permalink / raw) To: Anton Blanchard; +Cc: netdev, miltonm Le mercredi 25 août 2010 à 17:17 +1000, Anton Blanchard a écrit : > We were hard coding 4096 when sizing sysctl_tcp_max_orphans which > causes problems when PAGE_SIZE is not 4k. We calculate an order based on > PAGE_SHIFT so the count should be based on PAGE_SIZE > > Signed-off-By: Milton Miller <miltonm@bga.com> > Signed-off-by: Anton Blanchard <anton@samba.org> > --- > > Index: powerpc.git/net/ipv4/tcp.c > =================================================================== > --- powerpc.git.orig/net/ipv4/tcp.c 2010-08-25 17:04:51.190305401 +1000 > +++ powerpc.git/net/ipv4/tcp.c 2010-08-25 17:05:15.463884764 +1000 > @@ -3270,7 +3270,7 @@ void __init tcp_init(void) > ; > if (order >= 4) { > tcp_death_row.sysctl_max_tw_buckets = 180000; > - sysctl_tcp_max_orphans = 4096 << (order - 4); > + sysctl_tcp_max_orphans = PAGE_SIZE << (order - 4); > sysctl_max_syn_backlog = 1024; > } else if (order < 3) { > tcp_death_row.sysctl_max_tw_buckets >>= (3 - order); > -- Considering your previous mail, I suggest you add sysctl_tcp_max_orphans = max(sysctl_tcp_max_orphans, percpu_counter_batch * num_possible_cpus()); ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] tcp: Fix sysctl_tcp_max_orphans when PAGE_SIZE != 4k 2010-08-25 7:39 ` Eric Dumazet @ 2010-08-25 7:59 ` David Miller 0 siblings, 0 replies; 17+ messages in thread From: David Miller @ 2010-08-25 7:59 UTC (permalink / raw) To: eric.dumazet; +Cc: anton, netdev, miltonm From: Eric Dumazet <eric.dumazet@gmail.com> Date: Wed, 25 Aug 2010 09:39:01 +0200 > Considering your previous mail, I suggest you add > > sysctl_tcp_max_orphans = max(sysctl_tcp_max_orphans, > percpu_counter_batch * num_possible_cpus()); That seems unnecessary, see my other reply to Anton on how to fix this skew problem. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] tcp: Fix sysctl_tcp_max_orphans when PAGE_SIZE != 4k 2010-08-25 7:17 ` [PATCH] tcp: Fix sysctl_tcp_max_orphans when PAGE_SIZE != 4k Anton Blanchard 2010-08-25 7:39 ` Eric Dumazet @ 2010-08-25 17:50 ` Eric Dumazet 2010-08-25 23:57 ` David Miller 1 sibling, 1 reply; 17+ messages in thread From: Eric Dumazet @ 2010-08-25 17:50 UTC (permalink / raw) To: Anton Blanchard; +Cc: netdev, miltonm Le mercredi 25 août 2010 à 17:17 +1000, Anton Blanchard a écrit : > We were hard coding 4096 when sizing sysctl_tcp_max_orphans which > causes problems when PAGE_SIZE is not 4k. We calculate an order based on > PAGE_SHIFT so the count should be based on PAGE_SIZE > > Signed-off-By: Milton Miller <miltonm@bga.com> > Signed-off-by: Anton Blanchard <anton@samba.org> > --- > > Index: powerpc.git/net/ipv4/tcp.c > =================================================================== > --- powerpc.git.orig/net/ipv4/tcp.c 2010-08-25 17:04:51.190305401 +1000 > +++ powerpc.git/net/ipv4/tcp.c 2010-08-25 17:05:15.463884764 +1000 > @@ -3270,7 +3270,7 @@ void __init tcp_init(void) > ; > if (order >= 4) { > tcp_death_row.sysctl_max_tw_buckets = 180000; > - sysctl_tcp_max_orphans = 4096 << (order - 4); > + sysctl_tcp_max_orphans = PAGE_SIZE << (order - 4); > sysctl_max_syn_backlog = 1024; > } else if (order < 3) { > tcp_death_row.sysctl_max_tw_buckets >>= (3 - order); > -- In fact, existing code makes litle sense.... (tcp_hashinfo.bhash_size * sizeof(struct inet_bind_hashbucket)) is much bigger if spinlock debugging is on. Its strange to select bigger limits in this case (where kernel structures are also bigger) bhash_size max is 65536, and we get this value even for small machines. Sizing would probably be better if using ehash_size instead of bhash_size Maybe remove the 'order' loop and use ehash_size, already a result of the available memory or thash_entries tunable. unsigned int ehash_size = tcp_hashinfo.ehash_mask + 1; tcp_death_row.sysctl_max_tw_buckets = cnt / 2; sysctl_tcp_max_orphans = cnt / 2; sysctl_max_syn_backlog = min(128, cnt / 256); ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] tcp: Fix sysctl_tcp_max_orphans when PAGE_SIZE != 4k 2010-08-25 17:50 ` Eric Dumazet @ 2010-08-25 23:57 ` David Miller 2010-08-26 0:38 ` Anton Blanchard 0 siblings, 1 reply; 17+ messages in thread From: David Miller @ 2010-08-25 23:57 UTC (permalink / raw) To: eric.dumazet; +Cc: anton, netdev, miltonm From: Eric Dumazet <eric.dumazet@gmail.com> Date: Wed, 25 Aug 2010 19:50:34 +0200 > In fact, existing code makes litle sense.... > > (tcp_hashinfo.bhash_size * sizeof(struct inet_bind_hashbucket)) > > is much bigger if spinlock debugging is on. Its strange to select bigger > limits in this case (where kernel structures are also bigger) > > bhash_size max is 65536, and we get this value even for small machines. > > Sizing would probably be better if using ehash_size instead of > bhash_size Yes, that would avoid debugging unduly influencing this tunable. Also, I think the "4096" constant is a reference to thing we now call SK_MEM_QUANTUM. Back when this sizing code was added in: commit 1f28b683339f74f9664b77532f4a2f1aad512451 Author: davem <davem> Date: Sun Jan 16 05:10:52 2000 +0000 Merge in TCP/UDP optimizations and bug fixing from softnet patches. Softnet patch set decreases size by approx. 300k of the netdev-vger-cvs tree, we used the '4096' constant explicitly. > Maybe remove the 'order' loop and use ehash_size, already a result of > the available memory or thash_entries tunable. > > unsigned int ehash_size = tcp_hashinfo.ehash_mask + 1; > > tcp_death_row.sysctl_max_tw_buckets = cnt / 2; > sysctl_tcp_max_orphans = cnt / 2; > sysctl_max_syn_backlog = min(128, cnt / 256); Yeah, something like the following, Anton? diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 197b9b7..403c029 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -3209,7 +3209,7 @@ void __init tcp_init(void) { struct sk_buff *skb = NULL; unsigned long nr_pages, limit; - int order, i, max_share; + int order, i, max_share, cnt; unsigned long jiffy = jiffies; BUILD_BUG_ON(sizeof(struct tcp_skb_cb) > sizeof(skb->cb)); @@ -3258,22 +3258,11 @@ void __init tcp_init(void) INIT_HLIST_HEAD(&tcp_hashinfo.bhash[i].chain); } - /* Try to be a bit smarter and adjust defaults depending - * on available memory. - */ - for (order = 0; ((1 << order) << PAGE_SHIFT) < - (tcp_hashinfo.bhash_size * sizeof(struct inet_bind_hashbucket)); - order++) - ; - if (order >= 4) { - tcp_death_row.sysctl_max_tw_buckets = 180000; - sysctl_tcp_max_orphans = 4096 << (order - 4); - sysctl_max_syn_backlog = 1024; - } else if (order < 3) { - tcp_death_row.sysctl_max_tw_buckets >>= (3 - order); - sysctl_tcp_max_orphans >>= (3 - order); - sysctl_max_syn_backlog = 128; - } + cnt = tcp_hashinfo.ehash_mask + 1; + + tcp_death_row.sysctl_max_tw_buckets = cnt / 2; + sysctl_tcp_max_orphans = cnt / 2; + sysctl_max_syn_backlog = min(128, cnt / 256); /* Set the pressure threshold to be a fraction of global memory that * is up to 1/2 at 256 MB, decreasing toward zero with the amount of ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH] tcp: Fix sysctl_tcp_max_orphans when PAGE_SIZE != 4k 2010-08-25 23:57 ` David Miller @ 2010-08-26 0:38 ` Anton Blanchard 2010-08-26 3:53 ` David Miller ` (2 more replies) 0 siblings, 3 replies; 17+ messages in thread From: Anton Blanchard @ 2010-08-26 0:38 UTC (permalink / raw) To: David Miller; +Cc: eric.dumazet, netdev, miltonm Hi Dave, > Yeah, something like the following, Anton? Thanks! I tested on a 512GB box. One thing: # cat /proc/sys/net/ipv4/tcp_max_syn_backlog 128 We probably want to use max(): > - sysctl_max_syn_backlog = min(128, cnt / 256); > + sysctl_max_syn_backlog = max(128, cnt / 256); With that change: # cat /proc/sys/net/ipv4/tcp_max_orphans 262144 # cat /proc/sys/net/ipv4/tcp_max_tw_buckets 262144 # cat /proc/sys/net/ipv4/tcp_max_syn_backlog 2048 Tested-by: Anton Blanchard <anton@samba.org> Anton ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] tcp: Fix sysctl_tcp_max_orphans when PAGE_SIZE != 4k 2010-08-26 0:38 ` Anton Blanchard @ 2010-08-26 3:53 ` David Miller 2010-08-26 6:36 ` Anton Blanchard 2010-08-26 4:45 ` Eric Dumazet 2010-08-26 5:15 ` [PATCH] tcp: fix three tcp sysctls tuning Eric Dumazet 2 siblings, 1 reply; 17+ messages in thread From: David Miller @ 2010-08-26 3:53 UTC (permalink / raw) To: anton; +Cc: eric.dumazet, netdev, miltonm From: Anton Blanchard <anton@samba.org> Date: Thu, 26 Aug 2010 10:38:34 +1000 > # cat /proc/sys/net/ipv4/tcp_max_syn_backlog > 128 > > We probably want to use max(): > >> - sysctl_max_syn_backlog = min(128, cnt / 256); >> + sysctl_max_syn_backlog = max(128, cnt / 256); > > With that change: ... > Tested-by: Anton Blanchard <anton@samba.org> Thanks a lot Anton. Did you test my percpu orphan fix yet? ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] tcp: Fix sysctl_tcp_max_orphans when PAGE_SIZE != 4k 2010-08-26 3:53 ` David Miller @ 2010-08-26 6:36 ` Anton Blanchard 0 siblings, 0 replies; 17+ messages in thread From: Anton Blanchard @ 2010-08-26 6:36 UTC (permalink / raw) To: David Miller; +Cc: eric.dumazet, netdev, miltonm Hi Dave, > Did you test my percpu orphan fix yet? The machine is running some tests, but I will either get time on it or replicate the issue on a similar box. Anton ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] tcp: Fix sysctl_tcp_max_orphans when PAGE_SIZE != 4k 2010-08-26 0:38 ` Anton Blanchard 2010-08-26 3:53 ` David Miller @ 2010-08-26 4:45 ` Eric Dumazet 2010-08-26 5:15 ` [PATCH] tcp: fix three tcp sysctls tuning Eric Dumazet 2 siblings, 0 replies; 17+ messages in thread From: Eric Dumazet @ 2010-08-26 4:45 UTC (permalink / raw) To: Anton Blanchard; +Cc: David Miller, netdev, miltonm Le jeudi 26 août 2010 à 10:38 +1000, Anton Blanchard a écrit : > Hi Dave, > > > Yeah, something like the following, Anton? > > Thanks! I tested on a 512GB box. One thing: > > # cat /proc/sys/net/ipv4/tcp_max_syn_backlog > 128 > > We probably want to use max(): > > > - sysctl_max_syn_backlog = min(128, cnt / 256); > > + sysctl_max_syn_backlog = max(128, cnt / 256); > I believe I make this min()/max() error once per week, sorry ;) In my mind, I say "I want a minimum value of 128", then I write : val = min(128, val); instead of val = max(128, val); Oh well... > With that change: > > # cat /proc/sys/net/ipv4/tcp_max_orphans > 262144 > # cat /proc/sys/net/ipv4/tcp_max_tw_buckets > 262144 > # cat /proc/sys/net/ipv4/tcp_max_syn_backlog > 2048 > > > Tested-by: Anton Blanchard <anton@samba.org> > > Anton Seems pretty good to me, thanks ! BTW, we now auto-limit ehash to 512*1024 slots, so you probably have same results on a 128GB or 4096GB machine ;) ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH] tcp: fix three tcp sysctls tuning 2010-08-26 0:38 ` Anton Blanchard 2010-08-26 3:53 ` David Miller 2010-08-26 4:45 ` Eric Dumazet @ 2010-08-26 5:15 ` Eric Dumazet 2010-08-26 6:02 ` David Miller 2 siblings, 1 reply; 17+ messages in thread From: Eric Dumazet @ 2010-08-26 5:15 UTC (permalink / raw) To: Anton Blanchard, David Miller; +Cc: netdev, miltonm As discovered by Anton Blanchard, current code to autotune tcp_death_row.sysctl_max_tw_buckets, sysctl_tcp_max_orphans and sysctl_max_syn_backlog makes litle sense. The bigger a page is, the less tcp_max_orphans is : 4096 on a 512GB machine in Anton's case. (tcp_hashinfo.bhash_size * sizeof(struct inet_bind_hashbucket)) is much bigger if spinlock debugging is on. Its wrong to select bigger limits in this case (where kernel structures are also bigger) bhash_size max is 65536, and we get this value even for small machines. A better ground is to use size of ehash table, this also makes code shorter and more obvious. Based on a patch from Anton, and another from David. Reported-and-tested-by: Anton Blanchard <anton@samba.org> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> --- net/ipv4/tcp.c | 24 +++++++----------------- 1 file changed, 7 insertions(+), 17 deletions(-) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 197b9b7..e2add5f 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -3209,7 +3209,7 @@ void __init tcp_init(void) { struct sk_buff *skb = NULL; unsigned long nr_pages, limit; - int order, i, max_share; + int i, max_share, cnt; unsigned long jiffy = jiffies; BUILD_BUG_ON(sizeof(struct tcp_skb_cb) > sizeof(skb->cb)); @@ -3258,22 +3258,12 @@ void __init tcp_init(void) INIT_HLIST_HEAD(&tcp_hashinfo.bhash[i].chain); } - /* Try to be a bit smarter and adjust defaults depending - * on available memory. - */ - for (order = 0; ((1 << order) << PAGE_SHIFT) < - (tcp_hashinfo.bhash_size * sizeof(struct inet_bind_hashbucket)); - order++) - ; - if (order >= 4) { - tcp_death_row.sysctl_max_tw_buckets = 180000; - sysctl_tcp_max_orphans = 4096 << (order - 4); - sysctl_max_syn_backlog = 1024; - } else if (order < 3) { - tcp_death_row.sysctl_max_tw_buckets >>= (3 - order); - sysctl_tcp_max_orphans >>= (3 - order); - sysctl_max_syn_backlog = 128; - } + + cnt = tcp_hashinfo.ehash_mask + 1; + + tcp_death_row.sysctl_max_tw_buckets = cnt / 2; + sysctl_tcp_max_orphans = cnt / 2; + sysctl_max_syn_backlog = max(128, cnt / 256); /* Set the pressure threshold to be a fraction of global memory that * is up to 1/2 at 256 MB, decreasing toward zero with the amount of ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH] tcp: fix three tcp sysctls tuning 2010-08-26 5:15 ` [PATCH] tcp: fix three tcp sysctls tuning Eric Dumazet @ 2010-08-26 6:02 ` David Miller 2010-08-26 6:21 ` Eric Dumazet 0 siblings, 1 reply; 17+ messages in thread From: David Miller @ 2010-08-26 6:02 UTC (permalink / raw) To: eric.dumazet; +Cc: anton, netdev, miltonm From: Eric Dumazet <eric.dumazet@gmail.com> Date: Thu, 26 Aug 2010 07:15:53 +0200 > As discovered by Anton Blanchard, current code to autotune > tcp_death_row.sysctl_max_tw_buckets, sysctl_tcp_max_orphans and > sysctl_max_syn_backlog makes litle sense. > > The bigger a page is, the less tcp_max_orphans is : 4096 on a 512GB > machine in Anton's case. > > (tcp_hashinfo.bhash_size * sizeof(struct inet_bind_hashbucket)) > is much bigger if spinlock debugging is on. Its wrong to select bigger > limits in this case (where kernel structures are also bigger) > > bhash_size max is 65536, and we get this value even for small machines. > > A better ground is to use size of ehash table, this also makes code > shorter and more obvious. > > Based on a patch from Anton, and another from David. > > Reported-and-tested-by: Anton Blanchard <anton@samba.org> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> I had something in my tree already, but since I didn't push it out it's just as easy for me to use this patch instead :-) Thanks everyone! ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] tcp: fix three tcp sysctls tuning 2010-08-26 6:02 ` David Miller @ 2010-08-26 6:21 ` Eric Dumazet 0 siblings, 0 replies; 17+ messages in thread From: Eric Dumazet @ 2010-08-26 6:21 UTC (permalink / raw) To: David Miller; +Cc: anton, netdev, miltonm Le mercredi 25 août 2010 à 23:02 -0700, David Miller a écrit : > I had something in my tree already, but since I didn't push it out > it's just as easy for me to use this patch instead :-) > > Thanks everyone! You're welcome, thanks David. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Spurious "TCP: too many of orphaned sockets", unable to allocate sockets 2010-08-25 7:16 Spurious "TCP: too many of orphaned sockets", unable to allocate sockets Anton Blanchard 2010-08-25 7:17 ` [PATCH] tcp: Fix sysctl_tcp_max_orphans when PAGE_SIZE != 4k Anton Blanchard @ 2010-08-25 7:59 ` David Miller 2010-08-25 8:20 ` David Miller 1 sibling, 1 reply; 17+ messages in thread From: David Miller @ 2010-08-25 7:59 UTC (permalink / raw) To: anton; +Cc: netdev, miltonm From: Anton Blanchard <anton@samba.org> Date: Wed, 25 Aug 2010 17:16:26 +1000 > We have a machine running a network test that regularly hits: > > TCP: too many of orphaned sockets > > Which comes from: > > int orphan_count = percpu_counter_read_positive( > sk->sk_prot->orphan_count); > > sk_mem_reclaim(sk); > if (tcp_too_many_orphans(sk, orphan_count)) { ... > 2. Even with this fixed we could hit the original issue. We have been known to > test on 1024 thread boxes and we would have the possibility of 32 * 1024 > = 32k slack in the percpu counters. On this box tcp_max_orphans will be > 64k after the fix which is a bit close for comfort. Should we do anything here? Solution seems simple, if the too many orphan check triggers, simply redo the check using the expensive but more accurate per-cpu counter read (which avoids the skew) to make sure. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Spurious "TCP: too many of orphaned sockets", unable to allocate sockets 2010-08-25 7:59 ` Spurious "TCP: too many of orphaned sockets", unable to allocate sockets David Miller @ 2010-08-25 8:20 ` David Miller 2010-08-25 8:47 ` Eric Dumazet 0 siblings, 1 reply; 17+ messages in thread From: David Miller @ 2010-08-25 8:20 UTC (permalink / raw) To: anton; +Cc: netdev, miltonm, eric.dumazet From: David Miller <davem@davemloft.net> Date: Wed, 25 Aug 2010 00:59:29 -0700 (PDT) > Solution seems simple, if the too many orphan check triggers, simply > redo the check using the expensive but more accurate per-cpu counter > read (which avoids the skew) to make sure. Something like this: tcp: Combat per-cpu skew in orphan tests. As reported by Anton Blanchard when we use percpu_counter_read_positive() to make our orphan socket limit checks, the check can be off by up to num_cpus_online() * batch (which is 32 by default) which on a 128 cpu machine can be as large as the default orphan limit itself. Fix this by doing the full expensive sum check if the optimized check triggers. Reported-by: Anton Blanchard <anton@samba.org> Signed-off-by: David S. Miller <davem@davemloft.net> diff --git a/include/net/tcp.h b/include/net/tcp.h index df6a2eb..eaa9582 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -268,11 +268,21 @@ static inline int between(__u32 seq1, __u32 seq2, __u32 seq3) return seq3 - seq2 >= seq1 - seq2; } -static inline int tcp_too_many_orphans(struct sock *sk, int num) +static inline bool tcp_too_many_orphans(struct sock *sk, int shift) { - return (num > sysctl_tcp_max_orphans) || - (sk->sk_wmem_queued > SOCK_MIN_SNDBUF && - atomic_read(&tcp_memory_allocated) > sysctl_tcp_mem[2]); + struct percpu_counter *ocp = sk->sk_prot->orphan_count; + int orphans = percpu_counter_read_positive(ocp); + + if (orphans << shift > sysctl_tcp_max_orphans) { + orphans = percpu_counter_sum_positive(ocp); + if (orphans << shift > sysctl_tcp_max_orphans) + return true; + } + + if (sk->sk_wmem_queued > SOCK_MIN_SNDBUF && + atomic_read(&tcp_memory_allocated) > sysctl_tcp_mem[2]) + return true; + return false; } /* syncookies: remember time of last synqueue overflow */ diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 176e11a..197b9b7 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2011,11 +2011,8 @@ adjudge_to_death: } } if (sk->sk_state != TCP_CLOSE) { - int orphan_count = percpu_counter_read_positive( - sk->sk_prot->orphan_count); - sk_mem_reclaim(sk); - if (tcp_too_many_orphans(sk, orphan_count)) { + if (tcp_too_many_orphans(sk, 0)) { if (net_ratelimit()) printk(KERN_INFO "TCP: too many of orphaned " "sockets\n"); diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c index 808bb92..c35b469 100644 --- a/net/ipv4/tcp_timer.c +++ b/net/ipv4/tcp_timer.c @@ -66,18 +66,18 @@ static void tcp_write_err(struct sock *sk) static int tcp_out_of_resources(struct sock *sk, int do_reset) { struct tcp_sock *tp = tcp_sk(sk); - int orphans = percpu_counter_read_positive(&tcp_orphan_count); + int shift = 0; /* If peer does not open window for long time, or did not transmit * anything for long time, penalize it. */ if ((s32)(tcp_time_stamp - tp->lsndtime) > 2*TCP_RTO_MAX || !do_reset) - orphans <<= 1; + shift++; /* If some dubious ICMP arrived, penalize even more. */ if (sk->sk_err_soft) - orphans <<= 1; + shift++; - if (tcp_too_many_orphans(sk, orphans)) { + if (tcp_too_many_orphans(sk, shift)) { if (net_ratelimit()) printk(KERN_INFO "Out of socket memory\n"); ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: Spurious "TCP: too many of orphaned sockets", unable to allocate sockets 2010-08-25 8:20 ` David Miller @ 2010-08-25 8:47 ` Eric Dumazet 2010-08-25 9:28 ` David Miller 0 siblings, 1 reply; 17+ messages in thread From: Eric Dumazet @ 2010-08-25 8:47 UTC (permalink / raw) To: David Miller; +Cc: anton, netdev, miltonm Le mercredi 25 août 2010 à 01:20 -0700, David Miller a écrit : > From: David Miller <davem@davemloft.net> > Date: Wed, 25 Aug 2010 00:59:29 -0700 (PDT) > > > Solution seems simple, if the too many orphan check triggers, simply > > redo the check using the expensive but more accurate per-cpu counter > > read (which avoids the skew) to make sure. > > Something like this: > > tcp: Combat per-cpu skew in orphan tests. > > As reported by Anton Blanchard when we use > percpu_counter_read_positive() to make our orphan socket limit checks, > the check can be off by up to num_cpus_online() * batch (which is 32 > by default) which on a 128 cpu machine can be as large as the default > orphan limit itself. > > Fix this by doing the full expensive sum check if the optimized check > triggers. > > Reported-by: Anton Blanchard <anton@samba.org> > Signed-off-by: David S. Miller <davem@davemloft.net> Very nice ! tcp_too_many_orphans() might be a bit large to still be inlined ... Acked-by: Eric Dumazet <eric.dumazet@gmail.com> > > diff --git a/include/net/tcp.h b/include/net/tcp.h > index df6a2eb..eaa9582 100644 > --- a/include/net/tcp.h > +++ b/include/net/tcp.h > @@ -268,11 +268,21 @@ static inline int between(__u32 seq1, __u32 seq2, __u32 seq3) > return seq3 - seq2 >= seq1 - seq2; > } > > -static inline int tcp_too_many_orphans(struct sock *sk, int num) > +static inline bool tcp_too_many_orphans(struct sock *sk, int shift) > { > - return (num > sysctl_tcp_max_orphans) || > - (sk->sk_wmem_queued > SOCK_MIN_SNDBUF && > - atomic_read(&tcp_memory_allocated) > sysctl_tcp_mem[2]); > + struct percpu_counter *ocp = sk->sk_prot->orphan_count; > + int orphans = percpu_counter_read_positive(ocp); > + > + if (orphans << shift > sysctl_tcp_max_orphans) { > + orphans = percpu_counter_sum_positive(ocp); > + if (orphans << shift > sysctl_tcp_max_orphans) > + return true; > + } > + > + if (sk->sk_wmem_queued > SOCK_MIN_SNDBUF && > + atomic_read(&tcp_memory_allocated) > sysctl_tcp_mem[2]) > + return true; > + return false; > } > > /* syncookies: remember time of last synqueue overflow */ > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index 176e11a..197b9b7 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -2011,11 +2011,8 @@ adjudge_to_death: > } > } > if (sk->sk_state != TCP_CLOSE) { > - int orphan_count = percpu_counter_read_positive( > - sk->sk_prot->orphan_count); > - > sk_mem_reclaim(sk); > - if (tcp_too_many_orphans(sk, orphan_count)) { > + if (tcp_too_many_orphans(sk, 0)) { > if (net_ratelimit()) > printk(KERN_INFO "TCP: too many of orphaned " > "sockets\n"); > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c > index 808bb92..c35b469 100644 > --- a/net/ipv4/tcp_timer.c > +++ b/net/ipv4/tcp_timer.c > @@ -66,18 +66,18 @@ static void tcp_write_err(struct sock *sk) > static int tcp_out_of_resources(struct sock *sk, int do_reset) > { > struct tcp_sock *tp = tcp_sk(sk); > - int orphans = percpu_counter_read_positive(&tcp_orphan_count); > + int shift = 0; > > /* If peer does not open window for long time, or did not transmit > * anything for long time, penalize it. */ > if ((s32)(tcp_time_stamp - tp->lsndtime) > 2*TCP_RTO_MAX || !do_reset) > - orphans <<= 1; > + shift++; > > /* If some dubious ICMP arrived, penalize even more. */ > if (sk->sk_err_soft) > - orphans <<= 1; > + shift++; > > - if (tcp_too_many_orphans(sk, orphans)) { > + if (tcp_too_many_orphans(sk, shift)) { > if (net_ratelimit()) > printk(KERN_INFO "Out of socket memory\n"); > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Spurious "TCP: too many of orphaned sockets", unable to allocate sockets 2010-08-25 8:47 ` Eric Dumazet @ 2010-08-25 9:28 ` David Miller 0 siblings, 0 replies; 17+ messages in thread From: David Miller @ 2010-08-25 9:28 UTC (permalink / raw) To: eric.dumazet; +Cc: anton, netdev, miltonm From: Eric Dumazet <eric.dumazet@gmail.com> Date: Wed, 25 Aug 2010 10:47:37 +0200 > tcp_too_many_orphans() might be a bit large to still be inlined ... We can look into that once this propagates to net-next-2.6 > Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Thanks for reviewing. ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2010-08-26 6:39 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-08-25 7:16 Spurious "TCP: too many of orphaned sockets", unable to allocate sockets Anton Blanchard 2010-08-25 7:17 ` [PATCH] tcp: Fix sysctl_tcp_max_orphans when PAGE_SIZE != 4k Anton Blanchard 2010-08-25 7:39 ` Eric Dumazet 2010-08-25 7:59 ` David Miller 2010-08-25 17:50 ` Eric Dumazet 2010-08-25 23:57 ` David Miller 2010-08-26 0:38 ` Anton Blanchard 2010-08-26 3:53 ` David Miller 2010-08-26 6:36 ` Anton Blanchard 2010-08-26 4:45 ` Eric Dumazet 2010-08-26 5:15 ` [PATCH] tcp: fix three tcp sysctls tuning Eric Dumazet 2010-08-26 6:02 ` David Miller 2010-08-26 6:21 ` Eric Dumazet 2010-08-25 7:59 ` Spurious "TCP: too many of orphaned sockets", unable to allocate sockets David Miller 2010-08-25 8:20 ` David Miller 2010-08-25 8:47 ` Eric Dumazet 2010-08-25 9:28 ` David Miller
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).