* [PATCH 0/2] udp: rescan hash2 list if chains crossed. @ 2023-07-26 11:51 David Laight 2023-07-26 12:05 ` [PATCH 1/2] Move hash calculation inside udp4_lib_lookup2() David Laight 2023-07-26 12:05 ` [PATCH 2/2] Rescan the hash2 list if the hash chains have got cross-linked David Laight 0 siblings, 2 replies; 14+ messages in thread From: David Laight @ 2023-07-26 11:51 UTC (permalink / raw) To: willemdebruijn.kernel@gmail.com, davem@davemloft.net, dsahern@kernel.org, Eric Dumazet, kuba@kernel.org, pabeni@redhat.com, netdev@vger.kernel.org Commit ca065d0cf80fa removed the rescan of the hash2 list (local address + local port) during udp receive if the final socket wasn't on the expected hash chain. While unusual, udp sockets can get rehashed (and without an rcu delay) without this rescan received packets can generate an unexpected ICMP port unreachable instead of being delivered to a local socket. The rescan could be checked for every socket, but the chances of the hash chain being diverted twice and ending up in the correct list is pretty minimal. This is the 'smoking gun' for some ICMP port unreachable messages being received on long-lived UDP sockets on 127.0.0.1. The failures are rare and being seen on AWS - so it is pretty impossible to test the patch. David Laight (2): Move hash calculation inside udp4_lib_lookup2() Rescan the hash2 list if the hash chains have got cross-linked. net/ipv4/udp.c | 36 +++++++++++++++++++++++------------- 1 file changed, 23 insertions(+), 13 deletions(-) -- 2.17.1 - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 1/2] Move hash calculation inside udp4_lib_lookup2() 2023-07-26 11:51 [PATCH 0/2] udp: rescan hash2 list if chains crossed David Laight @ 2023-07-26 12:05 ` David Laight 2023-07-26 13:44 ` Paolo Abeni 2023-07-26 12:05 ` [PATCH 2/2] Rescan the hash2 list if the hash chains have got cross-linked David Laight 1 sibling, 1 reply; 14+ messages in thread From: David Laight @ 2023-07-26 12:05 UTC (permalink / raw) To: 'willemdebruijn.kernel@gmail.com', 'davem@davemloft.net', 'dsahern@kernel.org', 'Eric Dumazet', 'kuba@kernel.org', 'pabeni@redhat.com', 'netdev@vger.kernel.org' Pass the udptable address into udp4_lib_lookup2() instead of the hash slot. While ipv4_portaddr_hash(net, IP_ADDR_ANY, 0) is constant for each net (the port is an xor) the value isn't saved. Since the hash function doesn't get simplified when passed zero the hash might as well be computed inside udp4_lib_lookup2(). This reduces the cache footprint and allows additional checks. Signed-off-by: David Laight <david.laight@aculab.com> --- net/ipv4/udp.c | 22 +++++++++------------- 1 file changed, 9 insertions(+), 13 deletions(-) diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 42a96b3547c9..ad64d6c4cd99 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -439,12 +439,18 @@ static struct sock *udp4_lib_lookup2(struct net *net, __be32 saddr, __be16 sport, __be32 daddr, unsigned int hnum, int dif, int sdif, - struct udp_hslot *hslot2, + struct udp_table *udptable, struct sk_buff *skb) { + unsigned int hash2, slot2; + struct udp_hslot *hslot2; struct sock *sk, *result; int score, badness; + hash2 = ipv4_portaddr_hash(net, daddr, hnum); + slot2 = hash2 & udptable->mask; + hslot2 = &udptable->hash2[slot2]; + result = NULL; badness = 0; udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) { @@ -495,18 +501,12 @@ struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr, int sdif, struct udp_table *udptable, struct sk_buff *skb) { unsigned short hnum = ntohs(dport); - unsigned int hash2, slot2; - struct udp_hslot *hslot2; struct sock *result, *sk; - hash2 = ipv4_portaddr_hash(net, daddr, hnum); - slot2 = hash2 & udptable->mask; - hslot2 = &udptable->hash2[slot2]; - /* Lookup connected or non-wildcard socket */ result = udp4_lib_lookup2(net, saddr, sport, daddr, hnum, dif, sdif, - hslot2, skb); + udptable, skb); if (!IS_ERR_OR_NULL(result) && result->sk_state == TCP_ESTABLISHED) goto done; @@ -525,13 +525,9 @@ struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr, goto done; /* Lookup wildcard sockets */ - hash2 = ipv4_portaddr_hash(net, htonl(INADDR_ANY), hnum); - slot2 = hash2 & udptable->mask; - hslot2 = &udptable->hash2[slot2]; - result = udp4_lib_lookup2(net, saddr, sport, htonl(INADDR_ANY), hnum, dif, sdif, - hslot2, skb); + udptable, skb); done: if (IS_ERR(result)) return NULL; -- 2.17.1 - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] Move hash calculation inside udp4_lib_lookup2() 2023-07-26 12:05 ` [PATCH 1/2] Move hash calculation inside udp4_lib_lookup2() David Laight @ 2023-07-26 13:44 ` Paolo Abeni 2023-07-26 14:02 ` David Laight 2023-07-26 15:42 ` David Laight 0 siblings, 2 replies; 14+ messages in thread From: Paolo Abeni @ 2023-07-26 13:44 UTC (permalink / raw) To: David Laight, 'willemdebruijn.kernel@gmail.com', 'davem@davemloft.net', 'dsahern@kernel.org', 'Eric Dumazet', 'kuba@kernel.org', 'netdev@vger.kernel.org' On Wed, 2023-07-26 at 12:05 +0000, David Laight wrote: > Pass the udptable address into udp4_lib_lookup2() instead of the hash slot. > > While ipv4_portaddr_hash(net, IP_ADDR_ANY, 0) is constant for each net > (the port is an xor) the value isn't saved. > Since the hash function doesn't get simplified when passed zero the hash Are you sure? could you please objdump and compare the binary code generated before and after the patch? In theory all the callers up to __jhash_final() included should be inlined, and the compiler should be able to optimze at least rol32(0, <n>). Cheers, Paolo ^ permalink raw reply [flat|nested] 14+ messages in thread
* RE: [PATCH 1/2] Move hash calculation inside udp4_lib_lookup2() 2023-07-26 13:44 ` Paolo Abeni @ 2023-07-26 14:02 ` David Laight 2023-07-26 14:03 ` Eric Dumazet 2023-07-26 15:42 ` David Laight 1 sibling, 1 reply; 14+ messages in thread From: David Laight @ 2023-07-26 14:02 UTC (permalink / raw) To: 'Paolo Abeni', 'willemdebruijn.kernel@gmail.com', 'davem@davemloft.net', 'dsahern@kernel.org', 'Eric Dumazet', 'kuba@kernel.org', 'netdev@vger.kernel.org' From: Paolo Abeni > Sent: 26 July 2023 14:44 > > On Wed, 2023-07-26 at 12:05 +0000, David Laight wrote: > > Pass the udptable address into udp4_lib_lookup2() instead of the hash slot. > > > > While ipv4_portaddr_hash(net, IP_ADDR_ANY, 0) is constant for each net > > (the port is an xor) the value isn't saved. > > Since the hash function doesn't get simplified when passed zero the hash > > Are you sure? could you please objdump and compare the binary code > generated before and after the patch? In theory all the callers up to > __jhash_final() included should be inlined, and the compiler should be > able to optimze at least rol32(0, <n>). I looked the hash is 20+ instructions and pretty much all of them appeared twice. (I'm in the wrong building to have a buildable kernel tree.) It has to be said that ipv4_portaddr_hash(net, IPADDR_ANY, port) could just be net_hash_mix(net) ^ port. (Or maybe you could use a different random value.) I'm not even sure the relatively expensive mixing of 'saddr' is needed - it is one of the local IPv4 addresses. Mixing in the remote address for connected sockets might be useful for any server code that uses a lot of connected udp sockets - but that isn't done. We will have hundreds of udp sockets with different ports that are not connected (I don't know if they get bound to a local address). So a remote address hash wouldn't help. If you look at the generated code for __udp4_lib_lookup() it is actually quite horrid. Too many called functions with too many parameters. Things spill to the stack all the time. The reuse_port code made it a whole lot worse. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] Move hash calculation inside udp4_lib_lookup2() 2023-07-26 14:02 ` David Laight @ 2023-07-26 14:03 ` Eric Dumazet 2023-07-26 14:06 ` David Laight 0 siblings, 1 reply; 14+ messages in thread From: Eric Dumazet @ 2023-07-26 14:03 UTC (permalink / raw) To: David Laight Cc: Paolo Abeni, willemdebruijn.kernel@gmail.com, davem@davemloft.net, dsahern@kernel.org, kuba@kernel.org, netdev@vger.kernel.org On Wed, Jul 26, 2023 at 4:02 PM David Laight <David.Laight@aculab.com> wrote: > > From: Paolo Abeni > > Sent: 26 July 2023 14:44 > > > > On Wed, 2023-07-26 at 12:05 +0000, David Laight wrote: > > > Pass the udptable address into udp4_lib_lookup2() instead of the hash slot. > > > > > > While ipv4_portaddr_hash(net, IP_ADDR_ANY, 0) is constant for each net > > > (the port is an xor) the value isn't saved. > > > Since the hash function doesn't get simplified when passed zero the hash > > > > Are you sure? could you please objdump and compare the binary code > > generated before and after the patch? In theory all the callers up to > > __jhash_final() included should be inlined, and the compiler should be > > able to optimze at least rol32(0, <n>). > > I looked the hash is 20+ instructions and pretty much all of > them appeared twice. > (I'm in the wrong building to have a buildable kernel tree.) > > It has to be said that ipv4_portaddr_hash(net, IPADDR_ANY, port) > could just be net_hash_mix(net) ^ port. > (Or maybe you could use a different random value.) > > I'm not even sure the relatively expensive mixing of 'saddr' > is needed - it is one of the local IPv4 addresses. > Mixing in the remote address for connected sockets might > be useful for any server code that uses a lot of connected > udp sockets - but that isn't done. > > We will have hundreds of udp sockets with different ports that > are not connected (I don't know if they get bound to a local > address). So a remote address hash wouldn't help. > > If you look at the generated code for __udp4_lib_lookup() > it is actually quite horrid. > Too many called functions with too many parameters. > Things spill to the stack all the time. > > The reuse_port code made it a whole lot worse. > Maybe ... Please show us performance numbers. If less than 1%, I would not bother changing this code, making future backports more risky. ^ permalink raw reply [flat|nested] 14+ messages in thread
* RE: [PATCH 1/2] Move hash calculation inside udp4_lib_lookup2() 2023-07-26 14:03 ` Eric Dumazet @ 2023-07-26 14:06 ` David Laight 0 siblings, 0 replies; 14+ messages in thread From: David Laight @ 2023-07-26 14:06 UTC (permalink / raw) To: 'Eric Dumazet' Cc: Paolo Abeni, willemdebruijn.kernel@gmail.com, davem@davemloft.net, dsahern@kernel.org, kuba@kernel.org, netdev@vger.kernel.org From: Eric Dumazet > Sent: 26 July 2023 15:04 > > On Wed, Jul 26, 2023 at 4:02 PM David Laight <David.Laight@aculab.com> wrote: > > > > From: Paolo Abeni > > > Sent: 26 July 2023 14:44 > > > > > > On Wed, 2023-07-26 at 12:05 +0000, David Laight wrote: > > > > Pass the udptable address into udp4_lib_lookup2() instead of the hash slot. > > > > > > > > While ipv4_portaddr_hash(net, IP_ADDR_ANY, 0) is constant for each net > > > > (the port is an xor) the value isn't saved. > > > > Since the hash function doesn't get simplified when passed zero the hash > > > > > > Are you sure? could you please objdump and compare the binary code > > > generated before and after the patch? In theory all the callers up to > > > __jhash_final() included should be inlined, and the compiler should be > > > able to optimze at least rol32(0, <n>). > > > > I looked the hash is 20+ instructions and pretty much all of > > them appeared twice. > > (I'm in the wrong building to have a buildable kernel tree.) > > > > It has to be said that ipv4_portaddr_hash(net, IPADDR_ANY, port) > > could just be net_hash_mix(net) ^ port. > > (Or maybe you could use a different random value.) > > > > I'm not even sure the relatively expensive mixing of 'saddr' > > is needed - it is one of the local IPv4 addresses. > > Mixing in the remote address for connected sockets might > > be useful for any server code that uses a lot of connected > > udp sockets - but that isn't done. > > > > We will have hundreds of udp sockets with different ports that > > are not connected (I don't know if they get bound to a local > > address). So a remote address hash wouldn't help. > > > > If you look at the generated code for __udp4_lib_lookup() > > it is actually quite horrid. > > Too many called functions with too many parameters. > > Things spill to the stack all the time. > > > > The reuse_port code made it a whole lot worse. > > > > Maybe ... Please show us performance numbers. > > If less than 1%, I would not bother changing this code, making future > backports more risky. I didn't :-) David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 14+ messages in thread
* RE: [PATCH 1/2] Move hash calculation inside udp4_lib_lookup2() 2023-07-26 13:44 ` Paolo Abeni 2023-07-26 14:02 ` David Laight @ 2023-07-26 15:42 ` David Laight 1 sibling, 0 replies; 14+ messages in thread From: David Laight @ 2023-07-26 15:42 UTC (permalink / raw) To: 'Paolo Abeni', 'willemdebruijn.kernel@gmail.com', 'davem@davemloft.net', 'dsahern@kernel.org', 'Eric Dumazet', 'kuba@kernel.org', 'netdev@vger.kernel.org' From: Paolo Abeni > Sent: 26 July 2023 14:44 > > On Wed, 2023-07-26 at 12:05 +0000, David Laight wrote: > > Pass the udptable address into udp4_lib_lookup2() instead of the hash slot. > > > > While ipv4_portaddr_hash(net, IP_ADDR_ANY, 0) is constant for each net > > (the port is an xor) the value isn't saved. > > Since the hash function doesn't get simplified when passed zero the hash > > Are you sure? could you please objdump and compare the binary code > generated before and after the patch? In theory all the callers up to > __jhash_final() included should be inlined, and the compiler should be > able to optimze at least rol32(0, <n>). From an oldish build I have lurking: 0000000000003960 <__udp4_lib_lookup>: 3960: e8 00 00 00 00 callq 3965 <__udp4_lib_lookup+0x5> 3961: R_X86_64_PC32 __fentry__-0x4 3965: 48 83 ec 48 sub $0x48,%rsp 3969: 66 41 c1 c0 08 rol $0x8,%r8w 396e: 48 89 5c 24 18 mov %rbx,0x18(%rsp) 3973: 48 89 6c 24 20 mov %rbp,0x20(%rsp) 3978: 89 f5 mov %esi,%ebp 397a: 4c 89 64 24 28 mov %r12,0x28(%rsp) 397f: 4c 89 6c 24 30 mov %r13,0x30(%rsp) 3984: 44 0f b7 e2 movzwl %dx,%r12d 3988: 4c 89 74 24 38 mov %r14,0x38(%rsp) 398d: 4c 89 7c 24 40 mov %r15,0x40(%rsp) 3992: 49 89 fe mov %rdi,%r14 3995: 8b bf 40 01 00 00 mov 0x140(%rdi),%edi 399b: 45 0f b7 f8 movzwl %r8w,%r15d 399f: 48 8b 5c 24 58 mov 0x58(%rsp),%rbx 39a4: 48 8b 54 24 60 mov 0x60(%rsp),%rdx 39a9: 45 89 cd mov %r9d,%r13d 39ac: 81 ef 0d 41 52 21 sub $0x2152410d,%edi 39b2: 89 fe mov %edi,%esi 39b4: 8d 04 39 lea (%rcx,%rdi,1),%eax 39b7: 48 89 54 24 10 mov %rdx,0x10(%rsp) 39bc: c1 c6 0e rol $0xe,%esi 39bf: 44 89 e2 mov %r12d,%edx 39c2: f7 de neg %esi 39c4: 41 89 f0 mov %esi,%r8d 39c7: 31 f0 xor %esi,%eax 39c9: 41 c1 c0 0b rol $0xb,%r8d 39cd: 44 29 c0 sub %r8d,%eax 39d0: 41 89 c0 mov %eax,%r8d 39d3: 31 c7 xor %eax,%edi 39d5: 41 c1 c0 19 rol $0x19,%r8d 39d9: 44 29 c7 sub %r8d,%edi 39dc: 41 89 f8 mov %edi,%r8d 39df: 31 fe xor %edi,%esi 39e1: 41 c1 c0 10 rol $0x10,%r8d 39e5: 44 29 c6 sub %r8d,%esi 39e8: 41 89 f0 mov %esi,%r8d 39eb: 31 f0 xor %esi,%eax 39ed: 41 c1 c0 04 rol $0x4,%r8d 39f1: 44 29 c0 sub %r8d,%eax 39f4: 45 89 f8 mov %r15d,%r8d 39f7: 31 c7 xor %eax,%edi 39f9: c1 c0 0e rol $0xe,%eax 39fc: 29 c7 sub %eax,%edi 39fe: 89 f8 mov %edi,%eax 3a00: c1 c7 18 rol $0x18,%edi 3a03: 31 f0 xor %esi,%eax 3a05: 89 ee mov %ebp,%esi 3a07: 29 f8 sub %edi,%eax 3a09: 4c 89 f7 mov %r14,%rdi 3a0c: 44 31 f8 xor %r15d,%eax 3a0f: 23 43 10 and 0x10(%rbx),%eax 3a12: 48 c1 e0 04 shl $0x4,%rax 3a16: 48 03 43 08 add 0x8(%rbx),%rax 3a1a: 48 89 44 24 08 mov %rax,0x8(%rsp) 3a1f: 8b 44 24 50 mov 0x50(%rsp),%eax 3a23: 89 04 24 mov %eax,(%rsp) 3a26: e8 05 f7 ff ff callq 3130 <udp4_lib_lookup2> 3a2b: 48 85 c0 test %rax,%rax 3a2e: 74 32 je 3a62 <__udp4_lib_lookup+0x102> 3a30: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax 3a36: ba 00 00 00 00 mov $0x0,%edx 3a3b: 48 8b 5c 24 18 mov 0x18(%rsp),%rbx 3a40: 48 0f 43 c2 cmovae %rdx,%rax 3a44: 48 8b 6c 24 20 mov 0x20(%rsp),%rbp 3a49: 4c 8b 64 24 28 mov 0x28(%rsp),%r12 3a4e: 4c 8b 6c 24 30 mov 0x30(%rsp),%r13 3a53: 4c 8b 74 24 38 mov 0x38(%rsp),%r14 3a58: 4c 8b 7c 24 40 mov 0x40(%rsp),%r15 3a5d: 48 83 c4 48 add $0x48,%rsp 3a61: c3 retq 3a62: 41 8b 96 40 01 00 00 mov 0x140(%r14),%edx 3a69: 45 89 e9 mov %r13d,%r9d 3a6c: 45 89 f8 mov %r15d,%r8d 3a6f: 4c 89 f7 mov %r14,%rdi 3a72: 81 ea 0d 41 52 21 sub $0x2152410d,%edx 3a78: 89 d0 mov %edx,%eax 3a7a: c1 c0 0e rol $0xe,%eax 3a7d: f7 d8 neg %eax 3a7f: 89 c1 mov %eax,%ecx 3a81: 89 c6 mov %eax,%esi 3a83: 31 d1 xor %edx,%ecx 3a85: c1 c6 0b rol $0xb,%esi 3a88: 29 f1 sub %esi,%ecx 3a8a: 89 ce mov %ecx,%esi 3a8c: 31 ca xor %ecx,%edx 3a8e: c1 c6 19 rol $0x19,%esi 3a91: 29 f2 sub %esi,%edx 3a93: 89 d6 mov %edx,%esi 3a95: 31 d0 xor %edx,%eax 3a97: c1 c6 10 rol $0x10,%esi 3a9a: 29 f0 sub %esi,%eax 3a9c: 89 c6 mov %eax,%esi 3a9e: 31 c1 xor %eax,%ecx 3aa0: c1 c6 04 rol $0x4,%esi 3aa3: 29 f1 sub %esi,%ecx 3aa5: 89 ee mov %ebp,%esi 3aa7: 31 ca xor %ecx,%edx 3aa9: c1 c1 0e rol $0xe,%ecx 3aac: 29 ca sub %ecx,%edx 3aae: 48 8b 4c 24 60 mov 0x60(%rsp),%rcx 3ab3: 31 d0 xor %edx,%eax 3ab5: c1 c2 18 rol $0x18,%edx 3ab8: 29 d0 sub %edx,%eax 3aba: 44 89 e2 mov %r12d,%edx 3abd: 48 89 4c 24 10 mov %rcx,0x10(%rsp) 3ac2: 44 31 f8 xor %r15d,%eax 3ac5: 23 43 10 and 0x10(%rbx),%eax 3ac8: 31 c9 xor %ecx,%ecx 3aca: 48 c1 e0 04 shl $0x4,%rax 3ace: 48 03 43 08 add 0x8(%rbx),%rax 3ad2: 48 89 44 24 08 mov %rax,0x8(%rsp) 3ad7: 8b 44 24 50 mov 0x50(%rsp),%eax 3adb: 89 04 24 mov %eax,(%rsp) 3ade: e8 4d f6 ff ff callq 3130 <udp4_lib_lookup2> 3ae3: e9 48 ff ff ff jmpq 3a30 <__udp4_lib_lookup+0xd0> The two copies of the hash are pretty much the same. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 2/2] Rescan the hash2 list if the hash chains have got cross-linked. 2023-07-26 11:51 [PATCH 0/2] udp: rescan hash2 list if chains crossed David Laight 2023-07-26 12:05 ` [PATCH 1/2] Move hash calculation inside udp4_lib_lookup2() David Laight @ 2023-07-26 12:05 ` David Laight 2023-07-26 13:36 ` Paolo Abeni 2023-07-26 13:37 ` Eric Dumazet 1 sibling, 2 replies; 14+ messages in thread From: David Laight @ 2023-07-26 12:05 UTC (permalink / raw) To: 'willemdebruijn.kernel@gmail.com', 'davem@davemloft.net', 'dsahern@kernel.org', 'Eric Dumazet', 'kuba@kernel.org', 'pabeni@redhat.com', 'netdev@vger.kernel.org' udp_lib_rehash() can get called at any time and will move a socket to a different hash2 chain. This can cause udp4_lib_lookup2() (processing incoming UDP) to fail to find a socket and an ICMP port unreachable be sent. Prior to ca065d0cf80fa the lookup used 'hlist_nulls' and checked that the 'end if list' marker was on the correct list. Rather than re-instate the 'nulls' list just check that the final socket is on the correct list. The cross-linking can definitely happen (see earlier issues with it looping forever because gcc cached the list head). Fixes: ca065d0cf80fa ("udp: no longer use SLAB_DESTROY_BY_RCU") Signed-off-by: David Laight <david.laight@aculab.com> --- net/ipv4/udp.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index ad64d6c4cd99..ed92ba7610b0 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -443,6 +443,7 @@ static struct sock *udp4_lib_lookup2(struct net *net, struct sk_buff *skb) { unsigned int hash2, slot2; + unsigned int hash2_rescan; struct udp_hslot *hslot2; struct sock *sk, *result; int score, badness; @@ -451,9 +452,12 @@ static struct sock *udp4_lib_lookup2(struct net *net, slot2 = hash2 & udptable->mask; hslot2 = &udptable->hash2[slot2]; +rescan: + hash2_rescan = hash2; result = NULL; badness = 0; udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) { + hash2_rescan = udp_sk(sk)->udp_portaddr_hash; score = compute_score(sk, net, saddr, sport, daddr, hnum, dif, sdif); if (score > badness) { @@ -467,6 +471,16 @@ static struct sock *udp4_lib_lookup2(struct net *net, badness = score; } } + + /* udp sockets can get moved to a different hash chain. + * If the chains have got crossed then rescan. + */ + if ((hash2_rescan & udptable->mask) != slot2) { + /* Ensure hslot2->head is reread */ + barrier(); + goto rescan; + } + return result; } -- 2.17.1 - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] Rescan the hash2 list if the hash chains have got cross-linked. 2023-07-26 12:05 ` [PATCH 2/2] Rescan the hash2 list if the hash chains have got cross-linked David Laight @ 2023-07-26 13:36 ` Paolo Abeni 2023-07-26 13:37 ` Eric Dumazet 1 sibling, 0 replies; 14+ messages in thread From: Paolo Abeni @ 2023-07-26 13:36 UTC (permalink / raw) To: David Laight, 'willemdebruijn.kernel@gmail.com', 'davem@davemloft.net', 'dsahern@kernel.org', 'Eric Dumazet', 'kuba@kernel.org', 'netdev@vger.kernel.org' Hi, On Wed, 2023-07-26 at 12:05 +0000, David Laight wrote: > udp_lib_rehash() can get called at any time and will move a > socket to a different hash2 chain. > This can cause udp4_lib_lookup2() (processing incoming UDP) to > fail to find a socket and an ICMP port unreachable be sent. > > Prior to ca065d0cf80fa the lookup used 'hlist_nulls' and checked > that the 'end if list' marker was on the correct list. > > Rather than re-instate the 'nulls' list just check that the final > socket is on the correct list. > > The cross-linking can definitely happen (see earlier issues with > it looping forever because gcc cached the list head). > > Fixes: ca065d0cf80fa ("udp: no longer use SLAB_DESTROY_BY_RCU") > Signed-off-by: David Laight <david.laight@aculab.com> > --- > net/ipv4/udp.c | 14 ++++++++++++++ > 1 file changed, 14 insertions(+) > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c > index ad64d6c4cd99..ed92ba7610b0 100644 > --- a/net/ipv4/udp.c > +++ b/net/ipv4/udp.c > @@ -443,6 +443,7 @@ static struct sock *udp4_lib_lookup2(struct net *net, > struct sk_buff *skb) > { > unsigned int hash2, slot2; > + unsigned int hash2_rescan; > struct udp_hslot *hslot2; > struct sock *sk, *result; > int score, badness; > @@ -451,9 +452,12 @@ static struct sock *udp4_lib_lookup2(struct net *net, > slot2 = hash2 & udptable->mask; > hslot2 = &udptable->hash2[slot2]; > > +rescan: > + hash2_rescan = hash2; > result = NULL; > badness = 0; > udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) { > + hash2_rescan = udp_sk(sk)->udp_portaddr_hash; > score = compute_score(sk, net, saddr, sport, > daddr, hnum, dif, sdif); > if (score > badness) { > @@ -467,6 +471,16 @@ static struct sock *udp4_lib_lookup2(struct net *net, > badness = score; > } > } > + > + /* udp sockets can get moved to a different hash chain. > + * If the chains have got crossed then rescan. > + */ > + if ((hash2_rescan & udptable->mask) != slot2) { > + /* Ensure hslot2->head is reread */ > + barrier(); udp_portaddr_for_each_entry_rcu() uses (indirectly) rcu_dereference_raw() to access the head. That implies a READ_ONCE(). Additional barriers for re-read should not be necessary. What about IPV6? Cheers, Paolo ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] Rescan the hash2 list if the hash chains have got cross-linked. 2023-07-26 12:05 ` [PATCH 2/2] Rescan the hash2 list if the hash chains have got cross-linked David Laight 2023-07-26 13:36 ` Paolo Abeni @ 2023-07-26 13:37 ` Eric Dumazet 2023-07-26 14:13 ` David Laight 1 sibling, 1 reply; 14+ messages in thread From: Eric Dumazet @ 2023-07-26 13:37 UTC (permalink / raw) To: David Laight Cc: willemdebruijn.kernel@gmail.com, davem@davemloft.net, dsahern@kernel.org, kuba@kernel.org, pabeni@redhat.com, netdev@vger.kernel.org On Wed, Jul 26, 2023 at 2:06 PM David Laight <David.Laight@aculab.com> wrote: > > udp_lib_rehash() can get called at any time and will move a > socket to a different hash2 chain. > This can cause udp4_lib_lookup2() (processing incoming UDP) to > fail to find a socket and an ICMP port unreachable be sent. > > Prior to ca065d0cf80fa the lookup used 'hlist_nulls' and checked > that the 'end if list' marker was on the correct list. > > Rather than re-instate the 'nulls' list just check that the final > socket is on the correct list. > > The cross-linking can definitely happen (see earlier issues with > it looping forever because gcc cached the list head). > > Fixes: ca065d0cf80fa ("udp: no longer use SLAB_DESTROY_BY_RCU") > Signed-off-by: David Laight <david.laight@aculab.com> > --- Hi David, thanks a lot for the investigations. I do not think this is the proper fix. UDP rehash has always been buggy, because we lack an rcu grace period between the removal of the socket from the old hash bucket to the new one. We need to stuff a synchronize_rcu() somewhere in udp_lib_rehash(), and that might not be easy [1] and might add unexpected latency to some real time applications. ([1] : Not sure if we are allowed to sleep in udp_lib_rehash()) Also note that adding a synchronize_rcu() would mean the socket would not be found anyway by some incoming packets. I think that rehash is tricky to implement if you expect that all incoming packets must find the socket, wherever it is located. > net/ipv4/udp.c | 14 ++++++++++++++ > 1 file changed, 14 insertions(+) > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c > index ad64d6c4cd99..ed92ba7610b0 100644 > --- a/net/ipv4/udp.c > +++ b/net/ipv4/udp.c > @@ -443,6 +443,7 @@ static struct sock *udp4_lib_lookup2(struct net *net, > struct sk_buff *skb) > { > unsigned int hash2, slot2; > + unsigned int hash2_rescan; > struct udp_hslot *hslot2; > struct sock *sk, *result; > int score, badness; > @@ -451,9 +452,12 @@ static struct sock *udp4_lib_lookup2(struct net *net, > slot2 = hash2 & udptable->mask; > hslot2 = &udptable->hash2[slot2]; > > +rescan: > + hash2_rescan = hash2; > result = NULL; > badness = 0; > udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) { > + hash2_rescan = udp_sk(sk)->udp_portaddr_hash; > score = compute_score(sk, net, saddr, sport, > daddr, hnum, dif, sdif); > if (score > badness) { > @@ -467,6 +471,16 @@ static struct sock *udp4_lib_lookup2(struct net *net, > badness = score; > } > } > + > + /* udp sockets can get moved to a different hash chain. > + * If the chains have got crossed then rescan. > + */ > + if ((hash2_rescan & udptable->mask) != slot2) { This is only going to catch one of the possible cases. If we really want to add extra checks in this fast path, we would have to check all found sockets, not only the last one. ^ permalink raw reply [flat|nested] 14+ messages in thread
* RE: [PATCH 2/2] Rescan the hash2 list if the hash chains have got cross-linked. 2023-07-26 13:37 ` Eric Dumazet @ 2023-07-26 14:13 ` David Laight 2023-07-26 14:21 ` Eric Dumazet 0 siblings, 1 reply; 14+ messages in thread From: David Laight @ 2023-07-26 14:13 UTC (permalink / raw) To: 'Eric Dumazet' Cc: willemdebruijn.kernel@gmail.com, davem@davemloft.net, dsahern@kernel.org, kuba@kernel.org, pabeni@redhat.com, netdev@vger.kernel.org From: Eric Dumazet > Sent: 26 July 2023 14:37 > > On Wed, Jul 26, 2023 at 2:06 PM David Laight <David.Laight@aculab.com> wrote: > > > > udp_lib_rehash() can get called at any time and will move a > > socket to a different hash2 chain. > > This can cause udp4_lib_lookup2() (processing incoming UDP) to > > fail to find a socket and an ICMP port unreachable be sent. > > > > Prior to ca065d0cf80fa the lookup used 'hlist_nulls' and checked > > that the 'end if list' marker was on the correct list. > > > > Rather than re-instate the 'nulls' list just check that the final > > socket is on the correct list. > > > > The cross-linking can definitely happen (see earlier issues with > > it looping forever because gcc cached the list head). > > > > Fixes: ca065d0cf80fa ("udp: no longer use SLAB_DESTROY_BY_RCU") > > Signed-off-by: David Laight <david.laight@aculab.com> > > --- > > Hi David, thanks a lot for the investigations. > > I do not think this is the proper fix. > > UDP rehash has always been buggy, because we lack an rcu grace period > between the removal of the socket > from the old hash bucket to the new one. > > We need to stuff a synchronize_rcu() somewhere in udp_lib_rehash(), > and that might not be easy [1] > and might add unexpected latency to some real time applications. > ([1] : Not sure if we are allowed to sleep in udp_lib_rehash()) I'm also not sure that the callers would always like the potentially long rcu sleep. > Also note that adding a synchronize_rcu() would mean the socket would > not be found anyway by some incoming packets. > > I think that rehash is tricky to implement if you expect that all > incoming packets must find the socket, wherever it is located. I thought about something like the checks done for reading multi-word counters. Something like requiring the updater to increment a count on entry and exit and have the reader rescan with the lock held if the count is odd or changes. The problem is that a single 'port unreachable' can be treated as a fatal error by the receiving application. So you really don't want to be sending them. > > > > net/ipv4/udp.c | 14 ++++++++++++++ > > 1 file changed, 14 insertions(+) > > > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c > > index ad64d6c4cd99..ed92ba7610b0 100644 > > --- a/net/ipv4/udp.c > > +++ b/net/ipv4/udp.c > > @@ -443,6 +443,7 @@ static struct sock *udp4_lib_lookup2(struct net *net, > > struct sk_buff *skb) > > { > > unsigned int hash2, slot2; > > + unsigned int hash2_rescan; > > struct udp_hslot *hslot2; > > struct sock *sk, *result; > > int score, badness; > > @@ -451,9 +452,12 @@ static struct sock *udp4_lib_lookup2(struct net *net, > > slot2 = hash2 & udptable->mask; > > hslot2 = &udptable->hash2[slot2]; > > > > +rescan: > > + hash2_rescan = hash2; > > result = NULL; > > badness = 0; > > udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) { > > + hash2_rescan = udp_sk(sk)->udp_portaddr_hash; > > score = compute_score(sk, net, saddr, sport, > > daddr, hnum, dif, sdif); > > if (score > badness) { > > @@ -467,6 +471,16 @@ static struct sock *udp4_lib_lookup2(struct net *net, > > badness = score; > > } > > } > > + > > + /* udp sockets can get moved to a different hash chain. > > + * If the chains have got crossed then rescan. > > + */ > > + if ((hash2_rescan & udptable->mask) != slot2) { > > This is only going to catch one of the possible cases. > > If we really want to add extra checks in this fast path, we would have > to check all found sockets, > not only the last one. I did think about that. Being hit by a single rehash is very unlikely. Being hit by two that also put you back onto the original chain really isn't going to happen. Putting the check inside the loop would save the test when the hash list is empty - probably common for the first lookup. The code in compute_score() is pretty horrid so maybe it wouldn't really be noticeable. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] Rescan the hash2 list if the hash chains have got cross-linked. 2023-07-26 14:13 ` David Laight @ 2023-07-26 14:21 ` Eric Dumazet 2023-07-26 14:39 ` David Laight 0 siblings, 1 reply; 14+ messages in thread From: Eric Dumazet @ 2023-07-26 14:21 UTC (permalink / raw) To: David Laight Cc: willemdebruijn.kernel@gmail.com, davem@davemloft.net, dsahern@kernel.org, kuba@kernel.org, pabeni@redhat.com, netdev@vger.kernel.org On Wed, Jul 26, 2023 at 4:13 PM David Laight <David.Laight@aculab.com> wrote: > > From: Eric Dumazet > > Sent: 26 July 2023 14:37 > > > > On Wed, Jul 26, 2023 at 2:06 PM David Laight <David.Laight@aculab.com> wrote: > > > > > > udp_lib_rehash() can get called at any time and will move a > > > socket to a different hash2 chain. > > > This can cause udp4_lib_lookup2() (processing incoming UDP) to > > > fail to find a socket and an ICMP port unreachable be sent. > > > > > > Prior to ca065d0cf80fa the lookup used 'hlist_nulls' and checked > > > that the 'end if list' marker was on the correct list. > > > > > > Rather than re-instate the 'nulls' list just check that the final > > > socket is on the correct list. > > > > > > The cross-linking can definitely happen (see earlier issues with > > > it looping forever because gcc cached the list head). > > > > > > Fixes: ca065d0cf80fa ("udp: no longer use SLAB_DESTROY_BY_RCU") > > > Signed-off-by: David Laight <david.laight@aculab.com> > > > --- > > > > Hi David, thanks a lot for the investigations. > > > > I do not think this is the proper fix. > > > > UDP rehash has always been buggy, because we lack an rcu grace period > > between the removal of the socket > > from the old hash bucket to the new one. > > > > We need to stuff a synchronize_rcu() somewhere in udp_lib_rehash(), > > and that might not be easy [1] > > and might add unexpected latency to some real time applications. > > ([1] : Not sure if we are allowed to sleep in udp_lib_rehash()) > > I'm also not sure that the callers would always like the potentially > long rcu sleep. > > > Also note that adding a synchronize_rcu() would mean the socket would > > not be found anyway by some incoming packets. > > > > I think that rehash is tricky to implement if you expect that all > > incoming packets must find the socket, wherever it is located. > > I thought about something like the checks done for reading > multi-word counters. > Something like requiring the updater to increment a count on > entry and exit and have the reader rescan with the lock held > if the count is odd or changes. > Can you describe what user space operation is done by your precious application, triggering a rehash in the first place ? Maybe we can think of something less disruptive in the kernel. (For instance, you could have a second socket, insert it in the new bucket, then remove the old socket) > The problem is that a single 'port unreachable' can be treated > as a fatal error by the receiving application. > So you really don't want to be sending them. Well, if your application needs to run with old kernels, and or transient netfilter changes (some firewall setups do not use iptables-restore) better be more resilient to transient ICMP messages anyway. > > > > > > > > net/ipv4/udp.c | 14 ++++++++++++++ > > > 1 file changed, 14 insertions(+) > > > > > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c > > > index ad64d6c4cd99..ed92ba7610b0 100644 > > > --- a/net/ipv4/udp.c > > > +++ b/net/ipv4/udp.c > > > @@ -443,6 +443,7 @@ static struct sock *udp4_lib_lookup2(struct net *net, > > > struct sk_buff *skb) > > > { > > > unsigned int hash2, slot2; > > > + unsigned int hash2_rescan; > > > struct udp_hslot *hslot2; > > > struct sock *sk, *result; > > > int score, badness; > > > @@ -451,9 +452,12 @@ static struct sock *udp4_lib_lookup2(struct net *net, > > > slot2 = hash2 & udptable->mask; > > > hslot2 = &udptable->hash2[slot2]; > > > > > > +rescan: > > > + hash2_rescan = hash2; > > > result = NULL; > > > badness = 0; > > > udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) { > > > + hash2_rescan = udp_sk(sk)->udp_portaddr_hash; > > > score = compute_score(sk, net, saddr, sport, > > > daddr, hnum, dif, sdif); > > > if (score > badness) { > > > @@ -467,6 +471,16 @@ static struct sock *udp4_lib_lookup2(struct net *net, > > > badness = score; > > > } > > > } > > > + > > > + /* udp sockets can get moved to a different hash chain. > > > + * If the chains have got crossed then rescan. > > > + */ > > > + if ((hash2_rescan & udptable->mask) != slot2) { > > > > This is only going to catch one of the possible cases. > > > > If we really want to add extra checks in this fast path, we would have > > to check all found sockets, > > not only the last one. > > I did think about that. > Being hit by a single rehash is very unlikely. > Being hit by two that also put you back onto the original > chain really isn't going to happen. > > Putting the check inside the loop would save the test when the > hash list is empty - probably common for the first lookup. > > The code in compute_score() is pretty horrid so maybe it > wouldn't really be noticeable. > > David > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 14+ messages in thread
* RE: [PATCH 2/2] Rescan the hash2 list if the hash chains have got cross-linked. 2023-07-26 14:21 ` Eric Dumazet @ 2023-07-26 14:39 ` David Laight 2023-07-26 14:48 ` Eric Dumazet 0 siblings, 1 reply; 14+ messages in thread From: David Laight @ 2023-07-26 14:39 UTC (permalink / raw) To: 'Eric Dumazet' Cc: willemdebruijn.kernel@gmail.com, davem@davemloft.net, dsahern@kernel.org, kuba@kernel.org, pabeni@redhat.com, netdev@vger.kernel.org From: Eric Dumazet > Sent: 26 July 2023 15:22 ... > Can you describe what user space operation is done by your precious application, > triggering a rehash in the first place ? We've no idea what is causing the rehash. There are a lot of sockets that are receiving RTP audio. But they are only created, bound and then deleted. The 'best guess' is something to do with ipsec tunnels being created, deleted or rehashed. > > Maybe we can think of something less disruptive in the kernel. > (For instance, you could have a second socket, insert it in the new bucket, > then remove the old socket) > > > The problem is that a single 'port unreachable' can be treated > > as a fatal error by the receiving application. > > So you really don't want to be sending them. > > Well, if your application needs to run with old kernels, and or > transient netfilter changes (some firewall setups do not use > iptables-restore) > better be more resilient to transient ICMP messages anyway. This is being done for the specific pair of sockets that caused grief. For this setup they were on 127.0.0.1 but that isn't always true. But they would be expected to be on a local network. Reading between the lines of the comment in ipv4/icmp.c it is reasonable to assume that ICMP_PORT_UNREACH be treated as a fatal error (ie not a transient one). So really the Linux kernel ought to try quite hard to not generate them when the port exists. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] Rescan the hash2 list if the hash chains have got cross-linked. 2023-07-26 14:39 ` David Laight @ 2023-07-26 14:48 ` Eric Dumazet 0 siblings, 0 replies; 14+ messages in thread From: Eric Dumazet @ 2023-07-26 14:48 UTC (permalink / raw) To: David Laight Cc: willemdebruijn.kernel@gmail.com, davem@davemloft.net, dsahern@kernel.org, kuba@kernel.org, pabeni@redhat.com, netdev@vger.kernel.org On Wed, Jul 26, 2023 at 4:39 PM David Laight <David.Laight@aculab.com> wrote: > > From: Eric Dumazet > > Sent: 26 July 2023 15:22 > ... > > Can you describe what user space operation is done by your precious application, > > triggering a rehash in the first place ? > > We've no idea what is causing the rehash. > There are a lot of sockets that are receiving RTP audio. > But they are only created, bound and then deleted. > > The 'best guess' is something to do with ipsec tunnels > being created, deleted or rehashed. > > > > > Maybe we can think of something less disruptive in the kernel. > > (For instance, you could have a second socket, insert it in the new bucket, > > then remove the old socket) > > > > > The problem is that a single 'port unreachable' can be treated > > > as a fatal error by the receiving application. > > > So you really don't want to be sending them. > > > > Well, if your application needs to run with old kernels, and or > > transient netfilter changes (some firewall setups do not use > > iptables-restore) > > better be more resilient to transient ICMP messages anyway. > > This is being done for the specific pair of sockets that caused grief. > For this setup they were on 127.0.0.1 but that isn't always true. > But they would be expected to be on a local network. > > Reading between the lines of the comment in ipv4/icmp.c > it is reasonable to assume that ICMP_PORT_UNREACH be treated > as a fatal error (ie not a transient one). > So really the Linux kernel ought to try quite hard to not > generate them when the port exists. Sure, then please add the synchronize_rcu() call, because it won't affect you. You could add a probe to try to identify what is causing a rehash. perf probe -a udp_lib_rehash perf record -a -g -e probe:udp_lib_rehash sleep 60 ... perf script ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2023-07-26 15:42 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-07-26 11:51 [PATCH 0/2] udp: rescan hash2 list if chains crossed David Laight 2023-07-26 12:05 ` [PATCH 1/2] Move hash calculation inside udp4_lib_lookup2() David Laight 2023-07-26 13:44 ` Paolo Abeni 2023-07-26 14:02 ` David Laight 2023-07-26 14:03 ` Eric Dumazet 2023-07-26 14:06 ` David Laight 2023-07-26 15:42 ` David Laight 2023-07-26 12:05 ` [PATCH 2/2] Rescan the hash2 list if the hash chains have got cross-linked David Laight 2023-07-26 13:36 ` Paolo Abeni 2023-07-26 13:37 ` Eric Dumazet 2023-07-26 14:13 ` David Laight 2023-07-26 14:21 ` Eric Dumazet 2023-07-26 14:39 ` David Laight 2023-07-26 14:48 ` Eric Dumazet
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).