* [PATCH v2 net-next 0/3] net: Add incoming CPU mask to sockets
@ 2015-05-26 16:34 Tom Herbert
2015-05-26 16:34 ` [PATCH v2 net-next 1/3] net: Add cache alignment in sock_common for socket lookup fields Tom Herbert
` (3 more replies)
0 siblings, 4 replies; 12+ messages in thread
From: Tom Herbert @ 2015-05-26 16:34 UTC (permalink / raw)
To: davem, netdev
Added matching of CPU to a socket CPU mask. This is useful for TCP
listeners and unconnected UDP. This works with SO_REUSPORT to steer
packets to listener sockets based on CPU affinity. These patches
allow steering packets to listeners based on numa locality. This is
only useful for passive connections.
v2:
- Add cache alignment for fields used in socket lookup in sock_common
- Added UDP test results
Tested: 200 TCP_RR with --r 512,512 streams ran against an echo server
with that sets both SO_REUSEPORT and SO_INCOMING_CPU_MASK. In the
test case used taskset to pin an instance to a set of CPUs corresponding
to a numa node and set SO_INCOMING_CPU_MASK to the same set of CPUS.
CPU utilization is reported from the echo server side.
IPv4
No INCOMING_CPU_MASK
83.48% CPU utilization
1627173 tps
106/185/382 50/90/99% latencies
With INCOMING_CPU_MASK
77.61% CPU utilization
1669853 tps
103/181/378 50/90/99% latencies
IPv6
No INCOMING_CPU_MASK
84.82% CPU utilization
1551730 tps
111/195/392 50/90/99% latencies
With INCOMING_CPU_MASK
79.25% CPU utilization
1571911 tps
110/191/381 50/90/99% latencies
Tested: 200 UDP_RR with --r 512,512 streams ran against an echo server
with that sets both SO_REUSEPORT and SO_INCOMING_CPU_MASK. In the
test case used taskset to pin an instance to a set of CPUs corresponding
to a numa node and set SO_INCOMING_CPU_MASK to the same set of CPUS.
IPv4
No INCOMING_CPU_MASK
54.76% CPU utilization
1603752.3 tps
95/217/753 50/90/99% latencies
With INCOMING_CPU_MASK
53.25% CPU utilization
1717618 tps
117/221/502 50/90/99% latencies
IPv6
No INCOMING_CPU_MASK
61.62% CPU utilization
1276757 tps
125/279/629 50/90/99% latencies
With INCOMING_CPU_MASK
66.63% CPU utilization
1314527 tps
149/318/564 50/90/99% latencies
*** BLURB HERE ***
Tom Herbert (3):
net: Add cache alignment in sock_common for socket lookup fields
kernel: Make compat bitmap functions externally visible
net: Add incoming CPU mask to sockets
arch/alpha/include/uapi/asm/socket.h | 2 +
arch/avr32/include/uapi/asm/socket.h | 2 +
arch/cris/include/uapi/asm/socket.h | 2 +
arch/frv/include/uapi/asm/socket.h | 2 +
arch/ia64/include/uapi/asm/socket.h | 2 +
arch/m32r/include/uapi/asm/socket.h | 2 +
arch/mips/include/uapi/asm/socket.h | 2 +
arch/mn10300/include/uapi/asm/socket.h | 2 +
arch/parisc/include/uapi/asm/socket.h | 2 +
arch/powerpc/include/uapi/asm/socket.h | 2 +
arch/s390/include/uapi/asm/socket.h | 2 +
arch/sparc/include/uapi/asm/socket.h | 2 +
arch/xtensa/include/uapi/asm/socket.h | 2 +
include/linux/compat.h | 2 +
include/net/sock.h | 39 ++++++++++++++++-
include/uapi/asm-generic/socket.h | 2 +
kernel/compat.c | 7 ++-
net/compat.c | 56 ++++++++++++++++++++++++
net/core/sock.c | 80 ++++++++++++++++++++++++++++++++++
net/ipv4/inet_hashtables.c | 3 ++
net/ipv4/udp.c | 6 +++
net/ipv6/inet6_hashtables.c | 3 ++
net/ipv6/udp.c | 3 ++
23 files changed, 224 insertions(+), 3 deletions(-)
--
1.8.1
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v2 net-next 1/3] net: Add cache alignment in sock_common for socket lookup fields
2015-05-26 16:34 [PATCH v2 net-next 0/3] net: Add incoming CPU mask to sockets Tom Herbert
@ 2015-05-26 16:34 ` Tom Herbert
2015-05-26 17:19 ` Eric Dumazet
2015-05-26 16:34 ` [PATCH v2 net-next 2/3] kernel: Make compat bitmap functions externally visible Tom Herbert
` (2 subsequent siblings)
3 siblings, 1 reply; 12+ messages in thread
From: Tom Herbert @ 2015-05-26 16:34 UTC (permalink / raw)
To: davem, netdev
Use ____cacheline_aligned_in_smp to keep the fields used on socket
lookup in their own cachelines. These are read only fields and will
be often accessed on accross CPUs (would be very common with
SO_REUSEPORT for instance).
Signed-off-by: Tom Herbert <tom@herbertland.com>
---
include/net/sock.h | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/include/net/sock.h b/include/net/sock.h
index 26c1c31..bcf6114 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -211,7 +211,13 @@ struct sock_common {
struct hlist_node skc_node;
struct hlist_nulls_node skc_nulls_node;
};
- int skc_tx_queue_mapping;
+
+ /* Cachelines above this point are read mostly and are used in socket
+ * lookup.
+ */
+ int skc_tx_queue_mapping
+ ____cacheline_aligned_in_smp;
+
atomic_t skc_refcnt;
/* private: */
int skc_dontcopy_end[0];
--
1.8.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v2 net-next 2/3] kernel: Make compat bitmap functions externally visible
2015-05-26 16:34 [PATCH v2 net-next 0/3] net: Add incoming CPU mask to sockets Tom Herbert
2015-05-26 16:34 ` [PATCH v2 net-next 1/3] net: Add cache alignment in sock_common for socket lookup fields Tom Herbert
@ 2015-05-26 16:34 ` Tom Herbert
2015-05-26 16:34 ` [PATCH v2 net-next 3/3] net: Add incoming CPU mask to sockets Tom Herbert
2015-05-26 17:18 ` [PATCH v2 net-next 0/3] " Eric Dumazet
3 siblings, 0 replies; 12+ messages in thread
From: Tom Herbert @ 2015-05-26 16:34 UTC (permalink / raw)
To: davem, netdev
Export compat_get_bitmap and compat_put_bitmap. Make
compat_get_user_cpu_mask not static, add prototype in compat.h, and
export it.
Signed-off-by: Tom Herbert <tom@herbertland.com>
---
include/linux/compat.h | 2 ++
kernel/compat.c | 7 +++++--
2 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/include/linux/compat.h b/include/linux/compat.h
index ab25814..b52bc0c 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -380,6 +380,8 @@ long compat_get_bitmap(unsigned long *mask, const compat_ulong_t __user *umask,
unsigned long bitmap_size);
long compat_put_bitmap(compat_ulong_t __user *umask, unsigned long *mask,
unsigned long bitmap_size);
+int compat_get_user_cpu_mask(compat_ulong_t __user *user_mask_ptr,
+ unsigned len, struct cpumask *new_mask);
int copy_siginfo_from_user32(siginfo_t *to, struct compat_siginfo __user *from);
int copy_siginfo_to_user32(struct compat_siginfo __user *to, const siginfo_t *from);
int get_compat_sigevent(struct sigevent *event,
diff --git a/kernel/compat.c b/kernel/compat.c
index 24f0061..0b74ad9 100644
--- a/kernel/compat.c
+++ b/kernel/compat.c
@@ -599,8 +599,8 @@ COMPAT_SYSCALL_DEFINE5(waitid,
return copy_siginfo_to_user32(uinfo, &info);
}
-static int compat_get_user_cpu_mask(compat_ulong_t __user *user_mask_ptr,
- unsigned len, struct cpumask *new_mask)
+int compat_get_user_cpu_mask(compat_ulong_t __user *user_mask_ptr,
+ unsigned len, struct cpumask *new_mask)
{
unsigned long *k;
@@ -612,6 +612,7 @@ static int compat_get_user_cpu_mask(compat_ulong_t __user *user_mask_ptr,
k = cpumask_bits(new_mask);
return compat_get_bitmap(k, user_mask_ptr, len * 8);
}
+EXPORT_SYMBOL_GPL(compat_get_user_cpu_mask);
COMPAT_SYSCALL_DEFINE3(sched_setaffinity, compat_pid_t, pid,
unsigned int, len,
@@ -927,6 +928,7 @@ long compat_get_bitmap(unsigned long *mask, const compat_ulong_t __user *umask,
return 0;
}
+EXPORT_SYMBOL_GPL(compat_get_bitmap);
long compat_put_bitmap(compat_ulong_t __user *umask, unsigned long *mask,
unsigned long bitmap_size)
@@ -967,6 +969,7 @@ long compat_put_bitmap(compat_ulong_t __user *umask, unsigned long *mask,
return 0;
}
+EXPORT_SYMBOL_GPL(compat_put_bitmap);
void
sigset_from_compat(sigset_t *set, const compat_sigset_t *compat)
--
1.8.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v2 net-next 3/3] net: Add incoming CPU mask to sockets
2015-05-26 16:34 [PATCH v2 net-next 0/3] net: Add incoming CPU mask to sockets Tom Herbert
2015-05-26 16:34 ` [PATCH v2 net-next 1/3] net: Add cache alignment in sock_common for socket lookup fields Tom Herbert
2015-05-26 16:34 ` [PATCH v2 net-next 2/3] kernel: Make compat bitmap functions externally visible Tom Herbert
@ 2015-05-26 16:34 ` Tom Herbert
2015-05-26 17:18 ` [PATCH v2 net-next 0/3] " Eric Dumazet
3 siblings, 0 replies; 12+ messages in thread
From: Tom Herbert @ 2015-05-26 16:34 UTC (permalink / raw)
To: davem, netdev
Added matching of CPU to a socket CPU mask. This is useful for TCP
listeners and unconnected UDP. This works with SO_REUSEPORT to steer
packets to listener sockets based on CPU affinity.
In this patch:
- Add SO_INCOMING_CPU_MASK
- Add a CPU mask pointer to struct sock
- Get/setsockopt to get/set a the mask on a socket
- Compat functions for the sockopts
- Add sk_match_incoming_cpu_mask to check is running CPU is in a mask
for a socket
- Call sk_match_incoming_cpu_mask from inet compute_score and UDP
functions for IPv4 and IPv6
Signed-off-by: Tom Herbert <tom@herbertland.com>
---
arch/alpha/include/uapi/asm/socket.h | 2 +
arch/avr32/include/uapi/asm/socket.h | 2 +
arch/cris/include/uapi/asm/socket.h | 2 +
arch/frv/include/uapi/asm/socket.h | 2 +
arch/ia64/include/uapi/asm/socket.h | 2 +
arch/m32r/include/uapi/asm/socket.h | 2 +
arch/mips/include/uapi/asm/socket.h | 2 +
arch/mn10300/include/uapi/asm/socket.h | 2 +
arch/parisc/include/uapi/asm/socket.h | 2 +
arch/powerpc/include/uapi/asm/socket.h | 2 +
arch/s390/include/uapi/asm/socket.h | 2 +
arch/sparc/include/uapi/asm/socket.h | 2 +
arch/xtensa/include/uapi/asm/socket.h | 2 +
include/net/sock.h | 31 +++++++++++++
include/uapi/asm-generic/socket.h | 2 +
net/compat.c | 56 ++++++++++++++++++++++++
net/core/sock.c | 80 ++++++++++++++++++++++++++++++++++
net/ipv4/inet_hashtables.c | 3 ++
net/ipv4/udp.c | 6 +++
net/ipv6/inet6_hashtables.c | 3 ++
net/ipv6/udp.c | 3 ++
21 files changed, 210 insertions(+)
diff --git a/arch/alpha/include/uapi/asm/socket.h b/arch/alpha/include/uapi/asm/socket.h
index 9a20821..eae65a2 100644
--- a/arch/alpha/include/uapi/asm/socket.h
+++ b/arch/alpha/include/uapi/asm/socket.h
@@ -92,4 +92,6 @@
#define SO_ATTACH_BPF 50
#define SO_DETACH_BPF SO_DETACH_FILTER
+#define SO_INCOMING_CPU_MASK 51
+
#endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/avr32/include/uapi/asm/socket.h b/arch/avr32/include/uapi/asm/socket.h
index 2b65ed6..89515e3 100644
--- a/arch/avr32/include/uapi/asm/socket.h
+++ b/arch/avr32/include/uapi/asm/socket.h
@@ -85,4 +85,6 @@
#define SO_ATTACH_BPF 50
#define SO_DETACH_BPF SO_DETACH_FILTER
+#define SO_INCOMING_CPU_MASK 51
+
#endif /* _UAPI__ASM_AVR32_SOCKET_H */
diff --git a/arch/cris/include/uapi/asm/socket.h b/arch/cris/include/uapi/asm/socket.h
index e2503d9f..65fcf0e 100644
--- a/arch/cris/include/uapi/asm/socket.h
+++ b/arch/cris/include/uapi/asm/socket.h
@@ -87,6 +87,8 @@
#define SO_ATTACH_BPF 50
#define SO_DETACH_BPF SO_DETACH_FILTER
+#define SO_INCOMING_CPU_MASK 51
+
#endif /* _ASM_SOCKET_H */
diff --git a/arch/frv/include/uapi/asm/socket.h b/arch/frv/include/uapi/asm/socket.h
index 4823ad1..1af3b78 100644
--- a/arch/frv/include/uapi/asm/socket.h
+++ b/arch/frv/include/uapi/asm/socket.h
@@ -85,5 +85,7 @@
#define SO_ATTACH_BPF 50
#define SO_DETACH_BPF SO_DETACH_FILTER
+#define SO_INCOMING_CPU_MASK 51
+
#endif /* _ASM_SOCKET_H */
diff --git a/arch/ia64/include/uapi/asm/socket.h b/arch/ia64/include/uapi/asm/socket.h
index 59be3d8..7ef59d3 100644
--- a/arch/ia64/include/uapi/asm/socket.h
+++ b/arch/ia64/include/uapi/asm/socket.h
@@ -94,4 +94,6 @@
#define SO_ATTACH_BPF 50
#define SO_DETACH_BPF SO_DETACH_FILTER
+#define SO_INCOMING_CPU_MASK 51
+
#endif /* _ASM_IA64_SOCKET_H */
diff --git a/arch/m32r/include/uapi/asm/socket.h b/arch/m32r/include/uapi/asm/socket.h
index 7bc4cb2..53a697c 100644
--- a/arch/m32r/include/uapi/asm/socket.h
+++ b/arch/m32r/include/uapi/asm/socket.h
@@ -85,4 +85,6 @@
#define SO_ATTACH_BPF 50
#define SO_DETACH_BPF SO_DETACH_FILTER
+#define SO_INCOMING_CPU_MASK 51
+
#endif /* _ASM_M32R_SOCKET_H */
diff --git a/arch/mips/include/uapi/asm/socket.h b/arch/mips/include/uapi/asm/socket.h
index dec3c85..063d59d 100644
--- a/arch/mips/include/uapi/asm/socket.h
+++ b/arch/mips/include/uapi/asm/socket.h
@@ -103,4 +103,6 @@
#define SO_ATTACH_BPF 50
#define SO_DETACH_BPF SO_DETACH_FILTER
+#define SO_INCOMING_CPU_MASK 51
+
#endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/mn10300/include/uapi/asm/socket.h b/arch/mn10300/include/uapi/asm/socket.h
index cab7d6d..3c9f8e9 100644
--- a/arch/mn10300/include/uapi/asm/socket.h
+++ b/arch/mn10300/include/uapi/asm/socket.h
@@ -85,4 +85,6 @@
#define SO_ATTACH_BPF 50
#define SO_DETACH_BPF SO_DETACH_FILTER
+#define SO_INCOMING_CPU_MASK 51
+
#endif /* _ASM_SOCKET_H */
diff --git a/arch/parisc/include/uapi/asm/socket.h b/arch/parisc/include/uapi/asm/socket.h
index a5cd40c..557a09b 100644
--- a/arch/parisc/include/uapi/asm/socket.h
+++ b/arch/parisc/include/uapi/asm/socket.h
@@ -84,4 +84,6 @@
#define SO_ATTACH_BPF 0x402B
#define SO_DETACH_BPF SO_DETACH_FILTER
+#define SO_INCOMING_CPU_MASK 0x402C
+
#endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/powerpc/include/uapi/asm/socket.h b/arch/powerpc/include/uapi/asm/socket.h
index c046666..a72fac6 100644
--- a/arch/powerpc/include/uapi/asm/socket.h
+++ b/arch/powerpc/include/uapi/asm/socket.h
@@ -92,4 +92,6 @@
#define SO_ATTACH_BPF 50
#define SO_DETACH_BPF SO_DETACH_FILTER
+#define SO_INCOMING_CPU_MASK 51
+
#endif /* _ASM_POWERPC_SOCKET_H */
diff --git a/arch/s390/include/uapi/asm/socket.h b/arch/s390/include/uapi/asm/socket.h
index 296942d..b901044 100644
--- a/arch/s390/include/uapi/asm/socket.h
+++ b/arch/s390/include/uapi/asm/socket.h
@@ -91,4 +91,6 @@
#define SO_ATTACH_BPF 50
#define SO_DETACH_BPF SO_DETACH_FILTER
+#define SO_INCOMING_CPU_MASK 51
+
#endif /* _ASM_SOCKET_H */
diff --git a/arch/sparc/include/uapi/asm/socket.h b/arch/sparc/include/uapi/asm/socket.h
index e6a16c4..95835a1 100644
--- a/arch/sparc/include/uapi/asm/socket.h
+++ b/arch/sparc/include/uapi/asm/socket.h
@@ -81,6 +81,8 @@
#define SO_ATTACH_BPF 0x0034
#define SO_DETACH_BPF SO_DETACH_FILTER
+#define SO_INCOMING_CPU_MASK 0x0035
+
/* Security levels - as per NRL IPv6 - don't actually do anything */
#define SO_SECURITY_AUTHENTICATION 0x5001
#define SO_SECURITY_ENCRYPTION_TRANSPORT 0x5002
diff --git a/arch/xtensa/include/uapi/asm/socket.h b/arch/xtensa/include/uapi/asm/socket.h
index 4120af0..0167812 100644
--- a/arch/xtensa/include/uapi/asm/socket.h
+++ b/arch/xtensa/include/uapi/asm/socket.h
@@ -96,4 +96,6 @@
#define SO_ATTACH_BPF 50
#define SO_DETACH_BPF SO_DETACH_FILTER
+#define SO_INCOMING_CPU_MASK 51
+
#endif /* _XTENSA_SOCKET_H */
diff --git a/include/net/sock.h b/include/net/sock.h
index bcf6114..8407c3b 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -123,6 +123,11 @@ typedef struct {
#endif
} socket_lock_t;
+struct rcu_cpumask {
+ struct rcu_head rcu;
+ unsigned long cpumask[0];
+};
+
struct sock;
struct proto;
struct net;
@@ -150,6 +155,7 @@ typedef __u64 __bitwise __addrpair;
* @skc_node: main hash linkage for various protocol lookup tables
* @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
* @skc_tx_queue_mapping: tx queue number for this connection
+ * @skc_incoming_cpu_mask: CPU mask for listeners
* @skc_refcnt: reference count
*
* This is the minimal network layer representation of sockets, the header
@@ -212,9 +218,12 @@ struct sock_common {
struct hlist_nulls_node skc_nulls_node;
};
+ struct rcu_cpumask __rcu *skc_incoming_cpu_mask;
+
/* Cachelines above this point are read mostly and are used in socket
* lookup.
*/
+
int skc_tx_queue_mapping
____cacheline_aligned_in_smp;
@@ -314,6 +323,7 @@ struct sock {
#define sk_node __sk_common.skc_node
#define sk_nulls_node __sk_common.skc_nulls_node
#define sk_refcnt __sk_common.skc_refcnt
+#define sk_incoming_cpu_mask __sk_common.skc_incoming_cpu_mask
#define sk_tx_queue_mapping __sk_common.skc_tx_queue_mapping
#define sk_dontcopy_begin __sk_common.skc_dontcopy_begin
@@ -2220,6 +2230,27 @@ static inline bool sk_fullsock(const struct sock *sk)
return (1 << sk->sk_state) & ~(TCPF_TIME_WAIT | TCPF_NEW_SYN_RECV);
}
+static inline bool sk_match_incoming_cpu_mask(const struct sock *sk)
+{
+ struct rcu_cpumask *mask;
+ bool ret = false;
+
+ if (!sk->sk_incoming_cpu_mask)
+ return ret;
+
+ rcu_read_lock();
+
+ mask = rcu_dereference(sk->sk_incoming_cpu_mask);
+ if (likely(mask) &&
+ cpumask_test_cpu(raw_smp_processor_id(),
+ to_cpumask(mask->cpumask)))
+ ret = true;
+
+ rcu_read_unlock();
+
+ return ret;
+}
+
void sock_enable_timestamp(struct sock *sk, int flag);
int sock_get_timestamp(struct sock *, struct timeval __user *);
int sock_get_timestampns(struct sock *, struct timespec __user *);
diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
index 5c15c2a..d41c8b9 100644
--- a/include/uapi/asm-generic/socket.h
+++ b/include/uapi/asm-generic/socket.h
@@ -87,4 +87,6 @@
#define SO_ATTACH_BPF 50
#define SO_DETACH_BPF SO_DETACH_FILTER
+#define SO_INCOMING_CPU_MASK 51
+
#endif /* __ASM_GENERIC_SOCKET_H */
diff --git a/net/compat.c b/net/compat.c
index 5cfd26a..f9fc5ce 100644
--- a/net/compat.c
+++ b/net/compat.c
@@ -351,6 +351,23 @@ static int do_set_sock_timeout(struct socket *sock, int level,
return err;
}
+static int do_set_incoming_cpu_mask(struct socket *sock, int level,
+ int optname, char __user *optval, unsigned int optlen)
+{
+ compat_ulong_t __user *user_mask_ptr =
+ (compat_ulong_t __user *)optval;
+ struct cpumask __user *mask = compat_alloc_user_space(cpumask_size());
+ int err;
+
+ err = compat_get_user_cpu_mask(user_mask_ptr, optlen, mask);
+ if (err)
+ return err;
+
+ return sock_setsockopt(sock, level, optname,
+ (char __user *)cpumask_bits(mask),
+ cpumask_size());
+}
+
static int compat_sock_setsockopt(struct socket *sock, int level, int optname,
char __user *optval, unsigned int optlen)
{
@@ -360,6 +377,10 @@ static int compat_sock_setsockopt(struct socket *sock, int level, int optname,
if (optname == SO_RCVTIMEO || optname == SO_SNDTIMEO)
return do_set_sock_timeout(sock, level, optname, optval, optlen);
+ if (optname == SO_INCOMING_CPU_MASK)
+ return do_set_incoming_cpu_mask(sock, level, optname,
+ optval, optlen);
+
return sock_setsockopt(sock, level, optname, optval, optlen);
}
@@ -419,11 +440,46 @@ static int do_get_sock_timeout(struct socket *sock, int level, int optname,
return err;
}
+static int do_get_incoming_cpu_mask(struct socket *sock, int level,
+ int optname, char __user *optval, unsigned int __user *optlen)
+{
+ compat_ulong_t __user *user_mask_ptr =
+ (compat_ulong_t __user *)optval;
+ struct cpumask __user *mask = compat_alloc_user_space(cpumask_size());
+ int len, err;
+
+ if (get_user(len, optlen))
+ return -EFAULT;
+
+ if ((len * BITS_PER_BYTE) < nr_cpu_ids)
+ return -EINVAL;
+ if (len & (sizeof(compat_ulong_t) - 1))
+ return -EINVAL;
+
+ if (put_user(cpumask_size(), optlen))
+ return -EFAULT;
+
+ err = sock_getsockopt(sock, level, optname,
+ (char __user *)cpumask_bits(mask), optlen);
+ if (err == 0)
+ if (get_user(len, optlen) ||
+ compat_put_bitmap(user_mask_ptr,
+ cpumask_bits(mask), len * 8))
+ err = -EFAULT;
+
+ return err;
+}
+
static int compat_sock_getsockopt(struct socket *sock, int level, int optname,
char __user *optval, int __user *optlen)
{
if (optname == SO_RCVTIMEO || optname == SO_SNDTIMEO)
return do_get_sock_timeout(sock, level, optname, optval, optlen);
+
+ if (optname == SO_INCOMING_CPU_MASK)
+ return do_get_incoming_cpu_mask(sock, level, optname,
+ optval, optlen);
+
return sock_getsockopt(sock, level, optname, optval, optlen);
}
diff --git a/net/core/sock.c b/net/core/sock.c
index 29124fc..25fc8a7 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -672,6 +672,71 @@ bool sk_mc_loop(struct sock *sk)
}
EXPORT_SYMBOL(sk_mc_loop);
+static int do_set_incoming_cpu_mask(struct sock *sk, char __user *optval,
+ unsigned int optlen)
+{
+ struct rcu_cpumask *new_mask, *old_mask;
+ unsigned long *k;
+
+ old_mask = rcu_dereference_protected(sk->sk_incoming_cpu_mask,
+ sock_owned_by_user(sk));
+
+ if (optlen == 0) {
+ RCU_INIT_POINTER(sk->sk_incoming_cpu_mask, NULL);
+ if (old_mask) {
+ kfree_rcu(old_mask, rcu);
+ return 0;
+ }
+ }
+
+ if (optlen & (sizeof(unsigned long) - 1))
+ return -EINVAL;
+
+ new_mask = kzalloc(sizeof(*new_mask) + cpumask_size(), GFP_KERNEL);
+ if (!new_mask)
+ return -ENOMEM;
+
+ k = cpumask_bits(to_cpumask(new_mask->cpumask));
+ if (copy_from_user(k, optval, min_t(int, optlen, cpumask_size())))
+ return -EFAULT;
+
+ rcu_assign_pointer(sk->sk_incoming_cpu_mask, new_mask);
+
+ if (old_mask)
+ kfree_rcu(old_mask, rcu);
+
+ return 0;
+}
+
+static int do_get_incoming_cpu_mask(struct sock *sk, char __user *optval,
+ unsigned int __user *optlen,
+ unsigned int len)
+{
+ struct rcu_cpumask *mask;
+ unsigned long *k;
+ int err = 0;
+
+ if (len < cpumask_size())
+ return -EINVAL;
+
+ if (len & (sizeof(unsigned long) - 1))
+ return -EINVAL;
+
+ rcu_read_lock();
+
+ mask = rcu_dereference(sk->sk_incoming_cpu_mask);
+
+ k = cpumask_bits(to_cpumask(mask->cpumask));
+ if (copy_to_user(optval, k, cpumask_size()))
+ err = -EFAULT;
+ else
+ put_user(cpumask_size(), optlen);
+
+ rcu_read_unlock();
+
+ return err;
+}
+
/*
* This is meant for all protocols to use and covers goings on
* at the socket level. Everything here is generic.
@@ -990,6 +1055,10 @@ set_rcvbuf:
sk->sk_max_pacing_rate);
break;
+ case SO_INCOMING_CPU_MASK:
+ ret = do_set_incoming_cpu_mask(sk, optval, optlen);
+ break;
+
default:
ret = -ENOPROTOOPT;
break;
@@ -1250,6 +1319,9 @@ int sock_getsockopt(struct socket *sock, int level, int optname,
v.val = sk->sk_incoming_cpu;
break;
+ case SO_INCOMING_CPU_MASK:
+ return do_get_incoming_cpu_mask(sk, optval, optlen, len);
+
default:
/* We implement the SO_SNDLOWAT etc to not be settable
* (1003.1g 7).
@@ -1429,6 +1501,7 @@ EXPORT_SYMBOL(sk_alloc);
static void __sk_free(struct sock *sk)
{
struct sk_filter *filter;
+ struct rcu_cpumask *incoming_cpu_mask;
if (sk->sk_destruct)
sk->sk_destruct(sk);
@@ -1440,6 +1513,12 @@ static void __sk_free(struct sock *sk)
RCU_INIT_POINTER(sk->sk_filter, NULL);
}
+ incoming_cpu_mask = rcu_dereference(sk->sk_incoming_cpu_mask);
+ if (incoming_cpu_mask) {
+ kfree_rcu(incoming_cpu_mask, rcu);
+ RCU_INIT_POINTER(sk->sk_incoming_cpu_mask, NULL);
+ }
+
sock_disable_timestamp(sk, SK_FLAGS_TIMESTAMP);
if (atomic_read(&sk->sk_omem_alloc))
@@ -1543,6 +1622,7 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
newsk->sk_err = 0;
newsk->sk_priority = 0;
newsk->sk_incoming_cpu = raw_smp_processor_id();
+ RCU_INIT_POINTER(newsk->sk_incoming_cpu_mask, NULL);
atomic64_set(&newsk->sk_cookie, 0);
/*
* Before updating sk_refcnt, we must commit prior changes to memory
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 3766bdd..2e9a95f 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -184,6 +184,9 @@ static inline int compute_score(struct sock *sk, struct net *net,
return -1;
score += 4;
}
+
+ if (sk_match_incoming_cpu_mask(sk))
+ score += 4;
}
return score;
}
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index d10b7e0..dc6a3da 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -375,6 +375,9 @@ static inline int compute_score(struct sock *sk, struct net *net,
score += 4;
}
+ if (sk_match_incoming_cpu_mask(sk))
+ score += 4;
+
return score;
}
@@ -418,6 +421,9 @@ static inline int compute_score2(struct sock *sk, struct net *net,
score += 4;
}
+ if (sk_match_incoming_cpu_mask(sk))
+ score += 4;
+
return score;
}
diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index 871641b..8cc4ba9 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -114,6 +114,9 @@ static inline int compute_score(struct sock *sk, struct net *net,
return -1;
score++;
}
+
+ if (sk_match_incoming_cpu_mask(sk))
+ score += 4;
}
return score;
}
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index c2ec416..a0c9a80 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -182,6 +182,9 @@ static inline int compute_score(struct sock *sk, struct net *net,
score++;
}
+ if (sk_match_incoming_cpu_mask(sk))
+ score++;
+
return score;
}
--
1.8.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH v2 net-next 0/3] net: Add incoming CPU mask to sockets
2015-05-26 16:34 [PATCH v2 net-next 0/3] net: Add incoming CPU mask to sockets Tom Herbert
` (2 preceding siblings ...)
2015-05-26 16:34 ` [PATCH v2 net-next 3/3] net: Add incoming CPU mask to sockets Tom Herbert
@ 2015-05-26 17:18 ` Eric Dumazet
2015-05-26 18:00 ` Tom Herbert
3 siblings, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2015-05-26 17:18 UTC (permalink / raw)
To: Tom Herbert; +Cc: davem, netdev
On Tue, 2015-05-26 at 09:34 -0700, Tom Herbert wrote:
> Added matching of CPU to a socket CPU mask. This is useful for TCP
> listeners and unconnected UDP. This works with SO_REUSPORT to steer
> packets to listener sockets based on CPU affinity. These patches
> allow steering packets to listeners based on numa locality. This is
> only useful for passive connections.
>
> v2:
> - Add cache alignment for fields used in socket lookup in sock_common
> - Added UDP test results
What about the feedback I gave earlier Tom ???
This cannot work for TCP in its current state.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 net-next 1/3] net: Add cache alignment in sock_common for socket lookup fields
2015-05-26 16:34 ` [PATCH v2 net-next 1/3] net: Add cache alignment in sock_common for socket lookup fields Tom Herbert
@ 2015-05-26 17:19 ` Eric Dumazet
2015-05-26 17:54 ` Eric Dumazet
0 siblings, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2015-05-26 17:19 UTC (permalink / raw)
To: Tom Herbert; +Cc: davem, netdev
On Tue, 2015-05-26 at 09:34 -0700, Tom Herbert wrote:
> Use ____cacheline_aligned_in_smp to keep the fields used on socket
> lookup in their own cachelines. These are read only fields and will
> be often accessed on accross CPUs (would be very common with
> SO_REUSEPORT for instance).
>
> Signed-off-by: Tom Herbert <tom@herbertland.com>
> ---
> include/net/sock.h | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 26c1c31..bcf6114 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -211,7 +211,13 @@ struct sock_common {
> struct hlist_node skc_node;
> struct hlist_nulls_node skc_nulls_node;
> };
> - int skc_tx_queue_mapping;
> +
> + /* Cachelines above this point are read mostly and are used in socket
> + * lookup.
> + */
> + int skc_tx_queue_mapping
> + ____cacheline_aligned_in_smp;
> +
> atomic_t skc_refcnt;
> /* private: */
> int skc_dontcopy_end[0];
No, we do not want to increase the size of sock_common with such a
hammer.
I am pretty sure you never tested this patch, since you placed the
attribute at the wrong place anyway.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 net-next 1/3] net: Add cache alignment in sock_common for socket lookup fields
2015-05-26 17:19 ` Eric Dumazet
@ 2015-05-26 17:54 ` Eric Dumazet
2015-05-26 18:02 ` Tom Herbert
0 siblings, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2015-05-26 17:54 UTC (permalink / raw)
To: Tom Herbert; +Cc: davem, netdev
On Tue, 2015-05-26 at 10:19 -0700, Eric Dumazet wrote:
> No, we do not want to increase the size of sock_common with such a
> hammer.
Current sizeof(struct sock_common) is 0x78 bytes.
So moving 2 read_mostly pointers into this structure would be enough to
make 2 first cache lines read mostly.
One candidate would be sk_rx_dst, as it is used in early demux.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 net-next 0/3] net: Add incoming CPU mask to sockets
2015-05-26 17:18 ` [PATCH v2 net-next 0/3] " Eric Dumazet
@ 2015-05-26 18:00 ` Tom Herbert
2015-05-26 18:19 ` Eric Dumazet
0 siblings, 1 reply; 12+ messages in thread
From: Tom Herbert @ 2015-05-26 18:00 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David S. Miller, Linux Kernel Network Developers
On Tue, May 26, 2015 at 10:18 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2015-05-26 at 09:34 -0700, Tom Herbert wrote:
>> Added matching of CPU to a socket CPU mask. This is useful for TCP
>> listeners and unconnected UDP. This works with SO_REUSPORT to steer
>> packets to listener sockets based on CPU affinity. These patches
>> allow steering packets to listeners based on numa locality. This is
>> only useful for passive connections.
>>
>> v2:
>> - Add cache alignment for fields used in socket lookup in sock_common
>> - Added UDP test results
>
> What about the feedback I gave earlier Tom ???
>
> This cannot work for TCP in its current state.
>
It does work and it fixes cache server locality issues we are seeing.
Right now half of our connections are persistently crossing numa nodes
on receive-- this is having big negative impact. Yes, there may be
edge conditions where SYN goes to a different CPU than the rest of the
flow (probably need RFS or flow director for that problem), and that
sounds like something nice to fix, but this patch is not dependent on
that. Besides, did you foresee an API change would be required?
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 net-next 1/3] net: Add cache alignment in sock_common for socket lookup fields
2015-05-26 17:54 ` Eric Dumazet
@ 2015-05-26 18:02 ` Tom Herbert
0 siblings, 0 replies; 12+ messages in thread
From: Tom Herbert @ 2015-05-26 18:02 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David S. Miller, Linux Kernel Network Developers
On Tue, May 26, 2015 at 10:54 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2015-05-26 at 10:19 -0700, Eric Dumazet wrote:
>
>> No, we do not want to increase the size of sock_common with such a
>> hammer.
>
> Current sizeof(struct sock_common) is 0x78 bytes.
>
> So moving 2 read_mostly pointers into this structure would be enough to
> make 2 first cache lines read mostly.
>
Right, the problem is refcnt out of those cachelines.
> One candidate would be sk_rx_dst, as it is used in early demux.
>
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 net-next 0/3] net: Add incoming CPU mask to sockets
2015-05-26 18:00 ` Tom Herbert
@ 2015-05-26 18:19 ` Eric Dumazet
2015-05-26 20:01 ` Tom Herbert
0 siblings, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2015-05-26 18:19 UTC (permalink / raw)
To: Tom Herbert; +Cc: David S. Miller, Linux Kernel Network Developers
On Tue, 2015-05-26 at 11:00 -0700, Tom Herbert wrote:
> On Tue, May 26, 2015 at 10:18 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Tue, 2015-05-26 at 09:34 -0700, Tom Herbert wrote:
> >> Added matching of CPU to a socket CPU mask. This is useful for TCP
> >> listeners and unconnected UDP. This works with SO_REUSPORT to steer
> >> packets to listener sockets based on CPU affinity. These patches
> >> allow steering packets to listeners based on numa locality. This is
> >> only useful for passive connections.
> >>
> >> v2:
> >> - Add cache alignment for fields used in socket lookup in sock_common
> >> - Added UDP test results
> >
> > What about the feedback I gave earlier Tom ???
> >
> > This cannot work for TCP in its current state.
> >
> It does work and it fixes cache server locality issues we are seeing.
> Right now half of our connections are persistently crossing numa nodes
> on receive-- this is having big negative impact. Yes, there may be
> edge conditions where SYN goes to a different CPU than the rest of the
> flow (probably need RFS or flow director for that problem), and that
> sounds like something nice to fix, but this patch is not dependent on
> that. Besides, did you foresee an API change would be required?
With current stack, there is no guarantee SYN and ACK packets are
handled by same cpu.
These are no edge conditions, but real ones, even with RFS.
Not everyone tweaks /proc/irq/*/smp_affinity
Default is still allowing cpus being almost random (affinity=fffffff)
That was partly for these reasons that SO_REUSEPORT (for TCP) could not
use cpu number, but a flow hash to select the target socket.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 net-next 0/3] net: Add incoming CPU mask to sockets
2015-05-26 18:19 ` Eric Dumazet
@ 2015-05-26 20:01 ` Tom Herbert
2015-05-26 22:42 ` Eric Dumazet
0 siblings, 1 reply; 12+ messages in thread
From: Tom Herbert @ 2015-05-26 20:01 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David S. Miller, Linux Kernel Network Developers
On Tue, May 26, 2015 at 11:19 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2015-05-26 at 11:00 -0700, Tom Herbert wrote:
>> On Tue, May 26, 2015 at 10:18 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > On Tue, 2015-05-26 at 09:34 -0700, Tom Herbert wrote:
>> >> Added matching of CPU to a socket CPU mask. This is useful for TCP
>> >> listeners and unconnected UDP. This works with SO_REUSPORT to steer
>> >> packets to listener sockets based on CPU affinity. These patches
>> >> allow steering packets to listeners based on numa locality. This is
>> >> only useful for passive connections.
>> >>
>> >> v2:
>> >> - Add cache alignment for fields used in socket lookup in sock_common
>> >> - Added UDP test results
>> >
>> > What about the feedback I gave earlier Tom ???
>> >
>> > This cannot work for TCP in its current state.
>> >
>> It does work and it fixes cache server locality issues we are seeing.
>> Right now half of our connections are persistently crossing numa nodes
>> on receive-- this is having big negative impact. Yes, there may be
>> edge conditions where SYN goes to a different CPU than the rest of the
>> flow (probably need RFS or flow director for that problem), and that
>> sounds like something nice to fix, but this patch is not dependent on
>> that. Besides, did you foresee an API change would be required?
>
> With current stack, there is no guarantee SYN and ACK packets are
> handled by same cpu.
>
> These are no edge conditions, but real ones, even with RFS.
>
> Not everyone tweaks /proc/irq/*/smp_affinity
>
> Default is still allowing cpus being almost random (affinity=fffffff)
>
In that case there's no guarantee that any two packets in a flow will
hit the same CPU so there's no way to establish affinity to the
interrupt anyway. RFS would work okay to get affinity of the soft
processing, but there would be no point in trying to do any affinity
with incoming cpu so this feature wouldn't help.
The general problem is that the flow hash and/or RX CPU for a flow are
not guaranteed to be persistent for a connection. UDP doesn't have a
problem with this since every RX UDP packet can be independently
steered to a good socket in SO_REUSEPORT. For TCP we only get to make
this decision once for the whole lifetime of the flow, which means
that eventually that may turn out to made "wrong". These patches don't
try to fix that problem, for that I believe we're going to need to do
something a little more radical :-)
> That was partly for these reasons that SO_REUSEPORT (for TCP) could not
> use cpu number, but a flow hash to select the target socket.
>
>
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 net-next 0/3] net: Add incoming CPU mask to sockets
2015-05-26 20:01 ` Tom Herbert
@ 2015-05-26 22:42 ` Eric Dumazet
0 siblings, 0 replies; 12+ messages in thread
From: Eric Dumazet @ 2015-05-26 22:42 UTC (permalink / raw)
To: Tom Herbert; +Cc: David S. Miller, Linux Kernel Network Developers
On Tue, 2015-05-26 at 13:01 -0700, Tom Herbert wrote:
> In that case there's no guarantee that any two packets in a flow will
> hit the same CPU so there's no way to establish affinity to the
> interrupt anyway. RFS would work okay to get affinity of the soft
> processing, but there would be no point in trying to do any affinity
> with incoming cpu so this feature wouldn't help.
This is why I think this patch can hurt users.
RPS/RFS/smp_affinity/SO_REUSEPORT/cpuhotplug are only hints, that would
never break TCP session establishment, even with hash collisions and
sockets being added/deleted to SO_REUSEPORT pools.
It works in _all_ situations. Even the crazy/stupid setups.
Your patch is breaking this rule, without any clear documentation on how
to make sure everything is properly setup.
I am not sure my tcp listener stuff will be finished for 4.2, because I
had to spend lot of time to more urgent stuff lately, like reviewing
patches ;)
I would prefer your patch is added for 4.3, not before.
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2015-05-26 22:42 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-05-26 16:34 [PATCH v2 net-next 0/3] net: Add incoming CPU mask to sockets Tom Herbert
2015-05-26 16:34 ` [PATCH v2 net-next 1/3] net: Add cache alignment in sock_common for socket lookup fields Tom Herbert
2015-05-26 17:19 ` Eric Dumazet
2015-05-26 17:54 ` Eric Dumazet
2015-05-26 18:02 ` Tom Herbert
2015-05-26 16:34 ` [PATCH v2 net-next 2/3] kernel: Make compat bitmap functions externally visible Tom Herbert
2015-05-26 16:34 ` [PATCH v2 net-next 3/3] net: Add incoming CPU mask to sockets Tom Herbert
2015-05-26 17:18 ` [PATCH v2 net-next 0/3] " Eric Dumazet
2015-05-26 18:00 ` Tom Herbert
2015-05-26 18:19 ` Eric Dumazet
2015-05-26 20:01 ` Tom Herbert
2015-05-26 22:42 ` Eric Dumazet
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).