[PATCH RFC V4 0/2] Optimize the snmp stat aggregation for large cpus

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC V4 0/2] Optimize the snmp stat aggregation for large cpus
@ 2015-08-30  5:59 Raghavendra K T
  2015-08-30  5:59 ` [PATCH RFC V4 1/2] net: Introduce helper functions to get the per cpu data Raghavendra K T
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Raghavendra K T @ 2015-08-30  5:59 UTC (permalink / raw)
  To: davem, kuznet, jmorris, yoshfuji, kaber
  Cc: jiri, edumazet, hannes, tom, azhou, ebiederm, ipm,
	nicolas.dichtel, serge.hallyn, joe, netdev, linux-kernel,
	raghavendra.kt, anton, nacc, srikar

While creating 1000 containers, perf is showing lot of time spent in
snmp_fold_field on a large cpu system.

The current patch tries to improve by reordering the statistics gathering.

Please note that similar overhead was also reported while creating
veth pairs  https://lkml.org/lkml/2013/3/19/556

Changes in V4:
 - remove 'item' variable and use IPSTATS_MIB_MAX to avoid sparse
   warning (Eric) also remove 'item' parameter (Joe)
 - add missing memset of padding. 

Changes in V3:
 - use memset to initialize temp buffer in leaf function. (David)
 - use memcpy to copy the buffer data to stat instead of unalign_pu (Joe)
 - Move buffer definition to leaf function __snmp6_fill_stats64() (Eric)
 -
Changes in V2:
 - Allocate the stat calculation buffer in stack. (Eric)
 
Setup:
160 cpu (20 core) baremetal powerpc system with 1TB memory

1000 docker containers was created with command
docker run -itd  ubuntu:15.04  /bin/bash in loop

observation:
Docker container creation linearly increased from around 1.6 sec to 7.5 sec
(at 1000 containers) perf data showed, creating veth interfaces resulting in
the below code path was taking more time.

rtnl_fill_ifinfo
  -> inet6_fill_link_af
    -> inet6_fill_ifla6_attrs
      -> snmp_fold_field

proposed idea:
 currently __snmp6_fill_stats64 calls snmp_fold_field that walks
through per cpu data to of an item (iteratively for around 36 items).
 The patch tries to aggregate the statistics by going through
all the items of each cpu sequentially which is reducing cache
misses.

Performance of docker creation improved by around more than 2x
after the patch.

before the patch: 
================
#time docker run -itd  ubuntu:15.04  /bin/bash
3f45ba571a42e925c4ec4aaee0e48d7610a9ed82a4c931f83324d41822cf6617
real	0m6.836s
user	0m0.095s
sys	0m0.011s

perf record -a docker run -itd  ubuntu:15.04  /bin/bash
=======================================================
# Samples: 32K of event 'cycles'
# Event count (approx.): 24688700190
# Overhead  Command          Shared Object           Symbol                                                                                         
# ........  ...............  ......................  ........................
    50.73%  docker           [kernel.kallsyms]       [k] snmp_fold_field                                                                                                        
     9.07%  swapper          [kernel.kallsyms]       [k] snooze_loop                                                                                                            
     3.49%  docker           [kernel.kallsyms]       [k] veth_stats_one                                                                                                         
     2.85%  swapper          [kernel.kallsyms]       [k] _raw_spin_lock                                                                                                         
     1.37%  docker           docker                  [.] backtrace_qsort                                                                                                        
     1.31%  docker           docker                  [.] strings.FieldsFunc                                                                      

  cache-misses:  2.7%
                                                      
after the patch:
=============
#time docker run -itd  ubuntu:15.04  /bin/bash
9178273e9df399c8290b6c196e4aef9273be2876225f63b14a60cf97eacfafb5
real	0m3.249s
user	0m0.088s
sys	0m0.020s

perf record -a docker run -itd  ubuntu:15.04  /bin/bash
=======================================================
# Samples: 18K of event 'cycles'
# Event count (approx.): 13958958797
# Overhead  Command          Shared Object         Symbol                                                                                
# ........  ...............  ....................  ..........................
    10.57%  docker           docker                [.] scanblock                                                                         
     8.37%  swapper          [kernel.kallsyms]     [k] snooze_loop                                                                       
     6.91%  docker           [kernel.kallsyms]     [k] snmp_get_cpu_field                                                                
     6.67%  docker           [kernel.kallsyms]     [k] veth_stats_one                                                                    
     3.96%  docker           docker                [.] runtime_MSpan_Sweep                                                               
     2.47%  docker           docker                [.] strings.FieldsFunc                                                                
 
cache-misses: 1.41 %

Please let me know if you have suggestions/comments.
Thanks Eric, Joe and David for the comments.

Raghavendra K T (2):
  net: Introduce helper functions to get the per cpu data
  net: Optimize snmp stat aggregation by walking all the percpu data at
    once

 include/net/ip.h    | 10 ++++++++++
 net/ipv4/af_inet.c  | 41 +++++++++++++++++++++++++++--------------
 net/ipv6/addrconf.c | 26 ++++++++++++++++----------
 3 files changed, 53 insertions(+), 24 deletions(-)

-- 
1.7.11.7

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH RFC V4 1/2] net: Introduce helper functions to get the per cpu data
  2015-08-30  5:59 [PATCH RFC V4 0/2] Optimize the snmp stat aggregation for large cpus Raghavendra K T
@ 2015-08-30  5:59 ` Raghavendra K T
  2015-08-30  5:59 ` [PATCH RFC V4 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once Raghavendra K T
  2015-08-31  4:49 ` [PATCH RFC V4 0/2] Optimize the snmp stat aggregation for large cpus David Miller
  2 siblings, 0 replies; 5+ messages in thread
From: Raghavendra K T @ 2015-08-30  5:59 UTC (permalink / raw)
  To: davem, kuznet, jmorris, yoshfuji, kaber
  Cc: jiri, edumazet, hannes, tom, azhou, ebiederm, ipm,
	nicolas.dichtel, serge.hallyn, joe, netdev, linux-kernel,
	raghavendra.kt, anton, nacc, srikar

Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
 include/net/ip.h   | 10 ++++++++++
 net/ipv4/af_inet.c | 41 +++++++++++++++++++++++++++--------------
 2 files changed, 37 insertions(+), 14 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index d5fe9f2..93bf12e 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -202,10 +202,20 @@ void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb,
 #define NET_ADD_STATS_BH(net, field, adnd) SNMP_ADD_STATS_BH((net)->mib.net_statistics, field, adnd)
 #define NET_ADD_STATS_USER(net, field, adnd) SNMP_ADD_STATS_USER((net)->mib.net_statistics, field, adnd)
 
+u64 snmp_get_cpu_field(void __percpu *mib, int cpu, int offct);
 unsigned long snmp_fold_field(void __percpu *mib, int offt);
 #if BITS_PER_LONG==32
+u64 snmp_get_cpu_field64(void __percpu *mib, int cpu, int offct,
+			 size_t syncp_offset);
 u64 snmp_fold_field64(void __percpu *mib, int offt, size_t sync_off);
 #else
+static inline u64  snmp_get_cpu_field64(void __percpu *mib, int cpu, int offct,
+					size_t syncp_offset)
+{
+	return snmp_get_cpu_field(mib, cpu, offct);
+
+}
+
 static inline u64 snmp_fold_field64(void __percpu *mib, int offt, size_t syncp_off)
 {
 	return snmp_fold_field(mib, offt);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 9532ee8..302e36b 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1448,38 +1448,51 @@ int inet_ctl_sock_create(struct sock **sk, unsigned short family,
 }
 EXPORT_SYMBOL_GPL(inet_ctl_sock_create);
 
+u64 snmp_get_cpu_field(void __percpu *mib, int cpu, int offt)
+{
+	return  *(((unsigned long *)per_cpu_ptr(mib, cpu)) + offt);
+}
+EXPORT_SYMBOL_GPL(snmp_get_cpu_field);
+
 unsigned long snmp_fold_field(void __percpu *mib, int offt)
 {
 	unsigned long res = 0;
 	int i;
 
 	for_each_possible_cpu(i)
-		res += *(((unsigned long *) per_cpu_ptr(mib, i)) + offt);
+		res += snmp_get_cpu_field(mib, i, offt);
 	return res;
 }
 EXPORT_SYMBOL_GPL(snmp_fold_field);
 
 #if BITS_PER_LONG==32
 
+u64 snmp_get_cpu_field64(void __percpu *mib, int cpu, int offct,
+			 size_t syncp_offset)
+{
+	void *bhptr;
+	struct u64_stats_sync *syncp;
+	u64 v;
+	unsigned int start;
+
+	bhptr = per_cpu_ptr(mib, cpu);
+	syncp = (struct u64_stats_sync *)(bhptr + syncp_offset);
+	do {
+		start = u64_stats_fetch_begin_irq(syncp);
+		v = *(((u64 *)bhptr) + offt);
+	} while (u64_stats_fetch_retry_irq(syncp, start));
+
+	return v;
+}
+EXPORT_SYMBOL_GPL(snmp_get_cpu_field64);
+
 u64 snmp_fold_field64(void __percpu *mib, int offt, size_t syncp_offset)
 {
 	u64 res = 0;
 	int cpu;
 
 	for_each_possible_cpu(cpu) {
-		void *bhptr;
-		struct u64_stats_sync *syncp;
-		u64 v;
-		unsigned int start;
-
-		bhptr = per_cpu_ptr(mib, cpu);
-		syncp = (struct u64_stats_sync *)(bhptr + syncp_offset);
-		do {
-			start = u64_stats_fetch_begin_irq(syncp);
-			v = *(((u64 *) bhptr) + offt);
-		} while (u64_stats_fetch_retry_irq(syncp, start));
-
-		res += v;
+		res += snmp_get_cpu_field(mib, cpu, offct, syncp_offset);
 	}
 	return res;
 }
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH RFC V4 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once
  2015-08-30  5:59 [PATCH RFC V4 0/2] Optimize the snmp stat aggregation for large cpus Raghavendra K T
  2015-08-30  5:59 ` [PATCH RFC V4 1/2] net: Introduce helper functions to get the per cpu data Raghavendra K T
@ 2015-08-30  5:59 ` Raghavendra K T
  2015-08-30 15:52   ` Eric Dumazet
  2015-08-31  4:49 ` [PATCH RFC V4 0/2] Optimize the snmp stat aggregation for large cpus David Miller
  2 siblings, 1 reply; 5+ messages in thread
From: Raghavendra K T @ 2015-08-30  5:59 UTC (permalink / raw)
  To: davem, kuznet, jmorris, yoshfuji, kaber
  Cc: jiri, edumazet, hannes, tom, azhou, ebiederm, ipm,
	nicolas.dichtel, serge.hallyn, joe, netdev, linux-kernel,
	raghavendra.kt, anton, nacc, srikar

Docker container creation linearly increased from around 1.6 sec to 7.5 sec
(at 1000 containers) and perf data showed 50% ovehead in snmp_fold_field.

reason: currently __snmp6_fill_stats64 calls snmp_fold_field that walks
through per cpu data of an item (iteratively for around 36 items).

idea: This patch tries to aggregate the statistics by going through
all the items of each cpu sequentially which is reducing cache
misses.

Docker creation got faster by more than 2x after the patch.

Result:
                       Before           After
Docker creation time   6.836s           3.25s
cache miss             2.7%             1.41%

perf before:
    50.73%  docker           [kernel.kallsyms]       [k] snmp_fold_field
     9.07%  swapper          [kernel.kallsyms]       [k] snooze_loop
     3.49%  docker           [kernel.kallsyms]       [k] veth_stats_one
     2.85%  swapper          [kernel.kallsyms]       [k] _raw_spin_lock

perf after:
    10.57%  docker           docker                [.] scanblock
     8.37%  swapper          [kernel.kallsyms]     [k] snooze_loop
     6.91%  docker           [kernel.kallsyms]     [k] snmp_get_cpu_field
     6.67%  docker           [kernel.kallsyms]     [k] veth_stats_one

changes/ideas suggested:
Using buffer in stack (Eric), Usage of memset (David), Using memcpy in
place of unaligned_put (Joe).

Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
 net/ipv6/addrconf.c | 26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

Changes in V4:
 - remove 'item' variable and use IPSTATS_MIB_MAX to avoid sparse
   warning (Eric) also remove 'item' parameter (Joe)
 - add missing memset of padding.

Changes in V3:
 - use memset to initialize temp buffer in leaf function. (David)
 - use memcpy to copy the buffer data to stat instead of unalign_pu (Joe)
 - Move buffer definition to leaf function __snmp6_fill_stats64() (Eric)
 -
Changes in V2:
 - Allocate the stat calculation buffer in stack. (Eric)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 21c2c81..4abc10c 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4624,18 +4624,24 @@ static inline void __snmp6_fill_statsdev(u64 *stats, atomic_long_t *mib,
 }
 
 static inline void __snmp6_fill_stats64(u64 *stats, void __percpu *mib,
-				      int items, int bytes, size_t syncpoff)
+					int bytes, size_t syncpoff)
 {
-	int i;
-	int pad = bytes - sizeof(u64) * items;
+	int i, c;
+	u64 buff[IPSTATS_MIB_MAX];
+	int pad = bytes - sizeof(u64) * IPSTATS_MIB_MAX;
+
 	BUG_ON(pad < 0);
 
-	/* Use put_unaligned() because stats may not be aligned for u64. */
-	put_unaligned(items, &stats[0]);
-	for (i = 1; i < items; i++)
-		put_unaligned(snmp_fold_field64(mib, i, syncpoff), &stats[i]);
+	memset(buff, 0, sizeof(buff));
+	buff[0] = IPSTATS_MIB_MAX;
 
-	memset(&stats[items], 0, pad);
+	for_each_possible_cpu(c) {
+		for (i = 1; i < IPSTATS_MIB_MAX; i++)
+			buff[i] += snmp_get_cpu_field64(mib, c, i, syncpoff);
+	}
+
+	memcpy(stats, buff, IPSTATS_MIB_MAX * sizeof(u64));
+	memset(&stats[IPSTATS_MIB_MAX], 0, pad);
 }
 
 static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int attrtype,
@@ -4643,8 +4649,8 @@ static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int attrtype,
 {
 	switch (attrtype) {
 	case IFLA_INET6_STATS:
-		__snmp6_fill_stats64(stats, idev->stats.ipv6,
-				     IPSTATS_MIB_MAX, bytes, offsetof(struct ipstats_mib, syncp));
+		__snmp6_fill_stats64(stats, idev->stats.ipv6, bytes,
+				     offsetof(struct ipstats_mib, syncp));
 		break;
 	case IFLA_INET6_ICMP6STATS:
 		__snmp6_fill_statsdev(stats, idev->stats.icmpv6dev->mibs, ICMP6_MIB_MAX, bytes);
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH RFC V4 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once
  2015-08-30  5:59 ` [PATCH RFC V4 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once Raghavendra K T
@ 2015-08-30 15:52   ` Eric Dumazet
  0 siblings, 0 replies; 5+ messages in thread
From: Eric Dumazet @ 2015-08-30 15:52 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: davem, kuznet, jmorris, yoshfuji, kaber, jiri, edumazet, hannes,
	tom, azhou, ebiederm, ipm, nicolas.dichtel, serge.hallyn, joe,
	netdev, linux-kernel, anton, nacc, srikar

On Sun, 2015-08-30 at 11:29 +0530, Raghavendra K T wrote:
> Docker container creation linearly increased from around 1.6 sec to 7.5 sec
> (at 1000 containers) and perf data showed 50% ovehead in snmp_fold_field.
> 
> reason: currently __snmp6_fill_stats64 calls snmp_fold_field that walks
> through per cpu data of an item (iteratively for around 36 items).
> 
> idea: This patch tries to aggregate the statistics by going through
> all the items of each cpu sequentially which is reducing cache
> misses.
> 
> Docker creation got faster by more than 2x after the patch.
> 
> Result:
>                        Before           After
> Docker creation time   6.836s           3.25s
> cache miss             2.7%             1.41%
> 
> perf before:
>     50.73%  docker           [kernel.kallsyms]       [k] snmp_fold_field
>      9.07%  swapper          [kernel.kallsyms]       [k] snooze_loop
>      3.49%  docker           [kernel.kallsyms]       [k] veth_stats_one
>      2.85%  swapper          [kernel.kallsyms]       [k] _raw_spin_lock
> 
> perf after:
>     10.57%  docker           docker                [.] scanblock
>      8.37%  swapper          [kernel.kallsyms]     [k] snooze_loop
>      6.91%  docker           [kernel.kallsyms]     [k] snmp_get_cpu_field
>      6.67%  docker           [kernel.kallsyms]     [k] veth_stats_one
> 
> changes/ideas suggested:
> Using buffer in stack (Eric), Usage of memset (David), Using memcpy in
> place of unaligned_put (Joe).
> 
> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> ---

Acked-by: Eric Dumazet <edumazet@google.com>

Thanks.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH RFC V4 0/2] Optimize the snmp stat aggregation for large cpus
  2015-08-30  5:59 [PATCH RFC V4 0/2] Optimize the snmp stat aggregation for large cpus Raghavendra K T
  2015-08-30  5:59 ` [PATCH RFC V4 1/2] net: Introduce helper functions to get the per cpu data Raghavendra K T
  2015-08-30  5:59 ` [PATCH RFC V4 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once Raghavendra K T
@ 2015-08-31  4:49 ` David Miller
  2 siblings, 0 replies; 5+ messages in thread
From: David Miller @ 2015-08-31  4:49 UTC (permalink / raw)
  To: raghavendra.kt
  Cc: kuznet, jmorris, yoshfuji, kaber, jiri, edumazet, hannes, tom,
	azhou, ebiederm, ipm, nicolas.dichtel, serge.hallyn, joe, netdev,
	linux-kernel, anton, nacc, srikar

From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Date: Sun, 30 Aug 2015 11:29:40 +0530

> While creating 1000 containers, perf is showing lot of time spent in
> snmp_fold_field on a large cpu system.
> 
> The current patch tries to improve by reordering the statistics gathering.
 ...

Series applied, thanks.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-08-31  4:49 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-08-30  5:59 [PATCH RFC V4 0/2] Optimize the snmp stat aggregation for large cpus Raghavendra K T
2015-08-30  5:59 ` [PATCH RFC V4 1/2] net: Introduce helper functions to get the per cpu data Raghavendra K T
2015-08-30  5:59 ` [PATCH RFC V4 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once Raghavendra K T
2015-08-30 15:52   ` Eric Dumazet
2015-08-31  4:49 ` [PATCH RFC V4 0/2] Optimize the snmp stat aggregation for large cpus David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).