netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] openvswitch: fix flow stats accounting when node 0 is not possible
@ 2016-09-09 20:41 Thadeu Lima de Souza Cascardo
  2016-09-09 20:41 ` [PATCH 2/2] openvswitch: use percpu flow stats Thadeu Lima de Souza Cascardo
  0 siblings, 1 reply; 3+ messages in thread
From: Thadeu Lima de Souza Cascardo @ 2016-09-09 20:41 UTC (permalink / raw)
  To: dev-yBygre7rU0TnMu66kgdUjQ
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, David Miller, Eric Dumazet

On a system with only node 1 as possible, all statistics is going to be
accounted on node 0 as it will have a single writer.

However, when getting and clearing the statistics, node 0 is not going
to be considered, as it's not a possible node.

Tested that statistics are not zero on a system with only node 1
possible. Also compile-tested with CONFIG_NUMA off.

Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com>
---

I am providing this intermediate patch, that will be thrown out by the next one,
in case there is any need to backport this fix.

---
 net/openvswitch/flow.c       | 6 ++++--
 net/openvswitch/flow_table.c | 5 +++--
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 0ea128e..3609f37 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -142,7 +142,8 @@ void ovs_flow_stats_get(const struct sw_flow *flow,
 	*tcp_flags = 0;
 	memset(ovs_stats, 0, sizeof(*ovs_stats));
 
-	for_each_node(node) {
+	/* We open code this to make sure node 0 is always considered */
+	for (node = 0; node < MAX_NUMNODES; node = next_node(node, node_possible_map)) {
 		struct flow_stats *stats = rcu_dereference_ovsl(flow->stats[node]);
 
 		if (stats) {
@@ -165,7 +166,8 @@ void ovs_flow_stats_clear(struct sw_flow *flow)
 {
 	int node;
 
-	for_each_node(node) {
+	/* We open code this to make sure node 0 is always considered */
+	for (node = 0; node < MAX_NUMNODES; node = next_node(node, node_possible_map)) {
 		struct flow_stats *stats = ovsl_dereference(flow->stats[node]);
 
 		if (stats) {
diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c
index d073fff..957a3c3 100644
--- a/net/openvswitch/flow_table.c
+++ b/net/openvswitch/flow_table.c
@@ -148,8 +148,9 @@ static void flow_free(struct sw_flow *flow)
 		kfree(flow->id.unmasked_key);
 	if (flow->sf_acts)
 		ovs_nla_free_flow_actions((struct sw_flow_actions __force *)flow->sf_acts);
-	for_each_node(node)
-		if (flow->stats[node])
+	/* We open code this to make sure node 0 is always considered */
+	for (node = 0; node < MAX_NUMNODES; node = next_node(node, node_possible_map))
+		if (node != 0 && flow->stats[node])
 			kmem_cache_free(flow_stats_cache,
 					(struct flow_stats __force *)flow->stats[node]);
 	kmem_cache_free(flow_cache, flow);
-- 
2.7.4

_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH 2/2] openvswitch: use percpu flow stats
  2016-09-09 20:41 [PATCH 1/2] openvswitch: fix flow stats accounting when node 0 is not possible Thadeu Lima de Souza Cascardo
@ 2016-09-09 20:41 ` Thadeu Lima de Souza Cascardo
  2016-09-12 19:42   ` [ovs-dev] " pravin shelar
  0 siblings, 1 reply; 3+ messages in thread
From: Thadeu Lima de Souza Cascardo @ 2016-09-09 20:41 UTC (permalink / raw)
  To: dev; +Cc: netdev, Eric Dumazet, David Miller

Instead of using flow stats per NUMA node, use it per CPU. When using
megaflows, the stats lock can be a bottleneck in scalability.

On a E5-2690 12-core system, usual throughput went from ~4Mpps to
~15Mpps when forwarding between two 40GbE ports with a single flow
configured on the datapath.

This has been tested on a system with possible CPUs 0-7,16-23. After
module removal, there were no corruption on the slab cache.

Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com>
---
 net/openvswitch/flow.c       | 43 +++++++++++++++++++++++--------------------
 net/openvswitch/flow.h       |  4 ++--
 net/openvswitch/flow_table.c | 23 ++++++++++++-----------
 3 files changed, 37 insertions(+), 33 deletions(-)

diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 3609f37..2970a9f 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -29,6 +29,7 @@
 #include <linux/module.h>
 #include <linux/in.h>
 #include <linux/rcupdate.h>
+#include <linux/cpumask.h>
 #include <linux/if_arp.h>
 #include <linux/ip.h>
 #include <linux/ipv6.h>
@@ -72,32 +73,33 @@ void ovs_flow_stats_update(struct sw_flow *flow, __be16 tcp_flags,
 {
 	struct flow_stats *stats;
 	int node = numa_node_id();
+	int cpu = get_cpu();
 	int len = skb->len + (skb_vlan_tag_present(skb) ? VLAN_HLEN : 0);
 
-	stats = rcu_dereference(flow->stats[node]);
+	stats = rcu_dereference(flow->stats[cpu]);
 
-	/* Check if already have node-specific stats. */
+	/* Check if already have CPU-specific stats. */
 	if (likely(stats)) {
 		spin_lock(&stats->lock);
 		/* Mark if we write on the pre-allocated stats. */
-		if (node == 0 && unlikely(flow->stats_last_writer != node))
-			flow->stats_last_writer = node;
+		if (cpu == 0 && unlikely(flow->stats_last_writer != cpu))
+			flow->stats_last_writer = cpu;
 	} else {
 		stats = rcu_dereference(flow->stats[0]); /* Pre-allocated. */
 		spin_lock(&stats->lock);
 
-		/* If the current NUMA-node is the only writer on the
+		/* If the current CPU is the only writer on the
 		 * pre-allocated stats keep using them.
 		 */
-		if (unlikely(flow->stats_last_writer != node)) {
+		if (unlikely(flow->stats_last_writer != cpu)) {
 			/* A previous locker may have already allocated the
-			 * stats, so we need to check again.  If node-specific
+			 * stats, so we need to check again.  If CPU-specific
 			 * stats were already allocated, we update the pre-
 			 * allocated stats as we have already locked them.
 			 */
-			if (likely(flow->stats_last_writer != NUMA_NO_NODE)
-			    && likely(!rcu_access_pointer(flow->stats[node]))) {
-				/* Try to allocate node-specific stats. */
+			if (likely(flow->stats_last_writer != -1) &&
+			    likely(!rcu_access_pointer(flow->stats[cpu]))) {
+				/* Try to allocate CPU-specific stats. */
 				struct flow_stats *new_stats;
 
 				new_stats =
@@ -114,12 +116,12 @@ void ovs_flow_stats_update(struct sw_flow *flow, __be16 tcp_flags,
 					new_stats->tcp_flags = tcp_flags;
 					spin_lock_init(&new_stats->lock);
 
-					rcu_assign_pointer(flow->stats[node],
+					rcu_assign_pointer(flow->stats[cpu],
 							   new_stats);
 					goto unlock;
 				}
 			}
-			flow->stats_last_writer = node;
+			flow->stats_last_writer = cpu;
 		}
 	}
 
@@ -129,6 +131,7 @@ void ovs_flow_stats_update(struct sw_flow *flow, __be16 tcp_flags,
 	stats->tcp_flags |= tcp_flags;
 unlock:
 	spin_unlock(&stats->lock);
+	put_cpu();
 }
 
 /* Must be called with rcu_read_lock or ovs_mutex. */
@@ -136,15 +139,15 @@ void ovs_flow_stats_get(const struct sw_flow *flow,
 			struct ovs_flow_stats *ovs_stats,
 			unsigned long *used, __be16 *tcp_flags)
 {
-	int node;
+	int cpu;
 
 	*used = 0;
 	*tcp_flags = 0;
 	memset(ovs_stats, 0, sizeof(*ovs_stats));
 
-	/* We open code this to make sure node 0 is always considered */
-	for (node = 0; node < MAX_NUMNODES; node = next_node(node, node_possible_map)) {
-		struct flow_stats *stats = rcu_dereference_ovsl(flow->stats[node]);
+	/* We open code this to make sure cpu 0 is always considered */
+	for (cpu = 0; cpu < nr_cpu_ids; cpu = cpumask_next(cpu, cpu_possible_mask)) {
+		struct flow_stats *stats = rcu_dereference_ovsl(flow->stats[cpu]);
 
 		if (stats) {
 			/* Local CPU may write on non-local stats, so we must
@@ -164,11 +167,11 @@ void ovs_flow_stats_get(const struct sw_flow *flow,
 /* Called with ovs_mutex. */
 void ovs_flow_stats_clear(struct sw_flow *flow)
 {
-	int node;
+	int cpu;
 
-	/* We open code this to make sure node 0 is always considered */
-	for (node = 0; node < MAX_NUMNODES; node = next_node(node, node_possible_map)) {
-		struct flow_stats *stats = ovsl_dereference(flow->stats[node]);
+	/* We open code this to make sure cpu 0 is always considered */
+	for (cpu = 0; cpu < nr_cpu_ids; cpu = cpumask_next(cpu, cpu_possible_mask)) {
+		struct flow_stats *stats = ovsl_dereference(flow->stats[cpu]);
 
 		if (stats) {
 			spin_lock_bh(&stats->lock);
diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h
index 03378e7..7725ffc 100644
--- a/net/openvswitch/flow.h
+++ b/net/openvswitch/flow.h
@@ -172,14 +172,14 @@ struct sw_flow {
 		struct hlist_node node[2];
 		u32 hash;
 	} flow_table, ufid_table;
-	int stats_last_writer;		/* NUMA-node id of the last writer on
+	int stats_last_writer;		/* CPU id of the last writer on
 					 * 'stats[0]'.
 					 */
 	struct sw_flow_key key;
 	struct sw_flow_id id;
 	struct sw_flow_mask *mask;
 	struct sw_flow_actions __rcu *sf_acts;
-	struct flow_stats __rcu *stats[]; /* One for each NUMA node.  First one
+	struct flow_stats __rcu *stats[]; /* One for each CPU.  First one
 					   * is allocated at flow creation time,
 					   * the rest are allocated on demand
 					   * while holding the 'stats[0].lock'.
diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c
index 957a3c3..60e5ae0 100644
--- a/net/openvswitch/flow_table.c
+++ b/net/openvswitch/flow_table.c
@@ -32,6 +32,7 @@
 #include <linux/module.h>
 #include <linux/in.h>
 #include <linux/rcupdate.h>
+#include <linux/cpumask.h>
 #include <linux/if_arp.h>
 #include <linux/ip.h>
 #include <linux/ipv6.h>
@@ -79,7 +80,7 @@ struct sw_flow *ovs_flow_alloc(void)
 {
 	struct sw_flow *flow;
 	struct flow_stats *stats;
-	int node;
+	int cpu;
 
 	flow = kmem_cache_alloc(flow_cache, GFP_KERNEL);
 	if (!flow)
@@ -89,7 +90,7 @@ struct sw_flow *ovs_flow_alloc(void)
 	flow->mask = NULL;
 	flow->id.unmasked_key = NULL;
 	flow->id.ufid_len = 0;
-	flow->stats_last_writer = NUMA_NO_NODE;
+	flow->stats_last_writer = -1;
 
 	/* Initialize the default stat node. */
 	stats = kmem_cache_alloc_node(flow_stats_cache,
@@ -102,9 +103,9 @@ struct sw_flow *ovs_flow_alloc(void)
 
 	RCU_INIT_POINTER(flow->stats[0], stats);
 
-	for_each_node(node)
-		if (node != 0)
-			RCU_INIT_POINTER(flow->stats[node], NULL);
+	for_each_possible_cpu(cpu)
+		if (cpu != 0)
+			RCU_INIT_POINTER(flow->stats[cpu], NULL);
 
 	return flow;
 err:
@@ -142,17 +143,17 @@ static struct flex_array *alloc_buckets(unsigned int n_buckets)
 
 static void flow_free(struct sw_flow *flow)
 {
-	int node;
+	int cpu;
 
 	if (ovs_identifier_is_key(&flow->id))
 		kfree(flow->id.unmasked_key);
 	if (flow->sf_acts)
 		ovs_nla_free_flow_actions((struct sw_flow_actions __force *)flow->sf_acts);
-	/* We open code this to make sure node 0 is always considered */
-	for (node = 0; node < MAX_NUMNODES; node = next_node(node, node_possible_map))
-		if (node != 0 && flow->stats[node])
+	/* We open code this to make sure cpu 0 is always considered */
+	for (cpu = 0; cpu < nr_cpu_ids; cpu = cpumask_next(cpu, cpu_possible_mask))
+		if (flow->stats[cpu])
 			kmem_cache_free(flow_stats_cache,
-					(struct flow_stats __force *)flow->stats[node]);
+					(struct flow_stats __force *)flow->stats[cpu]);
 	kmem_cache_free(flow_cache, flow);
 }
 
@@ -757,7 +758,7 @@ int ovs_flow_init(void)
 	BUILD_BUG_ON(sizeof(struct sw_flow_key) % sizeof(long));
 
 	flow_cache = kmem_cache_create("sw_flow", sizeof(struct sw_flow)
-				       + (nr_node_ids
+				       + (nr_cpu_ids
 					  * sizeof(struct flow_stats *)),
 				       0, 0, NULL);
 	if (flow_cache == NULL)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [ovs-dev] [PATCH 2/2] openvswitch: use percpu flow stats
  2016-09-09 20:41 ` [PATCH 2/2] openvswitch: use percpu flow stats Thadeu Lima de Souza Cascardo
@ 2016-09-12 19:42   ` pravin shelar
  0 siblings, 0 replies; 3+ messages in thread
From: pravin shelar @ 2016-09-12 19:42 UTC (permalink / raw)
  To: Thadeu Lima de Souza Cascardo
  Cc: ovs dev, Linux Kernel Network Developers, David Miller,
	Eric Dumazet

On Fri, Sep 9, 2016 at 1:41 PM, Thadeu Lima de Souza Cascardo
<cascardo@redhat.com> wrote:
> Instead of using flow stats per NUMA node, use it per CPU. When using
> megaflows, the stats lock can be a bottleneck in scalability.
>
> On a E5-2690 12-core system, usual throughput went from ~4Mpps to
> ~15Mpps when forwarding between two 40GbE ports with a single flow
> configured on the datapath.
>
> This has been tested on a system with possible CPUs 0-7,16-23. After
> module removal, there were no corruption on the slab cache.
>
> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com>
> ---
>  net/openvswitch/flow.c       | 43 +++++++++++++++++++++++--------------------
>  net/openvswitch/flow.h       |  4 ++--
>  net/openvswitch/flow_table.c | 23 ++++++++++++-----------
>  3 files changed, 37 insertions(+), 33 deletions(-)
>
> diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
> index 3609f37..2970a9f 100644
> --- a/net/openvswitch/flow.c
> +++ b/net/openvswitch/flow.c
> @@ -29,6 +29,7 @@
>  #include <linux/module.h>
>  #include <linux/in.h>
>  #include <linux/rcupdate.h>
> +#include <linux/cpumask.h>
>  #include <linux/if_arp.h>
>  #include <linux/ip.h>
>  #include <linux/ipv6.h>
> @@ -72,32 +73,33 @@ void ovs_flow_stats_update(struct sw_flow *flow, __be16 tcp_flags,
>  {
>         struct flow_stats *stats;
>         int node = numa_node_id();
> +       int cpu = get_cpu();
>         int len = skb->len + (skb_vlan_tag_present(skb) ? VLAN_HLEN : 0);
>
This function is always called from BH context. So calling
smp_processor_id() for cpu id is fine. There is no need to handle
pre-emption here.

> -       stats = rcu_dereference(flow->stats[node]);
> +       stats = rcu_dereference(flow->stats[cpu]);
>
...
> diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c
> index 957a3c3..60e5ae0 100644
> --- a/net/openvswitch/flow_table.c
> +++ b/net/openvswitch/flow_table.c
...

> @@ -102,9 +103,9 @@ struct sw_flow *ovs_flow_alloc(void)
>
>         RCU_INIT_POINTER(flow->stats[0], stats);
>
> -       for_each_node(node)
> -               if (node != 0)
> -                       RCU_INIT_POINTER(flow->stats[node], NULL);
> +       for_each_possible_cpu(cpu)
> +               if (cpu != 0)
> +                       RCU_INIT_POINTER(flow->stats[cpu], NULL);
>
I think at this point we should just use GFP_ZERO flag for allocating
struct flow.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2016-09-12 19:42 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-09-09 20:41 [PATCH 1/2] openvswitch: fix flow stats accounting when node 0 is not possible Thadeu Lima de Souza Cascardo
2016-09-09 20:41 ` [PATCH 2/2] openvswitch: use percpu flow stats Thadeu Lima de Souza Cascardo
2016-09-12 19:42   ` [ovs-dev] " pravin shelar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).