[PATCH RFC net-next 0/4] bridge: improve cache utilization

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC net-next 0/4] bridge: improve cache utilization
@ 2017-01-31 15:31 Nikolay Aleksandrov
  2017-01-31 15:31 ` [PATCH RFC net-next 1/4] bridge: modify bridge and port to have often accessed fields in one cache line Nikolay Aleksandrov
                   ` (4 more replies)
  0 siblings, 5 replies; 20+ messages in thread
From: Nikolay Aleksandrov @ 2017-01-31 15:31 UTC (permalink / raw)
  To: netdev; +Cc: roopa, stephen, davem, Nikolay Aleksandrov

Hi all,
This is the first set which begins to deal with the bad bridge cache
access patterns. The first patch rearranges the bridge and port structs
a little so the frequently (and closely) accessed members are in the same
cache line. The second patch then moves the garbage collection to a
workqueue trying to improve system responsiveness under load (many fdbs)
and more importantly removes the need to check if the matched entry is
expired in __br_fdb_get which was a source of false-sharing.
The third patch is a preparation for the final one which adds a new option
that allows to turn off "used" updates on transmit to a fdb which is the
other source of false-sharing in the bridge. If properly configured, i.e.
ports bound to CPUs (thus updating "updated" locally) and "used_enabled"
set to 0 then the bridge's HitM goes from 100% to 0%, but even with
"used_enabled" = 1 we still get a win because lookups which iterated over
the hash chain caused false-sharing due to the first cache line being
used for both mac/vid and used/updated fields.
I'm not a fan of adding new options to the bridge, so I'm open to
suggestions for this one. Things I've tried before that:
- dynamically allocate a pool of percpu memory for used/updated fields
  (works but it's more complicated as we need dynamic resizing, too)
- dynamically allocate percpu memory for each fdb separately (I'm not a fan
  of this one, but it's much simpler to implement)

Of these two obviously the first approach worked best, but the complexity
it brings is considerable, while with this patchset we can achieve the same
result with proper configuration. Any feedback on this one would be greatly
appreciated.

Some results from tests I've run:
(note that these were run in good conditions for the baseline, everything
 ran on a single NUMA node and there were only 3 fdbs)

1. baseline
100% Load HitM on the fdbs (between everyone who has done lookups and hit
                            one of the 3 hash chains of the communicating
                            src/dst fdbs)
Overall 5.06% Load HitM for the bridge, first place in the list

bridge udp rr between 3 cores/ports/fdbs test:
   0.35%  ksoftirqd/0  [k] br_fdb_update
   0.30%  ksoftirqd/0  [k] __br_fdb_get
   0.17%  ksoftirqd/0  [k] br_handle_frame_finish

2. used = 1
0% Local Load HitM
bridge udp rr between 3 cores/ports/fdbs test:
   0.24%  ksoftirqd/0    [k] br_fdb_update
   0.23%  ksoftirqd/0    [k] __br_fdb_get
   0.12%  ksoftirqd/0    [k] br_handle_frame_finish

3. used = 0
0% Local Load HitM
bridge udp rr between 3 cores/ports/fdbs test:
   0.23%  ksoftirqd/0    [k] __br_fdb_get
   0.22%  ksoftirqd/0    [k] br_fdb_update
   0.10%  ksoftirqd/0    [k] br_handle_frame_finish


Thanks,
 Nik

Nikolay Aleksandrov (4):
  bridge: modify bridge and port to have often accessed fields in one
    cache line
  bridge: move to workqueue gc
  bridge: move write-heavy fdb members in their own cache line
  bridge: add ability to turn off fdb used updates

 include/uapi/linux/if_link.h |  1 +
 net/bridge/br_device.c       |  2 ++
 net/bridge/br_fdb.c          | 31 ++++++++++++++---------
 net/bridge/br_if.c           |  2 +-
 net/bridge/br_input.c        |  3 ++-
 net/bridge/br_ioctl.c        |  2 +-
 net/bridge/br_netlink.c      | 12 +++++++--
 net/bridge/br_private.h      | 58 ++++++++++++++++++++++----------------------
 net/bridge/br_stp.c          |  2 +-
 net/bridge/br_stp_if.c       |  4 +--
 net/bridge/br_stp_timer.c    |  2 --
 net/bridge/br_sysfs_br.c     | 25 ++++++++++++++++++-
 12 files changed, 92 insertions(+), 52 deletions(-)

-- 
2.1.4

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH RFC net-next 1/4] bridge: modify bridge and port to have often accessed fields in one cache line
  2017-01-31 15:31 [PATCH RFC net-next 0/4] bridge: improve cache utilization Nikolay Aleksandrov
@ 2017-01-31 15:31 ` Nikolay Aleksandrov
  2017-01-31 15:31 ` [PATCH RFC net-next 2/4] bridge: move to workqueue gc Nikolay Aleksandrov
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 20+ messages in thread
From: Nikolay Aleksandrov @ 2017-01-31 15:31 UTC (permalink / raw)
  To: netdev; +Cc: roopa, stephen, davem, Nikolay Aleksandrov

Move around net_bridge so the vlan fields are in the beginning since
they're checked on every packet even if vlan filtering is disabled.
Also move topology_change in the same cache line with ageing_time and
forward_delay as they're all checked in br_fdb_update on each packet rx.
For the port move flags & vlan group to the beginning, so they're in the
same cache line with the port's state (both flags and state are checked
on each packet).

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
---
 net/bridge/br_private.h | 43 ++++++++++++++++++++-----------------------
 1 file changed, 20 insertions(+), 23 deletions(-)

diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 0b82a227fc34..d3be859a6346 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -202,12 +202,16 @@ struct net_bridge_mdb_htable
 	u32				ver;
 };
 
-struct net_bridge_port
-{
+struct net_bridge_port {
 	struct net_bridge		*br;
 	struct net_device		*dev;
 	struct list_head		list;
 
+	unsigned long			flags;
+#ifdef CONFIG_BRIDGE_VLAN_FILTERING
+	struct net_bridge_vlan_group	__rcu *vlgrp;
+#endif
+
 	/* STP */
 	u8				priority;
 	u8				state;
@@ -228,8 +232,6 @@ struct net_bridge_port
 	struct kobject			kobj;
 	struct rcu_head			rcu;
 
-	unsigned long 			flags;
-
 #ifdef CONFIG_BRIDGE_IGMP_SNOOPING
 	struct bridge_mcast_own_query	ip4_own_query;
 #if IS_ENABLED(CONFIG_IPV6)
@@ -249,9 +251,6 @@ struct net_bridge_port
 #ifdef CONFIG_NET_POLL_CONTROLLER
 	struct netpoll			*np;
 #endif
-#ifdef CONFIG_BRIDGE_VLAN_FILTERING
-	struct net_bridge_vlan_group	__rcu *vlgrp;
-#endif
 #ifdef CONFIG_NET_SWITCHDEV
 	int				offload_fwd_mark;
 #endif
@@ -273,14 +272,21 @@ static inline struct net_bridge_port *br_port_get_rtnl(const struct net_device *
 		rtnl_dereference(dev->rx_handler_data) : NULL;
 }
 
-struct net_bridge
-{
+struct net_bridge {
 	spinlock_t			lock;
+	spinlock_t			hash_lock;
 	struct list_head		port_list;
 	struct net_device		*dev;
-
 	struct pcpu_sw_netstats		__percpu *stats;
-	spinlock_t			hash_lock;
+	/* These fields are accessed on each packet */
+#ifdef CONFIG_BRIDGE_VLAN_FILTERING
+	u8				vlan_enabled;
+	u8				vlan_stats_enabled;
+	__be16				vlan_proto;
+	u16				default_pvid;
+	struct net_bridge_vlan_group	__rcu *vlgrp;
+#endif
+
 	struct hlist_head		hash[BR_HASH_SIZE];
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
 	union {
@@ -298,6 +304,9 @@ struct net_bridge
 	bridge_id			designated_root;
 	bridge_id			bridge_id;
 	u32				root_path_cost;
+	unsigned char			topology_change;
+	unsigned char			topology_change_detected;
+	u16				root_port;
 	unsigned long			max_age;
 	unsigned long			hello_time;
 	unsigned long			forward_delay;
@@ -309,7 +318,6 @@ struct net_bridge
 
 	u8				group_addr[ETH_ALEN];
 	bool				group_addr_set;
-	u16				root_port;
 
 	enum {
 		BR_NO_STP, 		/* no spanning tree */
@@ -317,9 +325,6 @@ struct net_bridge
 		BR_USER_STP,		/* new RSTP in userspace */
 	} stp_enabled;
 
-	unsigned char			topology_change;
-	unsigned char			topology_change_detected;
-
 #ifdef CONFIG_BRIDGE_IGMP_SNOOPING
 	unsigned char			multicast_router;
 
@@ -371,14 +376,6 @@ struct net_bridge
 #ifdef CONFIG_NET_SWITCHDEV
 	int offload_fwd_mark;
 #endif
-
-#ifdef CONFIG_BRIDGE_VLAN_FILTERING
-	struct net_bridge_vlan_group	__rcu *vlgrp;
-	u8				vlan_enabled;
-	u8				vlan_stats_enabled;
-	__be16				vlan_proto;
-	u16				default_pvid;
-#endif
 };
 
 struct br_input_skb_cb {
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH RFC net-next 2/4] bridge: move to workqueue gc
  2017-01-31 15:31 [PATCH RFC net-next 0/4] bridge: improve cache utilization Nikolay Aleksandrov
  2017-01-31 15:31 ` [PATCH RFC net-next 1/4] bridge: modify bridge and port to have often accessed fields in one cache line Nikolay Aleksandrov
@ 2017-01-31 15:31 ` Nikolay Aleksandrov
  2017-01-31 15:31 ` [PATCH RFC net-next 3/4] bridge: move write-heavy fdb members in their own cache line Nikolay Aleksandrov
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 20+ messages in thread
From: Nikolay Aleksandrov @ 2017-01-31 15:31 UTC (permalink / raw)
  To: netdev; +Cc: roopa, stephen, davem, Nikolay Aleksandrov

Move the fdb garbage collector to a workqueue which fires at least 10
milliseconds apart and cleans chain by chain allowing for other tasks
to run in the meantime. When having thousands of fdbs the system is much
more responsive. Most importantly remove the need to check if the
matched entry has expired in __br_fdb_get that causes false-sharing and
is completely unnecessary if we cleanup entries, at worst we'll get ~10ms
of traffic for that entry before it gets deleted.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
---
 net/bridge/br_device.c    |  1 +
 net/bridge/br_fdb.c       | 31 +++++++++++++++++++------------
 net/bridge/br_if.c        |  2 +-
 net/bridge/br_ioctl.c     |  2 +-
 net/bridge/br_netlink.c   |  2 +-
 net/bridge/br_private.h   |  4 ++--
 net/bridge/br_stp.c       |  2 +-
 net/bridge/br_stp_if.c    |  4 ++--
 net/bridge/br_stp_timer.c |  2 --
 net/bridge/br_sysfs_br.c  |  2 +-
 10 files changed, 29 insertions(+), 23 deletions(-)

diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index 6c46d1b4cdbb..89b414fd1901 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -413,4 +413,5 @@ void br_dev_setup(struct net_device *dev)
 	br_netfilter_rtable_init(br);
 	br_stp_timer_init(br);
 	br_multicast_init(br);
+	INIT_DELAYED_WORK(&br->gc_work, br_fdb_cleanup);
 }
diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
index e4a4176171c9..5cbed5c0db88 100644
--- a/net/bridge/br_fdb.c
+++ b/net/bridge/br_fdb.c
@@ -154,7 +154,7 @@ static void fdb_delete(struct net_bridge *br, struct net_bridge_fdb_entry *f)
 	if (f->added_by_external_learn)
 		fdb_del_external_learn(f);
 
-	hlist_del_rcu(&f->hlist);
+	hlist_del_init_rcu(&f->hlist);
 	fdb_notify(br, f, RTM_DELNEIGH);
 	call_rcu(&f->rcu, fdb_rcu_free);
 }
@@ -290,34 +290,43 @@ void br_fdb_change_mac_address(struct net_bridge *br, const u8 *newaddr)
 	spin_unlock_bh(&br->hash_lock);
 }
 
-void br_fdb_cleanup(unsigned long _data)
+void br_fdb_cleanup(struct work_struct *work)
 {
-	struct net_bridge *br = (struct net_bridge *)_data;
+	struct net_bridge *br = container_of(work, struct net_bridge,
+					     gc_work.work);
 	unsigned long delay = hold_time(br);
-	unsigned long next_timer = jiffies + br->ageing_time;
+	unsigned long work_delay = delay;
+	unsigned long now = jiffies;
 	int i;
 
-	spin_lock(&br->hash_lock);
 	for (i = 0; i < BR_HASH_SIZE; i++) {
 		struct net_bridge_fdb_entry *f;
 		struct hlist_node *n;
 
+		if (!br->hash[i].first)
+			continue;
+
+		spin_lock_bh(&br->hash_lock);
 		hlist_for_each_entry_safe(f, n, &br->hash[i], hlist) {
 			unsigned long this_timer;
+
 			if (f->is_static)
 				continue;
 			if (f->added_by_external_learn)
 				continue;
 			this_timer = f->updated + delay;
-			if (time_before_eq(this_timer, jiffies))
+			if (time_after(this_timer, now))
+				work_delay = min(work_delay, this_timer - now);
+			else
 				fdb_delete(br, f);
-			else if (time_before(this_timer, next_timer))
-				next_timer = this_timer;
 		}
+		spin_unlock_bh(&br->hash_lock);
+		cond_resched();
 	}
-	spin_unlock(&br->hash_lock);
 
-	mod_timer(&br->gc_timer, round_jiffies_up(next_timer));
+	/* Cleanup minimum 10 milliseconds apart */
+	work_delay = max_t(unsigned long, work_delay, msecs_to_jiffies(10));
+	mod_delayed_work(system_long_wq, &br->gc_work, work_delay);
 }
 
 /* Completely flush all dynamic entries in forwarding database.*/
@@ -382,8 +391,6 @@ struct net_bridge_fdb_entry *__br_fdb_get(struct net_bridge *br,
 				&br->hash[br_mac_hash(addr, vid)], hlist) {
 		if (ether_addr_equal(fdb->addr.addr, addr) &&
 		    fdb->vlan_id == vid) {
-			if (unlikely(has_expired(br, fdb)))
-				break;
 			return fdb;
 		}
 	}
diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index ed0dd3340084..8ac1770aa222 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -313,7 +313,7 @@ void br_dev_delete(struct net_device *dev, struct list_head *head)
 
 	br_vlan_flush(br);
 	br_multicast_dev_del(br);
-	del_timer_sync(&br->gc_timer);
+	cancel_delayed_work_sync(&br->gc_work);
 
 	br_sysfs_delbr(br->dev);
 	unregister_netdevice_queue(br->dev, head);
diff --git a/net/bridge/br_ioctl.c b/net/bridge/br_ioctl.c
index da8157c57eb1..7970f8540cbb 100644
--- a/net/bridge/br_ioctl.c
+++ b/net/bridge/br_ioctl.c
@@ -149,7 +149,7 @@ static int old_dev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
 		b.hello_timer_value = br_timer_value(&br->hello_timer);
 		b.tcn_timer_value = br_timer_value(&br->tcn_timer);
 		b.topology_change_timer_value = br_timer_value(&br->topology_change_timer);
-		b.gc_timer_value = br_timer_value(&br->gc_timer);
+		b.gc_timer_value = br_timer_value(&br->gc_work.timer);
 		rcu_read_unlock();
 
 		if (copy_to_user((void __user *)args[1], &b, sizeof(b)))
diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
index 1ca25498fe4d..d63ad8337dcd 100644
--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -1200,7 +1200,7 @@ static int br_fill_info(struct sk_buff *skb, const struct net_device *brdev)
 	if (nla_put_u64_64bit(skb, IFLA_BR_TOPOLOGY_CHANGE_TIMER, clockval,
 			      IFLA_BR_PAD))
 		return -EMSGSIZE;
-	clockval = br_timer_value(&br->gc_timer);
+	clockval = br_timer_value(&br->gc_work.timer);
 	if (nla_put_u64_64bit(skb, IFLA_BR_GC_TIMER, clockval, IFLA_BR_PAD))
 		return -EMSGSIZE;
 
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index d3be859a6346..9658c6484935 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -369,7 +369,7 @@ struct net_bridge {
 	struct timer_list		hello_timer;
 	struct timer_list		tcn_timer;
 	struct timer_list		topology_change_timer;
-	struct timer_list		gc_timer;
+	struct delayed_work		gc_work;
 	struct kobject			*ifobj;
 	u32				auto_cnt;
 
@@ -492,7 +492,7 @@ void br_fdb_find_delete_local(struct net_bridge *br,
 			      const unsigned char *addr, u16 vid);
 void br_fdb_changeaddr(struct net_bridge_port *p, const unsigned char *newaddr);
 void br_fdb_change_mac_address(struct net_bridge *br, const u8 *newaddr);
-void br_fdb_cleanup(unsigned long arg);
+void br_fdb_cleanup(struct work_struct *work);
 void br_fdb_delete_by_port(struct net_bridge *br,
 			   const struct net_bridge_port *p, u16 vid, int do_all);
 struct net_bridge_fdb_entry *__br_fdb_get(struct net_bridge *br,
diff --git a/net/bridge/br_stp.c b/net/bridge/br_stp.c
index 71fd1a4e63cc..8f56c2d1f1a7 100644
--- a/net/bridge/br_stp.c
+++ b/net/bridge/br_stp.c
@@ -602,7 +602,7 @@ int br_set_ageing_time(struct net_bridge *br, clock_t ageing_time)
 	br->ageing_time = t;
 	spin_unlock_bh(&br->lock);
 
-	mod_timer(&br->gc_timer, jiffies);
+	mod_delayed_work(system_long_wq, &br->gc_work, 0);
 
 	return 0;
 }
diff --git a/net/bridge/br_stp_if.c b/net/bridge/br_stp_if.c
index 6c1e21411125..08341d2aa9c9 100644
--- a/net/bridge/br_stp_if.c
+++ b/net/bridge/br_stp_if.c
@@ -57,7 +57,7 @@ void br_stp_enable_bridge(struct net_bridge *br)
 	spin_lock_bh(&br->lock);
 	if (br->stp_enabled == BR_KERNEL_STP)
 		mod_timer(&br->hello_timer, jiffies + br->hello_time);
-	mod_timer(&br->gc_timer, jiffies + HZ/10);
+	mod_delayed_work(system_long_wq, &br->gc_work, HZ / 10);
 
 	br_config_bpdu_generation(br);
 
@@ -88,7 +88,7 @@ void br_stp_disable_bridge(struct net_bridge *br)
 	del_timer_sync(&br->hello_timer);
 	del_timer_sync(&br->topology_change_timer);
 	del_timer_sync(&br->tcn_timer);
-	del_timer_sync(&br->gc_timer);
+	cancel_delayed_work_sync(&br->gc_work);
 }
 
 /* called under bridge lock */
diff --git a/net/bridge/br_stp_timer.c b/net/bridge/br_stp_timer.c
index 7ddb38e0a06e..c98b3e5c140a 100644
--- a/net/bridge/br_stp_timer.c
+++ b/net/bridge/br_stp_timer.c
@@ -153,8 +153,6 @@ void br_stp_timer_init(struct net_bridge *br)
 	setup_timer(&br->topology_change_timer,
 		      br_topology_change_timer_expired,
 		      (unsigned long) br);
-
-	setup_timer(&br->gc_timer, br_fdb_cleanup, (unsigned long) br);
 }
 
 void br_stp_port_timer_init(struct net_bridge_port *p)
diff --git a/net/bridge/br_sysfs_br.c b/net/bridge/br_sysfs_br.c
index a18148213b08..0f4034934d56 100644
--- a/net/bridge/br_sysfs_br.c
+++ b/net/bridge/br_sysfs_br.c
@@ -263,7 +263,7 @@ static ssize_t gc_timer_show(struct device *d, struct device_attribute *attr,
 			     char *buf)
 {
 	struct net_bridge *br = to_bridge(d);
-	return sprintf(buf, "%ld\n", br_timer_value(&br->gc_timer));
+	return sprintf(buf, "%ld\n", br_timer_value(&br->gc_work.timer));
 }
 static DEVICE_ATTR_RO(gc_timer);
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH RFC net-next 3/4] bridge: move write-heavy fdb members in their own cache line
  2017-01-31 15:31 [PATCH RFC net-next 0/4] bridge: improve cache utilization Nikolay Aleksandrov
  2017-01-31 15:31 ` [PATCH RFC net-next 1/4] bridge: modify bridge and port to have often accessed fields in one cache line Nikolay Aleksandrov
  2017-01-31 15:31 ` [PATCH RFC net-next 2/4] bridge: move to workqueue gc Nikolay Aleksandrov
@ 2017-01-31 15:31 ` Nikolay Aleksandrov
  2017-01-31 16:37   ` Stephen Hemminger
  2017-01-31 15:31 ` [PATCH RFC net-next 4/4] bridge: add ability to turn off fdb used updates Nikolay Aleksandrov
  2017-01-31 16:39 ` [PATCH RFC net-next 0/4] bridge: improve cache utilization Stephen Hemminger
  4 siblings, 1 reply; 20+ messages in thread
From: Nikolay Aleksandrov @ 2017-01-31 15:31 UTC (permalink / raw)
  To: netdev; +Cc: roopa, stephen, davem, Nikolay Aleksandrov

Fdb's used and updated fields are written to on every packet forward and
packet receive respectively. Thus if we are receiving packets from a
particular fdb, they'll cause false-sharing with everyone who has looked
it up (even if it didn't match, since mac/vid share cache line!). The
"used" field is even worse since it is updated on every packet forward
to that fdb, thus the standard config where X ports use a single gateway
results in 100% fdb false-sharing. Note that this patch does not prevent
the last scenario, but it makes it better for other bridge participants
which are not using that fdb (and are only doing lookups over it).
The point is with this move we make sure that only communicating parties
get the false-sharing, in a later patch we'll show how to avoid that too.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
---
 net/bridge/br_private.h | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 9658c6484935..1e1b9a07e2da 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -150,19 +150,21 @@ struct net_bridge_vlan_group {
 	u16				pvid;
 };
 
-struct net_bridge_fdb_entry
-{
+struct net_bridge_fdb_entry {
 	struct hlist_node		hlist;
 	struct net_bridge_port		*dst;
 
-	unsigned long			updated;
-	unsigned long			used;
 	mac_addr			addr;
 	__u16				vlan_id;
 	unsigned char			is_local:1,
 					is_static:1,
 					added_by_user:1,
 					added_by_external_learn:1;
+
+	/* write-heavy members should not affect lookups */
+	unsigned long			updated ____cacheline_aligned_in_smp;
+	unsigned long			used;
+
 	struct rcu_head			rcu;
 };
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC net-next 3/4] bridge: move write-heavy fdb members in their own cache line
  2017-01-31 15:31 ` [PATCH RFC net-next 3/4] bridge: move write-heavy fdb members in their own cache line Nikolay Aleksandrov
@ 2017-01-31 16:37   ` Stephen Hemminger
  2017-01-31 16:39     ` Nikolay Aleksandrov
  0 siblings, 1 reply; 20+ messages in thread
From: Stephen Hemminger @ 2017-01-31 16:37 UTC (permalink / raw)
  To: Nikolay Aleksandrov; +Cc: netdev, roopa, davem

On Tue, 31 Jan 2017 16:31:57 +0100
Nikolay Aleksandrov <nikolay@cumulusnetworks.com> wrote:

> Fdb's used and updated fields are written to on every packet forward and
> packet receive respectively. Thus if we are receiving packets from a
> particular fdb, they'll cause false-sharing with everyone who has looked
> it up (even if it didn't match, since mac/vid share cache line!). The
> "used" field is even worse since it is updated on every packet forward
> to that fdb, thus the standard config where X ports use a single gateway
> results in 100% fdb false-sharing. Note that this patch does not prevent
> the last scenario, but it makes it better for other bridge participants
> which are not using that fdb (and are only doing lookups over it).
> The point is with this move we make sure that only communicating parties
> get the false-sharing, in a later patch we'll show how to avoid that too.
> 
> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
> ---

What about making updated a per-cpu value?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC net-next 3/4] bridge: move write-heavy fdb members in their own cache line
  2017-01-31 16:37   ` Stephen Hemminger
@ 2017-01-31 16:39     ` Nikolay Aleksandrov
  0 siblings, 0 replies; 20+ messages in thread
From: Nikolay Aleksandrov @ 2017-01-31 16:39 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, roopa, davem

On 31/01/17 17:37, Stephen Hemminger wrote:
> On Tue, 31 Jan 2017 16:31:57 +0100
> Nikolay Aleksandrov <nikolay@cumulusnetworks.com> wrote:
> 
>> Fdb's used and updated fields are written to on every packet forward and
>> packet receive respectively. Thus if we are receiving packets from a
>> particular fdb, they'll cause false-sharing with everyone who has looked
>> it up (even if it didn't match, since mac/vid share cache line!). The
>> "used" field is even worse since it is updated on every packet forward
>> to that fdb, thus the standard config where X ports use a single gateway
>> results in 100% fdb false-sharing. Note that this patch does not prevent
>> the last scenario, but it makes it better for other bridge participants
>> which are not using that fdb (and are only doing lookups over it).
>> The point is with this move we make sure that only communicating parties
>> get the false-sharing, in a later patch we'll show how to avoid that too.
>>
>> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
>> ---
> 
> What about making updated a per-cpu value?
> 

Quote from patch 0000:
"I'm not a fan of adding new options to the bridge, so I'm open to
suggestions for this one. Things I've tried before that:
- dynamically allocate a pool of percpu memory for used/updated fields
  (works but it's more complicated as we need dynamic resizing, too)
- dynamically allocate percpu memory for each fdb separately (I'm not a fan
  of this one, but it's much simpler to implement)

Of these two obviously the first approach worked best, but the complexity
it brings is considerable, while with this patchset we can achieve the same
result with proper configuration. Any feedback on this one would be greatly
appreciated."

I'm not sure how acceptable it is to do per-cpu allocations for each fdb, as
a network storm or flood can cause many of these. That's the reason I went with
this approach.
Would it be okay if we allocate per-cpu for each fdb ?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH RFC net-next 4/4] bridge: add ability to turn off fdb used updates
  2017-01-31 15:31 [PATCH RFC net-next 0/4] bridge: improve cache utilization Nikolay Aleksandrov
                   ` (2 preceding siblings ...)
  2017-01-31 15:31 ` [PATCH RFC net-next 3/4] bridge: move write-heavy fdb members in their own cache line Nikolay Aleksandrov
@ 2017-01-31 15:31 ` Nikolay Aleksandrov
  2017-02-03  2:47   ` David Miller
  2017-01-31 16:39 ` [PATCH RFC net-next 0/4] bridge: improve cache utilization Stephen Hemminger
  4 siblings, 1 reply; 20+ messages in thread
From: Nikolay Aleksandrov @ 2017-01-31 15:31 UTC (permalink / raw)
  To: netdev; +Cc: roopa, stephen, davem, Nikolay Aleksandrov

Being able to turn off fdb "used" updates makes it possible to avoid
false-sharing on each packet transmit/receive for that fdb. The best way
to completely avoid false-sharing is by binding ports to CPUs so receive
will write to the "updated" field only on a single CPU and transmit to
that fdb will not touch the "used" field. The default is used_enabled =
1 to avoid breaking user-space, strongly suggest if not needed to set it
to 0.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
---
 include/uapi/linux/if_link.h |  1 +
 net/bridge/br_device.c       |  1 +
 net/bridge/br_input.c        |  3 ++-
 net/bridge/br_netlink.c      | 10 +++++++++-
 net/bridge/br_private.h      |  1 +
 net/bridge/br_sysfs_br.c     | 23 +++++++++++++++++++++++
 6 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index b9aa5641ebe5..8bcd234fed03 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -277,6 +277,7 @@ enum {
 	IFLA_BR_MCAST_STATS_ENABLED,
 	IFLA_BR_MCAST_IGMP_VERSION,
 	IFLA_BR_MCAST_MLD_VERSION,
+	IFLA_BR_USED_ENABLED,
 	__IFLA_BR_MAX,
 };
 
diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index 89b414fd1901..b1fa1b031fea 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -408,6 +408,7 @@ void br_dev_setup(struct net_device *dev)
 	br->bridge_hello_time = br->hello_time = 2 * HZ;
 	br->bridge_forward_delay = br->forward_delay = 15 * HZ;
 	br->bridge_ageing_time = br->ageing_time = BR_DEFAULT_AGEING_TIME;
+	br->used_enabled = 1;
 	dev->max_mtu = ETH_MAX_MTU;
 
 	br_netfilter_rtable_init(br);
diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index 855b72fbe1da..8a320901aaa7 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -197,7 +197,8 @@ int br_handle_frame_finish(struct net *net, struct sock *sk, struct sk_buff *skb
 		if (dst->is_local)
 			return br_pass_frame_up(skb);
 
-		dst->used = jiffies;
+		if (br->used_enabled)
+			dst->used = jiffies;
 		br_forward(dst->dst, skb, local_rcv, false);
 	} else {
 		if (!mcast_hit)
diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
index d63ad8337dcd..49272efaad00 100644
--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -851,6 +851,7 @@ static const struct nla_policy br_policy[IFLA_BR_MAX + 1] = {
 	[IFLA_BR_MCAST_STATS_ENABLED] = { .type = NLA_U8 },
 	[IFLA_BR_MCAST_IGMP_VERSION] = { .type = NLA_U8 },
 	[IFLA_BR_MCAST_MLD_VERSION] = { .type = NLA_U8 },
+	[IFLA_BR_USED_ENABLED] = { .type = NLA_U8 },
 };
 
 static int br_changelink(struct net_device *brdev, struct nlattr *tb[],
@@ -1102,6 +1103,11 @@ static int br_changelink(struct net_device *brdev, struct nlattr *tb[],
 		br->nf_call_arptables = val ? true : false;
 	}
 #endif
+	if (data[IFLA_BR_USED_ENABLED]) {
+		u8 val = nla_get_u8(data[IFLA_BR_USED_ENABLED]);
+
+		br->used_enabled = !!val;
+	}
 
 	return 0;
 }
@@ -1175,6 +1181,7 @@ static size_t br_get_size(const struct net_device *brdev)
 	       nla_total_size(sizeof(u8)) +     /* IFLA_BR_NF_CALL_IP6TABLES */
 	       nla_total_size(sizeof(u8)) +     /* IFLA_BR_NF_CALL_ARPTABLES */
 #endif
+	       nla_total_size(sizeof(u8)) +	/* IFLA_BR_USED_ENABLED */
 	       0;
 }
 
@@ -1246,7 +1253,8 @@ static int br_fill_info(struct sk_buff *skb, const struct net_device *brdev)
 	    nla_put_u32(skb, IFLA_BR_MCAST_STARTUP_QUERY_CNT,
 			br->multicast_startup_query_count) ||
 	    nla_put_u8(skb, IFLA_BR_MCAST_IGMP_VERSION,
-		       br->multicast_igmp_version))
+		       br->multicast_igmp_version) ||
+	    nla_put_u8(skb, IFLA_BR_USED_ENABLED, br->used_enabled))
 		return -EMSGSIZE;
 #if IS_ENABLED(CONFIG_IPV6)
 	if (nla_put_u8(skb, IFLA_BR_MCAST_MLD_VERSION,
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 1e1b9a07e2da..4b6eb2393d7e 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -281,6 +281,7 @@ struct net_bridge {
 	struct net_device		*dev;
 	struct pcpu_sw_netstats		__percpu *stats;
 	/* These fields are accessed on each packet */
+	u8				used_enabled;
 #ifdef CONFIG_BRIDGE_VLAN_FILTERING
 	u8				vlan_enabled;
 	u8				vlan_stats_enabled;
diff --git a/net/bridge/br_sysfs_br.c b/net/bridge/br_sysfs_br.c
index 0f4034934d56..22adc29e8721 100644
--- a/net/bridge/br_sysfs_br.c
+++ b/net/bridge/br_sysfs_br.c
@@ -334,6 +334,28 @@ static ssize_t flush_store(struct device *d,
 }
 static DEVICE_ATTR_WO(flush);
 
+static ssize_t used_enabled_show(struct device *d,
+				 struct device_attribute *attr, char *buf)
+{
+	struct net_bridge *br = to_bridge(d);
+
+	return sprintf(buf, "%u\n", br->used_enabled);
+}
+
+static int set_used_enabled(struct net_bridge *br, unsigned long val)
+{
+	br->used_enabled = !!val;
+	return 0;
+}
+
+static ssize_t used_enabled_store(struct device *d,
+				  struct device_attribute *attr,
+				  const char *buf, size_t len)
+{
+	return store_bridge_parm(d, buf, len, set_used_enabled);
+}
+static DEVICE_ATTR_RW(used_enabled);
+
 #ifdef CONFIG_BRIDGE_IGMP_SNOOPING
 static ssize_t multicast_router_show(struct device *d,
 				     struct device_attribute *attr, char *buf)
@@ -829,6 +851,7 @@ static struct attribute *bridge_attrs[] = {
 	&dev_attr_gc_timer.attr,
 	&dev_attr_group_addr.attr,
 	&dev_attr_flush.attr,
+	&dev_attr_used_enabled.attr,
 #ifdef CONFIG_BRIDGE_IGMP_SNOOPING
 	&dev_attr_multicast_router.attr,
 	&dev_attr_multicast_snooping.attr,
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC net-next 4/4] bridge: add ability to turn off fdb used updates
  2017-01-31 15:31 ` [PATCH RFC net-next 4/4] bridge: add ability to turn off fdb used updates Nikolay Aleksandrov
@ 2017-02-03  2:47   ` David Miller
  2017-02-03  8:30     ` Nikolay Aleksandrov
  2017-02-04 16:45     ` Nikolay Aleksandrov
  0 siblings, 2 replies; 20+ messages in thread
From: David Miller @ 2017-02-03  2:47 UTC (permalink / raw)
  To: nikolay; +Cc: netdev, roopa, stephen

From: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Date: Tue, 31 Jan 2017 16:31:58 +0100

> @@ -197,7 +197,8 @@ int br_handle_frame_finish(struct net *net, struct sock *sk, struct sk_buff *skb
>  		if (dst->is_local)
>  			return br_pass_frame_up(skb);
>  
> -		dst->used = jiffies;
> +		if (br->used_enabled)
> +			dst->used = jiffies;

Have you tried:

	if (dst->used != jiffies)
		dst->used = jiffies;

If that isn't effective, you can tweak the test to decrease the
granularity of the value.  Basically, if dst->used is within
1 HZ of jiffies, don't do the write.

I suspect this might help a lot, and not require a new bridging
option.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC net-next 4/4] bridge: add ability to turn off fdb used updates
  2017-02-03  2:47   ` David Miller
@ 2017-02-03  8:30     ` Nikolay Aleksandrov
  2017-02-03 18:28       ` Stephen Hemminger
  2017-02-04 16:45     ` Nikolay Aleksandrov
  1 sibling, 1 reply; 20+ messages in thread
From: Nikolay Aleksandrov @ 2017-02-03  8:30 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, roopa, stephen

On 03/02/17 03:47, David Miller wrote:
> From: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
> Date: Tue, 31 Jan 2017 16:31:58 +0100
> 
>> @@ -197,7 +197,8 @@ int br_handle_frame_finish(struct net *net, struct sock *sk, struct sk_buff *skb
>>  		if (dst->is_local)
>>  			return br_pass_frame_up(skb);
>>  
>> -		dst->used = jiffies;
>> +		if (br->used_enabled)
>> +			dst->used = jiffies;
> 
> Have you tried:
> 
> 	if (dst->used != jiffies)
> 		dst->used = jiffies;
> 
> If that isn't effective, you can tweak the test to decrease the
> granularity of the value.  Basically, if dst->used is within
> 1 HZ of jiffies, don't do the write.
> 
> I suspect this might help a lot, and not require a new bridging
> option.
> 

Yes, I actually have a patch titled "used granularity". :-) I've tested with different
values and it does help but it either needs to be paired with another similar test for
the "updated" field (since they share a write-heavy cache line) or they need to be
in separate cache lines to avoid that dst's source port from causing the load HitM for
all who check the value.

I'll run some more tests and probably go this way for now.

Thanks,
 Nik

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC net-next 4/4] bridge: add ability to turn off fdb used updates
  2017-02-03  8:30     ` Nikolay Aleksandrov
@ 2017-02-03 18:28       ` Stephen Hemminger
  2017-02-03 18:34         ` Nikolay Aleksandrov
  0 siblings, 1 reply; 20+ messages in thread
From: Stephen Hemminger @ 2017-02-03 18:28 UTC (permalink / raw)
  To: Nikolay Aleksandrov; +Cc: David Miller, netdev, roopa

On Fri, 3 Feb 2017 09:30:37 +0100
Nikolay Aleksandrov <nikolay@cumulusnetworks.com> wrote:

> On 03/02/17 03:47, David Miller wrote:
> > From: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
> > Date: Tue, 31 Jan 2017 16:31:58 +0100
> >   
> >> @@ -197,7 +197,8 @@ int br_handle_frame_finish(struct net *net, struct sock *sk, struct sk_buff *skb
> >>  		if (dst->is_local)
> >>  			return br_pass_frame_up(skb);
> >>  
> >> -		dst->used = jiffies;
> >> +		if (br->used_enabled)
> >> +			dst->used = jiffies;  
> > 
> > Have you tried:
> > 
> > 	if (dst->used != jiffies)
> > 		dst->used = jiffies;
> > 
> > If that isn't effective, you can tweak the test to decrease the
> > granularity of the value.  Basically, if dst->used is within
> > 1 HZ of jiffies, don't do the write.
> > 
> > I suspect this might help a lot, and not require a new bridging
> > option.
> >   
> 
> Yes, I actually have a patch titled "used granularity". :-) I've tested with different
> values and it does help but it either needs to be paired with another similar test for
> the "updated" field (since they share a write-heavy cache line) or they need to be
> in separate cache lines to avoid that dst's source port from causing the load HitM for
> all who check the value.
> 
> I'll run some more tests and probably go this way for now.
> 
> Thanks,
>  Nik
> 

Since used doesn't need HZ granularity, it reports values in clock_t resolution so
storing (and doing cmp and set would mean that it would only be 100 HZ

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC net-next 4/4] bridge: add ability to turn off fdb used updates
  2017-02-03 18:28       ` Stephen Hemminger
@ 2017-02-03 18:34         ` Nikolay Aleksandrov
  2017-02-03 22:24           ` Stephen Hemminger
  0 siblings, 1 reply; 20+ messages in thread
From: Nikolay Aleksandrov @ 2017-02-03 18:34 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, netdev, roopa

On 03/02/17 19:28, Stephen Hemminger wrote:
> On Fri, 3 Feb 2017 09:30:37 +0100
> Nikolay Aleksandrov <nikolay@cumulusnetworks.com> wrote:
> 
>> On 03/02/17 03:47, David Miller wrote:
>>> From: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
>>> Date: Tue, 31 Jan 2017 16:31:58 +0100
>>>   
>>>> @@ -197,7 +197,8 @@ int br_handle_frame_finish(struct net *net, struct sock *sk, struct sk_buff *skb
>>>>  		if (dst->is_local)
>>>>  			return br_pass_frame_up(skb);
>>>>  
>>>> -		dst->used = jiffies;
>>>> +		if (br->used_enabled)
>>>> +			dst->used = jiffies;  
>>>
>>> Have you tried:
>>>
>>> 	if (dst->used != jiffies)
>>> 		dst->used = jiffies;
>>>
>>> If that isn't effective, you can tweak the test to decrease the
>>> granularity of the value.  Basically, if dst->used is within
>>> 1 HZ of jiffies, don't do the write.
>>>
>>> I suspect this might help a lot, and not require a new bridging
>>> option.
>>>   
>>
>> Yes, I actually have a patch titled "used granularity". :-) I've tested with different
>> values and it does help but it either needs to be paired with another similar test for
>> the "updated" field (since they share a write-heavy cache line) or they need to be
>> in separate cache lines to avoid that dst's source port from causing the load HitM for
>> all who check the value.
>>
>> I'll run some more tests and probably go this way for now.
>>
>> Thanks,
>>  Nik
>>
> 
> Since used doesn't need HZ granularity, it reports values in clock_t resolution so
> storing (and doing cmp and set would mean that it would only be 100 HZ
> 

Yes, exactly what I'm currently testing. Will post the new set soon.
Since HZ can be different a generic way to obtain the granularity for
both should be clock_t_to_jiffies(1) if I'm not missing something.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC net-next 4/4] bridge: add ability to turn off fdb used updates
  2017-02-03 18:34         ` Nikolay Aleksandrov
@ 2017-02-03 22:24           ` Stephen Hemminger
  2017-02-03 22:27             ` Nikolay Aleksandrov
  0 siblings, 1 reply; 20+ messages in thread
From: Stephen Hemminger @ 2017-02-03 22:24 UTC (permalink / raw)
  To: Nikolay Aleksandrov; +Cc: David Miller, netdev, roopa

On Fri, 3 Feb 2017 19:34:19 +0100
Nikolay Aleksandrov <nikolay@cumulusnetworks.com> wrote:

> On 03/02/17 19:28, Stephen Hemminger wrote:
> > On Fri, 3 Feb 2017 09:30:37 +0100
> > Nikolay Aleksandrov <nikolay@cumulusnetworks.com> wrote:
> >   
> >> On 03/02/17 03:47, David Miller wrote:  
> >>> From: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
> >>> Date: Tue, 31 Jan 2017 16:31:58 +0100
> >>>     
> >>>> @@ -197,7 +197,8 @@ int br_handle_frame_finish(struct net *net, struct sock *sk, struct sk_buff *skb
> >>>>  		if (dst->is_local)
> >>>>  			return br_pass_frame_up(skb);
> >>>>  
> >>>> -		dst->used = jiffies;
> >>>> +		if (br->used_enabled)
> >>>> +			dst->used = jiffies;    
> >>>
> >>> Have you tried:
> >>>
> >>> 	if (dst->used != jiffies)
> >>> 		dst->used = jiffies;
> >>>
> >>> If that isn't effective, you can tweak the test to decrease the
> >>> granularity of the value.  Basically, if dst->used is within
> >>> 1 HZ of jiffies, don't do the write.
> >>>
> >>> I suspect this might help a lot, and not require a new bridging
> >>> option.
> >>>     
> >>
> >> Yes, I actually have a patch titled "used granularity". :-) I've tested with different
> >> values and it does help but it either needs to be paired with another similar test for
> >> the "updated" field (since they share a write-heavy cache line) or they need to be
> >> in separate cache lines to avoid that dst's source port from causing the load HitM for
> >> all who check the value.
> >>
> >> I'll run some more tests and probably go this way for now.
> >>
> >> Thanks,
> >>  Nik
> >>  
> > 
> > Since used doesn't need HZ granularity, it reports values in clock_t resolution so
> > storing (and doing cmp and set would mean that it would only be 100 HZ
> >   
> 
> Yes, exactly what I'm currently testing. Will post the new set soon.
> Since HZ can be different a generic way to obtain the granularity for
> both should be clock_t_to_jiffies(1) if I'm not missing something.
> 
> 

USER_HZ is set by userspace ABI to 100 hz. HZ is configurable when kernel is built.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC net-next 4/4] bridge: add ability to turn off fdb used updates
  2017-02-03 22:24           ` Stephen Hemminger
@ 2017-02-03 22:27             ` Nikolay Aleksandrov
  0 siblings, 0 replies; 20+ messages in thread
From: Nikolay Aleksandrov @ 2017-02-03 22:27 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, netdev, roopa

On 03/02/17 23:24, Stephen Hemminger wrote:
> On Fri, 3 Feb 2017 19:34:19 +0100
> Nikolay Aleksandrov <nikolay@cumulusnetworks.com> wrote:
> 
>> On 03/02/17 19:28, Stephen Hemminger wrote:
>>> On Fri, 3 Feb 2017 09:30:37 +0100
>>> Nikolay Aleksandrov <nikolay@cumulusnetworks.com> wrote:
>>>   
>>>> On 03/02/17 03:47, David Miller wrote:  
>>>>> From: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
>>>>> Date: Tue, 31 Jan 2017 16:31:58 +0100
>>>>>     
>>>>>> @@ -197,7 +197,8 @@ int br_handle_frame_finish(struct net *net, struct sock *sk, struct sk_buff *skb
>>>>>>  		if (dst->is_local)
>>>>>>  			return br_pass_frame_up(skb);
>>>>>>  
>>>>>> -		dst->used = jiffies;
>>>>>> +		if (br->used_enabled)
>>>>>> +			dst->used = jiffies;    
>>>>>
>>>>> Have you tried:
>>>>>
>>>>> 	if (dst->used != jiffies)
>>>>> 		dst->used = jiffies;
>>>>>
>>>>> If that isn't effective, you can tweak the test to decrease the
>>>>> granularity of the value.  Basically, if dst->used is within
>>>>> 1 HZ of jiffies, don't do the write.
>>>>>
>>>>> I suspect this might help a lot, and not require a new bridging
>>>>> option.
>>>>>     
>>>>
>>>> Yes, I actually have a patch titled "used granularity". :-) I've tested with different
>>>> values and it does help but it either needs to be paired with another similar test for
>>>> the "updated" field (since they share a write-heavy cache line) or they need to be
>>>> in separate cache lines to avoid that dst's source port from causing the load HitM for
>>>> all who check the value.
>>>>
>>>> I'll run some more tests and probably go this way for now.
>>>>
>>>> Thanks,
>>>>  Nik
>>>>  
>>>
>>> Since used doesn't need HZ granularity, it reports values in clock_t resolution so
>>> storing (and doing cmp and set would mean that it would only be 100 HZ
>>>   
>>
>> Yes, exactly what I'm currently testing. Will post the new set soon.
>> Since HZ can be different a generic way to obtain the granularity for
>> both should be clock_t_to_jiffies(1) if I'm not missing something.
>>
>>
> 
> USER_HZ is set by userspace ABI to 100 hz. HZ is configurable when kernel is built.
> 

Yes, the point I was trying to make is that we want to take the number of jiffies
we can skip by converting 1 clock_t to X jiffies because the user-space granularity
is clock_t and HZ can change, thus clock_t_to_jiffies(1) should give us the number
of updates we can skip for "used" and "updated".
By "both" I meant "used" and "updated" fields, not HZ and USER_HZ.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC net-next 4/4] bridge: add ability to turn off fdb used updates
  2017-02-03  2:47   ` David Miller
  2017-02-03  8:30     ` Nikolay Aleksandrov
@ 2017-02-04 16:45     ` Nikolay Aleksandrov
  1 sibling, 0 replies; 20+ messages in thread
From: Nikolay Aleksandrov @ 2017-02-04 16:45 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, roopa, stephen

On 03/02/17 03:47, David Miller wrote:
> From: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
> Date: Tue, 31 Jan 2017 16:31:58 +0100
> 
>> @@ -197,7 +197,8 @@ int br_handle_frame_finish(struct net *net, struct sock *sk, struct sk_buff *skb
>>  		if (dst->is_local)
>>  			return br_pass_frame_up(skb);
>>  
>> -		dst->used = jiffies;
>> +		if (br->used_enabled)
>> +			dst->used = jiffies;
> 
> Have you tried:
> 
> 	if (dst->used != jiffies)
> 		dst->used = jiffies;
> 
> If that isn't effective, you can tweak the test to decrease the
> granularity of the value.  Basically, if dst->used is within
> 1 HZ of jiffies, don't do the write.
> 
> I suspect this might help a lot, and not require a new bridging
> option.
> 

After running a ton of tests with different granularity settings and improvements
similar to:

br->jiffy_update_mask = rounddown_pow_of_two(clock_t_to_jiffies(1)) - 1

fast_path:
if (!(jiffies & br->jiffy_update_mask) && dst->used != jiffies)
	dst->used = jiffies;

in order to bring down the updates even lower, it seems that just going with the
comparison is enough - I can't see any measurable win by bringing it even lower (e.g.
HZ / USER_HZ granularity).

So I'm going with the simpler approach and just doing the comparison.

Thanks,
 Nik

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC net-next 0/4] bridge: improve cache utilization
  2017-01-31 15:31 [PATCH RFC net-next 0/4] bridge: improve cache utilization Nikolay Aleksandrov
                   ` (3 preceding siblings ...)
  2017-01-31 15:31 ` [PATCH RFC net-next 4/4] bridge: add ability to turn off fdb used updates Nikolay Aleksandrov
@ 2017-01-31 16:39 ` Stephen Hemminger
  2017-01-31 16:41   ` Nikolay Aleksandrov
  4 siblings, 1 reply; 20+ messages in thread
From: Stephen Hemminger @ 2017-01-31 16:39 UTC (permalink / raw)
  To: Nikolay Aleksandrov; +Cc: netdev, roopa, davem

On Tue, 31 Jan 2017 16:31:54 +0100
Nikolay Aleksandrov <nikolay@cumulusnetworks.com> wrote:

> Hi all,
> This is the first set which begins to deal with the bad bridge cache
> access patterns. The first patch rearranges the bridge and port structs
> a little so the frequently (and closely) accessed members are in the same
> cache line. The second patch then moves the garbage collection to a
> workqueue trying to improve system responsiveness under load (many fdbs)
> and more importantly removes the need to check if the matched entry is
> expired in __br_fdb_get which was a source of false-sharing.
> The third patch is a preparation for the final one which adds a new option
> that allows to turn off "used" updates on transmit to a fdb which is the
> other source of false-sharing in the bridge. If properly configured, i.e.
> ports bound to CPUs (thus updating "updated" locally) and "used_enabled"
> set to 0 then the bridge's HitM goes from 100% to 0%, but even with
> "used_enabled" = 1 we still get a win because lookups which iterated over
> the hash chain caused false-sharing due to the first cache line being
> used for both mac/vid and used/updated fields.
> I'm not a fan of adding new options to the bridge, so I'm open to
> suggestions for this one. Things I've tried before that:
> - dynamically allocate a pool of percpu memory for used/updated fields
>   (works but it's more complicated as we need dynamic resizing, too)
> - dynamically allocate percpu memory for each fdb separately (I'm not a fan
>   of this one, but it's much simpler to implement)
> 
> Of these two obviously the first approach worked best, but the complexity
> it brings is considerable, while with this patchset we can achieve the same
> result with proper configuration. Any feedback on this one would be greatly
> appreciated.
> 
> Some results from tests I've run:
> (note that these were run in good conditions for the baseline, everything
>  ran on a single NUMA node and there were only 3 fdbs)
> 
> 1. baseline
> 100% Load HitM on the fdbs (between everyone who has done lookups and hit
>                             one of the 3 hash chains of the communicating
>                             src/dst fdbs)
> Overall 5.06% Load HitM for the bridge, first place in the list
> 
> bridge udp rr between 3 cores/ports/fdbs test:
>    0.35%  ksoftirqd/0  [k] br_fdb_update
>    0.30%  ksoftirqd/0  [k] __br_fdb_get
>    0.17%  ksoftirqd/0  [k] br_handle_frame_finish
> 
> 2. used = 1
> 0% Local Load HitM
> bridge udp rr between 3 cores/ports/fdbs test:
>    0.24%  ksoftirqd/0    [k] br_fdb_update
>    0.23%  ksoftirqd/0    [k] __br_fdb_get
>    0.12%  ksoftirqd/0    [k] br_handle_frame_finish
> 
> 3. used = 0
> 0% Local Load HitM
> bridge udp rr between 3 cores/ports/fdbs test:
>    0.23%  ksoftirqd/0    [k] __br_fdb_get
>    0.22%  ksoftirqd/0    [k] br_fdb_update
>    0.10%  ksoftirqd/0    [k] br_handle_frame_finish
> 
> 
> Thanks,
>  Nik
> 
> Nikolay Aleksandrov (4):
>   bridge: modify bridge and port to have often accessed fields in one
>     cache line
>   bridge: move to workqueue gc
>   bridge: move write-heavy fdb members in their own cache line
>   bridge: add ability to turn off fdb used updates
> 
>  include/uapi/linux/if_link.h |  1 +
>  net/bridge/br_device.c       |  2 ++
>  net/bridge/br_fdb.c          | 31 ++++++++++++++---------
>  net/bridge/br_if.c           |  2 +-
>  net/bridge/br_input.c        |  3 ++-
>  net/bridge/br_ioctl.c        |  2 +-
>  net/bridge/br_netlink.c      | 12 +++++++--
>  net/bridge/br_private.h      | 58 ++++++++++++++++++++++----------------------
>  net/bridge/br_stp.c          |  2 +-
>  net/bridge/br_stp_if.c       |  4 +--
>  net/bridge/br_stp_timer.c    |  2 --
>  net/bridge/br_sysfs_br.c     | 25 ++++++++++++++++++-
>  12 files changed, 92 insertions(+), 52 deletions(-)
> 

I agree with the first 3 patches, but not the last one.
Changing the API just for a performance hack is not necessary. Instead make
the algorithm smarter and use per-cpu values.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC net-next 0/4] bridge: improve cache utilization
  2017-01-31 16:39 ` [PATCH RFC net-next 0/4] bridge: improve cache utilization Stephen Hemminger
@ 2017-01-31 16:41   ` Nikolay Aleksandrov
  2017-01-31 18:09     ` Nikolay Aleksandrov
  0 siblings, 1 reply; 20+ messages in thread
From: Nikolay Aleksandrov @ 2017-01-31 16:41 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, roopa, davem

> 
> I agree with the first 3 patches, but not the last one.
> Changing the API just for a performance hack is not necessary. Instead make
> the algorithm smarter and use per-cpu values.
> 

Thanks for the feedback, I would very much prefer any of the other two approaches
I tried (per-cpu pool and per-cpu for each fdb), from the two the second one -
per-cpu for each fdb is much simpler, so would it be acceptable to do per-cpu allocation
for each fdb ?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC net-next 0/4] bridge: improve cache utilization
  2017-01-31 16:41   ` Nikolay Aleksandrov
@ 2017-01-31 18:09     ` Nikolay Aleksandrov
  2017-01-31 18:21       ` Stephen Hemminger
  0 siblings, 1 reply; 20+ messages in thread
From: Nikolay Aleksandrov @ 2017-01-31 18:09 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, roopa, davem

On 31/01/17 17:41, Nikolay Aleksandrov wrote:
>>
>> I agree with the first 3 patches, but not the last one.
>> Changing the API just for a performance hack is not necessary. Instead make
>> the algorithm smarter and use per-cpu values.
>>
> 
> Thanks for the feedback, I would very much prefer any of the other two approaches
> I tried (per-cpu pool and per-cpu for each fdb), from the two the second one -
> per-cpu for each fdb is much simpler, so would it be acceptable to do per-cpu allocation
> for each fdb ?
> 
> 
> 

Okay, after some more testing the version with per-cpu per-fdb allocations, at 300 000 fdb entries
I got 120 failed per-cpu allocs which seems okay. I'll wait a little more and will repost the series
with per-cpu allocations and without the RFC tag.

Thanks,
 Nik

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC net-next 0/4] bridge: improve cache utilization
  2017-01-31 18:09     ` Nikolay Aleksandrov
@ 2017-01-31 18:21       ` Stephen Hemminger
  2017-01-31 18:45         ` Nikolay Aleksandrov
  0 siblings, 1 reply; 20+ messages in thread
From: Stephen Hemminger @ 2017-01-31 18:21 UTC (permalink / raw)
  To: Nikolay Aleksandrov; +Cc: netdev, roopa, davem

On Tue, 31 Jan 2017 19:09:09 +0100
Nikolay Aleksandrov <nikolay@cumulusnetworks.com> wrote:

> On 31/01/17 17:41, Nikolay Aleksandrov wrote:
> >>
> >> I agree with the first 3 patches, but not the last one.
> >> Changing the API just for a performance hack is not necessary. Instead make
> >> the algorithm smarter and use per-cpu values.
> >>  
> > 
> > Thanks for the feedback, I would very much prefer any of the other two approaches
> > I tried (per-cpu pool and per-cpu for each fdb), from the two the second one -
> > per-cpu for each fdb is much simpler, so would it be acceptable to do per-cpu allocation
> > for each fdb ?
> > 
> > 
> >   
> 
> Okay, after some more testing the version with per-cpu per-fdb allocations, at 300 000 fdb entries
> I got 120 failed per-cpu allocs which seems okay. I'll wait a little more and will repost the series
> with per-cpu allocations and without the RFC tag.
> 
> Thanks,
>  Nik
> 

You could also use a mark/sweep algorithm (rather than recording updated).
It turns out that clearing is fast (can be unlocked).
The timer workqueue can mark all fdb entries (during scan), then in forward
function clear the bit if it is set. This would turn writes into reads.

To keep the API for last used, just change the resolution to be scan interval.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC net-next 0/4] bridge: improve cache utilization
  2017-01-31 18:21       ` Stephen Hemminger
@ 2017-01-31 18:45         ` Nikolay Aleksandrov
  2017-01-31 18:51           ` Nikolay Aleksandrov
  0 siblings, 1 reply; 20+ messages in thread
From: Nikolay Aleksandrov @ 2017-01-31 18:45 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, roopa, davem

On 31/01/17 19:21, Stephen Hemminger wrote:
> On Tue, 31 Jan 2017 19:09:09 +0100
> Nikolay Aleksandrov <nikolay@cumulusnetworks.com> wrote:
> 
>> On 31/01/17 17:41, Nikolay Aleksandrov wrote:
>>>>
>>>> I agree with the first 3 patches, but not the last one.
>>>> Changing the API just for a performance hack is not necessary. Instead make
>>>> the algorithm smarter and use per-cpu values.
>>>>  
>>>
>>> Thanks for the feedback, I would very much prefer any of the other two approaches
>>> I tried (per-cpu pool and per-cpu for each fdb), from the two the second one -
>>> per-cpu for each fdb is much simpler, so would it be acceptable to do per-cpu allocation
>>> for each fdb ?
>>>
>>>
>>>   
>>
>> Okay, after some more testing the version with per-cpu per-fdb allocations, at 300 000 fdb entries
>> I got 120 failed per-cpu allocs which seems okay. I'll wait a little more and will repost the series
>> with per-cpu allocations and without the RFC tag.
>>
>> Thanks,
>>  Nik
>>
> 
> You could also use a mark/sweep algorithm (rather than recording updated).
> It turns out that clearing is fast (can be unlocked).
> The timer workqueue can mark all fdb entries (during scan), then in forward
> function clear the bit if it is set. This would turn writes into reads.

The wq doesn't have a strict next call, it is floating depending on the soonest
expire, this can cause issues as we don't know when last we've reset the bit and
using the scan interval resolution will result in big offsets when purging entries.

> 
> To keep the API for last used, just change the resolution to be scan interval.
> 

With default 300 second resolution ? People will be angry. :-)
Also this has to happen for both "updated" and "used", they're both causing trouble.
In fact "used" is much worse than "updated", because it's written to by all who transmit
to that fdb.

Actually to start we can do something much simpler - just always update "used" at most
once per 1/10 of ageing_time for example. The default case would give us an update every
30 seconds if the fdb is actually used or we can cap it at 10 seconds.
The "updated" we move to its own cache line and with proper config (bind ports to CPUs)
it will be fine.

What do you think ?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC net-next 0/4] bridge: improve cache utilization
  2017-01-31 18:45         ` Nikolay Aleksandrov
@ 2017-01-31 18:51           ` Nikolay Aleksandrov
  0 siblings, 0 replies; 20+ messages in thread
From: Nikolay Aleksandrov @ 2017-01-31 18:51 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, roopa, davem

On 31/01/17 19:45, Nikolay Aleksandrov wrote:
> On 31/01/17 19:21, Stephen Hemminger wrote:
>> On Tue, 31 Jan 2017 19:09:09 +0100
>> Nikolay Aleksandrov <nikolay@cumulusnetworks.com> wrote:
>>
>>> On 31/01/17 17:41, Nikolay Aleksandrov wrote:
>>>>>
>>>>> I agree with the first 3 patches, but not the last one.
>>>>> Changing the API just for a performance hack is not necessary. Instead make
>>>>> the algorithm smarter and use per-cpu values.
>>>>>  
>>>>
>>>> Thanks for the feedback, I would very much prefer any of the other two approaches
>>>> I tried (per-cpu pool and per-cpu for each fdb), from the two the second one -
>>>> per-cpu for each fdb is much simpler, so would it be acceptable to do per-cpu allocation
>>>> for each fdb ?
>>>>
>>>>
>>>>   
>>>
>>> Okay, after some more testing the version with per-cpu per-fdb allocations, at 300 000 fdb entries
>>> I got 120 failed per-cpu allocs which seems okay. I'll wait a little more and will repost the series
>>> with per-cpu allocations and without the RFC tag.
>>>
>>> Thanks,
>>>  Nik
>>>
>>
>> You could also use a mark/sweep algorithm (rather than recording updated).
>> It turns out that clearing is fast (can be unlocked).
>> The timer workqueue can mark all fdb entries (during scan), then in forward
>> function clear the bit if it is set. This would turn writes into reads.
> 
> The wq doesn't have a strict next call, it is floating depending on the soonest
> expire, this can cause issues as we don't know when last we've reset the bit and
> using the scan interval resolution will result in big offsets when purging entries.
> 
>>
>> To keep the API for last used, just change the resolution to be scan interval.
>>
> 
> With default 300 second resolution ? People will be angry. :-)
> Also this has to happen for both "updated" and "used", they're both causing trouble.
> In fact "used" is much worse than "updated", because it's written to by all who transmit
> to that fdb.
> 
> Actually to start we can do something much simpler - just always update "used" at most
> once per 1/10 of ageing_time for example. The default case would give us an update every
> 30 seconds if the fdb is actually used or we can cap it at 10 seconds.
> The "updated" we move to its own cache line and with proper config (bind ports to CPUs)
> it will be fine.
> 

Acutally this is a no go, there're already users out there who depend on the high resolution
of the "used" field, so we cannot break them. We're back to either an option or per-cpu.

> What do you think ?
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2017-02-04 16:45 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-01-31 15:31 [PATCH RFC net-next 0/4] bridge: improve cache utilization Nikolay Aleksandrov
2017-01-31 15:31 ` [PATCH RFC net-next 1/4] bridge: modify bridge and port to have often accessed fields in one cache line Nikolay Aleksandrov
2017-01-31 15:31 ` [PATCH RFC net-next 2/4] bridge: move to workqueue gc Nikolay Aleksandrov
2017-01-31 15:31 ` [PATCH RFC net-next 3/4] bridge: move write-heavy fdb members in their own cache line Nikolay Aleksandrov
2017-01-31 16:37   ` Stephen Hemminger
2017-01-31 16:39     ` Nikolay Aleksandrov
2017-01-31 15:31 ` [PATCH RFC net-next 4/4] bridge: add ability to turn off fdb used updates Nikolay Aleksandrov
2017-02-03  2:47   ` David Miller
2017-02-03  8:30     ` Nikolay Aleksandrov
2017-02-03 18:28       ` Stephen Hemminger
2017-02-03 18:34         ` Nikolay Aleksandrov
2017-02-03 22:24           ` Stephen Hemminger
2017-02-03 22:27             ` Nikolay Aleksandrov
2017-02-04 16:45     ` Nikolay Aleksandrov
2017-01-31 16:39 ` [PATCH RFC net-next 0/4] bridge: improve cache utilization Stephen Hemminger
2017-01-31 16:41   ` Nikolay Aleksandrov
2017-01-31 18:09     ` Nikolay Aleksandrov
2017-01-31 18:21       ` Stephen Hemminger
2017-01-31 18:45         ` Nikolay Aleksandrov
2017-01-31 18:51           ` Nikolay Aleksandrov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).