Netdev List
 help / color / mirror / Atom feed
* Re: [patch net-next v2 1/3] net: devlink: allow to change namespaces
From: David Ahern @ 2019-08-02 15:45 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: Jakub Kicinski, netdev, davem, sthemmin, mlxsw
In-Reply-To: <20190802074838.GC2203@nanopsycho>

On 8/2/19 1:48 AM, Jiri Pirko wrote:
> Wed, Jul 31, 2019 at 09:58:10PM CEST, dsahern@gmail.com wrote:
>> On 7/31/19 1:46 PM, David Ahern wrote:
>>> On 7/31/19 1:45 PM, Jiri Pirko wrote:
>>>>> check. e.g., what happens if a resource controller has been configured
>>>>> for the devlink instance and it is moved to a namespace whose existing
>>>>> config exceeds those limits?
>>>>
>>>> It's moved with all the values. The whole instance is moved.
>>>>
>>>
>>> The values are moved, but the FIB in a namespace could already contain
>>> more routes than the devlink instance allows.
>>>
>>
>>From a quick test your recent refactoring to netdevsim broke the
>> resource controller. It was, and is intended to be, per network namespace.
> 
> unifying devlink instances with network namespace in netdevsim was
> really odd. Netdevsim is also a device, like any other. With other
> devices, you do not do this so I don't see why to do this with netdevsim.
> 
> Now you create netdevsim instance in sysfs, there is proper bus probe
> mechanism done, there is a devlink instance created for this device,
> there are netdevices and devlink ports created. Same as for the real
> hardware.
> 
> Honestly, creating a devlink instance per-network namespace
> automagically, no relation to netdevsim devices, that is simply wrong.
> There should be always 1:1 relationshin between a device and devlink
> instance.
> 

Jiri: prior to your recent change netdevsim had a fib resource
controller per network namespace. Please return that behavior or revert
the change.

^ permalink raw reply

* Re: [PATCH net v4] net: bridge: move default pvid init/deinit to NETDEV_REGISTER/UNREGISTER
From: Nikolay Aleksandrov @ 2019-08-02 15:41 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, roopa, davem, bridge, michael-dev
In-Reply-To: <20190802083519.71cb4fa2@hermes.lan>

On 02/08/2019 18:35, Stephen Hemminger wrote:
> On Fri,  2 Aug 2019 13:57:36 +0300
> Nikolay Aleksandrov <nikolay@cumulusnetworks.com> wrote:
> 
>> +int br_vlan_bridge_event(struct net_device *dev, unsigned long event, void *ptr)
>>  {
>>  	struct netdev_notifier_changeupper_info *info;
>> -	struct net_bridge *br;
>> +	struct net_bridge *br = netdev_priv(dev);
>> +	bool changed;
>> +	int ret = 0;
>>  
>>  	switch (event) {
>> +	case NETDEV_REGISTER:
>> +		ret = br_vlan_add(br, br->default_pvid,
>> +				  BRIDGE_VLAN_INFO_PVID |
>> +				  BRIDGE_VLAN_INFO_UNTAGGED |
>> +				  BRIDGE_VLAN_INFO_BRENTRY, &changed, NULL);
>> +		break;
> 
> Looks good.
> 
> As minor optimization br_vlan_add could ignore changed pointer if NULL.
> This would save places where you don't care.
> 
> 
> Something like:
> diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
> index 021cc9f66804..bacd3543b215 100644
> --- a/net/bridge/br_vlan.c
> +++ b/net/bridge/br_vlan.c
> @@ -626,10 +626,11 @@ static int br_vlan_add_existing(struct net_bridge *br,
>  		refcount_inc(&vlan->refcnt);
>  		vlan->flags |= BRIDGE_VLAN_INFO_BRENTRY;
>  		vg->num_vlans++;
> -		*changed = true;
> +		if (changed)
> +			*changed = true;
>  	}
>  
> -	if (__vlan_add_flags(vlan, flags))
> +	if (__vlan_add_flags(vlan, flags) && changed)
>  		*changed = true;
>  
>  	return 0;
> @@ -653,7 +654,8 @@ int br_vlan_add(struct net_bridge *br, u16 vid, u16 flags, bool *changed,
>  
>  	ASSERT_RTNL();
>  
> -	*changed = false;
> +	if (changed)
> +		*changed = false;
>  	vg = br_vlan_group(br);
>  	vlan = br_vlan_find(vg, vid);
>  	if (vlan)
> @@ -679,7 +681,7 @@ int br_vlan_add(struct net_bridge *br, u16 vid, u16 flags, bool *changed,
>  	if (ret) {
>  		free_percpu(vlan->stats);
>  		kfree(vlan);
> -	} else {
> +	} else if (changed) {
>  		*changed = true;
>  	}
>  
> 

sure, this is ok for net-next


^ permalink raw reply

* Re: [PATCH net v4] net: bridge: move default pvid init/deinit to NETDEV_REGISTER/UNREGISTER
From: Stephen Hemminger @ 2019-08-02 15:35 UTC (permalink / raw)
  To: Nikolay Aleksandrov; +Cc: netdev, roopa, davem, bridge, michael-dev
In-Reply-To: <20190802105736.26767-1-nikolay@cumulusnetworks.com>

On Fri,  2 Aug 2019 13:57:36 +0300
Nikolay Aleksandrov <nikolay@cumulusnetworks.com> wrote:

> +int br_vlan_bridge_event(struct net_device *dev, unsigned long event, void *ptr)
>  {
>  	struct netdev_notifier_changeupper_info *info;
> -	struct net_bridge *br;
> +	struct net_bridge *br = netdev_priv(dev);
> +	bool changed;
> +	int ret = 0;
>  
>  	switch (event) {
> +	case NETDEV_REGISTER:
> +		ret = br_vlan_add(br, br->default_pvid,
> +				  BRIDGE_VLAN_INFO_PVID |
> +				  BRIDGE_VLAN_INFO_UNTAGGED |
> +				  BRIDGE_VLAN_INFO_BRENTRY, &changed, NULL);
> +		break;

Looks good.

As minor optimization br_vlan_add could ignore changed pointer if NULL.
This would save places where you don't care.


Something like:
diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
index 021cc9f66804..bacd3543b215 100644
--- a/net/bridge/br_vlan.c
+++ b/net/bridge/br_vlan.c
@@ -626,10 +626,11 @@ static int br_vlan_add_existing(struct net_bridge *br,
 		refcount_inc(&vlan->refcnt);
 		vlan->flags |= BRIDGE_VLAN_INFO_BRENTRY;
 		vg->num_vlans++;
-		*changed = true;
+		if (changed)
+			*changed = true;
 	}
 
-	if (__vlan_add_flags(vlan, flags))
+	if (__vlan_add_flags(vlan, flags) && changed)
 		*changed = true;
 
 	return 0;
@@ -653,7 +654,8 @@ int br_vlan_add(struct net_bridge *br, u16 vid, u16 flags, bool *changed,
 
 	ASSERT_RTNL();
 
-	*changed = false;
+	if (changed)
+		*changed = false;
 	vg = br_vlan_group(br);
 	vlan = br_vlan_find(vg, vid);
 	if (vlan)
@@ -679,7 +681,7 @@ int br_vlan_add(struct net_bridge *br, u16 vid, u16 flags, bool *changed,
 	if (ret) {
 		free_percpu(vlan->stats);
 		kfree(vlan);
-	} else {
+	} else if (changed) {
 		*changed = true;
 	}
 


^ permalink raw reply related

* Re: [PATCH 2/2] net: mvpp2: support multiple comphy lanes
From: Maxime Chevallier @ 2019-08-02 15:22 UTC (permalink / raw)
  To: Matt Pelland; +Cc: netdev, davem, antoine.tenart
In-Reply-To: <20190801204523.26454-3-mpelland@starry.com>

Hello Matt,

On Thu,  1 Aug 2019 16:45:23 -0400
Matt Pelland <mpelland@starry.com> wrote:

>mvpp 2.2 supports RXAUI which requires a pair of serdes lanes instead of
>the usual single lane required by other interface modes. This patch
>expands the number of lanes that can be associated to a port so that
>both lanes are correctly configured at the appropriate times when RXAUI
>is in use.

Thanks for the patch, it's nice to see RXAUI support moving forward.

I'll give your patches a test on my setup, although I never really got
something stable, it might simply be due to the particular hardware I
have access to.

I do have some comments below :

>Signed-off-by: Matt Pelland <mpelland@starry.com>
>---
> drivers/net/ethernet/marvell/mvpp2/mvpp2.h    |  7 +-
> .../net/ethernet/marvell/mvpp2/mvpp2_main.c   | 66 +++++++++++++------
> 2 files changed, 53 insertions(+), 20 deletions(-)
>
>diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2.h b/drivers/net/ethernet/marvell/mvpp2/mvpp2.h
>index 256e7c796631..9ae2b3d9d0c7 100644
>--- a/drivers/net/ethernet/marvell/mvpp2/mvpp2.h
>+++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2.h
>@@ -655,6 +655,11 @@
> #define MVPP2_F_LOOPBACK		BIT(0)
> #define MVPP2_F_DT_COMPAT		BIT(1)
> 
>+/* MVPP22 supports RXAUI which requires two comphy lanes, all other modes
>+ * require one.
>+ */
>+#define MVPP22_MAX_COMPHYS		2

This driver is "supposed" to support XAUI too, at least it appears in
the list of modes we handle. XAUI uses 4 lanes, maybe we could bump the
max number of lanes to 4.

> /* Marvell tag types */
> enum mvpp2_tag_type {
> 	MVPP2_TAG_TYPE_NONE = 0,
>@@ -935,7 +940,7 @@ struct mvpp2_port {
> 	phy_interface_t phy_interface;
> 	struct phylink *phylink;
> 	struct phylink_config phylink_config;
>-	struct phy *comphy;
>+	struct phy *comphys[MVPP22_MAX_COMPHYS];
> 
> 	struct mvpp2_bm_pool *pool_long;
> 	struct mvpp2_bm_pool *pool_short;
>diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
>index 8b633af4a684..97dfe2e71b03 100644
>--- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
>+++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
>@@ -1186,17 +1186,40 @@ static void mvpp22_gop_setup_irq(struct mvpp2_port *port)
>  */
> static int mvpp22_comphy_init(struct mvpp2_port *port)
> {
>-	int ret;
>+	int i, ret;
> 
>-	if (!port->comphy)
>-		return 0;
>+	for (i = 0; i < ARRAY_SIZE(port->comphys); i++) {
>+		if (!port->comphys[i])
>+			return 0;
> 
>-	ret = phy_set_mode_ext(port->comphy, PHY_MODE_ETHERNET,
>-			       port->phy_interface);
>-	if (ret)
>-		return ret;
>+		ret = phy_set_mode_ext(port->comphys[i],
>+				       PHY_MODE_ETHERNET,
>+				       port->phy_interface);
>+		if (ret)
>+			return ret;
> 
>-	return phy_power_on(port->comphy);
>+		ret = phy_power_on(port->comphys[i]);
>+		if (ret)
>+			return ret;
>+	}
>+
>+	return 0;
>+}

It could be good to add some sanity check, to make sure we do have 2
comphy lanes available when configuring RXAUI mode (and 4 for XAUI).

>+static int mvpp22_comphy_deinit(struct mvpp2_port *port)
>+{
>+	int i, ret;
>+
>+	for (i = 0; i < ARRAY_SIZE(port->comphys); i++) {
>+		if (!port->comphys[i])
>+			return 0;
>+
>+		ret = phy_power_off(port->comphys[i]);
>+		if (ret)
>+			return ret;
>+	}
>+
>+	return 0;
> }
> 
> static void mvpp2_port_enable(struct mvpp2_port *port)
>@@ -3375,7 +3398,9 @@ static void mvpp2_stop_dev(struct mvpp2_port *port)
> 
> 	if (port->phylink)
> 		phylink_stop(port->phylink);
>-	phy_power_off(port->comphy);
>+
>+	if (port->priv->hw_version == MVPP22)
>+		mvpp22_comphy_deinit(port);
> }
> 
> static int mvpp2_check_ringparam_valid(struct net_device *dev,
>@@ -4947,7 +4972,7 @@ static void mvpp2_mac_config(struct phylink_config *config, unsigned int mode,
> 		port->phy_interface = state->interface;
> 
> 		/* Reconfigure the serdes lanes */
>-		phy_power_off(port->comphy);
>+		mvpp22_comphy_deinit(port);
> 		mvpp22_mode_reconfigure(port);
> 	}
> 
>@@ -5038,7 +5063,6 @@ static int mvpp2_port_probe(struct platform_device *pdev,
> 			    struct fwnode_handle *port_fwnode,
> 			    struct mvpp2 *priv)
> {
>-	struct phy *comphy = NULL;
> 	struct mvpp2_port *port;
> 	struct mvpp2_port_pcpu *port_pcpu;
> 	struct device_node *port_node = to_of_node(port_fwnode);
>@@ -5085,14 +5109,20 @@ static int mvpp2_port_probe(struct platform_device *pdev,
> 		goto err_free_netdev;
> 	}
> 
>+	port = netdev_priv(dev);
>+
> 	if (port_node) {
>-		comphy = devm_of_phy_get(&pdev->dev, port_node, NULL);
>-		if (IS_ERR(comphy)) {
>-			if (PTR_ERR(comphy) == -EPROBE_DEFER) {
>-				err = -EPROBE_DEFER;
>-				goto err_free_netdev;
>+		for (i = 0; i < ARRAY_SIZE(port->comphys); i++) {
>+			port->comphys[i] = devm_of_phy_get_by_index(&pdev->dev,
>+								    port_node,
>+								    i);
>+			if (IS_ERR(port->comphys[i])) {
>+				err = PTR_ERR(port->comphys[i]);
>+				port->comphys[i] = NULL;
>+				if (err == -EPROBE_DEFER)
>+					goto err_free_netdev;
>+				err = 0;
> 			}
>-			comphy = NULL;
> 		}
> 	}
> 
>@@ -5107,7 +5137,6 @@ static int mvpp2_port_probe(struct platform_device *pdev,
> 	dev->netdev_ops = &mvpp2_netdev_ops;
> 	dev->ethtool_ops = &mvpp2_eth_tool_ops;
> 
>-	port = netdev_priv(dev);
> 	port->dev = dev;
> 	port->fwnode = port_fwnode;
> 	port->has_phy = !!of_find_property(port_node, "phy", NULL);
>@@ -5144,7 +5173,6 @@ static int mvpp2_port_probe(struct platform_device *pdev,
> 
> 	port->of_node = port_node;
> 	port->phy_interface = phy_mode;
>-	port->comphy = comphy;
> 
> 	if (priv->hw_version == MVPP21) {
> 		port->base = devm_platform_ioremap_resource(pdev, 2 + id);



-- 
Maxime Chevallier, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com

^ permalink raw reply

* Re: [PATCH net-next 00/15] net: Add functional tests for L3 and L4
From: Alexei Starovoitov @ 2019-08-02 15:15 UTC (permalink / raw)
  To: David Ahern; +Cc: David Ahern, David S. Miller, Network Development
In-Reply-To: <7b95042e-e586-2ca2-2a26-f5aea5a8184b@gmail.com>

On Thu, Aug 1, 2019 at 9:11 PM David Ahern <dsahern@gmail.com> wrote:
>
> On 8/1/19 6:19 PM, Alexei Starovoitov wrote:
> > Do you really need 'sleep 1' everywhere?
> > It makes them so slow to run...
> > What happens if you just remove it ? Tests will fail? Why?
>
> yes, the sleep 1 is needed. A server process is getting launched into
> the background. It's status is not relevant; the client's is the key.
> The server needs time to initialize. And, yes I tried forking and it
> gets hung up with bash and capturing the output for verbose mode.

That means that the tests are not suitable for CI servers
where cpus are pegged to 100% and multiple tests run in parallel.
The test will be flaky :(

^ permalink raw reply

* [PATCH net] inet: frags: re-introduce skb coalescing for local delivery
From: Guillaume Nault @ 2019-08-02 15:15 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal, Eric Dumazet, Peter Oskolkov, Alexander Aring

Before commit d4289fcc9b16 ("net: IP6 defrag: use rbtrees for IPv6
defrag"), a netperf UDP_STREAM test[0] using big IPv6 datagrams (thus
generating many fragments) and running over an IPsec tunnel, reported
more than 6Gbps throughput. After that patch, the same test gets only
9Mbps when receiving on a be2net nic (driver can make a big difference
here, for example, ixgbe doesn't seem to be affected).

By reusing the IPv4 defragmentation code, IPv6 lost fragment coalescing
(IPv4 fragment coalescing was dropped by commit 14fe22e33462 ("Revert
"ipv4: use skb coalescing in defragmentation"")).

Without fragment coalescing, be2net runs out of Rx ring entries and
starts to drop frames (ethtool reports rx_drops_no_frags errors). Since
the netperf traffic is only composed of UDP fragments, any lost packet
prevents reassembly of the full datagram. Therefore, fragments which
have no possibility to ever get reassembled pile up in the reassembly
queue, until the memory accounting exeeds the threshold. At that point
no fragment is accepted anymore, which effectively discards all
netperf traffic.

When reassembly timeout expires, some stale fragments are removed from
the reassembly queue, so a few packets can be received, reassembled
and delivered to the netperf receiver. But the nic still drops frames
and soon the reassembly queue gets filled again with stale fragments.
These long time frames where no datagram can be received explain why
the performance drop is so significant.

Re-introducing fragment coalescing is enough to get the initial
performances again (6.6Gbps with be2net): driver doesn't drop frames
anymore (no more rx_drops_no_frags errors) and the reassembly engine
works at full speed.

This patch is quite conservative and only coalesces skbs for local
IPv4 and IPv6 delivery (in order to avoid changing skb geometry when
forwarding). Coalescing could be extended in the future if need be, as
more scenarios would probably benefit from it.

[0]: Test configuration
Sender:
ip xfrm policy flush
ip xfrm state flush
ip xfrm state add src fc00:1::1 dst fc00:2::1 proto esp spi 0x1000 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:1::1 dst fc00:2::1
ip xfrm policy add src fc00:1::1 dst fc00:2::1 dir in tmpl src fc00:1::1 dst fc00:2::1 proto esp mode transport action allow
ip xfrm state add src fc00:2::1 dst fc00:1::1 proto esp spi 0x1001 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:2::1 dst fc00:1::1
ip xfrm policy add src fc00:2::1 dst fc00:1::1 dir out tmpl src fc00:2::1 dst fc00:1::1 proto esp mode transport action allow
netserver -D -L fc00:2::1

Receiver:
ip xfrm policy flush
ip xfrm state flush
ip xfrm state add src fc00:2::1 dst fc00:1::1 proto esp spi 0x1001 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:2::1 dst fc00:1::1
ip xfrm policy add src fc00:2::1 dst fc00:1::1 dir in tmpl src fc00:2::1 dst fc00:1::1 proto esp mode transport action allow
ip xfrm state add src fc00:1::1 dst fc00:2::1 proto esp spi 0x1000 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:1::1 dst fc00:2::1
ip xfrm policy add src fc00:1::1 dst fc00:2::1 dir out tmpl src fc00:1::1 dst fc00:2::1 proto esp mode transport action allow
netperf -H fc00:2::1 -f k -P 0 -L fc00:1::1 -l 60 -t UDP_STREAM -I 99,5 -i 5,5 -T5,5 -6

Signed-off-by: Guillaume Nault <gnault@redhat.com>
---
 include/net/inet_frag.h                 |  2 +-
 net/ieee802154/6lowpan/reassembly.c     |  2 +-
 net/ipv4/inet_fragment.c                | 39 ++++++++++++++++++-------
 net/ipv4/ip_fragment.c                  |  8 ++++-
 net/ipv6/netfilter/nf_conntrack_reasm.c |  2 +-
 net/ipv6/reassembly.c                   |  2 +-
 6 files changed, 39 insertions(+), 16 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 010f26b31c89..bac79e817776 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -171,7 +171,7 @@ int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb,
 void *inet_frag_reasm_prepare(struct inet_frag_queue *q, struct sk_buff *skb,
 			      struct sk_buff *parent);
 void inet_frag_reasm_finish(struct inet_frag_queue *q, struct sk_buff *head,
-			    void *reasm_data);
+			    void *reasm_data, bool try_coalesce);
 struct sk_buff *inet_frag_pull_head(struct inet_frag_queue *q);
 
 #endif
diff --git a/net/ieee802154/6lowpan/reassembly.c b/net/ieee802154/6lowpan/reassembly.c
index e4aba5d485be..bbe9b3b2d395 100644
--- a/net/ieee802154/6lowpan/reassembly.c
+++ b/net/ieee802154/6lowpan/reassembly.c
@@ -170,7 +170,7 @@ static int lowpan_frag_reasm(struct lowpan_frag_queue *fq, struct sk_buff *skb,
 	reasm_data = inet_frag_reasm_prepare(&fq->q, skb, prev_tail);
 	if (!reasm_data)
 		goto out_oom;
-	inet_frag_reasm_finish(&fq->q, skb, reasm_data);
+	inet_frag_reasm_finish(&fq->q, skb, reasm_data, false);
 
 	skb->dev = ldev;
 	skb->tstamp = fq->q.stamp;
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index a999451345f9..10d31733297d 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -475,11 +475,12 @@ void *inet_frag_reasm_prepare(struct inet_frag_queue *q, struct sk_buff *skb,
 EXPORT_SYMBOL(inet_frag_reasm_prepare);
 
 void inet_frag_reasm_finish(struct inet_frag_queue *q, struct sk_buff *head,
-			    void *reasm_data)
+			    void *reasm_data, bool try_coalesce)
 {
 	struct sk_buff **nextp = (struct sk_buff **)reasm_data;
 	struct rb_node *rbn;
 	struct sk_buff *fp;
+	int sum_truesize;
 
 	skb_push(head, head->data - skb_network_header(head));
 
@@ -487,25 +488,41 @@ void inet_frag_reasm_finish(struct inet_frag_queue *q, struct sk_buff *head,
 	fp = FRAG_CB(head)->next_frag;
 	rbn = rb_next(&head->rbnode);
 	rb_erase(&head->rbnode, &q->rb_fragments);
+
+	sum_truesize = head->truesize;
 	while (rbn || fp) {
 		/* fp points to the next sk_buff in the current run;
 		 * rbn points to the next run.
 		 */
 		/* Go through the current run. */
 		while (fp) {
-			*nextp = fp;
-			nextp = &fp->next;
-			fp->prev = NULL;
-			memset(&fp->rbnode, 0, sizeof(fp->rbnode));
-			fp->sk = NULL;
-			head->data_len += fp->len;
-			head->len += fp->len;
+			struct sk_buff *next_frag = FRAG_CB(fp)->next_frag;
+			bool stolen;
+			int delta;
+
+			sum_truesize += fp->truesize;
 			if (head->ip_summed != fp->ip_summed)
 				head->ip_summed = CHECKSUM_NONE;
 			else if (head->ip_summed == CHECKSUM_COMPLETE)
 				head->csum = csum_add(head->csum, fp->csum);
-			head->truesize += fp->truesize;
-			fp = FRAG_CB(fp)->next_frag;
+
+			if (try_coalesce && skb_try_coalesce(head, fp, &stolen,
+							     &delta)) {
+				kfree_skb_partial(fp, stolen);
+			} else {
+				fp->prev = NULL;
+				memset(&fp->rbnode, 0, sizeof(fp->rbnode));
+				fp->sk = NULL;
+
+				head->data_len += fp->len;
+				head->len += fp->len;
+				head->truesize += fp->truesize;
+
+				*nextp = fp;
+				nextp = &fp->next;
+			}
+
+			fp = next_frag;
 		}
 		/* Move to the next run. */
 		if (rbn) {
@@ -516,7 +533,7 @@ void inet_frag_reasm_finish(struct inet_frag_queue *q, struct sk_buff *head,
 			rbn = rbnext;
 		}
 	}
-	sub_frag_mem_limit(q->fqdir, head->truesize);
+	sub_frag_mem_limit(q->fqdir, sum_truesize);
 
 	*nextp = NULL;
 	skb_mark_not_on_list(head);
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 4385eb9e781f..cfeb8890f94e 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -393,6 +393,11 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 	return err;
 }
 
+static bool ip_frag_coalesce_ok(const struct ipq *qp)
+{
+	return qp->q.key.v4.user == IP_DEFRAG_LOCAL_DELIVER;
+}
+
 /* Build a new IP datagram from all its fragments. */
 static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
 			 struct sk_buff *prev_tail, struct net_device *dev)
@@ -421,7 +426,8 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
 	if (len > 65535)
 		goto out_oversize;
 
-	inet_frag_reasm_finish(&qp->q, skb, reasm_data);
+	inet_frag_reasm_finish(&qp->q, skb, reasm_data,
+			       ip_frag_coalesce_ok(qp));
 
 	skb->dev = dev;
 	IPCB(skb)->frag_max_size = max(qp->max_df_size, qp->q.max_size);
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 0f82c150543b..fed9666a2f7d 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -348,7 +348,7 @@ static int nf_ct_frag6_reasm(struct frag_queue *fq, struct sk_buff *skb,
 
 	skb_reset_transport_header(skb);
 
-	inet_frag_reasm_finish(&fq->q, skb, reasm_data);
+	inet_frag_reasm_finish(&fq->q, skb, reasm_data, false);
 
 	skb->ignore_df = 1;
 	skb->dev = dev;
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index ca05b16f1bb9..1f5d4d196dcc 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -282,7 +282,7 @@ static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *skb,
 
 	skb_reset_transport_header(skb);
 
-	inet_frag_reasm_finish(&fq->q, skb, reasm_data);
+	inet_frag_reasm_finish(&fq->q, skb, reasm_data, true);
 
 	skb->dev = dev;
 	ipv6_hdr(skb)->payload_len = htons(payload_len);
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH v2 1/2] cxgb4: sched: Use refcount_t for refcount
From: Willem de Bruijn @ 2019-08-02 15:14 UTC (permalink / raw)
  To: Chuhong Yuan
  Cc: Vishal Kulkarni, David S . Miller, Network Development,
	linux-kernel
In-Reply-To: <CANhBUQ2C3OfkC6qDL9=hhXq=C-OMHUwaL7EaMbagVRTt=rc00A@mail.gmail.com>

On Fri, Aug 2, 2019 at 11:10 AM Chuhong Yuan <hslester96@gmail.com> wrote:
>
> Willem de Bruijn <willemdebruijn.kernel@gmail.com> 于2019年8月2日周五 下午10:53写道:
> >
> > On Fri, Aug 2, 2019 at 10:27 AM Chuhong Yuan <hslester96@gmail.com> wrote:
> > >
> > > Willem de Bruijn <willemdebruijn.kernel@gmail.com> 于2019年8月2日周五 下午9:40写道:
> > > >
> > > > On Fri, Aug 2, 2019 at 4:36 AM Chuhong Yuan <hslester96@gmail.com> wrote:
> > > > >
> > > > > refcount_t is better for reference counters since its
> > > > > implementation can prevent overflows.
> > > > > So convert atomic_t ref counters to refcount_t.
> > > > >
> > > > > Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
> > > > > ---
> > > > > Changes in v2:
> > > > >   - Convert refcount from 0-base to 1-base.
> > > >
> > > > This changes the initial value from 0 to 1, but does not change the
> > > > release condition. So this introduces an accounting bug?
> > >
> > > I have noticed this problem and have checked other files which use refcount_t.
> > > I find although the refcounts are 1-based, they still use
> > > refcount_dec_and_test()
> > > to check whether the resource should be released.
> > > One example is drivers/char/mspec.c.
> > > Therefore I think this is okay and do not change the release condition.
> >
> > Indeed it is fine to use refcount_t with a model where the initial
> > allocation already accounts for the first reference and thus
> > initializes with refcount_set(.., 1).
> >
> > But it is not correct to just change a previously zero initialization
> > to one. As now an extra refcount_dec will be needed to release state.
> > But the rest of the code has not changed, so this extra decrement will
> > not happen.
> >
> > For a correct conversion, see for instance commits
> >
> >   commit db5bce32fbe19f0c7482fb5a40a33178bbe7b11b
> >   net: prepare (struct ubuf_info)->refcnt conversion
> >
> > and
> >
> >   commit c1d1b437816f0afa99202be3cb650c9d174667bc
> >   net: convert (struct ubuf_info)->refcnt to refcount_t
> >
> > The second makes a search-and-replace style API change like your
> > patches (though also notice the additional required #include).
> >
>
> Thanks for your examples!
> I will fix the #include in those no base changed patches.
>
> > But the other patch is needed first to change both the initial
> > atomic_set *and* at least one atomic_inc, to maintain the same
> > reference count over the object's lifetime.
> >
> > That change requires understanding of the object's lifecycle, so I
> > suggest only making those changes when aware of that whole data path.
>
> I think I had better focus on the 1-based cases first.

Yes, sounds good. And please try a single driver first and get that
accepted, before moving on to multiple concurrent submissions.

^ permalink raw reply

* Re: [PATCH net-next 00/15] net: Add functional tests for L3 and L4
From: Alexei Starovoitov @ 2019-08-02 15:14 UTC (permalink / raw)
  To: David Ahern; +Cc: David Ahern, David S. Miller, Network Development
In-Reply-To: <4c89b1cd-4dba-9cd8-0f4e-ae0a5d8bc61c@gmail.com>

On Thu, Aug 1, 2019 at 9:04 PM David Ahern <dsahern@gmail.com> wrote:
> ...
>
> >
> > with -v I see:
> > COMMAND: ip netns exec ns-A ping -c1 -w1 -I 172.16.2.1 172.16.1.2
> > ping: unknown iface 172.16.2.1
> > TEST: ping out, address bind - ns-B IP                                        [FAIL]
>
> With ping from iputils-ping -I can be an address or a device.

the ping, I have installed, supports -I.
The issue is somewhere else. Ideas?

^ permalink raw reply

* [PATCH][net-next][V2] net/mlx5: remove self-assignment on esw->dev
From: Colin King @ 2019-08-02 15:13 UTC (permalink / raw)
  To: Saeed Mahameed, Leon Romanovsky, David S . Miller, netdev,
	linux-rdma
  Cc: kernel-janitors, linux-kernel

From: Colin Ian King <colin.king@canonical.com>

There is a self assignment of esw->dev to itself, clean this up by
removing it. Also make dev a const pointer.

Addresses-Coverity: ("Self assignment")
Fixes: 6cedde451399 ("net/mlx5: E-Switch, Verify support QoS element type")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
---

V2: make dev const

---
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index f4ace5f8e884..de0894b695e3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -1413,7 +1413,7 @@ static int esw_vport_egress_config(struct mlx5_eswitch *esw,
 
 static bool element_type_supported(struct mlx5_eswitch *esw, int type)
 {
-	struct mlx5_core_dev *dev = esw->dev = esw->dev;
+	const struct mlx5_core_dev *dev = esw->dev;
 
 	switch (type) {
 	case SCHEDULING_CONTEXT_ELEMENT_TYPE_TSAR:
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH v2 1/2] cxgb4: sched: Use refcount_t for refcount
From: Chuhong Yuan @ 2019-08-02 15:10 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Vishal Kulkarni, David S . Miller, Network Development,
	linux-kernel
In-Reply-To: <CAF=yD-+3tzufyOnK4suJnovrhX_=4sPqWOsjOcETGG3cA9+MdA@mail.gmail.com>

Willem de Bruijn <willemdebruijn.kernel@gmail.com> 于2019年8月2日周五 下午10:53写道:
>
> On Fri, Aug 2, 2019 at 10:27 AM Chuhong Yuan <hslester96@gmail.com> wrote:
> >
> > Willem de Bruijn <willemdebruijn.kernel@gmail.com> 于2019年8月2日周五 下午9:40写道:
> > >
> > > On Fri, Aug 2, 2019 at 4:36 AM Chuhong Yuan <hslester96@gmail.com> wrote:
> > > >
> > > > refcount_t is better for reference counters since its
> > > > implementation can prevent overflows.
> > > > So convert atomic_t ref counters to refcount_t.
> > > >
> > > > Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
> > > > ---
> > > > Changes in v2:
> > > >   - Convert refcount from 0-base to 1-base.
> > >
> > > This changes the initial value from 0 to 1, but does not change the
> > > release condition. So this introduces an accounting bug?
> >
> > I have noticed this problem and have checked other files which use refcount_t.
> > I find although the refcounts are 1-based, they still use
> > refcount_dec_and_test()
> > to check whether the resource should be released.
> > One example is drivers/char/mspec.c.
> > Therefore I think this is okay and do not change the release condition.
>
> Indeed it is fine to use refcount_t with a model where the initial
> allocation already accounts for the first reference and thus
> initializes with refcount_set(.., 1).
>
> But it is not correct to just change a previously zero initialization
> to one. As now an extra refcount_dec will be needed to release state.
> But the rest of the code has not changed, so this extra decrement will
> not happen.
>
> For a correct conversion, see for instance commits
>
>   commit db5bce32fbe19f0c7482fb5a40a33178bbe7b11b
>   net: prepare (struct ubuf_info)->refcnt conversion
>
> and
>
>   commit c1d1b437816f0afa99202be3cb650c9d174667bc
>   net: convert (struct ubuf_info)->refcnt to refcount_t
>
> The second makes a search-and-replace style API change like your
> patches (though also notice the additional required #include).
>

Thanks for your examples!
I will fix the #include in those no base changed patches.

> But the other patch is needed first to change both the initial
> atomic_set *and* at least one atomic_inc, to maintain the same
> reference count over the object's lifetime.
>
> That change requires understanding of the object's lifecycle, so I
> suggest only making those changes when aware of that whole data path.

I think I had better focus on the 1-based cases first.

^ permalink raw reply

* Re: [PATCH v2 1/2] cxgb4: sched: Use refcount_t for refcount
From: Willem de Bruijn @ 2019-08-02 14:52 UTC (permalink / raw)
  To: Chuhong Yuan
  Cc: Vishal Kulkarni, David S . Miller, Network Development,
	linux-kernel
In-Reply-To: <CANhBUQ2TRr4RuSmjaRYPXHZpVw_-2awXvWNjjdvV_z1yoGdkXA@mail.gmail.com>

On Fri, Aug 2, 2019 at 10:27 AM Chuhong Yuan <hslester96@gmail.com> wrote:
>
> Willem de Bruijn <willemdebruijn.kernel@gmail.com> 于2019年8月2日周五 下午9:40写道:
> >
> > On Fri, Aug 2, 2019 at 4:36 AM Chuhong Yuan <hslester96@gmail.com> wrote:
> > >
> > > refcount_t is better for reference counters since its
> > > implementation can prevent overflows.
> > > So convert atomic_t ref counters to refcount_t.
> > >
> > > Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
> > > ---
> > > Changes in v2:
> > >   - Convert refcount from 0-base to 1-base.
> >
> > This changes the initial value from 0 to 1, but does not change the
> > release condition. So this introduces an accounting bug?
>
> I have noticed this problem and have checked other files which use refcount_t.
> I find although the refcounts are 1-based, they still use
> refcount_dec_and_test()
> to check whether the resource should be released.
> One example is drivers/char/mspec.c.
> Therefore I think this is okay and do not change the release condition.

Indeed it is fine to use refcount_t with a model where the initial
allocation already accounts for the first reference and thus
initializes with refcount_set(.., 1).

But it is not correct to just change a previously zero initialization
to one. As now an extra refcount_dec will be needed to release state.
But the rest of the code has not changed, so this extra decrement will
not happen.

For a correct conversion, see for instance commits

  commit db5bce32fbe19f0c7482fb5a40a33178bbe7b11b
  net: prepare (struct ubuf_info)->refcnt conversion

and

  commit c1d1b437816f0afa99202be3cb650c9d174667bc
  net: convert (struct ubuf_info)->refcnt to refcount_t

The second makes a search-and-replace style API change like your
patches (though also notice the additional required #include).

But the other patch is needed first to change both the initial
atomic_set *and* at least one atomic_inc, to maintain the same
reference count over the object's lifetime.

That change requires understanding of the object's lifecycle, so I
suggest only making those changes when aware of that whole data path.

^ permalink raw reply

* Re: [PATCH 00/34] put_user_pages(): miscellaneous call sites
From: Jan Kara @ 2019-08-02 14:52 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jan Kara, Michal Hocko, john.hubbard, Andrew Morton,
	Christoph Hellwig, Dan Williams, Dave Chinner, Dave Hansen,
	Ira Weiny, Jason Gunthorpe, Jérôme Glisse, LKML,
	amd-gfx, ceph-devel, devel, devel, dri-devel, intel-gfx, kvm,
	linux-arm-kernel, linux-block, linux-crypto, linux-fbdev,
	linux-fsdevel, linux-media, linux-mm, linux-nfs, linux-rdma,
	linux-rpi-kernel, linux-xfs, netdev, rds-devel, sparclinux, x86,
	xen-devel, John Hubbard
In-Reply-To: <20190802142443.GB5597@bombadil.infradead.org>

On Fri 02-08-19 07:24:43, Matthew Wilcox wrote:
> On Fri, Aug 02, 2019 at 02:41:46PM +0200, Jan Kara wrote:
> > On Fri 02-08-19 11:12:44, Michal Hocko wrote:
> > > On Thu 01-08-19 19:19:31, john.hubbard@gmail.com wrote:
> > > [...]
> > > > 2) Convert all of the call sites for get_user_pages*(), to
> > > > invoke put_user_page*(), instead of put_page(). This involves dozens of
> > > > call sites, and will take some time.
> > > 
> > > How do we make sure this is the case and it will remain the case in the
> > > future? There must be some automagic to enforce/check that. It is simply
> > > not manageable to do it every now and then because then 3) will simply
> > > be never safe.
> > > 
> > > Have you considered coccinele or some other scripted way to do the
> > > transition? I have no idea how to deal with future changes that would
> > > break the balance though.
> > 
> > Yeah, that's why I've been suggesting at LSF/MM that we may need to create
> > a gup wrapper - say vaddr_pin_pages() - and track which sites dropping
> > references got converted by using this wrapper instead of gup. The
> > counterpart would then be more logically named as unpin_page() or whatever
> > instead of put_user_page().  Sure this is not completely foolproof (you can
> > create new callsite using vaddr_pin_pages() and then just drop refs using
> > put_page()) but I suppose it would be a high enough barrier for missed
> > conversions... Thoughts?
> 
> I think the API we really need is get_user_bvec() / put_user_bvec(),
> and I know Christoph has been putting some work into that.  That avoids
> doing refcount operations on hundreds of pages if the page in question is
> a huge page.  Once people are switched over to that, they won't be tempted
> to manually call put_page() on the individual constituent pages of a bvec.

Well, get_user_bvec() is certainly a good API for one class of users but
just looking at the above series, you'll see there are *many* places that
just don't work with bvecs at all and you need something for those.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH net-next v2] net: phy: broadcom: add 1000Base-X support for BCM54616S
From: Andrew Lunn @ 2019-08-02 14:50 UTC (permalink / raw)
  To: Tao Ren
  Cc: Florian Fainelli, Heiner Kallweit, David S . Miller,
	Arun Parameswaran, Justin Chen, Vladimir Oltean, netdev,
	linux-kernel, openbmc
In-Reply-To: <20190801235839.290689-1-taoren@fb.com>

> +static int bcm54616s_read_status(struct phy_device *phydev)
> +{
> +	int err;
> +
> +	err = genphy_read_status(phydev);
> +
> +	/* 1000Base-X register set doesn't provide speed fields: the
> +	 * link speed is always 1000 Mb/s as long as link is up.
> +	 */
> +	if (phydev->dev_flags & PHY_BCM_FLAGS_MODE_1000BX &&
> +	    phydev->link)
> +		phydev->speed = SPEED_1000;
> +
> +	return err;
> +}

This function is equivalent to bcm5482_read_status(). You should use
it, rather than add a new function.

    Andrew

^ permalink raw reply

* Re: [patch 1/1] drivers/net/ethernet/marvell/mvmdio.c: Fix non OF case
From: Andrew Lunn @ 2019-08-02 14:43 UTC (permalink / raw)
  To: Arnaud Patard; +Cc: netdev
In-Reply-To: <20190802083310.772136040@rtp-net.org>

On Fri, Aug 02, 2019 at 10:32:40AM +0200, Arnaud Patard wrote:
> Orion5.x systems are still using machine files and not device-tree.
> Commit 96cb4342382290c9 ("net: mvmdio: allow up to three clocks to be
> specified for orion-mdio") has replaced devm_clk_get() with of_clk_get(),
> leading to a oops at boot and not working network, as reported in 
> https://lists.debian.org/debian-arm/2019/07/msg00088.html and possibly in
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=908712.
> 
> Link: https://lists.debian.org/debian-arm/2019/07/msg00088.html
> Fixes: 96cb4342382290c9 ("net: mvmdio: allow up to three clocks to be specified for orion-mdio")
> Signed-off-by: Arnaud Patard <arnaud.patard@rtp-net.org>
> Index: linux-next/drivers/net/ethernet/marvell/mvmdio.c
> ===================================================================
> --- linux-next.orig/drivers/net/ethernet/marvell/mvmdio.c
> +++ linux-next/drivers/net/ethernet/marvell/mvmdio.c
> @@ -319,20 +319,33 @@ static int orion_mdio_probe(struct platf
>  
>  	init_waitqueue_head(&dev->smi_busy_wait);
>  
> -	for (i = 0; i < ARRAY_SIZE(dev->clk); i++) {
> -		dev->clk[i] = of_clk_get(pdev->dev.of_node, i);
> -		if (PTR_ERR(dev->clk[i]) == -EPROBE_DEFER) {
> +	if (pdev->dev.of_node) {
> +		for (i = 0; i < ARRAY_SIZE(dev->clk); i++) {
> +			dev->clk[i] = of_clk_get(pdev->dev.of_node, i);
> +			if (PTR_ERR(dev->clk[i]) == -EPROBE_DEFER) {
> +				ret = -EPROBE_DEFER;
> +				goto out_clk;
> +			}
> +			if (IS_ERR(dev->clk[i]))
> +				break;
> +			clk_prepare_enable(dev->clk[i]);
> +		}
> +
> +		if (!IS_ERR(of_clk_get(pdev->dev.of_node,
> +				       ARRAY_SIZE(dev->clk))))
> +			dev_warn(&pdev->dev,
> +				 "unsupported number of clocks, limiting to the first "
> +				 __stringify(ARRAY_SIZE(dev->clk)) "\n");
> +	} else {
> +		dev->clk[0] = clk_get(&pdev->dev, NULL);
> +		if (PTR_ERR(dev->clk[0]) == -EPROBE_DEFER) {
>  			ret = -EPROBE_DEFER;
>  			goto out_clk;
>  		}

Hi Arnaud

It is a long time since i looked at Orion5x. Is this else clause even
needed? If my memory is right, i don't think it needs to enable tclk?
It was kirkwood which first added gateable clocks. And all kirkwood
boards are not DT.

Thanks
	Andrew

^ permalink raw reply

* Re: [PATCH net v4] net: bridge: move default pvid init/deinit to NETDEV_REGISTER/UNREGISTER
From: Roopa Prabhu @ 2019-08-02 14:29 UTC (permalink / raw)
  To: Nikolay Aleksandrov; +Cc: netdev, David Miller, bridge, michael-dev
In-Reply-To: <20190802105736.26767-1-nikolay@cumulusnetworks.com>

On Fri, Aug 2, 2019 at 3:57 AM Nikolay Aleksandrov
<nikolay@cumulusnetworks.com> wrote:
>
> Most of the bridge device's vlan init bugs come from the fact that its
> default pvid is created at the wrong time, way too early in ndo_init()
> before the device is even assigned an ifindex. It introduces a bug when the
> bridge's dev_addr is added as fdb during the initial default pvid creation
> the notification has ifindex/NDA_MASTER both equal to 0 (see example below)
> which really makes no sense for user-space[0] and is wrong.
> Usually user-space software would ignore such entries, but they are
> actually valid and will eventually have all necessary attributes.
> It makes much more sense to send a notification *after* the device has
> registered and has a proper ifindex allocated rather than before when
> there's a chance that the registration might still fail or to receive
> it with ifindex/NDA_MASTER == 0. Note that we can remove the fdb flush
> from br_vlan_flush() since that case can no longer happen. At
> NETDEV_REGISTER br->default_pvid is always == 1 as it's initialized by
> br_vlan_init() before that and at NETDEV_UNREGISTER it can be anything
> depending why it was called (if called due to NETDEV_REGISTER error
> it'll still be == 1, otherwise it could be any value changed during the
> device life time).
>
> For the demonstration below a small change to iproute2 for printing all fdb
> notifications is added, because it contained a workaround not to show
> entries with ifindex == 0.
> Command executed while monitoring: $ ip l add br0 type bridge
> Before (both ifindex and master == 0):
> $ bridge monitor fdb
> 36:7e:8a:b3:56:ba dev * vlan 1 master * permanent
>
> After (proper br0 ifindex):
> $ bridge monitor fdb
> e6:2a:ae:7a:b7:48 dev br0 vlan 1 master br0 permanent
>
> v4: move only the default pvid init/deinit to NETDEV_REGISTER/UNREGISTER
> v3: send the correct v2 patch with all changes (stub should return 0)
> v2: on error in br_vlan_init set br->vlgrp to NULL and return 0 in
>     the br_vlan_bridge_event stub when bridge vlans are disabled
>
> [0] https://bugzilla.kernel.org/show_bug.cgi?id=204389
>
> Reported-by: michael-dev <michael-dev@fami-braun.de>
> Fixes: 5be5a2df40f0 ("bridge: Add filtering support for default_pvid")
> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>

Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>

^ permalink raw reply

* Re: [PATCH v2 1/2] cxgb4: sched: Use refcount_t for refcount
From: Chuhong Yuan @ 2019-08-02 14:27 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Vishal Kulkarni, David S . Miller, Network Development,
	linux-kernel
In-Reply-To: <CA+FuTSc8WBx2PCUhn-sLtYHQR-OROXm2pUN9SDj7P-Bd8432UQ@mail.gmail.com>

Willem de Bruijn <willemdebruijn.kernel@gmail.com> 于2019年8月2日周五 下午9:40写道:
>
> On Fri, Aug 2, 2019 at 4:36 AM Chuhong Yuan <hslester96@gmail.com> wrote:
> >
> > refcount_t is better for reference counters since its
> > implementation can prevent overflows.
> > So convert atomic_t ref counters to refcount_t.
> >
> > Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
> > ---
> > Changes in v2:
> >   - Convert refcount from 0-base to 1-base.
>
> This changes the initial value from 0 to 1, but does not change the
> release condition. So this introduces an accounting bug?

I have noticed this problem and have checked other files which use refcount_t.
I find although the refcounts are 1-based, they still use
refcount_dec_and_test()
to check whether the resource should be released.
One example is drivers/char/mspec.c.
Therefore I think this is okay and do not change the release condition.

^ permalink raw reply

* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
From: Michael S. Tsirkin @ 2019-08-02 14:27 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jason Wang, kvm, virtualization, netdev, linux-kernel, linux-mm
In-Reply-To: <20190802124613.GA11245@ziepe.ca>

On Fri, Aug 02, 2019 at 09:46:13AM -0300, Jason Gunthorpe wrote:
> On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote:
> > > This must be a proper barrier, like a spinlock, mutex, or
> > > synchronize_rcu.
> > 
> > 
> > I start with synchronize_rcu() but both you and Michael raise some
> > concern.
> 
> I've also idly wondered if calling synchronize_rcu() under the various
> mm locks is a deadlock situation.
> 
> > Then I try spinlock and mutex:
> > 
> > 1) spinlock: add lots of overhead on datapath, this leads 0 performance
> > improvement.
> 
> I think the topic here is correctness not performance improvement

The topic is whether we should revert
commit 7f466032dc9 ("vhost: access vq metadata through kernel virtual address")

or keep it in. The only reason to keep it is performance.

Now as long as all this code is disabled anyway, we can experiment a
bit.

I personally feel we would be best served by having two code paths:

- Access to VM memory directly mapped into kernel
- Access to userspace


Having it all cleanly split will allow a bunch of optimizations, for
example for years now we planned to be able to process an incoming short
packet directly on softirq path, or an outgoing on directly within
eventfd.


-- 
MST

^ permalink raw reply

* Re: [PATCH 00/34] put_user_pages(): miscellaneous call sites
From: Matthew Wilcox @ 2019-08-02 14:24 UTC (permalink / raw)
  To: Jan Kara
  Cc: Michal Hocko, john.hubbard, Andrew Morton, Christoph Hellwig,
	Dan Williams, Dave Chinner, Dave Hansen, Ira Weiny,
	Jason Gunthorpe, Jérôme Glisse, LKML, amd-gfx,
	ceph-devel, devel, devel, dri-devel, intel-gfx, kvm,
	linux-arm-kernel, linux-block, linux-crypto, linux-fbdev,
	linux-fsdevel, linux-media, linux-mm, linux-nfs, linux-rdma,
	linux-rpi-kernel, linux-xfs, netdev, rds-devel, sparclinux, x86,
	xen-devel, John Hubbard
In-Reply-To: <20190802124146.GL25064@quack2.suse.cz>

On Fri, Aug 02, 2019 at 02:41:46PM +0200, Jan Kara wrote:
> On Fri 02-08-19 11:12:44, Michal Hocko wrote:
> > On Thu 01-08-19 19:19:31, john.hubbard@gmail.com wrote:
> > [...]
> > > 2) Convert all of the call sites for get_user_pages*(), to
> > > invoke put_user_page*(), instead of put_page(). This involves dozens of
> > > call sites, and will take some time.
> > 
> > How do we make sure this is the case and it will remain the case in the
> > future? There must be some automagic to enforce/check that. It is simply
> > not manageable to do it every now and then because then 3) will simply
> > be never safe.
> > 
> > Have you considered coccinele or some other scripted way to do the
> > transition? I have no idea how to deal with future changes that would
> > break the balance though.
> 
> Yeah, that's why I've been suggesting at LSF/MM that we may need to create
> a gup wrapper - say vaddr_pin_pages() - and track which sites dropping
> references got converted by using this wrapper instead of gup. The
> counterpart would then be more logically named as unpin_page() or whatever
> instead of put_user_page().  Sure this is not completely foolproof (you can
> create new callsite using vaddr_pin_pages() and then just drop refs using
> put_page()) but I suppose it would be a high enough barrier for missed
> conversions... Thoughts?

I think the API we really need is get_user_bvec() / put_user_bvec(),
and I know Christoph has been putting some work into that.  That avoids
doing refcount operations on hundreds of pages if the page in question is
a huge page.  Once people are switched over to that, they won't be tempted
to manually call put_page() on the individual constituent pages of a bvec.

^ permalink raw reply

* Re: [PATCH 26/34] mm/gup_benchmark.c: convert put_page() to put_user_page*()
From: Keith Busch @ 2019-08-02 14:19 UTC (permalink / raw)
  To: john.hubbard
  Cc: Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Dave Hansen, Ira Weiny, Jan Kara, Jason Gunthorpe,
	Jérôme Glisse, LKML, amd-gfx, ceph-devel, devel, devel,
	dri-devel, intel-gfx, kvm, linux-arm-kernel, linux-block,
	linux-crypto, linux-fbdev, linux-fsdevel, linux-media, linux-mm,
	linux-nfs, linux-rdma, linux-rpi-kernel, linux-xfs, netdev,
	rds-devel, sparclinux, x86, xen-devel, John Hubbard,
	Dan Carpenter, Greg Kroah-Hartman, Kirill A . Shutemov,
	Michael S . Tsirkin, YueHaibing
In-Reply-To: <20190802022005.5117-27-jhubbard@nvidia.com>

On Thu, Aug 01, 2019 at 07:19:57PM -0700, john.hubbard@gmail.com wrote:
> From: John Hubbard <jhubbard@nvidia.com>
> 
> For pages that were retained via get_user_pages*(), release those pages
> via the new put_user_page*() routines, instead of via put_page() or
> release_pages().
> 
> This is part a tree-wide conversion, as described in commit fc1d8e7cca2d
> ("mm: introduce put_user_page*(), placeholder versions").
> 
> Cc: Dan Carpenter <dan.carpenter@oracle.com>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Keith Busch <keith.busch@intel.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: YueHaibing <yuehaibing@huawei.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>

Looks fine.

Reviewed-by: Keith Busch <keith.busch@intel.com>

>  mm/gup_benchmark.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
> index 7dd602d7f8db..515ac8eeb6ee 100644
> --- a/mm/gup_benchmark.c
> +++ b/mm/gup_benchmark.c
> @@ -79,7 +79,7 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
>  	for (i = 0; i < nr_pages; i++) {
>  		if (!pages[i])
>  			break;
> -		put_page(pages[i]);
> +		put_user_page(pages[i]);
>  	}
>  	end_time = ktime_get();
>  	gup->put_delta_usec = ktime_us_delta(end_time, start_time);
> -- 

^ permalink raw reply

* Re: [net-next,rfc] net: bridge: mdb: Extend with multicast LLADDR
From: Allan W. Nielsen @ 2019-08-02 14:21 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: Horatiu Vultur, idosch, andrew, davem, roopa, petrm, tglx, fw,
	netdev, linux-kernel, bridge
In-Reply-To: <fcb0e526-778f-5451-9934-e6c2421e4eb3@cumulusnetworks.com>

The 08/02/2019 17:16, Nikolay Aleksandrov wrote:
> On 02/08/2019 17:07, Allan W. Nielsen wrote:
> > The 08/01/2019 17:07, Nikolay Aleksandrov wrote:
> >>> To create a group for two of the front ports the following entries can
> >>> be added:
> >>> bridge mdb add dev br0 port eth0 grp 01:00:00:00:00:04 permanent vid 1
> >>> bridge mdb add dev br0 port eth1 grp 01:00:00:00:00:04 permanent vid 1
> >>>
> >>> Now the entries will be display as following:
> >>> dev br0 port eth0 grp 01:00:00:00:00:04 permanent offload vid 1
> >>> dev br0 port eth1 grp 01:00:00:00:00:04 permanent offload vid 1
> >>>
> >>> This requires changes to iproute2 as well, see the follogin patch for that.
> >>>
> >>> Now if frame with dmac '01:00:00:00:00:04' will arrive at one of the front
> >>> ports. If we have HW offload support, then the frame will be forwarded by
> >>> the switch, and need not to go to the CPU. In a pure SW world, the frame is
> >>> forwarded by the SW bridge, which will flooded it only the ports which are
> >>> part of the group.
> >>>
> >>> So far so good. This is an important part of the problem we wanted to solve.
> >>>
> >>> But, there is one drawback of this approach. If you want to add two of the
> >>> front ports and br0 to receive the frame then I can't see a way of doing it
> >>> with the bridge mdb command. To do that it requireds many more changes to
> >>> the existing code.
> >>>
> >>> Example:
> >>> bridge mdb add dev br0 port eth0 grp 01:00:00:00:00:04 permanent vid 1
> >>> bridge mdb add dev br0 port eth1 grp 01:00:00:00:00:04 permanent vid 1
> >>> bridge mdb add dev br0 port br0 grp 01:00:00:00:00:04 permanent vid 1 // This looks wrong.
> >>>
> >>> We believe we come a long way by re-using the facilities in MDB (thanks for
> >>> convincing us in doing this), but we are still not completely happy with
> >>> the result.
> >> Just add self argument for the bridge mdb command, no need to specify it twice.
> > Like this:
> > bridge mdb add dev br0 port eth1 grp 01:00:00:00:00:04 permanent vid self
> 
> What ?! No, that is not what I meant.
> bridge mdb add dev br0 grp 01:00:00:00:00:04 permanent vid self
> bridge mdb del dev br0 grp 01:00:00:00:00:04 permanent vid self
> 
> Similar to fdb. You don't need no-self..
> I don't mind a different approach, this was just a suggestion. But please
> without "no-self" :)

Good, then we are in sync on that one :-D

> > 
> > Then if I want to remove br0 rom the group, should I then have a no-self, and
> > then it becomes even more strange what to write in the port.
> > 
> > bridge mdb add dev br0 port ?? grp 01:00:00:00:00:04 permanent vid no-self
> >                             ^^
> > And, what if it is a group with only br0 (the traffic should go to br0 and
> > not any of the slave interfaces)?
> > 
> > Also, the 'self' keyword has different meanings in the 'bridge vlan' and the
> > 'bridge fdb' commands where it refres to if the offload rule should be install
> > in HW - or only in the SW bridge.
> 
> No, it shouldn't. Self means act on the device, in this case act on the bridge,
> otherwise master is assumed.
> 
> > 
> > The proposed does not look pretty bad, but at least it will be possible to
> > configured the different scenarios:
> > 
> > bridge mdb add dev br0 port br0 grp 01:00:00:00:00:04 permanent vid 1
> > bridge mdb del dev br0 port br0 grp 01:00:00:00:00:04 permanent vid 1
> > 
> 
> That works too, but the "port" part is redundant.
> 
> > The more I look at the "bridge mdb { add | del } dev DEV port PORT" command, the
> > less I understand why do we have both 'dev' and 'port'? The implementation will
> > only allow this if 'port' has become enslaved to the switch represented by
> > 'dev'. Anyway, what is done is done, and we need to stay backwards compatible,
> > but we could make it optional, and then it looks a bit less strange to allow the
> > port to specify a br0.
> > 
> > Like this:
> > 
> > bridge mdb { add | del } [dev DEV] port PORT grp GROUP [ permanent | temp ] [ vid VID ]
> > 
> > bridge mdb add port eth0 grp 01:00:00:00:00:04 permanent vid 1
> > bridge mdb add port eth1 grp 01:00:00:00:00:04 permanent vid 1
> > bridge mdb add port br0  grp 01:00:00:00:00:04 permanent vid 1 // Add br0 to the gruop
> > bridge mdb del port br0  grp 01:00:00:00:00:04 permanent vid 1 // Delete it again
> > 
> 
> br0 is not a port, thus the "self" or just dev or whatever you choose..
Ahh, now I understand what you meant.

> > Alternative we could also make the port optional:
> > 
> > bridge mdb { add | del } dev DEV [port PORT] grp GROUP [ permanent | temp ] [ vid VID ]
> > 
> > bridge mdb add dev br0 port eth0 grp 01:00:00:00:00:04 permanent vid 1
> > bridge mdb add dev br0 port eth1 grp 01:00:00:00:00:04 permanent vid 1
> > bridge mdb add dev br0           grp 01:00:00:00:00:04 permanent vid 1 // Add br0 to the gruop
> > bridge mdb del dev br0           grp 01:00:00:00:00:04 permanent vid 1 // Delete it again
> > 
> 
> Right. I read this one later. :)
> 
> > Any preferences?
> Not really, up to you. Any of the above seem fine to me.
Perfect, I like this one the most:

bridge mdb { add | del } dev DEV [ port PORT ] grp GROUP [ permanent | temp ] [ vid VID ]

bridge mdb add dev br0 port eth0 grp 01:00:00:00:00:04 permanent vid 1
bridge mdb add dev br0 port eth1 grp 01:00:00:00:00:04 permanent vid 1
bridge mdb add dev br0           grp 01:00:00:00:00:04 permanent vid 1 // Add br0 to the gruop
bridge mdb del dev br0           grp 01:00:00:00:00:04 permanent vid 1 // Delete it again

/Allan

^ permalink raw reply

* Re: [net-next,rfc] net: bridge: mdb: Extend with multicast LLADDR
From: Nikolay Aleksandrov @ 2019-08-02 14:16 UTC (permalink / raw)
  To: Allan W. Nielsen
  Cc: Horatiu Vultur, idosch, andrew, davem, roopa, petrm, tglx, fw,
	netdev, linux-kernel, bridge
In-Reply-To: <20190802140655.ngbok2ubprhivlhy@lx-anielsen.microsemi.net>

On 02/08/2019 17:07, Allan W. Nielsen wrote:
> The 08/01/2019 17:07, Nikolay Aleksandrov wrote:
>>> To create a group for two of the front ports the following entries can
>>> be added:
>>> bridge mdb add dev br0 port eth0 grp 01:00:00:00:00:04 permanent vid 1
>>> bridge mdb add dev br0 port eth1 grp 01:00:00:00:00:04 permanent vid 1
>>>
>>> Now the entries will be display as following:
>>> dev br0 port eth0 grp 01:00:00:00:00:04 permanent offload vid 1
>>> dev br0 port eth1 grp 01:00:00:00:00:04 permanent offload vid 1
>>>
>>> This requires changes to iproute2 as well, see the follogin patch for that.
>>>
>>> Now if frame with dmac '01:00:00:00:00:04' will arrive at one of the front
>>> ports. If we have HW offload support, then the frame will be forwarded by
>>> the switch, and need not to go to the CPU. In a pure SW world, the frame is
>>> forwarded by the SW bridge, which will flooded it only the ports which are
>>> part of the group.
>>>
>>> So far so good. This is an important part of the problem we wanted to solve.
>>>
>>> But, there is one drawback of this approach. If you want to add two of the
>>> front ports and br0 to receive the frame then I can't see a way of doing it
>>> with the bridge mdb command. To do that it requireds many more changes to
>>> the existing code.
>>>
>>> Example:
>>> bridge mdb add dev br0 port eth0 grp 01:00:00:00:00:04 permanent vid 1
>>> bridge mdb add dev br0 port eth1 grp 01:00:00:00:00:04 permanent vid 1
>>> bridge mdb add dev br0 port br0 grp 01:00:00:00:00:04 permanent vid 1 // This looks wrong.
>>>
>>> We believe we come a long way by re-using the facilities in MDB (thanks for
>>> convincing us in doing this), but we are still not completely happy with
>>> the result.
>> Just add self argument for the bridge mdb command, no need to specify it twice.
> Like this:
> bridge mdb add dev br0 port eth1 grp 01:00:00:00:00:04 permanent vid self

What ?! No, that is not what I meant.
bridge mdb add dev br0 grp 01:00:00:00:00:04 permanent vid self
bridge mdb del dev br0 grp 01:00:00:00:00:04 permanent vid self

Similar to fdb. You don't need no-self..
I don't mind a different approach, this was just a suggestion. But please
without "no-self" :)

> 
> Then if I want to remove br0 rom the group, should I then have a no-self, and
> then it becomes even more strange what to write in the port.
> 
> bridge mdb add dev br0 port ?? grp 01:00:00:00:00:04 permanent vid no-self
>                             ^^
> And, what if it is a group with only br0 (the traffic should go to br0 and
> not any of the slave interfaces)?
> 
> Also, the 'self' keyword has different meanings in the 'bridge vlan' and the
> 'bridge fdb' commands where it refres to if the offload rule should be install
> in HW - or only in the SW bridge.

No, it shouldn't. Self means act on the device, in this case act on the bridge,
otherwise master is assumed.

> 
> The proposed does not look pretty bad, but at least it will be possible to
> configured the different scenarios:
> 
> bridge mdb add dev br0 port br0 grp 01:00:00:00:00:04 permanent vid 1
> bridge mdb del dev br0 port br0 grp 01:00:00:00:00:04 permanent vid 1
> 

That works too, but the "port" part is redundant.

> The more I look at the "bridge mdb { add | del } dev DEV port PORT" command, the
> less I understand why do we have both 'dev' and 'port'? The implementation will
> only allow this if 'port' has become enslaved to the switch represented by
> 'dev'. Anyway, what is done is done, and we need to stay backwards compatible,
> but we could make it optional, and then it looks a bit less strange to allow the
> port to specify a br0.
> 
> Like this:
> 
> bridge mdb { add | del } [dev DEV] port PORT grp GROUP [ permanent | temp ] [ vid VID ]
> 
> bridge mdb add port eth0 grp 01:00:00:00:00:04 permanent vid 1
> bridge mdb add port eth1 grp 01:00:00:00:00:04 permanent vid 1
> bridge mdb add port br0  grp 01:00:00:00:00:04 permanent vid 1 // Add br0 to the gruop
> bridge mdb del port br0  grp 01:00:00:00:00:04 permanent vid 1 // Delete it again
> 

br0 is not a port, thus the "self" or just dev or whatever you choose..

> Alternative we could also make the port optional:
> 
> bridge mdb { add | del } dev DEV [port PORT] grp GROUP [ permanent | temp ] [ vid VID ]
> 
> bridge mdb add dev br0 port eth0 grp 01:00:00:00:00:04 permanent vid 1
> bridge mdb add dev br0 port eth1 grp 01:00:00:00:00:04 permanent vid 1
> bridge mdb add dev br0           grp 01:00:00:00:00:04 permanent vid 1 // Add br0 to the gruop
> bridge mdb del dev br0           grp 01:00:00:00:00:04 permanent vid 1 // Delete it again
> 

Right. I read this one later. :)

> Any preferences?
> 

Not really, up to you. Any of the above seem fine to me.

> /Allan
> 
> 


^ permalink raw reply

* Re: [net-next,rfc] net: bridge: mdb: Extend with multicast LLADDR
From: Allan W. Nielsen @ 2019-08-02 14:07 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: Horatiu Vultur, idosch, andrew, davem, roopa, petrm, tglx, fw,
	netdev, linux-kernel, bridge
In-Reply-To: <f758fdbf-4e0a-57b3-f13d-23e893ba7458@cumulusnetworks.com>

The 08/01/2019 17:07, Nikolay Aleksandrov wrote:
> > To create a group for two of the front ports the following entries can
> > be added:
> > bridge mdb add dev br0 port eth0 grp 01:00:00:00:00:04 permanent vid 1
> > bridge mdb add dev br0 port eth1 grp 01:00:00:00:00:04 permanent vid 1
> > 
> > Now the entries will be display as following:
> > dev br0 port eth0 grp 01:00:00:00:00:04 permanent offload vid 1
> > dev br0 port eth1 grp 01:00:00:00:00:04 permanent offload vid 1
> > 
> > This requires changes to iproute2 as well, see the follogin patch for that.
> > 
> > Now if frame with dmac '01:00:00:00:00:04' will arrive at one of the front
> > ports. If we have HW offload support, then the frame will be forwarded by
> > the switch, and need not to go to the CPU. In a pure SW world, the frame is
> > forwarded by the SW bridge, which will flooded it only the ports which are
> > part of the group.
> > 
> > So far so good. This is an important part of the problem we wanted to solve.
> > 
> > But, there is one drawback of this approach. If you want to add two of the
> > front ports and br0 to receive the frame then I can't see a way of doing it
> > with the bridge mdb command. To do that it requireds many more changes to
> > the existing code.
> > 
> > Example:
> > bridge mdb add dev br0 port eth0 grp 01:00:00:00:00:04 permanent vid 1
> > bridge mdb add dev br0 port eth1 grp 01:00:00:00:00:04 permanent vid 1
> > bridge mdb add dev br0 port br0 grp 01:00:00:00:00:04 permanent vid 1 // This looks wrong.
> > 
> > We believe we come a long way by re-using the facilities in MDB (thanks for
> > convincing us in doing this), but we are still not completely happy with
> > the result.
> Just add self argument for the bridge mdb command, no need to specify it twice.
Like this:
bridge mdb add dev br0 port eth1 grp 01:00:00:00:00:04 permanent vid self

Then if I want to remove br0 rom the group, should I then have a no-self, and
then it becomes even more strange what to write in the port.

bridge mdb add dev br0 port ?? grp 01:00:00:00:00:04 permanent vid no-self
                            ^^
And, what if it is a group with only br0 (the traffic should go to br0 and
not any of the slave interfaces)?

Also, the 'self' keyword has different meanings in the 'bridge vlan' and the
'bridge fdb' commands where it refres to if the offload rule should be install
in HW - or only in the SW bridge.

The proposed does not look pretty bad, but at least it will be possible to
configured the different scenarios:

bridge mdb add dev br0 port br0 grp 01:00:00:00:00:04 permanent vid 1
bridge mdb del dev br0 port br0 grp 01:00:00:00:00:04 permanent vid 1

The more I look at the "bridge mdb { add | del } dev DEV port PORT" command, the
less I understand why do we have both 'dev' and 'port'? The implementation will
only allow this if 'port' has become enslaved to the switch represented by
'dev'. Anyway, what is done is done, and we need to stay backwards compatible,
but we could make it optional, and then it looks a bit less strange to allow the
port to specify a br0.

Like this:

bridge mdb { add | del } [dev DEV] port PORT grp GROUP [ permanent | temp ] [ vid VID ]

bridge mdb add port eth0 grp 01:00:00:00:00:04 permanent vid 1
bridge mdb add port eth1 grp 01:00:00:00:00:04 permanent vid 1
bridge mdb add port br0  grp 01:00:00:00:00:04 permanent vid 1 // Add br0 to the gruop
bridge mdb del port br0  grp 01:00:00:00:00:04 permanent vid 1 // Delete it again

Alternative we could also make the port optional:

bridge mdb { add | del } dev DEV [port PORT] grp GROUP [ permanent | temp ] [ vid VID ]

bridge mdb add dev br0 port eth0 grp 01:00:00:00:00:04 permanent vid 1
bridge mdb add dev br0 port eth1 grp 01:00:00:00:00:04 permanent vid 1
bridge mdb add dev br0           grp 01:00:00:00:00:04 permanent vid 1 // Add br0 to the gruop
bridge mdb del dev br0           grp 01:00:00:00:00:04 permanent vid 1 // Delete it again

Any preferences?

/Allan



^ permalink raw reply

* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
From: Michael S. Tsirkin @ 2019-08-02 14:03 UTC (permalink / raw)
  To: Jason Wang
  Cc: Jason Gunthorpe, kvm, virtualization, netdev, linux-kernel,
	linux-mm
In-Reply-To: <42ead87b-1749-4c73-cbe4-29dbeb945041@redhat.com>

On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote:
> Btw, I come up another idea, that is to disable preemption when vhost thread
> need to access the memory. Then register preempt notifier and if vhost
> thread is preempted, we're sure no one will access the memory and can do the
> cleanup.

Great, more notifiers :(

Maybe can live with
1- disable preemption while using the cached pointer
2- teach vhost to recover from memory access failures,
   by switching to regular from/to user path

So if you want to try that, fine since it's a step in
the right direction.

But I think fundamentally it's not what we want to do long term.

It's always been a fundamental problem with this patch series that only
metadata is accessed through a direct pointer.

The difference in ways you handle metadata and data is what is
now coming and messing everything up.

So if continuing the direct map approach,
what is needed is a cache of mapped VM memory, then on a cache miss
we'd queue work along the lines of 1-2 above.

That's one direction to take. Another one is to give up on that and
write our own version of uaccess macros.  Add a "high security" flag to
the vhost module and if not active use these for userspace memory
access.


-- 
MST

^ permalink raw reply

* Re: [PATCH 2/2] ixgbe: Use refcount_t for refcount
From: Willem de Bruijn @ 2019-08-02 13:46 UTC (permalink / raw)
  To: Chuhong Yuan
  Cc: Jeff Kirsher, David S . Miller, intel-wired-lan,
	Network Development, linux-kernel
In-Reply-To: <20190802105507.16650-1-hslester96@gmail.com>

On Fri, Aug 2, 2019 at 6:55 AM Chuhong Yuan <hslester96@gmail.com> wrote:
>
> refcount_t is better for reference counters since its
> implementation can prevent overflows.
> So convert atomic_t ref counters to refcount_t.
>
> Also convert refcount from 0-based to 1-based.
>
> This patch depends on PATCH 1/2.
>
> Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c | 6 +++---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.h | 2 +-
>  2 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c
> index 00710f43cfd2..d313d00065cd 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c
> @@ -773,7 +773,7 @@ int ixgbe_setup_fcoe_ddp_resources(struct ixgbe_adapter *adapter)
>
>         fcoe->extra_ddp_buffer = buffer;
>         fcoe->extra_ddp_buffer_dma = dma;
> -       atomic_set(&fcoe->refcnt, 0);
> +       refcount_set(&fcoe->refcnt, 1);

Same point as in the cxgb4 driver patch: how can you just change the
initial value without modifying the condition for release?

This is not a suggestion to resubmit all these changes again with a
change to the release condition.

^ permalink raw reply

* Re: [PATCH v2 1/2] cxgb4: sched: Use refcount_t for refcount
From: Willem de Bruijn @ 2019-08-02 13:39 UTC (permalink / raw)
  To: Chuhong Yuan
  Cc: Vishal Kulkarni, David S . Miller, Network Development,
	linux-kernel
In-Reply-To: <20190802083541.12602-1-hslester96@gmail.com>

On Fri, Aug 2, 2019 at 4:36 AM Chuhong Yuan <hslester96@gmail.com> wrote:
>
> refcount_t is better for reference counters since its
> implementation can prevent overflows.
> So convert atomic_t ref counters to refcount_t.
>
> Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
> ---
> Changes in v2:
>   - Convert refcount from 0-base to 1-base.

This changes the initial value from 0 to 1, but does not change the
release condition. So this introduces an accounting bug?

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox