Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 2/3] vhost: better detection of available buffers
From: Michael S. Tsirkin @ 2016-11-15 14:46 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, linux-kernel
In-Reply-To: <8bf86752-4dca-3ced-4641-efa7a4a1fc6e@redhat.com>

On Tue, Nov 15, 2016 at 04:00:21PM +0800, Jason Wang wrote:
> 
> 
> On 2016年11月15日 11:28, Michael S. Tsirkin wrote:
> > On Tue, Nov 15, 2016 at 11:16:59AM +0800, Jason Wang wrote:
> > > 
> > > On 2016年11月12日 00:20, Michael S. Tsirkin wrote:
> > > > On Fri, Nov 11, 2016 at 12:18:50PM +0800, Jason Wang wrote:
> > > > > On 2016年11月11日 11:41, Michael S. Tsirkin wrote:
> > > > > > On Fri, Nov 11, 2016 at 10:18:37AM +0800, Jason Wang wrote:
> > > > > > > > On 2016年11月10日 03:57, Michael S. Tsirkin wrote:
> > > > > > > > > > On Wed, Nov 09, 2016 at 03:38:32PM +0800, Jason Wang wrote:
> > > > > > > > > > > > We should use vq->last_avail_idx instead of vq->avail_idx in the
> > > > > > > > > > > > checking of vhost_vq_avail_empty() since latter is the cached avail
> > > > > > > > > > > > index from guest but we want to know if there's pending available
> > > > > > > > > > > > buffers in the virtqueue.
> > > > > > > > > > > > 
> > > > > > > > > > > > Signed-off-by: Jason Wang<jasowang@redhat.com>
> > > > > > > > > > I'm not sure why is this patch here. Is it related to
> > > > > > > > > > batching somehow?
> > > > > > > > Yes, we need to know whether or not there's still buffers left in the
> > > > > > > > virtqueue, so need to check last_avail_idx. Otherwise, we're checking if
> > > > > > > > guest has submitted new buffers.
> > > > > > > > 
> > > > > > > > > > > > ---
> > > > > > > > > > > >     drivers/vhost/vhost.c | 2 +-
> > > > > > > > > > > >     1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > > > > > > > 
> > > > > > > > > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > > > > > > > > > > index c6f2d89..fdf4cdf 100644
> > > > > > > > > > > > --- a/drivers/vhost/vhost.c
> > > > > > > > > > > > +++ b/drivers/vhost/vhost.c
> > > > > > > > > > > > @@ -2230,7 +2230,7 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq)
> > > > > > > > > > > >     	if (r)
> > > > > > > > > > > >     		return false;
> > > > > > > > > > > > -	return vhost16_to_cpu(vq, avail_idx) == vq->avail_idx;
> > > > > > > > > > > > +	return vhost16_to_cpu(vq, avail_idx) == vq->last_avail_idx;
> > > > > > > > > > > >     }
> > > > > > > > > > > >     EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);
> > > > > > > > > > That might be OK for TX but it's probably wrong for RX
> > > > > > > > > > where the fact that used != avail does not mean
> > > > > > > > > > we have enough space to store the packet.
> > > > > > > > Right, but it's no harm since it was just a hint, handle_rx() can handle
> > > > > > > > this situation.
> > > > > > Means busy polling will cause useless load on the CPU though.
> > > > > > 
> > > > > Right, but,it's not easy to have 100% correct hint here. Needs more thought.
> > > > What's wrong with what we have? It polls until value changes.
> > > > 
> > > But as you said, this does not mean (in mergeable cases) we have enough
> > > space to store the packet.
> > Absolutely but it checks once and then only re-checks after value
> > changes again.
> > 
> 
> Since get_rx_bufs() does not get enough buffers, we will wait for the kick
> in this case. For busy polling, we probably want to stay in the busy loop
> here.

That's what I'm saying. You don't want to re-poll the queue
if available idx was unchanged.

-- 
MST

^ permalink raw reply

* Re: [PATCH v3] ip6_output: ensure flow saddr actually belongs to device
From: Hannes Frederic Sowa @ 2016-11-15 14:45 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: David Ahern, Netdev, WireGuard mailing list, LKML,
	YOSHIFUJI Hideaki
In-Reply-To: <CAHmME9ppx01YR9Db1oPpm6FJ+BmpqSxvjQ2S+GT0DXO09_M4oQ@mail.gmail.com>

Hey Jason,

On 15.11.2016 01:45, Jason A. Donenfeld wrote:
> I'll have a better look at this. Perhaps this should be modeled on
> what we currently do for userspace, which might amount to something
> more or less like:

Cool, thanks!

> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
> index 6001e78..0721915 100644
> --- a/net/ipv6/ip6_output.c
> +++ b/net/ipv6/ip6_output.c
> @@ -925,6 +925,7 @@ static int ip6_dst_lookup_tail(struct net *net,
> const struct sock *sk,
>  #endif
>          int err;
>          int flags = 0;
> +        int addr_type, bind_to_dev;
> 
>          /* The correct way to handle this would be to do
>           * ip6_route_get_saddr, and then ip6_route_output; however,
> @@ -1012,6 +1013,16 @@ static int ip6_dst_lookup_tail(struct net *net,
> const struct sock *sk,
>          }
>  #endif
> 
> +        addr_type = ipv6_addr_type(&fl6->saddr);
> +        if (addr_type == IPv6_ADDR_ANY)
> +                return 0;
> +
> +        err = -EINVAL;
> +        bind_to_dev = __ipv6_addr_src_scope(addr_type) <=
> IPV6_ADDR_SCOPE_LINKLOCAL;
> +        if (!ipv6_chk_addr(net, &fl6->saddr, bind_to_dev ?
> (*dst)->dev : NULL, 0) &&
> +            !ipv6_chk_acast_addr_src(net, (*dst)->dev, &fl6->saddr))
> +                goto out_err_release;
> +
>          return 0;
> 
>  out_err_release:
> 

We should not use (*dst)->dev, as this is the resulting device after the
lookup and not necessarily corresponds to the device the user asked for.
Thus you need to pass in fl6.flowi6_oif. Thus to kill the necessary
ifindex->net_device lookup, I would suggest to move
ipv6_chk_addr_and_flags to use ifindex instead of net_device (0
corresponds to the net_device == NULL case). It seems to me this would
make the code easier. ipv6_chk_addr can simply pass down dev->ifindex to
ipv6_chk_addr.

Probably for checking anycast address you need to look up the
net_device, thus use dev_get_by_index_rcu. But probably the unicast
filter will already hit thus the whole traversing of anycast addresses
won't happen in normal cases. This could be separated to its own function.

In the non-strict case we don't necessarily need bind_to_dev?

Bye,
Hannes

^ permalink raw reply

* Re: [PATCH net-next v3 5/7] vxlan: simplify RTF_LOCAL handling.
From: Jiri Benc @ 2016-11-15 14:44 UTC (permalink / raw)
  To: Pravin B Shelar; +Cc: netdev
In-Reply-To: <1479098638-4921-6-git-send-email-pshelar@ovn.org>

On Sun, 13 Nov 2016 20:43:56 -0800, Pravin B Shelar wrote:
> Avoid code duplicate code for handling RTF_LOCAL routes.
> 
> Signed-off-by: Pravin B Shelar <pshelar@ovn.org>

Acked-by: Jiri Benc <jbenc@redhat.com>

^ permalink raw reply

* Re: [PATCH] vhost/scsi: Remove unused but set variable
From: Stefan Hajnoczi @ 2016-11-15 14:40 UTC (permalink / raw)
  To: Tobias Klauser
  Cc: Michael S. Tsirkin, Jason Wang, kvm, virtualization, netdev
In-Reply-To: <20161111132710.25804-1-tklauser@distanz.ch>

[-- Attachment #1: Type: text/plain, Size: 471 bytes --]

On Fri, Nov 11, 2016 at 02:27:10PM +0100, Tobias Klauser wrote:
> Remove the unused but set variable se_tpg in vhost_scsi_nexus_cb() to
> fix the following GCC warning when building with 'W=1':
> 
>   drivers/vhost/scsi.c:1752:26: warning: variable ‘se_tpg’ set but not used
> 
> Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
> ---
>  drivers/vhost/scsi.c | 2 --
>  1 file changed, 2 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply

* Re: [PATCH net-next v3 4/7] vxlan: improve vxlan route lookup checks.
From: Jiri Benc @ 2016-11-15 14:39 UTC (permalink / raw)
  To: Pravin B Shelar; +Cc: netdev
In-Reply-To: <1479098638-4921-5-git-send-email-pshelar@ovn.org>

On Sun, 13 Nov 2016 20:43:55 -0800, Pravin B Shelar wrote:
> @@ -1929,8 +1951,8 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
>  	union vxlan_addr *src;
>  	struct vxlan_metadata _md;
>  	struct vxlan_metadata *md = &_md;
> -	struct dst_entry *ndst = NULL;
>  	__be16 src_port = 0, dst_port;
> +	struct dst_entry *ndst = NULL;
>  	__be32 vni, label;
>  	__be16 df = 0;
>  	__u8 tos, ttl;

This looks kind of arbitrary. You might want to remove this hunk or
merge it to patch 3.

Other than that,
Acked-by: Jiri Benc <jbenc@redhat.com>

^ permalink raw reply

* Re: [RFC PATCH 1/2] net: use cmpxchg instead of spinlock in ptr rings
From: Michael S. Tsirkin @ 2016-11-15 14:30 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: netdev@vger.kernel.org, John Fastabend
In-Reply-To: <20161115143258.2c46fc9a@redhat.com>

On Tue, Nov 15, 2016 at 02:32:58PM +0100, Jesper Dangaard Brouer wrote:
> What I would really like to see is a lock-free (locked cmpxchg) queue
> implementation, what like ptr_ring use the array as empty/full check,
> and still (somehow) support bulking.

I think lock-free is overrated for this use-case - we hold the lock
for such a short amount of time.

I think what we want is just a simpler spinlock - one that's faster than
qlock for use-cases that are unfair anyway, like this one where even if
you get the lock in a fair way, FIFO might be full and you won't be able
to queue.

Or find an API to add to FIFO in a fair way.

-- 
MST

^ permalink raw reply

* Re: [PATCH net-next v3 3/7] vxlan: simplify exception handling
From: Jiri Benc @ 2016-11-15 14:30 UTC (permalink / raw)
  To: Pravin B Shelar; +Cc: netdev
In-Reply-To: <1479098638-4921-4-git-send-email-pshelar@ovn.org>

On Sun, 13 Nov 2016 20:43:54 -0800, Pravin B Shelar wrote:
> @@ -1927,13 +1923,13 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
>  	struct ip_tunnel_info *info;
>  	struct vxlan_dev *vxlan = netdev_priv(dev);
>  	struct sock *sk;
> -	struct rtable *rt = NULL;
>  	const struct iphdr *old_iph;
>  	union vxlan_addr *dst;
>  	union vxlan_addr remote_ip, local_ip;
>  	union vxlan_addr *src;
>  	struct vxlan_metadata _md;
>  	struct vxlan_metadata *md = &_md;
> +	struct dst_entry *ndst = NULL;
>  	__be16 src_port = 0, dst_port;
>  	__be32 vni, label;
>  	__be16 df = 0;
> @@ -2009,6 +2005,7 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
>  
>  	if (dst->sa.sa_family == AF_INET) {
>  		struct vxlan_sock *sock4 = rcu_dereference(vxlan->vn4_sock);
> +		struct rtable *rt;
>  
>  		if (!sock4)
>  			goto drop;
> @@ -2030,7 +2027,8 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
>  			netdev_dbg(dev, "circular route to %pI4\n",
>  				   &dst->sin.sin_addr.s_addr);
>  			dev->stats.collisions++;
> -			goto rt_tx_error;
> +			ip_rt_put(rt);
> +			goto tx_error;
>  		}
>  
>  		/* Bypass encapsulation if the destination is local */
> @@ -2053,12 +2051,13 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
>  		else if (info->key.tun_flags & TUNNEL_DONT_FRAGMENT)
>  			df = htons(IP_DF);
>  
> +		ndst = &rt->dst;

It would be a bit cleaner to do this assignment just after rt is
assigned (but after the IS_ERR(rt) condition), get rid of the added
ip_rt_put call above and move the existing ip_rt_put call in the bypass
case just before the vxlan_encap_bypass call...

>  		tos = ip_tunnel_ecn_encap(tos, old_iph, skb);
>  		ttl = ttl ? : ip4_dst_hoplimit(&rt->dst);
> -		err = vxlan_build_skb(skb, &rt->dst, sizeof(struct iphdr),
> +		err = vxlan_build_skb(skb, ndst, sizeof(struct iphdr),
>  				      vni, md, flags, udp_sum);
>  		if (err < 0)
> -			goto xmit_tx_error;
> +			goto tx_error;
>  
>  		udp_tunnel_xmit_skb(rt, sk, skb, src->sin.sin_addr.s_addr,
>  				    dst->sin.sin_addr.s_addr, tos, ttl, df,
> @@ -2066,7 +2065,6 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
>  #if IS_ENABLED(CONFIG_IPV6)
>  	} else {
>  		struct vxlan_sock *sock6 = rcu_dereference(vxlan->vn6_sock);
> -		struct dst_entry *ndst;
>  		u32 rt6i_flags;
>  
>  		ndst = vxlan6_get_route(vxlan, sock6, skb,
> @@ -2078,13 +2076,13 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
>  			netdev_dbg(dev, "no route to %pI6\n",
>  				   &dst->sin6.sin6_addr);
>  			dev->stats.tx_carrier_errors++;
> +			ndst = NULL;
>  			goto tx_error;
>  		}
>  
>  		if (ndst->dev == dev) {
>  			netdev_dbg(dev, "circular route to %pI6\n",
>  				   &dst->sin6.sin6_addr);
> -			dst_release(ndst);
>  			dev->stats.collisions++;
>  			goto tx_error;
>  		}
> @@ -2096,12 +2094,12 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
>  		    !(rt6i_flags & (RTCF_BROADCAST | RTCF_MULTICAST))) {
>  			struct vxlan_dev *dst_vxlan;
>  
> -			dst_release(ndst);
>  			dst_vxlan = vxlan_find_vni(vxlan->net, vni,
>  						   dst->sa.sa_family, dst_port,
>  						   vxlan->flags);
>  			if (!dst_vxlan)
>  				goto tx_error;
> +			dst_release(ndst);
>  			vxlan_encap_bypass(skb, vxlan, dst_vxlan);
>  			return;
>  		}

...the same way you have it here, in the IPv6 part. Could you change
the IPv4 part to match it?

Looks good otherwise. Seeing it, I like this version much more than v2.

Thanks!

 Jiri

^ permalink raw reply

* [PATCH net 3/3] ARM64: dts: meson: odroidc2: disable 1000t-eee advertisement
From: Jerome Brunet @ 2016-11-15 14:29 UTC (permalink / raw)
  To: netdev, devicetree, Florian Fainelli
  Cc: Jerome Brunet, Carlo Caione, Kevin Hilman, Giuseppe Cavallaro,
	Alexandre TORGUE, Martin Blumenstingl, Andre Roth, Neil Armstrong,
	linux-amlogic, linux-arm-kernel, linux-kernel
In-Reply-To: <1479220154-25851-1-git-send-email-jbrunet@baylibre.com>

Reported-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre TORGUE <alexandre.torgue@st.com>
Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
Tested-by: Andre Roth <neolynx@gmail.com>
---
 arch/arm64/boot/dts/amlogic/meson-gxbb-odroidc2.dts | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/arch/arm64/boot/dts/amlogic/meson-gxbb-odroidc2.dts b/arch/arm64/boot/dts/amlogic/meson-gxbb-odroidc2.dts
index e6e3491d48a5..1f4416ecb183 100644
--- a/arch/arm64/boot/dts/amlogic/meson-gxbb-odroidc2.dts
+++ b/arch/arm64/boot/dts/amlogic/meson-gxbb-odroidc2.dts
@@ -98,3 +98,18 @@
 	pinctrl-0 = <&i2c_a_pins>;
 	pinctrl-names = "default";
 };
+
+&ethmac {
+	phy-handle = <&eth_phy0>;
+
+	mdio {
+		compatible = "snps,dwmac-mdio";
+		#address-cells = <1>;
+		#size-cells = <0>;
+
+		eth_phy0: ethernet-phy@0 {
+			reg = <0>;
+			realtek,disable-eee-1000t;
+		};
+	};
+};
-- 
2.7.4

^ permalink raw reply related

* [PATCH net 2/3] dt-bindings: net: add DT bindings for realtek phys
From: Jerome Brunet @ 2016-11-15 14:29 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA, devicetree-u79uwXL29TY76Z2rM5mHXA,
	Florian Fainelli
  Cc: Jerome Brunet, Carlo Caione, Kevin Hilman, Giuseppe Cavallaro,
	Alexandre TORGUE, Martin Blumenstingl, Andre Roth, Neil Armstrong,
	linux-amlogic-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1479220154-25851-1-git-send-email-jbrunet-rdvid1DuHRBWk0Htik3J/w@public.gmane.org>

Signed-off-by: Jerome Brunet <jbrunet-rdvid1DuHRBWk0Htik3J/w@public.gmane.org>
Signed-off-by: Neil Armstrong <narmstrong-rdvid1DuHRBWk0Htik3J/w@public.gmane.org>
---
 .../devicetree/bindings/net/realtek-phy.txt          | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/net/realtek-phy.txt

diff --git a/Documentation/devicetree/bindings/net/realtek-phy.txt b/Documentation/devicetree/bindings/net/realtek-phy.txt
new file mode 100644
index 000000000000..dc2845a6b387
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/realtek-phy.txt
@@ -0,0 +1,20 @@
+Realtek Ethernet PHY
+
+Some boards require special tuning values of the phy.
+
+Optional properties:
+
+realtek,disable-eee-1000t:
+realtek,disable-eee-100tx:
+  If set, respectively disable 1000-BaseT and 100-BaseTx energy efficient
+  ethernet capabilty advertisement
+  default: Leave the phy default settings unchanged (capabilities advertised)
+
+Example:
+
+&mdio0 {
+	ethernetphy0: ethernet-phy@0 {
+		reg = <0>;
+		realtek,disable-eee-1000t;
+	};
+};
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH net 1/3] net: phy: realtek: add eee advertisement disable options
From: Jerome Brunet @ 2016-11-15 14:29 UTC (permalink / raw)
  To: netdev, devicetree, Florian Fainelli
  Cc: Jerome Brunet, Carlo Caione, Kevin Hilman, Giuseppe Cavallaro,
	Alexandre TORGUE, Martin Blumenstingl, Andre Roth, Neil Armstrong,
	linux-amlogic, linux-arm-kernel, linux-kernel
In-Reply-To: <1479220154-25851-1-git-send-email-jbrunet@baylibre.com>

On some platforms, energy efficient ethernet with rtl8211 devices is
causing issue, like throughput drop or broken link.

This was reported on the OdroidC2 (DWMAC + RTL8211F). While the issue root
cause is not fully understood yet, disabling EEE advertisement prevent auto
negotiation from enabling EEE.

This patch provides options to disable 1000T and 100TX EEE advertisement
individually for the realtek phys supporting this feature.

Reported-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre TORGUE <alexandre.torgue@st.com>
Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
Tested-by: Andre Roth <neolynx@gmail.com>
---
 drivers/net/phy/realtek.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 64 insertions(+), 1 deletion(-)

diff --git a/drivers/net/phy/realtek.c b/drivers/net/phy/realtek.c
index aadd6e9f54ad..77235fd5faaf 100644
--- a/drivers/net/phy/realtek.c
+++ b/drivers/net/phy/realtek.c
@@ -15,6 +15,12 @@
  */
 #include <linux/phy.h>
 #include <linux/module.h>
+#include <linux/of.h>
+
+struct rtl8211x_phy_priv {
+	bool eee_1000t_disable;
+	bool eee_100tx_disable;
+};
 
 #define RTL821x_PHYSR		0x11
 #define RTL821x_PHYSR_DUPLEX	0x2000
@@ -93,12 +99,44 @@ static int rtl8211f_config_intr(struct phy_device *phydev)
 	return err;
 }
 
+static void rtl8211x_clear_eee_adv(struct phy_device *phydev)
+{
+	struct rtl8211x_phy_priv *priv = phydev->priv;
+	u16 val;
+
+	if (priv->eee_1000t_disable || priv->eee_100tx_disable) {
+		val = phy_read_mmd_indirect(phydev, MDIO_AN_EEE_ADV,
+					    MDIO_MMD_AN);
+
+		if (priv->eee_1000t_disable)
+			val &= ~MDIO_AN_EEE_ADV_1000T;
+		if (priv->eee_100tx_disable)
+			val &= ~MDIO_AN_EEE_ADV_100TX;
+
+		phy_write_mmd_indirect(phydev, MDIO_AN_EEE_ADV,
+				       MDIO_MMD_AN, val);
+	}
+}
+
+static int rtl8211x_config_init(struct phy_device *phydev)
+{
+	int ret;
+
+	ret = genphy_config_init(phydev);
+	if (ret < 0)
+		return ret;
+
+	rtl8211x_clear_eee_adv(phydev);
+
+	return 0;
+}
+
 static int rtl8211f_config_init(struct phy_device *phydev)
 {
 	int ret;
 	u16 reg;
 
-	ret = genphy_config_init(phydev);
+	ret = rtl8211x_config_init(phydev);
 	if (ret < 0)
 		return ret;
 
@@ -115,6 +153,26 @@ static int rtl8211f_config_init(struct phy_device *phydev)
 	return 0;
 }
 
+static int rtl8211x_phy_probe(struct phy_device *phydev)
+{
+	struct device *dev = &phydev->mdio.dev;
+	struct device_node *of_node = dev->of_node;
+	struct rtl8211x_phy_priv *priv;
+
+	priv = devm_kzalloc(dev, sizeof(*priv), GFP_KERNEL);
+	if (!priv)
+		return -ENOMEM;
+
+	priv->eee_1000t_disable =
+		of_property_read_bool(of_node, "realtek,disable-eee-1000t");
+	priv->eee_100tx_disable =
+		of_property_read_bool(of_node, "realtek,disable-eee-100tx");
+
+	phydev->priv = priv;
+
+	return 0;
+}
+
 static struct phy_driver realtek_drvs[] = {
 	{
 		.phy_id         = 0x00008201,
@@ -140,7 +198,9 @@ static struct phy_driver realtek_drvs[] = {
 		.phy_id_mask	= 0x001fffff,
 		.features	= PHY_GBIT_FEATURES,
 		.flags		= PHY_HAS_INTERRUPT,
+		.probe		= &rtl8211x_phy_probe,
 		.config_aneg	= genphy_config_aneg,
+		.config_init	= &rtl8211x_config_init,
 		.read_status	= genphy_read_status,
 		.ack_interrupt	= rtl821x_ack_interrupt,
 		.config_intr	= rtl8211e_config_intr,
@@ -152,7 +212,9 @@ static struct phy_driver realtek_drvs[] = {
 		.phy_id_mask	= 0x001fffff,
 		.features	= PHY_GBIT_FEATURES,
 		.flags		= PHY_HAS_INTERRUPT,
+		.probe		= &rtl8211x_phy_probe,
 		.config_aneg	= &genphy_config_aneg,
+		.config_init	= &rtl8211x_config_init,
 		.read_status	= &genphy_read_status,
 		.ack_interrupt	= &rtl821x_ack_interrupt,
 		.config_intr	= &rtl8211e_config_intr,
@@ -164,6 +226,7 @@ static struct phy_driver realtek_drvs[] = {
 		.phy_id_mask	= 0x001fffff,
 		.features	= PHY_GBIT_FEATURES,
 		.flags		= PHY_HAS_INTERRUPT,
+		.probe		= &rtl8211x_phy_probe,
 		.config_aneg	= &genphy_config_aneg,
 		.config_init	= &rtl8211f_config_init,
 		.read_status	= &genphy_read_status,
-- 
2.7.4

^ permalink raw reply related

* [PATCH net 0/3] Fix OdroidC2 Gigabit Tx link issue
From: Jerome Brunet @ 2016-11-15 14:29 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA, devicetree-u79uwXL29TY76Z2rM5mHXA,
	Florian Fainelli
  Cc: Jerome Brunet, Carlo Caione, Kevin Hilman, Giuseppe Cavallaro,
	Alexandre TORGUE, Martin Blumenstingl, Andre Roth, Neil Armstrong,
	linux-amlogic-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

This patchset fixes an issue with the OdroidC2 board (DWMAC + RTL8211F).
Initially reported as a low Tx throughput issue at gigabit speed, the
platform enters LPI too often. This eventually break the link (both Tx
and Rx), and require to bring the interface down and up again to get the
Rx path working again.

The root cause of this issue is not fully understood yet but disabling EEE
advertisement on the PHY prevent this feature to be negotiated.
With this change, the link is stable and reliable, with the expected
throughput performance.

The patchset adds options in the realtek phy driver to disable EEE
advertisement, through device tree, for the phy version supporting EEE.
Then EEE is disabled in the OdroidC2 device tree for Gigabit speed.
100M is not affected by this issue.

Jerome Brunet (3):
  net: phy: realtek: add eee advertisement disable options
  dt-bindings: net: add DT bindings for realtek phys
  ARM64: dts: meson: odroidc2: disable 1000t-eee advertisement

 .../devicetree/bindings/net/realtek-phy.txt        | 20 +++++++
 .../arm64/boot/dts/amlogic/meson-gxbb-odroidc2.dts | 15 +++++
 drivers/net/phy/realtek.c                          | 65 +++++++++++++++++++++-
 3 files changed, 99 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/devicetree/bindings/net/realtek-phy.txt

-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net][v2] bpf: fix range arithmetic for bpf map access
From: Josef Bacik @ 2016-11-15 14:20 UTC (permalink / raw)
  To: Jann Horn, Alexei Starovoitov
  Cc: Alexei Starovoitov, Daniel Borkmann, David S. Miller, netdev
In-Reply-To: <CAG48ez1z8wJstz84-ekY5Ed8oNgpT73Xc18Or7RboOoBnTE03w@mail.gmail.com>

On 11/15/2016 08:47 AM, Jann Horn wrote:
> On Tue, Nov 15, 2016 at 4:10 AM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>> On Mon, Nov 14, 2016 at 03:45:36PM -0500, Josef Bacik wrote:
>>> I made some invalid assumptions with BPF_AND and BPF_MOD that could result in
>>> invalid accesses to bpf map entries.  Fix this up by doing a few things
>>>
>>> 1) Kill BPF_MOD support.  This doesn't actually get used by the compiler in real
>>> life and just adds extra complexity.
>>>
>>> 2) Fix the logic for BPF_AND, don't allow AND of negative numbers and set the
>>> minimum value to 0 for positive AND's.
>>>
>>> 3) Don't do operations on the ranges if they are set to the limits, as they are
>>> by definition undefined, and allowing arithmetic operations on those values
>>> could make them appear valid when they really aren't.
>>>
>>> This fixes the testcase provided by Jann as well as a few other theoretical
>>> problems.
>>>
>>> Reported-by: Jann Horn <jannh@google.com>
>>> Signed-off-by: Josef Bacik <jbacik@fb.com>
>>
>> lgtm.
>> Acked-by: Alexei Starovoitov <ast@kernel.org>
>>
>> Jann, could you please double check the logic.
>> Thanks!
>
> I found some more potential issues, maybe Josef and you can tell me whether I
> understood these correctly.
>
>
> /* If the source register is a random pointer then the
> * min_value/max_value values represent the range of the known
> * accesses into that value, not the actual min/max value of the
> * register itself.  In this case we have to reset the reg range
> * values so we know it is not safe to look at.
> */
> if (regs[insn->src_reg].type != CONST_IMM &&
>    regs[insn->src_reg].type != UNKNOWN_VALUE) {
> min_val = BPF_REGISTER_MIN_RANGE;
> max_val = BPF_REGISTER_MAX_RANGE;
> }
>
> Why only the source register? Why not the destination register?
>

The destination register is what we are doing arithmetic to, so we don't 
actually care about the type of register it is, as we'll interpret the values at 
a a later point.  If the source register however is a MAP or some other pointer, 
then we know that the min/max values only apply to the range on the actual value 
of the register, rather than the possible range of values.  Said in another way 
if src register is a MAP then the range is [SRC_REG.imm+SRC_REG.min_value, 
SRC_REG.imm+SRC_REG.max_value] instead of [SRC_REG.min_value, SRC_REG.max_value].

So this is the same behavior for the destination register for sure, but we don't 
actually care about it at this point.  If the src_reg meets these criteria then 
we certainly don't know anything about the dest_reg and reset it blow with this

         if (min_val == BPF_REGISTER_MIN_RANGE &&
             max_val == BPF_REGISTER_MAX_RANGE) {
                 reset_reg_range_values(regs, insn->dst_reg);
                 return;
         }

Then in check_alu_op() if we weren't doing our operation on a MAP_PTR then we 
set the register to unknown and carry on.

>
> /* We don't know anything about what was done to this register, mark it
> * as unknown.
> */
> if (min_val == BPF_REGISTER_MIN_RANGE &&
>    max_val == BPF_REGISTER_MAX_RANGE) {
> reset_reg_range_values(regs, insn->dst_reg);
> return;
> }
>
> Why have this special case at all? Since min_val and max_val are
> basically independent, this code shouldn't be necessary, right?
>

They are the value of the source register, as I explained above if we know 
nothing about the source register then don't even bother doing the math, we also 
know nothing about the destination register.  This is a shortcut to keep us from 
doing potentially dangerous things when we already know it's invalid.

>
> static void check_reg_overflow(struct bpf_reg_state *reg)
> {
> if (reg->max_value > BPF_REGISTER_MAX_RANGE)
> reg->max_value = BPF_REGISTER_MAX_RANGE;
> if (reg->min_value < BPF_REGISTER_MIN_RANGE ||
>    reg->min_value > BPF_REGISTER_MAX_RANGE)
> reg->min_value = BPF_REGISTER_MIN_RANGE;
> }
>
> Why is this asymmetric? Why is `reg->max_value <
> BPF_REGISTER_MIN_RANGE` not important, but `reg->min_value >
> BPF_REGISTER_MAX_RANGE` is?

Because max_value is unsigned, so if we do reg->max_value = 
BPF_REGISTER_MIN_RANGE; and then do this we'll still get max_value reset to 
MAX_RANGE.

>
>
> In states_equal():
> if (rold->type == NOT_INIT ||
>    (rold->type == UNKNOWN_VALUE && rcur->type != NOT_INIT))   <------------
> continue;
>
> I think this is broken in code like the following:
>
> int value;
> if (condition) {
>   value = 1; // visited first by verifier
> } else {
>   value = 1000000; // visited second by verifier
> }
> int dummy = 1; // states seem to converge here, but actually don't
> map[value] = 1234;
>
> `value` would be an UNKNOWN_VALUE for both paths, right? So
> states_equal() would decide that the states converge after the
> conditionally executed code?
>

Value would be CONST_IMM for both paths, and wouldn't match so they wouldn't 
converge.  I think I understood your question right, let me know if I'm 
addressing the wrong part of it.

Do my explanations make sense?  I'm doing this first thing in the morning so I'm 
still a little foggy, let me know if things still aren't clear.  Thanks,

Josef

^ permalink raw reply

* RE: [PATCH net-next v5] cadence: Add LSO support.
From: Rafal Ozieblo @ 2016-11-15 14:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, nicolas.ferre@atmel.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <1479215490.8455.122.camel@edumazet-glaptop3.roam.corp.google.com>

> From: Eric Dumazet [mailto:eric.dumazet@gmail.com] 
> Sent: 15 listopada 2016 14:12
> To: Rafal Ozieblo
> Cc: David Miller; nicolas.ferre@atmel.com; netdev@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH net-next v5] cadence: Add LSO support.
>
> On Tue, 2016-11-15 at 07:07 +0000, Rafal Ozieblo wrote:
> > > > > If UFO is in use it should not silently disable UDP checksums.
> > > > > 
> > > > > If you cannot support UFO with proper checksumming, then you cannot enable support for that feature.
> > > > 
> > > > According Cadence Gigabit Ethernet MAC documentation:
> > > > 
> > > > "Hardware will not calculate the UDP checksum or modify the UDP 
> > > > checksum field. Therefore software must set a value of zero in the 
> > > > checksum field in the UDP header (in the first payload buffer) to indicate to the receiver that the UDP datagram does not include a checksum."
> > > > 
> > > > It is hardware requirement.
> > >
> > > I do not doubt that it is a hardware restriction.
> > >
> > > But I am saying that you cannot enable this feature under Linux if this is how it operates on your hardware.
> > 
> > Would it be good to enable UFO conditionally with some internal define? Ex.:
> > 
> > +#ifdef MACB_ENABLE_UFO
> > +#define MACB_NETIF_LSO         (NETIF_F_TSO | NETIF_F_UFO)
> > +#else
> > +#define MACB_NETIF_LSO         (NETIF_F_TSO)
> > +#endif
> > 
> > I could add precise comment here that ufo is possible only without checksum.
> > 
> > Or maybe I could enable it from module_params or device-tree (like: drivers/net/ethernet/neterion/s2io.c).
>
> No you can not do that.
>
> 1) That would violate UDP specs.
> 2) Module params are no longer accepted.
> 3) Comments in a driver source code would only help the driver maintainer, not users to make their mind.
>
> Only way would be to propagate the intent of the sender.
>
> Only the sender application can decide to generate UDP checksums or not.
>
> Your driver ndo_features_check() could then force software segmentation fallback if the user did not asked to disable UDP checksums, and packet is UFO.
>
> (look for UDP_NO_CHECK6_TX, and SO_NO_CHECK )
>
> Problem is complex, because the skb has no marker, only the socket has.
>
> And socket state could change between packets, and packets can stay in an intermediate qdisc before hitting device driver. So looking at
> skb->sk from your ndo_features_check() would be racy.
>
> What use case would you have precisely ?

I have talked with hardware team who designed and created gem IP.  The conclusion is that there is no need to zeroed checksum for UFO. 
I have tested UFO without zeroing UDP checksum on cadence gem. It works fine.
I'll deeply investigate and test UFO again. I'll send version 6 of LSO PATCH either with UFO without zeroing or without UFO at all.

^ permalink raw reply

* Re: amd-xgbe: Add support for MDIO attached PHYs
From: Tom Lendacky @ 2016-11-15 14:17 UTC (permalink / raw)
  To: Colin Ian King; +Cc: netdev@vger.kernel.org, David S. Miller
In-Reply-To: <6e653058-6442-5bcf-7e02-3136780caffb@canonical.com>

On 11/15/2016 7:07 AM, Colin Ian King wrote:
> Hi,
> 
> Commit:
> 
> amd-xgbe: Add support for MDIO attached PHYs
> 
>     Use the phylib support in the kernel to communicate with and control an
>     MDIO attached PHY. Use the hardware's MDIO communication mechanism to
>     communicate with the PHY.
> 
> 
> +static int xgbe_clr_gpio(struct xgbe_prv_data *pdata, unsigned int gpio)
> +{
> +       unsigned int reg;
> +
> +       if (gpio > 16)
> +               return -EINVAL;
> 
> is gpio in the range 0..15?
> 
> 	if (gpio > 15)
> 		return -EINVAL;

Yes, the GPIO range is 0 to 15. I'll submit a patch to change the
constraint check.

Thanks,
Tom

> 
> +
> +       reg = XGMAC_IOREAD(pdata, MAC_GPIOSR);
> +
> +       reg &= ~(1 << (gpio + 16));
> 
> if gpio is 16, we get 1 << 32 which I believe is undefined behaviour.
> 
> +       XGMAC_IOWRITE(pdata, MAC_GPIOSR, reg);
> +
> +       return 0;
> +}
> 
> 
> Same applies for function xgbe_clr_gpio().
> 
> Colin
> 

^ permalink raw reply

* Re: [PATCH net-next v3 2/7] vxlan: avoid checking socket multiple times.
From: Jiri Benc @ 2016-11-15 14:15 UTC (permalink / raw)
  To: Pravin B Shelar; +Cc: netdev
In-Reply-To: <1479098638-4921-3-git-send-email-pshelar@ovn.org>

Pravin,

please CC reviewers of the previous version when submitting a new
version. You'll get faster reviews that way.

On Sun, 13 Nov 2016 20:43:53 -0800, Pravin B Shelar wrote:
> @@ -2069,11 +2069,7 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
>  		struct dst_entry *ndst;
>  		u32 rt6i_flags;
>  
> -		if (!sock6)
> -			goto drop;
> -		sk = sock6->sock->sk;
> -
> -		ndst = vxlan6_get_route(vxlan, skb,
> +		ndst = vxlan6_get_route(vxlan, sock6, skb,
>  					rdst ? rdst->remote_ifindex : 0, tos,
>  					label, &dst->sin6.sin6_addr,
>  					&src->sin6.sin6_addr,
> @@ -2093,6 +2089,7 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
>  			goto tx_error;
>  		}
>  
> +		sk = sock6->sock->sk;
>  		/* Bypass encapsulation if the destination is local */
>  		rt6i_flags = ((struct rt6_info *)ndst)->rt6i_flags;
>  		if (!info && rt6i_flags & RTF_LOCAL &&

This moves the sk assignment from one arbitrary place to a different
arbitrary place, while it would be best to just remove it and open code
sock6->sock->sk in the call to udp_tunnel6_xmit_skb. But patch 6 does
that later, so whatever.

Acked-by: Jiri Benc <jbenc@redhat.com>

^ permalink raw reply

* Re: [PATCH] amd-xgbe: fix unsigned comparison against less than zero
From: Tom Lendacky @ 2016-11-15 14:09 UTC (permalink / raw)
  To: Colin King, netdev; +Cc: linux-kernel
In-Reply-To: <20161115121842.5774-1-colin.king@canonical.com>

On 11/15/2016 6:18 AM, Colin King wrote:
> From: Colin Ian King <colin.king@canonical.com>
> 
> Comparing unsigned int ret to less than zero for an error status
> check is never true.  Fix this by making ret a signed int. Reduce
> scope of ret too.
> 
> Found with static analysis by CoverityScan, CID 1377750

Thanks Colin, this was already identified by someone else and I
submitted the patch yesterday.

Thanks,
Tom

> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>
> ---
>  drivers/net/ethernet/amd/xgbe/xgbe-phy-v2.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-phy-v2.c b/drivers/net/ethernet/amd/xgbe/xgbe-phy-v2.c
> index 4ba4332..168507e 100644
> --- a/drivers/net/ethernet/amd/xgbe/xgbe-phy-v2.c
> +++ b/drivers/net/ethernet/amd/xgbe/xgbe-phy-v2.c
> @@ -2346,7 +2346,7 @@ static bool xgbe_phy_valid_speed(struct xgbe_prv_data *pdata, int speed)
>  static int xgbe_phy_link_status(struct xgbe_prv_data *pdata, int *an_restart)
>  {
>  	struct xgbe_phy_data *phy_data = pdata->phy_data;
> -	unsigned int ret, reg;
> +	unsigned int reg;
>  
>  	*an_restart = 0;
>  
> @@ -2365,7 +2365,8 @@ static int xgbe_phy_link_status(struct xgbe_prv_data *pdata, int *an_restart)
>  
>  	if (phy_data->phydev) {
>  		/* Check external PHY */
> -		ret = phy_read_status(phy_data->phydev);
> +		int ret = phy_read_status(phy_data->phydev);
> +
>  		if (ret < 0)
>  			return 0;
>  
> 

^ permalink raw reply

* Re: linux-next: BUG: unable to handle kernel NULL pointer dereference in __sk_mem_raise_allocated()
From: Paolo Abeni @ 2016-11-15 14:07 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Andrei Vagin, Linux Kernel Network Developers
In-Reply-To: <1479218524.8455.123.camel@edumazet-glaptop3.roam.corp.google.com>

On Tue, 2016-11-15 at 06:02 -0800, Eric Dumazet wrote:
> On Tue, 2016-11-15 at 10:26 +0100, Paolo Abeni wrote:
> > Hi,
> > 
> > On Mon, 2016-11-14 at 15:24 -0800, Andrei Vagin wrote:
> > > Our test system detected a kernel oops. Looks like a problem in the
> > > "udp: refactor memory accounting" series.
> > 
> > My fault: I missed udplite in my tests.
> > 
> > Thank you for reporting.
> > 
> > I'm fine with Eric's patch, setting both .memory_allocated
> > and .sysctl_mem.
> > We could also remove .backlog_rcv, but it's not strictly needed.
> 
> That is a good point, can you cook the official combined patch ?

Sure, I'll send ASAP, after a little testing.

^ permalink raw reply

* Re: linux-next: BUG: unable to handle kernel NULL pointer dereference in __sk_mem_raise_allocated()
From: Eric Dumazet @ 2016-11-15 14:02 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: Andrei Vagin, Linux Kernel Network Developers
In-Reply-To: <1479202010.4660.11.camel@redhat.com>

On Tue, 2016-11-15 at 10:26 +0100, Paolo Abeni wrote:
> Hi,
> 
> On Mon, 2016-11-14 at 15:24 -0800, Andrei Vagin wrote:
> > Our test system detected a kernel oops. Looks like a problem in the
> > "udp: refactor memory accounting" series.
> 
> My fault: I missed udplite in my tests.
> 
> Thank you for reporting.
> 
> I'm fine with Eric's patch, setting both .memory_allocated
> and .sysctl_mem.
> We could also remove .backlog_rcv, but it's not strictly needed.

That is a good point, can you cook the official combined patch ?

Thanks !

^ permalink raw reply

* RE: [PATCH net-next v13 0/8] openvswitch: support for layer 3 encapsulated packets
From: Yang, Yi Y @ 2016-11-15 13:57 UTC (permalink / raw)
  To: Jiri Benc, netdev@vger.kernel.org
  Cc: dev@openvswitch.org, Pravin Shelar, Lorand Jakab, Simon Horman,
	Thadeu Lima de Souza Cascardo
In-Reply-To: <cover.1478791347.git.jbenc@redhat.com>

Hi, Jiri

I'm very glad to see you're continuing this work :-), I asked Simon about this twice, but nobody replies. I also remember Cascardo has a patch set to collaborate with this patch set, I asked Cascardo, but nobody responds, will you continue to do Cascardo's " create tunnel devices using rtnetlink interface" patch set? I test the old one v3, that can work with vxlan module in kernel, but if I build ovs with option " --with-linux=/lib/modules/`uname -r`/build", ovs vxlan module is built in vport_vxlan module, when I create vxlan-gpe port, kernel will automatically load vxlan module in the kernel instead of using the APIs in vport_vxlan module. 

Cascardo, are you still working on this?

-----Original Message-----
From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On Behalf Of Jiri Benc
Sent: Thursday, November 10, 2016 11:28 PM
To: netdev@vger.kernel.org
Cc: dev@openvswitch.org; Pravin Shelar <pshelar@ovn.org>; Lorand Jakab <lojakab@cisco.com>; Simon Horman <simon.horman@netronome.com>
Subject: [PATCH net-next v13 0/8] openvswitch: support for layer 3 encapsulated packets

At the core of this patch set is removing the assumption in Open vSwitch datapath that all packets have Ethernet header.

The implementation relies on the presence of pop_eth and push_eth actions in datapath flows to facilitate adding and removing Ethernet headers as appropriate. The construction of such flows is left up to user-space.

This series is based on work by Simon Horman, Lorand Jakab, Thomas Morin and others. I kept Lorand's and Simon's s-o-b in the patches that are derived from v11 to record their authorship of parts of the code.

Changes from v12 to v13:

* Addressed Pravin's feedback.
* Removed the GRE vport conversion patch; L3 GRE ports should be created by
  rtnetlink instead.

Main changes from v11 to v12:

* The patches were restructured and split differently for easier review.
* They were rebased and adjusted to the current net-next. Especially MPLS
  handling is different (and easier) thanks to the recent MPLS GSO rework.
* Several bugs were discovered and fixed. The most notable is fragment
  handling: header adjustment for ARPHRD_NONE devices on tx needs to be done
  after refragmentation, not before it. This required significant changes in
  the patchset. Another one is stricter checking of attributes (match on L2
  vs. L3 packet) at the kernel level.
* Instead of is_layer3 bool, a mac_proto field is used.

Jiri Benc (8):
  openvswitch: use hard_header_len instead of hardcoded ETH_HLEN
  openvswitch: add mac_proto field to the flow key
  openvswitch: pass mac_proto to ovs_vport_send
  openvswitch: support MPLS push and pop for L3 packets
  openvswitch: add processing of L3 packets
  openvswitch: netlink: support L3 packets
  openvswitch: add Ethernet push and pop actions
  openvswitch: allow L3 netdev ports

 include/uapi/linux/openvswitch.h |  15 ++++
 net/openvswitch/actions.c        | 111 +++++++++++++++++-------
 net/openvswitch/datapath.c       |  13 +--
 net/openvswitch/flow.c           | 105 +++++++++++++++++------
 net/openvswitch/flow.h           |  22 +++++
 net/openvswitch/flow_netlink.c   | 179 ++++++++++++++++++++++++++-------------
 net/openvswitch/vport-netdev.c   |   9 +-
 net/openvswitch/vport.c          |  31 +++++--
 net/openvswitch/vport.h          |   2 +-
 9 files changed, 353 insertions(+), 134 deletions(-)

^ permalink raw reply

* Re: [PATCH net][v2] bpf: fix range arithmetic for bpf map access
From: Jann Horn @ 2016-11-15 13:47 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Josef Bacik, Alexei Starovoitov, Daniel Borkmann, David S. Miller,
	netdev
In-Reply-To: <20161115031016.GA10323@ast-mbp.thefacebook.com>

On Tue, Nov 15, 2016 at 4:10 AM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Mon, Nov 14, 2016 at 03:45:36PM -0500, Josef Bacik wrote:
>> I made some invalid assumptions with BPF_AND and BPF_MOD that could result in
>> invalid accesses to bpf map entries.  Fix this up by doing a few things
>>
>> 1) Kill BPF_MOD support.  This doesn't actually get used by the compiler in real
>> life and just adds extra complexity.
>>
>> 2) Fix the logic for BPF_AND, don't allow AND of negative numbers and set the
>> minimum value to 0 for positive AND's.
>>
>> 3) Don't do operations on the ranges if they are set to the limits, as they are
>> by definition undefined, and allowing arithmetic operations on those values
>> could make them appear valid when they really aren't.
>>
>> This fixes the testcase provided by Jann as well as a few other theoretical
>> problems.
>>
>> Reported-by: Jann Horn <jannh@google.com>
>> Signed-off-by: Josef Bacik <jbacik@fb.com>
>
> lgtm.
> Acked-by: Alexei Starovoitov <ast@kernel.org>
>
> Jann, could you please double check the logic.
> Thanks!

I found some more potential issues, maybe Josef and you can tell me whether I
understood these correctly.

/* If the source register is a random pointer then the
* min_value/max_value values represent the range of the known
* accesses into that value, not the actual min/max value of the
* register itself.  In this case we have to reset the reg range
* values so we know it is not safe to look at.
*/
if (regs[insn->src_reg].type != CONST_IMM &&
   regs[insn->src_reg].type != UNKNOWN_VALUE) {
min_val = BPF_REGISTER_MIN_RANGE;
max_val = BPF_REGISTER_MAX_RANGE;
}

Why only the source register? Why not the destination register?

/* We don't know anything about what was done to this register, mark it
* as unknown.
*/
if (min_val == BPF_REGISTER_MIN_RANGE &&
   max_val == BPF_REGISTER_MAX_RANGE) {
reset_reg_range_values(regs, insn->dst_reg);
return;
}

Why have this special case at all? Since min_val and max_val are
basically independent, this code shouldn't be necessary, right?

static void check_reg_overflow(struct bpf_reg_state *reg)
{
if (reg->max_value > BPF_REGISTER_MAX_RANGE)
reg->max_value = BPF_REGISTER_MAX_RANGE;
if (reg->min_value < BPF_REGISTER_MIN_RANGE ||
   reg->min_value > BPF_REGISTER_MAX_RANGE)
reg->min_value = BPF_REGISTER_MIN_RANGE;
}

Why is this asymmetric? Why is `reg->max_value <
BPF_REGISTER_MIN_RANGE` not important, but `reg->min_value >
BPF_REGISTER_MAX_RANGE` is?

In states_equal():
if (rold->type == NOT_INIT ||
   (rold->type == UNKNOWN_VALUE && rcur->type != NOT_INIT))   <------------
continue;

I think this is broken in code like the following:

int value;
if (condition) {
  value = 1; // visited first by verifier
} else {
  value = 1000000; // visited second by verifier
}
int dummy = 1; // states seem to converge here, but actually don't
map[value] = 1234;

`value` would be an UNKNOWN_VALUE for both paths, right? So
states_equal() would decide that the states converge after the
conditionally executed code?

^ permalink raw reply

* Re: [PATCH net-next v3 3/5] ethtool: (uapi) Add ETHTOOL_PHY_DOWNSHIFT to PHY tunables
From: Andrew Lunn @ 2016-11-15 13:45 UTC (permalink / raw)
  To: Allan W. Nielsen; +Cc: netdev, raju.lakkaraju
In-Reply-To: <1479205204-27768-4-git-send-email-allan.nielsen@microsemi.com>

On Tue, Nov 15, 2016 at 11:20:02AM +0100, Allan W. Nielsen wrote:
> From: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
> 
> For operation in cabling environments that are incompatible with
> 1000BASE-T, PHY device may provide an automatic link speed downshift
> operation. When enabled, the device automatically changes its 1000BASE-T
> auto-negotiation to the next slower speed after a configured number of
> failed attempts at 1000BASE-T.  This feature is useful in setting up in
> networks using older cable installations that include only pairs A and B,
> and not pairs C and D.
> 
> Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
> Signed-off-by: Allan W. Nielsen <allan.nielsen@microsemi.com>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* [RFC PATCH 1/2] net: use cmpxchg instead of spinlock in ptr rings
From: Jesper Dangaard Brouer @ 2016-11-15 13:32 UTC (permalink / raw)
  To: netdev@vger.kernel.org; +Cc: brouer, John Fastabend, Michael S. Tsirkin


(looks like my message didn't reach the netdev list, due to me sending
from the wrong email, forwarded message again):

On Thu, 10 Nov 2016 20:44:08 -0800 John Fastabend <john.fastabend@gmail.com> wrote:

> ---
>  include/linux/ptr_ring_ll.h |  136 +++++++++++++++++++++++++++++++++++++++++++
>  include/linux/skb_array.h   |   25 ++++++++
>  2 files changed, 161 insertions(+)
>  create mode 100644 include/linux/ptr_ring_ll.h
> 
> diff --git a/include/linux/ptr_ring_ll.h b/include/linux/ptr_ring_ll.h
> new file mode 100644
> index 0000000..bcb11f3
> --- /dev/null
> +++ b/include/linux/ptr_ring_ll.h
> @@ -0,0 +1,136 @@
> +/*
> + *	Definitions for the 'struct ptr_ring_ll' datastructure.
> + *
> + *	Author:
> + *		John Fastabend <john.r.fastabend@intel.com>  
[...]
> + *
> + *	This is a limited-size FIFO maintaining pointers in FIFO order, with
> + *	one CPU producing entries and another consuming entries from a FIFO.
> + *	extended from ptr_ring_ll to use cmpxchg over spin lock.  

It sounds like this is Single Producer Single Consumer (SPSC)
implementation, but your implementation actually is Multi Producer
Multi Consumer (MPMC) capable.

The implementation looks a lot like my alf_queue[1] implementation:
 [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/include/linux/alf_queue.h

If the primary use-case is one CPU producing and another consuming,
then the normal ptr_ring (skb_array) will actually be faster!

The reason is ptr_ring avoids bouncing a cache-line between the CPUs on
every ring access.  This is achieved by having the checks for full
(__ptr_ring_full) and empty (__ptr_ring_empty) use the contents of the
array (NULL value).

I actually implemented two micro-benchmarks to measure the difference
between skb_array[2] and alf_queue[3]:
 [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_parallel01.c
 [3] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/alf_queue_parallel01.c


> + */
> +
> +#ifndef _LINUX_PTR_RING_LL_H
> +#define _LINUX_PTR_RING_LL_H 1
> +  
[...]
> +
> +struct ptr_ring_ll {
> +	u32 prod_size;
> +	u32 prod_mask;
> +	u32 prod_head;
> +	u32 prod_tail;
> +	u32 cons_size;
> +	u32 cons_mask;
> +	u32 cons_head;
> +	u32 cons_tail;
> +
> +	void **queue;
> +};  

Your implementation doesn't even split the consumer and producer into
different cachelines (which in practice doesn't help much due to how
the empty/full checks are performed).

> +
> +/* Note: callers invoking this in a loop must use a compiler barrier,
> + * for example cpu_relax(). Callers must hold producer_lock.
> + */
> +static inline int __ptr_ring_ll_produce(struct ptr_ring_ll *r, void *ptr)
> +{
> +	u32 ret, head, tail, next, slots, mask;
> +
> +	do {
> +		head = READ_ONCE(r->prod_head);
> +		mask = READ_ONCE(r->prod_mask);
> +		tail = READ_ONCE(r->cons_tail);  

Problem occur here, as the producer need to access/read the consumers
tail, to determine if the queue is not already full (slots avail).
Thus, the next "consumer-CPU" will see the cacheline in wrong state
(Modified/Invalid or Shared).

> +
> +		slots = mask + tail - head;
> +		if (slots < 1)
> +			return -ENOMEM;
> +
> +		next = head + 1;
> +		ret = cmpxchg(&r->prod_head, head, next);
> +	} while (ret != head);
> +
> +	r->queue[head & mask] = ptr;
> +	smp_wmb();
> +
> +	while (r->prod_tail != head)
> +		cpu_relax();
> +
> +	r->prod_tail = next;
> +	return 0;
> +}
> +
> +static inline void *__ptr_ring_ll_consume(struct ptr_ring_ll *r)
> +{
> +	u32 ret, head, tail, next, slots, mask;
> +	void *ptr;
> +
> +	do {
> +		head = READ_ONCE(r->cons_head);
> +		mask = READ_ONCE(r->cons_mask);
> +		tail = READ_ONCE(r->prod_tail);  

Like wise the consumer is reading the producer tail (for the empty check).

> +
> +		slots = tail - head;
> +		if (slots < 1)
> +			return ERR_PTR(-ENOMEM);
> +
> +		next = head + 1;
> +		ret = cmpxchg(&r->cons_head, head, next);
> +	} while (ret != head);
> +
> +	ptr = r->queue[head & mask];
> +	smp_rmb();
> +
> +	while (r->cons_tail != head)
> +		cpu_relax();
> +
> +	r->cons_tail = next;
> +	return ptr;
> +}
> +
> +static inline void **__ptr_ring_ll_init_queue_alloc(int size, gfp_t gfp)
> +{
> +	return kzalloc(ALIGN(size * sizeof(void *), SMP_CACHE_BYTES), gfp);
> +}
> +
> +static inline int ptr_ring_ll_init(struct ptr_ring_ll *r, int size, gfp_t gfp)
> +{
> +	r->queue = __ptr_ring_init_queue_alloc(size, gfp);
> +	if (!r->queue)
> +		return -ENOMEM;
> +
> +	r->prod_size = r->cons_size = size;
> +	r->prod_mask = r->cons_mask = size - 1;  

Shouldn't we have some check like is_power_of_2(size), as this code
looks like it depend on this.

> +	r->prod_tail = r->prod_head = 0;
> +	r->cons_tail = r->prod_tail = 0;
> +
> +	return 0;
> +}
> +  
[...]
> +#endif /* _LINUX_PTR_RING_LL_H  */
> diff --git a/include/linux/skb_array.h b/include/linux/skb_array.h
> index f4dfade..9b43dfd 100644
> --- a/include/linux/skb_array.h
> +++ b/include/linux/skb_array.h  
[...]
>  
> +static inline int skb_array_ll_produce(struct skb_array_ll *a, struct sk_buff *skb)
> +{
> +	return __ptr_ring_ll_produce(&a->ring, skb);
> +}
> +  
[...]
>  
> +static inline struct sk_buff *skb_array_ll_consume(struct skb_array_ll *a)
> +{
> +	return __ptr_ring_ll_consume(&a->ring);
> +}
> +  

Note in the Multi Producer Multi Consumer (MPMC) use-case this type of
queue can be faster than normal ptr_ring.  And in patch2 you implement
bulking, which is where the real benefit shows (in the MPMC case) for
this kind of queue.

What I would really like to see is a lock-free (locked cmpxchg) queue
implementation, what like ptr_ring use the array as empty/full check,
and still (somehow) support bulking.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH V2] mac80211: Ignore VHT IE from peer with wrong rx_mcs_map
From: Johannes Berg @ 2016-11-15 13:20 UTC (permalink / raw)
  To: Filip Matusiak, linux-wireless
  Cc: marek.kwaczynski, davem, netdev, linux-kernel
In-Reply-To: <1478077466-4308-1-git-send-email-filip.matusiak@tieto.com>

On Wed, 2016-11-02 at 10:04 +0100, Filip Matusiak wrote:
> This is a workaround for VHT-enabled STAs which break the spec
> and have the VHT-MCS Rx map filled in with value 3 for all eight
> spacial streams, an example is AR9462 in AP mode.
> 
> As per spec, in section 22.1.1 Introduction to the VHT PHY
> A VHT STA shall support at least single spactial stream VHT-MCSs
> 0 to 7 (transmit and receive) in all supported channel widths.
> 
> Some devices in STA mode will get firmware assert when trying to
> associate, examples are QCA9377 & QCA6174.
> 
> Packet example of broken VHT Cap IE of AR9462:
> 
> [...]

Applied, thanks.

johannes

^ permalink raw reply

* Re: [PATCH net-next v5] cadence: Add LSO support.
From: Eric Dumazet @ 2016-11-15 13:11 UTC (permalink / raw)
  To: Rafal Ozieblo
  Cc: David Miller, nicolas.ferre@atmel.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <BN3PR07MB251641C606D02892E2196960C9BF0@BN3PR07MB2516.namprd07.prod.outlook.com>

On Tue, 2016-11-15 at 07:07 +0000, Rafal Ozieblo wrote:
> > > > If UFO is in use it should not silently disable UDP checksums.
> > > > 
> > > > If you cannot support UFO with proper checksumming, then you cannot enable support for that feature.
> > > 
> > > According Cadence Gigabit Ethernet MAC documentation:
> > > 
> > > "Hardware will not calculate the UDP checksum or modify the UDP 
> > > checksum field. Therefore software must set a value of zero in the 
> > > checksum field in the UDP header (in the first payload buffer) to indicate to the receiver that the UDP datagram does not include a checksum."
> > > 
> > > It is hardware requirement.
> >
> > I do not doubt that it is a hardware restriction.
> >
> > But I am saying that you cannot enable this feature under Linux if this is how it operates on your hardware.
> 
> Would it be good to enable UFO conditionally with some internal define? Ex.:
> 
> +#ifdef MACB_ENABLE_UFO
> +#define MACB_NETIF_LSO         (NETIF_F_TSO | NETIF_F_UFO)
> +#else
> +#define MACB_NETIF_LSO         (NETIF_F_TSO)
> +#endif
> 
> I could add precise comment here that ufo is possible only without checksum.
> 
> Or maybe I could enable it from module_params or device-tree (like: drivers/net/ethernet/neterion/s2io.c).

No you can not do that.

1) That would violate UDP specs.
2) Module params are no longer accepted.
3) Comments in a driver source code would only help the driver
maintainer, not users to make their mind.

Only way would be to propagate the intent of the sender.

Only the sender application can decide to generate UDP checksums or not.

Your driver ndo_features_check() could then force software segmentation
fallback if the user did not asked to disable UDP checksums, and packet
is UFO.

(look for UDP_NO_CHECK6_TX, and SO_NO_CHECK )

Problem is complex, because the skb has no marker, only the socket has.

And socket state could change between packets, and packets can stay in
an intermediate qdisc before hitting device driver. So looking at
skb->sk from your ndo_features_check() would be racy.

What use case would you have precisely ?

^ permalink raw reply

* re: amd-xgbe: Add support for MDIO attached PHYs
From: Colin Ian King @ 2016-11-15 13:07 UTC (permalink / raw)
  To: Tom Lendacky; +Cc: netdev@vger.kernel.org, David S. Miller

Hi,

Commit:

amd-xgbe: Add support for MDIO attached PHYs

    Use the phylib support in the kernel to communicate with and control an
    MDIO attached PHY. Use the hardware's MDIO communication mechanism to
    communicate with the PHY.

+static int xgbe_clr_gpio(struct xgbe_prv_data *pdata, unsigned int gpio)
+{
+       unsigned int reg;
+
+       if (gpio > 16)
+               return -EINVAL;

is gpio in the range 0..15?

	if (gpio > 15)
		return -EINVAL;

+
+       reg = XGMAC_IOREAD(pdata, MAC_GPIOSR);
+
+       reg &= ~(1 << (gpio + 16));

if gpio is 16, we get 1 << 32 which I believe is undefined behaviour.

+       XGMAC_IOWRITE(pdata, MAC_GPIOSR, reg);
+
+       return 0;
+}

Same applies for function xgbe_clr_gpio().

Colin

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox