Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: next-20180605 - BUG in ipv6_add_addr
From: Dexuan-Linux Cui @ 2018-06-09  2:54 UTC (permalink / raw)
  To: David Ahern
  Cc: valdis.kletnieks, netdev, Linux Kernel Mailing List, Dexuan Cui
In-Reply-To: <6471e14e-2872-3ba4-7336-7c5840d28c12@gmail.com>

On Thu, Jun 7, 2018 at 5:51 PM, David Ahern <dsahern@gmail.com> wrote:
>
> ...
> I know you don't have a reliable reproducer, but I did find one spot
> where I was too clever and did not initialize a new cfg variable:
>
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index 89019bf59f46..59c22a25e654 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -1324,6 +1324,7 @@ static int ipv6_create_tempaddr(struct
> inet6_ifaddr *ifp,
>                 }
>         }
>
> +       memset(&cfg, 0, sizeof(cfg));
>         cfg.valid_lft = min_t(__u32, ifp->valid_lft,
>                               idev->cnf.temp_valid_lft + age);
>         cfg.preferred_lft = cnf_temp_preferred_lft + age -
> idev->desync_factor;

This works for me. Great!

Thanks,
-- Dexuan

^ permalink raw reply

* [PATCH net] tcp: limit sk_rcvlowat by the maximum receive buffer
From: Soheil Hassas Yeganeh @ 2018-06-09  2:47 UTC (permalink / raw)
  To: davem, netdev; +Cc: ycheng, ncardwell, edumazet, willemb, Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soheil@google.com>

The user-provided value to setsockopt(SO_RCVLOWAT) can be
larger than the maximum possible receive buffer. Such values
mute POLLIN signals on the socket which can stall progress
on the socket.

Limit the user-provided value to half of the maximum receive
buffer, i.e., half of sk_rcvbuf when the receive buffer size
is set by the user, or otherwise half of sysctl_tcp_rmem[2].

Fixes: d1361840f8c5 ("tcp: fix SO_RCVLOWAT and RCVBUF autotuning")
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
---
 net/ipv4/tcp.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 2741953adaba2..141acd92e58ae 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1694,6 +1694,13 @@ EXPORT_SYMBOL(tcp_peek_len);
 /* Make sure sk_rcvbuf is big enough to satisfy SO_RCVLOWAT hint */
 int tcp_set_rcvlowat(struct sock *sk, int val)
 {
+	int cap;
+
+	if (sk->sk_userlocks & SOCK_RCVBUF_LOCK)
+		cap = sk->sk_rcvbuf >> 1;
+	else
+		cap = sock_net(sk)->ipv4.sysctl_tcp_rmem[2] >> 1;
+	val = min(val, cap);
 	sk->sk_rcvlowat = val ? : 1;
 
 	/* Check if we need to signal EPOLLIN right now */
@@ -1702,12 +1709,7 @@ int tcp_set_rcvlowat(struct sock *sk, int val)
 	if (sk->sk_userlocks & SOCK_RCVBUF_LOCK)
 		return 0;
 
-	/* val comes from user space and might be close to INT_MAX */
 	val <<= 1;
-	if (val < 0)
-		val = INT_MAX;
-
-	val = min(val, sock_net(sk)->ipv4.sysctl_tcp_rmem[2]);
 	if (val > sk->sk_rcvbuf) {
 		sk->sk_rcvbuf = val;
 		tcp_sk(sk)->window_clamp = tcp_win_from_space(sk, val);
-- 
2.18.0.rc1.242.g61856ae69a-goog

^ permalink raw reply related

* Re: Qualcomm rmnet driver and qmi_wwan
From: Subash Abhinov Kasiviswanathan @ 2018-06-09  2:19 UTC (permalink / raw)
  To: Bjørn Mork, Daniele Palmas; +Cc: Dan Williams, netdev
In-Reply-To: <87k1r914df.fsf@miraculix.mork.no>

[-- Attachment #1: Type: text/plain, Size: 664 bytes --]

> This sounds like a good idea. I probably won't have any time to look at
> this in the near future, though.  Sorry about that. Extremely 
> overloaded
> both at work and private right now...
> 
> But I trust that you and Daniele can work out something. Please keep me
> CCed, but don't expect timely replies.
> 

Hi Daniele

Can you try out the attached patch.
I have added a new sysfs attribute pass_through to be used in raw_ip 
mode
only. Once you attach rmnet devices on it, the rx_handler will be setup
and the packet will be processed by rmnet.

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-net-qmi_wwan-Add-pass-through-mode.patch --]
[-- Type: text/x-diff; name=0001-net-qmi_wwan-Add-pass-through-mode.patch, Size: 3444 bytes --]

From bccfae3707af1be671fe55ea63123438f2dc38a8 Mon Sep 17 00:00:00 2001
From: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Date: Fri, 8 Jun 2018 19:53:08 -0600
Subject: [PATCH] net: qmi_wwan: Add pass through mode

Pass through mode is to allow packets in MAP format to be passed
on to the stack. rmnet driver can be used to process and demultiplex
these packets. Note that pass through mode can be enabled when the
device is in raw ip mode only.

Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
---
 drivers/net/usb/qmi_wwan.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 72 insertions(+)

diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c
index 8e8b51f..f52a9be 100644
--- a/drivers/net/usb/qmi_wwan.c
+++ b/drivers/net/usb/qmi_wwan.c
@@ -59,6 +59,7 @@ struct qmi_wwan_state {
 enum qmi_wwan_flags {
 	QMI_WWAN_FLAG_RAWIP = 1 << 0,
 	QMI_WWAN_FLAG_MUX = 1 << 1,
+	QMI_WWAN_FLAG_PASS_THROUGH = 1 << 2,
 };
 
 enum qmi_wwan_quirks {
@@ -425,14 +426,80 @@ static ssize_t del_mux_store(struct device *d,  struct device_attribute *attr, c
 	return ret;
 }
 
+static ssize_t pass_through_show(struct device *d,
+				 struct device_attribute *attr,
+				 char *buf)
+{
+	struct usbnet *dev = netdev_priv(to_net_dev(d));
+	struct qmi_wwan_state *info;
+
+	info = (void *)&dev->data;
+	return sprintf(buf, "%c\n",
+		       info->flags & QMI_WWAN_FLAG_PASS_THROUGH ? 'Y' : 'N');
+}
+
+static ssize_t pass_through_store(struct device *d,
+				  struct device_attribute *attr,
+				  const char *buf, size_t len)
+{
+	struct usbnet *dev = netdev_priv(to_net_dev(d));
+	struct qmi_wwan_state *info;
+	bool enable;
+	int ret;
+
+	if (strtobool(buf, &enable))
+		return -EINVAL;
+
+	info = (void *)&dev->data;
+
+	/* no change? */
+	if (enable == (info->flags & QMI_WWAN_FLAG_PASS_THROUGH))
+		return len;
+
+	/* pass through mode can be set for raw ip devices only */
+	if (!(info->flags & QMI_WWAN_FLAG_RAWIP))
+		return -EINVAL;
+
+	if (!rtnl_trylock())
+		return restart_syscall();
+
+	/* we don't want to modify a running netdev */
+	if (netif_running(dev->net)) {
+		netdev_err(dev->net, "Cannot change a running device\n");
+		ret = -EBUSY;
+		goto err;
+	}
+
+	/* let other drivers deny the change */
+	ret = call_netdevice_notifiers(NETDEV_PRE_TYPE_CHANGE, dev->net);
+	ret = notifier_to_errno(ret);
+	if (ret) {
+		netdev_err(dev->net, "Type change was refused\n");
+		goto err;
+	}
+
+	if (enable)
+		info->flags |= QMI_WWAN_FLAG_PASS_THROUGH;
+	else
+		info->flags &= ~QMI_WWAN_FLAG_PASS_THROUGH;
+	qmi_wwan_netdev_setup(dev->net);
+	call_netdevice_notifiers(NETDEV_POST_TYPE_CHANGE, dev->net);
+	ret = len;
+err:
+	rtnl_unlock();
+	return ret;
+}
+
 static DEVICE_ATTR_RW(raw_ip);
 static DEVICE_ATTR_RW(add_mux);
 static DEVICE_ATTR_RW(del_mux);
+static DEVICE_ATTR_RW(pass_through);
 
 static struct attribute *qmi_wwan_sysfs_attrs[] = {
 	&dev_attr_raw_ip.attr,
 	&dev_attr_add_mux.attr,
 	&dev_attr_del_mux.attr,
+	&dev_attr_pass_through.attr,
 	NULL,
 };
 
@@ -479,6 +546,11 @@ static int qmi_wwan_rx_fixup(struct usbnet *dev, struct sk_buff *skb)
 	if (info->flags & QMI_WWAN_FLAG_MUX)
 		return qmimux_rx_fixup(dev, skb);
 
+	if (rawip && (info->flags & QMI_WWAN_FLAG_PASS_THROUGH)) {
+		skb->protocol = htons(ETH_P_MAP);
+		return (netif_rx(skb) == NET_RX_SUCCESS);
+	}
+
 	switch (skb->data[0] & 0xf0) {
 	case 0x40:
 		proto = htons(ETH_P_IP);
-- 
1.9.1


^ permalink raw reply related

* Re: [PATCH net] failover: eliminate callback hell
From: Jakub Kicinski @ 2018-06-09  1:29 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Stephen Hemminger, Michael S. Tsirkin, Jiri Pirko, kys, haiyangz,
	David Miller, Samudrala, Sridhar, Netdev, Stephen Hemminger
In-Reply-To: <CADGSJ20sq8AmW34Z7w0JrTg6=YnKaM3Hq5R3TfhAH5BkuR2pqA@mail.gmail.com>

On Fri, 8 Jun 2018 16:44:12 -0700, Siwei Liu wrote:
> >> I have a somewhat different view regarding IFF_HIDDEN. The purpose of
> >> that flag, as well as the 1-netdev model, is to have a means to
> >> inherit the interface name from the VF, and to eliminate playing hacks
> >> around renaming devices, customizing udev rules and et al. Why
> >> inheriting VF's name important? To allow existing config/setup around
> >> VF continues to work across kernel feature upgrade. Most of network
> >> config files in all distros are based on interface names. Few are MAC
> >> address based but making lower slaves hidden would cover the rest. And
> >> most importantly, preserving the same level of user experience as
> >> using raw VF interface once getting all ndo_ops and ethtool_ops
> >> exposed. This is essential to realize transparent live migration that
> >> users dont have to learn and be aware of the undertaken.  
> >
> > Inheriting the VF name will fail in the migration scenario.
> > It is perfectly reasonable to migrate a guest to another machine where
> > the VF PCI address is different. And since current udev/systemd model
> > is to base network device name off of PCI address, the device will change
> > name when guest is migrated.
> >  
> The scenario of having VF on a different PCI address on post migration
> is essentially equal to plugging in a new NIC. Why it has to pair with
> the original PV? A sepearte PV device should be in place to pair the
> new VF.

IMHO it may be a better idea to look at the VF as acceleration for the
PV rather than PV a migration vehicle from the VF.  Hence we should
continue to follow the naming of PV, like the current implementation
does implicitly by linking to PV's struct device.

^ permalink raw reply

* Re: [PATCH net] failover: eliminate callback hell
From: Siwei Liu @ 2018-06-09  0:42 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Michael S. Tsirkin, Jiri Pirko, kys, haiyangz, David Miller,
	Samudrala, Sridhar, Netdev, Stephen Hemminger
In-Reply-To: <20180608170235.7f345b58@xeon-e3>

On Fri, Jun 8, 2018 at 5:02 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Fri, 8 Jun 2018 16:44:12 -0700
> Siwei Liu <loseweigh@gmail.com> wrote:
>
>> On Fri, Jun 8, 2018 at 4:18 PM, Stephen Hemminger
>> <stephen@networkplumber.org> wrote:
>> > On Fri, 8 Jun 2018 15:25:59 -0700
>> > Siwei Liu <loseweigh@gmail.com> wrote:
>> >
>> >> On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger
>> >> <stephen@networkplumber.org> wrote:
>> >> > On Wed, 6 Jun 2018 15:30:27 +0300
>> >> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
>> >> >
>> >> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
>> >> >> > Tue, Jun 05, 2018 at 05:42:31AM CEST, stephen@networkplumber.org wrote:
>> >> >> > >The net failover should be a simple library, not a virtual
>> >> >> > >object with function callbacks (see callback hell).
>> >> >> >
>> >> >> > Why just a library? It should do a common things. I think it should be a
>> >> >> > virtual object. Looks like your patch again splits the common
>> >> >> > functionality into multiple drivers. That is kind of backwards attitude.
>> >> >> > I don't get it. We should rather focus on fixing the mess the
>> >> >> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
>> >> >> > model.
>> >> >>
>> >> >> So it seems that at least one benefit for netvsc would be better
>> >> >> handling of renames.
>> >> >>
>> >> >> Question is how can this change to 3-netdev happen?  Stephen is
>> >> >> concerned about risk of breaking some userspace.
>> >> >>
>> >> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
>> >> >> address, and you said then "why not use existing network namespaces
>> >> >> rather than inventing a new abstraction". So how about it then? Do you
>> >> >> want to find a way to use namespaces to hide the PV device for netvsc
>> >> >> compatibility?
>> >> >>
>> >> >
>> >> > Netvsc can't work with 3 dev model. MS has worked with enough distro's and
>> >> > startups that all demand eth0 always be present. And VF may come and go.
>> >> > After this history, there is a strong motivation not to change how kernel
>> >> > behaves. Switching to 3 device model would be perceived as breaking
>> >> > existing userspace.
>> >> >
>> >> > With virtio you can  work it out with the distro's yourself.
>> >> > There is no pre-existing semantics to deal with.
>> >> >
>> >> > For the virtio, I don't see the need for IFF_HIDDEN.
>> >>
>> >> I have a somewhat different view regarding IFF_HIDDEN. The purpose of
>> >> that flag, as well as the 1-netdev model, is to have a means to
>> >> inherit the interface name from the VF, and to eliminate playing hacks
>> >> around renaming devices, customizing udev rules and et al. Why
>> >> inheriting VF's name important? To allow existing config/setup around
>> >> VF continues to work across kernel feature upgrade. Most of network
>> >> config files in all distros are based on interface names. Few are MAC
>> >> address based but making lower slaves hidden would cover the rest. And
>> >> most importantly, preserving the same level of user experience as
>> >> using raw VF interface once getting all ndo_ops and ethtool_ops
>> >> exposed. This is essential to realize transparent live migration that
>> >> users dont have to learn and be aware of the undertaken.
>> >
>> > Inheriting the VF name will fail in the migration scenario.
>> > It is perfectly reasonable to migrate a guest to another machine where
>> > the VF PCI address is different. And since current udev/systemd model
>> > is to base network device name off of PCI address, the device will change
>> > name when guest is migrated.
>> >
>> The scenario of having VF on a different PCI address on post migration
>> is essentially equal to plugging in a new NIC. Why it has to pair with
>> the original PV? A sepearte PV device should be in place to pair the
>> new VF.
>
> The host only guarantees that the PV device will be on the same network.
> It does not make any PCI guarantees. The way Windows works is to find
> the device based on "serial number" which is an Hyper-V specific attribute
> of PCI devices.
>
> I considered naming off of serial number but that won't work for the
> case where PV device is present first and VF arrives later. The serial
> number is attribute of VF, not the PV which is there first.

I assume the PV can get that information ahead of time before VF
arrives? Without it how do you match the device when you see a VF
coming with some serial number? Is it possible for PV to get the
matching SN even earlier during probe time? Or it has to depend on the
presence of vPCI bridge to generate this SN?

>
> Your ideas about having the PCI information of the VF form the name
> of the failover device have the same problem. The PV device may
> be the only one present on boot.

Yeah, this is a chicken-egg problem indeed, and that was the reason
why I supply the BDF info for PV to name the master interface.
However, the ACPI PCI slot needs to depend on the PCI bus enumeration
so that can't be predictable.  Would it make sense to only rename when
the first time a matching VF appears and PV interface isn't brought
up, then the failover master would always stick to the name
afterwards? I think it should cover most scenarios as it's usually
during boot time (dracut) the VF first appears and the PV interface at
the time then shouldn't have been configured yet.

-Siwei

>
>
>> > On Azure, the VF maybe removed (by host) at any time and then later
>> > reattached. There is no guarantee that VF will show back up at
>> > the same synthetic PCI address. It will likely have a different
>> > PCI domain value.
>>
>> This is something QEMU can do and make sure the PCI address is
>> consistent after migration.
>>
>> -Siwei
>

^ permalink raw reply

* Re: [PATCH net-next] net: ipv6: Generate random IID for addresses on RAWIP devices
From: Subash Abhinov Kasiviswanathan @ 2018-06-09  0:34 UTC (permalink / raw)
  To: Lorenzo Colitti
  Cc: YOSHIFUJI Hideaki, David Miller, netdev, YOSHIFUJI Hideaki
In-Reply-To: <CAKD1Yr0uFPS5poYVPJe_LqFh1EVmepWoeEXyhuubSAU_VMBFuQ@mail.gmail.com>

> Actually, I think this is fine. RFC 7136 clarified this, and says:
> 
> ======
>    Thus, we can conclude that the value of the "u" bit in IIDs has no
>    particular meaning.  In the case of an IID created from a MAC 
> address
>    according to RFC 4291, its value is determined by the MAC address,
>    but that is all.
> [...]
>    Specifications of other forms of 64-bit IIDs MUST specify how all 64
>    bits are set, but a generic semantic meaning for the "u" and "g" 
> bits
>    MUST NOT be defined.  However, the method of generating IIDs for
>    specific link types MAY define some local significance for certain
>    bits.
> 
>    In all cases, the bits in an IID have no generic semantics; in other
>    words, they have opaque values.  In fact, the whole IID value MUST 
> be
>    viewed as an opaque bit string by third parties, except possibly in
>    the local context.
> ======
> 
> That said - we already have a way to form all-random IIDs:
> IN6_ADDR_GEN_MODE_RANDOM. Can't you just ensure that links of type
> ARPHRD_RAWIP always use IN6_ADDR_GEN_MODE_RANDOM? That might lead to
> less special-casing.

Hi Lorenzo

In v2 of this patchset, I used addrconf_ifid_ip6tnl() similar to
IP6 Tunnels / VTI6, so I didnt need that way of generating the IID.
addrconf_ifid_ip6tnl() also provides fixed IIDs during the lifetime of 
the
net device while IN6_ADDR_GEN_MODE_RANDOM generates different addresses.

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply

* Re: [PATCH v4 9/9] net-next: New ax88796 platform driver for Amiga X-Surf 100 Zorro board (m68k)
From: Michael Schmitz @ 2018-06-09  0:28 UTC (permalink / raw)
  To: Michael Karcher
  Cc: Geert Uytterhoeven, netdev, Andrew Lunn, Finn Thain,
	Florian Fainelli, Linux/m68k, Michael Karcher
In-Reply-To: <56235.87.122.27.167.1528450090.webmail@webmail.zedat.fu-berlin.de>

Hi Michael,

Am 08.06.2018 um 21:28 schrieb Michael Karcher:
> Let me add my 2 cents as main author of that code:
...
>
> actually, the the block_input / block_output functions were the reason I
> included the lib8390.c file. Except for xs100_write / xs100_read, they are
> a verbatim copy from ax88796.c I'm not that enthusiastic about that idea
> anymore, but did not get around to improve it. I added a customization
> point to ax88796 for a custom block_input / block_output, because the 8390
> core already provides that customization point. What I really need is a
> customization point just for the 8390-remote-DMA-via-MMIO functions (i.e.
> xs100_write, xs100_read) instead of the whole block transfer core that
> also sets up the remote DMA engine and tries to re-initialize the card in
> case of unexplained problems.

OK, so essentially change

         if (ei_local->word16) {
                 ioread16_rep(nic_base + NE_DATAPORT, buf, count >> 1);
                 if (count & 0x01)
                         buf[count-1] = ei_inb(nic_base + NE_DATAPORT);

         } else {
                 ioread8_rep(nic_base + NE_DATAPORT, buf, count);
         }

to something like

         if (ax->block_read) {
		ax->block_read(dev, buf, count);
	} else if (ei_local->word16) {
                 ioread16_rep(nic_base + NE_DATAPORT, buf, count >> 1);
                 if (count & 0x01)
                         buf[count-1] = ei_inb(nic_base + NE_DATAPORT);

         } else {
                 ioread8_rep(nic_base + NE_DATAPORT, buf, count);
         }

and populate ax->block_read() and ax_block_write() from platform data, 
instead of substituting ax_block_input() / ax_block_output() wholesale?

I must be missing something here.

> This should get rid of
>  - xs100_block_output
>  - xs100_block_input (which has the calls to ax_reset_8390 and
> ax_NS8390_init)
>  - ax_reset_8390
>  - and thus the include of lib8390.c (which should be included only by
> ax88796.c, not by xsurf100.c)

I've got an (untested) patch that just exports ax_NS8390_init() via the 
ax_device struct, and gets rid of the lib8390.c include that way, but 
the above solution would be a lot cleaner. Adds one test for 
ax->block_read on the critical path but we already have the test for 
ei_local->word16 there. May need rearranging the tests so the majority 
of ax88796 users isn't impacted.

Anyway, your code, your call.

Cheers,

	Michael


>
> Regards,
>   Michael Karcher
>

^ permalink raw reply

* Re: [PATCH net] failover: eliminate callback hell
From: Stephen Hemminger @ 2018-06-09  0:02 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Michael S. Tsirkin, Jiri Pirko, kys, haiyangz, David Miller,
	Samudrala, Sridhar, Netdev, Stephen Hemminger
In-Reply-To: <CADGSJ20sq8AmW34Z7w0JrTg6=YnKaM3Hq5R3TfhAH5BkuR2pqA@mail.gmail.com>

On Fri, 8 Jun 2018 16:44:12 -0700
Siwei Liu <loseweigh@gmail.com> wrote:

> On Fri, Jun 8, 2018 at 4:18 PM, Stephen Hemminger
> <stephen@networkplumber.org> wrote:
> > On Fri, 8 Jun 2018 15:25:59 -0700
> > Siwei Liu <loseweigh@gmail.com> wrote:
> >  
> >> On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger
> >> <stephen@networkplumber.org> wrote:  
> >> > On Wed, 6 Jun 2018 15:30:27 +0300
> >> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >> >  
> >> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:  
> >> >> > Tue, Jun 05, 2018 at 05:42:31AM CEST, stephen@networkplumber.org wrote:  
> >> >> > >The net failover should be a simple library, not a virtual
> >> >> > >object with function callbacks (see callback hell).  
> >> >> >
> >> >> > Why just a library? It should do a common things. I think it should be a
> >> >> > virtual object. Looks like your patch again splits the common
> >> >> > functionality into multiple drivers. That is kind of backwards attitude.
> >> >> > I don't get it. We should rather focus on fixing the mess the
> >> >> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> >> >> > model.  
> >> >>
> >> >> So it seems that at least one benefit for netvsc would be better
> >> >> handling of renames.
> >> >>
> >> >> Question is how can this change to 3-netdev happen?  Stephen is
> >> >> concerned about risk of breaking some userspace.
> >> >>
> >> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
> >> >> address, and you said then "why not use existing network namespaces
> >> >> rather than inventing a new abstraction". So how about it then? Do you
> >> >> want to find a way to use namespaces to hide the PV device for netvsc
> >> >> compatibility?
> >> >>  
> >> >
> >> > Netvsc can't work with 3 dev model. MS has worked with enough distro's and
> >> > startups that all demand eth0 always be present. And VF may come and go.
> >> > After this history, there is a strong motivation not to change how kernel
> >> > behaves. Switching to 3 device model would be perceived as breaking
> >> > existing userspace.
> >> >
> >> > With virtio you can  work it out with the distro's yourself.
> >> > There is no pre-existing semantics to deal with.
> >> >
> >> > For the virtio, I don't see the need for IFF_HIDDEN.  
> >>
> >> I have a somewhat different view regarding IFF_HIDDEN. The purpose of
> >> that flag, as well as the 1-netdev model, is to have a means to
> >> inherit the interface name from the VF, and to eliminate playing hacks
> >> around renaming devices, customizing udev rules and et al. Why
> >> inheriting VF's name important? To allow existing config/setup around
> >> VF continues to work across kernel feature upgrade. Most of network
> >> config files in all distros are based on interface names. Few are MAC
> >> address based but making lower slaves hidden would cover the rest. And
> >> most importantly, preserving the same level of user experience as
> >> using raw VF interface once getting all ndo_ops and ethtool_ops
> >> exposed. This is essential to realize transparent live migration that
> >> users dont have to learn and be aware of the undertaken.  
> >
> > Inheriting the VF name will fail in the migration scenario.
> > It is perfectly reasonable to migrate a guest to another machine where
> > the VF PCI address is different. And since current udev/systemd model
> > is to base network device name off of PCI address, the device will change
> > name when guest is migrated.
> >  
> The scenario of having VF on a different PCI address on post migration
> is essentially equal to plugging in a new NIC. Why it has to pair with
> the original PV? A sepearte PV device should be in place to pair the
> new VF.

The host only guarantees that the PV device will be on the same network.
It does not make any PCI guarantees. The way Windows works is to find
the device based on "serial number" which is an Hyper-V specific attribute
of PCI devices.

I considered naming off of serial number but that won't work for the
case where PV device is present first and VF arrives later. The serial
number is attribute of VF, not the PV which is there first.

Your ideas about having the PCI information of the VF form the name
of the failover device have the same problem. The PV device may
be the only one present on boot.


> > On Azure, the VF maybe removed (by host) at any time and then later
> > reattached. There is no guarantee that VF will show back up at
> > the same synthetic PCI address. It will likely have a different
> > PCI domain value.  
> 
> This is something QEMU can do and make sure the PCI address is
> consistent after migration.
> 
> -Siwei

^ permalink raw reply

* Re: [PATCH net v2] net: bridge: Fix locking in br_fdb_find_port()
From: David Miller @ 2018-06-09  0:00 UTC (permalink / raw)
  To: petrm; +Cc: bridge, netdev, stephen
In-Reply-To: <c41fffa00abb9a123039e5509886a0de85274291.1528450455.git.petrm@mellanox.com>

From: Petr Machata <petrm@mellanox.com>
Date: Fri, 08 Jun 2018 15:11:47 +0200

> Callers of br_fdb_find() need to hold the hash lock, which
> br_fdb_find_port() doesn't do. However, since br_fdb_find_port() is not
> doing any actual FDB manipulation, the hash lock is not really needed at
> all. So convert to br_fdb_find_rcu(), surrounded by rcu_read_lock() /
> _unlock() pair.
> 
> The device pointer copied from inside the FDB entry is then kept alive
> by the RTNL lock, which br_fdb_find_port() asserts.
> 
> Fixes: 4d4fd36126d6 ("net: bridge: Publish bridge accessor functions")
> Signed-off-by: Petr Machata <petrm@mellanox.com>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH net] udp: fix rx queue len reported by diag and proc interface
From: David Miller @ 2018-06-08 23:56 UTC (permalink / raw)
  To: pabeni; +Cc: netdev, edumazet, trevor.francis
In-Reply-To: <6189b60449cb3ab524db8e91acd6be8d01ba887c.1528450183.git.pabeni@redhat.com>

From: Paolo Abeni <pabeni@redhat.com>
Date: Fri,  8 Jun 2018 11:35:40 +0200

> After commit 6b229cf77d68 ("udp: add batching to udp_rmem_release()")
> the sk_rmem_alloc field does not measure exactly anymore the
> receive queue length, because we batch the rmem release. The issue
> is really apparent only after commit 0d4a6608f68c ("udp: do rmem bulk
> free even if the rx sk queue is empty"): the user space can easily
> check for an empty socket with not-0 queue length reported by the 'ss'
> tool or the procfs interface.
> 
> We need to use a custom UDP helper to report the correct queue length,
> taking into account the forward allocation deficit.
> 
> Reported-by: trevor.francis@46labs.com
> Fixes: 6b229cf77d68 ("UDP: add batching to udp_rmem_release()")
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH net,stable] cdc_ncm: avoid padding beyond end of skb
From: David Miller @ 2018-06-08 23:50 UTC (permalink / raw)
  To: bjorn; +Cc: netdev, linux-usb, bodqhrohro, dennis.wassenberg, mrkiko.rs
In-Reply-To: <20180608071524.11528-1-bjorn@mork.no>

From: Bjørn Mork <bjorn@mork.no>
Date: Fri,  8 Jun 2018 09:15:24 +0200

> Commit 4a0e3e989d66 ("cdc_ncm: Add support for moving NDP to end
> of NCM frame") added logic to reserve space for the NDP at the
> end of the NTB/skb.  This reservation did not take the final
> alignment of the NDP into account, causing us to reserve too
> little space. Additionally the padding prior to NDP addition did
> not ensure there was enough space for the NDP.
> 
> The NTB/skb with the NDP appended would then exceed the configured
> max size. This caused the final padding of the NTB to use a
> negative count, padding to almost INT_MAX, and resulting in:
 ...
> Commit e1069bbfcf3b ("net: cdc_ncm: Reduce memory use when kernel
> memory low") made this bug much more likely to trigger by reducing
> the NTB size under memory pressure.
> 
> Link: https://bugs.debian.org/893393
> Reported-by: Горбешко Богдан <bodqhrohro@gmail.com>
> Reported-and-tested-by: Dennis Wassenberg <dennis.wassenberg@secunet.com>
> Cc: Enrico Mioso <mrkiko.rs@gmail.com>
> Fixes: 4a0e3e989d66 ("cdc_ncm: Add support for moving NDP to end of NCM frame")
> Signed-off-by: Bjørn Mork <bjorn@mork.no>
> ---
> Big thanks to Dennis for the observation that this crash depended on
> FLAG_SEND_ZLP not being set. This made it possible to pinpoint where
> the problem was.

Applied and queued up for -stable.

^ permalink raw reply

* Re: [PATCH net] failover: eliminate callback hell
From: Siwei Liu @ 2018-06-08 23:44 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Michael S. Tsirkin, Jiri Pirko, kys, haiyangz, David Miller,
	Samudrala, Sridhar, Netdev, Stephen Hemminger
In-Reply-To: <20180608161801.64afda65@xeon-e3>

On Fri, Jun 8, 2018 at 4:18 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Fri, 8 Jun 2018 15:25:59 -0700
> Siwei Liu <loseweigh@gmail.com> wrote:
>
>> On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger
>> <stephen@networkplumber.org> wrote:
>> > On Wed, 6 Jun 2018 15:30:27 +0300
>> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
>> >
>> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
>> >> > Tue, Jun 05, 2018 at 05:42:31AM CEST, stephen@networkplumber.org wrote:
>> >> > >The net failover should be a simple library, not a virtual
>> >> > >object with function callbacks (see callback hell).
>> >> >
>> >> > Why just a library? It should do a common things. I think it should be a
>> >> > virtual object. Looks like your patch again splits the common
>> >> > functionality into multiple drivers. That is kind of backwards attitude.
>> >> > I don't get it. We should rather focus on fixing the mess the
>> >> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
>> >> > model.
>> >>
>> >> So it seems that at least one benefit for netvsc would be better
>> >> handling of renames.
>> >>
>> >> Question is how can this change to 3-netdev happen?  Stephen is
>> >> concerned about risk of breaking some userspace.
>> >>
>> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
>> >> address, and you said then "why not use existing network namespaces
>> >> rather than inventing a new abstraction". So how about it then? Do you
>> >> want to find a way to use namespaces to hide the PV device for netvsc
>> >> compatibility?
>> >>
>> >
>> > Netvsc can't work with 3 dev model. MS has worked with enough distro's and
>> > startups that all demand eth0 always be present. And VF may come and go.
>> > After this history, there is a strong motivation not to change how kernel
>> > behaves. Switching to 3 device model would be perceived as breaking
>> > existing userspace.
>> >
>> > With virtio you can  work it out with the distro's yourself.
>> > There is no pre-existing semantics to deal with.
>> >
>> > For the virtio, I don't see the need for IFF_HIDDEN.
>>
>> I have a somewhat different view regarding IFF_HIDDEN. The purpose of
>> that flag, as well as the 1-netdev model, is to have a means to
>> inherit the interface name from the VF, and to eliminate playing hacks
>> around renaming devices, customizing udev rules and et al. Why
>> inheriting VF's name important? To allow existing config/setup around
>> VF continues to work across kernel feature upgrade. Most of network
>> config files in all distros are based on interface names. Few are MAC
>> address based but making lower slaves hidden would cover the rest. And
>> most importantly, preserving the same level of user experience as
>> using raw VF interface once getting all ndo_ops and ethtool_ops
>> exposed. This is essential to realize transparent live migration that
>> users dont have to learn and be aware of the undertaken.
>
> Inheriting the VF name will fail in the migration scenario.
> It is perfectly reasonable to migrate a guest to another machine where
> the VF PCI address is different. And since current udev/systemd model
> is to base network device name off of PCI address, the device will change
> name when guest is migrated.
>
The scenario of having VF on a different PCI address on post migration
is essentially equal to plugging in a new NIC. Why it has to pair with
the original PV? A sepearte PV device should be in place to pair the
new VF.


> On Azure, the VF maybe removed (by host) at any time and then later
> reattached. There is no guarantee that VF will show back up at
> the same synthetic PCI address. It will likely have a different
> PCI domain value.

This is something QEMU can do and make sure the PCI address is
consistent after migration.

-Siwei

^ permalink raw reply

* Re: linux-next: Please add mlx5-next tree
From: Stephen Hemminger @ 2018-06-08 23:19 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Stephen Rothwell, Linux-Next Mailing List,
	Linux Kernel Mailing List, Saeed Mahameed, Jason Gunthorpe,
	Doug Ledford, David S. Miller, linux-netdev, RDMA mailing list
In-Reply-To: <20180606192500.GZ2958@mtr-leonro.mtl.com>

[-- Attachment #1: Type: text/plain, Size: 579 bytes --]

On Wed, 6 Jun 2018 22:25:00 +0300
Leon Romanovsky <leon@kernel.org> wrote:

> Hi Stephen,
> 
> Can you please add the branch "mlx5-next" from the following tree to the
> linux-next integration tree?
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux.git/
> 
> The mlx5-next branch is used to send shared pull requests to both netdev
> and RDMA subsystems and it is worth to have linux-next coverage on it
> before.
> 
> Thanks

I would hope all network drivers keep getting review in netdev.
Don't want a path to upstream that bypasses review.

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH net] failover: eliminate callback hell
From: Stephen Hemminger @ 2018-06-08 23:18 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Michael S. Tsirkin, Jiri Pirko, kys, haiyangz, David Miller,
	Samudrala, Sridhar, Netdev, Stephen Hemminger
In-Reply-To: <CADGSJ201TaacZhRTgQZC+a3c5fu8JkHcGifGTfbFQzyocCJxVg@mail.gmail.com>

On Fri, 8 Jun 2018 15:25:59 -0700
Siwei Liu <loseweigh@gmail.com> wrote:

> On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger
> <stephen@networkplumber.org> wrote:
> > On Wed, 6 Jun 2018 15:30:27 +0300
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >  
> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:  
> >> > Tue, Jun 05, 2018 at 05:42:31AM CEST, stephen@networkplumber.org wrote:  
> >> > >The net failover should be a simple library, not a virtual
> >> > >object with function callbacks (see callback hell).  
> >> >
> >> > Why just a library? It should do a common things. I think it should be a
> >> > virtual object. Looks like your patch again splits the common
> >> > functionality into multiple drivers. That is kind of backwards attitude.
> >> > I don't get it. We should rather focus on fixing the mess the
> >> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> >> > model.  
> >>
> >> So it seems that at least one benefit for netvsc would be better
> >> handling of renames.
> >>
> >> Question is how can this change to 3-netdev happen?  Stephen is
> >> concerned about risk of breaking some userspace.
> >>
> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
> >> address, and you said then "why not use existing network namespaces
> >> rather than inventing a new abstraction". So how about it then? Do you
> >> want to find a way to use namespaces to hide the PV device for netvsc
> >> compatibility?
> >>  
> >
> > Netvsc can't work with 3 dev model. MS has worked with enough distro's and
> > startups that all demand eth0 always be present. And VF may come and go.
> > After this history, there is a strong motivation not to change how kernel
> > behaves. Switching to 3 device model would be perceived as breaking
> > existing userspace.
> >
> > With virtio you can  work it out with the distro's yourself.
> > There is no pre-existing semantics to deal with.
> >
> > For the virtio, I don't see the need for IFF_HIDDEN.  
> 
> I have a somewhat different view regarding IFF_HIDDEN. The purpose of
> that flag, as well as the 1-netdev model, is to have a means to
> inherit the interface name from the VF, and to eliminate playing hacks
> around renaming devices, customizing udev rules and et al. Why
> inheriting VF's name important? To allow existing config/setup around
> VF continues to work across kernel feature upgrade. Most of network
> config files in all distros are based on interface names. Few are MAC
> address based but making lower slaves hidden would cover the rest. And
> most importantly, preserving the same level of user experience as
> using raw VF interface once getting all ndo_ops and ethtool_ops
> exposed. This is essential to realize transparent live migration that
> users dont have to learn and be aware of the undertaken.

Inheriting the VF name will fail in the migration scenario.
It is perfectly reasonable to migrate a guest to another machine where
the VF PCI address is different. And since current udev/systemd model
is to base network device name off of PCI address, the device will change
name when guest is migrated.

On Azure, the VF maybe removed (by host) at any time and then later
reattached. There is no guarantee that VF will show back up at
the same synthetic PCI address. It will likely have a different
PCI domain value.

^ permalink raw reply

* Re: [PATCH net] failover: eliminate callback hell
From: Siwei Liu @ 2018-06-08 22:54 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Stephen Hemminger, Michael S. Tsirkin, Jiri Pirko, kys, haiyangz,
	David Miller, Netdev, Stephen Hemminger
In-Reply-To: <e20a6cdf-34b4-cbc9-1dc9-75c436d6c2fe@intel.com>

On Wed, Jun 6, 2018 at 2:54 PM, Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
>
>
> On 6/6/2018 2:24 PM, Stephen Hemminger wrote:
>>
>> On Wed, 6 Jun 2018 15:30:27 +0300
>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>
>>> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
>>>>
>>>> Tue, Jun 05, 2018 at 05:42:31AM CEST, stephen@networkplumber.org wrote:
>>>>>
>>>>> The net failover should be a simple library, not a virtual
>>>>> object with function callbacks (see callback hell).
>>>>
>>>> Why just a library? It should do a common things. I think it should be a
>>>> virtual object. Looks like your patch again splits the common
>>>> functionality into multiple drivers. That is kind of backwards attitude.
>>>> I don't get it. We should rather focus on fixing the mess the
>>>> introduction of netvsc-bonding caused and switch netvsc to 3-netdev
>>>> model.
>>>
>>> So it seems that at least one benefit for netvsc would be better
>>> handling of renames.
>>>
>>> Question is how can this change to 3-netdev happen?  Stephen is
>>> concerned about risk of breaking some userspace.
>>>
>>> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
>>> address, and you said then "why not use existing network namespaces
>>> rather than inventing a new abstraction". So how about it then? Do you
>>> want to find a way to use namespaces to hide the PV device for netvsc
>>> compatibility?
>>>
>> Netvsc can't work with 3 dev model. MS has worked with enough distro's and
>> startups that all demand eth0 always be present. And VF may come and go.
>> After this history, there is a strong motivation not to change how kernel
>> behaves. Switching to 3 device model would be perceived as breaking
>> existing userspace.
>
>
> I think it should be possible for netvsc to work with 3 dev model if the
> only
> requirement is that eth0 will always be present. With net_failover, you will
> see eth0 and eth0nsby OR with older distros eth0 and eth1.  It may be an
> issue
> if somehow there is userspace requirement that there can be only 2 netdevs,
> not 3
> when VF is plugged.
>
> eth0 will be the net_failover device and eth0nsby/eth1 will be the netvsc
> device
> and the IP address gets configured on eth0. Will this be an issue?
>
Did you realize that the eth0 name in the current 3-netdev code can't
be consistently persisted across reboot, if you have more than one VFs
to pair with? On one boot it got eth0/eth0nsby, on the next it may get
eth1/eth1nsby on the same interface.

It won't be useable by default until you add some custom udev rules.

-Siwei

>
>
>>
>> With virtio you can  work it out with the distro's yourself.
>> There is no pre-existing semantics to deal with.
>>
>> For the virtio, I don't see the need for IFF_HIDDEN.
>> With 3-dev model as long as you mark the PV and VF devices
>> as slaves, then userspace knows to leave them alone. Assuming userspace
>> is already able to deal with team and bond devices.
>> Any time you introduce new UAPI behavior something breaks.
>>
>> On the rename front, I really don't care if VF can be renamed. And for
>> netvsc want to allow the PV device to be renamed. Udev developers want
>> that
>> but have not found a stable/persistent value to expose to userspace
>> to allow it.
>
>

^ permalink raw reply

* Re: [PATCH net v2] net/sched: act_simple: fix parsing of TCA_DEF_DATA
From: David Miller @ 2018-06-08 22:50 UTC (permalink / raw)
  To: dcaratti; +Cc: jhs, xiyou.wangcong, jiri, netdev
In-Reply-To: <473a0039fa5c28284ebb5928eb894674cc65b925.1528426801.git.dcaratti@redhat.com>

From: Davide Caratti <dcaratti@redhat.com>
Date: Fri,  8 Jun 2018 05:02:31 +0200

> use nla_strlcpy() to avoid copying data beyond the length of TCA_DEF_DATA
> netlink attribute, in case it is less than SIMP_MAX_DATA and it does not
> end with '\0' character.
> 
> v2: fix errors in the commit message, thanks Hangbin Liu
> 
> Fixes: fa1b1cff3d06 ("net_cls_act: Make act_simple use of netlink policy.")
> Signed-off-by: Davide Caratti <dcaratti@redhat.com>

Applied and queued up for -stable.

^ permalink raw reply

* Re: [PATCH v2 net] net: fddi: fix a possible null-ptr-deref
From: David Miller @ 2018-06-08 22:48 UTC (permalink / raw)
  To: yuehaibing; +Cc: netdev, linux-kernel
In-Reply-To: <20180608025825.25716-1-yuehaibing@huawei.com>

From: YueHaibing <yuehaibing@huawei.com>
Date: Fri, 8 Jun 2018 10:58:25 +0800

> bp->SharedMemAddr is set to NULL while bp->SharedMemSize lesser-or-equal 0,
> then memset will trigger null-ptr-deref.
> 
> fix it by replacing pci_alloc_consistent with dma_zalloc_coherent.
> 
> Signed-off-by: YueHaibing <yuehaibing@huawei.com>
> ---
> v1->v2: move from pci_dma* to dma_* as Christoph suggested

Applied.

^ permalink raw reply

* Re: [PATCH net] net: aquantia: fix unsigned numvecs comparison with less than zero
From: David Miller @ 2018-06-08 22:46 UTC (permalink / raw)
  To: igor.russkikh; +Cc: netdev, darcari, pavel.belous, colin.king
In-Reply-To: <b3f15fb11d16929c60728800bfaae4fdd36f1406.1528407764.git.igor.russkikh@aquantia.com>

From: Igor Russkikh <igor.russkikh@aquantia.com>
Date: Thu,  7 Jun 2018 17:54:37 -0400

> From: Colin Ian King <colin.king@canonical.com>
> 
> From: Colin Ian King <colin.king@canonical.com>
> 
> This was originally mistakenly submitted to net-next. Resubmitting to net.
> 
> The comparison of numvecs < 0 is always false because numvecs is a u32
> and hence the error return from a failed call to pci_alloc_irq_vectores
> is never detected.  Fix this by using the signed int ret to handle the
> error return and assign numvecs to err.
> 
> Detected by CoverityScan, CID#1468650 ("Unsigned compared against 0")
> 
> Fixes: a09bd81b5413 ("net: aquantia: Limit number of vectors to actually allocated irqs")
> Signed-off-by: Colin Ian King <colin.king@canonical.com>
> Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH net] failover: eliminate callback hell
From: Siwei Liu @ 2018-06-08 22:25 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Michael S. Tsirkin, Jiri Pirko, kys, haiyangz, David Miller,
	Samudrala, Sridhar, Netdev, Stephen Hemminger
In-Reply-To: <20180606142447.3c5072d8@xeon-e3>

On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Wed, 6 Jun 2018 15:30:27 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>
>> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
>> > Tue, Jun 05, 2018 at 05:42:31AM CEST, stephen@networkplumber.org wrote:
>> > >The net failover should be a simple library, not a virtual
>> > >object with function callbacks (see callback hell).
>> >
>> > Why just a library? It should do a common things. I think it should be a
>> > virtual object. Looks like your patch again splits the common
>> > functionality into multiple drivers. That is kind of backwards attitude.
>> > I don't get it. We should rather focus on fixing the mess the
>> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
>> > model.
>>
>> So it seems that at least one benefit for netvsc would be better
>> handling of renames.
>>
>> Question is how can this change to 3-netdev happen?  Stephen is
>> concerned about risk of breaking some userspace.
>>
>> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
>> address, and you said then "why not use existing network namespaces
>> rather than inventing a new abstraction". So how about it then? Do you
>> want to find a way to use namespaces to hide the PV device for netvsc
>> compatibility?
>>
>
> Netvsc can't work with 3 dev model. MS has worked with enough distro's and
> startups that all demand eth0 always be present. And VF may come and go.
> After this history, there is a strong motivation not to change how kernel
> behaves. Switching to 3 device model would be perceived as breaking
> existing userspace.
>
> With virtio you can  work it out with the distro's yourself.
> There is no pre-existing semantics to deal with.
>
> For the virtio, I don't see the need for IFF_HIDDEN.

I have a somewhat different view regarding IFF_HIDDEN. The purpose of
that flag, as well as the 1-netdev model, is to have a means to
inherit the interface name from the VF, and to eliminate playing hacks
around renaming devices, customizing udev rules and et al. Why
inheriting VF's name important? To allow existing config/setup around
VF continues to work across kernel feature upgrade. Most of network
config files in all distros are based on interface names. Few are MAC
address based but making lower slaves hidden would cover the rest. And
most importantly, preserving the same level of user experience as
using raw VF interface once getting all ndo_ops and ethtool_ops
exposed. This is essential to realize transparent live migration that
users dont have to learn and be aware of the undertaken.

It's unfair to say all virtio use cases don't need IFF_HIDDEN. A few
use cases don't care about getting slaves exposed, the 3-netdev model
would work for them. For the rest, the pre-existing semantics to them
is the single VF device model they've already dealt with. This is
nothing different than having Azure stick to the 2-netdev model
because of existing user base IMHO.

-Siwei

> With 3-dev model as long as you mark the PV and VF devices
> as slaves, then userspace knows to leave them alone. Assuming userspace
> is already able to deal with team and bond devices.
> Any time you introduce new UAPI behavior something breaks.
>
> On the rename front, I really don't care if VF can be renamed. And for
> netvsc want to allow the PV device to be renamed. Udev developers want that
> but have not found a stable/persistent value to expose to userspace
> to allow it.

^ permalink raw reply

* Re: [RFC PATCH 1/3] ebpf: add next_skb_frag bpf helper for sk filter
From: Tushar Dave @ 2018-06-08 22:24 UTC (permalink / raw)
  To: Daniel Borkmann, netdev, ast, davem, john.fastabend,
	jakub.kicinski, kafai, rdna, quentin.monnet, brakmo, acme
In-Reply-To: <39186936-9af3-f609-7b2a-26c908558a5a@oracle.com>



On 06/08/2018 02:46 PM, Tushar Dave wrote:
> 
> 
> On 06/08/2018 02:27 PM, Daniel Borkmann wrote:
>> On 06/08/2018 11:00 PM, Tushar Dave wrote:
>>> Today socket filter only deals with linear skbs. This change allows
>>> ebpf programs to look into non-linear skb e.g. skb frags. This will be
>>> useful when users need to look into data which is not contained in the
>>> linear part of skb.
>>
>> Hmm, I don't think this statement is correct in its form here ... they
>> can handle non-linear skbs just fine.
> Thanks Daniel for your reply.
>>
>> Straight forward way is to use bpf_skb_load_bytes(). It's simple and uses
>> internally skb_header_pointer(), and that one of course walks everything
>> if it really has to via skb_copy_bits() (page frags _and_ frag list). And
>> if you need to look into mac/net headers that may otherwise not be 
>> accessible
>> anymore from socket layer, there's bpf_skb_load_bytes_relative() helper
>> which is effectively doing the negative offset trick from ld_abs/ind more
>> efficient for multi-byte loads.
> I'm looking into bpf_skb_load_bytes and friends.

Daniel,

While I am trying to see if I can use exiting bpf_skb_load helpers, I am
wondering socket filter based ebpf program are allowed to change packet
data? In other words, can we use them to build firewall?

Thanks.

-Tushar
> 
> Thanks.
> -Tushar
>>
>> Thanks,
>> Daniel
>>
> 

^ permalink raw reply

* Re: Fw: [Bug 199995] New: Ramdomly sent TCP Reset from Kernel with bonding mode "brodcast"
From: Eric Dumazet @ 2018-06-08 21:53 UTC (permalink / raw)
  To: Eric Dumazet, Michal Kubecek, Stephen Hemminger; +Cc: netdev
In-Reply-To: <3cbd2c1f-4e03-1cb1-3731-4ce440778bb8@gmail.com>



On 06/08/2018 02:38 PM, Eric Dumazet wrote:
> 
> 
> On 06/08/2018 02:04 PM, Michal Kubecek wrote:

>>
>> However, the lockless listener was introduced in 4.4 so it's not clear
>> why reporter started encountering this after an upgrade from 4.13 to
>> 4.15.
> 
> Yes, I do not buy this at all.
> 
> If two identical SYN are received by two cpus, we should create one SYN_RECV and send
> two SYNACK.
> 
> But it is a bit hard to test this :/
> 
> I will take a look, thanks.


Oh well, this is not done as I thought, this needs a fix, I will work on this.

reqsk_queue_hash_req() calls inet_ehash_insert() without making sure that the same 4-tuple
is not already there.

Do not worry, we will keep the listener lockless :)

^ permalink raw reply

* Re: [RFC PATCH 1/3] ebpf: add next_skb_frag bpf helper for sk filter
From: Tushar Dave @ 2018-06-08 21:46 UTC (permalink / raw)
  To: Daniel Borkmann, netdev, ast, davem, john.fastabend,
	jakub.kicinski, kafai, rdna, quentin.monnet, brakmo, acme
In-Reply-To: <9588eb72-f1d5-f6ce-b2a3-aefb431e70d5@iogearbox.net>



On 06/08/2018 02:27 PM, Daniel Borkmann wrote:
> On 06/08/2018 11:00 PM, Tushar Dave wrote:
>> Today socket filter only deals with linear skbs. This change allows
>> ebpf programs to look into non-linear skb e.g. skb frags. This will be
>> useful when users need to look into data which is not contained in the
>> linear part of skb.
> 
> Hmm, I don't think this statement is correct in its form here ... they
> can handle non-linear skbs just fine.
Thanks Daniel for your reply.
> 
> Straight forward way is to use bpf_skb_load_bytes(). It's simple and uses
> internally skb_header_pointer(), and that one of course walks everything
> if it really has to via skb_copy_bits() (page frags _and_ frag list). And
> if you need to look into mac/net headers that may otherwise not be accessible
> anymore from socket layer, there's bpf_skb_load_bytes_relative() helper
> which is effectively doing the negative offset trick from ld_abs/ind more
> efficient for multi-byte loads.
I'm looking into bpf_skb_load_bytes and friends.

Thanks.
-Tushar
> 
> Thanks,
> Daniel
> 

^ permalink raw reply

* Re: [PATCH v2] net-fq: Add WARN_ON check for null flow.
From: Arend van Spriel @ 2018-06-08 21:40 UTC (permalink / raw)
  To: Ben Greear, Michał Kazior
  Cc: Cong Wang, Linux Kernel Network Developers,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <1f11144f-7580-03f4-72bd-76b0907d7ed1-my8/4N5VtI7c+919tysfdA@public.gmane.org>

On 6/8/2018 5:17 PM, Ben Greear wrote:

I recalled an email from Michał leaving tieto so adding his alternate 
email he provided back then.

Gr. AvS

> On 06/07/2018 04:59 PM, Cong Wang wrote:
>> On Thu, Jun 7, 2018 at 4:48 PM,  <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org> wrote:
>>> diff --git a/include/net/fq_impl.h b/include/net/fq_impl.h
>>> index be7c0fa..cb911f0 100644
>>> --- a/include/net/fq_impl.h
>>> +++ b/include/net/fq_impl.h
>>> @@ -78,7 +78,10 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq,
>>>                         return NULL;
>>>         }
>>>
>>> -       flow = list_first_entry(head, struct fq_flow, flowchain);
>>> +       flow = list_first_entry_or_null(head, struct fq_flow,
>>> flowchain);
>>> +
>>> +       if (WARN_ON_ONCE(!flow))
>>> +               return NULL;
>>
>> This does not make sense either. list_first_entry_or_null()
>> returns NULL only when the list is empty, but we already check
>> list_empty() right before this code, and it is protected by fq->lock.
>>
>
> Hello Michal,
>
> git blame shows you as the author of the fq_impl.h code.
>
> I saw a crash when debugging funky ath10k firmware in a 4.16 + hacks
> kernel.  There was an apparent
> mostly-null deref in the fq_tin_dequeue method.  According to gdb, it
> was within
> 1 line of the dereference of 'flow'.
>
> My hack above is probably not that useful.  Cong thinks maybe the
> locking is bad.
>
> If you get a chance, please review this thread and see if you have any
> ideas for
> a better fix (or better debugging code).
>
> As always, if you would like me to generate you a buggy firmware that
> will crash
> in the tx path and cause all sorts of mayhem in the ath10k driver and
> wifi stack,
> I will be happy to do so.
>
> https://www.mail-archive.com/netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg239738.html
>
> Thanks,
> Ben
>

^ permalink raw reply

* Re: Fw: [Bug 199995] New: Ramdomly sent TCP Reset from Kernel with bonding mode "brodcast"
From: Eric Dumazet @ 2018-06-08 21:38 UTC (permalink / raw)
  To: Michal Kubecek, Stephen Hemminger; +Cc: Eric Dumazet, netdev
In-Reply-To: <20180608210403.2moomjshtwszvsso@unicorn.suse.cz>



On 06/08/2018 02:04 PM, Michal Kubecek wrote:
> On Fri, Jun 08, 2018 at 09:59:54AM -0700, Stephen Hemminger wrote:
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=199995
>>
>>             Bug ID: 199995
>>            Summary: Ramdomly sent TCP Reset from Kernel with bonding mode
>>                     "brodcast"
>>
>> after a dist upgrade from Ubuntu 17.10 (Kernel 4.13.x) to Ubuntu 18.04 (Kernel
>> 4.15.0) I suffer from ramdomly generated TCP RST packets sent (presumably) by
>> the Kernel 
>> on a bonding device that uses bonding mode "brodcast" with 2 physical NICs.
>>
>> With tcpdump/whireshark I can see that the kernel randomly sends TCP-RST
>> packets after the SYN/ACK/ACK packet is received (see attached PCAP).
>> This only happens if the kernel receives the initial SYN packet on both
>> physical NICs (and therefore seeing it twice), before the connection is
>> established by sending SYN/ACK.
>> It's not happening in 100% of all cases and only, if the system can use two or
>> more CPU cores/threads. With only one CPU available to the system, this
>> behaviour is not reproducable.
> 
> I have seen similar report earlier from one of our customers running
> SLE12 SP2 (kernel 4.4). The problem is that if duplicated SYN packet is
> received on both slaves, these two copies can be processed by the
> lockless listener simultaneously on different CPUs and each can reply by
> SYNACK with different sequence number which results in a reset.
> 
> I tried to think of a way to prevent this race without losing the
> performance gain of lockless listener but couldn't come with anything.
> Eventually, I managed to persuade the customer that this setup (where
> each packet is received twice under normal circumstances) is not what
> broadcast mode was designed for (based on the description in
> Documentation/networking/bonding.txt).
> 
> However, the lockless listener was introduced in 4.4 so it's not clear
> why reporter started encountering this after an upgrade from 4.13 to
> 4.15.

Yes, I do not buy this at all.

If two identical SYN are received by two cpus, we should create one SYN_RECV and send
two SYNACK.

But it is a bit hard to test this :/

I will take a look, thanks.

^ permalink raw reply

* Re: [RFC PATCH 1/3] ebpf: add next_skb_frag bpf helper for sk filter
From: Daniel Borkmann @ 2018-06-08 21:27 UTC (permalink / raw)
  To: Tushar Dave, netdev, ast, davem, john.fastabend, jakub.kicinski,
	kafai, rdna, quentin.monnet, brakmo, acme
In-Reply-To: <1528491607-10399-2-git-send-email-tushar.n.dave@oracle.com>

On 06/08/2018 11:00 PM, Tushar Dave wrote:
> Today socket filter only deals with linear skbs. This change allows
> ebpf programs to look into non-linear skb e.g. skb frags. This will be
> useful when users need to look into data which is not contained in the
> linear part of skb.

Hmm, I don't think this statement is correct in its form here ... they
can handle non-linear skbs just fine.

Straight forward way is to use bpf_skb_load_bytes(). It's simple and uses
internally skb_header_pointer(), and that one of course walks everything
if it really has to via skb_copy_bits() (page frags _and_ frag list). And
if you need to look into mac/net headers that may otherwise not be accessible
anymore from socket layer, there's bpf_skb_load_bytes_relative() helper
which is effectively doing the negative offset trick from ld_abs/ind more
efficient for multi-byte loads.

Thanks,
Daniel

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox