Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] 8139too: Use disable_irq_nosync() in rtl8139_poll_controller()
From: David Miller @ 2018-05-02 17:22 UTC (permalink / raw)
  To: bigeasy; +Cc: netdev, mingo, tglx
In-Reply-To: <20180502113057.18209-1-bigeasy@linutronix.de>

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date: Wed,  2 May 2018 13:30:57 +0200

> From: Ingo Molnar <mingo@elte.hu>
> 
> Use disable_irq_nosync() instead of disable_irq() as this might be
> called in atomic context with netpoll.
> 
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH net-next] net: phy: broadcom: add support for BCM89610 PHY
From: David Miller @ 2018-05-02 17:21 UTC (permalink / raw)
  To: vbhadram; +Cc: andrew, f.fainelli, netdev
In-Reply-To: <1525256676-24335-1-git-send-email-vbhadram@nvidia.com>

Please remove the email footer from your postings here that talks about
confidential information and whatnot.

That is expressly inappropriate for this mailing list, and any such
postings shall be ignored in their entirety.

Thank you.

^ permalink raw reply

* Re: [PATCH net-next 1/4] ipv6: Calculate hash thresholds for IPv6 nexthops
From: Ido Schimmel @ 2018-05-02 17:21 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Ahern, Ido Schimmel, netdev, davem, roopa, nikolay, pch,
	jkbs, yoshfuji, mlxsw
In-Reply-To: <e4c89c14-99fc-5ba5-667b-0220520f47f3@gmail.com>

On Wed, May 02, 2018 at 09:43:50AM -0700, Eric Dumazet wrote:
> 
> 
> On 01/09/2018 07:43 PM, David Ahern wrote:
> > On 1/9/18 7:40 AM, Ido Schimmel wrote:
> >> Before we convert IPv6 to use hash-threshold instead of modulo-N, we
> >> first need each nexthop to store its region boundary in the hash
> >> function's output space.
> >>
> >> The boundary is calculated by dividing the output space equally between
> >> the different active nexthops. That is, nexthops that are not dead or
> >> linkdown.
> >>
> >> The boundaries are rebalanced whenever a nexthop is added or removed to
> >> a multipath route and whenever a nexthop becomes active or inactive.
> >>
> >> Signed-off-by: Ido Schimmel <idosch@mellanox.com>
> >> ---
> >>  include/net/ip6_fib.h   |  1 +
> >>  include/net/ip6_route.h |  7 ++++
> >>  net/ipv6/ip6_fib.c      |  8 ++---
> >>  net/ipv6/route.c        | 96 +++++++++++++++++++++++++++++++++++++++++++++++++
> >>  4 files changed, 106 insertions(+), 6 deletions(-)
> >>
> > 
> > LGTM.
> > Acked-by: David Ahern <dsahern@gmail.com>
> > 
> 
> For some reason I have a divide by zero error booting my hosts with latest net tree.
> 
> What guarantee do we have that total is not zero when rt6_upper_bound_set() is called ?

Thanks for the report, Eric. I believe I didn't cover all the cases and
'rt6i_nh_weight' might be 0 is some cases. I'll try to reproduce and
work on a fix.

> 
> 
> 
> [    8.498639] divide error: 0000 [#1] SMP PTI
> [    8.503178] gsmi: Log Shutdown Reason 0x03
> [    8.507270] Modules linked in: bnx2x mdio
> [    8.511276] CPU: 17 PID: 116 Comm: kworker/17:0 Not tainted 4.17.0-smp-DEV #110
> [    8.518571] Hardware name: Intel RML,PCH/Ibis_QC_18, BIOS 2.40.0 06/22/2016
> [    8.525526] Workqueue: ipv6_addrconf addrconf_dad_work
> [    8.530662] RIP: 0010:rt6_multipath_rebalance.part.82+0x1cb/0x1f0
> [    8.536752] RSP: 0018:ffffba72867cbbf8 EFLAGS: 00010246
> [    8.541966] RAX: 0000000000000000 RBX: 0000000000000025 RCX: ffff9d555ab73180
> [    8.549090] RDX: 0000000000000000 RSI: ffff9d4d5a34b1c0 RDI: 0000000000000000
> [    8.556212] RBP: ffffba72867cbc00 R08: 0000000000000000 R09: 0000000000000000
> [    8.563336] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9d5559f95680
> [    8.570457] R13: ffff9d4d5a34b1c0 R14: ffff9d555ab73180 R15: 0000000000000000
> [    8.577579] FS:  0000000000000000(0000) GS:ffff9d4d5fc40000(0000) knlGS:0000000000000000
> [    8.585654] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    8.591391] CR2: 00007fffe47ff000 CR3: 0000000c39c0a001 CR4: 00000000000606e0
> [    8.598515] Call Trace:
> [    8.600961]  ? rt6_multipath_rebalance+0x21/0x30
> [    8.605579]  fib6_add+0x75f/0xf70
> [    8.608899]  ? __wake_up+0x13/0x20
> [    8.612303]  ? netlink_broadcast_filtered+0x14c/0x3c0
> [    8.617355]  __ip6_ins_rt+0x4c/0x70
> [    8.620847]  ip6_ins_rt+0x6e/0xa0
> [    8.624157]  __ipv6_ifa_notify+0x226/0x2e0
> [    8.628249]  ipv6_ifa_notify+0x2a/0x40
> [    8.631999]  addrconf_dad_completed+0x59/0x360
> [    8.636438]  addrconf_dad_work+0x11c/0x400
> [    8.640536]  ? addrconf_dad_work+0x11c/0x400
> [    8.644810]  process_one_work+0x184/0x370
> [    8.648820]  ? process_one_work+0x184/0x370
> [    8.652996]  worker_thread+0x35/0x3a0
> [    8.656654]  kthread+0x121/0x140
> [    8.659887]  ? process_one_work+0x370/0x370
> [    8.664073]  ? kthread_create_worker_on_cpu+0x70/0x70
> [    8.669118]  ret_from_fork+0x35/0x40
> [    8.672693] Code: c3 8b b9 38 01 00 00 eb aa 48 63 81 38 01 00 00 89 fa 41 89 f8 c1 ea 1f 01 fa d1 fa 48 63 d2 49 89 c2 48 c1 e0 1f 48 01 d0 31 d2 <49> f7 f0 83 e8 01 48 39 ce 89 81 b4 00 00 00 0f 85 e2 fe ff ff 
> [    8.691533] RIP: rt6_multipath_rebalance.part.82+0x1cb/0x1f0 RSP: ffffba72867cbbf8
> [    8.699135] ---[ end trace 9ae26819121cdc3a ]---
> [    8.703760] Kernel panic - not syncing: Fatal exception in interrupt
> [    8.710169] Kernel Offset: 0x3d200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> [    8.721256] gsmi: Log Shutdown Reason 0x02
> [    8.725357] Rebooting in 10 seconds..
> 
> 
> 

^ permalink raw reply

* Re: [PATCH] change the comment of ip6gre_tnl_addr_conflict
From: David Miller @ 2018-05-02 17:19 UTC (permalink / raw)
  To: sunlw.fnst; +Cc: netdev
In-Reply-To: <20180502090608.13612-1-sunlw.fnst@cn.fujitsu.com>

From: Sun Lianwen <sunlw.fnst@cn.fujitsu.com>
Date: Wed, 2 May 2018 17:06:08 +0800

> The comment of ip6gre_tnl_addr_conflict() is wrong. which use
> ip6_tnl_addr_conflict instead of ip6gre_tnl_addr_conflict.
> 
> Signed-off-by: Sun Lianwen <sunlw.fnst@cn.fujitsu.com>

Please format your Subject line properly.

If should start with "[PATCH X] " where "X" is the name of the
networking GIT tree your patch is against.

Then there should be a subsystem prefix, followed by a colon ":" and a
space character.  In this case, "ipv6: " would be an appropriate
subject prefix.

Then, please make your Subject sentence more clear.

Be more precise with what you are doing.  You are correcting the
function name in a comment, so say something that expresses that.

Thank you.

^ permalink raw reply

* Re: [PATCH V2 net-next 4/6] tun: Add support for SCTP checksum offload
From: Vlad Yasevich @ 2018-05-02 17:17 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner, Vladislav Yasevich
  Cc: netdev, linux-sctp, virtualization, virtio-dev, mst, jasowang,
	nhorman
In-Reply-To: <20180502145607.GH5105@localhost.localdomain>

On 05/02/2018 10:56 AM, Marcelo Ricardo Leitner wrote:
> On Wed, May 02, 2018 at 11:53:47AM -0300, Marcelo Ricardo Leitner wrote:
>> On Tue, May 01, 2018 at 10:07:37PM -0400, Vladislav Yasevich wrote:
>>> Adds a new tun offload flag to allow for SCTP checksum offload.
>>> The flag has to be set by the user and defaults to "no offload".
>>
>> I'm confused here:
>>
>>> +++ b/drivers/net/tun.c
>>> @@ -216,7 +216,7 @@ struct tun_struct {
>>>  	struct net_device	*dev;
>>>  	netdev_features_t	set_features;
>>>  #define TUN_USER_FEATURES (NETIF_F_HW_CSUM|NETIF_F_TSO_ECN|NETIF_F_TSO| \
>>> -			  NETIF_F_TSO6)
>>> +			  NETIF_F_TSO6|NETIF_F_SCTP_CRC)
>>
>> Doesn't adding it here mean it defaults to "offload", instead?
>>
>> later on, it does:
>>                 dev->hw_features = NETIF_F_SG | NETIF_F_FRAGLIST |
>>                                    TUN_USER_FEATURES | NETIF_F_HW_VLAN_CTAG_TX |
>>                                    NETIF_F_HW_VLAN_STAG_TX;
> 
> Missed to paste the next line too:
>                 dev->features = dev->hw_features | NETIF_F_LLTX;
> 

Yes, as a software device, we can default it to on.  However, qemu will 0-out
the features and then set them up based on the guest (just like regular checksum).

-vlad

^ permalink raw reply

* Re: [PATCH net-next v2 0/2] mlxsw: Reject unsupported FIB configurations
From: David Miller @ 2018-05-02 17:17 UTC (permalink / raw)
  To: idosch; +Cc: netdev, jiri, dsahern, mlxsw
In-Reply-To: <20180502071735.32352-1-idosch@mellanox.com>

From: Ido Schimmel <idosch@mellanox.com>
Date: Wed,  2 May 2018 10:17:33 +0300

> Recently it became possible for listeners of the FIB notification chain
> to veto operations such as addition of routes and rules.
> 
> Adjust the mlxsw driver to take advantage of it and return an error for
> unsupported FIB rules and for routes configured after the abort
> mechanism was triggered (due to exceeded resources for example).
> 
> v2:
> * Change error code in first patch to -EOPNOTSUPP (David Ahern).

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP
From: Martin KaFai Lau @ 2018-05-02 17:10 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: netdev, David Ahern, Tom Herbert, Eric Dumazet, Nikita Shirokov,
	kernel-team, lvs-devel
In-Reply-To: <alpine.LFD.2.20.1805020932270.1964@ja.home.ssi.bg>

On Wed, May 02, 2018 at 09:38:43AM +0300, Julian Anastasov wrote:
> 
> 	Hello,
> 
> On Thu, 19 Apr 2018, Martin KaFai Lau wrote:
> 
> > This patch is not a proper fix and mainly serves for discussion purpose.
> > It is based on net-next which I have been using to debug the issue.
> > 
> > The change that works around the issue is in ensure_mtu_is_adequate().
> > Other changes are the rippling effect in function arg.
> > 
> > This bug was uncovered by one of our legacy service that
> > are still using ipvs for load balancing.  In that setup,
> > the ipvs encap the ipv6-tcp packet in another ipv6 hdr
> > before tx it out to eth0.
> > 
> > The problem is the kernel stack could pass a skb (which was
> > originated from a sys_write(tcp_fd)) to the driver with skb->len
> > bigger than the device MTU.  In one NIC setup (with gso and tso off)
> > that we are using, it upset the NIC/driver and caused the tx queue
> > stalled for tens of seconds which is how it got uncovered.
> > (On the NIC side, the NIC firmware and driver have been fixed
> > to avoid this tx queue stall after seeing this skb).
> 
> 	In the last week I analyzed the situation and found
> that just changes in route.c are able to solve the problems,
> at 99% :) I'm posting a separate patch for this.
> 
> 	Here is what happens, I'm testing with FTP which
> starts with connection to port 21 and then with data connection,
> from local ftp client to local Virtual IP (forwarded to remote
> real server via tunnel).
> 
> - TCP creates local route which after commit 839da4d98960
> appears to be non-cached, before this commit it is cached.
> Route is saved and reused.
> 
> - initial traffic for port 21 does not use GSO. But after
> every packet IPVS calls maybe_update_pmtu (rt->dst.ops->update_pmtu)
> to report the reduced MTU. These updates are stored in fnhe_pmtu
> but they do not go to any route, even if we try to get fresh
> output route. Why? Because the local routes are not cached, so
> they can not use the fnhe. This is what my patch for route.c
> will fix. With this fix FTP-DATA gets route with reduced PMTU.
For IPv6, the 'if (rt6->rt6i_flags & RTF_LOCAL)' gate in
__ip6_rt_update_pmtu() may need to be lifted also.

> 
> - later, there are large GSO packets in the FTP-DATA connection.
> Currently, IPVS does not send ICMP error for exceeded PMTU
> by GSO packets (this will be fixed, see patch below). Even
> if they reach TCP and it refreshes its route, TCP can not get
> any actual PMTU values from the routing, so continues to
> use the large gso_size without the patch for route.c
> 
> 	For the changes in IPVS that I show below as RFC:
> 
> - I synchronized the PMTU checks in __mtu_check_toobig* with
> IPv4/IPv6 forwarding
> 
> - I modified your changes, see ipvs_gso_hlen() and how I use it
> at start of ensure_mtu_is_adequate(), after skb is made writable,
> before the PMTU checks.
> 
> 	In my tests, all variants work, just with skb_decrease_gso_size
> or even if I add SKB_GSO_DODGY+gso_segs=0. But I'm not sure
> if this is safe for the different GSO configurations.
> 
> 	I'm analyzing what can be changed in route.c, so that
> dst.ops->update_pmtu changes to take effect for the provided
> route. If that is possible, it will allow the PMTU change to
> take immediate effect for the local routes.
> 
> diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
> index 4527921..7a2a0a89 100644
> --- a/net/netfilter/ipvs/ip_vs_xmit.c
> +++ b/net/netfilter/ipvs/ip_vs_xmit.c
> @@ -104,9 +104,32 @@ __ip_vs_dst_check(struct ip_vs_dest *dest)
>  	return dest_dst;
>  }
>  
> -static inline bool
> -__mtu_check_toobig_v6(const struct sk_buff *skb, u32 mtu)
> +/* Check if packet size violates MTU */
> +static bool __mtu_check_toobig(const struct sk_buff *skb, u32 mtu)
>  {
> +	if (skb->len <= mtu)
> +		return false;
> +
> +	if (unlikely((ip_hdr(skb)->frag_off & htons(IP_DF)) == 0))
> +		return false;
> +
> +	if (IPCB(skb)->frag_max_size) {
> +		/* frag_max_size tell us that, this packet have been
> +		 * defragmented by netfilter IPv4 conntrack module.
> +		 */
> +		if (IPCB(skb)->frag_max_size > mtu)
> +			return true; /* largest fragment violate MTU */
> +	}
> +	if (skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu))
> +		return false;
> +	return true;
> +}
> +
> +/* Check if packet size violates MTU */
> +static bool __mtu_check_toobig_v6(const struct sk_buff *skb, u32 mtu)
> +{
> +	if (skb->len <= mtu)
> +		return false;
>  	if (IP6CB(skb)->frag_max_size) {
>  		/* frag_max_size tell us that, this packet have been
>  		 * defragmented by netfilter IPv6 conntrack module.
> @@ -114,10 +137,9 @@ __mtu_check_toobig_v6(const struct sk_buff *skb, u32 mtu)
>  		if (IP6CB(skb)->frag_max_size > mtu)
>  			return true; /* largest fragment violate MTU */
>  	}
> -	else if (skb->len > mtu && !skb_is_gso(skb)) {
> -		return true; /* Packet size violate MTU size */
> -	}
> -	return false;
> +	if (skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu))
> +		return false;
> +	return true;
>  }
>  
>  /* Get route to daddr, update *saddr, optionally bind route to saddr */
> @@ -212,11 +234,42 @@ static inline void maybe_update_pmtu(int skb_af, struct sk_buff *skb, int mtu)
>  		ort->dst.ops->update_pmtu(&ort->dst, sk, NULL, mtu);
>  }
>  
> +/* GSO: length of network and transport headers, 0=unsupported */
> +static unsigned short ipvs_gso_hlen(struct sk_buff *skb,
> +				    struct ip_vs_iphdr *ipvsh)
> +{
> +	unsigned short hlen = ipvsh->len - ipvsh->off;
> +
> +	if (skb_shinfo(skb)->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6)) {
> +		struct tcphdr _tcph, *th;
> +
> +		th = skb_header_pointer(skb, ipvsh->len, sizeof(_tcph), &_tcph);
> +		if (th)
> +			return hlen + (th->doff << 2);
> +	}
> +	return 0;
> +}
> + 
>  static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
>  					  int rt_mode,
>  					  struct ip_vs_iphdr *ipvsh,
>  					  struct sk_buff *skb, int mtu)
>  {
> +	/* Re-segment traffic from local clients */
> +	if (!skb->dev && skb_is_gso(skb)) {
> +		unsigned short hlen = ipvs_gso_hlen(skb, ipvsh);
> +
> +		if (hlen && mtu - hlen < skb_shinfo(skb)->gso_size &&
> +		    mtu > hlen) {
> +			u16 reduce = skb_shinfo(skb)->gso_size - (mtu - hlen);
> +
> +			IP_VS_DBG(10, "Reducing gso_size=%u with %u\n",
> +				skb_shinfo(skb)->gso_size, reduce);
> +			skb_decrease_gso_size(skb_shinfo(skb), reduce);
> +			skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY;
> +			skb_shinfo(skb)->gso_segs = 0;
> +		}
> +	}
>  #ifdef CONFIG_IP_VS_IPV6
>  	if (skb_af == AF_INET6) {
>  		struct net *net = ipvs->net;
> @@ -240,9 +293,7 @@ static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
>  		if ((rt_mode & IP_VS_RT_MODE_TUNNEL) && !sysctl_pmtu_disc(ipvs))
>  			return true;
>  
> -		if (unlikely(ip_hdr(skb)->frag_off & htons(IP_DF) &&
> -			     skb->len > mtu && !skb_is_gso(skb) &&
> -			     !ip_vs_iph_icmp(ipvsh))) {
> +		if (unlikely(__mtu_check_toobig(skb, mtu))) {
>  			icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
>  				  htonl(mtu));
>  			IP_VS_DBG(1, "frag needed for %pI4\n",
> 
> Regards
> 
> --
> Julian Anastasov <ja@ssi.bg>

^ permalink raw reply

* Re: [PATCH net] ipv4: fix fnhe usage by non-cached routes
From: David Miller @ 2018-05-02 17:04 UTC (permalink / raw)
  To: ja; +Cc: netdev, kafai, kernel-team, dsahern, lucien.xin
In-Reply-To: <20180502064119.4552-1-ja@ssi.bg>

From: Julian Anastasov <ja@ssi.bg>
Date: Wed,  2 May 2018 09:41:19 +0300

> Allow some non-cached routes to use non-expired fnhe:
 ...
> Fixes: 839da4d98960 ("net: ipv4: set orig_oif based on fib result for local traffic")
> Fixes: d6d5e999e5df ("route: do not cache fib route info on local routes with oif")
> Fixes: deed49df7390 ("route: check and remove route cache when we get route")
> Cc: David Ahern <dsahern@gmail.com>
> Cc: Xin Long <lucien.xin@gmail.com>
> Signed-off-by: Julian Anastasov <ja@ssi.bg>

David A., please review.

^ permalink raw reply

* Re: [PATCH net-next] cxgb4: add new T5 device id's
From: David Miller @ 2018-05-02 17:04 UTC (permalink / raw)
  To: ganeshgr; +Cc: netdev, nirranjan, indranil
In-Reply-To: <1525241835-14055-1-git-send-email-ganeshgr@chelsio.com>

From: Ganesh Goudar <ganeshgr@chelsio.com>
Date: Wed,  2 May 2018 11:47:15 +0530

> Add device id's 0x5019, 0x501a and 0x501b for T5
> cards.
> 
> Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>

Applied, thanks.

^ permalink raw reply

* [PATCH net] net_sched: fq: take care of throttled flows before reuse
From: Eric Dumazet @ 2018-05-02 17:03 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric Dumazet, Eric Dumazet

Normally, a socket can not be freed/reused unless all its TX packets
left qdisc and were TX-completed. However connect(AF_UNSPEC) allows
this to happen.

With commit fc59d5bdf1e3 ("pkt_sched: fq: clear time_next_packet for
reused flows") we cleared f->time_next_packet but took no special
action if the flow was still in the throttled rb-tree.

Since f->time_next_packet is the key used in the rb-tree searches,
blindly clearing it might break rb-tree integrity. We need to make
sure the flow is no longer in the rb-tree to avoid this problem.

Fixes: fc59d5bdf1e3 ("pkt_sched: fq: clear time_next_packet for reused flows")
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/sched/sch_fq.c | 37 +++++++++++++++++++++++++------------
 1 file changed, 25 insertions(+), 12 deletions(-)

diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index a366e4c9413ab4fe4dfb16f0255cb7632ade7f1c..4808713c73b988cc3e536cff866cf18de05375fa 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -128,6 +128,28 @@ static bool fq_flow_is_detached(const struct fq_flow *f)
 	return f->next == &detached;
 }
 
+static bool fq_flow_is_throttled(const struct fq_flow *f)
+{
+	return f->next == &throttled;
+}
+
+static void fq_flow_add_tail(struct fq_flow_head *head, struct fq_flow *flow)
+{
+	if (head->first)
+		head->last->next = flow;
+	else
+		head->first = flow;
+	head->last = flow;
+	flow->next = NULL;
+}
+
+static void fq_flow_unset_throttled(struct fq_sched_data *q, struct fq_flow *f)
+{
+	rb_erase(&f->rate_node, &q->delayed);
+	q->throttled_flows--;
+	fq_flow_add_tail(&q->old_flows, f);
+}
+
 static void fq_flow_set_throttled(struct fq_sched_data *q, struct fq_flow *f)
 {
 	struct rb_node **p = &q->delayed.rb_node, *parent = NULL;
@@ -155,15 +177,6 @@ static void fq_flow_set_throttled(struct fq_sched_data *q, struct fq_flow *f)
 
 static struct kmem_cache *fq_flow_cachep __read_mostly;
 
-static void fq_flow_add_tail(struct fq_flow_head *head, struct fq_flow *flow)
-{
-	if (head->first)
-		head->last->next = flow;
-	else
-		head->first = flow;
-	head->last = flow;
-	flow->next = NULL;
-}
 
 /* limit number of collected flows per round */
 #define FQ_GC_MAX 8
@@ -267,6 +280,8 @@ static struct fq_flow *fq_classify(struct sk_buff *skb, struct fq_sched_data *q)
 				     f->socket_hash != sk->sk_hash)) {
 				f->credit = q->initial_quantum;
 				f->socket_hash = sk->sk_hash;
+				if (fq_flow_is_throttled(f))
+					fq_flow_unset_throttled(q, f);
 				f->time_next_packet = 0ULL;
 			}
 			return f;
@@ -438,9 +453,7 @@ static void fq_check_throttled(struct fq_sched_data *q, u64 now)
 			q->time_next_delayed_flow = f->time_next_packet;
 			break;
 		}
-		rb_erase(p, &q->delayed);
-		q->throttled_flows--;
-		fq_flow_add_tail(&q->old_flows, f);
+		fq_flow_unset_throttled(q, f);
 	}
 }
 
-- 
2.17.0.441.gb46fe60e1d-goog

^ permalink raw reply related

* Re: [RFC v2 bpf-next 8/9] bpf: Provide helper to do lookups in kernel FIB table
From: David Miller @ 2018-05-02 17:00 UTC (permalink / raw)
  To: dsahern
  Cc: brouer, netdev, borkmann, ast, shm, roopa, toke, john.fastabend,
	bernat
In-Reply-To: <4a92f04a-6865-ae70-1c54-e4b92d886491@gmail.com>

From: David Ahern <dsahern@gmail.com>
Date: Wed, 2 May 2018 09:37:21 -0600

> To share numbers from recent testing I did using Vincent's modules,
> lookup times in nsec (using local_clock) with MULTIPLE_TABLES config
> disabled for IPv4 and IPv6
> 
>             IPv4    IPv6-dst    IPv6-fib6
> baseline     49        126         52
> 
> I have other cases with combinations of configs and rules, but this
> shows the best possible case.
> 
> IPv6 needs some more work to improve speeds with MULTIPLE_TABLES enabled
> (separate local and main tables unlike IPv4) and IPV6_SUBTREES enabled.

Yes, like for ipv4 sharing local and main tables will help a lot.

^ permalink raw reply

* Re: [PATCH bpf-next 0/3] bpf: cleanups on managing subprog information
From: Jiong Wang @ 2018-05-02 16:59 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: borkmann, ecree, netdev, oss-drivers
In-Reply-To: <20180501222257.cogbkcsncrmg54p5@ast-mbp>

On 01/05/2018 23:22, Alexei Starovoitov wrote:
...
> [   27.784931]  ? bpf_int_jit_compile+0x7ac/0xab0
> [   27.785475]  bpf_int_jit_compile+0x2b6/0xab0
> [   27.786001]  ? do_jit+0x6020/0x6020
> [   27.786428]  ? kasan_kmalloc+0xa0/0xd0
> [   27.786885]  bpf_check+0x2c05/0x4c40
> [   27.787346]  ? fixup_bpf_calls+0x1140/0x1140
> [   27.787865]  ? kasan_unpoison_shadow+0x30/0x40
> [   27.788406]  ? kasan_kmalloc+0xa0/0xd0
> [   27.788865]  ? memset+0x1f/0x40
> [   27.789255]  ? bpf_obj_name_cpy+0x2d/0x200
> [   27.789750]  bpf_prog_load+0xb07/0xeb0
>
> simply running test_verifier with JIT and kasan on.

Ah, sorry, I should add "sysctl net/core/bpf_jit_enable=1" to my test
script, error reproduced.

convert_ctx_accesses and fixup_bpf_calls might insert ebpf insns that
prog->len would change.

The new fake "exit" subprog whose .start offset is prog->len should be
updated as well.

The "for" condition in adjust_subprog_starts:

   for (i = 0; i < env->subprog_cnt; i++) {

need to be changed into:

   for (i = 0; i <= env->subprog_cnt; i++) {

Will respin the patch set.

Thanks.

Regards,
Jiong

^ permalink raw reply

* Re: [PATCH bpf-next 1/2] bpf: enable stackmap with build_id in nmi context
From: Song Liu @ 2018-05-02 16:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Networking, Kernel Team, Alexei Starovoitov, Daniel Borkmann
In-Reply-To: <20180502092109.GI12180@hirez.programming.kicks-ass.net>



> On May 2, 2018, at 2:21 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Tue, May 01, 2018 at 05:02:19PM -0700, Song Liu wrote:
>> @@ -267,17 +285,27 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
>> {
>> 	int i;
>> 	struct vm_area_struct *vma;
>> +	bool in_nmi_ctx = in_nmi();
>> +	bool irq_work_busy = false;
>> +	struct stack_map_irq_work *work;
>> +
>> +	if (in_nmi_ctx) {
>> +		work = this_cpu_ptr(&irq_work);
>> +		if (work->sem)
>> +			/* cannot queue more up_read, fallback */
>> +			irq_work_busy = true;
>> +	}
>> 
>> 	/*
>> +	 * We cannot do up_read() in nmi context. To do build_id lookup
>> +	 * in nmi context, we need to run up_read() in irq_work. We use
>> +	 * a percpu variable to do the irq_work. If the irq_work is
>> +	 * already used by another lookup, we fall back to report ips.
>> 	 *
>> 	 * Same fallback is used for kernel stack (!user) on a stackmap
>> 	 * with build_id.
>> 	 */
>> +	if (!user || !current || !current->mm || irq_work_busy ||
>> 	    down_read_trylock(&current->mm->mmap_sem) == 0) {
>> 		/* cannot access current->mm, fall back to ips */
>> 		for (i = 0; i < trace_nr; i++) {
>> @@ -299,7 +327,13 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
>> 			- vma->vm_start;
>> 		id_offs[i].status = BPF_STACK_BUILD_ID_VALID;
>> 	}
>> +
>> +	if (!in_nmi_ctx)
>> +		up_read(&current->mm->mmap_sem);
>> +	else {
>> +		work->sem = &current->mm->mmap_sem;
>> +		irq_work_queue(&work->work);
>> +	}
>> }
> 
> This is disguisting.. :-)
> 
> It's broken though, I've bet you've never actually ran this with lockdep
> enabled for example.

I am not following here. I just run the new selftest with CONFIG_LOCKDEP on, 
and got no warning for this. 


> Also, you set work->sem before you do trylock, if the trylock fails you
> return early and keep work->sem set, which will thereafter always result
> in irq_work_busy.

work->sem was set after down_read_trylock(). I guess you misread the patch?

Thanks,
Song

^ permalink raw reply

* Re: [PATCH net-next 1/4] ipv6: Calculate hash thresholds for IPv6 nexthops
From: Eric Dumazet @ 2018-05-02 16:43 UTC (permalink / raw)
  To: David Ahern, Ido Schimmel, netdev
  Cc: davem, roopa, nikolay, pch, jkbs, yoshfuji, mlxsw
In-Reply-To: <5550c628-5014-427b-60c9-71cf80462723@gmail.com>



On 01/09/2018 07:43 PM, David Ahern wrote:
> On 1/9/18 7:40 AM, Ido Schimmel wrote:
>> Before we convert IPv6 to use hash-threshold instead of modulo-N, we
>> first need each nexthop to store its region boundary in the hash
>> function's output space.
>>
>> The boundary is calculated by dividing the output space equally between
>> the different active nexthops. That is, nexthops that are not dead or
>> linkdown.
>>
>> The boundaries are rebalanced whenever a nexthop is added or removed to
>> a multipath route and whenever a nexthop becomes active or inactive.
>>
>> Signed-off-by: Ido Schimmel <idosch@mellanox.com>
>> ---
>>  include/net/ip6_fib.h   |  1 +
>>  include/net/ip6_route.h |  7 ++++
>>  net/ipv6/ip6_fib.c      |  8 ++---
>>  net/ipv6/route.c        | 96 +++++++++++++++++++++++++++++++++++++++++++++++++
>>  4 files changed, 106 insertions(+), 6 deletions(-)
>>
> 
> LGTM.
> Acked-by: David Ahern <dsahern@gmail.com>
> 

For some reason I have a divide by zero error booting my hosts with latest net tree.

What guarantee do we have that total is not zero when rt6_upper_bound_set() is called ?



[    8.498639] divide error: 0000 [#1] SMP PTI
[    8.503178] gsmi: Log Shutdown Reason 0x03
[    8.507270] Modules linked in: bnx2x mdio
[    8.511276] CPU: 17 PID: 116 Comm: kworker/17:0 Not tainted 4.17.0-smp-DEV #110
[    8.518571] Hardware name: Intel RML,PCH/Ibis_QC_18, BIOS 2.40.0 06/22/2016
[    8.525526] Workqueue: ipv6_addrconf addrconf_dad_work
[    8.530662] RIP: 0010:rt6_multipath_rebalance.part.82+0x1cb/0x1f0
[    8.536752] RSP: 0018:ffffba72867cbbf8 EFLAGS: 00010246
[    8.541966] RAX: 0000000000000000 RBX: 0000000000000025 RCX: ffff9d555ab73180
[    8.549090] RDX: 0000000000000000 RSI: ffff9d4d5a34b1c0 RDI: 0000000000000000
[    8.556212] RBP: ffffba72867cbc00 R08: 0000000000000000 R09: 0000000000000000
[    8.563336] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9d5559f95680
[    8.570457] R13: ffff9d4d5a34b1c0 R14: ffff9d555ab73180 R15: 0000000000000000
[    8.577579] FS:  0000000000000000(0000) GS:ffff9d4d5fc40000(0000) knlGS:0000000000000000
[    8.585654] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    8.591391] CR2: 00007fffe47ff000 CR3: 0000000c39c0a001 CR4: 00000000000606e0
[    8.598515] Call Trace:
[    8.600961]  ? rt6_multipath_rebalance+0x21/0x30
[    8.605579]  fib6_add+0x75f/0xf70
[    8.608899]  ? __wake_up+0x13/0x20
[    8.612303]  ? netlink_broadcast_filtered+0x14c/0x3c0
[    8.617355]  __ip6_ins_rt+0x4c/0x70
[    8.620847]  ip6_ins_rt+0x6e/0xa0
[    8.624157]  __ipv6_ifa_notify+0x226/0x2e0
[    8.628249]  ipv6_ifa_notify+0x2a/0x40
[    8.631999]  addrconf_dad_completed+0x59/0x360
[    8.636438]  addrconf_dad_work+0x11c/0x400
[    8.640536]  ? addrconf_dad_work+0x11c/0x400
[    8.644810]  process_one_work+0x184/0x370
[    8.648820]  ? process_one_work+0x184/0x370
[    8.652996]  worker_thread+0x35/0x3a0
[    8.656654]  kthread+0x121/0x140
[    8.659887]  ? process_one_work+0x370/0x370
[    8.664073]  ? kthread_create_worker_on_cpu+0x70/0x70
[    8.669118]  ret_from_fork+0x35/0x40
[    8.672693] Code: c3 8b b9 38 01 00 00 eb aa 48 63 81 38 01 00 00 89 fa 41 89 f8 c1 ea 1f 01 fa d1 fa 48 63 d2 49 89 c2 48 c1 e0 1f 48 01 d0 31 d2 <49> f7 f0 83 e8 01 48 39 ce 89 81 b4 00 00 00 0f 85 e2 fe ff ff 
[    8.691533] RIP: rt6_multipath_rebalance.part.82+0x1cb/0x1f0 RSP: ffffba72867cbbf8
[    8.699135] ---[ end trace 9ae26819121cdc3a ]---
[    8.703760] Kernel panic - not syncing: Fatal exception in interrupt
[    8.710169] Kernel Offset: 0x3d200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[    8.721256] gsmi: Log Shutdown Reason 0x02
[    8.725357] Rebooting in 10 seconds..

^ permalink raw reply

* Re: [PATCH net-next v9 2/4] net: Introduce generic failover module
From: Jiri Pirko @ 2018-05-02 16:15 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
	loseweigh, aaron.f.brown
In-Reply-To: <20180428090601.GL5632@nanopsycho.orion>

Sat, Apr 28, 2018 at 11:06:01AM CEST, jiri@resnulli.us wrote:
>Fri, Apr 27, 2018 at 07:06:58PM CEST, sridhar.samudrala@intel.com wrote:

[...]


>>+
>>+	err = netdev_rx_handler_register(slave_dev, net_failover_handle_frame,
>>+					 failover_dev);
>>+	if (err) {
>>+		netdev_err(slave_dev, "can not register failover rx handler (err = %d)\n",
>>+			   err);
>>+		goto err_handler_register;
>>+	}
>>+
>>+	err = netdev_upper_dev_link(slave_dev, failover_dev, NULL);
>
>Please use netdev_master_upper_dev_link().

Don't forget to fillup struct netdev_lag_upper_info - NETDEV_LAG_TX_TYPE_ACTIVEBACKUP


Also, please call netdev_lower_state_changed() when the active slave
device changes from primary->backup of backup->primary and whenever link
state of a slave changes

[...]

^ permalink raw reply

* Re: [PATCH net-next v9 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Jiri Pirko @ 2018-05-02 16:05 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: alexander.h.duyck, virtio-dev, mst, kubakici, netdev,
	virtualization, loseweigh, aaron.f.brown, davem
In-Reply-To: <c94ce6a9-6f36-675c-cf5b-414454fc2244@intel.com>

Wed, May 02, 2018 at 05:34:44PM CEST, sridhar.samudrala@intel.com wrote:
>On 5/2/2018 12:50 AM, Jiri Pirko wrote:
>> Wed, May 02, 2018 at 02:20:26AM CEST, sridhar.samudrala@intel.com wrote:
>> > On 4/30/2018 12:20 AM, Jiri Pirko wrote:
>> > > > > Now I try to change mac of the failover master:
>> > > > > [root@test1 ~]# ip link set ens3 addr 52:54:00:b2:a7:f3
>> > > > > RTNETLINK answers: Operation not supported
>> > > > > 
>> > > > > That I did expect to work. I would expect this would change the mac of
>> > > > > the master and both standby and primary slaves.
>> > > > If a VF is untrusted, a VM will not able to change its MAC and moreover
>> > > Note that at this point, I have no VF. So I'm not sure why you mention
>> > > that.
>> > > 
>> > > 
>> > > > in this mode we are assuming that the hypervisor has assigned the MAC and
>> > > > guest is not expected to change the MAC.
>> > > Wait, for ordinary old-fashioned virtio_net, as a VM user, I can change
>> > > mac and all works fine. How is this different? Change mac on "failover
>> > > instance" should work and should propagate the mac down to its slaves.
>> > > 
>> > > 
>> > > > For the initial implementation, i would propose not allowing the guest to
>> > > > change the MAC of failover or standby dev.
>> > > I see no reason for such restriction.
>> > > 
>> > It is true that a VM user can change mac address of a normal virtio-net interface,
>> > however when it is in STANDBY mode i think we should not allow this change specifically
>> > because we are creating a failover instance based on a MAC that is assigned by the
>> > hypervisor.
>> > 
>> > Moreover,  in a cloud environment i would think that PF/hypervisor assigns a MAC to
>> > the VF and it cannot be changed by the guest.
>> So that is easy. You allow the change of the mac and in the "failover"
>> mac change implementation you propagate the change down to slaves. If
>> one slave does not support the change, you bail out. And since VF does
>> not allow it as you say, once it will be enslaved, the mac change could
>> not be done. Seems like a correct behavior to me and is in-sync with how
>> bond/team behaves.
>
>If we allow the mac to be changed when the VF is not yet plugged in
>and the guest changes the mac, then VF cannot join the failover when
>it is hot plugged with the original mac assigned by the hypervisor.
>Specifically there could be timing window during the live migration where
>VF would be unplugged at the source and if we allow the guest to change the
>failover mac, then VF would not be able to register with the failover after
>the VM is migrated to the destination.

Okay. So as suggested by mst, just forbid the mac changes all together.


>
>> 
>> > So for the initial implementation, do you see any issues with having this restriction
>> > in STANDBY mode.
>> > 
>> > 
>

^ permalink raw reply

* Re: [PATCH net-next v9 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Jiri Pirko @ 2018-05-02 16:04 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Samudrala, Sridhar, stephen, davem, netdev, virtualization,
	virtio-dev, jesse.brandeburg, alexander.h.duyck, kubakici,
	jasowang, loseweigh, aaron.f.brown
In-Reply-To: <20180502184326-mutt-send-email-mst@kernel.org>

Wed, May 02, 2018 at 05:47:27PM CEST, mst@redhat.com wrote:
>On Wed, May 02, 2018 at 09:50:21AM +0200, Jiri Pirko wrote:
>> Wed, May 02, 2018 at 02:20:26AM CEST, sridhar.samudrala@intel.com wrote:
>> >On 4/30/2018 12:20 AM, Jiri Pirko wrote:
>> >> 
>> >> > > Now I try to change mac of the failover master:
>> >> > > [root@test1 ~]# ip link set ens3 addr 52:54:00:b2:a7:f3
>> >> > > RTNETLINK answers: Operation not supported
>> >> > > 
>> >> > > That I did expect to work. I would expect this would change the mac of
>> >> > > the master and both standby and primary slaves.
>> >> > If a VF is untrusted, a VM will not able to change its MAC and moreover
>> >> Note that at this point, I have no VF. So I'm not sure why you mention
>> >> that.
>> >> 
>> >> 
>> >> > in this mode we are assuming that the hypervisor has assigned the MAC and
>> >> > guest is not expected to change the MAC.
>> >> Wait, for ordinary old-fashioned virtio_net, as a VM user, I can change
>> >> mac and all works fine. How is this different? Change mac on "failover
>> >> instance" should work and should propagate the mac down to its slaves.
>> >> 
>> >> 
>> >> > For the initial implementation, i would propose not allowing the guest to
>> >> > change the MAC of failover or standby dev.
>> >> I see no reason for such restriction.
>> >> 
>> >
>> >It is true that a VM user can change mac address of a normal virtio-net interface,
>> >however when it is in STANDBY mode i think we should not allow this change specifically
>> >because we are creating a failover instance based on a MAC that is assigned by the
>> >hypervisor.
>> >
>> >Moreover,  in a cloud environment i would think that PF/hypervisor assigns a MAC to
>> >the VF and it cannot be changed by the guest.
>> 
>> So that is easy. You allow the change of the mac and in the "failover"
>> mac change implementation you propagate the change down to slaves. If
>> one slave does not support the change, you bail out. And since VF does
>
>I wish people would say primary/standby and not "VF" :)

Sure, sorry.


>
>> not allow it as you say, once it will be enslaved, the mac change could
>> not be done. Seems like a correct behavior to me
>
>
>what if primary does not allow mac changes and is attached after
>mac is changed on standy?

Mac should be changed on failover. In that case, the primary would have
a different mac and therefore it won't get enslaved.

>
>
>> and is in-sync with how
>> bond/team behaves.
>
>I think in the end virtio can just block MAC changes for simplicity.
>
>I personally don't see softmac as a must have feature in v1,
>we can add it later.

Okay.


>
>What's the situation with init scripts and whether it's
>possible to make them work well would be a better question.
>
>> 
>> >
>> >So for the initial implementation, do you see any issues with having this restriction
>> >in STANDBY mode.
>> >
>> >

^ permalink raw reply

* Re: [PATCH bpf-next v2] bpf: relax constraints on formatting for eBPF helper documentation
From: Daniel Borkmann @ 2018-05-02 15:48 UTC (permalink / raw)
  To: Quentin Monnet, ast; +Cc: dsahern, yhs, ecree, netdev, oss-drivers
In-Reply-To: <20180502132024.14550-1-quentin.monnet@netronome.com>

On 05/02/2018 03:20 PM, Quentin Monnet wrote:
> The Python script used to parse and extract eBPF helpers documentation
> from include/uapi/linux/bpf.h expects a very specific formatting for the
> descriptions (single dot represents a space, '>' stands for a tab):
> 
>     /*
>      ...
>      *.int bpf_helper(list of arguments)
>      *.>    Description
>      *.>    >       Start of description
>      *.>    >       Another line of description
>      *.>    >       And yet another line of description
>      *.>    Return
>      *.>    >       0 on success, or a negative error in case of failure
>      ...
>      */
> 
> This is too strict, and painful for developers who wants to add
> documentation for new helpers. Worse, it is extremely difficult to check
> that the formatting is correct during reviews. Change the format
> expected by the script and make it more flexible. The script now works
> whether or not the initial space (right after the star) is present, and
> accepts both tabs and white spaces (or a combination of both) for
> indenting description sections and contents.
> 
> Concretely, something like the following would now be supported:
> 
>     /*
>      ...
>      *int bpf_helper(list of arguments)
>      *......Description
>      *.>    >       Start of description...
>      *>     >       Another line of description
>      *..............And yet another line of description
>      *>     Return
>      *.>    ........0 on success, or a negative error in case of failure
>      ...
>      */
> 
> While at it, remove unnecessary carets from each regex used with match()
> in the script. They are redundant, as match() tries to match from the
> beginning of the string by default.
> 
> v2: Remove unnecessary caret when a regex is used with match().
> 
> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>

Applied to bpf-next, thanks Quentin!

^ permalink raw reply

* Re: [PATCH net-next v9 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Michael S. Tsirkin @ 2018-05-02 15:47 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Samudrala, Sridhar, stephen, davem, netdev, virtualization,
	virtio-dev, jesse.brandeburg, alexander.h.duyck, kubakici,
	jasowang, loseweigh, aaron.f.brown
In-Reply-To: <20180502075021.GC19250@nanopsycho>

On Wed, May 02, 2018 at 09:50:21AM +0200, Jiri Pirko wrote:
> Wed, May 02, 2018 at 02:20:26AM CEST, sridhar.samudrala@intel.com wrote:
> >On 4/30/2018 12:20 AM, Jiri Pirko wrote:
> >> 
> >> > > Now I try to change mac of the failover master:
> >> > > [root@test1 ~]# ip link set ens3 addr 52:54:00:b2:a7:f3
> >> > > RTNETLINK answers: Operation not supported
> >> > > 
> >> > > That I did expect to work. I would expect this would change the mac of
> >> > > the master and both standby and primary slaves.
> >> > If a VF is untrusted, a VM will not able to change its MAC and moreover
> >> Note that at this point, I have no VF. So I'm not sure why you mention
> >> that.
> >> 
> >> 
> >> > in this mode we are assuming that the hypervisor has assigned the MAC and
> >> > guest is not expected to change the MAC.
> >> Wait, for ordinary old-fashioned virtio_net, as a VM user, I can change
> >> mac and all works fine. How is this different? Change mac on "failover
> >> instance" should work and should propagate the mac down to its slaves.
> >> 
> >> 
> >> > For the initial implementation, i would propose not allowing the guest to
> >> > change the MAC of failover or standby dev.
> >> I see no reason for such restriction.
> >> 
> >
> >It is true that a VM user can change mac address of a normal virtio-net interface,
> >however when it is in STANDBY mode i think we should not allow this change specifically
> >because we are creating a failover instance based on a MAC that is assigned by the
> >hypervisor.
> >
> >Moreover,  in a cloud environment i would think that PF/hypervisor assigns a MAC to
> >the VF and it cannot be changed by the guest.
> 
> So that is easy. You allow the change of the mac and in the "failover"
> mac change implementation you propagate the change down to slaves. If
> one slave does not support the change, you bail out. And since VF does

I wish people would say primary/standby and not "VF" :)

> not allow it as you say, once it will be enslaved, the mac change could
> not be done. Seems like a correct behavior to me


what if primary does not allow mac changes and is attached after
mac is changed on standy?


> and is in-sync with how
> bond/team behaves.

I think in the end virtio can just block MAC changes for simplicity.

I personally don't see softmac as a must have feature in v1,
we can add it later.

What's the situation with init scripts and whether it's
possible to make them work well would be a better question.

> 
> >
> >So for the initial implementation, do you see any issues with having this restriction
> >in STANDBY mode.
> >
> >

^ permalink raw reply

* Re: [PATCH v3 2/3] selftests/bpf: Makefile: add include path
From: Daniel Borkmann @ 2018-05-02 15:43 UTC (permalink / raw)
  To: Sirio Balmelli; +Cc: netdev
In-Reply-To: <20180502110505.7n567iqy6duuzjbs@vm4>

On 05/02/2018 01:05 PM, Sirio Balmelli wrote:
> On some systems selftests fail to build, missing the following headers:
> 
> asm/byteorder.h
> asm/socket.h
> asm/swab.h
> 
> In the specific case of Ubuntu, this is because the files are in
> '/usr/include/x86_64-linux-gnu' (see <https://wiki.ubuntu.com/MultiarchSpec>)
> which is both architecture- and distro-specific.
> 
> The solution is to add $(KERNEL)/usr/include to the Makefile,
> so the build references these from the current kernel build
> and not the running system.
> 
> Signed-off-by: Sirio Balmelli <sirio@b-ad.ch>
> ---
>  tools/testing/selftests/bpf/Makefile | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
> index 9d76218..1ec09c6 100644
> --- a/tools/testing/selftests/bpf/Makefile
> +++ b/tools/testing/selftests/bpf/Makefile
> @@ -84,7 +84,10 @@ else
>    CPU ?= generic
>  endif
>  
> -CLANG_FLAGS = -I. -I./include/uapi -I../../../include/uapi \
> +# we are in 'tools/testing/selftests/bpf'
> +KERNEL=../../../..
> +TOOLS=../../..
> +CLANG_FLAGS = -I. -I./include/uapi -I$(TOOLS)/include/uapi -I$(KERNEL)/usr/include \
>  	      -Wno-compare-distinct-pointer-types

First patch in the series looks good to me, thanks Sirio! Problem with this
one is still as described earlier in: http://patchwork.ozlabs.org/patch/906505/

This will break people's setup and build bots since they expect headers to be
available from tools infra and not -I<kernel-root>/usr/include/ via headers_install,
so adding the latter is not possible (unless Arnaldo agrees to rework the whole tools
include infra in a better way). This means the only other option we have is to pull
them into tools/ instead, even if this results in a lot of duplication, but whether
we like it or not that is the way the tools/include stuff was designed along with
perf, meaning unless there's a fundamental rework, we will have to stick to it.

Thanks,
Daniel

>  $(OUTPUT)/test_l4lb_noinline.o: CLANG_FLAGS += -fno-inline
> 

^ permalink raw reply

* Re: [RFC v3 4/5] virtio_ring: add event idx support in packed ring
From: Michael S. Tsirkin @ 2018-05-02 15:42 UTC (permalink / raw)
  To: Tiwei Bie
  Cc: Jason Wang, virtualization, linux-kernel, netdev, wexu, jfreimann
In-Reply-To: <20180502151255.h3x6rhszxa3euinl@debian>

On Wed, May 02, 2018 at 11:12:55PM +0800, Tiwei Bie wrote:
> On Wed, May 02, 2018 at 04:51:01PM +0300, Michael S. Tsirkin wrote:
> > On Wed, May 02, 2018 at 03:28:19PM +0800, Tiwei Bie wrote:
> > > On Wed, May 02, 2018 at 10:51:06AM +0800, Jason Wang wrote:
> > > > On 2018年04月25日 13:15, Tiwei Bie wrote:
> > > > > This commit introduces the event idx support in packed
> > > > > ring. This feature is temporarily disabled, because the
> > > > > implementation in this patch may not work as expected,
> > > > > and some further discussions on the implementation are
> > > > > needed, e.g. do we have to check the wrap counter when
> > > > > checking whether a kick is needed?
> > > > > 
> > > > > Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> > > > > ---
> > > > >   drivers/virtio/virtio_ring.c | 53 ++++++++++++++++++++++++++++++++++++++++----
> > > > >   1 file changed, 49 insertions(+), 4 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > index 0181e93897be..b1039c2985b9 100644
> > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > @@ -986,7 +986,7 @@ static inline int virtqueue_add_packed(struct virtqueue *_vq,
> > > > >   static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
> > > > >   {
> > > > >   	struct vring_virtqueue *vq = to_vvq(_vq);
> > > > > -	u16 flags;
> > > > > +	u16 new, old, off_wrap, flags;
> > > > >   	bool needs_kick;
> > > > >   	u32 snapshot;
> > > > > @@ -995,7 +995,12 @@ static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
> > > > >   	 * suppressions. */
> > > > >   	virtio_mb(vq->weak_barriers);
> > > > > +	old = vq->next_avail_idx - vq->num_added;
> > > > > +	new = vq->next_avail_idx;
> > > > > +	vq->num_added = 0;
> > > > > +
> > > > >   	snapshot = *(u32 *)vq->vring_packed.device;
> > > > > +	off_wrap = virtio16_to_cpu(_vq->vdev, snapshot & 0xffff);
> > > > >   	flags = cpu_to_virtio16(_vq->vdev, snapshot >> 16) & 0x3;
> > > > >   #ifdef DEBUG
> > > > > @@ -1006,7 +1011,10 @@ static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
> > > > >   	vq->last_add_time_valid = false;
> > > > >   #endif
> > > > > -	needs_kick = (flags != VRING_EVENT_F_DISABLE);
> > > > > +	if (flags == VRING_EVENT_F_DESC)
> > > > > +		needs_kick = vring_need_event(off_wrap & ~(1<<15), new, old);
> > > > 
> > > > I wonder whether or not the math is correct. Both new and event are in the
> > > > unit of descriptor ring size, but old looks not.
> > > 
> > > What vring_need_event() cares is the distance between
> > > `new` and `old`, i.e. vq->num_added. So I think there
> > > is nothing wrong with `old`. But the calculation of the
> > > distance between `new` and `event_idx` isn't right when
> > > `new` wraps. How do you think about the below code:
> > > 
> > > 	wrap_counter = off_wrap >> 15;
> > > 	event_idx = off_wrap & ~(1<<15);
> > > 	if (wrap_counter != vq->wrap_counter)
> > > 		event_idx -= vq->vring_packed.num;
> > > 	
> > > 	needs_kick = vring_need_event(event_idx, new, old);
> > 
> > I suspect this hack won't work for non power of 2 ring.
> 
> Above code doesn't require the ring size to be a power of 2.
> 
> For (__u16)(new_idx - old), what we want to get is vq->num_added.
> 
> old = vq->next_avail_idx - vq->num_added;
> new = vq->next_avail_idx;
> 
> When vq->next_avail_idx >= vq->num_added, it's obvious that,
> (__u16)(new_idx - old) is vq->num_added.
> 
> And when vq->next_avail_idx < vq->num_added, new will be smaller
> than old (old will be a big unsigned number), but (__u16)(new_idx
> - old) is still vq->num_added.
> 
> For (__u16)(new_idx - event_idx - 1), when new wraps and event_idx
> doesn't wrap, the most straightforward way to calculate it is:
> (new + vq->vring_packed.num) - event_idx - 1.

So how about we use the straightforward way then?

> But we can also calculate it in this way:
> 
> event_idx -= vq->vring_packed.num;
> (event_idx will be a big unsigned number)
> 
> Then (__u16)(new_idx - event_idx - 1) will be the value we want.
> 
> Best regards,
> Tiwei Bie


> > 
> > 
> > > Best regards,
> > > Tiwei Bie
> > > 
> > > 
> > > > 
> > > > Thanks
> > > > 
> > > > > +	else
> > > > > +		needs_kick = (flags != VRING_EVENT_F_DISABLE);
> > > > >   	END_USE(vq);
> > > > >   	return needs_kick;
> > > > >   }
> > > > > @@ -1116,6 +1124,15 @@ static void *virtqueue_get_buf_ctx_packed(struct virtqueue *_vq,
> > > > >   	if (vq->last_used_idx >= vq->vring_packed.num)
> > > > >   		vq->last_used_idx -= vq->vring_packed.num;
> > > > > +	/* If we expect an interrupt for the next entry, tell host
> > > > > +	 * by writing event index and flush out the write before
> > > > > +	 * the read in the next get_buf call. */
> > > > > +	if (vq->event_flags_shadow == VRING_EVENT_F_DESC)
> > > > > +		virtio_store_mb(vq->weak_barriers,
> > > > > +				&vq->vring_packed.driver->off_wrap,
> > > > > +				cpu_to_virtio16(_vq->vdev, vq->last_used_idx |
> > > > > +						(vq->wrap_counter << 15)));
> > > > > +
> > > > >   #ifdef DEBUG
> > > > >   	vq->last_add_time_valid = false;
> > > > >   #endif
> > > > > @@ -1143,10 +1160,17 @@ static unsigned virtqueue_enable_cb_prepare_packed(struct virtqueue *_vq)
> > > > >   	/* We optimistically turn back on interrupts, then check if there was
> > > > >   	 * more to do. */
> > > > > +	/* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
> > > > > +	 * either clear the flags bit or point the event index at the next
> > > > > +	 * entry. Always update the event index to keep code simple. */
> > > > > +
> > > > > +	vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev,
> > > > > +			vq->last_used_idx | (vq->wrap_counter << 15));
> > > > >   	if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
> > > > >   		virtio_wmb(vq->weak_barriers);
> > > > > -		vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
> > > > > +		vq->event_flags_shadow = vq->event ? VRING_EVENT_F_DESC :
> > > > > +						     VRING_EVENT_F_ENABLE;
> > > > >   		vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
> > > > >   							vq->event_flags_shadow);
> > > > >   	}
> > > > > @@ -1172,15 +1196,34 @@ static bool virtqueue_poll_packed(struct virtqueue *_vq, unsigned last_used_idx)
> > > > >   static bool virtqueue_enable_cb_delayed_packed(struct virtqueue *_vq)
> > > > >   {
> > > > >   	struct vring_virtqueue *vq = to_vvq(_vq);
> > > > > +	u16 bufs, used_idx, wrap_counter;
> > > > >   	START_USE(vq);
> > > > >   	/* We optimistically turn back on interrupts, then check if there was
> > > > >   	 * more to do. */
> > > > > +	/* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
> > > > > +	 * either clear the flags bit or point the event index at the next
> > > > > +	 * entry. Always update the event index to keep code simple. */
> > > > > +
> > > > > +	/* TODO: tune this threshold */
> > > > > +	bufs = (u16)(vq->next_avail_idx - vq->last_used_idx) * 3 / 4;
> > > > > +
> > > > > +	used_idx = vq->last_used_idx + bufs;
> > > > > +	wrap_counter = vq->wrap_counter;
> > > > > +
> > > > > +	if (used_idx >= vq->vring_packed.num) {
> > > > > +		used_idx -= vq->vring_packed.num;
> > > > > +		wrap_counter ^= 1;
> > > > > +	}
> > > > > +
> > > > > +	vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev,
> > > > > +			used_idx | (wrap_counter << 15));
> > > > >   	if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
> > > > >   		virtio_wmb(vq->weak_barriers);
> > > > > -		vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
> > > > > +		vq->event_flags_shadow = vq->event ? VRING_EVENT_F_DESC :
> > > > > +						     VRING_EVENT_F_ENABLE;
> > > > >   		vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
> > > > >   							vq->event_flags_shadow);
> > > > >   	}
> > > > > @@ -1822,8 +1865,10 @@ void vring_transport_features(struct virtio_device *vdev)
> > > > >   		switch (i) {
> > > > >   		case VIRTIO_RING_F_INDIRECT_DESC:
> > > > >   			break;
> > > > > +#if 0
> > > > >   		case VIRTIO_RING_F_EVENT_IDX:
> > > > >   			break;
> > > > > +#endif
> > > > >   		case VIRTIO_F_VERSION_1:
> > > > >   			break;
> > > > >   		case VIRTIO_F_IOMMU_PLATFORM:
> > > > 

^ permalink raw reply

* Re: [RFC v2 bpf-next 9/9] samples/bpf: Add examples of ipv4 and ipv6 forwarding in XDP
From: David Ahern @ 2018-05-02 15:40 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, borkmann, ast, davem, shm, roopa, toke, john.fastabend,
	Tariq Toukan
In-Reply-To: <20180502131306.07c3acee@redhat.com>

On 5/2/18 5:13 AM, Jesper Dangaard Brouer wrote:
> 
> On Sun, 29 Apr 2018 11:07:52 -0700 David Ahern <dsahern@gmail.com> wrote:
> 
>> +	/* verify egress index has xdp support */
>> +	// TO-DO bpf_map_lookup_elem(&tx_port, &key) fails with
>> +	//       cannot pass map_type 14 into func bpf_map_lookup_elem#1:
> 
> I just want to point out that I/we are aware of this "usability"
> problem with the sample program, but I don't want to block the FIB
> helper going upstream, we can fix this problem later.
> 
> The problem is that if you load this bpf/xdp prog, then all incoming
> traffic (on that interface), will be forward using XDP, regardless
> whether the egress ifindex support XDP or not.  And if not supported,
> then packets are dropped hard (only detectable via tracepoints).
> 
> If the bpf prog could tell/know that the egress ifindex doesn't
> support XDP xmit, then it could simply return XDP_PASS to fallback to
> the normal network stack.
> 

That's what the above lookup is intending but DEVMAP type does not
handle lookups from xdp programs. Any chance of fixing that?

^ permalink raw reply

* Re: [RFC v2 bpf-next 8/9] bpf: Provide helper to do lookups in kernel FIB table
From: David Ahern @ 2018-05-02 15:37 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, borkmann, ast, davem, shm, roopa, toke, john.fastabend,
	Vincent Bernat
In-Reply-To: <20180502132736.3560fcac@redhat.com>

On 5/2/18 5:27 AM, Jesper Dangaard Brouer wrote:
> On Sun, 29 Apr 2018 11:07:51 -0700
> David Ahern <dsahern@gmail.com> wrote:
> 
>> Initial performance numbers collected by Jesper, forwarded packets/sec:
>>
>>        Full stack    XDP FIB lookup    XDP Direct lookup
>> IPv4   1,947,969       7,074,156          7,415,333
>> IPv6   1,728,000       6,165,504          7,262,720
> 
> Do notice these number is single CPU core forwarding performance!
> On a Broadwell E5-1650 v4 @ 3.60GHz.

I'll add that context to the commit message. Thanks,

> 
> Another interesting data point is that xdp_redirect_map performance is
> 13,365,161 pps, which allow us to calculate/isolate the overhead/cost
> of the FIB lookup.
> 
> (1/13365161-1/7074156)*10^9 = -66.5 ns
> (1/13365161-1/7415333)*10^9 = -60.0 ns
> 
> Which is very close to the measured 50 ns cost of the FIB lookup, done
> by Vincent Bernat.
>   See: https://vincent.bernat.im/en/blog/2017-ipv4-route-lookup-linux
> 
> 
> 
> Another way I calculate this is by (ran a new benchmark):
> 
> Performance: 7641593 (7,641,593) <= tx_unicast /sec
>  * Packet-gap: (1/7641593*10^9)  = 130.86 ns
> 
> Find all FIB related lookup functions in perf-report::
> 
>  Samples: 93K of event 'cycles:ppp', Event count (approx.): 88553104553
>    Overhead   Cost      CPU  Command      Symbol
>      20.63 %  26.99 ns  002  ksoftirqd/2  [k] fib_table_lookup
>      12.92 %  16.90 ns  002  ksoftirqd/2  [k] bpf_fib_lookup
>       2.40 %   3.14 ns  002  ksoftirqd/2  [k] fib_select_path
>       0.83 %   1.09 ns  002  ksoftirqd/2  [k] fib_get_table
>       0.40 %   0.52 ns  002  ksoftirqd/2  [k] l3mdev_fib_table_rcu
>  -----------
>  Tot:37.18 % (20.63+12.92+2.40+0.83+0.40)
>  ===========
> 
> Cost of FIB lookup:
>  - 130.86/100*37.18 = 48.65 ns overhead by FIB lookup.
> 
> Again very close to Vincent's IPv4 measurements of ~50 ns.
> 
> 
> 
> Notice that the IPv6 measurements does not match up:
>  https://vincent.bernat.im/en/blog/2017-ipv6-route-lookup-linux
> This is because, we/I'm just testing the IPv6 route cache here...
> 

Vincent's blog is before recent changes -- 4.15 getting the rcu locking,
net-next getting separate fib entries and now this set adding a FIB
lookup without the dst.

To share numbers from recent testing I did using Vincent's modules,
lookup times in nsec (using local_clock) with MULTIPLE_TABLES config
disabled for IPv4 and IPv6

            IPv4    IPv6-dst    IPv6-fib6
baseline     49        126         52

I have other cases with combinations of configs and rules, but this
shows the best possible case.

IPv6 needs some more work to improve speeds with MULTIPLE_TABLES enabled
(separate local and main tables unlike IPv4) and IPV6_SUBTREES enabled.

^ permalink raw reply

* Re: [PATCH net-next v9 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Samudrala, Sridhar @ 2018-05-02 15:34 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
	loseweigh, aaron.f.brown
In-Reply-To: <20180502075021.GC19250@nanopsycho>

On 5/2/2018 12:50 AM, Jiri Pirko wrote:
> Wed, May 02, 2018 at 02:20:26AM CEST, sridhar.samudrala@intel.com wrote:
>> On 4/30/2018 12:20 AM, Jiri Pirko wrote:
>>>>> Now I try to change mac of the failover master:
>>>>> [root@test1 ~]# ip link set ens3 addr 52:54:00:b2:a7:f3
>>>>> RTNETLINK answers: Operation not supported
>>>>>
>>>>> That I did expect to work. I would expect this would change the mac of
>>>>> the master and both standby and primary slaves.
>>>> If a VF is untrusted, a VM will not able to change its MAC and moreover
>>> Note that at this point, I have no VF. So I'm not sure why you mention
>>> that.
>>>
>>>
>>>> in this mode we are assuming that the hypervisor has assigned the MAC and
>>>> guest is not expected to change the MAC.
>>> Wait, for ordinary old-fashioned virtio_net, as a VM user, I can change
>>> mac and all works fine. How is this different? Change mac on "failover
>>> instance" should work and should propagate the mac down to its slaves.
>>>
>>>
>>>> For the initial implementation, i would propose not allowing the guest to
>>>> change the MAC of failover or standby dev.
>>> I see no reason for such restriction.
>>>
>> It is true that a VM user can change mac address of a normal virtio-net interface,
>> however when it is in STANDBY mode i think we should not allow this change specifically
>> because we are creating a failover instance based on a MAC that is assigned by the
>> hypervisor.
>>
>> Moreover,  in a cloud environment i would think that PF/hypervisor assigns a MAC to
>> the VF and it cannot be changed by the guest.
> So that is easy. You allow the change of the mac and in the "failover"
> mac change implementation you propagate the change down to slaves. If
> one slave does not support the change, you bail out. And since VF does
> not allow it as you say, once it will be enslaved, the mac change could
> not be done. Seems like a correct behavior to me and is in-sync with how
> bond/team behaves.

If we allow the mac to be changed when the VF is not yet plugged in
and the guest changes the mac, then VF cannot join the failover when
it is hot plugged with the original mac assigned by the hypervisor.
Specifically there could be timing window during the live migration where
VF would be unplugged at the source and if we allow the guest to change the
failover mac, then VF would not be able to register with the failover after
the VM is migrated to the destination.

>
>> So for the initial implementation, do you see any issues with having this restriction
>> in STANDBY mode.
>>
>>

^ permalink raw reply

* Re: [PATCH net-next v2 2/2] mlxsw: spectrum_router: Return an error for routes added after abort
From: David Ahern @ 2018-05-02 15:28 UTC (permalink / raw)
  To: Ido Schimmel, netdev; +Cc: davem, jiri, mlxsw
In-Reply-To: <20180502071735.32352-3-idosch@mellanox.com>

On 5/2/18 1:17 AM, Ido Schimmel wrote:
> We currently do not perform accounting in the driver and thus can't
> reject routes before resources are exceeded.
> 
> However, in order to make users aware of the fact that routes are no
> longer offloaded we can return an error for routes configured after the
> abort mechanism was triggered.
> 
> Signed-off-by: Ido Schimmel <idosch@mellanox.com>
> ---
>  drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> index added380e344..8028d221aece 100644
> --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> @@ -5928,6 +5928,13 @@ static int mlxsw_sp_router_fib_event(struct notifier_block *nb,
>  						     router->mlxsw_sp);
>  		if (!err || info->extack)
>  			return notifier_from_errno(err);
> +		break;
> +	case FIB_EVENT_ENTRY_ADD:
> +		if (router->aborted) {
> +			NL_SET_ERR_MSG_MOD(info->extack, "FIB offload was aborted. Not configuring route");
> +			return notifier_from_errno(-EINVAL);
> +		}
> +		break;
>  	}
>  
>  	fib_work = kzalloc(sizeof(*fib_work), GFP_ATOMIC);
> 

Reasonable next step.

Acked-by: David Ahern <dsahern@gmail.com>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox