Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net 1/2] l2tp: reject creation of non-PPP sessions on L2TPv2 tunnels
From: Guillaume Nault @ 2018-06-15 13:39 UTC (permalink / raw)
  To: netdev; +Cc: James Chapman
In-Reply-To: <cover.1529065935.git.g.nault@alphalink.fr>

The /proc/net/pppol2tp handlers (pppol2tp_seq_*()) iterate over all
L2TPv2 tunnels, and rightfully expect that only PPP sessions can be
found there. However, l2tp_netlink accepts creating Ethernet sessions
regardless of the underlying tunnel version.

This confuses pppol2tp_seq_session_show(), which expects that
l2tp_session_priv() returns a pppol2tp_session structure. When the
session is an Ethernet pseudo-wire, a struct l2tp_eth_sess is returned
instead. This leads to invalid memory access when
pppol2tp_session_get_sock() later tries to dereference ps->sk.

Fixes: d9e31d17ceba ("l2tp: Add L2TP ethernet pseudowire support")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
---
 net/l2tp/l2tp_netlink.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/net/l2tp/l2tp_netlink.c b/net/l2tp/l2tp_netlink.c
index 6616c9fd292f..5b9900889e31 100644
--- a/net/l2tp/l2tp_netlink.c
+++ b/net/l2tp/l2tp_netlink.c
@@ -553,6 +553,12 @@ static int l2tp_nl_cmd_session_create(struct sk_buff *skb, struct genl_info *inf
 		goto out_tunnel;
 	}

+	/* L2TPv2 only accepts PPP pseudo-wires */
+	if (tunnel->version == 2 && cfg.pw_type != L2TP_PWTYPE_PPP) {
+		ret = -EPROTONOSUPPORT;
+		goto out_tunnel;
+	}
+
 	if (tunnel->version > 2) {
 		if (info->attrs[L2TP_ATTR_DATA_SEQ])
 			cfg.data_seq = nla_get_u8(info->attrs[L2TP_ATTR_DATA_SEQ]);
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH 07/17] net: convert sock.sk_wmem_alloc from atomic_t to refcount_t
From: Eric Dumazet @ 2018-06-15 13:39 UTC (permalink / raw)
  To: Eric Dumazet, David Woodhouse, Elena Reshetova, netdev
  Cc: Krzysztof Mazur, Kevin Darbyshire-Bryant, chas, Mathias Kresin
In-Reply-To: <9805cd6e-1c21-f7b4-cc0e-d4a343b8e3d8@gmail.com>



On 06/15/2018 06:27 AM, Eric Dumazet wrote:
> 
> 
> On 06/15/2018 05:29 AM, David Woodhouse wrote:
>> On Fri, 2017-06-30 at 13:08 +0300, Elena Reshetova wrote:
>>> diff --git a/include/linux/atmdev.h b/include/linux/atmdev.h
>>> index c1da539..4d97a89 100644
>>> --- a/include/linux/atmdev.h
>>> +++ b/include/linux/atmdev.h
>>> @@ -254,7 +254,7 @@ static inline void atm_return(struct atm_vcc *vcc,int truesize)
>>>  
>>>  static inline int atm_may_send(struct atm_vcc *vcc,unsigned int size)
>>>  {
>>> -       return (size + atomic_read(&sk_atm(vcc)->sk_wmem_alloc)) <
>>> +       return (size + refcount_read(&sk_atm(vcc)->sk_wmem_alloc)) <
>>>                sk_atm(vcc)->sk_sndbuf;
>>>  }
>>
>>
>> Hm, this (commit 14afee4b6092fd) may have broken PPPoATM. I did spend a
>> while staring hard at my own commit 9d02daf754238 which introduced
>> pppoatm_may_send(), but it's actually atm_may_send() which is never
>> allowing packets to be pushed.
>>
>> Debugging (which is ongoing) has so far shown that we are accounting
>> for a packet in pppoatm_send() which has skb->truesize==0x1c0, and
>> later end up in pppoatm_pop()→atm_raw_pop() with skb->truesize==0x2c0.
>>
>> This was always harmless before, but now it's a refcount_t it appears
>> to underflow and go into its "screw you" mode and never let any more
>> packets get sent.
>>
>> I'm staring hard at the special case in pskb_expand_head() to *not*
>> change skb->truesize under certain circumstances, and wondering if that
>> (is, should be) covering the case of ATM skbs:
>>
>>         /* It is not generally safe to change skb->truesize.
>>          * For the moment, we really care of rx path, or
>>          * when skb is orphaned (not attached to a socket).
>>          */
>>         if (!skb->sk || skb->destructor == sock_edemux)
>>                 skb->truesize += size - osize;
>>
>> Failing that, maybe we should copy the accounted value of skb->truesize 
>> into the struct skb_atm_data in skb->cb at the time we add it to
>> sk_wmem_alloc — and then subtract *that* value from sk_wmem_alloc in
>> atm_raw_pop() instead of the then current value of skb->truesize.
>>
>> Suggestions?
>>
> 
> Maybe ATM should make sure skb->sk is set ?
> 
> something like the following :
> 

Or simply use a new field in ATM_SKB(skb) to remember a stable truesize used in both sides (add/sub)

^ permalink raw reply

* [PATCH net 0/2] l2tp: l2tp_ppp must ignore non-PPP sessions
From: Guillaume Nault @ 2018-06-15 13:39 UTC (permalink / raw)
  To: netdev; +Cc: James Chapman

The original L2TP code was written for version 2 of the protocol, which
could only carry PPP sessions. Then L2TPv3 generalised the protocol so that
it could transport different kinds of pseudo-wires. But parts of the
l2tp_ppp module still break in presence of non-PPP sessions.

Assuming L2TPv2 tunnels can only transport PPP sessions is right, but
l2tp_netlink failed to ensure that (fixed in patch 1).
When retrieving a session from an arbitrary tunnel, l2tp_ppp needs to
filter out non-PPP sessions (last occurrence fixed in patch 2).

Guillaume Nault (2):
  l2tp: reject creation of non-PPP sessions on L2TPv2 tunnels
  l2tp: filter out non-PPP sessions in pppol2tp_tunnel_ioctl()

 net/l2tp/l2tp_netlink.c | 6 ++++++
 net/l2tp/l2tp_ppp.c     | 2 +-
 2 files changed, 7 insertions(+), 1 deletion(-)

-- 
2.17.1

^ permalink raw reply

* Re: [PATCH] net_sched: blackhole: tell upper qdisc about dropped packets
From: Konstantin Khlebnikov @ 2018-06-15 13:21 UTC (permalink / raw)
  To: Eric Dumazet, netdev, David S. Miller
  Cc: Cong Wang, Jiri Pirko, Jamal Hadi Salim
In-Reply-To: <e654c245-7745-f49a-c995-7071e0e57153@gmail.com>

On 15.06.2018 16:13, Eric Dumazet wrote:
> 
> 
> On 06/15/2018 03:27 AM, Konstantin Khlebnikov wrote:
>> When blackhole is used on top of classful qdisc like hfsc it breaks
>> qlen and backlog counters because packets are disappear without notice.
>>
>> In HFSC non-zero qlen while all classes are inactive triggers warning:
>> WARNING: ... at net/sched/sch_hfsc.c:1393 hfsc_dequeue+0xba4/0xe90 [sch_hfsc]
>> and schedules watchdog work endlessly.
>>
>> This patch return __NET_XMIT_BYPASS in addition to NET_XMIT_SUCCESS,
>> this flag tells upper layer: this packet is gone and isn't queued.
>>
>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
>> ---
>>   net/sched/sch_blackhole.c |    2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/net/sched/sch_blackhole.c b/net/sched/sch_blackhole.c
>> index c98a61e980ba..9c4c2bb547d7 100644
>> --- a/net/sched/sch_blackhole.c
>> +++ b/net/sched/sch_blackhole.c
>> @@ -21,7 +21,7 @@ static int blackhole_enqueue(struct sk_buff *skb, struct Qdisc *sch,
>>   			     struct sk_buff **to_free)
>>   {
>>   	qdisc_drop(skb, sch, to_free);
>> -	return NET_XMIT_SUCCESS;
>> +	return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
> 
> Why do not we use instead :
> 
> 	return qdisc_drop(skb, sch, to_free);
> 
> Although noop_enqueue() seems to use :
> 
> 	return NET_XMIT_CN;
> 
> Oh well.
> 
> 

I suppose "blackhole" should work like "successful" xmit, but counted as drop.

^ permalink raw reply

* Re: [PATCH 07/17] net: convert sock.sk_wmem_alloc from atomic_t to refcount_t
From: Eric Dumazet @ 2018-06-15 13:27 UTC (permalink / raw)
  To: David Woodhouse, Elena Reshetova, netdev
  Cc: Krzysztof Mazur, Kevin Darbyshire-Bryant, chas, Mathias Kresin
In-Reply-To: <1529065762.27158.40.camel@infradead.org>



On 06/15/2018 05:29 AM, David Woodhouse wrote:
> On Fri, 2017-06-30 at 13:08 +0300, Elena Reshetova wrote:
>> diff --git a/include/linux/atmdev.h b/include/linux/atmdev.h
>> index c1da539..4d97a89 100644
>> --- a/include/linux/atmdev.h
>> +++ b/include/linux/atmdev.h
>> @@ -254,7 +254,7 @@ static inline void atm_return(struct atm_vcc *vcc,int truesize)
>>  
>>  static inline int atm_may_send(struct atm_vcc *vcc,unsigned int size)
>>  {
>> -       return (size + atomic_read(&sk_atm(vcc)->sk_wmem_alloc)) <
>> +       return (size + refcount_read(&sk_atm(vcc)->sk_wmem_alloc)) <
>>                sk_atm(vcc)->sk_sndbuf;
>>  }
> 
> 
> Hm, this (commit 14afee4b6092fd) may have broken PPPoATM. I did spend a
> while staring hard at my own commit 9d02daf754238 which introduced
> pppoatm_may_send(), but it's actually atm_may_send() which is never
> allowing packets to be pushed.
> 
> Debugging (which is ongoing) has so far shown that we are accounting
> for a packet in pppoatm_send() which has skb->truesize==0x1c0, and
> later end up in pppoatm_pop()→atm_raw_pop() with skb->truesize==0x2c0.
> 
> This was always harmless before, but now it's a refcount_t it appears
> to underflow and go into its "screw you" mode and never let any more
> packets get sent.
> 
> I'm staring hard at the special case in pskb_expand_head() to *not*
> change skb->truesize under certain circumstances, and wondering if that
> (is, should be) covering the case of ATM skbs:
> 
>         /* It is not generally safe to change skb->truesize.
>          * For the moment, we really care of rx path, or
>          * when skb is orphaned (not attached to a socket).
>          */
>         if (!skb->sk || skb->destructor == sock_edemux)
>                 skb->truesize += size - osize;
> 
> Failing that, maybe we should copy the accounted value of skb->truesize 
> into the struct skb_atm_data in skb->cb at the time we add it to
> sk_wmem_alloc — and then subtract *that* value from sk_wmem_alloc in
> atm_raw_pop() instead of the then current value of skb->truesize.
> 
> Suggestions?
> 

Maybe ATM should make sure skb->sk is set ?

something like the following :

diff --git a/net/atm/pppoatm.c b/net/atm/pppoatm.c
index 21d9d341a6199255a017437954e4b688f1ba5bfd..4863d02939a26ce0a5816da86779336255d550bd 100644
--- a/net/atm/pppoatm.c
+++ b/net/atm/pppoatm.c
@@ -350,7 +350,7 @@ static int pppoatm_send(struct ppp_channel *chan, struct sk_buff *skb)
                return 1;
        }
 
-       refcount_add(skb->truesize, &sk_atm(ATM_SKB(skb)->vcc)->sk_wmem_alloc);
+       skb_set_owner_w(skb, &sk_atm(ATM_SKB(skb)->vcc));
        ATM_SKB(skb)->atm_options = ATM_SKB(skb)->vcc->atm_options;
        pr_debug("atm_skb(%p)->vcc(%p)->dev(%p)\n",
                 skb, ATM_SKB(skb)->vcc, ATM_SKB(skb)->vcc->dev);

Or something more sophisticated, but this might need more thinking

diff --git a/net/atm/pppoatm.c b/net/atm/pppoatm.c
index 21d9d341a6199255a017437954e4b688f1ba5bfd..80902a0b56a1d15d2851a07f067490d99136d214 100644
--- a/net/atm/pppoatm.c
+++ b/net/atm/pppoatm.c
@@ -350,7 +350,10 @@ static int pppoatm_send(struct ppp_channel *chan, struct sk_buff *skb)
                return 1;
        }
 
-       refcount_add(skb->truesize, &sk_atm(ATM_SKB(skb)->vcc)->sk_wmem_alloc);
+       if (!skb->sk)
+               skb_set_owner_w(skb, &sk_atm(ATM_SKB(skb)->vcc));
+-      else
+               refcount_add(skb->truesize, &sk_atm(ATM_SKB(skb)->vcc)->sk_wmem_alloc);
        ATM_SKB(skb)->atm_options = ATM_SKB(skb)->vcc->atm_options;
        pr_debug("atm_skb(%p)->vcc(%p)->dev(%p)\n",
                 skb, ATM_SKB(skb)->vcc, ATM_SKB(skb)->vcc->dev);

^ permalink raw reply related

* [PATCH net 4/4] mlxsw: spectrum_switchdev: Fix port_vlan refcounting
From: Ido Schimmel @ 2018-06-15 13:23 UTC (permalink / raw)
  To: netdev; +Cc: davem, jiri, petrm, dsahern, mlxsw, Ido Schimmel
In-Reply-To: <20180615132338.14241-1-idosch@mellanox.com>

From: Petr Machata <petrm@mellanox.com>

Switchdev notifications for addition of SWITCHDEV_OBJ_ID_PORT_VLAN are
distributed not only on clean addition, but also when flags on an
existing VLAN are changed. mlxsw_sp_bridge_port_vlan_add() calls
mlxsw_sp_port_vlan_get() to get at the port_vlan in question, which
implicitly references the object. This then leads to discrepancies in
reference counting when the VLAN is removed. spectrum.c warns about the
problem when the module is removed:

[13578.493090] WARNING: CPU: 0 PID: 2454 at drivers/net/ethernet/mellanox/mlxsw/spectrum.c:2973 mlxsw_sp_port_remove+0xfd/0x110 [mlxsw_spectrum]
[...]
[13578.627106] Call Trace:
[13578.629617]  mlxsw_sp_fini+0x2a/0xe0 [mlxsw_spectrum]
[13578.634748]  mlxsw_core_bus_device_unregister+0x3e/0x130 [mlxsw_core]
[13578.641290]  mlxsw_pci_remove+0x13/0x40 [mlxsw_pci]
[13578.646238]  pci_device_remove+0x31/0xb0
[13578.650244]  device_release_driver_internal+0x14f/0x220
[13578.655562]  driver_detach+0x32/0x70
[13578.659183]  bus_remove_driver+0x47/0xa0
[13578.663134]  pci_unregister_driver+0x1e/0x80
[13578.667486]  mlxsw_sp_module_exit+0xc/0x3fa [mlxsw_spectrum]
[13578.673207]  __x64_sys_delete_module+0x13b/0x1e0
[13578.677888]  ? exit_to_usermode_loop+0x78/0x80
[13578.682374]  do_syscall_64+0x39/0xe0
[13578.685976]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fix by putting the port_vlan when mlxsw_sp_port_vlan_bridge_join()
determines it's a flag-only change.

Fixes: b3529af6bb0d ("spectrum: Reference count VLAN entries")
Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
index e97652c40d13..eea5666a86b2 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
@@ -1018,8 +1018,10 @@ mlxsw_sp_port_vlan_bridge_join(struct mlxsw_sp_port_vlan *mlxsw_sp_port_vlan,
 	int err;
 
 	/* No need to continue if only VLAN flags were changed */
-	if (mlxsw_sp_port_vlan->bridge_port)
+	if (mlxsw_sp_port_vlan->bridge_port) {
+		mlxsw_sp_port_vlan_put(mlxsw_sp_port_vlan);
 		return 0;
+	}
 
 	err = mlxsw_sp_port_vlan_fid_join(mlxsw_sp_port_vlan, bridge_port);
 	if (err)
-- 
2.14.4

^ permalink raw reply related

* [PATCH net 3/4] mlxsw: spectrum_router: Align with new route replace logic
From: Ido Schimmel @ 2018-06-15 13:23 UTC (permalink / raw)
  To: netdev; +Cc: davem, jiri, petrm, dsahern, mlxsw, Ido Schimmel
In-Reply-To: <20180615132338.14241-1-idosch@mellanox.com>

Commit f34436a43092 ("net/ipv6: Simplify route replace and appending
into multipath route") changed the IPv6 route replace logic so that the
first matching route (i.e., same metric) is replaced.

Have mlxsw replace the first matching route as well.

Fixes: f34436a43092 ("net/ipv6: Simplify route replace and appending into multipath route")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 .../net/ethernet/mellanox/mlxsw/spectrum_router.c   | 21 +++++----------------
 1 file changed, 5 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index c8956ab224ea..6aaaf3d9ba31 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -4756,12 +4756,6 @@ static void mlxsw_sp_rt6_destroy(struct mlxsw_sp_rt6 *mlxsw_sp_rt6)
 	kfree(mlxsw_sp_rt6);
 }
 
-static bool mlxsw_sp_fib6_rt_can_mp(const struct fib6_info *rt)
-{
-	/* RTF_CACHE routes are ignored */
-	return (rt->fib6_flags & (RTF_GATEWAY | RTF_ADDRCONF)) == RTF_GATEWAY;
-}
-
 static struct fib6_info *
 mlxsw_sp_fib6_entry_rt(const struct mlxsw_sp_fib6_entry *fib6_entry)
 {
@@ -5169,7 +5163,7 @@ static struct mlxsw_sp_fib6_entry *
 mlxsw_sp_fib6_node_entry_find(const struct mlxsw_sp_fib_node *fib_node,
 			      const struct fib6_info *nrt, bool replace)
 {
-	struct mlxsw_sp_fib6_entry *fib6_entry, *fallback = NULL;
+	struct mlxsw_sp_fib6_entry *fib6_entry;
 
 	list_for_each_entry(fib6_entry, &fib_node->entry_list, common.list) {
 		struct fib6_info *rt = mlxsw_sp_fib6_entry_rt(fib6_entry);
@@ -5178,18 +5172,13 @@ mlxsw_sp_fib6_node_entry_find(const struct mlxsw_sp_fib_node *fib_node,
 			continue;
 		if (rt->fib6_table->tb6_id != nrt->fib6_table->tb6_id)
 			break;
-		if (replace && rt->fib6_metric == nrt->fib6_metric) {
-			if (mlxsw_sp_fib6_rt_can_mp(rt) ==
-			    mlxsw_sp_fib6_rt_can_mp(nrt))
-				return fib6_entry;
-			if (mlxsw_sp_fib6_rt_can_mp(nrt))
-				fallback = fallback ?: fib6_entry;
-		}
+		if (replace && rt->fib6_metric == nrt->fib6_metric)
+			return fib6_entry;
 		if (rt->fib6_metric > nrt->fib6_metric)
-			return fallback ?: fib6_entry;
+			return fib6_entry;
 	}
 
-	return fallback;
+	return NULL;
 }
 
 static int
-- 
2.14.4

^ permalink raw reply related

* [PATCH net 2/4] mlxsw: spectrum_router: Allow appending to dev-only routes
From: Ido Schimmel @ 2018-06-15 13:23 UTC (permalink / raw)
  To: netdev; +Cc: davem, jiri, petrm, dsahern, mlxsw, Ido Schimmel
In-Reply-To: <20180615132338.14241-1-idosch@mellanox.com>

Commit f34436a43092 ("net/ipv6: Simplify route replace and appending
into multipath route") changed the IPv6 route append logic so that
dev-only routes can be appended and not only gatewayed routes.

Align mlxsw with the new behaviour.

Fixes: f34436a43092 ("net/ipv6: Simplify route replace and appending into multipath route")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 27 +++++++++++++++-------
 1 file changed, 19 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index 77b2adb29341..c8956ab224ea 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -4771,11 +4771,11 @@ mlxsw_sp_fib6_entry_rt(const struct mlxsw_sp_fib6_entry *fib6_entry)
 
 static struct mlxsw_sp_fib6_entry *
 mlxsw_sp_fib6_node_mp_entry_find(const struct mlxsw_sp_fib_node *fib_node,
-				 const struct fib6_info *nrt, bool replace)
+				 const struct fib6_info *nrt, bool append)
 {
 	struct mlxsw_sp_fib6_entry *fib6_entry;
 
-	if (!mlxsw_sp_fib6_rt_can_mp(nrt) || replace)
+	if (!append)
 		return NULL;
 
 	list_for_each_entry(fib6_entry, &fib_node->entry_list, common.list) {
@@ -4790,8 +4790,7 @@ mlxsw_sp_fib6_node_mp_entry_find(const struct mlxsw_sp_fib_node *fib_node,
 			break;
 		if (rt->fib6_metric < nrt->fib6_metric)
 			continue;
-		if (rt->fib6_metric == nrt->fib6_metric &&
-		    mlxsw_sp_fib6_rt_can_mp(rt))
+		if (rt->fib6_metric == nrt->fib6_metric)
 			return fib6_entry;
 		if (rt->fib6_metric > nrt->fib6_metric)
 			break;
@@ -5316,7 +5315,8 @@ static void mlxsw_sp_fib6_entry_replace(struct mlxsw_sp *mlxsw_sp,
 }
 
 static int mlxsw_sp_router_fib6_add(struct mlxsw_sp *mlxsw_sp,
-				    struct fib6_info *rt, bool replace)
+				    struct fib6_info *rt, bool replace,
+				    bool append)
 {
 	struct mlxsw_sp_fib6_entry *fib6_entry;
 	struct mlxsw_sp_fib_node *fib_node;
@@ -5342,7 +5342,7 @@ static int mlxsw_sp_router_fib6_add(struct mlxsw_sp *mlxsw_sp,
 	/* Before creating a new entry, try to append route to an existing
 	 * multipath entry.
 	 */
-	fib6_entry = mlxsw_sp_fib6_node_mp_entry_find(fib_node, rt, replace);
+	fib6_entry = mlxsw_sp_fib6_node_mp_entry_find(fib_node, rt, append);
 	if (fib6_entry) {
 		err = mlxsw_sp_fib6_entry_nexthop_add(mlxsw_sp, fib6_entry, rt);
 		if (err)
@@ -5350,6 +5350,14 @@ static int mlxsw_sp_router_fib6_add(struct mlxsw_sp *mlxsw_sp,
 		return 0;
 	}
 
+	/* We received an append event, yet did not find any route to
+	 * append to.
+	 */
+	if (WARN_ON(append)) {
+		err = -EINVAL;
+		goto err_fib6_entry_append;
+	}
+
 	fib6_entry = mlxsw_sp_fib6_entry_create(mlxsw_sp, fib_node, rt);
 	if (IS_ERR(fib6_entry)) {
 		err = PTR_ERR(fib6_entry);
@@ -5367,6 +5375,7 @@ static int mlxsw_sp_router_fib6_add(struct mlxsw_sp *mlxsw_sp,
 err_fib6_node_entry_link:
 	mlxsw_sp_fib6_entry_destroy(mlxsw_sp, fib6_entry);
 err_fib6_entry_create:
+err_fib6_entry_append:
 err_fib6_entry_nexthop_add:
 	mlxsw_sp_fib_node_put(mlxsw_sp, fib_node);
 	return err;
@@ -5717,7 +5726,7 @@ static void mlxsw_sp_router_fib6_event_work(struct work_struct *work)
 	struct mlxsw_sp_fib_event_work *fib_work =
 		container_of(work, struct mlxsw_sp_fib_event_work, work);
 	struct mlxsw_sp *mlxsw_sp = fib_work->mlxsw_sp;
-	bool replace;
+	bool replace, append;
 	int err;
 
 	rtnl_lock();
@@ -5728,8 +5737,10 @@ static void mlxsw_sp_router_fib6_event_work(struct work_struct *work)
 	case FIB_EVENT_ENTRY_APPEND: /* fall through */
 	case FIB_EVENT_ENTRY_ADD:
 		replace = fib_work->event == FIB_EVENT_ENTRY_REPLACE;
+		append = fib_work->event == FIB_EVENT_ENTRY_APPEND;
 		err = mlxsw_sp_router_fib6_add(mlxsw_sp,
-					       fib_work->fen6_info.rt, replace);
+					       fib_work->fen6_info.rt, replace,
+					       append);
 		if (err)
 			mlxsw_sp_router_fib_abort(mlxsw_sp);
 		mlxsw_sp_rt6_release(fib_work->fen6_info.rt);
-- 
2.14.4

^ permalink raw reply related

* [PATCH net 1/4] ipv6: Only emit append events for appended routes
From: Ido Schimmel @ 2018-06-15 13:23 UTC (permalink / raw)
  To: netdev; +Cc: davem, jiri, petrm, dsahern, mlxsw, Ido Schimmel
In-Reply-To: <20180615132338.14241-1-idosch@mellanox.com>

Current code will emit an append event in the FIB notification chain for
any route added with NLM_F_APPEND set, even if the route was not
appended to any existing route.

This is inconsistent with IPv4 where such an event is only emitted when
the new route is appended after an existing one.

Align IPv6 behavior with IPv4, thereby allowing listeners to more easily
handle these events.

Fixes: f34436a43092 ("net/ipv6: Simplify route replace and appending into multipath route")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 net/ipv6/ip6_fib.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 7aa4c41a3bd9..39d1d487eca2 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -934,6 +934,7 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct fib6_info *rt,
 {
 	struct fib6_info *leaf = rcu_dereference_protected(fn->leaf,
 				    lockdep_is_held(&rt->fib6_table->tb6_lock));
+	enum fib_event_type event = FIB_EVENT_ENTRY_ADD;
 	struct fib6_info *iter = NULL, *match = NULL;
 	struct fib6_info __rcu **ins;
 	int replace = (info->nlh &&
@@ -1013,6 +1014,7 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct fib6_info *rt,
 				       "Can not append to a REJECT route");
 			return -EINVAL;
 		}
+		event = FIB_EVENT_ENTRY_APPEND;
 		rt->fib6_nsiblings = match->fib6_nsiblings;
 		list_add_tail(&rt->fib6_siblings, &match->fib6_siblings);
 		match->fib6_nsiblings++;
@@ -1034,15 +1036,12 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct fib6_info *rt,
 	 *	insert node
 	 */
 	if (!replace) {
-		enum fib_event_type event;
-
 		if (!add)
 			pr_warn("NLM_F_CREATE should be set when creating new route\n");
 
 add:
 		nlflags |= NLM_F_CREATE;
 
-		event = append ? FIB_EVENT_ENTRY_APPEND : FIB_EVENT_ENTRY_ADD;
 		err = call_fib6_entry_notifiers(info->nl_net, event, rt,
 						extack);
 		if (err)
-- 
2.14.4

^ permalink raw reply related

* [PATCH net 0/4] mlxsw: IPv6 and reference counting fixes
From: Ido Schimmel @ 2018-06-15 13:23 UTC (permalink / raw)
  To: netdev; +Cc: davem, jiri, petrm, dsahern, mlxsw, Ido Schimmel

Hi,

The first three patches fix a mismatch between the new IPv6 behavior
introduced in commit f34436a43092 ("net/ipv6: Simplify route replace and
appending into multipath route") and mlxsw. The patches allow the driver
to support multipathing in IPv6 overlays with GRE tunnel devices. A
selftest will be submitted when net-next opens.

The last patch fixes a reference count problem of the port_vlan struct.
I plan to simplify the code in net-next, so that reference counting is
not necessary anymore.

Ido Schimmel (3):
  ipv6: Only emit append events for appended routes
  mlxsw: spectrum_router: Allow appending to dev-only routes
  mlxsw: spectrum_router: Align with new route replace logic

Petr Machata (1):
  mlxsw: spectrum_switchdev: Fix port_vlan refcounting

 .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 48 +++++++++++-----------
 .../ethernet/mellanox/mlxsw/spectrum_switchdev.c   |  4 +-
 net/ipv6/ip6_fib.c                                 |  5 +--
 3 files changed, 29 insertions(+), 28 deletions(-)

-- 
2.14.4

^ permalink raw reply

* Re: [PATCH net-next,RFC 00/13] New fast forwarding path
From: Daniel Borkmann @ 2018-06-15 13:22 UTC (permalink / raw)
  To: Steffen Klassert, David Miller; +Cc: pablo, netfilter-devel, netdev
In-Reply-To: <20180615061740.eqgr2iv722oydheb@gauss3.secunet.de>

Hi Steffen,

On 06/15/2018 08:17 AM, Steffen Klassert wrote:
> On Thu, Jun 14, 2018 at 10:18:31AM -0700, David Miller wrote:
>> From: Pablo Neira Ayuso <pablo@netfilter.org>
>> Date: Thu, 14 Jun 2018 16:19:34 +0200
>>
>>> This patchset proposes a new fast forwarding path infrastructure
>>> that combines the GRO/GSO and the flowtable infrastructures. The
>>> idea is to add a hook at the GRO layer that is invoked before the
>>> standard GRO protocol offloads. This allows us to build custom
>>> packet chains that we can quickly pass in one go to the neighbour
>>> layer to define fast forwarding path for flows.
>>
>> We have full, complete, customizability of the packet path via XDP
>> and eBPF.
>>
>> XDP and eBPF supports everything necessary to accomplish that,
>> there are implementations of forwarding implementations in
>> the tree and elsewhere.
>>
>> And most importantly, XDP and eBPF are optimized in drivers and
>> offloaded to hardware.
>>
>> There really is no need for something like what you are proposing.
> 
> I started with this last year because I wanted to improve
> the IPsec (and UDP) forwarding path. Batching packets
> at layer2  and send them directly to the output path
> seemed to be a good method to improve this.
> 
> In particular, we need to do only one IPsec lookup
> for the whole packet chain. So it relaxes the pain
> from reomoving the IPsec flowcache a bit. It can be
> only a first step, but we need some improvements here
> as people start to complain about that.

But did you also experiment with XDP on this? Would be curious about
the numbers. You'd get implicit batching for the forwarding via devmap
as well if you're required to flush it out via different device with
XDP_REDIRECT; otherwise XDP_TX of course. Given we have recently
integrated helpers for XDP to do a FIB and neighbor lookup from the
kernel tables, where it's thus shared and integrated with the rest of
the stack and tooling, it would be awesome to get to the same point
with xfrm as well. Eyal recently did a start on that for xfrm for tc
progs; would be nice to have integration on XDP as well, potentially
it might also result in a bigger plus on the forwarding numbers.

Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH] net_sched: blackhole: tell upper qdisc about dropped packets
From: Eric Dumazet @ 2018-06-15 13:13 UTC (permalink / raw)
  To: Konstantin Khlebnikov, netdev, David S. Miller
  Cc: Cong Wang, Jiri Pirko, Jamal Hadi Salim
In-Reply-To: <152905845136.88583.6197221349040585517.stgit@buzz>



On 06/15/2018 03:27 AM, Konstantin Khlebnikov wrote:
> When blackhole is used on top of classful qdisc like hfsc it breaks
> qlen and backlog counters because packets are disappear without notice.
> 
> In HFSC non-zero qlen while all classes are inactive triggers warning:
> WARNING: ... at net/sched/sch_hfsc.c:1393 hfsc_dequeue+0xba4/0xe90 [sch_hfsc]
> and schedules watchdog work endlessly.
> 
> This patch return __NET_XMIT_BYPASS in addition to NET_XMIT_SUCCESS,
> this flag tells upper layer: this packet is gone and isn't queued.
> 
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> ---
>  net/sched/sch_blackhole.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/sched/sch_blackhole.c b/net/sched/sch_blackhole.c
> index c98a61e980ba..9c4c2bb547d7 100644
> --- a/net/sched/sch_blackhole.c
> +++ b/net/sched/sch_blackhole.c
> @@ -21,7 +21,7 @@ static int blackhole_enqueue(struct sk_buff *skb, struct Qdisc *sch,
>  			     struct sk_buff **to_free)
>  {
>  	qdisc_drop(skb, sch, to_free);
> -	return NET_XMIT_SUCCESS;
> +	return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;

Why do not we use instead :

	return qdisc_drop(skb, sch, to_free);

Although noop_enqueue() seems to use :

	return NET_XMIT_CN;

Oh well.

^ permalink raw reply

* Re: [PATCH net-next,RFC 00/13] New fast forwarding path
From: Eric Dumazet @ 2018-06-15 13:01 UTC (permalink / raw)
  To: Steffen Klassert, Eric Dumazet; +Cc: Pablo Neira Ayuso, netfilter-devel, netdev
In-Reply-To: <20180615060307.fwwc2vnb5rpc464p@gauss3.secunet.de>



On 06/14/2018 11:03 PM, Steffen Klassert wrote:
> On Thu, Jun 14, 2018 at 08:57:20AM -0700, Eric Dumazet wrote:
>>

>> Saving cpu cycles on moderate load is not okay if added complexity
>> slows down the DDOS (or stress) by 10 % :/
> 
> Why 10%?
> 

GRO adds a ~6 % cost on UDP receive path at this moment, depending on the state
of GRO engine (number of packets in the napi->gro_list)

Adding yet another conditions and icache pressure might raise the cost to 10%,
but we do not know because the numbers presented in this RFC do not include that.

(Early demux is also adding extra costs for UDP on 'non connected sockets' BTW)

Most linux hosts are not routers, but end hosts, lets not forget this...

^ permalink raw reply

* Re: [BUG BISECT] NFSv4 client fails on Flush Journal to Persistent Storage
From: Sudeep Holla @ 2018-06-15 12:53 UTC (permalink / raw)
  To: Krzysztof Kozlowski, Trond Myklebust, linux-nfs
  Cc: Anna Schumaker, J. Bruce Fields, Jeff Layton, David S. Miller,
	netdev, open list, linux-samsung-soc@vger.kernel.org,
	Sudeep Holla
In-Reply-To: <CAJKOXPd2rntNOpU1quR9Zm_J22+=pEaj4ZTC_tdZ0zcRYUciFg@mail.gmail.com>

Hi,

On Thu, Jun 7, 2018 at 12:19 PM, Krzysztof Kozlowski <krzk@kernel.org> wrote:
> Hi,
>
> When booting my boards under recent linux-next, I see failures of systemd:
>
> [FAILED] Failed to start Flush Journal to Persistent Storage.
> See 'systemctl status systemd-journal-flush.service' for details.
>          Starting Create Volatile Files and Directories...
> [**    ] A start job is running for Create V… [  223.209289] nfs:
> server 192.168.1.10 not responding, still trying
> [  223.209377] nfs: server 192.168.1.10 not responding, still trying
>
> Effectively the boards fails to boot. Example is here:
> https://krzk.eu/#/builders/1/builds/2157
>

I too encountered the same issue.

> This was bisected to:
> commit 37ac86c3a76c113619b7d9afe0251bbfc04cb80a
> Author: Chuck Lever <chuck.lever@oracle.com>
> Date:   Fri May 4 15:34:53 2018 -0400
>
>     SUNRPC: Initialize rpc_rqst outside of xprt->reserve_lock
>
>     alloc_slot is a transport-specific op, but initializing an rpc_rqst
>     is common to all transports. In addition, the only part of initial-
>     izing an rpc_rqst that needs serialization is getting a fresh XID.
>
>     Move rpc_rqst initialization to common code in preparation for
>     adding a transport-specific alloc_slot to xprtrdma.
>
>     Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>     Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
>

Unfortunately, spent time to bisect independently without seeing this
report and got the same culprit.

>
> Bisect log attached. Full configuration:
> 1. exynos_defconfig
> 2. ARMv7, octa-core, Exynos5422 and Exynos4412 (Odroid XU3, U3 and others)
> 3. NFSv4 client (from Raspberry Pi)
>

Yes the issue is seen only with NFSv4 client and with latest systemd I think.
My Ubuntu 16.04(32bit FS) is  boots fine while 18.04 has the above issue.
Passing nfsv3 in kernel command line makes it work again.

> Let me know if you need any more information.
>

Also I was observing this issue with Linus master branch from
the time the above patch was merged until today. The issue
is no longer seen since this morning however I just enabled lockdep
and got these messages.

---->8

DEBUG_LOCKS_WARN_ON(sem->owner != ((struct task_struct *)(1UL << 0)))
WARNING: CPU: 2 PID: 74 at kernel/locking/rwsem.c:217
up_read_non_owner+0x78/0x90
Modules linked in:
CPU: 2 PID: 74 Comm: kworker/2:1 Not tainted 4.17.0-10597-g4c5e8fc62d6a #40
Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno
Development Platform, BIOS EDK II Jun 14 2018
Workqueue: nfsiod rpc_async_release
pstate: 40000005 (nZcv daif -PAN -UAO)
pc : up_read_non_owner+0x78/0x90
lr : up_read_non_owner+0x78/0x90
Call trace:
 up_read_non_owner+0x78/0x90
 nfs_async_unlink_release+0x2c/0x88
 rpc_free_task+0x28/0x48
 rpc_async_release+0x10/0x18
 process_one_work+0x29c/0x7a0
 worker_thread+0x40/0x458
 kthread+0x12c/0x130
 ret_from_fork+0x10/0x18
irq event stamp: 64737
hardirqs last  enabled at (64737): _raw_spin_unlock_irq+0x2c/0x60
hardirqs last disabled at (64736): _raw_spin_lock_irq+0x1c/0x60
softirqs last  enabled at (64728): srcu_invoke_callbacks+0xec/0x1a8
softirqs last disabled at (64724): srcu_invoke_callbacks+0xec/0x1a8


DEBUG_LOCKS_WARN_ON(sem->owner != ((struct task_struct *)(1UL << 0)))
WARNING: CPU: 2 PID: 1517 at kernel/locking/rwsem.c:217
up_read_non_owner+0x78/0x90
Modules linked in:
CPU: 2 PID: 1517 Comm: kworker/2:2 Not tainted
4.17.0-10597-g4c5e8fc62d6a-dirty #42
Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno
Development Platform, BIOS EDK II Jun 14 2018
Workqueue: nfsiod rpc_async_release
pstate: 40000005 (nZcv daif -PAN -UAO)
pc : up_read_non_owner+0x78/0x90
lr : up_read_non_owner+0x78/0x90
Call trace:
 up_read_non_owner+0x78/0x90
 nfs_async_unlink_release+0x2c/0x88
 rpc_free_task+0x28/0x48
 rpc_async_release+0x10/0x18
 process_one_work+0x258/0x400
 worker_thread+0x40/0x448
 kthread+0x11c/0x120
 ret_from_fork+0x10/0x18
irq event stamp: 69059
hardirqs last  enabled at (69059): _raw_spin_unlock_irq+0x2c/0x60
hardirqs last disabled at (69058): _raw_spin_lock_irq+0x1c/0x60
softirqs last  enabled at (68194): css_release_work_fn+0x188/0x1c0
softirqs last disabled at (68192): css_release_work_fn+0x170/0x1c0

Regards,
Sudeep

^ permalink raw reply

* Re: [PATCH bpf-net] selftests/bpf: delete xfrm tunnel when test exits.
From: Daniel Borkmann @ 2018-06-15 12:52 UTC (permalink / raw)
  To: William Tu, Eyal Birger; +Cc: Linux Kernel Network Developers, Anders Roxell
In-Reply-To: <CALDO+SaBVSXg8rpatDvan57noBsZgRoRy8wY6EHNqdvpKuxQHA@mail.gmail.com>

On 06/15/2018 02:43 PM, William Tu wrote:
> On Thu, Jun 14, 2018 at 10:24 PM, Eyal Birger <eyal.birger@gmail.com> wrote:
>>
>>
>>> On 14 Jun 2018, at 15:01, William Tu <u9012063@gmail.com> wrote:
>>>
>>> Make the printting of bpf xfrm tunnel better and
>>> cleanup xfrm state and policy when xfrm test finishes.
>>
>> Yeah the ‘tee’ was useful when developing the test - I could see what’s going on :)
>>
>> Now that it’s in ‘selftests’ it’s definitely better without it.
>>
>> Thanks for the cleanup!
>> Eyal.
> 
> Hi Eyal,
> Thanks for double check!
> 
> Hi Daniel and Martin,
> Sorry for the confusing "bpf-net". It should be "net"

Yep, took it into bpf, PR to David will come later today.

Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH bpf-net] selftests/bpf: delete xfrm tunnel when test exits.
From: William Tu @ 2018-06-15 12:43 UTC (permalink / raw)
  To: Eyal Birger; +Cc: Linux Kernel Network Developers, Anders Roxell
In-Reply-To: <F41E9F5B-D170-4BAE-B797-D915F6C24227@gmail.com>

On Thu, Jun 14, 2018 at 10:24 PM, Eyal Birger <eyal.birger@gmail.com> wrote:
>
>
>> On 14 Jun 2018, at 15:01, William Tu <u9012063@gmail.com> wrote:
>>
>> Make the printting of bpf xfrm tunnel better and
>> cleanup xfrm state and policy when xfrm test finishes.
>
> Yeah the ‘tee’ was useful when developing the test - I could see what’s going on :)
>
> Now that it’s in ‘selftests’ it’s definitely better without it.
>
> Thanks for the cleanup!
> Eyal.

Hi Eyal,
Thanks for double check!

Hi Daniel and Martin,
Sorry for the confusing "bpf-net". It should be "net"

Thanks
William

^ permalink raw reply

* pull-request: mac80211 2018-06-15
From: Johannes Berg @ 2018-06-15 12:38 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-wireless

Hi Dave,

We have just a handful of pretty simple fixes for this round
of updates for the current cycle.

Please pull and let me know if there's any problem.

Thanks,
johannes



The following changes since commit 7f6afc338405628ae32826ce82827c3f19bf3d15:

  Merge branch 'l2tp-fixes' (2018-06-14 17:10:19 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211.git tags/mac80211-for-davem-2018-06-15

for you to fetch changes up to bf2b61a6838f19cbc33f6732715012c483fa3795:

  cfg80211: fix rcu in cfg80211_unregister_wdev (2018-06-15 13:05:14 +0200)

----------------------------------------------------------------
A handful of fixes:
 * missing RCU grace period enforcement led to drivers freeing
   data structures before; fix from Dedy Lansky.
 * hwsim module init error paths were messed up; fixed it myself
   after a report from Colin King (who had sent a partial patch)
 * kernel-doc tag errors; fix from Luca Coelho
 * initialize the on-stack sinfo data structure when getting
   station information; fix from Sven Eckelmann
 * TXQ state dumping is now done from init, and when TXQs aren't
   initialized yet at that point, bad things happen, move the
   initialization; fix from Toke Høiland-Jørgensen.

----------------------------------------------------------------
Dedy Lansky (1):
      cfg80211: fix rcu in cfg80211_unregister_wdev

Johannes Berg (1):
      mac80211_hwsim: fix module init error paths

Luca Coelho (1):
      nl80211: fix some kernel doc tag mistakes

Sven Eckelmann (1):
      cfg80211: initialize sinfo in cfg80211_get_station

Toke Høiland-Jørgensen (1):
      mac80211: Move up init of TXQs

 drivers/net/wireless/mac80211_hwsim.c | 11 +++++++++--
 include/uapi/linux/nl80211.h          | 28 ++++++++++++++--------------
 net/mac80211/main.c                   | 12 ++++++------
 net/wireless/core.c                   |  1 +
 net/wireless/util.c                   |  2 ++
 5 files changed, 32 insertions(+), 22 deletions(-)

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Michael S. Tsirkin @ 2018-06-15 12:31 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Siwei Liu, Samudrala, Sridhar, Alexander Duyck, virtio-dev,
	aaron.f.brown, Jiri Pirko, Jakub Kicinski, Netdev, qemu-devel,
	virtualization
In-Reply-To: <20180615113242.799974f6.cohuck@redhat.com>

On Fri, Jun 15, 2018 at 11:32:42AM +0200, Cornelia Huck wrote:
> On Fri, 15 Jun 2018 05:34:24 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Thu, Jun 14, 2018 at 12:02:31PM +0200, Cornelia Huck wrote:
> 
> > > > > I am not all that familiar with how Qemu manages network devices. If we can
> > > > > do all the
> > > > > required management of the primary/standby devices within Qemu, that is
> > > > > definitely a better
> > > > > approach without upper layer involvement.    
> > > > 
> > > > Right. I would imagine in the extreme case the upper layer doesn't
> > > > have to be involved at all if QEMU manages all hot plug/unplug logic.
> > > > The management tool can supply passthrough device and virtio with the
> > > > same group UUID, QEMU auto-manages the presence of the primary, and
> > > > hot plug the device as needed before or after the migration.  
> > > 
> > > I do not really see how you can manage that kind of stuff in QEMU only.  
> > 
> > So right now failover is limited to pci passthrough devices only.
> > The idea is to realize the vfio device but not expose it
> > to guest. Have a separate command to expose it to guest.
> > Hotunplug would also hide it from guest but not unrealize it.
> 
> So, this would not be real hot(un)plug, but 'hide it from the guest',
> right? The concept of "we have it realized in QEMU, but the guest can't
> discover and use it" should be translatable to non-pci as well (at
> least for ccw).
> 
> > 
> > This will help ensure that e.g. on migration failure we can
> > re-expose the device without risk of running out of resources.
> 
> Makes sense.
> 
> Should that 'hidden' state be visible/settable from outside as well
> (e.g. via a property)? I guess yes, so that management software has a
> chance to see whether a device is visible.

Might be handy for debug, but note that since QEMU manages this
state it's transient: can change at any time, so it's kind
of hard for management to rely on it.

> Settable may be useful if we
> find another use case for hiding realized devices.

^ permalink raw reply

* Re: [PATCH 07/17] net: convert sock.sk_wmem_alloc from atomic_t to refcount_t
From: David Woodhouse @ 2018-06-15 12:29 UTC (permalink / raw)
  To: Elena Reshetova, netdev
  Cc: Krzysztof Mazur, Kevin Darbyshire-Bryant, chas, Mathias Kresin
In-Reply-To: <1498817290-3368-8-git-send-email-elena.reshetova@intel.com>

[-- Attachment #1: Type: text/plain, Size: 2083 bytes --]

On Fri, 2017-06-30 at 13:08 +0300, Elena Reshetova wrote:
> diff --git a/include/linux/atmdev.h b/include/linux/atmdev.h
> index c1da539..4d97a89 100644
> --- a/include/linux/atmdev.h
> +++ b/include/linux/atmdev.h
> @@ -254,7 +254,7 @@ static inline void atm_return(struct atm_vcc *vcc,int truesize)
>  
>  static inline int atm_may_send(struct atm_vcc *vcc,unsigned int size)
>  {
> -       return (size + atomic_read(&sk_atm(vcc)->sk_wmem_alloc)) <
> +       return (size + refcount_read(&sk_atm(vcc)->sk_wmem_alloc)) <
>                sk_atm(vcc)->sk_sndbuf;
>  }

Hm, this (commit 14afee4b6092fd) may have broken PPPoATM. I did spend a
while staring hard at my own commit 9d02daf754238 which introduced
pppoatm_may_send(), but it's actually atm_may_send() which is never
allowing packets to be pushed.

Debugging (which is ongoing) has so far shown that we are accounting
for a packet in pppoatm_send() which has skb->truesize==0x1c0, and
later end up in pppoatm_pop()→atm_raw_pop() with skb->truesize==0x2c0.

This was always harmless before, but now it's a refcount_t it appears
to underflow and go into its "screw you" mode and never let any more
packets get sent.

I'm staring hard at the special case in pskb_expand_head() to *not*
change skb->truesize under certain circumstances, and wondering if that
(is, should be) covering the case of ATM skbs:

        /* It is not generally safe to change skb->truesize.
         * For the moment, we really care of rx path, or
         * when skb is orphaned (not attached to a socket).
         */
        if (!skb->sk || skb->destructor == sock_edemux)
                skb->truesize += size - osize;

Failing that, maybe we should copy the accounted value of skb->truesize 
into the struct skb_atm_data in skb->cb at the time we add it to
sk_wmem_alloc — and then subtract *that* value from sk_wmem_alloc in
atm_raw_pop() instead of the then current value of skb->truesize.

Suggestions?

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply

* Re: [PATCH net-next,RFC 00/13] New fast forwarding path
From: Edward Cree @ 2018-06-15 12:18 UTC (permalink / raw)
  To: Steffen Klassert, David Miller; +Cc: tom, pablo, netfilter-devel, netdev
In-Reply-To: <20180615063433.g4pqrea4f4pymvvs@gauss3.secunet.de>

On 15/06/18 07:34, Steffen Klassert wrote:
> On Thu, Jun 14, 2018 at 04:58:34PM -0700, David Miller wrote:
>> You're probably talking about Edward Cree's SKB list stuff, and as
>> per his presenation at netconf 2 weeks ago he plans to revitalize
>> it given how Spectre et al. gives cause to reevaluate all bulking
>> techniques.
> Are there patches for the proposal Edward did a while ago,
> or was it just a concept?
Old patches are at http://lists.openwall.net/netdev/2016/04/19/89
 (note that the absolute numbers I give in the cover letter are wrong;
 I quoted them as Mpps but they're actually Mbps which is 8x higher).
I hope to have a new series ready shortly after net-next reopens.

-Ed

^ permalink raw reply

* Re: Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Cornelia Huck @ 2018-06-15 11:48 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Samudrala, Sridhar, Alexander Duyck, virtio-dev, aaron.f.brown,
	Jiri Pirko, Michael S. Tsirkin, Jakub Kicinski, Netdev,
	qemu-devel, virtualization
In-Reply-To: <CADGSJ23WnTmVKHezm3t0V6M2sWeHaOUoTjdXkmrvbO0EF83hMg@mail.gmail.com>

On Thu, 14 Jun 2018 18:57:11 -0700
Siwei Liu <loseweigh@gmail.com> wrote:

> Thank you for sharing your thoughts, Cornelia. With questions below, I
> think you raised really good points, some of which I don't have answer
> yet and would also like to explore here.
> 
> First off, I don't want to push the discussion to the extreme at this
> point, or sell anything about having QEMU manage everything
> automatically. Don't get me wrong, it's not there yet. Let's don't
> assume we are tied to a specific or concerte solution. I think the key
> for our discussion might be to define or refine the boundary between
> VM and guest,  e.g. what each layer is expected to control and manage
> exactly.
> 
> In my view, there might be possibly 3 different options to represent
> the failover device conceipt to QEMU and libvirt (or any upper layer
> software):
> 
> a. Seperate device: in this model, virtio and passthough remains
> separate devices just as today. QEMU exposes the standby feature bit
> for virtio, and publish status/event around the negotiation process of
> this feature bit for libvirt to react upon. Since Libvirt has the
> pairing relationship itself, maybe through MAC address or something
> else, it can control the presence of primary by hot plugging or
> unplugging the passthrough device, although it has to work tightly
> with virtio's feature negotation process. Not just for migration but
> also various corner scenarios (driver/feature ok, device reset,
> reboot, legacy guest etc) along virtio's feature negotiation.

Yes, that one has obvious tie-ins to virtio's modus operandi.

> 
> b. Coupled device: in this model, virtio and passthough devices are
> weakly coupled using some group ID, i.e. QEMU match the passthough
> device for a standby virtio instance by comparing the group ID value
> present behind each device's bridge. Libvirt provides QEMU the group
> ID for both type of devices, and only deals with hot plug for
> migration, by checking some migration status exposed (e.g. the feature
> negotiation status on the virtio device) by QEMU. QEMU manages the
> visibility of the primary in guest along virtio's feature negotiation
> process.

I'm a bit confused here. What, exactly, ties the two devices together?
If libvirt already has the knowledge that it should manage the two as a
couple, why do we need the group id (or something else for other
architectures)? (Maybe I'm simply missing something because I'm not
that familiar with pci.)

> 
> c. Fully combined device: in this model, virtio and passthough devices
> are viewed as a single VM interface altogther. QEMU not just controls
> the visibility of the primary in guest, but can also manage the
> exposure of the passthrough for migratability. It can be like that
> libvirt supplies the group ID to QEMU. Or libvirt does not even have
> to provide group ID for grouping the two devices, if just one single
> combined device is exposed by QEMU. In either case, QEMU manages all
> aspect of such internal construct, including virtio feature
> negotiation, presence of the primary, and live migration.

Same question as above.

> 
> It looks like to me that, in your opinion, you seem to prefer go with
> (a). While I'm actually okay with either (b) or (c). Do I understand
> your point correctly?

I'm not yet preferring anything, as I'm still trying to understand how
this works :) I hope we can arrive at a model that covers the use case
and that is also flexible enough to be extended to other platforms.

> 
> The reason that I feel that (a) might not be ideal, just as Michael
> alluded to (quoting below), is that as management stack, it really
> doesn't need to care about the detailed process of feature negotiation
> (if we view the guest presence of the primary as part of feature
> negotiation at an extended level not just virtio). All it needs to be
> done is to hand in the required devices to QEMU and that's all. Why do
> we need to addd various hooks, events for whichever happens internally
> within the guest?
> 
> ''
> Primary device is added with a special "primary-failover" flag.
> A virtual machine is then initialized with just a standby virtio
> device. Primary is not yet added.
> 
> Later QEMU detects that guest driver device set DRIVER_OK.
> It then exposes the primary device to the guest, and triggers
> a device addition event (hot-plug event) for it.
> 
> If QEMU detects guest driver removal, it initiates a hot-unplug sequence
> to remove the primary driver.  In particular, if QEMU detects guest
> re-initialization (e.g. by detecting guest reset) it immediately removes
> the primary device.
> ''
> 
> and,
> 
> ''
> management just wants to give the primary to guest and later take it back,
> it really does not care about the details of the process,
> so I don't see what does pushing it up the stack buy you.
> 
> So I don't think it *needs* to be done in libvirt. It probably can if you
> add a bunch of hooks so it knows whenever vm reboots, driver binds and
> unbinds from device, and can check that backup flag was set.
> If you are pushing for a setup like that please get a buy-in
> from libvirt maintainers or better write a patch.
> ''

This actually seems to mean the opposite to me: We need to know what
the guest is doing and when, as it directly drives what we need to do
with the devices. If we switch to a visibility vs a hotplug model (see
the other mail), we might be able to handle that part within qemu.
However, I don't see how you get around needing libvirt to actually set
this up in the first place and to handle migration per se.

^ permalink raw reply

* [PATCH][ipsec] xfrm: replace NR_CPU with num_possible_cpus()
From: Li RongQing @ 2018-06-15 11:38 UTC (permalink / raw)
  To: netdev, steffen.klassert

The default NR_CPUS can be very large, but actual possible nr_cpu_ids
usually is very small. For some x86 distribution, the NR_CPUS is 8192
and nr_cpu_ids is 4.

when xfrm_init is running, num_possible_cpus() should work

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Wang Li <wangli39@baidu.com>
---
 net/xfrm/xfrm_policy.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 40b54cc64243..cbb862463cbd 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -2988,12 +2988,13 @@ static struct pernet_operations __net_initdata xfrm_net_ops = {
 void __init xfrm_init(void)
 {
 	int i;
+	unsigned int nr_cpus = num_possible_cpus();
 
-	xfrm_pcpu_work = kmalloc_array(NR_CPUS, sizeof(*xfrm_pcpu_work),
+	xfrm_pcpu_work = kmalloc_array(nr_cpus, sizeof(*xfrm_pcpu_work),
 				       GFP_KERNEL);
 	BUG_ON(!xfrm_pcpu_work);
 
-	for (i = 0; i < NR_CPUS; i++)
+	for (i = 0; i < nr_cpus; i++)
 		INIT_WORK(&xfrm_pcpu_work[i], xfrm_pcpu_work_fn);
 
 	register_pernet_subsys(&xfrm_net_ops);
-- 
2.16.2

^ permalink raw reply related

* Re: [PATCH bpf-next v5 00/10] BTF: BPF Type Format
From: Bo YU @ 2018-06-15 11:20 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Martin KaFai Lau, netdev, Alexei Starovoitov, Daniel Borkmann,
	kernel-team, Wang Nan, Jiri Olsa, Namhyung Kim, Ingo Molnar
In-Reply-To: <20180607193029.GG17292@kernel.org>

Hi,
On Thu, Jun 07, 2018 at 04:30:29PM -0300, Arnaldo Carvalho de Melo wrote:
>Em Thu, Jun 07, 2018 at 12:05:10PM -0700, Martin KaFai Lau escreveu:
>dump-obj = true
>clang-opt = -g
>[root@jouet bpf]# perf trace -e open*,hello.c touch /tmp/hello.BTF
>LLVM: dumping hello.o
>     0.185 (         ): __bpf_stdout__:Hello, world
>     0.188 ( 0.009 ms): touch/19670 openat(dfd: CWD, filename: /etc/ld.so.cache, flags: CLOEXEC           ) = 3
>     0.219 (         ): __bpf_stdout__:Hello, world
>     0.220 ( 0.011 ms): touch/19670 openat(dfd: CWD, filename: /lib64/libc.so.6, flags: CLOEXEC           ) = 3
>     0.480 ( 0.095 ms): touch/19670 open(filename: /usr/lib/locale/locale-archive, flags: CLOEXEC         ) = 3
>     0.624 (         ): __bpf_stdout__:Hello, world
>     0.626 ( 0.011 ms): touch/19670 openat(dfd: CWD, filename: /tmp/hello.BTF, flags: CREAT|NOCTTY|NONBLOCK|WRONLY, mode: IRUGO|IWUGO) = 3
>[root@jouet bpf]# file hello.o
>hello.o: ELF 64-bit LSB relocatable, *unknown arch 0xf7* version 1 (SYSV), with debug_info, not stripped
>[root@jouet bpf]#
>
>Much better:
>
>[root@jouet bpf]# readelf -SW hello.o
>There are 25 section headers, starting at offset 0xc70:
>
>Section Headers:
>  [Nr] Name              Type            Address          Off    Size   ES Flg Lk Inf Al
>  [ 0]                   NULL            0000000000000000 000000 000000 00      0   0  0
>  [ 1] .strtab           STRTAB          0000000000000000 000b50 000119 00      0   0  1
>  [ 2] .text             PROGBITS        0000000000000000 000040 000000 00  AX  0   0  4
>  [ 3] syscalls:sys_enter_openat PROGBITS        0000000000000000 000040 000088 00  AX  0   0  8
>  [ 4] .relsyscalls:sys_enter_openat REL             0000000000000000 000910 000010 10     24   3  8
>  [ 5] license           PROGBITS        0000000000000000 0000c8 000004 00  WA  0   0  1
>  [ 6] version           PROGBITS        0000000000000000 0000cc 000004 00  WA  0   0  4
>  [ 7] maps              PROGBITS        0000000000000000 0000d0 000010 00  WA  0   0  4
>  [ 8] .rodata.str1.1    PROGBITS        0000000000000000 0000e0 00000e 01 AMS  0   0  1
>  [ 9] .debug_str        PROGBITS        0000000000000000 0000ee 00010e 01  MS  0   0  1
>  [10] .debug_loc        PROGBITS        0000000000000000 0001fc 000023 00      0   0  1
>  [11] .debug_abbrev     PROGBITS        0000000000000000 00021f 0000e3 00      0   0  1
>  [12] .debug_info       PROGBITS        0000000000000000 000302 00015e 00      0   0  1
>  [13] .rel.debug_info   REL             0000000000000000 000920 0001e0 10     24  12  8
>  [14] .debug_ranges     PROGBITS        0000000000000000 000460 000030 00      0   0  1
>  [15] .debug_macinfo    PROGBITS        0000000000000000 000490 000001 00      0   0  1
>  [16] .debug_pubnames   PROGBITS        0000000000000000 000491 00006e 00      0   0  1
>  [17] .rel.debug_pubnames REL             0000000000000000 000b00 000010 10     24  16  8
>  [18] .debug_pubtypes   PROGBITS        0000000000000000 0004ff 00005a 00      0   0  1
>  [19] .rel.debug_pubtypes REL             0000000000000000 000b10 000010 10     24  18  8
>  [20] .debug_frame      PROGBITS        0000000000000000 000560 000028 00      0   0  8
>  [21] .rel.debug_frame  REL             0000000000000000 000b20 000020 10     24  20  8
>  [22] .debug_line       PROGBITS        0000000000000000 000588 00006e 00      0   0  1
>  [23] .rel.debug_line   REL             0000000000000000 000b40 000010 10     24  22  8
>  [24] .symtab           SYMTAB          0000000000000000 0005f8 000318 18      1  29  8
>Key to Flags:
>  W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
>  L (link order), O (extra OS processing required), G (group), T (TLS),
>  C (compressed), x (unknown), o (OS specific), E (exclude),
>  p (processor specific)
>[root@jouet bpf]#
>
>[root@jouet bpf]# readelf -s hello.o
>
>Symbol table '.symtab' contains 33 entries:
>   Num:    Value          Size Type    Bind   Vis      Ndx Name
>     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND
>     1: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT    9
>     2: 000000000000002d     0 NOTYPE  LOCAL  DEFAULT    9
>     3: 0000000000000044     0 NOTYPE  LOCAL  DEFAULT    9
>     4: 0000000000000053     0 NOTYPE  LOCAL  DEFAULT    9
>     5: 000000000000005c     0 NOTYPE  LOCAL  DEFAULT    9
>     6: 0000000000000061     0 NOTYPE  LOCAL  DEFAULT    9
>     7: 000000000000006a     0 NOTYPE  LOCAL  DEFAULT    9
>     8: 0000000000000073     0 NOTYPE  LOCAL  DEFAULT    9
>     9: 0000000000000077     0 NOTYPE  LOCAL  DEFAULT    9
>    10: 0000000000000086     0 NOTYPE  LOCAL  DEFAULT    9
>    11: 000000000000008b     0 NOTYPE  LOCAL  DEFAULT    9
>    12: 0000000000000098     0 NOTYPE  LOCAL  DEFAULT    9
>    13: 00000000000000a1     0 NOTYPE  LOCAL  DEFAULT    9
>    14: 00000000000000ac     0 NOTYPE  LOCAL  DEFAULT    9
>    15: 00000000000000b8     0 NOTYPE  LOCAL  DEFAULT    9
>    16: 00000000000000c4     0 NOTYPE  LOCAL  DEFAULT    9
>    17: 00000000000000d6     0 NOTYPE  LOCAL  DEFAULT    9
>    18: 00000000000000e8     0 NOTYPE  LOCAL  DEFAULT    9
>    19: 00000000000000fd     0 NOTYPE  LOCAL  DEFAULT    9
>    20: 0000000000000101     0 NOTYPE  LOCAL  DEFAULT    9
>    21: 0000000000000107     0 NOTYPE  LOCAL  DEFAULT    9
>    22: 0000000000000000     0 SECTION LOCAL  DEFAULT    3
>    23: 0000000000000000     0 SECTION LOCAL  DEFAULT   10
>    24: 0000000000000000     0 SECTION LOCAL  DEFAULT   11
>    25: 0000000000000000     0 SECTION LOCAL  DEFAULT   12
>    26: 0000000000000000     0 SECTION LOCAL  DEFAULT   14
>    27: 0000000000000000     0 SECTION LOCAL  DEFAULT   20
>    28: 0000000000000000     0 SECTION LOCAL  DEFAULT   22
>    29: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT    7 __bpf_stdout__
>    30: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT    5 _license
>    31: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT    6 _version
>    32: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT    3 syscall_enter_openat
>[root@jouet bpf]#
>
>Progress:
>
>[root@jouet bpf]# pahole --btf_encode hello.o
>sh: llvm-objcopy: command not found
>Failed to encode BTF
>[root@jouet bpf]#
>
>[acme@jouet perf]$ llvm-
>llvm-ar          llvm-cov         llvm-dis         llvm-lib         llvm-modextract  llvm-profdata    llvm-split
>llvm-as          llvm-c-test      llvm-dlltool     llvm-link        llvm-mt          llvm-ranlib      llvm-stress
>llvm-bcanalyzer  llvm-cvtres      llvm-dsymutil    llvm-lto         llvm-nm          llvm-readelf     llvm-strings
>llvm-cat         llvm-cxxdump     llvm-dwarfdump   llvm-lto2        llvm-objdump     llvm-readobj     llvm-symbolizer
>llvm-config      llvm-cxxfilt     llvm-dwp         llvm-mc          llvm-opt-report  llvm-rtdyld      llvm-tblgen
>llvm-config-64   llvm-diff        llvm-extract     llvm-mcmarkup    llvm-pdbutil     llvm-size        llvm-xray
>[acme@jouet perf]$ llvm-
Sorry, maybe i make a noise for the thread.
Could you share how you put the output above in mutt's reply mode?
Thank you in advance!
>
>So this must be available in a newer llvm version? Which one?
>
>[root@jouet bpf]# clang -v
>clang version 5.0.1 (tags/RELEASE_501/final)
>Target: x86_64-unknown-linux-gnu
>Thread model: posix
>InstalledDir: /usr/bin
>Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-redhat-linux/7
>Found candidate GCC installation: /usr/lib/gcc/x86_64-redhat-linux/7
>Selected GCC installation: /usr/bin/../lib/gcc/x86_64-redhat-linux/7
>Candidate multilib: .;@m64
>Candidate multilib: 32;@m32
>Selected multilib: .;@m64
>[root@jouet bpf]#
>
>[acme@jouet perf]$ rpm -q clang llvm
>clang-5.0.1-5.fc27.x86_64
>llvm-5.0.1-6.fc27.x86_64
>[acme@jouet perf]$
>[root@jouet bpf]# clang -v
>clang version 5.0.1 (tags/RELEASE_501/final)
>Target: x86_64-unknown-linux-gnu
>Thread model: posix
>InstalledDir: /usr/bin
>Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-redhat-linux/7
>Found candidate GCC installation: /usr/lib/gcc/x86_64-redhat-linux/7
>Selected GCC installation: /usr/bin/../lib/gcc/x86_64-redhat-linux/7
>Candidate multilib: .;@m64
>Candidate multilib: 32;@m32
>Selected multilib: .;@m64
>[root@jouet bpf]#
>
>[acme@jouet perf]$ rpm -q clang llvm
>clang-5.0.1-5.fc27.x86_64
>llvm-5.0.1-6.fc27.x86_64
>[acme@jouet perf]$
>
>- Arnaldo

^ permalink raw reply

* Re: [RFC PATCH RESEND] tcp: avoid F-RTO if SACK and timestamps are disabled
From: Ilpo Järvinen @ 2018-06-15 10:35 UTC (permalink / raw)
  To: Michal Kubecek; +Cc: Yuchung Cheng, netdev, Eric Dumazet, LKML
In-Reply-To: <20180615092753.lmxqh65moc33rzbq@unicorn.suse.cz>

[-- Attachment #1: Type: text/plain, Size: 6083 bytes --]

On Fri, 15 Jun 2018, Michal Kubecek wrote:

> On Fri, Jun 15, 2018 at 11:05:03AM +0300, Ilpo Järvinen wrote:
> > On Thu, 14 Jun 2018, Michal Kubecek wrote:
> > 
> > My point was that the new data segment bursts that occur if the sender 
> > isn't application limited indicate that there's something going wrong
> > with FRTO. And that wrong is also what is causing that RTO loop because
> > the sender doesn't see the previous FRTO recovery on second RTO. With 
> > my FRTO undo fix, (new_recovery || icsk->icsk_retransmits) will be false
> > and that will prevent the RTO loop.
> 
> Yes, it would prevent the loop in this case (except it would be a bit
> later, after second RTO rather than after first).

Hmm, I'm actually wrong about the new data missing bit I think. After
reading more code I'm quite sure conventional RTO recovery is triggered 
right away (as long as that bogus undo that ends the recovery wouldn't
occur first like it does without my fix). So no second RTO would be 
needed.

> But I'm not convinced
> the logic of the patch is correct. If I understand it correctly, it
> essentially changes "presumption of innocence" (if we get an ack past
> what we retransmitted, we assume it was received earlier - i.e. would
> have been sacked before if SACK was in use) to "presumption of guilt"
> (whenever a retransmitted segment is acked, we assume nothing else acked
> with it was received earlier). Or that you trade false negatives for
> false positives.

FRTO depends on knowing for sure what packet (original pre-RTO one or 
something that was transmitted post-RTO) triggered the ACK. If FRTO
isn't sure that the ACK was generated by a post-RTO packet, it must
not assume innocence! This change in practice affects just the time while 
the segment rexmitted by RTO is still there, that is, processing in step 2 
(if we get a cumulative ACK beyond it because the next loss is not for the 
subsequent segment but for some later segment, FLAG_ORIG_SACK_ACKED is set 
and we'll incorrectly do step 3b while still in FRTO has only reached step 
2 for real; this is fixed by my patch). ...The decision about 
positive/negative only occurs _after_ that in step 3.

> Maybe I understand it wrong but it seems that you de facto prevent
> Step (3b) from ever happening in non-SACK case.

Only if any of skb that was ACKed had been retransmitted. There shouldn't 
be any in step 3 because the RTO rexmit was ACKed (and also because 
of how new_recovery variable works wrt. earlier recoveries). Thus, in
step 3 the fix won't clear FLAG_ORIG_SACK_ACKED flag allowing FRTO to 
detect spurious RTOs using step 3b.

> > > > No! The window should not update window on ACKs the receiver intends to 
> > > > designate as "duplicate ACKs". That is not without some potential cost 
> > > > though as it requires delaying window updates up to the next cumulative 
> > > > ACK. In the non-SACK series one of the changes is fixing this for
> > > > non-SACK Linux TCP flows.
> > > 
> > > That sounds like a reasonable change (at least at the first glance,
> > > I didn't think about it too deeply) but even if we fix Linux stack to
> > > behave like this, we cannot force everyone else to do the same.
> > 
> > Unfortunately I don't know what the other stacks besides Linux do. But 
> > for Linux, the cause for the changing receiver window is the receiver 
> > window auto-tuning and I'm not sure if other stacks have a similar 
> > feature (or if that affects (almost) all ACKs like in Linux).
> 
> The capture from my previous e-mail and some others I have seen indicate
> that at least some implementations do not take care to never change
> window size when responding to an out-of-order segment.  That means that
> even if we change linux TCP this way (we might still need to send
> a separate window update in some cases), we still cannot rely on others
> doing the same.

Those implementations ignore what is a duplicate ACK (RFC5681, which
is also pointed into by RFC5682 for its defination):
  DUPLICATE ACKNOWLEDGMENT: An acknowledgment is considered a
      "duplicate" ... (e)
      the advertised window in the incoming acknowledgment equals the
      advertised window in the last incoming acknowledgment.

Not sending duplicate ACKs also means that fast recovery will not work
for those flows but that may not show up performance wise as long as you 
have enough capacity for any unnecessary rexmits the forced RTO recovery 
is going to do. RTO recovery may actually improve completion times for 
non-SACK flows as NewReno recovers only one lost pkt/RTT where as RTO 
recovery needs log(outstanding packets) RTTs at worst. For a large number 
of losses in a window, the log is going to win.

> I checked the capture attached to my previous e-mail again and there is
> one thing where our F-RTO implementation (in 4.4, at least) is wrong,
> IMHO. While the first ACK after "new data" (sent in (2b)) was a window
> update (and therefore not dupack by definition) so that we could take
> neither (3a) nor (3b), in some iterations there were further acks which
> did not change window size. The text just before Step 1 says
> 
>   The F-RTO algorithm does not specify actions for receiving
>   a segment that neither acknowledges new data nor is a duplicate
>   acknowledgment.  The TCP sender SHOULD ignore such segments and
>   wait for a segment that either acknowledges new data or is
>   a duplicate acknowledgment.
> 
> My understanding is that this means that while the first ack after new
> data is correctly ignored, the following ack which preserves window size
> should be recognized as a dupack and (3a) should be taken.

Linux FRTO never gets that far (without my fix) if the ACK in step 2
covers beyond the RTO rexmit because 3b is prematurely invoked, that's 
why you never see what would occur if 3a is taken. TCP thinks it's not 
recovering anymore and therefore can send only new data (if there's some 
available).

This is what I tried to tell earlier, with new data there you see there's 
something else wrong too with FRTO besides the RTO loop.

-- 
 i.

^ permalink raw reply

* [PATCH] net_sched: blackhole: tell upper qdisc about dropped packets
From: Konstantin Khlebnikov @ 2018-06-15 10:27 UTC (permalink / raw)
  To: netdev, David S. Miller; +Cc: Cong Wang, Jiri Pirko, Jamal Hadi Salim

When blackhole is used on top of classful qdisc like hfsc it breaks
qlen and backlog counters because packets are disappear without notice.

In HFSC non-zero qlen while all classes are inactive triggers warning:
WARNING: ... at net/sched/sch_hfsc.c:1393 hfsc_dequeue+0xba4/0xe90 [sch_hfsc]
and schedules watchdog work endlessly.

This patch return __NET_XMIT_BYPASS in addition to NET_XMIT_SUCCESS,
this flag tells upper layer: this packet is gone and isn't queued.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 net/sched/sch_blackhole.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/sch_blackhole.c b/net/sched/sch_blackhole.c
index c98a61e980ba..9c4c2bb547d7 100644
--- a/net/sched/sch_blackhole.c
+++ b/net/sched/sch_blackhole.c
@@ -21,7 +21,7 @@ static int blackhole_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 			     struct sk_buff **to_free)
 {
 	qdisc_drop(skb, sch, to_free);
-	return NET_XMIT_SUCCESS;
+	return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
 }

 static struct sk_buff *blackhole_dequeue(struct Qdisc *sch)

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox