Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net-next 0/2] Introduce rtnl_lock_killable()
From: Kirill Tkhai @ 2018-03-14 19:17 UTC (permalink / raw)
  To: davem, ktkhai, vyasevic, edumazet, nicolas.dichtel, netdev

rtnl_lock() is widely used mutex in kernel. Some of kernel code
does memory allocations under it. In case of memory deficit this
may invoke OOM killer, but the problem is a killed task can't
exit if it's waiting for the mutex. This may be a reason of deadlock
and panic.

This patchset adds a new primitive, which responds on SIGKILL,
and it allows to use it in the places, where we don't want
to sleep forever. Also, the first place is made to use it.

---

Kirill Tkhai (2):
      net: Add rtnl_lock_killable()
      net: Use rtnl_lock_killable() in register_netdev()

 include/linux/rtnetlink.h |    1 +
 net/core/dev.c            |    3 ++-
 net/core/rtnetlink.c      |    6 ++++++
 3 files changed, 9 insertions(+), 1 deletion(-)

--
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>

^ permalink raw reply

* Re: [PATCH] vhost: add vsock compat ioctl
From: Michael S. Tsirkin @ 2018-03-14 19:05 UTC (permalink / raw)
  To: Sonny Rao
  Cc: kvm, Stefan Hajnoczi, Jason Wang, virtualization, netdev,
	linux-kernel
In-Reply-To: <20180314172605.130483-1-sonnyrao@chromium.org>

On Wed, Mar 14, 2018 at 10:26:05AM -0700, Sonny Rao wrote:
> This will allow usage of vsock from 32-bit binaries on a 64-bit
> kernel.
> 
> Signed-off-by: Sonny Rao <sonnyrao@chromium.org>

I think you need to convert the pointer argument though.
Something along the lines of:

#ifdef CONFIG_COMPAT
static long vhost_vsock_dev_compat_ioctl(struct file *f, unsigned int ioctl,
					 unsigned long arg)
{
        return vhost_vsock_dev_ioctl(f, ioctl, (unsigned long)compat_ptr(arg));
} 
#endif  



> ---
>  drivers/vhost/vsock.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> index 0d14e2ff19f16..d0e65e92110e5 100644
> --- a/drivers/vhost/vsock.c
> +++ b/drivers/vhost/vsock.c
> @@ -705,6 +705,7 @@ static const struct file_operations vhost_vsock_fops = {
>  	.release        = vhost_vsock_dev_release,
>  	.llseek		= noop_llseek,
>  	.unlocked_ioctl = vhost_vsock_dev_ioctl,
> +	.compat_ioctl   = vhost_vsock_dev_ioctl,
>  };
>  
>  static struct miscdevice vhost_vsock_misc = {
> -- 
> 2.13.5

^ permalink raw reply

* [PATCH net-next v3 4/4] net: qualcomm: rmnet: Implement fill_info
From: Subash Abhinov Kasiviswanathan @ 2018-03-14 19:01 UTC (permalink / raw)
  To: davem, joe, netdev; +Cc: Subash Abhinov Kasiviswanathan
In-Reply-To: <1521054064-22775-1-git-send-email-subashab@codeaurora.org>

This is needed to query the mux_id and flags of a rmnet device.

Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
---
 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c | 30 ++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
index 096301a..d0f3e0f 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
@@ -331,6 +331,35 @@ static size_t rmnet_get_size(const struct net_device *dev)
 	       nla_total_size(sizeof(struct ifla_vlan_flags)); /* IFLA_VLAN_FLAGS */
 }
 
+static int rmnet_fill_info(struct sk_buff *skb, const struct net_device *dev)
+{
+	struct rmnet_priv *priv = netdev_priv(dev);
+	struct net_device *real_dev;
+	struct ifla_vlan_flags f;
+	struct rmnet_port *port;
+
+	real_dev = priv->real_dev;
+
+	if (!rmnet_is_real_dev_registered(real_dev))
+		return -ENODEV;
+
+	if (nla_put_u16(skb, IFLA_VLAN_ID, priv->mux_id))
+		goto nla_put_failure;
+
+	port = rmnet_get_port_rtnl(real_dev);
+
+	f.flags = port->data_format;
+	f.mask  = ~0;
+
+	if (nla_put(skb, IFLA_VLAN_FLAGS, sizeof(f), &f))
+		goto nla_put_failure;
+
+	return 0;
+
+nla_put_failure:
+	return -EMSGSIZE;
+}
+
 struct rtnl_link_ops rmnet_link_ops __read_mostly = {
 	.kind		= "rmnet",
 	.maxtype	= __IFLA_VLAN_MAX,
@@ -341,6 +370,7 @@ struct rtnl_link_ops rmnet_link_ops __read_mostly = {
 	.dellink	= rmnet_dellink,
 	.get_size	= rmnet_get_size,
 	.changelink     = rmnet_changelink,
+	.fill_info	= rmnet_fill_info,
 };
 
 /* Needs either rcu_read_lock() or rtnl lock */
-- 
1.9.1

^ permalink raw reply related

* [PATCH net-next v3 3/4] net: qualcomm: rmnet: Remove unnecessary device assignment
From: Subash Abhinov Kasiviswanathan @ 2018-03-14 19:01 UTC (permalink / raw)
  To: davem, joe, netdev; +Cc: Subash Abhinov Kasiviswanathan
In-Reply-To: <1521054064-22775-1-git-send-email-subashab@codeaurora.org>

Device of the de-aggregated skb is correctly assigned after inspecting
the mux_id, so remove the assignment in rmnet_map_deaggregate().

Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
---
 drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c
index 49e420e..e8f6c79 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c
@@ -323,7 +323,6 @@ struct sk_buff *rmnet_map_deaggregate(struct sk_buff *skb,
 	if (!skbn)
 		return NULL;
 
-	skbn->dev = skb->dev;
 	skb_reserve(skbn, RMNET_MAP_DEAGGR_HEADROOM);
 	skb_put(skbn, packet_len);
 	memcpy(skbn->data, skb->data, packet_len);
-- 
1.9.1

^ permalink raw reply related

* [PATCH net-next v3 2/4] net: qualcomm: rmnet: Update copyright year to 2018
From: Subash Abhinov Kasiviswanathan @ 2018-03-14 19:01 UTC (permalink / raw)
  To: davem, joe, netdev; +Cc: Subash Abhinov Kasiviswanathan
In-Reply-To: <1521054064-22775-1-git-send-email-subashab@codeaurora.org>

Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
---
 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c      | 2 +-
 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h      | 2 +-
 drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c    | 2 +-
 drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h         | 2 +-
 drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c | 2 +-
 drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c    | 2 +-
 drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h     | 2 +-
 drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c         | 2 +-
 8 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
index c494918..096301a 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
@@ -1,4 +1,4 @@
-/* Copyright (c) 2013-2017, The Linux Foundation. All rights reserved.
+/* Copyright (c) 2013-2018, The Linux Foundation. All rights reserved.
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 and
diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h
index 00e4634..0b5b5da 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h
@@ -1,4 +1,4 @@
-/* Copyright (c) 2013-2014, 2016-2017 The Linux Foundation. All rights reserved.
+/* Copyright (c) 2013-2014, 2016-2018 The Linux Foundation. All rights reserved.
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 and
diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
index 601edec..c758248 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
@@ -1,4 +1,4 @@
-/* Copyright (c) 2013-2017, The Linux Foundation. All rights reserved.
+/* Copyright (c) 2013-2018, The Linux Foundation. All rights reserved.
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 and
diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
index 4f362df..884f1f5 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
@@ -1,4 +1,4 @@
-/* Copyright (c) 2013-2017, The Linux Foundation. All rights reserved.
+/* Copyright (c) 2013-2018, The Linux Foundation. All rights reserved.
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 and
diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c
index b0dbca0..afa2b86 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c
@@ -1,4 +1,4 @@
-/* Copyright (c) 2013-2017, The Linux Foundation. All rights reserved.
+/* Copyright (c) 2013-2018, The Linux Foundation. All rights reserved.
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 and
diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c
index c74a6c5..49e420e 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c
@@ -1,4 +1,4 @@
-/* Copyright (c) 2013-2017, The Linux Foundation. All rights reserved.
+/* Copyright (c) 2013-2018, The Linux Foundation. All rights reserved.
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 and
diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h b/drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h
index de0143e..98365ef 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h
@@ -1,4 +1,4 @@
-/* Copyright (c) 2013-2014, 2016-2017 The Linux Foundation. All rights reserved.
+/* Copyright (c) 2013-2014, 2016-2018 The Linux Foundation. All rights reserved.
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 and
diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c
index 346d310..2ea16a0 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c
@@ -1,4 +1,4 @@
-/* Copyright (c) 2013-2017, The Linux Foundation. All rights reserved.
+/* Copyright (c) 2013-2018, The Linux Foundation. All rights reserved.
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 and
-- 
1.9.1

^ permalink raw reply related

* [PATCH net-next v3 1/4] net: qualcomm: rmnet: Fix casting issues
From: Subash Abhinov Kasiviswanathan @ 2018-03-14 19:01 UTC (permalink / raw)
  To: davem, joe, netdev; +Cc: Subash Abhinov Kasiviswanathan
In-Reply-To: <1521054064-22775-1-git-send-email-subashab@codeaurora.org>

Fix warnings which were reported when running with sparse
(make C=1 CF=-D__CHECK_ENDIAN__)

drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c:81:15:
warning: cast to restricted __be16
drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c:271:37:
warning: incorrect type in assignment (different base types)
expected unsigned short [unsigned] [usertype] pkt_len
got restricted __be16 [usertype] <noident>
drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c:287:29:
warning: incorrect type in assignment (different base types)
expected unsigned short [unsigned] [usertype] pkt_len
got restricted __be16 [usertype] <noident>
drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c:310:22:
warning: cast to restricted __be16
drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c:319:13:
warning: cast to restricted __be16
drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c:49:18:
warning: cast to restricted __be16
drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c:50:18:
warning: cast to restricted __be32
drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c:74:21:
warning: cast to restricted __be16

Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
---
 drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
index 6ce31e2..4f362df 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
@@ -23,8 +23,8 @@ struct rmnet_map_control_command {
 		struct {
 			u16 ip_family:2;
 			u16 reserved:14;
-			u16 flow_control_seq_num;
-			u32 qos_id;
+			__be16 flow_control_seq_num;
+			__be32 qos_id;
 		} flow_control;
 		u8 data[0];
 	};
@@ -44,7 +44,7 @@ struct rmnet_map_header {
 	u8  reserved_bit:1;
 	u8  cd_bit:1;
 	u8  mux_id;
-	u16 pkt_len;
+	__be16 pkt_len;
 }  __aligned(1);
 
 struct rmnet_map_dl_csum_trailer {
-- 
1.9.1

^ permalink raw reply related

* [PATCH net-next v3 0/4] net: qualcomm: rmnet: Updates 2018-03-12
From: Subash Abhinov Kasiviswanathan @ 2018-03-14 19:01 UTC (permalink / raw)
  To: davem, joe, netdev; +Cc: Subash Abhinov Kasiviswanathan

This series contains some minor updates for rmnet driver.

Patch 1 contains fixes for sparse warnings.
Patch 2 updates the copyright date to 2018.
Patch 3 is a cleanup in receive path.
Patch 4 has the implementation of the fill_info operation.

v1->v2: Remove the force casts since the data type is changed to __be 
types as mentioned by David.
v3>-v3: Update copyright in files which actually had changes as
mentioned by Joe.

Subash Abhinov Kasiviswanathan (4):
  net: qualcomm: rmnet: Fix casting issues
  net: qualcomm: rmnet: Update copyright year to 2018
  net: qualcomm: rmnet: Remove unnecessary device assignment
  net: qualcomm: rmnet: Implement fill_info

 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c | 32 +++++++++++++++++++++-
 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h |  2 +-
 .../net/ethernet/qualcomm/rmnet/rmnet_handlers.c   |  2 +-
 drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h    |  8 +++---
 .../ethernet/qualcomm/rmnet/rmnet_map_command.c    |  2 +-
 .../net/ethernet/qualcomm/rmnet/rmnet_map_data.c   |  3 +-
 .../net/ethernet/qualcomm/rmnet/rmnet_private.h    |  2 +-
 drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c    |  2 +-
 8 files changed, 41 insertions(+), 12 deletions(-)

-- 
1.9.1

^ permalink raw reply

* Re: [PATCH 0/5] DPAA Ethernet fixes
From: Joakim Tjernlund @ 2018-03-14 18:43 UTC (permalink / raw)
  To: davem@davemloft.net, madalin.bucur@nxp.com
  Cc: linuxppc-dev@lists.ozlabs.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, leoyang.li@nxp.com,
	camelia.groza@nxp.com, linux-arm-kernel@lists.infradead.org
In-Reply-To: <20180314133732.24068-1-madalin.bucur@nxp.com>

On Wed, 2018-03-14 at 08:37 -0500, Madalin Bucur wrote:
> Hi,
> 
> This patch set is addressing several issues in the DPAA Ethernet
> driver suite:
> 
>  - module unload crash caused by wrong reference to device being left
>    in the cleanup code after the DSA related changes
>  - scheduling wile atomic bug in QMan code revealed during dpaa_eth
>    module unload
>  - a couple of error counter fixes, a duplicated init in dpaa_eth.

hmm, some of these(all?) bugs are in stable too, CC: stable perhaps? 

> 
> Madalin
> 
> Camelia Groza (3):
>   dpaa_eth: remove duplicate initialization
>   dpaa_eth: increment the RX dropped counter when needed
>   dpaa_eth: remove duplicate increment of the tx_errors counter
> 
> Madalin Bucur (2):
>   soc/fsl/qbman: fix issue in qman_delete_cgr_safe()
>   dpaa_eth: fix error in dpaa_remove()
> 
>  drivers/net/ethernet/freescale/dpaa/dpaa_eth.c |  8 ++++----
>  drivers/soc/fsl/qbman/qman.c                   | 28 +++++---------------------
>  2 files changed, 9 insertions(+), 27 deletions(-)
> 
> --
> 2.1.0
> 

^ permalink raw reply

* Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind
From: Alexei Starovoitov @ 2018-03-14 18:41 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexei Starovoitov, David S. Miller, Daniel Borkmann,
	Network Development, Kernel Team

On Wed, Mar 14, 2018 at 11:00 AM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
>> It seems this is exactly the case where a netns would be the correct answer.
>
> Unfortuantely that's not the case. That's what I tried to explain
> in the cover letter:
> "The setup involves per-container IPs, policy, etc, so traditional
> network-only solutions that involve VRFs, netns, acls are not applicable."
> To elaborate more on that:
> netns is l2 isolation.
> vrf is l3 isolation.
> whereas to containerize an application we need to punch connectivity holes
> in these layered techniques.
> We also considered resurrecting Hannes's afnetns work
> and even went as far as designing a new namespace for L4 isolation.
> Unfortunately all hierarchical namespace abstraction don't work.
> To run an application inside cgroup container that was not written
> with containers in mind we have to make an illusion of running
> in non-containerized environment.
> In some cases we remember the port and container id in the post-bind hook
> in a bpf map and when some other task in a different container is trying
> to connect to a service we need to know where this service is running.
> It can be remote and can be local. Both client and service may or may not
> be written with containers in mind and this sockaddr rewrite is providing
> connectivity and load balancing feature that you simply cannot do
> with hierarchical networking primitives.

have to explain this a bit further...
We also considered hacking these 'connectivity holes' in
netns and/or vrf, but that would be real layering violation,
since clean l2, l3 abstraction would suddenly support
something that breaks through the layers.
Just like many consider ipvlan a bad hack that punches
through the layers and connects l2 abstraction of netns
at l3 layer, this is not something kernel should ever do.
We really didn't want another ipvlan-like hack in the kernel.
Instead bpf programs at bind/connect time _help_
applications discover and connect to each other.
All containers are running in init_nens and there are no vrfs.
After bind/connect the normal fib/neighbor core networking
logic works as it should always do. The whole system is
clean from network point of view.

^ permalink raw reply

* Re: [PATCH net] net/sched: act_simple: don't leak 'index' in the error path
From: Cong Wang @ 2018-03-14 18:41 UTC (permalink / raw)
  To: Davide Caratti
  Cc: Jamal Hadi Salim, Jiri Pirko, David S. Miller,
	Linux Kernel Network Developers
In-Reply-To: <a8b86ed8ce3860123feb676d8d26b994a9bc7444.1520993347.git.dcaratti@redhat.com>

On Tue, Mar 13, 2018 at 7:13 PM, Davide Caratti <dcaratti@redhat.com> wrote:
> Similarly to what other TC actions do, we can duplicate 'sdata' before
> calling tcf_idr_create(), and avoid calling tcf_idr_cleanup(), so that
> leaks of 'index' don't occur anymore.

Looks like we just need to replace the tcf_idr_cleanup() with
tcf_idr_release()? Which is also simpler.


>
> Signed-off-by: Davide Caratti <dcaratti@redhat.com>
> ---
>
> Notes:
>     Hello,
>
>     I observed this faulty behavior on act_bpf, in case of negative return
>     value of tcf_bpf_init_from_ops() and tcf_bpf_init_from_efd(). Then I
>     tried on act_simple, that parses its parameter in a similar way, and
>     reproduced the same leakage of 'index'.
>
>     So, unless you think we should fix this issue in a different way (i.e.
>     changing tcf_idr_cleanup() ), I will post a similar fix for act_bpf.
>
>     Any feedback is welcome, thank you in advance!

Looks like all other callers of tcf_idr_cleanup() need to be replaced too,
but I don't audit all of them...


[...]

>         if (!exists) {
> +               defdata = nla_strdup(tb[TCA_DEF_DATA], GFP_KERNEL);
> +               if (unlikely(!defdata))
> +                       return -ENOMEM;
> +
>                 ret = tcf_idr_create(tn, parm->index, est, a,
>                                      &act_simp_ops, bind, false);
>                 if (ret)
>                         return ret;

You leak memory here on failure, BTW.

^ permalink raw reply

* Re: [PATCH 00/30] Netfilter/IPVS updates for net-next
From: Pablo Neira Ayuso @ 2018-03-14 18:38 UTC (permalink / raw)
  To: David Miller; +Cc: fw, nbd, netfilter-devel, netdev
In-Reply-To: <20180313.113434.1173466843045633114.davem@davemloft.net>

Hi David,

Just for the record, this is a summary of what we have discussed so
far:

1) The existing flowtable infrastructure provides a software fast path
   that is being useful for a valid number of usecases, in particular,
   OpenWRT/LEDE developers/users are very enthusiastic about this.
   Reason for this is that they have had no other choice rather than
   loading out of tree kernel modules to enable fast forwarding paths
   before this infrastructure has been mainlined. Fortunately, now
   they have an upstream alternative that can help them get rid of those
   modules. This fast path can be enabled very easily, actually one
   single rule to select what flows follow the alternative path is
   sufficient.

2) The software flowtable implementation is not affected by the
   problems that flow/routing cache used to have. An attacker that
   cycles through all key values by sending forged packets to fill up
   the hashtable will get no entries. Ruleset policy specifies when
   to offload entries into the flowtable, users can arbitrarily
   decide when to push the flow into the flowtable, eg.

        add rule filter forward ct status assured flow offload @x

   Worst case scenario is that users need to see two packets, one on
   each direction, to be able to place a flow in the flowtable.

3) There is no hardware offload integration yet. There's a public
   patch - waiting to have a driver - that proposes ndo hooks, this
   patch is not merged upstream. The flowtable design and the hardware
   offload patch has been the result of conversations with many vendors
   that represent a wide range of networking device classes, so it is
   an individual effort by looking at one single device. Stateful
   flowtable offload has been another main topic, pipeline is going
   to stall a bit if we cannot make incremental progress towards that
   direction.

Note that this batch was coming with a patch to reduce cache footprint
of the flowtable entries, so there is already working-in-progress
targeted at improving performance of this new software fast path.
Also, preparation works to introduce iptables support has been in the
radar while working on this.

We understand, we may have have spent more time in explaining all this
in the mailing list, we are trying to amend this now. Therefore, we
can probably convince someone here to write design documentation to be
placed on the Documentation/flowtable/ directory in the next pull
request if that makes it easier for the broader audience to understand
our effort and rise concerns, if any.

Thanks.

^ permalink raw reply

* Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind
From: Alexei Starovoitov @ 2018-03-14 18:11 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Alexei Starovoitov, davem, netdev, kernel-team
In-Reply-To: <043c6e66-aef1-51ad-177c-6e3925d4408a@iogearbox.net>

On Wed, Mar 14, 2018 at 03:37:01PM +0100, Daniel Borkmann wrote:
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -133,6 +133,8 @@ enum bpf_prog_type {
> >  	BPF_PROG_TYPE_SOCK_OPS,
> >  	BPF_PROG_TYPE_SK_SKB,
> >  	BPF_PROG_TYPE_CGROUP_DEVICE,
> > +	BPF_PROG_TYPE_CGROUP_INET4_BIND,
> > +	BPF_PROG_TYPE_CGROUP_INET6_BIND,
> 
> Could those all be merged into BPF_PROG_TYPE_SOCK_OPS? I'm slowly getting
> confused by the many sock_*/sk_* prog types we have. The attach hook could
> still be something like BPF_CGROUP_BIND/BPF_CGROUP_CONNECT. Potentially
> storing some prog-type specific void *private_data in prog's aux during
> verification could be a way (similarly as you mention) which can later be
> retrieved at attach time to reject with -ENOTSUPP or such.

that's exacly what I mentioned in the cover letter,
but we need to extend attach cmd with verifier-like log_buf+log_size.
since simple enotsupp will be impossible to debug.
That's the main question of the RFC.

> >  };
> >  
> >  enum bpf_attach_type {
> > @@ -143,6 +145,8 @@ enum bpf_attach_type {
> >  	BPF_SK_SKB_STREAM_PARSER,
> >  	BPF_SK_SKB_STREAM_VERDICT,
> >  	BPF_CGROUP_DEVICE,
> > +	BPF_CGROUP_INET4_BIND,
> > +	BPF_CGROUP_INET6_BIND,
> 
> Binding to v4 mapped v6 address would work as well, right? Can't this be
> squashed into one attach type as mentioned?

explained the reasons for this in the cover letter and proposed extension
to attach cmd.

> > +struct bpf_sock_addr {
> > +	__u32 user_family;	/* Allows 4-byte read, but no write. */
> > +	__u32 user_ip4;		/* Allows 1,2,4-byte read and 4-byte write.
> > +				 * Stored in network byte order.
> > +				 */
> > +	__u32 user_ip6[4];	/* Allows 1,2,4-byte read an 4-byte write.
> > +				 * Stored in network byte order.
> > +				 */
> > +	__u32 user_port;	/* Allows 4-byte read and write.
> > +				 * Stored in network byte order
> > +				 */
> > +	__u32 family;		/* Allows 4-byte read, but no write */
> > +	__u32 type;		/* Allows 4-byte read, but no write */
> > +	__u32 protocol;		/* Allows 4-byte read, but no write */
> 
> I recall bind to prefix came up from time to time in BPF context in the sense
> to let the app itself be more flexible to bind to BPF prog. Do you see also app
> to be able to add a BPF prog into the array itself?

I'm not following. In this case the container management framework
will attach bpf progs to cgroups and apps inside the cgroups will
get their bind/connects overwritten when necessary.

> > +int __cgroup_bpf_run_filter_sock_addr(struct sock *sk,
> > +				      struct sockaddr *uaddr,
> > +				      enum bpf_attach_type type)
> > +{
> > +	struct bpf_sock_addr_kern ctx = {
> > +		.sk = sk,
> > +		.uaddr = uaddr,
> > +	};
> > +	struct cgroup *cgrp;
> > +	int ret;
> > +
> > +	/* Check socket family since not all sockets represent network
> > +	 * endpoint (e.g. AF_UNIX).
> > +	 */
> > +	if (sk->sk_family != AF_INET && sk->sk_family != AF_INET6)
> > +		return 0;
> > +
> > +	cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> > +	ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[type], &ctx, BPF_PROG_RUN);
> > +
> > +	return ret == 1 ? 0 : -EPERM;
> 
> Semantics may be a little bit strange, though this would perhaps be at the risk
> of the orchestrator(s) (?). Basically when you run through the prog array, then
> the last attached program in that array has the final /real/ say to which address
> to bind/connect to; all the others decisions can freely be overridden, so this
> is dependent on the order the BPF progs how they were attached. I think we don't
> have this case for context in multi-prog tracing, cgroup/inet (only filtering)
> and cgroup/dev. Although in cgroup/sock_ops for the tcp/BPF hooks progs can already
> override the sock_ops.reply (and sk_txhash which may be less relevant) field from
> the ctx, so whatever one prog is assumed to reply back to the caller, another one
> could override it. 

correct. tcp-bpf is in the same boat. When progs override the decision the last
prog in the prog_run_array is effective. Remember that
 * The programs of sub-cgroup are executed first, then programs of
 * this cgroup and then programs of parent cgroup.
so outer cgroup controlled by container management is running last.
If it would want to let children do nested overwrittes it could look at the same
sockaddr memory region and will see what children's prog or children's tasks
did with sockaddr and make approriate decision.

> Wouldn't it make more sense to just have a single prog instead
> to avoid this override/ordering issue?

I don't think there is any ordering issue, but yes, if parent is paranoid
it can install no-override program on the cgroup. Which is default anyway.

^ permalink raw reply

* Re: [PATCH net-next 0/6] ibmvnic: Update TX pool and TX routines
From: Thomas Falcon @ 2018-03-14 18:08 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, jallen, nfont
In-Reply-To: <20180314.140354.467457221253867994.davem@davemloft.net>

On 03/14/2018 01:03 PM, David Miller wrote:
> From: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
> Date: Tue, 13 Mar 2018 19:34:17 -0500
>
>> This patch restructures the TX pool data structure and provides a
>> separate TX pool array for TSO transmissions. This is already used
>> in some way due to our unique DMA situation, namely that we cannot
>> use single DMA mappings for packet data. Previously, both buffer
>> arrays used the same pool entry. This restructuring allows for
>> some additional cleanup in the driver code, especially in some
>> places in the device transmit routine.
>>
>> In addition, it allows us to more easily track the consumer
>> and producer indexes of a particular pool. This has been
>> further improved by better tracking of in-use buffers to
>> prevent possible data corruption in case an invalid buffer
>> entry is used.
>  ...
>>   ibmvnic: Update release RX pool routine
> I think you need to fix up this Subject line to say TX instead
> of RX.
>
> Thanks.
>
Drat, I missed that.  I'll send another version shortly. Thanks.

^ permalink raw reply

* Re: [PATCH net-next v2 0/4] net: qualcomm: rmnet: Updates 2018-03-12
From: David Miller @ 2018-03-14 18:06 UTC (permalink / raw)
  To: subashab; +Cc: netdev
In-Reply-To: <1520981428-29148-1-git-send-email-subashab@codeaurora.org>

From: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Date: Tue, 13 Mar 2018 16:50:24 -0600

> This series contains some minor updates for rmnet driver.
> 
> Patch 1 contains fixes for sparse warnings.
> Patch 2 updates the copyright date to 2018.
> Patch 3 is a cleanup in receive path.
> Patch 4 has the implementation of the fill_info operation.
> 
> v1->v2: Remove the force casts since the data type is changed to __be types
> as mentioned by David.

Please address Joe's feedback and only update the copyright date on
files that actually had changes this year.

Thank you.

^ permalink raw reply

* Re: [PATCH net-next 0/6] ibmvnic: Update TX pool and TX routines
From: David Miller @ 2018-03-14 18:03 UTC (permalink / raw)
  To: tlfalcon; +Cc: netdev, jallen, nfont
In-Reply-To: <1520987663-26056-1-git-send-email-tlfalcon@linux.vnet.ibm.com>

From: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
Date: Tue, 13 Mar 2018 19:34:17 -0500

> This patch restructures the TX pool data structure and provides a
> separate TX pool array for TSO transmissions. This is already used
> in some way due to our unique DMA situation, namely that we cannot
> use single DMA mappings for packet data. Previously, both buffer
> arrays used the same pool entry. This restructuring allows for
> some additional cleanup in the driver code, especially in some
> places in the device transmit routine.
> 
> In addition, it allows us to more easily track the consumer
> and producer indexes of a particular pool. This has been
> further improved by better tracking of in-use buffers to
> prevent possible data corruption in case an invalid buffer
> entry is used.
 ...
>   ibmvnic: Update release RX pool routine

I think you need to fix up this Subject line to say TX instead
of RX.

Thanks.

^ permalink raw reply

* Re: [PATCH net-next] sunvnet: does not support GSO for sctp
From: David Miller @ 2018-03-14 18:01 UTC (permalink / raw)
  To: cathy.zhou; +Cc: netdev, sparclinux
In-Reply-To: <1521050167-3946-1-git-send-email-cathy.zhou@oracle.com>

From: cathy.zhou@oracle.com
Date: Wed, 14 Mar 2018 10:56:07 -0700

> From: Cathy Zhou <Cathy.Zhou@Oracle.COM>
> 
> The NETIF_F_GSO_SOFTWARE implies support for GSO on SCTP, but the
> sunvnet driver does not support GSO for sctp.  Here we remove the
> NETIF_F_GSO_SOFTWARE feature flag and only report NETIF_F_ALL_TSO
> instead.
> 
> Signed-off-by: Cathy Zhou <Cathy.Zhou@Oracle.COM>
> Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>

Since this is a bug fix I'm applying this to the 'net' tree.

Thanks.

^ permalink raw reply

* Re: [PATCH RFC bpf-next 0/6] bpf: introduce cgroup-bpf bind, connect, post-bind hooks
From: Alexei Starovoitov @ 2018-03-14 18:01 UTC (permalink / raw)
  To: Mahesh Bandewar (महेश बंडेवार)
  Cc: Alexei Starovoitov, David Miller, daniel, linux-netdev,
	kernel-team
In-Reply-To: <CAF2d9jimTh3LnVGwDQ-MsK7nVY=g5bVSk+=RTp0Qwz4-ZF0-jg@mail.gmail.com>

On Wed, Mar 14, 2018 at 10:22:03AM -0700, Mahesh Bandewar (महेश बंडेवार) wrote:
> On Tue, Mar 13, 2018 at 8:39 PM, Alexei Starovoitov <ast@kernel.org> wrote:
> > For our container management we've been using complicated and fragile setup
> > consisting of LD_PRELOAD wrapper intercepting bind and connect calls from
> > all containerized applications.
> > The setup involves per-container IPs, policy, etc, so traditional
> > network-only solutions that involve VRFs, netns, acls are not applicable.
> You can keep the policies per cgroup but move the ip from cgroup to
> net-ns and then none of these ebpf hacks are required since cgroup and
> namespaces are orthogonal you can use cgroups in conjunction with
> namespaces.

answered in reply to Eric. Pls follow up there if it's still not clear.

^ permalink raw reply

* Re: [PATCH RFC bpf-next 0/6] bpf: introduce cgroup-bpf bind, connect, post-bind hooks
From: Alexei Starovoitov @ 2018-03-14 18:00 UTC (permalink / raw)
  To: David Ahern; +Cc: Alexei Starovoitov, davem, daniel, netdev, kernel-team
In-Reply-To: <0edde01a-c9bb-7de2-ede1-dc52996c12c2@gmail.com>

On Wed, Mar 14, 2018 at 10:13:22AM -0700, David Ahern wrote:
> On 3/13/18 8:39 PM, Alexei Starovoitov wrote:
> > For our container management we've been using complicated and fragile setup
> > consisting of LD_PRELOAD wrapper intercepting bind and connect calls from
> > all containerized applications.
> > The setup involves per-container IPs, policy, etc, so traditional
> > network-only solutions that involve VRFs, netns, acls are not applicable.
> 
> Why does VRF and the cgroup option to bind sockets to the VRF not solve
> this problem for you? The VRF limits the source address choices.

answered in reply to Eric. Pls follow up there if it's still not clear.

^ permalink raw reply

* Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind
From: Alexei Starovoitov @ 2018-03-14 18:00 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Alexei Starovoitov, davem, daniel, netdev, kernel-team
In-Reply-To: <77f77631-f8ad-dc0c-94ce-ec561d4c10f9@gmail.com>

On Tue, Mar 13, 2018 at 11:21:08PM -0700, Eric Dumazet wrote:
> 
> If I understand well,  strace(1) will not show the real (after modification
> by eBPF) IP/port ?

correct. Just like it won't show anything after syscall entry, whether
lsm acted, seccomp, etc

> What about selinux and other LSM ?

clearly lsm is not place to do ip/port enforcement for containers.
lsm in general is missing post-bind lsm hook and visibility in cgroups.
This patch set is not about policy, but more about connectivity.
That's why sockaddr rewrite is must have.

> We have now network namespaces for full isolation. Soon ILA will come.

we're already using a form of ila. That's orthogonal to this feature.

> The argument that it is not convenient (or even possible) to change the
> application or using modern isolation is quite strange, considering the

just like any other datacenter there are thousands of third party
applications that we cannot control. Including open source code
written by google. Would golang switch to use glibc? I very much doubt.
Statically linked apps also don't work with ld_preload.

> added burden/complexity/bloat to the kernel.

bloat? that's very odd to hear. bpf is very much anti-bloat technique.
If you were serious with that comment, please argue with tracing folks
who add thousand upon thousand lines of code to the kernel to do
hard coded things while bpf already does all that and more
without any extra kernel code.

> The post hook for sys_bind is clearly a failure of the model, since
> releasing the port might already be too late, another thread might fail to
> get it during a non zero time window.

I suspect commit log wasn't clear. In post-bind hook we don't release
the port, we only fail sys_bind and user space will eventually close
the socket and release the port.
I don't think it's safe to call inet_put_port() here. It is also
racy as you pointed out.

> If you want to provide an alternate port allocation strategy, better provide
> a correct eBPF hook.

right. that's another separate work indepedent from this feature.
port allocation/free from bpf via helper is also necessary, but
for different use case.

> It seems this is exactly the case where a netns would be the correct answer.

Unfortuantely that's not the case. That's what I tried to explain
in the cover letter:
"The setup involves per-container IPs, policy, etc, so traditional
network-only solutions that involve VRFs, netns, acls are not applicable."
To elaborate more on that:
netns is l2 isolation.
vrf is l3 isolation.
whereas to containerize an application we need to punch connectivity holes
in these layered techniques.
We also considered resurrecting Hannes's afnetns work
and even went as far as designing a new namespace for L4 isolation.
Unfortunately all hierarchical namespace abstraction don't work.
To run an application inside cgroup container that was not written
with containers in mind we have to make an illusion of running
in non-containerized environment.
In some cases we remember the port and container id in the post-bind hook
in a bpf map and when some other task in a different container is trying
to connect to a service we need to know where this service is running.
It can be remote and can be local. Both client and service may or may not
be written with containers in mind and this sockaddr rewrite is providing
connectivity and load balancing feature that you simply cannot do
with hierarchical networking primitives.

btw the per-container policy enforcement of ip+port via these hooks
wasn't our planned feature. It was requested by other folks and
we had to tweak the api a little bit to satisfy ours and theirs requirement.

^ permalink raw reply

* [PATCH net-next] sunvnet: does not support GSO for sctp
From: cathy.zhou @ 2018-03-14 17:56 UTC (permalink / raw)
  To: davem, netdev; +Cc: sparclinux

From: Cathy Zhou <Cathy.Zhou@Oracle.COM>

The NETIF_F_GSO_SOFTWARE implies support for GSO on SCTP, but the
sunvnet driver does not support GSO for sctp.  Here we remove the
NETIF_F_GSO_SOFTWARE feature flag and only report NETIF_F_ALL_TSO
instead.

Signed-off-by: Cathy Zhou <Cathy.Zhou@Oracle.COM>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
---
 drivers/net/ethernet/sun/sunvnet.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/sun/sunvnet.c b/drivers/net/ethernet/sun/sunvnet.c
index 63d3d6b..a94f504 100644
--- a/drivers/net/ethernet/sun/sunvnet.c
+++ b/drivers/net/ethernet/sun/sunvnet.c
@@ -312,7 +312,7 @@ static void vnet_poll_controller(struct net_device *dev)
 	dev->ethtool_ops = &vnet_ethtool_ops;
 	dev->watchdog_timeo = VNET_TX_TIMEOUT;
 
-	dev->hw_features = NETIF_F_TSO | NETIF_F_GSO | NETIF_F_GSO_SOFTWARE |
+	dev->hw_features = NETIF_F_TSO | NETIF_F_GSO | NETIF_F_ALL_TSO |
 			   NETIF_F_HW_CSUM | NETIF_F_SG;
 	dev->features = dev->hw_features;
 
-- 
1.7.1

^ permalink raw reply related

* Re: pull-request: can 2018-03-14
From: David Miller @ 2018-03-14 17:51 UTC (permalink / raw)
  To: mkl; +Cc: netdev, linux-can, kernel
In-Reply-To: <20180314120523.2295-1-mkl@pengutronix.de>

From: Marc Kleine-Budde <mkl@pengutronix.de>
Date: Wed, 14 Mar 2018 13:05:21 +0100

> this is a pull request of two patches for net/master.
> 
> Both patches are by Andri Yngvason and fix problems in the cc770 driver,
> that show up quite fast on RT systems, but also on non RT setups.

Series applied, thanks Marc.

^ permalink raw reply

* Re: [PATCH net-next 0/5] sctp: add support for some sctp auth APIs from RFC6458
From: David Miller @ 2018-03-14 17:49 UTC (permalink / raw)
  To: lucien.xin; +Cc: netdev, linux-sctp, marcelo.leitner, nhorman
In-Reply-To: <cover.1521025473.git.lucien.xin@gmail.com>

From: Xin Long <lucien.xin@gmail.com>
Date: Wed, 14 Mar 2018 19:05:29 +0800

> This patchset mainly adds support for SCTP AUTH Information for sendmsg,
> described in RFC6458:
> 
>     5.3.8.  SCTP AUTH Information Structure (SCTP_AUTHINFO)
> 
> and also adds a sockopt described in RFC6458:
> 
>     8.3.4.  Deactivate a Shared Key (SCTP_AUTH_DEACTIVATE_KEY)
> 
> and two types of events for AUTHENTICATION_EVENT described in RFC6458:
> 
>     6.1.8.  SCTP_AUTHENTICATION_EVENT:
>              - SCTP_AUTH_NO_AUTH
>              - SCTP_AUTH_FREE_KEY
> 
> After this patchset, we have fully support for sctp_sendv in kernel.
> 
> Note that this patchset won't touch that sctp options merge conflict.

Series applied, thanks Xin.

^ permalink raw reply

* Re: [PATCH net] tg3: prevent scheduling while atomic splat
From: David Miller @ 2018-03-14 17:43 UTC (permalink / raw)
  To: michael.chan
  Cc: jtoppins, netdev, andy, siva.kallam, prashant, mchan,
	linux-kernel
In-Reply-To: <CACKFLi=R2q71JH2Jbqbpy8HrzNpg1UmnoNWKhEs7ooWLWoQOqQ@mail.gmail.com>

From: Michael Chan <michael.chan@broadcom.com>
Date: Wed, 14 Mar 2018 10:22:51 -0700

> On Wed, Mar 14, 2018 at 9:36 AM, Jonathan Toppins <jtoppins@redhat.com> wrote:
>> The problem was introduced in commit
>> 506b0a395f26 ("[netdrv] tg3: APE heartbeat changes"). The bug occurs
>> because tp->lock spinlock is held which is obtained in tg3_start
>> by way of tg3_full_lock(), line 11571. The documentation for usleep_range()
>> specifically states it cannot be used inside a spinlock.
>>
>> Fixes: 506b0a395f26 ("[netdrv] tg3: APE heartbeat changes")
>> Signed-off-by: Jonathan Toppins <jtoppins@redhat.com>
>> ---
>>
>> Notes:
>>     The thing I need reviewed from Broadcom is if the udelay should be 20
>>     instead of 10, due to any timing changes introduced by the offending
>>     patch.
> 
> Thanks.  10 us is correct.
> 
> As a future improvement, we might want to see if we can release the
> spinlock and go back to usleep_range().  The wait time is potentially
> up to 20 msec which is quite long.
> 
> Acked-by: Michael Chan <michael.chan@broadcom.com>

Applied, thanks everyone.

^ permalink raw reply

* Re: [PATCH net-next 0/3] net/smc: fixes 2018-03-14
From: David Miller @ 2018-03-14 17:41 UTC (permalink / raw)
  To: ubraun; +Cc: netdev, linux-s390, linux-rdma, schwidefsky, heiko.carstens,
	raspl
In-Reply-To: <20180314100102.40474-1-ubraun@linux.vnet.ibm.com>

From: Ursula Braun <ubraun@linux.vnet.ibm.com>
Date: Wed, 14 Mar 2018 11:00:59 +0100

> here are smc changes for the net-next tree.
> The first patch enables SMC to work with mlx5-RoCE-devices.
> Patches 2 and 3 deal with link group freeing.

Series applied, thank you.

^ permalink raw reply

* Re: [PATCH net-next 0/2] skbuff: Fix applications not being woken for errors
From: Soheil Hassas Yeganeh @ 2018-03-14 17:39 UTC (permalink / raw)
  To: vinicius.gomes
  Cc: netdev, randy.e.witt, David Miller, Eric Dumazet,
	Willem de Bruijn
In-Reply-To: <CAF=yD-K2xH6uGxC3P7UQab5v26kP85z3axvY8L9NfigCLrP6Hw@mail.gmail.com>

On Wed, Mar 14, 2018 at 12:32 PM Willem de Bruijn <
willemdebruijn.kernel@gmail.com> wrote:

> On Tue, Mar 13, 2018 at 4:35 PM, Vinicius Costa Gomes
> <vinicius.gomes@intel.com> wrote:
> > Hi,
> >
> > Changes from the RFC:
> >  - tweaked commit messages;
> >
> > Original cover letter:
> >
> > This is actually a "bug report"-RFC instead of the more usual "new
> > feature"-RFC.
> >
> > We are developing an application that uses TX hardware timestamping to
> > make some measurements, and during development Randy Witt initially
> > reported that the application poll() never unblocked when TX hardware
> > timestamping was enabled.
> >
> > After some investigation, it turned out the problem wasn't only
> > exclusive to hardware timestamping, and could be reproduced with
> > software timestamping.
> >
> > Applying patch (1), and running txtimestamp like this, for example:
> >
> > $ ./txtimestamp -u -4 192.168.1.71 -c 1000 -D -l 1000 -F
> >
> > ('-u' to use UDP only, '-4' for ipv4 only, '-c 1000' to send 1000
> > packets for each test, '-D' to remove the delay between packets, '-l
> > 1000' to set the payload to 1000 bytes, '-F' for configuring poll() to
> > wait forever)
> >
> > will cause the application to become stuck in the poll() call in most
> > of the times. (Note: I couldn't reproduce the issue running against an
> > address that is routed through loopback.)
> >
> > Another interesting fact is that if the POLLIN event is added to the
> > poll() .events, poll() no longer becomes stuck,

> The process has registered interest only in POLLIN, which the call to
> sk_data_read (sock_def_readable) will trigger.

> > and more interestingly
> > the returned event in .revents is only POLLERR.

> datagram_poll will set (E)POLLERR based on non-empty sk_error_queue.

> > After a few debugging sessions, we got to 'sock_queue_err_skb()' and
> > how it notifies applications of the error just enqueued. Changing it
> > to use 'sk->sk_error_report()', fixes the issue for hardware and
> > software timestamping. That is patch (2).
> >
> > The "solution" proposed in patch (2) looks like too big a hammer,

> It looks fine to me. POLLERR is returned regardless of the mask a
> process sets up in pollfd.events. So waking with sk_error_report
> will fix this while still waking callers waiting on POLLIN.

> Note that on sock_dequeue_err_skb, if another notification (of the
> right kind) is waiting, sk_error_report is already called instead of
> sk_data_ready.

Thank you for the fix. It looks fine to me too, because we already only arm
POLLERR for sock_dequeue_err_skb, and POLLERR is always returned when error
queue is not empty:
   if (...skb_queue_empty(&sk->sk_error_queue))
      mask |= POLLERR ...

> This should perhaps go to net, instead of net-next (though not the test).

+1 I think the fix should go to net.

> If resending, a small nit in the test: please keep the alphabetical
> order in getopt. The filepath also looks a bit fishy, but git am applied
> the mbox from patchwork without issue.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox