Netdev List
 help / color / mirror / Atom feed
* [PATCH net-next] net: loopback: clear skb->tstamp before netif_rx()
From: Eric Dumazet @ 2018-10-20  2:11 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Eric Dumazet, Eric Dumazet, Willem de Bruijn,
	Soheil Hassas Yeganeh

At least UDP / TCP stacks can now cook skbs with a tstamp using
MONOTONIC base (or arbitrary values with SCM_TXTIME)

Since loopback driver does not call (directly or indirectly)
skb_scrub_packet(), we need to clear skb->tstamp so that
net_timestamp_check() can eventually resample the time,
using ktime_get_real().

Fixes: 80b14dee2bea ("net: Add a new socket option for a future transmit time.")
Fixes: fb420d5d91c1 ("tcp/fq: move back to CLOCK_MONOTONIC")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
---
 drivers/net/loopback.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
index a7207fa7e451311aed13cdeb100e0ea7922931bf..2df7f60fe05220c19896a251b6b15239f4b95112 100644
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -69,6 +69,10 @@ static netdev_tx_t loopback_xmit(struct sk_buff *skb,
 	int len;
 
 	skb_tx_timestamp(skb);
+
+	/* do not fool net_timestamp_check() with various clock bases */
+	skb->tstamp = 0;
+
 	skb_orphan(skb);
 
 	/* Before queueing this packet to netif_rx(),
-- 
2.19.1.568.g152ad8e336-goog

^ permalink raw reply related

* RE: [Intel-wired-lan] [PATCH] igb: shorten maximum PHC timecounter update interval
From: Brown, Aaron F @ 2018-10-20  1:13 UTC (permalink / raw)
  To: Miroslav Lichvar, intel-wired-lan@lists.osuosl.org,
	netdev@vger.kernel.org
  Cc: Thomas Gleixner, Richard Cochran
In-Reply-To: <20181012111339.1361-1-mlichvar@redhat.com>

> From: Intel-wired-lan [mailto:intel-wired-lan-bounces@osuosl.org] On
> Behalf Of Miroslav Lichvar
> Sent: Friday, October 12, 2018 4:14 AM
> To: intel-wired-lan@lists.osuosl.org; netdev@vger.kernel.org
> Cc: Thomas Gleixner <tglx@linutronix.de>; Richard Cochran
> <richardcochran@gmail.com>
> Subject: [Intel-wired-lan] [PATCH] igb: shorten maximum PHC timecounter
> update interval
> 
> The timecounter needs to be updated at least once per ~550 seconds in
> order to avoid a 40-bit SYSTIM timestamp to be misinterpreted as an old
> timestamp.
> 
> Since commit 500462a9d ("timers: Switch to a non-cascading wheel"),
> scheduling of delayed work seems to be less accurate and a requested
> delay of 540 seconds may actually be longer than 550 seconds. Shorten
> the delay to 480 seconds to be sure the timecounter is updated in time.
> 
> This fixes an issue with HW timestamps on 82580/I350/I354 being off by
> ~1100 seconds for few seconds every ~9 minutes.
> 
> Cc: Jacob Keller <jacob.e.keller@intel.com>
> Cc: Richard Cochran <richardcochran@gmail.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
> ---
>  drivers/net/ethernet/intel/igb/igb_ptp.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 

Tested-by: Aaron Brown <aaron.f.brown@intel.com>

^ permalink raw reply

* Re: [PATCH net] net: fix pskb_trim_rcsum_slow() with odd trim offset
From: Eric Dumazet @ 2018-10-20  0:13 UTC (permalink / raw)
  To: Dimitris Michailidis, edumazet, davem; +Cc: netdev
In-Reply-To: <20181020000713.228659-1-dmichail@google.com>



On 10/19/2018 05:07 PM, Dimitris Michailidis wrote:
> We've been getting checksum errors involving small UDP packets, usually
> 59B packets with 1 extra non-zero padding byte. netdev_rx_csum_fault()
> has been complaining that HW is providing bad checksums. Turns out the
> problem is in pskb_trim_rcsum_slow(), introduced in commit 88078d98d1bb
> ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends").
> 
> The source of the problem is that when the bytes we are trimming start
> at an odd address, as in the case of the 1 padding byte above,
> skb_checksum() returns a byte-swapped value. We cannot just combine this
> with skb->csum using csum_sub(). We need to use csum_block_sub() here
> that takes into account the parity of the start address and handles the
> swapping.
> 
> Matches existing code in __skb_postpull_rcsum() and esp_remove_trailer().
> 
> Fixes: 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends")
> Signed-off-by: Dimitris Michailidis <dmichail@google.com>

Thanks a lot Dimitris for finding this.

Reviewed-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* [PATCH net] net: fix pskb_trim_rcsum_slow() with odd trim offset
From: Dimitris Michailidis @ 2018-10-20  0:07 UTC (permalink / raw)
  To: edumazet, davem; +Cc: netdev, Dimitris Michailidis

We've been getting checksum errors involving small UDP packets, usually
59B packets with 1 extra non-zero padding byte. netdev_rx_csum_fault()
has been complaining that HW is providing bad checksums. Turns out the
problem is in pskb_trim_rcsum_slow(), introduced in commit 88078d98d1bb
("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends").

The source of the problem is that when the bytes we are trimming start
at an odd address, as in the case of the 1 padding byte above,
skb_checksum() returns a byte-swapped value. We cannot just combine this
with skb->csum using csum_sub(). We need to use csum_block_sub() here
that takes into account the parity of the start address and handles the
swapping.

Matches existing code in __skb_postpull_rcsum() and esp_remove_trailer().

Fixes: 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends")
Signed-off-by: Dimitris Michailidis <dmichail@google.com>
---
 net/core/skbuff.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 428094b577fc..f817f336595d 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1846,8 +1846,9 @@ int pskb_trim_rcsum_slow(struct sk_buff *skb, unsigned int len)
 	if (skb->ip_summed == CHECKSUM_COMPLETE) {
 		int delta = skb->len - len;
 
-		skb->csum = csum_sub(skb->csum,
-				     skb_checksum(skb, len, delta, 0));
+		skb->csum = csum_block_sub(skb->csum,
+					   skb_checksum(skb, len, delta, 0),
+					   len);
 	}
 	return __pskb_trim(skb, len);
 }
-- 
2.19.1.568.g152ad8e336-goog

^ permalink raw reply related

* Re: [PATCH] net: ethernet: lpc_eth: add device and device node local variables
From: David Miller @ 2018-10-20  0:05 UTC (permalink / raw)
  To: vz; +Cc: slemieux.tyco, netdev
In-Reply-To: <20181018232511.17325-1-vz@mleia.com>

From: Vladimir Zapolskiy <vz@mleia.com>
Date: Fri, 19 Oct 2018 02:25:11 +0300

> Trivial non-functional change added to simplify getting multiple
> references to device pointer in lpc_eth_drv_probe().
> 
> Signed-off-by: Vladimir Zapolskiy <vz@mleia.com>

Applied.

^ permalink raw reply

* Re: [PATCH] net: ethernet: lpc_eth: remove unused local variable
From: David Miller @ 2018-10-20  0:05 UTC (permalink / raw)
  To: vz; +Cc: slemieux.tyco, netdev
In-Reply-To: <20181018230653.10637-1-vz@mleia.com>

From: Vladimir Zapolskiy <vz@mleia.com>
Date: Fri, 19 Oct 2018 02:06:53 +0300

> A trivial change which removes an unused local variable, the issue
> is reported as a compile time warning:
> 
>   drivers/net/ethernet/nxp/lpc_eth.c: In function 'lpc_eth_drv_probe':
>   drivers/net/ethernet/nxp/lpc_eth.c:1250:21: warning: variable 'phydev' set but not used [-Wunused-but-set-variable]
>     struct phy_device *phydev;
>                        ^~~~~~
> 
> Signed-off-by: Vladimir Zapolskiy <vz@mleia.com>

Applied.

^ permalink raw reply

* Re: [PATCH] net: ethernet: lpc_eth: remove CONFIG_OF guard from the driver
From: David Miller @ 2018-10-20  0:05 UTC (permalink / raw)
  To: vz; +Cc: slemieux.tyco, netdev
In-Reply-To: <20181018225841.17835-1-vz@mleia.com>

From: Vladimir Zapolskiy <vz@mleia.com>
Date: Fri, 19 Oct 2018 01:58:41 +0300

> The MAC controller device is available on NXP LPC32xx platform only,
> and the LPC32xx platform supports OF builds only, so additional
> checks in the device driver are not needed.
> 
> Signed-off-by: Vladimir Zapolskiy <vz@mleia.com>

Applied.

^ permalink raw reply

* Re: [PATCH] net: ethernet: lpc_eth: clean up the list of included headers
From: David Miller @ 2018-10-20  0:05 UTC (permalink / raw)
  To: vz; +Cc: slemieux.tyco, netdev
In-Reply-To: <20181018225325.4959-1-vz@mleia.com>

From: Vladimir Zapolskiy <vz@mleia.com>
Date: Fri, 19 Oct 2018 01:53:25 +0300

> The change removes all unnecessary included headers from the driver
> source code, the remaining list is sorted in alphabetical order.
> 
> Signed-off-by: Vladimir Zapolskiy <vz@mleia.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next v2] netpoll: allow cleanup to be synchronous
From: David Miller @ 2018-10-19 23:58 UTC (permalink / raw)
  To: dbanerje; +Cc: nhorman, netdev
In-Reply-To: <20181018151826.8373-1-dbanerje@akamai.com>

From: Debabrata Banerjee <dbanerje@akamai.com>
Date: Thu, 18 Oct 2018 11:18:26 -0400

> This fixes a problem introduced by:
> commit 2cde6acd49da ("netpoll: Fix __netpoll_rcu_free so that it can hold the rtnl lock")
> 
> When using netconsole on a bond, __netpoll_cleanup can asynchronously
> recurse multiple times, each __netpoll_free_async call can result in
> more __netpoll_free_async's. This means there is now a race between
> cleanup_work queues on multiple netpoll_info's on multiple devices and
> the configuration of a new netpoll. For example if a netconsole is set
> to enable 0, reconfigured, and enable 1 immediately, this netconsole
> will likely not work.
> 
> Given the reason for __netpoll_free_async is it can be called when rtnl
> is not locked, if it is locked, we should be able to execute
> synchronously. It appears to be locked everywhere it's called from.
> 
> Generalize the design pattern from the teaming driver for current
> callers of __netpoll_free_async.
> 
> CC: Neil Horman <nhorman@tuxdriver.com>
> CC: "David S. Miller" <davem@davemloft.net>
> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>

Applied, thank you.

^ permalink raw reply

* Re: [bpf-next v3 0/2] Fix kcm + sockmap by checking psock type
From: John Fastabend @ 2018-10-19 23:38 UTC (permalink / raw)
  To: Daniel Borkmann, ast, eric.dumazet; +Cc: netdev
In-Reply-To: <81286597-9886-84e0-dcb4-00c15c85fed0@iogearbox.net>

On 10/19/2018 03:57 PM, Daniel Borkmann wrote:
> On 10/20/2018 12:51 AM, Daniel Borkmann wrote:
>> On 10/18/2018 10:58 PM, John Fastabend wrote:
>>> We check if the sk_user_data (the psock in skmsg) is in fact a sockmap
>>> type to late, after we read the refcnt which is an error. This
>>> series moves the check up before reading refcnt and also adds a test
>>> to test_maps to test trying to add a KCM socket into a sockmap.
>>>
>>> While reviewig this code I also found an issue with KCM and kTLS
>>> where each uses sk_data_ready hooks and associated stream parser
>>> breaking expectations in kcm, ktls or both. But that fix will need
>>> to go to net.
>>>
>>> Thanks to Eric for reporting.
>>>
>>> v2: Fix up file +/- my scripts lost track of them
>>> v3: return EBUSY if refcnt is zero
>>>
>>> John Fastabend (2):
>>>   bpf: skmsg, fix psock create on existing kcm/tls port
>>>   bpf: test_maps add a test to catch kcm + sockmap
>>>
>>>  include/linux/skmsg.h                     | 25 +++++++++---
>>>  net/core/sock_map.c                       | 11 +++---
>>>  tools/testing/selftests/bpf/Makefile      |  2 +-
>>>  tools/testing/selftests/bpf/sockmap_kcm.c | 14 +++++++
>>>  tools/testing/selftests/bpf/test_maps.c   | 64 ++++++++++++++++++++++++++++++-
>>>  5 files changed, 103 insertions(+), 13 deletions(-)
>>>  create mode 100644 tools/testing/selftests/bpf/sockmap_kcm.c
>>
>> Applied, thanks!
> 
> Fyi, I've only applied patch 1/2 for now to get the bug fixed. The patch 2/2 throws
> a bunch of warnings that look like the below. Also, I think we leak kcm socket in
> error paths and once we're done with testing, so would be good to close it once
> unneeded. Please respin the test as a stand-alone commit, thanks:
> 

Thanks, I didn't see the warnings below locally but will look
into spinning a good version tonight with the closing sock fix
as well.

John

> [...]
> bpf-next/tools/testing/selftests/bpf/libbpf.a -lcap -lelf -lrt -lpthread -o /home/darkstar/trees/bpf-next-ok/tools/testing/selftests/bpf/test_maps
> test_maps.c: In function ‘test_sockmap’:
> test_maps.c:869:0: warning: "AF_KCM" redefined
>  #define AF_KCM 41
> 
> In file included from /usr/include/sys/socket.h:38:0,
>                  from test_maps.c:21:
> /usr/include/bits/socket.h:133:0: note: this is the location of the previous definition
>  #define AF_KCM  PF_KCM
> 

^ permalink raw reply

* Re: [PATCH bpf-next v2 02/13] bpf: btf: Add BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO
From: Martin Lau @ 2018-10-19 23:27 UTC (permalink / raw)
  To: Edward Cree
  Cc: Yonghong Song, Alexei Starovoitov, daniel@iogearbox.net,
	netdev@vger.kernel.org, Kernel Team
In-Reply-To: <7d251102-2073-98b7-94a6-4dfcc21a3071@solarflare.com>

On Fri, Oct 19, 2018 at 10:26:53PM +0100, Edward Cree wrote:
> On 19/10/18 20:36, Martin Lau wrote:
> > On Fri, Oct 19, 2018 at 06:04:11PM +0100, Edward Cree wrote:
> >> But you *do* have such a new section.
> >> The patch comment talks about a 'FuncInfo Table' which appears to
> > Note that the new section, which contains the FuncInfo Table,
> > is in a new ELF section ".BTF.ext" instead of the ".BTF".
> > It is not in the ".BTF" section because it is only useful during
> > bpf_prog_load().
> I thought it was because it needed to be munged by the loader/linker?
> 
> > IIUC, I think what you are suggesting here is to use (type_id, name)
> > to describe DW_TAG_subprogram "int foo1(int) {}", "int foo2(int) {}",
> > "int foo3(int) {}" where type_id here is referring to the same
> > DW_TAG_subroutine_type, and only define that _one_
> > DW_TAG_subroutine_type in the BTF "type" section.
> Yes, something like that.
> 
> > If the concern is having both FUNC and FUNC_PROTO is confusing,
> The concern is that you're conflating different entities (types
>  and instances); FUNC_PROTO is just a symptom/canary of that.
> 
> > we could go back to the CTF way which adds a new function section
> > in ".BTF" and it is only for DW_TAG_subprogram.
> > BTF_KIND_FUNC_PROTO is then no longer necessary.
> > Some of new BTF verifier checkings may actually go away also.
> > The down side is there will be two id spaces.
> Two id spaces... one for types and the other for subprograms.
> These are different things, so why would you _want_ them to share
>  an id space?  I don't, for instance, see any situation in which
>  you'd want some other record to have a field that could reference
>  either.
> And the 'subprogram id' doesn't have to be just for subprograms;
>  it could be for instances generally — like I've been saying, a
>  variable declaration is to an object type what a subprogram is to
>  a function type, just with a few complications like "subprograms
>  can only appear at file scope, not nested in other functions" and
>  "variables of function type are immutable".
> (I'm assuming that at some point we're going to want to be able to
>  have BTF information for e.g. variables stored on a subprogram's
>  stack, if only for stuff like single-stepping in a debugger in
>  userspace with some sort of mock.  At that point, the variable
>  has to have its own record — you can't just have some sort of
>  magic type record because e.g. "struct foo bar;" has two names,
>  one for the type and one for the variable.)
> 
btf_type is not exactly a C type.

btf_type is a debug-info.  Each btf_type carries specific
debug information.  Name is part of the debug-info/btf_type.
If something carries different debug-info, it is another btf_type.
Like struct, the member's names of struct is part of the btf_type.
A struct with the same member's types but different member's names
is a different btf_type.

The same go for function.  The function with different function
names and arg names is a different btf_type.

> > Discussed a bit offline with folks about the two id spaces
> > situation and it is not good for debugging purpose.
> Could you unpack this a bit more?
Having two id spaces for debug-info is confusing.  They are
all debug-info at the end.

^ permalink raw reply

* Re: [PATCH ghak90 (was ghak32) V4 08/10] audit: add support for containerid to network namespaces
From: Paul Moore @ 2018-10-19 23:18 UTC (permalink / raw)
  To: rgb
  Cc: containers, linux-api, linux-audit, linux-fsdevel, linux-kernel,
	netdev, netfilter-devel, ebiederm, luto, carlos, dhowells, viro,
	simo, Eric Paris, Serge Hallyn
In-Reply-To: <5a2b4aadf6994f622bc1ad27a8a6889c7e61edff.1533065887.git.rgb@redhat.com>

On Tue, Jul 31, 2018 at 4:12 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> Audit events could happen in a network namespace outside of a task
> context due to packets received from the net that trigger an auditing
> rule prior to being associated with a running task.  The network
> namespace could in use by multiple containers by association to the
> tasks in that network namespace.  We still want a way to attribute
> these events to any potential containers.  Keep a list per network
> namespace to track these audit container identifiiers.
>
> Add/increment the audit container identifier on:
> - initial setting of the audit container identifier via /proc
> - clone/fork call that inherits an audit container identifier
> - unshare call that inherits an audit container identifier
> - setns call that inherits an audit container identifier
> Delete/decrement the audit container identifier on:
> - an inherited audit container identifier dropped when child set
> - process exit
> - unshare call that drops a net namespace
> - setns call that drops a net namespace
>
> See: https://github.com/linux-audit/audit-kernel/issues/92
> See: https://github.com/linux-audit/audit-testsuite/issues/64
> See: https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
> Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> ---
>  include/linux/audit.h | 17 ++++++++++
>  kernel/audit.c        | 86 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  kernel/auditsc.c      |  8 ++++-
>  kernel/nsproxy.c      |  4 +++
>  4 files changed, 114 insertions(+), 1 deletion(-)

...

> @@ -308,6 +312,86 @@ static struct sock *audit_get_sk(const struct net *net)
>         return aunet->sk;
>  }
>
> +/**
> + * audit_get_netns_contid_list - Return the audit container ID list for the given network namespace
> + * @net: the destination network namespace
> + *
> + * Description:
> + * Returns the list pointer if valid, NULL otherwise.  The caller must ensure
> + * that a reference is held for the network namespace while the sock is in use.
> + */
> +struct list_head *audit_get_netns_contid_list(const struct net *net)
> +{
> +       struct audit_net *aunet = net_generic(net, audit_net_id);
> +
> +       return &aunet->contid_list;
> +}
> +
> +spinlock_t *audit_get_netns_contid_list_lock(const struct net *net)
> +{
> +       struct audit_net *aunet = net_generic(net, audit_net_id);
> +
> +       return &aunet->contid_list_lock;
> +}

Instead of returning the spinlock, just do away with the
audit_get_ns_contid_list_lock() function and create two separate lock
and unlock functions that basically do the net_generic() and spinlock
operations together, for example:

static int audit_netns_contid_lock(const struct net *net)
{
  aunet = net_generic(net, audit_net_id);

  if (!aunet)
    return -whatever;

  spin_lock(aunet->lock);
  return 0;
}

> +void audit_netns_contid_add(struct net *net, u64 contid)
> +{
> +       spinlock_t *lock = audit_get_netns_contid_list_lock(net);
> +       struct list_head *contid_list = audit_get_netns_contid_list(net);
> +       struct audit_contid *cont;
> +
> +       if (!audit_contid_valid(contid))
> +               return;
> +       spin_lock(lock);
> +       if (!list_empty(contid_list))
> +               list_for_each_entry(cont, contid_list, list)
> +                       if (cont->id == contid) {
> +                               refcount_inc(&cont->refcount);
> +                               goto out;
> +                       }
> +       cont = kmalloc(sizeof(struct audit_contid), GFP_KERNEL);
> +       if (cont) {
> +               INIT_LIST_HEAD(&cont->list);
> +               cont->id = contid;
> +               refcount_set(&cont->refcount, 1);
> +               list_add(&cont->list, contid_list);
> +       }
> +out:
> +       spin_unlock(lock);
> +}
> +
> +void audit_netns_contid_del(struct net *net, u64 contid)
> +{
> +       spinlock_t *lock = audit_get_netns_contid_list_lock(net);
> +       struct list_head *contid_list = audit_get_netns_contid_list(net);
> +       struct audit_contid *cont = NULL;
> +
> +       if (!audit_contid_valid(contid))
> +               return;
> +       spin_lock(lock);
> +       if (!list_empty(contid_list))
> +               list_for_each_entry(cont, contid_list, list)
> +                       if (cont->id == contid) {
> +                               list_del(&cont->list);
> +                               if (refcount_dec_and_test(&cont->refcount))
> +                                       kfree(cont);
> +                               break;
> +                       }
> +       spin_unlock(lock);
> +}
> +
> +void audit_switch_task_namespaces(struct nsproxy *ns, struct task_struct *p)
> +{
> +       u64 contid = audit_get_contid(p);
> +       struct nsproxy *new = p->nsproxy;
> +
> +       if (!audit_contid_valid(contid))
> +               return;
> +       audit_netns_contid_del(ns->net_ns, contid);
> +       if (new)
> +               audit_netns_contid_add(new->net_ns, contid);
> +}
> +
>  void audit_panic(const char *message)
>  {
>         switch (audit_failure) {
> @@ -1547,6 +1631,8 @@ static int __net_init audit_net_init(struct net *net)
>                 return -ENOMEM;
>         }
>         aunet->sk->sk_sndtimeo = MAX_SCHEDULE_TIMEOUT;
> +       INIT_LIST_HEAD(&aunet->contid_list);
> +       spin_lock_init(&aunet->contid_list_lock);
>
>         return 0;
>  }
> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> index 610c6869..fdf3f68 100644
> --- a/kernel/auditsc.c
> +++ b/kernel/auditsc.c
> @@ -75,6 +75,7 @@
>  #include <linux/uaccess.h>
>  #include <linux/fsnotify_backend.h>
>  #include <uapi/linux/limits.h>
> +#include <net/net_namespace.h>
>
>  #include "audit.h"
>
> @@ -2165,6 +2166,7 @@ int audit_set_contid(struct task_struct *task, u64 contid)
>         uid_t uid;
>         struct tty_struct *tty;
>         char comm[sizeof(current->comm)];
> +       struct net *net = task->nsproxy->net_ns;
>
>         task_lock(task);
>         /* Can't set if audit disabled */
> @@ -2186,8 +2188,12 @@ int audit_set_contid(struct task_struct *task, u64 contid)
>         else if (!(thread_group_leader(task) && thread_group_empty(task)))
>                 rc = -EALREADY;
>         read_unlock(&tasklist_lock);
> -       if (!rc)
> +       if (!rc) {
> +               if (audit_contid_valid(oldcontid))
> +                       audit_netns_contid_del(net, oldcontid);
>                 task->audit->contid = contid;
> +               audit_netns_contid_add(net, contid);
> +       }
>         task_unlock(task);
>
>         if (!audit_enabled)
> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> index f6c5d33..718b120 100644
> --- a/kernel/nsproxy.c
> +++ b/kernel/nsproxy.c
> @@ -27,6 +27,7 @@
>  #include <linux/syscalls.h>
>  #include <linux/cgroup.h>
>  #include <linux/perf_event.h>
> +#include <linux/audit.h>
>
>  static struct kmem_cache *nsproxy_cachep;
>
> @@ -140,6 +141,7 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
>         struct nsproxy *old_ns = tsk->nsproxy;
>         struct user_namespace *user_ns = task_cred_xxx(tsk, user_ns);
>         struct nsproxy *new_ns;
> +       u64 contid = audit_get_contid(tsk);
>
>         if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
>                               CLONE_NEWPID | CLONE_NEWNET |
> @@ -167,6 +169,7 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
>                 return  PTR_ERR(new_ns);
>
>         tsk->nsproxy = new_ns;
> +       audit_netns_contid_add(new_ns->net_ns, contid);
>         return 0;
>  }
>
> @@ -224,6 +227,7 @@ void switch_task_namespaces(struct task_struct *p, struct nsproxy *new)
>         ns = p->nsproxy;
>         p->nsproxy = new;
>         task_unlock(p);
> +       audit_switch_task_namespaces(ns, p);
>
>         if (ns && atomic_dec_and_test(&ns->count))
>                 free_nsproxy(ns);

^ permalink raw reply

* Re: [PATCH ghak90 (was ghak32) V4 01/10] audit: collect audit task parameters
From: Paul Moore @ 2018-10-19 23:15 UTC (permalink / raw)
  To: rgb
  Cc: containers, linux-api, linux-audit, linux-fsdevel, linux-kernel,
	netdev, netfilter-devel, ebiederm, luto, carlos, dhowells, viro,
	simo, Eric Paris, Serge Hallyn
In-Reply-To: <8e617ab568df28a66dfbe3284452de186b42fb0f.1533065887.git.rgb@redhat.com>

On Sun, Aug 5, 2018 at 4:32 AM Richard Guy Briggs <rgb@redhat.com> wrote:
> The audit-related parameters in struct task_struct should ideally be
> collected together and accessed through a standard audit API.
>
> Collect the existing loginuid, sessionid and audit_context together in a
> new struct audit_task_info called "audit" in struct task_struct.
>
> Use kmem_cache to manage this pool of memory.
> Un-inline audit_free() to be able to always recover that memory.
>
> See: https://github.com/linux-audit/audit-kernel/issues/81
>
> Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> ---
>  include/linux/audit.h | 34 ++++++++++++++++++++++++----------
>  include/linux/sched.h |  5 +----
>  init/init_task.c      |  3 +--
>  init/main.c           |  2 ++
>  kernel/auditsc.c      | 51 ++++++++++++++++++++++++++++++++++++++++++---------
>  kernel/fork.c         |  4 +++-
>  6 files changed, 73 insertions(+), 26 deletions(-)

...

> diff --git a/include/linux/audit.h b/include/linux/audit.h
> index 9334fbe..8964332 100644
> --- a/include/linux/audit.h
> +++ b/include/linux/audit.h
> @@ -219,8 +219,15 @@ static inline void audit_log_task_info(struct audit_buffer *ab,
>
>  /* These are defined in auditsc.c */
>                                 /* Public API */
> +struct audit_task_info {
> +       kuid_t                  loginuid;
> +       unsigned int            sessionid;
> +       struct audit_context    *ctx;
> +};

...

> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 87bf02d..e117272 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -873,10 +872,8 @@ struct task_struct {
>
>         struct callback_head            *task_works;
>
> -       struct audit_context            *audit_context;
>  #ifdef CONFIG_AUDITSYSCALL
> -       kuid_t                          loginuid;
> -       unsigned int                    sessionid;
> +       struct audit_task_info          *audit;
>  #endif
>         struct seccomp                  seccomp;

Prior to this patch audit_context was available regardless of
CONFIG_AUDITSYSCALL, after this patch the corresponding audit_context
is only available when CONFIG_AUDITSYSCALL is defined.

> diff --git a/init/main.c b/init/main.c
> index 3b4ada1..6aba171 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -92,6 +92,7 @@
>  #include <linux/rodata_test.h>
>  #include <linux/jump_label.h>
>  #include <linux/mem_encrypt.h>
> +#include <linux/audit.h>
>
>  #include <asm/io.h>
>  #include <asm/bugs.h>
> @@ -721,6 +722,7 @@ asmlinkage __visible void __init start_kernel(void)
>         nsfs_init();
>         cpuset_init();
>         cgroup_init();
> +       audit_task_init();
>         taskstats_init_early();
>         delayacct_init();

It seems like we would need either init_struct_audit or
audit_task_init(), but not both, yes?

> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> index fb20746..88779a7 100644
> --- a/kernel/auditsc.c
> +++ b/kernel/auditsc.c
> @@ -841,7 +841,7 @@ static inline struct audit_context *audit_take_context(struct task_struct *tsk,
>                                                       int return_valid,
>                                                       long return_code)
>  {
> -       struct audit_context *context = tsk->audit_context;
> +       struct audit_context *context = tsk->audit->ctx;
>
>         if (!context)
>                 return NULL;
> @@ -926,6 +926,15 @@ static inline struct audit_context *audit_alloc_context(enum audit_state state)
>         return context;
>  }
>
> +static struct kmem_cache *audit_task_cache;
> +
> +void __init audit_task_init(void)
> +{
> +       audit_task_cache = kmem_cache_create("audit_task",
> +                                            sizeof(struct audit_task_info),
> +                                            0, SLAB_PANIC, NULL);
> +}

This is somewhat related to the CONFIG_AUDITSYSCALL comment above, but
since the audit_task_info contains generic audit state (not just
syscall related state), it seems like this, and the audit_task_info
accessors/helpers, should live in kernel/audit.c.

There are probably a few other things that should move to
kernel/audit.c too, e.g. audit_alloc().  Have you verified that this
builds/runs correctly on architectures that define CONFIG_AUDIT but
not CONFIG_AUDITSYSCALL?

>  /**
>   * audit_alloc - allocate an audit context block for a task
>   * @tsk: task
> @@ -940,17 +949,28 @@ int audit_alloc(struct task_struct *tsk)
>         struct audit_context *context;
>         enum audit_state     state;
>         char *key = NULL;
> +       struct audit_task_info *info;
> +
> +       info = kmem_cache_zalloc(audit_task_cache, GFP_KERNEL);
> +       if (!info)
> +               return -ENOMEM;
> +       info->loginuid = audit_get_loginuid(current);
> +       info->sessionid = audit_get_sessionid(current);
> +       tsk->audit = info;
>
>         if (likely(!audit_ever_enabled))
>                 return 0; /* Return if not auditing. */

I don't view this as necessary for initial acceptance, and
synchronization/locking might render this undesirable, but it would be
curious to see if we could do something clever with refcnts and
copy-on-write to minimize the number of kmem_cache objects in use in
the !audit_ever_enabled (and possibly the AUDIT_DISABLED) case.

>         state = audit_filter_task(tsk, &key);
>         if (state == AUDIT_DISABLED) {
> +               audit_set_context(tsk, NULL);

It's already NULL, isn't it?

>                 clear_tsk_thread_flag(tsk, TIF_SYSCALL_AUDIT);
>                 return 0;
>         }
>
>         if (!(context = audit_alloc_context(state))) {
> +               tsk->audit = NULL;
> +               kmem_cache_free(audit_task_cache, info);
>                 kfree(key);
>                 audit_log_lost("out of memory in audit_alloc");
>                 return -ENOMEM;
> @@ -962,6 +982,12 @@ int audit_alloc(struct task_struct *tsk)
>         return 0;
>  }
>
> +struct audit_task_info init_struct_audit = {
> +       .loginuid = INVALID_UID,
> +       .sessionid = AUDIT_SID_UNSET,
> +       .ctx = NULL,
> +};
> +
>  static inline void audit_free_context(struct audit_context *context)
>  {
>         audit_free_names(context);

^ permalink raw reply

* Re: [PATCH v8 bpf-next 0/2] bpf: add cg_skb_is_valid_access
From: Alexei Starovoitov @ 2018-10-19 23:09 UTC (permalink / raw)
  To: Song Liu; +Cc: netdev, ast, daniel, kernel-team, edumazet
In-Reply-To: <20181019165758.1410213-1-songliubraving@fb.com>

On Fri, Oct 19, 2018 at 09:57:56AM -0700, Song Liu wrote:
> Changes v7 -> v8:
> 1. Dynamically allocate the dummy sk to avoid race conditions.
> 
> Changes v6 -> v7:
> 1. Make dummy sk a global variable (test_run_sk).
> 
> Changes v5 -> v6:
> 1. Fixed dummy sk in bpf_prog_test_run_skb() as suggested by Eric Dumazet.
> 
> Changes v4 -> v5:
> 1. Replaced bpf_compute_and_save_data_pointers() with
>    bpf_compute_and_save_data_end();
>    Replaced bpf_restore_data_pointers() with bpf_restore_data_end().
> 2. Fixed indentation in test_verifier.c
> 
> Changes v3 -> v4:
> 1. Fixed crash issue reported by Alexei.
> 
> Changes v2 -> v3:
> 1. Added helper function bpf_compute_and_save_data_pointers() and
>    bpf_restore_data_pointers().
> 
> Changes v1 -> v2:
> 1. Updated the list of read-only fields, and read-write fields.
> 2. Added dummy sk to bpf_prog_test_run_skb().
> 
> This set enables BPF program of type BPF_PROG_TYPE_CGROUP_SKB to access
> some __skb_buff data directly.

Applied, Thanks

^ permalink raw reply

* Re: [PATCH bpf-next] bpf: Extend the sk_lookup() helper to XDP hookpoint.
From: Nitin Hande @ 2018-10-19 23:04 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Joe Stringer, Martin KaFai Lau, netdev, ast, Jesper Brouer,
	john fastabend
In-Reply-To: <6c530eaa-c4dd-bcf9-fce5-1f9d66b8efe3@iogearbox.net>

On Fri, 19 Oct 2018 22:32:28 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:

> On 10/19/2018 06:47 PM, Joe Stringer wrote:
> > On Thu, 18 Oct 2018 at 22:07, Martin Lau <kafai@fb.com> wrote:  
> >> On Thu, Oct 18, 2018 at 04:52:40PM -0700, Joe Stringer wrote:  
> >>> On Thu, 18 Oct 2018 at 14:20, Daniel Borkmann
> >>> <daniel@iogearbox.net> wrote:  
> >>>> On 10/18/2018 11:06 PM, Joe Stringer wrote:  
> >>>>> On Thu, 18 Oct 2018 at 11:54, Nitin Hande
> >>>>> <nitin.hande@gmail.com> wrote:  
> >>>> [...]  
> >>>>>> Open Issue
> >>>>>> * The underlying code relies on presence of an skb to find out
> >>>>>> the right sk for the case of REUSEPORT socket option. Since
> >>>>>> there is no skb available at XDP hookpoint, the helper
> >>>>>> function will return the first available sk based off the 5
> >>>>>> tuple hash. If the desire is to return a particular sk
> >>>>>> matching reuseport_cb function, please suggest way to tackle
> >>>>>> it, which can be addressed in a future commit.  
> >>>>  
> >>>>>> Signed-off-by: Nitin Hande <Nitin.Hande@gmail.com>  
> >>>>>
> >>>>> Thanks Nitin, LGTM overall.
> >>>>>
> >>>>> The REUSEPORT thing suggests that the usage of this helper from
> >>>>> XDP layer may lead to a different socket being selected vs. the
> >>>>> equivalent call at TC hook, or other places where the selection
> >>>>> may occur. This could be a bit counter-intuitive.
> >>>>>
> >>>>> One thought I had to work around this was to introduce a flag,
> >>>>> something like BPF_F_FIND_REUSEPORT_SK_BY_HASH. This flag would
> >>>>> effectively communicate in the API that the bpf_sk_lookup_xxx()
> >>>>> functions will only select a REUSEPORT socket based on the hash
> >>>>> and not by, for example BPF_PROG_TYPE_SK_REUSEPORT programs.
> >>>>> The absence of the flag would support finding REUSEPORT sockets
> >>>>> by other mechanisms (which would be allowed for now from TC
> >>>>> hooks but would be disallowed from XDP, since there's no
> >>>>> specific plan to support this).  
> >>>>
> >>>> Hmm, given skb is NULL here the only way to lookup the socket in
> >>>> such scenario is based on hash, that is, inet_ehashfn() /
> >>>> inet6_ehashfn(), perhaps alternative is to pass this hash in
> >>>> from XDP itself to the helper so it could be custom selector. Do
> >>>> you have a specific use case on this for XDP (just curious)?  
> >>>
> >>> I don't have a use case for SO_REUSEPORT introspection from XDP,
> >>> so I'm primarily thinking from the perspective of making the
> >>> behaviour clear in the API in a way that leaves open the
> >>> possibility for a reasonable implementation in future. From that
> >>> perspective, my main concern is that it may surprise some BPF
> >>> writers that the same "bpf_sk_lookup_tcp()" call (with identical
> >>> parameters) may have different behaviour at TC vs. XDP layers, as
> >>> the BPF selection of sockets is respected at TC but not at XDP.
> >>>
> >>> FWIW we're already out of parameters for the actual call, so if we
> >>> wanted to allow passing a hash in, we'd need to either dedicate
> >>> half the 'flags' field for this configurable hash, or consider
> >>> adding the new hash parameter to 'struct bpf_sock_tuple'.
> >>>
> >>> +Martin for any thoughts on SO_REUSEPORT and XDP here.  
> >> The XDP/TC prog has read access to the sk fields through
> >> 'struct bpf_sock'?
> >>
> >> A quick thought...
> >> Considering all sk in the same reuse->socks[] share
> >> many things (e.g. family,type,protocol,ip,port..etc are the same),
> >> I wonder returning which particular sk from reuse->socks[] will
> >> matter too much since most of the fields from 'struct bpf_sock'
> >> will be the same.  Some of fields in 'struct bpf_sock' could be
> >> different though, like priority?  Hence, another possibility is to
> >> limit the accessible fields for the XDP prog.  Only allow
> >> accessing the fields that must be the same among the sk in the
> >> same reuse->socks[].  
> > 
> > This sounds pretty reasonable to me.  
> 
> Agree, and in any case this difference in returned sk selection should
> probably also be documented in the uapi helper description.

Okay, will do in a v2.

Thanks
Nitin

^ permalink raw reply

* Re: [bpf-next v3 0/2] Fix kcm + sockmap by checking psock type
From: Daniel Borkmann @ 2018-10-19 22:57 UTC (permalink / raw)
  To: John Fastabend, ast, eric.dumazet; +Cc: netdev
In-Reply-To: <76735602-6af6-bc03-ee66-294987e768e5@iogearbox.net>

On 10/20/2018 12:51 AM, Daniel Borkmann wrote:
> On 10/18/2018 10:58 PM, John Fastabend wrote:
>> We check if the sk_user_data (the psock in skmsg) is in fact a sockmap
>> type to late, after we read the refcnt which is an error. This
>> series moves the check up before reading refcnt and also adds a test
>> to test_maps to test trying to add a KCM socket into a sockmap.
>>
>> While reviewig this code I also found an issue with KCM and kTLS
>> where each uses sk_data_ready hooks and associated stream parser
>> breaking expectations in kcm, ktls or both. But that fix will need
>> to go to net.
>>
>> Thanks to Eric for reporting.
>>
>> v2: Fix up file +/- my scripts lost track of them
>> v3: return EBUSY if refcnt is zero
>>
>> John Fastabend (2):
>>   bpf: skmsg, fix psock create on existing kcm/tls port
>>   bpf: test_maps add a test to catch kcm + sockmap
>>
>>  include/linux/skmsg.h                     | 25 +++++++++---
>>  net/core/sock_map.c                       | 11 +++---
>>  tools/testing/selftests/bpf/Makefile      |  2 +-
>>  tools/testing/selftests/bpf/sockmap_kcm.c | 14 +++++++
>>  tools/testing/selftests/bpf/test_maps.c   | 64 ++++++++++++++++++++++++++++++-
>>  5 files changed, 103 insertions(+), 13 deletions(-)
>>  create mode 100644 tools/testing/selftests/bpf/sockmap_kcm.c
> 
> Applied, thanks!

Fyi, I've only applied patch 1/2 for now to get the bug fixed. The patch 2/2 throws
a bunch of warnings that look like the below. Also, I think we leak kcm socket in
error paths and once we're done with testing, so would be good to close it once
unneeded. Please respin the test as a stand-alone commit, thanks:

[...]
bpf-next/tools/testing/selftests/bpf/libbpf.a -lcap -lelf -lrt -lpthread -o /home/darkstar/trees/bpf-next-ok/tools/testing/selftests/bpf/test_maps
test_maps.c: In function ‘test_sockmap’:
test_maps.c:869:0: warning: "AF_KCM" redefined
 #define AF_KCM 41

In file included from /usr/include/sys/socket.h:38:0,
                 from test_maps.c:21:
/usr/include/bits/socket.h:133:0: note: this is the location of the previous definition
 #define AF_KCM  PF_KCM

^ permalink raw reply

* Re: [bpf-next v3 0/2] Fix kcm + sockmap by checking psock type
From: Daniel Borkmann @ 2018-10-19 22:51 UTC (permalink / raw)
  To: John Fastabend, ast, eric.dumazet; +Cc: netdev
In-Reply-To: <1539896316-13403-1-git-send-email-john.fastabend@gmail.com>

On 10/18/2018 10:58 PM, John Fastabend wrote:
> We check if the sk_user_data (the psock in skmsg) is in fact a sockmap
> type to late, after we read the refcnt which is an error. This
> series moves the check up before reading refcnt and also adds a test
> to test_maps to test trying to add a KCM socket into a sockmap.
> 
> While reviewig this code I also found an issue with KCM and kTLS
> where each uses sk_data_ready hooks and associated stream parser
> breaking expectations in kcm, ktls or both. But that fix will need
> to go to net.
> 
> Thanks to Eric for reporting.
> 
> v2: Fix up file +/- my scripts lost track of them
> v3: return EBUSY if refcnt is zero
> 
> John Fastabend (2):
>   bpf: skmsg, fix psock create on existing kcm/tls port
>   bpf: test_maps add a test to catch kcm + sockmap
> 
>  include/linux/skmsg.h                     | 25 +++++++++---
>  net/core/sock_map.c                       | 11 +++---
>  tools/testing/selftests/bpf/Makefile      |  2 +-
>  tools/testing/selftests/bpf/sockmap_kcm.c | 14 +++++++
>  tools/testing/selftests/bpf/test_maps.c   | 64 ++++++++++++++++++++++++++++++-
>  5 files changed, 103 insertions(+), 13 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/sockmap_kcm.c

Applied, thanks!

^ permalink raw reply

* Re: Fw: [Bug 201423] New: eth0: hw csum failure
From: Eric Dumazet @ 2018-10-19 22:25 UTC (permalink / raw)
  To: Eric Dumazet, Eric Dumazet, andre
  Cc: Stephen Hemminger, netdev, rossi.f, Dimitris Michailidis
In-Reply-To: <4693819f-4a76-532f-9b24-d4328183c807@gmail.com>



On 10/19/2018 02:58 PM, Eric Dumazet wrote:
> 
> 
> On 10/16/2018 06:00 AM, Eric Dumazet wrote:
>> On Mon, Oct 15, 2018 at 11:30 PM Andre Tomt <andre@tomt.net> wrote:
>>>
>>> On 15.10.2018 17:41, Eric Dumazet wrote:
>>>> On Mon, Oct 15, 2018 at 8:15 AM Stephen Hemminger
>>>>> Something is changed between 4.17.12 and 4.18, after bisecting the problem I
>>>>> got the following first bad commit:
>>>>>
>>>>> commit 88078d98d1bb085d72af8437707279e203524fa5
>>>>> Author: Eric Dumazet <edumazet@google.com>
>>>>> Date:   Wed Apr 18 11:43:15 2018 -0700
>>>>>
>>>>>      net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends
>>>>>
>>>>>      After working on IP defragmentation lately, I found that some large
>>>>>      packets defeat CHECKSUM_COMPLETE optimization because of NIC adding
>>>>>      zero paddings on the last (small) fragment.
>>>>>
>>>>>      While removing the padding with pskb_trim_rcsum(), we set skb->ip_summed
>>>>>      to CHECKSUM_NONE, forcing a full csum validation, even if all prior
>>>>>      fragments had CHECKSUM_COMPLETE set.
>>>>>
>>>>>      We can instead compute the checksum of the part we are trimming,
>>>>>      usually smaller than the part we keep.
>>>>>
>>>>>      Signed-off-by: Eric Dumazet <edumazet@google.com>
>>>>>      Signed-off-by: David S. Miller <davem@davemloft.net>
>>>>>
>>>>
>>>> Thanks for bisecting !
>>>>
>>>> This commit is known to expose some NIC/driver bugs.
>>>>
>>>> Look at commit 12b03558cef6d655d0d394f5e98a6fd07c1f6c0f
>>>> ("net: sungem: fix rx checksum support")  for one driver needing a fix.
>>>>
>>>> I assume SKY2_HW_NEW_LE is not set on your NIC ?
>>>>
>>>
>>> I've seen similar on several systems with mlx4 cards when using 4.18.x -
>>> that is hw csum failure followed by some backtrace.
>>>
>>> Only seems to happen on systems dealing with quite a bit of UDP.
>>>
>>
>> Strange, because mlx4 on IPv6+UDP should not use CHECKSUM_COMPLETE,
>> but CHECKSUM_UNNECESSARY
>>
>> I would be nice to track this a bit further, maybe by providing the
>> full packet content.
>>
>>> Example from 4.18.10:
>>>> [635607.740574] p0xe0: hw csum failure
>>>> [635607.740598] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.18.0-1 #1
>>>> [635607.740599] Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0b 05/02/2017
>>>> [635607.740599] Call Trace:
>>>> [635607.740602]  <IRQ>
>>>> [635607.740611]  dump_stack+0x5c/0x7b
>>>> [635607.740617]  __skb_gro_checksum_complete+0x9a/0xa0
>>>> [635607.740621]  udp6_gro_receive+0x211/0x290
>>>> [635607.740624]  ipv6_gro_receive+0x1a8/0x390
>>>> [635607.740627]  dev_gro_receive+0x33e/0x550
>>>> [635607.740628]  napi_gro_frags+0xa2/0x210
>>>> [635607.740635]  mlx4_en_process_rx_cq+0xa01/0xb40 [mlx4_en]
>>>> [635607.740648]  ? mlx4_cq_completion+0x23/0x70 [mlx4_core]
>>>> [635607.740654]  ? mlx4_eq_int+0x373/0xc80 [mlx4_core]
>>>> [635607.740657]  mlx4_en_poll_rx_cq+0x55/0xf0 [mlx4_en]
>>>> [635607.740658]  net_rx_action+0xe0/0x2e0
>>>> [635607.740662]  __do_softirq+0xd8/0x2e5
>>>> [635607.740666]  irq_exit+0xb4/0xc0
>>>> [635607.740667]  do_IRQ+0x85/0xd0
>>>> [635607.740670]  common_interrupt+0xf/0xf
>>>> [635607.740671]  </IRQ>
>>>> [635607.740675] RIP: 0010:cpuidle_enter_state+0xb4/0x2a0
>>>> [635607.740675] Code: 31 ff e8 df a6 ba ff 45 84 f6 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 d8 01 00 00 31 ff e8 13 81 bf ff fb 66 0f 1f 44 00 00 <4c> 29 fb 48 ba cf f7 53 e3 a5 9b c4 20 48 89 d8 48 c1 fb 3f 48 f7
>>>> [635607.740701] RSP: 0018:ffffa5c206353ea8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffd9
>>>> [635607.740703] RAX: ffff8d72ffd20f00 RBX: 00024214f597c5b0 RCX: 000000000000001f
>>>> [635607.740703] RDX: 00024214f597c5b0 RSI: 0000000000020780 RDI: 0000000000000000
>>>> [635607.740704] RBP: 0000000000000004 R08: 002542bfbefa99fa R09: 00000000ffffffff
>>>> [635607.740705] R10: ffffa5c206353e88 R11: 00000000000000c5 R12: ffffffffaf0aaf78
>>>> [635607.740706] R13: ffff8d72ffd297d8 R14: 0000000000000000 R15: 00024214f58c2ed5
>>>> [635607.740709]  ? cpuidle_enter_state+0x91/0x2a0
>>>> [635607.740712]  do_idle+0x1d0/0x240
>>>> [635607.740715]  cpu_startup_entry+0x5f/0x70
>>>> [635607.740719]  start_secondary+0x185/0x1a0
>>>> [635607.740722]  secondary_startup_64+0xa5/0xb0
>>>> [635607.740731] p0xe0: hw csum failure
>>>> [635607.740745] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.18.0-1 #1
>>>> [635607.740746] Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0b 05/02/2017
>>>> [635607.740746] Call Trace:
>>>> [635607.740747]  <IRQ>
>>>> [635607.740750]  dump_stack+0x5c/0x7b
>>>> [635607.740755]  __skb_checksum_complete+0xb8/0xd0
>>>> [635607.740760]  __udp6_lib_rcv+0xa6b/0xa70
>>>> [635607.740767]  ? nft_do_chain_inet+0x7a/0xd0 [nf_tables]
>>>> [635607.740770]  ? nft_do_chain_inet+0x7a/0xd0 [nf_tables]
>>>> [635607.740774]  ip6_input_finish+0xc0/0x460
>>>> [635607.740776]  ip6_input+0x2b/0x90
>>>> [635607.740778]  ? ip6_rcv_finish+0x110/0x110
>>>> [635607.740780]  ipv6_rcv+0x2cd/0x4b0
>>>> [635607.740783]  ? udp6_lib_lookup_skb+0x59/0x80
>>>> [635607.740785]  __netif_receive_skb_core+0x455/0xb30
>>>> [635607.740788]  ? ipv6_gro_receive+0x1a8/0x390
>>>> [635607.740790]  ? netif_receive_skb_internal+0x24/0xb0
>>>> [635607.740792]  netif_receive_skb_internal+0x24/0xb0
>>>> [635607.740793]  napi_gro_frags+0x165/0x210
>>>> [635607.740796]  mlx4_en_process_rx_cq+0xa01/0xb40 [mlx4_en]
>>>> [635607.740802]  ? mlx4_cq_completion+0x23/0x70 [mlx4_core]
>>>> [635607.740807]  ? mlx4_eq_int+0x373/0xc80 [mlx4_core]
>>>> [635607.740810]  mlx4_en_poll_rx_cq+0x55/0xf0 [mlx4_en]
>>>> [635607.740811]  net_rx_action+0xe0/0x2e0
>>>> [635607.740813]  __do_softirq+0xd8/0x2e5
>>>> [635607.740816]  irq_exit+0xb4/0xc0
>>>> [635607.740817]  do_IRQ+0x85/0xd0
>>>> [635607.740820]  common_interrupt+0xf/0xf
>>>> [635607.740821]  </IRQ>
>>>> [635607.740823] RIP: 0010:cpuidle_enter_state+0xb4/0x2a0
>>>> [635607.740823] Code: 31 ff e8 df a6 ba ff 45 84 f6 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 d8 01 00 00 31 ff e8 13 81 bf ff fb 66 0f 1f 44 00 00 <4c> 29 fb 48 ba cf f7 53 e3 a5 9b c4 20 48 89 d8 48 c1 fb 3f 48 f7
>>>> [635607.740848] RSP: 0018:ffffa5c206353ea8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffd9
>>>> [635607.740849] RAX: ffff8d72ffd20f00 RBX: 00024214f597c5b0 RCX: 000000000000001f
>>>> [635607.740850] RDX: 00024214f597c5b0 RSI: 0000000000020780 RDI: 0000000000000000
>>>> [635607.740851] RBP: 0000000000000004 R08: 002542bfbefa99fa R09: 00000000ffffffff
>>>> [635607.740852] R10: ffffa5c206353e88 R11: 00000000000000c5 R12: ffffffffaf0aaf78
>>>> [635607.740853] R13: ffff8d72ffd297d8 R14: 0000000000000000 R15: 00024214f58c2ed5
>>>> [635607.740855]  ? cpuidle_enter_state+0x91/0x2a0
>>>> [635607.740857]  do_idle+0x1d0/0x240
>>>> [635607.740859]  cpu_startup_entry+0x5f/0x70
>>>> [635607.740861]  start_secondary+0x185/0x1a0
>>>> [635607.740863]  secondary_startup_64+0xa5/0xb0
> 
> As a matter of fact Dimitris found the issue in the patch and is working on a fix involving csum_block_sub()
> 
> Problems comes from trimming an odd number of bytes.

More exactly, trimming bytes starting at an odd offset.

^ permalink raw reply

* Re: Fw: [Bug 201423] New: eth0: hw csum failure
From: Eric Dumazet @ 2018-10-19 21:58 UTC (permalink / raw)
  To: Eric Dumazet, andre
  Cc: Stephen Hemminger, netdev, rossi.f, Dimitris Michailidis
In-Reply-To: <CANn89i+Q2z6DZx3q4KR7sU2hY6td2B61GMtyroLmTXFqJH4XRw@mail.gmail.com>



On 10/16/2018 06:00 AM, Eric Dumazet wrote:
> On Mon, Oct 15, 2018 at 11:30 PM Andre Tomt <andre@tomt.net> wrote:
>>
>> On 15.10.2018 17:41, Eric Dumazet wrote:
>>> On Mon, Oct 15, 2018 at 8:15 AM Stephen Hemminger
>>>> Something is changed between 4.17.12 and 4.18, after bisecting the problem I
>>>> got the following first bad commit:
>>>>
>>>> commit 88078d98d1bb085d72af8437707279e203524fa5
>>>> Author: Eric Dumazet <edumazet@google.com>
>>>> Date:   Wed Apr 18 11:43:15 2018 -0700
>>>>
>>>>      net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends
>>>>
>>>>      After working on IP defragmentation lately, I found that some large
>>>>      packets defeat CHECKSUM_COMPLETE optimization because of NIC adding
>>>>      zero paddings on the last (small) fragment.
>>>>
>>>>      While removing the padding with pskb_trim_rcsum(), we set skb->ip_summed
>>>>      to CHECKSUM_NONE, forcing a full csum validation, even if all prior
>>>>      fragments had CHECKSUM_COMPLETE set.
>>>>
>>>>      We can instead compute the checksum of the part we are trimming,
>>>>      usually smaller than the part we keep.
>>>>
>>>>      Signed-off-by: Eric Dumazet <edumazet@google.com>
>>>>      Signed-off-by: David S. Miller <davem@davemloft.net>
>>>>
>>>
>>> Thanks for bisecting !
>>>
>>> This commit is known to expose some NIC/driver bugs.
>>>
>>> Look at commit 12b03558cef6d655d0d394f5e98a6fd07c1f6c0f
>>> ("net: sungem: fix rx checksum support")  for one driver needing a fix.
>>>
>>> I assume SKY2_HW_NEW_LE is not set on your NIC ?
>>>
>>
>> I've seen similar on several systems with mlx4 cards when using 4.18.x -
>> that is hw csum failure followed by some backtrace.
>>
>> Only seems to happen on systems dealing with quite a bit of UDP.
>>
> 
> Strange, because mlx4 on IPv6+UDP should not use CHECKSUM_COMPLETE,
> but CHECKSUM_UNNECESSARY
> 
> I would be nice to track this a bit further, maybe by providing the
> full packet content.
> 
>> Example from 4.18.10:
>>> [635607.740574] p0xe0: hw csum failure
>>> [635607.740598] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.18.0-1 #1
>>> [635607.740599] Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0b 05/02/2017
>>> [635607.740599] Call Trace:
>>> [635607.740602]  <IRQ>
>>> [635607.740611]  dump_stack+0x5c/0x7b
>>> [635607.740617]  __skb_gro_checksum_complete+0x9a/0xa0
>>> [635607.740621]  udp6_gro_receive+0x211/0x290
>>> [635607.740624]  ipv6_gro_receive+0x1a8/0x390
>>> [635607.740627]  dev_gro_receive+0x33e/0x550
>>> [635607.740628]  napi_gro_frags+0xa2/0x210
>>> [635607.740635]  mlx4_en_process_rx_cq+0xa01/0xb40 [mlx4_en]
>>> [635607.740648]  ? mlx4_cq_completion+0x23/0x70 [mlx4_core]
>>> [635607.740654]  ? mlx4_eq_int+0x373/0xc80 [mlx4_core]
>>> [635607.740657]  mlx4_en_poll_rx_cq+0x55/0xf0 [mlx4_en]
>>> [635607.740658]  net_rx_action+0xe0/0x2e0
>>> [635607.740662]  __do_softirq+0xd8/0x2e5
>>> [635607.740666]  irq_exit+0xb4/0xc0
>>> [635607.740667]  do_IRQ+0x85/0xd0
>>> [635607.740670]  common_interrupt+0xf/0xf
>>> [635607.740671]  </IRQ>
>>> [635607.740675] RIP: 0010:cpuidle_enter_state+0xb4/0x2a0
>>> [635607.740675] Code: 31 ff e8 df a6 ba ff 45 84 f6 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 d8 01 00 00 31 ff e8 13 81 bf ff fb 66 0f 1f 44 00 00 <4c> 29 fb 48 ba cf f7 53 e3 a5 9b c4 20 48 89 d8 48 c1 fb 3f 48 f7
>>> [635607.740701] RSP: 0018:ffffa5c206353ea8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffd9
>>> [635607.740703] RAX: ffff8d72ffd20f00 RBX: 00024214f597c5b0 RCX: 000000000000001f
>>> [635607.740703] RDX: 00024214f597c5b0 RSI: 0000000000020780 RDI: 0000000000000000
>>> [635607.740704] RBP: 0000000000000004 R08: 002542bfbefa99fa R09: 00000000ffffffff
>>> [635607.740705] R10: ffffa5c206353e88 R11: 00000000000000c5 R12: ffffffffaf0aaf78
>>> [635607.740706] R13: ffff8d72ffd297d8 R14: 0000000000000000 R15: 00024214f58c2ed5
>>> [635607.740709]  ? cpuidle_enter_state+0x91/0x2a0
>>> [635607.740712]  do_idle+0x1d0/0x240
>>> [635607.740715]  cpu_startup_entry+0x5f/0x70
>>> [635607.740719]  start_secondary+0x185/0x1a0
>>> [635607.740722]  secondary_startup_64+0xa5/0xb0
>>> [635607.740731] p0xe0: hw csum failure
>>> [635607.740745] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.18.0-1 #1
>>> [635607.740746] Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0b 05/02/2017
>>> [635607.740746] Call Trace:
>>> [635607.740747]  <IRQ>
>>> [635607.740750]  dump_stack+0x5c/0x7b
>>> [635607.740755]  __skb_checksum_complete+0xb8/0xd0
>>> [635607.740760]  __udp6_lib_rcv+0xa6b/0xa70
>>> [635607.740767]  ? nft_do_chain_inet+0x7a/0xd0 [nf_tables]
>>> [635607.740770]  ? nft_do_chain_inet+0x7a/0xd0 [nf_tables]
>>> [635607.740774]  ip6_input_finish+0xc0/0x460
>>> [635607.740776]  ip6_input+0x2b/0x90
>>> [635607.740778]  ? ip6_rcv_finish+0x110/0x110
>>> [635607.740780]  ipv6_rcv+0x2cd/0x4b0
>>> [635607.740783]  ? udp6_lib_lookup_skb+0x59/0x80
>>> [635607.740785]  __netif_receive_skb_core+0x455/0xb30
>>> [635607.740788]  ? ipv6_gro_receive+0x1a8/0x390
>>> [635607.740790]  ? netif_receive_skb_internal+0x24/0xb0
>>> [635607.740792]  netif_receive_skb_internal+0x24/0xb0
>>> [635607.740793]  napi_gro_frags+0x165/0x210
>>> [635607.740796]  mlx4_en_process_rx_cq+0xa01/0xb40 [mlx4_en]
>>> [635607.740802]  ? mlx4_cq_completion+0x23/0x70 [mlx4_core]
>>> [635607.740807]  ? mlx4_eq_int+0x373/0xc80 [mlx4_core]
>>> [635607.740810]  mlx4_en_poll_rx_cq+0x55/0xf0 [mlx4_en]
>>> [635607.740811]  net_rx_action+0xe0/0x2e0
>>> [635607.740813]  __do_softirq+0xd8/0x2e5
>>> [635607.740816]  irq_exit+0xb4/0xc0
>>> [635607.740817]  do_IRQ+0x85/0xd0
>>> [635607.740820]  common_interrupt+0xf/0xf
>>> [635607.740821]  </IRQ>
>>> [635607.740823] RIP: 0010:cpuidle_enter_state+0xb4/0x2a0
>>> [635607.740823] Code: 31 ff e8 df a6 ba ff 45 84 f6 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 d8 01 00 00 31 ff e8 13 81 bf ff fb 66 0f 1f 44 00 00 <4c> 29 fb 48 ba cf f7 53 e3 a5 9b c4 20 48 89 d8 48 c1 fb 3f 48 f7
>>> [635607.740848] RSP: 0018:ffffa5c206353ea8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffd9
>>> [635607.740849] RAX: ffff8d72ffd20f00 RBX: 00024214f597c5b0 RCX: 000000000000001f
>>> [635607.740850] RDX: 00024214f597c5b0 RSI: 0000000000020780 RDI: 0000000000000000
>>> [635607.740851] RBP: 0000000000000004 R08: 002542bfbefa99fa R09: 00000000ffffffff
>>> [635607.740852] R10: ffffa5c206353e88 R11: 00000000000000c5 R12: ffffffffaf0aaf78
>>> [635607.740853] R13: ffff8d72ffd297d8 R14: 0000000000000000 R15: 00024214f58c2ed5
>>> [635607.740855]  ? cpuidle_enter_state+0x91/0x2a0
>>> [635607.740857]  do_idle+0x1d0/0x240
>>> [635607.740859]  cpu_startup_entry+0x5f/0x70
>>> [635607.740861]  start_secondary+0x185/0x1a0
>>> [635607.740863]  secondary_startup_64+0xa5/0xb0

As a matter of fact Dimitris found the issue in the patch and is working on a fix involving csum_block_sub()

Problems comes from trimming an odd number of bytes.

^ permalink raw reply

* Re: [PATCH v2 net] r8169: fix NAPI handling under high load
From: Francois Romieu @ 2018-10-19 21:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Heiner Kallweit, David Miller, Realtek linux nic maintainers,
	netdev@vger.kernel.org
In-Reply-To: <bcb1b9e9-150d-38d1-d61e-2e0bf440d310@gmail.com>

Eric Dumazet <eric.dumazet@gmail.com> :
> On 10/18/2018 03:59 PM, Francois Romieu wrote:
> > Eric Dumazet <eric.dumazet@gmail.com> :
> > [...]
> >> One has to wonder why rtl8169_poll(), which might be called in a loop under DOS,
> >> has to call rtl_ack_events() ?
> > 
> > So as to cover a wider temporal range before any event can trigger an
> > extra irq. I was more worried about irq cost than about IO cost (and
> > I still am).
> > 
> Normally the IRQ would not be enabled under DOS.

Yes.

My concern was not the DOS situation when NAPI runs at full speed.
As far as I was able to experiment with it, the driver did not seem
too bad here.

The location of the ack targets the interim situation where the IRQ
rate can increase before NAPI kicks in. By increasing the time range
whose events can be acked, the maximum irq rate should be lowered.

-- 
Ueimor

^ permalink raw reply

* Re: [PATCH bpf-next v2 02/13] bpf: btf: Add BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO
From: Edward Cree @ 2018-10-19 21:26 UTC (permalink / raw)
  To: Martin Lau
  Cc: Yonghong Song, Alexei Starovoitov, daniel@iogearbox.net,
	netdev@vger.kernel.org, Kernel Team
In-Reply-To: <20181019193548.g4neaudunoan5gvn@kafai-mbp.dhcp.thefacebook.com>

On 19/10/18 20:36, Martin Lau wrote:
> On Fri, Oct 19, 2018 at 06:04:11PM +0100, Edward Cree wrote:
>> But you *do* have such a new section.
>> The patch comment talks about a 'FuncInfo Table' which appears to
> Note that the new section, which contains the FuncInfo Table,
> is in a new ELF section ".BTF.ext" instead of the ".BTF".
> It is not in the ".BTF" section because it is only useful during
> bpf_prog_load().
I thought it was because it needed to be munged by the loader/linker?

> IIUC, I think what you are suggesting here is to use (type_id, name)
> to describe DW_TAG_subprogram "int foo1(int) {}", "int foo2(int) {}",
> "int foo3(int) {}" where type_id here is referring to the same
> DW_TAG_subroutine_type, and only define that _one_
> DW_TAG_subroutine_type in the BTF "type" section.
Yes, something like that.

> If the concern is having both FUNC and FUNC_PROTO is confusing,
The concern is that you're conflating different entities (types
 and instances); FUNC_PROTO is just a symptom/canary of that.

> we could go back to the CTF way which adds a new function section
> in ".BTF" and it is only for DW_TAG_subprogram.
> BTF_KIND_FUNC_PROTO is then no longer necessary.
> Some of new BTF verifier checkings may actually go away also.
> The down side is there will be two id spaces.
Two id spaces... one for types and the other for subprograms.
These are different things, so why would you _want_ them to share
 an id space?  I don't, for instance, see any situation in which
 you'd want some other record to have a field that could reference
 either.
And the 'subprogram id' doesn't have to be just for subprograms;
 it could be for instances generally — like I've been saying, a
 variable declaration is to an object type what a subprogram is to
 a function type, just with a few complications like "subprograms
 can only appear at file scope, not nested in other functions" and
 "variables of function type are immutable".
(I'm assuming that at some point we're going to want to be able to
 have BTF information for e.g. variables stored on a subprogram's
 stack, if only for stuff like single-stepping in a debugger in
 userspace with some sort of mock.  At that point, the variable
 has to have its own record — you can't just have some sort of
 magic type record because e.g. "struct foo bar;" has two names,
 one for the type and one for the variable.)

> Discussed a bit offline with folks about the two id spaces
> situation and it is not good for debugging purpose.
Could you unpack this a bit more?

-Ed

^ permalink raw reply

* (unknown)
From: David Miller @ 2018-10-19 20:58 UTC (permalink / raw)
  To: dhowells; +Cc: netdev, linux-afs
In-Reply-To: <6068.1539982295@warthog.procyon.org.uk>

From: David Howells <dhowells@redhat.com>
Date: Fri, 19 Oct 2018 21:51:35 +0100

> David Miller <davem@davemloft.net> wrote:
> 
>> > Is there going to be a merge of net into net-next before the merge
>> > window opens?  Or do you have a sample merge that I can rebase my
>> > afs-next branch on?
>> 
>> I'll be doing a net to net-next merge some time today.
> 
> Excellent, thanks!

And this is now complete.

^ permalink raw reply

* (unknown)
From: David Howells @ 2018-10-19 20:51 UTC (permalink / raw)
  To: David Miller; +Cc: dhowells, netdev, linux-afs
In-Reply-To: <20181019.104625.891840017846483385.davem@davemloft.net>

David Miller <davem@davemloft.net> wrote:

> > Is there going to be a merge of net into net-next before the merge
> > window opens?  Or do you have a sample merge that I can rebase my
> > afs-next branch on?
> 
> I'll be doing a net to net-next merge some time today.

Excellent, thanks!

David

^ permalink raw reply

* RE: [PATCH net-next v2] netpoll: allow cleanup to be synchronous
From: Banerjee, Debabrata @ 2018-10-19 20:46 UTC (permalink / raw)
  To: 'Neil Horman'; +Cc: David S . Miller, netdev@vger.kernel.org
In-Reply-To: <20181019203458.GC30254@hmswarspite.think-freely.org>

> From: Neil Horman <nhorman@tuxdriver.com>

> I presume you've tested this with some of the stacked devices?  I think I'm
> ok with this change, but I'd like confirmation that its worked.
> 
> Neil

Yes I've tested this on a bond device with vlan stacked on top.

-Deb

> 
> > CC: Neil Horman <nhorman@tuxdriver.com>
> > CC: "David S. Miller" <davem@davemloft.net>
> > Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
> > ---
> >  drivers/net/bonding/bond_main.c |  3 ++-
> >  drivers/net/macvlan.c           |  2 +-
> >  drivers/net/team/team.c         |  5 +----
> >  include/linux/netpoll.h         |  4 +---
> >  net/8021q/vlan_dev.c            |  3 +--
> >  net/bridge/br_device.c          |  2 +-
> >  net/core/netpoll.c              | 20 +++++---------------
> >  net/dsa/slave.c                 |  2 +-
> >  8 files changed, 13 insertions(+), 28 deletions(-)
> >
> > diff --git a/drivers/net/bonding/bond_main.c
> > b/drivers/net/bonding/bond_main.c index ee28ec9e0aba..ffa37adb7681
> > 100644
> > --- a/drivers/net/bonding/bond_main.c
> > +++ b/drivers/net/bonding/bond_main.c
> > @@ -963,7 +963,8 @@ static inline void slave_disable_netpoll(struct slave
> *slave)
> >  		return;
> >
> >  	slave->np = NULL;
> > -	__netpoll_free_async(np);
> > +
> > +	__netpoll_free(np);
> >  }
> >
> >  static void bond_poll_controller(struct net_device *bond_dev) diff
> > --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c index
> > cfda146f3b3b..fc8d5f1ee1ad 100644
> > --- a/drivers/net/macvlan.c
> > +++ b/drivers/net/macvlan.c
> > @@ -1077,7 +1077,7 @@ static void macvlan_dev_netpoll_cleanup(struct
> > net_device *dev)
> >
> >  	vlan->netpoll = NULL;
> >
> > -	__netpoll_free_async(netpoll);
> > +	__netpoll_free(netpoll);
> >  }
> >  #endif	/* CONFIG_NET_POLL_CONTROLLER */
> >
> > diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c index
> > d887016e54b6..db633ae9f784 100644
> > --- a/drivers/net/team/team.c
> > +++ b/drivers/net/team/team.c
> > @@ -1104,10 +1104,7 @@ static void team_port_disable_netpoll(struct
> team_port *port)
> >  		return;
> >  	port->np = NULL;
> >
> > -	/* Wait for transmitting packets to finish before freeing. */
> > -	synchronize_rcu_bh();
> > -	__netpoll_cleanup(np);
> > -	kfree(np);
> > +	__netpoll_free(np);
> >  }
> >  #else
> >  static int team_port_enable_netpoll(struct team_port *port) diff
> > --git a/include/linux/netpoll.h b/include/linux/netpoll.h index
> > 3ef82d3a78db..676f1ff161a9 100644
> > --- a/include/linux/netpoll.h
> > +++ b/include/linux/netpoll.h
> > @@ -31,8 +31,6 @@ struct netpoll {
> >  	bool ipv6;
> >  	u16 local_port, remote_port;
> >  	u8 remote_mac[ETH_ALEN];
> > -
> > -	struct work_struct cleanup_work;
> >  };
> >
> >  struct netpoll_info {
> > @@ -63,7 +61,7 @@ int netpoll_parse_options(struct netpoll *np, char
> > *opt);  int __netpoll_setup(struct netpoll *np, struct net_device
> > *ndev);  int netpoll_setup(struct netpoll *np);  void
> > __netpoll_cleanup(struct netpoll *np); -void
> > __netpoll_free_async(struct netpoll *np);
> > +void __netpoll_free(struct netpoll *np);
> >  void netpoll_cleanup(struct netpoll *np);  void
> > netpoll_send_skb_on_dev(struct netpoll *np, struct sk_buff *skb,
> >  			     struct net_device *dev);
> > diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c index
> > 546af0e73ac3..ff720f1ebf73 100644
> > --- a/net/8021q/vlan_dev.c
> > +++ b/net/8021q/vlan_dev.c
> > @@ -756,8 +756,7 @@ static void vlan_dev_netpoll_cleanup(struct
> net_device *dev)
> >  		return;
> >
> >  	vlan->netpoll = NULL;
> > -
> > -	__netpoll_free_async(netpoll);
> > +	__netpoll_free(netpoll);
> >  }
> >  #endif /* CONFIG_NET_POLL_CONTROLLER */
> >
> > diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c index
> > e053a4e43758..c6abf927f0c9 100644
> > --- a/net/bridge/br_device.c
> > +++ b/net/bridge/br_device.c
> > @@ -344,7 +344,7 @@ void br_netpoll_disable(struct net_bridge_port *p)
> >
> >  	p->np = NULL;
> >
> > -	__netpoll_free_async(np);
> > +	__netpoll_free(np);
> >  }
> >
> >  #endif
> > diff --git a/net/core/netpoll.c b/net/core/netpoll.c index
> > de1d1ba92f2d..6ac71624ead4 100644
> > --- a/net/core/netpoll.c
> > +++ b/net/core/netpoll.c
> > @@ -591,7 +591,6 @@ int __netpoll_setup(struct netpoll *np, struct
> > net_device *ndev)
> >
> >  	np->dev = ndev;
> >  	strlcpy(np->dev_name, ndev->name, IFNAMSIZ);
> > -	INIT_WORK(&np->cleanup_work, netpoll_async_cleanup);
> >
> >  	if (ndev->priv_flags & IFF_DISABLE_NETPOLL) {
> >  		np_err(np, "%s doesn't support polling, aborting\n", @@ -
> 790,10
> > +789,6 @@ void __netpoll_cleanup(struct netpoll *np)  {
> >  	struct netpoll_info *npinfo;
> >
> > -	/* rtnl_dereference would be preferable here but
> > -	 * rcu_cleanup_netpoll path can put us in here safely without
> > -	 * holding the rtnl, so plain rcu_dereference it is
> > -	 */
> >  	npinfo = rtnl_dereference(np->dev->npinfo);
> >  	if (!npinfo)
> >  		return;
> > @@ -814,21 +809,16 @@ void __netpoll_cleanup(struct netpoll *np)  }
> > EXPORT_SYMBOL_GPL(__netpoll_cleanup);
> >
> > -static void netpoll_async_cleanup(struct work_struct *work)
> > +void __netpoll_free(struct netpoll *np)
> >  {
> > -	struct netpoll *np = container_of(work, struct netpoll,
> cleanup_work);
> > +	ASSERT_RTNL();
> >
> > -	rtnl_lock();
> > +	/* Wait for transmitting packets to finish before freeing. */
> > +	synchronize_rcu_bh();
> >  	__netpoll_cleanup(np);
> > -	rtnl_unlock();
> >  	kfree(np);
> >  }
> > -
> > -void __netpoll_free_async(struct netpoll *np) -{
> > -	schedule_work(&np->cleanup_work);
> > -}
> > -EXPORT_SYMBOL_GPL(__netpoll_free_async);
> > +EXPORT_SYMBOL_GPL(__netpoll_free);
> >
> >  void netpoll_cleanup(struct netpoll *np)  { diff --git
> > a/net/dsa/slave.c b/net/dsa/slave.c index 3f840b6eea69..3679e13b2ead
> > 100644
> > --- a/net/dsa/slave.c
> > +++ b/net/dsa/slave.c
> > @@ -722,7 +722,7 @@ static void dsa_slave_netpoll_cleanup(struct
> > net_device *dev)
> >
> >  	p->netpoll = NULL;
> >
> > -	__netpoll_free_async(netpoll);
> > +	__netpoll_free(netpoll);
> >  }
> >
> >  static void dsa_slave_poll_controller(struct net_device *dev)
> > --
> > 2.19.1
> >
> >

^ permalink raw reply

* Re: [PATCH bpf-next v2 0/2] improve and fix barriers for walking perf ring buffer
From: Alexei Starovoitov @ 2018-10-19 20:45 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: peterz, paulmck, will.deacon, acme, yhs, john.fastabend, netdev
In-Reply-To: <20181019135103.3602-1-daniel@iogearbox.net>

On Fri, Oct 19, 2018 at 03:51:01PM +0200, Daniel Borkmann wrote:
> This set first adds smp_* barrier variants to tools infrastructure
> and updates perf and libbpf to make use of them. For details, please
> see individual patches, thanks!
> 
> Arnaldo, if there are no objections, could this be routed via bpf-next
> with Acked-by's due to later dependencies in libbpf? Alternatively,
> I could also get the 2nd patch out during merge window, but perhaps
> it's okay to do in one go as there shouldn't be much conflict in perf
> itself.
> 
> Thanks!
> 
> v1 -> v2:
>   - add common helper and switch to acquire/release variants
>     when possible, thanks Peter!

Applied, Thanks

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox