Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net-next 0/3] gre: add collect_md mode for ERSPAN tunnel
From: William Tu @ 2017-08-25 16:21 UTC (permalink / raw)
  To: netdev

This patch series provide collect_md mode for ERSPAN tunnel.  The fist patch
refactors the existing gre_fb_xmit function by exacting the route cache
portion into a new function called prepare_fb_xmit.  The second patch
introduces the collect_md mode for ERSPAN tunnel, by calling the
prepare_fb_xmit function and adding ERSPAN specific logic.  The final patch
adds the test case using bpf_skb_{set,get}_tunnel_{key,opt}.

Thank you

William Tu (3):
  gre: refactor the gre_fb_xmit
  gre: add collect_md mode to ERSPAN tunnel
  samples/bpf: extend test_tunnel_bpf.sh with ERSPAN

 include/net/ip_tunnels.h       |   4 +-
 net/ipv4/ip_gre.c              | 157 ++++++++++++++++++++++++++++++++++++-----
 samples/bpf/tcbpf2_kern.c      |  63 ++++++++++++++++-
 samples/bpf/test_tunnel_bpf.sh |  29 ++++++++
 4 files changed, 232 insertions(+), 21 deletions(-)

-- 
2.7.4

^ permalink raw reply

* Re: [PATCH 3/7] ieee802154: 6lowpan: make header_ops const
From: Marcel Holtmann @ 2017-08-25 16:17 UTC (permalink / raw)
  To: Bhumika Goyal
  Cc: julia.lawall, Gustavo F. Padovan, Johan Hedberg, David S. Miller,
	Pablo Neira Ayuso, kadlec, Florian Westphal, Stephen Hemminger,
	Alexander Aring, Stefan Schmidt, Alexey Kuznetsov,
	Hideaki YOSHIFUJI, santosh.shilimkar, trond.myklebust,
	anna.schumaker, bfields, jlayton, open list:BLUETOOTH DRIVERS,
	Network Development, LKML, netfilter-devel
In-Reply-To: <1503670907-23221-4-git-send-email-bhumirks@gmail.com>

Hi Bhumika,

> Make this const as it is only stored as a reference in a const field of
> a net_device structure.
> 
> Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
> ---
> net/ieee802154/6lowpan/core.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)

patch has been applied to bluetooth-next tree.

Regards

Marcel

^ permalink raw reply

* Re: [RFC PATCH] net: limit maximum number of packets to mark with xmit_more
From: Stephen Hemminger @ 2017-08-25 15:58 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter; +Cc: Keller, Jacob E, netdev@vger.kernel.org
In-Reply-To: <E0D909EE5BB15A4699798539EA149D7F07793B3E@ORSMSX103.amr.corp.intel.com>

On Fri, 25 Aug 2017 15:36:22 +0000
"Waskiewicz Jr, Peter" <peter.waskiewicz.jr@intel.com> wrote:

> On 8/25/17 11:25 AM, Jacob Keller wrote:
> > Under some circumstances, such as with many stacked devices, it is
> > possible that dev_hard_start_xmit will bundle many packets together, and
> > mark them all with xmit_more.
> > 
> > Most drivers respond to xmit_more by skipping tail bumps on packet
> > rings, or similar behavior as long as xmit_more is set. This is
> > a performance win since it means drivers can avoid notifying hardware of
> > new packets repeat daily, and thus avoid wasting unnecessary PCIe or other
> > bandwidth.
> > 
> > This use of xmit_more comes with a trade off because bundling too many
> > packets can increase latency of the Tx packets. To avoid this, we should
> > limit the maximum number of packets with xmit_more.
> > 
> > Driver authors could modify their drivers to check for some determined
> > limit, but this requires all drivers to be modified in order to gain
> > advantage.
> > 
> > Instead, add a sysctl "xmit_more_max" which can be used to configure the
> > maximum number of xmit_more skbs to send in a sequence. This ensures
> > that all drivers benefit, and allows system administrators the option to
> > tune the value to their environment.
> > 
> > Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
> > ---
> > 
> > Stray thoughts and further questions....
> > 
> > Is this the right approach? Did I miss any other places where we should
> > limit? Does the limit make sense? Should it instead be a per-device
> > tuning nob instead of a global? Is 32 a good default?  
> 
> I actually like the idea of a per-device knob.  A xmit_more_max that's 
> global in a system with 1GbE devices along with a 25/50GbE or more just 
> doesn't make much sense to me.  Or having heterogeneous vendor devices 
> in the same system that have different HW behaviors could mask issues 
> with latency.
> 
> This seems like another incarnation of possible buffer-bloat if the max 
> is too high...
> 
> > 
> >   Documentation/sysctl/net.txt |  6 ++++++
> >   include/linux/netdevice.h    |  2 ++
> >   net/core/dev.c               | 10 +++++++++-
> >   net/core/sysctl_net_core.c   |  7 +++++++
> >   4 files changed, 24 insertions(+), 1 deletion(-)
> > 
> > diff --git a/Documentation/sysctl/net.txt b/Documentation/sysctl/net.txt
> > index b67044a2575f..3d995e8f4448 100644
> > --- a/Documentation/sysctl/net.txt
> > +++ b/Documentation/sysctl/net.txt
> > @@ -230,6 +230,12 @@ netdev_max_backlog
> >   Maximum number  of  packets,  queued  on  the  INPUT  side, when the interface
> >   receives packets faster than kernel can process them.
> >   
> > +xmit_more_max
> > +-------------
> > +
> > +Maximum number of packets in a row to mark with skb->xmit_more. A value of zero
> > +indicates no limit.  
> 
> What defines "packet?"  MTU-sized packets, or payloads coming down from 
> the stack (e.g. TSO's)?

xmit_more is only a hint to the device. The device driver should ignore it unless
there are hardware advantages. The device driver is the place with HW specific
knowledge (like 4 Tx descriptors is equivalent to one PCI transaction on this device).

Anything that pushes that optimization out to the user is only useful for benchmarks
and embedded devices.

^ permalink raw reply

* Re: [PATCH net-next v2 00/14] net: mvpp2: comphy configuration
From: Antoine Tenart @ 2017-08-25 15:56 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Antoine Tenart, davem, kishon, jason, sebastian.hesselbarth,
	gregory.clement, thomas.petazzoni, nadavh, linux, linux-kernel,
	mw, stefanc, miquel.raynal, netdev
In-Reply-To: <20170825155111.GD30922@lunn.ch>

[-- Attachment #1: Type: text/plain, Size: 574 bytes --]

Hi Andrew,

On Fri, Aug 25, 2017 at 05:51:11PM +0200, Andrew Lunn wrote:
> >   - Checked if the carrier_on/off functions were needed. They are.
> 
> Could you explain the situations they are needed in?
> 
> Quite a few drivers do this, so i'm not saying it is wrong.  But it
> would be nice to understand if we are missing something in phylib.

I know it's not working without these calls, but I can't tell why (yet).
I can try to dig into this.

Antoine

-- 
Antoine Ténart, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH net v2 1/4] net: mvpp2: fix the mac address used when using PPv2.2
From: Antoine Tenart @ 2017-08-25 15:54 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Antoine Tenart, davem, thomas.petazzoni, gregory.clement, nadavh,
	linux, linux-kernel, mw, stefanc, netdev
In-Reply-To: <20170825154247.GC30922@lunn.ch>

[-- Attachment #1: Type: text/plain, Size: 598 bytes --]

On Fri, Aug 25, 2017 at 05:42:47PM +0200, Andrew Lunn wrote:
> > So probably the best way to handle this would have been to send 1/4 to
> > net and 2-4/4 to net-next
> 
> Correct. 
> 
> > (but then there's a dependency between the two series).
> 
> Dave merges net into net-next every so often. So you just need to wait
> a bit before submitting the net-next parts.

I see. So Dave can take patch 1/4 and I'll respin the others once it's
merged into net-next.

Thanks!
Antoine

-- 
Antoine Ténart, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH net-next v2 00/14] net: mvpp2: comphy configuration
From: Andrew Lunn @ 2017-08-25 15:51 UTC (permalink / raw)
  To: Antoine Tenart
  Cc: davem, kishon, jason, sebastian.hesselbarth, gregory.clement,
	thomas.petazzoni, nadavh, linux, linux-kernel, mw, stefanc,
	miquel.raynal, netdev
In-Reply-To: <20170825144821.31129-1-antoine.tenart@free-electrons.com>

>   - Checked if the carrier_on/off functions were needed. They are.

Hi Antoine

Could you explain the situations they are needed in?

Quite a few drivers do this, so i'm not saying it is wrong.  But it
would be nice to understand if we are missing something in phylib.

      Andrew

^ permalink raw reply

* Re: [PATCH] strparser: initialize all callbacks
From: Tom Herbert @ 2017-08-25 15:50 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Linux Kernel Network Developers, David S . Miller, linux-kernel,
	Eric Biggers, Dmitry Vyukov
In-Reply-To: <20170824213851.57601-1-ebiggers3@gmail.com>

On Thu, Aug 24, 2017 at 2:38 PM, Eric Biggers <ebiggers3@gmail.com> wrote:
> From: Eric Biggers <ebiggers@google.com>
>
> commit bbb03029a899 ("strparser: Generalize strparser") added more
> function pointers to 'struct strp_callbacks'; however, kcm_attach() was
> not updated to initialize them.  This could cause the ->lock() and/or
> ->unlock() function pointers to be set to garbage values, causing a
> crash in strp_work().
>
> Fix the bug by moving the callback structs into static memory, so
> unspecified members are zeroed.  Also constify them while we're at it.
>
> This bug was found by syzkaller, which encountered the following splat:
>
>     IP: 0x55
>     PGD 3b1ca067
>     P4D 3b1ca067
>     PUD 3b12f067
>     PMD 0
>
>     Oops: 0010 [#1] SMP KASAN
>     Dumping ftrace buffer:
>        (ftrace buffer empty)
>     Modules linked in:
>     CPU: 2 PID: 1194 Comm: kworker/u8:1 Not tainted 4.13.0-rc4-next-20170811 #2
>     Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>     Workqueue: kstrp strp_work
>     task: ffff88006bb0e480 task.stack: ffff88006bb10000
>     RIP: 0010:0x55
>     RSP: 0018:ffff88006bb17540 EFLAGS: 00010246
>     RAX: dffffc0000000000 RBX: ffff88006ce4bd60 RCX: 0000000000000000
>     RDX: 1ffff1000d9c97bd RSI: 0000000000000000 RDI: ffff88006ce4bc48
>     RBP: ffff88006bb17558 R08: ffffffff81467ab2 R09: 0000000000000000
>     R10: ffff88006bb17438 R11: ffff88006bb17940 R12: ffff88006ce4bc48
>     R13: ffff88003c683018 R14: ffff88006bb17980 R15: ffff88003c683000
>     FS:  0000000000000000(0000) GS:ffff88006de00000(0000) knlGS:0000000000000000
>     CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>     CR2: 0000000000000055 CR3: 000000003c145000 CR4: 00000000000006e0
>     DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>     DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>     Call Trace:
>      process_one_work+0xbf3/0x1bc0 kernel/workqueue.c:2098
>      worker_thread+0x223/0x1860 kernel/workqueue.c:2233
>      kthread+0x35e/0x430 kernel/kthread.c:231
>      ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:431
>     Code:  Bad RIP value.
>     RIP: 0x55 RSP: ffff88006bb17540
>     CR2: 0000000000000055
>     ---[ end trace f0e4920047069cee ]---
>
> Here is a C reproducer (requires CONFIG_BPF_SYSCALL=y and
> CONFIG_AF_KCM=y):
>
>     #include <linux/bpf.h>
>     #include <linux/kcm.h>
>     #include <linux/types.h>
>     #include <stdint.h>
>     #include <sys/ioctl.h>
>     #include <sys/socket.h>
>     #include <sys/syscall.h>
>     #include <unistd.h>
>
>     static const struct bpf_insn bpf_insns[3] = {
>         { .code = 0xb7 }, /* BPF_MOV64_IMM(0, 0) */
>         { .code = 0x95 }, /* BPF_EXIT_INSN() */
>     };
>
>     static const union bpf_attr bpf_attr = {
>         .prog_type = 1,
>         .insn_cnt = 2,
>         .insns = (uintptr_t)&bpf_insns,
>         .license = (uintptr_t)"",
>     };
>
>     int main(void)
>     {
>         int bpf_fd = syscall(__NR_bpf, BPF_PROG_LOAD,
>                              &bpf_attr, sizeof(bpf_attr));
>         int inet_fd = socket(AF_INET, SOCK_STREAM, 0);
>         int kcm_fd = socket(AF_KCM, SOCK_DGRAM, 0);
>
>         ioctl(kcm_fd, SIOCKCMATTACH,
>               &(struct kcm_attach) { .fd = inet_fd, .bpf_fd = bpf_fd });
>     }
>
> Fixes: bbb03029a899 ("strparser: Generalize strparser")
> Cc: Dmitry Vyukov <dvyukov@google.com>
> Cc: Tom Herbert <tom@quantonium.net>
> Signed-off-by: Eric Biggers <ebiggers@google.com>
> ---
>  Documentation/networking/strparser.txt |  2 +-
>  include/net/strparser.h                |  2 +-
>  kernel/bpf/sockmap.c                   | 10 +++++-----
>  net/kcm/kcmsock.c                      | 11 +++++------
>  net/strparser/strparser.c              |  2 +-
>  5 files changed, 13 insertions(+), 14 deletions(-)
>
> diff --git a/Documentation/networking/strparser.txt b/Documentation/networking/strparser.txt
> index fe01302471ae..13081b3decef 100644
> --- a/Documentation/networking/strparser.txt
> +++ b/Documentation/networking/strparser.txt
> @@ -35,7 +35,7 @@ Functions
>  =========
>
>  strp_init(struct strparser *strp, struct sock *sk,
> -         struct strp_callbacks *cb)
> +         const struct strp_callbacks *cb)
>
>       Called to initialize a stream parser. strp is a struct of type
>       strparser that is allocated by the upper layer. sk is the TCP
> diff --git a/include/net/strparser.h b/include/net/strparser.h
> index 4fe966a0ad92..7dc131d62ad5 100644
> --- a/include/net/strparser.h
> +++ b/include/net/strparser.h
> @@ -138,7 +138,7 @@ void strp_done(struct strparser *strp);
>  void strp_stop(struct strparser *strp);
>  void strp_check_rcv(struct strparser *strp);
>  int strp_init(struct strparser *strp, struct sock *sk,
> -             struct strp_callbacks *cb);
> +             const struct strp_callbacks *cb);
>  void strp_data_ready(struct strparser *strp);
>  int strp_process(struct strparser *strp, struct sk_buff *orig_skb,
>                  unsigned int orig_offset, size_t orig_len,
> diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
> index 78b2bb9370ac..617c239590c2 100644
> --- a/kernel/bpf/sockmap.c
> +++ b/kernel/bpf/sockmap.c
> @@ -368,12 +368,12 @@ static int smap_read_sock_done(struct strparser *strp, int err)
>  static int smap_init_sock(struct smap_psock *psock,
>                           struct sock *sk)
>  {
> -       struct strp_callbacks cb;
> +       static const struct strp_callbacks cb = {
> +               .rcv_msg = smap_read_sock_strparser,
> +               .parse_msg = smap_parse_func_strparser,
> +               .read_sock_done = smap_read_sock_done,
> +       };
>
> -       memset(&cb, 0, sizeof(cb));
> -       cb.rcv_msg = smap_read_sock_strparser;
> -       cb.parse_msg = smap_parse_func_strparser;
> -       cb.read_sock_done = smap_read_sock_done;
>         return strp_init(&psock->strp, sk, &cb);
>  }
>
> diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
> index 88ce73288247..48e993b2dbcf 100644
> --- a/net/kcm/kcmsock.c
> +++ b/net/kcm/kcmsock.c
> @@ -1376,7 +1376,11 @@ static int kcm_attach(struct socket *sock, struct socket *csock,
>         struct kcm_psock *psock = NULL, *tpsock;
>         struct list_head *head;
>         int index = 0;
> -       struct strp_callbacks cb;
> +       static const struct strp_callbacks cb = {
> +               .rcv_msg = kcm_rcv_strparser,
> +               .parse_msg = kcm_parse_func_strparser,
> +               .read_sock_done = kcm_read_sock_done,
> +       };
>         int err;
>
>         csk = csock->sk;
> @@ -1391,11 +1395,6 @@ static int kcm_attach(struct socket *sock, struct socket *csock,
>         psock->sk = csk;
>         psock->bpf_prog = prog;
>
> -       cb.rcv_msg = kcm_rcv_strparser;
> -       cb.abort_parser = NULL;
> -       cb.parse_msg = kcm_parse_func_strparser;
> -       cb.read_sock_done = kcm_read_sock_done;
> -
>         err = strp_init(&psock->strp, csk, &cb);
>         if (err) {
>                 kmem_cache_free(kcm_psockp, psock);
> diff --git a/net/strparser/strparser.c b/net/strparser/strparser.c
> index 434aa6637a52..d4ea46a5f233 100644
> --- a/net/strparser/strparser.c
> +++ b/net/strparser/strparser.c
> @@ -472,7 +472,7 @@ static void strp_sock_unlock(struct strparser *strp)
>  }
>
>  int strp_init(struct strparser *strp, struct sock *sk,
> -             struct strp_callbacks *cb)
> +             const struct strp_callbacks *cb)
>  {
>
>         if (!cb || !cb->rcv_msg || !cb->parse_msg)
> --
> 2.14.1.342.g6490525c54-goog
>

 Acked-by: Tom Herbert <tom@quantonium.net>

Thanks!

^ permalink raw reply

* Re: [PATCH net v2 1/4] net: mvpp2: fix the mac address used when using PPv2.2
From: Andrew Lunn @ 2017-08-25 15:42 UTC (permalink / raw)
  To: Antoine Tenart
  Cc: davem, thomas.petazzoni, gregory.clement, nadavh, linux,
	linux-kernel, mw, stefanc, netdev
In-Reply-To: <20170825142928.GB31512@kwain>

> So probably the best way to handle this would have been to send 1/4 to
> net and 2-4/4 to net-next

Correct. 

> (but then there's a dependency between the two series).

Dave merges net into net-next every so often. So you just need to wait
a bit before submitting the net-next parts.

  Andrew

^ permalink raw reply

* Re: [RFC PATCH] net: limit maximum number of packets to mark with xmit_more
From: Waskiewicz Jr, Peter @ 2017-08-25 15:36 UTC (permalink / raw)
  To: Keller, Jacob E, netdev@vger.kernel.org
In-Reply-To: <20170825152449.29790-1-jacob.e.keller@intel.com>

On 8/25/17 11:25 AM, Jacob Keller wrote:
> Under some circumstances, such as with many stacked devices, it is
> possible that dev_hard_start_xmit will bundle many packets together, and
> mark them all with xmit_more.
> 
> Most drivers respond to xmit_more by skipping tail bumps on packet
> rings, or similar behavior as long as xmit_more is set. This is
> a performance win since it means drivers can avoid notifying hardware of
> new packets repeat daily, and thus avoid wasting unnecessary PCIe or other
> bandwidth.
> 
> This use of xmit_more comes with a trade off because bundling too many
> packets can increase latency of the Tx packets. To avoid this, we should
> limit the maximum number of packets with xmit_more.
> 
> Driver authors could modify their drivers to check for some determined
> limit, but this requires all drivers to be modified in order to gain
> advantage.
> 
> Instead, add a sysctl "xmit_more_max" which can be used to configure the
> maximum number of xmit_more skbs to send in a sequence. This ensures
> that all drivers benefit, and allows system administrators the option to
> tune the value to their environment.
> 
> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
> ---
> 
> Stray thoughts and further questions....
> 
> Is this the right approach? Did I miss any other places where we should
> limit? Does the limit make sense? Should it instead be a per-device
> tuning nob instead of a global? Is 32 a good default?

I actually like the idea of a per-device knob.  A xmit_more_max that's 
global in a system with 1GbE devices along with a 25/50GbE or more just 
doesn't make much sense to me.  Or having heterogeneous vendor devices 
in the same system that have different HW behaviors could mask issues 
with latency.

This seems like another incarnation of possible buffer-bloat if the max 
is too high...

> 
>   Documentation/sysctl/net.txt |  6 ++++++
>   include/linux/netdevice.h    |  2 ++
>   net/core/dev.c               | 10 +++++++++-
>   net/core/sysctl_net_core.c   |  7 +++++++
>   4 files changed, 24 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/sysctl/net.txt b/Documentation/sysctl/net.txt
> index b67044a2575f..3d995e8f4448 100644
> --- a/Documentation/sysctl/net.txt
> +++ b/Documentation/sysctl/net.txt
> @@ -230,6 +230,12 @@ netdev_max_backlog
>   Maximum number  of  packets,  queued  on  the  INPUT  side, when the interface
>   receives packets faster than kernel can process them.
>   
> +xmit_more_max
> +-------------
> +
> +Maximum number of packets in a row to mark with skb->xmit_more. A value of zero
> +indicates no limit.

What defines "packet?"  MTU-sized packets, or payloads coming down from 
the stack (e.g. TSO's)?

-PJ

^ permalink raw reply

* Re: [PATCH 5/7] RDS: make rhashtable_params const
From: Santosh Shilimkar @ 2017-08-25 15:31 UTC (permalink / raw)
  To: Bhumika Goyal, julia.lawall-L2FTfq7BK8M,
	marcel-kz+m5ild9QBg9hUCZPvPmw, gustavo-THi1TnShQwVAfugRpC6u6w,
	johan.hedberg-Re5JQEeQqe8AvxtiuMwx3w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q, pablo-Cap9r6Oaw4JrovVCs/uTlw,
	kadlec-K40Dz/62t/MgiyqX0sVFJYdd74u8MsAO,
	fw-HFFVJYpyMKqzQB+pC5nmwQ,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	alex.aring-Re5JQEeQqe8AvxtiuMwx3w, stefan-JPH+aEBZ4P+UEJcrhfAQsw,
	kuznet-v/Mj1YrvjDBInbfyfbPRSQ, yoshfuji-VfPWfsRibaP+Ru+s062T9g,
	trond.myklebust-7I+n7zu2hftEKMMhf/gKZA,
	anna.schumaker-HgOvQuBEEgTQT0dZR+AlfA,
	bfields-uC3wQj2KruNg9hUCZPvPmw, jlayton-vpEMnDpepFuMZCB2o+C8xQ,
	linux-bluetooth-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netfilter-devel-u79uwXL29TY76Z2rM5mHXA,
	coreteam-Cap9r6Oaw4JrovVCs/uTlw,
	bridge-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-wpan-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	rds-devel-N0ozoZBvEnrZJqsBc5GL+g,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1503670907-23221-6-git-send-email-bhumirks-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

8/25/2017 7:21 AM, Bhumika Goyal wrote:
> Make this const as it is either used during a copy operation or passed
> to a const argument of the function rhltable_init
> 
> Signed-off-by: Bhumika Goyal <bhumirks-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> ---
Acked-by: Santosh Shilimkar <santosh.shilimkar-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply

* Re: XDP redirect measurements, gotchas and tracepoints
From: Michael Chan @ 2017-08-25 15:28 UTC (permalink / raw)
  To: John Fastabend
  Cc: Jesper Dangaard Brouer, Alexander Duyck, Duyck, Alexander H,
	pstaszewski@itcare.pl, netdev@vger.kernel.org,
	xdp-newbies@vger.kernel.org, andy@greyhouse.net,
	borkmann@iogearbox.net
In-Reply-To: <59A03DF5.7070806@gmail.com>

On Fri, Aug 25, 2017 at 8:10 AM, John Fastabend
<john.fastabend@gmail.com> wrote:
> On 08/25/2017 05:45 AM, Jesper Dangaard Brouer wrote:
>> On Thu, 24 Aug 2017 20:36:28 -0700
>> Michael Chan <michael.chan@broadcom.com> wrote:
>>
>>> On Wed, Aug 23, 2017 at 1:29 AM, Jesper Dangaard Brouer
>>> <brouer@redhat.com> wrote:
>>>> On Tue, 22 Aug 2017 23:59:05 -0700
>>>> Michael Chan <michael.chan@broadcom.com> wrote:
>>>>
>>>>> On Tue, Aug 22, 2017 at 6:06 PM, Alexander Duyck
>>>>> <alexander.duyck@gmail.com> wrote:
>>>>>> On Tue, Aug 22, 2017 at 1:04 PM, Michael Chan <michael.chan@broadcom.com> wrote:
>>>>>>>
>>>>>>> Right, but it's conceivable to add an API to "return" the buffer to
>>>>>>> the input device, right?
>>>>
>>>> Yes, I would really like to see an API like this.
>>>>
>>>>>>
>>>>>> You could, it is just added complexity. "just free the buffer" in
>>>>>> ixgbe usually just amounts to one atomic operation to decrement the
>>>>>> total page count since page recycling is already implemented in the
>>>>>> driver. You still would have to unmap the buffer regardless of if you
>>>>>> were recycling it or not so all you would save is 1.000015259 atomic
>>>>>> operations per packet. The fraction is because once every 64K uses we
>>>>>> have to bulk update the count on the page.
>>>>>>
>>>>>
>>>>> If the buffer is returned to the input device, the input device can
>>>>> keep the DMA mapping.  All it needs to do is to dma_sync it back to
>>>>> the input device when the buffer is returned.
>>>>
>>>> Yes, exactly, return to the input device. I really think we should
>>>> work on a solution where we can keep the DMA mapping around.  We have
>>>> an opportunity here to make ndo_xdp_xmit TX queues use a specialized
>>>> page return call, to achieve this. (I imagine other arch's have a high
>>>> DMA overhead than Intel)
>>>>
>>>> I'm not sure how the API should look.  The ixgbe recycle mechanism and
>>>> splitting the page (into two packets) actually complicates things, and
>>>> tie us into a page-refcnt based model.  We could get around this by
>>>> each driver implementing a page-return-callback, that allow us to
>>>> return the page to the input device?  Then, drivers implementing the
>>>> 1-packet-per-page can simply check/read the page-refcnt, and if it is
>>>> "1" DMA-sync and reuse it in the RX queue.
>>>>
>>>
>>> Yeah, based on Alex' description, it's not clear to me whether ixgbe
>>> redirecting to a non-intel NIC or vice versa will actually work.  It
>>> sounds like the output device has to make some assumptions about how
>>> the page was allocated by the input device.
>>
>> Yes, exactly. We are tied into a page refcnt based scheme.
>>
>> Besides the ixgbe page recycle scheme (which keeps the DMA RX-mapping)
>> is also tied to the RX queue size, plus how fast the pages are returned.
>> This makes it very hard to tune.  As I demonstrated, default ixgbe
>> settings does not work well with XDP_REDIRECT.  I needed to increase
>> TX-ring size, but it broke page recycling (dropping perf from 13Mpps to
>> 10Mpps) so I also needed it increase RX-ring size.  But perf is best if
>> RX-ring size is smaller, thus two contradicting tuning needed.
>>
>
> The changes to decouple the ixgbe page recycle scheme (1pg per descriptor
> split into two halves being the default) from the number of descriptors
> doesn't look too bad IMO. It seems like it could be done by having some
> extra pages allocated upfront and pulling those in when we need another
> page.
>
> This would be a nice iterative step we could take on the existing API.
>
>>
>>> With buffer return API,
>>> each driver can cleanly recycle or free its own buffers properly.
>>
>> Yes, exactly. And RX-driver can implement a special memory model for
>> this queue.  E.g. RX-driver can know this is a dedicated XDP RX-queue
>> which is never used for SKBs, thus opening for new RX memory models.
>>
>> Another advantage of a return API.  There is also an opportunity for
>> avoiding the DMA map on TX. As we need to know the from-device.  Thus,
>> we can add a DMA API, where we can query if the two devices uses the
>> same DMA engine, and can reuse the same DMA address the RX-side already
>> knows.
>>
>>
>>> Let me discuss this further with Andy to see if we can come up with a
>>> good scheme.
>>
>> Sound good, looking forward to hear what you come-up with :-)
>>
>
> I guess by this thread we will see a broadcom nic with redirect support
> soon ;)

Yes, Andy actually has finished the coding for XDP_REDIRECT, but the
buffer recycling scheme has some problems.  We can make it work for
Broadcom to Broadcom only, but we want a better solution.

^ permalink raw reply

* [RFC PATCH] net: limit maximum number of packets to mark with xmit_more
From: Jacob Keller @ 2017-08-25 15:24 UTC (permalink / raw)
  To: netdev; +Cc: Jacob Keller

Under some circumstances, such as with many stacked devices, it is
possible that dev_hard_start_xmit will bundle many packets together, and
mark them all with xmit_more.

Most drivers respond to xmit_more by skipping tail bumps on packet
rings, or similar behavior as long as xmit_more is set. This is
a performance win since it means drivers can avoid notifying hardware of
new packets repeat daily, and thus avoid wasting unnecessary PCIe or other
bandwidth.

This use of xmit_more comes with a trade off because bundling too many
packets can increase latency of the Tx packets. To avoid this, we should
limit the maximum number of packets with xmit_more.

Driver authors could modify their drivers to check for some determined
limit, but this requires all drivers to be modified in order to gain
advantage.

Instead, add a sysctl "xmit_more_max" which can be used to configure the
maximum number of xmit_more skbs to send in a sequence. This ensures
that all drivers benefit, and allows system administrators the option to
tune the value to their environment.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
---

Stray thoughts and further questions....

Is this the right approach? Did I miss any other places where we should
limit? Does the limit make sense? Should it instead be a per-device
tuning nob instead of a global? Is 32 a good default?

 Documentation/sysctl/net.txt |  6 ++++++
 include/linux/netdevice.h    |  2 ++
 net/core/dev.c               | 10 +++++++++-
 net/core/sysctl_net_core.c   |  7 +++++++
 4 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/Documentation/sysctl/net.txt b/Documentation/sysctl/net.txt
index b67044a2575f..3d995e8f4448 100644
--- a/Documentation/sysctl/net.txt
+++ b/Documentation/sysctl/net.txt
@@ -230,6 +230,12 @@ netdev_max_backlog
 Maximum number  of  packets,  queued  on  the  INPUT  side, when the interface
 receives packets faster than kernel can process them.
 
+xmit_more_max
+-------------
+
+Maximum number of packets in a row to mark with skb->xmit_more. A value of zero
+indicates no limit.
+
 netdev_rss_key
 --------------
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index c5475b37a631..6341452aed09 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3321,6 +3321,8 @@ void dev_queue_xmit_nit(struct sk_buff *skb, struct net_device *dev);
 extern int		netdev_budget;
 extern unsigned int	netdev_budget_usecs;
 
+extern unsigned int sysctl_xmit_more_max;
+
 /* Called by rtnetlink.c:rtnl_unlock() */
 void netdev_run_todo(void);
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 270b54754821..d9946d29c3a5 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2983,12 +2983,19 @@ struct sk_buff *dev_hard_start_xmit(struct sk_buff *first, struct net_device *de
 {
 	struct sk_buff *skb = first;
 	int rc = NETDEV_TX_OK;
+	int xmit_count = 0;
+	bool more = true;
 
 	while (skb) {
 		struct sk_buff *next = skb->next;
 
+		if (sysctl_xmit_more_max)
+			more = xmit_count++ < sysctl_xmit_more_max;
+		if (!more)
+			xmit_count = 0;
+
 		skb->next = NULL;
-		rc = xmit_one(skb, dev, txq, next != NULL);
+		rc = xmit_one(skb, dev, txq, more && next != NULL);
 		if (unlikely(!dev_xmit_complete(rc))) {
 			skb->next = next;
 			goto out;
@@ -3523,6 +3530,7 @@ EXPORT_SYMBOL(netdev_max_backlog);
 int netdev_tstamp_prequeue __read_mostly = 1;
 int netdev_budget __read_mostly = 300;
 unsigned int __read_mostly netdev_budget_usecs = 2000;
+unsigned int __read_mostly sysctl_xmit_more_max = 32;
 int weight_p __read_mostly = 64;           /* old backlog weight */
 int dev_weight_rx_bias __read_mostly = 1;  /* bias for backlog weight */
 int dev_weight_tx_bias __read_mostly = 1;  /* bias for output_queue quota */
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index b7cd9aafe99e..6950e702e101 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -460,6 +460,13 @@ static struct ctl_table net_core_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= &zero,
 	},
+	{
+		.procname	= "xmit_more_max",
+		.data		= &sysctl_xmit_more_max,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.prox_handler	= proc_dointvec
+	},
 	{ }
 };
 
-- 
2.14.1.436.g33e61a4f0239

^ permalink raw reply related

* Re: [PATCH] net: stmmac: dwmac-sun8i: Use reset exclusive
From: Corentin Labbe @ 2017-08-25 15:17 UTC (permalink / raw)
  To: Maxime Ripard
  Cc: peppe.cavallaro, alexandre.torgue, wens, netdev, linux-arm-kernel,
	linux-kernel
In-Reply-To: <20170825144832.3w6izhri5vstbndp@flea.lan>

On Fri, Aug 25, 2017 at 04:48:32PM +0200, Maxime Ripard wrote:
> On Fri, Aug 25, 2017 at 04:38:05PM +0200, Corentin Labbe wrote:
> > The current dwmac_sun8i module cannot be rmmod/modprobe due to that
> > the reset controller was not released when removed.
> > 
> > This patch remove ambiguity, by using of_reset_control_get_exclusive and
> > add the missing reset_control_put().
> > 
> > Signed-off-by: Corentin Labbe <clabbe.montjoie@gmail.com>
> > ---
> >  drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
> > index fffd6d5fc907..675a09629d85 100644
> > --- a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
> > +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
> > @@ -782,6 +782,7 @@ static int sun8i_dwmac_unpower_internal_phy(struct sunxi_priv_data *gmac)
> >  
> >  	clk_disable_unprepare(gmac->ephy_clk);
> >  	reset_control_assert(gmac->rst_ephy);
> > +	reset_control_put(gmac->rst_ephy);
> >  	return 0;
> >  }
> >  
> > @@ -942,7 +943,7 @@ static int sun8i_dwmac_probe(struct platform_device *pdev)
> >  			return -EINVAL;
> >  		}
> >  
> > -		gmac->rst_ephy = of_reset_control_get(plat_dat->phy_node, NULL);
> > +		gmac->rst_ephy = of_reset_control_get_exclusive(plat_dat->phy_node, NULL);
> 
> Why not just use devm_reset_control_get?
> 

Because there no devm_ functions with of_

Regards

^ permalink raw reply

* Re: [PATCH] pktgen: add a new sample script for 40G and above link testing
From: Waskiewicz Jr, Peter @ 2017-08-25 15:11 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Hu, Robert, robert.hu@linux.intel.com, netdev@vger.kernel.org
In-Reply-To: <20170825165929.0d840ab2@redhat.com>

On 8/25/17 10:59 AM, Jesper Dangaard Brouer wrote:
> On Fri, 25 Aug 2017 14:24:28 +0000
> "Waskiewicz Jr, Peter" <peter.waskiewicz.jr@intel.com> wrote:
> 
>> On 8/25/17 5:19 AM, Jesper Dangaard Brouer wrote:
>>>>
>>>> Tested with Intel XL710 NIC with Cisco 3172 switch.
>>>>
>>>> It would be even slightly better if the irqbalance service is turned
>>>> off outside.
>>>
>>> Yes, if you don't turn-off (kill) irqbalance it will move around the
>>> IRQs behind your back...
>>
>> Or you can use the --banirq option to irqbalance to ignore your device's
>> interrupts as targets for balancing.
> 
> It might be worth mentioning that --banirq=X is specified for each IRQ
> that you want to exclude, and --banirq is simply specified multiple
> times on the command line.
> 
> Is it possible to tell a running irqbalance that I want to excluded an
> extra IRQ? (just before I do my manual adjustment).

It isn't possible today, since we don't have a way to attach a 
foreground/oneshot irqbalance run to a currently-running daemon.  That's 
an interesting feature enhancement...I can add it to our list as a 
feature request so I don't forget about it.  That way I can also get 
Neil's thoughts on this.

-PJ

^ permalink raw reply

* Re: XDP redirect measurements, gotchas and tracepoints
From: John Fastabend @ 2017-08-25 15:10 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Michael Chan
  Cc: Alexander Duyck, Duyck, Alexander H, pstaszewski@itcare.pl,
	netdev@vger.kernel.org, xdp-newbies@vger.kernel.org,
	andy@greyhouse.net, borkmann@iogearbox.net
In-Reply-To: <20170825144513.1ee9fbb1@redhat.com>

On 08/25/2017 05:45 AM, Jesper Dangaard Brouer wrote:
> On Thu, 24 Aug 2017 20:36:28 -0700
> Michael Chan <michael.chan@broadcom.com> wrote:
> 
>> On Wed, Aug 23, 2017 at 1:29 AM, Jesper Dangaard Brouer
>> <brouer@redhat.com> wrote:
>>> On Tue, 22 Aug 2017 23:59:05 -0700
>>> Michael Chan <michael.chan@broadcom.com> wrote:
>>>  
>>>> On Tue, Aug 22, 2017 at 6:06 PM, Alexander Duyck
>>>> <alexander.duyck@gmail.com> wrote:  
>>>>> On Tue, Aug 22, 2017 at 1:04 PM, Michael Chan <michael.chan@broadcom.com> wrote:  
>>>>>>
>>>>>> Right, but it's conceivable to add an API to "return" the buffer to
>>>>>> the input device, right?  
>>>
>>> Yes, I would really like to see an API like this.
>>>  
>>>>>
>>>>> You could, it is just added complexity. "just free the buffer" in
>>>>> ixgbe usually just amounts to one atomic operation to decrement the
>>>>> total page count since page recycling is already implemented in the
>>>>> driver. You still would have to unmap the buffer regardless of if you
>>>>> were recycling it or not so all you would save is 1.000015259 atomic
>>>>> operations per packet. The fraction is because once every 64K uses we
>>>>> have to bulk update the count on the page.
>>>>>  
>>>>
>>>> If the buffer is returned to the input device, the input device can
>>>> keep the DMA mapping.  All it needs to do is to dma_sync it back to
>>>> the input device when the buffer is returned.  
>>>
>>> Yes, exactly, return to the input device. I really think we should
>>> work on a solution where we can keep the DMA mapping around.  We have
>>> an opportunity here to make ndo_xdp_xmit TX queues use a specialized
>>> page return call, to achieve this. (I imagine other arch's have a high
>>> DMA overhead than Intel)
>>>
>>> I'm not sure how the API should look.  The ixgbe recycle mechanism and
>>> splitting the page (into two packets) actually complicates things, and
>>> tie us into a page-refcnt based model.  We could get around this by
>>> each driver implementing a page-return-callback, that allow us to
>>> return the page to the input device?  Then, drivers implementing the
>>> 1-packet-per-page can simply check/read the page-refcnt, and if it is
>>> "1" DMA-sync and reuse it in the RX queue.
>>>  
>>
>> Yeah, based on Alex' description, it's not clear to me whether ixgbe
>> redirecting to a non-intel NIC or vice versa will actually work.  It
>> sounds like the output device has to make some assumptions about how
>> the page was allocated by the input device. 
> 
> Yes, exactly. We are tied into a page refcnt based scheme.
> 
> Besides the ixgbe page recycle scheme (which keeps the DMA RX-mapping)
> is also tied to the RX queue size, plus how fast the pages are returned.
> This makes it very hard to tune.  As I demonstrated, default ixgbe
> settings does not work well with XDP_REDIRECT.  I needed to increase
> TX-ring size, but it broke page recycling (dropping perf from 13Mpps to
> 10Mpps) so I also needed it increase RX-ring size.  But perf is best if
> RX-ring size is smaller, thus two contradicting tuning needed.
> 

The changes to decouple the ixgbe page recycle scheme (1pg per descriptor
split into two halves being the default) from the number of descriptors
doesn't look too bad IMO. It seems like it could be done by having some
extra pages allocated upfront and pulling those in when we need another
page.

This would be a nice iterative step we could take on the existing API.

> 
>> With buffer return API,
>> each driver can cleanly recycle or free its own buffers properly.
> 
> Yes, exactly. And RX-driver can implement a special memory model for
> this queue.  E.g. RX-driver can know this is a dedicated XDP RX-queue
> which is never used for SKBs, thus opening for new RX memory models.
> 
> Another advantage of a return API.  There is also an opportunity for
> avoiding the DMA map on TX. As we need to know the from-device.  Thus,
> we can add a DMA API, where we can query if the two devices uses the
> same DMA engine, and can reuse the same DMA address the RX-side already
> knows.
> 
> 
>> Let me discuss this further with Andy to see if we can come up with a
>> good scheme.
> 
> Sound good, looking forward to hear what you come-up with :-)
> 

I guess by this thread we will see a broadcom nic with redirect support
soon ;)

^ permalink raw reply

* Re: [PATCH net-next v6 1/3] net: add NSH header structures and helpers
From: Jiri Benc @ 2017-08-25 15:07 UTC (permalink / raw)
  To: Yi Yang; +Cc: netdev, dev, e, blp, jan.scheurich
In-Reply-To: <1503670805-31051-2-git-send-email-yi.y.yang@intel.com>

On Fri, 25 Aug 2017 22:20:03 +0800, Yi Yang wrote:
> --- a/include/uapi/linux/if_ether.h
> +++ b/include/uapi/linux/if_ether.h
> @@ -138,6 +138,7 @@
>  #define ETH_P_IEEE802154 0x00F6		/* IEEE802.15.4 frame		*/
>  #define ETH_P_CAIF	0x00F7		/* ST-Ericsson CAIF protocol	*/
>  #define ETH_P_XDSA	0x00F8		/* Multiplexed DSA protocol	*/
> +#define ETH_P_NSH	0x894F		/* Ethertype for NSH. */

This is in a wrong section. It belongs to Ethernet protocols a few lines
higher in the file (after ETH_P_HSR).

 Jiri

^ permalink raw reply

* [PATCH] e1000e: apply burst mode settings only on default
From: Willem de Bruijn @ 2017-08-25 15:06 UTC (permalink / raw)
  To: jeffrey.t.kirsher
  Cc: intel-wired-lan, netdev, jesse.brandeburg, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Devices that support FLAG2_DMA_BURST have different default values
for RDTR and RADV. Apply burst mode default settings only when no
explicit value was passed at module load.

The RDTR default is zero. If the module is loaded for low latency
operation with RxIntDelay=0, do not override this value with a burst
default of 32.

Move the decision to apply burst values earlier, where explicitly
initialized module variables can be distinguished from defaults.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 drivers/net/ethernet/intel/e1000e/e1000.h  |  4 ----
 drivers/net/ethernet/intel/e1000e/netdev.c |  8 --------
 drivers/net/ethernet/intel/e1000e/param.c  | 16 +++++++++++++++-
 3 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/e1000.h b/drivers/net/ethernet/intel/e1000e/e1000.h
index 98e68888abb1..2311b31bdcac 100644
--- a/drivers/net/ethernet/intel/e1000e/e1000.h
+++ b/drivers/net/ethernet/intel/e1000e/e1000.h
@@ -94,10 +94,6 @@ struct e1000_info;
  */
 #define E1000_CHECK_RESET_COUNT		25
 
-#define DEFAULT_RDTR			0
-#define DEFAULT_RADV			8
-#define BURST_RDTR			0x20
-#define BURST_RADV			0x20
 #define PCICFG_DESC_RING_STATUS		0xe4
 #define FLUSH_DESC_REQUIRED		0x100
 
diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index 327dfe5bedc0..47b89aac7969 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -3223,14 +3223,6 @@ static void e1000_configure_rx(struct e1000_adapter *adapter)
 		 */
 		ew32(RXDCTL(0), E1000_RXDCTL_DMA_BURST_ENABLE);
 		ew32(RXDCTL(1), E1000_RXDCTL_DMA_BURST_ENABLE);
-
-		/* override the delay timers for enabling bursting, only if
-		 * the value was not set by the user via module options
-		 */
-		if (adapter->rx_int_delay == DEFAULT_RDTR)
-			adapter->rx_int_delay = BURST_RDTR;
-		if (adapter->rx_abs_int_delay == DEFAULT_RADV)
-			adapter->rx_abs_int_delay = BURST_RADV;
 	}
 
 	/* set the Receive Delay Timer Register */
diff --git a/drivers/net/ethernet/intel/e1000e/param.c b/drivers/net/ethernet/intel/e1000e/param.c
index 6d8c39abee16..bb696c98f9b0 100644
--- a/drivers/net/ethernet/intel/e1000e/param.c
+++ b/drivers/net/ethernet/intel/e1000e/param.c
@@ -73,17 +73,25 @@ E1000_PARAM(TxAbsIntDelay, "Transmit Absolute Interrupt Delay");
 /* Receive Interrupt Delay in units of 1.024 microseconds
  * hardware will likely hang if you set this to anything but zero.
  *
+ * Burst variant is used as default if device has FLAG2_DMA_BURST.
+ *
  * Valid Range: 0-65535
  */
 E1000_PARAM(RxIntDelay, "Receive Interrupt Delay");
+#define DEFAULT_RDTR			0
+#define BURST_RDTR			0x20
 #define MAX_RXDELAY 0xFFFF
 #define MIN_RXDELAY 0
 
 /* Receive Absolute Interrupt Delay in units of 1.024 microseconds
+ *
+ * Burst variant is used as default if device has FLAG2_DMA_BURST.
  *
  * Valid Range: 0-65535
  */
 E1000_PARAM(RxAbsIntDelay, "Receive Absolute Interrupt Delay");
+#define DEFAULT_RADV			8
+#define BURST_RADV			0x20
 #define MAX_RXABSDELAY 0xFFFF
 #define MIN_RXABSDELAY 0
 
@@ -297,6 +305,9 @@ void e1000e_check_options(struct e1000_adapter *adapter)
 					 .max = MAX_RXDELAY } }
 		};
 
+		if (adapter->flags2 & FLAG2_DMA_BURST)
+			opt.def = BURST_RDTR;
+
 		if (num_RxIntDelay > bd) {
 			adapter->rx_int_delay = RxIntDelay[bd];
 			e1000_validate_option(&adapter->rx_int_delay, &opt,
@@ -307,7 +318,7 @@ void e1000e_check_options(struct e1000_adapter *adapter)
 	}
 	/* Receive Absolute Interrupt Delay */
 	{
-		static const struct e1000_option opt = {
+		static struct e1000_option opt = {
 			.type = range_option,
 			.name = "Receive Absolute Interrupt Delay",
 			.err  = "using default of "
@@ -317,6 +328,9 @@ void e1000e_check_options(struct e1000_adapter *adapter)
 					 .max = MAX_RXABSDELAY } }
 		};
 
+		if (adapter->flags2 & FLAG2_DMA_BURST)
+			opt.def = BURST_RADV;
+
 		if (num_RxAbsIntDelay > bd) {
 			adapter->rx_abs_int_delay = RxAbsIntDelay[bd];
 			e1000_validate_option(&adapter->rx_abs_int_delay, &opt,
-- 
2.14.1.342.g6490525c54-goog

^ permalink raw reply related

* Re: [PATCH] pktgen: add a new sample script for 40G and above link testing
From: Jesper Dangaard Brouer @ 2017-08-25 14:59 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter
  Cc: Hu, Robert, robert.hu@linux.intel.com, netdev@vger.kernel.org,
	brouer
In-Reply-To: <E0D909EE5BB15A4699798539EA149D7F077932D8@ORSMSX103.amr.corp.intel.com>

On Fri, 25 Aug 2017 14:24:28 +0000
"Waskiewicz Jr, Peter" <peter.waskiewicz.jr@intel.com> wrote:

> On 8/25/17 5:19 AM, Jesper Dangaard Brouer wrote:
> >>
> >> Tested with Intel XL710 NIC with Cisco 3172 switch.
> >>
> >> It would be even slightly better if the irqbalance service is turned
> >> off outside.  
> > 
> > Yes, if you don't turn-off (kill) irqbalance it will move around the
> > IRQs behind your back...  
> 
> Or you can use the --banirq option to irqbalance to ignore your device's 
> interrupts as targets for balancing.

It might be worth mentioning that --banirq=X is specified for each IRQ
that you want to exclude, and --banirq is simply specified multiple
times on the command line.

Is it possible to tell a running irqbalance that I want to excluded an
extra IRQ? (just before I do my manual adjustment).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [RFC] about net: Fix inconsistent teardown and release of private netdev state.
From: Eric Dumazet @ 2017-08-25 14:57 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20170821.141315.2143125372476050164.davem@davemloft.net>

On Mon, 2017-08-21 at 14:13 -0700, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Fri, 18 Aug 2017 20:40:01 -0700
> 
> > Let look at tun->pcpu_stats, for example.
> > 
> > It is allocated at line 1831, before the register_netdevice()
> > 
> > drivers/net/tun.c does not provide ndo_init()
> 
> I see the problem now.
> 
> And it's done this way because several steps need to occur, with
> various kinds of dependencies, before the register_netdevice() call is
> made.
> 
> I'll see if I can untangle this somehow.

I guess we could enforce having an ndo_init() for all drivers providing
ndo_uninit()


==================================================================
BUG: KASAN: use-after-free in hlist_move_list include/linux/list.h:729 [inline]
BUG: KASAN: use-after-free in __collect_expired_timers kernel/time/timer.c:1326 [inline]
BUG: KASAN: use-after-free in collect_expired_timers kernel/time/timer.c:1550 [inline]
BUG: KASAN: use-after-free in __run_timers+0xa2e/0xb90 kernel/time/timer.c:1597
Write of size 8 at addr ffff8801cc8bbb00 by task syz-executor3/5627

CPU: 1 PID: 5627 Comm: syz-executor3 Not tainted 4.13.0-rc6+ #45
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:16 [inline]
 dump_stack+0x194/0x257 lib/dump_stack.c:52
 print_address_description+0x73/0x250 mm/kasan/report.c:252
 kasan_report_error mm/kasan/report.c:351 [inline]
 kasan_report+0x24e/0x340 mm/kasan/report.c:409
 __asan_report_store8_noabort+0x17/0x20 mm/kasan/report.c:435
 hlist_move_list include/linux/list.h:729 [inline]
 __collect_expired_timers kernel/time/timer.c:1326 [inline]
 collect_expired_timers kernel/time/timer.c:1550 [inline]
 __run_timers+0xa2e/0xb90 kernel/time/timer.c:1597
 run_timer_softirq+0x21/0x80 kernel/time/timer.c:1614
 __do_softirq+0x2f5/0xba3 kernel/softirq.c:284
 invoke_softirq kernel/softirq.c:364 [inline]
 irq_exit+0x1cc/0x200 kernel/softirq.c:405
 exiting_irq arch/x86/include/asm/apic.h:638 [inline]
 smp_apic_timer_interrupt+0x76/0xa0 arch/x86/kernel/apic/apic.c:1044
 apic_timer_interrupt+0x93/0xa0 arch/x86/entry/entry_64.S:702
RIP: 0010:arch_local_irq_restore arch/x86/include/asm/paravirt.h:814 [inline]
RIP: 0010:rmqueue_pcplist mm/page_alloc.c:2789 [inline]
RIP: 0010:rmqueue mm/page_alloc.c:2806 [inline]
RIP: 0010:get_page_from_freelist+0x8ec/0x3430 mm/page_alloc.c:3144
RSP: 0018:ffff8801c6b8e0c0 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff10
RAX: 1ffffffff0b59431 RBX: 000000000000000f RCX: 0000000000000006
RDX: 0000000000000000 RSI: 1ffff10038ca2944 RDI: 0000000000000286
RBP: ffff8801c6b8e7a8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: ffffea0006a8c880 R12: 0000000000000000
R13: dffffc0000000000 R14: 0000000000000000 R15: ffff8801c6b8e780
 </IRQ>
 __alloc_pages_nodemask+0x516/0xd40 mm/page_alloc.c:4096
 alloc_pages_vma+0xc8/0x4b0 mm/mempolicy.c:1938
 shmem_alloc_page+0x171/0x2c0 mm/shmem.c:1448
 shmem_alloc_and_acct_page+0x14d/0x5f0 mm/shmem.c:1478
 shmem_getpage_gfp+0x7bb/0x3870 mm/shmem.c:1757
 shmem_fault+0x2b9/0x960 mm/shmem.c:1995
 __do_fault+0xeb/0x30e mm/memory.c:3011
 do_read_fault mm/memory.c:3421 [inline]
 do_fault mm/memory.c:3521 [inline]
 handle_pte_fault mm/memory.c:3751 [inline]
 __handle_mm_fault+0x1a2d/0x3860 mm/memory.c:3869
 handle_mm_fault+0x3bb/0x860 mm/memory.c:3906
 faultin_page mm/gup.c:477 [inline]
 __get_user_pages+0x4f8/0x15d0 mm/gup.c:674
 populate_vma_page_range+0x20e/0x2f0 mm/gup.c:1127
 __mm_populate+0x23a/0x450 mm/gup.c:1177
 mm_populate include/linux/mm.h:2116 [inline]
 vm_mmap_pgoff+0x241/0x280 mm/util.c:338
 SYSC_mmap_pgoff mm/mmap.c:1517 [inline]
 SyS_mmap_pgoff+0x23b/0x5f0 mm/mmap.c:1475
 SYSC_mmap arch/x86/kernel/sys_x86_64.c:98 [inline]
 SyS_mmap+0x16/0x20 arch/x86/kernel/sys_x86_64.c:89
 entry_SYSCALL_64_fastpath+0x1f/0xbe
RIP: 0033:0x4512e9
RSP: 002b:00007f5effbabc08 EFLAGS: 00000216 ORIG_RAX: 0000000000000009
RAX: ffffffffffffffda RBX: 0000000000718000 RCX: 00000000004512e9
RDX: 0000000001000001 RSI: 0000000000b36000 RDI: 0000000020000000
RBP: 0000000000000086 R08: ffffffffffffffff R09: 0000000000000000
R10: 0000000000008031 R11: 0000000000000216 R12: 00000000004b8a52
R13: 00000000ffffffff R14: ffffffffffffffff R15: 0000000000005417

Allocated by task 5206:
 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
 save_stack+0x43/0xd0 mm/kasan/kasan.c:447
 set_track mm/kasan/kasan.c:459 [inline]
 kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
 __do_kmalloc_node mm/slab.c:3689 [inline]
 __kmalloc_node+0x47/0x70 mm/slab.c:3696
 kmalloc_node include/linux/slab.h:535 [inline]
 kvmalloc_node+0x64/0xd0 mm/util.c:397
 kvmalloc include/linux/mm.h:524 [inline]
 kvzalloc include/linux/mm.h:532 [inline]
 alloc_netdev_mqs+0x16e/0xed0 net/core/dev.c:7968
 tun_set_iff drivers/net/tun.c:1808 [inline]
 __tun_chr_ioctl+0x12b2/0x3d60 drivers/net/tun.c:2067
 tun_chr_ioctl+0x2a/0x40 drivers/net/tun.c:2312
 vfs_ioctl fs/ioctl.c:45 [inline]
 do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:685
 SYSC_ioctl fs/ioctl.c:700 [inline]
 SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
 entry_SYSCALL_64_fastpath+0x1f/0xbe

Freed by task 5206:
 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
 save_stack+0x43/0xd0 mm/kasan/kasan.c:447
 set_track mm/kasan/kasan.c:459 [inline]
 kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
 __cache_free mm/slab.c:3503 [inline]
 kfree+0xca/0x250 mm/slab.c:3820
 kvfree+0x36/0x60 mm/util.c:416
 netdev_freemem net/core/dev.c:7920 [inline]
 free_netdev+0x2cf/0x360 net/core/dev.c:8082
 tun_set_iff drivers/net/tun.c:1891 [inline]
 __tun_chr_ioctl+0x2d24/0x3d60 drivers/net/tun.c:2067
 tun_chr_ioctl+0x2a/0x40 drivers/net/tun.c:2312
 vfs_ioctl fs/ioctl.c:45 [inline]
 do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:685
 SYSC_ioctl fs/ioctl.c:700 [inline]
 SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
 entry_SYSCALL_64_fastpath+0x1f/0xbe

The buggy address belongs to the object at ffff8801cc8b87c0
 which belongs to the cache kmalloc-16384 of size 16384
The buggy address is located 13120 bytes inside of
 16384-byte region [ffff8801cc8b87c0, ffff8801cc8bc7c0)
The buggy address belongs to the page:
page:ffffea0007322e00 count:1 mapcount:0 mapping:ffff8801cc8b87c0 index:0x0 compound_mapcount: 0
flags: 0x200000000008100(slab|head)
raw: 0200000000008100 ffff8801cc8b87c0 0000000000000000 0000000100000001
raw: ffffea0007320c20 ffffea0007329220 ffff8801dac02200 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff8801cc8bba00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff8801cc8bba80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff8801cc8bbb00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                   ^
 ffff8801cc8bbb80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff8801cc8bbc00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================

^ permalink raw reply

* [PATCH net 2/5] l2tp: hold tunnel while processing genl delete command
From: Guillaume Nault @ 2017-08-25 14:51 UTC (permalink / raw)
  To: netdev; +Cc: James Chapman
In-Reply-To: <cover.1503671776.git.g.nault@alphalink.fr>

l2tp_nl_cmd_tunnel_delete() needs to take a reference on the tunnel, to
prevent it from being concurrently freed by l2tp_tunnel_destruct().

Fixes: 309795f4bec2 ("l2tp: Add netlink control API for L2TP")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
---
 net/l2tp/l2tp_netlink.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/l2tp/l2tp_netlink.c b/net/l2tp/l2tp_netlink.c
index 27ee94b5c189..808966550620 100644
--- a/net/l2tp/l2tp_netlink.c
+++ b/net/l2tp/l2tp_netlink.c
@@ -273,8 +273,8 @@ static int l2tp_nl_cmd_tunnel_delete(struct sk_buff *skb, struct genl_info *info
 	}
 	tunnel_id = nla_get_u32(info->attrs[L2TP_ATTR_CONN_ID]);
 
-	tunnel = l2tp_tunnel_find(net, tunnel_id);
-	if (tunnel == NULL) {
+	tunnel = l2tp_tunnel_get(net, tunnel_id);
+	if (!tunnel) {
 		ret = -ENODEV;
 		goto out;
 	}
@@ -284,6 +284,8 @@ static int l2tp_nl_cmd_tunnel_delete(struct sk_buff *skb, struct genl_info *info
 
 	(void) l2tp_tunnel_delete(tunnel);
 
+	l2tp_tunnel_dec_refcount(tunnel);
+
 out:
 	return ret;
 }
-- 
2.14.1

^ permalink raw reply related

* [PATCH net 5/5] l2tp: hold tunnel used while creating sessions with netlink
From: Guillaume Nault @ 2017-08-25 14:51 UTC (permalink / raw)
  To: netdev; +Cc: James Chapman
In-Reply-To: <cover.1503671776.git.g.nault@alphalink.fr>

Use l2tp_tunnel_get() to retrieve tunnel, so that it can't go away on
us. Otherwise l2tp_tunnel_destruct() might release the last reference
count concurrently, thus freeing the tunnel while we're using it.

Fixes: 309795f4bec2 ("l2tp: Add netlink control API for L2TP")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
---
 net/l2tp/l2tp_netlink.c | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/net/l2tp/l2tp_netlink.c b/net/l2tp/l2tp_netlink.c
index ae5170e26281..57427d430f10 100644
--- a/net/l2tp/l2tp_netlink.c
+++ b/net/l2tp/l2tp_netlink.c
@@ -518,8 +518,9 @@ static int l2tp_nl_cmd_session_create(struct sk_buff *skb, struct genl_info *inf
 		ret = -EINVAL;
 		goto out;
 	}
+
 	tunnel_id = nla_get_u32(info->attrs[L2TP_ATTR_CONN_ID]);
-	tunnel = l2tp_tunnel_find(net, tunnel_id);
+	tunnel = l2tp_tunnel_get(net, tunnel_id);
 	if (!tunnel) {
 		ret = -ENODEV;
 		goto out;
@@ -527,24 +528,24 @@ static int l2tp_nl_cmd_session_create(struct sk_buff *skb, struct genl_info *inf
 
 	if (!info->attrs[L2TP_ATTR_SESSION_ID]) {
 		ret = -EINVAL;
-		goto out;
+		goto out_tunnel;
 	}
 	session_id = nla_get_u32(info->attrs[L2TP_ATTR_SESSION_ID]);
 
 	if (!info->attrs[L2TP_ATTR_PEER_SESSION_ID]) {
 		ret = -EINVAL;
-		goto out;
+		goto out_tunnel;
 	}
 	peer_session_id = nla_get_u32(info->attrs[L2TP_ATTR_PEER_SESSION_ID]);
 
 	if (!info->attrs[L2TP_ATTR_PW_TYPE]) {
 		ret = -EINVAL;
-		goto out;
+		goto out_tunnel;
 	}
 	cfg.pw_type = nla_get_u16(info->attrs[L2TP_ATTR_PW_TYPE]);
 	if (cfg.pw_type >= __L2TP_PWTYPE_MAX) {
 		ret = -EINVAL;
-		goto out;
+		goto out_tunnel;
 	}
 
 	if (tunnel->version > 2) {
@@ -566,7 +567,7 @@ static int l2tp_nl_cmd_session_create(struct sk_buff *skb, struct genl_info *inf
 			u16 len = nla_len(info->attrs[L2TP_ATTR_COOKIE]);
 			if (len > 8) {
 				ret = -EINVAL;
-				goto out;
+				goto out_tunnel;
 			}
 			cfg.cookie_len = len;
 			memcpy(&cfg.cookie[0], nla_data(info->attrs[L2TP_ATTR_COOKIE]), len);
@@ -575,7 +576,7 @@ static int l2tp_nl_cmd_session_create(struct sk_buff *skb, struct genl_info *inf
 			u16 len = nla_len(info->attrs[L2TP_ATTR_PEER_COOKIE]);
 			if (len > 8) {
 				ret = -EINVAL;
-				goto out;
+				goto out_tunnel;
 			}
 			cfg.peer_cookie_len = len;
 			memcpy(&cfg.peer_cookie[0], nla_data(info->attrs[L2TP_ATTR_PEER_COOKIE]), len);
@@ -618,7 +619,7 @@ static int l2tp_nl_cmd_session_create(struct sk_buff *skb, struct genl_info *inf
 	if ((l2tp_nl_cmd_ops[cfg.pw_type] == NULL) ||
 	    (l2tp_nl_cmd_ops[cfg.pw_type]->session_create == NULL)) {
 		ret = -EPROTONOSUPPORT;
-		goto out;
+		goto out_tunnel;
 	}
 
 	/* Check that pseudowire-specific params are present */
@@ -628,7 +629,7 @@ static int l2tp_nl_cmd_session_create(struct sk_buff *skb, struct genl_info *inf
 	case L2TP_PWTYPE_ETH_VLAN:
 		if (!info->attrs[L2TP_ATTR_VLAN_ID]) {
 			ret = -EINVAL;
-			goto out;
+			goto out_tunnel;
 		}
 		break;
 	case L2TP_PWTYPE_ETH:
@@ -656,6 +657,8 @@ static int l2tp_nl_cmd_session_create(struct sk_buff *skb, struct genl_info *inf
 		}
 	}
 
+out_tunnel:
+	l2tp_tunnel_dec_refcount(tunnel);
 out:
 	return ret;
 }
-- 
2.14.1

^ permalink raw reply related

* [PATCH net 4/5] l2tp: hold tunnel while handling genl TUNNEL_GET commands
From: Guillaume Nault @ 2017-08-25 14:51 UTC (permalink / raw)
  To: netdev; +Cc: James Chapman
In-Reply-To: <cover.1503671776.git.g.nault@alphalink.fr>

Use l2tp_tunnel_get() instead of l2tp_tunnel_find() so that we get
a reference on the tunnel, preventing l2tp_tunnel_destruct() from
freeing it from under us.

Also move l2tp_tunnel_get() below nlmsg_new() so that we only take
the reference when needed.

Fixes: 309795f4bec2 ("l2tp: Add netlink control API for L2TP")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
---
 net/l2tp/l2tp_netlink.c | 27 +++++++++++++++------------
 1 file changed, 15 insertions(+), 12 deletions(-)

diff --git a/net/l2tp/l2tp_netlink.c b/net/l2tp/l2tp_netlink.c
index d61e75b4e619..ae5170e26281 100644
--- a/net/l2tp/l2tp_netlink.c
+++ b/net/l2tp/l2tp_netlink.c
@@ -444,34 +444,37 @@ static int l2tp_nl_cmd_tunnel_get(struct sk_buff *skb, struct genl_info *info)
 
 	if (!info->attrs[L2TP_ATTR_CONN_ID]) {
 		ret = -EINVAL;
-		goto out;
+		goto err;
 	}
 
 	tunnel_id = nla_get_u32(info->attrs[L2TP_ATTR_CONN_ID]);
 
-	tunnel = l2tp_tunnel_find(net, tunnel_id);
-	if (tunnel == NULL) {
-		ret = -ENODEV;
-		goto out;
-	}
-
 	msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
 	if (!msg) {
 		ret = -ENOMEM;
-		goto out;
+		goto err;
+	}
+
+	tunnel = l2tp_tunnel_get(net, tunnel_id);
+	if (!tunnel) {
+		ret = -ENODEV;
+		goto err_nlmsg;
 	}
 
 	ret = l2tp_nl_tunnel_send(msg, info->snd_portid, info->snd_seq,
 				  NLM_F_ACK, tunnel, L2TP_CMD_TUNNEL_GET);
 	if (ret < 0)
-		goto err_out;
+		goto err_nlmsg_tunnel;
+
+	l2tp_tunnel_dec_refcount(tunnel);
 
 	return genlmsg_unicast(net, msg, info->snd_portid);
 
-err_out:
+err_nlmsg_tunnel:
+	l2tp_tunnel_dec_refcount(tunnel);
+err_nlmsg:
 	nlmsg_free(msg);
-
-out:
+err:
 	return ret;
 }
 
-- 
2.14.1

^ permalink raw reply related

* [PATCH net 3/5] l2tp: hold tunnel while handling genl tunnel updates
From: Guillaume Nault @ 2017-08-25 14:51 UTC (permalink / raw)
  To: netdev; +Cc: James Chapman
In-Reply-To: <cover.1503671776.git.g.nault@alphalink.fr>

We need to make sure the tunnel is not going to be destroyed by
l2tp_tunnel_destruct() concurrently.

Fixes: 309795f4bec2 ("l2tp: Add netlink control API for L2TP")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
---
 net/l2tp/l2tp_netlink.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/l2tp/l2tp_netlink.c b/net/l2tp/l2tp_netlink.c
index 808966550620..d61e75b4e619 100644
--- a/net/l2tp/l2tp_netlink.c
+++ b/net/l2tp/l2tp_netlink.c
@@ -303,8 +303,8 @@ static int l2tp_nl_cmd_tunnel_modify(struct sk_buff *skb, struct genl_info *info
 	}
 	tunnel_id = nla_get_u32(info->attrs[L2TP_ATTR_CONN_ID]);
 
-	tunnel = l2tp_tunnel_find(net, tunnel_id);
-	if (tunnel == NULL) {
+	tunnel = l2tp_tunnel_get(net, tunnel_id);
+	if (!tunnel) {
 		ret = -ENODEV;
 		goto out;
 	}
@@ -315,6 +315,8 @@ static int l2tp_nl_cmd_tunnel_modify(struct sk_buff *skb, struct genl_info *info
 	ret = l2tp_tunnel_notify(&l2tp_nl_family, info,
 				 tunnel, L2TP_CMD_TUNNEL_MODIFY);
 
+	l2tp_tunnel_dec_refcount(tunnel);
+
 out:
 	return ret;
 }
-- 
2.14.1

^ permalink raw reply related

* [PATCH net 1/5] l2tp: hold tunnel while looking up sessions in l2tp_netlink
From: Guillaume Nault @ 2017-08-25 14:51 UTC (permalink / raw)
  To: netdev; +Cc: James Chapman
In-Reply-To: <cover.1503671776.git.g.nault@alphalink.fr>

l2tp_tunnel_find() doesn't take a reference on the returned tunnel.
Therefore, it's unsafe to use it because the returned tunnel can go
away on us anytime.

Fix this by defining l2tp_tunnel_get(), which works like
l2tp_tunnel_find(), but takes a reference on the returned tunnel.
Caller then has to drop this reference using l2tp_tunnel_dec_refcount().

As l2tp_tunnel_dec_refcount() needs to be moved to l2tp_core.h, let's
simplify the patch and not move the L2TP_REFCNT_DEBUG part. This code
has been broken (not even compiling) in May 2012 by
commit a4ca44fa578c ("net: l2tp: Standardize logging styles")
and fixed more than two years later by
commit 29abe2fda54f ("l2tp: fix missing line continuation"). So it
doesn't appear to be used by anyone.

Same thing for l2tp_tunnel_free(); instead of moving it to l2tp_core.h,
let's just simplify things and call kfree_rcu() directly in
l2tp_tunnel_dec_refcount(). Extra assertions and debugging code
provided by l2tp_tunnel_free() didn't help catching any of the
reference counting and socket handling issues found while working on
this series.

Fixes: 309795f4bec2 ("l2tp: Add netlink control API for L2TP")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
---
 net/l2tp/l2tp_core.c    | 66 ++++++++++++++++---------------------------------
 net/l2tp/l2tp_core.h    | 13 ++++++++++
 net/l2tp/l2tp_netlink.c |  6 +++--
 3 files changed, 38 insertions(+), 47 deletions(-)

diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c
index b0c2d4ae781d..1adebfb9bde2 100644
--- a/net/l2tp/l2tp_core.c
+++ b/net/l2tp/l2tp_core.c
@@ -113,7 +113,6 @@ struct l2tp_net {
 	spinlock_t l2tp_session_hlist_lock;
 };
 
-static void l2tp_tunnel_free(struct l2tp_tunnel *tunnel);
 
 static inline struct l2tp_tunnel *l2tp_tunnel(struct sock *sk)
 {
@@ -127,39 +126,6 @@ static inline struct l2tp_net *l2tp_pernet(const struct net *net)
 	return net_generic(net, l2tp_net_id);
 }
 
-/* Tunnel reference counts. Incremented per session that is added to
- * the tunnel.
- */
-static inline void l2tp_tunnel_inc_refcount_1(struct l2tp_tunnel *tunnel)
-{
-	refcount_inc(&tunnel->ref_count);
-}
-
-static inline void l2tp_tunnel_dec_refcount_1(struct l2tp_tunnel *tunnel)
-{
-	if (refcount_dec_and_test(&tunnel->ref_count))
-		l2tp_tunnel_free(tunnel);
-}
-#ifdef L2TP_REFCNT_DEBUG
-#define l2tp_tunnel_inc_refcount(_t)					\
-do {									\
-	pr_debug("l2tp_tunnel_inc_refcount: %s:%d %s: cnt=%d\n",	\
-		 __func__, __LINE__, (_t)->name,			\
-		 refcount_read(&_t->ref_count));			\
-	l2tp_tunnel_inc_refcount_1(_t);					\
-} while (0)
-#define l2tp_tunnel_dec_refcount(_t)					\
-do {									\
-	pr_debug("l2tp_tunnel_dec_refcount: %s:%d %s: cnt=%d\n",	\
-		 __func__, __LINE__, (_t)->name,			\
-		 refcount_read(&_t->ref_count));			\
-	l2tp_tunnel_dec_refcount_1(_t);					\
-} while (0)
-#else
-#define l2tp_tunnel_inc_refcount(t) l2tp_tunnel_inc_refcount_1(t)
-#define l2tp_tunnel_dec_refcount(t) l2tp_tunnel_dec_refcount_1(t)
-#endif
-
 /* Session hash global list for L2TPv3.
  * The session_id SHOULD be random according to RFC3931, but several
  * L2TP implementations use incrementing session_ids.  So we do a real
@@ -229,6 +195,27 @@ l2tp_session_id_hash(struct l2tp_tunnel *tunnel, u32 session_id)
 	return &tunnel->session_hlist[hash_32(session_id, L2TP_HASH_BITS)];
 }
 
+/* Lookup a tunnel. A new reference is held on the returned tunnel. */
+struct l2tp_tunnel *l2tp_tunnel_get(const struct net *net, u32 tunnel_id)
+{
+	const struct l2tp_net *pn = l2tp_pernet(net);
+	struct l2tp_tunnel *tunnel;
+
+	rcu_read_lock_bh();
+	list_for_each_entry_rcu(tunnel, &pn->l2tp_tunnel_list, list) {
+		if (tunnel->tunnel_id == tunnel_id) {
+			l2tp_tunnel_inc_refcount(tunnel);
+			rcu_read_unlock_bh();
+
+			return tunnel;
+		}
+	}
+	rcu_read_unlock_bh();
+
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(l2tp_tunnel_get);
+
 /* Lookup a session. A new reference is held on the returned session.
  * Optionally calls session->ref() too if do_ref is true.
  */
@@ -1348,17 +1335,6 @@ static void l2tp_udp_encap_destroy(struct sock *sk)
 	}
 }
 
-/* Really kill the tunnel.
- * Come here only when all sessions have been cleared from the tunnel.
- */
-static void l2tp_tunnel_free(struct l2tp_tunnel *tunnel)
-{
-	BUG_ON(refcount_read(&tunnel->ref_count) != 0);
-	BUG_ON(tunnel->sock != NULL);
-	l2tp_info(tunnel, L2TP_MSG_CONTROL, "%s: free...\n", tunnel->name);
-	kfree_rcu(tunnel, rcu);
-}
-
 /* Workqueue tunnel deletion function */
 static void l2tp_tunnel_del_work(struct work_struct *work)
 {
diff --git a/net/l2tp/l2tp_core.h b/net/l2tp/l2tp_core.h
index cdb6e3327f74..9101297f27ad 100644
--- a/net/l2tp/l2tp_core.h
+++ b/net/l2tp/l2tp_core.h
@@ -231,6 +231,8 @@ static inline struct l2tp_tunnel *l2tp_sock_to_tunnel(struct sock *sk)
 	return tunnel;
 }
 
+struct l2tp_tunnel *l2tp_tunnel_get(const struct net *net, u32 tunnel_id);
+
 struct l2tp_session *l2tp_session_get(const struct net *net,
 				      struct l2tp_tunnel *tunnel,
 				      u32 session_id, bool do_ref);
@@ -269,6 +271,17 @@ int l2tp_nl_register_ops(enum l2tp_pwtype pw_type,
 void l2tp_nl_unregister_ops(enum l2tp_pwtype pw_type);
 int l2tp_ioctl(struct sock *sk, int cmd, unsigned long arg);
 
+static inline void l2tp_tunnel_inc_refcount(struct l2tp_tunnel *tunnel)
+{
+	refcount_inc(&tunnel->ref_count);
+}
+
+static inline void l2tp_tunnel_dec_refcount(struct l2tp_tunnel *tunnel)
+{
+	if (refcount_dec_and_test(&tunnel->ref_count))
+		kfree_rcu(tunnel, rcu);
+}
+
 /* Session reference counts. Incremented when code obtains a reference
  * to a session.
  */
diff --git a/net/l2tp/l2tp_netlink.c b/net/l2tp/l2tp_netlink.c
index 12cfcd0ca807..27ee94b5c189 100644
--- a/net/l2tp/l2tp_netlink.c
+++ b/net/l2tp/l2tp_netlink.c
@@ -65,10 +65,12 @@ static struct l2tp_session *l2tp_nl_session_get(struct genl_info *info,
 		   (info->attrs[L2TP_ATTR_CONN_ID])) {
 		tunnel_id = nla_get_u32(info->attrs[L2TP_ATTR_CONN_ID]);
 		session_id = nla_get_u32(info->attrs[L2TP_ATTR_SESSION_ID]);
-		tunnel = l2tp_tunnel_find(net, tunnel_id);
-		if (tunnel)
+		tunnel = l2tp_tunnel_get(net, tunnel_id);
+		if (tunnel) {
 			session = l2tp_session_get(net, tunnel, session_id,
 						   do_ref);
+			l2tp_tunnel_dec_refcount(tunnel);
+		}
 	}
 
 	return session;
-- 
2.14.1

^ permalink raw reply related

* [PATCH net 0/5] l2tp: fix some l2tp_tunnel_find() issues in l2tp_netlink
From: Guillaume Nault @ 2017-08-25 14:51 UTC (permalink / raw)
  To: netdev; +Cc: James Chapman

Since l2tp_tunnel_find() doesn't take a reference on the tunnel it
returns, its users are almost guaranteed to be racy.

This series defines l2tp_tunnel_get() which can be used as a safe
replacement, and converts some of l2tp_tunnel_find() users in the
l2tp_netlink module.

Other users often combine this issue with other more or less subtle
races. They will be fixed incrementally in followup series.

Guillaume Nault (5):
  l2tp: hold tunnel while looking up sessions in l2tp_netlink
  l2tp: hold tunnel while processing genl delete command
  l2tp: hold tunnel while handling genl tunnel updates
  l2tp: hold tunnel while handling genl TUNNEL_GET commands
  l2tp: hold tunnel used while creating sessions with netlink

 net/l2tp/l2tp_core.c    | 66 ++++++++++++++++---------------------------------
 net/l2tp/l2tp_core.h    | 13 ++++++++++
 net/l2tp/l2tp_netlink.c | 66 +++++++++++++++++++++++++++++--------------------
 3 files changed, 73 insertions(+), 72 deletions(-)

-- 
2.14.1

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox