* [PATCH net-next] net/tls: prevent skb_orphan() from leaking TLS plain text with offload
From: Jakub Kicinski @ 2019-07-30 21:12 UTC (permalink / raw)
To: davem
Cc: netdev, oss-drivers, edumazet, borisp, aviadye, davejwatson,
john.fastabend, daniel, willemb, xiyou.wangcong, fw,
alexei.starovoitov, Jakub Kicinski
sk_validate_xmit_skb() and drivers depend on the sk member of
struct sk_buff to identify segments requiring encryption.
Any operation which removes or does not preserve the original TLS
socket such as skb_orphan() or skb_clone() will cause clear text
leaks.
Make the TCP socket underlying an offloaded TLS connection
mark all skbs as decrypted, if TLS TX is in offload mode.
Then in sk_validate_xmit_skb() catch skbs which have no socket
(or a socket with no validation) and decrypted flag set.
Note that CONFIG_SOCK_VALIDATE_XMIT, CONFIG_TLS_DEVICE and
sk->sk_validate_xmit_skb are slightly interchangeable right now,
they all imply TLS offload. The new checks are guarded by
CONFIG_TLS_DEVICE because that's the option guarding the
sk_buff->decrypted member.
Second, smaller issue with orphaning is that it breaks
the guarantee that packets will be delivered to device
queues in-order. All TLS offload drivers depend on that
scheduling property. This means skb_orphan_partial()'s
trick of preserving partial socket references will cause
issues in the drivers. We need a full orphan, and as a
result netem delay/throttling will cause all TLS offload
skbs to be dropped.
Reusing the sk_buff->decrypted flag also protects from
leaking clear text when incoming, decrypted skb is redirected
(e.g. by TC).
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
I'm sending this for net-next because of lack of confidence
in my own abilities. It should apply cleanly to net... :)
Documentation/networking/tls-offload.rst | 9 --------
include/net/sock.h | 28 +++++++++++++++++++++++-
net/core/skbuff.c | 3 +++
net/core/sock.c | 22 ++++++++++++++-----
net/ipv4/tcp.c | 4 +++-
net/ipv4/tcp_output.c | 3 +++
net/tls/tls_device.c | 2 ++
7 files changed, 55 insertions(+), 16 deletions(-)
diff --git a/Documentation/networking/tls-offload.rst b/Documentation/networking/tls-offload.rst
index 048e5ca44824..2bc3ab5515d8 100644
--- a/Documentation/networking/tls-offload.rst
+++ b/Documentation/networking/tls-offload.rst
@@ -499,15 +499,6 @@ offloads, old connections will remain active after flags are cleared.
Known bugs
==========
-skb_orphan() leaks clear text
------------------------------
-
-Currently drivers depend on the :c:member:`sk` member of
-:c:type:`struct sk_buff <sk_buff>` to identify segments requiring
-encryption. Any operation which removes or does not preserve the socket
-association such as :c:func:`skb_orphan` or :c:func:`skb_clone`
-will cause the driver to miss the packets and lead to clear text leaks.
-
Redirects leak clear text
-------------------------
diff --git a/include/net/sock.h b/include/net/sock.h
index 228db3998e46..90f3f552c789 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -814,6 +814,7 @@ enum sock_flags {
SOCK_TXTIME,
SOCK_XDP, /* XDP is attached */
SOCK_TSTAMP_NEW, /* Indicates 64 bit timestamps always */
+ SOCK_CRYPTO_TX_PLAIN_TEXT, /* Generate skbs with decrypted flag set */
};
#define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
@@ -2480,8 +2481,26 @@ static inline bool sk_fullsock(const struct sock *sk)
return (1 << sk->sk_state) & ~(TCPF_TIME_WAIT | TCPF_NEW_SYN_RECV);
}
+static inline void sk_set_tx_crypto(const struct sock *sk, struct sk_buff *skb)
+{
+#ifdef CONFIG_TLS_DEVICE
+ skb->decrypted = sock_flag(sk, SOCK_CRYPTO_TX_PLAIN_TEXT);
+#endif
+}
+
+static inline bool sk_tx_crypto_match(const struct sock *sk,
+ const struct sk_buff *skb)
+{
+#ifdef CONFIG_TLS_DEVICE
+ return sock_flag(sk, SOCK_CRYPTO_TX_PLAIN_TEXT) == !!skb->decrypted;
+#else
+ return true;
+#endif
+}
+
/* Checks if this SKB belongs to an HW offloaded socket
* and whether any SW fallbacks are required based on dev.
+ * Check decrypted mark in case skb_orphan() cleared socket.
*/
static inline struct sk_buff *sk_validate_xmit_skb(struct sk_buff *skb,
struct net_device *dev)
@@ -2489,8 +2508,15 @@ static inline struct sk_buff *sk_validate_xmit_skb(struct sk_buff *skb,
#ifdef CONFIG_SOCK_VALIDATE_XMIT
struct sock *sk = skb->sk;
- if (sk && sk_fullsock(sk) && sk->sk_validate_xmit_skb)
+ if (sk && sk_fullsock(sk) && sk->sk_validate_xmit_skb) {
skb = sk->sk_validate_xmit_skb(sk, dev, skb);
+#ifdef CONFIG_TLS_DEVICE
+ } else if (unlikely(skb->decrypted)) {
+ pr_warn_ratelimited("unencrypted skb with no associated socket - dropping\n");
+ kfree_skb(skb);
+ skb = NULL;
+#endif
+ }
#endif
return skb;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 0b788df5a75b..9e92684479b9 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3794,6 +3794,9 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
skb_reserve(nskb, headroom);
__skb_put(nskb, doffset);
+#ifdef CONFIG_TLS_DEVICE
+ nskb->decrypted = head_skb->decrypted;
+#endif
}
if (segs)
diff --git a/net/core/sock.c b/net/core/sock.c
index d57b0cc995a0..b0c10b518e65 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1992,6 +1992,22 @@ void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
}
EXPORT_SYMBOL(skb_set_owner_w);
+static bool can_skb_orphan_partial(const struct sk_buff *skb)
+{
+#ifdef CONFIG_TLS_DEVICE
+ /* Drivers depend on in-order delivery for crypto offload,
+ * partial orphan breaks out-of-order-OK logic.
+ */
+ if (skb->decrypted)
+ return false;
+#endif
+#ifdef CONFIG_INET
+ if (skb->destructor == tcp_wfree)
+ return true;
+#endif
+ return skb->destructor == sock_wfree;
+}
+
/* This helper is used by netem, as it can hold packets in its
* delay queue. We want to allow the owner socket to send more
* packets, as if they were already TX completed by a typical driver.
@@ -2003,11 +2019,7 @@ void skb_orphan_partial(struct sk_buff *skb)
if (skb_is_tcp_pure_ack(skb))
return;
- if (skb->destructor == sock_wfree
-#ifdef CONFIG_INET
- || skb->destructor == tcp_wfree
-#endif
- ) {
+ if (can_skb_orphan_partial(skb)) {
struct sock *sk = skb->sk;
if (refcount_inc_not_zero(&sk->sk_refcnt)) {
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index f62f0e7e3cdd..dee93eab02fe 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -984,6 +984,7 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
if (!skb)
goto wait_for_memory;
+ sk_set_tx_crypto(sk, skb);
skb_entail(sk, skb);
copy = size_goal;
}
@@ -993,7 +994,8 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
i = skb_shinfo(skb)->nr_frags;
can_coalesce = skb_can_coalesce(skb, i, page, offset);
- if (!can_coalesce && i >= sysctl_max_skb_frags) {
+ if ((!can_coalesce && i >= sysctl_max_skb_frags) ||
+ !sk_tx_crypto_match(sk, skb)) {
tcp_mark_push(tp, skb);
goto new_segment;
}
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 6e4afc48d7bb..9efd0ca44d49 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1320,6 +1320,7 @@ int tcp_fragment(struct sock *sk, enum tcp_queue tcp_queue,
buff = sk_stream_alloc_skb(sk, nsize, gfp, true);
if (!buff)
return -ENOMEM; /* We'll just try again later. */
+ sk_set_tx_crypto(sk, buff);
sk->sk_wmem_queued += buff->truesize;
sk_mem_charge(sk, buff->truesize);
@@ -1874,6 +1875,7 @@ static int tso_fragment(struct sock *sk, struct sk_buff *skb, unsigned int len,
buff = sk_stream_alloc_skb(sk, 0, gfp, true);
if (unlikely(!buff))
return -ENOMEM;
+ sk_set_tx_crypto(sk, buff);
sk->sk_wmem_queued += buff->truesize;
sk_mem_charge(sk, buff->truesize);
@@ -2139,6 +2141,7 @@ static int tcp_mtu_probe(struct sock *sk)
nskb = sk_stream_alloc_skb(sk, probe_size, GFP_ATOMIC, false);
if (!nskb)
return -1;
+ sk_set_tx_crypto(sk, nskb);
sk->sk_wmem_queued += nskb->truesize;
sk_mem_charge(sk, nskb->truesize);
diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index 4ec8a06fa5d1..3d78742b3b1b 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -970,6 +970,8 @@ int tls_set_device_offload(struct sock *sk, struct tls_context *ctx)
tls_device_attach(ctx, sk, netdev);
+ sock_set_flag(sk, SOCK_CRYPTO_TX_PLAIN_TEXT);
+
/* following this assignment tls_is_sk_tx_device_offloaded
* will return true and the context might be accessed
* by the netdev's xmit function.
--
2.21.0
^ permalink raw reply related
* Re: [PATCHv2 net-next 0/5] sctp: clean up __sctp_connect function
From: David Miller @ 2019-07-30 21:18 UTC (permalink / raw)
To: marcelo.leitner; +Cc: lucien.xin, netdev, linux-sctp, nhorman
In-Reply-To: <20190730194231.GE4064@localhost.localdomain>
From: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Date: Tue, 30 Jul 2019 16:42:31 -0300
> On Tue, Jul 30, 2019 at 08:38:18PM +0800, Xin Long wrote:
>> This patchset is to factor out some common code for
>> sctp_sendmsg_new_asoc() and __sctp_connect() into 2
>> new functioins.
>>
>> v1->v2:
>> - add the patch 1/5 to avoid a slab-out-of-bounds warning.
>> - add some code comment for the check change in patch 2/5.
>> - remove unused 'addrcnt' as Marcelo noticed in patch 3/5.
>>
>> Xin Long (5):
>> sctp: only copy the available addr data in sctp_transport_init
>> sctp: check addr_size with sa_family_t size in
>> __sctp_setsockopt_connectx
>> sctp: clean up __sctp_connect
>> sctp: factor out sctp_connect_new_asoc
>> sctp: factor out sctp_connect_add_peer
>
> Series,
> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Series applied, thanks.
^ permalink raw reply
* Re: [PATCH v2 bpf-next 01/12] libbpf: add .BTF.ext offset relocation section loading
From: Song Liu @ 2019-07-30 21:19 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: bpf, Networking, Alexei Starovoitov, Daniel Borkmann,
Yonghong Song, Andrii Nakryiko, Kernel Team
In-Reply-To: <20190730195408.670063-2-andriin@fb.com>
> On Jul 30, 2019, at 12:53 PM, Andrii Nakryiko <andriin@fb.com> wrote:
>
> Add support for BPF CO-RE offset relocations. Add section/record
> iteration macros for .BTF.ext. These macro are useful for iterating over
> each .BTF.ext record, either for dumping out contents or later for BPF
> CO-RE relocation handling.
>
> To enable other parts of libbpf to work with .BTF.ext contents, moved
> a bunch of type definitions into libbpf_internal.h.
>
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
^ permalink raw reply
* Re: [PATCH v2 0/3 net-next] Finish conversion of skb_frag_t to bio_vec
From: David Miller @ 2019-07-30 21:22 UTC (permalink / raw)
To: jonathan.lemon; +Cc: willy, jakub.kicinski, kernel-team, netdev
In-Reply-To: <20190730144034.444022-1-jonathan.lemon@gmail.com>
From: Jonathan Lemon <jonathan.lemon@gmail.com>
Date: Tue, 30 Jul 2019 07:40:31 -0700
> The recent conversion of skb_frag_t to bio_vec did not include
> skb_frag's page_offset. Add accessor functions for this field,
> utilize them, and remove the union, restoring the original structure.
>
> v2:
> - rename accessors
> - follow kdoc conventions
Series applied, thanks Jonathan.
^ permalink raw reply
* Re: [PATCH v2 0/3 net-next] Finish conversion of skb_frag_t to bio_vec
From: David Miller @ 2019-07-30 21:22 UTC (permalink / raw)
To: axboe; +Cc: jonathan.lemon, willy, jakub.kicinski, Kernel-team, netdev
In-Reply-To: <1d34658b-a807-44ae-756a-d55dead27f94@fb.com>
From: Jens Axboe <axboe@fb.com>
Date: Tue, 30 Jul 2019 20:49:09 +0000
> Pretty appalled to see this abomination:
>
> net: Convert skb_frag_t to bio_vec
>
> There are a lot of users of frag->page_offset, so use a union
> to avoid converting those users today.
>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Signed-off-by: David S. Miller <davem@davemloft.net>
>
> show up in the net tree without even having been posted on a
> block list...
>
> At least this kills this ugly thing.
Sorry about that Jens, but at least as you say it's gone now.
^ permalink raw reply
* Re: [PATCH v2 bpf-next 03/12] selftests/bpf: add BPF_CORE_READ relocatable read macro
From: Song Liu @ 2019-07-30 21:24 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: bpf, Networking, Alexei Starovoitov, daniel@iogearbox.net,
Yonghong Song, andrii.nakryiko@gmail.com, Kernel Team
In-Reply-To: <20190730195408.670063-4-andriin@fb.com>
> On Jul 30, 2019, at 12:53 PM, Andrii Nakryiko <andriin@fb.com> wrote:
>
> Add BPF_CORE_READ macro used in tests to do bpf_core_read(), which
> automatically captures offset relocation.
>
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> ---
> tools/testing/selftests/bpf/bpf_helpers.h | 19 +++++++++++++++++++
> 1 file changed, 19 insertions(+)
>
> diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
> index f804f210244e..81bc51293d11 100644
> --- a/tools/testing/selftests/bpf/bpf_helpers.h
> +++ b/tools/testing/selftests/bpf/bpf_helpers.h
> @@ -501,4 +501,23 @@ struct pt_regs;
> (void *)(PT_REGS_FP(ctx) + sizeof(ip))); })
> #endif
>
> +/*
> + * BPF_CORE_READ abstracts away bpf_probe_read() call and captures offset
> + * relocation for source address using __builtin_preserve_access_index()
> + * built-in, provided by Clang.
> + *
> + * __builtin_preserve_access_index() takes as an argument an expression of
> + * taking an address of a field within struct/union. It makes compiler emit
> + * a relocation, which records BTF type ID describing root struct/union and an
> + * accessor string which describes exact embedded field that was used to take
> + * an address. See detailed description of this relocation format and
> + * semantics in comments to struct bpf_offset_reloc in libbpf_internal.h.
> + *
> + * This relocation allows libbpf to adjust BPF instruction to use correct
> + * actual field offset, based on target kernel BTF type that matches original
> + * (local) BTF, used to record relocation.
> + */
> +#define BPF_CORE_READ(dst, src) \
> + bpf_probe_read(dst, sizeof(*src), __builtin_preserve_access_index(src))
We should use "sizeof(*(src))"
^ permalink raw reply
* Re: [PATCH v6 51/57] net: Remove dev_err() usage after platform_get_irq()
From: David Miller @ 2019-07-30 21:25 UTC (permalink / raw)
To: swboyd
Cc: linux-kernel, kvalo, saeedm, jeffrey.t.kirsher, nbd, lorenzo,
netdev, gregkh
In-Reply-To: <20190730181557.90391-52-swboyd@chromium.org>
From: Stephen Boyd <swboyd@chromium.org>
Date: Tue, 30 Jul 2019 11:15:51 -0700
> We don't need dev_err() messages when platform_get_irq() fails now that
> platform_get_irq() prints an error message itself when something goes
> wrong. Let's remove these prints with a simple semantic patch.
...
> While we're here, remove braces on if statements that only have one
> statement (manually).
...
> Signed-off-by: Stephen Boyd <swboyd@chromium.org>
> ---
>
> Please apply directly to subsystem trees
I'll take this into net-next.
^ permalink raw reply
* Re: [PATCH v2 bpf-next 03/12] selftests/bpf: add BPF_CORE_READ relocatable read macro
From: Andrii Nakryiko @ 2019-07-30 21:26 UTC (permalink / raw)
To: Song Liu
Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
daniel@iogearbox.net, Yonghong Song, Kernel Team
In-Reply-To: <87422673-525B-461B-B487-EB16386CAB25@fb.com>
On Tue, Jul 30, 2019 at 2:24 PM Song Liu <songliubraving@fb.com> wrote:
>
>
>
> > On Jul 30, 2019, at 12:53 PM, Andrii Nakryiko <andriin@fb.com> wrote:
> >
> > Add BPF_CORE_READ macro used in tests to do bpf_core_read(), which
> > automatically captures offset relocation.
> >
> > Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> > ---
> > tools/testing/selftests/bpf/bpf_helpers.h | 19 +++++++++++++++++++
> > 1 file changed, 19 insertions(+)
> >
> > diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
> > index f804f210244e..81bc51293d11 100644
> > --- a/tools/testing/selftests/bpf/bpf_helpers.h
> > +++ b/tools/testing/selftests/bpf/bpf_helpers.h
> > @@ -501,4 +501,23 @@ struct pt_regs;
> > (void *)(PT_REGS_FP(ctx) + sizeof(ip))); })
> > #endif
> >
> > +/*
> > + * BPF_CORE_READ abstracts away bpf_probe_read() call and captures offset
> > + * relocation for source address using __builtin_preserve_access_index()
> > + * built-in, provided by Clang.
> > + *
> > + * __builtin_preserve_access_index() takes as an argument an expression of
> > + * taking an address of a field within struct/union. It makes compiler emit
> > + * a relocation, which records BTF type ID describing root struct/union and an
> > + * accessor string which describes exact embedded field that was used to take
> > + * an address. See detailed description of this relocation format and
> > + * semantics in comments to struct bpf_offset_reloc in libbpf_internal.h.
> > + *
> > + * This relocation allows libbpf to adjust BPF instruction to use correct
> > + * actual field offset, based on target kernel BTF type that matches original
> > + * (local) BTF, used to record relocation.
> > + */
> > +#define BPF_CORE_READ(dst, src) \
> > + bpf_probe_read(dst, sizeof(*src), __builtin_preserve_access_index(src))
>
> We should use "sizeof(*(src))"
>
Good point. Also (dst) instead of just (dst). Will update.
^ permalink raw reply
* Re: [PATCH v2 0/3 net-next] Finish conversion of skb_frag_t to bio_vec
From: Jens Axboe @ 2019-07-30 21:30 UTC (permalink / raw)
To: David Miller
Cc: jonathan.lemon@gmail.com, willy@infradead.org,
jakub.kicinski@netronome.com, Kernel Team, netdev@vger.kernel.org
In-Reply-To: <20190730.142238.1475873068715429404.davem@davemloft.net>
On 7/30/19 3:22 PM, David Miller wrote:
> From: Jens Axboe <axboe@fb.com>
> Date: Tue, 30 Jul 2019 20:49:09 +0000
>
>> Pretty appalled to see this abomination:
>>
>> net: Convert skb_frag_t to bio_vec
>>
>> There are a lot of users of frag->page_offset, so use a union
>> to avoid converting those users today.
>>
>> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
>> Signed-off-by: David S. Miller <davem@davemloft.net>
>>
>> show up in the net tree without even having been posted on a
>> block list...
>>
>> At least this kills this ugly thing.
>
> Sorry about that Jens, but at least as you say it's gone now.
Yeah all good now, thanks.
--
Jens Axboe
^ permalink raw reply
* Re: [PATCH v2 bpf-next 03/12] selftests/bpf: add BPF_CORE_READ relocatable read macro
From: Song Liu @ 2019-07-30 21:33 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
daniel@iogearbox.net, Yonghong Song, Kernel Team
In-Reply-To: <CAEf4BzakowCqkCkBdGPJCZNU3MpDf1yBhzOXL2pos1tPiUH0mQ@mail.gmail.com>
> On Jul 30, 2019, at 2:26 PM, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> On Tue, Jul 30, 2019 at 2:24 PM Song Liu <songliubraving@fb.com> wrote:
>>
>>
>>
>>> On Jul 30, 2019, at 12:53 PM, Andrii Nakryiko <andriin@fb.com> wrote:
>>>
>>> Add BPF_CORE_READ macro used in tests to do bpf_core_read(), which
>>> automatically captures offset relocation.
>>>
>>> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
>>> ---
>>> tools/testing/selftests/bpf/bpf_helpers.h | 19 +++++++++++++++++++
>>> 1 file changed, 19 insertions(+)
>>>
>>> diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
>>> index f804f210244e..81bc51293d11 100644
>>> --- a/tools/testing/selftests/bpf/bpf_helpers.h
>>> +++ b/tools/testing/selftests/bpf/bpf_helpers.h
>>> @@ -501,4 +501,23 @@ struct pt_regs;
>>> (void *)(PT_REGS_FP(ctx) + sizeof(ip))); })
>>> #endif
>>>
>>> +/*
>>> + * BPF_CORE_READ abstracts away bpf_probe_read() call and captures offset
>>> + * relocation for source address using __builtin_preserve_access_index()
>>> + * built-in, provided by Clang.
>>> + *
>>> + * __builtin_preserve_access_index() takes as an argument an expression of
>>> + * taking an address of a field within struct/union. It makes compiler emit
>>> + * a relocation, which records BTF type ID describing root struct/union and an
>>> + * accessor string which describes exact embedded field that was used to take
>>> + * an address. See detailed description of this relocation format and
>>> + * semantics in comments to struct bpf_offset_reloc in libbpf_internal.h.
>>> + *
>>> + * This relocation allows libbpf to adjust BPF instruction to use correct
>>> + * actual field offset, based on target kernel BTF type that matches original
>>> + * (local) BTF, used to record relocation.
>>> + */
>>> +#define BPF_CORE_READ(dst, src) \
>>> + bpf_probe_read(dst, sizeof(*src), __builtin_preserve_access_index(src))
>>
>> We should use "sizeof(*(src))"
>>
>
> Good point. Also (dst) instead of just (dst). Will update.
I think dst as-is is fine. "," is the very last in precedence list.
^ permalink raw reply
* Re: [net-next 01/13] net/mlx5e: Print a warning when LRO feature is dropped or not allowed
From: Willem de Bruijn @ 2019-07-30 21:34 UTC (permalink / raw)
To: Saeed Mahameed; +Cc: Huy Nguyen, davem@davemloft.net, netdev@vger.kernel.org
In-Reply-To: <cb43e9dadb8e48d27df8f08464bf40f7a81eafe9.camel@mellanox.com>
On Tue, Jul 30, 2019 at 4:08 PM Saeed Mahameed <saeedm@mellanox.com> wrote:
>
> On Tue, 2019-07-30 at 11:52 -0400, Willem de Bruijn wrote:
> > On Mon, Jul 29, 2019 at 7:50 PM Saeed Mahameed <saeedm@mellanox.com>
> > wrote:
> > > From: Huy Nguyen <huyn@mellanox.com>
> > >
> > > When user enables LRO via ethtool and if the RQ mode is legacy,
> > > mlx5e_fix_features drops the request without any explanation.
> > > Add netdev_warn to cover this case.
> > >
> > > Fixes: 6c3a823e1e9c ("net/mlx5e: RX, Remove HW LRO support in
> > > legacy RQ")
> > > Signed-off-by: Huy Nguyen <huyn@mellanox.com>
> > > Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
> > > ---
> > > drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 5 +++--
> > > 1 file changed, 3 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > > b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > > index 47eea6b3a1c3..776eb46d263d 100644
> > > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > > @@ -3788,9 +3788,10 @@ static netdev_features_t
> > > mlx5e_fix_features(struct net_device *netdev,
> > > netdev_warn(netdev, "Dropping C-tag vlan
> > > stripping offload due to S-tag vlan\n");
> > > }
> > > if (!MLX5E_GET_PFLAG(params, MLX5E_PFLAG_RX_STRIDING_RQ)) {
> > > - features &= ~NETIF_F_LRO;
> > > - if (params->lro_en)
> > > + if (features & NETIF_F_LRO) {
> > > netdev_warn(netdev, "Disabling LRO, not
> > > supported in legacy RQ\n");
> >
> > This warns about "Disabling LRO" on an enable request?
> >
>
> no, this warning appears only when lro is already enabled and might
> conflict with any other feature requested by user (hence
> mlx5e_fix_features), e.g when moving away from striding rq in this
> example, we will force lro to off.
Ok. The previous commit mentioned "totally remove LRO support in
Legacy RQ". This handles the additional case when moving a queue into
legacy mode that still had LRO enabled. I see.
>
>
> > More fundamentally, it appears that the device does not advertise
> > the feature as configurable in netdev_hw_features as of commit
> > 6c3a823e1e9c ("net/mlx5e: RX, Remove HW LRO support in
> > legacy RQ"), so shouldn't this be caught by the device driver
> > independent ethtool code?
>
> when hw doesn't support MLX5E_PFLAG_RX_STRIDING_RQ then yes, you will
> never hit this code path, but when hw does support
> MLX5E_PFLAG_RX_STRIDING_RQ and you want to turn striding rq off, then
> lro will be forced to off (if it was enabled in first space) and a
> warning msg will be shown.
Got it, thanks.
^ permalink raw reply
* Re: [net 1/1] tipc: fix unitilized skb list crash
From: David Miller @ 2019-07-30 21:40 UTC (permalink / raw)
To: jon.maloy
Cc: netdev, tung.q.nguyen, hoang.h.le, lxin, shuali, ying.xue,
tipc-discussion
In-Reply-To: <1564510750-19531-1-git-send-email-jon.maloy@ericsson.com>
From: Jon Maloy <jon.maloy@ericsson.com>
Date: Tue, 30 Jul 2019 20:19:10 +0200
> Our test suite somtimes provokes the following crash:
>
> Description of problem:
...
> The reason is that the skb list tipc_socket::mc_method.deferredq only
> is initialized for connectionless sockets, while nothing stops arriving
> multicast messages from being filtered by connection oriented sockets,
> with subsequent access to the said list.
>
> We fix this by initializing the list unconditionally at socket creation.
> This eliminates the crash, while the message still is dropped further
> down in tipc_sk_filter_rcv() as it should be.
>
> Reported-by: Li Shuang <shuali@redhat.com>
> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Applied and queued up for -stable, thank you.
^ permalink raw reply
* Re: [PATCH v5 09/29] compat_ioctl: pppoe: fix PPPOEIOCSFWD handling
From: David Miller @ 2019-07-30 21:42 UTC (permalink / raw)
To: arnd
Cc: viro, linux-fsdevel, linux-kernel, g.nault, mostrows, xeb,
jchapman, netdev
In-Reply-To: <20190730192552.4014288-10-arnd@arndb.de>
From: Arnd Bergmann <arnd@arndb.de>
Date: Tue, 30 Jul 2019 21:25:20 +0200
> Support for handling the PPPOEIOCSFWD ioctl in compat mode was added in
> linux-2.5.69 along with hundreds of other commands, but was always broken
> sincen only the structure is compatible, but the command number is not,
> due to the size being sizeof(size_t), or at first sizeof(sizeof((struct
> sockaddr_pppox)), which is different on 64-bit architectures.
>
> Guillaume Nault adds:
>
> And the implementation was broken until 2016 (see 29e73269aa4d ("pppoe:
> fix reference counting in PPPoE proxy")), and nobody ever noticed. I
> should probably have removed this ioctl entirely instead of fixing it.
> Clearly, it has never been used.
>
> Fix it by adding a compat_ioctl handler for all pppoe variants that
> translates the command number and then calls the regular ioctl function.
>
> All other ioctl commands handled by pppoe are compatible between 32-bit
> and 64-bit, and require compat_ptr() conversion.
>
> This should apply to all stable kernels.
>
> Acked-by: Guillaume Nault <g.nault@alphalink.fr>
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Applied and queued up for -stable, thanks everyone.
^ permalink raw reply
* Re: [PATCH v5 12/29] compat_ioctl: move drivers to compat_ptr_ioctl
From: David Miller @ 2019-07-30 21:43 UTC (permalink / raw)
To: arnd
Cc: viro, linux-fsdevel, linux-kernel, gregkh, mst, jarkko.sakkinen,
jgg, jkosina, stefanha, linux-integrity, linux1394-devel,
linux-usb, linux-input, linux-stm32, linux-arm-kernel, linux-mtd,
netdev, devel, kvm, virtualization, ceph-devel
In-Reply-To: <20190730195227.742215-1-arnd@arndb.de>
From: Arnd Bergmann <arnd@arndb.de>
Date: Tue, 30 Jul 2019 21:50:28 +0200
> Each of these drivers has a copy of the same trivial helper function to
> convert the pointer argument and then call the native ioctl handler.
>
> We now have a generic implementation of that, so use it.
>
> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Acked-by: Michael S. Tsirkin <mst@redhat.com>
> Reviewed-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
> Reviewed-by: Jiri Kosina <jkosina@suse.cz>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
I assume this has to go via your series, thus:
Acked-by: David S. Miller <davem@davemloft.net>
^ permalink raw reply
* Re: [PATCH v4 net-next 13/19] ionic: Add initial ethtool support
From: Shannon Nelson @ 2019-07-30 21:51 UTC (permalink / raw)
To: Andrew Lunn; +Cc: netdev, davem
In-Reply-To: <20190725133506.GD21952@lunn.ch>
On 7/25/19 6:35 AM, Andrew Lunn wrote:
>> +static int ionic_get_module_eeprom(struct net_device *netdev,
>> + struct ethtool_eeprom *ee,
>> + u8 *data)
>> +{
>> + struct lif *lif = netdev_priv(netdev);
>> + struct ionic_dev *idev = &lif->ionic->idev;
>> + struct xcvr_status *xcvr;
>> + u32 len;
>> +
>> + /* The NIC keeps the module prom up-to-date in the DMA space
>> + * so we can simply copy the module bytes into the data buffer.
>> + */
>> + xcvr = &idev->port_info->status.xcvr;
>> + len = min_t(u32, sizeof(xcvr->sprom), ee->len);
>> + memcpy(data, xcvr->sprom, len);
>> +
>> + return 0;
>> +}
> Is the firmware doing this DMA update atomically? The diagnostic
> values are u16s. Is there any chance we do this memcpy at the same
> time the DMA is active and we get a mix of old and new data?
>
> Often in cases like this you do the copy twice and ensure you get the
> same values each time. If not, keep repeating the copy until you do
> get the same values twice.
Regardless of how the structs are all aligned and our PCI block does
large writes, I can see how an unoptimized memcpy() that is doing
byte-by-byte copy rather than by words might result in a mangled value.
I think this is the only buffer that may be susceptible to this. Sure,
doing a double copy should work here.
sln
^ permalink raw reply
* Re: [PATCH bpf-next v2] tools: bpftool: add support for reporting the effective cgroup progs
From: Takshak Chahande @ 2019-07-30 22:01 UTC (permalink / raw)
To: Jakub Kicinski
Cc: alexei.starovoitov@gmail.com, daniel@iogearbox.net,
netdev@vger.kernel.org, bpf@vger.kernel.org,
oss-drivers@netronome.com, Kernel Team, Quentin Monnet
In-Reply-To: <20190730210300.13113-1-jakub.kicinski@netronome.com>
Jakub Kicinski <jakub.kicinski@netronome.com> wrote on Tue [2019-Jul-30 14:03:00 -0700]:
> Takshak said in the original submission:
>
> With different bpf attach_flags available to attach bpf programs specially
> with BPF_F_ALLOW_OVERRIDE and BPF_F_ALLOW_MULTI, the list of effective
> bpf-programs available to any sub-cgroups really needs to be available for
> easy debugging.
>
> Using BPF_F_QUERY_EFFECTIVE flag, one can get the list of not only attached
> bpf-programs to a cgroup but also the inherited ones from parent cgroup.
>
> So a new option is introduced to use BPF_F_QUERY_EFFECTIVE query flag here
> to list all the effective bpf-programs available for execution at a specified
> cgroup.
>
> Reused modified test program test_cgroup_attach from tools/testing/selftests/bpf:
> # ./test_cgroup_attach
>
> With old bpftool:
>
> # bpftool cgroup show /sys/fs/cgroup/cgroup-test-work-dir/cg1/
> ID AttachType AttachFlags Name
> 271 egress multi pkt_cntr_1
> 272 egress multi pkt_cntr_2
>
> Attached new program pkt_cntr_4 in cg2 gives following:
>
> # bpftool cgroup show /sys/fs/cgroup/cgroup-test-work-dir/cg1/cg2
> ID AttachType AttachFlags Name
> 273 egress override pkt_cntr_4
>
> And with new "effective" option it shows all effective programs for cg2:
>
> # bpftool cgroup show /sys/fs/cgroup/cgroup-test-work-dir/cg1/cg2 effective
> ID AttachType AttachFlags Name
> 273 egress override pkt_cntr_4
> 271 egress override pkt_cntr_1
> 272 egress override pkt_cntr_2
>
> Compared to original submission use a local flag instead of global
> option.
>
> We need to clear query_flags on every command, in case batch mode
> wants to use varying settings.
>
> v2: (Takshak)
> - forbid duplicated flags;
> - fix cgroup path freeing.
>
> Signed-off-by: Takshak Chahande <ctakshak@fb.com>
> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
> ---
> .../bpftool/Documentation/bpftool-cgroup.rst | 16 +++-
> tools/bpf/bpftool/bash-completion/bpftool | 15 ++--
> tools/bpf/bpftool/cgroup.c | 83 ++++++++++++-------
> 3 files changed, 76 insertions(+), 38 deletions(-)
>
> diff --git a/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst b/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst
> index 585f270c2d25..06a28b07787d 100644
> --- a/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst
> +++ b/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst
> @@ -20,8 +20,8 @@ SYNOPSIS
> CGROUP COMMANDS
> ===============
>
> -| **bpftool** **cgroup { show | list }** *CGROUP*
> -| **bpftool** **cgroup tree** [*CGROUP_ROOT*]
> +| **bpftool** **cgroup { show | list }** *CGROUP* [**effective**]
> +| **bpftool** **cgroup tree** [*CGROUP_ROOT*] [**effective**]
> | **bpftool** **cgroup attach** *CGROUP* *ATTACH_TYPE* *PROG* [*ATTACH_FLAGS*]
> | **bpftool** **cgroup detach** *CGROUP* *ATTACH_TYPE* *PROG*
> | **bpftool** **cgroup help**
> @@ -35,13 +35,17 @@ CGROUP COMMANDS
>
> DESCRIPTION
> ===========
> - **bpftool cgroup { show | list }** *CGROUP*
> + **bpftool cgroup { show | list }** *CGROUP* [**effective**]
> List all programs attached to the cgroup *CGROUP*.
>
> Output will start with program ID followed by attach type,
> attach flags and program name.
>
> - **bpftool cgroup tree** [*CGROUP_ROOT*]
> + If **effective** is specified retrieve effective programs that
> + will execute for events within a cgroup. This includes
> + inherited along with attached ones.
> +
> + **bpftool cgroup tree** [*CGROUP_ROOT*] [**effective**]
> Iterate over all cgroups in *CGROUP_ROOT* and list all
> attached programs. If *CGROUP_ROOT* is not specified,
> bpftool uses cgroup v2 mountpoint.
> @@ -50,6 +54,10 @@ DESCRIPTION
> commands: it starts with absolute cgroup path, followed by
> program ID, attach type, attach flags and program name.
>
> + If **effective** is specified retrieve effective programs that
> + will execute for events within a cgroup. This includes
> + inherited along with attached ones.
> +
> **bpftool cgroup attach** *CGROUP* *ATTACH_TYPE* *PROG* [*ATTACH_FLAGS*]
> Attach program *PROG* to the cgroup *CGROUP* with attach type
> *ATTACH_TYPE* and optional *ATTACH_FLAGS*.
> diff --git a/tools/bpf/bpftool/bash-completion/bpftool b/tools/bpf/bpftool/bash-completion/bpftool
> index 6b961a5ed100..df16c5415444 100644
> --- a/tools/bpf/bpftool/bash-completion/bpftool
> +++ b/tools/bpf/bpftool/bash-completion/bpftool
> @@ -710,12 +710,15 @@ _bpftool()
> ;;
> cgroup)
> case $command in
> - show|list)
> - _filedir
> - return 0
> - ;;
> - tree)
> - _filedir
> + show|list|tree)
> + case $cword in
> + 3)
> + _filedir
> + ;;
> + 4)
> + COMPREPLY=( $( compgen -W 'effective' -- "$cur" ) )
> + ;;
> + esac
> return 0
> ;;
> attach|detach)
> diff --git a/tools/bpf/bpftool/cgroup.c b/tools/bpf/bpftool/cgroup.c
> index f3c05b08c68c..44352b5aca85 100644
> --- a/tools/bpf/bpftool/cgroup.c
> +++ b/tools/bpf/bpftool/cgroup.c
> @@ -29,6 +29,8 @@
> " recvmsg4 | recvmsg6 | sysctl |\n" \
> " getsockopt | setsockopt }"
>
> +static unsigned int query_flags;
> +
> static const char * const attach_type_strings[] = {
> [BPF_CGROUP_INET_INGRESS] = "ingress",
> [BPF_CGROUP_INET_EGRESS] = "egress",
> @@ -107,7 +109,8 @@ static int count_attached_bpf_progs(int cgroup_fd, enum bpf_attach_type type)
> __u32 prog_cnt = 0;
> int ret;
>
> - ret = bpf_prog_query(cgroup_fd, type, 0, NULL, NULL, &prog_cnt);
> + ret = bpf_prog_query(cgroup_fd, type, query_flags, NULL,
> + NULL, &prog_cnt);
> if (ret)
> return -1;
>
> @@ -125,8 +128,8 @@ static int show_attached_bpf_progs(int cgroup_fd, enum bpf_attach_type type,
> int ret;
>
> prog_cnt = ARRAY_SIZE(prog_ids);
> - ret = bpf_prog_query(cgroup_fd, type, 0, &attach_flags, prog_ids,
> - &prog_cnt);
> + ret = bpf_prog_query(cgroup_fd, type, query_flags, &attach_flags,
> + prog_ids, &prog_cnt);
> if (ret)
> return ret;
>
> @@ -158,20 +161,34 @@ static int show_attached_bpf_progs(int cgroup_fd, enum bpf_attach_type type,
> static int do_show(int argc, char **argv)
> {
> enum bpf_attach_type type;
> + const char *path;
> int cgroup_fd;
> int ret = -1;
>
> - if (argc < 1) {
> - p_err("too few parameters for cgroup show");
> - goto exit;
> - } else if (argc > 1) {
> - p_err("too many parameters for cgroup show");
> - goto exit;
> + query_flags = 0;
> +
> + if (!REQ_ARGS(1))
> + return -1;
> + path = GET_ARG();
> +
> + while (argc) {
> + if (is_prefix(*argv, "effective")) {
> + if (query_flags & BPF_F_QUERY_EFFECTIVE) {
> + p_err("duplicated argument: %s", *argv);
> + return -1;
> + }
> + query_flags |= BPF_F_QUERY_EFFECTIVE;
> + NEXT_ARG();
> + } else {
> + p_err("expected no more arguments, 'effective', got: '%s'?",
> + *argv);
> + return -1;
> + }
> }
>
> - cgroup_fd = open(argv[0], O_RDONLY);
> + cgroup_fd = open(path, O_RDONLY);
> if (cgroup_fd < 0) {
> - p_err("can't open cgroup %s", argv[0]);
> + p_err("can't open cgroup %s", path);
> goto exit;
> }
>
> @@ -294,26 +311,37 @@ static char *find_cgroup_root(void)
>
> static int do_show_tree(int argc, char **argv)
> {
> - char *cgroup_root;
> + char *cgroup_root, *cgroup_alloced = NULL;
> int ret;
>
> - switch (argc) {
> - case 0:
> - cgroup_root = find_cgroup_root();
> - if (!cgroup_root) {
> + query_flags = 0;
> +
> + if (!argc) {
> + cgroup_alloced = find_cgroup_root();
> + if (!cgroup_alloced) {
> p_err("cgroup v2 isn't mounted");
> return -1;
> }
> - break;
> - case 1:
> - cgroup_root = argv[0];
> - break;
> - default:
> - p_err("too many parameters for cgroup tree");
> - return -1;
> + cgroup_root = cgroup_alloced;
> + } else {
> + cgroup_root = GET_ARG();
> +
> + while (argc) {
> + if (is_prefix(*argv, "effective")) {
> + if (query_flags & BPF_F_QUERY_EFFECTIVE) {
> + p_err("duplicated argument: %s", *argv);
> + return -1;
> + }
> + query_flags |= BPF_F_QUERY_EFFECTIVE;
> + NEXT_ARG();
> + } else {
> + p_err("expected no more arguments, 'effective', got: '%s'?",
> + *argv);
> + return -1;
> + }
> + }
> }
>
> -
> if (json_output)
> jsonw_start_array(json_wtr);
> else
> @@ -338,8 +366,7 @@ static int do_show_tree(int argc, char **argv)
> if (json_output)
> jsonw_end_array(json_wtr);
>
> - if (argc == 0)
> - free(cgroup_root);
> + free(cgroup_alloced);
>
> return ret;
> }
> @@ -459,8 +486,8 @@ static int do_help(int argc, char **argv)
> }
>
> fprintf(stderr,
> - "Usage: %s %s { show | list } CGROUP\n"
> - " %s %s tree [CGROUP_ROOT]\n"
> + "Usage: %s %s { show | list } CGROUP [**effective**]\n"
> + " %s %s tree [CGROUP_ROOT] [**effective**]\n"
> " %s %s attach CGROUP ATTACH_TYPE PROG [ATTACH_FLAGS]\n"
> " %s %s detach CGROUP ATTACH_TYPE PROG\n"
> " %s %s help\n"
> --
> 2.21.0
>
Thanks for v2; looks good.
Reviewed-by: Takshak Chahande <ctakshak@fb.com>
^ permalink raw reply
* Re: [PATCH v3] net: dsa: qca8k: enable port flow control
From: David Miller @ 2019-07-30 22:08 UTC (permalink / raw)
To: xiaofeis
Cc: vkoul, netdev, andrew, linux-arm-msm, bjorn.andersson,
vivien.didelot, f.fainelli, niklas.cassel, xiazha
In-Reply-To: <1564275470-52666-1-git-send-email-xiaofeis@codeaurora.org>
From: xiaofeis <xiaofeis@codeaurora.org>
Date: Sun, 28 Jul 2019 08:57:50 +0800
> Set phy device advertising to enable MAC flow control.
>
> Signed-off-by: Xiaofei Shen <xiaofeis@codeaurora.org>
I've read the discussion over a few times and if internal PHY is the
only thing supported, and the specific setup this driver supports uses
internal signalling to sync the PHY and MAC settings I guess this is
OK although suboptimal.
So applied, thanks.
Thanks.
^ permalink raw reply
* Re: [PATCH V4 0/3] net: dsa: ksz: Add Microchip KSZ87xx support
From: David Miller @ 2019-07-30 22:14 UTC (permalink / raw)
To: marex; +Cc: netdev, andrew, f.fainelli, Tristram.Ha, vivien.didelot,
woojung.huh
In-Reply-To: <20190729174947.10103-1-marex@denx.de>
From: Marek Vasut <marex@denx.de>
Date: Mon, 29 Jul 2019 19:49:44 +0200
> This series adds support for Microchip KSZ87xx switches, which are
> slightly simpler compared to KSZ9xxx .
>
> Signed-off-by: Marek Vasut <marex@denx.de>
Series applied to net-next, thank you.
^ permalink raw reply
* [RFC PATCH v2 net-next 0/3] net: batched receive in GRO path
From: Edward Cree @ 2019-07-30 22:18 UTC (permalink / raw)
To: David Miller; +Cc: linux-net-drivers, netdev, Eric Dumazet
This series listifies part of GRO processing, in a manner which allows those
packets which are not GROed (i.e. for which dev_gro_receive returns
GRO_NORMAL) to be passed on to the listified regular receive path.
dev_gro_receive() itself is not listified, nor the per-protocol GRO
callback, since GRO's need to hold packets on lists under napi->gro_hash
makes keeping the packets on other lists awkward, and since the GRO control
block state of held skbs can refer only to one 'new' skb at a time.
Instead, when napi_frags_finish() handles a GRO_NORMAL result, stash the skb
onto a list in the napi struct, which is received at the end of the napi
poll or when its length exceeds the (new) sysctl net.core.gro_normal_batch.
Performance figures with this series, collected on a back-to-back pair of
Solarflare sfn8522-r2 NICs with 120-second NetPerf tests. In the stats,
sample size n for old and new code is 6 runs each; p is from a Welch t-test.
Tests were run both with GRO enabled and disabled, the latter simulating
uncoalesceable packets (e.g. due to IP or TCP options). The receive side
(which was the device under test) had the NetPerf process pinned to one CPU,
and the device interrupts pinned to a second CPU. CPU utilisation figures
(used in cases of line-rate performance) are summed across all CPUs.
net.core.gro_normal_batch was left at its default value of 8.
TCP 4 streams, GRO on: all results line rate (9.415Gbps)
net-next: 210.3% cpu
after #1: 181.5% cpu (-13.7%, p=0.031 vs net-next)
after #3: 196.7% cpu (- 8.4%, p=0.136 vs net-next)
TCP 4 streams, GRO off:
net-next: 8.017 Gbps
after #1: 7.785 Gbps (- 2.9%, p=0.385 vs net-next)
after #3: 7.604 Gbps (- 5.1%, p=0.282 vs net-next. But note *)
TCP 1 stream, GRO off:
net-next: 6.553 Gbps
after #1: 6.444 Gbps (- 1.7%, p=0.302 vs net-next)
after #3: 6.790 Gbps (+ 3.6%, p=0.169 vs net-next)
TCP 1 stream, GRO on, busy_read = 50: all results line rate
net-next: 156.0% cpu
after #1: 174.5% cpu (+11.9%, p=0.015 vs net-next)
after #3: 165.0% cpu (+ 5.8%, p=0.147 vs net-next)
TCP 1 stream, GRO off, busy_read = 50:
net-next: 6.488 Gbps
after #1: 6.625 Gbps (+ 2.1%, p=0.059 vs net-next)
after #3: 7.351 Gbps (+13.3%, p=0.026 vs net-next)
TCP_RR 100 streams, GRO off, 8000 byte payload
net-next: 995.083 us
after #1: 969.167 us (- 2.6%, p=0.204 vs net-next)
after #3: 976.433 us (- 1.9%, p=0.254 vs net-next)
TCP_RR 100 streams, GRO off, 8000 byte payload, busy_read = 50:
net-next: 2.851 ms
after #1: 2.871 ms (+ 0.7%, p=0.134 vs net-next)
after #3: 2.937 ms (+ 3.0%, p<0.001 vs net-next)
TCP_RR 100 streams, GRO off, 1 byte payload, busy_read = 50:
net-next: 867.317 us
after #1: 865.717 us (- 0.2%, p=0.334 vs net-next)
after #3: 868.517 us (+ 0.1%, p=0.414 vs net-next)
(*) These tests produced a mixture of line-rate and below-line-rate results,
meaning that statistically speaking the results were 'censored' by the
upper bound, and were thus not normally distributed, making a Welch t-test
mathematically invalid. I therefore also calculated estimators according
to [1], which gave the following:
net-next: 8.133 Gbps
after #1: 8.130 Gbps (- 0.0%, p=0.499 vs net-next)
after #3: 7.680 Gbps (- 5.6%, p=0.285 vs net-next)
(though my procedure for determining ν wasn't mathematically well-founded
either, so take that p-value with a grain of salt).
A further check came from dividing the bandwidth figure by the CPU usage for
each test run, giving:
net-next: 3.461
after #1: 3.198 (- 7.6%, p=0.145 vs net-next)
after #3: 3.641 (+ 5.2%, p=0.280 vs net-next)
The above results are fairly mixed, and in most cases not statistically
significant. But I think we can roughly conclude that the series
marginally improves non-GROable throughput, without hurting latency
(except in the large-payload busy-polling case, which in any case yields
horrid performance even on net-next (almost triple the latency without
busy-poll). Also, drivers which, unlike sfc, pass UDP traffic to GRO
would expect to see a benefit from gaining access to batching.
Changed in v2:
* During busy poll, call gro_normal_list() to receive batched packets
after each cycle of the napi busy loop. See comments in Patch #3 for
complications of doing the same in busy_poll_stop().
[1]: Cohen 1959, doi: 10.1080/00401706.1959.10489859
Edward Cree (3):
sfc: don't score irq moderation points for GRO
sfc: falcon: don't score irq moderation points for GRO
net: use listified RX for handling GRO_NORMAL skbs
drivers/net/ethernet/sfc/falcon/rx.c | 5 +---
drivers/net/ethernet/sfc/rx.c | 5 +---
include/linux/netdevice.h | 3 ++
net/core/dev.c | 44 ++++++++++++++++++++++++++--
net/core/sysctl_net_core.c | 8 +++++
5 files changed, 54 insertions(+), 11 deletions(-)
^ permalink raw reply
* [RFC PATCH v2 net-next 1/3] sfc: don't score irq moderation points for GRO
From: Edward Cree @ 2019-07-30 22:20 UTC (permalink / raw)
To: linux-net-drivers, David Miller; +Cc: netdev, Eric Dumazet
In-Reply-To: <9bcebf59-a0e7-f461-36ef-8564ecb33282@solarflare.com>
We already scored points when handling the RX event, no-one else does this,
and looking at the history it appears this was originally meant to only
score on merges, not on GRO_NORMAL. Moreover, it gets in the way of
changing GRO to not immediately pass GRO_NORMAL skbs to the stack.
Performance testing with four TCP streams received on a single CPU (where
throughput was line rate of 9.4Gbps in all tests) showed a 13.7% reduction
in RX CPU usage (n=6, p=0.03).
Signed-off-by: Edward Cree <ecree@solarflare.com>
---
drivers/net/ethernet/sfc/rx.c | 5 +----
1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/sfc/rx.c b/drivers/net/ethernet/sfc/rx.c
index d5db045535d3..85ec07f5a674 100644
--- a/drivers/net/ethernet/sfc/rx.c
+++ b/drivers/net/ethernet/sfc/rx.c
@@ -412,7 +412,6 @@ efx_rx_packet_gro(struct efx_channel *channel, struct efx_rx_buffer *rx_buf,
unsigned int n_frags, u8 *eh)
{
struct napi_struct *napi = &channel->napi_str;
- gro_result_t gro_result;
struct efx_nic *efx = channel->efx;
struct sk_buff *skb;
@@ -449,9 +448,7 @@ efx_rx_packet_gro(struct efx_channel *channel, struct efx_rx_buffer *rx_buf,
skb_record_rx_queue(skb, channel->rx_queue.core_index);
- gro_result = napi_gro_frags(napi);
- if (gro_result != GRO_DROP)
- channel->irq_mod_score += 2;
+ napi_gro_frags(napi);
}
/* Allocate and construct an SKB around page fragments */
^ permalink raw reply related
* [RFC PATCH v2 net-next 2/3] sfc: falcon: don't score irq moderation points for GRO
From: Edward Cree @ 2019-07-30 22:21 UTC (permalink / raw)
To: linux-net-drivers, David Miller; +Cc: netdev, Eric Dumazet
In-Reply-To: <9bcebf59-a0e7-f461-36ef-8564ecb33282@solarflare.com>
Same rationale as for sfc, except that this wasn't performance-tested.
Signed-off-by: Edward Cree <ecree@solarflare.com>
---
drivers/net/ethernet/sfc/falcon/rx.c | 5 +----
1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/sfc/falcon/rx.c b/drivers/net/ethernet/sfc/falcon/rx.c
index fd850d3d8ec0..05ea3523890a 100644
--- a/drivers/net/ethernet/sfc/falcon/rx.c
+++ b/drivers/net/ethernet/sfc/falcon/rx.c
@@ -424,7 +424,6 @@ ef4_rx_packet_gro(struct ef4_channel *channel, struct ef4_rx_buffer *rx_buf,
unsigned int n_frags, u8 *eh)
{
struct napi_struct *napi = &channel->napi_str;
- gro_result_t gro_result;
struct ef4_nic *efx = channel->efx;
struct sk_buff *skb;
@@ -460,9 +459,7 @@ ef4_rx_packet_gro(struct ef4_channel *channel, struct ef4_rx_buffer *rx_buf,
skb_record_rx_queue(skb, channel->rx_queue.core_index);
- gro_result = napi_gro_frags(napi);
- if (gro_result != GRO_DROP)
- channel->irq_mod_score += 2;
+ napi_gro_frags(napi);
}
/* Allocate and construct an SKB around page fragments */
^ permalink raw reply related
* [RFC PATCH v2 net-next 3/3] net: use listified RX for handling GRO_NORMAL skbs
From: Edward Cree @ 2019-07-30 22:21 UTC (permalink / raw)
To: David Miller; +Cc: linux-net-drivers, netdev, Eric Dumazet
In-Reply-To: <9bcebf59-a0e7-f461-36ef-8564ecb33282@solarflare.com>
When GRO decides not to coalesce a packet, in napi_frags_finish(), instead
of passing it to the stack immediately, place it on a list in the napi
struct. Then, at flush time (napi_complete_done(), napi_poll(), or
napi_busy_loop()), call netif_receive_skb_list_internal() on the list.
We'd like to do that in napi_gro_flush(), but it's not called if
!napi->gro_bitmask, so we have to do it in the callers instead. (There are
a handful of drivers that call napi_gro_flush() themselves, but it's not
clear why, or whether this will affect them.)
Because a full 64 packets is an inefficiently large batch, also consume the
list whenever it exceeds gro_normal_batch, a new net/core sysctl that
defaults to 8.
Signed-off-by: Edward Cree <ecree@solarflare.com>
---
include/linux/netdevice.h | 3 +++
net/core/dev.c | 44 +++++++++++++++++++++++++++++++++++---
net/core/sysctl_net_core.c | 8 +++++++
3 files changed, 52 insertions(+), 3 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 88292953aa6f..55ac223553f8 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -332,6 +332,8 @@ struct napi_struct {
struct net_device *dev;
struct gro_list gro_hash[GRO_HASH_BUCKETS];
struct sk_buff *skb;
+ struct list_head rx_list; /* Pending GRO_NORMAL skbs */
+ int rx_count; /* length of rx_list */
struct hrtimer timer;
struct list_head dev_list;
struct hlist_node napi_hash_node;
@@ -4239,6 +4241,7 @@ extern int dev_weight_rx_bias;
extern int dev_weight_tx_bias;
extern int dev_rx_weight;
extern int dev_tx_weight;
+extern int gro_normal_batch;
bool netdev_has_upper_dev(struct net_device *dev, struct net_device *upper_dev);
struct net_device *netdev_upper_get_next_dev_rcu(struct net_device *dev,
diff --git a/net/core/dev.c b/net/core/dev.c
index fc676b2610e3..b749eb2bfb0c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3963,6 +3963,8 @@ int dev_weight_rx_bias __read_mostly = 1; /* bias for backlog weight */
int dev_weight_tx_bias __read_mostly = 1; /* bias for output_queue quota */
int dev_rx_weight __read_mostly = 64;
int dev_tx_weight __read_mostly = 64;
+/* Maximum number of GRO_NORMAL skbs to batch up for list-RX */
+int gro_normal_batch __read_mostly = 8;
/* Called with irq disabled */
static inline void ____napi_schedule(struct softnet_data *sd,
@@ -5742,6 +5744,26 @@ struct sk_buff *napi_get_frags(struct napi_struct *napi)
}
EXPORT_SYMBOL(napi_get_frags);
+/* Pass the currently batched GRO_NORMAL SKBs up to the stack. */
+static void gro_normal_list(struct napi_struct *napi)
+{
+ if (!napi->rx_count)
+ return;
+ netif_receive_skb_list_internal(&napi->rx_list);
+ INIT_LIST_HEAD(&napi->rx_list);
+ napi->rx_count = 0;
+}
+
+/* Queue one GRO_NORMAL SKB up for list processing. If batch size exceeded,
+ * pass the whole batch up to the stack.
+ */
+static void gro_normal_one(struct napi_struct *napi, struct sk_buff *skb)
+{
+ list_add_tail(&skb->list, &napi->rx_list);
+ if (++napi->rx_count >= gro_normal_batch)
+ gro_normal_list(napi);
+}
+
static gro_result_t napi_frags_finish(struct napi_struct *napi,
struct sk_buff *skb,
gro_result_t ret)
@@ -5751,8 +5773,8 @@ static gro_result_t napi_frags_finish(struct napi_struct *napi,
case GRO_HELD:
__skb_push(skb, ETH_HLEN);
skb->protocol = eth_type_trans(skb, skb->dev);
- if (ret == GRO_NORMAL && netif_receive_skb_internal(skb))
- ret = GRO_DROP;
+ if (ret == GRO_NORMAL)
+ gro_normal_one(napi, skb);
break;
case GRO_DROP:
@@ -6029,6 +6051,8 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
NAPIF_STATE_IN_BUSY_POLL)))
return false;
+ gro_normal_list(n);
+
if (n->gro_bitmask) {
unsigned long timeout = 0;
@@ -6114,10 +6138,19 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock)
* Ideally, a new ndo_busy_poll_stop() could avoid another round.
*/
rc = napi->poll(napi, BUSY_POLL_BUDGET);
+ /* We can't gro_normal_list() here, because napi->poll() might have
+ * rearmed the napi (napi_complete_done()) in which case it could
+ * already be running on another CPU.
+ */
trace_napi_poll(napi, rc, BUSY_POLL_BUDGET);
netpoll_poll_unlock(have_poll_lock);
- if (rc == BUSY_POLL_BUDGET)
+ if (rc == BUSY_POLL_BUDGET) {
+ /* As the whole budget was spent, we still own the napi so can
+ * safely handle the rx_list.
+ */
+ gro_normal_list(napi);
__napi_schedule(napi);
+ }
local_bh_enable();
}
@@ -6162,6 +6195,7 @@ void napi_busy_loop(unsigned int napi_id,
}
work = napi_poll(napi, BUSY_POLL_BUDGET);
trace_napi_poll(napi, work, BUSY_POLL_BUDGET);
+ gro_normal_list(napi);
count:
if (work > 0)
__NET_ADD_STATS(dev_net(napi->dev),
@@ -6267,6 +6301,8 @@ void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
napi->timer.function = napi_watchdog;
init_gro_hash(napi);
napi->skb = NULL;
+ INIT_LIST_HEAD(&napi->rx_list);
+ napi->rx_count = 0;
napi->poll = poll;
if (weight > NAPI_POLL_WEIGHT)
netdev_err_once(dev, "%s() called with weight %d\n", __func__,
@@ -6363,6 +6399,8 @@ static int napi_poll(struct napi_struct *n, struct list_head *repoll)
goto out_unlock;
}
+ gro_normal_list(n);
+
if (n->gro_bitmask) {
/* flush too old packets
* If HZ < 1000, flush all packets.
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index f9204719aeee..dba52f53eace 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -569,6 +569,14 @@ static struct ctl_table net_core_table[] = {
.mode = 0644,
.proc_handler = proc_do_static_key,
},
+ {
+ .procname = "gro_normal_batch",
+ .data = &gro_normal_batch,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &one,
+ },
{ }
};
^ permalink raw reply related
* [PATCH bpf-next] libbpf : make libbpf_num_possible_cpus function thread safe
From: Takshak Chahande @ 2019-07-30 22:24 UTC (permalink / raw)
To: netdev; +Cc: ast, daniel, rdna, ctakshak, kernel-team, hechaol
Having static variable `cpus` in libbpf_num_possible_cpus function without
guarding it with mutex makes this function thread-unsafe.
If multiple threads accessing this function, in the current form; it
leads to incrementing the static variable value `cpus` in the multiple
of total available CPUs.
Let caching the number of possile CPUs handled by libbpf's users than
this library itself; and let this function be rock bottom one which reads
and parse the file (/sys/devices/system/cpu/possible) everytime it gets
called to simplify the things.
Fixes: 6446b3155521 (bpf: add a new API libbpf_num_possible_cpus())
Signed-off-by: Takshak Chahande <ctakshak@fb.com>
Acked-by: Andrey Ignatov <rdna@fb.com>
---
tools/lib/bpf/libbpf.c | 5 +----
1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index ead915aec349..e7ac0e02287e 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -4998,14 +4998,11 @@ int libbpf_num_possible_cpus(void)
static const char *fcpu = "/sys/devices/system/cpu/possible";
int len = 0, n = 0, il = 0, ir = 0;
unsigned int start = 0, end = 0;
- static int cpus;
char buf[128];
int error = 0;
+ int cpus = 0;
int fd = -1;
- if (cpus > 0)
- return cpus;
-
fd = open(fcpu, O_RDONLY);
if (fd < 0) {
error = errno;
--
2.17.1
^ permalink raw reply related
* Re: [patch net-next v2 1/3] net: devlink: allow to change namespaces
From: Jakub Kicinski @ 2019-07-30 22:39 UTC (permalink / raw)
To: Jiri Pirko; +Cc: netdev, davem, sthemmin, dsahern, mlxsw
In-Reply-To: <20190730085734.31504-2-jiri@resnulli.us>
On Tue, 30 Jul 2019 10:57:32 +0200, Jiri Pirko wrote:
> From: Jiri Pirko <jiri@mellanox.com>
>
> All devlink instances are created in init_net and stay there for a
> lifetime. Allow user to be able to move devlink instances into
> namespaces.
>
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
I'm no namespace expert, but seems reasonable, so FWIW:
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
If I read things right we will only send the devlink instance
notification to other namespaces when it moves, but not
notifications for sub-objects like ports. Is the expectation
that the user space dumps the objects it cares about or will
the other notifications be added as needed (or option 3 - I
misread the code).
I was also wondering it moving the devlink instance during a
multi-part transaction could be an issue. But AFAIU it should
be fine - the flashing doesn't release the instance lock, and
health stuff should complete correctly even if devlink is moved
half way through?
^ permalink raw reply
* Re: [patch net-next v2 2/3] net: devlink: export devlink net set/get helpers
From: Jakub Kicinski @ 2019-07-30 22:40 UTC (permalink / raw)
To: Jiri Pirko; +Cc: netdev, davem, sthemmin, dsahern, mlxsw
In-Reply-To: <20190730085734.31504-3-jiri@resnulli.us>
On Tue, 30 Jul 2019 10:57:33 +0200, Jiri Pirko wrote:
> From: Jiri Pirko <jiri@mellanox.com>
>
> Allow drivers to set/get net struct for devlink instance. Set is only
> allowed for newly allocated devlink instance.
>
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox