* Re: [patch iproute2 1/2] devlink: introduce cmdline option to switch to a different namespace
From: David Ahern @ 2019-07-29 20:21 UTC (permalink / raw)
To: Jiri Pirko, Toke Høiland-Jørgensen
Cc: netdev, davem, jakub.kicinski, sthemmin, mlxsw
In-Reply-To: <20190727102116.GC2843@nanopsycho>
On 7/27/19 4:21 AM, Jiri Pirko wrote:
>>> diff --git a/devlink/devlink.c b/devlink/devlink.c
>>> index d8197ea3a478..9242cc05ad0c 100644
>>> --- a/devlink/devlink.c
>>> +++ b/devlink/devlink.c
>>> @@ -32,6 +32,7 @@
>>> #include "mnlg.h"
>>> #include "json_writer.h"
>>> #include "utils.h"
>>> +#include "namespace.h"
>>>
>>> #define ESWITCH_MODE_LEGACY "legacy"
>>> #define ESWITCH_MODE_SWITCHDEV "switchdev"
>>> @@ -6332,7 +6333,7 @@ static int cmd_health(struct dl *dl)
>>> static void help(void)
>>> {
>>> pr_err("Usage: devlink [ OPTIONS ] OBJECT { COMMAND | help }\n"
>>> - " devlink [ -f[orce] ] -b[atch] filename\n"
>>> + " devlink [ -f[orce] ] -b[atch] filename -N[etns]
>>> netnsname\n"
>>
>> 'ip' uses lower-case n for this; why not be consistent?
>
> Because "n" is taken :/
that's really unfortunate. commands within a package should have similar
syntax and -n/N are backwards between ip/tc and devlink. That's the
stuff that drives users crazy.
^ permalink raw reply
* Re: [PATCH bpf-next 00/10] CO-RE offset relocations
From: Song Liu @ 2019-07-29 20:20 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: bpf, Networking, Alexei Starovoitov, Daniel Borkmann,
Yonghong Song, Andrii Nakryiko, Kernel Team
In-Reply-To: <20190724192742.1419254-1-andriin@fb.com>
On Wed, Jul 24, 2019 at 1:34 PM Andrii Nakryiko <andriin@fb.com> wrote:
>
> This patch set implements central part of CO-RE (Compile Once - Run
> Everywhere, see [0] and [1] for slides and video): relocating field offsets.
> Most of the details are written down as comments to corresponding parts of the
> code.
>
> Patch #1 adds loading of .BTF.ext offset relocations section and macros to
> work with its contents.
> Patch #2 implements CO-RE relocations algorithm in libbpf.
> Patches #3-#10 adds selftests validating various parts of relocation handling,
> type compatibility, etc.
>
> For all tests to work, you'll need latest Clang/LLVM supporting
> __builtin_preserve_access_index intrinsic, used for recording offset
> relocations. Kernel on which selftests run should have BTF information built
> in (CONFIG_DEBUG_INFO_BTF=y).
>
> [0] http://vger.kernel.org/bpfconf2019.html#session-2
> [1] http://vger.kernel.org/lpc-bpf2018.html#session-2CO-RE relocations
>
> This patch set implements central part of CO-RE (Compile Once - Run
> Everywhere, see [0] and [1] for slides and video): relocating field offsets.
> Most of the details are written down as comments to corresponding parts of the
> code.
>
> Patch #1 adds loading of .BTF.ext offset relocations section and macros to
> work with its contents.
> Patch #2 implements CO-RE relocations algorithm in libbpf.
> Patches #3-#10 adds selftests validating various parts of relocation handling,
> type compatibility, etc.
>
> For all tests to work, you'll need latest Clang/LLVM supporting
> __builtin_preserve_access_index intrinsic, used for recording offset
> relocations. Kernel on which selftests run should have BTF information built
> in (CONFIG_DEBUG_INFO_BTF=y).
>
> [0] http://vger.kernel.org/bpfconf2019.html#session-2
> [1] http://vger.kernel.org/lpc-bpf2018.html#session-2
>
> Andrii Nakryiko (10):
> libbpf: add .BTF.ext offset relocation section loading
> libbpf: implement BPF CO-RE offset relocation algorithm
> selftests/bpf: add CO-RE relocs testing setup
> selftests/bpf: add CO-RE relocs struct flavors tests
> selftests/bpf: add CO-RE relocs nesting tests
> selftests/bpf: add CO-RE relocs array tests
> selftests/bpf: add CO-RE relocs enum/ptr/func_proto tests
> selftests/bpf: add CO-RE relocs modifiers/typedef tests
> selftest/bpf: add CO-RE relocs ptr-as-array tests
> selftests/bpf: add CO-RE relocs ints tests
>
> tools/lib/bpf/btf.c | 64 +-
> tools/lib/bpf/btf.h | 4 +
> tools/lib/bpf/libbpf.c | 866 +++++++++++++++++-
> tools/lib/bpf/libbpf.h | 1 +
> tools/lib/bpf/libbpf_internal.h | 91 ++
> .../selftests/bpf/prog_tests/core_reloc.c | 363 ++++++++
> .../bpf/progs/btf__core_reloc_arrays.c | 3 +
> .../btf__core_reloc_arrays___diff_arr_dim.c | 3 +
> ...btf__core_reloc_arrays___diff_arr_val_sz.c | 3 +
> .../btf__core_reloc_arrays___err_non_array.c | 3 +
> ...btf__core_reloc_arrays___err_too_shallow.c | 3 +
> .../btf__core_reloc_arrays___err_too_small.c | 3 +
> ..._core_reloc_arrays___err_wrong_val_type1.c | 3 +
> ..._core_reloc_arrays___err_wrong_val_type2.c | 3 +
> .../bpf/progs/btf__core_reloc_flavors.c | 3 +
> .../btf__core_reloc_flavors__err_wrong_name.c | 3 +
> .../bpf/progs/btf__core_reloc_ints.c | 3 +
> .../bpf/progs/btf__core_reloc_ints___bool.c | 3 +
> .../btf__core_reloc_ints___err_bitfield.c | 3 +
> .../btf__core_reloc_ints___err_wrong_sz_16.c | 3 +
> .../btf__core_reloc_ints___err_wrong_sz_32.c | 3 +
> .../btf__core_reloc_ints___err_wrong_sz_64.c | 3 +
> .../btf__core_reloc_ints___err_wrong_sz_8.c | 3 +
> .../btf__core_reloc_ints___reverse_sign.c | 3 +
> .../bpf/progs/btf__core_reloc_mods.c | 3 +
> .../progs/btf__core_reloc_mods___mod_swap.c | 3 +
> .../progs/btf__core_reloc_mods___typedefs.c | 3 +
> .../bpf/progs/btf__core_reloc_nesting.c | 3 +
> .../btf__core_reloc_nesting___anon_embed.c | 3 +
> ...f__core_reloc_nesting___dup_compat_types.c | 5 +
> ...core_reloc_nesting___err_array_container.c | 3 +
> ...tf__core_reloc_nesting___err_array_field.c | 3 +
> ...e_reloc_nesting___err_dup_incompat_types.c | 4 +
> ...re_reloc_nesting___err_missing_container.c | 3 +
> ...__core_reloc_nesting___err_missing_field.c | 3 +
> ..._reloc_nesting___err_nonstruct_container.c | 3 +
> ...e_reloc_nesting___err_partial_match_dups.c | 4 +
> .../btf__core_reloc_nesting___err_too_deep.c | 3 +
> .../btf__core_reloc_nesting___extra_nesting.c | 3 +
> ..._core_reloc_nesting___struct_union_mixup.c | 3 +
> .../bpf/progs/btf__core_reloc_primitives.c | 3 +
> ...f__core_reloc_primitives___diff_enum_def.c | 3 +
> ..._core_reloc_primitives___diff_func_proto.c | 3 +
> ...f__core_reloc_primitives___diff_ptr_type.c | 3 +
> ...tf__core_reloc_primitives___err_non_enum.c | 3 +
> ...btf__core_reloc_primitives___err_non_int.c | 3 +
> ...btf__core_reloc_primitives___err_non_ptr.c | 3 +
> .../bpf/progs/btf__core_reloc_ptr_as_arr.c | 3 +
> .../btf__core_reloc_ptr_as_arr___diff_sz.c | 3 +
> .../selftests/bpf/progs/core_reloc_types.h | 642 +++++++++++++
> .../bpf/progs/test_core_reloc_arrays.c | 58 ++
> .../bpf/progs/test_core_reloc_flavors.c | 65 ++
> .../bpf/progs/test_core_reloc_ints.c | 48 +
> .../bpf/progs/test_core_reloc_kernel.c | 39 +
> .../bpf/progs/test_core_reloc_mods.c | 68 ++
> .../bpf/progs/test_core_reloc_nesting.c | 48 +
> .../bpf/progs/test_core_reloc_primitives.c | 50 +
> .../bpf/progs/test_core_reloc_ptr_as_arr.c | 34 +
> 58 files changed, 2527 insertions(+), 47 deletions(-)
> create mode 100644 tools/testing/selftests/bpf/prog_tests/core_reloc.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_arrays.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_arrays___diff_arr_dim.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_arrays___diff_arr_val_sz.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_arrays___err_non_array.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_arrays___err_too_shallow.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_arrays___err_too_small.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_arrays___err_wrong_val_type1.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_arrays___err_wrong_val_type2.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_flavors.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_flavors__err_wrong_name.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_ints.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_ints___bool.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_ints___err_bitfield.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_ints___err_wrong_sz_16.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_ints___err_wrong_sz_32.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_ints___err_wrong_sz_64.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_ints___err_wrong_sz_8.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_ints___reverse_sign.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_mods.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_mods___mod_swap.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_mods___typedefs.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_nesting.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_nesting___anon_embed.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_nesting___dup_compat_types.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_nesting___err_array_container.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_nesting___err_array_field.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_nesting___err_dup_incompat_types.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_nesting___err_missing_container.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_nesting___err_missing_field.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_nesting___err_nonstruct_container.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_nesting___err_partial_match_dups.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_nesting___err_too_deep.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_nesting___extra_nesting.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_nesting___struct_union_mixup.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_primitives.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_primitives___diff_enum_def.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_primitives___diff_func_proto.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_primitives___diff_ptr_type.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_primitives___err_non_enum.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_primitives___err_non_int.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_primitives___err_non_ptr.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_ptr_as_arr.c
> create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_ptr_as_arr___diff_sz.c
> create mode 100644 tools/testing/selftests/bpf/progs/core_reloc_types.h
> create mode 100644 tools/testing/selftests/bpf/progs/test_core_reloc_arrays.c
> create mode 100644 tools/testing/selftests/bpf/progs/test_core_reloc_flavors.c
> create mode 100644 tools/testing/selftests/bpf/progs/test_core_reloc_ints.c
> create mode 100644 tools/testing/selftests/bpf/progs/test_core_reloc_kernel.c
> create mode 100644 tools/testing/selftests/bpf/progs/test_core_reloc_mods.c
> create mode 100644 tools/testing/selftests/bpf/progs/test_core_reloc_nesting.c
> create mode 100644 tools/testing/selftests/bpf/progs/test_core_reloc_primitives.c
> create mode 100644 tools/testing/selftests/bpf/progs/test_core_reloc_ptr_as_arr.c
We have created a lot of small files. Would it be cleaner if we can
somehow put these
data in one file (maybe different sections?).
Alternatively, maybe create a folder for these files:
tools/testing/selftests/bpf/progs/core/
Thanks,
Song
^ permalink raw reply
* Re: [patch iproute2 1/2] devlink: introduce cmdline option to switch to a different namespace
From: David Ahern @ 2019-07-29 20:17 UTC (permalink / raw)
To: Jiri Pirko, netdev; +Cc: davem, jakub.kicinski, sthemmin, mlxsw
In-Reply-To: <20190727100544.28649-1-jiri@resnulli.us>
subject line should have iproute2-next
^ permalink raw reply
* Re: [patch net-next 0/3] net: devlink: Finish network namespace support
From: David Ahern @ 2019-07-29 20:17 UTC (permalink / raw)
To: Jiri Pirko, netdev; +Cc: davem, jakub.kicinski, sthemmin, mlxsw
In-Reply-To: <20190727094459.26345-1-jiri@resnulli.us>
On 7/27/19 3:44 AM, Jiri Pirko wrote:
> From: Jiri Pirko <jiri@mellanox.com>
>
> Devlink from the beginning counts with network namespaces, but the
> instances has been fixed to init_net. The first patch allows user
> to move existing devlink instances into namespaces:
>
so you intend for an asic, for example, to have multiple devlink
instances where each instance governs a set of related ports (e.g.,
ports that share a set of hardware resources) and those instances can be
managed from distinct network namespaces?
^ permalink raw reply
* [PATCH] net: hamradio: baycom_epp: Mark expected switch fall-through
From: Gustavo A. R. Silva @ 2019-07-29 20:12 UTC (permalink / raw)
To: Thomas Sailer, David S. Miller
Cc: linux-hams, netdev, linux-kernel, Gustavo A. R. Silva, Kees Cook
Mark switch cases where we are expecting to fall through.
This patch fixes the following warning (Building: i386):
drivers/net/hamradio/baycom_epp.c: In function ‘transmit’:
drivers/net/hamradio/baycom_epp.c:491:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
if (i) {
^
drivers/net/hamradio/baycom_epp.c:504:3: note: here
default: /* fall through */
^~~~~~~
Notice that, in this particular case, the code comment is
modified in accordance with what GCC is expecting to find.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
---
drivers/net/hamradio/baycom_epp.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/net/hamradio/baycom_epp.c b/drivers/net/hamradio/baycom_epp.c
index daab2c07d891..9303aeb2595f 100644
--- a/drivers/net/hamradio/baycom_epp.c
+++ b/drivers/net/hamradio/baycom_epp.c
@@ -500,8 +500,9 @@ static int transmit(struct baycom_state *bc, int cnt, unsigned char stat)
}
break;
}
+ /* fall through */
- default: /* fall through */
+ default:
if (bc->hdlctx.calibrate <= 0)
return 0;
i = min_t(int, cnt, bc->hdlctx.calibrate);
--
2.22.0
^ permalink raw reply related
* Re: [PATCH v4 1/5] vsock/virtio: limit the memory used per-socket
From: Michael S. Tsirkin @ 2019-07-29 19:33 UTC (permalink / raw)
To: Stefano Garzarella
Cc: netdev, linux-kernel, Stefan Hajnoczi, David S. Miller,
virtualization, Jason Wang, kvm
In-Reply-To: <CAGxU2F5F1KcaFNJ6n7++ApZiYMGnoEWKVRgo3Vc4h5hpxSJEZg@mail.gmail.com>
On Mon, Jul 29, 2019 at 06:41:27PM +0200, Stefano Garzarella wrote:
> On Mon, Jul 29, 2019 at 12:01:37PM -0400, Michael S. Tsirkin wrote:
> > On Mon, Jul 29, 2019 at 05:36:56PM +0200, Stefano Garzarella wrote:
> > > On Mon, Jul 29, 2019 at 10:04:29AM -0400, Michael S. Tsirkin wrote:
> > > > On Wed, Jul 17, 2019 at 01:30:26PM +0200, Stefano Garzarella wrote:
> > > > > Since virtio-vsock was introduced, the buffers filled by the host
> > > > > and pushed to the guest using the vring, are directly queued in
> > > > > a per-socket list. These buffers are preallocated by the guest
> > > > > with a fixed size (4 KB).
> > > > >
> > > > > The maximum amount of memory used by each socket should be
> > > > > controlled by the credit mechanism.
> > > > > The default credit available per-socket is 256 KB, but if we use
> > > > > only 1 byte per packet, the guest can queue up to 262144 of 4 KB
> > > > > buffers, using up to 1 GB of memory per-socket. In addition, the
> > > > > guest will continue to fill the vring with new 4 KB free buffers
> > > > > to avoid starvation of other sockets.
> > > > >
> > > > > This patch mitigates this issue copying the payload of small
> > > > > packets (< 128 bytes) into the buffer of last packet queued, in
> > > > > order to avoid wasting memory.
> > > > >
> > > > > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
> > > >
> > > > This is good enough for net-next, but for net I think we
> > > > should figure out how to address the issue completely.
> > > > Can we make the accounting precise? What happens to
> > > > performance if we do?
> > > >
> > >
> > > In order to do more precise accounting maybe we can use the buffer size,
> > > instead of payload size when we update the credit available.
> > > In this way, the credit available for each socket will reflect the memory
> > > actually used.
> > >
> > > I should check better, because I'm not sure what happen if the peer sees
> > > 1KB of space available, then it sends 1KB of payload (using a 4KB
> > > buffer).
> > > The other option is to copy each packet in a new buffer like I did in
> > > the v2 [2], but this forces us to make a copy for each packet that does
> > > not fill the entire buffer, perhaps too expensive.
> > >
> > > [2] https://patchwork.kernel.org/patch/10938741/
> > >
> >
> > So one thing we can easily do is to under-report the
> > available credit. E.g. if we copy up to 256bytes,
> > then report just 256bytes for every buffer in the queue.
> >
>
> Ehm sorry, I got lost :(
> Can you explain better?
>
>
> Thanks,
> Stefano
I think I suggested a better idea more recently.
But to clarify this option: we are adding a 4K buffer.
Let's say we know we will always copy 128 bytes.
So we just tell remote we have 128.
If we add another 4K buffer we add another 128 credits.
So we are charging local socket 16x more (4k for a 128 byte packet) but
we are paying remote 16x less (128 credits for 4k byte buffer). It evens
out.
Way less credits to go around so I'm not sure it's a good idea,
at least as the only solution. Can be combined with other
optimizations and probably in a less drastic fashion
(e.g. 2x rather than 16x).
--
MST
^ permalink raw reply
* Re: [net-next 1/2] ipvs: batch __ip_vs_cleanup
From: Julian Anastasov @ 2019-07-29 20:03 UTC (permalink / raw)
To: Haishuang Yan
Cc: David S. Miller, Pablo Neira Ayuso, Simon Horman, netdev,
lvs-devel, linux-kernel, netfilter-devel
In-Reply-To: <8441EA26-E197-4F40-A6D7-5B7D59AA7F7F@cmss.chinamobile.com>
Hello,
On Thu, 18 Jul 2019, Haishuang Yan wrote:
> As the following benchmark testing results show, there is a little performance improvement:
OK, can you send v2 after removing the LIST_HEAD(list) from
both patches, I guess, it is not needed. If you prefer, you can
include these benchmark results too.
> $ cat add_del_unshare.sh
> #!/bin/bash
>
> for i in `seq 1 100`
> do
> (for j in `seq 1 40` ; do unshare -n ipvsadm -A -t 172.16.$i.$j:80 >/dev/null ; done) &
> done
> wait; grep net_namespace /proc/slabinfo
>
> Befor patch:
> $ time sh add_del_unshare.sh
> net_namespace 4020 4020 4736 6 8 : tunables 0 0 0 : slabdata 670 670 0
>
> real 0m8.086s
> user 0m2.025s
> sys 0m36.956s
>
> After patch:
> $ time sh add_del_unshare.sh
> net_namespace 4020 4020 4736 6 8 : tunables 0 0 0 : slabdata 670 670 0
>
> real 0m7.623s
> user 0m2.003s
> sys 0m32.935s
>
>
> >
> >> + ipvs = net_ipvs(net);
> >> + ip_vs_conn_net_cleanup(ipvs);
> >> + ip_vs_app_net_cleanup(ipvs);
> >> + ip_vs_protocol_net_cleanup(ipvs);
> >> + ip_vs_control_net_cleanup(ipvs);
> >> + ip_vs_estimator_net_cleanup(ipvs);
> >> + IP_VS_DBG(2, "ipvs netns %d released\n", ipvs->gen);
> >> + net->ipvs = NULL;
Regards
--
Julian Anastasov <ja@ssi.bg>
^ permalink raw reply
* Re: [PATCH bpf-next 01/10] libbpf: add .BTF.ext offset relocation section loading
From: Song Liu @ 2019-07-29 20:00 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
Daniel Borkmann, Yonghong Song, Kernel Team
In-Reply-To: <CAEf4BzYgyAiPt0wVESrWSJ_tLheq0BRWLgrqMfLZsnp11+F77Q@mail.gmail.com>
> On Jul 26, 2019, at 10:11 PM, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> On Wed, Jul 24, 2019 at 10:20 PM Song Liu <songliubraving@fb.com> wrote:
>>
>>
>>
>>> On Jul 24, 2019, at 5:37 PM, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>>>
>>> On Wed, Jul 24, 2019 at 5:00 PM Song Liu <songliubraving@fb.com> wrote:
>>>>
>>>>
>>>>
>>>>> On Jul 24, 2019, at 12:27 PM, Andrii Nakryiko <andriin@fb.com> wrote:
>>>>>
>>>>> Add support for BPF CO-RE offset relocations. Add section/record
>>>>> iteration macros for .BTF.ext. These macro are useful for iterating over
>>>>> each .BTF.ext record, either for dumping out contents or later for BPF
>>>>> CO-RE relocation handling.
>>>>>
>>>>> To enable other parts of libbpf to work with .BTF.ext contents, moved
>>>>> a bunch of type definitions into libbpf_internal.h.
>>>>>
>>>>> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
>>>>> ---
>>>>> tools/lib/bpf/btf.c | 64 +++++++++--------------
>>>>> tools/lib/bpf/btf.h | 4 ++
>>>>> tools/lib/bpf/libbpf_internal.h | 91 +++++++++++++++++++++++++++++++++
>>>>> 3 files changed, 118 insertions(+), 41 deletions(-)
>>>>>
>>>
>>> [...]
>>>
>>>>> +
>>>>> static int btf_ext_parse_hdr(__u8 *data, __u32 data_size)
>>>>> {
>>>>> const struct btf_ext_header *hdr = (struct btf_ext_header *)data;
>>>>> @@ -1004,6 +979,13 @@ struct btf_ext *btf_ext__new(__u8 *data, __u32 size)
>>>>> if (err)
>>>>> goto done;
>>>>>
>>>>> + /* check if there is offset_reloc_off/offset_reloc_len fields */
>>>>> + if (btf_ext->hdr->hdr_len < sizeof(struct btf_ext_header))
>>>>
>>>> This check will break when we add more optional sections to btf_ext_header.
>>>> Maybe use offsetof() instead?
>>>
>>> I didn't do it, because there are no fields after offset_reloc_len.
>>> But now I though that maybe it would be ok to add zero-sized marker
>>> field, kind of like marking off various versions of btf_ext header?
>>>
>>> Alternatively, I can add offsetofend() macro somewhere in libbpf_internal.h.
>>>
>>> Do you have any preference?
>>
>> We only need a stable number to compare against. offsetofend() works.
>> Or we can simply have something like
>>
>> if (btf_ext->hdr->hdr_len <= offsetof(struct btf_ext_header, offset_reloc_off))
>> goto done;
>> or
>> if (btf_ext->hdr->hdr_len < offsetof(struct btf_ext_header, offset_reloc_len))
>> goto done;
>>
>> Does this make sense?
>
> I think offsetofend() is the cleanest solution, I'll do just that.
Agreed that offsetofend() is the best.
Song
^ permalink raw reply
* Re: [PATCH bpf-next 02/10] libbpf: implement BPF CO-RE offset relocation algorithm
From: Song Liu @ 2019-07-29 19:56 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: bpf, Networking, Alexei Starovoitov, Daniel Borkmann,
Yonghong Song, Andrii Nakryiko, Kernel Team
In-Reply-To: <20190724192742.1419254-3-andriin@fb.com>
> On Jul 24, 2019, at 12:27 PM, Andrii Nakryiko <andriin@fb.com> wrote:
>
> This patch implements the core logic for BPF CO-RE offsets relocations.
> All the details are described in code comments.
>
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> ---
> tools/lib/bpf/libbpf.c | 866 ++++++++++++++++++++++++++++++++++++++++-
> tools/lib/bpf/libbpf.h | 1 +
> 2 files changed, 861 insertions(+), 6 deletions(-)
>
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 8741c39adb1c..86d87bf10d46 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -38,6 +38,7 @@
> #include <sys/stat.h>
> #include <sys/types.h>
> #include <sys/vfs.h>
> +#include <sys/utsname.h>
> #include <tools/libc_compat.h>
> #include <libelf.h>
> #include <gelf.h>
> @@ -47,6 +48,7 @@
> #include "btf.h"
> #include "str_error.h"
> #include "libbpf_internal.h"
> +#include "hashmap.h"
>
> #ifndef EM_BPF
> #define EM_BPF 247
> @@ -1013,16 +1015,22 @@ static int bpf_object__init_user_maps(struct bpf_object *obj, bool strict)
> }
>
> static const struct btf_type *skip_mods_and_typedefs(const struct btf *btf,
> - __u32 id)
> + __u32 id,
> + __u32 *res_id)
> {
> const struct btf_type *t = btf__type_by_id(btf, id);
>
> + if (res_id)
> + *res_id = id;
> +
> while (true) {
> switch (BTF_INFO_KIND(t->info)) {
> case BTF_KIND_VOLATILE:
> case BTF_KIND_CONST:
> case BTF_KIND_RESTRICT:
> case BTF_KIND_TYPEDEF:
> + if (res_id)
> + *res_id = t->type;
> t = btf__type_by_id(btf, t->type);
> break;
> default:
> @@ -1041,7 +1049,7 @@ static const struct btf_type *skip_mods_and_typedefs(const struct btf *btf,
> static bool get_map_field_int(const char *map_name, const struct btf *btf,
> const struct btf_type *def,
> const struct btf_member *m, __u32 *res) {
> - const struct btf_type *t = skip_mods_and_typedefs(btf, m->type);
> + const struct btf_type *t = skip_mods_and_typedefs(btf, m->type, NULL);
> const char *name = btf__name_by_offset(btf, m->name_off);
> const struct btf_array *arr_info;
> const struct btf_type *arr_t;
> @@ -1107,7 +1115,7 @@ static int bpf_object__init_user_btf_map(struct bpf_object *obj,
> return -EOPNOTSUPP;
> }
>
> - def = skip_mods_and_typedefs(obj->btf, var->type);
> + def = skip_mods_and_typedefs(obj->btf, var->type, NULL);
> if (BTF_INFO_KIND(def->info) != BTF_KIND_STRUCT) {
> pr_warning("map '%s': unexpected def kind %u.\n",
> map_name, BTF_INFO_KIND(var->info));
> @@ -2289,6 +2297,845 @@ bpf_program_reloc_btf_ext(struct bpf_program *prog, struct bpf_object *obj,
> return 0;
> }
>
> +#define BPF_CORE_SPEC_MAX_LEN 64
> +
> +/* represents BPF CO-RE field or array element accessor */
> +struct bpf_core_accessor {
> + __u32 type_id; /* struct/union type or array element type */
> + __u32 idx; /* field index or array index */
> + const char *name; /* field name or NULL for array accessor */
> +};
> +
> +struct bpf_core_spec {
> + const struct btf *btf;
> + /* high-level spec: named fields and array indicies only */
> + struct bpf_core_accessor spec[BPF_CORE_SPEC_MAX_LEN];
> + /* high-level spec length */
> + int len;
> + /* raw, low-level spec: 1-to-1 with accessor spec string */
> + int raw_spec[BPF_CORE_SPEC_MAX_LEN];
> + /* raw spec length */
> + int raw_len;
> + /* field byte offset represented by spec */
> + __u32 offset;
> +};
> +
> +static bool str_is_empty(const char *s)
> +{
> + return !s || !s[0];
> +}
> +
> +static int btf_kind(const struct btf_type *t)
> +{
> + return BTF_INFO_KIND(t->info);
> +}
> +
> +static bool btf_is_composite(const struct btf_type *t)
> +{
> + int kind = btf_kind(t);
> +
> + return kind == BTF_KIND_STRUCT || kind == BTF_KIND_UNION;
> +}
> +
> +static bool btf_is_array(const struct btf_type *t)
> +{
> + return btf_kind(t) == BTF_KIND_ARRAY;
> +}
> +
> +/*
> + * Turn bpf_offset_reloc into a low- and high-level spec representation,
> + * validating correctness along the way, as well as calculating resulting
> + * field offset (in bytes), specified by accessor string. Low-level spec
> + * captures every single level of nestedness, including traversing anonymous
> + * struct/union members. High-level one only captures semantically meaningful
> + * "turning points": named fields and array indicies.
> + * E.g., for this case:
> + *
> + * struct sample {
> + * int __unimportant;
> + * struct {
> + * int __1;
> + * int __2;
> + * int a[7];
> + * };
> + * };
> + *
> + * struct sample *s = ...;
> + *
> + * int x = &s->a[3]; // access string = '0:1:2:3'
> + *
> + * Low-level spec has 1:1 mapping with each element of access string (it's
> + * just a parsed access string representation): [0, 1, 2, 3].
> + *
> + * High-level spec will capture only 3 points:
> + * - intial zero-index access by pointer (&s->... is the same as &s[0]...);
> + * - field 'a' access (corresponds to '2' in low-level spec);
> + * - array element #3 access (corresponds to '3' in low-level spec).
> + *
> + */
> +static int bpf_core_spec_parse(const struct btf *btf,
> + __u32 type_id,
> + const char *spec_str,
> + struct bpf_core_spec *spec)
> +{
> + int access_idx, parsed_len, i;
> + const struct btf_type *t;
> + __u32 id = type_id;
> + const char *name;
> + __s64 sz;
> +
> + if (str_is_empty(spec_str) || *spec_str == ':')
> + return -EINVAL;
> +
> + memset(spec, 0, sizeof(*spec));
> + spec->btf = btf;
> +
> + /* parse spec_str="0:1:2:3:4" into array raw_spec=[0, 1, 2, 3, 4] */
> + while (*spec_str) {
> + if (*spec_str == ':')
> + ++spec_str;
> + if (sscanf(spec_str, "%d%n", &access_idx, &parsed_len) != 1)
> + return -EINVAL;
> + if (spec->raw_len == BPF_CORE_SPEC_MAX_LEN)
> + return -E2BIG;
> + spec_str += parsed_len;
> + spec->raw_spec[spec->raw_len++] = access_idx;
> + }
> +
> + if (spec->raw_len == 0)
> + return -EINVAL;
> +
> + for (i = 0; i < spec->raw_len; i++) {
> + t = skip_mods_and_typedefs(btf, id, &id);
> + if (!t)
> + return -EINVAL;
> +
> + access_idx = spec->raw_spec[i];
> +
> + if (i == 0) {
> + /* first spec value is always reloc type array index */
> + spec->spec[spec->len].type_id = id;
> + spec->spec[spec->len].idx = access_idx;
> + spec->len++;
> +
> + sz = btf__resolve_size(btf, id);
> + if (sz < 0)
> + return sz;
> + spec->offset += access_idx * sz;
> + continue;
> + }
> +
> + if (btf_is_composite(t)) {
> + const struct btf_member *m = (void *)(t + 1);
> + __u32 offset;
> +
> + if (access_idx >= BTF_INFO_VLEN(t->info))
> + return -EINVAL;
> +
> + m = &m[access_idx];
> +
> + if (BTF_INFO_KFLAG(t->info)) {
> + if (BTF_MEMBER_BITFIELD_SIZE(m->offset))
> + return -EINVAL;
> + offset = BTF_MEMBER_BIT_OFFSET(m->offset);
> + } else {
> + offset = m->offset;
> + }
> + if (m->offset % 8)
> + return -EINVAL;
> + spec->offset += offset / 8;
> +
> + if (m->name_off) {
> + name = btf__name_by_offset(btf, m->name_off);
> + if (str_is_empty(name))
> + return -EINVAL;
> +
> + spec->spec[spec->len].type_id = id;
> + spec->spec[spec->len].idx = access_idx;
> + spec->spec[spec->len].name = name;
> + spec->len++;
> + }
> +
> + id = m->type;
> + } else if (btf_is_array(t)) {
> + const struct btf_array *a = (void *)(t + 1);
> +
> + t = skip_mods_and_typedefs(btf, a->type, &id);
> + if (!t || access_idx >= a->nelems)
> + return -EINVAL;
> +
> + spec->spec[spec->len].type_id = id;
> + spec->spec[spec->len].idx = access_idx;
> + spec->len++;
> +
> + sz = btf__resolve_size(btf, id);
> + if (sz < 0)
> + return sz;
> + spec->offset += access_idx * sz;
> + } else {
> + pr_warning("relo for [%u] %s (at idx %d) captures type [%d] of unexpected kind %d\n",
> + type_id, spec_str, i, id, btf_kind(t));
> + return -EINVAL;
> + }
> + }
> +
> + if (spec->len == 0)
> + return -EINVAL;
> +
> + return 0;
> +}
> +
> +/* Given 'some_struct_name___with_flavor' return the length of a name prefix
> + * before last triple underscore. Struct name part after last triple
> + * underscore is ignored by BPF CO-RE relocation during relocation matching.
> + */
> +static size_t bpf_core_essential_name_len(const char *name)
> +{
> + size_t n = strlen(name);
> + int i = n - 3;
> +
> + while (i > 0) {
> + if (name[i] == '_' && name[i + 1] == '_' && name[i + 2] == '_')
> + return i;
> + i--;
> + }
> + return n;
> +}
> +
> +/* dynamically sized list of type IDs */
> +struct ids_vec {
> + __u32 *data;
> + int len;
> +};
> +
> +static void bpf_core_free_cands(struct ids_vec *cand_ids)
> +{
> + free(cand_ids->data);
> + free(cand_ids);
> +}
> +
> +static struct ids_vec *bpf_core_find_cands(const struct btf *local_btf,
> + __u32 local_type_id,
> + const struct btf *targ_btf)
> +{
> + size_t local_essent_len, targ_essent_len;
> + const char *local_name, *targ_name;
> + const struct btf_type *t;
> + struct ids_vec *cand_ids;
> + __u32 *new_ids;
> + int i, err, n;
> +
> + t = btf__type_by_id(local_btf, local_type_id);
> + if (!t)
> + return ERR_PTR(-EINVAL);
> +
> + local_name = btf__name_by_offset(local_btf, t->name_off);
> + if (str_is_empty(local_name))
> + return ERR_PTR(-EINVAL);
> + local_essent_len = bpf_core_essential_name_len(local_name);
> +
> + cand_ids = calloc(1, sizeof(*cand_ids));
> + if (!cand_ids)
> + return ERR_PTR(-ENOMEM);
> +
> + n = btf__get_nr_types(targ_btf);
> + for (i = 1; i <= n; i++) {
> + t = btf__type_by_id(targ_btf, i);
> + targ_name = btf__name_by_offset(targ_btf, t->name_off);
> + if (str_is_empty(targ_name))
> + continue;
> +
> + targ_essent_len = bpf_core_essential_name_len(targ_name);
> + if (targ_essent_len != local_essent_len)
> + continue;
> +
> + if (strncmp(local_name, targ_name, local_essent_len) == 0) {
> + pr_debug("[%d] (%s): found candidate [%d] (%s)\n",
> + local_type_id, local_name, i, targ_name);
> + new_ids = realloc(cand_ids->data, cand_ids->len + 1);
> + if (!new_ids) {
> + err = -ENOMEM;
> + goto err_out;
> + }
> + cand_ids->data = new_ids;
> + cand_ids->data[cand_ids->len++] = i;
> + }
> + }
> + return cand_ids;
> +err_out:
> + bpf_core_free_cands(cand_ids);
> + return ERR_PTR(err);
> +}
> +
> +/* Check two types for compatibility, skipping const/volatile/restrict and
> + * typedefs, to ensure we are relocating offset to the compatible entities:
> + * - any two STRUCTs/UNIONs are compatible and can be mixed;
> + * - any two FWDs are compatible;
> + * - any two PTRs are always compatible;
> + * - for ENUMs, check sizes, names are ignored;
> + * - for INT, size and bitness should match, signedness is ignored;
> + * - for ARRAY, dimensionality is ignored, element types are checked for
> + * compatibility recursively;
> + * - everything else shouldn't be ever a target of relocation.
> + * These rules are not set in stone and probably will be adjusted as we get
> + * more experience with using BPF CO-RE relocations.
> + */
> +static int bpf_core_fields_are_compat(const struct btf *local_btf,
> + __u32 local_id,
> + const struct btf *targ_btf,
> + __u32 targ_id)
> +{
> + const struct btf_type *local_type, *targ_type;
> + __u16 kind;
> +
> +recur:
> + local_type = skip_mods_and_typedefs(local_btf, local_id, &local_id);
> + targ_type = skip_mods_and_typedefs(targ_btf, targ_id, &targ_id);
> + if (!local_type || !targ_type)
> + return -EINVAL;
> +
> + if (btf_is_composite(local_type) && btf_is_composite(targ_type))
> + return 1;
> + if (BTF_INFO_KIND(local_type->info) != BTF_INFO_KIND(targ_type->info))
> + return 0;
> +
> + kind = BTF_INFO_KIND(local_type->info);
> + switch (kind) {
> + case BTF_KIND_FWD:
> + case BTF_KIND_PTR:
> + return 1;
> + case BTF_KIND_ENUM:
> + return local_type->size == targ_type->size;
> + case BTF_KIND_INT: {
> + __u32 loc_int = *(__u32 *)(local_type + 1);
> + __u32 targ_int = *(__u32 *)(targ_type + 1);
> +
> + return BTF_INT_OFFSET(loc_int) == 0 &&
> + BTF_INT_OFFSET(targ_int) == 0 &&
> + local_type->size == targ_type->size &&
> + BTF_INT_BITS(loc_int) == BTF_INT_BITS(targ_int);
> + }
> + case BTF_KIND_ARRAY: {
> + const struct btf_array *loc_a, *targ_a;
> +
> + loc_a = (void *)(local_type + 1);
> + targ_a = (void *)(targ_type + 1);
> + local_id = loc_a->type;
> + targ_id = targ_a->type;
> + goto recur;
> + }
> + default:
> + pr_warning("unexpected kind %d relocated, local [%d], target [%d]\n",
> + kind, local_id, targ_id);
> + return 0;
> + }
> +}
> +
> +/*
> + * Given single high-level accessor (either named field or array index) in
> + * local type, find corresponding high-level accessor for a target type. Along
> + * the way, maintain low-level spec for target as well. Also keep updating
> + * target offset.
> + */
> +static int bpf_core_match_member(const struct btf *local_btf,
> + const struct bpf_core_accessor *local_acc,
> + const struct btf *targ_btf,
> + __u32 targ_id,
> + struct bpf_core_spec *spec,
> + __u32 *next_targ_id)
> +{
> + const struct btf_type *local_type, *targ_type;
> + const struct btf_member *local_member, *m;
> + const char *local_name, *targ_name;
> + __u32 local_id;
> + int i, n, found;
> +
> + targ_type = skip_mods_and_typedefs(targ_btf, targ_id, &targ_id);
> + if (!targ_type)
> + return -EINVAL;
> + if (!btf_is_composite(targ_type))
> + return 0;
> +
> + local_id = local_acc->type_id;
> + local_type = btf__type_by_id(local_btf, local_id);
> + local_member = (void *)(local_type + 1);
> + local_member += local_acc->idx;
> + local_name = btf__name_by_offset(local_btf, local_member->name_off);
> +
> + n = BTF_INFO_VLEN(targ_type->info);
> + m = (void *)(targ_type + 1);
> + for (i = 0; i < n; i++, m++) {
> + __u32 offset;
> +
> + /* bitfield relocations not supported */
> + if (BTF_INFO_KFLAG(targ_type->info)) {
> + if (BTF_MEMBER_BITFIELD_SIZE(m->offset))
> + continue;
> + offset = BTF_MEMBER_BIT_OFFSET(m->offset);
> + } else {
> + offset = m->offset;
> + }
> + if (offset % 8)
> + continue;
> +
> + /* too deep struct/union/array nesting */
> + if (spec->raw_len == BPF_CORE_SPEC_MAX_LEN)
> + return -E2BIG;
> +
> + /* speculate this member will be the good one */
> + spec->offset += offset / 8;
> + spec->raw_spec[spec->raw_len++] = i;
> +
> + targ_name = btf__name_by_offset(targ_btf, m->name_off);
> + if (str_is_empty(targ_name)) {
> + /* embedded struct/union, we need to go deeper */
> + found = bpf_core_match_member(local_btf, local_acc,
> + targ_btf, m->type,
> + spec, next_targ_id);
> + if (found) /* either found or error */
> + return found;
> + } else if (strcmp(local_name, targ_name) == 0) {
> + /* matching named field */
> + struct bpf_core_accessor *targ_acc;
> +
> + targ_acc = &spec->spec[spec->len++];
> + targ_acc->type_id = targ_id;
> + targ_acc->idx = i;
> + targ_acc->name = targ_name;
> +
> + *next_targ_id = m->type;
> + found = bpf_core_fields_are_compat(local_btf,
> + local_member->type,
> + targ_btf, m->type);
> + if (!found)
> + spec->len--; /* pop accessor */
> + return found;
> + }
> + /* member turned out to be not we looked for */
/* member turned out not to be what we looked for */
Or something similar.
The rest of this patch looks good to me.
Thanks,
Song
^ permalink raw reply
* Re: [PATCH v4 1/5] vsock/virtio: limit the memory used per-socket
From: Michael S. Tsirkin @ 2019-07-29 19:10 UTC (permalink / raw)
To: Stefano Garzarella
Cc: netdev, linux-kernel, Stefan Hajnoczi, David S. Miller,
virtualization, Jason Wang, kvm
In-Reply-To: <20190729165056.r32uzj6om3o6vfvp@steredhat>
On Mon, Jul 29, 2019 at 06:50:56PM +0200, Stefano Garzarella wrote:
> On Mon, Jul 29, 2019 at 06:19:03PM +0200, Stefano Garzarella wrote:
> > On Mon, Jul 29, 2019 at 11:49:02AM -0400, Michael S. Tsirkin wrote:
> > > On Mon, Jul 29, 2019 at 05:36:56PM +0200, Stefano Garzarella wrote:
> > > > On Mon, Jul 29, 2019 at 10:04:29AM -0400, Michael S. Tsirkin wrote:
> > > > > On Wed, Jul 17, 2019 at 01:30:26PM +0200, Stefano Garzarella wrote:
> > > > > > Since virtio-vsock was introduced, the buffers filled by the host
> > > > > > and pushed to the guest using the vring, are directly queued in
> > > > > > a per-socket list. These buffers are preallocated by the guest
> > > > > > with a fixed size (4 KB).
> > > > > >
> > > > > > The maximum amount of memory used by each socket should be
> > > > > > controlled by the credit mechanism.
> > > > > > The default credit available per-socket is 256 KB, but if we use
> > > > > > only 1 byte per packet, the guest can queue up to 262144 of 4 KB
> > > > > > buffers, using up to 1 GB of memory per-socket. In addition, the
> > > > > > guest will continue to fill the vring with new 4 KB free buffers
> > > > > > to avoid starvation of other sockets.
> > > > > >
> > > > > > This patch mitigates this issue copying the payload of small
> > > > > > packets (< 128 bytes) into the buffer of last packet queued, in
> > > > > > order to avoid wasting memory.
> > > > > >
> > > > > > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
> > > > >
> > > > > This is good enough for net-next, but for net I think we
> > > > > should figure out how to address the issue completely.
> > > > > Can we make the accounting precise? What happens to
> > > > > performance if we do?
> > > > >
> > > >
> > > > In order to do more precise accounting maybe we can use the buffer size,
> > > > instead of payload size when we update the credit available.
> > > > In this way, the credit available for each socket will reflect the memory
> > > > actually used.
> > > >
> > > > I should check better, because I'm not sure what happen if the peer sees
> > > > 1KB of space available, then it sends 1KB of payload (using a 4KB
> > > > buffer).
> > > >
> > > > The other option is to copy each packet in a new buffer like I did in
> > > > the v2 [2], but this forces us to make a copy for each packet that does
> > > > not fill the entire buffer, perhaps too expensive.
> > > >
> > > > [2] https://patchwork.kernel.org/patch/10938741/
> > > >
> > > >
> > > > Thanks,
> > > > Stefano
> > >
> > > Interesting. You are right, and at some level the protocol forces copies.
> > >
> > > We could try to detect that the actual memory is getting close to
> > > admin limits and force copies on queued packets after the fact.
> > > Is that practical?
> >
> > Yes, I think it is doable!
> > We can decrease the credit available with the buffer size queued, and
> > when the buffer size of packet to queue is bigger than the credit
> > available, we can copy it.
> >
> > >
> > > And yes we can extend the credit accounting to include buffer size.
> > > That's a protocol change but maybe it makes sense.
> >
> > Since we send to the other peer the credit available, maybe this
> > change can be backwards compatible (I'll check better this).
>
> What I said was wrong.
>
> We send a counter (increased when the user consumes the packets) and the
> "buf_alloc" (the max memory allowed) to the other peer.
> It makes a difference between a local counter (increased when the
> packets are sent) and the remote counter to calculate the credit available:
>
> u32 virtio_transport_get_credit(struct virtio_vsock_sock *vvs, u32 credit)
> {
> u32 ret;
>
> spin_lock_bh(&vvs->tx_lock);
> ret = vvs->peer_buf_alloc - (vvs->tx_cnt - vvs->peer_fwd_cnt);
> if (ret > credit)
> ret = credit;
> vvs->tx_cnt += ret;
> spin_unlock_bh(&vvs->tx_lock);
>
> return ret;
> }
>
> Maybe I can play with "buf_alloc" to take care of bytes queued but not
> used.
>
> Thanks,
> Stefano
Right. And the idea behind it all was that if we send a credit
to remote then we have space for it.
I think the basic idea was that if we have actual allocated
memory and can copy data there, then we send the credit to
remote.
Of course that means an extra copy every packet.
So as an optimization, it seems that we just assume
that we will be able to allocate a new buffer.
First this is not the best we can do. We can actually do
allocate memory in the socket before sending credit.
If packet is small then we copy it there.
If packet is big then we queue the packet,
take the buffer out of socket and add it to the virtqueue.
Second question is what to do about medium sized packets.
Packet is 1K but buffer is 4K, what do we do?
And here I wonder - why don't we add the 3K buffer
to the vq?
--
MST
^ permalink raw reply
* Re: [PATCH] net: sctp: drop unneeded likely() call around IS_ERR()
From: Marcelo Ricardo Leitner @ 2019-07-29 19:03 UTC (permalink / raw)
To: Enrico Weigelt, metux IT consult
Cc: linux-kernel, vyasevich, nhorman, davem, linux-sctp, netdev
In-Reply-To: <1564426521-22525-1-git-send-email-info@metux.net>
On Mon, Jul 29, 2019 at 08:55:21PM +0200, Enrico Weigelt, metux IT consult wrote:
> From: Enrico Weigelt <info@metux.net>
>
> IS_ERR() already calls unlikely(), so this extra unlikely() call
> around IS_ERR() is not needed.
>
> Signed-off-by: Enrico Weigelt <info@metux.net>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> ---
> net/sctp/socket.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index aa80cda..9d1f83b 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -985,7 +985,7 @@ static int sctp_setsockopt_bindx(struct sock *sk,
> return -EINVAL;
>
> kaddrs = memdup_user(addrs, addrs_size);
> - if (unlikely(IS_ERR(kaddrs)))
> + if (IS_ERR(kaddrs))
> return PTR_ERR(kaddrs);
>
> /* Walk through the addrs buffer and count the number of addresses. */
> @@ -1315,7 +1315,7 @@ static int __sctp_setsockopt_connectx(struct sock *sk,
> return -EINVAL;
>
> kaddrs = memdup_user(addrs, addrs_size);
> - if (unlikely(IS_ERR(kaddrs)))
> + if (IS_ERR(kaddrs))
> return PTR_ERR(kaddrs);
>
> /* Allow security module to validate connectx addresses. */
> --
> 1.9.1
>
^ permalink raw reply
* Re: [PATCH net v2] mlxsw: spectrum_ptp: Increase parsing depth when PTP is enabled
From: Ido Schimmel @ 2019-07-29 19:00 UTC (permalink / raw)
To: Petr Machata
Cc: netdev@vger.kernel.org, Jiri Pirko, Ido Schimmel,
davem@davemloft.net
In-Reply-To: <b1584bdec4a0a36a2567a43dc0973dd8f3a05dec.1564424420.git.petrm@mellanox.com>
On Mon, Jul 29, 2019 at 06:26:14PM +0000, Petr Machata wrote:
> Spectrum systems have a configurable limit on how far into the packet they
> parse. By default, the limit is 96 bytes.
>
> An IPv6 PTP packet is layered as Ethernet/IPv6/UDP (14+40+8 bytes), and
> sequence ID of a PTP event is only available 32 bytes into payload, for a
> total of 94 bytes. When an additional 802.1q header is present as
> well (such as when ptp4l is running on a VLAN port), the parsing limit is
> exceeded. Such packets are not recognized as PTP, and are not timestamped.
>
> Therefore generalize the current VXLAN-specific parsing depth setting to
> allow reference-counted requests from other modules as well. Keep it in the
> VXLAN module, because the MPRS register also configures UDP destination
> port number used for VXLAN, and is thus closely tied to the VXLAN code
> anyway.
>
> Then invoke the new interfaces from both VXLAN (in obvious places), as well
> as from PTP code, when the (global) timestamping configuration changes from
> disabled to enabled or vice versa.
>
> Fixes: 8748642751ed ("mlxsw: spectrum: PTP: Support SIOCGHWTSTAMP, SIOCSHWTSTAMP ioctls")
> Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
^ permalink raw reply
* Re: [patch net-next 3/3] netdevsim: create devlink and netdev instances in namespace
From: Jakub Kicinski @ 2019-07-29 18:59 UTC (permalink / raw)
To: Jiri Pirko; +Cc: netdev, davem, sthemmin, dsahern, mlxsw
In-Reply-To: <20190727094459.26345-4-jiri@resnulli.us>
On Sat, 27 Jul 2019 11:44:59 +0200, Jiri Pirko wrote:
> From: Jiri Pirko <jiri@mellanox.com>
>
> When user does create new netdevsim instance using sysfs bus file,
> create the devlink instance and related netdev instance in the namespace
> of the caller.
>
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
> diff --git a/drivers/net/netdevsim/bus.c b/drivers/net/netdevsim/bus.c
> index 1a0ff3d7747b..6aeed0c600f8 100644
> --- a/drivers/net/netdevsim/bus.c
> +++ b/drivers/net/netdevsim/bus.c
> @@ -283,6 +283,7 @@ nsim_bus_dev_new(unsigned int id, unsigned int port_count)
> nsim_bus_dev->dev.bus = &nsim_bus;
> nsim_bus_dev->dev.type = &nsim_bus_dev_type;
> nsim_bus_dev->port_count = port_count;
> + nsim_bus_dev->initial_net = current->nsproxy->net_ns;
>
> err = device_register(&nsim_bus_dev->dev);
> if (err)
The saved initial_net is only to carry the net info from the adding
process to the probe callback, because probe can be asynchronous?
I'm not entirely sure netdevsim can probe asynchronously in practice
given we never return EPROBE_DEFER, but would you mind at least adding
a comment stating that next to the definition of the field, otherwise
I worry someone may mistakenly use this field instead of the up-to-date
devlink's netns.
> diff --git a/drivers/net/netdevsim/netdevsim.h b/drivers/net/netdevsim/netdevsim.h
> index 79c05af2a7c0..cdf53d0e0c49 100644
> --- a/drivers/net/netdevsim/netdevsim.h
> +++ b/drivers/net/netdevsim/netdevsim.h
> @@ -19,6 +19,7 @@
> #include <linux/netdevice.h>
> #include <linux/u64_stats_sync.h>
> #include <net/devlink.h>
> +#include <net/net_namespace.h>
You can just do a forward declaration, no need to pull in the header.
> #include <net/xdp.h>
>
> #define DRV_NAME "netdevsim"
> @@ -213,6 +215,7 @@ struct nsim_bus_dev {
> struct device dev;
> struct list_head list;
> unsigned int port_count;
> + struct net *initial_net;
> unsigned int num_vfs;
> struct nsim_vf_config *vfconfigs;
> };
Otherwise makes perfect sense, with the above nits addressed feel free
to add:
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
^ permalink raw reply
* [PATCH] net: sctp: drop unneeded likely() call around IS_ERR()
From: Enrico Weigelt, metux IT consult @ 2019-07-29 18:55 UTC (permalink / raw)
To: linux-kernel
Cc: vyasevich, nhorman, marcelo.leitner, davem, linux-sctp, netdev
From: Enrico Weigelt <info@metux.net>
IS_ERR() already calls unlikely(), so this extra unlikely() call
around IS_ERR() is not needed.
Signed-off-by: Enrico Weigelt <info@metux.net>
---
net/sctp/socket.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index aa80cda..9d1f83b 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -985,7 +985,7 @@ static int sctp_setsockopt_bindx(struct sock *sk,
return -EINVAL;
kaddrs = memdup_user(addrs, addrs_size);
- if (unlikely(IS_ERR(kaddrs)))
+ if (IS_ERR(kaddrs))
return PTR_ERR(kaddrs);
/* Walk through the addrs buffer and count the number of addresses. */
@@ -1315,7 +1315,7 @@ static int __sctp_setsockopt_connectx(struct sock *sk,
return -EINVAL;
kaddrs = memdup_user(addrs, addrs_size);
- if (unlikely(IS_ERR(kaddrs)))
+ if (IS_ERR(kaddrs))
return PTR_ERR(kaddrs);
/* Allow security module to validate connectx addresses. */
--
1.9.1
^ permalink raw reply related
* Re: [PATCH] net/mlx5: fix -Wtype-limits compilation warnings
From: Saeed Mahameed @ 2019-07-29 18:51 UTC (permalink / raw)
To: cai@lca.pw, Leon Romanovsky
Cc: Yishai Hadas, davem@davemloft.net, netdev@vger.kernel.org,
linux-rdma@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <1563820482-10302-1-git-send-email-cai@lca.pw>
On Mon, 2019-07-22 at 14:34 -0400, Qian Cai wrote:
> The commit b9a7ba556207 ("net/mlx5: Use event mask based on device
> capabilities") introduced a few compilation warnings due to it bumps
> MLX5_EVENT_TYPE_MAX from 0x27 to 0x100 which is always greater than
> an "struct {mlx5_eqe|mlx5_nb}.type" that is an "u8".
>
> drivers/net/ethernet/mellanox/mlx5/core/eq.c: In function
> 'mlx5_eq_notifier_register':
> drivers/net/ethernet/mellanox/mlx5/core/eq.c:948:21: warning:
> comparison
> is always false due to limited range of data type [-Wtype-limits]
> if (nb->event_type >= MLX5_EVENT_TYPE_MAX)
> ^~
> drivers/net/ethernet/mellanox/mlx5/core/eq.c: In function
> 'mlx5_eq_notifier_unregister':
> drivers/net/ethernet/mellanox/mlx5/core/eq.c:959:21: warning:
> comparison
> is always false due to limited range of data type [-Wtype-limits]
> if (nb->event_type >= MLX5_EVENT_TYPE_MAX)
>
> Fix them by removing unnecessary checkings.
>
> Fixes: b9a7ba556207 ("net/mlx5: Use event mask based on device
> capabilities")
> Signed-off-by: Qian Cai <cai@lca.pw>
>
Applied to mlx5-next.
Thanks,
Saeed.
^ permalink raw reply
* Re: [PATCH] tcp: add new tcp_mtu_probe_floor sysctl
From: Josh Hunt @ 2019-07-29 18:50 UTC (permalink / raw)
To: Neal Cardwell; +Cc: Eric Dumazet, netdev, David Miller
In-Reply-To: <CADVnQykOFxm5ms-z0KwxrjFLcOQvixSPosXXn7+4MC3Qee9gKQ@mail.gmail.com>
On 7/29/19 6:12 AM, Neal Cardwell wrote:
> On Sun, Jul 28, 2019 at 5:14 PM Josh Hunt <johunt@akamai.com> wrote:
>>
>> On 7/28/19 6:54 AM, Eric Dumazet wrote:
>>> On Sun, Jul 28, 2019 at 1:21 AM Josh Hunt <johunt@akamai.com> wrote:
>>>>
>>>> On 7/27/19 12:05 AM, Eric Dumazet wrote:
>>>>> On Sat, Jul 27, 2019 at 4:23 AM Josh Hunt <johunt@akamai.com> wrote:
>>>>>>
>>>>>> The current implementation of TCP MTU probing can considerably
>>>>>> underestimate the MTU on lossy connections allowing the MSS to get down to
>>>>>> 48. We have found that in almost all of these cases on our networks these
>>>>>> paths can handle much larger MTUs meaning the connections are being
>>>>>> artificially limited. Even though TCP MTU probing can raise the MSS back up
>>>>>> we have seen this not to be the case causing connections to be "stuck" with
>>>>>> an MSS of 48 when heavy loss is present.
>>>>>>
>>>>>> Prior to pushing out this change we could not keep TCP MTU probing enabled
>>>>>> b/c of the above reasons. Now with a reasonble floor set we've had it
>>>>>> enabled for the past 6 months.
>>>>>
>>>>> And what reasonable value have you used ???
>>>>
>>>> Reasonable for some may not be reasonable for others hence the new
>>>> sysctl :) We're currently running with a fairly high value based off of
>>>> the v6 min MTU minus headers and options, etc. We went conservative with
>>>> our setting initially as it seemed a reasonable first step when
>>>> re-enabling TCP MTU probing since with no configurable floor we saw a #
>>>> of cases where connections were using severely reduced mss b/c of loss
>>>> and not b/c of actual path restriction. I plan to reevaluate the setting
>>>> at some point, but since the probing method is still the same it means
>>>> the same clients who got stuck with mss of 48 before will land at
>>>> whatever floor we set. Looking forward we are interested in trying to
>>>> improve TCP MTU probing so it does not penalize clients like this.
>>>>
>>>> A suggestion for a more reasonable floor default would be 512, which is
>>>> the same as the min_pmtu. Given both mechanisms are trying to achieve
>>>> the same goal it seems like they should have a similar min/floor.
>>>>
>>>>>
>>>>>>
>>>>>> The new sysctl will still default to TCP_MIN_SND_MSS (48), but gives
>>>>>> administrators the ability to control the floor of MSS probing.
>>>>>>
>>>>>> Signed-off-by: Josh Hunt <johunt@akamai.com>
>>>>>> ---
>>>>>> Documentation/networking/ip-sysctl.txt | 6 ++++++
>>>>>> include/net/netns/ipv4.h | 1 +
>>>>>> net/ipv4/sysctl_net_ipv4.c | 9 +++++++++
>>>>>> net/ipv4/tcp_ipv4.c | 1 +
>>>>>> net/ipv4/tcp_timer.c | 2 +-
>>>>>> 5 files changed, 18 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
>>>>>> index df33674799b5..49e95f438ed7 100644
>>>>>> --- a/Documentation/networking/ip-sysctl.txt
>>>>>> +++ b/Documentation/networking/ip-sysctl.txt
>>>>>> @@ -256,6 +256,12 @@ tcp_base_mss - INTEGER
>>>>>> Path MTU discovery (MTU probing). If MTU probing is enabled,
>>>>>> this is the initial MSS used by the connection.
>>>>>>
>>>>>> +tcp_mtu_probe_floor - INTEGER
>>>>>> + If MTU probing is enabled this caps the minimum MSS used for search_low
>>>>>> + for the connection.
>>>>>> +
>>>>>> + Default : 48
>>>>>> +
>>>>>> tcp_min_snd_mss - INTEGER
>>>>>> TCP SYN and SYNACK messages usually advertise an ADVMSS option,
>>>>>> as described in RFC 1122 and RFC 6691.
>>>>>> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
>>>>>> index bc24a8ec1ce5..c0c0791b1912 100644
>>>>>> --- a/include/net/netns/ipv4.h
>>>>>> +++ b/include/net/netns/ipv4.h
>>>>>> @@ -116,6 +116,7 @@ struct netns_ipv4 {
>>>>>> int sysctl_tcp_l3mdev_accept;
>>>>>> #endif
>>>>>> int sysctl_tcp_mtu_probing;
>>>>>> + int sysctl_tcp_mtu_probe_floor;
>>>>>> int sysctl_tcp_base_mss;
>>>>>> int sysctl_tcp_min_snd_mss;
>>>>>> int sysctl_tcp_probe_threshold;
>>>>>> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
>>>>>> index 0b980e841927..59ded25acd04 100644
>>>>>> --- a/net/ipv4/sysctl_net_ipv4.c
>>>>>> +++ b/net/ipv4/sysctl_net_ipv4.c
>>>>>> @@ -820,6 +820,15 @@ static struct ctl_table ipv4_net_table[] = {
>>>>>> .extra2 = &tcp_min_snd_mss_max,
>>>>>> },
>>>>>> {
>>>>>> + .procname = "tcp_mtu_probe_floor",
>>>>>> + .data = &init_net.ipv4.sysctl_tcp_mtu_probe_floor,
>>>>>> + .maxlen = sizeof(int),
>>>>>> + .mode = 0644,
>>>>>> + .proc_handler = proc_dointvec_minmax,
>>>>>> + .extra1 = &tcp_min_snd_mss_min,
>>>>>> + .extra2 = &tcp_min_snd_mss_max,
>>>>>> + },
>>>>>> + {
>>>>>> .procname = "tcp_probe_threshold",
>>>>>> .data = &init_net.ipv4.sysctl_tcp_probe_threshold,
>>>>>> .maxlen = sizeof(int),
>>>>>> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
>>>>>> index d57641cb3477..e0a372676329 100644
>>>>>> --- a/net/ipv4/tcp_ipv4.c
>>>>>> +++ b/net/ipv4/tcp_ipv4.c
>>>>>> @@ -2637,6 +2637,7 @@ static int __net_init tcp_sk_init(struct net *net)
>>>>>> net->ipv4.sysctl_tcp_min_snd_mss = TCP_MIN_SND_MSS;
>>>>>> net->ipv4.sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD;
>>>>>> net->ipv4.sysctl_tcp_probe_interval = TCP_PROBE_INTERVAL;
>>>>>> + net->ipv4.sysctl_tcp_mtu_probe_floor = TCP_MIN_SND_MSS;
>>>>>>
>>>>>> net->ipv4.sysctl_tcp_keepalive_time = TCP_KEEPALIVE_TIME;
>>>>>> net->ipv4.sysctl_tcp_keepalive_probes = TCP_KEEPALIVE_PROBES;
>>>>>> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
>>>>>> index c801cd37cc2a..dbd9d2d0ee63 100644
>>>>>> --- a/net/ipv4/tcp_timer.c
>>>>>> +++ b/net/ipv4/tcp_timer.c
>>>>>> @@ -154,7 +154,7 @@ static void tcp_mtu_probing(struct inet_connection_sock *icsk, struct sock *sk)
>>>>>> } else {
>>>>>> mss = tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low) >> 1;
>>>>>> mss = min(net->ipv4.sysctl_tcp_base_mss, mss);
>>>>>> - mss = max(mss, 68 - tcp_sk(sk)->tcp_header_len);
>>>>>> + mss = max(mss, net->ipv4.sysctl_tcp_mtu_probe_floor);
>>>>>> mss = max(mss, net->ipv4.sysctl_tcp_min_snd_mss);
>>>>>> icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss);
>>>>>> }
>>>>>
>>>>>
>>>>> Existing sysctl should be enough ?
>>>>
>>>> I don't think so. Changing tcp_min_snd_mss could impact clients that
>>>> really want/need a small mss. When you added the new sysctl I tried to
>>>> analyze the mss values we're seeing to understand what we could possibly
>>>> raise it to. While not a huge amount, we see more clients than I
>>>> expected announcing mss values in the 180-512 range. Given that I would
>>>> not feel comfortable setting tcp_min_snd_mss to say 512 as I suggested
>>>> above.
>>>
>>> If these clients need mss values in 180-512 ranges, how MTU probing
>>> would work for them,
>>> if you set a floor to 512 ?
>>
>> First, we already seem to be fine with ignoring these paths with ICMP
>> based PMTU discovery b/c of our min_pmtu default of 512 and that is
>> configurable. Second by adding this sysctl we're giving administrators
>> the choice to decide if they'd like to attempt to support these very
>> very small # of paths which may be below 512 (MSS <= 512 does not mean
>> MTU <= 512) or cover themselves by being able to raise the floor to not
>> penalize clients who may be on very lossy networks.
>>
>>>
>>> Are we sure the intent of tcp_base_mss was not to act as a floor ?
>>
>> My understanding is that tcp_base_mss is meant to be the initial value
>> of search_low (as per Docs). Then in RFC 4821 [1] Sections 7.2, shows
>> search_low should be configurable, and 7.7 we see that in response to
>> successive black hole detection search_low should be halved. So I don't
>> think it was meant to be a floor, but just the initial search_low param.
>
> That matches my reading of the RFC and code as well. But in that case
> IMHO an additional commit should fix this comment to reflect the fact
> thatTCP_BASE_MSS is the initial value, rather than a floor:
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 42728239cdbe..05575ac70333 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -75,7 +75,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
> /* Minimal accepted MSS. It is (60+60+8) - (20+20). */
> #define TCP_MIN_MSS 88U
>
> -/* The least MTU to use for probing */
> +/* The initial MTU to use for probing */
> #define TCP_BASE_MSS 1024
>
> /* probing interval, default to 10 minutes as per RFC4821 */
Good catch. Agreed. That comment is misleading.
Josh
^ permalink raw reply
* Re: [PATCH v3 5/7] mfd: ioc3: Add driver for SGI IOC3 chip
From: Thomas Bogendoerfer @ 2019-07-29 18:45 UTC (permalink / raw)
To: Lee Jones
Cc: Ralf Baechle, Paul Burton, James Hogan, Dmitry Torokhov,
David S. Miller, Srinivas Kandagatla, Alessandro Zummo,
Alexandre Belloni, Greg Kroah-Hartman, Jiri Slaby, linux-mips,
linux-kernel, linux-input, netdev, linux-rtc, linux-serial
In-Reply-To: <20190725114716.GB23883@dell>
On Thu, 25 Jul 2019 12:47:16 +0100
Lee Jones <lee.jones@linaro.org> wrote:
> On Thu, 13 Jun 2019, Thomas Bogendoerfer wrote:
> > +/*
> > + * On IP30 the RTC (a DS1687) is behind the IOC3 on the generic
> > + * ByteBus regions. We have to write the RTC address of interest to
> > + * IOC3_BYTEBUS_DEV1, then read the data from IOC3_BYTEBUS_DEV2.
> > + * rtc->regs already points to IOC3_BYTEBUS_DEV1.
> > + */
> > +#define IP30_RTC_ADDR(rtc) (rtc->regs)
> > +#define IP30_RTC_DATA(rtc) ((rtc->regs) + IOC3_BYTEBUS_DEV2 - IOC3_BYTEBUS_DEV1)
> > +
> > +static u8 ip30_rtc_read(struct ds1685_priv *rtc, int reg)
> > +{
> > + writeb((reg & 0x7f), IP30_RTC_ADDR(rtc));
> > + return readb(IP30_RTC_DATA(rtc));
> > +}
> > +
> > +static void ip30_rtc_write(struct ds1685_priv *rtc, int reg, u8 value)
> > +{
> > + writeb((reg & 0x7f), IP30_RTC_ADDR(rtc));
> > + writeb(value, IP30_RTC_DATA(rtc));
> > +}
>
> Why is this not in the RTC driver?
because rtc1685 is used in different systems and accessing the chip
differs between those systems.
> > +static struct ds1685_rtc_platform_data ip30_rtc_platform_data = {
> > + .bcd_mode = false,
> > + .no_irq = false,
> > + .uie_unsupported = true,
> > + .alloc_io_resources = true,
>
> > + .plat_read = ip30_rtc_read,
> > + .plat_write = ip30_rtc_write,
>
> Call-backs in a non-subsystem API is pretty ugly IMHO.
I agree
> Where are these called from?
drivers/rtc/rtc-ds1685.c
I could do the same as done for serial8250 and add an additional .c file
in drivers/rtc which handles this for SGI-IP30. Alexandre would this work
for you as well ?
> > +#define IOC3_SID(_name, _sid, _setup) \
> > + { \
> > + .name = _name, \
> > + .sid = (PCI_VENDOR_ID_SGI << 16) | IOC3_SUBSYS_ ## _sid, \
> > + .setup = _setup, \
> > + }
> > +
> > +static struct {
> > + const char *name;
> > + u32 sid;
> > + int (*setup)(struct ioc3_priv_data *ipd);
> > +} ioc3_infos[] = {
>
> IMHO it's neater if you separate the definition and static data part.
I don't quite understand what you mean here. Should I move the #define at
the beginning of the file ? Why is it neater ?
Thomas.
--
SUSE Linux GmbH
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)
^ permalink raw reply
* Re: [PULL] vhost,virtio: cleanups and fixes
From: pr-tracker-bot @ 2019-07-29 18:40 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Linus Torvalds, kvm, virtualization, netdev, linux-kernel,
eric.auger, jean-philippe, jroedel, mst, namit, wei.w.wang
In-Reply-To: <20190729121605-mutt-send-email-mst@kernel.org>
The pull request you sent on Mon, 29 Jul 2019 12:16:05 -0400:
> git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/2a11c76e5301dddefcb618dac04f74e6314df6bc
Thank you!
--
Deet-doot-dot, I am a bot.
https://korg.wiki.kernel.org/userdoc/prtracker
^ permalink raw reply
* [PATCH net v2] mlxsw: spectrum_ptp: Increase parsing depth when PTP is enabled
From: Petr Machata @ 2019-07-29 18:26 UTC (permalink / raw)
To: netdev@vger.kernel.org
Cc: Petr Machata, Jiri Pirko, Ido Schimmel, davem@davemloft.net
Spectrum systems have a configurable limit on how far into the packet they
parse. By default, the limit is 96 bytes.
An IPv6 PTP packet is layered as Ethernet/IPv6/UDP (14+40+8 bytes), and
sequence ID of a PTP event is only available 32 bytes into payload, for a
total of 94 bytes. When an additional 802.1q header is present as
well (such as when ptp4l is running on a VLAN port), the parsing limit is
exceeded. Such packets are not recognized as PTP, and are not timestamped.
Therefore generalize the current VXLAN-specific parsing depth setting to
allow reference-counted requests from other modules as well. Keep it in the
VXLAN module, because the MPRS register also configures UDP destination
port number used for VXLAN, and is thus closely tied to the VXLAN code
anyway.
Then invoke the new interfaces from both VXLAN (in obvious places), as well
as from PTP code, when the (global) timestamping configuration changes from
disabled to enabled or vice versa.
Fixes: 8748642751ed ("mlxsw: spectrum: PTP: Support SIOCGHWTSTAMP, SIOCSHWTSTAMP ioctls")
Signed-off-by: Petr Machata <petrm@mellanox.com>
---
Notes:
v2:
- Preserve RXT in mlxsw_sp1_ptp_mtpppc_update()
.../net/ethernet/mellanox/mlxsw/spectrum.h | 4 +
.../ethernet/mellanox/mlxsw/spectrum_nve.c | 1 +
.../ethernet/mellanox/mlxsw/spectrum_nve.h | 1 +
.../mellanox/mlxsw/spectrum_nve_vxlan.c | 76 ++++++++++++++-----
.../ethernet/mellanox/mlxsw/spectrum_ptp.c | 17 +++++
5 files changed, 82 insertions(+), 17 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
index 131f62ce9297..6664119fb0c8 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
@@ -951,4 +951,8 @@ void mlxsw_sp_port_nve_fini(struct mlxsw_sp_port *mlxsw_sp_port);
int mlxsw_sp_nve_init(struct mlxsw_sp *mlxsw_sp);
void mlxsw_sp_nve_fini(struct mlxsw_sp *mlxsw_sp);
+/* spectrum_nve_vxlan.c */
+int mlxsw_sp_nve_inc_parsing_depth_get(struct mlxsw_sp *mlxsw_sp);
+void mlxsw_sp_nve_inc_parsing_depth_put(struct mlxsw_sp *mlxsw_sp);
+
#endif
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_nve.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_nve.c
index 1df164a4b06d..17f334b46c40 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_nve.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_nve.c
@@ -775,6 +775,7 @@ static void mlxsw_sp_nve_tunnel_fini(struct mlxsw_sp *mlxsw_sp)
ops->fini(nve);
mlxsw_sp_kvdl_free(mlxsw_sp, MLXSW_SP_KVDL_ENTRY_TYPE_ADJ, 1,
nve->tunnel_index);
+ memset(&nve->config, 0, sizeof(nve->config));
}
nve->num_nve_tunnels--;
}
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_nve.h b/drivers/net/ethernet/mellanox/mlxsw/spectrum_nve.h
index 0035640156a1..12f664f42f21 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_nve.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_nve.h
@@ -29,6 +29,7 @@ struct mlxsw_sp_nve {
unsigned int num_max_mc_entries[MLXSW_SP_L3_PROTO_MAX];
u32 tunnel_index;
u16 ul_rif_index; /* Reserved for Spectrum */
+ unsigned int inc_parsing_depth_refs;
};
struct mlxsw_sp_nve_ops {
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_nve_vxlan.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_nve_vxlan.c
index 93ccd9fc2266..05517c7feaa5 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_nve_vxlan.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_nve_vxlan.c
@@ -103,9 +103,9 @@ static void mlxsw_sp_nve_vxlan_config(const struct mlxsw_sp_nve *nve,
config->udp_dport = cfg->dst_port;
}
-static int mlxsw_sp_nve_parsing_set(struct mlxsw_sp *mlxsw_sp,
- unsigned int parsing_depth,
- __be16 udp_dport)
+static int __mlxsw_sp_nve_parsing_set(struct mlxsw_sp *mlxsw_sp,
+ unsigned int parsing_depth,
+ __be16 udp_dport)
{
char mprs_pl[MLXSW_REG_MPRS_LEN];
@@ -113,6 +113,56 @@ static int mlxsw_sp_nve_parsing_set(struct mlxsw_sp *mlxsw_sp,
return mlxsw_reg_write(mlxsw_sp->core, MLXSW_REG(mprs), mprs_pl);
}
+static int mlxsw_sp_nve_parsing_set(struct mlxsw_sp *mlxsw_sp,
+ __be16 udp_dport)
+{
+ int parsing_depth = mlxsw_sp->nve->inc_parsing_depth_refs ?
+ MLXSW_SP_NVE_VXLAN_PARSING_DEPTH :
+ MLXSW_SP_NVE_DEFAULT_PARSING_DEPTH;
+
+ return __mlxsw_sp_nve_parsing_set(mlxsw_sp, parsing_depth, udp_dport);
+}
+
+static int
+__mlxsw_sp_nve_inc_parsing_depth_get(struct mlxsw_sp *mlxsw_sp,
+ __be16 udp_dport)
+{
+ int err;
+
+ mlxsw_sp->nve->inc_parsing_depth_refs++;
+
+ err = mlxsw_sp_nve_parsing_set(mlxsw_sp, udp_dport);
+ if (err)
+ goto err_nve_parsing_set;
+ return 0;
+
+err_nve_parsing_set:
+ mlxsw_sp->nve->inc_parsing_depth_refs--;
+ return err;
+}
+
+static void
+__mlxsw_sp_nve_inc_parsing_depth_put(struct mlxsw_sp *mlxsw_sp,
+ __be16 udp_dport)
+{
+ mlxsw_sp->nve->inc_parsing_depth_refs--;
+ mlxsw_sp_nve_parsing_set(mlxsw_sp, udp_dport);
+}
+
+int mlxsw_sp_nve_inc_parsing_depth_get(struct mlxsw_sp *mlxsw_sp)
+{
+ __be16 udp_dport = mlxsw_sp->nve->config.udp_dport;
+
+ return __mlxsw_sp_nve_inc_parsing_depth_get(mlxsw_sp, udp_dport);
+}
+
+void mlxsw_sp_nve_inc_parsing_depth_put(struct mlxsw_sp *mlxsw_sp)
+{
+ __be16 udp_dport = mlxsw_sp->nve->config.udp_dport;
+
+ __mlxsw_sp_nve_inc_parsing_depth_put(mlxsw_sp, udp_dport);
+}
+
static void
mlxsw_sp_nve_vxlan_config_prepare(char *tngcr_pl,
const struct mlxsw_sp_nve_config *config)
@@ -176,9 +226,7 @@ static int mlxsw_sp1_nve_vxlan_init(struct mlxsw_sp_nve *nve,
struct mlxsw_sp *mlxsw_sp = nve->mlxsw_sp;
int err;
- err = mlxsw_sp_nve_parsing_set(mlxsw_sp,
- MLXSW_SP_NVE_VXLAN_PARSING_DEPTH,
- config->udp_dport);
+ err = __mlxsw_sp_nve_inc_parsing_depth_get(mlxsw_sp, config->udp_dport);
if (err)
return err;
@@ -203,8 +251,7 @@ static int mlxsw_sp1_nve_vxlan_init(struct mlxsw_sp_nve *nve,
err_rtdp_set:
mlxsw_sp1_nve_vxlan_config_clear(mlxsw_sp);
err_config_set:
- mlxsw_sp_nve_parsing_set(mlxsw_sp, MLXSW_SP_NVE_DEFAULT_PARSING_DEPTH,
- config->udp_dport);
+ __mlxsw_sp_nve_inc_parsing_depth_put(mlxsw_sp, 0);
return err;
}
@@ -216,8 +263,7 @@ static void mlxsw_sp1_nve_vxlan_fini(struct mlxsw_sp_nve *nve)
mlxsw_sp_router_nve_demote_decap(mlxsw_sp, config->ul_tb_id,
config->ul_proto, &config->ul_sip);
mlxsw_sp1_nve_vxlan_config_clear(mlxsw_sp);
- mlxsw_sp_nve_parsing_set(mlxsw_sp, MLXSW_SP_NVE_DEFAULT_PARSING_DEPTH,
- config->udp_dport);
+ __mlxsw_sp_nve_inc_parsing_depth_put(mlxsw_sp, 0);
}
static int
@@ -320,9 +366,7 @@ static int mlxsw_sp2_nve_vxlan_init(struct mlxsw_sp_nve *nve,
struct mlxsw_sp *mlxsw_sp = nve->mlxsw_sp;
int err;
- err = mlxsw_sp_nve_parsing_set(mlxsw_sp,
- MLXSW_SP_NVE_VXLAN_PARSING_DEPTH,
- config->udp_dport);
+ err = __mlxsw_sp_nve_inc_parsing_depth_get(mlxsw_sp, config->udp_dport);
if (err)
return err;
@@ -348,8 +392,7 @@ static int mlxsw_sp2_nve_vxlan_init(struct mlxsw_sp_nve *nve,
err_rtdp_set:
mlxsw_sp2_nve_vxlan_config_clear(mlxsw_sp);
err_config_set:
- mlxsw_sp_nve_parsing_set(mlxsw_sp, MLXSW_SP_NVE_DEFAULT_PARSING_DEPTH,
- config->udp_dport);
+ __mlxsw_sp_nve_inc_parsing_depth_put(mlxsw_sp, 0);
return err;
}
@@ -361,8 +404,7 @@ static void mlxsw_sp2_nve_vxlan_fini(struct mlxsw_sp_nve *nve)
mlxsw_sp_router_nve_demote_decap(mlxsw_sp, config->ul_tb_id,
config->ul_proto, &config->ul_sip);
mlxsw_sp2_nve_vxlan_config_clear(mlxsw_sp);
- mlxsw_sp_nve_parsing_set(mlxsw_sp, MLXSW_SP_NVE_DEFAULT_PARSING_DEPTH,
- config->udp_dport);
+ __mlxsw_sp_nve_inc_parsing_depth_put(mlxsw_sp, 0);
}
const struct mlxsw_sp_nve_ops mlxsw_sp2_nve_vxlan_ops = {
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_ptp.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_ptp.c
index bd9c2bc2d5d6..98c5ba3200bc 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_ptp.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_ptp.c
@@ -979,6 +979,9 @@ static int mlxsw_sp1_ptp_mtpppc_update(struct mlxsw_sp_port *mlxsw_sp_port,
{
struct mlxsw_sp *mlxsw_sp = mlxsw_sp_port->mlxsw_sp;
struct mlxsw_sp_port *tmp;
+ u16 orig_ing_types = 0;
+ u16 orig_egr_types = 0;
+ int err;
int i;
/* MTPPPC configures timestamping globally, not per port. Find the
@@ -986,12 +989,26 @@ static int mlxsw_sp1_ptp_mtpppc_update(struct mlxsw_sp_port *mlxsw_sp_port,
*/
for (i = 1; i < mlxsw_core_max_ports(mlxsw_sp->core); i++) {
tmp = mlxsw_sp->ports[i];
+ if (tmp) {
+ orig_ing_types |= tmp->ptp.ing_types;
+ orig_egr_types |= tmp->ptp.egr_types;
+ }
if (tmp && tmp != mlxsw_sp_port) {
ing_types |= tmp->ptp.ing_types;
egr_types |= tmp->ptp.egr_types;
}
}
+ if ((ing_types || egr_types) && !(orig_egr_types || orig_egr_types)) {
+ err = mlxsw_sp_nve_inc_parsing_depth_get(mlxsw_sp);
+ if (err) {
+ netdev_err(mlxsw_sp_port->dev, "Failed to increase parsing depth");
+ return err;
+ }
+ }
+ if (!(ing_types || egr_types) && (orig_egr_types || orig_egr_types))
+ mlxsw_sp_nve_inc_parsing_depth_put(mlxsw_sp);
+
return mlxsw_sp1_ptp_mtpppc_set(mlxsw_sp_port->mlxsw_sp,
ing_types, egr_types);
}
--
2.20.1
^ permalink raw reply related
* Re: [PATCH net] net/mlx5e: Fix unnecessary flow_block_cb_is_busy call
From: Saeed Mahameed @ 2019-07-29 18:25 UTC (permalink / raw)
To: wenxu@ucloud.cn, netdev@vger.kernel.org, davem@davemloft.net
In-Reply-To: <1564239595-23786-1-git-send-email-wenxu@ucloud.cn>
On Sat, 2019-07-27 at 22:59 +0800, wenxu@ucloud.cn wrote:
> From: wenxu <wenxu@ucloud.cn>
>
> When call flow_block_cb_is_busy. The indr_priv is guaranteed to
> NULL ptr. So there is no need to call flow_bock_cb_is_busy.
>
> Fixes: 0d4fd02e7199 ("net: flow_offload: add flow_block_cb_is_busy()
> and use it")
> Signed-off-by: wenxu <wenxu@ucloud.cn>
> ---
> drivers/net/ethernet/mellanox/mlx5/core/en_rep.c | 4 ----
> 1 file changed, 4 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
> b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
> index 7f747cb..496d303 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
> @@ -722,10 +722,6 @@ static void mlx5e_rep_indr_tc_block_unbind(void
> *cb_priv)
> if (indr_priv)
> return -EEXIST;
>
> - if
> (flow_block_cb_is_busy(mlx5e_rep_indr_setup_block_cb,
> - indr_priv,
> &mlx5e_block_cb_list))
> - return -EBUSY;
> -
> indr_priv = kmalloc(sizeof(*indr_priv), GFP_KERNEL);
> if (!indr_priv)
> return -ENOMEM;
Indeed flow_block_cb_is_busy is redundant and will always return false
in this path.
This is net-next material.
Dave let me know if you want me to take it to my branch.
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
^ permalink raw reply
* Re: [PATCH net] mlxsw: spectrum_ptp: Increase parsing depth when PTP is enabled
From: Petr Machata @ 2019-07-29 18:16 UTC (permalink / raw)
To: David Miller
Cc: idosch@idosch.org, netdev@vger.kernel.org, Jiri Pirko, mlxsw,
Ido Schimmel
In-Reply-To: <20190729.105623.1884187062066808743.davem@davemloft.net>
David Miller <davem@davemloft.net> writes:
> From: Ido Schimmel <idosch@idosch.org>
> Date: Sat, 27 Jul 2019 20:35:32 +0300
>
>> @@ -979,19 +979,36 @@ static int mlxsw_sp1_ptp_mtpppc_update(struct mlxsw_sp_port *mlxsw_sp_port,
>> {
>> struct mlxsw_sp *mlxsw_sp = mlxsw_sp_port->mlxsw_sp;
>> struct mlxsw_sp_port *tmp;
>> + u16 orig_ing_types = 0;
>> + u16 orig_egr_types = 0;
>> int i;
>> + int err;
>
> Please preserve the reverse christmas tree here, thank you.
All right, will respin momentarily.
^ permalink raw reply
* Re: [PATCH] net: spider_net: Mark expected switch fall-through
From: David Miller @ 2019-07-29 18:12 UTC (permalink / raw)
To: gustavo; +Cc: kou.ishizaki, netdev, linux-kernel, sfr, keescook
In-Reply-To: <20190729003251.GA25556@embeddedor>
From: "Gustavo A. R. Silva" <gustavo@embeddedor.com>
Date: Sun, 28 Jul 2019 19:32:51 -0500
> Mark switch cases where we are expecting to fall through.
>
> This patch fixes the following warning:
>
> drivers/net/ethernet/toshiba/spider_net.c: In function 'spider_net_release_tx_chain':
> drivers/net/ethernet/toshiba/spider_net.c:783:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
> if (!brutal) {
> ^
> drivers/net/ethernet/toshiba/spider_net.c:792:3: note: here
> case SPIDER_NET_DESCR_RESPONSE_ERROR:
> ^~~~
>
> Notice that, in this particular case, the code comment is
> modified in accordance with what GCC is expecting to find.
>
> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Applied.
^ permalink raw reply
* Re: [PATCH] net: ehea: Mark expected switch fall-through
From: David Miller @ 2019-07-29 18:12 UTC (permalink / raw)
To: gustavo; +Cc: dougmill, netdev, linux-kernel, sfr, keescook
In-Reply-To: <20190729003009.GA25425@embeddedor>
From: "Gustavo A. R. Silva" <gustavo@embeddedor.com>
Date: Sun, 28 Jul 2019 19:30:09 -0500
> Mark switch cases where we are expecting to fall through.
>
> This patch fixes the following warning:
>
> drivers/net/ethernet/ibm/ehea/ehea_main.c: In function 'ehea_mem_notifier':
> include/linux/printk.h:311:2: warning: this statement may fall through [-Wimplicit-fallthrough=]
> printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
> ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> drivers/net/ethernet/ibm/ehea/ehea_main.c:3253:3: note: in expansion of macro 'pr_info'
> pr_info("memory offlining canceled");
> ^~~~~~~
> drivers/net/ethernet/ibm/ehea/ehea_main.c:3256:2: note: here
> case MEM_ONLINE:
> ^~~~
>
> Notice that, in this particular case, the code comment is
> modified in accordance with what GCC is expecting to find.
>
> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Applied.
^ permalink raw reply
* Re: [PATCH net v2] mvpp2: refactor the HW checksum setup
From: David Miller @ 2019-07-29 18:09 UTC (permalink / raw)
To: mcroce; +Cc: netdev, antoine.tenart, maxime.chevallier, mw, stefanc,
linux-kernel
In-Reply-To: <20190728173549.32034-1-mcroce@redhat.com>
From: Matteo Croce <mcroce@redhat.com>
Date: Sun, 28 Jul 2019 19:35:49 +0200
> The hardware can only offload checksum calculation on first port due to
> the Tx FIFO size limitation, and has a maximum L3 offset of 128 bytes.
> Document this in a comment and move duplicated code in a function.
>
> Fixes: 576193f2d579 ("net: mvpp2: jumbo frames support")
> Signed-off-by: Matteo Croce <mcroce@redhat.com>
Applied.
^ permalink raw reply
* Re: [patch net] net: fix ifindex collision during namespace removal
From: David Miller @ 2019-07-29 18:08 UTC (permalink / raw)
To: jiri
Cc: netdev, xemul, edumazet, pabeni, idosch, petrm, sd, f.fainelli,
stephen, mlxsw, jiri
In-Reply-To: <20190728125636.13895-1-jiri@resnulli.us>
From: Jiri Pirko <jiri@resnulli.us>
Date: Sun, 28 Jul 2019 14:56:36 +0200
> From: Jiri Pirko <jiri@mellanox.com>
>
> Commit aca51397d014 ("netns: Fix arbitrary net_device-s corruptions
> on net_ns stop.") introduced a possibility to hit a BUG in case device
> is returning back to init_net and two following conditions are met:
> 1) dev->ifindex value is used in a name of another "dev%d"
> device in init_net.
> 2) dev->name is used by another device in init_net.
>
> Under real life circumstances this is hard to get. Therefore this has
> been present happily for over 10 years. To reproduce:
...
> Fix this by checking if a device with the same name exists in init_net
> and fallback to original code - dev%d to allocate name - in case it does.
>
> This was found using syzkaller.
>
> Fixes: aca51397d014 ("netns: Fix arbitrary net_device-s corruptions on net_ns stop.")
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Very interesting :)
Applied and queued up for -stable, thanks.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox