Netdev List
 help / color / mirror / Atom feed
* libbpf distro packaging
From: Julia Kartseva @ 2019-08-12 19:04 UTC (permalink / raw)
  To: labbott@redhat.com, acme@kernel.org,
	debian-kernel@lists.debian.org, netdev@vger.kernel.org
  Cc: Andrii Nakryiko, Andrey Ignatov, Alexei Starovoitov,
	Yonghong Song, jolsa@kernel.org

I would like to bring up libbpf publishing discussion started at [1].
The present state of things is that libbpf is built from kernel tree, e.g. [2]
For Debian and [3] for Fedora whereas the better way would be having a
package built from github mirror. The advantages of the latter:
- Consistent, ABI matching versioning across distros
- The mirror has integration tests
- No need in kernel tree to build a package
- Changes can be merged directly to github w/o waiting them to be merged
through bpf-next -> net-next -> main
There is a PR introducing a libbpf.spec which can be used as a starting point: [4]
Any comments regarding the spec itself can be posted there.
In the future it may be used as a source of truth.
Please consider switching libbpf packaging to the github mirror instead
of the kernel tree.
Thanks

[1] https://lists.iovisor.org/g/iovisor-dev/message/1521
[2] https://packages.debian.org/sid/libbpf4.19
[3] http://rpmfind.net/linux/RPM/fedora/devel/rawhide/x86_64/l/libbpf-5.3.0-0.rc2.git0.1.fc31.x86_64.html
[4] https://github.com/libbpf/libbpf/pull/64



^ permalink raw reply

* Re: [PATCH v4 14/14] dt-bindings: net: add bindings for ADIN PHY driver
From: Rob Herring @ 2019-08-12 19:02 UTC (permalink / raw)
  To: Alexandru Ardelean
  Cc: netdev, devicetree, linux-kernel@vger.kernel.org, David Miller,
	Mark Rutland, Florian Fainelli, Heiner Kallweit, Andrew Lunn
In-Reply-To: <20190812112350.15242-15-alexandru.ardelean@analog.com>

On Mon, Aug 12, 2019 at 5:24 AM Alexandru Ardelean
<alexandru.ardelean@analog.com> wrote:
>
> This change adds bindings for the Analog Devices ADIN PHY driver, detailing
> all the properties implemented by the driver.
>
> Signed-off-by: Alexandru Ardelean <alexandru.ardelean@analog.com>
> ---
>  .../devicetree/bindings/net/adi,adin.yaml     | 73 +++++++++++++++++++
>  MAINTAINERS                                   |  1 +
>  2 files changed, 74 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/net/adi,adin.yaml

Reviewed-by: Rob Herring <robh@kernel.org>

^ permalink raw reply

* Re: [PATCH v3 13/17] mvpp2: no need to check return value of debugfs_create functions
From: Greg Kroah-Hartman @ 2019-08-12 19:01 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: netdev, David S. Miller, Maxime Chevallier, Nathan Huckleberry
In-Reply-To: <CAKwvOdnP4OU9g_ebjnT=r1WcGRvsFsgv3NbguhFKOtt8RWNHwA@mail.gmail.com>

On Mon, Aug 12, 2019 at 10:55:51AM -0700, Nick Desaulniers wrote:
> On Sat, Aug 10, 2019 at 3:17 AM Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> >
> > When calling debugfs functions, there is no need to ever check the
> > return value.  The function can work or not, but the code logic should
> > never do something different based on this.
> 
> Maybe adding this recommendation to the comment block above the
> definition of debugfs_create_dir() in fs/debugfs/inode.c would help
> prevent this issue in the future?  What failure means, and how to
> proceed can be tricky; more documentation can only help in this
> regard.

If it was there, would you have read it?  :)

I'll add it to the list for when I revamp the debugfs documentation that
is already in the kernel, that very few people actually read...

thanks,

greg k-h

^ permalink raw reply

* [PATCH net-next] r8169: fix sporadic transmit timeout issue
From: Heiner Kallweit @ 2019-08-12 18:47 UTC (permalink / raw)
  To: Realtek linux nic maintainers, David Miller
  Cc: netdev@vger.kernel.org, Eric Dumazet, Holger Hoffstätte

Holger reported sporadic transmit timeouts and it turned out that one
path misses ringing the doorbell. Fix was suggested by Eric.

Fixes: ef14358546b1 ("r8169: make use of xmit_more")
Suggested-by: Eric Dumazet <edumazet@google.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
---
 drivers/net/ethernet/realtek/r8169_main.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index 641a34942..448047a32 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -5681,6 +5681,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
 		 */
 		smp_wmb();
 		netif_stop_queue(dev);
+		door_bell = true;
 	}
 
 	if (door_bell)
-- 
2.22.0


^ permalink raw reply related

* [PATCH net] ibmveth: Convert multicast list size for little-endian systems
From: Thomas Falcon @ 2019-08-12 17:43 UTC (permalink / raw)
  To: netdev; +Cc: liuhangbin, davem, joe, Thomas Falcon

The ibm,mac-address-filters property defines the maximum number of
addresses the hypervisor's multicast filter list can support. It is
encoded as a big-endian integer in the OF device tree, but the virtual
ethernet driver does not convert it for use by little-endian systems.
As a result, the driver is not behaving as it should on affected systems
when a large number of multicast addresses are assigned to the device.

Reported-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com>
---
 drivers/net/ethernet/ibm/ibmveth.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ibm/ibmveth.c b/drivers/net/ethernet/ibm/ibmveth.c
index d654c23..b50a6cf 100644
--- a/drivers/net/ethernet/ibm/ibmveth.c
+++ b/drivers/net/ethernet/ibm/ibmveth.c
@@ -1645,7 +1645,7 @@ static int ibmveth_probe(struct vio_dev *dev, const struct vio_device_id *id)
 
 	adapter->vdev = dev;
 	adapter->netdev = netdev;
-	adapter->mcastFilterSize = *mcastFilterSize_p;
+	adapter->mcastFilterSize = be32_to_cpu(*mcastFilterSize_p);
 	adapter->pool_config = 0;
 
 	netif_napi_add(netdev, &adapter->napi, ibmveth_poll, 16);
-- 
1.8.3.1


^ permalink raw reply related

* [RESEND][PATCH v3 bpf-next] btf: expose BTF info through sysfs
From: Andrii Nakryiko @ 2019-08-12 18:39 UTC (permalink / raw)
  To: bpf, netdev, ast, daniel; +Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko

Make .BTF section allocated and expose its contents through sysfs.

/sys/kernel/btf directory is created to contain all the BTFs present
inside kernel. Currently there is only kernel's main BTF, represented as
/sys/kernel/btf/kernel file. Once kernel modules' BTFs are supported,
each module will expose its BTF as /sys/kernel/btf/<module-name> file.

Current approach relies on a few pieces coming together:
1. pahole is used to take almost final vmlinux image (modulo .BTF and
   kallsyms) and generate .BTF section by converting DWARF info into
   BTF. This section is not allocated and not mapped to any segment,
   though, so is not yet accessible from inside kernel at runtime.
2. objcopy dumps .BTF contents into binary file and subsequently
   convert binary file into linkable object file with automatically
   generated symbols _binary__btf_kernel_bin_start and
   _binary__btf_kernel_bin_end, pointing to start and end, respectively,
   of BTF raw data.
3. final vmlinux image is generated by linking this object file (and
   kallsyms, if necessary). sysfs_btf.c then creates
   /sys/kernel/btf/kernel file and exposes embedded BTF contents through
   it. This allows, e.g., libbpf and bpftool access BTF info at
   well-known location, without resorting to searching for vmlinux image
   on disk (location of which is not standardized and vmlinux image
   might not be even available in some scenarios, e.g., inside qemu
   during testing).

Alternative approach using .incbin assembler directive to embed BTF
contents directly was attempted but didn't work, because sysfs_proc.o is
not re-compiled during link-vmlinux.sh stage. This is required, though,
to update embedded BTF data (initially empty data is embedded, then
pahole generates BTF info and we need to regenerate sysfs_btf.o with
updated contents, but it's too late at that point).

If BTF couldn't be generated due to missing or too old pahole,
sysfs_btf.c handles that gracefully by detecting that
_binary__btf_kernel_bin_start (weak symbol) is 0 and not creating
/sys/kernel/btf at all.

v2->v3:
- added Documentation/ABI/testing/sysfs-kernel-btf (Greg K-H);
- created proper kobject (btf_kobj) for btf directory (Greg K-H);
- undo v2 change of reusing vmlinux, as it causes extra kallsyms pass
  due to initially missing  __binary__btf_kernel_bin_{start/end} symbols;

v1->v2:
- allow kallsyms stage to re-use vmlinux generated by gen_btf();

Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
---

Resending with shorter CC list as it seems vger blocked my patch.
Added Greg's Reviewd-by, though.

 Documentation/ABI/testing/sysfs-kernel-btf | 17 +++++++
 kernel/bpf/Makefile                        |  3 ++
 kernel/bpf/sysfs_btf.c                     | 51 +++++++++++++++++++++
 scripts/link-vmlinux.sh                    | 52 ++++++++++++++--------
 4 files changed, 104 insertions(+), 19 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-btf
 create mode 100644 kernel/bpf/sysfs_btf.c

diff --git a/Documentation/ABI/testing/sysfs-kernel-btf b/Documentation/ABI/testing/sysfs-kernel-btf
new file mode 100644
index 000000000000..5390f8001f96
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-btf
@@ -0,0 +1,17 @@
+What:		/sys/kernel/btf
+Date:		Aug 2019
+KernelVersion:	5.5
+Contact:	bpf@vger.kernel.org
+Description:
+		Contains BTF type information and related data for kernel and
+		kernel modules.
+
+What:		/sys/kernel/btf/kernel
+Date:		Aug 2019
+KernelVersion:	5.5
+Contact:	bpf@vger.kernel.org
+Description:
+		Read-only binary attribute exposing kernel's own BTF type
+		information with description of all internal kernel types. See
+		Documentation/bpf/btf.rst for detailed description of format
+		itself.
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 29d781061cd5..e1d9adb212f9 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -22,3 +22,6 @@ obj-$(CONFIG_CGROUP_BPF) += cgroup.o
 ifeq ($(CONFIG_INET),y)
 obj-$(CONFIG_BPF_SYSCALL) += reuseport_array.o
 endif
+ifeq ($(CONFIG_SYSFS),y)
+obj-$(CONFIG_DEBUG_INFO_BTF) += sysfs_btf.o
+endif
diff --git a/kernel/bpf/sysfs_btf.c b/kernel/bpf/sysfs_btf.c
new file mode 100644
index 000000000000..092e63b9758b
--- /dev/null
+++ b/kernel/bpf/sysfs_btf.c
@@ -0,0 +1,51 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Provide kernel BTF information for introspection and use by eBPF tools.
+ */
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/kobject.h>
+#include <linux/init.h>
+#include <linux/sysfs.h>
+
+/* See scripts/link-vmlinux.sh, gen_btf() func for details */
+extern char __weak _binary__btf_kernel_bin_start[];
+extern char __weak _binary__btf_kernel_bin_end[];
+
+static ssize_t
+btf_kernel_read(struct file *file, struct kobject *kobj,
+		struct bin_attribute *bin_attr,
+		char *buf, loff_t off, size_t len)
+{
+	memcpy(buf, _binary__btf_kernel_bin_start + off, len);
+	return len;
+}
+
+static struct bin_attribute bin_attr_btf_kernel __ro_after_init = {
+	.attr = { .name = "kernel", .mode = 0444, },
+	.read = btf_kernel_read,
+};
+
+static struct kobject *btf_kobj;
+
+static int __init btf_kernel_init(void)
+{
+	int err;
+
+	if (!_binary__btf_kernel_bin_start)
+		return 0;
+
+	btf_kobj = kobject_create_and_add("btf", kernel_kobj);
+	if (IS_ERR(btf_kobj)) {
+		err = PTR_ERR(btf_kobj);
+		btf_kobj = NULL;
+		return err;
+	}
+
+	bin_attr_btf_kernel.size = _binary__btf_kernel_bin_end -
+				   _binary__btf_kernel_bin_start;
+
+	return sysfs_create_bin_file(btf_kobj, &bin_attr_btf_kernel);
+}
+
+subsys_initcall(btf_kernel_init);
diff --git a/scripts/link-vmlinux.sh b/scripts/link-vmlinux.sh
index a7124f895b24..cb93832c6ad7 100755
--- a/scripts/link-vmlinux.sh
+++ b/scripts/link-vmlinux.sh
@@ -56,8 +56,8 @@ modpost_link()
 }
 
 # Link of vmlinux
-# ${1} - optional extra .o files
-# ${2} - output file
+# ${1} - output file
+# ${@:2} - optional extra .o files
 vmlinux_link()
 {
 	local lds="${objtree}/${KBUILD_LDS}"
@@ -70,9 +70,9 @@ vmlinux_link()
 			--start-group				\
 			${KBUILD_VMLINUX_LIBS}			\
 			--end-group				\
-			${1}"
+			${@:2}"
 
-		${LD} ${KBUILD_LDFLAGS} ${LDFLAGS_vmlinux} -o ${2}	\
+		${LD} ${KBUILD_LDFLAGS} ${LDFLAGS_vmlinux} -o ${1}	\
 			-T ${lds} ${objects}
 	else
 		objects="-Wl,--whole-archive			\
@@ -81,9 +81,9 @@ vmlinux_link()
 			-Wl,--start-group			\
 			${KBUILD_VMLINUX_LIBS}			\
 			-Wl,--end-group				\
-			${1}"
+			${@:2}"
 
-		${CC} ${CFLAGS_vmlinux} -o ${2}			\
+		${CC} ${CFLAGS_vmlinux} -o ${1}			\
 			-Wl,-T,${lds}				\
 			${objects}				\
 			-lutil -lrt -lpthread
@@ -92,23 +92,34 @@ vmlinux_link()
 }
 
 # generate .BTF typeinfo from DWARF debuginfo
+# ${1} - vmlinux image
+# ${2} - file to dump raw BTF data into
 gen_btf()
 {
-	local pahole_ver;
+	local pahole_ver
+	local bin_arch
 
 	if ! [ -x "$(command -v ${PAHOLE})" ]; then
 		info "BTF" "${1}: pahole (${PAHOLE}) is not available"
-		return 0
+		return 1
 	fi
 
 	pahole_ver=$(${PAHOLE} --version | sed -E 's/v([0-9]+)\.([0-9]+)/\1\2/')
 	if [ "${pahole_ver}" -lt "113" ]; then
 		info "BTF" "${1}: pahole version $(${PAHOLE} --version) is too old, need at least v1.13"
-		return 0
+		return 1
 	fi
 
-	info "BTF" ${1}
+	info "BTF" ${2}
+	vmlinux_link ${1}
 	LLVM_OBJCOPY=${OBJCOPY} ${PAHOLE} -J ${1}
+
+	# dump .BTF section into raw binary file to link with final vmlinux
+	bin_arch=$(${OBJDUMP} -f ${1} | grep architecture | \
+		cut -d, -f1 | cut -d' ' -f2)
+	${OBJCOPY} --dump-section .BTF=.btf.kernel.bin ${1} 2>/dev/null
+	${OBJCOPY} -I binary -O ${CONFIG_OUTPUT_FORMAT} -B ${bin_arch} \
+		--rename-section .data=.BTF .btf.kernel.bin ${2}
 }
 
 # Create ${2} .o file with all symbols from the ${1} object file
@@ -153,6 +164,7 @@ sortextable()
 # Delete output files in case of error
 cleanup()
 {
+	rm -f .btf.*
 	rm -f .tmp_System.map
 	rm -f .tmp_kallsyms*
 	rm -f .tmp_vmlinux*
@@ -215,6 +227,13 @@ ${MAKE} -f "${srctree}/scripts/Makefile.modpost" vmlinux.o
 info MODINFO modules.builtin.modinfo
 ${OBJCOPY} -j .modinfo -O binary vmlinux.o modules.builtin.modinfo
 
+btf_kernel_bin_o=""
+if [ -n "${CONFIG_DEBUG_INFO_BTF}" ]; then
+	if gen_btf .tmp_vmlinux.btf .btf.kernel.bin.o ; then
+		btf_kernel_bin_o=.btf.kernel.bin.o
+	fi
+fi
+
 kallsymso=""
 kallsyms_vmlinux=""
 if [ -n "${CONFIG_KALLSYMS}" ]; then
@@ -246,11 +265,11 @@ if [ -n "${CONFIG_KALLSYMS}" ]; then
 	kallsyms_vmlinux=.tmp_vmlinux2
 
 	# step 1
-	vmlinux_link "" .tmp_vmlinux1
+	vmlinux_link .tmp_vmlinux1 ${btf_kernel_bin_o}
 	kallsyms .tmp_vmlinux1 .tmp_kallsyms1.o
 
 	# step 2
-	vmlinux_link .tmp_kallsyms1.o .tmp_vmlinux2
+	vmlinux_link .tmp_vmlinux2 .tmp_kallsyms1.o ${btf_kernel_bin_o}
 	kallsyms .tmp_vmlinux2 .tmp_kallsyms2.o
 
 	# step 3
@@ -261,18 +280,13 @@ if [ -n "${CONFIG_KALLSYMS}" ]; then
 		kallsymso=.tmp_kallsyms3.o
 		kallsyms_vmlinux=.tmp_vmlinux3
 
-		vmlinux_link .tmp_kallsyms2.o .tmp_vmlinux3
-
+		vmlinux_link .tmp_vmlinux3 .tmp_kallsyms2.o ${btf_kernel_bin_o}
 		kallsyms .tmp_vmlinux3 .tmp_kallsyms3.o
 	fi
 fi
 
 info LD vmlinux
-vmlinux_link "${kallsymso}" vmlinux
-
-if [ -n "${CONFIG_DEBUG_INFO_BTF}" ]; then
-	gen_btf vmlinux
-fi
+vmlinux_link vmlinux "${kallsymso}" "${btf_kernel_bin_o}"
 
 if [ -n "${CONFIG_BUILDTIME_EXTABLE_SORT}" ]; then
 	info SORTEX vmlinux
-- 
2.17.1


^ permalink raw reply related

* Re: [PATCH net-next] r8169: make use of xmit_more
From: Heiner Kallweit @ 2019-08-12 18:38 UTC (permalink / raw)
  To: Holger Hoffstätte, Eric Dumazet
  Cc: Realtek linux nic maintainers, David Miller,
	netdev@vger.kernel.org, Sander Eikelenboom
In-Reply-To: <06438520-1902-bc7c-7bb2-015dfcdf5457@applied-asynchrony.com>

On 12.08.2019 11:59, Holger Hoffstätte wrote:
> On 8/9/19 10:52 AM, Holger Hoffstätte wrote:
>> On 8/9/19 10:25 AM, Eric Dumazet wrote:
>> (snip)
>>>>
>>>> So that didn't take long - got another timeout this morning during some
>>>> random light usage, despite sg/tso being disabled this time.
>>>> Again the only common element is the xmit_more patch. :(
>>>> Not sure whether you want to revert this right away or wait for 5.4-rc1
>>>> feedback. Maybe this too is chipset-specific?
>>>>
>>>>> Thanks a lot for the analysis and testing. Then I'll submit the disabling
>>>>> of SG on RTL8168evl (on your behalf), independent of whether it fixes
>>>>> the timeout issue.
>>>>
>>>> Got it, thanks!
>>>>
>>>> Holger
>>>
>>> I would try this fix maybe ?
>>>
>>> diff --git a/drivers/net/ethernet/realtek/r8169_main.c
>>> b/drivers/net/ethernet/realtek/r8169_main.c
>>> index b2a275d8504cf099cff738f2f7554efa9658fe32..e77628813daba493ad50dab9ac1e3703e38b560c
>>> 100644
>>> --- a/drivers/net/ethernet/realtek/r8169_main.c
>>> +++ b/drivers/net/ethernet/realtek/r8169_main.c
>>> @@ -5691,6 +5691,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
>>>                   */
>>>                  smp_wmb();
>>>                  netif_stop_queue(dev);
>>> +               door_bell = true;
>>>          }
>>>
>>>          if (door_bell)
>>>
>>
>> Thanks Eric, I'll give that a try and see how it fares over the next few days.
>> It suspiciously looks like it could help..
> 
> Good news everyone!
> 
> After three days non-stop action between two machines and hundreds of GBs
> pushed back and forth: not a single timeout or hiccup. Nice! \o/
> Eric, please send this as a proper patch for -next. Feel free to add my
> Tested-by.
> 
Thanks for the feedback! I can submit the fix with Eric's "Suggested-by".

> cheers
> Holger
> 
Heiner

^ permalink raw reply

* Re: Error when loading BPF_CGROUP_INET_EGRESS program with bpftool
From: Andrii Nakryiko @ 2019-08-12 18:27 UTC (permalink / raw)
  To: Fejes Ferenc; +Cc: netdev@vger.kernel.org
In-Reply-To: <CAAej5NbkQDpDXEtsROmLmNidSP8qN3VRE56s3z91zHw9XjtNZA@mail.gmail.com>

On Mon, Aug 12, 2019 at 1:59 AM Fejes Ferenc <fejes@inf.elte.hu> wrote:
>
> Greetings!
>
> I found a strange error when I tried to load a BPF_CGROUP_INET_EGRESS
> prog with bpftool. Loading the same program from C code with
> bpf_prog_load_xattr works without problem.
>
> The error message I got:
> bpftool prog loadall hbm_kern.o /sys/fs/bpf/hbm type cgroup/skb

You need "cgroup_skb/egress" instead of "cgroup/skb" (or try just
dropping it, bpftool will try to guess the type from program's section
name, which would be correct in this case).

> libbpf: load bpf program failed: Invalid argument
> libbpf: -- BEGIN DUMP LOG ---
> libbpf:
> ; return ALLOW_PKT | REDUCE_CW;
> 0: (b7) r0 = 3
> 1: (95) exit
> At program exit the register R0 has value (0x3; 0x0) should have been
> in (0x0; 0x1)
> processed 2 insns (limit 1000000) max_states_per_insn 0 total_states 0
> peak_states 0 mark_read 0
>
> libbpf: -- END LOG --
> libbpf: failed to load program 'cgroup_skb/egress'
> libbpf: failed to load object 'hbm_kern.o'
> Error: failed to load object file
>
>
> My environment: 5.3-rc3 / net-next master (both producing the error).
> Libbpf and bpftool installed from the kernel source (cleaned and
> reinstalled when I tried a new kernel). I compiled the program with
> Clang 8, on Ubuntu 19.10 server image, the source:
>
> #include <linux/bpf.h>
> #include "bpf_helpers.h"
>
> #define DROP_PKT        0
> #define ALLOW_PKT       1
> #define REDUCE_CW       2
>
> SEC("cgroup_skb/egress")
> int hbm(struct __sk_buff *skb)
> {
>         return ALLOW_PKT | REDUCE_CW;
> }
> char _license[] SEC("license") = "GPL";
>
>
> I also tried to trace down the bug with gdb. It seems like the
> section_names array in libbpf.c filled with garbage, especially the

I did the same, section_names appears to be correct, not sure what was
going on in your case. The problem is that "cgroup/skb", which you
provided on command line, overrides this section name and forces
bpftool to guess program type as BPF_CGROUP_INET_INGRESS, which
restricts return codes to just 0 or 1, while for
BPF_CGROUP_INET_EGRESS i is [0, 3].

> expected_attach_type fields (in my case, this contains
> BPF_CGROUP_INET_INGRESS instead of BPF_CGROUP_INET_EGRESS).
>
> Thanks!

^ permalink raw reply

* Re: [PATCHv2 net 0/2] Add netdev_level_ratelimited to avoid netdev msg flush
From: David Miller @ 2019-08-12 18:27 UTC (permalink / raw)
  To: tlfalcon; +Cc: liuhangbin, netdev, joe
In-Reply-To: <9bb8e9af-4d9b-7c16-f58d-e299b1f30007@linux.ibm.com>

From: Thomas Falcon <tlfalcon@linux.ibm.com>
Date: Mon, 12 Aug 2019 10:56:39 -0500

> Hi, thanks for reporting this. I was able to recreate this on my own
> system. The virtual ethernet's multicast filter list size is limited,
> and the driver will check that there is available space before adding
> entries.  The problem is that the size is encoded as big endian, but
> the driver does not convert it for little endian systems after
> retrieving it from the device tree.  As a result the driver is
> requesting more than the hypervisor can allow and getting this error
> in reply. I will submit a patch to correct this soon.

This is 1,000 times better than just trying to make the warning message
go away, thanks Thomas!

^ permalink raw reply

* Re: [PATCH v3 net-next 0/3] net: batched receive in GRO path
From: Eric Dumazet @ 2019-08-12 18:20 UTC (permalink / raw)
  To: Ioana Ciocoi Radulescu, Edward Cree
  Cc: David Miller, netdev, Eric Dumazet,
	linux-net-drivers@solarflare.com
In-Reply-To: <AM0PR04MB4994A035C6121DC13C0EFBB194D30@AM0PR04MB4994.eurprd04.prod.outlook.com>



On 8/12/19 7:51 PM, Ioana Ciocoi Radulescu wrote:
>> -----Original Message-----
>> From: Edward Cree <ecree@solarflare.com>
>> Sent: Friday, August 9, 2019 8:32 PM
>> To: Ioana Ciocoi Radulescu <ruxandra.radulescu@nxp.com>
>> Cc: David Miller <davem@davemloft.net>; netdev <netdev@vger.kernel.org>;
>> Eric Dumazet <eric.dumazet@gmail.com>; linux-net-drivers@solarflare.com
>> Subject: Re: [PATCH v3 net-next 0/3] net: batched receive in GRO path
>>
>> On 09/08/2019 18:14, Ioana Ciocoi Radulescu wrote:
>>> Hi Edward,
>>>
>>> I'm probably missing a lot of context here, but is there a reason
>>> this change targets only the napi_gro_frags() path and not the
>>> napi_gro_receive() one?
>>> I'm trying to understand what drivers that don't call napi_gro_frags()
>>> should do in order to benefit from this batching feature.
>> The sfc driver (which is what I have lots of hardware for, so I can
>>  test it) uses napi_gro_frags().
>> It should be possible to do a similar patch to napi_gro_receive(),
>>  if someone wants to put in the effort of writing and testing it.
> 
> Rather tricky, since I'm not really familiar with GRO internals and
> probably don't understand all the implications of such a change :-/
> Any pointers to what I should pay attention to/sensitive areas that
> need extra care?
> 
>> However, there are many more callers, so more effort required to
>>  make sure none of them care whether the return value is GRO_DROP
>>  or GRO_NORMAL (since the listified version cannot give that
>>  indication).
> 
> At a quick glance, there's only one driver that looks at the return
> value of napi_gro_receive (drivers/net/ethernet/socionext/netsec.c),
> and it only updates interface stats based on it.
> 
>> Also, the guidance from Eric is that drivers seeking high performance
>>  should use napi_gro_frags(), as this allows GRO to recycle the SKB.
> 
> But this guidance is for GRO-able frames only, right? If I try to use
> napi_gro_frags() indiscriminately on the Rx path, I get a big
> performance penalty in some cases - e.g. forwarding of non-TCP
> single buffer frames.

How big is big ?

You can not win all the time.

Some design (or optimizations) are for the most common case,
they might hurt some other use cases.

> 
> On the other hand, Eric shot down my attempt to differentiate between
> TCP and non-TCP frames inside the driver (see 
> https://patchwork.ozlabs.org/patch/1135817/#2222236), so I'm not
> really sure what's the recommended approach here?

If GRO is not good enough for non-TCP buffer frames, please make the change in GRO,
or document that disabling GRO might help some setups.

We do not want each driver to implement their own logic that are a
maintenance nightmare.

GRO can aggregate non-TCP frames (say if you add any encapsulation over TCP),
with a very significant gain, so detecting if an incoming frame is a 'TCP packet'
in the driver would be a serious problem if the traffic is 100% SIT for example.


^ permalink raw reply

* Re: [PATCH 1/2] ip nexthop: Add space to display properly when showing a group
From: David Ahern @ 2019-08-12 18:14 UTC (permalink / raw)
  To: Donald Sharp, netdev
In-Reply-To: <20190810001843.32068-2-sharpd@cumulusnetworks.com>

On 8/9/19 6:18 PM, Donald Sharp wrote:
> When displaying a nexthop group made up of other nexthops, the display
> line shows this when you have additional data at the end:
> 
> id 42 group 43/44/45/46/47/48/49/50/51/52/53/54/55/56/57/58/59/60/61/62/63/64/65/66/67/68/69/70/71/72/73/74proto zebra
> 
> Modify code so that it shows:
> 
> id 42 group 43/44/45/46/47/48/49/50/51/52/53/54/55/56/57/58/59/60/61/62/63/64/65/66/67/68/69/70/71/72/73/74 proto zebra
> 
> Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com>
> ---
>  ip/ipnexthop.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/ip/ipnexthop.c b/ip/ipnexthop.c
> index 97f09e74..f35aab52 100644
> --- a/ip/ipnexthop.c
> +++ b/ip/ipnexthop.c
> @@ -186,6 +186,7 @@ static void print_nh_group(FILE *fp, const struct rtattr *grps_attr)
>  
>  		close_json_object();
>  	}
> +	print_string(PRINT_FP, NULL, "%s", " ");
>  	close_json_array(PRINT_JSON, NULL);
>  }
>  
> 

Looks right to me:
Reviewed-by: David Ahern <dsahern@gmail.com>

Stephen: this should go through your tree.

^ permalink raw reply

* Re: KASAN: use-after-free Read in rxrpc_queue_local
From: syzbot @ 2019-08-12 18:06 UTC (permalink / raw)
  To: arvid.brodin, davem, dhowells, linux-afs, linux-kernel, netdev,
	syzkaller-bugs, xiyou.wangcong
In-Reply-To: <0000000000007593f4058fea60d8@google.com>

syzbot has bisected this bug to:

commit b9a1e627405d68d475a3c1f35e685ccfb5bbe668
Author: Cong Wang <xiyou.wangcong@gmail.com>
Date:   Thu Jul 4 00:21:13 2019 +0000

     hsr: implement dellink to clean up resources

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=10b4ebce600000
start commit:   125b7e09 net: tc35815: Explicitly check NET_IP_ALIGN is no..
git tree:       net
final crash:    https://syzkaller.appspot.com/x/report.txt?x=12b4ebce600000
console output: https://syzkaller.appspot.com/x/log.txt?x=14b4ebce600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=a4c9e9f08e9e8960
dashboard link: https://syzkaller.appspot.com/bug?extid=78e71c5bab4f76a6a719
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=165ec172600000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=119d4eba600000

Reported-by: syzbot+78e71c5bab4f76a6a719@syzkaller.appspotmail.com
Fixes: b9a1e627405d ("hsr: implement dellink to clean up resources")

For information about bisection process see: https://goo.gl/tpsmEJ#bisection

^ permalink raw reply

* Re: [PATCH net] ibmveth: Convert multicast list size for little-endian systems
From: Thomas Falcon @ 2019-08-12 18:02 UTC (permalink / raw)
  To: Joe Perches, netdev; +Cc: liuhangbin, davem
In-Reply-To: <93581c35c0a8c06434221e628153028105849064.camel@perches.com>


On 8/12/19 12:49 PM, Joe Perches wrote:
> On Mon, 2019-08-12 at 12:43 -0500, Thomas Falcon wrote:
>> The ibm,mac-address-filters property defines the maximum number of
>> addresses the hypervisor's multicast filter list can support. It is
>> encoded as a big-endian integer in the OF device tree, but the virtual
>> ethernet driver does not convert it for use by little-endian systems.
>> As a result, the driver is not behaving as it should on affected systems
>> when a large number of multicast addresses are assigned to the device.
>>
>> Reported-by: Hangbin Liu <liuhangbin@gmail.com>
>> Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com>
>> ---
>>   drivers/net/ethernet/ibm/ibmveth.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/ethernet/ibm/ibmveth.c b/drivers/net/ethernet/ibm/ibmveth.c
>> index d654c23..b50a6cf 100644
>> --- a/drivers/net/ethernet/ibm/ibmveth.c
>> +++ b/drivers/net/ethernet/ibm/ibmveth.c
>> @@ -1645,7 +1645,7 @@ static int ibmveth_probe(struct vio_dev *dev, const struct vio_device_id *id)
>>   
>>   	adapter->vdev = dev;
>>   	adapter->netdev = netdev;
>> -	adapter->mcastFilterSize = *mcastFilterSize_p;
>> +	adapter->mcastFilterSize = be32_to_cpu(*mcastFilterSize_p);
>>   	adapter->pool_config = 0;
>>   
>>   	netif_napi_add(netdev, &adapter->napi, ibmveth_poll, 16);
> Perhaps to keep sparse happy too: (untested)


Thanks! I will test this and submit a v2 soon.


> ---
>   drivers/net/ethernet/ibm/ibmveth.c | 6 +++---
>   1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/ethernet/ibm/ibmveth.c b/drivers/net/ethernet/ibm/ibmveth.c
> index d654c234aaf7..90539d7ce565 100644
> --- a/drivers/net/ethernet/ibm/ibmveth.c
> +++ b/drivers/net/ethernet/ibm/ibmveth.c
> @@ -1605,7 +1605,7 @@ static int ibmveth_probe(struct vio_dev *dev, const struct vio_device_id *id)
>   	struct net_device *netdev;
>   	struct ibmveth_adapter *adapter;
>   	unsigned char *mac_addr_p;
> -	unsigned int *mcastFilterSize_p;
> +	__be32 *mcastFilterSize_p;
>   	long ret;
>   	unsigned long ret_attr;
>   
> @@ -1627,7 +1627,7 @@ static int ibmveth_probe(struct vio_dev *dev, const struct vio_device_id *id)
>   		return -EINVAL;
>   	}
>   
> -	mcastFilterSize_p = (unsigned int *)vio_get_attribute(dev,
> +	mcastFilterSize_p = (__be32 *)vio_get_attribute(dev,
>   						VETH_MCAST_FILTER_SIZE, NULL);
>   	if (!mcastFilterSize_p) {
>   		dev_err(&dev->dev, "Can't find VETH_MCAST_FILTER_SIZE "
> @@ -1645,7 +1645,7 @@ static int ibmveth_probe(struct vio_dev *dev, const struct vio_device_id *id)
>   
>   	adapter->vdev = dev;
>   	adapter->netdev = netdev;
> -	adapter->mcastFilterSize = *mcastFilterSize_p;
> +	adapter->mcastFilterSize = be32_to_cpu(*mcastFilterSize_p);
>   	adapter->pool_config = 0;
>   
>   	netif_napi_add(netdev, &adapter->napi, ibmveth_poll, 16);
>
>


^ permalink raw reply

* Re: 5.3-rc3-ish VM crash: RIP: 0010:tcp_trim_head+0x20/0xe0
From: Eric Dumazet @ 2019-08-12 17:56 UTC (permalink / raw)
  To: Sander Eikelenboom, netdev, linux-kernel
In-Reply-To: <27aebb57-0ca9-fba3-092f-39131ad2b648@eikelenboom.it>



On 8/12/19 2:50 PM, Sander Eikelenboom wrote:
> L.S.,
> 
> While testing a somewhere-after-5.3-rc3 kernel (which included the latest net merge (33920f1ec5bf47c5c0a1d2113989bdd9dfb3fae9),
> one of my Xen VM's (which gets quite some network load) crashed.
> See below for the stacktrace.
> 
> Unfortunately I haven't got a clear trigger, so bisection doesn't seem to be an option at the moment. 
> I haven't encountered this on 5.2, so it seems to be an regression against 5.2.
> 
> Any ideas ?
> 
> --
> Sander
> 
> 
> [16930.653595] general protection fault: 0000 [#1] SMP NOPTI
> [16930.653624] CPU: 0 PID: 3275 Comm: rsync Not tainted 5.3.0-rc3-20190809-doflr+ #1
> [16930.653657] RIP: 0010:tcp_trim_head+0x20/0xe0
> [16930.653677] Code: 2e 0f 1f 84 00 00 00 00 00 90 41 54 41 89 d4 55 48 89 fd 53 48 89 f3 f6 46 7e 01 74 2f 8b 86 bc 00 00 00 48 03 86 c0 00 00 00 <8b> 40 20 66 83 f8 01 74 19 31 d2 31 f6 b9 20 0a 00 00 48 89 df e8
> [16930.653741] RSP: 0000:ffffc90000003ad8 EFLAGS: 00010286
> [16930.653762] RAX: fffe888005bf62c0 RBX: ffff8880115fb800 RCX: 000000008010000b

crash in " mov    0x20(%rax),%eax"   and RAX=fffe888005bf62c0 (not a valid kernel address)

Look like one bit corruption maybe.

Nothing comes to mind really between 5.2 and 53 that could explain this.

> [16930.653791] RDX: 00000000000005a0 RSI: ffff8880115fb800 RDI: ffff888016b00880
> [16930.653819] RBP: ffff888016b00880 R08: 0000000000000001 R09: 0000000000000000
> [16930.653848] R10: ffff88800ae00800 R11: 00000000bfe632e6 R12: 00000000000005a0
> [16930.653875] R13: 0000000000000001 R14: 00000000bfe62d46 R15: 0000000000000004
> [16930.653913] FS:  00007fe71fe2cb80(0000) GS:ffff88801f200000(0000) knlGS:0000000000000000
> [16930.653943] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [16930.653965] CR2: 000055de0f3e7000 CR3: 0000000011f32000 CR4: 00000000000006f0
> [16930.653993] Call Trace:
> [16930.654005]  <IRQ>
> [16930.654018]  tcp_ack+0xbb0/0x1230
> [16930.654033]  tcp_rcv_established+0x2e8/0x630
> [16930.654053]  tcp_v4_do_rcv+0x129/0x1d0
> [16930.654070]  tcp_v4_rcv+0xac9/0xcb0
> [16930.654088]  ip_protocol_deliver_rcu+0x27/0x1b0
> [16930.654109]  ip_local_deliver_finish+0x3f/0x50
> [16930.654128]  ip_local_deliver+0x4d/0xe0
> [16930.654145]  ? ip_protocol_deliver_rcu+0x1b0/0x1b0
> [16930.654163]  ip_rcv+0x4c/0xd0
> [16930.654179]  __netif_receive_skb_one_core+0x79/0x90
> [16930.654200]  netif_receive_skb_internal+0x2a/0xa0
> [16930.654219]  napi_gro_receive+0xe7/0x140
> [16930.654237]  xennet_poll+0x9be/0xae0
> [16930.654254]  net_rx_action+0x136/0x340
> [16930.654271]  __do_softirq+0xdd/0x2cf
> [16930.654287]  irq_exit+0x7a/0xa0
> [16930.654304]  xen_evtchn_do_upcall+0x27/0x40
> [16930.654320]  xen_hvm_callback_vector+0xf/0x20
> [16930.654339]  </IRQ>
> [16930.654349] RIP: 0033:0x55de0d87db99
> [16930.654364] Code: 00 00 48 89 7c 24 f8 45 39 fe 45 0f 42 fe 44 89 7c 24 f4 eb 09 0f 1f 40 00 83 e9 01 74 3e 89 f2 48 63 f8 4c 01 d2 44 38 1c 3a <75> 25 44 38 6c 3a ff 75 1e 41 0f b6 3c 24 40 38 3a 75 14 41 0f b6
> [16930.654432] RSP: 002b:00007ffd5531eec8 EFLAGS: 00000a87 ORIG_RAX: ffffffffffffff0c
> [16930.655004] RAX: 0000000000000002 RBX: 000055de0f3e8e50 RCX: 000000000000007f
> [16930.655034] RDX: 000055de0f3dc2d2 RSI: 0000000000003492 RDI: 0000000000000002
> [16930.655062] RBP: 0000000000007fff R08: 00000000000080ea R09: 00000000000001f0
> [16930.655089] R10: 000055de0f3d8e40 R11: 0000000000000094 R12: 000055de0f3e0f2a
> [16930.655116] R13: 0000000000000010 R14: 0000000000007f16 R15: 0000000000000080
> [16930.655144] Modules linked in:
> [16930.655200] ---[ end trace 533367c95501b645 ]---
> [16930.655223] RIP: 0010:tcp_trim_head+0x20/0xe0
> [16930.655243] Code: 2e 0f 1f 84 00 00 00 00 00 90 41 54 41 89 d4 55 48 89 fd 53 48 89 f3 f6 46 7e 01 74 2f 8b 86 bc 00 00 00 48 03 86 c0 00 00 00 <8b> 40 20 66 83 f8 01 74 19 31 d2 31 f6 b9 20 0a 00 00 48 89 df e8
> [16930.655312] RSP: 0000:ffffc90000003ad8 EFLAGS: 00010286
> [16930.655331] RAX: fffe888005bf62c0 RBX: ffff8880115fb800 RCX: 000000008010000b
> [16930.655360] RDX: 00000000000005a0 RSI: ffff8880115fb800 RDI: ffff888016b00880
> [16930.655387] RBP: ffff888016b00880 R08: 0000000000000001 R09: 0000000000000000
> [16930.655414] R10: ffff88800ae00800 R11: 00000000bfe632e6 R12: 00000000000005a0
> [16930.655441] R13: 0000000000000001 R14: 00000000bfe62d46 R15: 0000000000000004
> [16930.655475] FS:  00007fe71fe2cb80(0000) GS:ffff88801f200000(0000) knlGS:0000000000000000
> [16930.655502] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [16930.655525] CR2: 000055de0f3e7000 CR3: 0000000011f32000 CR4: 00000000000006f0
> [16930.655553] Kernel panic - not syncing: Fatal exception in interrupt
> [16930.655789] Kernel Offset: disabled
> 

^ permalink raw reply

* Re: [PATCH v3 13/17] mvpp2: no need to check return value of debugfs_create functions
From: Nick Desaulniers @ 2019-08-12 17:55 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: netdev, David S. Miller, Maxime Chevallier, Nathan Huckleberry
In-Reply-To: <20190810101732.26612-14-gregkh@linuxfoundation.org>

On Sat, Aug 10, 2019 at 3:17 AM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> When calling debugfs functions, there is no need to ever check the
> return value.  The function can work or not, but the code logic should
> never do something different based on this.

Maybe adding this recommendation to the comment block above the
definition of debugfs_create_dir() in fs/debugfs/inode.c would help
prevent this issue in the future?  What failure means, and how to
proceed can be tricky; more documentation can only help in this
regard.

>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Maxime Chevallier <maxime.chevallier@bootlin.com>
> Cc: Nick Desaulniers <ndesaulniers@google.com>
> Cc: Nathan Huckleberry <nhuck@google.com>
> Cc: netdev@vger.kernel.org
> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> ---
>  .../ethernet/marvell/mvpp2/mvpp2_debugfs.c    | 19 +------------------
>  1 file changed, 1 insertion(+), 18 deletions(-)
>
> diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_debugfs.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_debugfs.c
> index 274fb07362cb..4a3baa7e0142 100644
> --- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_debugfs.c
> +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_debugfs.c
> @@ -452,8 +452,6 @@ static int mvpp2_dbgfs_flow_port_init(struct dentry *parent,
>         struct dentry *port_dir;
>
>         port_dir = debugfs_create_dir(port->dev->name, parent);
> -       if (IS_ERR(port_dir))
> -               return PTR_ERR(port_dir);
>
>         port_entry = &port->priv->dbgfs_entries->port_flow_entries[port->id];
>
> @@ -480,8 +478,6 @@ static int mvpp2_dbgfs_flow_entry_init(struct dentry *parent,
>         sprintf(flow_entry_name, "%02d", flow);
>
>         flow_entry_dir = debugfs_create_dir(flow_entry_name, parent);
> -       if (!flow_entry_dir)
> -               return -ENOMEM;
>
>         entry = &priv->dbgfs_entries->flow_entries[flow];
>
> @@ -514,8 +510,6 @@ static int mvpp2_dbgfs_flow_init(struct dentry *parent, struct mvpp2 *priv)
>         int i, ret;
>
>         flow_dir = debugfs_create_dir("flows", parent);
> -       if (!flow_dir)
> -               return -ENOMEM;
>
>         for (i = 0; i < MVPP2_N_PRS_FLOWS; i++) {
>                 ret = mvpp2_dbgfs_flow_entry_init(flow_dir, priv, i);
> @@ -539,8 +533,6 @@ static int mvpp2_dbgfs_prs_entry_init(struct dentry *parent,
>         sprintf(prs_entry_name, "%03d", tid);
>
>         prs_entry_dir = debugfs_create_dir(prs_entry_name, parent);
> -       if (!prs_entry_dir)
> -               return -ENOMEM;
>
>         entry = &priv->dbgfs_entries->prs_entries[tid];
>
> @@ -578,8 +570,6 @@ static int mvpp2_dbgfs_prs_init(struct dentry *parent, struct mvpp2 *priv)
>         int i, ret;
>
>         prs_dir = debugfs_create_dir("parser", parent);
> -       if (!prs_dir)
> -               return -ENOMEM;
>
>         for (i = 0; i < MVPP2_PRS_TCAM_SRAM_SIZE; i++) {
>                 ret = mvpp2_dbgfs_prs_entry_init(prs_dir, priv, i);
> @@ -688,8 +678,6 @@ static int mvpp2_dbgfs_port_init(struct dentry *parent,
>         struct dentry *port_dir;
>
>         port_dir = debugfs_create_dir(port->dev->name, parent);
> -       if (IS_ERR(port_dir))
> -               return PTR_ERR(port_dir);
>
>         debugfs_create_file("parser_entries", 0444, port_dir, port,
>                             &mvpp2_dbgfs_port_parser_fops);
> @@ -716,15 +704,10 @@ void mvpp2_dbgfs_init(struct mvpp2 *priv, const char *name)
>         int ret, i;
>
>         mvpp2_root = debugfs_lookup(MVPP2_DRIVER_NAME, NULL);
> -       if (!mvpp2_root) {
> +       if (!mvpp2_root)
>                 mvpp2_root = debugfs_create_dir(MVPP2_DRIVER_NAME, NULL);
> -               if (IS_ERR(mvpp2_root))
> -                       return;
> -       }
>
>         mvpp2_dir = debugfs_create_dir(name, mvpp2_root);
> -       if (IS_ERR(mvpp2_dir))
> -               return;
>
>         priv->dbgfs_dir = mvpp2_dir;
>         priv->dbgfs_entries = kzalloc(sizeof(*priv->dbgfs_entries), GFP_KERNEL);
> --
> 2.22.0
>


-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply

* Re: [PATCH bpf-next v2 2/4] bpf: support cloning sk storage on accept()
From: Stanislav Fomichev @ 2019-08-12 17:52 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Stanislav Fomichev, netdev, bpf, davem, ast, Martin KaFai Lau,
	Yonghong Song
In-Reply-To: <db5ec323-1126-d461-bc65-27ccc1414589@iogearbox.net>

On 08/12, Daniel Borkmann wrote:
> On 8/9/19 6:10 PM, Stanislav Fomichev wrote:
> > Add new helper bpf_sk_storage_clone which optionally clones sk storage
> > and call it from sk_clone_lock.
> > 
> > Cc: Martin KaFai Lau <kafai@fb.com>
> > Cc: Yonghong Song <yhs@fb.com>
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> [...]
> > +int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk)
> > +{
> > +	struct bpf_sk_storage *new_sk_storage = NULL;
> > +	struct bpf_sk_storage *sk_storage;
> > +	struct bpf_sk_storage_elem *selem;
> > +	int ret;
> > +
> > +	RCU_INIT_POINTER(newsk->sk_bpf_storage, NULL);
> > +
> > +	rcu_read_lock();
> > +	sk_storage = rcu_dereference(sk->sk_bpf_storage);
> > +
> > +	if (!sk_storage || hlist_empty(&sk_storage->list))
> > +		goto out;
> > +
> > +	hlist_for_each_entry_rcu(selem, &sk_storage->list, snode) {
> > +		struct bpf_sk_storage_elem *copy_selem;
> > +		struct bpf_sk_storage_map *smap;
> > +		struct bpf_map *map;
> > +		int refold;
> > +
> > +		smap = rcu_dereference(SDATA(selem)->smap);
> > +		if (!(smap->map.map_flags & BPF_F_CLONE))
> > +			continue;
> > +
> > +		map = bpf_map_inc_not_zero(&smap->map, false);
> > +		if (IS_ERR(map))
> > +			continue;
> > +
> > +		copy_selem = bpf_sk_storage_clone_elem(newsk, smap, selem);
> > +		if (!copy_selem) {
> > +			ret = -ENOMEM;
> > +			bpf_map_put(map);
> > +			goto err;
> > +		}
> > +
> > +		if (new_sk_storage) {
> > +			selem_link_map(smap, copy_selem);
> > +			__selem_link_sk(new_sk_storage, copy_selem);
> > +		} else {
> > +			ret = sk_storage_alloc(newsk, smap, copy_selem);
> > +			if (ret) {
> > +				kfree(copy_selem);
> > +				atomic_sub(smap->elem_size,
> > +					   &newsk->sk_omem_alloc);
> > +				bpf_map_put(map);
> > +				goto err;
> > +			}
> > +
> > +			new_sk_storage = rcu_dereference(copy_selem->sk_storage);
> > +		}
> > +		bpf_map_put(map);
> 
> The map get/put combination /under/ RCU read lock seems a bit odd to me, could
> you exactly describe the race that this would be preventing?
There is a race between sk storage release and sk storage clone.
bpf_sk_storage_map_free uses synchronize_rcu to wait for all existing
users to finish and the new ones are prevented via map's refcnt being
zero; we need to do something like that for the clone.
Martin suggested to use bpf_map_inc_not_zero/bpf_map_put.
If I read everythin correctly, I think without map_inc/map_put we
get the following race:

CPU0                                   CPU1

bpf_map_put
  bpf_sk_storage_map_free(smap)
    synchronize_rcu

    // no more users via bpf or
    // syscall, but clone
    // can still happen

    for each (bucket)
      selem_unlink
        selem_unlink_map(smap)

        // adding anything at
        // this point to the
        // bucket will leak

                                       rcu_read_lock
                                       tcp_v4_rcv
                                         tcp_v4_do_rcv
                                           // sk is lockless TCP_LISTEN
                                           tcp_v4_cookie_check
                                             tcp_v4_syn_recv_sock
                                               bpf_sk_storage_clone
                                                 rcu_dereference(sk->sk_bpf_storage)
                                                 selem_link_map(smap, copy)
                                                 // adding new element to the
                                                 // map -> leak
                                       rcu_read_unlock

      selem_unlink_sk
       sk->sk_bpf_storage = NULL

    synchronize_rcu

^ permalink raw reply

* RE: [PATCH v3 net-next 0/3] net: batched receive in GRO path
From: Ioana Ciocoi Radulescu @ 2019-08-12 17:51 UTC (permalink / raw)
  To: Edward Cree
  Cc: David Miller, netdev, Eric Dumazet,
	linux-net-drivers@solarflare.com
In-Reply-To: <a6faf533-6dd3-d7d7-9464-1fe87d0ac7fc@solarflare.com>

> -----Original Message-----
> From: Edward Cree <ecree@solarflare.com>
> Sent: Friday, August 9, 2019 8:32 PM
> To: Ioana Ciocoi Radulescu <ruxandra.radulescu@nxp.com>
> Cc: David Miller <davem@davemloft.net>; netdev <netdev@vger.kernel.org>;
> Eric Dumazet <eric.dumazet@gmail.com>; linux-net-drivers@solarflare.com
> Subject: Re: [PATCH v3 net-next 0/3] net: batched receive in GRO path
> 
> On 09/08/2019 18:14, Ioana Ciocoi Radulescu wrote:
> > Hi Edward,
> >
> > I'm probably missing a lot of context here, but is there a reason
> > this change targets only the napi_gro_frags() path and not the
> > napi_gro_receive() one?
> > I'm trying to understand what drivers that don't call napi_gro_frags()
> > should do in order to benefit from this batching feature.
> The sfc driver (which is what I have lots of hardware for, so I can
>  test it) uses napi_gro_frags().
> It should be possible to do a similar patch to napi_gro_receive(),
>  if someone wants to put in the effort of writing and testing it.

Rather tricky, since I'm not really familiar with GRO internals and
probably don't understand all the implications of such a change :-/
Any pointers to what I should pay attention to/sensitive areas that
need extra care?

> However, there are many more callers, so more effort required to
>  make sure none of them care whether the return value is GRO_DROP
>  or GRO_NORMAL (since the listified version cannot give that
>  indication).

At a quick glance, there's only one driver that looks at the return
value of napi_gro_receive (drivers/net/ethernet/socionext/netsec.c),
and it only updates interface stats based on it.

> Also, the guidance from Eric is that drivers seeking high performance
>  should use napi_gro_frags(), as this allows GRO to recycle the SKB.

But this guidance is for GRO-able frames only, right? If I try to use
napi_gro_frags() indiscriminately on the Rx path, I get a big
performance penalty in some cases - e.g. forwarding of non-TCP
single buffer frames.

On the other hand, Eric shot down my attempt to differentiate between
TCP and non-TCP frames inside the driver (see 
https://patchwork.ozlabs.org/patch/1135817/#2222236), so I'm not
really sure what's the recommended approach here?

> 
> All of this together means I don't plan to submit such a patch; I
>  would however be happy to review a patch if someone else writes one.

Thanks a lot for the explanations!
Ioana

^ permalink raw reply

* Re: [PATCH net-next,v4 08/12] drivers: net: use flow block API
From: Edward Cree @ 2019-08-12 17:50 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netdev, netfilter-devel
In-Reply-To: <20190709205550.3160-9-pablo@netfilter.org>

On 09/07/2019 21:55, Pablo Neira Ayuso wrote:
> This patch updates flow_block_cb_setup_simple() to use the flow block API.
> Several drivers are also adjusted to use it.
>
> This patch introduces the per-driver list of flow blocks to account for
> blocks that are already in use.
>
> Remove tc_block_offload alias.
>
> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
> ---
> v4: fix typo in list in nfp driver - Jakub Kicinski.
>     Move driver_list handling to the driver code, this list is transitional,
>     until drivers are updated to support multiple subsystems. No more
>     driver_list handling from core.

Pablo, can you explain (because this commit message doesn't) why these per-
 driver lists are needed, and what the information/state is that has module
 (rather than, say, netdevice) scope?

AFAICT none of these drivers otherwise interacts with TC at a module level
 (every block binding, callback etc. is associated with a single netdevice,
 usually through a cb_priv = netdev_priv(net_dev)), so this looks very out
 of place.

(More questions likely to follow as I work my way through trying to re-
 target my in-development driver to this new API.  Also if you have any
 kind of design document explaining how it all fits together, that'd be
 nice, because currently trying to figure it out is giving me a headache.)

-Ed

^ permalink raw reply

* Re: [PATCH net] ibmveth: Convert multicast list size for little-endian systems
From: Joe Perches @ 2019-08-12 17:49 UTC (permalink / raw)
  To: Thomas Falcon, netdev; +Cc: liuhangbin, davem
In-Reply-To: <1565631786-18860-1-git-send-email-tlfalcon@linux.ibm.com>

On Mon, 2019-08-12 at 12:43 -0500, Thomas Falcon wrote:
> The ibm,mac-address-filters property defines the maximum number of
> addresses the hypervisor's multicast filter list can support. It is
> encoded as a big-endian integer in the OF device tree, but the virtual
> ethernet driver does not convert it for use by little-endian systems.
> As a result, the driver is not behaving as it should on affected systems
> when a large number of multicast addresses are assigned to the device.
> 
> Reported-by: Hangbin Liu <liuhangbin@gmail.com>
> Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com>
> ---
>  drivers/net/ethernet/ibm/ibmveth.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/ibm/ibmveth.c b/drivers/net/ethernet/ibm/ibmveth.c
> index d654c23..b50a6cf 100644
> --- a/drivers/net/ethernet/ibm/ibmveth.c
> +++ b/drivers/net/ethernet/ibm/ibmveth.c
> @@ -1645,7 +1645,7 @@ static int ibmveth_probe(struct vio_dev *dev, const struct vio_device_id *id)
>  
>  	adapter->vdev = dev;
>  	adapter->netdev = netdev;
> -	adapter->mcastFilterSize = *mcastFilterSize_p;
> +	adapter->mcastFilterSize = be32_to_cpu(*mcastFilterSize_p);
>  	adapter->pool_config = 0;
>  
>  	netif_napi_add(netdev, &adapter->napi, ibmveth_poll, 16);

Perhaps to keep sparse happy too: (untested)
---
 drivers/net/ethernet/ibm/ibmveth.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmveth.c b/drivers/net/ethernet/ibm/ibmveth.c
index d654c234aaf7..90539d7ce565 100644
--- a/drivers/net/ethernet/ibm/ibmveth.c
+++ b/drivers/net/ethernet/ibm/ibmveth.c
@@ -1605,7 +1605,7 @@ static int ibmveth_probe(struct vio_dev *dev, const struct vio_device_id *id)
 	struct net_device *netdev;
 	struct ibmveth_adapter *adapter;
 	unsigned char *mac_addr_p;
-	unsigned int *mcastFilterSize_p;
+	__be32 *mcastFilterSize_p;
 	long ret;
 	unsigned long ret_attr;
 
@@ -1627,7 +1627,7 @@ static int ibmveth_probe(struct vio_dev *dev, const struct vio_device_id *id)
 		return -EINVAL;
 	}
 
-	mcastFilterSize_p = (unsigned int *)vio_get_attribute(dev,
+	mcastFilterSize_p = (__be32 *)vio_get_attribute(dev,
 						VETH_MCAST_FILTER_SIZE, NULL);
 	if (!mcastFilterSize_p) {
 		dev_err(&dev->dev, "Can't find VETH_MCAST_FILTER_SIZE "
@@ -1645,7 +1645,7 @@ static int ibmveth_probe(struct vio_dev *dev, const struct vio_device_id *id)
 
 	adapter->vdev = dev;
 	adapter->netdev = netdev;
-	adapter->mcastFilterSize = *mcastFilterSize_p;
+	adapter->mcastFilterSize = be32_to_cpu(*mcastFilterSize_p);
 	adapter->pool_config = 0;
 
 	netif_napi_add(netdev, &adapter->napi, ibmveth_poll, 16);



^ permalink raw reply related

* Re: [PATCH bpf-next v4 1/2] xsk: remove AF_XDP socket from map when the socket is released
From: Björn Töpel @ 2019-08-12 17:25 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, Netdev, Björn Töpel,
	Karlsson, Magnus, Bruce Richardson, Song Liu, bpf
In-Reply-To: <5ad56a5e-a189-3f56-c85c-24b6c300efd9@iogearbox.net>

On Mon, 12 Aug 2019 at 14:28, Daniel Borkmann <daniel@iogearbox.net> wrote:
>
[...]
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index 59b57d708697..c3447bad608a 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> > @@ -362,6 +362,50 @@ static void xsk_unbind_dev(struct xdp_sock *xs)
> >       dev_put(dev);
> >   }
> >
> > +static struct xsk_map *xsk_get_map_list_entry(struct xdp_sock *xs,
> > +                                           struct xdp_sock ***map_entry)
> > +{
> > +     struct xsk_map *map = NULL;
> > +     struct xsk_map_node *node;
> > +
> > +     *map_entry = NULL;
> > +
> > +     spin_lock_bh(&xs->map_list_lock);
> > +     node = list_first_entry_or_null(&xs->map_list, struct xsk_map_node,
> > +                                     node);
> > +     if (node) {
> > +             WARN_ON(xsk_map_inc(node->map));
>
> Can you elaborate on the refcount usage here and against what scenario it is protecting?
>

Thanks for having a look!

First we access the map_list (under the lock) and pull out the map
which we intend to clean. In order to clear the map entry, we need to
a reference to the map. However, when the map_list_lock is released,
there's a window where the map entry can be cleared and the map can be
destroyed, and making the "map", which is used in
xsk_delete_from_maps, stale. To guarantee existence the additional
refinc is required. Makes sense?

> Do we pretend it never fails on the bpf_map_inc() wrt the WARN_ON(),
> why that (what makes it different from the xsk_map_node_alloc() inc
> above where we do error out)?

Hmm, given that we're in a cleanup (socket release), we can't really
return any error. What would be a more robust way? Retrying? AFAIK the
release ops return an int, but it's not checked/used.

> > +             map = node->map;
> > +             *map_entry = node->map_entry;
> > +     }
> > +     spin_unlock_bh(&xs->map_list_lock);
> > +     return map;
> > +}
> > +
> > +static void xsk_delete_from_maps(struct xdp_sock *xs)
> > +{
> > +     /* This function removes the current XDP socket from all the
> > +      * maps it resides in. We need to take extra care here, due to
> > +      * the two locks involved. Each map has a lock synchronizing
> > +      * updates to the entries, and each socket has a lock that
> > +      * synchronizes access to the list of maps (map_list). For
> > +      * deadlock avoidance the locks need to be taken in the order
> > +      * "map lock"->"socket map list lock". We start off by
> > +      * accessing the socket map list, and take a reference to the
> > +      * map to guarantee existence. Then we ask the map to remove
> > +      * the socket, which tries to remove the socket from the
> > +      * map. Note that there might be updates to the map between
> > +      * xsk_get_map_list_entry() and xsk_map_try_sock_delete().
> > +      */

I tried to clarify here, but I obviously need to do a better job. :-)


Björn

^ permalink raw reply

* Re: [PATCH net] nexthop: use nlmsg_parse_deprecated()
From: David Ahern @ 2019-08-12 17:23 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller
  Cc: netdev, Eric Dumazet, syzbot, Johannes Berg
In-Reply-To: <20190812113616.51725-1-edumazet@google.com>

On 8/12/19 5:36 AM, Eric Dumazet wrote:
> David missed that commit 8cb081746c03 ("netlink: make validation
> more configurable for future strictness") has renamed nlmsg_parse()

I think the root cause is nlmsg_parse() calling __nla_parse and not
__nlmsg_parse. Users of nlmsg_parse are missing the header validation.

^ permalink raw reply

* Re: [PATCH net-next] net: can: Fix compiling warning
From: Kees Cook @ 2019-08-12 17:19 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Oliver Hartkopp, Patrick Bellasi, linux-sparse, Mao Wenan, davem,
	netdev, linux-kernel, kernel-janitors, Ingo Molnar
In-Reply-To: <20190807105042.GK1974@kadam>

On Wed, Aug 07, 2019 at 01:50:42PM +0300, Dan Carpenter wrote:
> On Tue, Aug 06, 2019 at 06:41:44PM +0200, Oliver Hartkopp wrote:
> > I compiled the code (the original version), but I do not get that "Should it
> > be static?" warning:
> > 
> > user@box:~/net-next$ make C=1
> >   CALL    scripts/checksyscalls.sh
> >   CALL    scripts/atomic/check-atomics.sh
> >   DESCEND  objtool
> >   CHK     include/generated/compile.h
> >   CHECK   net/can/af_can.c
> > ./include/linux/sched.h:609:43: error: bad integer constant expression
> > ./include/linux/sched.h:609:73: error: invalid named zero-width bitfield
> > `value'
> > ./include/linux/sched.h:610:43: error: bad integer constant expression
> > ./include/linux/sched.h:610:67: error: invalid named zero-width bitfield
> > `bucket_id'
> >   CC [M]  net/can/af_can.o
> 
> The sched.h errors suppress Sparse warnings so it's broken/useless now.
> The code looks like this:
> 
> include/linux/sched.h
>    613  struct uclamp_se {
>    614          unsigned int value              : bits_per(SCHED_CAPACITY_SCALE);
>    615          unsigned int bucket_id          : bits_per(UCLAMP_BUCKETS);
>    616          unsigned int active             : 1;
>    617          unsigned int user_defined       : 1;
>    618  };
> 
> bits_per() is zero and Sparse doesn't like zero sized bitfields.

I just noticed these sparse warnings too -- what's happening here? Are
they _supposed_ to be 0-width fields? It doesn't look like it to me:

CONFIG_UCLAMP_BUCKETS_COUNT=5
...
#define UCLAMP_BUCKETS CONFIG_UCLAMP_BUCKETS_COUNT
...
        unsigned int bucket_id          : bits_per(UCLAMP_BUCKETS);

I would expect this to be 3 bits wide. ... Looks like gcc agrees:

struct uclamp_se {
    unsigned int               value:11;             /*     0: 0  4 */
    unsigned int               bucket_id:3;          /*     0:11  4 */
...

So this is a sparse issue?

-- 
Kees Cook

^ permalink raw reply

* [PATCH net-next] net: devlink: remove redundant rtnl lock assert
From: Vlad Buslov @ 2019-08-12 17:02 UTC (permalink / raw)
  To: netdev; +Cc: jhs, xiyou.wangcong, jiri, davem, Vlad Buslov, Jiri Pirko

It is enough for caller of devlink_compat_switch_id_get() to hold the net
device to guarantee that devlink port is not destroyed concurrently. Remove
rtnl lock assertion and modify comment to warn user that they must hold
either rtnl lock or reference to net device. This is necessary to
accommodate future implementation of rtnl-unlocked TC offloads driver
callbacks.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 net/core/devlink.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/core/devlink.c b/net/core/devlink.c
index e8f0b891f000..e98e7bd9740b 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -6938,11 +6938,10 @@ int devlink_compat_switch_id_get(struct net_device *dev,
 {
 	struct devlink_port *devlink_port;
 
-	/* RTNL mutex is held here which ensures that devlink_port
-	 * instance cannot disappear in the middle. No need to take
+	/* Caller must hold RTNL mutex or reference to dev, which ensures that
+	 * devlink_port instance cannot disappear in the middle. No need to take
 	 * any devlink lock as only permanent values are accessed.
 	 */
-	ASSERT_RTNL();
 	devlink_port = netdev_to_devlink_port(dev);
 	if (!devlink_port || !devlink_port->attrs.switch_port)
 		return -EOPNOTSUPP;
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH v5 19/29] compat_ioctl: move hci_sock handlers into driver
From: Marcel Holtmann @ 2019-08-12 16:29 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Alexander Viro, linux-fsdevel, linux-kernel, Johan Hedberg,
	David S. Miller, Deepa Dinamani, linux-bluetooth, netdev
In-Reply-To: <20190730195819.901457-7-arnd@arndb.de>

Hi Arnd,

> All these ioctl commands are compatible, so we can handle
> them with a trivial wrapper in hci_sock.c and remove
> the listing in fs/compat_ioctl.c.
> 
> A few of the commands pass integer arguments instead of
> pointers, so for correctness skip the compat_ptr() conversion
> here.
> 
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
> fs/compat_ioctl.c        | 24 ------------------------
> net/bluetooth/hci_sock.c | 21 ++++++++++++++++++++-
> 2 files changed, 20 insertions(+), 25 deletions(-)

I think it is best if this series is applied as a whole. So whoever takes it

Acked-by: Marcel Holtmann <marcel@holtmann.org>

Regards

Marcel


^ permalink raw reply

* Re: [PATCH v5 18/29] compat_ioctl: move rfcomm handlers into driver
From: Marcel Holtmann @ 2019-08-12 16:29 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Alexander Viro, linux-fsdevel, linux-kernel, Johan Hedberg,
	David S. Miller, Mauro Carvalho Chehab, linux-bluetooth, netdev
In-Reply-To: <20190730195819.901457-6-arnd@arndb.de>

Hi Arnd,

> All these ioctl commands are compatible, so we can handle
> them with a trivial wrapper in rfcomm/sock.c and remove
> the listing in fs/compat_ioctl.c.
> 
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
> fs/compat_ioctl.c           |  6 ------
> net/bluetooth/rfcomm/sock.c | 14 ++++++++++++--
> 2 files changed, 12 insertions(+), 8 deletions(-)

I think it is best if this series is applied as a whole. So whoever takes it

Acked-by: Marcel Holtmann <marcel@holtmann.org>

Regards

Marcel


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox