Netdev List
 help / color / mirror / Atom feed
* Re: [patch net-next rfc 2/7] net: introduce name_node struct to be used in hashlist
From: kbuild test robot @ 2019-08-08  4:34 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: kbuild-all, netdev, davem, jakub.kicinski, sthemmin, dsahern,
	dcbw, mkubecek, andrew, parav, saeedm, mlxsw
In-Reply-To: <20190719110029.29466-3-jiri@resnulli.us>

[-- Attachment #1: Type: text/plain, Size: 1138 bytes --]

Hi Jiri,

I love your patch! Yet something to improve:

[auto build test ERROR on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Jiri-Pirko/net-introduce-alternative-names-for-network-interfaces/20190720-094639
config: arc-defconfig (attached as .config)
compiler: arc-elf-gcc (GCC) 7.4.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=7.4.0 make.cross ARCH=arc 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   {standard input}: Assembler messages:
   {standard input}:147: Error: inappropriate arguments for opcode 'st'
>> {standard input}:150: Error: Insn j_s has a jump/branch instruction j_s in its delay slot.
   {standard input}:175: Error: inappropriate arguments for opcode 'st'

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 8655 bytes --]

^ permalink raw reply

* Re: [PATCH v2 bpf-next] btf: expose BTF info through sysfs
From: Yonghong Song @ 2019-08-08  4:24 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf@vger.kernel.org, netdev@vger.kernel.org,
	Alexei Starovoitov, daniel@iogearbox.net
  Cc: andrii.nakryiko@gmail.com, Kernel Team, Masahiro Yamada,
	Arnaldo Carvalho de Melo, Jiri Olsa, Sam Ravnborg
In-Reply-To: <20190808003215.1462821-1-andriin@fb.com>



On 8/7/19 5:32 PM, Andrii Nakryiko wrote:
> Make .BTF section allocated and expose its contents through sysfs.
> 
> /sys/kernel/btf directory is created to contain all the BTFs present
> inside kernel. Currently there is only kernel's main BTF, represented as
> /sys/kernel/btf/kernel file. Once kernel modules' BTFs are supported,
> each module will expose its BTF as /sys/kernel/btf/<module-name> file.
> 
> Current approach relies on a few pieces coming together:
> 1. pahole is used to take almost final vmlinux image (modulo .BTF and
>     kallsyms) and generate .BTF section by converting DWARF info into
>     BTF. This section is not allocated and not mapped to any segment,
>     though, so is not yet accessible from inside kernel at runtime.
> 2. objcopy dumps .BTF contents into binary file and subsequently
>     convert binary file into linkable object file with automatically
>     generated symbols _binary__btf_kernel_bin_start and
>     _binary__btf_kernel_bin_end, pointing to start and end, respectively,
>     of BTF raw data.
> 3. final vmlinux image is generated by linking this object file (and
>     kallsyms, if necessary). sysfs_btf.c then creates
>     /sys/kernel/btf/kernel file and exposes embedded BTF contents through
>     it. This allows, e.g., libbpf and bpftool access BTF info at
>     well-known location, without resorting to searching for vmlinux image
>     on disk (location of which is not standardized and vmlinux image
>     might not be even available in some scenarios, e.g., inside qemu
>     during testing).
> 
> Alternative approach using .incbin assembler directive to embed BTF
> contents directly was attempted but didn't work, because sysfs_proc.o is
> not re-compiled during link-vmlinux.sh stage. This is required, though,
> to update embedded BTF data (initially empty data is embedded, then
> pahole generates BTF info and we need to regenerate sysfs_btf.o with
> updated contents, but it's too late at that point).
> 
> If BTF couldn't be generated due to missing or too old pahole,
> sysfs_btf.c handles that gracefully by detecting that
> _binary__btf_kernel_bin_start (weak symbol) is 0 and not creating
> /sys/kernel/btf at all.
> 
> v1->v2:
> - allow kallsyms stage to re-use vmlinux generated by gen_btf();
> 
> Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
> Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
> Cc: Jiri Olsa <jolsa@kernel.org>
> Cc: Sam Ravnborg <sam@ravnborg.org>
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> ---
>   kernel/bpf/Makefile     |  3 +++
>   kernel/bpf/sysfs_btf.c  | 52 ++++++++++++++++++++++++++++++++++++++
>   scripts/link-vmlinux.sh | 55 +++++++++++++++++++++++++++--------------
>   3 files changed, 91 insertions(+), 19 deletions(-)
>   create mode 100644 kernel/bpf/sysfs_btf.c
> 
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index 29d781061cd5..e1d9adb212f9 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -22,3 +22,6 @@ obj-$(CONFIG_CGROUP_BPF) += cgroup.o
>   ifeq ($(CONFIG_INET),y)
>   obj-$(CONFIG_BPF_SYSCALL) += reuseport_array.o
>   endif
> +ifeq ($(CONFIG_SYSFS),y)
> +obj-$(CONFIG_DEBUG_INFO_BTF) += sysfs_btf.o
> +endif
> diff --git a/kernel/bpf/sysfs_btf.c b/kernel/bpf/sysfs_btf.c
> new file mode 100644
> index 000000000000..ac06ce1d62e8
> --- /dev/null
> +++ b/kernel/bpf/sysfs_btf.c
> @@ -0,0 +1,52 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Provide kernel BTF information for introspection and use by eBPF tools.
> + */
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/kobject.h>
> +#include <linux/init.h>
> +
> +/* See scripts/link-vmlinux.sh, gen_btf() func for details */
> +extern char __weak _binary__btf_kernel_bin_start[];
> +extern char __weak _binary__btf_kernel_bin_end[];
> +
> +static ssize_t
> +btf_kernel_read(struct file *file, struct kobject *kobj,
> +		struct bin_attribute *bin_attr,
> +		char *buf, loff_t off, size_t len)
> +{
> +	memcpy(buf, _binary__btf_kernel_bin_start + off, len);
> +	return len;
> +}
> +
> +static struct bin_attribute btf_kernel_attr __ro_after_init = {
> +	.attr = {
> +		.name = "kernel",
> +		.mode = 0444,
> +	},
> +	.read = btf_kernel_read,
> +};
> +
> +static struct bin_attribute *btf_attrs[] __ro_after_init = {
> +	&btf_kernel_attr,
> +	NULL,
> +};
> +
> +static struct attribute_group btf_group_attr __ro_after_init = {
> +	.name = "btf",
> +	.bin_attrs = btf_attrs,
> +};
> +
> +static int __init btf_kernel_init(void)
> +{
> +	if (!_binary__btf_kernel_bin_start)
> +		return 0;
> +
> +	btf_kernel_attr.size = _binary__btf_kernel_bin_end -
> +			       _binary__btf_kernel_bin_start;
> +
> +	return sysfs_create_group(kernel_kobj, &btf_group_attr);
> +}
> +
> +subsys_initcall(btf_kernel_init);
> diff --git a/scripts/link-vmlinux.sh b/scripts/link-vmlinux.sh
> index a7124f895b24..e05abe19b11f 100755
> --- a/scripts/link-vmlinux.sh
> +++ b/scripts/link-vmlinux.sh
> @@ -56,8 +56,8 @@ modpost_link()
>   }
>   
>   # Link of vmlinux
> -# ${1} - optional extra .o files
> -# ${2} - output file
> +# ${1} - output file
> +# ${@:2} - optional extra .o files
>   vmlinux_link()
>   {
>   	local lds="${objtree}/${KBUILD_LDS}"
> @@ -70,9 +70,9 @@ vmlinux_link()
>   			--start-group				\
>   			${KBUILD_VMLINUX_LIBS}			\
>   			--end-group				\
> -			${1}"
> +			${@:2}"
>   
> -		${LD} ${KBUILD_LDFLAGS} ${LDFLAGS_vmlinux} -o ${2}	\
> +		${LD} ${KBUILD_LDFLAGS} ${LDFLAGS_vmlinux} -o ${1}	\
>   			-T ${lds} ${objects}
>   	else
>   		objects="-Wl,--whole-archive			\
> @@ -81,9 +81,9 @@ vmlinux_link()
>   			-Wl,--start-group			\
>   			${KBUILD_VMLINUX_LIBS}			\
>   			-Wl,--end-group				\
> -			${1}"
> +			${@:2}"
>   
> -		${CC} ${CFLAGS_vmlinux} -o ${2}			\
> +		${CC} ${CFLAGS_vmlinux} -o ${1}			\
>   			-Wl,-T,${lds}				\
>   			${objects}				\
>   			-lutil -lrt -lpthread
> @@ -92,23 +92,34 @@ vmlinux_link()
>   }
>   
>   # generate .BTF typeinfo from DWARF debuginfo
> +# ${1} - vmlinux image
> +# ${2} - file to dump raw BTF data into
>   gen_btf()
>   {
> -	local pahole_ver;
> +	local pahole_ver
> +	local bin_arch
>   
>   	if ! [ -x "$(command -v ${PAHOLE})" ]; then
>   		info "BTF" "${1}: pahole (${PAHOLE}) is not available"
> -		return 0
> +		return 1
>   	fi
>   
>   	pahole_ver=$(${PAHOLE} --version | sed -E 's/v([0-9]+)\.([0-9]+)/\1\2/')
>   	if [ "${pahole_ver}" -lt "113" ]; then
>   		info "BTF" "${1}: pahole version $(${PAHOLE} --version) is too old, need at least v1.13"
> -		return 0
> +		return 1
>   	fi
>   
> -	info "BTF" ${1}
> +	info "BTF" ${2}
> +	vmlinux_link ${1}
>   	LLVM_OBJCOPY=${OBJCOPY} ${PAHOLE} -J ${1}
> +
> +	# dump .BTF section into raw binary file to link with final vmlinux
> +	bin_arch=$(${OBJDUMP} -f ${1} | grep architecture | \
> +		cut -d, -f1 | cut -d' ' -f2)
> +	${OBJCOPY} --dump-section .BTF=.btf.kernel.bin ${1} 2>/dev/null
> +	${OBJCOPY} -I binary -O ${CONFIG_OUTPUT_FORMAT} -B ${bin_arch} \
> +		--rename-section .data=.BTF .btf.kernel.bin ${2}

Currently, the binary size on my config is about 2.6MB. Do you think
we could or need to compress it to make it smaller? I tried gzip
and the compressed size is 0.9MB.

>   }
>   
>   # Create ${2} .o file with all symbols from the ${1} object file
> @@ -153,6 +164,7 @@ sortextable()
>   # Delete output files in case of error
>   cleanup()
>   {
> +	rm -f .btf.*
>   	rm -f .tmp_System.map
>   	rm -f .tmp_kallsyms*
>   	rm -f .tmp_vmlinux*
> @@ -215,6 +227,13 @@ ${MAKE} -f "${srctree}/scripts/Makefile.modpost" vmlinux.o
>   info MODINFO modules.builtin.modinfo
>   ${OBJCOPY} -j .modinfo -O binary vmlinux.o modules.builtin.modinfo
>   
> +btf_kernel_bin_o=""
> +if [ -n "${CONFIG_DEBUG_INFO_BTF}" ]; then
> +	if gen_btf .tmp_vmlinux1 .btf.kernel.bin.o ; then
> +		btf_kernel_bin_o=.btf.kernel.bin.o
> +	fi
> +fi
> +
>   kallsymso=""
>   kallsyms_vmlinux=""
>   if [ -n "${CONFIG_KALLSYMS}" ]; then
> @@ -246,11 +265,14 @@ if [ -n "${CONFIG_KALLSYMS}" ]; then
>   	kallsyms_vmlinux=.tmp_vmlinux2
>   
>   	# step 1
> -	vmlinux_link "" .tmp_vmlinux1
> +	# skip building .tmp_vmlinux1 if gen_btf() already did that
> +	if [ -z "${btf_kernel_bin_o}" ]; then
> +		vmlinux_link .tmp_vmlinux1
> +	fi
>   	kallsyms .tmp_vmlinux1 .tmp_kallsyms1.o
>   
>   	# step 2
> -	vmlinux_link .tmp_kallsyms1.o .tmp_vmlinux2
> +	vmlinux_link .tmp_vmlinux2 .tmp_kallsyms1.o ${btf_kernel_bin_o}
>   	kallsyms .tmp_vmlinux2 .tmp_kallsyms2.o
>   
>   	# step 3
> @@ -261,18 +283,13 @@ if [ -n "${CONFIG_KALLSYMS}" ]; then
>   		kallsymso=.tmp_kallsyms3.o
>   		kallsyms_vmlinux=.tmp_vmlinux3
>   
> -		vmlinux_link .tmp_kallsyms2.o .tmp_vmlinux3
> -
> +		vmlinux_link .tmp_vmlinux3 .tmp_kallsyms2.o ${btf_kernel_bin_o}
>   		kallsyms .tmp_vmlinux3 .tmp_kallsyms3.o
>   	fi
>   fi
>   
>   info LD vmlinux
> -vmlinux_link "${kallsymso}" vmlinux
> -
> -if [ -n "${CONFIG_DEBUG_INFO_BTF}" ]; then
> -	gen_btf vmlinux
> -fi
> +vmlinux_link vmlinux "${kallsymso}" "${btf_kernel_bin_o}"
>   
>   if [ -n "${CONFIG_BUILDTIME_EXTABLE_SORT}" ]; then
>   	info SORTEX vmlinux
> 

^ permalink raw reply

* Re: [PATCH net-next v4 2/2] net: phy: broadcom: add 1000Base-X support for BCM54616S
From: Tao Ren @ 2019-08-08  4:24 UTC (permalink / raw)
  To: Heiner Kallweit, Andrew Lunn, Florian Fainelli, David S . Miller,
	Arun Parameswaran, Justin Chen, Vladimir Oltean,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	openbmc@lists.ozlabs.org
In-Reply-To: <cfd6e14e-a447-aedb-5bd6-bf65b4b6f98a@gmail.com>

Hi Heiner,

On 8/7/19 12:18 PM, Heiner Kallweit wrote:
> On 06.08.2019 23:42, Tao Ren wrote:
>> Hi Andrew / Heiner / Vladimir,
>>
>> On 8/6/19 2:09 PM, Tao Ren wrote:
>>> The BCM54616S PHY cannot work properly in RGMII->1000Base-KX mode (for
>>> example, on Facebook CMM BMC platform), mainly because genphy functions
>>> are designed for copper links, and 1000Base-X (clause 37) auto negotiation
>>> needs to be handled differently.
>>>
>>> This patch enables 1000Base-X support for BCM54616S by customizing 3
>>> driver callbacks:
>>>
>>>   - probe: probe callback detects PHY's operation mode based on
>>>     INTERF_SEL[1:0] pins and 1000X/100FX selection bit in SerDES 100-FX
>>>     Control register.
>>>
>>>   - config_aneg: bcm54616s_config_aneg_1000bx function is added for auto
>>>     negotiation in 1000Base-X mode.
>>>
>>>   - read_status: BCM54616S and BCM5482 PHY share the same read_status
>>>     callback which manually set link speed and duplex mode in 1000Base-X
>>>     mode.
>>>
>>> Signed-off-by: Tao Ren <taoren@fb.com>
>>
>> I customized config_aneg function for BCM54616S 1000Base-X mode and link-down issue is also fixed: the patch is tested on Facebook CMM and Minipack BMC and everything looks normal. Please kindly review when you have bandwidth and let me know if you have further suggestions.
>>
>> BTW, I would be happy to help if we decide to add a set of genphy functions for clause 37, although that may mean I need more help/guidance from you :-)
> 
> You want to have standard clause 37 aneg and this should be generic in phylib.
> I hacked together a first version that is compile-tested only:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.ozlabs.org_patch_1143631_&d=DwICaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=iYElT7HC77pRZ3byVvW8ng&m=ZJArOJvHqNkqvs1x8l9HjfxjCN8e5xJpPz2YViBuKRA&s=EskpfBQtu9IBVeb96dv-sz76xIz4tJK5-lD4-qdIyWI&e= 
> It supports fixed mode too.
> 
> It doesn't support half duplex mode because phylib doesn't know 1000BaseX HD yet.
> Not sure whether half duplex mode is used at all in reality.
> 
> You could test the new core functions in your own config_aneg and read_status
> callback implementations.

Thank you very much for the help! I'm planning to add these functions but I haven't started yet because I'm still going through clause 37 :-)

Let me apply your patch and run some test on my platform. Will share you results tomorrow.


Cheers,

Tao

^ permalink raw reply

* Slow internet? Don't miss out on this WiFi booster
From:  Lester Hardy @ 2019-08-08  4:08 UTC (permalink / raw)
  To: netdev

Internet providers make big money by overcharging you for faster internet lines. Could this little device really be the one solution that we've all been waiting for?

Frequency: 2.4Ghz
Wireless Rate: 300Mbps
Interface: 1 10/100Mbps WAN/LAN RJ45 Ports

Amazing new technology find out here! http://bit.ly/asdf411gt

------------------------
If you'd prefer not to receive future emails, Unsubscribe Here http://bit.ly/itr4fgy
11041 Santa Monica Blvd. #301 Los Angeles, CA 90025


^ permalink raw reply

* [PATCH v3 1/1] ixgbe: sync the first fragment unconditionally
From: Firo Yang @ 2019-08-08  4:03 UTC (permalink / raw)
  To: netdev@vger.kernel.org
  Cc: davem@davemloft.net, alexander.duyck@gmail.com,
	alexander.h.duyck@linux.intel.com, jian.w.wen@oracle.com,
	jeffrey.t.kirsher@intel.com, linux-kernel@vger.kernel.org,
	maciejromanfijalkowski@gmail.com,
	intel-wired-lan@lists.osuosl.org, Firo Yang

In Xen environment, if Xen-swiotlb is enabled, ixgbe driver
could possibly allocate a page, DMA memory buffer, for the first
fragment which is not suitable for Xen-swiotlb to do DMA operations.
Xen-swiotlb have to internally allocate another page for doing DMA
operations. This mechanism requires syncing the data from the internal
page to the page which ixgbe sends to upper network stack. However,
since commit f3213d932173 ("ixgbe: Update driver to make use of DMA
attributes in Rx path"), the unmap operation is performed with
DMA_ATTR_SKIP_CPU_SYNC. As a result, the sync is not performed.
Since the sync isn't performed, the upper network stack could receive
a incomplete network packet. By incomplete, it means the linear data
on the first fragment(between skb->head and skb->end) is invalid. So 
we have to copy the data from the internal xen-swiotlb page to the page 
which ixgbe sends to upper network stack through the sync operation.

More details from Alexander Duyck:
Specifically since we are mapping the frame with
DMA_ATTR_SKIP_CPU_SYNC we have to unmap with that as well. As a result
a sync is not performed on an unmap and must be done manually as we
skipped it for the first frag. As such we need to always sync before
possibly performing a page unmap operation.

Fixes: f3213d932173 ("ixgbe: Update driver to make use of DMA
attributes in Rx path")
Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Firo Yang <firo.yang@suse.com>
---
Changes from v2:
 * Added details on the problem caused by skipping the sync.
 * Added more explanation from Alexander Duyck.

Changes from v1:
 * Imporved the patch description.
 * Added Reviewed-by: and Fixes: as suggested by Alexander Duyck.

 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index cbaf712d6529..200de9838096 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1825,13 +1825,7 @@ static void ixgbe_pull_tail(struct ixgbe_ring *rx_ring,
 static void ixgbe_dma_sync_frag(struct ixgbe_ring *rx_ring,
 				struct sk_buff *skb)
 {
-	/* if the page was released unmap it, else just sync our portion */
-	if (unlikely(IXGBE_CB(skb)->page_released)) {
-		dma_unmap_page_attrs(rx_ring->dev, IXGBE_CB(skb)->dma,
-				     ixgbe_rx_pg_size(rx_ring),
-				     DMA_FROM_DEVICE,
-				     IXGBE_RX_DMA_ATTR);
-	} else if (ring_uses_build_skb(rx_ring)) {
+	if (ring_uses_build_skb(rx_ring)) {
 		unsigned long offset = (unsigned long)(skb->data) & ~PAGE_MASK;
 
 		dma_sync_single_range_for_cpu(rx_ring->dev,
@@ -1848,6 +1842,14 @@ static void ixgbe_dma_sync_frag(struct ixgbe_ring *rx_ring,
 					      skb_frag_size(frag),
 					      DMA_FROM_DEVICE);
 	}
+
+	/* If the page was released, just unmap it. */
+	if (unlikely(IXGBE_CB(skb)->page_released)) {
+		dma_unmap_page_attrs(rx_ring->dev, IXGBE_CB(skb)->dma,
+				     ixgbe_rx_pg_size(rx_ring),
+				     DMA_FROM_DEVICE,
+				     IXGBE_RX_DMA_ATTR);
+	}
 }
 
 /**
-- 
2.16.4


^ permalink raw reply related

* Re: [PATCH v2 1/1] ixgbe: sync the first fragment unconditionally
From: Alexander Duyck @ 2019-08-08  3:48 UTC (permalink / raw)
  To: Jacob Wen
  Cc: Firo Yang, davem@davemloft.net, jeffrey.t.kirsher@intel.com,
	alexander.h.duyck@linux.intel.com,
	intel-wired-lan@lists.osuosl.org, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org
In-Reply-To: <901704f1-163d-9dd8-4d20-93fa19f4435d@oracle.com>

On Wed, Aug 7, 2019 at 6:58 PM Jacob Wen <jian.w.wen@oracle.com> wrote:
>
>
> On 8/7/19 4:38 PM, Firo Yang wrote:
> > The 08/07/2019 15:56, Jacob Wen wrote:
> >> I think the description is not correct. Consider using something like below.
> > Thank you for comments.
> >
> >> In Xen environment, due to memory fragmentation ixgbe may allocate a 'DMA'
> >> buffer with pages that are not physically contiguous.
> > Actually, I didn't look into the reason why ixgbe got a DMA buffer which
> > was mapped to Xen-swiotlb area.
> Yes. I was wrong. You don't need to tell the exact reason.
> >
> > But I don't think this issue relates to phsical memory contiguity because, in
> > our case, one ixgbe_rx_buffer only associates at most one page.
>
> This is interesting.
>
> I guess the performance of the NIC in your environment is not good due
> to the extra cost of bounce buffer.

If I recall correctly the Rx performance for ixgbe shouldn't be too
bad even with a bounce buffer. The cost for map/unmap are expensive
for a bounce buffer setup but the syncs are just copies so they are
pretty cheap in comparison. The driver can take advantage of that on
Rx since it leaves the pages mapped and just syncs the portion of the
pages that are used.

Now if you are hitting the bounce buffer on the Tx side that is
another matter as you have to allocate the buffer on demand and that
is quite expensive.

^ permalink raw reply

* Re: [PATCH 00/34] put_user_pages(): miscellaneous call sites
From: John Hubbard @ 2019-08-08  3:46 UTC (permalink / raw)
  To: Ira Weiny, Michal Hocko
  Cc: Jan Kara, Matthew Wilcox, Andrew Morton, Christoph Hellwig,
	Dan Williams, Dave Chinner, Dave Hansen, Jason Gunthorpe,
	Jérôme Glisse, LKML, amd-gfx, ceph-devel, devel, devel,
	dri-devel, intel-gfx, kvm, linux-arm-kernel, linux-block,
	linux-crypto, linux-fbdev, linux-fsdevel, linux-media, linux-mm,
	linux-nfs, linux-rdma, linux-rpi-kernel, linux-xfs, netdev,
	rds-devel, sparclinux, x86, xen-devel
In-Reply-To: <20190808023637.GA1508@iweiny-DESK2.sc.intel.com>

On 8/7/19 7:36 PM, Ira Weiny wrote:
> On Wed, Aug 07, 2019 at 10:46:49AM +0200, Michal Hocko wrote:
>> On Wed 07-08-19 10:37:26, Jan Kara wrote:
>>> On Fri 02-08-19 12:14:09, John Hubbard wrote:
>>>> On 8/2/19 7:52 AM, Jan Kara wrote:
>>>>> On Fri 02-08-19 07:24:43, Matthew Wilcox wrote:
>>>>>> On Fri, Aug 02, 2019 at 02:41:46PM +0200, Jan Kara wrote:
>>>>>>> On Fri 02-08-19 11:12:44, Michal Hocko wrote:
>>>>>>>> On Thu 01-08-19 19:19:31, john.hubbard@gmail.com wrote:
  [...]
> Before I go on, I would like to say that the "imbalance" of get_user_pages()
> and put_page() bothers me from a purist standpoint...  However, since this
> discussion cropped up I went ahead and ported my work to Linus' current master
> (5.3-rc3+) and in doing so I only had to steal a bit of Johns code...  Sorry
> John...  :-(
> 
> I don't have the commit messages all cleaned up and I know there may be some
> discussion on these new interfaces but I wanted to throw this series out there
> because I think it may be what Jan and Michal are driving at (or at least in
> that direction.
> 
> Right now only RDMA and DAX FS's are supported.  Other users of GUP will still
> fail on a DAX file and regular files will still be at risk.[2]
> 
> I've pushed this work (based 5.3-rc3+ (33920f1ec5bf)) here[3]:
> 
> https://github.com/weiny2/linux-kernel/tree/linus-rdmafsdax-b0-v3
> 
> I think the most relevant patch to this conversation is:
> 
> https://github.com/weiny2/linux-kernel/commit/5d377653ba5cf11c3b716f904b057bee6641aaf6
> 

ohhh...can you please avoid using the old __put_user_pages_dirty()
function? I thought I'd caught things early enough to get away with
the rename and deletion of that. You could either:

a) open code an implementation of vaddr_put_pages_dirty_lock() that
doesn't call any of the *put_user_pages_dirty*() variants, or

b) include my first patch ("") are part of your series, or

c) base this on Andrews's tree, which already has merged in my first patch.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply

* linux-next: manual merge of the bpf-next tree with Linus' tree
From: Stephen Rothwell @ 2019-08-08  2:53 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov, Networking
  Cc: Linux Next Mailing List, Linux Kernel Mailing List,
	Andrii Nakryiko

[-- Attachment #1: Type: text/plain, Size: 2054 bytes --]

Hi all,

Today's linux-next merge of the bpf-next tree got a conflict in:

  tools/lib/bpf/libbpf.c

between commit:

  1d4126c4e119 ("libbpf: sanitize VAR to conservative 1-byte INT")

from Linus' tree and commit:

  b03bc6853c0e ("libbpf: convert libbpf code to use new btf helpers")

from the bpf-next tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc tools/lib/bpf/libbpf.c
index 2b57d7ea7836,3abf2dd1b3b5..000000000000
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@@ -1370,22 -1374,16 +1372,21 @@@ static void bpf_object__sanitize_btf(st
  
  	for (i = 1; i <= btf__get_nr_types(btf); i++) {
  		t = (struct btf_type *)btf__type_by_id(btf, i);
- 		kind = BTF_INFO_KIND(t->info);
  
- 		if (!has_datasec && kind == BTF_KIND_VAR) {
+ 		if (!has_datasec && btf_is_var(t)) {
  			/* replace VAR with INT */
  			t->info = BTF_INFO_ENC(BTF_KIND_INT, 0, 0);
 -			t->size = sizeof(int);
 -			*(int *)(t + 1) = BTF_INT_ENC(0, 0, 32);
 +			/*
 +			 * using size = 1 is the safest choice, 4 will be too
 +			 * big and cause kernel BTF validation failure if
 +			 * original variable took less than 4 bytes
 +			 */
 +			t->size = 1;
- 			*(int *)(t+1) = BTF_INT_ENC(0, 0, 8);
- 		} else if (!has_datasec && kind == BTF_KIND_DATASEC) {
++			*(int *)(t + 1) = BTF_INT_ENC(0, 0, 8);
+ 		} else if (!has_datasec && btf_is_datasec(t)) {
  			/* replace DATASEC with STRUCT */
- 			struct btf_var_secinfo *v = (void *)(t + 1);
- 			struct btf_member *m = (void *)(t + 1);
+ 			const struct btf_var_secinfo *v = btf_var_secinfos(t);
+ 			struct btf_member *m = btf_members(t);
  			struct btf_type *vt;
  			char *name;
  

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH 00/34] put_user_pages(): miscellaneous call sites
From: Ira Weiny @ 2019-08-08  2:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jan Kara, John Hubbard, Matthew Wilcox, john.hubbard,
	Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Dave Hansen, Jason Gunthorpe, Jérôme Glisse, LKML,
	amd-gfx, ceph-devel, devel, devel, dri-devel, intel-gfx, kvm,
	linux-arm-kernel, linux-block, linux-crypto, linux-fbdev,
	linux-fsdevel, linux-media, linux-mm, linux-nfs, linux-rdma,
	linux-rpi-kernel, linux-xfs, netdev, rds-devel, sparclinux, x86,
	xen-devel
In-Reply-To: <20190807084649.GQ11812@dhcp22.suse.cz>

On Wed, Aug 07, 2019 at 10:46:49AM +0200, Michal Hocko wrote:
> On Wed 07-08-19 10:37:26, Jan Kara wrote:
> > On Fri 02-08-19 12:14:09, John Hubbard wrote:
> > > On 8/2/19 7:52 AM, Jan Kara wrote:
> > > > On Fri 02-08-19 07:24:43, Matthew Wilcox wrote:
> > > > > On Fri, Aug 02, 2019 at 02:41:46PM +0200, Jan Kara wrote:
> > > > > > On Fri 02-08-19 11:12:44, Michal Hocko wrote:
> > > > > > > On Thu 01-08-19 19:19:31, john.hubbard@gmail.com wrote:
> > > > > > > [...]
> > > > > > > > 2) Convert all of the call sites for get_user_pages*(), to
> > > > > > > > invoke put_user_page*(), instead of put_page(). This involves dozens of
> > > > > > > > call sites, and will take some time.
> > > > > > > 
> > > > > > > How do we make sure this is the case and it will remain the case in the
> > > > > > > future? There must be some automagic to enforce/check that. It is simply
> > > > > > > not manageable to do it every now and then because then 3) will simply
> > > > > > > be never safe.
> > > > > > > 
> > > > > > > Have you considered coccinele or some other scripted way to do the
> > > > > > > transition? I have no idea how to deal with future changes that would
> > > > > > > break the balance though.
> > > 
> > > Hi Michal,
> > > 
> > > Yes, I've thought about it, and coccinelle falls a bit short (it's not smart
> > > enough to know which put_page()'s to convert). However, there is a debug
> > > option planned: a yet-to-be-posted commit [1] uses struct page extensions
> > > (obviously protected by CONFIG_DEBUG_GET_USER_PAGES_REFERENCES) to add
> > > a redundant counter. That allows:
> > > 
> > > void __put_page(struct page *page)
> > > {
> > > 	...
> > > 	/* Someone called put_page() instead of put_user_page() */
> > > 	WARN_ON_ONCE(atomic_read(&page_ext->pin_count) > 0);
> > > 
> > > > > > 
> > > > > > Yeah, that's why I've been suggesting at LSF/MM that we may need to create
> > > > > > a gup wrapper - say vaddr_pin_pages() - and track which sites dropping
> > > > > > references got converted by using this wrapper instead of gup. The
> > > > > > counterpart would then be more logically named as unpin_page() or whatever
> > > > > > instead of put_user_page().  Sure this is not completely foolproof (you can
> > > > > > create new callsite using vaddr_pin_pages() and then just drop refs using
> > > > > > put_page()) but I suppose it would be a high enough barrier for missed
> > > > > > conversions... Thoughts?
> > > 
> > > The debug option above is still a bit simplistic in its implementation
> > > (and maybe not taking full advantage of the data it has), but I think
> > > it's preferable, because it monitors the "core" and WARNs.
> > > 
> > > Instead of the wrapper, I'm thinking: documentation and the passage of
> > > time, plus the debug option (perhaps enhanced--probably once I post it
> > > someone will notice opportunities), yes?
> > 
> > So I think your debug option and my suggested renaming serve a bit
> > different purposes (and thus both make sense). If you do the renaming, you
> > can just grep to see unconverted sites. Also when someone merges new GUP
> > user (unaware of the new rules) while you switch GUP to use pins instead of
> > ordinary references, you'll get compilation error in case of renaming
> > instead of hard to debug refcount leak without the renaming. And such
> > conflict is almost bound to happen given the size of GUP patch set... Also
> > the renaming serves against the "coding inertia" - i.e., GUP is around for
> > ages so people just use it without checking any documentation or comments.
> > After switching how GUP works, what used to be correct isn't anymore so
> > renaming the function serves as a warning that something has really
> > changed.
> 
> Fully agreed!

Ok Prior to this I've been basing all my work for the RDMA/FS DAX stuff in
Johns put_user_pages()...  (Including when I proposed failing truncate with a
lease in June [1])

However, based on the suggestions in that thread it became clear that a new
interface was going to need to be added to pass in the "RDMA file" information
to GUP to associate file pins with the correct processes...

I have many drawings on my white board with "a whole lot of lines" on them to
make sure that if a process opens a file, mmaps it, pins it with RDMA, _closes_
it, and ummaps it; that the resulting file pin can still be traced back to the
RDMA context and all the processes which may have access to it....  No matter
where the original context may have come from.  I believe I have accomplished
that.

Before I go on, I would like to say that the "imbalance" of get_user_pages()
and put_page() bothers me from a purist standpoint...  However, since this
discussion cropped up I went ahead and ported my work to Linus' current master
(5.3-rc3+) and in doing so I only had to steal a bit of Johns code...  Sorry
John...  :-(

I don't have the commit messages all cleaned up and I know there may be some
discussion on these new interfaces but I wanted to throw this series out there
because I think it may be what Jan and Michal are driving at (or at least in
that direction.

Right now only RDMA and DAX FS's are supported.  Other users of GUP will still
fail on a DAX file and regular files will still be at risk.[2]

I've pushed this work (based 5.3-rc3+ (33920f1ec5bf)) here[3]:

https://github.com/weiny2/linux-kernel/tree/linus-rdmafsdax-b0-v3

I think the most relevant patch to this conversation is:

https://github.com/weiny2/linux-kernel/commit/5d377653ba5cf11c3b716f904b057bee6641aaf6

I stole Jans suggestion for a name as the name I used while prototyping was
pretty bad...  So Thanks Jan...  ;-)

Also thanks to John for his contribution on some of this.  I'm still tweaking
put_user_pages under the hood on the DAX path.

Ira

[1] https://lwn.net/Articles/790544/

[2] I've been looking into how to support io_uring next but I've had some issue
getting a test program to actually call GUP in that code path...  :-(

[3] If it would be easier I can just throw an RFC on the list but right now the
cover letter and some of the commit messages are full of the old stuff and
various ideas I have had...

> 
> > Your refcount debug patches are good to catch bugs in the conversions done
> > but that requires you to be able to excercise the code path in the first
> > place which may require particular HW or so, and you also have to enable
> > the debug option which means you already aim at verifying the GUP
> > references are treated properly.
> > 
> > 								Honza
> > 
> > -- 
> > Jan Kara <jack@suse.com>
> > SUSE Labs, CR
> 
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply

* Re: [PATCH v2] Fix non-kerneldoc comment in realtek/rtlwifi/usb.c
From: Larry Finger @ 2019-08-08  2:07 UTC (permalink / raw)
  To: Valdis Klētnieks, Ping-Ke Shih, Kalle Valo, David S. Miller
  Cc: linux-wireless, netdev, linux-kernel
In-Reply-To: <34195.1565229118@turing-police>

On 8/7/19 8:51 PM, Valdis Klētnieks wrote:
> Fix spurious warning message when building with W=1:
> 
>    CC [M]  drivers/net/wireless/realtek/rtlwifi/usb.o
> drivers/net/wireless/realtek/rtlwifi/usb.c:243: warning: Cannot understand  * on line 243 - I thought it was a doc line
> drivers/net/wireless/realtek/rtlwifi/usb.c:760: warning: Cannot understand  * on line 760 - I thought it was a doc line
> drivers/net/wireless/realtek/rtlwifi/usb.c:790: warning: Cannot understand  * on line 790 - I thought it was a doc line
> 
> Clean up the comment format.
> 
> Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
> 
> ---
> Changes since v1:  Larry Finger pointed out the patch wasn't checkpatch-clean.
> 
> diff --git a/drivers/net/wireless/realtek/rtlwifi/usb.c b/drivers/net/wireless/realtek/rtlwifi/usb.c
> index 34d68dbf4b4c..4b59f3b46b28 100644
> --- a/drivers/net/wireless/realtek/rtlwifi/usb.c
> +++ b/drivers/net/wireless/realtek/rtlwifi/usb.c
> @@ -239,10 +239,7 @@ static void _rtl_usb_io_handler_release(struct ieee80211_hw *hw)
>   	mutex_destroy(&rtlpriv->io.bb_mutex);
>   }
>   
> -/**
> - *
> - *	Default aggregation handler. Do nothing and just return the oldest skb.
> - */
> +/*	Default aggregation handler. Do nothing and just return the oldest skb.  */
>   static struct sk_buff *_none_usb_tx_aggregate_hdl(struct ieee80211_hw *hw,
>   						  struct sk_buff_head *list)
>   {
> @@ -756,11 +753,6 @@ static int rtl_usb_start(struct ieee80211_hw *hw)
>   	return err;
>   }
>   
> -/**
> - *
> - *
> - */
> -
>   /*=======================  tx =========================================*/
>   static void rtl_usb_cleanup(struct ieee80211_hw *hw)
>   {
> @@ -786,11 +778,7 @@ static void rtl_usb_cleanup(struct ieee80211_hw *hw)
>   	usb_kill_anchored_urbs(&rtlusb->tx_submitted);
>   }
>   
> -/**
> - *
> - * We may add some struct into struct rtl_usb later. Do deinit here.
> - *
> - */
> +/* We may add some struct into struct rtl_usb later. Do deinit here.  */
>   static void rtl_usb_deinit(struct ieee80211_hw *hw)
>   {
>   	rtl_usb_cleanup(hw);

I missed that the subject line should be "rtwifi: Fix ....". Otherwise it is OK.

Larry



^ permalink raw reply

* RE: [PATCH net-next] r8169: allocate rx buffers using alloc_pages_node
From: Hayes Wang @ 2019-08-08  2:00 UTC (permalink / raw)
  To: Heiner Kallweit, nic_swsd, David Miller; +Cc: netdev@vger.kernel.org
In-Reply-To: <7d79d794-b41c-101f-0720-59eea88bf9ab@gmail.com>

> From: Heiner Kallweit [mailto:hkallweit1@gmail.com]
> Sent: Thursday, August 08, 2019 3:38 AM
> To: nic_swsd; David Miller
> Cc: netdev@vger.kernel.org
> Subject: [PATCH net-next] r8169: allocate rx buffers using alloc_pages_node
> 
> We allocate 16kb per rx buffer, so we can avoid some overhead by using
> alloc_pages_node directly instead of bothering kmalloc_node. Due to
> this change buffers are page-aligned now, therefore the alignment check
> can be removed.
> 
> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>

Acked-by: Hayes Wang <hayeswang@realtek.com>


^ permalink raw reply

* Re: [PATCH v2 1/1] ixgbe: sync the first fragment unconditionally
From: Jacob Wen @ 2019-08-08  1:56 UTC (permalink / raw)
  To: Firo Yang
  Cc: davem@davemloft.net, jeffrey.t.kirsher@intel.com,
	alexander.h.duyck@linux.intel.com,
	intel-wired-lan@lists.osuosl.org, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org
In-Reply-To: <20190807083831.GA6811@linux-6qg8>


On 8/7/19 4:38 PM, Firo Yang wrote:
> The 08/07/2019 15:56, Jacob Wen wrote:
>> I think the description is not correct. Consider using something like below.
> Thank you for comments.
>
>> In Xen environment, due to memory fragmentation ixgbe may allocate a 'DMA'
>> buffer with pages that are not physically contiguous.
> Actually, I didn't look into the reason why ixgbe got a DMA buffer which
> was mapped to Xen-swiotlb area.
Yes. I was wrong. You don't need to tell the exact reason.
>
> But I don't think this issue relates to phsical memory contiguity because, in
> our case, one ixgbe_rx_buffer only associates at most one page.

This is interesting.

I guess the performance of the NIC in your environment is not good due 
to the extra cost of bounce buffer.

>
> If you take a look at the related code, you will find there are several reasons
> for mapping a DMA buffer to Xen-swiotlb area:
> static dma_addr_t xen_swiotlb_map_page(struct device *dev, struct page *page,
>           */
>          if (dma_capable(dev, dev_addr, size) &&
>              !range_straddles_page_boundary(phys, size) &&
>                  !xen_arch_need_swiotlb(dev, phys, dev_addr) &&
>                  swiotlb_force != SWIOTLB_FORCE)
>                  goto done;
>
> // Firo
>> A NIC doesn't support directly write such buffer. So xen-swiotlb would use
>> the pages, which are physically contiguous, from the swiotlb buffer for the
>> NIC.
>>
>> The unmap operation is used to copy the swiotlb buffer to the pages that are
>> allocated by ixgbe.
>>
>> On 8/7/19 10:49 AM, Firo Yang wrote:
>>> In Xen environment, if Xen-swiotlb is enabled, ixgbe driver
>>> could possibly allocate a page, DMA memory buffer, for the first
>>> fragment which is not suitable for Xen-swiotlb to do DMA operations.
>>> Xen-swiotlb have to internally allocate another page for doing DMA
>>> operations. It requires syncing between those two pages. However,
>>> since commit f3213d932173 ("ixgbe: Update driver to make use of DMA
>>> attributes in Rx path"), the unmap operation is performed with
>>> DMA_ATTR_SKIP_CPU_SYNC. As a result, the sync is not performed.
>>>
>>> To fix this problem, always sync before possibly performing a page
>>> unmap operation.
>>>
>>> Fixes: f3213d932173 ("ixgbe: Update driver to make use of DMA
>>> attributes in Rx path")
>>> Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
>>> Signed-off-by: Firo Yang <firo.yang@suse.com>
>>> ---
>>>
>>> Changes from v1:
>>>    * Imporved the patch description.
>>>    * Added Reviewed-by: and Fixes: as suggested by Alexander Duyck
>>>
>>>    drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 16 +++++++++-------
>>>    1 file changed, 9 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>>> index cbaf712d6529..200de9838096 100644
>>> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>>> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>>> @@ -1825,13 +1825,7 @@ static void ixgbe_pull_tail(struct ixgbe_ring *rx_ring,
>>>    static void ixgbe_dma_sync_frag(struct ixgbe_ring *rx_ring,
>>>    				struct sk_buff *skb)
>>>    {
>>> -	/* if the page was released unmap it, else just sync our portion */
>>> -	if (unlikely(IXGBE_CB(skb)->page_released)) {
>>> -		dma_unmap_page_attrs(rx_ring->dev, IXGBE_CB(skb)->dma,
>>> -				     ixgbe_rx_pg_size(rx_ring),
>>> -				     DMA_FROM_DEVICE,
>>> -				     IXGBE_RX_DMA_ATTR);
>>> -	} else if (ring_uses_build_skb(rx_ring)) {
>>> +	if (ring_uses_build_skb(rx_ring)) {
>>>    		unsigned long offset = (unsigned long)(skb->data) & ~PAGE_MASK;
>>>    		dma_sync_single_range_for_cpu(rx_ring->dev,
>>> @@ -1848,6 +1842,14 @@ static void ixgbe_dma_sync_frag(struct ixgbe_ring *rx_ring,
>>>    					      skb_frag_size(frag),
>>>    					      DMA_FROM_DEVICE);
>>>    	}
>>> +
>>> +	/* If the page was released, just unmap it. */
>>> +	if (unlikely(IXGBE_CB(skb)->page_released)) {
>>> +		dma_unmap_page_attrs(rx_ring->dev, IXGBE_CB(skb)->dma,
>>> +				     ixgbe_rx_pg_size(rx_ring),
>>> +				     DMA_FROM_DEVICE,
>>> +				     IXGBE_RX_DMA_ATTR);
>>> +	}
>>>    }
>>>    /**

^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH v2 1/1] ixgbe: sync the first fragment unconditionally
From: Firo Yang @ 2019-08-08  1:55 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Alexander Duyck, davem@davemloft.net, Alexander Duyck,
	intel-wired-lan, Jacob Wen, LKML, Netdev
In-Reply-To: <CAKgT0UfEh8cvTht3yceyXqwReJOQkcpJV8j0vHSJwookTWhn_Q@mail.gmail.com>

The 08/07/2019 09:06, Alexander Duyck wrote:
> On Wed, Aug 7, 2019 at 7:09 AM Maciej Fijalkowski
> <maciejromanfijalkowski@gmail.com> wrote:
> >
> > On Wed, 7 Aug 2019 08:38:43 +0000
> > Firo Yang <firo.yang@suse.com> wrote:
> >
> > > The 08/07/2019 15:56, Jacob Wen wrote:
> > > > I think the description is not correct. Consider using something like below.
> > > Thank you for comments.
> > >
> > > >
> > > > In Xen environment, due to memory fragmentation ixgbe may allocate a 'DMA'
> > > > buffer with pages that are not physically contiguous.
> > > Actually, I didn't look into the reason why ixgbe got a DMA buffer which
> > > was mapped to Xen-swiotlb area.
> >
> > I think that neither of these descriptions are telling us what was truly
> > broken. They lack what Alexander wrote on v1 thread of this patch.
> >
> > ixgbe_dma_sync_frag is called only on case when the current descriptor has EOP
> > bit set, skb was already allocated and you'll be adding a current buffer as a
> > frag. DMA unmapping for the first frag was intentionally skipped and we will be
> > unmapping it here, in ixgbe_dma_sync_frag. As Alex said, we're using the
> > DMA_ATTR_SKIP_CPU_SYNC attribute which obliges us to perform a sync manually
> > and it was missing.
> >
> > So IMHO the commit description should include descriptions from both xen and
> > ixgbe worlds, the v2 lacks info about ixgbe case.
Thank you. Will submit v3 with what Alexander worte on v1.

Thanks,
Firo
> >
> > BTW Alex, what was the initial reason for holding off with unmapping the first
> > buffer? Asking because IIRC the i40e and ice behave a bit different here. We
> > don't look there for EOP at all when building/constructing skb and not delaying
> > the unmap of non-eop buffers.
> >
> > Thanks,
> > Maciej
> 
> The reason why we have to hold off on unmapping the first buffer is
> because in the case of Receive Side Coalescing (RSC), also known as Large
> Receive Offload (LRO), the header of the packet is updated for each
> additional frame that is added. As such you can end up with the device
> writing data, header, data, header, data, header where each data write
> would update a new descriptor, but the header will only ever update the
> first descriptor in the chain. As such if we unmapped it earlier it would
> result in an IOMMU fault because the device would be rewriting the header
> after it had been unmapped.
> 
> The devices supported by the ixgbe driver are the only ones that have
> RSC/LRO support. As such this behavior is present for ixgbe, but not for
> other Intel NIC drivers including igb, igbvf, ixgbevf, i40e, etc.
> 
> - Alex
> 

^ permalink raw reply

* Re: [PATCH] Fix non-kerneldoc comment in realtek/rtlwifi/usb.c
From: Randy Dunlap @ 2019-08-08  1:54 UTC (permalink / raw)
  To: Valdis Klētnieks, Ping-Ke Shih, Kalle Valo, David S. Miller
  Cc: linux-wireless, netdev, linux-kernel
In-Reply-To: <5924.1565217560@turing-police>

On 8/7/19 3:39 PM, Valdis Klētnieks wrote:
> Fix spurious warning message when building with W=1:
> 
>   CC [M]  drivers/net/wireless/realtek/rtlwifi/usb.o
> drivers/net/wireless/realtek/rtlwifi/usb.c:243: warning: Cannot understand  * on line 243 - I thought it was a doc line
> drivers/net/wireless/realtek/rtlwifi/usb.c:760: warning: Cannot understand  * on line 760 - I thought it was a doc line
> drivers/net/wireless/realtek/rtlwifi/usb.c:790: warning: Cannot understand  * on line 790 - I thought it was a doc line
> 
> Change the comment so gcc doesn't think it's a kerneldoc comment block

s:gcc:scripts/kernel-doc/

and as Larry pointed out, networking comment style is "different" from the rest
of the kernel.  :(

> 
> Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
> 
> diff --git a/drivers/net/wireless/realtek/rtlwifi/usb.c b/drivers/net/wireless/realtek/rtlwifi/usb.c
> index e24fda5e9087..9478cc0d4f8b 100644
> --- a/drivers/net/wireless/realtek/rtlwifi/usb.c
> +++ b/drivers/net/wireless/realtek/rtlwifi/usb.c
> @@ -239,7 +239,7 @@ static void _rtl_usb_io_handler_release(struct ieee80211_hw *hw)
>  	mutex_destroy(&rtlpriv->io.bb_mutex);
>  }
>  
> -/**
> +/*
>   *
>   *	Default aggregation handler. Do nothing and just return the oldest skb.
>   */
> @@ -756,7 +756,7 @@ static int rtl_usb_start(struct ieee80211_hw *hw)
>  	return err;
>  }
>  
> -/**
> +/*
>   *
>   *
>   */
> @@ -786,7 +786,7 @@ static void rtl_usb_cleanup(struct ieee80211_hw *hw)
>  	usb_kill_anchored_urbs(&rtlusb->tx_submitted);
>  }
>  
> -/**
> +/*
>   *
>   * We may add some struct into struct rtl_usb later. Do deinit here.
>   *
> 


-- 
~Randy

^ permalink raw reply

* [PATCH v2] Fix non-kerneldoc comment in realtek/rtlwifi/usb.c
From: Valdis Klētnieks @ 2019-08-08  1:51 UTC (permalink / raw)
  To: Ping-Ke Shih, Kalle Valo, David S. Miller
  Cc: linux-wireless, netdev, linux-kernel

Fix spurious warning message when building with W=1:

  CC [M]  drivers/net/wireless/realtek/rtlwifi/usb.o
drivers/net/wireless/realtek/rtlwifi/usb.c:243: warning: Cannot understand  * on line 243 - I thought it was a doc line
drivers/net/wireless/realtek/rtlwifi/usb.c:760: warning: Cannot understand  * on line 760 - I thought it was a doc line
drivers/net/wireless/realtek/rtlwifi/usb.c:790: warning: Cannot understand  * on line 790 - I thought it was a doc line

Clean up the comment format.

Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>

---
Changes since v1:  Larry Finger pointed out the patch wasn't checkpatch-clean.

diff --git a/drivers/net/wireless/realtek/rtlwifi/usb.c b/drivers/net/wireless/realtek/rtlwifi/usb.c
index 34d68dbf4b4c..4b59f3b46b28 100644
--- a/drivers/net/wireless/realtek/rtlwifi/usb.c
+++ b/drivers/net/wireless/realtek/rtlwifi/usb.c
@@ -239,10 +239,7 @@ static void _rtl_usb_io_handler_release(struct ieee80211_hw *hw)
 	mutex_destroy(&rtlpriv->io.bb_mutex);
 }
 
-/**
- *
- *	Default aggregation handler. Do nothing and just return the oldest skb.
- */
+/*	Default aggregation handler. Do nothing and just return the oldest skb.  */
 static struct sk_buff *_none_usb_tx_aggregate_hdl(struct ieee80211_hw *hw,
 						  struct sk_buff_head *list)
 {
@@ -756,11 +753,6 @@ static int rtl_usb_start(struct ieee80211_hw *hw)
 	return err;
 }
 
-/**
- *
- *
- */
-
 /*=======================  tx =========================================*/
 static void rtl_usb_cleanup(struct ieee80211_hw *hw)
 {
@@ -786,11 +778,7 @@ static void rtl_usb_cleanup(struct ieee80211_hw *hw)
 	usb_kill_anchored_urbs(&rtlusb->tx_submitted);
 }
 
-/**
- *
- * We may add some struct into struct rtl_usb later. Do deinit here.
- *
- */
+/* We may add some struct into struct rtl_usb later. Do deinit here.  */
 static void rtl_usb_deinit(struct ieee80211_hw *hw)
 {
 	rtl_usb_cleanup(hw);


^ permalink raw reply related

* RE: [PATCH net-next 5/5] r8152: change rx_frag_head_sz and rx_max_agg_num dynamically
From: Hayes Wang @ 2019-08-08  1:40 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Jakub Kicinski, netdev@vger.kernel.org, nic_swsd,
	linux-kernel@vger.kernel.org, linux-usb@vger.kernel.org
In-Reply-To: <20190807144346.00005d2b@gmail.com>

Maciej Fijalkowski [mailto:maciejromanfijalkowski@gmail.com]
> Sent: Wednesday, August 07, 2019 8:44 PM
[...]
> > Excuse me.
> > I find struct ethtool_tunable for ETHTOOL_RX_COPYBREAK.
> > How about the descriptor count?
> 
> Look how drivers implement ethtool's set_ringparam ops.

I would check it. Thanks.


Best Regards,
Hayes



^ permalink raw reply

* Re: [PATCH bpf-next] tools/bpf: fix core_reloc.c compilation error
From: Alexei Starovoitov @ 2019-08-08  1:25 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Network Development, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Andrii Nakryiko
In-Reply-To: <20190808003856.555097-1-yhs@fb.com>

On Wed, Aug 7, 2019 at 5:42 PM Yonghong Song <yhs@fb.com> wrote:
>
> On my local machine, I have the following compilation errors:
> =====
>   In file included from prog_tests/core_reloc.c:3:0:
>   ./progs/core_reloc_types.h:517:46: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘fancy_char_ptr_t’
>  typedef const char * const volatile restrict fancy_char_ptr_t;
>                                               ^
>   ./progs/core_reloc_types.h:527:2: error: unknown type name ‘fancy_char_ptr_t’
>     fancy_char_ptr_t d;
>     ^
> =====
>
> I am using gcc 4.8.5. Later compilers may change their behavior not emitting the
> error. Nevertheless, let us fix the issue. "restrict" can be tested
> without typedef.
>
> Fixes: 9654e2ae908e ("selftests/bpf: add CO-RE relocs modifiers/typedef tests")
> Cc: Andrii Nakryiko <andriin@fb.com>
> Signed-off-by: Yonghong Song <yhs@fb.com>

Applied. Thanks.

^ permalink raw reply

* [PATCH v5 bpf-next] BPF: helpers: New helper to obtain namespace data from current task
From: Carlos Antonio Neira Bustos @ 2019-08-08  1:22 UTC (permalink / raw)
  To: netdev; +Cc: yhs, ebiederm, brouer, quentin.monnet, cneirabustos

The code has been modified to avoid syscalls that could sleep.
Please let me know if any other modification is needed.

From be0384c0fa209a78c1567936e8db4e35b9a7c0f8 Mon Sep 17 00:00:00 2001
From: Carlos <cneirabustos@gmail.com>
Date: Wed, 7 Aug 2019 20:04:30 -0400
Subject: [PATCH] [PATCH v5 bpf-next] BPF: New helper to obtain namespace data 
 from current task

This helper obtains the active namespace from current and returns pid, tgid,
device and namespace id as seen from that namespace, allowing to instrument
a process inside a container.
Device is read from /proc/self/ns/pid, as in the future it's possible that
different pid_ns files may belong to different devices, according
to the discussion between Eric Biederman and Yonghong in 2017 linux plumbers
conference.
Currently bpf_get_current_pid_tgid(), is used to do pid filtering in bcc's
scripts but this helper returns the pid as seen by the root namespace which is
fine when a bcc script is not executed inside a container.
When the process of interest is inside a container, pid filtering will not work
if bpf_get_current_pid_tgid() is used. This helper addresses this limitation
returning the pid as it's seen by the current namespace where the script is
executing.

This helper has the same use cases as bpf_get_current_pid_tgid() as it can be
used to do pid filtering even inside a container.

For example a bcc script using bpf_get_current_pid_tgid() (tools/funccount.py):

        u32 pid = bpf_get_current_pid_tgid() >> 32;
        if (pid != <pid_arg_passed_in>)
                return 0;
Could be modified to use bpf_get_current_pidns_info() as follows:

        struct bpf_pidns pidns;
        bpf_get_current_pidns_info(&pidns, sizeof(struct bpf_pidns));
        u32 pid = pidns.tgid;
        u32 nsid = pidns.nsid;
        if ((pid != <pid_arg_passed_in>) && (nsid != <nsid_arg_passed_in>))
                return 0;

To find out the name PID namespace id of a process, you could use this command:

$ ps -h -o pidns -p <pid_of_interest>

Or this other command:

$ ls -Li /proc/<pid_of_interest>/ns/pid

Signed-off-by: Carlos Neira <cneirabustos@gmail.com>
---
 fs/namei.c                                         |   2 +-
 include/linux/bpf.h                                |   1 +
 include/linux/namei.h                              |   4 +
 include/uapi/linux/bpf.h                           |  29 ++++-
 kernel/bpf/core.c                                  |   1 +
 kernel/bpf/helpers.c                               |  78 ++++++++++++
 kernel/trace/bpf_trace.c                           |   2 +
 samples/bpf/Makefile                               |   3 +
 samples/bpf/trace_ns_info_user.c                   |  35 ++++++
 samples/bpf/trace_ns_info_user_kern.c              |  44 +++++++
 tools/include/uapi/linux/bpf.h                     |  29 ++++-
 tools/testing/selftests/bpf/Makefile               |   2 +-
 tools/testing/selftests/bpf/bpf_helpers.h          |   3 +
 .../testing/selftests/bpf/progs/test_pidns_kern.c  |  51 ++++++++
 tools/testing/selftests/bpf/test_pidns.c           | 138 +++++++++++++++++++++
 15 files changed, 418 insertions(+), 4 deletions(-)
 create mode 100644 samples/bpf/trace_ns_info_user.c
 create mode 100644 samples/bpf/trace_ns_info_user_kern.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_pidns_kern.c
 create mode 100644 tools/testing/selftests/bpf/test_pidns.c

diff --git a/fs/namei.c b/fs/namei.c
index 209c51a5226c..d1eca36972d2 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -19,7 +19,6 @@
 #include <linux/export.h>
 #include <linux/kernel.h>
 #include <linux/slab.h>
-#include <linux/fs.h>
 #include <linux/namei.h>
 #include <linux/pagemap.h>
 #include <linux/fsnotify.h>
@@ -2355,6 +2354,7 @@ int filename_lookup(int dfd, struct filename *name, unsigned flags,
 	putname(name);
 	return retval;
 }
+EXPORT_SYMBOL(filename_lookup);
 
 /* Returns 0 and nd will be valid on success; Retuns error, otherwise. */
 static int path_parentat(struct nameidata *nd, unsigned flags,
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index f9a506147c8a..e4adf5e05afd 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1050,6 +1050,7 @@ extern const struct bpf_func_proto bpf_get_local_storage_proto;
 extern const struct bpf_func_proto bpf_strtol_proto;
 extern const struct bpf_func_proto bpf_strtoul_proto;
 extern const struct bpf_func_proto bpf_tcp_sock_proto;
+extern const struct bpf_func_proto bpf_get_current_pidns_info_proto;
 
 /* Shared helpers among cBPF and eBPF. */
 void bpf_user_rnd_init_once(void);
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 9138b4471dbf..2c24e8c71d46 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -6,6 +6,7 @@
 #include <linux/path.h>
 #include <linux/fcntl.h>
 #include <linux/errno.h>
+#include <linux/fs.h>
 
 enum { MAX_NESTED_LINKS = 8 };
 
@@ -97,6 +98,9 @@ extern void unlock_rename(struct dentry *, struct dentry *);
 
 extern void nd_jump_link(struct path *path);
 
+extern int filename_lookup(int dfd, struct filename *name, unsigned int flags,
+		    struct path *path, struct path *root);
+
 static inline void nd_terminate_link(void *name, size_t len, size_t maxlen)
 {
 	((char *) name)[min(len, maxlen)] = '\0';
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 4393bd4b2419..6f601f7106e2 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2741,6 +2741,26 @@ union bpf_attr {
  *		**-EOPNOTSUPP** kernel configuration does not enable SYN cookies
  *
  *		**-EPROTONOSUPPORT** IP packet version is not 4 or 6
+ *
+ * int bpf_get_current_pidns_info(struct bpf_pidns_info *pidns, u32 size_of_pidns)
+ *	Description
+ *		Copies into *pidns* pid, namespace id and tgid as seen by the
+ *		current namespace and also device from /proc/self/ns/pid.
+ *		*size_of_pidns* must be the size of *pidns*
+ *
+ *		This helper is used when pid filtering is needed inside a
+ *		container as bpf_get_current_tgid() helper returns always the
+ *		pid id as seen by the root namespace.
+ *	Return
+ *		0 on success
+ *
+ *		**-EINVAL**  if unable to get ns, pid or tgid of current task.
+ *		Or if size_of_pidns is not valid.
+ *
+ *		**-ENOMEM**  if allocation fails.
+ *
+ *		If unable to get the inode from /proc/self/ns/pid an error code
+ *		will be returned.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2853,7 +2873,8 @@ union bpf_attr {
 	FN(sk_storage_get),		\
 	FN(sk_storage_delete),		\
 	FN(send_signal),		\
-	FN(tcp_gen_syncookie),
+	FN(tcp_gen_syncookie),		\
+	FN(get_current_pidns_info),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -3604,4 +3625,10 @@ struct bpf_sockopt {
 	__s32	retval;
 };
 
+struct bpf_pidns_info {
+	__u32 dev;
+	__u32 nsid;
+	__u32 tgid;
+	__u32 pid;
+};
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 8191a7db2777..3159f2a0188c 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2038,6 +2038,7 @@ const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak;
 const struct bpf_func_proto bpf_get_current_comm_proto __weak;
 const struct bpf_func_proto bpf_get_current_cgroup_id_proto __weak;
 const struct bpf_func_proto bpf_get_local_storage_proto __weak;
+const struct bpf_func_proto bpf_get_current_pidns_info __weak;
 
 const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
 {
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 5e28718928ca..571f24077db2 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -11,6 +11,12 @@
 #include <linux/uidgid.h>
 #include <linux/filter.h>
 #include <linux/ctype.h>
+#include <linux/pid_namespace.h>
+#include <linux/major.h>
+#include <linux/stat.h>
+#include <linux/namei.h>
+#include <linux/version.h>
+
 
 #include "../../lib/kstrtox.h"
 
@@ -312,6 +318,78 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
 	preempt_enable();
 }
 
+BPF_CALL_2(bpf_get_current_pidns_info, struct bpf_pidns_info *, pidns_info, u32,
+	 size)
+{
+	const char *name = "/proc/self/ns/pid";
+	struct pid_namespace *pidns = NULL;
+	struct filename *tmp = NULL;
+	int len = strlen(name) + 1;
+	struct inode *inode;
+	struct path kp;
+	pid_t tgid = 0;
+	pid_t pid = 0;
+	int ret;
+
+	if (unlikely(size != sizeof(struct bpf_pidns_info)))
+		return -EINVAL;
+
+	pidns = task_active_pid_ns(current);
+
+	if (unlikely(!pidns))
+		goto clear;
+
+	pidns_info->nsid =  pidns->ns.inum;
+	pid = task_pid_nr_ns(current, pidns);
+
+	if (unlikely(!pid))
+		goto clear;
+
+	tgid = task_tgid_nr_ns(current, pidns);
+
+	if (unlikely(!tgid))
+		goto clear;
+
+	pidns_info->tgid = (u32) tgid;
+	pidns_info->pid = (u32) pid;
+
+	tmp = kmem_cache_alloc(names_cachep, GFP_ATOMIC);
+	if (unlikely(!tmp)) {
+		memset((void *)pidns_info, 0, (size_t) size);
+		return -ENOMEM;
+	}
+
+	memcpy((char *)tmp->name, name, len);
+	tmp->uptr = NULL;
+	tmp->aname = NULL;
+	tmp->refcnt = 1;
+
+	ret = filename_lookup(AT_FDCWD, tmp, 0, &kp, NULL);
+
+	if (ret) {
+		memset((void *)pidns_info, 0, (size_t) size);
+		return ret;
+	}
+
+	inode = d_backing_inode(kp.dentry);
+	pidns_info->dev = inode->i_sb->s_dev;
+
+	return 0;
+
+clear:
+	memset((void *)pidns_info, 0, (size_t) size);
+
+	return -EINVAL;
+}
+
+const struct bpf_func_proto bpf_get_current_pidns_info_proto = {
+	.func	= bpf_get_current_pidns_info,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_UNINIT_MEM,
+	.arg2_type	= ARG_CONST_SIZE,
+};
+
 #ifdef CONFIG_CGROUPS
 BPF_CALL_0(bpf_get_current_cgroup_id)
 {
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index ca1255d14576..5e1dc22765a5 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -709,6 +709,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 #endif
 	case BPF_FUNC_send_signal:
 		return &bpf_send_signal_proto;
+	case BPF_FUNC_get_current_pidns_info:
+		return &bpf_get_current_pidns_info_proto;
 	default:
 		return NULL;
 	}
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 1d9be26b4edd..238453ff27d2 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -53,6 +53,7 @@ hostprogs-y += task_fd_query
 hostprogs-y += xdp_sample_pkts
 hostprogs-y += ibumad
 hostprogs-y += hbm
+hostprogs-y += trace_ns_info
 
 # Libbpf dependencies
 LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
@@ -109,6 +110,7 @@ task_fd_query-objs := bpf_load.o task_fd_query_user.o $(TRACE_HELPERS)
 xdp_sample_pkts-objs := xdp_sample_pkts_user.o $(TRACE_HELPERS)
 ibumad-objs := bpf_load.o ibumad_user.o $(TRACE_HELPERS)
 hbm-objs := bpf_load.o hbm.o $(CGROUP_HELPERS)
+trace_ns_info-objs := bpf_load.o trace_ns_info_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -170,6 +172,7 @@ always += xdp_sample_pkts_kern.o
 always += ibumad_kern.o
 always += hbm_out_kern.o
 always += hbm_edt_kern.o
+always += trace_ns_info_user_kern.o
 
 KBUILD_HOSTCFLAGS += -I$(objtree)/usr/include
 KBUILD_HOSTCFLAGS += -I$(srctree)/tools/lib/bpf/
diff --git a/samples/bpf/trace_ns_info_user.c b/samples/bpf/trace_ns_info_user.c
new file mode 100644
index 000000000000..e06d08db6f30
--- /dev/null
+++ b/samples/bpf/trace_ns_info_user.c
@@ -0,0 +1,35 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018 Carlos Neira cneirabustos@gmail.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+
+#include <stdio.h>
+#include <linux/bpf.h>
+#include <unistd.h>
+#include "bpf/libbpf.h"
+#include "bpf_load.h"
+
+/* This code was taken verbatim from tracex1_user.c, it's used
+ * to exercize bpf_get_current_pidns_info() helper call.
+ */
+int main(int ac, char **argv)
+{
+	FILE *f;
+	char filename[256];
+
+	snprintf(filename, sizeof(filename), "%s_user_kern.o", argv[0]);
+	printf("loading %s\n", filename);
+
+	if (load_bpf_file(filename)) {
+		printf("%s", bpf_log_buf);
+		return 1;
+	}
+
+	f = popen("taskset 1 ping  localhost", "r");
+	(void) f;
+	read_trace_pipe();
+	return 0;
+}
diff --git a/samples/bpf/trace_ns_info_user_kern.c b/samples/bpf/trace_ns_info_user_kern.c
new file mode 100644
index 000000000000..96675e02b707
--- /dev/null
+++ b/samples/bpf/trace_ns_info_user_kern.c
@@ -0,0 +1,44 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018 Carlos Neira cneirabustos@gmail.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <linux/version.h>
+#include <uapi/linux/bpf.h>
+#include "bpf_helpers.h"
+
+typedef __u64 u64;
+typedef __u32 u32;
+
+
+/* kprobe is NOT a stable ABI
+ * kernel functions can be removed, renamed or completely change semantics.
+ * Number of arguments and their positions can change, etc.
+ * In such case this bpf+kprobe example will no longer be meaningful
+ */
+
+/* This will call bpf_get_current_pidns_info() to display pid and ns values
+ * as seen by the current namespace, on the far left you will see the pid as
+ * seen as by the root namespace.
+ */
+
+SEC("kprobe/__netif_receive_skb_core")
+int bpf_prog1(struct pt_regs *ctx)
+{
+	char fmt[] = "nsid:%u, dev: %u,  pid:%u\n";
+	struct bpf_pidns_info nsinfo;
+	int ok = 0;
+
+	ok = bpf_get_current_pidns_info(&nsinfo, sizeof(nsinfo));
+	if (ok == 0)
+		bpf_trace_printk(fmt, sizeof(fmt), (u32)nsinfo.nsid,
+				 (u32) nsinfo.dev, (u32)nsinfo.pid);
+
+	return 0;
+}
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 4393bd4b2419..6f601f7106e2 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2741,6 +2741,26 @@ union bpf_attr {
  *		**-EOPNOTSUPP** kernel configuration does not enable SYN cookies
  *
  *		**-EPROTONOSUPPORT** IP packet version is not 4 or 6
+ *
+ * int bpf_get_current_pidns_info(struct bpf_pidns_info *pidns, u32 size_of_pidns)
+ *	Description
+ *		Copies into *pidns* pid, namespace id and tgid as seen by the
+ *		current namespace and also device from /proc/self/ns/pid.
+ *		*size_of_pidns* must be the size of *pidns*
+ *
+ *		This helper is used when pid filtering is needed inside a
+ *		container as bpf_get_current_tgid() helper returns always the
+ *		pid id as seen by the root namespace.
+ *	Return
+ *		0 on success
+ *
+ *		**-EINVAL**  if unable to get ns, pid or tgid of current task.
+ *		Or if size_of_pidns is not valid.
+ *
+ *		**-ENOMEM**  if allocation fails.
+ *
+ *		If unable to get the inode from /proc/self/ns/pid an error code
+ *		will be returned.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2853,7 +2873,8 @@ union bpf_attr {
 	FN(sk_storage_get),		\
 	FN(sk_storage_delete),		\
 	FN(send_signal),		\
-	FN(tcp_gen_syncookie),
+	FN(tcp_gen_syncookie),		\
+	FN(get_current_pidns_info),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -3604,4 +3625,10 @@ struct bpf_sockopt {
 	__s32	retval;
 };
 
+struct bpf_pidns_info {
+	__u32 dev;
+	__u32 nsid;
+	__u32 tgid;
+	__u32 pid;
+};
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 3bd0f4a0336a..1f97b571b581 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -29,7 +29,7 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test
 	test_cgroup_storage test_select_reuseport test_section_names \
 	test_netcnt test_tcpnotify_user test_sock_fields test_sysctl test_hashmap \
 	test_btf_dump test_cgroup_attach xdping test_sockopt test_sockopt_sk \
-	test_sockopt_multi test_tcp_rtt
+	test_sockopt_multi test_tcp_rtt test_pidns
 
 BPF_OBJ_FILES = $(patsubst %.c,%.o, $(notdir $(wildcard progs/*.c)))
 TEST_GEN_FILES = $(BPF_OBJ_FILES)
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index 120aa86c58d3..c96795a9d983 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -231,6 +231,9 @@ static int (*bpf_send_signal)(unsigned sig) = (void *)BPF_FUNC_send_signal;
 static long long (*bpf_tcp_gen_syncookie)(struct bpf_sock *sk, void *ip,
 					  int ip_len, void *tcp, int tcp_len) =
 	(void *) BPF_FUNC_tcp_gen_syncookie;
+static int (*bpf_get_current_pidns_info)(struct bpf_pidns_info *buf,
+					 unsigned int buf_size) =
+	(void *) BPF_FUNC_get_current_pidns_info;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/tools/testing/selftests/bpf/progs/test_pidns_kern.c b/tools/testing/selftests/bpf/progs/test_pidns_kern.c
new file mode 100644
index 000000000000..e1d2facfa762
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_pidns_kern.c
@@ -0,0 +1,51 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018 Carlos Neira cneirabustos@gmail.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+
+#include <linux/bpf.h>
+#include <errno.h>
+#include "bpf_helpers.h"
+
+struct bpf_map_def SEC("maps") nsidmap = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(__u32),
+	.max_entries = 1,
+};
+
+struct bpf_map_def SEC("maps") pidmap = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(__u32),
+	.max_entries = 1,
+};
+
+SEC("tracepoint/syscalls/sys_enter_nanosleep")
+int trace(void *ctx)
+{
+	struct bpf_pidns_info nsinfo;
+	__u32 key = 0, *expected_pid, *val;
+	char fmt[] = "ERROR nspid:%d\n";
+
+	if (bpf_get_current_pidns_info(&nsinfo, sizeof(nsinfo)))
+		return -EINVAL;
+
+	expected_pid = bpf_map_lookup_elem(&pidmap, &key);
+
+
+	if (!expected_pid || *expected_pid != nsinfo.pid)
+		return 0;
+
+	val = bpf_map_lookup_elem(&nsidmap, &key);
+	if (val)
+		*val = nsinfo.nsid;
+
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
+__u32 _version SEC("version") = 1;
diff --git a/tools/testing/selftests/bpf/test_pidns.c b/tools/testing/selftests/bpf/test_pidns.c
new file mode 100644
index 000000000000..a7254055f294
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_pidns.c
@@ -0,0 +1,138 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018 Carlos Neira cneirabustos@gmail.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <syscall.h>
+#include <unistd.h>
+#include <linux/perf_event.h>
+#include <sys/ioctl.h>
+#include <sys/time.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+
+#include <linux/bpf.h>
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#include "cgroup_helpers.h"
+#include "bpf_rlimit.h"
+
+#define CHECK(condition, tag, format...) ({		\
+	int __ret = !!(condition);			\
+	if (__ret) {					\
+		printf("%s:FAIL:%s ", __func__, tag);	\
+		printf(format);				\
+	} else {					\
+		printf("%s:PASS:%s\n", __func__, tag);	\
+	}						\
+	__ret;						\
+})
+
+static int bpf_find_map(const char *test, struct bpf_object *obj,
+			const char *name)
+{
+	struct bpf_map *map;
+
+	map = bpf_object__find_map_by_name(obj, name);
+	if (!map)
+		return -1;
+	return bpf_map__fd(map);
+}
+
+
+int main(int argc, char **argv)
+{
+	const char *probe_name = "syscalls/sys_enter_nanosleep";
+	const char *file = "test_pidns_kern.o";
+	int err, bytes, efd, prog_fd, pmu_fd;
+	int pidmap_fd, nsidmap_fd;
+	struct perf_event_attr attr = {};
+	struct bpf_object *obj;
+	__u32 knsid = 0;
+	__u32 key = 0, pid;
+	int exit_code = 1;
+	struct stat st;
+	char buf[256];
+
+	err = bpf_prog_load(file, BPF_PROG_TYPE_TRACEPOINT, &obj, &prog_fd);
+	if (CHECK(err, "bpf_prog_load", "err %d errno %d\n", err, errno))
+		goto cleanup_cgroup_env;
+
+	nsidmap_fd = bpf_find_map(__func__, obj, "nsidmap");
+	if (CHECK(nsidmap_fd < 0, "bpf_find_map", "err %d errno %d\n",
+		  nsidmap_fd, errno))
+		goto close_prog;
+
+	pidmap_fd = bpf_find_map(__func__, obj, "pidmap");
+	if (CHECK(pidmap_fd < 0, "bpf_find_map", "err %d errno %d\n",
+		  pidmap_fd, errno))
+		goto close_prog;
+
+	pid = getpid();
+	bpf_map_update_elem(pidmap_fd, &key, &pid, 0);
+
+	snprintf(buf, sizeof(buf),
+		 "/sys/kernel/debug/tracing/events/%s/id", probe_name);
+	efd = open(buf, O_RDONLY, 0);
+	if (CHECK(efd < 0, "open", "err %d errno %d\n", efd, errno))
+		goto close_prog;
+	bytes = read(efd, buf, sizeof(buf));
+	close(efd);
+	if (CHECK(bytes <= 0 || bytes >= sizeof(buf), "read",
+		  "bytes %d errno %d\n", bytes, errno))
+		goto close_prog;
+
+	attr.config = strtol(buf, NULL, 0);
+	attr.type = PERF_TYPE_TRACEPOINT;
+	attr.sample_type = PERF_SAMPLE_RAW;
+	attr.sample_period = 1;
+	attr.wakeup_events = 1;
+
+	pmu_fd = syscall(__NR_perf_event_open, &attr, getpid(), -1, -1, 0);
+	if (CHECK(pmu_fd < 0, "perf_event_open", "err %d errno %d\n", pmu_fd,
+		  errno))
+		goto close_prog;
+
+	err = ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0);
+	if (CHECK(err, "perf_event_ioc_enable", "err %d errno %d\n", err,
+		  errno))
+		goto close_pmu;
+
+	err = ioctl(pmu_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
+	if (CHECK(err, "perf_event_ioc_set_bpf", "err %d errno %d\n", err,
+		  errno))
+		goto close_pmu;
+
+	/* trigger some syscalls */
+	sleep(1);
+
+	err = bpf_map_lookup_elem(nsidmap_fd, &key, &knsid);
+	if (CHECK(err, "bpf_map_lookup_elem", "err %d errno %d\n", err, errno))
+		goto close_pmu;
+
+	if (stat("/proc/self/ns/pid", &st))
+		goto close_pmu;
+
+	if (CHECK(knsid != (__u32) st.st_ino, "compare_namespace_id",
+		  "kern knsid %u user unsid %u\n", knsid, (__u32) st.st_ino))
+		goto close_pmu;
+
+	exit_code = 0;
+	printf("%s:PASS\n", argv[0]);
+
+close_pmu:
+	close(pmu_fd);
+close_prog:
+	bpf_object__close(obj);
+cleanup_cgroup_env:
+	return exit_code;
+}
-- 
2.11.0


^ permalink raw reply related

* Re: [PATCH net] net: phy: rtl8211f: do a double read to get real time link status
From: Yonglong Liu @ 2019-08-08  1:15 UTC (permalink / raw)
  To: Heiner Kallweit, davem, andrew
  Cc: netdev, linux-kernel, linuxarm, salil.mehta, yisen.zhuang,
	shiju.jose
In-Reply-To: <d67831ab-8902-a653-3db9-b2f55adacabd@gmail.com>



On 2019/8/8 0:47, Heiner Kallweit wrote:
> On 07.08.2019 15:16, Yonglong Liu wrote:
>> [   27.232781] hns3 0000:bd:00.3 eth7: net open
>> [   27.237303] 8021q: adding VLAN 0 to HW filter on device eth7
>> [   27.242972] IPv6: ADDRCONF(NETDEV_CHANGE): eth7: link becomes ready
>> [   27.244449] hns3 0000:bd:00.3: invalid speed (-1)
>> [   27.253904] hns3 0000:bd:00.3 eth7: failed to adjust link.
>> [   27.259379] RTL8211F Gigabit Ethernet mii-0000:bd:00.3:07: PHY state change UP -> RUNNING
>> [   27.924903] hns3 0000:bd:00.3 eth7: link up
>> [   28.280479] RTL8211F Gigabit Ethernet mii-0000:bd:00.3:07: PHY state change RUNNING -> NOLINK
>> [   29.208452] hns3 0000:bd:00.3 eth7: link down
>> [   32.376745] RTL8211F Gigabit Ethernet mii-0000:bd:00.3:07: PHY state change NOLINK -> RUNNING
>> [   33.208448] hns3 0000:bd:00.3 eth7: link up
>> [   35.253821] hns3 0000:bd:00.3 eth7: net stop
>> [   35.258270] hns3 0000:bd:00.3 eth7: link down
>>
>> When using rtl8211f in polling mode, may get a invalid speed,
>> because of reading a fake link up and autoneg complete status
>> immediately after starting autoneg:
>>
>>         ifconfig-1176  [007] ....    27.232763: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x00 val:0x1040
>>   kworker/u257:1-670   [015] ....    27.232805: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x04 val:0x01e1
>>   kworker/u257:1-670   [015] ....    27.232815: mdio_access: mii-0000:bd:00.3 write phy:0x07 reg:0x04 val:0x05e1
>>   kworker/u257:1-670   [015] ....    27.232869: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x01 val:0x79ad
>>   kworker/u257:1-670   [015] ....    27.232904: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x09 val:0x0200
>>   kworker/u257:1-670   [015] ....    27.232940: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x00 val:0x1040
>>   kworker/u257:1-670   [015] ....    27.232949: mdio_access: mii-0000:bd:00.3 write phy:0x07 reg:0x00 val:0x1240
>>   kworker/u257:1-670   [015] ....    27.233003: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x01 val:0x79ad
>>   kworker/u257:1-670   [015] ....    27.233039: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x0a val:0x3002
>>   kworker/u257:1-670   [015] ....    27.233074: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x09 val:0x0200
>>   kworker/u257:1-670   [015] ....    27.233110: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x05 val:0x0000
>>   kworker/u257:1-670   [000] ....    28.280475: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x01 val:0x7989
>>   kworker/u257:1-670   [000] ....    29.304471: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x01 val:0x7989
>>
>> According to the datasheet of rtl8211f, to get the real time
>> link status, need to read MII_BMSR twice.
>>
>> This patch add a read_status hook for rtl8211f, and do a fake
>> phy_read before genphy_read_status(), so that can get real link
>> status in genphy_read_status().
>>
>> Signed-off-by: Yonglong Liu <liuyonglong@huawei.com>
>> ---
>>  drivers/net/phy/realtek.c | 13 +++++++++++++
>>  1 file changed, 13 insertions(+)
>>
> Is this an accidental resubmit? Because we discussed this in
> https://marc.info/?t=156413509900003&r=1&w=2 and a fix has
> been applied already.
> 
> Heiner
> 
> .
> 

In https://marc.info/?t=156413509900003&r=1&w=2 , the invalid speed
recurrence rate is almost 100%, and I had test the solution about
5 times and it works. But yesterday it happen again suddenly, and than
I fount that the recurrence rate reduce to 10%. This time we get 0x79ad
after autoneg started which is not 0x798d from last discussion.



^ permalink raw reply

* Re: [PATCH] Fix non-kerneldoc comment in realtek/rtlwifi/usb.c
From: Larry Finger @ 2019-08-08  0:57 UTC (permalink / raw)
  To: Valdis Klētnieks, Ping-Ke Shih, Kalle Valo, David S. Miller
  Cc: linux-wireless, netdev, linux-kernel
In-Reply-To: <5924.1565217560@turing-police>

On 8/7/19 5:39 PM, Valdis Klētnieks wrote:

When this driver was originally entered, a line with "/*" was flagged by 
checkpatch.pl. In fact, when I make your change, I get

WARNING: networking block comments don't use an empty /* line, use /* Comment...
#243: FILE: drivers/net/wireless/realtek/rtlwifi/usb.c:243:
+/*
+ *

To avoid a loop of "fixing" compiler/checkpatch warnings, you need to put the 
first real line of the comment on the line of the "/*". For the first of your 
patches, that results in

diff --git a/drivers/net/wireless/realtek/rtlwifi/usb.c 
b/drivers/net/wireless/realtek/rtlwifi/usb.c
index 34d68dbf4b4c..f89ceac25eff 100644
--- a/drivers/net/wireless/realtek/rtlwifi/usb.c
+++ b/drivers/net/wireless/realtek/rtlwifi/usb.c
@@ -239,10 +239,7 @@ static void _rtl_usb_io_handler_release(struct ieee80211_hw 
*hw)
         mutex_destroy(&rtlpriv->io.bb_mutex);
  }
-/**
- *
- *     Default aggregation handler. Do nothing and just return the oldest skb.
- */
+/* Default aggregation handler. Do nothing and just return the oldest skb.  */
  static struct sk_buff *_none_usb_tx_aggregate_hdl(struct ieee80211_hw *hw,
                                                   struct sk_buff_head *list)

Because you merely shift the warning to a different tool,

NACK.

Larry

^ permalink raw reply related

* [PATCH] vhost: do not reference a file that does not exist
From: egranata @ 2019-08-08  0:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: mst, jasowang, kvm, virtualization, netdev, trivial, egranata

From: Enrico Granata <egranata@google.com>

lguest was removed from the mainline kernel in late 2017.

Signed-off-by: Enrico Granata <egranata@google.com>
---
 drivers/vhost/vhost.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 0536f8526359..2c376cb66971 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -4,8 +4,8 @@
  *
  * Author: Michael S. Tsirkin <mst@redhat.com>
  *
- * Inspiration, some code, and most witty comments come from
- * Documentation/virtual/lguest/lguest.c, by Rusty Russell
+ * Inspiration, some code, and most witty comments come from lguest.c
+ * by Rusty Russell
  *
  * Generic code for virtio server in host kernel.
  */
-- 
2.22.0.770.g0f2c4a37fd-goog


^ permalink raw reply related

* [PATCH bpf-next] tools/bpf: fix core_reloc.c compilation error
From: Yonghong Song @ 2019-08-08  0:38 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Yonghong Song,
	Andrii Nakryiko

On my local machine, I have the following compilation errors:
=====
  In file included from prog_tests/core_reloc.c:3:0:
  ./progs/core_reloc_types.h:517:46: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘fancy_char_ptr_t’
 typedef const char * const volatile restrict fancy_char_ptr_t;
                                              ^
  ./progs/core_reloc_types.h:527:2: error: unknown type name ‘fancy_char_ptr_t’
    fancy_char_ptr_t d;
    ^
=====

I am using gcc 4.8.5. Later compilers may change their behavior not emitting the
error. Nevertheless, let us fix the issue. "restrict" can be tested
without typedef.

Fixes: 9654e2ae908e ("selftests/bpf: add CO-RE relocs modifiers/typedef tests")
Cc: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
---
 tools/testing/selftests/bpf/progs/core_reloc_types.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/progs/core_reloc_types.h b/tools/testing/selftests/bpf/progs/core_reloc_types.h
index 10a252b6da55..f686a8138d90 100644
--- a/tools/testing/selftests/bpf/progs/core_reloc_types.h
+++ b/tools/testing/selftests/bpf/progs/core_reloc_types.h
@@ -514,7 +514,7 @@ typedef arr1_t arr2_t;
 typedef arr2_t arr3_t;
 typedef arr3_t arr4_t;
 
-typedef const char * const volatile restrict fancy_char_ptr_t;
+typedef const char * const volatile fancy_char_ptr_t;
 
 typedef core_reloc_mods_substruct_t core_reloc_mods_substruct_tt;
 
-- 
2.17.1


^ permalink raw reply related

* Re: WARNING: ODEBUG bug in netdev_freemem (2)
From: syzbot @ 2019-08-08  0:25 UTC (permalink / raw)
  To: alexander.h.duyck, amritha.nambiar, andriy.shevchenko, avagin,
	davem, dmitry.torokhov, dvyukov, eric.dumazet, f.fainelli, gregkh,
	idosch, jiri, kimbrownkd, linux-kernel, netdev, syzkaller-bugs,
	tglx, tyhicks, wanghai26, yuehaibing
In-Reply-To: <000000000000d6a8ba058c0df076@google.com>

syzbot has found a reproducer for the following crash on:

HEAD commit:    13dfb3fa Merge git://git.kernel.org/pub/scm/linux/kernel/g..
git tree:       net-next
console output: https://syzkaller.appspot.com/x/log.txt?x=1671e69a600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=d4cf1ffb87d590d7
dashboard link: https://syzkaller.appspot.com/bug?extid=c4521ac872a4ccc3afec
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=170542c2600000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+c4521ac872a4ccc3afec@syzkaller.appspotmail.com

bond0 (unregistering): (slave bond_slave_1): Releasing backup interface
bond0 (unregistering): (slave bond_slave_0): Releasing backup interface
bond0 (unregistering): Released all slaves
------------[ cut here ]------------
ODEBUG: free active (active state 0) object type: timer_list hint:  
delayed_work_timer_fn+0x0/0x90 arch/x86/include/asm/paravirt.h:768
WARNING: CPU: 0 PID: 9919 at lib/debugobjects.c:481  
debug_print_object+0x168/0x250 lib/debugobjects.c:481
Kernel panic - not syncing: panic_on_warn set ...
CPU: 0 PID: 9919 Comm: kworker/u4:6 Not tainted 5.3.0-rc3+ #122
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Workqueue: netns cleanup_net
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x172/0x1f0 lib/dump_stack.c:113
  panic+0x2dc/0x755 kernel/panic.c:219
  __warn.cold+0x20/0x4c kernel/panic.c:576
  report_bug+0x263/0x2b0 lib/bug.c:186
  fixup_bug arch/x86/kernel/traps.c:179 [inline]
  fixup_bug arch/x86/kernel/traps.c:174 [inline]
  do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:272
  do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:291
  invalid_op+0x23/0x30 arch/x86/entry/entry_64.S:1028
RIP: 0010:debug_print_object+0x168/0x250 lib/debugobjects.c:481
Code: dd e0 32 c6 87 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 b5 00 00 00 48  
8b 14 dd e0 32 c6 87 48 c7 c7 e0 27 c6 87 e8 70 cd 05 fe <0f> 0b 83 05 a3  
7c 67 06 01 48 83 c4 20 5b 41 5c 41 5d 41 5e 5d c3
RSP: 0018:ffff8880898ff838 EFLAGS: 00010086
RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff815c3ba6 RDI: ffffed101131fef9
RBP: ffff8880898ff878 R08: ffff88808ece42c0 R09: ffffed1015d04101
R10: ffffed1015d04100 R11: ffff8880ae820807 R12: 0000000000000001
R13: ffffffff88db6660 R14: ffffffff8161da40 R15: ffff88808f639af0
  __debug_check_no_obj_freed lib/debugobjects.c:963 [inline]
  debug_check_no_obj_freed+0x2d4/0x43f lib/debugobjects.c:994
  kfree+0xf8/0x2c0 mm/slab.c:3755
  kvfree+0x61/0x70 mm/util.c:488
  netdev_freemem+0x4c/0x60 net/core/dev.c:9093
  netdev_release+0x86/0xb0 net/core/net-sysfs.c:1635
  device_release+0x7a/0x210 drivers/base/core.c:1064
  kobject_cleanup lib/kobject.c:693 [inline]
  kobject_release lib/kobject.c:722 [inline]
  kref_put include/linux/kref.h:65 [inline]
  kobject_put.cold+0x289/0x2e6 lib/kobject.c:739
  netdev_run_todo+0x53b/0x7b0 net/core/dev.c:8998
  rtnl_unlock+0xe/0x10 net/core/rtnetlink.c:112
  default_device_exit_batch+0x358/0x410 net/core/dev.c:9781
  ops_exit_list.isra.0+0xfc/0x150 net/core/net_namespace.c:175
  cleanup_net+0x4e2/0xa70 net/core/net_namespace.c:594
  process_one_work+0x9af/0x1740 kernel/workqueue.c:2269
  worker_thread+0x98/0xe40 kernel/workqueue.c:2415
  kthread+0x361/0x430 kernel/kthread.c:255
  ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
Kernel Offset: disabled
Rebooting in 86400 seconds..


^ permalink raw reply

* High-Speed WiFi Booster Takes the World By Storm
From:  Julie Rogers @ 2019-08-08  0:06 UTC (permalink / raw)
  To: netdev

Internet providers make big money by overcharging you for faster internet lines. Could this little device really be the one solution that we've all been waiting for?

Frequency: 2.4Ghz
Wireless Rate: 300Mbps
Interface: 1 10/100Mbps WAN/LAN RJ45 Ports

Amazing new technology find out here! http://bit.ly/asdf411gt

------------------------
If you'd prefer not to receive future emails, Unsubscribe Here http://bit.ly/itr4fgy
11041 Santa Monica Blvd. #301 Los Angeles, CA 90025


^ permalink raw reply

* Re: [PATCH bpf-next 1/3] bpf: support cloning sk storage on accept()
From: Yonghong Song @ 2019-08-08  0:18 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Stanislav Fomichev, netdev@vger.kernel.org, bpf@vger.kernel.org,
	davem@davemloft.net, ast@kernel.org, daniel@iogearbox.net,
	Martin Lau
In-Reply-To: <20190808000533.GA2820@mini-arch>



On 8/7/19 5:05 PM, Stanislav Fomichev wrote:
> On 08/07, Yonghong Song wrote:
>>
>>
>> On 8/7/19 8:47 AM, Stanislav Fomichev wrote:
>>> Add new helper bpf_sk_storage_clone which optionally clones sk storage
>>> and call it from bpf_sk_storage_clone. Reuse the gap in
>>> bpf_sk_storage_elem to store clone/non-clone flag.
>>>
>>> Cc: Martin KaFai Lau <kafai@fb.com>
>>> Signed-off-by: Stanislav Fomichev <sdf@google.com>
>>
>> I tried to see whether I can find any missing race conditions in
>> the code but I failed. So except a minor comments below,
> Thanks for a review!
> 
>> Acked-by: Yonghong Song <yhs@fb.com>
>>
>>> ---
>>>    include/net/bpf_sk_storage.h |  10 ++++
>>>    include/uapi/linux/bpf.h     |   1 +
>>>    net/core/bpf_sk_storage.c    | 102 +++++++++++++++++++++++++++++++++--
>>>    net/core/sock.c              |   9 ++--
>>>    4 files changed, 115 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/include/net/bpf_sk_storage.h b/include/net/bpf_sk_storage.h
>>> index b9dcb02e756b..8e4f831d2e52 100644
>>> --- a/include/net/bpf_sk_storage.h
>>> +++ b/include/net/bpf_sk_storage.h
>>> @@ -10,4 +10,14 @@ void bpf_sk_storage_free(struct sock *sk);
>>>    extern const struct bpf_func_proto bpf_sk_storage_get_proto;
>>>    extern const struct bpf_func_proto bpf_sk_storage_delete_proto;
>>>    
>>> +#ifdef CONFIG_BPF_SYSCALL
>>> +int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk);
>>> +#else
>>> +static inline int bpf_sk_storage_clone(const struct sock *sk,
>>> +				       struct sock *newsk)
>>> +{
>>> +	return 0;
>>> +}
>>> +#endif
>>> +
>>>    #endif /* _BPF_SK_STORAGE_H */
>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>> index 4393bd4b2419..00459ca4c8cf 100644
>>> --- a/include/uapi/linux/bpf.h
>>> +++ b/include/uapi/linux/bpf.h
>>> @@ -2931,6 +2931,7 @@ enum bpf_func_id {
>>>    
>>>    /* BPF_FUNC_sk_storage_get flags */
>>>    #define BPF_SK_STORAGE_GET_F_CREATE	(1ULL << 0)
>>> +#define BPF_SK_STORAGE_GET_F_CLONE	(1ULL << 1)
>>>    
>>>    /* Mode for BPF_FUNC_skb_adjust_room helper. */
>>>    enum bpf_adj_room_mode {
>>> diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c
>>> index 94c7f77ecb6b..b6dea67965bc 100644
>>> --- a/net/core/bpf_sk_storage.c
>>> +++ b/net/core/bpf_sk_storage.c
>>> @@ -12,6 +12,9 @@
>>>    
>>>    static atomic_t cache_idx;
>>>    
>>> +#define BPF_SK_STORAGE_GET_F_MASK	(BPF_SK_STORAGE_GET_F_CREATE | \
>>> +					 BPF_SK_STORAGE_GET_F_CLONE)
>>> +
>>>    struct bucket {
>>>    	struct hlist_head list;
>>>    	raw_spinlock_t lock;
>>> @@ -66,7 +69,8 @@ struct bpf_sk_storage_elem {
>>>    	struct hlist_node snode;	/* Linked to bpf_sk_storage */
>>>    	struct bpf_sk_storage __rcu *sk_storage;
>>>    	struct rcu_head rcu;
>>> -	/* 8 bytes hole */
>>> +	u8 clone:1;
>>> +	/* 7 bytes hole */
>>>    	/* The data is stored in aother cacheline to minimize
>>>    	 * the number of cachelines access during a cache hit.
>>>    	 */
>>> @@ -509,7 +513,7 @@ static int sk_storage_delete(struct sock *sk, struct bpf_map *map)
>>>    	return 0;
>>>    }
>>>    
>>> -/* Called by __sk_destruct() */
>>> +/* Called by __sk_destruct() & bpf_sk_storage_clone() */
>>>    void bpf_sk_storage_free(struct sock *sk)
>>>    {
>>>    	struct bpf_sk_storage_elem *selem;
>>> @@ -739,19 +743,106 @@ static int bpf_fd_sk_storage_delete_elem(struct bpf_map *map, void *key)
>>>    	return err;
>>>    }
>>>    
>>> +static struct bpf_sk_storage_elem *
>>> +bpf_sk_storage_clone_elem(struct sock *newsk,
>>> +			  struct bpf_sk_storage_map *smap,
>>> +			  struct bpf_sk_storage_elem *selem)
>>> +{
>>> +	struct bpf_sk_storage_elem *copy_selem;
>>> +
>>> +	copy_selem = selem_alloc(smap, newsk, NULL, true);
>>> +	if (!copy_selem)
>>> +		return ERR_PTR(-ENOMEM);
>>> +
>>> +	if (map_value_has_spin_lock(&smap->map))
>>> +		copy_map_value_locked(&smap->map, SDATA(copy_selem)->data,
>>> +				      SDATA(selem)->data, true);
>>> +	else
>>> +		copy_map_value(&smap->map, SDATA(copy_selem)->data,
>>> +			       SDATA(selem)->data);
>>> +
>>> +	return copy_selem;
>>> +}
>>> +
>>> +int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk)
>>> +{
>>> +	struct bpf_sk_storage *new_sk_storage = NULL;
>>> +	struct bpf_sk_storage *sk_storage;
>>> +	struct bpf_sk_storage_elem *selem;
>>> +	int ret;
>>> +
>>> +	RCU_INIT_POINTER(newsk->sk_bpf_storage, NULL);
>>> +
>>> +	rcu_read_lock();
>>> +	sk_storage = rcu_dereference(sk->sk_bpf_storage);
>>> +
>>> +	if (!sk_storage || hlist_empty(&sk_storage->list))
>>> +		goto out;
>>> +
>>> +	hlist_for_each_entry_rcu(selem, &sk_storage->list, snode) {
>>> +		struct bpf_sk_storage_map *smap;
>>> +		struct bpf_sk_storage_elem *copy_selem;
>>> +
>>> +		if (!selem->clone)
>>> +			continue;
>>> +
>>> +		smap = rcu_dereference(SDATA(selem)->smap);
>>> +		if (!smap)
>>> +			continue;
>>> +
>>> +		copy_selem = bpf_sk_storage_clone_elem(newsk, smap, selem);
>>> +		if (IS_ERR(copy_selem)) {
>>> +			ret = PTR_ERR(copy_selem);
>>> +			goto err;
>>> +		}
>>> +
>>> +		if (!new_sk_storage) {
>>> +			ret = sk_storage_alloc(newsk, smap, copy_selem);
>>> +			if (ret) {
>>> +				kfree(copy_selem);
>>> +				atomic_sub(smap->elem_size,
>>> +					   &newsk->sk_omem_alloc);
>>> +				goto err;
>>> +			}
>>> +
>>> +			new_sk_storage = rcu_dereference(copy_selem->sk_storage);
>>> +			continue;
>>> +		}
>>> +
>>> +		raw_spin_lock_bh(&new_sk_storage->lock);
>>> +		selem_link_map(smap, copy_selem);
>>> +		__selem_link_sk(new_sk_storage, copy_selem);
>>> +		raw_spin_unlock_bh(&new_sk_storage->lock);
>>
>> Considering in this particular case, new socket is not visible to
>> outside world yet (both kernel and user space), map_delete/map_update
>> operations are not applicable in this situation, so
>> the above raw_spin_lock_bh() probably not needed.
> I agree, it's doing nothing, but __selem_link_sk has the following comment:
> /* sk_storage->lock must be held and sk_storage->list cannot be empty */
> 
> Just wanted to keep that invariant for this call site as well (in case
> we add some lockdep enforcement or smth else). WDYT?

Agree. Let us keep the locking to be consistent with other uses in
the same file. This is not the critical path.


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox