Netdev List
 help / color / mirror / Atom feed
* Re: [pci PATCH v8 1/4] pci: Add pci_sriov_configure_simple for PFs that don't manage VF resources
From: Gregory Rose @ 2018-04-20 22:06 UTC (permalink / raw)
  To: Alexander Duyck, bhelgaas, linux-pci
  Cc: virtio-dev, kvm, netdev, dan.daly, linux-kernel, linux-nvme,
	keith.busch, netanel, ddutile, mheyne, liang-min.wang,
	mark.d.rustad, dwmw2, hch, dwmw
In-Reply-To: <20180420162814.46077.11939.stgit@ahduyck-green-test.jf.intel.com>


On 4/20/2018 9:28 AM, Alexander Duyck wrote:
> This patch adds a common configuration function called
> pci_sriov_configure_simple that will allow for managing VFs on devices
> where the PF is not capable of managing VF resources.
>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> Tested-by: Mark Rustad <mark.d.rustad@intel.com>
> ---
>
> v5: New patch replacing pci_sriov_configure_unmanaged with
>        pci_sriov_configure_simple
>      Dropped bits related to autoprobe changes
> v6: Defined pci_sriov_configure_simple as NULL if IOV is disabled
> v7: Updated pci_sriov_configure_simple to drop need for err value
>      Fixed comment explaining why pci_sriov_configure_simple is NULL
>
>   drivers/pci/iov.c   |   31 +++++++++++++++++++++++++++++++
>   include/linux/pci.h |    3 +++
>   2 files changed, 34 insertions(+)
>
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index 677924a..3e0a7fd 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -807,3 +807,34 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev)
>   	return dev->sriov->total_VFs;
>   }
>   EXPORT_SYMBOL_GPL(pci_sriov_get_totalvfs);
> +
> +/**
> + * pci_sriov_configure_simple - helper to configure unmanaged SR-IOV
> + * @dev: the PCI device
> + * @nr_virtfn: number of virtual functions to enable, 0 to disable
> + *
> + * Used to provide generic enable/disable SR-IOV option for devices
> + * that do not manage the VFs generated by their driver. Return value
> + * is negative on error, or number of VFs allocated on success.
> + */
> +int pci_sriov_configure_simple(struct pci_dev *dev, int nr_virtfn)
> +{
> +	might_sleep();
> +
> +	if (!dev->is_physfn)
> +		return -ENODEV;
> +
> +	if (pci_vfs_assigned(dev)) {
> +		pci_warn(dev,
> +			 "Cannot modify SR-IOV while VFs are assigned\n");
> +		return -EPERM;
> +	}
> +
> +	if (!nr_virtfn) {
> +		sriov_disable(dev);
> +		return 0;
> +	}
> +
> +	return sriov_enable(dev, nr_virtfn) ? : nr_virtfn;
> +}
> +EXPORT_SYMBOL_GPL(pci_sriov_configure_simple);
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index ae42289..7d36e39 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1955,6 +1955,7 @@ static inline void pci_mmcfg_late_init(void) { }
>   int pci_vfs_assigned(struct pci_dev *dev);
>   int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs);
>   int pci_sriov_get_totalvfs(struct pci_dev *dev);
> +int pci_sriov_configure_simple(struct pci_dev *dev, int nr_virtfn);
>   resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno);
>   void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe);
>   #else
> @@ -1982,6 +1983,8 @@ static inline int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs)
>   { return 0; }
>   static inline int pci_sriov_get_totalvfs(struct pci_dev *dev)
>   { return 0; }
> +/* this is expected to be used as a function pointer, just define as NULL */
> +#define pci_sriov_configure_simple NULL
>   static inline resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno)
>   { return 0; }
>   static inline void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe) { }
>

Thanks Alex!

Reviewed-by: Greg Rose <gvrose8192@gmail.com>

^ permalink raw reply

* Re: [pci PATCH v8 4/4] pci-pf-stub: Add PF driver stub for PFs that function only to enable VFs
From: Gregory Rose @ 2018-04-20 22:08 UTC (permalink / raw)
  To: Alexander Duyck, bhelgaas, linux-pci
  Cc: virtio-dev, kvm, netdev, dan.daly, linux-kernel, linux-nvme,
	keith.busch, netanel, ddutile, mheyne, liang-min.wang,
	mark.d.rustad, dwmw2, hch, dwmw
In-Reply-To: <20180420163109.46077.60334.stgit@ahduyck-green-test.jf.intel.com>

On 4/20/2018 9:31 AM, Alexander Duyck wrote:
> Add a new driver called "pci-pf-stub" to act as a "white-list" for PF
> devices that provide no other functionality other then acting as a means of
> allocating a set of VFs. For now I only have one example ID provided by
> Amazon in terms of devices that require this functionality. The general
> idea is that in the future we will see other devices added as vendors come
> up with devices where the PF is more or less just a lightweight shim used
> to allocate VFs.
>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> ---
>
> v6: New driver to address concerns about Amazon devices left unsupported
> v7: Dropped pci_id table explanation from pci-pf-stub driver
>
>   drivers/pci/Kconfig       |   12 ++++++++++
>   drivers/pci/Makefile      |    2 ++
>   drivers/pci/pci-pf-stub.c |   54 +++++++++++++++++++++++++++++++++++++++++++++
>   include/linux/pci_ids.h   |    2 ++
>   4 files changed, 70 insertions(+)
>   create mode 100644 drivers/pci/pci-pf-stub.c
>
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index 34b56a8..cdef2a2 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -71,6 +71,18 @@ config PCI_STUB
>   
>   	  When in doubt, say N.
>   
> +config PCI_PF_STUB
> +	tristate "PCI PF Stub driver"
> +	depends on PCI
> +	depends on PCI_IOV
> +	help
> +	  Say Y or M here if you want to enable support for devices that
> +	  require SR-IOV support, while at the same time the PF itself is
> +	  not providing any actual services on the host itself such as
> +	  storage or networking.
> +
> +	  When in doubt, say N.
> +
>   config XEN_PCIDEV_FRONTEND
>           tristate "Xen PCI Frontend"
>           depends on PCI && X86 && XEN
> diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
> index 9419709..4e133d3 100644
> --- a/drivers/pci/Makefile
> +++ b/drivers/pci/Makefile
> @@ -43,6 +43,8 @@ obj-$(CONFIG_PCI_SYSCALL) += syscall.o
>   
>   obj-$(CONFIG_PCI_STUB) += pci-stub.o
>   
> +obj-$(CONFIG_PCI_PF_STUB) += pci-pf-stub.o
> +
>   obj-$(CONFIG_PCI_ECAM) += ecam.o
>   
>   obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o
> diff --git a/drivers/pci/pci-pf-stub.c b/drivers/pci/pci-pf-stub.c
> new file mode 100644
> index 0000000..9d5fdf2
> --- /dev/null
> +++ b/drivers/pci/pci-pf-stub.c
> @@ -0,0 +1,54 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* pci-pf-stub - simple stub driver for PCI SR-IOV PF device
> + *
> + * This driver is meant to act as a "white-list" for devices that provde
> + * SR-IOV functionality while at the same time not actually needing a
> + * driver of their own.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/pci.h>
> +
> +/**
> + * pci_pf_stub_white_list - White list of devices to bind pci-pf-stub onto
> + *
> + * This table provides the list of IDs this driver is supposed to bind
> + * onto. You could think of this as a list of "quirked" devices where we
> + * are adding support for SR-IOV here since there are no other drivers
> + * that they would be running under.
> + */
> +static const struct pci_device_id pci_pf_stub_white_list[] = {
> +	{ PCI_VDEVICE(AMAZON, 0x0053) },
> +	/* required last entry */
> +	{ 0 }
> +};
> +MODULE_DEVICE_TABLE(pci, pci_pf_stub_white_list);
> +
> +static int pci_pf_stub_probe(struct pci_dev *dev,
> +			     const struct pci_device_id *id)
> +{
> +	pci_info(dev, "claimed by pci-pf-stub\n");
> +	return 0;
> +}
> +
> +static struct pci_driver pf_stub_driver = {
> +	.name			= "pci-pf-stub",
> +	.id_table		= pci_pf_stub_white_list,
> +	.probe			= pci_pf_stub_probe,
> +	.sriov_configure	= pci_sriov_configure_simple,
> +};
> +
> +static int __init pci_pf_stub_init(void)
> +{
> +	return pci_register_driver(&pf_stub_driver);
> +}
> +
> +static void __exit pci_pf_stub_exit(void)
> +{
> +	pci_unregister_driver(&pf_stub_driver);
> +}
> +
> +module_init(pci_pf_stub_init);
> +module_exit(pci_pf_stub_exit);
> +
> +MODULE_LICENSE("GPL");
> diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
> index a637a7d..62dab14 100644
> --- a/include/linux/pci_ids.h
> +++ b/include/linux/pci_ids.h
> @@ -2549,6 +2549,8 @@
>   #define PCI_VENDOR_ID_CIRCUITCO		0x1cc8
>   #define PCI_SUBSYSTEM_ID_CIRCUITCO_MINNOWBOARD	0x0001
>   
> +#define PCI_VENDOR_ID_AMAZON		0x1d0f
> +
>   #define PCI_VENDOR_ID_TEKRAM		0x1de1
>   #define PCI_DEVICE_ID_TEKRAM_DC290	0xdc29
>   
>

LGTM

Reviewed-by: Greg Rose <gvrose8192@gmail.com>

^ permalink raw reply

* Re: [pci PATCH v8 2/4] ena: Migrate over to unmanaged SR-IOV support
From: Gregory Rose @ 2018-04-20 22:10 UTC (permalink / raw)
  To: Alexander Duyck, bhelgaas, linux-pci
  Cc: virtio-dev, kvm, netdev, dan.daly, linux-kernel, linux-nvme,
	keith.busch, netanel, ddutile, mheyne, liang-min.wang,
	mark.d.rustad, dwmw2, hch, dwmw
In-Reply-To: <20180420162841.46077.17967.stgit@ahduyck-green-test.jf.intel.com>

On 4/20/2018 9:30 AM, Alexander Duyck wrote:
> Instead of implementing our own version of a SR-IOV configuration stub in
> the ena driver we can just reuse the existing
> pci_sriov_configure_simple function.
>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> ---
>
> v5: Replaced call to pci_sriov_configure_unmanaged with
>          pci_sriov_configure_simple
> v6: Dropped "#ifdef" checks for IOV wrapping sriov_configure definition
> v7: No change
>
>   drivers/net/ethernet/amazon/ena/ena_netdev.c |   28 +-------------------------
>   1 file changed, 1 insertion(+), 27 deletions(-)
>
> diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c b/drivers/net/ethernet/amazon/ena/ena_netdev.c
> index a822e70..f2af87d 100644
> --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
> +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
> @@ -3386,32 +3386,6 @@ static int ena_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
>   }
>   
>   /*****************************************************************************/
> -static int ena_sriov_configure(struct pci_dev *dev, int numvfs)
> -{
> -	int rc;
> -
> -	if (numvfs > 0) {
> -		rc = pci_enable_sriov(dev, numvfs);
> -		if (rc != 0) {
> -			dev_err(&dev->dev,
> -				"pci_enable_sriov failed to enable: %d vfs with the error: %d\n",
> -				numvfs, rc);
> -			return rc;
> -		}
> -
> -		return numvfs;
> -	}
> -
> -	if (numvfs == 0) {
> -		pci_disable_sriov(dev);
> -		return 0;
> -	}
> -
> -	return -EINVAL;
> -}
> -
> -/*****************************************************************************/
> -/*****************************************************************************/
>   
>   /* ena_remove - Device Removal Routine
>    * @pdev: PCI device information struct
> @@ -3526,7 +3500,7 @@ static int ena_resume(struct pci_dev *pdev)
>   	.suspend    = ena_suspend,
>   	.resume     = ena_resume,
>   #endif
> -	.sriov_configure = ena_sriov_configure,
> +	.sriov_configure = pci_sriov_configure_simple,
>   };
>   
>   static int __init ena_init(void)
>

Reviewed-by: Greg Rose <gvrose8192@gmail.com>

^ permalink raw reply

* [PATCH bpf-next v3 1/9] bpf: change prototype for stack_map_get_build_id_offset
From: Yonghong Song @ 2018-04-20 22:18 UTC (permalink / raw)
  To: ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20180420221842.742330-1-yhs@fb.com>

This patch didn't incur functionality change. The function prototype
got changed so that the same function can be reused later.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 kernel/bpf/stackmap.c | 13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 57eeb12..04f6ec1 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -262,16 +262,11 @@ static int stack_map_get_build_id(struct vm_area_struct *vma,
 	return ret;
 }
 
-static void stack_map_get_build_id_offset(struct bpf_map *map,
-					  struct stack_map_bucket *bucket,
+static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
 					  u64 *ips, u32 trace_nr, bool user)
 {
 	int i;
 	struct vm_area_struct *vma;
-	struct bpf_stack_build_id *id_offs;
-
-	bucket->nr = trace_nr;
-	id_offs = (struct bpf_stack_build_id *)bucket->data;
 
 	/*
 	 * We cannot do up_read() in nmi context, so build_id lookup is
@@ -361,8 +356,10 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 			pcpu_freelist_pop(&smap->freelist);
 		if (unlikely(!new_bucket))
 			return -ENOMEM;
-		stack_map_get_build_id_offset(map, new_bucket, ips,
-					      trace_nr, user);
+		new_bucket->nr = trace_nr;
+		stack_map_get_build_id_offset(
+			(struct bpf_stack_build_id *)new_bucket->data,
+			ips, trace_nr, user);
 		trace_len = trace_nr * sizeof(struct bpf_stack_build_id);
 		if (hash_matches && bucket->nr == trace_nr &&
 		    memcmp(bucket->data, new_bucket->data, trace_len) == 0) {
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next v3 7/9] tools/bpf: add a verifier test case for bpf_get_stack helper and ARSH
From: Yonghong Song @ 2018-04-20 22:18 UTC (permalink / raw)
  To: ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20180420221842.742330-1-yhs@fb.com>

The test_verifier already has a few ARSH test cases.
This patch adds a new test case which takes advantage of newly
improved verifier behavior for bpf_get_stack and ARSH.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 tools/testing/selftests/bpf/test_verifier.c | 45 +++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
index 3e7718b..cd595ba 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -11423,6 +11423,51 @@ static struct bpf_test tests[] = {
 		.errstr = "BPF_XADD stores into R2 packet",
 		.prog_type = BPF_PROG_TYPE_XDP,
 	},
+	{
+		"bpf_get_stack return R0 within range",
+		.insns = {
+			BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+			BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+				     BPF_FUNC_map_lookup_elem),
+			BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 28),
+			BPF_MOV64_REG(BPF_REG_7, BPF_REG_0),
+			BPF_MOV64_IMM(BPF_REG_9, sizeof(struct test_val)),
+			BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_7),
+			BPF_MOV64_IMM(BPF_REG_3, sizeof(struct test_val)),
+			BPF_MOV64_IMM(BPF_REG_4, 256),
+			BPF_EMIT_CALL(BPF_FUNC_get_stack),
+			BPF_MOV64_IMM(BPF_REG_1, 0),
+			BPF_MOV64_REG(BPF_REG_8, BPF_REG_0),
+			BPF_ALU64_IMM(BPF_LSH, BPF_REG_8, 32),
+			BPF_ALU64_IMM(BPF_ARSH, BPF_REG_8, 32),
+			BPF_JMP_REG(BPF_JSLT, BPF_REG_1, BPF_REG_8, 16),
+			BPF_ALU64_REG(BPF_SUB, BPF_REG_9, BPF_REG_8),
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_7),
+			BPF_ALU64_REG(BPF_ADD, BPF_REG_2, BPF_REG_8),
+			BPF_MOV64_REG(BPF_REG_1, BPF_REG_9),
+			BPF_ALU64_IMM(BPF_LSH, BPF_REG_1, 32),
+			BPF_ALU64_IMM(BPF_ARSH, BPF_REG_1, 32),
+			BPF_MOV64_REG(BPF_REG_3, BPF_REG_2),
+			BPF_ALU64_REG(BPF_ADD, BPF_REG_3, BPF_REG_1),
+			BPF_MOV64_REG(BPF_REG_1, BPF_REG_7),
+			BPF_MOV64_IMM(BPF_REG_5, sizeof(struct test_val)),
+			BPF_ALU64_REG(BPF_ADD, BPF_REG_1, BPF_REG_5),
+			BPF_JMP_REG(BPF_JGE, BPF_REG_3, BPF_REG_1, 4),
+			BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+			BPF_MOV64_REG(BPF_REG_3, BPF_REG_9),
+			BPF_MOV64_IMM(BPF_REG_4, 0),
+			BPF_EMIT_CALL(BPF_FUNC_get_stack),
+			BPF_EXIT_INSN(),
+		},
+		.fixup_map2 = { 4 },
+		.result = ACCEPT,
+		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
+	},
 };
 
 static int probe_filter_length(const struct bpf_insn *fp)
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next v3 3/9] bpf/verifier: refine retval R0 state for bpf_get_stack helper
From: Yonghong Song @ 2018-04-20 22:18 UTC (permalink / raw)
  To: ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20180420221842.742330-1-yhs@fb.com>

The special property of return values for helpers bpf_get_stack
and bpf_probe_read_str are captured in verifier.
Both helpers return a negative error code or
a length, which is equal to or smaller than the buffer
size argument. This additional information in the
verifier can avoid the condition such as "retval > bufsize"
in the bpf program. For example, for the code blow,
    usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
    if (usize < 0 || usize > max_len)
        return 0;
The verifier may have the following errors:
    52: (85) call bpf_get_stack#65
     R0=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R1_w=ctx(id=0,off=0,imm=0)
     R2_w=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R3_w=inv800 R4_w=inv256
     R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
     R9_w=inv800 R10=fp0,call_-1
    53: (bf) r8 = r0
    54: (bf) r1 = r8
    55: (67) r1 <<= 32
    56: (bf) r2 = r1
    57: (77) r2 >>= 32
    58: (25) if r2 > 0x31f goto pc+33
     R0=inv(id=0) R1=inv(id=0,smax_value=9223372032559808512,
                         umax_value=18446744069414584320,
                         var_off=(0x0; 0xffffffff00000000))
     R2=inv(id=0,umax_value=799,var_off=(0x0; 0x3ff))
     R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
     R8=inv(id=0) R9=inv800 R10=fp0,call_-1
    59: (1f) r9 -= r8
    60: (c7) r1 s>>= 32
    61: (bf) r2 = r7
    62: (0f) r2 += r1
    math between map_value pointer and register with unbounded
    min value is not allowed
The failure is due to llvm compiler optimization where register "r2",
which is a copy of "r1", is tested for condition while later on "r1"
is used for map_ptr operation. The verifier is not able to track such
inst sequence effectively.

Without the "usize > max_len" condition, there is no llvm optimization
and the below generated code passed verifier:
    52: (85) call bpf_get_stack#65
     R0=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R1_w=ctx(id=0,off=0,imm=0)
     R2_w=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R3_w=inv800 R4_w=inv256
     R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
     R9_w=inv800 R10=fp0,call_-1
    53: (b7) r1 = 0
    54: (bf) r8 = r0
    55: (67) r8 <<= 32
    56: (c7) r8 s>>= 32
    57: (6d) if r1 s> r8 goto pc+24
     R0=inv(id=0,umax_value=800) R1=inv0 R6=ctx(id=0,off=0,imm=0)
     R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
     R8=inv(id=0,umax_value=800,var_off=(0x0; 0x3ff)) R9=inv800
     R10=fp0,call_-1
    58: (bf) r2 = r7
    59: (0f) r2 += r8
    60: (1f) r9 -= r8
    61: (bf) r1 = r6

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 kernel/bpf/verifier.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index aba9425..3c8bb92 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -164,6 +164,8 @@ struct bpf_call_arg_meta {
 	bool pkt_access;
 	int regno;
 	int access_size;
+	s64 msize_smax_value;
+	u64 msize_umax_value;
 };
 
 static DEFINE_MUTEX(bpf_verifier_lock);
@@ -2027,6 +2029,14 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
 		err = check_helper_mem_access(env, regno - 1,
 					      reg->umax_value,
 					      zero_size_allowed, meta);
+
+		if (!err && !!meta) {
+			/* remember the mem_size which may be used later
+			 * to refine return values.
+			 */
+			meta->msize_smax_value = reg->smax_value;
+			meta->msize_umax_value = reg->umax_value;
+		}
 	}
 
 	return err;
@@ -2333,6 +2343,21 @@ static int prepare_func_exit(struct bpf_verifier_env *env, int *insn_idx)
 	return 0;
 }
 
+static void do_refine_retval_range(struct bpf_reg_state *regs, int ret_type,
+				   int func_id,
+				   struct bpf_call_arg_meta *meta)
+{
+	struct bpf_reg_state *ret_reg = &regs[BPF_REG_0];
+
+	if (ret_type != RET_INTEGER ||
+	    (func_id != BPF_FUNC_get_stack &&
+	     func_id != BPF_FUNC_probe_read_str))
+		return;
+
+	ret_reg->smax_value = meta->msize_smax_value;
+	ret_reg->umax_value = meta->msize_umax_value;
+}
+
 static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn_idx)
 {
 	const struct bpf_func_proto *fn = NULL;
@@ -2456,6 +2481,8 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
 		return -EINVAL;
 	}
 
+	do_refine_retval_range(regs, fn->ret_type, func_id, &meta);
+
 	err = check_map_func_compatibility(env, meta.map_ptr, func_id);
 	if (err)
 		return err;
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next v3 9/9] tools/bpf: add a test for bpf_get_stack with tracepoint prog
From: Yonghong Song @ 2018-04-20 22:18 UTC (permalink / raw)
  To: ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20180420221842.742330-1-yhs@fb.com>

The test_stacktrace_map and test_stacktrace_build_id are
enhanced to call bpf_get_stack in the helper to get the
stack trace as well.  The stack traces from bpf_get_stack
and bpf_get_stackid are compared to ensure that for the
same stack as represented as the same hash, their ip addresses
or build id's must be the same.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 tools/testing/selftests/bpf/test_progs.c           | 63 +++++++++++++++++++---
 .../selftests/bpf/test_stacktrace_build_id.c       | 20 ++++++-
 tools/testing/selftests/bpf/test_stacktrace_map.c  | 20 +++++--
 3 files changed, 92 insertions(+), 11 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
index dad4c3f..06b922a 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -897,11 +897,40 @@ static int compare_map_keys(int map1_fd, int map2_fd)
 	return 0;
 }
 
+static int compare_stack_ips(int smap_fd, int amap_fd, int stack_trace_len)
+{
+	__u32 key, next_key, *cur_key_p, *next_key_p;
+	char val_buf1[stack_trace_len], val_buf2[stack_trace_len];
+	int i, err;
+
+	cur_key_p = NULL;
+	next_key_p = &key;
+	while (bpf_map_get_next_key(smap_fd, cur_key_p, next_key_p) == 0) {
+		err = bpf_map_lookup_elem(smap_fd, next_key_p, val_buf1);
+		if (err)
+			return err;
+		err = bpf_map_lookup_elem(amap_fd, next_key_p, val_buf2);
+		if (err)
+			return err;
+		for (i = 0; i < stack_trace_len; i++) {
+			if (val_buf1[i] != val_buf2[i])
+				return -1;
+		}
+		key = *next_key_p;
+		cur_key_p = &key;
+		next_key_p = &next_key;
+	}
+	if (errno != ENOENT)
+		return -1;
+
+	return 0;
+}
+
 static void test_stacktrace_map()
 {
-	int control_map_fd, stackid_hmap_fd, stackmap_fd;
+	int control_map_fd, stackid_hmap_fd, stackmap_fd, stack_amap_fd;
 	const char *file = "./test_stacktrace_map.o";
-	int bytes, efd, err, pmu_fd, prog_fd;
+	int bytes, efd, err, pmu_fd, prog_fd, stack_trace_len;
 	struct perf_event_attr attr = {};
 	__u32 key, val, duration = 0;
 	struct bpf_object *obj;
@@ -957,6 +986,10 @@ static void test_stacktrace_map()
 	if (stackmap_fd < 0)
 		goto disable_pmu;
 
+	stack_amap_fd = bpf_find_map(__func__, obj, "stack_amap");
+	if (stack_amap_fd < 0)
+		goto disable_pmu;
+
 	/* give some time for bpf program run */
 	sleep(1);
 
@@ -978,6 +1011,12 @@ static void test_stacktrace_map()
 		  "err %d errno %d\n", err, errno))
 		goto disable_pmu_noerr;
 
+	stack_trace_len = PERF_MAX_STACK_DEPTH * sizeof(__u64);
+	err = compare_stack_ips(stackmap_fd, stack_amap_fd, stack_trace_len);
+	if (CHECK(err, "compare_stack_ips stackmap vs. stack_amap",
+		  "err %d errno %d\n", err, errno))
+		goto disable_pmu_noerr;
+
 	goto disable_pmu_noerr;
 disable_pmu:
 	error_cnt++;
@@ -1071,9 +1110,9 @@ static int extract_build_id(char *build_id, size_t size)
 
 static void test_stacktrace_build_id(void)
 {
-	int control_map_fd, stackid_hmap_fd, stackmap_fd;
+	int control_map_fd, stackid_hmap_fd, stackmap_fd, stack_amap_fd;
 	const char *file = "./test_stacktrace_build_id.o";
-	int bytes, efd, err, pmu_fd, prog_fd;
+	int bytes, efd, err, pmu_fd, prog_fd, stack_trace_len;
 	struct perf_event_attr attr = {};
 	__u32 key, previous_key, val, duration = 0;
 	struct bpf_object *obj;
@@ -1138,6 +1177,11 @@ static void test_stacktrace_build_id(void)
 		  err, errno))
 		goto disable_pmu;
 
+	stack_amap_fd = bpf_find_map(__func__, obj, "stack_amap");
+	if (CHECK(stack_amap_fd < 0, "bpf_find_map stack_amap",
+		  "err %d errno %d\n", err, errno))
+		goto disable_pmu;
+
 	assert(system("dd if=/dev/urandom of=/dev/zero count=4 2> /dev/null")
 	       == 0);
 	assert(system("./urandom_read if=/dev/urandom of=/dev/zero count=4 2> /dev/null") == 0);
@@ -1189,8 +1233,15 @@ static void test_stacktrace_build_id(void)
 		previous_key = key;
 	} while (bpf_map_get_next_key(stackmap_fd, &previous_key, &key) == 0);
 
-	CHECK(build_id_matches < 1, "build id match",
-	      "Didn't find expected build ID from the map");
+	if (CHECK(build_id_matches < 1, "build id match",
+		  "Didn't find expected build ID from the map"))
+		goto disable_pmu;
+
+	stack_trace_len = PERF_MAX_STACK_DEPTH
+		* sizeof(struct bpf_stack_build_id);
+	err = compare_stack_ips(stackmap_fd, stack_amap_fd, stack_trace_len);
+	CHECK(err, "compare_stack_ips stackmap vs. stack_amap",
+	      "err %d errno %d\n", err, errno);
 
 disable_pmu:
 	ioctl(pmu_fd, PERF_EVENT_IOC_DISABLE);
diff --git a/tools/testing/selftests/bpf/test_stacktrace_build_id.c b/tools/testing/selftests/bpf/test_stacktrace_build_id.c
index b755bd7..d86c281 100644
--- a/tools/testing/selftests/bpf/test_stacktrace_build_id.c
+++ b/tools/testing/selftests/bpf/test_stacktrace_build_id.c
@@ -19,7 +19,7 @@ struct bpf_map_def SEC("maps") stackid_hmap = {
 	.type = BPF_MAP_TYPE_HASH,
 	.key_size = sizeof(__u32),
 	.value_size = sizeof(__u32),
-	.max_entries = 10000,
+	.max_entries = 16384,
 };
 
 struct bpf_map_def SEC("maps") stackmap = {
@@ -31,6 +31,14 @@ struct bpf_map_def SEC("maps") stackmap = {
 	.map_flags = BPF_F_STACK_BUILD_ID,
 };
 
+struct bpf_map_def SEC("maps") stack_amap = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(struct bpf_stack_build_id)
+		* PERF_MAX_STACK_DEPTH,
+	.max_entries = 128,
+};
+
 /* taken from /sys/kernel/debug/tracing/events/random/urandom_read/format */
 struct random_urandom_args {
 	unsigned long long pad;
@@ -42,7 +50,10 @@ struct random_urandom_args {
 SEC("tracepoint/random/urandom_read")
 int oncpu(struct random_urandom_args *args)
 {
+	__u32 max_len = sizeof(struct bpf_stack_build_id)
+			* PERF_MAX_STACK_DEPTH;
 	__u32 key = 0, val = 0, *value_p;
+	void *stack_p;
 
 	value_p = bpf_map_lookup_elem(&control_map, &key);
 	if (value_p && *value_p)
@@ -50,8 +61,13 @@ int oncpu(struct random_urandom_args *args)
 
 	/* The size of stackmap and stackid_hmap should be the same */
 	key = bpf_get_stackid(args, &stackmap, BPF_F_USER_STACK);
-	if ((int)key >= 0)
+	if ((int)key >= 0) {
 		bpf_map_update_elem(&stackid_hmap, &key, &val, 0);
+		stack_p = bpf_map_lookup_elem(&stack_amap, &key);
+		if (stack_p)
+			bpf_get_stack(args, stack_p, max_len,
+				      BPF_F_USER_STACK | BPF_F_USER_BUILD_ID);
+	}
 
 	return 0;
 }
diff --git a/tools/testing/selftests/bpf/test_stacktrace_map.c b/tools/testing/selftests/bpf/test_stacktrace_map.c
index 76d85c5d..f83c7b6 100644
--- a/tools/testing/selftests/bpf/test_stacktrace_map.c
+++ b/tools/testing/selftests/bpf/test_stacktrace_map.c
@@ -19,14 +19,21 @@ struct bpf_map_def SEC("maps") stackid_hmap = {
 	.type = BPF_MAP_TYPE_HASH,
 	.key_size = sizeof(__u32),
 	.value_size = sizeof(__u32),
-	.max_entries = 10000,
+	.max_entries = 16384,
 };
 
 struct bpf_map_def SEC("maps") stackmap = {
 	.type = BPF_MAP_TYPE_STACK_TRACE,
 	.key_size = sizeof(__u32),
 	.value_size = sizeof(__u64) * PERF_MAX_STACK_DEPTH,
-	.max_entries = 10000,
+	.max_entries = 16384,
+};
+
+struct bpf_map_def SEC("maps") stack_amap = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(__u64) * PERF_MAX_STACK_DEPTH,
+	.max_entries = 16384,
 };
 
 /* taken from /sys/kernel/debug/tracing/events/sched/sched_switch/format */
@@ -44,7 +51,10 @@ struct sched_switch_args {
 SEC("tracepoint/sched/sched_switch")
 int oncpu(struct sched_switch_args *ctx)
 {
+	__u32 max_len = PERF_MAX_STACK_DEPTH * sizeof(__u64);
 	__u32 key = 0, val = 0, *value_p;
+	void *stack_p;
+
 
 	value_p = bpf_map_lookup_elem(&control_map, &key);
 	if (value_p && *value_p)
@@ -52,8 +62,12 @@ int oncpu(struct sched_switch_args *ctx)
 
 	/* The size of stackmap and stackid_hmap should be the same */
 	key = bpf_get_stackid(ctx, &stackmap, 0);
-	if ((int)key >= 0)
+	if ((int)key >= 0) {
 		bpf_map_update_elem(&stackid_hmap, &key, &val, 0);
+		stack_p = bpf_map_lookup_elem(&stack_amap, &key);
+		if (stack_p)
+			bpf_get_stack(ctx, stack_p, max_len, 0);
+	}
 
 	return 0;
 }
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next v3 0/9] bpf: add bpf_get_stack helper
From: Yonghong Song @ 2018-04-20 22:18 UTC (permalink / raw)
  To: ast, daniel, netdev; +Cc: kernel-team

Currently, stackmap and bpf_get_stackid helper are provided
for bpf program to get the stack trace. This approach has
a limitation though. If two stack traces have the same hash,
only one will get stored in the stackmap table regardless of
whether BPF_F_REUSE_STACKID is specified or not,
so some stack traces may be missing from user perspective.

This patch implements a new helper, bpf_get_stack, will
send stack traces directly to bpf program. The bpf program
is able to see all stack traces, and then can do in-kernel
processing or send stack traces to user space through
shared map or bpf_perf_event_output.

Patches #1 and #2 implemented the core kernel support.
Patches #3 and #4 are two verifier improves to make
bpf programming easier. Patch #5 synced the new helper
to tools headers. Patch #6 moved perf_event polling code
and ksym lookup code from samples/bpf to
tools/testing/selftests/bpf. Patch #7 added a verifier
test in tools/bpf for new verifier change.
Patches #8 and #9 added tests for raw tracepoint prog
and tracepoint prog respectively.

Changelogs:
  v2 -> v3:
    . use meta to track helper memory size argument
    . implement range checking for ARSH in verifier
    . move perf event polling and ksym related functions
      from samples/bpf to tools/bpf
    . added test to compare build id's between bpf_get_stackid
      and bpf_get_stack
  v1 -> v2:
    . fix compilation error when CONFIG_PERF_EVENTS is not enabled

Yonghong Song (9):
  bpf: change prototype for stack_map_get_build_id_offset
  bpf: add bpf_get_stack helper
  bpf/verifier: refine retval R0 state for bpf_get_stack helper
  bpf/verifier: improve register value range tracking with ARSH
  tools/bpf: add bpf_get_stack helper to tools headers
  samples/bpf: move common-purpose trace functions to selftests
  tools/bpf: add a verifier test case for bpf_get_stack helper and ARSH
  tools/bpf: add a test for bpf_get_stack with raw tracepoint prog
  tools/bpf: add a test for bpf_get_stack with tracepoint prog

 include/linux/bpf.h                                |   1 +
 include/linux/filter.h                             |   3 +-
 include/uapi/linux/bpf.h                           |  19 ++-
 kernel/bpf/core.c                                  |   5 +
 kernel/bpf/stackmap.c                              |  80 ++++++++-
 kernel/bpf/syscall.c                               |  10 ++
 kernel/bpf/verifier.c                              |  56 +++++++
 kernel/trace/bpf_trace.c                           |  50 +++++-
 samples/bpf/Makefile                               |  11 +-
 samples/bpf/bpf_load.c                             |  63 -------
 samples/bpf/bpf_load.h                             |   7 -
 samples/bpf/offwaketime_user.c                     |   1 +
 samples/bpf/sampleip_user.c                        |   1 +
 samples/bpf/spintest_user.c                        |   1 +
 samples/bpf/trace_event_user.c                     |   1 +
 samples/bpf/trace_output_user.c                    | 125 ++------------
 tools/include/uapi/linux/bpf.h                     |  19 ++-
 tools/testing/selftests/bpf/Makefile               |   3 +-
 tools/testing/selftests/bpf/bpf_helpers.h          |   3 +-
 tools/testing/selftests/bpf/test_get_stack_rawtp.c | 102 +++++++++++
 tools/testing/selftests/bpf/test_progs.c           | 178 +++++++++++++++++++-
 .../selftests/bpf/test_stacktrace_build_id.c       |  20 ++-
 tools/testing/selftests/bpf/test_stacktrace_map.c  |  20 ++-
 tools/testing/selftests/bpf/test_verifier.c        |  45 +++++
 tools/testing/selftests/bpf/trace_helpers.c        | 186 +++++++++++++++++++++
 tools/testing/selftests/bpf/trace_helpers.h        |  24 +++
 26 files changed, 825 insertions(+), 209 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_get_stack_rawtp.c
 create mode 100644 tools/testing/selftests/bpf/trace_helpers.c
 create mode 100644 tools/testing/selftests/bpf/trace_helpers.h

-- 
2.9.5

^ permalink raw reply

* [PATCH bpf-next v3 2/9] bpf: add bpf_get_stack helper
From: Yonghong Song @ 2018-04-20 22:18 UTC (permalink / raw)
  To: ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20180420221842.742330-1-yhs@fb.com>

Currently, stackmap and bpf_get_stackid helper are provided
for bpf program to get the stack trace. This approach has
a limitation though. If two stack traces have the same hash,
only one will get stored in the stackmap table,
so some stack traces are missing from user perspective.

This patch implements a new helper, bpf_get_stack, will
send stack traces directly to bpf program. The bpf program
is able to see all stack traces, and then can do in-kernel
processing or send stack traces to user space through
shared map or bpf_perf_event_output.

Acked-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
---
 include/linux/bpf.h      |  1 +
 include/linux/filter.h   |  3 ++-
 include/uapi/linux/bpf.h | 19 ++++++++++++--
 kernel/bpf/core.c        |  5 ++++
 kernel/bpf/stackmap.c    | 67 ++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/bpf/syscall.c     | 10 ++++++++
 kernel/bpf/verifier.c    |  3 +++
 kernel/trace/bpf_trace.c | 50 +++++++++++++++++++++++++++++++++++-
 8 files changed, 154 insertions(+), 4 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index ee5275e..2c520b4 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -690,6 +690,7 @@ extern const struct bpf_func_proto bpf_get_current_comm_proto;
 extern const struct bpf_func_proto bpf_skb_vlan_push_proto;
 extern const struct bpf_func_proto bpf_skb_vlan_pop_proto;
 extern const struct bpf_func_proto bpf_get_stackid_proto;
+extern const struct bpf_func_proto bpf_get_stack_proto;
 extern const struct bpf_func_proto bpf_sock_map_update_proto;
 
 /* Shared helpers among cBPF and eBPF. */
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 4da8b23..044d30e 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -468,7 +468,8 @@ struct bpf_prog {
 				dst_needed:1,	/* Do we need dst entry? */
 				blinded:1,	/* Was blinded */
 				is_func:1,	/* program is a bpf function */
-				kprobe_override:1; /* Do we override a kprobe? */
+				kprobe_override:1, /* Do we override a kprobe? */
+				need_callchain_buf:1; /* Needs callchain buffer? */
 	enum bpf_prog_type	type;		/* Type of BPF program */
 	enum bpf_attach_type	expected_attach_type; /* For some prog types */
 	u32			len;		/* Number of filter blocks */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c8383a2..470f3a2 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -529,6 +529,17 @@ union bpf_attr {
  *             other bits - reserved
  *     Return: >= 0 stackid on success or negative error
  *
+ * int bpf_get_stack(ctx, buf, size, flags)
+ *     walk user or kernel stack and store the ips in buf
+ *     @ctx: struct pt_regs*
+ *     @buf: user buffer to fill stack
+ *     @size: the buf size
+ *     @flags: bits 0-7 - numer of stack frames to skip
+ *             bit 8 - collect user stack instead of kernel
+ *             bit 11 - get build-id as well if user stack
+ *             other bits - reserved
+ *     Return: >= 0 size copied on success or negative error
+ *
  * s64 bpf_csum_diff(from, from_size, to, to_size, seed)
  *     calculate csum diff
  *     @from: raw from buffer
@@ -841,7 +852,8 @@ union bpf_attr {
 	FN(msg_cork_bytes),		\
 	FN(msg_pull_data),		\
 	FN(bind),			\
-	FN(xdp_adjust_tail),
+	FN(xdp_adjust_tail),		\
+	FN(get_stack),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -875,11 +887,14 @@ enum bpf_func_id {
 /* BPF_FUNC_skb_set_tunnel_key and BPF_FUNC_skb_get_tunnel_key flags. */
 #define BPF_F_TUNINFO_IPV6		(1ULL << 0)
 
-/* BPF_FUNC_get_stackid flags. */
+/* flags for both BPF_FUNC_get_stackid and BPF_FUNC_get_stack. */
 #define BPF_F_SKIP_FIELD_MASK		0xffULL
 #define BPF_F_USER_STACK		(1ULL << 8)
+/* flags used by BPF_FUNC_get_stackid only. */
 #define BPF_F_FAST_STACK_CMP		(1ULL << 9)
 #define BPF_F_REUSE_STACKID		(1ULL << 10)
+/* flags used by BPF_FUNC_get_stack only. */
+#define BPF_F_USER_BUILD_ID		(1ULL << 11)
 
 /* BPF_FUNC_skb_set_tunnel_key flags. */
 #define BPF_F_ZERO_CSUM_TX		(1ULL << 1)
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index d315b39..bf22eca 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -31,6 +31,7 @@
 #include <linux/rbtree_latch.h>
 #include <linux/kallsyms.h>
 #include <linux/rcupdate.h>
+#include <linux/perf_event.h>
 
 #include <asm/unaligned.h>
 
@@ -1709,6 +1710,10 @@ static void bpf_prog_free_deferred(struct work_struct *work)
 	aux = container_of(work, struct bpf_prog_aux, work);
 	if (bpf_prog_is_dev_bound(aux))
 		bpf_prog_offload_destroy(aux->prog);
+#ifdef CONFIG_PERF_EVENTS
+	if (aux->prog->need_callchain_buf)
+		put_callchain_buffers();
+#endif
 	for (i = 0; i < aux->func_cnt; i++)
 		bpf_jit_free(aux->func[i]);
 	if (aux->func_cnt) {
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 04f6ec1..4477cf6 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -402,6 +402,73 @@ const struct bpf_func_proto bpf_get_stackid_proto = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
+BPF_CALL_4(bpf_get_stack, struct pt_regs *, regs, void *, buf, u32, size,
+	   u64, flags)
+{
+	u32 init_nr, trace_nr, copy_len, elem_size, num_elem;
+	bool user_build_id = flags & BPF_F_USER_BUILD_ID;
+	u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
+	bool user = flags & BPF_F_USER_STACK;
+	struct perf_callchain_entry *trace;
+	bool kernel = !user;
+	int err = -EINVAL;
+	u64 *ips;
+
+	if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK |
+			       BPF_F_USER_BUILD_ID)))
+		goto clear;
+	if (kernel && user_build_id)
+		goto clear;
+
+	elem_size = (user && user_build_id) ? sizeof(struct bpf_stack_build_id)
+					    : sizeof(u64);
+	if (unlikely(size % elem_size))
+		goto clear;
+
+	num_elem = size / elem_size;
+	if (sysctl_perf_event_max_stack < num_elem)
+		init_nr = 0;
+	else
+		init_nr = sysctl_perf_event_max_stack - num_elem;
+	trace = get_perf_callchain(regs, init_nr, kernel, user,
+				   sysctl_perf_event_max_stack, false, false);
+	if (unlikely(!trace))
+		goto err_fault;
+
+	trace_nr = trace->nr - init_nr;
+	if (trace_nr <= skip)
+		goto err_fault;
+
+	trace_nr -= skip;
+	trace_nr = (trace_nr <= num_elem) ? trace_nr : num_elem;
+	copy_len = trace_nr * elem_size;
+	ips = trace->ip + skip + init_nr;
+	if (user && user_build_id)
+		stack_map_get_build_id_offset(buf, ips, trace_nr, user);
+	else
+		memcpy(buf, ips, copy_len);
+
+	if (size > copy_len)
+		memset(buf + copy_len, 0, size - copy_len);
+	return copy_len;
+
+err_fault:
+	err = -EFAULT;
+clear:
+	memset(buf, 0, size);
+	return err;
+}
+
+const struct bpf_func_proto bpf_get_stack_proto = {
+	.func		= bpf_get_stack,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_PTR_TO_UNINIT_MEM,
+	.arg3_type	= ARG_CONST_SIZE_OR_ZERO,
+	.arg4_type	= ARG_ANYTHING,
+};
+
 /* Called from eBPF program */
 static void *stack_map_lookup_elem(struct bpf_map *map, void *key)
 {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index fe23dc5a..1ee71f6 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1360,6 +1360,16 @@ static int bpf_prog_load(union bpf_attr *attr)
 	if (err)
 		goto free_used_maps;
 
+	if (prog->need_callchain_buf) {
+#ifdef CONFIG_PERF_EVENTS
+		err = get_callchain_buffers(sysctl_perf_event_max_stack);
+#else
+		err = -ENOTSUPP;
+#endif
+		if (err)
+			goto free_used_maps;
+	}
+
 	err = bpf_prog_new_fd(prog);
 	if (err < 0) {
 		/* failed to allocate fd.
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 5dd1dcb..aba9425 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2460,6 +2460,9 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
 	if (err)
 		return err;
 
+	if (func_id == BPF_FUNC_get_stack)
+		env->prog->need_callchain_buf = true;
+
 	if (changes_data)
 		clear_all_pkt_pointers(env);
 	return 0;
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index d88e96d..fe8476f 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -20,6 +20,7 @@
 #include "trace.h"
 
 u64 bpf_get_stackid(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
+u64 bpf_get_stack(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 
 /**
  * trace_call_bpf - invoke BPF program
@@ -577,6 +578,8 @@ kprobe_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_perf_event_output_proto;
 	case BPF_FUNC_get_stackid:
 		return &bpf_get_stackid_proto;
+	case BPF_FUNC_get_stack:
+		return &bpf_get_stack_proto;
 	case BPF_FUNC_perf_event_read_value:
 		return &bpf_perf_event_read_value_proto;
 #ifdef CONFIG_BPF_KPROBE_OVERRIDE
@@ -664,6 +667,25 @@ static const struct bpf_func_proto bpf_get_stackid_proto_tp = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
+BPF_CALL_4(bpf_get_stack_tp, void *, tp_buff, void *, buf, u32, size,
+	   u64, flags)
+{
+	struct pt_regs *regs = *(struct pt_regs **)tp_buff;
+
+	return bpf_get_stack((unsigned long) regs, (unsigned long) buf,
+			     (unsigned long) size, flags, 0);
+}
+
+static const struct bpf_func_proto bpf_get_stack_proto_tp = {
+	.func		= bpf_get_stack_tp,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_PTR_TO_UNINIT_MEM,
+	.arg3_type	= ARG_CONST_SIZE_OR_ZERO,
+	.arg4_type	= ARG_ANYTHING,
+};
+
 static const struct bpf_func_proto *
 tp_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
@@ -672,6 +694,8 @@ tp_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_perf_event_output_proto_tp;
 	case BPF_FUNC_get_stackid:
 		return &bpf_get_stackid_proto_tp;
+	case BPF_FUNC_get_stack:
+		return &bpf_get_stack_proto_tp;
 	default:
 		return tracing_func_proto(func_id, prog);
 	}
@@ -734,6 +758,8 @@ pe_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_perf_event_output_proto_tp;
 	case BPF_FUNC_get_stackid:
 		return &bpf_get_stackid_proto_tp;
+	case BPF_FUNC_get_stack:
+		return &bpf_get_stack_proto_tp;
 	case BPF_FUNC_perf_prog_read_value:
 		return &bpf_perf_prog_read_value_proto;
 	default:
@@ -744,7 +770,7 @@ pe_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 /*
  * bpf_raw_tp_regs are separate from bpf_pt_regs used from skb/xdp
  * to avoid potential recursive reuse issue when/if tracepoints are added
- * inside bpf_*_event_output and/or bpf_get_stack_id
+ * inside bpf_*_event_output, bpf_get_stackid and/or bpf_get_stack
  */
 static DEFINE_PER_CPU(struct pt_regs, bpf_raw_tp_regs);
 BPF_CALL_5(bpf_perf_event_output_raw_tp, struct bpf_raw_tracepoint_args *, args,
@@ -787,6 +813,26 @@ static const struct bpf_func_proto bpf_get_stackid_proto_raw_tp = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
+BPF_CALL_4(bpf_get_stack_raw_tp, struct bpf_raw_tracepoint_args *, args,
+	   void *, buf, u32, size, u64, flags)
+{
+	struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs);
+
+	perf_fetch_caller_regs(regs);
+	return bpf_get_stack((unsigned long) regs, (unsigned long) buf,
+			     (unsigned long) size, flags, 0);
+}
+
+static const struct bpf_func_proto bpf_get_stack_proto_raw_tp = {
+	.func		= bpf_get_stack_raw_tp,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_PTR_TO_MEM,
+	.arg3_type	= ARG_CONST_SIZE_OR_ZERO,
+	.arg4_type	= ARG_ANYTHING,
+};
+
 static const struct bpf_func_proto *
 raw_tp_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
@@ -795,6 +841,8 @@ raw_tp_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_perf_event_output_proto_raw_tp;
 	case BPF_FUNC_get_stackid:
 		return &bpf_get_stackid_proto_raw_tp;
+	case BPF_FUNC_get_stack:
+		return &bpf_get_stack_proto_raw_tp;
 	default:
 		return tracing_func_proto(func_id, prog);
 	}
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next v3 6/9] samples/bpf: move common-purpose trace functions to selftests
From: Yonghong Song @ 2018-04-20 22:18 UTC (permalink / raw)
  To: ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20180420221842.742330-1-yhs@fb.com>

There is no functionality change in this patch. The common-purpose
trace functions, including perf_event polling and ksym lookup,
are moved from trace_output_user.c and bpf_load.c to
selftests/bpf/trace_helpers.c so that these function can
be reused later in selftests.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 samples/bpf/Makefile                        |  11 +-
 samples/bpf/bpf_load.c                      |  63 ----------
 samples/bpf/bpf_load.h                      |   7 --
 samples/bpf/offwaketime_user.c              |   1 +
 samples/bpf/sampleip_user.c                 |   1 +
 samples/bpf/spintest_user.c                 |   1 +
 samples/bpf/trace_event_user.c              |   1 +
 samples/bpf/trace_output_user.c             | 125 +++----------------
 tools/testing/selftests/bpf/trace_helpers.c | 186 ++++++++++++++++++++++++++++
 tools/testing/selftests/bpf/trace_helpers.h |  24 ++++
 10 files changed, 238 insertions(+), 182 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/trace_helpers.c
 create mode 100644 tools/testing/selftests/bpf/trace_helpers.h

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index aa8c392..d36444c 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -49,6 +49,7 @@ hostprogs-y += xdp_adjust_tail
 # Libbpf dependencies
 LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o
 CGROUP_HELPERS := ../../tools/testing/selftests/bpf/cgroup_helpers.o
+TRACE_HELPERS := ../../tools/testing/selftests/bpf/trace_helpers.o
 
 test_lru_dist-objs := test_lru_dist.o $(LIBBPF)
 sock_example-objs := sock_example.o $(LIBBPF)
@@ -65,10 +66,10 @@ tracex6-objs := bpf_load.o $(LIBBPF) tracex6_user.o
 tracex7-objs := bpf_load.o $(LIBBPF) tracex7_user.o
 load_sock_ops-objs := bpf_load.o $(LIBBPF) load_sock_ops.o
 test_probe_write_user-objs := bpf_load.o $(LIBBPF) test_probe_write_user_user.o
-trace_output-objs := bpf_load.o $(LIBBPF) trace_output_user.o
+trace_output-objs := bpf_load.o $(LIBBPF) trace_output_user.o $(TRACE_HELPERS)
 lathist-objs := bpf_load.o $(LIBBPF) lathist_user.o
-offwaketime-objs := bpf_load.o $(LIBBPF) offwaketime_user.o
-spintest-objs := bpf_load.o $(LIBBPF) spintest_user.o
+offwaketime-objs := bpf_load.o $(LIBBPF) offwaketime_user.o $(TRACE_HELPERS)
+spintest-objs := bpf_load.o $(LIBBPF) spintest_user.o $(TRACE_HELPERS)
 map_perf_test-objs := bpf_load.o $(LIBBPF) map_perf_test_user.o
 test_overhead-objs := bpf_load.o $(LIBBPF) test_overhead_user.o
 test_cgrp2_array_pin-objs := $(LIBBPF) test_cgrp2_array_pin.o
@@ -82,8 +83,8 @@ xdp2-objs := bpf_load.o $(LIBBPF) xdp1_user.o
 xdp_router_ipv4-objs := bpf_load.o $(LIBBPF) xdp_router_ipv4_user.o
 test_current_task_under_cgroup-objs := bpf_load.o $(LIBBPF) $(CGROUP_HELPERS) \
 				       test_current_task_under_cgroup_user.o
-trace_event-objs := bpf_load.o $(LIBBPF) trace_event_user.o
-sampleip-objs := bpf_load.o $(LIBBPF) sampleip_user.o
+trace_event-objs := bpf_load.o $(LIBBPF) trace_event_user.o $(TRACE_HELPERS)
+sampleip-objs := bpf_load.o $(LIBBPF) sampleip_user.o $(TRACE_HELPERS)
 tc_l2_redirect-objs := bpf_load.o $(LIBBPF) tc_l2_redirect_user.o
 lwt_len_hist-objs := bpf_load.o $(LIBBPF) lwt_len_hist_user.o
 xdp_tx_iptunnel-objs := bpf_load.o $(LIBBPF) xdp_tx_iptunnel_user.o
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index bebe418..529972e 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -650,66 +650,3 @@ void read_trace_pipe(void)
 		}
 	}
 }
-
-#define MAX_SYMS 300000
-static struct ksym syms[MAX_SYMS];
-static int sym_cnt;
-
-static int ksym_cmp(const void *p1, const void *p2)
-{
-	return ((struct ksym *)p1)->addr - ((struct ksym *)p2)->addr;
-}
-
-int load_kallsyms(void)
-{
-	FILE *f = fopen("/proc/kallsyms", "r");
-	char func[256], buf[256];
-	char symbol;
-	void *addr;
-	int i = 0;
-
-	if (!f)
-		return -ENOENT;
-
-	while (!feof(f)) {
-		if (!fgets(buf, sizeof(buf), f))
-			break;
-		if (sscanf(buf, "%p %c %s", &addr, &symbol, func) != 3)
-			break;
-		if (!addr)
-			continue;
-		syms[i].addr = (long) addr;
-		syms[i].name = strdup(func);
-		i++;
-	}
-	sym_cnt = i;
-	qsort(syms, sym_cnt, sizeof(struct ksym), ksym_cmp);
-	return 0;
-}
-
-struct ksym *ksym_search(long key)
-{
-	int start = 0, end = sym_cnt;
-	int result;
-
-	while (start < end) {
-		size_t mid = start + (end - start) / 2;
-
-		result = key - syms[mid].addr;
-		if (result < 0)
-			end = mid;
-		else if (result > 0)
-			start = mid + 1;
-		else
-			return &syms[mid];
-	}
-
-	if (start >= 1 && syms[start - 1].addr < key &&
-	    key < syms[start].addr)
-		/* valid ksym */
-		return &syms[start - 1];
-
-	/* out of range. return _stext */
-	return &syms[0];
-}
-
diff --git a/samples/bpf/bpf_load.h b/samples/bpf/bpf_load.h
index 453c200..2c3d0b4 100644
--- a/samples/bpf/bpf_load.h
+++ b/samples/bpf/bpf_load.h
@@ -54,12 +54,5 @@ int load_bpf_file(char *path);
 int load_bpf_file_fixup_map(const char *path, fixup_map_cb fixup_map);
 
 void read_trace_pipe(void);
-struct ksym {
-	long addr;
-	char *name;
-};
-
-int load_kallsyms(void);
-struct ksym *ksym_search(long key);
 int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags);
 #endif
diff --git a/samples/bpf/offwaketime_user.c b/samples/bpf/offwaketime_user.c
index 512f87a..f06063a 100644
--- a/samples/bpf/offwaketime_user.c
+++ b/samples/bpf/offwaketime_user.c
@@ -17,6 +17,7 @@
 #include <sys/resource.h>
 #include "libbpf.h"
 #include "bpf_load.h"
+#include "trace_helpers.h"
 
 #define PRINT_RAW_ADDR 0
 
diff --git a/samples/bpf/sampleip_user.c b/samples/bpf/sampleip_user.c
index 4ed690b..60c2b73 100644
--- a/samples/bpf/sampleip_user.c
+++ b/samples/bpf/sampleip_user.c
@@ -22,6 +22,7 @@
 #include "libbpf.h"
 #include "bpf_load.h"
 #include "perf-sys.h"
+#include "trace_helpers.h"
 
 #define DEFAULT_FREQ	99
 #define DEFAULT_SECS	5
diff --git a/samples/bpf/spintest_user.c b/samples/bpf/spintest_user.c
index 3d73621..8d3e9cf 100644
--- a/samples/bpf/spintest_user.c
+++ b/samples/bpf/spintest_user.c
@@ -7,6 +7,7 @@
 #include <sys/resource.h>
 #include "libbpf.h"
 #include "bpf_load.h"
+#include "trace_helpers.h"
 
 int main(int ac, char **argv)
 {
diff --git a/samples/bpf/trace_event_user.c b/samples/bpf/trace_event_user.c
index 56f7a25..1fa1bec 100644
--- a/samples/bpf/trace_event_user.c
+++ b/samples/bpf/trace_event_user.c
@@ -21,6 +21,7 @@
 #include "libbpf.h"
 #include "bpf_load.h"
 #include "perf-sys.h"
+#include "trace_helpers.h"
 
 #define SAMPLE_FREQ 50
 
diff --git a/samples/bpf/trace_output_user.c b/samples/bpf/trace_output_user.c
index ccca1e3..cc4b383 100644
--- a/samples/bpf/trace_output_user.c
+++ b/samples/bpf/trace_output_user.c
@@ -21,100 +21,10 @@
 #include "libbpf.h"
 #include "bpf_load.h"
 #include "perf-sys.h"
+#include "trace_helpers.h"
 
 static int pmu_fd;
 
-int page_size;
-int page_cnt = 8;
-volatile struct perf_event_mmap_page *header;
-
-typedef void (*print_fn)(void *data, int size);
-
-static int perf_event_mmap(int fd)
-{
-	void *base;
-	int mmap_size;
-
-	page_size = getpagesize();
-	mmap_size = page_size * (page_cnt + 1);
-
-	base = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
-	if (base == MAP_FAILED) {
-		printf("mmap err\n");
-		return -1;
-	}
-
-	header = base;
-	return 0;
-}
-
-static int perf_event_poll(int fd)
-{
-	struct pollfd pfd = { .fd = fd, .events = POLLIN };
-
-	return poll(&pfd, 1, 1000);
-}
-
-struct perf_event_sample {
-	struct perf_event_header header;
-	__u32 size;
-	char data[];
-};
-
-static void perf_event_read(print_fn fn)
-{
-	__u64 data_tail = header->data_tail;
-	__u64 data_head = header->data_head;
-	__u64 buffer_size = page_cnt * page_size;
-	void *base, *begin, *end;
-	char buf[256];
-
-	asm volatile("" ::: "memory"); /* in real code it should be smp_rmb() */
-	if (data_head == data_tail)
-		return;
-
-	base = ((char *)header) + page_size;
-
-	begin = base + data_tail % buffer_size;
-	end = base + data_head % buffer_size;
-
-	while (begin != end) {
-		struct perf_event_sample *e;
-
-		e = begin;
-		if (begin + e->header.size > base + buffer_size) {
-			long len = base + buffer_size - begin;
-
-			assert(len < e->header.size);
-			memcpy(buf, begin, len);
-			memcpy(buf + len, base, e->header.size - len);
-			e = (void *) buf;
-			begin = base + e->header.size - len;
-		} else if (begin + e->header.size == base + buffer_size) {
-			begin = base;
-		} else {
-			begin += e->header.size;
-		}
-
-		if (e->header.type == PERF_RECORD_SAMPLE) {
-			fn(e->data, e->size);
-		} else if (e->header.type == PERF_RECORD_LOST) {
-			struct {
-				struct perf_event_header header;
-				__u64 id;
-				__u64 lost;
-			} *lost = (void *) e;
-			printf("lost %lld events\n", lost->lost);
-		} else {
-			printf("unknown event type=%d size=%d\n",
-			       e->header.type, e->header.size);
-		}
-	}
-
-	__sync_synchronize(); /* smp_mb() */
-	header->data_tail = data_head;
-}
-
 static __u64 time_get_ns(void)
 {
 	struct timespec ts;
@@ -127,7 +37,7 @@ static __u64 start_time;
 
 #define MAX_CNT 100000ll
 
-static void print_bpf_output(void *data, int size)
+static int print_bpf_output(void *data, int size)
 {
 	static __u64 cnt;
 	struct {
@@ -138,7 +48,7 @@ static void print_bpf_output(void *data, int size)
 	if (e->cookie != 0x12345678) {
 		printf("BUG pid %llx cookie %llx sized %d\n",
 		       e->pid, e->cookie, size);
-		kill(0, SIGINT);
+		return PERF_EVENT_ERROR;
 	}
 
 	cnt++;
@@ -146,8 +56,10 @@ static void print_bpf_output(void *data, int size)
 	if (cnt == MAX_CNT) {
 		printf("recv %lld events per sec\n",
 		       MAX_CNT * 1000000000ll / (time_get_ns() - start_time));
-		kill(0, SIGINT);
+		return PERF_EVENT_DONE;
 	}
+
+	return PERF_EVENT_CONT;
 }
 
 static void test_bpf_perf_event(void)
@@ -166,10 +78,18 @@ static void test_bpf_perf_event(void)
 	ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0);
 }
 
+static void exec_action(void)
+{
+	FILE *f;
+
+	f = popen("taskset 1 dd if=/dev/zero of=/dev/null", "r");
+	(void) f;
+}
+
 int main(int argc, char **argv)
 {
 	char filename[256];
-	FILE *f;
+	int ret;
 
 	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
 
@@ -180,17 +100,8 @@ int main(int argc, char **argv)
 
 	test_bpf_perf_event();
 
-	if (perf_event_mmap(pmu_fd) < 0)
-		return 1;
-
-	f = popen("taskset 1 dd if=/dev/zero of=/dev/null", "r");
-	(void) f;
-
 	start_time = time_get_ns();
-	for (;;) {
-		perf_event_poll(pmu_fd);
-		perf_event_read(print_bpf_output);
-	}
-
-	return 0;
+	ret = perf_event_poller(pmu_fd, exec_action, print_bpf_output);
+	kill(0, SIGINT);
+	return ret;
 }
diff --git a/tools/testing/selftests/bpf/trace_helpers.c b/tools/testing/selftests/bpf/trace_helpers.c
new file mode 100644
index 0000000..00954e3
--- /dev/null
+++ b/tools/testing/selftests/bpf/trace_helpers.c
@@ -0,0 +1,186 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <assert.h>
+#include <errno.h>
+#include <poll.h>
+#include <unistd.h>
+#include <linux/perf_event.h>
+#include <sys/mman.h>
+#include "trace_helpers.h"
+
+#define MAX_SYMS 300000
+static struct ksym syms[MAX_SYMS];
+static int sym_cnt;
+
+static int ksym_cmp(const void *p1, const void *p2)
+{
+	return ((struct ksym *)p1)->addr - ((struct ksym *)p2)->addr;
+}
+
+int load_kallsyms(void)
+{
+	FILE *f = fopen("/proc/kallsyms", "r");
+	char func[256], buf[256];
+	char symbol;
+	void *addr;
+	int i = 0;
+
+	if (!f)
+		return -ENOENT;
+
+	while (!feof(f)) {
+		if (!fgets(buf, sizeof(buf), f))
+			break;
+		if (sscanf(buf, "%p %c %s", &addr, &symbol, func) != 3)
+			break;
+		if (!addr)
+			continue;
+		syms[i].addr = (long) addr;
+		syms[i].name = strdup(func);
+		i++;
+	}
+	sym_cnt = i;
+	qsort(syms, sym_cnt, sizeof(struct ksym), ksym_cmp);
+	return 0;
+}
+
+struct ksym *ksym_search(long key)
+{
+	int start = 0, end = sym_cnt;
+	int result;
+
+	while (start < end) {
+		size_t mid = start + (end - start) / 2;
+
+		result = key - syms[mid].addr;
+		if (result < 0)
+			end = mid;
+		else if (result > 0)
+			start = mid + 1;
+		else
+			return &syms[mid];
+	}
+
+	if (start >= 1 && syms[start - 1].addr < key &&
+	    key < syms[start].addr)
+		/* valid ksym */
+		return &syms[start - 1];
+
+	/* out of range. return _stext */
+	return &syms[0];
+}
+
+static int page_size;
+static int page_cnt = 8;
+static volatile struct perf_event_mmap_page *header;
+
+static int perf_event_mmap(int fd)
+{
+	void *base;
+	int mmap_size;
+
+	page_size = getpagesize();
+	mmap_size = page_size * (page_cnt + 1);
+
+	base = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	if (base == MAP_FAILED) {
+		printf("mmap err\n");
+		return -1;
+	}
+
+	header = base;
+	return 0;
+}
+
+static int perf_event_poll(int fd)
+{
+	struct pollfd pfd = { .fd = fd, .events = POLLIN };
+
+	return poll(&pfd, 1, 1000);
+}
+
+struct perf_event_sample {
+	struct perf_event_header header;
+	__u32 size;
+	char data[];
+};
+
+static int perf_event_read(perf_event_print_fn fn)
+{
+	__u64 data_tail = header->data_tail;
+	__u64 data_head = header->data_head;
+	__u64 buffer_size = page_cnt * page_size;
+	void *base, *begin, *end;
+	char buf[256];
+	int ret;
+
+	asm volatile("" ::: "memory"); /* in real code it should be smp_rmb() */
+	if (data_head == data_tail)
+		return PERF_EVENT_CONT;
+
+	base = ((char *)header) + page_size;
+
+	begin = base + data_tail % buffer_size;
+	end = base + data_head % buffer_size;
+
+	while (begin != end) {
+		struct perf_event_sample *e;
+
+		e = begin;
+		if (begin + e->header.size > base + buffer_size) {
+			long len = base + buffer_size - begin;
+
+			assert(len < e->header.size);
+			memcpy(buf, begin, len);
+			memcpy(buf + len, base, e->header.size - len);
+			e = (void *) buf;
+			begin = base + e->header.size - len;
+		} else if (begin + e->header.size == base + buffer_size) {
+			begin = base;
+		} else {
+			begin += e->header.size;
+		}
+
+		if (e->header.type == PERF_RECORD_SAMPLE) {
+			ret = fn(e->data, e->size);
+			if (ret != PERF_EVENT_CONT)
+				return ret;
+		} else if (e->header.type == PERF_RECORD_LOST) {
+			struct {
+				struct perf_event_header header;
+				__u64 id;
+				__u64 lost;
+			} *lost = (void *) e;
+			printf("lost %lld events\n", lost->lost);
+		} else {
+			printf("unknown event type=%d size=%d\n",
+			       e->header.type, e->header.size);
+		}
+	}
+
+	__sync_synchronize(); /* smp_mb() */
+	header->data_tail = data_head;
+	return PERF_EVENT_CONT;
+}
+
+int perf_event_poller(int fd, perf_event_exec_fn exec_fn,
+		      perf_event_print_fn output_fn)
+{
+	int ret;
+
+	if (perf_event_mmap(fd) < 0)
+		return PERF_EVENT_ERROR;
+
+	exec_fn();
+
+	for (;;) {
+		perf_event_poll(fd);
+		ret = perf_event_read(output_fn);
+		if (ret != PERF_EVENT_CONT)
+			return ret;
+	}
+
+	return PERF_EVENT_DONE;
+}
diff --git a/tools/testing/selftests/bpf/trace_helpers.h b/tools/testing/selftests/bpf/trace_helpers.h
new file mode 100644
index 0000000..8750778
--- /dev/null
+++ b/tools/testing/selftests/bpf/trace_helpers.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __TRACE_HELPER_H
+#define __TRACE_HELPER_H
+
+struct ksym {
+	long addr;
+	char *name;
+};
+
+int load_kallsyms(void);
+struct ksym *ksym_search(long key);
+
+typedef void (*perf_event_exec_fn)(void);
+typedef int (*perf_event_print_fn)(void *data, int size);
+
+/* return code for perf_event_print_fn */
+#define PERF_EVENT_DONE		0
+#define PERF_EVENT_ERROR	1
+#define PERF_EVENT_CONT		2
+
+/* return PERF_EVENT_DONE or PERF_EVENT_ERROR */
+int perf_event_poller(int fd, perf_event_exec_fn exec_fn,
+		      perf_event_print_fn output_fn);
+#endif
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next v3 8/9] tools/bpf: add a test for bpf_get_stack with raw tracepoint prog
From: Yonghong Song @ 2018-04-20 22:18 UTC (permalink / raw)
  To: ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20180420221842.742330-1-yhs@fb.com>

The test attached a raw_tracepoint program to sched/sched_switch.
It tested to get stack for user space, kernel space and user
space with build_id request. It also tested to get user
and kernel stack into the same buffer with back-to-back
bpf_get_stack helper calls.

Whenever the kernel stack is available, the user space
application will check to ensure that the kernel function
for raw_tracepoint ___bpf_prog_run is part of the stack.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 tools/testing/selftests/bpf/Makefile               |   3 +-
 tools/testing/selftests/bpf/test_get_stack_rawtp.c | 102 ++++++++++++++++++
 tools/testing/selftests/bpf/test_progs.c           | 115 +++++++++++++++++++++
 3 files changed, 219 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/test_get_stack_rawtp.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 0b72cc7..54e9e74 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -32,7 +32,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test
 	test_l4lb_noinline.o test_xdp_noinline.o test_stacktrace_map.o \
 	sample_map_ret0.o test_tcpbpf_kern.o test_stacktrace_build_id.o \
 	sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o test_adjust_tail.o \
-	test_btf_haskv.o test_btf_nokv.o
+	test_btf_haskv.o test_btf_nokv.o test_get_stack_rawtp.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
@@ -56,6 +56,7 @@ $(TEST_GEN_PROGS_EXTENDED): $(OUTPUT)/libbpf.a
 $(OUTPUT)/test_dev_cgroup: cgroup_helpers.c
 $(OUTPUT)/test_sock: cgroup_helpers.c
 $(OUTPUT)/test_sock_addr: cgroup_helpers.c
+$(OUTPUT)/test_progs: trace_helpers.c
 
 .PHONY: force
 
diff --git a/tools/testing/selftests/bpf/test_get_stack_rawtp.c b/tools/testing/selftests/bpf/test_get_stack_rawtp.c
new file mode 100644
index 0000000..ba1dcf9
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_get_stack_rawtp.c
@@ -0,0 +1,102 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/bpf.h>
+#include "bpf_helpers.h"
+
+/* Permit pretty deep stack traces */
+#define MAX_STACK_RAWTP 100
+struct stack_trace_t {
+	int pid;
+	int kern_stack_size;
+	int user_stack_size;
+	int user_stack_buildid_size;
+	__u64 kern_stack[MAX_STACK_RAWTP];
+	__u64 user_stack[MAX_STACK_RAWTP];
+	struct bpf_stack_build_id user_stack_buildid[MAX_STACK_RAWTP];
+};
+
+struct bpf_map_def SEC("maps") perfmap = {
+	.type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
+	.key_size = sizeof(int),
+	.value_size = sizeof(__u32),
+	.max_entries = 2,
+};
+
+struct bpf_map_def SEC("maps") stackdata_map = {
+	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(struct stack_trace_t),
+	.max_entries = 1,
+};
+
+/* Allocate per-cpu space twice the needed. For the code below
+ *   usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
+ *   if (usize < 0)
+ *     return 0;
+ *   ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
+ *
+ * If we have value_size = MAX_STACK_RAWTP * sizeof(__u64),
+ * verifier will complain that access "raw_data + usize"
+ * with size "max_len - usize" may be out of bound.
+ * The maximum "raw_data + usize" is "raw_data + max_len"
+ * and the maximum "max_len - usize" is "max_len", verifier
+ * concludes that the maximum buffer access range is
+ * "raw_data[0...max_len * 2 - 1]" and hence reject the program.
+ *
+ * Doubling the to-be-used max buffer size can fix this verifier
+ * issue and avoid complicated C programming massaging.
+ * This is an acceptable workaround since there is one entry here.
+ */
+struct bpf_map_def SEC("maps") rawdata_map = {
+	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = MAX_STACK_RAWTP * sizeof(__u64) * 2,
+	.max_entries = 1,
+};
+
+SEC("tracepoint/sched/sched_switch")
+int bpf_prog1(void *ctx)
+{
+	int max_len, max_buildid_len, usize, ksize, total_size;
+	struct stack_trace_t *data;
+	void *raw_data;
+	__u32 key = 0;
+
+	data = bpf_map_lookup_elem(&stackdata_map, &key);
+	if (!data)
+		return 0;
+
+	max_len = MAX_STACK_RAWTP * sizeof(__u64);
+	max_buildid_len = MAX_STACK_RAWTP * sizeof(struct bpf_stack_build_id);
+	data->pid = bpf_get_current_pid_tgid();
+	data->kern_stack_size = bpf_get_stack(ctx, data->kern_stack,
+					      max_len, 0);
+	data->user_stack_size = bpf_get_stack(ctx, data->user_stack, max_len,
+					    BPF_F_USER_STACK);
+	data->user_stack_buildid_size = bpf_get_stack(
+		ctx, data->user_stack_buildid, max_buildid_len,
+		BPF_F_USER_STACK | BPF_F_USER_BUILD_ID);
+	bpf_perf_event_output(ctx, &perfmap, 0, data, sizeof(*data));
+
+	/* write both kernel and user stacks to the same buffer */
+	raw_data = bpf_map_lookup_elem(&rawdata_map, &key);
+	if (!raw_data)
+		return 0;
+
+	usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
+	if (usize < 0)
+		return 0;
+
+	ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
+	if (ksize < 0)
+		return 0;
+
+	total_size = usize + ksize;
+	if (total_size > 0 && total_size <= max_len)
+		bpf_perf_event_output(ctx, &perfmap, 0, raw_data, total_size);
+
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
+__u32 _version SEC("version") = 1; /* ignored by tracepoints, required by libbpf.a */
diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
index eedda98..dad4c3f 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -38,6 +38,7 @@ typedef __u16 __sum16;
 #include "bpf_util.h"
 #include "bpf_endian.h"
 #include "bpf_rlimit.h"
+#include "trace_helpers.h"
 
 static int error_cnt, pass_cnt;
 
@@ -1204,6 +1205,119 @@ static void test_stacktrace_build_id(void)
 	return;
 }
 
+#define MAX_CNT_RAWTP	10ull
+#define MAX_STACK_RAWTP	100
+struct get_stack_trace_t {
+	int pid;
+	int kern_stack_size;
+	int user_stack_size;
+	int user_stack_buildid_size;
+	__u64 kern_stack[MAX_STACK_RAWTP];
+	__u64 user_stack[MAX_STACK_RAWTP];
+	struct bpf_stack_build_id user_stack_buildid[MAX_STACK_RAWTP];
+};
+
+static void get_stack_raw_tp_action(void)
+{
+	FILE *f;
+
+	f = popen("taskset 1 dd if=/dev/zero of=/dev/null", "r");
+	(void) f;
+}
+
+static int get_stack_print_output(void *data, int size)
+{
+	bool good_kern_stack = false, good_user_stack = false;
+	const char *expected_func = "___bpf_prog_run";
+	struct get_stack_trace_t *e = data;
+	int i, num_stack;
+	static __u64 cnt;
+	struct ksym *ks;
+
+	cnt++;
+
+	if (size < sizeof(struct get_stack_trace_t)) {
+		__u64 *raw_data = data;
+
+		num_stack = size / sizeof(__u64);
+		for (i = 0; i < num_stack; i++) {
+			ks = ksym_search(raw_data[i]);
+			if (ks && (strcmp(ks->name, expected_func) == 0)) {
+				good_kern_stack = true;
+				good_user_stack = (i > 0);
+			}
+		}
+	} else {
+		if (e->kern_stack_size > 0) {
+			num_stack = e->kern_stack_size / sizeof(__u64);
+			for (i = 0; i < num_stack; i++) {
+				ks = ksym_search(e->kern_stack[i]);
+				if (ks && (strcmp(ks->name, expected_func) == 0))
+					good_kern_stack = true;
+			}
+		}
+		if (e->user_stack_size > 0 && e->user_stack_buildid_size > 0)
+			good_user_stack = true;
+	}
+	if (!good_kern_stack || !good_user_stack)
+		return PERF_EVENT_ERROR;
+
+	if (cnt == MAX_CNT_RAWTP)
+		return PERF_EVENT_DONE;
+
+	return PERF_EVENT_CONT;
+}
+
+static void test_get_stack_raw_tp(void)
+{
+	const char *file = "./test_get_stack_rawtp.o";
+	int efd, err, prog_fd, pmu_fd, perfmap_fd;
+	struct perf_event_attr attr = {};
+	__u32 key = 0, duration = 0;
+	struct bpf_object *obj;
+
+	err = bpf_prog_load(file, BPF_PROG_TYPE_RAW_TRACEPOINT, &obj, &prog_fd);
+	if (CHECK(err, "prog_load raw tp", "err %d errno %d\n", err, errno))
+		return;
+
+	efd = bpf_raw_tracepoint_open("sched_switch", prog_fd);
+	if (CHECK(efd < 0, "raw_tp_open", "err %d errno %d\n", efd, errno))
+		goto close_prog;
+
+	perfmap_fd = bpf_find_map(__func__, obj, "perfmap");
+	if (CHECK(perfmap_fd < 0, "bpf_find_map", "err %d errno %d\n", perfmap_fd, errno))
+		goto close_prog;
+
+	err = load_kallsyms();
+	if (CHECK(err < 0, "load_kallsyms", "err %d errno %d\n", err, errno))
+		goto close_prog;
+
+	attr.sample_type = PERF_SAMPLE_RAW;
+	attr.type = PERF_TYPE_SOFTWARE;
+	attr.config = PERF_COUNT_SW_BPF_OUTPUT;
+	pmu_fd = syscall(__NR_perf_event_open, &attr, -1/*pid*/, 0/*cpu*/,
+			 -1/*group_fd*/, 0);
+	if (CHECK(pmu_fd < 0, "perf_event_open", "err %d errno %d\n", pmu_fd, errno))
+		goto close_prog;
+
+	err = bpf_map_update_elem(perfmap_fd, &key, &pmu_fd, BPF_ANY);
+	if (CHECK(err < 0, "bpf_map_update_elem", "err %d errno %d\n", err, errno))
+		goto close_prog;
+
+	err = ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0);
+	if (CHECK(err < 0, "ioctl PERF_EVENT_IOC_ENABLE", "err %d errno %d\n", err, errno))
+		goto close_prog;
+
+	if (perf_event_poller(pmu_fd, get_stack_raw_tp_action, get_stack_print_output))
+		goto close_prog;
+
+	goto close_prog_noerr;
+close_prog:
+	error_cnt++;
+close_prog_noerr:
+	bpf_object__close(obj);
+}
+
 int main(void)
 {
 	test_pkt_access();
@@ -1219,6 +1333,7 @@ int main(void)
 	test_stacktrace_map();
 	test_stacktrace_build_id();
 	test_stacktrace_map_raw_tp();
+	test_get_stack_raw_tp();
 
 	printf("Summary: %d PASSED, %d FAILED\n", pass_cnt, error_cnt);
 	return error_cnt ? EXIT_FAILURE : EXIT_SUCCESS;
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next v3 5/9] tools/bpf: add bpf_get_stack helper to tools headers
From: Yonghong Song @ 2018-04-20 22:18 UTC (permalink / raw)
  To: ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20180420221842.742330-1-yhs@fb.com>

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 tools/include/uapi/linux/bpf.h            | 19 +++++++++++++++++--
 tools/testing/selftests/bpf/bpf_helpers.h |  3 ++-
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 7f7fbb9..116eb5f 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -529,6 +529,17 @@ union bpf_attr {
  *             other bits - reserved
  *     Return: >= 0 stackid on success or negative error
  *
+ * int bpf_get_stack(ctx, buf, size, flags)
+ *     walk user or kernel stack and store the ips in buf
+ *     @ctx: struct pt_regs*
+ *     @buf: user buffer to fill stack
+ *     @size: the buf size
+ *     @flags: bits 0-7 - numer of stack frames to skip
+ *             bit 8 - collect user stack instead of kernel
+ *             bit 11 - get build-id as well if user stack
+ *             other bits - reserved
+ *     Return: >= 0 size copied on success or negative error
+ *
  * s64 bpf_csum_diff(from, from_size, to, to_size, seed)
  *     calculate csum diff
  *     @from: raw from buffer
@@ -841,7 +852,8 @@ union bpf_attr {
 	FN(msg_cork_bytes),		\
 	FN(msg_pull_data),		\
 	FN(bind),			\
-	FN(xdp_adjust_tail),
+	FN(xdp_adjust_tail),		\
+	FN(get_stack),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -875,11 +887,14 @@ enum bpf_func_id {
 /* BPF_FUNC_skb_set_tunnel_key and BPF_FUNC_skb_get_tunnel_key flags. */
 #define BPF_F_TUNINFO_IPV6		(1ULL << 0)
 
-/* BPF_FUNC_get_stackid flags. */
+/* flags for both BPF_FUNC_get_stackid and BPF_FUNC_get_stack. */
 #define BPF_F_SKIP_FIELD_MASK		0xffULL
 #define BPF_F_USER_STACK		(1ULL << 8)
+/* flags used by BPF_FUNC_get_stackid only. */
 #define BPF_F_FAST_STACK_CMP		(1ULL << 9)
 #define BPF_F_REUSE_STACKID		(1ULL << 10)
+/* flags used by BPF_FUNC_get_stack only. */
+#define BPF_F_USER_BUILD_ID		(1ULL << 11)
 
 /* BPF_FUNC_skb_set_tunnel_key flags. */
 #define BPF_F_ZERO_CSUM_TX		(1ULL << 1)
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index 9271576..2d9d650 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -98,7 +98,8 @@ static int (*bpf_bind)(void *ctx, void *addr, int addr_len) =
 	(void *) BPF_FUNC_bind;
 static int (*bpf_xdp_adjust_tail)(void *ctx, int offset) =
 	(void *) BPF_FUNC_xdp_adjust_tail;
-
+static int (*bpf_get_stack)(void *ctx, void *buf, int size, int flags) =
+	(void *) BPF_FUNC_get_stack;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next v3 4/9] bpf/verifier: improve register value range tracking with ARSH
From: Yonghong Song @ 2018-04-20 22:18 UTC (permalink / raw)
  To: ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20180420221842.742330-1-yhs@fb.com>

When helpers like bpf_get_stack returns an int value
and later on used for arithmetic computation, the LSH and ARSH
operations are often required to get proper sign extension into
64-bit. For example, without this patch:
    54: R0=inv(id=0,umax_value=800)
    54: (bf) r8 = r0
    55: R0=inv(id=0,umax_value=800) R8_w=inv(id=0,umax_value=800)
    55: (67) r8 <<= 32
    56: R8_w=inv(id=0,umax_value=3435973836800,var_off=(0x0; 0x3ff00000000))
    56: (c7) r8 s>>= 32
    57: R8=inv(id=0)
With this patch:
    54: R0=inv(id=0,umax_value=800)
    54: (bf) r8 = r0
    55: R0=inv(id=0,umax_value=800) R8_w=inv(id=0,umax_value=800)
    55: (67) r8 <<= 32
    56: R8_w=inv(id=0,umax_value=3435973836800,var_off=(0x0; 0x3ff00000000))
    56: (c7) r8 s>>= 32
    57: R8=inv(id=0, umax_value=800,var_off=(0x0; 0x3ff))
With better range of "R8", later on when "R8" is added to other register,
e.g., a map pointer or scalar-value register, the better register
range can be derived and verifier failure may be avoided.

In our later example,
    ......
    usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
    if (usize < 0)
        return 0;
    ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
    ......
Without improving ARSH value range tracking, the register representing
"max_len - usize" will have smin_value equal to S64_MIN and will be
rejected by verifier.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 kernel/bpf/verifier.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 3c8bb92..01c215d 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2975,6 +2975,32 @@ static int adjust_scalar_min_max_vals(struct bpf_verifier_env *env,
 		/* We may learn something more from the var_off */
 		__update_reg_bounds(dst_reg);
 		break;
+	case BPF_ARSH:
+		if (umax_val >= insn_bitness) {
+			/* Shifts greater than 31 or 63 are undefined.
+			 * This includes shifts by a negative number.
+			 */
+			mark_reg_unknown(env, regs, insn->dst_reg);
+			break;
+		}
+		if (dst_reg->smin_value < 0)
+			dst_reg->smin_value >>= umin_val;
+		else
+			dst_reg->smin_value >>= umax_val;
+		if (dst_reg->smax_value < 0)
+			dst_reg->smax_value >>= umax_val;
+		else
+			dst_reg->smax_value >>= umin_val;
+		if (src_known)
+			dst_reg->var_off = tnum_rshift(dst_reg->var_off,
+						       umin_val);
+		else
+			dst_reg->var_off = tnum_rshift(tnum_unknown, umin_val);
+		dst_reg->umin_value >>= umax_val;
+		dst_reg->umax_value >>= umin_val;
+		/* We may learn something more from the var_off */
+		__update_reg_bounds(dst_reg);
+		break;
 	default:
 		mark_reg_unknown(env, regs, insn->dst_reg);
 		break;
-- 
2.9.5

^ permalink raw reply related

* [PATCH net-next 0/7] net/ipv6: Another followup to the fib6_info change
From: David Ahern @ 2018-04-20 22:37 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, roopa, eric.dumazet, weiwan, kafai, yoshfuji,
	David Ahern

Last one - for this week.

Patches 1, 2 and 7 are more cleanup patches - removing dead code,
moving code from a header to near its single caller, and updating
function name.

Patches 3-5 do some refactoring leading up to patch 6 which fixes
a NULL dereference. I have only managed to trigger a panic once, so
I can not definitively confirm it addresses the problem but it seems
pretty clear that it is a race on removing a 'from' reference on
an rt6_info and another path using that 'from' value to do
cookie checking.

David Ahern (7):
  net/ipv6: Clean up rt expires helpers
  net/ipv6: Rename rt6_get_cookie_safe
  net/ipv6: Move rcu_read_lock to callers of ip6_rt_cache_alloc
  net/ipv6: Move rcu locking to callers of fib6_get_cookie_safe
  net/ipv6: Move release of fib6_info from pcpu routes to helper
  net/ipv6: Make from in rt6_info rcu protected
  net/ipv6: Remove unncessary check on f6i in fib6_check

 include/net/ip6_fib.h |  41 +++++------------
 net/ipv6/ip6_fib.c    |  45 ++++++++++--------
 net/ipv6/ip6_output.c |   9 +++-
 net/ipv6/route.c      | 124 ++++++++++++++++++++++++++++++++++++--------------
 4 files changed, 136 insertions(+), 83 deletions(-)

-- 
2.11.0

^ permalink raw reply

* [PATCH net-next 1/7] net/ipv6: Clean up rt expires helpers
From: David Ahern @ 2018-04-20 22:37 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, roopa, eric.dumazet, weiwan, kafai, yoshfuji,
	David Ahern
In-Reply-To: <20180420223803.15743-1-dsahern@gmail.com>

rt6_clean_expires and rt6_set_expires are no longer used. Removed them.
rt6_update_expires has 1 caller in route.c, so move it from the header.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/net/ip6_fib.h | 21 ---------------------
 net/ipv6/route.c      |  9 +++++++++
 2 files changed, 9 insertions(+), 21 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 642211174692..327a74cd6a0d 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -223,27 +223,6 @@ static inline bool fib6_check_expired(const struct fib6_info *f6i)
 	return false;
 }
 
-static inline void rt6_clean_expires(struct rt6_info *rt)
-{
-	rt->rt6i_flags &= ~RTF_EXPIRES;
-	rt->dst.expires = 0;
-}
-
-static inline void rt6_set_expires(struct rt6_info *rt, unsigned long expires)
-{
-	rt->dst.expires = expires;
-	rt->rt6i_flags |= RTF_EXPIRES;
-}
-
-static inline void rt6_update_expires(struct rt6_info *rt0, int timeout)
-{
-	if (!(rt0->rt6i_flags & RTF_EXPIRES) && rt0->from)
-		rt0->dst.expires = rt0->from->expires;
-
-	dst_set_expires(&rt0->dst, timeout);
-	rt0->rt6i_flags |= RTF_EXPIRES;
-}
-
 /* Function to safely get fn->sernum for passed in rt
  * and store result in passed in cookie.
  * Return true if we can get cookie safely
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 884d9cee790f..062dd4d8232c 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2221,6 +2221,15 @@ static void ip6_link_failure(struct sk_buff *skb)
 	}
 }
 
+static void rt6_update_expires(struct rt6_info *rt0, int timeout)
+{
+	if (!(rt0->rt6i_flags & RTF_EXPIRES) && rt0->from)
+		rt0->dst.expires = rt0->from->expires;
+
+	dst_set_expires(&rt0->dst, timeout);
+	rt0->rt6i_flags |= RTF_EXPIRES;
+}
+
 static void rt6_do_update_pmtu(struct rt6_info *rt, u32 mtu)
 {
 	struct net *net = dev_net(rt->dst.dev);
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next 2/7] net/ipv6: Rename rt6_get_cookie_safe
From: David Ahern @ 2018-04-20 22:37 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, roopa, eric.dumazet, weiwan, kafai, yoshfuji,
	David Ahern
In-Reply-To: <20180420223803.15743-1-dsahern@gmail.com>

rt6_get_cookie_safe takes a fib6_info and checks the sernum of
the node. Update the name to reflect its purpose.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/net/ip6_fib.h | 6 +++---
 net/ipv6/route.c      | 4 ++--
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 327a74cd6a0d..dd1481ed8bdb 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -228,8 +228,8 @@ static inline bool fib6_check_expired(const struct fib6_info *f6i)
  * Return true if we can get cookie safely
  * Return false if not
  */
-static inline bool rt6_get_cookie_safe(const struct fib6_info *f6i,
-				       u32 *cookie)
+static inline bool fib6_get_cookie_safe(const struct fib6_info *f6i,
+					u32 *cookie)
 {
 	struct fib6_node *fn;
 	bool status = false;
@@ -254,7 +254,7 @@ static inline u32 rt6_get_cookie(const struct rt6_info *rt)
 
 	if (rt->rt6i_flags & RTF_PCPU ||
 	    (unlikely(!list_empty(&rt->rt6i_uncached)) && rt->from))
-		rt6_get_cookie_safe(rt->from, &cookie);
+		fib6_get_cookie_safe(rt->from, &cookie);
 
 	return cookie;
 }
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 062dd4d8232c..2d6fcfe11c82 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2128,7 +2128,7 @@ static bool fib6_check(struct fib6_info *f6i, u32 cookie)
 {
 	u32 rt_cookie = 0;
 
-	if ((f6i && !rt6_get_cookie_safe(f6i, &rt_cookie)) ||
+	if ((f6i && !fib6_get_cookie_safe(f6i, &rt_cookie)) ||
 	     rt_cookie != cookie)
 		return false;
 
@@ -2142,7 +2142,7 @@ static struct dst_entry *rt6_check(struct rt6_info *rt, u32 cookie)
 {
 	u32 rt_cookie = 0;
 
-	if ((rt->from && !rt6_get_cookie_safe(rt->from, &rt_cookie)) ||
+	if ((rt->from && !fib6_get_cookie_safe(rt->from, &rt_cookie)) ||
 	    rt_cookie != cookie)
 		return NULL;
 
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next 3/7] net/ipv6: Move rcu_read_lock to callers of ip6_rt_cache_alloc
From: David Ahern @ 2018-04-20 22:37 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, roopa, eric.dumazet, weiwan, kafai, yoshfuji,
	David Ahern
In-Reply-To: <20180420223803.15743-1-dsahern@gmail.com>

A later patch protects 'from' in rt6_info and this simplifies the
locking needed by it.

With the move, the fib6_info_hold for the uncached_rt is no longer
needed since the rcu_lock is still held.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 net/ipv6/route.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 2d6fcfe11c82..f8b22183b7fd 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1164,10 +1164,8 @@ static struct rt6_info *ip6_rt_cache_alloc(struct fib6_info *ort,
 	 *	Clone the route.
 	 */
 
-	rcu_read_lock();
 	dev = ip6_rt_get_dev_rcu(ort);
 	rt = ip6_dst_alloc(dev_net(dev), dev, 0);
-	rcu_read_unlock();
 	if (!rt)
 		return NULL;
 
@@ -1855,14 +1853,11 @@ struct rt6_info *ip6_pol_route(struct net *net, struct fib6_table *table,
 		 * the daddr in the skb during the neighbor look-up is different
 		 * from the fl6->daddr used to look-up route here.
 		 */
-
 		struct rt6_info *uncached_rt;
 
-		fib6_info_hold(f6i);
-		rcu_read_unlock();
-
 		uncached_rt = ip6_rt_cache_alloc(f6i, &fl6->daddr, NULL);
-		fib6_info_release(f6i);
+
+		rcu_read_unlock();
 
 		if (uncached_rt) {
 			/* Uncached_rt's refcnt is taken during ip6_rt_cache_alloc()
@@ -2280,7 +2275,9 @@ static void __ip6_rt_update_pmtu(struct dst_entry *dst, const struct sock *sk,
 	} else if (daddr) {
 		struct rt6_info *nrt6;
 
+		rcu_read_lock();
 		nrt6 = ip6_rt_cache_alloc(rt6->from, daddr, saddr);
+		rcu_read_unlock();
 		if (nrt6) {
 			rt6_do_update_pmtu(nrt6, mtu);
 			if (rt6_insert_exception(nrt6, rt6->from))
@@ -3299,7 +3296,9 @@ static void rt6_do_redirect(struct dst_entry *dst, struct sock *sk, struct sk_bu
 				     NEIGH_UPDATE_F_ISROUTER)),
 		     NDISC_REDIRECT, &ndopts);
 
+	rcu_read_lock();
 	nrt = ip6_rt_cache_alloc(rt->from, &msg->dest, NULL);
+	rcu_read_unlock();
 	if (!nrt)
 		goto out;
 
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next 4/7] net/ipv6: Move rcu locking to callers of fib6_get_cookie_safe
From: David Ahern @ 2018-04-20 22:38 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, roopa, eric.dumazet, weiwan, kafai, yoshfuji,
	David Ahern
In-Reply-To: <20180420223803.15743-1-dsahern@gmail.com>

A later patch protects 'from' in rt6_info and this simplifies the
locking needed by it.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/net/ip6_fib.h |  6 ++++--
 net/ipv6/route.c      | 13 ++++++++++---
 2 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index dd1481ed8bdb..dc3505fb62b3 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -234,7 +234,6 @@ static inline bool fib6_get_cookie_safe(const struct fib6_info *f6i,
 	struct fib6_node *fn;
 	bool status = false;
 
-	rcu_read_lock();
 	fn = rcu_dereference(f6i->fib6_node);
 
 	if (fn) {
@@ -244,7 +243,6 @@ static inline bool fib6_get_cookie_safe(const struct fib6_info *f6i,
 		status = true;
 	}
 
-	rcu_read_unlock();
 	return status;
 }
 
@@ -252,10 +250,14 @@ static inline u32 rt6_get_cookie(const struct rt6_info *rt)
 {
 	u32 cookie = 0;
 
+	rcu_read_lock();
+
 	if (rt->rt6i_flags & RTF_PCPU ||
 	    (unlikely(!list_empty(&rt->rt6i_uncached)) && rt->from))
 		fib6_get_cookie_safe(rt->from, &cookie);
 
+	rcu_read_unlock();
+
 	return cookie;
 }
 
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index f8b22183b7fd..810a37ac043b 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2159,9 +2159,12 @@ static struct dst_entry *rt6_dst_from_check(struct rt6_info *rt, u32 cookie)
 
 static struct dst_entry *ip6_dst_check(struct dst_entry *dst, u32 cookie)
 {
+	struct dst_entry *dst_ret;
 	struct rt6_info *rt;
 
-	rt = (struct rt6_info *) dst;
+	rt = container_of(dst, struct rt6_info, dst);
+
+	rcu_read_lock();
 
 	/* All IPV6 dsts are created with ->obsolete set to the value
 	 * DST_OBSOLETE_FORCE_CHK which forces validation calls down
@@ -2170,9 +2173,13 @@ static struct dst_entry *ip6_dst_check(struct dst_entry *dst, u32 cookie)
 
 	if (rt->rt6i_flags & RTF_PCPU ||
 	    (unlikely(!list_empty(&rt->rt6i_uncached)) && rt->from))
-		return rt6_dst_from_check(rt, cookie);
+		dst_ret = rt6_dst_from_check(rt, cookie);
 	else
-		return rt6_check(rt, cookie);
+		dst_ret = rt6_check(rt, cookie);
+
+	rcu_read_unlock();
+
+	return dst_ret;
 }
 
 static struct dst_entry *ip6_negative_advice(struct dst_entry *dst)
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next 5/7] net/ipv6: Move release of fib6_info from pcpu routes to helper
From: David Ahern @ 2018-04-20 22:38 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, roopa, eric.dumazet, weiwan, kafai, yoshfuji,
	David Ahern
In-Reply-To: <20180420223803.15743-1-dsahern@gmail.com>

Code move only; no functional change intended.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 net/ipv6/ip6_fib.c | 41 +++++++++++++++++++++++------------------
 1 file changed, 23 insertions(+), 18 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index f936e91a8c65..64ca02b7745f 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -860,6 +860,27 @@ static struct fib6_node *fib6_add_1(struct net *net,
 	return ln;
 }
 
+static void fib6_drop_pcpu_from(struct fib6_info *f6i,
+				const struct fib6_table *table)
+{
+	int cpu;
+
+	/* release the reference to this fib entry from
+	 * all of its cached pcpu routes
+	 */
+	for_each_possible_cpu(cpu) {
+		struct rt6_info **ppcpu_rt;
+		struct rt6_info *pcpu_rt;
+
+		ppcpu_rt = per_cpu_ptr(f6i->rt6i_pcpu, cpu);
+		pcpu_rt = *ppcpu_rt;
+		if (pcpu_rt) {
+			fib6_info_release(pcpu_rt->from);
+			pcpu_rt->from = NULL;
+		}
+	}
+}
+
 static void fib6_purge_rt(struct fib6_info *rt, struct fib6_node *fn,
 			  struct net *net)
 {
@@ -887,24 +908,8 @@ static void fib6_purge_rt(struct fib6_info *rt, struct fib6_node *fn,
 				    lockdep_is_held(&table->tb6_lock));
 		}
 
-		if (rt->rt6i_pcpu) {
-			int cpu;
-
-			/* release the reference to this fib entry from
-			 * all of its cached pcpu routes
-			 */
-			for_each_possible_cpu(cpu) {
-				struct rt6_info **ppcpu_rt;
-				struct rt6_info *pcpu_rt;
-
-				ppcpu_rt = per_cpu_ptr(rt->rt6i_pcpu, cpu);
-				pcpu_rt = *ppcpu_rt;
-				if (pcpu_rt) {
-					fib6_info_release(pcpu_rt->from);
-					pcpu_rt->from = NULL;
-				}
-			}
-		}
+		if (rt->rt6i_pcpu)
+			fib6_drop_pcpu_from(rt, table);
 	}
 }
 
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next 6/7] net/ipv6: Make from in rt6_info rcu protected
From: David Ahern @ 2018-04-20 22:38 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, roopa, eric.dumazet, weiwan, kafai, yoshfuji,
	David Ahern
In-Reply-To: <20180420223803.15743-1-dsahern@gmail.com>

When a dst entry is created from a fib entry, the 'from' in rt6_info
is set to the fib entry. The 'from' reference is used most notably for
cookie checking - making sure stale dst entries are updated if the
fib entry is changed.

When a fib entry is deleted, the pcpu routes on it are walked releasing
the fib6_info reference. This is needed for the fib6_info cleanup to
happen and to make sure all device references are released in a timely
manner.

There is a race window when a FIB entry is deleted and the 'from' on the
pcpu route is dropped and the pcpu route hits a cookie check. Handle
this race using rcu on from.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/net/ip6_fib.h | 10 +++---
 net/ipv6/ip6_fib.c    |  8 +++--
 net/ipv6/ip6_output.c |  9 +++--
 net/ipv6/route.c      | 96 ++++++++++++++++++++++++++++++++++++---------------
 4 files changed, 88 insertions(+), 35 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index dc3505fb62b3..1af450d4e923 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -174,7 +174,7 @@ struct fib6_info {
 
 struct rt6_info {
 	struct dst_entry		dst;
-	struct fib6_info		*from;
+	struct fib6_info __rcu		*from;
 
 	struct rt6key			rt6i_dst;
 	struct rt6key			rt6i_src;
@@ -248,13 +248,15 @@ static inline bool fib6_get_cookie_safe(const struct fib6_info *f6i,
 
 static inline u32 rt6_get_cookie(const struct rt6_info *rt)
 {
+	struct fib6_info *from;
 	u32 cookie = 0;
 
 	rcu_read_lock();
 
-	if (rt->rt6i_flags & RTF_PCPU ||
-	    (unlikely(!list_empty(&rt->rt6i_uncached)) && rt->from))
-		fib6_get_cookie_safe(rt->from, &cookie);
+	from = rcu_dereference(rt->from);
+	if (from && (rt->rt6i_flags & RTF_PCPU ||
+	    unlikely(!list_empty(&rt->rt6i_uncached))))
+		fib6_get_cookie_safe(from, &cookie);
 
 	rcu_read_unlock();
 
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 64ca02b7745f..6421c893466e 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -875,8 +875,12 @@ static void fib6_drop_pcpu_from(struct fib6_info *f6i,
 		ppcpu_rt = per_cpu_ptr(f6i->rt6i_pcpu, cpu);
 		pcpu_rt = *ppcpu_rt;
 		if (pcpu_rt) {
-			fib6_info_release(pcpu_rt->from);
-			pcpu_rt->from = NULL;
+			struct fib6_info *from;
+
+			from = rcu_dereference_protected(pcpu_rt->from,
+					     lockdep_is_held(&table->tb6_lock));
+			rcu_assign_pointer(pcpu_rt->from, NULL);
+			fib6_info_release(from);
 		}
 	}
 }
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 3db47986ef38..6a477d54f8c7 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -962,16 +962,21 @@ static int ip6_dst_lookup_tail(struct net *net, const struct sock *sk,
 	 * that's why we try it again later.
 	 */
 	if (ipv6_addr_any(&fl6->saddr) && (!*dst || !(*dst)->error)) {
+		struct fib6_info *from;
 		struct rt6_info *rt;
 		bool had_dst = *dst != NULL;
 
 		if (!had_dst)
 			*dst = ip6_route_output(net, sk, fl6);
 		rt = (*dst)->error ? NULL : (struct rt6_info *)*dst;
-		err = ip6_route_get_saddr(net, rt ? rt->from : NULL,
-					  &fl6->daddr,
+
+		rcu_read_lock();
+		from = rt ? rcu_dereference(rt->from) : NULL;
+		err = ip6_route_get_saddr(net, from, &fl6->daddr,
 					  sk ? inet6_sk(sk)->srcprefs : 0,
 					  &fl6->saddr);
+		rcu_read_unlock();
+
 		if (err)
 			goto out_err_release;
 
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 810a37ac043b..004d00fe2fe5 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -359,7 +359,7 @@ EXPORT_SYMBOL(ip6_dst_alloc);
 static void ip6_dst_destroy(struct dst_entry *dst)
 {
 	struct rt6_info *rt = (struct rt6_info *)dst;
-	struct fib6_info *from = rt->from;
+	struct fib6_info *from;
 	struct inet6_dev *idev;
 
 	dst_destroy_metrics_generic(dst);
@@ -371,8 +371,11 @@ static void ip6_dst_destroy(struct dst_entry *dst)
 		in6_dev_put(idev);
 	}
 
-	rt->from = NULL;
+	rcu_read_lock();
+	from = rcu_dereference(rt->from);
+	rcu_assign_pointer(rt->from, NULL);
 	fib6_info_release(from);
+	rcu_read_unlock();
 }
 
 static void ip6_dst_ifdown(struct dst_entry *dst, struct net_device *dev,
@@ -402,12 +405,16 @@ static bool __rt6_check_expired(const struct rt6_info *rt)
 
 static bool rt6_check_expired(const struct rt6_info *rt)
 {
+	struct fib6_info *from;
+
+	from = rcu_dereference(rt->from);
+
 	if (rt->rt6i_flags & RTF_EXPIRES) {
 		if (time_after(jiffies, rt->dst.expires))
 			return true;
-	} else if (rt->from) {
+	} else if (from) {
 		return rt->dst.obsolete != DST_OBSOLETE_FORCE_CHK ||
-			fib6_check_expired(rt->from);
+			fib6_check_expired(from);
 	}
 	return false;
 }
@@ -963,7 +970,7 @@ static void rt6_set_from(struct rt6_info *rt, struct fib6_info *from)
 {
 	rt->rt6i_flags &= ~RTF_EXPIRES;
 	fib6_info_hold(from);
-	rt->from = from;
+	rcu_assign_pointer(rt->from, from);
 	dst_init_metrics(&rt->dst, from->fib6_metrics->metrics, true);
 	if (from->fib6_metrics != &dst_default_metrics) {
 		rt->dst._metrics |= DST_METRICS_REFCOUNTED;
@@ -2133,11 +2140,13 @@ static bool fib6_check(struct fib6_info *f6i, u32 cookie)
 	return true;
 }
 
-static struct dst_entry *rt6_check(struct rt6_info *rt, u32 cookie)
+static struct dst_entry *rt6_check(struct rt6_info *rt,
+				   struct fib6_info *from,
+				   u32 cookie)
 {
 	u32 rt_cookie = 0;
 
-	if ((rt->from && !fib6_get_cookie_safe(rt->from, &rt_cookie)) ||
+	if ((from && !fib6_get_cookie_safe(from, &rt_cookie)) ||
 	    rt_cookie != cookie)
 		return NULL;
 
@@ -2147,11 +2156,13 @@ static struct dst_entry *rt6_check(struct rt6_info *rt, u32 cookie)
 	return &rt->dst;
 }
 
-static struct dst_entry *rt6_dst_from_check(struct rt6_info *rt, u32 cookie)
+static struct dst_entry *rt6_dst_from_check(struct rt6_info *rt,
+					    struct fib6_info *from,
+					    u32 cookie)
 {
 	if (!__rt6_check_expired(rt) &&
 	    rt->dst.obsolete == DST_OBSOLETE_FORCE_CHK &&
-	    fib6_check(rt->from, cookie))
+	    fib6_check(from, cookie))
 		return &rt->dst;
 	else
 		return NULL;
@@ -2160,6 +2171,7 @@ static struct dst_entry *rt6_dst_from_check(struct rt6_info *rt, u32 cookie)
 static struct dst_entry *ip6_dst_check(struct dst_entry *dst, u32 cookie)
 {
 	struct dst_entry *dst_ret;
+	struct fib6_info *from;
 	struct rt6_info *rt;
 
 	rt = container_of(dst, struct rt6_info, dst);
@@ -2171,11 +2183,13 @@ static struct dst_entry *ip6_dst_check(struct dst_entry *dst, u32 cookie)
 	 * into this function always.
 	 */
 
-	if (rt->rt6i_flags & RTF_PCPU ||
-	    (unlikely(!list_empty(&rt->rt6i_uncached)) && rt->from))
-		dst_ret = rt6_dst_from_check(rt, cookie);
+	from = rcu_dereference(rt->from);
+
+	if (from && (rt->rt6i_flags & RTF_PCPU ||
+	    unlikely(!list_empty(&rt->rt6i_uncached))))
+		dst_ret = rt6_dst_from_check(rt, from, cookie);
 	else
-		dst_ret = rt6_check(rt, cookie);
+		dst_ret = rt6_check(rt, from, cookie);
 
 	rcu_read_unlock();
 
@@ -2211,13 +2225,17 @@ static void ip6_link_failure(struct sk_buff *skb)
 		if (rt->rt6i_flags & RTF_CACHE) {
 			if (dst_hold_safe(&rt->dst))
 				rt6_remove_exception_rt(rt);
-		} else if (rt->from) {
+		} else {
+			struct fib6_info *from;
 			struct fib6_node *fn;
 
 			rcu_read_lock();
-			fn = rcu_dereference(rt->from->fib6_node);
-			if (fn && (rt->rt6i_flags & RTF_DEFAULT))
-				fn->fn_sernum = -1;
+			from = rcu_dereference(rt->from);
+			if (from) {
+				fn = rcu_dereference(from->fib6_node);
+				if (fn && (rt->rt6i_flags & RTF_DEFAULT))
+					fn->fn_sernum = -1;
+			}
 			rcu_read_unlock();
 		}
 	}
@@ -2225,8 +2243,15 @@ static void ip6_link_failure(struct sk_buff *skb)
 
 static void rt6_update_expires(struct rt6_info *rt0, int timeout)
 {
-	if (!(rt0->rt6i_flags & RTF_EXPIRES) && rt0->from)
-		rt0->dst.expires = rt0->from->expires;
+	if (!(rt0->rt6i_flags & RTF_EXPIRES)) {
+		struct fib6_info *from;
+
+		rcu_read_lock();
+		from = rcu_dereference(rt0->from);
+		if (from)
+			rt0->dst.expires = from->expires;
+		rcu_read_unlock();
+	}
 
 	dst_set_expires(&rt0->dst, timeout);
 	rt0->rt6i_flags |= RTF_EXPIRES;
@@ -2243,8 +2268,14 @@ static void rt6_do_update_pmtu(struct rt6_info *rt, u32 mtu)
 
 static bool rt6_cache_allowed_for_pmtu(const struct rt6_info *rt)
 {
+	bool from_set;
+
+	rcu_read_lock();
+	from_set = !!rcu_dereference(rt->from);
+	rcu_read_unlock();
+
 	return !(rt->rt6i_flags & RTF_CACHE) &&
-		(rt->rt6i_flags & RTF_PCPU || rt->from);
+		(rt->rt6i_flags & RTF_PCPU || from_set);
 }
 
 static void __ip6_rt_update_pmtu(struct dst_entry *dst, const struct sock *sk,
@@ -2280,16 +2311,18 @@ static void __ip6_rt_update_pmtu(struct dst_entry *dst, const struct sock *sk,
 		if (rt6->rt6i_flags & RTF_CACHE)
 			rt6_update_exception_stamp_rt(rt6);
 	} else if (daddr) {
+		struct fib6_info *from;
 		struct rt6_info *nrt6;
 
 		rcu_read_lock();
-		nrt6 = ip6_rt_cache_alloc(rt6->from, daddr, saddr);
-		rcu_read_unlock();
+		from = rcu_dereference(rt6->from);
+		nrt6 = ip6_rt_cache_alloc(from, daddr, saddr);
 		if (nrt6) {
 			rt6_do_update_pmtu(nrt6, mtu);
-			if (rt6_insert_exception(nrt6, rt6->from))
+			if (rt6_insert_exception(nrt6, from))
 				dst_release_immediate(&nrt6->dst);
 		}
+		rcu_read_unlock();
 	}
 }
 
@@ -3222,6 +3255,7 @@ static void rt6_do_redirect(struct dst_entry *dst, struct sock *sk, struct sk_bu
 	struct ndisc_options ndopts;
 	struct inet6_dev *in6_dev;
 	struct neighbour *neigh;
+	struct fib6_info *from;
 	struct rd_msg *msg;
 	int optlen, on_link;
 	u8 *lladdr;
@@ -3304,7 +3338,8 @@ static void rt6_do_redirect(struct dst_entry *dst, struct sock *sk, struct sk_bu
 		     NDISC_REDIRECT, &ndopts);
 
 	rcu_read_lock();
-	nrt = ip6_rt_cache_alloc(rt->from, &msg->dest, NULL);
+	from = rcu_dereference(rt->from);
+	nrt = ip6_rt_cache_alloc(from, &msg->dest, NULL);
 	rcu_read_unlock();
 	if (!nrt)
 		goto out;
@@ -4687,6 +4722,7 @@ static int inet6_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh,
 	struct net *net = sock_net(in_skb->sk);
 	struct nlattr *tb[RTA_MAX+1];
 	int err, iif = 0, oif = 0;
+	struct fib6_info *from;
 	struct dst_entry *dst;
 	struct rt6_info *rt;
 	struct sk_buff *skb;
@@ -4783,15 +4819,21 @@ static int inet6_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh,
 	}
 
 	skb_dst_set(skb, &rt->dst);
+
+	rcu_read_lock();
+	from = rcu_dereference(rt->from);
+
 	if (fibmatch)
-		err = rt6_fill_node(net, skb, rt->from, NULL, NULL, NULL, iif,
+		err = rt6_fill_node(net, skb, from, NULL, NULL, NULL, iif,
 				    RTM_NEWROUTE, NETLINK_CB(in_skb).portid,
 				    nlh->nlmsg_seq, 0);
 	else
-		err = rt6_fill_node(net, skb, rt->from, dst,
-				    &fl6.daddr, &fl6.saddr, iif, RTM_NEWROUTE,
+		err = rt6_fill_node(net, skb, from, dst, &fl6.daddr,
+				    &fl6.saddr, iif, RTM_NEWROUTE,
 				    NETLINK_CB(in_skb).portid, nlh->nlmsg_seq,
 				    0);
+	rcu_read_unlock();
+
 	if (err < 0) {
 		kfree_skb(skb);
 		goto errout;
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next 7/7] net/ipv6: Remove unncessary check on f6i in fib6_check
From: David Ahern @ 2018-04-20 22:38 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, roopa, eric.dumazet, weiwan, kafai, yoshfuji,
	David Ahern
In-Reply-To: <20180420223803.15743-1-dsahern@gmail.com>

Dan reported an imbalance in fib6_check on use of f6i and checking
whether it is null. Since fib6_check is only called if f6i is non-null,
remove the unnecessary check.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
---
 net/ipv6/route.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 004d00fe2fe5..0407bbc5a028 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2130,8 +2130,7 @@ static bool fib6_check(struct fib6_info *f6i, u32 cookie)
 {
 	u32 rt_cookie = 0;
 
-	if ((f6i && !fib6_get_cookie_safe(f6i, &rt_cookie)) ||
-	     rt_cookie != cookie)
+	if (!fib6_get_cookie_safe(f6i, &rt_cookie) || rt_cookie != cookie)
 		return false;
 
 	if (fib6_check_expired(f6i))
-- 
2.11.0

^ permalink raw reply related

* Re: Representative Needed.
From: PPMC OFFSHORE @ 2018-04-20 22:23 UTC (permalink / raw)
  To: Recipients

Good day,

  I am seeking your concept with great gratitude to present you as a representative to carry out business transactions with a reasonable share upon your interest and cooperation to work with us in trust. If interested please get back.

Regards
Kingsley

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

^ permalink raw reply

* Re: Representative Needed.
From: PPMC OFFSHORE @ 2018-04-20 22:25 UTC (permalink / raw)
  To: Recipients

Good day,

  I am seeking your concept with great gratitude to present you as a representative to carry out business transactions with a reasonable share upon your interest and cooperation to work with us in trust. If interested please get back.

Regards
Kingsley

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

^ permalink raw reply

* Re: [PATCH] iptables: Per-net ns lock
From: Andrei Vagin @ 2018-04-20 23:06 UTC (permalink / raw)
  To: Kirill Tkhai; +Cc: fw, netdev, pablo, rstoyanov1, ptikhomirov
In-Reply-To: <152423174378.4473.8708420767261754117.stgit@localhost.localdomain>

On Fri, Apr 20, 2018 at 04:42:47PM +0300, Kirill Tkhai wrote:
> Containers want to restore their own net ns,
> while they may have no their own mnt ns.
> This case they share host's /run/xtables.lock
> file, but they may not have permission to open
> it.
> 
> Patch makes /run/xtables.lock to be per-namespace,
> i.e., to refer to the caller task's net ns.
>
> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
> ---
>  iptables/xshared.c |    7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/iptables/xshared.c b/iptables/xshared.c
> index 06db72d4..b6dbe4e7 100644
> --- a/iptables/xshared.c
> +++ b/iptables/xshared.c
> @@ -254,7 +254,12 @@ static int xtables_lock(int wait, struct timeval *wait_interval)
>  	time_left.tv_sec = wait;
>  	time_left.tv_usec = 0;
>  
> -	fd = open(XT_LOCK_NAME, O_CREAT, 0600);
> +	if (symlink("/proc/self/ns/net", XT_LOCK_NAME) != 0 &&

Any user can open this file and take the lock. Before this patch, the
lock file could be opened only by the root user. It means that any user
will be able to block all iptables operations. Do I miss something?

[root@fc24 ~]# ln -s /proc/self/ns/net /run/xtables.lock2
[root@fc24 ~]# ls -l /run/xtables.lock2
lrwxrwxrwx 1 root root 17 Apr 21 01:52 /run/xtables.lock2 ->
/proc/self/ns/net
[root@fc24 ~]# ls -l /proc/self/ns/net 
lrwxrwxrwx 1 root root 0 Apr 21 01:52 /proc/self/ns/net ->
net:[4026531993]

Thanks,
Andrei

> +	    errno != EEXIST) {
> +		fprintf(stderr, "Fatal: can't create lock file\n");

		fprintf(stderr, "Fatal: can't create lock file %s: %s\n",
			XT_LOCK_NAME, strerror(errno));

> +		return XT_LOCK_FAILED;
> +	}
> +	fd = open(XT_LOCK_NAME, O_RDONLY);
>  	if (fd < 0) {
>  		fprintf(stderr, "Fatal: can't open lock file %s: %s\n",
>  			XT_LOCK_NAME, strerror(errno));
> 

^ permalink raw reply

* Re: [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP
From: Martin KaFai Lau @ 2018-04-20 23:13 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: netdev, Tom Herbert, Eric Dumazet, Nikita Shirokov, kernel-team
In-Reply-To: <alpine.LFD.2.20.1804202205290.2906@ja.home.ssi.bg>

On Fri, Apr 20, 2018 at 11:43:59PM +0300, Julian Anastasov wrote:
> 
> 	Hello,
> 
> On Thu, 19 Apr 2018, Martin KaFai Lau wrote:
> 
> > This patch is not a proper fix and mainly serves for discussion purpose.
> > It is based on net-next which I have been using to debug the issue.
> > 
> > The change that works around the issue is in ensure_mtu_is_adequate().
> > Other changes are the rippling effect in function arg.
> > 
> > This bug was uncovered by one of our legacy service that
> > are still using ipvs for load balancing.  In that setup,
> > the ipvs encap the ipv6-tcp packet in another ipv6 hdr
> > before tx it out to eth0.
> > 
> > The problem is the kernel stack could pass a skb (which was
> > originated from a sys_write(tcp_fd)) to the driver with skb->len
> > bigger than the device MTU.  In one NIC setup (with gso and tso off)
> > that we are using, it upset the NIC/driver and caused the tx queue
> > stalled for tens of seconds which is how it got uncovered.
> > (On the NIC side, the NIC firmware and driver have been fixed
> > to avoid this tx queue stall after seeing this skb).
> > 
> > On the kernel side, based on the commit log, this bug should have
> > been exposed after commit 815d22e55b0e ("ip6ip6: Support for GSO/GRO").
> 
> 	Before I go to deeply analyze the GSO code, here
> are some preliminary observations:
> 
> - IPVS started to use iptunnel_handle_offloads() around 2014,
> commit ea1d5d7755a3
> 
> - later (2016) the "ip6ip6: Support for GSO/GRO" commits started
> to use skb_set_inner_ipproto(). But it is missing in the IPVS code.
> I'm not sure if such call can help. At least, I don't see what
> different happens in IPVS compared to ip6ip6_tnl_xmit(), for example.
I don't see skb->inner_ipproto used in
ipv6_gso_segment() -> ipv6_gso_segment() -> tcp6_gso_segment()

One thing may worth to highlight is that this problem sort of exist
even before commit 815d22e55b0e, it just happened that the ipv6_gso_segment()
error out (-EPROTONOSUPPORT) which then stops sending it down to the driver.

> 
> > Before commit 815d22e55b0e, ipv6_gso_segment() would just error
> > out (-EPROTONOSUPPORT) because the tx-ing packet is an ip6ip6.
> > Due to this error out, it avoid passing it to the driver.  The TCP
> > stack then timeout and the TCP mtu probing eventually kicked in to
> > lower the skb->len enough to avoid gso_segment.
> > 
> > After commit 815d22e55b0e, ipv6_gso_segment() -> ipv6_gso_segment()
> > -> tcp6_gso_segment() which segment the packet based on a mss
> > that does not account for the extra IPv6 hdr.
> > 
> > Here is a stack from the WARN_ON() that we added to the driver to
> > capture the issue:
> > [ 1128.611875] WARNING: CPU: 40 PID: 31495 at drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:424 mlx5e_xmit+0x814
> > ...
> > [ 1129.016536] Call Trace:
> > [ 1129.021412]  ? skb_release_data+0xfc/0x120
> > [ 1129.029587]  ? kfree_skbmem+0x64/0x70
> > [ 1129.036905]  dev_hard_start_xmit+0xa4/0x200
> > [ 1129.045262]  sch_direct_xmit+0x10f/0x280
> > [ 1129.053111]  __qdisc_run+0x223/0x5a0
> > [ 1129.060251]  __dev_queue_xmit+0x245/0x7d0
> > [ 1129.068268]  dev_queue_xmit+0x10/0x20
> > [ 1129.075573]  ? dev_queue_xmit+0x10/0x20
> > [ 1129.083218]  ip6_finish_output2+0x2db/0x490
> > [ 1129.091573]  ip6_finish_output+0x125/0x190
> > [ 1129.099754]  ip6_output+0x5f/0x100
> > [ 1129.106548]  ? ip6_fragment+0x9f0/0x9f0
> > [ 1129.114212]  ip6_local_out+0x35/0x40
> > [ 1129.121356]  ip_vs_tunnel_xmit_v6+0x267/0x290 [ip_vs]
> > [ 1129.131443]  ip_vs_in.part.24+0x302/0x710 [ip_vs]
> > [ 1129.140837]  ? ip_vs_in.part.24+0x302/0x710 [ip_vs]
> > [ 1129.150578]  ? ip_vs_conn_out_get+0x17/0x140 [ip_vs]
> > [ 1129.160493]  ? ip_vs_conn_out_get_proto+0x25/0x30 [ip_vs]
> > [ 1129.171273]  ip_vs_in+0x43/0x130 [ip_vs]
> > [ 1129.179109]  ip_vs_local_request6+0x26/0x30 [ip_vs]
> > [ 1129.188849]  nf_hook_slow+0x3e/0xc0
> > [ 1129.195800]  ip6_xmit+0x30b/0x540
> > [ 1129.202421]  ? ac6_proc_exit+0x20/0x20
> > [ 1129.209909]  inet6_csk_xmit+0x82/0xd0
> > [ 1129.217207]  ? lock_timer_base+0x76/0xa0
> > [ 1129.225043]  tcp_transmit_skb+0x56f/0xa40
> > [ 1129.233051]  tcp_write_xmit+0x2b2/0x11b0
> > [ 1129.240885]  __tcp_push_pending_frames+0x33/0xa0
> > [ 1129.250106]  tcp_push+0xde/0x100
> > [ 1129.256554]  tcp_sendmsg_locked+0x9ca/0xca0
> > [ 1129.264910]  tcp_sendmsg+0x2c/0x50
> > [ 1129.271703]  inet_sendmsg+0x31/0xb0
> > [ 1129.278672]  sock_write_iter+0xf8/0x110
> > [ 1129.286335]  new_sync_write+0xd9/0x120
> > [ 1129.293823]  vfs_write+0x18d/0x1e0
> > [ 1129.300614]  SyS_write+0x48/0xa0
> > [ 1129.307045]  do_syscall_64+0x69/0x1e0
> > [ 1129.314361]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> > ...
> > [ 1129.648183] ---[ end trace 635061c9c300799e ]---
> > [ 1129.657407] skb->len:1554 MTU:1522
> > 
> > The tcp flow is connecting from the address ending ':27:0' to the ':85'.
> > 
> > [host-a] > ip -6 r show table local
> > local 2401:db00:1011:1f01:face:b00c:0:85 dev lo src 2401:db00:1011:10af:face:0:27:0 metric 1024 advmss 1440 pref medium
> > 
> > [host-a] > ip -6 a
> > 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 state UNKNOWN qlen 1000
> >     inet6 2401:db00:1011:1f01:face:b00c:0:85/128 scope global
> >        valid_lft forever preferred_lft forever
> > 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
> >     inet6 2401:db00:1011:10af:face:0:27:0/64 scope global
> >        valid_lft forever preferred_lft forever
> > 
> > [host-a] > cat /proc/net/ip_vs
> > TCP  [2401:db00:1011:1f01:face:b00c:0000:0085]:01BB rr
> >   -> [2401:db00:1011:10cc:face:0000:0091:0000]:01BB      Tunnel  6772   9          6
> >   -> [2401:db00:1011:10d8:face:0000:0091:0000]:01BB      Tunnel  6772   8          6
> >   -> [2401:db00:1011:10d2:face:0000:0091:0000]:01BB      Tunnel  6772   19         7
> > 
> > [host-a] > openssl s_client -connect [2401:db00:1011:1f01:face:b00c:0:85]:443
> > send-something-long-here-to-trigger-the-bug
> > 
> > Changing the local route mtu to 1460 to account for the extra ipv6 tunnel header
> > can also side step the issue.  Like this:
> 
> 	Yes, reducing the MTU was a solution recommended from
> long time ago in the IPVS HOWTO.
> 
> > 
> > > ip -6 r show table local
> > local 2401:db00:1011:1f01:face:b00c:0:85 dev lo src 2401:db00:1011:10af:face:0:27:0 metric 1024 mtu 1460 advmss 1440 pref medium
> > 
> > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > ---
> >  net/netfilter/ipvs/ip_vs_xmit.c | 49 +++++++++++++++++++++++++++--------------
> >  1 file changed, 33 insertions(+), 16 deletions(-)
> > 
> > diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
> > index 11c416f3d6e3..88cc0d53ebce 100644
> > --- a/net/netfilter/ipvs/ip_vs_xmit.c
> > +++ b/net/netfilter/ipvs/ip_vs_xmit.c
> > @@ -212,13 +212,15 @@ static inline void maybe_update_pmtu(int skb_af, struct sk_buff *skb, int mtu)
> >  		ort->dst.ops->update_pmtu(&ort->dst, sk, NULL, mtu);
> >  }
> >  
> > -static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
> > +static inline bool ensure_mtu_is_adequate(struct ip_vs_conn *cp,
> >  					  int rt_mode,
> >  					  struct ip_vs_iphdr *ipvsh,
> >  					  struct sk_buff *skb, int mtu)
> >  {
> > +	struct netns_ipvs *ipvs = cp->ipvs;
> > +
> >  #ifdef CONFIG_IP_VS_IPV6
> > -	if (skb_af == AF_INET6) {
> > +	if (cp->af == AF_INET6) {
> >  		struct net *net = ipvs->net;
> >  
> >  		if (unlikely(__mtu_check_toobig_v6(skb, mtu))) {
> > @@ -251,6 +253,17 @@ static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
> >  		}
> >  	}
> >  
> > +	if (skb_shinfo(skb)->gso_size && cp->protocol == IPPROTO_TCP) {
> > +		const struct tcphdr *th = (struct tcphdr *)skb_transport_header(skb);
> > +		unsigned short hdr_len = (th->doff << 2) +
> > +			skb_network_header_len(skb);
> > +
> > +		if (mtu > hdr_len && mtu - hdr_len < skb_shinfo(skb)->gso_size)
> > +			skb_decrease_gso_size(skb_shinfo(skb),
> > +					      skb_shinfo(skb)->gso_size -
> > +					      (mtu - hdr_len));
> 
> 	So, software segmentation happens and we want the
> tunnel header to be accounted immediately and not after PMTU
> probing period?
I think accounting tunnel header correctly early on is the expected
behavior.  Otherwise, if we choose to drop the big skb instead (either
gso_segment error out, driver error out or NIC drops it from tx queue),
the first few packets of a TCP connection will guarantee to
experience timeout.

> Is this a problem only for IPVS tunnels?
> Do we see such delays with other tunnels?
Based on what I understand, a tun dev (setup by iproute2) should
not experience this issue because the MTU of that tun dev has
already accounted for the extra header.

Our current work around using a smaller MTU in the local route has
similar effect (in the sense that introducing a smaller MTU betewen
the phyical eth0 and the TCP stack).

Thanks for looking into it!

Martin

> May be this should
> be solved for all protocols (not just TCP) and for all tunnels?
> Looking at ip6_xmit, on GSO we do not return -EMSGSIZE to
> local sender, so we should really alter the gso_size for proper
> segmentation?
> 
> Regards
> 
> --
> Julian Anastasov <ja@ssi.bg>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox