Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next v2 1/3] net: phy: xgmiitorgmii: Check phy_driver ready before accessing
From: Florian Fainelli @ 2018-06-26 21:59 UTC (permalink / raw)
  To: Brandon Maier, netdev
  Cc: andrew, davem, michal.simek, clayton.shotwell, kristopher.cory,
	linux-kernel
In-Reply-To: <20180626175050.71165-1-brandon.maier@rockwellcollins.com>

On 06/26/2018 10:50 AM, Brandon Maier wrote:
> Since a phy_device is added to the global mdio_bus list during
> phy_device_register(), but a phy_device's phy_driver doesn't get
> attached until phy_probe(). It's possible of_phy_find_device() in
> xgmiitorgmii will return a valid phy with a NULL phy_driver. Leading to
> a NULL pointer access during the memcpy().
> 
> Fixes this Oops:
> 
> Unable to handle kernel NULL pointer dereference at virtual address 00000000
> pgd = c0004000
> [00000000] *pgd=00000000
> Internal error: Oops: 5 [#1] PREEMPT SMP ARM
> Modules linked in:
> CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.14.40 #1
> Hardware name: Xilinx Zynq Platform
> task: ce4c8d00 task.stack: ce4ca000
> PC is at memcpy+0x48/0x330
> LR is at xgmiitorgmii_probe+0x90/0xe8
> pc : [<c074bc68>]    lr : [<c0529548>]    psr: 20000013
> sp : ce4cbb54  ip : 00000000  fp : ce4cbb8c
> r10: 00000000  r9 : 00000000  r8 : c0c49178
> r7 : 00000000  r6 : cdc14718  r5 : ce762800  r4 : cdc14710
> r3 : 00000000  r2 : 00000054  r1 : 00000000  r0 : cdc14718
> Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
> Control: 18c5387d  Table: 0000404a  DAC: 00000051
> Process swapper/0 (pid: 1, stack limit = 0xce4ca210)
> ...
> [<c074bc68>] (memcpy) from [<c0529548>] (xgmiitorgmii_probe+0x90/0xe8)
> [<c0529548>] (xgmiitorgmii_probe) from [<c0526a94>] (mdio_probe+0x28/0x34)
> [<c0526a94>] (mdio_probe) from [<c04db98c>] (driver_probe_device+0x254/0x414)
> [<c04db98c>] (driver_probe_device) from [<c04dbd58>] (__device_attach_driver+0xac/0x10c)
> [<c04dbd58>] (__device_attach_driver) from [<c04d96f4>] (bus_for_each_drv+0x84/0xc8)
> [<c04d96f4>] (bus_for_each_drv) from [<c04db5bc>] (__device_attach+0xd0/0x134)
> [<c04db5bc>] (__device_attach) from [<c04dbdd4>] (device_initial_probe+0x1c/0x20)
> [<c04dbdd4>] (device_initial_probe) from [<c04da8fc>] (bus_probe_device+0x98/0xa0)
> [<c04da8fc>] (bus_probe_device) from [<c04d8660>] (device_add+0x43c/0x5d0)
> [<c04d8660>] (device_add) from [<c0526cb8>] (mdio_device_register+0x34/0x80)
> [<c0526cb8>] (mdio_device_register) from [<c0580b48>] (of_mdiobus_register+0x170/0x30c)
> [<c0580b48>] (of_mdiobus_register) from [<c05349c4>] (macb_probe+0x710/0xc00)
> [<c05349c4>] (macb_probe) from [<c04dd700>] (platform_drv_probe+0x44/0x80)
> [<c04dd700>] (platform_drv_probe) from [<c04db98c>] (driver_probe_device+0x254/0x414)
> [<c04db98c>] (driver_probe_device) from [<c04dbc58>] (__driver_attach+0x10c/0x118)
> [<c04dbc58>] (__driver_attach) from [<c04d9600>] (bus_for_each_dev+0x8c/0xd0)
> [<c04d9600>] (bus_for_each_dev) from [<c04db1fc>] (driver_attach+0x2c/0x30)
> [<c04db1fc>] (driver_attach) from [<c04daa98>] (bus_add_driver+0x50/0x260)
> [<c04daa98>] (bus_add_driver) from [<c04dc440>] (driver_register+0x88/0x108)
> [<c04dc440>] (driver_register) from [<c04dd6b4>] (__platform_driver_register+0x50/0x58)
> [<c04dd6b4>] (__platform_driver_register) from [<c0b31248>] (macb_driver_init+0x24/0x28)
> [<c0b31248>] (macb_driver_init) from [<c010203c>] (do_one_initcall+0x60/0x1a4)
> [<c010203c>] (do_one_initcall) from [<c0b00f78>] (kernel_init_freeable+0x15c/0x1f8)
> [<c0b00f78>] (kernel_init_freeable) from [<c0763d10>] (kernel_init+0x18/0x124)
> [<c0763d10>] (kernel_init) from [<c0112d74>] (ret_from_fork+0x14/0x20)
> Code: ba000002 f5d1f03c f5d1f05c f5d1f07c (e8b151f8)
> ---[ end trace 3e4ec21905820a1f ]---
> 
> Signed-off-by: Brandon Maier <brandon.maier@rockwellcollins.com>

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next] net: preserve sock reference when scrubbing the skb.
From: Flavio Leitner @ 2018-06-26 22:03 UTC (permalink / raw)
  To: Cong Wang
  Cc: Eric Dumazet, Linux Kernel Network Developers, Paolo Abeni,
	David Miller, Florian Westphal, NetFilter
In-Reply-To: <CAM_iQpU0jp=wvutGbSLzVYX5qQeW0W8ARvR=-gp4MYJqWeee_A@mail.gmail.com>

On Tue, Jun 26, 2018 at 02:48:47PM -0700, Cong Wang wrote:
> On Mon, Jun 25, 2018 at 11:41 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > When a packet is attached to a socket, we should keep the association as much as possible.
> 
> As much as possible within one stack, I agree. I still don't understand
> why we should keep it across the stack boundary.
> 
> > Only when a new association needs to be done, skb_orphan() needs to be called.
> >
> > Doing this skb_orphan() too soon breaks back pressure in general, this is bad, since a socket
> > can evades SO_SNDBUF limits.
> 
> Right before leaving the stack is not too soon, it is the latest
> actually, for veth case.

Depends on how you view things - it's the same host/stack sharing the
same resources, so why should we not keep it?

-- 
Flavio

^ permalink raw reply

* Re: [PATCH bpf-next 5/7] nfp: bpf: support u16 and u32 multiplications
From: Song Liu @ 2018-06-26 22:23 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Alexei Starovoitov, Daniel Borkmann, oss-drivers, Networking,
	Jiong Wang
In-Reply-To: <20180625035421.2991-6-jakub.kicinski@netronome.com>

On Sun, Jun 24, 2018 at 8:54 PM, Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
> From: Jiong Wang <jiong.wang@netronome.com>
>
> NFP supports u16 and u32 multiplication. Multiplication is done 8-bits per
> step, therefore we need 2 steps for u16 and 4 steps for u32.
>
> We also need one start instruction to initialize the sequence and one or
> two instructions to fetch the result depending on either you need the high
> halve of u32 multiplication.
>
> For ALU64, if either operand is beyond u32's value range, we reject it. One
> thing to note, if the source operand is BPF_K, then we need to check "imm"
> field directly, and we'd reject it if it is negative.  Because for ALU64,
> "imm" (with s32 type) is expected to be sign extended to s64 which NFP mul
> doesn't support. For ALU32, it is fine for "imm" be negative though,
> because the result is 32-bits and here is no difference on the low halve
> of result for signed/unsigned mul, so we will get correct result.
>
> Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>

Acked-by: Song Liu <songliubraving@fb.com>

> ---
>  drivers/net/ethernet/netronome/nfp/bpf/jit.c  | 137 ++++++++++++++++++
>  drivers/net/ethernet/netronome/nfp/bpf/main.h |   5 +
>  .../net/ethernet/netronome/nfp/bpf/verifier.c |  58 ++++++--
>  drivers/net/ethernet/netronome/nfp/nfp_asm.h  |  28 ++++
>  4 files changed, 217 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/net/ethernet/netronome/nfp/bpf/jit.c b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
> index 4a629e9b5c0f..7d7061d93358 100644
> --- a/drivers/net/ethernet/netronome/nfp/bpf/jit.c
> +++ b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
> @@ -415,6 +415,60 @@ emit_alu(struct nfp_prog *nfp_prog, swreg dst,
>                    reg.dst_lmextn, reg.src_lmextn);
>  }
>
> +static void
> +__emit_mul(struct nfp_prog *nfp_prog, enum alu_dst_ab dst_ab, u16 areg,
> +          enum mul_type type, enum mul_step step, u16 breg, bool swap,
> +          bool wr_both, bool dst_lmextn, bool src_lmextn)
> +{
> +       u64 insn;
> +
> +       insn = OP_MUL_BASE |
> +               FIELD_PREP(OP_MUL_A_SRC, areg) |
> +               FIELD_PREP(OP_MUL_B_SRC, breg) |
> +               FIELD_PREP(OP_MUL_STEP, step) |
> +               FIELD_PREP(OP_MUL_DST_AB, dst_ab) |
> +               FIELD_PREP(OP_MUL_SW, swap) |
> +               FIELD_PREP(OP_MUL_TYPE, type) |
> +               FIELD_PREP(OP_MUL_WR_AB, wr_both) |
> +               FIELD_PREP(OP_MUL_SRC_LMEXTN, src_lmextn) |
> +               FIELD_PREP(OP_MUL_DST_LMEXTN, dst_lmextn);
> +
> +       nfp_prog_push(nfp_prog, insn);
> +}
> +
> +static void
> +emit_mul(struct nfp_prog *nfp_prog, swreg lreg, enum mul_type type,
> +        enum mul_step step, swreg rreg)
> +{
> +       struct nfp_insn_ur_regs reg;
> +       u16 areg;
> +       int err;
> +
> +       if (type == MUL_TYPE_START && step != MUL_STEP_NONE) {
> +               nfp_prog->error = -EINVAL;
> +               return;
> +       }
> +
> +       if (step == MUL_LAST || step == MUL_LAST_2) {
> +               /* When type is step and step Number is LAST or LAST2, left
> +                * source is used as destination.
> +                */
> +               err = swreg_to_unrestricted(lreg, reg_none(), rreg, &reg);
> +               areg = reg.dst;
> +       } else {
> +               err = swreg_to_unrestricted(reg_none(), lreg, rreg, &reg);
> +               areg = reg.areg;
> +       }
> +
> +       if (err) {
> +               nfp_prog->error = err;
> +               return;
> +       }
> +
> +       __emit_mul(nfp_prog, reg.dst_ab, areg, type, step, reg.breg, reg.swap,
> +                  reg.wr_both, reg.dst_lmextn, reg.src_lmextn);
> +}
> +
>  static void
>  __emit_ld_field(struct nfp_prog *nfp_prog, enum shf_sc sc,
>                 u8 areg, u8 bmask, u8 breg, u8 shift, bool imm8,
> @@ -1380,6 +1434,65 @@ static void wrp_end32(struct nfp_prog *nfp_prog, swreg reg_in, u8 gpr_out)
>                       SHF_SC_R_ROT, 16);
>  }
>
> +static void
> +wrp_mul_u32(struct nfp_prog *nfp_prog, swreg dst_hi, swreg dst_lo, swreg lreg,
> +           swreg rreg, bool gen_high_half)
> +{
> +       emit_mul(nfp_prog, lreg, MUL_TYPE_START, MUL_STEP_NONE, rreg);
> +       emit_mul(nfp_prog, lreg, MUL_TYPE_STEP_32x32, MUL_STEP_1, rreg);
> +       emit_mul(nfp_prog, lreg, MUL_TYPE_STEP_32x32, MUL_STEP_2, rreg);
> +       emit_mul(nfp_prog, lreg, MUL_TYPE_STEP_32x32, MUL_STEP_3, rreg);
> +       emit_mul(nfp_prog, lreg, MUL_TYPE_STEP_32x32, MUL_STEP_4, rreg);
> +       emit_mul(nfp_prog, dst_lo, MUL_TYPE_STEP_32x32, MUL_LAST, reg_none());
> +       if (gen_high_half)
> +               emit_mul(nfp_prog, dst_hi, MUL_TYPE_STEP_32x32, MUL_LAST_2,
> +                        reg_none());
> +       else
> +               wrp_immed(nfp_prog, dst_hi, 0);
> +}
> +
> +static void
> +wrp_mul_u16(struct nfp_prog *nfp_prog, swreg dst_hi, swreg dst_lo, swreg lreg,
> +           swreg rreg)
> +{
> +       emit_mul(nfp_prog, lreg, MUL_TYPE_START, MUL_STEP_NONE, rreg);
> +       emit_mul(nfp_prog, lreg, MUL_TYPE_STEP_16x16, MUL_STEP_1, rreg);
> +       emit_mul(nfp_prog, lreg, MUL_TYPE_STEP_16x16, MUL_STEP_2, rreg);
> +       emit_mul(nfp_prog, dst_lo, MUL_TYPE_STEP_16x16, MUL_LAST, reg_none());
> +}
> +
> +static int
> +wrp_mul(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta,
> +       bool gen_high_half, bool ropnd_from_reg)
> +{
> +       swreg multiplier, multiplicand, dst_hi, dst_lo;
> +       const struct bpf_insn *insn = &meta->insn;
> +       u32 lopnd_max, ropnd_max;
> +       u8 dst_reg;
> +
> +       dst_reg = insn->dst_reg;
> +       multiplicand = reg_a(dst_reg * 2);
> +       dst_hi = reg_both(dst_reg * 2 + 1);
> +       dst_lo = reg_both(dst_reg * 2);
> +       lopnd_max = meta->umax_dst;
> +       if (ropnd_from_reg) {
> +               multiplier = reg_b(insn->src_reg * 2);
> +               ropnd_max = meta->umax_src;
> +       } else {
> +               u32 imm = insn->imm;
> +
> +               multiplier = re_load_imm_any(nfp_prog, imm, imm_b(nfp_prog));
> +               ropnd_max = imm;
> +       }
> +       if (lopnd_max > U16_MAX || ropnd_max > U16_MAX)
> +               wrp_mul_u32(nfp_prog, dst_hi, dst_lo, multiplicand, multiplier,
> +                           gen_high_half);
> +       else
> +               wrp_mul_u16(nfp_prog, dst_hi, dst_lo, multiplicand, multiplier);
> +
> +       return 0;
> +}
> +
>  static int adjust_head(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
>  {
>         swreg tmp = imm_a(nfp_prog), tmp_len = imm_b(nfp_prog);
> @@ -1684,6 +1797,16 @@ static int sub_imm64(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
>         return 0;
>  }
>
> +static int mul_reg64(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
> +{
> +       return wrp_mul(nfp_prog, meta, true, true);
> +}
> +
> +static int mul_imm64(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
> +{
> +       return wrp_mul(nfp_prog, meta, true, false);
> +}
> +
>  static int neg_reg64(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
>  {
>         const struct bpf_insn *insn = &meta->insn;
> @@ -2097,6 +2220,16 @@ static int sub_imm(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
>         return wrp_alu32_imm(nfp_prog, meta, ALU_OP_SUB, !meta->insn.imm);
>  }
>
> +static int mul_reg(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
> +{
> +       return wrp_mul(nfp_prog, meta, false, true);
> +}
> +
> +static int mul_imm(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
> +{
> +       return wrp_mul(nfp_prog, meta, false, false);
> +}
> +
>  static int neg_reg(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
>  {
>         u8 dst = meta->insn.dst_reg * 2;
> @@ -2848,6 +2981,8 @@ static const instr_cb_t instr_cb[256] = {
>         [BPF_ALU64 | BPF_ADD | BPF_K] = add_imm64,
>         [BPF_ALU64 | BPF_SUB | BPF_X] = sub_reg64,
>         [BPF_ALU64 | BPF_SUB | BPF_K] = sub_imm64,
> +       [BPF_ALU64 | BPF_MUL | BPF_X] = mul_reg64,
> +       [BPF_ALU64 | BPF_MUL | BPF_K] = mul_imm64,
>         [BPF_ALU64 | BPF_NEG] =         neg_reg64,
>         [BPF_ALU64 | BPF_LSH | BPF_X] = shl_reg64,
>         [BPF_ALU64 | BPF_LSH | BPF_K] = shl_imm64,
> @@ -2867,6 +3002,8 @@ static const instr_cb_t instr_cb[256] = {
>         [BPF_ALU | BPF_ADD | BPF_K] =   add_imm,
>         [BPF_ALU | BPF_SUB | BPF_X] =   sub_reg,
>         [BPF_ALU | BPF_SUB | BPF_K] =   sub_imm,
> +       [BPF_ALU | BPF_MUL | BPF_X] =   mul_reg,
> +       [BPF_ALU | BPF_MUL | BPF_K] =   mul_imm,
>         [BPF_ALU | BPF_NEG] =           neg_reg,
>         [BPF_ALU | BPF_LSH | BPF_K] =   shl_imm,
>         [BPF_ALU | BPF_END | BPF_X] =   end_reg32,
> diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.h b/drivers/net/ethernet/netronome/nfp/bpf/main.h
> index c985d0ac61a3..c10079b1a312 100644
> --- a/drivers/net/ethernet/netronome/nfp/bpf/main.h
> +++ b/drivers/net/ethernet/netronome/nfp/bpf/main.h
> @@ -394,6 +394,11 @@ static inline bool is_mbpf_xadd(const struct nfp_insn_meta *meta)
>         return (meta->insn.code & ~BPF_SIZE_MASK) == (BPF_STX | BPF_XADD);
>  }
>
> +static inline bool is_mbpf_mul(const struct nfp_insn_meta *meta)
> +{
> +       return is_mbpf_alu(meta) && mbpf_op(meta) == BPF_MUL;
> +}
> +
>  /**
>   * struct nfp_prog - nfp BPF program
>   * @bpf: backpointer to the bpf app priv structure
> diff --git a/drivers/net/ethernet/netronome/nfp/bpf/verifier.c b/drivers/net/ethernet/netronome/nfp/bpf/verifier.c
> index 7bd9666bd8ff..30d4f1580693 100644
> --- a/drivers/net/ethernet/netronome/nfp/bpf/verifier.c
> +++ b/drivers/net/ethernet/netronome/nfp/bpf/verifier.c
> @@ -516,6 +516,51 @@ nfp_bpf_check_xadd(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta,
>         return nfp_bpf_check_ptr(nfp_prog, meta, env, meta->insn.dst_reg);
>  }
>
> +static int
> +nfp_bpf_check_alu(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta,
> +                 struct bpf_verifier_env *env)
> +{
> +       const struct bpf_reg_state *sreg =
> +               cur_regs(env) + meta->insn.src_reg;
> +       const struct bpf_reg_state *dreg =
> +               cur_regs(env) + meta->insn.dst_reg;
> +
> +       meta->umin_src = min(meta->umin_src, sreg->umin_value);
> +       meta->umax_src = max(meta->umax_src, sreg->umax_value);
> +       meta->umin_dst = min(meta->umin_dst, dreg->umin_value);
> +       meta->umax_dst = max(meta->umax_dst, dreg->umax_value);
> +
> +       /* NFP supports u16 and u32 multiplication.
> +        *
> +        * For ALU64, if either operand is beyond u32's value range, we reject
> +        * it. One thing to note, if the source operand is BPF_K, then we need
> +        * to check "imm" field directly, and we'd reject it if it is negative.
> +        * Because for ALU64, "imm" (with s32 type) is expected to be sign
> +        * extended to s64 which NFP mul doesn't support.
> +        *
> +        * For ALU32, it is fine for "imm" be negative though, because the
> +        * result is 32-bits and there is no difference on the low halve of
> +        * the result for signed/unsigned mul, so we will get correct result.
> +        */
> +       if (is_mbpf_mul(meta)) {
> +               if (meta->umax_dst > U32_MAX) {
> +                       pr_vlog(env, "multiplier is not within u32 value range\n");
> +                       return -EINVAL;
> +               }
> +               if (mbpf_src(meta) == BPF_X && meta->umax_src > U32_MAX) {
> +                       pr_vlog(env, "multiplicand is not within u32 value range\n");
> +                       return -EINVAL;
> +               }
> +               if (mbpf_class(meta) == BPF_ALU64 &&
> +                   mbpf_src(meta) == BPF_K && meta->insn.imm < 0) {
> +                       pr_vlog(env, "sign extended multiplicand won't be within u32 value range\n");
> +                       return -EINVAL;
> +               }
> +       }
> +
> +       return 0;
> +}
> +
>  static int
>  nfp_verify_insn(struct bpf_verifier_env *env, int insn_idx, int prev_insn_idx)
>  {
> @@ -551,17 +596,8 @@ nfp_verify_insn(struct bpf_verifier_env *env, int insn_idx, int prev_insn_idx)
>         if (is_mbpf_xadd(meta))
>                 return nfp_bpf_check_xadd(nfp_prog, meta, env);
>
> -       if (is_mbpf_alu(meta)) {
> -               const struct bpf_reg_state *sreg =
> -                       cur_regs(env) + meta->insn.src_reg;
> -               const struct bpf_reg_state *dreg =
> -                       cur_regs(env) + meta->insn.dst_reg;
> -
> -               meta->umin_src = min(meta->umin_src, sreg->umin_value);
> -               meta->umax_src = max(meta->umax_src, sreg->umax_value);
> -               meta->umin_dst = min(meta->umin_dst, dreg->umin_value);
> -               meta->umax_dst = max(meta->umax_dst, dreg->umax_value);
> -       }
> +       if (is_mbpf_alu(meta))
> +               return nfp_bpf_check_alu(nfp_prog, meta, env);
>
>         return 0;
>  }
> diff --git a/drivers/net/ethernet/netronome/nfp/nfp_asm.h b/drivers/net/ethernet/netronome/nfp/nfp_asm.h
> index f6677bc9875a..cdc4e065f6f5 100644
> --- a/drivers/net/ethernet/netronome/nfp/nfp_asm.h
> +++ b/drivers/net/ethernet/netronome/nfp/nfp_asm.h
> @@ -426,4 +426,32 @@ static inline u32 nfp_get_ind_csr_ctx_ptr_offs(u32 read_offset)
>         return (read_offset & ~NFP_IND_ME_CTX_PTR_BASE_MASK) | NFP_CSR_CTX_PTR;
>  }
>
> +enum mul_type {
> +       MUL_TYPE_START          = 0x00,
> +       MUL_TYPE_STEP_24x8      = 0x01,
> +       MUL_TYPE_STEP_16x16     = 0x02,
> +       MUL_TYPE_STEP_32x32     = 0x03,
> +};
> +
> +enum mul_step {
> +       MUL_STEP_1              = 0x00,
> +       MUL_STEP_NONE           = MUL_STEP_1,
> +       MUL_STEP_2              = 0x01,
> +       MUL_STEP_3              = 0x02,
> +       MUL_STEP_4              = 0x03,
> +       MUL_LAST                = 0x04,
> +       MUL_LAST_2              = 0x05,
> +};
> +
> +#define OP_MUL_BASE            0x0f800000000ULL
> +#define OP_MUL_A_SRC           0x000000003ffULL
> +#define OP_MUL_B_SRC           0x000000ffc00ULL
> +#define OP_MUL_STEP            0x00000700000ULL
> +#define OP_MUL_DST_AB          0x00000800000ULL
> +#define OP_MUL_SW              0x00040000000ULL
> +#define OP_MUL_TYPE            0x00180000000ULL
> +#define OP_MUL_WR_AB           0x20000000000ULL
> +#define OP_MUL_SRC_LMEXTN      0x40000000000ULL
> +#define OP_MUL_DST_LMEXTN      0x80000000000ULL
> +
>  #endif
> --
> 2.17.1
>

^ permalink raw reply

* Re: [PATCH bpf-next 2/3] bpf: btf: add btf json print functionality
From: Martin KaFai Lau @ 2018-06-26 22:27 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Okash Khawaja, Daniel Borkmann, Alexei Starovoitov, Yonghong Song,
	Quentin Monnet, David S. Miller, netdev, kernel-team,
	linux-kernel
In-Reply-To: <20180626133133.618af1d3@cakuba.netronome.com>

On Tue, Jun 26, 2018 at 01:31:33PM -0700, Jakub Kicinski wrote:
> On Tue, 26 Jun 2018 17:48:22 +0100, Okash Khawaja wrote:
> > On Fri, Jun 22, 2018 at 05:26:39PM -0700, Martin KaFai Lau wrote:
> > > On Fri, Jun 22, 2018 at 04:32:00PM -0700, Jakub Kicinski wrote:  
> > > > On Fri, 22 Jun 2018 15:54:08 -0700, Martin KaFai Lau wrote:  
> > > > > > > > > > > > > >         "value": ["0x02","0x00","0x00","0x00","0x00","0x00","0x00","0x00"
> > > > > > > > > > > > > >         ],
> > > > > > > > > > > > > > 	"value_struct":{
> > > > > > > > > > > > > > 		"src_ip":2,        
> > > > > > > > > If for the same map the user changes the "src_ip" to an array of int[4]
> > > > > > > > > later (e.g. to support ipv6), it will become "src_ip": [1, 2, 3, 4].
> > > > > > > > > Is it breaking backward compat?
> > > > > > > > > i.e.
> > > > > > > > > struct five_tuples {
> > > > > > > > > -	int src_ip;
> > > > > > > > > +	int src_ip[4];
> > > > > > > > > /* ... */
> > > > > > > > > };      
> > > > > > > > 
> > > > > > > > Well, it is breaking backward compat, but it's the program doing it,
> > > > > > > > not bpftool :)  BTF changes so does the output.      
> > > > > > > As we see, the key/value's btf-output is inherently not backward compat.
> > > > > > > Hence, "-j" and "-p" will stay as is.  The whole existing json will
> > > > > > > be backward compat instead of only partly backward compat.    
> > > > > > 
> > > > > > No.  There is a difference between user of a facility changing their
> > > > > > input and kernel/libraries providing different output in response to
> > > > > > that, and the libraries suddenly changing the output on their own.
> > > > > > 
> > > > > > Your example is like saying if user started using IPv6 addresses
> > > > > > instead of IPv4 the netlink attributes in dumps will be different so
> > > > > > kernel didn't keep backwards compat.  While what you're doing is more
> > > > > > equivalent to dropping support for old ioctl interfaces because there
> > > > > > is a better mechanism now.    
> > > > > Sorry, I don't follow this.  I don't see netlink suffer json issue like
> > > > > the one on "key" and "value".
> > > > > 
> > > > > All I can grasp is, the json should normally be backward compat but now
> > > > > we are saying anything added by btf-output is an exception because
> > > > > the script parsing it will treat it differently than "key" and "value"  
> > > > 
> > > > Backward compatibility means that if I run *the same* program against
> > > > different kernels/libraries it continues to work.  If someone decides
> > > > to upgrade their program to work with IPv6 (which was your example)
> > > > obviously there is no way system as a whole will look 1:1 the same.
> > > >   
> > > > > > BTF in JSON is very useful, and will help people who writes simple
> > > > > > orchestration/scripts based on bpftool *a* *lot*.  I really appreciate    
> > > > > Can you share what the script will do?  I want to understand why
> > > > > it cannot directly use the BTF format and the map data.  
> > > > 
> > > > Think about a python script which wants to read a counter in a map.
> > > > Right now it would have to get the BTF, find out which bytes are the
> > > > counter, then convert the bytes into a larger int.  With JSON BTF if
> > > > just does entry["formatted"]["value"]["counter"].
> > > > 
> > > > Real life example from my test code (conversion of 3 element counter
> > > > array):
> > > > 
> > > > def str2int(strtab):
> > > >     inttab = []
> > > >     for i in strtab:
> > > >         inttab.append(int(i, 16))
> > > >     ba = bytearray(inttab)
> > > >     if len(strtab) == 4:
> > > >         fmt = "I"
> > > >     elif len(strtab) == 8:
> > > >         fmt = "Q"
> > > >     else:
> > > >         raise Exception("String array of len %d can't be unpacked to an int" %
> > > >                         (len(strtab)))
> > > >     return struct.unpack(fmt, ba)[0]
> > > > 
> > > > def convert(elems, idx):
> > > >     val = []
> > > >     for i in range(3):
> > > >         part = elems[idx]["value"][i * length:(i + 1) * length]
> > > >         val.append(str2int(part))
> > > >     return val
> > > > 
> > > > With BTF it would be:
> > > > 
> > > > 	elems[idx]["formatted"]["value"]
> > > > 
> > > > Which is fairly awesome.  
> > > Thanks for the example.  Agree that with BTF, things are easier in general.
> > > 
> > > btw, what more awesome is,  
> > > #> bpftool map find id 100 key 1  
> > > {
> > > 	"counter_x": 1,
> > > 	"counter_y": 10
> > > }
> > >   
> > > >   
> > > > > > this addition to bpftool and will start using it myself as soon as it
> > > > > > lands.  I'm not sure why the reluctance to slightly change the output
> > > > > > format?    
> > > > > The initial change argument is because the json has to be backward compat.
> > > > > 
> > > > > Then we show that btf-output is inherently not backward compat, so
> > > > > printing it in json does not make sense at all.
> > > > > 
> > > > > However, now it is saying part of it does not have to be backward compat.  
> > > > 
> > > > Compatibility of "formatted" member is defined as -> fields broken out
> > > > according to BTF.  So it is backward compatible.  The definition of
> > > > "value" member is -> an array of unfortunately formatted array of
> > > > ugly hex strings :(
> > > >   
> > > > > I am fine putting it under "formatted" for "-j" or "-p" if that is the
> > > > > case, other than the double output is still confusing.  Lets wait for
> > > > > Okash's input.
> > > > >
> > > > > At the same time, the same output will be used as the default plaintext
> > > > > output when BTF is available.  Then the plaintext BTF output
> > > > > will not be limited by the json restrictions when we want
> > > > > to improve human readability later.  Apparently, the
> > > > > improvements on plaintext will not be always applicable
> > > > > to json output.  
> > > >   
> > 
> > hi,
> > 
> > so i guess following is what we want:
> > 
> > 1. a "formatted" object nested inside -p and -j switches for bpf map
> >   dump. this will be JSON and backward compatible
> > 2. an output for humans - which is like the current output. this will
> > not be JSON. this won't have to be backward compatible. this output will
> > be shown when neither of -j and -p are supplied and btf info is
> > available.
> > 
> > i can update the patches to v2 which covers 2 above + all other comments
> > on the patchset. later we can follow up with a patch for 1.
> 
> Please do both at the same time.  I've learnt not to trust people when
> they say things like "we can follow up with xyz" :(  In my experience it
> _always_ backfires.
> 
> Implementing both outputs in one series will help you structure your
> code to best suit both of the formats up front.
hex and "formatted" are the only things missing?  As always, things
can be refactored when new use case comes up.  Lets wait for
Okash input.

Regardless, plaintext is our current use case.  Having the current
patchset in does not stop us or others from contributing other use
cases (json, "bpftool map find"...etc),  and IMO it is actually
the opposite.  Others may help us get there faster than us alone.
We should not stop making forward progress and take this patch
as hostage because "abc" and "xyz" are not done together.

^ permalink raw reply

* Re: [PATCH bpf-next 6/7] nfp: bpf: support u32 divide using reciprocal_div.h
From: Song Liu @ 2018-06-26 22:28 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Alexei Starovoitov, Daniel Borkmann, oss-drivers, Networking,
	Jiong Wang
In-Reply-To: <20180625035421.2991-7-jakub.kicinski@netronome.com>

On Sun, Jun 24, 2018 at 8:54 PM, Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
> From: Jiong Wang <jiong.wang@netronome.com>
>
> NFP doesn't have integer divide instruction, this patch use reciprocal
> algorithm (the basic one, reciprocal_div) to emulate it.
>
> For each u32 divide, we would need 11 instructions to finish the operation.
>
>   7 (for multiplication) + 4 (various ALUs) = 11
>
> Given NFP only supports multiplication no bigger than u32, we'd require
> divisor and dividend no bigger than that as well.
>
> Also eBPF doesn't support signed divide and has enforced this on C language
> level by failing compilation. However LLVM assembler hasn't enforced this,
> so it is possible for negative constant to leak in as a BPF_K operand
> through assembly code, we reject such cases as well.
>
> Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>

Acked-by: Song Liu <songliubraving@fb.com>

> ---
>  drivers/net/ethernet/netronome/nfp/bpf/jit.c  | 58 ++++++++++++++++++-
>  drivers/net/ethernet/netronome/nfp/bpf/main.h |  5 ++
>  .../net/ethernet/netronome/nfp/bpf/verifier.c | 31 ++++++++++
>  3 files changed, 93 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/netronome/nfp/bpf/jit.c b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
> index 7d7061d93358..d732b6cfc356 100644
> --- a/drivers/net/ethernet/netronome/nfp/bpf/jit.c
> +++ b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
> @@ -34,10 +34,11 @@
>  #define pr_fmt(fmt)    "NFP net bpf: " fmt
>
>  #include <linux/bug.h>
> -#include <linux/kernel.h>
>  #include <linux/bpf.h>
>  #include <linux/filter.h>
> +#include <linux/kernel.h>
>  #include <linux/pkt_cls.h>
> +#include <linux/reciprocal_div.h>
>  #include <linux/unistd.h>
>
>  #include "main.h"
> @@ -1493,6 +1494,32 @@ wrp_mul(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta,
>         return 0;
>  }
>
> +static int wrp_div_imm(struct nfp_prog *nfp_prog, u8 dst, u64 imm)
> +{
> +       swreg tmp_both = imm_both(nfp_prog), dst_both = reg_both(dst);
> +       swreg dst_a = reg_a(dst), dst_b = reg_a(dst);
> +       struct reciprocal_value rvalue;
> +       swreg tmp_b = imm_b(nfp_prog);
> +       swreg magic;
> +
> +       if (imm > U32_MAX) {
> +               wrp_immed(nfp_prog, dst_both, 0);
> +               return 0;
> +       }
> +
> +       rvalue = reciprocal_value(imm);
> +       magic = re_load_imm_any(nfp_prog, rvalue.m, imm_b(nfp_prog));
> +       wrp_mul_u32(nfp_prog, tmp_both, tmp_both, dst_a, magic, true);
> +       emit_alu(nfp_prog, dst_both, dst_a, ALU_OP_SUB, tmp_b);
> +       emit_shf(nfp_prog, dst_both, reg_none(), SHF_OP_NONE, dst_b,
> +                SHF_SC_R_SHF, rvalue.sh1);
> +       emit_alu(nfp_prog, dst_both, dst_a, ALU_OP_ADD, tmp_b);
> +       emit_shf(nfp_prog, dst_both, reg_none(), SHF_OP_NONE, dst_b,
> +                SHF_SC_R_SHF, rvalue.sh2);
> +
> +       return 0;
> +}
> +
>  static int adjust_head(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
>  {
>         swreg tmp = imm_a(nfp_prog), tmp_len = imm_b(nfp_prog);
> @@ -1807,6 +1834,21 @@ static int mul_imm64(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
>         return wrp_mul(nfp_prog, meta, true, false);
>  }
>
> +static int div_imm64(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
> +{
> +       const struct bpf_insn *insn = &meta->insn;
> +
> +       return wrp_div_imm(nfp_prog, insn->dst_reg * 2, insn->imm);
> +}
> +
> +static int div_reg64(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
> +{
> +       /* NOTE: verifier hook has rejected cases for which verifier doesn't
> +        * know whether the source operand is constant or not.
> +        */
> +       return wrp_div_imm(nfp_prog, meta->insn.dst_reg * 2, meta->umin_src);
> +}
> +
>  static int neg_reg64(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
>  {
>         const struct bpf_insn *insn = &meta->insn;
> @@ -2230,6 +2272,16 @@ static int mul_imm(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
>         return wrp_mul(nfp_prog, meta, false, false);
>  }
>
> +static int div_reg(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
> +{
> +       return div_reg64(nfp_prog, meta);
> +}
> +
> +static int div_imm(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
> +{
> +       return div_imm64(nfp_prog, meta);
> +}
> +
>  static int neg_reg(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
>  {
>         u8 dst = meta->insn.dst_reg * 2;
> @@ -2983,6 +3035,8 @@ static const instr_cb_t instr_cb[256] = {
>         [BPF_ALU64 | BPF_SUB | BPF_K] = sub_imm64,
>         [BPF_ALU64 | BPF_MUL | BPF_X] = mul_reg64,
>         [BPF_ALU64 | BPF_MUL | BPF_K] = mul_imm64,
> +       [BPF_ALU64 | BPF_DIV | BPF_X] = div_reg64,
> +       [BPF_ALU64 | BPF_DIV | BPF_K] = div_imm64,
>         [BPF_ALU64 | BPF_NEG] =         neg_reg64,
>         [BPF_ALU64 | BPF_LSH | BPF_X] = shl_reg64,
>         [BPF_ALU64 | BPF_LSH | BPF_K] = shl_imm64,
> @@ -3004,6 +3058,8 @@ static const instr_cb_t instr_cb[256] = {
>         [BPF_ALU | BPF_SUB | BPF_K] =   sub_imm,
>         [BPF_ALU | BPF_MUL | BPF_X] =   mul_reg,
>         [BPF_ALU | BPF_MUL | BPF_K] =   mul_imm,
> +       [BPF_ALU | BPF_DIV | BPF_X] =   div_reg,
> +       [BPF_ALU | BPF_DIV | BPF_K] =   div_imm,
>         [BPF_ALU | BPF_NEG] =           neg_reg,
>         [BPF_ALU | BPF_LSH | BPF_K] =   shl_imm,
>         [BPF_ALU | BPF_END | BPF_X] =   end_reg32,
> diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.h b/drivers/net/ethernet/netronome/nfp/bpf/main.h
> index c10079b1a312..9845c1a2d4c2 100644
> --- a/drivers/net/ethernet/netronome/nfp/bpf/main.h
> +++ b/drivers/net/ethernet/netronome/nfp/bpf/main.h
> @@ -399,6 +399,11 @@ static inline bool is_mbpf_mul(const struct nfp_insn_meta *meta)
>         return is_mbpf_alu(meta) && mbpf_op(meta) == BPF_MUL;
>  }
>
> +static inline bool is_mbpf_div(const struct nfp_insn_meta *meta)
> +{
> +       return is_mbpf_alu(meta) && mbpf_op(meta) == BPF_DIV;
> +}
> +
>  /**
>   * struct nfp_prog - nfp BPF program
>   * @bpf: backpointer to the bpf app priv structure
> diff --git a/drivers/net/ethernet/netronome/nfp/bpf/verifier.c b/drivers/net/ethernet/netronome/nfp/bpf/verifier.c
> index 30d4f1580693..f0f07e988c46 100644
> --- a/drivers/net/ethernet/netronome/nfp/bpf/verifier.c
> +++ b/drivers/net/ethernet/netronome/nfp/bpf/verifier.c
> @@ -558,6 +558,37 @@ nfp_bpf_check_alu(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta,
>                 }
>         }
>
> +       /* NFP doesn't have divide instructions, we support divide by constant
> +        * through reciprocal multiplication. Given NFP support multiplication
> +        * no bigger than u32, we'd require divisor and dividend no bigger than
> +        * that as well.
> +        *
> +        * Also eBPF doesn't support signed divide and has enforced this on C
> +        * language level by failing compilation. However LLVM assembler hasn't
> +        * enforced this, so it is possible for negative constant to leak in as
> +        * a BPF_K operand through assembly code, we reject such cases as well.
> +        */
> +       if (is_mbpf_div(meta)) {
> +               if (meta->umax_dst > U32_MAX) {
> +                       pr_vlog(env, "divisor is not within u32 value range\n");
> +                       return -EINVAL;
> +               }
> +               if (mbpf_src(meta) == BPF_X) {
> +                       if (meta->umin_src != meta->umax_src) {
> +                               pr_vlog(env, "dividend is not constant\n");
> +                               return -EINVAL;
> +                       }
> +                       if (meta->umax_src > U32_MAX) {
> +                               pr_vlog(env, "dividend is not within u32 value range\n");
> +                               return -EINVAL;
> +                       }
> +               }
> +               if (mbpf_src(meta) == BPF_K && meta->insn.imm < 0) {
> +                       pr_vlog(env, "divide by negative constant is not supported\n");
> +                       return -EINVAL;
> +               }
> +       }
> +
>         return 0;
>  }
>
> --
> 2.17.1
>

^ permalink raw reply

* Re: [PATCH 0/6] offload Linux LAG devices to the TC datapath
From: Jakub Kicinski @ 2018-06-26 22:31 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: John Hurley, Jiri Pirko, netdev, ASAP_Direct_Dev, simon.horman,
	Andy Gospodarek
In-Reply-To: <c3e3790f-fd49-3106-718f-d87993a0c195@mellanox.com>

On Tue, 26 Jun 2018 17:57:08 +0300, Or Gerlitz wrote:
> > -------- Forwarded Message --------
> > Subject: [PATCH 0/6] offload Linux LAG devices to the TC datapath
> > Date: Thu, 21 Jun 2018 14:35:55 +0100
> > From: John Hurley <john.hurley@netronome.com>
> > To: dev@openvswitch.org, roid@mellanox.com, gavi@mellanox.com, paulb@mellanox.com, fbl@sysclose.org, simon.horman@netronome.com
> > CC: John Hurley <john.hurley@netronome.com>
> > 
> > This patchset extends OvS TC and the linux-netdev implementation to
> > support the offloading of Linux Link Aggregation devices (LAG) and their
> > slaves. TC blocks are used to provide this offload. Blocks, in TC, group
> > together a series of qdiscs. If a filter is added to one of these qdiscs
> > then it applied to all. Similarly, if a packet is matched on one of the
> > grouped qdiscs then the stats for the entire block are increased. The
> > basis of the LAG offload is that the LAG master (attached to the OvS
> > bridge) and slaves that may exist outside of OvS are all added to the same
> > TC block. OvS can then control the filters and collect the stats on the
> > slaves via its interaction with the LAG master.
> > 
> > The TC API is extended within OvS to allow the addition of a block id to
> > ingress qdisc adds. Block ids are then assigned to each LAG master that is
> > attached to the OvS bridge. The linux netdev netlink socket is used to
> > monitor slave devices. If a LAG slave is found whose master is on the bridge
> > then it is added to the same block as its master. If the underlying slaves
> > belong to an offloadable device then the Linux LAG device can be offloaded
> > to hardware.  
> 
> Guys (J/J/J), 
> 
> Doing this here b/c
> 
> a. this has impact on the kernel side of things
> 
> b. I am more of a netdev and not openvswitch citizen..
> 
> some comments, 
> 
> 1. this + Jakub's patch for the reply are really a great design
> 
> 2. re the egress side of things. Some NIC HWs can't just use LAG
> as the egress port destination of an ACL (tc rule) and the HW rule
> needs to be duplicated to both HW ports. So... in that case, you 
> see the HW driver doing the duplication (:() or we can somehow
> make it happen from user-space?

It's the TC core that does the duplication.  Drivers which don't need
the duplication (e.g. mlxsw) will not register a new callback for each
port on which shared block is bound.  They will keep one list of rules,
and a list of ports that those rules apply to.

Drivers which need duplication (multiplication) (all NICs?) have to
register a new callback for each port bound to a shared block.  And TC
will call those drivers as many times as they have callbacks registered
== as many times as they have ports bound to the block.  Each time
callback is invoked the driver will figure out the ingress port based
on the cb_priv and use <ingress, cookie> as the key in its rule table
(or have a separate rule table per ingress port).

So again you just register a callback every time shared block is bound,
and then TC core will send add/remove rule commands down to the driver,
relaying existing rules as well if needed.  I may be wrong, but I think
you split the rules tables per port for mlx5, so this OvS bond offload
scheme "should just work" for you?

Does that clarify things or were you asking more about the active
passive thing John mentioned or some way to save rule space?

> 3. for the case of overlay networks, e.g OVS based vxlan tunnel, the
> ingress (decap) rule is set on the vxlan device. Jakub, you mentioned 
> a possible kernel patch to the HW (nfp, mlx5) drivers to have them bind 
> to the tunnel device for ingress rules. If we have agreed way to identify
> uplink representors, can we do that from ovs too?

I'm not sure, there can be multiple tunnel devices.  Plus we really
want to know the tunnel type, e.g. vxlan vs geneve, so simple shared
block propagation will probably not cut it.  If that's what you're
referring to.

> does it matter if we are bonding + encapsulating or just
> encapsulating? note that under encap scheme the bond is typically not
> part of the OVS bridge. 

I don't think that matters in general, driver doing bonding offload
should just start recognizing the bond master as "their port" and
register an egdev callback for redirects to master today (or equivalent
in the new scheme once that materializes...)

^ permalink raw reply

* Re: [PATCH RESEND bpf-next v6 1/2] trace_helpers.c: Add helpers to poll multiple perf FDs for events
From: Song Liu @ 2018-06-26 22:35 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: Networking
In-Reply-To: <152992950237.15897.9894201421576854943.stgit@alrua-kau>

On Mon, Jun 25, 2018 at 5:25 AM, Toke Høiland-Jørgensen <toke@toke.dk> wrote:
> Add two new helper functions to trace_helpers that supports polling
> multiple perf file descriptors for events. These are used to the XDP
> perf_event_output example, which needs to work with one perf fd per CPU.
>
> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>

Acked-by: Song Liu <songliubraving@fb.com>

> ---
>  tools/testing/selftests/bpf/trace_helpers.c |   48 ++++++++++++++++++++++++++-
>  tools/testing/selftests/bpf/trace_helpers.h |    4 ++
>  2 files changed, 50 insertions(+), 2 deletions(-)
>
> diff --git a/tools/testing/selftests/bpf/trace_helpers.c b/tools/testing/selftests/bpf/trace_helpers.c
> index 3868dcb63420..cabe2a3a3b30 100644
> --- a/tools/testing/selftests/bpf/trace_helpers.c
> +++ b/tools/testing/selftests/bpf/trace_helpers.c
> @@ -88,7 +88,7 @@ static int page_size;
>  static int page_cnt = 8;
>  static struct perf_event_mmap_page *header;
>
> -int perf_event_mmap(int fd)
> +int perf_event_mmap_header(int fd, struct perf_event_mmap_page **header)
>  {
>         void *base;
>         int mmap_size;
> @@ -102,10 +102,15 @@ int perf_event_mmap(int fd)
>                 return -1;
>         }
>
> -       header = base;
> +       *header = base;
>         return 0;
>  }
>
> +int perf_event_mmap(int fd)
> +{
> +       return perf_event_mmap_header(fd, &header);
> +}
> +
>  static int perf_event_poll(int fd)
>  {
>         struct pollfd pfd = { .fd = fd, .events = POLLIN };
> @@ -163,3 +168,42 @@ int perf_event_poller(int fd, perf_event_print_fn output_fn)
>
>         return ret;
>  }
> +
> +int perf_event_poller_multi(int *fds, struct perf_event_mmap_page **headers,
> +                           int num_fds, perf_event_print_fn output_fn)
> +{
> +       enum bpf_perf_event_ret ret;
> +       struct pollfd *pfds;
> +       void *buf = NULL;
> +       size_t len = 0;
> +       int i;
> +
> +       pfds = calloc(num_fds, sizeof(*pfds));
> +       if (!pfds)
> +               return LIBBPF_PERF_EVENT_ERROR;
> +
> +       for (i = 0; i < num_fds; i++) {
> +               pfds[i].fd = fds[i];
> +               pfds[i].events = POLLIN;
> +       }
> +
> +       for (;;) {
> +               poll(pfds, num_fds, 1000);
> +               for (i = 0; i < num_fds; i++) {
> +                       if (!pfds[i].revents)
> +                               continue;
> +
> +                       ret = bpf_perf_event_read_simple(headers[i],
> +                                                        page_cnt * page_size,
> +                                                        page_size, &buf, &len,
> +                                                        bpf_perf_event_print,
> +                                                        output_fn);
> +                       if (ret != LIBBPF_PERF_EVENT_CONT)
> +                               break;
> +               }
> +       }
> +       free(buf);
> +       free(pfds);
> +
> +       return ret;
> +}
> diff --git a/tools/testing/selftests/bpf/trace_helpers.h b/tools/testing/selftests/bpf/trace_helpers.h
> index 3b4bcf7f5084..18924f23db1b 100644
> --- a/tools/testing/selftests/bpf/trace_helpers.h
> +++ b/tools/testing/selftests/bpf/trace_helpers.h
> @@ -3,6 +3,7 @@
>  #define __TRACE_HELPER_H
>
>  #include <libbpf.h>
> +#include <linux/perf_event.h>
>
>  struct ksym {
>         long addr;
> @@ -16,6 +17,9 @@ long ksym_get_addr(const char *name);
>  typedef enum bpf_perf_event_ret (*perf_event_print_fn)(void *data, int size);
>
>  int perf_event_mmap(int fd);
> +int perf_event_mmap_header(int fd, struct perf_event_mmap_page **header);
>  /* return LIBBPF_PERF_EVENT_DONE or LIBBPF_PERF_EVENT_ERROR */
>  int perf_event_poller(int fd, perf_event_print_fn output_fn);
> +int perf_event_poller_multi(int *fds, struct perf_event_mmap_page **headers,
> +                           int num_fds, perf_event_print_fn output_fn);
>  #endif
>

^ permalink raw reply

* Re: [PATCH bpf-next 2/3] bpf: btf: add btf json print functionality
From: Jakub Kicinski @ 2018-06-26 22:35 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Okash Khawaja, Daniel Borkmann, Alexei Starovoitov, Yonghong Song,
	Quentin Monnet, David S. Miller, netdev, kernel-team,
	linux-kernel
In-Reply-To: <20180626222709.fsznwqauxojhhx7v@kafai-mbp.dhcp.thefacebook.com>

On Tue, 26 Jun 2018 15:27:09 -0700, Martin KaFai Lau wrote:
> On Tue, Jun 26, 2018 at 01:31:33PM -0700, Jakub Kicinski wrote:
> > On Tue, 26 Jun 2018 17:48:22 +0100, Okash Khawaja wrote:  
> > > On Fri, Jun 22, 2018 at 05:26:39PM -0700, Martin KaFai Lau wrote:  
> > > > On Fri, Jun 22, 2018 at 04:32:00PM -0700, Jakub Kicinski wrote:    
> > > > > On Fri, 22 Jun 2018 15:54:08 -0700, Martin KaFai Lau wrote:    
> > > > > > > > > > > > > > >         "value": ["0x02","0x00","0x00","0x00","0x00","0x00","0x00","0x00"
> > > > > > > > > > > > > > >         ],
> > > > > > > > > > > > > > > 	"value_struct":{
> > > > > > > > > > > > > > > 		"src_ip":2,          
> > > > > > > > > > If for the same map the user changes the "src_ip" to an array of int[4]
> > > > > > > > > > later (e.g. to support ipv6), it will become "src_ip": [1, 2, 3, 4].
> > > > > > > > > > Is it breaking backward compat?
> > > > > > > > > > i.e.
> > > > > > > > > > struct five_tuples {
> > > > > > > > > > -	int src_ip;
> > > > > > > > > > +	int src_ip[4];
> > > > > > > > > > /* ... */
> > > > > > > > > > };        
> > > > > > > > > 
> > > > > > > > > Well, it is breaking backward compat, but it's the program doing it,
> > > > > > > > > not bpftool :)  BTF changes so does the output.        
> > > > > > > > As we see, the key/value's btf-output is inherently not backward compat.
> > > > > > > > Hence, "-j" and "-p" will stay as is.  The whole existing json will
> > > > > > > > be backward compat instead of only partly backward compat.      
> > > > > > > 
> > > > > > > No.  There is a difference between user of a facility changing their
> > > > > > > input and kernel/libraries providing different output in response to
> > > > > > > that, and the libraries suddenly changing the output on their own.
> > > > > > > 
> > > > > > > Your example is like saying if user started using IPv6 addresses
> > > > > > > instead of IPv4 the netlink attributes in dumps will be different so
> > > > > > > kernel didn't keep backwards compat.  While what you're doing is more
> > > > > > > equivalent to dropping support for old ioctl interfaces because there
> > > > > > > is a better mechanism now.      
> > > > > > Sorry, I don't follow this.  I don't see netlink suffer json issue like
> > > > > > the one on "key" and "value".
> > > > > > 
> > > > > > All I can grasp is, the json should normally be backward compat but now
> > > > > > we are saying anything added by btf-output is an exception because
> > > > > > the script parsing it will treat it differently than "key" and "value"    
> > > > > 
> > > > > Backward compatibility means that if I run *the same* program against
> > > > > different kernels/libraries it continues to work.  If someone decides
> > > > > to upgrade their program to work with IPv6 (which was your example)
> > > > > obviously there is no way system as a whole will look 1:1 the same.
> > > > >     
> > > > > > > BTF in JSON is very useful, and will help people who writes simple
> > > > > > > orchestration/scripts based on bpftool *a* *lot*.  I really appreciate      
> > > > > > Can you share what the script will do?  I want to understand why
> > > > > > it cannot directly use the BTF format and the map data.    
> > > > > 
> > > > > Think about a python script which wants to read a counter in a map.
> > > > > Right now it would have to get the BTF, find out which bytes are the
> > > > > counter, then convert the bytes into a larger int.  With JSON BTF if
> > > > > just does entry["formatted"]["value"]["counter"].
> > > > > 
> > > > > Real life example from my test code (conversion of 3 element counter
> > > > > array):
> > > > > 
> > > > > def str2int(strtab):
> > > > >     inttab = []
> > > > >     for i in strtab:
> > > > >         inttab.append(int(i, 16))
> > > > >     ba = bytearray(inttab)
> > > > >     if len(strtab) == 4:
> > > > >         fmt = "I"
> > > > >     elif len(strtab) == 8:
> > > > >         fmt = "Q"
> > > > >     else:
> > > > >         raise Exception("String array of len %d can't be unpacked to an int" %
> > > > >                         (len(strtab)))
> > > > >     return struct.unpack(fmt, ba)[0]
> > > > > 
> > > > > def convert(elems, idx):
> > > > >     val = []
> > > > >     for i in range(3):
> > > > >         part = elems[idx]["value"][i * length:(i + 1) * length]
> > > > >         val.append(str2int(part))
> > > > >     return val
> > > > > 
> > > > > With BTF it would be:
> > > > > 
> > > > > 	elems[idx]["formatted"]["value"]
> > > > > 
> > > > > Which is fairly awesome.    
> > > > Thanks for the example.  Agree that with BTF, things are easier in general.
> > > > 
> > > > btw, what more awesome is,    
> > > > #> bpftool map find id 100 key 1    
> > > > {
> > > > 	"counter_x": 1,
> > > > 	"counter_y": 10
> > > > }
> > > >     
> > > > >     
> > > > > > > this addition to bpftool and will start using it myself as soon as it
> > > > > > > lands.  I'm not sure why the reluctance to slightly change the output
> > > > > > > format?      
> > > > > > The initial change argument is because the json has to be backward compat.
> > > > > > 
> > > > > > Then we show that btf-output is inherently not backward compat, so
> > > > > > printing it in json does not make sense at all.
> > > > > > 
> > > > > > However, now it is saying part of it does not have to be backward compat.    
> > > > > 
> > > > > Compatibility of "formatted" member is defined as -> fields broken out
> > > > > according to BTF.  So it is backward compatible.  The definition of
> > > > > "value" member is -> an array of unfortunately formatted array of
> > > > > ugly hex strings :(
> > > > >     
> > > > > > I am fine putting it under "formatted" for "-j" or "-p" if that is the
> > > > > > case, other than the double output is still confusing.  Lets wait for
> > > > > > Okash's input.
> > > > > >
> > > > > > At the same time, the same output will be used as the default plaintext
> > > > > > output when BTF is available.  Then the plaintext BTF output
> > > > > > will not be limited by the json restrictions when we want
> > > > > > to improve human readability later.  Apparently, the
> > > > > > improvements on plaintext will not be always applicable
> > > > > > to json output.    
> > > > >     
> > > 
> > > hi,
> > > 
> > > so i guess following is what we want:
> > > 
> > > 1. a "formatted" object nested inside -p and -j switches for bpf map
> > >   dump. this will be JSON and backward compatible
> > > 2. an output for humans - which is like the current output. this will
> > > not be JSON. this won't have to be backward compatible. this output will
> > > be shown when neither of -j and -p are supplied and btf info is
> > > available.
> > > 
> > > i can update the patches to v2 which covers 2 above + all other comments
> > > on the patchset. later we can follow up with a patch for 1.  
> > 
> > Please do both at the same time.  I've learnt not to trust people when
> > they say things like "we can follow up with xyz" :(  In my experience it
> > _always_ backfires.
> > 
> > Implementing both outputs in one series will help you structure your
> > code to best suit both of the formats up front.  
> hex and "formatted" are the only things missing?  As always, things
> can be refactored when new use case comes up.  Lets wait for
> Okash input.
> 
> Regardless, plaintext is our current use case.  Having the current
> patchset in does not stop us or others from contributing other use
> cases (json, "bpftool map find"...etc),  and IMO it is actually
> the opposite.  Others may help us get there faster than us alone.
> We should not stop making forward progress and take this patch
> as hostage because "abc" and "xyz" are not done together.

Parity between JSON and plain text output is non negotiable.

^ permalink raw reply

* Re: [PATCH RESEND bpf-next v6 2/2] samples/bpf: Add xdp_sample_pkts example
From: Song Liu @ 2018-06-26 22:37 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: Networking
In-Reply-To: <152992950246.15897.18198690562616367907.stgit@alrua-kau>

On Mon, Jun 25, 2018 at 5:25 AM, Toke Høiland-Jørgensen <toke@toke.dk> wrote:
> Add an example program showing how to sample packets from XDP using the
> perf event buffer. The example userspace program just prints the ethernet
> header for every packet sampled.
>
> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>

Acked-by: Song Liu <songliubraving@fb.com>


> ---
>  samples/bpf/Makefile               |    4 +
>  samples/bpf/xdp_sample_pkts_kern.c |   66 ++++++++++++++
>  samples/bpf/xdp_sample_pkts_user.c |  169 ++++++++++++++++++++++++++++++++++++
>  3 files changed, 239 insertions(+)
>  create mode 100644 samples/bpf/xdp_sample_pkts_kern.c
>  create mode 100644 samples/bpf/xdp_sample_pkts_user.c
>
> diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> index 1303af10e54d..9ea2f7b64869 100644
> --- a/samples/bpf/Makefile
> +++ b/samples/bpf/Makefile
> @@ -52,6 +52,7 @@ hostprogs-y += xdp_adjust_tail
>  hostprogs-y += xdpsock
>  hostprogs-y += xdp_fwd
>  hostprogs-y += task_fd_query
> +hostprogs-y += xdp_sample_pkts
>
>  # Libbpf dependencies
>  LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
> @@ -107,6 +108,7 @@ xdp_adjust_tail-objs := xdp_adjust_tail_user.o
>  xdpsock-objs := bpf_load.o xdpsock_user.o
>  xdp_fwd-objs := bpf_load.o xdp_fwd_user.o
>  task_fd_query-objs := bpf_load.o task_fd_query_user.o $(TRACE_HELPERS)
> +xdp_sample_pkts-objs := xdp_sample_pkts_user.o $(TRACE_HELPERS)
>
>  # Tell kbuild to always build the programs
>  always := $(hostprogs-y)
> @@ -163,6 +165,7 @@ always += xdp_adjust_tail_kern.o
>  always += xdpsock_kern.o
>  always += xdp_fwd_kern.o
>  always += task_fd_query_kern.o
> +always += xdp_sample_pkts_kern.o
>
>  HOSTCFLAGS += -I$(objtree)/usr/include
>  HOSTCFLAGS += -I$(srctree)/tools/lib/
> @@ -179,6 +182,7 @@ HOSTCFLAGS_spintest_user.o += -I$(srctree)/tools/lib/bpf/
>  HOSTCFLAGS_trace_event_user.o += -I$(srctree)/tools/lib/bpf/
>  HOSTCFLAGS_sampleip_user.o += -I$(srctree)/tools/lib/bpf/
>  HOSTCFLAGS_task_fd_query_user.o += -I$(srctree)/tools/lib/bpf/
> +HOSTCFLAGS_xdp_sample_pkts_user.o += -I$(srctree)/tools/lib/bpf/
>
>  HOST_LOADLIBES         += $(LIBBPF) -lelf
>  HOSTLOADLIBES_tracex4          += -lrt
> diff --git a/samples/bpf/xdp_sample_pkts_kern.c b/samples/bpf/xdp_sample_pkts_kern.c
> new file mode 100644
> index 000000000000..f7ca8b850978
> --- /dev/null
> +++ b/samples/bpf/xdp_sample_pkts_kern.c
> @@ -0,0 +1,66 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/ptrace.h>
> +#include <linux/version.h>
> +#include <uapi/linux/bpf.h>
> +#include "bpf_helpers.h"
> +
> +#define SAMPLE_SIZE 64ul
> +#define MAX_CPUS 128
> +
> +#define bpf_printk(fmt, ...)                                   \
> +({                                                             \
> +              char ____fmt[] = fmt;                            \
> +              bpf_trace_printk(____fmt, sizeof(____fmt),       \
> +                               ##__VA_ARGS__);                 \
> +})
> +
> +struct bpf_map_def SEC("maps") my_map = {
> +       .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
> +       .key_size = sizeof(int),
> +       .value_size = sizeof(u32),
> +       .max_entries = MAX_CPUS,
> +};
> +
> +SEC("xdp_sample")
> +int xdp_sample_prog(struct xdp_md *ctx)
> +{
> +       void *data_end = (void *)(long)ctx->data_end;
> +       void *data = (void *)(long)ctx->data;
> +
> +       /* Metadata will be in the perf event before the packet data. */
> +       struct S {
> +               u16 cookie;
> +               u16 pkt_len;
> +       } __packed metadata;
> +
> +       if (data < data_end) {
> +               /* The XDP perf_event_output handler will use the upper 32 bits
> +                * of the flags argument as a number of bytes to include of the
> +                * packet payload in the event data. If the size is too big, the
> +                * call to bpf_perf_event_output will fail and return -EFAULT.
> +                *
> +                * See bpf_xdp_event_output in net/core/filter.c.
> +                *
> +                * The BPF_F_CURRENT_CPU flag means that the event output fd
> +                * will be indexed by the CPU number in the event map.
> +                */
> +               u64 flags = BPF_F_CURRENT_CPU;
> +               u16 sample_size;
> +               int ret;
> +
> +               metadata.cookie = 0xdead;
> +               metadata.pkt_len = (u16)(data_end - data);
> +               sample_size = min(metadata.pkt_len, SAMPLE_SIZE);
> +               flags |= (u64)sample_size << 32;
> +
> +               ret = bpf_perf_event_output(ctx, &my_map, flags,
> +                                           &metadata, sizeof(metadata));
> +               if (ret)
> +                       bpf_printk("perf_event_output failed: %d\n", ret);
> +       }
> +
> +       return XDP_PASS;
> +}
> +
> +char _license[] SEC("license") = "GPL";
> +u32 _version SEC("version") = LINUX_VERSION_CODE;
> diff --git a/samples/bpf/xdp_sample_pkts_user.c b/samples/bpf/xdp_sample_pkts_user.c
> new file mode 100644
> index 000000000000..8dd87c1eb560
> --- /dev/null
> +++ b/samples/bpf/xdp_sample_pkts_user.c
> @@ -0,0 +1,169 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <linux/perf_event.h>
> +#include <linux/bpf.h>
> +#include <net/if.h>
> +#include <errno.h>
> +#include <assert.h>
> +#include <sys/sysinfo.h>
> +#include <sys/ioctl.h>
> +#include <signal.h>
> +#include <libbpf.h>
> +#include <bpf/bpf.h>
> +
> +#include "perf-sys.h"
> +#include "trace_helpers.h"
> +
> +#define MAX_CPUS 128
> +static int pmu_fds[MAX_CPUS], if_idx;
> +static struct perf_event_mmap_page *headers[MAX_CPUS];
> +static char *if_name;
> +
> +static int do_attach(int idx, int fd, const char *name)
> +{
> +       int err;
> +
> +       err = bpf_set_link_xdp_fd(idx, fd, 0);
> +       if (err < 0)
> +               printf("ERROR: failed to attach program to %s\n", name);
> +
> +       return err;
> +}
> +
> +static int do_detach(int idx, const char *name)
> +{
> +       int err;
> +
> +       err = bpf_set_link_xdp_fd(idx, -1, 0);
> +       if (err < 0)
> +               printf("ERROR: failed to detach program from %s\n", name);
> +
> +       return err;
> +}
> +
> +#define SAMPLE_SIZE 64
> +
> +static int print_bpf_output(void *data, int size)
> +{
> +       struct {
> +               __u16 cookie;
> +               __u16 pkt_len;
> +               __u8  pkt_data[SAMPLE_SIZE];
> +       } __packed *e = data;
> +       int i;
> +
> +       if (e->cookie != 0xdead) {
> +               printf("BUG cookie %x sized %d\n",
> +                      e->cookie, size);
> +               return LIBBPF_PERF_EVENT_ERROR;
> +       }
> +
> +       printf("Pkt len: %-5d bytes. Ethernet hdr: ", e->pkt_len);
> +       for (i = 0; i < 14 && i < e->pkt_len; i++)
> +               printf("%02x ", e->pkt_data[i]);
> +       printf("\n");
> +
> +       return LIBBPF_PERF_EVENT_CONT;
> +}
> +
> +static void test_bpf_perf_event(int map_fd, int num)
> +{
> +       struct perf_event_attr attr = {
> +               .sample_type = PERF_SAMPLE_RAW,
> +               .type = PERF_TYPE_SOFTWARE,
> +               .config = PERF_COUNT_SW_BPF_OUTPUT,
> +               .wakeup_events = 1, /* get an fd notification for every event */
> +       };
> +       int i;
> +
> +       for (i = 0; i < num; i++) {
> +               int key = i;
> +
> +               pmu_fds[i] = sys_perf_event_open(&attr, -1/*pid*/, i/*cpu*/,
> +                                                -1/*group_fd*/, 0);
> +
> +               assert(pmu_fds[i] >= 0);
> +               assert(bpf_map_update_elem(map_fd, &key,
> +                                          &pmu_fds[i], BPF_ANY) == 0);
> +               ioctl(pmu_fds[i], PERF_EVENT_IOC_ENABLE, 0);
> +       }
> +}
> +
> +static void sig_handler(int signo)
> +{
> +       do_detach(if_idx, if_name);
> +       exit(0);
> +}
> +
> +int main(int argc, char **argv)
> +{
> +       struct bpf_prog_load_attr prog_load_attr = {
> +               .prog_type      = BPF_PROG_TYPE_XDP,
> +       };
> +       struct bpf_object *obj;
> +       struct bpf_map *map;
> +       int prog_fd, map_fd;
> +       char filename[256];
> +       int ret, err, i;
> +       int numcpus;
> +
> +       if (argc < 2) {
> +               printf("Usage: %s <ifname>\n", argv[0]);
> +               return 1;
> +       }
> +
> +       numcpus = get_nprocs();
> +       if (numcpus > MAX_CPUS)
> +               numcpus = MAX_CPUS;
> +
> +       snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
> +       prog_load_attr.file = filename;
> +
> +       if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd))
> +               return 1;
> +
> +       if (!prog_fd) {
> +               printf("load_bpf_file: %s\n", strerror(errno));
> +               return 1;
> +       }
> +
> +       map = bpf_map__next(NULL, obj);
> +       if (!map) {
> +               printf("finding a map in obj file failed\n");
> +               return 1;
> +       }
> +       map_fd = bpf_map__fd(map);
> +
> +       if_idx = if_nametoindex(argv[1]);
> +       if (!if_idx)
> +               if_idx = strtoul(argv[1], NULL, 0);
> +
> +       if (!if_idx) {
> +               fprintf(stderr, "Invalid ifname\n");
> +               return 1;
> +       }
> +       if_name = argv[1];
> +       err = do_attach(if_idx, prog_fd, argv[1]);
> +       if (err)
> +               return err;
> +
> +       if (signal(SIGINT, sig_handler) ||
> +           signal(SIGHUP, sig_handler) ||
> +           signal(SIGTERM, sig_handler)) {
> +               perror("signal");
> +               return 1;
> +       }
> +
> +       test_bpf_perf_event(map_fd, numcpus);
> +
> +       for (i = 0; i < numcpus; i++)
> +               if (perf_event_mmap_header(pmu_fds[i], &headers[i]) < 0)
> +                       return 1;
> +
> +       ret = perf_event_poller_multi(pmu_fds, headers, numcpus,
> +                                     print_bpf_output);
> +       kill(0, SIGINT);
> +       return ret;
> +}
>

^ permalink raw reply

* Re: [PATCH net-next] net: preserve sock reference when scrubbing the skb.
From: Cong Wang @ 2018-06-26 22:47 UTC (permalink / raw)
  To: Flavio Leitner
  Cc: Eric Dumazet, Linux Kernel Network Developers, Paolo Abeni,
	David Miller, Florian Westphal, NetFilter
In-Reply-To: <20180626220300.GT19565@plex.lan>

On Tue, Jun 26, 2018 at 3:03 PM Flavio Leitner <fbl@redhat.com> wrote:
>
> On Tue, Jun 26, 2018 at 02:48:47PM -0700, Cong Wang wrote:
> > On Mon, Jun 25, 2018 at 11:41 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > When a packet is attached to a socket, we should keep the association as much as possible.
> >
> > As much as possible within one stack, I agree. I still don't understand
> > why we should keep it across the stack boundary.
> >
> > > Only when a new association needs to be done, skb_orphan() needs to be called.
> > >
> > > Doing this skb_orphan() too soon breaks back pressure in general, this is bad, since a socket
> > > can evades SO_SNDBUF limits.
> >
> > Right before leaving the stack is not too soon, it is the latest
> > actually, for veth case.
>
> Depends on how you view things - it's the same host/stack sharing the
> same resources, so why should we not keep it?

Because stacks are supposed to be independent, netdevices are
isolated, iptables and route tables too. This is how netns is designed
from the beginning. The trend today is actually more isolation instead
of more sharing, given the popularity of containers.

You need to justify why you want to break the TSQ's scope here,
which is obviously not compatible with netns design.

^ permalink raw reply

* Re: [net-next PATCH v4 1/7] net: Refactor XPS for CPUs and Rx queues
From: Tom Herbert @ 2018-06-26 22:53 UTC (permalink / raw)
  To: Amritha Nambiar
  Cc: Linux Kernel Network Developers, David S. Miller, Alexander Duyck,
	Willem de Bruijn, Sridhar Samudrala, Alexander Duyck,
	Eric Dumazet, Hannes Frederic Sowa
In-Reply-To: <152994985870.9733.538443146526394443.stgit@anamhost.jf.intel.com>

On Mon, Jun 25, 2018 at 11:04 AM, Amritha Nambiar
<amritha.nambiar@intel.com> wrote:
> Refactor XPS code to support Tx queue selection based on
> CPU(s) map or Rx queue(s) map.
>
> Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
> ---
>  include/linux/cpumask.h   |   11 ++
>  include/linux/netdevice.h |  100 +++++++++++++++++++++
>  net/core/dev.c            |  211 ++++++++++++++++++++++++++++++---------------
>  net/core/net-sysfs.c      |    4 -
>  4 files changed, 246 insertions(+), 80 deletions(-)
>
> diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
> index bf53d89..57f20a0 100644
> --- a/include/linux/cpumask.h
> +++ b/include/linux/cpumask.h
> @@ -115,12 +115,17 @@ extern struct cpumask __cpu_active_mask;
>  #define cpu_active(cpu)                ((cpu) == 0)
>  #endif
>
> -/* verify cpu argument to cpumask_* operators */
> -static inline unsigned int cpumask_check(unsigned int cpu)
> +static inline void cpu_max_bits_warn(unsigned int cpu, unsigned int bits)
>  {
>  #ifdef CONFIG_DEBUG_PER_CPU_MAPS
> -       WARN_ON_ONCE(cpu >= nr_cpumask_bits);
> +       WARN_ON_ONCE(cpu >= bits);
>  #endif /* CONFIG_DEBUG_PER_CPU_MAPS */
> +}
> +
> +/* verify cpu argument to cpumask_* operators */
> +static inline unsigned int cpumask_check(unsigned int cpu)
> +{
> +       cpu_max_bits_warn(cpu, nr_cpumask_bits);
>         return cpu;
>  }
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 3ec9850..c534f03 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -730,10 +730,15 @@ struct xps_map {
>   */
>  struct xps_dev_maps {
>         struct rcu_head rcu;
> -       struct xps_map __rcu *cpu_map[0];
> +       struct xps_map __rcu *attr_map[0]; /* Either CPUs map or RXQs map */
>  };
> -#define XPS_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) +         \
> +
> +#define XPS_CPU_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) +     \
>         (nr_cpu_ids * (_tcs) * sizeof(struct xps_map *)))
> +
> +#define XPS_RXQ_DEV_MAPS_SIZE(_tcs, _rxqs) (sizeof(struct xps_dev_maps) +\
> +       (_rxqs * (_tcs) * sizeof(struct xps_map *)))
> +
>  #endif /* CONFIG_XPS */
>
>  #define TC_MAX_QUEUE   16
> @@ -1909,7 +1914,8 @@ struct net_device {
>         int                     watchdog_timeo;
>
>  #ifdef CONFIG_XPS
> -       struct xps_dev_maps __rcu *xps_maps;
> +       struct xps_dev_maps __rcu *xps_cpus_map;
> +       struct xps_dev_maps __rcu *xps_rxqs_map;
>  #endif
>  #ifdef CONFIG_NET_CLS_ACT
>         struct mini_Qdisc __rcu *miniq_egress;
> @@ -3258,6 +3264,94 @@ static inline void netif_wake_subqueue(struct net_device *dev, u16 queue_index)
>  #ifdef CONFIG_XPS
>  int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
>                         u16 index);
> +int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
> +                         u16 index, bool is_rxqs_map);
> +
> +/**
> + *     attr_test_mask - Test a CPU or Rx queue set in a cpumask/rx queues mask
> + *     @j: CPU/Rx queue index
> + *     @mask: bitmask of all cpus/rx queues
> + *     @nr_bits: number of bits in the bitmask
> + *
> + * Test if a CPU or Rx queue index is set in a mask of all CPU/Rx queues.
> + */
> +static inline bool attr_test_mask(unsigned long j, const unsigned long *mask,
> +                                 unsigned int nr_bits)
> +{
> +       cpu_max_bits_warn(j, nr_bits);
> +       return test_bit(j, mask);
> +}
> +
> +/**
> + *     attr_test_online - Test for online CPU/Rx queue
> + *     @j: CPU/Rx queue index
> + *     @online_mask: bitmask for CPUs/Rx queues that are online
> + *     @nr_bits: number of bits in the bitmask
> + *
> + * Returns true if a CPU/Rx queue is online.
> + */
> +static inline bool attr_test_online(unsigned long j,
> +                                   const unsigned long *online_mask,
> +                                   unsigned int nr_bits)
> +{
> +       cpu_max_bits_warn(j, nr_bits);
> +
> +       if (online_mask)
> +               return test_bit(j, online_mask);
> +
> +       if (j >= 0 && j < nr_bits)

j is unsigned so j >= 0 is superfluous.

> +               return true;
> +
> +       return false;

Could just do:

return (j < nr_bits);

> +}
> +
> +/**
> + *     attrmask_next - get the next CPU/Rx queue in a cpumask/Rx queues mask
> + *     @n: CPU/Rx queue index
> + *     @srcp: the cpumask/Rx queue mask pointer
> + *     @nr_bits: number of bits in the bitmask
> + *
> + * Returns >= nr_bits if no further CPUs/Rx queues set.
> + */
> +static inline unsigned int attrmask_next(int n, const unsigned long *srcp,
> +                                        unsigned int nr_bits)
> +{
> +       /* -1 is a legal arg here. */
> +       if (n != -1)
> +               cpu_max_bits_warn(n, nr_bits);
> +
> +       if (srcp)
> +               return find_next_bit(srcp, nr_bits, n + 1);
> +
> +       return n + 1;
> +}
> +
> +/**
> + *     attrmask_next_and - get the next CPU/Rx queue in *src1p & *src2p
> + *     @n: CPU/Rx queue index
> + *     @src1p: the first CPUs/Rx queues mask pointer
> + *     @src2p: the second CPUs/Rx queues mask pointer
> + *     @nr_bits: number of bits in the bitmask
> + *
> + * Returns >= nr_bits if no further CPUs/Rx queues set in both.
> + */
> +static inline int attrmask_next_and(int n, const unsigned long *src1p,
> +                                   const unsigned long *src2p,
> +                                   unsigned int nr_bits)
> +{
> +       /* -1 is a legal arg here. */
> +       if (n != -1)
> +               cpu_max_bits_warn(n, nr_bits);
> +
> +       if (src1p && src2p)
> +               return find_next_and_bit(src1p, src2p, nr_bits, n + 1);
> +       else if (src1p)
> +               return find_next_bit(src1p, nr_bits, n + 1);
> +       else if (src2p)
> +               return find_next_bit(src2p, nr_bits, n + 1);
> +
> +       return n + 1;
> +}
>  #else
>  static inline int netif_set_xps_queue(struct net_device *dev,
>                                       const struct cpumask *mask,
> diff --git a/net/core/dev.c b/net/core/dev.c
> index a5aa1c7..2552556 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2092,7 +2092,7 @@ static bool remove_xps_queue(struct xps_dev_maps *dev_maps,
>         int pos;
>
>         if (dev_maps)
> -               map = xmap_dereference(dev_maps->cpu_map[tci]);
> +               map = xmap_dereference(dev_maps->attr_map[tci]);
>         if (!map)
>                 return false;
>
> @@ -2105,7 +2105,7 @@ static bool remove_xps_queue(struct xps_dev_maps *dev_maps,
>                         break;
>                 }
>
> -               RCU_INIT_POINTER(dev_maps->cpu_map[tci], NULL);
> +               RCU_INIT_POINTER(dev_maps->attr_map[tci], NULL);
>                 kfree_rcu(map, rcu);
>                 return false;
>         }
> @@ -2135,31 +2135,58 @@ static bool remove_xps_queue_cpu(struct net_device *dev,
>         return active;
>  }
>
> +static void clean_xps_maps(struct net_device *dev, const unsigned long *mask,
> +                          struct xps_dev_maps *dev_maps, unsigned int nr_ids,
> +                          u16 offset, u16 count, bool is_rxqs_map)
> +{
> +       bool active = false;
> +       int i, j;
> +
> +       for (j = -1; j = attrmask_next(j, mask, nr_ids),
> +            j < nr_ids;)
> +               active |= remove_xps_queue_cpu(dev, dev_maps, j, offset,
> +                                              count);
> +       if (!active) {
> +               if (is_rxqs_map) {
> +                       RCU_INIT_POINTER(dev->xps_rxqs_map, NULL);
> +               } else {
> +                       RCU_INIT_POINTER(dev->xps_cpus_map, NULL);
> +
> +                       for (i = offset + (count - 1); count--; i--)
> +                               netdev_queue_numa_node_write(
> +                                       netdev_get_tx_queue(dev, i),
> +                                                       NUMA_NO_NODE);
> +               }
> +               kfree_rcu(dev_maps, rcu);
> +       }
> +}
> +
>  static void netif_reset_xps_queues(struct net_device *dev, u16 offset,
>                                    u16 count)
>  {
> +       const unsigned long *possible_mask = NULL;
>         struct xps_dev_maps *dev_maps;
> -       int cpu, i;
> -       bool active = false;
> +       unsigned int nr_ids;
>
>         mutex_lock(&xps_map_mutex);
> -       dev_maps = xmap_dereference(dev->xps_maps);
>
> -       if (!dev_maps)
> -               goto out_no_maps;
> -
> -       for_each_possible_cpu(cpu)
> -               active |= remove_xps_queue_cpu(dev, dev_maps, cpu,
> -                                              offset, count);
> +       dev_maps = xmap_dereference(dev->xps_rxqs_map);
> +       if (dev_maps) {
> +               nr_ids = dev->num_rx_queues;
> +               clean_xps_maps(dev, possible_mask, dev_maps, nr_ids, offset,
> +                              count, true);
>
> -       if (!active) {
> -               RCU_INIT_POINTER(dev->xps_maps, NULL);
> -               kfree_rcu(dev_maps, rcu);
>         }
>
> -       for (i = offset + (count - 1); count--; i--)
> -               netdev_queue_numa_node_write(netdev_get_tx_queue(dev, i),
> -                                            NUMA_NO_NODE);
> +       dev_maps = xmap_dereference(dev->xps_cpus_map);
> +       if (!dev_maps)
> +               goto out_no_maps;
> +
> +       if (num_possible_cpus() > 1)
> +               possible_mask = cpumask_bits(cpu_possible_mask);
> +       nr_ids = nr_cpu_ids;
> +       clean_xps_maps(dev, possible_mask, dev_maps, nr_ids, offset, count,
> +                      false);
>
>  out_no_maps:
>         mutex_unlock(&xps_map_mutex);
> @@ -2170,8 +2197,8 @@ static void netif_reset_xps_queues_gt(struct net_device *dev, u16 index)
>         netif_reset_xps_queues(dev, index, dev->num_tx_queues - index);
>  }
>
> -static struct xps_map *expand_xps_map(struct xps_map *map,
> -                                     int cpu, u16 index)
> +static struct xps_map *expand_xps_map(struct xps_map *map, int attr_index,
> +                                     u16 index, bool is_rxqs_map)
>  {
>         struct xps_map *new_map;
>         int alloc_len = XPS_MIN_MAP_ALLOC;
> @@ -2183,7 +2210,7 @@ static struct xps_map *expand_xps_map(struct xps_map *map,
>                 return map;
>         }
>
> -       /* Need to add queue to this CPU's existing map */
> +       /* Need to add tx-queue to this CPU's/rx-queue's existing map */
>         if (map) {
>                 if (pos < map->alloc_len)
>                         return map;
> @@ -2191,9 +2218,14 @@ static struct xps_map *expand_xps_map(struct xps_map *map,
>                 alloc_len = map->alloc_len * 2;
>         }
>
> -       /* Need to allocate new map to store queue on this CPU's map */
> -       new_map = kzalloc_node(XPS_MAP_SIZE(alloc_len), GFP_KERNEL,
> -                              cpu_to_node(cpu));
> +       /* Need to allocate new map to store tx-queue on this CPU's/rx-queue's
> +        *  map
> +        */
> +       if (is_rxqs_map)
> +               new_map = kzalloc(XPS_MAP_SIZE(alloc_len), GFP_KERNEL);
> +       else
> +               new_map = kzalloc_node(XPS_MAP_SIZE(alloc_len), GFP_KERNEL,
> +                                      cpu_to_node(attr_index));
>         if (!new_map)
>                 return NULL;
>
> @@ -2205,14 +2237,16 @@ static struct xps_map *expand_xps_map(struct xps_map *map,
>         return new_map;
>  }
>
> -int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
> -                       u16 index)
> +int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
> +                         u16 index, bool is_rxqs_map)
>  {
> +       const unsigned long *online_mask = NULL, *possible_mask = NULL;
>         struct xps_dev_maps *dev_maps, *new_dev_maps = NULL;
> -       int i, cpu, tci, numa_node_id = -2;
> +       int i, j, tci, numa_node_id = -2;
>         int maps_sz, num_tc = 1, tc = 0;
>         struct xps_map *map, *new_map;
>         bool active = false;
> +       unsigned int nr_ids;
>
>         if (dev->num_tc) {
>                 num_tc = dev->num_tc;
> @@ -2221,16 +2255,27 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
>                         return -EINVAL;
>         }
>
> -       maps_sz = XPS_DEV_MAPS_SIZE(num_tc);
> -       if (maps_sz < L1_CACHE_BYTES)
> -               maps_sz = L1_CACHE_BYTES;
> -
>         mutex_lock(&xps_map_mutex);
> +       if (is_rxqs_map) {
> +               maps_sz = XPS_RXQ_DEV_MAPS_SIZE(num_tc, dev->num_rx_queues);
> +               dev_maps = xmap_dereference(dev->xps_rxqs_map);
> +               nr_ids = dev->num_rx_queues;
> +       } else {
> +               maps_sz = XPS_CPU_DEV_MAPS_SIZE(num_tc);
> +               if (num_possible_cpus() > 1) {
> +                       online_mask = cpumask_bits(cpu_online_mask);
> +                       possible_mask = cpumask_bits(cpu_possible_mask);
> +               }
> +               dev_maps = xmap_dereference(dev->xps_cpus_map);
> +               nr_ids = nr_cpu_ids;
> +       }
>
> -       dev_maps = xmap_dereference(dev->xps_maps);
> +       if (maps_sz < L1_CACHE_BYTES)
> +               maps_sz = L1_CACHE_BYTES;
>
>         /* allocate memory for queue storage */
> -       for_each_cpu_and(cpu, cpu_online_mask, mask) {
> +       for (j = -1; j = attrmask_next_and(j, online_mask, mask, nr_ids),
> +            j < nr_ids;) {
>                 if (!new_dev_maps)
>                         new_dev_maps = kzalloc(maps_sz, GFP_KERNEL);
>                 if (!new_dev_maps) {
> @@ -2238,73 +2283,81 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
>                         return -ENOMEM;
>                 }
>
> -               tci = cpu * num_tc + tc;
> -               map = dev_maps ? xmap_dereference(dev_maps->cpu_map[tci]) :
> +               tci = j * num_tc + tc;
> +               map = dev_maps ? xmap_dereference(dev_maps->attr_map[tci]) :
>                                  NULL;
>
> -               map = expand_xps_map(map, cpu, index);
> +               map = expand_xps_map(map, j, index, is_rxqs_map);
>                 if (!map)
>                         goto error;
>
> -               RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
> +               RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
>         }
>
>         if (!new_dev_maps)
>                 goto out_no_new_maps;
>
> -       for_each_possible_cpu(cpu) {
> +       for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
> +            j < nr_ids;) {
>                 /* copy maps belonging to foreign traffic classes */
> -               for (i = tc, tci = cpu * num_tc; dev_maps && i--; tci++) {
> +               for (i = tc, tci = j * num_tc; dev_maps && i--; tci++) {
>                         /* fill in the new device map from the old device map */
> -                       map = xmap_dereference(dev_maps->cpu_map[tci]);
> -                       RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
> +                       map = xmap_dereference(dev_maps->attr_map[tci]);
> +                       RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
>                 }
>
>                 /* We need to explicitly update tci as prevous loop
>                  * could break out early if dev_maps is NULL.
>                  */
> -               tci = cpu * num_tc + tc;
> +               tci = j * num_tc + tc;
>
> -               if (cpumask_test_cpu(cpu, mask) && cpu_online(cpu)) {
> -                       /* add queue to CPU maps */
> +               if (attr_test_mask(j, mask, nr_ids) &&
> +                   attr_test_online(j, online_mask, nr_ids)) {
> +                       /* add tx-queue to CPU/rx-queue maps */
>                         int pos = 0;
>
> -                       map = xmap_dereference(new_dev_maps->cpu_map[tci]);
> +                       map = xmap_dereference(new_dev_maps->attr_map[tci]);
>                         while ((pos < map->len) && (map->queues[pos] != index))
>                                 pos++;
>
>                         if (pos == map->len)
>                                 map->queues[map->len++] = index;
>  #ifdef CONFIG_NUMA
> -                       if (numa_node_id == -2)
> -                               numa_node_id = cpu_to_node(cpu);
> -                       else if (numa_node_id != cpu_to_node(cpu))
> -                               numa_node_id = -1;

Seems like there should be a comment here about meaning of -2 and -1
in NUMA node. Better yet, seems like there should be constants defined
for these special values. Maybe something to clean up in the future.

> +                       if (!is_rxqs_map) {
> +                               if (numa_node_id == -2)
> +                                       numa_node_id = cpu_to_node(j);
> +                               else if (numa_node_id != cpu_to_node(j))
> +                                       numa_node_id = -1;
> +                       }
>  #endif
>                 } else if (dev_maps) {
>                         /* fill in the new device map from the old device map */
> -                       map = xmap_dereference(dev_maps->cpu_map[tci]);
> -                       RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
> +                       map = xmap_dereference(dev_maps->attr_map[tci]);
> +                       RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
>                 }
>
>                 /* copy maps belonging to foreign traffic classes */
>                 for (i = num_tc - tc, tci++; dev_maps && --i; tci++) {
>                         /* fill in the new device map from the old device map */
> -                       map = xmap_dereference(dev_maps->cpu_map[tci]);
> -                       RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
> +                       map = xmap_dereference(dev_maps->attr_map[tci]);
> +                       RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
>                 }
>         }
>
> -       rcu_assign_pointer(dev->xps_maps, new_dev_maps);
> +       if (is_rxqs_map)
> +               rcu_assign_pointer(dev->xps_rxqs_map, new_dev_maps);
> +       else
> +               rcu_assign_pointer(dev->xps_cpus_map, new_dev_maps);
>
>         /* Cleanup old maps */
>         if (!dev_maps)
>                 goto out_no_old_maps;
>
> -       for_each_possible_cpu(cpu) {
> -               for (i = num_tc, tci = cpu * num_tc; i--; tci++) {
> -                       new_map = xmap_dereference(new_dev_maps->cpu_map[tci]);
> -                       map = xmap_dereference(dev_maps->cpu_map[tci]);
> +       for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
> +            j < nr_ids;) {
> +               for (i = num_tc, tci = j * num_tc; i--; tci++) {
> +                       new_map = xmap_dereference(new_dev_maps->attr_map[tci]);
> +                       map = xmap_dereference(dev_maps->attr_map[tci]);
>                         if (map && map != new_map)
>                                 kfree_rcu(map, rcu);
>                 }
> @@ -2317,19 +2370,23 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
>         active = true;
>
>  out_no_new_maps:
> -       /* update Tx queue numa node */
> -       netdev_queue_numa_node_write(netdev_get_tx_queue(dev, index),
> -                                    (numa_node_id >= 0) ? numa_node_id :
> -                                    NUMA_NO_NODE);
> +       if (!is_rxqs_map) {
> +               /* update Tx queue numa node */
> +               netdev_queue_numa_node_write(netdev_get_tx_queue(dev, index),
> +                                            (numa_node_id >= 0) ?
> +                                            numa_node_id : NUMA_NO_NODE);
> +       }
>
>         if (!dev_maps)
>                 goto out_no_maps;
>
> -       /* removes queue from unused CPUs */
> -       for_each_possible_cpu(cpu) {
> -               for (i = tc, tci = cpu * num_tc; i--; tci++)
> +       /* removes tx-queue from unused CPUs/rx-queues */
> +       for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
> +            j < nr_ids;) {
> +               for (i = tc, tci = j * num_tc; i--; tci++)
>                         active |= remove_xps_queue(dev_maps, tci, index);
> -               if (!cpumask_test_cpu(cpu, mask) || !cpu_online(cpu))
> +               if (!attr_test_mask(j, mask, nr_ids) ||
> +                   !attr_test_online(j, online_mask, nr_ids))
>                         active |= remove_xps_queue(dev_maps, tci, index);
>                 for (i = num_tc - tc, tci++; --i; tci++)
>                         active |= remove_xps_queue(dev_maps, tci, index);
> @@ -2337,7 +2394,10 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
>
>         /* free map if not active */
>         if (!active) {
> -               RCU_INIT_POINTER(dev->xps_maps, NULL);
> +               if (is_rxqs_map)
> +                       RCU_INIT_POINTER(dev->xps_rxqs_map, NULL);
> +               else
> +                       RCU_INIT_POINTER(dev->xps_cpus_map, NULL);
>                 kfree_rcu(dev_maps, rcu);
>         }
>
> @@ -2347,11 +2407,12 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
>         return 0;
>  error:
>         /* remove any maps that we added */
> -       for_each_possible_cpu(cpu) {
> -               for (i = num_tc, tci = cpu * num_tc; i--; tci++) {
> -                       new_map = xmap_dereference(new_dev_maps->cpu_map[tci]);
> +       for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
> +            j < nr_ids;) {
> +               for (i = num_tc, tci = j * num_tc; i--; tci++) {
> +                       new_map = xmap_dereference(new_dev_maps->attr_map[tci]);
>                         map = dev_maps ?
> -                             xmap_dereference(dev_maps->cpu_map[tci]) :
> +                             xmap_dereference(dev_maps->attr_map[tci]) :
>                               NULL;
>                         if (new_map && new_map != map)
>                                 kfree(new_map);
> @@ -2363,6 +2424,12 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
>         kfree(new_dev_maps);
>         return -ENOMEM;
>  }
> +
> +int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
> +                       u16 index)
> +{
> +       return __netif_set_xps_queue(dev, cpumask_bits(mask), index, false);
> +}
>  EXPORT_SYMBOL(netif_set_xps_queue);
>
>  #endif
> @@ -3384,7 +3451,7 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
>         int queue_index = -1;
>
>         rcu_read_lock();
> -       dev_maps = rcu_dereference(dev->xps_maps);
> +       dev_maps = rcu_dereference(dev->xps_cpus_map);
>         if (dev_maps) {
>                 unsigned int tci = skb->sender_cpu - 1;
>
> @@ -3393,7 +3460,7 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
>                         tci += netdev_get_prio_tc_map(dev, skb->priority);
>                 }
>
> -               map = rcu_dereference(dev_maps->cpu_map[tci]);
> +               map = rcu_dereference(dev_maps->attr_map[tci]);
>                 if (map) {
>                         if (map->len == 1)
>                                 queue_index = map->queues[0];
> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index bb7e80f..b39987c 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -1227,13 +1227,13 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
>                 return -ENOMEM;
>
>         rcu_read_lock();
> -       dev_maps = rcu_dereference(dev->xps_maps);
> +       dev_maps = rcu_dereference(dev->xps_cpus_map);
>         if (dev_maps) {
>                 for_each_possible_cpu(cpu) {
>                         int i, tci = cpu * num_tc + tc;
>                         struct xps_map *map;
>
> -                       map = rcu_dereference(dev_maps->cpu_map[tci]);
> +                       map = rcu_dereference(dev_maps->attr_map[tci]);
>                         if (!map)
>                                 continue;
>
>

Acked-by: Tom Herbert <tom@quantonium.net>

^ permalink raw reply

* set_memory_* (was: Re: BUG: unable to handle kernel paging request in bpf_int_jit_compile)
From: Daniel Borkmann @ 2018-06-26 22:53 UTC (permalink / raw)
  To: Ingo Molnar, David Miller
  Cc: tglx, syzbot+a4eb8c7766952a1ca872, ast, hpa, kuznet, linux-kernel,
	mingo, netdev, syzkaller-bugs, x86, yoshfuji, peterz, labbott,
	keescook, torvalds, edumazet
In-Reply-To: <20180624100249.GA9493@gmail.com>

On 06/24/2018 12:02 PM, Ingo Molnar wrote:
> * David Miller <davem@davemloft.net> wrote:
>> From: Thomas Gleixner <tglx@linutronix.de>
>> Date: Sun, 24 Jun 2018 09:09:09 +0200 (CEST)
>>
>>> I'm really tempted to make the BPF config switch depend on BROKEN. 
>>
>> This really isn't necessary Thomas.
>>
>> Whoever wrote the code didn't understand that set ro can legitimately
>> fail.
> 
> No, that's *NOT* the only thing that happened, according to the Git history.
> 
> The first use of set_memory_ro() in include/linux/filter.h was added by
> this commit almost four years ago:
> 
>   # 2014/09
>   60a3b2253c41 ("net: bpf: make eBPF interpreter images read-only")
> 
> ... and yes, that commit didn't anticipate the (in hindsight) obvious property of 
> a function that changes global kernel mappings that if it is used after bootup 
> without locking it 'may fail'. So that commit slipping through is 'shit happens' 
> and I don't think we ever complained about such things slipping through.

Hmm, back then I adapted the code similar from 314beb9bcabf ("x86: bpf_jit_comp:
secure bpf jit against spraying attacks") for interpreter images as well, and
from grepping through the kernel code none of the callers of set_memory_{ro,rw}()
at that time (& now except bpf) did check for the return code (e.g. module_enable_ro()
and module_disable_ro() as one example which could happen late after bootup has
finished when pulling in modules on the fly).

I did made the mistake in 9facc336876f ("bpf: reject any prog that failed read-only
lock") assuming that after the set_memory_ro() call it would either succeed or
it would not, but not leaving us in a state in the middle. That was silly assumption
and I'll fix this up in bpf, very sorry about that! I've been debugging the syzkaller
BUG at [1] and noticed that even though set_memory_ro() failed with an error, doing
a probe_kernel_write() on it afterwards failed with EFAULT, meaning the module_alloc()
memory was however set to read-only at that point triggering later the BUG when
attempting to change its memory (at least on the virtual mem). From debugging output,
it was a single 4k page and on x86_64 in the __change_page_attr_set_clr() we failed
in the cpa_process_alias() where the syzkaller fault injection happened. So latter
failure from cpa_process_alias() came from call to __change_page_attr_set_clr() with
primary to 0, where it tried to split a large page in __change_page_attr() but failed
in alloc_pages() thus returning the -ENOMEM from there. Testing subsequent undoing
via set_memory_rw() made it writable again, though.

In any case, for pairs like set_memory_ro() + set_memory_rw() that are also used
outside of bpf e.g. STRICT_MODULE_RWX and friends which are mostly default these
days for some archs, is the choice to not check errors from there by design or from
historical context that it originated from 'debugging code' in that sense (DEBUG_RODATA /
DEBUG_SET_MODULE_RONX) earlier? Also if no-one checks for errors (and if that would
infact be the recommendation it is agreed upon) should the API be changed to void,
or generally should actual error checking occur on these + potential rollback; but
then question is what about restoring part from prior set_memory_ro() via set_memory_rw()?
Kees/others, do you happen to have some more context on recommended use around this
by any chance? (Would probably also help if we add some doc around assumptions into
include/linux/set_memory.h for future users.)

Thanks a lot,
Daniel

  [1] https://syzkaller.appspot.com/bug?extid=a4eb8c7766952a1ca872

^ permalink raw reply

* Re: [net-next PATCH v4 2/7] net: Use static_key for XPS maps
From: Tom Herbert @ 2018-06-26 22:54 UTC (permalink / raw)
  To: Amritha Nambiar
  Cc: Linux Kernel Network Developers, David S. Miller, Alexander Duyck,
	Willem de Bruijn, Sridhar Samudrala, Alexander Duyck,
	Eric Dumazet, Hannes Frederic Sowa
In-Reply-To: <152994986435.9733.17031817483990964500.stgit@anamhost.jf.intel.com>

On Mon, Jun 25, 2018 at 11:04 AM, Amritha Nambiar
<amritha.nambiar@intel.com> wrote:
> Use static_key for XPS maps to reduce the cost of extra map checks,
> similar to how it is used for RPS and RFS. This includes static_key
> 'xps_needed' for XPS and another for 'xps_rxqs_needed' for XPS using
> Rx queues map.
>

Acked-by: Tom Herbert <tom@quantonium.net>

> Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
> ---
>  net/core/dev.c |   26 ++++++++++++++++++++------
>  1 file changed, 20 insertions(+), 6 deletions(-)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 2552556..df2a78d 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2081,6 +2081,10 @@ int netdev_txq_to_tc(struct net_device *dev, unsigned int txq)
>  EXPORT_SYMBOL(netdev_txq_to_tc);
>
>  #ifdef CONFIG_XPS
> +struct static_key xps_needed __read_mostly;
> +EXPORT_SYMBOL(xps_needed);
> +struct static_key xps_rxqs_needed __read_mostly;
> +EXPORT_SYMBOL(xps_rxqs_needed);
>  static DEFINE_MUTEX(xps_map_mutex);
>  #define xmap_dereference(P)            \
>         rcu_dereference_protected((P), lockdep_is_held(&xps_map_mutex))
> @@ -2170,12 +2174,14 @@ static void netif_reset_xps_queues(struct net_device *dev, u16 offset,
>
>         mutex_lock(&xps_map_mutex);
>
> -       dev_maps = xmap_dereference(dev->xps_rxqs_map);
> -       if (dev_maps) {
> -               nr_ids = dev->num_rx_queues;
> -               clean_xps_maps(dev, possible_mask, dev_maps, nr_ids, offset,
> -                              count, true);
> -
> +       if (static_key_false(&xps_rxqs_needed)) {
> +               dev_maps = xmap_dereference(dev->xps_rxqs_map);
> +               if (dev_maps) {
> +                       nr_ids = dev->num_rx_queues;
> +                       clean_xps_maps(dev, possible_mask, dev_maps, nr_ids,
> +                                      offset, count, true);
> +               }
> +               static_key_slow_dec(&xps_rxqs_needed);
>         }
>
>         dev_maps = xmap_dereference(dev->xps_cpus_map);
> @@ -2189,6 +2195,7 @@ static void netif_reset_xps_queues(struct net_device *dev, u16 offset,
>                        false);
>
>  out_no_maps:
> +       static_key_slow_dec(&xps_needed);
>         mutex_unlock(&xps_map_mutex);
>  }
>
> @@ -2297,6 +2304,10 @@ int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
>         if (!new_dev_maps)
>                 goto out_no_new_maps;
>
> +       static_key_slow_inc(&xps_needed);
> +       if (is_rxqs_map)
> +               static_key_slow_inc(&xps_rxqs_needed);
> +
>         for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
>              j < nr_ids;) {
>                 /* copy maps belonging to foreign traffic classes */
> @@ -3450,6 +3461,9 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
>         struct xps_map *map;
>         int queue_index = -1;
>
> +       if (!static_key_false(&xps_needed))
> +               return -1;
> +
>         rcu_read_lock();
>         dev_maps = rcu_dereference(dev->xps_cpus_map);
>         if (dev_maps) {
>

^ permalink raw reply

* Re: [net-next PATCH v4 7/7] Documentation: Add explanation for XPS using Rx-queue(s) map
From: Tom Herbert @ 2018-06-26 22:59 UTC (permalink / raw)
  To: Amritha Nambiar
  Cc: Linux Kernel Network Developers, David S. Miller, Alexander Duyck,
	Willem de Bruijn, Sridhar Samudrala, Alexander Duyck,
	Eric Dumazet, Hannes Frederic Sowa
In-Reply-To: <152994989181.9733.4112250516573561282.stgit@anamhost.jf.intel.com>

On Mon, Jun 25, 2018 at 11:04 AM, Amritha Nambiar
<amritha.nambiar@intel.com> wrote:
> Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>

Acked-by: Tom Herbert <tom@quantonium.net>

> ---
>  Documentation/ABI/testing/sysfs-class-net-queues |   11 ++++
>  Documentation/networking/scaling.txt             |   57 ++++++++++++++++++----
>  2 files changed, 58 insertions(+), 10 deletions(-)
>
> diff --git a/Documentation/ABI/testing/sysfs-class-net-queues b/Documentation/ABI/testing/sysfs-class-net-queues
> index 0c0df91..978b763 100644
> --- a/Documentation/ABI/testing/sysfs-class-net-queues
> +++ b/Documentation/ABI/testing/sysfs-class-net-queues
> @@ -42,6 +42,17 @@ Description:
>                 network device transmit queue. Possible vaules depend on the
>                 number of available CPU(s) in the system.
>
> +What:          /sys/class/<iface>/queues/tx-<queue>/xps_rxqs
> +Date:          June 2018
> +KernelVersion: 4.18.0
> +Contact:       netdev@vger.kernel.org
> +Description:
> +               Mask of the receive queue(s) currently enabled to participate
> +               into the Transmit Packet Steering packet processing flow for this
> +               network device transmit queue. Possible values depend on the
> +               number of available receive queue(s) in the network device.
> +               Default is disabled.
> +
>  What:          /sys/class/<iface>/queues/tx-<queue>/byte_queue_limits/hold_time
>  Date:          November 2011
>  KernelVersion: 3.3
> diff --git a/Documentation/networking/scaling.txt b/Documentation/networking/scaling.txt
> index f55639d..8336116 100644
> --- a/Documentation/networking/scaling.txt
> +++ b/Documentation/networking/scaling.txt
> @@ -366,8 +366,13 @@ XPS: Transmit Packet Steering
>
>  Transmit Packet Steering is a mechanism for intelligently selecting
>  which transmit queue to use when transmitting a packet on a multi-queue
> -device. To accomplish this, a mapping from CPU to hardware queue(s) is
> -recorded. The goal of this mapping is usually to assign queues
> +device. This can be accomplished by recording two kinds of maps, either
> +a mapping of CPU to hardware queue(s) or a mapping of receive queue(s)
> +to hardware transmit queue(s).
> +
> +1. XPS using CPUs map
> +
> +The goal of this mapping is usually to assign queues
>  exclusively to a subset of CPUs, where the transmit completions for
>  these queues are processed on a CPU within this set. This choice
>  provides two benefits. First, contention on the device queue lock is
> @@ -377,12 +382,35 @@ transmit queue). Secondly, cache miss rate on transmit completion is
>  reduced, in particular for data cache lines that hold the sk_buff
>  structures.
>
> -XPS is configured per transmit queue by setting a bitmap of CPUs that
> -may use that queue to transmit. The reverse mapping, from CPUs to
> -transmit queues, is computed and maintained for each network device.
> -When transmitting the first packet in a flow, the function
> -get_xps_queue() is called to select a queue. This function uses the ID
> -of the running CPU as a key into the CPU-to-queue lookup table. If the
> +2. XPS using receive queues map
> +
> +This mapping is used to pick transmit queue based on the receive
> +queue(s) map configuration set by the administrator. A set of receive
> +queues can be mapped to a set of transmit queues (many:many), although
> +the common use case is a 1:1 mapping. This will enable sending packets
> +on the same queue associations for transmit and receive. This is useful for
> +busy polling multi-threaded workloads where there are challenges in
> +associating a given CPU to a given application thread. The application
> +threads are not pinned to CPUs and each thread handles packets
> +received on a single queue. The receive queue number is cached in the
> +socket for the connection. In this model, sending the packets on the same
> +transmit queue corresponding to the associated receive queue has benefits
> +in keeping the CPU overhead low. Transmit completion work is locked into
> +the same queue-association that a given application is polling on. This
> +avoids the overhead of triggering an interrupt on another CPU. When the
> +application cleans up the packets during the busy poll, transmit completion
> +may be processed along with it in the same thread context and so result in
> +reduced latency.
> +
> +XPS is configured per transmit queue by setting a bitmap of
> +CPUs/receive-queues that may use that queue to transmit. The reverse
> +mapping, from CPUs to transmit queues or from receive-queues to transmit
> +queues, is computed and maintained for each network device. When
> +transmitting the first packet in a flow, the function get_xps_queue() is
> +called to select a queue. This function uses the ID of the receive queue
> +for the socket connection for a match in the receive queue-to-transmit queue
> +lookup table. Alternatively, this function can also use the ID of the
> +running CPU as a key into the CPU-to-queue lookup table. If the
>  ID matches a single queue, that is used for transmission. If multiple
>  queues match, one is selected by using the flow hash to compute an index
>  into the set.
> @@ -404,11 +432,15 @@ acknowledged.
>
>  XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by
>  default for SMP). The functionality remains disabled until explicitly
> -configured. To enable XPS, the bitmap of CPUs that may use a transmit
> -queue is configured using the sysfs file entry:
> +configured. To enable XPS, the bitmap of CPUs/receive-queues that may
> +use a transmit queue is configured using the sysfs file entry:
>
> +For selection based on CPUs map:
>  /sys/class/net/<dev>/queues/tx-<n>/xps_cpus
>
> +For selection based on receive-queues map:
> +/sys/class/net/<dev>/queues/tx-<n>/xps_rxqs
> +
>  == Suggested Configuration
>
>  For a network device with a single transmission queue, XPS configuration
> @@ -421,6 +453,11 @@ best CPUs to share a given queue are probably those that share the cache
>  with the CPU that processes transmit completions for that queue
>  (transmit interrupts).
>
> +For transmit queue selection based on receive queue(s), XPS has to be
> +explicitly configured mapping receive-queue(s) to transmit queue(s). If the
> +user configuration for receive-queue map does not apply, then the transmit
> +queue is selected based on the CPUs map.
> +
>  Per TX Queue rate limitation:
>  =============================
>
>

^ permalink raw reply

* [PATCH v3 bpf-net] bpf: Change bpf_fib_lookup to return lookup status
From: dsahern @ 2018-06-26 23:21 UTC (permalink / raw)
  To: netdev, borkmann, ast; +Cc: kafai, brouer, David Ahern

From: David Ahern <dsahern@gmail.com>

For ACLs implemented using either FIB rules or FIB entries, the BPF
program needs the FIB lookup status to be able to drop the packet.
Since the bpf_fib_lookup API has not reached a released kernel yet,
change the return code to contain an encoding of the FIB lookup
result and return the nexthop device index in the params struct.

In addition, inform the BPF program of any post FIB lookup reason as
to why the packet needs to go up the stack.

The fib result for unicast routes must have an egress device, so remove
the check that it is non-NULL.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
v3
- fixup bpf_skb_fib_lookup for the change in API

v2
- drop BPF_FIB_LKUP_RET_NO_NHDEV; check in dev in fib result not needed
- enhance documentation of BPF_FIB_LKUP_RET_ codes
---
 include/uapi/linux/bpf.h   | 28 ++++++++++++---
 net/core/filter.c          | 86 +++++++++++++++++++++++++++++-----------------
 samples/bpf/xdp_fwd_kern.c |  8 ++---
 3 files changed, 81 insertions(+), 41 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 59b19b6a40d7..b7db3261c62d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1857,7 +1857,8 @@ union bpf_attr {
  *		is resolved), the nexthop address is returned in ipv4_dst
  *		or ipv6_dst based on family, smac is set to mac address of
  *		egress device, dmac is set to nexthop mac address, rt_metric
- *		is set to metric from route (IPv4/IPv6 only).
+ *		is set to metric from route (IPv4/IPv6 only), and ifindex
+ *		is set to the device index of the nexthop from the FIB lookup.
  *
  *             *plen* argument is the size of the passed in struct.
  *             *flags* argument can be a combination of one or more of the
@@ -1873,9 +1874,10 @@ union bpf_attr {
  *             *ctx* is either **struct xdp_md** for XDP programs or
  *             **struct sk_buff** tc cls_act programs.
  *     Return
- *             Egress device index on success, 0 if packet needs to continue
- *             up the stack for further processing or a negative error in case
- *             of failure.
+ *		* < 0 if any input argument is invalid
+ *		*   0 on success (packet is forwarded, nexthop neighbor exists)
+ *		* > 0 one of **BPF_FIB_LKUP_RET_** codes explaining why the
+ *		*     packet is not forwarded or needs assist from full stack
  *
  * int bpf_sock_hash_update(struct bpf_sock_ops_kern *skops, struct bpf_map *map, void *key, u64 flags)
  *	Description
@@ -2612,6 +2614,18 @@ struct bpf_raw_tracepoint_args {
 #define BPF_FIB_LOOKUP_DIRECT  BIT(0)
 #define BPF_FIB_LOOKUP_OUTPUT  BIT(1)
 
+enum {
+	BPF_FIB_LKUP_RET_SUCCESS,      /* lookup successful */
+	BPF_FIB_LKUP_RET_BLACKHOLE,    /* dest is blackholed; can be dropped */
+	BPF_FIB_LKUP_RET_UNREACHABLE,  /* dest is unreachable; can be dropped */
+	BPF_FIB_LKUP_RET_PROHIBIT,     /* dest not allowed; can be dropped */
+	BPF_FIB_LKUP_RET_NOT_FWDED,    /* packet is not forwarded */
+	BPF_FIB_LKUP_RET_FWD_DISABLED, /* fwding is not enabled on ingress */
+	BPF_FIB_LKUP_RET_UNSUPP_LWT,   /* fwd requires encapsulation */
+	BPF_FIB_LKUP_RET_NO_NEIGH,     /* no neighbor entry for nh */
+	BPF_FIB_LKUP_RET_FRAG_NEEDED,  /* fragmentation required to fwd */
+};
+
 struct bpf_fib_lookup {
 	/* input:  network family for lookup (AF_INET, AF_INET6)
 	 * output: network family of egress nexthop
@@ -2625,7 +2639,11 @@ struct bpf_fib_lookup {
 
 	/* total length of packet from network header - used for MTU check */
 	__u16	tot_len;
-	__u32	ifindex;  /* L3 device index for lookup */
+
+	/* input: L3 device index for lookup
+	 * output: device index from FIB lookup
+	 */
+	__u32	ifindex;
 
 	union {
 		/* inputs to lookup */
diff --git a/net/core/filter.c b/net/core/filter.c
index e7f12e9f598c..0ca6907d7efe 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4073,8 +4073,9 @@ static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params,
 	memcpy(params->smac, dev->dev_addr, ETH_ALEN);
 	params->h_vlan_TCI = 0;
 	params->h_vlan_proto = 0;
+	params->ifindex = dev->ifindex;
 
-	return dev->ifindex;
+	return 0;
 }
 #endif
 
@@ -4098,7 +4099,7 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 	/* verify forwarding is enabled on this interface */
 	in_dev = __in_dev_get_rcu(dev);
 	if (unlikely(!in_dev || !IN_DEV_FORWARD(in_dev)))
-		return 0;
+		return BPF_FIB_LKUP_RET_FWD_DISABLED;
 
 	if (flags & BPF_FIB_LOOKUP_OUTPUT) {
 		fl4.flowi4_iif = 1;
@@ -4123,7 +4124,7 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 
 		tb = fib_get_table(net, tbid);
 		if (unlikely(!tb))
-			return 0;
+			return BPF_FIB_LKUP_RET_NOT_FWDED;
 
 		err = fib_table_lookup(tb, &fl4, &res, FIB_LOOKUP_NOREF);
 	} else {
@@ -4135,8 +4136,20 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 		err = fib_lookup(net, &fl4, &res, FIB_LOOKUP_NOREF);
 	}
 
-	if (err || res.type != RTN_UNICAST)
-		return 0;
+	if (err) {
+		/* map fib lookup errors to RTN_ type */
+		if (err == -EINVAL)
+			return BPF_FIB_LKUP_RET_BLACKHOLE;
+		if (err == -EHOSTUNREACH)
+			return BPF_FIB_LKUP_RET_UNREACHABLE;
+		if (err == -EACCES)
+			return BPF_FIB_LKUP_RET_PROHIBIT;
+
+		return BPF_FIB_LKUP_RET_NOT_FWDED;
+	}
+
+	if (res.type != RTN_UNICAST)
+		return BPF_FIB_LKUP_RET_NOT_FWDED;
 
 	if (res.fi->fib_nhs > 1)
 		fib_select_path(net, &res, &fl4, NULL);
@@ -4144,19 +4157,16 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 	if (check_mtu) {
 		mtu = ip_mtu_from_fib_result(&res, params->ipv4_dst);
 		if (params->tot_len > mtu)
-			return 0;
+			return BPF_FIB_LKUP_RET_FRAG_NEEDED;
 	}
 
 	nh = &res.fi->fib_nh[res.nh_sel];
 
 	/* do not handle lwt encaps right now */
 	if (nh->nh_lwtstate)
-		return 0;
+		return BPF_FIB_LKUP_RET_UNSUPP_LWT;
 
 	dev = nh->nh_dev;
-	if (unlikely(!dev))
-		return 0;
-
 	if (nh->nh_gw)
 		params->ipv4_dst = nh->nh_gw;
 
@@ -4166,10 +4176,10 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 	 * rcu_read_lock_bh is not needed here
 	 */
 	neigh = __ipv4_neigh_lookup_noref(dev, (__force u32)params->ipv4_dst);
-	if (neigh)
-		return bpf_fib_set_fwd_params(params, neigh, dev);
+	if (!neigh)
+		return BPF_FIB_LKUP_RET_NO_NEIGH;
 
-	return 0;
+	return bpf_fib_set_fwd_params(params, neigh, dev);
 }
 #endif
 
@@ -4190,7 +4200,7 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 
 	/* link local addresses are never forwarded */
 	if (rt6_need_strict(dst) || rt6_need_strict(src))
-		return 0;
+		return BPF_FIB_LKUP_RET_NOT_FWDED;
 
 	dev = dev_get_by_index_rcu(net, params->ifindex);
 	if (unlikely(!dev))
@@ -4198,7 +4208,7 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 
 	idev = __in6_dev_get_safely(dev);
 	if (unlikely(!idev || !net->ipv6.devconf_all->forwarding))
-		return 0;
+		return BPF_FIB_LKUP_RET_FWD_DISABLED;
 
 	if (flags & BPF_FIB_LOOKUP_OUTPUT) {
 		fl6.flowi6_iif = 1;
@@ -4225,7 +4235,7 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 
 		tb = ipv6_stub->fib6_get_table(net, tbid);
 		if (unlikely(!tb))
-			return 0;
+			return BPF_FIB_LKUP_RET_NOT_FWDED;
 
 		f6i = ipv6_stub->fib6_table_lookup(net, tb, oif, &fl6, strict);
 	} else {
@@ -4238,11 +4248,23 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 	}
 
 	if (unlikely(IS_ERR_OR_NULL(f6i) || f6i == net->ipv6.fib6_null_entry))
-		return 0;
+		return BPF_FIB_LKUP_RET_NOT_FWDED;
+
+	if (unlikely(f6i->fib6_flags & RTF_REJECT)) {
+		switch (f6i->fib6_type) {
+		case RTN_BLACKHOLE:
+			return BPF_FIB_LKUP_RET_BLACKHOLE;
+		case RTN_UNREACHABLE:
+			return BPF_FIB_LKUP_RET_UNREACHABLE;
+		case RTN_PROHIBIT:
+			return BPF_FIB_LKUP_RET_PROHIBIT;
+		default:
+			return BPF_FIB_LKUP_RET_NOT_FWDED;
+		}
+	}
 
-	if (unlikely(f6i->fib6_flags & RTF_REJECT ||
-	    f6i->fib6_type != RTN_UNICAST))
-		return 0;
+	if (f6i->fib6_type != RTN_UNICAST)
+		return BPF_FIB_LKUP_RET_NOT_FWDED;
 
 	if (f6i->fib6_nsiblings && fl6.flowi6_oif == 0)
 		f6i = ipv6_stub->fib6_multipath_select(net, f6i, &fl6,
@@ -4252,11 +4274,11 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 	if (check_mtu) {
 		mtu = ipv6_stub->ip6_mtu_from_fib6(f6i, dst, src);
 		if (params->tot_len > mtu)
-			return 0;
+			return BPF_FIB_LKUP_RET_FRAG_NEEDED;
 	}
 
 	if (f6i->fib6_nh.nh_lwtstate)
-		return 0;
+		return BPF_FIB_LKUP_RET_UNSUPP_LWT;
 
 	if (f6i->fib6_flags & RTF_GATEWAY)
 		*dst = f6i->fib6_nh.nh_gw;
@@ -4270,10 +4292,10 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 	 */
 	neigh = ___neigh_lookup_noref(ipv6_stub->nd_tbl, neigh_key_eq128,
 				      ndisc_hashfn, dst, dev);
-	if (neigh)
-		return bpf_fib_set_fwd_params(params, neigh, dev);
+	if (!neigh)
+		return BPF_FIB_LKUP_RET_NO_NEIGH;
 
-	return 0;
+	return bpf_fib_set_fwd_params(params, neigh, dev);
 }
 #endif
 
@@ -4315,7 +4337,7 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
 	   struct bpf_fib_lookup *, params, int, plen, u32, flags)
 {
 	struct net *net = dev_net(skb->dev);
-	int index = -EAFNOSUPPORT;
+	int rc = -EAFNOSUPPORT;
 
 	if (plen < sizeof(*params))
 		return -EINVAL;
@@ -4326,25 +4348,25 @@ BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
 	switch (params->family) {
 #if IS_ENABLED(CONFIG_INET)
 	case AF_INET:
-		index = bpf_ipv4_fib_lookup(net, params, flags, false);
+		rc = bpf_ipv4_fib_lookup(net, params, flags, false);
 		break;
 #endif
 #if IS_ENABLED(CONFIG_IPV6)
 	case AF_INET6:
-		index = bpf_ipv6_fib_lookup(net, params, flags, false);
+		rc = bpf_ipv6_fib_lookup(net, params, flags, false);
 		break;
 #endif
 	}
 
-	if (index > 0) {
+	if (!rc) {
 		struct net_device *dev;
 
-		dev = dev_get_by_index_rcu(net, index);
+		dev = dev_get_by_index_rcu(net, params->ifindex);
 		if (!is_skb_forwardable(dev, skb))
-			index = 0;
+			rc = BPF_FIB_LKUP_RET_FRAG_NEEDED;
 	}
 
-	return index;
+	return rc;
 }
 
 static const struct bpf_func_proto bpf_skb_fib_lookup_proto = {
diff --git a/samples/bpf/xdp_fwd_kern.c b/samples/bpf/xdp_fwd_kern.c
index 6673cdb9f55c..a7e94e7ff87d 100644
--- a/samples/bpf/xdp_fwd_kern.c
+++ b/samples/bpf/xdp_fwd_kern.c
@@ -48,9 +48,9 @@ static __always_inline int xdp_fwd_flags(struct xdp_md *ctx, u32 flags)
 	struct ethhdr *eth = data;
 	struct ipv6hdr *ip6h;
 	struct iphdr *iph;
-	int out_index;
 	u16 h_proto;
 	u64 nh_off;
+	int rc;
 
 	nh_off = sizeof(*eth);
 	if (data + nh_off > data_end)
@@ -101,7 +101,7 @@ static __always_inline int xdp_fwd_flags(struct xdp_md *ctx, u32 flags)
 
 	fib_params.ifindex = ctx->ingress_ifindex;
 
-	out_index = bpf_fib_lookup(ctx, &fib_params, sizeof(fib_params), flags);
+	rc = bpf_fib_lookup(ctx, &fib_params, sizeof(fib_params), flags);
 
 	/* verify egress index has xdp support
 	 * TO-DO bpf_map_lookup_elem(&tx_port, &key) fails with
@@ -109,7 +109,7 @@ static __always_inline int xdp_fwd_flags(struct xdp_md *ctx, u32 flags)
 	 * NOTE: without verification that egress index supports XDP
 	 *       forwarding packets are dropped.
 	 */
-	if (out_index > 0) {
+	if (rc == 0) {
 		if (h_proto == htons(ETH_P_IP))
 			ip_decrease_ttl(iph);
 		else if (h_proto == htons(ETH_P_IPV6))
@@ -117,7 +117,7 @@ static __always_inline int xdp_fwd_flags(struct xdp_md *ctx, u32 flags)
 
 		memcpy(eth->h_dest, fib_params.dmac, ETH_ALEN);
 		memcpy(eth->h_source, fib_params.smac, ETH_ALEN);
-		return bpf_redirect_map(&tx_port, out_index, 0);
+		return bpf_redirect_map(&tx_port, fib_params.ifindex, 0);
 	}
 
 	return XDP_PASS;
-- 
2.11.0

^ permalink raw reply related

* Re: [PATCH net-next] net: preserve sock reference when scrubbing the skb.
From: Flavio Leitner @ 2018-06-26 23:33 UTC (permalink / raw)
  To: Cong Wang
  Cc: Eric Dumazet, Linux Kernel Network Developers, Paolo Abeni,
	David Miller, Florian Westphal, NetFilter
In-Reply-To: <CAM_iQpU8E5OuXx87Dm+jbqwbkkwETNF_RZh-VnUkF5seFPvv_A@mail.gmail.com>

On Tue, Jun 26, 2018 at 03:47:31PM -0700, Cong Wang wrote:
> On Tue, Jun 26, 2018 at 3:03 PM Flavio Leitner <fbl@redhat.com> wrote:
> >
> > On Tue, Jun 26, 2018 at 02:48:47PM -0700, Cong Wang wrote:
> > > On Mon, Jun 25, 2018 at 11:41 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > > When a packet is attached to a socket, we should keep the association as much as possible.
> > >
> > > As much as possible within one stack, I agree. I still don't understand
> > > why we should keep it across the stack boundary.
> > >
> > > > Only when a new association needs to be done, skb_orphan() needs to be called.
> > > >
> > > > Doing this skb_orphan() too soon breaks back pressure in general, this is bad, since a socket
> > > > can evades SO_SNDBUF limits.
> > >
> > > Right before leaving the stack is not too soon, it is the latest
> > > actually, for veth case.
> >
> > Depends on how you view things - it's the same host/stack sharing the
> > same resources, so why should we not keep it?
> 
> Because stacks are supposed to be independent, netdevices are
> isolated, iptables and route tables too. This is how netns is designed
> from the beginning. The trend today is actually more isolation instead
> of more sharing, given the popularity of containers.

It is still isolated, the sk carries the netns info and it is
orphaned when it re-enters the stack.

-- 
Flavio

^ permalink raw reply

* Re: Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Siwei Liu @ 2018-06-26 23:38 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Samudrala, Sridhar, Cornelia Huck, Alexander Duyck, virtio-dev,
	aaron.f.brown, Jiri Pirko, Jakub Kicinski, Netdev, qemu-devel,
	virtualization, konrad.wilk, boris.ostrovsky, Joao Martins,
	Venu Busireddy, vijay.balakrishna
In-Reply-To: <20180626044650-mutt-send-email-mst@kernel.org>

On Mon, Jun 25, 2018 at 6:50 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Mon, Jun 25, 2018 at 10:54:09AM -0700, Samudrala, Sridhar wrote:
>> > > > > Might not neccessarily be something wrong, but it's very limited to
>> > > > > prohibit the MAC of VF from changing when enslaved by failover.
>> > > > You mean guest changing MAC? I'm not sure why we prohibit that.
>> > > I think Sridhar and Jiri might be better person to answer it. My
>> > > impression was that sync'ing the MAC address change between all 3
>> > > devices is challenging, as the failover driver uses MAC address to
>> > > match net_device internally.
>>
>> Yes. The MAC address is assigned by the hypervisor and it needs to manage the movement
>> of the MAC between the PF and VF.  Allowing the guest to change the MAC will require
>> synchronization between the hypervisor and the PF/VF drivers. Most of the VF drivers
>> don't allow changing guest MAC unless it is a trusted VF.
>
> OK but it's a policy thing. Maybe it's a trusted VF. Who knows?
> For example I can see host just
> failing VIRTIO_NET_CTRL_MAC_ADDR_SET if it wants to block it.
> I'm not sure why VIRTIO_NET_F_STANDBY has to block it in the guest.

That's why I think pairing using MAC is fragile IMHO. When VF's MAC
got changed before virtio attempts to match and pair the device, it
ends up with no pairing found out at all. UUID is better.

-Siwei

>
> --
> MST

^ permalink raw reply

* Re: [bpf-next PATCH 1/2] samples/bpf: extend xdp_rxq_info to read packet payload
From: Song Liu @ 2018-06-26 23:53 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Networking, Daniel Borkmann, Toke Høiland-Jørgensen,
	Alexei Starovoitov
In-Reply-To: <152993686364.8835.3914229981165096265.stgit@firesoul>

On Mon, Jun 25, 2018 at 7:27 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> There is a cost associated with reading the packet data payload
> that this test ignored.  Add option --read to allow enabling
> reading part of the payload.
>
> This sample/tool helps us analyse an issue observed with a NIC
> mlx5 (ConnectX-5 Ex) and an Intel(R) Xeon(R) CPU E5-1650 v4.
>
> With no_touch of data:
>
> Running XDP on dev:mlx5p1 (ifindex:8) action:XDP_DROP options:no_touch
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      0       14,465,157  0
> XDP-RX CPU      1       14,464,728  0
> XDP-RX CPU      2       14,465,283  0
> XDP-RX CPU      3       14,465,282  0
> XDP-RX CPU      4       14,464,159  0
> XDP-RX CPU      5       14,465,379  0
> XDP-RX CPU      total   86,789,992
>
> When not touching data, we observe that the CPUs have idle cycles.
> When reading data the CPUs are 100% busy in softirq.
>
> With reading data:
>
> Running XDP on dev:mlx5p1 (ifindex:8) action:XDP_DROP options:read
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      0       9,620,639   0
> XDP-RX CPU      1       9,489,843   0
> XDP-RX CPU      2       9,407,854   0
> XDP-RX CPU      3       9,422,289   0
> XDP-RX CPU      4       9,321,959   0
> XDP-RX CPU      5       9,395,242   0
> XDP-RX CPU      total   56,657,828
>
> The effect seen above is a result of cache-misses occuring when
> more RXQs are being used.  Based on perf-event observations, our
> conclusion is that the CPUs DDIO (Direct Data I/O) choose to
> deliver packet into main memory, instead of L3-cache.  We also
> found, that this can be mitigated by either using less RXQs or by
> reducing NICs the RX-ring size.
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
> ---
>  samples/bpf/xdp_rxq_info_kern.c |   19 +++++++++++++++++++
>  samples/bpf/xdp_rxq_info_user.c |   34 ++++++++++++++++++++++++++++------
>  2 files changed, 47 insertions(+), 6 deletions(-)
>
> diff --git a/samples/bpf/xdp_rxq_info_kern.c b/samples/bpf/xdp_rxq_info_kern.c
> index 3fd209291653..61af6210df2f 100644
> --- a/samples/bpf/xdp_rxq_info_kern.c
> +++ b/samples/bpf/xdp_rxq_info_kern.c
> @@ -4,6 +4,8 @@
>   *  Example howto extract XDP RX-queue info
>   */
>  #include <uapi/linux/bpf.h>
> +#include <uapi/linux/if_ether.h>
> +#include <uapi/linux/in.h>
>  #include "bpf_helpers.h"
>
>  /* Config setup from with userspace
> @@ -14,6 +16,11 @@
>  struct config {
>         __u32 action;
>         int ifindex;
> +       __u32 options;
> +};
> +enum cfg_options_flags {
> +       NO_TOUCH = 0x0U,
> +       READ_MEM = 0x1U,
>  };
>  struct bpf_map_def SEC("maps") config_map = {
>         .type           = BPF_MAP_TYPE_ARRAY,
> @@ -90,6 +97,18 @@ int  xdp_prognum0(struct xdp_md *ctx)
>         if (key == MAX_RXQs)
>                 rxq_rec->issue++;
>
> +       /* Default: Don't touch packet data, only count packets */
> +       if (unlikely(config->options & READ_MEM)) {
> +               struct ethhdr *eth = data;
> +
> +               if (eth + 1 > data_end)
> +                       return XDP_ABORTED;
> +
> +               /* Avoid compiler removing this: Drop non 802.3 Ethertypes */
> +               if (ntohs(eth->h_proto) < ETH_P_802_3_MIN)
> +                       return XDP_ABORTED;
> +       }
> +
>         return config->action;
>  }
>
> diff --git a/samples/bpf/xdp_rxq_info_user.c b/samples/bpf/xdp_rxq_info_user.c
> index e4e9ba52bff0..435485d4f49e 100644
> --- a/samples/bpf/xdp_rxq_info_user.c
> +++ b/samples/bpf/xdp_rxq_info_user.c
> @@ -50,6 +50,7 @@ static const struct option long_options[] = {
>         {"sec",         required_argument,      NULL, 's' },
>         {"no-separators", no_argument,          NULL, 'z' },
>         {"action",      required_argument,      NULL, 'a' },
> +       {"readmem",     no_argument,            NULL, 'r' },
>         {0, 0, NULL,  0 }
>  };
>
> @@ -66,6 +67,11 @@ static void int_exit(int sig)
>  struct config {
>         __u32 action;
>         int ifindex;
> +       __u32 options;
> +};
> +enum cfg_options_flags {
> +       NO_TOUCH = 0x0U,
> +       READ_MEM = 0x1U,
>  };
>  #define XDP_ACTION_MAX (XDP_TX + 1)
>  #define XDP_ACTION_MAX_STRLEN 11
> @@ -109,6 +115,16 @@ static void list_xdp_actions(void)
>         printf("\n");
>  }
>
> +static char* options2str(enum cfg_options_flags flag)
> +{
> +       if (flag == NO_TOUCH)
> +               return "no_touch";
> +       if (flag & READ_MEM)
> +               return "read";
> +       fprintf(stderr, "ERR: Unknown config option flags");
> +       exit(EXIT_FAIL);
> +}
> +

enum cfg_options_flags is used as a bitmap in other parts of the sample.
So this function is a little weird (with more flags added).

Thanks,
Song

>  static void usage(char *argv[])
>  {
>         int i;
> @@ -305,7 +321,7 @@ static __u64 calc_errs_pps(struct datarec *r,
>
>  static void stats_print(struct stats_record *stats_rec,
>                         struct stats_record *stats_prev,
> -                       int action)
> +                       int action, __u32 cfg_opt)
>  {
>         unsigned int nr_rxqs = bpf_map__def(rx_queue_index_map)->max_entries;
>         unsigned int nr_cpus = bpf_num_possible_cpus();
> @@ -316,8 +332,8 @@ static void stats_print(struct stats_record *stats_rec,
>         int i;
>
>         /* Header */
> -       printf("\nRunning XDP on dev:%s (ifindex:%d) action:%s\n",
> -              ifname, ifindex, action2str(action));
> +       printf("\nRunning XDP on dev:%s (ifindex:%d) action:%s options:%s\n",
> +              ifname, ifindex, action2str(action), options2str(cfg_opt));
>
>         /* stats_global_map */
>         {
> @@ -399,7 +415,7 @@ static inline void swap(struct stats_record **a, struct stats_record **b)
>         *b = tmp;
>  }
>
> -static void stats_poll(int interval, int action)
> +static void stats_poll(int interval, int action, __u32 cfg_opt)
>  {
>         struct stats_record *record, *prev;
>
> @@ -410,7 +426,7 @@ static void stats_poll(int interval, int action)
>         while (1) {
>                 swap(&prev, &record);
>                 stats_collect(record);
> -               stats_print(record, prev, action);
> +               stats_print(record, prev, action, cfg_opt);
>                 sleep(interval);
>         }
>
> @@ -421,6 +437,7 @@ static void stats_poll(int interval, int action)
>
>  int main(int argc, char **argv)
>  {
> +       __u32 cfg_options= NO_TOUCH ; /* Default: Don't touch packet memory */
>         struct rlimit r = {10 * 1024 * 1024, RLIM_INFINITY};
>         struct bpf_prog_load_attr prog_load_attr = {
>                 .prog_type      = BPF_PROG_TYPE_XDP,
> @@ -435,6 +452,7 @@ int main(int argc, char **argv)
>         int interval = 2;
>         __u32 key = 0;
>
> +
>         char action_str_buf[XDP_ACTION_MAX_STRLEN + 1 /* for \0 */] = { 0 };
>         int action = XDP_PASS; /* Default action */
>         char *action_str = NULL;
> @@ -496,6 +514,9 @@ int main(int argc, char **argv)
>                         action_str = (char *)&action_str_buf;
>                         strncpy(action_str, optarg, XDP_ACTION_MAX_STRLEN);
>                         break;
> +               case 'r':
> +                       cfg_options |= READ_MEM;
> +                       break;
>                 case 'h':
>                 error:
>                 default:
> @@ -522,6 +543,7 @@ int main(int argc, char **argv)
>                 }
>         }
>         cfg.action = action;
> +       cfg.options = cfg_options;
>
>         /* Trick to pretty printf with thousands separators use %' */
>         if (use_separators)
> @@ -542,6 +564,6 @@ int main(int argc, char **argv)
>                 return EXIT_FAIL_XDP;
>         }
>
> -       stats_poll(interval, action);
> +       stats_poll(interval, action, cfg_options);
>         return EXIT_OK;
>  }
>

^ permalink raw reply

* Re: [PATCH net-next] net: preserve sock reference when scrubbing the skb.
From: Eric Dumazet @ 2018-06-26 23:53 UTC (permalink / raw)
  To: Cong Wang, Flavio Leitner
  Cc: Linux Kernel Network Developers, Paolo Abeni, David Miller,
	Florian Westphal, NetFilter
In-Reply-To: <CAM_iQpU8E5OuXx87Dm+jbqwbkkwETNF_RZh-VnUkF5seFPvv_A@mail.gmail.com>



On 06/26/2018 03:47 PM, Cong Wang wrote:
> 
> You need to justify why you want to break the TSQ's scope here,
> which is obviously not compatible with netns design.

You have to explain why you do not want us to fix this buggy behavior.

Right now TSQ (and more generally back pressure) is broken by this skb_orphan()

So we want to restore TSQ (and back pressure)

TSQ scope never mentioned netns.
We (TCP stack TSQ handler) want to be notified when this packet leaves the host,
even if it had to traverse multiple netns (for whatever reasons).

_If_ a packet is locally 'consumed' (like on loopback device, or veth pair),
then the skb_orphan() will automatically be done.

If you have a case where this skb_orphan() is needed, please add it at the needed place.

^ permalink raw reply

* Re: [net-next PATCH v4 3/7] net: sock: Change tx_queue_mapping in sock_common to unsigned short
From: Nambiar, Amritha @ 2018-06-26 23:54 UTC (permalink / raw)
  To: Alexander Duyck, Tom Herbert
  Cc: Linux Kernel Network Developers, David S. Miller, Alexander Duyck,
	Willem de Bruijn, Sridhar Samudrala, Eric Dumazet,
	Hannes Frederic Sowa
In-Reply-To: <CAKgT0UdPWPtmatsWtAGTpEMA_SxEqm4OehD3Obg-mA2noBZtOw@mail.gmail.com>

On 6/25/2018 8:25 PM, Alexander Duyck wrote:
> On Mon, Jun 25, 2018 at 6:34 PM, Tom Herbert <tom@herbertland.com> wrote:
>>
>>
>> On Mon, Jun 25, 2018 at 11:04 AM, Amritha Nambiar
>> <amritha.nambiar@intel.com> wrote:
>>>
>>> Change 'skc_tx_queue_mapping' field in sock_common structure from
>>> 'int' to 'unsigned short' type with 0 indicating unset and
>>> a positive queue value being set. This way it is consistent with
>>> the queue_mapping field in the sk_buff. This will also accommodate
>>> adding a new 'unsigned short' field in sock_common in the next
>>> patch for rx_queue_mapping.
>>>
>>> Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
>>> ---
>>>  include/net/sock.h |   10 ++++++----
>>>  1 file changed, 6 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/include/net/sock.h b/include/net/sock.h
>>> index b3b7541..009fd30 100644
>>> --- a/include/net/sock.h
>>> +++ b/include/net/sock.h
>>> @@ -214,7 +214,7 @@ struct sock_common {
>>>                 struct hlist_node       skc_node;
>>>                 struct hlist_nulls_node skc_nulls_node;
>>>         };
>>> -       int                     skc_tx_queue_mapping;
>>> +       unsigned short          skc_tx_queue_mapping;
>>>         union {
>>>                 int             skc_incoming_cpu;
>>>                 u32             skc_rcv_wnd;
>>> @@ -1681,17 +1681,19 @@ static inline int sk_receive_skb(struct sock *sk,
>>> struct sk_buff *skb,
>>>
>>>  static inline void sk_tx_queue_set(struct sock *sk, int tx_queue)
>>>  {
>>> -       sk->sk_tx_queue_mapping = tx_queue;
>>> +       /* sk_tx_queue_mapping accept only upto a 16-bit value */
>>> +       WARN_ON((unsigned short)tx_queue > USHRT_MAX);
>>
>>
>> Shouldn't this be USHRT_MAX - 1 ?
> 
> Actually just a ">=" would probably do as well.

Ugh! Will definitely fix this.

> 
>>
>>> +       sk->sk_tx_queue_mapping = tx_queue + 1;
>>>  }
>>>
>>>  static inline void sk_tx_queue_clear(struct sock *sk)
>>>  {
>>> -       sk->sk_tx_queue_mapping = -1;
>>>
>>> +       sk->sk_tx_queue_mapping = 0;
>>
>>
>> I think it's slightly better to define a new constant like NO_QUEUE_MAPPING
>> to be USHRT_MAX. That avoids needing to do the arithmetic every time the
>> value is accessed.

The idea was to have avoid having to make any changes to the callers of
these functions and make this similar to queue_mapping in skbuff with 0
indicating unset and +ve value for set. sk_tx_queue_get returns -1 on
invalid and the callers were validating -ve values.  With
sk_tx_queue_mapping initialized to USHRT_MAX, and having an additional
check in sk_tx_queue_get to return -1 if sk_tx_queue_mapping has
USHRT_MAX, I think I can keep changes minimal and avoid the arithmetic
if that's a more acceptable solution.

>>>
>>>  }
>>>
>>>  static inline int sk_tx_queue_get(const struct sock *sk)
>>>  {
>>> -       return sk ? sk->sk_tx_queue_mapping : -1;
>>> +       return sk ? sk->sk_tx_queue_mapping - 1 : -1;
>>
>>
>> Doesn't the comparison in __netdev_pick_tx need to be simultaneously changed
>> for this?
> 
> This doesn't change the result. It was still -1 if the queue mapping
> is not set. It was just initialized to 0 instead of to -1 so we have
> to perform the operation to get there.
> 
> Also in regards to the comment above about needing an extra operation
> I am not sure it makes much difference.
> 
> In the case of us starting with 0 as a reserved value I think the
> instruction count should be about the same. We move the unsigned short
> into an unsigned in, then decrement, and if the value is non-negative
> we can assume it is valid. Although maybe I should double check the
> code to make certain it is doing what I thought it was supposed to be
> doing.
> 
>>
>>>
>>>
>>>
>>>  }
>>>
>>>  static inline void sk_set_socket(struct sock *sk, struct socket *sock)
>>>
>>

^ permalink raw reply

* Re: [net-next PATCH v4 3/7] net: sock: Change tx_queue_mapping in sock_common to unsigned short
From: Nambiar, Amritha @ 2018-06-27  0:00 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, David Miller, Alexander Duyck,
	Samudrala, Sridhar, Alexander Duyck, Eric Dumazet,
	Hannes Frederic Sowa, Tom Herbert
In-Reply-To: <CAF=yD-+gAp8qMYGgpFA8cwPnpt50XjMS0DdJGnmNKzQDpUotFQ@mail.gmail.com>

On 6/26/2018 3:58 AM, Willem de Bruijn wrote:
> On Mon, Jun 25, 2018 at 7:06 PM Amritha Nambiar
> <amritha.nambiar@intel.com> wrote:
>>
>> Change 'skc_tx_queue_mapping' field in sock_common structure from
>> 'int' to 'unsigned short' type with 0 indicating unset and
>> a positive queue value being set. This way it is consistent with
>> the queue_mapping field in the sk_buff. This will also accommodate
>> adding a new 'unsigned short' field in sock_common in the next
>> patch for rx_queue_mapping.
>>
>> Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
>> ---
> 
>>  static inline void sk_tx_queue_set(struct sock *sk, int tx_queue)
>>  {
>> -       sk->sk_tx_queue_mapping = tx_queue;
>> +       /* sk_tx_queue_mapping accept only upto a 16-bit value */
>> +       WARN_ON((unsigned short)tx_queue > USHRT_MAX);
>> +       sk->sk_tx_queue_mapping = tx_queue + 1;
>>  }
> 
> WARN_ON_ONCE to avoid flooding the kernel buffer.
> 
Will fix.

^ permalink raw reply

* Re: [patch net-next v2 0/9] net: sched: introduce chain templates support with offloading to mlxsw
From: Cong Wang @ 2018-06-27  0:04 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Linux Kernel Network Developers, David Miller, Jamal Hadi Salim,
	Jakub Kicinski, Simon Horman, john.hurley, David Ahern, mlxsw
In-Reply-To: <20180626080000.12964-1-jiri@resnulli.us>

On Tue, Jun 26, 2018 at 1:01 AM Jiri Pirko <jiri@resnulli.us> wrote:
> Create dummy device with clsact first:
> # ip link add type dummy
> # tc qdisc add dev dummy0 clsact
>
> There is no template assigned by default:
> # tc filter template show dev dummy0 ingress
>
> Add a template of type flower allowing to insert rules matching on last
> 2 bytes of destination mac address:
> # tc filter template add dev dummy0 ingress proto ip flower dst_mac 00:00:00:00:00:00/00:00:00:00:FF:FF

Now you are extending 'tc filter' command with a new
subcommand 'template', which looks weird.

Why not make it a new property of filter like you did for chain?
Like:

tc filter add dev dummy0 ingress proto ip template flower

which is much better IMHO.

^ permalink raw reply

* [PATCH v7 00/11] net: pch_gbe: Fixes, conversion to phylib, enable for MIPS
From: Paul Burton @ 2018-06-27  0:06 UTC (permalink / raw)
  To: netdev; +Cc: David S . Miller, Andrew Lunn, paul.burton

This series cleans up & reworks the pch_gbe driver such that it no
longer contains PHY-specific code, converts it to phylib & enables it to
be built on MIPS systems for use with the MIPS Boston development board.

Unfortunately I don't have access to a Minnowboard, which the driver
contains some platform-specific code for, so I haven't been able to test
the end result there.

Applies cleanly atop net-next as of commit 27a2628b3c24 ("selftests:
forwarding: mirror_gre_vlan_bridge_1q: Unset rp_filter").

Thanks,
    Paul

Andrew Lunn (1):
  net: pch_gbe: Convert to mdiobus and phylib

Paul Burton (10):
  net: pch_gbe: Remove unused struct pch_gbe_adapter fields
  net: pch_gbe: Mask spare MAC addresses all at once
  net: pch_gbe: Probe PHY ID & initialize only once
  net: pch_gbe: Remove irq_sem
  net: pch_gbe: Move pch_gbe_watchdog lower in pch_gbe_main.c
  net: pch_gbe: Only enable MAC when PHY link is active
  net: pch_gbe: Remove AR8031 PHY hibernation disable
  net: pch_gbe: Clean up resets
  ptp: pch: Allow build on MIPS platforms
  net: pch_gbe: Allow build on MIPS platforms

 drivers/net/ethernet/oki-semi/pch_gbe/Kconfig |   5 +-
 .../net/ethernet/oki-semi/pch_gbe/Makefile    |   2 +-
 .../net/ethernet/oki-semi/pch_gbe/pch_gbe.h   |  20 +-
 .../oki-semi/pch_gbe/pch_gbe_ethtool.c        |  88 +---
 .../ethernet/oki-semi/pch_gbe/pch_gbe_main.c  | 407 ++++++++----------
 .../ethernet/oki-semi/pch_gbe/pch_gbe_param.c | 265 ------------
 .../ethernet/oki-semi/pch_gbe/pch_gbe_phy.c   | 377 ----------------
 .../ethernet/oki-semi/pch_gbe/pch_gbe_phy.h   |  35 --
 drivers/ptp/Kconfig                           |   2 +-
 9 files changed, 191 insertions(+), 1010 deletions(-)
 delete mode 100644 drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_phy.c
 delete mode 100644 drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_phy.h

-- 
2.18.0

^ permalink raw reply

* [PATCH v7 01/11] net: pch_gbe: Remove unused struct pch_gbe_adapter fields
From: Paul Burton @ 2018-06-27  0:06 UTC (permalink / raw)
  To: netdev; +Cc: David S . Miller, Andrew Lunn, paul.burton
In-Reply-To: <20180627000612.27263-1-paul.burton@mips.com>

Remove a bunch of unused fields from struct pch_gbe_adapter. Among these
polling_netdev, config_space & led_status are entirely unused.
ethtool_lock is initialized but we never attempt to acquire the lock, so
that is effectively unused too. A msg_enable field was documented but
missing, so drop that from the kerneldoc comment.

Signed-off-by: Paul Burton <paul.burton@mips.com>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
---

Changes in v7: New patch

 drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.h      | 9 ---------
 drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c | 1 -
 2 files changed, 10 deletions(-)

diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.h b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.h
index 44c2f291e766..be218ac81f21 100644
--- a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.h
+++ b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.h
@@ -555,11 +555,9 @@ struct pch_gbe_privdata {
 /**
  * struct pch_gbe_adapter - board specific private data structure
  * @stats_lock:	Spinlock structure for status
- * @ethtool_lock:	Spinlock structure for ethtool
  * @irq_sem:		Semaphore for interrupt
  * @netdev:		Pointer of network device structure
  * @pdev:		Pointer of pci device structure
- * @polling_netdev:	Pointer of polling network device structure
  * @napi:		NAPI structure
  * @hw:			Pointer of hardware structure
  * @stats:		Hardware status
@@ -567,9 +565,6 @@ struct pch_gbe_privdata {
  * @mii:		MII information structure
  * @watchdog_timer:	Watchdog timer list
  * @wake_up_evt:	Wake up event
- * @config_space:	Configuration space
- * @msg_enable:		Driver message level
- * @led_status:		LED status
  * @tx_ring:		Pointer of Tx descriptor ring structure
  * @rx_ring:		Pointer of Rx descriptor ring structure
  * @rx_buffer_len:	Receive buffer length
@@ -579,12 +574,10 @@ struct pch_gbe_privdata {
 
 struct pch_gbe_adapter {
 	spinlock_t stats_lock;
-	spinlock_t ethtool_lock;
 	atomic_t irq_sem;
 	struct net_device *netdev;
 	struct pci_dev *pdev;
 	int irq;
-	struct net_device *polling_netdev;
 	struct napi_struct napi;
 	struct pch_gbe_hw hw;
 	struct pch_gbe_hw_stats stats;
@@ -592,8 +585,6 @@ struct pch_gbe_adapter {
 	struct mii_if_info mii;
 	struct timer_list watchdog_timer;
 	u32 wake_up_evt;
-	u32 *config_space;
-	unsigned long led_status;
 	struct pch_gbe_tx_ring *tx_ring;
 	struct pch_gbe_rx_ring *rx_ring;
 	unsigned long rx_buffer_len;
diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
index 43c0c10dfeb7..8908ef654d94 100644
--- a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
+++ b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
@@ -1998,7 +1998,6 @@ static int pch_gbe_sw_init(struct pch_gbe_adapter *adapter)
 	}
 	spin_lock_init(&adapter->hw.miim_lock);
 	spin_lock_init(&adapter->stats_lock);
-	spin_lock_init(&adapter->ethtool_lock);
 	atomic_set(&adapter->irq_sem, 0);
 	pch_gbe_irq_disable(adapter);
 
-- 
2.18.0

^ permalink raw reply related

* [PATCH v7 02/11] net: pch_gbe: Mask spare MAC addresses all at once
From: Paul Burton @ 2018-06-27  0:06 UTC (permalink / raw)
  To: netdev; +Cc: David S . Miller, Andrew Lunn, paul.burton
In-Reply-To: <20180627000612.27263-1-paul.burton@mips.com>

pch_gbe_set_multi() loops through each unused MAC address register,
masking them one by one & waiting for a bit to clear indicating that the
change has taken effect before zeroing out the MAC register.

This is needlessly inefficient. We can instead set all the desired mask
bits with a single write to the ADDR_MASK register & wait only once for
the busy bit to clear indicating that the addresses are masked (ie.
ignored) as required.

It's pointless zeroing the MAC registers since they're masked anyway so
their contents are irrelevant, so we can avoid looping over them here
entirely.

Signed-off-by: Paul Burton <paul.burton@mips.com>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
---

Changes in v7: New patch

 drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
index 8908ef654d94..9651fa02d4bb 100644
--- a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
+++ b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
@@ -2140,15 +2140,13 @@ static void pch_gbe_set_multi(struct net_device *netdev)
 		pch_gbe_mac_mar_set(hw, ha->addr, i++);
 
 	/* If there are spare MAC registers, mask & clear them */
-	for (; i < PCH_GBE_MAR_ENTRIES; i++) {
-		/* Clear MAC address mask */
+	if (i < PCH_GBE_MAR_ENTRIES) {
 		adrmask = ioread32(&hw->reg->ADDR_MASK);
-		iowrite32(adrmask | BIT(i), &hw->reg->ADDR_MASK);
+		adrmask |= GENMASK(PCH_GBE_MAR_ENTRIES - 1, i);
+		iowrite32(adrmask, &hw->reg->ADDR_MASK);
+
 		/* wait busy */
 		pch_gbe_wait_clr_bit(&hw->reg->ADDR_MASK, PCH_GBE_BUSY);
-		/* Clear MAC address */
-		iowrite32(0, &hw->reg->mac_adr[i].high);
-		iowrite32(0, &hw->reg->mac_adr[i].low);
 	}
 
 	netdev_dbg(netdev,
-- 
2.18.0

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox