Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v2 2/4] bpf: fix 'struct pt_reg' typo in documentation
From: Quentin Monnet @ 2019-08-21 10:28 UTC (permalink / raw)
  To: Peter Wu, Alexei Starovoitov, Daniel Borkmann; +Cc: netdev, bpf
In-Reply-To: <20190820230900.23445-3-peter@lekensteyn.nl>

2019-08-21 00:08 UTC+0100 ~ Peter Wu <peter@lekensteyn.nl>
> There is no 'struct pt_reg'.
> 
> Signed-off-by: Peter Wu <peter@lekensteyn.nl>

Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>

Thanks for the fix!
Quentin

^ permalink raw reply

* Re: [PATCH net-next v3 4/4] net: fec: add support for PTP system timestamping for MDIO devices
From: Vladimir Oltean @ 2019-08-21 10:28 UTC (permalink / raw)
  To: Hubert Feurstein
  Cc: netdev, lkml, Andrew Lunn, Richard Cochran, Miroslav Lichvar,
	Fugang Duan, David S. Miller
In-Reply-To: <20190820084833.6019-5-hubert.feurstein@vahle.at>

On Tue, 20 Aug 2019 at 11:49, Hubert Feurstein <h.feurstein@gmail.com> wrote:
>
> From: Hubert Feurstein <h.feurstein@gmail.com>
>
> In order to improve the synchronisation precision of phc2sys (from
> the linuxptp project) for devices like switches which are attached
> to the MDIO bus, it is necessary the get the system timestamps as
> close as possible to the access which causes the PTP timestamp
> register to be snapshotted in the switch hardware. Usually this is
> triggered by an MDIO write access, the snapshotted timestamp is then
> transferred by several MDIO reads.
>
> The ptp_read_system_*ts functions already check the ptp_sts pointer.
>
> Signed-off-by: Hubert Feurstein <h.feurstein@gmail.com>
> ---
>  drivers/net/ethernet/freescale/fec_main.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
> index c01d3ec3e9af..dd1253683ac0 100644
> --- a/drivers/net/ethernet/freescale/fec_main.c
> +++ b/drivers/net/ethernet/freescale/fec_main.c
> @@ -1815,10 +1815,12 @@ static int fec_enet_mdio_write(struct mii_bus *bus, int mii_id, int regnum,
>         reinit_completion(&fep->mdio_done);
>
>         /* start a write op */
> +       ptp_read_system_prets(bus->ptp_sts);
>         writel(FEC_MMFR_ST | FEC_MMFR_OP_WRITE |
>                 FEC_MMFR_PA(mii_id) | FEC_MMFR_RA(regnum) |
>                 FEC_MMFR_TA | FEC_MMFR_DATA(value),
>                 fep->hwp + FEC_MII_DATA);
> +       ptp_read_system_postts(bus->ptp_sts);
>

How do you know the core will not service an interrupt here?
Why are you not disabling (postponing) local interrupts after this
critical section? (which you were in the RFC)
If the argument is that you didn't notice any issue with phc2sys -N 5,
that's not a good argument. "Unlikely for a condition to happen" does
not mean deterministic.
Here is an example of the system servicing an interrupt during the
transmission of a 12-byte SPI transfer (proof that they can occur
anywhere where they aren't disabled):
https://drive.google.com/file/d/1rUZpfkBKHJGwQN4orFUWks_5i70wn-mj/view?usp=sharing

>         /* wait for end of transfer */
>         time_left = wait_for_completion_timeout(&fep->mdio_done,
> @@ -1956,7 +1958,7 @@ static int fec_enet_mii_init(struct platform_device *pdev)
>         struct fec_enet_private *fep = netdev_priv(ndev);
>         struct device_node *node;
>         int err = -ENXIO;
> -       u32 mii_speed, holdtime;
> +       u32 mii_speed, mii_period, holdtime;
>
>         /*
>          * The i.MX28 dual fec interfaces are not equal.
> @@ -1993,6 +1995,7 @@ static int fec_enet_mii_init(struct platform_device *pdev)
>          * document.
>          */
>         mii_speed = DIV_ROUND_UP(clk_get_rate(fep->clk_ipg), 5000000);
> +       mii_period = div_u64((u64)mii_speed * 2 * NSEC_PER_SEC, clk_get_rate(fep->clk_ipg));
>         if (fep->quirks & FEC_QUIRK_ENET_MAC)
>                 mii_speed--;
>         if (mii_speed > 63) {
> @@ -2034,6 +2037,8 @@ static int fec_enet_mii_init(struct platform_device *pdev)
>                 pdev->name, fep->dev_id + 1);
>         fep->mii_bus->priv = fep;
>         fep->mii_bus->parent = &pdev->dev;
> +       fep->mii_bus->flags = MII_BUS_F_PTP_STS_SUPPORTED;
> +       fep->mii_bus->ptp_sts_offset = 32 * mii_period;
>
>         node = of_get_child_by_name(pdev->dev.of_node, "mdio");
>         err = of_mdiobus_register(fep->mii_bus, node);
> --
> 2.22.1
>

Regards,
-Vladimir

^ permalink raw reply

* Regression fix for bpf in v5.3 (was Re: [RFC PATCH] bpf: handle 32-bit zext during constant blinding)
From: Michael Ellerman @ 2019-08-21 10:25 UTC (permalink / raw)
  To: Naveen N. Rao, Alexei Starovoitov, Daniel Borkmann, Jiong Wang
  Cc: bpf, linuxppc-dev, netdev, linux-kernel
In-Reply-To: <20190813171018.28221-1-naveen.n.rao@linux.vnet.ibm.com>

"Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com> writes:
> Since BPF constant blinding is performed after the verifier pass, there
> are certain ALU32 instructions inserted which don't have a corresponding
> zext instruction inserted after. This is causing a kernel oops on
> powerpc and can be reproduced by running 'test_cgroup_storage' with
> bpf_jit_harden=2.
>
> Fix this by emitting BPF_ZEXT during constant blinding if
> prog->aux->verifier_zext is set.
>
> Fixes: a4b1d3c1ddf6cb ("bpf: verifier: insert zero extension according to analysis result")
> Reported-by: Michael Ellerman <mpe@ellerman.id.au>
> Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
> ---
> This approach (the location where zext is being introduced below, in 
> particular) works for powerpc, but I am not entirely sure if this is 
> sufficient for other architectures as well. This is broken on v5.3-rc4.

Any comment on this?

This is a regression in v5.3, which results in a kernel crash, it would
be nice to get it fixed before the release please?

cheers

> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 8191a7db2777..d84146e6fd9e 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -890,7 +890,8 @@ int bpf_jit_get_func_addr(const struct bpf_prog *prog,
>  
>  static int bpf_jit_blind_insn(const struct bpf_insn *from,
>  			      const struct bpf_insn *aux,
> -			      struct bpf_insn *to_buff)
> +			      struct bpf_insn *to_buff,
> +			      bool emit_zext)
>  {
>  	struct bpf_insn *to = to_buff;
>  	u32 imm_rnd = get_random_int();
> @@ -939,6 +940,8 @@ static int bpf_jit_blind_insn(const struct bpf_insn *from,
>  		*to++ = BPF_ALU32_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ from->imm);
>  		*to++ = BPF_ALU32_IMM(BPF_XOR, BPF_REG_AX, imm_rnd);
>  		*to++ = BPF_ALU32_REG(from->code, from->dst_reg, BPF_REG_AX);
> +		if (emit_zext)
> +			*to++ = BPF_ZEXT_REG(from->dst_reg);
>  		break;
>  
>  	case BPF_ALU64 | BPF_ADD | BPF_K:
> @@ -992,6 +995,10 @@ static int bpf_jit_blind_insn(const struct bpf_insn *from,
>  			off -= 2;
>  		*to++ = BPF_ALU32_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ from->imm);
>  		*to++ = BPF_ALU32_IMM(BPF_XOR, BPF_REG_AX, imm_rnd);
> +		if (emit_zext) {
> +			*to++ = BPF_ZEXT_REG(BPF_REG_AX);
> +			off--;
> +		}
>  		*to++ = BPF_JMP32_REG(from->code, from->dst_reg, BPF_REG_AX,
>  				      off);
>  		break;
> @@ -1005,6 +1012,8 @@ static int bpf_jit_blind_insn(const struct bpf_insn *from,
>  	case 0: /* Part 2 of BPF_LD | BPF_IMM | BPF_DW. */
>  		*to++ = BPF_ALU32_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ aux[0].imm);
>  		*to++ = BPF_ALU32_IMM(BPF_XOR, BPF_REG_AX, imm_rnd);
> +		if (emit_zext)
> +			*to++ = BPF_ZEXT_REG(BPF_REG_AX);
>  		*to++ = BPF_ALU64_REG(BPF_OR,  aux[0].dst_reg, BPF_REG_AX);
>  		break;
>  
> @@ -1088,7 +1097,8 @@ struct bpf_prog *bpf_jit_blind_constants(struct bpf_prog *prog)
>  		    insn[1].code == 0)
>  			memcpy(aux, insn, sizeof(aux));
>  
> -		rewritten = bpf_jit_blind_insn(insn, aux, insn_buff);
> +		rewritten = bpf_jit_blind_insn(insn, aux, insn_buff,
> +						clone->aux->verifier_zext);
>  		if (!rewritten)
>  			continue;
>  
> -- 
> 2.22.0

^ permalink raw reply

* Re: [PATCH net-next v3 2/4] net: mdio: add PTP offset compensation to mdiobus_write_sts
From: Miroslav Lichvar @ 2019-08-21 10:19 UTC (permalink / raw)
  To: Hubert Feurstein
  Cc: Andrew Lunn, netdev, lkml, Richard Cochran, Florian Fainelli,
	Heiner Kallweit, Vladimir Oltean, David S. Miller
In-Reply-To: <CAFfN3gXtkv=YjoQixN+MdZ9vLZRPBMwg1mefuBTHFf1_QENPsg@mail.gmail.com>

On Wed, Aug 21, 2019 at 11:53:12AM +0200, Hubert Feurstein wrote:
> Am Mi., 21. Aug. 2019 um 10:07 Uhr schrieb Miroslav Lichvar
> > Because those reports/statistics are important in calculation of
> > maximum error. If someone had a requirement for a clock to be accurate
> > to 1.5 microseconds and the ioctl returned a delay indicating a
> > sufficient accuracy when in reality it could be worse, that would be a
> > problem.
> >
> Ok, I understand your point. But including the MDIO completion into
> delay calculation
> will indicate a much wore precision as it actually is.

That's ok. It's the same with PCIe devices. It takes about 500 ns to
read a PCI register, so we know in the worst case the offset is
accurate to 250 ns. It's probably much better, maybe to 50 ns, but we
don't really know. We don't know how much asymmetry there is in the
PCIe delay (it certainly is not zero), or how much time the NIC
actually needs to respond to the command and when exactly it reads the
clock.

> When the MDIO
> driver implements
> the PTP system timestamping as follows ...
> 
>   ptp_read_system_prets(bus->ptp_sts);
>   writel(value, mdio-reg)
>   ptp_read_system_postts(bus->ptp_sts);
> 
> ... then we catch already the error caused by interrupts which hit the
> pre/post_ts section.
> Now we only have the additional error of one MDIO clock cycle
> (~400ns). Because I expect
> the MDIO controller to shift out the MDIO frame on the next MDIO clock
> cycle.

Is this always the case?

> So if I subtract
> one MDIO clock cycle from pre_ts and add one MDIO clock cycle to
> post_ts the error indication
> would be sufficiently corrected IMHO.

If I understand it correctly, this ignores the time needed for the
frame to be received, decoded and the clock to be read.

-- 
Miroslav Lichvar

^ permalink raw reply

* Re: [PATCH net-next v3 2/4] net: mdio: add PTP offset compensation to mdiobus_write_sts
From: Vladimir Oltean @ 2019-08-21 10:19 UTC (permalink / raw)
  To: Hubert Feurstein
  Cc: Miroslav Lichvar, Andrew Lunn, netdev, lkml, Richard Cochran,
	Florian Fainelli, Heiner Kallweit, David S. Miller
In-Reply-To: <CAFfN3gXtkv=YjoQixN+MdZ9vLZRPBMwg1mefuBTHFf1_QENPsg@mail.gmail.com>

On Wed, 21 Aug 2019 at 12:53, Hubert Feurstein <h.feurstein@gmail.com> wrote:
>
> Am Mi., 21. Aug. 2019 um 10:07 Uhr schrieb Miroslav Lichvar
> <mlichvar@redhat.com>:
> > > Currently I do not see the benefit from this. The original intention was to
> > > compensate for the remaining offset as good as possible.
> >
> > That's ok, but IMHO the change should not break the assumptions of
> > existing application and users.
> >
> > > The current code
> > > of phc2sys uses the delay only for the filtering of the measurement record
> > > with the shortest delay and for reporting and statistics. Why not simple shift
> > > the timestamps with the offset to the point where we expect the PHC timestamp
> > > to be captured, and we have a very good result compared to where we came
> > > from.
> >
> > Because those reports/statistics are important in calculation of
> > maximum error. If someone had a requirement for a clock to be accurate
> > to 1.5 microseconds and the ioctl returned a delay indicating a
> > sufficient accuracy when in reality it could be worse, that would be a
> > problem.
> >
> Ok, I understand your point. But including the MDIO completion into
> delay calculation
> will indicate a much wore precision as it actually is. When the MDIO
> driver implements
> the PTP system timestamping as follows ...
>
>   ptp_read_system_prets(bus->ptp_sts);
>   writel(value, mdio-reg)
>   ptp_read_system_postts(bus->ptp_sts);
>
> ... then we catch already the error caused by interrupts which hit the
> pre/post_ts section.
> Now we only have the additional error of one MDIO clock cycle
> (~400ns). Because I expect
> the MDIO controller to shift out the MDIO frame on the next MDIO clock
> cycle. So if I subtract

How do you know?
The MDIO controller is a memory-mapped peripheral so there will be a
transaction on the core <-> peripheral interconnect in the SoC.
Depending on the system load, the transaction might not be
instantaneous as you think. Additionally there will be clock domain
crossings in the MDIO controller when transferring the command from
the platform clock to the peripheral clock, which might also add some
jitter.
MDIO frames may also begin with 32 bits of preamble, with some
controllers being able to suppress it. Have you checked/accounted for
this?
The only reliable moment when you know the MDIO command has completed
is when the controller says it has. If there is additional jitter in
waiting for the completion event caused by the GIC and the scheduling
of the ISR, then just switch the driver to poll mode, like everybody
else.

> one MDIO clock cycle from pre_ts and add one MDIO clock cycle to
> post_ts the error indication
> would be sufficiently corrected IMHO. And then we can shift both
> timestamps in the switch driver
> (in the gettimex handler) to compensate the switch depending offset.
> What do you think?
>
> Hubert

Regards,
-Vladimir

^ permalink raw reply

* Re: [PATCH RFC ipsec-next 0/7] ipsec: add TCP encapsulation support (RFC 8229)
From: Sabrina Dubroca @ 2019-08-21 10:17 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: netdev, Herbert Xu
In-Reply-To: <20190821065911.GO2879@gauss3.secunet.de>

2019-08-21, 08:59:11 +0200, Steffen Klassert wrote:
> On Fri, Aug 16, 2019 at 04:18:14PM +0200, Sabrina Dubroca wrote:
> > Hi Steffen,
> > 
> > 2019-06-25, 12:11:33 +0200, Sabrina Dubroca wrote:
> > > This patchset introduces support for TCP encapsulation of IKE and ESP
> > > messages, as defined by RFC 8229 [0]. It is an evolution of what
> > > Herbert Xu proposed in January 2018 [1] that addresses the main
> > > criticism against it, by not interfering with the TCP implementation
> > > at all. The networking stack now has infrastructure for this: TCP ULPs
> > > and Stream Parsers.
> > 
> > Have you had a chance to look at this?  I was going to rebase and
> > resend, but the patches still apply to ipsec-next and net-next (patch
> > 2 is already in net-next as commit bd95e678e0f6).
> 
> I had a look and I have no general objection against this. If you
> think the patchset is ready for inclusion, just remove the RFC and
> resend it. I'll have a closer on it look then.

Ok, thanks, I'll repost.

-- 
Sabrina

^ permalink raw reply

* Re: [PATCH net] ixgbe: fix double clean of tx descriptors with xdp
From: Eelco Chaudron @ 2019-08-21 10:09 UTC (permalink / raw)
  To: Ilya Maximets
  Cc: netdev, linux-kernel, bpf, David S. Miller, Björn Töpel,
	Magnus Karlsson, Jakub Kicinski, Alexei Starovoitov,
	Daniel Borkmann, Jeff Kirsher, intel-wired-lan, William Tu
In-Reply-To: <20190820151611.10727-1-i.maximets@samsung.com>



On 20 Aug 2019, at 17:16, Ilya Maximets wrote:

> Tx code doesn't clear the descriptor status after cleaning.
> So, if the budget is larger than number of used elems in a ring, some
> descriptors will be accounted twice and xsk_umem_complete_tx will move
> prod_tail far beyond the prod_head breaking the comletion queue ring.
>
> Fix that by limiting the number of descriptors to clean by the number
> of used descriptors in the tx ring.
>
> Fixes: 8221c5eba8c1 ("ixgbe: add AF_XDP zero-copy Tx support")
> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
> ---
>
> Not tested yet because of lack of available hardware.
> So, testing is very welcome.
>
Hi Ilya, this patch fixes the issue I reported earlier on the Open 
vSwitch mailing list regarding complete queue overrun.

Tested-by: Eelco Chaudron <echaudro@redhat.com>

<SNIP>

^ permalink raw reply

* pull-request: mac80211-next 2019-08-21
From: Johannes Berg @ 2019-08-21 10:04 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-wireless

Hi Dave,

For -next, we have more changes, but as described in the tag
they really just fall into a few groups of changes :-)

Please pull and let me know if there's any problem.

Thanks,
johannes



The following changes since commit 8c40f3b212a373be843a29db608b462af5c3ed5d:

  Merge tag 'mlx5-updates-2019-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux (2019-08-20 22:59:45 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next.git tags/mac80211-next-for-davem-2019-08-21

for you to fetch changes up to 48cb39522a9d4d4680865e40a88f975a1cee6abc:

  mac80211: minstrel_ht: improve rate probing for devices with static fallback (2019-08-21 11:10:13 +0200)

----------------------------------------------------------------
Here are a few groups of changes:
 * EDMG channel support (60 GHz, just a single patch)
 * initial 6/7 GHz band support (Arend)
 * association timestamp recording (Ben)
 * rate control improvements for better performance with
   the mt76 driver (Felix)
 * various fixes for previous HE support changes (John)

----------------------------------------------------------------
Alexei Avshalom Lazar (1):
      nl80211: Add support for EDMG channels

Arend van Spriel (8):
      nl80211: add 6GHz band definition to enum nl80211_band
      cfg80211: add 6GHz UNII band definitions
      cfg80211: util: add 6GHz channel to freq conversion and vice versa
      cfg80211: extend ieee80211_operating_class_to_band() for 6GHz
      cfg80211: add 6GHz in code handling array with NUM_NL80211_BANDS entries
      cfg80211: use same IR permissive rules for 6GHz band
      cfg80211: ibss: use 11a mandatory rates for 6GHz band operation
      cfg80211: apply same mandatory rate flags for 5GHz and 6GHz

Ben Greear (2):
      cfg80211: Support assoc-at timer in sta-info
      mac80211: add assoc-at support

Felix Fietkau (4):
      mac80211: minstrel_ht: fix per-group max throughput rate initialization
      mac80211: minstrel_ht: reduce unnecessary rate probing attempts
      mac80211: minstrel_ht: fix default max throughput rate indexes
      mac80211: minstrel_ht: improve rate probing for devices with static fallback

John Crispin (5):
      mac80211: fix TX legacy rate reporting when tx_status_ext is used
      mac80211: fix bad guard when reporting legacy rates
      mac80211: 80Mhz was not reported properly when using tx_status_ext
      mac80211: add missing length field increment when generating Radiotap header
      mac80211: fix possible NULL pointerderef in obss pd code

 drivers/net/wireless/ath/wil6210/cfg80211.c |   2 +-
 include/net/cfg80211.h                      |  88 ++++++++-
 include/uapi/linux/nl80211.h                |  29 +++
 net/mac80211/he.c                           |   3 +-
 net/mac80211/mlme.c                         |   2 +-
 net/mac80211/rc80211_minstrel.h             |   1 +
 net/mac80211/rc80211_minstrel_ht.c          | 277 ++++++++++++++++++++++++----
 net/mac80211/rc80211_minstrel_ht.h          |  12 ++
 net/mac80211/sta_info.c                     |   3 +
 net/mac80211/sta_info.h                     |   2 +
 net/mac80211/status.c                       |  31 ++--
 net/mac80211/tx.c                           |   1 +
 net/wireless/chan.c                         | 162 +++++++++++++++-
 net/wireless/ibss.c                         |  16 +-
 net/wireless/nl80211.c                      |  39 ++++
 net/wireless/reg.c                          |  21 ++-
 net/wireless/trace.h                        |   3 +-
 net/wireless/util.c                         |  56 +++++-
 18 files changed, 684 insertions(+), 64 deletions(-)


^ permalink raw reply

* Re: [PATCH net-next v2 6/9] net: macsec: hardware offloading infrastructure
From: Antoine Tenart @ 2019-08-21 10:01 UTC (permalink / raw)
  To: Sabrina Dubroca
  Cc: Antoine Tenart, Igor Russkikh, Andrew Lunn, davem@davemloft.net,
	f.fainelli@gmail.com, hkallweit1@gmail.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	thomas.petazzoni@bootlin.com, alexandre.belloni@bootlin.com,
	allan.nielsen@microchip.com, camelia.groza@nxp.com,
	Simon Edelhaus, Pavel Belous
In-Reply-To: <20190820144119.GA28714@bistromath.localdomain>

Hi Sabrina,

On Tue, Aug 20, 2019 at 04:41:19PM +0200, Sabrina Dubroca wrote:
> 2019-08-20, 12:01:40 +0200, Antoine Tenart wrote:
> > So it seems the ability to enable or disable the offloading on a given
> > interface is the main missing feature. I'll add that, however I'll
> > probably (at least at first):
> > 
> > - Have the interface to be fully offloaded or fully handled in s/w (with
> >   errors being thrown if a given configuration isn't supported). Having
> >   both at the same time on a given interface would be tricky because of
> >   the MACsec validation parameter.
> > 
> > - Won't allow to enable/disable the offloading of there are rules in
> >   place, as we're not sure the same rules would be accepted by the other
> >   implementation.
> 
> That's probably quite problematic actually, because to do that you
> need to be able to resync the state between software and hardware,
> particularly packet numbers. So, yeah, we're better off having it
> completely blocked until we have a working implementation (if that
> ever happens).
> 
> However, that means in case the user wants to set up something that's
> not offloadable, they'll have to:
>  - configure the offloaded version until EOPNOTSUPP
>  - tear everything down
>  - restart from scratch without offloading
> 
> That's inconvenient.

That's right, the user might have to replay the whole configuration if
on rule failed to match the h/w requirements. It's inconvenient, but I
think it's better to be safe until we have (if that happens) a working
implementation of synchronizing the rules' state.

> Talking about packet numbers, can you describe how PN exhaustion is
> handled?  I couldn't find much about packet numbers at all in the
> driver patches (I hope the hw doesn't wrap around from 2^32-1 to 0 on
> the same SA).  At some point userspace needs to know that we're
> getting close to 2^32 and that it's time to re-key.  Since the whole
> TX path of the software implementation is bypassed, it looks like the
> PN (as far as drivers/net/macsec.c is concerned) never increases, so
> userspace can't know when to negotiate a new SA.

That's a very good point. It actually was on my todo list for the next
version (I wanted to discuss the other points first). We would also
need to sync the stats at some point.

> > I'm not sure if we should allow to mix the implementations on a given
> > physical interface (by having two MACsec interfaces attached) as the
> > validation would be impossible to do (we would have no idea if a
> > packet was correctly handled by the offloading part or just not being
> > a MACsec packet in the first place, in Rx).
> 
> That's something that really bothers me with this proposal. Quoting
> from the commit message:
> 
> > the packets seen by the networking stack on both the physical and
> > MACsec virtual interface are exactly the same

That bothers me as well.

> If the HW/driver is expected to strip the sectag, I don't see how we
> could ever have multiple secy's at the same time and demultiplex
> properly between them. That's part of the reason why I chose to have
> virtual interfaces: without them, picking the right secy on TX gets
> weird.
> 
> AFAICT, it means that any filters (tc, tcpdump) need to be different
> between offloaded and non-offloaded cases.
> 
> And it's going to be confusing to the administrator when they look at
> tcpdumps expecting to see MACsec frames.

Right. I did not see how *not* to strip the sectag in the h/w back then,
I'll have another look because that would improve things a lot.

@all: do other MACsec offloading hardware allow not to stip the sectag?

> I don't see how this implementation handles non-macsec traffic (on TX,
> that would be packets sent directly through the real interface, for
> example by wpa_supplicant - on RX, incoming MKA traffic for
> wpa_supplicant). Unless I missed something, incoming MKA traffic will
> end up on the macsec interface as well as the lower interface (not
> entirely critical, as long as wpa_supplicant can grab it on the lower
> device, but not consistent with the software implementation).

That's right, as we have no way to tell if an Rx packet was MACsec or
non-MACsec traffic, both will end up on both interfaces. Some h/w may be
able to insert a custom header (or may allow not to strip the sectag),
but I did not find anything related to this in mine (I'll double check).

> How does the driver distinguish traffic that should pass through
> unmodified from traffic that the HW needs to encapsulate and encrypt?

At least in PHYs, packets go in a classification unit (that can match on
multiple parts of the packet, given the hardware capabilities, eg. the
MAC addresses). The result of the match is an action, which can be
"bypass the MACsec block" or "go through it (which links the packet to a
given configuration)". This is done in Rx and in Tx, and this is how the
h/w block will know what to encapsulate and encrypt.

Thanks,
Antoine

-- 
Antoine Ténart, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com

^ permalink raw reply

* pull-request: mac80211 2019-08-21
From: Johannes Berg @ 2019-08-21 10:00 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-wireless

Hi Dave,

I have here for you a few fixes; three, to be specific. Nothing that
warrants real discussion or urgency though.

Please pull and let me know if there's any problem.

Thanks,
johannes



The following changes since commit a1c4cd67840ef80f6ca5f73326fa9a6719303a95:

  net: fix __ip_mc_inc_group usage (2019-08-20 12:48:06 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211.git tags/mac80211-for-davem-2019-08-21

for you to fetch changes up to 0d31d4dbf38412f5b8b11b4511d07b840eebe8cb:

  Revert "cfg80211: fix processing world regdomain when non modular" (2019-08-21 10:43:03 +0200)

----------------------------------------------------------------
Just three fixes:
 * extended key ID key installation
 * regulatory processing
 * possible memory leak in an error path

----------------------------------------------------------------
Alexander Wetzel (1):
      cfg80211: Fix Extended Key ID key install checks

Hodaszi, Robert (1):
      Revert "cfg80211: fix processing world regdomain when non modular"

Johannes Berg (1):
      mac80211: fix possible sta leak

 net/mac80211/cfg.c  |  9 +++++----
 net/wireless/reg.c  |  2 +-
 net/wireless/util.c | 23 ++++++++++++++---------
 3 files changed, 20 insertions(+), 14 deletions(-)


^ permalink raw reply

* Re: [PATCH 1/1] netfilter: nf_tables: fib: Drop IPV6 packages if IPv6 is disabled on boot
From: Pablo Neira Ayuso @ 2019-08-21  9:58 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Florian Westphal, netfilter-devel, coreteam, netdev, linux-kernel,
	Jozsef Kadlecsik, David S. Miller
In-Reply-To: <793ce2e9b6200a033d44716749acc837aaf5e4e7.camel@linux.ibm.com>

On Tue, Aug 20, 2019 at 01:15:58PM -0300, Leonardo Bras wrote:
> On Tue, 2019-08-20 at 07:36 +0200, Florian Westphal wrote:
> > Wouldn't fib_netdev.c have the same problem?
> Probably, but I haven't hit this issue yet.
> 
> > If so, might be better to place this test in both
> > nft_fib6_eval_type and nft_fib6_eval.
>
> I think that is possible, and not very hard to do.
> 
> But in my humble viewpoint, it looks like it's nft_fib_inet_eval() and
> nft_fib_netdev_eval() have the responsibility to choose a valid
> protocol or drop the package. 
> I am not sure if it would be a good move to transfer this
> responsibility to nft_fib6_eval_type() and nft_fib6_eval(), so I would
> rather add the same test to nft_fib_netdev_eval().
> 
> Does it make sense?

Please, update common code to netdev and ip6 extensions as Florian
suggests.

Thanks.

^ permalink raw reply

* Re: [RFC 0/4] Add support into cdc_ncm for MAC address pass through
From: Oliver Neukum @ 2019-08-21  9:41 UTC (permalink / raw)
  To: Charles.Hyde, linux-acpi, linux-usb
  Cc: Mario.Limonciello, gregkh, nic_swsd, netdev
In-Reply-To: <89a5f8ea30b240babd8750d236ca9ef4@AUSX13MPS303.AMER.DELL.COM>

Am Dienstag, den 20.08.2019, 22:15 +0000 schrieb
Charles.Hyde@dellteam.com:
> In recent testing of a Dell Universal Dock D6000, I found that MAC address pass through is not supported in the Linux drivers.

Hi,

thank you for writing this code. It adds cool functionality.
There are, however, a few design issues and minor flaws with it.
I will comment on the patches in the series to point them out.
Could you correct them?

	Regards
		Oliver


^ permalink raw reply

* Re: [RFC 2/4] Allow cdc_ncm to set MAC address in hardware
From: Oliver Neukum @ 2019-08-21  9:39 UTC (permalink / raw)
  To: Charles.Hyde, linux-acpi, linux-usb
  Cc: Mario.Limonciello, gregkh, nic_swsd, netdev
In-Reply-To: <1566339663476.54366@Dellteam.com>

Am Dienstag, den 20.08.2019, 22:21 +0000 schrieb
Charles.Hyde@dellteam.com:
> This patch adds support for pushing a MAC address out to USB based
> ethernet controllers driven by cdc_ncm.  With this change, ifconfig can
> now set the device's MAC address.  For example, the Dell Universal Dock
> D6000 is driven by cdc_ncm.  The D6000 can now have its MAC address set
> by ifconfig, as it can be done in Windows.  This was tested with a D6000
> using ifconfig.

On a design note, it looks like you broke S4 with the driver.
Suspend To Disk will cut power and hence reset the MAC to default.
You need to reset it to the user's setting in reset_resume().
Please add that to usbnet.

> 
> Signed-off-by: Charles Hyde <charles.hyde@dellteam.com>
> Cc: Mario Limonciello <mario.limonciello@dell.com>
> Cc: Oliver Neukum <oliver@neukum.org>
> Cc: netdev@vger.kernel.org
> Cc: linux-usb@vger.kernel.org
> ---
>  drivers/net/usb/cdc_ncm.c  | 20 +++++++++++++++++++-
>  drivers/net/usb/usbnet.c   | 37 ++++++++++++++++++++++++++++---------
>  include/linux/usb/usbnet.h |  1 +
>  3 files changed, 48 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/net/usb/cdc_ncm.c b/drivers/net/usb/cdc_ncm.c
> index 50c05d0f44cb..f77c8672f972 100644
> --- a/drivers/net/usb/cdc_ncm.c
> +++ b/drivers/net/usb/cdc_ncm.c
> @@ -750,6 +750,24 @@ int cdc_ncm_change_mtu(struct net_device *net, int new_mtu)
>  }
>  EXPORT_SYMBOL_GPL(cdc_ncm_change_mtu);
>  
> +/* Provide method to push MAC address to the USB device's ethernet controller.
> + */
> +int cdc_ncm_set_mac_addr(struct net_device *net, void *p)
> +{
> +	struct usbnet *dev = netdev_priv(net);
> +	struct sockaddr *addr = p;
> +
> +	memcpy(dev->net->dev_addr, addr->sa_data, ETH_ALEN);
> +	/*
> +	 * Try to push the MAC address out to the device.  Ignore any errors,
> +	 * to be compatible with prior versions of this source.
> +	 */
> +	usbnet_set_ethernet_addr(dev);
> +
> +	return eth_mac_addr(net, p);
> +}
> +EXPORT_SYMBOL_GPL(cdc_ncm_set_mac_addr);
> +
>  static const struct net_device_ops cdc_ncm_netdev_ops = {
>  	.ndo_open	     = usbnet_open,
>  	.ndo_stop	     = usbnet_stop,
> @@ -757,7 +775,7 @@ static const struct net_device_ops cdc_ncm_netdev_ops = {
>  	.ndo_tx_timeout	     = usbnet_tx_timeout,
>  	.ndo_get_stats64     = usbnet_get_stats64,
>  	.ndo_change_mtu	     = cdc_ncm_change_mtu,
> -	.ndo_set_mac_address = eth_mac_addr,
> +	.ndo_set_mac_address = cdc_ncm_set_mac_addr,

Why can't this fully go into usbnet?

>  	.ndo_validate_addr   = eth_validate_addr,
>  };
>  
> diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
> index 72514c46b478..72bdac34b0ee 100644
> --- a/drivers/net/usb/usbnet.c
> +++ b/drivers/net/usb/usbnet.c
> @@ -149,20 +149,39 @@ int usbnet_get_ethernet_addr(struct usbnet *dev, int iMACAddress)
>  	int 		tmp = -1, ret;
>  	unsigned char	buf [13];
>  
> -	ret = usb_string(dev->udev, iMACAddress, buf, sizeof buf);
> -	if (ret == 12)
> -		tmp = hex2bin(dev->net->dev_addr, buf, 6);
> -	if (tmp < 0) {
> -		dev_dbg(&dev->udev->dev,
> -			"bad MAC string %d fetch, %d\n", iMACAddress, tmp);
> -		if (ret >= 0)
> -			ret = -EINVAL;
> -		return ret;
> +	ret = usb_get_address(dev->udev, buf);
> +	if (ret == 6)

If you mean ETH_ALEN, you should use it.

> +		memcpy(dev->net->dev_addr, buf, 6);
> +	else if (ret < 0) {
> +		ret = usb_string(dev->udev, iMACAddress, buf, sizeof buf);
> +		if (ret == 12)
> +			tmp = hex2bin(dev->net->dev_addr, buf, 6);
> +		if (tmp < 0) {
> +			dev_dbg(&dev->udev->dev,
> +				"bad MAC string %d fetch, %d\n", iMACAddress,
> +				tmp);
> +			if (ret >= 0)
> +				ret = -EINVAL;
> +			return ret;

Again, you cannot ignore the possibility of getting fewer or more than
6 bytes.

> +		}
>  	}
>  	return 0;
>  }
>  EXPORT_SYMBOL_GPL(usbnet_get_ethernet_addr);
>  
> +int usbnet_set_ethernet_addr(struct usbnet *dev)
> +{
> +	int ret;
> +
> +	ret = usb_set_address(dev->udev, dev->net->dev_addr);
> +	if (ret < 0) {
> +		dev_dbg(&dev->udev->dev,
> +			"bad MAC address put, %d\n", ret);
> +	}
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(usbnet_set_ethernet_addr);

What is the purpose of this wrapper?

	Regards
		Oliver


^ permalink raw reply

* Re: [PATCH net-next v3 2/4] net: mdio: add PTP offset compensation to mdiobus_write_sts
From: Hubert Feurstein @ 2019-08-21  9:53 UTC (permalink / raw)
  To: Miroslav Lichvar
  Cc: Andrew Lunn, netdev, lkml, Richard Cochran, Florian Fainelli,
	Heiner Kallweit, Vladimir Oltean, David S. Miller
In-Reply-To: <20190821080709.GO891@localhost>

Am Mi., 21. Aug. 2019 um 10:07 Uhr schrieb Miroslav Lichvar
<mlichvar@redhat.com>:
> > Currently I do not see the benefit from this. The original intention was to
> > compensate for the remaining offset as good as possible.
>
> That's ok, but IMHO the change should not break the assumptions of
> existing application and users.
>
> > The current code
> > of phc2sys uses the delay only for the filtering of the measurement record
> > with the shortest delay and for reporting and statistics. Why not simple shift
> > the timestamps with the offset to the point where we expect the PHC timestamp
> > to be captured, and we have a very good result compared to where we came
> > from.
>
> Because those reports/statistics are important in calculation of
> maximum error. If someone had a requirement for a clock to be accurate
> to 1.5 microseconds and the ioctl returned a delay indicating a
> sufficient accuracy when in reality it could be worse, that would be a
> problem.
>
Ok, I understand your point. But including the MDIO completion into
delay calculation
will indicate a much wore precision as it actually is. When the MDIO
driver implements
the PTP system timestamping as follows ...

  ptp_read_system_prets(bus->ptp_sts);
  writel(value, mdio-reg)
  ptp_read_system_postts(bus->ptp_sts);

... then we catch already the error caused by interrupts which hit the
pre/post_ts section.
Now we only have the additional error of one MDIO clock cycle
(~400ns). Because I expect
the MDIO controller to shift out the MDIO frame on the next MDIO clock
cycle. So if I subtract
one MDIO clock cycle from pre_ts and add one MDIO clock cycle to
post_ts the error indication
would be sufficiently corrected IMHO. And then we can shift both
timestamps in the switch driver
(in the gettimex handler) to compensate the switch depending offset.
What do you think?

Hubert

^ permalink raw reply

* Re: [PATCH net-next 3/6] net: dsa: Delete the VID from the upstream port as well
From: Vladimir Oltean @ 2019-08-21  9:51 UTC (permalink / raw)
  To: Vivien Didelot
  Cc: Florian Fainelli, Andrew Lunn, Ido Schimmel, Roopa Prabhu,
	nikolay, David S. Miller, netdev
In-Reply-To: <20190820233026.GC21067@t480s.localdomain>

On Wed, 21 Aug 2019 at 06:30, Vivien Didelot <vivien.didelot@gmail.com> wrote:
>
> On Wed, 21 Aug 2019 01:09:39 +0300, Vladimir Oltean <olteanv@gmail.com> wrote:
> > I mean I made an argument already for the hack in 4/6 ("Don't program
> > the VLAN as pvid on the upstream port"). If the hack gets accepted
> > like that, I have no further need of any change in the implicit VLAN
> > configuration. But it's still a hack, so in that sense it would be
> > nicer to not need it and have a better amount of control.
>
> How come you simply cannot ignore the PVID flag for the CPU port in the
> driver directly, as mv88e6xxx does in preference of the Marvell specific
> "unmodified" mode? What PVID are you programming on the CPU port already?
>
>
> Thanks,
>
>         Vivien

sja1105 has no such thing as an "unmodified" egress policy.
And ignoring the flags in the switch driver for the CPU port is just
as hacky as fixing it up in the DSA core.
I fail to see any reason to change the pvid for the CPU/DSA ports,
maybe you can clarify.

Actually I gave a second thought to the patchset and in a weird,
convoluted way, I can get away with just:
- 2/6: net: bridge: Populate the pvid flag in br_vlan_get_info
- 5/6 net: dsa: Allow proper internal use of VLANs
- 6/6 net: dsa: tag_8021q: Restore bridge pvid when enabling vlan_filtering
A side effect of running dsa_port_restore_pvid on a user port will be
that it is going to also restore the pvid on the CPU port, via the
bitmap operations. I had not thought of this initially when I first
jumped to remove the BRIDGE_VLAN_INFO_PVID flag in 4/6. And the fact
that it would work would just be "programming by coincidence".

I guess my aversion against the VLAN bitmap operations (and, well,
"match" in your new change) stems from the fact that the DSA core sits
in the way of doing custom configuration of the CPU port VLAN
settings. Yes, you can work around it (dsa_8021q first configures the
user ports as pvid and egress untagged, then configures the CPU port
as egress tagged, so that the pvid setting from user ports doesn't
"stick" to the CPU via the bitmap), but it's like pouring water that's
half hot and half cold from a water cooler, when all you want is water
that's at room temperature.

-Vladimir

^ permalink raw reply

* Re: [PATCH net-next v2 6/9] net: macsec: hardware offloading infrastructure
From: allan.nielsen @ 2019-08-21  9:27 UTC (permalink / raw)
  To: Igor Russkikh
  Cc: Sabrina Dubroca, Antoine Tenart, Andrew Lunn, davem@davemloft.net,
	f.fainelli@gmail.com, hkallweit1@gmail.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	thomas.petazzoni@bootlin.com, alexandre.belloni@bootlin.com,
	camelia.groza@nxp.com, Simon Edelhaus, Pavel Belous
In-Reply-To: <81ec0497-58cd-1f4c-faa3-c057693cd50e@aquantia.com>

The 08/21/2019 09:20, Igor Russkikh wrote:
> > Talking about packet numbers, can you describe how PN exhaustion is
> > handled?  I couldn't find much about packet numbers at all in the
> > driver patches (I hope the hw doesn't wrap around from 2^32-1 to 0 on
> > the same SA).  At some point userspace needs to know that we're
> > getting close to 2^32 and that it's time to re-key.  Since the whole
> > TX path of the software implementation is bypassed, it looks like the
> > PN (as far as drivers/net/macsec.c is concerned) never increases, so
> > userspace can't know when to negotiate a new SA.
> 
> I think there should be driver specific implementation of this functionality.
> As an example, our macsec HW issues an interrupt towards the host to indicate
> PN threshold has reached and it's time for userspace to change the keys.
> 
> In contrast, current SW macsec implementation just stops this SA/secy.
> 
> > I don't see how this implementation handles non-macsec traffic (on TX,
> > that would be packets sent directly through the real interface, for
> > example by wpa_supplicant - on RX, incoming MKA traffic for
> > wpa_supplicant). Unless I missed something, incoming MKA traffic will
> > end up on the macsec interface as well as the lower interface (not
> > entirely critical, as long as wpa_supplicant can grab it on the lower
> > device, but not consistent with the software implementation). How does
> > the driver distinguish traffic that should pass through unmodified
> > from traffic that the HW needs to encapsulate and encrypt?
> 
> I can comment on our HW engine - where it has special bypass rules
> for configured ethertypes. This way macsec engine skips encryption on TX and
> passes in RX unencrypted for the selected control packets.
In our case it is a TCAM which can look at various fields (including ethertype),
but is sounds like we have a vary similar design.

> But thats true, realdev driver is hard to distinguish encrypted/unencrypted
> packets. In case realdev should make a decision where to put RX packet,
> it only may do some heuristic (since after macsec decription all the
> macsec related info is dropped. Thats true at least for our HW implementation).
Same for ours.

> > If you look at IPsec offloading, the networking stack builds up the
> > ESP header, and passes the unencrypted data down to the driver. I'm
> > wondering if the same would be possible with MACsec offloading: the
> > macsec virtual interface adds the header (and maybe a dummy ICV), and
> > then the HW does the encryption. In case of HW that needs to add the
> > sectag itself, the driver would first strip the headers that the stack
> > created. On receive, the driver would recreate a sectag and the macsec
> > interface would just skip all verification (decrypt, PN).
> 
> I don't think this way is good, as driver have to do per packet header mangling.
> That'll harm linerate performance heavily.
Agree, and it will also prevent MACsec offload in offloaded bridge devices.

/Allan

^ permalink raw reply

* Re: [PATCH v1] ocfs2/dlm: Move BITS_TO_BYTES() to bitops.h for wider use
From: Andy Shevchenko @ 2019-08-21  9:25 UTC (permalink / raw)
  To: Joseph Qi
  Cc: Mark Fasheh, Joel Becker, ocfs2-devel, Ariel Elior,
	Sudarsana Kalluru, GR-everest-linux-l2, David S. Miller, netdev,
	Colin Ian King
In-Reply-To: <1a3e6660-10d2-e66c-2880-24af64c7f120@linux.alibaba.com>

On Wed, Aug 21, 2019 at 09:29:04AM +0800, Joseph Qi wrote:
> On 19/8/21 00:31, Andy Shevchenko wrote:
> > There are users already and will be more of BITS_TO_BYTES() macro.
> > Move it to bitops.h for wider use.
> > 
> > Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
> > ---
> >  drivers/net/ethernet/broadcom/bnx2x/bnx2x_init.h | 1 -
> >  fs/ocfs2/dlm/dlmcommon.h                         | 4 ----
> >  include/linux/bitops.h                           | 1 +
> >  3 files changed, 1 insertion(+), 5 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_init.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_init.h
> > index 066765fbef06..0a59a09ef82f 100644
> > --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_init.h
> > +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_init.h
> > @@ -296,7 +296,6 @@ static inline void bnx2x_dcb_config_qm(struct bnx2x *bp, enum cos_mode mode,
> >   *    possible, the driver should only write the valid vnics into the internal
> >   *    ram according to the appropriate port mode.
> >   */
> > -#define BITS_TO_BYTES(x) ((x)/8)>
> I don't think this is a equivalent replace, or it is in fact
> wrong before?

I was thinking about this one and there are two applications:
- calculus of the amount of structures of certain type per PAGE
  (obviously off-by-one error in the original code IIUC purpose of STRUCT_SIZE)
- calculus of some threshold based on line speed in bytes per second
  (I dunno it will have any difference on the Gbs / 100 MBs speeds)

> >  /* CMNG constants, as derived from system spec calculations */
> >  
> > diff --git a/fs/ocfs2/dlm/dlmcommon.h b/fs/ocfs2/dlm/dlmcommon.h
> > index aaf24548b02a..0463dce65bb2 100644
> > --- a/fs/ocfs2/dlm/dlmcommon.h
> > +++ b/fs/ocfs2/dlm/dlmcommon.h
> > @@ -688,10 +688,6 @@ struct dlm_begin_reco
> >  	__be32 pad2;
> >  };
> >  
> > -
> > -#define BITS_PER_BYTE 8
> > -#define BITS_TO_BYTES(bits) (((bits)+BITS_PER_BYTE-1)/BITS_PER_BYTE)
> > -
> For ocfs2 part, it looks good to me.
> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>

Thanks!

> 
> >  struct dlm_query_join_request
> >  {
> >  	u8 node_idx;
> > diff --git a/include/linux/bitops.h b/include/linux/bitops.h
> > index cf074bce3eb3..79d80f5ddf7b 100644
> > --- a/include/linux/bitops.h
> > +++ b/include/linux/bitops.h
> > @@ -5,6 +5,7 @@
> >  #include <linux/bits.h>
> >  
> >  #define BITS_PER_TYPE(type) (sizeof(type) * BITS_PER_BYTE)
> > +#define BITS_TO_BYTES(nr)	DIV_ROUND_UP(nr, BITS_PER_BYTE)
> >  #define BITS_TO_LONGS(nr)	DIV_ROUND_UP(nr, BITS_PER_TYPE(long))
> >  
> >  extern unsigned int __sw_hweight8(unsigned int w);
> > 

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply

* Re: [PATCH net-next v2 6/9] net: macsec: hardware offloading infrastructure
From: allan.nielsen @ 2019-08-21  9:24 UTC (permalink / raw)
  To: Sabrina Dubroca
  Cc: Antoine Tenart, Igor Russkikh, Andrew Lunn, davem@davemloft.net,
	f.fainelli@gmail.com, hkallweit1@gmail.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	thomas.petazzoni@bootlin.com, alexandre.belloni@bootlin.com,
	camelia.groza@nxp.com, Simon Edelhaus, Pavel Belous
In-Reply-To: <20190820144119.GA28714@bistromath.localdomain>

Hi,

I can add some information to the HW Antoine is working on, general design of it
and the thoughts behind it. See below.

The 08/20/2019 16:41, Sabrina Dubroca wrote:
> 2019-08-20, 12:01:40 +0200, Antoine Tenart wrote:
> > So it seems the ability to enable or disable the offloading on a given
> > interface is the main missing feature. I'll add that, however I'll
> > probably (at least at first):
> > 
> > - Have the interface to be fully offloaded or fully handled in s/w (with
> >   errors being thrown if a given configuration isn't supported). Having
> >   both at the same time on a given interface would be tricky because of
> >   the MACsec validation parameter.
> > 
> > - Won't allow to enable/disable the offloading of there are rules in
> >   place, as we're not sure the same rules would be accepted by the other
> >   implementation.
> 
> That's probably quite problematic actually, because to do that you
> need to be able to resync the state between software and hardware,
> particularly packet numbers. So, yeah, we're better off having it
> completely blocked until we have a working implementation (if that
> ever happens).
> 
> However, that means in case the user wants to set up something that's
> not offloadable, they'll have to:
>  - configure the offloaded version until EOPNOTSUPP
>  - tear everything down
>  - restart from scratch without offloading
> 
> That's inconvenient.
> 
> Talking about packet numbers, can you describe how PN exhaustion is
> handled?  I couldn't find much about packet numbers at all in the
> driver patches (I hope the hw doesn't wrap around from 2^32-1 to 0 on
> the same SA).
New SA's are suppose to be installed ahead of time. The HW will automatic move
to the next SA and reset the PN.

> At some point userspace needs to know that we're
> getting close to 2^32 and that it's time to re-key.  Since the whole
> TX path of the software implementation is bypassed, it looks like the
> PN (as far as drivers/net/macsec.c is concerned) never increases, so
> userspace can't know when to negotiate a new SA.
> 
> > I'm not sure if we should allow to mix the implementations on a given
> > physical interface (by having two MACsec interfaces attached) as the
> > validation would be impossible to do (we would have no idea if a
> > packet was correctly handled by the offloading part or just not being
> > a MACsec packet in the first place, in Rx).
> 
> That's something that really bothers me with this proposal. Quoting
> from the commit message:
> 
> > the packets seen by the networking stack on both the physical and
> > MACsec virtual interface are exactly the same
> 
> If the HW/driver is expected to strip the sectag, I don't see how we
> could ever have multiple secy's at the same time and demultiplex
> properly between them. That's part of the reason why I chose to have
> virtual interfaces: without them, picking the right secy on TX gets
> weird.

The HW does frame clasification, and use the claisfication to associate frames
to a given secy.

We we in SW have eth0, with 2 vlan-sub interfaces, and enable macsec on those,
then we have:

eth0
eth0.10
eth0.10.macsec
eth0.20
eth0.20.macsec

In this case the HW needs to be configured to match vlan 10 to one secy, and
vlan 20 to an other one.

This is nor supported in the current patch, but is something we can add later.
We just wanted to get the basic functionallity done right before moving on to
this.

But in the current design, there is nothing that prevent us from adding this.

If anyone is interested in the details of this then it is described in section
3.6.3 in http://ww1.microchip.com/downloads/en/DeviceDoc/VMDS-10455.pdf

It is possible to construct an encapsulation that the HW can not classify
correctly. If that is the case, then we should reject the HW offload, and use
the SW.

But it is a good point, which I think is missing. We should properly reject
MACsec HW offload (with this driver, in this state) on virtual interfaces, if
the encapsulation can not be handled (could start by reject all virtual
interfaces)

> AFAICT, it means that any filters (tc, tcpdump) need to be different
> between offloaded and non-offloaded cases.
It will see the result of the offloaded operation. But I guess that this is no
different from when 'tc' operations are offloaded to HW.

> How does the driver distinguish traffic that should pass through unmodified
> from traffic that the HW needs to encapsulate and encrypt?
It relay on frame classification (I think Antoine is missing a flow
configuration to bypass all MKA traffic).

> If you look at IPsec offloading, the networking stack builds up the
> ESP header, and passes the unencrypted data down to the driver. I'm
> wondering if the same would be possible with MACsec offloading: the
> macsec virtual interface adds the header (and maybe a dummy ICV), and
> then the HW does the encryption. In case of HW that needs to add the
> sectag itself, the driver would first strip the headers that the stack
> created. On receive, the driver would recreate a sectag and the macsec
> interface would just skip all verification (decrypt, PN).
I do not think this is possible with this HW, nor do I think this is desirable.

One of the big differences between MACsec and IPsec, is the fact that it is a L2
protocol, designed to be running on switches (this is how MACsec claims to limit
key-distributions issues, as all frames are decrypted when entering a switch,
and encrypted with a new key when leaving).

If a SW stack needs to add a tag to the frame, then we cannot offload traffic
being bridged in HW, and never goes to the CPU.

/Allan

^ permalink raw reply

* Re: [RFC 1/4] Add usb_get_address and usb_set_address support
From: Oliver Neukum @ 2019-08-21  9:08 UTC (permalink / raw)
  To: Charles.Hyde, linux-acpi, linux-usb
  Cc: Mario.Limonciello, gregkh, nic_swsd, netdev
In-Reply-To: <1566339522507.45056@Dellteam.com>

Am Dienstag, den 20.08.2019, 22:18 +0000 schrieb
Charles.Hyde@dellteam.com:
> The core USB driver message.c is missing get/set address functionality

This should go into usbnet. The CDC parser is where it is because
it is needed for serial and network devices. As serial devices
do not have a MAC, this can go into usbnet.

> that stops ifconfig from being able to push MAC addresses out to USB
> based ethernet devices.  Without this functionality, some USB devices
> stop responding to ethernet packets when using ifconfig to change MAC
> addresses.  This has been tested with a Dell Universal Dock D6000.
> 
> Signed-off-by: Charles Hyde <charles.hyde@dellteam.com>
> Cc: Mario Limonciello <mario.limonciello@dell.com>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: linux-usb@vger.kernel.org
> ---
>  drivers/usb/core/message.c | 59 ++++++++++++++++++++++++++++++++++++++
>  include/linux/usb.h        |  3 ++
>  2 files changed, 62 insertions(+)
> 
> diff --git a/drivers/usb/core/message.c b/drivers/usb/core/message.c
> index 5adf489428aa..eea775234b09 100644
> --- a/drivers/usb/core/message.c
> +++ b/drivers/usb/core/message.c
> @@ -1085,6 +1085,65 @@ int usb_clear_halt(struct usb_device *dev, int pipe)
>  }
>  EXPORT_SYMBOL_GPL(usb_clear_halt);
>  
> +/**
> + * usb_get_address - 
> + * @dev: device whose endpoint is halted

Which endpoint?

> + * @mac: buffer for containing 
> + * Context: !in_interrupt ()
> + *
> + * This will attempt to get the six byte MAC address from a USB device's
> + * ethernet controller using GET_NET_ADDRESS command.
> + *
> + * This call is synchronous, and may not be used in an interrupt context.
> + *
> + * Return: Zero on success, or else the status code returned by the

Well, I am afraid it will return 6 on success.

> + * underlying usb_control_msg() call.
> + */
> +int usb_get_address(struct usb_device *dev, unsigned char * mac)
> +{
> +	int ret = -ENOMEM;

Initialization is unnecessary here.

> +	unsigned char *tbuf = kmalloc(256, GFP_NOIO);

If you intentionally picked a safety margin of 42 times, this
is cool. Otherwise it is a litttle much.

> +
> +	if (!tbuf)
> +		return -ENOMEM;
> +
> +	ret = usb_control_msg(dev, usb_sndctrlpipe(dev, 0),
> +			USB_CDC_GET_NET_ADDRESS,
> +			USB_DIR_IN | USB_TYPE_CLASS | USB_RECIP_INTERFACE,
> +			0, USB_REQ_SET_ADDRESS, tbuf, 256,
> +			USB_CTRL_GET_TIMEOUT);
> +	if (ret == 6)
> +		memcpy(mac, tbuf, 6);

You cannot ignore the case of devices sending more or less than 6
bytes.

	Regards
		Oliver


^ permalink raw reply

* Re: [PATCH net-next v2 6/9] net: macsec: hardware offloading infrastructure
From: Igor Russkikh @ 2019-08-21  9:20 UTC (permalink / raw)
  To: Sabrina Dubroca, Antoine Tenart
  Cc: Andrew Lunn, davem@davemloft.net, f.fainelli@gmail.com,
	hkallweit1@gmail.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, thomas.petazzoni@bootlin.com,
	alexandre.belloni@bootlin.com, allan.nielsen@microchip.com,
	camelia.groza@nxp.com, Simon Edelhaus, Pavel Belous
In-Reply-To: <20190820144119.GA28714@bistromath.localdomain>


> 
> Talking about packet numbers, can you describe how PN exhaustion is
> handled?  I couldn't find much about packet numbers at all in the
> driver patches (I hope the hw doesn't wrap around from 2^32-1 to 0 on
> the same SA).  At some point userspace needs to know that we're
> getting close to 2^32 and that it's time to re-key.  Since the whole
> TX path of the software implementation is bypassed, it looks like the
> PN (as far as drivers/net/macsec.c is concerned) never increases, so
> userspace can't know when to negotiate a new SA.

I think there should be driver specific implementation of this functionality.
As an example, our macsec HW issues an interrupt towards the host to indicate
PN threshold has reached and it's time for userspace to change the keys.

In contrast, current SW macsec implementation just stops this SA/secy.

> I don't see how this implementation handles non-macsec traffic (on TX,
> that would be packets sent directly through the real interface, for
> example by wpa_supplicant - on RX, incoming MKA traffic for
> wpa_supplicant). Unless I missed something, incoming MKA traffic will
> end up on the macsec interface as well as the lower interface (not
> entirely critical, as long as wpa_supplicant can grab it on the lower
> device, but not consistent with the software implementation). How does
> the driver distinguish traffic that should pass through unmodified
> from traffic that the HW needs to encapsulate and encrypt?

I can comment on our HW engine - where it has special bypass rules
for configured ethertypes. This way macsec engine skips encryption on TX and
passes in RX unencrypted for the selected control packets.

But thats true, realdev driver is hard to distinguish encrypted/unencrypted
packets. In case realdev should make a decision where to put RX packet,
it only may do some heuristic (since after macsec decription all the
macsec related info is dropped. Thats true at least for our HW implementation).

> If you look at IPsec offloading, the networking stack builds up the
> ESP header, and passes the unencrypted data down to the driver. I'm
> wondering if the same would be possible with MACsec offloading: the
> macsec virtual interface adds the header (and maybe a dummy ICV), and
> then the HW does the encryption. In case of HW that needs to add the
> sectag itself, the driver would first strip the headers that the stack
> created. On receive, the driver would recreate a sectag and the macsec
> interface would just skip all verification (decrypt, PN).

I don't think this way is good, as driver have to do per packet header mangling.
That'll harm linerate performance heavily.

Regards,
   Igor

^ permalink raw reply

* Re: [RFC PATCH] bpf: handle 32-bit zext during constant blinding
From: Jiong Wang @ 2019-08-21  9:05 UTC (permalink / raw)
  To: Naveen N. Rao
  Cc: Alexei Starovoitov, Daniel Borkmann, Jiong Wang, bpf,
	linux-kernel, linuxppc-dev, Michael Ellerman, netdev
In-Reply-To: <1566376025.68ldwx3wc7.naveen@linux.ibm.com>


Naveen N. Rao writes:

> Naveen N. Rao wrote:
>> Since BPF constant blinding is performed after the verifier pass, there
>> are certain ALU32 instructions inserted which don't have a corresponding
>> zext instruction inserted after. This is causing a kernel oops on
>> powerpc and can be reproduced by running 'test_cgroup_storage' with
>> bpf_jit_harden=2.
>> 
>> Fix this by emitting BPF_ZEXT during constant blinding if
>> prog->aux->verifier_zext is set.
>> 
>> Fixes: a4b1d3c1ddf6cb ("bpf: verifier: insert zero extension according to analysis result")
>> Reported-by: Michael Ellerman <mpe@ellerman.id.au>
>> Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
>> ---
>> This approach (the location where zext is being introduced below, in 
>> particular) works for powerpc, but I am not entirely sure if this is 
>> sufficient for other architectures as well. This is broken on v5.3-rc4.
>
> Alexie, Daniel, Jiong,
> Any feedback on this?

The fix on BPF_LD | BPF_IMM | BPF_DW looks correct to me, but the two other
places looks to me is unnecessary, as those destinations are exposed to
external and if they are used as 64-bit then there will be zext inserted
for them.

Have you verified removing those two fixes will still cause the bug?

Regards,
Jiong

>
> - Naveen


^ permalink raw reply

* [PATCH bpf-next 2/2] tools: bpftool: add "bpftool map freeze" subcommand
From: Quentin Monnet @ 2019-08-21  8:52 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: bpf, netdev, oss-drivers, Quentin Monnet
In-Reply-To: <20190821085219.30387-1-quentin.monnet@netronome.com>

Add a new subcommand to freeze maps from user space.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 .../bpf/bpftool/Documentation/bpftool-map.rst |  9 +++++
 tools/bpf/bpftool/bash-completion/bpftool     |  4 +--
 tools/bpf/bpftool/map.c                       | 34 ++++++++++++++++++-
 3 files changed, 44 insertions(+), 3 deletions(-)

diff --git a/tools/bpf/bpftool/Documentation/bpftool-map.rst b/tools/bpf/bpftool/Documentation/bpftool-map.rst
index 61d1d270eb5e..1c0f7146aab0 100644
--- a/tools/bpf/bpftool/Documentation/bpftool-map.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool-map.rst
@@ -36,6 +36,7 @@ MAP COMMANDS
 |	**bpftool** **map pop**        *MAP*
 |	**bpftool** **map enqueue**    *MAP* **value** *VALUE*
 |	**bpftool** **map dequeue**    *MAP*
+|	**bpftool** **map freeze**     *MAP*
 |	**bpftool** **map help**
 |
 |	*MAP* := { **id** *MAP_ID* | **pinned** *FILE* }
@@ -127,6 +128,14 @@ DESCRIPTION
 	**bpftool map dequeue**  *MAP*
 		  Dequeue and print **value** from the queue.
 
+	**bpftool map freeze**  *MAP*
+		  Freeze the map as read-only from user space. Entries from a
+		  frozen map can not longer be updated or deleted with the
+		  **bpf\ ()** system call. This operation is not reversible,
+		  and the map remains immutable from user space until its
+		  destruction. However, read and write permissions for BPF
+		  programs to the map remain unchanged.
+
 	**bpftool map help**
 		  Print short help message.
 
diff --git a/tools/bpf/bpftool/bash-completion/bpftool b/tools/bpf/bpftool/bash-completion/bpftool
index 2ffd351f9dbf..70493a6da206 100644
--- a/tools/bpf/bpftool/bash-completion/bpftool
+++ b/tools/bpf/bpftool/bash-completion/bpftool
@@ -449,7 +449,7 @@ _bpftool()
         map)
             local MAP_TYPE='id pinned'
             case $command in
-                show|list|dump|peek|pop|dequeue)
+                show|list|dump|peek|pop|dequeue|freeze)
                     case $prev in
                         $command)
                             COMPREPLY=( $( compgen -W "$MAP_TYPE" -- "$cur" ) )
@@ -638,7 +638,7 @@ _bpftool()
                     [[ $prev == $object ]] && \
                         COMPREPLY=( $( compgen -W 'delete dump getnext help \
                             lookup pin event_pipe show list update create \
-                            peek push enqueue pop dequeue' -- \
+                            peek push enqueue pop dequeue freeze' -- \
                             "$cur" ) )
                     ;;
             esac
diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index af2e9eb9747b..de61d73b9030 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -1262,6 +1262,35 @@ static int do_pop_dequeue(int argc, char **argv)
 	return err;
 }
 
+static int do_freeze(int argc, char **argv)
+{
+	int err, fd;
+
+	if (!REQ_ARGS(2))
+		return -1;
+
+	fd = map_parse_fd(&argc, &argv);
+	if (fd < 0)
+		return -1;
+
+	if (argc) {
+		close(fd);
+		return BAD_ARG();
+	}
+
+	err = bpf_map_freeze(fd);
+	close(fd);
+	if (err) {
+		p_err("failed to freeze map: %s", strerror(errno));
+		return err;
+	}
+
+	if (json_output)
+		jsonw_null(json_wtr);
+
+	return 0;
+}
+
 static int do_help(int argc, char **argv)
 {
 	if (json_output) {
@@ -1286,6 +1315,7 @@ static int do_help(int argc, char **argv)
 		"       %s %s pop        MAP\n"
 		"       %s %s enqueue    MAP value VALUE\n"
 		"       %s %s dequeue    MAP\n"
+		"       %s %s freeze     MAP\n"
 		"       %s %s help\n"
 		"\n"
 		"       " HELP_SPEC_MAP "\n"
@@ -1304,7 +1334,8 @@ static int do_help(int argc, char **argv)
 		bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2],
 		bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2],
 		bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2],
-		bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2]);
+		bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2],
+		bin_name, argv[-2]);
 
 	return 0;
 }
@@ -1326,6 +1357,7 @@ static const struct cmd cmds[] = {
 	{ "enqueue",	do_update },
 	{ "pop",	do_pop_dequeue },
 	{ "dequeue",	do_pop_dequeue },
+	{ "freeze",	do_freeze },
 	{ 0 }
 };
 
-- 
2.17.1


^ permalink raw reply related

* [PATCH bpf-next 1/2] tools: bpftool: show frozen status for maps
From: Quentin Monnet @ 2019-08-21  8:52 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: bpf, netdev, oss-drivers, Quentin Monnet
In-Reply-To: <20190821085219.30387-1-quentin.monnet@netronome.com>

When listing maps, read their "frozen" status from procfs, and tell if
maps are frozen.

As commit log for map freezing command mentions that the feature might
be extended with flags (e.g. for write-only instead of read-only) in the
future, use an integer and not a boolean for JSON output.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 tools/bpf/bpftool/map.c | 30 +++++++++++++++++++++++++++---
 1 file changed, 27 insertions(+), 3 deletions(-)

diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index bfbbc6b4cb83..af2e9eb9747b 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -481,9 +481,11 @@ static int parse_elem(char **argv, struct bpf_map_info *info,
 
 static int show_map_close_json(int fd, struct bpf_map_info *info)
 {
-	char *memlock;
+	char *memlock, *frozen_str;
+	int frozen = 0;
 
 	memlock = get_fdinfo(fd, "memlock");
+	frozen_str = get_fdinfo(fd, "frozen");
 
 	jsonw_start_object(json_wtr);
 
@@ -533,6 +535,12 @@ static int show_map_close_json(int fd, struct bpf_map_info *info)
 	}
 	close(fd);
 
+	if (frozen_str) {
+		frozen = atoi(frozen_str);
+		free(frozen_str);
+	}
+	jsonw_int_field(json_wtr, "frozen", frozen);
+
 	if (info->btf_id)
 		jsonw_int_field(json_wtr, "btf_id", info->btf_id);
 
@@ -555,9 +563,11 @@ static int show_map_close_json(int fd, struct bpf_map_info *info)
 
 static int show_map_close_plain(int fd, struct bpf_map_info *info)
 {
-	char *memlock;
+	char *memlock, *frozen_str;
+	int frozen = 0;
 
 	memlock = get_fdinfo(fd, "memlock");
+	frozen_str = get_fdinfo(fd, "frozen");
 
 	printf("%u: ", info->id);
 	if (info->type < ARRAY_SIZE(map_type_name))
@@ -610,9 +620,23 @@ static int show_map_close_plain(int fd, struct bpf_map_info *info)
 				printf("\n\tpinned %s", obj->path);
 		}
 	}
+	printf("\n");
+
+	if (frozen_str) {
+		frozen = atoi(frozen_str);
+		free(frozen_str);
+	}
+
+	if (!info->btf_id && !frozen)
+		return 0;
+
+	printf("\t");
 
 	if (info->btf_id)
-		printf("\n\tbtf_id %d", info->btf_id);
+		printf("btf_id %d", info->btf_id);
+
+	if (frozen)
+		printf("%sfrozen", info->btf_id ? "  " : "");
 
 	printf("\n");
 	return 0;
-- 
2.17.1


^ permalink raw reply related

* [PATCH bpf-next 0/2] tools: bpftool: work with frozen maps
From: Quentin Monnet @ 2019-08-21  8:52 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: bpf, netdev, oss-drivers, Quentin Monnet

Hi,
This is a simple set to add support for BPF map freezing to bpftool. First
patch makes bpftool indicate if a map is frozen when listing BPF maps.
Second patch adds a command to freeze a map loaded on the system.

Quentin Monnet (2):
  tools: bpftool: show frozen status for maps
  tools: bpftool: add "bpftool map freeze" subcommand

 .../bpf/bpftool/Documentation/bpftool-map.rst |  9 +++
 tools/bpf/bpftool/bash-completion/bpftool     |  4 +-
 tools/bpf/bpftool/map.c                       | 64 +++++++++++++++++--
 3 files changed, 71 insertions(+), 6 deletions(-)

-- 
2.17.1


^ permalink raw reply

* Re: [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP
From: Toshiaki Makita @ 2019-08-21  8:49 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Stanislav Fomichev, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, David S. Miller,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, netdev, bpf, William Tu
In-Reply-To: <20190819111546.35a8ed76@cakuba.netronome.com>

On 19/08/20 (火) 3:15:46, Jakub Kicinski wrote:

I'm on vacation and replying slowly. Sorry for any inconvenience.

> On Sat, 17 Aug 2019 23:01:59 +0900, Toshiaki Makita wrote:
>> On 19/08/17 (土) 3:52:24, Jakub Kicinski wrote:
>>> On Fri, 16 Aug 2019 10:28:10 +0900, Toshiaki Makita wrote:
>>>> On 2019/08/16 4:22, Jakub Kicinski wrote:
>>>>> There's a certain allure in bringing the in-kernel BPF translation
>>>>> infrastructure forward. OTOH from system architecture perspective IMHO
>>>>> it does seem like a task best handed in user space. bpfilter can replace
>>>>> iptables completely, here we're looking at an acceleration relatively
>>>>> loosely coupled with flower.
>>>>
>>>> I don't think it's loosely coupled. Emulating TC behavior in userspace
>>>> is not so easy.
>>>>
>>>> Think about recent multi-mask support in flower. Previously userspace could
>>>> assume there is one mask and hash table for each preference in TC. After the
>>>> change TC accepts different masks with the same pref. Such a change tends to
>>>> break userspace emulation. It may ignore masks passed from flow insertion
>>>> and use the mask remembered when the first flow of the pref is inserted. It
>>>> may override the mask of all existing flows with the pref. It may fail to
>>>> insert such flows. Any of them would result in unexpected wrong datapath
>>>> handling which is critical.
>>>> I think such an emulation layer needs to be updated in sync with TC.
>>>
>>> Oh, so you're saying that if xdp_flow is merged all patches to
>>> cls_flower and netfilter which affect flow offload will be required
>>> to update xdp_flow as well?
>>
>> Hmm... you are saying that we are allowed to break other in-kernel
>> subsystem by some change? Sounds strange...
> 
> No I'm not saying that, please don't put words in my mouth.

If we ignore xdp_flow when modifying something which affects flow 
offload, that may cause breakage. I showed such an example using 
multi-mask support. So I just wondered what you mean and guessed you 
think we can break other subsystem in some situation.

I admit I should not have used the wording "you are saying...?". If it 
was not unpleasant to you I'm sorry about that. But I think you should 
not use it as well. I did not say "cls_flower and netfilter which affect 
flow offload will be required to update xdp_flow". I guess most patches 
which affect flow offload core will not break xdp_flow. In some cases 
breakage may happen. In that case we need to fix xdp_flow as well.

> I'm asking you if that's your intention.
> 
> Having an implementation nor support a feature of another implementation
> and degrade gracefully to the slower one is not necessarily breakage.
> We need to make a concious decision here, hence the clarifying question.

As I described above, breakage can happen in some case, and if the patch 
breaks xdp_flow I think we need to fix xdp_flow at the same time. If 
xdp_flow does not support newly added features but it works for existing 
ones, it is OK. In the first place not all features can be offloaded to 
xdp_flow. I think this is the same as HW-offload.

>>> That's a question of policy. Technically the implementation in user
>>> space is equivalent.
>>>
>>> The advantage of user space implementation is that you can add more
>>> to it and explore use cases which do not fit in the flow offload API,
>>> but are trivial for BPF. Not to mention the obvious advantage of
>>> decoupling the upgrade path.
>>
>> I understand the advantage, but I can't trust such a third-party kernel
>> emulation solution for this kind of thing which handles critical data path.
> 
> That's a strange argument to make. All production data path BPF today
> comes from user space.

Probably my explanation was not sufficient. What I'm concerned about is 
that this needs to emulate kernel behavior, and it is difficult.
I don't think userspace-defined datapath itself is not reliable, nor 
eBPF ecosystem.

Toshiaki Makita

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox