Netdev List
 help / color / mirror / Atom feed
* Re: [RFC bpf-next 0/5] Convert iproute2 to use libbpf (WIP)
From: Andrii Nakryiko @ 2019-08-22  7:49 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Stephen Hemminger, Daniel Borkmann, Alexei Starovoitov,
	Martin KaFai Lau, Song Liu, Yonghong Song, David Miller,
	Jesper Dangaard Brouer, Networking, bpf
In-Reply-To: <87blwiqlc8.fsf@toke.dk>

On Wed, Aug 21, 2019 at 2:07 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>
> > On Tue, Aug 20, 2019 at 4:47 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> iproute2 uses its own bpf loader to load eBPF programs, which has
> >> evolved separately from libbpf. Since we are now standardising on
> >> libbpf, this becomes a problem as iproute2 is slowly accumulating
> >> feature incompatibilities with libbpf-based loaders. In particular,
> >> iproute2 has its own (expanded) version of the map definition struct,
> >> which makes it difficult to write programs that can be loaded with both
> >> custom loaders and iproute2.
> >>
> >> This series seeks to address this by converting iproute2 to using libbpf
> >> for all its bpf needs. This version is an early proof-of-concept RFC, to
> >> get some feedback on whether people think this is the right direction.
> >>
> >> What this series does is the following:
> >>
> >> - Updates the libbpf map definition struct to match that of iproute2
> >>   (patch 1).
> >
> >
> > Hi Toke,
> >
> > Thanks for taking a stab at unifying libbpf and iproute2 loaders. I'm
> > totally in support of making iproute2 use libbpf to load/initialize
> > BPF programs. But I'm against adding iproute2-specific fields to
> > libbpf's bpf_map_def definitions to support this.
> >
> > I've proposed the plan of extending libbpf's supported features so
> > that it can be used to load iproute2-style BPF programs earlier,
> > please see discussions in [0] and [1].
>
> Yeah, I've seen that discussion, and agree that longer term this is
> probably a better way to do map-in-map definitions.
>
> However, I view your proposal as complementary to this series: we'll
> probably also want the BTF-based definition to work with iproute2, and
> that means iproute2 needs to be ported to libbpf. But iproute2 needs to
> be backwards compatible with the format it supports now, and, well, this
> series is the simplest way to achieve that IMO :)

Ok, I understand that. But I'd still want to avoid adding extra cruft
to libbpf just for backwards-compatibility with *exact* iproute2
format. Libbpf as a whole is trying to move away from relying on
binary bpf_map_def and into using BTF-defined map definitions, and
this patch series is a step backwards in that regard, that adds,
essentially, already outdated stuff that we'll need to support forever
(I mean those extra fields in bpf_map_def, that will stay there
forever).

We've discussed one way to deal with it, IMO, in a cleaner way. It can
be done in few steps:

1. I originally wanted BTF-defined map definitions to ignore unknown
fields. It shouldn't be a default mode, but it should be supported
(and of course is very easy to add). So let's add that and let libbpf
ignore unknown stuff.

2. Then to let iproute2 loader deal with backwards-compatibility for
libbpf-incompatible bpf_elf_map, we need to "pass-through" all those
fields so that users of libbpf (iproute2 loader, in this case) can
make use of it. The easiest and cleanest way to do this is to expose
BTF ID of a type describing each map entry and let iproute2 process
that in whichever way it sees fit.

Luckily, bpf_elf_map is compatible in `type` field, which will let
libbpf recognize bpf_elf_map as map definition. All the rest setup
will be done by iproute2, by processing BTF of bpf_elf_map, which will
let it set up map sizes, flags and do all of its map-in-map magic.

The only additions to libbpf in this case would be a new `__u32
bpf_map__btf_id(struct bpf_map* map);` API.

I haven't written any code and haven't 100% checked that this will
cover everything, but I think we should try. This will allow to let
users of libbpf do custom stuff with map definitions without having to
put all this extra logic into libbpf itself, which I think is
desirable outcome.


>
> > I think instead of emulating iproute2 way of matching everything based
> > on user-specified internal IDs, which doesn't provide good user
> > experience and is quite easy to get wrong, we should support same
> > scenarios with better declarative syntax and in a less error-prone
> > way. I believe we can do that by relying on BTF more heavily (again,
> > please check some of my proposals in [0], [1], and discussion with
> > Daniel in those threads). It will feel more natural and be more
> > straightforward to follow. It would be great if you can lend a hand in
> > implementing pieces of that plan!
> >
> > I'm currently on vacation, so my availability is very sparse, but I'd
> > be happy to discuss this further, if need be.
>
> Happy to collaborate on your proposal when you're back from vacation;
> but as I said above, I believe this is a complementary longer-term
> thing...
>
> -Toke

^ permalink raw reply

* Re: [RFC bpf-next 0/5] Convert iproute2 to use libbpf (WIP)
From: Andrii Nakryiko @ 2019-08-22  7:52 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Alexei Starovoitov, Stephen Hemminger, Daniel Borkmann,
	Alexei Starovoitov, Martin KaFai Lau, Song Liu, Yonghong Song,
	David Miller, Jesper Dangaard Brouer, Networking, bpf
In-Reply-To: <87ef1eqlnb.fsf@toke.dk>

On Wed, Aug 21, 2019 at 4:29 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>
> > On Tue, Aug 20, 2019 at 01:47:01PM +0200, Toke Høiland-Jørgensen wrote:
> >> iproute2 uses its own bpf loader to load eBPF programs, which has
> >> evolved separately from libbpf. Since we are now standardising on
> >> libbpf, this becomes a problem as iproute2 is slowly accumulating
> >> feature incompatibilities with libbpf-based loaders. In particular,
> >> iproute2 has its own (expanded) version of the map definition struct,
> >> which makes it difficult to write programs that can be loaded with both
> >> custom loaders and iproute2.
> >>
> >> This series seeks to address this by converting iproute2 to using libbpf
> >> for all its bpf needs. This version is an early proof-of-concept RFC, to
> >> get some feedback on whether people think this is the right direction.
> >>
> >> What this series does is the following:
> >>
> >> - Updates the libbpf map definition struct to match that of iproute2
> >>   (patch 1).
> >> - Adds functionality to libbpf to support automatic pinning of maps when
> >>   loading an eBPF program, while re-using pinned maps if they already
> >>   exist (patches 2-3).
> >> - Modifies iproute2 to make it possible to compile it against libbpf
> >>   without affecting any existing functionality (patch 4).
> >> - Changes the iproute2 eBPF loader to use libbpf for loading XDP
> >>   programs (patch 5).
> >>
> >>
> >> As this is an early PoC, there are still a few missing pieces before
> >> this can be merged. Including (but probably not limited to):
> >>
> >> - Consolidate the map definition struct in the bpf_helpers.h file in the
> >>   kernel tree. This contains a different, and incompatible, update to
> >>   the struct. Since the iproute2 version has actually been released for
> >>   use outside the kernel tree (and thus is subject to API stability
> >>   constraints), I think it makes the most sense to keep that, and port
> >>   the selftests to use it.
> >
> > It sounds like you're implying that existing libbpf format is not
> > uapi.
>
> No, that's not what I meant... See below.
>
> > It is and we cannot break it.
> > If patch 1 means breakage for existing pre-compiled .o that won't load
> > with new libbpf then we cannot use this method.
> > Recompiling .o with new libbpf definition of bpf_map_def isn't an option.
> > libbpf has to be smart before/after and recognize both old and iproute2 format.
>
> The libbpf.h definition of struct bpf_map_def is compatible with the one
> used in iproute2. In libbpf.h, the struct only contains five fields
> (type, key_size, value_size, max_entries and flags), and iproute2 adds
> another 4 (id, pinning, inner_id and inner_idx; these are the ones in
> patch 1 in this series).
>
> The issue I was alluding to above is that the bpf_helpers.h file in the
> kernel selftests directory *also* extends the bpf_map_def struct, and
> adds two *different* fields (inner_map_idx and numa_mode). The former is
> used to implement the same map-in-map definition functionality that
> iproute2 has, but with different semantics. The latter is additional to
> that, and I'm planning to add that to this series.
>
> Since bpf_helpers.h is *not* part of libbpf (yet), this will make it

We should start considering it as if it was, so if we can avoid adding
stuff that I'd need to untangle to move it into libbpf, I'd rather
avoid it.
We've already prepared this move by relicensing bpf_helpers.h. Moving
it into libbpf itself is immediate next thing I'll do when I'm back.

> possible to keep API (and ABI) compatibility with both iproute2 and
> libbpf. As in, old .o files will still load with libbpf after this
> series, they just won't be able to use the new automatic pinning
> feature.
>
> -Toke

^ permalink raw reply

* Re: [PATCH net-next v3 4/8] MIPS: dts: mscc: describe the PTP ready interrupt
From: Antoine Tenart @ 2019-08-22  7:56 UTC (permalink / raw)
  To: davem, richardcochran, alexandre.belloni, UNGLinuxDriver, ralf,
	paul.burton, jhogan
  Cc: Antoine Tenart, netdev, linux-mips, thomas.petazzoni,
	allan.nielsen
In-Reply-To: <20190724081715.29159-5-antoine.tenart@bootlin.com>

Hello,

On Wed, Jul 24, 2019 at 10:17:11AM +0200, Antoine Tenart wrote:
> This patch adds a description of the PTP ready interrupt, which can be
> triggered when a PTP timestamp is available on an hardware FIFO.
> 
> Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
> Acked-by: Paul Burton <paul.burton@mips.com>

The net patches of this series were applied into the net-next tree.
However the two dts patches were not and should go through the MIPS
tree. Gentle ping about this :)

Thanks,
Antoine

> ---
>  arch/mips/boot/dts/mscc/ocelot.dtsi | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/mips/boot/dts/mscc/ocelot.dtsi b/arch/mips/boot/dts/mscc/ocelot.dtsi
> index 1e55a778def5..797d336db54d 100644
> --- a/arch/mips/boot/dts/mscc/ocelot.dtsi
> +++ b/arch/mips/boot/dts/mscc/ocelot.dtsi
> @@ -139,8 +139,8 @@
>  				    "port2", "port3", "port4", "port5", "port6",
>  				    "port7", "port8", "port9", "port10", "qsys",
>  				    "ana", "s2";
> -			interrupts = <21 22>;
> -			interrupt-names = "xtr", "inj";
> +			interrupts = <18 21 22>;
> +			interrupt-names = "ptp_rdy", "xtr", "inj";
>  
>  			ethernet-ports {
>  				#address-cells = <1>;
> -- 
> 2.21.0
> 

-- 
Antoine Ténart, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com

^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH net] ixgbe: fix double clean of tx descriptors with xdp
From: Ilya Maximets @ 2019-08-22  8:05 UTC (permalink / raw)
  To: Björn Töpel, Alexander Duyck
  Cc: Jakub Kicinski, Daniel Borkmann, Netdev, William Tu, LKML,
	Alexei Starovoitov, intel-wired-lan, bpf, Björn Töpel,
	David S. Miller, Magnus Karlsson, Eelco Chaudron
In-Reply-To: <CAJ+HfNjo0tpk2v_+85SuX7Jw797QwRA7uJBggPHtY=JznLC9Zg@mail.gmail.com>

On 22.08.2019 10:12, Björn Töpel wrote:
> On Wed, 21 Aug 2019 at 18:57, Alexander Duyck <alexander.duyck@gmail.com> wrote:
>>
>> On Wed, Aug 21, 2019 at 9:22 AM Ilya Maximets <i.maximets@samsung.com> wrote:
>>>
>>> On 21.08.2019 4:17, Alexander Duyck wrote:
>>>> On Tue, Aug 20, 2019 at 8:58 AM Ilya Maximets <i.maximets@samsung.com> wrote:
>>>>>
>>>>> On 20.08.2019 18:35, Alexander Duyck wrote:
>>>>>> On Tue, Aug 20, 2019 at 8:18 AM Ilya Maximets <i.maximets@samsung.com> wrote:
>>>>>>>
>>>>>>> Tx code doesn't clear the descriptor status after cleaning.
>>>>>>> So, if the budget is larger than number of used elems in a ring, some
>>>>>>> descriptors will be accounted twice and xsk_umem_complete_tx will move
>>>>>>> prod_tail far beyond the prod_head breaking the comletion queue ring.
>>>>>>>
>>>>>>> Fix that by limiting the number of descriptors to clean by the number
>>>>>>> of used descriptors in the tx ring.
>>>>>>>
>>>>>>> Fixes: 8221c5eba8c1 ("ixgbe: add AF_XDP zero-copy Tx support")
>>>>>>> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
>>>>>>
>>>>>> I'm not sure this is the best way to go. My preference would be to
>>>>>> have something in the ring that would prevent us from racing which I
>>>>>> don't think this really addresses. I am pretty sure this code is safe
>>>>>> on x86 but I would be worried about weak ordered systems such as
>>>>>> PowerPC.
>>>>>>
>>>>>> It might make sense to look at adding the eop_desc logic like we have
>>>>>> in the regular path with a proper barrier before we write it and after
>>>>>> we read it. So for example we could hold of on writing the bytecount
>>>>>> value until the end of an iteration and call smp_wmb before we write
>>>>>> it. Then on the cleanup we could read it and if it is non-zero we take
>>>>>> an smp_rmb before proceeding further to process the Tx descriptor and
>>>>>> clearing the value. Otherwise this code is going to just keep popping
>>>>>> up with issues.
>>>>>
>>>>> But, unlike regular case, xdp zero-copy xmit and clean for particular
>>>>> tx ring always happens in the same NAPI context and even on the same
>>>>> CPU core.
>>>>>
>>>>> I saw the 'eop_desc' manipulations in regular case and yes, we could
>>>>> use 'next_to_watch' field just as a flag of descriptor existence,
>>>>> but it seems unnecessarily complicated. Am I missing something?
>>>>>
>>>>
>>>> So is it always in the same NAPI context?. I forgot, I was thinking
>>>> that somehow the socket could possibly make use of XDP for transmit.
>>>
>>> AF_XDP socket only triggers tx interrupt on ndo_xsk_async_xmit() which
>>> is used in zero-copy mode. Real xmit happens inside
>>> ixgbe_poll()
>>>  -> ixgbe_clean_xdp_tx_irq()
>>>     -> ixgbe_xmit_zc()
>>>
>>> This should be not possible to bound another XDP socket to the same netdev
>>> queue.
>>>
>>> It also possible to xmit frames in xdp_ring while performing XDP_TX/REDIRECT
>>> actions. REDIRECT could happen from different netdev with different NAPI
>>> context, but this operation is bound to specific CPU core and each core has
>>> its own xdp_ring.
>>>
>>> However, I'm not an expert here.
>>> Björn, maybe you could comment on this?
>>>
>>>>
>>>> As far as the logic to use I would be good with just using a value you
>>>> are already setting such as the bytecount value. All that would need
>>>> to happen is to guarantee that the value is cleared in the Tx path. So
>>>> if you clear the bytecount in ixgbe_clean_xdp_tx_irq you could
>>>> theoretically just use that as well to flag that a descriptor has been
>>>> populated and is ready to be cleaned. Assuming the logic about this
>>>> all being in the same NAPI context anyway you wouldn't need to mess
>>>> with the barrier stuff I mentioned before.
>>>
>>> Checking the number of used descs, i.e. next_to_use - next_to_clean,
>>> makes iteration in this function logically equal to the iteration inside
>>> 'ixgbe_xsk_clean_tx_ring()'. Do you think we need to change the later
>>> function too to follow same 'bytecount' approach? I don't like having
>>> two different ways to determine number of used descriptors in the same file.
>>>
>>> Best regards, Ilya Maximets.
>>
>> As far as ixgbe_clean_xdp_tx_irq() vs ixgbe_xsk_clean_tx_ring(), I
>> would say that if you got rid of budget and framed things more like
>> how ixgbe_xsk_clean_tx_ring was framed with the ntc != ntu being
>> obvious I would prefer to see us go that route.
>>
>> Really there is no need for budget in ixgbe_clean_xdp_tx_irq() if you
>> are going to be working with a static ntu value since you will only
>> ever process one iteration through the ring anyway. It might make more
>> sense if you just went through and got rid of budget and i, and
>> instead used ntc and ntu like what was done in
>> ixgbe_xsk_clean_tx_ring().
>>
> 
> +1. I'd prefer this as well!

Sounds good. I'll look in this direction.
But I'm not completely sure about 'budget' elimination. Wouldn't it cause
issues if we'll clean the whole ring at once? I mean maybe it'll be too long
to hold the CPU core for this amount of work.
We could re-work the code keeping the loop break on budget exhaustion.
What do you think?

> 
> 
> Cheers,
> Björn
> 
>> Thanks.
>>
>> - Alex
>> _______________________________________________
>> Intel-wired-lan mailing list
>> Intel-wired-lan@osuosl.org
>> https://lists.osuosl.org/mailman/listinfo/intel-wired-lan
> 
> 

^ permalink raw reply

* [PATCH net-next] r8152: saving the settings of EEE
From: Hayes Wang @ 2019-08-22  8:07 UTC (permalink / raw)
  To: netdev; +Cc: nic_swsd, linux-kernel, Hayes Wang

Saving the settings of EEE to avoid they become the default settings
after reset_resume().

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
---
 drivers/net/usb/r8152.c | 86 +++++++++++++++++++++++++++--------------
 1 file changed, 56 insertions(+), 30 deletions(-)

diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
index 1aa61610f0bb..bbc65a94a83f 100644
--- a/drivers/net/usb/r8152.c
+++ b/drivers/net/usb/r8152.c
@@ -751,6 +751,7 @@ struct r8152 {
 
 	atomic_t rx_count;
 
+	bool eee_en;
 	int intr_interval;
 	u32 saved_wolopts;
 	u32 msg_enable;
@@ -762,6 +763,7 @@ struct r8152 {
 
 	u16 ocp_base;
 	u16 speed;
+	u16 eee_adv;
 	u8 *intr_buff;
 	u8 version;
 	u8 duplex;
@@ -3202,8 +3204,13 @@ static void r8152_eee_en(struct r8152 *tp, bool enable)
 
 static void r8152b_enable_eee(struct r8152 *tp)
 {
-	r8152_eee_en(tp, true);
-	r8152_mmd_write(tp, MDIO_MMD_AN, MDIO_AN_EEE_ADV, MDIO_EEE_100TX);
+	if (tp->eee_en) {
+		r8152_eee_en(tp, true);
+		r8152_mmd_write(tp, MDIO_MMD_AN, MDIO_AN_EEE_ADV, tp->eee_adv);
+	} else {
+		r8152_eee_en(tp, false);
+		r8152_mmd_write(tp, MDIO_MMD_AN, MDIO_AN_EEE_ADV, 0);
+	}
 }
 
 static void r8152b_enable_fc(struct r8152 *tp)
@@ -3495,8 +3502,13 @@ static void r8153_hw_phy_cfg(struct r8152 *tp)
 	sram_write(tp, SRAM_10M_AMP1, 0x00af);
 	sram_write(tp, SRAM_10M_AMP2, 0x0208);
 
-	r8153_eee_en(tp, true);
-	ocp_reg_write(tp, OCP_EEE_ADV, MDIO_EEE_1000T | MDIO_EEE_100TX);
+	if (tp->eee_en) {
+		r8153_eee_en(tp, true);
+		ocp_reg_write(tp, OCP_EEE_ADV, tp->eee_adv);
+	} else {
+		r8153_eee_en(tp, false);
+		ocp_reg_write(tp, OCP_EEE_ADV, 0);
+	}
 
 	r8153_aldps_en(tp, true);
 	r8152b_enable_fc(tp);
@@ -3599,8 +3611,13 @@ static void r8153b_hw_phy_cfg(struct r8152 *tp)
 
 	r8153b_ups_flags_w1w0(tp, ups_flags, 0);
 
-	r8153b_eee_en(tp, true);
-	ocp_reg_write(tp, OCP_EEE_ADV, MDIO_EEE_1000T | MDIO_EEE_100TX);
+	if (tp->eee_en) {
+		r8153b_eee_en(tp, true);
+		ocp_reg_write(tp, OCP_EEE_ADV, tp->eee_adv);
+	} else {
+		r8153b_eee_en(tp, false);
+		ocp_reg_write(tp, OCP_EEE_ADV, 0);
+	}
 
 	r8153b_aldps_en(tp, true);
 	r8153b_enable_fc(tp);
@@ -4891,7 +4908,7 @@ static void rtl8152_get_strings(struct net_device *dev, u32 stringset, u8 *data)
 
 static int r8152_get_eee(struct r8152 *tp, struct ethtool_eee *eee)
 {
-	u32 ocp_data, lp, adv, supported = 0;
+	u32 lp, adv, supported = 0;
 	u16 val;
 
 	val = r8152_mmd_read(tp, MDIO_MMD_PCS, MDIO_PCS_EEE_ABLE);
@@ -4903,13 +4920,10 @@ static int r8152_get_eee(struct r8152 *tp, struct ethtool_eee *eee)
 	val = r8152_mmd_read(tp, MDIO_MMD_AN, MDIO_AN_EEE_LPABLE);
 	lp = mmd_eee_adv_to_ethtool_adv_t(val);
 
-	ocp_data = ocp_read_word(tp, MCU_TYPE_PLA, PLA_EEE_CR);
-	ocp_data &= EEE_RX_EN | EEE_TX_EN;
-
-	eee->eee_enabled = !!ocp_data;
+	eee->eee_enabled = tp->eee_en;
 	eee->eee_active = !!(supported & adv & lp);
 	eee->supported = supported;
-	eee->advertised = adv;
+	eee->advertised = tp->eee_adv;
 	eee->lp_advertised = lp;
 
 	return 0;
@@ -4919,19 +4933,22 @@ static int r8152_set_eee(struct r8152 *tp, struct ethtool_eee *eee)
 {
 	u16 val = ethtool_adv_to_mmd_eee_adv_t(eee->advertised);
 
-	r8152_eee_en(tp, eee->eee_enabled);
+	tp->eee_en = eee->eee_enabled;
+	tp->eee_adv = val;
 
-	if (!eee->eee_enabled)
-		val = 0;
+	r8152_eee_en(tp, eee->eee_enabled);
 
-	r8152_mmd_write(tp, MDIO_MMD_AN, MDIO_AN_EEE_ADV, val);
+	if (eee->eee_enabled)
+		r8152_mmd_write(tp, MDIO_MMD_AN, MDIO_AN_EEE_ADV, val);
+	else
+		r8152_mmd_write(tp, MDIO_MMD_AN, MDIO_AN_EEE_ADV, 0);
 
 	return 0;
 }
 
 static int r8153_get_eee(struct r8152 *tp, struct ethtool_eee *eee)
 {
-	u32 ocp_data, lp, adv, supported = 0;
+	u32 lp, adv, supported = 0;
 	u16 val;
 
 	val = ocp_reg_read(tp, OCP_EEE_ABLE);
@@ -4943,13 +4960,10 @@ static int r8153_get_eee(struct r8152 *tp, struct ethtool_eee *eee)
 	val = ocp_reg_read(tp, OCP_EEE_LPABLE);
 	lp = mmd_eee_adv_to_ethtool_adv_t(val);
 
-	ocp_data = ocp_read_word(tp, MCU_TYPE_PLA, PLA_EEE_CR);
-	ocp_data &= EEE_RX_EN | EEE_TX_EN;
-
-	eee->eee_enabled = !!ocp_data;
+	eee->eee_enabled = tp->eee_en;
 	eee->eee_active = !!(supported & adv & lp);
 	eee->supported = supported;
-	eee->advertised = adv;
+	eee->advertised = tp->eee_adv;
 	eee->lp_advertised = lp;
 
 	return 0;
@@ -4959,12 +4973,15 @@ static int r8153_set_eee(struct r8152 *tp, struct ethtool_eee *eee)
 {
 	u16 val = ethtool_adv_to_mmd_eee_adv_t(eee->advertised);
 
-	r8153_eee_en(tp, eee->eee_enabled);
+	tp->eee_en = eee->eee_enabled;
+	tp->eee_adv = val;
 
-	if (!eee->eee_enabled)
-		val = 0;
+	r8153_eee_en(tp, eee->eee_enabled);
 
-	ocp_reg_write(tp, OCP_EEE_ADV, val);
+	if (eee->eee_enabled)
+		ocp_reg_write(tp, OCP_EEE_ADV, val);
+	else
+		ocp_reg_write(tp, OCP_EEE_ADV, 0);
 
 	return 0;
 }
@@ -4973,12 +4990,15 @@ static int r8153b_set_eee(struct r8152 *tp, struct ethtool_eee *eee)
 {
 	u16 val = ethtool_adv_to_mmd_eee_adv_t(eee->advertised);
 
-	r8153b_eee_en(tp, eee->eee_enabled);
+	tp->eee_en = eee->eee_enabled;
+	tp->eee_adv = val;
 
-	if (!eee->eee_enabled)
-		val = 0;
+	r8153b_eee_en(tp, eee->eee_enabled);
 
-	ocp_reg_write(tp, OCP_EEE_ADV, val);
+	if (eee->eee_enabled)
+		ocp_reg_write(tp, OCP_EEE_ADV, val);
+	else
+		ocp_reg_write(tp, OCP_EEE_ADV, 0);
 
 	return 0;
 }
@@ -5353,6 +5373,8 @@ static int rtl_ops_init(struct r8152 *tp)
 		ops->hw_phy_cfg		= r8152b_hw_phy_cfg;
 		ops->autosuspend_en	= rtl_runtime_suspend_enable;
 		tp->rx_buf_sz		= 16 * 1024;
+		tp->eee_en		= true;
+		tp->eee_adv		= MDIO_EEE_100TX;
 		break;
 
 	case RTL_VER_03:
@@ -5371,6 +5393,8 @@ static int rtl_ops_init(struct r8152 *tp)
 		ops->hw_phy_cfg		= r8153_hw_phy_cfg;
 		ops->autosuspend_en	= rtl8153_runtime_enable;
 		tp->rx_buf_sz		= 32 * 1024;
+		tp->eee_en		= true;
+		tp->eee_adv		= MDIO_EEE_1000T | MDIO_EEE_100TX;
 		break;
 
 	case RTL_VER_08:
@@ -5387,6 +5411,8 @@ static int rtl_ops_init(struct r8152 *tp)
 		ops->hw_phy_cfg		= r8153b_hw_phy_cfg;
 		ops->autosuspend_en	= rtl8153b_runtime_enable;
 		tp->rx_buf_sz		= 32 * 1024;
+		tp->eee_en		= true;
+		tp->eee_adv		= MDIO_EEE_1000T | MDIO_EEE_100TX;
 		break;
 
 	default:
-- 
2.21.0


^ permalink raw reply related

* Re: [RFC 1/4] Add usb_get_address and usb_set_address support
From: Oliver Neukum @ 2019-08-22  8:08 UTC (permalink / raw)
  To: Charles.Hyde, gregkh
  Cc: Mario.Limonciello, nic_swsd, linux-acpi, linux-usb, netdev
In-Reply-To: <1566430506442.20925@Dellteam.com>

Am Mittwoch, den 21.08.2019, 23:35 +0000 schrieb
Charles.Hyde@dellteam.com:
> <snipped>
> > 
> > This is a VERY cdc-net-specific function.  It is not a "generic" USB
> > function at all.  Why does it belong in the USB core?  Shouldn't it live
> > in the code that handles the other cdc-net-specific logic?
> > 
> > thanks,
> > 
> > greg k-h
> 
> 
> Thank you for this feedback, Greg.  I was not sure about adding this to message.c, because of the USB_CDC_GET_NET_ADDRESS.  I had found references to SET_ADDRESS in the USB protocol at https://wiki.osdev.org/Universal_Serial_Bus#USB_Protocol.  If one wanted a generic USB function for SET_ADDRESS, to be used for both sending a MAC address and receiving one, how would you suggest this be implemented?  This is a legit question because I am curious.

Your implementation was, except for missing error handling, usable.
The problem is where you put it. CDC messages exist only for CDC
devices. Now it is true that there is no generic CDC driver.
Creating a module just for that would cost more memory than it saves
in most cases.
But MACs are confined to network devices. Hence the functionality
can be put into usbnet. It should not be put into any individual
driver, so that every network driver can use it without duplication.

> Your feedback led to moving the functionality into cdc_ncm.c for today's testing, and removing all changes from messages.c, usb.h, usbnet.c, and usbnet.h.  This may be where I end up long term, but I would like to learn if there is a possible solution that could live in message.c and be callable from other USB-to-Ethernet aware drivers.

All those drivers use usbnet. Hence there it should be.

	Regards
		Oliver


^ permalink raw reply

* [PATCH net-next v5] sched: Add dualpi2 qdisc
From: Tilmans, Olivier (Nokia - BE/Antwerp) @ 2019-08-22  8:10 UTC (permalink / raw)
  Cc: Eric Dumazet, Stephen Hemminger, Olga Albisser,
	De Schepper, Koen (Nokia - BE/Antwerp),
	Tilmans, Olivier (Nokia - BE/Antwerp), Bob Briscoe, Henrik Steen,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, David S. Miller,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org

From: Olga Albisser <olga@albisser.org>

DualPI2 provides L4S-type low latency & loss to traffic that uses a
scalable congestion controller (e.g. TCP-Prague, DCTCP) without
degrading the performance of 'classic' traffic (e.g. Reno,
Cubic etc.). It is intended to be the reference implementation of the
IETF's DualQ Coupled AQM.

The qdisc provides two queues called low latency and classic. It
classifies packets based on the ECN field in the IP headers. By
default it directs non-ECN and ECT(0) into the classic queue and
ECT(1) and CE into the low latency queue, as per the IETF spec.

Each queue runs its own AQM:
* The classic AQM is called PI2, which is similar to the PIE AQM but
   more responsive and simpler. Classic traffic requires a decent
   target queue (default 15ms for Internet deployment) to fully
   utilize the link and to avoid high drop rates.
* The low latency AQM is, by default, a very shallow ECN marking
   threshold (1ms) similar to that used for DCTCP.

The DualQ isolates the low queuing delay of the Low Latency queue
from the larger delay of the 'Classic' queue. However, from a
bandwidth perspective, flows in either queue will share out the link
capacity as if there was just a single queue. This bandwidth pooling
effect is achieved by coupling together the drop and ECN-marking
probabilities of the two AQMs.

The PI2 AQM has two main parameters in addition to its target delay.
All the defaults are suitable for any Internet setting, but it can
be reconfigured for a Data Centre setting. The integral gain factor
alpha is used to slowly correct any persistent standing queue error
from the target delay, while the proportional gain factor beta is
used to quickly compensate for queue changes (growth or shrinkage).
Either alpha and beta are given as a parameter, or they can be
calculated by tc from alternative typical and maximum RTT parameters.

Internally, the output of a linear Proportional Integral (PI)
controller is used for both queues. This output is squared to
calculate the drop or ECN-marking probability of the classic queue.
This counterbalances the square-root rate equation of Reno/Cubic,
which is the trick that balances flow rates across the queues. For
the ECN-marking probability of the low latency queue, the output of
the base AQM is multiplied by a coupling factor. This determines the
balance between the flow rates in each queue. The default setting
makes the flow rates roughly equal, which should be generally
applicable.

If DUALPI2 AQM has detected overload (due to excessive non-responsive
traffic in either queue), it will switch to signaling congestion
solely using drop, irrespective of the ECN field. Alternatively, it
can be configured to limit the drop probability and let the queue
grow and eventually overflow (like tail-drop).

Additional details can be found in the draft:
https://www.ietf.org/id/draft-ietf-tsvwg-aqm-dualq-coupled

Signed-off-by: Olga Albisser <olga@albisser.org>
Signed-off-by: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>
Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia-bell-labs.com>
Signed-off-by: Bob Briscoe <research@bobbriscoe.net>
Signed-off-by: Henrik Steen <henrist@henrist.net>
---

Notes:
    Changelog:
    * v4 -> v5
      - Fix do_div() usage in calculate_probability() to preserve sign
    * v3 -> v4
      - Replaced license boiletplate with SPDX identifier
      - Fix missing pskb_may_pull() calls when accessing ECN bits
      - Move timestamp computation at enqueue to happen after drop check
      - Use NMI-safe time keeping function, i.e., ktime_get_ns()
      - Switched from deprecated PSCHED_NS2TICKS/... to raw nanoseconds clocks
      - Validate netlink parameters properly (ranges, error reporting)
      - Expanded the statistics tracked/reported to better reflect the behavior of
        both queues
      - Simplified the qdisc structure:
        o Reworked classification logic to only depend on an ECN mask
        o Renamed most parameters to better reflect their usage
        o Removed unused/experimental features (e.g., TS-FIFO)
        o Restructured the skb->cb
        o Extracted helper functions
      - Fix compilation issues for ARM
      - Updated defaults parameter values to latest IETF ID
      - Fix the step AQM being applied on empty queues, causing excess marking on
        slower links
    * v2 -> v3
      - Fix compilation issues
      - Replaced the classic queue starvation protection from time-shifted FIFO
        to WRR, as it gives better results (e.g., prevents leaking burst in the C
        queue to the L queue)
    * v1 -> v2
      - Store enqueue timestamp in skb->cb to avoid conflict with EDT

 include/uapi/linux/pkt_sched.h |  33 ++
 net/sched/Kconfig              |  22 +-
 net/sched/Makefile             |   1 +
 net/sched/sch_dualpi2.c        | 746 +++++++++++++++++++++++++++++++++
 4 files changed, 801 insertions(+), 1 deletion(-)
 create mode 100644 net/sched/sch_dualpi2.c

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 18f185299f47..e2ad4a8d2059 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -1180,4 +1180,37 @@ enum {
 
 #define TCA_TAPRIO_ATTR_MAX (__TCA_TAPRIO_ATTR_MAX - 1)
 
+/* DUALPI2 */
+enum {
+	TCA_DUALPI2_UNSPEC,
+	TCA_DUALPI2_LIMIT,		/* Packets */
+	TCA_DUALPI2_TARGET,		/* us */
+	TCA_DUALPI2_TUPDATE,		/* us */
+	TCA_DUALPI2_ALPHA,		/* Hz scaled up by 256 */
+	TCA_DUALPI2_BETA,		/* HZ scaled up by 256 */
+	TCA_DUALPI2_STEP_THRESH,	/* Packets or us */
+	TCA_DUALPI2_STEP_PACKETS,	/* Whether STEP_THRESH is in packets */
+	TCA_DUALPI2_COUPLING,		/* Coupling factor between queues */
+	TCA_DUALPI2_DROP_OVERLOAD,	/* Whether to drop on overload */
+	TCA_DUALPI2_DROP_EARLY,		/* Whether to drop on enqueue */
+	TCA_DUALPI2_C_PROTECTION,	/* Percentage */
+	TCA_DUALPI2_ECN_MASK,		/* L4S queue classification mask */
+	TCA_DUALPI2_PAD,
+	__TCA_DUALPI2_MAX
+};
+
+#define TCA_DUALPI2_MAX   (__TCA_DUALPI2_MAX - 1)
+
+struct tc_dualpi2_xstats {
+	__u32 prob;		/* current probability */
+	__u32 delay_c;		/* current delay in C queue */
+	__u32 delay_l;		/* current delay in L queue */
+	__s32 credit;		/* current c_protection credit */
+	__u32 packets_in_c;	/* number of packets enqueued in C queue */
+	__u32 packets_in_l;	/* number of packets enqueued in L queue */
+	__u32 maxq;		/* maximum queue size */
+	__u32 ecn_mark;		/* packets marked with ecn*/
+	__u32 step_marks;	/* ECN marks due to the step AQM */
+};
+
 #endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index afd2ba157a13..f9340c18c3a2 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -409,6 +409,26 @@ config NET_SCH_PLUG
 	  To compile this code as a module, choose M here: the
 	  module will be called sch_plug.
 
+config NET_SCH_DUALPI2
+	tristate "Dual Queue Proportional Integral Controller Improved with a Square (DUALPI2) scheduler"
+	help
+	  Say Y here if you want to use the DualPI2 AQM.
+	  This is a combination of the DUALQ Coupled-AQM with a PI2 base-AQM.
+	  The PI2 AQM is in turn both an extension and a simplification of the
+	  PIE AQM. PI2 makes quite some PIE heuristics unnecessary, while being
+	  able to control scalable congestion controls like DCTCP and
+	  TCP-Prague. With PI2, both Reno/Cubic can be used in parallel with
+	  DCTCP, maintaining window fairness. DUALQ provides latency separation
+	  between low latency DCTCP flows and Reno/Cubic flows that need a
+	  bigger queue.
+	  For more information, please see
+	  https://www.ietf.org/id/draft-ietf-tsvwg-aqm-dualq-coupled
+
+	  To compile this code as a module, choose M here: the module
+	  will be called sch_dualpi2.
+
+	  If unsure, say N.
+
 menuconfig NET_SCH_DEFAULT
 	bool "Allow override default queue discipline"
 	---help---
@@ -418,7 +438,7 @@ menuconfig NET_SCH_DEFAULT
 	  of pfifo_fast will be used. Many distributions already set
 	  the default value via /proc/sys/net/core/default_qdisc.
 
-	  If unsure, say N.
+
 
 if NET_SCH_DEFAULT
 
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 415d1e1f237e..8e3bd4459eb4 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -61,6 +61,7 @@ obj-$(CONFIG_NET_SCH_PIE)	+= sch_pie.o
 obj-$(CONFIG_NET_SCH_CBS)	+= sch_cbs.o
 obj-$(CONFIG_NET_SCH_ETF)	+= sch_etf.o
 obj-$(CONFIG_NET_SCH_TAPRIO)	+= sch_taprio.o
+obj-$(CONFIG_NET_SCH_DUALPI2)   += sch_dualpi2.o
 
 obj-$(CONFIG_NET_CLS_U32)	+= cls_u32.o
 obj-$(CONFIG_NET_CLS_ROUTE4)	+= cls_route.o
diff --git a/net/sched/sch_dualpi2.c b/net/sched/sch_dualpi2.c
new file mode 100644
index 000000000000..c6c851499d35
--- /dev/null
+++ b/net/sched/sch_dualpi2.c
@@ -0,0 +1,746 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2019 Nokia.
+ *
+ * Author: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>
+ * Author: Olga Albisser <olga@albisser.org>
+ * Author: Henrik Steen <henrist@henrist.net>
+ * Author: Olivier Tilmans <olivier.tilmans@nokia-bell-labs.com>
+ *
+ * DualPI Improved with a Square (dualpi2):
+ *   Supports scalable congestion controls (e.g., DCTCP)
+ *   Supports coupled dual-queue with PI2
+ *   Supports L4S ECN identifier
+ *
+ * References:
+ *   draft-ietf-tsvwg-aqm-dualq-coupled:
+ *     http://tools.ietf.org/html/draft-ietf-tsvwg-aqm-dualq-coupled-08
+ *   De Schepper, Koen, et al. "PI 2: A linearized AQM for both classic and
+ *   scalable TCP."  in proc. ACM CoNEXT'16, 2016.
+ */
+
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/skbuff.h>
+#include <net/pkt_sched.h>
+#include <net/inet_ecn.h>
+#include <linux/string.h>
+
+/* 32b enable to support flows with windows up to ~8.6 * 1e9 packets
+ * i.e., twice the maximal snd_cwnd.
+ * MAX_PROB must be consistent with the RNG in dualpi2_roll().
+ */
+#define MAX_PROB ((u32)(~((u32)0)))
+/* alpha/beta values exchanged over netlink are in units of 256ns */
+#define ALPHA_BETA_SHIFT 8
+/* Scaled values of alpha/beta must fit in 32b to avoid overflow in later
+ * computations. Consequently (see and dualpi2_scale_alpha_beta()), their
+ * netlink-provided values can use at most 31b, i.e. be at most most (2^23)-1
+ * (~4MHz) as those are given in 1/256th. This enable to tune alpha/beta to
+ * control flows whose maximal RTTs can be in usec up to few secs.
+ */
+#define ALPHA_BETA_MAX ((2 << 31) - 1)
+/* Internal alpha/beta are in units of 64ns.
+ * This enables to use all alpha/beta values in the allowed range without loss
+ * of precision due to rounding when scaling them internally, e.g.,
+ * scale_alpha_beta(1) will not round down to 0.
+ */
+#define ALPHA_BETA_GRANULARITY 6
+#define ALPHA_BETA_SCALING (ALPHA_BETA_SHIFT - ALPHA_BETA_GRANULARITY)
+/* We express the weights (wc, wl) in %, i.e., wc + wl = 100 */
+#define MAX_WC 100
+
+struct dualpi2_sched_data {
+	struct Qdisc *l_queue;	/* The L4S LL queue */
+	struct Qdisc *sch;	/* The classic queue (owner of this struct) */
+
+	struct { /* PI2 parameters */
+		u64	target;	/* Target delay in nanoseconds */
+		u32	tupdate;/* timer frequency (in jiffies) */
+		u32	prob;	/* Base PI2 probability */
+		u32	alpha;	/* Gain factor for the integral rate response */
+		u32	beta;	/* Gain factor for the proportional response */
+		struct timer_list timer; /* prob update timer */
+	} pi2;
+
+	struct { /* Step AQM (L4S queue only) parameters */
+		u32 thresh;	/* Step threshold */
+		bool in_packets;/* Whether the step is in packets or time */
+	} step;
+
+	struct { /* Classic queue starvation protection */
+		s32	credit; /* Credit (sign indicates which queue) */
+		s32	init;	/* Reset value of the credit */
+		u8	wc;	/* C queue weight (between 0 and MAX_WC) */
+		u8	wl;	/* L queue weight (MAX_WC - wc) */
+	} c_protection;
+
+	/* General dualQ parameters */
+	u8	coupling_factor;/* Coupling factor (k) between both queues */
+	u8	ecn_mask;	/* Mask to match L4S packets */
+	bool	drop_early;	/* Drop at enqueue instead of dequeue if true */
+	bool	drop_overload;	/* Drop (1) on overload, or overflow (0) */
+
+	/* Statistics */
+	u64	qdelay_c;	/* Classic Q delay */
+	u64	qdelay_l;	/* L4S Q delay */
+	u32	packets_in_c;	/* Number of packets enqueued in C queue */
+	u32	packets_in_l;	/* Number of packets enqueued in L queue */
+	u32	maxq;		/* maximum queue size */
+	u32	ecn_mark;	/* packets marked with ECN */
+	u32	step_marks;	/* ECN marks due to the step AQM */
+
+	struct { /* Deferred drop statistics */
+		u32 cnt;	/* Packets dropped */
+		u32 len;	/* Bytes dropped */
+	} deferred_drops;
+};
+
+struct dualpi2_skb_cb {
+	u64 ts;		/* Timestamp at enqueue */
+	u8 apply_step:1,/* Can we apply the step threshold */
+	   l4s:1,	/* Packet has been classified as L4S */
+	   ect:2;	/* Packet ECT codepoint */
+};
+
+static inline struct dualpi2_skb_cb *dualpi2_skb_cb(struct sk_buff *skb)
+{
+	qdisc_cb_private_validate(skb, sizeof(struct dualpi2_skb_cb));
+	return (struct dualpi2_skb_cb *)qdisc_skb_cb(skb)->data;
+}
+
+static inline u64 skb_sojourn_time(struct sk_buff *skb, u64 reference)
+{
+	return reference - dualpi2_skb_cb(skb)->ts;
+}
+
+static inline u64 qdelay_in_ns(struct Qdisc *q, u64 now)
+{
+	struct sk_buff *skb = qdisc_peek_head(q);
+
+	return skb ? skb_sojourn_time(skb, now) : 0;
+}
+
+static inline u32 dualpi2_scale_alpha_beta(u32 param)
+{
+	u64 tmp  = ((u64)param * MAX_PROB >> ALPHA_BETA_SCALING);
+	do_div(tmp, NSEC_PER_SEC);
+	return tmp;
+}
+
+static inline u32 dualpi2_unscale_alpha_beta(u32 param)
+{
+	u64 tmp = ((u64)param * NSEC_PER_SEC << ALPHA_BETA_SCALING);
+	do_div(tmp, MAX_PROB);
+	return tmp;
+}
+
+static inline bool skb_is_l4s(struct sk_buff *skb)
+{
+	return dualpi2_skb_cb(skb)->l4s != 0;
+}
+
+static inline void dualpi2_mark(struct dualpi2_sched_data *q,
+				struct sk_buff *skb)
+{
+	if (INET_ECN_set_ce(skb))
+		q->ecn_mark++;
+}
+
+static inline void dualpi2_reset_c_protection(struct dualpi2_sched_data *q)
+{
+	q->c_protection.credit = q->c_protection.init;
+}
+
+static inline void dualpi2_calculate_c_protection(struct Qdisc *sch,
+						  struct dualpi2_sched_data *q,
+						  u32 wc)
+{
+	q->c_protection.wc = wc;
+	q->c_protection.wl = MAX_WC - wc;
+	/* Start with L queue if wl > wc */
+	q->c_protection.init = (s32)psched_mtu(qdisc_dev(sch)) *
+		((int)q->c_protection.wc - (int)q->c_protection.wl);
+	dualpi2_reset_c_protection(q);
+}
+
+static inline bool dualpi2_roll(u32 prob)
+{
+	return prandom_u32() <= prob;
+}
+
+static inline bool dualpi2_squared_roll(struct dualpi2_sched_data *q)
+{
+	return dualpi2_roll(q->pi2.prob) && dualpi2_roll(q->pi2.prob);
+}
+
+static inline bool dualpi2_is_overloaded(u64 prob)
+{
+	return prob > MAX_PROB;
+}
+
+static bool must_drop(struct Qdisc *sch, struct dualpi2_sched_data *q,
+		      struct sk_buff *skb)
+{
+	u64 local_l_prob;
+
+	/* Never drop if we have fewer than 2 mtu-sized packets;
+	 * similar to min_th in RED.
+	 */
+	if (sch->qstats.backlog < 2 * psched_mtu(qdisc_dev(sch)))
+		return false;
+
+	local_l_prob = (u64)q->pi2.prob * q->coupling_factor;
+
+	if (skb_is_l4s(skb)) {
+		if (dualpi2_is_overloaded(local_l_prob)) {
+			/* On overload, preserve delay by doing a classic drop
+			 * in the L queue. Otherwise, let both queues grow until
+			 * we reach the limit and cannot enqueue anymore
+			 * (sacrifice delay to avoid drops).
+			 */
+			if (q->drop_overload && dualpi2_squared_roll(q))
+				goto drop;
+			else
+				goto mark;
+			/* Scalable marking has a  (prob * k) probability */
+		} else if (dualpi2_roll(local_l_prob)) {
+			goto mark;
+		}
+		/* Apply classic marking with a (prob * prob) probability.
+		 * Force drops for ECN-capable traffic on overload.
+		 */
+	} else if (dualpi2_squared_roll(q)) {
+		if (dualpi2_skb_cb(skb)->ect &&
+		    !dualpi2_is_overloaded(local_l_prob))
+			goto mark;
+		else
+			goto drop;
+	}
+	return false;
+
+mark:
+	dualpi2_mark(q, skb);
+	return false;
+
+drop:
+	return true;
+}
+
+static void dualpi2_skb_classify(struct dualpi2_sched_data *q,
+				 struct sk_buff *skb)
+{
+	struct dualpi2_skb_cb *cb = dualpi2_skb_cb(skb);
+	int wlen = skb_network_offset(skb);
+
+	switch (tc_skb_protocol(skb)) {
+	case htons(ETH_P_IP):
+		wlen += sizeof(struct iphdr);
+		if (!pskb_may_pull(skb, wlen) ||
+		    skb_try_make_writable(skb, wlen))
+			goto not_ecn;
+
+		cb->ect = ipv4_get_dsfield(ip_hdr(skb)) & INET_ECN_MASK;
+		break;
+	case htons(ETH_P_IPV6):
+		wlen += sizeof(struct ipv6hdr);
+		if (!pskb_may_pull(skb, wlen) ||
+		    skb_try_make_writable(skb, wlen))
+			goto not_ecn;
+
+		cb->ect = ipv6_get_dsfield(ipv6_hdr(skb)) & INET_ECN_MASK;
+		break;
+	default:
+		goto not_ecn;
+	}
+	cb->l4s = (cb->ect & q->ecn_mask) != 0;
+	return;
+
+not_ecn:
+	/* Not ECN capable or not non pullable/writable packets can only be
+	 * dropped hence go the the classic queue.
+	 */
+	cb->ect = INET_ECN_NOT_ECT;
+	cb->l4s = 0;
+}
+
+static int dualpi2_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+				 struct sk_buff **to_free)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+	int err;
+
+	if (unlikely(qdisc_qlen(sch) >= sch->limit)) {
+		qdisc_qstats_overlimit(sch);
+		err = NET_XMIT_DROP;
+		goto drop;
+	}
+
+	dualpi2_skb_classify(q, skb);
+
+	/* drop early if configured */
+	if (q->drop_early && must_drop(sch, q, skb)) {
+		err = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
+		goto drop;
+	}
+
+	dualpi2_skb_cb(skb)->ts = ktime_get_ns();
+
+	if (qdisc_qlen(sch) > q->maxq)
+		q->maxq = qdisc_qlen(sch);
+
+	if (skb_is_l4s(skb)) {
+		/* Only apply the step if a queue is building up */
+		dualpi2_skb_cb(skb)->apply_step = qdisc_qlen(q->l_queue) > 1;
+		/* Keep the overall qdisc stats consistent */
+		++sch->q.qlen;
+		qdisc_qstats_backlog_inc(sch, skb);
+		++q->packets_in_l;
+		return qdisc_enqueue_tail(skb, q->l_queue);
+	}
+	++q->packets_in_c;
+	return qdisc_enqueue_tail(skb, sch);
+
+drop:
+	qdisc_drop(skb, sch, to_free);
+	return err;
+}
+
+static struct sk_buff *dualpi2_qdisc_dequeue(struct Qdisc *sch)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+	struct sk_buff *skb;
+	int qlen_c, credit_change;
+
+pick_packet:
+	/* L queue packets are also accounted for in qdisc_qlen(sch)! */
+	qlen_c = qdisc_qlen(sch) - qdisc_qlen(q->l_queue);
+	skb = NULL;
+	/* We can drop after qdisc_dequeue_head() calls.
+	 * Manage statistics by hand to keep them consistent if that happens.
+	 */
+	if (qdisc_qlen(q->l_queue) > 0 &&
+	    (qlen_c <= 0 || q->c_protection.credit <= 0)) {
+		/* Dequeue and increase the credit by wc if qlen_c != 0 */
+		skb = __qdisc_dequeue_head(&q->l_queue->q);
+		credit_change = qlen_c ?
+			q->c_protection.wc * qdisc_pkt_len(skb) : 0;
+		/* The global backlog will be updated later. */
+		qdisc_qstats_backlog_dec(q->l_queue, skb);
+		/* Propagate the dequeue to the global stats. */
+		--sch->q.qlen;
+	} else if (qlen_c > 0) {
+		/* Dequeue and decrease the credit by wl if qlen_l != 0 */
+		skb = __qdisc_dequeue_head(&sch->q);
+		credit_change = qdisc_qlen(q->l_queue) ?
+			(s32)(-1) * q->c_protection.wl * qdisc_pkt_len(skb) : 0;
+	} else {
+		dualpi2_reset_c_protection(q);
+		goto exit;
+	}
+	qdisc_qstats_backlog_dec(sch, skb);
+
+	/* Drop on dequeue? */
+	if (!q->drop_early && must_drop(sch, q, skb)) {
+		++q->deferred_drops.cnt;
+		q->deferred_drops.len += qdisc_pkt_len(skb);
+		consume_skb(skb);
+		qdisc_qstats_drop(sch);
+		/* try next packet */
+		goto pick_packet;
+	}
+
+	/* Apply the Step AQM to packets coming out of the L queue. */
+	if (skb_is_l4s(skb)) {
+		u64 qdelay = 0;
+
+		if (q->step.in_packets)
+			qdelay = qdisc_qlen(q->l_queue);
+		else
+			qdelay = skb_sojourn_time(skb, ktime_get_ns());
+		/* Apply the step */
+		if (likely(dualpi2_skb_cb(skb)->apply_step) &&
+		    qdelay > q->step.thresh) {
+			dualpi2_mark(q, skb);
+			++q->step_marks;
+		}
+		qdisc_bstats_update(q->l_queue, skb);
+	}
+
+	q->c_protection.credit += credit_change;
+	qdisc_bstats_update(sch, skb);
+
+exit:
+	/* We cannot call qdisc_tree_reduce_backlog() if our qlen is 0,
+	 * or HTB crashes.
+	 */
+	if (q->deferred_drops.cnt && qdisc_qlen(sch)) {
+		qdisc_tree_reduce_backlog(sch, q->deferred_drops.cnt,
+					  q->deferred_drops.len);
+		q->deferred_drops.cnt = 0;
+		q->deferred_drops.len = 0;
+	}
+	return skb;
+}
+
+static s64 __scale_delta(u64 diff)
+{
+	do_div(diff, (1 << (ALPHA_BETA_GRANULARITY + 1)) - 1);
+	return diff;
+}
+
+static u32 calculate_probability(struct Qdisc *sch)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+	u64 qdelay, qdelay_old, now;
+	u32 new_prob;
+	s64 delta;
+
+	qdelay_old = max_t(u64, q->qdelay_c, q->qdelay_l);
+	now = ktime_get_ns();
+	q->qdelay_l = qdelay_in_ns(q->l_queue, now);
+	q->qdelay_c = qdelay_in_ns(sch, now);
+	qdelay = max_t(u64, q->qdelay_c, q->qdelay_l);
+	/* Alpha and beta take at most 32b, i.e, the delay difference would
+	 * overflow for queueing delay differences > ~4.2sec.
+	 */
+	delta = ((s64)qdelay - q->pi2.target) * q->pi2.alpha;
+	delta += ((s64)qdelay - qdelay_old) * q->pi2.beta;
+	/* Prevent overflow */
+	if (delta > 0) {
+		new_prob = __scale_delta(delta) + q->pi2.prob;
+		if (new_prob < q->pi2.prob)
+			new_prob = MAX_PROB;
+	} else {
+		new_prob = q->pi2.prob - __scale_delta(delta * -1);
+		/* Prevent underflow */
+		if (new_prob > q->pi2.prob)
+			new_prob = 0;
+	}
+	/* If we do not drop on overload, ensure we cap the L4S probability to
+	 * 100% to keep window fairness when overflowing.
+	 */
+	if (!q->drop_overload)
+		return min_t(u32, new_prob, MAX_PROB / q->coupling_factor);
+	return new_prob;
+}
+
+static void dualpi2_timer(struct timer_list *timer)
+{
+	struct dualpi2_sched_data *q = from_timer(q, timer, pi2.timer);
+	struct Qdisc *sch = q->sch;
+	spinlock_t *root_lock; /* Lock to access the head of both queues. */
+
+	root_lock = qdisc_lock(qdisc_root_sleeping(sch));
+	spin_lock(root_lock);
+
+	q->pi2.prob = calculate_probability(sch);
+	mod_timer(&q->pi2.timer, jiffies + q->pi2.tupdate);
+
+	spin_unlock(root_lock);
+}
+
+static const struct nla_policy dualpi2_policy[TCA_DUALPI2_MAX + 1] = {
+	[TCA_DUALPI2_LIMIT] = {.type = NLA_U32},
+	[TCA_DUALPI2_TARGET] = {.type = NLA_U32},
+	[TCA_DUALPI2_TUPDATE] = {.type = NLA_U32},
+	[TCA_DUALPI2_ALPHA] = {.type = NLA_U32},
+	[TCA_DUALPI2_BETA] = {.type = NLA_U32},
+	[TCA_DUALPI2_STEP_THRESH] = {.type = NLA_U32},
+	[TCA_DUALPI2_STEP_PACKETS] = {.type = NLA_U8},
+	[TCA_DUALPI2_COUPLING] = {.type = NLA_U8},
+	[TCA_DUALPI2_DROP_OVERLOAD] = {.type = NLA_U8},
+	[TCA_DUALPI2_DROP_EARLY] = {.type = NLA_U8},
+	[TCA_DUALPI2_C_PROTECTION] = {.type = NLA_U8},
+	[TCA_DUALPI2_ECN_MASK] = {.type = NLA_U8},
+};
+
+static int dualpi2_change(struct Qdisc *sch, struct nlattr *opt,
+			  struct netlink_ext_ack *extack)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+	struct nlattr *tb[TCA_DUALPI2_MAX + 1];
+	unsigned int old_qlen, dropped = 0;
+	int err;
+
+	if (!opt)
+		return -EINVAL;
+	err = nla_parse_nested_deprecated(tb, TCA_DUALPI2_MAX, opt,
+					  dualpi2_policy, extack);
+	if (err < 0)
+		return err;
+
+	sch_tree_lock(sch);
+
+	if (tb[TCA_DUALPI2_LIMIT]) {
+		u32 limit = nla_get_u32(tb[TCA_DUALPI2_LIMIT]);
+
+		if (!limit) {
+			NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_LIMIT],
+					    "limit must be greater than 0 !");
+			return -EINVAL;
+		}
+		sch->limit = limit;
+	}
+
+	if (tb[TCA_DUALPI2_TARGET])
+		q->pi2.target = (u64)nla_get_u32(tb[TCA_DUALPI2_TARGET]) *
+			NSEC_PER_USEC;
+
+	if (tb[TCA_DUALPI2_TUPDATE]) {
+		u64 tupdate =
+			usecs_to_jiffies(nla_get_u32(tb[TCA_DUALPI2_TUPDATE]));
+
+		if (!tupdate) {
+			NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_TUPDATE],
+					    "tupdate cannot be 0 jiffies!");
+			return -EINVAL;
+		}
+		q->pi2.tupdate = tupdate;
+	}
+
+	if (tb[TCA_DUALPI2_ALPHA]) {
+		u32 alpha = nla_get_u32(tb[TCA_DUALPI2_ALPHA]);
+
+		if (alpha > ALPHA_BETA_MAX) {
+			NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_ALPHA],
+					    "alpha is too large!");
+			return -EINVAL;
+		}
+		q->pi2.alpha = dualpi2_scale_alpha_beta(alpha);
+	}
+
+	if (tb[TCA_DUALPI2_BETA]) {
+		u32 beta = nla_get_u32(tb[TCA_DUALPI2_BETA]);
+
+		if (beta > ALPHA_BETA_MAX) {
+			NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_BETA],
+					    "beta is too large!");
+			return -EINVAL;
+		}
+		q->pi2.beta = dualpi2_scale_alpha_beta(beta);
+	}
+
+	if (tb[TCA_DUALPI2_STEP_THRESH])
+		q->step.thresh = nla_get_u32(tb[TCA_DUALPI2_STEP_THRESH]) *
+			NSEC_PER_USEC;
+
+	if (tb[TCA_DUALPI2_COUPLING]) {
+		u8 coupling = nla_get_u8(tb[TCA_DUALPI2_COUPLING]);
+
+		if (!coupling) {
+			NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_COUPLING],
+					    "Must use a non-zero coupling!");
+			return -EINVAL;
+		}
+		q->coupling_factor = coupling;
+	}
+
+	if (tb[TCA_DUALPI2_STEP_PACKETS])
+		q->step.in_packets = nla_get_u8(tb[TCA_DUALPI2_STEP_PACKETS]);
+
+	if (tb[TCA_DUALPI2_DROP_OVERLOAD])
+		q->drop_overload = nla_get_u8(tb[TCA_DUALPI2_DROP_OVERLOAD]);
+
+	if (tb[TCA_DUALPI2_DROP_EARLY])
+		q->drop_early = nla_get_u8(tb[TCA_DUALPI2_DROP_EARLY]);
+
+	if (tb[TCA_DUALPI2_C_PROTECTION]) {
+		u8 wc = nla_get_u8(tb[TCA_DUALPI2_C_PROTECTION]);
+
+		if (wc > MAX_WC) {
+			NL_SET_ERR_MSG_ATTR(extack,
+					    tb[TCA_DUALPI2_C_PROTECTION],
+					    "c_protection must be <= 100!");
+			return -EINVAL;
+		}
+		dualpi2_calculate_c_protection(sch, q, wc);
+	}
+
+	if (tb[TCA_DUALPI2_ECN_MASK])
+		q->ecn_mask = nla_get_u8(tb[TCA_DUALPI2_ECN_MASK]);
+
+	/* Drop excess packets if new limit is lower */
+	old_qlen = qdisc_qlen(sch);
+	while (qdisc_qlen(sch) > sch->limit) {
+		struct sk_buff *skb = __qdisc_dequeue_head(&sch->q);
+
+		dropped += qdisc_pkt_len(skb);
+		qdisc_qstats_backlog_dec(sch, skb);
+		rtnl_qdisc_drop(skb, sch);
+	}
+	qdisc_tree_reduce_backlog(sch, old_qlen - qdisc_qlen(sch), dropped);
+
+	sch_tree_unlock(sch);
+	return 0;
+}
+
+static void dualpi2_reset_default(struct dualpi2_sched_data *q)
+{
+	q->sch->limit = 10000; /* Holds 125ms at 1G */
+
+	q->pi2.target = 15 * NSEC_PER_MSEC;
+	q->pi2.tupdate = usecs_to_jiffies(16 * USEC_PER_MSEC);
+	q->pi2.alpha = dualpi2_scale_alpha_beta(41); /* ~0.16 Hz in 1/256th */
+	q->pi2.beta = dualpi2_scale_alpha_beta(819); /* ~3.2 Hz in 1/256th */
+	/* These values give a 10dB stability margin with max_rtt=100ms */
+
+	q->step.thresh = 1 * NSEC_PER_MSEC; /* 1ms */
+	q->step.in_packets = false; /* Step in time not packets */
+
+	dualpi2_calculate_c_protection(q->sch, q, 10); /* Defaults to wc = 10 */
+
+	q->ecn_mask = INET_ECN_ECT_1; /* l4s-id */
+	q->coupling_factor = 2; /* window fairness for equal RTTs */
+	q->drop_overload = true; /* Preserve latency by dropping on overload */
+	q->drop_early = false; /* PI2 drop on dequeue */
+}
+
+static int dualpi2_init(struct Qdisc *sch, struct nlattr *opt,
+			struct netlink_ext_ack *extack)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+
+	q->l_queue = qdisc_create_dflt(sch->dev_queue, &pfifo_qdisc_ops,
+				       TC_H_MAKE(sch->handle, 1), extack);
+	if (!q->l_queue)
+		return -ENOMEM;
+
+	q->sch = sch;
+	dualpi2_reset_default(q);
+	timer_setup(&q->pi2.timer, dualpi2_timer, 0);
+
+	if (opt) {
+		int err = dualpi2_change(sch, opt, extack);
+
+		if (err)
+			return err;
+	}
+
+	mod_timer(&q->pi2.timer, (jiffies + HZ) >> 1);
+	return 0;
+}
+
+static int dualpi2_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+	struct nlattr *opts = nla_nest_start_noflag(skb, TCA_OPTIONS);
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+	u64 step_thresh = q->step.thresh;
+	u64 target_usec = q->pi2.target;
+
+	if (!opts)
+		goto nla_put_failure;
+
+	do_div(target_usec, NSEC_PER_USEC);
+	if (!q->step.in_packets)
+		do_div(step_thresh, NSEC_PER_USEC);
+
+	if (nla_put_u32(skb, TCA_DUALPI2_LIMIT, sch->limit) ||
+	    nla_put_u32(skb, TCA_DUALPI2_TARGET, target_usec) ||
+	    nla_put_u32(skb, TCA_DUALPI2_TUPDATE,
+			jiffies_to_usecs(q->pi2.tupdate)) ||
+	    nla_put_u32(skb, TCA_DUALPI2_ALPHA,
+			dualpi2_unscale_alpha_beta(q->pi2.alpha)) ||
+	    nla_put_u32(skb, TCA_DUALPI2_BETA,
+			dualpi2_unscale_alpha_beta(q->pi2.beta)) ||
+	    nla_put_u32(skb, TCA_DUALPI2_STEP_THRESH, step_thresh) ||
+	    nla_put_u8(skb, TCA_DUALPI2_COUPLING, q->coupling_factor) ||
+	    nla_put_u8(skb, TCA_DUALPI2_DROP_OVERLOAD, q->drop_overload) ||
+	    nla_put_u8(skb, TCA_DUALPI2_STEP_PACKETS, q->step.in_packets) ||
+	    nla_put_u8(skb, TCA_DUALPI2_DROP_EARLY, q->drop_early) ||
+	    nla_put_u8(skb, TCA_DUALPI2_C_PROTECTION, q->c_protection.wc) ||
+	    nla_put_u8(skb, TCA_DUALPI2_ECN_MASK, q->ecn_mask))
+		goto nla_put_failure;
+
+	return nla_nest_end(skb, opts);
+
+nla_put_failure:
+	nla_nest_cancel(skb, opts);
+	return -1;
+}
+
+static int dualpi2_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+	u64 qdelay_c_usec = q->qdelay_c;
+	u64 qdelay_l_usec = q->qdelay_l;
+	struct tc_dualpi2_xstats st = {
+		.prob		= q->pi2.prob,
+		.packets_in_c	= q->packets_in_c,
+		.packets_in_l	= q->packets_in_l,
+		.maxq		= q->maxq,
+		.ecn_mark	= q->ecn_mark,
+		.credit		= q->c_protection.credit,
+		.step_marks	= q->step_marks,
+	};
+
+	do_div(qdelay_c_usec, NSEC_PER_USEC);
+	do_div(qdelay_l_usec, NSEC_PER_USEC);
+	st.delay_c = qdelay_c_usec;
+	st.delay_l = qdelay_l_usec;
+	return gnet_stats_copy_app(d, &st, sizeof(st));
+}
+
+static void dualpi2_reset(struct Qdisc *sch)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+
+	qdisc_reset_queue(sch);
+	qdisc_reset_queue(q->l_queue);
+	q->qdelay_c = 0;
+	q->qdelay_l = 0;
+	q->pi2.prob = 0;
+	q->packets_in_c = 0;
+	q->packets_in_l = 0;
+	q->maxq = 0;
+	q->ecn_mark = 0;
+	q->step_marks = 0;
+	dualpi2_reset_c_protection(q);
+}
+
+static void dualpi2_destroy(struct Qdisc *sch)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+
+	q->pi2.tupdate = 0;
+	del_timer_sync(&q->pi2.timer);
+	if (q->l_queue)
+		qdisc_put(q->l_queue);
+}
+
+static struct Qdisc_ops dualpi2_qdisc_ops __read_mostly = {
+	.id = "dualpi2",
+	.priv_size	= sizeof(struct dualpi2_sched_data),
+	.enqueue	= dualpi2_qdisc_enqueue,
+	.dequeue	= dualpi2_qdisc_dequeue,
+	.peek		= qdisc_peek_dequeued,
+	.init		= dualpi2_init,
+	.destroy	= dualpi2_destroy,
+	.reset		= dualpi2_reset,
+	.change		= dualpi2_change,
+	.dump		= dualpi2_dump,
+	.dump_stats	= dualpi2_dump_stats,
+	.owner		= THIS_MODULE,
+};
+
+static int __init dualpi2_module_init(void)
+{
+	return register_qdisc(&dualpi2_qdisc_ops);
+}
+
+static void __exit dualpi2_module_exit(void)
+{
+	unregister_qdisc(&dualpi2_qdisc_ops);
+}
+
+module_init(dualpi2_module_init);
+module_exit(dualpi2_module_exit);
+
+MODULE_DESCRIPTION("Dual Queue with Proportional Integral controller Improved with a Square (dualpi2) scheduler");
+MODULE_AUTHOR("Koen De Schepper");
+MODULE_AUTHOR("Olga Albisser");
+MODULE_AUTHOR("Henrik Steen");
+MODULE_AUTHOR("Olivier Tilmans");
+MODULE_LICENSE("GPL");
-- 
2.23.0


^ permalink raw reply related

* Re: [PATCH net] ixgbe: fix double clean of tx descriptors with xdp
From: Ilya Maximets @ 2019-08-22  8:17 UTC (permalink / raw)
  To: William Tu, Alexander Duyck
  Cc: Björn Töpel, Netdev, LKML, bpf, David S. Miller,
	Magnus Karlsson, Jakub Kicinski, Alexei Starovoitov,
	Daniel Borkmann, Jeff Kirsher, intel-wired-lan, Eelco Chaudron
In-Reply-To: <CALDO+SZCbxEEwCS6MyHk-Cp_LJ33N=QFqwZ8uRm0e-PBRgxRYw@mail.gmail.com>

On 22.08.2019 0:38, William Tu wrote:
> On Wed, Aug 21, 2019 at 9:57 AM Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>>
>> On Wed, Aug 21, 2019 at 9:22 AM Ilya Maximets <i.maximets@samsung.com> wrote:
>>>
>>> On 21.08.2019 4:17, Alexander Duyck wrote:
>>>> On Tue, Aug 20, 2019 at 8:58 AM Ilya Maximets <i.maximets@samsung.com> wrote:
>>>>>
>>>>> On 20.08.2019 18:35, Alexander Duyck wrote:
>>>>>> On Tue, Aug 20, 2019 at 8:18 AM Ilya Maximets <i.maximets@samsung.com> wrote:
>>>>>>>
>>>>>>> Tx code doesn't clear the descriptor status after cleaning.
>>>>>>> So, if the budget is larger than number of used elems in a ring, some
>>>>>>> descriptors will be accounted twice and xsk_umem_complete_tx will move
>>>>>>> prod_tail far beyond the prod_head breaking the comletion queue ring.
>>>>>>>
>>>>>>> Fix that by limiting the number of descriptors to clean by the number
>>>>>>> of used descriptors in the tx ring.
>>>>>>>
>>>>>>> Fixes: 8221c5eba8c1 ("ixgbe: add AF_XDP zero-copy Tx support")
>>>>>>> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
>>>>>>
>>>>>> I'm not sure this is the best way to go. My preference would be to
>>>>>> have something in the ring that would prevent us from racing which I
>>>>>> don't think this really addresses. I am pretty sure this code is safe
>>>>>> on x86 but I would be worried about weak ordered systems such as
>>>>>> PowerPC.
>>>>>>
>>>>>> It might make sense to look at adding the eop_desc logic like we have
>>>>>> in the regular path with a proper barrier before we write it and after
>>>>>> we read it. So for example we could hold of on writing the bytecount
>>>>>> value until the end of an iteration and call smp_wmb before we write
>>>>>> it. Then on the cleanup we could read it and if it is non-zero we take
>>>>>> an smp_rmb before proceeding further to process the Tx descriptor and
>>>>>> clearing the value. Otherwise this code is going to just keep popping
>>>>>> up with issues.
>>>>>
>>>>> But, unlike regular case, xdp zero-copy xmit and clean for particular
>>>>> tx ring always happens in the same NAPI context and even on the same
>>>>> CPU core.
>>>>>
>>>>> I saw the 'eop_desc' manipulations in regular case and yes, we could
>>>>> use 'next_to_watch' field just as a flag of descriptor existence,
>>>>> but it seems unnecessarily complicated. Am I missing something?
>>>>>
>>>>
>>>> So is it always in the same NAPI context?. I forgot, I was thinking
>>>> that somehow the socket could possibly make use of XDP for transmit.
>>>
>>> AF_XDP socket only triggers tx interrupt on ndo_xsk_async_xmit() which
>>> is used in zero-copy mode. Real xmit happens inside
>>> ixgbe_poll()
>>>  -> ixgbe_clean_xdp_tx_irq()
>>>     -> ixgbe_xmit_zc()
>>>
>>> This should be not possible to bound another XDP socket to the same netdev
>>> queue.
>>>
>>> It also possible to xmit frames in xdp_ring while performing XDP_TX/REDIRECT
>>> actions. REDIRECT could happen from different netdev with different NAPI
>>> context, but this operation is bound to specific CPU core and each core has
>>> its own xdp_ring.
>>>
>>> However, I'm not an expert here.
>>> Björn, maybe you could comment on this?
>>>
>>>>
>>>> As far as the logic to use I would be good with just using a value you
>>>> are already setting such as the bytecount value. All that would need
>>>> to happen is to guarantee that the value is cleared in the Tx path. So
>>>> if you clear the bytecount in ixgbe_clean_xdp_tx_irq you could
>>>> theoretically just use that as well to flag that a descriptor has been
>>>> populated and is ready to be cleaned. Assuming the logic about this
>>>> all being in the same NAPI context anyway you wouldn't need to mess
>>>> with the barrier stuff I mentioned before.
>>>
>>> Checking the number of used descs, i.e. next_to_use - next_to_clean,
>>> makes iteration in this function logically equal to the iteration inside
>>> 'ixgbe_xsk_clean_tx_ring()'. Do you think we need to change the later
>>> function too to follow same 'bytecount' approach? I don't like having
>>> two different ways to determine number of used descriptors in the same file.
>>>
>>> Best regards, Ilya Maximets.
>>
>> As far as ixgbe_clean_xdp_tx_irq() vs ixgbe_xsk_clean_tx_ring(), I
>> would say that if you got rid of budget and framed things more like
>> how ixgbe_xsk_clean_tx_ring was framed with the ntc != ntu being
>> obvious I would prefer to see us go that route.
>>
>> Really there is no need for budget in ixgbe_clean_xdp_tx_irq() if you
>> are going to be working with a static ntu value since you will only
>> ever process one iteration through the ring anyway. It might make more
>> sense if you just went through and got rid of budget and i, and
>> instead used ntc and ntu like what was done in
>> ixgbe_xsk_clean_tx_ring().
>>
>> Thanks.
>>
>> - Alex
> 
> Not familiar with the driver details.
> I tested this patch and the issue mentioned in OVS mailing list.
> https://www.mail-archive.com/ovs-dev@openvswitch.org/msg35362.html
> and indeed the problem goes away.

Good. Thanks for testing!

> But I saw a huge performance drop,
> my AF_XDP tx performance drops from >9Mpps to <5Mpps.

I didn't expect so big performance difference with this change.
What is your test scenario? Is it possible that you're accounting same
packet several times due to broken completion queue?

Looking at samples/bpf/xdpsock_user.c:complete_tx_only(), it accounts
sent packets (tx_npkts) by accumulating results of xsk_ring_cons__peek()
for completion queue, so it's not a trusted source of pps information.

Best regards, Ilya Maximets.

> 
> Tested using kernel 5.3.0-rc3+
> 03:00.0 Ethernet controller: Intel Corporation Ethernet Controller
> 10-Gigabit X540-AT2 (rev 01)
> Subsystem: Intel Corporation Ethernet 10G 2P X540-t Adapter
> Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> Stepping- SERR- FastB2B- DisINTx+
> 
> Regards,
> William

^ permalink raw reply

* [net] devlink: Add method for time-stamp on reporter's dump
From: Aya Levin @ 2019-08-22  8:17 UTC (permalink / raw)
  To: David S. Miller; +Cc: Jiri Pirko, netdev, linux-kernel, Aya Levin

When setting the dump's time-stamp, use ktime_get_real in addition to
jiffies. This simplifies the user space implementation and bypasses
some inconsistent behavior with translating jiffies to current time.

Fixes: c8e1da0bf923 ("devlink: Add health report functionality")
Signed-off-by: Aya Levin <ayal@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 include/uapi/linux/devlink.h | 2 ++
 net/core/devlink.c           | 6 ++++++
 2 files changed, 8 insertions(+)

diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index ffc993256527..4dd4e4e7b19b 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -348,6 +348,8 @@ enum devlink_attr {
 	DEVLINK_ATTR_PORT_PCI_PF_NUMBER,	/* u16 */
 	DEVLINK_ATTR_PORT_PCI_VF_NUMBER,	/* u16 */
 
+	DEVLINK_ATTR_HEALTH_REPORTER_DUMP_TSPEC,
+
 	/* add new attributes above here, update the policy in devlink.c */
 
 	__DEVLINK_ATTR_MAX,
diff --git a/net/core/devlink.c b/net/core/devlink.c
index d3dbb904bf3b..b26875c4329b 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -4577,6 +4577,7 @@ struct devlink_health_reporter {
 	bool auto_recover;
 	u8 health_state;
 	u64 dump_ts;
+	struct timespec dump_real_ts;
 	u64 error_count;
 	u64 recovery_count;
 	u64 last_recovery_ts;
@@ -4749,6 +4750,7 @@ static int devlink_health_do_dump(struct devlink_health_reporter *reporter,
 		goto dump_err;
 
 	reporter->dump_ts = jiffies;
+	reporter->dump_real_ts = ktime_to_timespec(ktime_get_real());
 
 	return 0;
 
@@ -4911,6 +4913,10 @@ devlink_nl_health_reporter_fill(struct sk_buff *msg,
 			      jiffies_to_msecs(reporter->dump_ts),
 			      DEVLINK_ATTR_PAD))
 		goto reporter_nest_cancel;
+	if (reporter->dump_fmsg &&
+	    nla_put(msg, DEVLINK_ATTR_HEALTH_REPORTER_DUMP_TSPEC,
+		    sizeof(reporter->dump_real_ts), &reporter->dump_real_ts))
+		goto reporter_nest_cancel;
 
 	nla_nest_end(msg, reporter_attr);
 	genlmsg_end(msg, hdr);
-- 
2.14.1


^ permalink raw reply related

* Re: [PATCHv2 net] ipv6/addrconf: allow adding multicast addr if IFA_F_MCAUTOJOIN is set
From: Hangbin Liu @ 2019-08-22  8:20 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev, Madhu Challa, David S . Miller, Jianlin Shi
In-Reply-To: <4306235d-db31-bf06-9d26-ce19319feae3@gmail.com>

On Mon, Aug 19, 2019 at 10:33:58PM -0400, David Ahern wrote:
> On 8/19/19 10:19 PM, Hangbin Liu wrote:
> > But in ipv6_add_addr() it will check the address type and reject multicast
> > address directly. So this feature is never worked for IPv6.
> 
> If true, that is really disappointing.
> 
> We need to get a functional test script started for various address cases.

Do you mean an `ip addr add` testing for all kinds of address types?

Thanks
Hangbin

^ permalink raw reply

* Re: [RFC bpf-next 0/5] Convert iproute2 to use libbpf (WIP)
From: Daniel Borkmann @ 2019-08-22  8:33 UTC (permalink / raw)
  To: Andrii Nakryiko, Toke Høiland-Jørgensen
  Cc: Stephen Hemminger, Alexei Starovoitov, Martin KaFai Lau, Song Liu,
	Yonghong Song, David Miller, Jesper Dangaard Brouer, Networking,
	bpf
In-Reply-To: <CAEf4BzYMKPbfOu4a4UDEfJVcNW1-KvRwJ7PVo+Mf_1YUJgE4Qw@mail.gmail.com>

On 8/22/19 9:49 AM, Andrii Nakryiko wrote:
> On Wed, Aug 21, 2019 at 2:07 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>>> On Tue, Aug 20, 2019 at 4:47 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>>>
>>>> iproute2 uses its own bpf loader to load eBPF programs, which has
>>>> evolved separately from libbpf. Since we are now standardising on
>>>> libbpf, this becomes a problem as iproute2 is slowly accumulating
>>>> feature incompatibilities with libbpf-based loaders. In particular,
>>>> iproute2 has its own (expanded) version of the map definition struct,
>>>> which makes it difficult to write programs that can be loaded with both
>>>> custom loaders and iproute2.
>>>>
>>>> This series seeks to address this by converting iproute2 to using libbpf
>>>> for all its bpf needs. This version is an early proof-of-concept RFC, to
>>>> get some feedback on whether people think this is the right direction.
>>>>
>>>> What this series does is the following:
>>>>
>>>> - Updates the libbpf map definition struct to match that of iproute2
>>>>    (patch 1).
>>>
>>> Thanks for taking a stab at unifying libbpf and iproute2 loaders. I'm
>>> totally in support of making iproute2 use libbpf to load/initialize
>>> BPF programs. But I'm against adding iproute2-specific fields to
>>> libbpf's bpf_map_def definitions to support this.
>>>
>>> I've proposed the plan of extending libbpf's supported features so
>>> that it can be used to load iproute2-style BPF programs earlier,
>>> please see discussions in [0] and [1].
>>
>> Yeah, I've seen that discussion, and agree that longer term this is
>> probably a better way to do map-in-map definitions.
>>
>> However, I view your proposal as complementary to this series: we'll
>> probably also want the BTF-based definition to work with iproute2, and
>> that means iproute2 needs to be ported to libbpf. But iproute2 needs to
>> be backwards compatible with the format it supports now, and, well, this
>> series is the simplest way to achieve that IMO :)
> 
> Ok, I understand that. But I'd still want to avoid adding extra cruft
> to libbpf just for backwards-compatibility with *exact* iproute2
> format. Libbpf as a whole is trying to move away from relying on
> binary bpf_map_def and into using BTF-defined map definitions, and
> this patch series is a step backwards in that regard, that adds,
> essentially, already outdated stuff that we'll need to support forever
> (I mean those extra fields in bpf_map_def, that will stay there
> forever).

Agree, adding these extensions for libbpf would be a step backwards
compared to using BTF defined map defs.

> We've discussed one way to deal with it, IMO, in a cleaner way. It can
> be done in few steps:
> 
> 1. I originally wanted BTF-defined map definitions to ignore unknown
> fields. It shouldn't be a default mode, but it should be supported
> (and of course is very easy to add). So let's add that and let libbpf
> ignore unknown stuff.
> 
> 2. Then to let iproute2 loader deal with backwards-compatibility for
> libbpf-incompatible bpf_elf_map, we need to "pass-through" all those
> fields so that users of libbpf (iproute2 loader, in this case) can
> make use of it. The easiest and cleanest way to do this is to expose
> BTF ID of a type describing each map entry and let iproute2 process
> that in whichever way it sees fit.
> 
> Luckily, bpf_elf_map is compatible in `type` field, which will let
> libbpf recognize bpf_elf_map as map definition. All the rest setup
> will be done by iproute2, by processing BTF of bpf_elf_map, which will
> let it set up map sizes, flags and do all of its map-in-map magic.
> 
> The only additions to libbpf in this case would be a new `__u32
> bpf_map__btf_id(struct bpf_map* map);` API.
> 
> I haven't written any code and haven't 100% checked that this will
> cover everything, but I think we should try. This will allow to let
> users of libbpf do custom stuff with map definitions without having to
> put all this extra logic into libbpf itself, which I think is
> desirable outcome.

Sounds reasonable in general, but all this still has the issue that we're
assuming that BTF is /always/ present. Existing object files that would load
just fine /today/ but do not have BTF attached won't be handled here. Wouldn't
it be more straight forward to allow passing callbacks to the libbpf loader
such that if the map section is not found to be bpf_map_def compatible, we
rely on external user aka callback to parse the ELF section, handle any
non-default libbpf behavior like pinning/retrieving from BPF fs, populate
related internal libbpf map data structures and pass control back to libbpf
loader afterwards. (Similar callback with prog section name handling for the
case where tail call maps get automatically populated.)

Thanks,
Daniel

^ permalink raw reply

* Re: [RFC v2] vsock: proposal to support multiple transports at runtime
From: Stefano Garzarella @ 2019-08-22  8:54 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: netdev, Dexuan Cui, Jorgen Hansen, David S. Miller, Vishnu Dasa,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Sasha Levin
In-Reply-To: <20190819130911.GE28081@stefanha-x1.localdomain>

On Mon, Aug 19, 2019 at 02:09:11PM +0100, Stefan Hajnoczi wrote:
> On Thu, Jun 06, 2019 at 12:09:12PM +0200, Stefano Garzarella wrote:
> > 
> > Hi all,
> > this is a v2 of a proposal addressing the comments made by Dexuan, Stefan,
> > and Jorgen.
> > 
> > v1: https://www.spinics.net/lists/netdev/msg570274.html
> > 
> > 
> > 
> > We can define two types of transport that we have to handle at the same time
> > (e.g. in a nested VM we would have both types of transport running together):
> > 
> > - 'host->guest' transport, it runs in the host and it is used to communicate
> >   with the guests of a specific hypervisor (KVM, VMWare or Hyper-V). It also
> >   runs in the guest who has nested guests, to communicate with them.
> > 
> >   [Phase 2]
> >   We can support multiple 'host->guest' transport running at the same time,
> >   but on x86 only one hypervisor uses VMX at any given time.
> > 
> > - 'guest->host' transport, it runs in the guest and it is used to communicate
> >   with the host.
> > 
> > 
> > The main goal is to find a way to decide what transport use in these cases:
> > 1. connect() / sendto()
> > 
> >    a. use the 'host->guest' transport, if the destination is the guest
> >       (dest_cid > VMADDR_CID_HOST).
> > 
> >       [Phase 2]
> >       In order to support multiple 'host->guest' transports running at the same
> >       time, we should assign CIDs uniquely across all transports. In this way,
> >       a packet generated by the host side will get directed to the appropriate
> >       transport based on the CID.
> > 
> >    b. use the 'guest->host' transport, if the destination is the host or the
> >       hypervisor.
> >       (dest_cid == VMADDR_CID_HOST || dest_cid == VMADDR_CID_HYPERVISOR)
> > 
> > 
> > 2. listen() / recvfrom()
> > 
> >    a. use the 'host->guest' transport, if the socket is bound to
> >       VMADDR_CID_HOST, or it is bound to VMADDR_CID_ANY and there is no
> >       'guest->host' transport.
> >       We could also define a new VMADDR_CID_LISTEN_FROM_GUEST in order to
> >       address this case.
> > 
> >       [Phase 2]
> >       We can support network namespaces to create independent AF_VSOCK
> >       addressing domains:
> >       - could be used to partition VMs between hypervisors or at a finer
> >    	 granularity;
> >       - could be used to isolate host applications from guest applications
> >    	 using the same ports with CID_ANY;
> > 
> >    b. use the 'guest->host' transport, if the socket is bound to local CID
> >       different from the VMADDR_CID_HOST (guest CID get with
> >       IOCTL_VM_SOCKETS_GET_LOCAL_CID), or it is bound to VMADDR_CID_ANY (to be
> >       backward compatible).
> >       Also in this case, we could define a new VMADDR_CID_LISTEN_FROM_HOST.
> > 
> >    c. shared port space between transports
> >       For incoming requests or packets, we should be able to choose which
> >       transport use, looking at the 'port' requested.
> > 
> >       - stream sockets already support shared port space between transports
> >         (one port can be assigned to only one transport)
> > 
> >       [Phase 2]
> >       - datagram sockets will support it, but for now VMCI transport is the
> >         default transport for any host side datagram socket (KVM and Hyper-V
> >         do not yet support datagrams sockets)
> > 
> > We will make the loading of af_vsock.ko independent of the transports to
> > allow to:
> >    - create a AF_VSOCK socket without any loaded transports;
> >    - listen on a socket (e.g. bound to VMADDR_CID_ANY) without any loaded
> >      transports;
> > 
> > Hopefully, we could move MODULE_ALIAS_NETPROTO(PF_VSOCK) from the
> > vmci_transport.ko to the af_vsock.ko.
> > [Jorgen will check if this will impact the existing VMware products]
> > 
> > Notes:
> >    - For Hyper-V sockets, the host can only be Windows. No changes should
> >      be required on the Windows host to support the changes on this proposal.
> > 
> >    - Communication between guests are not allowed on any transports, so we can
> >      drop packets sent from a guest to another guest (dest_cid >
> >      VMADDR_CID_HOST) if the 'host->guest' transport is not available.
> > 
> >    - [Phase 2] tag used to identify things that can be done at a later stage,
> >      but that should be taken into account during this design.
> > 
> >    - Namespace support will be developed in [Phase 2] or in a separate project.
> > 
> > 
> > 
> > Comments and suggestions are welcome.
> > I'll be on PTO for next two weeks, so sorry in advance if I'll answer later.
> > 
> > If we agree on this proposal, when I get back, I'll start working on the code
> > to get a first PATCH RFC.
> 
> Stefano,
> I've reviewed your proposal and it looks good for solving nested
> virtualization.

Hi Stefan,
Thank you very much for the review!

> 
> The tricky implementation details will be supporting listen sockets,
> especially with VMADDR_CID_ANY so they can be accessed from both
> transports.

Yes, it will be tricky because the current implementation has 1 to 1
mapping with the transport callbacks.

Maybe I could move some logic in the core (e.g. for listening sockets)
to have a single point of control. (e.g. using vsk->pending_links in all
transports)

I'll work on it in the next weeks, I'll keep you updated.

Thanks,
Stefano

^ permalink raw reply

* Re: [RFC bpf-next 4/5] iproute2: Allow compiling against libbpf
From: Daniel Borkmann @ 2019-08-22  8:58 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Stephen Hemminger,
	Alexei Starovoitov
  Cc: Martin KaFai Lau, Song Liu, Yonghong Song, David Miller,
	Jesper Dangaard Brouer, netdev, bpf, andrii.nakryiko
In-Reply-To: <20190820114706.18546-5-toke@redhat.com>

On 8/20/19 1:47 PM, Toke Høiland-Jørgensen wrote:
> This adds a configure check for libbpf and renames functions to allow
> lib/bpf.c to be compiled with it present. This makes it possible to
> port functionality piecemeal to use libbpf.
> 
> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
> ---
>   configure          | 16 ++++++++++++++++
>   include/bpf_util.h |  6 +++---
>   ip/ipvrf.c         |  4 ++--
>   lib/bpf.c          | 33 +++++++++++++++++++--------------
>   4 files changed, 40 insertions(+), 19 deletions(-)
> 
> diff --git a/configure b/configure
> index 45fcffb6..5a89ee9f 100755
> --- a/configure
> +++ b/configure
> @@ -238,6 +238,19 @@ check_elf()
>       fi
>   }
>   
> +check_libbpf()
> +{
> +    if ${PKG_CONFIG} libbpf --exists; then
> +	echo "HAVE_LIBBPF:=y" >>$CONFIG
> +	echo "yes"
> +
> +	echo 'CFLAGS += -DHAVE_LIBBPF' `${PKG_CONFIG} libbpf --cflags` >> $CONFIG
> +	echo 'LDLIBS += ' `${PKG_CONFIG} libbpf --libs` >>$CONFIG
> +    else
> +	echo "no"
> +    fi
> +}
> +
>   check_selinux()

More of an implementation detail at this point in time, but want to make sure this
doesn't get missed along the way: as discussed at bpfconf [0] best for iproute2 to
handle libbpf support would be the same way of integration as pahole does, that is,
to integrate it via submodule [1] to allow kernel and libbpf features to be in sync
with iproute2 releases and therefore easily consume extensions we're adding to libbpf
to aide iproute2 integration.

Thanks,
Daniel

   [0] http://vger.kernel.org/bpfconf2019.html#session-4
   [1] https://git.kernel.org/pub/scm/devel/pahole/pahole.git/commit/?id=21507cd3e97bc5692d97201ee68df044c6767e9a

^ permalink raw reply

* Re: [PATCH v1] ocfs2/dlm: Move BITS_TO_BYTES() to bitops.h for wider use
From: Andy Shevchenko @ 2019-08-22  9:08 UTC (permalink / raw)
  To: Sudarsana Reddy Kalluru
  Cc: Joseph Qi, Mark Fasheh, Joel Becker, ocfs2-devel@oss.oracle.com,
	Ariel Elior, GR-everest-linux-l2, David S. Miller,
	netdev@vger.kernel.org, Colin Ian King
In-Reply-To: <MN2PR18MB2528511CEFCBC2BE07947BAAD3A50@MN2PR18MB2528.namprd18.prod.outlook.com>

On Thu, Aug 22, 2019 at 05:46:07AM +0000, Sudarsana Reddy Kalluru wrote:
> 
> > -----Original Message-----
> > From: netdev-owner@vger.kernel.org <netdev-owner@vger.kernel.org> On
> > Behalf Of Andy Shevchenko
> > Sent: Wednesday, August 21, 2019 2:56 PM
> > To: Joseph Qi <joseph.qi@linux.alibaba.com>
> > Cc: Mark Fasheh <mark@fasheh.com>; Joel Becker <jlbec@evilplan.org>;
> > ocfs2-devel@oss.oracle.com; Ariel Elior <aelior@marvell.com>; Sudarsana
> > Reddy Kalluru <skalluru@marvell.com>; GR-everest-linux-l2 <GR-everest-
> > linux-l2@marvell.com>; David S. Miller <davem@davemloft.net>;
> > netdev@vger.kernel.org; Colin Ian King <colin.king@canonical.com>
> > Subject: Re: [PATCH v1] ocfs2/dlm: Move BITS_TO_BYTES() to bitops.h for
> > wider use
> > 
> > On Wed, Aug 21, 2019 at 09:29:04AM +0800, Joseph Qi wrote:
> > > On 19/8/21 00:31, Andy Shevchenko wrote:
> > > > There are users already and will be more of BITS_TO_BYTES() macro.
> > > > Move it to bitops.h for wider use.

> > > > -#define BITS_TO_BYTES(x) ((x)/8)>
> > > I don't think this is a equivalent replace, or it is in fact wrong
> > > before?
> > 
> > I was thinking about this one and there are two applications:
> > - calculus of the amount of structures of certain type per PAGE
> >   (obviously off-by-one error in the original code IIUC purpose of
> > STRUCT_SIZE)
> > - calculus of some threshold based on line speed in bytes per second
> >   (I dunno it will have any difference on the Gbs / 100 MBs speeds)
> > 
> I see that both the implementations (existing vs new) yield same value for standard speeds 10G (i.e.,10000), 1G (1000) that device supports. Hence the change look to be ok.

Thank you for testing, may I use your Tested-by tag?

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply

* [PATCH bpf-next 0/4] xsk: various CPU barrier and {READ, WRITE}_ONCE fixes
From: Björn Töpel @ 2019-08-22  9:13 UTC (permalink / raw)
  To: ast, daniel, netdev
  Cc: Björn Töpel, magnus.karlsson, magnus.karlsson, bpf,
	bjorn.topel, jonathan.lemon, syzbot+c82697e3043781e08802, hdanton,
	i.maximets

This is a four patch series of various barrier, {READ, WRITE}_ONCE
cleanups in the AF_XDP socket code. More details can be found in the
corresponding commit message.

For an AF_XDP socket, most control plane operations are done under the
control mutex (struct xdp_sock, mutex), but there are some places
where members of the struct is read outside the control mutex. This,
as pointed out by Daniel in [1], requires proper {READ,
WRITE}_ONCE-correctness [2] [3]. To address this, and to simplify the
code, the state variable (introduced by Ilya), is now used a point of
synchronization ("is the socket in a valid state, or not").


Thanks,
Björn

[1] https://lore.kernel.org/bpf/beef16bb-a09b-40f1-7dd0-c323b4b89b17@iogearbox.net/
[2] https://lwn.net/Articles/793253/
[3] https://github.com/google/ktsan/wiki/READ_ONCE-and-WRITE_ONCE

Björn Töpel (4):
  xsk: avoid store-tearing when assigning queues
  xsk: add proper barriers and {READ, WRITE}_ONCE-correctness for state
  xsk: avoid store-tearing when assigning umem
  xsk: lock the control mutex in sock_diag interface

 net/xdp/xsk.c      | 61 ++++++++++++++++++++++++++++++++--------------
 net/xdp/xsk_diag.c |  3 +++
 2 files changed, 46 insertions(+), 18 deletions(-)

-- 
2.20.1


^ permalink raw reply

* [PATCH bpf-next 1/4] xsk: avoid store-tearing when assigning queues
From: Björn Töpel @ 2019-08-22  9:13 UTC (permalink / raw)
  To: ast, daniel, netdev
  Cc: Björn Töpel, magnus.karlsson, magnus.karlsson, bpf,
	jonathan.lemon, syzbot+c82697e3043781e08802, hdanton, i.maximets
In-Reply-To: <20190822091306.20581-1-bjorn.topel@gmail.com>

From: Björn Töpel <bjorn.topel@intel.com>

Use WRITE_ONCE when doing the store of tx, rx, fq, and cq, to avoid
potential store-tearing. These members are read outside of the control
mutex in the mmap implementation.

Fixes: 37b076933a8e ("xsk: add missing write- and data-dependency barrier")
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 net/xdp/xsk.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index ee4428a892fa..f3351013c2a5 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -409,7 +409,7 @@ static int xsk_init_queue(u32 entries, struct xsk_queue **queue,
 
 	/* Make sure queue is ready before it can be seen by others */
 	smp_wmb();
-	*queue = q;
+	WRITE_ONCE(*queue, q);
 	return 0;
 }
 
-- 
2.20.1


^ permalink raw reply related

* [PATCH bpf-next 2/4] xsk: add proper barriers and {READ, WRITE}_ONCE-correctness for state
From: Björn Töpel @ 2019-08-22  9:13 UTC (permalink / raw)
  To: ast, daniel, netdev
  Cc: Björn Töpel, magnus.karlsson, magnus.karlsson, bpf,
	jonathan.lemon, syzbot+c82697e3043781e08802, hdanton, i.maximets
In-Reply-To: <20190822091306.20581-1-bjorn.topel@gmail.com>

From: Björn Töpel <bjorn.topel@intel.com>

The state variable was read, and written outside the control mutex
(struct xdp_sock, mutex), without proper barriers and {READ,
WRITE}_ONCE correctness.

In this commit this issue is addressed, and the state member is now
used a point of synchronization whether the socket is setup correctly
or not.

This also fixes a race, found by syzcaller, in xsk_poll() where umem
could be accessed when stale.

Suggested-by: Hillf Danton <hdanton@sina.com>
Reported-by: syzbot+c82697e3043781e08802@syzkaller.appspotmail.com
Fixes: 77cd0d7b3f25 ("xsk: add support for need_wakeup flag in AF_XDP rings")
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 net/xdp/xsk.c | 57 ++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 41 insertions(+), 16 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index f3351013c2a5..31236e61069b 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -162,10 +162,23 @@ static int __xsk_rcv_zc(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
 	return err;
 }
 
+static bool xsk_is_bound(struct xdp_sock *xs)
+{
+	if (READ_ONCE(xs->state) == XSK_BOUND) {
+		/* Matches smp_wmb() in bind(). */
+		smp_rmb();
+		return true;
+	}
+	return false;
+}
+
 int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
 	u32 len;
 
+	if (!xsk_is_bound(xs))
+		return -EINVAL;
+
 	if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index)
 		return -EINVAL;
 
@@ -362,6 +375,8 @@ static int xsk_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
 	struct sock *sk = sock->sk;
 	struct xdp_sock *xs = xdp_sk(sk);
 
+	if (unlikely(!xsk_is_bound(xs)))
+		return -ENXIO;
 	if (unlikely(!xs->dev))
 		return -ENXIO;
 	if (unlikely(!(xs->dev->flags & IFF_UP)))
@@ -378,10 +393,15 @@ static unsigned int xsk_poll(struct file *file, struct socket *sock,
 			     struct poll_table_struct *wait)
 {
 	unsigned int mask = datagram_poll(file, sock, wait);
-	struct sock *sk = sock->sk;
-	struct xdp_sock *xs = xdp_sk(sk);
-	struct net_device *dev = xs->dev;
-	struct xdp_umem *umem = xs->umem;
+	struct xdp_sock *xs = xdp_sk(sock->sk);
+	struct net_device *dev;
+	struct xdp_umem *umem;
+
+	if (unlikely(!xsk_is_bound(xs)))
+		return mask;
+
+	dev = xs->dev;
+	umem = xs->umem;
 
 	if (umem->need_wakeup)
 		dev->netdev_ops->ndo_xsk_wakeup(dev, xs->queue_id,
@@ -417,10 +437,9 @@ static void xsk_unbind_dev(struct xdp_sock *xs)
 {
 	struct net_device *dev = xs->dev;
 
-	if (!dev || xs->state != XSK_BOUND)
+	if (xs->state != XSK_BOUND)
 		return;
-
-	xs->state = XSK_UNBOUND;
+	WRITE_ONCE(xs->state, XSK_UNBOUND);
 
 	/* Wait for driver to stop using the xdp socket. */
 	xdp_del_sk_umem(xs->umem, xs);
@@ -495,7 +514,9 @@ static int xsk_release(struct socket *sock)
 	local_bh_enable();
 
 	xsk_delete_from_maps(xs);
+	mutex_lock(&xs->mutex);
 	xsk_unbind_dev(xs);
+	mutex_unlock(&xs->mutex);
 
 	xskq_destroy(xs->rx);
 	xskq_destroy(xs->tx);
@@ -589,19 +610,18 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 		}
 
 		umem_xs = xdp_sk(sock->sk);
-		if (!umem_xs->umem) {
-			/* No umem to inherit. */
+		if (!xsk_is_bound(umem_xs)) {
 			err = -EBADF;
 			sockfd_put(sock);
 			goto out_unlock;
-		} else if (umem_xs->dev != dev || umem_xs->queue_id != qid) {
+		}
+		if (umem_xs->dev != dev || umem_xs->queue_id != qid) {
 			err = -EINVAL;
 			sockfd_put(sock);
 			goto out_unlock;
 		}
-
 		xdp_get_umem(umem_xs->umem);
-		xs->umem = umem_xs->umem;
+		WRITE_ONCE(xs->umem, umem_xs->umem);
 		sockfd_put(sock);
 	} else if (!xs->umem || !xdp_umem_validate_queues(xs->umem)) {
 		err = -EINVAL;
@@ -626,10 +646,15 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 	xdp_add_sk_umem(xs->umem, xs);
 
 out_unlock:
-	if (err)
+	if (err) {
 		dev_put(dev);
-	else
-		xs->state = XSK_BOUND;
+	} else {
+		/* Matches smp_rmb() in bind() for shared umem
+		 * sockets, and xsk_is_bound().
+		 */
+		smp_wmb();
+		WRITE_ONCE(xs->state, XSK_BOUND);
+	}
 out_release:
 	mutex_unlock(&xs->mutex);
 	rtnl_unlock();
@@ -869,7 +894,7 @@ static int xsk_mmap(struct file *file, struct socket *sock,
 	unsigned long pfn;
 	struct page *qpg;
 
-	if (xs->state != XSK_READY)
+	if (READ_ONCE(xs->state) != XSK_READY)
 		return -EBUSY;
 
 	if (offset == XDP_PGOFF_RX_RING) {
-- 
2.20.1


^ permalink raw reply related

* [PATCH bpf-next 3/4] xsk: avoid store-tearing when assigning umem
From: Björn Töpel @ 2019-08-22  9:13 UTC (permalink / raw)
  To: ast, daniel, netdev
  Cc: Björn Töpel, magnus.karlsson, magnus.karlsson, bpf,
	jonathan.lemon, syzbot+c82697e3043781e08802, hdanton, i.maximets
In-Reply-To: <20190822091306.20581-1-bjorn.topel@gmail.com>

From: Björn Töpel <bjorn.topel@intel.com>

The umem member of struct xdp_sock is read outside of the control
mutex, in the mmap implementation, and needs a WRITE_ONCE to avoid
potentional store-tearing.

Fixes: 423f38329d26 ("xsk: add umem fill queue support and mmap")
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 net/xdp/xsk.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 31236e61069b..6dde1857ed52 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -718,7 +718,7 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 
 		/* Make sure umem is ready before it can be seen by others */
 		smp_wmb();
-		xs->umem = umem;
+		WRITE_ONCE(xs->umem, umem);
 		mutex_unlock(&xs->mutex);
 		return 0;
 	}
-- 
2.20.1


^ permalink raw reply related

* [PATCH bpf-next 4/4] xsk: lock the control mutex in sock_diag interface
From: Björn Töpel @ 2019-08-22  9:13 UTC (permalink / raw)
  To: ast, daniel, netdev
  Cc: Björn Töpel, magnus.karlsson, magnus.karlsson, bpf,
	jonathan.lemon, syzbot+c82697e3043781e08802, hdanton, i.maximets
In-Reply-To: <20190822091306.20581-1-bjorn.topel@gmail.com>

From: Björn Töpel <bjorn.topel@intel.com>

When accessing the members of an XDP socket, the control mutex should
be held. This commit fixes that.

Fixes: a36b38aa2af6 ("xsk: add sock_diag interface for AF_XDP")
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 net/xdp/xsk_diag.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/xdp/xsk_diag.c b/net/xdp/xsk_diag.c
index d5e06c8e0cbf..c8f4f11edbbc 100644
--- a/net/xdp/xsk_diag.c
+++ b/net/xdp/xsk_diag.c
@@ -97,6 +97,7 @@ static int xsk_diag_fill(struct sock *sk, struct sk_buff *nlskb,
 	msg->xdiag_ino = sk_ino;
 	sock_diag_save_cookie(sk, msg->xdiag_cookie);
 
+	mutex_lock(&xs->mutex);
 	if ((req->xdiag_show & XDP_SHOW_INFO) && xsk_diag_put_info(xs, nlskb))
 		goto out_nlmsg_trim;
 
@@ -117,10 +118,12 @@ static int xsk_diag_fill(struct sock *sk, struct sk_buff *nlskb,
 	    sock_diag_put_meminfo(sk, nlskb, XDP_DIAG_MEMINFO))
 		goto out_nlmsg_trim;
 
+	mutex_unlock(&xs->mutex);
 	nlmsg_end(nlskb, nlh);
 	return 0;
 
 out_nlmsg_trim:
+	mutex_unlock(&xs->mutex);
 	nlmsg_cancel(nlskb, nlh);
 	return -EMSGSIZE;
 }
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH] can: Delete unnecessary checks before the macro call “dev_kfree_skb”
From: Sean Nyekjaer @ 2019-08-22  9:13 UTC (permalink / raw)
  To: Markus Elfring, linux-can, netdev, Allison Randal,
	David S. Miller, Enrico Weigelt, Greg Kroah-Hartman,
	Gustavo A. R. Silva, Lukas Wunner, Marc Kleine-Budde,
	Thomas Gleixner, Weitao Hou, Wolfgang Grandegger
  Cc: LKML, kernel-janitors
In-Reply-To: <27674907-fd2a-7f0c-84fd-d8b5124739a9@web.de>



On 21/08/2019 21.30, Markus Elfring wrote:
> From: Markus Elfring <elfring@users.sourceforge.net>
> Date: Wed, 21 Aug 2019 21:16:15 +0200
> 
> The dev_kfree_skb() function performs also input parameter validation.
> Thus the test around the shown calls is not needed.
> 
> This issue was detected by using the Coccinelle software.
> 
> Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Acked-by: Sean Nyekjaer <sean@geanix.com>
> ---
>   drivers/net/can/spi/hi311x.c  | 3 +--
>   drivers/net/can/spi/mcp251x.c | 3 +--
>   2 files changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/can/spi/hi311x.c b/drivers/net/can/spi/hi311x.c
> index 03a711c3221b..7c7c7e78214c 100644
> --- a/drivers/net/can/spi/hi311x.c
> +++ b/drivers/net/can/spi/hi311x.c
> @@ -184,8 +184,7 @@ static void hi3110_clean(struct net_device *net)
> 
>   	if (priv->tx_skb || priv->tx_len)
>   		net->stats.tx_errors++;
> -	if (priv->tx_skb)
> -		dev_kfree_skb(priv->tx_skb);
> +	dev_kfree_skb(priv->tx_skb);
>   	if (priv->tx_len)
>   		can_free_echo_skb(priv->net, 0);
>   	priv->tx_skb = NULL;
> diff --git a/drivers/net/can/spi/mcp251x.c b/drivers/net/can/spi/mcp251x.c
> index 12358f06d194..1c496d2adb45 100644
> --- a/drivers/net/can/spi/mcp251x.c
> +++ b/drivers/net/can/spi/mcp251x.c
> @@ -274,8 +274,7 @@ static void mcp251x_clean(struct net_device *net)
> 
>   	if (priv->tx_skb || priv->tx_len)
>   		net->stats.tx_errors++;
> -	if (priv->tx_skb)
> -		dev_kfree_skb(priv->tx_skb);
> +	dev_kfree_skb(priv->tx_skb);
>   	if (priv->tx_len)
>   		can_free_echo_skb(priv->net, 0);
>   	priv->tx_skb = NULL;
> --
> 2.23.0
> 

Good catch Markus :-)

/Sean

^ permalink raw reply

* Re: [PATCH v2 11/11] vsock_test: wait for the remote to close the connection
From: Stefano Garzarella @ 2019-08-22  9:15 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: netdev, kvm, Dexuan Cui, virtualization, David S. Miller,
	Jorgen Hansen, linux-kernel
In-Reply-To: <20190820082828.GA9855@stefanha-x1.localdomain>

On Tue, Aug 20, 2019 at 09:28:28AM +0100, Stefan Hajnoczi wrote:
> On Thu, Aug 01, 2019 at 05:25:41PM +0200, Stefano Garzarella wrote:
> > +/* Wait for the remote to close the connection */
> > +void vsock_wait_remote_close(int fd)
> > +{
> > +	struct epoll_event ev;
> > +	int epollfd, nfds;
> > +
> > +	epollfd = epoll_create1(0);
> > +	if (epollfd == -1) {
> > +		perror("epoll_create1");
> > +		exit(EXIT_FAILURE);
> > +	}
> > +
> > +	ev.events = EPOLLRDHUP | EPOLLHUP;
> > +	ev.data.fd = fd;
> > +	if (epoll_ctl(epollfd, EPOLL_CTL_ADD, fd, &ev) == -1) {
> > +		perror("epoll_ctl");
> > +		exit(EXIT_FAILURE);
> > +	}
> > +
> > +	nfds = epoll_wait(epollfd, &ev, 1, TIMEOUT * 1000);
> > +	if (nfds == -1) {
> > +		perror("epoll_wait");
> > +		exit(EXIT_FAILURE);
> > +	}
> > +
> > +	if (nfds == 0) {
> > +		fprintf(stderr, "epoll_wait timed out\n");
> > +		exit(EXIT_FAILURE);
> > +	}
> > +
> > +	assert(nfds == 1);
> > +	assert(ev.events & (EPOLLRDHUP | EPOLLHUP));
> > +	assert(ev.data.fd == fd);
> > +
> > +	close(epollfd);
> > +}
> 
> Please use timeout_begin()/timeout_end() so that the test cannot hang.
> 

I used the TIMEOUT macro in the epoll_wait() to avoid the hang.
Do you think is better to use the timeout_begin()/timeout_end()?
In this case, should I remove the timeout in the epoll_wait()?

> > diff --git a/tools/testing/vsock/vsock_test.c b/tools/testing/vsock/vsock_test.c
> > index 64adf45501ca..a664675bec5a 100644
> > --- a/tools/testing/vsock/vsock_test.c
> > +++ b/tools/testing/vsock/vsock_test.c
> > @@ -84,6 +84,11 @@ static void test_stream_client_close_server(const struct test_opts *opts)
> >  
> >  	control_expectln("CLOSED");
> >  
> > +	/* Wait for the remote to close the connection, before check
> > +	 * -EPIPE error on send.
> > +	 */
> > +	vsock_wait_remote_close(fd);
> 
> Is control_expectln("CLOSED") still necessary now that we're waiting for
> the poll event?  The control message was an attempt to wait until the
> other side closed the socket.

Right, I'll remove it in the v3

Thanks,
Stefano

^ permalink raw reply

* RE: [PATCH v1] ocfs2/dlm: Move BITS_TO_BYTES() to bitops.h for wider use
From: Sudarsana Reddy Kalluru @ 2019-08-22  9:16 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Joseph Qi, Mark Fasheh, Joel Becker, ocfs2-devel@oss.oracle.com,
	Ariel Elior, GR-everest-linux-l2, David S. Miller,
	netdev@vger.kernel.org, Colin Ian King
In-Reply-To: <20190822090855.GL30120@smile.fi.intel.com>


> -----Original Message-----
> From: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
> Sent: Thursday, August 22, 2019 2:39 PM
> To: Sudarsana Reddy Kalluru <skalluru@marvell.com>
> Cc: Joseph Qi <joseph.qi@linux.alibaba.com>; Mark Fasheh
> <mark@fasheh.com>; Joel Becker <jlbec@evilplan.org>; ocfs2-
> devel@oss.oracle.com; Ariel Elior <aelior@marvell.com>; GR-everest-linux-l2
> <GR-everest-linux-l2@marvell.com>; David S. Miller
> <davem@davemloft.net>; netdev@vger.kernel.org; Colin Ian King
> <colin.king@canonical.com>
> Subject: Re: [PATCH v1] ocfs2/dlm: Move BITS_TO_BYTES() to bitops.h for
> wider use
> 
> On Thu, Aug 22, 2019 at 05:46:07AM +0000, Sudarsana Reddy Kalluru wrote:
> >
> > > -----Original Message-----
> > > From: netdev-owner@vger.kernel.org <netdev-owner@vger.kernel.org>
> On
> > > Behalf Of Andy Shevchenko
> > > Sent: Wednesday, August 21, 2019 2:56 PM
> > > To: Joseph Qi <joseph.qi@linux.alibaba.com>
> > > Cc: Mark Fasheh <mark@fasheh.com>; Joel Becker <jlbec@evilplan.org>;
> > > ocfs2-devel@oss.oracle.com; Ariel Elior <aelior@marvell.com>;
> > > Sudarsana Reddy Kalluru <skalluru@marvell.com>; GR-everest-linux-l2
> > > <GR-everest- linux-l2@marvell.com>; David S. Miller
> > > <davem@davemloft.net>; netdev@vger.kernel.org; Colin Ian King
> > > <colin.king@canonical.com>
> > > Subject: Re: [PATCH v1] ocfs2/dlm: Move BITS_TO_BYTES() to bitops.h
> > > for wider use
> > >
> > > On Wed, Aug 21, 2019 at 09:29:04AM +0800, Joseph Qi wrote:
> > > > On 19/8/21 00:31, Andy Shevchenko wrote:
> > > > > There are users already and will be more of BITS_TO_BYTES() macro.
> > > > > Move it to bitops.h for wider use.
> 
> > > > > -#define BITS_TO_BYTES(x) ((x)/8)>
> > > > I don't think this is a equivalent replace, or it is in fact wrong
> > > > before?
> > >
> > > I was thinking about this one and there are two applications:
> > > - calculus of the amount of structures of certain type per PAGE
> > >   (obviously off-by-one error in the original code IIUC purpose of
> > > STRUCT_SIZE)
> > > - calculus of some threshold based on line speed in bytes per second
> > >   (I dunno it will have any difference on the Gbs / 100 MBs speeds)
> > >
> > I see that both the implementations (existing vs new) yield same value for
> standard speeds 10G (i.e.,10000), 1G (1000) that device supports. Hence the
> change look to be ok.
> 
> Thank you for testing, may I use your Tested-by tag?
Sorry, I didn't test the actual driver flows. Was only referring to the output values for 1000/10000 speed values.

> 
> --
> With Best Regards,
> Andy Shevchenko
> 


^ permalink raw reply

* Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
From: Jiri Pirko @ 2019-08-22  9:29 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Alex Williamson, Jiri Pirko, David S . Miller, Kirti Wankhede,
	Cornelia Huck, kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	cjia, netdev@vger.kernel.org
In-Reply-To: <AM0PR05MB4866437FAA63C447CACCD7E5D1AA0@AM0PR05MB4866.eurprd05.prod.outlook.com>

Wed, Aug 21, 2019 at 08:23:17AM CEST, parav@mellanox.com wrote:
>
>
>> -----Original Message-----
>> From: Alex Williamson <alex.williamson@redhat.com>
>> Sent: Wednesday, August 21, 2019 10:56 AM
>> To: Parav Pandit <parav@mellanox.com>
>> Cc: Jiri Pirko <jiri@mellanox.com>; David S . Miller <davem@davemloft.net>;
>> Kirti Wankhede <kwankhede@nvidia.com>; Cornelia Huck
>> <cohuck@redhat.com>; kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
>> cjia <cjia@nvidia.com>; netdev@vger.kernel.org
>> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
>> 
>> > > > > Just an example of the alias, not proposing how it's set.  In
>> > > > > fact, proposing that the user does not set it, mdev-core
>> > > > > provides one
>> > > automatically.
>> > > > >
>> > > > > > > Since there seems to be some prefix overhead, as I ask about
>> > > > > > > above in how many characters we actually have to work with
>> > > > > > > in IFNAMESZ, maybe we start with 8 characters (matching your
>> > > > > > > "index" namespace) and expand as necessary for disambiguation.
>> > > > > > > If we can eliminate overhead in IFNAMESZ, let's start with 12.
>> > > > > > > Thanks,
>> > > > > > >
>> > > > > > If user is going to choose the alias, why does it have to be limited to
>> sha1?
>> > > > > > Or you just told it as an example?
>> > > > > >
>> > > > > > It can be an alpha-numeric string.
>> > > > >
>> > > > > No, I'm proposing a different solution where mdev-core creates
>> > > > > an alias based on an abbreviated sha1.  The user does not provide the
>> alias.
>> > > > >
>> > > > > > Instead of mdev imposing number of characters on the alias, it
>> > > > > > should be best
>> > > > > left to the user.
>> > > > > > Because in future if netdev improves on the naming scheme,
>> > > > > > mdev will be
>> > > > > limiting it, which is not right.
>> > > > > > So not restricting alias size seems right to me.
>> > > > > > User configuring mdev for networking devices in a given kernel
>> > > > > > knows what
>> > > > > user is doing.
>> > > > > > So user can choose alias name size as it finds suitable.
>> > > > >
>> > > > > That's not what I'm proposing, please read again.  Thanks,
>> > > >
>> > > > I understood your point. But mdev doesn't know how user is going
>> > > > to use
>> > > udev/systemd to name the netdev.
>> > > > So even if mdev chose to pick 12 characters, it could result in collision.
>> > > > Hence the proposal to provide the alias by the user, as user know
>> > > > the best
>> > > policy for its use case in the environment its using.
>> > > > So 12 character sha1 method will still work by user.
>> > >
>> > > Haven't you already provided examples where certain drivers or
>> > > subsystems have unique netdev prefixes?  If mdev provides a unique
>> > > alias within the subsystem, couldn't we simply define a netdev
>> > > prefix for the mdev subsystem and avoid all other collisions?  I'm
>> > > not in favor of the user providing both a uuid and an
>> > > alias/instance.  Thanks,
>> > >
>> > For a given prefix, say ens2f0, can two UUID->sha1 first 9 characters have
>> collision?
>> 
>> I think it would be a mistake to waste so many chars on a prefix, but 9
>> characters of sha1 likely wouldn't have a collision before we have 10s of
>> thousands of devices.  Thanks,
>> 
>> Alex
>
>Jiri, Dave,
>Are you ok with it for devlink/netdev part?
>Mdev core will create an alias from a UUID.
>
>This will be supplied during devlink port attr set such as,
>
>devlink_port_attrs_mdev_set(struct devlink_port *port, const char *mdev_alias);
>
>This alias is used to generate representor netdev's phys_port_name.
>This alias from the mdev device's sysfs will be used by the udev/systemd to generate predicable netdev's name.
>Example: enm<mdev_alias_first_12_chars>

What happens in unlikely case of 2 UUIDs collide?


>I took Ethernet mdev as an example.
>New prefix 'm' stands for mediated device.
>Remaining 12 characters are first 12 chars of the mdev alias.

Does this resolve the identification of devlink port representor? I
assume you want to use the same 12(or so) chars, don't you?


^ permalink raw reply

* RE: [PATCH v2 0/2] Simplify mtty driver and mdev core
From: Parav Pandit @ 2019-08-22  9:42 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Alex Williamson, Jiri Pirko, David S . Miller, Kirti Wankhede,
	Cornelia Huck, kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	cjia, netdev@vger.kernel.org
In-Reply-To: <20190822092903.GA2276@nanopsycho.orion>



> -----Original Message-----
> From: Jiri Pirko <jiri@resnulli.us>
> Sent: Thursday, August 22, 2019 2:59 PM
> To: Parav Pandit <parav@mellanox.com>
> Cc: Alex Williamson <alex.williamson@redhat.com>; Jiri Pirko
> <jiri@mellanox.com>; David S . Miller <davem@davemloft.net>; Kirti
> Wankhede <kwankhede@nvidia.com>; Cornelia Huck <cohuck@redhat.com>;
> kvm@vger.kernel.org; linux-kernel@vger.kernel.org; cjia <cjia@nvidia.com>;
> netdev@vger.kernel.org
> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> 
> Wed, Aug 21, 2019 at 08:23:17AM CEST, parav@mellanox.com wrote:
> >
> >
> >> -----Original Message-----
> >> From: Alex Williamson <alex.williamson@redhat.com>
> >> Sent: Wednesday, August 21, 2019 10:56 AM
> >> To: Parav Pandit <parav@mellanox.com>
> >> Cc: Jiri Pirko <jiri@mellanox.com>; David S . Miller
> >> <davem@davemloft.net>; Kirti Wankhede <kwankhede@nvidia.com>;
> >> Cornelia Huck <cohuck@redhat.com>; kvm@vger.kernel.org;
> >> linux-kernel@vger.kernel.org; cjia <cjia@nvidia.com>;
> >> netdev@vger.kernel.org
> >> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> >>
> >> > > > > Just an example of the alias, not proposing how it's set.  In
> >> > > > > fact, proposing that the user does not set it, mdev-core
> >> > > > > provides one
> >> > > automatically.
> >> > > > >
> >> > > > > > > Since there seems to be some prefix overhead, as I ask
> >> > > > > > > about above in how many characters we actually have to
> >> > > > > > > work with in IFNAMESZ, maybe we start with 8 characters
> >> > > > > > > (matching your "index" namespace) and expand as necessary for
> disambiguation.
> >> > > > > > > If we can eliminate overhead in IFNAMESZ, let's start with 12.
> >> > > > > > > Thanks,
> >> > > > > > >
> >> > > > > > If user is going to choose the alias, why does it have to
> >> > > > > > be limited to
> >> sha1?
> >> > > > > > Or you just told it as an example?
> >> > > > > >
> >> > > > > > It can be an alpha-numeric string.
> >> > > > >
> >> > > > > No, I'm proposing a different solution where mdev-core
> >> > > > > creates an alias based on an abbreviated sha1.  The user does
> >> > > > > not provide the
> >> alias.
> >> > > > >
> >> > > > > > Instead of mdev imposing number of characters on the alias,
> >> > > > > > it should be best
> >> > > > > left to the user.
> >> > > > > > Because in future if netdev improves on the naming scheme,
> >> > > > > > mdev will be
> >> > > > > limiting it, which is not right.
> >> > > > > > So not restricting alias size seems right to me.
> >> > > > > > User configuring mdev for networking devices in a given
> >> > > > > > kernel knows what
> >> > > > > user is doing.
> >> > > > > > So user can choose alias name size as it finds suitable.
> >> > > > >
> >> > > > > That's not what I'm proposing, please read again.  Thanks,
> >> > > >
> >> > > > I understood your point. But mdev doesn't know how user is
> >> > > > going to use
> >> > > udev/systemd to name the netdev.
> >> > > > So even if mdev chose to pick 12 characters, it could result in collision.
> >> > > > Hence the proposal to provide the alias by the user, as user
> >> > > > know the best
> >> > > policy for its use case in the environment its using.
> >> > > > So 12 character sha1 method will still work by user.
> >> > >
> >> > > Haven't you already provided examples where certain drivers or
> >> > > subsystems have unique netdev prefixes?  If mdev provides a
> >> > > unique alias within the subsystem, couldn't we simply define a
> >> > > netdev prefix for the mdev subsystem and avoid all other
> >> > > collisions?  I'm not in favor of the user providing both a uuid
> >> > > and an alias/instance.  Thanks,
> >> > >
> >> > For a given prefix, say ens2f0, can two UUID->sha1 first 9
> >> > characters have
> >> collision?
> >>
> >> I think it would be a mistake to waste so many chars on a prefix, but
> >> 9 characters of sha1 likely wouldn't have a collision before we have
> >> 10s of thousands of devices.  Thanks,
> >>
> >> Alex
> >
> >Jiri, Dave,
> >Are you ok with it for devlink/netdev part?
> >Mdev core will create an alias from a UUID.
> >
> >This will be supplied during devlink port attr set such as,
> >
> >devlink_port_attrs_mdev_set(struct devlink_port *port, const char
> >*mdev_alias);
> >
> >This alias is used to generate representor netdev's phys_port_name.
> >This alias from the mdev device's sysfs will be used by the udev/systemd to
> generate predicable netdev's name.
> >Example: enm<mdev_alias_first_12_chars>
> 
> What happens in unlikely case of 2 UUIDs collide?
> 
Since users sees two devices with same phys_port_name, user should destroy recently created mdev and recreate mdev with different UUID?
> 
> >I took Ethernet mdev as an example.
> >New prefix 'm' stands for mediated device.
> >Remaining 12 characters are first 12 chars of the mdev alias.
> 
> Does this resolve the identification of devlink port representor? 
Not sure if I understood your question correctly, attemping to answer below.
phys_port_name of devlink port is defined by the first 12 characters of mdev alias.
> I assume you want to use the same 12(or so) chars, don't you?
Mdev's netdev will also use the same mdev alias from the sysfs to rename netdev name from ethX to enm<mdev_alias>, where en=Etherenet, m=mdev.

So yes, same 12 characters are use for mdev's netdev and mdev devlink port's phys_port_name.

Is that what are you asking?

^ permalink raw reply

* [PATCH bpf-next v5] libbpf: add xsk_ring_prod__nb_free() function
From: Eelco Chaudron @ 2019-08-22  9:54 UTC (permalink / raw)
  To: netdev
  Cc: ast, daniel, kafai, songliubraving, yhs, andrii.nakryiko,
	magnus.karlsson

When an AF_XDP application received X packets, it does not mean X
frames can be stuffed into the producer ring. To make it easier for
AF_XDP applications this API allows them to check how many frames can
be added into the ring.

The patch below looks like a name change only, but the xsk_prod__
prefix denotes that this API is exposed to be used by applications.

Besides, if you set the nb value to the size of the ring, you will
get the exact amount of slots available, at the cost of performance
(you touch shared state for sure). nb is there to limit the
touching of the shared state.

Also the example xdpsock application has been modified to use this
new API, so it's also able to process flows at a 1pps rate on veth
interfaces.

Signed-off-by: Eelco Chaudron <echaudro@redhat.com>

---
v4 -> v5
  - Rebase on latest bpf-next

v3 -> v4
  - Cleanedup commit message
  - Updated AF_XDP sample application to use this new API

v2 -> v3
  - Removed cache by pass option

v1 -> v2
  - Renamed xsk_ring_prod__free() to xsk_ring_prod__nb_free()
  - Add caching so it will only touch global state when needed

 samples/bpf/xdpsock_user.c | 119 ++++++++++++++++++++++++++++---------
 tools/lib/bpf/xsk.h        |   4 +-
 2 files changed, 93 insertions(+), 30 deletions(-)

diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
index da84c760c094..bec0ee463184 100644
--- a/samples/bpf/xdpsock_user.c
+++ b/samples/bpf/xdpsock_user.c
@@ -470,9 +470,13 @@ static void kick_tx(struct xsk_socket_info *xsk)
 static inline void complete_tx_l2fwd(struct xsk_socket_info *xsk,
 				     struct pollfd *fds)
 {
-	u32 idx_cq = 0, idx_fq = 0;
-	unsigned int rcvd;
+	static u64 free_frames[NUM_FRAMES];
+	static size_t nr_free_frames;
+
+	u32 idx_cq = 0, idx_fq = 0, free_slots;
+	unsigned int rcvd, i;
 	size_t ndescs;
+	int ret;
 
 	if (!xsk->outstanding_tx)
 		return;
@@ -485,29 +489,56 @@ static inline void complete_tx_l2fwd(struct xsk_socket_info *xsk,
 
 	/* re-add completed Tx buffers */
 	rcvd = xsk_ring_cons__peek(&xsk->umem->cq, ndescs, &idx_cq);
-	if (rcvd > 0) {
-		unsigned int i;
-		int ret;
-
-		ret = xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq);
-		while (ret != rcvd) {
-			if (ret < 0)
-				exit_with_error(-ret);
-			if (xsk_ring_prod__needs_wakeup(&xsk->umem->fq))
-				ret = poll(fds, num_socks, opt_timeout);
-			ret = xsk_ring_prod__reserve(&xsk->umem->fq, rcvd,
-						     &idx_fq);
-		}
-		for (i = 0; i < rcvd; i++)
+	if (!rcvd)
+		return;
+
+	/* When xsk_ring_cons__peek() for example returns that 5 packets
+	 * have been received, it does not automatically mean that
+	 * xsk_ring_prod__reserve() will have 5 slots available. You will
+	 * see this, for example, when using a veth interface due to the
+	 * RX_BATCH_SIZE used by the generic driver.
+	 *
+	 * In this example we store unused buffers and try to re-stock
+	 * them the next iteration.
+	 */
+
+	free_slots = xsk_prod__nb_free(&xsk->umem->fq, rcvd + nr_free_frames);
+	if (free_slots > rcvd + nr_free_frames)
+		free_slots = rcvd + nr_free_frames;
+
+	ret = xsk_ring_prod__reserve(&xsk->umem->fq, free_slots, &idx_fq);
+	while (ret != free_slots) {
+		if (ret < 0)
+			exit_with_error(-ret);
+
+		if (xsk_ring_prod__needs_wakeup(&xsk->umem->fq))
+			ret = poll(fds, num_socks, opt_timeout);
+
+		ret = xsk_ring_prod__reserve(&xsk->umem->fq, free_slots,
+					     &idx_fq);
+	}
+	for (i = 0; i < rcvd; i++) {
+		u64 addr = *xsk_ring_cons__comp_addr(&xsk->umem->cq, idx_cq++);
+
+		if (i < free_slots)
 			*xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) =
-				*xsk_ring_cons__comp_addr(&xsk->umem->cq,
-							  idx_cq++);
+				addr;
+		else
+			free_frames[nr_free_frames++] = addr;
+	}
 
-		xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
-		xsk_ring_cons__release(&xsk->umem->cq, rcvd);
-		xsk->outstanding_tx -= rcvd;
-		xsk->tx_npkts += rcvd;
+	if (free_slots > rcvd) {
+		for (i = 0; i < (free_slots - rcvd); i++) {
+			u64 addr = free_frames[--nr_free_frames];
+			*xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) =
+				addr;
+		}
 	}
+
+	xsk_ring_prod__submit(&xsk->umem->fq, free_slots);
+	xsk_ring_cons__release(&xsk->umem->cq, rcvd);
+	xsk->outstanding_tx -= rcvd;
+	xsk->tx_npkts += rcvd;
 }
 
 static inline void complete_tx_only(struct xsk_socket_info *xsk)
@@ -531,8 +562,11 @@ static inline void complete_tx_only(struct xsk_socket_info *xsk)
 
 static void rx_drop(struct xsk_socket_info *xsk, struct pollfd *fds)
 {
+	static u64 free_frames[NUM_FRAMES];
+	static size_t nr_free_frames;
+
 	unsigned int rcvd, i;
-	u32 idx_rx = 0, idx_fq = 0;
+	u32 idx_rx = 0, idx_fq = 0, free_slots;
 	int ret;
 
 	rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE, &idx_rx);
@@ -542,13 +576,30 @@ static void rx_drop(struct xsk_socket_info *xsk, struct pollfd *fds)
 		return;
 	}
 
-	ret = xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq);
-	while (ret != rcvd) {
+	/* When xsk_ring_cons__peek() for example returns that 5 packets
+	 * have been received, it does not automatically mean that
+	 * xsk_ring_prod__reserve() will have 5 slots available. You will
+	 * see this, for example, when using a veth interface due to the
+	 * RX_BATCH_SIZE used by the generic driver.
+	 *
+	 * In this example we store unused buffers and try to re-stock
+	 * them the next iteration.
+	 */
+
+	free_slots = xsk_prod__nb_free(&xsk->umem->fq, rcvd + nr_free_frames);
+	if (free_slots > rcvd + nr_free_frames)
+		free_slots = rcvd + nr_free_frames;
+
+	ret = xsk_ring_prod__reserve(&xsk->umem->fq, free_slots, &idx_fq);
+	while (ret != free_slots) {
 		if (ret < 0)
 			exit_with_error(-ret);
+
 		if (xsk_ring_prod__needs_wakeup(&xsk->umem->fq))
 			ret = poll(fds, num_socks, opt_timeout);
-		ret = xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq);
+
+		ret = xsk_ring_prod__reserve(&xsk->umem->fq, free_slots,
+					     &idx_fq);
 	}
 
 	for (i = 0; i < rcvd; i++) {
@@ -557,10 +608,22 @@ static void rx_drop(struct xsk_socket_info *xsk, struct pollfd *fds)
 		char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
 
 		hex_dump(pkt, len, addr);
-		*xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) = addr;
+		if (i < free_slots)
+			*xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) =
+				addr;
+		else
+			free_frames[nr_free_frames++] = addr;
+	}
+
+	if (free_slots > rcvd) {
+		for (i = 0; i < (free_slots - rcvd); i++) {
+			u64 addr = free_frames[--nr_free_frames];
+			*xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) =
+				addr;
+		}
 	}
 
-	xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
+	xsk_ring_prod__submit(&xsk->umem->fq, free_slots);
 	xsk_ring_cons__release(&xsk->rx, rcvd);
 	xsk->rx_npkts += rcvd;
 }
diff --git a/tools/lib/bpf/xsk.h b/tools/lib/bpf/xsk.h
index aa1d6122b7db..520a772c882c 100644
--- a/tools/lib/bpf/xsk.h
+++ b/tools/lib/bpf/xsk.h
@@ -82,7 +82,7 @@ static inline int xsk_ring_prod__needs_wakeup(const struct xsk_ring_prod *r)
 	return *r->flags & XDP_RING_NEED_WAKEUP;
 }
 
-static inline __u32 xsk_prod_nb_free(struct xsk_ring_prod *r, __u32 nb)
+static inline __u32 xsk_prod__nb_free(struct xsk_ring_prod *r, __u32 nb)
 {
 	__u32 free_entries = r->cached_cons - r->cached_prod;
 
@@ -116,7 +116,7 @@ static inline __u32 xsk_cons_nb_avail(struct xsk_ring_cons *r, __u32 nb)
 static inline size_t xsk_ring_prod__reserve(struct xsk_ring_prod *prod,
 					    size_t nb, __u32 *idx)
 {
-	if (xsk_prod_nb_free(prod, nb) < nb)
+	if (xsk_prod__nb_free(prod, nb) < nb)
 		return 0;
 
 	*idx = prod->cached_prod;
-- 
2.18.1


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox