Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next 1/2] net: stmmac: fix tc flower deletion for VLAN priority Rx steering
From: Yannick Vignon @ 2021-12-10 16:30 UTC (permalink / raw)
  To: Ong Boon Leong, Giuseppe Cavallaro, Alexandre Torgue, Jose Abreu,
	David S . Miller, Jakub Kicinski, Maxime Coquelin,
	alexandre.torgue, Kurt Kanzenbach
  Cc: netdev, linux-stm32, linux-arm-kernel
In-Reply-To: <20211209151631.138326-2-boon.leong.ong@intel.com>

Hi,

On 12/9/2021 4:16 PM, Ong Boon Leong wrote:
> To replicate the issue:-
> 
> 1) Add 2 flower filters for VLAN Priority based frame steering:-
> $ IFDEVNAME=eth0
> $ tc qdisc add dev $IFDEVNAME ingress
> $ tc qdisc add dev $IFDEVNAME root mqprio num_tc 8 \
>     map 0 1 2 3 4 5 6 7 0 0 0 0 0 0 0 0 \
>     queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 hw 0
> $ tc filter add dev $IFDEVNAME parent ffff: protocol 802.1Q \
>     flower vlan_prio 0 hw_tc 0
> $ tc filter add dev $IFDEVNAME parent ffff: protocol 802.1Q \
>     flower vlan_prio 1 hw_tc 1
> 
> 2) Get the 'pref' id
> $ tc filter show dev $IFDEVNAME ingress
> 
> 3) Delete a specific tc flower record
> $ tc filter del dev $IFDEVNAME parent ffff: pref 49151
> 
>  From dmesg, we will observe kernel NULL pointer ooops
> 
> [  197.170464] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [  197.171367] #PF: supervisor read access in kernel mode
> [  197.171367] #PF: error_code(0x0000) - not-present page
> [  197.171367] PGD 0 P4D 0
> [  197.171367] Oops: 0000 [#1] PREEMPT SMP NOPTI
> [  197.171367] CPU: 0 PID: 3216 Comm: tc Tainted: G     U      E     5.16.0-rc2+ #12
> [  197.171367] Hardware name: Intel Corporation Elkhart Lake Embedded Platform/ElkhartLake LPDDR4x T3 CRB, BIOS EHLSFWI1.R00.3273.A04.2107240322 07/24/2021
> [  197.171367] RIP: 0010:tc_setup_cls+0x20b/0x4a0 [stmmac]
> [  197.171367] Code: fe ff ff c7 43 14 00 00 00 00 48 c7 03 00 00 00 00 c7 43 1c 00 00 00 00 49 8b 44 24 28 48 8b bd b0 00 00 00 41 0f b7 54 24 58 <48> 8b 00 0f bf 8f 38 08 00 00 81 ea e0 ff 00 00 8b 00 25 00 04 00
> [  197.171367] RSP: 0018:ffff940940a037c0 EFLAGS: 00010246
> [  197.171367] RAX: 0000000000000000 RBX: ffff92e826cae2c8 RCX: ffff92e825f39000
> [  197.171367] RDX: 0000000000000000 RSI: ffff92e826cae2a8 RDI: ffff92e82f0c0000
> [  197.171367] RBP: ffff92e82f0c0940 R08: 0000000000000000 R09: ffff92e825f39434
> [  197.171367] R10: ffff92e826c5af00 R11: ffff940940a038a8 R12: ffff940940a038a8
> [  197.171367] R13: 0000000000000000 R14: 0000000000000000 R15: ffff92e830a5b600
> [  197.171367] FS:  00007fa7b0c47740(0000) GS:ffff92e964200000(0000) knlGS:0000000000000000
> [  197.171367] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  197.171367] CR2: 0000000000000000 CR3: 0000000124c50000 CR4: 0000000000350ef0
> [  197.171367] Call Trace:
> [  197.171367]  <TASK>
> [  197.171367]  ? __stmmac_disable_all_queues+0xa8/0xe0 [stmmac]
> [  197.171367]  stmmac_setup_tc_block_cb+0x70/0x110 [stmmac]
> [  197.171367]  tc_setup_cb_destroy+0xb3/0x180
> [  197.171367]  fl_hw_destroy_filter+0x94/0xc0 [cls_flower]
> [  197.171367]  __fl_delete+0x16a/0x180 [cls_flower]
> [  197.171367]  fl_destroy+0xb9/0x140 [cls_flower]
> [  197.171367]  tcf_proto_destroy+0x1d/0xa0
> [  197.171367]  tc_del_tfilter+0x3c9/0x7b0
> [  197.171367]  ? tc_dump_tfilter+0x310/0x310
> [  197.171367]  rtnetlink_rcv_msg+0x2bf/0x370
> [  197.171367]  ? preempt_count_add+0x68/0xa0
> [  197.171367]  ? _raw_spin_lock_irqsave+0x19/0x40
> [  197.171367]  ? _raw_spin_unlock_irqrestore+0x1f/0x31
> [  197.171367]  ? rtnl_calcit.isra.0+0x130/0x130
> [  197.171367]  netlink_rcv_skb+0x4e/0x100
> [  197.171367]  netlink_unicast+0x18e/0x230
> [  197.171367]  netlink_sendmsg+0x245/0x480
> [  197.171367]  sock_sendmsg+0x5b/0x60
> [  197.171367]  ____sys_sendmsg+0x20b/0x280
> [  197.171367]  ? copy_msghdr_from_user+0x5c/0x90
> [  197.171367]  ___sys_sendmsg+0x7c/0xc0
> [  197.171367]  ? folio_add_lru+0x52/0x80
> [  197.171367]  ? __sys_sendto+0xee/0x160
> [  197.171367]  __sys_sendmsg+0x59/0xa0
> [  197.171367]  do_syscall_64+0x40/0x90
> [  197.171367]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [  197.171367] RIP: 0033:0x7fa7b0d64397
> [  197.171367] Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
> [  197.171367] RSP: 002b:00007ffdd88b58e8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
> [  197.171367] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fa7b0d64397
> [  197.171367] RDX: 0000000000000000 RSI: 00007ffdd88b5960 RDI: 0000000000000003
> [  197.171367] RBP: 0000000061b05c21 R08: 0000000000000001 R09: 0000564584e47890
> [  197.171367] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
> [  197.171367] R13: 00007ffdd88b9a80 R14: 00000000bfff0000 R15: 0000564584e3e420
> [  197.171367]  </TASK>
> [  197.171367] Modules linked in: cls_flower sch_mqprio sch_ingress dwmac_intel(E) stmmac(E) pcs_xpcs phylink marvell marvell10g libphy 8021q bnep bluetooth ecryptfs nfsd sch_fq_codel uio uhid snd_soc_dmic snd_sof_pci_intel_tgl x86_pkg_temp_thermal snd_sof_intel_hda_common kvm_intel iTCO_wdt iTCO_vendor_support soundwire_intel mei_hdcp kvm soundwire_generic_allocation soundwire_cadence soundwire_bus irqbypass snd_sof_xtensa_dsp ax88179_178a snd_soc_acpi_intel_match intel_rapl_msr pcspkr usbnet snd_soc_acpi mii snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec i2c_i801 snd_hda_core intel_ish_ipc tpm_crb 8250_lpss intel_ishtp tpm_tis i915 mei_me i2c_smbus mei tpm_tis_core dw_dmac_core tpm spi_dw_pci parport_pc intel_pmc_core spi_dw thermal parport ttm fuse configfs snd_sof_pci snd_sof snd_soc_core snd_compress ac97_bus ledtrig_audio snd_pcm snd_timer snd soundcore [last unloaded: libphy]
> [  197.171367] CR2: 0000000000000000
> [  197.171367] ---[ end trace 8b8d1c617c39093d ]---
> 
> This patch reimplements the tc flower rx frame steering for VLAN priority
> by keeping a record of flow_cls_offload added. The implementation also
> makes way to support EtherType based RX frame steering later.
> 
> Fixes: 0e039f5cf86c ("net: stmmac: add RX frame steering based on VLAN priority in tc flower")
> Tested-by: Kurt Kanzenbach <kurt@linutronix.de>
> Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com>
> ---
>   drivers/net/ethernet/stmicro/stmmac/stmmac.h  | 17 ++++
>   .../net/ethernet/stmicro/stmmac/stmmac_tc.c   | 86 ++++++++++++++++---
>   2 files changed, 90 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
> index 4f5292cadf5..18a262ef17f 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
> @@ -171,6 +171,19 @@ struct stmmac_flow_entry {
>   	int is_l4;
>   };
>   
> +/* Rx Frame Steering */
> +enum stmmac_rfs_type {
> +	STMMAC_RFS_T_VLAN,
> +	STMMAC_RFS_T_MAX,
> +};
> +
> +struct stmmac_rfs_entry {
> +	unsigned long cookie;
> +	int in_use;
> +	int type;
> +	int tc;
> +};
> +
>   struct stmmac_priv {
>   	/* Frequently used values are kept adjacent for cache effect */
>   	u32 tx_coal_frames[MTL_MAX_TX_QUEUES];
> @@ -288,6 +301,10 @@ struct stmmac_priv {
>   	struct stmmac_tc_entry *tc_entries;
>   	unsigned int flow_entries_max;
>   	struct stmmac_flow_entry *flow_entries;
> +	unsigned int rfs_entries_max[STMMAC_RFS_T_MAX];
> +	unsigned int rfs_entries_cnt[STMMAC_RFS_T_MAX];
> +	unsigned int rfs_entries_total;
> +	struct stmmac_rfs_entry *rfs_entries;
>   
>   	/* Pulse Per Second output */
>   	struct stmmac_pps_cfg pps[STMMAC_PPS_MAX];
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_tc.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_tc.c
> index 1c4ea0b1b84..d0a2b289f46 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_tc.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_tc.c
> @@ -232,11 +232,33 @@ static int tc_setup_cls_u32(struct stmmac_priv *priv,
>   	}
>   }
>   
> +static int tc_rfs_init(struct stmmac_priv *priv)
> +{
> +	int i;
> +
> +	priv->rfs_entries_max[STMMAC_RFS_T_VLAN] = 8;
> +
> +	for (i = 0; i < STMMAC_RFS_T_MAX; i++)
> +		priv->rfs_entries_total += priv->rfs_entries_max[i];
> +
> +	priv->rfs_entries = devm_kcalloc(priv->device,
> +					 priv->rfs_entries_total,
> +					 sizeof(*priv->rfs_entries),
> +					 GFP_KERNEL);
> +	if (!priv->rfs_entries)
> +		return -ENOMEM;
> +
> +	dev_info(priv->device, "Enabled RFS Flow TC (entries=%d)\n",
> +		 priv->rfs_entries_total);
> +
> +	return 0;
> +}
> +
>   static int tc_init(struct stmmac_priv *priv)
>   {
>   	struct dma_features *dma_cap = &priv->dma_cap;
>   	unsigned int count;
> -	int i;
> +	int ret, i;
>   
>   	if (dma_cap->l3l4fnum) {
>   		priv->flow_entries_max = dma_cap->l3l4fnum;
> @@ -250,10 +272,14 @@ static int tc_init(struct stmmac_priv *priv)
>   		for (i = 0; i < priv->flow_entries_max; i++)
>   			priv->flow_entries[i].idx = i;
>   
> -		dev_info(priv->device, "Enabled Flow TC (entries=%d)\n",
> +		dev_info(priv->device, "Enabled L3L4 Flow TC (entries=%d)\n",
>   			 priv->flow_entries_max);
>   	}
>   
> +	ret = tc_rfs_init(priv);
> +	if (ret)
> +		return -ENOMEM;
> +
>   	if (!priv->plat->fpe_cfg) {
>   		priv->plat->fpe_cfg = devm_kzalloc(priv->device,
>   						   sizeof(*priv->plat->fpe_cfg),
> @@ -607,16 +633,45 @@ static int tc_del_flow(struct stmmac_priv *priv,
>   	return ret;
>   }
>   
> +static struct stmmac_rfs_entry *tc_find_rfs(struct stmmac_priv *priv,
> +					    struct flow_cls_offload *cls,
> +					    bool get_free)
> +{
> +	int i;
> +
> +	for (i = 0; i < priv->rfs_entries_total; i++) {
> +		struct stmmac_rfs_entry *entry = &priv->rfs_entries[i];
> +
> +		if (entry->cookie == cls->cookie)
> +			return entry;
> +		if (get_free && entry->in_use == false)
> +			return entry;
> +	}
> +
> +	return NULL;
> +}
> +
>   #define VLAN_PRIO_FULL_MASK (0x07)
>   
>   static int tc_add_vlan_flow(struct stmmac_priv *priv,
>   			    struct flow_cls_offload *cls)
>   {
> +	struct stmmac_rfs_entry *entry = tc_find_rfs(priv, cls, false);
>   	struct flow_rule *rule = flow_cls_offload_flow_rule(cls);
>   	struct flow_dissector *dissector = rule->match.dissector;
>   	int tc = tc_classid_to_hwtc(priv->dev, cls->classid);
>   	struct flow_match_vlan match;
> 

While we're at it, shouldn't we also check that no actions are being 
requested and fail if there are, instead of silently ignoring them?

> +	if (!entry) {
> +		entry = tc_find_rfs(priv, cls, true);
> +		if (!entry)
> +			return -ENOENT;
> +	}
> +
> +	if (priv->rfs_entries_cnt[STMMAC_RFS_T_VLAN] >=
> +	    priv->rfs_entries_max[STMMAC_RFS_T_VLAN])
> +		return -ENOENT;
> +
>   	/* Nothing to do here */
>   	if (!dissector_uses_key(dissector, FLOW_DISSECTOR_KEY_VLAN))
>   		return -EINVAL;
> @@ -638,6 +693,12 @@ static int tc_add_vlan_flow(struct stmmac_priv *priv,
>   
>   		prio = BIT(match.key->vlan_priority);
>   		stmmac_rx_queue_prio(priv, priv->hw, prio, tc);
> +
> +		entry->in_use = true;
> +		entry->cookie = cls->cookie;
> +		entry->tc = tc;
> +		entry->type = STMMAC_RFS_T_VLAN;
> +		priv->rfs_entries_cnt[STMMAC_RFS_T_VLAN]++;
>   	}
>   
>   	return 0;
> @@ -646,20 +707,19 @@ static int tc_add_vlan_flow(struct stmmac_priv *priv,
>   static int tc_del_vlan_flow(struct stmmac_priv *priv,
>   			    struct flow_cls_offload *cls)
>   {
> -	struct flow_rule *rule = flow_cls_offload_flow_rule(cls);
> -	struct flow_dissector *dissector = rule->match.dissector;
> -	int tc = tc_classid_to_hwtc(priv->dev, cls->classid);
> +	struct stmmac_rfs_entry *entry = tc_find_rfs(priv, cls, false);
>   
> -	/* Nothing to do here */
> -	if (!dissector_uses_key(dissector, FLOW_DISSECTOR_KEY_VLAN))
> -		return -EINVAL;
> +	if (!entry || !entry->in_use || entry->type != STMMAC_RFS_T_VLAN)
> +		return -ENOENT;
>   
> -	if (tc < 0) {
> -		netdev_err(priv->dev, "Invalid traffic class\n");
> -		return -EINVAL;
> -	}
> +	stmmac_rx_queue_prio(priv, priv->hw, 0, entry->tc);
> +
> +	entry->in_use = false;
> +	entry->cookie = 0;
> +	entry->tc = 0;
> +	entry->type = 0;
>   
> -	stmmac_rx_queue_prio(priv, priv->hw, 0, tc);
> +	priv->rfs_entries_cnt[STMMAC_RFS_T_VLAN]--;
>   
>   	return 0;
>   }
> 

I was about to post a very similar fix for that same problem (except I 
was adding support for other packet steering types)...
I can confirm your patch works. Note that a simpler way to reproduce is 
simply to add a filter, then remove all the filters, e.g.:
$ IFDEVNAME=eth0
$ tc qdisc add dev $IFDEVNAME ingress
$ tc qdisc add dev $IFDEVNAME root mqprio num_tc 8 \
    map 0 1 2 3 4 5 6 7 0 0 0 0 0 0 0 0 \
    queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 hw 0
$ tc filter add dev $IFDEVNAME parent ffff: protocol 802.1Q \
    flower vlan_prio 0 hw_tc 0
$ tc filter del dev $IFDEVNAME ingress


Yannick

^ permalink raw reply

* Re: [PATCH net-next] net: dev: Always serialize on Qdisc::busylock in __dev_xmit_skb() on PREEMPT_RT.
From: Paolo Abeni @ 2021-12-10 16:35 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, netdev
  Cc: David S. Miller, Jakub Kicinski, Eric Dumazet, Thomas Gleixner
In-Reply-To: <YbN1OL0I1ja4Fwkb@linutronix.de>

On Fri, 2021-12-10 at 16:41 +0100, Sebastian Andrzej Siewior wrote:
> The root-lock is dropped before dev_hard_start_xmit() is invoked and after
> setting the __QDISC___STATE_RUNNING bit. If the Qdisc owner is preempted
> by another sender/task with a higher priority then this new sender won't
> be able to submit packets to the NIC directly instead they will be
> enqueued into the Qdisc. The NIC will remain idle until the Qdisc owner
> is scheduled again and finishes the job.
> 
> By serializing every task on the ->busylock then the task will be
> preempted by a sender only after the Qdisc has no owner.
> 
> Always serialize on the busylock on PREEMPT_RT.

Not sure how much is relevant in the RT context, but this should impact
the xmit tput in a relevant, negative way.

If I read correctly, you use the busylock to trigger priority ceiling
on each sender. I'm wondering if there are other alternative ways (no
contended lock, just some RT specific annotation) to mark a whole
section of code for priority ceiling ?!?

Thanks!

Paolo
> 
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>  net/core/dev.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 2a352e668d103..4a701cf2e2c10 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3836,8 +3836,16 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>  	 * separate lock before trying to get qdisc main lock.
>  	 * This permits qdisc->running owner to get the lock more
>  	 * often and dequeue packets faster.
> +	 * On PREEMPT_RT it is possible to preempt the qdisc owner during xmit
> +	 * and then other tasks will only enqueue packets. The packets will be
> +	 * sent after the qdisc owner is scheduled again. To prevent this
> +	 * scenario the task always serialize on the lock.
>  	 */
> -	contended = qdisc_is_running(q);
> +	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
> +		contended = qdisc_is_running(q);
> +	else
> +		contended = true;
> +
>  	if (unlikely(contended))
>  		spin_lock(&q->busylock);
>  


^ permalink raw reply

* [PATCH net-next 0/2] net: ipa: fix SDX55 interconnects
From: Alex Elder @ 2021-12-10 16:37 UTC (permalink / raw)
  To: davem, kuba
  Cc: david, manivannan.sadhasivam, jponduru, avuyyuru, bjorn.andersson,
	cpratapa, subashab, evgreen, elder, netdev, linux-arm-msm,
	linux-kernel

The SDX55 SoC has IPA v4.5.  It currently represents the path
between IPA and main memory using two consecutive interconnects.
This was an optimization--not required for correct operation--and
complicates things unnecessarily.  It also does not conform to the
IPA binding (as pointed out by David Heidelberg).

This series fixes this by combining the two interconnects into one.

					-Alex

Alex Elder (2):
  ARM: dts: qcom: sdx55: fix IPA interconnect definitions
  net: ipa: fix IPA v4.5 interconnect data

 arch/arm/boot/dts/qcom-sdx55.dtsi | 6 ++----
 drivers/net/ipa/ipa_data-v4.5.c   | 7 +------
 2 files changed, 3 insertions(+), 10 deletions(-)

-- 
2.32.0

^ permalink raw reply

* [PATCH net-next 1/2] ARM: dts: qcom: sdx55: fix IPA interconnect definitions
From: Alex Elder @ 2021-12-10 16:37 UTC (permalink / raw)
  To: davem, kuba
  Cc: david, manivannan.sadhasivam, jponduru, avuyyuru, bjorn.andersson,
	cpratapa, subashab, evgreen, elder, netdev, linux-arm-msm,
	linux-kernel
In-Reply-To: <20211210163745.34748-1-elder@linaro.org>

The first two interconnects defined for IPA on the SDX55 SoC are
really two parts of what should be represented as a single path
between IPA and system memory.

Fix this by combining the "memory-a" and "memory-b" interconnects
into a single "memory" interconnect.

Reported-by: David Heidelberg <david@ixit.cz>
Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Alex Elder <elder@linaro.org>
---
 arch/arm/boot/dts/qcom-sdx55.dtsi | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/arm/boot/dts/qcom-sdx55.dtsi b/arch/arm/boot/dts/qcom-sdx55.dtsi
index 44526ad9d210b..eee2f63b9bbab 100644
--- a/arch/arm/boot/dts/qcom-sdx55.dtsi
+++ b/arch/arm/boot/dts/qcom-sdx55.dtsi
@@ -333,12 +333,10 @@ ipa: ipa@1e40000 {
 			clocks = <&rpmhcc RPMH_IPA_CLK>;
 			clock-names = "core";
 
-			interconnects = <&system_noc MASTER_IPA &system_noc SLAVE_SNOC_MEM_NOC_GC>,
-					<&mem_noc MASTER_SNOC_GC_MEM_NOC &mc_virt SLAVE_EBI_CH0>,
+			interconnects = <&system_noc MASTER_IPA &mc_virt SLAVE_EBI_CH0>,
 					<&system_noc MASTER_IPA &system_noc SLAVE_OCIMEM>,
 					<&mem_noc MASTER_AMPSS_M0 &system_noc SLAVE_IPA_CFG>;
-			interconnect-names = "memory-a",
-					     "memory-b",
+			interconnect-names = "memory",
 					     "imem",
 					     "config";
 
-- 
2.32.0


^ permalink raw reply related

* [PATCH net-next 2/2] net: ipa: fix IPA v4.5 interconnect data
From: Alex Elder @ 2021-12-10 16:37 UTC (permalink / raw)
  To: davem, kuba
  Cc: david, manivannan.sadhasivam, jponduru, avuyyuru, bjorn.andersson,
	cpratapa, subashab, evgreen, elder, netdev, linux-arm-msm,
	linux-kernel
In-Reply-To: <20211210163745.34748-1-elder@linaro.org>

Update the definition of the IPA interconnects for IPA v4.5 so
the path between IPA and system memory is represented by a single
"memory" interconnect.

Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Alex Elder <elder@linaro.org>
---
 drivers/net/ipa/ipa_data-v4.5.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/drivers/net/ipa/ipa_data-v4.5.c b/drivers/net/ipa/ipa_data-v4.5.c
index e62ab9c3ac672..2da2c4194f2e6 100644
--- a/drivers/net/ipa/ipa_data-v4.5.c
+++ b/drivers/net/ipa/ipa_data-v4.5.c
@@ -420,15 +420,10 @@ static const struct ipa_mem_data ipa_mem_data = {
 /* Interconnect rates are in 1000 byte/second units */
 static const struct ipa_interconnect_data ipa_interconnect_data[] = {
 	{
-		.name			= "memory-a",
+		.name			= "memory",
 		.peak_bandwidth		= 600000,	/* 600 MBps */
 		.average_bandwidth	= 150000,	/* 150 MBps */
 	},
-	{
-		.name			= "memory-b",
-		.peak_bandwidth		= 1804000,	/* 1.804 GBps */
-		.average_bandwidth	= 150000,	/* 150 MBps */
-	},
 	/* Average rate is unused for the next two interconnects */
 	{
 		.name			= "imem",
-- 
2.32.0


^ permalink raw reply related

* [PATCH net] net/sched: sch_ets: don't remove idle classes from the round-robin list
From: Davide Caratti @ 2021-12-10 16:42 UTC (permalink / raw)
  To: Jamal Hadi Salim, Cong Wang, Jiri Pirko, David S. Miller,
	Jakub Kicinski, Shuang Li
  Cc: netdev

Shuang reported that the following script:

 1) tc qdisc add dev ddd0 handle 10: parent 1: ets bands 8 strict 4 priomap 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
 2) mausezahn ddd0  -A 10.10.10.1 -B 10.10.10.2 -c 0 -a own -b 00:c1:a0:c1:a0:00 -t udp &
 3) tc qdisc change dev ddd0 handle 10: ets bands 4 strict 2 quanta 2500 2500 priomap 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

crashes systematically when line 2) is commented:

 list_del corruption, ffff8e028404bd30->next is LIST_POISON1 (dead000000000100)
 ------------[ cut here ]------------
 kernel BUG at lib/list_debug.c:47!
 invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
 CPU: 0 PID: 954 Comm: tc Not tainted 5.16.0-rc4+ #478
 Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014
 RIP: 0010:__list_del_entry_valid.cold.1+0x12/0x47
 Code: fe ff 0f 0b 48 89 c1 4c 89 c6 48 c7 c7 08 42 1b 87 e8 1d c5 fe ff 0f 0b 48 89 fe 48 89 c2 48 c7 c7 98 42 1b 87 e8 09 c5 fe ff <0f> 0b 48 c7 c7 48 43 1b 87 e8 fb c4 fe ff 0f 0b 48 89 f2 48 89 fe
 RSP: 0018:ffffae46807a3888 EFLAGS: 00010246
 RAX: 000000000000004e RBX: 0000000000000007 RCX: 0000000000000202
 RDX: 0000000000000000 RSI: ffffffff871ac536 RDI: 00000000ffffffff
 RBP: ffffae46807a3a10 R08: 0000000000000000 R09: c0000000ffff7fff
 R10: 0000000000000001 R11: ffffae46807a36a8 R12: ffff8e028404b800
 R13: ffff8e028404bd30 R14: dead000000000100 R15: ffff8e02fafa2400
 FS:  00007efdc92e4480(0000) GS:ffff8e02fb600000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000682f48 CR3: 00000001058be000 CR4: 0000000000350ef0
 Call Trace:
  <TASK>
  ets_qdisc_change+0x58b/0xa70 [sch_ets]
  tc_modify_qdisc+0x323/0x880
  rtnetlink_rcv_msg+0x169/0x4a0
  netlink_rcv_skb+0x50/0x100
  netlink_unicast+0x1a5/0x280
  netlink_sendmsg+0x257/0x4d0
  sock_sendmsg+0x5b/0x60
  ____sys_sendmsg+0x1f2/0x260
  ___sys_sendmsg+0x7c/0xc0
  __sys_sendmsg+0x57/0xa0
  do_syscall_64+0x3a/0x80
  entry_SYSCALL_64_after_hwframe+0x44/0xae
 RIP: 0033:0x7efdc8031338
 Code: 89 02 48 c7 c0 ff ff ff ff eb b5 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 25 43 2c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 41 89 d4 55
 RSP: 002b:00007ffdf1ce9828 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
 RAX: ffffffffffffffda RBX: 0000000061b37a97 RCX: 00007efdc8031338
 RDX: 0000000000000000 RSI: 00007ffdf1ce9890 RDI: 0000000000000003
 RBP: 0000000000000000 R08: 0000000000000001 R09: 000000000078a940
 R10: 000000000000000c R11: 0000000000000246 R12: 0000000000000001
 R13: 0000000000688880 R14: 0000000000000000 R15: 0000000000000000
  </TASK>
 Modules linked in: sch_ets sch_tbf dummy rfkill iTCO_wdt iTCO_vendor_support intel_rapl_msr intel_rapl_common joydev pcspkr i2c_i801 virtio_balloon i2c_smbus lpc_ich ip_tables xfs libcrc32c crct10dif_pclmul crc32_pclmul crc32c_intel serio_raw ghash_clmulni_intel ahci libahci libata virtio_blk virtio_console virtio_net net_failover failover sunrpc dm_mirror dm_region_hash dm_log dm_mod [last unloaded: sch_ets]
 ---[ end trace f35878d1912655c2 ]---
 RIP: 0010:__list_del_entry_valid.cold.1+0x12/0x47
 Code: fe ff 0f 0b 48 89 c1 4c 89 c6 48 c7 c7 08 42 1b 87 e8 1d c5 fe ff 0f 0b 48 89 fe 48 89 c2 48 c7 c7 98 42 1b 87 e8 09 c5 fe ff <0f> 0b 48 c7 c7 48 43 1b 87 e8 fb c4 fe ff 0f 0b 48 89 f2 48 89 fe
 RSP: 0018:ffffae46807a3888 EFLAGS: 00010246
 RAX: 000000000000004e RBX: 0000000000000007 RCX: 0000000000000202
 RDX: 0000000000000000 RSI: ffffffff871ac536 RDI: 00000000ffffffff
 RBP: ffffae46807a3a10 R08: 0000000000000000 R09: c0000000ffff7fff
 R10: 0000000000000001 R11: ffffae46807a36a8 R12: ffff8e028404b800
 R13: ffff8e028404bd30 R14: dead000000000100 R15: ffff8e02fafa2400
 FS:  00007efdc92e4480(0000) GS:ffff8e02fb600000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000682f48 CR3: 00000001058be000 CR4: 0000000000350ef0
 Kernel panic - not syncing: Fatal exception in interrupt
 Kernel Offset: 0x4e00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
 ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---

we can remove 'q->classes[i].alist' only if DRR class 'i' was part of the
active list. In the ETS scheduler DRR classes belong to that list only if
the queue length is greater than zero: we need to test for non-zero value
of 'q->classes[i].qdisc->q.qlen' before removing from the list, similarly
to what has been done elsewhere in the ETS code.

Fixes: de6d25924c2a ("net/sched: sch_ets: don't peek at classes beyond 'nbands'")
Reported-by: Shuang Li <shuali@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
---
 net/sched/sch_ets.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sched/sch_ets.c b/net/sched/sch_ets.c
index e007fc75ef2f..d73393493553 100644
--- a/net/sched/sch_ets.c
+++ b/net/sched/sch_ets.c
@@ -666,9 +666,9 @@ static int ets_qdisc_change(struct Qdisc *sch, struct nlattr *opt,
 		}
 	}
 	for (i = q->nbands; i < oldbands; i++) {
-		qdisc_tree_flush_backlog(q->classes[i].qdisc);
-		if (i >= q->nstrict)
+		if (i >= q->nstrict && q->classes[i].qdisc->q.qlen)
 			list_del(&q->classes[i].alist);
+		qdisc_tree_flush_backlog(q->classes[i].qdisc);
 	}
 	q->nstrict = nstrict;
 	memcpy(q->prio2band, priomap, sizeof(priomap));
-- 
2.31.1


^ permalink raw reply related

* RE: [net v5 2/3] net: sched: add check tc_skip_classify in sch egress
From: John Fastabend @ 2021-12-10 16:43 UTC (permalink / raw)
  To: xiangxia.m.yue, netdev
  Cc: Tonghao Zhang, David S. Miller, Jakub Kicinski,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Eric Dumazet, Antoine Tenart, Alexander Lobakin,
	Wei Wang, Arnd Bergmann
In-Reply-To: <20211208145459.9590-3-xiangxia.m.yue@gmail.com>

xiangxia.m.yue@ wrote:
> From: Tonghao Zhang <xiangxia.m.yue@gmail.com>
> 
> Try to resolve the issues as below:
> * We look up and then check tc_skip_classify flag in net
>   sched layer, even though skb don't want to be classified.
>   That case may consume a lot of cpu cycles. This patch
>   is useful when there are a lot of filters with different
>   prio. There is ~5 prio in in production, ~1% improvement.
> 
>   Rules as below:
>   $ for id in $(seq 1 5); do
>   $       tc filter add ... egress prio $id ... action mirred egress redirect dev ifb0
>   $ done
> 
> * bpf_redirect may be invoked in egress path. If we don't
>   check the flags and then return immediately, the packets
>   will loopback.

This would be the naive case right? Meaning the BPF program is
doing a redirect without any logic or is buggy?

Can you map out how this happens for me, I'm not fully sure I
understand the exact concern. Is it possible for BPF programs
that used to see packets no longer see the packet as expected?

Is this the path you are talking about?

 rx ethx  ->
   execute BPF program on ethx with bpf_redirect(ifb0) ->
     __skb_dequeue @ifb tc_skip_classify = 1 ->
       dev_queue_xmit() -> 
          sch_handle_egress() ->
            execute BPF program again

I can't see why you want to skip that second tc BPF program,
or for that matter any tc filter there. In general how do you
know that is the correct/expected behavior? Before the above
change it would have been called, what if its doing useful
work.

Also its not clear how your ifb setup is built or used. That
might help understand your use case. I would just remove the
IFB altogether and the above discussion is mute.

Thanks,
John

^ permalink raw reply

* Re: [PATCH net-next] net: dev: Always serialize on Qdisc::busylock in __dev_xmit_skb() on PREEMPT_RT.
From: Sebastian Andrzej Siewior @ 2021-12-10 16:46 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, David S. Miller, Jakub Kicinski, Eric Dumazet,
	Thomas Gleixner
In-Reply-To: <99af5c3079470432b97a74ab6aa3a43a1f7b178d.camel@redhat.com>

On 2021-12-10 17:35:21 [+0100], Paolo Abeni wrote:
> On Fri, 2021-12-10 at 16:41 +0100, Sebastian Andrzej Siewior wrote:
> > The root-lock is dropped before dev_hard_start_xmit() is invoked and after
> > setting the __QDISC___STATE_RUNNING bit. If the Qdisc owner is preempted
> > by another sender/task with a higher priority then this new sender won't
> > be able to submit packets to the NIC directly instead they will be
> > enqueued into the Qdisc. The NIC will remain idle until the Qdisc owner
> > is scheduled again and finishes the job.
> > 
> > By serializing every task on the ->busylock then the task will be
> > preempted by a sender only after the Qdisc has no owner.
> > 
> > Always serialize on the busylock on PREEMPT_RT.
> 
> Not sure how much is relevant in the RT context, but this should impact
> the xmit tput in a relevant, negative way.

Negative because everyone blocks on lock and transmits packets directly
instead of adding it to the queue and leaving for more?

> If I read correctly, you use the busylock to trigger priority ceiling
> on each sender. I'm wondering if there are other alternative ways (no
> contended lock, just some RT specific annotation) to mark a whole
> section of code for priority ceiling ?!?

priority ceiling as you call it, happens always with the help of a lock.
The root_lock is dropped in sch_direct_xmit().
qdisc_run_begin() sets only a bit with no owner association.
If I know why the busy-lock bad than I could add another one. The
important part is force the sende out of the section so the task with
the higher priority can send packets instead of queueing them only.

> Thanks!
> 
> Paolo

Sebastian

^ permalink raw reply

* Re: [PATCH] bridge: extend BR_ISOLATE to full split-horizon
From: David Lamparter @ 2021-12-10 16:49 UTC (permalink / raw)
  To: Nikolay Aleksandrov, Alexandra Winter
  Cc: Jakub Kicinski, David Lamparter, netdev
In-Reply-To: <898155b2-08d7-1977-d0c9-bbb1ae99c0bb@nvidia.com>

Thanks all for the feedback!
(rolled up several mails below)

On Thu, Dec 09, 2021 at 07:42:04AM -0800, Jakub Kicinski wrote:
> Does not apply to net-next, you'll need to repost even if the code is
> good. Please put [PATCH net-next] in the subject.

Apologies, I think I skipped a step in my rebase somewhere.

On Thu, Dec 09, 2021 at 06:08:03PM +0200, Nikolay Aleksandrov wrote:
> - please drop the sysfs part, we're not extending sysfs anymore

ACK (will remove the "horizon_group" file, change to "isolated" remains)

> - split the bridge change from the driver

ACK - I wasn't sure if it's OK if the intermediate doesn't compile due
to deleted BR_ISOLATED?  Or should I make 3 patches with only the 3rd
removing the flag definition?

> - drop the /* BR_ISOLATED - previously BIT(16) */ comment

Should I move up the other bits below it or just leave a hole at 16?

> - [IFLA_BRPORT_HORIZON_GROUP] = NLA_POLICY_MIN(NLA_S32, 0), why not just { .type = NLA_U32 } ?
>
> Why the limitation (UAPI limited to positive signed int. (recommended ifindex namespace)) ?
> You have the full unsigned space available, user-space can use it as it sees fit.
> You can just remove the comment about recommended ifindex.

No problem, I guess I'm overthinking this a bit.  Using the ifindex is
just a "good way" of avoiding confusion if multiple things in userspace
try to configure this.  Kernel really doesn't care.

> Also please extend the port isolation self-test with a test for a different horizon group.

ACK

> - just forbid having both set (tb[IFLA_BRPORT_ISOLATED] && tb[IFLA_BRPORT_HORIZON_GROUP])
>   user-space should use just one of the two, if isolated is set then it overwrites any older
>   IFLA_BRPORT_HORIZON_GROUP settings, that should simplify things considerably

So a bit of a hidden question here is which way "BR_ISOLATED" maps -
both "ISOLATED :=: group >= 1" and "ISOLATED :=: group == 1" are
possible.  But regardless of which way it's defined, there's to some
degree a risk of "old" tools accidentally wrecking the horizon group
setting.  That's why I ended up making BR_ISOLATE=1 not change the
horizon group if it's nonzero already...

On Thu, Dec 09, 2021 at 05:23:35PM +0100, Alexandra Winter wrote:
> Yes, please keep it compatible with userspace setting IFLA_BRPORT_ISOLATED only.

On Thu, Dec 09, 2021 at 06:23:25PM +0200, Nikolay Aleksandrov wrote:
> Actually they'll have to be exported together always... that's unfortunate. I get
> why you need the extra netlink logic, I think we'll have to keep it.

Yeah - we can't remove BR_ISOLATED from netlink GETs to keep compat, so
we need to accept it in SETs too in order to not break userspace handing
it right back in 1:1 for a get(-modify)-set...

On Thu, Dec 09, 2021 at 06:57:23PM +0200, Nikolay Aleksandrov wrote:
> So one relatively simple way to drop most of the logic is to have BRPORT_HORIZON_GROUP's
> attribute get set after ISOLATED so it always overwrites it, which is similar to the current
> situation but if you set only ISOLATED later it will function as intended (i.e. drop the logic
> for horizon_group when using the old attr). Still will have to disallow ISOLATED = 0/HORIZON_GROUP >0
> and ISOLATED=1/HORIZON_GROUP=0 though as these don't make sense.
>
> e.g.:
> if (tb[IFLA_BRPORT_ISOLATED])
> 	p->horizon_group = !!nla_get_u8(tb[IFLA_BRPORT_ISOLATED]);
> if (tb[IFLA_BRPORT_HORIZON_GROUP])
> 	p->horizon_group = nla_get_u32(tb[IFLA_BRPORT_HORIZON_GROUP]);
>
> This will be also in line with how they're exported (ISOLATED = 1 and HORIZON_GROUP >0).

This boils down to the same logic as is currently in the patch, except
it's written the other way around and with an else, i.e.

if (tb[IFLA_BRPORT_HORIZON_GROUP])
	p->horizon_group = ...
else if (tb[IFLA_BRPORT_ISOLATED])
	p->horizon_group = ...

With the else it seems more obvious to me, to avoid someone in the
future missing the fact that the 2 settings interact - so I would
preferably keep it like this.

Cheers,

-David

^ permalink raw reply

* [PATCH net] mptcp: fix NULL ptr dereference in inet_csk_accept()
From: Paolo Abeni @ 2021-12-10 16:51 UTC (permalink / raw)
  To: netdev; +Cc: mptcp, Geliang Tang

Since commit 740d798e8767 ("mptcp: remove id 0 address"), the PM
can remove the MPTCP first subflow in response to the netlink DEL_ADDR
command. At subflow removal time, the TCP subflow socket is orphaned.

If the relevant MPTCP socket is in listening status and such
operation races with an accept(), the kernel will access a NULL wait
queue, as reported by syzbot:

general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
CPU: 1 PID: 6550 Comm: syz-executor122 Not tainted 5.16.0-rc4-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:__lock_acquire+0xd7d/0x54a0 kernel/locking/lockdep.c:4897
Code: 0f 0e 41 be 01 00 00 00 0f 86 c8 00 00 00 89 05 69 cc 0f 0e e9 bd 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 da 48 c1 ea 03 <80> 3c 02 00 0f 85 f3 2f 00 00 48 81 3b 20 75 17 8f 0f 84 52 f3 ff
RSP: 0018:ffffc90001f2f818 EFLAGS: 00010016
RAX: dffffc0000000000 RBX: 0000000000000018 RCX: 0000000000000000
RDX: 0000000000000003 RSI: 0000000000000000 RDI: 0000000000000001
RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000000 R11: 000000000000000a R12: 0000000000000000
R13: ffff88801b98d700 R14: 0000000000000000 R15: 0000000000000001
FS:  00007f177cd3d700(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f177cd1b268 CR3: 000000001dd55000 CR4: 0000000000350ee0
Call Trace:
 <TASK>
 lock_acquire kernel/locking/lockdep.c:5637 [inline]
 lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5602
 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
 _raw_spin_lock_irqsave+0x39/0x50 kernel/locking/spinlock.c:162
 finish_wait+0xc0/0x270 kernel/sched/wait.c:400
 inet_csk_wait_for_connect net/ipv4/inet_connection_sock.c:464 [inline]
 inet_csk_accept+0x7de/0x9d0 net/ipv4/inet_connection_sock.c:497
 mptcp_accept+0xe5/0x500 net/mptcp/protocol.c:2865
 inet_accept+0xe4/0x7b0 net/ipv4/af_inet.c:739
 mptcp_stream_accept+0x2e7/0x10e0 net/mptcp/protocol.c:3345
 do_accept+0x382/0x510 net/socket.c:1773
 __sys_accept4_file+0x7e/0xe0 net/socket.c:1816
 __sys_accept4+0xb0/0x100 net/socket.c:1846
 __do_sys_accept net/socket.c:1864 [inline]
 __se_sys_accept net/socket.c:1861 [inline]
 __x64_sys_accept+0x71/0xb0 net/socket.c:1861
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f177cd8b8e9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 14 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f177cd3d308 EFLAGS: 00000246 ORIG_RAX: 000000000000002b
RAX: ffffffffffffffda RBX: 00007f177ce13408 RCX: 00007f177cd8b8e9
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
RBP: 00007f177ce13400 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f177ce1340c
R13: 00007f177cde1004 R14: 6d705f706374706d R15: 0000000000022000
 </TASK>

Fix the issue explicitly preventing the PM from closing subflows
of MPTCP socket in listener status.

Reported-and-tested-by: syzbot+e4d843bb96a9431e6331@syzkaller.appspotmail.com
Fixes: 740d798e8767 ("mptcp: remove id 0 address")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/mptcp/pm_netlink.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/net/mptcp/pm_netlink.c b/net/mptcp/pm_netlink.c
index 7b96be1e9f14..afd4c6ddad0c 100644
--- a/net/mptcp/pm_netlink.c
+++ b/net/mptcp/pm_netlink.c
@@ -1275,10 +1275,16 @@ static int mptcp_nl_remove_subflow_and_signal_addr(struct net *net,
 		}
 
 		lock_sock(sk);
+		/* don't touch subflows for listener sockets */
+		if (sk->sk_state == TCP_LISTEN)
+			goto unlock_next;
+
 		remove_subflow = lookup_subflow_by_saddr(&msk->conn_list, addr);
 		mptcp_pm_remove_anno_addr(msk, addr, remove_subflow);
 		if (remove_subflow)
 			mptcp_pm_remove_subflow(msk, &list);
+
+unlock_next:
 		release_sock(sk);
 
 next:
@@ -1318,10 +1324,16 @@ static int mptcp_nl_remove_id_zero_address(struct net *net,
 			goto next;
 
 		lock_sock(sk);
+		/* don't touch subflows for listener sockets */
+		if (sk->sk_state == TCP_LISTEN)
+			goto unlock_next;
+
 		spin_lock_bh(&msk->pm.lock);
 		mptcp_pm_remove_addr(msk, &list);
 		mptcp_pm_nl_rm_subflow_received(msk, &list);
 		spin_unlock_bh(&msk->pm.lock);
+
+unlock_next:
 		release_sock(sk);
 
 next:
-- 
2.33.1


^ permalink raw reply related

* RE: [net v5 2/3] net: sched: add check tc_skip_classify in sch egress
From: John Fastabend @ 2021-12-10 16:52 UTC (permalink / raw)
  To: John Fastabend, xiangxia.m.yue, netdev
  Cc: Tonghao Zhang, David S. Miller, Jakub Kicinski,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Eric Dumazet, Antoine Tenart, Alexander Lobakin,
	Wei Wang, Arnd Bergmann
In-Reply-To: <61b383c6373ca_1f50e20816@john.notmuch>

John Fastabend wrote:
> xiangxia.m.yue@ wrote:
> > From: Tonghao Zhang <xiangxia.m.yue@gmail.com>
> > 
> > Try to resolve the issues as below:
> > * We look up and then check tc_skip_classify flag in net
> >   sched layer, even though skb don't want to be classified.
> >   That case may consume a lot of cpu cycles. This patch
> >   is useful when there are a lot of filters with different
> >   prio. There is ~5 prio in in production, ~1% improvement.
> > 
> >   Rules as below:
> >   $ for id in $(seq 1 5); do
> >   $       tc filter add ... egress prio $id ... action mirred egress redirect dev ifb0
> >   $ done
> > 
> > * bpf_redirect may be invoked in egress path. If we don't
> >   check the flags and then return immediately, the packets
> >   will loopback.
> 
> This would be the naive case right? Meaning the BPF program is
> doing a redirect without any logic or is buggy?
> 
> Can you map out how this happens for me, I'm not fully sure I
> understand the exact concern. Is it possible for BPF programs
> that used to see packets no longer see the packet as expected?
> 
> Is this the path you are talking about?
> 
>  rx ethx  ->
>    execute BPF program on ethx with bpf_redirect(ifb0) ->
>      __skb_dequeue @ifb tc_skip_classify = 1 ->
>        dev_queue_xmit() -> 
>           sch_handle_egress() ->
>             execute BPF program again
> 
> I can't see why you want to skip that second tc BPF program,
> or for that matter any tc filter there. In general how do you
> know that is the correct/expected behavior? Before the above
> change it would have been called, what if its doing useful
> work.
> 
> Also its not clear how your ifb setup is built or used. That
> might help understand your use case. I would just remove the
> IFB altogether and the above discussion is mute.
> 
> Thanks,
> John

After a bit further thought (and coffee) I think this will
break some programs that exist today. Consider the case
where I pop a header off and resubmit to the same device
intentionally to reprocess the pkt without the header. I've
used this pattern in BPF a few times.

^ permalink raw reply

* Re: [syzbot] KMSAN: uninit-value in fib_get_nhs
From: David Ahern @ 2021-12-10 16:52 UTC (permalink / raw)
  To: syzbot, davem, dsahern, glider, kuba, linux-kernel, netdev,
	syzkaller-bugs, yoshfuji
In-Reply-To: <0000000000005a735005d2bb2dff@google.com>

On 12/9/21 11:57 AM, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:    8b936c96768e kmsan: core: remove the accidentally committe..
> git tree:       https://github.com/google/kmsan.git master
> console output: https://syzkaller.appspot.com/x/log.txt?x=1724ebc5b00000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=e00a8959fdd3f3e8
> dashboard link: https://syzkaller.appspot.com/bug?extid=d4b9a2851cc3ce998741
> compiler:       clang version 14.0.0 (git@github.com:llvm/llvm-project.git 0996585c8e3b3d409494eb5f1cad714b9e1f7fb5), GNU ld (GNU Binutils for Debian) 2.35.2
> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=1225f875b00000
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=139513c5b00000
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+d4b9a2851cc3ce998741@syzkaller.appspotmail.com
> 
> =====================================================
> BUG: KMSAN: uninit-value in fib_get_nhs+0xac4/0x1f80 net/ipv4/fib_semantics.c:708


Most likely lack of length validation on nla like we have in
fib_gw_from_via for the other attribute (nlav). Will send a patch.


^ permalink raw reply

* Re: [PATCH net-next] net: dev: Always serialize on Qdisc::busylock in __dev_xmit_skb() on PREEMPT_RT.
From: Eric Dumazet @ 2021-12-10 16:52 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Paolo Abeni, netdev, David S. Miller, Jakub Kicinski,
	Thomas Gleixner
In-Reply-To: <YbOEaSQW+LtWjuzI@linutronix.de>

On Fri, Dec 10, 2021 at 8:46 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> On 2021-12-10 17:35:21 [+0100], Paolo Abeni wrote:
> > On Fri, 2021-12-10 at 16:41 +0100, Sebastian Andrzej Siewior wrote:
> > > The root-lock is dropped before dev_hard_start_xmit() is invoked and after
> > > setting the __QDISC___STATE_RUNNING bit. If the Qdisc owner is preempted
> > > by another sender/task with a higher priority then this new sender won't
> > > be able to submit packets to the NIC directly instead they will be
> > > enqueued into the Qdisc. The NIC will remain idle until the Qdisc owner
> > > is scheduled again and finishes the job.
> > >
> > > By serializing every task on the ->busylock then the task will be
> > > preempted by a sender only after the Qdisc has no owner.
> > >
> > > Always serialize on the busylock on PREEMPT_RT.
> >
> > Not sure how much is relevant in the RT context, but this should impact
> > the xmit tput in a relevant, negative way.
>
> Negative because everyone blocks on lock and transmits packets directly
> instead of adding it to the queue and leaving for more?
>
> > If I read correctly, you use the busylock to trigger priority ceiling
> > on each sender. I'm wondering if there are other alternative ways (no
> > contended lock, just some RT specific annotation) to mark a whole
> > section of code for priority ceiling ?!?
>
> priority ceiling as you call it, happens always with the help of a lock.
> The root_lock is dropped in sch_direct_xmit().
> qdisc_run_begin() sets only a bit with no owner association.
> If I know why the busy-lock bad than I could add another one. The
> important part is force the sende out of the section so the task with
> the higher priority can send packets instead of queueing them only.

Problem is that if you have a qdisc, qdisc_dequeue() will not know that the
packet queued by high prio thread needs to shortcut all prior packets and
be sent right away.

Because of that, it seems just better that a high prio thread returns
immediately and let the dirty work (dequeue packets and send them to devices)
be done by other threads ?

>
> > Thanks!
> >
> > Paolo
>
> Sebastian

^ permalink raw reply

* Re: [PATCH] bridge: extend BR_ISOLATE to full split-horizon
From: Alexandra Winter @ 2021-12-10 16:53 UTC (permalink / raw)
  To: David Lamparter, netdev; +Cc: Nikolay Aleksandrov
In-Reply-To: <20211209121432.473979-1-equinox@diac24.net>



On 09.12.21 13:14, David Lamparter wrote:
> Split-horizon essentially just means being able to create multiple
> groups of isolated ports that are isolated within the group, but not
> with respect to each other.
> 
> The intent is very different, while isolation is a policy feature,
> split-horizon is intended to provide functional "multiple member ports
> are treated as one for loop avoidance."  But it boils down to the same
> thing in the end.
> 
> Signed-off-by: David Lamparter <equinox@diac24.net>
> Cc: Nikolay Aleksandrov <nikolay@nvidia.com>
> Cc: Alexandra Winter <wintera@linux.ibm.com>
> ---
> 
> Alexandra, could you double check my change to the qeth_l2 driver?  I
> can't really test it...
> 
> Cheers,
> 
> David
> ---
Reviewed and tested for s390/qeth. Looks good to me, see 2 comments below.
Kind regards 
Alexandra


>  drivers/s390/net/qeth_l2_main.c | 10 ++++++----
>  include/linux/if_bridge.h       |  9 ++++++++-
>  include/uapi/linux/if_link.h    |  1 +
>  net/bridge/br_if.c              | 12 ++++++++++++
>  net/bridge/br_input.c           |  2 +-
>  net/bridge/br_netlink.c         | 33 +++++++++++++++++++++++++++++++--
>  net/bridge/br_private.h         | 13 ++++++++++---
>  net/bridge/br_sysfs_if.c        | 33 ++++++++++++++++++++++++++++++++-
>  8 files changed, 101 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/s390/net/qeth_l2_main.c b/drivers/s390/net/qeth_l2_main.c
> index 303461d70af3..405d36757c22 100644
> --- a/drivers/s390/net/qeth_l2_main.c
> +++ b/drivers/s390/net/qeth_l2_main.c
> @@ -729,8 +729,8 @@ static bool qeth_l2_must_learn(struct net_device *netdev,
>  	priv = netdev_priv(netdev);
>  	return (netdev != dstdev &&
>  		(priv->brport_features & BR_LEARNING_SYNC) &&
> -		!(br_port_flag_is_set(netdev, BR_ISOLATED) &&
> -		  br_port_flag_is_set(dstdev, BR_ISOLATED)) &&
> +		!(br_port_horizon_group(netdev) != 0 &&
> +		  br_port_horizon_group(netdev) == br_port_horizon_group(dstdev)) &&
WARNING: line length of 84 exceeds 80 columns

>  		(netdev->netdev_ops == &qeth_l2_iqd_netdev_ops ||
>  		 netdev->netdev_ops == &qeth_l2_osa_netdev_ops));
>  }
> @@ -757,6 +757,7 @@ static void qeth_l2_br2dev_worker(struct work_struct *work)
>  	struct net_device *lowerdev;
>  	struct list_head *iter;
>  	int err = 0;
> +	u32 horizon_group;
reverse Christmas tree!
>  
>  	kfree(br2dev_event_work);
>  	QETH_CARD_TEXT_(card, 4, "b2dw%04lx", event);
> @@ -770,12 +771,13 @@ static void qeth_l2_br2dev_worker(struct work_struct *work)
>  	if (!qeth_l2_must_learn(lsyncdev, dstdev))
>  		goto unlock;
>  
> -	if (br_port_flag_is_set(lsyncdev, BR_ISOLATED)) {
> +	horizon_group = br_port_horizon_group(lsyncdev);
> +	if (horizon_group) {
>  		/* Update lsyncdev and its isolated sibling(s): */
>  		iter = &brdev->adj_list.lower;
>  		lowerdev = netdev_next_lower_dev_rcu(brdev, &iter);
>  		while (lowerdev) {
> -			if (br_port_flag_is_set(lowerdev, BR_ISOLATED)) {
> +			if (br_port_horizon_group(lowerdev) == horizon_group) {
>  				switch (event) {
>  				case SWITCHDEV_FDB_ADD_TO_DEVICE:
>  					err = dev_uc_add(lowerdev, addr);

---snip---

^ permalink raw reply

* Re: [PATCH] bridge: extend BR_ISOLATE to full split-horizon
From: David Lamparter @ 2021-12-10 16:57 UTC (permalink / raw)
  To: Alexandra Winter; +Cc: David Lamparter, netdev, Nikolay Aleksandrov
In-Reply-To: <2556a385-425c-0f25-2be5-efcfdc77aeaa@linux.ibm.com>

On Fri, Dec 10, 2021 at 05:53:17PM +0100, Alexandra Winter wrote:
> On 09.12.21 13:14, David Lamparter wrote:
> > Alexandra, could you double check my change to the qeth_l2 driver?  I
> > can't really test it...
>
> Reviewed and tested for s390/qeth. Looks good to me, see 2 comments below.

Awesome, Thanks a lot!

I'll fix your comments together with the other bits Nikolay pointed out
& send a v2 in a few days or so.

Cheers,


-David

^ permalink raw reply

* Re: [RFC PATCH v2 net-next 0/4] DSA master state tracking
From: Vladimir Oltean @ 2021-12-10 17:02 UTC (permalink / raw)
  To: Ansuel Smith
  Cc: netdev@vger.kernel.org, David S. Miller, Jakub Kicinski,
	Andrew Lunn, Vivien Didelot, Florian Fainelli
In-Reply-To: <61b2cb93.1c69fb81.2192a.3ef3@mx.google.com>

On Fri, Dec 10, 2021 at 04:37:52AM +0100, Ansuel Smith wrote:
> On Thu, Dec 09, 2021 at 07:39:23PM +0200, Vladimir Oltean wrote:
> > This patch set is provided solely for review purposes (therefore not to
> > be applied anywhere) and for Ansuel to test whether they resolve the
> > slowdown reported here:
> > https://patchwork.kernel.org/project/netdevbpf/cover/20211207145942.7444-1-ansuelsmth@gmail.com/
> > 
> > The patches posted here are mainly to offer a consistent
> > "master_state_change" chain of events to switches, without duplicates,
> > and always starting with operational=true and ending with
> > operational=false. This way, drivers should know when they can perform
> > Ethernet-based register access, and need not care about more than that.
> > 
> > Changes in v2:
> > - dropped some useless patches
> > - also check master operstate.
> > 
> > Vladimir Oltean (4):
> >   net: dsa: provide switch operations for tracking the master state
> >   net: dsa: stop updating master MTU from master.c
> >   net: dsa: hold rtnl_mutex when calling dsa_master_{setup,teardown}
> >   net: dsa: replay master state events in
> >     dsa_tree_{setup,teardown}_master
> > 
> >  include/net/dsa.h  | 11 +++++++
> >  net/dsa/dsa2.c     | 80 +++++++++++++++++++++++++++++++++++++++++++---
> >  net/dsa/dsa_priv.h | 13 ++++++++
> >  net/dsa/master.c   | 29 ++---------------
> >  net/dsa/slave.c    | 27 ++++++++++++++++
> >  net/dsa/switch.c   | 15 +++++++++
> >  6 files changed, 145 insertions(+), 30 deletions(-)
> > 
> > -- 
> > 2.25.1
> > 
> 
> Hi, I tested this v2 and I still have 2 ethernet mdio failing on init.
> I don't think we have other way to track this. Am I wrong?
> 
> All works correctly with this and promisc_on_master.
> If you have other test, feel free to send me other stuff to test.
> 
> (I'm starting to think the fail is caused by some delay that the switch
> require to actually start accepting packet or from the reinit? But I'm
> not sure... don't know if you notice something from the pcap)

I've opened the pcap just now. The Ethernet MDIO packets are
non-standard. When the DSA master receives them, it expects the first 6
octets to be the MAC DA, because that's the format of an Ethernet frame.
But the packets have this other format, according to your own writing:

/* Specific define for in-band MDIO read/write with Ethernet packet */
#define QCA_HDR_MDIO_SEQ_LEN           4 /* 4 byte for the seq */
#define QCA_HDR_MDIO_COMMAND_LEN       4 /* 4 byte for the command */
#define QCA_HDR_MDIO_DATA1_LEN         4 /* First 4 byte for the mdio data */
#define QCA_HDR_MDIO_HEADER_LEN        (QCA_HDR_MDIO_SEQ_LEN + \
                                       QCA_HDR_MDIO_COMMAND_LEN + \
                                       QCA_HDR_MDIO_DATA1_LEN)

#define QCA_HDR_MDIO_DATA2_LEN         12 /* Other 12 byte for the mdio data */
#define QCA_HDR_MDIO_PADDING_LEN       34 /* Padding to reach the min Ethernet packet */

The first 6 octets change like crazy in your pcap. Definitely can't add
that to the RX filter of the DSA master.

So yes, promisc_on_master is precisely what you need, it exists for
situations like this.

Considering this, I guess it works?

^ permalink raw reply

* Re: [PATCH net-next] xfrm: add net device refcount tracker to struct xfrm_state_offload
From: patchwork-bot+netdevbpf @ 2021-12-10 17:10 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, kuba, netdev, edumazet, steffen.klassert
In-Reply-To: <20211209154451.4184050-1-eric.dumazet@gmail.com>

Hello:

This patch was applied to netdev/net-next.git (master)
by Jakub Kicinski <kuba@kernel.org>:

On Thu,  9 Dec 2021 07:44:51 -0800 you wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Steffen Klassert <steffen.klassert@secunet.com>
> ---
>  include/net/xfrm.h     | 3 ++-
>  net/xfrm/xfrm_device.c | 3 ++-
>  2 files changed, 4 insertions(+), 2 deletions(-)

Here is the summary with links:
  - [net-next] xfrm: add net device refcount tracker to struct xfrm_state_offload
    https://git.kernel.org/netdev/net-next/c/e1b539bd73a7

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [RFC PATCH v2 net-next 0/4] DSA master state tracking
From: Ansuel Smith @ 2021-12-10 17:10 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: netdev@vger.kernel.org, David S. Miller, Jakub Kicinski,
	Andrew Lunn, Vivien Didelot, Florian Fainelli
In-Reply-To: <20211210170242.bckpdm2qa6lchbde@skbuf>

On Fri, Dec 10, 2021 at 05:02:42PM +0000, Vladimir Oltean wrote:
> On Fri, Dec 10, 2021 at 04:37:52AM +0100, Ansuel Smith wrote:
> > On Thu, Dec 09, 2021 at 07:39:23PM +0200, Vladimir Oltean wrote:
> > > This patch set is provided solely for review purposes (therefore not to
> > > be applied anywhere) and for Ansuel to test whether they resolve the
> > > slowdown reported here:
> > > https://patchwork.kernel.org/project/netdevbpf/cover/20211207145942.7444-1-ansuelsmth@gmail.com/
> > > 
> > > The patches posted here are mainly to offer a consistent
> > > "master_state_change" chain of events to switches, without duplicates,
> > > and always starting with operational=true and ending with
> > > operational=false. This way, drivers should know when they can perform
> > > Ethernet-based register access, and need not care about more than that.
> > > 
> > > Changes in v2:
> > > - dropped some useless patches
> > > - also check master operstate.
> > > 
> > > Vladimir Oltean (4):
> > >   net: dsa: provide switch operations for tracking the master state
> > >   net: dsa: stop updating master MTU from master.c
> > >   net: dsa: hold rtnl_mutex when calling dsa_master_{setup,teardown}
> > >   net: dsa: replay master state events in
> > >     dsa_tree_{setup,teardown}_master
> > > 
> > >  include/net/dsa.h  | 11 +++++++
> > >  net/dsa/dsa2.c     | 80 +++++++++++++++++++++++++++++++++++++++++++---
> > >  net/dsa/dsa_priv.h | 13 ++++++++
> > >  net/dsa/master.c   | 29 ++---------------
> > >  net/dsa/slave.c    | 27 ++++++++++++++++
> > >  net/dsa/switch.c   | 15 +++++++++
> > >  6 files changed, 145 insertions(+), 30 deletions(-)
> > > 
> > > -- 
> > > 2.25.1
> > > 
> > 
> > Hi, I tested this v2 and I still have 2 ethernet mdio failing on init.
> > I don't think we have other way to track this. Am I wrong?
> > 
> > All works correctly with this and promisc_on_master.
> > If you have other test, feel free to send me other stuff to test.
> > 
> > (I'm starting to think the fail is caused by some delay that the switch
> > require to actually start accepting packet or from the reinit? But I'm
> > not sure... don't know if you notice something from the pcap)
> 
> I've opened the pcap just now. The Ethernet MDIO packets are
> non-standard. When the DSA master receives them, it expects the first 6
> octets to be the MAC DA, because that's the format of an Ethernet frame.
> But the packets have this other format, according to your own writing:
> 
> /* Specific define for in-band MDIO read/write with Ethernet packet */
> #define QCA_HDR_MDIO_SEQ_LEN           4 /* 4 byte for the seq */
> #define QCA_HDR_MDIO_COMMAND_LEN       4 /* 4 byte for the command */
> #define QCA_HDR_MDIO_DATA1_LEN         4 /* First 4 byte for the mdio data */
> #define QCA_HDR_MDIO_HEADER_LEN        (QCA_HDR_MDIO_SEQ_LEN + \
>                                        QCA_HDR_MDIO_COMMAND_LEN + \
>                                        QCA_HDR_MDIO_DATA1_LEN)
> 
> #define QCA_HDR_MDIO_DATA2_LEN         12 /* Other 12 byte for the mdio data */
> #define QCA_HDR_MDIO_PADDING_LEN       34 /* Padding to reach the min Ethernet packet */
> 
> The first 6 octets change like crazy in your pcap. Definitely can't add
> that to the RX filter of the DSA master.
> 
> So yes, promisc_on_master is precisely what you need, it exists for
> situations like this.
> 
> Considering this, I guess it works?

Yes it works! We can totally accept 2 mdio timeout out of a good way to
track the master port. It's probably related to other stuff like switch
delay or other.

Wonder the next step is wait for this to be accepted and then I can
propose a v3 of my patch? Or net-next is closed now and I should just
send v3 RFC saying it does depend on this?

-- 
	Ansuel

^ permalink raw reply

* [PATCH bpf-next 0/3] xsk: Tx improvements
From: Maciej Fijalkowski @ 2021-12-10 17:14 UTC (permalink / raw)
  To: bpf, ast, daniel; +Cc: netdev, magnus.karlsson, Maciej Fijalkowski

Hi,
this time we're on Tx side of AF_XDP and we touch i40e and ice drivers.
Unfortunately, similar scalability issues that were addressed for XDP
processing in ice, exist on AF_XDP side. Let's resolve them in mostly
the same way as we did on [0].

Magnus moves the array of Tx descriptors that is used with batching
approach to the xsk buffer pool. This means that future users of this
API will not have carry the array on their own side, they can simple
refer to pool's tx_desc array, which can be seen on patch from Magnus.
Described patch is based on i40e as it is the only user of this API.
Tx batching is still left to be tried out for ice, though.

Thanks,
Magnus & Maciej

[0]: https://lore.kernel.org/bpf/20211015162908.145341-8-anthony.l.nguyen@intel.com/

Maciej Fijalkowski (2):
  ice: xsk: improve AF_XDP ZC Tx side
  ice: xsk: borrow xdp_tx_active logic from i40e

Magnus Karlsson (1):
  i40e: xsk: move tmp desc array from driver to pool

 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |  11 --
 drivers/net/ethernet/intel/i40e/i40e_txrx.h   |   1 -
 drivers/net/ethernet/intel/i40e/i40e_xsk.c    |   4 +-
 drivers/net/ethernet/intel/ice/ice_txrx.c     |   2 +-
 drivers/net/ethernet/intel/ice/ice_txrx.h     |   3 +-
 drivers/net/ethernet/intel/ice/ice_txrx_lib.c |   1 +
 drivers/net/ethernet/intel/ice/ice_xsk.c      | 138 ++++++++++--------
 drivers/net/ethernet/intel/ice/ice_xsk.h      |   5 +-
 include/net/xdp_sock_drv.h                    |   5 +-
 include/net/xsk_buff_pool.h                   |   1 +
 net/xdp/xsk.c                                 |  13 +-
 net/xdp/xsk_buff_pool.c                       |   7 +
 net/xdp/xsk_queue.h                           |  12 +-
 13 files changed, 106 insertions(+), 97 deletions(-)

-- 
2.33.1

^ permalink raw reply

* [PATCH bpf-next 1/3] i40e: xsk: move tmp desc array from driver to pool
From: Maciej Fijalkowski @ 2021-12-10 17:14 UTC (permalink / raw)
  To: bpf, ast, daniel; +Cc: netdev, magnus.karlsson
In-Reply-To: <20211210171425.11475-1-maciej.fijalkowski@intel.com>

From: Magnus Karlsson <magnus.karlsson@intel.com>

Move desc_array from the driver to the pool. The reason behind this is
that we can then reuse this array as a temporary storage for
descriptors in both the zero-copy case and the SKB case. No need to
have two, but it needs to be in the pool as the SKB case assumes
unmodified drivers.

i40e is the only driver that has a batched Tx zero-copy
implementation, so no need to touch any other driver.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 11 -----------
 drivers/net/ethernet/intel/i40e/i40e_txrx.h |  1 -
 drivers/net/ethernet/intel/i40e/i40e_xsk.c  |  4 ++--
 include/net/xdp_sock_drv.h                  |  5 ++---
 include/net/xsk_buff_pool.h                 |  1 +
 net/xdp/xsk.c                               | 13 ++++++-------
 net/xdp/xsk_buff_pool.c                     |  7 +++++++
 net/xdp/xsk_queue.h                         | 12 ++++++------
 8 files changed, 24 insertions(+), 30 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 10a83e5385c7..d3a4a33977ee 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -830,8 +830,6 @@ void i40e_free_tx_resources(struct i40e_ring *tx_ring)
 	i40e_clean_tx_ring(tx_ring);
 	kfree(tx_ring->tx_bi);
 	tx_ring->tx_bi = NULL;
-	kfree(tx_ring->xsk_descs);
-	tx_ring->xsk_descs = NULL;
 
 	if (tx_ring->desc) {
 		dma_free_coherent(tx_ring->dev, tx_ring->size,
@@ -1433,13 +1431,6 @@ int i40e_setup_tx_descriptors(struct i40e_ring *tx_ring)
 	if (!tx_ring->tx_bi)
 		goto err;
 
-	if (ring_is_xdp(tx_ring)) {
-		tx_ring->xsk_descs = kcalloc(I40E_MAX_NUM_DESCRIPTORS, sizeof(*tx_ring->xsk_descs),
-					     GFP_KERNEL);
-		if (!tx_ring->xsk_descs)
-			goto err;
-	}
-
 	u64_stats_init(&tx_ring->syncp);
 
 	/* round up to nearest 4K */
@@ -1463,8 +1454,6 @@ int i40e_setup_tx_descriptors(struct i40e_ring *tx_ring)
 	return 0;
 
 err:
-	kfree(tx_ring->xsk_descs);
-	tx_ring->xsk_descs = NULL;
 	kfree(tx_ring->tx_bi);
 	tx_ring->tx_bi = NULL;
 	return -ENOMEM;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index bfc2845c99d1..f6d91fa1562e 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -390,7 +390,6 @@ struct i40e_ring {
 	u16 rx_offset;
 	struct xdp_rxq_info xdp_rxq;
 	struct xsk_buff_pool *xsk_pool;
-	struct xdp_desc *xsk_descs;      /* For storing descriptors in the AF_XDP ZC path */
 } ____cacheline_internodealigned_in_smp;
 
 static inline bool ring_uses_build_skb(struct i40e_ring *ring)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index ea06e957393e..1a505cf8a6ad 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -467,11 +467,11 @@ static void i40e_set_rs_bit(struct i40e_ring *xdp_ring)
  **/
 static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
 {
-	struct xdp_desc *descs = xdp_ring->xsk_descs;
+	struct xdp_desc *descs = xdp_ring->xsk_pool->tx_descs;
 	u32 nb_pkts, nb_processed = 0;
 	unsigned int total_bytes = 0;
 
-	nb_pkts = xsk_tx_peek_release_desc_batch(xdp_ring->xsk_pool, descs, budget);
+	nb_pkts = xsk_tx_peek_release_desc_batch(xdp_ring->xsk_pool, budget);
 	if (!nb_pkts)
 		return true;
 
diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
index 443d45951564..4aa031849668 100644
--- a/include/net/xdp_sock_drv.h
+++ b/include/net/xdp_sock_drv.h
@@ -13,7 +13,7 @@
 
 void xsk_tx_completed(struct xsk_buff_pool *pool, u32 nb_entries);
 bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc);
-u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, struct xdp_desc *desc, u32 max);
+u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, u32 max);
 void xsk_tx_release(struct xsk_buff_pool *pool);
 struct xsk_buff_pool *xsk_get_pool_from_qid(struct net_device *dev,
 					    u16 queue_id);
@@ -142,8 +142,7 @@ static inline bool xsk_tx_peek_desc(struct xsk_buff_pool *pool,
 	return false;
 }
 
-static inline u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, struct xdp_desc *desc,
-						 u32 max)
+static inline u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, u32 max)
 {
 	return 0;
 }
diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
index ddeefc4a1040..5554ee75e7da 100644
--- a/include/net/xsk_buff_pool.h
+++ b/include/net/xsk_buff_pool.h
@@ -60,6 +60,7 @@ struct xsk_buff_pool {
 	 */
 	dma_addr_t *dma_pages;
 	struct xdp_buff_xsk *heads;
+	struct xdp_desc *tx_descs;
 	u64 chunk_mask;
 	u64 addrs_cnt;
 	u32 free_list_cnt;
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index f16074eb53c7..e86a8fd0c7b3 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -343,9 +343,9 @@ bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc)
 }
 EXPORT_SYMBOL(xsk_tx_peek_desc);
 
-static u32 xsk_tx_peek_release_fallback(struct xsk_buff_pool *pool, struct xdp_desc *descs,
-					u32 max_entries)
+static u32 xsk_tx_peek_release_fallback(struct xsk_buff_pool *pool, u32 max_entries)
 {
+	struct xdp_desc *descs = pool->tx_descs;
 	u32 nb_pkts = 0;
 
 	while (nb_pkts < max_entries && xsk_tx_peek_desc(pool, &descs[nb_pkts]))
@@ -355,8 +355,7 @@ static u32 xsk_tx_peek_release_fallback(struct xsk_buff_pool *pool, struct xdp_d
 	return nb_pkts;
 }
 
-u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, struct xdp_desc *descs,
-				   u32 max_entries)
+u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, u32 max_entries)
 {
 	struct xdp_sock *xs;
 	u32 nb_pkts;
@@ -365,7 +364,7 @@ u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, struct xdp_desc *
 	if (!list_is_singular(&pool->xsk_tx_list)) {
 		/* Fallback to the non-batched version */
 		rcu_read_unlock();
-		return xsk_tx_peek_release_fallback(pool, descs, max_entries);
+		return xsk_tx_peek_release_fallback(pool, max_entries);
 	}
 
 	xs = list_first_or_null_rcu(&pool->xsk_tx_list, struct xdp_sock, tx_list);
@@ -374,7 +373,7 @@ u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, struct xdp_desc *
 		goto out;
 	}
 
-	nb_pkts = xskq_cons_peek_desc_batch(xs->tx, descs, pool, max_entries);
+	nb_pkts = xskq_cons_peek_desc_batch(xs->tx, pool, max_entries);
 	if (!nb_pkts) {
 		xs->tx->queue_empty_descs++;
 		goto out;
@@ -386,7 +385,7 @@ u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, struct xdp_desc *
 	 * packets. This avoids having to implement any buffering in
 	 * the Tx path.
 	 */
-	nb_pkts = xskq_prod_reserve_addr_batch(pool->cq, descs, nb_pkts);
+	nb_pkts = xskq_prod_reserve_addr_batch(pool->cq, pool->tx_descs, nb_pkts);
 	if (!nb_pkts)
 		goto out;
 
diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index 90c4e1e819d3..15f02f02a743 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -37,6 +37,7 @@ void xp_destroy(struct xsk_buff_pool *pool)
 	if (!pool)
 		return;
 
+	kvfree(pool->tx_descs);
 	kvfree(pool->heads);
 	kvfree(pool);
 }
@@ -58,6 +59,12 @@ struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs,
 	if (!pool->heads)
 		goto out;
 
+	if (xs->tx) {
+		pool->tx_descs = kcalloc(xs->tx->nentries, sizeof(*pool->tx_descs), GFP_KERNEL);
+		if (!pool->tx_descs)
+			goto out;
+	}
+
 	pool->chunk_mask = ~((u64)umem->chunk_size - 1);
 	pool->addrs_cnt = umem->size;
 	pool->heads_cnt = umem->chunks;
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index e9aa2c236356..638138fbe475 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -205,11 +205,11 @@ static inline bool xskq_cons_read_desc(struct xsk_queue *q,
 	return false;
 }
 
-static inline u32 xskq_cons_read_desc_batch(struct xsk_queue *q,
-					    struct xdp_desc *descs,
-					    struct xsk_buff_pool *pool, u32 max)
+static inline u32 xskq_cons_read_desc_batch(struct xsk_queue *q, struct xsk_buff_pool *pool,
+					    u32 max)
 {
 	u32 cached_cons = q->cached_cons, nb_entries = 0;
+	struct xdp_desc *descs = pool->tx_descs;
 
 	while (cached_cons != q->cached_prod && nb_entries < max) {
 		struct xdp_rxtx_ring *ring = (struct xdp_rxtx_ring *)q->ring;
@@ -282,12 +282,12 @@ static inline bool xskq_cons_peek_desc(struct xsk_queue *q,
 	return xskq_cons_read_desc(q, desc, pool);
 }
 
-static inline u32 xskq_cons_peek_desc_batch(struct xsk_queue *q, struct xdp_desc *descs,
-					    struct xsk_buff_pool *pool, u32 max)
+static inline u32 xskq_cons_peek_desc_batch(struct xsk_queue *q, struct xsk_buff_pool *pool,
+					    u32 max)
 {
 	u32 entries = xskq_cons_nb_entries(q, max);
 
-	return xskq_cons_read_desc_batch(q, descs, pool, entries);
+	return xskq_cons_read_desc_batch(q, pool, entries);
 }
 
 /* To improve performance in the xskq_cons_release functions, only update local state here.
-- 
2.33.1


^ permalink raw reply related

* [PATCH bpf-next 2/3] ice: xsk: improve AF_XDP ZC Tx side
From: Maciej Fijalkowski @ 2021-12-10 17:14 UTC (permalink / raw)
  To: bpf, ast, daniel; +Cc: netdev, magnus.karlsson, Maciej Fijalkowski
In-Reply-To: <20211210171425.11475-1-maciej.fijalkowski@intel.com>

Follow mostly the logic from commit 9610bd988df9 ("ice: optimize XDP_TX
workloads") that has been done in order to address the massive tx_busy
statistic bump and improve the performance as well.

One difference from 'xdpdrv' XDP_TX is when ring has less than
ICE_TX_THRESH free entries, the cleaning routine will not stop after
cleaning a single ICE_TX_THRESH amount of descs but rather will forward
the next_dd pointer and check the DD bit and for this bit being set the
cleaning will be repeated. IOW clean until there are descs that can be
cleaned.

Single instance of txonly scenario from xdpsock is improved by 20%. It
takes four instances to achieve the line rate, which was not possible to
achieve previously.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_txrx.c |   2 +-
 drivers/net/ethernet/intel/ice/ice_txrx.h |   2 +-
 drivers/net/ethernet/intel/ice/ice_xsk.c  | 131 ++++++++++++----------
 drivers/net/ethernet/intel/ice/ice_xsk.h  |   5 +-
 4 files changed, 73 insertions(+), 67 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index 227513b687b9..4b0ddb6df0c7 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -1452,7 +1452,7 @@ int ice_napi_poll(struct napi_struct *napi, int budget)
 		bool wd;
 
 		if (tx_ring->xsk_pool)
-			wd = ice_clean_tx_irq_zc(tx_ring, budget);
+			wd = ice_clean_tx_irq_zc(tx_ring);
 		else if (ice_ring_is_xdp(tx_ring))
 			wd = true;
 		else
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
index b7b3bd4816f0..f2ebbe2158e7 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.h
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
@@ -321,9 +321,9 @@ struct ice_tx_ring {
 	u16 count;			/* Number of descriptors */
 	u16 q_index;			/* Queue number of ring */
 	/* stats structs */
+	struct ice_txq_stats tx_stats;
 	struct ice_q_stats	stats;
 	struct u64_stats_sync syncp;
-	struct ice_txq_stats tx_stats;
 
 	/* CL3 - 3rd cacheline starts here */
 	struct rcu_head rcu;		/* to avoid race on free */
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
index 925326c70701..a7f866b3fcd7 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -613,55 +613,68 @@ int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget)
 /**
  * ice_xmit_zc - Completes AF_XDP entries, and cleans XDP entries
  * @xdp_ring: XDP Tx ring
- * @budget: max number of frames to xmit
  *
  * Returns true if cleanup/transmission is done.
  */
-static bool ice_xmit_zc(struct ice_tx_ring *xdp_ring, int budget)
+static bool ice_xmit_zc(struct ice_tx_ring *xdp_ring)
 {
+	int total_packets = 0, total_bytes = 0;
 	struct ice_tx_desc *tx_desc = NULL;
-	bool work_done = true;
+	u16 ntu = xdp_ring->next_to_use;
 	struct xdp_desc desc;
 	dma_addr_t dma;
+	u16 budget = 0;
 
-	while (likely(budget-- > 0)) {
-		struct ice_tx_buf *tx_buf;
-
-		if (unlikely(!ICE_DESC_UNUSED(xdp_ring))) {
-			xdp_ring->tx_stats.tx_busy++;
-			work_done = false;
-			break;
-		}
-
-		tx_buf = &xdp_ring->tx_buf[xdp_ring->next_to_use];
+	budget = ICE_DESC_UNUSED(xdp_ring);
+	if (unlikely(!budget)) {
+		xdp_ring->tx_stats.tx_busy++;
+		return false;
+	}
 
+	while (budget-- > 0) {
 		if (!xsk_tx_peek_desc(xdp_ring->xsk_pool, &desc))
 			break;
 
+		total_packets++;
 		dma = xsk_buff_raw_get_dma(xdp_ring->xsk_pool, desc.addr);
 		xsk_buff_raw_dma_sync_for_device(xdp_ring->xsk_pool, dma,
 						 desc.len);
 
-		tx_buf->bytecount = desc.len;
+		total_bytes += desc.len;
 
-		tx_desc = ICE_TX_DESC(xdp_ring, xdp_ring->next_to_use);
+		tx_desc = ICE_TX_DESC(xdp_ring, ntu);
 		tx_desc->buf_addr = cpu_to_le64(dma);
 		tx_desc->cmd_type_offset_bsz =
-			ice_build_ctob(ICE_TXD_LAST_DESC_CMD, 0, desc.len, 0);
-
-		xdp_ring->next_to_use++;
-		if (xdp_ring->next_to_use == xdp_ring->count)
-			xdp_ring->next_to_use = 0;
+			ice_build_ctob(ICE_TX_DESC_CMD_EOP, 0, desc.len, 0);
+
+		ntu++;
+		if (ntu == xdp_ring->count) {
+			ntu = 0;
+			tx_desc = ICE_TX_DESC(xdp_ring, xdp_ring->next_rs);
+			tx_desc->cmd_type_offset_bsz |=
+				cpu_to_le64(ICE_TX_DESC_CMD_RS << ICE_TXD_QW1_CMD_S);
+			xdp_ring->next_rs = ICE_TX_THRESH - 1;
+		}
+		if (ntu > xdp_ring->next_rs) {
+			tx_desc = ICE_TX_DESC(xdp_ring, xdp_ring->next_rs);
+			tx_desc->cmd_type_offset_bsz |=
+				cpu_to_le64(ICE_TX_DESC_CMD_RS << ICE_TXD_QW1_CMD_S);
+			xdp_ring->next_rs += ICE_TX_THRESH;
+		}
 	}
 
+	xdp_ring->next_to_use = ntu;
 	if (tx_desc) {
 		ice_xdp_ring_update_tail(xdp_ring);
 		xsk_tx_release(xdp_ring->xsk_pool);
 	}
 
-	return budget > 0 && work_done;
-}
+	if (xsk_uses_need_wakeup(xdp_ring->xsk_pool))
+		xsk_set_tx_need_wakeup(xdp_ring->xsk_pool);
+	ice_update_tx_ring_stats(xdp_ring, total_packets, total_bytes);
 
+	return budget > 0;
+}
 /**
  * ice_clean_xdp_tx_buf - Free and unmap XDP Tx buffer
  * @xdp_ring: XDP Tx ring
@@ -679,30 +692,31 @@ ice_clean_xdp_tx_buf(struct ice_tx_ring *xdp_ring, struct ice_tx_buf *tx_buf)
 /**
  * ice_clean_tx_irq_zc - Completes AF_XDP entries, and cleans XDP entries
  * @xdp_ring: XDP Tx ring
- * @budget: NAPI budget
  *
  * Returns true if cleanup/tranmission is done.
  */
-bool ice_clean_tx_irq_zc(struct ice_tx_ring *xdp_ring, int budget)
+bool ice_clean_tx_irq_zc(struct ice_tx_ring *xdp_ring)
 {
-	int total_packets = 0, total_bytes = 0;
-	s16 ntc = xdp_ring->next_to_clean;
+	u16 next_dd = xdp_ring->next_dd;
+	u16 desc_cnt = xdp_ring->count;
 	struct ice_tx_desc *tx_desc;
 	struct ice_tx_buf *tx_buf;
-	u32 xsk_frames = 0;
-	bool xmit_done;
+	u32 xsk_frames;
+	u16 ntc;
+	int i;
 
-	tx_desc = ICE_TX_DESC(xdp_ring, ntc);
-	tx_buf = &xdp_ring->tx_buf[ntc];
-	ntc -= xdp_ring->count;
+	tx_desc = ICE_TX_DESC(xdp_ring, next_dd);
 
-	do {
-		if (!(tx_desc->cmd_type_offset_bsz &
-		      cpu_to_le64(ICE_TX_DESC_DTYPE_DESC_DONE)))
-			break;
+	if (!(tx_desc->cmd_type_offset_bsz &
+	      cpu_to_le64(ICE_TX_DESC_DTYPE_DESC_DONE)))
+		return ice_xmit_zc(xdp_ring);
 
-		total_bytes += tx_buf->bytecount;
-		total_packets++;
+again:
+	xsk_frames = 0;
+	ntc = xdp_ring->next_to_clean;
+
+	for (i = 0; i < ICE_TX_THRESH; i++) {
+		tx_buf = &xdp_ring->tx_buf[ntc];
 
 		if (tx_buf->raw_buf) {
 			ice_clean_xdp_tx_buf(xdp_ring, tx_buf);
@@ -711,34 +725,27 @@ bool ice_clean_tx_irq_zc(struct ice_tx_ring *xdp_ring, int budget)
 			xsk_frames++;
 		}
 
-		tx_desc->cmd_type_offset_bsz = 0;
-		tx_buf++;
-		tx_desc++;
 		ntc++;
-
-		if (unlikely(!ntc)) {
-			ntc -= xdp_ring->count;
-			tx_buf = xdp_ring->tx_buf;
-			tx_desc = ICE_TX_DESC(xdp_ring, 0);
-		}
-
-		prefetch(tx_desc);
-
-	} while (likely(--budget));
-
-	ntc += xdp_ring->count;
-	xdp_ring->next_to_clean = ntc;
-
+		if (ntc >= xdp_ring->count)
+			ntc = 0;
+	}
+	xdp_ring->next_to_clean += ICE_TX_THRESH;
+	if (xdp_ring->next_to_clean >= desc_cnt)
+		xdp_ring->next_to_clean -= desc_cnt;
 	if (xsk_frames)
 		xsk_tx_completed(xdp_ring->xsk_pool, xsk_frames);
-
-	if (xsk_uses_need_wakeup(xdp_ring->xsk_pool))
-		xsk_set_tx_need_wakeup(xdp_ring->xsk_pool);
-
-	ice_update_tx_ring_stats(xdp_ring, total_packets, total_bytes);
-	xmit_done = ice_xmit_zc(xdp_ring, ICE_DFLT_IRQ_WORK);
-
-	return budget > 0 && xmit_done;
+	tx_desc->cmd_type_offset_bsz = 0;
+	next_dd += ICE_TX_THRESH;
+	if (next_dd > desc_cnt)
+		next_dd = ICE_TX_THRESH - 1;
+
+	tx_desc = ICE_TX_DESC(xdp_ring, next_dd);
+	if ((tx_desc->cmd_type_offset_bsz &
+	    cpu_to_le64(ICE_TX_DESC_DTYPE_DESC_DONE)))
+		goto again;
+	xdp_ring->next_dd = next_dd;
+
+	return ice_xmit_zc(xdp_ring);
 }
 
 /**
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.h b/drivers/net/ethernet/intel/ice/ice_xsk.h
index 4c7bd8e9dfc4..1f98ad090f89 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.h
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.h
@@ -12,7 +12,7 @@ struct ice_vsi;
 int ice_xsk_pool_setup(struct ice_vsi *vsi, struct xsk_buff_pool *pool,
 		       u16 qid);
 int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget);
-bool ice_clean_tx_irq_zc(struct ice_tx_ring *xdp_ring, int budget);
+bool ice_clean_tx_irq_zc(struct ice_tx_ring *xdp_ring);
 int ice_xsk_wakeup(struct net_device *netdev, u32 queue_id, u32 flags);
 bool ice_alloc_rx_bufs_zc(struct ice_rx_ring *rx_ring, u16 count);
 bool ice_xsk_any_rx_ring_ena(struct ice_vsi *vsi);
@@ -35,8 +35,7 @@ ice_clean_rx_irq_zc(struct ice_rx_ring __always_unused *rx_ring,
 }
 
 static inline bool
-ice_clean_tx_irq_zc(struct ice_tx_ring __always_unused *xdp_ring,
-		    int __always_unused budget)
+ice_clean_tx_irq_zc(struct ice_tx_ring __always_unused *xdp_ring)
 {
 	return false;
 }
-- 
2.33.1


^ permalink raw reply related

* [PATCH bpf-next 3/3] ice: xsk: borrow xdp_tx_active logic from i40e
From: Maciej Fijalkowski @ 2021-12-10 17:14 UTC (permalink / raw)
  To: bpf, ast, daniel; +Cc: netdev, magnus.karlsson, Maciej Fijalkowski
In-Reply-To: <20211210171425.11475-1-maciej.fijalkowski@intel.com>

One of the things that commit 5574ff7b7b3d ("i40e: optimize AF_XDP Tx
completion path") introduced was the @xdp_tx_active field. Its usage
from i40e can be adjusted to ice driver and give us positive performance
results.

If the descriptor that @next_dd to points has been sent by HW (its DD
bit is set), then we are sure that there are ICE_TX_THRESH count of
descriptors ready to be cleaned. If @xdp_tx_active is 0 which means that
related xdp_ring is not used for XDP_{TX, REDIRECT} workloads, then we
know how many XSK entries should placed to completion queue, IOW walking
through the ring can be skipped.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_txrx.h     | 1 +
 drivers/net/ethernet/intel/ice/ice_txrx_lib.c | 1 +
 drivers/net/ethernet/intel/ice/ice_xsk.c      | 7 +++++++
 3 files changed, 9 insertions(+)

diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
index f2ebbe2158e7..8dd9c92662ad 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.h
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
@@ -332,6 +332,7 @@ struct ice_tx_ring {
 	struct ice_ptp_tx *tx_tstamps;
 	spinlock_t tx_lock;
 	u32 txq_teid;			/* Added Tx queue TEID */
+	u16 xdp_tx_active;
 #define ICE_TX_FLAGS_RING_XDP		BIT(0)
 	u8 flags;
 	u8 dcb_tc;			/* Traffic class of ring */
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
index 1dd7e84f41f8..f15c215c973c 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
@@ -299,6 +299,7 @@ int ice_xmit_xdp_ring(void *data, u16 size, struct ice_tx_ring *xdp_ring)
 	tx_desc->cmd_type_offset_bsz = ice_build_ctob(ICE_TX_DESC_CMD_EOP, 0,
 						      size, 0);
 
+	xdp_ring->xdp_tx_active++;
 	i++;
 	if (i == xdp_ring->count) {
 		i = 0;
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
index a7f866b3fcd7..8949a7be45c6 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -684,6 +684,7 @@ static void
 ice_clean_xdp_tx_buf(struct ice_tx_ring *xdp_ring, struct ice_tx_buf *tx_buf)
 {
 	xdp_return_frame((struct xdp_frame *)tx_buf->raw_buf);
+	xdp_ring->xdp_tx_active--;
 	dma_unmap_single(xdp_ring->dev, dma_unmap_addr(tx_buf, dma),
 			 dma_unmap_len(tx_buf, len), DMA_TO_DEVICE);
 	dma_unmap_len_set(tx_buf, len, 0);
@@ -713,6 +714,11 @@ bool ice_clean_tx_irq_zc(struct ice_tx_ring *xdp_ring)
 
 again:
 	xsk_frames = 0;
+	if (likely(!xdp_ring->xdp_tx_active)) {
+		xsk_frames = ICE_TX_THRESH;
+		goto skip;
+	}
+
 	ntc = xdp_ring->next_to_clean;
 
 	for (i = 0; i < ICE_TX_THRESH; i++) {
@@ -729,6 +735,7 @@ bool ice_clean_tx_irq_zc(struct ice_tx_ring *xdp_ring)
 		if (ntc >= xdp_ring->count)
 			ntc = 0;
 	}
+skip:
 	xdp_ring->next_to_clean += ICE_TX_THRESH;
 	if (xdp_ring->next_to_clean >= desc_cnt)
 		xdp_ring->next_to_clean -= desc_cnt;
-- 
2.33.1


^ permalink raw reply related

* [PATCH bpf-next] xsk: wipe out dead zero_copy_allocator declarations
From: Maciej Fijalkowski @ 2021-12-10 17:15 UTC (permalink / raw)
  To: bpf, ast, daniel; +Cc: netdev, magnus.karlsson, Maciej Fijalkowski

zero_copy_allocator has been removed back when Bjorn Topel introduced
xsk_buff_pool. Remove references to it that were dangling in the tree.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_xsk.h           | 1 -
 drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h | 2 --
 include/net/xdp_priv.h                               | 1 -
 3 files changed, 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.h b/drivers/net/ethernet/intel/i40e/i40e_xsk.h
index ea88f4597a07..bb962987f300 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.h
@@ -22,7 +22,6 @@
 
 struct i40e_vsi;
 struct xsk_buff_pool;
-struct zero_copy_allocator;
 
 int i40e_queue_pair_disable(struct i40e_vsi *vsi, int queue_pair);
 int i40e_queue_pair_enable(struct i40e_vsi *vsi, int queue_pair);
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h b/drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h
index a82533f21d36..bba3feaf3318 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h
@@ -35,8 +35,6 @@ int ixgbe_xsk_pool_setup(struct ixgbe_adapter *adapter,
 			 struct xsk_buff_pool *pool,
 			 u16 qid);
 
-void ixgbe_zca_free(struct zero_copy_allocator *alloc, unsigned long handle);
-
 bool ixgbe_alloc_rx_buffers_zc(struct ixgbe_ring *rx_ring, u16 cleaned_count);
 int ixgbe_clean_rx_irq_zc(struct ixgbe_q_vector *q_vector,
 			  struct ixgbe_ring *rx_ring,
diff --git a/include/net/xdp_priv.h b/include/net/xdp_priv.h
index a9d5b7603b89..a2d58b1a12e1 100644
--- a/include/net/xdp_priv.h
+++ b/include/net/xdp_priv.h
@@ -10,7 +10,6 @@ struct xdp_mem_allocator {
 	union {
 		void *allocator;
 		struct page_pool *page_pool;
-		struct zero_copy_allocator *zc_alloc;
 	};
 	struct rhash_head node;
 	struct rcu_head rcu;
-- 
2.33.1


^ permalink raw reply related

* Re: [RFC PATCH v2 net-next 0/4] DSA master state tracking
From: Vladimir Oltean @ 2021-12-10 17:15 UTC (permalink / raw)
  To: Ansuel Smith
  Cc: netdev@vger.kernel.org, David S. Miller, Jakub Kicinski,
	Andrew Lunn, Vivien Didelot, Florian Fainelli
In-Reply-To: <61b38a18.1c69fb81.95975.8545@mx.google.com>

On Fri, Dec 10, 2021 at 06:10:45PM +0100, Ansuel Smith wrote:
> On Fri, Dec 10, 2021 at 05:02:42PM +0000, Vladimir Oltean wrote:
> > On Fri, Dec 10, 2021 at 04:37:52AM +0100, Ansuel Smith wrote:
> > > On Thu, Dec 09, 2021 at 07:39:23PM +0200, Vladimir Oltean wrote:
> > > > This patch set is provided solely for review purposes (therefore not to
> > > > be applied anywhere) and for Ansuel to test whether they resolve the
> > > > slowdown reported here:
> > > > https://patchwork.kernel.org/project/netdevbpf/cover/20211207145942.7444-1-ansuelsmth@gmail.com/
> > > > 
> > > > The patches posted here are mainly to offer a consistent
> > > > "master_state_change" chain of events to switches, without duplicates,
> > > > and always starting with operational=true and ending with
> > > > operational=false. This way, drivers should know when they can perform
> > > > Ethernet-based register access, and need not care about more than that.
> > > > 
> > > > Changes in v2:
> > > > - dropped some useless patches
> > > > - also check master operstate.
> > > > 
> > > > Vladimir Oltean (4):
> > > >   net: dsa: provide switch operations for tracking the master state
> > > >   net: dsa: stop updating master MTU from master.c
> > > >   net: dsa: hold rtnl_mutex when calling dsa_master_{setup,teardown}
> > > >   net: dsa: replay master state events in
> > > >     dsa_tree_{setup,teardown}_master
> > > > 
> > > >  include/net/dsa.h  | 11 +++++++
> > > >  net/dsa/dsa2.c     | 80 +++++++++++++++++++++++++++++++++++++++++++---
> > > >  net/dsa/dsa_priv.h | 13 ++++++++
> > > >  net/dsa/master.c   | 29 ++---------------
> > > >  net/dsa/slave.c    | 27 ++++++++++++++++
> > > >  net/dsa/switch.c   | 15 +++++++++
> > > >  6 files changed, 145 insertions(+), 30 deletions(-)
> > > > 
> > > > -- 
> > > > 2.25.1
> > > > 
> > > 
> > > Hi, I tested this v2 and I still have 2 ethernet mdio failing on init.
> > > I don't think we have other way to track this. Am I wrong?
> > > 
> > > All works correctly with this and promisc_on_master.
> > > If you have other test, feel free to send me other stuff to test.
> > > 
> > > (I'm starting to think the fail is caused by some delay that the switch
> > > require to actually start accepting packet or from the reinit? But I'm
> > > not sure... don't know if you notice something from the pcap)
> > 
> > I've opened the pcap just now. The Ethernet MDIO packets are
> > non-standard. When the DSA master receives them, it expects the first 6
> > octets to be the MAC DA, because that's the format of an Ethernet frame.
> > But the packets have this other format, according to your own writing:
> > 
> > /* Specific define for in-band MDIO read/write with Ethernet packet */
> > #define QCA_HDR_MDIO_SEQ_LEN           4 /* 4 byte for the seq */
> > #define QCA_HDR_MDIO_COMMAND_LEN       4 /* 4 byte for the command */
> > #define QCA_HDR_MDIO_DATA1_LEN         4 /* First 4 byte for the mdio data */
> > #define QCA_HDR_MDIO_HEADER_LEN        (QCA_HDR_MDIO_SEQ_LEN + \
> >                                        QCA_HDR_MDIO_COMMAND_LEN + \
> >                                        QCA_HDR_MDIO_DATA1_LEN)
> > 
> > #define QCA_HDR_MDIO_DATA2_LEN         12 /* Other 12 byte for the mdio data */
> > #define QCA_HDR_MDIO_PADDING_LEN       34 /* Padding to reach the min Ethernet packet */
> > 
> > The first 6 octets change like crazy in your pcap. Definitely can't add
> > that to the RX filter of the DSA master.
> > 
> > So yes, promisc_on_master is precisely what you need, it exists for
> > situations like this.
> > 
> > Considering this, I guess it works?
> 
> Yes it works! We can totally accept 2 mdio timeout out of a good way to
> track the master port. It's probably related to other stuff like switch
> delay or other.
> 
> Wonder the next step is wait for this to be accepted and then I can
> propose a v3 of my patch? Or net-next is closed now and I should just
> send v3 RFC saying it does depend on this?

Wait a minute, I don't think I understood your previous reply.
With promisc_on_master, is there or is there not any timeout?
My understanding was this: DSA tells you when the master is up and
operational. That information is correct, except it isn't sufficient and
you don't see the replies back. Later during boot, you have some init
scripts triggered by user space that create a bridge interface and put
the switch ports under the bridge. The bridge puts the switch interfaces
in promiscuous mode, because that's what bridges do. Then DSA propagates
the promiscuous mode from the switch ports to the DSA master, and once
the master is promiscuous, the Ethernet MDIO starts working too.
Now, with promisc_on_master set, the DSA master is already promiscuous
by the time DSA tells you that it's up and running. Hence your message
that "All works correctly with this and promisc_on_master."
What did I misunderstand?

^ permalink raw reply

* Re: [PATCH 1/2] xfrm: interface with if_id 0 should return error
From: Eyal Birger @ 2021-12-10 17:22 UTC (permalink / raw)
  To: antony.antony
  Cc: Steffen Klassert, Herbert Xu, David S. Miller, Jakub Kicinski,
	Linux Kernel Network Developers
In-Reply-To: <0bfebd4e5f317cbf301750d5dd5cc706d4385d7f.1639064087.git.antony.antony@secunet.com>

On Thu, Dec 9, 2021 at 5:36 PM Antony Antony <antony.antony@secunet.com> wrote:
>
> xfrm interface if_id = 0 would cause xfrm policy lookup errors since
> commit 9f8550e4bd9d ("xfrm: fix disable_xfrm sysctl when used on xfrm interfaces")
>
> Now fail to create an xfrm interface when if_id = 0
>
> With this commit:
>  ip link add ipsec0  type xfrm dev lo  if_id 0
>  Error: if_id must be non zero.
>
> Signed-off-by: Antony Antony <antony.antony@secunet.com>
> ---
>  net/xfrm/xfrm_interface.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c
> index 41de46b5ffa9..57448fc519fc 100644
> --- a/net/xfrm/xfrm_interface.c
> +++ b/net/xfrm/xfrm_interface.c
> @@ -637,11 +637,16 @@ static int xfrmi_newlink(struct net *src_net, struct net_device *dev,
>                         struct netlink_ext_ack *extack)
>  {
>         struct net *net = dev_net(dev);
> -       struct xfrm_if_parms p;
> +       struct xfrm_if_parms p = {};
>         struct xfrm_if *xi;
>         int err;
>
>         xfrmi_netlink_parms(data, &p);
> +       if (!p.if_id) {
> +               NL_SET_ERR_MSG(extack, "if_id must be non zero");
> +               return -EINVAL;
> +       }
> +
>         xi = xfrmi_locate(net, &p);
>         if (xi)
>                 return -EEXIST;
> @@ -666,7 +671,12 @@ static int xfrmi_changelink(struct net_device *dev, struct nlattr *tb[],
>  {
>         struct xfrm_if *xi = netdev_priv(dev);
>         struct net *net = xi->net;
> -       struct xfrm_if_parms p;
> +       struct xfrm_if_parms p = {};
> +
> +       if (!p.if_id) {
> +               NL_SET_ERR_MSG(extack, "if_id must be non zero");
> +               return -EINVAL;
> +       }
>
>         xfrmi_netlink_parms(data, &p);
>         xi = xfrmi_locate(net, &p);

Looks good. Maybe this needs a "Fixes:" tag?

Reviewed-by: Eyal Birger <eyal.birger@gmail.com>

Thanks,
Eyal.

> --
> 2.30.2
>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox