DPDK-dev Archive on lore.kernel.org

DPDK-dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH v1 1/1] net/i40e: allow discontiguous queue lists in hash
From: Bruce Richardson @ 2026-06-16  8:42 UTC (permalink / raw)
  To: Anatoly Burakov; +Cc: dev
In-Reply-To: <9999fab5d9491d15ff98ac5aafa248e11df558de.1781521311.git.anatoly.burakov@intel.com>

On Mon, Jun 15, 2026 at 12:01:58PM +0100, Anatoly Burakov wrote:
> Due to recent refactors and code unification, there are now the following
> properties of RSS queue list that can be checked by common infrastructure:
> 
> - Monotony (i.e. queue indices always increase, never decrease)
> - No duplication (i.e. can't have the same index specified twice)
> - Contiguousness (i.e. can't have holes in the queue list)
> 
> The latter is an optional feature that can be enabled with a flag. However,
> previous hash code only enforced contiguousness for queue *regions* but not
> queue *lists*, whereas after the refactor, all queue lists were required to
> be contiguous. This is an unnecessary restriction, and it breaks backwards
> compatibility.
> 
> Fix it by only specifying contiguousness requirement for the VLAN branch
> where we are actually looking for a queue *region* not queue *list*.
> 
> Fixes: 0185303c2e24 ("net/i40e: refactor RSS flow parameter checks")
> 
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---

Acked-by: Bruce Richardson <bruce.richardson@intel.com>

Applied to dpdk-next-net-intel (with corrected fixline commit id).
Thanks,
/Bruce


>  drivers/net/intel/i40e/i40e_hash.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/intel/i40e/i40e_hash.c b/drivers/net/intel/i40e/i40e_hash.c
> index 3c1302469c..8b80d0a91c 100644
> --- a/drivers/net/intel/i40e/i40e_hash.c
> +++ b/drivers/net/intel/i40e/i40e_hash.c
> @@ -1238,7 +1238,6 @@ i40e_hash_parse(struct rte_eth_dev *dev,
>  		},
>  		.max_actions = 1,
>  		.driver_ctx = dev->data->dev_private,
> -		.rss_queues_contig = true,
>  		/* each pattern type will add specific check function */
>  	};
>  	const struct rte_flow_action_rss *rss_act;
> @@ -1265,6 +1264,8 @@ i40e_hash_parse(struct rte_eth_dev *dev,
>  	/* VLAN path */
>  	if (is_vlan) {
>  		ac_param.check = i40e_hash_validate_queue_region;
> +		/* queue regions must be contiguous */
> +		ac_param.rss_queues_contig = true;
>  		ret = ci_flow_check_actions(actions, &ac_param, &parsed_actions, error);
>  		if (ret)
>  			return ret;
> -- 
> 2.47.3
> 

^ permalink raw reply

* RE: [EXTERNAL] Re: [PATCH v9 0/1] net/mana: add device reset support
From: Wei Hu @ 2026-06-16  8:11 UTC (permalink / raw)
  To: Stephen Hemminger, Wei Hu; +Cc: dev@dpdk.org, Long Li
In-Reply-To: <20260615115028.5fa705c3@phoenix.local>

> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Tuesday, June 16, 2026 2:50 AM
> To: Wei Hu <weh@linux.microsoft.com>
> Cc: dev@dpdk.org; Long Li <longli@microsoft.com>; Wei Hu
> <weh@microsoft.com>
> Subject: [EXTERNAL] Re: [PATCH v9 0/1] net/mana: add device reset support
> 
> 
> One small thing in mp.c: in the RESET_EXIT secondary handler the received fd is
> only closed on the branch that maps it. If proc_priv->db_page is already non-
> NULL the fd from the message is leaked. Close it whenever num_fds >= 1,
> outside the if/else.
> ---
> I can merge it as is, or you can send a revision to close that minor leak.

I will send a revision to close this leak. Thanks Stephen!

Wei

^ permalink raw reply

* Re: [PATCH] net/iavf: fix scalar Rx path zero-length segment
From: Bruce Richardson @ 2026-06-16  8:06 UTC (permalink / raw)
  To: Loftus, Ciara; +Cc: dev@dpdk.org, stable@dpdk.org, Doherty, Declan
In-Reply-To: <ai_G2-sLspd2PdK8@bricha3-mobl1.ger.corp.intel.com>

On Mon, Jun 15, 2026 at 10:33:15AM +0100, Bruce Richardson wrote:
> On Mon, Jun 15, 2026 at 10:17:41AM +0100, Loftus, Ciara wrote:
> > > Subject: Re: [PATCH] net/iavf: fix scalar Rx path zero-length segment
> > > 
> > > On Fri, Jun 12, 2026 at 02:35:31PM +0000, Ciara Loftus wrote:
> > > > When hardware CRC stripping is active, a frame whose on-wire size is an
> > > > exact multiple of the Rx buffer size can cause the NIC to fill the final
> > > > data descriptor and place the four CRC bytes into a separate trailing
> > > > descriptor. After hardware stripping, that descriptor carries zero bytes
> > > > of payload.
> > > >
> > > > The existing CRC cleanup code only handles a zero-length trailing segment
> > > > when software CRC stripping is enabled. When hardware stripping is
> > > > active, the zero-length mbuf is silently chained to the reassembled
> > > > packet. Forwarding such a packet causes a zero-length Tx descriptor,
> > > > triggering a Malicious Driver Detection event on the PF and resetting
> > > > the VF.
> > > >
> > > > Fix by adding logic to detect a zero-length final segment when hardware
> > > > CRC stripping is active, and freeing it.
> > > >
> > > > Fixes: a2b29a7733ef ("net/avf: enable basic Rx Tx")
> > > > Fixes: b8b4c54ef9b0 ("net/iavf: support flexible Rx descriptor in normal
> > > path")
> > > > Cc: stable@dpdk.org
> > > >
> > > > Signed-off-by: Declan Doherty <declan.doherty@intel.com>
> > > > Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
> > > > ---
> > > >  drivers/net/intel/iavf/iavf_rxtx.c | 16 ++++++++++++++++
> > > >  1 file changed, 16 insertions(+)
> > > >
> > > > diff --git a/drivers/net/intel/iavf/iavf_rxtx.c
> > > b/drivers/net/intel/iavf/iavf_rxtx.c
> > > > index a57af7faed..86ebb2618d 100644
> > > > --- a/drivers/net/intel/iavf/iavf_rxtx.c
> > > > +++ b/drivers/net/intel/iavf/iavf_rxtx.c
> > > > @@ -1716,6 +1716,14 @@ iavf_recv_scattered_pkts_flex_rxd(void
> > > *rx_queue, struct rte_mbuf **rx_pkts,
> > > >  				rxm->data_len = (uint16_t)(rx_packet_len -
> > > >
> > > 	RTE_ETHER_CRC_LEN);
> > > >  			}
> > > > +		} else if (unlikely(rx_packet_len == 0)) {
> > > > +			/*
> > > > +			 * NIC split CRC bytes into a trailing segment which is
> > > > +			 * now empty after hardware CRC stripping. Free it.
> > > > +			 */
> > > > +			rte_pktmbuf_free_seg(rxm);
> > > > +			first_seg->nb_segs--;
> > > > +			last_seg->next = NULL;
> > > >  		}
> > > >
> > > 
> > > The vector paths also handle scattered packets (via reassembly). Do they
> > > need a fix for this? What about the other drivers that work on the PF, such
> > > as ice/i40e?
> > 
> > The vector paths use the common ci_rx_reassemble_packets which already
> > handles the zero-length trailing segment case correctly. When
> > crc_len == 0 and the last segment has data_len == 0, the empty segment
> > is freed.
> > 
> > The ice scalar path had the same issue but it was patched in 2022:
> > https://git.dpdk.org/dpdk/commit/?id=90ba4442058a14763e57ca96d03ab1e6044e3e5c
> > I cannot reproduce the behaviour on i40e hardware (either PF or VF) so I
> > don't think it needs to be patched as the HW seems to behave
> > differently.
> > 
> 
> Thanks for clarifying.
> 
> Acked-by: Bruce Richardson <bruce.richardson@intel.com>
> 
Applied to dpdk-next-net-intel.

thanks,
/Bruce

^ permalink raw reply

* Re: [PATCH 5/9] net/dpaa2: support Rx queue interrupts
From: David Marchand @ 2026-06-16  8:05 UTC (permalink / raw)
  To: Maxime Leroy; +Cc: hemant.agrawal, sachin.saxena, dev
In-Reply-To: <20260611154926.392670-6-maxime@leroys.fr>

On Thu, 11 Jun 2026 at 17:51, Maxime Leroy <maxime@leroys.fr> wrote:
> diff --git a/drivers/bus/fslmc/qbman/qbman_portal.c b/drivers/bus/fslmc/qbman/qbman_portal.c
> index 84853924e7..947415363a 100644
> --- a/drivers/bus/fslmc/qbman/qbman_portal.c
> +++ b/drivers/bus/fslmc/qbman/qbman_portal.c
> @@ -448,6 +448,7 @@ int qbman_swp_interrupt_get_inhibit(struct qbman_swp *p)
>         return qbman_cinh_read(&p->sys, QBMAN_CINH_SWP_IIR);
>  }
>
> +RTE_EXPORT_INTERNAL_SYMBOL(qbman_swp_interrupt_set_inhibit)
>  void qbman_swp_interrupt_set_inhibit(struct qbman_swp *p, int inhibit)

qbman_swp_interrupt_set_inhibit is not declared as __rte_internal.

>  {
>         qbman_cinh_write(&p->sys, QBMAN_CINH_SWP_IIR,


-- 
David Marchand


^ permalink raw reply

* Re: [PATCH dpdk v2 2/2] graph: replace circular buffer with priority-based bitmap
From: kirankumark @ 2026-06-16  8:03 UTC (permalink / raw)
  To: rjarry
  Cc: cfontain, david.marchand, dev, jerinj, kirankumark,
	konstantin.ananyev, maxime, ndabilpuram, vladimir.medvedkin,
	yanzhirun_163
In-Reply-To: <DJAAFGOSALXB.3BJ91NA2ES267@redhat.com>

> Hi,
>
> , Jun 16, 2026 at 08:30:
> > This will break the ABI. Please check and fix.
>
> Yes it will break the ABI. There is no way around it. What did you mean
> by "check and fix"?
>

 Looks like we canot avoid ABI break, Since Graph is not experimental library,
please send deprication notice and we need to wait for next ABI breaking release i.e 27.11

^ permalink raw reply

* Re: [PATCH 2/9] eal/interrupts: keep real errno on epoll error
From: David Marchand @ 2026-06-16  8:02 UTC (permalink / raw)
  To: Maxime Leroy
  Cc: hemant.agrawal, sachin.saxena, dev, stable, Harman Kalra,
	Cunming Liang
In-Reply-To: <20260611154926.392670-3-maxime@leroys.fr>

On Thu, 11 Jun 2026 at 17:50, Maxime Leroy <maxime@leroys.fr> wrote:
>
> Some interrupt users have several vectors backed by the same eventfd
> (e.g. several Rx queues behind one DPAA2 portal eventfd). Adding the
> second vector to the same epoll instance then fails with EEXIST.
>
> Upper layers such as ethdev and bbdev already treat -EEXIST as a
> non-fatal duplicate registration (if (ret && ret != -EEXIST)), but
> rte_intr_rx_ctl() lost that information: rte_epoll_ctl() returned -1 and
> rte_intr_rx_ctl() flattened every failure to -EPERM.
>
> Return the negative errno from rte_epoll_ctl() (its documented contract
> is already "a negative value") and stop rte_intr_rx_ctl() from
> flattening errors to -EPERM, so EEXIST reaches the upper layers that
> already handle it; other failures carry their real errno.
>
> Fixes: 9efe9c6cdcac ("eal/linux: add epoll wrappers")
> Fixes: c9f3ec1a0f3f ("eal/linux: add Rx interrupt control function")
> Cc: stable@dpdk.org
> Signed-off-by: Maxime Leroy <maxime@leroys.fr>

Reviewed-by: David Marchand <david.marchand@redhat.com>

Nit: the eal/ prefix is only for OS specific / arch specific changes.
The title prefix should be interrupt:


-- 
David Marchand


^ permalink raw reply

* Re: [PATCH dpdk v2 2/2] graph: replace circular buffer with priority-based bitmap
From: kirankumark @ 2026-06-16  7:57 UTC (permalink / raw)
  To: rjarry
  Cc: cfontain, david.marchand, dev, jerinj, kirankumark,
	konstantin.ananyev, maxime, ndabilpuram, vladimir.medvedkin,
	yanzhirun_163
In-Reply-To: <DJAAFGOSALXB.3BJ91NA2ES267@redhat.com>



^ permalink raw reply

* Re: [EXTERNAL] [PATCH 00/13] Bus cleanup infrastructure and fixes
From: David Marchand @ 2026-06-16  7:47 UTC (permalink / raw)
  To: Hemant Agrawal
  Cc: dev@dpdk.org, thomas@monjalon.net, stephen@networkplumber.org,
	bruce.richardson@intel.com, fengchengwen@huawei.com, Long Li
In-Reply-To: <CAJFAV8wL8K=gD1f7CAAuY8_7tz4CdR_hPn4j4xP3Nb2sfnrjqA@mail.gmail.com>

On Tue, 16 Jun 2026 at 08:55, David Marchand <david.marchand@redhat.com> wrote:
>
> On Tue, 16 Jun 2026 at 01:55, Long Li <longli@microsoft.com> wrote:
> >
> > >
> > > > This series refactors the bus cleanup infrastructure to reduce code
> > > > duplication and fix resource leaks in several bus drivers.
> > > > It should address the leak Thomas pointed at.
> > > >
> > > > The first part of the series (patches 1-8) addresses several bugs and
> > > > inconsistencies:
> > > > - Documentation and log message inconsistencies from earlier bus
> > > >   refactoring
> > > > - Device list management issues in dma/idxd and bus/vdev
> > > > - Resource leaks in PCI and VMBUS bus cleanup (mappings and
> > > > interrupts)
> > > > - Simplified device freeing in NXP buses (DPAA and FSLMC)
> > > > - Deferred interrupt allocation to probe time (NXP buses, VMBUS)
> > > >
> > > > The core infrastructure changes (patches 9-10) introduce the generic
> > > > cleanup
> > > > framework:
> > > > - Refactors unplug operations to be the counterpart of probe_device
> > > > - Implements rte_bus_generic_cleanup() to centralize cleanup logic
> > > > - Adds .free_device operation to struct rte_bus
> > > > - Adds compile-time verification that rte_device is at offset 0
> > > >
> > > > The final patches (11-13) convert remaining buses to use the generic
> > > > cleanup
> > > > helper:
> > > > - DPAA bus: add unplug support
> > > > - VMBUS bus: switch to embedded device name and add unplug support
> > >
> > > There is a hung on vmbus during device shutdown after applying the series, I'm
> > > looking into it.
> >
> > Turned out to be a test issue. Please see my comments on patch 08, the patch set tested well after that fix.
>
> Thanks a lot for testing!
>
> I'll fix this regression in the next revision.

Fyi Hemant, this series has a similar regression for dpaa/fslmc bus
(interrupt handle allocated too late in the device probing flow).
The implications seem greater than fixing vmbus though, as I am now
finding bugs on the cleanup side (interrupt eventfd are never closed,
for example).

I'll think about how to fix it in the next revision, one option may be
to leave dpaa/fslmc alone.. ?
But in the long run, all bus drivers should behave consistently.

I'll get back in this thread once I have a better view of the situation.


-- 
David Marchand


^ permalink raw reply

* Re: [PATCH dpdk v2 2/2] graph: replace circular buffer with priority-based bitmap
From: Robin Jarry @ 2026-06-16  7:16 UTC (permalink / raw)
  To: kirankumark
  Cc: cfontain, david.marchand, dev, jerinj, konstantin.ananyev, maxime,
	ndabilpuram, vladimir.medvedkin, yanzhirun_163
In-Reply-To: <20260616063026.2007191-1-kirankumark@marvell.com>

Hi,

, Jun 16, 2026 at 08:30:
> This will break the ABI. Please check and fix.

Yes it will break the ABI. There is no way around it. What did you mean
by "check and fix"?


^ permalink raw reply

* Re: [EXTERNAL] [PATCH 00/13] Bus cleanup infrastructure and fixes
From: David Marchand @ 2026-06-16  6:55 UTC (permalink / raw)
  To: Long Li
  Cc: dev@dpdk.org, thomas@monjalon.net, stephen@networkplumber.org,
	bruce.richardson@intel.com, fengchengwen@huawei.com
In-Reply-To: <SA1PR21MB6683AEB004DF96D48ACDAD51CEE62@SA1PR21MB6683.namprd21.prod.outlook.com>

On Tue, 16 Jun 2026 at 01:55, Long Li <longli@microsoft.com> wrote:
>
> >
> > > This series refactors the bus cleanup infrastructure to reduce code
> > > duplication and fix resource leaks in several bus drivers.
> > > It should address the leak Thomas pointed at.
> > >
> > > The first part of the series (patches 1-8) addresses several bugs and
> > > inconsistencies:
> > > - Documentation and log message inconsistencies from earlier bus
> > >   refactoring
> > > - Device list management issues in dma/idxd and bus/vdev
> > > - Resource leaks in PCI and VMBUS bus cleanup (mappings and
> > > interrupts)
> > > - Simplified device freeing in NXP buses (DPAA and FSLMC)
> > > - Deferred interrupt allocation to probe time (NXP buses, VMBUS)
> > >
> > > The core infrastructure changes (patches 9-10) introduce the generic
> > > cleanup
> > > framework:
> > > - Refactors unplug operations to be the counterpart of probe_device
> > > - Implements rte_bus_generic_cleanup() to centralize cleanup logic
> > > - Adds .free_device operation to struct rte_bus
> > > - Adds compile-time verification that rte_device is at offset 0
> > >
> > > The final patches (11-13) convert remaining buses to use the generic
> > > cleanup
> > > helper:
> > > - DPAA bus: add unplug support
> > > - VMBUS bus: switch to embedded device name and add unplug support
> >
> > There is a hung on vmbus during device shutdown after applying the series, I'm
> > looking into it.
>
> Turned out to be a test issue. Please see my comments on patch 08, the patch set tested well after that fix.

Thanks a lot for testing!

I'll fix this regression in the next revision.


-- 
David Marchand


^ permalink raw reply

* Deep Dive into DPDK NIC RX/TX Flow
From: 胡兴菊 @ 2026-06-16  6:44 UTC (permalink / raw)
  To: dev

[-- Attachment #1: Type: text/plain, Size: 1193 bytes --]

Hello DPDK team,


I'd like to submit the following technical content for consideration in the DPDK newsletter.


Title: DPDK网卡收发报文流程深度解析 (Deep Dive into DPDK NIC RX/TX Flow)
Link: https://github.com/huxingju/dpdk-source-analyzer/blob/main/NIC_RX_TX_FLOW.md


Type: Technical deep dive / Developer blog


Description:
This article provides a comprehensive analysis of DPDK's RX/TX data path, covering:
- Hardware architecture: RX/TX FIFO, DMA engine, descriptor ring
- Software layer: rte_mbuf structure, driver implementation (ixgbe as example)
- Relationship between hardware descriptor, SW ring, and mbuf
- Performance optimization techniques: batching, prefetching, NUMA awareness


**About the creation process:**
This article was human-directed and AI-assisted. As a developer with 8+ years of network development experience, I:
- Defined the scope, structure, and key technical points
- Guided the AI to generate specific sections
- Reviewed, corrected, and validated all technical content


The final result reflects my expertise and has been verified for technical accuracy.


Best regards,
Huxingju
GitHub: github.com/huxingju

[-- Attachment #2: Type: text/html, Size: 5689 bytes --]

^ permalink raw reply

* [PATCH dpdk v2 2/2] graph: replace circular buffer with priority-based bitmap
From: kirankumark @ 2026-06-16  6:30 UTC (permalink / raw)
  To: rjarry
  Cc: cfontain, david.marchand, dev, jerinj, kirankumark,
	konstantin.ananyev, maxime, ndabilpuram, vladimir.medvedkin,
	yanzhirun_163
In-Reply-To: <20260519213822.735891-3-rjarry@redhat.com>

This will break the ABI. Please check and fix.


>  	rte_node_process_t process; /**< Node process function. */
>  	rte_node_init_t init;       /**< Node init function. */
>  	rte_node_fini_t fini;       /**< Node fini function. */
> diff --git a/lib/graph/rte_graph_model_mcore_dispatch.h b/lib/graph/rte_graph_model_mcore_dispatch.h
> index f9ff3daa88ec..50a473564b56 100644
> --- a/lib/graph/rte_graph_model_mcore_dispatch.h
> +++ b/lib/graph/rte_graph_model_mcore_dispatch.h
> @@ -77,9 +77,13 @@ int rte_graph_model_mcore_dispatch_node_lcore_affinity_set(const char *name,
>  							   unsigned int lcore_id);
>
>  /**
> - * Perform graph walk on the circular buffer and invoke the process function
> + * Perform graph walk on the pending bitmap and invoke the process function
>   * of the nodes and collect the stats.
>   *
> + * Nodes are visited in scheduling order (lowest priority value first).
> + * Source nodes are seeded into the pending bitmap at the start of each walk.
> + * Nodes with different lcore affinity are dispatched to their target lcore.
> + *
>   * @param graph
>   *   Graph pointer returned from rte_graph_lookup function.
>   *
> @@ -88,20 +92,28 @@ int rte_graph_model_mcore_dispatch_node_lcore_affinity_set(const char *name,
>  static inline void
>  rte_graph_walk_mcore_dispatch(struct rte_graph *graph)
>  {
> -	const rte_graph_off_t *cir_start = graph->cir_start;
> -	const rte_node_t mask = graph->cir_mask;
> -	uint32_t head = graph->head;
> +	const uint16_t nwords = graph->nb_sched_words;
>  	struct rte_node *node;
> +	uint16_t word, bit;
>
>  	if (graph->dispatch.wq != NULL)
>  		__rte_graph_mcore_dispatch_sched_wq_process(graph);
>
> -	while (likely(head != graph->tail)) {
> -		node = (struct rte_node *)RTE_PTR_ADD(graph, cir_start[(int32_t)head++]);
> +	/* Seed pending bitmap with source nodes bound to this lcore */
> +	for (word = 0; word < nwords; word++)
> +		graph->pending[word] |= graph->src_pending[word];
>
> -		/* skip the src nodes which not bind with current worker */
> -		if ((int32_t)head < 1 && node->dispatch.lcore_id != graph->dispatch.lcore_id)
> -			continue;
> +	for (;;) {
> +		/* find first word with any pending bit */
> +		for (word = 0; word < nwords; word++)
> +			if (graph->pending[word])
> +				break;
> +		if (word == nwords)
> +			break; /* no more pending nodes */
> +
> +		bit = rte_ctz64(graph->pending[word]);
> +		graph->pending[word] &= ~(1ULL << bit);
> +		node = __rte_graph_pending_node(graph, word, bit);
>
>  		/* Schedule the node until all task/objs are done */
>  		if (node->dispatch.lcore_id != RTE_MAX_LCORE &&
> @@ -111,11 +123,7 @@ rte_graph_walk_mcore_dispatch(struct rte_graph *graph)
>  			continue;
>
>  		__rte_node_process(graph, node);
> -
> -		head = likely((int32_t)head > 0) ? head & mask : head;
>  	}
> -
> -	graph->tail = 0;
>  }
>
>  #ifdef __cplusplus
> diff --git a/lib/graph/rte_graph_model_rtc.h b/lib/graph/rte_graph_model_rtc.h
> index 4b6236e301e3..38feb3e1ca88 100644
> --- a/lib/graph/rte_graph_model_rtc.h
> +++ b/lib/graph/rte_graph_model_rtc.h
> @@ -6,9 +6,12 @@
>  #include "rte_graph_worker_common.h"
>
>  /**
> - * Perform graph walk on the circular buffer and invoke the process function
> + * Perform graph walk on the pending bitmap and invoke the process function
>   * of the nodes and collect the stats.
>   *
> + * Nodes are visited in scheduling order (lowest priority value first).
> + * Source nodes are seeded into the pending bitmap at the start of each walk.
> + *
>   * @param graph
>   *   Graph pointer returned from rte_graph_lookup function.
>   *
> @@ -17,30 +20,52 @@
>  static inline void
>  rte_graph_walk_rtc(struct rte_graph *graph)
>  {
> -	const rte_graph_off_t *cir_start = graph->cir_start;
> -	const rte_node_t mask = graph->cir_mask;
> -	uint32_t head = graph->head;
> +	const uint16_t nwords = graph->nb_sched_words;
>  	struct rte_node *node;
> +	uint16_t word, bit;
>
>  	/*
> -	 * Walk on the source node(s) ((cir_start - head) -> cir_start) and then
> -	 * on the pending streams (cir_start -> (cir_start + mask) -> cir_start)
> -	 * in a circular buffer fashion.
> +	 * Nodes are assigned a bit position (sched_idx) sorted by (priority,
> +	 * node_id) at graph creation time. Source nodes are forced to INT16_MIN
> +	 * priority so they always come first.
>  	 *
> -	 *	+-----+ <= cir_start - head [number of source nodes]
> -	 *	|     |
> -	 *	| ... | <= source nodes
> -	 *	|     |
> -	 *	+-----+ <= cir_start [head = 0] [tail = 0]
> -	 *	|     |
> -	 *	| ... | <= pending streams
> -	 *	|     |
> -	 *	+-----+ <= cir_start + mask
> +	 * sched_table[] maps bit positions to node offsets:
> +	 *
> +	 *   pending[]         sched_table[]
> +	 *   +----------+      +------------------+
> +	 *   | word 0   | ---> | src_node_0       | bit 0 (prio=INT16_MIN)
> +	 *   | 1100...1 |      | src_node_1       | bit 1 (prio=INT16_MIN)
> +	 *   |          |      | mpls_input       | bit 2 (prio=-10)
> +	 *   |          |      | ipv4_input       | bit 3 (prio=0)
> +	 *   |          |      | ...              |
> +	 *   +----------+      +------------------+
> +	 *   | word 1   | ---> | ip4_rewrite      | bit 64 (prio=10)
> +	 *   | ...      |      | ...              |
> +	 *   +----------+      +------------------+
> +	 *
> +	 * Walk: for each word, find lowest set bit (rte_ctz64), process that
> +	 * node, clear the bit, re-read the word (processing may have set new
> +	 * bits), repeat.
> +	 *
> +	 * After each node is processed, restart scanning from word 0 since
> +	 * processing may set bits in any word, including earlier ones.
>  	 */
> -	while (likely(head != graph->tail)) {
> -		node = (struct rte_node *)RTE_PTR_ADD(graph, cir_start[(int32_t)head++]);
> +
> +	/* Seed pending bitmap with source nodes */
> +	for (word = 0; word < nwords; word++)
> +		graph->pending[word] |= graph->src_pending[word];
> +
> +	for (;;) {
> +		/* find first word with any pending bit */
> +		for (word = 0; word < nwords; word++)
> +			if (graph->pending[word])
> +				break;
> +		if (word == nwords)
> +			break; /* no more pending nodes */
> +
> +		bit = rte_ctz64(graph->pending[word]);
> +		graph->pending[word] &= ~(1ULL << bit);
> +		node = __rte_graph_pending_node(graph, word, bit);
>  		__rte_node_process(graph, node);
> -		head = likely((int32_t)head > 0) ? head & mask : head;
>  	}
> -	graph->tail = 0;
>  }
> diff --git a/lib/graph/rte_graph_worker.h b/lib/graph/rte_graph_worker.h
> index b0f952a82cc9..e513d7a655d9 100644
> --- a/lib/graph/rte_graph_worker.h
> +++ b/lib/graph/rte_graph_worker.h
> @@ -14,7 +14,7 @@ extern "C" {
>  #endif
>
>  /**
> - * Perform graph walk on the circular buffer and invoke the process function
> + * Perform graph walk on the pending bitmap and invoke the process function
>   * of the nodes and collect the stats.
>   *
>   * @param graph
> diff --git a/lib/graph/rte_graph_worker_common.h b/lib/graph/rte_graph_worker_common.h
> index 4ab53a533e4c..0e60486043d8 100644
> --- a/lib/graph/rte_graph_worker_common.h
> +++ b/lib/graph/rte_graph_worker_common.h
> @@ -49,15 +49,14 @@ SLIST_HEAD(rte_graph_rq_head, rte_graph);
>   */
>  struct __rte_cache_aligned rte_graph {
>  	/* Fast path area. */
> -	uint32_t tail;		     /**< Tail of circular buffer. */
> -	uint32_t head;		     /**< Head of circular buffer. */
> -	uint32_t cir_mask;	     /**< Circular buffer wrap around mask. */
>  	rte_node_t nb_nodes;	     /**< Number of nodes in the graph. */
> -	rte_graph_off_t *cir_start;  /**< Pointer to circular buffer. */
>  	rte_graph_off_t nodes_start; /**< Offset at which node memory starts. */
> +	rte_graph_off_t *sched_table; /**< Node offset indexed by sched_idx. */
> +	uint64_t *pending;	     /**< Bitmap of pending nodes. */
> +	uint64_t *src_pending;	     /**< Bitmap of source nodes (constant). */
> +	uint16_t nb_sched_words;     /**< Number of uint64_t words in pending bitmaps. */
>  	uint8_t model;		     /**< graph model */
> -	uint8_t reserved1;	     /**< Reserved for future use. */
> -	uint16_t reserved2;	     /**< Reserved for future use. */
> +	/* 26 bytes padding */
>  	union {
>  		/* Fast schedule area for mcore dispatch model */
>  		struct {
> @@ -98,6 +97,7 @@ struct __rte_cache_aligned rte_node {
>  	rte_node_t id;		/**< Node identifier. */
>  	rte_node_t parent_id;	/**< Parent Node identifier. */
>  	rte_edge_t nb_edges;	/**< Number of edges from this node. */
> +	uint16_t sched_idx;	/**< Bit position in pending bitmap. */
>  	uint32_t realloc_count;	/**< Number of times realloced. */
>
>  	char parent[RTE_NODE_NAMESIZE];	/**< Parent node name. */
> @@ -132,7 +132,7 @@ struct __rte_cache_aligned rte_node {
>  		}; /**< Node Context. */
>  		uint16_t size;		/**< Total number of objects available. */
>  		uint16_t idx;		/**< Number of objects used. */
> -		rte_graph_off_t off;	/**< Offset of node in the graph reel. */
> +		rte_graph_off_t off;	/**< Offset of node in the graph memory. */
>  		uint64_t total_cycles;	/**< Cycles spent in this node. */
>  		uint64_t total_calls;	/**< Calls done to this node. */
>  		uint64_t total_objs;	/**< Objects processed by this node. */
> @@ -187,12 +187,12 @@ void __rte_node_stream_alloc_size(struct rte_graph *graph,
>  /**
>   * @internal
>   *
> - * Enqueue a given node to the tail of the graph reel.
> + * Process a node's pending objects and collect stats.
>   *
>   * @param graph
>   *   Pointer Graph object.
>   * @param node
> - *   Pointer to node object to be enqueued.
> + *   Pointer to node object to be processed.
>   */
>  static __rte_always_inline void
>  __rte_node_process(struct rte_graph *graph, struct rte_node *node)
> @@ -220,21 +220,42 @@ __rte_node_process(struct rte_graph *graph, struct rte_node *node)
>  /**
>   * @internal
>   *
> - * Enqueue a given node to the tail of the graph reel.
> + * Get a pointer to a node from the scheduling table.
>   *
>   * @param graph
>   *   Pointer Graph object.
> + * @param word
> + *   Offset in the pending bitmap.
> + * @param bit
> + *   Bit number.
> + *
> + * @return
> + *   Pointer to the node.
> + */
> +static __rte_always_inline struct rte_node *
> +__rte_graph_pending_node(struct rte_graph *graph, uint16_t word, uint16_t bit)
> +{
> +	const uint16_t index = (word * sizeof(*graph->pending) * CHAR_BIT) + bit;
> +	const rte_graph_off_t node_offset = graph->sched_table[index];
> +	return RTE_PTR_ADD(graph, node_offset);
> +}
> +
> +/**
> + * @internal
> + *
> + * Mark a node as pending in the graph scheduling bitmap.
> + *
> + * @param bitmap
> + *   Either graph->pending or graph->src_pending.
>   * @param node
> - *   Pointer to node object to be enqueued.
> + *   Pointer to node object to be marked pending.
>   */
>  static __rte_always_inline void
> -__rte_node_enqueue_tail_update(struct rte_graph *graph, struct rte_node *node)
> +__rte_node_pending_set(uint64_t *bitmap, struct rte_node *node)
>  {
> -	uint32_t tail;
> -
> -	tail = graph->tail;
> -	graph->cir_start[tail++] = node->off;
> -	graph->tail = tail & graph->cir_mask;
> +	const uint16_t word = node->sched_idx / (sizeof(*bitmap) * CHAR_BIT);
> +	const uint16_t bit = node->sched_idx % (sizeof(*bitmap) * CHAR_BIT);
> +	bitmap[word] |= 1ULL << bit;
>  }
>
>  /**
> @@ -242,8 +263,8 @@ __rte_node_enqueue_tail_update(struct rte_graph *graph, struct rte_node *node)
>   *
>   * Enqueue sequence prologue function.
>   *
> - * Updates the node to tail of graph reel and resizes the number of objects
> - * available in the stream as needed.
> + * Marks the node as pending in the scheduling bitmap and resizes the number
> + * of objects available in the stream as needed.
>   *
>   * @param graph
>   *   Pointer to the graph object.
> @@ -259,9 +280,8 @@ __rte_node_enqueue_prologue(struct rte_graph *graph, struct rte_node *node,
>  			    const uint16_t idx, const uint16_t space)
>  {
>
> -	/* Add to the pending stream list if the node is new */
>  	if (idx == 0)
> -		__rte_node_enqueue_tail_update(graph, node);
> +		__rte_node_pending_set(graph->pending, node);
>
>  	if (unlikely(node->size < (idx + space)))
>  		__rte_node_stream_alloc_size(graph, node, node->size + space);
> @@ -293,7 +313,7 @@ __rte_node_next_node_get(struct rte_node *node, rte_edge_t next)
>
>  /**
>   * Enqueue the objs to next node for further processing and set
> - * the next node to pending state in the circular buffer.
> + * the next node to pending state in the scheduling bitmap.
>   *
>   * @param graph
>   *   Graph pointer returned from rte_graph_lookup().
> @@ -321,7 +341,7 @@ rte_node_enqueue(struct rte_graph *graph, struct rte_node *node,
>
>  /**
>   * Enqueue only one obj to next node for further processing and
> - * set the next node to pending state in the circular buffer.
> + * set the next node to pending state in the scheduling bitmap.
>   *
>   * @param graph
>   *   Graph pointer returned from rte_graph_lookup().
> @@ -347,7 +367,7 @@ rte_node_enqueue_x1(struct rte_graph *graph, struct rte_node *node,
>
>  /**
>   * Enqueue only two objs to next node for further processing and
> - * set the next node to pending state in the circular buffer.
> + * set the next node to pending state in the scheduling bitmap.
>   * Same as rte_node_enqueue_x1 but enqueue two objs.
>   *
>   * @param graph
> @@ -377,7 +397,7 @@ rte_node_enqueue_x2(struct rte_graph *graph, struct rte_node *node,
>
>  /**
>   * Enqueue only four objs to next node for further processing and
> - * set the next node to pending state in the circular buffer.
> + * set the next node to pending state in the scheduling bitmap.
>   * Same as rte_node_enqueue_x1 but enqueue four objs.
>   *
>   * @param graph
> @@ -414,7 +434,7 @@ rte_node_enqueue_x4(struct rte_graph *graph, struct rte_node *node,
>
>  /**
>   * Enqueue objs to multiple next nodes for further processing and
> - * set the next nodes to pending state in the circular buffer.
> + * set the next nodes to pending state in the scheduling bitmap.
>   * objs[i] will be enqueued to nexts[i].
>   *
>   * @param graph
> @@ -472,7 +492,7 @@ rte_node_next_stream_get(struct rte_graph *graph, struct rte_node *node,
>  }
>
>  /**
> - * Put the next stream to pending state in the circular buffer
> + * Put the next stream to pending state in the scheduling bitmap
>   * for further processing. Should be invoked after rte_node_next_stream_get().
>   *
>   * @param graph
> @@ -496,8 +516,7 @@ rte_node_next_stream_put(struct rte_graph *graph, struct rte_node *node,
>
>  	node = __rte_node_next_node_get(node, next);
>  	if (node->idx == 0)
> -		__rte_node_enqueue_tail_update(graph, node);
> -
> +		__rte_node_pending_set(graph->pending, node);
>  	node->idx += idx;
>  }
>
> @@ -530,7 +549,7 @@ rte_node_next_stream_move(struct rte_graph *graph, struct rte_node *src,
>  		src->objs = dobjs;
>  		src->size = dsz;
>  		dst->idx = src->idx;
> -		__rte_node_enqueue_tail_update(graph, dst);
> +		__rte_node_pending_set(graph->pending, dst);
>  	} else { /* Move the objects from src node to dst node */
>  		rte_node_enqueue(graph, src, next, src->objs, src->idx);
>  	}
> --
> 2.54.0
>

^ permalink raw reply

* Re: Question regarding duplicate fragment handling in DPDK IP reassembly library
From: Samyak Jain @ 2026-06-16  5:59 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev@dpdk.org, Vikash kumar, Ankur Bharadwaj, Rajan Goel
In-Reply-To: <20260615103752.6155db47@phoenix.local>


[-- Attachment #1.1: Type: text/plain, Size: 1626 bytes --]

 Can we fix it ?

Thanks & Regards,
Samyak Jain
Software Engineer
[cid:cb660601-cf07-4da8-aa47-acae8e4439c5]

________________________________
From: Stephen Hemminger <stephen@networkplumber.org>
Sent: Monday, June 15, 2026 11:07 PM
To: Samyak Jain <samyak.jain@amantyatech.com>
Cc: dev@dpdk.org <dev@dpdk.org>; Vikash kumar <vikash.kumar@amantyatech.com>; Ankur Bharadwaj <ankur.bharadwaj@amantyatech.com>
Subject: Re: Question regarding duplicate fragment handling in DPDK IP reassembly library

▲ CAUTION: This e-mail originated from OUTSIDE the organization. Please do not click links or open attachments from an unknown or suspicious origin.


On Mon, 15 Jun 2026 12:39:48 +0000
Samyak Jain <samyak.jain@amantyatech.com> wrote:

> Hi DPDK Community,
>
> I am using DPDK 25.11 and evaluating the IP reassembly library
> (librte_ip_frag).
>
> During testing, I observed that duplicate fragments appear to cause reassembly failure and the fragment context gets invalidated.
>
> I would like to know:
>
> 1. Is duplicate fragment handling intentionally unsupported in
>    rte_ipv4_frag_reassemble_packet() / rte_ipv6_frag_reassemble_packet()?
>
> 2. Has there been any upstream discussion or patch to support
>    duplicate fragments while still rejecting conflicting
>    fragments?
>
> 3. Are there any recommended approaches for applications that need
>    Linux-like duplicate fragment tolerance?
>
> Any guidance would be appreciated.
>
> Thanks & Regards,
> Samyak Jain
>

Short answer: yes it is buggy, no it shouldn't be.
Looking into it but not a simple answer

[-- Attachment #1.2: Type: text/html, Size: 3464 bytes --]

[-- Attachment #2: image.png --]
[-- Type: image/png, Size: 3798 bytes --]

^ permalink raw reply

* [PATCH v3 2/2] net/cnxk: add FEC get set and capability ops
From: Rakesh Kudurumalla @ 2026-06-16  4:35 UTC (permalink / raw)
  To: Nithin Dabilpuram, Kiran Kumar K, Sunil Kumar Kori, Satha Rao,
	Harman Kalra
  Cc: dev, jerinj, Rakesh Kudurumalla
In-Reply-To: <20260616043536.4034946-1-rkudurumalla@marvell.com>

Add ethdev FEC operations for cnxk NIX driver:
- fec_get_capability: Report supported FEC modes per speed.
  If firmware provides supported FEC info, return actual
  capabilities for current link speed. Otherwise, fall back
  to a default capability table for common speeds.
- fec_get: Query current FEC mode from link info
- fec_set: Configure FEC mode on the link. AUTO mode
  defaults to Reed-Solomon FEC.

Signed-off-by: Rakesh Kudurumalla <rkudurumalla@marvell.com>
---
 doc/guides/nics/cnxk.rst           | 45 ++++++++++++++
 doc/guides/nics/features/cnxk.ini  |  1 +
 drivers/net/cnxk/cnxk_ethdev.c     |  3 +
 drivers/net/cnxk/cnxk_ethdev.h     |  6 ++
 drivers/net/cnxk/cnxk_ethdev_ops.c | 94 ++++++++++++++++++++++++++++++
 5 files changed, 149 insertions(+)

diff --git a/doc/guides/nics/cnxk.rst b/doc/guides/nics/cnxk.rst
index b5bd50ceea..0891767f83 100644
--- a/doc/guides/nics/cnxk.rst
+++ b/doc/guides/nics/cnxk.rst
@@ -29,6 +29,7 @@ Features of the CNXK Ethdev PMD are:
 - Port hardware statistics
 - Link state information
 - Link flow control
+- Forward Error Correction (FEC)
 - MTU update
 - Scatter-Gather IO support
 - Vector Poll mode driver
@@ -513,6 +514,50 @@ Runtime Config Options
    parameters to all the PCIe devices if application requires to configure on
    all the ethdev ports.
 
+Forward Error Correction (FEC)
+------------------------------
+
+The CNXK PMD supports the DPDK FEC ethdev APIs on physical function (PF) ports
+for links where firmware reports FEC support (typically high-speed Ethernet
+interfaces such as 25G, 50G and 100G).
+
+Supported FEC modes exposed through the ethdev API are:
+
+- ``RTE_ETH_FEC_NOFEC``: FEC disabled
+- ``RTE_ETH_FEC_AUTO``: maps to Reed-Solomon (RS) FEC on set
+- ``RTE_ETH_FEC_BASER``: Base-R FEC
+- ``RTE_ETH_FEC_RS``: Reed-Solomon FEC
+
+``rte_eth_fec_get_capability()`` reports the FEC modes supported by firmware for
+the current link speed. ``rte_eth_fec_get()`` returns the active FEC mode from
+link information. ``rte_eth_fec_set()`` configures the FEC mode on the link.
+
+.. note::
+
+   ``rte_eth_fec_get_capability()`` and ``rte_eth_fec_set()`` are supported on
+   PF ports only. SR-IOV virtual function (VF) ports can use
+   ``rte_eth_fec_get()`` to read the current FEC mode from link status.
+
+Example usage:
+
+.. code-block:: c
+
+   struct rte_eth_fec_capa capa[1];
+   uint32_t fec_capa;
+   int num, ret;
+
+   num = rte_eth_fec_get_capability(port_id, capa, RTE_DIM(capa));
+   if (num > 0)
+       printf("FEC capa 0x%x at speed %u\n", capa[0].capa, capa[0].speed);
+
+   ret = rte_eth_fec_get(port_id, &fec_capa);
+   if (ret == 0)
+       printf("Current FEC capa 0x%x\n", fec_capa);
+
+   ret = rte_eth_fec_set(port_id, RTE_ETH_FEC_MODE_CAPA_MASK(RS));
+   if (ret)
+       printf("FEC set failed: %s\n", rte_strerror(-ret));
+
 Limitations
 -----------
 
diff --git a/doc/guides/nics/features/cnxk.ini b/doc/guides/nics/features/cnxk.ini
index 2de156c695..dc75947d86 100644
--- a/doc/guides/nics/features/cnxk.ini
+++ b/doc/guides/nics/features/cnxk.ini
@@ -31,6 +31,7 @@ Congestion management = Y
 Traffic manager      = Y
 Inline protocol      = Y
 Flow control         = Y
+FEC                  = Y
 Scattered Rx         = Y
 L3 checksum offload  = Y
 L4 checksum offload  = Y
diff --git a/drivers/net/cnxk/cnxk_ethdev.c b/drivers/net/cnxk/cnxk_ethdev.c
index 7ae16186c6..4c3d906e16 100644
--- a/drivers/net/cnxk/cnxk_ethdev.c
+++ b/drivers/net/cnxk/cnxk_ethdev.c
@@ -2138,6 +2138,9 @@ struct eth_dev_ops cnxk_eth_dev_ops = {
 	.cman_config_set = cnxk_nix_cman_config_set,
 	.cman_config_get = cnxk_nix_cman_config_get,
 	.eth_tx_descriptor_dump = cnxk_nix_tx_descriptor_dump,
+	.fec_get_capability = cnxk_nix_fec_get_capability,
+	.fec_get = cnxk_nix_fec_get,
+	.fec_set = cnxk_nix_fec_set,
 };
 
 void
diff --git a/drivers/net/cnxk/cnxk_ethdev.h b/drivers/net/cnxk/cnxk_ethdev.h
index ea6a2be30e..4a8fb1b974 100644
--- a/drivers/net/cnxk/cnxk_ethdev.h
+++ b/drivers/net/cnxk/cnxk_ethdev.h
@@ -664,6 +664,12 @@ int cnxk_nix_tm_mark_ip_dscp(struct rte_eth_dev *eth_dev, int mark_green,
 int cnxk_nix_tx_descriptor_dump(const struct rte_eth_dev *eth_dev, uint16_t qid, uint16_t offset,
 				uint16_t num, FILE *file);
 
+/* FEC */
+int cnxk_nix_fec_get_capability(struct rte_eth_dev *eth_dev,
+				struct rte_eth_fec_capa *speed_fec_capa, unsigned int num);
+int cnxk_nix_fec_get(struct rte_eth_dev *eth_dev, uint32_t *fec_capa);
+int cnxk_nix_fec_set(struct rte_eth_dev *eth_dev, uint32_t fec_capa);
+
 /* MTR */
 int cnxk_nix_mtr_ops_get(struct rte_eth_dev *dev, void *ops);
 
diff --git a/drivers/net/cnxk/cnxk_ethdev_ops.c b/drivers/net/cnxk/cnxk_ethdev_ops.c
index 460ffa32b6..0ea3d7e89f 100644
--- a/drivers/net/cnxk/cnxk_ethdev_ops.c
+++ b/drivers/net/cnxk/cnxk_ethdev_ops.c
@@ -1414,3 +1414,97 @@ cnxk_nix_tx_descriptor_dump(const struct rte_eth_dev *eth_dev, uint16_t qid, uin
 
 	return roc_nix_sq_desc_dump(nix, qid, offset, num, file);
 }
+
+static uint32_t
+cnxk_roc_fec_to_ethdev_capa(int roc_fec)
+{
+	switch (roc_fec) {
+	case ROC_FEC_BASER:
+		return RTE_ETH_FEC_MODE_CAPA_MASK(BASER);
+	case ROC_FEC_RS:
+		return RTE_ETH_FEC_MODE_CAPA_MASK(RS);
+	default:
+		return RTE_ETH_FEC_MODE_CAPA_MASK(NOFEC);
+	}
+}
+
+static int
+cnxk_ethdev_fec_to_roc(uint32_t fec_capa)
+{
+	if (fec_capa & RTE_ETH_FEC_MODE_CAPA_MASK(RS))
+		return ROC_FEC_RS;
+	if (fec_capa & RTE_ETH_FEC_MODE_CAPA_MASK(BASER))
+		return ROC_FEC_BASER;
+	return ROC_FEC_NONE;
+}
+
+static uint32_t
+cnxk_fec_capa_from_supported(uint64_t supported_fec)
+{
+	uint32_t capa = RTE_ETH_FEC_MODE_CAPA_MASK(NOFEC) | RTE_ETH_FEC_MODE_CAPA_MASK(AUTO);
+
+	if (supported_fec & (1ULL << ROC_FEC_BASER))
+		capa |= RTE_ETH_FEC_MODE_CAPA_MASK(BASER);
+	if (supported_fec & (1ULL << ROC_FEC_RS))
+		capa |= RTE_ETH_FEC_MODE_CAPA_MASK(RS);
+
+	return capa;
+}
+
+int
+cnxk_nix_fec_get_capability(struct rte_eth_dev *eth_dev, struct rte_eth_fec_capa *speed_fec_capa,
+			    unsigned int num)
+{
+	struct cnxk_eth_dev *dev = cnxk_eth_pmd_priv(eth_dev);
+	struct roc_nix *nix = &dev->nix;
+	struct roc_nix_link_info link_info;
+	uint64_t supported_fec = 0;
+	int rc;
+
+	rc = roc_nix_mac_fec_supported_get(nix, &supported_fec);
+	if (rc == 0 && supported_fec != 0) {
+		rc = roc_nix_mac_link_info_get(nix, &link_info);
+		if (rc)
+			return rc;
+
+		if (speed_fec_capa == NULL || num == 0)
+			return 1;
+
+		speed_fec_capa[0].speed = link_info.speed;
+		speed_fec_capa[0].capa = cnxk_fec_capa_from_supported(supported_fec);
+		return 1;
+	}
+
+	return rc;
+}
+
+int
+cnxk_nix_fec_get(struct rte_eth_dev *eth_dev, uint32_t *fec_capa)
+{
+	struct cnxk_eth_dev *dev = cnxk_eth_pmd_priv(eth_dev);
+	struct roc_nix *nix = &dev->nix;
+	struct roc_nix_link_info link_info;
+	int rc;
+
+	rc = roc_nix_mac_link_info_get(nix, &link_info);
+	if (rc)
+		return rc;
+
+	*fec_capa = cnxk_roc_fec_to_ethdev_capa(link_info.fec);
+	return 0;
+}
+
+int
+cnxk_nix_fec_set(struct rte_eth_dev *eth_dev, uint32_t fec_capa)
+{
+	struct cnxk_eth_dev *dev = cnxk_eth_pmd_priv(eth_dev);
+	struct roc_nix *nix = &dev->nix;
+	int roc_fec;
+
+	if (fec_capa & RTE_ETH_FEC_MODE_CAPA_MASK(AUTO))
+		roc_fec = ROC_FEC_RS;
+	else
+		roc_fec = cnxk_ethdev_fec_to_roc(fec_capa);
+
+	return roc_nix_mac_fec_set(nix, roc_fec);
+}
-- 
2.25.1


^ permalink raw reply related

* [PATCH v3 1/2] common/cnxk: add FEC configuration support
From: Rakesh Kudurumalla @ 2026-06-16  4:35 UTC (permalink / raw)
  To: Nithin Dabilpuram, Kiran Kumar K, Sunil Kumar Kori, Satha Rao,
	Harman Kalra
  Cc: dev, jerinj, Rakesh Kudurumalla
In-Reply-To: <20260416120031.3553798-2-rkudurumalla@marvell.com>

Add ROC APIs for Forward Error Correction (FEC) configuration:
- roc_nix_mac_fec_set: Set FEC mode on the link
- roc_nix_mac_fec_supported_get: Query supported FEC modes
  from firmware

These APIs use CGX mailbox messages to configure and query
FEC parameters on PF interfaces.

Signed-off-by: Rakesh Kudurumalla <rkudurumalla@marvell.com>
---
V3: Updated doc/guides/nics/cnxk.rst

 drivers/common/cnxk/roc_nix.h                 |  2 +
 drivers/common/cnxk/roc_nix_mac.c             | 51 +++++++++++++++++++
 .../common/cnxk/roc_platform_base_symbols.c   |  2 +
 3 files changed, 55 insertions(+)

diff --git a/drivers/common/cnxk/roc_nix.h b/drivers/common/cnxk/roc_nix.h
index 8ba8b3e0b6..6130e4c42b 100644
--- a/drivers/common/cnxk/roc_nix.h
+++ b/drivers/common/cnxk/roc_nix.h
@@ -975,6 +975,8 @@ int __roc_api roc_nix_mac_link_info_set(struct roc_nix *roc_nix,
 					struct roc_nix_link_info *link_info);
 int __roc_api roc_nix_mac_link_info_get(struct roc_nix *roc_nix,
 					struct roc_nix_link_info *link_info);
+int __roc_api roc_nix_mac_fec_set(struct roc_nix *roc_nix, int fec);
+int __roc_api roc_nix_mac_fec_supported_get(struct roc_nix *roc_nix, uint64_t *supported_fec);
 int __roc_api roc_nix_mac_mtu_set(struct roc_nix *roc_nix, uint16_t mtu);
 int __roc_api roc_nix_mac_max_rx_len_set(struct roc_nix *roc_nix,
 					 uint16_t maxlen);
diff --git a/drivers/common/cnxk/roc_nix_mac.c b/drivers/common/cnxk/roc_nix_mac.c
index 376ff48522..4f856677e0 100644
--- a/drivers/common/cnxk/roc_nix_mac.c
+++ b/drivers/common/cnxk/roc_nix_mac.c
@@ -257,6 +257,57 @@ roc_nix_mac_link_state_set(struct roc_nix *roc_nix, uint8_t up)
 	return rc;
 }
 
+int
+roc_nix_mac_fec_set(struct roc_nix *roc_nix, int fec)
+{
+	struct nix *nix = roc_nix_to_nix_priv(roc_nix);
+	struct dev *dev = &nix->dev;
+	struct mbox *mbox = mbox_get(dev->mbox);
+	struct fec_mode *req;
+	int rc = -ENOSPC;
+
+	if (roc_nix_is_vf_or_sdp(roc_nix)) {
+		rc = NIX_ERR_OP_NOTSUP;
+		goto exit;
+	}
+
+	req = mbox_alloc_msg_cgx_set_fec_param(mbox);
+	if (req == NULL)
+		goto exit;
+	req->fec = fec;
+
+	rc = mbox_process(mbox);
+exit:
+	mbox_put(mbox);
+	return rc;
+}
+
+int
+roc_nix_mac_fec_supported_get(struct roc_nix *roc_nix, uint64_t *supported_fec)
+{
+	struct nix *nix = roc_nix_to_nix_priv(roc_nix);
+	struct dev *dev = &nix->dev;
+	struct mbox *mbox = mbox_get(dev->mbox);
+	struct cgx_fw_data *rsp = NULL;
+	int rc;
+
+	if (roc_nix_is_vf_or_sdp(roc_nix)) {
+		rc = NIX_ERR_OP_NOTSUP;
+		goto exit;
+	}
+
+	mbox_alloc_msg_cgx_get_aux_link_info(mbox);
+	rc = mbox_process_msg(mbox, (void *)&rsp);
+	if (rc)
+		goto exit;
+
+	*supported_fec = rsp->fwdata.supported_fec;
+	rc = 0;
+exit:
+	mbox_put(mbox);
+	return rc;
+}
+
 int
 roc_nix_mac_link_info_set(struct roc_nix *roc_nix,
 			  struct roc_nix_link_info *link_info)
diff --git a/drivers/common/cnxk/roc_platform_base_symbols.c b/drivers/common/cnxk/roc_platform_base_symbols.c
index ed34d4b05b..f116f32cf4 100644
--- a/drivers/common/cnxk/roc_platform_base_symbols.c
+++ b/drivers/common/cnxk/roc_platform_base_symbols.c
@@ -307,6 +307,8 @@ RTE_EXPORT_INTERNAL_SYMBOL(roc_nix_mac_rxtx_start_stop)
 RTE_EXPORT_INTERNAL_SYMBOL(roc_nix_mac_link_event_start_stop)
 RTE_EXPORT_INTERNAL_SYMBOL(roc_nix_mac_loopback_enable)
 RTE_EXPORT_INTERNAL_SYMBOL(roc_nix_mac_addr_set)
+RTE_EXPORT_INTERNAL_SYMBOL(roc_nix_mac_fec_set)
+RTE_EXPORT_INTERNAL_SYMBOL(roc_nix_mac_fec_supported_get)
 RTE_EXPORT_INTERNAL_SYMBOL(roc_nix_mac_max_entries_get)
 RTE_EXPORT_INTERNAL_SYMBOL(roc_nix_mac_addr_add)
 RTE_EXPORT_INTERNAL_SYMBOL(roc_nix_mac_addr_del)
-- 
2.25.1


^ permalink raw reply related

* [DPDK/testpmd Bug 1957] [dpdk-26.07] ABI testing dpdk26.07rc1+dpdk25.11 shows error: "undefined symbol: rte_flow_dynf_metadata_offs, version EXPERIMENTAL"
From: bugzilla @ 2026-06-16  2:18 UTC (permalink / raw)
  To: dev

http://bugs.dpdk.org/show_bug.cgi?id=1957

            Bug ID: 1957
           Summary: [dpdk-26.07] ABI testing dpdk26.07rc1+dpdk25.11 shows
                    error: "undefined symbol: rte_flow_dynf_metadata_offs,
                    version EXPERIMENTAL"
           Product: DPDK
           Version: 26.07
          Hardware: All
                OS: All
            Status: UNCONFIRMED
          Severity: normal
          Priority: Normal
         Component: testpmd
          Assignee: dev@dpdk.org
          Reporter: yux.jiang@intel.com
  Target Milestone: ---

Environment
-----------
DPDK version:  
[DPDK 26.07rc1] 
commit c429b06df56788795f886eca748420e2248da784 (HEAD -> main, origin/main,
origin/HEAD)
Author: Thomas Monjalon <thomas@monjalon.net>
Date:   Thu Jun 11 04:27:32 2026 +0200    
version: 26.07-rc1    
Signed-off-by: Thomas Monjalon <thomas@monjalon.net>

Steps to reproduce
------------------

1, Build latest dpdk(dpdk26.07rc1)
cd dpdk
rm -rf x86_64-native-linuxapp-gcc
CC=gcc meson -Dlibdir=lib -Dc_args=-DRTE_BUILD_SHARED_LIB
--default-library=shared x86_64-native-linuxapp-gcc
ninja -C x86_64-native-linuxapp-gcc
rm -rf /root/tmp/dpdk_share_lib /root/shared_lib_dpdk
DESTDIR=/root/tmp/dpdk_share_lib ninja -C x86_64-native-linuxapp-gcc -j 110
install
mv /root/tmp/dpdk_share_lib/usr/local/lib /root/shared_lib_dpdk
cat /root/.bashrc | grep LD_LIBRARY_PATH
sed -i 's#export LD_LIBRARY_PATH=.*#export
LD_LIBRARY_PATH=/root/shared_lib_dpdk#g' /root/.bashrc

2, copy LTS(dpdk25.11) dpdk_abi.tar.gz and build LTS dpdk
tar zxf /tmp/dpdk_abi.tar.gz -C ~
cd ~/dpdk/
rm -rf x86_64-native-linuxapp-gcc
CC=gcc meson -Dlibdir=lib -Dc_args=-DRTE_BUILD_SHARED_LIB
--default-library=shared x86_64-native-linuxapp-gcc
ninja -C x86_64-native-linuxapp-gcc
rm -rf x86_64-native-linuxapp-gcc/lib
rm -rf x86_64-native-linuxapp-gcc/drivers

3, start testpmd
root@icx-jy-abi-d81:~/dpdk# x86_64-native-linuxapp-gcc/app/dpdk-testpmd -l 1-4
-n 4 -a 0000:31:00.0 --file-prefix=dpdk_14651_20260325102153  -d
/root/shared_lib_dpdk   -- -i

Show the output from the previous commands.
-------------------------------------------
root@icx-jy-abi-d81:~/jaccy/dpdk_25.11#
x86_64-native-linuxapp-gcc/app/dpdk-testpmd -l 1-4 -n 4 -a 0000:31:00.0
--file-prefix=dpdk_14651_20260325102153  -d /root/shared_lib_dpdk   -- -i
x86_64-native-linuxapp-gcc/app/dpdk-testpmd: symbol lookup error:
x86_64-native-linuxapp-gcc/app/dpdk-testpmd: undefined symbol:
rte_flow_dynf_metadata_offs, version EXPERIMENTAL

Expected Result
---------------
launch ok


Is this issue a regression: Y
-----------------------------

Version the regression was introduced:  commit 4ee2f5c1ced

commit 4ee2f5c1cedf9ee7f39afa667f71b07f4004ba5c (HEAD ->
4ee2f5c1ce-flowmetadata)
Author: Dariusz Sosnowski <dsosnowski@nvidia.com>
Date:   Fri May 29 09:28:53 2026 +0200

    ethdev: promote flow metadata API to stable

    Following experimental symbols related to flow metadata
    were added in v19.11:

    - rte_flow_dynf_metadata_register
    - rte_flow_dynf_metadata_offs
    - rte_flow_dynf_metadata_mask

    Type of rte_flow_dynf_metadata_offs was changed from int to int32_t
    in v20.05 release.
    There were no changes to these symbols since then.

    This patch promotes these symbols and removes __rte_experimental
    from the following inline functions:

    - rte_flow_dynf_metadata_avail
    - rte_flow_dynf_metadata_get
    - rte_flow_dynf_metadata_set

    All these symbols and functions will be used by netdev-doca
    backend in Open vSwitch [1].
    Stabilizing these symbols is required by current OVS policy
    to remove the need for ALLOW_EXPERIMENTAL_API [2].

    [1]:
https://patchwork.ozlabs.org/project/openvswitch/list/?series=504726&state=%2A&archive=both
    [2]: https://mail.openvswitch.org/pipermail/ovs-dev/2026-May/432066.html

    Signed-off-by: Dariusz Sosnowski <dsosnowski@nvidia.com>

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply

* [DPDK/testpmd Bug 1956] [dpdk-26.07-rc1] ice_buffer_split: testpmd start failed when set multiple mbuf-size
From: bugzilla @ 2026-06-16  1:33 UTC (permalink / raw)
  To: dev

http://bugs.dpdk.org/show_bug.cgi?id=1956

            Bug ID: 1956
           Summary: [dpdk-26.07-rc1] ice_buffer_split: testpmd start
                    failed when set multiple mbuf-size
           Product: DPDK
           Version: 26.07
          Hardware: x86
                OS: Linux
            Status: UNCONFIRMED
          Severity: normal
          Priority: Normal
         Component: testpmd
          Assignee: dev@dpdk.org
          Reporter: songx.jiale@intel.com
  Target Milestone: ---

Environment
===========
dpdk-25.11.0-rc1: 39b54f2dcf44ad1f91eabc7080cd5dea763607fd
OS: openEuler24.03/6.6.0-145.0.4.135.oe2403sp3.x86_64
Compiler: gcc version 12.3.1 (openEuler 12.3.1-105.oe2403sp3) (GCC)
NIC hardware: CVL,Intel Corporation Ethernet Controller E810-C for SFP
[8086:1593] (rev 02)

NIC firmware:
driver: vfio-pci
kdriver: ice-2.6.4
fw: 5.00 0x80021c11 1.4002.0
ddp: ice os default 1.3.59.0

Test Setup
Steps to reproduce
==================
1. Compile DPDK
rm -rf x86_64-native-linuxapp-gcc
CC=gcc meson -Dlibdir=lib -Dc_args='-DRTE_ETHDEV_DEBUG_RX=1'
--default-library=static x86_64-native-linuxapp-gcc 
ninja -C x86_64-native-linuxapp-gcc -j 72

2. bind port to vfio-pci
./usertools/dpdk-devbind.py -b vfio-pci 0000:18:00.0 0000:18:00.1

3. start testpmd
./x86_64-native-linuxapp-gcc/app/dpdk-testpmd -l 1-2 -n 4 -a 0000:18:00.0 -a
0000:18:00.1 --force-max-simd-bitwidth=64 -- -i --mbuf-size=2048,2048


Results: 
========
Configuring Port 0 (socket 0)
ETHDEV: No Rx segmentation offload configured
Fail to configure port 0 rx queues
Start ports failed 

Expected Result:
================
testpmd started successfully and the buff split is working fine

bad commit:
===========
commit 0be0ad196b52ab4fab88a51de35a2a4c83b21362
Author: Gregory Etelson <getelson@nvidia.com>
Date:   Sat Jun 6 01:33:43 2026 +0200

    app/testpmd: support selective Rx

    Add support for selective Rx using existing rxpkts and mbuf-size
    command line parameters.

    When a segment is specified with rxpkts and a matching 0 mbuf-size
    on PMDs supporting selective Rx,
    testpmd set the mempool of the segment to NULL,
    meaning the segment won't be received.

    Example usage to receive only Ethernet header and 64 bytes at offset 128:

      --rxpkts=14,114,64,0 --mbuf-size=256,0,256,0

    This creates segments:
    - [0-13]: 14 bytes with mempool (received)
    - [14-127]: 114 bytes with NULL mempool (discarded)
    - [128-191]: 64 bytes with mempool (received)
    - [192-max]: remaining bytes with NULL mempool (discarded)

    If the first segment has no mempool,
    there will be no mempool created with the index 0.
    That's why the lookup of the first mempool is now achieved
    in the new function mbuf_pool_find_first(socket)
    instead of mbuf_pool_find(socket, index 0)

    Note: RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT is required for this feature
    and is checked at ethdev API level.
    This check is removed from testpmd to allow negative testing of the API.

    Signed-off-by: Gregory Etelson <getelson@nvidia.com>
    Signed-off-by: Thomas Monjalon <thomas@monjalon.net>

 app/test-pmd/cmdline.c                      |  2 +-
 app/test-pmd/parameters.c                   |  5 ++-
 app/test-pmd/testpmd.c                      | 48 +++++++++++++++++------------
 app/test-pmd/testpmd.h                      | 16 ++++++++++
 doc/guides/testpmd_app_ug/run_app.rst       | 16 ++++++++++
 doc/guides/testpmd_app_ug/testpmd_funcs.rst |  3 +-
 6 files changed, 66 insertions, 24 deletions

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply

* RE: [EXTERNAL] [PATCH 00/13] Bus cleanup infrastructure and fixes
From: Long Li @ 2026-06-15 23:55 UTC (permalink / raw)
  To: David Marchand, dev@dpdk.org
  Cc: thomas@monjalon.net, stephen@networkplumber.org,
	bruce.richardson@intel.com, fengchengwen@huawei.com
In-Reply-To: <SA1PR21MB6683D20ADEE7D476EBE59CC8CEE62@SA1PR21MB6683.namprd21.prod.outlook.com>

>
> > This series refactors the bus cleanup infrastructure to reduce code
> > duplication and fix resource leaks in several bus drivers.
> > It should address the leak Thomas pointed at.
> >
> > The first part of the series (patches 1-8) addresses several bugs and
> > inconsistencies:
> > - Documentation and log message inconsistencies from earlier bus
> >   refactoring
> > - Device list management issues in dma/idxd and bus/vdev
> > - Resource leaks in PCI and VMBUS bus cleanup (mappings and
> > interrupts)
> > - Simplified device freeing in NXP buses (DPAA and FSLMC)
> > - Deferred interrupt allocation to probe time (NXP buses, VMBUS)
> >
> > The core infrastructure changes (patches 9-10) introduce the generic
> > cleanup
> > framework:
> > - Refactors unplug operations to be the counterpart of probe_device
> > - Implements rte_bus_generic_cleanup() to centralize cleanup logic
> > - Adds .free_device operation to struct rte_bus
> > - Adds compile-time verification that rte_device is at offset 0
> >
> > The final patches (11-13) convert remaining buses to use the generic
> > cleanup
> > helper:
> > - DPAA bus: add unplug support
> > - VMBUS bus: switch to embedded device name and add unplug support
> 
> There is a hung on vmbus during device shutdown after applying the series, I'm
> looking into it.

Turned out to be a test issue. Please see my comments on patch 08, the patch set tested well after that fix.

Long

^ permalink raw reply

* Re: [PATCH] test: add larger input len test for CRC16-CCITT
From: Stephen Hemminger @ 2026-06-15 21:31 UTC (permalink / raw)
  To: Shreesh Adiga; +Cc: Jasvinder Singh, dev
In-Reply-To: <20260612023745.275608-1-16567adigashreesh@gmail.com>

On Fri, 12 Jun 2026 08:07:45 +0530
Shreesh Adiga <16567adigashreesh@gmail.com> wrote:

> CRC16-CCITT test only covered len 32, 12, and 2 which meant that
> code paths like 4x SSE4.2 loop and AVX512 code paths which operated
> on larger lens like >255 never got covered.
> 
> This patch adds a 348 len input test for CRC16-CCITT similar to
> CRC32 test which covers the additional paths in SSE4.2 and AVX512
> implementations, therefore improving the test coverage.
> 
> Signed-off-by: Shreesh Adiga <16567adigashreesh@gmail.com>
> ---

Looks good applied to net-next

^ permalink raw reply

* Re: [PATCH] net/crc: cleanup code in net_crc_sse.c implementation
From: Stephen Hemminger @ 2026-06-15 21:31 UTC (permalink / raw)
  To: Shreesh Adiga; +Cc: Bruce Richardson, Konstantin Ananyev, Jasvinder Singh, dev
In-Reply-To: <20260612025135.298226-1-16567adigashreesh@gmail.com>

On Fri, 12 Jun 2026 08:21:35 +0530
Shreesh Adiga <16567adigashreesh@gmail.com> wrote:

> Special handling for len between 16 and 31 is not required as the
> implementation correctly handles them in the main path. Given that these
> cases were annotated with unlikely branch hint, it should be simpler to
> have these handled in the main path itself.
> 
> We can remove the partial_bytes label as there is no jump target to it,
> and replace folding code in that block with already existing inline
> function to simplify and have better code reuse.
> 
> Signed-off-by: Shreesh Adiga <16567adigashreesh@gmail.com>
> ---

Looks good, applied to net-next

^ permalink raw reply

* Re: [PATCH v4 0/6] net/gve: add hardware timestamping support
From: Stephen Hemminger @ 2026-06-15 21:29 UTC (permalink / raw)
  To: Mark Blasko; +Cc: dev, joshwash, jtranoleary
In-Reply-To: <CANULgnLQ6E-JFEJoB6N5DEcKcFsJ3Vj3qJMmAB=2OvBfbPWKFw@mail.gmail.com>

On Mon, 15 Jun 2026 14:01:42 -0700
Mark Blasko <blasko@google.com> wrote:

> That appears to be the AI feedback from V3 (which was addressed in V4).
> I also looked at the AI feedback in the mail archive for V4 and it does not
> look like there's anything actionable.
> 
> Do you have any other feedback for the V4 patches?

Let me re-run it.

Sorry got wrong part of long AI thread.

Re-reviewed v4 against current main. The two correctness errors and the
interrupt-thread warning from v3 are all resolved:

  - 4/6: the periodic read now runs on a dedicated control thread via
    rte_thread_create_internal_control(), sleeping in 10ms increments
    with a relaxed stop flag, instead of blocking the shared EAL
    interrupt thread. Teardown sets the stop flag, joins the thread,
    then frees the memzone, so the join keeps the thread off the freed
    memzone. Good.

  - 5/6: nic_ts_lock is now initialized before gve_init_priv() (so the
    setup-time gve_read_nic_clock() locks an initialized mutex) and
    destroyed only after gve_teardown_device_resources() has joined the
    sync thread. Both ordering bugs are fixed, and the gve_init_priv()
    failure path destroys both mutexes correctly.

One item remains:

[PATCH v4 4/6] net/gve: add periodic NIC clock synchronization

Warning: the commit message body still describes the old implementation.
It says the sync "runs every 250ms using rte_alarm" and "the alarm is
still rescheduled", but the code no longer uses rte_alarm - it is a
dedicated control thread (gve_nic_ts_thread) that polls with
rte_delay_us_sleep() and a stop flag. The v4 changelog notes the switch,
but the commit message body itself was not updated. Please reword it to
match the thread-based implementation.

Patches 1, 2, 3, 5, and 6 look good. With the 4/6 commit message
corrected, the series is ready.

^ permalink raw reply

* Re: [PATCH v4 0/6] net/gve: add hardware timestamping support
From: Mark Blasko @ 2026-06-15 21:01 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, joshwash, jtranoleary
In-Reply-To: <20260615115314.5e2ff054@phoenix.local>

That appears to be the AI feedback from V3 (which was addressed in V4).
I also looked at the AI feedback in the mail archive for V4 and it does not
look like there's anything actionable.

Do you have any other feedback for the V4 patches?

On Mon, Jun 15, 2026 at 11:53 AM Stephen Hemminger
<stephen@networkplumber.org> wrote:
>
> On Sat, 13 Jun 2026 04:22:33 +0000
> Mark Blasko <blasko@google.com> wrote:
>
> > This patch series introduces support for GVE hardware timestamping
> > on DQO queues. To support concurrent access, a mutex lock is introduced
> > to protect admin queue operations. A mechanism is then added to
> > periodically synchronize the NIC clock via a dedicated control thread,
> > and support is introduced for the read_clock ethdev operation.
> > Finally, the RX datapath is updated to reconstruct full 64-bit
> > timestamps from the 32-bit values in DQO descriptors.
> >
> > ---
>
> AI spotted several issues...
>
> Reviewed the v3 series against current main. Findings on 4/6 and 5/6
> below; patches 1, 2, 3, and 6 look good.
>
> [PATCH v3 4/6] net/gve: add periodic NIC clock synchronization
>
> Warning: gve_read_nic_clock() runs as an rte_alarm callback, i.e. on the
> shared EAL interrupt thread, and calls gve_adminq_report_nic_timestamp()
> -> gve_adminq_kick_and_wait(), which busy-waits via rte_delay_ms() up to
> GVE_MAX_ADMINQ_EVENT_COUNTER_CHECK * GVE_ADMINQ_SLEEP_LEN = 100 * 20ms =
> 2s on an AdminQ timeout (tens of ms in the normal case). Blocking the
> interrupt thread that long stalls link-status/reset detection and every
> other device's interrupt and alarm handling for the whole process. The
> existing gve_check_device_status() alarm only does a single ioread32be(),
> so this is new behavior for the driver. Consider running the periodic
> read off a dedicated control thread, or otherwise bounding the time spent
> on the interrupt thread.
>
> The teardown ordering itself is fine: rte_eal_alarm_cancel() is called
> before gve_free_nic_ts_report(), and its spin-until-not-executing
> semantics catch the self-rescheduled alarm, since the new entry is queued
> during the callback before the dispatcher removes the executing one. No
> use-after-free on the memzone there.
>
> [PATCH v3 5/6] net/gve: support read clock ethdev op
>
> Error: priv->nic_ts_lock is locked before it is initialized. In
> gve_dev_init() the order is:
>
>         err = gve_init_priv(priv, false);
>         ...
>         pthread_mutex_init(&priv->nic_ts_lock, &mutexattr);
>
> but gve_init_priv() -> gve_setup_device_resources() ->
> gve_setup_nic_timestamp() calls gve_read_nic_clock() synchronously when
> the device reports NIC-timestamp support, and after this patch
> gve_read_nic_clock() takes priv->nic_ts_lock. So on timestamp-capable
> hardware the first lock runs on an uninitialized mutex, and the later
> pthread_mutex_init() then re-initializes an already-used mutex - both
> undefined behavior. It only appears to work because dev_private is zeroed
> (a zeroed pthread_mutex_t happens to be a valid default mutex on glibc).
> Initialize nic_ts_lock (and the mutexattr) before the gve_init_priv()
> call.
>
> Error: priv->nic_ts_lock is destroyed before the alarm that uses it is
> cancelled. In gve_dev_close():
>
>         pthread_mutex_destroy(&priv->nic_ts_lock);
>         gve_free_queues(dev);
>         gve_teardown_device_resources(priv);    /* cancels gve_read_nic_clock alarm */
>
> The periodic gve_read_nic_clock() alarm is still armed when the mutex is
> destroyed, and that callback locks nic_ts_lock; if it fires in the window
> before gve_teardown_device_resources() cancels it, it locks a destroyed
> mutex. Move the pthread_mutex_destroy(&priv->nic_ts_lock) to after
> gve_teardown_device_resources() returns.
>
> The two 5/6 errors are the same root cause from opposite ends: the
> nic_ts_lock lifetime needs to bracket all its users - initialized before
> the synchronous setup-time read, and destroyed only after the alarm is
> cancelled.

^ permalink raw reply

* RE: [EXTERNAL] [PATCH 00/13] Bus cleanup infrastructure and fixes
From: Long Li @ 2026-06-15 19:14 UTC (permalink / raw)
  To: David Marchand, dev@dpdk.org
  Cc: thomas@monjalon.net, stephen@networkplumber.org,
	bruce.richardson@intel.com, fengchengwen@huawei.com
In-Reply-To: <20260611094551.1514962-1-david.marchand@redhat.com>

> This is a followup of the previous bus refactoring.
> See
> https://inbox.dp/
> dk.org%2Fdev%2FCAJFAV8zvFpLwz8SY8DUUezyJyM43eRZ17Yj30ex808eHC4ZE%
> 3Dg%40mail.gmail.com%2F&data=05%7C02%7Clongli%40microsoft.com%7C2b
> fea821d6064273513d08dec79e460d%7C72f988bf86f141af91ab2d7cd011db47%
> 7C1%7C0%7C639167679756696197%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0e
> U1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIl
> dUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=wW8q%2FAt1AsVp2OHaK3OCLJrJo
> ar4KrjdKtf78hnDUyY%3D&reserved=0.
>
> This series refactors the bus cleanup infrastructure to reduce code duplication
> and fix resource leaks in several bus drivers.
> It should address the leak Thomas pointed at.
>
> The first part of the series (patches 1-8) addresses several bugs and
> inconsistencies:
> - Documentation and log message inconsistencies from earlier bus
>   refactoring
> - Device list management issues in dma/idxd and bus/vdev
> - Resource leaks in PCI and VMBUS bus cleanup (mappings and interrupts)
> - Simplified device freeing in NXP buses (DPAA and FSLMC)
> - Deferred interrupt allocation to probe time (NXP buses, VMBUS)
>
> The core infrastructure changes (patches 9-10) introduce the generic cleanup
> framework:
> - Refactors unplug operations to be the counterpart of probe_device
> - Implements rte_bus_generic_cleanup() to centralize cleanup logic
> - Adds .free_device operation to struct rte_bus
> - Adds compile-time verification that rte_device is at offset 0
>
> The final patches (11-13) convert remaining buses to use the generic cleanup
> helper:
> - DPAA bus: add unplug support
> - VMBUS bus: switch to embedded device name and add unplug support

There is a hung on vmbus during device shutdown after applying the series, I'm looking into it.


^ permalink raw reply

* RE: [EXTERNAL] [PATCH 08/13] bus/vmbus: allocate interrupt during probing
From: Long Li @ 2026-06-15 19:13 UTC (permalink / raw)
  To: David Marchand, dev@dpdk.org
  Cc: thomas@monjalon.net, stephen@networkplumber.org,
	bruce.richardson@intel.com, fengchengwen@huawei.com, Wei Hu
In-Reply-To: <20260611094551.1514962-9-david.marchand@redhat.com>


> Allocating the interrupt handle is a waste of memory if no device is probed
> later (like for example, if a allowlist is passed).
> Instead, allocate this handle at the time probe_device is called.
> 
> Signed-off-by: David Marchand <david.marchand@redhat.com>
> ---
>  drivers/bus/vmbus/linux/vmbus_bus.c |  6 ------
>  drivers/bus/vmbus/vmbus_common.c    | 18 +++++++++++++++++-
>  2 files changed, 17 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/bus/vmbus/linux/vmbus_bus.c
> b/drivers/bus/vmbus/linux/vmbus_bus.c
> index 0af10f6a69..77d904ad6d 100644
> --- a/drivers/bus/vmbus/linux/vmbus_bus.c
> +++ b/drivers/bus/vmbus/linux/vmbus_bus.c
> @@ -345,12 +345,6 @@ vmbus_scan_one(const char *name)
>  		}
>  	}
> 
> -	/* Allocate interrupt handle instance */
> -	dev->intr_handle =
> -		rte_intr_instance_alloc(RTE_INTR_INSTANCE_F_PRIVATE);
> -	if (dev->intr_handle == NULL)
> -		goto error;
> -
>  	/* device is valid, add in list (sorted) */
>  	VMBUS_LOG(DEBUG, "Adding vmbus device %s", name);
> 
> diff --git a/drivers/bus/vmbus/vmbus_common.c
> b/drivers/bus/vmbus/vmbus_common.c
> index 74c1ddff69..b6ae82915f 100644
> --- a/drivers/bus/vmbus/vmbus_common.c
> +++ b/drivers/bus/vmbus/vmbus_common.c
> @@ -108,11 +108,27 @@ vmbus_probe_device(struct rte_driver *drv, struct
> rte_device *dev)
>  	if (vmbus_dev->device.numa_node < 0 && rte_socket_count() > 1)
>  		VMBUS_LOG(INFO, "Device %s is not NUMA-aware", guid);
> 
> +	/* Allocate interrupt handle instance */
> +	vmbus_dev->intr_handle =
> +		rte_intr_instance_alloc(RTE_INTR_INSTANCE_F_PRIVATE);
> +	if (vmbus_dev->intr_handle == NULL) {
> +		ret = -ENOMEM;
> +		goto unmap;
> +	}
> +
>  	/* call the driver probe() function */
>  	VMBUS_LOG(INFO, "  probe driver: %s", vmbus_drv->driver.name);
>  	ret = vmbus_drv->probe(vmbus_drv, vmbus_dev);
>  	if (ret != 0)
> -		rte_vmbus_unmap_device(vmbus_dev);
> +		goto free_intr;
> +
> +	return 0;
> +
> +free_intr:
> +	rte_intr_instance_free(vmbus_dev->intr_handle);
> +	vmbus_dev->intr_handle = NULL;
> +unmap:
> +	rte_vmbus_unmap_device(vmbus_dev);
> 
>  	return ret;
>  }
> --
> 2.53.0

rte_vmbus_map_device() needs intr_handle to already exist, so need to move rte_intr_instance_alloc to earlier before calling rte_vmbus_map_device(),
something like this:

diff --git a/drivers/bus/vmbus/vmbus_common.c b/drivers/bus/vmbus/vmbus_common.c
index 419eb9b895..f3bcb90e46 100644
--- a/drivers/bus/vmbus/vmbus_common.c
+++ b/drivers/bus/vmbus/vmbus_common.c
@@ -100,35 +100,33 @@ vmbus_probe_device(struct rte_driver *drv, struct rte_device *dev)
                return 1;
        }

+       /* Allocate interrupt handle instance */
+       vmbus_dev->intr_handle =
+               rte_intr_instance_alloc(RTE_INTR_INSTANCE_F_PRIVATE);
+       if (vmbus_dev->intr_handle == NULL)
+               return -ENOMEM;
+
        /* map resources for device */
        ret = rte_vmbus_map_device(vmbus_dev);
        if (ret != 0)
-               return ret;
+               goto free_intr;

        if (vmbus_dev->device.numa_node < 0 && rte_socket_count() > 1)
                VMBUS_LOG(INFO, "Device %s is not NUMA-aware", guid);

-       /* Allocate interrupt handle instance */
-       vmbus_dev->intr_handle =
-               rte_intr_instance_alloc(RTE_INTR_INSTANCE_F_PRIVATE);
-       if (vmbus_dev->intr_handle == NULL) {
-               ret = -ENOMEM;
-               goto unmap;
-       }
-
        /* call the driver probe() function */
        VMBUS_LOG(INFO, "  probe driver: %s", vmbus_drv->driver.name);
        ret = vmbus_drv->probe(vmbus_drv, vmbus_dev);
        if (ret != 0)
-               goto free_intr;
+               goto unmap;

        return 0;

+unmap:
+       rte_vmbus_unmap_device(vmbus_dev);
 free_intr:
        rte_intr_instance_free(vmbus_dev->intr_handle);
        vmbus_dev->intr_handle = NULL;
-unmap:
-       rte_vmbus_unmap_device(vmbus_dev);

        return ret;
 }

^ permalink raw reply related

* Re: [PATCH] app/testpmd: add VLAN priority insert support
From: Stephen Hemminger @ 2026-06-15 19:12 UTC (permalink / raw)
  To: Xingui Yang
  Cc: dev, david.marchand, aman.deep.singh, fengchengwen, yangshuaisong,
	lihuisong, liuyonglong, kangfenglong
In-Reply-To: <20260612081411.2798403-1-yangxingui@huawei.com>

On Fri, 12 Jun 2026 16:14:11 +0800
Xingui Yang <yangxingui@huawei.com> wrote:

> The tx_vlan set command currently only accepts a VLAN ID in range
> [0, 4095].  This patch adds support for an extended format that includes
> 802.1p priority and CFI bits, allowing users to set the VLAN priority
> tag when inserting VLAN headers in TX packets.
> 
> The extended format is:
>   bit 0-11:  VLAN ID (0-4095)
>   bit 12:    CFI (Canonical Format Indicator)
>   bit 13-15: Priority (0-7, 802.1p CoS)
> 
> This is consistent with the VLAN tag structure used by
> rte_eth_dev_set_vlan_pvid() where the PVID field encodes VLAN ID, CFI
> and priority in the same format.
> 
> A new command line option --enable-vlan-priority is added to enable this
> feature. By default, the feature is disabled to maintain backward
> compatibility with existing users. When enabled, the
> vlan_id_is_invalid() function allows any 16-bit value to pass, while the
> full 16-bit value (including CFI and priority bits) is passed to the
> driver for hardware VLAN insertion.
> 
> Signed-off-by: Xingui Yang <yangxingui@huawei.com>
> ---

Having ability to set priority bits is good, and testpmd should allow it.
The mbuf vlan_tci is already a full 16-bit TCI (priority/CFI/VID), and
the TX insert path copies tx_vlan_id straight into it.  So priority
insert already works; the only thing in the way is the < 4096 check.

Do you actually need a new option for this?  Both of_push_vlan +
of_set_vlan_pcp (rte_flow) and "tx_vlan set pvid" already let you set
the priority bits today, with no new code.

If you still want "tx_vlan set" itself to carry priority, I'd suggest
a smaller change: relax only the TX insert validators and drop the
option and the global.  Don't touch rx_vft_set -- it feeds the VLAN
filter, which only takes a VLAN ID and rejects > 4095 anyway, so the
flag just turns a clear error into a confusing one.

Either way, if the option stays, please document it, and add a release note.
The commit message why the existing paths aren't enough.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox