mbuf fast-free requirements analysis

public inbox for dev@dpdk.org
 help / color / mirror / Atom feed

* mbuf fast-free requirements analysis
@ 2025-12-15 11:06 Morten Brørup
  2025-12-15 11:46 ` Bruce Richardson
                   ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Morten Brørup @ 2025-12-15 11:06 UTC (permalink / raw)
  To: dev, techboard

Executive Summary:

My analysis shows that the mbuf library is not a barrier for fast-freeing
segmented packet mbufs, and thus fast-free of jumbo frames is possible.

Detailed Analysis:

The purpose of the mbuf fast-free Tx optimization is to reduce
rte_pktmbuf_free_seg() to something much simpler in the ethdev drivers, by
eliminating the code path related to indirect mbufs.
Optimally, we want to simplify the ethdev driver's function that frees the
transmitted mbufs, so it can free them directly to their mempool without
accessing the mbufs themselves.

If the driver cannot access the mbuf itself, it cannot determine which
mempool it belongs to.
We don't want the driver to access every mbuf being freed; but if all
mbufs of a Tx queue belong to the same mempool, the driver can determine
which mempool by looking into just one of the mbufs.

REQUIREMENT 1: The mbufs of a Tx queue must come from the same mempool.

When an mbuf is freed to its mempool, some of the fields in the mbuf must
be initialized.
So, for fast-free, this must be done by the driver's function that
prepares the Tx descriptor.
This is a requirement to the driver, not a requirement to the application.

Now, let's dig into the code for freeing an mbuf.
Note: For readability purposes, I'll cut out some code and comments
unrelated to this topic.

static __rte_always_inline void
rte_pktmbuf_free_seg(struct rte_mbuf *m)
{
	m = rte_pktmbuf_prefree_seg(m);
	if (likely(m != NULL))
		rte_mbuf_raw_free(m);
}

rte_mbuf_raw_free(m) is simple, so nothing to gain there:

/**
 * Put mbuf back into its original mempool.
 *
 * The caller must ensure that the mbuf is direct and properly
 * reinitialized (refcnt=1, next=NULL, nb_segs=1), as done by
 * rte_pktmbuf_prefree_seg().
 */
static __rte_always_inline void
rte_mbuf_raw_free(struct rte_mbuf *m)
{
	rte_mbuf_history_mark(m, RTE_MBUF_HISTORY_OP_LIB_FREE);
	rte_mempool_put(m->pool, m);
}

Note that the description says that the mbuf must be direct.
This is not entirely accurate; the mbuf is allowed to use a pinned
external buffer, if the mbuf holds the only reference to it.
(Most of the mbuf library functions have this documentation inaccuracy,
which should be fixed some day.)

So, the fast-free optimization really comes down to
rte_pktmbuf_prefree_seg(m), which must not return NULL.

Let's dig into that.

/**
 * Decrease reference counter and unlink a mbuf segment
 *
 * This function does the same than a free, except that it does not
 * return the segment to its pool.
 * It decreases the reference counter, and if it reaches 0, it is
 * detached from its parent for an indirect mbuf.
 *
 * @return
 *   - (m) if it is the last reference. It can be recycled or freed.
 *   - (NULL) if the mbuf still has remaining references on it.
 */
static __rte_always_inline struct rte_mbuf *
rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
{
	bool refcnt_not_one;

	refcnt_not_one = unlikely(rte_mbuf_refcnt_read(m) != 1);
	if (refcnt_not_one && __rte_mbuf_refcnt_update(m, -1) != 0)
		return NULL;

	if (unlikely(!RTE_MBUF_DIRECT(m))) {
		rte_pktmbuf_detach(m);
		if (RTE_MBUF_HAS_EXTBUF(m) &&
				RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
				__rte_pktmbuf_pinned_extbuf_decref(m))
			return NULL;
	}

	if (refcnt_not_one)
		rte_mbuf_refcnt_set(m, 1);
	if (m->nb_segs != 1)
		m->nb_segs = 1;
	if (m->next != NULL)
		m->next = NULL;

	return m;
}

This function can only succeed (i.e. return non-NULL) when 'refcnt' is 1
(or reaches 0).

REQUIREMENT 2: The driver must hold the only reference to the mbuf,
i.e. 'm->refcnt' must be 1.

When the function succeeds, it initializes the mbuf fields as required by
rte_mbuf_raw_free() before returning.

Now, since the driver has exclusive access to the mbuf, it is free to
initialize the 'm->next' and 'm->nb_segs' at any time.
It could do that when preparing the Tx descriptor.

This is very interesting, because it means that fast-free does not
prohibit segmented packets!
(But the driver must have sufficient Tx descriptors for all segments in
the mbuf.)

Now, lets dig into rte_pktmbuf_prefree_seg()'s block handling non-direct
mbufs, i.e. cloned mbufs and mbufs with external buffer:

	if (unlikely(!RTE_MBUF_DIRECT(m))) {
		rte_pktmbuf_detach(m);
		if (RTE_MBUF_HAS_EXTBUF(m) &&
				RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
				__rte_pktmbuf_pinned_extbuf_decref(m))
			return NULL;
	}

Starting with rte_pktmbuf_detach():

static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
{
	struct rte_mempool *mp = m->pool;
	uint32_t mbuf_size, buf_len;
	uint16_t priv_size;

	if (RTE_MBUF_HAS_EXTBUF(m)) {
		/*
		 * The mbuf has the external attached buffer,
		 * we should check the type of the memory pool where
		 * the mbuf was allocated from to detect the pinned
		 * external buffer.
		 */
		uint32_t flags = rte_pktmbuf_priv_flags(mp);

		if (flags & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF) {
			/*
			 * The pinned external buffer should not be
			 * detached from its backing mbuf, just exit.
			 */
			return;
		}
		__rte_pktmbuf_free_extbuf(m);
	} else {
		__rte_pktmbuf_free_direct(m);
	}
	priv_size = rte_pktmbuf_priv_size(mp);
	mbuf_size = (uint32_t)(sizeof(struct rte_mbuf) + priv_size);
	buf_len = rte_pktmbuf_data_room_size(mp);

	m->priv_size = priv_size;
	m->buf_addr = (char *)m + mbuf_size;
	rte_mbuf_iova_set(m, rte_mempool_virt2iova(m) + mbuf_size);
	m->buf_len = (uint16_t)buf_len;
	rte_pktmbuf_reset_headroom(m);
	m->data_len = 0;
	m->ol_flags = 0;
}

The only quick and simple code path through this function is when the mbuf
uses a pinned external buffer:
	if (RTE_MBUF_HAS_EXTBUF(m)) {
		uint32_t flags = rte_pktmbuf_priv_flags(mp);
		if (flags & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF)
			return;

REQUIREMENT 3: The mbuf must not be cloned or use a non-pinned external buffer.

Continuing with the next part of rte_pktmbuf_prefree_seg()'s block:
 		if (RTE_MBUF_HAS_EXTBUF(m) &&
				RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
				__rte_pktmbuf_pinned_extbuf_decref(m))
			return NULL;

Continuing with the next part of the block in rte_pktmbuf_prefree_seg():

/**
 * @internal Handle the packet mbufs with attached pinned external buffer
 * on the mbuf freeing:
 *
 *  - return zero if reference counter in shinfo is one. It means there is
 *  no more reference to this pinned buffer and mbuf can be returned to
 *  the pool
 *
 *  - otherwise (if reference counter is not one), decrement reference
 *  counter and return non-zero value to prevent freeing the backing mbuf.
 *
 * Returns non zero if mbuf should not be freed.
 */
static inline int __rte_pktmbuf_pinned_extbuf_decref(struct rte_mbuf *m)
{
	struct rte_mbuf_ext_shared_info *shinfo;

	/* Clear flags, mbuf is being freed. */
	m->ol_flags = RTE_MBUF_F_EXTERNAL;
	shinfo = m->shinfo;

	/* Optimize for performance - do not dec/reinit */
	if (likely(rte_mbuf_ext_refcnt_read(shinfo) == 1))
		return 0;

	/*
	 * Direct usage of add primitive to avoid
	 * duplication of comparing with one.
	 */
	if (likely(rte_atomic_fetch_add_explicit(&shinfo->refcnt, -1,
				     rte_memory_order_acq_rel) - 1))
		return 1;

	/* Reinitialize counter before mbuf freeing. */
	rte_mbuf_ext_refcnt_set(shinfo, 1);
	return 0;
}

Essentially, if the mbuf does use a pinned external buffer,
rte_pktmbuf_prefree_seg() only succeeds if that pinned external buffer is
only referred to by the mbuf.

REQUIREMENT 4: If the mbuf uses a pinned external buffer, the mbuf must
hold the only reference to that pinned external buffer, i.e. in that case,
'm->shinfo->refcnt' must be 1.

Please review.

If I'm not mistaken, the mbuf library is not a barrier for fast-freeing
segmented packet mbufs, and thus fast-free of jumbo frames is possible.

We need a driver developer to confirm that my suggested approach -
resetting the mbuf fields, incl. 'm->nb_segs' and 'm->next', when
preparing the Tx descriptor - is viable.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: mbuf fast-free requirements analysis
  2025-12-15 11:06 mbuf fast-free requirements analysis Morten Brørup
@ 2025-12-15 11:46 ` Bruce Richardson
  2026-01-14 15:31   ` Morten Brørup
  2025-12-15 14:41 ` Konstantin Ananyev
  2026-01-14 17:01 ` Bruce Richardson
  2 siblings, 1 reply; 28+ messages in thread
From: Bruce Richardson @ 2025-12-15 11:46 UTC (permalink / raw)
  To: Morten Brørup; +Cc: dev, techboard

On Mon, Dec 15, 2025 at 12:06:38PM +0100, Morten Brørup wrote:
> Executive Summary:
> 
> My analysis shows that the mbuf library is not a barrier for fast-freeing
> segmented packet mbufs, and thus fast-free of jumbo frames is possible.
> 
> 
> Detailed Analysis:
> 
> The purpose of the mbuf fast-free Tx optimization is to reduce
> rte_pktmbuf_free_seg() to something much simpler in the ethdev drivers, by
> eliminating the code path related to indirect mbufs.
> Optimally, we want to simplify the ethdev driver's function that frees the
> transmitted mbufs, so it can free them directly to their mempool without
> accessing the mbufs themselves.
> 
> If the driver cannot access the mbuf itself, it cannot determine which
> mempool it belongs to.
> We don't want the driver to access every mbuf being freed; but if all
> mbufs of a Tx queue belong to the same mempool, the driver can determine
> which mempool by looking into just one of the mbufs.
> 
> REQUIREMENT 1: The mbufs of a Tx queue must come from the same mempool.
> 
> 
> When an mbuf is freed to its mempool, some of the fields in the mbuf must
> be initialized.
> So, for fast-free, this must be done by the driver's function that
> prepares the Tx descriptor.
> This is a requirement to the driver, not a requirement to the application.
> 
> Now, let's dig into the code for freeing an mbuf.
> Note: For readability purposes, I'll cut out some code and comments
> unrelated to this topic.
> 
> static __rte_always_inline void
> rte_pktmbuf_free_seg(struct rte_mbuf *m)
> {
> 	m = rte_pktmbuf_prefree_seg(m);
> 	if (likely(m != NULL))
> 		rte_mbuf_raw_free(m);
> }
> 
> 
> rte_mbuf_raw_free(m) is simple, so nothing to gain there:
> 
> /**
>  * Put mbuf back into its original mempool.
>  *
>  * The caller must ensure that the mbuf is direct and properly
>  * reinitialized (refcnt=1, next=NULL, nb_segs=1), as done by
>  * rte_pktmbuf_prefree_seg().
>  */
> static __rte_always_inline void
> rte_mbuf_raw_free(struct rte_mbuf *m)
> {
> 	rte_mbuf_history_mark(m, RTE_MBUF_HISTORY_OP_LIB_FREE);
> 	rte_mempool_put(m->pool, m);
> }
> 
> Note that the description says that the mbuf must be direct.
> This is not entirely accurate; the mbuf is allowed to use a pinned
> external buffer, if the mbuf holds the only reference to it.
> (Most of the mbuf library functions have this documentation inaccuracy,
> which should be fixed some day.)
> 
> So, the fast-free optimization really comes down to
> rte_pktmbuf_prefree_seg(m), which must not return NULL.
> 
> Let's dig into that.
> 
> /**
>  * Decrease reference counter and unlink a mbuf segment
>  *
>  * This function does the same than a free, except that it does not
>  * return the segment to its pool.
>  * It decreases the reference counter, and if it reaches 0, it is
>  * detached from its parent for an indirect mbuf.
>  *
>  * @return
>  *   - (m) if it is the last reference. It can be recycled or freed.
>  *   - (NULL) if the mbuf still has remaining references on it.
>  */
> static __rte_always_inline struct rte_mbuf *
> rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> {
> 	bool refcnt_not_one;
> 
> 	refcnt_not_one = unlikely(rte_mbuf_refcnt_read(m) != 1);
> 	if (refcnt_not_one && __rte_mbuf_refcnt_update(m, -1) != 0)
> 		return NULL;
> 
> 	if (unlikely(!RTE_MBUF_DIRECT(m))) {
> 		rte_pktmbuf_detach(m);
> 		if (RTE_MBUF_HAS_EXTBUF(m) &&
> 				RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> 				__rte_pktmbuf_pinned_extbuf_decref(m))
> 			return NULL;
> 	}
> 
> 	if (refcnt_not_one)
> 		rte_mbuf_refcnt_set(m, 1);
> 	if (m->nb_segs != 1)
> 		m->nb_segs = 1;
> 	if (m->next != NULL)
> 		m->next = NULL;
> 
> 	return m;
> }
> 
> This function can only succeed (i.e. return non-NULL) when 'refcnt' is 1
> (or reaches 0).
> 
> REQUIREMENT 2: The driver must hold the only reference to the mbuf,
> i.e. 'm->refcnt' must be 1.
> 
> 
> When the function succeeds, it initializes the mbuf fields as required by
> rte_mbuf_raw_free() before returning.
> 
> Now, since the driver has exclusive access to the mbuf, it is free to
> initialize the 'm->next' and 'm->nb_segs' at any time.
> It could do that when preparing the Tx descriptor.
> 
> This is very interesting, because it means that fast-free does not
> prohibit segmented packets!
> (But the driver must have sufficient Tx descriptors for all segments in
> the mbuf.)
> 
> 
> Now, lets dig into rte_pktmbuf_prefree_seg()'s block handling non-direct
> mbufs, i.e. cloned mbufs and mbufs with external buffer:
> 
> 	if (unlikely(!RTE_MBUF_DIRECT(m))) {
> 		rte_pktmbuf_detach(m);
> 		if (RTE_MBUF_HAS_EXTBUF(m) &&
> 				RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> 				__rte_pktmbuf_pinned_extbuf_decref(m))
> 			return NULL;
> 	}
> 
> Starting with rte_pktmbuf_detach():
> 
> static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
> {
> 	struct rte_mempool *mp = m->pool;
> 	uint32_t mbuf_size, buf_len;
> 	uint16_t priv_size;
> 
> 	if (RTE_MBUF_HAS_EXTBUF(m)) {
> 		/*
> 		 * The mbuf has the external attached buffer,
> 		 * we should check the type of the memory pool where
> 		 * the mbuf was allocated from to detect the pinned
> 		 * external buffer.
> 		 */
> 		uint32_t flags = rte_pktmbuf_priv_flags(mp);
> 
> 		if (flags & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF) {
> 			/*
> 			 * The pinned external buffer should not be
> 			 * detached from its backing mbuf, just exit.
> 			 */
> 			return;
> 		}
> 		__rte_pktmbuf_free_extbuf(m);
> 	} else {
> 		__rte_pktmbuf_free_direct(m);
> 	}
> 	priv_size = rte_pktmbuf_priv_size(mp);
> 	mbuf_size = (uint32_t)(sizeof(struct rte_mbuf) + priv_size);
> 	buf_len = rte_pktmbuf_data_room_size(mp);
> 
> 	m->priv_size = priv_size;
> 	m->buf_addr = (char *)m + mbuf_size;
> 	rte_mbuf_iova_set(m, rte_mempool_virt2iova(m) + mbuf_size);
> 	m->buf_len = (uint16_t)buf_len;
> 	rte_pktmbuf_reset_headroom(m);
> 	m->data_len = 0;
> 	m->ol_flags = 0;
> }
> 
> The only quick and simple code path through this function is when the mbuf
> uses a pinned external buffer:
> 	if (RTE_MBUF_HAS_EXTBUF(m)) {
> 		uint32_t flags = rte_pktmbuf_priv_flags(mp);
> 		if (flags & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF)
> 			return;
> 
> REQUIREMENT 3: The mbuf must not be cloned or use a non-pinned external buffer.
> 
> 
> Continuing with the next part of rte_pktmbuf_prefree_seg()'s block:
>  		if (RTE_MBUF_HAS_EXTBUF(m) &&
> 				RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> 				__rte_pktmbuf_pinned_extbuf_decref(m))
> 			return NULL;
> 
> Continuing with the next part of the block in rte_pktmbuf_prefree_seg():
> 
> /**
>  * @internal Handle the packet mbufs with attached pinned external buffer
>  * on the mbuf freeing:
>  *
>  *  - return zero if reference counter in shinfo is one. It means there is
>  *  no more reference to this pinned buffer and mbuf can be returned to
>  *  the pool
>  *
>  *  - otherwise (if reference counter is not one), decrement reference
>  *  counter and return non-zero value to prevent freeing the backing mbuf.
>  *
>  * Returns non zero if mbuf should not be freed.
>  */
> static inline int __rte_pktmbuf_pinned_extbuf_decref(struct rte_mbuf *m)
> {
> 	struct rte_mbuf_ext_shared_info *shinfo;
> 
> 	/* Clear flags, mbuf is being freed. */
> 	m->ol_flags = RTE_MBUF_F_EXTERNAL;
> 	shinfo = m->shinfo;
> 
> 	/* Optimize for performance - do not dec/reinit */
> 	if (likely(rte_mbuf_ext_refcnt_read(shinfo) == 1))
> 		return 0;
> 
> 	/*
> 	 * Direct usage of add primitive to avoid
> 	 * duplication of comparing with one.
> 	 */
> 	if (likely(rte_atomic_fetch_add_explicit(&shinfo->refcnt, -1,
> 				     rte_memory_order_acq_rel) - 1))
> 		return 1;
> 
> 	/* Reinitialize counter before mbuf freeing. */
> 	rte_mbuf_ext_refcnt_set(shinfo, 1);
> 	return 0;
> }
> 
> Essentially, if the mbuf does use a pinned external buffer,
> rte_pktmbuf_prefree_seg() only succeeds if that pinned external buffer is
> only referred to by the mbuf.
> 
> REQUIREMENT 4: If the mbuf uses a pinned external buffer, the mbuf must
> hold the only reference to that pinned external buffer, i.e. in that case,
> 'm->shinfo->refcnt' must be 1.
> 
> 
> Please review.
> 
> If I'm not mistaken, the mbuf library is not a barrier for fast-freeing
> segmented packet mbufs, and thus fast-free of jumbo frames is possible.
> 
> We need a driver developer to confirm that my suggested approach -
> resetting the mbuf fields, incl. 'm->nb_segs' and 'm->next', when
> preparing the Tx descriptor - is viable.
> 
Excellent analysis, Morten. If I get a chance some time this release cycle,
I will try implementing this change in our drivers, see if any difference
is made.

/Bruce

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: mbuf fast-free requirements analysis
  2025-12-15 11:06 mbuf fast-free requirements analysis Morten Brørup
  2025-12-15 11:46 ` Bruce Richardson
@ 2025-12-15 14:41 ` Konstantin Ananyev
  2025-12-15 16:14   ` Morten Brørup
  2026-01-14 17:01 ` Bruce Richardson
  2 siblings, 1 reply; 28+ messages in thread
From: Konstantin Ananyev @ 2025-12-15 14:41 UTC (permalink / raw)
  To: Morten Brørup, dev@dpdk.org, techboard@dpdk.org



> 
> Executive Summary:
> 
> My analysis shows that the mbuf library is not a barrier for fast-freeing
> segmented packet mbufs, and thus fast-free of jumbo frames is possible.
> 
> 
> Detailed Analysis:
> 
> The purpose of the mbuf fast-free Tx optimization is to reduce
> rte_pktmbuf_free_seg() to something much simpler in the ethdev drivers, by
> eliminating the code path related to indirect mbufs.
> Optimally, we want to simplify the ethdev driver's function that frees the
> transmitted mbufs, so it can free them directly to their mempool without
> accessing the mbufs themselves.
> 
> If the driver cannot access the mbuf itself, it cannot determine which
> mempool it belongs to.
> We don't want the driver to access every mbuf being freed; but if all
> mbufs of a Tx queue belong to the same mempool, the driver can determine
> which mempool by looking into just one of the mbufs.
> 
> REQUIREMENT 1: The mbufs of a Tx queue must come from the same mempool.
> 
> 
> When an mbuf is freed to its mempool, some of the fields in the mbuf must
> be initialized.
> So, for fast-free, this must be done by the driver's function that
> prepares the Tx descriptor.
> This is a requirement to the driver, not a requirement to the application.
> 
> Now, let's dig into the code for freeing an mbuf.
> Note: For readability purposes, I'll cut out some code and comments
> unrelated to this topic.
> 
> static __rte_always_inline void
> rte_pktmbuf_free_seg(struct rte_mbuf *m)
> {
> 	m = rte_pktmbuf_prefree_seg(m);
> 	if (likely(m != NULL))
> 		rte_mbuf_raw_free(m);
> }
> 
> 
> rte_mbuf_raw_free(m) is simple, so nothing to gain there:
> 
> /**
>  * Put mbuf back into its original mempool.
>  *
>  * The caller must ensure that the mbuf is direct and properly
>  * reinitialized (refcnt=1, next=NULL, nb_segs=1), as done by
>  * rte_pktmbuf_prefree_seg().
>  */
> static __rte_always_inline void
> rte_mbuf_raw_free(struct rte_mbuf *m)
> {
> 	rte_mbuf_history_mark(m, RTE_MBUF_HISTORY_OP_LIB_FREE);
> 	rte_mempool_put(m->pool, m);
> }
> 
> Note that the description says that the mbuf must be direct.
> This is not entirely accurate; the mbuf is allowed to use a pinned
> external buffer, if the mbuf holds the only reference to it.
> (Most of the mbuf library functions have this documentation inaccuracy,
> which should be fixed some day.)
> 
> So, the fast-free optimization really comes down to
> rte_pktmbuf_prefree_seg(m), which must not return NULL.
> 
> Let's dig into that.
> 
> /**
>  * Decrease reference counter and unlink a mbuf segment
>  *
>  * This function does the same than a free, except that it does not
>  * return the segment to its pool.
>  * It decreases the reference counter, and if it reaches 0, it is
>  * detached from its parent for an indirect mbuf.
>  *
>  * @return
>  *   - (m) if it is the last reference. It can be recycled or freed.
>  *   - (NULL) if the mbuf still has remaining references on it.
>  */
> static __rte_always_inline struct rte_mbuf *
> rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> {
> 	bool refcnt_not_one;
> 
> 	refcnt_not_one = unlikely(rte_mbuf_refcnt_read(m) != 1);
> 	if (refcnt_not_one && __rte_mbuf_refcnt_update(m, -1) != 0)
> 		return NULL;
> 
> 	if (unlikely(!RTE_MBUF_DIRECT(m))) {
> 		rte_pktmbuf_detach(m);
> 		if (RTE_MBUF_HAS_EXTBUF(m) &&
> 				RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> 				__rte_pktmbuf_pinned_extbuf_decref(m))
> 			return NULL;
> 	}
> 
> 	if (refcnt_not_one)
> 		rte_mbuf_refcnt_set(m, 1);
> 	if (m->nb_segs != 1)
> 		m->nb_segs = 1;
> 	if (m->next != NULL)
> 		m->next = NULL;
> 
> 	return m;
> }
> 
> This function can only succeed (i.e. return non-NULL) when 'refcnt' is 1
> (or reaches 0).
> 
> REQUIREMENT 2: The driver must hold the only reference to the mbuf,
> i.e. 'm->refcnt' must be 1.
> 
> 
> When the function succeeds, it initializes the mbuf fields as required by
> rte_mbuf_raw_free() before returning.
> 
> Now, since the driver has exclusive access to the mbuf, it is free to
> initialize the 'm->next' and 'm->nb_segs' at any time.
> It could do that when preparing the Tx descriptor.
> 
> This is very interesting, because it means that fast-free does not
> prohibit segmented packets!
> (But the driver must have sufficient Tx descriptors for all segments in
> the mbuf.)
> 
> 
> Now, lets dig into rte_pktmbuf_prefree_seg()'s block handling non-direct
> mbufs, i.e. cloned mbufs and mbufs with external buffer:
> 
> 	if (unlikely(!RTE_MBUF_DIRECT(m))) {
> 		rte_pktmbuf_detach(m);
> 		if (RTE_MBUF_HAS_EXTBUF(m) &&
> 				RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> 				__rte_pktmbuf_pinned_extbuf_decref(m))
> 			return NULL;
> 	}
> 
> Starting with rte_pktmbuf_detach():
> 
> static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
> {
> 	struct rte_mempool *mp = m->pool;
> 	uint32_t mbuf_size, buf_len;
> 	uint16_t priv_size;
> 
> 	if (RTE_MBUF_HAS_EXTBUF(m)) {
> 		/*
> 		 * The mbuf has the external attached buffer,
> 		 * we should check the type of the memory pool where
> 		 * the mbuf was allocated from to detect the pinned
> 		 * external buffer.
> 		 */
> 		uint32_t flags = rte_pktmbuf_priv_flags(mp);
> 
> 		if (flags & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF) {
> 			/*
> 			 * The pinned external buffer should not be
> 			 * detached from its backing mbuf, just exit.
> 			 */
> 			return;
> 		}
> 		__rte_pktmbuf_free_extbuf(m);
> 	} else {
> 		__rte_pktmbuf_free_direct(m);
> 	}
> 	priv_size = rte_pktmbuf_priv_size(mp);
> 	mbuf_size = (uint32_t)(sizeof(struct rte_mbuf) + priv_size);
> 	buf_len = rte_pktmbuf_data_room_size(mp);
> 
> 	m->priv_size = priv_size;
> 	m->buf_addr = (char *)m + mbuf_size;
> 	rte_mbuf_iova_set(m, rte_mempool_virt2iova(m) + mbuf_size);
> 	m->buf_len = (uint16_t)buf_len;
> 	rte_pktmbuf_reset_headroom(m);
> 	m->data_len = 0;
> 	m->ol_flags = 0;
> }
> 
> The only quick and simple code path through this function is when the mbuf
> uses a pinned external buffer:
> 	if (RTE_MBUF_HAS_EXTBUF(m)) {
> 		uint32_t flags = rte_pktmbuf_priv_flags(mp);
> 		if (flags & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF)
> 			return;
> 
> REQUIREMENT 3: The mbuf must not be cloned or use a non-pinned external
> buffer.
> 
> 
> Continuing with the next part of rte_pktmbuf_prefree_seg()'s block:
>  		if (RTE_MBUF_HAS_EXTBUF(m) &&
> 				RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> 				__rte_pktmbuf_pinned_extbuf_decref(m))
> 			return NULL;
> 
> Continuing with the next part of the block in rte_pktmbuf_prefree_seg():
> 
> /**
>  * @internal Handle the packet mbufs with attached pinned external buffer
>  * on the mbuf freeing:
>  *
>  *  - return zero if reference counter in shinfo is one. It means there is
>  *  no more reference to this pinned buffer and mbuf can be returned to
>  *  the pool
>  *
>  *  - otherwise (if reference counter is not one), decrement reference
>  *  counter and return non-zero value to prevent freeing the backing mbuf.
>  *
>  * Returns non zero if mbuf should not be freed.
>  */
> static inline int __rte_pktmbuf_pinned_extbuf_decref(struct rte_mbuf *m)
> {
> 	struct rte_mbuf_ext_shared_info *shinfo;
> 
> 	/* Clear flags, mbuf is being freed. */
> 	m->ol_flags = RTE_MBUF_F_EXTERNAL;
> 	shinfo = m->shinfo;
> 
> 	/* Optimize for performance - do not dec/reinit */
> 	if (likely(rte_mbuf_ext_refcnt_read(shinfo) == 1))
> 		return 0;
> 
> 	/*
> 	 * Direct usage of add primitive to avoid
> 	 * duplication of comparing with one.
> 	 */
> 	if (likely(rte_atomic_fetch_add_explicit(&shinfo->refcnt, -1,
> 				     rte_memory_order_acq_rel) - 1))
> 		return 1;
> 
> 	/* Reinitialize counter before mbuf freeing. */
> 	rte_mbuf_ext_refcnt_set(shinfo, 1);
> 	return 0;
> }
> 
> Essentially, if the mbuf does use a pinned external buffer,
> rte_pktmbuf_prefree_seg() only succeeds if that pinned external buffer is
> only referred to by the mbuf.
> 
> REQUIREMENT 4: If the mbuf uses a pinned external buffer, the mbuf must
> hold the only reference to that pinned external buffer, i.e. in that case,
> 'm->shinfo->refcnt' must be 1.
> 
> 
> Please review.
> 
> If I'm not mistaken, the mbuf library is not a barrier for fast-freeing
> segmented packet mbufs, and thus fast-free of jumbo frames is possible.
> 
> We need a driver developer to confirm that my suggested approach -
> resetting the mbuf fields, incl. 'm->nb_segs' and 'm->next', when
> preparing the Tx descriptor - is viable.

Great analysis, makes a lot of sense to me.
Shall we add then a special API to make PMD maintainers life a bit easier:
Something like rte_mbuf_fast_free_prep(mp, mb), that will optionally check
that requirements outlined  above are satisfied for given mbuf and
also reset mbuf fields to expected values?
Konstantin




^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: mbuf fast-free requirements analysis
  2025-12-15 14:41 ` Konstantin Ananyev
@ 2025-12-15 16:14   ` Morten Brørup
  2025-12-19 17:08     ` Konstantin Ananyev
  0 siblings, 1 reply; 28+ messages in thread
From: Morten Brørup @ 2025-12-15 16:14 UTC (permalink / raw)
  To: Konstantin Ananyev, dev, techboard, bruce.richardson

> From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> Sent: Monday, 15 December 2025 15.41
> 
> >
> > Executive Summary:
> >
> > My analysis shows that the mbuf library is not a barrier for fast-
> freeing
> > segmented packet mbufs, and thus fast-free of jumbo frames is
> possible.
> >
> >
> > Detailed Analysis:
> >
> > The purpose of the mbuf fast-free Tx optimization is to reduce
> > rte_pktmbuf_free_seg() to something much simpler in the ethdev
> drivers, by
> > eliminating the code path related to indirect mbufs.
> > Optimally, we want to simplify the ethdev driver's function that
> frees the
> > transmitted mbufs, so it can free them directly to their mempool
> without
> > accessing the mbufs themselves.
> >
> > If the driver cannot access the mbuf itself, it cannot determine
> which
> > mempool it belongs to.
> > We don't want the driver to access every mbuf being freed; but if all
> > mbufs of a Tx queue belong to the same mempool, the driver can
> determine
> > which mempool by looking into just one of the mbufs.
> >
> > REQUIREMENT 1: The mbufs of a Tx queue must come from the same
> mempool.
> >
> >
> > When an mbuf is freed to its mempool, some of the fields in the mbuf
> must
> > be initialized.
> > So, for fast-free, this must be done by the driver's function that
> > prepares the Tx descriptor.
> > This is a requirement to the driver, not a requirement to the
> application.
> >
> > Now, let's dig into the code for freeing an mbuf.
> > Note: For readability purposes, I'll cut out some code and comments
> > unrelated to this topic.
> >
> > static __rte_always_inline void
> > rte_pktmbuf_free_seg(struct rte_mbuf *m)
> > {
> > 	m = rte_pktmbuf_prefree_seg(m);
> > 	if (likely(m != NULL))
> > 		rte_mbuf_raw_free(m);
> > }
> >
> >
> > rte_mbuf_raw_free(m) is simple, so nothing to gain there:
> >
> > /**
> >  * Put mbuf back into its original mempool.
> >  *
> >  * The caller must ensure that the mbuf is direct and properly
> >  * reinitialized (refcnt=1, next=NULL, nb_segs=1), as done by
> >  * rte_pktmbuf_prefree_seg().
> >  */
> > static __rte_always_inline void
> > rte_mbuf_raw_free(struct rte_mbuf *m)
> > {
> > 	rte_mbuf_history_mark(m, RTE_MBUF_HISTORY_OP_LIB_FREE);
> > 	rte_mempool_put(m->pool, m);
> > }
> >
> > Note that the description says that the mbuf must be direct.
> > This is not entirely accurate; the mbuf is allowed to use a pinned
> > external buffer, if the mbuf holds the only reference to it.
> > (Most of the mbuf library functions have this documentation
> inaccuracy,
> > which should be fixed some day.)
> >
> > So, the fast-free optimization really comes down to
> > rte_pktmbuf_prefree_seg(m), which must not return NULL.
> >
> > Let's dig into that.
> >
> > /**
> >  * Decrease reference counter and unlink a mbuf segment
> >  *
> >  * This function does the same than a free, except that it does not
> >  * return the segment to its pool.
> >  * It decreases the reference counter, and if it reaches 0, it is
> >  * detached from its parent for an indirect mbuf.
> >  *
> >  * @return
> >  *   - (m) if it is the last reference. It can be recycled or freed.
> >  *   - (NULL) if the mbuf still has remaining references on it.
> >  */
> > static __rte_always_inline struct rte_mbuf *
> > rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> > {
> > 	bool refcnt_not_one;
> >
> > 	refcnt_not_one = unlikely(rte_mbuf_refcnt_read(m) != 1);
> > 	if (refcnt_not_one && __rte_mbuf_refcnt_update(m, -1) != 0)
> > 		return NULL;
> >
> > 	if (unlikely(!RTE_MBUF_DIRECT(m))) {
> > 		rte_pktmbuf_detach(m);
> > 		if (RTE_MBUF_HAS_EXTBUF(m) &&
> > 				RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> > 				__rte_pktmbuf_pinned_extbuf_decref(m))
> > 			return NULL;
> > 	}
> >
> > 	if (refcnt_not_one)
> > 		rte_mbuf_refcnt_set(m, 1);
> > 	if (m->nb_segs != 1)
> > 		m->nb_segs = 1;
> > 	if (m->next != NULL)
> > 		m->next = NULL;
> >
> > 	return m;
> > }
> >
> > This function can only succeed (i.e. return non-NULL) when 'refcnt'
> is 1
> > (or reaches 0).
> >
> > REQUIREMENT 2: The driver must hold the only reference to the mbuf,
> > i.e. 'm->refcnt' must be 1.
> >
> >
> > When the function succeeds, it initializes the mbuf fields as
> required by
> > rte_mbuf_raw_free() before returning.
> >
> > Now, since the driver has exclusive access to the mbuf, it is free to
> > initialize the 'm->next' and 'm->nb_segs' at any time.
> > It could do that when preparing the Tx descriptor.
> >
> > This is very interesting, because it means that fast-free does not
> > prohibit segmented packets!
> > (But the driver must have sufficient Tx descriptors for all segments
> in
> > the mbuf.)
> >
> >
> > Now, lets dig into rte_pktmbuf_prefree_seg()'s block handling non-
> direct
> > mbufs, i.e. cloned mbufs and mbufs with external buffer:
> >
> > 	if (unlikely(!RTE_MBUF_DIRECT(m))) {
> > 		rte_pktmbuf_detach(m);
> > 		if (RTE_MBUF_HAS_EXTBUF(m) &&
> > 				RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> > 				__rte_pktmbuf_pinned_extbuf_decref(m))
> > 			return NULL;
> > 	}
> >
> > Starting with rte_pktmbuf_detach():
> >
> > static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
> > {
> > 	struct rte_mempool *mp = m->pool;
> > 	uint32_t mbuf_size, buf_len;
> > 	uint16_t priv_size;
> >
> > 	if (RTE_MBUF_HAS_EXTBUF(m)) {
> > 		/*
> > 		 * The mbuf has the external attached buffer,
> > 		 * we should check the type of the memory pool where
> > 		 * the mbuf was allocated from to detect the pinned
> > 		 * external buffer.
> > 		 */
> > 		uint32_t flags = rte_pktmbuf_priv_flags(mp);
> >
> > 		if (flags & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF) {
> > 			/*
> > 			 * The pinned external buffer should not be
> > 			 * detached from its backing mbuf, just exit.
> > 			 */
> > 			return;
> > 		}
> > 		__rte_pktmbuf_free_extbuf(m);
> > 	} else {
> > 		__rte_pktmbuf_free_direct(m);
> > 	}
> > 	priv_size = rte_pktmbuf_priv_size(mp);
> > 	mbuf_size = (uint32_t)(sizeof(struct rte_mbuf) + priv_size);
> > 	buf_len = rte_pktmbuf_data_room_size(mp);
> >
> > 	m->priv_size = priv_size;
> > 	m->buf_addr = (char *)m + mbuf_size;
> > 	rte_mbuf_iova_set(m, rte_mempool_virt2iova(m) + mbuf_size);
> > 	m->buf_len = (uint16_t)buf_len;
> > 	rte_pktmbuf_reset_headroom(m);
> > 	m->data_len = 0;
> > 	m->ol_flags = 0;
> > }
> >
> > The only quick and simple code path through this function is when the
> mbuf
> > uses a pinned external buffer:
> > 	if (RTE_MBUF_HAS_EXTBUF(m)) {
> > 		uint32_t flags = rte_pktmbuf_priv_flags(mp);
> > 		if (flags & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF)
> > 			return;
> >
> > REQUIREMENT 3: The mbuf must not be cloned or use a non-pinned
> external
> > buffer.
> >
> >
> > Continuing with the next part of rte_pktmbuf_prefree_seg()'s block:
> >  		if (RTE_MBUF_HAS_EXTBUF(m) &&
> > 				RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> > 				__rte_pktmbuf_pinned_extbuf_decref(m))
> > 			return NULL;
> >
> > Continuing with the next part of the block in
> rte_pktmbuf_prefree_seg():
> >
> > /**
> >  * @internal Handle the packet mbufs with attached pinned external
> buffer
> >  * on the mbuf freeing:
> >  *
> >  *  - return zero if reference counter in shinfo is one. It means
> there is
> >  *  no more reference to this pinned buffer and mbuf can be returned
> to
> >  *  the pool
> >  *
> >  *  - otherwise (if reference counter is not one), decrement
> reference
> >  *  counter and return non-zero value to prevent freeing the backing
> mbuf.
> >  *
> >  * Returns non zero if mbuf should not be freed.
> >  */
> > static inline int __rte_pktmbuf_pinned_extbuf_decref(struct rte_mbuf
> *m)
> > {
> > 	struct rte_mbuf_ext_shared_info *shinfo;
> >
> > 	/* Clear flags, mbuf is being freed. */
> > 	m->ol_flags = RTE_MBUF_F_EXTERNAL;
> > 	shinfo = m->shinfo;
> >
> > 	/* Optimize for performance - do not dec/reinit */
> > 	if (likely(rte_mbuf_ext_refcnt_read(shinfo) == 1))
> > 		return 0;
> >
> > 	/*
> > 	 * Direct usage of add primitive to avoid
> > 	 * duplication of comparing with one.
> > 	 */
> > 	if (likely(rte_atomic_fetch_add_explicit(&shinfo->refcnt, -1,
> > 				     rte_memory_order_acq_rel) - 1))
> > 		return 1;
> >
> > 	/* Reinitialize counter before mbuf freeing. */
> > 	rte_mbuf_ext_refcnt_set(shinfo, 1);
> > 	return 0;
> > }
> >
> > Essentially, if the mbuf does use a pinned external buffer,
> > rte_pktmbuf_prefree_seg() only succeeds if that pinned external
> buffer is
> > only referred to by the mbuf.
> >
> > REQUIREMENT 4: If the mbuf uses a pinned external buffer, the mbuf
> must
> > hold the only reference to that pinned external buffer, i.e. in that
> case,
> > 'm->shinfo->refcnt' must be 1.
> >
> >
> > Please review.
> >
> > If I'm not mistaken, the mbuf library is not a barrier for fast-
> freeing
> > segmented packet mbufs, and thus fast-free of jumbo frames is
> possible.
> >
> > We need a driver developer to confirm that my suggested approach -
> > resetting the mbuf fields, incl. 'm->nb_segs' and 'm->next', when
> > preparing the Tx descriptor - is viable.
> 
> Great analysis, makes a lot of sense to me.
> Shall we add then a special API to make PMD maintainers life a bit
> easier:
> Something like rte_mbuf_fast_free_prep(mp, mb), that will optionally
> check
> that requirements outlined  above are satisfied for given mbuf and
> also reset mbuf fields to expected values?

Good idea, Konstantin.

Detailed suggestion below.
Note that __rte_mbuf_raw_sanity_check_mp() is used to checks the requirements after 'nb_segs' and 'next' have been initialized.

/**
 * Reinitialize an mbuf for freeing back into the mempool.
 *
 * The caller must ensure that the mbuf comes from the specified mempool,
 * is direct and only referred to by the caller (refcnt=1).
 *
 * This function is used by drivers in their transmit function for mbuf fast release
 * when the transmit descriptor is initialized,
 * so the driver can call rte_mbuf_raw_free()
 * when the packet segment has been transmitted.
 *
 * @see RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE
 *
 * @param mp
 *   The mempool to which the mbuf belong.
 * @param m
 *   The mbuf being reinitialized.
 */
static __rte_always_inline void
rte_mbuf_raw_prefree_seg(const struct rte_mempool *mp, struct rte_mbuf *m)
{
	if (m->nb_segs != 1)
		m->nb_segs = 1;
	if (m->next != NULL)
		m->next = NULL;

	__rte_mbuf_raw_sanity_check_mp(m, mp);
	rte_mbuf_history_mark(mbuf, RTE_MBUF_HISTORY_OP_LIB_PREFREE_RAW);
}

/**
 * Reinitialize a bulk of mbufs for freeing back into the mempool.
 *
 * The caller must ensure that the mbufs come from the specified mempool,
 * are direct and only referred to by the caller (refcnt=1).
 *
 * This function is used by drivers in their transmit function for mbuf fast release
 * when the transmit descriptors are initialized,
 * so the driver can call rte_mbuf_raw_free_bulk()
 * when the packet segments have been transmitted.
 *
 * @see RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE
 *
 * @param mp
 *   The mempool to which the mbufs belong.
 * @param mbufs
 *   Array of pointers to mbufs being reinitialized.
 *   The array must not contain NULL pointers.
 * @param count
 *   Array size.
 */
static __rte_always_inline void
rte_mbuf_raw_prefree_seg_bulk(const struct rte_mempool *mp, struct rte_mbuf **mbufs, unsigned int count)
{
	for (unsigned int idx = 0; idx < count; idx++) {
		struct rte_mbuf *m = mbufs[idx];

		if (m->nb_segs != 1)
			m->nb_segs = 1;
		if (m->next != NULL)
			m->next = NULL;

		__rte_mbuf_raw_sanity_check_mp(m, mp);
	}
	rte_mbuf_history_mark_bulk(mbufs, count, RTE_MBUF_HISTORY_OP_LIB_PREFREE_RAW);
}

> Konstantin
> 
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: mbuf fast-free requirements analysis
  2025-12-15 16:14   ` Morten Brørup
@ 2025-12-19 17:08     ` Konstantin Ananyev
  2025-12-20  7:33       ` Morten Brørup
  0 siblings, 1 reply; 28+ messages in thread
From: Konstantin Ananyev @ 2025-12-19 17:08 UTC (permalink / raw)
  To: Morten Brørup, dev@dpdk.org, techboard@dpdk.org,
	bruce.richardson@intel.com


> > From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> > Sent: Monday, 15 December 2025 15.41
> >
> > >
> > > Executive Summary:
> > >
> > > My analysis shows that the mbuf library is not a barrier for fast-
> > freeing
> > > segmented packet mbufs, and thus fast-free of jumbo frames is
> > possible.
> > >
> > >
> > > Detailed Analysis:
> > >
> > > The purpose of the mbuf fast-free Tx optimization is to reduce
> > > rte_pktmbuf_free_seg() to something much simpler in the ethdev
> > drivers, by
> > > eliminating the code path related to indirect mbufs.
> > > Optimally, we want to simplify the ethdev driver's function that
> > frees the
> > > transmitted mbufs, so it can free them directly to their mempool
> > without
> > > accessing the mbufs themselves.
> > >
> > > If the driver cannot access the mbuf itself, it cannot determine
> > which
> > > mempool it belongs to.
> > > We don't want the driver to access every mbuf being freed; but if all
> > > mbufs of a Tx queue belong to the same mempool, the driver can
> > determine
> > > which mempool by looking into just one of the mbufs.
> > >
> > > REQUIREMENT 1: The mbufs of a Tx queue must come from the same
> > mempool.
> > >
> > >
> > > When an mbuf is freed to its mempool, some of the fields in the mbuf
> > must
> > > be initialized.
> > > So, for fast-free, this must be done by the driver's function that
> > > prepares the Tx descriptor.
> > > This is a requirement to the driver, not a requirement to the
> > application.
> > >
> > > Now, let's dig into the code for freeing an mbuf.
> > > Note: For readability purposes, I'll cut out some code and comments
> > > unrelated to this topic.
> > >
> > > static __rte_always_inline void
> > > rte_pktmbuf_free_seg(struct rte_mbuf *m)
> > > {
> > > 	m = rte_pktmbuf_prefree_seg(m);
> > > 	if (likely(m != NULL))
> > > 		rte_mbuf_raw_free(m);
> > > }
> > >
> > >
> > > rte_mbuf_raw_free(m) is simple, so nothing to gain there:
> > >
> > > /**
> > >  * Put mbuf back into its original mempool.
> > >  *
> > >  * The caller must ensure that the mbuf is direct and properly
> > >  * reinitialized (refcnt=1, next=NULL, nb_segs=1), as done by
> > >  * rte_pktmbuf_prefree_seg().
> > >  */
> > > static __rte_always_inline void
> > > rte_mbuf_raw_free(struct rte_mbuf *m)
> > > {
> > > 	rte_mbuf_history_mark(m, RTE_MBUF_HISTORY_OP_LIB_FREE);
> > > 	rte_mempool_put(m->pool, m);
> > > }
> > >
> > > Note that the description says that the mbuf must be direct.
> > > This is not entirely accurate; the mbuf is allowed to use a pinned
> > > external buffer, if the mbuf holds the only reference to it.
> > > (Most of the mbuf library functions have this documentation
> > inaccuracy,
> > > which should be fixed some day.)
> > >
> > > So, the fast-free optimization really comes down to
> > > rte_pktmbuf_prefree_seg(m), which must not return NULL.
> > >
> > > Let's dig into that.
> > >
> > > /**
> > >  * Decrease reference counter and unlink a mbuf segment
> > >  *
> > >  * This function does the same than a free, except that it does not
> > >  * return the segment to its pool.
> > >  * It decreases the reference counter, and if it reaches 0, it is
> > >  * detached from its parent for an indirect mbuf.
> > >  *
> > >  * @return
> > >  *   - (m) if it is the last reference. It can be recycled or freed.
> > >  *   - (NULL) if the mbuf still has remaining references on it.
> > >  */
> > > static __rte_always_inline struct rte_mbuf *
> > > rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> > > {
> > > 	bool refcnt_not_one;
> > >
> > > 	refcnt_not_one = unlikely(rte_mbuf_refcnt_read(m) != 1);
> > > 	if (refcnt_not_one && __rte_mbuf_refcnt_update(m, -1) != 0)
> > > 		return NULL;
> > >
> > > 	if (unlikely(!RTE_MBUF_DIRECT(m))) {
> > > 		rte_pktmbuf_detach(m);
> > > 		if (RTE_MBUF_HAS_EXTBUF(m) &&
> > > 				RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> > > 				__rte_pktmbuf_pinned_extbuf_decref(m))
> > > 			return NULL;
> > > 	}
> > >
> > > 	if (refcnt_not_one)
> > > 		rte_mbuf_refcnt_set(m, 1);
> > > 	if (m->nb_segs != 1)
> > > 		m->nb_segs = 1;
> > > 	if (m->next != NULL)
> > > 		m->next = NULL;
> > >
> > > 	return m;
> > > }
> > >
> > > This function can only succeed (i.e. return non-NULL) when 'refcnt'
> > is 1
> > > (or reaches 0).
> > >
> > > REQUIREMENT 2: The driver must hold the only reference to the mbuf,
> > > i.e. 'm->refcnt' must be 1.
> > >
> > >
> > > When the function succeeds, it initializes the mbuf fields as
> > required by
> > > rte_mbuf_raw_free() before returning.
> > >
> > > Now, since the driver has exclusive access to the mbuf, it is free to
> > > initialize the 'm->next' and 'm->nb_segs' at any time.
> > > It could do that when preparing the Tx descriptor.
> > >
> > > This is very interesting, because it means that fast-free does not
> > > prohibit segmented packets!
> > > (But the driver must have sufficient Tx descriptors for all segments
> > in
> > > the mbuf.)
> > >
> > >
> > > Now, lets dig into rte_pktmbuf_prefree_seg()'s block handling non-
> > direct
> > > mbufs, i.e. cloned mbufs and mbufs with external buffer:
> > >
> > > 	if (unlikely(!RTE_MBUF_DIRECT(m))) {
> > > 		rte_pktmbuf_detach(m);
> > > 		if (RTE_MBUF_HAS_EXTBUF(m) &&
> > > 				RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> > > 				__rte_pktmbuf_pinned_extbuf_decref(m))
> > > 			return NULL;
> > > 	}
> > >
> > > Starting with rte_pktmbuf_detach():
> > >
> > > static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
> > > {
> > > 	struct rte_mempool *mp = m->pool;
> > > 	uint32_t mbuf_size, buf_len;
> > > 	uint16_t priv_size;
> > >
> > > 	if (RTE_MBUF_HAS_EXTBUF(m)) {
> > > 		/*
> > > 		 * The mbuf has the external attached buffer,
> > > 		 * we should check the type of the memory pool where
> > > 		 * the mbuf was allocated from to detect the pinned
> > > 		 * external buffer.
> > > 		 */
> > > 		uint32_t flags = rte_pktmbuf_priv_flags(mp);
> > >
> > > 		if (flags & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF) {
> > > 			/*
> > > 			 * The pinned external buffer should not be
> > > 			 * detached from its backing mbuf, just exit.
> > > 			 */
> > > 			return;
> > > 		}
> > > 		__rte_pktmbuf_free_extbuf(m);
> > > 	} else {
> > > 		__rte_pktmbuf_free_direct(m);
> > > 	}
> > > 	priv_size = rte_pktmbuf_priv_size(mp);
> > > 	mbuf_size = (uint32_t)(sizeof(struct rte_mbuf) + priv_size);
> > > 	buf_len = rte_pktmbuf_data_room_size(mp);
> > >
> > > 	m->priv_size = priv_size;
> > > 	m->buf_addr = (char *)m + mbuf_size;
> > > 	rte_mbuf_iova_set(m, rte_mempool_virt2iova(m) + mbuf_size);
> > > 	m->buf_len = (uint16_t)buf_len;
> > > 	rte_pktmbuf_reset_headroom(m);
> > > 	m->data_len = 0;
> > > 	m->ol_flags = 0;
> > > }
> > >
> > > The only quick and simple code path through this function is when the
> > mbuf
> > > uses a pinned external buffer:
> > > 	if (RTE_MBUF_HAS_EXTBUF(m)) {
> > > 		uint32_t flags = rte_pktmbuf_priv_flags(mp);
> > > 		if (flags & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF)
> > > 			return;
> > >
> > > REQUIREMENT 3: The mbuf must not be cloned or use a non-pinned
> > external
> > > buffer.
> > >
> > >
> > > Continuing with the next part of rte_pktmbuf_prefree_seg()'s block:
> > >  		if (RTE_MBUF_HAS_EXTBUF(m) &&
> > > 				RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> > > 				__rte_pktmbuf_pinned_extbuf_decref(m))
> > > 			return NULL;
> > >
> > > Continuing with the next part of the block in
> > rte_pktmbuf_prefree_seg():
> > >
> > > /**
> > >  * @internal Handle the packet mbufs with attached pinned external
> > buffer
> > >  * on the mbuf freeing:
> > >  *
> > >  *  - return zero if reference counter in shinfo is one. It means
> > there is
> > >  *  no more reference to this pinned buffer and mbuf can be returned
> > to
> > >  *  the pool
> > >  *
> > >  *  - otherwise (if reference counter is not one), decrement
> > reference
> > >  *  counter and return non-zero value to prevent freeing the backing
> > mbuf.
> > >  *
> > >  * Returns non zero if mbuf should not be freed.
> > >  */
> > > static inline int __rte_pktmbuf_pinned_extbuf_decref(struct rte_mbuf
> > *m)
> > > {
> > > 	struct rte_mbuf_ext_shared_info *shinfo;
> > >
> > > 	/* Clear flags, mbuf is being freed. */
> > > 	m->ol_flags = RTE_MBUF_F_EXTERNAL;
> > > 	shinfo = m->shinfo;
> > >
> > > 	/* Optimize for performance - do not dec/reinit */
> > > 	if (likely(rte_mbuf_ext_refcnt_read(shinfo) == 1))
> > > 		return 0;
> > >
> > > 	/*
> > > 	 * Direct usage of add primitive to avoid
> > > 	 * duplication of comparing with one.
> > > 	 */
> > > 	if (likely(rte_atomic_fetch_add_explicit(&shinfo->refcnt, -1,
> > > 				     rte_memory_order_acq_rel) - 1))
> > > 		return 1;
> > >
> > > 	/* Reinitialize counter before mbuf freeing. */
> > > 	rte_mbuf_ext_refcnt_set(shinfo, 1);
> > > 	return 0;
> > > }
> > >
> > > Essentially, if the mbuf does use a pinned external buffer,
> > > rte_pktmbuf_prefree_seg() only succeeds if that pinned external
> > buffer is
> > > only referred to by the mbuf.
> > >
> > > REQUIREMENT 4: If the mbuf uses a pinned external buffer, the mbuf
> > must
> > > hold the only reference to that pinned external buffer, i.e. in that
> > case,
> > > 'm->shinfo->refcnt' must be 1.
> > >
> > >
> > > Please review.
> > >
> > > If I'm not mistaken, the mbuf library is not a barrier for fast-
> > freeing
> > > segmented packet mbufs, and thus fast-free of jumbo frames is
> > possible.
> > >
> > > We need a driver developer to confirm that my suggested approach -
> > > resetting the mbuf fields, incl. 'm->nb_segs' and 'm->next', when
> > > preparing the Tx descriptor - is viable.
> >
> > Great analysis, makes a lot of sense to me.
> > Shall we add then a special API to make PMD maintainers life a bit
> > easier:
> > Something like rte_mbuf_fast_free_prep(mp, mb), that will optionally
> > check
> > that requirements outlined  above are satisfied for given mbuf and
> > also reset mbuf fields to expected values?
> 
> Good idea, Konstantin.
> 
> Detailed suggestion below.
> Note that __rte_mbuf_raw_sanity_check_mp() is used to checks the requirements
> after 'nb_segs' and 'next' have been initialized.
> 
> /**
>  * Reinitialize an mbuf for freeing back into the mempool.
>  *
>  * The caller must ensure that the mbuf comes from the specified mempool,
>  * is direct and only referred to by the caller (refcnt=1).
>  *
>  * This function is used by drivers in their transmit function for mbuf fast release
>  * when the transmit descriptor is initialized,
>  * so the driver can call rte_mbuf_raw_free()
>  * when the packet segment has been transmitted.
>  *
>  * @see RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE
>  *
>  * @param mp
>  *   The mempool to which the mbuf belong.
>  * @param m
>  *   The mbuf being reinitialized.
>  */
> static __rte_always_inline void
> rte_mbuf_raw_prefree_seg(const struct rte_mempool *mp, struct rte_mbuf *m)
> {
> 	if (m->nb_segs != 1)
> 		m->nb_segs = 1;
> 	if (m->next != NULL)
> 		m->next = NULL;
> 
> 	__rte_mbuf_raw_sanity_check_mp(m, mp);
> 	rte_mbuf_history_mark(mbuf,
> RTE_MBUF_HISTORY_OP_LIB_PREFREE_RAW);
> }

Thanks Morten, though should we really panic if condition is not met?
Might be just do check first and return an error.
 
> 
> /**
>  * Reinitialize a bulk of mbufs for freeing back into the mempool.
>  *
>  * The caller must ensure that the mbufs come from the specified mempool,
>  * are direct and only referred to by the caller (refcnt=1).
>  *
>  * This function is used by drivers in their transmit function for mbuf fast release
>  * when the transmit descriptors are initialized,
>  * so the driver can call rte_mbuf_raw_free_bulk()
>  * when the packet segments have been transmitted.
>  *
>  * @see RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE
>  *
>  * @param mp
>  *   The mempool to which the mbufs belong.
>  * @param mbufs
>  *   Array of pointers to mbufs being reinitialized.
>  *   The array must not contain NULL pointers.
>  * @param count
>  *   Array size.
>  */
> static __rte_always_inline void
> rte_mbuf_raw_prefree_seg_bulk(const struct rte_mempool *mp, struct rte_mbuf
> **mbufs, unsigned int count)
> {
> 	for (unsigned int idx = 0; idx < count; idx++) {
> 		struct rte_mbuf *m = mbufs[idx];
> 
> 		if (m->nb_segs != 1)
> 			m->nb_segs = 1;
> 		if (m->next != NULL)
> 			m->next = NULL;
> 
> 		__rte_mbuf_raw_sanity_check_mp(m, mp);
> 	}
> 	rte_mbuf_history_mark_bulk(mbufs, count,
> RTE_MBUF_HISTORY_OP_LIB_PREFREE_RAW);
> }
> 
> > Konstantin
> >
> >


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: mbuf fast-free requirements analysis
  2025-12-19 17:08     ` Konstantin Ananyev
@ 2025-12-20  7:33       ` Morten Brørup
  2025-12-22 15:22         ` Konstantin Ananyev
  0 siblings, 1 reply; 28+ messages in thread
From: Morten Brørup @ 2025-12-20  7:33 UTC (permalink / raw)
  To: Konstantin Ananyev, dev, techboard, bruce.richardson

> From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> Sent: Friday, 19 December 2025 18.08
> > > >
> > > > Executive Summary:
> > > >
> > > > My analysis shows that the mbuf library is not a barrier for
> fast-
> > > freeing
> > > > segmented packet mbufs, and thus fast-free of jumbo frames is
> > > possible.
> > > >
> > > >
> > > > Detailed Analysis:
> > > >
> > > > The purpose of the mbuf fast-free Tx optimization is to reduce
> > > > rte_pktmbuf_free_seg() to something much simpler in the ethdev
> > > drivers, by
> > > > eliminating the code path related to indirect mbufs.
> > > > Optimally, we want to simplify the ethdev driver's function that
> > > frees the
> > > > transmitted mbufs, so it can free them directly to their mempool
> > > without
> > > > accessing the mbufs themselves.
> > > >
> > > > If the driver cannot access the mbuf itself, it cannot determine
> > > which
> > > > mempool it belongs to.
> > > > We don't want the driver to access every mbuf being freed; but if
> all
> > > > mbufs of a Tx queue belong to the same mempool, the driver can
> > > determine
> > > > which mempool by looking into just one of the mbufs.
> > > >
> > > > REQUIREMENT 1: The mbufs of a Tx queue must come from the same
> > > mempool.
> > > >
> > > >
> > > > When an mbuf is freed to its mempool, some of the fields in the
> mbuf
> > > must
> > > > be initialized.
> > > > So, for fast-free, this must be done by the driver's function
> that
> > > > prepares the Tx descriptor.
> > > > This is a requirement to the driver, not a requirement to the
> > > application.
> > > >
> > > > Now, let's dig into the code for freeing an mbuf.
> > > > Note: For readability purposes, I'll cut out some code and
> comments
> > > > unrelated to this topic.
> > > >
> > > > static __rte_always_inline void
> > > > rte_pktmbuf_free_seg(struct rte_mbuf *m)
> > > > {
> > > > 	m = rte_pktmbuf_prefree_seg(m);
> > > > 	if (likely(m != NULL))
> > > > 		rte_mbuf_raw_free(m);
> > > > }
> > > >
> > > >
> > > > rte_mbuf_raw_free(m) is simple, so nothing to gain there:
> > > >
> > > > /**
> > > >  * Put mbuf back into its original mempool.
> > > >  *
> > > >  * The caller must ensure that the mbuf is direct and properly
> > > >  * reinitialized (refcnt=1, next=NULL, nb_segs=1), as done by
> > > >  * rte_pktmbuf_prefree_seg().
> > > >  */
> > > > static __rte_always_inline void
> > > > rte_mbuf_raw_free(struct rte_mbuf *m)
> > > > {
> > > > 	rte_mbuf_history_mark(m, RTE_MBUF_HISTORY_OP_LIB_FREE);
> > > > 	rte_mempool_put(m->pool, m);
> > > > }
> > > >
> > > > Note that the description says that the mbuf must be direct.
> > > > This is not entirely accurate; the mbuf is allowed to use a
> pinned
> > > > external buffer, if the mbuf holds the only reference to it.
> > > > (Most of the mbuf library functions have this documentation
> > > inaccuracy,
> > > > which should be fixed some day.)
> > > >
> > > > So, the fast-free optimization really comes down to
> > > > rte_pktmbuf_prefree_seg(m), which must not return NULL.
> > > >
> > > > Let's dig into that.
> > > >
> > > > /**
> > > >  * Decrease reference counter and unlink a mbuf segment
> > > >  *
> > > >  * This function does the same than a free, except that it does
> not
> > > >  * return the segment to its pool.
> > > >  * It decreases the reference counter, and if it reaches 0, it is
> > > >  * detached from its parent for an indirect mbuf.
> > > >  *
> > > >  * @return
> > > >  *   - (m) if it is the last reference. It can be recycled or
> freed.
> > > >  *   - (NULL) if the mbuf still has remaining references on it.
> > > >  */
> > > > static __rte_always_inline struct rte_mbuf *
> > > > rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> > > > {
> > > > 	bool refcnt_not_one;
> > > >
> > > > 	refcnt_not_one = unlikely(rte_mbuf_refcnt_read(m) != 1);
> > > > 	if (refcnt_not_one && __rte_mbuf_refcnt_update(m, -1) != 0)
> > > > 		return NULL;
> > > >
> > > > 	if (unlikely(!RTE_MBUF_DIRECT(m))) {
> > > > 		rte_pktmbuf_detach(m);
> > > > 		if (RTE_MBUF_HAS_EXTBUF(m) &&
> > > > 				RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> > > > 				__rte_pktmbuf_pinned_extbuf_decref(m))
> > > > 			return NULL;
> > > > 	}
> > > >
> > > > 	if (refcnt_not_one)
> > > > 		rte_mbuf_refcnt_set(m, 1);
> > > > 	if (m->nb_segs != 1)
> > > > 		m->nb_segs = 1;
> > > > 	if (m->next != NULL)
> > > > 		m->next = NULL;
> > > >
> > > > 	return m;
> > > > }
> > > >
> > > > This function can only succeed (i.e. return non-NULL) when
> 'refcnt'
> > > is 1
> > > > (or reaches 0).
> > > >
> > > > REQUIREMENT 2: The driver must hold the only reference to the
> mbuf,
> > > > i.e. 'm->refcnt' must be 1.
> > > >
> > > >
> > > > When the function succeeds, it initializes the mbuf fields as
> > > required by
> > > > rte_mbuf_raw_free() before returning.
> > > >
> > > > Now, since the driver has exclusive access to the mbuf, it is
> free to
> > > > initialize the 'm->next' and 'm->nb_segs' at any time.
> > > > It could do that when preparing the Tx descriptor.
> > > >
> > > > This is very interesting, because it means that fast-free does
> not
> > > > prohibit segmented packets!
> > > > (But the driver must have sufficient Tx descriptors for all
> segments
> > > in
> > > > the mbuf.)
> > > >
> > > >
> > > > Now, lets dig into rte_pktmbuf_prefree_seg()'s block handling
> non-
> > > direct
> > > > mbufs, i.e. cloned mbufs and mbufs with external buffer:
> > > >
> > > > 	if (unlikely(!RTE_MBUF_DIRECT(m))) {
> > > > 		rte_pktmbuf_detach(m);
> > > > 		if (RTE_MBUF_HAS_EXTBUF(m) &&
> > > > 				RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> > > > 				__rte_pktmbuf_pinned_extbuf_decref(m))
> > > > 			return NULL;
> > > > 	}
> > > >
> > > > Starting with rte_pktmbuf_detach():
> > > >
> > > > static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
> > > > {
> > > > 	struct rte_mempool *mp = m->pool;
> > > > 	uint32_t mbuf_size, buf_len;
> > > > 	uint16_t priv_size;
> > > >
> > > > 	if (RTE_MBUF_HAS_EXTBUF(m)) {
> > > > 		/*
> > > > 		 * The mbuf has the external attached buffer,
> > > > 		 * we should check the type of the memory pool where
> > > > 		 * the mbuf was allocated from to detect the pinned
> > > > 		 * external buffer.
> > > > 		 */
> > > > 		uint32_t flags = rte_pktmbuf_priv_flags(mp);
> > > >
> > > > 		if (flags & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF) {
> > > > 			/*
> > > > 			 * The pinned external buffer should not be
> > > > 			 * detached from its backing mbuf, just exit.
> > > > 			 */
> > > > 			return;
> > > > 		}
> > > > 		__rte_pktmbuf_free_extbuf(m);
> > > > 	} else {
> > > > 		__rte_pktmbuf_free_direct(m);
> > > > 	}
> > > > 	priv_size = rte_pktmbuf_priv_size(mp);
> > > > 	mbuf_size = (uint32_t)(sizeof(struct rte_mbuf) +
> priv_size);
> > > > 	buf_len = rte_pktmbuf_data_room_size(mp);
> > > >
> > > > 	m->priv_size = priv_size;
> > > > 	m->buf_addr = (char *)m + mbuf_size;
> > > > 	rte_mbuf_iova_set(m, rte_mempool_virt2iova(m) + mbuf_size);
> > > > 	m->buf_len = (uint16_t)buf_len;
> > > > 	rte_pktmbuf_reset_headroom(m);
> > > > 	m->data_len = 0;
> > > > 	m->ol_flags = 0;
> > > > }
> > > >
> > > > The only quick and simple code path through this function is when
> the
> > > mbuf
> > > > uses a pinned external buffer:
> > > > 	if (RTE_MBUF_HAS_EXTBUF(m)) {
> > > > 		uint32_t flags = rte_pktmbuf_priv_flags(mp);
> > > > 		if (flags & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF)
> > > > 			return;
> > > >
> > > > REQUIREMENT 3: The mbuf must not be cloned or use a non-pinned
> > > external
> > > > buffer.
> > > >
> > > >
> > > > Continuing with the next part of rte_pktmbuf_prefree_seg()'s
> block:
> > > >  		if (RTE_MBUF_HAS_EXTBUF(m) &&
> > > > 				RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> > > > 				__rte_pktmbuf_pinned_extbuf_decref(m))
> > > > 			return NULL;
> > > >
> > > > Continuing with the next part of the block in
> > > rte_pktmbuf_prefree_seg():
> > > >
> > > > /**
> > > >  * @internal Handle the packet mbufs with attached pinned
> external
> > > buffer
> > > >  * on the mbuf freeing:
> > > >  *
> > > >  *  - return zero if reference counter in shinfo is one. It means
> > > there is
> > > >  *  no more reference to this pinned buffer and mbuf can be
> returned
> > > to
> > > >  *  the pool
> > > >  *
> > > >  *  - otherwise (if reference counter is not one), decrement
> > > reference
> > > >  *  counter and return non-zero value to prevent freeing the
> backing
> > > mbuf.
> > > >  *
> > > >  * Returns non zero if mbuf should not be freed.
> > > >  */
> > > > static inline int __rte_pktmbuf_pinned_extbuf_decref(struct
> rte_mbuf
> > > *m)
> > > > {
> > > > 	struct rte_mbuf_ext_shared_info *shinfo;
> > > >
> > > > 	/* Clear flags, mbuf is being freed. */
> > > > 	m->ol_flags = RTE_MBUF_F_EXTERNAL;
> > > > 	shinfo = m->shinfo;
> > > >
> > > > 	/* Optimize for performance - do not dec/reinit */
> > > > 	if (likely(rte_mbuf_ext_refcnt_read(shinfo) == 1))
> > > > 		return 0;
> > > >
> > > > 	/*
> > > > 	 * Direct usage of add primitive to avoid
> > > > 	 * duplication of comparing with one.
> > > > 	 */
> > > > 	if (likely(rte_atomic_fetch_add_explicit(&shinfo->refcnt, -
> 1,
> > > > 				     rte_memory_order_acq_rel) - 1))
> > > > 		return 1;
> > > >
> > > > 	/* Reinitialize counter before mbuf freeing. */
> > > > 	rte_mbuf_ext_refcnt_set(shinfo, 1);
> > > > 	return 0;
> > > > }
> > > >
> > > > Essentially, if the mbuf does use a pinned external buffer,
> > > > rte_pktmbuf_prefree_seg() only succeeds if that pinned external
> > > buffer is
> > > > only referred to by the mbuf.
> > > >
> > > > REQUIREMENT 4: If the mbuf uses a pinned external buffer, the
> mbuf
> > > must
> > > > hold the only reference to that pinned external buffer, i.e. in
> that
> > > case,
> > > > 'm->shinfo->refcnt' must be 1.
> > > >
> > > >
> > > > Please review.
> > > >
> > > > If I'm not mistaken, the mbuf library is not a barrier for fast-
> > > freeing
> > > > segmented packet mbufs, and thus fast-free of jumbo frames is
> > > possible.
> > > >
> > > > We need a driver developer to confirm that my suggested approach
> -
> > > > resetting the mbuf fields, incl. 'm->nb_segs' and 'm->next', when
> > > > preparing the Tx descriptor - is viable.
> > >
> > > Great analysis, makes a lot of sense to me.
> > > Shall we add then a special API to make PMD maintainers life a bit
> > > easier:
> > > Something like rte_mbuf_fast_free_prep(mp, mb), that will
> optionally
> > > check
> > > that requirements outlined  above are satisfied for given mbuf and
> > > also reset mbuf fields to expected values?
> >
> > Good idea, Konstantin.
> >
> > Detailed suggestion below.
> > Note that __rte_mbuf_raw_sanity_check_mp() is used to checks the
> requirements
> > after 'nb_segs' and 'next' have been initialized.
> >
> > /**
> >  * Reinitialize an mbuf for freeing back into the mempool.
> >  *
> >  * The caller must ensure that the mbuf comes from the specified
> mempool,
> >  * is direct and only referred to by the caller (refcnt=1).
> >  *
> >  * This function is used by drivers in their transmit function for
> mbuf fast release
> >  * when the transmit descriptor is initialized,
> >  * so the driver can call rte_mbuf_raw_free()
> >  * when the packet segment has been transmitted.
> >  *
> >  * @see RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE
> >  *
> >  * @param mp
> >  *   The mempool to which the mbuf belong.
> >  * @param m
> >  *   The mbuf being reinitialized.
> >  */
> > static __rte_always_inline void
> > rte_mbuf_raw_prefree_seg(const struct rte_mempool *mp, struct
> rte_mbuf *m)
> > {
> > 	if (m->nb_segs != 1)
> > 		m->nb_segs = 1;
> > 	if (m->next != NULL)
> > 		m->next = NULL;
> >
> > 	__rte_mbuf_raw_sanity_check_mp(m, mp);
> > 	rte_mbuf_history_mark(mbuf,
> > RTE_MBUF_HISTORY_OP_LIB_PREFREE_RAW);
> > }
> 
> Thanks Morten, though should we really panic if condition is not met?
> Might be just do check first and return an error.

__rte_mbuf_raw_sanity_check_mp() is a no-op unless RTE_LIBRTE_MBUF_DEBUG is enabled.

Using it everywhere in alloc/free in the mbuf library is the convention.

And if we don't do it here, the __rte_mbuf_raw_sanity_check_mp() in rte_mbuf_raw_free() will panic later instead.
Better to fail early.

> 
> >
> > /**
> >  * Reinitialize a bulk of mbufs for freeing back into the mempool.
> >  *
> >  * The caller must ensure that the mbufs come from the specified
> mempool,
> >  * are direct and only referred to by the caller (refcnt=1).
> >  *
> >  * This function is used by drivers in their transmit function for
> mbuf fast release
> >  * when the transmit descriptors are initialized,
> >  * so the driver can call rte_mbuf_raw_free_bulk()
> >  * when the packet segments have been transmitted.
> >  *
> >  * @see RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE
> >  *
> >  * @param mp
> >  *   The mempool to which the mbufs belong.
> >  * @param mbufs
> >  *   Array of pointers to mbufs being reinitialized.
> >  *   The array must not contain NULL pointers.
> >  * @param count
> >  *   Array size.
> >  */
> > static __rte_always_inline void
> > rte_mbuf_raw_prefree_seg_bulk(const struct rte_mempool *mp, struct
> rte_mbuf
> > **mbufs, unsigned int count)
> > {
> > 	for (unsigned int idx = 0; idx < count; idx++) {
> > 		struct rte_mbuf *m = mbufs[idx];
> >
> > 		if (m->nb_segs != 1)
> > 			m->nb_segs = 1;
> > 		if (m->next != NULL)
> > 			m->next = NULL;
> >
> > 		__rte_mbuf_raw_sanity_check_mp(m, mp);
> > 	}
> > 	rte_mbuf_history_mark_bulk(mbufs, count,
> > RTE_MBUF_HISTORY_OP_LIB_PREFREE_RAW);
> > }
> >
> > > Konstantin
> > >
> > >


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: mbuf fast-free requirements analysis
  2025-12-20  7:33       ` Morten Brørup
@ 2025-12-22 15:22         ` Konstantin Ananyev
  2025-12-22 17:11           ` Morten Brørup
  0 siblings, 1 reply; 28+ messages in thread
From: Konstantin Ananyev @ 2025-12-22 15:22 UTC (permalink / raw)
  To: Morten Brørup, dev@dpdk.org, techboard@dpdk.org,
	bruce.richardson@intel.com



> > From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> > Sent: Friday, 19 December 2025 18.08
> > > > >
> > > > > Executive Summary:
> > > > >
> > > > > My analysis shows that the mbuf library is not a barrier for
> > fast-
> > > > freeing
> > > > > segmented packet mbufs, and thus fast-free of jumbo frames is
> > > > possible.
> > > > >
> > > > >
> > > > > Detailed Analysis:
> > > > >
> > > > > The purpose of the mbuf fast-free Tx optimization is to reduce
> > > > > rte_pktmbuf_free_seg() to something much simpler in the ethdev
> > > > drivers, by
> > > > > eliminating the code path related to indirect mbufs.
> > > > > Optimally, we want to simplify the ethdev driver's function that
> > > > frees the
> > > > > transmitted mbufs, so it can free them directly to their mempool
> > > > without
> > > > > accessing the mbufs themselves.
> > > > >
> > > > > If the driver cannot access the mbuf itself, it cannot determine
> > > > which
> > > > > mempool it belongs to.
> > > > > We don't want the driver to access every mbuf being freed; but if
> > all
> > > > > mbufs of a Tx queue belong to the same mempool, the driver can
> > > > determine
> > > > > which mempool by looking into just one of the mbufs.
> > > > >
> > > > > REQUIREMENT 1: The mbufs of a Tx queue must come from the same
> > > > mempool.
> > > > >
> > > > >
> > > > > When an mbuf is freed to its mempool, some of the fields in the
> > mbuf
> > > > must
> > > > > be initialized.
> > > > > So, for fast-free, this must be done by the driver's function
> > that
> > > > > prepares the Tx descriptor.
> > > > > This is a requirement to the driver, not a requirement to the
> > > > application.
> > > > >
> > > > > Now, let's dig into the code for freeing an mbuf.
> > > > > Note: For readability purposes, I'll cut out some code and
> > comments
> > > > > unrelated to this topic.
> > > > >
> > > > > static __rte_always_inline void
> > > > > rte_pktmbuf_free_seg(struct rte_mbuf *m)
> > > > > {
> > > > > 	m = rte_pktmbuf_prefree_seg(m);
> > > > > 	if (likely(m != NULL))
> > > > > 		rte_mbuf_raw_free(m);
> > > > > }
> > > > >
> > > > >
> > > > > rte_mbuf_raw_free(m) is simple, so nothing to gain there:
> > > > >
> > > > > /**
> > > > >  * Put mbuf back into its original mempool.
> > > > >  *
> > > > >  * The caller must ensure that the mbuf is direct and properly
> > > > >  * reinitialized (refcnt=1, next=NULL, nb_segs=1), as done by
> > > > >  * rte_pktmbuf_prefree_seg().
> > > > >  */
> > > > > static __rte_always_inline void
> > > > > rte_mbuf_raw_free(struct rte_mbuf *m)
> > > > > {
> > > > > 	rte_mbuf_history_mark(m, RTE_MBUF_HISTORY_OP_LIB_FREE);
> > > > > 	rte_mempool_put(m->pool, m);
> > > > > }
> > > > >
> > > > > Note that the description says that the mbuf must be direct.
> > > > > This is not entirely accurate; the mbuf is allowed to use a
> > pinned
> > > > > external buffer, if the mbuf holds the only reference to it.
> > > > > (Most of the mbuf library functions have this documentation
> > > > inaccuracy,
> > > > > which should be fixed some day.)
> > > > >
> > > > > So, the fast-free optimization really comes down to
> > > > > rte_pktmbuf_prefree_seg(m), which must not return NULL.
> > > > >
> > > > > Let's dig into that.
> > > > >
> > > > > /**
> > > > >  * Decrease reference counter and unlink a mbuf segment
> > > > >  *
> > > > >  * This function does the same than a free, except that it does
> > not
> > > > >  * return the segment to its pool.
> > > > >  * It decreases the reference counter, and if it reaches 0, it is
> > > > >  * detached from its parent for an indirect mbuf.
> > > > >  *
> > > > >  * @return
> > > > >  *   - (m) if it is the last reference. It can be recycled or
> > freed.
> > > > >  *   - (NULL) if the mbuf still has remaining references on it.
> > > > >  */
> > > > > static __rte_always_inline struct rte_mbuf *
> > > > > rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> > > > > {
> > > > > 	bool refcnt_not_one;
> > > > >
> > > > > 	refcnt_not_one = unlikely(rte_mbuf_refcnt_read(m) != 1);
> > > > > 	if (refcnt_not_one && __rte_mbuf_refcnt_update(m, -1) != 0)
> > > > > 		return NULL;
> > > > >
> > > > > 	if (unlikely(!RTE_MBUF_DIRECT(m))) {
> > > > > 		rte_pktmbuf_detach(m);
> > > > > 		if (RTE_MBUF_HAS_EXTBUF(m) &&
> > > > > 				RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> > > > > 				__rte_pktmbuf_pinned_extbuf_decref(m))
> > > > > 			return NULL;
> > > > > 	}
> > > > >
> > > > > 	if (refcnt_not_one)
> > > > > 		rte_mbuf_refcnt_set(m, 1);
> > > > > 	if (m->nb_segs != 1)
> > > > > 		m->nb_segs = 1;
> > > > > 	if (m->next != NULL)
> > > > > 		m->next = NULL;
> > > > >
> > > > > 	return m;
> > > > > }
> > > > >
> > > > > This function can only succeed (i.e. return non-NULL) when
> > 'refcnt'
> > > > is 1
> > > > > (or reaches 0).
> > > > >
> > > > > REQUIREMENT 2: The driver must hold the only reference to the
> > mbuf,
> > > > > i.e. 'm->refcnt' must be 1.
> > > > >
> > > > >
> > > > > When the function succeeds, it initializes the mbuf fields as
> > > > required by
> > > > > rte_mbuf_raw_free() before returning.
> > > > >
> > > > > Now, since the driver has exclusive access to the mbuf, it is
> > free to
> > > > > initialize the 'm->next' and 'm->nb_segs' at any time.
> > > > > It could do that when preparing the Tx descriptor.
> > > > >
> > > > > This is very interesting, because it means that fast-free does
> > not
> > > > > prohibit segmented packets!
> > > > > (But the driver must have sufficient Tx descriptors for all
> > segments
> > > > in
> > > > > the mbuf.)
> > > > >
> > > > >
> > > > > Now, lets dig into rte_pktmbuf_prefree_seg()'s block handling
> > non-
> > > > direct
> > > > > mbufs, i.e. cloned mbufs and mbufs with external buffer:
> > > > >
> > > > > 	if (unlikely(!RTE_MBUF_DIRECT(m))) {
> > > > > 		rte_pktmbuf_detach(m);
> > > > > 		if (RTE_MBUF_HAS_EXTBUF(m) &&
> > > > > 				RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> > > > > 				__rte_pktmbuf_pinned_extbuf_decref(m))
> > > > > 			return NULL;
> > > > > 	}
> > > > >
> > > > > Starting with rte_pktmbuf_detach():
> > > > >
> > > > > static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
> > > > > {
> > > > > 	struct rte_mempool *mp = m->pool;
> > > > > 	uint32_t mbuf_size, buf_len;
> > > > > 	uint16_t priv_size;
> > > > >
> > > > > 	if (RTE_MBUF_HAS_EXTBUF(m)) {
> > > > > 		/*
> > > > > 		 * The mbuf has the external attached buffer,
> > > > > 		 * we should check the type of the memory pool where
> > > > > 		 * the mbuf was allocated from to detect the pinned
> > > > > 		 * external buffer.
> > > > > 		 */
> > > > > 		uint32_t flags = rte_pktmbuf_priv_flags(mp);
> > > > >
> > > > > 		if (flags & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF) {
> > > > > 			/*
> > > > > 			 * The pinned external buffer should not be
> > > > > 			 * detached from its backing mbuf, just exit.
> > > > > 			 */
> > > > > 			return;
> > > > > 		}
> > > > > 		__rte_pktmbuf_free_extbuf(m);
> > > > > 	} else {
> > > > > 		__rte_pktmbuf_free_direct(m);
> > > > > 	}
> > > > > 	priv_size = rte_pktmbuf_priv_size(mp);
> > > > > 	mbuf_size = (uint32_t)(sizeof(struct rte_mbuf) +
> > priv_size);
> > > > > 	buf_len = rte_pktmbuf_data_room_size(mp);
> > > > >
> > > > > 	m->priv_size = priv_size;
> > > > > 	m->buf_addr = (char *)m + mbuf_size;
> > > > > 	rte_mbuf_iova_set(m, rte_mempool_virt2iova(m) + mbuf_size);
> > > > > 	m->buf_len = (uint16_t)buf_len;
> > > > > 	rte_pktmbuf_reset_headroom(m);
> > > > > 	m->data_len = 0;
> > > > > 	m->ol_flags = 0;
> > > > > }
> > > > >
> > > > > The only quick and simple code path through this function is when
> > the
> > > > mbuf
> > > > > uses a pinned external buffer:
> > > > > 	if (RTE_MBUF_HAS_EXTBUF(m)) {
> > > > > 		uint32_t flags = rte_pktmbuf_priv_flags(mp);
> > > > > 		if (flags & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF)
> > > > > 			return;
> > > > >
> > > > > REQUIREMENT 3: The mbuf must not be cloned or use a non-pinned
> > > > external
> > > > > buffer.
> > > > >
> > > > >
> > > > > Continuing with the next part of rte_pktmbuf_prefree_seg()'s
> > block:
> > > > >  		if (RTE_MBUF_HAS_EXTBUF(m) &&
> > > > > 				RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> > > > > 				__rte_pktmbuf_pinned_extbuf_decref(m))
> > > > > 			return NULL;
> > > > >
> > > > > Continuing with the next part of the block in
> > > > rte_pktmbuf_prefree_seg():
> > > > >
> > > > > /**
> > > > >  * @internal Handle the packet mbufs with attached pinned
> > external
> > > > buffer
> > > > >  * on the mbuf freeing:
> > > > >  *
> > > > >  *  - return zero if reference counter in shinfo is one. It means
> > > > there is
> > > > >  *  no more reference to this pinned buffer and mbuf can be
> > returned
> > > > to
> > > > >  *  the pool
> > > > >  *
> > > > >  *  - otherwise (if reference counter is not one), decrement
> > > > reference
> > > > >  *  counter and return non-zero value to prevent freeing the
> > backing
> > > > mbuf.
> > > > >  *
> > > > >  * Returns non zero if mbuf should not be freed.
> > > > >  */
> > > > > static inline int __rte_pktmbuf_pinned_extbuf_decref(struct
> > rte_mbuf
> > > > *m)
> > > > > {
> > > > > 	struct rte_mbuf_ext_shared_info *shinfo;
> > > > >
> > > > > 	/* Clear flags, mbuf is being freed. */
> > > > > 	m->ol_flags = RTE_MBUF_F_EXTERNAL;
> > > > > 	shinfo = m->shinfo;
> > > > >
> > > > > 	/* Optimize for performance - do not dec/reinit */
> > > > > 	if (likely(rte_mbuf_ext_refcnt_read(shinfo) == 1))
> > > > > 		return 0;
> > > > >
> > > > > 	/*
> > > > > 	 * Direct usage of add primitive to avoid
> > > > > 	 * duplication of comparing with one.
> > > > > 	 */
> > > > > 	if (likely(rte_atomic_fetch_add_explicit(&shinfo->refcnt, -
> > 1,
> > > > > 				     rte_memory_order_acq_rel) - 1))
> > > > > 		return 1;
> > > > >
> > > > > 	/* Reinitialize counter before mbuf freeing. */
> > > > > 	rte_mbuf_ext_refcnt_set(shinfo, 1);
> > > > > 	return 0;
> > > > > }
> > > > >
> > > > > Essentially, if the mbuf does use a pinned external buffer,
> > > > > rte_pktmbuf_prefree_seg() only succeeds if that pinned external
> > > > buffer is
> > > > > only referred to by the mbuf.
> > > > >
> > > > > REQUIREMENT 4: If the mbuf uses a pinned external buffer, the
> > mbuf
> > > > must
> > > > > hold the only reference to that pinned external buffer, i.e. in
> > that
> > > > case,
> > > > > 'm->shinfo->refcnt' must be 1.
> > > > >
> > > > >
> > > > > Please review.
> > > > >
> > > > > If I'm not mistaken, the mbuf library is not a barrier for fast-
> > > > freeing
> > > > > segmented packet mbufs, and thus fast-free of jumbo frames is
> > > > possible.
> > > > >
> > > > > We need a driver developer to confirm that my suggested approach
> > -
> > > > > resetting the mbuf fields, incl. 'm->nb_segs' and 'm->next', when
> > > > > preparing the Tx descriptor - is viable.
> > > >
> > > > Great analysis, makes a lot of sense to me.
> > > > Shall we add then a special API to make PMD maintainers life a bit
> > > > easier:
> > > > Something like rte_mbuf_fast_free_prep(mp, mb), that will
> > optionally
> > > > check
> > > > that requirements outlined  above are satisfied for given mbuf and
> > > > also reset mbuf fields to expected values?
> > >
> > > Good idea, Konstantin.
> > >
> > > Detailed suggestion below.
> > > Note that __rte_mbuf_raw_sanity_check_mp() is used to checks the
> > requirements
> > > after 'nb_segs' and 'next' have been initialized.
> > >
> > > /**
> > >  * Reinitialize an mbuf for freeing back into the mempool.
> > >  *
> > >  * The caller must ensure that the mbuf comes from the specified
> > mempool,
> > >  * is direct and only referred to by the caller (refcnt=1).
> > >  *
> > >  * This function is used by drivers in their transmit function for
> > mbuf fast release
> > >  * when the transmit descriptor is initialized,
> > >  * so the driver can call rte_mbuf_raw_free()
> > >  * when the packet segment has been transmitted.
> > >  *
> > >  * @see RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE
> > >  *
> > >  * @param mp
> > >  *   The mempool to which the mbuf belong.
> > >  * @param m
> > >  *   The mbuf being reinitialized.
> > >  */
> > > static __rte_always_inline void
> > > rte_mbuf_raw_prefree_seg(const struct rte_mempool *mp, struct
> > rte_mbuf *m)
> > > {
> > > 	if (m->nb_segs != 1)
> > > 		m->nb_segs = 1;
> > > 	if (m->next != NULL)
> > > 		m->next = NULL;
> > >
> > > 	__rte_mbuf_raw_sanity_check_mp(m, mp);
> > > 	rte_mbuf_history_mark(mbuf,
> > > RTE_MBUF_HISTORY_OP_LIB_PREFREE_RAW);
> > > }
> >
> > Thanks Morten, though should we really panic if condition is not met?
> > Might be just do check first and return an error.
> 
> __rte_mbuf_raw_sanity_check_mp() is a no-op unless
> RTE_LIBRTE_MBUF_DEBUG is enabled.

Yep, I noticed.

> 
> Using it everywhere in alloc/free in the mbuf library is the convention.
> 
> And if we don't do it here, the __rte_mbuf_raw_sanity_check_mp() in
> rte_mbuf_raw_free() will panic later instead.

Why? 
This new routine (rte_mbuf_raw_prefree_seg) can check that mbuf satisfies fast-free criteria
before updating it, and if conditions are not met simply return an error.
Then the driver has several options:
1) Drop the packet silently
2) Refuse to send it
3) Switch to some slower but always working code-path (without fast-free)
4) Panic

> Better to fail early.
> >
> > >
> > > /**
> > >  * Reinitialize a bulk of mbufs for freeing back into the mempool.
> > >  *
> > >  * The caller must ensure that the mbufs come from the specified
> > mempool,
> > >  * are direct and only referred to by the caller (refcnt=1).
> > >  *
> > >  * This function is used by drivers in their transmit function for
> > mbuf fast release
> > >  * when the transmit descriptors are initialized,
> > >  * so the driver can call rte_mbuf_raw_free_bulk()
> > >  * when the packet segments have been transmitted.
> > >  *
> > >  * @see RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE
> > >  *
> > >  * @param mp
> > >  *   The mempool to which the mbufs belong.
> > >  * @param mbufs
> > >  *   Array of pointers to mbufs being reinitialized.
> > >  *   The array must not contain NULL pointers.
> > >  * @param count
> > >  *   Array size.
> > >  */
> > > static __rte_always_inline void
> > > rte_mbuf_raw_prefree_seg_bulk(const struct rte_mempool *mp, struct
> > rte_mbuf
> > > **mbufs, unsigned int count)
> > > {
> > > 	for (unsigned int idx = 0; idx < count; idx++) {
> > > 		struct rte_mbuf *m = mbufs[idx];
> > >
> > > 		if (m->nb_segs != 1)
> > > 			m->nb_segs = 1;
> > > 		if (m->next != NULL)
> > > 			m->next = NULL;
> > >
> > > 		__rte_mbuf_raw_sanity_check_mp(m, mp);
> > > 	}
> > > 	rte_mbuf_history_mark_bulk(mbufs, count,
> > > RTE_MBUF_HISTORY_OP_LIB_PREFREE_RAW);
> > > }
> > >
> > > > Konstantin
> > > >
> > > >
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: mbuf fast-free requirements analysis
  2025-12-22 15:22         ` Konstantin Ananyev
@ 2025-12-22 17:11           ` Morten Brørup
  2025-12-22 17:43             ` Bruce Richardson
  0 siblings, 1 reply; 28+ messages in thread
From: Morten Brørup @ 2025-12-22 17:11 UTC (permalink / raw)
  To: Konstantin Ananyev, dev, techboard, bruce.richardson

> From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> Sent: Monday, 22 December 2025 16.22
> > > > > >
> > > > > > Executive Summary:
> > > > > >
> > > > > > My analysis shows that the mbuf library is not a barrier for
> > > fast-
> > > > > freeing
> > > > > > segmented packet mbufs, and thus fast-free of jumbo frames is
> > > > > possible.
> > > > > >
> > > > > >
> > > > > > Detailed Analysis:
> > > > > >
> > > > > > The purpose of the mbuf fast-free Tx optimization is to
> reduce
> > > > > > rte_pktmbuf_free_seg() to something much simpler in the
> ethdev
> > > > > drivers, by
> > > > > > eliminating the code path related to indirect mbufs.
> > > > > > Optimally, we want to simplify the ethdev driver's function
> that
> > > > > frees the
> > > > > > transmitted mbufs, so it can free them directly to their
> mempool
> > > > > without
> > > > > > accessing the mbufs themselves.
> > > > > >
> > > > > > If the driver cannot access the mbuf itself, it cannot
> determine
> > > > > which
> > > > > > mempool it belongs to.
> > > > > > We don't want the driver to access every mbuf being freed;
> but if
> > > all
> > > > > > mbufs of a Tx queue belong to the same mempool, the driver
> can
> > > > > determine
> > > > > > which mempool by looking into just one of the mbufs.
> > > > > >
> > > > > > REQUIREMENT 1: The mbufs of a Tx queue must come from the
> same
> > > > > mempool.
> > > > > >
> > > > > >
> > > > > > When an mbuf is freed to its mempool, some of the fields in
> the
> > > mbuf
> > > > > must
> > > > > > be initialized.
> > > > > > So, for fast-free, this must be done by the driver's function
> > > that
> > > > > > prepares the Tx descriptor.
> > > > > > This is a requirement to the driver, not a requirement to the
> > > > > application.
> > > > > >
> > > > > > Now, let's dig into the code for freeing an mbuf.
> > > > > > Note: For readability purposes, I'll cut out some code and
> > > comments
> > > > > > unrelated to this topic.
> > > > > >
> > > > > > static __rte_always_inline void
> > > > > > rte_pktmbuf_free_seg(struct rte_mbuf *m)
> > > > > > {
> > > > > > 	m = rte_pktmbuf_prefree_seg(m);
> > > > > > 	if (likely(m != NULL))
> > > > > > 		rte_mbuf_raw_free(m);
> > > > > > }
> > > > > >
> > > > > >
> > > > > > rte_mbuf_raw_free(m) is simple, so nothing to gain there:
> > > > > >
> > > > > > /**
> > > > > >  * Put mbuf back into its original mempool.
> > > > > >  *
> > > > > >  * The caller must ensure that the mbuf is direct and
> properly
> > > > > >  * reinitialized (refcnt=1, next=NULL, nb_segs=1), as done by
> > > > > >  * rte_pktmbuf_prefree_seg().
> > > > > >  */
> > > > > > static __rte_always_inline void
> > > > > > rte_mbuf_raw_free(struct rte_mbuf *m)
> > > > > > {
> > > > > > 	rte_mbuf_history_mark(m, RTE_MBUF_HISTORY_OP_LIB_FREE);
> > > > > > 	rte_mempool_put(m->pool, m);
> > > > > > }
> > > > > >
> > > > > > Note that the description says that the mbuf must be direct.
> > > > > > This is not entirely accurate; the mbuf is allowed to use a
> > > pinned
> > > > > > external buffer, if the mbuf holds the only reference to it.
> > > > > > (Most of the mbuf library functions have this documentation
> > > > > inaccuracy,
> > > > > > which should be fixed some day.)
> > > > > >
> > > > > > So, the fast-free optimization really comes down to
> > > > > > rte_pktmbuf_prefree_seg(m), which must not return NULL.
> > > > > >
> > > > > > Let's dig into that.
> > > > > >
> > > > > > /**
> > > > > >  * Decrease reference counter and unlink a mbuf segment
> > > > > >  *
> > > > > >  * This function does the same than a free, except that it
> does
> > > not
> > > > > >  * return the segment to its pool.
> > > > > >  * It decreases the reference counter, and if it reaches 0,
> it is
> > > > > >  * detached from its parent for an indirect mbuf.
> > > > > >  *
> > > > > >  * @return
> > > > > >  *   - (m) if it is the last reference. It can be recycled or
> > > freed.
> > > > > >  *   - (NULL) if the mbuf still has remaining references on
> it.
> > > > > >  */
> > > > > > static __rte_always_inline struct rte_mbuf *
> > > > > > rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> > > > > > {
> > > > > > 	bool refcnt_not_one;
> > > > > >
> > > > > > 	refcnt_not_one = unlikely(rte_mbuf_refcnt_read(m) != 1);
> > > > > > 	if (refcnt_not_one && __rte_mbuf_refcnt_update(m, -1) != 0)
> > > > > > 		return NULL;
> > > > > >
> > > > > > 	if (unlikely(!RTE_MBUF_DIRECT(m))) {
> > > > > > 		rte_pktmbuf_detach(m);
> > > > > > 		if (RTE_MBUF_HAS_EXTBUF(m) &&
> > > > > > 				RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> > > > > > 				__rte_pktmbuf_pinned_extbuf_decref(m))
> > > > > > 			return NULL;
> > > > > > 	}
> > > > > >
> > > > > > 	if (refcnt_not_one)
> > > > > > 		rte_mbuf_refcnt_set(m, 1);
> > > > > > 	if (m->nb_segs != 1)
> > > > > > 		m->nb_segs = 1;
> > > > > > 	if (m->next != NULL)
> > > > > > 		m->next = NULL;
> > > > > >
> > > > > > 	return m;
> > > > > > }
> > > > > >
> > > > > > This function can only succeed (i.e. return non-NULL) when
> > > 'refcnt'
> > > > > is 1
> > > > > > (or reaches 0).
> > > > > >
> > > > > > REQUIREMENT 2: The driver must hold the only reference to the
> > > mbuf,
> > > > > > i.e. 'm->refcnt' must be 1.
> > > > > >
> > > > > >
> > > > > > When the function succeeds, it initializes the mbuf fields as
> > > > > required by
> > > > > > rte_mbuf_raw_free() before returning.
> > > > > >
> > > > > > Now, since the driver has exclusive access to the mbuf, it is
> > > free to
> > > > > > initialize the 'm->next' and 'm->nb_segs' at any time.
> > > > > > It could do that when preparing the Tx descriptor.
> > > > > >
> > > > > > This is very interesting, because it means that fast-free
> does
> > > not
> > > > > > prohibit segmented packets!
> > > > > > (But the driver must have sufficient Tx descriptors for all
> > > segments
> > > > > in
> > > > > > the mbuf.)
> > > > > >
> > > > > >
> > > > > > Now, lets dig into rte_pktmbuf_prefree_seg()'s block handling
> > > non-
> > > > > direct
> > > > > > mbufs, i.e. cloned mbufs and mbufs with external buffer:
> > > > > >
> > > > > > 	if (unlikely(!RTE_MBUF_DIRECT(m))) {
> > > > > > 		rte_pktmbuf_detach(m);
> > > > > > 		if (RTE_MBUF_HAS_EXTBUF(m) &&
> > > > > > 				RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> > > > > > 				__rte_pktmbuf_pinned_extbuf_decref(m))
> > > > > > 			return NULL;
> > > > > > 	}
> > > > > >
> > > > > > Starting with rte_pktmbuf_detach():
> > > > > >
> > > > > > static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
> > > > > > {
> > > > > > 	struct rte_mempool *mp = m->pool;
> > > > > > 	uint32_t mbuf_size, buf_len;
> > > > > > 	uint16_t priv_size;
> > > > > >
> > > > > > 	if (RTE_MBUF_HAS_EXTBUF(m)) {
> > > > > > 		/*
> > > > > > 		 * The mbuf has the external attached buffer,
> > > > > > 		 * we should check the type of the memory pool where
> > > > > > 		 * the mbuf was allocated from to detect the pinned
> > > > > > 		 * external buffer.
> > > > > > 		 */
> > > > > > 		uint32_t flags = rte_pktmbuf_priv_flags(mp);
> > > > > >
> > > > > > 		if (flags & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF) {
> > > > > > 			/*
> > > > > > 			 * The pinned external buffer should not be
> > > > > > 			 * detached from its backing mbuf, just exit.
> > > > > > 			 */
> > > > > > 			return;
> > > > > > 		}
> > > > > > 		__rte_pktmbuf_free_extbuf(m);
> > > > > > 	} else {
> > > > > > 		__rte_pktmbuf_free_direct(m);
> > > > > > 	}
> > > > > > 	priv_size = rte_pktmbuf_priv_size(mp);
> > > > > > 	mbuf_size = (uint32_t)(sizeof(struct rte_mbuf) +
> > > priv_size);
> > > > > > 	buf_len = rte_pktmbuf_data_room_size(mp);
> > > > > >
> > > > > > 	m->priv_size = priv_size;
> > > > > > 	m->buf_addr = (char *)m + mbuf_size;
> > > > > > 	rte_mbuf_iova_set(m, rte_mempool_virt2iova(m) + mbuf_size);
> > > > > > 	m->buf_len = (uint16_t)buf_len;
> > > > > > 	rte_pktmbuf_reset_headroom(m);
> > > > > > 	m->data_len = 0;
> > > > > > 	m->ol_flags = 0;
> > > > > > }
> > > > > >
> > > > > > The only quick and simple code path through this function is
> when
> > > the
> > > > > mbuf
> > > > > > uses a pinned external buffer:
> > > > > > 	if (RTE_MBUF_HAS_EXTBUF(m)) {
> > > > > > 		uint32_t flags = rte_pktmbuf_priv_flags(mp);
> > > > > > 		if (flags & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF)
> > > > > > 			return;
> > > > > >
> > > > > > REQUIREMENT 3: The mbuf must not be cloned or use a non-
> pinned
> > > > > external
> > > > > > buffer.
> > > > > >
> > > > > >
> > > > > > Continuing with the next part of rte_pktmbuf_prefree_seg()'s
> > > block:
> > > > > >  		if (RTE_MBUF_HAS_EXTBUF(m) &&
> > > > > > 				RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> > > > > > 				__rte_pktmbuf_pinned_extbuf_decref(m))
> > > > > > 			return NULL;
> > > > > >
> > > > > > Continuing with the next part of the block in
> > > > > rte_pktmbuf_prefree_seg():
> > > > > >
> > > > > > /**
> > > > > >  * @internal Handle the packet mbufs with attached pinned
> > > external
> > > > > buffer
> > > > > >  * on the mbuf freeing:
> > > > > >  *
> > > > > >  *  - return zero if reference counter in shinfo is one. It
> means
> > > > > there is
> > > > > >  *  no more reference to this pinned buffer and mbuf can be
> > > returned
> > > > > to
> > > > > >  *  the pool
> > > > > >  *
> > > > > >  *  - otherwise (if reference counter is not one), decrement
> > > > > reference
> > > > > >  *  counter and return non-zero value to prevent freeing the
> > > backing
> > > > > mbuf.
> > > > > >  *
> > > > > >  * Returns non zero if mbuf should not be freed.
> > > > > >  */
> > > > > > static inline int __rte_pktmbuf_pinned_extbuf_decref(struct
> > > rte_mbuf
> > > > > *m)
> > > > > > {
> > > > > > 	struct rte_mbuf_ext_shared_info *shinfo;
> > > > > >
> > > > > > 	/* Clear flags, mbuf is being freed. */
> > > > > > 	m->ol_flags = RTE_MBUF_F_EXTERNAL;
> > > > > > 	shinfo = m->shinfo;
> > > > > >
> > > > > > 	/* Optimize for performance - do not dec/reinit */
> > > > > > 	if (likely(rte_mbuf_ext_refcnt_read(shinfo) == 1))
> > > > > > 		return 0;
> > > > > >
> > > > > > 	/*
> > > > > > 	 * Direct usage of add primitive to avoid
> > > > > > 	 * duplication of comparing with one.
> > > > > > 	 */
> > > > > > 	if (likely(rte_atomic_fetch_add_explicit(&shinfo->refcnt, -
> > > 1,
> > > > > > 				     rte_memory_order_acq_rel) - 1))
> > > > > > 		return 1;
> > > > > >
> > > > > > 	/* Reinitialize counter before mbuf freeing. */
> > > > > > 	rte_mbuf_ext_refcnt_set(shinfo, 1);
> > > > > > 	return 0;
> > > > > > }
> > > > > >
> > > > > > Essentially, if the mbuf does use a pinned external buffer,
> > > > > > rte_pktmbuf_prefree_seg() only succeeds if that pinned
> external
> > > > > buffer is
> > > > > > only referred to by the mbuf.
> > > > > >
> > > > > > REQUIREMENT 4: If the mbuf uses a pinned external buffer, the
> > > mbuf
> > > > > must
> > > > > > hold the only reference to that pinned external buffer, i.e.
> in
> > > that
> > > > > case,
> > > > > > 'm->shinfo->refcnt' must be 1.
> > > > > >
> > > > > >
> > > > > > Please review.
> > > > > >
> > > > > > If I'm not mistaken, the mbuf library is not a barrier for
> fast-
> > > > > freeing
> > > > > > segmented packet mbufs, and thus fast-free of jumbo frames is
> > > > > possible.
> > > > > >
> > > > > > We need a driver developer to confirm that my suggested
> approach
> > > -
> > > > > > resetting the mbuf fields, incl. 'm->nb_segs' and 'm->next',
> when
> > > > > > preparing the Tx descriptor - is viable.
> > > > >
> > > > > Great analysis, makes a lot of sense to me.
> > > > > Shall we add then a special API to make PMD maintainers life a
> bit
> > > > > easier:
> > > > > Something like rte_mbuf_fast_free_prep(mp, mb), that will
> > > optionally
> > > > > check
> > > > > that requirements outlined  above are satisfied for given mbuf
> and
> > > > > also reset mbuf fields to expected values?
> > > >
> > > > Good idea, Konstantin.
> > > >
> > > > Detailed suggestion below.
> > > > Note that __rte_mbuf_raw_sanity_check_mp() is used to checks the
> > > requirements
> > > > after 'nb_segs' and 'next' have been initialized.
> > > >
> > > > /**
> > > >  * Reinitialize an mbuf for freeing back into the mempool.
> > > >  *
> > > >  * The caller must ensure that the mbuf comes from the specified
> > > mempool,
> > > >  * is direct and only referred to by the caller (refcnt=1).
> > > >  *
> > > >  * This function is used by drivers in their transmit function
> for
> > > mbuf fast release
> > > >  * when the transmit descriptor is initialized,
> > > >  * so the driver can call rte_mbuf_raw_free()
> > > >  * when the packet segment has been transmitted.
> > > >  *
> > > >  * @see RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE
> > > >  *
> > > >  * @param mp
> > > >  *   The mempool to which the mbuf belong.
> > > >  * @param m
> > > >  *   The mbuf being reinitialized.
> > > >  */
> > > > static __rte_always_inline void
> > > > rte_mbuf_raw_prefree_seg(const struct rte_mempool *mp, struct
> > > rte_mbuf *m)
> > > > {
> > > > 	if (m->nb_segs != 1)
> > > > 		m->nb_segs = 1;
> > > > 	if (m->next != NULL)
> > > > 		m->next = NULL;

A comment could mention that when the driver calls this function, it has just loaded the mbuf to initialize the transmit descriptor, so m->nb_segs and m->next are already hot in the cache.
And no store operations on the mbuf itself will occur for non-segmented packets.

> > > >
> > > > 	__rte_mbuf_raw_sanity_check_mp(m, mp);
> > > > 	rte_mbuf_history_mark(mbuf,
> > > > RTE_MBUF_HISTORY_OP_LIB_PREFREE_RAW);
> > > > }
> > >
> > > Thanks Morten, though should we really panic if condition is not
> met?
> > > Might be just do check first and return an error.
> >
> > __rte_mbuf_raw_sanity_check_mp() is a no-op unless
> > RTE_LIBRTE_MBUF_DEBUG is enabled.
> 
> Yep, I noticed.
> 
> >
> > Using it everywhere in alloc/free in the mbuf library is the
> convention.
> >
> > And if we don't do it here, the __rte_mbuf_raw_sanity_check_mp() in
> > rte_mbuf_raw_free() will panic later instead.
> 
> Why?
> This new routine (rte_mbuf_raw_prefree_seg) can check that mbuf
> satisfies fast-free criteria
> before updating it, and if conditions are not met simply return an
> error.
> Then the driver has several options:
> 1) Drop the packet silently
> 2) Refuse to send it
> 3) Switch to some slower but always working code-path (without fast-
> free)
> 4) Panic

It boils down to purpose.

The function is designed for use in the transmit code path designated for fast-free, where the application has promised/hinted that packets are fast-free compliant.
Violating preconditions in the fast path (by passing packets not compliant to fast-free requirements to this function) is a serious type of bug, for which DPDK usually doesn't provide graceful fallback.
I don't want to make an exception (and introduce graceful fallback) for the designated fast-free code path.

My answer would be completely different if we were designing an API for standard packet transmission, whereby some packets living up to certain criteria could take a faster code path for being freed.
If that was the case, I would agree with you about returning a value to indicate how to proceed, like rte_pktmbuf_prefree_seg() does.

> 
> > Better to fail early.
> > >
> > > >
> > > > /**
> > > >  * Reinitialize a bulk of mbufs for freeing back into the
> mempool.
> > > >  *
> > > >  * The caller must ensure that the mbufs come from the specified
> > > mempool,
> > > >  * are direct and only referred to by the caller (refcnt=1).
> > > >  *
> > > >  * This function is used by drivers in their transmit function
> for
> > > mbuf fast release
> > > >  * when the transmit descriptors are initialized,
> > > >  * so the driver can call rte_mbuf_raw_free_bulk()
> > > >  * when the packet segments have been transmitted.
> > > >  *
> > > >  * @see RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE
> > > >  *
> > > >  * @param mp
> > > >  *   The mempool to which the mbufs belong.
> > > >  * @param mbufs
> > > >  *   Array of pointers to mbufs being reinitialized.
> > > >  *   The array must not contain NULL pointers.
> > > >  * @param count
> > > >  *   Array size.
> > > >  */
> > > > static __rte_always_inline void
> > > > rte_mbuf_raw_prefree_seg_bulk(const struct rte_mempool *mp,
> struct
> > > rte_mbuf
> > > > **mbufs, unsigned int count)
> > > > {
> > > > 	for (unsigned int idx = 0; idx < count; idx++) {
> > > > 		struct rte_mbuf *m = mbufs[idx];
> > > >
> > > > 		if (m->nb_segs != 1)
> > > > 			m->nb_segs = 1;
> > > > 		if (m->next != NULL)
> > > > 			m->next = NULL;
> > > >
> > > > 		__rte_mbuf_raw_sanity_check_mp(m, mp);
> > > > 	}
> > > > 	rte_mbuf_history_mark_bulk(mbufs, count,
> > > > RTE_MBUF_HISTORY_OP_LIB_PREFREE_RAW);
> > > > }
> > > >
> > > > > Konstantin
> > > > >
> > > > >
> >


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: mbuf fast-free requirements analysis
  2025-12-22 17:11           ` Morten Brørup
@ 2025-12-22 17:43             ` Bruce Richardson
  2026-01-13 14:48               ` Konstantin Ananyev
  0 siblings, 1 reply; 28+ messages in thread
From: Bruce Richardson @ 2025-12-22 17:43 UTC (permalink / raw)
  To: Morten Brørup; +Cc: Konstantin Ananyev, dev, techboard

On Mon, Dec 22, 2025 at 06:11:35PM +0100, Morten Brørup wrote:
> > From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> > Sent: Monday, 22 December 2025 16.22
> > > > > > >

<snip>

> > > > >
> > > > > 	__rte_mbuf_raw_sanity_check_mp(m, mp);
> > > > > 	rte_mbuf_history_mark(mbuf,
> > > > > RTE_MBUF_HISTORY_OP_LIB_PREFREE_RAW);
> > > > > }
> > > >
> > > > Thanks Morten, though should we really panic if condition is not
> > met?
> > > > Might be just do check first and return an error.
> > >
> > > __rte_mbuf_raw_sanity_check_mp() is a no-op unless
> > > RTE_LIBRTE_MBUF_DEBUG is enabled.
> > 
> > Yep, I noticed.
> > 
> > >
> > > Using it everywhere in alloc/free in the mbuf library is the
> > convention.
> > >
> > > And if we don't do it here, the __rte_mbuf_raw_sanity_check_mp() in
> > > rte_mbuf_raw_free() will panic later instead.
> > 
> > Why?
> > This new routine (rte_mbuf_raw_prefree_seg) can check that mbuf
> > satisfies fast-free criteria
> > before updating it, and if conditions are not met simply return an
> > error.
> > Then the driver has several options:
> > 1) Drop the packet silently
> > 2) Refuse to send it
> > 3) Switch to some slower but always working code-path (without fast-
> > free)
> > 4) Panic
> 
> It boils down to purpose.
> 
> The function is designed for use in the transmit code path designated for fast-free, where the application has promised/hinted that packets are fast-free compliant.
> Violating preconditions in the fast path (by passing packets not compliant to fast-free requirements to this function) is a serious type of bug, for which DPDK usually doesn't provide graceful fallback.
> I don't want to make an exception (and introduce graceful fallback) for the designated fast-free code path.
> 
> My answer would be completely different if we were designing an API for standard packet transmission, whereby some packets living up to certain criteria could take a faster code path for being freed.
> If that was the case, I would agree with you about returning a value to indicate how to proceed, like rte_pktmbuf_prefree_seg() does.
> 
I would tend to agree. The extra handling for those cases just expands the
code and adds more potential branches to the resulting binary. Lots of the
fastpath code just assumes you know what you are doing, and violating
constraints will lead to panics and segfaults generally. Therefore panicing
in this case doesn't seem a bit deal to me.

/Bruce

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: mbuf fast-free requirements analysis
  2025-12-22 17:43             ` Bruce Richardson
@ 2026-01-13 14:48               ` Konstantin Ananyev
  2026-01-13 16:07                 ` Stephen Hemminger
  0 siblings, 1 reply; 28+ messages in thread
From: Konstantin Ananyev @ 2026-01-13 14:48 UTC (permalink / raw)
  To: Bruce Richardson, Morten Brørup; +Cc: dev@dpdk.org, techboard@dpdk.org

> > > > > >
> > > > > > 	__rte_mbuf_raw_sanity_check_mp(m, mp);
> > > > > > 	rte_mbuf_history_mark(mbuf,
> > > > > > RTE_MBUF_HISTORY_OP_LIB_PREFREE_RAW);
> > > > > > }
> > > > >
> > > > > Thanks Morten, though should we really panic if condition is not
> > > met?
> > > > > Might be just do check first and return an error.
> > > >
> > > > __rte_mbuf_raw_sanity_check_mp() is a no-op unless
> > > > RTE_LIBRTE_MBUF_DEBUG is enabled.
> > >
> > > Yep, I noticed.
> > >
> > > >
> > > > Using it everywhere in alloc/free in the mbuf library is the
> > > convention.
> > > >
> > > > And if we don't do it here, the __rte_mbuf_raw_sanity_check_mp() in
> > > > rte_mbuf_raw_free() will panic later instead.
> > >
> > > Why?
> > > This new routine (rte_mbuf_raw_prefree_seg) can check that mbuf
> > > satisfies fast-free criteria
> > > before updating it, and if conditions are not met simply return an
> > > error.
> > > Then the driver has several options:
> > > 1) Drop the packet silently
> > > 2) Refuse to send it
> > > 3) Switch to some slower but always working code-path (without fast-
> > > free)
> > > 4) Panic
> >
> > It boils down to purpose.
> >
> > The function is designed for use in the transmit code path designated for fast-free,
> where the application has promised/hinted that packets are fast-free compliant.
> > Violating preconditions in the fast path (by passing packets not compliant to fast-
> free requirements to this function) is a serious type of bug, for which DPDK usually
> doesn't provide graceful fallback.
> > I don't want to make an exception (and introduce graceful fallback) for the
> designated fast-free code path.
> >
> > My answer would be completely different if we were designing an API for standard
> packet transmission, whereby some packets living up to certain criteria could take a
> faster code path for being freed.
> > If that was the case, I would agree with you about returning a value to indicate
> how to proceed, like rte_pktmbuf_prefree_seg() does.
> >
> I would tend to agree. The extra handling for those cases just expands the
> code and adds more potential branches to the resulting binary. Lots of the
> fastpath code just assumes you know what you are doing, and violating
> constraints will lead to panics and segfaults generally. Therefore panicing
> in this case doesn't seem a bit deal to me.
> 
I understand your point lads, and somewhat agree: if we plan to keep FAST_FREE flag forever,
then it is probably the simplest and safest approach.
My hope was that such function will open a possibility for the PMDs to implement similar
perf improvement completely transparent to the user (without need for special flags and/or pre-requirements).
But might be I am looking forward way too far with it.
Even what Morten proposed above is a big step, so ok - let's deal with it first.
Konstantin    

  


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: mbuf fast-free requirements analysis
  2026-01-13 14:48               ` Konstantin Ananyev
@ 2026-01-13 16:07                 ` Stephen Hemminger
  0 siblings, 0 replies; 28+ messages in thread
From: Stephen Hemminger @ 2026-01-13 16:07 UTC (permalink / raw)
  To: Konstantin Ananyev
  Cc: Bruce Richardson, Morten Brørup, dev@dpdk.org,
	techboard@dpdk.org

On Tue, 13 Jan 2026 14:48:19 +0000
Konstantin Ananyev <konstantin.ananyev@huawei.com> wrote:

> > > > > > >
> > > > > > > 	__rte_mbuf_raw_sanity_check_mp(m, mp);
> > > > > > > 	rte_mbuf_history_mark(mbuf,
> > > > > > > RTE_MBUF_HISTORY_OP_LIB_PREFREE_RAW);
> > > > > > > }  
> > > > > >
> > > > > > Thanks Morten, though should we really panic if condition is not  
> > > > met?  
> > > > > > Might be just do check first and return an error.  
> > > > >
> > > > > __rte_mbuf_raw_sanity_check_mp() is a no-op unless
> > > > > RTE_LIBRTE_MBUF_DEBUG is enabled.  
> > > >
> > > > Yep, I noticed.
> > > >  
> > > > >
> > > > > Using it everywhere in alloc/free in the mbuf library is the  
> > > > convention.  
> > > > >
> > > > > And if we don't do it here, the __rte_mbuf_raw_sanity_check_mp() in
> > > > > rte_mbuf_raw_free() will panic later instead.  
> > > >
> > > > Why?
> > > > This new routine (rte_mbuf_raw_prefree_seg) can check that mbuf
> > > > satisfies fast-free criteria
> > > > before updating it, and if conditions are not met simply return an
> > > > error.
> > > > Then the driver has several options:
> > > > 1) Drop the packet silently
> > > > 2) Refuse to send it
> > > > 3) Switch to some slower but always working code-path (without fast-
> > > > free)
> > > > 4) Panic  
> > >
> > > It boils down to purpose.
> > >
> > > The function is designed for use in the transmit code path designated for fast-free,  
> > where the application has promised/hinted that packets are fast-free compliant.  
> > > Violating preconditions in the fast path (by passing packets not compliant to fast-  
> > free requirements to this function) is a serious type of bug, for which DPDK usually
> > doesn't provide graceful fallback.  
> > > I don't want to make an exception (and introduce graceful fallback) for the  
> > designated fast-free code path.  
> > >
> > > My answer would be completely different if we were designing an API for standard  
> > packet transmission, whereby some packets living up to certain criteria could take a
> > faster code path for being freed.  
> > > If that was the case, I would agree with you about returning a value to indicate  
> > how to proceed, like rte_pktmbuf_prefree_seg() does.  
> > >  
> > I would tend to agree. The extra handling for those cases just expands the
> > code and adds more potential branches to the resulting binary. Lots of the
> > fastpath code just assumes you know what you are doing, and violating
> > constraints will lead to panics and segfaults generally. Therefore panicing
> > in this case doesn't seem a bit deal to me.
> >   
> I understand your point lads, and somewhat agree: if we plan to keep FAST_FREE flag forever,
> then it is probably the simplest and safest approach.
> My hope was that such function will open a possibility for the PMDs to implement similar
> perf improvement completely transparent to the user (without need for special flags and/or pre-requirements).
> But might be I am looking forward way too far with it.
> Even what Morten proposed above is a big step, so ok - let's deal with it first.
> Konstantin    
> 

We should document any restrictions on this and other semantic assumptions.
Ideally, then AI review tools could help new and old developers see places
where the assumptions are violated.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: mbuf fast-free requirements analysis
  2025-12-15 11:46 ` Bruce Richardson
@ 2026-01-14 15:31   ` Morten Brørup
  2026-01-14 16:36     ` Bruce Richardson
                       ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Morten Brørup @ 2026-01-14 15:31 UTC (permalink / raw)
  To: Bruce Richardson, Konstantin Ananyev; +Cc: techboard, dev

> > If I'm not mistaken, the mbuf library is not a barrier for fast-
> freeing
> > segmented packet mbufs, and thus fast-free of jumbo frames is
> possible.
> >
> > We need a driver developer to confirm that my suggested approach -
> > resetting the mbuf fields, incl. 'm->nb_segs' and 'm->next', when
> > preparing the Tx descriptor - is viable.
> >
> Excellent analysis, Morten. If I get a chance some time this release
> cycle,
> I will try implementing this change in our drivers, see if any
> difference
> is made.

Bruce,

Have you had a chance to look into the driver change requirements?
If not, could you please try scratching the surface, to build a gut feeling.

I wonder if the vector implementations have strong requirements that packets are not segmented...

The i40 driver only sets "tx_simple_allowed" and "tx_vec_allowed" flags when MBUF_FAST_FREE is set:
https://elixir.bootlin.com/dpdk/v25.11/source/drivers/net/intel/i40e/i40e_rxtx.c#L3502

And only when these two flags are set, it uses a vector Tx function:
https://elixir.bootlin.com/dpdk/v25.11/source/drivers/net/intel/i40e/i40e_rxtx.c#L3550
And a special Tx Prep function:
https://elixir.bootlin.com/dpdk/v25.11/source/drivers/net/intel/i40e/i40e_rxtx.c#L3584
Which fails if nb_segs != 1:
https://elixir.bootlin.com/dpdk/v25.11/source/drivers/net/intel/i40e/i40e_rxtx.c#L1675

So currently it does.
But does it need to?... That is the question.
Paraphrasing:
Can the Tx function only be vectorized when the code path doesn't have branches depending on the number of segments?
If so, then this may be the main reason for not supporting segmented packets with FAST_FREE.

In that case, we cannot remove the single-segment requirement from FAST_FREE without sacrificing the performance boost from vectorizing.

But then we can proceed pursuing alternative optimizations, as suggested by Konstantin.

Here's another idea:
The Tx function could pre-scan each Tx burst for multi-segment packets, to decide if the burst should be processed by the vector code path or a fallback code path (which can also handle multi-segment packets).

-Morten

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: mbuf fast-free requirements analysis
  2026-01-14 15:31   ` Morten Brørup
@ 2026-01-14 16:36     ` Bruce Richardson
  2026-01-14 18:05       ` Morten Brørup
  2026-01-23 11:20     ` [PATCH] net/intel: optimize for fast-free hint Bruce Richardson
  2026-01-23 11:33     ` mbuf fast-free requirements analysis Bruce Richardson
  2 siblings, 1 reply; 28+ messages in thread
From: Bruce Richardson @ 2026-01-14 16:36 UTC (permalink / raw)
  To: Morten Brørup; +Cc: Konstantin Ananyev, techboard, dev

On Wed, Jan 14, 2026 at 04:31:31PM +0100, Morten Brørup wrote:
> > > If I'm not mistaken, the mbuf library is not a barrier for fast-
> > freeing
> > > segmented packet mbufs, and thus fast-free of jumbo frames is
> > possible.
> > >
> > > We need a driver developer to confirm that my suggested approach -
> > > resetting the mbuf fields, incl. 'm->nb_segs' and 'm->next', when
> > > preparing the Tx descriptor - is viable.
> > >
> > Excellent analysis, Morten. If I get a chance some time this release
> > cycle,
> > I will try implementing this change in our drivers, see if any
> > difference
> > is made.
> 
> Bruce,
> 
> Have you had a chance to look into the driver change requirements?
> If not, could you please try scratching the surface, to build a gut feeling.

I'll try and take a look this week. Juggling a few things at the moment, so
I had forgotten about this. Sorry.

More comments inline below.

/Bruce

> 
> I wonder if the vector implementations have strong requirements that packets are not segmented...
> 
> The i40 driver only sets "tx_simple_allowed" and "tx_vec_allowed" flags when MBUF_FAST_FREE is set:
> https://elixir.bootlin.com/dpdk/v25.11/source/drivers/net/intel/i40e/i40e_rxtx.c#L3502
>

Actually, it allows but does not require FAST_FREE. The check is just
verifying that the flags with everything *but* FAST_FREE masked out is the
same as the original flags, i.e. FAST_FREE is just ignored.
 
> And only when these two flags are set, it uses a vector Tx function:
> https://elixir.bootlin.com/dpdk/v25.11/source/drivers/net/intel/i40e/i40e_rxtx.c#L3550
> And a special Tx Prep function:
> https://elixir.bootlin.com/dpdk/v25.11/source/drivers/net/intel/i40e/i40e_rxtx.c#L3584
> Which fails if nb_segs != 1:
> https://elixir.bootlin.com/dpdk/v25.11/source/drivers/net/intel/i40e/i40e_rxtx.c#L1675
> 
> So currently it does.
> But does it need to?... That is the question.
> Paraphrasing:
> Can the Tx function only be vectorized when the code path doesn't have branches depending on the number of segments?
> If so, then this may be the main reason for not supporting segmented packets with FAST_FREE.
> 
> In that case, we cannot remove the single-segment requirement from FAST_FREE without sacrificing the performance boost from vectorizing.

No, based on what I state above, this should not be a blocker. The vector
paths do require us to guarantee only one segment per packet - without
additional context descriptors - so only one descriptor per packet
(generally, or always one + ctx, in one code-path case). FAST_FREE can be
used in conjunction with that but should not be a requirement. See [1]
where in vector cleanup we explicitly check for FAST_FREE.

Similarly for scalar code path, in my latest rework, I am attempting to
standardize the use of FAST_FREE optimizations even when we have a slightly
slower Tx path [2].

[1] https://github.com/DPDK/dpdk/blob/main/drivers/net/intel/common/tx.h
[2] https://patches.dpdk.org/project/dpdk/patch/20260113151505.1871271-31-bruce.richardson@intel.com/

> 
> But then we can proceed pursuing alternative optimizations, as suggested by Konstantin.
> 
> Here's another idea:
> The Tx function could pre-scan each Tx burst for multi-segment packets, to decide if the burst should be processed by the vector code path or a fallback code path (which can also handle multi-segment packets).
> 
> 
> -Morten
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: mbuf fast-free requirements analysis
  2025-12-15 11:06 mbuf fast-free requirements analysis Morten Brørup
  2025-12-15 11:46 ` Bruce Richardson
  2025-12-15 14:41 ` Konstantin Ananyev
@ 2026-01-14 17:01 ` Bruce Richardson
  2026-01-14 17:31   ` Morten Brørup
  2 siblings, 1 reply; 28+ messages in thread
From: Bruce Richardson @ 2026-01-14 17:01 UTC (permalink / raw)
  To: Morten Brørup; +Cc: dev, techboard

On Mon, Dec 15, 2025 at 12:06:38PM +0100, Morten Brørup wrote:
> Executive Summary:
> 
> My analysis shows that the mbuf library is not a barrier for fast-freeing
> segmented packet mbufs, and thus fast-free of jumbo frames is possible.
> 
> 
> Detailed Analysis:
> 
> The purpose of the mbuf fast-free Tx optimization is to reduce
> rte_pktmbuf_free_seg() to something much simpler in the ethdev drivers, by
> eliminating the code path related to indirect mbufs.
> Optimally, we want to simplify the ethdev driver's function that frees the
> transmitted mbufs, so it can free them directly to their mempool without
> accessing the mbufs themselves.
> 
> If the driver cannot access the mbuf itself, it cannot determine which
> mempool it belongs to.
> We don't want the driver to access every mbuf being freed; but if all
> mbufs of a Tx queue belong to the same mempool, the driver can determine
> which mempool by looking into just one of the mbufs.
> 

<snip>

> 
> If I'm not mistaken, the mbuf library is not a barrier for fast-freeing
> segmented packet mbufs, and thus fast-free of jumbo frames is possible.
> 
> We need a driver developer to confirm that my suggested approach -
> resetting the mbuf fields, incl. 'm->nb_segs' and 'm->next', when
> preparing the Tx descriptor - is viable.
> 

Just to make sure I understand things correctly here, the suggestion to
prototype is:

- When FAST_FREE flag is set:
  - reset the m->nb_segs and m->next fields (if necessary) when accessing
    the mbuf to write the descriptor
  - skip calling pre-free seg on cleanup and instead
  - just free all buffers directly to the mempool

Is that correct?

/Bruce

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: mbuf fast-free requirements analysis
  2026-01-14 17:01 ` Bruce Richardson
@ 2026-01-14 17:31   ` Morten Brørup
  2026-01-14 17:45     ` Bruce Richardson
  0 siblings, 1 reply; 28+ messages in thread
From: Morten Brørup @ 2026-01-14 17:31 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev, techboard

> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Wednesday, 14 January 2026 18.01
> 
> On Mon, Dec 15, 2025 at 12:06:38PM +0100, Morten Brørup wrote:
> > Executive Summary:
> >
> > My analysis shows that the mbuf library is not a barrier for fast-
> freeing
> > segmented packet mbufs, and thus fast-free of jumbo frames is
> possible.
> >
> >
> > Detailed Analysis:
> >
> > The purpose of the mbuf fast-free Tx optimization is to reduce
> > rte_pktmbuf_free_seg() to something much simpler in the ethdev
> drivers, by
> > eliminating the code path related to indirect mbufs.
> > Optimally, we want to simplify the ethdev driver's function that
> frees the
> > transmitted mbufs, so it can free them directly to their mempool
> without
> > accessing the mbufs themselves.
> >
> > If the driver cannot access the mbuf itself, it cannot determine
> which
> > mempool it belongs to.
> > We don't want the driver to access every mbuf being freed; but if all
> > mbufs of a Tx queue belong to the same mempool, the driver can
> determine
> > which mempool by looking into just one of the mbufs.
> >
> 
> <snip>
> 
> >
> > If I'm not mistaken, the mbuf library is not a barrier for fast-
> freeing
> > segmented packet mbufs, and thus fast-free of jumbo frames is
> possible.
> >
> > We need a driver developer to confirm that my suggested approach -
> > resetting the mbuf fields, incl. 'm->nb_segs' and 'm->next', when
> > preparing the Tx descriptor - is viable.
> >
> 
> Just to make sure I understand things correctly here, the suggestion to
> prototype is:
> 
> - When FAST_FREE flag is set:
>   - reset the m->nb_segs and m->next fields (if necessary) when
> accessing
>     the mbuf to write the descriptor
>   - skip calling pre-free seg on cleanup and instead
>   - just free all buffers directly to the mempool
> 
> Is that correct?

Yes.
If this can be done with multi-segment packets, we should be able to eliminate the single-segment requirement to FAST_FREE.
(Unless something in the code that writes the descriptor requires single-segment to be super performant, as I suspected of vectorization.)


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: mbuf fast-free requirements analysis
  2026-01-14 17:31   ` Morten Brørup
@ 2026-01-14 17:45     ` Bruce Richardson
  0 siblings, 0 replies; 28+ messages in thread
From: Bruce Richardson @ 2026-01-14 17:45 UTC (permalink / raw)
  To: Morten Brørup; +Cc: dev, techboard

On Wed, Jan 14, 2026 at 06:31:24PM +0100, Morten Brørup wrote:
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > Sent: Wednesday, 14 January 2026 18.01
> > 
> > On Mon, Dec 15, 2025 at 12:06:38PM +0100, Morten Brørup wrote:
> > > Executive Summary:
> > >
> > > My analysis shows that the mbuf library is not a barrier for fast-
> > freeing
> > > segmented packet mbufs, and thus fast-free of jumbo frames is
> > possible.
> > >
> > >
> > > Detailed Analysis:
> > >
> > > The purpose of the mbuf fast-free Tx optimization is to reduce
> > > rte_pktmbuf_free_seg() to something much simpler in the ethdev
> > drivers, by
> > > eliminating the code path related to indirect mbufs.
> > > Optimally, we want to simplify the ethdev driver's function that
> > frees the
> > > transmitted mbufs, so it can free them directly to their mempool
> > without
> > > accessing the mbufs themselves.
> > >
> > > If the driver cannot access the mbuf itself, it cannot determine
> > which
> > > mempool it belongs to.
> > > We don't want the driver to access every mbuf being freed; but if all
> > > mbufs of a Tx queue belong to the same mempool, the driver can
> > determine
> > > which mempool by looking into just one of the mbufs.
> > >
> > 
> > <snip>
> > 
> > >
> > > If I'm not mistaken, the mbuf library is not a barrier for fast-
> > freeing
> > > segmented packet mbufs, and thus fast-free of jumbo frames is
> > possible.
> > >
> > > We need a driver developer to confirm that my suggested approach -
> > > resetting the mbuf fields, incl. 'm->nb_segs' and 'm->next', when
> > > preparing the Tx descriptor - is viable.
> > >
> > 
> > Just to make sure I understand things correctly here, the suggestion to
> > prototype is:
> > 
> > - When FAST_FREE flag is set:
> >   - reset the m->nb_segs and m->next fields (if necessary) when
> > accessing
> >     the mbuf to write the descriptor
> >   - skip calling pre-free seg on cleanup and instead
> >   - just free all buffers directly to the mempool
> > 
> > Is that correct?
> 
> Yes.
> If this can be done with multi-segment packets, we should be able to eliminate the single-segment requirement to FAST_FREE.
> (Unless something in the code that writes the descriptor requires single-segment to be super performant, as I suspected of vectorization.)
> 

Yes, the vector paths do require single-segment, but there is a separate
offload flag for that [1]. That flag is part of the overall
vector/non-vector Tx flags check.

/Bruce

[1] https://github.com/DPDK/dpdk/blob/main/lib/ethdev/rte_ethdev.h#L1643

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: mbuf fast-free requirements analysis
  2026-01-14 16:36     ` Bruce Richardson
@ 2026-01-14 18:05       ` Morten Brørup
  2026-01-15  8:46         ` Bruce Richardson
  0 siblings, 1 reply; 28+ messages in thread
From: Morten Brørup @ 2026-01-14 18:05 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: Konstantin Ananyev, techboard, dev

> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Wednesday, 14 January 2026 17.36
> 
> On Wed, Jan 14, 2026 at 04:31:31PM +0100, Morten Brørup wrote:
> > > > If I'm not mistaken, the mbuf library is not a barrier for fast-
> > > freeing
> > > > segmented packet mbufs, and thus fast-free of jumbo frames is
> > > possible.
> > > >
> > > > We need a driver developer to confirm that my suggested approach
> -
> > > > resetting the mbuf fields, incl. 'm->nb_segs' and 'm->next', when
> > > > preparing the Tx descriptor - is viable.
> > > >
> > > Excellent analysis, Morten. If I get a chance some time this
> release
> > > cycle,
> > > I will try implementing this change in our drivers, see if any
> > > difference
> > > is made.
> >
> > Bruce,
> >
> > Have you had a chance to look into the driver change requirements?
> > If not, could you please try scratching the surface, to build a gut
> feeling.
> 
> I'll try and take a look this week. Juggling a few things at the
> moment, so
> I had forgotten about this. Sorry.
> 
> More comments inline below.
> 
> /Bruce
> 
> >
> > I wonder if the vector implementations have strong requirements that
> packets are not segmented...
> >
> > The i40 driver only sets "tx_simple_allowed" and "tx_vec_allowed"
> flags when MBUF_FAST_FREE is set:
> >
> https://elixir.bootlin.com/dpdk/v25.11/source/drivers/net/intel/i40e/i4
> 0e_rxtx.c#L3502
> >
> 
> Actually, it allows but does not require FAST_FREE. The check is just
> verifying that the flags with everything *but* FAST_FREE masked out is
> the
> same as the original flags, i.e. FAST_FREE is just ignored.

That's not how I read the code:
ad->tx_simple_allowed =
	(txq->offloads ==
	 (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) &&
	 txq->tx_rs_thresh >= I40E_TX_MAX_BURST);

Look at it with offloads=(MULTI_SEGS|FAST_FREE):
simple_allowed = (MULTI_SEGS|FAST_FREE) == (MULTI_SEGS|FAST_FREE) & FAST_FREE
i.e.:
simple_allowed = (MULTI_SEGS|FAST_FREE) == FAST_FREE
i.e.: false

> 
> > And only when these two flags are set, it uses a vector Tx function:
> >
> https://elixir.bootlin.com/dpdk/v25.11/source/drivers/net/intel/i40e/i4
> 0e_rxtx.c#L3550
> > And a special Tx Prep function:
> >
> https://elixir.bootlin.com/dpdk/v25.11/source/drivers/net/intel/i40e/i4
> 0e_rxtx.c#L3584
> > Which fails if nb_segs != 1:
> >
> https://elixir.bootlin.com/dpdk/v25.11/source/drivers/net/intel/i40e/i4
> 0e_rxtx.c#L1675
> >
> > So currently it does.
> > But does it need to?... That is the question.
> > Paraphrasing:
> > Can the Tx function only be vectorized when the code path doesn't
> have branches depending on the number of segments?
> > If so, then this may be the main reason for not supporting segmented
> packets with FAST_FREE.
> >
> > In that case, we cannot remove the single-segment requirement from
> FAST_FREE without sacrificing the performance boost from vectorizing.
> 
> No, based on what I state above, this should not be a blocker. The
> vector
> paths do require us to guarantee only one segment per packet - without
> additional context descriptors - so only one descriptor per packet
> (generally, or always one + ctx, in one code-path case). FAST_FREE can
> be
> used in conjunction with that but should not be a requirement. See [1]
> where in vector cleanup we explicitly check for FAST_FREE.
> 
> Similarly for scalar code path, in my latest rework, I am attempting to
> standardize the use of FAST_FREE optimizations even when we have a
> slightly
> slower Tx path [2].

Good point:
The Tx path has two steps:
1) Pre-transmission Tx descriptor setup.
2) Post-transmission mbuf free.

FAST_FREE requirements for optimizing each of these two steps might differ.

As suggested in my other email, hopefully the post-transmission step can be vectorized (also for multi-segment packets) by assisting it in the pre-transmission step - i.e. by preparing the FAST_FREE segments for direct release to the mempool.

Then we can consider single-segment requirements for the pre-transmission step.

> 
> [1]
> https://github.com/DPDK/dpdk/blob/main/drivers/net/intel/common/tx.h
> [2] https://patches.dpdk.org/project/dpdk/patch/20260113151505.1871271-
> 31-bruce.richardson@intel.com/
> 
> >
> > But then we can proceed pursuing alternative optimizations, as
> suggested by Konstantin.
> >
> > Here's another idea:
> > The Tx function could pre-scan each Tx burst for multi-segment
> packets, to decide if the burst should be processed by the vector code
> path or a fallback code path (which can also handle multi-segment
> packets).
> >
> >
> > -Morten
> >

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: mbuf fast-free requirements analysis
  2026-01-14 18:05       ` Morten Brørup
@ 2026-01-15  8:46         ` Bruce Richardson
  2026-01-15  9:04           ` Morten Brørup
  0 siblings, 1 reply; 28+ messages in thread
From: Bruce Richardson @ 2026-01-15  8:46 UTC (permalink / raw)
  To: Morten Brørup; +Cc: Konstantin Ananyev, techboard, dev

On Wed, Jan 14, 2026 at 07:05:44PM +0100, Morten Brørup wrote:
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > Sent: Wednesday, 14 January 2026 17.36
> > 
> > On Wed, Jan 14, 2026 at 04:31:31PM +0100, Morten Brørup wrote:
> > > > > If I'm not mistaken, the mbuf library is not a barrier for fast-
> > > > freeing
> > > > > segmented packet mbufs, and thus fast-free of jumbo frames is
> > > > possible.
> > > > >
> > > > > We need a driver developer to confirm that my suggested approach
> > -
> > > > > resetting the mbuf fields, incl. 'm->nb_segs' and 'm->next', when
> > > > > preparing the Tx descriptor - is viable.
> > > > >
> > > > Excellent analysis, Morten. If I get a chance some time this
> > release
> > > > cycle,
> > > > I will try implementing this change in our drivers, see if any
> > > > difference
> > > > is made.
> > >
> > > Bruce,
> > >
> > > Have you had a chance to look into the driver change requirements?
> > > If not, could you please try scratching the surface, to build a gut
> > feeling.
> > 
> > I'll try and take a look this week. Juggling a few things at the
> > moment, so
> > I had forgotten about this. Sorry.
> > 
> > More comments inline below.
> > 
> > /Bruce
> > 
> > >
> > > I wonder if the vector implementations have strong requirements that
> > packets are not segmented...
> > >
> > > The i40 driver only sets "tx_simple_allowed" and "tx_vec_allowed"
> > flags when MBUF_FAST_FREE is set:
> > >
> > https://elixir.bootlin.com/dpdk/v25.11/source/drivers/net/intel/i40e/i4
> > 0e_rxtx.c#L3502
> > >
> > 
> > Actually, it allows but does not require FAST_FREE. The check is just
> > verifying that the flags with everything *but* FAST_FREE masked out is
> > the
> > same as the original flags, i.e. FAST_FREE is just ignored.
> 
> That's not how I read the code:
> ad->tx_simple_allowed =
> 	(txq->offloads ==
> 	 (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) &&
> 	 txq->tx_rs_thresh >= I40E_TX_MAX_BURST);
> 
> Look at it with offloads=(MULTI_SEGS|FAST_FREE):
> simple_allowed = (MULTI_SEGS|FAST_FREE) == (MULTI_SEGS|FAST_FREE) & FAST_FREE
> i.e.:
> simple_allowed = (MULTI_SEGS|FAST_FREE) == FAST_FREE
> i.e.: false
> 

Which is correct. The only flag allowed is FAST_FREE, but its not required.
If the input flags were just MULTI_SEGS, it would end  up as:

simple_allowed = (MULTI_SEGS) == 0
i.e. also false

So the FAST_FREE flag does not affect the result.

/Bruce

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: mbuf fast-free requirements analysis
  2026-01-15  8:46         ` Bruce Richardson
@ 2026-01-15  9:04           ` Morten Brørup
  0 siblings, 0 replies; 28+ messages in thread
From: Morten Brørup @ 2026-01-15  9:04 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: Konstantin Ananyev, techboard, dev

> > > > I wonder if the vector implementations have strong requirements
> that
> > > packets are not segmented...
> > > >
> > > > The i40 driver only sets "tx_simple_allowed" and "tx_vec_allowed"
> > > flags when MBUF_FAST_FREE is set:
> > > >
> > >
> https://elixir.bootlin.com/dpdk/v25.11/source/drivers/net/intel/i40e/i4
> > > 0e_rxtx.c#L3502
> > > >
> > >
> > > Actually, it allows but does not require FAST_FREE. The check is
> just
> > > verifying that the flags with everything *but* FAST_FREE masked out
> is
> > > the
> > > same as the original flags, i.e. FAST_FREE is just ignored.
> >
> > That's not how I read the code:
> > ad->tx_simple_allowed =
> > 	(txq->offloads ==
> > 	 (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) &&
> > 	 txq->tx_rs_thresh >= I40E_TX_MAX_BURST);
> >
> > Look at it with offloads=(MULTI_SEGS|FAST_FREE):
> > simple_allowed = (MULTI_SEGS|FAST_FREE) == (MULTI_SEGS|FAST_FREE) &
> FAST_FREE
> > i.e.:
> > simple_allowed = (MULTI_SEGS|FAST_FREE) == FAST_FREE
> > i.e.: false
> >
> 
> Which is correct. The only flag allowed is FAST_FREE, but its not
> required.
> If the input flags were just MULTI_SEGS, it would end  up as:
> 
> simple_allowed = (MULTI_SEGS) == 0
> i.e. also false
> 
> So the FAST_FREE flag does not affect the result.

OK, now I get it.
It's an obfuscated way of writing:
simple_allowed = (offloads & ~RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) == 0

Suggest updating for readability next time that code is touched.

The comment "only fast free is allowed" [1] didn't help me either;
I interpreted it as "it's only allowed when fast free is set".
Now that I understand the code, I can also interpret the comment as intended.

[1]: https://elixir.bootlin.com/dpdk/v25.11/source/drivers/net/intel/i40e/i40e_rxtx.c#L3502

Thanks for clarifying, Bruce!


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH] net/intel: optimize for fast-free hint
  2026-01-14 15:31   ` Morten Brørup
  2026-01-14 16:36     ` Bruce Richardson
@ 2026-01-23 11:20     ` Bruce Richardson
  2026-01-23 12:05       ` Morten Brørup
  2026-04-08 13:25       ` [PATCH v2] " Bruce Richardson
  2026-01-23 11:33     ` mbuf fast-free requirements analysis Bruce Richardson
  2 siblings, 2 replies; 28+ messages in thread
From: Bruce Richardson @ 2026-01-23 11:20 UTC (permalink / raw)
  To: dev; +Cc: mb, Bruce Richardson

When the fast-free hint is provided to the driver we know that the mbufs
have refcnt of 1 and are from the same mempool. Therefore, we can
optimize a bit for this case by:

* resetting the necessary mbuf fields, ie. nb_seg and next pointer when
  we are accessing the mbuf on writing the descriptor.
* on cleanup of buffers after transmit, we can just write those buffers
  straight to the mempool without accessing them.

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
 drivers/net/intel/common/tx.h            | 16 +++--
 drivers/net/intel/common/tx_scalar_fns.h | 81 +++++++++++++++++++-----
 2 files changed, 76 insertions(+), 21 deletions(-)

diff --git a/drivers/net/intel/common/tx.h b/drivers/net/intel/common/tx.h
index 7fe4022f35..9bda2c8f59 100644
--- a/drivers/net/intel/common/tx.h
+++ b/drivers/net/intel/common/tx.h
@@ -345,12 +345,20 @@ ci_txq_release_all_mbufs(struct ci_tx_queue *txq, bool use_ctx)
 		return;
 
 	if (!txq->vector_tx) {
-		for (uint16_t i = 0; i < txq->nb_tx_desc; i++) {
-			if (txq->sw_ring[i].mbuf != NULL) {
+		/* Free mbufs from (last_desc_cleaned + 1) to (tx_tail - 1). */
+		const uint16_t start = (txq->last_desc_cleaned + 1) % txq->nb_tx_desc;
+		const uint16_t nb_desc = txq->nb_tx_desc;
+		const uint16_t end = txq->tx_tail;
+
+		uint16_t i = start;
+		if (end < i) {
+			for (; i < nb_desc; i++)
 				rte_pktmbuf_free_seg(txq->sw_ring[i].mbuf);
-				txq->sw_ring[i].mbuf = NULL;
-			}
+			i = 0;
 		}
+		for (; i < end; i++)
+			rte_pktmbuf_free_seg(txq->sw_ring[i].mbuf);
+		memset(txq->sw_ring, 0, sizeof(txq->sw_ring[0]) * nb_desc);
 		return;
 	}
 
diff --git a/drivers/net/intel/common/tx_scalar_fns.h b/drivers/net/intel/common/tx_scalar_fns.h
index 82dc54438f..47ddcf411b 100644
--- a/drivers/net/intel/common/tx_scalar_fns.h
+++ b/drivers/net/intel/common/tx_scalar_fns.h
@@ -30,16 +30,60 @@ ci_tx_xmit_cleanup(struct ci_tx_queue *txq)
 	const uint16_t rs_idx = (last_desc_cleaned == nb_tx_desc - 1) ?
 			0 :
 			(last_desc_cleaned + 1) >> txq->log2_rs_thresh;
-	uint16_t desc_to_clean_to = (rs_idx << txq->log2_rs_thresh) + (txq->tx_rs_thresh - 1);
+	const uint16_t dd_idx = txq->rs_last_id[rs_idx];
+	const uint16_t first_to_clean = rs_idx << txq->log2_rs_thresh;
 
 	/* Check if descriptor is done - all drivers use 0xF as done value in bits 3:0 */
-	if ((txd[txq->rs_last_id[rs_idx]].cmd_type_offset_bsz & rte_cpu_to_le_64(CI_TXD_QW1_DTYPE_M)) !=
+	if ((txd[dd_idx].cmd_type_offset_bsz & rte_cpu_to_le_64(CI_TXD_QW1_DTYPE_M)) !=
 			rte_cpu_to_le_64(CI_TX_DESC_DTYPE_DESC_DONE))
 		/* Descriptor not yet processed by hardware */
 		return -1;
 
+	/* DD bit is set, descriptors are done. Now free the mbufs. */
+	/* Note: nb_tx_desc is guaranteed to be a multiple of tx_rs_thresh,
+	 * validated during queue setup. This means cleanup never wraps around
+	 * the ring within a single burst (e.g., ring=256, rs_thresh=32 gives
+	 * bursts of 0-31, 32-63, ..., 224-255).
+	 */
+	const uint16_t nb_to_clean = txq->tx_rs_thresh;
+	struct ci_tx_entry *sw_ring = txq->sw_ring;
+
+	if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
+		/* FAST_FREE path: mbufs are already reset, just return to pool */
+		uint16_t nb_free = 0;
+
+		/* Get cached mempool pointer, or cache it on first use */
+		struct rte_mempool *mp =
+			likely(txq->fast_free_mp != (void *)UINTPTR_MAX) ?
+			txq->fast_free_mp :
+			(txq->fast_free_mp = sw_ring[dd_idx].mbuf->pool);
+
+		/* Pack non-NULL mbufs in-place at start of sw_ring range.
+		 * No modulo needed in loop since we're guaranteed not to wrap.
+		 */
+		for (uint16_t i = 0; i < nb_to_clean; i++) {
+			struct rte_mbuf *m = sw_ring[first_to_clean + i].mbuf;
+			if (m != NULL) {
+				/* Pack into sw_ring at packed position */
+				sw_ring[first_to_clean + nb_free].mbuf = m;
+				nb_free++;
+			}
+		}
+
+		/* Bulk return to mempool using packed sw_ring entries directly */
+		if (nb_free > 0)
+			rte_mempool_put_bulk(mp, (void **)&sw_ring[first_to_clean].mbuf, nb_free);
+	} else {
+		/* Non-FAST_FREE path: use prefree_seg for refcount checks */
+		for (uint16_t i = 0; i < nb_to_clean; i++) {
+			struct rte_mbuf *m = sw_ring[first_to_clean + i].mbuf;
+			if (m != NULL)
+				rte_pktmbuf_free_seg(m);
+		}
+	}
+
 	/* Update the txq to reflect the last descriptor that was cleaned */
-	txq->last_desc_cleaned = desc_to_clean_to;
+	txq->last_desc_cleaned = first_to_clean + txq->tx_rs_thresh - 1;
 	txq->nb_tx_free += txq->tx_rs_thresh;
 
 	return 0;
@@ -300,8 +344,6 @@ ci_xmit_pkts(struct ci_tx_queue *txq,
 			txd = &ci_tx_ring[tx_id];
 			tx_id = txe->next_id;
 
-			if (txe->mbuf)
-				rte_pktmbuf_free_seg(txe->mbuf);
 			txe->mbuf = tx_pkt;
 			/* Setup TX Descriptor */
 			td_cmd |= CI_TX_DESC_CMD_EOP;
@@ -322,10 +364,7 @@ ci_xmit_pkts(struct ci_tx_queue *txq,
 
 			txn = &sw_ring[txe->next_id];
 			RTE_MBUF_PREFETCH_TO_FREE(txn->mbuf);
-			if (txe->mbuf) {
-				rte_pktmbuf_free_seg(txe->mbuf);
-				txe->mbuf = NULL;
-			}
+			txe->mbuf = NULL;
 
 			write_txd(ctx_txd, cd_qw0, cd_qw1);
 
@@ -339,10 +378,7 @@ ci_xmit_pkts(struct ci_tx_queue *txq,
 
 			txn = &sw_ring[txe->next_id];
 			RTE_MBUF_PREFETCH_TO_FREE(txn->mbuf);
-			if (txe->mbuf) {
-				rte_pktmbuf_free_seg(txe->mbuf);
-				txe->mbuf = NULL;
-			}
+			txe->mbuf = NULL;
 
 			ipsec_txd[0] = ipsec_qw0;
 			ipsec_txd[1] = ipsec_qw1;
@@ -357,10 +393,21 @@ ci_xmit_pkts(struct ci_tx_queue *txq,
 			txd = &ci_tx_ring[tx_id];
 			txn = &sw_ring[txe->next_id];
 
-			if (txe->mbuf)
-				rte_pktmbuf_free_seg(txe->mbuf);
 			txe->mbuf = m_seg;
 
+			/* For FAST_FREE: reset mbuf fields while we have it in cache.
+			 * FAST_FREE guarantees refcnt=1 and direct mbufs, so we only
+			 * need to reset nb_segs and next pointer as per rte_pktmbuf_prefree_seg.
+			 * Save next pointer before resetting since we need it for loop iteration.
+			 */
+			struct rte_mbuf *next_seg = m_seg->next;
+			if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
+				if (m_seg->nb_segs != 1)
+					m_seg->nb_segs = 1;
+				if (next_seg != NULL)
+					m_seg->next = NULL;
+			}
+
 			/* Setup TX Descriptor */
 			/* Calculate segment length, using IPsec callback if provided */
 			if (ipsec_ops != NULL)
@@ -389,7 +436,7 @@ ci_xmit_pkts(struct ci_tx_queue *txq,
 			}
 
 			/* fill the last descriptor with End of Packet (EOP) bit */
-			if (m_seg->next == NULL)
+			if (next_seg == NULL)
 				td_cmd |= CI_TX_DESC_CMD_EOP;
 
 			const uint64_t cmd_type_offset_bsz = CI_TX_DESC_DTYPE_DATA |
@@ -401,7 +448,7 @@ ci_xmit_pkts(struct ci_tx_queue *txq,
 
 			tx_id = txe->next_id;
 			txe = txn;
-			m_seg = m_seg->next;
+			m_seg = next_seg;
 		} while (m_seg);
 end_pkt:
 		txq->nb_tx_free = (uint16_t)(txq->nb_tx_free - nb_used);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: mbuf fast-free requirements analysis
  2026-01-14 15:31   ` Morten Brørup
  2026-01-14 16:36     ` Bruce Richardson
  2026-01-23 11:20     ` [PATCH] net/intel: optimize for fast-free hint Bruce Richardson
@ 2026-01-23 11:33     ` Bruce Richardson
  2 siblings, 0 replies; 28+ messages in thread
From: Bruce Richardson @ 2026-01-23 11:33 UTC (permalink / raw)
  To: Morten Brørup; +Cc: Konstantin Ananyev, techboard, dev

On Wed, Jan 14, 2026 at 04:31:31PM +0100, Morten Brørup wrote:
> > > If I'm not mistaken, the mbuf library is not a barrier for fast-
> > freeing
> > > segmented packet mbufs, and thus fast-free of jumbo frames is
> > possible.
> > >
> > > We need a driver developer to confirm that my suggested approach -
> > > resetting the mbuf fields, incl. 'm->nb_segs' and 'm->next', when
> > > preparing the Tx descriptor - is viable.
> > >
> > Excellent analysis, Morten. If I get a chance some time this release
> > cycle,
> > I will try implementing this change in our drivers, see if any
> > difference
> > is made.
> 
> Bruce,
> 
> Have you had a chance to look into the driver change requirements?
> If not, could you please try scratching the surface, to build a gut feeling.
> 
Prototype implementation done and posted as a patch to this thread[1]. (It
applies on top of my "combine multiple Intel scalar Tx paths" patchset[2]).

The logic is not terribly complicated, and running a quick perf test with
scattered Tx enabled (to force use of scalar path, this optimization
doesn't apply to vector path*) shows a small perf increase from it using
testpmd iofwd.

/Bruce

[1] https://patches.dpdk.org/project/dpdk/patch/20260123112032.2174361-1-bruce.richardson@intel.com/
[2] https://patches.dpdk.org/project/dpdk/list/?series=37033

* Vector path already has support for fast-free, and disallows use of
  scattered packets, so there is nothing to optimize for here.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH] net/intel: optimize for fast-free hint
  2026-01-23 11:20     ` [PATCH] net/intel: optimize for fast-free hint Bruce Richardson
@ 2026-01-23 12:05       ` Morten Brørup
  2026-01-23 12:09         ` Bruce Richardson
  2026-04-08 13:25       ` [PATCH v2] " Bruce Richardson
  1 sibling, 1 reply; 28+ messages in thread
From: Morten Brørup @ 2026-01-23 12:05 UTC (permalink / raw)
  To: Bruce Richardson, dev

I haven't looked into the details yet, but have a quick question inline below.

> @@ -345,12 +345,20 @@ ci_txq_release_all_mbufs(struct ci_tx_queue *txq,
> bool use_ctx)
>  		return;
> 
>  	if (!txq->vector_tx) {
> -		for (uint16_t i = 0; i < txq->nb_tx_desc; i++) {
> -			if (txq->sw_ring[i].mbuf != NULL) {

You changed this loop to only operate on not-yet-cleaned descriptors.

Here comes the first part of my question:
You removed the NULL check for txq->sw_ring[i].mbuf, thereby assuming that it is never NULL for not-yet-cleaned descriptors.

> +		/* Free mbufs from (last_desc_cleaned + 1) to (tx_tail -
> 1). */
> +		const uint16_t start = (txq->last_desc_cleaned + 1) % txq-
> >nb_tx_desc;
> +		const uint16_t nb_desc = txq->nb_tx_desc;
> +		const uint16_t end = txq->tx_tail;
> +
> +		uint16_t i = start;
> +		if (end < i) {
> +			for (; i < nb_desc; i++)
>  				rte_pktmbuf_free_seg(txq->sw_ring[i].mbuf);
> -				txq->sw_ring[i].mbuf = NULL;
> -			}
> +			i = 0;
>  		}
> +		for (; i < end; i++)
> +			rte_pktmbuf_free_seg(txq->sw_ring[i].mbuf);
> +		memset(txq->sw_ring, 0, sizeof(txq->sw_ring[0]) * nb_desc);
>  		return;
>  	}
> 
> diff --git a/drivers/net/intel/common/tx_scalar_fns.h
> b/drivers/net/intel/common/tx_scalar_fns.h
> index 82dc54438f..47ddcf411b 100644
> --- a/drivers/net/intel/common/tx_scalar_fns.h
> +++ b/drivers/net/intel/common/tx_scalar_fns.h
> @@ -30,16 +30,60 @@ ci_tx_xmit_cleanup(struct ci_tx_queue *txq)
>  	const uint16_t rs_idx = (last_desc_cleaned == nb_tx_desc - 1) ?
>  			0 :
>  			(last_desc_cleaned + 1) >> txq->log2_rs_thresh;
> -	uint16_t desc_to_clean_to = (rs_idx << txq->log2_rs_thresh) +
> (txq->tx_rs_thresh - 1);
> +	const uint16_t dd_idx = txq->rs_last_id[rs_idx];
> +	const uint16_t first_to_clean = rs_idx << txq->log2_rs_thresh;
> 
>  	/* Check if descriptor is done - all drivers use 0xF as done
> value in bits 3:0 */
> -	if ((txd[txq->rs_last_id[rs_idx]].cmd_type_offset_bsz &
> rte_cpu_to_le_64(CI_TXD_QW1_DTYPE_M)) !=
> +	if ((txd[dd_idx].cmd_type_offset_bsz &
> rte_cpu_to_le_64(CI_TXD_QW1_DTYPE_M)) !=
>  			rte_cpu_to_le_64(CI_TX_DESC_DTYPE_DESC_DONE))
>  		/* Descriptor not yet processed by hardware */
>  		return -1;
> 
> +	/* DD bit is set, descriptors are done. Now free the mbufs. */
> +	/* Note: nb_tx_desc is guaranteed to be a multiple of
> tx_rs_thresh,
> +	 * validated during queue setup. This means cleanup never wraps
> around
> +	 * the ring within a single burst (e.g., ring=256, rs_thresh=32
> gives
> +	 * bursts of 0-31, 32-63, ..., 224-255).
> +	 */
> +	const uint16_t nb_to_clean = txq->tx_rs_thresh;
> +	struct ci_tx_entry *sw_ring = txq->sw_ring;
> +
> +	if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
> +		/* FAST_FREE path: mbufs are already reset, just return to
> pool */
> +		uint16_t nb_free = 0;
> +
> +		/* Get cached mempool pointer, or cache it on first use */
> +		struct rte_mempool *mp =
> +			likely(txq->fast_free_mp != (void *)UINTPTR_MAX) ?
> +			txq->fast_free_mp :
> +			(txq->fast_free_mp = sw_ring[dd_idx].mbuf->pool);
> +
> +		/* Pack non-NULL mbufs in-place at start of sw_ring range.

Here is the second part of my question:
How can they (sw_ring[X].mbuf) be NULL here, when they cannot be NULL in ci_txq_release_all_mbufs()?

> +		 * No modulo needed in loop since we're guaranteed not to
> wrap.
> +		 */
> +		for (uint16_t i = 0; i < nb_to_clean; i++) {
> +			struct rte_mbuf *m = sw_ring[first_to_clean +
> i].mbuf;
> +			if (m != NULL) {
> +				/* Pack into sw_ring at packed position */
> +				sw_ring[first_to_clean + nb_free].mbuf = m;
> +				nb_free++;
> +			}
> +		}
> +
> +		/* Bulk return to mempool using packed sw_ring entries
> directly */
> +		if (nb_free > 0)
> +			rte_mempool_put_bulk(mp, (void
> **)&sw_ring[first_to_clean].mbuf, nb_free);
> +	} else {
> +		/* Non-FAST_FREE path: use prefree_seg for refcount checks
> */
> +		for (uint16_t i = 0; i < nb_to_clean; i++) {
> +			struct rte_mbuf *m = sw_ring[first_to_clean +
> i].mbuf;
> +			if (m != NULL)
> +				rte_pktmbuf_free_seg(m);
> +		}
> +	}
> +
>  	/* Update the txq to reflect the last descriptor that was cleaned
> */
> -	txq->last_desc_cleaned = desc_to_clean_to;
> +	txq->last_desc_cleaned = first_to_clean + txq->tx_rs_thresh - 1;
>  	txq->nb_tx_free += txq->tx_rs_thresh;
> 
>  	return 0;

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] net/intel: optimize for fast-free hint
  2026-01-23 12:05       ` Morten Brørup
@ 2026-01-23 12:09         ` Bruce Richardson
  2026-01-23 12:27           ` Morten Brørup
  0 siblings, 1 reply; 28+ messages in thread
From: Bruce Richardson @ 2026-01-23 12:09 UTC (permalink / raw)
  To: Morten Brørup; +Cc: dev

On Fri, Jan 23, 2026 at 01:05:10PM +0100, Morten Brørup wrote:
> I haven't looked into the details yet, but have a quick question inline below.
> 
> > @@ -345,12 +345,20 @@ ci_txq_release_all_mbufs(struct ci_tx_queue *txq,
> > bool use_ctx)
> >  		return;
> > 
> >  	if (!txq->vector_tx) {
> > -		for (uint16_t i = 0; i < txq->nb_tx_desc; i++) {
> > -			if (txq->sw_ring[i].mbuf != NULL) {
> 
> You changed this loop to only operate on not-yet-cleaned descriptors.
> 
> Here comes the first part of my question:
> You removed the NULL check for txq->sw_ring[i].mbuf, thereby assuming that it is never NULL for not-yet-cleaned descriptors.
> 

Good point, I was quite focused on making this block and the vector block
the same, I forgot that we can have NULL pointers for context descriptors.
That was a silly mistake (and AI never caught it for me either.)

> > +		/* Free mbufs from (last_desc_cleaned + 1) to (tx_tail -
> > 1). */
> > +		const uint16_t start = (txq->last_desc_cleaned + 1) % txq-
> > >nb_tx_desc;
> > +		const uint16_t nb_desc = txq->nb_tx_desc;
> > +		const uint16_t end = txq->tx_tail;
> > +
> > +		uint16_t i = start;
> > +		if (end < i) {
> > +			for (; i < nb_desc; i++)
> >  				rte_pktmbuf_free_seg(txq->sw_ring[i].mbuf);
> > -				txq->sw_ring[i].mbuf = NULL;
> > -			}
> > +			i = 0;
> >  		}
> > +		for (; i < end; i++)
> > +			rte_pktmbuf_free_seg(txq->sw_ring[i].mbuf);
> > +		memset(txq->sw_ring, 0, sizeof(txq->sw_ring[0]) * nb_desc);
> >  		return;
> >  	}
> > 
> > diff --git a/drivers/net/intel/common/tx_scalar_fns.h
> > b/drivers/net/intel/common/tx_scalar_fns.h
> > index 82dc54438f..47ddcf411b 100644
> > --- a/drivers/net/intel/common/tx_scalar_fns.h
> > +++ b/drivers/net/intel/common/tx_scalar_fns.h
> > @@ -30,16 +30,60 @@ ci_tx_xmit_cleanup(struct ci_tx_queue *txq)
> >  	const uint16_t rs_idx = (last_desc_cleaned == nb_tx_desc - 1) ?
> >  			0 :
> >  			(last_desc_cleaned + 1) >> txq->log2_rs_thresh;
> > -	uint16_t desc_to_clean_to = (rs_idx << txq->log2_rs_thresh) +
> > (txq->tx_rs_thresh - 1);
> > +	const uint16_t dd_idx = txq->rs_last_id[rs_idx];
> > +	const uint16_t first_to_clean = rs_idx << txq->log2_rs_thresh;
> > 
> >  	/* Check if descriptor is done - all drivers use 0xF as done
> > value in bits 3:0 */
> > -	if ((txd[txq->rs_last_id[rs_idx]].cmd_type_offset_bsz &
> > rte_cpu_to_le_64(CI_TXD_QW1_DTYPE_M)) !=
> > +	if ((txd[dd_idx].cmd_type_offset_bsz &
> > rte_cpu_to_le_64(CI_TXD_QW1_DTYPE_M)) !=
> >  			rte_cpu_to_le_64(CI_TX_DESC_DTYPE_DESC_DONE))
> >  		/* Descriptor not yet processed by hardware */
> >  		return -1;
> > 
> > +	/* DD bit is set, descriptors are done. Now free the mbufs. */
> > +	/* Note: nb_tx_desc is guaranteed to be a multiple of
> > tx_rs_thresh,
> > +	 * validated during queue setup. This means cleanup never wraps
> > around
> > +	 * the ring within a single burst (e.g., ring=256, rs_thresh=32
> > gives
> > +	 * bursts of 0-31, 32-63, ..., 224-255).
> > +	 */
> > +	const uint16_t nb_to_clean = txq->tx_rs_thresh;
> > +	struct ci_tx_entry *sw_ring = txq->sw_ring;
> > +
> > +	if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
> > +		/* FAST_FREE path: mbufs are already reset, just return to
> > pool */
> > +		uint16_t nb_free = 0;
> > +
> > +		/* Get cached mempool pointer, or cache it on first use */
> > +		struct rte_mempool *mp =
> > +			likely(txq->fast_free_mp != (void *)UINTPTR_MAX) ?
> > +			txq->fast_free_mp :
> > +			(txq->fast_free_mp = sw_ring[dd_idx].mbuf->pool);
> > +
> > +		/* Pack non-NULL mbufs in-place at start of sw_ring range.
> 
> Here is the second part of my question:
> How can they (sw_ring[X].mbuf) be NULL here, when they cannot be NULL in ci_txq_release_all_mbufs()?

Because the latter function is wrong! :-)
> 
> > +		 * No modulo needed in loop since we're guaranteed not to
> > wrap.
> > +		 */
> > +		for (uint16_t i = 0; i < nb_to_clean; i++) {
> > +			struct rte_mbuf *m = sw_ring[first_to_clean +
> > i].mbuf;
> > +			if (m != NULL) {
> > +				/* Pack into sw_ring at packed position */
> > +				sw_ring[first_to_clean + nb_free].mbuf = m;
> > +				nb_free++;
> > +			}
> > +		}
> > +
> > +		/* Bulk return to mempool using packed sw_ring entries
> > directly */
> > +		if (nb_free > 0)
> > +			rte_mempool_put_bulk(mp, (void
> > **)&sw_ring[first_to_clean].mbuf, nb_free);
> > +	} else {
> > +		/* Non-FAST_FREE path: use prefree_seg for refcount checks
> > */
> > +		for (uint16_t i = 0; i < nb_to_clean; i++) {
> > +			struct rte_mbuf *m = sw_ring[first_to_clean +
> > i].mbuf;
> > +			if (m != NULL)
> > +				rte_pktmbuf_free_seg(m);
> > +		}
> > +	}
> > +
> >  	/* Update the txq to reflect the last descriptor that was cleaned
> > */
> > -	txq->last_desc_cleaned = desc_to_clean_to;
> > +	txq->last_desc_cleaned = first_to_clean + txq->tx_rs_thresh - 1;
> >  	txq->nb_tx_free += txq->tx_rs_thresh;
> > 
> >  	return 0;

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH] net/intel: optimize for fast-free hint
  2026-01-23 12:09         ` Bruce Richardson
@ 2026-01-23 12:27           ` Morten Brørup
  2026-01-23 12:53             ` Bruce Richardson
  0 siblings, 1 reply; 28+ messages in thread
From: Morten Brørup @ 2026-01-23 12:27 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev

> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Friday, 23 January 2026 13.09
> 
> On Fri, Jan 23, 2026 at 01:05:10PM +0100, Morten Brørup wrote:
> > I haven't looked into the details yet, but have a quick question
> inline below.
> >
> > > @@ -345,12 +345,20 @@ ci_txq_release_all_mbufs(struct ci_tx_queue
> *txq,
> > > bool use_ctx)
> > >  		return;
> > >
> > >  	if (!txq->vector_tx) {
> > > -		for (uint16_t i = 0; i < txq->nb_tx_desc; i++) {
> > > -			if (txq->sw_ring[i].mbuf != NULL) {
> >
> > You changed this loop to only operate on not-yet-cleaned descriptors.
> >
> > Here comes the first part of my question:
> > You removed the NULL check for txq->sw_ring[i].mbuf, thereby assuming
> that it is never NULL for not-yet-cleaned descriptors.
> >
> 
> Good point, I was quite focused on making this block and the vector
> block
> the same, I forgot that we can have NULL pointers for context
> descriptors.
> That was a silly mistake (and AI never caught it for me either.)
> 
> > > +		/* Free mbufs from (last_desc_cleaned + 1) to (tx_tail -
> > > 1). */
> > > +		const uint16_t start = (txq->last_desc_cleaned + 1) % txq-
> > > >nb_tx_desc;
> > > +		const uint16_t nb_desc = txq->nb_tx_desc;
> > > +		const uint16_t end = txq->tx_tail;
> > > +
> > > +		uint16_t i = start;

Suggest getting rid of "start"; it is only used for initializing "i".

> > > +		if (end < i) {
> > > +			for (; i < nb_desc; i++)
> > >  				rte_pktmbuf_free_seg(txq->sw_ring[i].mbuf);
> > > -				txq->sw_ring[i].mbuf = NULL;
> > > -			}
> > > +			i = 0;
> > >  		}
> > > +		for (; i < end; i++)
> > > +			rte_pktmbuf_free_seg(txq->sw_ring[i].mbuf);
> > > +		memset(txq->sw_ring, 0, sizeof(txq->sw_ring[0]) * nb_desc);

Consider also splitting this memset() into two, one for each of the two for loops.
Then you might need to keep "start" and make it non-const. :-)

> > >  		return;
> > >  	}

Or just keep the original version, looping over all descriptors.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] net/intel: optimize for fast-free hint
  2026-01-23 12:27           ` Morten Brørup
@ 2026-01-23 12:53             ` Bruce Richardson
  2026-01-23 13:06               ` Morten Brørup
  0 siblings, 1 reply; 28+ messages in thread
From: Bruce Richardson @ 2026-01-23 12:53 UTC (permalink / raw)
  To: Morten Brørup; +Cc: dev

On Fri, Jan 23, 2026 at 01:27:54PM +0100, Morten Brørup wrote:
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > Sent: Friday, 23 January 2026 13.09
> > 
> > On Fri, Jan 23, 2026 at 01:05:10PM +0100, Morten Brørup wrote:
> > > I haven't looked into the details yet, but have a quick question
> > inline below.
> > >
> > > > @@ -345,12 +345,20 @@ ci_txq_release_all_mbufs(struct ci_tx_queue
> > *txq,
> > > > bool use_ctx)
> > > >  		return;
> > > >
> > > >  	if (!txq->vector_tx) {
> > > > -		for (uint16_t i = 0; i < txq->nb_tx_desc; i++) {
> > > > -			if (txq->sw_ring[i].mbuf != NULL) {
> > >
> > > You changed this loop to only operate on not-yet-cleaned descriptors.
> > >
> > > Here comes the first part of my question:
> > > You removed the NULL check for txq->sw_ring[i].mbuf, thereby assuming
> > that it is never NULL for not-yet-cleaned descriptors.
> > >
> > 
> > Good point, I was quite focused on making this block and the vector
> > block
> > the same, I forgot that we can have NULL pointers for context
> > descriptors.
> > That was a silly mistake (and AI never caught it for me either.)
> > 
> > > > +		/* Free mbufs from (last_desc_cleaned + 1) to (tx_tail -
> > > > 1). */
> > > > +		const uint16_t start = (txq->last_desc_cleaned + 1) % txq-
> > > > >nb_tx_desc;
> > > > +		const uint16_t nb_desc = txq->nb_tx_desc;
> > > > +		const uint16_t end = txq->tx_tail;
> > > > +
> > > > +		uint16_t i = start;
> 
> Suggest getting rid of "start"; it is only used for initializing "i".
> 

Not sure it's worth doing. I quite like having an explicit start and end
values for clarity.

> > > > +		if (end < i) {
> > > > +			for (; i < nb_desc; i++)
> > > >  				rte_pktmbuf_free_seg(txq->sw_ring[i].mbuf);
> > > > -				txq->sw_ring[i].mbuf = NULL;
> > > > -			}
> > > > +			i = 0;
> > > >  		}
> > > > +		for (; i < end; i++)
> > > > +			rte_pktmbuf_free_seg(txq->sw_ring[i].mbuf);
> > > > +		memset(txq->sw_ring, 0, sizeof(txq->sw_ring[0]) * nb_desc);
> 
> Consider also splitting this memset() into two, one for each of the two for loops.
> Then you might need to keep "start" and make it non-const. :-)
> 

Don't see the point of that. The memset just zeros the whole array,
ignoring wraparound so no point in doing two memsets when one will do.

> > > >  		return; }
> 
> Or just keep the original version, looping over all descriptors.
> 

The reason for this whole change is that after the refactor the old code
was wrong.

The original code used the fact that all mbuf pointers were zereod or
overwritten after being freed, but that no longer applies, because we free
the mbufs in bulk after we check the dd bits, rather than doing so
individually later immediately before reuse.  Instead, in both the datapath
and this release path, we must use the index values to track what mbuf
entries are valid or invalid. (We go from having two states, NULL or
non-NULL, to 3; invalid i.e. freed or NULL, valid-NULL i.e. in slot used by
context descriptor, valid-non-NULL i.e. a pointer to a not-yet-cleaned-up
mbuf).

/Bruce

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH] net/intel: optimize for fast-free hint
  2026-01-23 12:53             ` Bruce Richardson
@ 2026-01-23 13:06               ` Morten Brørup
  0 siblings, 0 replies; 28+ messages in thread
From: Morten Brørup @ 2026-01-23 13:06 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev

> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Friday, 23 January 2026 13.54
> 
> On Fri, Jan 23, 2026 at 01:27:54PM +0100, Morten Brørup wrote:
> > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > Sent: Friday, 23 January 2026 13.09
> > >
> > > On Fri, Jan 23, 2026 at 01:05:10PM +0100, Morten Brørup wrote:
> > > > I haven't looked into the details yet, but have a quick question
> > > inline below.
> > > >
> > > > > @@ -345,12 +345,20 @@ ci_txq_release_all_mbufs(struct
> ci_tx_queue
> > > *txq,
> > > > > bool use_ctx)
> > > > >  		return;
> > > > >
> > > > >  	if (!txq->vector_tx) {
> > > > > -		for (uint16_t i = 0; i < txq->nb_tx_desc; i++) {
> > > > > -			if (txq->sw_ring[i].mbuf != NULL) {
> > > >
> > > > You changed this loop to only operate on not-yet-cleaned
> descriptors.
> > > >
> > > > Here comes the first part of my question:
> > > > You removed the NULL check for txq->sw_ring[i].mbuf, thereby
> assuming
> > > that it is never NULL for not-yet-cleaned descriptors.
> > > >
> > >
> > > Good point, I was quite focused on making this block and the vector
> > > block
> > > the same, I forgot that we can have NULL pointers for context
> > > descriptors.
> > > That was a silly mistake (and AI never caught it for me either.)
> > >
> > > > > +		/* Free mbufs from (last_desc_cleaned + 1) to
> (tx_tail -
> > > > > 1). */
> > > > > +		const uint16_t start = (txq->last_desc_cleaned + 1) %
> txq-
> > > > > >nb_tx_desc;
> > > > > +		const uint16_t nb_desc = txq->nb_tx_desc;
> > > > > +		const uint16_t end = txq->tx_tail;
> > > > > +
> > > > > +		uint16_t i = start;
> >
> > Suggest getting rid of "start"; it is only used for initializing "i".
> >
> 
> Not sure it's worth doing. I quite like having an explicit start and
> end
> values for clarity.

I have no preference, and it's a matter of taste, so the choice is yours. :-)

> 
> > > > > +		if (end < i) {
> > > > > +			for (; i < nb_desc; i++)
> > > > >  				rte_pktmbuf_free_seg(txq-
> >sw_ring[i].mbuf);
> > > > > -				txq->sw_ring[i].mbuf = NULL;
> > > > > -			}
> > > > > +			i = 0;
> > > > >  		}
> > > > > +		for (; i < end; i++)
> > > > > +			rte_pktmbuf_free_seg(txq->sw_ring[i].mbuf);
> > > > > +		memset(txq->sw_ring, 0, sizeof(txq->sw_ring[0]) *
> nb_desc);
> >
> > Consider also splitting this memset() into two, one for each of the
> two for loops.
> > Then you might need to keep "start" and make it non-const. :-)
> >
> 
> Don't see the point of that. The memset just zeros the whole array,
> ignoring wraparound so no point in doing two memsets when one will do.
> 
> > > > >  		return; }
> >
> > Or just keep the original version, looping over all descriptors.
> >
> 
> The reason for this whole change is that after the refactor the old
> code
> was wrong.
> 
> The original code used the fact that all mbuf pointers were zereod or
> overwritten after being freed, but that no longer applies, because we
> free
> the mbufs in bulk after we check the dd bits, rather than doing so
> individually later immediately before reuse.  Instead, in both the
> datapath
> and this release path, we must use the index values to track what mbuf
> entries are valid or invalid. (We go from having two states, NULL or
> non-NULL, to 3; invalid i.e. freed or NULL, valid-NULL i.e. in slot
> used by
> context descriptor, valid-non-NULL i.e. a pointer to a not-yet-cleaned-
> up
> mbuf).

Thank you for clarifying.

I thought of the two for loops as a kind of performance optimization, skipping the sub-array of already freed descriptors. In that case, memsetting only the two remaining sub-arrays might have been a good idea.
That's not the case, so having one memset for the whole array is perfectly fine.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2] net/intel: optimize for fast-free hint
  2026-01-23 11:20     ` [PATCH] net/intel: optimize for fast-free hint Bruce Richardson
  2026-01-23 12:05       ` Morten Brørup
@ 2026-04-08 13:25       ` Bruce Richardson
  2026-04-08 19:27         ` Morten Brørup
  1 sibling, 1 reply; 28+ messages in thread
From: Bruce Richardson @ 2026-04-08 13:25 UTC (permalink / raw)
  To: dev; +Cc: mb, Bruce Richardson

When the fast-free hint is provided to the driver we know that the mbufs
have refcnt of 1 and are from the same mempool. Therefore, we can
optimize a bit for this case by:

* resetting the necessary mbuf fields, ie. nb_seg and next pointer when
  we are accessing the mbuf on writing the descriptor.
* on cleanup of buffers after transmit, we can just write those buffers
  straight to the mempool without accessing them.

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
V2: Fix issues with original submission:
* missed check for NULL mbufs
* fixed issue with freeing directly from sw_ring in scalar path which
  doesn't work as thats not a flag array of pointers
* fixed missing null assignment in case of large segments for TSO
---
 drivers/net/intel/common/tx.h        | 21 ++++--
 drivers/net/intel/common/tx_scalar.h | 95 ++++++++++++++++++++++------
 2 files changed, 90 insertions(+), 26 deletions(-)

diff --git a/drivers/net/intel/common/tx.h b/drivers/net/intel/common/tx.h
index 283bd58d5d..f2123f069c 100644
--- a/drivers/net/intel/common/tx.h
+++ b/drivers/net/intel/common/tx.h
@@ -363,13 +363,22 @@ ci_txq_release_all_mbufs(struct ci_tx_queue *txq, bool use_ctx)
 		return;
 
 	if (!txq->use_vec_entry) {
-		/* Regular scalar path uses sw_ring with ci_tx_entry */
-		for (uint16_t i = 0; i < txq->nb_tx_desc; i++) {
-			if (txq->sw_ring[i].mbuf != NULL) {
-				rte_pktmbuf_free_seg(txq->sw_ring[i].mbuf);
-				txq->sw_ring[i].mbuf = NULL;
-			}
+		/* Free mbufs from (last_desc_cleaned + 1) to (tx_tail - 1). */
+		const uint16_t start = (txq->last_desc_cleaned + 1) % txq->nb_tx_desc;
+		const uint16_t nb_desc = txq->nb_tx_desc;
+		const uint16_t end = txq->tx_tail;
+
+		uint16_t i = start;
+		if (end < i) {
+			for (; i < nb_desc; i++)
+				if (txq->sw_ring[i].mbuf != NULL)
+					rte_pktmbuf_free_seg(txq->sw_ring[i].mbuf);
+			i = 0;
 		}
+		for (; i < end; i++)
+			if (txq->sw_ring[i].mbuf != NULL)
+				rte_pktmbuf_free_seg(txq->sw_ring[i].mbuf);
+		memset(txq->sw_ring, 0, sizeof(txq->sw_ring[0]) * nb_desc);
 		return;
 	}
 
diff --git a/drivers/net/intel/common/tx_scalar.h b/drivers/net/intel/common/tx_scalar.h
index 9fcd2e4733..adbc4bafee 100644
--- a/drivers/net/intel/common/tx_scalar.h
+++ b/drivers/net/intel/common/tx_scalar.h
@@ -197,16 +197,63 @@ ci_tx_xmit_cleanup(struct ci_tx_queue *txq)
 	const uint16_t rs_idx = (last_desc_cleaned == nb_tx_desc - 1) ?
 			0 :
 			(last_desc_cleaned + 1) >> txq->log2_rs_thresh;
-	uint16_t desc_to_clean_to = (rs_idx << txq->log2_rs_thresh) + (txq->tx_rs_thresh - 1);
+	const uint16_t dd_idx = txq->rs_last_id[rs_idx];
+	const uint16_t first_to_clean = rs_idx << txq->log2_rs_thresh;
 
-	/* Check if descriptor is done  */
-	if ((txd[txq->rs_last_id[rs_idx]].cmd_type_offset_bsz &
-			rte_cpu_to_le_64(CI_TXD_QW1_DTYPE_M)) !=
-				rte_cpu_to_le_64(CI_TX_DESC_DTYPE_DESC_DONE))
+	/* Check if descriptor is done - all drivers use 0xF as done value in bits 3:0 */
+	if ((txd[dd_idx].cmd_type_offset_bsz & rte_cpu_to_le_64(CI_TXD_QW1_DTYPE_M)) !=
+			rte_cpu_to_le_64(CI_TX_DESC_DTYPE_DESC_DONE))
+		/* Descriptor not yet processed by hardware */
 		return -1;
 
+	/* DD bit is set, descriptors are done. Now free the mbufs. */
+	/* Note: nb_tx_desc is guaranteed to be a multiple of tx_rs_thresh,
+	 * validated during queue setup. This means cleanup never wraps around
+	 * the ring within a single burst (e.g., ring=256, rs_thresh=32 gives
+	 * bursts of 0-31, 32-63, ..., 224-255).
+	 */
+	const uint16_t nb_to_clean = txq->tx_rs_thresh;
+	struct ci_tx_entry *sw_ring = txq->sw_ring;
+
+	if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
+		/* FAST_FREE path: mbufs are already reset, just return to pool */
+		void *free[CI_TX_MAX_FREE_BUF_SZ];
+		uint16_t nb_free = 0;
+
+		/* Get cached mempool pointer, or cache it on first use */
+		struct rte_mempool *mp =
+			likely(txq->fast_free_mp != (void *)UINTPTR_MAX) ?
+			txq->fast_free_mp :
+			(txq->fast_free_mp = sw_ring[dd_idx].mbuf->pool);
+
+		/* Pack non-NULL mbufs in-place at start of sw_ring range.
+		 * No modulo needed in loop since we're guaranteed not to wrap.
+		 */
+		for (uint16_t i = 0; i < nb_to_clean; i++) {
+			struct rte_mbuf *m = sw_ring[first_to_clean + i].mbuf;
+			if (m == NULL)
+				continue;
+			free[nb_free++] = m;
+			if (unlikely(nb_free == CI_TX_MAX_FREE_BUF_SZ)) {
+				rte_mempool_put_bulk(mp, free, nb_free);
+				nb_free = 0;
+			}
+		}
+
+		/* Bulk return to mempool using packed sw_ring entries directly */
+		if (nb_free > 0)
+			rte_mempool_put_bulk(mp, free, nb_free);
+	} else {
+		/* Non-FAST_FREE path: use prefree_seg for refcount checks */
+		for (uint16_t i = 0; i < nb_to_clean; i++) {
+			struct rte_mbuf *m = sw_ring[first_to_clean + i].mbuf;
+			if (m != NULL)
+				rte_pktmbuf_free_seg(m);
+		}
+	}
+
 	/* Update the txq to reflect the last descriptor that was cleaned */
-	txq->last_desc_cleaned = desc_to_clean_to;
+	txq->last_desc_cleaned = first_to_clean + txq->tx_rs_thresh - 1;
 	txq->nb_tx_free += txq->tx_rs_thresh;
 
 	return 0;
@@ -450,8 +497,6 @@ ci_xmit_pkts(struct ci_tx_queue *txq,
 			txd = &ci_tx_ring[tx_id];
 			tx_id = txe->next_id;
 
-			if (txe->mbuf)
-				rte_pktmbuf_free_seg(txe->mbuf);
 			txe->mbuf = tx_pkt;
 			/* Setup TX Descriptor */
 			td_cmd |= CI_TX_DESC_CMD_EOP;
@@ -472,10 +517,7 @@ ci_xmit_pkts(struct ci_tx_queue *txq,
 
 			txn = &sw_ring[txe->next_id];
 			RTE_MBUF_PREFETCH_TO_FREE(txn->mbuf);
-			if (txe->mbuf) {
-				rte_pktmbuf_free_seg(txe->mbuf);
-				txe->mbuf = NULL;
-			}
+			txe->mbuf = NULL;
 
 			write_txd(ctx_txd, cd_qw0, cd_qw1);
 
@@ -489,10 +531,7 @@ ci_xmit_pkts(struct ci_tx_queue *txq,
 
 			txn = &sw_ring[txe->next_id];
 			RTE_MBUF_PREFETCH_TO_FREE(txn->mbuf);
-			if (txe->mbuf) {
-				rte_pktmbuf_free_seg(txe->mbuf);
-				txe->mbuf = NULL;
-			}
+			txe->mbuf = NULL;
 
 			ipsec_txd[0] = ipsec_qw0;
 			ipsec_txd[1] = ipsec_qw1;
@@ -507,10 +546,21 @@ ci_xmit_pkts(struct ci_tx_queue *txq,
 			txd = &ci_tx_ring[tx_id];
 			txn = &sw_ring[txe->next_id];
 
-			if (txe->mbuf)
-				rte_pktmbuf_free_seg(txe->mbuf);
 			txe->mbuf = m_seg;
 
+			/* For FAST_FREE: reset mbuf fields while we have it in cache.
+			 * FAST_FREE guarantees refcnt=1 and direct mbufs, so we only
+			 * need to reset nb_segs and next pointer as per rte_pktmbuf_prefree_seg.
+			 * Save next pointer before resetting since we need it for loop iteration.
+			 */
+			struct rte_mbuf *next_seg = m_seg->next;
+			if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
+				if (m_seg->nb_segs != 1)
+					m_seg->nb_segs = 1;
+				if (next_seg != NULL)
+					m_seg->next = NULL;
+			}
+
 			/* Setup TX Descriptor */
 			/* Calculate segment length, using IPsec callback if provided */
 			if (ipsec_ops != NULL)
@@ -528,18 +578,23 @@ ci_xmit_pkts(struct ci_tx_queue *txq,
 					((uint64_t)CI_MAX_DATA_PER_TXD << CI_TXD_QW1_TX_BUF_SZ_S) |
 					((uint64_t)td_tag << CI_TXD_QW1_L2TAG1_S);
 				write_txd(txd, buf_dma_addr, cmd_type_offset_bsz);
+				/* txe for this slot has already been written (e.g. above outside
+				 * loop), so we write the extra NULL mbuf pointer for this
+				 * descriptor after we increment txe below.
+				 */
 
 				buf_dma_addr += CI_MAX_DATA_PER_TXD;
 				slen -= CI_MAX_DATA_PER_TXD;
 
 				tx_id = txe->next_id;
 				txe = txn;
+				txe->mbuf = NULL;
 				txd = &ci_tx_ring[tx_id];
 				txn = &sw_ring[txe->next_id];
 			}
 
 			/* fill the last descriptor with End of Packet (EOP) bit */
-			if (m_seg->next == NULL)
+			if (next_seg == NULL)
 				td_cmd |= CI_TX_DESC_CMD_EOP;
 
 			const uint64_t cmd_type_offset_bsz = CI_TX_DESC_DTYPE_DATA |
@@ -551,7 +606,7 @@ ci_xmit_pkts(struct ci_tx_queue *txq,
 
 			tx_id = txe->next_id;
 			txe = txn;
-			m_seg = m_seg->next;
+			m_seg = next_seg;
 		} while (m_seg);
 end_pkt:
 		txq->nb_tx_free = (uint16_t)(txq->nb_tx_free - nb_used);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* RE: [PATCH v2] net/intel: optimize for fast-free hint
  2026-04-08 13:25       ` [PATCH v2] " Bruce Richardson
@ 2026-04-08 19:27         ` Morten Brørup
  0 siblings, 0 replies; 28+ messages in thread
From: Morten Brørup @ 2026-04-08 19:27 UTC (permalink / raw)
  To: Bruce Richardson, dev

> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Wednesday, 8 April 2026 15.25
> 
> When the fast-free hint is provided to the driver we know that the
> mbufs
> have refcnt of 1 and are from the same mempool. Therefore, we can
> optimize a bit for this case by:
> 
> * resetting the necessary mbuf fields, ie. nb_seg and next pointer when
>   we are accessing the mbuf on writing the descriptor.
> * on cleanup of buffers after transmit, we can just write those buffers
>   straight to the mempool without accessing them.
> 
> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>

A bunch of review thoughts inline below.
The ones regarding instrumentation should be fixed.
The rest might be irrelevant and/or nonsense.

> ---
> V2: Fix issues with original submission:
> * missed check for NULL mbufs
> * fixed issue with freeing directly from sw_ring in scalar path which
>   doesn't work as thats not a flag array of pointers
> * fixed missing null assignment in case of large segments for TSO
> ---
>  drivers/net/intel/common/tx.h        | 21 ++++--
>  drivers/net/intel/common/tx_scalar.h | 95 ++++++++++++++++++++++------
>  2 files changed, 90 insertions(+), 26 deletions(-)
> 
> diff --git a/drivers/net/intel/common/tx.h
> b/drivers/net/intel/common/tx.h
> index 283bd58d5d..f2123f069c 100644
> --- a/drivers/net/intel/common/tx.h
> +++ b/drivers/net/intel/common/tx.h
> @@ -363,13 +363,22 @@ ci_txq_release_all_mbufs(struct ci_tx_queue *txq,
> bool use_ctx)
>  		return;
> 
>  	if (!txq->use_vec_entry) {
> -		/* Regular scalar path uses sw_ring with ci_tx_entry */
> -		for (uint16_t i = 0; i < txq->nb_tx_desc; i++) {
> -			if (txq->sw_ring[i].mbuf != NULL) {
> -				rte_pktmbuf_free_seg(txq->sw_ring[i].mbuf);
> -				txq->sw_ring[i].mbuf = NULL;
> -			}
> +		/* Free mbufs from (last_desc_cleaned + 1) to (tx_tail -
> 1). */
> +		const uint16_t start = (txq->last_desc_cleaned + 1) % txq-
> >nb_tx_desc;
> +		const uint16_t nb_desc = txq->nb_tx_desc;
> +		const uint16_t end = txq->tx_tail;
> +
> +		uint16_t i = start;
> +		if (end < i) {
> +			for (; i < nb_desc; i++)
> +				if (txq->sw_ring[i].mbuf != NULL)
> +					rte_pktmbuf_free_seg(txq-
> >sw_ring[i].mbuf);
> +			i = 0;
>  		}
> +		for (; i < end; i++)
> +			if (txq->sw_ring[i].mbuf != NULL)
> +				rte_pktmbuf_free_seg(txq->sw_ring[i].mbuf);
> +		memset(txq->sw_ring, 0, sizeof(txq->sw_ring[0]) * nb_desc);
>  		return;
>  	}

The above LGTM.
IIRC, we already discussed it - or something very similar.

> 
> diff --git a/drivers/net/intel/common/tx_scalar.h
> b/drivers/net/intel/common/tx_scalar.h
> index 9fcd2e4733..adbc4bafee 100644
> --- a/drivers/net/intel/common/tx_scalar.h
> +++ b/drivers/net/intel/common/tx_scalar.h
> @@ -197,16 +197,63 @@ ci_tx_xmit_cleanup(struct ci_tx_queue *txq)
>  	const uint16_t rs_idx = (last_desc_cleaned == nb_tx_desc - 1) ?
>  			0 :
>  			(last_desc_cleaned + 1) >> txq->log2_rs_thresh;
> -	uint16_t desc_to_clean_to = (rs_idx << txq->log2_rs_thresh) +
> (txq->tx_rs_thresh - 1);
> +	const uint16_t dd_idx = txq->rs_last_id[rs_idx];
> +	const uint16_t first_to_clean = rs_idx << txq->log2_rs_thresh;
> 
> -	/* Check if descriptor is done  */
> -	if ((txd[txq->rs_last_id[rs_idx]].cmd_type_offset_bsz &
> -			rte_cpu_to_le_64(CI_TXD_QW1_DTYPE_M)) !=
> -				rte_cpu_to_le_64(CI_TX_DESC_DTYPE_DESC_DONE))
> +	/* Check if descriptor is done - all drivers use 0xF as done
> value in bits 3:0 */
> +	if ((txd[dd_idx].cmd_type_offset_bsz &
> rte_cpu_to_le_64(CI_TXD_QW1_DTYPE_M)) !=
> +			rte_cpu_to_le_64(CI_TX_DESC_DTYPE_DESC_DONE))
> +		/* Descriptor not yet processed by hardware */
>  		return -1;
> 
> +	/* DD bit is set, descriptors are done. Now free the mbufs. */
> +	/* Note: nb_tx_desc is guaranteed to be a multiple of
> tx_rs_thresh,
> +	 * validated during queue setup. This means cleanup never wraps
> around
> +	 * the ring within a single burst (e.g., ring=256, rs_thresh=32
> gives
> +	 * bursts of 0-31, 32-63, ..., 224-255).
> +	 */
> +	const uint16_t nb_to_clean = txq->tx_rs_thresh;
> +	struct ci_tx_entry *sw_ring = txq->sw_ring;
> +
> +	if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
> +		/* FAST_FREE path: mbufs are already reset, just return to
> pool */

Depending on which cache lines from txq have already been loaded, unless txq->offloads is hot in the CPU cache and txq->fast_free_mp is not, consider testing (mp != NULL) instead of (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE).
Like here:
https://elixir.bootlin.com/dpdk/v26.03/source/drivers/net/intel/common/tx.h#L281

> +		void *free[CI_TX_MAX_FREE_BUF_SZ];
> +		uint16_t nb_free = 0;
> +
> +		/* Get cached mempool pointer, or cache it on first use */
> +		struct rte_mempool *mp =
> +			likely(txq->fast_free_mp != (void *)UINTPTR_MAX) ?
> +			txq->fast_free_mp :
> +			(txq->fast_free_mp = sw_ring[dd_idx].mbuf->pool);
> +
> +		/* Pack non-NULL mbufs in-place at start of sw_ring range.
> +		 * No modulo needed in loop since we're guaranteed not to
> wrap.
> +		 */
> +		for (uint16_t i = 0; i < nb_to_clean; i++) {
> +			struct rte_mbuf *m = sw_ring[first_to_clean +
> i].mbuf;
> +			if (m == NULL)
> +				continue;
> +			free[nb_free++] = m;

Should sw_ring[first_to_clean + i].mbuf be set to NULL here, instead of in ci_xmit_pkts()?
I don't know, just want you to consider it.

> +			if (unlikely(nb_free == CI_TX_MAX_FREE_BUF_SZ)) {
> +				rte_mempool_put_bulk(mp, free, nb_free);

rte_mempool_put_bulk() -> rte_mbuf_raw_free_bulk(),
for instrumentation.

> +				nb_free = 0;
> +			}
> +		}
> +
> +		/* Bulk return to mempool using packed sw_ring entries
> directly */
> +		if (nb_free > 0)
> +			rte_mempool_put_bulk(mp, free, nb_free);

rte_mempool_put_bulk() -> rte_mbuf_raw_free_bulk(),
for instrumentation.

> +	} else {
> +		/* Non-FAST_FREE path: use prefree_seg for refcount checks
> */
> +		for (uint16_t i = 0; i < nb_to_clean; i++) {
> +			struct rte_mbuf *m = sw_ring[first_to_clean +
> i].mbuf;
> +			if (m != NULL)
> +				rte_pktmbuf_free_seg(m);

Should sw_ring[first_to_clean + i].mbuf be set to NULL here, instead of in ci_xmit_pkts()?
I don't know, just want you to consider it.

> +		}
> +	}
> +
>  	/* Update the txq to reflect the last descriptor that was cleaned
> */
> -	txq->last_desc_cleaned = desc_to_clean_to;
> +	txq->last_desc_cleaned = first_to_clean + txq->tx_rs_thresh - 1;
>  	txq->nb_tx_free += txq->tx_rs_thresh;
> 
>  	return 0;
> @@ -450,8 +497,6 @@ ci_xmit_pkts(struct ci_tx_queue *txq,
>  			txd = &ci_tx_ring[tx_id];
>  			tx_id = txe->next_id;
> 
> -			if (txe->mbuf)
> -				rte_pktmbuf_free_seg(txe->mbuf);
>  			txe->mbuf = tx_pkt;
>  			/* Setup TX Descriptor */
>  			td_cmd |= CI_TX_DESC_CMD_EOP;
> @@ -472,10 +517,7 @@ ci_xmit_pkts(struct ci_tx_queue *txq,
> 
>  			txn = &sw_ring[txe->next_id];
>  			RTE_MBUF_PREFETCH_TO_FREE(txn->mbuf);

RTE_MBUF_PREFETCH_TO_FREE() doesn't seem relevant here anymore.
I don't know if it fits into ci_tx_xmit_cleanup() instead.

> -			if (txe->mbuf) {
> -				rte_pktmbuf_free_seg(txe->mbuf);
> -				txe->mbuf = NULL;
> -			}
> +			txe->mbuf = NULL;

Already mentioned: Should txe->mbuf be set to NULL in ci_tx_xmit_cleanup() instead of in ci_tx_xmit_pkts()?

> 
>  			write_txd(ctx_txd, cd_qw0, cd_qw1);
> 
> @@ -489,10 +531,7 @@ ci_xmit_pkts(struct ci_tx_queue *txq,
> 
>  			txn = &sw_ring[txe->next_id];
>  			RTE_MBUF_PREFETCH_TO_FREE(txn->mbuf);

RTE_MBUF_PREFETCH_TO_FREE() doesn't seem relevant here anymore.
I don't know if it fits into ci_tx_xmit_cleanup() instead.

> -			if (txe->mbuf) {
> -				rte_pktmbuf_free_seg(txe->mbuf);
> -				txe->mbuf = NULL;
> -			}
> +			txe->mbuf = NULL;

Already mentioned: Should txe->mbuf be set to NULL in ci_tx_xmit_cleanup() instead of in ci_tx_xmit_pkts()?

> 
>  			ipsec_txd[0] = ipsec_qw0;
>  			ipsec_txd[1] = ipsec_qw1;
> @@ -507,10 +546,21 @@ ci_xmit_pkts(struct ci_tx_queue *txq,
>  			txd = &ci_tx_ring[tx_id];
>  			txn = &sw_ring[txe->next_id];
> 
> -			if (txe->mbuf)
> -				rte_pktmbuf_free_seg(txe->mbuf);
>  			txe->mbuf = m_seg;
> 
> +			/* For FAST_FREE: reset mbuf fields while we have it
> in cache.
> +			 * FAST_FREE guarantees refcnt=1 and direct mbufs, so
> we only
> +			 * need to reset nb_segs and next pointer as per
> rte_pktmbuf_prefree_seg.
> +			 * Save next pointer before resetting since we need
> it for loop iteration.
> +			 */
> +			struct rte_mbuf *next_seg = m_seg->next;
> +			if (txq->offloads &
> RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {

Similar to comment further above: Is txq->offloads or txq->fast_free_mp hotter in the CPU cache here?

> +				if (m_seg->nb_segs != 1)
> +					m_seg->nb_segs = 1;
> +				if (next_seg != NULL)
> +					m_seg->next = NULL;
> +			}
> +
>  			/* Setup TX Descriptor */
>  			/* Calculate segment length, using IPsec callback if
> provided */
>  			if (ipsec_ops != NULL)
> @@ -528,18 +578,23 @@ ci_xmit_pkts(struct ci_tx_queue *txq,
>  					((uint64_t)CI_MAX_DATA_PER_TXD <<
> CI_TXD_QW1_TX_BUF_SZ_S) |
>  					((uint64_t)td_tag <<
> CI_TXD_QW1_L2TAG1_S);
>  				write_txd(txd, buf_dma_addr,
> cmd_type_offset_bsz);
> +				/* txe for this slot has already been written
> (e.g. above outside
> +				 * loop), so we write the extra NULL mbuf
> pointer for this
> +				 * descriptor after we increment txe below.
> +				 */
> 
>  				buf_dma_addr += CI_MAX_DATA_PER_TXD;
>  				slen -= CI_MAX_DATA_PER_TXD;
> 
>  				tx_id = txe->next_id;
>  				txe = txn;
> +				txe->mbuf = NULL;
>  				txd = &ci_tx_ring[tx_id];
>  				txn = &sw_ring[txe->next_id];
>  			}
> 
>  			/* fill the last descriptor with End of Packet (EOP)
> bit */
> -			if (m_seg->next == NULL)
> +			if (next_seg == NULL)
>  				td_cmd |= CI_TX_DESC_CMD_EOP;
> 
>  			const uint64_t cmd_type_offset_bsz =
> CI_TX_DESC_DTYPE_DATA |
> @@ -551,7 +606,7 @@ ci_xmit_pkts(struct ci_tx_queue *txq,
> 
>  			tx_id = txe->next_id;
>  			txe = txn;
> -			m_seg = m_seg->next;
> +			m_seg = next_seg;
>  		} while (m_seg);
>  end_pkt:
>  		txq->nb_tx_free = (uint16_t)(txq->nb_tx_free - nb_used);
> --
> 2.51.0


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2026-04-08 19:27 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-15 11:06 mbuf fast-free requirements analysis Morten Brørup
2025-12-15 11:46 ` Bruce Richardson
2026-01-14 15:31   ` Morten Brørup
2026-01-14 16:36     ` Bruce Richardson
2026-01-14 18:05       ` Morten Brørup
2026-01-15  8:46         ` Bruce Richardson
2026-01-15  9:04           ` Morten Brørup
2026-01-23 11:20     ` [PATCH] net/intel: optimize for fast-free hint Bruce Richardson
2026-01-23 12:05       ` Morten Brørup
2026-01-23 12:09         ` Bruce Richardson
2026-01-23 12:27           ` Morten Brørup
2026-01-23 12:53             ` Bruce Richardson
2026-01-23 13:06               ` Morten Brørup
2026-04-08 13:25       ` [PATCH v2] " Bruce Richardson
2026-04-08 19:27         ` Morten Brørup
2026-01-23 11:33     ` mbuf fast-free requirements analysis Bruce Richardson
2025-12-15 14:41 ` Konstantin Ananyev
2025-12-15 16:14   ` Morten Brørup
2025-12-19 17:08     ` Konstantin Ananyev
2025-12-20  7:33       ` Morten Brørup
2025-12-22 15:22         ` Konstantin Ananyev
2025-12-22 17:11           ` Morten Brørup
2025-12-22 17:43             ` Bruce Richardson
2026-01-13 14:48               ` Konstantin Ananyev
2026-01-13 16:07                 ` Stephen Hemminger
2026-01-14 17:01 ` Bruce Richardson
2026-01-14 17:31   ` Morten Brørup
2026-01-14 17:45     ` Bruce Richardson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox