From: Konstantin Ananyev <konstantin.ananyev@huawei.com>
To: "Morten Brørup" <mb@smartsharesystems.com>,
"dev@dpdk.org" <dev@dpdk.org>,
"techboard@dpdk.org" <techboard@dpdk.org>,
"bruce.richardson@intel.com" <bruce.richardson@intel.com>
Subject: RE: mbuf fast-free requirements analysis
Date: Fri, 19 Dec 2025 17:08:00 +0000 [thread overview]
Message-ID: <d290adf103244ff7be53844ee32bb6d0@huawei.com> (raw)
In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35F655E5@smartserver.smartshare.dk>
> > From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> > Sent: Monday, 15 December 2025 15.41
> >
> > >
> > > Executive Summary:
> > >
> > > My analysis shows that the mbuf library is not a barrier for fast-
> > freeing
> > > segmented packet mbufs, and thus fast-free of jumbo frames is
> > possible.
> > >
> > >
> > > Detailed Analysis:
> > >
> > > The purpose of the mbuf fast-free Tx optimization is to reduce
> > > rte_pktmbuf_free_seg() to something much simpler in the ethdev
> > drivers, by
> > > eliminating the code path related to indirect mbufs.
> > > Optimally, we want to simplify the ethdev driver's function that
> > frees the
> > > transmitted mbufs, so it can free them directly to their mempool
> > without
> > > accessing the mbufs themselves.
> > >
> > > If the driver cannot access the mbuf itself, it cannot determine
> > which
> > > mempool it belongs to.
> > > We don't want the driver to access every mbuf being freed; but if all
> > > mbufs of a Tx queue belong to the same mempool, the driver can
> > determine
> > > which mempool by looking into just one of the mbufs.
> > >
> > > REQUIREMENT 1: The mbufs of a Tx queue must come from the same
> > mempool.
> > >
> > >
> > > When an mbuf is freed to its mempool, some of the fields in the mbuf
> > must
> > > be initialized.
> > > So, for fast-free, this must be done by the driver's function that
> > > prepares the Tx descriptor.
> > > This is a requirement to the driver, not a requirement to the
> > application.
> > >
> > > Now, let's dig into the code for freeing an mbuf.
> > > Note: For readability purposes, I'll cut out some code and comments
> > > unrelated to this topic.
> > >
> > > static __rte_always_inline void
> > > rte_pktmbuf_free_seg(struct rte_mbuf *m)
> > > {
> > > m = rte_pktmbuf_prefree_seg(m);
> > > if (likely(m != NULL))
> > > rte_mbuf_raw_free(m);
> > > }
> > >
> > >
> > > rte_mbuf_raw_free(m) is simple, so nothing to gain there:
> > >
> > > /**
> > > * Put mbuf back into its original mempool.
> > > *
> > > * The caller must ensure that the mbuf is direct and properly
> > > * reinitialized (refcnt=1, next=NULL, nb_segs=1), as done by
> > > * rte_pktmbuf_prefree_seg().
> > > */
> > > static __rte_always_inline void
> > > rte_mbuf_raw_free(struct rte_mbuf *m)
> > > {
> > > rte_mbuf_history_mark(m, RTE_MBUF_HISTORY_OP_LIB_FREE);
> > > rte_mempool_put(m->pool, m);
> > > }
> > >
> > > Note that the description says that the mbuf must be direct.
> > > This is not entirely accurate; the mbuf is allowed to use a pinned
> > > external buffer, if the mbuf holds the only reference to it.
> > > (Most of the mbuf library functions have this documentation
> > inaccuracy,
> > > which should be fixed some day.)
> > >
> > > So, the fast-free optimization really comes down to
> > > rte_pktmbuf_prefree_seg(m), which must not return NULL.
> > >
> > > Let's dig into that.
> > >
> > > /**
> > > * Decrease reference counter and unlink a mbuf segment
> > > *
> > > * This function does the same than a free, except that it does not
> > > * return the segment to its pool.
> > > * It decreases the reference counter, and if it reaches 0, it is
> > > * detached from its parent for an indirect mbuf.
> > > *
> > > * @return
> > > * - (m) if it is the last reference. It can be recycled or freed.
> > > * - (NULL) if the mbuf still has remaining references on it.
> > > */
> > > static __rte_always_inline struct rte_mbuf *
> > > rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> > > {
> > > bool refcnt_not_one;
> > >
> > > refcnt_not_one = unlikely(rte_mbuf_refcnt_read(m) != 1);
> > > if (refcnt_not_one && __rte_mbuf_refcnt_update(m, -1) != 0)
> > > return NULL;
> > >
> > > if (unlikely(!RTE_MBUF_DIRECT(m))) {
> > > rte_pktmbuf_detach(m);
> > > if (RTE_MBUF_HAS_EXTBUF(m) &&
> > > RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> > > __rte_pktmbuf_pinned_extbuf_decref(m))
> > > return NULL;
> > > }
> > >
> > > if (refcnt_not_one)
> > > rte_mbuf_refcnt_set(m, 1);
> > > if (m->nb_segs != 1)
> > > m->nb_segs = 1;
> > > if (m->next != NULL)
> > > m->next = NULL;
> > >
> > > return m;
> > > }
> > >
> > > This function can only succeed (i.e. return non-NULL) when 'refcnt'
> > is 1
> > > (or reaches 0).
> > >
> > > REQUIREMENT 2: The driver must hold the only reference to the mbuf,
> > > i.e. 'm->refcnt' must be 1.
> > >
> > >
> > > When the function succeeds, it initializes the mbuf fields as
> > required by
> > > rte_mbuf_raw_free() before returning.
> > >
> > > Now, since the driver has exclusive access to the mbuf, it is free to
> > > initialize the 'm->next' and 'm->nb_segs' at any time.
> > > It could do that when preparing the Tx descriptor.
> > >
> > > This is very interesting, because it means that fast-free does not
> > > prohibit segmented packets!
> > > (But the driver must have sufficient Tx descriptors for all segments
> > in
> > > the mbuf.)
> > >
> > >
> > > Now, lets dig into rte_pktmbuf_prefree_seg()'s block handling non-
> > direct
> > > mbufs, i.e. cloned mbufs and mbufs with external buffer:
> > >
> > > if (unlikely(!RTE_MBUF_DIRECT(m))) {
> > > rte_pktmbuf_detach(m);
> > > if (RTE_MBUF_HAS_EXTBUF(m) &&
> > > RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> > > __rte_pktmbuf_pinned_extbuf_decref(m))
> > > return NULL;
> > > }
> > >
> > > Starting with rte_pktmbuf_detach():
> > >
> > > static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
> > > {
> > > struct rte_mempool *mp = m->pool;
> > > uint32_t mbuf_size, buf_len;
> > > uint16_t priv_size;
> > >
> > > if (RTE_MBUF_HAS_EXTBUF(m)) {
> > > /*
> > > * The mbuf has the external attached buffer,
> > > * we should check the type of the memory pool where
> > > * the mbuf was allocated from to detect the pinned
> > > * external buffer.
> > > */
> > > uint32_t flags = rte_pktmbuf_priv_flags(mp);
> > >
> > > if (flags & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF) {
> > > /*
> > > * The pinned external buffer should not be
> > > * detached from its backing mbuf, just exit.
> > > */
> > > return;
> > > }
> > > __rte_pktmbuf_free_extbuf(m);
> > > } else {
> > > __rte_pktmbuf_free_direct(m);
> > > }
> > > priv_size = rte_pktmbuf_priv_size(mp);
> > > mbuf_size = (uint32_t)(sizeof(struct rte_mbuf) + priv_size);
> > > buf_len = rte_pktmbuf_data_room_size(mp);
> > >
> > > m->priv_size = priv_size;
> > > m->buf_addr = (char *)m + mbuf_size;
> > > rte_mbuf_iova_set(m, rte_mempool_virt2iova(m) + mbuf_size);
> > > m->buf_len = (uint16_t)buf_len;
> > > rte_pktmbuf_reset_headroom(m);
> > > m->data_len = 0;
> > > m->ol_flags = 0;
> > > }
> > >
> > > The only quick and simple code path through this function is when the
> > mbuf
> > > uses a pinned external buffer:
> > > if (RTE_MBUF_HAS_EXTBUF(m)) {
> > > uint32_t flags = rte_pktmbuf_priv_flags(mp);
> > > if (flags & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF)
> > > return;
> > >
> > > REQUIREMENT 3: The mbuf must not be cloned or use a non-pinned
> > external
> > > buffer.
> > >
> > >
> > > Continuing with the next part of rte_pktmbuf_prefree_seg()'s block:
> > > if (RTE_MBUF_HAS_EXTBUF(m) &&
> > > RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> > > __rte_pktmbuf_pinned_extbuf_decref(m))
> > > return NULL;
> > >
> > > Continuing with the next part of the block in
> > rte_pktmbuf_prefree_seg():
> > >
> > > /**
> > > * @internal Handle the packet mbufs with attached pinned external
> > buffer
> > > * on the mbuf freeing:
> > > *
> > > * - return zero if reference counter in shinfo is one. It means
> > there is
> > > * no more reference to this pinned buffer and mbuf can be returned
> > to
> > > * the pool
> > > *
> > > * - otherwise (if reference counter is not one), decrement
> > reference
> > > * counter and return non-zero value to prevent freeing the backing
> > mbuf.
> > > *
> > > * Returns non zero if mbuf should not be freed.
> > > */
> > > static inline int __rte_pktmbuf_pinned_extbuf_decref(struct rte_mbuf
> > *m)
> > > {
> > > struct rte_mbuf_ext_shared_info *shinfo;
> > >
> > > /* Clear flags, mbuf is being freed. */
> > > m->ol_flags = RTE_MBUF_F_EXTERNAL;
> > > shinfo = m->shinfo;
> > >
> > > /* Optimize for performance - do not dec/reinit */
> > > if (likely(rte_mbuf_ext_refcnt_read(shinfo) == 1))
> > > return 0;
> > >
> > > /*
> > > * Direct usage of add primitive to avoid
> > > * duplication of comparing with one.
> > > */
> > > if (likely(rte_atomic_fetch_add_explicit(&shinfo->refcnt, -1,
> > > rte_memory_order_acq_rel) - 1))
> > > return 1;
> > >
> > > /* Reinitialize counter before mbuf freeing. */
> > > rte_mbuf_ext_refcnt_set(shinfo, 1);
> > > return 0;
> > > }
> > >
> > > Essentially, if the mbuf does use a pinned external buffer,
> > > rte_pktmbuf_prefree_seg() only succeeds if that pinned external
> > buffer is
> > > only referred to by the mbuf.
> > >
> > > REQUIREMENT 4: If the mbuf uses a pinned external buffer, the mbuf
> > must
> > > hold the only reference to that pinned external buffer, i.e. in that
> > case,
> > > 'm->shinfo->refcnt' must be 1.
> > >
> > >
> > > Please review.
> > >
> > > If I'm not mistaken, the mbuf library is not a barrier for fast-
> > freeing
> > > segmented packet mbufs, and thus fast-free of jumbo frames is
> > possible.
> > >
> > > We need a driver developer to confirm that my suggested approach -
> > > resetting the mbuf fields, incl. 'm->nb_segs' and 'm->next', when
> > > preparing the Tx descriptor - is viable.
> >
> > Great analysis, makes a lot of sense to me.
> > Shall we add then a special API to make PMD maintainers life a bit
> > easier:
> > Something like rte_mbuf_fast_free_prep(mp, mb), that will optionally
> > check
> > that requirements outlined above are satisfied for given mbuf and
> > also reset mbuf fields to expected values?
>
> Good idea, Konstantin.
>
> Detailed suggestion below.
> Note that __rte_mbuf_raw_sanity_check_mp() is used to checks the requirements
> after 'nb_segs' and 'next' have been initialized.
>
> /**
> * Reinitialize an mbuf for freeing back into the mempool.
> *
> * The caller must ensure that the mbuf comes from the specified mempool,
> * is direct and only referred to by the caller (refcnt=1).
> *
> * This function is used by drivers in their transmit function for mbuf fast release
> * when the transmit descriptor is initialized,
> * so the driver can call rte_mbuf_raw_free()
> * when the packet segment has been transmitted.
> *
> * @see RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE
> *
> * @param mp
> * The mempool to which the mbuf belong.
> * @param m
> * The mbuf being reinitialized.
> */
> static __rte_always_inline void
> rte_mbuf_raw_prefree_seg(const struct rte_mempool *mp, struct rte_mbuf *m)
> {
> if (m->nb_segs != 1)
> m->nb_segs = 1;
> if (m->next != NULL)
> m->next = NULL;
>
> __rte_mbuf_raw_sanity_check_mp(m, mp);
> rte_mbuf_history_mark(mbuf,
> RTE_MBUF_HISTORY_OP_LIB_PREFREE_RAW);
> }
Thanks Morten, though should we really panic if condition is not met?
Might be just do check first and return an error.
>
> /**
> * Reinitialize a bulk of mbufs for freeing back into the mempool.
> *
> * The caller must ensure that the mbufs come from the specified mempool,
> * are direct and only referred to by the caller (refcnt=1).
> *
> * This function is used by drivers in their transmit function for mbuf fast release
> * when the transmit descriptors are initialized,
> * so the driver can call rte_mbuf_raw_free_bulk()
> * when the packet segments have been transmitted.
> *
> * @see RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE
> *
> * @param mp
> * The mempool to which the mbufs belong.
> * @param mbufs
> * Array of pointers to mbufs being reinitialized.
> * The array must not contain NULL pointers.
> * @param count
> * Array size.
> */
> static __rte_always_inline void
> rte_mbuf_raw_prefree_seg_bulk(const struct rte_mempool *mp, struct rte_mbuf
> **mbufs, unsigned int count)
> {
> for (unsigned int idx = 0; idx < count; idx++) {
> struct rte_mbuf *m = mbufs[idx];
>
> if (m->nb_segs != 1)
> m->nb_segs = 1;
> if (m->next != NULL)
> m->next = NULL;
>
> __rte_mbuf_raw_sanity_check_mp(m, mp);
> }
> rte_mbuf_history_mark_bulk(mbufs, count,
> RTE_MBUF_HISTORY_OP_LIB_PREFREE_RAW);
> }
>
> > Konstantin
> >
> >
next prev parent reply other threads:[~2025-12-19 17:08 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-15 11:06 mbuf fast-free requirements analysis Morten Brørup
2025-12-15 11:46 ` Bruce Richardson
2026-01-14 15:31 ` Morten Brørup
2026-01-14 16:36 ` Bruce Richardson
2026-01-14 18:05 ` Morten Brørup
2026-01-15 8:46 ` Bruce Richardson
2026-01-15 9:04 ` Morten Brørup
2026-01-23 11:20 ` [PATCH] net/intel: optimize for fast-free hint Bruce Richardson
2026-01-23 12:05 ` Morten Brørup
2026-01-23 12:09 ` Bruce Richardson
2026-01-23 12:27 ` Morten Brørup
2026-01-23 12:53 ` Bruce Richardson
2026-01-23 13:06 ` Morten Brørup
2026-04-08 13:25 ` [PATCH v2] " Bruce Richardson
2026-04-08 19:27 ` Morten Brørup
2026-05-19 11:01 ` Bruce Richardson
2026-05-19 11:06 ` [PATCH v3] " Bruce Richardson
2026-05-28 13:23 ` Loftus, Ciara
2026-06-02 15:36 ` Bruce Richardson
2026-06-02 15:45 ` [PATCH v4 0/2] " Bruce Richardson
2026-06-02 15:45 ` [PATCH v4 1/2] net/intel: write mbuf for last Tx desc of segment Bruce Richardson
2026-06-03 14:21 ` Loftus, Ciara
2026-06-02 15:45 ` [PATCH v4 2/2] net/intel: optimize for fast-free hint Bruce Richardson
2026-06-02 16:26 ` [PATCH v4 0/2] " Morten Brørup
2026-06-03 15:56 ` Bruce Richardson
2026-01-23 11:33 ` mbuf fast-free requirements analysis Bruce Richardson
2025-12-15 14:41 ` Konstantin Ananyev
2025-12-15 16:14 ` Morten Brørup
2025-12-19 17:08 ` Konstantin Ananyev [this message]
2025-12-20 7:33 ` Morten Brørup
2025-12-22 15:22 ` Konstantin Ananyev
2025-12-22 17:11 ` Morten Brørup
2025-12-22 17:43 ` Bruce Richardson
2026-01-13 14:48 ` Konstantin Ananyev
2026-01-13 16:07 ` Stephen Hemminger
2026-01-14 17:01 ` Bruce Richardson
2026-01-14 17:31 ` Morten Brørup
2026-01-14 17:45 ` Bruce Richardson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d290adf103244ff7be53844ee32bb6d0@huawei.com \
--to=konstantin.ananyev@huawei.com \
--cc=bruce.richardson@intel.com \
--cc=dev@dpdk.org \
--cc=mb@smartsharesystems.com \
--cc=techboard@dpdk.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.