RE: [PATCH v2 3/8] mbuf: fix Tx checksum offload examples

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Konstantin Ananyev <konstantin.ananyev@huawei.com>
To: "Konstantin Ananyev" <konstantin.ananyev@huawei.com>,
	"Morten Brørup" <mb@smartsharesystems.com>,
	"David Marchand" <david.marchand@redhat.com>,
	"dev@dpdk.org" <dev@dpdk.org>
Cc: "thomas@monjalon.net" <thomas@monjalon.net>,
	"ferruh.yigit@amd.com" <ferruh.yigit@amd.com>,
	"stable@dpdk.org" <stable@dpdk.org>,
	Olivier Matz <olivier.matz@6wind.com>,
	Jijiang Liu <jijiang.liu@intel.com>,
	"Andrew Rybchenko" <andrew.rybchenko@oktetlabs.ru>,
	Ferruh Yigit <ferruh.yigit@amd.com>,
	Kaiwen Deng <kaiwenx.deng@intel.com>,
	"qiming.yang@intel.com" <qiming.yang@intel.com>,
	"yidingx.zhou@intel.com" <yidingx.zhou@intel.com>,
	Aman Singh <aman.deep.singh@intel.com>,
	"Yuying Zhang" <yuying.zhang@intel.com>,
	Thomas Monjalon <thomas@monjalon.net>,
	"Jerin Jacob" <jerinj@marvell.com>
Subject: RE: [PATCH v2 3/8] mbuf: fix Tx checksum offload examples
Date: Tue, 16 Apr 2024 11:36:27 +0000	[thread overview]
Message-ID: <b4b19af1b13c436ebb13cda5ced25470@huawei.com> (raw)
In-Reply-To: <3f8214e0bcb448338cf2679f753a983d@huawei.com>


> > > > > > > > > > > Mandate use of rte_eth_tx_prepare() in the mbuf Tx
> > > checksum
> > > > > offload
> > > > > > > > > > > examples.
> > > > > > > > > >
> > > > > > > > > > I strongly disagree with this change!
> > > > > > > > > >
> > > > > > > > > > It will cause a huge performance degradation for shaping
> > > > > applications:
> > > > > > > > > >
> > > > > > > > > > A packet will be processed and finalized at an output or
> > > > > forwarding
> > > > > > > > > pipeline stage, where some other fields might also be
> > > written,
> > > > > so
> > > > > > > > > > zeroing e.g. the out_ip checksum at this stage has low
> > > cost
> > > > > (no new
> > > > > > > > > cache misses).
> > > > > > > > > >
> > > > > > > > > > Then, the packet might be queued for QoS or similar.
> > > > > > > > > >
> > > > > > > > > > If rte_eth_tx_prepare() must be called at the egress
> > > pipeline
> > > > > stage,
> > > > > > > > > it has to write to the packet and cause a cache miss per
> > > packet,
> > > > > > > > > > instead of simply passing on the packet to the NIC
> > > hardware.
> > > > > > > > > >
> > > > > > > > > > It must be possible to finalize the packet at the
> > > > > output/forwarding
> > > > > > > > > pipeline stage!
> > > > > > > > >
> > > > > > > > > If you can finalize your packet on  output/forwarding, then
> > > why
> > > > > you
> > > > > > > > > can't invoke tx_prepare() on the same stage?
> > > > > > > > > There seems to be some misunderstanding about what
> > > tx_prepare()
> > > > > does -
> > > > > > > > > in fact it doesn't communicate with HW queue (doesn't update
> > > TXD
> > > > > ring,
> > > > > > > > > etc.), what it does - just make changes in mbuf itself.
> > > > > > > > > Yes, it reads some fields in SW TX queue struct (max number
> > > of
> > > > > TXDs per
> > > > > > > > > packet, etc.), but AFAIK it is safe
> > > > > > > > > to call tx_prepare() and tx_burst() from different threads.
> > > > > > > > > At least on implementations I am aware about.
> > > > > > > > > Just checked the docs - it seems not stated explicitly
> > > anywhere,
> > > > > might
> > > > > > > > > be that's why it causing such misunderstanding.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Also, how is rte_eth_tx_prepare() supposed to work for
> > > cloned
> > > > > packets
> > > > > > > > > egressing on different NIC hardware?
> > > > > > > > >
> > > > > > > > > If you create a clone of full packet (including L2/L3)
> > > headers
> > > > > then
> > > > > > > > > obviously such construction might not
> > > > > > > > > work properly with tx_prepare() over two different NICs.
> > > > > > > > > Though In majority of cases you do clone segments with data,
> > > > > while at
> > > > > > > > > least L2 headers are put into different segments.
> > > > > > > > > One simple approach would be to keep L3 header in that
> > > separate
> > > > > segment.
> > > > > > > > > But yes, there is a problem when you'll need to send exactly
> > > the
> > > > > same
> > > > > > > > > packet over different NICs.
> > > > > > > > > As I remember, for bonding PMD things don't work quite well
> > > here
> > > > > - you
> > > > > > > > > might have a bond over 2 NICs with
> > > > > > > > > different tx_prepare() and which one to call might be not
> > > clear
> > > > > till
> > > > > > > > > actual PMD tx_burst() is invoked.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > In theory, it might get even worse if we make this opaque
> > > > > instead of
> > > > > > > > > transparent and standardized:
> > > > > > > > > > One PMD might reset out_ip checksum to 0x0000, and another
> > > PMD
> > > > > might
> > > > > > > > > reset it to 0xFFFF.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I can only see one solution:
> > > > > > > > > > We need to standardize on common minimum requirements for
> > > how
> > > > > to
> > > > > > > > > prepare packets for each TX offload.
> > > > > > > > >
> > > > > > > > > If we can make each and every vendor to agree here - that
> > > > > definitely
> > > > > > > > > will help to simplify things quite a bit.
> > > > > > > >
> > > > > > > > An API is more than a function name and parameters.
> > > > > > > > It also has preconditions and postconditions.
> > > > > > > >
> > > > > > > > All major NIC vendors are contributing to DPDK.
> > > > > > > > It should be possible to reach consensus for reasonable
> > > minimum
> > > > > requirements
> > > > > > > for offloads.
> > > > > > > > Hardware- and driver-specific exceptions can be documented
> > > with
> > > > > the offload
> > > > > > > flag, or with rte_eth_rx/tx_burst(), like the note to
> > > > > > > > rte_eth_rx_burst():
> > > > > > > > "Some drivers using vector instructions require that nb_pkts
> > > is
> > > > > divisible by
> > > > > > > 4 or 8, depending on the driver implementation."
> > > > > > >
> > > > > > > If we introduce a rule that everyone supposed to follow and then
> > > > > straightway
> > > > > > > allow people to have a 'documented exceptions',
> > > > > > > for me it means like 'no rule' in practice.
> > > > > > > A 'documented exceptions' approach might work if you have 5
> > > > > different PMDs to
> > > > > > > support, but not when you have 50+.
> > > > > > > No-one would write an app with possible 10 different exception
> > > cases
> > > > > in his
> > > > > > > head.
> > > > > > > Again, with such approach we can forget about backward
> > > > > compatibility.
> > > > > > > I think we already had this discussion before, my opinion
> > > remains
> > > > > the same
> > > > > > > here -
> > > > > > > 'documented exceptions' approach is a way to trouble.
> > > > > >
> > > > > > The "minimum requirements" should be the lowest common denominator
> > > of
> > > > > all NICs.
> > > > > > Exceptions should be extremely few, for outlier NICs that still
> > > want
> > > > > to provide an offload and its driver is unable to live up to the
> > > > > > minimum requirements.
> > > > > > Any exception should require techboard approval. If a NIC/driver
> > > does
> > > > > not support the "minimum requirements" for an offload
> > > > > > feature, it is not allowed to claim support for that offload
> > > feature,
> > > > > or needs to seek approval for an exception.
> > > > > >
> > > > > > As another option for NICs not supporting the minimum requirements
> > > of
> > > > > an offload feature, we could introduce offload flags with
> > > > > > finer granularity. E.g. one offload flag for "gold standard" TX
> > > > > checksum update (where the packet's checksum field can have any
> > > > > > value), and another offload flag for "silver standard" TX checksum
> > > > > update (where the packet's checksum field must have a
> > > > > > precomputed value).
> > > > >
> > > > > Actually yes, I was thinking in the same direction - we need some
> > > extra
> > > > > API to allow user to distinguish.
> > > > > Probably we can do something like that: a new API for the ethdev
> > > call
> > > > > that would take as a parameter
> > > > > TX offloads bitmap and in return specify would it need to modify
> > > > > contents of packet to support these
> > > > > offloads or not.
> > > > > Something like:
> > > > > int rte_ethdev_tx_offload_pkt_mod_required(unt64_t tx_offloads)
> > > > >
> > > > > For the majority of the drivers that satisfy these "minimum
> > > > > requirements" corresponding devops
> > > > > entry will be empty and we'll always return 0, otherwise PMD has to
> > > > > provide a proper devop.
> > > > > Then again, it would be up to the user, to determine can he pass
> > > same
> > > > > packet to 2 different NICs or not.
> > > > >
> > > > > I suppose it is similar to what you were talking about?
> > > >
> > > > I was thinking something more simple:
> > > >
> > > > The NIC exposes its RX and TX offload capabilities to the application
> > > through the rx/tx_offload_capa and other fields in the
> > > > rte_eth_dev_info structure returned by rte_eth_dev_info_get().
> > > >
> > > > E.g. tx_offload_capa might have the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM flag
> > > set.
> > > > These capability flags (or enums) are mostly undocumented in the code,
> > > but I guess that the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM
> > > > capability means that the NIC is able to update the IPv4 header
> > > checksum at egress (on the wire, i.e. without modifying the mbuf or
> > > > packet data), and that the application must set RTE_MBUF_F_TX_IP_CKSUM
> > > in the mbufs to utilize this offload.
> > > > I would define and document what each capability flag/enum exactly
> > > means, the minimum requirements (as defined by the DPDK
> > > > community) for the driver to claim support for it, and the
> > > requirements for an application to use it.
> > > > For the sake of discussion, let's say that
> > > RTE_ETH_TX_OFFLOAD_IPV4_CKSUM means "gold standard" TX checksum update
> > > capability
> > > > (i.e. no requirements to the checksum field in the packet contents).
> > > > If some NIC requires the checksum field in the packet contents to have
> > > a precomputed value, the NIC would not be allowed to claim
> > > > the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM capability.
> > > > Such a NIC would need to define and document a new capability, e.g.
> > > RTE_ETH_TX_OFFLOAD_IPV4_CKSUM_ASSISTED, for the "silver
> > > > standard" TX checksum update capability.
> > > > In other words: I would encode variations of offload capabilities
> > > directly in the capabilities flags.
> > > > Then we don't need additional APIs to help interpret those
> > > capabilities.
> > >
> > > I understood your intention with different flags, yes it should work too
> > > I think.
> > > The reason I am not very fond of it - it will require to double
> > > TX_OFFLOAD flags.
> >
> > An additional feature flag is only required if a NIC is not conforming to the "minimum requirements" of an offload feature, and the
> > techboard permits introducing a variant of an existing feature.
> > There should be very few additional feature flags for variants - exceptions only - or the "minimum requirements" are not broad
> > enough to support the majority of NICs.
> 
> Ok, so you suggest to group all existing reqs plus what all current tx_prepare() do into "minimum requirements"?
> So with current drivers in place we wouldn't need these new flags, but we'll reserve such opportunity.
> That might work, if there are no contradictory requirements in current PMDs, and PMDs maintainers with
> less reqs will agree with these 'extra' stuff.

Just to check how easy/hard would be to get a consensus, compiled a list of mbuf changes
done by different PMDs in tx_prepare(). See below.
Could be not fully correct or complete. PMD maintainers, feel free to update it, if I missed something.
From how it looks to me:
if we'll go the way you suggest, then hns3 and virtio will most likely become
a 'second class citizens' - will need a special offload flags for them.
Plus, either all PMDs that now set tx_prepare()=NULL will have to agree to require
rte_net_intel_cksum_prepare() to be done, or all Intel PMDs and few others will also be downgraded
to 'second class'.

PMD: atlantic
MOD: rte_net_intel_cksum_prepare()
/*for  ipv4_hdr->hdr_checksum = 0; (tcp|udp)_hdr->cksum=rte_ipv(4|6)_phdr_cksum(...);*/

PMD: cpfl/idpf
MOD: none

PMD: em/igb/igc/fm10k/i40e/iavf/ice/ixgbe
MOD: rte_net_intel_cksum_prepare()

PMD: enic
MOD: rte_net_intel_cksum_prepare()

PMD: hns3
MOD: rte_net_intel_cksum_prepare() plus some extra:
            /*
         * A UDP packet with the same dst_port as VXLAN\VXLAN_GPE\GENEVE will
         * be recognized as a tunnel packet in HW. In this case, if UDP CKSUM
         * offload is set and the tunnel mask has not been set, the CKSUM will
         * be wrong since the header length is wrong and driver should complete
         * the CKSUM to avoid CKSUM error.
         */

PMD: ionic
MOD: none

PMD: ngbe
MOD: rte_net_intel_cksum_prepare()

PMD: qede
MOD: none

PMD: txgbe
MOD: rte_net_intel_cksum_prepare()

PMD: virtio:
MOD: rte_net_intel_cksum_prepare() plus some extra:
           - for RTE_MBUF_F_TX_TCP_SEG: virtio_tso_fix_cksum()
           - for RTE_MBUF_F_TX_VLAN: rte_vlan_insert()

PMD: vmxnet3
MOD: rte_net_intel_cksum_prepare()

For all other PMDs in our main tree set tx_prepare = NULL.

next prev parent reply	other threads:[~2024-04-16 11:36 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-05 12:49 [PATCH 0/8] Fix outer UDP checksum for Intel nics David Marchand
2024-04-05 12:49 ` [PATCH 1/8] net/ice: fix check for outer UDP checksum offload David Marchand
2024-04-05 14:19   ` David Marchand
2024-04-05 12:49 ` [PATCH 2/8] net/ice: enhance debug when HW fails to transmit David Marchand
2024-04-05 12:49 ` [PATCH 3/8] mbuf: fix Tx checksum offload examples David Marchand
2024-04-05 12:49 ` [PATCH 4/8] app/testpmd: fix outer IP checksum offload David Marchand
2024-04-05 12:49 ` [PATCH 5/8] net: fix outer UDP checksum in Intel prepare helper David Marchand
2024-04-05 12:49 ` [PATCH 6/8] net/i40e: fix outer UDP checksum offload for X710 David Marchand
2024-04-05 12:49 ` [PATCH 7/8] net/iavf: remove outer UDP checksum offload for X710 VF David Marchand
2024-04-05 12:49 ` [PATCH 8/8] net: clear outer UDP checksum in Intel prepare helper David Marchand
2024-04-05 14:45 ` [PATCH v2 0/8] Fix outer UDP checksum for Intel nics David Marchand
2024-04-05 14:45   ` [PATCH v2 1/8] net/ice: fix check for outer UDP checksum offload David Marchand
2024-04-08 15:05     ` Bruce Richardson
2024-04-05 14:45   ` [PATCH v2 2/8] net/ice: enhance debug when HW fails to transmit David Marchand
2024-04-08 15:23     ` Bruce Richardson
2024-04-11  8:30       ` David Marchand
2024-04-11  8:42         ` Bruce Richardson
2024-04-15  8:32           ` David Marchand
2024-04-05 14:45   ` [PATCH v2 3/8] mbuf: fix Tx checksum offload examples David Marchand
2024-04-05 16:20     ` Morten Brørup
2024-04-08 10:12       ` David Marchand
2024-04-09 13:38       ` Konstantin Ananyev
2024-04-09 14:44         ` Morten Brørup
2024-04-10 10:35           ` Konstantin Ananyev
2024-04-10 12:20             ` Morten Brørup
2024-04-12 12:46               ` Konstantin Ananyev
2024-04-12 14:44                 ` Morten Brørup
2024-04-12 15:17                   ` Konstantin Ananyev
2024-04-12 15:54                     ` Morten Brørup
2024-04-16  9:16                       ` Konstantin Ananyev
2024-04-16 11:36                         ` Konstantin Ananyev [this message]
2024-04-15 15:07                   ` Ferruh Yigit
2024-04-16  7:14                     ` Morten Brørup
2024-04-16  9:26                       ` Konstantin Ananyev
2024-04-05 14:45   ` [PATCH v2 4/8] app/testpmd: fix outer IP checksum offload David Marchand
2024-04-05 14:45   ` [PATCH v2 5/8] net: fix outer UDP checksum in Intel prepare helper David Marchand
2024-04-05 14:46   ` [PATCH v2 6/8] net/i40e: fix outer UDP checksum offload for X710 David Marchand
2024-04-05 14:46   ` [PATCH v2 7/8] net/iavf: remove outer UDP checksum offload for X710 VF David Marchand
2024-04-05 14:46   ` [PATCH v2 8/8] net: clear outer UDP checksum in Intel prepare helper David Marchand
2024-04-18  8:20 ` [PATCH v3 0/7] Fix outer UDP checksum for Intel nics David Marchand
2024-04-18  8:20   ` [PATCH v3 1/7] net/ice: fix check for outer UDP checksum offload David Marchand
2024-04-19 11:46     ` Morten Brørup
2024-04-18  8:20   ` [PATCH v3 2/7] net/ice: enhance debug when HW fails to transmit David Marchand
2024-04-18  8:20   ` [PATCH v3 3/7] app/testpmd: fix outer IP checksum offload David Marchand
2024-06-11 18:25     ` Ferruh Yigit
2024-04-18  8:20   ` [PATCH v3 4/7] net: fix outer UDP checksum in Intel prepare helper David Marchand
2024-04-18  8:20   ` [PATCH v3 5/7] net/i40e: fix outer UDP checksum offload for X710 David Marchand
2024-04-18  8:20   ` [PATCH v3 6/7] net/iavf: remove outer UDP checksum offload for X710 VF David Marchand
2024-04-18  8:20   ` [PATCH v3 7/7] net: clear outer UDP checksum in Intel prepare helper David Marchand
2024-05-03 13:10   ` [PATCH v3 0/7] Fix outer UDP checksum for Intel nics David Marchand
2024-05-27 18:06   ` Ali Alnubani
2024-06-11 21:10     ` Ferruh Yigit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b4b19af1b13c436ebb13cda5ced25470@huawei.com \
    --to=konstantin.ananyev@huawei.com \
    --cc=aman.deep.singh@intel.com \
    --cc=andrew.rybchenko@oktetlabs.ru \
    --cc=david.marchand@redhat.com \
    --cc=dev@dpdk.org \
    --cc=ferruh.yigit@amd.com \
    --cc=jerinj@marvell.com \
    --cc=jijiang.liu@intel.com \
    --cc=kaiwenx.deng@intel.com \
    --cc=mb@smartsharesystems.com \
    --cc=olivier.matz@6wind.com \
    --cc=qiming.yang@intel.com \
    --cc=stable@dpdk.org \
    --cc=thomas@monjalon.net \
    --cc=yidingx.zhou@intel.com \
    --cc=yuying.zhang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.