Re: [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp
       [not found]           ` <20241009170102.1980ed1d@kernel.org>
@ 2024-10-10 17:44             ` Mina Almasry
  2024-10-11  1:34               ` Jakub Kicinski
  0 siblings, 1 reply; 17+ messages in thread
From: Mina Almasry @ 2024-10-10 17:44 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Taehee Yoo, davem, pabeni, edumazet, netdev, linux-doc,
	donald.hunter, corbet, michael.chan, kory.maincent, andrew,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Wed, Oct 9, 2024 at 5:01 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 10 Oct 2024 00:37:49 +0900 Taehee Yoo wrote:
> > > Yes, but netmem_ref can be either a net_iov or a normal page,
> > > and skb_add_rx_frag_netmem() and similar helpers should automatically
> > > set skb->unreadable or not.
> > >
> > > IOW you should be able to always use netmem-aware APIs, no?
> >
> > I'm not sure the update skb->unreadable flag is possible because
> > frag API like skb_add_rx_frag_netmem(), receives only frag, not skb.
> > How about an additional API to update skb->unreadable flag?
> > skb_update_unreadable() or skb_update_netmem()?
>
> Ah, the case where we don't get skb is because we're just building XDP
> frame at that stage. And XDP can't be netmem.
>
> In that case switching to skb_frag_fill_netmem_desc() should be enough.
>
> > > > The reason why the branch exists here is the PP_FLAG_ALLOW_UNREADABLE_NETMEM
> > > > flag can't be used with PP_FLAG_DMA_SYNC_DEV.
> > >
> > > Hm. Isn't the existing check the wrong way around? Is the driver
> > > supposed to sync the buffers for device before passing them down?
> >
> > I haven't thought the failure of PP_FLAG_DMA_SYNC_DEV
> > for dmabuf may be wrong.
> > I think device memory TCP is not related to this flag.
> > So device memory TCP core API should not return failure when
> > PP_FLAG_DMA_SYNC_DEV flag is set.
> > How about removing this condition check code in device memory TCP core?
>
> I think we need to invert the check..
> Mina, WDYT?
>

On a closer look, my feeling is similar to Taehee,
PP_FLAG_DMA_SYNC_DEV should be orthogonal to memory providers. The
memory providers allocate the memory and provide the dma-addr, but
need not dma-sync the dma-addr, right? The driver can sync the
dma-addr if it wants and the driver can delegate the syncing to the pp
via PP_FLAG_DMA_SYNC_DEV if it wants. AFAICT I think the check should
be removed, not inverted, but I could be missing something.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp
  2024-10-10 17:44             ` [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp Mina Almasry
@ 2024-10-11  1:34               ` Jakub Kicinski
  2024-10-11 17:33                 ` Mina Almasry
  0 siblings, 1 reply; 17+ messages in thread
From: Jakub Kicinski @ 2024-10-11  1:34 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Taehee Yoo, davem, pabeni, edumazet, netdev, linux-doc,
	donald.hunter, corbet, michael.chan, kory.maincent, andrew,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Thu, 10 Oct 2024 10:44:38 -0700 Mina Almasry wrote:
> > > I haven't thought the failure of PP_FLAG_DMA_SYNC_DEV
> > > for dmabuf may be wrong.
> > > I think device memory TCP is not related to this flag.
> > > So device memory TCP core API should not return failure when
> > > PP_FLAG_DMA_SYNC_DEV flag is set.
> > > How about removing this condition check code in device memory TCP core?  
> >
> > I think we need to invert the check..
> > Mina, WDYT?
> 
> On a closer look, my feeling is similar to Taehee,
> PP_FLAG_DMA_SYNC_DEV should be orthogonal to memory providers. The
> memory providers allocate the memory and provide the dma-addr, but
> need not dma-sync the dma-addr, right? The driver can sync the
> dma-addr if it wants and the driver can delegate the syncing to the pp
> via PP_FLAG_DMA_SYNC_DEV if it wants. AFAICT I think the check should
> be removed, not inverted, but I could be missing something.

I don't know much about dmabuf but it hinges on the question whether
doing DMA sync for device on a dmabuf address is :
 - a good thing
 - a noop
 - a bad thing

If it's a good thing or a noop - agreed.

Similar question for the sync for CPU.

I agree that intuitively it should be all fine. But the fact that dmabuf
has a bespoke API for accessing the memory by the CPU makes me worried
that there may be assumptions about these addresses not getting
randomly fed into the normal DMA API..

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp
  2024-10-11  1:34               ` Jakub Kicinski
@ 2024-10-11 17:33                 ` Mina Almasry
  2024-10-11 23:42                   ` Jason Gunthorpe
  0 siblings, 1 reply; 17+ messages in thread
From: Mina Almasry @ 2024-10-11 17:33 UTC (permalink / raw)
  To: Jakub Kicinski, Christian König, Jason Gunthorpe,
	Samiullah Khawaja
  Cc: Taehee Yoo, davem, pabeni, edumazet, netdev, linux-doc,
	donald.hunter, corbet, michael.chan, kory.maincent, andrew,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Thu, Oct 10, 2024 at 6:34 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 10 Oct 2024 10:44:38 -0700 Mina Almasry wrote:
> > > > I haven't thought the failure of PP_FLAG_DMA_SYNC_DEV
> > > > for dmabuf may be wrong.
> > > > I think device memory TCP is not related to this flag.
> > > > So device memory TCP core API should not return failure when
> > > > PP_FLAG_DMA_SYNC_DEV flag is set.
> > > > How about removing this condition check code in device memory TCP core?
> > >
> > > I think we need to invert the check..
> > > Mina, WDYT?
> >
> > On a closer look, my feeling is similar to Taehee,
> > PP_FLAG_DMA_SYNC_DEV should be orthogonal to memory providers. The
> > memory providers allocate the memory and provide the dma-addr, but
> > need not dma-sync the dma-addr, right? The driver can sync the
> > dma-addr if it wants and the driver can delegate the syncing to the pp
> > via PP_FLAG_DMA_SYNC_DEV if it wants. AFAICT I think the check should
> > be removed, not inverted, but I could be missing something.
>
> I don't know much about dmabuf but it hinges on the question whether
> doing DMA sync for device on a dmabuf address is :
>  - a good thing
>  - a noop
>  - a bad thing
>
> If it's a good thing or a noop - agreed.
>
> Similar question for the sync for CPU.
>
> I agree that intuitively it should be all fine. But the fact that dmabuf
> has a bespoke API for accessing the memory by the CPU makes me worried
> that there may be assumptions about these addresses not getting
> randomly fed into the normal DMA API..

Sorry I'm also a bit unsure what is the right thing to do here. The
code that we've been running in GVE does a dma-sync for cpu
unconditionally on RX for dma-buf and non-dmabuf dma-addrs and we
haven't been seeing issues. It never does dma-sync for device.

My first question is why is dma-sync for device needed on RX path at
all for some drivers in the first place? For incoming (non-dmabuf)
data, the data is written by the device and read by the cpu, so sync
for cpu is really what's needed. Is the sync for device for XDP? Or is
it that buffers should be dma-syncd for device before they are
re-posted to the NIC?

Christian/Jason, sorry quick question: are
dma_sync_single_for_{device|cpu} needed or wanted when the dma-addrs
come from a dma-buf? Or these dma-addrs to be treated like any other
with the normal dma_sync_for_{device|cpu} rules?

--
Thanks,
Mina

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp
  2024-10-11 17:33                 ` Mina Almasry
@ 2024-10-11 23:42                   ` Jason Gunthorpe
  2024-10-14 22:38                     ` Mina Almasry
  0 siblings, 1 reply; 17+ messages in thread
From: Jason Gunthorpe @ 2024-10-11 23:42 UTC (permalink / raw)
  To: Mina Almasry, Leon Romanovsky
  Cc: Jakub Kicinski, Christian König, Samiullah Khawaja,
	Taehee Yoo, davem, pabeni, edumazet, netdev, linux-doc,
	donald.hunter, corbet, michael.chan, kory.maincent, andrew,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Fri, Oct 11, 2024 at 10:33:43AM -0700, Mina Almasry wrote:
> On Thu, Oct 10, 2024 at 6:34 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Thu, 10 Oct 2024 10:44:38 -0700 Mina Almasry wrote:
> > > > > I haven't thought the failure of PP_FLAG_DMA_SYNC_DEV
> > > > > for dmabuf may be wrong.
> > > > > I think device memory TCP is not related to this flag.
> > > > > So device memory TCP core API should not return failure when
> > > > > PP_FLAG_DMA_SYNC_DEV flag is set.
> > > > > How about removing this condition check code in device memory TCP core?
> > > >
> > > > I think we need to invert the check..
> > > > Mina, WDYT?
> > >
> > > On a closer look, my feeling is similar to Taehee,
> > > PP_FLAG_DMA_SYNC_DEV should be orthogonal to memory providers. The
> > > memory providers allocate the memory and provide the dma-addr, but
> > > need not dma-sync the dma-addr, right? The driver can sync the
> > > dma-addr if it wants and the driver can delegate the syncing to the pp
> > > via PP_FLAG_DMA_SYNC_DEV if it wants. AFAICT I think the check should
> > > be removed, not inverted, but I could be missing something.
> >
> > I don't know much about dmabuf but it hinges on the question whether
> > doing DMA sync for device on a dmabuf address is :
> >  - a good thing
> >  - a noop
> >  - a bad thing
> >
> > If it's a good thing or a noop - agreed.
> >
> > Similar question for the sync for CPU.
> >
> > I agree that intuitively it should be all fine. But the fact that dmabuf
> > has a bespoke API for accessing the memory by the CPU makes me worried
> > that there may be assumptions about these addresses not getting
> > randomly fed into the normal DMA API..
> 
> Sorry I'm also a bit unsure what is the right thing to do here. The
> code that we've been running in GVE does a dma-sync for cpu
> unconditionally on RX for dma-buf and non-dmabuf dma-addrs and we
> haven't been seeing issues. It never does dma-sync for device.
> 
> My first question is why is dma-sync for device needed on RX path at
> all for some drivers in the first place? For incoming (non-dmabuf)
> data, the data is written by the device and read by the cpu, so sync
> for cpu is really what's needed. Is the sync for device for XDP? Or is
> it that buffers should be dma-syncd for device before they are
> re-posted to the NIC?
> 
> Christian/Jason, sorry quick question: are
> dma_sync_single_for_{device|cpu} needed or wanted when the dma-addrs
> come from a dma-buf? Or these dma-addrs to be treated like any other
> with the normal dma_sync_for_{device|cpu} rules?

Um, I think because dma-buf hacks things up and generates illegal
scatterlist entries with weird dma_map_resource() addresses for the
typical P2P case the dma sync API should not be used on those things.

However, there is no way to know if the dma-buf has does this, and
there are valid case where the scatterlist is not ill formed and the
sync is necessary.

We are getting soo close to being able to start fixing these API
issues in dmabuf, I hope next cylce we can begin.. Fingers crossed.

From a CPU architecture perspective you do not need to cache flush PCI
MMIO BAR memory, and perhaps doing so be might be problematic on some
arches (???). But you do need to flush normal cachable CPU memory if
that is in the DMA buf.

Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp
  2024-10-11 23:42                   ` Jason Gunthorpe
@ 2024-10-14 22:38                     ` Mina Almasry
  2024-10-15  0:16                       ` Jakub Kicinski
  2024-10-15 14:29                       ` Pavel Begunkov
  0 siblings, 2 replies; 17+ messages in thread
From: Mina Almasry @ 2024-10-14 22:38 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Jakub Kicinski, Christian König,
	Samiullah Khawaja, Taehee Yoo, davem, pabeni, edumazet, netdev,
	linux-doc, donald.hunter, corbet, michael.chan, kory.maincent,
	andrew, maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Sat, Oct 12, 2024 at 2:42 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Fri, Oct 11, 2024 at 10:33:43AM -0700, Mina Almasry wrote:
> > On Thu, Oct 10, 2024 at 6:34 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > >
> > > On Thu, 10 Oct 2024 10:44:38 -0700 Mina Almasry wrote:
> > > > > > I haven't thought the failure of PP_FLAG_DMA_SYNC_DEV
> > > > > > for dmabuf may be wrong.
> > > > > > I think device memory TCP is not related to this flag.
> > > > > > So device memory TCP core API should not return failure when
> > > > > > PP_FLAG_DMA_SYNC_DEV flag is set.
> > > > > > How about removing this condition check code in device memory TCP core?
> > > > >
> > > > > I think we need to invert the check..
> > > > > Mina, WDYT?
> > > >
> > > > On a closer look, my feeling is similar to Taehee,
> > > > PP_FLAG_DMA_SYNC_DEV should be orthogonal to memory providers. The
> > > > memory providers allocate the memory and provide the dma-addr, but
> > > > need not dma-sync the dma-addr, right? The driver can sync the
> > > > dma-addr if it wants and the driver can delegate the syncing to the pp
> > > > via PP_FLAG_DMA_SYNC_DEV if it wants. AFAICT I think the check should
> > > > be removed, not inverted, but I could be missing something.
> > >
> > > I don't know much about dmabuf but it hinges on the question whether
> > > doing DMA sync for device on a dmabuf address is :
> > >  - a good thing
> > >  - a noop
> > >  - a bad thing
> > >
> > > If it's a good thing or a noop - agreed.
> > >
> > > Similar question for the sync for CPU.
> > >
> > > I agree that intuitively it should be all fine. But the fact that dmabuf
> > > has a bespoke API for accessing the memory by the CPU makes me worried
> > > that there may be assumptions about these addresses not getting
> > > randomly fed into the normal DMA API..
> >
> > Sorry I'm also a bit unsure what is the right thing to do here. The
> > code that we've been running in GVE does a dma-sync for cpu
> > unconditionally on RX for dma-buf and non-dmabuf dma-addrs and we
> > haven't been seeing issues. It never does dma-sync for device.
> >
> > My first question is why is dma-sync for device needed on RX path at
> > all for some drivers in the first place? For incoming (non-dmabuf)
> > data, the data is written by the device and read by the cpu, so sync
> > for cpu is really what's needed. Is the sync for device for XDP? Or is
> > it that buffers should be dma-syncd for device before they are
> > re-posted to the NIC?
> >
> > Christian/Jason, sorry quick question: are
> > dma_sync_single_for_{device|cpu} needed or wanted when the dma-addrs
> > come from a dma-buf? Or these dma-addrs to be treated like any other
> > with the normal dma_sync_for_{device|cpu} rules?
>
> Um, I think because dma-buf hacks things up and generates illegal
> scatterlist entries with weird dma_map_resource() addresses for the
> typical P2P case the dma sync API should not be used on those things.
>
> However, there is no way to know if the dma-buf has does this, and
> there are valid case where the scatterlist is not ill formed and the
> sync is necessary.
>
> We are getting soo close to being able to start fixing these API
> issues in dmabuf, I hope next cylce we can begin.. Fingers crossed.
>
> From a CPU architecture perspective you do not need to cache flush PCI
> MMIO BAR memory, and perhaps doing so be might be problematic on some
> arches (???). But you do need to flush normal cachable CPU memory if
> that is in the DMA buf.
>

Thanks Jason. In that case I agree with Jakub we should take in his change here:

https://lore.kernel.org/netdev/20241009170102.1980ed1d@kernel.org/

With this change the driver would delegate dma_sync_for_device to the
page_pool, and the page_pool will skip it altogether for the dma-buf
memory provider.

--
Thanks,
Mina

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp
  2024-10-14 22:38                     ` Mina Almasry
@ 2024-10-15  0:16                       ` Jakub Kicinski
  2024-10-15  1:10                         ` Mina Almasry
  2024-10-15 14:29                       ` Pavel Begunkov
  1 sibling, 1 reply; 17+ messages in thread
From: Jakub Kicinski @ 2024-10-15  0:16 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Jason Gunthorpe, Leon Romanovsky, Christian König,
	Samiullah Khawaja, Taehee Yoo, davem, pabeni, edumazet, netdev,
	linux-doc, donald.hunter, corbet, michael.chan, kory.maincent,
	andrew, maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Tue, 15 Oct 2024 01:38:20 +0300 Mina Almasry wrote:
> Thanks Jason. In that case I agree with Jakub we should take in his change here:
> 
> https://lore.kernel.org/netdev/20241009170102.1980ed1d@kernel.org/
> 
> With this change the driver would delegate dma_sync_for_device to the
> page_pool, and the page_pool will skip it altogether for the dma-buf
> memory provider.

And we need a wrapper for a sync for CPU which will skip if the page
comes from an unreadable pool?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp
  2024-10-15  0:16                       ` Jakub Kicinski
@ 2024-10-15  1:10                         ` Mina Almasry
  2024-10-15 12:44                           ` Jason Gunthorpe
  0 siblings, 1 reply; 17+ messages in thread
From: Mina Almasry @ 2024-10-15  1:10 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Jason Gunthorpe, Leon Romanovsky, Christian König,
	Samiullah Khawaja, Taehee Yoo, davem, pabeni, edumazet, netdev,
	linux-doc, donald.hunter, corbet, michael.chan, kory.maincent,
	andrew, maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Tue, Oct 15, 2024 at 3:16 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 15 Oct 2024 01:38:20 +0300 Mina Almasry wrote:
> > Thanks Jason. In that case I agree with Jakub we should take in his change here:
> >
> > https://lore.kernel.org/netdev/20241009170102.1980ed1d@kernel.org/
> >
> > With this change the driver would delegate dma_sync_for_device to the
> > page_pool, and the page_pool will skip it altogether for the dma-buf
> > memory provider.
>
> And we need a wrapper for a sync for CPU which will skip if the page
> comes from an unreadable pool?

This is where it gets a bit tricky, no?

Our production code does a dma_sync_for_cpu but no
dma_sync_for_device. That has been working reliably for us with GPU
dmabufs and udmabuf, but we haven't of course tested every dma-buf.
I'm comfortable enforcing the 'no dma_sync_for_device' now since it
brings upstream in line with our well tested setup. I'm not sure I'm
100% comfortable enforcing the 'no dma_sync_for_cpu' now since it's a
deviation. The dma_sync_for_cpu is very very likely a no-op since we
don't really access the data from cpu ever with devmem TCP, but who
knows.

Is it possible to give me a couple of weeks to make this change
locally and run it through some testing to see if it breaks anything?

But if you or Jason think that enforcing the 'no dma_buf_sync_for_cpu'
now is critical, no problem. We can also provide this patch, and seek
to revert it or fix it up properly later in the event it turns out it
causes issues.

Note that io_uring provider, or other non-dmabuf providers may need to
do a dma-sync, but that bridge can be crossed in David's patchset.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp
  2024-10-15  1:10                         ` Mina Almasry
@ 2024-10-15 12:44                           ` Jason Gunthorpe
  2024-10-18  8:25                             ` Mina Almasry
  0 siblings, 1 reply; 17+ messages in thread
From: Jason Gunthorpe @ 2024-10-15 12:44 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Jakub Kicinski, Leon Romanovsky, Christian König,
	Samiullah Khawaja, Taehee Yoo, davem, pabeni, edumazet, netdev,
	linux-doc, donald.hunter, corbet, michael.chan, kory.maincent,
	andrew, maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Tue, Oct 15, 2024 at 04:10:44AM +0300, Mina Almasry wrote:
> On Tue, Oct 15, 2024 at 3:16 AM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Tue, 15 Oct 2024 01:38:20 +0300 Mina Almasry wrote:
> > > Thanks Jason. In that case I agree with Jakub we should take in his change here:
> > >
> > > https://lore.kernel.org/netdev/20241009170102.1980ed1d@kernel.org/
> > >
> > > With this change the driver would delegate dma_sync_for_device to the
> > > page_pool, and the page_pool will skip it altogether for the dma-buf
> > > memory provider.
> >
> > And we need a wrapper for a sync for CPU which will skip if the page
> > comes from an unreadable pool?
> 
> This is where it gets a bit tricky, no?
> 
> Our production code does a dma_sync_for_cpu but no
> dma_sync_for_device. That has been working reliably for us with GPU

Those functions are all NOP on systems you are testing on.

The question is what is correct to do on systems where it is not a
NOP, and none of this is really right, as I explained..

> But if you or Jason think that enforcing the 'no dma_buf_sync_for_cpu'
> now is critical, no problem. We can also provide this patch, and seek
> to revert it or fix it up properly later in the event it turns out it
> causes issues.

What is important is you organize things going forward to be able to
do this properly, which means the required sync type is dependent on
the actual page being synced and you will eventually somehow learn
which is required from the dmabuf.

Most likely nobody will ever run this code on system where dma_sync is
not a NOP, but we should still use the DMA API properly and things
should make architectural sense.

Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp
  2024-10-15 12:44                           ` Jason Gunthorpe
@ 2024-10-18  8:25                             ` Mina Almasry
  2024-10-19 13:55                               ` Taehee Yoo
  0 siblings, 1 reply; 17+ messages in thread
From: Mina Almasry @ 2024-10-18  8:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jakub Kicinski, Leon Romanovsky, Christian König,
	Samiullah Khawaja, Taehee Yoo, davem, pabeni, edumazet, netdev,
	linux-doc, donald.hunter, corbet, michael.chan, kory.maincent,
	andrew, maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Tue, Oct 15, 2024 at 3:44 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Tue, Oct 15, 2024 at 04:10:44AM +0300, Mina Almasry wrote:
> > On Tue, Oct 15, 2024 at 3:16 AM Jakub Kicinski <kuba@kernel.org> wrote:
> > >
> > > On Tue, 15 Oct 2024 01:38:20 +0300 Mina Almasry wrote:
> > > > Thanks Jason. In that case I agree with Jakub we should take in his change here:
> > > >
> > > > https://lore.kernel.org/netdev/20241009170102.1980ed1d@kernel.org/
> > > >
> > > > With this change the driver would delegate dma_sync_for_device to the
> > > > page_pool, and the page_pool will skip it altogether for the dma-buf
> > > > memory provider.
> > >
> > > And we need a wrapper for a sync for CPU which will skip if the page
> > > comes from an unreadable pool?
> >
> > This is where it gets a bit tricky, no?
> >
> > Our production code does a dma_sync_for_cpu but no
> > dma_sync_for_device. That has been working reliably for us with GPU
>
> Those functions are all NOP on systems you are testing on.
>

OK, thanks. This is what I wanted to confirm. If you already know this
here then there is no need to wait for me to confirm.

> The question is what is correct to do on systems where it is not a
> NOP, and none of this is really right, as I explained..
>
> > But if you or Jason think that enforcing the 'no dma_buf_sync_for_cpu'
> > now is critical, no problem. We can also provide this patch, and seek
> > to revert it or fix it up properly later in the event it turns out it
> > causes issues.
>
> What is important is you organize things going forward to be able to
> do this properly, which means the required sync type is dependent on
> the actual page being synced and you will eventually somehow learn
> which is required from the dmabuf.
>
> Most likely nobody will ever run this code on system where dma_sync is
> not a NOP, but we should still use the DMA API properly and things
> should make architectural sense.
>

Makes sense. OK, we can do what Jakub suggested in the thread earlier.
I.e. likely some wrapper which skips the dma_sync_for_cpu if the
netmem is unreadable.

--
Thanks,
Mina

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp
  2024-10-18  8:25                             ` Mina Almasry
@ 2024-10-19 13:55                               ` Taehee Yoo
  0 siblings, 0 replies; 17+ messages in thread
From: Taehee Yoo @ 2024-10-19 13:55 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Jason Gunthorpe, Jakub Kicinski, Leon Romanovsky,
	Christian König, Samiullah Khawaja, davem, pabeni, edumazet,
	netdev, linux-doc, donald.hunter, corbet, michael.chan,
	kory.maincent, andrew, maxime.chevallier, danieller, hengqi,
	ecree.xilinx, przemyslaw.kitszel, hkallweit1, ahmed.zaki,
	paul.greenwalt, rrameshbabu, idosch, asml.silence, kaiyuanz,
	willemb, aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Fri, Oct 18, 2024 at 5:25 PM Mina Almasry <almasrymina@google.com> wrote:
>
> On Tue, Oct 15, 2024 at 3:44 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Tue, Oct 15, 2024 at 04:10:44AM +0300, Mina Almasry wrote:
> > > On Tue, Oct 15, 2024 at 3:16 AM Jakub Kicinski <kuba@kernel.org> wrote:
> > > >
> > > > On Tue, 15 Oct 2024 01:38:20 +0300 Mina Almasry wrote:
> > > > > Thanks Jason. In that case I agree with Jakub we should take in his change here:
> > > > >
> > > > > https://lore.kernel.org/netdev/20241009170102.1980ed1d@kernel.org/
> > > > >
> > > > > With this change the driver would delegate dma_sync_for_device to the
> > > > > page_pool, and the page_pool will skip it altogether for the dma-buf
> > > > > memory provider.
> > > >
> > > > And we need a wrapper for a sync for CPU which will skip if the page
> > > > comes from an unreadable pool?
> > >
> > > This is where it gets a bit tricky, no?
> > >
> > > Our production code does a dma_sync_for_cpu but no
> > > dma_sync_for_device. That has been working reliably for us with GPU
> >
> > Those functions are all NOP on systems you are testing on.
> >
>
> OK, thanks. This is what I wanted to confirm. If you already know this
> here then there is no need to wait for me to confirm.
>
> > The question is what is correct to do on systems where it is not a
> > NOP, and none of this is really right, as I explained..
> >
> > > But if you or Jason think that enforcing the 'no dma_buf_sync_for_cpu'
> > > now is critical, no problem. We can also provide this patch, and seek
> > > to revert it or fix it up properly later in the event it turns out it
> > > causes issues.
> >
> > What is important is you organize things going forward to be able to
> > do this properly, which means the required sync type is dependent on
> > the actual page being synced and you will eventually somehow learn
> > which is required from the dmabuf.
> >
> > Most likely nobody will ever run this code on system where dma_sync is
> > not a NOP, but we should still use the DMA API properly and things
> > should make architectural sense.
> >
>
> Makes sense. OK, we can do what Jakub suggested in the thread earlier.
> I.e. likely some wrapper which skips the dma_sync_for_cpu if the
> netmem is unreadable.
>

Thanks a lot for confirmation about it.
I will pass the PP_FLAG_ALLOW_UNREADABLE_NETMEM flag
regardless of enabling/disabling devmem TCP in a v4 patch.
The page_pool core logic will handle flags properly.

I think patches for changes of page_pool are worked on by Mina,
so I will not include changes for page_pool in a v4 patch.

If you think I missed something, please let me know :)

Thanks a lot!
Taehee Yoo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp
  2024-10-14 22:38                     ` Mina Almasry
  2024-10-15  0:16                       ` Jakub Kicinski
@ 2024-10-15 14:29                       ` Pavel Begunkov
  2024-10-15 17:38                         ` David Wei
  1 sibling, 1 reply; 17+ messages in thread
From: Pavel Begunkov @ 2024-10-15 14:29 UTC (permalink / raw)
  To: Mina Almasry, Jason Gunthorpe
  Cc: Leon Romanovsky, Jakub Kicinski, Christian König,
	Samiullah Khawaja, Taehee Yoo, davem, pabeni, edumazet, netdev,
	linux-doc, donald.hunter, corbet, michael.chan, kory.maincent,
	andrew, maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, kaiyuanz, willemb, aleksander.lobakin, dw,
	sridhar.samudrala, bcreeley

On 10/14/24 23:38, Mina Almasry wrote:
> On Sat, Oct 12, 2024 at 2:42 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>>
>> On Fri, Oct 11, 2024 at 10:33:43AM -0700, Mina Almasry wrote:
>>> On Thu, Oct 10, 2024 at 6:34 PM Jakub Kicinski <kuba@kernel.org> wrote:
>>>>
>>>> On Thu, 10 Oct 2024 10:44:38 -0700 Mina Almasry wrote:
>>>>>>> I haven't thought the failure of PP_FLAG_DMA_SYNC_DEV
>>>>>>> for dmabuf may be wrong.
>>>>>>> I think device memory TCP is not related to this flag.
>>>>>>> So device memory TCP core API should not return failure when
>>>>>>> PP_FLAG_DMA_SYNC_DEV flag is set.
>>>>>>> How about removing this condition check code in device memory TCP core?
>>>>>>
>>>>>> I think we need to invert the check..
>>>>>> Mina, WDYT?
>>>>>
>>>>> On a closer look, my feeling is similar to Taehee,
>>>>> PP_FLAG_DMA_SYNC_DEV should be orthogonal to memory providers. The
>>>>> memory providers allocate the memory and provide the dma-addr, but
>>>>> need not dma-sync the dma-addr, right? The driver can sync the
>>>>> dma-addr if it wants and the driver can delegate the syncing to the pp
>>>>> via PP_FLAG_DMA_SYNC_DEV if it wants. AFAICT I think the check should
>>>>> be removed, not inverted, but I could be missing something.
>>>>
>>>> I don't know much about dmabuf but it hinges on the question whether
>>>> doing DMA sync for device on a dmabuf address is :
>>>>   - a good thing
>>>>   - a noop
>>>>   - a bad thing
>>>>
>>>> If it's a good thing or a noop - agreed.
>>>>
>>>> Similar question for the sync for CPU.
>>>>
>>>> I agree that intuitively it should be all fine. But the fact that dmabuf
>>>> has a bespoke API for accessing the memory by the CPU makes me worried
>>>> that there may be assumptions about these addresses not getting
>>>> randomly fed into the normal DMA API..
>>>
>>> Sorry I'm also a bit unsure what is the right thing to do here. The
>>> code that we've been running in GVE does a dma-sync for cpu
>>> unconditionally on RX for dma-buf and non-dmabuf dma-addrs and we
>>> haven't been seeing issues. It never does dma-sync for device.
>>>
>>> My first question is why is dma-sync for device needed on RX path at
>>> all for some drivers in the first place? For incoming (non-dmabuf)
>>> data, the data is written by the device and read by the cpu, so sync
>>> for cpu is really what's needed. Is the sync for device for XDP? Or is
>>> it that buffers should be dma-syncd for device before they are
>>> re-posted to the NIC?
>>>
>>> Christian/Jason, sorry quick question: are
>>> dma_sync_single_for_{device|cpu} needed or wanted when the dma-addrs
>>> come from a dma-buf? Or these dma-addrs to be treated like any other
>>> with the normal dma_sync_for_{device|cpu} rules?
>>
>> Um, I think because dma-buf hacks things up and generates illegal
>> scatterlist entries with weird dma_map_resource() addresses for the
>> typical P2P case the dma sync API should not be used on those things.
>>
>> However, there is no way to know if the dma-buf has does this, and
>> there are valid case where the scatterlist is not ill formed and the
>> sync is necessary.
>>
>> We are getting soo close to being able to start fixing these API
>> issues in dmabuf, I hope next cylce we can begin.. Fingers crossed.
>>
>>  From a CPU architecture perspective you do not need to cache flush PCI
>> MMIO BAR memory, and perhaps doing so be might be problematic on some
>> arches (???). But you do need to flush normal cachable CPU memory if
>> that is in the DMA buf.
>>
> 
> Thanks Jason. In that case I agree with Jakub we should take in his change here:
> 
> https://lore.kernel.org/netdev/20241009170102.1980ed1d@kernel.org/
> 
> With this change the driver would delegate dma_sync_for_device to the
> page_pool, and the page_pool will skip it altogether for the dma-buf
> memory provider.

Requiring ->dma_map should be common to all providers as page pool
shouldn't be dipping to net_iovs figuring out how to map them. However,
looking at this discussion seems that the ->dma_sync concern is devmem
specific and should be discarded by pp providers using dmabufs, i.e. in
devmem.c:mp_dmabuf_devmem_init().

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp
  2024-10-15 14:29                       ` Pavel Begunkov
@ 2024-10-15 17:38                         ` David Wei
  0 siblings, 0 replies; 17+ messages in thread
From: David Wei @ 2024-10-15 17:38 UTC (permalink / raw)
  To: Pavel Begunkov, Mina Almasry, Jason Gunthorpe
  Cc: Leon Romanovsky, Jakub Kicinski, Christian König,
	Samiullah Khawaja, Taehee Yoo, davem, pabeni, edumazet, netdev,
	linux-doc, donald.hunter, corbet, michael.chan, kory.maincent,
	andrew, maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, kaiyuanz, willemb, aleksander.lobakin,
	sridhar.samudrala, bcreeley, David Wei

On 2024-10-15 07:29, Pavel Begunkov wrote:
> On 10/14/24 23:38, Mina Almasry wrote:
>> On Sat, Oct 12, 2024 at 2:42 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>>>
>>> On Fri, Oct 11, 2024 at 10:33:43AM -0700, Mina Almasry wrote:
>>>> On Thu, Oct 10, 2024 at 6:34 PM Jakub Kicinski <kuba@kernel.org> wrote:
>>>>>
>>>>> On Thu, 10 Oct 2024 10:44:38 -0700 Mina Almasry wrote:
>>>>>>>> I haven't thought the failure of PP_FLAG_DMA_SYNC_DEV
>>>>>>>> for dmabuf may be wrong.
>>>>>>>> I think device memory TCP is not related to this flag.
>>>>>>>> So device memory TCP core API should not return failure when
>>>>>>>> PP_FLAG_DMA_SYNC_DEV flag is set.
>>>>>>>> How about removing this condition check code in device memory TCP core?
>>>>>>>
>>>>>>> I think we need to invert the check..
>>>>>>> Mina, WDYT?
>>>>>>
>>>>>> On a closer look, my feeling is similar to Taehee,
>>>>>> PP_FLAG_DMA_SYNC_DEV should be orthogonal to memory providers. The
>>>>>> memory providers allocate the memory and provide the dma-addr, but
>>>>>> need not dma-sync the dma-addr, right? The driver can sync the
>>>>>> dma-addr if it wants and the driver can delegate the syncing to the pp
>>>>>> via PP_FLAG_DMA_SYNC_DEV if it wants. AFAICT I think the check should
>>>>>> be removed, not inverted, but I could be missing something.
>>>>>
>>>>> I don't know much about dmabuf but it hinges on the question whether
>>>>> doing DMA sync for device on a dmabuf address is :
>>>>>   - a good thing
>>>>>   - a noop
>>>>>   - a bad thing
>>>>>
>>>>> If it's a good thing or a noop - agreed.
>>>>>
>>>>> Similar question for the sync for CPU.
>>>>>
>>>>> I agree that intuitively it should be all fine. But the fact that dmabuf
>>>>> has a bespoke API for accessing the memory by the CPU makes me worried
>>>>> that there may be assumptions about these addresses not getting
>>>>> randomly fed into the normal DMA API..
>>>>
>>>> Sorry I'm also a bit unsure what is the right thing to do here. The
>>>> code that we've been running in GVE does a dma-sync for cpu
>>>> unconditionally on RX for dma-buf and non-dmabuf dma-addrs and we
>>>> haven't been seeing issues. It never does dma-sync for device.
>>>>
>>>> My first question is why is dma-sync for device needed on RX path at
>>>> all for some drivers in the first place? For incoming (non-dmabuf)
>>>> data, the data is written by the device and read by the cpu, so sync
>>>> for cpu is really what's needed. Is the sync for device for XDP? Or is
>>>> it that buffers should be dma-syncd for device before they are
>>>> re-posted to the NIC?
>>>>
>>>> Christian/Jason, sorry quick question: are
>>>> dma_sync_single_for_{device|cpu} needed or wanted when the dma-addrs
>>>> come from a dma-buf? Or these dma-addrs to be treated like any other
>>>> with the normal dma_sync_for_{device|cpu} rules?
>>>
>>> Um, I think because dma-buf hacks things up and generates illegal
>>> scatterlist entries with weird dma_map_resource() addresses for the
>>> typical P2P case the dma sync API should not be used on those things.
>>>
>>> However, there is no way to know if the dma-buf has does this, and
>>> there are valid case where the scatterlist is not ill formed and the
>>> sync is necessary.
>>>
>>> We are getting soo close to being able to start fixing these API
>>> issues in dmabuf, I hope next cylce we can begin.. Fingers crossed.
>>>
>>>  From a CPU architecture perspective you do not need to cache flush PCI
>>> MMIO BAR memory, and perhaps doing so be might be problematic on some
>>> arches (???). But you do need to flush normal cachable CPU memory if
>>> that is in the DMA buf.
>>>
>>
>> Thanks Jason. In that case I agree with Jakub we should take in his change here:
>>
>> https://lore.kernel.org/netdev/20241009170102.1980ed1d@kernel.org/
>>
>> With this change the driver would delegate dma_sync_for_device to the
>> page_pool, and the page_pool will skip it altogether for the dma-buf
>> memory provider.
> 
> Requiring ->dma_map should be common to all providers as page pool
> shouldn't be dipping to net_iovs figuring out how to map them. However,
> looking at this discussion seems that the ->dma_sync concern is devmem
> specific and should be discarded by pp providers using dmabufs, i.e. in
> devmem.c:mp_dmabuf_devmem_init().

Yes, that's my preference as well, see my earlier reply.

> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v3 0/7] bnxt_en: implement device memory TCP for bnxt
       [not found] <20241003160620.1521626-1-ap420073@gmail.com>
       [not found] ` <20241003160620.1521626-8-ap420073@gmail.com>
@ 2024-10-16 20:17 ` Stanislav Fomichev
  2024-10-17  8:58   ` Taehee Yoo
       [not found] ` <20241003160620.1521626-3-ap420073@gmail.com>
  2 siblings, 1 reply; 17+ messages in thread
From: Stanislav Fomichev @ 2024-10-16 20:17 UTC (permalink / raw)
  To: Taehee Yoo
  Cc: davem, kuba, pabeni, edumazet, almasrymina, netdev, linux-doc,
	donald.hunter, corbet, michael.chan, kory.maincent, andrew,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On 10/03, Taehee Yoo wrote:
> This series implements device memory TCP for bnxt_en driver and
> necessary ethtool command implementations.
> 
> NICs that use the bnxt_en driver support tcp-data-split feature named
> HDS(header-data-split).
> But there is no implementation for the HDS to enable/disable by ethtool.
> Only getting the current HDS status is implemented and the HDS is just
> automatically enabled only when either LRO, HW-GRO, or JUMBO is enabled.
> The hds_threshold follows the rx-copybreak value but it wasn't
> changeable.
> 
> Currently, bnxt_en driver enables tcp-data-split by default but not
> always work.
> There is hds_threshold value, which indicates that a packet size is
> larger than this value, a packet will be split into header and data.
> hds_threshold value has been 256, which is a default value of
> rx-copybreak value too.
> The rx-copybreak value hasn't been allowed to change so the
> hds_threshold too.
> 
> This patchset decouples hds_threshold and rx-copybreak first.
> and make tcp-data-split, rx-copybreak, and
> tcp-data-split-thresh(hds_threshold) configurable independently.
> 
> But the default configuration is the same.
> The default value of rx-copybreak is 256 and default
> tcp-data-split-thresh is also 256.
> 
> There are several related options.
> TPA(HW-GRO, LRO), JUMBO, jumbo_thresh(firmware command), and Aggregation
> Ring.
> 
> The aggregation ring is fundamental to these all features.
> When gro/lro/jumbo packets are received, NIC receives the first packet
> from the normal ring.
> follow packets come from the aggregation ring.
> 
> These features are working regardless of HDS.
> When TPA is enabled and HDS is disabled, the first packet contains
> header and payload too.
> and the following packets contain payload only.
> If HDS is enabled, the first packet contains the header only, and the
> following packets contain only payload.
> So, HW-GRO/LRO is working regardless of HDS.
> 
> There is another threshold value, which is jumbo_thresh.
> This is very similar to hds_thresh, but jumbo thresh doesn't split
> header and data.
> It just split the first and following data based on length.
> When NIC receives 1500 sized packet, and jumbo_thresh is 256(default, but
> follows rx-copybreak),
> the first data is 256 and the following packet size is 1500-256.
> 
> Before this patch, at least if one of GRO, LRO, and JUMBO flags is
> enabled, the Aggregation ring will be enabled.
> If the Aggregation ring is enabled, both hds_threshold and
> jumbo_thresh are set to the default value of rx-copybreak.
> 
> So, GRO, LRO, JUMBO frames, they larger than 256 bytes, they will
> be split into header and data if the protocol is TCP or UDP.
> for the other protocol, jumbo_thresh works instead of hds_thresh.
> 
> This means that tcp-data-split relies on the GRO, LRO, and JUMBO flags.
> But by this patch, tcp-data-split no longer relies on these flags.
> If the tcp-data-split is enabled, the Aggregation ring will be
> enabled.
> Also, hds_threshold no longer follows rx-copybreak value, it will
> be set to the tcp-data-split-thresh value by user-space, but the
> default value is still 256.
> 
> If the protocol is TCP or UDP and the HDS is disabled and Aggregation
> ring is enabled, a packet will be split into several pieces due to
> jumbo_thresh.
> 
> When XDP is attached, tcp-data-split is automatically disabled.
> 
> LRO, GRO, and JUMBO are tested with BCM57414, BCM57504 and the firmware
> version is 230.0.157.0.
> I couldn't find any specification about minimum and maximum value
> of hds_threshold, but from my test result, it was about 0 ~ 1023.
> It means, over 1023 sized packets will be split into header and data if
> tcp-data-split is enabled regardless of hds_treshold value.
> When hds_threshold is 1500 and received packet size is 1400, HDS should
> not be activated, but it is activated.
> The maximum value of hds_threshold(tcp-data-split-thresh)
> value is 256 because it has been working.
> It was decided very conservatively.
> 
> I checked out the tcp-data-split(HDS) works independently of GRO, LRO,
> JUMBO. Tested GRO/LRO, JUMBO with enabled HDS and disabled HDS.
> Also, I checked out tcp-data-split should be disabled automatically
> when XDP is attached and disallowed to enable it again while XDP is
> attached. I tested ranged values from min to max for
> tcp-data-split-thresh and rx-copybreak, and it works.
> tcp-data-split-thresh from 0 to 256, and rx-copybreak 65 to 256.
> When testing this patchset, I checked skb->data, skb->data_len, and
> nr_frags values.
> 
> The first patch implements .{set, get}_tunable() in the bnxt_en.
> The bnxt_en driver has been supporting the rx-copybreak feature but is
> not configurable, Only the default rx-copybreak value has been working.
> So, it changes the bnxt_en driver to be able to configure
> the rx-copybreak value.
> 
> The second patch adds an implementation of tcp-data-split ethtool
> command.
> The HDS relies on the Aggregation ring, which is automatically enabled
> when either LRO, GRO, or large mtu is configured.
> So, if the Aggregation ring is enabled, HDS is automatically enabled by
> it.
> 
> The third patch adds tcp-data-split-thresh command in the ethtool.
> This threshold value indicates if a received packet size is larger
> than this threshold, the packet's header and payload will be split.
> Example:
>    # ethtool -G <interface name> tcp-data-split-thresh <value>
> This option can not be used when tcp-data-split is disabled or not
> supported.
>    # ethtool -G enp14s0f0np0 tcp-data-split on tcp-data-split-thresh 256
>    # ethtool -g enp14s0f0np0
>    Ring parameters for enp14s0f0np0:
>    Pre-set maximums:
>    ...
>    Current hardware settings:
>    ...
>    TCP data split:         on
>    TCP data split thresh:  256
> 
>    # ethtool -G enp14s0f0np0 tcp-data-split off
>    # ethtool -g enp14s0f0np0
>    Ring parameters for enp14s0f0np0:
>    Pre-set maximums:
>    ...
>    Current hardware settings:
>    ...
>    TCP data split:         off
>    TCP data split thresh:  n/a
> 
> The fourth patch adds the implementation of tcp-data-split-thresh logic
> in the bnxt_en driver.
> The default value is 256, which used to be the default rx-copybreak
> value.
> 
> The fifth and sixth adds condition check for devmem and ethtool.
> If tcp-data-split is disabled or threshold value is not zero, setup of
> devmem will be failed.
> Also, tcp-data-split and tcp-data-split-thresh will not be changed
> while devmem is running.
> 
> The last patch implements device memory TCP for bnxt_en driver.
> It usually converts generic page_pool api to netmem page_pool api.
> 
> No dependencies exist between device memory TCP and GRO/LRO/MTU.
> Only tcp-data-split and tcp-data-split-thresh should be enabled when the
> device memory TCP.
> While devmem TCP is set, tcp-data-split and tcp-data-split-thresh can't
> be updated because core API disallows change.
> 
> I tested the interface up/down while devmem TCP running. It works well.
> Also, channel count change, and rx/tx ringsize change tests work well too.
> 
> The devmem TCP test NIC is BCM57504

[..]

> All necessary configuration validations exist at the core API level.
> 
> Note that by this patch, the setup of device memory TCP would fail.
> Because tcp-data-split-thresh command is not supported by ethtool yet.
> The tcp-data-split-thresh should be 0 for setup device memory TCP and
> the default of bnxt is 256.
> So, for the bnxt, it always fails until ethtool supports
> tcp-data-split-thresh command.
> 
> The ncdevmem.c will be updated after ethtool supports
> tcp-data-split-thresh option.

FYI, I've tested your series with BCM57504 on top of [1] and [2] with
a couple of patches to make ncdevmem.c and TX work (see below). [1]
decouples ncdevmem from ethtool so we can flip header split settings
without requiring recent ethtool. Both RX and TX work perfectly.
Feel free to carry:

Tested-by: Stanislav Fomichev <sdf@fomichev.me>

Also feel free to take over the ncdevmem patch if my ncdevmem changes
get pulled before your series.

1: https://lore.kernel.org/netdev/20241009171252.2328284-1-sdf@fomichev.me/
2: https://lore.kernel.org/netdev/20240913150913.1280238-1-sdf@fomichev.me/

commit 69bc0e247eb4132ef5fd0b118719427d35d462fc
Author:     Stanislav Fomichev <sdf@fomichev.me>
AuthorDate: Tue Oct 15 15:56:43 2024 -0700
Commit:     Stanislav Fomichev <sdf@fomichev.me>
CommitDate: Wed Oct 16 13:13:42 2024 -0700

    selftests: ncdevmem: Set header split threshold to 0
    
    Needs to happen on BRCM to allow devmem to be attached.
    
    Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>

diff --git a/tools/testing/selftests/drivers/net/hw/ncdevmem.c b/tools/testing/selftests/drivers/net/hw/ncdevmem.c
index 903dac3e61d5..6a94d52a6c43 100644
--- a/tools/testing/selftests/drivers/net/hw/ncdevmem.c
+++ b/tools/testing/selftests/drivers/net/hw/ncdevmem.c
@@ -322,6 +322,8 @@ static int configure_headersplit(bool on)
 	ethtool_rings_set_req_set_header_dev_index(req, ifindex);
 	/* 0 - off, 1 - auto, 2 - on */
 	ethtool_rings_set_req_set_tcp_data_split(req, on ? 2 : 0);
+	if (enable)
+		ethtool_rings_set_req_set_tcp_data_split_thresh(req, 0);
 	ret = ethtool_rings_set(ys, req);
 	if (ret < 0)
 		fprintf(stderr, "YNL failed: %s\n", ys->err.msg);


commit ef5ba647bc94a19153c2c5cfc64ebe4cb86ac58d
Author:     Stanislav Fomichev <sdf@fomichev.me>
AuthorDate: Fri Oct 11 13:52:03 2024 -0700
Commit:     Stanislav Fomichev <sdf@fomichev.me>
CommitDate: Wed Oct 16 13:13:42 2024 -0700

    bnxt_en: support tx device memory
    
    The only change is to not unmap the frags on completions.
    
    Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 6e422e24750a..cb22707a35aa 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -692,7 +692,10 @@ static netdev_tx_t bnxt_start_xmit(struct sk_buff *skb, struct net_device *dev)
 			goto tx_dma_error;
 
 		tx_buf = &txr->tx_buf_ring[RING_TX(bp, prod)];
-		dma_unmap_addr_set(tx_buf, mapping, mapping);
+		if (netmem_is_net_iov(frag->netmem))
+			dma_unmap_addr_set(tx_buf, mapping, 0);
+		else
+			dma_unmap_addr_set(tx_buf, mapping, mapping);
 
 		txbd->tx_bd_haddr = cpu_to_le64(mapping);
 
@@ -749,9 +752,10 @@ static netdev_tx_t bnxt_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	for (i = 0; i < last_frag; i++) {
 		prod = NEXT_TX(prod);
 		tx_buf = &txr->tx_buf_ring[RING_TX(bp, prod)];
-		dma_unmap_page(&pdev->dev, dma_unmap_addr(tx_buf, mapping),
-			       skb_frag_size(&skb_shinfo(skb)->frags[i]),
-			       DMA_TO_DEVICE);
+		if (dma_unmap_addr(tx_buf, mapping))
+			dma_unmap_page(&pdev->dev, dma_unmap_addr(tx_buf, mapping),
+				       skb_frag_size(&skb_shinfo(skb)->frags[i]),
+				       DMA_TO_DEVICE);
 	}
 
 tx_free:
@@ -821,11 +825,12 @@ static bool __bnxt_tx_int(struct bnxt *bp, struct bnxt_tx_ring_info *txr,
 		for (j = 0; j < last; j++) {
 			cons = NEXT_TX(cons);
 			tx_buf = &txr->tx_buf_ring[RING_TX(bp, cons)];
-			dma_unmap_page(
-				&pdev->dev,
-				dma_unmap_addr(tx_buf, mapping),
-				skb_frag_size(&skb_shinfo(skb)->frags[j]),
-				DMA_TO_DEVICE);
+			if (dma_unmap_addr(tx_buf, mapping))
+				dma_unmap_page(
+					&pdev->dev,
+					dma_unmap_addr(tx_buf, mapping),
+					skb_frag_size(&skb_shinfo(skb)->frags[j]),
+					DMA_TO_DEVICE);
 		}
 		if (unlikely(is_ts_pkt)) {
 			if (BNXT_CHIP_P5(bp)) {
@@ -3296,10 +3301,11 @@ static void bnxt_free_tx_skbs(struct bnxt *bp)
 				skb_frag_t *frag = &skb_shinfo(skb)->frags[k];
 
 				tx_buf = &txr->tx_buf_ring[ring_idx];
-				dma_unmap_page(
-					&pdev->dev,
-					dma_unmap_addr(tx_buf, mapping),
-					skb_frag_size(frag), DMA_TO_DEVICE);
+				if (dma_unmap_addr(tx_buf, mapping))
+					dma_unmap_page(
+						&pdev->dev,
+						dma_unmap_addr(tx_buf, mapping),
+						skb_frag_size(frag), DMA_TO_DEVICE);
 			}
 			dev_kfree_skb(skb);
 		}

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v3 0/7] bnxt_en: implement device memory TCP for bnxt
  2024-10-16 20:17 ` [PATCH net-next v3 0/7] bnxt_en: implement device memory TCP for bnxt Stanislav Fomichev
@ 2024-10-17  8:58   ` Taehee Yoo
  0 siblings, 0 replies; 17+ messages in thread
From: Taehee Yoo @ 2024-10-17  8:58 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: davem, kuba, pabeni, edumazet, almasrymina, netdev, linux-doc,
	donald.hunter, corbet, michael.chan, kory.maincent, andrew,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Thu, Oct 17, 2024 at 5:17 AM Stanislav Fomichev <stfomichev@gmail.com> wrote:

Hi Stanislav,
Thank you so much for testing and improvement!

>
> On 10/03, Taehee Yoo wrote:
> > This series implements device memory TCP for bnxt_en driver and
> > necessary ethtool command implementations.
> >
> > NICs that use the bnxt_en driver support tcp-data-split feature named
> > HDS(header-data-split).
> > But there is no implementation for the HDS to enable/disable by ethtool.
> > Only getting the current HDS status is implemented and the HDS is just
> > automatically enabled only when either LRO, HW-GRO, or JUMBO is enabled.
> > The hds_threshold follows the rx-copybreak value but it wasn't
> > changeable.
> >
> > Currently, bnxt_en driver enables tcp-data-split by default but not
> > always work.
> > There is hds_threshold value, which indicates that a packet size is
> > larger than this value, a packet will be split into header and data.
> > hds_threshold value has been 256, which is a default value of
> > rx-copybreak value too.
> > The rx-copybreak value hasn't been allowed to change so the
> > hds_threshold too.
> >
> > This patchset decouples hds_threshold and rx-copybreak first.
> > and make tcp-data-split, rx-copybreak, and
> > tcp-data-split-thresh(hds_threshold) configurable independently.
> >
> > But the default configuration is the same.
> > The default value of rx-copybreak is 256 and default
> > tcp-data-split-thresh is also 256.
> >
> > There are several related options.
> > TPA(HW-GRO, LRO), JUMBO, jumbo_thresh(firmware command), and Aggregation
> > Ring.
> >
> > The aggregation ring is fundamental to these all features.
> > When gro/lro/jumbo packets are received, NIC receives the first packet
> > from the normal ring.
> > follow packets come from the aggregation ring.
> >
> > These features are working regardless of HDS.
> > When TPA is enabled and HDS is disabled, the first packet contains
> > header and payload too.
> > and the following packets contain payload only.
> > If HDS is enabled, the first packet contains the header only, and the
> > following packets contain only payload.
> > So, HW-GRO/LRO is working regardless of HDS.
> >
> > There is another threshold value, which is jumbo_thresh.
> > This is very similar to hds_thresh, but jumbo thresh doesn't split
> > header and data.
> > It just split the first and following data based on length.
> > When NIC receives 1500 sized packet, and jumbo_thresh is 256(default, but
> > follows rx-copybreak),
> > the first data is 256 and the following packet size is 1500-256.
> >
> > Before this patch, at least if one of GRO, LRO, and JUMBO flags is
> > enabled, the Aggregation ring will be enabled.
> > If the Aggregation ring is enabled, both hds_threshold and
> > jumbo_thresh are set to the default value of rx-copybreak.
> >
> > So, GRO, LRO, JUMBO frames, they larger than 256 bytes, they will
> > be split into header and data if the protocol is TCP or UDP.
> > for the other protocol, jumbo_thresh works instead of hds_thresh.
> >
> > This means that tcp-data-split relies on the GRO, LRO, and JUMBO flags.
> > But by this patch, tcp-data-split no longer relies on these flags.
> > If the tcp-data-split is enabled, the Aggregation ring will be
> > enabled.
> > Also, hds_threshold no longer follows rx-copybreak value, it will
> > be set to the tcp-data-split-thresh value by user-space, but the
> > default value is still 256.
> >
> > If the protocol is TCP or UDP and the HDS is disabled and Aggregation
> > ring is enabled, a packet will be split into several pieces due to
> > jumbo_thresh.
> >
> > When XDP is attached, tcp-data-split is automatically disabled.
> >
> > LRO, GRO, and JUMBO are tested with BCM57414, BCM57504 and the firmware
> > version is 230.0.157.0.
> > I couldn't find any specification about minimum and maximum value
> > of hds_threshold, but from my test result, it was about 0 ~ 1023.
> > It means, over 1023 sized packets will be split into header and data if
> > tcp-data-split is enabled regardless of hds_treshold value.
> > When hds_threshold is 1500 and received packet size is 1400, HDS should
> > not be activated, but it is activated.
> > The maximum value of hds_threshold(tcp-data-split-thresh)
> > value is 256 because it has been working.
> > It was decided very conservatively.
> >
> > I checked out the tcp-data-split(HDS) works independently of GRO, LRO,
> > JUMBO. Tested GRO/LRO, JUMBO with enabled HDS and disabled HDS.
> > Also, I checked out tcp-data-split should be disabled automatically
> > when XDP is attached and disallowed to enable it again while XDP is
> > attached. I tested ranged values from min to max for
> > tcp-data-split-thresh and rx-copybreak, and it works.
> > tcp-data-split-thresh from 0 to 256, and rx-copybreak 65 to 256.
> > When testing this patchset, I checked skb->data, skb->data_len, and
> > nr_frags values.
> >
> > The first patch implements .{set, get}_tunable() in the bnxt_en.
> > The bnxt_en driver has been supporting the rx-copybreak feature but is
> > not configurable, Only the default rx-copybreak value has been working.
> > So, it changes the bnxt_en driver to be able to configure
> > the rx-copybreak value.
> >
> > The second patch adds an implementation of tcp-data-split ethtool
> > command.
> > The HDS relies on the Aggregation ring, which is automatically enabled
> > when either LRO, GRO, or large mtu is configured.
> > So, if the Aggregation ring is enabled, HDS is automatically enabled by
> > it.
> >
> > The third patch adds tcp-data-split-thresh command in the ethtool.
> > This threshold value indicates if a received packet size is larger
> > than this threshold, the packet's header and payload will be split.
> > Example:
> >    # ethtool -G <interface name> tcp-data-split-thresh <value>
> > This option can not be used when tcp-data-split is disabled or not
> > supported.
> >    # ethtool -G enp14s0f0np0 tcp-data-split on tcp-data-split-thresh 256
> >    # ethtool -g enp14s0f0np0
> >    Ring parameters for enp14s0f0np0:
> >    Pre-set maximums:
> >    ...
> >    Current hardware settings:
> >    ...
> >    TCP data split:         on
> >    TCP data split thresh:  256
> >
> >    # ethtool -G enp14s0f0np0 tcp-data-split off
> >    # ethtool -g enp14s0f0np0
> >    Ring parameters for enp14s0f0np0:
> >    Pre-set maximums:
> >    ...
> >    Current hardware settings:
> >    ...
> >    TCP data split:         off
> >    TCP data split thresh:  n/a
> >
> > The fourth patch adds the implementation of tcp-data-split-thresh logic
> > in the bnxt_en driver.
> > The default value is 256, which used to be the default rx-copybreak
> > value.
> >
> > The fifth and sixth adds condition check for devmem and ethtool.
> > If tcp-data-split is disabled or threshold value is not zero, setup of
> > devmem will be failed.
> > Also, tcp-data-split and tcp-data-split-thresh will not be changed
> > while devmem is running.
> >
> > The last patch implements device memory TCP for bnxt_en driver.
> > It usually converts generic page_pool api to netmem page_pool api.
> >
> > No dependencies exist between device memory TCP and GRO/LRO/MTU.
> > Only tcp-data-split and tcp-data-split-thresh should be enabled when the
> > device memory TCP.
> > While devmem TCP is set, tcp-data-split and tcp-data-split-thresh can't
> > be updated because core API disallows change.
> >
> > I tested the interface up/down while devmem TCP running. It works well.
> > Also, channel count change, and rx/tx ringsize change tests work well too.
> >
> > The devmem TCP test NIC is BCM57504
>
> [..]
>
> > All necessary configuration validations exist at the core API level.
> >
> > Note that by this patch, the setup of device memory TCP would fail.
> > Because tcp-data-split-thresh command is not supported by ethtool yet.
> > The tcp-data-split-thresh should be 0 for setup device memory TCP and
> > the default of bnxt is 256.
> > So, for the bnxt, it always fails until ethtool supports
> > tcp-data-split-thresh command.
> >
> > The ncdevmem.c will be updated after ethtool supports
> > tcp-data-split-thresh option.
>
> FYI, I've tested your series with BCM57504 on top of [1] and [2] with
> a couple of patches to make ncdevmem.c and TX work (see below). [1]
> decouples ncdevmem from ethtool so we can flip header split settings
> without requiring recent ethtool. Both RX and TX work perfectly.
> Feel free to carry:
>
> Tested-by: Stanislav Fomichev <sdf@fomichev.me>

Thank you so much for your work!
I will try to test your TX side patch before sending v4 patch.

>
> Also feel free to take over the ncdevmem patch if my ncdevmem changes
> get pulled before your series.

Good, Thanks!

>
> 1: https://lore.kernel.org/netdev/20241009171252.2328284-1-sdf@fomichev.me/
> 2: https://lore.kernel.org/netdev/20240913150913.1280238-1-sdf@fomichev.me/
>
> commit 69bc0e247eb4132ef5fd0b118719427d35d462fc
> Author:     Stanislav Fomichev <sdf@fomichev.me>
> AuthorDate: Tue Oct 15 15:56:43 2024 -0700
> Commit:     Stanislav Fomichev <sdf@fomichev.me>
> CommitDate: Wed Oct 16 13:13:42 2024 -0700
>
>     selftests: ncdevmem: Set header split threshold to 0
>
>     Needs to happen on BRCM to allow devmem to be attached.
>
>     Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
>
> diff --git a/tools/testing/selftests/drivers/net/hw/ncdevmem.c b/tools/testing/selftests/drivers/net/hw/ncdevmem.c
> index 903dac3e61d5..6a94d52a6c43 100644
> --- a/tools/testing/selftests/drivers/net/hw/ncdevmem.c
> +++ b/tools/testing/selftests/drivers/net/hw/ncdevmem.c
> @@ -322,6 +322,8 @@ static int configure_headersplit(bool on)
>         ethtool_rings_set_req_set_header_dev_index(req, ifindex);
>         /* 0 - off, 1 - auto, 2 - on */
>         ethtool_rings_set_req_set_tcp_data_split(req, on ? 2 : 0);
> +       if (enable)
> +               ethtool_rings_set_req_set_tcp_data_split_thresh(req, 0);
>         ret = ethtool_rings_set(ys, req);
>         if (ret < 0)
>                 fprintf(stderr, "YNL failed: %s\n", ys->err.msg);
>
>
> commit ef5ba647bc94a19153c2c5cfc64ebe4cb86ac58d
> Author:     Stanislav Fomichev <sdf@fomichev.me>
> AuthorDate: Fri Oct 11 13:52:03 2024 -0700
> Commit:     Stanislav Fomichev <sdf@fomichev.me>
> CommitDate: Wed Oct 16 13:13:42 2024 -0700
>
>     bnxt_en: support tx device memory
>
>     The only change is to not unmap the frags on completions.
>
>     Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
>
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> index 6e422e24750a..cb22707a35aa 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> @@ -692,7 +692,10 @@ static netdev_tx_t bnxt_start_xmit(struct sk_buff *skb, struct net_device *dev)
>                         goto tx_dma_error;
>
>                 tx_buf = &txr->tx_buf_ring[RING_TX(bp, prod)];
> -               dma_unmap_addr_set(tx_buf, mapping, mapping);
> +               if (netmem_is_net_iov(frag->netmem))
> +                       dma_unmap_addr_set(tx_buf, mapping, 0);
> +               else
> +                       dma_unmap_addr_set(tx_buf, mapping, mapping);
>
>                 txbd->tx_bd_haddr = cpu_to_le64(mapping);
>
> @@ -749,9 +752,10 @@ static netdev_tx_t bnxt_start_xmit(struct sk_buff *skb, struct net_device *dev)
>         for (i = 0; i < last_frag; i++) {
>                 prod = NEXT_TX(prod);
>                 tx_buf = &txr->tx_buf_ring[RING_TX(bp, prod)];
> -               dma_unmap_page(&pdev->dev, dma_unmap_addr(tx_buf, mapping),
> -                              skb_frag_size(&skb_shinfo(skb)->frags[i]),
> -                              DMA_TO_DEVICE);
> +               if (dma_unmap_addr(tx_buf, mapping))
> +                       dma_unmap_page(&pdev->dev, dma_unmap_addr(tx_buf, mapping),
> +                                      skb_frag_size(&skb_shinfo(skb)->frags[i]),
> +                                      DMA_TO_DEVICE);
>         }
>
>  tx_free:
> @@ -821,11 +825,12 @@ static bool __bnxt_tx_int(struct bnxt *bp, struct bnxt_tx_ring_info *txr,
>                 for (j = 0; j < last; j++) {
>                         cons = NEXT_TX(cons);
>                         tx_buf = &txr->tx_buf_ring[RING_TX(bp, cons)];
> -                       dma_unmap_page(
> -                               &pdev->dev,
> -                               dma_unmap_addr(tx_buf, mapping),
> -                               skb_frag_size(&skb_shinfo(skb)->frags[j]),
> -                               DMA_TO_DEVICE);
> +                       if (dma_unmap_addr(tx_buf, mapping))
> +                               dma_unmap_page(
> +                                       &pdev->dev,
> +                                       dma_unmap_addr(tx_buf, mapping),
> +                                       skb_frag_size(&skb_shinfo(skb)->frags[j]),
> +                                       DMA_TO_DEVICE);
>                 }
>                 if (unlikely(is_ts_pkt)) {
>                         if (BNXT_CHIP_P5(bp)) {
> @@ -3296,10 +3301,11 @@ static void bnxt_free_tx_skbs(struct bnxt *bp)
>                                 skb_frag_t *frag = &skb_shinfo(skb)->frags[k];
>
>                                 tx_buf = &txr->tx_buf_ring[ring_idx];
> -                               dma_unmap_page(
> -                                       &pdev->dev,
> -                                       dma_unmap_addr(tx_buf, mapping),
> -                                       skb_frag_size(frag), DMA_TO_DEVICE);
> +                               if (dma_unmap_addr(tx_buf, mapping))
> +                                       dma_unmap_page(
> +                                               &pdev->dev,
> +                                               dma_unmap_addr(tx_buf, mapping),
> +                                               skb_frag_size(frag), DMA_TO_DEVICE);
>                         }
>                         dev_kfree_skb(skb);
>                 }

Thanks a lot!
Taehee Yoo

^ permalink raw reply	[flat|nested] 17+ messages in thread

[parent not found: <20241003160620.1521626-3-ap420073@gmail.com>]

[parent not found: <20241008111926.7056cc93@kernel.org>]

[parent not found: <CAMArcTU+r+Pj_y7rUvRwTrDWqg57xy4e-OacjWCfKRCUa8A-aw@mail.gmail.com>]

[parent not found: <20241009082837.2735cd97@kernel.org>]

* Re: [PATCH net-next v3 2/7] bnxt_en: add support for tcp-data-split ethtool command
       [not found]       ` <20241009082837.2735cd97@kernel.org>
@ 2024-10-31 17:34         ` Taehee Yoo
  2024-10-31 23:56           ` Jakub Kicinski
  0 siblings, 1 reply; 17+ messages in thread
From: Taehee Yoo @ 2024-10-31 17:34 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, pabeni, edumazet, almasrymina, netdev, linux-doc,
	donald.hunter, corbet, michael.chan, kory.maincent, andrew,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Thu, Oct 10, 2024 at 12:28 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 9 Oct 2024 22:54:17 +0900 Taehee Yoo wrote:
> > > This breaks previous behavior. The HDS reporting from get was
> > > introduced to signal to user space whether the page flip based
> > > TCP zero-copy (the one added some years ago not the recent one)
> > > will be usable with this NIC.
> > >
> > > When HW-GRO is enabled HDS will be working.
> > >
> > > I think that the driver should only track if the user has set the value
> > > to ENABLED (forced HDS), or to UKNOWN (driver default). Setting the HDS
> > > to disabled is not useful, don't support it.
> >
> > Okay, I will remove the disable feature in a v4 patch.
> > Before this patch, hds_threshold was rx-copybreak value.
> > How do you think hds_threshold should still follow rx-copybreak value
> > if it is UNKNOWN mode?
>
> IIUC the rx_copybreak only applies to the header? Or does it apply
> to the entire frame?
>
> If rx_copybreak applies to the entire frame and not just the first
> buffer (headers or headers+payload if not split) - no preference.
> If rx_copybreak only applies to the headers / first buffer then
> I'd keep them separate as they operate on a different length.
>
> > I think hds_threshold need to follow new tcp-data-split-thresh value in
> > ENABLE/UNKNOWN and make rx-copybreak pure software feature.
>
> Sounds good to me, but just to be clear:
>
> If user sets the HDS enable to UNKNOWN (or doesn't set it):
>  - GET returns (current behavior, AFAIU):
>    - DISABLED (if HW-GRO is disabled and MTU is not Jumbo)
>    - ENABLED (if HW-GRO is enabled of MTU is Jumbo)
> If user sets the HDS enable to ENABLED (force HDS on):
>  - GET returns ENABLED

While I'm writing a patch I face an ambiguous problem here.
ethnl_set_ring() first calls .get_ringparam() to get current config.
Then it calls .set_ringparam() after it sets the current config + new
config to param structures.
The bnxt_set_ringparam() may receive ETHTOOL_TCP_DATA_SPLIT_ENABLED
because two cases.
1. from user
2. from bnxt_get_ringparam() because of UNKNWON.
The problem is that the bnxt_set_ringparam() can't distinguish them.
The problem scenario is here.
1. tcp-data-split is UNKNOWN mode.
2. HDS is automatically enabled because one of LRO or GRO is enabled.
3. user changes ring parameter with following command
`ethtool -G eth0 rx 1024`
4. ethnl_set_rings() calls .get_ringparam() to get current config.
5. bnxt_get_ringparam() returns ENABLE of HDS because of UNKNWON mode.
6. ethnl_set_rings() calls .set_ringparam() after setting param with
configs comes from .get_ringparam().
7. bnxt_set_ringparam() is passed ETHTOOL_TCP_DATA_SPLIT_ENABLED but
the user didn't set it explicitly.
8. bnxt_set_ringparam() eventually force enables tcp-data-split.

I couldn't find a way to distinguish them so far.
I'm not sure if this is acceptable or not.
Maybe we need to modify a scenario?

>
> hds_threshold returns: some value, but it's only actually used if GET
> returns ENABLED.
>
> > But if so, it changes the default behavior.
>
> How so? The configuration of neither of those two is exposed to
> the user. We can keep the same defaults, until user overrides them.
>
> > How do you think about it?
> >
> > >
> > > >       ering->tx_max_pending = BNXT_MAX_TX_DESC_CNT;
> > > >
> > > >       ering->rx_pending = bp->rx_ring_size;
> > > > @@ -854,9 +858,25 @@ static int bnxt_set_ringparam(struct net_device *dev,
> > > >           (ering->tx_pending < BNXT_MIN_TX_DESC_CNT))
> > > >               return -EINVAL;
> > > >
> > > > +     if (kernel_ering->tcp_data_split != ETHTOOL_TCP_DATA_SPLIT_DISABLED &&
> > > > +         BNXT_RX_PAGE_MODE(bp)) {
> > > > +             NL_SET_ERR_MSG_MOD(extack, "tcp-data-split can not be enabled with XDP");
> > > > +             return -EINVAL;
> > > > +     }
> > >
> > > Technically just if the XDP does not support multi-buffer.
> > > Any chance we could do this check in the core?
> >
> > I think we can access xdp_rxq_info with netdev_rx_queue structure.
> > However, xdp_rxq_info is not sufficient to distinguish mb is supported
> > by the driver or not. I think prog->aux->xdp_has_frags is required to
> > distinguish it correctly.
> > So, I think we need something more.
> > Do you have any idea?
>
> Take a look at dev_xdp_prog_count(), something like that but only
> counting non-mb progs?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v3 2/7] bnxt_en: add support for tcp-data-split ethtool command
  2024-10-31 17:34         ` [PATCH net-next v3 2/7] bnxt_en: add support for tcp-data-split ethtool command Taehee Yoo
@ 2024-10-31 23:56           ` Jakub Kicinski
  2024-11-01 17:11             ` Taehee Yoo
  0 siblings, 1 reply; 17+ messages in thread
From: Jakub Kicinski @ 2024-10-31 23:56 UTC (permalink / raw)
  To: Taehee Yoo
  Cc: davem, pabeni, edumazet, almasrymina, netdev, linux-doc,
	donald.hunter, corbet, michael.chan, kory.maincent, andrew,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Fri, 1 Nov 2024 02:34:59 +0900 Taehee Yoo wrote:
> While I'm writing a patch I face an ambiguous problem here.
> ethnl_set_ring() first calls .get_ringparam() to get current config.
> Then it calls .set_ringparam() after it sets the current config + new
> config to param structures.
> The bnxt_set_ringparam() may receive ETHTOOL_TCP_DATA_SPLIT_ENABLED
> because two cases.
> 1. from user
> 2. from bnxt_get_ringparam() because of UNKNWON.
> The problem is that the bnxt_set_ringparam() can't distinguish them.
> The problem scenario is here.
> 1. tcp-data-split is UNKNOWN mode.
> 2. HDS is automatically enabled because one of LRO or GRO is enabled.
> 3. user changes ring parameter with following command
> `ethtool -G eth0 rx 1024`
> 4. ethnl_set_rings() calls .get_ringparam() to get current config.
> 5. bnxt_get_ringparam() returns ENABLE of HDS because of UNKNWON mode.
> 6. ethnl_set_rings() calls .set_ringparam() after setting param with
> configs comes from .get_ringparam().
> 7. bnxt_set_ringparam() is passed ETHTOOL_TCP_DATA_SPLIT_ENABLED but
> the user didn't set it explicitly.
> 8. bnxt_set_ringparam() eventually force enables tcp-data-split.
> 
> I couldn't find a way to distinguish them so far.
> I'm not sure if this is acceptable or not.
> Maybe we need to modify a scenario?

I thought we discussed this, but I may be misremembering.
You may need to record in the core whether the setting came 
from the user or not (similarly to IFF_RXFH_CONFIGURED).
User setting UNKNWON would mean "reset".
Maybe I'm misunderstanding..

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v3 2/7] bnxt_en: add support for tcp-data-split ethtool command
  2024-10-31 23:56           ` Jakub Kicinski
@ 2024-11-01 17:11             ` Taehee Yoo
  0 siblings, 0 replies; 17+ messages in thread
From: Taehee Yoo @ 2024-11-01 17:11 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, pabeni, edumazet, almasrymina, netdev, linux-doc,
	donald.hunter, corbet, michael.chan, kory.maincent, andrew,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Fri, Nov 1, 2024 at 8:56 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 1 Nov 2024 02:34:59 +0900 Taehee Yoo wrote:
> > While I'm writing a patch I face an ambiguous problem here.
> > ethnl_set_ring() first calls .get_ringparam() to get current config.
> > Then it calls .set_ringparam() after it sets the current config + new
> > config to param structures.
> > The bnxt_set_ringparam() may receive ETHTOOL_TCP_DATA_SPLIT_ENABLED
> > because two cases.
> > 1. from user
> > 2. from bnxt_get_ringparam() because of UNKNWON.
> > The problem is that the bnxt_set_ringparam() can't distinguish them.
> > The problem scenario is here.
> > 1. tcp-data-split is UNKNOWN mode.
> > 2. HDS is automatically enabled because one of LRO or GRO is enabled.
> > 3. user changes ring parameter with following command
> > `ethtool -G eth0 rx 1024`
> > 4. ethnl_set_rings() calls .get_ringparam() to get current config.
> > 5. bnxt_get_ringparam() returns ENABLE of HDS because of UNKNWON mode.
> > 6. ethnl_set_rings() calls .set_ringparam() after setting param with
> > configs comes from .get_ringparam().
> > 7. bnxt_set_ringparam() is passed ETHTOOL_TCP_DATA_SPLIT_ENABLED but
> > the user didn't set it explicitly.
> > 8. bnxt_set_ringparam() eventually force enables tcp-data-split.
> >
> > I couldn't find a way to distinguish them so far.
> > I'm not sure if this is acceptable or not.
> > Maybe we need to modify a scenario?
>
> I thought we discussed this, but I may be misremembering.
> You may need to record in the core whether the setting came
> from the user or not (similarly to IFF_RXFH_CONFIGURED).
> User setting UNKNWON would mean "reset".
> Maybe I'm misunderstanding..

Thanks a lot for that!
I will try to add a new variable, that indicates tcp-data-split is set by
user. It would be the tcp_data_split_mod in the
kernel_ethtool_ringparam structure.

Thanks a lot!
Taehee Yoo

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2024-11-01 17:11 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20241003160620.1521626-1-ap420073@gmail.com>
     [not found] ` <20241003160620.1521626-8-ap420073@gmail.com>
     [not found]   ` <CAHS8izO-7pPk7xyY4JdyaY4hZpd7zerbjhGanRvaTk+OOsvY0A@mail.gmail.com>
     [not found]     ` <CAMArcTU61G=fexf-RJDSW_sGp9dZCkJsJKC=yjg79RS9Ugjuxw@mail.gmail.com>
     [not found]       ` <20241008125023.7fbc1f64@kernel.org>
     [not found]         ` <CAMArcTWVrQ7KWPt+c0u7X=jvBd2VZGVLwjWYCjMYhWZTymMRTg@mail.gmail.com>
     [not found]           ` <20241009170102.1980ed1d@kernel.org>
2024-10-10 17:44             ` [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp Mina Almasry
2024-10-11  1:34               ` Jakub Kicinski
2024-10-11 17:33                 ` Mina Almasry
2024-10-11 23:42                   ` Jason Gunthorpe
2024-10-14 22:38                     ` Mina Almasry
2024-10-15  0:16                       ` Jakub Kicinski
2024-10-15  1:10                         ` Mina Almasry
2024-10-15 12:44                           ` Jason Gunthorpe
2024-10-18  8:25                             ` Mina Almasry
2024-10-19 13:55                               ` Taehee Yoo
2024-10-15 14:29                       ` Pavel Begunkov
2024-10-15 17:38                         ` David Wei
2024-10-16 20:17 ` [PATCH net-next v3 0/7] bnxt_en: implement device memory TCP for bnxt Stanislav Fomichev
2024-10-17  8:58   ` Taehee Yoo
     [not found] ` <20241003160620.1521626-3-ap420073@gmail.com>
     [not found]   ` <20241008111926.7056cc93@kernel.org>
     [not found]     ` <CAMArcTU+r+Pj_y7rUvRwTrDWqg57xy4e-OacjWCfKRCUa8A-aw@mail.gmail.com>
     [not found]       ` <20241009082837.2735cd97@kernel.org>
2024-10-31 17:34         ` [PATCH net-next v3 2/7] bnxt_en: add support for tcp-data-split ethtool command Taehee Yoo
2024-10-31 23:56           ` Jakub Kicinski
2024-11-01 17:11             ` Taehee Yoo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox