[RFC PATCH v2 00/27] NTB transport backed by remote DW eDMA

dmaengine.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v2 00/27] NTB transport backed by remote DW eDMA
@ 2025-11-29 16:03 Koichiro Den
  2025-11-29 16:03 ` [RFC PATCH v2 01/27] PCI: endpoint: pci-epf-vntb: Use array_index_nospec() on mws_size[] access Koichiro Den
                   ` (27 more replies)
  0 siblings, 28 replies; 97+ messages in thread
From: Koichiro Den @ 2025-11-29 16:03 UTC (permalink / raw)
  To: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li
  Cc: mani, kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason,
	dave.jiang, allenbh, Basavaraj.Natikar, Shyam-sundar.S-k,
	kurt.schwemmer, logang, jingoohan1, lpieralisi, robh, jbrunet,
	fancer.lancer, arnd, pstanner, elfring

Hi,

This is RFC v2 of the NTB/PCI series for Renesas R-Car S4. The ultimate
goal is unchanged, i.e. to improve performance between RC and EP
(with vNTB) over ntb_transport, but the approach has changed drastically.
Based on the feedback from Frank Li in the v1 thread, in particular:
https://lore.kernel.org/all/aQEsip3TsPn4LJY9@lizhi-Precision-Tower-5810/
this RFC v2 instead builds an NTB transport backed by remote eDMA
architecture and reshapes the series around it. The RC->EP interruption
is now achieved using a dedicated eDMA read channel, so the somewhat
"hack"-ish approach in RFC v1 is no longer needed.

Compared to RFC v1, this v2 series enables NTB transport backed by
remote DW eDMA, so the current ntb_transport handling of Memory Window
is no longer needed, and direct DMA transfers between EP and RC are
used.

I realize this is quite a large series. Sorry for the volume, but for
the RFC stage I believe presenting the full picture in a single set
helps with reviewing the overall architecture. Once the direction is
agreed, I will respin it split by subsystem and topic.


The new architecture
====================

In the new architecture the endpoint exposes a small memory window that
contains the unrolled DesignWare eDMA register block plus a per-channel
control structure and linked-list rings. The endpoint allocates these in
its own memory, then maps them into a peer MW via an inbound iATU
region. The host maps the peer MW, configures a dw-edma engine to use
the remote rings. The data plane flow is depicted as below (Figure 1 and
Figure 2).

With this design, per-queue PCI memory usage is reduced to control-plane
metadata (ring descriptors and indices). Data buffers live in system
memory and are transferred by the remote eDMA, so even relatively small
BAR windows can theoritically scale to multiple ntb_transport queue
pairs, and it no longer requires the DMA_MEMCPY operation. This series
also adds ntb_netdev multiple queues support to demonstrate performance
improvement.

The shared-memory ntb_transport backend remains the default. The remote
eDMA mode is compile-time and run-time selectable via
CONFIG_NTB_TRANSPORT_EDMA and the new 'use_remote_edma' module
parameter, and existing users that do not enable it should see no
behavioural change apart from the BAR subrange support described below.


    Figure 1. RC->EP traffic via ntb_netdev+ntb_transport
                     backed by Remote eDMA

          EP                                   RC
       phys addr                            phys addr  
         space                                space 
          +-+                                  +-+
          | |                                  | |
          | |                ||                | |
          +-+-----.          ||                | |
 EDMA REG | |      \    [A]  ||                | |
          +-+----.  '---+-+  ||                | |
          | |     \     | |<---------[0-a]----------
          +-+-----------| |<----------[2]----------.
  EDMA LL | |           | |  ||                | | :
          | |           | |  ||                | | :
          +-+-----------+-+  ||  [B]           | | :
          | |                ||  ++            | | :
       ---------[0-b]----------->||----------------'
          | |            ++  ||  ||            | |
          | |            ||  ||  ++            | |
          | |            ||<----------[4]-----------
          | |            ++  ||                | |
          | |           [C]  ||                | |
       .--|#|<------------------------[3]------|#|<-.
       :  |#|                ||                |#|  :
      [5] | |                ||                | | [1]
       :  | |                ||                | |  :
       '->|#|                                  |#|--'
          |#|                                  |#|
          | |                                  | |


      0-a. configure Remote eDMA
      0-b. DMA-map and produce DAR
      1.   memcpy while building skb in ntb_netdev case
      2.   consume DAR, DMA-map SAR and kick DMA read transfer
      3.   DMA read transfer (initiated by RC remotely)
      4.   consume (commit)
      5.   memcpy to application side

      [A]: MemoryWindow that aggregates eDMA regs and LL.
           IB iATU translations (Address Match Mode).
      [B]: Control plane ring buffer (for "produce")
      [C]: Control plane ring buffer (for "consume")


    Figure 2. EP->RC traffic via ntb_netdev+ntb_transport
                     backed by Remote eDMA

          EP                                   RC
       phys addr                            phys addr  
         space                                space 
          +-+                                  +-+
          | |                                  | |
          | |                ||                | |
          +-+-----.          ||                | |
 EDMA REG | |      \    [A]  ||                | |
          +-+----.  '---+-+  ||                | |
          | |     \     | |<----------[0]-----------
          +-+-----------| |<----------[3]----------.
  EDMA LL | |           | |  ||                | | :
          | |           | |  ||                | | :
          +-+-----------+-+  ||  [B]           | | :
          | |                ||  ++            | | :
       -----------[2]----------->||----------------'
          | |            ++  ||  ||            | |
          | |            ||  ||  ++            | |
          | |            ||<----------[5]-----------
          | |            ++  ||                | |
          | |           [C]  ||                | |
       .->|#|--------[4]---------------------->|#|--.
       :  |#|                ||                |#|  :
      [1] | |                ||                | | [6]
       :  | |                ||                | |  :
       '--|#|                                  |#|<-'
          |#|                                  |#|
          | |                                  | |


      0-a. configure Remote eDMA
      1.   memcpy while building skb in ntb_netdev case
      2.   DMA-map SAR and "produce"
      3.   consume SAR, DMA-map DAR and kick DMA write transfer
      4.   DMA write transfer (initiated by RC remotely)
      5.   consume (commit)
      6.   memcpy to application side

      [A]: MemoryWindow that aggregates eDMA regs and LL.
           IB iATU translations (Address Match Mode).
      [B]: Control plane ring buffer (for "produce")
      [C]: Control plane ring buffer (for "consume")


Patch layout
============

  Patch 01-19 : preparation for Patch 20
                - 01-10: support multiple MWs in a BAR
                - 11-19: other misc preparations
  Patch 20    : main and most important patch, adds remote eDMA support
  Patch 21-22 : multi-queue use, thanks to the remote eDMA, performance
                scales
  Patch 23-27 : handle several SoC-specific issues so that remote eDMA
                mode ntb_transport works on R-Car S4


Changelog
=========

RFCv1->RFCv2 changes:
  - Architecture
    - Drop the generic interrupt backend + DW eDMA test-interrupt backend
      approach and instead adopt the remote eDMA-backed ntb_transport mode
      proposed by Frank Li. The BAR-sharing / mwN_offset / inbound
      mapping (Address Match Mode) infrastructure from RFC v1 is largely
      kept, with only minor refinements and code motion where necessary
      to fit the new transport-mode design.
  - For Patch 01
    - Rework the array_index_nospec() conversion to address review
      comments on "[RFC PATCH 01/25]".

RFCv1: https://lore.kernel.org/all/20251023071916.901355-1-den@valinux.co.jp/


Tested on
=========

* 2x Renesas R-Car S4 Spider (RC<->EP connected with OcuLink cable)
* Kernel base: next-20251128


Performance measurement
=======================

No serious measurements yet, because:
  * For "before the change", even use_dma/use_msi does not work on the
    upstream kernel unless we apply some patches for R-Car S4. With some
    unmerged patch series I had posted earlier, it was observed that we
    can achieve about 7 Gbps for the RC->EP direction. Pure upstream
    kernel can achieve around 500 Mbps though.
  * For "after the change", measurements are not mature because this
    RFC v2 patch series is not yet performance-optimized at this stage.
    Also, somewhat unstable behaviour remains around ntb_edma_isr().

Here are the rough measurements showing the achievable performance on
the R-Car S4:

- Before this change:

  * ping
    64 bytes from 10.0.0.11: icmp_seq=1 ttl=64 time=12.3 ms
    64 bytes from 10.0.0.11: icmp_seq=2 ttl=64 time=6.58 ms
    64 bytes from 10.0.0.11: icmp_seq=3 ttl=64 time=1.26 ms
    64 bytes from 10.0.0.11: icmp_seq=4 ttl=64 time=7.43 ms
    64 bytes from 10.0.0.11: icmp_seq=5 ttl=64 time=1.39 ms
    64 bytes from 10.0.0.11: icmp_seq=6 ttl=64 time=7.38 ms
    64 bytes from 10.0.0.11: icmp_seq=7 ttl=64 time=1.42 ms
    64 bytes from 10.0.0.11: icmp_seq=8 ttl=64 time=7.41 ms

  * RC->EP (`sudo iperf3 -ub0 -l 65480 -P 2`)
    [ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
    [  5]   0.00-10.01  sec   344 MBytes   288 Mbits/sec  3.483 ms  51/5555 (0.92%)  receiver
    [  6]   0.00-10.01  sec   342 MBytes   287 Mbits/sec  3.814 ms  38/5517 (0.69%)  receiver
    [SUM]   0.00-10.01  sec   686 MBytes   575 Mbits/sec  3.648 ms  89/11072 (0.8%)  receiver

  * EP->RC (`sudo iperf3 -ub0 -l 65480 -P 2`)
    [  5]   0.00-10.03  sec   334 MBytes   279 Mbits/sec  3.164 ms  390/5731 (6.8%)  receiver
    [  6]   0.00-10.03  sec   334 MBytes   279 Mbits/sec  2.416 ms  396/5741 (6.9%)  receiver
    [SUM]   0.00-10.03  sec   667 MBytes   558 Mbits/sec  2.790 ms  786/11472 (6.9%)  receiver

    Note: with `-P 2`, the best total bitrate (receiver side) was achieved.

- After this change (use_remote_edma=1) [1]:

  * ping
    64 bytes from 10.0.0.11: icmp_seq=1 ttl=64 time=1.48 ms
    64 bytes from 10.0.0.11: icmp_seq=2 ttl=64 time=1.03 ms
    64 bytes from 10.0.0.11: icmp_seq=3 ttl=64 time=0.931 ms
    64 bytes from 10.0.0.11: icmp_seq=4 ttl=64 time=0.910 ms
    64 bytes from 10.0.0.11: icmp_seq=5 ttl=64 time=1.07 ms
    64 bytes from 10.0.0.11: icmp_seq=6 ttl=64 time=0.986 ms
    64 bytes from 10.0.0.11: icmp_seq=7 ttl=64 time=0.910 ms
    64 bytes from 10.0.0.11: icmp_seq=8 ttl=64 time=0.883 ms

  * RC->EP (`sudo iperf3 -ub0 -l 65480 -P 4`)
    [  5]   0.00-10.01  sec  3.54 GBytes  3.04 Gbits/sec  0.030 ms  0/58007 (0%)  receiver
    [  6]   0.00-10.01  sec  3.71 GBytes  3.19 Gbits/sec  0.453 ms  0/60909 (0%)  receiver
    [  9]   0.00-10.01  sec  3.85 GBytes  3.30 Gbits/sec  0.027 ms  0/63072 (0%)  receiver
    [ 11]   0.00-10.01  sec  3.26 GBytes  2.80 Gbits/sec  0.070 ms  1/53512 (0.0019%)  receiver
    [SUM]   0.00-10.01  sec  14.4 GBytes  12.3 Gbits/sec  0.145 ms  1/235500 (0.00042%)  receiver

  * EP->RC (`sudo iperf3 -ub0 -l 65480 -P 4`)
    [  5]   0.00-10.03  sec  3.40 GBytes  2.91 Gbits/sec  0.104 ms  15467/71208 (22%)  receiver
    [  6]   0.00-10.03  sec  3.08 GBytes  2.64 Gbits/sec  0.176 ms  12097/62609 (19%)  receiver
    [  9]   0.00-10.03  sec  3.38 GBytes  2.90 Gbits/sec  0.270 ms  17212/72710 (24%)  receiver
    [ 11]   0.00-10.03  sec  2.56 GBytes  2.19 Gbits/sec  0.200 ms  11193/53090 (21%)  receiver
    [SUM]   0.00-10.03  sec  12.4 GBytes  10.6 Gbits/sec  0.188 ms  55969/259617 (22%)  receiver

  [1] configfs settings:
      # modprobe pci_epf_vntb dyndbg=+pmf
      # cd /sys/kernel/config/pci_ep/
      # mkdir functions/pci_epf_vntb/func1
      # echo 0x1912 >   functions/pci_epf_vntb/func1/vendorid
      # echo 0x0030 >   functions/pci_epf_vntb/func1/deviceid
      # echo 32 >       functions/pci_epf_vntb/func1/msi_interrupts
      # echo 16 >       functions/pci_epf_vntb/func1/pci_epf_vntb.0/db_count
      # echo 128 >      functions/pci_epf_vntb/func1/pci_epf_vntb.0/spad_count
      # echo 2 >        functions/pci_epf_vntb/func1/pci_epf_vntb.0/num_mws
      # echo 0xe0000 >  functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw1
      # echo 0x20000 >  functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw2
      # echo 0xe0000 >  functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw2_offset
      # echo 0x1912 >   functions/pci_epf_vntb/func1/pci_epf_vntb.0/vntb_vid
      # echo 0x0030 >   functions/pci_epf_vntb/func1/pci_epf_vntb.0/vntb_pid
      # echo 0x10 >     functions/pci_epf_vntb/func1/pci_epf_vntb.0/vbus_number
      # echo 0 >        functions/pci_epf_vntb/func1/pci_epf_vntb.0/ctrl_bar
      # echo 4 >        functions/pci_epf_vntb/func1/pci_epf_vntb.0/db_bar
      # echo 2 >        functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw1_bar
      # echo 2 >        functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw2_bar
      # ln -s controllers/e65d0000.pcie-ep functions/pci_epf_vntb/func1/primary/
      # echo 1 > controllers/e65d0000.pcie-ep/start


Thanks for taking a look.


Koichiro Den (27):
  PCI: endpoint: pci-epf-vntb: Use array_index_nospec() on mws_size[]
    access
  PCI: endpoint: pci-epf-vntb: Add mwN_offset configfs attributes
  NTB: epf: Handle mwN_offset for inbound MW regions
  PCI: endpoint: Add inbound mapping ops to EPC core
  PCI: dwc: ep: Implement EPC inbound mapping support
  PCI: endpoint: pci-epf-vntb: Use pci_epc_map_inbound() for MW mapping
  NTB: Add offset parameter to MW translation APIs
  PCI: endpoint: pci-epf-vntb: Propagate MW offset from configfs when
    present
  NTB: ntb_transport: Support offsetted partial memory windows
  NTB: core: Add .get_pci_epc() to ntb_dev_ops
  NTB: epf: vntb: Implement .get_pci_epc() callback
  damengine: dw-edma: Fix MSI data values for multi-vector IMWr
    interrupts
  NTB: ntb_transport: Use seq_file for QP stats debugfs
  NTB: ntb_transport: Move TX memory window setup into setup_qp_mw()
  NTB: ntb_transport: Dynamically determine qp count
  NTB: ntb_transport: Introduce get_dma_dev() helper
  NTB: epf: Reserve a subset of MSI vectors for non-NTB users
  NTB: ntb_transport: Introduce ntb_transport_backend_ops
  PCI: dwc: ep: Cache MSI outbound iATU mapping
  NTB: ntb_transport: Introduce remote eDMA backed transport mode
  NTB: epf: Provide db_vector_count/db_vector_mask callbacks
  ntb_netdev: Multi-queue support
  NTB: epf: Add per-SoC quirk to cap MRRS for DWC eDMA (128B for R-Car)
  iommu: ipmmu-vmsa: Add PCIe ch0 to devices_allowlist
  iommu: ipmmu-vmsa: Add support for reserved regions
  arm64: dts: renesas: Add Spider RC/EP DTs for NTB with remote DW PCIe
    eDMA
  NTB: epf: Add an additional memory window (MW2) barno mapping on
    Renesas R-Car

 arch/arm64/boot/dts/renesas/Makefile          |    2 +
 .../boot/dts/renesas/r8a779f0-spider-ep.dts   |   46 +
 .../boot/dts/renesas/r8a779f0-spider-rc.dts   |   52 +
 drivers/dma/dw-edma/dw-edma-core.c            |   28 +-
 drivers/iommu/ipmmu-vmsa.c                    |    7 +-
 drivers/net/ntb_netdev.c                      |  341 ++-
 drivers/ntb/Kconfig                           |   11 +
 drivers/ntb/Makefile                          |    3 +
 drivers/ntb/hw/amd/ntb_hw_amd.c               |    6 +-
 drivers/ntb/hw/epf/ntb_hw_epf.c               |  177 +-
 drivers/ntb/hw/idt/ntb_hw_idt.c               |    3 +-
 drivers/ntb/hw/intel/ntb_hw_gen1.c            |    6 +-
 drivers/ntb/hw/intel/ntb_hw_gen1.h            |    2 +-
 drivers/ntb/hw/intel/ntb_hw_gen3.c            |    3 +-
 drivers/ntb/hw/intel/ntb_hw_gen4.c            |    6 +-
 drivers/ntb/hw/mscc/ntb_hw_switchtec.c        |    6 +-
 drivers/ntb/msi.c                             |    6 +-
 drivers/ntb/ntb_edma.c                        |  628 ++++++
 drivers/ntb/ntb_edma.h                        |  128 ++
 .../{ntb_transport.c => ntb_transport_core.c} | 1829 ++++++++++++++---
 drivers/ntb/test/ntb_perf.c                   |    4 +-
 drivers/ntb/test/ntb_tool.c                   |    6 +-
 .../pci/controller/dwc/pcie-designware-ep.c   |  287 ++-
 drivers/pci/controller/dwc/pcie-designware.h  |    7 +
 drivers/pci/endpoint/functions/pci-epf-vntb.c |  229 ++-
 drivers/pci/endpoint/pci-epc-core.c           |   44 +
 include/linux/ntb.h                           |   39 +-
 include/linux/ntb_transport.h                 |   21 +
 include/linux/pci-epc.h                       |   11 +
 29 files changed, 3415 insertions(+), 523 deletions(-)
 create mode 100644 arch/arm64/boot/dts/renesas/r8a779f0-spider-ep.dts
 create mode 100644 arch/arm64/boot/dts/renesas/r8a779f0-spider-rc.dts
 create mode 100644 drivers/ntb/ntb_edma.c
 create mode 100644 drivers/ntb/ntb_edma.h
 rename drivers/ntb/{ntb_transport.c => ntb_transport_core.c} (59%)

-- 
2.48.1


^ permalink raw reply	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 01/27] PCI: endpoint: pci-epf-vntb: Use array_index_nospec() on mws_size[] access
  2025-11-29 16:03 [RFC PATCH v2 00/27] NTB transport backed by remote DW eDMA Koichiro Den
@ 2025-11-29 16:03 ` Koichiro Den
  2025-12-01 18:59   ` Frank Li
  2025-11-29 16:03 ` [RFC PATCH v2 02/27] PCI: endpoint: pci-epf-vntb: Add mwN_offset configfs attributes Koichiro Den
                   ` (26 subsequent siblings)
  27 siblings, 1 reply; 97+ messages in thread
From: Koichiro Den @ 2025-11-29 16:03 UTC (permalink / raw)
  To: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li
  Cc: mani, kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason,
	dave.jiang, allenbh, Basavaraj.Natikar, Shyam-sundar.S-k,
	kurt.schwemmer, logang, jingoohan1, lpieralisi, robh, jbrunet,
	fancer.lancer, arnd, pstanner, elfring

Follow common kernel idioms for indices derived from configfs attributes
and suppress Smatch warnings:

  epf_ntb_mw1_show() warn: potential spectre issue 'ntb->mws_size' [r]
  epf_ntb_mw1_store() warn: potential spectre issue 'ntb->mws_size' [w]

Also fix the error message for out-of-range MW indices.

Signed-off-by: Koichiro Den <den@valinux.co.jp>
---
 drivers/pci/endpoint/functions/pci-epf-vntb.c | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/drivers/pci/endpoint/functions/pci-epf-vntb.c b/drivers/pci/endpoint/functions/pci-epf-vntb.c
index 3ecc5059f92b..6c4c78915970 100644
--- a/drivers/pci/endpoint/functions/pci-epf-vntb.c
+++ b/drivers/pci/endpoint/functions/pci-epf-vntb.c
@@ -995,17 +995,18 @@ static ssize_t epf_ntb_##_name##_show(struct config_item *item,		\
 	struct config_group *group = to_config_group(item);		\
 	struct epf_ntb *ntb = to_epf_ntb(group);			\
 	struct device *dev = &ntb->epf->dev;				\
-	int win_no;							\
+	int win_no, idx;						\
 									\
 	if (sscanf(#_name, "mw%d", &win_no) != 1)			\
 		return -EINVAL;						\
 									\
 	if (win_no <= 0 || win_no > ntb->num_mws) {			\
-		dev_err(dev, "Invalid num_nws: %d value\n", ntb->num_mws); \
+		dev_err(dev, "MW%d out of range (num_mws=%d)\n",	\
+			win_no, ntb->num_mws);				\
 		return -EINVAL;						\
 	}								\
-									\
-	return sprintf(page, "%lld\n", ntb->mws_size[win_no - 1]);	\
+	idx = array_index_nospec(win_no - 1, ntb->num_mws);		\
+	return sprintf(page, "%lld\n", ntb->mws_size[idx]);		\
 }
 
 #define EPF_NTB_MW_W(_name)						\
@@ -1015,7 +1016,7 @@ static ssize_t epf_ntb_##_name##_store(struct config_item *item,	\
 	struct config_group *group = to_config_group(item);		\
 	struct epf_ntb *ntb = to_epf_ntb(group);			\
 	struct device *dev = &ntb->epf->dev;				\
-	int win_no;							\
+	int win_no, idx;						\
 	u64 val;							\
 	int ret;							\
 									\
@@ -1027,11 +1028,13 @@ static ssize_t epf_ntb_##_name##_store(struct config_item *item,	\
 		return -EINVAL;						\
 									\
 	if (win_no <= 0 || win_no > ntb->num_mws) {			\
-		dev_err(dev, "Invalid num_nws: %d value\n", ntb->num_mws); \
+		dev_err(dev, "MW%d out of range (num_mws=%d)\n",	\
+			win_no, ntb->num_mws);				\
 		return -EINVAL;						\
 	}								\
 									\
-	ntb->mws_size[win_no - 1] = val;				\
+	idx = array_index_nospec(win_no - 1, ntb->num_mws);		\
+	ntb->mws_size[idx] = val;					\
 									\
 	return len;							\
 }
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 02/27] PCI: endpoint: pci-epf-vntb: Add mwN_offset configfs attributes
  2025-11-29 16:03 [RFC PATCH v2 00/27] NTB transport backed by remote DW eDMA Koichiro Den
  2025-11-29 16:03 ` [RFC PATCH v2 01/27] PCI: endpoint: pci-epf-vntb: Use array_index_nospec() on mws_size[] access Koichiro Den
@ 2025-11-29 16:03 ` Koichiro Den
  2025-12-01 19:11   ` Frank Li
  2025-11-29 16:03 ` [RFC PATCH v2 03/27] NTB: epf: Handle mwN_offset for inbound MW regions Koichiro Den
                   ` (25 subsequent siblings)
  27 siblings, 1 reply; 97+ messages in thread
From: Koichiro Den @ 2025-11-29 16:03 UTC (permalink / raw)
  To: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li
  Cc: mani, kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason,
	dave.jiang, allenbh, Basavaraj.Natikar, Shyam-sundar.S-k,
	kurt.schwemmer, logang, jingoohan1, lpieralisi, robh, jbrunet,
	fancer.lancer, arnd, pstanner, elfring

Introduce new mwN_offset configfs attributes to specify memory window
offsets. This enables mapping multiple windows into a single BAR at
arbitrary offsets, improving layout flexibility.

Signed-off-by: Koichiro Den <den@valinux.co.jp>
---
 drivers/pci/endpoint/functions/pci-epf-vntb.c | 133 ++++++++++++++++--
 1 file changed, 120 insertions(+), 13 deletions(-)

diff --git a/drivers/pci/endpoint/functions/pci-epf-vntb.c b/drivers/pci/endpoint/functions/pci-epf-vntb.c
index 6c4c78915970..1ff414703566 100644
--- a/drivers/pci/endpoint/functions/pci-epf-vntb.c
+++ b/drivers/pci/endpoint/functions/pci-epf-vntb.c
@@ -39,6 +39,7 @@
 #include <linux/atomic.h>
 #include <linux/delay.h>
 #include <linux/io.h>
+#include <linux/log2.h>
 #include <linux/module.h>
 #include <linux/slab.h>
 
@@ -111,7 +112,8 @@ struct epf_ntb_ctrl {
 	u64 addr;
 	u64 size;
 	u32 num_mws;
-	u32 reserved;
+	u32 mw_offset[MAX_MW];
+	u32 mw_size[MAX_MW];
 	u32 spad_offset;
 	u32 spad_count;
 	u32 db_entry_size;
@@ -128,6 +130,7 @@ struct epf_ntb {
 	u32 db_count;
 	u32 spad_count;
 	u64 mws_size[MAX_MW];
+	u64 mws_offset[MAX_MW];
 	atomic64_t db;
 	u32 vbus_number;
 	u16 vntb_pid;
@@ -458,6 +461,8 @@ static int epf_ntb_config_spad_bar_alloc(struct epf_ntb *ntb)
 
 	ctrl->spad_count = spad_count;
 	ctrl->num_mws = ntb->num_mws;
+	memset(ctrl->mw_offset, 0, sizeof(ctrl->mw_offset));
+	memset(ctrl->mw_size, 0, sizeof(ctrl->mw_size));
 	ntb->spad_size = spad_size;
 
 	ctrl->db_entry_size = sizeof(u32);
@@ -689,15 +694,31 @@ static void epf_ntb_db_bar_clear(struct epf_ntb *ntb)
  */
 static int epf_ntb_mw_bar_init(struct epf_ntb *ntb)
 {
+	struct device *dev = &ntb->epf->dev;
+	u64 bar_ends[BAR_5 + 1] = { 0 };
+	unsigned long bars_used = 0;
+	enum pci_barno barno;
+	u64 off, size, end;
 	int ret = 0;
 	int i;
-	u64 size;
-	enum pci_barno barno;
-	struct device *dev = &ntb->epf->dev;
 
 	for (i = 0; i < ntb->num_mws; i++) {
-		size = ntb->mws_size[i];
 		barno = ntb->epf_ntb_bar[BAR_MW1 + i];
+		off = ntb->mws_offset[i];
+		size = ntb->mws_size[i];
+		end = off + size;
+		if (end > bar_ends[barno])
+			bar_ends[barno] = end;
+		bars_used |= BIT(barno);
+	}
+
+	for (barno = BAR_0; barno <= BAR_5; barno++) {
+		if (!(bars_used & BIT(barno)))
+			continue;
+		if (bar_ends[barno] < SZ_4K)
+			size = SZ_4K;
+		else
+			size = roundup_pow_of_two(bar_ends[barno]);
 
 		ntb->epf->bar[barno].barno = barno;
 		ntb->epf->bar[barno].size = size;
@@ -713,8 +734,12 @@ static int epf_ntb_mw_bar_init(struct epf_ntb *ntb)
 				      &ntb->epf->bar[barno]);
 		if (ret) {
 			dev_err(dev, "MW set failed\n");
-			goto err_alloc_mem;
+			goto err_set_bar;
 		}
+	}
+
+	for (i = 0; i < ntb->num_mws; i++) {
+		size = ntb->mws_size[i];
 
 		/* Allocate EPC outbound memory windows to vpci vntb device */
 		ntb->vpci_mw_addr[i] = pci_epc_mem_alloc_addr(ntb->epf->epc,
@@ -723,19 +748,31 @@ static int epf_ntb_mw_bar_init(struct epf_ntb *ntb)
 		if (!ntb->vpci_mw_addr[i]) {
 			ret = -ENOMEM;
 			dev_err(dev, "Failed to allocate source address\n");
-			goto err_set_bar;
+			goto err_alloc_mem;
 		}
 	}
 
+	for (i = 0; i < ntb->num_mws; i++) {
+		ntb->reg->mw_offset[i] = (u32)ntb->mws_offset[i];
+		ntb->reg->mw_size[i] = (u32)ntb->mws_size[i];
+	}
+
 	return ret;
 
-err_set_bar:
-	pci_epc_clear_bar(ntb->epf->epc,
-			  ntb->epf->func_no,
-			  ntb->epf->vfunc_no,
-			  &ntb->epf->bar[barno]);
 err_alloc_mem:
-	epf_ntb_mw_bar_clear(ntb, i);
+	while (--i >= 0)
+		pci_epc_mem_free_addr(ntb->epf->epc,
+				      ntb->vpci_mw_phy[i],
+				      ntb->vpci_mw_addr[i],
+				      ntb->mws_size[i]);
+err_set_bar:
+	while (--barno >= BAR_0)
+		if (bars_used & BIT(barno))
+			pci_epc_clear_bar(ntb->epf->epc,
+					  ntb->epf->func_no,
+					  ntb->epf->vfunc_no,
+					  &ntb->epf->bar[barno]);
+
 	return ret;
 }
 
@@ -1039,6 +1076,60 @@ static ssize_t epf_ntb_##_name##_store(struct config_item *item,	\
 	return len;							\
 }
 
+#define EPF_NTB_MW_OFF_R(_name)						\
+static ssize_t epf_ntb_##_name##_show(struct config_item *item,		\
+				      char *page)			\
+{									\
+	struct config_group *group = to_config_group(item);		\
+	struct epf_ntb *ntb = to_epf_ntb(group);			\
+	struct device *dev = &ntb->epf->dev;				\
+	int win_no, idx;						\
+									\
+	if (sscanf(#_name, "mw%d_offset", &win_no) != 1)		\
+		return -EINVAL;						\
+									\
+	idx = win_no - 1;						\
+	if (idx < 0 || idx >= ntb->num_mws) {				\
+		dev_err(dev, "MW%d out of range (num_mws=%d)\n",	\
+			win_no, ntb->num_mws);				\
+		return -EINVAL;						\
+	}								\
+									\
+	idx = array_index_nospec(idx, ntb->num_mws);			\
+	return sprintf(page, "%lld\n", ntb->mws_offset[idx]);		\
+}
+
+#define EPF_NTB_MW_OFF_W(_name)						\
+static ssize_t epf_ntb_##_name##_store(struct config_item *item,	\
+				       const char *page, size_t len)	\
+{									\
+	struct config_group *group = to_config_group(item);		\
+	struct epf_ntb *ntb = to_epf_ntb(group);			\
+	struct device *dev = &ntb->epf->dev;				\
+	int win_no, idx;						\
+	u64 val;							\
+	int ret;							\
+									\
+	ret = kstrtou64(page, 0, &val);					\
+	if (ret)							\
+		return ret;						\
+									\
+	if (sscanf(#_name, "mw%d_offset", &win_no) != 1)		\
+		return -EINVAL;						\
+									\
+	idx = win_no - 1;						\
+	if (idx < 0 || idx >= ntb->num_mws) {				\
+		dev_err(dev, "MW%d out of range (num_mws=%d)\n",	\
+			win_no, ntb->num_mws);				\
+		return -EINVAL;						\
+	}								\
+									\
+	idx = array_index_nospec(idx, ntb->num_mws);			\
+	ntb->mws_offset[idx] = val;					\
+									\
+	return len;							\
+}
+
 #define EPF_NTB_BAR_R(_name, _id)					\
 	static ssize_t epf_ntb_##_name##_show(struct config_item *item,	\
 					      char *page)		\
@@ -1109,6 +1200,14 @@ EPF_NTB_MW_R(mw3)
 EPF_NTB_MW_W(mw3)
 EPF_NTB_MW_R(mw4)
 EPF_NTB_MW_W(mw4)
+EPF_NTB_MW_OFF_R(mw1_offset)
+EPF_NTB_MW_OFF_W(mw1_offset)
+EPF_NTB_MW_OFF_R(mw2_offset)
+EPF_NTB_MW_OFF_W(mw2_offset)
+EPF_NTB_MW_OFF_R(mw3_offset)
+EPF_NTB_MW_OFF_W(mw3_offset)
+EPF_NTB_MW_OFF_R(mw4_offset)
+EPF_NTB_MW_OFF_W(mw4_offset)
 EPF_NTB_BAR_R(ctrl_bar, BAR_CONFIG)
 EPF_NTB_BAR_W(ctrl_bar, BAR_CONFIG)
 EPF_NTB_BAR_R(db_bar, BAR_DB)
@@ -1129,6 +1228,10 @@ CONFIGFS_ATTR(epf_ntb_, mw1);
 CONFIGFS_ATTR(epf_ntb_, mw2);
 CONFIGFS_ATTR(epf_ntb_, mw3);
 CONFIGFS_ATTR(epf_ntb_, mw4);
+CONFIGFS_ATTR(epf_ntb_, mw1_offset);
+CONFIGFS_ATTR(epf_ntb_, mw2_offset);
+CONFIGFS_ATTR(epf_ntb_, mw3_offset);
+CONFIGFS_ATTR(epf_ntb_, mw4_offset);
 CONFIGFS_ATTR(epf_ntb_, vbus_number);
 CONFIGFS_ATTR(epf_ntb_, vntb_pid);
 CONFIGFS_ATTR(epf_ntb_, vntb_vid);
@@ -1147,6 +1250,10 @@ static struct configfs_attribute *epf_ntb_attrs[] = {
 	&epf_ntb_attr_mw2,
 	&epf_ntb_attr_mw3,
 	&epf_ntb_attr_mw4,
+	&epf_ntb_attr_mw1_offset,
+	&epf_ntb_attr_mw2_offset,
+	&epf_ntb_attr_mw3_offset,
+	&epf_ntb_attr_mw4_offset,
 	&epf_ntb_attr_vbus_number,
 	&epf_ntb_attr_vntb_pid,
 	&epf_ntb_attr_vntb_vid,
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 03/27] NTB: epf: Handle mwN_offset for inbound MW regions
  2025-11-29 16:03 [RFC PATCH v2 00/27] NTB transport backed by remote DW eDMA Koichiro Den
  2025-11-29 16:03 ` [RFC PATCH v2 01/27] PCI: endpoint: pci-epf-vntb: Use array_index_nospec() on mws_size[] access Koichiro Den
  2025-11-29 16:03 ` [RFC PATCH v2 02/27] PCI: endpoint: pci-epf-vntb: Add mwN_offset configfs attributes Koichiro Den
@ 2025-11-29 16:03 ` Koichiro Den
  2025-12-01 19:14   ` Frank Li
  2025-11-29 16:03 ` [RFC PATCH v2 04/27] PCI: endpoint: Add inbound mapping ops to EPC core Koichiro Den
                   ` (24 subsequent siblings)
  27 siblings, 1 reply; 97+ messages in thread
From: Koichiro Den @ 2025-11-29 16:03 UTC (permalink / raw)
  To: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li
  Cc: mani, kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason,
	dave.jiang, allenbh, Basavaraj.Natikar, Shyam-sundar.S-k,
	kurt.schwemmer, logang, jingoohan1, lpieralisi, robh, jbrunet,
	fancer.lancer, arnd, pstanner, elfring

Add and use new fields in the common control register to convey both
offset and size for each memory window (MW), so that it can correctly
handle flexible MW layouts and support partial BAR mappings.

Signed-off-by: Koichiro Den <den@valinux.co.jp>
---
 drivers/ntb/hw/epf/ntb_hw_epf.c | 27 ++++++++++++++++-----------
 1 file changed, 16 insertions(+), 11 deletions(-)

diff --git a/drivers/ntb/hw/epf/ntb_hw_epf.c b/drivers/ntb/hw/epf/ntb_hw_epf.c
index d3ecf25a5162..91d3f8e05807 100644
--- a/drivers/ntb/hw/epf/ntb_hw_epf.c
+++ b/drivers/ntb/hw/epf/ntb_hw_epf.c
@@ -36,12 +36,13 @@
 #define NTB_EPF_LOWER_SIZE	0x18
 #define NTB_EPF_UPPER_SIZE	0x1C
 #define NTB_EPF_MW_COUNT	0x20
-#define NTB_EPF_MW1_OFFSET	0x24
-#define NTB_EPF_SPAD_OFFSET	0x28
-#define NTB_EPF_SPAD_COUNT	0x2C
-#define NTB_EPF_DB_ENTRY_SIZE	0x30
-#define NTB_EPF_DB_DATA(n)	(0x34 + (n) * 4)
-#define NTB_EPF_DB_OFFSET(n)	(0xB4 + (n) * 4)
+#define NTB_EPF_MW_OFFSET(n)	(0x24 + (n) * 4)
+#define NTB_EPF_MW_SIZE(n)	(0x34 + (n) * 4)
+#define NTB_EPF_SPAD_OFFSET	0x44
+#define NTB_EPF_SPAD_COUNT	0x48
+#define NTB_EPF_DB_ENTRY_SIZE	0x4C
+#define NTB_EPF_DB_DATA(n)	(0x50 + (n) * 4)
+#define NTB_EPF_DB_OFFSET(n)	(0xD0 + (n) * 4)
 
 #define NTB_EPF_MIN_DB_COUNT	3
 #define NTB_EPF_MAX_DB_COUNT	31
@@ -451,11 +452,12 @@ static int ntb_epf_peer_mw_get_addr(struct ntb_dev *ntb, int idx,
 				    phys_addr_t *base, resource_size_t *size)
 {
 	struct ntb_epf_dev *ndev = ntb_ndev(ntb);
-	u32 offset = 0;
+	resource_size_t bar_sz;
+	u32 offset, sz;
 	int bar;
 
-	if (idx == 0)
-		offset = readl(ndev->ctrl_reg + NTB_EPF_MW1_OFFSET);
+	offset = readl(ndev->ctrl_reg + NTB_EPF_MW_OFFSET(idx));
+	sz = readl(ndev->ctrl_reg + NTB_EPF_MW_SIZE(idx));
 
 	bar = ntb_epf_mw_to_bar(ndev, idx);
 	if (bar < 0)
@@ -464,8 +466,11 @@ static int ntb_epf_peer_mw_get_addr(struct ntb_dev *ntb, int idx,
 	if (base)
 		*base = pci_resource_start(ndev->ntb.pdev, bar) + offset;
 
-	if (size)
-		*size = pci_resource_len(ndev->ntb.pdev, bar) - offset;
+	if (size) {
+		bar_sz = pci_resource_len(ndev->ntb.pdev, bar);
+		*size = sz ? min_t(resource_size_t, sz, bar_sz - offset)
+			   : (bar_sz > offset ? bar_sz - offset : 0);
+	}
 
 	return 0;
 }
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 04/27] PCI: endpoint: Add inbound mapping ops to EPC core
  2025-11-29 16:03 [RFC PATCH v2 00/27] NTB transport backed by remote DW eDMA Koichiro Den
                   ` (2 preceding siblings ...)
  2025-11-29 16:03 ` [RFC PATCH v2 03/27] NTB: epf: Handle mwN_offset for inbound MW regions Koichiro Den
@ 2025-11-29 16:03 ` Koichiro Den
  2025-12-01 19:19   ` Frank Li
  2025-11-29 16:03 ` [RFC PATCH v2 05/27] PCI: dwc: ep: Implement EPC inbound mapping support Koichiro Den
                   ` (23 subsequent siblings)
  27 siblings, 1 reply; 97+ messages in thread
From: Koichiro Den @ 2025-11-29 16:03 UTC (permalink / raw)
  To: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li
  Cc: mani, kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason,
	dave.jiang, allenbh, Basavaraj.Natikar, Shyam-sundar.S-k,
	kurt.schwemmer, logang, jingoohan1, lpieralisi, robh, jbrunet,
	fancer.lancer, arnd, pstanner, elfring

Add new EPC ops map_inbound() and unmap_inbound() for mapping a subrange
of a BAR into CPU space. These will be implemented by controller drivers
such as DesignWare.

Signed-off-by: Koichiro Den <den@valinux.co.jp>
---
 drivers/pci/endpoint/pci-epc-core.c | 44 +++++++++++++++++++++++++++++
 include/linux/pci-epc.h             | 11 ++++++++
 2 files changed, 55 insertions(+)

diff --git a/drivers/pci/endpoint/pci-epc-core.c b/drivers/pci/endpoint/pci-epc-core.c
index ca7f19cc973a..825109e54ba9 100644
--- a/drivers/pci/endpoint/pci-epc-core.c
+++ b/drivers/pci/endpoint/pci-epc-core.c
@@ -444,6 +444,50 @@ int pci_epc_map_addr(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
 }
 EXPORT_SYMBOL_GPL(pci_epc_map_addr);
 
+/**
+ * pci_epc_map_inbound() - map a BAR subrange to the local CPU address
+ * @epc: the EPC device on which BAR has to be configured
+ * @func_no: the physical endpoint function number in the EPC device
+ * @vfunc_no: the virtual endpoint function number in the physical function
+ * @epf_bar: the struct epf_bar that contains the BAR information
+ * @offset: byte offset from the BAR base selected by the host
+ *
+ * Invoke to configure the BAR of the endpoint device and map a subrange
+ * selected by @offset to a CPU address.
+ *
+ * Returns 0 on success, -EOPNOTSUPP if unsupported, or a negative errno.
+ */
+int pci_epc_map_inbound(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
+			struct pci_epf_bar *epf_bar, u64 offset)
+{
+	if (!epc || !epc->ops || !epc->ops->map_inbound)
+		return -EOPNOTSUPP;
+
+	return epc->ops->map_inbound(epc, func_no, vfunc_no, epf_bar, offset);
+}
+EXPORT_SYMBOL_GPL(pci_epc_map_inbound);
+
+/**
+ * pci_epc_unmap_inbound() - unmap a previously mapped BAR subrange
+ * @epc: the EPC device on which the inbound mapping was programmed
+ * @func_no: the physical endpoint function number in the EPC device
+ * @vfunc_no: the virtual endpoint function number in the physical function
+ * @epf_bar: the struct epf_bar used when the mapping was created
+ * @offset: byte offset from the BAR base that was mapped
+ *
+ * Invoke to remove a BAR subrange mapping created by pci_epc_map_inbound().
+ * If the controller has no support, this call is a no-op.
+ */
+void pci_epc_unmap_inbound(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
+			   struct pci_epf_bar *epf_bar, u64 offset)
+{
+	if (!epc || !epc->ops || !epc->ops->unmap_inbound)
+		return;
+
+	epc->ops->unmap_inbound(epc, func_no, vfunc_no, epf_bar, offset);
+}
+EXPORT_SYMBOL_GPL(pci_epc_unmap_inbound);
+
 /**
  * pci_epc_mem_map() - allocate and map a PCI address to a CPU address
  * @epc: the EPC device on which the CPU address is to be allocated and mapped
diff --git a/include/linux/pci-epc.h b/include/linux/pci-epc.h
index 4286bfdbfdfa..a5fb91cc2982 100644
--- a/include/linux/pci-epc.h
+++ b/include/linux/pci-epc.h
@@ -71,6 +71,8 @@ struct pci_epc_map {
  *		region
  * @map_addr: ops to map CPU address to PCI address
  * @unmap_addr: ops to unmap CPU address and PCI address
+ * @map_inbound: ops to map a subrange inside a BAR to CPU address.
+ * @unmap_inbound: ops to unmap a subrange inside a BAR and CPU address.
  * @set_msi: ops to set the requested number of MSI interrupts in the MSI
  *	     capability register
  * @get_msi: ops to get the number of MSI interrupts allocated by the RC from
@@ -99,6 +101,10 @@ struct pci_epc_ops {
 			    phys_addr_t addr, u64 pci_addr, size_t size);
 	void	(*unmap_addr)(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
 			      phys_addr_t addr);
+	int	(*map_inbound)(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
+			       struct pci_epf_bar *epf_bar, u64 offset);
+	void	(*unmap_inbound)(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
+				 struct pci_epf_bar *epf_bar, u64 offset);
 	int	(*set_msi)(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
 			   u8 nr_irqs);
 	int	(*get_msi)(struct pci_epc *epc, u8 func_no, u8 vfunc_no);
@@ -286,6 +292,11 @@ int pci_epc_map_addr(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
 		     u64 pci_addr, size_t size);
 void pci_epc_unmap_addr(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
 			phys_addr_t phys_addr);
+
+int pci_epc_map_inbound(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
+			struct pci_epf_bar *epf_bar, u64 offset);
+void pci_epc_unmap_inbound(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
+			   struct pci_epf_bar *epf_bar, u64 offset);
 int pci_epc_set_msi(struct pci_epc *epc, u8 func_no, u8 vfunc_no, u8 nr_irqs);
 int pci_epc_get_msi(struct pci_epc *epc, u8 func_no, u8 vfunc_no);
 int pci_epc_set_msix(struct pci_epc *epc, u8 func_no, u8 vfunc_no, u16 nr_irqs,
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 05/27] PCI: dwc: ep: Implement EPC inbound mapping support
  2025-11-29 16:03 [RFC PATCH v2 00/27] NTB transport backed by remote DW eDMA Koichiro Den
                   ` (3 preceding siblings ...)
  2025-11-29 16:03 ` [RFC PATCH v2 04/27] PCI: endpoint: Add inbound mapping ops to EPC core Koichiro Den
@ 2025-11-29 16:03 ` Koichiro Den
  2025-12-01 19:32   ` Frank Li
  2025-11-29 16:03 ` [RFC PATCH v2 06/27] PCI: endpoint: pci-epf-vntb: Use pci_epc_map_inbound() for MW mapping Koichiro Den
                   ` (22 subsequent siblings)
  27 siblings, 1 reply; 97+ messages in thread
From: Koichiro Den @ 2025-11-29 16:03 UTC (permalink / raw)
  To: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li
  Cc: mani, kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason,
	dave.jiang, allenbh, Basavaraj.Natikar, Shyam-sundar.S-k,
	kurt.schwemmer, logang, jingoohan1, lpieralisi, robh, jbrunet,
	fancer.lancer, arnd, pstanner, elfring

Implement map_inbound() and unmap_inbound() for DesignWare endpoint
controllers (Address Match mode). Allows subrange mappings within a BAR,
enabling advanced endpoint functions such as NTB with offset-based
windows.

Signed-off-by: Koichiro Den <den@valinux.co.jp>
---
 .../pci/controller/dwc/pcie-designware-ep.c   | 239 +++++++++++++++---
 drivers/pci/controller/dwc/pcie-designware.h  |   2 +
 2 files changed, 212 insertions(+), 29 deletions(-)

diff --git a/drivers/pci/controller/dwc/pcie-designware-ep.c b/drivers/pci/controller/dwc/pcie-designware-ep.c
index 19571ac2b961..3780a9bd6f79 100644
--- a/drivers/pci/controller/dwc/pcie-designware-ep.c
+++ b/drivers/pci/controller/dwc/pcie-designware-ep.c
@@ -8,13 +8,25 @@
 
 #include <linux/align.h>
 #include <linux/bitfield.h>
+#include <linux/list.h>
 #include <linux/of.h>
+#include <linux/pci_regs.h>
 #include <linux/platform_device.h>
+#include <linux/spinlock.h>
 
 #include "pcie-designware.h"
 #include <linux/pci-epc.h>
 #include <linux/pci-epf.h>
 
+struct dw_pcie_ib_map {
+	struct list_head node;
+	enum pci_barno bar;
+	u64 pci_addr;           /* BAR base + offset at map time */
+	phys_addr_t cpu_addr;   /* EP local phys */
+	u64 size;
+	u32 index;              /* iATU inbound window index */
+};
+
 /**
  * dw_pcie_ep_get_func_from_ep - Get the struct dw_pcie_ep_func corresponding to
  *				 the endpoint function
@@ -205,6 +217,7 @@ static void dw_pcie_ep_clear_bar(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
 	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
 	enum pci_barno bar = epf_bar->barno;
 	u32 atu_index = ep->bar_to_atu[bar] - 1;
+	struct dw_pcie_ib_map *m, *tmp;
 
 	if (!ep->bar_to_atu[bar])
 		return;
@@ -215,6 +228,16 @@ static void dw_pcie_ep_clear_bar(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
 	clear_bit(atu_index, ep->ib_window_map);
 	ep->epf_bar[bar] = NULL;
 	ep->bar_to_atu[bar] = 0;
+
+	guard(spinlock_irqsave)(&ep->ib_map_lock);
+	list_for_each_entry_safe(m, tmp, &ep->ib_map_list, node) {
+		if (m->bar != bar)
+			continue;
+		dw_pcie_disable_atu(pci, PCIE_ATU_REGION_DIR_IB, m->index);
+		clear_bit(m->index, ep->ib_window_map);
+		list_del(&m->node);
+		kfree(m);
+	}
 }
 
 static unsigned int dw_pcie_ep_get_rebar_offset(struct dw_pcie *pci,
@@ -336,14 +359,46 @@ static enum pci_epc_bar_type dw_pcie_ep_get_bar_type(struct dw_pcie_ep *ep,
 	return epc_features->bar[bar].type;
 }
 
+static int dw_pcie_ep_set_bar_init(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
+				   struct pci_epf_bar *epf_bar)
+{
+	struct dw_pcie_ep *ep = epc_get_drvdata(epc);
+	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
+	enum pci_barno bar = epf_bar->barno;
+	enum pci_epc_bar_type bar_type;
+	int ret;
+
+	bar_type = dw_pcie_ep_get_bar_type(ep, bar);
+	switch (bar_type) {
+	case BAR_FIXED:
+		/*
+		 * There is no need to write a BAR mask for a fixed BAR (except
+		 * to write 1 to the LSB of the BAR mask register, to enable the
+		 * BAR). Write the BAR mask regardless. (The fixed bits in the
+		 * BAR mask register will be read-only anyway.)
+		 */
+		fallthrough;
+	case BAR_PROGRAMMABLE:
+		ret = dw_pcie_ep_set_bar_programmable(ep, func_no, epf_bar);
+		break;
+	case BAR_RESIZABLE:
+		ret = dw_pcie_ep_set_bar_resizable(ep, func_no, epf_bar);
+		break;
+	default:
+		ret = -EINVAL;
+		dev_err(pci->dev, "Invalid BAR type\n");
+		break;
+	}
+
+	return ret;
+}
+
 static int dw_pcie_ep_set_bar(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
 			      struct pci_epf_bar *epf_bar)
 {
 	struct dw_pcie_ep *ep = epc_get_drvdata(epc);
-	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
 	enum pci_barno bar = epf_bar->barno;
 	size_t size = epf_bar->size;
-	enum pci_epc_bar_type bar_type;
 	int flags = epf_bar->flags;
 	int ret, type;
 
@@ -374,35 +429,12 @@ static int dw_pcie_ep_set_bar(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
 		 * When dynamically changing a BAR, skip writing the BAR reg, as
 		 * that would clear the BAR's PCI address assigned by the host.
 		 */
-		goto config_atu;
-	}
-
-	bar_type = dw_pcie_ep_get_bar_type(ep, bar);
-	switch (bar_type) {
-	case BAR_FIXED:
-		/*
-		 * There is no need to write a BAR mask for a fixed BAR (except
-		 * to write 1 to the LSB of the BAR mask register, to enable the
-		 * BAR). Write the BAR mask regardless. (The fixed bits in the
-		 * BAR mask register will be read-only anyway.)
-		 */
-		fallthrough;
-	case BAR_PROGRAMMABLE:
-		ret = dw_pcie_ep_set_bar_programmable(ep, func_no, epf_bar);
-		break;
-	case BAR_RESIZABLE:
-		ret = dw_pcie_ep_set_bar_resizable(ep, func_no, epf_bar);
-		break;
-	default:
-		ret = -EINVAL;
-		dev_err(pci->dev, "Invalid BAR type\n");
-		break;
+	} else {
+		ret = dw_pcie_ep_set_bar_init(epc, func_no, vfunc_no, epf_bar);
+		if (ret)
+			return ret;
 	}
 
-	if (ret)
-		return ret;
-
-config_atu:
 	if (!(flags & PCI_BASE_ADDRESS_SPACE))
 		type = PCIE_ATU_TYPE_MEM;
 	else
@@ -488,6 +520,151 @@ static int dw_pcie_ep_map_addr(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
 	return 0;
 }
 
+static inline u64 dw_pcie_ep_read_bar_assigned(struct dw_pcie_ep *ep, u8 func_no,
+					       enum pci_barno bar, bool is_io,
+					       bool is_64)
+{
+	u32 reg = PCI_BASE_ADDRESS_0 + 4 * bar;
+	u32 lo, hi = 0;
+	u64 base;
+
+	lo = dw_pcie_ep_readl_dbi(ep, func_no, reg);
+	if (is_io)
+		base = lo & PCI_BASE_ADDRESS_IO_MASK;
+	else {
+		base = lo & PCI_BASE_ADDRESS_MEM_MASK;
+		if (is_64) {
+			hi = dw_pcie_ep_readl_dbi(ep, func_no, reg + 4);
+			base |= ((u64)hi) << 32;
+		}
+	}
+	return base;
+}
+
+static int dw_pcie_ep_map_inbound(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
+				  struct pci_epf_bar *epf_bar, u64 offset)
+{
+	struct dw_pcie_ep *ep = epc_get_drvdata(epc);
+	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
+	enum pci_barno bar = epf_bar->barno;
+	size_t size = epf_bar->size;
+	int flags = epf_bar->flags;
+	struct dw_pcie_ib_map *m;
+	u64 base, pci_addr;
+	int ret, type, win;
+
+	/*
+	 * DWC does not allow BAR pairs to overlap, e.g. you cannot combine BARs
+	 * 1 and 2 to form a 64-bit BAR.
+	 */
+	if ((flags & PCI_BASE_ADDRESS_MEM_TYPE_64) && (bar & 1))
+		return -EINVAL;
+
+	/*
+	 * Certain EPF drivers dynamically change the physical address of a BAR
+	 * (i.e. they call set_bar() twice, without ever calling clear_bar(), as
+	 * calling clear_bar() would clear the BAR's PCI address assigned by the
+	 * host).
+	 */
+	if (epf_bar->phys_addr && ep->epf_bar[bar]) {
+		/*
+		 * We can only dynamically add a whole or partial mapping if the
+		 * BAR flags do not differ from the existing configuration.
+		 */
+		if (ep->epf_bar[bar]->barno != bar ||
+		    ep->epf_bar[bar]->flags != flags)
+			return -EINVAL;
+
+		/*
+		 * When dynamically changing a BAR, skip writing the BAR reg, as
+		 * that would clear the BAR's PCI address assigned by the host.
+		 */
+	}
+
+	/*
+	 * Skip programming the inbound translation if phys_addr is 0.
+	 * In this case, the caller only intends to initialize the BAR.
+	 */
+	if (!epf_bar->phys_addr) {
+		ret = dw_pcie_ep_set_bar_init(epc, func_no, vfunc_no, epf_bar);
+		ep->epf_bar[bar] = epf_bar;
+		return ret;
+	}
+
+	base = dw_pcie_ep_read_bar_assigned(ep, func_no, bar,
+					    flags & PCI_BASE_ADDRESS_SPACE,
+					    flags & PCI_BASE_ADDRESS_MEM_TYPE_64);
+	if (!(flags & PCI_BASE_ADDRESS_SPACE))
+		type = PCIE_ATU_TYPE_MEM;
+	else
+		type = PCIE_ATU_TYPE_IO;
+	pci_addr = base + offset;
+
+	/* Allocate an inbound iATU window */
+	win = find_first_zero_bit(ep->ib_window_map, pci->num_ib_windows);
+	if (win >= pci->num_ib_windows)
+		return -ENOSPC;
+
+	/* Program address-match inbound iATU */
+	ret = dw_pcie_prog_inbound_atu(pci, win, type,
+				       epf_bar->phys_addr - pci->parent_bus_offset,
+				       pci_addr, size);
+	if (ret)
+		return ret;
+
+	m = kzalloc(sizeof(*m), GFP_KERNEL);
+	if (!m) {
+		dw_pcie_disable_atu(pci, PCIE_ATU_REGION_DIR_IB, win);
+		return -ENOMEM;
+	}
+	m->bar = bar;
+	m->pci_addr = pci_addr;
+	m->cpu_addr = epf_bar->phys_addr;
+	m->size = size;
+	m->index = win;
+
+	guard(spinlock_irqsave)(&ep->ib_map_lock);
+	set_bit(win, ep->ib_window_map);
+	list_add(&m->node, &ep->ib_map_list);
+
+	return 0;
+}
+
+static void dw_pcie_ep_unmap_inbound(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
+				     struct pci_epf_bar *epf_bar, u64 offset)
+{
+	struct dw_pcie_ep *ep = epc_get_drvdata(epc);
+	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
+	enum pci_barno bar = epf_bar->barno;
+	struct dw_pcie_ib_map *m, *tmp;
+	size_t size = epf_bar->size;
+	int flags = epf_bar->flags;
+	u64 match_pci = 0;
+	u64 base;
+
+	/* If BAR base isn't assigned, there can't be any programmed sub-window */
+	base = dw_pcie_ep_read_bar_assigned(ep, func_no, bar,
+					    flags & PCI_BASE_ADDRESS_SPACE,
+					    flags & PCI_BASE_ADDRESS_MEM_TYPE_64);
+	if (base)
+		match_pci = base + offset;
+
+	guard(spinlock_irqsave)(&ep->ib_map_lock);
+	list_for_each_entry_safe(m, tmp, &ep->ib_map_list, node) {
+		if (m->bar != bar)
+			continue;
+		if (match_pci && m->pci_addr != match_pci)
+			continue;
+		if (size && m->size != size)
+			/* Partial unmap is unsupported for now */
+			continue;
+		dw_pcie_disable_atu(pci, PCIE_ATU_REGION_DIR_IB, m->index);
+		clear_bit(m->index, ep->ib_window_map);
+		list_del(&m->node);
+		kfree(m);
+	}
+}
+
 static int dw_pcie_ep_get_msi(struct pci_epc *epc, u8 func_no, u8 vfunc_no)
 {
 	struct dw_pcie_ep *ep = epc_get_drvdata(epc);
@@ -630,6 +807,8 @@ static const struct pci_epc_ops epc_ops = {
 	.align_addr		= dw_pcie_ep_align_addr,
 	.map_addr		= dw_pcie_ep_map_addr,
 	.unmap_addr		= dw_pcie_ep_unmap_addr,
+	.map_inbound		= dw_pcie_ep_map_inbound,
+	.unmap_inbound		= dw_pcie_ep_unmap_inbound,
 	.set_msi		= dw_pcie_ep_set_msi,
 	.get_msi		= dw_pcie_ep_get_msi,
 	.set_msix		= dw_pcie_ep_set_msix,
@@ -1087,6 +1266,8 @@ int dw_pcie_ep_init(struct dw_pcie_ep *ep)
 	struct device *dev = pci->dev;
 
 	INIT_LIST_HEAD(&ep->func_list);
+	INIT_LIST_HEAD(&ep->ib_map_list);
+	spin_lock_init(&ep->ib_map_lock);
 
 	epc = devm_pci_epc_create(dev, &epc_ops);
 	if (IS_ERR(epc)) {
diff --git a/drivers/pci/controller/dwc/pcie-designware.h b/drivers/pci/controller/dwc/pcie-designware.h
index 31685951a080..269a9fe0501f 100644
--- a/drivers/pci/controller/dwc/pcie-designware.h
+++ b/drivers/pci/controller/dwc/pcie-designware.h
@@ -476,6 +476,8 @@ struct dw_pcie_ep {
 	phys_addr_t		*outbound_addr;
 	unsigned long		*ib_window_map;
 	unsigned long		*ob_window_map;
+	struct list_head	ib_map_list;
+	spinlock_t		ib_map_lock;
 	void __iomem		*msi_mem;
 	phys_addr_t		msi_mem_phys;
 	struct pci_epf_bar	*epf_bar[PCI_STD_NUM_BARS];
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 06/27] PCI: endpoint: pci-epf-vntb: Use pci_epc_map_inbound() for MW mapping
  2025-11-29 16:03 [RFC PATCH v2 00/27] NTB transport backed by remote DW eDMA Koichiro Den
                   ` (4 preceding siblings ...)
  2025-11-29 16:03 ` [RFC PATCH v2 05/27] PCI: dwc: ep: Implement EPC inbound mapping support Koichiro Den
@ 2025-11-29 16:03 ` Koichiro Den
  2025-12-01 19:34   ` Frank Li
  2025-11-29 16:03 ` [RFC PATCH v2 07/27] NTB: Add offset parameter to MW translation APIs Koichiro Den
                   ` (21 subsequent siblings)
  27 siblings, 1 reply; 97+ messages in thread
From: Koichiro Den @ 2025-11-29 16:03 UTC (permalink / raw)
  To: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li
  Cc: mani, kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason,
	dave.jiang, allenbh, Basavaraj.Natikar, Shyam-sundar.S-k,
	kurt.schwemmer, logang, jingoohan1, lpieralisi, robh, jbrunet,
	fancer.lancer, arnd, pstanner, elfring

Switch MW setup to use pci_epc_map_inbound() when supported. This allows
mapping portions of a BAR rather than the entire region, supporting
partial BAR usage on capable controllers.

Signed-off-by: Koichiro Den <den@valinux.co.jp>
---
 drivers/pci/endpoint/functions/pci-epf-vntb.c | 21 ++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/endpoint/functions/pci-epf-vntb.c b/drivers/pci/endpoint/functions/pci-epf-vntb.c
index 1ff414703566..42e57721dcb4 100644
--- a/drivers/pci/endpoint/functions/pci-epf-vntb.c
+++ b/drivers/pci/endpoint/functions/pci-epf-vntb.c
@@ -728,10 +728,15 @@ static int epf_ntb_mw_bar_init(struct epf_ntb *ntb)
 				PCI_BASE_ADDRESS_MEM_TYPE_64 :
 				PCI_BASE_ADDRESS_MEM_TYPE_32;
 
-		ret = pci_epc_set_bar(ntb->epf->epc,
-				      ntb->epf->func_no,
-				      ntb->epf->vfunc_no,
-				      &ntb->epf->bar[barno]);
+		ret = pci_epc_map_inbound(ntb->epf->epc,
+					  ntb->epf->func_no,
+					  ntb->epf->vfunc_no,
+					  &ntb->epf->bar[barno], 0);
+		if (ret == -EOPNOTSUPP)
+			ret = pci_epc_set_bar(ntb->epf->epc,
+					      ntb->epf->func_no,
+					      ntb->epf->vfunc_no,
+					      &ntb->epf->bar[barno]);
 		if (ret) {
 			dev_err(dev, "MW set failed\n");
 			goto err_set_bar;
@@ -1385,17 +1390,23 @@ static int vntb_epf_mw_set_trans(struct ntb_dev *ndev, int pidx, int idx,
 	struct epf_ntb *ntb = ntb_ndev(ndev);
 	struct pci_epf_bar *epf_bar;
 	enum pci_barno barno;
+	struct pci_epc *epc;
 	int ret;
 	struct device *dev;
 
+	epc = ntb->epf->epc;
 	dev = &ntb->ntb.dev;
 	barno = ntb->epf_ntb_bar[BAR_MW1 + idx];
+
 	epf_bar = &ntb->epf->bar[barno];
 	epf_bar->phys_addr = addr;
 	epf_bar->barno = barno;
 	epf_bar->size = size;
 
-	ret = pci_epc_set_bar(ntb->epf->epc, 0, 0, epf_bar);
+	ret = pci_epc_map_inbound(epc, 0, 0, epf_bar, 0);
+	if (ret == -EOPNOTSUPP)
+		ret = pci_epc_set_bar(epc, 0, 0, epf_bar);
+
 	if (ret) {
 		dev_err(dev, "failure set mw trans\n");
 		return ret;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 07/27] NTB: Add offset parameter to MW translation APIs
  2025-11-29 16:03 [RFC PATCH v2 00/27] NTB transport backed by remote DW eDMA Koichiro Den
                   ` (5 preceding siblings ...)
  2025-11-29 16:03 ` [RFC PATCH v2 06/27] PCI: endpoint: pci-epf-vntb: Use pci_epc_map_inbound() for MW mapping Koichiro Den
@ 2025-11-29 16:03 ` Koichiro Den
  2025-11-29 16:03 ` [RFC PATCH v2 08/27] PCI: endpoint: pci-epf-vntb: Propagate MW offset from configfs when present Koichiro Den
                   ` (20 subsequent siblings)
  27 siblings, 0 replies; 97+ messages in thread
From: Koichiro Den @ 2025-11-29 16:03 UTC (permalink / raw)
  To: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li
  Cc: mani, kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason,
	dave.jiang, allenbh, Basavaraj.Natikar, Shyam-sundar.S-k,
	kurt.schwemmer, logang, jingoohan1, lpieralisi, robh, jbrunet,
	fancer.lancer, arnd, pstanner, elfring

Extend ntb_mw_set_trans() and ntb_mw_get_align() with an offset
argument. This supports subrange mapping inside a BAR for platforms that
require offset-based translations.

Signed-off-by: Koichiro Den <den@valinux.co.jp>
---
 drivers/ntb/hw/amd/ntb_hw_amd.c               |  6 ++++--
 drivers/ntb/hw/epf/ntb_hw_epf.c               |  6 ++++--
 drivers/ntb/hw/idt/ntb_hw_idt.c               |  3 ++-
 drivers/ntb/hw/intel/ntb_hw_gen1.c            |  6 ++++--
 drivers/ntb/hw/intel/ntb_hw_gen1.h            |  2 +-
 drivers/ntb/hw/intel/ntb_hw_gen3.c            |  3 ++-
 drivers/ntb/hw/intel/ntb_hw_gen4.c            |  6 ++++--
 drivers/ntb/hw/mscc/ntb_hw_switchtec.c        |  6 ++++--
 drivers/ntb/msi.c                             |  6 +++---
 drivers/ntb/ntb_transport.c                   |  4 ++--
 drivers/ntb/test/ntb_perf.c                   |  4 ++--
 drivers/ntb/test/ntb_tool.c                   |  6 +++---
 drivers/pci/endpoint/functions/pci-epf-vntb.c |  7 ++++---
 include/linux/ntb.h                           | 18 +++++++++++-------
 14 files changed, 50 insertions(+), 33 deletions(-)

diff --git a/drivers/ntb/hw/amd/ntb_hw_amd.c b/drivers/ntb/hw/amd/ntb_hw_amd.c
index 1a163596ddf5..c0137df413c4 100644
--- a/drivers/ntb/hw/amd/ntb_hw_amd.c
+++ b/drivers/ntb/hw/amd/ntb_hw_amd.c
@@ -92,7 +92,8 @@ static int amd_ntb_mw_count(struct ntb_dev *ntb, int pidx)
 static int amd_ntb_mw_get_align(struct ntb_dev *ntb, int pidx, int idx,
 				resource_size_t *addr_align,
 				resource_size_t *size_align,
-				resource_size_t *size_max)
+				resource_size_t *size_max,
+				resource_size_t *offset)
 {
 	struct amd_ntb_dev *ndev = ntb_ndev(ntb);
 	int bar;
@@ -117,7 +118,8 @@ static int amd_ntb_mw_get_align(struct ntb_dev *ntb, int pidx, int idx,
 }
 
 static int amd_ntb_mw_set_trans(struct ntb_dev *ntb, int pidx, int idx,
-				dma_addr_t addr, resource_size_t size)
+				dma_addr_t addr, resource_size_t size,
+				resource_size_t offset)
 {
 	struct amd_ntb_dev *ndev = ntb_ndev(ntb);
 	unsigned long xlat_reg, limit_reg = 0;
diff --git a/drivers/ntb/hw/epf/ntb_hw_epf.c b/drivers/ntb/hw/epf/ntb_hw_epf.c
index 91d3f8e05807..a3ec411bfe49 100644
--- a/drivers/ntb/hw/epf/ntb_hw_epf.c
+++ b/drivers/ntb/hw/epf/ntb_hw_epf.c
@@ -164,7 +164,8 @@ static int ntb_epf_mw_count(struct ntb_dev *ntb, int pidx)
 static int ntb_epf_mw_get_align(struct ntb_dev *ntb, int pidx, int idx,
 				resource_size_t *addr_align,
 				resource_size_t *size_align,
-				resource_size_t *size_max)
+				resource_size_t *size_max,
+				resource_size_t *offset)
 {
 	struct ntb_epf_dev *ndev = ntb_ndev(ntb);
 	struct device *dev = ndev->dev;
@@ -402,7 +403,8 @@ static int ntb_epf_db_set_mask(struct ntb_dev *ntb, u64 db_bits)
 }
 
 static int ntb_epf_mw_set_trans(struct ntb_dev *ntb, int pidx, int idx,
-				dma_addr_t addr, resource_size_t size)
+				dma_addr_t addr, resource_size_t size,
+				resource_size_t offset)
 {
 	struct ntb_epf_dev *ndev = ntb_ndev(ntb);
 	struct device *dev = ndev->dev;
diff --git a/drivers/ntb/hw/idt/ntb_hw_idt.c b/drivers/ntb/hw/idt/ntb_hw_idt.c
index f27df8d7f3b9..8c2cf149b99b 100644
--- a/drivers/ntb/hw/idt/ntb_hw_idt.c
+++ b/drivers/ntb/hw/idt/ntb_hw_idt.c
@@ -1190,7 +1190,8 @@ static int idt_ntb_mw_count(struct ntb_dev *ntb, int pidx)
 static int idt_ntb_mw_get_align(struct ntb_dev *ntb, int pidx, int widx,
 				resource_size_t *addr_align,
 				resource_size_t *size_align,
-				resource_size_t *size_max)
+				resource_size_t *size_max,
+				resource_size_t *offset)
 {
 	struct idt_ntb_dev *ndev = to_ndev_ntb(ntb);
 	struct idt_ntb_peer *peer;
diff --git a/drivers/ntb/hw/intel/ntb_hw_gen1.c b/drivers/ntb/hw/intel/ntb_hw_gen1.c
index 079b8cd79785..6cbbd6cdf4c0 100644
--- a/drivers/ntb/hw/intel/ntb_hw_gen1.c
+++ b/drivers/ntb/hw/intel/ntb_hw_gen1.c
@@ -804,7 +804,8 @@ int intel_ntb_mw_count(struct ntb_dev *ntb, int pidx)
 int intel_ntb_mw_get_align(struct ntb_dev *ntb, int pidx, int idx,
 			   resource_size_t *addr_align,
 			   resource_size_t *size_align,
-			   resource_size_t *size_max)
+			   resource_size_t *size_max,
+			   resource_size_t *offset)
 {
 	struct intel_ntb_dev *ndev = ntb_ndev(ntb);
 	resource_size_t bar_size, mw_size;
@@ -840,7 +841,8 @@ int intel_ntb_mw_get_align(struct ntb_dev *ntb, int pidx, int idx,
 }
 
 static int intel_ntb_mw_set_trans(struct ntb_dev *ntb, int pidx, int idx,
-				  dma_addr_t addr, resource_size_t size)
+				  dma_addr_t addr, resource_size_t size,
+				  resource_size_t offset)
 {
 	struct intel_ntb_dev *ndev = ntb_ndev(ntb);
 	unsigned long base_reg, xlat_reg, limit_reg;
diff --git a/drivers/ntb/hw/intel/ntb_hw_gen1.h b/drivers/ntb/hw/intel/ntb_hw_gen1.h
index 344249fc18d1..f9ebd2780b7f 100644
--- a/drivers/ntb/hw/intel/ntb_hw_gen1.h
+++ b/drivers/ntb/hw/intel/ntb_hw_gen1.h
@@ -159,7 +159,7 @@ int ndev_mw_to_bar(struct intel_ntb_dev *ndev, int idx);
 int intel_ntb_mw_count(struct ntb_dev *ntb, int pidx);
 int intel_ntb_mw_get_align(struct ntb_dev *ntb, int pidx, int idx,
 		resource_size_t *addr_align, resource_size_t *size_align,
-		resource_size_t *size_max);
+		resource_size_t *size_max, resource_size_t *offset);
 int intel_ntb_peer_mw_count(struct ntb_dev *ntb);
 int intel_ntb_peer_mw_get_addr(struct ntb_dev *ntb, int idx,
 		phys_addr_t *base, resource_size_t *size);
diff --git a/drivers/ntb/hw/intel/ntb_hw_gen3.c b/drivers/ntb/hw/intel/ntb_hw_gen3.c
index a5aa96a31f4a..98722032ca5d 100644
--- a/drivers/ntb/hw/intel/ntb_hw_gen3.c
+++ b/drivers/ntb/hw/intel/ntb_hw_gen3.c
@@ -444,7 +444,8 @@ int intel_ntb3_link_enable(struct ntb_dev *ntb, enum ntb_speed max_speed,
 	return 0;
 }
 static int intel_ntb3_mw_set_trans(struct ntb_dev *ntb, int pidx, int idx,
-				   dma_addr_t addr, resource_size_t size)
+				   dma_addr_t addr, resource_size_t size,
+				   resource_size_t offset)
 {
 	struct intel_ntb_dev *ndev = ntb_ndev(ntb);
 	unsigned long xlat_reg, limit_reg;
diff --git a/drivers/ntb/hw/intel/ntb_hw_gen4.c b/drivers/ntb/hw/intel/ntb_hw_gen4.c
index 22cac7975b3c..8df90ea04c7c 100644
--- a/drivers/ntb/hw/intel/ntb_hw_gen4.c
+++ b/drivers/ntb/hw/intel/ntb_hw_gen4.c
@@ -335,7 +335,8 @@ ssize_t ndev_ntb4_debugfs_read(struct file *filp, char __user *ubuf,
 }
 
 static int intel_ntb4_mw_set_trans(struct ntb_dev *ntb, int pidx, int idx,
-				   dma_addr_t addr, resource_size_t size)
+				   dma_addr_t addr, resource_size_t size,
+				   resource_size_t offset)
 {
 	struct intel_ntb_dev *ndev = ntb_ndev(ntb);
 	unsigned long xlat_reg, limit_reg, idx_reg;
@@ -524,7 +525,8 @@ static int intel_ntb4_link_disable(struct ntb_dev *ntb)
 static int intel_ntb4_mw_get_align(struct ntb_dev *ntb, int pidx, int idx,
 				   resource_size_t *addr_align,
 				   resource_size_t *size_align,
-				   resource_size_t *size_max)
+				   resource_size_t *size_max,
+				   resource_size_t *offset)
 {
 	struct intel_ntb_dev *ndev = ntb_ndev(ntb);
 	resource_size_t bar_size, mw_size;
diff --git a/drivers/ntb/hw/mscc/ntb_hw_switchtec.c b/drivers/ntb/hw/mscc/ntb_hw_switchtec.c
index e38540b92716..5d8bace78d4f 100644
--- a/drivers/ntb/hw/mscc/ntb_hw_switchtec.c
+++ b/drivers/ntb/hw/mscc/ntb_hw_switchtec.c
@@ -191,7 +191,8 @@ static int peer_lut_index(struct switchtec_ntb *sndev, int mw_idx)
 static int switchtec_ntb_mw_get_align(struct ntb_dev *ntb, int pidx,
 				      int widx, resource_size_t *addr_align,
 				      resource_size_t *size_align,
-				      resource_size_t *size_max)
+				      resource_size_t *size_max,
+				      resource_size_t *offset)
 {
 	struct switchtec_ntb *sndev = ntb_sndev(ntb);
 	int lut;
@@ -268,7 +269,8 @@ static void switchtec_ntb_mw_set_lut(struct switchtec_ntb *sndev, int idx,
 }
 
 static int switchtec_ntb_mw_set_trans(struct ntb_dev *ntb, int pidx, int widx,
-				      dma_addr_t addr, resource_size_t size)
+				      dma_addr_t addr, resource_size_t size,
+				      resource_size_t offset)
 {
 	struct switchtec_ntb *sndev = ntb_sndev(ntb);
 	struct ntb_ctrl_regs __iomem *ctl = sndev->mmio_peer_ctrl;
diff --git a/drivers/ntb/msi.c b/drivers/ntb/msi.c
index 6817d504c12a..8875bcbf2ea4 100644
--- a/drivers/ntb/msi.c
+++ b/drivers/ntb/msi.c
@@ -117,7 +117,7 @@ int ntb_msi_setup_mws(struct ntb_dev *ntb)
 			return peer_widx;
 
 		ret = ntb_mw_get_align(ntb, peer, peer_widx, &addr_align,
-				       NULL, NULL);
+				       NULL, NULL, NULL);
 		if (ret)
 			return ret;
 
@@ -132,7 +132,7 @@ int ntb_msi_setup_mws(struct ntb_dev *ntb)
 		}
 
 		ret = ntb_mw_get_align(ntb, peer, peer_widx, NULL,
-				       &size_align, &size_max);
+				       &size_align, &size_max, NULL);
 		if (ret)
 			goto error_out;
 
@@ -142,7 +142,7 @@ int ntb_msi_setup_mws(struct ntb_dev *ntb)
 			mw_min_size = mw_size;
 
 		ret = ntb_mw_set_trans(ntb, peer, peer_widx,
-				       addr, mw_size);
+				       addr, mw_size, 0);
 		if (ret)
 			goto error_out;
 	}
diff --git a/drivers/ntb/ntb_transport.c b/drivers/ntb/ntb_transport.c
index eb875e3db2e3..4bb1a64c1090 100644
--- a/drivers/ntb/ntb_transport.c
+++ b/drivers/ntb/ntb_transport.c
@@ -883,7 +883,7 @@ static int ntb_set_mw(struct ntb_transport_ctx *nt, int num_mw,
 		return -EINVAL;
 
 	rc = ntb_mw_get_align(nt->ndev, PIDX, num_mw, &xlat_align,
-			      &xlat_align_size, NULL);
+			      &xlat_align_size, NULL, NULL);
 	if (rc)
 		return rc;
 
@@ -918,7 +918,7 @@ static int ntb_set_mw(struct ntb_transport_ctx *nt, int num_mw,
 
 	/* Notify HW the memory location of the receive buffer */
 	rc = ntb_mw_set_trans(nt->ndev, PIDX, num_mw, mw->dma_addr,
-			      mw->xlat_size);
+			      mw->xlat_size, 0);
 	if (rc) {
 		dev_err(&pdev->dev, "Unable to set mw%d translation", num_mw);
 		ntb_free_mw(nt, num_mw);
diff --git a/drivers/ntb/test/ntb_perf.c b/drivers/ntb/test/ntb_perf.c
index dfd175f79e8f..b842b69e4242 100644
--- a/drivers/ntb/test/ntb_perf.c
+++ b/drivers/ntb/test/ntb_perf.c
@@ -573,7 +573,7 @@ static int perf_setup_inbuf(struct perf_peer *peer)
 
 	/* Get inbound MW parameters */
 	ret = ntb_mw_get_align(perf->ntb, peer->pidx, perf->gidx,
-			       &xlat_align, &size_align, &size_max);
+			       &xlat_align, &size_align, &size_max, NULL);
 	if (ret) {
 		dev_err(&perf->ntb->dev, "Couldn't get inbuf restrictions\n");
 		return ret;
@@ -604,7 +604,7 @@ static int perf_setup_inbuf(struct perf_peer *peer)
 	}
 
 	ret = ntb_mw_set_trans(perf->ntb, peer->pidx, peer->gidx,
-			       peer->inbuf_xlat, peer->inbuf_size);
+			       peer->inbuf_xlat, peer->inbuf_size, 0);
 	if (ret) {
 		dev_err(&perf->ntb->dev, "Failed to set inbuf translation\n");
 		goto err_free_inbuf;
diff --git a/drivers/ntb/test/ntb_tool.c b/drivers/ntb/test/ntb_tool.c
index 641cb7e05a47..7a7ba486bba7 100644
--- a/drivers/ntb/test/ntb_tool.c
+++ b/drivers/ntb/test/ntb_tool.c
@@ -578,7 +578,7 @@ static int tool_setup_mw(struct tool_ctx *tc, int pidx, int widx,
 		return 0;
 
 	ret = ntb_mw_get_align(tc->ntb, pidx, widx, &addr_align,
-				&size_align, &size);
+				&size_align, &size, NULL);
 	if (ret)
 		return ret;
 
@@ -595,7 +595,7 @@ static int tool_setup_mw(struct tool_ctx *tc, int pidx, int widx,
 		goto err_free_dma;
 	}
 
-	ret = ntb_mw_set_trans(tc->ntb, pidx, widx, inmw->dma_base, inmw->size);
+	ret = ntb_mw_set_trans(tc->ntb, pidx, widx, inmw->dma_base, inmw->size, 0);
 	if (ret)
 		goto err_free_dma;
 
@@ -652,7 +652,7 @@ static ssize_t tool_mw_trans_read(struct file *filep, char __user *ubuf,
 		return -ENOMEM;
 
 	ret = ntb_mw_get_align(inmw->tc->ntb, inmw->pidx, inmw->widx,
-			       &addr_align, &size_align, &size_max);
+			       &addr_align, &size_align, &size_max, NULL);
 	if (ret)
 		goto err;
 
diff --git a/drivers/pci/endpoint/functions/pci-epf-vntb.c b/drivers/pci/endpoint/functions/pci-epf-vntb.c
index 42e57721dcb4..8dbae9be9402 100644
--- a/drivers/pci/endpoint/functions/pci-epf-vntb.c
+++ b/drivers/pci/endpoint/functions/pci-epf-vntb.c
@@ -1385,7 +1385,7 @@ static int vntb_epf_db_set_mask(struct ntb_dev *ntb, u64 db_bits)
 }
 
 static int vntb_epf_mw_set_trans(struct ntb_dev *ndev, int pidx, int idx,
-		dma_addr_t addr, resource_size_t size)
+		dma_addr_t addr, resource_size_t size, resource_size_t offset)
 {
 	struct epf_ntb *ntb = ntb_ndev(ndev);
 	struct pci_epf_bar *epf_bar;
@@ -1403,7 +1403,7 @@ static int vntb_epf_mw_set_trans(struct ntb_dev *ndev, int pidx, int idx,
 	epf_bar->barno = barno;
 	epf_bar->size = size;
 
-	ret = pci_epc_map_inbound(epc, 0, 0, epf_bar, 0);
+	ret = pci_epc_map_inbound(epc, 0, 0, epf_bar, offset);
 	if (ret == -EOPNOTSUPP)
 		ret = pci_epc_set_bar(epc, 0, 0, epf_bar);
 
@@ -1514,7 +1514,8 @@ static u64 vntb_epf_db_read(struct ntb_dev *ndev)
 static int vntb_epf_mw_get_align(struct ntb_dev *ndev, int pidx, int idx,
 			resource_size_t *addr_align,
 			resource_size_t *size_align,
-			resource_size_t *size_max)
+			resource_size_t *size_max,
+			resource_size_t *offset)
 {
 	struct epf_ntb *ntb = ntb_ndev(ndev);
 
diff --git a/include/linux/ntb.h b/include/linux/ntb.h
index 8ff9d663096b..d7ce5d2e60d0 100644
--- a/include/linux/ntb.h
+++ b/include/linux/ntb.h
@@ -273,9 +273,11 @@ struct ntb_dev_ops {
 	int (*mw_get_align)(struct ntb_dev *ntb, int pidx, int widx,
 			    resource_size_t *addr_align,
 			    resource_size_t *size_align,
-			    resource_size_t *size_max);
+			    resource_size_t *size_max,
+			    resource_size_t *offset);
 	int (*mw_set_trans)(struct ntb_dev *ntb, int pidx, int widx,
-			    dma_addr_t addr, resource_size_t size);
+			    dma_addr_t addr, resource_size_t size,
+			    resource_size_t offset);
 	int (*mw_clear_trans)(struct ntb_dev *ntb, int pidx, int widx);
 	int (*peer_mw_count)(struct ntb_dev *ntb);
 	int (*peer_mw_get_addr)(struct ntb_dev *ntb, int widx,
@@ -823,13 +825,14 @@ static inline int ntb_mw_count(struct ntb_dev *ntb, int pidx)
 static inline int ntb_mw_get_align(struct ntb_dev *ntb, int pidx, int widx,
 				   resource_size_t *addr_align,
 				   resource_size_t *size_align,
-				   resource_size_t *size_max)
+				   resource_size_t *size_max,
+				   resource_size_t *offset)
 {
 	if (!(ntb_link_is_up(ntb, NULL, NULL) & BIT_ULL(pidx)))
 		return -ENOTCONN;
 
 	return ntb->ops->mw_get_align(ntb, pidx, widx, addr_align, size_align,
-				      size_max);
+				      size_max, offset);
 }
 
 /**
@@ -852,12 +855,13 @@ static inline int ntb_mw_get_align(struct ntb_dev *ntb, int pidx, int widx,
  * Return: Zero on success, otherwise an error number.
  */
 static inline int ntb_mw_set_trans(struct ntb_dev *ntb, int pidx, int widx,
-				   dma_addr_t addr, resource_size_t size)
+				   dma_addr_t addr, resource_size_t size,
+				   resource_size_t offset)
 {
 	if (!ntb->ops->mw_set_trans)
 		return 0;
 
-	return ntb->ops->mw_set_trans(ntb, pidx, widx, addr, size);
+	return ntb->ops->mw_set_trans(ntb, pidx, widx, addr, size, offset);
 }
 
 /**
@@ -875,7 +879,7 @@ static inline int ntb_mw_set_trans(struct ntb_dev *ntb, int pidx, int widx,
 static inline int ntb_mw_clear_trans(struct ntb_dev *ntb, int pidx, int widx)
 {
 	if (!ntb->ops->mw_clear_trans)
-		return ntb_mw_set_trans(ntb, pidx, widx, 0, 0);
+		return ntb_mw_set_trans(ntb, pidx, widx, 0, 0, 0);
 
 	return ntb->ops->mw_clear_trans(ntb, pidx, widx);
 }
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 08/27] PCI: endpoint: pci-epf-vntb: Propagate MW offset from configfs when present
  2025-11-29 16:03 [RFC PATCH v2 00/27] NTB transport backed by remote DW eDMA Koichiro Den
                   ` (6 preceding siblings ...)
  2025-11-29 16:03 ` [RFC PATCH v2 07/27] NTB: Add offset parameter to MW translation APIs Koichiro Den
@ 2025-11-29 16:03 ` Koichiro Den
  2025-12-01 19:35   ` Frank Li
  2025-11-29 16:03 ` [RFC PATCH v2 09/27] NTB: ntb_transport: Support offsetted partial memory windows Koichiro Den
                   ` (19 subsequent siblings)
  27 siblings, 1 reply; 97+ messages in thread
From: Koichiro Den @ 2025-11-29 16:03 UTC (permalink / raw)
  To: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li
  Cc: mani, kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason,
	dave.jiang, allenbh, Basavaraj.Natikar, Shyam-sundar.S-k,
	kurt.schwemmer, logang, jingoohan1, lpieralisi, robh, jbrunet,
	fancer.lancer, arnd, pstanner, elfring

The NTB API functions ntb_mw_set_trans() and ntb_mw_get_align() now
support non-zero MW offsets. Update pci-epf-vntb to populate
mws_offset[idx] when the offset parameter is provided. Users can now get
the offset value and use it on ntb_mw_set_trans().

Signed-off-by: Koichiro Den <den@valinux.co.jp>
---
 drivers/pci/endpoint/functions/pci-epf-vntb.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/pci/endpoint/functions/pci-epf-vntb.c b/drivers/pci/endpoint/functions/pci-epf-vntb.c
index 8dbae9be9402..aa44dcd5c943 100644
--- a/drivers/pci/endpoint/functions/pci-epf-vntb.c
+++ b/drivers/pci/endpoint/functions/pci-epf-vntb.c
@@ -1528,6 +1528,9 @@ static int vntb_epf_mw_get_align(struct ntb_dev *ndev, int pidx, int idx,
 	if (size_max)
 		*size_max = ntb->mws_size[idx];
 
+	if (offset)
+		*offset = ntb->mws_offset[idx];
+
 	return 0;
 }
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 09/27] NTB: ntb_transport: Support offsetted partial memory windows
  2025-11-29 16:03 [RFC PATCH v2 00/27] NTB transport backed by remote DW eDMA Koichiro Den
                   ` (7 preceding siblings ...)
  2025-11-29 16:03 ` [RFC PATCH v2 08/27] PCI: endpoint: pci-epf-vntb: Propagate MW offset from configfs when present Koichiro Den
@ 2025-11-29 16:03 ` Koichiro Den
  2025-11-29 16:03 ` [RFC PATCH v2 10/27] NTB: core: Add .get_pci_epc() to ntb_dev_ops Koichiro Den
                   ` (18 subsequent siblings)
  27 siblings, 0 replies; 97+ messages in thread
From: Koichiro Den @ 2025-11-29 16:03 UTC (permalink / raw)
  To: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li
  Cc: mani, kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason,
	dave.jiang, allenbh, Basavaraj.Natikar, Shyam-sundar.S-k,
	kurt.schwemmer, logang, jingoohan1, lpieralisi, robh, jbrunet,
	fancer.lancer, arnd, pstanner, elfring

The NTB API functions ntb_mw_set_trans() and ntb_mw_get_align() now
support non-zero MW offsets. Update ntb_transport to make use of this
capability by propagating the offset when setting up MW translations.

Signed-off-by: Koichiro Den <den@valinux.co.jp>
---
 drivers/ntb/ntb_transport.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/ntb/ntb_transport.c b/drivers/ntb/ntb_transport.c
index 4bb1a64c1090..3f3bc991e667 100644
--- a/drivers/ntb/ntb_transport.c
+++ b/drivers/ntb/ntb_transport.c
@@ -877,13 +877,14 @@ static int ntb_set_mw(struct ntb_transport_ctx *nt, int num_mw,
 	size_t xlat_size, buff_size;
 	resource_size_t xlat_align;
 	resource_size_t xlat_align_size;
+	resource_size_t offset;
 	int rc;
 
 	if (!size)
 		return -EINVAL;
 
 	rc = ntb_mw_get_align(nt->ndev, PIDX, num_mw, &xlat_align,
-			      &xlat_align_size, NULL, NULL);
+			      &xlat_align_size, NULL, &offset);
 	if (rc)
 		return rc;
 
@@ -918,7 +919,7 @@ static int ntb_set_mw(struct ntb_transport_ctx *nt, int num_mw,
 
 	/* Notify HW the memory location of the receive buffer */
 	rc = ntb_mw_set_trans(nt->ndev, PIDX, num_mw, mw->dma_addr,
-			      mw->xlat_size, 0);
+			      mw->xlat_size, offset);
 	if (rc) {
 		dev_err(&pdev->dev, "Unable to set mw%d translation", num_mw);
 		ntb_free_mw(nt, num_mw);
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 10/27] NTB: core: Add .get_pci_epc() to ntb_dev_ops
  2025-11-29 16:03 [RFC PATCH v2 00/27] NTB transport backed by remote DW eDMA Koichiro Den
                   ` (8 preceding siblings ...)
  2025-11-29 16:03 ` [RFC PATCH v2 09/27] NTB: ntb_transport: Support offsetted partial memory windows Koichiro Den
@ 2025-11-29 16:03 ` Koichiro Den
  2025-12-01 19:39   ` Frank Li
  2025-12-01 21:08   ` Dave Jiang
  2025-11-29 16:03 ` [RFC PATCH v2 11/27] NTB: epf: vntb: Implement .get_pci_epc() callback Koichiro Den
                   ` (17 subsequent siblings)
  27 siblings, 2 replies; 97+ messages in thread
From: Koichiro Den @ 2025-11-29 16:03 UTC (permalink / raw)
  To: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li
  Cc: mani, kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason,
	dave.jiang, allenbh, Basavaraj.Natikar, Shyam-sundar.S-k,
	kurt.schwemmer, logang, jingoohan1, lpieralisi, robh, jbrunet,
	fancer.lancer, arnd, pstanner, elfring

Add an optional get_pci_epc() callback to retrieve the underlying
pci_epc device associated with the NTB implementation.

Signed-off-by: Koichiro Den <den@valinux.co.jp>
---
 drivers/ntb/hw/epf/ntb_hw_epf.c | 11 +----------
 include/linux/ntb.h             | 21 +++++++++++++++++++++
 2 files changed, 22 insertions(+), 10 deletions(-)

diff --git a/drivers/ntb/hw/epf/ntb_hw_epf.c b/drivers/ntb/hw/epf/ntb_hw_epf.c
index a3ec411bfe49..d55ce6b0fad4 100644
--- a/drivers/ntb/hw/epf/ntb_hw_epf.c
+++ b/drivers/ntb/hw/epf/ntb_hw_epf.c
@@ -9,6 +9,7 @@
 #include <linux/delay.h>
 #include <linux/module.h>
 #include <linux/pci.h>
+#include <linux/pci-epf.h>
 #include <linux/slab.h>
 #include <linux/ntb.h>
 
@@ -49,16 +50,6 @@
 
 #define NTB_EPF_COMMAND_TIMEOUT	1000 /* 1 Sec */
 
-enum pci_barno {
-	NO_BAR = -1,
-	BAR_0,
-	BAR_1,
-	BAR_2,
-	BAR_3,
-	BAR_4,
-	BAR_5,
-};
-
 enum epf_ntb_bar {
 	BAR_CONFIG,
 	BAR_PEER_SPAD,
diff --git a/include/linux/ntb.h b/include/linux/ntb.h
index d7ce5d2e60d0..04dc9a4d6b85 100644
--- a/include/linux/ntb.h
+++ b/include/linux/ntb.h
@@ -64,6 +64,7 @@ struct ntb_client;
 struct ntb_dev;
 struct ntb_msi;
 struct pci_dev;
+struct pci_epc;
 
 /**
  * enum ntb_topo - NTB connection topology
@@ -256,6 +257,7 @@ static inline int ntb_ctx_ops_is_valid(const struct ntb_ctx_ops *ops)
  * @msg_clear_mask:	See ntb_msg_clear_mask().
  * @msg_read:		See ntb_msg_read().
  * @peer_msg_write:	See ntb_peer_msg_write().
+ * @get_pci_epc:	See ntb_get_pci_epc().
  */
 struct ntb_dev_ops {
 	int (*port_number)(struct ntb_dev *ntb);
@@ -331,6 +333,7 @@ struct ntb_dev_ops {
 	int (*msg_clear_mask)(struct ntb_dev *ntb, u64 mask_bits);
 	u32 (*msg_read)(struct ntb_dev *ntb, int *pidx, int midx);
 	int (*peer_msg_write)(struct ntb_dev *ntb, int pidx, int midx, u32 msg);
+	struct pci_epc *(*get_pci_epc)(struct ntb_dev *ntb);
 };
 
 static inline int ntb_dev_ops_is_valid(const struct ntb_dev_ops *ops)
@@ -393,6 +396,9 @@ static inline int ntb_dev_ops_is_valid(const struct ntb_dev_ops *ops)
 		/* !ops->msg_clear_mask == !ops->msg_count	&& */
 		!ops->msg_read == !ops->msg_count		&&
 		!ops->peer_msg_write == !ops->msg_count		&&
+
+		/* Miscellaneous optional callbacks */
+		/* ops->get_pci_epc			&& */
 		1;
 }
 
@@ -1567,6 +1573,21 @@ static inline int ntb_peer_msg_write(struct ntb_dev *ntb, int pidx, int midx,
 	return ntb->ops->peer_msg_write(ntb, pidx, midx, msg);
 }
 
+/**
+ * ntb_get_pci_epc() - get backing PCI endpoint controller if possible.
+ * @ntb:	NTB device context.
+ *
+ * Get the backing PCI endpoint controller representation.
+ *
+ * Return: A pointer to the pci_epc instance if available. or %NULL if not.
+ */
+static inline struct pci_epc __maybe_unused *ntb_get_pci_epc(struct ntb_dev *ntb)
+{
+	if (!ntb->ops->get_pci_epc)
+		return NULL;
+	return ntb->ops->get_pci_epc(ntb);
+}
+
 /**
  * ntb_peer_resource_idx() - get a resource index for a given peer idx
  * @ntb:	NTB device context.
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 11/27] NTB: epf: vntb: Implement .get_pci_epc() callback
  2025-11-29 16:03 [RFC PATCH v2 00/27] NTB transport backed by remote DW eDMA Koichiro Den
                   ` (9 preceding siblings ...)
  2025-11-29 16:03 ` [RFC PATCH v2 10/27] NTB: core: Add .get_pci_epc() to ntb_dev_ops Koichiro Den
@ 2025-11-29 16:03 ` Koichiro Den
  2025-11-29 16:03 ` [RFC PATCH v2 12/27] damengine: dw-edma: Fix MSI data values for multi-vector IMWr interrupts Koichiro Den
                   ` (16 subsequent siblings)
  27 siblings, 0 replies; 97+ messages in thread
From: Koichiro Den @ 2025-11-29 16:03 UTC (permalink / raw)
  To: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li
  Cc: mani, kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason,
	dave.jiang, allenbh, Basavaraj.Natikar, Shyam-sundar.S-k,
	kurt.schwemmer, logang, jingoohan1, lpieralisi, robh, jbrunet,
	fancer.lancer, arnd, pstanner, elfring

Implement the new get_pci_epc() operation for the EPF vNTB driver to
expose its associated EPC device to NTB subsystems.

Signed-off-by: Koichiro Den <den@valinux.co.jp>
---
 drivers/pci/endpoint/functions/pci-epf-vntb.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/pci/endpoint/functions/pci-epf-vntb.c b/drivers/pci/endpoint/functions/pci-epf-vntb.c
index aa44dcd5c943..93fd724a8faa 100644
--- a/drivers/pci/endpoint/functions/pci-epf-vntb.c
+++ b/drivers/pci/endpoint/functions/pci-epf-vntb.c
@@ -1561,6 +1561,15 @@ static int vntb_epf_link_disable(struct ntb_dev *ntb)
 	return 0;
 }
 
+static struct pci_epc *vntb_epf_get_pci_epc(struct ntb_dev *ntb)
+{
+	struct epf_ntb *ndev = ntb_ndev(ntb);
+
+	if (!ndev || !ndev->epf)
+		return NULL;
+	return ndev->epf->epc;
+}
+
 static const struct ntb_dev_ops vntb_epf_ops = {
 	.mw_count		= vntb_epf_mw_count,
 	.spad_count		= vntb_epf_spad_count,
@@ -1582,6 +1591,7 @@ static const struct ntb_dev_ops vntb_epf_ops = {
 	.db_clear_mask		= vntb_epf_db_clear_mask,
 	.db_clear		= vntb_epf_db_clear,
 	.link_disable		= vntb_epf_link_disable,
+	.get_pci_epc		= vntb_epf_get_pci_epc,
 };
 
 static int pci_vntb_probe(struct pci_dev *pdev, const struct pci_device_id *id)
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 12/27] damengine: dw-edma: Fix MSI data values for multi-vector IMWr interrupts
  2025-11-29 16:03 [RFC PATCH v2 00/27] NTB transport backed by remote DW eDMA Koichiro Den
                   ` (10 preceding siblings ...)
  2025-11-29 16:03 ` [RFC PATCH v2 11/27] NTB: epf: vntb: Implement .get_pci_epc() callback Koichiro Den
@ 2025-11-29 16:03 ` Koichiro Den
  2025-12-01 19:46   ` Frank Li
  2025-11-29 16:03 ` [RFC PATCH v2 13/27] NTB: ntb_transport: Use seq_file for QP stats debugfs Koichiro Den
                   ` (15 subsequent siblings)
  27 siblings, 1 reply; 97+ messages in thread
From: Koichiro Den @ 2025-11-29 16:03 UTC (permalink / raw)
  To: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li
  Cc: mani, kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason,
	dave.jiang, allenbh, Basavaraj.Natikar, Shyam-sundar.S-k,
	kurt.schwemmer, logang, jingoohan1, lpieralisi, robh, jbrunet,
	fancer.lancer, arnd, pstanner, elfring

When multiple MSI vectors are allocated for the DesignWare eDMA, the
driver currently records the same MSI message for all IRQs by calling
get_cached_msi_msg() per vector. For multi-vector MSI (as opposed to
MSI-X), the cached message corresponds to vector 0 and msg.data is
supposed to be adjusted by the IRQ index.

As a result, all eDMA interrupts share the same MSI data value and the
interrupt controller cannot distinguish between them.

Introduce dw_edma_compose_msi() to construct the correct MSI message for
each vector. For MSI-X nothing changes. For multi-vector MSI, derive the
base IRQ with msi_get_virq(dev, 0) and OR in the per-vector offset into
msg.data before storing it in dw->irq[i].msi.

This makes each IMWr MSI vector use a unique MSI data value.

Signed-off-by: Koichiro Den <den@valinux.co.jp>
---
 drivers/dma/dw-edma/dw-edma-core.c | 28 ++++++++++++++++++++++++----
 1 file changed, 24 insertions(+), 4 deletions(-)

diff --git a/drivers/dma/dw-edma/dw-edma-core.c b/drivers/dma/dw-edma/dw-edma-core.c
index 8e5f7defa6b6..3542177a4a8e 100644
--- a/drivers/dma/dw-edma/dw-edma-core.c
+++ b/drivers/dma/dw-edma/dw-edma-core.c
@@ -839,6 +839,28 @@ static inline void dw_edma_add_irq_mask(u32 *mask, u32 alloc, u16 cnt)
 		(*mask)++;
 }
 
+static void dw_edma_compose_msi(struct device *dev, int irq, struct msi_msg *out)
+{
+	struct msi_desc *desc = irq_get_msi_desc(irq);
+	struct msi_msg msg;
+	unsigned int base;
+
+	if (!desc)
+		return;
+
+	get_cached_msi_msg(irq, &msg);
+	if (!desc->pci.msi_attrib.is_msix) {
+		/*
+		 * For multi-vector MSI, the cached message corresponds to
+		 * vector 0. Adjust msg.data by the IRQ index so that each
+		 * vector gets a unique MSI data value for IMWr Data Register.
+		 */
+		base = msi_get_virq(dev, 0);
+		msg.data |= (irq - base);
+	}
+	*out = msg;
+}
+
 static int dw_edma_irq_request(struct dw_edma *dw,
 			       u32 *wr_alloc, u32 *rd_alloc)
 {
@@ -869,8 +891,7 @@ static int dw_edma_irq_request(struct dw_edma *dw,
 			return err;
 		}
 
-		if (irq_get_msi_desc(irq))
-			get_cached_msi_msg(irq, &dw->irq[0].msi);
+		dw_edma_compose_msi(dev, irq, &dw->irq[0].msi);
 
 		dw->nr_irqs = 1;
 	} else {
@@ -896,8 +917,7 @@ static int dw_edma_irq_request(struct dw_edma *dw,
 			if (err)
 				goto err_irq_free;
 
-			if (irq_get_msi_desc(irq))
-				get_cached_msi_msg(irq, &dw->irq[i].msi);
+			dw_edma_compose_msi(dev, irq, &dw->irq[i].msi);
 		}
 
 		dw->nr_irqs = i;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 13/27] NTB: ntb_transport: Use seq_file for QP stats debugfs
  2025-11-29 16:03 [RFC PATCH v2 00/27] NTB transport backed by remote DW eDMA Koichiro Den
                   ` (11 preceding siblings ...)
  2025-11-29 16:03 ` [RFC PATCH v2 12/27] damengine: dw-edma: Fix MSI data values for multi-vector IMWr interrupts Koichiro Den
@ 2025-11-29 16:03 ` Koichiro Den
  2025-12-01 19:50   ` Frank Li
  2025-11-29 16:03 ` [RFC PATCH v2 14/27] NTB: ntb_transport: Move TX memory window setup into setup_qp_mw() Koichiro Den
                   ` (14 subsequent siblings)
  27 siblings, 1 reply; 97+ messages in thread
From: Koichiro Den @ 2025-11-29 16:03 UTC (permalink / raw)
  To: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li
  Cc: mani, kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason,
	dave.jiang, allenbh, Basavaraj.Natikar, Shyam-sundar.S-k,
	kurt.schwemmer, logang, jingoohan1, lpieralisi, robh, jbrunet,
	fancer.lancer, arnd, pstanner, elfring

The ./qp*/stats debugfs file for each NTB transport QP is currently
implemented with a hand-crafted kmalloc() buffer and a series of
scnprintf() calls. This is a pre-seq_file style pattern and makes future
extensions easy to truncate.

Convert the stats file to use the seq_file helpers via
DEFINE_SHOW_ATTRIBUTE(), which simplifies the code and lets the seq_file
core handle buffering and partial reads.

While touching this area, fix a bug in the per-QP debugfs directory
naming: the buffer used for "qp%d" was only 4 bytes, which truncates
names like "qp10" to "qp1" and causes multiple queues to share the same
directory. Enlarge the buffer and use sizeof() to avoid truncation.

Signed-off-by: Koichiro Den <den@valinux.co.jp>
---
 drivers/ntb/ntb_transport.c | 136 +++++++++++-------------------------
 1 file changed, 41 insertions(+), 95 deletions(-)

diff --git a/drivers/ntb/ntb_transport.c b/drivers/ntb/ntb_transport.c
index 3f3bc991e667..57b4c0511927 100644
--- a/drivers/ntb/ntb_transport.c
+++ b/drivers/ntb/ntb_transport.c
@@ -57,6 +57,7 @@
 #include <linux/module.h>
 #include <linux/pci.h>
 #include <linux/slab.h>
+#include <linux/seq_file.h>
 #include <linux/types.h>
 #include <linux/uaccess.h>
 #include <linux/mutex.h>
@@ -466,104 +467,49 @@ void ntb_transport_unregister_client(struct ntb_transport_client *drv)
 }
 EXPORT_SYMBOL_GPL(ntb_transport_unregister_client);
 
-static ssize_t debugfs_read(struct file *filp, char __user *ubuf, size_t count,
-			    loff_t *offp)
+static int ntb_qp_debugfs_stats_show(struct seq_file *s, void *v)
 {
-	struct ntb_transport_qp *qp;
-	char *buf;
-	ssize_t ret, out_offset, out_count;
-
-	qp = filp->private_data;
+	struct ntb_transport_qp *qp = s->private;
 
 	if (!qp || !qp->link_is_up)
 		return 0;
 
-	out_count = 1000;
-
-	buf = kmalloc(out_count, GFP_KERNEL);
-	if (!buf)
-		return -ENOMEM;
+	seq_puts(s, "\nNTB QP stats:\n\n");
+
+	seq_printf(s, "rx_bytes - \t%llu\n", qp->rx_bytes);
+	seq_printf(s, "rx_pkts - \t%llu\n", qp->rx_pkts);
+	seq_printf(s, "rx_memcpy - \t%llu\n", qp->rx_memcpy);
+	seq_printf(s, "rx_async - \t%llu\n", qp->rx_async);
+	seq_printf(s, "rx_ring_empty - %llu\n", qp->rx_ring_empty);
+	seq_printf(s, "rx_err_no_buf - %llu\n", qp->rx_err_no_buf);
+	seq_printf(s, "rx_err_oflow - \t%llu\n", qp->rx_err_oflow);
+	seq_printf(s, "rx_err_ver - \t%llu\n", qp->rx_err_ver);
+	seq_printf(s, "rx_buff - \t0x%p\n", qp->rx_buff);
+	seq_printf(s, "rx_index - \t%u\n", qp->rx_index);
+	seq_printf(s, "rx_max_entry - \t%u\n", qp->rx_max_entry);
+	seq_printf(s, "rx_alloc_entry - \t%u\n\n", qp->rx_alloc_entry);
+
+	seq_printf(s, "tx_bytes - \t%llu\n", qp->tx_bytes);
+	seq_printf(s, "tx_pkts - \t%llu\n", qp->tx_pkts);
+	seq_printf(s, "tx_memcpy - \t%llu\n", qp->tx_memcpy);
+	seq_printf(s, "tx_async - \t%llu\n", qp->tx_async);
+	seq_printf(s, "tx_ring_full - \t%llu\n", qp->tx_ring_full);
+	seq_printf(s, "tx_err_no_buf - %llu\n", qp->tx_err_no_buf);
+	seq_printf(s, "tx_mw - \t0x%p\n", qp->tx_mw);
+	seq_printf(s, "tx_index (H) - \t%u\n", qp->tx_index);
+	seq_printf(s, "RRI (T) - \t%u\n", qp->remote_rx_info->entry);
+	seq_printf(s, "tx_max_entry - \t%u\n", qp->tx_max_entry);
+	seq_printf(s, "free tx - \t%u\n", ntb_transport_tx_free_entry(qp));
+	seq_putc(s, '\n');
+
+	seq_printf(s, "Using TX DMA - \t%s\n", qp->tx_dma_chan ? "Yes" : "No");
+	seq_printf(s, "Using RX DMA - \t%s\n", qp->rx_dma_chan ? "Yes" : "No");
+	seq_printf(s, "QP Link - \t%s\n", qp->link_is_up ? "Up" : "Down");
+	seq_putc(s, '\n');
 
-	out_offset = 0;
-	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
-			       "\nNTB QP stats:\n\n");
-	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
-			       "rx_bytes - \t%llu\n", qp->rx_bytes);
-	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
-			       "rx_pkts - \t%llu\n", qp->rx_pkts);
-	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
-			       "rx_memcpy - \t%llu\n", qp->rx_memcpy);
-	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
-			       "rx_async - \t%llu\n", qp->rx_async);
-	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
-			       "rx_ring_empty - %llu\n", qp->rx_ring_empty);
-	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
-			       "rx_err_no_buf - %llu\n", qp->rx_err_no_buf);
-	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
-			       "rx_err_oflow - \t%llu\n", qp->rx_err_oflow);
-	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
-			       "rx_err_ver - \t%llu\n", qp->rx_err_ver);
-	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
-			       "rx_buff - \t0x%p\n", qp->rx_buff);
-	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
-			       "rx_index - \t%u\n", qp->rx_index);
-	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
-			       "rx_max_entry - \t%u\n", qp->rx_max_entry);
-	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
-			       "rx_alloc_entry - \t%u\n\n", qp->rx_alloc_entry);
-
-	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
-			       "tx_bytes - \t%llu\n", qp->tx_bytes);
-	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
-			       "tx_pkts - \t%llu\n", qp->tx_pkts);
-	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
-			       "tx_memcpy - \t%llu\n", qp->tx_memcpy);
-	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
-			       "tx_async - \t%llu\n", qp->tx_async);
-	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
-			       "tx_ring_full - \t%llu\n", qp->tx_ring_full);
-	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
-			       "tx_err_no_buf - %llu\n", qp->tx_err_no_buf);
-	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
-			       "tx_mw - \t0x%p\n", qp->tx_mw);
-	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
-			       "tx_index (H) - \t%u\n", qp->tx_index);
-	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
-			       "RRI (T) - \t%u\n",
-			       qp->remote_rx_info->entry);
-	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
-			       "tx_max_entry - \t%u\n", qp->tx_max_entry);
-	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
-			       "free tx - \t%u\n",
-			       ntb_transport_tx_free_entry(qp));
-
-	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
-			       "\n");
-	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
-			       "Using TX DMA - \t%s\n",
-			       qp->tx_dma_chan ? "Yes" : "No");
-	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
-			       "Using RX DMA - \t%s\n",
-			       qp->rx_dma_chan ? "Yes" : "No");
-	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
-			       "QP Link - \t%s\n",
-			       qp->link_is_up ? "Up" : "Down");
-	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
-			       "\n");
-
-	if (out_offset > out_count)
-		out_offset = out_count;
-
-	ret = simple_read_from_buffer(ubuf, count, offp, buf, out_offset);
-	kfree(buf);
-	return ret;
-}
-
-static const struct file_operations ntb_qp_debugfs_stats = {
-	.owner = THIS_MODULE,
-	.open = simple_open,
-	.read = debugfs_read,
-};
+	return 0;
+}
+DEFINE_SHOW_ATTRIBUTE(ntb_qp_debugfs_stats);
 
 static void ntb_list_add(spinlock_t *lock, struct list_head *entry,
 			 struct list_head *list)
@@ -1237,15 +1183,15 @@ static int ntb_transport_init_queue(struct ntb_transport_ctx *nt,
 	qp->tx_max_entry = tx_size / qp->tx_max_frame;
 
 	if (nt->debugfs_node_dir) {
-		char debugfs_name[4];
+		char debugfs_name[8];
 
-		snprintf(debugfs_name, 4, "qp%d", qp_num);
+		snprintf(debugfs_name, sizeof(debugfs_name), "qp%d", qp_num);
 		qp->debugfs_dir = debugfs_create_dir(debugfs_name,
 						     nt->debugfs_node_dir);
 
 		qp->debugfs_stats = debugfs_create_file("stats", S_IRUSR,
 							qp->debugfs_dir, qp,
-							&ntb_qp_debugfs_stats);
+							&ntb_qp_debugfs_stats_fops);
 	} else {
 		qp->debugfs_dir = NULL;
 		qp->debugfs_stats = NULL;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 14/27] NTB: ntb_transport: Move TX memory window setup into setup_qp_mw()
  2025-11-29 16:03 [RFC PATCH v2 00/27] NTB transport backed by remote DW eDMA Koichiro Den
                   ` (12 preceding siblings ...)
  2025-11-29 16:03 ` [RFC PATCH v2 13/27] NTB: ntb_transport: Use seq_file for QP stats debugfs Koichiro Den
@ 2025-11-29 16:03 ` Koichiro Den
  2025-12-01 20:02   ` Frank Li
  2025-11-29 16:03 ` [RFC PATCH v2 15/27] NTB: ntb_transport: Dynamically determine qp count Koichiro Den
                   ` (13 subsequent siblings)
  27 siblings, 1 reply; 97+ messages in thread
From: Koichiro Den @ 2025-11-29 16:03 UTC (permalink / raw)
  To: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li
  Cc: mani, kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason,
	dave.jiang, allenbh, Basavaraj.Natikar, Shyam-sundar.S-k,
	kurt.schwemmer, logang, jingoohan1, lpieralisi, robh, jbrunet,
	fancer.lancer, arnd, pstanner, elfring

Historically both TX and RX have assumed the same per-QP MW slice
(tx_max_entry == remote rx_max_entry), while those are calculated
separately in different places (pre and post the link-up negotiation
point). This has been safe because nt->link_is_up is never set to true
unless the pre-determined qp_count are the same among them, and qp_count
is typically limited to nt->mw_count, which should be carefully
configured by admin.

However, setup_qp_mw can actually split mw and handle multi-qps in one
MW properly, so qp_count needs not to be limited by nt->mw_count. Once
we relaxing the limitation, pre-determined qp_count can differ among
host side and endpoint, and link-up negotiation can easily fail.

Move the TX MW configuration (per-QP offset and size) into
ntb_transport_setup_qp_mw() so that both RX and TX layout decisions are
centralized in a single helper. ntb_transport_init_queue() now deals
only with per-QP software state, not with MW layout.

This keeps the previous behaviour, while preparing for relaxing the
qp_count limitation and improving readibility.

No functional change is intended.

Signed-off-by: Koichiro Den <den@valinux.co.jp>
---
 drivers/ntb/ntb_transport.c | 67 ++++++++++++++-----------------------
 1 file changed, 26 insertions(+), 41 deletions(-)

diff --git a/drivers/ntb/ntb_transport.c b/drivers/ntb/ntb_transport.c
index 57b4c0511927..79063e2f911b 100644
--- a/drivers/ntb/ntb_transport.c
+++ b/drivers/ntb/ntb_transport.c
@@ -569,7 +569,8 @@ static int ntb_transport_setup_qp_mw(struct ntb_transport_ctx *nt,
 	struct ntb_transport_mw *mw;
 	struct ntb_dev *ndev = nt->ndev;
 	struct ntb_queue_entry *entry;
-	unsigned int rx_size, num_qps_mw;
+	unsigned int num_qps_mw;
+	unsigned int mw_size, mw_size_per_qp, qp_offset, rx_info_offset;
 	unsigned int mw_num, mw_count, qp_count;
 	unsigned int i;
 	int node;
@@ -588,15 +589,33 @@ static int ntb_transport_setup_qp_mw(struct ntb_transport_ctx *nt,
 	else
 		num_qps_mw = qp_count / mw_count;
 
-	rx_size = (unsigned int)mw->xlat_size / num_qps_mw;
-	qp->rx_buff = mw->virt_addr + rx_size * (qp_num / mw_count);
-	rx_size -= sizeof(struct ntb_rx_info);
+	mw_size = min(nt->mw_vec[mw_num].phys_size, mw->xlat_size);
+	if (max_mw_size && mw_size > max_mw_size)
+		mw_size = max_mw_size;
 
-	qp->remote_rx_info = qp->rx_buff + rx_size;
+	/* Split this MW evenly among the queue pairs mapped to it. */
+	mw_size_per_qp = (unsigned int)mw_size / num_qps_mw;
+	qp_offset = mw_size_per_qp * (qp_num / mw_count);
+
+	/* Place remote_rx_info at the end of the per-QP region. */
+	rx_info_offset = mw_size_per_qp - sizeof(struct ntb_rx_info);
+
+	qp->tx_mw_size = mw_size_per_qp;
+	qp->tx_mw = nt->mw_vec[mw_num].vbase + qp_offset;
+	if (!qp->tx_mw)
+		return -EINVAL;
+	qp->tx_mw_phys = nt->mw_vec[mw_num].phys_addr + qp_offset;
+	if (!qp->tx_mw_phys)
+		return -EINVAL;
+	qp->rx_info = qp->tx_mw + rx_info_offset;
+	qp->rx_buff = mw->virt_addr + qp_offset;
+	qp->remote_rx_info = qp->rx_buff + rx_info_offset;
 
 	/* Due to housekeeping, there must be atleast 2 buffs */
-	qp->rx_max_frame = min(transport_mtu, rx_size / 2);
-	qp->rx_max_entry = rx_size / qp->rx_max_frame;
+	qp->tx_max_frame = min(transport_mtu, mw_size_per_qp / 2);
+	qp->tx_max_entry = mw_size_per_qp / qp->tx_max_frame;
+	qp->rx_max_frame = min(transport_mtu, mw_size_per_qp / 2);
+	qp->rx_max_entry = mw_size_per_qp / qp->rx_max_frame;
 	qp->rx_index = 0;
 
 	/*
@@ -1133,11 +1152,7 @@ static int ntb_transport_init_queue(struct ntb_transport_ctx *nt,
 				    unsigned int qp_num)
 {
 	struct ntb_transport_qp *qp;
-	phys_addr_t mw_base;
-	resource_size_t mw_size;
-	unsigned int num_qps_mw, tx_size;
 	unsigned int mw_num, mw_count, qp_count;
-	u64 qp_offset;
 
 	mw_count = nt->mw_count;
 	qp_count = nt->qp_count;
@@ -1152,36 +1167,6 @@ static int ntb_transport_init_queue(struct ntb_transport_ctx *nt,
 	qp->event_handler = NULL;
 	ntb_qp_link_context_reset(qp);
 
-	if (mw_num < qp_count % mw_count)
-		num_qps_mw = qp_count / mw_count + 1;
-	else
-		num_qps_mw = qp_count / mw_count;
-
-	mw_base = nt->mw_vec[mw_num].phys_addr;
-	mw_size = nt->mw_vec[mw_num].phys_size;
-
-	if (max_mw_size && mw_size > max_mw_size)
-		mw_size = max_mw_size;
-
-	tx_size = (unsigned int)mw_size / num_qps_mw;
-	qp_offset = tx_size * (qp_num / mw_count);
-
-	qp->tx_mw_size = tx_size;
-	qp->tx_mw = nt->mw_vec[mw_num].vbase + qp_offset;
-	if (!qp->tx_mw)
-		return -EINVAL;
-
-	qp->tx_mw_phys = mw_base + qp_offset;
-	if (!qp->tx_mw_phys)
-		return -EINVAL;
-
-	tx_size -= sizeof(struct ntb_rx_info);
-	qp->rx_info = qp->tx_mw + tx_size;
-
-	/* Due to housekeeping, there must be atleast 2 buffs */
-	qp->tx_max_frame = min(transport_mtu, tx_size / 2);
-	qp->tx_max_entry = tx_size / qp->tx_max_frame;
-
 	if (nt->debugfs_node_dir) {
 		char debugfs_name[8];
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 15/27] NTB: ntb_transport: Dynamically determine qp count
  2025-11-29 16:03 [RFC PATCH v2 00/27] NTB transport backed by remote DW eDMA Koichiro Den
                   ` (13 preceding siblings ...)
  2025-11-29 16:03 ` [RFC PATCH v2 14/27] NTB: ntb_transport: Move TX memory window setup into setup_qp_mw() Koichiro Den
@ 2025-11-29 16:03 ` Koichiro Den
  2025-11-29 16:03 ` [RFC PATCH v2 16/27] NTB: ntb_transport: Introduce get_dma_dev() helper Koichiro Den
                   ` (12 subsequent siblings)
  27 siblings, 0 replies; 97+ messages in thread
From: Koichiro Den @ 2025-11-29 16:03 UTC (permalink / raw)
  To: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li
  Cc: mani, kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason,
	dave.jiang, allenbh, Basavaraj.Natikar, Shyam-sundar.S-k,
	kurt.schwemmer, logang, jingoohan1, lpieralisi, robh, jbrunet,
	fancer.lancer, arnd, pstanner, elfring

One MW can host multiple queue pairs, so stop limiting qp_count to the
number of MWs.

Now that both TX and RX MW sizing are done in the same place, the MW
layout is derived from a single code path on both host and endpoint, so
the layout cannot diverge between the two sides.

Signed-off-by: Koichiro Den <den@valinux.co.jp>
---
 drivers/ntb/ntb_transport.c | 20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/drivers/ntb/ntb_transport.c b/drivers/ntb/ntb_transport.c
index 79063e2f911b..5e2a87358e87 100644
--- a/drivers/ntb/ntb_transport.c
+++ b/drivers/ntb/ntb_transport.c
@@ -1015,6 +1015,7 @@ static void ntb_transport_link_work(struct work_struct *work)
 	struct ntb_dev *ndev = nt->ndev;
 	struct pci_dev *pdev = ndev->pdev;
 	resource_size_t size;
+	u64 qp_bitmap_free;
 	u32 val;
 	int rc = 0, i, spad;
 
@@ -1062,8 +1063,23 @@ static void ntb_transport_link_work(struct work_struct *work)
 
 	val = ntb_spad_read(ndev, NUM_QPS);
 	dev_dbg(&pdev->dev, "Remote max number of qps = %d\n", val);
-	if (val != nt->qp_count)
+	if (val == 0)
 		goto out;
+	else if (val < nt->qp_count) {
+		/*
+		 * Clamp local qp_count to peer-advertised NUM_QPS to avoid
+		 * mismatched queues.
+		 */
+		qp_bitmap_free = nt->qp_bitmap_free;
+		for (i = val; i < nt->qp_count; i++) {
+			nt->qp_bitmap &= ~BIT_ULL(i);
+			nt->qp_bitmap_free &= ~BIT_ULL(i);
+		}
+		dev_warn(&pdev->dev,
+			 "Local number of qps is reduced: %d->%d (qp_bitmap_free: 0x%llx->0x%llx)\n",
+			 nt->qp_count, val, qp_bitmap_free, nt->qp_bitmap_free);
+		nt->qp_count = val;
+	}
 
 	val = ntb_spad_read(ndev, NUM_MWS);
 	dev_dbg(&pdev->dev, "Remote number of mws = %d\n", val);
@@ -1298,8 +1314,6 @@ static int ntb_transport_probe(struct ntb_client *self, struct ntb_dev *ndev)
 
 	if (max_num_clients && max_num_clients < qp_count)
 		qp_count = max_num_clients;
-	else if (nt->mw_count < qp_count)
-		qp_count = nt->mw_count;
 
 	qp_bitmap &= BIT_ULL(qp_count) - 1;
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 16/27] NTB: ntb_transport: Introduce get_dma_dev() helper
  2025-11-29 16:03 [RFC PATCH v2 00/27] NTB transport backed by remote DW eDMA Koichiro Den
                   ` (14 preceding siblings ...)
  2025-11-29 16:03 ` [RFC PATCH v2 15/27] NTB: ntb_transport: Dynamically determine qp count Koichiro Den
@ 2025-11-29 16:03 ` Koichiro Den
  2025-11-29 16:03 ` [RFC PATCH v2 17/27] NTB: epf: Reserve a subset of MSI vectors for non-NTB users Koichiro Den
                   ` (11 subsequent siblings)
  27 siblings, 0 replies; 97+ messages in thread
From: Koichiro Den @ 2025-11-29 16:03 UTC (permalink / raw)
  To: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li
  Cc: mani, kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason,
	dave.jiang, allenbh, Basavaraj.Natikar, Shyam-sundar.S-k,
	kurt.schwemmer, logang, jingoohan1, lpieralisi, robh, jbrunet,
	fancer.lancer, arnd, pstanner, elfring

When ntb_transport is used on top of an endpoint function (EPF) NTB
implementation, DMA mappings should be associated with the underlying
PCIe controller device rather than the virtual NTB PCI function. This
matters for IOMMU configuration and DMA mask validation.

Add a small helper, get_dma_dev(), that returns the appropriate struct
device for DMA mapping, i.e. &pdev->dev for a regular NTB host bridge
and the EPC parent device for EPF-based NTB endpoints. Use it in the
places where we set up DMA mappings or log DMA-related errors.

Signed-off-by: Koichiro Den <den@valinux.co.jp>
---
 drivers/ntb/ntb_transport.c | 35 ++++++++++++++++++++++++++++-------
 1 file changed, 28 insertions(+), 7 deletions(-)

diff --git a/drivers/ntb/ntb_transport.c b/drivers/ntb/ntb_transport.c
index 5e2a87358e87..dad596e3a405 100644
--- a/drivers/ntb/ntb_transport.c
+++ b/drivers/ntb/ntb_transport.c
@@ -63,6 +63,7 @@
 #include <linux/mutex.h>
 #include "linux/ntb.h"
 #include "linux/ntb_transport.h"
+#include <linux/pci-epc.h>
 
 #define NTB_TRANSPORT_VERSION	4
 #define NTB_TRANSPORT_VER	"4"
@@ -259,6 +260,26 @@ struct ntb_payload_header {
 	unsigned int flags;
 };
 
+/*
+ * Return the device that should be used for DMA mapping.
+ *
+ * On RC, this is simply &pdev->dev.
+ * On EPF-backed NTB endpoints, use the EPC parent device so that
+ * DMA capabilities and IOMMU configuration are taken from the
+ * controller rather than the virtual NTB PCI function.
+ */
+static struct device *get_dma_dev(struct ntb_dev *ndev)
+{
+	struct device *dev = &ndev->pdev->dev;
+	struct pci_epc *epc;
+
+	epc = ntb_get_pci_epc(ndev);
+	if (epc)
+		dev = epc->dev.parent;
+
+	return dev;
+}
+
 enum {
 	VERSION = 0,
 	QP_LINKS,
@@ -762,13 +783,13 @@ static void ntb_transport_msi_desc_changed(void *data)
 static void ntb_free_mw(struct ntb_transport_ctx *nt, int num_mw)
 {
 	struct ntb_transport_mw *mw = &nt->mw_vec[num_mw];
-	struct pci_dev *pdev = nt->ndev->pdev;
+	struct device *dev = get_dma_dev(nt->ndev);
 
 	if (!mw->virt_addr)
 		return;
 
 	ntb_mw_clear_trans(nt->ndev, PIDX, num_mw);
-	dma_free_coherent(&pdev->dev, mw->alloc_size,
+	dma_free_coherent(dev, mw->alloc_size,
 			  mw->alloc_addr, mw->dma_addr);
 	mw->xlat_size = 0;
 	mw->buff_size = 0;
@@ -838,7 +859,7 @@ static int ntb_set_mw(struct ntb_transport_ctx *nt, int num_mw,
 		      resource_size_t size)
 {
 	struct ntb_transport_mw *mw = &nt->mw_vec[num_mw];
-	struct pci_dev *pdev = nt->ndev->pdev;
+	struct device *dev = get_dma_dev(nt->ndev);
 	size_t xlat_size, buff_size;
 	resource_size_t xlat_align;
 	resource_size_t xlat_align_size;
@@ -868,12 +889,12 @@ static int ntb_set_mw(struct ntb_transport_ctx *nt, int num_mw,
 	mw->buff_size = buff_size;
 	mw->alloc_size = buff_size;
 
-	rc = ntb_alloc_mw_buffer(mw, &pdev->dev, xlat_align);
+	rc = ntb_alloc_mw_buffer(mw, dev, xlat_align);
 	if (rc) {
 		mw->alloc_size *= 2;
-		rc = ntb_alloc_mw_buffer(mw, &pdev->dev, xlat_align);
+		rc = ntb_alloc_mw_buffer(mw, dev, xlat_align);
 		if (rc) {
-			dev_err(&pdev->dev,
+			dev_err(dev,
 				"Unable to alloc aligned MW buff\n");
 			mw->xlat_size = 0;
 			mw->buff_size = 0;
@@ -886,7 +907,7 @@ static int ntb_set_mw(struct ntb_transport_ctx *nt, int num_mw,
 	rc = ntb_mw_set_trans(nt->ndev, PIDX, num_mw, mw->dma_addr,
 			      mw->xlat_size, offset);
 	if (rc) {
-		dev_err(&pdev->dev, "Unable to set mw%d translation", num_mw);
+		dev_err(dev, "Unable to set mw%d translation", num_mw);
 		ntb_free_mw(nt, num_mw);
 		return -EIO;
 	}
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 17/27] NTB: epf: Reserve a subset of MSI vectors for non-NTB users
  2025-11-29 16:03 [RFC PATCH v2 00/27] NTB transport backed by remote DW eDMA Koichiro Den
                   ` (15 preceding siblings ...)
  2025-11-29 16:03 ` [RFC PATCH v2 16/27] NTB: ntb_transport: Introduce get_dma_dev() helper Koichiro Den
@ 2025-11-29 16:03 ` Koichiro Den
  2025-11-29 16:03 ` [RFC PATCH v2 18/27] NTB: ntb_transport: Introduce ntb_transport_backend_ops Koichiro Den
                   ` (10 subsequent siblings)
  27 siblings, 0 replies; 97+ messages in thread
From: Koichiro Den @ 2025-11-29 16:03 UTC (permalink / raw)
  To: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li
  Cc: mani, kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason,
	dave.jiang, allenbh, Basavaraj.Natikar, Shyam-sundar.S-k,
	kurt.schwemmer, logang, jingoohan1, lpieralisi, robh, jbrunet,
	fancer.lancer, arnd, pstanner, elfring

The ntb_hw_epf driver currently uses all MSI/MSI-X vectors allocated for
the endpoint as doorbell interrupts. On SoCs that also run other
functions on the same PCIe controller (e.g. DesignWare eDMA), we need to
reserve some vectors for those other consumers.

Introduce NTB_EPF_IRQ_RESERVE and track the total number of allocated
vectors in ntb_epf_dev's 'num_irqs' field. Use only (num_irqs -
NTB_EPF_IRQ_RESERVE) vectors for NTB doorbells and free all num_irqs
vectors in the teardown path, so that the remaining vectors can be used
by other endpoint functions such as the integrated DesignWare eDMA.

This makes it possible to share the PCIe controller MSI space between
NTB and other on-chip IP blocks.

Signed-off-by: Koichiro Den <den@valinux.co.jp>
---
 drivers/ntb/hw/epf/ntb_hw_epf.c | 34 +++++++++++++++++++++------------
 1 file changed, 22 insertions(+), 12 deletions(-)

diff --git a/drivers/ntb/hw/epf/ntb_hw_epf.c b/drivers/ntb/hw/epf/ntb_hw_epf.c
index d55ce6b0fad4..c94bf63d69ff 100644
--- a/drivers/ntb/hw/epf/ntb_hw_epf.c
+++ b/drivers/ntb/hw/epf/ntb_hw_epf.c
@@ -47,6 +47,7 @@
 
 #define NTB_EPF_MIN_DB_COUNT	3
 #define NTB_EPF_MAX_DB_COUNT	31
+#define NTB_EPF_IRQ_RESERVE	8
 
 #define NTB_EPF_COMMAND_TIMEOUT	1000 /* 1 Sec */
 
@@ -75,6 +76,8 @@ struct ntb_epf_dev {
 	unsigned int spad_count;
 	unsigned int db_count;
 
+	unsigned int num_irqs;
+
 	void __iomem *ctrl_reg;
 	void __iomem *db_reg;
 	void __iomem *peer_spad_reg;
@@ -329,7 +332,7 @@ static int ntb_epf_init_isr(struct ntb_epf_dev *ndev, int msi_min, int msi_max)
 	u32 argument = MSIX_ENABLE;
 	int irq;
 	int ret;
-	int i;
+	int i = 0;
 
 	irq = pci_alloc_irq_vectors(pdev, msi_min, msi_max, PCI_IRQ_MSIX);
 	if (irq < 0) {
@@ -343,33 +346,39 @@ static int ntb_epf_init_isr(struct ntb_epf_dev *ndev, int msi_min, int msi_max)
 		argument &= ~MSIX_ENABLE;
 	}
 
+	ndev->num_irqs = irq;
+	irq -= NTB_EPF_IRQ_RESERVE;
+	if (irq <= 0) {
+		dev_err(dev, "Not enough irqs allocated\n");
+		ret = -ENOSPC;
+		goto err_out;
+	}
+
 	for (i = 0; i < irq; i++) {
 		ret = request_irq(pci_irq_vector(pdev, i), ntb_epf_vec_isr,
 				  0, "ntb_epf", ndev);
 		if (ret) {
 			dev_err(dev, "Failed to request irq\n");
-			goto err_request_irq;
+			goto err_out;
 		}
 	}
 
-	ndev->db_count = irq - 1;
+	ndev->db_count = irq;
 
 	ret = ntb_epf_send_command(ndev, CMD_CONFIGURE_DOORBELL,
 				   argument | irq);
 	if (ret) {
 		dev_err(dev, "Failed to configure doorbell\n");
-		goto err_configure_db;
+		goto err_out;
 	}
 
 	return 0;
 
-err_configure_db:
-	for (i = 0; i < ndev->db_count + 1; i++)
+err_out:
+	while (i-- > 0)
 		free_irq(pci_irq_vector(pdev, i), ndev);
 
-err_request_irq:
 	pci_free_irq_vectors(pdev);
-
 	return ret;
 }
 
@@ -477,7 +486,7 @@ static int ntb_epf_peer_db_set(struct ntb_dev *ntb, u64 db_bits)
 	u32 db_offset;
 	u32 db_data;
 
-	if (interrupt_num > ndev->db_count) {
+	if (interrupt_num >= ndev->db_count) {
 		dev_err(dev, "DB interrupt %d greater than Max Supported %d\n",
 			interrupt_num, ndev->db_count);
 		return -EINVAL;
@@ -487,6 +496,7 @@ static int ntb_epf_peer_db_set(struct ntb_dev *ntb, u64 db_bits)
 
 	db_data = readl(ndev->ctrl_reg + NTB_EPF_DB_DATA(interrupt_num));
 	db_offset = readl(ndev->ctrl_reg + NTB_EPF_DB_OFFSET(interrupt_num));
+
 	writel(db_data, ndev->db_reg + (db_entry_size * interrupt_num) +
 	       db_offset);
 
@@ -551,8 +561,8 @@ static int ntb_epf_init_dev(struct ntb_epf_dev *ndev)
 	int ret;
 
 	/* One Link interrupt and rest doorbell interrupt */
-	ret = ntb_epf_init_isr(ndev, NTB_EPF_MIN_DB_COUNT + 1,
-			       NTB_EPF_MAX_DB_COUNT + 1);
+	ret = ntb_epf_init_isr(ndev, NTB_EPF_MIN_DB_COUNT + NTB_EPF_IRQ_RESERVE,
+			       NTB_EPF_MAX_DB_COUNT + NTB_EPF_IRQ_RESERVE);
 	if (ret) {
 		dev_err(dev, "Failed to init ISR\n");
 		return ret;
@@ -659,7 +669,7 @@ static void ntb_epf_cleanup_isr(struct ntb_epf_dev *ndev)
 
 	ntb_epf_send_command(ndev, CMD_TEARDOWN_DOORBELL, ndev->db_count + 1);
 
-	for (i = 0; i < ndev->db_count + 1; i++)
+	for (i = 0; i < ndev->num_irqs; i++)
 		free_irq(pci_irq_vector(pdev, i), ndev);
 	pci_free_irq_vectors(pdev);
 }
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 18/27] NTB: ntb_transport: Introduce ntb_transport_backend_ops
  2025-11-29 16:03 [RFC PATCH v2 00/27] NTB transport backed by remote DW eDMA Koichiro Den
                   ` (16 preceding siblings ...)
  2025-11-29 16:03 ` [RFC PATCH v2 17/27] NTB: epf: Reserve a subset of MSI vectors for non-NTB users Koichiro Den
@ 2025-11-29 16:03 ` Koichiro Den
  2025-11-29 16:03 ` [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping Koichiro Den
                   ` (9 subsequent siblings)
  27 siblings, 0 replies; 97+ messages in thread
From: Koichiro Den @ 2025-11-29 16:03 UTC (permalink / raw)
  To: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li
  Cc: mani, kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason,
	dave.jiang, allenbh, Basavaraj.Natikar, Shyam-sundar.S-k,
	kurt.schwemmer, logang, jingoohan1, lpieralisi, robh, jbrunet,
	fancer.lancer, arnd, pstanner, elfring

Introduce struct ntb_transport_backend_ops to abstract queue setup and
enqueue/poll operations. The existing implementation is moved behind
this interface, and a subsequent patch will add an eDMA-backed
implementation.

No functional changes intended.

Signed-off-by: Koichiro Den <den@valinux.co.jp>
---
 drivers/ntb/ntb_transport.c   | 127 +++++++++++++++++++++++-----------
 include/linux/ntb_transport.h |  21 ++++++
 2 files changed, 106 insertions(+), 42 deletions(-)

diff --git a/drivers/ntb/ntb_transport.c b/drivers/ntb/ntb_transport.c
index dad596e3a405..907db6c93d4d 100644
--- a/drivers/ntb/ntb_transport.c
+++ b/drivers/ntb/ntb_transport.c
@@ -228,6 +228,8 @@ struct ntb_transport_ctx {
 
 	struct ntb_dev *ndev;
 
+	struct ntb_transport_backend_ops backend_ops;
+
 	struct ntb_transport_mw *mw_vec;
 	struct ntb_transport_qp *qp_vec;
 	unsigned int mw_count;
@@ -488,15 +490,9 @@ void ntb_transport_unregister_client(struct ntb_transport_client *drv)
 }
 EXPORT_SYMBOL_GPL(ntb_transport_unregister_client);
 
-static int ntb_qp_debugfs_stats_show(struct seq_file *s, void *v)
+static void ntb_transport_default_debugfs_stats_show(struct seq_file *s,
+						     struct ntb_transport_qp *qp)
 {
-	struct ntb_transport_qp *qp = s->private;
-
-	if (!qp || !qp->link_is_up)
-		return 0;
-
-	seq_puts(s, "\nNTB QP stats:\n\n");
-
 	seq_printf(s, "rx_bytes - \t%llu\n", qp->rx_bytes);
 	seq_printf(s, "rx_pkts - \t%llu\n", qp->rx_pkts);
 	seq_printf(s, "rx_memcpy - \t%llu\n", qp->rx_memcpy);
@@ -526,6 +522,17 @@ static int ntb_qp_debugfs_stats_show(struct seq_file *s, void *v)
 	seq_printf(s, "Using TX DMA - \t%s\n", qp->tx_dma_chan ? "Yes" : "No");
 	seq_printf(s, "Using RX DMA - \t%s\n", qp->rx_dma_chan ? "Yes" : "No");
 	seq_printf(s, "QP Link - \t%s\n", qp->link_is_up ? "Up" : "Down");
+}
+
+static int ntb_qp_debugfs_stats_show(struct seq_file *s, void *v)
+{
+	struct ntb_transport_qp *qp = s->private;
+
+	if (!qp || !qp->link_is_up)
+		return 0;
+
+	seq_puts(s, "\nNTB QP stats:\n\n");
+	qp->transport->backend_ops.debugfs_stats_show(s, qp);
 	seq_putc(s, '\n');
 
 	return 0;
@@ -583,8 +590,8 @@ static struct ntb_queue_entry *ntb_list_mv(spinlock_t *lock,
 	return entry;
 }
 
-static int ntb_transport_setup_qp_mw(struct ntb_transport_ctx *nt,
-				     unsigned int qp_num)
+static int ntb_transport_default_setup_qp_mw(struct ntb_transport_ctx *nt,
+					     unsigned int qp_num)
 {
 	struct ntb_transport_qp *qp = &nt->qp_vec[qp_num];
 	struct ntb_transport_mw *mw;
@@ -1128,7 +1135,7 @@ static void ntb_transport_link_work(struct work_struct *work)
 	for (i = 0; i < nt->qp_count; i++) {
 		struct ntb_transport_qp *qp = &nt->qp_vec[i];
 
-		ntb_transport_setup_qp_mw(nt, i);
+		nt->backend_ops.setup_qp_mw(nt, i);
 		ntb_transport_setup_qp_peer_msi(nt, i);
 
 		if (qp->client_ready)
@@ -1236,6 +1243,40 @@ static int ntb_transport_init_queue(struct ntb_transport_ctx *nt,
 	return 0;
 }
 
+static unsigned int ntb_transport_default_tx_free_entry(struct ntb_transport_qp *qp)
+{
+	unsigned int head = qp->tx_index;
+	unsigned int tail = qp->remote_rx_info->entry;
+
+	return tail >= head ? tail - head : qp->tx_max_entry + tail - head;
+}
+
+static int ntb_transport_default_rx_enqueue(struct ntb_transport_qp *qp,
+					    struct ntb_queue_entry *entry)
+{
+	ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry, &qp->rx_pend_q);
+
+	if (qp->active)
+		tasklet_schedule(&qp->rxc_db_work);
+
+	return 0;
+}
+
+static void ntb_transport_default_rx_poll(struct ntb_transport_qp *qp);
+static int ntb_transport_default_tx_enqueue(struct ntb_transport_qp *qp,
+					    struct ntb_queue_entry *entry,
+					    void *cb, void *data, unsigned int len,
+					    unsigned int flags);
+
+static const struct ntb_transport_backend_ops default_backend_ops = {
+	.setup_qp_mw = ntb_transport_default_setup_qp_mw,
+	.tx_free_entry = ntb_transport_default_tx_free_entry,
+	.tx_enqueue = ntb_transport_default_tx_enqueue,
+	.rx_enqueue = ntb_transport_default_rx_enqueue,
+	.rx_poll = ntb_transport_default_rx_poll,
+	.debugfs_stats_show = ntb_transport_default_debugfs_stats_show,
+};
+
 static int ntb_transport_probe(struct ntb_client *self, struct ntb_dev *ndev)
 {
 	struct ntb_transport_ctx *nt;
@@ -1270,6 +1311,8 @@ static int ntb_transport_probe(struct ntb_client *self, struct ntb_dev *ndev)
 
 	nt->ndev = ndev;
 
+	nt->backend_ops = default_backend_ops;
+
 	/*
 	 * If we are using MSI, and have at least one extra memory window,
 	 * we will reserve the last MW for the MSI window.
@@ -1679,14 +1722,10 @@ static int ntb_process_rxc(struct ntb_transport_qp *qp)
 	return 0;
 }
 
-static void ntb_transport_rxc_db(unsigned long data)
+static void ntb_transport_default_rx_poll(struct ntb_transport_qp *qp)
 {
-	struct ntb_transport_qp *qp = (void *)data;
 	int rc, i;
 
-	dev_dbg(&qp->ndev->pdev->dev, "%s: doorbell %d received\n",
-		__func__, qp->qp_num);
-
 	/* Limit the number of packets processed in a single interrupt to
 	 * provide fairness to others
 	 */
@@ -1718,6 +1757,17 @@ static void ntb_transport_rxc_db(unsigned long data)
 	}
 }
 
+static void ntb_transport_rxc_db(unsigned long data)
+{
+	struct ntb_transport_qp *qp = (void *)data;
+	struct ntb_transport_ctx *nt = qp->transport;
+
+	dev_dbg(&qp->ndev->pdev->dev, "%s: doorbell %d received\n",
+		__func__, qp->qp_num);
+
+	nt->backend_ops.rx_poll(qp);
+}
+
 static void ntb_tx_copy_callback(void *data,
 				 const struct dmaengine_result *res)
 {
@@ -1887,9 +1937,18 @@ static void ntb_async_tx(struct ntb_transport_qp *qp,
 	qp->tx_memcpy++;
 }
 
-static int ntb_process_tx(struct ntb_transport_qp *qp,
-			  struct ntb_queue_entry *entry)
+static int ntb_transport_default_tx_enqueue(struct ntb_transport_qp *qp,
+					    struct ntb_queue_entry *entry,
+					    void *cb, void *data, unsigned int len,
+					    unsigned int flags)
 {
+	entry->cb_data = cb;
+	entry->buf = data;
+	entry->len = len;
+	entry->flags = flags;
+	entry->errors = 0;
+	entry->tx_index = 0;
+
 	if (!ntb_transport_tx_free_entry(qp)) {
 		qp->tx_ring_full++;
 		return -EAGAIN;
@@ -1916,6 +1975,7 @@ static int ntb_process_tx(struct ntb_transport_qp *qp,
 
 static void ntb_send_link_down(struct ntb_transport_qp *qp)
 {
+	struct ntb_transport_ctx *nt = qp->transport;
 	struct pci_dev *pdev = qp->ndev->pdev;
 	struct ntb_queue_entry *entry;
 	int i, rc;
@@ -1935,12 +1995,7 @@ static void ntb_send_link_down(struct ntb_transport_qp *qp)
 	if (!entry)
 		return;
 
-	entry->cb_data = NULL;
-	entry->buf = NULL;
-	entry->len = 0;
-	entry->flags = LINK_DOWN_FLAG;
-
-	rc = ntb_process_tx(qp, entry);
+	rc = nt->backend_ops.tx_enqueue(qp, entry, NULL, NULL, 0, LINK_DOWN_FLAG);
 	if (rc)
 		dev_err(&pdev->dev, "ntb: QP%d unable to send linkdown msg\n",
 			qp->qp_num);
@@ -2227,6 +2282,7 @@ EXPORT_SYMBOL_GPL(ntb_transport_rx_remove);
 int ntb_transport_rx_enqueue(struct ntb_transport_qp *qp, void *cb, void *data,
 			     unsigned int len)
 {
+	struct ntb_transport_ctx *nt = qp->transport;
 	struct ntb_queue_entry *entry;
 
 	if (!qp)
@@ -2244,12 +2300,7 @@ int ntb_transport_rx_enqueue(struct ntb_transport_qp *qp, void *cb, void *data,
 	entry->errors = 0;
 	entry->rx_index = 0;
 
-	ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry, &qp->rx_pend_q);
-
-	if (qp->active)
-		tasklet_schedule(&qp->rxc_db_work);
-
-	return 0;
+	return nt->backend_ops.rx_enqueue(qp, entry);
 }
 EXPORT_SYMBOL_GPL(ntb_transport_rx_enqueue);
 
@@ -2269,6 +2320,7 @@ EXPORT_SYMBOL_GPL(ntb_transport_rx_enqueue);
 int ntb_transport_tx_enqueue(struct ntb_transport_qp *qp, void *cb, void *data,
 			     unsigned int len)
 {
+	struct ntb_transport_ctx *nt = qp->transport;
 	struct ntb_queue_entry *entry;
 	int rc;
 
@@ -2285,15 +2337,7 @@ int ntb_transport_tx_enqueue(struct ntb_transport_qp *qp, void *cb, void *data,
 		return -EBUSY;
 	}
 
-	entry->cb_data = cb;
-	entry->buf = data;
-	entry->len = len;
-	entry->flags = 0;
-	entry->errors = 0;
-	entry->retries = 0;
-	entry->tx_index = 0;
-
-	rc = ntb_process_tx(qp, entry);
+	rc = nt->backend_ops.tx_enqueue(qp, entry, cb, data, len, 0);
 	if (rc)
 		ntb_list_add(&qp->ntb_tx_free_q_lock, &entry->entry,
 			     &qp->tx_free_q);
@@ -2415,10 +2459,9 @@ EXPORT_SYMBOL_GPL(ntb_transport_max_size);
 
 unsigned int ntb_transport_tx_free_entry(struct ntb_transport_qp *qp)
 {
-	unsigned int head = qp->tx_index;
-	unsigned int tail = qp->remote_rx_info->entry;
+	struct ntb_transport_ctx *nt = qp->transport;
 
-	return tail >= head ? tail - head : qp->tx_max_entry + tail - head;
+	return nt->backend_ops.tx_free_entry(qp);
 }
 EXPORT_SYMBOL_GPL(ntb_transport_tx_free_entry);
 
diff --git a/include/linux/ntb_transport.h b/include/linux/ntb_transport.h
index 7243eb98a722..297099d42370 100644
--- a/include/linux/ntb_transport.h
+++ b/include/linux/ntb_transport.h
@@ -49,6 +49,8 @@
  */
 
 struct ntb_transport_qp;
+struct ntb_transport_ctx;
+struct ntb_queue_entry;
 
 struct ntb_transport_client {
 	struct device_driver driver;
@@ -84,3 +86,22 @@ void ntb_transport_link_up(struct ntb_transport_qp *qp);
 void ntb_transport_link_down(struct ntb_transport_qp *qp);
 bool ntb_transport_link_query(struct ntb_transport_qp *qp);
 unsigned int ntb_transport_tx_free_entry(struct ntb_transport_qp *qp);
+
+/**
+ * struct ntb_transport_backend_ops - backend-specific transport hooks
+ * @setup_qp_mw:    Set up memory windows for a given queue pair.
+ * @tx_free_entry:  Return the number of free TX entries for the queue pair.
+ * @tx_enqueue:     Backend-specific TX enqueue implementation.
+ * @rx_enqueue:     Backend-specific RX enqueue implementation.
+ * @rx_poll:        Poll for RX completions / push new RX buffers.
+ * @debugfs_stats_show: Dump backend-specific statistics, if any.
+ */
+struct ntb_transport_backend_ops {
+	int (*setup_qp_mw)(struct ntb_transport_ctx *nt, unsigned int qp_num);
+	unsigned int (*tx_free_entry)(struct ntb_transport_qp *qp);
+	int (*tx_enqueue)(struct ntb_transport_qp *qp, struct ntb_queue_entry *entry,
+			  void *cb, void *data, unsigned int len, unsigned int flags);
+	int (*rx_enqueue)(struct ntb_transport_qp *qp, struct ntb_queue_entry *entry);
+	void (*rx_poll)(struct ntb_transport_qp *qp);
+	void (*debugfs_stats_show)(struct seq_file *s, struct ntb_transport_qp *qp);
+};
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping
  2025-11-29 16:03 [RFC PATCH v2 00/27] NTB transport backed by remote DW eDMA Koichiro Den
                   ` (17 preceding siblings ...)
  2025-11-29 16:03 ` [RFC PATCH v2 18/27] NTB: ntb_transport: Introduce ntb_transport_backend_ops Koichiro Den
@ 2025-11-29 16:03 ` Koichiro Den
  2025-12-01 20:41   ` Frank Li
                     ` (3 more replies)
  2025-11-29 16:03 ` [RFC PATCH v2 20/27] NTB: ntb_transport: Introduce remote eDMA backed transport mode Koichiro Den
                   ` (8 subsequent siblings)
  27 siblings, 4 replies; 97+ messages in thread
From: Koichiro Den @ 2025-11-29 16:03 UTC (permalink / raw)
  To: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li
  Cc: mani, kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason,
	dave.jiang, allenbh, Basavaraj.Natikar, Shyam-sundar.S-k,
	kurt.schwemmer, logang, jingoohan1, lpieralisi, robh, jbrunet,
	fancer.lancer, arnd, pstanner, elfring

dw_pcie_ep_raise_msi_irq() currently programs an outbound iATU window
for the MSI target address on every interrupt and tears it down again
via dw_pcie_ep_unmap_addr().

On systems that heavily use the AXI bridge interface (for example when
the integrated eDMA engine is active), this means the outbound iATU
registers are updated while traffic is in flight. The DesignWare
endpoint spec warns that updating iATU registers in this situation is
not supported, and the behavior is undefined.

Under high MSI and eDMA load this pattern results in occasional bogus
outbound transactions and IOMMU faults such as:

  ipmmu-vmsa eed40000.iommu: Unhandled fault: status 0x00001502 iova 0xfe000000

followed by the system becoming unresponsive. This is the actual output
observed on Renesas R-Car S4, with its ipmmu_hc used with PCIe ch0.

There is no need to reprogram the iATU region used for MSI on every
interrupt. The host-provided MSI address is stable while MSI is enabled,
and the endpoint driver already dedicates a scratch buffer for MSI
generation.

Cache the aligned MSI address and map size, program the outbound iATU
once, and keep the window enabled. Subsequent interrupts only perform a
write to the MSI scratch buffer, avoiding dynamic iATU reprogramming in
the hot path and fixing the lockups seen under load.

Signed-off-by: Koichiro Den <den@valinux.co.jp>
---
 .../pci/controller/dwc/pcie-designware-ep.c   | 48 ++++++++++++++++---
 drivers/pci/controller/dwc/pcie-designware.h  |  5 ++
 2 files changed, 47 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/controller/dwc/pcie-designware-ep.c b/drivers/pci/controller/dwc/pcie-designware-ep.c
index 3780a9bd6f79..ef8ded34d9ab 100644
--- a/drivers/pci/controller/dwc/pcie-designware-ep.c
+++ b/drivers/pci/controller/dwc/pcie-designware-ep.c
@@ -778,6 +778,16 @@ static void dw_pcie_ep_stop(struct pci_epc *epc)
 	struct dw_pcie_ep *ep = epc_get_drvdata(epc);
 	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
 
+	/*
+	 * Tear down the dedicated outbound window used for MSI
+	 * generation. This avoids leaking an iATU window across
+	 * endpoint stop/start cycles.
+	 */
+	if (ep->msi_iatu_mapped) {
+		dw_pcie_ep_unmap_addr(epc, 0, 0, ep->msi_mem_phys);
+		ep->msi_iatu_mapped = false;
+	}
+
 	dw_pcie_stop_link(pci);
 }
 
@@ -881,14 +891,37 @@ int dw_pcie_ep_raise_msi_irq(struct dw_pcie_ep *ep, u8 func_no,
 	msg_addr = ((u64)msg_addr_upper) << 32 | msg_addr_lower;
 
 	msg_addr = dw_pcie_ep_align_addr(epc, msg_addr, &map_size, &offset);
-	ret = dw_pcie_ep_map_addr(epc, func_no, 0, ep->msi_mem_phys, msg_addr,
-				  map_size);
-	if (ret)
-		return ret;
 
-	writel(msg_data | (interrupt_num - 1), ep->msi_mem + offset);
+	/*
+	 * Program the outbound iATU once and keep it enabled.
+	 *
+	 * The spec warns that updating iATU registers while there are
+	 * operations in flight on the AXI bridge interface is not
+	 * supported, so we avoid reprogramming the region on every MSI,
+	 * specifically unmapping immediately after writel().
+	 */
+	if (!ep->msi_iatu_mapped) {
+		ret = dw_pcie_ep_map_addr(epc, func_no, 0,
+					  ep->msi_mem_phys, msg_addr,
+					  map_size);
+		if (ret)
+			return ret;
 
-	dw_pcie_ep_unmap_addr(epc, func_no, 0, ep->msi_mem_phys);
+		ep->msi_iatu_mapped = true;
+		ep->msi_msg_addr = msg_addr;
+		ep->msi_map_size = map_size;
+	} else if (WARN_ON_ONCE(ep->msi_msg_addr != msg_addr ||
+				ep->msi_map_size != map_size)) {
+		/*
+		 * The host changed the MSI target address or the required
+		 * mapping size. Reprogramming the iATU at runtime is unsafe
+		 * on this controller, so bail out instead of trying to update
+		 * the existing region.
+		 */
+		return -EINVAL;
+	}
+
+	writel(msg_data | (interrupt_num - 1), ep->msi_mem + offset);
 
 	return 0;
 }
@@ -1268,6 +1301,9 @@ int dw_pcie_ep_init(struct dw_pcie_ep *ep)
 	INIT_LIST_HEAD(&ep->func_list);
 	INIT_LIST_HEAD(&ep->ib_map_list);
 	spin_lock_init(&ep->ib_map_lock);
+	ep->msi_iatu_mapped = false;
+	ep->msi_msg_addr = 0;
+	ep->msi_map_size = 0;
 
 	epc = devm_pci_epc_create(dev, &epc_ops);
 	if (IS_ERR(epc)) {
diff --git a/drivers/pci/controller/dwc/pcie-designware.h b/drivers/pci/controller/dwc/pcie-designware.h
index 269a9fe0501f..1770a2318557 100644
--- a/drivers/pci/controller/dwc/pcie-designware.h
+++ b/drivers/pci/controller/dwc/pcie-designware.h
@@ -481,6 +481,11 @@ struct dw_pcie_ep {
 	void __iomem		*msi_mem;
 	phys_addr_t		msi_mem_phys;
 	struct pci_epf_bar	*epf_bar[PCI_STD_NUM_BARS];
+
+	/* MSI outbound iATU state */
+	bool			msi_iatu_mapped;
+	u64			msi_msg_addr;
+	size_t			msi_map_size;
 };
 
 struct dw_pcie_ops {
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 20/27] NTB: ntb_transport: Introduce remote eDMA backed transport mode
  2025-11-29 16:03 [RFC PATCH v2 00/27] NTB transport backed by remote DW eDMA Koichiro Den
                   ` (18 preceding siblings ...)
  2025-11-29 16:03 ` [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping Koichiro Den
@ 2025-11-29 16:03 ` Koichiro Den
  2025-12-01 21:41   ` Frank Li
  2025-12-01 21:46   ` Dave Jiang
  2025-11-29 16:03 ` [RFC PATCH v2 21/27] NTB: epf: Provide db_vector_count/db_vector_mask callbacks Koichiro Den
                   ` (7 subsequent siblings)
  27 siblings, 2 replies; 97+ messages in thread
From: Koichiro Den @ 2025-11-29 16:03 UTC (permalink / raw)
  To: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li
  Cc: mani, kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason,
	dave.jiang, allenbh, Basavaraj.Natikar, Shyam-sundar.S-k,
	kurt.schwemmer, logang, jingoohan1, lpieralisi, robh, jbrunet,
	fancer.lancer, arnd, pstanner, elfring

Add a new transport backend that uses a remote DesignWare eDMA engine
located on the NTB endpoint to move data between host and endpoint.

In this mode:

  - The endpoint exposes a dedicated memory window that contains the
    eDMA register block followed by a small control structure (struct
    ntb_edma_info) and per-channel linked-list (LL) rings.

  - On the endpoint side, ntb_edma_setup_mws() allocates the control
    structure and LL rings in endpoint memory, then programs an inbound
    iATU region so that the host can access them via a peer MW.

  - On the host side, ntb_edma_setup_peer() ioremaps the peer MW, reads
    ntb_edma_info and configures a dw-edma DMA device to use the LL
    rings provided by the endpoint.

  - ntb_transport is extended with a new backend_ops implementation that
    routes TX and RX enqueue/poll operations through the remote eDMA
    rings while keeping the existing shared-memory backend intact.

  - The host signals the endpoint via a dedicated DMA read channel.
    'use_msi' module option is ignored when 'use_remote_edma=1'.

The new mode is guarded by a Kconfig option (NTB_TRANSPORT_EDMA) and a
module parameter (use_remote_edma). When disabled, the existing
ntb_transport behaviour is unchanged.

Signed-off-by: Koichiro Den <den@valinux.co.jp>
---
 drivers/ntb/Kconfig                           |   11 +
 drivers/ntb/Makefile                          |    3 +
 drivers/ntb/ntb_edma.c                        |  628 ++++++++
 drivers/ntb/ntb_edma.h                        |  128 ++
 .../{ntb_transport.c => ntb_transport_core.c} | 1281 ++++++++++++++++-
 5 files changed, 2048 insertions(+), 3 deletions(-)
 create mode 100644 drivers/ntb/ntb_edma.c
 create mode 100644 drivers/ntb/ntb_edma.h
 rename drivers/ntb/{ntb_transport.c => ntb_transport_core.c} (65%)

diff --git a/drivers/ntb/Kconfig b/drivers/ntb/Kconfig
index df16c755b4da..db63f02bb116 100644
--- a/drivers/ntb/Kconfig
+++ b/drivers/ntb/Kconfig
@@ -37,4 +37,15 @@ config NTB_TRANSPORT
 
 	 If unsure, say N.
 
+config NTB_TRANSPORT_EDMA
+	bool "NTB Transport backed by remote eDMA"
+	depends on NTB_TRANSPORT
+	depends on PCI
+	select DMA_ENGINE
+	help
+	  Enable a transport backend that uses a remote DesignWare eDMA engine
+	  exposed through a dedicated NTB memory window. The host uses the
+	  endpoint's eDMA engine to move data in both directions.
+	  Say Y here if you intend to use the 'use_remote_edma' module parameter.
+
 endif # NTB
diff --git a/drivers/ntb/Makefile b/drivers/ntb/Makefile
index 3a6fa181ff99..51f0e1e3aec7 100644
--- a/drivers/ntb/Makefile
+++ b/drivers/ntb/Makefile
@@ -4,3 +4,6 @@ obj-$(CONFIG_NTB_TRANSPORT) += ntb_transport.o
 
 ntb-y			:= core.o
 ntb-$(CONFIG_NTB_MSI)	+= msi.o
+
+ntb_transport-y					:= ntb_transport_core.o
+ntb_transport-$(CONFIG_NTB_TRANSPORT_EDMA)	+= ntb_edma.o
diff --git a/drivers/ntb/ntb_edma.c b/drivers/ntb/ntb_edma.c
new file mode 100644
index 000000000000..cb35e0d56aa8
--- /dev/null
+++ b/drivers/ntb/ntb_edma.c
@@ -0,0 +1,628 @@
+// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
+
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/pci.h>
+#include <linux/ntb.h>
+#include <linux/io.h>
+#include <linux/iommu.h>
+#include <linux/dmaengine.h>
+#include <linux/pci-epc.h>
+#include <linux/dma/edma.h>
+#include <linux/irq.h>
+#include <linux/irqdomain.h>
+#include <linux/of.h>
+#include <linux/of_irq.h>
+#include <dt-bindings/interrupt-controller/arm-gic.h>
+
+#include "ntb_edma.h"
+
+/*
+ * The interrupt register offsets below are taken from the DesignWare
+ * eDMA "unrolled" register map (EDMA_MF_EDMA_UNROLL). The remote eDMA
+ * backend currently only supports this layout.
+ */
+#define DMA_WRITE_INT_STATUS_OFF   0x4c
+#define DMA_WRITE_INT_MASK_OFF     0x54
+#define DMA_WRITE_INT_CLEAR_OFF    0x58
+#define DMA_READ_INT_STATUS_OFF    0xa0
+#define DMA_READ_INT_MASK_OFF      0xa8
+#define DMA_READ_INT_CLEAR_OFF     0xac
+
+#define NTB_EDMA_NOTIFY_MAX_QP		64
+
+static unsigned int edma_spi = 417; /* 0x1a1 */
+module_param(edma_spi, uint, 0644);
+MODULE_PARM_DESC(edma_spi, "SPI number used by remote eDMA interrupt (EP local)");
+
+static u64 edma_regs_phys = 0xe65d5000;
+module_param(edma_regs_phys, ullong, 0644);
+MODULE_PARM_DESC(edma_regs_phys, "Physical base address of local eDMA registers (EP)");
+
+static unsigned long edma_regs_size = 0x1200;
+module_param(edma_regs_size, ulong, 0644);
+MODULE_PARM_DESC(edma_regs_size, "Size of the local eDMA register space (EP)");
+
+struct ntb_edma_intr {
+	u32 db[NTB_EDMA_NOTIFY_MAX_QP];
+};
+
+struct ntb_edma_ctx {
+	void *ll_wr_virt[EDMA_WR_CH_NUM];
+	dma_addr_t ll_wr_phys[EDMA_WR_CH_NUM];
+	void *ll_rd_virt[EDMA_RD_CH_NUM + 1];
+	dma_addr_t ll_rd_phys[EDMA_RD_CH_NUM + 1];
+
+	struct ntb_edma_intr *intr_ep_virt;
+	dma_addr_t intr_ep_phys;
+	struct ntb_edma_intr *intr_rc_virt;
+	dma_addr_t intr_rc_phys;
+	u32 notify_qp_max;
+
+	bool initialized;
+};
+
+static struct ntb_edma_ctx edma_ctx;
+
+typedef void (*ntb_edma_interrupt_cb_t)(void *data, int qp_num);
+
+struct ntb_edma_interrupt {
+	int virq;
+	void __iomem *base;
+	ntb_edma_interrupt_cb_t cb;
+	void *data;
+};
+
+static struct ntb_edma_interrupt ntb_edma_intr;
+
+static int ntb_edma_map_spi_to_virq(struct device *dev, unsigned int spi)
+{
+	struct device_node *np = dev_of_node(dev);
+	struct device_node *parent;
+	struct irq_fwspec fwspec = { 0 };
+	int virq;
+
+	parent = of_irq_find_parent(np);
+	if (!parent)
+		return -ENODEV;
+
+	fwspec.fwnode      = of_fwnode_handle(parent);
+	fwspec.param_count = 3;
+	fwspec.param[0]    = GIC_SPI;
+	fwspec.param[1]    = spi;
+	fwspec.param[2]    = IRQ_TYPE_LEVEL_HIGH;
+
+	virq = irq_create_fwspec_mapping(&fwspec);
+	of_node_put(parent);
+	return (virq > 0) ? virq : -EINVAL;
+}
+
+static irqreturn_t ntb_edma_isr(int irq, void *data)
+{
+	struct ntb_edma_interrupt *v = data;
+	u32 mask = BIT(EDMA_RD_CH_NUM);
+	u32 i, val;
+
+	/*
+	 * We do not ack interrupts here but instead we mask all local interrupt
+	 * sources except the read channel used for notification. This reduces
+	 * needless ISR invocations.
+	 *
+	 * In theory we could configure LIE=1/RIE=0 only for the notification
+	 * transfer (keeping all other channels at LIE=1/RIE=1), but that would
+	 * require intrusive changes to the dw-edma core.
+	 *
+	 * Note: The host side may have already cleared the read interrupt used
+	 * for notification, so reading DMA_READ_INT_CLEAR_OFF is not a reliable
+	 * way to detect it. As a result, we cannot reliably tell which specific
+	 * channel triggered this interrupt. intr_ep_virt->db[i] teaches us
+	 * instead.
+	 */
+	iowrite32(~0x0, v->base + DMA_WRITE_INT_MASK_OFF);
+	iowrite32(~mask, v->base + DMA_READ_INT_MASK_OFF);
+
+	if (!v->cb || !edma_ctx.intr_ep_virt)
+		return IRQ_HANDLED;
+
+	for (i = 0; i < edma_ctx.notify_qp_max; i++) {
+		val = READ_ONCE(edma_ctx.intr_ep_virt->db[i]);
+		if (!val)
+			continue;
+
+		WRITE_ONCE(edma_ctx.intr_ep_virt->db[i], 0);
+		v->cb(v->data, i);
+	}
+
+	return IRQ_HANDLED;
+}
+
+int ntb_edma_setup_isr(struct device *dev, struct device *epc_dev,
+		       ntb_edma_interrupt_cb_t cb, void *data)
+{
+	struct ntb_edma_interrupt *v = &ntb_edma_intr;
+	int virq = ntb_edma_map_spi_to_virq(epc_dev->parent, edma_spi);
+	int ret;
+
+	if (virq < 0) {
+		dev_err(dev, "failed to get virq (%d)\n", virq);
+		return virq;
+	}
+
+	v->virq = virq;
+	v->cb = cb;
+	v->data = data;
+	if (edma_regs_phys && !v->base)
+		v->base = devm_ioremap(dev, edma_regs_phys, edma_regs_size);
+	if (!v->base) {
+		dev_err(dev, "failed to setup v->base\n");
+		return -1;
+	}
+	ret = devm_request_irq(dev, v->virq, ntb_edma_isr, 0, "ntb-edma", v);
+	if (ret)
+		return ret;
+
+	if (v->base) {
+		iowrite32(0x0, v->base + DMA_WRITE_INT_MASK_OFF);
+		iowrite32(0x0, v->base + DMA_READ_INT_MASK_OFF);
+	}
+	return 0;
+}
+
+void ntb_edma_teardown_isr(struct device *dev)
+{
+	struct ntb_edma_interrupt *v = &ntb_edma_intr;
+
+	/* Mask all write/read interrupts so we don't get called again. */
+	if (v->base) {
+		iowrite32(~0x0, v->base + DMA_WRITE_INT_MASK_OFF);
+		iowrite32(~0x0, v->base + DMA_READ_INT_MASK_OFF);
+	}
+
+	if (v->virq > 0)
+		devm_free_irq(dev, v->virq, v);
+
+	if (v->base)
+		devm_iounmap(dev, v->base);
+
+	v->virq = 0;
+	v->cb = NULL;
+	v->data = NULL;
+}
+
+int ntb_edma_setup_mws(struct ntb_dev *ndev)
+{
+	const size_t info_bytes = PAGE_SIZE;
+	resource_size_t size_max, offset;
+	dma_addr_t intr_phys, info_phys;
+	u32 wr_done = 0, rd_done = 0;
+	struct ntb_edma_intr *intr;
+	struct ntb_edma_info *info;
+	int peer_mw, mw_index, rc;
+	struct iommu_domain *dom;
+	bool reg_mapped = false;
+	size_t ll_bytes, size;
+	struct pci_epc *epc;
+	struct device *dev;
+	unsigned long iova;
+	phys_addr_t phys;
+	u64 need;
+	u32 i;
+
+	/* +1 is for interruption */
+	ll_bytes = (EDMA_WR_CH_NUM + EDMA_RD_CH_NUM + 1) * DMA_LLP_MEM_SIZE;
+	need = EDMA_REG_SIZE + info_bytes + ll_bytes;
+
+	epc = ntb_get_pci_epc(ndev);
+	if (!epc)
+		return -ENODEV;
+	dev = epc->dev.parent;
+
+	if (edma_ctx.initialized)
+		return 0;
+
+	info = dma_alloc_coherent(dev, info_bytes, &info_phys, GFP_KERNEL);
+	if (!info)
+		return -ENOMEM;
+
+	memset(info, 0, info_bytes);
+	info->magic = NTB_EDMA_INFO_MAGIC;
+	info->wr_cnt = EDMA_WR_CH_NUM;
+	info->rd_cnt = EDMA_RD_CH_NUM + 1; /* +1 for interruption */
+	info->regs_phys = edma_regs_phys;
+	info->ll_stride = DMA_LLP_MEM_SIZE;
+
+	for (i = 0; i < EDMA_WR_CH_NUM; i++) {
+		edma_ctx.ll_wr_virt[i] = dma_alloc_attrs(dev, DMA_LLP_MEM_SIZE,
+							 &edma_ctx.ll_wr_phys[i],
+							 GFP_KERNEL,
+							 DMA_ATTR_FORCE_CONTIGUOUS);
+		if (!edma_ctx.ll_wr_virt[i]) {
+			rc = -ENOMEM;
+			goto err_free_ll;
+		}
+		wr_done++;
+		info->ll_wr_phys[i] = edma_ctx.ll_wr_phys[i];
+	}
+	for (i = 0; i < EDMA_RD_CH_NUM + 1; i++) {
+		edma_ctx.ll_rd_virt[i] = dma_alloc_attrs(dev, DMA_LLP_MEM_SIZE,
+							 &edma_ctx.ll_rd_phys[i],
+							 GFP_KERNEL,
+							 DMA_ATTR_FORCE_CONTIGUOUS);
+		if (!edma_ctx.ll_rd_virt[i]) {
+			rc = -ENOMEM;
+			goto err_free_ll;
+		}
+		rd_done++;
+		info->ll_rd_phys[i] = edma_ctx.ll_rd_phys[i];
+	}
+
+	/* For interruption */
+	edma_ctx.notify_qp_max = NTB_EDMA_NOTIFY_MAX_QP;
+	intr = dma_alloc_coherent(dev, sizeof(*intr), &intr_phys, GFP_KERNEL);
+	if (!intr) {
+		rc = -ENOMEM;
+		goto err_free_ll;
+	}
+	memset(intr, 0, sizeof(*intr));
+	edma_ctx.intr_ep_virt = intr;
+	edma_ctx.intr_ep_phys = intr_phys;
+	info->intr_dar_base = intr_phys;
+
+	peer_mw = ntb_peer_mw_count(ndev);
+	if (peer_mw <= 0) {
+		rc = -ENODEV;
+		goto err_free_ll;
+	}
+
+	mw_index = peer_mw - 1; /* last MW */
+
+	rc = ntb_mw_get_align(ndev, 0, mw_index, 0, NULL, &size_max,
+			      &offset);
+	if (rc)
+		goto err_free_ll;
+
+	if (size_max < need) {
+		rc = -ENOSPC;
+		goto err_free_ll;
+	}
+
+	/* Map register space (direct) */
+	dom = iommu_get_domain_for_dev(dev);
+	if (dom) {
+		phys = edma_regs_phys & PAGE_MASK;
+		size = PAGE_ALIGN(EDMA_REG_SIZE + edma_regs_phys - phys);
+		iova = phys;
+
+		rc = iommu_map(dom, iova, phys, EDMA_REG_SIZE,
+			       IOMMU_READ | IOMMU_WRITE | IOMMU_MMIO, GFP_KERNEL);
+		if (rc)
+			dev_err(&ndev->dev, "failed to create direct mapping for eDMA reg space\n");
+		reg_mapped = true;
+	}
+
+	rc = ntb_mw_set_trans(ndev, 0, mw_index, edma_regs_phys, EDMA_REG_SIZE, offset);
+	if (rc)
+		goto err_unmap_reg;
+
+	offset += EDMA_REG_SIZE;
+
+	/* Map ntb_edma_info */
+	rc = ntb_mw_set_trans(ndev, 0, mw_index, info_phys, info_bytes, offset);
+	if (rc)
+		goto err_clear_trans;
+	offset += info_bytes;
+
+	/* Map LL location */
+	for (i = 0; i < EDMA_WR_CH_NUM; i++) {
+		rc = ntb_mw_set_trans(ndev, 0, mw_index, edma_ctx.ll_wr_phys[i],
+				      DMA_LLP_MEM_SIZE, offset);
+		if (rc)
+			goto err_clear_trans;
+		offset += DMA_LLP_MEM_SIZE;
+	}
+	for (i = 0; i < EDMA_RD_CH_NUM + 1; i++) {
+		rc = ntb_mw_set_trans(ndev, 0, mw_index, edma_ctx.ll_rd_phys[i],
+				      DMA_LLP_MEM_SIZE, offset);
+		if (rc)
+			goto err_clear_trans;
+		offset += DMA_LLP_MEM_SIZE;
+	}
+	edma_ctx.initialized = true;
+
+	return 0;
+
+err_clear_trans:
+	/*
+	 * Tear down the NTB translation window used for the eDMA MW.
+	 * There is no sub-range clear API for ntb_mw_set_trans(), so we
+	 * unconditionally drop the whole mapping on error.
+	 */
+	ntb_mw_clear_trans(ndev, 0, mw_index);
+
+err_unmap_reg:
+	if (reg_mapped)
+		iommu_unmap(dom, iova, size);
+err_free_ll:
+	while (rd_done--)
+		dma_free_attrs(dev, DMA_LLP_MEM_SIZE,
+			       edma_ctx.ll_rd_virt[rd_done],
+			       edma_ctx.ll_rd_phys[rd_done],
+			       DMA_ATTR_FORCE_CONTIGUOUS);
+	while (wr_done--)
+		dma_free_attrs(dev, DMA_LLP_MEM_SIZE,
+			       edma_ctx.ll_wr_virt[wr_done],
+			       edma_ctx.ll_wr_phys[wr_done],
+			       DMA_ATTR_FORCE_CONTIGUOUS);
+	if (edma_ctx.intr_ep_virt)
+		dma_free_coherent(dev, sizeof(struct ntb_edma_intr),
+				  edma_ctx.intr_ep_virt,
+				  edma_ctx.intr_ep_phys);
+	dma_free_coherent(dev, info_bytes, info, info_phys);
+	return rc;
+}
+
+static int ntb_edma_irq_vector(struct device *dev, unsigned int nr)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	int ret, nvec;
+
+	nvec = pci_msi_vec_count(pdev);
+	for (; nr < nvec; nr++) {
+		ret = pci_irq_vector(pdev, nr);
+		if (!irq_has_action(ret))
+			return ret;
+	}
+	return 0;
+}
+
+static const struct dw_edma_plat_ops ntb_edma_ops = {
+	.irq_vector     = ntb_edma_irq_vector,
+};
+
+int ntb_edma_setup_peer(struct ntb_dev *ndev)
+{
+	struct ntb_edma_info *info;
+	unsigned int wr_cnt, rd_cnt;
+	struct dw_edma_chip *chip;
+	void __iomem *edma_virt;
+	phys_addr_t edma_phys;
+	resource_size_t mw_size;
+	u64 off = EDMA_REG_SIZE;
+	int peer_mw, mw_index;
+	unsigned int i;
+	int ret;
+
+	peer_mw = ntb_peer_mw_count(ndev);
+	if (peer_mw <= 0)
+		return -ENODEV;
+
+	mw_index = peer_mw - 1; /* last MW */
+
+	ret = ntb_peer_mw_get_addr(ndev, mw_index, &edma_phys,
+				   &mw_size);
+	if (ret)
+		return -1;
+
+	edma_virt = ioremap(edma_phys, mw_size);
+
+	chip = devm_kzalloc(&ndev->dev, sizeof(*chip), GFP_KERNEL);
+	if (!chip) {
+		ret = -ENOMEM;
+		return ret;
+	}
+
+	chip->dev = &ndev->pdev->dev;
+	chip->nr_irqs = 4;
+	chip->ops = &ntb_edma_ops;
+	chip->flags = 0;
+	chip->reg_base = edma_virt;
+	chip->mf = EDMA_MF_EDMA_UNROLL;
+
+	info = edma_virt + off;
+	if (info->magic != NTB_EDMA_INFO_MAGIC)
+		return -EINVAL;
+	wr_cnt = info->wr_cnt;
+	rd_cnt = info->rd_cnt;
+	chip->ll_wr_cnt = wr_cnt;
+	chip->ll_rd_cnt = rd_cnt;
+	off += PAGE_SIZE;
+
+	edma_ctx.notify_qp_max = NTB_EDMA_NOTIFY_MAX_QP;
+	edma_ctx.intr_ep_phys = info->intr_dar_base;
+	if (edma_ctx.intr_ep_phys) {
+		edma_ctx.intr_rc_virt =
+			dma_alloc_coherent(&ndev->pdev->dev,
+					   sizeof(struct ntb_edma_intr),
+					   &edma_ctx.intr_rc_phys,
+					   GFP_KERNEL);
+		if (!edma_ctx.intr_rc_virt)
+			return -ENOMEM;
+		memset(edma_ctx.intr_rc_virt, 0,
+		       sizeof(struct ntb_edma_intr));
+	}
+
+	for (i = 0; i < wr_cnt; i++) {
+		chip->ll_region_wr[i].vaddr.io = edma_virt + off;
+		chip->ll_region_wr[i].paddr = info->ll_wr_phys[i];
+		chip->ll_region_wr[i].sz = DMA_LLP_MEM_SIZE;
+		off += DMA_LLP_MEM_SIZE;
+	}
+	for (i = 0; i < rd_cnt; i++) {
+		chip->ll_region_rd[i].vaddr.io = edma_virt + off;
+		chip->ll_region_rd[i].paddr = info->ll_rd_phys[i];
+		chip->ll_region_rd[i].sz = DMA_LLP_MEM_SIZE;
+		off += DMA_LLP_MEM_SIZE;
+	}
+
+	if (!pci_dev_msi_enabled(ndev->pdev))
+		return -ENXIO;
+
+	ret = dw_edma_probe(chip);
+	if (ret) {
+		dev_err(&ndev->dev, "dw_edma_probe failed: %d\n", ret);
+		return ret;
+	}
+
+	return 0;
+}
+
+struct ntb_edma_filter {
+	struct device *dma_dev;
+	u32 direction;
+};
+
+static bool ntb_edma_filter_fn(struct dma_chan *chan, void *arg)
+{
+	struct ntb_edma_filter *filter = arg;
+	u32 dir = filter->direction;
+	struct dma_slave_caps caps;
+	int ret;
+
+	if (chan->device->dev != filter->dma_dev)
+		return false;
+
+	ret = dma_get_slave_caps(chan, &caps);
+	if (ret < 0)
+		return false;
+
+	return !!(caps.directions & dir);
+}
+
+void ntb_edma_teardown_chans(struct ntb_edma_chans *edma)
+{
+	unsigned int i;
+
+	for (i = 0; i < edma->num_wr_chan; i++)
+		dma_release_channel(edma->wr_chan[i]);
+
+	for (i = 0; i < edma->num_rd_chan; i++)
+		dma_release_channel(edma->rd_chan[i]);
+
+	if (edma->intr_chan)
+		dma_release_channel(edma->intr_chan);
+}
+
+int ntb_edma_setup_chans(struct device *dma_dev, struct ntb_edma_chans *edma)
+{
+	struct ntb_edma_filter filter;
+	dma_cap_mask_t dma_mask;
+	unsigned int i;
+
+	dma_cap_zero(dma_mask);
+	dma_cap_set(DMA_SLAVE, dma_mask);
+
+	memset(edma, 0, sizeof(*edma));
+	edma->dev = dma_dev;
+
+	filter.dma_dev = dma_dev;
+	filter.direction = BIT(DMA_DEV_TO_MEM);
+	for (i = 0; i < EDMA_WR_CH_NUM; i++) {
+		edma->wr_chan[i] = dma_request_channel(dma_mask,
+						       ntb_edma_filter_fn,
+						       &filter);
+		if (!edma->wr_chan[i])
+			break;
+		edma->num_wr_chan++;
+	}
+
+	filter.direction = BIT(DMA_MEM_TO_DEV);
+	for (i = 0; i < EDMA_RD_CH_NUM; i++) {
+		edma->rd_chan[i] = dma_request_channel(dma_mask,
+						       ntb_edma_filter_fn,
+						       &filter);
+		if (!edma->rd_chan[i])
+			break;
+		edma->num_rd_chan++;
+	}
+
+	edma->intr_chan = dma_request_channel(dma_mask, ntb_edma_filter_fn,
+					      &filter);
+	if (!edma->intr_chan)
+		dev_warn(dma_dev,
+			 "Remote eDMA notify channel could not be allocated\n");
+
+	if (!edma->num_wr_chan || !edma->num_rd_chan) {
+		dev_warn(dma_dev, "Remote eDMA channels failed to initialize\n");
+		ntb_edma_teardown_chans(edma);
+		return -ENODEV;
+	}
+	return 0;
+}
+
+struct dma_chan *ntb_edma_pick_chan(struct ntb_edma_chans *edma,
+				    remote_edma_dir_t dir)
+{
+	unsigned int n, cur, idx;
+	struct dma_chan **chans;
+	atomic_t *cur_chan;
+
+	if (dir == REMOTE_EDMA_WRITE) {
+		n = edma->num_wr_chan;
+		chans = edma->wr_chan;
+		cur_chan = &edma->cur_wr_chan;
+	} else {
+		n = edma->num_rd_chan;
+		chans = edma->rd_chan;
+		cur_chan = &edma->cur_rd_chan;
+	}
+	if (WARN_ON_ONCE(!n))
+		return NULL;
+
+	/* Simple round-robin */
+	cur = (unsigned int)atomic_inc_return(cur_chan) - 1;
+	idx = cur % n;
+	return chans[idx];
+}
+
+int ntb_edma_notify_peer(struct ntb_edma_chans *edma, int qp_num)
+{
+	struct dma_async_tx_descriptor *txd;
+	struct dma_slave_config cfg;
+	struct scatterlist sgl;
+	dma_cookie_t cookie;
+	struct device *dev;
+
+	if (!edma || !edma->intr_chan)
+		return -ENXIO;
+
+	if (qp_num < 0 || qp_num >= edma_ctx.notify_qp_max)
+		return -EINVAL;
+
+	if (!edma_ctx.intr_rc_virt || !edma_ctx.intr_ep_phys)
+		return -EINVAL;
+
+	dev = edma->dev;
+	if (!dev)
+		return -ENODEV;
+
+	WRITE_ONCE(edma_ctx.intr_rc_virt->db[qp_num], 1);
+
+	/* Ensure store is visible before kicking the DMA transfer */
+	wmb();
+
+	sg_init_table(&sgl, 1);
+	sg_dma_address(&sgl) = edma_ctx.intr_rc_phys + qp_num * sizeof(u32);
+	sg_dma_len(&sgl) = sizeof(u32);
+
+	memset(&cfg, 0, sizeof(cfg));
+	cfg.dst_addr       = edma_ctx.intr_ep_phys + qp_num * sizeof(u32);
+	cfg.src_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
+	cfg.dst_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
+	cfg.direction      = DMA_MEM_TO_DEV;
+
+	if (dmaengine_slave_config(edma->intr_chan, &cfg))
+		return -EINVAL;
+
+	txd = dmaengine_prep_slave_sg(edma->intr_chan, &sgl, 1,
+				      DMA_MEM_TO_DEV,
+				      DMA_CTRL_ACK | DMA_PREP_INTERRUPT);
+	if (!txd)
+		return -ENOSPC;
+
+	cookie = dmaengine_submit(txd);
+	if (dma_submit_error(cookie))
+		return -ENOSPC;
+
+	dma_async_issue_pending(edma->intr_chan);
+	return 0;
+}
diff --git a/drivers/ntb/ntb_edma.h b/drivers/ntb/ntb_edma.h
new file mode 100644
index 000000000000..da0451827edb
--- /dev/null
+++ b/drivers/ntb/ntb_edma.h
@@ -0,0 +1,128 @@
+/* SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) */
+#ifndef _NTB_EDMA_H_
+#define _NTB_EDMA_H_
+
+#include <linux/completion.h>
+#include <linux/device.h>
+#include <linux/interrupt.h>
+
+#define EDMA_REG_SIZE		SZ_64K
+#define DMA_LLP_MEM_SIZE	SZ_4K
+#define EDMA_WR_CH_NUM		4
+#define EDMA_RD_CH_NUM		4
+#define NTB_EDMA_MAX_CH		8
+
+#define NTB_EDMA_INFO_MAGIC	0x45444D41 /* "EDMA" */
+#define NTB_EDMA_INFO_OFF	EDMA_REG_SIZE
+
+#define NTB_EDMA_RING_ORDER	7
+#define NTB_EDMA_RING_ENTRIES	(1U << NTB_EDMA_RING_ORDER)
+#define NTB_EDMA_RING_MASK	(NTB_EDMA_RING_ENTRIES - 1)
+
+typedef void (*ntb_edma_interrupt_cb_t)(void *data, int qp_num);
+
+/*
+ * REMOTE_EDMA_EP:
+ *   Endpoint owns the eDMA engine and pushes descriptors into a shared MW.
+ *
+ * REMOTE_EDMA_RC:
+ *   Root Complex controls the endpoint eDMA through the shared MW and
+ *   drives reads/writes on behalf of the host.
+ */
+typedef enum {
+	REMOTE_EDMA_UNKNOWN,
+	REMOTE_EDMA_EP,
+	REMOTE_EDMA_RC,
+} remote_edma_mode_t;
+
+typedef enum {
+	REMOTE_EDMA_WRITE,
+	REMOTE_EDMA_READ,
+} remote_edma_dir_t;
+
+/*
+ * Layout of remote eDMA MW (EP local address space, RC sees via peer MW):
+ *
+ *  0 .. EDMA_REG_SIZE-1        : DesignWare eDMA registers
+ *  EDMA_REG_SIZE .. +PAGE_SIZE : struct ntb_edma_info (EP writes, RC reads)
+ *  +PAGE_SIZE ..               : LL ring buffers (EP allocates phys addresses,
+ *                                RC configures via dw_edma)
+ *
+ * ntb_edma_setup_mws() on EP:
+ *   - allocates ntb_edma_info and LLs in EP memory
+ *   - programs inbound iATU so that RC peer MW[n] points at this block
+ *
+ * ntb_edma_setup_peer() on RC:
+ *   - ioremaps peer MW[n]
+ *   - reads ntb_edma_info
+ *   - sets up dw_edma_chip ll_region_* from that info
+ */
+struct ntb_edma_info {
+	u32 magic;
+	u16 wr_cnt;
+	u16 rd_cnt;
+	u64 regs_phys;
+	u32 ll_stride;
+	u32 rsvd;
+	u64 ll_wr_phys[NTB_EDMA_MAX_CH];
+	u64 ll_rd_phys[NTB_EDMA_MAX_CH];
+
+	u64 intr_dar_base;
+} __packed;
+
+struct ll_dma_addrs {
+	dma_addr_t wr[EDMA_WR_CH_NUM];
+	dma_addr_t rd[EDMA_RD_CH_NUM];
+};
+
+struct ntb_edma_chans {
+	struct device *dev;
+
+	struct dma_chan *wr_chan[EDMA_WR_CH_NUM];
+	struct dma_chan *rd_chan[EDMA_RD_CH_NUM];
+	struct dma_chan *intr_chan;
+
+	unsigned int num_wr_chan;
+	unsigned int num_rd_chan;
+	atomic_t cur_wr_chan;
+	atomic_t cur_rd_chan;
+};
+
+static __always_inline u32 ntb_edma_ring_idx(u32 v)
+{
+	return v & NTB_EDMA_RING_MASK;
+}
+
+static __always_inline u32 ntb_edma_ring_used_entry(u32 head, u32 tail)
+{
+	if (head >= tail) {
+		WARN_ON_ONCE((head - tail) > (NTB_EDMA_RING_ENTRIES - 1));
+		return head - tail;
+	}
+
+	WARN_ON_ONCE((U32_MAX - tail + head + 1) > (NTB_EDMA_RING_ENTRIES - 1));
+	return U32_MAX - tail + head + 1;
+}
+
+static __always_inline u32 ntb_edma_ring_free_entry(u32 head, u32 tail)
+{
+	return NTB_EDMA_RING_ENTRIES - ntb_edma_ring_used_entry(head, tail) - 1;
+}
+
+static __always_inline bool ntb_edma_ring_full(u32 head, u32 tail)
+{
+	return ntb_edma_ring_free_entry(head, tail) == 0;
+}
+
+int ntb_edma_setup_isr(struct device *dev, struct device *epc_dev,
+		       ntb_edma_interrupt_cb_t cb, void *data);
+void ntb_edma_teardown_isr(struct device *dev);
+int ntb_edma_setup_mws(struct ntb_dev *ndev);
+int ntb_edma_setup_peer(struct ntb_dev *ndev);
+int ntb_edma_setup_chans(struct device *dma_dev, struct ntb_edma_chans *edma);
+struct dma_chan *ntb_edma_pick_chan(struct ntb_edma_chans *edma,
+				    remote_edma_dir_t dir);
+void ntb_edma_teardown_chans(struct ntb_edma_chans *edma);
+int ntb_edma_notify_peer(struct ntb_edma_chans *edma, int qp_num);
+
+#endif
diff --git a/drivers/ntb/ntb_transport.c b/drivers/ntb/ntb_transport_core.c
similarity index 65%
rename from drivers/ntb/ntb_transport.c
rename to drivers/ntb/ntb_transport_core.c
index 907db6c93d4d..48d48921978d 100644
--- a/drivers/ntb/ntb_transport.c
+++ b/drivers/ntb/ntb_transport_core.c
@@ -47,6 +47,9 @@
  * Contact Information:
  * Jon Mason <jon.mason@intel.com>
  */
+#include <linux/atomic.h>
+#include <linux/bug.h>
+#include <linux/compiler.h>
 #include <linux/debugfs.h>
 #include <linux/delay.h>
 #include <linux/dmaengine.h>
@@ -71,6 +74,8 @@
 #define NTB_TRANSPORT_DESC	"Software Queue-Pair Transport over NTB"
 #define NTB_TRANSPORT_MIN_SPADS (MW0_SZ_HIGH + 2)
 
+#define NTB_EDMA_MAX_POLL		32
+
 MODULE_DESCRIPTION(NTB_TRANSPORT_DESC);
 MODULE_VERSION(NTB_TRANSPORT_VER);
 MODULE_LICENSE("Dual BSD/GPL");
@@ -102,6 +107,13 @@ module_param(use_msi, bool, 0644);
 MODULE_PARM_DESC(use_msi, "Use MSI interrupts instead of doorbells");
 #endif
 
+#ifdef CONFIG_NTB_TRANSPORT_EDMA
+#include "ntb_edma.h"
+static bool use_remote_edma;
+module_param(use_remote_edma, bool, 0644);
+MODULE_PARM_DESC(use_remote_edma, "Use remote eDMA mode (when enabled, use_msi is ignored)");
+#endif
+
 static struct dentry *nt_debugfs_dir;
 
 /* Only two-ports NTB devices are supported */
@@ -125,6 +137,14 @@ struct ntb_queue_entry {
 		struct ntb_payload_header __iomem *tx_hdr;
 		struct ntb_payload_header *rx_hdr;
 	};
+
+#ifdef CONFIG_NTB_TRANSPORT_EDMA
+	dma_addr_t addr;
+
+	/* Used by RC side only */
+	struct scatterlist sgl;
+	struct work_struct dma_work;
+#endif
 };
 
 struct ntb_rx_info {
@@ -202,6 +222,33 @@ struct ntb_transport_qp {
 	int msi_irq;
 	struct ntb_msi_desc msi_desc;
 	struct ntb_msi_desc peer_msi_desc;
+
+#ifdef CONFIG_NTB_TRANSPORT_EDMA
+	/*
+	 * For ensuring peer notification in non-atomic context.
+	 * ntb_peer_db_set might sleep or schedule.
+	 */
+	struct work_struct db_work;
+
+	/*
+	 * wr: remote eDMA write transfer (EP -> RC direction)
+	 * rd: remote eDMA read transfer (RC -> EP direction)
+	 */
+	u32 wr_cons;
+	u32 rd_cons;
+	u32 wr_prod;
+	u32 rd_prod;
+	u32 wr_issue;
+	u32 rd_issue;
+
+	spinlock_t ep_tx_lock;
+	spinlock_t ep_rx_lock;
+	spinlock_t rc_lock;
+
+	/* Completion work for read/write transfers. */
+	struct work_struct read_work;
+	struct work_struct write_work;
+#endif
 };
 
 struct ntb_transport_mw {
@@ -249,6 +296,13 @@ struct ntb_transport_ctx {
 
 	/* Make sure workq of link event be executed serially */
 	struct mutex link_event_lock;
+
+#ifdef CONFIG_NTB_TRANSPORT_EDMA
+	remote_edma_mode_t remote_edma_mode;
+	struct device *dma_dev;
+	struct workqueue_struct *wq;
+	struct ntb_edma_chans edma;
+#endif
 };
 
 enum {
@@ -262,6 +316,19 @@ struct ntb_payload_header {
 	unsigned int flags;
 };
 
+#ifdef CONFIG_NTB_TRANSPORT_EDMA
+static void ntb_transport_edma_uninit(struct ntb_transport_ctx *nt);
+static int ntb_transport_edma_init(struct ntb_transport_ctx *nt,
+				   unsigned int *mw_count);
+static void ntb_transport_edma_init_queue(struct ntb_transport_ctx *nt,
+					  unsigned int qp_num);
+static void ntb_transport_edma_create_queue(struct ntb_transport_ctx *nt,
+					    struct ntb_transport_qp *qp);
+static int ntb_transport_edma_rc_init(struct ntb_transport_ctx *nt);
+static int ntb_transport_edma_ep_init(struct ntb_transport_ctx *nt);
+static void ntb_transport_edma_rc_dma_work(struct work_struct *work);
+#endif /* CONFIG_NTB_TRANSPORT_EDMA */
+
 /*
  * Return the device that should be used for DMA mapping.
  *
@@ -298,7 +365,7 @@ enum {
 	container_of((__drv), struct ntb_transport_client, driver)
 
 #define QP_TO_MW(nt, qp)	((qp) % nt->mw_count)
-#define NTB_QP_DEF_NUM_ENTRIES	100
+#define NTB_QP_DEF_NUM_ENTRIES	128
 #define NTB_LINK_DOWN_TIMEOUT	10
 
 static void ntb_transport_rxc_db(unsigned long data);
@@ -1015,6 +1082,10 @@ static void ntb_transport_link_cleanup(struct ntb_transport_ctx *nt)
 	count = ntb_spad_count(nt->ndev);
 	for (i = 0; i < count; i++)
 		ntb_spad_write(nt->ndev, i, 0);
+
+#ifdef CONFIG_NTB_TRANSPORT_EDMA
+	ntb_edma_teardown_chans(&nt->edma);
+#endif
 }
 
 static void ntb_transport_link_cleanup_work(struct work_struct *work)
@@ -1051,6 +1122,14 @@ static void ntb_transport_link_work(struct work_struct *work)
 
 	/* send the local info, in the opposite order of the way we read it */
 
+#ifdef CONFIG_NTB_TRANSPORT_EDMA
+	rc = ntb_transport_edma_ep_init(nt);
+	if (rc) {
+		dev_err(&pdev->dev, "Failed to init EP: %d\n", rc);
+		return;
+	}
+#endif
+
 	if (nt->use_msi) {
 		rc = ntb_msi_setup_mws(ndev);
 		if (rc) {
@@ -1132,6 +1211,14 @@ static void ntb_transport_link_work(struct work_struct *work)
 
 	nt->link_is_up = true;
 
+#ifdef CONFIG_NTB_TRANSPORT_EDMA
+	rc = ntb_transport_edma_rc_init(nt);
+	if (rc) {
+		dev_err(&pdev->dev, "Failed to init RC: %d\n", rc);
+		goto out1;
+	}
+#endif
+
 	for (i = 0; i < nt->qp_count; i++) {
 		struct ntb_transport_qp *qp = &nt->qp_vec[i];
 
@@ -1277,6 +1364,8 @@ static const struct ntb_transport_backend_ops default_backend_ops = {
 	.debugfs_stats_show = ntb_transport_default_debugfs_stats_show,
 };
 
+static const struct ntb_transport_backend_ops edma_backend_ops;
+
 static int ntb_transport_probe(struct ntb_client *self, struct ntb_dev *ndev)
 {
 	struct ntb_transport_ctx *nt;
@@ -1311,7 +1400,23 @@ static int ntb_transport_probe(struct ntb_client *self, struct ntb_dev *ndev)
 
 	nt->ndev = ndev;
 
-	nt->backend_ops = default_backend_ops;
+#ifdef CONFIG_NTB_TRANSPORT_EDMA
+	if (use_remote_edma) {
+		rc = ntb_transport_edma_init(nt, &mw_count);
+		if (rc) {
+			nt->mw_count = 0;
+			goto err;
+		}
+		nt->backend_ops = edma_backend_ops;
+
+		/*
+		 * On remote eDMA mode, we reserve a read channel for Host->EP
+		 * interruption.
+		 */
+		use_msi = false;
+	} else
+#endif
+		nt->backend_ops = default_backend_ops;
 
 	/*
 	 * If we are using MSI, and have at least one extra memory window,
@@ -1402,6 +1507,10 @@ static int ntb_transport_probe(struct ntb_client *self, struct ntb_dev *ndev)
 		rc = ntb_transport_init_queue(nt, i);
 		if (rc)
 			goto err2;
+
+#ifdef CONFIG_NTB_TRANSPORT_EDMA
+		ntb_transport_edma_init_queue(nt, i);
+#endif
 	}
 
 	INIT_DELAYED_WORK(&nt->link_work, ntb_transport_link_work);
@@ -1433,6 +1542,9 @@ static int ntb_transport_probe(struct ntb_client *self, struct ntb_dev *ndev)
 	}
 	kfree(nt->mw_vec);
 err:
+#ifdef CONFIG_NTB_TRANSPORT_EDMA
+	ntb_transport_edma_uninit(nt);
+#endif
 	kfree(nt);
 	return rc;
 }
@@ -2055,11 +2167,16 @@ ntb_transport_create_queue(void *data, struct device *client_dev,
 
 	nt->qp_bitmap_free &= ~qp_bit;
 
+	qp->qp_bit = qp_bit;
 	qp->cb_data = data;
 	qp->rx_handler = handlers->rx_handler;
 	qp->tx_handler = handlers->tx_handler;
 	qp->event_handler = handlers->event_handler;
 
+#ifdef CONFIG_NTB_TRANSPORT_EDMA
+	ntb_transport_edma_create_queue(nt, qp);
+#endif
+
 	dma_cap_zero(dma_mask);
 	dma_cap_set(DMA_MEMCPY, dma_mask);
 
@@ -2105,6 +2222,9 @@ ntb_transport_create_queue(void *data, struct device *client_dev,
 			goto err1;
 
 		entry->qp = qp;
+#ifdef CONFIG_NTB_TRANSPORT_EDMA
+		INIT_WORK(&entry->dma_work, ntb_transport_edma_rc_dma_work);
+#endif
 		ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry,
 			     &qp->rx_free_q);
 	}
@@ -2156,8 +2276,8 @@ EXPORT_SYMBOL_GPL(ntb_transport_create_queue);
  */
 void ntb_transport_free_queue(struct ntb_transport_qp *qp)
 {
-	struct pci_dev *pdev;
 	struct ntb_queue_entry *entry;
+	struct pci_dev *pdev;
 	u64 qp_bit;
 
 	if (!qp)
@@ -2208,6 +2328,10 @@ void ntb_transport_free_queue(struct ntb_transport_qp *qp)
 	tasklet_kill(&qp->rxc_db_work);
 
 	cancel_delayed_work_sync(&qp->link_work);
+#ifdef CONFIG_NTB_TRANSPORT_EDMA
+	cancel_work_sync(&qp->read_work);
+	cancel_work_sync(&qp->write_work);
+#endif
 
 	qp->cb_data = NULL;
 	qp->rx_handler = NULL;
@@ -2346,6 +2470,1157 @@ int ntb_transport_tx_enqueue(struct ntb_transport_qp *qp, void *cb, void *data,
 }
 EXPORT_SYMBOL_GPL(ntb_transport_tx_enqueue);
 
+#ifdef CONFIG_NTB_TRANSPORT_EDMA
+/*
+ * Remote eDMA mode implementation
+ */
+struct ntb_edma_desc {
+	u32 len;
+	u32 flags;
+	u64 addr; /* DMA address */
+	u64 data;
+};
+
+struct ntb_edma_ring {
+	struct ntb_edma_desc desc[NTB_EDMA_RING_ENTRIES];
+	u32 head;
+	u32 tail;
+};
+
+#define NTB_EDMA_DESC_OFF(i)	((size_t)(i) * sizeof(struct ntb_edma_desc))
+
+#define __NTB_EDMA_CHECK_INDEX(_i)					\
+({									\
+	unsigned long __i = (unsigned long)(_i);			\
+	WARN_ONCE(__i >= (unsigned long)NTB_EDMA_RING_ENTRIES,		\
+		  "ntb_edma: index i=%lu >= ring_entries=%lu\n",	\
+		  __i, (unsigned long)NTB_EDMA_RING_ENTRIES);		\
+	__i;								\
+})
+
+#define NTB_EDMA_DESC_I(qp, i, n)					\
+({									\
+	typeof(qp) __qp = (qp);						\
+	unsigned long __i = __NTB_EDMA_CHECK_INDEX(i);			\
+	(struct ntb_edma_desc *)					\
+		((char *)(__qp)->rx_buff +				\
+		 (sizeof(struct ntb_edma_ring) * n) +			\
+		 NTB_EDMA_DESC_OFF(__i));				\
+})
+
+#define NTB_EDMA_DESC_O(qp, i, n)					\
+({									\
+	typeof(qp) __qp = (qp);						\
+	unsigned long __i = __NTB_EDMA_CHECK_INDEX(i);			\
+	(struct ntb_edma_desc __iomem *)				\
+		((char __iomem *)(__qp)->tx_mw +			\
+		 (sizeof(struct ntb_edma_ring) * n) +			\
+		 NTB_EDMA_DESC_OFF(__i));				\
+})
+
+#define NTB_EDMA_HEAD_I(qp, n) ((u32 *)((char *)qp->rx_buff +		\
+				(sizeof(struct ntb_edma_ring) * n) +	\
+				offsetof(struct ntb_edma_ring, head)))
+#define NTB_EDMA_HEAD_O(qp, n) ((u32 *)((char __iomem *)qp->tx_mw +	\
+				(sizeof(struct ntb_edma_ring) * n) +	\
+				offsetof(struct ntb_edma_ring, head)))
+#define NTB_EDMA_TAIL_I(qp, n) ((u32 *)((char *)qp->rx_buff +		\
+				(sizeof(struct ntb_edma_ring) * n) +	\
+				offsetof(struct ntb_edma_ring, tail)))
+#define NTB_EDMA_TAIL_O(qp, n) ((u32 *)((char __iomem *)qp->tx_mw +	\
+				(sizeof(struct ntb_edma_ring) * n) +	\
+				offsetof(struct ntb_edma_ring, tail)))
+
+/*
+ * Macro naming rule:
+ *   NTB_DESC_RD_EP_I (as an example)
+ *            ^^ ^^ ^
+ *            :  :  `-- I(n) or O(ut). In = Read, Out = Write.
+ *            :  `----- Who uses this macro.
+ *            `-------- DESC / HEAD / TAIL
+ *
+ * Read transfers (RC->EP):
+ *
+ *   EP view (outbound, written via NTB):
+ *       - descs: NTB_DESC_RD_EP_O(qp, i) / NTB_DESC_RD_EP_I(qp, i)
+ *           [ len ][ flags ][ addr ][ data ]
+ *           [ len ][ flags ][ addr ][ data ]
+ *           :
+ *           [ len ][ flags ][ addr ][ data ]
+ *       - head: NTB_HEAD_RD_EP_O(qp)
+ *       - tail: NTB_TAIL_RD_EP_I(qp)
+ *
+ *   RC view (inbound, local mapping):
+ *       - descs: NTB_DESC_RD_RC_I(qp, i) / NTB_DESC_RD_RC_O(qp, i)
+ *           [ len ][ flags ][ addr ][ data ]
+ *           [ len ][ flags ][ addr ][ data ]
+ *           :
+ *           [ len ][ flags ][ addr ][ data ]
+ *       - head: NTB_HEAD_RD_RC_I(qp)
+ *       - tail: NTB_TAIL_RD_RC_O(qp)
+ *
+ * Write transfers (EP -> RC) are analogous but use
+ * NTB_DESC_WR_{EP_O,RC_I}(), NTB_HEAD_WR_{EP_O,RC_I}(),
+ * and NTB_TAIL_WR_{EP_I,RC_O}().
+ */
+#define NTB_DESC_RD_EP_I(qp, i)	NTB_EDMA_DESC_I(qp, i, 0)
+#define NTB_DESC_RD_EP_O(qp, i)	NTB_EDMA_DESC_O(qp, i, 0)
+#define NTB_DESC_WR_EP_I(qp, i)	NTB_EDMA_DESC_I(qp, i, 1)
+#define NTB_DESC_WR_EP_O(qp, i)	NTB_EDMA_DESC_O(qp, i, 1)
+#define NTB_DESC_RD_RC_I(qp, i)	NTB_EDMA_DESC_I(qp, i, 0)
+#define NTB_DESC_RD_RC_O(qp, i)	NTB_EDMA_DESC_O(qp, i, 0)
+#define NTB_DESC_WR_RC_I(qp, i)	NTB_EDMA_DESC_I(qp, i, 1)
+#define NTB_DESC_WR_RC_O(qp, i)	NTB_EDMA_DESC_O(qp, i, 1)
+
+#define NTB_HEAD_RD_EP_O(qp)	NTB_EDMA_HEAD_O(qp, 0)
+#define NTB_HEAD_WR_EP_O(qp)	NTB_EDMA_HEAD_O(qp, 1)
+#define NTB_HEAD_RD_RC_I(qp)	NTB_EDMA_HEAD_I(qp, 0)
+#define NTB_HEAD_WR_RC_I(qp)	NTB_EDMA_HEAD_I(qp, 1)
+
+#define NTB_TAIL_RD_EP_I(qp)	NTB_EDMA_TAIL_I(qp, 0)
+#define NTB_TAIL_WR_EP_I(qp)	NTB_EDMA_TAIL_I(qp, 1)
+#define NTB_TAIL_RD_RC_O(qp)	NTB_EDMA_TAIL_O(qp, 0)
+#define NTB_TAIL_WR_RC_O(qp)	NTB_EDMA_TAIL_O(qp, 1)
+
+static inline bool ntb_qp_edma_is_rc(struct ntb_transport_qp *qp)
+{
+	return qp->transport->remote_edma_mode == REMOTE_EDMA_RC;
+}
+
+static inline bool ntb_qp_edma_is_ep(struct ntb_transport_qp *qp)
+{
+	return qp->transport->remote_edma_mode == REMOTE_EDMA_EP;
+}
+
+static inline bool ntb_qp_edma_enabled(struct ntb_transport_qp *qp)
+{
+	return ntb_qp_edma_is_rc(qp) || ntb_qp_edma_is_ep(qp);
+}
+
+static unsigned int ntb_transport_edma_tx_free_entry(struct ntb_transport_qp *qp)
+{
+	unsigned int head, tail;
+
+	if (ntb_qp_edma_is_ep(qp)) {
+		scoped_guard(spinlock_irqsave, &qp->ep_tx_lock) {
+			/* In this scope, only 'head' might proceed */
+			tail = READ_ONCE(qp->wr_cons);
+			head = READ_ONCE(qp->wr_prod);
+		}
+		return ntb_edma_ring_free_entry(head, tail);
+	}
+
+	scoped_guard(spinlock_irqsave, &qp->rc_lock) {
+		/* In this scope, only 'head' might proceed */
+		tail = READ_ONCE(qp->rd_issue);
+		head = READ_ONCE(*NTB_HEAD_RD_RC_I(qp));
+	}
+	/*
+	 * On RC side, 'used' amount indicates how much EP side
+	 * has refilled, which are available for us to use for TX.
+	 */
+	return ntb_edma_ring_used_entry(head, tail);
+}
+
+static void ntb_transport_edma_debugfs_stats_show(struct seq_file *s,
+						  struct ntb_transport_qp *qp)
+{
+	seq_printf(s, "rx_bytes - \t%llu\n", qp->rx_bytes);
+	seq_printf(s, "rx_pkts - \t%llu\n", qp->rx_pkts);
+	seq_printf(s, "rx_err_no_buf - %llu\n", qp->rx_err_no_buf);
+	seq_printf(s, "rx_buff - \t0x%p\n", qp->rx_buff);
+	seq_printf(s, "rx_max_entry - \t%u\n", qp->rx_max_entry);
+	seq_printf(s, "rx_alloc_entry - \t%u\n\n", qp->rx_alloc_entry);
+
+	seq_printf(s, "tx_bytes - \t%llu\n", qp->tx_bytes);
+	seq_printf(s, "tx_pkts - \t%llu\n", qp->tx_pkts);
+	seq_printf(s, "tx_ring_full - \t%llu\n", qp->tx_ring_full);
+	seq_printf(s, "tx_err_no_buf - %llu\n", qp->tx_err_no_buf);
+	seq_printf(s, "tx_mw - \t0x%p\n", qp->tx_mw);
+	seq_printf(s, "tx_max_entry - \t%u\n", qp->tx_max_entry);
+	seq_printf(s, "free tx - \t%u\n", ntb_transport_tx_free_entry(qp));
+	seq_putc(s, '\n');
+
+	seq_puts(s, "Using Remote eDMA - Yes\n");
+	seq_printf(s, "QP Link - \t%s\n", qp->link_is_up ? "Up" : "Down");
+}
+
+static void ntb_transport_edma_uninit(struct ntb_transport_ctx *nt)
+{
+	struct ntb_dev *ndev = nt->ndev;
+
+	if (nt->remote_edma_mode == REMOTE_EDMA_EP && ndev && ndev->pdev)
+		ntb_edma_teardown_isr(&ndev->pdev->dev);
+
+	if (nt->wq)
+		destroy_workqueue(nt->wq);
+	nt->wq = NULL;
+}
+
+static int ntb_transport_edma_init(struct ntb_transport_ctx *nt,
+				   unsigned int *mw_count)
+{
+	struct ntb_dev *ndev = nt->ndev;
+
+	/*
+	 * We need at least one MW for the transport plus one MW reserved
+	 * for the remote eDMA window (see ntb_edma_setup_mws/peer).
+	 */
+	if (*mw_count <= 1) {
+		dev_err(&ndev->dev,
+			"remote eDMA requires at least two MWS (have %u)\n",
+			*mw_count);
+		return -ENODEV;
+	}
+
+	nt->wq = alloc_workqueue("ntb-edma-wq", WQ_UNBOUND | WQ_SYSFS, 0);
+	if (!nt->wq) {
+		ntb_transport_edma_uninit(nt);
+		return -ENOMEM;
+	}
+
+	/* Reserve the last peer MW exclusively for the eDMA window. */
+	*mw_count -= 1;
+
+	return 0;
+}
+
+static void ntb_transport_edma_db_work(struct work_struct *work)
+{
+	struct ntb_transport_qp *qp =
+			container_of(work, struct ntb_transport_qp, db_work);
+
+	ntb_peer_db_set(qp->ndev, qp->qp_bit);
+}
+
+static void ntb_transport_edma_notify_peer(struct ntb_transport_qp *qp)
+{
+	if (ntb_qp_edma_is_rc(qp))
+		if (!ntb_edma_notify_peer(&qp->transport->edma, qp->qp_num))
+			return;
+
+	/*
+	 * Called from contexts that may be atomic. Since ntb_peer_db_set()
+	 * may sleep, delegate the actual doorbell write to a workqueue.
+	 */
+	queue_work(system_highpri_wq, &qp->db_work);
+}
+
+static void ntb_transport_edma_isr(void *data, int qp_num)
+{
+	struct ntb_transport_ctx *nt = data;
+	struct ntb_transport_qp *qp;
+
+	if (qp_num < 0 || qp_num >= nt->qp_count)
+		return;
+
+	qp = &nt->qp_vec[qp_num];
+	if (WARN_ON(!qp))
+		return;
+
+	queue_work(nt->wq, &qp->read_work);
+	queue_work(nt->wq, &qp->write_work);
+}
+
+static int ntb_transport_edma_rc_init(struct ntb_transport_ctx *nt)
+{
+	struct ntb_dev *ndev = nt->ndev;
+	struct pci_dev *pdev = ndev->pdev;
+	int rc;
+
+	if (!use_remote_edma || nt->remote_edma_mode != REMOTE_EDMA_UNKNOWN)
+		return 0;
+
+	rc = ntb_edma_setup_peer(ndev);
+	if (rc) {
+		dev_err(&pdev->dev, "Failed to enable remote eDMA: %d\n", rc);
+		return rc;
+	}
+
+	rc = ntb_edma_setup_chans(get_dma_dev(ndev), &nt->edma);
+	if (rc) {
+		dev_err(&pdev->dev, "Failed to setup eDMA channels: %d\n", rc);
+		return rc;
+	}
+
+	nt->remote_edma_mode = REMOTE_EDMA_RC;
+	return 0;
+}
+
+static int ntb_transport_edma_ep_init(struct ntb_transport_ctx *nt)
+{
+	struct ntb_dev *ndev = nt->ndev;
+	struct pci_dev *pdev = ndev->pdev;
+	struct pci_epc *epc;
+	int rc;
+
+	if (!use_remote_edma || nt->remote_edma_mode == REMOTE_EDMA_EP)
+		return 0;
+
+	/* Only EP side can return pci_epc */
+	epc = ntb_get_pci_epc(ndev);
+	if (!epc)
+		return 0;
+
+	rc = ntb_edma_setup_mws(ndev);
+	if (rc) {
+		dev_err(&pdev->dev,
+			"Failed to set up memory window for eDMA: %d\n", rc);
+		return rc;
+	}
+
+	rc = ntb_edma_setup_isr(&pdev->dev, &epc->dev, ntb_transport_edma_isr, nt);
+	if (rc) {
+		dev_err(&pdev->dev, "Failed to setup eDMA ISR (%d)\n", rc);
+		return rc;
+	}
+
+	nt->remote_edma_mode = REMOTE_EDMA_EP;
+	return 0;
+}
+
+static int ntb_transport_edma_setup_qp_mw(struct ntb_transport_ctx *nt,
+					  unsigned int qp_num)
+{
+	struct ntb_transport_qp *qp = &nt->qp_vec[qp_num];
+	struct ntb_dev *ndev = nt->ndev;
+	struct ntb_queue_entry *entry;
+	struct ntb_transport_mw *mw;
+	unsigned int mw_num, mw_count, qp_count;
+	unsigned int qp_offset, rx_info_offset;
+	unsigned int mw_size, mw_size_per_qp;
+	unsigned int num_qps_mw;
+	size_t edma_total;
+	unsigned int i;
+	int node;
+
+	mw_count = nt->mw_count;
+	qp_count = nt->qp_count;
+
+	mw_num = QP_TO_MW(nt, qp_num);
+	mw = &nt->mw_vec[mw_num];
+
+	if (!mw->virt_addr)
+		return -ENOMEM;
+
+	if (mw_num < qp_count % mw_count)
+		num_qps_mw = qp_count / mw_count + 1;
+	else
+		num_qps_mw = qp_count / mw_count;
+
+	mw_size = min(nt->mw_vec[mw_num].phys_size, mw->xlat_size);
+	if (max_mw_size && mw_size > max_mw_size)
+		mw_size = max_mw_size;
+
+	mw_size_per_qp = round_down((unsigned int)mw_size / num_qps_mw, SZ_64);
+	qp_offset = mw_size_per_qp * (qp_num / mw_count);
+	rx_info_offset = mw_size_per_qp - sizeof(struct ntb_rx_info);
+
+	qp->tx_mw_size = mw_size_per_qp;
+	qp->tx_mw = nt->mw_vec[mw_num].vbase + qp_offset;
+	if (!qp->tx_mw)
+		return -EINVAL;
+	qp->tx_mw_phys = nt->mw_vec[mw_num].phys_addr + qp_offset;
+	if (!qp->tx_mw_phys)
+		return -EINVAL;
+	qp->rx_info = qp->tx_mw + rx_info_offset;
+	qp->rx_buff = mw->virt_addr + qp_offset;
+	qp->remote_rx_info = qp->rx_buff + rx_info_offset;
+
+	/* Due to housekeeping, there must be at least 2 buffs */
+	qp->tx_max_frame = min(transport_mtu, mw_size_per_qp / 2);
+	qp->rx_max_frame = min(transport_mtu, mw_size_per_qp / 2);
+
+	/* In eDMA mode, decouple from MW sizing and force ring-sized entries */
+	edma_total = 2 * sizeof(struct ntb_edma_ring);
+	if (rx_info_offset < edma_total) {
+		dev_err(&ndev->dev, "Ring space requires %luB (>=%uB)\n",
+			edma_total, rx_info_offset);
+		return -EINVAL;
+	}
+	qp->tx_max_entry = NTB_EDMA_RING_ENTRIES;
+	qp->rx_max_entry = NTB_EDMA_RING_ENTRIES;
+
+	/*
+	 * Checking to see if we have more entries than the default.
+	 * We should add additional entries if that is the case so we
+	 * can be in sync with the transport frames.
+	 */
+	node = dev_to_node(&ndev->dev);
+	for (i = qp->rx_alloc_entry; i < qp->rx_max_entry; i++) {
+		entry = kzalloc_node(sizeof(*entry), GFP_KERNEL, node);
+		if (!entry)
+			return -ENOMEM;
+
+		entry->qp = qp;
+		INIT_WORK(&entry->dma_work, ntb_transport_edma_rc_dma_work);
+		ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry,
+			     &qp->rx_free_q);
+		qp->rx_alloc_entry++;
+	}
+
+	memset(qp->rx_buff, 0, edma_total);
+
+	qp->rx_pkts = 0;
+	qp->tx_pkts = 0;
+
+	return 0;
+}
+
+static int ntb_transport_edma_ep_read_complete(struct ntb_transport_qp *qp)
+{
+	struct device *dma_dev = get_dma_dev(qp->ndev);
+	struct ntb_queue_entry *entry;
+	struct ntb_edma_desc *in;
+	unsigned int len;
+	u32 idx;
+
+	if (ntb_edma_ring_used_entry(READ_ONCE(*NTB_TAIL_RD_EP_I(qp)),
+				     qp->rd_cons) == 0)
+		return 0;
+
+	idx = ntb_edma_ring_idx(qp->rd_cons);
+	in = NTB_DESC_RD_EP_I(qp, idx);
+	if (!(in->flags & DESC_DONE_FLAG))
+		return 0;
+
+	in->flags = 0;
+	len = in->len; /* might be smaller than entry->len */
+
+	entry = (struct ntb_queue_entry *)(in->data);
+	if (WARN_ON(!entry))
+		return 0;
+
+	if (in->flags & LINK_DOWN_FLAG) {
+		ntb_qp_link_down(qp);
+		qp->rd_cons++;
+		ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry, &qp->rx_free_q);
+		return 1;
+	}
+
+	dma_unmap_single(dma_dev, entry->addr, entry->len, DMA_FROM_DEVICE);
+
+	qp->rx_bytes += len;
+	qp->rx_pkts++;
+	qp->rd_cons++;
+
+	if (qp->rx_handler && qp->client_ready)
+		qp->rx_handler(qp, qp->cb_data, entry->cb_data, len);
+
+	ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry, &qp->rx_free_q);
+	return 1;
+}
+
+static int ntb_transport_edma_ep_write_complete(struct ntb_transport_qp *qp)
+{
+	struct ntb_queue_entry *entry;
+	struct ntb_edma_desc *in;
+	u32 idx;
+
+	if (ntb_edma_ring_used_entry(READ_ONCE(*NTB_TAIL_WR_EP_I(qp)),
+				     qp->wr_cons) == 0)
+		return 0;
+
+	idx = ntb_edma_ring_idx(qp->wr_cons);
+	in = NTB_DESC_WR_EP_I(qp, idx);
+
+	entry = (struct ntb_queue_entry *)(in->data);
+	if (WARN_ON(!entry))
+		return 0;
+
+	qp->wr_cons++;
+
+	if (qp->tx_handler)
+		qp->tx_handler(qp, qp->cb_data, entry->cb_data, entry->len);
+
+	ntb_list_add(&qp->ntb_tx_free_q_lock, &entry->entry, &qp->tx_free_q);
+	return 1;
+}
+
+static void ntb_transport_edma_ep_read_work(struct work_struct *work)
+{
+	struct ntb_transport_qp *qp = container_of(
+				work, struct ntb_transport_qp, read_work);
+	unsigned int i;
+
+	for (i = 0; i < NTB_EDMA_MAX_POLL; i++) {
+		if (!ntb_transport_edma_ep_read_complete(qp))
+			break;
+	}
+
+	if (ntb_transport_edma_ep_read_complete(qp))
+		queue_work(qp->transport->wq, &qp->read_work);
+}
+
+static void ntb_transport_edma_ep_write_work(struct work_struct *work)
+{
+	struct ntb_transport_qp *qp = container_of(
+				work, struct ntb_transport_qp, write_work);
+	unsigned int i;
+
+	for (i = 0; i < NTB_EDMA_MAX_POLL; i++) {
+		if (!ntb_transport_edma_ep_write_complete(qp))
+			break;
+	}
+
+	if (ntb_transport_edma_ep_write_complete(qp))
+		queue_work(qp->transport->wq, &qp->write_work);
+}
+
+static void ntb_transport_edma_rc_write_complete_work(struct work_struct *work)
+{
+	struct ntb_transport_qp *qp = container_of(
+				work, struct ntb_transport_qp, write_work);
+	struct ntb_queue_entry *entry;
+	struct ntb_edma_desc *in;
+	unsigned int len;
+	void *cb_data;
+	u32 idx;
+
+	while (ntb_edma_ring_used_entry(READ_ONCE(qp->wr_issue),
+					qp->wr_cons) != 0) {
+		/* Paired with smp_wmb() in ntb_transport_edma_rc_poll() */
+		smp_rmb();
+
+		idx = ntb_edma_ring_idx(qp->wr_cons);
+		in = NTB_DESC_WR_RC_I(qp, idx);
+		entry = (struct ntb_queue_entry *)READ_ONCE(in->data);
+		if (!entry || !(entry->flags & DESC_DONE_FLAG))
+			break;
+
+		in->data = 0;
+
+		cb_data = entry->cb_data;
+		len = entry->len;
+
+		iowrite32(++qp->wr_cons, NTB_TAIL_WR_RC_O(qp));
+
+		if (unlikely(entry->flags & LINK_DOWN_FLAG)) {
+			ntb_qp_link_down(qp);
+			continue;
+		}
+
+		ntb_transport_edma_notify_peer(qp);
+
+		ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry, &qp->rx_free_q);
+
+		if (qp->rx_handler && qp->client_ready)
+			qp->rx_handler(qp, qp->cb_data, cb_data, len);
+
+		/* stat updates */
+		qp->rx_bytes += len;
+		qp->rx_pkts++;
+	}
+}
+
+static void ntb_transport_edma_rc_write_cb(void *data,
+					   const struct dmaengine_result *res)
+{
+	struct ntb_queue_entry *entry = data;
+	struct ntb_transport_qp *qp = entry->qp;
+	struct ntb_transport_ctx *nt = qp->transport;
+	enum dmaengine_tx_result dma_err = res->result;
+	struct device *dma_dev = get_dma_dev(qp->ndev);
+
+	switch (dma_err) {
+	case DMA_TRANS_READ_FAILED:
+	case DMA_TRANS_WRITE_FAILED:
+	case DMA_TRANS_ABORTED:
+		entry->errors++;
+		entry->len = -EIO;
+		break;
+	case DMA_TRANS_NOERROR:
+	default:
+		break;
+	}
+	dma_unmap_sg(dma_dev, &entry->sgl, 1, DMA_FROM_DEVICE);
+	sg_dma_address(&entry->sgl) = 0;
+
+	entry->flags |= DESC_DONE_FLAG;
+
+	queue_work(nt->wq, &qp->write_work);
+}
+
+static void ntb_transport_edma_rc_read_complete_work(struct work_struct *work)
+{
+	struct ntb_transport_qp *qp = container_of(
+				work, struct ntb_transport_qp, read_work);
+	struct ntb_edma_desc *in, __iomem *out;
+	struct ntb_queue_entry *entry;
+	unsigned int len;
+	void *cb_data;
+	u32 idx;
+
+	while (ntb_edma_ring_used_entry(READ_ONCE(qp->rd_issue),
+					qp->rd_cons) != 0) {
+		/* Paired with smp_wmb() in ntb_transport_edma_rc_tx_enqueue() */
+		smp_rmb();
+
+		idx = ntb_edma_ring_idx(qp->rd_cons);
+		in = NTB_DESC_RD_RC_I(qp, idx);
+		entry = (struct ntb_queue_entry *)in->data;
+		if (!entry || !(entry->flags & DESC_DONE_FLAG))
+			break;
+
+		in->data = 0;
+
+		cb_data = entry->cb_data;
+		len = entry->len;
+
+		out = NTB_DESC_RD_RC_O(qp, idx);
+
+		WRITE_ONCE(qp->rd_cons, qp->rd_cons + 1);
+
+		/*
+		 * No need to add barrier in-between to enforce ordering here.
+		 * The other side proceeds only after both flags and tail are
+		 * updated.
+		 */
+		iowrite32(entry->flags, &out->flags);
+		iowrite32(qp->rd_cons, NTB_TAIL_RD_RC_O(qp));
+
+		ntb_transport_edma_notify_peer(qp);
+
+		ntb_list_add(&qp->ntb_tx_free_q_lock, &entry->entry,
+			     &qp->tx_free_q);
+
+		if (qp->tx_handler)
+			qp->tx_handler(qp, qp->cb_data, cb_data, len);
+
+		/* stat updates */
+		qp->tx_bytes += len;
+		qp->tx_pkts++;
+	}
+}
+
+static void ntb_transport_edma_rc_read_cb(void *data,
+					  const struct dmaengine_result *res)
+{
+	struct ntb_queue_entry *entry = data;
+	struct ntb_transport_qp *qp = entry->qp;
+	struct ntb_transport_ctx *nt = qp->transport;
+	struct device *dma_dev = get_dma_dev(qp->ndev);
+	enum dmaengine_tx_result dma_err = res->result;
+
+	switch (dma_err) {
+	case DMA_TRANS_READ_FAILED:
+	case DMA_TRANS_WRITE_FAILED:
+	case DMA_TRANS_ABORTED:
+		entry->errors++;
+		entry->len = -EIO;
+		break;
+	case DMA_TRANS_NOERROR:
+	default:
+		break;
+	}
+	dma_unmap_sg(dma_dev, &entry->sgl, 1, DMA_TO_DEVICE);
+	sg_dma_address(&entry->sgl) = 0;
+
+	entry->flags |= DESC_DONE_FLAG;
+
+	queue_work(nt->wq, &qp->read_work);
+}
+
+static int ntb_transport_edma_rc_write_start(struct device *d,
+					     struct dma_chan *chan, size_t len,
+					     dma_addr_t ep_src, void *rc_dst,
+					     struct ntb_queue_entry *entry)
+{
+	struct scatterlist *sgl = &entry->sgl;
+	struct dma_async_tx_descriptor *txd;
+	struct dma_slave_config cfg;
+	dma_cookie_t cookie;
+	int nents, rc;
+
+	if (!d)
+		return -ENODEV;
+
+	if (!chan)
+		return -ENXIO;
+
+	if (WARN_ON(!ep_src || !rc_dst))
+		return -EINVAL;
+
+	if (WARN_ON(sg_dma_address(sgl)))
+		return -EINVAL;
+
+	sg_init_one(sgl, rc_dst, len);
+	nents = dma_map_sg(d, sgl, 1, DMA_FROM_DEVICE);
+	if (nents <= 0)
+		return -EIO;
+
+	memset(&cfg, 0, sizeof(cfg));
+	cfg.src_addr       = ep_src;
+	cfg.src_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
+	cfg.dst_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
+	cfg.direction      = DMA_DEV_TO_MEM;
+	rc = dmaengine_slave_config(chan, &cfg);
+	if (rc)
+		goto out_unmap;
+
+	txd = dmaengine_prep_slave_sg(chan, sgl, 1, DMA_DEV_TO_MEM,
+				      DMA_CTRL_ACK | DMA_PREP_INTERRUPT);
+	if (!txd) {
+		rc = -EIO;
+		goto out_unmap;
+	}
+
+	txd->callback_result = ntb_transport_edma_rc_write_cb;
+	txd->callback_param = entry;
+
+	cookie = dmaengine_submit(txd);
+	if (dma_submit_error(cookie)) {
+		rc = -EIO;
+		goto out_unmap;
+	}
+	dma_async_issue_pending(chan);
+	return 0;
+out_unmap:
+	dma_unmap_sg(d, sgl, 1, DMA_FROM_DEVICE);
+	return rc;
+}
+
+static int ntb_transport_edma_rc_read_start(struct device *d,
+					    struct dma_chan *chan, size_t len,
+					    void *rc_src, dma_addr_t ep_dst,
+					    struct ntb_queue_entry *entry)
+{
+	struct scatterlist *sgl = &entry->sgl;
+	struct dma_async_tx_descriptor *txd;
+	struct dma_slave_config cfg;
+	dma_cookie_t cookie;
+	int nents, rc;
+
+	if (!d)
+		return -ENODEV;
+
+	if (!chan)
+		return -ENXIO;
+
+	if (WARN_ON(!rc_src || !ep_dst))
+		return -EINVAL;
+
+	if (WARN_ON(sg_dma_address(sgl)))
+		return -EINVAL;
+
+	sg_init_one(sgl, rc_src, len);
+	nents = dma_map_sg(d, sgl, 1, DMA_TO_DEVICE);
+	if (nents <= 0)
+		return -EIO;
+
+	memset(&cfg, 0, sizeof(cfg));
+	cfg.dst_addr       = ep_dst;
+	cfg.src_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
+	cfg.dst_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
+	cfg.direction      = DMA_MEM_TO_DEV;
+	rc = dmaengine_slave_config(chan, &cfg);
+	if (rc)
+		goto out_unmap;
+
+	txd = dmaengine_prep_slave_sg(chan, sgl, 1, DMA_MEM_TO_DEV,
+				      DMA_CTRL_ACK | DMA_PREP_INTERRUPT);
+	if (!txd) {
+		rc = -EIO;
+		goto out_unmap;
+	}
+
+	txd->callback_result = ntb_transport_edma_rc_read_cb;
+	txd->callback_param = entry;
+
+	cookie = dmaengine_submit(txd);
+	if (dma_submit_error(cookie)) {
+		rc = -EIO;
+		goto out_unmap;
+	}
+	dma_async_issue_pending(chan);
+	return 0;
+out_unmap:
+	dma_unmap_sg(d, sgl, 1, DMA_TO_DEVICE);
+	return rc;
+}
+
+static void ntb_transport_edma_rc_dma_work(struct work_struct *work)
+{
+	struct ntb_queue_entry *entry = container_of(
+				work, struct ntb_queue_entry, dma_work);
+	struct ntb_transport_qp *qp = entry->qp;
+	struct ntb_transport_ctx *nt = qp->transport;
+	struct device *dma_dev = get_dma_dev(qp->ndev);
+	struct dma_chan *chan;
+	int rc;
+
+	chan = ntb_edma_pick_chan(&nt->edma, REMOTE_EDMA_WRITE);
+	rc = ntb_transport_edma_rc_write_start(dma_dev, chan, entry->len,
+					       entry->addr, entry->buf, entry);
+	if (rc) {
+		entry->errors++;
+		entry->len = -EIO;
+		entry->flags |= DESC_DONE_FLAG;
+		queue_work(nt->wq, &qp->write_work);
+		return;
+	}
+}
+
+static void ntb_transport_edma_rc_poll(struct ntb_transport_qp *qp)
+{
+	struct ntb_transport_ctx *nt = qp->transport;
+	unsigned int budget = NTB_EDMA_MAX_POLL;
+	struct ntb_queue_entry *entry;
+	struct ntb_edma_desc *in;
+	dma_addr_t ep_src;
+	u32 len, idx;
+
+	while (budget--) {
+		if (ntb_edma_ring_used_entry(READ_ONCE(*NTB_HEAD_WR_RC_I(qp)),
+					     qp->wr_issue) == 0)
+			break;
+
+		idx = ntb_edma_ring_idx(qp->wr_issue);
+		in = NTB_DESC_WR_RC_I(qp, idx);
+
+		len = READ_ONCE(in->len);
+		ep_src = (dma_addr_t)READ_ONCE(in->addr);
+
+		/* Prepare 'entry' for write completion */
+		entry = ntb_list_rm(&qp->ntb_rx_q_lock, &qp->rx_pend_q);
+		if (!entry) {
+			qp->rx_err_no_buf++;
+			break;
+		}
+		if (WARN_ON(entry->flags & DESC_DONE_FLAG))
+			entry->flags &= ~DESC_DONE_FLAG;
+		entry->len = len; /* NB. entry->len can be <=0 */
+		entry->addr = ep_src;
+
+		/*
+		 * ntb_transport_edma_rc_write_complete_work() checks entry->flags
+		 * so it needs to be set before wr_issue++.
+		 */
+		in->data = (uintptr_t)entry;
+
+		/* Ensure in->data visible before wr_issue++ */
+		smp_wmb();
+
+		WRITE_ONCE(qp->wr_issue, qp->wr_issue + 1);
+
+		if (!len) {
+			entry->flags |= DESC_DONE_FLAG;
+			queue_work(nt->wq, &qp->write_work);
+			continue;
+		}
+
+		if (in->flags & LINK_DOWN_FLAG) {
+			dev_dbg(&qp->ndev->pdev->dev, "link down flag set\n");
+			entry->flags |= DESC_DONE_FLAG | LINK_DOWN_FLAG;
+			queue_work(nt->wq, &qp->write_work);
+			continue;
+		}
+
+		queue_work(nt->wq, &entry->dma_work);
+	}
+
+	if (!budget)
+		tasklet_schedule(&qp->rxc_db_work);
+}
+
+static int ntb_transport_edma_rc_tx_enqueue(struct ntb_transport_qp *qp,
+					    struct ntb_queue_entry *entry)
+{
+	struct device *dma_dev = get_dma_dev(qp->ndev);
+	struct ntb_transport_ctx *nt = qp->transport;
+	struct ntb_edma_desc *in, __iomem *out;
+	unsigned int len = entry->len;
+	struct dma_chan *chan;
+	u32 issue, idx, head;
+	dma_addr_t ep_dst;
+	int rc;
+
+	WARN_ON_ONCE(entry->flags & DESC_DONE_FLAG);
+
+	scoped_guard(spinlock_irqsave, &qp->rc_lock) {
+		head = READ_ONCE(*NTB_HEAD_RD_RC_I(qp));
+		issue = qp->rd_issue;
+		if (ntb_edma_ring_used_entry(head, issue) == 0) {
+			qp->tx_ring_full++;
+			return -ENOSPC;
+		}
+
+		/*
+		 * ntb_transport_edma_rc_read_complete_work() checks entry->flags
+		 * so it needs to be set before rd_issue++.
+		 */
+		idx = ntb_edma_ring_idx(issue);
+		in = NTB_DESC_RD_RC_I(qp, idx);
+		in->data = (uintptr_t)entry;
+
+		/* Make in->data visible before rd_issue++ */
+		smp_wmb();
+
+		WRITE_ONCE(qp->rd_issue, qp->rd_issue + 1);
+	}
+
+	/* Publish the final transfer length to the EP side */
+	out = NTB_DESC_RD_RC_O(qp, idx);
+	iowrite32(len, &out->len);
+	ioread32(&out->len);
+
+	if (unlikely(!len)) {
+		entry->flags |= DESC_DONE_FLAG;
+		queue_work(nt->wq, &qp->read_work);
+		return 0;
+	}
+
+	/* Paired with dma_wmb() in ntb_transport_edma_ep_rx_enqueue() */
+	dma_rmb();
+
+	/* kick remote eDMA read transfer */
+	ep_dst = (dma_addr_t)in->addr;
+	chan = ntb_edma_pick_chan(&nt->edma, REMOTE_EDMA_READ);
+	rc = ntb_transport_edma_rc_read_start(dma_dev, chan, len,
+					      entry->buf, ep_dst, entry);
+	if (rc) {
+		entry->errors++;
+		entry->len = -EIO;
+		entry->flags |= DESC_DONE_FLAG;
+		queue_work(nt->wq, &qp->read_work);
+	}
+	return 0;
+}
+
+static int ntb_transport_edma_ep_tx_enqueue(struct ntb_transport_qp *qp,
+					    struct ntb_queue_entry *entry)
+{
+	struct device *dma_dev = get_dma_dev(qp->ndev);
+	struct ntb_edma_desc *in, __iomem *out;
+	unsigned int len = entry->len;
+	dma_addr_t ep_src = 0;
+	u32 idx;
+	int rc;
+
+	if (likely(len)) {
+		ep_src = dma_map_single(dma_dev, entry->buf, len,
+					DMA_TO_DEVICE);
+		rc = dma_mapping_error(dma_dev, ep_src);
+		if (rc)
+			return rc;
+	}
+
+	scoped_guard(spinlock_irqsave, &qp->ep_tx_lock) {
+		if (ntb_edma_ring_full(qp->wr_prod, qp->wr_cons)) {
+			rc = -ENOSPC;
+			qp->tx_ring_full++;
+			goto out_unmap;
+		}
+
+		idx = ntb_edma_ring_idx(qp->wr_prod);
+		in  = NTB_DESC_WR_EP_I(qp, idx);
+		out = NTB_DESC_WR_EP_O(qp, idx);
+
+		WARN_ON(in->flags & DESC_DONE_FLAG);
+		WARN_ON(entry->flags & DESC_DONE_FLAG);
+		in->flags = 0;
+		in->data  = (uintptr_t)entry;
+		entry->addr  = ep_src;
+
+		iowrite32(len,          &out->len);
+		iowrite32(entry->flags, &out->flags);
+		iowrite64(ep_src,       &out->addr);
+		WRITE_ONCE(qp->wr_prod, qp->wr_prod + 1);
+
+		dma_wmb();
+		iowrite32(qp->wr_prod, NTB_HEAD_WR_EP_O(qp));
+
+		qp->tx_bytes += len;
+		qp->tx_pkts++;
+	}
+
+	ntb_transport_edma_notify_peer(qp);
+
+	return 0;
+out_unmap:
+	if (likely(len))
+		dma_unmap_single(dma_dev, ep_src, len, DMA_TO_DEVICE);
+	return rc;
+}
+
+static int ntb_transport_edma_tx_enqueue(struct ntb_transport_qp *qp,
+					 struct ntb_queue_entry *entry,
+					 void *cb, void *data, unsigned int len,
+					 unsigned int flags)
+{
+	struct device *dma_dev;
+
+	if (entry->addr) {
+		/* Deferred unmap */
+		dma_dev = get_dma_dev(qp->ndev);
+		dma_unmap_single(dma_dev, entry->addr, entry->len, DMA_TO_DEVICE);
+	}
+
+	entry->cb_data = cb;
+	entry->buf = data;
+	entry->len = len;
+	entry->flags = flags;
+	entry->errors = 0;
+	entry->addr = 0;
+
+	WARN_ON_ONCE(!ntb_qp_edma_enabled(qp));
+
+	if (ntb_qp_edma_is_ep(qp))
+		return ntb_transport_edma_ep_tx_enqueue(qp, entry);
+	else
+		return ntb_transport_edma_rc_tx_enqueue(qp, entry);
+}
+
+static int ntb_transport_edma_ep_rx_enqueue(struct ntb_transport_qp *qp,
+					    struct ntb_queue_entry *entry)
+{
+	struct device *dma_dev = get_dma_dev(qp->ndev);
+	struct ntb_edma_desc *in, __iomem *out;
+	unsigned int len = entry->len;
+	void *data = entry->buf;
+	dma_addr_t ep_dst;
+	u32 idx;
+	int rc;
+
+	ep_dst = dma_map_single(dma_dev, data, len, DMA_FROM_DEVICE);
+	rc = dma_mapping_error(dma_dev, ep_dst);
+	if (rc)
+		return rc;
+
+	scoped_guard(spinlock_bh, &qp->ep_rx_lock) {
+		if (ntb_edma_ring_full(READ_ONCE(qp->rd_prod),
+				       READ_ONCE(qp->rd_cons))) {
+			rc = -ENOSPC;
+			goto out_unmap;
+		}
+
+		idx = ntb_edma_ring_idx(qp->rd_prod);
+		in = NTB_DESC_RD_EP_I(qp, idx);
+		out = NTB_DESC_RD_EP_O(qp, idx);
+
+		iowrite32(len, &out->len);
+		iowrite64(ep_dst, &out->addr);
+
+		WARN_ON(in->flags & DESC_DONE_FLAG);
+		in->data = (uintptr_t)entry;
+		entry->addr = ep_dst;
+
+		/* Ensure len/addr are visible before the head update */
+		dma_wmb();
+
+		WRITE_ONCE(qp->rd_prod, qp->rd_prod + 1);
+		iowrite32(qp->rd_prod, NTB_HEAD_RD_EP_O(qp));
+	}
+	return 0;
+out_unmap:
+	dma_unmap_single(dma_dev, ep_dst, len, DMA_FROM_DEVICE);
+	return rc;
+}
+
+static int ntb_transport_edma_rx_enqueue(struct ntb_transport_qp *qp,
+					 struct ntb_queue_entry *entry)
+{
+	int rc;
+
+	/* The behaviour is the same as the default backend for RC side */
+	if (ntb_qp_edma_is_ep(qp)) {
+		rc = ntb_transport_edma_ep_rx_enqueue(qp, entry);
+		if (rc) {
+			ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry,
+				     &qp->rx_free_q);
+			return rc;
+		}
+	}
+
+	ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry, &qp->rx_pend_q);
+
+	if (qp->active)
+		tasklet_schedule(&qp->rxc_db_work);
+
+	return 0;
+}
+
+static void ntb_transport_edma_rx_poll(struct ntb_transport_qp *qp)
+{
+	struct ntb_transport_ctx *nt = qp->transport;
+
+	if (ntb_qp_edma_is_rc(qp))
+		ntb_transport_edma_rc_poll(qp);
+	else if (ntb_qp_edma_is_ep(qp)) {
+		/*
+		 * Make sure we poll the rings even if an eDMA interrupt is
+		 * cleared on the RC side earlier.
+		 */
+		queue_work(nt->wq, &qp->read_work);
+		queue_work(nt->wq, &qp->write_work);
+	} else
+		/* Unreachable */
+		WARN_ON_ONCE(1);
+}
+
+static void ntb_transport_edma_read_work(struct work_struct *work)
+{
+	struct ntb_transport_qp *qp = container_of(
+				work, struct ntb_transport_qp, read_work);
+
+	if (ntb_qp_edma_is_rc(qp))
+		ntb_transport_edma_rc_read_complete_work(work);
+	else if (ntb_qp_edma_is_ep(qp))
+		ntb_transport_edma_ep_read_work(work);
+	else
+		/* Unreachable */
+		WARN_ON_ONCE(1);
+}
+
+static void ntb_transport_edma_write_work(struct work_struct *work)
+{
+	struct ntb_transport_qp *qp = container_of(
+				work, struct ntb_transport_qp, write_work);
+
+	if (ntb_qp_edma_is_rc(qp))
+		ntb_transport_edma_rc_write_complete_work(work);
+	else if (ntb_qp_edma_is_ep(qp))
+		ntb_transport_edma_ep_write_work(work);
+	else
+		/* Unreachable */
+		WARN_ON_ONCE(1);
+}
+
+static void ntb_transport_edma_init_queue(struct ntb_transport_ctx *nt,
+					  unsigned int qp_num)
+{
+	struct ntb_transport_qp *qp = &nt->qp_vec[qp_num];
+
+	qp->wr_cons = 0;
+	qp->rd_cons = 0;
+	qp->wr_prod = 0;
+	qp->rd_prod = 0;
+	qp->wr_issue = 0;
+	qp->rd_issue = 0;
+
+	INIT_WORK(&qp->db_work, ntb_transport_edma_db_work);
+	INIT_WORK(&qp->read_work, ntb_transport_edma_read_work);
+	INIT_WORK(&qp->write_work, ntb_transport_edma_write_work);
+}
+
+static void ntb_transport_edma_create_queue(struct ntb_transport_ctx *nt,
+					    struct ntb_transport_qp *qp)
+{
+	spin_lock_init(&qp->ep_tx_lock);
+	spin_lock_init(&qp->ep_rx_lock);
+	spin_lock_init(&qp->rc_lock);
+}
+
+static const struct ntb_transport_backend_ops edma_backend_ops = {
+	.setup_qp_mw = ntb_transport_edma_setup_qp_mw,
+	.tx_free_entry = ntb_transport_edma_tx_free_entry,
+	.tx_enqueue = ntb_transport_edma_tx_enqueue,
+	.rx_enqueue = ntb_transport_edma_rx_enqueue,
+	.rx_poll = ntb_transport_edma_rx_poll,
+	.debugfs_stats_show = ntb_transport_edma_debugfs_stats_show,
+};
+#endif /* CONFIG_NTB_TRANSPORT_EDMA */
+
 /**
  * ntb_transport_link_up - Notify NTB transport of client readiness to use queue
  * @qp: NTB transport layer queue to be enabled
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 21/27] NTB: epf: Provide db_vector_count/db_vector_mask callbacks
  2025-11-29 16:03 [RFC PATCH v2 00/27] NTB transport backed by remote DW eDMA Koichiro Den
                   ` (19 preceding siblings ...)
  2025-11-29 16:03 ` [RFC PATCH v2 20/27] NTB: ntb_transport: Introduce remote eDMA backed transport mode Koichiro Den
@ 2025-11-29 16:03 ` Koichiro Den
  2025-11-29 16:04 ` [RFC PATCH v2 22/27] ntb_netdev: Multi-queue support Koichiro Den
                   ` (6 subsequent siblings)
  27 siblings, 0 replies; 97+ messages in thread
From: Koichiro Den @ 2025-11-29 16:03 UTC (permalink / raw)
  To: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li
  Cc: mani, kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason,
	dave.jiang, allenbh, Basavaraj.Natikar, Shyam-sundar.S-k,
	kurt.schwemmer, logang, jingoohan1, lpieralisi, robh, jbrunet,
	fancer.lancer, arnd, pstanner, elfring

Provide db_vector_count() and db_vector_mask() implementations for both
ntb_hw_epf and pci-epf-vntb so that ntb_transport can map MSI vectors to
doorbell bits. Without them, the upper layer cannot identify which
doorbell vector fired and ends up scheduling rxc_db_work() for all queue
pairs, resulting in a thundering-herd effect when multiple queue pairs
(QPs) are enabled.

With this change, .peer_db_set() must honor the db_bits mask and raise
all requested doorbell interrupts, so update those implementations
accordingly.

Signed-off-by: Koichiro Den <den@valinux.co.jp>
---
 drivers/ntb/hw/epf/ntb_hw_epf.c               | 47 ++++++++++++-------
 drivers/pci/endpoint/functions/pci-epf-vntb.c | 40 +++++++++++++---
 2 files changed, 63 insertions(+), 24 deletions(-)

diff --git a/drivers/ntb/hw/epf/ntb_hw_epf.c b/drivers/ntb/hw/epf/ntb_hw_epf.c
index c94bf63d69ff..d9811da90599 100644
--- a/drivers/ntb/hw/epf/ntb_hw_epf.c
+++ b/drivers/ntb/hw/epf/ntb_hw_epf.c
@@ -363,7 +363,7 @@ static int ntb_epf_init_isr(struct ntb_epf_dev *ndev, int msi_min, int msi_max)
 		}
 	}
 
-	ndev->db_count = irq;
+	ndev->db_count = irq - 1;
 
 	ret = ntb_epf_send_command(ndev, CMD_CONFIGURE_DOORBELL,
 				   argument | irq);
@@ -397,6 +397,22 @@ static u64 ntb_epf_db_valid_mask(struct ntb_dev *ntb)
 	return ntb_ndev(ntb)->db_valid_mask;
 }
 
+static int ntb_epf_db_vector_count(struct ntb_dev *ntb)
+{
+	return ntb_ndev(ntb)->db_count;
+}
+
+static u64 ntb_epf_db_vector_mask(struct ntb_dev *ntb, int db_vector)
+{
+	struct ntb_epf_dev *ndev = ntb_ndev(ntb);
+
+	db_vector--; /* vector 0 is reserved for link events */
+	if (db_vector < 0 || db_vector >= ndev->db_count)
+		return 0;
+
+	return ndev->db_valid_mask & (1ULL << db_vector);
+}
+
 static int ntb_epf_db_set_mask(struct ntb_dev *ntb, u64 db_bits)
 {
 	return 0;
@@ -480,26 +496,21 @@ static int ntb_epf_peer_mw_get_addr(struct ntb_dev *ntb, int idx,
 static int ntb_epf_peer_db_set(struct ntb_dev *ntb, u64 db_bits)
 {
 	struct ntb_epf_dev *ndev = ntb_ndev(ntb);
-	u32 interrupt_num = ffs(db_bits) + 1;
-	struct device *dev = ndev->dev;
+	u32 interrupt_num;
 	u32 db_entry_size;
 	u32 db_offset;
 	u32 db_data;
-
-	if (interrupt_num >= ndev->db_count) {
-		dev_err(dev, "DB interrupt %d greater than Max Supported %d\n",
-			interrupt_num, ndev->db_count);
-		return -EINVAL;
-	}
+	int i;
 
 	db_entry_size = readl(ndev->ctrl_reg + NTB_EPF_DB_ENTRY_SIZE);
 
-	db_data = readl(ndev->ctrl_reg + NTB_EPF_DB_DATA(interrupt_num));
-	db_offset = readl(ndev->ctrl_reg + NTB_EPF_DB_OFFSET(interrupt_num));
-
-	writel(db_data, ndev->db_reg + (db_entry_size * interrupt_num) +
-	       db_offset);
-
+	for_each_set_bit(i, (unsigned long *)&db_bits, ndev->db_count) {
+		interrupt_num = i + 1;
+		db_data = readl(ndev->ctrl_reg + NTB_EPF_DB_DATA(interrupt_num));
+		db_offset = readl(ndev->ctrl_reg + NTB_EPF_DB_OFFSET(interrupt_num));
+		writel(db_data, ndev->db_reg + (db_entry_size * interrupt_num) +
+		       db_offset);
+	}
 	return 0;
 }
 
@@ -529,6 +540,8 @@ static const struct ntb_dev_ops ntb_epf_ops = {
 	.spad_count		= ntb_epf_spad_count,
 	.peer_mw_count		= ntb_epf_peer_mw_count,
 	.db_valid_mask		= ntb_epf_db_valid_mask,
+	.db_vector_count	= ntb_epf_db_vector_count,
+	.db_vector_mask		= ntb_epf_db_vector_mask,
 	.db_set_mask		= ntb_epf_db_set_mask,
 	.mw_set_trans		= ntb_epf_mw_set_trans,
 	.mw_clear_trans		= ntb_epf_mw_clear_trans,
@@ -561,8 +574,8 @@ static int ntb_epf_init_dev(struct ntb_epf_dev *ndev)
 	int ret;
 
 	/* One Link interrupt and rest doorbell interrupt */
-	ret = ntb_epf_init_isr(ndev, NTB_EPF_MIN_DB_COUNT + NTB_EPF_IRQ_RESERVE,
-			       NTB_EPF_MAX_DB_COUNT + NTB_EPF_IRQ_RESERVE);
+	ret = ntb_epf_init_isr(ndev, NTB_EPF_MIN_DB_COUNT + 1 + NTB_EPF_IRQ_RESERVE,
+			       NTB_EPF_MAX_DB_COUNT + 1 + NTB_EPF_IRQ_RESERVE);
 	if (ret) {
 		dev_err(dev, "Failed to init ISR\n");
 		return ret;
diff --git a/drivers/pci/endpoint/functions/pci-epf-vntb.c b/drivers/pci/endpoint/functions/pci-epf-vntb.c
index 93fd724a8faa..af8753650051 100644
--- a/drivers/pci/endpoint/functions/pci-epf-vntb.c
+++ b/drivers/pci/endpoint/functions/pci-epf-vntb.c
@@ -1379,6 +1379,22 @@ static u64 vntb_epf_db_valid_mask(struct ntb_dev *ntb)
 	return BIT_ULL(ntb_ndev(ntb)->db_count) - 1;
 }
 
+static int vntb_epf_db_vector_count(struct ntb_dev *ntb)
+{
+	return ntb_ndev(ntb)->db_count;
+}
+
+static u64 vntb_epf_db_vector_mask(struct ntb_dev *ntb, int db_vector)
+{
+	struct epf_ntb *ndev = ntb_ndev(ntb);
+
+	db_vector--; /* vector 0 is reserved for link events */
+	if (db_vector < 0 || db_vector >= ndev->db_count)
+		return 0;
+
+	return 1ULL << db_vector;
+}
+
 static int vntb_epf_db_set_mask(struct ntb_dev *ntb, u64 db_bits)
 {
 	return 0;
@@ -1488,20 +1504,28 @@ static int vntb_epf_peer_spad_write(struct ntb_dev *ndev, int pidx, int idx, u32
 
 static int vntb_epf_peer_db_set(struct ntb_dev *ndev, u64 db_bits)
 {
-	u32 interrupt_num = ffs(db_bits) + 1;
 	struct epf_ntb *ntb = ntb_ndev(ndev);
 	u8 func_no, vfunc_no;
-	int ret;
+	u64 failed = 0;
+	int i;
 
 	func_no = ntb->epf->func_no;
 	vfunc_no = ntb->epf->vfunc_no;
 
-	ret = pci_epc_raise_irq(ntb->epf->epc, func_no, vfunc_no,
-				PCI_IRQ_MSI, interrupt_num + 1);
-	if (ret)
-		dev_err(&ntb->ntb.dev, "Failed to raise IRQ\n");
+	for_each_set_bit(i, (unsigned long *)&db_bits, ntb->db_count) {
+		/*
+		 * DB bit i is MSI interrupt (i + 2).
+		 * Vector 0 is used for link events and MSI vectors are
+		 * 1-based for pci_epc_raise_irq().
+		 */
+		if (pci_epc_raise_irq(ntb->epf->epc, func_no, vfunc_no,
+				      PCI_IRQ_MSI, i + 2))
+			failed |= BIT_ULL(i);
+	}
+	if (failed)
+		dev_err(&ntb->ntb.dev, "Failed to raise IRQ (0x%llx)\n", failed);
 
-	return ret;
+	return failed ? -EIO : 0;
 }
 
 static u64 vntb_epf_db_read(struct ntb_dev *ndev)
@@ -1575,6 +1599,8 @@ static const struct ntb_dev_ops vntb_epf_ops = {
 	.spad_count		= vntb_epf_spad_count,
 	.peer_mw_count		= vntb_epf_peer_mw_count,
 	.db_valid_mask		= vntb_epf_db_valid_mask,
+	.db_vector_count	= vntb_epf_db_vector_count,
+	.db_vector_mask		= vntb_epf_db_vector_mask,
 	.db_set_mask		= vntb_epf_db_set_mask,
 	.mw_set_trans		= vntb_epf_mw_set_trans,
 	.mw_clear_trans		= vntb_epf_mw_clear_trans,
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 22/27] ntb_netdev: Multi-queue support
  2025-11-29 16:03 [RFC PATCH v2 00/27] NTB transport backed by remote DW eDMA Koichiro Den
                   ` (20 preceding siblings ...)
  2025-11-29 16:03 ` [RFC PATCH v2 21/27] NTB: epf: Provide db_vector_count/db_vector_mask callbacks Koichiro Den
@ 2025-11-29 16:04 ` Koichiro Den
  2025-11-29 16:04 ` [RFC PATCH v2 23/27] NTB: epf: Add per-SoC quirk to cap MRRS for DWC eDMA (128B for R-Car) Koichiro Den
                   ` (5 subsequent siblings)
  27 siblings, 0 replies; 97+ messages in thread
From: Koichiro Den @ 2025-11-29 16:04 UTC (permalink / raw)
  To: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li
  Cc: mani, kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason,
	dave.jiang, allenbh, Basavaraj.Natikar, Shyam-sundar.S-k,
	kurt.schwemmer, logang, jingoohan1, lpieralisi, robh, jbrunet,
	fancer.lancer, arnd, pstanner, elfring

ntb_transport can now scale throughput across multiple queue pairs when
remote eDMA is enabled on ntb_transport (use_remote_edma=1).

Teach ntb_netdev to allocate multiple ntb_transport queue pairs and
expose them as a multi-queue net_device. In particular, when remote eDMA
is enabled, each queue pair can be serviced in parallel by the eDMA
engine.

With this patch, up to N queue pairs are created, where N is chosen as
follows:

  - By default, N is num_online_cpus(), to give each CPU its own queue.
  - If the ntb_num_queues module parameter is non-zero, it overrides the
    default and requests that many queues.
  - In both cases the requested value is capped at a fixed upper bound
    to avoid unbounded allocations, and by the number of queue pairs
    actually available from ntb_transport.

If only one queue pair can be created (or ntb_num_queues=1 is set), the
driver effectively falls back to the previous single-queue behaviour.

Signed-off-by: Koichiro Den <den@valinux.co.jp>
---
 drivers/net/ntb_netdev.c | 341 ++++++++++++++++++++++++++++-----------
 1 file changed, 243 insertions(+), 98 deletions(-)

diff --git a/drivers/net/ntb_netdev.c b/drivers/net/ntb_netdev.c
index fbeae05817e9..7aeca35b46c5 100644
--- a/drivers/net/ntb_netdev.c
+++ b/drivers/net/ntb_netdev.c
@@ -53,6 +53,8 @@
 #include <linux/pci.h>
 #include <linux/ntb.h>
 #include <linux/ntb_transport.h>
+#include <linux/cpumask.h>
+#include <linux/slab.h>
 
 #define NTB_NETDEV_VER	"0.7"
 
@@ -70,26 +72,84 @@ static unsigned int tx_start = 10;
 /* Number of descriptors still available before stop upper layer tx */
 static unsigned int tx_stop = 5;
 
+/*
+ * Upper bound on how many queue pairs we will try to create even if
+ * ntb_num_queues or num_online_cpus() is very large. This is an
+ * arbitrary safety cap to avoid unbounded allocations.
+ */
+#define NTB_NETDEV_MAX_QUEUES  64
+
+/*
+ * ntb_num_queues == 0 (default) means:
+ *   - use num_online_cpus() as the desired queue count, capped by
+ *     NTB_NETDEV_MAX_QUEUES.
+ * ntb_num_queues > 0:
+ *   - try to create exactly ntb_num_queues queue pairs (again capped
+ *     by NTB_NETDEV_MAX_QUEUES), but fall back to the number of queue
+ *     pairs actually available from ntb_transport.
+ */
+static unsigned int ntb_num_queues;
+module_param(ntb_num_queues, uint, 0644);
+MODULE_PARM_DESC(ntb_num_queues,
+		 "Number of NTB netdev queue pairs to use (0 = per-CPU)");
+
+struct ntb_netdev;
+
+struct ntb_netdev_queue {
+	struct ntb_netdev *ntdev;
+	struct ntb_transport_qp *qp;
+	struct timer_list tx_timer;
+	u16 qid;
+};
+
 struct ntb_netdev {
 	struct pci_dev *pdev;
 	struct net_device *ndev;
-	struct ntb_transport_qp *qp;
-	struct timer_list tx_timer;
+	unsigned int num_queues;
+	struct ntb_netdev_queue *queues;
 };
 
 #define	NTB_TX_TIMEOUT_MS	1000
 #define	NTB_RXQ_SIZE		100
 
+static unsigned int ntb_netdev_default_queues(void)
+{
+	unsigned int n;
+
+	if (ntb_num_queues)
+		n = ntb_num_queues;
+	else
+		n = num_online_cpus();
+
+	if (!n)
+		n = 1;
+
+	if (n > NTB_NETDEV_MAX_QUEUES)
+		n = NTB_NETDEV_MAX_QUEUES;
+
+	return n;
+}
+
 static void ntb_netdev_event_handler(void *data, int link_is_up)
 {
-	struct net_device *ndev = data;
-	struct ntb_netdev *dev = netdev_priv(ndev);
+	struct ntb_netdev_queue *q = data;
+	struct ntb_netdev *dev = q->ntdev;
+	struct net_device *ndev = dev->ndev;
+	bool any_up = false;
+	unsigned int i;
 
-	netdev_dbg(ndev, "Event %x, Link %x\n", link_is_up,
-		   ntb_transport_link_query(dev->qp));
+	netdev_dbg(ndev, "Event %x, Link %x, qp %u\n", link_is_up,
+		   ntb_transport_link_query(q->qp), q->qid);
 
 	if (link_is_up) {
-		if (ntb_transport_link_query(dev->qp))
+		for (i = 0; i < dev->num_queues; i++) {
+			if (ntb_transport_link_query(dev->queues[i].qp)) {
+				any_up = true;
+				break;
+			}
+		}
+
+		if (any_up)
 			netif_carrier_on(ndev);
 	} else {
 		netif_carrier_off(ndev);
@@ -99,7 +159,9 @@ static void ntb_netdev_event_handler(void *data, int link_is_up)
 static void ntb_netdev_rx_handler(struct ntb_transport_qp *qp, void *qp_data,
 				  void *data, int len)
 {
-	struct net_device *ndev = qp_data;
+	struct ntb_netdev_queue *q = qp_data;
+	struct ntb_netdev *dev = q->ntdev;
+	struct net_device *ndev = dev->ndev;
 	struct sk_buff *skb;
 	int rc;
 
@@ -135,7 +197,8 @@ static void ntb_netdev_rx_handler(struct ntb_transport_qp *qp, void *qp_data,
 	}
 
 enqueue_again:
-	rc = ntb_transport_rx_enqueue(qp, skb, skb->data, ndev->mtu + ETH_HLEN);
+	rc = ntb_transport_rx_enqueue(q->qp, skb, skb->data,
+				      ndev->mtu + ETH_HLEN);
 	if (rc) {
 		dev_kfree_skb_any(skb);
 		ndev->stats.rx_errors++;
@@ -143,42 +206,37 @@ static void ntb_netdev_rx_handler(struct ntb_transport_qp *qp, void *qp_data,
 	}
 }
 
-static int __ntb_netdev_maybe_stop_tx(struct net_device *netdev,
-				      struct ntb_transport_qp *qp, int size)
+static int ntb_netdev_maybe_stop_tx(struct ntb_netdev_queue *q, int size)
 {
-	struct ntb_netdev *dev = netdev_priv(netdev);
+	struct net_device *ndev = q->ntdev->ndev;
+
+	if (ntb_transport_tx_free_entry(q->qp) >= size)
+		return 0;
+
+	netif_stop_subqueue(ndev, q->qid);
 
-	netif_stop_queue(netdev);
 	/* Make sure to see the latest value of ntb_transport_tx_free_entry()
 	 * since the queue was last started.
 	 */
 	smp_mb();
 
-	if (likely(ntb_transport_tx_free_entry(qp) < size)) {
-		mod_timer(&dev->tx_timer, jiffies + usecs_to_jiffies(tx_time));
+	if (likely(ntb_transport_tx_free_entry(q->qp) < size)) {
+		mod_timer(&q->tx_timer, jiffies + usecs_to_jiffies(tx_time));
 		return -EBUSY;
 	}
 
-	netif_start_queue(netdev);
-	return 0;
-}
-
-static int ntb_netdev_maybe_stop_tx(struct net_device *ndev,
-				    struct ntb_transport_qp *qp, int size)
-{
-	if (netif_queue_stopped(ndev) ||
-	    (ntb_transport_tx_free_entry(qp) >= size))
-		return 0;
+	netif_wake_subqueue(ndev, q->qid);
 
-	return __ntb_netdev_maybe_stop_tx(ndev, qp, size);
+	return 0;
 }
 
 static void ntb_netdev_tx_handler(struct ntb_transport_qp *qp, void *qp_data,
 				  void *data, int len)
 {
-	struct net_device *ndev = qp_data;
+	struct ntb_netdev_queue *q = qp_data;
+	struct ntb_netdev *dev = q->ntdev;
+	struct net_device *ndev = dev->ndev;
 	struct sk_buff *skb;
-	struct ntb_netdev *dev = netdev_priv(ndev);
 
 	skb = data;
 	if (!skb || !ndev)
@@ -194,13 +252,12 @@ static void ntb_netdev_tx_handler(struct ntb_transport_qp *qp, void *qp_data,
 
 	dev_kfree_skb_any(skb);
 
-	if (ntb_transport_tx_free_entry(dev->qp) >= tx_start) {
+	if (ntb_transport_tx_free_entry(qp) >= tx_start) {
 		/* Make sure anybody stopping the queue after this sees the new
 		 * value of ntb_transport_tx_free_entry()
 		 */
 		smp_mb();
-		if (netif_queue_stopped(ndev))
-			netif_wake_queue(ndev);
+		netif_wake_subqueue(ndev, q->qid);
 	}
 }
 
@@ -208,16 +265,26 @@ static netdev_tx_t ntb_netdev_start_xmit(struct sk_buff *skb,
 					 struct net_device *ndev)
 {
 	struct ntb_netdev *dev = netdev_priv(ndev);
+	u16 qid = skb_get_queue_mapping(skb);
+	struct ntb_netdev_queue *q;
 	int rc;
 
-	ntb_netdev_maybe_stop_tx(ndev, dev->qp, tx_stop);
+	if (unlikely(!dev->num_queues))
+		goto err;
+
+	if (unlikely(qid >= dev->num_queues))
+		qid = qid % dev->num_queues;
 
-	rc = ntb_transport_tx_enqueue(dev->qp, skb, skb->data, skb->len);
+	q = &dev->queues[qid];
+
+	ntb_netdev_maybe_stop_tx(q, tx_stop);
+
+	rc = ntb_transport_tx_enqueue(q->qp, skb, skb->data, skb->len);
 	if (rc)
 		goto err;
 
 	/* check for next submit */
-	ntb_netdev_maybe_stop_tx(ndev, dev->qp, tx_stop);
+	ntb_netdev_maybe_stop_tx(q, tx_stop);
 
 	return NETDEV_TX_OK;
 
@@ -229,80 +296,103 @@ static netdev_tx_t ntb_netdev_start_xmit(struct sk_buff *skb,
 
 static void ntb_netdev_tx_timer(struct timer_list *t)
 {
-	struct ntb_netdev *dev = timer_container_of(dev, t, tx_timer);
+	struct ntb_netdev_queue *q = container_of(t, struct ntb_netdev_queue, tx_timer);
+	struct ntb_netdev *dev = q->ntdev;
 	struct net_device *ndev = dev->ndev;
 
-	if (ntb_transport_tx_free_entry(dev->qp) < tx_stop) {
-		mod_timer(&dev->tx_timer, jiffies + usecs_to_jiffies(tx_time));
+	if (ntb_transport_tx_free_entry(q->qp) < tx_stop) {
+		mod_timer(&q->tx_timer, jiffies + usecs_to_jiffies(tx_time));
 	} else {
-		/* Make sure anybody stopping the queue after this sees the new
+		/*
+		 * Make sure anybody stopping the queue after this sees the new
 		 * value of ntb_transport_tx_free_entry()
 		 */
 		smp_mb();
-		if (netif_queue_stopped(ndev))
-			netif_wake_queue(ndev);
+		netif_wake_subqueue(ndev, q->qid);
 	}
 }
 
 static int ntb_netdev_open(struct net_device *ndev)
 {
 	struct ntb_netdev *dev = netdev_priv(ndev);
+	struct ntb_netdev_queue *queue;
 	struct sk_buff *skb;
-	int rc, i, len;
-
-	/* Add some empty rx bufs */
-	for (i = 0; i < NTB_RXQ_SIZE; i++) {
-		skb = netdev_alloc_skb(ndev, ndev->mtu + ETH_HLEN);
-		if (!skb) {
-			rc = -ENOMEM;
-			goto err;
-		}
+	int rc = 0, i, len;
+	unsigned int q;
 
-		rc = ntb_transport_rx_enqueue(dev->qp, skb, skb->data,
-					      ndev->mtu + ETH_HLEN);
-		if (rc) {
-			dev_kfree_skb(skb);
-			goto err;
+	/* Add some empty rx bufs for each queue */
+	for (q = 0; q < dev->num_queues; q++) {
+		queue = &dev->queues[q];
+
+		for (i = 0; i < NTB_RXQ_SIZE; i++) {
+			skb = netdev_alloc_skb(ndev, ndev->mtu + ETH_HLEN);
+			if (!skb) {
+				rc = -ENOMEM;
+				goto err;
+			}
+
+			rc = ntb_transport_rx_enqueue(queue->qp, skb, skb->data,
+						      ndev->mtu + ETH_HLEN);
+			if (rc) {
+				dev_kfree_skb(skb);
+				goto err;
+			}
 		}
-	}
 
-	timer_setup(&dev->tx_timer, ntb_netdev_tx_timer, 0);
+		timer_setup(&queue->tx_timer, ntb_netdev_tx_timer, 0);
+	}
 
 	netif_carrier_off(ndev);
-	ntb_transport_link_up(dev->qp);
-	netif_start_queue(ndev);
+
+	for (q = 0; q < dev->num_queues; q++)
+		ntb_transport_link_up(dev->queues[q].qp);
+
+	netif_tx_start_all_queues(ndev);
 
 	return 0;
 
 err:
-	while ((skb = ntb_transport_rx_remove(dev->qp, &len)))
-		dev_kfree_skb(skb);
+	for (q = 0; q < dev->num_queues; q++) {
+		queue = &dev->queues[q];
+
+		while ((skb = ntb_transport_rx_remove(queue->qp, &len)))
+			dev_kfree_skb(skb);
+	}
 	return rc;
 }
 
 static int ntb_netdev_close(struct net_device *ndev)
 {
 	struct ntb_netdev *dev = netdev_priv(ndev);
+	struct ntb_netdev_queue *queue;
 	struct sk_buff *skb;
+	unsigned int q;
 	int len;
 
-	ntb_transport_link_down(dev->qp);
+	netif_tx_stop_all_queues(ndev);
+
+	for (q = 0; q < dev->num_queues; q++) {
+		queue = &dev->queues[q];
 
-	while ((skb = ntb_transport_rx_remove(dev->qp, &len)))
-		dev_kfree_skb(skb);
+		ntb_transport_link_down(queue->qp);
 
-	timer_delete_sync(&dev->tx_timer);
+		while ((skb = ntb_transport_rx_remove(queue->qp, &len)))
+			dev_kfree_skb(skb);
 
+		timer_delete_sync(&queue->tx_timer);
+	}
 	return 0;
 }
 
 static int ntb_netdev_change_mtu(struct net_device *ndev, int new_mtu)
 {
 	struct ntb_netdev *dev = netdev_priv(ndev);
+	struct ntb_netdev_queue *queue;
 	struct sk_buff *skb;
-	int len, rc;
+	unsigned int q, i;
+	int len, rc = 0;
 
-	if (new_mtu > ntb_transport_max_size(dev->qp) - ETH_HLEN)
+	if (new_mtu > ntb_transport_max_size(dev->queues[0].qp) - ETH_HLEN)
 		return -EINVAL;
 
 	if (!netif_running(ndev)) {
@@ -311,41 +401,54 @@ static int ntb_netdev_change_mtu(struct net_device *ndev, int new_mtu)
 	}
 
 	/* Bring down the link and dispose of posted rx entries */
-	ntb_transport_link_down(dev->qp);
+	for (q = 0; q < dev->num_queues; q++)
+		ntb_transport_link_down(dev->queues[0].qp);
 
 	if (ndev->mtu < new_mtu) {
-		int i;
-
-		for (i = 0; (skb = ntb_transport_rx_remove(dev->qp, &len)); i++)
-			dev_kfree_skb(skb);
+		for (q = 0; q < dev->num_queues; q++) {
+			queue = &dev->queues[q];
 
-		for (; i; i--) {
-			skb = netdev_alloc_skb(ndev, new_mtu + ETH_HLEN);
-			if (!skb) {
-				rc = -ENOMEM;
-				goto err;
-			}
-
-			rc = ntb_transport_rx_enqueue(dev->qp, skb, skb->data,
-						      new_mtu + ETH_HLEN);
-			if (rc) {
+			for (i = 0;
+			     (skb = ntb_transport_rx_remove(queue->qp, &len));
+			     i++)
 				dev_kfree_skb(skb);
-				goto err;
+
+			for (; i; i--) {
+				skb = netdev_alloc_skb(ndev,
+						       new_mtu + ETH_HLEN);
+				if (!skb) {
+					rc = -ENOMEM;
+					goto err;
+				}
+
+				rc = ntb_transport_rx_enqueue(queue->qp, skb,
+							      skb->data,
+							      new_mtu +
+							      ETH_HLEN);
+				if (rc) {
+					dev_kfree_skb(skb);
+					goto err;
+				}
 			}
 		}
 	}
 
 	WRITE_ONCE(ndev->mtu, new_mtu);
 
-	ntb_transport_link_up(dev->qp);
+	for (q = 0; q < dev->num_queues; q++)
+		ntb_transport_link_up(dev->queues[q].qp);
 
 	return 0;
 
 err:
-	ntb_transport_link_down(dev->qp);
+	for (q = 0; q < dev->num_queues; q++) {
+		struct ntb_netdev_queue *queue = &dev->queues[q];
+
+		ntb_transport_link_down(queue->qp);
 
-	while ((skb = ntb_transport_rx_remove(dev->qp, &len)))
-		dev_kfree_skb(skb);
+		while ((skb = ntb_transport_rx_remove(queue->qp, &len)))
+			dev_kfree_skb(skb);
+	}
 
 	netdev_err(ndev, "Error changing MTU, device inoperable\n");
 	return rc;
@@ -404,6 +507,7 @@ static int ntb_netdev_probe(struct device *client_dev)
 	struct net_device *ndev;
 	struct pci_dev *pdev;
 	struct ntb_netdev *dev;
+	unsigned int q, desired_queues;
 	int rc;
 
 	ntb = dev_ntb(client_dev->parent);
@@ -411,7 +515,9 @@ static int ntb_netdev_probe(struct device *client_dev)
 	if (!pdev)
 		return -ENODEV;
 
-	ndev = alloc_etherdev(sizeof(*dev));
+	desired_queues = ntb_netdev_default_queues();
+
+	ndev = alloc_etherdev_mq(sizeof(*dev), desired_queues);
 	if (!ndev)
 		return -ENOMEM;
 
@@ -420,6 +526,15 @@ static int ntb_netdev_probe(struct device *client_dev)
 	dev = netdev_priv(ndev);
 	dev->ndev = ndev;
 	dev->pdev = pdev;
+	dev->num_queues = 0;
+
+	dev->queues = kcalloc(desired_queues, sizeof(*dev->queues),
+			      GFP_KERNEL);
+	if (!dev->queues) {
+		rc = -ENOMEM;
+		goto err_free_netdev;
+	}
+
 	ndev->features = NETIF_F_HIGHDMA;
 
 	ndev->priv_flags |= IFF_LIVE_ADDR_CHANGE;
@@ -436,26 +551,51 @@ static int ntb_netdev_probe(struct device *client_dev)
 	ndev->min_mtu = 0;
 	ndev->max_mtu = ETH_MAX_MTU;
 
-	dev->qp = ntb_transport_create_queue(ndev, client_dev,
-					     &ntb_netdev_handlers);
-	if (!dev->qp) {
+	for (q = 0; q < desired_queues; q++) {
+		struct ntb_netdev_queue *queue = &dev->queues[q];
+
+		queue->ntdev = dev;
+		queue->qid = q;
+		queue->qp = ntb_transport_create_queue(queue, client_dev,
+						       &ntb_netdev_handlers);
+		if (!queue->qp)
+			break;
+
+		dev->num_queues++;
+	}
+
+	if (!dev->num_queues) {
 		rc = -EIO;
-		goto err;
+		goto err_free_queues;
 	}
 
-	ndev->mtu = ntb_transport_max_size(dev->qp) - ETH_HLEN;
+	rc = netif_set_real_num_tx_queues(ndev, dev->num_queues);
+	if (rc)
+		goto err_free_qps;
+
+	rc = netif_set_real_num_rx_queues(ndev, dev->num_queues);
+	if (rc)
+		goto err_free_qps;
+
+	ndev->mtu = ntb_transport_max_size(dev->queues[0].qp) - ETH_HLEN;
 
 	rc = register_netdev(ndev);
 	if (rc)
-		goto err1;
+		goto err_free_qps;
 
 	dev_set_drvdata(client_dev, ndev);
-	dev_info(&pdev->dev, "%s created\n", ndev->name);
+	dev_info(&pdev->dev, "%s created with %u queue pairs\n",
+		 ndev->name, dev->num_queues);
 	return 0;
 
-err1:
-	ntb_transport_free_queue(dev->qp);
-err:
+err_free_qps:
+	for (q = 0; q < dev->num_queues; q++)
+		ntb_transport_free_queue(dev->queues[q].qp);
+
+err_free_queues:
+	kfree(dev->queues);
+
+err_free_netdev:
 	free_netdev(ndev);
 	return rc;
 }
@@ -464,9 +604,14 @@ static void ntb_netdev_remove(struct device *client_dev)
 {
 	struct net_device *ndev = dev_get_drvdata(client_dev);
 	struct ntb_netdev *dev = netdev_priv(ndev);
+	unsigned int q;
+
 
 	unregister_netdev(ndev);
-	ntb_transport_free_queue(dev->qp);
+	for (q = 0; q < dev->num_queues; q++)
+		ntb_transport_free_queue(dev->queues[q].qp);
+
+	kfree(dev->queues);
 	free_netdev(ndev);
 }
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 23/27] NTB: epf: Add per-SoC quirk to cap MRRS for DWC eDMA (128B for R-Car)
  2025-11-29 16:03 [RFC PATCH v2 00/27] NTB transport backed by remote DW eDMA Koichiro Den
                   ` (21 preceding siblings ...)
  2025-11-29 16:04 ` [RFC PATCH v2 22/27] ntb_netdev: Multi-queue support Koichiro Den
@ 2025-11-29 16:04 ` Koichiro Den
  2025-12-01 20:47   ` Frank Li
  2025-11-29 16:04 ` [RFC PATCH v2 24/27] iommu: ipmmu-vmsa: Add PCIe ch0 to devices_allowlist Koichiro Den
                   ` (4 subsequent siblings)
  27 siblings, 1 reply; 97+ messages in thread
From: Koichiro Den @ 2025-11-29 16:04 UTC (permalink / raw)
  To: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li
  Cc: mani, kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason,
	dave.jiang, allenbh, Basavaraj.Natikar, Shyam-sundar.S-k,
	kurt.schwemmer, logang, jingoohan1, lpieralisi, robh, jbrunet,
	fancer.lancer, arnd, pstanner, elfring

Some R-Car platforms using Synopsys DesignWare PCIe with the integrated
eDMA exhibit reproducible payload corruption in RC->EP remote DMA read
traffic whenever the endpoint issues 256-byte Memory Read (MRd) TLPs.

The eDMA injects multiple MRd requests of size less than or equal to
min(MRRS, MPS), so constraining the endpoint's MRd request size removes
256-byte MRd TLPs and avoids the issue. This change adds a per-SoC knob
in the ntb_hw_epf driver and sets MRRS=128 on R-Car.

We intentionally do not change the endpoint's MPS. Per PCIe Base
Specification, MPS limits the payload size of TLPs with data transmitted
by the Function, while Max_Read_Request_Size limits the size of read
requests produced by the Function as a Requester. Limiting MRRS is
sufficient to constrain MRd Byte Count, while lowering MPS would also
throttle unrelated traffic (e.g. endpoint-originated Posted Writes and
Completions with Data) without being necessary for this fix.

This quirk is scoped to the affected endpoint only and can be removed
once the underlying issue is resolved in the controller/IP.

Signed-off-by: Koichiro Den <den@valinux.co.jp>
---
 drivers/ntb/hw/epf/ntb_hw_epf.c | 66 +++++++++++++++++++++++++++++----
 1 file changed, 58 insertions(+), 8 deletions(-)

diff --git a/drivers/ntb/hw/epf/ntb_hw_epf.c b/drivers/ntb/hw/epf/ntb_hw_epf.c
index d9811da90599..21eb26b2f7cc 100644
--- a/drivers/ntb/hw/epf/ntb_hw_epf.c
+++ b/drivers/ntb/hw/epf/ntb_hw_epf.c
@@ -51,6 +51,12 @@
 
 #define NTB_EPF_COMMAND_TIMEOUT	1000 /* 1 Sec */
 
+struct ntb_epf_soc_data {
+	const enum pci_barno *barno_map;
+	/* non-zero to override MRRS for this SoC */
+	int force_mrrs;
+};
+
 enum epf_ntb_bar {
 	BAR_CONFIG,
 	BAR_PEER_SPAD,
@@ -594,11 +600,12 @@ static int ntb_epf_init_dev(struct ntb_epf_dev *ndev)
 }
 
 static int ntb_epf_init_pci(struct ntb_epf_dev *ndev,
-			    struct pci_dev *pdev)
+			    struct pci_dev *pdev,
+			    const struct ntb_epf_soc_data *soc)
 {
 	struct device *dev = ndev->dev;
 	size_t spad_sz, spad_off;
-	int ret;
+	int ret, cur;
 
 	pci_set_drvdata(pdev, ndev);
 
@@ -616,6 +623,17 @@ static int ntb_epf_init_pci(struct ntb_epf_dev *ndev,
 
 	pci_set_master(pdev);
 
+	if (soc && pci_is_pcie(pdev) && soc->force_mrrs) {
+		cur = pcie_get_readrq(pdev);
+		ret = pcie_set_readrq(pdev, soc->force_mrrs);
+		if (ret)
+			dev_warn(&pdev->dev, "failed to set MRRS=%d: %d\n",
+				 soc->force_mrrs, ret);
+		else
+			dev_info(&pdev->dev, "capped MRRS: %d->%d for ntb-epf\n",
+				 cur, soc->force_mrrs);
+	}
+
 	ret = dma_set_mask_and_coherent(dev, DMA_BIT_MASK(64));
 	if (ret) {
 		ret = dma_set_mask_and_coherent(dev, DMA_BIT_MASK(32));
@@ -690,6 +708,7 @@ static void ntb_epf_cleanup_isr(struct ntb_epf_dev *ndev)
 static int ntb_epf_pci_probe(struct pci_dev *pdev,
 			     const struct pci_device_id *id)
 {
+	const struct ntb_epf_soc_data *soc = (const void *)id->driver_data;
 	struct device *dev = &pdev->dev;
 	struct ntb_epf_dev *ndev;
 	int ret;
@@ -701,16 +720,16 @@ static int ntb_epf_pci_probe(struct pci_dev *pdev,
 	if (!ndev)
 		return -ENOMEM;
 
-	ndev->barno_map = (const enum pci_barno *)id->driver_data;
-	if (!ndev->barno_map)
+	if (!soc || !soc->barno_map)
 		return -EINVAL;
 
+	ndev->barno_map = soc->barno_map;
 	ndev->dev = dev;
 
 	ntb_epf_init_struct(ndev, pdev);
 	mutex_init(&ndev->cmd_lock);
 
-	ret = ntb_epf_init_pci(ndev, pdev);
+	ret = ntb_epf_init_pci(ndev, pdev, soc);
 	if (ret) {
 		dev_err(dev, "Failed to init PCI\n");
 		return ret;
@@ -778,21 +797,52 @@ static const enum pci_barno rcar_barno[NTB_BAR_NUM] = {
 	[BAR_MW4]	= NO_BAR,
 };
 
+static const struct ntb_epf_soc_data j721e_soc = {
+	.barno_map = j721e_map,
+};
+
+static const struct ntb_epf_soc_data mx8_soc = {
+	.barno_map = mx8_map,
+};
+
+static const struct ntb_epf_soc_data rcar_soc = {
+	.barno_map = rcar_barno,
+	/*
+	 * On some R-Car platforms using the Synopsys DWC PCIe + eDMA we
+	 * observe data corruption on RC->EP Remote DMA Read paths whenever
+	 * the EP issues large MRd requests. The corruption consistently
+	 * hits the tail of each 256-byte segment (e.g. offsets
+	 * 0x00E0..0x00FF within a 256B block, and again at 0x01E0..0x01FF
+	 * for larger transfers).
+	 *
+	 * The DMA injects multiple MRd requests of size less than or equal
+	 * to the min(MRRS, MPS) into the outbound request path. By
+	 * lowering MRRS to 128 we prevent 256B MRd TLPs from being
+	 * generated and avoid the issue on the affected hardware. We
+	 * intentionally keep MPS unchanged and scope this quirk to this
+	 * endpoint to avoid impacting unrelated devices.
+	 *
+	 * Remove this once the issue is resolved (maybe controller/IP
+	 * level) or a more preferable workaround becomes available.
+	 */
+	.force_mrrs = 128,
+};
+
 static const struct pci_device_id ntb_epf_pci_tbl[] = {
 	{
 		PCI_DEVICE(PCI_VENDOR_ID_TI, PCI_DEVICE_ID_TI_J721E),
 		.class = PCI_CLASS_MEMORY_RAM << 8, .class_mask = 0xffff00,
-		.driver_data = (kernel_ulong_t)j721e_map,
+		.driver_data = (kernel_ulong_t)&j721e_soc,
 	},
 	{
 		PCI_DEVICE(PCI_VENDOR_ID_FREESCALE, 0x0809),
 		.class = PCI_CLASS_MEMORY_RAM << 8, .class_mask = 0xffff00,
-		.driver_data = (kernel_ulong_t)mx8_map,
+		.driver_data = (kernel_ulong_t)&mx8_soc,
 	},
 	{
 		PCI_DEVICE(PCI_VENDOR_ID_RENESAS, 0x0030),
 		.class = PCI_CLASS_MEMORY_RAM << 8, .class_mask = 0xffff00,
-		.driver_data = (kernel_ulong_t)rcar_barno,
+		.driver_data = (kernel_ulong_t)&rcar_soc,
 	},
 	{ },
 };
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 24/27] iommu: ipmmu-vmsa: Add PCIe ch0 to devices_allowlist
  2025-11-29 16:03 [RFC PATCH v2 00/27] NTB transport backed by remote DW eDMA Koichiro Den
                   ` (22 preceding siblings ...)
  2025-11-29 16:04 ` [RFC PATCH v2 23/27] NTB: epf: Add per-SoC quirk to cap MRRS for DWC eDMA (128B for R-Car) Koichiro Den
@ 2025-11-29 16:04 ` Koichiro Den
  2025-11-29 16:04 ` [RFC PATCH v2 25/27] iommu: ipmmu-vmsa: Add support for reserved regions Koichiro Den
                   ` (3 subsequent siblings)
  27 siblings, 0 replies; 97+ messages in thread
From: Koichiro Den @ 2025-11-29 16:04 UTC (permalink / raw)
  To: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li
  Cc: mani, kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason,
	dave.jiang, allenbh, Basavaraj.Natikar, Shyam-sundar.S-k,
	kurt.schwemmer, logang, jingoohan1, lpieralisi, robh, jbrunet,
	fancer.lancer, arnd, pstanner, elfring

Add the PCIe ch0 to the ipmmu-vmsa devices_allowlist so that traffic
routed through this PCIe instance can be translated by the IOMMU.

Signed-off-by: Koichiro Den <den@valinux.co.jp>
---
 drivers/iommu/ipmmu-vmsa.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/ipmmu-vmsa.c b/drivers/iommu/ipmmu-vmsa.c
index ca848288dbf2..724d67ad5ef2 100644
--- a/drivers/iommu/ipmmu-vmsa.c
+++ b/drivers/iommu/ipmmu-vmsa.c
@@ -743,7 +743,9 @@ static const char * const devices_allowlist[] = {
 	"ee100000.mmc",
 	"ee120000.mmc",
 	"ee140000.mmc",
-	"ee160000.mmc"
+	"ee160000.mmc",
+	"e65d0000.pcie",
+	"e65d0000.pcie-ep",
 };
 
 static bool ipmmu_device_is_allowed(struct device *dev)
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 25/27] iommu: ipmmu-vmsa: Add support for reserved regions
  2025-11-29 16:03 [RFC PATCH v2 00/27] NTB transport backed by remote DW eDMA Koichiro Den
                   ` (23 preceding siblings ...)
  2025-11-29 16:04 ` [RFC PATCH v2 24/27] iommu: ipmmu-vmsa: Add PCIe ch0 to devices_allowlist Koichiro Den
@ 2025-11-29 16:04 ` Koichiro Den
  2025-11-29 16:04 ` [RFC PATCH v2 26/27] arm64: dts: renesas: Add Spider RC/EP DTs for NTB with remote DW PCIe eDMA Koichiro Den
                   ` (2 subsequent siblings)
  27 siblings, 0 replies; 97+ messages in thread
From: Koichiro Den @ 2025-11-29 16:04 UTC (permalink / raw)
  To: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li
  Cc: mani, kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason,
	dave.jiang, allenbh, Basavaraj.Natikar, Shyam-sundar.S-k,
	kurt.schwemmer, logang, jingoohan1, lpieralisi, robh, jbrunet,
	fancer.lancer, arnd, pstanner, elfring

Add support for reserved regions using iommu_dma_get_resv_regions().

Signed-off-by: Koichiro Den <den@valinux.co.jp>
---
 drivers/iommu/ipmmu-vmsa.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/iommu/ipmmu-vmsa.c b/drivers/iommu/ipmmu-vmsa.c
index 724d67ad5ef2..4a89d95db0f8 100644
--- a/drivers/iommu/ipmmu-vmsa.c
+++ b/drivers/iommu/ipmmu-vmsa.c
@@ -25,6 +25,8 @@
 #include <linux/slab.h>
 #include <linux/sys_soc.h>
 
+#include "dma-iommu.h"
+
 #if defined(CONFIG_ARM) && !defined(CONFIG_IOMMU_DMA)
 #include <asm/dma-iommu.h>
 #else
@@ -888,6 +890,7 @@ static const struct iommu_ops ipmmu_ops = {
 	.device_group = IS_ENABLED(CONFIG_ARM) && !IS_ENABLED(CONFIG_IOMMU_DMA)
 			? generic_device_group : generic_single_device_group,
 	.of_xlate = ipmmu_of_xlate,
+	.get_resv_regions = iommu_dma_get_resv_regions,
 	.default_domain_ops = &(const struct iommu_domain_ops) {
 		.attach_dev	= ipmmu_attach_device,
 		.map_pages	= ipmmu_map,
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 26/27] arm64: dts: renesas: Add Spider RC/EP DTs for NTB with remote DW PCIe eDMA
  2025-11-29 16:03 [RFC PATCH v2 00/27] NTB transport backed by remote DW eDMA Koichiro Den
                   ` (24 preceding siblings ...)
  2025-11-29 16:04 ` [RFC PATCH v2 25/27] iommu: ipmmu-vmsa: Add support for reserved regions Koichiro Den
@ 2025-11-29 16:04 ` Koichiro Den
  2025-11-29 16:04 ` [RFC PATCH v2 27/27] NTB: epf: Add an additional memory window (MW2) barno mapping on Renesas R-Car Koichiro Den
  2025-12-01 22:02 ` [RFC PATCH v2 00/27] NTB transport backed by remote DW eDMA Frank Li
  27 siblings, 0 replies; 97+ messages in thread
From: Koichiro Den @ 2025-11-29 16:04 UTC (permalink / raw)
  To: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li
  Cc: mani, kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason,
	dave.jiang, allenbh, Basavaraj.Natikar, Shyam-sundar.S-k,
	kurt.schwemmer, logang, jingoohan1, lpieralisi, robh, jbrunet,
	fancer.lancer, arnd, pstanner, elfring

Add dedicated DTs for the Spider CPU+BreakOut boards when used in PCIe
RC/EP mode with DW PCIe eDMA based NTB transport.

 * r8a779f0-spider-rc.dts describes the board in RC mode.

   It reserves 4 MiB of IOVA starting at 0xfe000000, which on this SoC
   is the ECAM/Config aperture of the PCIe host bridge. In stress
   testing with the remote eDMA, allowing generic DMA mappings to occupy
   this range led to immediate instability. The exact mechanism is under
   investigation, but reserving the range avoids the issue in practice.

 * r8a779f0-spider-ep.dts describes the board in EP mode.

   The RC interface is disabled and the EP interface is enabled. IPMMU
   usage matches the RC case.

The base r8a779f0-spider.dts is intentionally left unchanged and
continues to describe the default RC-only board configuration.

Signed-off-by: Koichiro Den <den@valinux.co.jp>
---
 arch/arm64/boot/dts/renesas/Makefile          |  2 +
 .../boot/dts/renesas/r8a779f0-spider-ep.dts   | 46 ++++++++++++++++
 .../boot/dts/renesas/r8a779f0-spider-rc.dts   | 52 +++++++++++++++++++
 3 files changed, 100 insertions(+)
 create mode 100644 arch/arm64/boot/dts/renesas/r8a779f0-spider-ep.dts
 create mode 100644 arch/arm64/boot/dts/renesas/r8a779f0-spider-rc.dts

diff --git a/arch/arm64/boot/dts/renesas/Makefile b/arch/arm64/boot/dts/renesas/Makefile
index 1fab1b50f20e..e8d312be515b 100644
--- a/arch/arm64/boot/dts/renesas/Makefile
+++ b/arch/arm64/boot/dts/renesas/Makefile
@@ -82,6 +82,8 @@ dtb-$(CONFIG_ARCH_R8A77995) += r8a77995-draak-panel-aa104xd12.dtb
 dtb-$(CONFIG_ARCH_R8A779A0) += r8a779a0-falcon.dtb
 
 dtb-$(CONFIG_ARCH_R8A779F0) += r8a779f0-spider.dtb
+dtb-$(CONFIG_ARCH_R8A779F0) += r8a779f0-spider-ep.dtb
+dtb-$(CONFIG_ARCH_R8A779F0) += r8a779f0-spider-rc.dtb
 dtb-$(CONFIG_ARCH_R8A779F0) += r8a779f4-s4sk.dtb
 
 dtb-$(CONFIG_ARCH_R8A779G0) += r8a779g0-white-hawk.dtb
diff --git a/arch/arm64/boot/dts/renesas/r8a779f0-spider-ep.dts b/arch/arm64/boot/dts/renesas/r8a779f0-spider-ep.dts
new file mode 100644
index 000000000000..9c9e29226458
--- /dev/null
+++ b/arch/arm64/boot/dts/renesas/r8a779f0-spider-ep.dts
@@ -0,0 +1,46 @@
+// SPDX-License-Identifier: (GPL-2.0 or MIT)
+/*
+ * Device Tree Source for the Spider CPU and BreakOut boards
+ * (PCIe EP mode with DW PCIe eDMA used for NTB transport)
+ *
+ * Based on the base r8a779f0-spider.dts.
+ *
+ * Copyright (C) 2025 Renesas Electronics Corp.
+ */
+
+/dts-v1/;
+#include "r8a779f0-spider-cpu.dtsi"
+#include "r8a779f0-spider-ethernet.dtsi"
+
+/ {
+	model = "Renesas Spider CPU and Breakout boards based on r8a779f0";
+	compatible = "renesas,spider-breakout", "renesas,spider-cpu",
+		     "renesas,r8a779f0";
+};
+
+&i2c4 {
+	eeprom@51 {
+		compatible = "rohm,br24g01", "atmel,24c01";
+		label = "breakout-board";
+		reg = <0x51>;
+		pagesize = <8>;
+	};
+};
+
+&pciec0 {
+	status = "disabled";
+};
+
+&pciec0_ep {
+	iommus = <&ipmmu_hc 32>;
+	status = "okay";
+	/* Hide eDMA from generic EP users, it is driven by host side remotely */
+	reg = <0 0xe65d0000 0 0x2000>, <0 0xe65d2000 0 0x1000>,
+	      <0 0xe65d3000 0 0x2000>, <0 0xe65d6200 0 0x0e00>,
+	      <0 0xe65d7000 0 0x0400>, <0 0xfe000000 0 0x400000>;
+	reg-names = "dbi", "dbi2", "atu", "app", "phy", "addr_space";
+	interrupts = <GIC_SPI 418 IRQ_TYPE_LEVEL_HIGH>,
+		     <GIC_SPI 422 IRQ_TYPE_LEVEL_HIGH>;
+	interrupt-names = "sft_ce", "app";
+	interrupt-parent = <&gic>;
+};
diff --git a/arch/arm64/boot/dts/renesas/r8a779f0-spider-rc.dts b/arch/arm64/boot/dts/renesas/r8a779f0-spider-rc.dts
new file mode 100644
index 000000000000..c7112862e1e1
--- /dev/null
+++ b/arch/arm64/boot/dts/renesas/r8a779f0-spider-rc.dts
@@ -0,0 +1,52 @@
+// SPDX-License-Identifier: (GPL-2.0 OR MIT)
+/*
+ * Device Tree Source for the Spider CPU and BreakOut boards
+ * (PCIe RC mode with remote DW PCIe eDMA used for NTB transport)
+ *
+ * Based on the base r8a779f0-spider.dts.
+ *
+ * Copyright (C) 2025 Renesas Electronics Corp.
+ */
+
+/dts-v1/;
+#include "r8a779f0-spider-cpu.dtsi"
+#include "r8a779f0-spider-ethernet.dtsi"
+
+/ {
+	model = "Renesas Spider CPU and Breakout boards based on r8a779f0";
+	compatible = "renesas,spider-breakout", "renesas,spider-cpu",
+		     "renesas,r8a779f0";
+
+	reserved-memory {
+		#address-cells = <2>;
+		#size-cells = <2>;
+		ranges;
+
+		/*
+		 * Reserve 4 MiB of IOVA starting at 0xfe000000. Allowing DMA
+		 * writes whose DAR (destination IOVA) falls numerically inside
+		 * the ECAM/config window has been observed to trigger
+		 * controller misbehavior.
+		 */
+		pciec0_iova_resv: pcie-iova-resv {
+			iommu-addresses = <&pciec0 0x0 0xfe000000 0x0 0x00400000>;
+		};
+	};
+};
+
+&i2c4 {
+	eeprom@51 {
+		compatible = "rohm,br24g01", "atmel,24c01";
+		label = "breakout-board";
+		reg = <0x51>;
+		pagesize = <8>;
+	};
+};
+
+&pciec0 {
+	iommus = <&ipmmu_hc 32>;
+	iommu-map = <0 &ipmmu_hc 32 1>;
+	iommu-map-mask = <0>;
+
+	memory-region = <&pciec0_iova_resv>;
+};
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [RFC PATCH v2 27/27] NTB: epf: Add an additional memory window (MW2) barno mapping on Renesas R-Car
  2025-11-29 16:03 [RFC PATCH v2 00/27] NTB transport backed by remote DW eDMA Koichiro Den
                   ` (25 preceding siblings ...)
  2025-11-29 16:04 ` [RFC PATCH v2 26/27] arm64: dts: renesas: Add Spider RC/EP DTs for NTB with remote DW PCIe eDMA Koichiro Den
@ 2025-11-29 16:04 ` Koichiro Den
  2025-12-01 22:02 ` [RFC PATCH v2 00/27] NTB transport backed by remote DW eDMA Frank Li
  27 siblings, 0 replies; 97+ messages in thread
From: Koichiro Den @ 2025-11-29 16:04 UTC (permalink / raw)
  To: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li
  Cc: mani, kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason,
	dave.jiang, allenbh, Basavaraj.Natikar, Shyam-sundar.S-k,
	kurt.schwemmer, logang, jingoohan1, lpieralisi, robh, jbrunet,
	fancer.lancer, arnd, pstanner, elfring

To enable remote eDMA mode on NTB transport, one additional memory
window is required. Since a single BAR can now be split into multiple
memory windows, add MW2 to BAR2 on R-Car.

For pci_epf_vntb configfs settings, users who want to use MW2 (e.g. to
enable remote eDMA mode for NTB transport as mentioned above) may
configure as follows:

  $ echo 2       > functions/pci_epf_vntb/func1/pci_epf_vntb.0/num_mws
  $ echo 0xE0000 > functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw1
  $ echo 0x20000 > functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw2
  $ echo 0xE0000 > functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw2_offset
  $ echo 2       > functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw1_bar
  $ echo 2       > functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw2_bar

Signed-off-by: Koichiro Den <den@valinux.co.jp>
---
 drivers/ntb/hw/epf/ntb_hw_epf.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/ntb/hw/epf/ntb_hw_epf.c b/drivers/ntb/hw/epf/ntb_hw_epf.c
index 21eb26b2f7cc..19a4c07bbc8f 100644
--- a/drivers/ntb/hw/epf/ntb_hw_epf.c
+++ b/drivers/ntb/hw/epf/ntb_hw_epf.c
@@ -792,7 +792,7 @@ static const enum pci_barno rcar_barno[NTB_BAR_NUM] = {
 	[BAR_PEER_SPAD]	= BAR_0,
 	[BAR_DB]	= BAR_4,
 	[BAR_MW1]	= BAR_2,
-	[BAR_MW2]	= NO_BAR,
+	[BAR_MW2]	= BAR_2,
 	[BAR_MW3]	= NO_BAR,
 	[BAR_MW4]	= NO_BAR,
 };
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 01/27] PCI: endpoint: pci-epf-vntb: Use array_index_nospec() on mws_size[] access
  2025-11-29 16:03 ` [RFC PATCH v2 01/27] PCI: endpoint: pci-epf-vntb: Use array_index_nospec() on mws_size[] access Koichiro Den
@ 2025-12-01 18:59   ` Frank Li
  0 siblings, 0 replies; 97+ messages in thread
From: Frank Li @ 2025-12-01 18:59 UTC (permalink / raw)
  To: Koichiro Den
  Cc: ntb, linux-pci, dmaengine, linux-kernel, mani, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Sun, Nov 30, 2025 at 01:03:39AM +0900, Koichiro Den wrote:
> Follow common kernel idioms for indices derived from configfs attributes
> and suppress Smatch warnings:
>
>   epf_ntb_mw1_show() warn: potential spectre issue 'ntb->mws_size' [r]
>   epf_ntb_mw1_store() warn: potential spectre issue 'ntb->mws_size' [w]
>
> Also fix the error message for out-of-range MW indices.
>
> Signed-off-by: Koichiro Den <den@valinux.co.jp>
> ---

Reviewed-by: Frank Li <Frank.Li@nxp.com>

>  drivers/pci/endpoint/functions/pci-epf-vntb.c | 17 ++++++++++-------
>  1 file changed, 10 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/pci/endpoint/functions/pci-epf-vntb.c b/drivers/pci/endpoint/functions/pci-epf-vntb.c
> index 3ecc5059f92b..6c4c78915970 100644
> --- a/drivers/pci/endpoint/functions/pci-epf-vntb.c
> +++ b/drivers/pci/endpoint/functions/pci-epf-vntb.c
> @@ -995,17 +995,18 @@ static ssize_t epf_ntb_##_name##_show(struct config_item *item,		\
>  	struct config_group *group = to_config_group(item);		\
>  	struct epf_ntb *ntb = to_epf_ntb(group);			\
>  	struct device *dev = &ntb->epf->dev;				\
> -	int win_no;							\
> +	int win_no, idx;						\
>  									\
>  	if (sscanf(#_name, "mw%d", &win_no) != 1)			\
>  		return -EINVAL;						\
>  									\
>  	if (win_no <= 0 || win_no > ntb->num_mws) {			\
> -		dev_err(dev, "Invalid num_nws: %d value\n", ntb->num_mws); \
> +		dev_err(dev, "MW%d out of range (num_mws=%d)\n",	\
> +			win_no, ntb->num_mws);				\
>  		return -EINVAL;						\
>  	}								\
> -									\
> -	return sprintf(page, "%lld\n", ntb->mws_size[win_no - 1]);	\
> +	idx = array_index_nospec(win_no - 1, ntb->num_mws);		\
> +	return sprintf(page, "%lld\n", ntb->mws_size[idx]);		\
>  }
>
>  #define EPF_NTB_MW_W(_name)						\
> @@ -1015,7 +1016,7 @@ static ssize_t epf_ntb_##_name##_store(struct config_item *item,	\
>  	struct config_group *group = to_config_group(item);		\
>  	struct epf_ntb *ntb = to_epf_ntb(group);			\
>  	struct device *dev = &ntb->epf->dev;				\
> -	int win_no;							\
> +	int win_no, idx;						\
>  	u64 val;							\
>  	int ret;							\
>  									\
> @@ -1027,11 +1028,13 @@ static ssize_t epf_ntb_##_name##_store(struct config_item *item,	\
>  		return -EINVAL;						\
>  									\
>  	if (win_no <= 0 || win_no > ntb->num_mws) {			\
> -		dev_err(dev, "Invalid num_nws: %d value\n", ntb->num_mws); \
> +		dev_err(dev, "MW%d out of range (num_mws=%d)\n",	\
> +			win_no, ntb->num_mws);				\
>  		return -EINVAL;						\
>  	}								\
>  									\
> -	ntb->mws_size[win_no - 1] = val;				\
> +	idx = array_index_nospec(win_no - 1, ntb->num_mws);		\
> +	ntb->mws_size[idx] = val;					\
>  									\
>  	return len;							\
>  }
> --
> 2.48.1
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 02/27] PCI: endpoint: pci-epf-vntb: Add mwN_offset configfs attributes
  2025-11-29 16:03 ` [RFC PATCH v2 02/27] PCI: endpoint: pci-epf-vntb: Add mwN_offset configfs attributes Koichiro Den
@ 2025-12-01 19:11   ` Frank Li
  2025-12-02  6:23     ` Koichiro Den
  0 siblings, 1 reply; 97+ messages in thread
From: Frank Li @ 2025-12-01 19:11 UTC (permalink / raw)
  To: Koichiro Den
  Cc: ntb, linux-pci, dmaengine, linux-kernel, mani, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Sun, Nov 30, 2025 at 01:03:40AM +0900, Koichiro Den wrote:
> Introduce new mwN_offset configfs attributes to specify memory window
> offsets. This enables mapping multiple windows into a single BAR at

This need update documents. But I am not clear how multiple windows map
into a single BAR, could you please give me an example, how to config
more than 2 memory space to one bar by echo to sys file commands.

Frank

> arbitrary offsets, improving layout flexibility.
>
> Signed-off-by: Koichiro Den <den@valinux.co.jp>
> ---
>  drivers/pci/endpoint/functions/pci-epf-vntb.c | 133 ++++++++++++++++--
>  1 file changed, 120 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/pci/endpoint/functions/pci-epf-vntb.c b/drivers/pci/endpoint/functions/pci-epf-vntb.c
> index 6c4c78915970..1ff414703566 100644
> --- a/drivers/pci/endpoint/functions/pci-epf-vntb.c
> +++ b/drivers/pci/endpoint/functions/pci-epf-vntb.c
> @@ -39,6 +39,7 @@
>  #include <linux/atomic.h>
>  #include <linux/delay.h>
>  #include <linux/io.h>
> +#include <linux/log2.h>
>  #include <linux/module.h>
>  #include <linux/slab.h>
>
> @@ -111,7 +112,8 @@ struct epf_ntb_ctrl {
>  	u64 addr;
>  	u64 size;
>  	u32 num_mws;
> -	u32 reserved;
> +	u32 mw_offset[MAX_MW];
> +	u32 mw_size[MAX_MW];
>  	u32 spad_offset;
>  	u32 spad_count;
>  	u32 db_entry_size;
> @@ -128,6 +130,7 @@ struct epf_ntb {
>  	u32 db_count;
>  	u32 spad_count;
>  	u64 mws_size[MAX_MW];
> +	u64 mws_offset[MAX_MW];
>  	atomic64_t db;
>  	u32 vbus_number;
>  	u16 vntb_pid;
> @@ -458,6 +461,8 @@ static int epf_ntb_config_spad_bar_alloc(struct epf_ntb *ntb)
>
>  	ctrl->spad_count = spad_count;
>  	ctrl->num_mws = ntb->num_mws;
> +	memset(ctrl->mw_offset, 0, sizeof(ctrl->mw_offset));
> +	memset(ctrl->mw_size, 0, sizeof(ctrl->mw_size));
>  	ntb->spad_size = spad_size;
>
>  	ctrl->db_entry_size = sizeof(u32);
> @@ -689,15 +694,31 @@ static void epf_ntb_db_bar_clear(struct epf_ntb *ntb)
>   */
>  static int epf_ntb_mw_bar_init(struct epf_ntb *ntb)
>  {
> +	struct device *dev = &ntb->epf->dev;
> +	u64 bar_ends[BAR_5 + 1] = { 0 };
> +	unsigned long bars_used = 0;
> +	enum pci_barno barno;
> +	u64 off, size, end;
>  	int ret = 0;
>  	int i;
> -	u64 size;
> -	enum pci_barno barno;
> -	struct device *dev = &ntb->epf->dev;
>
>  	for (i = 0; i < ntb->num_mws; i++) {
> -		size = ntb->mws_size[i];
>  		barno = ntb->epf_ntb_bar[BAR_MW1 + i];
> +		off = ntb->mws_offset[i];
> +		size = ntb->mws_size[i];
> +		end = off + size;
> +		if (end > bar_ends[barno])
> +			bar_ends[barno] = end;
> +		bars_used |= BIT(barno);
> +	}
> +
> +	for (barno = BAR_0; barno <= BAR_5; barno++) {
> +		if (!(bars_used & BIT(barno)))
> +			continue;
> +		if (bar_ends[barno] < SZ_4K)
> +			size = SZ_4K;
> +		else
> +			size = roundup_pow_of_two(bar_ends[barno]);
>
>  		ntb->epf->bar[barno].barno = barno;
>  		ntb->epf->bar[barno].size = size;
> @@ -713,8 +734,12 @@ static int epf_ntb_mw_bar_init(struct epf_ntb *ntb)
>  				      &ntb->epf->bar[barno]);
>  		if (ret) {
>  			dev_err(dev, "MW set failed\n");
> -			goto err_alloc_mem;
> +			goto err_set_bar;
>  		}
> +	}
> +
> +	for (i = 0; i < ntb->num_mws; i++) {
> +		size = ntb->mws_size[i];
>
>  		/* Allocate EPC outbound memory windows to vpci vntb device */
>  		ntb->vpci_mw_addr[i] = pci_epc_mem_alloc_addr(ntb->epf->epc,
> @@ -723,19 +748,31 @@ static int epf_ntb_mw_bar_init(struct epf_ntb *ntb)
>  		if (!ntb->vpci_mw_addr[i]) {
>  			ret = -ENOMEM;
>  			dev_err(dev, "Failed to allocate source address\n");
> -			goto err_set_bar;
> +			goto err_alloc_mem;
>  		}
>  	}
>
> +	for (i = 0; i < ntb->num_mws; i++) {
> +		ntb->reg->mw_offset[i] = (u32)ntb->mws_offset[i];
> +		ntb->reg->mw_size[i] = (u32)ntb->mws_size[i];
> +	}
> +
>  	return ret;
>
> -err_set_bar:
> -	pci_epc_clear_bar(ntb->epf->epc,
> -			  ntb->epf->func_no,
> -			  ntb->epf->vfunc_no,
> -			  &ntb->epf->bar[barno]);
>  err_alloc_mem:
> -	epf_ntb_mw_bar_clear(ntb, i);
> +	while (--i >= 0)
> +		pci_epc_mem_free_addr(ntb->epf->epc,
> +				      ntb->vpci_mw_phy[i],
> +				      ntb->vpci_mw_addr[i],
> +				      ntb->mws_size[i]);
> +err_set_bar:
> +	while (--barno >= BAR_0)
> +		if (bars_used & BIT(barno))
> +			pci_epc_clear_bar(ntb->epf->epc,
> +					  ntb->epf->func_no,
> +					  ntb->epf->vfunc_no,
> +					  &ntb->epf->bar[barno]);
> +
>  	return ret;
>  }
>
> @@ -1039,6 +1076,60 @@ static ssize_t epf_ntb_##_name##_store(struct config_item *item,	\
>  	return len;							\
>  }
>
> +#define EPF_NTB_MW_OFF_R(_name)						\
> +static ssize_t epf_ntb_##_name##_show(struct config_item *item,		\
> +				      char *page)			\
> +{									\
> +	struct config_group *group = to_config_group(item);		\
> +	struct epf_ntb *ntb = to_epf_ntb(group);			\
> +	struct device *dev = &ntb->epf->dev;				\
> +	int win_no, idx;						\
> +									\
> +	if (sscanf(#_name, "mw%d_offset", &win_no) != 1)		\
> +		return -EINVAL;						\
> +									\
> +	idx = win_no - 1;						\
> +	if (idx < 0 || idx >= ntb->num_mws) {				\
> +		dev_err(dev, "MW%d out of range (num_mws=%d)\n",	\
> +			win_no, ntb->num_mws);				\
> +		return -EINVAL;						\
> +	}								\
> +									\
> +	idx = array_index_nospec(idx, ntb->num_mws);			\
> +	return sprintf(page, "%lld\n", ntb->mws_offset[idx]);		\
> +}
> +
> +#define EPF_NTB_MW_OFF_W(_name)						\
> +static ssize_t epf_ntb_##_name##_store(struct config_item *item,	\
> +				       const char *page, size_t len)	\
> +{									\
> +	struct config_group *group = to_config_group(item);		\
> +	struct epf_ntb *ntb = to_epf_ntb(group);			\
> +	struct device *dev = &ntb->epf->dev;				\
> +	int win_no, idx;						\
> +	u64 val;							\
> +	int ret;							\
> +									\
> +	ret = kstrtou64(page, 0, &val);					\
> +	if (ret)							\
> +		return ret;						\
> +									\
> +	if (sscanf(#_name, "mw%d_offset", &win_no) != 1)		\
> +		return -EINVAL;						\
> +									\
> +	idx = win_no - 1;						\
> +	if (idx < 0 || idx >= ntb->num_mws) {				\
> +		dev_err(dev, "MW%d out of range (num_mws=%d)\n",	\
> +			win_no, ntb->num_mws);				\
> +		return -EINVAL;						\
> +	}								\
> +									\
> +	idx = array_index_nospec(idx, ntb->num_mws);			\
> +	ntb->mws_offset[idx] = val;					\
> +									\
> +	return len;							\
> +}
> +
>  #define EPF_NTB_BAR_R(_name, _id)					\
>  	static ssize_t epf_ntb_##_name##_show(struct config_item *item,	\
>  					      char *page)		\
> @@ -1109,6 +1200,14 @@ EPF_NTB_MW_R(mw3)
>  EPF_NTB_MW_W(mw3)
>  EPF_NTB_MW_R(mw4)
>  EPF_NTB_MW_W(mw4)
> +EPF_NTB_MW_OFF_R(mw1_offset)
> +EPF_NTB_MW_OFF_W(mw1_offset)
> +EPF_NTB_MW_OFF_R(mw2_offset)
> +EPF_NTB_MW_OFF_W(mw2_offset)
> +EPF_NTB_MW_OFF_R(mw3_offset)
> +EPF_NTB_MW_OFF_W(mw3_offset)
> +EPF_NTB_MW_OFF_R(mw4_offset)
> +EPF_NTB_MW_OFF_W(mw4_offset)
>  EPF_NTB_BAR_R(ctrl_bar, BAR_CONFIG)
>  EPF_NTB_BAR_W(ctrl_bar, BAR_CONFIG)
>  EPF_NTB_BAR_R(db_bar, BAR_DB)
> @@ -1129,6 +1228,10 @@ CONFIGFS_ATTR(epf_ntb_, mw1);
>  CONFIGFS_ATTR(epf_ntb_, mw2);
>  CONFIGFS_ATTR(epf_ntb_, mw3);
>  CONFIGFS_ATTR(epf_ntb_, mw4);
> +CONFIGFS_ATTR(epf_ntb_, mw1_offset);
> +CONFIGFS_ATTR(epf_ntb_, mw2_offset);
> +CONFIGFS_ATTR(epf_ntb_, mw3_offset);
> +CONFIGFS_ATTR(epf_ntb_, mw4_offset);
>  CONFIGFS_ATTR(epf_ntb_, vbus_number);
>  CONFIGFS_ATTR(epf_ntb_, vntb_pid);
>  CONFIGFS_ATTR(epf_ntb_, vntb_vid);
> @@ -1147,6 +1250,10 @@ static struct configfs_attribute *epf_ntb_attrs[] = {
>  	&epf_ntb_attr_mw2,
>  	&epf_ntb_attr_mw3,
>  	&epf_ntb_attr_mw4,
> +	&epf_ntb_attr_mw1_offset,
> +	&epf_ntb_attr_mw2_offset,
> +	&epf_ntb_attr_mw3_offset,
> +	&epf_ntb_attr_mw4_offset,
>  	&epf_ntb_attr_vbus_number,
>  	&epf_ntb_attr_vntb_pid,
>  	&epf_ntb_attr_vntb_vid,
> --
> 2.48.1
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 03/27] NTB: epf: Handle mwN_offset for inbound MW regions
  2025-11-29 16:03 ` [RFC PATCH v2 03/27] NTB: epf: Handle mwN_offset for inbound MW regions Koichiro Den
@ 2025-12-01 19:14   ` Frank Li
  2025-12-02  6:23     ` Koichiro Den
  0 siblings, 1 reply; 97+ messages in thread
From: Frank Li @ 2025-12-01 19:14 UTC (permalink / raw)
  To: Koichiro Den
  Cc: ntb, linux-pci, dmaengine, linux-kernel, mani, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Sun, Nov 30, 2025 at 01:03:41AM +0900, Koichiro Den wrote:
> Add and use new fields in the common control register to convey both
> offset and size for each memory window (MW), so that it can correctly
> handle flexible MW layouts and support partial BAR mappings.
>
> Signed-off-by: Koichiro Den <den@valinux.co.jp>
> ---
>  drivers/ntb/hw/epf/ntb_hw_epf.c | 27 ++++++++++++++++-----------
>  1 file changed, 16 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/ntb/hw/epf/ntb_hw_epf.c b/drivers/ntb/hw/epf/ntb_hw_epf.c
> index d3ecf25a5162..91d3f8e05807 100644
> --- a/drivers/ntb/hw/epf/ntb_hw_epf.c
> +++ b/drivers/ntb/hw/epf/ntb_hw_epf.c
> @@ -36,12 +36,13 @@
>  #define NTB_EPF_LOWER_SIZE	0x18
>  #define NTB_EPF_UPPER_SIZE	0x1C
>  #define NTB_EPF_MW_COUNT	0x20
> -#define NTB_EPF_MW1_OFFSET	0x24
> -#define NTB_EPF_SPAD_OFFSET	0x28
> -#define NTB_EPF_SPAD_COUNT	0x2C
> -#define NTB_EPF_DB_ENTRY_SIZE	0x30
> -#define NTB_EPF_DB_DATA(n)	(0x34 + (n) * 4)
> -#define NTB_EPF_DB_OFFSET(n)	(0xB4 + (n) * 4)
> +#define NTB_EPF_MW_OFFSET(n)	(0x24 + (n) * 4)
> +#define NTB_EPF_MW_SIZE(n)	(0x34 + (n) * 4)
> +#define NTB_EPF_SPAD_OFFSET	0x44
> +#define NTB_EPF_SPAD_COUNT	0x48
> +#define NTB_EPF_DB_ENTRY_SIZE	0x4C
> +#define NTB_EPF_DB_DATA(n)	(0x50 + (n) * 4)
> +#define NTB_EPF_DB_OFFSET(n)	(0xD0 + (n) * 4)

You need check difference version for register layout change. EP and RC
side's software are not necessary to run the same kernel. Maybe EP side
running at old version, RC side run at new version.

Frank

>
>  #define NTB_EPF_MIN_DB_COUNT	3
>  #define NTB_EPF_MAX_DB_COUNT	31
> @@ -451,11 +452,12 @@ static int ntb_epf_peer_mw_get_addr(struct ntb_dev *ntb, int idx,
>  				    phys_addr_t *base, resource_size_t *size)
>  {
>  	struct ntb_epf_dev *ndev = ntb_ndev(ntb);
> -	u32 offset = 0;
> +	resource_size_t bar_sz;
> +	u32 offset, sz;
>  	int bar;
>
> -	if (idx == 0)
> -		offset = readl(ndev->ctrl_reg + NTB_EPF_MW1_OFFSET);
> +	offset = readl(ndev->ctrl_reg + NTB_EPF_MW_OFFSET(idx));
> +	sz = readl(ndev->ctrl_reg + NTB_EPF_MW_SIZE(idx));
>
>  	bar = ntb_epf_mw_to_bar(ndev, idx);
>  	if (bar < 0)
> @@ -464,8 +466,11 @@ static int ntb_epf_peer_mw_get_addr(struct ntb_dev *ntb, int idx,
>  	if (base)
>  		*base = pci_resource_start(ndev->ntb.pdev, bar) + offset;
>
> -	if (size)
> -		*size = pci_resource_len(ndev->ntb.pdev, bar) - offset;
> +	if (size) {
> +		bar_sz = pci_resource_len(ndev->ntb.pdev, bar);
> +		*size = sz ? min_t(resource_size_t, sz, bar_sz - offset)
> +			   : (bar_sz > offset ? bar_sz - offset : 0);
> +	}
>
>  	return 0;
>  }
> --
> 2.48.1
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 04/27] PCI: endpoint: Add inbound mapping ops to EPC core
  2025-11-29 16:03 ` [RFC PATCH v2 04/27] PCI: endpoint: Add inbound mapping ops to EPC core Koichiro Den
@ 2025-12-01 19:19   ` Frank Li
  2025-12-02  6:25     ` Koichiro Den
  0 siblings, 1 reply; 97+ messages in thread
From: Frank Li @ 2025-12-01 19:19 UTC (permalink / raw)
  To: Koichiro Den
  Cc: ntb, linux-pci, dmaengine, linux-kernel, mani, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Sun, Nov 30, 2025 at 01:03:42AM +0900, Koichiro Den wrote:
> Add new EPC ops map_inbound() and unmap_inbound() for mapping a subrange
> of a BAR into CPU space. These will be implemented by controller drivers
> such as DesignWare.
>
> Signed-off-by: Koichiro Den <den@valinux.co.jp>
> ---
>  drivers/pci/endpoint/pci-epc-core.c | 44 +++++++++++++++++++++++++++++
>  include/linux/pci-epc.h             | 11 ++++++++
>  2 files changed, 55 insertions(+)
>
> diff --git a/drivers/pci/endpoint/pci-epc-core.c b/drivers/pci/endpoint/pci-epc-core.c
> index ca7f19cc973a..825109e54ba9 100644
> --- a/drivers/pci/endpoint/pci-epc-core.c
> +++ b/drivers/pci/endpoint/pci-epc-core.c
> @@ -444,6 +444,50 @@ int pci_epc_map_addr(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
>  }
>  EXPORT_SYMBOL_GPL(pci_epc_map_addr);
>
> +/**
> + * pci_epc_map_inbound() - map a BAR subrange to the local CPU address
> + * @epc: the EPC device on which BAR has to be configured
> + * @func_no: the physical endpoint function number in the EPC device
> + * @vfunc_no: the virtual endpoint function number in the physical function
> + * @epf_bar: the struct epf_bar that contains the BAR information
> + * @offset: byte offset from the BAR base selected by the host
> + *
> + * Invoke to configure the BAR of the endpoint device and map a subrange
> + * selected by @offset to a CPU address.
> + *
> + * Returns 0 on success, -EOPNOTSUPP if unsupported, or a negative errno.
> + */
> +int pci_epc_map_inbound(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> +			struct pci_epf_bar *epf_bar, u64 offset)

Supposed need size information?  if BAR's size is 4G,

you may just want to map from 0x4000 to 0x5000, not whole offset to end's
space.

commit message said map into CPU space, where CPU address?

Frank
> +{
> +	if (!epc || !epc->ops || !epc->ops->map_inbound)
> +		return -EOPNOTSUPP;
> +
> +	return epc->ops->map_inbound(epc, func_no, vfunc_no, epf_bar, offset);
> +}
> +EXPORT_SYMBOL_GPL(pci_epc_map_inbound);
> +
> +/**
> + * pci_epc_unmap_inbound() - unmap a previously mapped BAR subrange
> + * @epc: the EPC device on which the inbound mapping was programmed
> + * @func_no: the physical endpoint function number in the EPC device
> + * @vfunc_no: the virtual endpoint function number in the physical function
> + * @epf_bar: the struct epf_bar used when the mapping was created
> + * @offset: byte offset from the BAR base that was mapped
> + *
> + * Invoke to remove a BAR subrange mapping created by pci_epc_map_inbound().
> + * If the controller has no support, this call is a no-op.
> + */
> +void pci_epc_unmap_inbound(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> +			   struct pci_epf_bar *epf_bar, u64 offset)
> +{
> +	if (!epc || !epc->ops || !epc->ops->unmap_inbound)
> +		return;
> +
> +	epc->ops->unmap_inbound(epc, func_no, vfunc_no, epf_bar, offset);
> +}
> +EXPORT_SYMBOL_GPL(pci_epc_unmap_inbound);
> +
>  /**
>   * pci_epc_mem_map() - allocate and map a PCI address to a CPU address
>   * @epc: the EPC device on which the CPU address is to be allocated and mapped
> diff --git a/include/linux/pci-epc.h b/include/linux/pci-epc.h
> index 4286bfdbfdfa..a5fb91cc2982 100644
> --- a/include/linux/pci-epc.h
> +++ b/include/linux/pci-epc.h
> @@ -71,6 +71,8 @@ struct pci_epc_map {
>   *		region
>   * @map_addr: ops to map CPU address to PCI address
>   * @unmap_addr: ops to unmap CPU address and PCI address
> + * @map_inbound: ops to map a subrange inside a BAR to CPU address.
> + * @unmap_inbound: ops to unmap a subrange inside a BAR and CPU address.
>   * @set_msi: ops to set the requested number of MSI interrupts in the MSI
>   *	     capability register
>   * @get_msi: ops to get the number of MSI interrupts allocated by the RC from
> @@ -99,6 +101,10 @@ struct pci_epc_ops {
>  			    phys_addr_t addr, u64 pci_addr, size_t size);
>  	void	(*unmap_addr)(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
>  			      phys_addr_t addr);
> +	int	(*map_inbound)(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> +			       struct pci_epf_bar *epf_bar, u64 offset);
> +	void	(*unmap_inbound)(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> +				 struct pci_epf_bar *epf_bar, u64 offset);
>  	int	(*set_msi)(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
>  			   u8 nr_irqs);
>  	int	(*get_msi)(struct pci_epc *epc, u8 func_no, u8 vfunc_no);
> @@ -286,6 +292,11 @@ int pci_epc_map_addr(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
>  		     u64 pci_addr, size_t size);
>  void pci_epc_unmap_addr(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
>  			phys_addr_t phys_addr);
> +
> +int pci_epc_map_inbound(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> +			struct pci_epf_bar *epf_bar, u64 offset);
> +void pci_epc_unmap_inbound(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> +			   struct pci_epf_bar *epf_bar, u64 offset);
>  int pci_epc_set_msi(struct pci_epc *epc, u8 func_no, u8 vfunc_no, u8 nr_irqs);
>  int pci_epc_get_msi(struct pci_epc *epc, u8 func_no, u8 vfunc_no);
>  int pci_epc_set_msix(struct pci_epc *epc, u8 func_no, u8 vfunc_no, u16 nr_irqs,
> --
> 2.48.1
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 05/27] PCI: dwc: ep: Implement EPC inbound mapping support
  2025-11-29 16:03 ` [RFC PATCH v2 05/27] PCI: dwc: ep: Implement EPC inbound mapping support Koichiro Den
@ 2025-12-01 19:32   ` Frank Li
  2025-12-02  6:26     ` Koichiro Den
  0 siblings, 1 reply; 97+ messages in thread
From: Frank Li @ 2025-12-01 19:32 UTC (permalink / raw)
  To: Koichiro Den
  Cc: ntb, linux-pci, dmaengine, linux-kernel, mani, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Sun, Nov 30, 2025 at 01:03:43AM +0900, Koichiro Den wrote:
> Implement map_inbound() and unmap_inbound() for DesignWare endpoint
> controllers (Address Match mode). Allows subrange mappings within a BAR,
> enabling advanced endpoint functions such as NTB with offset-based
> windows.
>
> Signed-off-by: Koichiro Den <den@valinux.co.jp>
> ---
>  .../pci/controller/dwc/pcie-designware-ep.c   | 239 +++++++++++++++---
>  drivers/pci/controller/dwc/pcie-designware.h  |   2 +
>  2 files changed, 212 insertions(+), 29 deletions(-)
>
> diff --git a/drivers/pci/controller/dwc/pcie-designware-ep.c b/drivers/pci/controller/dwc/pcie-designware-ep.c
> index 19571ac2b961..3780a9bd6f79 100644
> --- a/drivers/pci/controller/dwc/pcie-designware-ep.c
> +++ b/drivers/pci/controller/dwc/pcie-designware-ep.c
...
>
> +static int dw_pcie_ep_set_bar_init(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> +				   struct pci_epf_bar *epf_bar)
> +{
> +	struct dw_pcie_ep *ep = epc_get_drvdata(epc);
> +	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> +	enum pci_barno bar = epf_bar->barno;
> +	enum pci_epc_bar_type bar_type;
> +	int ret;
> +
> +	bar_type = dw_pcie_ep_get_bar_type(ep, bar);
> +	switch (bar_type) {
> +	case BAR_FIXED:
> +		/*
> +		 * There is no need to write a BAR mask for a fixed BAR (except
> +		 * to write 1 to the LSB of the BAR mask register, to enable the
> +		 * BAR). Write the BAR mask regardless. (The fixed bits in the
> +		 * BAR mask register will be read-only anyway.)
> +		 */
> +		fallthrough;
> +	case BAR_PROGRAMMABLE:
> +		ret = dw_pcie_ep_set_bar_programmable(ep, func_no, epf_bar);
> +		break;
> +	case BAR_RESIZABLE:
> +		ret = dw_pcie_ep_set_bar_resizable(ep, func_no, epf_bar);
> +		break;
> +	default:
> +		ret = -EINVAL;
> +		dev_err(pci->dev, "Invalid BAR type\n");
> +		break;
> +	}
> +
> +	return ret;
> +}
> +

Create new patch for above code movement.

>  static int dw_pcie_ep_set_bar(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
>  			      struct pci_epf_bar *epf_bar)
>  {
>  	struct dw_pcie_ep *ep = epc_get_drvdata(epc);
> -	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
>  	enum pci_barno bar = epf_bar->barno;
>  	size_t size = epf_bar->size;
> -	enum pci_epc_bar_type bar_type;
>  	int flags = epf_bar->flags;
>  	int ret, type;
>
> @@ -374,35 +429,12 @@ static int dw_pcie_ep_set_bar(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
>  		 * When dynamically changing a BAR, skip writing the BAR reg, as
>  		 * that would clear the BAR's PCI address assigned by the host.
>  		 */
> -		goto config_atu;
> -	}
> -
> -	bar_type = dw_pcie_ep_get_bar_type(ep, bar);
> -	switch (bar_type) {
> -	case BAR_FIXED:
> -		/*
> -		 * There is no need to write a BAR mask for a fixed BAR (except
> -		 * to write 1 to the LSB of the BAR mask register, to enable the
> -		 * BAR). Write the BAR mask regardless. (The fixed bits in the
> -		 * BAR mask register will be read-only anyway.)
> -		 */
> -		fallthrough;
> -	case BAR_PROGRAMMABLE:
> -		ret = dw_pcie_ep_set_bar_programmable(ep, func_no, epf_bar);
> -		break;
> -	case BAR_RESIZABLE:
> -		ret = dw_pcie_ep_set_bar_resizable(ep, func_no, epf_bar);
> -		break;
> -	default:
> -		ret = -EINVAL;
> -		dev_err(pci->dev, "Invalid BAR type\n");
> -		break;
> +	} else {
> +		ret = dw_pcie_ep_set_bar_init(epc, func_no, vfunc_no, epf_bar);
> +		if (ret)
> +			return ret;
>  	}
>
> -	if (ret)
> -		return ret;
> -
> -config_atu:
>  	if (!(flags & PCI_BASE_ADDRESS_SPACE))
>  		type = PCIE_ATU_TYPE_MEM;
>  	else
> @@ -488,6 +520,151 @@ static int dw_pcie_ep_map_addr(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
>  	return 0;
>  }
>
...
> +
> +static int dw_pcie_ep_map_inbound(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> +				  struct pci_epf_bar *epf_bar, u64 offset)
> +{
> +	struct dw_pcie_ep *ep = epc_get_drvdata(epc);
> +	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> +	enum pci_barno bar = epf_bar->barno;
> +	size_t size = epf_bar->size;
> +	int flags = epf_bar->flags;
> +	struct dw_pcie_ib_map *m;
> +	u64 base, pci_addr;
> +	int ret, type, win;
> +
> +	/*
> +	 * DWC does not allow BAR pairs to overlap, e.g. you cannot combine BARs
> +	 * 1 and 2 to form a 64-bit BAR.
> +	 */
> +	if ((flags & PCI_BASE_ADDRESS_MEM_TYPE_64) && (bar & 1))
> +		return -EINVAL;
> +
> +	/*
> +	 * Certain EPF drivers dynamically change the physical address of a BAR
> +	 * (i.e. they call set_bar() twice, without ever calling clear_bar(), as
> +	 * calling clear_bar() would clear the BAR's PCI address assigned by the
> +	 * host).
> +	 */
> +	if (epf_bar->phys_addr && ep->epf_bar[bar]) {
> +		/*
> +		 * We can only dynamically add a whole or partial mapping if the
> +		 * BAR flags do not differ from the existing configuration.
> +		 */
> +		if (ep->epf_bar[bar]->barno != bar ||
> +		    ep->epf_bar[bar]->flags != flags)
> +			return -EINVAL;
> +
> +		/*
> +		 * When dynamically changing a BAR, skip writing the BAR reg, as
> +		 * that would clear the BAR's PCI address assigned by the host.
> +		 */
> +	}
> +
> +	/*
> +	 * Skip programming the inbound translation if phys_addr is 0.
> +	 * In this case, the caller only intends to initialize the BAR.
> +	 */
> +	if (!epf_bar->phys_addr) {
> +		ret = dw_pcie_ep_set_bar_init(epc, func_no, vfunc_no, epf_bar);
> +		ep->epf_bar[bar] = epf_bar;
> +		return ret;
> +	}
> +
> +	base = dw_pcie_ep_read_bar_assigned(ep, func_no, bar,
> +					    flags & PCI_BASE_ADDRESS_SPACE,
> +					    flags & PCI_BASE_ADDRESS_MEM_TYPE_64);
> +	if (!(flags & PCI_BASE_ADDRESS_SPACE))
> +		type = PCIE_ATU_TYPE_MEM;
> +	else
> +		type = PCIE_ATU_TYPE_IO;
> +	pci_addr = base + offset;
> +
> +	/* Allocate an inbound iATU window */
> +	win = find_first_zero_bit(ep->ib_window_map, pci->num_ib_windows);
> +	if (win >= pci->num_ib_windows)
> +		return -ENOSPC;
> +
> +	/* Program address-match inbound iATU */
> +	ret = dw_pcie_prog_inbound_atu(pci, win, type,
> +				       epf_bar->phys_addr - pci->parent_bus_offset,
> +				       pci_addr, size);
> +	if (ret)
> +		return ret;
> +
> +	m = kzalloc(sizeof(*m), GFP_KERNEL);
> +	if (!m) {
> +		dw_pcie_disable_atu(pci, PCIE_ATU_REGION_DIR_IB, win);
> +		return -ENOMEM;
> +	}

at least this should be above dw_pcie_prog_inbound_atu(). if return error
here, atu already prog.


> +	m->bar = bar;
> +	m->pci_addr = pci_addr;
> +	m->cpu_addr = epf_bar->phys_addr;
> +	m->size = size;
> +	m->index = win;
> +
> +	guard(spinlock_irqsave)(&ep->ib_map_lock);
> +	set_bit(win, ep->ib_window_map);
> +	list_add(&m->node, &ep->ib_map_list);
> +
> +	return 0;
> +}
> +
...
>
>  	INIT_LIST_HEAD(&ep->func_list);
> +	INIT_LIST_HEAD(&ep->ib_map_list);
> +	spin_lock_init(&ep->ib_map_lock);
>
>  	epc = devm_pci_epc_create(dev, &epc_ops);
>  	if (IS_ERR(epc)) {
> diff --git a/drivers/pci/controller/dwc/pcie-designware.h b/drivers/pci/controller/dwc/pcie-designware.h
> index 31685951a080..269a9fe0501f 100644
> --- a/drivers/pci/controller/dwc/pcie-designware.h
> +++ b/drivers/pci/controller/dwc/pcie-designware.h
> @@ -476,6 +476,8 @@ struct dw_pcie_ep {
>  	phys_addr_t		*outbound_addr;
>  	unsigned long		*ib_window_map;
>  	unsigned long		*ob_window_map;
> +	struct list_head	ib_map_list;
> +	spinlock_t		ib_map_lock;

need comments for lock

Frank
>  	void __iomem		*msi_mem;
>  	phys_addr_t		msi_mem_phys;
>  	struct pci_epf_bar	*epf_bar[PCI_STD_NUM_BARS];
> --
> 2.48.1
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 06/27] PCI: endpoint: pci-epf-vntb: Use pci_epc_map_inbound() for MW mapping
  2025-11-29 16:03 ` [RFC PATCH v2 06/27] PCI: endpoint: pci-epf-vntb: Use pci_epc_map_inbound() for MW mapping Koichiro Den
@ 2025-12-01 19:34   ` Frank Li
  2025-12-02  6:26     ` Koichiro Den
  0 siblings, 1 reply; 97+ messages in thread
From: Frank Li @ 2025-12-01 19:34 UTC (permalink / raw)
  To: Koichiro Den
  Cc: ntb, linux-pci, dmaengine, linux-kernel, mani, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Sun, Nov 30, 2025 at 01:03:44AM +0900, Koichiro Den wrote:
> Switch MW setup to use pci_epc_map_inbound() when supported. This allows
> mapping portions of a BAR rather than the entire region, supporting
> partial BAR usage on capable controllers.
>
> Signed-off-by: Koichiro Den <den@valinux.co.jp>
> ---
>  drivers/pci/endpoint/functions/pci-epf-vntb.c | 21 ++++++++++++++-----
>  1 file changed, 16 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/pci/endpoint/functions/pci-epf-vntb.c b/drivers/pci/endpoint/functions/pci-epf-vntb.c
> index 1ff414703566..42e57721dcb4 100644
> --- a/drivers/pci/endpoint/functions/pci-epf-vntb.c
> +++ b/drivers/pci/endpoint/functions/pci-epf-vntb.c
> @@ -728,10 +728,15 @@ static int epf_ntb_mw_bar_init(struct epf_ntb *ntb)
>  				PCI_BASE_ADDRESS_MEM_TYPE_64 :
>  				PCI_BASE_ADDRESS_MEM_TYPE_32;
>
> -		ret = pci_epc_set_bar(ntb->epf->epc,
> -				      ntb->epf->func_no,
> -				      ntb->epf->vfunc_no,
> -				      &ntb->epf->bar[barno]);
> +		ret = pci_epc_map_inbound(ntb->epf->epc,
> +					  ntb->epf->func_no,
> +					  ntb->epf->vfunc_no,
> +					  &ntb->epf->bar[barno], 0);
> +		if (ret == -EOPNOTSUPP)
> +			ret = pci_epc_set_bar(ntb->epf->epc,
> +					      ntb->epf->func_no,
> +					      ntb->epf->vfunc_no,
> +					      &ntb->epf->bar[barno]);
>  		if (ret) {
>  			dev_err(dev, "MW set failed\n");
>  			goto err_set_bar;
> @@ -1385,17 +1390,23 @@ static int vntb_epf_mw_set_trans(struct ntb_dev *ndev, int pidx, int idx,
>  	struct epf_ntb *ntb = ntb_ndev(ndev);
>  	struct pci_epf_bar *epf_bar;
>  	enum pci_barno barno;
> +	struct pci_epc *epc;
>  	int ret;
>  	struct device *dev;
>
> +	epc = ntb->epf->epc;
>  	dev = &ntb->ntb.dev;
>  	barno = ntb->epf_ntb_bar[BAR_MW1 + idx];
> +

Nit: unnecesary change here

Reviewed-by: Frank Li <Frank.Li@nxp.com>

>  	epf_bar = &ntb->epf->bar[barno];
>  	epf_bar->phys_addr = addr;
>  	epf_bar->barno = barno;
>  	epf_bar->size = size;
>
> -	ret = pci_epc_set_bar(ntb->epf->epc, 0, 0, epf_bar);
> +	ret = pci_epc_map_inbound(epc, 0, 0, epf_bar, 0);
> +	if (ret == -EOPNOTSUPP)
> +		ret = pci_epc_set_bar(epc, 0, 0, epf_bar);
> +
>  	if (ret) {
>  		dev_err(dev, "failure set mw trans\n");
>  		return ret;
> --
> 2.48.1
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 08/27] PCI: endpoint: pci-epf-vntb: Propagate MW offset from configfs when present
  2025-11-29 16:03 ` [RFC PATCH v2 08/27] PCI: endpoint: pci-epf-vntb: Propagate MW offset from configfs when present Koichiro Den
@ 2025-12-01 19:35   ` Frank Li
  0 siblings, 0 replies; 97+ messages in thread
From: Frank Li @ 2025-12-01 19:35 UTC (permalink / raw)
  To: Koichiro Den
  Cc: ntb, linux-pci, dmaengine, linux-kernel, mani, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Sun, Nov 30, 2025 at 01:03:46AM +0900, Koichiro Den wrote:
> The NTB API functions ntb_mw_set_trans() and ntb_mw_get_align() now
> support non-zero MW offsets. Update pci-epf-vntb to populate
> mws_offset[idx] when the offset parameter is provided. Users can now get
> the offset value and use it on ntb_mw_set_trans().
>
> Signed-off-by: Koichiro Den <den@valinux.co.jp>
> ---

Reviewed-by: Frank Li <Frank.Li@nxp.com>

>  drivers/pci/endpoint/functions/pci-epf-vntb.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/drivers/pci/endpoint/functions/pci-epf-vntb.c b/drivers/pci/endpoint/functions/pci-epf-vntb.c
> index 8dbae9be9402..aa44dcd5c943 100644
> --- a/drivers/pci/endpoint/functions/pci-epf-vntb.c
> +++ b/drivers/pci/endpoint/functions/pci-epf-vntb.c
> @@ -1528,6 +1528,9 @@ static int vntb_epf_mw_get_align(struct ntb_dev *ndev, int pidx, int idx,
>  	if (size_max)
>  		*size_max = ntb->mws_size[idx];
>
> +	if (offset)
> +		*offset = ntb->mws_offset[idx];
> +
>  	return 0;
>  }
>
> --
> 2.48.1
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 10/27] NTB: core: Add .get_pci_epc() to ntb_dev_ops
  2025-11-29 16:03 ` [RFC PATCH v2 10/27] NTB: core: Add .get_pci_epc() to ntb_dev_ops Koichiro Den
@ 2025-12-01 19:39   ` Frank Li
  2025-12-02  6:31     ` Koichiro Den
  2025-12-01 21:08   ` Dave Jiang
  1 sibling, 1 reply; 97+ messages in thread
From: Frank Li @ 2025-12-01 19:39 UTC (permalink / raw)
  To: Koichiro Den
  Cc: ntb, linux-pci, dmaengine, linux-kernel, mani, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Sun, Nov 30, 2025 at 01:03:48AM +0900, Koichiro Den wrote:
> Add an optional get_pci_epc() callback to retrieve the underlying
> pci_epc device associated with the NTB implementation.

EPC run at EP side, this is running at RC side. Why need get EP side's
controller pointer?

Frank
>
> Signed-off-by: Koichiro Den <den@valinux.co.jp>
> ---
>  drivers/ntb/hw/epf/ntb_hw_epf.c | 11 +----------
>  include/linux/ntb.h             | 21 +++++++++++++++++++++
>  2 files changed, 22 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/ntb/hw/epf/ntb_hw_epf.c b/drivers/ntb/hw/epf/ntb_hw_epf.c
> index a3ec411bfe49..d55ce6b0fad4 100644
> --- a/drivers/ntb/hw/epf/ntb_hw_epf.c
> +++ b/drivers/ntb/hw/epf/ntb_hw_epf.c
> @@ -9,6 +9,7 @@
>  #include <linux/delay.h>
>  #include <linux/module.h>
>  #include <linux/pci.h>
> +#include <linux/pci-epf.h>
>  #include <linux/slab.h>
>  #include <linux/ntb.h>
>
> @@ -49,16 +50,6 @@
>
>  #define NTB_EPF_COMMAND_TIMEOUT	1000 /* 1 Sec */
>
> -enum pci_barno {
> -	NO_BAR = -1,
> -	BAR_0,
> -	BAR_1,
> -	BAR_2,
> -	BAR_3,
> -	BAR_4,
> -	BAR_5,
> -};
> -
>  enum epf_ntb_bar {
>  	BAR_CONFIG,
>  	BAR_PEER_SPAD,
> diff --git a/include/linux/ntb.h b/include/linux/ntb.h
> index d7ce5d2e60d0..04dc9a4d6b85 100644
> --- a/include/linux/ntb.h
> +++ b/include/linux/ntb.h
> @@ -64,6 +64,7 @@ struct ntb_client;
>  struct ntb_dev;
>  struct ntb_msi;
>  struct pci_dev;
> +struct pci_epc;
>
>  /**
>   * enum ntb_topo - NTB connection topology
> @@ -256,6 +257,7 @@ static inline int ntb_ctx_ops_is_valid(const struct ntb_ctx_ops *ops)
>   * @msg_clear_mask:	See ntb_msg_clear_mask().
>   * @msg_read:		See ntb_msg_read().
>   * @peer_msg_write:	See ntb_peer_msg_write().
> + * @get_pci_epc:	See ntb_get_pci_epc().
>   */
>  struct ntb_dev_ops {
>  	int (*port_number)(struct ntb_dev *ntb);
> @@ -331,6 +333,7 @@ struct ntb_dev_ops {
>  	int (*msg_clear_mask)(struct ntb_dev *ntb, u64 mask_bits);
>  	u32 (*msg_read)(struct ntb_dev *ntb, int *pidx, int midx);
>  	int (*peer_msg_write)(struct ntb_dev *ntb, int pidx, int midx, u32 msg);
> +	struct pci_epc *(*get_pci_epc)(struct ntb_dev *ntb);
>  };
>
>  static inline int ntb_dev_ops_is_valid(const struct ntb_dev_ops *ops)
> @@ -393,6 +396,9 @@ static inline int ntb_dev_ops_is_valid(const struct ntb_dev_ops *ops)
>  		/* !ops->msg_clear_mask == !ops->msg_count	&& */
>  		!ops->msg_read == !ops->msg_count		&&
>  		!ops->peer_msg_write == !ops->msg_count		&&
> +
> +		/* Miscellaneous optional callbacks */
> +		/* ops->get_pci_epc			&& */
>  		1;
>  }
>
> @@ -1567,6 +1573,21 @@ static inline int ntb_peer_msg_write(struct ntb_dev *ntb, int pidx, int midx,
>  	return ntb->ops->peer_msg_write(ntb, pidx, midx, msg);
>  }
>
> +/**
> + * ntb_get_pci_epc() - get backing PCI endpoint controller if possible.
> + * @ntb:	NTB device context.
> + *
> + * Get the backing PCI endpoint controller representation.
> + *
> + * Return: A pointer to the pci_epc instance if available. or %NULL if not.
> + */
> +static inline struct pci_epc __maybe_unused *ntb_get_pci_epc(struct ntb_dev *ntb)
> +{
> +	if (!ntb->ops->get_pci_epc)
> +		return NULL;
> +	return ntb->ops->get_pci_epc(ntb);
> +}
> +
>  /**
>   * ntb_peer_resource_idx() - get a resource index for a given peer idx
>   * @ntb:	NTB device context.
> --
> 2.48.1
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 12/27] damengine: dw-edma: Fix MSI data values for multi-vector IMWr interrupts
  2025-11-29 16:03 ` [RFC PATCH v2 12/27] damengine: dw-edma: Fix MSI data values for multi-vector IMWr interrupts Koichiro Den
@ 2025-12-01 19:46   ` Frank Li
  2025-12-02  6:32     ` Koichiro Den
  0 siblings, 1 reply; 97+ messages in thread
From: Frank Li @ 2025-12-01 19:46 UTC (permalink / raw)
  To: Koichiro Den
  Cc: ntb, linux-pci, dmaengine, linux-kernel, mani, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Sun, Nov 30, 2025 at 01:03:50AM +0900, Koichiro Den wrote:
> When multiple MSI vectors are allocated for the DesignWare eDMA, the
> driver currently records the same MSI message for all IRQs by calling
> get_cached_msi_msg() per vector. For multi-vector MSI (as opposed to
> MSI-X), the cached message corresponds to vector 0 and msg.data is
> supposed to be adjusted by the IRQ index.
>
> As a result, all eDMA interrupts share the same MSI data value and the
> interrupt controller cannot distinguish between them.
>
> Introduce dw_edma_compose_msi() to construct the correct MSI message for
> each vector. For MSI-X nothing changes. For multi-vector MSI, derive the
> base IRQ with msi_get_virq(dev, 0) and OR in the per-vector offset into
> msg.data before storing it in dw->irq[i].msi.
>
> This makes each IMWr MSI vector use a unique MSI data value.
>
> Signed-off-by: Koichiro Den <den@valinux.co.jp>
> ---
>  drivers/dma/dw-edma/dw-edma-core.c | 28 ++++++++++++++++++++++++----
>  1 file changed, 24 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/dma/dw-edma/dw-edma-core.c b/drivers/dma/dw-edma/dw-edma-core.c
> index 8e5f7defa6b6..3542177a4a8e 100644
> --- a/drivers/dma/dw-edma/dw-edma-core.c
> +++ b/drivers/dma/dw-edma/dw-edma-core.c
> @@ -839,6 +839,28 @@ static inline void dw_edma_add_irq_mask(u32 *mask, u32 alloc, u16 cnt)
>  		(*mask)++;
>  }
>
> +static void dw_edma_compose_msi(struct device *dev, int irq, struct msi_msg *out)
> +{
> +	struct msi_desc *desc = irq_get_msi_desc(irq);
> +	struct msi_msg msg;
> +	unsigned int base;
> +
> +	if (!desc)
> +		return;
> +
> +	get_cached_msi_msg(irq, &msg);
> +	if (!desc->pci.msi_attrib.is_msix) {
> +		/*
> +		 * For multi-vector MSI, the cached message corresponds to
> +		 * vector 0. Adjust msg.data by the IRQ index so that each
> +		 * vector gets a unique MSI data value for IMWr Data Register.
> +		 */
> +		base = msi_get_virq(dev, 0);
> +		msg.data |= (irq - base);

why "|=", not "=" here?

Frank

> +	}
> +	*out = msg;
> +}
> +
>  static int dw_edma_irq_request(struct dw_edma *dw,
>  			       u32 *wr_alloc, u32 *rd_alloc)
>  {
> @@ -869,8 +891,7 @@ static int dw_edma_irq_request(struct dw_edma *dw,
>  			return err;
>  		}
>
> -		if (irq_get_msi_desc(irq))
> -			get_cached_msi_msg(irq, &dw->irq[0].msi);
> +		dw_edma_compose_msi(dev, irq, &dw->irq[0].msi);
>
>  		dw->nr_irqs = 1;
>  	} else {
> @@ -896,8 +917,7 @@ static int dw_edma_irq_request(struct dw_edma *dw,
>  			if (err)
>  				goto err_irq_free;
>
> -			if (irq_get_msi_desc(irq))
> -				get_cached_msi_msg(irq, &dw->irq[i].msi);
> +			dw_edma_compose_msi(dev, irq, &dw->irq[i].msi);
>  		}
>
>  		dw->nr_irqs = i;
> --
> 2.48.1
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 13/27] NTB: ntb_transport: Use seq_file for QP stats debugfs
  2025-11-29 16:03 ` [RFC PATCH v2 13/27] NTB: ntb_transport: Use seq_file for QP stats debugfs Koichiro Den
@ 2025-12-01 19:50   ` Frank Li
  2025-12-02  6:33     ` Koichiro Den
  0 siblings, 1 reply; 97+ messages in thread
From: Frank Li @ 2025-12-01 19:50 UTC (permalink / raw)
  To: Koichiro Den
  Cc: ntb, linux-pci, dmaengine, linux-kernel, mani, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Sun, Nov 30, 2025 at 01:03:51AM +0900, Koichiro Den wrote:
> The ./qp*/stats debugfs file for each NTB transport QP is currently
> implemented with a hand-crafted kmalloc() buffer and a series of
> scnprintf() calls. This is a pre-seq_file style pattern and makes future
> extensions easy to truncate.
>
> Convert the stats file to use the seq_file helpers via
> DEFINE_SHOW_ATTRIBUTE(), which simplifies the code and lets the seq_file
> core handle buffering and partial reads.
>
> While touching this area, fix a bug in the per-QP debugfs directory
> naming: the buffer used for "qp%d" was only 4 bytes, which truncates
> names like "qp10" to "qp1" and causes multiple queues to share the same
> directory. Enlarge the buffer and use sizeof() to avoid truncation.
>
> Signed-off-by: Koichiro Den <den@valinux.co.jp>
> ---

This one is indenpented with other,  you may post seperately. So get merge
this simple change quick.

Reviewed-by: Frank Li <Frank.Li@nxp.com>
>  drivers/ntb/ntb_transport.c | 136 +++++++++++-------------------------
>  1 file changed, 41 insertions(+), 95 deletions(-)
>
> diff --git a/drivers/ntb/ntb_transport.c b/drivers/ntb/ntb_transport.c
> index 3f3bc991e667..57b4c0511927 100644
> --- a/drivers/ntb/ntb_transport.c
> +++ b/drivers/ntb/ntb_transport.c
> @@ -57,6 +57,7 @@
>  #include <linux/module.h>
>  #include <linux/pci.h>
>  #include <linux/slab.h>
> +#include <linux/seq_file.h>
>  #include <linux/types.h>
>  #include <linux/uaccess.h>
>  #include <linux/mutex.h>
> @@ -466,104 +467,49 @@ void ntb_transport_unregister_client(struct ntb_transport_client *drv)
>  }
>  EXPORT_SYMBOL_GPL(ntb_transport_unregister_client);
>
> -static ssize_t debugfs_read(struct file *filp, char __user *ubuf, size_t count,
> -			    loff_t *offp)
> +static int ntb_qp_debugfs_stats_show(struct seq_file *s, void *v)
>  {
> -	struct ntb_transport_qp *qp;
> -	char *buf;
> -	ssize_t ret, out_offset, out_count;
> -
> -	qp = filp->private_data;
> +	struct ntb_transport_qp *qp = s->private;
>
>  	if (!qp || !qp->link_is_up)
>  		return 0;
>
> -	out_count = 1000;
> -
> -	buf = kmalloc(out_count, GFP_KERNEL);
> -	if (!buf)
> -		return -ENOMEM;
> +	seq_puts(s, "\nNTB QP stats:\n\n");
> +
> +	seq_printf(s, "rx_bytes - \t%llu\n", qp->rx_bytes);
> +	seq_printf(s, "rx_pkts - \t%llu\n", qp->rx_pkts);
> +	seq_printf(s, "rx_memcpy - \t%llu\n", qp->rx_memcpy);
> +	seq_printf(s, "rx_async - \t%llu\n", qp->rx_async);
> +	seq_printf(s, "rx_ring_empty - %llu\n", qp->rx_ring_empty);
> +	seq_printf(s, "rx_err_no_buf - %llu\n", qp->rx_err_no_buf);
> +	seq_printf(s, "rx_err_oflow - \t%llu\n", qp->rx_err_oflow);
> +	seq_printf(s, "rx_err_ver - \t%llu\n", qp->rx_err_ver);
> +	seq_printf(s, "rx_buff - \t0x%p\n", qp->rx_buff);
> +	seq_printf(s, "rx_index - \t%u\n", qp->rx_index);
> +	seq_printf(s, "rx_max_entry - \t%u\n", qp->rx_max_entry);
> +	seq_printf(s, "rx_alloc_entry - \t%u\n\n", qp->rx_alloc_entry);
> +
> +	seq_printf(s, "tx_bytes - \t%llu\n", qp->tx_bytes);
> +	seq_printf(s, "tx_pkts - \t%llu\n", qp->tx_pkts);
> +	seq_printf(s, "tx_memcpy - \t%llu\n", qp->tx_memcpy);
> +	seq_printf(s, "tx_async - \t%llu\n", qp->tx_async);
> +	seq_printf(s, "tx_ring_full - \t%llu\n", qp->tx_ring_full);
> +	seq_printf(s, "tx_err_no_buf - %llu\n", qp->tx_err_no_buf);
> +	seq_printf(s, "tx_mw - \t0x%p\n", qp->tx_mw);
> +	seq_printf(s, "tx_index (H) - \t%u\n", qp->tx_index);
> +	seq_printf(s, "RRI (T) - \t%u\n", qp->remote_rx_info->entry);
> +	seq_printf(s, "tx_max_entry - \t%u\n", qp->tx_max_entry);
> +	seq_printf(s, "free tx - \t%u\n", ntb_transport_tx_free_entry(qp));
> +	seq_putc(s, '\n');
> +
> +	seq_printf(s, "Using TX DMA - \t%s\n", qp->tx_dma_chan ? "Yes" : "No");
> +	seq_printf(s, "Using RX DMA - \t%s\n", qp->rx_dma_chan ? "Yes" : "No");
> +	seq_printf(s, "QP Link - \t%s\n", qp->link_is_up ? "Up" : "Down");
> +	seq_putc(s, '\n');
>
> -	out_offset = 0;
> -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> -			       "\nNTB QP stats:\n\n");
> -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> -			       "rx_bytes - \t%llu\n", qp->rx_bytes);
> -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> -			       "rx_pkts - \t%llu\n", qp->rx_pkts);
> -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> -			       "rx_memcpy - \t%llu\n", qp->rx_memcpy);
> -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> -			       "rx_async - \t%llu\n", qp->rx_async);
> -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> -			       "rx_ring_empty - %llu\n", qp->rx_ring_empty);
> -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> -			       "rx_err_no_buf - %llu\n", qp->rx_err_no_buf);
> -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> -			       "rx_err_oflow - \t%llu\n", qp->rx_err_oflow);
> -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> -			       "rx_err_ver - \t%llu\n", qp->rx_err_ver);
> -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> -			       "rx_buff - \t0x%p\n", qp->rx_buff);
> -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> -			       "rx_index - \t%u\n", qp->rx_index);
> -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> -			       "rx_max_entry - \t%u\n", qp->rx_max_entry);
> -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> -			       "rx_alloc_entry - \t%u\n\n", qp->rx_alloc_entry);
> -
> -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> -			       "tx_bytes - \t%llu\n", qp->tx_bytes);
> -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> -			       "tx_pkts - \t%llu\n", qp->tx_pkts);
> -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> -			       "tx_memcpy - \t%llu\n", qp->tx_memcpy);
> -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> -			       "tx_async - \t%llu\n", qp->tx_async);
> -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> -			       "tx_ring_full - \t%llu\n", qp->tx_ring_full);
> -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> -			       "tx_err_no_buf - %llu\n", qp->tx_err_no_buf);
> -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> -			       "tx_mw - \t0x%p\n", qp->tx_mw);
> -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> -			       "tx_index (H) - \t%u\n", qp->tx_index);
> -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> -			       "RRI (T) - \t%u\n",
> -			       qp->remote_rx_info->entry);
> -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> -			       "tx_max_entry - \t%u\n", qp->tx_max_entry);
> -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> -			       "free tx - \t%u\n",
> -			       ntb_transport_tx_free_entry(qp));
> -
> -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> -			       "\n");
> -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> -			       "Using TX DMA - \t%s\n",
> -			       qp->tx_dma_chan ? "Yes" : "No");
> -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> -			       "Using RX DMA - \t%s\n",
> -			       qp->rx_dma_chan ? "Yes" : "No");
> -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> -			       "QP Link - \t%s\n",
> -			       qp->link_is_up ? "Up" : "Down");
> -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> -			       "\n");
> -
> -	if (out_offset > out_count)
> -		out_offset = out_count;
> -
> -	ret = simple_read_from_buffer(ubuf, count, offp, buf, out_offset);
> -	kfree(buf);
> -	return ret;
> -}
> -
> -static const struct file_operations ntb_qp_debugfs_stats = {
> -	.owner = THIS_MODULE,
> -	.open = simple_open,
> -	.read = debugfs_read,
> -};
> +	return 0;
> +}
> +DEFINE_SHOW_ATTRIBUTE(ntb_qp_debugfs_stats);
>
>  static void ntb_list_add(spinlock_t *lock, struct list_head *entry,
>  			 struct list_head *list)
> @@ -1237,15 +1183,15 @@ static int ntb_transport_init_queue(struct ntb_transport_ctx *nt,
>  	qp->tx_max_entry = tx_size / qp->tx_max_frame;
>
>  	if (nt->debugfs_node_dir) {
> -		char debugfs_name[4];
> +		char debugfs_name[8];
>
> -		snprintf(debugfs_name, 4, "qp%d", qp_num);
> +		snprintf(debugfs_name, sizeof(debugfs_name), "qp%d", qp_num);
>  		qp->debugfs_dir = debugfs_create_dir(debugfs_name,
>  						     nt->debugfs_node_dir);
>
>  		qp->debugfs_stats = debugfs_create_file("stats", S_IRUSR,
>  							qp->debugfs_dir, qp,
> -							&ntb_qp_debugfs_stats);
> +							&ntb_qp_debugfs_stats_fops);
>  	} else {
>  		qp->debugfs_dir = NULL;
>  		qp->debugfs_stats = NULL;
> --
> 2.48.1
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 14/27] NTB: ntb_transport: Move TX memory window setup into setup_qp_mw()
  2025-11-29 16:03 ` [RFC PATCH v2 14/27] NTB: ntb_transport: Move TX memory window setup into setup_qp_mw() Koichiro Den
@ 2025-12-01 20:02   ` Frank Li
  2025-12-02  6:33     ` Koichiro Den
  0 siblings, 1 reply; 97+ messages in thread
From: Frank Li @ 2025-12-01 20:02 UTC (permalink / raw)
  To: Koichiro Den
  Cc: ntb, linux-pci, dmaengine, linux-kernel, mani, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Sun, Nov 30, 2025 at 01:03:52AM +0900, Koichiro Den wrote:
> Historically both TX and RX have assumed the same per-QP MW slice
> (tx_max_entry == remote rx_max_entry), while those are calculated
> separately in different places (pre and post the link-up negotiation
> point). This has been safe because nt->link_is_up is never set to true
> unless the pre-determined qp_count are the same among them, and qp_count
> is typically limited to nt->mw_count, which should be carefully
> configured by admin.
>
> However, setup_qp_mw can actually split mw and handle multi-qps in one
> MW properly, so qp_count needs not to be limited by nt->mw_count. Once
> we relaxing the limitation, pre-determined qp_count can differ among
> host side and endpoint, and link-up negotiation can easily fail.
>
> Move the TX MW configuration (per-QP offset and size) into
> ntb_transport_setup_qp_mw() so that both RX and TX layout decisions are
> centralized in a single helper. ntb_transport_init_queue() now deals
> only with per-QP software state, not with MW layout.
>
> This keeps the previous behaviour, while preparing for relaxing the
> qp_count limitation and improving readibility.
>
> No functional change is intended.
>
> Signed-off-by: Koichiro Den <den@valinux.co.jp>
> ---
>  drivers/ntb/ntb_transport.c | 67 ++++++++++++++-----------------------
>  1 file changed, 26 insertions(+), 41 deletions(-)
>
> diff --git a/drivers/ntb/ntb_transport.c b/drivers/ntb/ntb_transport.c
> index 57b4c0511927..79063e2f911b 100644
> --- a/drivers/ntb/ntb_transport.c
> +++ b/drivers/ntb/ntb_transport.c
> @@ -569,7 +569,8 @@ static int ntb_transport_setup_qp_mw(struct ntb_transport_ctx *nt,
>  	struct ntb_transport_mw *mw;
>  	struct ntb_dev *ndev = nt->ndev;
>  	struct ntb_queue_entry *entry;
> -	unsigned int rx_size, num_qps_mw;
> +	unsigned int num_qps_mw;
> +	unsigned int mw_size, mw_size_per_qp, qp_offset, rx_info_offset;
>  	unsigned int mw_num, mw_count, qp_count;
>  	unsigned int i;
>  	int node;
> @@ -588,15 +589,33 @@ static int ntb_transport_setup_qp_mw(struct ntb_transport_ctx *nt,
>  	else
>  		num_qps_mw = qp_count / mw_count;
>
> -	rx_size = (unsigned int)mw->xlat_size / num_qps_mw;
> -	qp->rx_buff = mw->virt_addr + rx_size * (qp_num / mw_count);
> -	rx_size -= sizeof(struct ntb_rx_info);
> +	mw_size = min(nt->mw_vec[mw_num].phys_size, mw->xlat_size);
> +	if (max_mw_size && mw_size > max_mw_size)
> +		mw_size = max_mw_size;
>
> -	qp->remote_rx_info = qp->rx_buff + rx_size;
> +	/* Split this MW evenly among the queue pairs mapped to it. */
> +	mw_size_per_qp = (unsigned int)mw_size / num_qps_mw;

Can you use the same variable firstly to make review easily?

tx_size = (unsigned int)mw_size / num_qps_mw;

It is hard to make sure code logic is the same as old one.

Frank

> +	qp_offset = mw_size_per_qp * (qp_num / mw_count);
> +
> +	/* Place remote_rx_info at the end of the per-QP region. */
> +	rx_info_offset = mw_size_per_qp - sizeof(struct ntb_rx_info);
> +
> +	qp->tx_mw_size = mw_size_per_qp;
> +	qp->tx_mw = nt->mw_vec[mw_num].vbase + qp_offset;
> +	if (!qp->tx_mw)
> +		return -EINVAL;
> +	qp->tx_mw_phys = nt->mw_vec[mw_num].phys_addr + qp_offset;
> +	if (!qp->tx_mw_phys)
> +		return -EINVAL;
> +	qp->rx_info = qp->tx_mw + rx_info_offset;
> +	qp->rx_buff = mw->virt_addr + qp_offset;
> +	qp->remote_rx_info = qp->rx_buff + rx_info_offset;
>
>  	/* Due to housekeeping, there must be atleast 2 buffs */
> -	qp->rx_max_frame = min(transport_mtu, rx_size / 2);
> -	qp->rx_max_entry = rx_size / qp->rx_max_frame;
> +	qp->tx_max_frame = min(transport_mtu, mw_size_per_qp / 2);
> +	qp->tx_max_entry = mw_size_per_qp / qp->tx_max_frame;
> +	qp->rx_max_frame = min(transport_mtu, mw_size_per_qp / 2);
> +	qp->rx_max_entry = mw_size_per_qp / qp->rx_max_frame;
>  	qp->rx_index = 0;
>
>  	/*
> @@ -1133,11 +1152,7 @@ static int ntb_transport_init_queue(struct ntb_transport_ctx *nt,
>  				    unsigned int qp_num)
>  {
>  	struct ntb_transport_qp *qp;
> -	phys_addr_t mw_base;
> -	resource_size_t mw_size;
> -	unsigned int num_qps_mw, tx_size;
>  	unsigned int mw_num, mw_count, qp_count;
> -	u64 qp_offset;
>
>  	mw_count = nt->mw_count;
>  	qp_count = nt->qp_count;
> @@ -1152,36 +1167,6 @@ static int ntb_transport_init_queue(struct ntb_transport_ctx *nt,
>  	qp->event_handler = NULL;
>  	ntb_qp_link_context_reset(qp);
>
> -	if (mw_num < qp_count % mw_count)
> -		num_qps_mw = qp_count / mw_count + 1;
> -	else
> -		num_qps_mw = qp_count / mw_count;
> -
> -	mw_base = nt->mw_vec[mw_num].phys_addr;
> -	mw_size = nt->mw_vec[mw_num].phys_size;
> -
> -	if (max_mw_size && mw_size > max_mw_size)
> -		mw_size = max_mw_size;
> -
> -	tx_size = (unsigned int)mw_size / num_qps_mw;
> -	qp_offset = tx_size * (qp_num / mw_count);
> -
> -	qp->tx_mw_size = tx_size;
> -	qp->tx_mw = nt->mw_vec[mw_num].vbase + qp_offset;
> -	if (!qp->tx_mw)
> -		return -EINVAL;
> -
> -	qp->tx_mw_phys = mw_base + qp_offset;
> -	if (!qp->tx_mw_phys)
> -		return -EINVAL;
> -
> -	tx_size -= sizeof(struct ntb_rx_info);
> -	qp->rx_info = qp->tx_mw + tx_size;
> -
> -	/* Due to housekeeping, there must be atleast 2 buffs */
> -	qp->tx_max_frame = min(transport_mtu, tx_size / 2);
> -	qp->tx_max_entry = tx_size / qp->tx_max_frame;
> -
>  	if (nt->debugfs_node_dir) {
>  		char debugfs_name[8];
>
> --
> 2.48.1
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping
  2025-11-29 16:03 ` [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping Koichiro Den
@ 2025-12-01 20:41   ` Frank Li
  2025-12-02  6:35     ` Koichiro Den
  2025-12-02  6:32   ` Niklas Cassel
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 97+ messages in thread
From: Frank Li @ 2025-12-01 20:41 UTC (permalink / raw)
  To: Koichiro Den
  Cc: ntb, linux-pci, dmaengine, linux-kernel, mani, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Sun, Nov 30, 2025 at 01:03:57AM +0900, Koichiro Den wrote:
> dw_pcie_ep_raise_msi_irq() currently programs an outbound iATU window
> for the MSI target address on every interrupt and tears it down again
> via dw_pcie_ep_unmap_addr().
>
> On systems that heavily use the AXI bridge interface (for example when
> the integrated eDMA engine is active), this means the outbound iATU
> registers are updated while traffic is in flight. The DesignWare
> endpoint spec warns that updating iATU registers in this situation is
> not supported, and the behavior is undefined.
>
> Under high MSI and eDMA load this pattern results in occasional bogus
> outbound transactions and IOMMU faults such as:
>
>   ipmmu-vmsa eed40000.iommu: Unhandled fault: status 0x00001502 iova 0xfe000000
>

I agree needn't map/unmap MSI every time. But I think there should be
logic problem behind this. IOMMU report error means page table already
removed, but you still try to access it after that. You'd better find where
access MSI memory after dw_pcie_ep_unmap_addr().

dw_pcie_ep_unmap_addr() use writel(), which use dma_dmb() before change
register, previous write should be completed before write ATU register.

Frank

> followed by the system becoming unresponsive. This is the actual output
> observed on Renesas R-Car S4, with its ipmmu_hc used with PCIe ch0.
>
> There is no need to reprogram the iATU region used for MSI on every
> interrupt. The host-provided MSI address is stable while MSI is enabled,
> and the endpoint driver already dedicates a scratch buffer for MSI
> generation.
>
> Cache the aligned MSI address and map size, program the outbound iATU
> once, and keep the window enabled. Subsequent interrupts only perform a
> write to the MSI scratch buffer, avoiding dynamic iATU reprogramming in
> the hot path and fixing the lockups seen under load.
>
> Signed-off-by: Koichiro Den <den@valinux.co.jp>
> ---
>  .../pci/controller/dwc/pcie-designware-ep.c   | 48 ++++++++++++++++---
>  drivers/pci/controller/dwc/pcie-designware.h  |  5 ++
>  2 files changed, 47 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/pci/controller/dwc/pcie-designware-ep.c b/drivers/pci/controller/dwc/pcie-designware-ep.c
> index 3780a9bd6f79..ef8ded34d9ab 100644
> --- a/drivers/pci/controller/dwc/pcie-designware-ep.c
> +++ b/drivers/pci/controller/dwc/pcie-designware-ep.c
> @@ -778,6 +778,16 @@ static void dw_pcie_ep_stop(struct pci_epc *epc)
>  	struct dw_pcie_ep *ep = epc_get_drvdata(epc);
>  	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
>
> +	/*
> +	 * Tear down the dedicated outbound window used for MSI
> +	 * generation. This avoids leaking an iATU window across
> +	 * endpoint stop/start cycles.
> +	 */
> +	if (ep->msi_iatu_mapped) {
> +		dw_pcie_ep_unmap_addr(epc, 0, 0, ep->msi_mem_phys);
> +		ep->msi_iatu_mapped = false;
> +	}
> +
>  	dw_pcie_stop_link(pci);
>  }
>
> @@ -881,14 +891,37 @@ int dw_pcie_ep_raise_msi_irq(struct dw_pcie_ep *ep, u8 func_no,
>  	msg_addr = ((u64)msg_addr_upper) << 32 | msg_addr_lower;
>
>  	msg_addr = dw_pcie_ep_align_addr(epc, msg_addr, &map_size, &offset);
> -	ret = dw_pcie_ep_map_addr(epc, func_no, 0, ep->msi_mem_phys, msg_addr,
> -				  map_size);
> -	if (ret)
> -		return ret;
>
> -	writel(msg_data | (interrupt_num - 1), ep->msi_mem + offset);
> +	/*
> +	 * Program the outbound iATU once and keep it enabled.
> +	 *
> +	 * The spec warns that updating iATU registers while there are
> +	 * operations in flight on the AXI bridge interface is not
> +	 * supported, so we avoid reprogramming the region on every MSI,
> +	 * specifically unmapping immediately after writel().
> +	 */
> +	if (!ep->msi_iatu_mapped) {
> +		ret = dw_pcie_ep_map_addr(epc, func_no, 0,
> +					  ep->msi_mem_phys, msg_addr,
> +					  map_size);
> +		if (ret)
> +			return ret;
>
> -	dw_pcie_ep_unmap_addr(epc, func_no, 0, ep->msi_mem_phys);
> +		ep->msi_iatu_mapped = true;
> +		ep->msi_msg_addr = msg_addr;
> +		ep->msi_map_size = map_size;
> +	} else if (WARN_ON_ONCE(ep->msi_msg_addr != msg_addr ||
> +				ep->msi_map_size != map_size)) {
> +		/*
> +		 * The host changed the MSI target address or the required
> +		 * mapping size. Reprogramming the iATU at runtime is unsafe
> +		 * on this controller, so bail out instead of trying to update
> +		 * the existing region.
> +		 */
> +		return -EINVAL;
> +	}
> +
> +	writel(msg_data | (interrupt_num - 1), ep->msi_mem + offset);
>
>  	return 0;
>  }
> @@ -1268,6 +1301,9 @@ int dw_pcie_ep_init(struct dw_pcie_ep *ep)
>  	INIT_LIST_HEAD(&ep->func_list);
>  	INIT_LIST_HEAD(&ep->ib_map_list);
>  	spin_lock_init(&ep->ib_map_lock);
> +	ep->msi_iatu_mapped = false;
> +	ep->msi_msg_addr = 0;
> +	ep->msi_map_size = 0;
>
>  	epc = devm_pci_epc_create(dev, &epc_ops);
>  	if (IS_ERR(epc)) {
> diff --git a/drivers/pci/controller/dwc/pcie-designware.h b/drivers/pci/controller/dwc/pcie-designware.h
> index 269a9fe0501f..1770a2318557 100644
> --- a/drivers/pci/controller/dwc/pcie-designware.h
> +++ b/drivers/pci/controller/dwc/pcie-designware.h
> @@ -481,6 +481,11 @@ struct dw_pcie_ep {
>  	void __iomem		*msi_mem;
>  	phys_addr_t		msi_mem_phys;
>  	struct pci_epf_bar	*epf_bar[PCI_STD_NUM_BARS];
> +
> +	/* MSI outbound iATU state */
> +	bool			msi_iatu_mapped;
> +	u64			msi_msg_addr;
> +	size_t			msi_map_size;
>  };
>
>  struct dw_pcie_ops {
> --
> 2.48.1
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 23/27] NTB: epf: Add per-SoC quirk to cap MRRS for DWC eDMA (128B for R-Car)
  2025-11-29 16:04 ` [RFC PATCH v2 23/27] NTB: epf: Add per-SoC quirk to cap MRRS for DWC eDMA (128B for R-Car) Koichiro Den
@ 2025-12-01 20:47   ` Frank Li
  0 siblings, 0 replies; 97+ messages in thread
From: Frank Li @ 2025-12-01 20:47 UTC (permalink / raw)
  To: Koichiro Den
  Cc: ntb, linux-pci, dmaengine, linux-kernel, mani, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Sun, Nov 30, 2025 at 01:04:01AM +0900, Koichiro Den wrote:
> Some R-Car platforms using Synopsys DesignWare PCIe with the integrated
> eDMA exhibit reproducible payload corruption in RC->EP remote DMA read
> traffic whenever the endpoint issues 256-byte Memory Read (MRd) TLPs.
>
> The eDMA injects multiple MRd requests of size less than or equal to
> min(MRRS, MPS), so constraining the endpoint's MRd request size removes
> 256-byte MRd TLPs and avoids the issue. This change adds a per-SoC knob
> in the ntb_hw_epf driver and sets MRRS=128 on R-Car.
>
> We intentionally do not change the endpoint's MPS. Per PCIe Base
> Specification, MPS limits the payload size of TLPs with data transmitted
> by the Function, while Max_Read_Request_Size limits the size of read
> requests produced by the Function as a Requester. Limiting MRRS is
> sufficient to constrain MRd Byte Count, while lowering MPS would also
> throttle unrelated traffic (e.g. endpoint-originated Posted Writes and
> Completions with Data) without being necessary for this fix.
>
> This quirk is scoped to the affected endpoint only and can be removed
> once the underlying issue is resolved in the controller/IP.
>
> Signed-off-by: Koichiro Den <den@valinux.co.jp>
> ---

Reviewed-by: Frank Li <Frank.Li@nxp.com>

>  drivers/ntb/hw/epf/ntb_hw_epf.c | 66 +++++++++++++++++++++++++++++----
>  1 file changed, 58 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/ntb/hw/epf/ntb_hw_epf.c b/drivers/ntb/hw/epf/ntb_hw_epf.c
> index d9811da90599..21eb26b2f7cc 100644
> --- a/drivers/ntb/hw/epf/ntb_hw_epf.c
> +++ b/drivers/ntb/hw/epf/ntb_hw_epf.c
> @@ -51,6 +51,12 @@
>
>  #define NTB_EPF_COMMAND_TIMEOUT	1000 /* 1 Sec */
>
> +struct ntb_epf_soc_data {
> +	const enum pci_barno *barno_map;
> +	/* non-zero to override MRRS for this SoC */
> +	int force_mrrs;
> +};
> +
>  enum epf_ntb_bar {
>  	BAR_CONFIG,
>  	BAR_PEER_SPAD,
> @@ -594,11 +600,12 @@ static int ntb_epf_init_dev(struct ntb_epf_dev *ndev)
>  }
>
>  static int ntb_epf_init_pci(struct ntb_epf_dev *ndev,
> -			    struct pci_dev *pdev)
> +			    struct pci_dev *pdev,
> +			    const struct ntb_epf_soc_data *soc)
>  {
>  	struct device *dev = ndev->dev;
>  	size_t spad_sz, spad_off;
> -	int ret;
> +	int ret, cur;
>
>  	pci_set_drvdata(pdev, ndev);
>
> @@ -616,6 +623,17 @@ static int ntb_epf_init_pci(struct ntb_epf_dev *ndev,
>
>  	pci_set_master(pdev);
>
> +	if (soc && pci_is_pcie(pdev) && soc->force_mrrs) {
> +		cur = pcie_get_readrq(pdev);
> +		ret = pcie_set_readrq(pdev, soc->force_mrrs);
> +		if (ret)
> +			dev_warn(&pdev->dev, "failed to set MRRS=%d: %d\n",
> +				 soc->force_mrrs, ret);
> +		else
> +			dev_info(&pdev->dev, "capped MRRS: %d->%d for ntb-epf\n",
> +				 cur, soc->force_mrrs);
> +	}
> +
>  	ret = dma_set_mask_and_coherent(dev, DMA_BIT_MASK(64));
>  	if (ret) {
>  		ret = dma_set_mask_and_coherent(dev, DMA_BIT_MASK(32));
> @@ -690,6 +708,7 @@ static void ntb_epf_cleanup_isr(struct ntb_epf_dev *ndev)
>  static int ntb_epf_pci_probe(struct pci_dev *pdev,
>  			     const struct pci_device_id *id)
>  {
> +	const struct ntb_epf_soc_data *soc = (const void *)id->driver_data;
>  	struct device *dev = &pdev->dev;
>  	struct ntb_epf_dev *ndev;
>  	int ret;
> @@ -701,16 +720,16 @@ static int ntb_epf_pci_probe(struct pci_dev *pdev,
>  	if (!ndev)
>  		return -ENOMEM;
>
> -	ndev->barno_map = (const enum pci_barno *)id->driver_data;
> -	if (!ndev->barno_map)
> +	if (!soc || !soc->barno_map)
>  		return -EINVAL;
>
> +	ndev->barno_map = soc->barno_map;
>  	ndev->dev = dev;
>
>  	ntb_epf_init_struct(ndev, pdev);
>  	mutex_init(&ndev->cmd_lock);
>
> -	ret = ntb_epf_init_pci(ndev, pdev);
> +	ret = ntb_epf_init_pci(ndev, pdev, soc);
>  	if (ret) {
>  		dev_err(dev, "Failed to init PCI\n");
>  		return ret;
> @@ -778,21 +797,52 @@ static const enum pci_barno rcar_barno[NTB_BAR_NUM] = {
>  	[BAR_MW4]	= NO_BAR,
>  };
>
> +static const struct ntb_epf_soc_data j721e_soc = {
> +	.barno_map = j721e_map,
> +};
> +
> +static const struct ntb_epf_soc_data mx8_soc = {
> +	.barno_map = mx8_map,
> +};
> +
> +static const struct ntb_epf_soc_data rcar_soc = {
> +	.barno_map = rcar_barno,
> +	/*
> +	 * On some R-Car platforms using the Synopsys DWC PCIe + eDMA we
> +	 * observe data corruption on RC->EP Remote DMA Read paths whenever
> +	 * the EP issues large MRd requests. The corruption consistently
> +	 * hits the tail of each 256-byte segment (e.g. offsets
> +	 * 0x00E0..0x00FF within a 256B block, and again at 0x01E0..0x01FF
> +	 * for larger transfers).
> +	 *
> +	 * The DMA injects multiple MRd requests of size less than or equal
> +	 * to the min(MRRS, MPS) into the outbound request path. By
> +	 * lowering MRRS to 128 we prevent 256B MRd TLPs from being
> +	 * generated and avoid the issue on the affected hardware. We
> +	 * intentionally keep MPS unchanged and scope this quirk to this
> +	 * endpoint to avoid impacting unrelated devices.
> +	 *
> +	 * Remove this once the issue is resolved (maybe controller/IP
> +	 * level) or a more preferable workaround becomes available.
> +	 */
> +	.force_mrrs = 128,
> +};
> +
>  static const struct pci_device_id ntb_epf_pci_tbl[] = {
>  	{
>  		PCI_DEVICE(PCI_VENDOR_ID_TI, PCI_DEVICE_ID_TI_J721E),
>  		.class = PCI_CLASS_MEMORY_RAM << 8, .class_mask = 0xffff00,
> -		.driver_data = (kernel_ulong_t)j721e_map,
> +		.driver_data = (kernel_ulong_t)&j721e_soc,
>  	},
>  	{
>  		PCI_DEVICE(PCI_VENDOR_ID_FREESCALE, 0x0809),
>  		.class = PCI_CLASS_MEMORY_RAM << 8, .class_mask = 0xffff00,
> -		.driver_data = (kernel_ulong_t)mx8_map,
> +		.driver_data = (kernel_ulong_t)&mx8_soc,
>  	},
>  	{
>  		PCI_DEVICE(PCI_VENDOR_ID_RENESAS, 0x0030),
>  		.class = PCI_CLASS_MEMORY_RAM << 8, .class_mask = 0xffff00,
> -		.driver_data = (kernel_ulong_t)rcar_barno,
> +		.driver_data = (kernel_ulong_t)&rcar_soc,
>  	},
>  	{ },
>  };
> --
> 2.48.1
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 10/27] NTB: core: Add .get_pci_epc() to ntb_dev_ops
  2025-11-29 16:03 ` [RFC PATCH v2 10/27] NTB: core: Add .get_pci_epc() to ntb_dev_ops Koichiro Den
  2025-12-01 19:39   ` Frank Li
@ 2025-12-01 21:08   ` Dave Jiang
  2025-12-02  6:32     ` Koichiro Den
  1 sibling, 1 reply; 97+ messages in thread
From: Dave Jiang @ 2025-12-01 21:08 UTC (permalink / raw)
  To: Koichiro Den, ntb, linux-pci, dmaengine, linux-kernel, Frank.Li
  Cc: mani, kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason,
	allenbh, Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer,
	logang, jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer,
	arnd, pstanner, elfring



On 11/29/25 9:03 AM, Koichiro Den wrote:
> Add an optional get_pci_epc() callback to retrieve the underlying
> pci_epc device associated with the NTB implementation.
> 
> Signed-off-by: Koichiro Den <den@valinux.co.jp>
> ---
>  drivers/ntb/hw/epf/ntb_hw_epf.c | 11 +----------
>  include/linux/ntb.h             | 21 +++++++++++++++++++++
>  2 files changed, 22 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/ntb/hw/epf/ntb_hw_epf.c b/drivers/ntb/hw/epf/ntb_hw_epf.c
> index a3ec411bfe49..d55ce6b0fad4 100644
> --- a/drivers/ntb/hw/epf/ntb_hw_epf.c
> +++ b/drivers/ntb/hw/epf/ntb_hw_epf.c
> @@ -9,6 +9,7 @@
>  #include <linux/delay.h>
>  #include <linux/module.h>
>  #include <linux/pci.h>
> +#include <linux/pci-epf.h>
>  #include <linux/slab.h>
>  #include <linux/ntb.h>
>  
> @@ -49,16 +50,6 @@
>  
>  #define NTB_EPF_COMMAND_TIMEOUT	1000 /* 1 Sec */
>  
> -enum pci_barno {
> -	NO_BAR = -1,
> -	BAR_0,
> -	BAR_1,
> -	BAR_2,
> -	BAR_3,
> -	BAR_4,
> -	BAR_5,
> -};
> -
>  enum epf_ntb_bar {
>  	BAR_CONFIG,
>  	BAR_PEER_SPAD,
> diff --git a/include/linux/ntb.h b/include/linux/ntb.h
> index d7ce5d2e60d0..04dc9a4d6b85 100644
> --- a/include/linux/ntb.h
> +++ b/include/linux/ntb.h
> @@ -64,6 +64,7 @@ struct ntb_client;
>  struct ntb_dev;
>  struct ntb_msi;
>  struct pci_dev;
> +struct pci_epc;
>  
>  /**
>   * enum ntb_topo - NTB connection topology
> @@ -256,6 +257,7 @@ static inline int ntb_ctx_ops_is_valid(const struct ntb_ctx_ops *ops)
>   * @msg_clear_mask:	See ntb_msg_clear_mask().
>   * @msg_read:		See ntb_msg_read().
>   * @peer_msg_write:	See ntb_peer_msg_write().
> + * @get_pci_epc:	See ntb_get_pci_epc().
>   */
>  struct ntb_dev_ops {
>  	int (*port_number)(struct ntb_dev *ntb);
> @@ -331,6 +333,7 @@ struct ntb_dev_ops {
>  	int (*msg_clear_mask)(struct ntb_dev *ntb, u64 mask_bits);
>  	u32 (*msg_read)(struct ntb_dev *ntb, int *pidx, int midx);
>  	int (*peer_msg_write)(struct ntb_dev *ntb, int pidx, int midx, u32 msg);
> +	struct pci_epc *(*get_pci_epc)(struct ntb_dev *ntb);

Seems very specific call to this particular hardware instead of something generic for the NTB dev ops. Maybe it should be something like get_private_data() or something like that?

DJ


>  };
>  
>  static inline int ntb_dev_ops_is_valid(const struct ntb_dev_ops *ops)
> @@ -393,6 +396,9 @@ static inline int ntb_dev_ops_is_valid(const struct ntb_dev_ops *ops)
>  		/* !ops->msg_clear_mask == !ops->msg_count	&& */
>  		!ops->msg_read == !ops->msg_count		&&
>  		!ops->peer_msg_write == !ops->msg_count		&&
> +
> +		/* Miscellaneous optional callbacks */
> +		/* ops->get_pci_epc			&& */
>  		1;
>  }
>  
> @@ -1567,6 +1573,21 @@ static inline int ntb_peer_msg_write(struct ntb_dev *ntb, int pidx, int midx,
>  	return ntb->ops->peer_msg_write(ntb, pidx, midx, msg);
>  }
>  
> +/**
> + * ntb_get_pci_epc() - get backing PCI endpoint controller if possible.
> + * @ntb:	NTB device context.
> + *
> + * Get the backing PCI endpoint controller representation.
> + *
> + * Return: A pointer to the pci_epc instance if available. or %NULL if not.
> + */
> +static inline struct pci_epc __maybe_unused *ntb_get_pci_epc(struct ntb_dev *ntb)
> +{
> +	if (!ntb->ops->get_pci_epc)
> +		return NULL;
> +	return ntb->ops->get_pci_epc(ntb);
> +}
> +
>  /**
>   * ntb_peer_resource_idx() - get a resource index for a given peer idx
>   * @ntb:	NTB device context.


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 20/27] NTB: ntb_transport: Introduce remote eDMA backed transport mode
  2025-11-29 16:03 ` [RFC PATCH v2 20/27] NTB: ntb_transport: Introduce remote eDMA backed transport mode Koichiro Den
@ 2025-12-01 21:41   ` Frank Li
  2025-12-02  6:43     ` Koichiro Den
  2025-12-01 21:46   ` Dave Jiang
  1 sibling, 1 reply; 97+ messages in thread
From: Frank Li @ 2025-12-01 21:41 UTC (permalink / raw)
  To: Koichiro Den
  Cc: ntb, linux-pci, dmaengine, linux-kernel, mani, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Sun, Nov 30, 2025 at 01:03:58AM +0900, Koichiro Den wrote:
> Add a new transport backend that uses a remote DesignWare eDMA engine
> located on the NTB endpoint to move data between host and endpoint.
>
> In this mode:
>
>   - The endpoint exposes a dedicated memory window that contains the
>     eDMA register block followed by a small control structure (struct
>     ntb_edma_info) and per-channel linked-list (LL) rings.
>
>   - On the endpoint side, ntb_edma_setup_mws() allocates the control
>     structure and LL rings in endpoint memory, then programs an inbound
>     iATU region so that the host can access them via a peer MW.
>
>   - On the host side, ntb_edma_setup_peer() ioremaps the peer MW, reads
>     ntb_edma_info and configures a dw-edma DMA device to use the LL
>     rings provided by the endpoint.
>
>   - ntb_transport is extended with a new backend_ops implementation that
>     routes TX and RX enqueue/poll operations through the remote eDMA
>     rings while keeping the existing shared-memory backend intact.
>
>   - The host signals the endpoint via a dedicated DMA read channel.
>     'use_msi' module option is ignored when 'use_remote_edma=1'.
>
> The new mode is guarded by a Kconfig option (NTB_TRANSPORT_EDMA) and a
> module parameter (use_remote_edma). When disabled, the existing
> ntb_transport behaviour is unchanged.
>
> Signed-off-by: Koichiro Den <den@valinux.co.jp>
> ---
>  drivers/ntb/Kconfig                           |   11 +
>  drivers/ntb/Makefile                          |    3 +
>  drivers/ntb/ntb_edma.c                        |  628 ++++++++
>  drivers/ntb/ntb_edma.h                        |  128 ++
>  .../{ntb_transport.c => ntb_transport_core.c} | 1281 ++++++++++++++++-
>  5 files changed, 2048 insertions(+), 3 deletions(-)
>  create mode 100644 drivers/ntb/ntb_edma.c
>  create mode 100644 drivers/ntb/ntb_edma.h
>  rename drivers/ntb/{ntb_transport.c => ntb_transport_core.c} (65%)
>
> diff --git a/drivers/ntb/Kconfig b/drivers/ntb/Kconfig
> index df16c755b4da..db63f02bb116 100644
> --- a/drivers/ntb/Kconfig
> +++ b/drivers/ntb/Kconfig
> @@ -37,4 +37,15 @@ config NTB_TRANSPORT
>
>  	 If unsure, say N.
>
> +config NTB_TRANSPORT_EDMA
> +	bool "NTB Transport backed by remote eDMA"
> +	depends on NTB_TRANSPORT
> +	depends on PCI
> +	select DMA_ENGINE
> +	help
> +	  Enable a transport backend that uses a remote DesignWare eDMA engine
> +	  exposed through a dedicated NTB memory window. The host uses the
> +	  endpoint's eDMA engine to move data in both directions.
> +	  Say Y here if you intend to use the 'use_remote_edma' module parameter.
> +
>  endif # NTB
> diff --git a/drivers/ntb/Makefile b/drivers/ntb/Makefile
> index 3a6fa181ff99..51f0e1e3aec7 100644
> --- a/drivers/ntb/Makefile
> +++ b/drivers/ntb/Makefile
> @@ -4,3 +4,6 @@ obj-$(CONFIG_NTB_TRANSPORT) += ntb_transport.o
>
>  ntb-y			:= core.o
>  ntb-$(CONFIG_NTB_MSI)	+= msi.o
> +
> +ntb_transport-y					:= ntb_transport_core.o
> +ntb_transport-$(CONFIG_NTB_TRANSPORT_EDMA)	+= ntb_edma.o
> diff --git a/drivers/ntb/ntb_edma.c b/drivers/ntb/ntb_edma.c
> new file mode 100644
> index 000000000000..cb35e0d56aa8
> --- /dev/null
> +++ b/drivers/ntb/ntb_edma.c
> @@ -0,0 +1,628 @@
> +// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
> +
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/pci.h>
> +#include <linux/ntb.h>
> +#include <linux/io.h>
> +#include <linux/iommu.h>
> +#include <linux/dmaengine.h>
> +#include <linux/pci-epc.h>
> +#include <linux/dma/edma.h>
> +#include <linux/irq.h>
> +#include <linux/irqdomain.h>
> +#include <linux/of.h>
> +#include <linux/of_irq.h>
> +#include <dt-bindings/interrupt-controller/arm-gic.h>
> +
> +#include "ntb_edma.h"
> +
> +/*
> + * The interrupt register offsets below are taken from the DesignWare
> + * eDMA "unrolled" register map (EDMA_MF_EDMA_UNROLL). The remote eDMA
> + * backend currently only supports this layout.
> + */
> +#define DMA_WRITE_INT_STATUS_OFF   0x4c
> +#define DMA_WRITE_INT_MASK_OFF     0x54
> +#define DMA_WRITE_INT_CLEAR_OFF    0x58
> +#define DMA_READ_INT_STATUS_OFF    0xa0
> +#define DMA_READ_INT_MASK_OFF      0xa8
> +#define DMA_READ_INT_CLEAR_OFF     0xac

Not sure why need access EDMA register because EMDA driver already export
as dmaengine driver.

> +
> +#define NTB_EDMA_NOTIFY_MAX_QP		64
> +
> +static unsigned int edma_spi = 417; /* 0x1a1 */
> +module_param(edma_spi, uint, 0644);
> +MODULE_PARM_DESC(edma_spi, "SPI number used by remote eDMA interrupt (EP local)");
> +
> +static u64 edma_regs_phys = 0xe65d5000;
> +module_param(edma_regs_phys, ullong, 0644);
> +MODULE_PARM_DESC(edma_regs_phys, "Physical base address of local eDMA registers (EP)");
> +
> +static unsigned long edma_regs_size = 0x1200;
> +module_param(edma_regs_size, ulong, 0644);
> +MODULE_PARM_DESC(edma_regs_size, "Size of the local eDMA register space (EP)");
> +
> +struct ntb_edma_intr {
> +	u32 db[NTB_EDMA_NOTIFY_MAX_QP];
> +};
> +
> +struct ntb_edma_ctx {
> +	void *ll_wr_virt[EDMA_WR_CH_NUM];
> +	dma_addr_t ll_wr_phys[EDMA_WR_CH_NUM];
> +	void *ll_rd_virt[EDMA_RD_CH_NUM + 1];
> +	dma_addr_t ll_rd_phys[EDMA_RD_CH_NUM + 1];
> +
> +	struct ntb_edma_intr *intr_ep_virt;
> +	dma_addr_t intr_ep_phys;
> +	struct ntb_edma_intr *intr_rc_virt;
> +	dma_addr_t intr_rc_phys;
> +	u32 notify_qp_max;
> +
> +	bool initialized;
> +};
> +
> +static struct ntb_edma_ctx edma_ctx;
> +
> +typedef void (*ntb_edma_interrupt_cb_t)(void *data, int qp_num);
> +
> +struct ntb_edma_interrupt {
> +	int virq;
> +	void __iomem *base;
> +	ntb_edma_interrupt_cb_t cb;
> +	void *data;
> +};
> +
> +static struct ntb_edma_interrupt ntb_edma_intr;
> +
> +static int ntb_edma_map_spi_to_virq(struct device *dev, unsigned int spi)
> +{
> +	struct device_node *np = dev_of_node(dev);
> +	struct device_node *parent;
> +	struct irq_fwspec fwspec = { 0 };
> +	int virq;
> +
> +	parent = of_irq_find_parent(np);
> +	if (!parent)
> +		return -ENODEV;
> +
> +	fwspec.fwnode      = of_fwnode_handle(parent);
> +	fwspec.param_count = 3;
> +	fwspec.param[0]    = GIC_SPI;
> +	fwspec.param[1]    = spi;
> +	fwspec.param[2]    = IRQ_TYPE_LEVEL_HIGH;
> +
> +	virq = irq_create_fwspec_mapping(&fwspec);
> +	of_node_put(parent);
> +	return (virq > 0) ? virq : -EINVAL;
> +}
> +
> +static irqreturn_t ntb_edma_isr(int irq, void *data)
> +{

Not sue why dw_edma_interrupt_write/read() does work for your case. Suppose
just register callback for dmeengine.

> +	struct ntb_edma_interrupt *v = data;
> +	u32 mask = BIT(EDMA_RD_CH_NUM);
> +	u32 i, val;
> +
> +	/*
> +	 * We do not ack interrupts here but instead we mask all local interrupt
> +	 * sources except the read channel used for notification. This reduces
> +	 * needless ISR invocations.
> +	 *
> +	 * In theory we could configure LIE=1/RIE=0 only for the notification
> +	 * transfer (keeping all other channels at LIE=1/RIE=1), but that would
> +	 * require intrusive changes to the dw-edma core.
> +	 *
> +	 * Note: The host side may have already cleared the read interrupt used
> +	 * for notification, so reading DMA_READ_INT_CLEAR_OFF is not a reliable
> +	 * way to detect it. As a result, we cannot reliably tell which specific
> +	 * channel triggered this interrupt. intr_ep_virt->db[i] teaches us
> +	 * instead.
> +	 */
> +	iowrite32(~0x0, v->base + DMA_WRITE_INT_MASK_OFF);
> +	iowrite32(~mask, v->base + DMA_READ_INT_MASK_OFF);
> +
> +	if (!v->cb || !edma_ctx.intr_ep_virt)
> +		return IRQ_HANDLED;
> +
> +	for (i = 0; i < edma_ctx.notify_qp_max; i++) {
> +		val = READ_ONCE(edma_ctx.intr_ep_virt->db[i]);
> +		if (!val)
> +			continue;
> +
> +		WRITE_ONCE(edma_ctx.intr_ep_virt->db[i], 0);
> +		v->cb(v->data, i);
> +	}
> +
> +	return IRQ_HANDLED;
> +}
> +
...
> +
> +int ntb_edma_setup_peer(struct ntb_dev *ndev)
> +{
> +	struct ntb_edma_info *info;
> +	unsigned int wr_cnt, rd_cnt;
> +	struct dw_edma_chip *chip;
> +	void __iomem *edma_virt;
> +	phys_addr_t edma_phys;
> +	resource_size_t mw_size;
> +	u64 off = EDMA_REG_SIZE;
> +	int peer_mw, mw_index;
> +	unsigned int i;
> +	int ret;
> +
> +	peer_mw = ntb_peer_mw_count(ndev);
> +	if (peer_mw <= 0)
> +		return -ENODEV;
> +
> +	mw_index = peer_mw - 1; /* last MW */
> +
> +	ret = ntb_peer_mw_get_addr(ndev, mw_index, &edma_phys,
> +				   &mw_size);
> +	if (ret)
> +		return -1;
> +
> +	edma_virt = ioremap(edma_phys, mw_size);
> +
> +	chip = devm_kzalloc(&ndev->dev, sizeof(*chip), GFP_KERNEL);
> +	if (!chip) {
> +		ret = -ENOMEM;
> +		return ret;
> +	}
> +
> +	chip->dev = &ndev->pdev->dev;
> +	chip->nr_irqs = 4;
> +	chip->ops = &ntb_edma_ops;
> +	chip->flags = 0;
> +	chip->reg_base = edma_virt;
> +	chip->mf = EDMA_MF_EDMA_UNROLL;
> +
> +	info = edma_virt + off;
> +	if (info->magic != NTB_EDMA_INFO_MAGIC)
> +		return -EINVAL;
> +	wr_cnt = info->wr_cnt;
> +	rd_cnt = info->rd_cnt;
> +	chip->ll_wr_cnt = wr_cnt;
> +	chip->ll_rd_cnt = rd_cnt;
> +	off += PAGE_SIZE;
> +
> +	edma_ctx.notify_qp_max = NTB_EDMA_NOTIFY_MAX_QP;
> +	edma_ctx.intr_ep_phys = info->intr_dar_base;
> +	if (edma_ctx.intr_ep_phys) {
> +		edma_ctx.intr_rc_virt =
> +			dma_alloc_coherent(&ndev->pdev->dev,
> +					   sizeof(struct ntb_edma_intr),
> +					   &edma_ctx.intr_rc_phys,
> +					   GFP_KERNEL);
> +		if (!edma_ctx.intr_rc_virt)
> +			return -ENOMEM;
> +		memset(edma_ctx.intr_rc_virt, 0,
> +		       sizeof(struct ntb_edma_intr));
> +	}
> +
> +	for (i = 0; i < wr_cnt; i++) {
> +		chip->ll_region_wr[i].vaddr.io = edma_virt + off;
> +		chip->ll_region_wr[i].paddr = info->ll_wr_phys[i];
> +		chip->ll_region_wr[i].sz = DMA_LLP_MEM_SIZE;
> +		off += DMA_LLP_MEM_SIZE;
> +	}
> +	for (i = 0; i < rd_cnt; i++) {
> +		chip->ll_region_rd[i].vaddr.io = edma_virt + off;
> +		chip->ll_region_rd[i].paddr = info->ll_rd_phys[i];
> +		chip->ll_region_rd[i].sz = DMA_LLP_MEM_SIZE;
> +		off += DMA_LLP_MEM_SIZE;
> +	}
> +
> +	if (!pci_dev_msi_enabled(ndev->pdev))
> +		return -ENXIO;
> +
> +	ret = dw_edma_probe(chip);

I think dw_edma_probe() should be in ntb_hw_epf.c, which provide DMA
dma engine support.

EP side, suppose default dwc controller driver already setup edma engine,
so use correct filter function, you should get dma chan.

Frank

> +	if (ret) {
> +		dev_err(&ndev->dev, "dw_edma_probe failed: %d\n", ret);
> +		return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +struct ntb_edma_filter {
> +	struct device *dma_dev;
> +	u32 direction;
> +};
> +
> +static bool ntb_edma_filter_fn(struct dma_chan *chan, void *arg)
> +{
> +	struct ntb_edma_filter *filter = arg;
> +	u32 dir = filter->direction;
> +	struct dma_slave_caps caps;
> +	int ret;
> +
> +	if (chan->device->dev != filter->dma_dev)
> +		return false;
> +
> +	ret = dma_get_slave_caps(chan, &caps);
> +	if (ret < 0)
> +		return false;
> +
> +	return !!(caps.directions & dir);
> +}
> +
> +void ntb_edma_teardown_chans(struct ntb_edma_chans *edma)
> +{
> +	unsigned int i;
> +
> +	for (i = 0; i < edma->num_wr_chan; i++)
> +		dma_release_channel(edma->wr_chan[i]);
> +
> +	for (i = 0; i < edma->num_rd_chan; i++)
> +		dma_release_channel(edma->rd_chan[i]);
> +
> +	if (edma->intr_chan)
> +		dma_release_channel(edma->intr_chan);
> +}
> +
> +int ntb_edma_setup_chans(struct device *dma_dev, struct ntb_edma_chans *edma)
> +{
> +	struct ntb_edma_filter filter;
> +	dma_cap_mask_t dma_mask;
> +	unsigned int i;
> +
> +	dma_cap_zero(dma_mask);
> +	dma_cap_set(DMA_SLAVE, dma_mask);
> +
> +	memset(edma, 0, sizeof(*edma));
> +	edma->dev = dma_dev;
> +
> +	filter.dma_dev = dma_dev;
> +	filter.direction = BIT(DMA_DEV_TO_MEM);
> +	for (i = 0; i < EDMA_WR_CH_NUM; i++) {
> +		edma->wr_chan[i] = dma_request_channel(dma_mask,
> +						       ntb_edma_filter_fn,
> +						       &filter);
> +		if (!edma->wr_chan[i])
> +			break;
> +		edma->num_wr_chan++;
> +	}
> +
> +	filter.direction = BIT(DMA_MEM_TO_DEV);
> +	for (i = 0; i < EDMA_RD_CH_NUM; i++) {
> +		edma->rd_chan[i] = dma_request_channel(dma_mask,
> +						       ntb_edma_filter_fn,
> +						       &filter);
> +		if (!edma->rd_chan[i])
> +			break;
> +		edma->num_rd_chan++;
> +	}
> +
> +	edma->intr_chan = dma_request_channel(dma_mask, ntb_edma_filter_fn,
> +					      &filter);
> +	if (!edma->intr_chan)
> +		dev_warn(dma_dev,
> +			 "Remote eDMA notify channel could not be allocated\n");
> +
> +	if (!edma->num_wr_chan || !edma->num_rd_chan) {
> +		dev_warn(dma_dev, "Remote eDMA channels failed to initialize\n");
> +		ntb_edma_teardown_chans(edma);
> +		return -ENODEV;
> +	}
> +	return 0;
> +}
> +
> +struct dma_chan *ntb_edma_pick_chan(struct ntb_edma_chans *edma,
> +				    remote_edma_dir_t dir)
> +{
> +	unsigned int n, cur, idx;
> +	struct dma_chan **chans;
> +	atomic_t *cur_chan;
> +
> +	if (dir == REMOTE_EDMA_WRITE) {
> +		n = edma->num_wr_chan;
> +		chans = edma->wr_chan;
> +		cur_chan = &edma->cur_wr_chan;
> +	} else {
> +		n = edma->num_rd_chan;
> +		chans = edma->rd_chan;
> +		cur_chan = &edma->cur_rd_chan;
> +	}
> +	if (WARN_ON_ONCE(!n))
> +		return NULL;
> +
> +	/* Simple round-robin */
> +	cur = (unsigned int)atomic_inc_return(cur_chan) - 1;
> +	idx = cur % n;
> +	return chans[idx];
> +}
> +
> +int ntb_edma_notify_peer(struct ntb_edma_chans *edma, int qp_num)
> +{
> +	struct dma_async_tx_descriptor *txd;
> +	struct dma_slave_config cfg;
> +	struct scatterlist sgl;
> +	dma_cookie_t cookie;
> +	struct device *dev;
> +
> +	if (!edma || !edma->intr_chan)
> +		return -ENXIO;
> +
> +	if (qp_num < 0 || qp_num >= edma_ctx.notify_qp_max)
> +		return -EINVAL;
> +
> +	if (!edma_ctx.intr_rc_virt || !edma_ctx.intr_ep_phys)
> +		return -EINVAL;
> +
> +	dev = edma->dev;
> +	if (!dev)
> +		return -ENODEV;
> +
> +	WRITE_ONCE(edma_ctx.intr_rc_virt->db[qp_num], 1);
> +
> +	/* Ensure store is visible before kicking the DMA transfer */
> +	wmb();
> +
> +	sg_init_table(&sgl, 1);
> +	sg_dma_address(&sgl) = edma_ctx.intr_rc_phys + qp_num * sizeof(u32);
> +	sg_dma_len(&sgl) = sizeof(u32);
> +
> +	memset(&cfg, 0, sizeof(cfg));
> +	cfg.dst_addr       = edma_ctx.intr_ep_phys + qp_num * sizeof(u32);
> +	cfg.src_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
> +	cfg.dst_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
> +	cfg.direction      = DMA_MEM_TO_DEV;
> +
> +	if (dmaengine_slave_config(edma->intr_chan, &cfg))
> +		return -EINVAL;
> +
> +	txd = dmaengine_prep_slave_sg(edma->intr_chan, &sgl, 1,
> +				      DMA_MEM_TO_DEV,
> +				      DMA_CTRL_ACK | DMA_PREP_INTERRUPT);
> +	if (!txd)
> +		return -ENOSPC;
> +
> +	cookie = dmaengine_submit(txd);
> +	if (dma_submit_error(cookie))
> +		return -ENOSPC;
> +
> +	dma_async_issue_pending(edma->intr_chan);
> +	return 0;
> +}
> diff --git a/drivers/ntb/ntb_edma.h b/drivers/ntb/ntb_edma.h
> new file mode 100644
> index 000000000000..da0451827edb
> --- /dev/null
> +++ b/drivers/ntb/ntb_edma.h
> @@ -0,0 +1,128 @@
> +/* SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) */
> +#ifndef _NTB_EDMA_H_
> +#define _NTB_EDMA_H_
> +
> +#include <linux/completion.h>
> +#include <linux/device.h>
> +#include <linux/interrupt.h>
> +
> +#define EDMA_REG_SIZE		SZ_64K
> +#define DMA_LLP_MEM_SIZE	SZ_4K
> +#define EDMA_WR_CH_NUM		4
> +#define EDMA_RD_CH_NUM		4
> +#define NTB_EDMA_MAX_CH		8
> +
> +#define NTB_EDMA_INFO_MAGIC	0x45444D41 /* "EDMA" */
> +#define NTB_EDMA_INFO_OFF	EDMA_REG_SIZE
> +
> +#define NTB_EDMA_RING_ORDER	7
> +#define NTB_EDMA_RING_ENTRIES	(1U << NTB_EDMA_RING_ORDER)
> +#define NTB_EDMA_RING_MASK	(NTB_EDMA_RING_ENTRIES - 1)
> +
> +typedef void (*ntb_edma_interrupt_cb_t)(void *data, int qp_num);
> +
> +/*
> + * REMOTE_EDMA_EP:
> + *   Endpoint owns the eDMA engine and pushes descriptors into a shared MW.
> + *
> + * REMOTE_EDMA_RC:
> + *   Root Complex controls the endpoint eDMA through the shared MW and
> + *   drives reads/writes on behalf of the host.
> + */
> +typedef enum {
> +	REMOTE_EDMA_UNKNOWN,
> +	REMOTE_EDMA_EP,
> +	REMOTE_EDMA_RC,
> +} remote_edma_mode_t;
> +
> +typedef enum {
> +	REMOTE_EDMA_WRITE,
> +	REMOTE_EDMA_READ,
> +} remote_edma_dir_t;
> +
> +/*
> + * Layout of remote eDMA MW (EP local address space, RC sees via peer MW):
> + *
> + *  0 .. EDMA_REG_SIZE-1        : DesignWare eDMA registers
> + *  EDMA_REG_SIZE .. +PAGE_SIZE : struct ntb_edma_info (EP writes, RC reads)
> + *  +PAGE_SIZE ..               : LL ring buffers (EP allocates phys addresses,
> + *                                RC configures via dw_edma)
> + *
> + * ntb_edma_setup_mws() on EP:
> + *   - allocates ntb_edma_info and LLs in EP memory
> + *   - programs inbound iATU so that RC peer MW[n] points at this block
> + *
> + * ntb_edma_setup_peer() on RC:
> + *   - ioremaps peer MW[n]
> + *   - reads ntb_edma_info
> + *   - sets up dw_edma_chip ll_region_* from that info
> + */
> +struct ntb_edma_info {
> +	u32 magic;
> +	u16 wr_cnt;
> +	u16 rd_cnt;
> +	u64 regs_phys;
> +	u32 ll_stride;
> +	u32 rsvd;
> +	u64 ll_wr_phys[NTB_EDMA_MAX_CH];
> +	u64 ll_rd_phys[NTB_EDMA_MAX_CH];
> +
> +	u64 intr_dar_base;
> +} __packed;
> +
> +struct ll_dma_addrs {
> +	dma_addr_t wr[EDMA_WR_CH_NUM];
> +	dma_addr_t rd[EDMA_RD_CH_NUM];
> +};
> +
> +struct ntb_edma_chans {
> +	struct device *dev;
> +
> +	struct dma_chan *wr_chan[EDMA_WR_CH_NUM];
> +	struct dma_chan *rd_chan[EDMA_RD_CH_NUM];
> +	struct dma_chan *intr_chan;
> +
> +	unsigned int num_wr_chan;
> +	unsigned int num_rd_chan;
> +	atomic_t cur_wr_chan;
> +	atomic_t cur_rd_chan;
> +};
> +
> +static __always_inline u32 ntb_edma_ring_idx(u32 v)
> +{
> +	return v & NTB_EDMA_RING_MASK;
> +}
> +
> +static __always_inline u32 ntb_edma_ring_used_entry(u32 head, u32 tail)
> +{
> +	if (head >= tail) {
> +		WARN_ON_ONCE((head - tail) > (NTB_EDMA_RING_ENTRIES - 1));
> +		return head - tail;
> +	}
> +
> +	WARN_ON_ONCE((U32_MAX - tail + head + 1) > (NTB_EDMA_RING_ENTRIES - 1));
> +	return U32_MAX - tail + head + 1;
> +}
> +
> +static __always_inline u32 ntb_edma_ring_free_entry(u32 head, u32 tail)
> +{
> +	return NTB_EDMA_RING_ENTRIES - ntb_edma_ring_used_entry(head, tail) - 1;
> +}
> +
> +static __always_inline bool ntb_edma_ring_full(u32 head, u32 tail)
> +{
> +	return ntb_edma_ring_free_entry(head, tail) == 0;
> +}
> +
> +int ntb_edma_setup_isr(struct device *dev, struct device *epc_dev,
> +		       ntb_edma_interrupt_cb_t cb, void *data);
> +void ntb_edma_teardown_isr(struct device *dev);
> +int ntb_edma_setup_mws(struct ntb_dev *ndev);
> +int ntb_edma_setup_peer(struct ntb_dev *ndev);
> +int ntb_edma_setup_chans(struct device *dma_dev, struct ntb_edma_chans *edma);
> +struct dma_chan *ntb_edma_pick_chan(struct ntb_edma_chans *edma,
> +				    remote_edma_dir_t dir);
> +void ntb_edma_teardown_chans(struct ntb_edma_chans *edma);
> +int ntb_edma_notify_peer(struct ntb_edma_chans *edma, int qp_num);
> +
> +#endif
> diff --git a/drivers/ntb/ntb_transport.c b/drivers/ntb/ntb_transport_core.c
> similarity index 65%
> rename from drivers/ntb/ntb_transport.c
> rename to drivers/ntb/ntb_transport_core.c
> index 907db6c93d4d..48d48921978d 100644
> --- a/drivers/ntb/ntb_transport.c
> +++ b/drivers/ntb/ntb_transport_core.c
> @@ -47,6 +47,9 @@
>   * Contact Information:
>   * Jon Mason <jon.mason@intel.com>
>   */
> +#include <linux/atomic.h>
> +#include <linux/bug.h>
> +#include <linux/compiler.h>
>  #include <linux/debugfs.h>
>  #include <linux/delay.h>
>  #include <linux/dmaengine.h>
> @@ -71,6 +74,8 @@
>  #define NTB_TRANSPORT_DESC	"Software Queue-Pair Transport over NTB"
>  #define NTB_TRANSPORT_MIN_SPADS (MW0_SZ_HIGH + 2)
>
> +#define NTB_EDMA_MAX_POLL		32
> +
>  MODULE_DESCRIPTION(NTB_TRANSPORT_DESC);
>  MODULE_VERSION(NTB_TRANSPORT_VER);
>  MODULE_LICENSE("Dual BSD/GPL");
> @@ -102,6 +107,13 @@ module_param(use_msi, bool, 0644);
>  MODULE_PARM_DESC(use_msi, "Use MSI interrupts instead of doorbells");
>  #endif
>
> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> +#include "ntb_edma.h"
> +static bool use_remote_edma;
> +module_param(use_remote_edma, bool, 0644);
> +MODULE_PARM_DESC(use_remote_edma, "Use remote eDMA mode (when enabled, use_msi is ignored)");
> +#endif
> +
>  static struct dentry *nt_debugfs_dir;
>
>  /* Only two-ports NTB devices are supported */
> @@ -125,6 +137,14 @@ struct ntb_queue_entry {
>  		struct ntb_payload_header __iomem *tx_hdr;
>  		struct ntb_payload_header *rx_hdr;
>  	};
> +
> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> +	dma_addr_t addr;
> +
> +	/* Used by RC side only */
> +	struct scatterlist sgl;
> +	struct work_struct dma_work;
> +#endif
>  };
>
>  struct ntb_rx_info {
> @@ -202,6 +222,33 @@ struct ntb_transport_qp {
>  	int msi_irq;
>  	struct ntb_msi_desc msi_desc;
>  	struct ntb_msi_desc peer_msi_desc;
> +
> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> +	/*
> +	 * For ensuring peer notification in non-atomic context.
> +	 * ntb_peer_db_set might sleep or schedule.
> +	 */
> +	struct work_struct db_work;
> +
> +	/*
> +	 * wr: remote eDMA write transfer (EP -> RC direction)
> +	 * rd: remote eDMA read transfer (RC -> EP direction)
> +	 */
> +	u32 wr_cons;
> +	u32 rd_cons;
> +	u32 wr_prod;
> +	u32 rd_prod;
> +	u32 wr_issue;
> +	u32 rd_issue;
> +
> +	spinlock_t ep_tx_lock;
> +	spinlock_t ep_rx_lock;
> +	spinlock_t rc_lock;
> +
> +	/* Completion work for read/write transfers. */
> +	struct work_struct read_work;
> +	struct work_struct write_work;
> +#endif
>  };
>
>  struct ntb_transport_mw {
> @@ -249,6 +296,13 @@ struct ntb_transport_ctx {
>
>  	/* Make sure workq of link event be executed serially */
>  	struct mutex link_event_lock;
> +
> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> +	remote_edma_mode_t remote_edma_mode;
> +	struct device *dma_dev;
> +	struct workqueue_struct *wq;
> +	struct ntb_edma_chans edma;
> +#endif
>  };
>
>  enum {
> @@ -262,6 +316,19 @@ struct ntb_payload_header {
>  	unsigned int flags;
>  };
>
> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> +static void ntb_transport_edma_uninit(struct ntb_transport_ctx *nt);
> +static int ntb_transport_edma_init(struct ntb_transport_ctx *nt,
> +				   unsigned int *mw_count);
> +static void ntb_transport_edma_init_queue(struct ntb_transport_ctx *nt,
> +					  unsigned int qp_num);
> +static void ntb_transport_edma_create_queue(struct ntb_transport_ctx *nt,
> +					    struct ntb_transport_qp *qp);
> +static int ntb_transport_edma_rc_init(struct ntb_transport_ctx *nt);
> +static int ntb_transport_edma_ep_init(struct ntb_transport_ctx *nt);
> +static void ntb_transport_edma_rc_dma_work(struct work_struct *work);
> +#endif /* CONFIG_NTB_TRANSPORT_EDMA */
> +
>  /*
>   * Return the device that should be used for DMA mapping.
>   *
> @@ -298,7 +365,7 @@ enum {
>  	container_of((__drv), struct ntb_transport_client, driver)
>
>  #define QP_TO_MW(nt, qp)	((qp) % nt->mw_count)
> -#define NTB_QP_DEF_NUM_ENTRIES	100
> +#define NTB_QP_DEF_NUM_ENTRIES	128
>  #define NTB_LINK_DOWN_TIMEOUT	10
>
>  static void ntb_transport_rxc_db(unsigned long data);
> @@ -1015,6 +1082,10 @@ static void ntb_transport_link_cleanup(struct ntb_transport_ctx *nt)
>  	count = ntb_spad_count(nt->ndev);
>  	for (i = 0; i < count; i++)
>  		ntb_spad_write(nt->ndev, i, 0);
> +
> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> +	ntb_edma_teardown_chans(&nt->edma);
> +#endif
>  }
>
>  static void ntb_transport_link_cleanup_work(struct work_struct *work)
> @@ -1051,6 +1122,14 @@ static void ntb_transport_link_work(struct work_struct *work)
>
>  	/* send the local info, in the opposite order of the way we read it */
>
> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> +	rc = ntb_transport_edma_ep_init(nt);
> +	if (rc) {
> +		dev_err(&pdev->dev, "Failed to init EP: %d\n", rc);
> +		return;
> +	}
> +#endif
> +
>  	if (nt->use_msi) {
>  		rc = ntb_msi_setup_mws(ndev);
>  		if (rc) {
> @@ -1132,6 +1211,14 @@ static void ntb_transport_link_work(struct work_struct *work)
>
>  	nt->link_is_up = true;
>
> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> +	rc = ntb_transport_edma_rc_init(nt);
> +	if (rc) {
> +		dev_err(&pdev->dev, "Failed to init RC: %d\n", rc);
> +		goto out1;
> +	}
> +#endif
> +
>  	for (i = 0; i < nt->qp_count; i++) {
>  		struct ntb_transport_qp *qp = &nt->qp_vec[i];
>
> @@ -1277,6 +1364,8 @@ static const struct ntb_transport_backend_ops default_backend_ops = {
>  	.debugfs_stats_show = ntb_transport_default_debugfs_stats_show,
>  };
>
> +static const struct ntb_transport_backend_ops edma_backend_ops;
> +
>  static int ntb_transport_probe(struct ntb_client *self, struct ntb_dev *ndev)
>  {
>  	struct ntb_transport_ctx *nt;
> @@ -1311,7 +1400,23 @@ static int ntb_transport_probe(struct ntb_client *self, struct ntb_dev *ndev)
>
>  	nt->ndev = ndev;
>
> -	nt->backend_ops = default_backend_ops;
> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> +	if (use_remote_edma) {
> +		rc = ntb_transport_edma_init(nt, &mw_count);
> +		if (rc) {
> +			nt->mw_count = 0;
> +			goto err;
> +		}
> +		nt->backend_ops = edma_backend_ops;
> +
> +		/*
> +		 * On remote eDMA mode, we reserve a read channel for Host->EP
> +		 * interruption.
> +		 */
> +		use_msi = false;
> +	} else
> +#endif
> +		nt->backend_ops = default_backend_ops;
>
>  	/*
>  	 * If we are using MSI, and have at least one extra memory window,
> @@ -1402,6 +1507,10 @@ static int ntb_transport_probe(struct ntb_client *self, struct ntb_dev *ndev)
>  		rc = ntb_transport_init_queue(nt, i);
>  		if (rc)
>  			goto err2;
> +
> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> +		ntb_transport_edma_init_queue(nt, i);
> +#endif
>  	}
>
>  	INIT_DELAYED_WORK(&nt->link_work, ntb_transport_link_work);
> @@ -1433,6 +1542,9 @@ static int ntb_transport_probe(struct ntb_client *self, struct ntb_dev *ndev)
>  	}
>  	kfree(nt->mw_vec);
>  err:
> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> +	ntb_transport_edma_uninit(nt);
> +#endif
>  	kfree(nt);
>  	return rc;
>  }
> @@ -2055,11 +2167,16 @@ ntb_transport_create_queue(void *data, struct device *client_dev,
>
>  	nt->qp_bitmap_free &= ~qp_bit;
>
> +	qp->qp_bit = qp_bit;
>  	qp->cb_data = data;
>  	qp->rx_handler = handlers->rx_handler;
>  	qp->tx_handler = handlers->tx_handler;
>  	qp->event_handler = handlers->event_handler;
>
> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> +	ntb_transport_edma_create_queue(nt, qp);
> +#endif
> +
>  	dma_cap_zero(dma_mask);
>  	dma_cap_set(DMA_MEMCPY, dma_mask);
>
> @@ -2105,6 +2222,9 @@ ntb_transport_create_queue(void *data, struct device *client_dev,
>  			goto err1;
>
>  		entry->qp = qp;
> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> +		INIT_WORK(&entry->dma_work, ntb_transport_edma_rc_dma_work);
> +#endif
>  		ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry,
>  			     &qp->rx_free_q);
>  	}
> @@ -2156,8 +2276,8 @@ EXPORT_SYMBOL_GPL(ntb_transport_create_queue);
>   */
>  void ntb_transport_free_queue(struct ntb_transport_qp *qp)
>  {
> -	struct pci_dev *pdev;
>  	struct ntb_queue_entry *entry;
> +	struct pci_dev *pdev;
>  	u64 qp_bit;
>
>  	if (!qp)
> @@ -2208,6 +2328,10 @@ void ntb_transport_free_queue(struct ntb_transport_qp *qp)
>  	tasklet_kill(&qp->rxc_db_work);
>
>  	cancel_delayed_work_sync(&qp->link_work);
> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> +	cancel_work_sync(&qp->read_work);
> +	cancel_work_sync(&qp->write_work);
> +#endif
>
>  	qp->cb_data = NULL;
>  	qp->rx_handler = NULL;
> @@ -2346,6 +2470,1157 @@ int ntb_transport_tx_enqueue(struct ntb_transport_qp *qp, void *cb, void *data,
>  }
>  EXPORT_SYMBOL_GPL(ntb_transport_tx_enqueue);
>
> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> +/*
> + * Remote eDMA mode implementation
> + */
> +struct ntb_edma_desc {
> +	u32 len;
> +	u32 flags;
> +	u64 addr; /* DMA address */
> +	u64 data;
> +};
> +
> +struct ntb_edma_ring {
> +	struct ntb_edma_desc desc[NTB_EDMA_RING_ENTRIES];
> +	u32 head;
> +	u32 tail;
> +};
> +
> +#define NTB_EDMA_DESC_OFF(i)	((size_t)(i) * sizeof(struct ntb_edma_desc))
> +
> +#define __NTB_EDMA_CHECK_INDEX(_i)					\
> +({									\
> +	unsigned long __i = (unsigned long)(_i);			\
> +	WARN_ONCE(__i >= (unsigned long)NTB_EDMA_RING_ENTRIES,		\
> +		  "ntb_edma: index i=%lu >= ring_entries=%lu\n",	\
> +		  __i, (unsigned long)NTB_EDMA_RING_ENTRIES);		\
> +	__i;								\
> +})
> +
> +#define NTB_EDMA_DESC_I(qp, i, n)					\
> +({									\
> +	typeof(qp) __qp = (qp);						\
> +	unsigned long __i = __NTB_EDMA_CHECK_INDEX(i);			\
> +	(struct ntb_edma_desc *)					\
> +		((char *)(__qp)->rx_buff +				\
> +		 (sizeof(struct ntb_edma_ring) * n) +			\
> +		 NTB_EDMA_DESC_OFF(__i));				\
> +})
> +
> +#define NTB_EDMA_DESC_O(qp, i, n)					\
> +({									\
> +	typeof(qp) __qp = (qp);						\
> +	unsigned long __i = __NTB_EDMA_CHECK_INDEX(i);			\
> +	(struct ntb_edma_desc __iomem *)				\
> +		((char __iomem *)(__qp)->tx_mw +			\
> +		 (sizeof(struct ntb_edma_ring) * n) +			\
> +		 NTB_EDMA_DESC_OFF(__i));				\
> +})
> +
> +#define NTB_EDMA_HEAD_I(qp, n) ((u32 *)((char *)qp->rx_buff +		\
> +				(sizeof(struct ntb_edma_ring) * n) +	\
> +				offsetof(struct ntb_edma_ring, head)))
> +#define NTB_EDMA_HEAD_O(qp, n) ((u32 *)((char __iomem *)qp->tx_mw +	\
> +				(sizeof(struct ntb_edma_ring) * n) +	\
> +				offsetof(struct ntb_edma_ring, head)))
> +#define NTB_EDMA_TAIL_I(qp, n) ((u32 *)((char *)qp->rx_buff +		\
> +				(sizeof(struct ntb_edma_ring) * n) +	\
> +				offsetof(struct ntb_edma_ring, tail)))
> +#define NTB_EDMA_TAIL_O(qp, n) ((u32 *)((char __iomem *)qp->tx_mw +	\
> +				(sizeof(struct ntb_edma_ring) * n) +	\
> +				offsetof(struct ntb_edma_ring, tail)))
> +
> +/*
> + * Macro naming rule:
> + *   NTB_DESC_RD_EP_I (as an example)
> + *            ^^ ^^ ^
> + *            :  :  `-- I(n) or O(ut). In = Read, Out = Write.
> + *            :  `----- Who uses this macro.
> + *            `-------- DESC / HEAD / TAIL
> + *
> + * Read transfers (RC->EP):
> + *
> + *   EP view (outbound, written via NTB):
> + *       - descs: NTB_DESC_RD_EP_O(qp, i) / NTB_DESC_RD_EP_I(qp, i)
> + *           [ len ][ flags ][ addr ][ data ]
> + *           [ len ][ flags ][ addr ][ data ]
> + *           :
> + *           [ len ][ flags ][ addr ][ data ]
> + *       - head: NTB_HEAD_RD_EP_O(qp)
> + *       - tail: NTB_TAIL_RD_EP_I(qp)
> + *
> + *   RC view (inbound, local mapping):
> + *       - descs: NTB_DESC_RD_RC_I(qp, i) / NTB_DESC_RD_RC_O(qp, i)
> + *           [ len ][ flags ][ addr ][ data ]
> + *           [ len ][ flags ][ addr ][ data ]
> + *           :
> + *           [ len ][ flags ][ addr ][ data ]
> + *       - head: NTB_HEAD_RD_RC_I(qp)
> + *       - tail: NTB_TAIL_RD_RC_O(qp)
> + *
> + * Write transfers (EP -> RC) are analogous but use
> + * NTB_DESC_WR_{EP_O,RC_I}(), NTB_HEAD_WR_{EP_O,RC_I}(),
> + * and NTB_TAIL_WR_{EP_I,RC_O}().
> + */
> +#define NTB_DESC_RD_EP_I(qp, i)	NTB_EDMA_DESC_I(qp, i, 0)
> +#define NTB_DESC_RD_EP_O(qp, i)	NTB_EDMA_DESC_O(qp, i, 0)
> +#define NTB_DESC_WR_EP_I(qp, i)	NTB_EDMA_DESC_I(qp, i, 1)
> +#define NTB_DESC_WR_EP_O(qp, i)	NTB_EDMA_DESC_O(qp, i, 1)
> +#define NTB_DESC_RD_RC_I(qp, i)	NTB_EDMA_DESC_I(qp, i, 0)
> +#define NTB_DESC_RD_RC_O(qp, i)	NTB_EDMA_DESC_O(qp, i, 0)
> +#define NTB_DESC_WR_RC_I(qp, i)	NTB_EDMA_DESC_I(qp, i, 1)
> +#define NTB_DESC_WR_RC_O(qp, i)	NTB_EDMA_DESC_O(qp, i, 1)
> +
> +#define NTB_HEAD_RD_EP_O(qp)	NTB_EDMA_HEAD_O(qp, 0)
> +#define NTB_HEAD_WR_EP_O(qp)	NTB_EDMA_HEAD_O(qp, 1)
> +#define NTB_HEAD_RD_RC_I(qp)	NTB_EDMA_HEAD_I(qp, 0)
> +#define NTB_HEAD_WR_RC_I(qp)	NTB_EDMA_HEAD_I(qp, 1)
> +
> +#define NTB_TAIL_RD_EP_I(qp)	NTB_EDMA_TAIL_I(qp, 0)
> +#define NTB_TAIL_WR_EP_I(qp)	NTB_EDMA_TAIL_I(qp, 1)
> +#define NTB_TAIL_RD_RC_O(qp)	NTB_EDMA_TAIL_O(qp, 0)
> +#define NTB_TAIL_WR_RC_O(qp)	NTB_EDMA_TAIL_O(qp, 1)
> +
> +static inline bool ntb_qp_edma_is_rc(struct ntb_transport_qp *qp)
> +{
> +	return qp->transport->remote_edma_mode == REMOTE_EDMA_RC;
> +}
> +
> +static inline bool ntb_qp_edma_is_ep(struct ntb_transport_qp *qp)
> +{
> +	return qp->transport->remote_edma_mode == REMOTE_EDMA_EP;
> +}
> +
> +static inline bool ntb_qp_edma_enabled(struct ntb_transport_qp *qp)
> +{
> +	return ntb_qp_edma_is_rc(qp) || ntb_qp_edma_is_ep(qp);
> +}
> +
> +static unsigned int ntb_transport_edma_tx_free_entry(struct ntb_transport_qp *qp)
> +{
> +	unsigned int head, tail;
> +
> +	if (ntb_qp_edma_is_ep(qp)) {
> +		scoped_guard(spinlock_irqsave, &qp->ep_tx_lock) {
> +			/* In this scope, only 'head' might proceed */
> +			tail = READ_ONCE(qp->wr_cons);
> +			head = READ_ONCE(qp->wr_prod);
> +		}
> +		return ntb_edma_ring_free_entry(head, tail);
> +	}
> +
> +	scoped_guard(spinlock_irqsave, &qp->rc_lock) {
> +		/* In this scope, only 'head' might proceed */
> +		tail = READ_ONCE(qp->rd_issue);
> +		head = READ_ONCE(*NTB_HEAD_RD_RC_I(qp));
> +	}
> +	/*
> +	 * On RC side, 'used' amount indicates how much EP side
> +	 * has refilled, which are available for us to use for TX.
> +	 */
> +	return ntb_edma_ring_used_entry(head, tail);
> +}
> +
> +static void ntb_transport_edma_debugfs_stats_show(struct seq_file *s,
> +						  struct ntb_transport_qp *qp)
> +{
> +	seq_printf(s, "rx_bytes - \t%llu\n", qp->rx_bytes);
> +	seq_printf(s, "rx_pkts - \t%llu\n", qp->rx_pkts);
> +	seq_printf(s, "rx_err_no_buf - %llu\n", qp->rx_err_no_buf);
> +	seq_printf(s, "rx_buff - \t0x%p\n", qp->rx_buff);
> +	seq_printf(s, "rx_max_entry - \t%u\n", qp->rx_max_entry);
> +	seq_printf(s, "rx_alloc_entry - \t%u\n\n", qp->rx_alloc_entry);
> +
> +	seq_printf(s, "tx_bytes - \t%llu\n", qp->tx_bytes);
> +	seq_printf(s, "tx_pkts - \t%llu\n", qp->tx_pkts);
> +	seq_printf(s, "tx_ring_full - \t%llu\n", qp->tx_ring_full);
> +	seq_printf(s, "tx_err_no_buf - %llu\n", qp->tx_err_no_buf);
> +	seq_printf(s, "tx_mw - \t0x%p\n", qp->tx_mw);
> +	seq_printf(s, "tx_max_entry - \t%u\n", qp->tx_max_entry);
> +	seq_printf(s, "free tx - \t%u\n", ntb_transport_tx_free_entry(qp));
> +	seq_putc(s, '\n');
> +
> +	seq_puts(s, "Using Remote eDMA - Yes\n");
> +	seq_printf(s, "QP Link - \t%s\n", qp->link_is_up ? "Up" : "Down");
> +}
> +
> +static void ntb_transport_edma_uninit(struct ntb_transport_ctx *nt)
> +{
> +	struct ntb_dev *ndev = nt->ndev;
> +
> +	if (nt->remote_edma_mode == REMOTE_EDMA_EP && ndev && ndev->pdev)
> +		ntb_edma_teardown_isr(&ndev->pdev->dev);
> +
/pr> +	if (nt->wq)
> +		destroy_workqueue(nt->wq);
> +	nt->wq = NULL;
> +}
> +
> +static int ntb_transport_edma_init(struct ntb_transport_ctx *nt,
> +				   unsigned int *mw_count)
> +{
> +	struct ntb_dev *ndev = nt->ndev;
> +
> +	/*
> +	 * We need at least one MW for the transport plus one MW reserved
> +	 * for the remote eDMA window (see ntb_edma_setup_mws/peer).
> +	 */
> +	if (*mw_count <= 1) {
> +		dev_err(&ndev->dev,
> +			"remote eDMA requires at least two MWS (have %u)\n",
> +			*mw_count);
> +		return -ENODEV;
> +	}
> +
> +	nt->wq = alloc_workqueue("ntb-edma-wq", WQ_UNBOUND | WQ_SYSFS, 0);
> +	if (!nt->wq) {
> +		ntb_transport_edma_uninit(nt);
> +		return -ENOMEM;
> +	}
> +
> +	/* Reserve the last peer MW exclusively for the eDMA window. */
> +	*mw_count -= 1;
> +
> +	return 0;
> +}
> +
> +static void ntb_transport_edma_db_work(struct work_struct *work)
> +{
> +	struct ntb_transport_qp *qp =
> +			container_of(work, struct ntb_transport_qp, db_work);
> +
> +	ntb_peer_db_set(qp->ndev, qp->qp_bit);
> +}
> +
> +static void ntb_transport_edma_notify_peer(struct ntb_transport_qp *qp)
> +{
> +	if (ntb_qp_edma_is_rc(qp))
> +		if (!ntb_edma_notify_peer(&qp->transport->edma, qp->qp_num))
> +			return;
> +
> +	/*
> +	 * Called from contexts that may be atomic. Since ntb_peer_db_set()
> +	 * may sleep, delegate the actual doorbell write to a workqueue.
> +	 */
> +	queue_work(system_highpri_wq, &qp->db_work);
> +}
> +
> +static void ntb_transport_edma_isr(void *data, int qp_num)
> +{
> +	struct ntb_transport_ctx *nt = data;
> +	struct ntb_transport_qp *qp;
> +
> +	if (qp_num < 0 || qp_num >= nt->qp_count)
> +		return;
> +
> +	qp = &nt->qp_vec[qp_num];
> +	if (WARN_ON(!qp))
> +		return;
> +
> +	queue_work(nt->wq, &qp->read_work);
> +	queue_work(nt->wq, &qp->write_work);
> +}
> +
> +static int ntb_transport_edma_rc_init(struct ntb_transport_ctx *nt)
> +{
> +	struct ntb_dev *ndev = nt->ndev;
> +	struct pci_dev *pdev = ndev->pdev;
> +	int rc;
> +
> +	if (!use_remote_edma || nt->remote_edma_mode != REMOTE_EDMA_UNKNOWN)
> +		return 0;
> +
> +	rc = ntb_edma_setup_peer(ndev);
> +	if (rc) {
> +		dev_err(&pdev->dev, "Failed to enable remote eDMA: %d\n", rc);
> +		return rc;
> +	}
> +
> +	rc = ntb_edma_setup_chans(get_dma_dev(ndev), &nt->edma);
> +	if (rc) {
> +		dev_err(&pdev->dev, "Failed to setup eDMA channels: %d\n", rc);
> +		return rc;
> +	}
> +
> +	nt->remote_edma_mode = REMOTE_EDMA_RC;
> +	return 0;
> +}
> +
> +static int ntb_transport_edma_ep_init(struct ntb_transport_ctx *nt)
> +{
> +	struct ntb_dev *ndev = nt->ndev;
> +	struct pci_dev *pdev = ndev->pdev;
> +	struct pci_epc *epc;
> +	int rc;
> +
> +	if (!use_remote_edma || nt->remote_edma_mode == REMOTE_EDMA_EP)
> +		return 0;
> +
> +	/* Only EP side can return pci_epc */
> +	epc = ntb_get_pci_epc(ndev);
> +	if (!epc)
> +		return 0;
> +
> +	rc = ntb_edma_setup_mws(ndev);
> +	if (rc) {
> +		dev_err(&pdev->dev,
> +			"Failed to set up memory window for eDMA: %d\n", rc);
> +		return rc;
> +	}
> +
> +	rc = ntb_edma_setup_isr(&pdev->dev, &epc->dev, ntb_transport_edma_isr, nt);
> +	if (rc) {
> +		dev_err(&pdev->dev, "Failed to setup eDMA ISR (%d)\n", rc);
> +		return rc;
> +	}
> +
> +	nt->remote_edma_mode = REMOTE_EDMA_EP;
> +	return 0;
> +}
> +
> +static int ntb_transport_edma_setup_qp_mw(struct ntb_transport_ctx *nt,
> +					  unsigned int qp_num)
> +{
> +	struct ntb_transport_qp *qp = &nt->qp_vec[qp_num];
> +	struct ntb_dev *ndev = nt->ndev;
> +	struct ntb_queue_entry *entry;
> +	struct ntb_transport_mw *mw;
> +	unsigned int mw_num, mw_count, qp_count;
> +	unsigned int qp_offset, rx_info_offset;
> +	unsigned int mw_size, mw_size_per_qp;
> +	unsigned int num_qps_mw;
> +	size_t edma_total;
> +	unsigned int i;
> +	int node;
> +
> +	mw_count = nt->mw_count;
> +	qp_count = nt->qp_count;
> +
> +	mw_num = QP_TO_MW(nt, qp_num);
> +	mw = &nt->mw_vec[mw_num];
> +
> +	if (!mw->virt_addr)
> +		return -ENOMEM;
> +
> +	if (mw_num < qp_count % mw_count)
> +		num_qps_mw = qp_count / mw_count + 1;
> +	else
> +		num_qps_mw = qp_count / mw_count;
> +
> +	mw_size = min(nt->mw_vec[mw_num].phys_size, mw->xlat_size);
> +	if (max_mw_size && mw_size > max_mw_size)
> +		mw_size = max_mw_size;
> +
> +	mw_size_per_qp = round_down((unsigned int)mw_size / num_qps_mw, SZ_64);
> +	qp_offset = mw_size_per_qp * (qp_num / mw_count);
> +	rx_info_offset = mw_size_per_qp - sizeof(struct ntb_rx_info);
> +
> +	qp->tx_mw_size = mw_size_per_qp;
> +	qp->tx_mw = nt->mw_vec[mw_num].vbase + qp_offset;
> +	if (!qp->tx_mw)
> +		return -EINVAL;
> +	qp->tx_mw_phys = nt->mw_vec[mw_num].phys_addr + qp_offset;
> +	if (!qp->tx_mw_phys)
> +		return -EINVAL;
> +	qp->rx_info = qp->tx_mw + rx_info_offset;
> +	qp->rx_buff = mw->virt_addr + qp_offset;
> +	qp->remote_rx_info = qp->rx_buff + rx_info_offset;
> +
> +	/* Due to housekeeping, there must be at least 2 buffs */
> +	qp->tx_max_frame = min(transport_mtu, mw_size_per_qp / 2);
> +	qp->rx_max_frame = min(transport_mtu, mw_size_per_qp / 2);
> +
> +	/* In eDMA mode, decouple from MW sizing and force ring-sized entries */
> +	edma_total = 2 * sizeof(struct ntb_edma_ring);
> +	if (rx_info_offset < edma_total) {
> +		dev_err(&ndev->dev, "Ring space requires %luB (>=%uB)\n",
> +			edma_total, rx_info_offset);
> +		return -EINVAL;
> +	}
> +	qp->tx_max_entry = NTB_EDMA_RING_ENTRIES;
> +	qp->rx_max_entry = NTB_EDMA_RING_ENTRIES;
> +
> +	/*
> +	 * Checking to see if we have more entries than the default.
> +	 * We should add additional entries if that is the case so we
> +	 * can be in sync with the transport frames.
> +	 */
> +	node = dev_to_node(&ndev->dev);
> +	for (i = qp->rx_alloc_entry; i < qp->rx_max_entry; i++) {
> +		entry = kzalloc_node(sizeof(*entry), GFP_KERNEL, node);
> +		if (!entry)
> +			return -ENOMEM;
> +
> +		entry->qp = qp;
> +		INIT_WORK(&entry->dma_work, ntb_transport_edma_rc_dma_work);
> +		ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry,
> +			     &qp->rx_free_q);
> +		qp->rx_alloc_entry++;
> +	}
> +
> +	memset(qp->rx_buff, 0, edma_total);
> +
> +	qp->rx_pkts = 0;
> +	qp->tx_pkts = 0;
> +
> +	return 0;
> +}
> +
> +static int ntb_transport_edma_ep_read_complete(struct ntb_transport_qp *qp)
> +{
> +	struct device *dma_dev = get_dma_dev(qp->ndev);
> +	struct ntb_queue_entry *entry;
> +	struct ntb_edma_desc *in;
> +	unsigned int len;
> +	u32 idx;
> +
> +	if (ntb_edma_ring_used_entry(READ_ONCE(*NTB_TAIL_RD_EP_I(qp)),
> +				     qp->rd_cons) == 0)
> +		return 0;
> +
> +	idx = ntb_edma_ring_idx(qp->rd_cons);
> +	in = NTB_DESC_RD_EP_I(qp, idx);
> +	if (!(in->flags & DESC_DONE_FLAG))
> +		return 0;
> +
> +	in->flags = 0;
> +	len = in->len; /* might be smaller than entry->len */
> +
> +	entry = (struct ntb_queue_entry *)(in->data);
> +	if (WARN_ON(!entry))
> +		return 0;
> +
> +	if (in->flags & LINK_DOWN_FLAG) {
> +		ntb_qp_link_down(qp);
> +		qp->rd_cons++;
> +		ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry, &qp->rx_free_q);
> +		return 1;
> +	}
> +
> +	dma_unmap_single(dma_dev, entry->addr, entry->len, DMA_FROM_DEVICE);
> +
> +	qp->rx_bytes += len;
> +	qp->rx_pkts++;
> +	qp->rd_cons++;
> +
> +	if (qp->rx_handler && qp->client_ready)
> +		qp->rx_handler(qp, qp->cb_data, entry->cb_data, len);
> +
> +	ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry, &qp->rx_free_q);
> +	return 1;
> +}
> +
> +static int ntb_transport_edma_ep_write_complete(struct ntb_transport_qp *qp)
> +{
> +	struct ntb_queue_entry *entry;
> +	struct ntb_edma_desc *in;
> +	u32 idx;
> +
> +	if (ntb_edma_ring_used_entry(READ_ONCE(*NTB_TAIL_WR_EP_I(qp)),
> +				     qp->wr_cons) == 0)
> +		return 0;
> +
> +	idx = ntb_edma_ring_idx(qp->wr_cons);
> +	in = NTB_DESC_WR_EP_I(qp, idx);
> +
> +	entry = (struct ntb_queue_entry *)(in->data);
> +	if (WARN_ON(!entry))
> +		return 0;
> +
> +	qp->wr_cons++;
> +
> +	if (qp->tx_handler)
> +		qp->tx_handler(qp, qp->cb_data, entry->cb_data, entry->len);
> +
> +	ntb_list_add(&qp->ntb_tx_free_q_lock, &entry->entry, &qp->tx_free_q);
> +	return 1;
> +}
> +
> +static void ntb_transport_edma_ep_read_work(struct work_struct *work)
> +{
> +	struct ntb_transport_qp *qp = container_of(
> +				work, struct ntb_transport_qp, read_work);
> +	unsigned int i;
> +
> +	for (i = 0; i < NTB_EDMA_MAX_POLL; i++) {
> +		if (!ntb_transport_edma_ep_read_complete(qp))
> +			break;
> +	}
> +
> +	if (ntb_transport_edma_ep_read_complete(qp))
> +		queue_work(qp->transport->wq, &qp->read_work);
> +}
> +
> +static void ntb_transport_edma_ep_write_work(struct work_struct *work)
> +{
> +	struct ntb_transport_qp *qp = container_of(
> +				work, struct ntb_transport_qp, write_work);
> +	unsigned int i;
> +
> +	for (i = 0; i < NTB_EDMA_MAX_POLL; i++) {
> +		if (!ntb_transport_edma_ep_write_complete(qp))
> +			break;
> +	}
> +
> +	if (ntb_transport_edma_ep_write_complete(qp))
> +		queue_work(qp->transport->wq, &qp->write_work);
> +}
> +
> +static void ntb_transport_edma_rc_write_complete_work(struct work_struct *work)
> +{
> +	struct ntb_transport_qp *qp = container_of(
> +				work, struct ntb_transport_qp, write_work);
> +	struct ntb_queue_entry *entry;
> +	struct ntb_edma_desc *in;
> +	unsigned int len;
> +	void *cb_data;
> +	u32 idx;
> +
> +	while (ntb_edma_ring_used_entry(READ_ONCE(qp->wr_issue),
> +					qp->wr_cons) != 0) {
> +		/* Paired with smp_wmb() in ntb_transport_edma_rc_poll() */
> +		smp_rmb();
> +
> +		idx = ntb_edma_ring_idx(qp->wr_cons);
> +		in = NTB_DESC_WR_RC_I(qp, idx);
> +		entry = (struct ntb_queue_entry *)READ_ONCE(in->data);
> +		if (!entry || !(entry->flags & DESC_DONE_FLAG))
> +			break;
> +
> +		in->data = 0;
> +
> +		cb_data = entry->cb_data;
> +		len = entry->len;
> +
> +		iowrite32(++qp->wr_cons, NTB_TAIL_WR_RC_O(qp));
> +
> +		if (unlikely(entry->flags & LINK_DOWN_FLAG)) {
> +			ntb_qp_link_down(qp);
> +			continue;
> +		}
> +
> +		ntb_transport_edma_notify_peer(qp);
> +
> +		ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry, &qp->rx_free_q);
> +
> +		if (qp->rx_handler && qp->client_ready)
> +			qp->rx_handler(qp, qp->cb_data, cb_data, len);
> +
> +		/* stat updates */
> +		qp->rx_bytes += len;
> +		qp->rx_pkts++;
> +	}
> +}
> +
> +static void ntb_transport_edma_rc_write_cb(void *data,
> +					   const struct dmaengine_result *res)
> +{
> +	struct ntb_queue_entry *entry = data;
> +	struct ntb_transport_qp *qp = entry->qp;
> +	struct ntb_transport_ctx *nt = qp->transport;
> +	enum dmaengine_tx_result dma_err = res->result;
> +	struct device *dma_dev = get_dma_dev(qp->ndev);
> +
> +	switch (dma_err) {
> +	case DMA_TRANS_READ_FAILED:
> +	case DMA_TRANS_WRITE_FAILED:
> +	case DMA_TRANS_ABORTED:
> +		entry->errors++;
> +		entry->len = -EIO;
> +		break;
> +	case DMA_TRANS_NOERROR:
> +	default:
> +		break;
> +	}
> +	dma_unmap_sg(dma_dev, &entry->sgl, 1, DMA_FROM_DEVICE);
> +	sg_dma_address(&entry->sgl) = 0;
> +
> +	entry->flags |= DESC_DONE_FLAG;
> +
> +	queue_work(nt->wq, &qp->write_work);
> +}
> +
> +static void ntb_transport_edma_rc_read_complete_work(struct work_struct *work)
> +{
> +	struct ntb_transport_qp *qp = container_of(
> +				work, struct ntb_transport_qp, read_work);
> +	struct ntb_edma_desc *in, __iomem *out;
> +	struct ntb_queue_entry *entry;
> +	unsigned int len;
> +	void *cb_data;
> +	u32 idx;
> +
> +	while (ntb_edma_ring_used_entry(READ_ONCE(qp->rd_issue),
> +					qp->rd_cons) != 0) {
> +		/* Paired with smp_wmb() in ntb_transport_edma_rc_tx_enqueue() */
> +		smp_rmb();
> +
> +		idx = ntb_edma_ring_idx(qp->rd_cons);
> +		in = NTB_DESC_RD_RC_I(qp, idx);
> +		entry = (struct ntb_queue_entry *)in->data;
> +		if (!entry || !(entry->flags & DESC_DONE_FLAG))
> +			break;
> +
> +		in->data = 0;
> +
> +		cb_data = entry->cb_data;
> +		len = entry->len;
> +
> +		out = NTB_DESC_RD_RC_O(qp, idx);
> +
> +		WRITE_ONCE(qp->rd_cons, qp->rd_cons + 1);
> +
> +		/*
> +		 * No need to add barrier in-between to enforce ordering here.
> +		 * The other side proceeds only after both flags and tail are
> +		 * updated.
> +		 */
> +		iowrite32(entry->flags, &out->flags);
> +		iowrite32(qp->rd_cons, NTB_TAIL_RD_RC_O(qp));
> +
> +		ntb_transport_edma_notify_peer(qp);
> +
> +		ntb_list_add(&qp->ntb_tx_free_q_lock, &entry->entry,
> +			     &qp->tx_free_q);
> +
> +		if (qp->tx_handler)
> +			qp->tx_handler(qp, qp->cb_data, cb_data, len);
> +
> +		/* stat updates */
> +		qp->tx_bytes += len;
> +		qp->tx_pkts++;
> +	}
> +}
> +
> +static void ntb_transport_edma_rc_read_cb(void *data,
> +					  const struct dmaengine_result *res)
> +{
> +	struct ntb_queue_entry *entry = data;
> +	struct ntb_transport_qp *qp = entry->qp;
> +	struct ntb_transport_ctx *nt = qp->transport;
> +	struct device *dma_dev = get_dma_dev(qp->ndev);
> +	enum dmaengine_tx_result dma_err = res->result;
> +
> +	switch (dma_err) {
> +	case DMA_TRANS_READ_FAILED:
> +	case DMA_TRANS_WRITE_FAILED:
> +	case DMA_TRANS_ABORTED:
> +		entry->errors++;
> +		entry->len = -EIO;
> +		break;
> +	case DMA_TRANS_NOERROR:
> +	default:
> +		break;
> +	}
> +	dma_unmap_sg(dma_dev, &entry->sgl, 1, DMA_TO_DEVICE);
> +	sg_dma_address(&entry->sgl) = 0;
> +
> +	entry->flags |= DESC_DONE_FLAG;
> +
> +	queue_work(nt->wq, &qp->read_work);
> +}
> +
> +static int ntb_transport_edma_rc_write_start(struct device *d,
> +					     struct dma_chan *chan, size_t len,
> +					     dma_addr_t ep_src, void *rc_dst,
> +					     struct ntb_queue_entry *entry)
> +{
> +	struct scatterlist *sgl = &entry->sgl;
> +	struct dma_async_tx_descriptor *txd;
> +	struct dma_slave_config cfg;
> +	dma_cookie_t cookie;
> +	int nents, rc;
> +
> +	if (!d)
> +		return -ENODEV;
> +
> +	if (!chan)
> +		return -ENXIO;
> +
> +	if (WARN_ON(!ep_src || !rc_dst))
> +		return -EINVAL;
> +
> +	if (WARN_ON(sg_dma_address(sgl)))
> +		return -EINVAL;
> +
> +	sg_init_one(sgl, rc_dst, len);
> +	nents = dma_map_sg(d, sgl, 1, DMA_FROM_DEVICE);
> +	if (nents <= 0)
> +		return -EIO;
> +
> +	memset(&cfg, 0, sizeof(cfg));
> +	cfg.src_addr       = ep_src;
> +	cfg.src_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
> +	cfg.dst_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
> +	cfg.direction      = DMA_DEV_TO_MEM;
> +	rc = dmaengine_slave_config(chan, &cfg);
> +	if (rc)
> +		goto out_unmap;
> +
> +	txd = dmaengine_prep_slave_sg(chan, sgl, 1, DMA_DEV_TO_MEM,
> +				      DMA_CTRL_ACK | DMA_PREP_INTERRUPT);
> +	if (!txd) {
> +		rc = -EIO;
> +		goto out_unmap;
> +	}
> +
> +	txd->callback_result = ntb_transport_edma_rc_write_cb;
> +	txd->callback_param = entry;
> +
> +	cookie = dmaengine_submit(txd);
> +	if (dma_submit_error(cookie)) {
> +		rc = -EIO;
> +		goto out_unmap;
> +	}
> +	dma_async_issue_pending(chan);
> +	return 0;
> +out_unmap:
> +	dma_unmap_sg(d, sgl, 1, DMA_FROM_DEVICE);
> +	return rc;
> +}
> +
> +static int ntb_transport_edma_rc_read_start(struct device *d,
> +					    struct dma_chan *chan, size_t len,
> +					    void *rc_src, dma_addr_t ep_dst,
> +					    struct ntb_queue_entry *entry)
> +{
> +	struct scatterlist *sgl = &entry->sgl;
> +	struct dma_async_tx_descriptor *txd;
> +	struct dma_slave_config cfg;
> +	dma_cookie_t cookie;
> +	int nents, rc;
> +
> +	if (!d)
> +		return -ENODEV;
> +
> +	if (!chan)
> +		return -ENXIO;
> +
> +	if (WARN_ON(!rc_src || !ep_dst))
> +		return -EINVAL;
> +
> +	if (WARN_ON(sg_dma_address(sgl)))
> +		return -EINVAL;
> +
> +	sg_init_one(sgl, rc_src, len);
> +	nents = dma_map_sg(d, sgl, 1, DMA_TO_DEVICE);
> +	if (nents <= 0)
> +		return -EIO;
> +
> +	memset(&cfg, 0, sizeof(cfg));
> +	cfg.dst_addr       = ep_dst;
> +	cfg.src_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
> +	cfg.dst_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
> +	cfg.direction      = DMA_MEM_TO_DEV;
> +	rc = dmaengine_slave_config(chan, &cfg);
> +	if (rc)
> +		goto out_unmap;
> +
> +	txd = dmaengine_prep_slave_sg(chan, sgl, 1, DMA_MEM_TO_DEV,
> +				      DMA_CTRL_ACK | DMA_PREP_INTERRUPT);
> +	if (!txd) {
> +		rc = -EIO;
> +		goto out_unmap;
> +	}
> +
> +	txd->callback_result = ntb_transport_edma_rc_read_cb;
> +	txd->callback_param = entry;
> +
> +	cookie = dmaengine_submit(txd);
> +	if (dma_submit_error(cookie)) {
> +		rc = -EIO;
> +		goto out_unmap;
> +	}
> +	dma_async_issue_pending(chan);
> +	return 0;
> +out_unmap:
> +	dma_unmap_sg(d, sgl, 1, DMA_TO_DEVICE);
> +	return rc;
> +}
> +
> +static void ntb_transport_edma_rc_dma_work(struct work_struct *work)
> +{
> +	struct ntb_queue_entry *entry = container_of(
> +				work, struct ntb_queue_entry, dma_work);
> +	struct ntb_transport_qp *qp = entry->qp;
> +	struct ntb_transport_ctx *nt = qp->transport;
> +	struct device *dma_dev = get_dma_dev(qp->ndev);
> +	struct dma_chan *chan;
> +	int rc;
> +
> +	chan = ntb_edma_pick_chan(&nt->edma, REMOTE_EDMA_WRITE);
> +	rc = ntb_transport_edma_rc_write_start(dma_dev, chan, entry->len,
> +					       entry->addr, entry->buf, entry);
> +	if (rc) {
> +		entry->errors++;
> +		entry->len = -EIO;
> +		entry->flags |= DESC_DONE_FLAG;
> +		queue_work(nt->wq, &qp->write_work);
> +		return;
> +	}
> +}
> +
> +static void ntb_transport_edma_rc_poll(struct ntb_transport_qp *qp)
> +{
> +	struct ntb_transport_ctx *nt = qp->transport;
> +	unsigned int budget = NTB_EDMA_MAX_POLL;
> +	struct ntb_queue_entry *entry;
> +	struct ntb_edma_desc *in;
> +	dma_addr_t ep_src;
> +	u32 len, idx;
> +
> +	while (budget--) {
> +		if (ntb_edma_ring_used_entry(READ_ONCE(*NTB_HEAD_WR_RC_I(qp)),
> +					     qp->wr_issue) == 0)
> +			break;
> +
> +		idx = ntb_edma_ring_idx(qp->wr_issue);
> +		in = NTB_DESC_WR_RC_I(qp, idx);
> +
> +		len = READ_ONCE(in->len);
> +		ep_src = (dma_addr_t)READ_ONCE(in->addr);
> +
> +		/* Prepare 'entry' for write completion */
> +		entry = ntb_list_rm(&qp->ntb_rx_q_lock, &qp->rx_pend_q);
> +		if (!entry) {
> +			qp->rx_err_no_buf++;
> +			break;
> +		}
> +		if (WARN_ON(entry->flags & DESC_DONE_FLAG))
> +			entry->flags &= ~DESC_DONE_FLAG;
> +		entry->len = len; /* NB. entry->len can be <=0 */
> +		entry->addr = ep_src;
> +
> +		/*
> +		 * ntb_transport_edma_rc_write_complete_work() checks entry->flags
> +		 * so it needs to be set before wr_issue++.
> +		 */
> +		in->data = (uintptr_t)entry;
> +
> +		/* Ensure in->data visible before wr_issue++ */
> +		smp_wmb();
> +
> +		WRITE_ONCE(qp->wr_issue, qp->wr_issue + 1);
> +
> +		if (!len) {
> +			entry->flags |= DESC_DONE_FLAG;
> +			queue_work(nt->wq, &qp->write_work);
> +			continue;
> +		}
> +
> +		if (in->flags & LINK_DOWN_FLAG) {
> +			dev_dbg(&qp->ndev->pdev->dev, "link down flag set\n");
> +			entry->flags |= DESC_DONE_FLAG | LINK_DOWN_FLAG;
> +			queue_work(nt->wq, &qp->write_work);
> +			continue;
> +		}
> +
> +		queue_work(nt->wq, &entry->dma_work);
> +	}
> +
> +	if (!budget)
> +		tasklet_schedule(&qp->rxc_db_work);
> +}
> +
> +static int ntb_transport_edma_rc_tx_enqueue(struct ntb_transport_qp *qp,
> +					    struct ntb_queue_entry *entry)
> +{
> +	struct device *dma_dev = get_dma_dev(qp->ndev);
> +	struct ntb_transport_ctx *nt = qp->transport;
> +	struct ntb_edma_desc *in, __iomem *out;
> +	unsigned int len = entry->len;
> +	struct dma_chan *chan;
> +	u32 issue, idx, head;
> +	dma_addr_t ep_dst;
> +	int rc;
> +
> +	WARN_ON_ONCE(entry->flags & DESC_DONE_FLAG);
> +
> +	scoped_guard(spinlock_irqsave, &qp->rc_lock) {
> +		head = READ_ONCE(*NTB_HEAD_RD_RC_I(qp));
> +		issue = qp->rd_issue;
> +		if (ntb_edma_ring_used_entry(head, issue) == 0) {
> +			qp->tx_ring_full++;
> +			return -ENOSPC;
> +		}
> +
> +		/*
> +		 * ntb_transport_edma_rc_read_complete_work() checks entry->flags
> +		 * so it needs to be set before rd_issue++.
> +		 */
> +		idx = ntb_edma_ring_idx(issue);
> +		in = NTB_DESC_RD_RC_I(qp, idx);
> +		in->data = (uintptr_t)entry;
> +
> +		/* Make in->data visible before rd_issue++ */
> +		smp_wmb();
> +
> +		WRITE_ONCE(qp->rd_issue, qp->rd_issue + 1);
> +	}
> +
> +	/* Publish the final transfer length to the EP side */
> +	out = NTB_DESC_RD_RC_O(qp, idx);
> +	iowrite32(len, &out->len);
> +	ioread32(&out->len);
> +
> +	if (unlikely(!len)) {
> +		entry->flags |= DESC_DONE_FLAG;
> +		queue_work(nt->wq, &qp->read_work);
> +		return 0;
> +	}
> +
> +	/* Paired with dma_wmb() in ntb_transport_edma_ep_rx_enqueue() */
> +	dma_rmb();
> +
> +	/* kick remote eDMA read transfer */
> +	ep_dst = (dma_addr_t)in->addr;
> +	chan = ntb_edma_pick_chan(&nt->edma, REMOTE_EDMA_READ);
> +	rc = ntb_transport_edma_rc_read_start(dma_dev, chan, len,
> +					      entry->buf, ep_dst, entry);
> +	if (rc) {
> +		entry->errors++;
> +		entry->len = -EIO;
> +		entry->flags |= DESC_DONE_FLAG;
> +		queue_work(nt->wq, &qp->read_work);
> +	}
> +	return 0;
> +}
> +
> +static int ntb_transport_edma_ep_tx_enqueue(struct ntb_transport_qp *qp,
> +					    struct ntb_queue_entry *entry)
> +{
> +	struct device *dma_dev = get_dma_dev(qp->ndev);
> +	struct ntb_edma_desc *in, __iomem *out;
> +	unsigned int len = entry->len;
> +	dma_addr_t ep_src = 0;
> +	u32 idx;
> +	int rc;
> +
> +	if (likely(len)) {
> +		ep_src = dma_map_single(dma_dev, entry->buf, len,
> +					DMA_TO_DEVICE);
> +		rc = dma_mapping_error(dma_dev, ep_src);
> +		if (rc)
> +			return rc;
> +	}
> +
> +	scoped_guard(spinlock_irqsave, &qp->ep_tx_lock) {
> +		if (ntb_edma_ring_full(qp->wr_prod, qp->wr_cons)) {
> +			rc = -ENOSPC;
> +			qp->tx_ring_full++;
> +			goto out_unmap;
> +		}
> +
> +		idx = ntb_edma_ring_idx(qp->wr_prod);
> +		in  = NTB_DESC_WR_EP_I(qp, idx);
> +		out = NTB_DESC_WR_EP_O(qp, idx);
> +
> +		WARN_ON(in->flags & DESC_DONE_FLAG);
> +		WARN_ON(entry->flags & DESC_DONE_FLAG);
> +		in->flags = 0;
> +		in->data  = (uintptr_t)entry;
> +		entry->addr  = ep_src;
> +
> +		iowrite32(len,          &out->len);
> +		iowrite32(entry->flags, &out->flags);
> +		iowrite64(ep_src,       &out->addr);
> +		WRITE_ONCE(qp->wr_prod, qp->wr_prod + 1);
> +
> +		dma_wmb();
> +		iowrite32(qp->wr_prod, NTB_HEAD_WR_EP_O(qp));
> +
> +		qp->tx_bytes += len;
> +		qp->tx_pkts++;
> +	}
> +
> +	ntb_transport_edma_notify_peer(qp);
> +
> +	return 0;
> +out_unmap:
> +	if (likely(len))
> +		dma_unmap_single(dma_dev, ep_src, len, DMA_TO_DEVICE);
> +	return rc;
> +}
> +
> +static int ntb_transport_edma_tx_enqueue(struct ntb_transport_qp *qp,
> +					 struct ntb_queue_entry *entry,
> +					 void *cb, void *data, unsigned int len,
> +					 unsigned int flags)
> +{
> +	struct device *dma_dev;
> +
> +	if (entry->addr) {
> +		/* Deferred unmap */
> +		dma_dev = get_dma_dev(qp->ndev);
> +		dma_unmap_single(dma_dev, entry->addr, entry->len, DMA_TO_DEVICE);
> +	}
> +
> +	entry->cb_data = cb;
> +	entry->buf = data;
> +	entry->len = len;
> +	entry->flags = flags;
> +	entry->errors = 0;
> +	entry->addr = 0;
> +
> +	WARN_ON_ONCE(!ntb_qp_edma_enabled(qp));
> +
> +	if (ntb_qp_edma_is_ep(qp))
> +		return ntb_transport_edma_ep_tx_enqueue(qp, entry);
> +	else
> +		return ntb_transport_edma_rc_tx_enqueue(qp, entry);
> +}
> +
> +static int ntb_transport_edma_ep_rx_enqueue(struct ntb_transport_qp *qp,
> +					    struct ntb_queue_entry *entry)
> +{
> +	struct device *dma_dev = get_dma_dev(qp->ndev);
> +	struct ntb_edma_desc *in, __iomem *out;
> +	unsigned int len = entry->len;
> +	void *data = entry->buf;
> +	dma_addr_t ep_dst;
> +	u32 idx;
> +	int rc;
> +
> +	ep_dst = dma_map_single(dma_dev, data, len, DMA_FROM_DEVICE);
> +	rc = dma_mapping_error(dma_dev, ep_dst);
> +	if (rc)
> +		return rc;
> +
> +	scoped_guard(spinlock_bh, &qp->ep_rx_lock) {
> +		if (ntb_edma_ring_full(READ_ONCE(qp->rd_prod),
> +				       READ_ONCE(qp->rd_cons))) {
> +			rc = -ENOSPC;
> +			goto out_unmap;
> +		}
> +
> +		idx = ntb_edma_ring_idx(qp->rd_prod);
> +		in = NTB_DESC_RD_EP_I(qp, idx);
> +		out = NTB_DESC_RD_EP_O(qp, idx);
> +
> +		iowrite32(len, &out->len);
> +		iowrite64(ep_dst, &out->addr);
> +
> +		WARN_ON(in->flags & DESC_DONE_FLAG);
> +		in->data = (uintptr_t)entry;
> +		entry->addr = ep_dst;
> +
> +		/* Ensure len/addr are visible before the head update */
> +		dma_wmb();
> +
> +		WRITE_ONCE(qp->rd_prod, qp->rd_prod + 1);
> +		iowrite32(qp->rd_prod, NTB_HEAD_RD_EP_O(qp));
> +	}
> +	return 0;
> +out_unmap:
> +	dma_unmap_single(dma_dev, ep_dst, len, DMA_FROM_DEVICE);
> +	return rc;
> +}
> +
> +static int ntb_transport_edma_rx_enqueue(struct ntb_transport_qp *qp,
> +					 struct ntb_queue_entry *entry)
> +{
> +	int rc;
> +
> +	/* The behaviour is the same as the default backend for RC side */
> +	if (ntb_qp_edma_is_ep(qp)) {
> +		rc = ntb_transport_edma_ep_rx_enqueue(qp, entry);
> +		if (rc) {
> +			ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry,
> +				     &qp->rx_free_q);
> +			return rc;
> +		}
> +	}
> +
> +	ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry, &qp->rx_pend_q);
> +
> +	if (qp->active)
> +		tasklet_schedule(&qp->rxc_db_work);
> +
> +	return 0;
> +}
> +
> +static void ntb_transport_edma_rx_poll(struct ntb_transport_qp *qp)
> +{
> +	struct ntb_transport_ctx *nt = qp->transport;
> +
> +	if (ntb_qp_edma_is_rc(qp))
> +		ntb_transport_edma_rc_poll(qp);
> +	else if (ntb_qp_edma_is_ep(qp)) {
> +		/*
> +		 * Make sure we poll the rings even if an eDMA interrupt is
> +		 * cleared on the RC side earlier.
> +		 */
> +		queue_work(nt->wq, &qp->read_work);
> +		queue_work(nt->wq, &qp->write_work);
> +	} else
> +		/* Unreachable */
> +		WARN_ON_ONCE(1);
> +}
> +
> +static void ntb_transport_edma_read_work(struct work_struct *work)
> +{
> +	struct ntb_transport_qp *qp = container_of(
> +				work, struct ntb_transport_qp, read_work);
> +
> +	if (ntb_qp_edma_is_rc(qp))
> +		ntb_transport_edma_rc_read_complete_work(work);
> +	else if (ntb_qp_edma_is_ep(qp))
> +		ntb_transport_edma_ep_read_work(work);
> +	else
> +		/* Unreachable */
> +		WARN_ON_ONCE(1);
> +}
> +
> +static void ntb_transport_edma_write_work(struct work_struct *work)
> +{
> +	struct ntb_transport_qp *qp = container_of(
> +				work, struct ntb_transport_qp, write_work);
> +
> +	if (ntb_qp_edma_is_rc(qp))
> +		ntb_transport_edma_rc_write_complete_work(work);
> +	else if (ntb_qp_edma_is_ep(qp))
> +		ntb_transport_edma_ep_write_work(work);
> +	else
> +		/* Unreachable */
> +		WARN_ON_ONCE(1);
> +}
> +
> +static void ntb_transport_edma_init_queue(struct ntb_transport_ctx *nt,
> +					  unsigned int qp_num)
> +{
> +	struct ntb_transport_qp *qp = &nt->qp_vec[qp_num];
> +
> +	qp->wr_cons = 0;
> +	qp->rd_cons = 0;
> +	qp->wr_prod = 0;
> +	qp->rd_prod = 0;
> +	qp->wr_issue = 0;
> +	qp->rd_issue = 0;
> +
> +	INIT_WORK(&qp->db_work, ntb_transport_edma_db_work);
> +	INIT_WORK(&qp->read_work, ntb_transport_edma_read_work);
> +	INIT_WORK(&qp->write_work, ntb_transport_edma_write_work);
> +}
> +
> +static void ntb_transport_edma_create_queue(struct ntb_transport_ctx *nt,
> +					    struct ntb_transport_qp *qp)
> +{
> +	spin_lock_init(&qp->ep_tx_lock);
> +	spin_lock_init(&qp->ep_rx_lock);
> +	spin_lock_init(&qp->rc_lock);
> +}
> +
> +static const struct ntb_transport_backend_ops edma_backend_ops = {
> +	.setup_qp_mw = ntb_transport_edma_setup_qp_mw,
> +	.tx_free_entry = ntb_transport_edma_tx_free_entry,
> +	.tx_enqueue = ntb_transport_edma_tx_enqueue,
> +	.rx_enqueue = ntb_transport_edma_rx_enqueue,
> +	.rx_poll = ntb_transport_edma_rx_poll,
> +	.debugfs_stats_show = ntb_transport_edma_debugfs_stats_show,
> +};
> +#endif /* CONFIG_NTB_TRANSPORT_EDMA */
> +
>  /**
>   * ntb_transport_link_up - Notify NTB transport of client readiness to use queue
>   * @qp: NTB transport layer queue to be enabled
> --
> 2.48.1
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 20/27] NTB: ntb_transport: Introduce remote eDMA backed transport mode
  2025-11-29 16:03 ` [RFC PATCH v2 20/27] NTB: ntb_transport: Introduce remote eDMA backed transport mode Koichiro Den
  2025-12-01 21:41   ` Frank Li
@ 2025-12-01 21:46   ` Dave Jiang
  2025-12-02  6:59     ` Koichiro Den
  1 sibling, 1 reply; 97+ messages in thread
From: Dave Jiang @ 2025-12-01 21:46 UTC (permalink / raw)
  To: Koichiro Den, ntb, linux-pci, dmaengine, linux-kernel, Frank.Li
  Cc: mani, kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason,
	allenbh, Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer,
	logang, jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer,
	arnd, pstanner, elfring



On 11/29/25 9:03 AM, Koichiro Den wrote:
> Add a new transport backend that uses a remote DesignWare eDMA engine
> located on the NTB endpoint to move data between host and endpoint.
> 
> In this mode:
> 
>   - The endpoint exposes a dedicated memory window that contains the
>     eDMA register block followed by a small control structure (struct
>     ntb_edma_info) and per-channel linked-list (LL) rings.
> 
>   - On the endpoint side, ntb_edma_setup_mws() allocates the control
>     structure and LL rings in endpoint memory, then programs an inbound
>     iATU region so that the host can access them via a peer MW.
> 
>   - On the host side, ntb_edma_setup_peer() ioremaps the peer MW, reads
>     ntb_edma_info and configures a dw-edma DMA device to use the LL
>     rings provided by the endpoint.
> 
>   - ntb_transport is extended with a new backend_ops implementation that
>     routes TX and RX enqueue/poll operations through the remote eDMA
>     rings while keeping the existing shared-memory backend intact.
> 
>   - The host signals the endpoint via a dedicated DMA read channel.
>     'use_msi' module option is ignored when 'use_remote_edma=1'.
> 
> The new mode is guarded by a Kconfig option (NTB_TRANSPORT_EDMA) and a
> module parameter (use_remote_edma). When disabled, the existing
> ntb_transport behaviour is unchanged.
> 
> Signed-off-by: Koichiro Den <den@valinux.co.jp>
> ---
>  drivers/ntb/Kconfig                           |   11 +
>  drivers/ntb/Makefile                          |    3 +
>  drivers/ntb/ntb_edma.c                        |  628 ++++++++
>  drivers/ntb/ntb_edma.h                        |  128 ++

I briefly looked over the code. It feels like the EDMA bits should go in drivers/ntb/hw/ rather than drivers/ntb/ given it's pretty specific to the designware hardware. What sits in drivers/ntb should be generic APIs where a different vendor can utilize it and not having to adopt to designware hardware specifics. So maybe a bit more abstractions are needed?

>  .../{ntb_transport.c => ntb_transport_core.c} | 1281 ++++++++++++++++-
>  5 files changed, 2048 insertions(+), 3 deletions(-)
>  create mode 100644 drivers/ntb/ntb_edma.c
>  create mode 100644 drivers/ntb/ntb_edma.h
>  rename drivers/ntb/{ntb_transport.c => ntb_transport_core.c} (65%)
> 
> diff --git a/drivers/ntb/Kconfig b/drivers/ntb/Kconfig
> index df16c755b4da..db63f02bb116 100644
> --- a/drivers/ntb/Kconfig
> +++ b/drivers/ntb/Kconfig
> @@ -37,4 +37,15 @@ config NTB_TRANSPORT
>  
>  	 If unsure, say N.
>  
> +config NTB_TRANSPORT_EDMA
> +	bool "NTB Transport backed by remote eDMA"
> +	depends on NTB_TRANSPORT
> +	depends on PCI
> +	select DMA_ENGINE
> +	help
> +	  Enable a transport backend that uses a remote DesignWare eDMA engine
> +	  exposed through a dedicated NTB memory window. The host uses the
> +	  endpoint's eDMA engine to move data in both directions.
> +	  Say Y here if you intend to use the 'use_remote_edma' module parameter.
> +
>  endif # NTB
> diff --git a/drivers/ntb/Makefile b/drivers/ntb/Makefile
> index 3a6fa181ff99..51f0e1e3aec7 100644
> --- a/drivers/ntb/Makefile
> +++ b/drivers/ntb/Makefile
> @@ -4,3 +4,6 @@ obj-$(CONFIG_NTB_TRANSPORT) += ntb_transport.o
>  
>  ntb-y			:= core.o
>  ntb-$(CONFIG_NTB_MSI)	+= msi.o
> +
> +ntb_transport-y					:= ntb_transport_core.o
> +ntb_transport-$(CONFIG_NTB_TRANSPORT_EDMA)	+= ntb_edma.o
> diff --git a/drivers/ntb/ntb_edma.c b/drivers/ntb/ntb_edma.c
> new file mode 100644
> index 000000000000..cb35e0d56aa8
> --- /dev/null
> +++ b/drivers/ntb/ntb_edma.c
> @@ -0,0 +1,628 @@
> +// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
> +
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/pci.h>
> +#include <linux/ntb.h>
> +#include <linux/io.h>
> +#include <linux/iommu.h>
> +#include <linux/dmaengine.h>
> +#include <linux/pci-epc.h>
> +#include <linux/dma/edma.h>
> +#include <linux/irq.h>
> +#include <linux/irqdomain.h>
> +#include <linux/of.h>
> +#include <linux/of_irq.h>
> +#include <dt-bindings/interrupt-controller/arm-gic.h>
> +
> +#include "ntb_edma.h"
> +
> +/*
> + * The interrupt register offsets below are taken from the DesignWare
> + * eDMA "unrolled" register map (EDMA_MF_EDMA_UNROLL). The remote eDMA
> + * backend currently only supports this layout.
> + */
> +#define DMA_WRITE_INT_STATUS_OFF   0x4c
> +#define DMA_WRITE_INT_MASK_OFF     0x54
> +#define DMA_WRITE_INT_CLEAR_OFF    0x58
> +#define DMA_READ_INT_STATUS_OFF    0xa0
> +#define DMA_READ_INT_MASK_OFF      0xa8
> +#define DMA_READ_INT_CLEAR_OFF     0xac
> +
> +#define NTB_EDMA_NOTIFY_MAX_QP		64
> +
> +static unsigned int edma_spi = 417; /* 0x1a1 */
> +module_param(edma_spi, uint, 0644);
> +MODULE_PARM_DESC(edma_spi, "SPI number used by remote eDMA interrupt (EP local)");
> +
> +static u64 edma_regs_phys = 0xe65d5000;
> +module_param(edma_regs_phys, ullong, 0644);
> +MODULE_PARM_DESC(edma_regs_phys, "Physical base address of local eDMA registers (EP)");
> +
> +static unsigned long edma_regs_size = 0x1200;
> +module_param(edma_regs_size, ulong, 0644);
> +MODULE_PARM_DESC(edma_regs_size, "Size of the local eDMA register space (EP)");
> +
> +struct ntb_edma_intr {
> +	u32 db[NTB_EDMA_NOTIFY_MAX_QP];
> +};
> +
> +struct ntb_edma_ctx {
> +	void *ll_wr_virt[EDMA_WR_CH_NUM];
> +	dma_addr_t ll_wr_phys[EDMA_WR_CH_NUM];
> +	void *ll_rd_virt[EDMA_RD_CH_NUM + 1];
> +	dma_addr_t ll_rd_phys[EDMA_RD_CH_NUM + 1];
> +
> +	struct ntb_edma_intr *intr_ep_virt;
> +	dma_addr_t intr_ep_phys;
> +	struct ntb_edma_intr *intr_rc_virt;
> +	dma_addr_t intr_rc_phys;
> +	u32 notify_qp_max;
> +
> +	bool initialized;
> +};
> +
> +static struct ntb_edma_ctx edma_ctx;
> +
> +typedef void (*ntb_edma_interrupt_cb_t)(void *data, int qp_num);
> +
> +struct ntb_edma_interrupt {
> +	int virq;
> +	void __iomem *base;
> +	ntb_edma_interrupt_cb_t cb;
> +	void *data;
> +};
> +
> +static struct ntb_edma_interrupt ntb_edma_intr;
> +
> +static int ntb_edma_map_spi_to_virq(struct device *dev, unsigned int spi)
> +{
> +	struct device_node *np = dev_of_node(dev);
> +	struct device_node *parent;
> +	struct irq_fwspec fwspec = { 0 };
> +	int virq;
> +
> +	parent = of_irq_find_parent(np);
> +	if (!parent)
> +		return -ENODEV;
> +
> +	fwspec.fwnode      = of_fwnode_handle(parent);
> +	fwspec.param_count = 3;
> +	fwspec.param[0]    = GIC_SPI;
> +	fwspec.param[1]    = spi;
> +	fwspec.param[2]    = IRQ_TYPE_LEVEL_HIGH;
> +
> +	virq = irq_create_fwspec_mapping(&fwspec);
> +	of_node_put(parent);
> +	return (virq > 0) ? virq : -EINVAL;
> +}
> +
> +static irqreturn_t ntb_edma_isr(int irq, void *data)
> +{
> +	struct ntb_edma_interrupt *v = data;
> +	u32 mask = BIT(EDMA_RD_CH_NUM);
> +	u32 i, val;
> +
> +	/*
> +	 * We do not ack interrupts here but instead we mask all local interrupt
> +	 * sources except the read channel used for notification. This reduces
> +	 * needless ISR invocations.
> +	 *
> +	 * In theory we could configure LIE=1/RIE=0 only for the notification
> +	 * transfer (keeping all other channels at LIE=1/RIE=1), but that would
> +	 * require intrusive changes to the dw-edma core.
> +	 *
> +	 * Note: The host side may have already cleared the read interrupt used
> +	 * for notification, so reading DMA_READ_INT_CLEAR_OFF is not a reliable
> +	 * way to detect it. As a result, we cannot reliably tell which specific
> +	 * channel triggered this interrupt. intr_ep_virt->db[i] teaches us
> +	 * instead.
> +	 */
> +	iowrite32(~0x0, v->base + DMA_WRITE_INT_MASK_OFF);
> +	iowrite32(~mask, v->base + DMA_READ_INT_MASK_OFF);
> +
> +	if (!v->cb || !edma_ctx.intr_ep_virt)
> +		return IRQ_HANDLED;
> +
> +	for (i = 0; i < edma_ctx.notify_qp_max; i++) {
> +		val = READ_ONCE(edma_ctx.intr_ep_virt->db[i]);
> +		if (!val)
> +			continue;
> +
> +		WRITE_ONCE(edma_ctx.intr_ep_virt->db[i], 0);
> +		v->cb(v->data, i);
> +	}
> +
> +	return IRQ_HANDLED;
> +}
> +
> +int ntb_edma_setup_isr(struct device *dev, struct device *epc_dev,
> +		       ntb_edma_interrupt_cb_t cb, void *data)
> +{
> +	struct ntb_edma_interrupt *v = &ntb_edma_intr;
> +	int virq = ntb_edma_map_spi_to_virq(epc_dev->parent, edma_spi);
> +	int ret;
> +
> +	if (virq < 0) {
> +		dev_err(dev, "failed to get virq (%d)\n", virq);
> +		return virq;
> +	}
> +
> +	v->virq = virq;
> +	v->cb = cb;
> +	v->data = data;
> +	if (edma_regs_phys && !v->base)
> +		v->base = devm_ioremap(dev, edma_regs_phys, edma_regs_size);
> +	if (!v->base) {
> +		dev_err(dev, "failed to setup v->base\n");
> +		return -1;
> +	}
> +	ret = devm_request_irq(dev, v->virq, ntb_edma_isr, 0, "ntb-edma", v);
> +	if (ret)
> +		return ret;
> +
> +	if (v->base) {
> +		iowrite32(0x0, v->base + DMA_WRITE_INT_MASK_OFF);
> +		iowrite32(0x0, v->base + DMA_READ_INT_MASK_OFF);
> +	}
> +	return 0;
> +}
> +
> +void ntb_edma_teardown_isr(struct device *dev)
> +{
> +	struct ntb_edma_interrupt *v = &ntb_edma_intr;
> +
> +	/* Mask all write/read interrupts so we don't get called again. */
> +	if (v->base) {
> +		iowrite32(~0x0, v->base + DMA_WRITE_INT_MASK_OFF);
> +		iowrite32(~0x0, v->base + DMA_READ_INT_MASK_OFF);
> +	}
> +
> +	if (v->virq > 0)
> +		devm_free_irq(dev, v->virq, v);
> +
> +	if (v->base)
> +		devm_iounmap(dev, v->base);
> +
> +	v->virq = 0;
> +	v->cb = NULL;
> +	v->data = NULL;
> +}
> +
> +int ntb_edma_setup_mws(struct ntb_dev *ndev)
> +{
> +	const size_t info_bytes = PAGE_SIZE;
> +	resource_size_t size_max, offset;
> +	dma_addr_t intr_phys, info_phys;
> +	u32 wr_done = 0, rd_done = 0;
> +	struct ntb_edma_intr *intr;
> +	struct ntb_edma_info *info;
> +	int peer_mw, mw_index, rc;
> +	struct iommu_domain *dom;
> +	bool reg_mapped = false;
> +	size_t ll_bytes, size;
> +	struct pci_epc *epc;
> +	struct device *dev;
> +	unsigned long iova;
> +	phys_addr_t phys;
> +	u64 need;
> +	u32 i;
> +
> +	/* +1 is for interruption */
> +	ll_bytes = (EDMA_WR_CH_NUM + EDMA_RD_CH_NUM + 1) * DMA_LLP_MEM_SIZE;
> +	need = EDMA_REG_SIZE + info_bytes + ll_bytes;
> +
> +	epc = ntb_get_pci_epc(ndev);
> +	if (!epc)
> +		return -ENODEV;
> +	dev = epc->dev.parent;
> +
> +	if (edma_ctx.initialized)
> +		return 0;
> +
> +	info = dma_alloc_coherent(dev, info_bytes, &info_phys, GFP_KERNEL);
> +	if (!info)
> +		return -ENOMEM;
> +
> +	memset(info, 0, info_bytes);
> +	info->magic = NTB_EDMA_INFO_MAGIC;
> +	info->wr_cnt = EDMA_WR_CH_NUM;
> +	info->rd_cnt = EDMA_RD_CH_NUM + 1; /* +1 for interruption */
> +	info->regs_phys = edma_regs_phys;
> +	info->ll_stride = DMA_LLP_MEM_SIZE;
> +
> +	for (i = 0; i < EDMA_WR_CH_NUM; i++) {
> +		edma_ctx.ll_wr_virt[i] = dma_alloc_attrs(dev, DMA_LLP_MEM_SIZE,
> +							 &edma_ctx.ll_wr_phys[i],
> +							 GFP_KERNEL,
> +							 DMA_ATTR_FORCE_CONTIGUOUS);
> +		if (!edma_ctx.ll_wr_virt[i]) {
> +			rc = -ENOMEM;
> +			goto err_free_ll;
> +		}
> +		wr_done++;
> +		info->ll_wr_phys[i] = edma_ctx.ll_wr_phys[i];
> +	}
> +	for (i = 0; i < EDMA_RD_CH_NUM + 1; i++) {
> +		edma_ctx.ll_rd_virt[i] = dma_alloc_attrs(dev, DMA_LLP_MEM_SIZE,
> +							 &edma_ctx.ll_rd_phys[i],
> +							 GFP_KERNEL,
> +							 DMA_ATTR_FORCE_CONTIGUOUS);
> +		if (!edma_ctx.ll_rd_virt[i]) {
> +			rc = -ENOMEM;
> +			goto err_free_ll;
> +		}
> +		rd_done++;
> +		info->ll_rd_phys[i] = edma_ctx.ll_rd_phys[i];
> +	}
> +
> +	/* For interruption */
> +	edma_ctx.notify_qp_max = NTB_EDMA_NOTIFY_MAX_QP;
> +	intr = dma_alloc_coherent(dev, sizeof(*intr), &intr_phys, GFP_KERNEL);
> +	if (!intr) {
> +		rc = -ENOMEM;
> +		goto err_free_ll;
> +	}
> +	memset(intr, 0, sizeof(*intr));
> +	edma_ctx.intr_ep_virt = intr;
> +	edma_ctx.intr_ep_phys = intr_phys;
> +	info->intr_dar_base = intr_phys;
> +
> +	peer_mw = ntb_peer_mw_count(ndev);
> +	if (peer_mw <= 0) {
> +		rc = -ENODEV;
> +		goto err_free_ll;
> +	}
> +
> +	mw_index = peer_mw - 1; /* last MW */
> +
> +	rc = ntb_mw_get_align(ndev, 0, mw_index, 0, NULL, &size_max,
> +			      &offset);
> +	if (rc)
> +		goto err_free_ll;
> +
> +	if (size_max < need) {
> +		rc = -ENOSPC;
> +		goto err_free_ll;
> +	}
> +
> +	/* Map register space (direct) */
> +	dom = iommu_get_domain_for_dev(dev);
> +	if (dom) {
> +		phys = edma_regs_phys & PAGE_MASK;
> +		size = PAGE_ALIGN(EDMA_REG_SIZE + edma_regs_phys - phys);
> +		iova = phys;
> +
> +		rc = iommu_map(dom, iova, phys, EDMA_REG_SIZE,
> +			       IOMMU_READ | IOMMU_WRITE | IOMMU_MMIO, GFP_KERNEL);
> +		if (rc)
> +			dev_err(&ndev->dev, "failed to create direct mapping for eDMA reg space\n");
> +		reg_mapped = true;
> +	}
> +
> +	rc = ntb_mw_set_trans(ndev, 0, mw_index, edma_regs_phys, EDMA_REG_SIZE, offset);
> +	if (rc)
> +		goto err_unmap_reg;
> +
> +	offset += EDMA_REG_SIZE;
> +
> +	/* Map ntb_edma_info */
> +	rc = ntb_mw_set_trans(ndev, 0, mw_index, info_phys, info_bytes, offset);
> +	if (rc)
> +		goto err_clear_trans;
> +	offset += info_bytes;
> +
> +	/* Map LL location */
> +	for (i = 0; i < EDMA_WR_CH_NUM; i++) {
> +		rc = ntb_mw_set_trans(ndev, 0, mw_index, edma_ctx.ll_wr_phys[i],
> +				      DMA_LLP_MEM_SIZE, offset);
> +		if (rc)
> +			goto err_clear_trans;
> +		offset += DMA_LLP_MEM_SIZE;
> +	}
> +	for (i = 0; i < EDMA_RD_CH_NUM + 1; i++) {
> +		rc = ntb_mw_set_trans(ndev, 0, mw_index, edma_ctx.ll_rd_phys[i],
> +				      DMA_LLP_MEM_SIZE, offset);
> +		if (rc)
> +			goto err_clear_trans;
> +		offset += DMA_LLP_MEM_SIZE;
> +	}
> +	edma_ctx.initialized = true;
> +
> +	return 0;
> +
> +err_clear_trans:
> +	/*
> +	 * Tear down the NTB translation window used for the eDMA MW.
> +	 * There is no sub-range clear API for ntb_mw_set_trans(), so we
> +	 * unconditionally drop the whole mapping on error.
> +	 */
> +	ntb_mw_clear_trans(ndev, 0, mw_index);
> +
> +err_unmap_reg:
> +	if (reg_mapped)
> +		iommu_unmap(dom, iova, size);
> +err_free_ll:
> +	while (rd_done--)
> +		dma_free_attrs(dev, DMA_LLP_MEM_SIZE,
> +			       edma_ctx.ll_rd_virt[rd_done],
> +			       edma_ctx.ll_rd_phys[rd_done],
> +			       DMA_ATTR_FORCE_CONTIGUOUS);
> +	while (wr_done--)
> +		dma_free_attrs(dev, DMA_LLP_MEM_SIZE,
> +			       edma_ctx.ll_wr_virt[wr_done],
> +			       edma_ctx.ll_wr_phys[wr_done],
> +			       DMA_ATTR_FORCE_CONTIGUOUS);
> +	if (edma_ctx.intr_ep_virt)
> +		dma_free_coherent(dev, sizeof(struct ntb_edma_intr),
> +				  edma_ctx.intr_ep_virt,
> +				  edma_ctx.intr_ep_phys);
> +	dma_free_coherent(dev, info_bytes, info, info_phys);
> +	return rc;
> +}
> +
> +static int ntb_edma_irq_vector(struct device *dev, unsigned int nr)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	int ret, nvec;
> +
> +	nvec = pci_msi_vec_count(pdev);
> +	for (; nr < nvec; nr++) {
> +		ret = pci_irq_vector(pdev, nr);
> +		if (!irq_has_action(ret))
> +			return ret;
> +	}
> +	return 0;
> +}
> +
> +static const struct dw_edma_plat_ops ntb_edma_ops = {
> +	.irq_vector     = ntb_edma_irq_vector,
> +};
> +
> +int ntb_edma_setup_peer(struct ntb_dev *ndev)
> +{
> +	struct ntb_edma_info *info;
> +	unsigned int wr_cnt, rd_cnt;
> +	struct dw_edma_chip *chip;
> +	void __iomem *edma_virt;
> +	phys_addr_t edma_phys;
> +	resource_size_t mw_size;
> +	u64 off = EDMA_REG_SIZE;
> +	int peer_mw, mw_index;
> +	unsigned int i;
> +	int ret;
> +
> +	peer_mw = ntb_peer_mw_count(ndev);
> +	if (peer_mw <= 0)
> +		return -ENODEV;
> +
> +	mw_index = peer_mw - 1; /* last MW */
> +
> +	ret = ntb_peer_mw_get_addr(ndev, mw_index, &edma_phys,
> +				   &mw_size);
> +	if (ret)
> +		return -1;
> +
> +	edma_virt = ioremap(edma_phys, mw_size);
> +
> +	chip = devm_kzalloc(&ndev->dev, sizeof(*chip), GFP_KERNEL);
> +	if (!chip) {
> +		ret = -ENOMEM;
> +		return ret;
> +	}
> +
> +	chip->dev = &ndev->pdev->dev;
> +	chip->nr_irqs = 4;
> +	chip->ops = &ntb_edma_ops;
> +	chip->flags = 0;
> +	chip->reg_base = edma_virt;
> +	chip->mf = EDMA_MF_EDMA_UNROLL;
> +
> +	info = edma_virt + off;
> +	if (info->magic != NTB_EDMA_INFO_MAGIC)
> +		return -EINVAL;
> +	wr_cnt = info->wr_cnt;
> +	rd_cnt = info->rd_cnt;
> +	chip->ll_wr_cnt = wr_cnt;
> +	chip->ll_rd_cnt = rd_cnt;
> +	off += PAGE_SIZE;
> +
> +	edma_ctx.notify_qp_max = NTB_EDMA_NOTIFY_MAX_QP;
> +	edma_ctx.intr_ep_phys = info->intr_dar_base;
> +	if (edma_ctx.intr_ep_phys) {
> +		edma_ctx.intr_rc_virt =
> +			dma_alloc_coherent(&ndev->pdev->dev,
> +					   sizeof(struct ntb_edma_intr),
> +					   &edma_ctx.intr_rc_phys,
> +					   GFP_KERNEL);
> +		if (!edma_ctx.intr_rc_virt)
> +			return -ENOMEM;
> +		memset(edma_ctx.intr_rc_virt, 0,
> +		       sizeof(struct ntb_edma_intr));
> +	}
> +
> +	for (i = 0; i < wr_cnt; i++) {
> +		chip->ll_region_wr[i].vaddr.io = edma_virt + off;
> +		chip->ll_region_wr[i].paddr = info->ll_wr_phys[i];
> +		chip->ll_region_wr[i].sz = DMA_LLP_MEM_SIZE;
> +		off += DMA_LLP_MEM_SIZE;
> +	}
> +	for (i = 0; i < rd_cnt; i++) {
> +		chip->ll_region_rd[i].vaddr.io = edma_virt + off;
> +		chip->ll_region_rd[i].paddr = info->ll_rd_phys[i];
> +		chip->ll_region_rd[i].sz = DMA_LLP_MEM_SIZE;
> +		off += DMA_LLP_MEM_SIZE;
> +	}
> +
> +	if (!pci_dev_msi_enabled(ndev->pdev))
> +		return -ENXIO;
> +
> +	ret = dw_edma_probe(chip);
> +	if (ret) {
> +		dev_err(&ndev->dev, "dw_edma_probe failed: %d\n", ret);
> +		return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +struct ntb_edma_filter {
> +	struct device *dma_dev;
> +	u32 direction;
> +};
> +
> +static bool ntb_edma_filter_fn(struct dma_chan *chan, void *arg)
> +{
> +	struct ntb_edma_filter *filter = arg;
> +	u32 dir = filter->direction;
> +	struct dma_slave_caps caps;
> +	int ret;
> +
> +	if (chan->device->dev != filter->dma_dev)
> +		return false;
> +
> +	ret = dma_get_slave_caps(chan, &caps);
> +	if (ret < 0)
> +		return false;
> +
> +	return !!(caps.directions & dir);
> +}
> +
> +void ntb_edma_teardown_chans(struct ntb_edma_chans *edma)
> +{
> +	unsigned int i;
> +
> +	for (i = 0; i < edma->num_wr_chan; i++)
> +		dma_release_channel(edma->wr_chan[i]);
> +
> +	for (i = 0; i < edma->num_rd_chan; i++)
> +		dma_release_channel(edma->rd_chan[i]);
> +
> +	if (edma->intr_chan)
> +		dma_release_channel(edma->intr_chan);
> +}
> +
> +int ntb_edma_setup_chans(struct device *dma_dev, struct ntb_edma_chans *edma)
> +{
> +	struct ntb_edma_filter filter;
> +	dma_cap_mask_t dma_mask;
> +	unsigned int i;
> +
> +	dma_cap_zero(dma_mask);
> +	dma_cap_set(DMA_SLAVE, dma_mask);
> +
> +	memset(edma, 0, sizeof(*edma));
> +	edma->dev = dma_dev;
> +
> +	filter.dma_dev = dma_dev;
> +	filter.direction = BIT(DMA_DEV_TO_MEM);
> +	for (i = 0; i < EDMA_WR_CH_NUM; i++) {
> +		edma->wr_chan[i] = dma_request_channel(dma_mask,
> +						       ntb_edma_filter_fn,
> +						       &filter);
> +		if (!edma->wr_chan[i])
> +			break;
> +		edma->num_wr_chan++;
> +	}
> +
> +	filter.direction = BIT(DMA_MEM_TO_DEV);
> +	for (i = 0; i < EDMA_RD_CH_NUM; i++) {
> +		edma->rd_chan[i] = dma_request_channel(dma_mask,
> +						       ntb_edma_filter_fn,
> +						       &filter);
> +		if (!edma->rd_chan[i])
> +			break;
> +		edma->num_rd_chan++;
> +	}
> +
> +	edma->intr_chan = dma_request_channel(dma_mask, ntb_edma_filter_fn,
> +					      &filter);
> +	if (!edma->intr_chan)
> +		dev_warn(dma_dev,
> +			 "Remote eDMA notify channel could not be allocated\n");
> +
> +	if (!edma->num_wr_chan || !edma->num_rd_chan) {
> +		dev_warn(dma_dev, "Remote eDMA channels failed to initialize\n");
> +		ntb_edma_teardown_chans(edma);
> +		return -ENODEV;
> +	}
> +	return 0;
> +}
> +
> +struct dma_chan *ntb_edma_pick_chan(struct ntb_edma_chans *edma,
> +				    remote_edma_dir_t dir)
> +{
> +	unsigned int n, cur, idx;
> +	struct dma_chan **chans;
> +	atomic_t *cur_chan;
> +
> +	if (dir == REMOTE_EDMA_WRITE) {
> +		n = edma->num_wr_chan;
> +		chans = edma->wr_chan;
> +		cur_chan = &edma->cur_wr_chan;
> +	} else {
> +		n = edma->num_rd_chan;
> +		chans = edma->rd_chan;
> +		cur_chan = &edma->cur_rd_chan;
> +	}
> +	if (WARN_ON_ONCE(!n))
> +		return NULL;
> +
> +	/* Simple round-robin */
> +	cur = (unsigned int)atomic_inc_return(cur_chan) - 1;
> +	idx = cur % n;
> +	return chans[idx];
> +}
> +
> +int ntb_edma_notify_peer(struct ntb_edma_chans *edma, int qp_num)
> +{
> +	struct dma_async_tx_descriptor *txd;
> +	struct dma_slave_config cfg;
> +	struct scatterlist sgl;
> +	dma_cookie_t cookie;
> +	struct device *dev;
> +
> +	if (!edma || !edma->intr_chan)
> +		return -ENXIO;
> +
> +	if (qp_num < 0 || qp_num >= edma_ctx.notify_qp_max)
> +		return -EINVAL;
> +
> +	if (!edma_ctx.intr_rc_virt || !edma_ctx.intr_ep_phys)
> +		return -EINVAL;
> +
> +	dev = edma->dev;
> +	if (!dev)
> +		return -ENODEV;
> +
> +	WRITE_ONCE(edma_ctx.intr_rc_virt->db[qp_num], 1);
> +
> +	/* Ensure store is visible before kicking the DMA transfer */
> +	wmb();
> +
> +	sg_init_table(&sgl, 1);
> +	sg_dma_address(&sgl) = edma_ctx.intr_rc_phys + qp_num * sizeof(u32);
> +	sg_dma_len(&sgl) = sizeof(u32);
> +
> +	memset(&cfg, 0, sizeof(cfg));
> +	cfg.dst_addr       = edma_ctx.intr_ep_phys + qp_num * sizeof(u32);
> +	cfg.src_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
> +	cfg.dst_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
> +	cfg.direction      = DMA_MEM_TO_DEV;
> +
> +	if (dmaengine_slave_config(edma->intr_chan, &cfg))
> +		return -EINVAL;
> +
> +	txd = dmaengine_prep_slave_sg(edma->intr_chan, &sgl, 1,
> +				      DMA_MEM_TO_DEV,
> +				      DMA_CTRL_ACK | DMA_PREP_INTERRUPT);
> +	if (!txd)
> +		return -ENOSPC;
> +
> +	cookie = dmaengine_submit(txd);
> +	if (dma_submit_error(cookie))
> +		return -ENOSPC;
> +
> +	dma_async_issue_pending(edma->intr_chan);
> +	return 0;
> +}
> diff --git a/drivers/ntb/ntb_edma.h b/drivers/ntb/ntb_edma.h
> new file mode 100644
> index 000000000000..da0451827edb
> --- /dev/null
> +++ b/drivers/ntb/ntb_edma.h
> @@ -0,0 +1,128 @@
> +/* SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) */
> +#ifndef _NTB_EDMA_H_
> +#define _NTB_EDMA_H_
> +
> +#include <linux/completion.h>
> +#include <linux/device.h>
> +#include <linux/interrupt.h>
> +
> +#define EDMA_REG_SIZE		SZ_64K
> +#define DMA_LLP_MEM_SIZE	SZ_4K
> +#define EDMA_WR_CH_NUM		4
> +#define EDMA_RD_CH_NUM		4
> +#define NTB_EDMA_MAX_CH		8
> +
> +#define NTB_EDMA_INFO_MAGIC	0x45444D41 /* "EDMA" */
> +#define NTB_EDMA_INFO_OFF	EDMA_REG_SIZE
> +
> +#define NTB_EDMA_RING_ORDER	7
> +#define NTB_EDMA_RING_ENTRIES	(1U << NTB_EDMA_RING_ORDER)
> +#define NTB_EDMA_RING_MASK	(NTB_EDMA_RING_ENTRIES - 1)
> +
> +typedef void (*ntb_edma_interrupt_cb_t)(void *data, int qp_num);
> +
> +/*
> + * REMOTE_EDMA_EP:
> + *   Endpoint owns the eDMA engine and pushes descriptors into a shared MW.
> + *
> + * REMOTE_EDMA_RC:
> + *   Root Complex controls the endpoint eDMA through the shared MW and
> + *   drives reads/writes on behalf of the host.
> + */
> +typedef enum {
> +	REMOTE_EDMA_UNKNOWN,
> +	REMOTE_EDMA_EP,
> +	REMOTE_EDMA_RC,
> +} remote_edma_mode_t;
> +
> +typedef enum {
> +	REMOTE_EDMA_WRITE,
> +	REMOTE_EDMA_READ,
> +} remote_edma_dir_t;
> +
> +/*
> + * Layout of remote eDMA MW (EP local address space, RC sees via peer MW):
> + *
> + *  0 .. EDMA_REG_SIZE-1        : DesignWare eDMA registers
> + *  EDMA_REG_SIZE .. +PAGE_SIZE : struct ntb_edma_info (EP writes, RC reads)
> + *  +PAGE_SIZE ..               : LL ring buffers (EP allocates phys addresses,
> + *                                RC configures via dw_edma)
> + *
> + * ntb_edma_setup_mws() on EP:
> + *   - allocates ntb_edma_info and LLs in EP memory
> + *   - programs inbound iATU so that RC peer MW[n] points at this block
> + *
> + * ntb_edma_setup_peer() on RC:
> + *   - ioremaps peer MW[n]
> + *   - reads ntb_edma_info
> + *   - sets up dw_edma_chip ll_region_* from that info
> + */
> +struct ntb_edma_info {
> +	u32 magic;
> +	u16 wr_cnt;
> +	u16 rd_cnt;
> +	u64 regs_phys;
> +	u32 ll_stride;
> +	u32 rsvd;
> +	u64 ll_wr_phys[NTB_EDMA_MAX_CH];
> +	u64 ll_rd_phys[NTB_EDMA_MAX_CH];
> +
> +	u64 intr_dar_base;
> +} __packed;
> +
> +struct ll_dma_addrs {
> +	dma_addr_t wr[EDMA_WR_CH_NUM];
> +	dma_addr_t rd[EDMA_RD_CH_NUM];
> +};
> +
> +struct ntb_edma_chans {
> +	struct device *dev;
> +
> +	struct dma_chan *wr_chan[EDMA_WR_CH_NUM];
> +	struct dma_chan *rd_chan[EDMA_RD_CH_NUM];
> +	struct dma_chan *intr_chan;
> +
> +	unsigned int num_wr_chan;
> +	unsigned int num_rd_chan;
> +	atomic_t cur_wr_chan;
> +	atomic_t cur_rd_chan;
> +};
> +
> +static __always_inline u32 ntb_edma_ring_idx(u32 v)
> +{
> +	return v & NTB_EDMA_RING_MASK;
> +}
> +
> +static __always_inline u32 ntb_edma_ring_used_entry(u32 head, u32 tail)
> +{
> +	if (head >= tail) {
> +		WARN_ON_ONCE((head - tail) > (NTB_EDMA_RING_ENTRIES - 1));
> +		return head - tail;
> +	}
> +
> +	WARN_ON_ONCE((U32_MAX - tail + head + 1) > (NTB_EDMA_RING_ENTRIES - 1));
> +	return U32_MAX - tail + head + 1;
> +}
> +
> +static __always_inline u32 ntb_edma_ring_free_entry(u32 head, u32 tail)
> +{
> +	return NTB_EDMA_RING_ENTRIES - ntb_edma_ring_used_entry(head, tail) - 1;
> +}
> +
> +static __always_inline bool ntb_edma_ring_full(u32 head, u32 tail)
> +{
> +	return ntb_edma_ring_free_entry(head, tail) == 0;
> +}
> +
> +int ntb_edma_setup_isr(struct device *dev, struct device *epc_dev,
> +		       ntb_edma_interrupt_cb_t cb, void *data);
> +void ntb_edma_teardown_isr(struct device *dev);
> +int ntb_edma_setup_mws(struct ntb_dev *ndev);
> +int ntb_edma_setup_peer(struct ntb_dev *ndev);
> +int ntb_edma_setup_chans(struct device *dma_dev, struct ntb_edma_chans *edma);
> +struct dma_chan *ntb_edma_pick_chan(struct ntb_edma_chans *edma,
> +				    remote_edma_dir_t dir);
> +void ntb_edma_teardown_chans(struct ntb_edma_chans *edma);
> +int ntb_edma_notify_peer(struct ntb_edma_chans *edma, int qp_num);
> +
> +#endif
> diff --git a/drivers/ntb/ntb_transport.c b/drivers/ntb/ntb_transport_core.c
> similarity index 65%
> rename from drivers/ntb/ntb_transport.c
> rename to drivers/ntb/ntb_transport_core.c
> index 907db6c93d4d..48d48921978d 100644
> --- a/drivers/ntb/ntb_transport.c
> +++ b/drivers/ntb/ntb_transport_core.c
> @@ -47,6 +47,9 @@
>   * Contact Information:
>   * Jon Mason <jon.mason@intel.com>
>   */
> +#include <linux/atomic.h>
> +#include <linux/bug.h>
> +#include <linux/compiler.h>
>  #include <linux/debugfs.h>
>  #include <linux/delay.h>
>  #include <linux/dmaengine.h>
> @@ -71,6 +74,8 @@
>  #define NTB_TRANSPORT_DESC	"Software Queue-Pair Transport over NTB"
>  #define NTB_TRANSPORT_MIN_SPADS (MW0_SZ_HIGH + 2)
>  
> +#define NTB_EDMA_MAX_POLL		32
> +
>  MODULE_DESCRIPTION(NTB_TRANSPORT_DESC);
>  MODULE_VERSION(NTB_TRANSPORT_VER);
>  MODULE_LICENSE("Dual BSD/GPL");
> @@ -102,6 +107,13 @@ module_param(use_msi, bool, 0644);
>  MODULE_PARM_DESC(use_msi, "Use MSI interrupts instead of doorbells");
>  #endif
>  
> +#ifdef CONFIG_NTB_TRANSPORT_EDMA

This is a comment throughout this patch. Doing ifdefs inside C source is pretty frowed upon in the kernel. The preferred way is to only have ifdefs in the header files. So please give this a bit more consideration and see if it can be done differently to address this.

> +#include "ntb_edma.h"
> +static bool use_remote_edma;
> +module_param(use_remote_edma, bool, 0644);
> +MODULE_PARM_DESC(use_remote_edma, "Use remote eDMA mode (when enabled, use_msi is ignored)");
> +#endif
> +
>  static struct dentry *nt_debugfs_dir;
>  
>  /* Only two-ports NTB devices are supported */
> @@ -125,6 +137,14 @@ struct ntb_queue_entry {
>  		struct ntb_payload_header __iomem *tx_hdr;
>  		struct ntb_payload_header *rx_hdr;
>  	};
> +
> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> +	dma_addr_t addr;
> +
> +	/* Used by RC side only */
> +	struct scatterlist sgl;
> +	struct work_struct dma_work;
> +#endif
>  };
>  
>  struct ntb_rx_info {
> @@ -202,6 +222,33 @@ struct ntb_transport_qp {
>  	int msi_irq;
>  	struct ntb_msi_desc msi_desc;
>  	struct ntb_msi_desc peer_msi_desc;
> +
> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> +	/*
> +	 * For ensuring peer notification in non-atomic context.
> +	 * ntb_peer_db_set might sleep or schedule.
> +	 */
> +	struct work_struct db_work;
> +
> +	/*
> +	 * wr: remote eDMA write transfer (EP -> RC direction)
> +	 * rd: remote eDMA read transfer (RC -> EP direction)
> +	 */
> +	u32 wr_cons;
> +	u32 rd_cons;
> +	u32 wr_prod;
> +	u32 rd_prod;
> +	u32 wr_issue;
> +	u32 rd_issue;
> +
> +	spinlock_t ep_tx_lock;
> +	spinlock_t ep_rx_lock;
> +	spinlock_t rc_lock;
> +
> +	/* Completion work for read/write transfers. */
> +	struct work_struct read_work;
> +	struct work_struct write_work;
> +#endif

For something like this, maybe it needs its own struct instead of an ifdef chunk. Perhaps 'ntb_rx_info' can serve as a core data struct with EDMA having 'ntb_rx_info_edma' and embed 'ntb_rx_info'. 

DJ

>  };
>  
>  struct ntb_transport_mw {
> @@ -249,6 +296,13 @@ struct ntb_transport_ctx {
>  
>  	/* Make sure workq of link event be executed serially */
>  	struct mutex link_event_lock;
> +
> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> +	remote_edma_mode_t remote_edma_mode;
> +	struct device *dma_dev;
> +	struct workqueue_struct *wq;
> +	struct ntb_edma_chans edma;
> +#endif
>  };
>  
>  enum {
> @@ -262,6 +316,19 @@ struct ntb_payload_header {
>  	unsigned int flags;
>  };
>  
> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> +static void ntb_transport_edma_uninit(struct ntb_transport_ctx *nt);
> +static int ntb_transport_edma_init(struct ntb_transport_ctx *nt,
> +				   unsigned int *mw_count);
> +static void ntb_transport_edma_init_queue(struct ntb_transport_ctx *nt,
> +					  unsigned int qp_num);
> +static void ntb_transport_edma_create_queue(struct ntb_transport_ctx *nt,
> +					    struct ntb_transport_qp *qp);
> +static int ntb_transport_edma_rc_init(struct ntb_transport_ctx *nt);
> +static int ntb_transport_edma_ep_init(struct ntb_transport_ctx *nt);
> +static void ntb_transport_edma_rc_dma_work(struct work_struct *work);
> +#endif /* CONFIG_NTB_TRANSPORT_EDMA */
> +
>  /*
>   * Return the device that should be used for DMA mapping.
>   *
> @@ -298,7 +365,7 @@ enum {
>  	container_of((__drv), struct ntb_transport_client, driver)
>  
>  #define QP_TO_MW(nt, qp)	((qp) % nt->mw_count)
> -#define NTB_QP_DEF_NUM_ENTRIES	100
> +#define NTB_QP_DEF_NUM_ENTRIES	128
>  #define NTB_LINK_DOWN_TIMEOUT	10
>  
>  static void ntb_transport_rxc_db(unsigned long data);
> @@ -1015,6 +1082,10 @@ static void ntb_transport_link_cleanup(struct ntb_transport_ctx *nt)
>  	count = ntb_spad_count(nt->ndev);
>  	for (i = 0; i < count; i++)
>  		ntb_spad_write(nt->ndev, i, 0);
> +
> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> +	ntb_edma_teardown_chans(&nt->edma);
> +#endif
>  }
>  
>  static void ntb_transport_link_cleanup_work(struct work_struct *work)
> @@ -1051,6 +1122,14 @@ static void ntb_transport_link_work(struct work_struct *work)
>  
>  	/* send the local info, in the opposite order of the way we read it */
>  
> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> +	rc = ntb_transport_edma_ep_init(nt);
> +	if (rc) {
> +		dev_err(&pdev->dev, "Failed to init EP: %d\n", rc);
> +		return;
> +	}
> +#endif
> +
>  	if (nt->use_msi) {
>  		rc = ntb_msi_setup_mws(ndev);
>  		if (rc) {
> @@ -1132,6 +1211,14 @@ static void ntb_transport_link_work(struct work_struct *work)
>  
>  	nt->link_is_up = true;
>  
> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> +	rc = ntb_transport_edma_rc_init(nt);
> +	if (rc) {
> +		dev_err(&pdev->dev, "Failed to init RC: %d\n", rc);
> +		goto out1;
> +	}
> +#endif
> +
>  	for (i = 0; i < nt->qp_count; i++) {
>  		struct ntb_transport_qp *qp = &nt->qp_vec[i];
>  
> @@ -1277,6 +1364,8 @@ static const struct ntb_transport_backend_ops default_backend_ops = {
>  	.debugfs_stats_show = ntb_transport_default_debugfs_stats_show,
>  };
>  
> +static const struct ntb_transport_backend_ops edma_backend_ops;
> +
>  static int ntb_transport_probe(struct ntb_client *self, struct ntb_dev *ndev)
>  {
>  	struct ntb_transport_ctx *nt;
> @@ -1311,7 +1400,23 @@ static int ntb_transport_probe(struct ntb_client *self, struct ntb_dev *ndev)
>  
>  	nt->ndev = ndev;
>  
> -	nt->backend_ops = default_backend_ops;
> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> +	if (use_remote_edma) {
> +		rc = ntb_transport_edma_init(nt, &mw_count);
> +		if (rc) {
> +			nt->mw_count = 0;
> +			goto err;
> +		}
> +		nt->backend_ops = edma_backend_ops;
> +
> +		/*
> +		 * On remote eDMA mode, we reserve a read channel for Host->EP
> +		 * interruption.
> +		 */
> +		use_msi = false;
> +	} else
> +#endif
> +		nt->backend_ops = default_backend_ops;
>  
>  	/*
>  	 * If we are using MSI, and have at least one extra memory window,
> @@ -1402,6 +1507,10 @@ static int ntb_transport_probe(struct ntb_client *self, struct ntb_dev *ndev)
>  		rc = ntb_transport_init_queue(nt, i);
>  		if (rc)
>  			goto err2;
> +
> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> +		ntb_transport_edma_init_queue(nt, i);
> +#endif
>  	}
>  
>  	INIT_DELAYED_WORK(&nt->link_work, ntb_transport_link_work);
> @@ -1433,6 +1542,9 @@ static int ntb_transport_probe(struct ntb_client *self, struct ntb_dev *ndev)
>  	}
>  	kfree(nt->mw_vec);
>  err:
> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> +	ntb_transport_edma_uninit(nt);
> +#endif
>  	kfree(nt);
>  	return rc;
>  }
> @@ -2055,11 +2167,16 @@ ntb_transport_create_queue(void *data, struct device *client_dev,
>  
>  	nt->qp_bitmap_free &= ~qp_bit;
>  
> +	qp->qp_bit = qp_bit;
>  	qp->cb_data = data;
>  	qp->rx_handler = handlers->rx_handler;
>  	qp->tx_handler = handlers->tx_handler;
>  	qp->event_handler = handlers->event_handler;
>  
> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> +	ntb_transport_edma_create_queue(nt, qp);
> +#endif
> +
>  	dma_cap_zero(dma_mask);
>  	dma_cap_set(DMA_MEMCPY, dma_mask);
>  
> @@ -2105,6 +2222,9 @@ ntb_transport_create_queue(void *data, struct device *client_dev,
>  			goto err1;
>  
>  		entry->qp = qp;
> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> +		INIT_WORK(&entry->dma_work, ntb_transport_edma_rc_dma_work);
> +#endif
>  		ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry,
>  			     &qp->rx_free_q);
>  	}
> @@ -2156,8 +2276,8 @@ EXPORT_SYMBOL_GPL(ntb_transport_create_queue);
>   */
>  void ntb_transport_free_queue(struct ntb_transport_qp *qp)
>  {
> -	struct pci_dev *pdev;
>  	struct ntb_queue_entry *entry;
> +	struct pci_dev *pdev;
>  	u64 qp_bit;
>  
>  	if (!qp)
> @@ -2208,6 +2328,10 @@ void ntb_transport_free_queue(struct ntb_transport_qp *qp)
>  	tasklet_kill(&qp->rxc_db_work);
>  
>  	cancel_delayed_work_sync(&qp->link_work);
> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> +	cancel_work_sync(&qp->read_work);
> +	cancel_work_sync(&qp->write_work);
> +#endif
>  
>  	qp->cb_data = NULL;
>  	qp->rx_handler = NULL;
> @@ -2346,6 +2470,1157 @@ int ntb_transport_tx_enqueue(struct ntb_transport_qp *qp, void *cb, void *data,
>  }
>  EXPORT_SYMBOL_GPL(ntb_transport_tx_enqueue);
>  
> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> +/*
> + * Remote eDMA mode implementation
> + */
> +struct ntb_edma_desc {
> +	u32 len;
> +	u32 flags;
> +	u64 addr; /* DMA address */
> +	u64 data;
> +};
> +
> +struct ntb_edma_ring {
> +	struct ntb_edma_desc desc[NTB_EDMA_RING_ENTRIES];
> +	u32 head;
> +	u32 tail;
> +};
> +
> +#define NTB_EDMA_DESC_OFF(i)	((size_t)(i) * sizeof(struct ntb_edma_desc))
> +
> +#define __NTB_EDMA_CHECK_INDEX(_i)					\
> +({									\
> +	unsigned long __i = (unsigned long)(_i);			\
> +	WARN_ONCE(__i >= (unsigned long)NTB_EDMA_RING_ENTRIES,		\
> +		  "ntb_edma: index i=%lu >= ring_entries=%lu\n",	\
> +		  __i, (unsigned long)NTB_EDMA_RING_ENTRIES);		\
> +	__i;								\
> +})
> +
> +#define NTB_EDMA_DESC_I(qp, i, n)					\
> +({									\
> +	typeof(qp) __qp = (qp);						\
> +	unsigned long __i = __NTB_EDMA_CHECK_INDEX(i);			\
> +	(struct ntb_edma_desc *)					\
> +		((char *)(__qp)->rx_buff +				\
> +		 (sizeof(struct ntb_edma_ring) * n) +			\
> +		 NTB_EDMA_DESC_OFF(__i));				\
> +})
> +
> +#define NTB_EDMA_DESC_O(qp, i, n)					\
> +({									\
> +	typeof(qp) __qp = (qp);						\
> +	unsigned long __i = __NTB_EDMA_CHECK_INDEX(i);			\
> +	(struct ntb_edma_desc __iomem *)				\
> +		((char __iomem *)(__qp)->tx_mw +			\
> +		 (sizeof(struct ntb_edma_ring) * n) +			\
> +		 NTB_EDMA_DESC_OFF(__i));				\
> +})
> +
> +#define NTB_EDMA_HEAD_I(qp, n) ((u32 *)((char *)qp->rx_buff +		\
> +				(sizeof(struct ntb_edma_ring) * n) +	\
> +				offsetof(struct ntb_edma_ring, head)))
> +#define NTB_EDMA_HEAD_O(qp, n) ((u32 *)((char __iomem *)qp->tx_mw +	\
> +				(sizeof(struct ntb_edma_ring) * n) +	\
> +				offsetof(struct ntb_edma_ring, head)))
> +#define NTB_EDMA_TAIL_I(qp, n) ((u32 *)((char *)qp->rx_buff +		\
> +				(sizeof(struct ntb_edma_ring) * n) +	\
> +				offsetof(struct ntb_edma_ring, tail)))
> +#define NTB_EDMA_TAIL_O(qp, n) ((u32 *)((char __iomem *)qp->tx_mw +	\
> +				(sizeof(struct ntb_edma_ring) * n) +	\
> +				offsetof(struct ntb_edma_ring, tail)))
> +
> +/*
> + * Macro naming rule:
> + *   NTB_DESC_RD_EP_I (as an example)
> + *            ^^ ^^ ^
> + *            :  :  `-- I(n) or O(ut). In = Read, Out = Write.
> + *            :  `----- Who uses this macro.
> + *            `-------- DESC / HEAD / TAIL
> + *
> + * Read transfers (RC->EP):
> + *
> + *   EP view (outbound, written via NTB):
> + *       - descs: NTB_DESC_RD_EP_O(qp, i) / NTB_DESC_RD_EP_I(qp, i)
> + *           [ len ][ flags ][ addr ][ data ]
> + *           [ len ][ flags ][ addr ][ data ]
> + *           :
> + *           [ len ][ flags ][ addr ][ data ]
> + *       - head: NTB_HEAD_RD_EP_O(qp)
> + *       - tail: NTB_TAIL_RD_EP_I(qp)
> + *
> + *   RC view (inbound, local mapping):
> + *       - descs: NTB_DESC_RD_RC_I(qp, i) / NTB_DESC_RD_RC_O(qp, i)
> + *           [ len ][ flags ][ addr ][ data ]
> + *           [ len ][ flags ][ addr ][ data ]
> + *           :
> + *           [ len ][ flags ][ addr ][ data ]
> + *       - head: NTB_HEAD_RD_RC_I(qp)
> + *       - tail: NTB_TAIL_RD_RC_O(qp)
> + *
> + * Write transfers (EP -> RC) are analogous but use
> + * NTB_DESC_WR_{EP_O,RC_I}(), NTB_HEAD_WR_{EP_O,RC_I}(),
> + * and NTB_TAIL_WR_{EP_I,RC_O}().
> + */
> +#define NTB_DESC_RD_EP_I(qp, i)	NTB_EDMA_DESC_I(qp, i, 0)
> +#define NTB_DESC_RD_EP_O(qp, i)	NTB_EDMA_DESC_O(qp, i, 0)
> +#define NTB_DESC_WR_EP_I(qp, i)	NTB_EDMA_DESC_I(qp, i, 1)
> +#define NTB_DESC_WR_EP_O(qp, i)	NTB_EDMA_DESC_O(qp, i, 1)
> +#define NTB_DESC_RD_RC_I(qp, i)	NTB_EDMA_DESC_I(qp, i, 0)
> +#define NTB_DESC_RD_RC_O(qp, i)	NTB_EDMA_DESC_O(qp, i, 0)
> +#define NTB_DESC_WR_RC_I(qp, i)	NTB_EDMA_DESC_I(qp, i, 1)
> +#define NTB_DESC_WR_RC_O(qp, i)	NTB_EDMA_DESC_O(qp, i, 1)
> +
> +#define NTB_HEAD_RD_EP_O(qp)	NTB_EDMA_HEAD_O(qp, 0)
> +#define NTB_HEAD_WR_EP_O(qp)	NTB_EDMA_HEAD_O(qp, 1)
> +#define NTB_HEAD_RD_RC_I(qp)	NTB_EDMA_HEAD_I(qp, 0)
> +#define NTB_HEAD_WR_RC_I(qp)	NTB_EDMA_HEAD_I(qp, 1)
> +
> +#define NTB_TAIL_RD_EP_I(qp)	NTB_EDMA_TAIL_I(qp, 0)
> +#define NTB_TAIL_WR_EP_I(qp)	NTB_EDMA_TAIL_I(qp, 1)
> +#define NTB_TAIL_RD_RC_O(qp)	NTB_EDMA_TAIL_O(qp, 0)
> +#define NTB_TAIL_WR_RC_O(qp)	NTB_EDMA_TAIL_O(qp, 1)
> +
> +static inline bool ntb_qp_edma_is_rc(struct ntb_transport_qp *qp)
> +{
> +	return qp->transport->remote_edma_mode == REMOTE_EDMA_RC;
> +}
> +
> +static inline bool ntb_qp_edma_is_ep(struct ntb_transport_qp *qp)
> +{
> +	return qp->transport->remote_edma_mode == REMOTE_EDMA_EP;
> +}
> +
> +static inline bool ntb_qp_edma_enabled(struct ntb_transport_qp *qp)
> +{
> +	return ntb_qp_edma_is_rc(qp) || ntb_qp_edma_is_ep(qp);
> +}
> +
> +static unsigned int ntb_transport_edma_tx_free_entry(struct ntb_transport_qp *qp)
> +{
> +	unsigned int head, tail;
> +
> +	if (ntb_qp_edma_is_ep(qp)) {
> +		scoped_guard(spinlock_irqsave, &qp->ep_tx_lock) {
> +			/* In this scope, only 'head' might proceed */
> +			tail = READ_ONCE(qp->wr_cons);
> +			head = READ_ONCE(qp->wr_prod);
> +		}
> +		return ntb_edma_ring_free_entry(head, tail);
> +	}
> +
> +	scoped_guard(spinlock_irqsave, &qp->rc_lock) {
> +		/* In this scope, only 'head' might proceed */
> +		tail = READ_ONCE(qp->rd_issue);
> +		head = READ_ONCE(*NTB_HEAD_RD_RC_I(qp));
> +	}
> +	/*
> +	 * On RC side, 'used' amount indicates how much EP side
> +	 * has refilled, which are available for us to use for TX.
> +	 */
> +	return ntb_edma_ring_used_entry(head, tail);
> +}
> +
> +static void ntb_transport_edma_debugfs_stats_show(struct seq_file *s,
> +						  struct ntb_transport_qp *qp)
> +{
> +	seq_printf(s, "rx_bytes - \t%llu\n", qp->rx_bytes);
> +	seq_printf(s, "rx_pkts - \t%llu\n", qp->rx_pkts);
> +	seq_printf(s, "rx_err_no_buf - %llu\n", qp->rx_err_no_buf);
> +	seq_printf(s, "rx_buff - \t0x%p\n", qp->rx_buff);
> +	seq_printf(s, "rx_max_entry - \t%u\n", qp->rx_max_entry);
> +	seq_printf(s, "rx_alloc_entry - \t%u\n\n", qp->rx_alloc_entry);
> +
> +	seq_printf(s, "tx_bytes - \t%llu\n", qp->tx_bytes);
> +	seq_printf(s, "tx_pkts - \t%llu\n", qp->tx_pkts);
> +	seq_printf(s, "tx_ring_full - \t%llu\n", qp->tx_ring_full);
> +	seq_printf(s, "tx_err_no_buf - %llu\n", qp->tx_err_no_buf);
> +	seq_printf(s, "tx_mw - \t0x%p\n", qp->tx_mw);
> +	seq_printf(s, "tx_max_entry - \t%u\n", qp->tx_max_entry);
> +	seq_printf(s, "free tx - \t%u\n", ntb_transport_tx_free_entry(qp));
> +	seq_putc(s, '\n');
> +
> +	seq_puts(s, "Using Remote eDMA - Yes\n");
> +	seq_printf(s, "QP Link - \t%s\n", qp->link_is_up ? "Up" : "Down");
> +}
> +
> +static void ntb_transport_edma_uninit(struct ntb_transport_ctx *nt)
> +{
> +	struct ntb_dev *ndev = nt->ndev;
> +
> +	if (nt->remote_edma_mode == REMOTE_EDMA_EP && ndev && ndev->pdev)
> +		ntb_edma_teardown_isr(&ndev->pdev->dev);
> +
> +	if (nt->wq)
> +		destroy_workqueue(nt->wq);
> +	nt->wq = NULL;
> +}
> +
> +static int ntb_transport_edma_init(struct ntb_transport_ctx *nt,
> +				   unsigned int *mw_count)
> +{
> +	struct ntb_dev *ndev = nt->ndev;
> +
> +	/*
> +	 * We need at least one MW for the transport plus one MW reserved
> +	 * for the remote eDMA window (see ntb_edma_setup_mws/peer).
> +	 */
> +	if (*mw_count <= 1) {
> +		dev_err(&ndev->dev,
> +			"remote eDMA requires at least two MWS (have %u)\n",
> +			*mw_count);
> +		return -ENODEV;
> +	}
> +
> +	nt->wq = alloc_workqueue("ntb-edma-wq", WQ_UNBOUND | WQ_SYSFS, 0);
> +	if (!nt->wq) {
> +		ntb_transport_edma_uninit(nt);
> +		return -ENOMEM;
> +	}
> +
> +	/* Reserve the last peer MW exclusively for the eDMA window. */
> +	*mw_count -= 1;
> +
> +	return 0;
> +}
> +
> +static void ntb_transport_edma_db_work(struct work_struct *work)
> +{
> +	struct ntb_transport_qp *qp =
> +			container_of(work, struct ntb_transport_qp, db_work);
> +
> +	ntb_peer_db_set(qp->ndev, qp->qp_bit);
> +}
> +
> +static void ntb_transport_edma_notify_peer(struct ntb_transport_qp *qp)
> +{
> +	if (ntb_qp_edma_is_rc(qp))
> +		if (!ntb_edma_notify_peer(&qp->transport->edma, qp->qp_num))
> +			return;
> +
> +	/*
> +	 * Called from contexts that may be atomic. Since ntb_peer_db_set()
> +	 * may sleep, delegate the actual doorbell write to a workqueue.
> +	 */
> +	queue_work(system_highpri_wq, &qp->db_work);
> +}
> +
> +static void ntb_transport_edma_isr(void *data, int qp_num)
> +{
> +	struct ntb_transport_ctx *nt = data;
> +	struct ntb_transport_qp *qp;
> +
> +	if (qp_num < 0 || qp_num >= nt->qp_count)
> +		return;
> +
> +	qp = &nt->qp_vec[qp_num];
> +	if (WARN_ON(!qp))
> +		return;
> +
> +	queue_work(nt->wq, &qp->read_work);
> +	queue_work(nt->wq, &qp->write_work);
> +}
> +
> +static int ntb_transport_edma_rc_init(struct ntb_transport_ctx *nt)
> +{
> +	struct ntb_dev *ndev = nt->ndev;
> +	struct pci_dev *pdev = ndev->pdev;
> +	int rc;
> +
> +	if (!use_remote_edma || nt->remote_edma_mode != REMOTE_EDMA_UNKNOWN)
> +		return 0;
> +
> +	rc = ntb_edma_setup_peer(ndev);
> +	if (rc) {
> +		dev_err(&pdev->dev, "Failed to enable remote eDMA: %d\n", rc);
> +		return rc;
> +	}
> +
> +	rc = ntb_edma_setup_chans(get_dma_dev(ndev), &nt->edma);
> +	if (rc) {
> +		dev_err(&pdev->dev, "Failed to setup eDMA channels: %d\n", rc);
> +		return rc;
> +	}
> +
> +	nt->remote_edma_mode = REMOTE_EDMA_RC;
> +	return 0;
> +}
> +
> +static int ntb_transport_edma_ep_init(struct ntb_transport_ctx *nt)
> +{
> +	struct ntb_dev *ndev = nt->ndev;
> +	struct pci_dev *pdev = ndev->pdev;
> +	struct pci_epc *epc;
> +	int rc;
> +
> +	if (!use_remote_edma || nt->remote_edma_mode == REMOTE_EDMA_EP)
> +		return 0;
> +
> +	/* Only EP side can return pci_epc */
> +	epc = ntb_get_pci_epc(ndev);
> +	if (!epc)
> +		return 0;
> +
> +	rc = ntb_edma_setup_mws(ndev);
> +	if (rc) {
> +		dev_err(&pdev->dev,
> +			"Failed to set up memory window for eDMA: %d\n", rc);
> +		return rc;
> +	}
> +
> +	rc = ntb_edma_setup_isr(&pdev->dev, &epc->dev, ntb_transport_edma_isr, nt);
> +	if (rc) {
> +		dev_err(&pdev->dev, "Failed to setup eDMA ISR (%d)\n", rc);
> +		return rc;
> +	}
> +
> +	nt->remote_edma_mode = REMOTE_EDMA_EP;
> +	return 0;
> +}
> +
> +static int ntb_transport_edma_setup_qp_mw(struct ntb_transport_ctx *nt,
> +					  unsigned int qp_num)
> +{
> +	struct ntb_transport_qp *qp = &nt->qp_vec[qp_num];
> +	struct ntb_dev *ndev = nt->ndev;
> +	struct ntb_queue_entry *entry;
> +	struct ntb_transport_mw *mw;
> +	unsigned int mw_num, mw_count, qp_count;
> +	unsigned int qp_offset, rx_info_offset;
> +	unsigned int mw_size, mw_size_per_qp;
> +	unsigned int num_qps_mw;
> +	size_t edma_total;
> +	unsigned int i;
> +	int node;
> +
> +	mw_count = nt->mw_count;
> +	qp_count = nt->qp_count;
> +
> +	mw_num = QP_TO_MW(nt, qp_num);
> +	mw = &nt->mw_vec[mw_num];
> +
> +	if (!mw->virt_addr)
> +		return -ENOMEM;
> +
> +	if (mw_num < qp_count % mw_count)
> +		num_qps_mw = qp_count / mw_count + 1;
> +	else
> +		num_qps_mw = qp_count / mw_count;
> +
> +	mw_size = min(nt->mw_vec[mw_num].phys_size, mw->xlat_size);
> +	if (max_mw_size && mw_size > max_mw_size)
> +		mw_size = max_mw_size;
> +
> +	mw_size_per_qp = round_down((unsigned int)mw_size / num_qps_mw, SZ_64);
> +	qp_offset = mw_size_per_qp * (qp_num / mw_count);
> +	rx_info_offset = mw_size_per_qp - sizeof(struct ntb_rx_info);
> +
> +	qp->tx_mw_size = mw_size_per_qp;
> +	qp->tx_mw = nt->mw_vec[mw_num].vbase + qp_offset;
> +	if (!qp->tx_mw)
> +		return -EINVAL;
> +	qp->tx_mw_phys = nt->mw_vec[mw_num].phys_addr + qp_offset;
> +	if (!qp->tx_mw_phys)
> +		return -EINVAL;
> +	qp->rx_info = qp->tx_mw + rx_info_offset;
> +	qp->rx_buff = mw->virt_addr + qp_offset;
> +	qp->remote_rx_info = qp->rx_buff + rx_info_offset;
> +
> +	/* Due to housekeeping, there must be at least 2 buffs */
> +	qp->tx_max_frame = min(transport_mtu, mw_size_per_qp / 2);
> +	qp->rx_max_frame = min(transport_mtu, mw_size_per_qp / 2);
> +
> +	/* In eDMA mode, decouple from MW sizing and force ring-sized entries */
> +	edma_total = 2 * sizeof(struct ntb_edma_ring);
> +	if (rx_info_offset < edma_total) {
> +		dev_err(&ndev->dev, "Ring space requires %luB (>=%uB)\n",
> +			edma_total, rx_info_offset);
> +		return -EINVAL;
> +	}
> +	qp->tx_max_entry = NTB_EDMA_RING_ENTRIES;
> +	qp->rx_max_entry = NTB_EDMA_RING_ENTRIES;
> +
> +	/*
> +	 * Checking to see if we have more entries than the default.
> +	 * We should add additional entries if that is the case so we
> +	 * can be in sync with the transport frames.
> +	 */
> +	node = dev_to_node(&ndev->dev);
> +	for (i = qp->rx_alloc_entry; i < qp->rx_max_entry; i++) {
> +		entry = kzalloc_node(sizeof(*entry), GFP_KERNEL, node);
> +		if (!entry)
> +			return -ENOMEM;
> +
> +		entry->qp = qp;
> +		INIT_WORK(&entry->dma_work, ntb_transport_edma_rc_dma_work);
> +		ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry,
> +			     &qp->rx_free_q);
> +		qp->rx_alloc_entry++;
> +	}
> +
> +	memset(qp->rx_buff, 0, edma_total);
> +
> +	qp->rx_pkts = 0;
> +	qp->tx_pkts = 0;
> +
> +	return 0;
> +}
> +
> +static int ntb_transport_edma_ep_read_complete(struct ntb_transport_qp *qp)
> +{
> +	struct device *dma_dev = get_dma_dev(qp->ndev);
> +	struct ntb_queue_entry *entry;
> +	struct ntb_edma_desc *in;
> +	unsigned int len;
> +	u32 idx;
> +
> +	if (ntb_edma_ring_used_entry(READ_ONCE(*NTB_TAIL_RD_EP_I(qp)),
> +				     qp->rd_cons) == 0)
> +		return 0;
> +
> +	idx = ntb_edma_ring_idx(qp->rd_cons);
> +	in = NTB_DESC_RD_EP_I(qp, idx);
> +	if (!(in->flags & DESC_DONE_FLAG))
> +		return 0;
> +
> +	in->flags = 0;
> +	len = in->len; /* might be smaller than entry->len */
> +
> +	entry = (struct ntb_queue_entry *)(in->data);
> +	if (WARN_ON(!entry))
> +		return 0;
> +
> +	if (in->flags & LINK_DOWN_FLAG) {
> +		ntb_qp_link_down(qp);
> +		qp->rd_cons++;
> +		ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry, &qp->rx_free_q);
> +		return 1;
> +	}
> +
> +	dma_unmap_single(dma_dev, entry->addr, entry->len, DMA_FROM_DEVICE);
> +
> +	qp->rx_bytes += len;
> +	qp->rx_pkts++;
> +	qp->rd_cons++;
> +
> +	if (qp->rx_handler && qp->client_ready)
> +		qp->rx_handler(qp, qp->cb_data, entry->cb_data, len);
> +
> +	ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry, &qp->rx_free_q);
> +	return 1;
> +}
> +
> +static int ntb_transport_edma_ep_write_complete(struct ntb_transport_qp *qp)
> +{
> +	struct ntb_queue_entry *entry;
> +	struct ntb_edma_desc *in;
> +	u32 idx;
> +
> +	if (ntb_edma_ring_used_entry(READ_ONCE(*NTB_TAIL_WR_EP_I(qp)),
> +				     qp->wr_cons) == 0)
> +		return 0;
> +
> +	idx = ntb_edma_ring_idx(qp->wr_cons);
> +	in = NTB_DESC_WR_EP_I(qp, idx);
> +
> +	entry = (struct ntb_queue_entry *)(in->data);
> +	if (WARN_ON(!entry))
> +		return 0;
> +
> +	qp->wr_cons++;
> +
> +	if (qp->tx_handler)
> +		qp->tx_handler(qp, qp->cb_data, entry->cb_data, entry->len);
> +
> +	ntb_list_add(&qp->ntb_tx_free_q_lock, &entry->entry, &qp->tx_free_q);
> +	return 1;
> +}
> +
> +static void ntb_transport_edma_ep_read_work(struct work_struct *work)
> +{
> +	struct ntb_transport_qp *qp = container_of(
> +				work, struct ntb_transport_qp, read_work);
> +	unsigned int i;
> +
> +	for (i = 0; i < NTB_EDMA_MAX_POLL; i++) {
> +		if (!ntb_transport_edma_ep_read_complete(qp))
> +			break;
> +	}
> +
> +	if (ntb_transport_edma_ep_read_complete(qp))
> +		queue_work(qp->transport->wq, &qp->read_work);
> +}
> +
> +static void ntb_transport_edma_ep_write_work(struct work_struct *work)
> +{
> +	struct ntb_transport_qp *qp = container_of(
> +				work, struct ntb_transport_qp, write_work);
> +	unsigned int i;
> +
> +	for (i = 0; i < NTB_EDMA_MAX_POLL; i++) {
> +		if (!ntb_transport_edma_ep_write_complete(qp))
> +			break;
> +	}
> +
> +	if (ntb_transport_edma_ep_write_complete(qp))
> +		queue_work(qp->transport->wq, &qp->write_work);
> +}
> +
> +static void ntb_transport_edma_rc_write_complete_work(struct work_struct *work)
> +{
> +	struct ntb_transport_qp *qp = container_of(
> +				work, struct ntb_transport_qp, write_work);
> +	struct ntb_queue_entry *entry;
> +	struct ntb_edma_desc *in;
> +	unsigned int len;
> +	void *cb_data;
> +	u32 idx;
> +
> +	while (ntb_edma_ring_used_entry(READ_ONCE(qp->wr_issue),
> +					qp->wr_cons) != 0) {
> +		/* Paired with smp_wmb() in ntb_transport_edma_rc_poll() */
> +		smp_rmb();
> +
> +		idx = ntb_edma_ring_idx(qp->wr_cons);
> +		in = NTB_DESC_WR_RC_I(qp, idx);
> +		entry = (struct ntb_queue_entry *)READ_ONCE(in->data);
> +		if (!entry || !(entry->flags & DESC_DONE_FLAG))
> +			break;
> +
> +		in->data = 0;
> +
> +		cb_data = entry->cb_data;
> +		len = entry->len;
> +
> +		iowrite32(++qp->wr_cons, NTB_TAIL_WR_RC_O(qp));
> +
> +		if (unlikely(entry->flags & LINK_DOWN_FLAG)) {
> +			ntb_qp_link_down(qp);
> +			continue;
> +		}
> +
> +		ntb_transport_edma_notify_peer(qp);
> +
> +		ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry, &qp->rx_free_q);
> +
> +		if (qp->rx_handler && qp->client_ready)
> +			qp->rx_handler(qp, qp->cb_data, cb_data, len);
> +
> +		/* stat updates */
> +		qp->rx_bytes += len;
> +		qp->rx_pkts++;
> +	}
> +}
> +
> +static void ntb_transport_edma_rc_write_cb(void *data,
> +					   const struct dmaengine_result *res)
> +{
> +	struct ntb_queue_entry *entry = data;
> +	struct ntb_transport_qp *qp = entry->qp;
> +	struct ntb_transport_ctx *nt = qp->transport;
> +	enum dmaengine_tx_result dma_err = res->result;
> +	struct device *dma_dev = get_dma_dev(qp->ndev);
> +
> +	switch (dma_err) {
> +	case DMA_TRANS_READ_FAILED:
> +	case DMA_TRANS_WRITE_FAILED:
> +	case DMA_TRANS_ABORTED:
> +		entry->errors++;
> +		entry->len = -EIO;
> +		break;
> +	case DMA_TRANS_NOERROR:
> +	default:
> +		break;
> +	}
> +	dma_unmap_sg(dma_dev, &entry->sgl, 1, DMA_FROM_DEVICE);
> +	sg_dma_address(&entry->sgl) = 0;
> +
> +	entry->flags |= DESC_DONE_FLAG;
> +
> +	queue_work(nt->wq, &qp->write_work);
> +}
> +
> +static void ntb_transport_edma_rc_read_complete_work(struct work_struct *work)
> +{
> +	struct ntb_transport_qp *qp = container_of(
> +				work, struct ntb_transport_qp, read_work);
> +	struct ntb_edma_desc *in, __iomem *out;
> +	struct ntb_queue_entry *entry;
> +	unsigned int len;
> +	void *cb_data;
> +	u32 idx;
> +
> +	while (ntb_edma_ring_used_entry(READ_ONCE(qp->rd_issue),
> +					qp->rd_cons) != 0) {
> +		/* Paired with smp_wmb() in ntb_transport_edma_rc_tx_enqueue() */
> +		smp_rmb();
> +
> +		idx = ntb_edma_ring_idx(qp->rd_cons);
> +		in = NTB_DESC_RD_RC_I(qp, idx);
> +		entry = (struct ntb_queue_entry *)in->data;
> +		if (!entry || !(entry->flags & DESC_DONE_FLAG))
> +			break;
> +
> +		in->data = 0;
> +
> +		cb_data = entry->cb_data;
> +		len = entry->len;
> +
> +		out = NTB_DESC_RD_RC_O(qp, idx);
> +
> +		WRITE_ONCE(qp->rd_cons, qp->rd_cons + 1);
> +
> +		/*
> +		 * No need to add barrier in-between to enforce ordering here.
> +		 * The other side proceeds only after both flags and tail are
> +		 * updated.
> +		 */
> +		iowrite32(entry->flags, &out->flags);
> +		iowrite32(qp->rd_cons, NTB_TAIL_RD_RC_O(qp));
> +
> +		ntb_transport_edma_notify_peer(qp);
> +
> +		ntb_list_add(&qp->ntb_tx_free_q_lock, &entry->entry,
> +			     &qp->tx_free_q);
> +
> +		if (qp->tx_handler)
> +			qp->tx_handler(qp, qp->cb_data, cb_data, len);
> +
> +		/* stat updates */
> +		qp->tx_bytes += len;
> +		qp->tx_pkts++;
> +	}
> +}
> +
> +static void ntb_transport_edma_rc_read_cb(void *data,
> +					  const struct dmaengine_result *res)
> +{
> +	struct ntb_queue_entry *entry = data;
> +	struct ntb_transport_qp *qp = entry->qp;
> +	struct ntb_transport_ctx *nt = qp->transport;
> +	struct device *dma_dev = get_dma_dev(qp->ndev);
> +	enum dmaengine_tx_result dma_err = res->result;
> +
> +	switch (dma_err) {
> +	case DMA_TRANS_READ_FAILED:
> +	case DMA_TRANS_WRITE_FAILED:
> +	case DMA_TRANS_ABORTED:
> +		entry->errors++;
> +		entry->len = -EIO;
> +		break;
> +	case DMA_TRANS_NOERROR:
> +	default:
> +		break;
> +	}
> +	dma_unmap_sg(dma_dev, &entry->sgl, 1, DMA_TO_DEVICE);
> +	sg_dma_address(&entry->sgl) = 0;
> +
> +	entry->flags |= DESC_DONE_FLAG;
> +
> +	queue_work(nt->wq, &qp->read_work);
> +}
> +
> +static int ntb_transport_edma_rc_write_start(struct device *d,
> +					     struct dma_chan *chan, size_t len,
> +					     dma_addr_t ep_src, void *rc_dst,
> +					     struct ntb_queue_entry *entry)
> +{
> +	struct scatterlist *sgl = &entry->sgl;
> +	struct dma_async_tx_descriptor *txd;
> +	struct dma_slave_config cfg;
> +	dma_cookie_t cookie;
> +	int nents, rc;
> +
> +	if (!d)
> +		return -ENODEV;
> +
> +	if (!chan)
> +		return -ENXIO;
> +
> +	if (WARN_ON(!ep_src || !rc_dst))
> +		return -EINVAL;
> +
> +	if (WARN_ON(sg_dma_address(sgl)))
> +		return -EINVAL;
> +
> +	sg_init_one(sgl, rc_dst, len);
> +	nents = dma_map_sg(d, sgl, 1, DMA_FROM_DEVICE);
> +	if (nents <= 0)
> +		return -EIO;
> +
> +	memset(&cfg, 0, sizeof(cfg));
> +	cfg.src_addr       = ep_src;
> +	cfg.src_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
> +	cfg.dst_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
> +	cfg.direction      = DMA_DEV_TO_MEM;
> +	rc = dmaengine_slave_config(chan, &cfg);
> +	if (rc)
> +		goto out_unmap;
> +
> +	txd = dmaengine_prep_slave_sg(chan, sgl, 1, DMA_DEV_TO_MEM,
> +				      DMA_CTRL_ACK | DMA_PREP_INTERRUPT);
> +	if (!txd) {
> +		rc = -EIO;
> +		goto out_unmap;
> +	}
> +
> +	txd->callback_result = ntb_transport_edma_rc_write_cb;
> +	txd->callback_param = entry;
> +
> +	cookie = dmaengine_submit(txd);
> +	if (dma_submit_error(cookie)) {
> +		rc = -EIO;
> +		goto out_unmap;
> +	}
> +	dma_async_issue_pending(chan);
> +	return 0;
> +out_unmap:
> +	dma_unmap_sg(d, sgl, 1, DMA_FROM_DEVICE);
> +	return rc;
> +}
> +
> +static int ntb_transport_edma_rc_read_start(struct device *d,
> +					    struct dma_chan *chan, size_t len,
> +					    void *rc_src, dma_addr_t ep_dst,
> +					    struct ntb_queue_entry *entry)
> +{
> +	struct scatterlist *sgl = &entry->sgl;
> +	struct dma_async_tx_descriptor *txd;
> +	struct dma_slave_config cfg;
> +	dma_cookie_t cookie;
> +	int nents, rc;
> +
> +	if (!d)
> +		return -ENODEV;
> +
> +	if (!chan)
> +		return -ENXIO;
> +
> +	if (WARN_ON(!rc_src || !ep_dst))
> +		return -EINVAL;
> +
> +	if (WARN_ON(sg_dma_address(sgl)))
> +		return -EINVAL;
> +
> +	sg_init_one(sgl, rc_src, len);
> +	nents = dma_map_sg(d, sgl, 1, DMA_TO_DEVICE);
> +	if (nents <= 0)
> +		return -EIO;
> +
> +	memset(&cfg, 0, sizeof(cfg));
> +	cfg.dst_addr       = ep_dst;
> +	cfg.src_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
> +	cfg.dst_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
> +	cfg.direction      = DMA_MEM_TO_DEV;
> +	rc = dmaengine_slave_config(chan, &cfg);
> +	if (rc)
> +		goto out_unmap;
> +
> +	txd = dmaengine_prep_slave_sg(chan, sgl, 1, DMA_MEM_TO_DEV,
> +				      DMA_CTRL_ACK | DMA_PREP_INTERRUPT);
> +	if (!txd) {
> +		rc = -EIO;
> +		goto out_unmap;
> +	}
> +
> +	txd->callback_result = ntb_transport_edma_rc_read_cb;
> +	txd->callback_param = entry;
> +
> +	cookie = dmaengine_submit(txd);
> +	if (dma_submit_error(cookie)) {
> +		rc = -EIO;
> +		goto out_unmap;
> +	}
> +	dma_async_issue_pending(chan);
> +	return 0;
> +out_unmap:
> +	dma_unmap_sg(d, sgl, 1, DMA_TO_DEVICE);
> +	return rc;
> +}
> +
> +static void ntb_transport_edma_rc_dma_work(struct work_struct *work)
> +{
> +	struct ntb_queue_entry *entry = container_of(
> +				work, struct ntb_queue_entry, dma_work);
> +	struct ntb_transport_qp *qp = entry->qp;
> +	struct ntb_transport_ctx *nt = qp->transport;
> +	struct device *dma_dev = get_dma_dev(qp->ndev);
> +	struct dma_chan *chan;
> +	int rc;
> +
> +	chan = ntb_edma_pick_chan(&nt->edma, REMOTE_EDMA_WRITE);
> +	rc = ntb_transport_edma_rc_write_start(dma_dev, chan, entry->len,
> +					       entry->addr, entry->buf, entry);
> +	if (rc) {
> +		entry->errors++;
> +		entry->len = -EIO;
> +		entry->flags |= DESC_DONE_FLAG;
> +		queue_work(nt->wq, &qp->write_work);
> +		return;
> +	}
> +}
> +
> +static void ntb_transport_edma_rc_poll(struct ntb_transport_qp *qp)
> +{
> +	struct ntb_transport_ctx *nt = qp->transport;
> +	unsigned int budget = NTB_EDMA_MAX_POLL;
> +	struct ntb_queue_entry *entry;
> +	struct ntb_edma_desc *in;
> +	dma_addr_t ep_src;
> +	u32 len, idx;
> +
> +	while (budget--) {
> +		if (ntb_edma_ring_used_entry(READ_ONCE(*NTB_HEAD_WR_RC_I(qp)),
> +					     qp->wr_issue) == 0)
> +			break;
> +
> +		idx = ntb_edma_ring_idx(qp->wr_issue);
> +		in = NTB_DESC_WR_RC_I(qp, idx);
> +
> +		len = READ_ONCE(in->len);
> +		ep_src = (dma_addr_t)READ_ONCE(in->addr);
> +
> +		/* Prepare 'entry' for write completion */
> +		entry = ntb_list_rm(&qp->ntb_rx_q_lock, &qp->rx_pend_q);
> +		if (!entry) {
> +			qp->rx_err_no_buf++;
> +			break;
> +		}
> +		if (WARN_ON(entry->flags & DESC_DONE_FLAG))
> +			entry->flags &= ~DESC_DONE_FLAG;
> +		entry->len = len; /* NB. entry->len can be <=0 */
> +		entry->addr = ep_src;
> +
> +		/*
> +		 * ntb_transport_edma_rc_write_complete_work() checks entry->flags
> +		 * so it needs to be set before wr_issue++.
> +		 */
> +		in->data = (uintptr_t)entry;
> +
> +		/* Ensure in->data visible before wr_issue++ */
> +		smp_wmb();
> +
> +		WRITE_ONCE(qp->wr_issue, qp->wr_issue + 1);
> +
> +		if (!len) {
> +			entry->flags |= DESC_DONE_FLAG;
> +			queue_work(nt->wq, &qp->write_work);
> +			continue;
> +		}
> +
> +		if (in->flags & LINK_DOWN_FLAG) {
> +			dev_dbg(&qp->ndev->pdev->dev, "link down flag set\n");
> +			entry->flags |= DESC_DONE_FLAG | LINK_DOWN_FLAG;
> +			queue_work(nt->wq, &qp->write_work);
> +			continue;
> +		}
> +
> +		queue_work(nt->wq, &entry->dma_work);
> +	}
> +
> +	if (!budget)
> +		tasklet_schedule(&qp->rxc_db_work);
> +}
> +
> +static int ntb_transport_edma_rc_tx_enqueue(struct ntb_transport_qp *qp,
> +					    struct ntb_queue_entry *entry)
> +{
> +	struct device *dma_dev = get_dma_dev(qp->ndev);
> +	struct ntb_transport_ctx *nt = qp->transport;
> +	struct ntb_edma_desc *in, __iomem *out;
> +	unsigned int len = entry->len;
> +	struct dma_chan *chan;
> +	u32 issue, idx, head;
> +	dma_addr_t ep_dst;
> +	int rc;
> +
> +	WARN_ON_ONCE(entry->flags & DESC_DONE_FLAG);
> +
> +	scoped_guard(spinlock_irqsave, &qp->rc_lock) {
> +		head = READ_ONCE(*NTB_HEAD_RD_RC_I(qp));
> +		issue = qp->rd_issue;
> +		if (ntb_edma_ring_used_entry(head, issue) == 0) {
> +			qp->tx_ring_full++;
> +			return -ENOSPC;
> +		}
> +
> +		/*
> +		 * ntb_transport_edma_rc_read_complete_work() checks entry->flags
> +		 * so it needs to be set before rd_issue++.
> +		 */
> +		idx = ntb_edma_ring_idx(issue);
> +		in = NTB_DESC_RD_RC_I(qp, idx);
> +		in->data = (uintptr_t)entry;
> +
> +		/* Make in->data visible before rd_issue++ */
> +		smp_wmb();
> +
> +		WRITE_ONCE(qp->rd_issue, qp->rd_issue + 1);
> +	}
> +
> +	/* Publish the final transfer length to the EP side */
> +	out = NTB_DESC_RD_RC_O(qp, idx);
> +	iowrite32(len, &out->len);
> +	ioread32(&out->len);
> +
> +	if (unlikely(!len)) {
> +		entry->flags |= DESC_DONE_FLAG;
> +		queue_work(nt->wq, &qp->read_work);
> +		return 0;
> +	}
> +
> +	/* Paired with dma_wmb() in ntb_transport_edma_ep_rx_enqueue() */
> +	dma_rmb();
> +
> +	/* kick remote eDMA read transfer */
> +	ep_dst = (dma_addr_t)in->addr;
> +	chan = ntb_edma_pick_chan(&nt->edma, REMOTE_EDMA_READ);
> +	rc = ntb_transport_edma_rc_read_start(dma_dev, chan, len,
> +					      entry->buf, ep_dst, entry);
> +	if (rc) {
> +		entry->errors++;
> +		entry->len = -EIO;
> +		entry->flags |= DESC_DONE_FLAG;
> +		queue_work(nt->wq, &qp->read_work);
> +	}
> +	return 0;
> +}
> +
> +static int ntb_transport_edma_ep_tx_enqueue(struct ntb_transport_qp *qp,
> +					    struct ntb_queue_entry *entry)
> +{
> +	struct device *dma_dev = get_dma_dev(qp->ndev);
> +	struct ntb_edma_desc *in, __iomem *out;
> +	unsigned int len = entry->len;
> +	dma_addr_t ep_src = 0;
> +	u32 idx;
> +	int rc;
> +
> +	if (likely(len)) {
> +		ep_src = dma_map_single(dma_dev, entry->buf, len,
> +					DMA_TO_DEVICE);
> +		rc = dma_mapping_error(dma_dev, ep_src);
> +		if (rc)
> +			return rc;
> +	}
> +
> +	scoped_guard(spinlock_irqsave, &qp->ep_tx_lock) {
> +		if (ntb_edma_ring_full(qp->wr_prod, qp->wr_cons)) {
> +			rc = -ENOSPC;
> +			qp->tx_ring_full++;
> +			goto out_unmap;
> +		}
> +
> +		idx = ntb_edma_ring_idx(qp->wr_prod);
> +		in  = NTB_DESC_WR_EP_I(qp, idx);
> +		out = NTB_DESC_WR_EP_O(qp, idx);
> +
> +		WARN_ON(in->flags & DESC_DONE_FLAG);
> +		WARN_ON(entry->flags & DESC_DONE_FLAG);
> +		in->flags = 0;
> +		in->data  = (uintptr_t)entry;
> +		entry->addr  = ep_src;
> +
> +		iowrite32(len,          &out->len);
> +		iowrite32(entry->flags, &out->flags);
> +		iowrite64(ep_src,       &out->addr);
> +		WRITE_ONCE(qp->wr_prod, qp->wr_prod + 1);
> +
> +		dma_wmb();
> +		iowrite32(qp->wr_prod, NTB_HEAD_WR_EP_O(qp));
> +
> +		qp->tx_bytes += len;
> +		qp->tx_pkts++;
> +	}
> +
> +	ntb_transport_edma_notify_peer(qp);
> +
> +	return 0;
> +out_unmap:
> +	if (likely(len))
> +		dma_unmap_single(dma_dev, ep_src, len, DMA_TO_DEVICE);
> +	return rc;
> +}
> +
> +static int ntb_transport_edma_tx_enqueue(struct ntb_transport_qp *qp,
> +					 struct ntb_queue_entry *entry,
> +					 void *cb, void *data, unsigned int len,
> +					 unsigned int flags)
> +{
> +	struct device *dma_dev;
> +
> +	if (entry->addr) {
> +		/* Deferred unmap */
> +		dma_dev = get_dma_dev(qp->ndev);
> +		dma_unmap_single(dma_dev, entry->addr, entry->len, DMA_TO_DEVICE);
> +	}
> +
> +	entry->cb_data = cb;
> +	entry->buf = data;
> +	entry->len = len;
> +	entry->flags = flags;
> +	entry->errors = 0;
> +	entry->addr = 0;
> +
> +	WARN_ON_ONCE(!ntb_qp_edma_enabled(qp));
> +
> +	if (ntb_qp_edma_is_ep(qp))
> +		return ntb_transport_edma_ep_tx_enqueue(qp, entry);
> +	else
> +		return ntb_transport_edma_rc_tx_enqueue(qp, entry);
> +}
> +
> +static int ntb_transport_edma_ep_rx_enqueue(struct ntb_transport_qp *qp,
> +					    struct ntb_queue_entry *entry)
> +{
> +	struct device *dma_dev = get_dma_dev(qp->ndev);
> +	struct ntb_edma_desc *in, __iomem *out;
> +	unsigned int len = entry->len;
> +	void *data = entry->buf;
> +	dma_addr_t ep_dst;
> +	u32 idx;
> +	int rc;
> +
> +	ep_dst = dma_map_single(dma_dev, data, len, DMA_FROM_DEVICE);
> +	rc = dma_mapping_error(dma_dev, ep_dst);
> +	if (rc)
> +		return rc;
> +
> +	scoped_guard(spinlock_bh, &qp->ep_rx_lock) {
> +		if (ntb_edma_ring_full(READ_ONCE(qp->rd_prod),
> +				       READ_ONCE(qp->rd_cons))) {
> +			rc = -ENOSPC;
> +			goto out_unmap;
> +		}
> +
> +		idx = ntb_edma_ring_idx(qp->rd_prod);
> +		in = NTB_DESC_RD_EP_I(qp, idx);
> +		out = NTB_DESC_RD_EP_O(qp, idx);
> +
> +		iowrite32(len, &out->len);
> +		iowrite64(ep_dst, &out->addr);
> +
> +		WARN_ON(in->flags & DESC_DONE_FLAG);
> +		in->data = (uintptr_t)entry;
> +		entry->addr = ep_dst;
> +
> +		/* Ensure len/addr are visible before the head update */
> +		dma_wmb();
> +
> +		WRITE_ONCE(qp->rd_prod, qp->rd_prod + 1);
> +		iowrite32(qp->rd_prod, NTB_HEAD_RD_EP_O(qp));
> +	}
> +	return 0;
> +out_unmap:
> +	dma_unmap_single(dma_dev, ep_dst, len, DMA_FROM_DEVICE);
> +	return rc;
> +}
> +
> +static int ntb_transport_edma_rx_enqueue(struct ntb_transport_qp *qp,
> +					 struct ntb_queue_entry *entry)
> +{
> +	int rc;
> +
> +	/* The behaviour is the same as the default backend for RC side */
> +	if (ntb_qp_edma_is_ep(qp)) {
> +		rc = ntb_transport_edma_ep_rx_enqueue(qp, entry);
> +		if (rc) {
> +			ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry,
> +				     &qp->rx_free_q);
> +			return rc;
> +		}
> +	}
> +
> +	ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry, &qp->rx_pend_q);
> +
> +	if (qp->active)
> +		tasklet_schedule(&qp->rxc_db_work);
> +
> +	return 0;
> +}
> +
> +static void ntb_transport_edma_rx_poll(struct ntb_transport_qp *qp)
> +{
> +	struct ntb_transport_ctx *nt = qp->transport;
> +
> +	if (ntb_qp_edma_is_rc(qp))
> +		ntb_transport_edma_rc_poll(qp);
> +	else if (ntb_qp_edma_is_ep(qp)) {
> +		/*
> +		 * Make sure we poll the rings even if an eDMA interrupt is
> +		 * cleared on the RC side earlier.
> +		 */
> +		queue_work(nt->wq, &qp->read_work);
> +		queue_work(nt->wq, &qp->write_work);
> +	} else
> +		/* Unreachable */
> +		WARN_ON_ONCE(1);
> +}
> +
> +static void ntb_transport_edma_read_work(struct work_struct *work)
> +{
> +	struct ntb_transport_qp *qp = container_of(
> +				work, struct ntb_transport_qp, read_work);
> +
> +	if (ntb_qp_edma_is_rc(qp))
> +		ntb_transport_edma_rc_read_complete_work(work);
> +	else if (ntb_qp_edma_is_ep(qp))
> +		ntb_transport_edma_ep_read_work(work);
> +	else
> +		/* Unreachable */
> +		WARN_ON_ONCE(1);
> +}
> +
> +static void ntb_transport_edma_write_work(struct work_struct *work)
> +{
> +	struct ntb_transport_qp *qp = container_of(
> +				work, struct ntb_transport_qp, write_work);
> +
> +	if (ntb_qp_edma_is_rc(qp))
> +		ntb_transport_edma_rc_write_complete_work(work);
> +	else if (ntb_qp_edma_is_ep(qp))
> +		ntb_transport_edma_ep_write_work(work);
> +	else
> +		/* Unreachable */
> +		WARN_ON_ONCE(1);
> +}
> +
> +static void ntb_transport_edma_init_queue(struct ntb_transport_ctx *nt,
> +					  unsigned int qp_num)
> +{
> +	struct ntb_transport_qp *qp = &nt->qp_vec[qp_num];
> +
> +	qp->wr_cons = 0;
> +	qp->rd_cons = 0;
> +	qp->wr_prod = 0;
> +	qp->rd_prod = 0;
> +	qp->wr_issue = 0;
> +	qp->rd_issue = 0;
> +
> +	INIT_WORK(&qp->db_work, ntb_transport_edma_db_work);
> +	INIT_WORK(&qp->read_work, ntb_transport_edma_read_work);
> +	INIT_WORK(&qp->write_work, ntb_transport_edma_write_work);
> +}
> +
> +static void ntb_transport_edma_create_queue(struct ntb_transport_ctx *nt,
> +					    struct ntb_transport_qp *qp)
> +{
> +	spin_lock_init(&qp->ep_tx_lock);
> +	spin_lock_init(&qp->ep_rx_lock);
> +	spin_lock_init(&qp->rc_lock);
> +}
> +
> +static const struct ntb_transport_backend_ops edma_backend_ops = {
> +	.setup_qp_mw = ntb_transport_edma_setup_qp_mw,
> +	.tx_free_entry = ntb_transport_edma_tx_free_entry,
> +	.tx_enqueue = ntb_transport_edma_tx_enqueue,
> +	.rx_enqueue = ntb_transport_edma_rx_enqueue,
> +	.rx_poll = ntb_transport_edma_rx_poll,
> +	.debugfs_stats_show = ntb_transport_edma_debugfs_stats_show,
> +};
> +#endif /* CONFIG_NTB_TRANSPORT_EDMA */
> +
>  /**
>   * ntb_transport_link_up - Notify NTB transport of client readiness to use queue
>   * @qp: NTB transport layer queue to be enabled


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 00/27] NTB transport backed by remote DW eDMA
  2025-11-29 16:03 [RFC PATCH v2 00/27] NTB transport backed by remote DW eDMA Koichiro Den
                   ` (26 preceding siblings ...)
  2025-11-29 16:04 ` [RFC PATCH v2 27/27] NTB: epf: Add an additional memory window (MW2) barno mapping on Renesas R-Car Koichiro Den
@ 2025-12-01 22:02 ` Frank Li
  2025-12-02  6:20   ` Koichiro Den
  27 siblings, 1 reply; 97+ messages in thread
From: Frank Li @ 2025-12-01 22:02 UTC (permalink / raw)
  To: Koichiro Den
  Cc: ntb, linux-pci, dmaengine, linux-kernel, mani, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Sun, Nov 30, 2025 at 01:03:38AM +0900, Koichiro Den wrote:
> Hi,
>
> This is RFC v2 of the NTB/PCI series for Renesas R-Car S4. The ultimate
> goal is unchanged, i.e. to improve performance between RC and EP
> (with vNTB) over ntb_transport, but the approach has changed drastically.
> Based on the feedback from Frank Li in the v1 thread, in particular:
> https://lore.kernel.org/all/aQEsip3TsPn4LJY9@lizhi-Precision-Tower-5810/
> this RFC v2 instead builds an NTB transport backed by remote eDMA
> architecture and reshapes the series around it. The RC->EP interruption
> is now achieved using a dedicated eDMA read channel, so the somewhat
> "hack"-ish approach in RFC v1 is no longer needed.
>
> Compared to RFC v1, this v2 series enables NTB transport backed by
> remote DW eDMA, so the current ntb_transport handling of Memory Window
> is no longer needed, and direct DMA transfers between EP and RC are
> used.
>
> I realize this is quite a large series. Sorry for the volume, but for
> the RFC stage I believe presenting the full picture in a single set
> helps with reviewing the overall architecture. Once the direction is
> agreed, I will respin it split by subsystem and topic.
>
>
...
>
> - Before this change:
>
>   * ping
>     64 bytes from 10.0.0.11: icmp_seq=1 ttl=64 time=12.3 ms
>     64 bytes from 10.0.0.11: icmp_seq=2 ttl=64 time=6.58 ms
>     64 bytes from 10.0.0.11: icmp_seq=3 ttl=64 time=1.26 ms
>     64 bytes from 10.0.0.11: icmp_seq=4 ttl=64 time=7.43 ms
>     64 bytes from 10.0.0.11: icmp_seq=5 ttl=64 time=1.39 ms
>     64 bytes from 10.0.0.11: icmp_seq=6 ttl=64 time=7.38 ms
>     64 bytes from 10.0.0.11: icmp_seq=7 ttl=64 time=1.42 ms
>     64 bytes from 10.0.0.11: icmp_seq=8 ttl=64 time=7.41 ms
>
>   * RC->EP (`sudo iperf3 -ub0 -l 65480 -P 2`)
>     [ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
>     [  5]   0.00-10.01  sec   344 MBytes   288 Mbits/sec  3.483 ms  51/5555 (0.92%)  receiver
>     [  6]   0.00-10.01  sec   342 MBytes   287 Mbits/sec  3.814 ms  38/5517 (0.69%)  receiver
>     [SUM]   0.00-10.01  sec   686 MBytes   575 Mbits/sec  3.648 ms  89/11072 (0.8%)  receiver
>
>   * EP->RC (`sudo iperf3 -ub0 -l 65480 -P 2`)
>     [  5]   0.00-10.03  sec   334 MBytes   279 Mbits/sec  3.164 ms  390/5731 (6.8%)  receiver
>     [  6]   0.00-10.03  sec   334 MBytes   279 Mbits/sec  2.416 ms  396/5741 (6.9%)  receiver
>     [SUM]   0.00-10.03  sec   667 MBytes   558 Mbits/sec  2.790 ms  786/11472 (6.9%)  receiver
>
>     Note: with `-P 2`, the best total bitrate (receiver side) was achieved.
>
> - After this change (use_remote_edma=1) [1]:
>
>   * ping
>     64 bytes from 10.0.0.11: icmp_seq=1 ttl=64 time=1.48 ms
>     64 bytes from 10.0.0.11: icmp_seq=2 ttl=64 time=1.03 ms
>     64 bytes from 10.0.0.11: icmp_seq=3 ttl=64 time=0.931 ms
>     64 bytes from 10.0.0.11: icmp_seq=4 ttl=64 time=0.910 ms
>     64 bytes from 10.0.0.11: icmp_seq=5 ttl=64 time=1.07 ms
>     64 bytes from 10.0.0.11: icmp_seq=6 ttl=64 time=0.986 ms
>     64 bytes from 10.0.0.11: icmp_seq=7 ttl=64 time=0.910 ms
>     64 bytes from 10.0.0.11: icmp_seq=8 ttl=64 time=0.883 ms
>
>   * RC->EP (`sudo iperf3 -ub0 -l 65480 -P 4`)
>     [  5]   0.00-10.01  sec  3.54 GBytes  3.04 Gbits/sec  0.030 ms  0/58007 (0%)  receiver
>     [  6]   0.00-10.01  sec  3.71 GBytes  3.19 Gbits/sec  0.453 ms  0/60909 (0%)  receiver
>     [  9]   0.00-10.01  sec  3.85 GBytes  3.30 Gbits/sec  0.027 ms  0/63072 (0%)  receiver
>     [ 11]   0.00-10.01  sec  3.26 GBytes  2.80 Gbits/sec  0.070 ms  1/53512 (0.0019%)  receiver
>     [SUM]   0.00-10.01  sec  14.4 GBytes  12.3 Gbits/sec  0.145 ms  1/235500 (0.00042%)  receiver
>
>   * EP->RC (`sudo iperf3 -ub0 -l 65480 -P 4`)
>     [  5]   0.00-10.03  sec  3.40 GBytes  2.91 Gbits/sec  0.104 ms  15467/71208 (22%)  receiver
>     [  6]   0.00-10.03  sec  3.08 GBytes  2.64 Gbits/sec  0.176 ms  12097/62609 (19%)  receiver
>     [  9]   0.00-10.03  sec  3.38 GBytes  2.90 Gbits/sec  0.270 ms  17212/72710 (24%)  receiver
>     [ 11]   0.00-10.03  sec  2.56 GBytes  2.19 Gbits/sec  0.200 ms  11193/53090 (21%)  receiver

Almost 10x fast, 2.9G vs 279M? high light this one will bring more peopole
interesting about this topic.

>     [SUM]   0.00-10.03  sec  12.4 GBytes  10.6 Gbits/sec  0.188 ms  55969/259617 (22%)  receiver
>
>   [1] configfs settings:
>       # modprobe pci_epf_vntb dyndbg=+pmf
>       # cd /sys/kernel/config/pci_ep/
>       # mkdir functions/pci_epf_vntb/func1
>       # echo 0x1912 >   functions/pci_epf_vntb/func1/vendorid
>       # echo 0x0030 >   functions/pci_epf_vntb/func1/deviceid
>       # echo 32 >       functions/pci_epf_vntb/func1/msi_interrupts
>       # echo 16 >       functions/pci_epf_vntb/func1/pci_epf_vntb.0/db_count
>       # echo 128 >      functions/pci_epf_vntb/func1/pci_epf_vntb.0/spad_count
>       # echo 2 >        functions/pci_epf_vntb/func1/pci_epf_vntb.0/num_mws
>       # echo 0xe0000 >  functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw1
>       # echo 0x20000 >  functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw2
>       # echo 0xe0000 >  functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw2_offset

look like, you try to create sub-small mw windows.

Is it more clean ?

echo 0xe0000 >  functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw1.0
echo 0x20000 >  functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw1.1

so wm1.1 natively continue from prevous one.

Frank

>       # echo 0x1912 >   functions/pci_epf_vntb/func1/pci_epf_vntb.0/vntb_vid
>       # echo 0x0030 >   functions/pci_epf_vntb/func1/pci_epf_vntb.0/vntb_pid
>       # echo 0x10 >     functions/pci_epf_vntb/func1/pci_epf_vntb.0/vbus_number
>       # echo 0 >        functions/pci_epf_vntb/func1/pci_epf_vntb.0/ctrl_bar
>       # echo 4 >        functions/pci_epf_vntb/func1/pci_epf_vntb.0/db_bar
>       # echo 2 >        functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw1_bar
>       # echo 2 >        functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw2_bar
>       # ln -s controllers/e65d0000.pcie-ep functions/pci_epf_vntb/func1/primary/
>       # echo 1 > controllers/e65d0000.pcie-ep/start
>
>
> Thanks for taking a look.
>
>
> Koichiro Den (27):
>   PCI: endpoint: pci-epf-vntb: Use array_index_nospec() on mws_size[]
>     access
>   PCI: endpoint: pci-epf-vntb: Add mwN_offset configfs attributes
>   NTB: epf: Handle mwN_offset for inbound MW regions
>   PCI: endpoint: Add inbound mapping ops to EPC core
>   PCI: dwc: ep: Implement EPC inbound mapping support
>   PCI: endpoint: pci-epf-vntb: Use pci_epc_map_inbound() for MW mapping
>   NTB: Add offset parameter to MW translation APIs
>   PCI: endpoint: pci-epf-vntb: Propagate MW offset from configfs when
>     present
>   NTB: ntb_transport: Support offsetted partial memory windows
>   NTB: core: Add .get_pci_epc() to ntb_dev_ops
>   NTB: epf: vntb: Implement .get_pci_epc() callback
>   damengine: dw-edma: Fix MSI data values for multi-vector IMWr
>     interrupts
>   NTB: ntb_transport: Use seq_file for QP stats debugfs
>   NTB: ntb_transport: Move TX memory window setup into setup_qp_mw()
>   NTB: ntb_transport: Dynamically determine qp count
>   NTB: ntb_transport: Introduce get_dma_dev() helper
>   NTB: epf: Reserve a subset of MSI vectors for non-NTB users
>   NTB: ntb_transport: Introduce ntb_transport_backend_ops
>   PCI: dwc: ep: Cache MSI outbound iATU mapping
>   NTB: ntb_transport: Introduce remote eDMA backed transport mode
>   NTB: epf: Provide db_vector_count/db_vector_mask callbacks
>   ntb_netdev: Multi-queue support
>   NTB: epf: Add per-SoC quirk to cap MRRS for DWC eDMA (128B for R-Car)
>   iommu: ipmmu-vmsa: Add PCIe ch0 to devices_allowlist
>   iommu: ipmmu-vmsa: Add support for reserved regions
>   arm64: dts: renesas: Add Spider RC/EP DTs for NTB with remote DW PCIe
>     eDMA
>   NTB: epf: Add an additional memory window (MW2) barno mapping on
>     Renesas R-Car
>
>  arch/arm64/boot/dts/renesas/Makefile          |    2 +
>  .../boot/dts/renesas/r8a779f0-spider-ep.dts   |   46 +
>  .../boot/dts/renesas/r8a779f0-spider-rc.dts   |   52 +
>  drivers/dma/dw-edma/dw-edma-core.c            |   28 +-
>  drivers/iommu/ipmmu-vmsa.c                    |    7 +-
>  drivers/net/ntb_netdev.c                      |  341 ++-
>  drivers/ntb/Kconfig                           |   11 +
>  drivers/ntb/Makefile                          |    3 +
>  drivers/ntb/hw/amd/ntb_hw_amd.c               |    6 +-
>  drivers/ntb/hw/epf/ntb_hw_epf.c               |  177 +-
>  drivers/ntb/hw/idt/ntb_hw_idt.c               |    3 +-
>  drivers/ntb/hw/intel/ntb_hw_gen1.c            |    6 +-
>  drivers/ntb/hw/intel/ntb_hw_gen1.h            |    2 +-
>  drivers/ntb/hw/intel/ntb_hw_gen3.c            |    3 +-
>  drivers/ntb/hw/intel/ntb_hw_gen4.c            |    6 +-
>  drivers/ntb/hw/mscc/ntb_hw_switchtec.c        |    6 +-
>  drivers/ntb/msi.c                             |    6 +-
>  drivers/ntb/ntb_edma.c                        |  628 ++++++
>  drivers/ntb/ntb_edma.h                        |  128 ++
>  .../{ntb_transport.c => ntb_transport_core.c} | 1829 ++++++++++++++---
>  drivers/ntb/test/ntb_perf.c                   |    4 +-
>  drivers/ntb/test/ntb_tool.c                   |    6 +-
>  .../pci/controller/dwc/pcie-designware-ep.c   |  287 ++-
>  drivers/pci/controller/dwc/pcie-designware.h  |    7 +
>  drivers/pci/endpoint/functions/pci-epf-vntb.c |  229 ++-
>  drivers/pci/endpoint/pci-epc-core.c           |   44 +
>  include/linux/ntb.h                           |   39 +-
>  include/linux/ntb_transport.h                 |   21 +
>  include/linux/pci-epc.h                       |   11 +
>  29 files changed, 3415 insertions(+), 523 deletions(-)
>  create mode 100644 arch/arm64/boot/dts/renesas/r8a779f0-spider-ep.dts
>  create mode 100644 arch/arm64/boot/dts/renesas/r8a779f0-spider-rc.dts
>  create mode 100644 drivers/ntb/ntb_edma.c
>  create mode 100644 drivers/ntb/ntb_edma.h
>  rename drivers/ntb/{ntb_transport.c => ntb_transport_core.c} (59%)
>
> --
> 2.48.1
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 00/27] NTB transport backed by remote DW eDMA
  2025-12-01 22:02 ` [RFC PATCH v2 00/27] NTB transport backed by remote DW eDMA Frank Li
@ 2025-12-02  6:20   ` Koichiro Den
  2025-12-02 16:07     ` Frank Li
  0 siblings, 1 reply; 97+ messages in thread
From: Koichiro Den @ 2025-12-02  6:20 UTC (permalink / raw)
  To: Frank Li
  Cc: ntb, linux-pci, dmaengine, linux-kernel, mani, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Mon, Dec 01, 2025 at 05:02:57PM -0500, Frank Li wrote:
> On Sun, Nov 30, 2025 at 01:03:38AM +0900, Koichiro Den wrote:
> > Hi,
> >
> > This is RFC v2 of the NTB/PCI series for Renesas R-Car S4. The ultimate
> > goal is unchanged, i.e. to improve performance between RC and EP
> > (with vNTB) over ntb_transport, but the approach has changed drastically.
> > Based on the feedback from Frank Li in the v1 thread, in particular:
> > https://lore.kernel.org/all/aQEsip3TsPn4LJY9@lizhi-Precision-Tower-5810/
> > this RFC v2 instead builds an NTB transport backed by remote eDMA
> > architecture and reshapes the series around it. The RC->EP interruption
> > is now achieved using a dedicated eDMA read channel, so the somewhat
> > "hack"-ish approach in RFC v1 is no longer needed.
> >
> > Compared to RFC v1, this v2 series enables NTB transport backed by
> > remote DW eDMA, so the current ntb_transport handling of Memory Window
> > is no longer needed, and direct DMA transfers between EP and RC are
> > used.
> >
> > I realize this is quite a large series. Sorry for the volume, but for
> > the RFC stage I believe presenting the full picture in a single set
> > helps with reviewing the overall architecture. Once the direction is
> > agreed, I will respin it split by subsystem and topic.
> >
> >
> ...
> >
> > - Before this change:
> >
> >   * ping
> >     64 bytes from 10.0.0.11: icmp_seq=1 ttl=64 time=12.3 ms
> >     64 bytes from 10.0.0.11: icmp_seq=2 ttl=64 time=6.58 ms
> >     64 bytes from 10.0.0.11: icmp_seq=3 ttl=64 time=1.26 ms
> >     64 bytes from 10.0.0.11: icmp_seq=4 ttl=64 time=7.43 ms
> >     64 bytes from 10.0.0.11: icmp_seq=5 ttl=64 time=1.39 ms
> >     64 bytes from 10.0.0.11: icmp_seq=6 ttl=64 time=7.38 ms
> >     64 bytes from 10.0.0.11: icmp_seq=7 ttl=64 time=1.42 ms
> >     64 bytes from 10.0.0.11: icmp_seq=8 ttl=64 time=7.41 ms
> >
> >   * RC->EP (`sudo iperf3 -ub0 -l 65480 -P 2`)
> >     [ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
> >     [  5]   0.00-10.01  sec   344 MBytes   288 Mbits/sec  3.483 ms  51/5555 (0.92%)  receiver
> >     [  6]   0.00-10.01  sec   342 MBytes   287 Mbits/sec  3.814 ms  38/5517 (0.69%)  receiver
> >     [SUM]   0.00-10.01  sec   686 MBytes   575 Mbits/sec  3.648 ms  89/11072 (0.8%)  receiver
> >
> >   * EP->RC (`sudo iperf3 -ub0 -l 65480 -P 2`)
> >     [  5]   0.00-10.03  sec   334 MBytes   279 Mbits/sec  3.164 ms  390/5731 (6.8%)  receiver
> >     [  6]   0.00-10.03  sec   334 MBytes   279 Mbits/sec  2.416 ms  396/5741 (6.9%)  receiver
> >     [SUM]   0.00-10.03  sec   667 MBytes   558 Mbits/sec  2.790 ms  786/11472 (6.9%)  receiver
> >
> >     Note: with `-P 2`, the best total bitrate (receiver side) was achieved.
> >
> > - After this change (use_remote_edma=1) [1]:
> >
> >   * ping
> >     64 bytes from 10.0.0.11: icmp_seq=1 ttl=64 time=1.48 ms
> >     64 bytes from 10.0.0.11: icmp_seq=2 ttl=64 time=1.03 ms
> >     64 bytes from 10.0.0.11: icmp_seq=3 ttl=64 time=0.931 ms
> >     64 bytes from 10.0.0.11: icmp_seq=4 ttl=64 time=0.910 ms
> >     64 bytes from 10.0.0.11: icmp_seq=5 ttl=64 time=1.07 ms
> >     64 bytes from 10.0.0.11: icmp_seq=6 ttl=64 time=0.986 ms
> >     64 bytes from 10.0.0.11: icmp_seq=7 ttl=64 time=0.910 ms
> >     64 bytes from 10.0.0.11: icmp_seq=8 ttl=64 time=0.883 ms
> >
> >   * RC->EP (`sudo iperf3 -ub0 -l 65480 -P 4`)
> >     [  5]   0.00-10.01  sec  3.54 GBytes  3.04 Gbits/sec  0.030 ms  0/58007 (0%)  receiver
> >     [  6]   0.00-10.01  sec  3.71 GBytes  3.19 Gbits/sec  0.453 ms  0/60909 (0%)  receiver
> >     [  9]   0.00-10.01  sec  3.85 GBytes  3.30 Gbits/sec  0.027 ms  0/63072 (0%)  receiver
> >     [ 11]   0.00-10.01  sec  3.26 GBytes  2.80 Gbits/sec  0.070 ms  1/53512 (0.0019%)  receiver
> >     [SUM]   0.00-10.01  sec  14.4 GBytes  12.3 Gbits/sec  0.145 ms  1/235500 (0.00042%)  receiver
> >
> >   * EP->RC (`sudo iperf3 -ub0 -l 65480 -P 4`)
> >     [  5]   0.00-10.03  sec  3.40 GBytes  2.91 Gbits/sec  0.104 ms  15467/71208 (22%)  receiver
> >     [  6]   0.00-10.03  sec  3.08 GBytes  2.64 Gbits/sec  0.176 ms  12097/62609 (19%)  receiver
> >     [  9]   0.00-10.03  sec  3.38 GBytes  2.90 Gbits/sec  0.270 ms  17212/72710 (24%)  receiver
> >     [ 11]   0.00-10.03  sec  2.56 GBytes  2.19 Gbits/sec  0.200 ms  11193/53090 (21%)  receiver
> 
> Almost 10x fast, 2.9G vs 279M? high light this one will bring more peopole
> interesting about this topic.

Thank you for the review!

OK, I'll highlight this in the next iteration.
By the way, my impression is that we can achieve even higher with this remote
eDMA architecture.

> 
> >     [SUM]   0.00-10.03  sec  12.4 GBytes  10.6 Gbits/sec  0.188 ms  55969/259617 (22%)  receiver
> >
> >   [1] configfs settings:
> >       # modprobe pci_epf_vntb dyndbg=+pmf
> >       # cd /sys/kernel/config/pci_ep/
> >       # mkdir functions/pci_epf_vntb/func1
> >       # echo 0x1912 >   functions/pci_epf_vntb/func1/vendorid
> >       # echo 0x0030 >   functions/pci_epf_vntb/func1/deviceid
> >       # echo 32 >       functions/pci_epf_vntb/func1/msi_interrupts
> >       # echo 16 >       functions/pci_epf_vntb/func1/pci_epf_vntb.0/db_count
> >       # echo 128 >      functions/pci_epf_vntb/func1/pci_epf_vntb.0/spad_count
> >       # echo 2 >        functions/pci_epf_vntb/func1/pci_epf_vntb.0/num_mws
> >       # echo 0xe0000 >  functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw1
> >       # echo 0x20000 >  functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw2
> >       # echo 0xe0000 >  functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw2_offset
> 
> look like, you try to create sub-small mw windows.
> 
> Is it more clean ?
> 
> echo 0xe0000 >  functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw1.0
> echo 0x20000 >  functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw1.1
> 
> so wm1.1 natively continue from prevous one.

Thanks for the suggestion.

I was trying to keep the sub-small mw windows referred to in the same way
as normal windows for simplicity and readability, but I agree your proposal
looks intuitive from a User-eXperience point of view.

My only concern is that e.g. {mw1.0, mw1.1, mw2.0} may translate internally
into something like {mw1, mw2, mw3} effectively, and that numbering
mismatch might become confusing when reading or debugging the code.

-Koichiro

> 
> Frank
> 
> >       # echo 0x1912 >   functions/pci_epf_vntb/func1/pci_epf_vntb.0/vntb_vid
> >       # echo 0x0030 >   functions/pci_epf_vntb/func1/pci_epf_vntb.0/vntb_pid
> >       # echo 0x10 >     functions/pci_epf_vntb/func1/pci_epf_vntb.0/vbus_number
> >       # echo 0 >        functions/pci_epf_vntb/func1/pci_epf_vntb.0/ctrl_bar
> >       # echo 4 >        functions/pci_epf_vntb/func1/pci_epf_vntb.0/db_bar
> >       # echo 2 >        functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw1_bar
> >       # echo 2 >        functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw2_bar
> >       # ln -s controllers/e65d0000.pcie-ep functions/pci_epf_vntb/func1/primary/
> >       # echo 1 > controllers/e65d0000.pcie-ep/start
> >
> >
> > Thanks for taking a look.
> >
> >
> > Koichiro Den (27):
> >   PCI: endpoint: pci-epf-vntb: Use array_index_nospec() on mws_size[]
> >     access
> >   PCI: endpoint: pci-epf-vntb: Add mwN_offset configfs attributes
> >   NTB: epf: Handle mwN_offset for inbound MW regions
> >   PCI: endpoint: Add inbound mapping ops to EPC core
> >   PCI: dwc: ep: Implement EPC inbound mapping support
> >   PCI: endpoint: pci-epf-vntb: Use pci_epc_map_inbound() for MW mapping
> >   NTB: Add offset parameter to MW translation APIs
> >   PCI: endpoint: pci-epf-vntb: Propagate MW offset from configfs when
> >     present
> >   NTB: ntb_transport: Support offsetted partial memory windows
> >   NTB: core: Add .get_pci_epc() to ntb_dev_ops
> >   NTB: epf: vntb: Implement .get_pci_epc() callback
> >   damengine: dw-edma: Fix MSI data values for multi-vector IMWr
> >     interrupts
> >   NTB: ntb_transport: Use seq_file for QP stats debugfs
> >   NTB: ntb_transport: Move TX memory window setup into setup_qp_mw()
> >   NTB: ntb_transport: Dynamically determine qp count
> >   NTB: ntb_transport: Introduce get_dma_dev() helper
> >   NTB: epf: Reserve a subset of MSI vectors for non-NTB users
> >   NTB: ntb_transport: Introduce ntb_transport_backend_ops
> >   PCI: dwc: ep: Cache MSI outbound iATU mapping
> >   NTB: ntb_transport: Introduce remote eDMA backed transport mode
> >   NTB: epf: Provide db_vector_count/db_vector_mask callbacks
> >   ntb_netdev: Multi-queue support
> >   NTB: epf: Add per-SoC quirk to cap MRRS for DWC eDMA (128B for R-Car)
> >   iommu: ipmmu-vmsa: Add PCIe ch0 to devices_allowlist
> >   iommu: ipmmu-vmsa: Add support for reserved regions
> >   arm64: dts: renesas: Add Spider RC/EP DTs for NTB with remote DW PCIe
> >     eDMA
> >   NTB: epf: Add an additional memory window (MW2) barno mapping on
> >     Renesas R-Car
> >
> >  arch/arm64/boot/dts/renesas/Makefile          |    2 +
> >  .../boot/dts/renesas/r8a779f0-spider-ep.dts   |   46 +
> >  .../boot/dts/renesas/r8a779f0-spider-rc.dts   |   52 +
> >  drivers/dma/dw-edma/dw-edma-core.c            |   28 +-
> >  drivers/iommu/ipmmu-vmsa.c                    |    7 +-
> >  drivers/net/ntb_netdev.c                      |  341 ++-
> >  drivers/ntb/Kconfig                           |   11 +
> >  drivers/ntb/Makefile                          |    3 +
> >  drivers/ntb/hw/amd/ntb_hw_amd.c               |    6 +-
> >  drivers/ntb/hw/epf/ntb_hw_epf.c               |  177 +-
> >  drivers/ntb/hw/idt/ntb_hw_idt.c               |    3 +-
> >  drivers/ntb/hw/intel/ntb_hw_gen1.c            |    6 +-
> >  drivers/ntb/hw/intel/ntb_hw_gen1.h            |    2 +-
> >  drivers/ntb/hw/intel/ntb_hw_gen3.c            |    3 +-
> >  drivers/ntb/hw/intel/ntb_hw_gen4.c            |    6 +-
> >  drivers/ntb/hw/mscc/ntb_hw_switchtec.c        |    6 +-
> >  drivers/ntb/msi.c                             |    6 +-
> >  drivers/ntb/ntb_edma.c                        |  628 ++++++
> >  drivers/ntb/ntb_edma.h                        |  128 ++
> >  .../{ntb_transport.c => ntb_transport_core.c} | 1829 ++++++++++++++---
> >  drivers/ntb/test/ntb_perf.c                   |    4 +-
> >  drivers/ntb/test/ntb_tool.c                   |    6 +-
> >  .../pci/controller/dwc/pcie-designware-ep.c   |  287 ++-
> >  drivers/pci/controller/dwc/pcie-designware.h  |    7 +
> >  drivers/pci/endpoint/functions/pci-epf-vntb.c |  229 ++-
> >  drivers/pci/endpoint/pci-epc-core.c           |   44 +
> >  include/linux/ntb.h                           |   39 +-
> >  include/linux/ntb_transport.h                 |   21 +
> >  include/linux/pci-epc.h                       |   11 +
> >  29 files changed, 3415 insertions(+), 523 deletions(-)
> >  create mode 100644 arch/arm64/boot/dts/renesas/r8a779f0-spider-ep.dts
> >  create mode 100644 arch/arm64/boot/dts/renesas/r8a779f0-spider-rc.dts
> >  create mode 100644 drivers/ntb/ntb_edma.c
> >  create mode 100644 drivers/ntb/ntb_edma.h
> >  rename drivers/ntb/{ntb_transport.c => ntb_transport_core.c} (59%)
> >
> > --
> > 2.48.1
> >

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 02/27] PCI: endpoint: pci-epf-vntb: Add mwN_offset configfs attributes
  2025-12-01 19:11   ` Frank Li
@ 2025-12-02  6:23     ` Koichiro Den
  0 siblings, 0 replies; 97+ messages in thread
From: Koichiro Den @ 2025-12-02  6:23 UTC (permalink / raw)
  To: Frank Li
  Cc: ntb, linux-pci, dmaengine, linux-kernel, mani, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Mon, Dec 01, 2025 at 02:11:07PM -0500, Frank Li wrote:
> On Sun, Nov 30, 2025 at 01:03:40AM +0900, Koichiro Den wrote:
> > Introduce new mwN_offset configfs attributes to specify memory window
> > offsets. This enables mapping multiple windows into a single BAR at
> 
> This need update documents. But I am not clear how multiple windows map
> into a single BAR, could you please give me an example, how to config
> more than 2 memory space to one bar by echo to sys file commands.

For some reason, I mistakenly dropped the doc update that I had added in RFC v1:
https://lore.kernel.org/all/20251023071916.901355-26-den@valinux.co.jp/
I'll restore it in the next iteration.

Thanks for the review.
-Koichiro

> 
> Frank
> 
> > arbitrary offsets, improving layout flexibility.
> >
> > Signed-off-by: Koichiro Den <den@valinux.co.jp>
> > ---
> >  drivers/pci/endpoint/functions/pci-epf-vntb.c | 133 ++++++++++++++++--
> >  1 file changed, 120 insertions(+), 13 deletions(-)
> >
> > diff --git a/drivers/pci/endpoint/functions/pci-epf-vntb.c b/drivers/pci/endpoint/functions/pci-epf-vntb.c
> > index 6c4c78915970..1ff414703566 100644
> > --- a/drivers/pci/endpoint/functions/pci-epf-vntb.c
> > +++ b/drivers/pci/endpoint/functions/pci-epf-vntb.c
> > @@ -39,6 +39,7 @@
> >  #include <linux/atomic.h>
> >  #include <linux/delay.h>
> >  #include <linux/io.h>
> > +#include <linux/log2.h>
> >  #include <linux/module.h>
> >  #include <linux/slab.h>
> >
> > @@ -111,7 +112,8 @@ struct epf_ntb_ctrl {
> >  	u64 addr;
> >  	u64 size;
> >  	u32 num_mws;
> > -	u32 reserved;
> > +	u32 mw_offset[MAX_MW];
> > +	u32 mw_size[MAX_MW];
> >  	u32 spad_offset;
> >  	u32 spad_count;
> >  	u32 db_entry_size;
> > @@ -128,6 +130,7 @@ struct epf_ntb {
> >  	u32 db_count;
> >  	u32 spad_count;
> >  	u64 mws_size[MAX_MW];
> > +	u64 mws_offset[MAX_MW];
> >  	atomic64_t db;
> >  	u32 vbus_number;
> >  	u16 vntb_pid;
> > @@ -458,6 +461,8 @@ static int epf_ntb_config_spad_bar_alloc(struct epf_ntb *ntb)
> >
> >  	ctrl->spad_count = spad_count;
> >  	ctrl->num_mws = ntb->num_mws;
> > +	memset(ctrl->mw_offset, 0, sizeof(ctrl->mw_offset));
> > +	memset(ctrl->mw_size, 0, sizeof(ctrl->mw_size));
> >  	ntb->spad_size = spad_size;
> >
> >  	ctrl->db_entry_size = sizeof(u32);
> > @@ -689,15 +694,31 @@ static void epf_ntb_db_bar_clear(struct epf_ntb *ntb)
> >   */
> >  static int epf_ntb_mw_bar_init(struct epf_ntb *ntb)
> >  {
> > +	struct device *dev = &ntb->epf->dev;
> > +	u64 bar_ends[BAR_5 + 1] = { 0 };
> > +	unsigned long bars_used = 0;
> > +	enum pci_barno barno;
> > +	u64 off, size, end;
> >  	int ret = 0;
> >  	int i;
> > -	u64 size;
> > -	enum pci_barno barno;
> > -	struct device *dev = &ntb->epf->dev;
> >
> >  	for (i = 0; i < ntb->num_mws; i++) {
> > -		size = ntb->mws_size[i];
> >  		barno = ntb->epf_ntb_bar[BAR_MW1 + i];
> > +		off = ntb->mws_offset[i];
> > +		size = ntb->mws_size[i];
> > +		end = off + size;
> > +		if (end > bar_ends[barno])
> > +			bar_ends[barno] = end;
> > +		bars_used |= BIT(barno);
> > +	}
> > +
> > +	for (barno = BAR_0; barno <= BAR_5; barno++) {
> > +		if (!(bars_used & BIT(barno)))
> > +			continue;
> > +		if (bar_ends[barno] < SZ_4K)
> > +			size = SZ_4K;
> > +		else
> > +			size = roundup_pow_of_two(bar_ends[barno]);
> >
> >  		ntb->epf->bar[barno].barno = barno;
> >  		ntb->epf->bar[barno].size = size;
> > @@ -713,8 +734,12 @@ static int epf_ntb_mw_bar_init(struct epf_ntb *ntb)
> >  				      &ntb->epf->bar[barno]);
> >  		if (ret) {
> >  			dev_err(dev, "MW set failed\n");
> > -			goto err_alloc_mem;
> > +			goto err_set_bar;
> >  		}
> > +	}
> > +
> > +	for (i = 0; i < ntb->num_mws; i++) {
> > +		size = ntb->mws_size[i];
> >
> >  		/* Allocate EPC outbound memory windows to vpci vntb device */
> >  		ntb->vpci_mw_addr[i] = pci_epc_mem_alloc_addr(ntb->epf->epc,
> > @@ -723,19 +748,31 @@ static int epf_ntb_mw_bar_init(struct epf_ntb *ntb)
> >  		if (!ntb->vpci_mw_addr[i]) {
> >  			ret = -ENOMEM;
> >  			dev_err(dev, "Failed to allocate source address\n");
> > -			goto err_set_bar;
> > +			goto err_alloc_mem;
> >  		}
> >  	}
> >
> > +	for (i = 0; i < ntb->num_mws; i++) {
> > +		ntb->reg->mw_offset[i] = (u32)ntb->mws_offset[i];
> > +		ntb->reg->mw_size[i] = (u32)ntb->mws_size[i];
> > +	}
> > +
> >  	return ret;
> >
> > -err_set_bar:
> > -	pci_epc_clear_bar(ntb->epf->epc,
> > -			  ntb->epf->func_no,
> > -			  ntb->epf->vfunc_no,
> > -			  &ntb->epf->bar[barno]);
> >  err_alloc_mem:
> > -	epf_ntb_mw_bar_clear(ntb, i);
> > +	while (--i >= 0)
> > +		pci_epc_mem_free_addr(ntb->epf->epc,
> > +				      ntb->vpci_mw_phy[i],
> > +				      ntb->vpci_mw_addr[i],
> > +				      ntb->mws_size[i]);
> > +err_set_bar:
> > +	while (--barno >= BAR_0)
> > +		if (bars_used & BIT(barno))
> > +			pci_epc_clear_bar(ntb->epf->epc,
> > +					  ntb->epf->func_no,
> > +					  ntb->epf->vfunc_no,
> > +					  &ntb->epf->bar[barno]);
> > +
> >  	return ret;
> >  }
> >
> > @@ -1039,6 +1076,60 @@ static ssize_t epf_ntb_##_name##_store(struct config_item *item,	\
> >  	return len;							\
> >  }
> >
> > +#define EPF_NTB_MW_OFF_R(_name)						\
> > +static ssize_t epf_ntb_##_name##_show(struct config_item *item,		\
> > +				      char *page)			\
> > +{									\
> > +	struct config_group *group = to_config_group(item);		\
> > +	struct epf_ntb *ntb = to_epf_ntb(group);			\
> > +	struct device *dev = &ntb->epf->dev;				\
> > +	int win_no, idx;						\
> > +									\
> > +	if (sscanf(#_name, "mw%d_offset", &win_no) != 1)		\
> > +		return -EINVAL;						\
> > +									\
> > +	idx = win_no - 1;						\
> > +	if (idx < 0 || idx >= ntb->num_mws) {				\
> > +		dev_err(dev, "MW%d out of range (num_mws=%d)\n",	\
> > +			win_no, ntb->num_mws);				\
> > +		return -EINVAL;						\
> > +	}								\
> > +									\
> > +	idx = array_index_nospec(idx, ntb->num_mws);			\
> > +	return sprintf(page, "%lld\n", ntb->mws_offset[idx]);		\
> > +}
> > +
> > +#define EPF_NTB_MW_OFF_W(_name)						\
> > +static ssize_t epf_ntb_##_name##_store(struct config_item *item,	\
> > +				       const char *page, size_t len)	\
> > +{									\
> > +	struct config_group *group = to_config_group(item);		\
> > +	struct epf_ntb *ntb = to_epf_ntb(group);			\
> > +	struct device *dev = &ntb->epf->dev;				\
> > +	int win_no, idx;						\
> > +	u64 val;							\
> > +	int ret;							\
> > +									\
> > +	ret = kstrtou64(page, 0, &val);					\
> > +	if (ret)							\
> > +		return ret;						\
> > +									\
> > +	if (sscanf(#_name, "mw%d_offset", &win_no) != 1)		\
> > +		return -EINVAL;						\
> > +									\
> > +	idx = win_no - 1;						\
> > +	if (idx < 0 || idx >= ntb->num_mws) {				\
> > +		dev_err(dev, "MW%d out of range (num_mws=%d)\n",	\
> > +			win_no, ntb->num_mws);				\
> > +		return -EINVAL;						\
> > +	}								\
> > +									\
> > +	idx = array_index_nospec(idx, ntb->num_mws);			\
> > +	ntb->mws_offset[idx] = val;					\
> > +									\
> > +	return len;							\
> > +}
> > +
> >  #define EPF_NTB_BAR_R(_name, _id)					\
> >  	static ssize_t epf_ntb_##_name##_show(struct config_item *item,	\
> >  					      char *page)		\
> > @@ -1109,6 +1200,14 @@ EPF_NTB_MW_R(mw3)
> >  EPF_NTB_MW_W(mw3)
> >  EPF_NTB_MW_R(mw4)
> >  EPF_NTB_MW_W(mw4)
> > +EPF_NTB_MW_OFF_R(mw1_offset)
> > +EPF_NTB_MW_OFF_W(mw1_offset)
> > +EPF_NTB_MW_OFF_R(mw2_offset)
> > +EPF_NTB_MW_OFF_W(mw2_offset)
> > +EPF_NTB_MW_OFF_R(mw3_offset)
> > +EPF_NTB_MW_OFF_W(mw3_offset)
> > +EPF_NTB_MW_OFF_R(mw4_offset)
> > +EPF_NTB_MW_OFF_W(mw4_offset)
> >  EPF_NTB_BAR_R(ctrl_bar, BAR_CONFIG)
> >  EPF_NTB_BAR_W(ctrl_bar, BAR_CONFIG)
> >  EPF_NTB_BAR_R(db_bar, BAR_DB)
> > @@ -1129,6 +1228,10 @@ CONFIGFS_ATTR(epf_ntb_, mw1);
> >  CONFIGFS_ATTR(epf_ntb_, mw2);
> >  CONFIGFS_ATTR(epf_ntb_, mw3);
> >  CONFIGFS_ATTR(epf_ntb_, mw4);
> > +CONFIGFS_ATTR(epf_ntb_, mw1_offset);
> > +CONFIGFS_ATTR(epf_ntb_, mw2_offset);
> > +CONFIGFS_ATTR(epf_ntb_, mw3_offset);
> > +CONFIGFS_ATTR(epf_ntb_, mw4_offset);
> >  CONFIGFS_ATTR(epf_ntb_, vbus_number);
> >  CONFIGFS_ATTR(epf_ntb_, vntb_pid);
> >  CONFIGFS_ATTR(epf_ntb_, vntb_vid);
> > @@ -1147,6 +1250,10 @@ static struct configfs_attribute *epf_ntb_attrs[] = {
> >  	&epf_ntb_attr_mw2,
> >  	&epf_ntb_attr_mw3,
> >  	&epf_ntb_attr_mw4,
> > +	&epf_ntb_attr_mw1_offset,
> > +	&epf_ntb_attr_mw2_offset,
> > +	&epf_ntb_attr_mw3_offset,
> > +	&epf_ntb_attr_mw4_offset,
> >  	&epf_ntb_attr_vbus_number,
> >  	&epf_ntb_attr_vntb_pid,
> >  	&epf_ntb_attr_vntb_vid,
> > --
> > 2.48.1
> >

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 03/27] NTB: epf: Handle mwN_offset for inbound MW regions
  2025-12-01 19:14   ` Frank Li
@ 2025-12-02  6:23     ` Koichiro Den
  0 siblings, 0 replies; 97+ messages in thread
From: Koichiro Den @ 2025-12-02  6:23 UTC (permalink / raw)
  To: Frank Li
  Cc: ntb, linux-pci, dmaengine, linux-kernel, mani, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Mon, Dec 01, 2025 at 02:14:03PM -0500, Frank Li wrote:
> On Sun, Nov 30, 2025 at 01:03:41AM +0900, Koichiro Den wrote:
> > Add and use new fields in the common control register to convey both
> > offset and size for each memory window (MW), so that it can correctly
> > handle flexible MW layouts and support partial BAR mappings.
> >
> > Signed-off-by: Koichiro Den <den@valinux.co.jp>
> > ---
> >  drivers/ntb/hw/epf/ntb_hw_epf.c | 27 ++++++++++++++++-----------
> >  1 file changed, 16 insertions(+), 11 deletions(-)
> >
> > diff --git a/drivers/ntb/hw/epf/ntb_hw_epf.c b/drivers/ntb/hw/epf/ntb_hw_epf.c
> > index d3ecf25a5162..91d3f8e05807 100644
> > --- a/drivers/ntb/hw/epf/ntb_hw_epf.c
> > +++ b/drivers/ntb/hw/epf/ntb_hw_epf.c
> > @@ -36,12 +36,13 @@
> >  #define NTB_EPF_LOWER_SIZE	0x18
> >  #define NTB_EPF_UPPER_SIZE	0x1C
> >  #define NTB_EPF_MW_COUNT	0x20
> > -#define NTB_EPF_MW1_OFFSET	0x24
> > -#define NTB_EPF_SPAD_OFFSET	0x28
> > -#define NTB_EPF_SPAD_COUNT	0x2C
> > -#define NTB_EPF_DB_ENTRY_SIZE	0x30
> > -#define NTB_EPF_DB_DATA(n)	(0x34 + (n) * 4)
> > -#define NTB_EPF_DB_OFFSET(n)	(0xB4 + (n) * 4)
> > +#define NTB_EPF_MW_OFFSET(n)	(0x24 + (n) * 4)
> > +#define NTB_EPF_MW_SIZE(n)	(0x34 + (n) * 4)
> > +#define NTB_EPF_SPAD_OFFSET	0x44
> > +#define NTB_EPF_SPAD_COUNT	0x48
> > +#define NTB_EPF_DB_ENTRY_SIZE	0x4C
> > +#define NTB_EPF_DB_DATA(n)	(0x50 + (n) * 4)
> > +#define NTB_EPF_DB_OFFSET(n)	(0xD0 + (n) * 4)
> 
> You need check difference version for register layout change. EP and RC
> side's software are not necessary to run the same kernel. Maybe EP side
> running at old version, RC side run at new version.

That totally makes sense, I'll do so. Thank you.

-Koichiro

> 
> Frank
> 
> >
> >  #define NTB_EPF_MIN_DB_COUNT	3
> >  #define NTB_EPF_MAX_DB_COUNT	31
> > @@ -451,11 +452,12 @@ static int ntb_epf_peer_mw_get_addr(struct ntb_dev *ntb, int idx,
> >  				    phys_addr_t *base, resource_size_t *size)
> >  {
> >  	struct ntb_epf_dev *ndev = ntb_ndev(ntb);
> > -	u32 offset = 0;
> > +	resource_size_t bar_sz;
> > +	u32 offset, sz;
> >  	int bar;
> >
> > -	if (idx == 0)
> > -		offset = readl(ndev->ctrl_reg + NTB_EPF_MW1_OFFSET);
> > +	offset = readl(ndev->ctrl_reg + NTB_EPF_MW_OFFSET(idx));
> > +	sz = readl(ndev->ctrl_reg + NTB_EPF_MW_SIZE(idx));
> >
> >  	bar = ntb_epf_mw_to_bar(ndev, idx);
> >  	if (bar < 0)
> > @@ -464,8 +466,11 @@ static int ntb_epf_peer_mw_get_addr(struct ntb_dev *ntb, int idx,
> >  	if (base)
> >  		*base = pci_resource_start(ndev->ntb.pdev, bar) + offset;
> >
> > -	if (size)
> > -		*size = pci_resource_len(ndev->ntb.pdev, bar) - offset;
> > +	if (size) {
> > +		bar_sz = pci_resource_len(ndev->ntb.pdev, bar);
> > +		*size = sz ? min_t(resource_size_t, sz, bar_sz - offset)
> > +			   : (bar_sz > offset ? bar_sz - offset : 0);
> > +	}
> >
> >  	return 0;
> >  }
> > --
> > 2.48.1
> >

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 04/27] PCI: endpoint: Add inbound mapping ops to EPC core
  2025-12-01 19:19   ` Frank Li
@ 2025-12-02  6:25     ` Koichiro Den
  2025-12-02 15:58       ` Frank Li
  0 siblings, 1 reply; 97+ messages in thread
From: Koichiro Den @ 2025-12-02  6:25 UTC (permalink / raw)
  To: Frank Li
  Cc: ntb, linux-pci, dmaengine, linux-kernel, mani, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Mon, Dec 01, 2025 at 02:19:55PM -0500, Frank Li wrote:
> On Sun, Nov 30, 2025 at 01:03:42AM +0900, Koichiro Den wrote:
> > Add new EPC ops map_inbound() and unmap_inbound() for mapping a subrange
> > of a BAR into CPU space. These will be implemented by controller drivers
> > such as DesignWare.
> >
> > Signed-off-by: Koichiro Den <den@valinux.co.jp>
> > ---
> >  drivers/pci/endpoint/pci-epc-core.c | 44 +++++++++++++++++++++++++++++
> >  include/linux/pci-epc.h             | 11 ++++++++
> >  2 files changed, 55 insertions(+)
> >
> > diff --git a/drivers/pci/endpoint/pci-epc-core.c b/drivers/pci/endpoint/pci-epc-core.c
> > index ca7f19cc973a..825109e54ba9 100644
> > --- a/drivers/pci/endpoint/pci-epc-core.c
> > +++ b/drivers/pci/endpoint/pci-epc-core.c
> > @@ -444,6 +444,50 @@ int pci_epc_map_addr(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> >  }
> >  EXPORT_SYMBOL_GPL(pci_epc_map_addr);
> >
> > +/**
> > + * pci_epc_map_inbound() - map a BAR subrange to the local CPU address
> > + * @epc: the EPC device on which BAR has to be configured
> > + * @func_no: the physical endpoint function number in the EPC device
> > + * @vfunc_no: the virtual endpoint function number in the physical function
> > + * @epf_bar: the struct epf_bar that contains the BAR information
> > + * @offset: byte offset from the BAR base selected by the host
> > + *
> > + * Invoke to configure the BAR of the endpoint device and map a subrange
> > + * selected by @offset to a CPU address.
> > + *
> > + * Returns 0 on success, -EOPNOTSUPP if unsupported, or a negative errno.
> > + */
> > +int pci_epc_map_inbound(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> > +			struct pci_epf_bar *epf_bar, u64 offset)
> 
> Supposed need size information?  if BAR's size is 4G,
> 
> you may just want to map from 0x4000 to 0x5000, not whole offset to end's
> space.

That sounds reasonable, the interface should accept a size parameter so that it
is flexible enough to configure arbitrary sub-ranges inside a BAR, instead of
implicitly using "offset to end of BAR".

For the ntb_transport use_remote_edma=1 testing on R‑Car S4 I only needed at
most two sub-ranges inside one BAR, so a size parameter was not strictly
necessary in that setup, but I agree that the current interface looks
half-baked and not very generic. I'll extend it to take size as well.

> 
> commit message said map into CPU space, where CPU address?

The interface currently requires a pointer to a struct pci_epf_bar instance and
uses its phys_addr field as the CPU physical base address.
I'm not fully convinced that using struct pci_epf_bar this way is the cleanest
approach, so I'm open to better suggestions if you have any.

Koichiro

> 
> Frank
> > +{
> > +	if (!epc || !epc->ops || !epc->ops->map_inbound)
> > +		return -EOPNOTSUPP;
> > +
> > +	return epc->ops->map_inbound(epc, func_no, vfunc_no, epf_bar, offset);
> > +}
> > +EXPORT_SYMBOL_GPL(pci_epc_map_inbound);
> > +
> > +/**
> > + * pci_epc_unmap_inbound() - unmap a previously mapped BAR subrange
> > + * @epc: the EPC device on which the inbound mapping was programmed
> > + * @func_no: the physical endpoint function number in the EPC device
> > + * @vfunc_no: the virtual endpoint function number in the physical function
> > + * @epf_bar: the struct epf_bar used when the mapping was created
> > + * @offset: byte offset from the BAR base that was mapped
> > + *
> > + * Invoke to remove a BAR subrange mapping created by pci_epc_map_inbound().
> > + * If the controller has no support, this call is a no-op.
> > + */
> > +void pci_epc_unmap_inbound(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> > +			   struct pci_epf_bar *epf_bar, u64 offset)
> > +{
> > +	if (!epc || !epc->ops || !epc->ops->unmap_inbound)
> > +		return;
> > +
> > +	epc->ops->unmap_inbound(epc, func_no, vfunc_no, epf_bar, offset);
> > +}
> > +EXPORT_SYMBOL_GPL(pci_epc_unmap_inbound);
> > +
> >  /**
> >   * pci_epc_mem_map() - allocate and map a PCI address to a CPU address
> >   * @epc: the EPC device on which the CPU address is to be allocated and mapped
> > diff --git a/include/linux/pci-epc.h b/include/linux/pci-epc.h
> > index 4286bfdbfdfa..a5fb91cc2982 100644
> > --- a/include/linux/pci-epc.h
> > +++ b/include/linux/pci-epc.h
> > @@ -71,6 +71,8 @@ struct pci_epc_map {
> >   *		region
> >   * @map_addr: ops to map CPU address to PCI address
> >   * @unmap_addr: ops to unmap CPU address and PCI address
> > + * @map_inbound: ops to map a subrange inside a BAR to CPU address.
> > + * @unmap_inbound: ops to unmap a subrange inside a BAR and CPU address.
> >   * @set_msi: ops to set the requested number of MSI interrupts in the MSI
> >   *	     capability register
> >   * @get_msi: ops to get the number of MSI interrupts allocated by the RC from
> > @@ -99,6 +101,10 @@ struct pci_epc_ops {
> >  			    phys_addr_t addr, u64 pci_addr, size_t size);
> >  	void	(*unmap_addr)(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> >  			      phys_addr_t addr);
> > +	int	(*map_inbound)(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> > +			       struct pci_epf_bar *epf_bar, u64 offset);
> > +	void	(*unmap_inbound)(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> > +				 struct pci_epf_bar *epf_bar, u64 offset);
> >  	int	(*set_msi)(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> >  			   u8 nr_irqs);
> >  	int	(*get_msi)(struct pci_epc *epc, u8 func_no, u8 vfunc_no);
> > @@ -286,6 +292,11 @@ int pci_epc_map_addr(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> >  		     u64 pci_addr, size_t size);
> >  void pci_epc_unmap_addr(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> >  			phys_addr_t phys_addr);
> > +
> > +int pci_epc_map_inbound(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> > +			struct pci_epf_bar *epf_bar, u64 offset);
> > +void pci_epc_unmap_inbound(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> > +			   struct pci_epf_bar *epf_bar, u64 offset);
> >  int pci_epc_set_msi(struct pci_epc *epc, u8 func_no, u8 vfunc_no, u8 nr_irqs);
> >  int pci_epc_get_msi(struct pci_epc *epc, u8 func_no, u8 vfunc_no);
> >  int pci_epc_set_msix(struct pci_epc *epc, u8 func_no, u8 vfunc_no, u16 nr_irqs,
> > --
> > 2.48.1
> >

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 05/27] PCI: dwc: ep: Implement EPC inbound mapping support
  2025-12-01 19:32   ` Frank Li
@ 2025-12-02  6:26     ` Koichiro Den
  0 siblings, 0 replies; 97+ messages in thread
From: Koichiro Den @ 2025-12-02  6:26 UTC (permalink / raw)
  To: Frank Li
  Cc: ntb, linux-pci, dmaengine, linux-kernel, mani, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Mon, Dec 01, 2025 at 02:32:10PM -0500, Frank Li wrote:
> On Sun, Nov 30, 2025 at 01:03:43AM +0900, Koichiro Den wrote:
> > Implement map_inbound() and unmap_inbound() for DesignWare endpoint
> > controllers (Address Match mode). Allows subrange mappings within a BAR,
> > enabling advanced endpoint functions such as NTB with offset-based
> > windows.
> >
> > Signed-off-by: Koichiro Den <den@valinux.co.jp>
> > ---
> >  .../pci/controller/dwc/pcie-designware-ep.c   | 239 +++++++++++++++---
> >  drivers/pci/controller/dwc/pcie-designware.h  |   2 +
> >  2 files changed, 212 insertions(+), 29 deletions(-)
> >
> > diff --git a/drivers/pci/controller/dwc/pcie-designware-ep.c b/drivers/pci/controller/dwc/pcie-designware-ep.c
> > index 19571ac2b961..3780a9bd6f79 100644
> > --- a/drivers/pci/controller/dwc/pcie-designware-ep.c
> > +++ b/drivers/pci/controller/dwc/pcie-designware-ep.c
> ...
> >
> > +static int dw_pcie_ep_set_bar_init(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> > +				   struct pci_epf_bar *epf_bar)
> > +{
> > +	struct dw_pcie_ep *ep = epc_get_drvdata(epc);
> > +	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> > +	enum pci_barno bar = epf_bar->barno;
> > +	enum pci_epc_bar_type bar_type;
> > +	int ret;
> > +
> > +	bar_type = dw_pcie_ep_get_bar_type(ep, bar);
> > +	switch (bar_type) {
> > +	case BAR_FIXED:
> > +		/*
> > +		 * There is no need to write a BAR mask for a fixed BAR (except
> > +		 * to write 1 to the LSB of the BAR mask register, to enable the
> > +		 * BAR). Write the BAR mask regardless. (The fixed bits in the
> > +		 * BAR mask register will be read-only anyway.)
> > +		 */
> > +		fallthrough;
> > +	case BAR_PROGRAMMABLE:
> > +		ret = dw_pcie_ep_set_bar_programmable(ep, func_no, epf_bar);
> > +		break;
> > +	case BAR_RESIZABLE:
> > +		ret = dw_pcie_ep_set_bar_resizable(ep, func_no, epf_bar);
> > +		break;
> > +	default:
> > +		ret = -EINVAL;
> > +		dev_err(pci->dev, "Invalid BAR type\n");
> > +		break;
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> 
> Create new patch for above code movement.

Ok I'll do so, thanks.

Koichiro

> 
> >  static int dw_pcie_ep_set_bar(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> >  			      struct pci_epf_bar *epf_bar)
> >  {
> >  	struct dw_pcie_ep *ep = epc_get_drvdata(epc);
> > -	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> >  	enum pci_barno bar = epf_bar->barno;
> >  	size_t size = epf_bar->size;
> > -	enum pci_epc_bar_type bar_type;
> >  	int flags = epf_bar->flags;
> >  	int ret, type;
> >
> > @@ -374,35 +429,12 @@ static int dw_pcie_ep_set_bar(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> >  		 * When dynamically changing a BAR, skip writing the BAR reg, as
> >  		 * that would clear the BAR's PCI address assigned by the host.
> >  		 */
> > -		goto config_atu;
> > -	}
> > -
> > -	bar_type = dw_pcie_ep_get_bar_type(ep, bar);
> > -	switch (bar_type) {
> > -	case BAR_FIXED:
> > -		/*
> > -		 * There is no need to write a BAR mask for a fixed BAR (except
> > -		 * to write 1 to the LSB of the BAR mask register, to enable the
> > -		 * BAR). Write the BAR mask regardless. (The fixed bits in the
> > -		 * BAR mask register will be read-only anyway.)
> > -		 */
> > -		fallthrough;
> > -	case BAR_PROGRAMMABLE:
> > -		ret = dw_pcie_ep_set_bar_programmable(ep, func_no, epf_bar);
> > -		break;
> > -	case BAR_RESIZABLE:
> > -		ret = dw_pcie_ep_set_bar_resizable(ep, func_no, epf_bar);
> > -		break;
> > -	default:
> > -		ret = -EINVAL;
> > -		dev_err(pci->dev, "Invalid BAR type\n");
> > -		break;
> > +	} else {
> > +		ret = dw_pcie_ep_set_bar_init(epc, func_no, vfunc_no, epf_bar);
> > +		if (ret)
> > +			return ret;
> >  	}
> >
> > -	if (ret)
> > -		return ret;
> > -
> > -config_atu:
> >  	if (!(flags & PCI_BASE_ADDRESS_SPACE))
> >  		type = PCIE_ATU_TYPE_MEM;
> >  	else
> > @@ -488,6 +520,151 @@ static int dw_pcie_ep_map_addr(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> >  	return 0;
> >  }
> >
> ...
> > +
> > +static int dw_pcie_ep_map_inbound(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> > +				  struct pci_epf_bar *epf_bar, u64 offset)
> > +{
> > +	struct dw_pcie_ep *ep = epc_get_drvdata(epc);
> > +	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> > +	enum pci_barno bar = epf_bar->barno;
> > +	size_t size = epf_bar->size;
> > +	int flags = epf_bar->flags;
> > +	struct dw_pcie_ib_map *m;
> > +	u64 base, pci_addr;
> > +	int ret, type, win;
> > +
> > +	/*
> > +	 * DWC does not allow BAR pairs to overlap, e.g. you cannot combine BARs
> > +	 * 1 and 2 to form a 64-bit BAR.
> > +	 */
> > +	if ((flags & PCI_BASE_ADDRESS_MEM_TYPE_64) && (bar & 1))
> > +		return -EINVAL;
> > +
> > +	/*
> > +	 * Certain EPF drivers dynamically change the physical address of a BAR
> > +	 * (i.e. they call set_bar() twice, without ever calling clear_bar(), as
> > +	 * calling clear_bar() would clear the BAR's PCI address assigned by the
> > +	 * host).
> > +	 */
> > +	if (epf_bar->phys_addr && ep->epf_bar[bar]) {
> > +		/*
> > +		 * We can only dynamically add a whole or partial mapping if the
> > +		 * BAR flags do not differ from the existing configuration.
> > +		 */
> > +		if (ep->epf_bar[bar]->barno != bar ||
> > +		    ep->epf_bar[bar]->flags != flags)
> > +			return -EINVAL;
> > +
> > +		/*
> > +		 * When dynamically changing a BAR, skip writing the BAR reg, as
> > +		 * that would clear the BAR's PCI address assigned by the host.
> > +		 */
> > +	}
> > +
> > +	/*
> > +	 * Skip programming the inbound translation if phys_addr is 0.
> > +	 * In this case, the caller only intends to initialize the BAR.
> > +	 */
> > +	if (!epf_bar->phys_addr) {
> > +		ret = dw_pcie_ep_set_bar_init(epc, func_no, vfunc_no, epf_bar);
> > +		ep->epf_bar[bar] = epf_bar;
> > +		return ret;
> > +	}
> > +
> > +	base = dw_pcie_ep_read_bar_assigned(ep, func_no, bar,
> > +					    flags & PCI_BASE_ADDRESS_SPACE,
> > +					    flags & PCI_BASE_ADDRESS_MEM_TYPE_64);
> > +	if (!(flags & PCI_BASE_ADDRESS_SPACE))
> > +		type = PCIE_ATU_TYPE_MEM;
> > +	else
> > +		type = PCIE_ATU_TYPE_IO;
> > +	pci_addr = base + offset;
> > +
> > +	/* Allocate an inbound iATU window */
> > +	win = find_first_zero_bit(ep->ib_window_map, pci->num_ib_windows);
> > +	if (win >= pci->num_ib_windows)
> > +		return -ENOSPC;
> > +
> > +	/* Program address-match inbound iATU */
> > +	ret = dw_pcie_prog_inbound_atu(pci, win, type,
> > +				       epf_bar->phys_addr - pci->parent_bus_offset,
> > +				       pci_addr, size);
> > +	if (ret)
> > +		return ret;
> > +
> > +	m = kzalloc(sizeof(*m), GFP_KERNEL);
> > +	if (!m) {
> > +		dw_pcie_disable_atu(pci, PCIE_ATU_REGION_DIR_IB, win);
> > +		return -ENOMEM;
> > +	}
> 
> at least this should be above dw_pcie_prog_inbound_atu(). if return error
> here, atu already prog.
> 
> 
> > +	m->bar = bar;
> > +	m->pci_addr = pci_addr;
> > +	m->cpu_addr = epf_bar->phys_addr;
> > +	m->size = size;
> > +	m->index = win;
> > +
> > +	guard(spinlock_irqsave)(&ep->ib_map_lock);
> > +	set_bit(win, ep->ib_window_map);
> > +	list_add(&m->node, &ep->ib_map_list);
> > +
> > +	return 0;
> > +}
> > +
> ...
> >
> >  	INIT_LIST_HEAD(&ep->func_list);
> > +	INIT_LIST_HEAD(&ep->ib_map_list);
> > +	spin_lock_init(&ep->ib_map_lock);
> >
> >  	epc = devm_pci_epc_create(dev, &epc_ops);
> >  	if (IS_ERR(epc)) {
> > diff --git a/drivers/pci/controller/dwc/pcie-designware.h b/drivers/pci/controller/dwc/pcie-designware.h
> > index 31685951a080..269a9fe0501f 100644
> > --- a/drivers/pci/controller/dwc/pcie-designware.h
> > +++ b/drivers/pci/controller/dwc/pcie-designware.h
> > @@ -476,6 +476,8 @@ struct dw_pcie_ep {
> >  	phys_addr_t		*outbound_addr;
> >  	unsigned long		*ib_window_map;
> >  	unsigned long		*ob_window_map;
> > +	struct list_head	ib_map_list;
> > +	spinlock_t		ib_map_lock;
> 
> need comments for lock
> 
> Frank
> >  	void __iomem		*msi_mem;
> >  	phys_addr_t		msi_mem_phys;
> >  	struct pci_epf_bar	*epf_bar[PCI_STD_NUM_BARS];
> > --
> > 2.48.1
> >

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 06/27] PCI: endpoint: pci-epf-vntb: Use pci_epc_map_inbound() for MW mapping
  2025-12-01 19:34   ` Frank Li
@ 2025-12-02  6:26     ` Koichiro Den
  0 siblings, 0 replies; 97+ messages in thread
From: Koichiro Den @ 2025-12-02  6:26 UTC (permalink / raw)
  To: Frank Li
  Cc: ntb, linux-pci, dmaengine, linux-kernel, mani, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Mon, Dec 01, 2025 at 02:34:07PM -0500, Frank Li wrote:
> On Sun, Nov 30, 2025 at 01:03:44AM +0900, Koichiro Den wrote:
> > Switch MW setup to use pci_epc_map_inbound() when supported. This allows
> > mapping portions of a BAR rather than the entire region, supporting
> > partial BAR usage on capable controllers.
> >
> > Signed-off-by: Koichiro Den <den@valinux.co.jp>
> > ---
> >  drivers/pci/endpoint/functions/pci-epf-vntb.c | 21 ++++++++++++++-----
> >  1 file changed, 16 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/pci/endpoint/functions/pci-epf-vntb.c b/drivers/pci/endpoint/functions/pci-epf-vntb.c
> > index 1ff414703566..42e57721dcb4 100644
> > --- a/drivers/pci/endpoint/functions/pci-epf-vntb.c
> > +++ b/drivers/pci/endpoint/functions/pci-epf-vntb.c
> > @@ -728,10 +728,15 @@ static int epf_ntb_mw_bar_init(struct epf_ntb *ntb)
> >  				PCI_BASE_ADDRESS_MEM_TYPE_64 :
> >  				PCI_BASE_ADDRESS_MEM_TYPE_32;
> >
> > -		ret = pci_epc_set_bar(ntb->epf->epc,
> > -				      ntb->epf->func_no,
> > -				      ntb->epf->vfunc_no,
> > -				      &ntb->epf->bar[barno]);
> > +		ret = pci_epc_map_inbound(ntb->epf->epc,
> > +					  ntb->epf->func_no,
> > +					  ntb->epf->vfunc_no,
> > +					  &ntb->epf->bar[barno], 0);
> > +		if (ret == -EOPNOTSUPP)
> > +			ret = pci_epc_set_bar(ntb->epf->epc,
> > +					      ntb->epf->func_no,
> > +					      ntb->epf->vfunc_no,
> > +					      &ntb->epf->bar[barno]);
> >  		if (ret) {
> >  			dev_err(dev, "MW set failed\n");
> >  			goto err_set_bar;
> > @@ -1385,17 +1390,23 @@ static int vntb_epf_mw_set_trans(struct ntb_dev *ndev, int pidx, int idx,
> >  	struct epf_ntb *ntb = ntb_ndev(ndev);
> >  	struct pci_epf_bar *epf_bar;
> >  	enum pci_barno barno;
> > +	struct pci_epc *epc;
> >  	int ret;
> >  	struct device *dev;
> >
> > +	epc = ntb->epf->epc;
> >  	dev = &ntb->ntb.dev;
> >  	barno = ntb->epf_ntb_bar[BAR_MW1 + idx];
> > +
> 
> Nit: unnecesary change here

I'll drop it in the next iteration. Thanks,

Koichiro

> 
> Reviewed-by: Frank Li <Frank.Li@nxp.com>
> 
> >  	epf_bar = &ntb->epf->bar[barno];
> >  	epf_bar->phys_addr = addr;
> >  	epf_bar->barno = barno;
> >  	epf_bar->size = size;
> >
> > -	ret = pci_epc_set_bar(ntb->epf->epc, 0, 0, epf_bar);
> > +	ret = pci_epc_map_inbound(epc, 0, 0, epf_bar, 0);
> > +	if (ret == -EOPNOTSUPP)
> > +		ret = pci_epc_set_bar(epc, 0, 0, epf_bar);
> > +
> >  	if (ret) {
> >  		dev_err(dev, "failure set mw trans\n");
> >  		return ret;
> > --
> > 2.48.1
> >

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 10/27] NTB: core: Add .get_pci_epc() to ntb_dev_ops
  2025-12-01 19:39   ` Frank Li
@ 2025-12-02  6:31     ` Koichiro Den
  0 siblings, 0 replies; 97+ messages in thread
From: Koichiro Den @ 2025-12-02  6:31 UTC (permalink / raw)
  To: Frank Li
  Cc: ntb, linux-pci, dmaengine, linux-kernel, mani, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Mon, Dec 01, 2025 at 02:39:25PM -0500, Frank Li wrote:
> On Sun, Nov 30, 2025 at 01:03:48AM +0900, Koichiro Den wrote:
> > Add an optional get_pci_epc() callback to retrieve the underlying
> > pci_epc device associated with the NTB implementation.
> 
> EPC run at EP side, this is running at RC side. Why need get EP side's
> controller pointer?

Thanks for pointing it out, that inclusion was just a mistake,
I'm unsure how it slipped in, I guess I was naively trying to get rid of
'enum pci_barno' by (ab)using pci-epf.h, which is unrelated.
I'll drop the changes on ntb_hw_epf.c.

-Koichiro

> 
> Frank
> >
> > Signed-off-by: Koichiro Den <den@valinux.co.jp>
> > ---
> >  drivers/ntb/hw/epf/ntb_hw_epf.c | 11 +----------
> >  include/linux/ntb.h             | 21 +++++++++++++++++++++
> >  2 files changed, 22 insertions(+), 10 deletions(-)
> >
> > diff --git a/drivers/ntb/hw/epf/ntb_hw_epf.c b/drivers/ntb/hw/epf/ntb_hw_epf.c
> > index a3ec411bfe49..d55ce6b0fad4 100644
> > --- a/drivers/ntb/hw/epf/ntb_hw_epf.c
> > +++ b/drivers/ntb/hw/epf/ntb_hw_epf.c
> > @@ -9,6 +9,7 @@
> >  #include <linux/delay.h>
> >  #include <linux/module.h>
> >  #include <linux/pci.h>
> > +#include <linux/pci-epf.h>
> >  #include <linux/slab.h>
> >  #include <linux/ntb.h>
> >
> > @@ -49,16 +50,6 @@
> >
> >  #define NTB_EPF_COMMAND_TIMEOUT	1000 /* 1 Sec */
> >
> > -enum pci_barno {
> > -	NO_BAR = -1,
> > -	BAR_0,
> > -	BAR_1,
> > -	BAR_2,
> > -	BAR_3,
> > -	BAR_4,
> > -	BAR_5,
> > -};
> > -
> >  enum epf_ntb_bar {
> >  	BAR_CONFIG,
> >  	BAR_PEER_SPAD,
> > diff --git a/include/linux/ntb.h b/include/linux/ntb.h
> > index d7ce5d2e60d0..04dc9a4d6b85 100644
> > --- a/include/linux/ntb.h
> > +++ b/include/linux/ntb.h
> > @@ -64,6 +64,7 @@ struct ntb_client;
> >  struct ntb_dev;
> >  struct ntb_msi;
> >  struct pci_dev;
> > +struct pci_epc;
> >
> >  /**
> >   * enum ntb_topo - NTB connection topology
> > @@ -256,6 +257,7 @@ static inline int ntb_ctx_ops_is_valid(const struct ntb_ctx_ops *ops)
> >   * @msg_clear_mask:	See ntb_msg_clear_mask().
> >   * @msg_read:		See ntb_msg_read().
> >   * @peer_msg_write:	See ntb_peer_msg_write().
> > + * @get_pci_epc:	See ntb_get_pci_epc().
> >   */
> >  struct ntb_dev_ops {
> >  	int (*port_number)(struct ntb_dev *ntb);
> > @@ -331,6 +333,7 @@ struct ntb_dev_ops {
> >  	int (*msg_clear_mask)(struct ntb_dev *ntb, u64 mask_bits);
> >  	u32 (*msg_read)(struct ntb_dev *ntb, int *pidx, int midx);
> >  	int (*peer_msg_write)(struct ntb_dev *ntb, int pidx, int midx, u32 msg);
> > +	struct pci_epc *(*get_pci_epc)(struct ntb_dev *ntb);
> >  };
> >
> >  static inline int ntb_dev_ops_is_valid(const struct ntb_dev_ops *ops)
> > @@ -393,6 +396,9 @@ static inline int ntb_dev_ops_is_valid(const struct ntb_dev_ops *ops)
> >  		/* !ops->msg_clear_mask == !ops->msg_count	&& */
> >  		!ops->msg_read == !ops->msg_count		&&
> >  		!ops->peer_msg_write == !ops->msg_count		&&
> > +
> > +		/* Miscellaneous optional callbacks */
> > +		/* ops->get_pci_epc			&& */
> >  		1;
> >  }
> >
> > @@ -1567,6 +1573,21 @@ static inline int ntb_peer_msg_write(struct ntb_dev *ntb, int pidx, int midx,
> >  	return ntb->ops->peer_msg_write(ntb, pidx, midx, msg);
> >  }
> >
> > +/**
> > + * ntb_get_pci_epc() - get backing PCI endpoint controller if possible.
> > + * @ntb:	NTB device context.
> > + *
> > + * Get the backing PCI endpoint controller representation.
> > + *
> > + * Return: A pointer to the pci_epc instance if available. or %NULL if not.
> > + */
> > +static inline struct pci_epc __maybe_unused *ntb_get_pci_epc(struct ntb_dev *ntb)
> > +{
> > +	if (!ntb->ops->get_pci_epc)
> > +		return NULL;
> > +	return ntb->ops->get_pci_epc(ntb);
> > +}
> > +
> >  /**
> >   * ntb_peer_resource_idx() - get a resource index for a given peer idx
> >   * @ntb:	NTB device context.
> > --
> > 2.48.1
> >

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 10/27] NTB: core: Add .get_pci_epc() to ntb_dev_ops
  2025-12-01 21:08   ` Dave Jiang
@ 2025-12-02  6:32     ` Koichiro Den
  2025-12-02 14:49       ` Dave Jiang
  0 siblings, 1 reply; 97+ messages in thread
From: Koichiro Den @ 2025-12-02  6:32 UTC (permalink / raw)
  To: Dave Jiang
  Cc: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li, mani,
	kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Mon, Dec 01, 2025 at 02:08:14PM -0700, Dave Jiang wrote:
> 
> 
> On 11/29/25 9:03 AM, Koichiro Den wrote:
> > Add an optional get_pci_epc() callback to retrieve the underlying
> > pci_epc device associated with the NTB implementation.
> > 
> > Signed-off-by: Koichiro Den <den@valinux.co.jp>
> > ---
> >  drivers/ntb/hw/epf/ntb_hw_epf.c | 11 +----------
> >  include/linux/ntb.h             | 21 +++++++++++++++++++++
> >  2 files changed, 22 insertions(+), 10 deletions(-)
> > 
> > diff --git a/drivers/ntb/hw/epf/ntb_hw_epf.c b/drivers/ntb/hw/epf/ntb_hw_epf.c
> > index a3ec411bfe49..d55ce6b0fad4 100644
> > --- a/drivers/ntb/hw/epf/ntb_hw_epf.c
> > +++ b/drivers/ntb/hw/epf/ntb_hw_epf.c
> > @@ -9,6 +9,7 @@
> >  #include <linux/delay.h>
> >  #include <linux/module.h>
> >  #include <linux/pci.h>
> > +#include <linux/pci-epf.h>
> >  #include <linux/slab.h>
> >  #include <linux/ntb.h>
> >  
> > @@ -49,16 +50,6 @@
> >  
> >  #define NTB_EPF_COMMAND_TIMEOUT	1000 /* 1 Sec */
> >  
> > -enum pci_barno {
> > -	NO_BAR = -1,
> > -	BAR_0,
> > -	BAR_1,
> > -	BAR_2,
> > -	BAR_3,
> > -	BAR_4,
> > -	BAR_5,
> > -};
> > -
> >  enum epf_ntb_bar {
> >  	BAR_CONFIG,
> >  	BAR_PEER_SPAD,
> > diff --git a/include/linux/ntb.h b/include/linux/ntb.h
> > index d7ce5d2e60d0..04dc9a4d6b85 100644
> > --- a/include/linux/ntb.h
> > +++ b/include/linux/ntb.h
> > @@ -64,6 +64,7 @@ struct ntb_client;
> >  struct ntb_dev;
> >  struct ntb_msi;
> >  struct pci_dev;
> > +struct pci_epc;
> >  
> >  /**
> >   * enum ntb_topo - NTB connection topology
> > @@ -256,6 +257,7 @@ static inline int ntb_ctx_ops_is_valid(const struct ntb_ctx_ops *ops)
> >   * @msg_clear_mask:	See ntb_msg_clear_mask().
> >   * @msg_read:		See ntb_msg_read().
> >   * @peer_msg_write:	See ntb_peer_msg_write().
> > + * @get_pci_epc:	See ntb_get_pci_epc().
> >   */
> >  struct ntb_dev_ops {
> >  	int (*port_number)(struct ntb_dev *ntb);
> > @@ -331,6 +333,7 @@ struct ntb_dev_ops {
> >  	int (*msg_clear_mask)(struct ntb_dev *ntb, u64 mask_bits);
> >  	u32 (*msg_read)(struct ntb_dev *ntb, int *pidx, int midx);
> >  	int (*peer_msg_write)(struct ntb_dev *ntb, int pidx, int midx, u32 msg);
> > +	struct pci_epc *(*get_pci_epc)(struct ntb_dev *ntb);
> 
> Seems very specific call to this particular hardware instead of something generic for the NTB dev ops. Maybe it should be something like get_private_data() or something like that?

Thank you for the suggestion.

I also felt that it's too specific, but I couldn't come up with a clean
generic interface at the time, so I left it in this form.

.get_private_data() might indeed be better. In the callback doc comment we
could describe it as "may be used to obtain a backing PCI controller
pointer"?

-Koichiro

> 
> DJ
> 
> 
> >  };
> >  
> >  static inline int ntb_dev_ops_is_valid(const struct ntb_dev_ops *ops)
> > @@ -393,6 +396,9 @@ static inline int ntb_dev_ops_is_valid(const struct ntb_dev_ops *ops)
> >  		/* !ops->msg_clear_mask == !ops->msg_count	&& */
> >  		!ops->msg_read == !ops->msg_count		&&
> >  		!ops->peer_msg_write == !ops->msg_count		&&
> > +
> > +		/* Miscellaneous optional callbacks */
> > +		/* ops->get_pci_epc			&& */
> >  		1;
> >  }
> >  
> > @@ -1567,6 +1573,21 @@ static inline int ntb_peer_msg_write(struct ntb_dev *ntb, int pidx, int midx,
> >  	return ntb->ops->peer_msg_write(ntb, pidx, midx, msg);
> >  }
> >  
> > +/**
> > + * ntb_get_pci_epc() - get backing PCI endpoint controller if possible.
> > + * @ntb:	NTB device context.
> > + *
> > + * Get the backing PCI endpoint controller representation.
> > + *
> > + * Return: A pointer to the pci_epc instance if available. or %NULL if not.
> > + */
> > +static inline struct pci_epc __maybe_unused *ntb_get_pci_epc(struct ntb_dev *ntb)
> > +{
> > +	if (!ntb->ops->get_pci_epc)
> > +		return NULL;
> > +	return ntb->ops->get_pci_epc(ntb);
> > +}
> > +
> >  /**
> >   * ntb_peer_resource_idx() - get a resource index for a given peer idx
> >   * @ntb:	NTB device context.
> 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping
  2025-11-29 16:03 ` [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping Koichiro Den
  2025-12-01 20:41   ` Frank Li
@ 2025-12-02  6:32   ` Niklas Cassel
  2025-12-03  8:30     ` Koichiro Den
  2025-12-08  7:57   ` Niklas Cassel
  2025-12-12  3:38   ` Manivannan Sadhasivam
  3 siblings, 1 reply; 97+ messages in thread
From: Niklas Cassel @ 2025-12-02  6:32 UTC (permalink / raw)
  To: Koichiro Den
  Cc: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li, mani,
	kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang,
	allenbh, Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer,
	logang, jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer,
	arnd, pstanner, elfring, Damien Le Moal

On Sun, Nov 30, 2025 at 01:03:57AM +0900, Koichiro Den wrote:
> dw_pcie_ep_raise_msi_irq() currently programs an outbound iATU window
> for the MSI target address on every interrupt and tears it down again
> via dw_pcie_ep_unmap_addr().
> 
> On systems that heavily use the AXI bridge interface (for example when
> the integrated eDMA engine is active), this means the outbound iATU
> registers are updated while traffic is in flight. The DesignWare
> endpoint spec warns that updating iATU registers in this situation is
> not supported, and the behavior is undefined.

Please reference a specific section in the EP databook, and the specific
EP databook version that you are using.

This patch appears to address quite a serious issue, so I think that you
should submit it as a standalone patch, and not as part of a series.

(Especially not as part of an RFC which can take quite long before it is
even submitted as a normal (non-RFC) series.)


Kind regards,
Niklas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 12/27] damengine: dw-edma: Fix MSI data values for multi-vector IMWr interrupts
  2025-12-01 19:46   ` Frank Li
@ 2025-12-02  6:32     ` Koichiro Den
  2025-12-18  6:52       ` Koichiro Den
  0 siblings, 1 reply; 97+ messages in thread
From: Koichiro Den @ 2025-12-02  6:32 UTC (permalink / raw)
  To: Frank Li
  Cc: ntb, linux-pci, dmaengine, linux-kernel, mani, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Mon, Dec 01, 2025 at 02:46:43PM -0500, Frank Li wrote:
> On Sun, Nov 30, 2025 at 01:03:50AM +0900, Koichiro Den wrote:
> > When multiple MSI vectors are allocated for the DesignWare eDMA, the
> > driver currently records the same MSI message for all IRQs by calling
> > get_cached_msi_msg() per vector. For multi-vector MSI (as opposed to
> > MSI-X), the cached message corresponds to vector 0 and msg.data is
> > supposed to be adjusted by the IRQ index.
> >
> > As a result, all eDMA interrupts share the same MSI data value and the
> > interrupt controller cannot distinguish between them.
> >
> > Introduce dw_edma_compose_msi() to construct the correct MSI message for
> > each vector. For MSI-X nothing changes. For multi-vector MSI, derive the
> > base IRQ with msi_get_virq(dev, 0) and OR in the per-vector offset into
> > msg.data before storing it in dw->irq[i].msi.
> >
> > This makes each IMWr MSI vector use a unique MSI data value.
> >
> > Signed-off-by: Koichiro Den <den@valinux.co.jp>
> > ---
> >  drivers/dma/dw-edma/dw-edma-core.c | 28 ++++++++++++++++++++++++----
> >  1 file changed, 24 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/dma/dw-edma/dw-edma-core.c b/drivers/dma/dw-edma/dw-edma-core.c
> > index 8e5f7defa6b6..3542177a4a8e 100644
> > --- a/drivers/dma/dw-edma/dw-edma-core.c
> > +++ b/drivers/dma/dw-edma/dw-edma-core.c
> > @@ -839,6 +839,28 @@ static inline void dw_edma_add_irq_mask(u32 *mask, u32 alloc, u16 cnt)
> >  		(*mask)++;
> >  }
> >
> > +static void dw_edma_compose_msi(struct device *dev, int irq, struct msi_msg *out)
> > +{
> > +	struct msi_desc *desc = irq_get_msi_desc(irq);
> > +	struct msi_msg msg;
> > +	unsigned int base;
> > +
> > +	if (!desc)
> > +		return;
> > +
> > +	get_cached_msi_msg(irq, &msg);
> > +	if (!desc->pci.msi_attrib.is_msix) {
> > +		/*
> > +		 * For multi-vector MSI, the cached message corresponds to
> > +		 * vector 0. Adjust msg.data by the IRQ index so that each
> > +		 * vector gets a unique MSI data value for IMWr Data Register.
> > +		 */
> > +		base = msi_get_virq(dev, 0);
> > +		msg.data |= (irq - base);
> 
> why "|=", not "=" here?

"=" is better and safe here. Thanks for pointing it out, I'll fix it.

Koichiro

> 
> Frank
> 
> > +	}
> > +	*out = msg;
> > +}
> > +
> >  static int dw_edma_irq_request(struct dw_edma *dw,
> >  			       u32 *wr_alloc, u32 *rd_alloc)
> >  {
> > @@ -869,8 +891,7 @@ static int dw_edma_irq_request(struct dw_edma *dw,
> >  			return err;
> >  		}
> >
> > -		if (irq_get_msi_desc(irq))
> > -			get_cached_msi_msg(irq, &dw->irq[0].msi);
> > +		dw_edma_compose_msi(dev, irq, &dw->irq[0].msi);
> >
> >  		dw->nr_irqs = 1;
> >  	} else {
> > @@ -896,8 +917,7 @@ static int dw_edma_irq_request(struct dw_edma *dw,
> >  			if (err)
> >  				goto err_irq_free;
> >
> > -			if (irq_get_msi_desc(irq))
> > -				get_cached_msi_msg(irq, &dw->irq[i].msi);
> > +			dw_edma_compose_msi(dev, irq, &dw->irq[i].msi);
> >  		}
> >
> >  		dw->nr_irqs = i;
> > --
> > 2.48.1
> >

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 13/27] NTB: ntb_transport: Use seq_file for QP stats debugfs
  2025-12-01 19:50   ` Frank Li
@ 2025-12-02  6:33     ` Koichiro Den
  0 siblings, 0 replies; 97+ messages in thread
From: Koichiro Den @ 2025-12-02  6:33 UTC (permalink / raw)
  To: Frank Li
  Cc: ntb, linux-pci, dmaengine, linux-kernel, mani, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Mon, Dec 01, 2025 at 02:50:19PM -0500, Frank Li wrote:
> On Sun, Nov 30, 2025 at 01:03:51AM +0900, Koichiro Den wrote:
> > The ./qp*/stats debugfs file for each NTB transport QP is currently
> > implemented with a hand-crafted kmalloc() buffer and a series of
> > scnprintf() calls. This is a pre-seq_file style pattern and makes future
> > extensions easy to truncate.
> >
> > Convert the stats file to use the seq_file helpers via
> > DEFINE_SHOW_ATTRIBUTE(), which simplifies the code and lets the seq_file
> > core handle buffering and partial reads.
> >
> > While touching this area, fix a bug in the per-QP debugfs directory
> > naming: the buffer used for "qp%d" was only 4 bytes, which truncates
> > names like "qp10" to "qp1" and causes multiple queues to share the same
> > directory. Enlarge the buffer and use sizeof() to avoid truncation.
> >
> > Signed-off-by: Koichiro Den <den@valinux.co.jp>
> > ---
> 
> This one is indenpented with other,  you may post seperately. So get merge
> this simple change quick.

Ok I'll do so, thanks.
(it was a semi-preparatory commit for [RFC PATCH v2 20/27], to avoid having
different styles from co-existing.)

Koichiro

> 
> Reviewed-by: Frank Li <Frank.Li@nxp.com>
> >  drivers/ntb/ntb_transport.c | 136 +++++++++++-------------------------
> >  1 file changed, 41 insertions(+), 95 deletions(-)
> >
> > diff --git a/drivers/ntb/ntb_transport.c b/drivers/ntb/ntb_transport.c
> > index 3f3bc991e667..57b4c0511927 100644
> > --- a/drivers/ntb/ntb_transport.c
> > +++ b/drivers/ntb/ntb_transport.c
> > @@ -57,6 +57,7 @@
> >  #include <linux/module.h>
> >  #include <linux/pci.h>
> >  #include <linux/slab.h>
> > +#include <linux/seq_file.h>
> >  #include <linux/types.h>
> >  #include <linux/uaccess.h>
> >  #include <linux/mutex.h>
> > @@ -466,104 +467,49 @@ void ntb_transport_unregister_client(struct ntb_transport_client *drv)
> >  }
> >  EXPORT_SYMBOL_GPL(ntb_transport_unregister_client);
> >
> > -static ssize_t debugfs_read(struct file *filp, char __user *ubuf, size_t count,
> > -			    loff_t *offp)
> > +static int ntb_qp_debugfs_stats_show(struct seq_file *s, void *v)
> >  {
> > -	struct ntb_transport_qp *qp;
> > -	char *buf;
> > -	ssize_t ret, out_offset, out_count;
> > -
> > -	qp = filp->private_data;
> > +	struct ntb_transport_qp *qp = s->private;
> >
> >  	if (!qp || !qp->link_is_up)
> >  		return 0;
> >
> > -	out_count = 1000;
> > -
> > -	buf = kmalloc(out_count, GFP_KERNEL);
> > -	if (!buf)
> > -		return -ENOMEM;
> > +	seq_puts(s, "\nNTB QP stats:\n\n");
> > +
> > +	seq_printf(s, "rx_bytes - \t%llu\n", qp->rx_bytes);
> > +	seq_printf(s, "rx_pkts - \t%llu\n", qp->rx_pkts);
> > +	seq_printf(s, "rx_memcpy - \t%llu\n", qp->rx_memcpy);
> > +	seq_printf(s, "rx_async - \t%llu\n", qp->rx_async);
> > +	seq_printf(s, "rx_ring_empty - %llu\n", qp->rx_ring_empty);
> > +	seq_printf(s, "rx_err_no_buf - %llu\n", qp->rx_err_no_buf);
> > +	seq_printf(s, "rx_err_oflow - \t%llu\n", qp->rx_err_oflow);
> > +	seq_printf(s, "rx_err_ver - \t%llu\n", qp->rx_err_ver);
> > +	seq_printf(s, "rx_buff - \t0x%p\n", qp->rx_buff);
> > +	seq_printf(s, "rx_index - \t%u\n", qp->rx_index);
> > +	seq_printf(s, "rx_max_entry - \t%u\n", qp->rx_max_entry);
> > +	seq_printf(s, "rx_alloc_entry - \t%u\n\n", qp->rx_alloc_entry);
> > +
> > +	seq_printf(s, "tx_bytes - \t%llu\n", qp->tx_bytes);
> > +	seq_printf(s, "tx_pkts - \t%llu\n", qp->tx_pkts);
> > +	seq_printf(s, "tx_memcpy - \t%llu\n", qp->tx_memcpy);
> > +	seq_printf(s, "tx_async - \t%llu\n", qp->tx_async);
> > +	seq_printf(s, "tx_ring_full - \t%llu\n", qp->tx_ring_full);
> > +	seq_printf(s, "tx_err_no_buf - %llu\n", qp->tx_err_no_buf);
> > +	seq_printf(s, "tx_mw - \t0x%p\n", qp->tx_mw);
> > +	seq_printf(s, "tx_index (H) - \t%u\n", qp->tx_index);
> > +	seq_printf(s, "RRI (T) - \t%u\n", qp->remote_rx_info->entry);
> > +	seq_printf(s, "tx_max_entry - \t%u\n", qp->tx_max_entry);
> > +	seq_printf(s, "free tx - \t%u\n", ntb_transport_tx_free_entry(qp));
> > +	seq_putc(s, '\n');
> > +
> > +	seq_printf(s, "Using TX DMA - \t%s\n", qp->tx_dma_chan ? "Yes" : "No");
> > +	seq_printf(s, "Using RX DMA - \t%s\n", qp->rx_dma_chan ? "Yes" : "No");
> > +	seq_printf(s, "QP Link - \t%s\n", qp->link_is_up ? "Up" : "Down");
> > +	seq_putc(s, '\n');
> >
> > -	out_offset = 0;
> > -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> > -			       "\nNTB QP stats:\n\n");
> > -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> > -			       "rx_bytes - \t%llu\n", qp->rx_bytes);
> > -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> > -			       "rx_pkts - \t%llu\n", qp->rx_pkts);
> > -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> > -			       "rx_memcpy - \t%llu\n", qp->rx_memcpy);
> > -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> > -			       "rx_async - \t%llu\n", qp->rx_async);
> > -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> > -			       "rx_ring_empty - %llu\n", qp->rx_ring_empty);
> > -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> > -			       "rx_err_no_buf - %llu\n", qp->rx_err_no_buf);
> > -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> > -			       "rx_err_oflow - \t%llu\n", qp->rx_err_oflow);
> > -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> > -			       "rx_err_ver - \t%llu\n", qp->rx_err_ver);
> > -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> > -			       "rx_buff - \t0x%p\n", qp->rx_buff);
> > -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> > -			       "rx_index - \t%u\n", qp->rx_index);
> > -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> > -			       "rx_max_entry - \t%u\n", qp->rx_max_entry);
> > -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> > -			       "rx_alloc_entry - \t%u\n\n", qp->rx_alloc_entry);
> > -
> > -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> > -			       "tx_bytes - \t%llu\n", qp->tx_bytes);
> > -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> > -			       "tx_pkts - \t%llu\n", qp->tx_pkts);
> > -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> > -			       "tx_memcpy - \t%llu\n", qp->tx_memcpy);
> > -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> > -			       "tx_async - \t%llu\n", qp->tx_async);
> > -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> > -			       "tx_ring_full - \t%llu\n", qp->tx_ring_full);
> > -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> > -			       "tx_err_no_buf - %llu\n", qp->tx_err_no_buf);
> > -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> > -			       "tx_mw - \t0x%p\n", qp->tx_mw);
> > -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> > -			       "tx_index (H) - \t%u\n", qp->tx_index);
> > -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> > -			       "RRI (T) - \t%u\n",
> > -			       qp->remote_rx_info->entry);
> > -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> > -			       "tx_max_entry - \t%u\n", qp->tx_max_entry);
> > -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> > -			       "free tx - \t%u\n",
> > -			       ntb_transport_tx_free_entry(qp));
> > -
> > -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> > -			       "\n");
> > -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> > -			       "Using TX DMA - \t%s\n",
> > -			       qp->tx_dma_chan ? "Yes" : "No");
> > -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> > -			       "Using RX DMA - \t%s\n",
> > -			       qp->rx_dma_chan ? "Yes" : "No");
> > -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> > -			       "QP Link - \t%s\n",
> > -			       qp->link_is_up ? "Up" : "Down");
> > -	out_offset += scnprintf(buf + out_offset, out_count - out_offset,
> > -			       "\n");
> > -
> > -	if (out_offset > out_count)
> > -		out_offset = out_count;
> > -
> > -	ret = simple_read_from_buffer(ubuf, count, offp, buf, out_offset);
> > -	kfree(buf);
> > -	return ret;
> > -}
> > -
> > -static const struct file_operations ntb_qp_debugfs_stats = {
> > -	.owner = THIS_MODULE,
> > -	.open = simple_open,
> > -	.read = debugfs_read,
> > -};
> > +	return 0;
> > +}
> > +DEFINE_SHOW_ATTRIBUTE(ntb_qp_debugfs_stats);
> >
> >  static void ntb_list_add(spinlock_t *lock, struct list_head *entry,
> >  			 struct list_head *list)
> > @@ -1237,15 +1183,15 @@ static int ntb_transport_init_queue(struct ntb_transport_ctx *nt,
> >  	qp->tx_max_entry = tx_size / qp->tx_max_frame;
> >
> >  	if (nt->debugfs_node_dir) {
> > -		char debugfs_name[4];
> > +		char debugfs_name[8];
> >
> > -		snprintf(debugfs_name, 4, "qp%d", qp_num);
> > +		snprintf(debugfs_name, sizeof(debugfs_name), "qp%d", qp_num);
> >  		qp->debugfs_dir = debugfs_create_dir(debugfs_name,
> >  						     nt->debugfs_node_dir);
> >
> >  		qp->debugfs_stats = debugfs_create_file("stats", S_IRUSR,
> >  							qp->debugfs_dir, qp,
> > -							&ntb_qp_debugfs_stats);
> > +							&ntb_qp_debugfs_stats_fops);
> >  	} else {
> >  		qp->debugfs_dir = NULL;
> >  		qp->debugfs_stats = NULL;
> > --
> > 2.48.1
> >

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 14/27] NTB: ntb_transport: Move TX memory window setup into setup_qp_mw()
  2025-12-01 20:02   ` Frank Li
@ 2025-12-02  6:33     ` Koichiro Den
  0 siblings, 0 replies; 97+ messages in thread
From: Koichiro Den @ 2025-12-02  6:33 UTC (permalink / raw)
  To: Frank Li
  Cc: ntb, linux-pci, dmaengine, linux-kernel, mani, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Mon, Dec 01, 2025 at 03:02:40PM -0500, Frank Li wrote:
> On Sun, Nov 30, 2025 at 01:03:52AM +0900, Koichiro Den wrote:
> > Historically both TX and RX have assumed the same per-QP MW slice
> > (tx_max_entry == remote rx_max_entry), while those are calculated
> > separately in different places (pre and post the link-up negotiation
> > point). This has been safe because nt->link_is_up is never set to true
> > unless the pre-determined qp_count are the same among them, and qp_count
> > is typically limited to nt->mw_count, which should be carefully
> > configured by admin.
> >
> > However, setup_qp_mw can actually split mw and handle multi-qps in one
> > MW properly, so qp_count needs not to be limited by nt->mw_count. Once
> > we relaxing the limitation, pre-determined qp_count can differ among
> > host side and endpoint, and link-up negotiation can easily fail.
> >
> > Move the TX MW configuration (per-QP offset and size) into
> > ntb_transport_setup_qp_mw() so that both RX and TX layout decisions are
> > centralized in a single helper. ntb_transport_init_queue() now deals
> > only with per-QP software state, not with MW layout.
> >
> > This keeps the previous behaviour, while preparing for relaxing the
> > qp_count limitation and improving readibility.
> >
> > No functional change is intended.
> >
> > Signed-off-by: Koichiro Den <den@valinux.co.jp>
> > ---
> >  drivers/ntb/ntb_transport.c | 67 ++++++++++++++-----------------------
> >  1 file changed, 26 insertions(+), 41 deletions(-)
> >
> > diff --git a/drivers/ntb/ntb_transport.c b/drivers/ntb/ntb_transport.c
> > index 57b4c0511927..79063e2f911b 100644
> > --- a/drivers/ntb/ntb_transport.c
> > +++ b/drivers/ntb/ntb_transport.c
> > @@ -569,7 +569,8 @@ static int ntb_transport_setup_qp_mw(struct ntb_transport_ctx *nt,
> >  	struct ntb_transport_mw *mw;
> >  	struct ntb_dev *ndev = nt->ndev;
> >  	struct ntb_queue_entry *entry;
> > -	unsigned int rx_size, num_qps_mw;
> > +	unsigned int num_qps_mw;
> > +	unsigned int mw_size, mw_size_per_qp, qp_offset, rx_info_offset;
> >  	unsigned int mw_num, mw_count, qp_count;
> >  	unsigned int i;
> >  	int node;
> > @@ -588,15 +589,33 @@ static int ntb_transport_setup_qp_mw(struct ntb_transport_ctx *nt,
> >  	else
> >  		num_qps_mw = qp_count / mw_count;
> >
> > -	rx_size = (unsigned int)mw->xlat_size / num_qps_mw;
> > -	qp->rx_buff = mw->virt_addr + rx_size * (qp_num / mw_count);
> > -	rx_size -= sizeof(struct ntb_rx_info);
> > +	mw_size = min(nt->mw_vec[mw_num].phys_size, mw->xlat_size);
> > +	if (max_mw_size && mw_size > max_mw_size)
> > +		mw_size = max_mw_size;
> >
> > -	qp->remote_rx_info = qp->rx_buff + rx_size;
> > +	/* Split this MW evenly among the queue pairs mapped to it. */
> > +	mw_size_per_qp = (unsigned int)mw_size / num_qps_mw;
> 
> Can you use the same variable firstly to make review easily?
> 
> tx_size = (unsigned int)mw_size / num_qps_mw;
> 
> It is hard to make sure code logic is the same as old one.

I'll do so. Thank you!

Koichiro

> 
> Frank
> 
> > +	qp_offset = mw_size_per_qp * (qp_num / mw_count);
> > +
> > +	/* Place remote_rx_info at the end of the per-QP region. */
> > +	rx_info_offset = mw_size_per_qp - sizeof(struct ntb_rx_info);
> > +
> > +	qp->tx_mw_size = mw_size_per_qp;
> > +	qp->tx_mw = nt->mw_vec[mw_num].vbase + qp_offset;
> > +	if (!qp->tx_mw)
> > +		return -EINVAL;
> > +	qp->tx_mw_phys = nt->mw_vec[mw_num].phys_addr + qp_offset;
> > +	if (!qp->tx_mw_phys)
> > +		return -EINVAL;
> > +	qp->rx_info = qp->tx_mw + rx_info_offset;
> > +	qp->rx_buff = mw->virt_addr + qp_offset;
> > +	qp->remote_rx_info = qp->rx_buff + rx_info_offset;
> >
> >  	/* Due to housekeeping, there must be atleast 2 buffs */
> > -	qp->rx_max_frame = min(transport_mtu, rx_size / 2);
> > -	qp->rx_max_entry = rx_size / qp->rx_max_frame;
> > +	qp->tx_max_frame = min(transport_mtu, mw_size_per_qp / 2);
> > +	qp->tx_max_entry = mw_size_per_qp / qp->tx_max_frame;
> > +	qp->rx_max_frame = min(transport_mtu, mw_size_per_qp / 2);
> > +	qp->rx_max_entry = mw_size_per_qp / qp->rx_max_frame;
> >  	qp->rx_index = 0;
> >
> >  	/*
> > @@ -1133,11 +1152,7 @@ static int ntb_transport_init_queue(struct ntb_transport_ctx *nt,
> >  				    unsigned int qp_num)
> >  {
> >  	struct ntb_transport_qp *qp;
> > -	phys_addr_t mw_base;
> > -	resource_size_t mw_size;
> > -	unsigned int num_qps_mw, tx_size;
> >  	unsigned int mw_num, mw_count, qp_count;
> > -	u64 qp_offset;
> >
> >  	mw_count = nt->mw_count;
> >  	qp_count = nt->qp_count;
> > @@ -1152,36 +1167,6 @@ static int ntb_transport_init_queue(struct ntb_transport_ctx *nt,
> >  	qp->event_handler = NULL;
> >  	ntb_qp_link_context_reset(qp);
> >
> > -	if (mw_num < qp_count % mw_count)
> > -		num_qps_mw = qp_count / mw_count + 1;
> > -	else
> > -		num_qps_mw = qp_count / mw_count;
> > -
> > -	mw_base = nt->mw_vec[mw_num].phys_addr;
> > -	mw_size = nt->mw_vec[mw_num].phys_size;
> > -
> > -	if (max_mw_size && mw_size > max_mw_size)
> > -		mw_size = max_mw_size;
> > -
> > -	tx_size = (unsigned int)mw_size / num_qps_mw;
> > -	qp_offset = tx_size * (qp_num / mw_count);
> > -
> > -	qp->tx_mw_size = tx_size;
> > -	qp->tx_mw = nt->mw_vec[mw_num].vbase + qp_offset;
> > -	if (!qp->tx_mw)
> > -		return -EINVAL;
> > -
> > -	qp->tx_mw_phys = mw_base + qp_offset;
> > -	if (!qp->tx_mw_phys)
> > -		return -EINVAL;
> > -
> > -	tx_size -= sizeof(struct ntb_rx_info);
> > -	qp->rx_info = qp->tx_mw + tx_size;
> > -
> > -	/* Due to housekeeping, there must be atleast 2 buffs */
> > -	qp->tx_max_frame = min(transport_mtu, tx_size / 2);
> > -	qp->tx_max_entry = tx_size / qp->tx_max_frame;
> > -
> >  	if (nt->debugfs_node_dir) {
> >  		char debugfs_name[8];
> >
> > --
> > 2.48.1
> >

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping
  2025-12-01 20:41   ` Frank Li
@ 2025-12-02  6:35     ` Koichiro Den
  2025-12-02  9:32       ` Niklas Cassel
  0 siblings, 1 reply; 97+ messages in thread
From: Koichiro Den @ 2025-12-02  6:35 UTC (permalink / raw)
  To: Frank Li
  Cc: ntb, linux-pci, dmaengine, linux-kernel, mani, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Mon, Dec 01, 2025 at 03:41:38PM -0500, Frank Li wrote:
> On Sun, Nov 30, 2025 at 01:03:57AM +0900, Koichiro Den wrote:
> > dw_pcie_ep_raise_msi_irq() currently programs an outbound iATU window
> > for the MSI target address on every interrupt and tears it down again
> > via dw_pcie_ep_unmap_addr().
> >
> > On systems that heavily use the AXI bridge interface (for example when
> > the integrated eDMA engine is active), this means the outbound iATU
> > registers are updated while traffic is in flight. The DesignWare
> > endpoint spec warns that updating iATU registers in this situation is
> > not supported, and the behavior is undefined.
> >
> > Under high MSI and eDMA load this pattern results in occasional bogus
> > outbound transactions and IOMMU faults such as:
> >
> >   ipmmu-vmsa eed40000.iommu: Unhandled fault: status 0x00001502 iova 0xfe000000
> >
> 
> I agree needn't map/unmap MSI every time. But I think there should be
> logic problem behind this. IOMMU report error means page table already
> removed, but you still try to access it after that. You'd better find where
> access MSI memory after dw_pcie_ep_unmap_addr().

I don't see any other callers that access the MSI region after
dw_pcie_ep_unmap_addr(), but I might be missing something. Also, even if I
serialize dw_pcie_ep_raise_msi_irq() invocations, the problem still
appears.

A couple of details I forgot to describe in the commit message:
(1). The IOMMU error is only reported on the RC side.
(2). Sometimes there is no IOMMU error printed and the board just freezes (becomes unresponsive).

The faulting iova is 0xfe000000. The iova 0xfe000000 is the base of
"addr_space" for R-Car S4 in EP mode:
https://github.com/jonmason/ntb/blob/68113d260674/arch/arm64/boot/dts/renesas/r8a779f0.dtsi#L847

So it looks like the EP sometimes issue MWr at "addr_space" base (offset 0),
the RC forwards it to its IOMMU (IPMMUHC) and that faults. My working theory
is that when the iATU registers are updated under heavy DMA load, the DAR of
some in-flight transfer can get corrupted to 0xfe000000. That would match one
possible symptom of the undefined behaviour that the DW EPC spec warns about
when changing iATU configuration under load.

-Koichiro

> 
> dw_pcie_ep_unmap_addr() use writel(), which use dma_dmb() before change
> register, previous write should be completed before write ATU register.
> 
> Frank
> 
> > followed by the system becoming unresponsive. This is the actual output
> > observed on Renesas R-Car S4, with its ipmmu_hc used with PCIe ch0.
> >
> > There is no need to reprogram the iATU region used for MSI on every
> > interrupt. The host-provided MSI address is stable while MSI is enabled,
> > and the endpoint driver already dedicates a scratch buffer for MSI
> > generation.
> >
> > Cache the aligned MSI address and map size, program the outbound iATU
> > once, and keep the window enabled. Subsequent interrupts only perform a
> > write to the MSI scratch buffer, avoiding dynamic iATU reprogramming in
> > the hot path and fixing the lockups seen under load.
> >
> > Signed-off-by: Koichiro Den <den@valinux.co.jp>
> > ---
> >  .../pci/controller/dwc/pcie-designware-ep.c   | 48 ++++++++++++++++---
> >  drivers/pci/controller/dwc/pcie-designware.h  |  5 ++
> >  2 files changed, 47 insertions(+), 6 deletions(-)
> >
> > diff --git a/drivers/pci/controller/dwc/pcie-designware-ep.c b/drivers/pci/controller/dwc/pcie-designware-ep.c
> > index 3780a9bd6f79..ef8ded34d9ab 100644
> > --- a/drivers/pci/controller/dwc/pcie-designware-ep.c
> > +++ b/drivers/pci/controller/dwc/pcie-designware-ep.c
> > @@ -778,6 +778,16 @@ static void dw_pcie_ep_stop(struct pci_epc *epc)
> >  	struct dw_pcie_ep *ep = epc_get_drvdata(epc);
> >  	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> >
> > +	/*
> > +	 * Tear down the dedicated outbound window used for MSI
> > +	 * generation. This avoids leaking an iATU window across
> > +	 * endpoint stop/start cycles.
> > +	 */
> > +	if (ep->msi_iatu_mapped) {
> > +		dw_pcie_ep_unmap_addr(epc, 0, 0, ep->msi_mem_phys);
> > +		ep->msi_iatu_mapped = false;
> > +	}
> > +
> >  	dw_pcie_stop_link(pci);
> >  }
> >
> > @@ -881,14 +891,37 @@ int dw_pcie_ep_raise_msi_irq(struct dw_pcie_ep *ep, u8 func_no,
> >  	msg_addr = ((u64)msg_addr_upper) << 32 | msg_addr_lower;
> >
> >  	msg_addr = dw_pcie_ep_align_addr(epc, msg_addr, &map_size, &offset);
> > -	ret = dw_pcie_ep_map_addr(epc, func_no, 0, ep->msi_mem_phys, msg_addr,
> > -				  map_size);
> > -	if (ret)
> > -		return ret;
> >
> > -	writel(msg_data | (interrupt_num - 1), ep->msi_mem + offset);
> > +	/*
> > +	 * Program the outbound iATU once and keep it enabled.
> > +	 *
> > +	 * The spec warns that updating iATU registers while there are
> > +	 * operations in flight on the AXI bridge interface is not
> > +	 * supported, so we avoid reprogramming the region on every MSI,
> > +	 * specifically unmapping immediately after writel().
> > +	 */
> > +	if (!ep->msi_iatu_mapped) {
> > +		ret = dw_pcie_ep_map_addr(epc, func_no, 0,
> > +					  ep->msi_mem_phys, msg_addr,
> > +					  map_size);
> > +		if (ret)
> > +			return ret;
> >
> > -	dw_pcie_ep_unmap_addr(epc, func_no, 0, ep->msi_mem_phys);
> > +		ep->msi_iatu_mapped = true;
> > +		ep->msi_msg_addr = msg_addr;
> > +		ep->msi_map_size = map_size;
> > +	} else if (WARN_ON_ONCE(ep->msi_msg_addr != msg_addr ||
> > +				ep->msi_map_size != map_size)) {
> > +		/*
> > +		 * The host changed the MSI target address or the required
> > +		 * mapping size. Reprogramming the iATU at runtime is unsafe
> > +		 * on this controller, so bail out instead of trying to update
> > +		 * the existing region.
> > +		 */
> > +		return -EINVAL;
> > +	}
> > +
> > +	writel(msg_data | (interrupt_num - 1), ep->msi_mem + offset);
> >
> >  	return 0;
> >  }
> > @@ -1268,6 +1301,9 @@ int dw_pcie_ep_init(struct dw_pcie_ep *ep)
> >  	INIT_LIST_HEAD(&ep->func_list);
> >  	INIT_LIST_HEAD(&ep->ib_map_list);
> >  	spin_lock_init(&ep->ib_map_lock);
> > +	ep->msi_iatu_mapped = false;
> > +	ep->msi_msg_addr = 0;
> > +	ep->msi_map_size = 0;
> >
> >  	epc = devm_pci_epc_create(dev, &epc_ops);
> >  	if (IS_ERR(epc)) {
> > diff --git a/drivers/pci/controller/dwc/pcie-designware.h b/drivers/pci/controller/dwc/pcie-designware.h
> > index 269a9fe0501f..1770a2318557 100644
> > --- a/drivers/pci/controller/dwc/pcie-designware.h
> > +++ b/drivers/pci/controller/dwc/pcie-designware.h
> > @@ -481,6 +481,11 @@ struct dw_pcie_ep {
> >  	void __iomem		*msi_mem;
> >  	phys_addr_t		msi_mem_phys;
> >  	struct pci_epf_bar	*epf_bar[PCI_STD_NUM_BARS];
> > +
> > +	/* MSI outbound iATU state */
> > +	bool			msi_iatu_mapped;
> > +	u64			msi_msg_addr;
> > +	size_t			msi_map_size;
> >  };
> >
> >  struct dw_pcie_ops {
> > --
> > 2.48.1
> >

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 20/27] NTB: ntb_transport: Introduce remote eDMA backed transport mode
  2025-12-01 21:41   ` Frank Li
@ 2025-12-02  6:43     ` Koichiro Den
  2025-12-02 15:42       ` Frank Li
  0 siblings, 1 reply; 97+ messages in thread
From: Koichiro Den @ 2025-12-02  6:43 UTC (permalink / raw)
  To: Frank Li
  Cc: ntb, linux-pci, dmaengine, linux-kernel, mani, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Mon, Dec 01, 2025 at 04:41:05PM -0500, Frank Li wrote:
> On Sun, Nov 30, 2025 at 01:03:58AM +0900, Koichiro Den wrote:
> > Add a new transport backend that uses a remote DesignWare eDMA engine
> > located on the NTB endpoint to move data between host and endpoint.
> >
> > In this mode:
> >
> >   - The endpoint exposes a dedicated memory window that contains the
> >     eDMA register block followed by a small control structure (struct
> >     ntb_edma_info) and per-channel linked-list (LL) rings.
> >
> >   - On the endpoint side, ntb_edma_setup_mws() allocates the control
> >     structure and LL rings in endpoint memory, then programs an inbound
> >     iATU region so that the host can access them via a peer MW.
> >
> >   - On the host side, ntb_edma_setup_peer() ioremaps the peer MW, reads
> >     ntb_edma_info and configures a dw-edma DMA device to use the LL
> >     rings provided by the endpoint.
> >
> >   - ntb_transport is extended with a new backend_ops implementation that
> >     routes TX and RX enqueue/poll operations through the remote eDMA
> >     rings while keeping the existing shared-memory backend intact.
> >
> >   - The host signals the endpoint via a dedicated DMA read channel.
> >     'use_msi' module option is ignored when 'use_remote_edma=1'.
> >
> > The new mode is guarded by a Kconfig option (NTB_TRANSPORT_EDMA) and a
> > module parameter (use_remote_edma). When disabled, the existing
> > ntb_transport behaviour is unchanged.
> >
> > Signed-off-by: Koichiro Den <den@valinux.co.jp>
> > ---
> >  drivers/ntb/Kconfig                           |   11 +
> >  drivers/ntb/Makefile                          |    3 +
> >  drivers/ntb/ntb_edma.c                        |  628 ++++++++
> >  drivers/ntb/ntb_edma.h                        |  128 ++
> >  .../{ntb_transport.c => ntb_transport_core.c} | 1281 ++++++++++++++++-
> >  5 files changed, 2048 insertions(+), 3 deletions(-)
> >  create mode 100644 drivers/ntb/ntb_edma.c
> >  create mode 100644 drivers/ntb/ntb_edma.h
> >  rename drivers/ntb/{ntb_transport.c => ntb_transport_core.c} (65%)
> >
> > diff --git a/drivers/ntb/Kconfig b/drivers/ntb/Kconfig
> > index df16c755b4da..db63f02bb116 100644
> > --- a/drivers/ntb/Kconfig
> > +++ b/drivers/ntb/Kconfig
> > @@ -37,4 +37,15 @@ config NTB_TRANSPORT
> >
> >  	 If unsure, say N.
> >
> > +config NTB_TRANSPORT_EDMA
> > +	bool "NTB Transport backed by remote eDMA"
> > +	depends on NTB_TRANSPORT
> > +	depends on PCI
> > +	select DMA_ENGINE
> > +	help
> > +	  Enable a transport backend that uses a remote DesignWare eDMA engine
> > +	  exposed through a dedicated NTB memory window. The host uses the
> > +	  endpoint's eDMA engine to move data in both directions.
> > +	  Say Y here if you intend to use the 'use_remote_edma' module parameter.
> > +
> >  endif # NTB
> > diff --git a/drivers/ntb/Makefile b/drivers/ntb/Makefile
> > index 3a6fa181ff99..51f0e1e3aec7 100644
> > --- a/drivers/ntb/Makefile
> > +++ b/drivers/ntb/Makefile
> > @@ -4,3 +4,6 @@ obj-$(CONFIG_NTB_TRANSPORT) += ntb_transport.o
> >
> >  ntb-y			:= core.o
> >  ntb-$(CONFIG_NTB_MSI)	+= msi.o
> > +
> > +ntb_transport-y					:= ntb_transport_core.o
> > +ntb_transport-$(CONFIG_NTB_TRANSPORT_EDMA)	+= ntb_edma.o
> > diff --git a/drivers/ntb/ntb_edma.c b/drivers/ntb/ntb_edma.c
> > new file mode 100644
> > index 000000000000..cb35e0d56aa8
> > --- /dev/null
> > +++ b/drivers/ntb/ntb_edma.c
> > @@ -0,0 +1,628 @@
> > +// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
> > +
> > +#include <linux/module.h>
> > +#include <linux/device.h>
> > +#include <linux/pci.h>
> > +#include <linux/ntb.h>
> > +#include <linux/io.h>
> > +#include <linux/iommu.h>
> > +#include <linux/dmaengine.h>
> > +#include <linux/pci-epc.h>
> > +#include <linux/dma/edma.h>
> > +#include <linux/irq.h>
> > +#include <linux/irqdomain.h>
> > +#include <linux/of.h>
> > +#include <linux/of_irq.h>
> > +#include <dt-bindings/interrupt-controller/arm-gic.h>
> > +
> > +#include "ntb_edma.h"
> > +
> > +/*
> > + * The interrupt register offsets below are taken from the DesignWare
> > + * eDMA "unrolled" register map (EDMA_MF_EDMA_UNROLL). The remote eDMA
> > + * backend currently only supports this layout.
> > + */
> > +#define DMA_WRITE_INT_STATUS_OFF   0x4c
> > +#define DMA_WRITE_INT_MASK_OFF     0x54
> > +#define DMA_WRITE_INT_CLEAR_OFF    0x58
> > +#define DMA_READ_INT_STATUS_OFF    0xa0
> > +#define DMA_READ_INT_MASK_OFF      0xa8
> > +#define DMA_READ_INT_CLEAR_OFF     0xac
> 
> Not sure why need access EDMA register because EMDA driver already export
> as dmaengine driver.

These are intended for EP use. In my current design I intentionally don't
use the standard dw-edma dmaengine driver on the EP side.

> 
> > +
> > +#define NTB_EDMA_NOTIFY_MAX_QP		64
> > +
> > +static unsigned int edma_spi = 417; /* 0x1a1 */
> > +module_param(edma_spi, uint, 0644);
> > +MODULE_PARM_DESC(edma_spi, "SPI number used by remote eDMA interrupt (EP local)");
> > +
> > +static u64 edma_regs_phys = 0xe65d5000;
> > +module_param(edma_regs_phys, ullong, 0644);
> > +MODULE_PARM_DESC(edma_regs_phys, "Physical base address of local eDMA registers (EP)");
> > +
> > +static unsigned long edma_regs_size = 0x1200;
> > +module_param(edma_regs_size, ulong, 0644);
> > +MODULE_PARM_DESC(edma_regs_size, "Size of the local eDMA register space (EP)");
> > +
> > +struct ntb_edma_intr {
> > +	u32 db[NTB_EDMA_NOTIFY_MAX_QP];
> > +};
> > +
> > +struct ntb_edma_ctx {
> > +	void *ll_wr_virt[EDMA_WR_CH_NUM];
> > +	dma_addr_t ll_wr_phys[EDMA_WR_CH_NUM];
> > +	void *ll_rd_virt[EDMA_RD_CH_NUM + 1];
> > +	dma_addr_t ll_rd_phys[EDMA_RD_CH_NUM + 1];
> > +
> > +	struct ntb_edma_intr *intr_ep_virt;
> > +	dma_addr_t intr_ep_phys;
> > +	struct ntb_edma_intr *intr_rc_virt;
> > +	dma_addr_t intr_rc_phys;
> > +	u32 notify_qp_max;
> > +
> > +	bool initialized;
> > +};
> > +
> > +static struct ntb_edma_ctx edma_ctx;
> > +
> > +typedef void (*ntb_edma_interrupt_cb_t)(void *data, int qp_num);
> > +
> > +struct ntb_edma_interrupt {
> > +	int virq;
> > +	void __iomem *base;
> > +	ntb_edma_interrupt_cb_t cb;
> > +	void *data;
> > +};
> > +
> > +static struct ntb_edma_interrupt ntb_edma_intr;
> > +
> > +static int ntb_edma_map_spi_to_virq(struct device *dev, unsigned int spi)
> > +{
> > +	struct device_node *np = dev_of_node(dev);
> > +	struct device_node *parent;
> > +	struct irq_fwspec fwspec = { 0 };
> > +	int virq;
> > +
> > +	parent = of_irq_find_parent(np);
> > +	if (!parent)
> > +		return -ENODEV;
> > +
> > +	fwspec.fwnode      = of_fwnode_handle(parent);
> > +	fwspec.param_count = 3;
> > +	fwspec.param[0]    = GIC_SPI;
> > +	fwspec.param[1]    = spi;
> > +	fwspec.param[2]    = IRQ_TYPE_LEVEL_HIGH;
> > +
> > +	virq = irq_create_fwspec_mapping(&fwspec);
> > +	of_node_put(parent);
> > +	return (virq > 0) ? virq : -EINVAL;
> > +}
> > +
> > +static irqreturn_t ntb_edma_isr(int irq, void *data)
> > +{
> 
> Not sue why dw_edma_interrupt_write/read() does work for your case. Suppose
> just register callback for dmeengine.

If we ran dw_edma_probe() on both the EP and RC sides and let the dmaengine
callbacks handle int_status/int_clear, I think we could hit races. One side
might clear a status bit before the other side has a chance to see it and
invoke its callback. Please correct me if I'm missing something here.

To avoid that, in my current implementation, the RC side handles the
status/int_clear registers in the usual way, and the EP side only tries to
suppress needless edma_int as much as possible.

That said, I'm now wondering if it would be better to set LIE=0/RIE=1 for
the DMA transfer channels and LIE=1/RIE=0 for the notification channel.
That would require some changes on dw-edma core.

> 
> > +	struct ntb_edma_interrupt *v = data;
> > +	u32 mask = BIT(EDMA_RD_CH_NUM);
> > +	u32 i, val;
> > +
> > +	/*
> > +	 * We do not ack interrupts here but instead we mask all local interrupt
> > +	 * sources except the read channel used for notification. This reduces
> > +	 * needless ISR invocations.
> > +	 *
> > +	 * In theory we could configure LIE=1/RIE=0 only for the notification
> > +	 * transfer (keeping all other channels at LIE=1/RIE=1), but that would
> > +	 * require intrusive changes to the dw-edma core.
> > +	 *
> > +	 * Note: The host side may have already cleared the read interrupt used
> > +	 * for notification, so reading DMA_READ_INT_CLEAR_OFF is not a reliable
> > +	 * way to detect it. As a result, we cannot reliably tell which specific
> > +	 * channel triggered this interrupt. intr_ep_virt->db[i] teaches us
> > +	 * instead.
> > +	 */
> > +	iowrite32(~0x0, v->base + DMA_WRITE_INT_MASK_OFF);
> > +	iowrite32(~mask, v->base + DMA_READ_INT_MASK_OFF);
> > +
> > +	if (!v->cb || !edma_ctx.intr_ep_virt)
> > +		return IRQ_HANDLED;
> > +
> > +	for (i = 0; i < edma_ctx.notify_qp_max; i++) {
> > +		val = READ_ONCE(edma_ctx.intr_ep_virt->db[i]);
> > +		if (!val)
> > +			continue;
> > +
> > +		WRITE_ONCE(edma_ctx.intr_ep_virt->db[i], 0);
> > +		v->cb(v->data, i);
> > +	}
> > +
> > +	return IRQ_HANDLED;
> > +}
> > +
> ...
> > +
> > +int ntb_edma_setup_peer(struct ntb_dev *ndev)
> > +{
> > +	struct ntb_edma_info *info;
> > +	unsigned int wr_cnt, rd_cnt;
> > +	struct dw_edma_chip *chip;
> > +	void __iomem *edma_virt;
> > +	phys_addr_t edma_phys;
> > +	resource_size_t mw_size;
> > +	u64 off = EDMA_REG_SIZE;
> > +	int peer_mw, mw_index;
> > +	unsigned int i;
> > +	int ret;
> > +
> > +	peer_mw = ntb_peer_mw_count(ndev);
> > +	if (peer_mw <= 0)
> > +		return -ENODEV;
> > +
> > +	mw_index = peer_mw - 1; /* last MW */
> > +
> > +	ret = ntb_peer_mw_get_addr(ndev, mw_index, &edma_phys,
> > +				   &mw_size);
> > +	if (ret)
> > +		return -1;
> > +
> > +	edma_virt = ioremap(edma_phys, mw_size);
> > +
> > +	chip = devm_kzalloc(&ndev->dev, sizeof(*chip), GFP_KERNEL);
> > +	if (!chip) {
> > +		ret = -ENOMEM;
> > +		return ret;
> > +	}
> > +
> > +	chip->dev = &ndev->pdev->dev;
> > +	chip->nr_irqs = 4;
> > +	chip->ops = &ntb_edma_ops;
> > +	chip->flags = 0;
> > +	chip->reg_base = edma_virt;
> > +	chip->mf = EDMA_MF_EDMA_UNROLL;
> > +
> > +	info = edma_virt + off;
> > +	if (info->magic != NTB_EDMA_INFO_MAGIC)
> > +		return -EINVAL;
> > +	wr_cnt = info->wr_cnt;
> > +	rd_cnt = info->rd_cnt;
> > +	chip->ll_wr_cnt = wr_cnt;
> > +	chip->ll_rd_cnt = rd_cnt;
> > +	off += PAGE_SIZE;
> > +
> > +	edma_ctx.notify_qp_max = NTB_EDMA_NOTIFY_MAX_QP;
> > +	edma_ctx.intr_ep_phys = info->intr_dar_base;
> > +	if (edma_ctx.intr_ep_phys) {
> > +		edma_ctx.intr_rc_virt =
> > +			dma_alloc_coherent(&ndev->pdev->dev,
> > +					   sizeof(struct ntb_edma_intr),
> > +					   &edma_ctx.intr_rc_phys,
> > +					   GFP_KERNEL);
> > +		if (!edma_ctx.intr_rc_virt)
> > +			return -ENOMEM;
> > +		memset(edma_ctx.intr_rc_virt, 0,
> > +		       sizeof(struct ntb_edma_intr));
> > +	}
> > +
> > +	for (i = 0; i < wr_cnt; i++) {
> > +		chip->ll_region_wr[i].vaddr.io = edma_virt + off;
> > +		chip->ll_region_wr[i].paddr = info->ll_wr_phys[i];
> > +		chip->ll_region_wr[i].sz = DMA_LLP_MEM_SIZE;
> > +		off += DMA_LLP_MEM_SIZE;
> > +	}
> > +	for (i = 0; i < rd_cnt; i++) {
> > +		chip->ll_region_rd[i].vaddr.io = edma_virt + off;
> > +		chip->ll_region_rd[i].paddr = info->ll_rd_phys[i];
> > +		chip->ll_region_rd[i].sz = DMA_LLP_MEM_SIZE;
> > +		off += DMA_LLP_MEM_SIZE;
> > +	}
> > +
> > +	if (!pci_dev_msi_enabled(ndev->pdev))
> > +		return -ENXIO;
> > +
> > +	ret = dw_edma_probe(chip);
> 
> I think dw_edma_probe() should be in ntb_hw_epf.c, which provide DMA
> dma engine support.
> 
> EP side, suppose default dwc controller driver already setup edma engine,
> so use correct filter function, you should get dma chan.

I intentionally hid edma for EP side in .dts patch in [RFC PATCH v2 26/27]
so that RC side only manages eDMA remotely and avoids the potential race
condition I mentioned above.

Thanks for reviewing,
Koichiro

> 
> Frank
> 
> > +	if (ret) {
> > +		dev_err(&ndev->dev, "dw_edma_probe failed: %d\n", ret);
> > +		return ret;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +struct ntb_edma_filter {
> > +	struct device *dma_dev;
> > +	u32 direction;
> > +};
> > +
> > +static bool ntb_edma_filter_fn(struct dma_chan *chan, void *arg)
> > +{
> > +	struct ntb_edma_filter *filter = arg;
> > +	u32 dir = filter->direction;
> > +	struct dma_slave_caps caps;
> > +	int ret;
> > +
> > +	if (chan->device->dev != filter->dma_dev)
> > +		return false;
> > +
> > +	ret = dma_get_slave_caps(chan, &caps);
> > +	if (ret < 0)
> > +		return false;
> > +
> > +	return !!(caps.directions & dir);
> > +}
> > +
> > +void ntb_edma_teardown_chans(struct ntb_edma_chans *edma)
> > +{
> > +	unsigned int i;
> > +
> > +	for (i = 0; i < edma->num_wr_chan; i++)
> > +		dma_release_channel(edma->wr_chan[i]);
> > +
> > +	for (i = 0; i < edma->num_rd_chan; i++)
> > +		dma_release_channel(edma->rd_chan[i]);
> > +
> > +	if (edma->intr_chan)
> > +		dma_release_channel(edma->intr_chan);
> > +}
> > +
> > +int ntb_edma_setup_chans(struct device *dma_dev, struct ntb_edma_chans *edma)
> > +{
> > +	struct ntb_edma_filter filter;
> > +	dma_cap_mask_t dma_mask;
> > +	unsigned int i;
> > +
> > +	dma_cap_zero(dma_mask);
> > +	dma_cap_set(DMA_SLAVE, dma_mask);
> > +
> > +	memset(edma, 0, sizeof(*edma));
> > +	edma->dev = dma_dev;
> > +
> > +	filter.dma_dev = dma_dev;
> > +	filter.direction = BIT(DMA_DEV_TO_MEM);
> > +	for (i = 0; i < EDMA_WR_CH_NUM; i++) {
> > +		edma->wr_chan[i] = dma_request_channel(dma_mask,
> > +						       ntb_edma_filter_fn,
> > +						       &filter);
> > +		if (!edma->wr_chan[i])
> > +			break;
> > +		edma->num_wr_chan++;
> > +	}
> > +
> > +	filter.direction = BIT(DMA_MEM_TO_DEV);
> > +	for (i = 0; i < EDMA_RD_CH_NUM; i++) {
> > +		edma->rd_chan[i] = dma_request_channel(dma_mask,
> > +						       ntb_edma_filter_fn,
> > +						       &filter);
> > +		if (!edma->rd_chan[i])
> > +			break;
> > +		edma->num_rd_chan++;
> > +	}
> > +
> > +	edma->intr_chan = dma_request_channel(dma_mask, ntb_edma_filter_fn,
> > +					      &filter);
> > +	if (!edma->intr_chan)
> > +		dev_warn(dma_dev,
> > +			 "Remote eDMA notify channel could not be allocated\n");
> > +
> > +	if (!edma->num_wr_chan || !edma->num_rd_chan) {
> > +		dev_warn(dma_dev, "Remote eDMA channels failed to initialize\n");
> > +		ntb_edma_teardown_chans(edma);
> > +		return -ENODEV;
> > +	}
> > +	return 0;
> > +}
> > +
> > +struct dma_chan *ntb_edma_pick_chan(struct ntb_edma_chans *edma,
> > +				    remote_edma_dir_t dir)
> > +{
> > +	unsigned int n, cur, idx;
> > +	struct dma_chan **chans;
> > +	atomic_t *cur_chan;
> > +
> > +	if (dir == REMOTE_EDMA_WRITE) {
> > +		n = edma->num_wr_chan;
> > +		chans = edma->wr_chan;
> > +		cur_chan = &edma->cur_wr_chan;
> > +	} else {
> > +		n = edma->num_rd_chan;
> > +		chans = edma->rd_chan;
> > +		cur_chan = &edma->cur_rd_chan;
> > +	}
> > +	if (WARN_ON_ONCE(!n))
> > +		return NULL;
> > +
> > +	/* Simple round-robin */
> > +	cur = (unsigned int)atomic_inc_return(cur_chan) - 1;
> > +	idx = cur % n;
> > +	return chans[idx];
> > +}
> > +
> > +int ntb_edma_notify_peer(struct ntb_edma_chans *edma, int qp_num)
> > +{
> > +	struct dma_async_tx_descriptor *txd;
> > +	struct dma_slave_config cfg;
> > +	struct scatterlist sgl;
> > +	dma_cookie_t cookie;
> > +	struct device *dev;
> > +
> > +	if (!edma || !edma->intr_chan)
> > +		return -ENXIO;
> > +
> > +	if (qp_num < 0 || qp_num >= edma_ctx.notify_qp_max)
> > +		return -EINVAL;
> > +
> > +	if (!edma_ctx.intr_rc_virt || !edma_ctx.intr_ep_phys)
> > +		return -EINVAL;
> > +
> > +	dev = edma->dev;
> > +	if (!dev)
> > +		return -ENODEV;
> > +
> > +	WRITE_ONCE(edma_ctx.intr_rc_virt->db[qp_num], 1);
> > +
> > +	/* Ensure store is visible before kicking the DMA transfer */
> > +	wmb();
> > +
> > +	sg_init_table(&sgl, 1);
> > +	sg_dma_address(&sgl) = edma_ctx.intr_rc_phys + qp_num * sizeof(u32);
> > +	sg_dma_len(&sgl) = sizeof(u32);
> > +
> > +	memset(&cfg, 0, sizeof(cfg));
> > +	cfg.dst_addr       = edma_ctx.intr_ep_phys + qp_num * sizeof(u32);
> > +	cfg.src_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
> > +	cfg.dst_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
> > +	cfg.direction      = DMA_MEM_TO_DEV;
> > +
> > +	if (dmaengine_slave_config(edma->intr_chan, &cfg))
> > +		return -EINVAL;
> > +
> > +	txd = dmaengine_prep_slave_sg(edma->intr_chan, &sgl, 1,
> > +				      DMA_MEM_TO_DEV,
> > +				      DMA_CTRL_ACK | DMA_PREP_INTERRUPT);
> > +	if (!txd)
> > +		return -ENOSPC;
> > +
> > +	cookie = dmaengine_submit(txd);
> > +	if (dma_submit_error(cookie))
> > +		return -ENOSPC;
> > +
> > +	dma_async_issue_pending(edma->intr_chan);
> > +	return 0;
> > +}
> > diff --git a/drivers/ntb/ntb_edma.h b/drivers/ntb/ntb_edma.h
> > new file mode 100644
> > index 000000000000..da0451827edb
> > --- /dev/null
> > +++ b/drivers/ntb/ntb_edma.h
> > @@ -0,0 +1,128 @@
> > +/* SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) */
> > +#ifndef _NTB_EDMA_H_
> > +#define _NTB_EDMA_H_
> > +
> > +#include <linux/completion.h>
> > +#include <linux/device.h>
> > +#include <linux/interrupt.h>
> > +
> > +#define EDMA_REG_SIZE		SZ_64K
> > +#define DMA_LLP_MEM_SIZE	SZ_4K
> > +#define EDMA_WR_CH_NUM		4
> > +#define EDMA_RD_CH_NUM		4
> > +#define NTB_EDMA_MAX_CH		8
> > +
> > +#define NTB_EDMA_INFO_MAGIC	0x45444D41 /* "EDMA" */
> > +#define NTB_EDMA_INFO_OFF	EDMA_REG_SIZE
> > +
> > +#define NTB_EDMA_RING_ORDER	7
> > +#define NTB_EDMA_RING_ENTRIES	(1U << NTB_EDMA_RING_ORDER)
> > +#define NTB_EDMA_RING_MASK	(NTB_EDMA_RING_ENTRIES - 1)
> > +
> > +typedef void (*ntb_edma_interrupt_cb_t)(void *data, int qp_num);
> > +
> > +/*
> > + * REMOTE_EDMA_EP:
> > + *   Endpoint owns the eDMA engine and pushes descriptors into a shared MW.
> > + *
> > + * REMOTE_EDMA_RC:
> > + *   Root Complex controls the endpoint eDMA through the shared MW and
> > + *   drives reads/writes on behalf of the host.
> > + */
> > +typedef enum {
> > +	REMOTE_EDMA_UNKNOWN,
> > +	REMOTE_EDMA_EP,
> > +	REMOTE_EDMA_RC,
> > +} remote_edma_mode_t;
> > +
> > +typedef enum {
> > +	REMOTE_EDMA_WRITE,
> > +	REMOTE_EDMA_READ,
> > +} remote_edma_dir_t;
> > +
> > +/*
> > + * Layout of remote eDMA MW (EP local address space, RC sees via peer MW):
> > + *
> > + *  0 .. EDMA_REG_SIZE-1        : DesignWare eDMA registers
> > + *  EDMA_REG_SIZE .. +PAGE_SIZE : struct ntb_edma_info (EP writes, RC reads)
> > + *  +PAGE_SIZE ..               : LL ring buffers (EP allocates phys addresses,
> > + *                                RC configures via dw_edma)
> > + *
> > + * ntb_edma_setup_mws() on EP:
> > + *   - allocates ntb_edma_info and LLs in EP memory
> > + *   - programs inbound iATU so that RC peer MW[n] points at this block
> > + *
> > + * ntb_edma_setup_peer() on RC:
> > + *   - ioremaps peer MW[n]
> > + *   - reads ntb_edma_info
> > + *   - sets up dw_edma_chip ll_region_* from that info
> > + */
> > +struct ntb_edma_info {
> > +	u32 magic;
> > +	u16 wr_cnt;
> > +	u16 rd_cnt;
> > +	u64 regs_phys;
> > +	u32 ll_stride;
> > +	u32 rsvd;
> > +	u64 ll_wr_phys[NTB_EDMA_MAX_CH];
> > +	u64 ll_rd_phys[NTB_EDMA_MAX_CH];
> > +
> > +	u64 intr_dar_base;
> > +} __packed;
> > +
> > +struct ll_dma_addrs {
> > +	dma_addr_t wr[EDMA_WR_CH_NUM];
> > +	dma_addr_t rd[EDMA_RD_CH_NUM];
> > +};
> > +
> > +struct ntb_edma_chans {
> > +	struct device *dev;
> > +
> > +	struct dma_chan *wr_chan[EDMA_WR_CH_NUM];
> > +	struct dma_chan *rd_chan[EDMA_RD_CH_NUM];
> > +	struct dma_chan *intr_chan;
> > +
> > +	unsigned int num_wr_chan;
> > +	unsigned int num_rd_chan;
> > +	atomic_t cur_wr_chan;
> > +	atomic_t cur_rd_chan;
> > +};
> > +
> > +static __always_inline u32 ntb_edma_ring_idx(u32 v)
> > +{
> > +	return v & NTB_EDMA_RING_MASK;
> > +}
> > +
> > +static __always_inline u32 ntb_edma_ring_used_entry(u32 head, u32 tail)
> > +{
> > +	if (head >= tail) {
> > +		WARN_ON_ONCE((head - tail) > (NTB_EDMA_RING_ENTRIES - 1));
> > +		return head - tail;
> > +	}
> > +
> > +	WARN_ON_ONCE((U32_MAX - tail + head + 1) > (NTB_EDMA_RING_ENTRIES - 1));
> > +	return U32_MAX - tail + head + 1;
> > +}
> > +
> > +static __always_inline u32 ntb_edma_ring_free_entry(u32 head, u32 tail)
> > +{
> > +	return NTB_EDMA_RING_ENTRIES - ntb_edma_ring_used_entry(head, tail) - 1;
> > +}
> > +
> > +static __always_inline bool ntb_edma_ring_full(u32 head, u32 tail)
> > +{
> > +	return ntb_edma_ring_free_entry(head, tail) == 0;
> > +}
> > +
> > +int ntb_edma_setup_isr(struct device *dev, struct device *epc_dev,
> > +		       ntb_edma_interrupt_cb_t cb, void *data);
> > +void ntb_edma_teardown_isr(struct device *dev);
> > +int ntb_edma_setup_mws(struct ntb_dev *ndev);
> > +int ntb_edma_setup_peer(struct ntb_dev *ndev);
> > +int ntb_edma_setup_chans(struct device *dma_dev, struct ntb_edma_chans *edma);
> > +struct dma_chan *ntb_edma_pick_chan(struct ntb_edma_chans *edma,
> > +				    remote_edma_dir_t dir);
> > +void ntb_edma_teardown_chans(struct ntb_edma_chans *edma);
> > +int ntb_edma_notify_peer(struct ntb_edma_chans *edma, int qp_num);
> > +
> > +#endif
> > diff --git a/drivers/ntb/ntb_transport.c b/drivers/ntb/ntb_transport_core.c
> > similarity index 65%
> > rename from drivers/ntb/ntb_transport.c
> > rename to drivers/ntb/ntb_transport_core.c
> > index 907db6c93d4d..48d48921978d 100644
> > --- a/drivers/ntb/ntb_transport.c
> > +++ b/drivers/ntb/ntb_transport_core.c
> > @@ -47,6 +47,9 @@
> >   * Contact Information:
> >   * Jon Mason <jon.mason@intel.com>
> >   */
> > +#include <linux/atomic.h>
> > +#include <linux/bug.h>
> > +#include <linux/compiler.h>
> >  #include <linux/debugfs.h>
> >  #include <linux/delay.h>
> >  #include <linux/dmaengine.h>
> > @@ -71,6 +74,8 @@
> >  #define NTB_TRANSPORT_DESC	"Software Queue-Pair Transport over NTB"
> >  #define NTB_TRANSPORT_MIN_SPADS (MW0_SZ_HIGH + 2)
> >
> > +#define NTB_EDMA_MAX_POLL		32
> > +
> >  MODULE_DESCRIPTION(NTB_TRANSPORT_DESC);
> >  MODULE_VERSION(NTB_TRANSPORT_VER);
> >  MODULE_LICENSE("Dual BSD/GPL");
> > @@ -102,6 +107,13 @@ module_param(use_msi, bool, 0644);
> >  MODULE_PARM_DESC(use_msi, "Use MSI interrupts instead of doorbells");
> >  #endif
> >
> > +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> > +#include "ntb_edma.h"
> > +static bool use_remote_edma;
> > +module_param(use_remote_edma, bool, 0644);
> > +MODULE_PARM_DESC(use_remote_edma, "Use remote eDMA mode (when enabled, use_msi is ignored)");
> > +#endif
> > +
> >  static struct dentry *nt_debugfs_dir;
> >
> >  /* Only two-ports NTB devices are supported */
> > @@ -125,6 +137,14 @@ struct ntb_queue_entry {
> >  		struct ntb_payload_header __iomem *tx_hdr;
> >  		struct ntb_payload_header *rx_hdr;
> >  	};
> > +
> > +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> > +	dma_addr_t addr;
> > +
> > +	/* Used by RC side only */
> > +	struct scatterlist sgl;
> > +	struct work_struct dma_work;
> > +#endif
> >  };
> >
> >  struct ntb_rx_info {
> > @@ -202,6 +222,33 @@ struct ntb_transport_qp {
> >  	int msi_irq;
> >  	struct ntb_msi_desc msi_desc;
> >  	struct ntb_msi_desc peer_msi_desc;
> > +
> > +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> > +	/*
> > +	 * For ensuring peer notification in non-atomic context.
> > +	 * ntb_peer_db_set might sleep or schedule.
> > +	 */
> > +	struct work_struct db_work;
> > +
> > +	/*
> > +	 * wr: remote eDMA write transfer (EP -> RC direction)
> > +	 * rd: remote eDMA read transfer (RC -> EP direction)
> > +	 */
> > +	u32 wr_cons;
> > +	u32 rd_cons;
> > +	u32 wr_prod;
> > +	u32 rd_prod;
> > +	u32 wr_issue;
> > +	u32 rd_issue;
> > +
> > +	spinlock_t ep_tx_lock;
> > +	spinlock_t ep_rx_lock;
> > +	spinlock_t rc_lock;
> > +
> > +	/* Completion work for read/write transfers. */
> > +	struct work_struct read_work;
> > +	struct work_struct write_work;
> > +#endif
> >  };
> >
> >  struct ntb_transport_mw {
> > @@ -249,6 +296,13 @@ struct ntb_transport_ctx {
> >
> >  	/* Make sure workq of link event be executed serially */
> >  	struct mutex link_event_lock;
> > +
> > +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> > +	remote_edma_mode_t remote_edma_mode;
> > +	struct device *dma_dev;
> > +	struct workqueue_struct *wq;
> > +	struct ntb_edma_chans edma;
> > +#endif
> >  };
> >
> >  enum {
> > @@ -262,6 +316,19 @@ struct ntb_payload_header {
> >  	unsigned int flags;
> >  };
> >
> > +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> > +static void ntb_transport_edma_uninit(struct ntb_transport_ctx *nt);
> > +static int ntb_transport_edma_init(struct ntb_transport_ctx *nt,
> > +				   unsigned int *mw_count);
> > +static void ntb_transport_edma_init_queue(struct ntb_transport_ctx *nt,
> > +					  unsigned int qp_num);
> > +static void ntb_transport_edma_create_queue(struct ntb_transport_ctx *nt,
> > +					    struct ntb_transport_qp *qp);
> > +static int ntb_transport_edma_rc_init(struct ntb_transport_ctx *nt);
> > +static int ntb_transport_edma_ep_init(struct ntb_transport_ctx *nt);
> > +static void ntb_transport_edma_rc_dma_work(struct work_struct *work);
> > +#endif /* CONFIG_NTB_TRANSPORT_EDMA */
> > +
> >  /*
> >   * Return the device that should be used for DMA mapping.
> >   *
> > @@ -298,7 +365,7 @@ enum {
> >  	container_of((__drv), struct ntb_transport_client, driver)
> >
> >  #define QP_TO_MW(nt, qp)	((qp) % nt->mw_count)
> > -#define NTB_QP_DEF_NUM_ENTRIES	100
> > +#define NTB_QP_DEF_NUM_ENTRIES	128
> >  #define NTB_LINK_DOWN_TIMEOUT	10
> >
> >  static void ntb_transport_rxc_db(unsigned long data);
> > @@ -1015,6 +1082,10 @@ static void ntb_transport_link_cleanup(struct ntb_transport_ctx *nt)
> >  	count = ntb_spad_count(nt->ndev);
> >  	for (i = 0; i < count; i++)
> >  		ntb_spad_write(nt->ndev, i, 0);
> > +
> > +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> > +	ntb_edma_teardown_chans(&nt->edma);
> > +#endif
> >  }
> >
> >  static void ntb_transport_link_cleanup_work(struct work_struct *work)
> > @@ -1051,6 +1122,14 @@ static void ntb_transport_link_work(struct work_struct *work)
> >
> >  	/* send the local info, in the opposite order of the way we read it */
> >
> > +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> > +	rc = ntb_transport_edma_ep_init(nt);
> > +	if (rc) {
> > +		dev_err(&pdev->dev, "Failed to init EP: %d\n", rc);
> > +		return;
> > +	}
> > +#endif
> > +
> >  	if (nt->use_msi) {
> >  		rc = ntb_msi_setup_mws(ndev);
> >  		if (rc) {
> > @@ -1132,6 +1211,14 @@ static void ntb_transport_link_work(struct work_struct *work)
> >
> >  	nt->link_is_up = true;
> >
> > +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> > +	rc = ntb_transport_edma_rc_init(nt);
> > +	if (rc) {
> > +		dev_err(&pdev->dev, "Failed to init RC: %d\n", rc);
> > +		goto out1;
> > +	}
> > +#endif
> > +
> >  	for (i = 0; i < nt->qp_count; i++) {
> >  		struct ntb_transport_qp *qp = &nt->qp_vec[i];
> >
> > @@ -1277,6 +1364,8 @@ static const struct ntb_transport_backend_ops default_backend_ops = {
> >  	.debugfs_stats_show = ntb_transport_default_debugfs_stats_show,
> >  };
> >
> > +static const struct ntb_transport_backend_ops edma_backend_ops;
> > +
> >  static int ntb_transport_probe(struct ntb_client *self, struct ntb_dev *ndev)
> >  {
> >  	struct ntb_transport_ctx *nt;
> > @@ -1311,7 +1400,23 @@ static int ntb_transport_probe(struct ntb_client *self, struct ntb_dev *ndev)
> >
> >  	nt->ndev = ndev;
> >
> > -	nt->backend_ops = default_backend_ops;
> > +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> > +	if (use_remote_edma) {
> > +		rc = ntb_transport_edma_init(nt, &mw_count);
> > +		if (rc) {
> > +			nt->mw_count = 0;
> > +			goto err;
> > +		}
> > +		nt->backend_ops = edma_backend_ops;
> > +
> > +		/*
> > +		 * On remote eDMA mode, we reserve a read channel for Host->EP
> > +		 * interruption.
> > +		 */
> > +		use_msi = false;
> > +	} else
> > +#endif
> > +		nt->backend_ops = default_backend_ops;
> >
> >  	/*
> >  	 * If we are using MSI, and have at least one extra memory window,
> > @@ -1402,6 +1507,10 @@ static int ntb_transport_probe(struct ntb_client *self, struct ntb_dev *ndev)
> >  		rc = ntb_transport_init_queue(nt, i);
> >  		if (rc)
> >  			goto err2;
> > +
> > +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> > +		ntb_transport_edma_init_queue(nt, i);
> > +#endif
> >  	}
> >
> >  	INIT_DELAYED_WORK(&nt->link_work, ntb_transport_link_work);
> > @@ -1433,6 +1542,9 @@ static int ntb_transport_probe(struct ntb_client *self, struct ntb_dev *ndev)
> >  	}
> >  	kfree(nt->mw_vec);
> >  err:
> > +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> > +	ntb_transport_edma_uninit(nt);
> > +#endif
> >  	kfree(nt);
> >  	return rc;
> >  }
> > @@ -2055,11 +2167,16 @@ ntb_transport_create_queue(void *data, struct device *client_dev,
> >
> >  	nt->qp_bitmap_free &= ~qp_bit;
> >
> > +	qp->qp_bit = qp_bit;
> >  	qp->cb_data = data;
> >  	qp->rx_handler = handlers->rx_handler;
> >  	qp->tx_handler = handlers->tx_handler;
> >  	qp->event_handler = handlers->event_handler;
> >
> > +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> > +	ntb_transport_edma_create_queue(nt, qp);
> > +#endif
> > +
> >  	dma_cap_zero(dma_mask);
> >  	dma_cap_set(DMA_MEMCPY, dma_mask);
> >
> > @@ -2105,6 +2222,9 @@ ntb_transport_create_queue(void *data, struct device *client_dev,
> >  			goto err1;
> >
> >  		entry->qp = qp;
> > +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> > +		INIT_WORK(&entry->dma_work, ntb_transport_edma_rc_dma_work);
> > +#endif
> >  		ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry,
> >  			     &qp->rx_free_q);
> >  	}
> > @@ -2156,8 +2276,8 @@ EXPORT_SYMBOL_GPL(ntb_transport_create_queue);
> >   */
> >  void ntb_transport_free_queue(struct ntb_transport_qp *qp)
> >  {
> > -	struct pci_dev *pdev;
> >  	struct ntb_queue_entry *entry;
> > +	struct pci_dev *pdev;
> >  	u64 qp_bit;
> >
> >  	if (!qp)
> > @@ -2208,6 +2328,10 @@ void ntb_transport_free_queue(struct ntb_transport_qp *qp)
> >  	tasklet_kill(&qp->rxc_db_work);
> >
> >  	cancel_delayed_work_sync(&qp->link_work);
> > +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> > +	cancel_work_sync(&qp->read_work);
> > +	cancel_work_sync(&qp->write_work);
> > +#endif
> >
> >  	qp->cb_data = NULL;
> >  	qp->rx_handler = NULL;
> > @@ -2346,6 +2470,1157 @@ int ntb_transport_tx_enqueue(struct ntb_transport_qp *qp, void *cb, void *data,
> >  }
> >  EXPORT_SYMBOL_GPL(ntb_transport_tx_enqueue);
> >
> > +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> > +/*
> > + * Remote eDMA mode implementation
> > + */
> > +struct ntb_edma_desc {
> > +	u32 len;
> > +	u32 flags;
> > +	u64 addr; /* DMA address */
> > +	u64 data;
> > +};
> > +
> > +struct ntb_edma_ring {
> > +	struct ntb_edma_desc desc[NTB_EDMA_RING_ENTRIES];
> > +	u32 head;
> > +	u32 tail;
> > +};
> > +
> > +#define NTB_EDMA_DESC_OFF(i)	((size_t)(i) * sizeof(struct ntb_edma_desc))
> > +
> > +#define __NTB_EDMA_CHECK_INDEX(_i)					\
> > +({									\
> > +	unsigned long __i = (unsigned long)(_i);			\
> > +	WARN_ONCE(__i >= (unsigned long)NTB_EDMA_RING_ENTRIES,		\
> > +		  "ntb_edma: index i=%lu >= ring_entries=%lu\n",	\
> > +		  __i, (unsigned long)NTB_EDMA_RING_ENTRIES);		\
> > +	__i;								\
> > +})
> > +
> > +#define NTB_EDMA_DESC_I(qp, i, n)					\
> > +({									\
> > +	typeof(qp) __qp = (qp);						\
> > +	unsigned long __i = __NTB_EDMA_CHECK_INDEX(i);			\
> > +	(struct ntb_edma_desc *)					\
> > +		((char *)(__qp)->rx_buff +				\
> > +		 (sizeof(struct ntb_edma_ring) * n) +			\
> > +		 NTB_EDMA_DESC_OFF(__i));				\
> > +})
> > +
> > +#define NTB_EDMA_DESC_O(qp, i, n)					\
> > +({									\
> > +	typeof(qp) __qp = (qp);						\
> > +	unsigned long __i = __NTB_EDMA_CHECK_INDEX(i);			\
> > +	(struct ntb_edma_desc __iomem *)				\
> > +		((char __iomem *)(__qp)->tx_mw +			\
> > +		 (sizeof(struct ntb_edma_ring) * n) +			\
> > +		 NTB_EDMA_DESC_OFF(__i));				\
> > +})
> > +
> > +#define NTB_EDMA_HEAD_I(qp, n) ((u32 *)((char *)qp->rx_buff +		\
> > +				(sizeof(struct ntb_edma_ring) * n) +	\
> > +				offsetof(struct ntb_edma_ring, head)))
> > +#define NTB_EDMA_HEAD_O(qp, n) ((u32 *)((char __iomem *)qp->tx_mw +	\
> > +				(sizeof(struct ntb_edma_ring) * n) +	\
> > +				offsetof(struct ntb_edma_ring, head)))
> > +#define NTB_EDMA_TAIL_I(qp, n) ((u32 *)((char *)qp->rx_buff +		\
> > +				(sizeof(struct ntb_edma_ring) * n) +	\
> > +				offsetof(struct ntb_edma_ring, tail)))
> > +#define NTB_EDMA_TAIL_O(qp, n) ((u32 *)((char __iomem *)qp->tx_mw +	\
> > +				(sizeof(struct ntb_edma_ring) * n) +	\
> > +				offsetof(struct ntb_edma_ring, tail)))
> > +
> > +/*
> > + * Macro naming rule:
> > + *   NTB_DESC_RD_EP_I (as an example)
> > + *            ^^ ^^ ^
> > + *            :  :  `-- I(n) or O(ut). In = Read, Out = Write.
> > + *            :  `----- Who uses this macro.
> > + *            `-------- DESC / HEAD / TAIL
> > + *
> > + * Read transfers (RC->EP):
> > + *
> > + *   EP view (outbound, written via NTB):
> > + *       - descs: NTB_DESC_RD_EP_O(qp, i) / NTB_DESC_RD_EP_I(qp, i)
> > + *           [ len ][ flags ][ addr ][ data ]
> > + *           [ len ][ flags ][ addr ][ data ]
> > + *           :
> > + *           [ len ][ flags ][ addr ][ data ]
> > + *       - head: NTB_HEAD_RD_EP_O(qp)
> > + *       - tail: NTB_TAIL_RD_EP_I(qp)
> > + *
> > + *   RC view (inbound, local mapping):
> > + *       - descs: NTB_DESC_RD_RC_I(qp, i) / NTB_DESC_RD_RC_O(qp, i)
> > + *           [ len ][ flags ][ addr ][ data ]
> > + *           [ len ][ flags ][ addr ][ data ]
> > + *           :
> > + *           [ len ][ flags ][ addr ][ data ]
> > + *       - head: NTB_HEAD_RD_RC_I(qp)
> > + *       - tail: NTB_TAIL_RD_RC_O(qp)
> > + *
> > + * Write transfers (EP -> RC) are analogous but use
> > + * NTB_DESC_WR_{EP_O,RC_I}(), NTB_HEAD_WR_{EP_O,RC_I}(),
> > + * and NTB_TAIL_WR_{EP_I,RC_O}().
> > + */
> > +#define NTB_DESC_RD_EP_I(qp, i)	NTB_EDMA_DESC_I(qp, i, 0)
> > +#define NTB_DESC_RD_EP_O(qp, i)	NTB_EDMA_DESC_O(qp, i, 0)
> > +#define NTB_DESC_WR_EP_I(qp, i)	NTB_EDMA_DESC_I(qp, i, 1)
> > +#define NTB_DESC_WR_EP_O(qp, i)	NTB_EDMA_DESC_O(qp, i, 1)
> > +#define NTB_DESC_RD_RC_I(qp, i)	NTB_EDMA_DESC_I(qp, i, 0)
> > +#define NTB_DESC_RD_RC_O(qp, i)	NTB_EDMA_DESC_O(qp, i, 0)
> > +#define NTB_DESC_WR_RC_I(qp, i)	NTB_EDMA_DESC_I(qp, i, 1)
> > +#define NTB_DESC_WR_RC_O(qp, i)	NTB_EDMA_DESC_O(qp, i, 1)
> > +
> > +#define NTB_HEAD_RD_EP_O(qp)	NTB_EDMA_HEAD_O(qp, 0)
> > +#define NTB_HEAD_WR_EP_O(qp)	NTB_EDMA_HEAD_O(qp, 1)
> > +#define NTB_HEAD_RD_RC_I(qp)	NTB_EDMA_HEAD_I(qp, 0)
> > +#define NTB_HEAD_WR_RC_I(qp)	NTB_EDMA_HEAD_I(qp, 1)
> > +
> > +#define NTB_TAIL_RD_EP_I(qp)	NTB_EDMA_TAIL_I(qp, 0)
> > +#define NTB_TAIL_WR_EP_I(qp)	NTB_EDMA_TAIL_I(qp, 1)
> > +#define NTB_TAIL_RD_RC_O(qp)	NTB_EDMA_TAIL_O(qp, 0)
> > +#define NTB_TAIL_WR_RC_O(qp)	NTB_EDMA_TAIL_O(qp, 1)
> > +
> > +static inline bool ntb_qp_edma_is_rc(struct ntb_transport_qp *qp)
> > +{
> > +	return qp->transport->remote_edma_mode == REMOTE_EDMA_RC;
> > +}
> > +
> > +static inline bool ntb_qp_edma_is_ep(struct ntb_transport_qp *qp)
> > +{
> > +	return qp->transport->remote_edma_mode == REMOTE_EDMA_EP;
> > +}
> > +
> > +static inline bool ntb_qp_edma_enabled(struct ntb_transport_qp *qp)
> > +{
> > +	return ntb_qp_edma_is_rc(qp) || ntb_qp_edma_is_ep(qp);
> > +}
> > +
> > +static unsigned int ntb_transport_edma_tx_free_entry(struct ntb_transport_qp *qp)
> > +{
> > +	unsigned int head, tail;
> > +
> > +	if (ntb_qp_edma_is_ep(qp)) {
> > +		scoped_guard(spinlock_irqsave, &qp->ep_tx_lock) {
> > +			/* In this scope, only 'head' might proceed */
> > +			tail = READ_ONCE(qp->wr_cons);
> > +			head = READ_ONCE(qp->wr_prod);
> > +		}
> > +		return ntb_edma_ring_free_entry(head, tail);
> > +	}
> > +
> > +	scoped_guard(spinlock_irqsave, &qp->rc_lock) {
> > +		/* In this scope, only 'head' might proceed */
> > +		tail = READ_ONCE(qp->rd_issue);
> > +		head = READ_ONCE(*NTB_HEAD_RD_RC_I(qp));
> > +	}
> > +	/*
> > +	 * On RC side, 'used' amount indicates how much EP side
> > +	 * has refilled, which are available for us to use for TX.
> > +	 */
> > +	return ntb_edma_ring_used_entry(head, tail);
> > +}
> > +
> > +static void ntb_transport_edma_debugfs_stats_show(struct seq_file *s,
> > +						  struct ntb_transport_qp *qp)
> > +{
> > +	seq_printf(s, "rx_bytes - \t%llu\n", qp->rx_bytes);
> > +	seq_printf(s, "rx_pkts - \t%llu\n", qp->rx_pkts);
> > +	seq_printf(s, "rx_err_no_buf - %llu\n", qp->rx_err_no_buf);
> > +	seq_printf(s, "rx_buff - \t0x%p\n", qp->rx_buff);
> > +	seq_printf(s, "rx_max_entry - \t%u\n", qp->rx_max_entry);
> > +	seq_printf(s, "rx_alloc_entry - \t%u\n\n", qp->rx_alloc_entry);
> > +
> > +	seq_printf(s, "tx_bytes - \t%llu\n", qp->tx_bytes);
> > +	seq_printf(s, "tx_pkts - \t%llu\n", qp->tx_pkts);
> > +	seq_printf(s, "tx_ring_full - \t%llu\n", qp->tx_ring_full);
> > +	seq_printf(s, "tx_err_no_buf - %llu\n", qp->tx_err_no_buf);
> > +	seq_printf(s, "tx_mw - \t0x%p\n", qp->tx_mw);
> > +	seq_printf(s, "tx_max_entry - \t%u\n", qp->tx_max_entry);
> > +	seq_printf(s, "free tx - \t%u\n", ntb_transport_tx_free_entry(qp));
> > +	seq_putc(s, '\n');
> > +
> > +	seq_puts(s, "Using Remote eDMA - Yes\n");
> > +	seq_printf(s, "QP Link - \t%s\n", qp->link_is_up ? "Up" : "Down");
> > +}
> > +
> > +static void ntb_transport_edma_uninit(struct ntb_transport_ctx *nt)
> > +{
> > +	struct ntb_dev *ndev = nt->ndev;
> > +
> > +	if (nt->remote_edma_mode == REMOTE_EDMA_EP && ndev && ndev->pdev)
> > +		ntb_edma_teardown_isr(&ndev->pdev->dev);
> > +
> /pr> +	if (nt->wq)
> > +		destroy_workqueue(nt->wq);
> > +	nt->wq = NULL;
> > +}
> > +
> > +static int ntb_transport_edma_init(struct ntb_transport_ctx *nt,
> > +				   unsigned int *mw_count)
> > +{
> > +	struct ntb_dev *ndev = nt->ndev;
> > +
> > +	/*
> > +	 * We need at least one MW for the transport plus one MW reserved
> > +	 * for the remote eDMA window (see ntb_edma_setup_mws/peer).
> > +	 */
> > +	if (*mw_count <= 1) {
> > +		dev_err(&ndev->dev,
> > +			"remote eDMA requires at least two MWS (have %u)\n",
> > +			*mw_count);
> > +		return -ENODEV;
> > +	}
> > +
> > +	nt->wq = alloc_workqueue("ntb-edma-wq", WQ_UNBOUND | WQ_SYSFS, 0);
> > +	if (!nt->wq) {
> > +		ntb_transport_edma_uninit(nt);
> > +		return -ENOMEM;
> > +	}
> > +
> > +	/* Reserve the last peer MW exclusively for the eDMA window. */
> > +	*mw_count -= 1;
> > +
> > +	return 0;
> > +}
> > +
> > +static void ntb_transport_edma_db_work(struct work_struct *work)
> > +{
> > +	struct ntb_transport_qp *qp =
> > +			container_of(work, struct ntb_transport_qp, db_work);
> > +
> > +	ntb_peer_db_set(qp->ndev, qp->qp_bit);
> > +}
> > +
> > +static void ntb_transport_edma_notify_peer(struct ntb_transport_qp *qp)
> > +{
> > +	if (ntb_qp_edma_is_rc(qp))
> > +		if (!ntb_edma_notify_peer(&qp->transport->edma, qp->qp_num))
> > +			return;
> > +
> > +	/*
> > +	 * Called from contexts that may be atomic. Since ntb_peer_db_set()
> > +	 * may sleep, delegate the actual doorbell write to a workqueue.
> > +	 */
> > +	queue_work(system_highpri_wq, &qp->db_work);
> > +}
> > +
> > +static void ntb_transport_edma_isr(void *data, int qp_num)
> > +{
> > +	struct ntb_transport_ctx *nt = data;
> > +	struct ntb_transport_qp *qp;
> > +
> > +	if (qp_num < 0 || qp_num >= nt->qp_count)
> > +		return;
> > +
> > +	qp = &nt->qp_vec[qp_num];
> > +	if (WARN_ON(!qp))
> > +		return;
> > +
> > +	queue_work(nt->wq, &qp->read_work);
> > +	queue_work(nt->wq, &qp->write_work);
> > +}
> > +
> > +static int ntb_transport_edma_rc_init(struct ntb_transport_ctx *nt)
> > +{
> > +	struct ntb_dev *ndev = nt->ndev;
> > +	struct pci_dev *pdev = ndev->pdev;
> > +	int rc;
> > +
> > +	if (!use_remote_edma || nt->remote_edma_mode != REMOTE_EDMA_UNKNOWN)
> > +		return 0;
> > +
> > +	rc = ntb_edma_setup_peer(ndev);
> > +	if (rc) {
> > +		dev_err(&pdev->dev, "Failed to enable remote eDMA: %d\n", rc);
> > +		return rc;
> > +	}
> > +
> > +	rc = ntb_edma_setup_chans(get_dma_dev(ndev), &nt->edma);
> > +	if (rc) {
> > +		dev_err(&pdev->dev, "Failed to setup eDMA channels: %d\n", rc);
> > +		return rc;
> > +	}
> > +
> > +	nt->remote_edma_mode = REMOTE_EDMA_RC;
> > +	return 0;
> > +}
> > +
> > +static int ntb_transport_edma_ep_init(struct ntb_transport_ctx *nt)
> > +{
> > +	struct ntb_dev *ndev = nt->ndev;
> > +	struct pci_dev *pdev = ndev->pdev;
> > +	struct pci_epc *epc;
> > +	int rc;
> > +
> > +	if (!use_remote_edma || nt->remote_edma_mode == REMOTE_EDMA_EP)
> > +		return 0;
> > +
> > +	/* Only EP side can return pci_epc */
> > +	epc = ntb_get_pci_epc(ndev);
> > +	if (!epc)
> > +		return 0;
> > +
> > +	rc = ntb_edma_setup_mws(ndev);
> > +	if (rc) {
> > +		dev_err(&pdev->dev,
> > +			"Failed to set up memory window for eDMA: %d\n", rc);
> > +		return rc;
> > +	}
> > +
> > +	rc = ntb_edma_setup_isr(&pdev->dev, &epc->dev, ntb_transport_edma_isr, nt);
> > +	if (rc) {
> > +		dev_err(&pdev->dev, "Failed to setup eDMA ISR (%d)\n", rc);
> > +		return rc;
> > +	}
> > +
> > +	nt->remote_edma_mode = REMOTE_EDMA_EP;
> > +	return 0;
> > +}
> > +
> > +static int ntb_transport_edma_setup_qp_mw(struct ntb_transport_ctx *nt,
> > +					  unsigned int qp_num)
> > +{
> > +	struct ntb_transport_qp *qp = &nt->qp_vec[qp_num];
> > +	struct ntb_dev *ndev = nt->ndev;
> > +	struct ntb_queue_entry *entry;
> > +	struct ntb_transport_mw *mw;
> > +	unsigned int mw_num, mw_count, qp_count;
> > +	unsigned int qp_offset, rx_info_offset;
> > +	unsigned int mw_size, mw_size_per_qp;
> > +	unsigned int num_qps_mw;
> > +	size_t edma_total;
> > +	unsigned int i;
> > +	int node;
> > +
> > +	mw_count = nt->mw_count;
> > +	qp_count = nt->qp_count;
> > +
> > +	mw_num = QP_TO_MW(nt, qp_num);
> > +	mw = &nt->mw_vec[mw_num];
> > +
> > +	if (!mw->virt_addr)
> > +		return -ENOMEM;
> > +
> > +	if (mw_num < qp_count % mw_count)
> > +		num_qps_mw = qp_count / mw_count + 1;
> > +	else
> > +		num_qps_mw = qp_count / mw_count;
> > +
> > +	mw_size = min(nt->mw_vec[mw_num].phys_size, mw->xlat_size);
> > +	if (max_mw_size && mw_size > max_mw_size)
> > +		mw_size = max_mw_size;
> > +
> > +	mw_size_per_qp = round_down((unsigned int)mw_size / num_qps_mw, SZ_64);
> > +	qp_offset = mw_size_per_qp * (qp_num / mw_count);
> > +	rx_info_offset = mw_size_per_qp - sizeof(struct ntb_rx_info);
> > +
> > +	qp->tx_mw_size = mw_size_per_qp;
> > +	qp->tx_mw = nt->mw_vec[mw_num].vbase + qp_offset;
> > +	if (!qp->tx_mw)
> > +		return -EINVAL;
> > +	qp->tx_mw_phys = nt->mw_vec[mw_num].phys_addr + qp_offset;
> > +	if (!qp->tx_mw_phys)
> > +		return -EINVAL;
> > +	qp->rx_info = qp->tx_mw + rx_info_offset;
> > +	qp->rx_buff = mw->virt_addr + qp_offset;
> > +	qp->remote_rx_info = qp->rx_buff + rx_info_offset;
> > +
> > +	/* Due to housekeeping, there must be at least 2 buffs */
> > +	qp->tx_max_frame = min(transport_mtu, mw_size_per_qp / 2);
> > +	qp->rx_max_frame = min(transport_mtu, mw_size_per_qp / 2);
> > +
> > +	/* In eDMA mode, decouple from MW sizing and force ring-sized entries */
> > +	edma_total = 2 * sizeof(struct ntb_edma_ring);
> > +	if (rx_info_offset < edma_total) {
> > +		dev_err(&ndev->dev, "Ring space requires %luB (>=%uB)\n",
> > +			edma_total, rx_info_offset);
> > +		return -EINVAL;
> > +	}
> > +	qp->tx_max_entry = NTB_EDMA_RING_ENTRIES;
> > +	qp->rx_max_entry = NTB_EDMA_RING_ENTRIES;
> > +
> > +	/*
> > +	 * Checking to see if we have more entries than the default.
> > +	 * We should add additional entries if that is the case so we
> > +	 * can be in sync with the transport frames.
> > +	 */
> > +	node = dev_to_node(&ndev->dev);
> > +	for (i = qp->rx_alloc_entry; i < qp->rx_max_entry; i++) {
> > +		entry = kzalloc_node(sizeof(*entry), GFP_KERNEL, node);
> > +		if (!entry)
> > +			return -ENOMEM;
> > +
> > +		entry->qp = qp;
> > +		INIT_WORK(&entry->dma_work, ntb_transport_edma_rc_dma_work);
> > +		ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry,
> > +			     &qp->rx_free_q);
> > +		qp->rx_alloc_entry++;
> > +	}
> > +
> > +	memset(qp->rx_buff, 0, edma_total);
> > +
> > +	qp->rx_pkts = 0;
> > +	qp->tx_pkts = 0;
> > +
> > +	return 0;
> > +}
> > +
> > +static int ntb_transport_edma_ep_read_complete(struct ntb_transport_qp *qp)
> > +{
> > +	struct device *dma_dev = get_dma_dev(qp->ndev);
> > +	struct ntb_queue_entry *entry;
> > +	struct ntb_edma_desc *in;
> > +	unsigned int len;
> > +	u32 idx;
> > +
> > +	if (ntb_edma_ring_used_entry(READ_ONCE(*NTB_TAIL_RD_EP_I(qp)),
> > +				     qp->rd_cons) == 0)
> > +		return 0;
> > +
> > +	idx = ntb_edma_ring_idx(qp->rd_cons);
> > +	in = NTB_DESC_RD_EP_I(qp, idx);
> > +	if (!(in->flags & DESC_DONE_FLAG))
> > +		return 0;
> > +
> > +	in->flags = 0;
> > +	len = in->len; /* might be smaller than entry->len */
> > +
> > +	entry = (struct ntb_queue_entry *)(in->data);
> > +	if (WARN_ON(!entry))
> > +		return 0;
> > +
> > +	if (in->flags & LINK_DOWN_FLAG) {
> > +		ntb_qp_link_down(qp);
> > +		qp->rd_cons++;
> > +		ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry, &qp->rx_free_q);
> > +		return 1;
> > +	}
> > +
> > +	dma_unmap_single(dma_dev, entry->addr, entry->len, DMA_FROM_DEVICE);
> > +
> > +	qp->rx_bytes += len;
> > +	qp->rx_pkts++;
> > +	qp->rd_cons++;
> > +
> > +	if (qp->rx_handler && qp->client_ready)
> > +		qp->rx_handler(qp, qp->cb_data, entry->cb_data, len);
> > +
> > +	ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry, &qp->rx_free_q);
> > +	return 1;
> > +}
> > +
> > +static int ntb_transport_edma_ep_write_complete(struct ntb_transport_qp *qp)
> > +{
> > +	struct ntb_queue_entry *entry;
> > +	struct ntb_edma_desc *in;
> > +	u32 idx;
> > +
> > +	if (ntb_edma_ring_used_entry(READ_ONCE(*NTB_TAIL_WR_EP_I(qp)),
> > +				     qp->wr_cons) == 0)
> > +		return 0;
> > +
> > +	idx = ntb_edma_ring_idx(qp->wr_cons);
> > +	in = NTB_DESC_WR_EP_I(qp, idx);
> > +
> > +	entry = (struct ntb_queue_entry *)(in->data);
> > +	if (WARN_ON(!entry))
> > +		return 0;
> > +
> > +	qp->wr_cons++;
> > +
> > +	if (qp->tx_handler)
> > +		qp->tx_handler(qp, qp->cb_data, entry->cb_data, entry->len);
> > +
> > +	ntb_list_add(&qp->ntb_tx_free_q_lock, &entry->entry, &qp->tx_free_q);
> > +	return 1;
> > +}
> > +
> > +static void ntb_transport_edma_ep_read_work(struct work_struct *work)
> > +{
> > +	struct ntb_transport_qp *qp = container_of(
> > +				work, struct ntb_transport_qp, read_work);
> > +	unsigned int i;
> > +
> > +	for (i = 0; i < NTB_EDMA_MAX_POLL; i++) {
> > +		if (!ntb_transport_edma_ep_read_complete(qp))
> > +			break;
> > +	}
> > +
> > +	if (ntb_transport_edma_ep_read_complete(qp))
> > +		queue_work(qp->transport->wq, &qp->read_work);
> > +}
> > +
> > +static void ntb_transport_edma_ep_write_work(struct work_struct *work)
> > +{
> > +	struct ntb_transport_qp *qp = container_of(
> > +				work, struct ntb_transport_qp, write_work);
> > +	unsigned int i;
> > +
> > +	for (i = 0; i < NTB_EDMA_MAX_POLL; i++) {
> > +		if (!ntb_transport_edma_ep_write_complete(qp))
> > +			break;
> > +	}
> > +
> > +	if (ntb_transport_edma_ep_write_complete(qp))
> > +		queue_work(qp->transport->wq, &qp->write_work);
> > +}
> > +
> > +static void ntb_transport_edma_rc_write_complete_work(struct work_struct *work)
> > +{
> > +	struct ntb_transport_qp *qp = container_of(
> > +				work, struct ntb_transport_qp, write_work);
> > +	struct ntb_queue_entry *entry;
> > +	struct ntb_edma_desc *in;
> > +	unsigned int len;
> > +	void *cb_data;
> > +	u32 idx;
> > +
> > +	while (ntb_edma_ring_used_entry(READ_ONCE(qp->wr_issue),
> > +					qp->wr_cons) != 0) {
> > +		/* Paired with smp_wmb() in ntb_transport_edma_rc_poll() */
> > +		smp_rmb();
> > +
> > +		idx = ntb_edma_ring_idx(qp->wr_cons);
> > +		in = NTB_DESC_WR_RC_I(qp, idx);
> > +		entry = (struct ntb_queue_entry *)READ_ONCE(in->data);
> > +		if (!entry || !(entry->flags & DESC_DONE_FLAG))
> > +			break;
> > +
> > +		in->data = 0;
> > +
> > +		cb_data = entry->cb_data;
> > +		len = entry->len;
> > +
> > +		iowrite32(++qp->wr_cons, NTB_TAIL_WR_RC_O(qp));
> > +
> > +		if (unlikely(entry->flags & LINK_DOWN_FLAG)) {
> > +			ntb_qp_link_down(qp);
> > +			continue;
> > +		}
> > +
> > +		ntb_transport_edma_notify_peer(qp);
> > +
> > +		ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry, &qp->rx_free_q);
> > +
> > +		if (qp->rx_handler && qp->client_ready)
> > +			qp->rx_handler(qp, qp->cb_data, cb_data, len);
> > +
> > +		/* stat updates */
> > +		qp->rx_bytes += len;
> > +		qp->rx_pkts++;
> > +	}
> > +}
> > +
> > +static void ntb_transport_edma_rc_write_cb(void *data,
> > +					   const struct dmaengine_result *res)
> > +{
> > +	struct ntb_queue_entry *entry = data;
> > +	struct ntb_transport_qp *qp = entry->qp;
> > +	struct ntb_transport_ctx *nt = qp->transport;
> > +	enum dmaengine_tx_result dma_err = res->result;
> > +	struct device *dma_dev = get_dma_dev(qp->ndev);
> > +
> > +	switch (dma_err) {
> > +	case DMA_TRANS_READ_FAILED:
> > +	case DMA_TRANS_WRITE_FAILED:
> > +	case DMA_TRANS_ABORTED:
> > +		entry->errors++;
> > +		entry->len = -EIO;
> > +		break;
> > +	case DMA_TRANS_NOERROR:
> > +	default:
> > +		break;
> > +	}
> > +	dma_unmap_sg(dma_dev, &entry->sgl, 1, DMA_FROM_DEVICE);
> > +	sg_dma_address(&entry->sgl) = 0;
> > +
> > +	entry->flags |= DESC_DONE_FLAG;
> > +
> > +	queue_work(nt->wq, &qp->write_work);
> > +}
> > +
> > +static void ntb_transport_edma_rc_read_complete_work(struct work_struct *work)
> > +{
> > +	struct ntb_transport_qp *qp = container_of(
> > +				work, struct ntb_transport_qp, read_work);
> > +	struct ntb_edma_desc *in, __iomem *out;
> > +	struct ntb_queue_entry *entry;
> > +	unsigned int len;
> > +	void *cb_data;
> > +	u32 idx;
> > +
> > +	while (ntb_edma_ring_used_entry(READ_ONCE(qp->rd_issue),
> > +					qp->rd_cons) != 0) {
> > +		/* Paired with smp_wmb() in ntb_transport_edma_rc_tx_enqueue() */
> > +		smp_rmb();
> > +
> > +		idx = ntb_edma_ring_idx(qp->rd_cons);
> > +		in = NTB_DESC_RD_RC_I(qp, idx);
> > +		entry = (struct ntb_queue_entry *)in->data;
> > +		if (!entry || !(entry->flags & DESC_DONE_FLAG))
> > +			break;
> > +
> > +		in->data = 0;
> > +
> > +		cb_data = entry->cb_data;
> > +		len = entry->len;
> > +
> > +		out = NTB_DESC_RD_RC_O(qp, idx);
> > +
> > +		WRITE_ONCE(qp->rd_cons, qp->rd_cons + 1);
> > +
> > +		/*
> > +		 * No need to add barrier in-between to enforce ordering here.
> > +		 * The other side proceeds only after both flags and tail are
> > +		 * updated.
> > +		 */
> > +		iowrite32(entry->flags, &out->flags);
> > +		iowrite32(qp->rd_cons, NTB_TAIL_RD_RC_O(qp));
> > +
> > +		ntb_transport_edma_notify_peer(qp);
> > +
> > +		ntb_list_add(&qp->ntb_tx_free_q_lock, &entry->entry,
> > +			     &qp->tx_free_q);
> > +
> > +		if (qp->tx_handler)
> > +			qp->tx_handler(qp, qp->cb_data, cb_data, len);
> > +
> > +		/* stat updates */
> > +		qp->tx_bytes += len;
> > +		qp->tx_pkts++;
> > +	}
> > +}
> > +
> > +static void ntb_transport_edma_rc_read_cb(void *data,
> > +					  const struct dmaengine_result *res)
> > +{
> > +	struct ntb_queue_entry *entry = data;
> > +	struct ntb_transport_qp *qp = entry->qp;
> > +	struct ntb_transport_ctx *nt = qp->transport;
> > +	struct device *dma_dev = get_dma_dev(qp->ndev);
> > +	enum dmaengine_tx_result dma_err = res->result;
> > +
> > +	switch (dma_err) {
> > +	case DMA_TRANS_READ_FAILED:
> > +	case DMA_TRANS_WRITE_FAILED:
> > +	case DMA_TRANS_ABORTED:
> > +		entry->errors++;
> > +		entry->len = -EIO;
> > +		break;
> > +	case DMA_TRANS_NOERROR:
> > +	default:
> > +		break;
> > +	}
> > +	dma_unmap_sg(dma_dev, &entry->sgl, 1, DMA_TO_DEVICE);
> > +	sg_dma_address(&entry->sgl) = 0;
> > +
> > +	entry->flags |= DESC_DONE_FLAG;
> > +
> > +	queue_work(nt->wq, &qp->read_work);
> > +}
> > +
> > +static int ntb_transport_edma_rc_write_start(struct device *d,
> > +					     struct dma_chan *chan, size_t len,
> > +					     dma_addr_t ep_src, void *rc_dst,
> > +					     struct ntb_queue_entry *entry)
> > +{
> > +	struct scatterlist *sgl = &entry->sgl;
> > +	struct dma_async_tx_descriptor *txd;
> > +	struct dma_slave_config cfg;
> > +	dma_cookie_t cookie;
> > +	int nents, rc;
> > +
> > +	if (!d)
> > +		return -ENODEV;
> > +
> > +	if (!chan)
> > +		return -ENXIO;
> > +
> > +	if (WARN_ON(!ep_src || !rc_dst))
> > +		return -EINVAL;
> > +
> > +	if (WARN_ON(sg_dma_address(sgl)))
> > +		return -EINVAL;
> > +
> > +	sg_init_one(sgl, rc_dst, len);
> > +	nents = dma_map_sg(d, sgl, 1, DMA_FROM_DEVICE);
> > +	if (nents <= 0)
> > +		return -EIO;
> > +
> > +	memset(&cfg, 0, sizeof(cfg));
> > +	cfg.src_addr       = ep_src;
> > +	cfg.src_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
> > +	cfg.dst_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
> > +	cfg.direction      = DMA_DEV_TO_MEM;
> > +	rc = dmaengine_slave_config(chan, &cfg);
> > +	if (rc)
> > +		goto out_unmap;
> > +
> > +	txd = dmaengine_prep_slave_sg(chan, sgl, 1, DMA_DEV_TO_MEM,
> > +				      DMA_CTRL_ACK | DMA_PREP_INTERRUPT);
> > +	if (!txd) {
> > +		rc = -EIO;
> > +		goto out_unmap;
> > +	}
> > +
> > +	txd->callback_result = ntb_transport_edma_rc_write_cb;
> > +	txd->callback_param = entry;
> > +
> > +	cookie = dmaengine_submit(txd);
> > +	if (dma_submit_error(cookie)) {
> > +		rc = -EIO;
> > +		goto out_unmap;
> > +	}
> > +	dma_async_issue_pending(chan);
> > +	return 0;
> > +out_unmap:
> > +	dma_unmap_sg(d, sgl, 1, DMA_FROM_DEVICE);
> > +	return rc;
> > +}
> > +
> > +static int ntb_transport_edma_rc_read_start(struct device *d,
> > +					    struct dma_chan *chan, size_t len,
> > +					    void *rc_src, dma_addr_t ep_dst,
> > +					    struct ntb_queue_entry *entry)
> > +{
> > +	struct scatterlist *sgl = &entry->sgl;
> > +	struct dma_async_tx_descriptor *txd;
> > +	struct dma_slave_config cfg;
> > +	dma_cookie_t cookie;
> > +	int nents, rc;
> > +
> > +	if (!d)
> > +		return -ENODEV;
> > +
> > +	if (!chan)
> > +		return -ENXIO;
> > +
> > +	if (WARN_ON(!rc_src || !ep_dst))
> > +		return -EINVAL;
> > +
> > +	if (WARN_ON(sg_dma_address(sgl)))
> > +		return -EINVAL;
> > +
> > +	sg_init_one(sgl, rc_src, len);
> > +	nents = dma_map_sg(d, sgl, 1, DMA_TO_DEVICE);
> > +	if (nents <= 0)
> > +		return -EIO;
> > +
> > +	memset(&cfg, 0, sizeof(cfg));
> > +	cfg.dst_addr       = ep_dst;
> > +	cfg.src_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
> > +	cfg.dst_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
> > +	cfg.direction      = DMA_MEM_TO_DEV;
> > +	rc = dmaengine_slave_config(chan, &cfg);
> > +	if (rc)
> > +		goto out_unmap;
> > +
> > +	txd = dmaengine_prep_slave_sg(chan, sgl, 1, DMA_MEM_TO_DEV,
> > +				      DMA_CTRL_ACK | DMA_PREP_INTERRUPT);
> > +	if (!txd) {
> > +		rc = -EIO;
> > +		goto out_unmap;
> > +	}
> > +
> > +	txd->callback_result = ntb_transport_edma_rc_read_cb;
> > +	txd->callback_param = entry;
> > +
> > +	cookie = dmaengine_submit(txd);
> > +	if (dma_submit_error(cookie)) {
> > +		rc = -EIO;
> > +		goto out_unmap;
> > +	}
> > +	dma_async_issue_pending(chan);
> > +	return 0;
> > +out_unmap:
> > +	dma_unmap_sg(d, sgl, 1, DMA_TO_DEVICE);
> > +	return rc;
> > +}
> > +
> > +static void ntb_transport_edma_rc_dma_work(struct work_struct *work)
> > +{
> > +	struct ntb_queue_entry *entry = container_of(
> > +				work, struct ntb_queue_entry, dma_work);
> > +	struct ntb_transport_qp *qp = entry->qp;
> > +	struct ntb_transport_ctx *nt = qp->transport;
> > +	struct device *dma_dev = get_dma_dev(qp->ndev);
> > +	struct dma_chan *chan;
> > +	int rc;
> > +
> > +	chan = ntb_edma_pick_chan(&nt->edma, REMOTE_EDMA_WRITE);
> > +	rc = ntb_transport_edma_rc_write_start(dma_dev, chan, entry->len,
> > +					       entry->addr, entry->buf, entry);
> > +	if (rc) {
> > +		entry->errors++;
> > +		entry->len = -EIO;
> > +		entry->flags |= DESC_DONE_FLAG;
> > +		queue_work(nt->wq, &qp->write_work);
> > +		return;
> > +	}
> > +}
> > +
> > +static void ntb_transport_edma_rc_poll(struct ntb_transport_qp *qp)
> > +{
> > +	struct ntb_transport_ctx *nt = qp->transport;
> > +	unsigned int budget = NTB_EDMA_MAX_POLL;
> > +	struct ntb_queue_entry *entry;
> > +	struct ntb_edma_desc *in;
> > +	dma_addr_t ep_src;
> > +	u32 len, idx;
> > +
> > +	while (budget--) {
> > +		if (ntb_edma_ring_used_entry(READ_ONCE(*NTB_HEAD_WR_RC_I(qp)),
> > +					     qp->wr_issue) == 0)
> > +			break;
> > +
> > +		idx = ntb_edma_ring_idx(qp->wr_issue);
> > +		in = NTB_DESC_WR_RC_I(qp, idx);
> > +
> > +		len = READ_ONCE(in->len);
> > +		ep_src = (dma_addr_t)READ_ONCE(in->addr);
> > +
> > +		/* Prepare 'entry' for write completion */
> > +		entry = ntb_list_rm(&qp->ntb_rx_q_lock, &qp->rx_pend_q);
> > +		if (!entry) {
> > +			qp->rx_err_no_buf++;
> > +			break;
> > +		}
> > +		if (WARN_ON(entry->flags & DESC_DONE_FLAG))
> > +			entry->flags &= ~DESC_DONE_FLAG;
> > +		entry->len = len; /* NB. entry->len can be <=0 */
> > +		entry->addr = ep_src;
> > +
> > +		/*
> > +		 * ntb_transport_edma_rc_write_complete_work() checks entry->flags
> > +		 * so it needs to be set before wr_issue++.
> > +		 */
> > +		in->data = (uintptr_t)entry;
> > +
> > +		/* Ensure in->data visible before wr_issue++ */
> > +		smp_wmb();
> > +
> > +		WRITE_ONCE(qp->wr_issue, qp->wr_issue + 1);
> > +
> > +		if (!len) {
> > +			entry->flags |= DESC_DONE_FLAG;
> > +			queue_work(nt->wq, &qp->write_work);
> > +			continue;
> > +		}
> > +
> > +		if (in->flags & LINK_DOWN_FLAG) {
> > +			dev_dbg(&qp->ndev->pdev->dev, "link down flag set\n");
> > +			entry->flags |= DESC_DONE_FLAG | LINK_DOWN_FLAG;
> > +			queue_work(nt->wq, &qp->write_work);
> > +			continue;
> > +		}
> > +
> > +		queue_work(nt->wq, &entry->dma_work);
> > +	}
> > +
> > +	if (!budget)
> > +		tasklet_schedule(&qp->rxc_db_work);
> > +}
> > +
> > +static int ntb_transport_edma_rc_tx_enqueue(struct ntb_transport_qp *qp,
> > +					    struct ntb_queue_entry *entry)
> > +{
> > +	struct device *dma_dev = get_dma_dev(qp->ndev);
> > +	struct ntb_transport_ctx *nt = qp->transport;
> > +	struct ntb_edma_desc *in, __iomem *out;
> > +	unsigned int len = entry->len;
> > +	struct dma_chan *chan;
> > +	u32 issue, idx, head;
> > +	dma_addr_t ep_dst;
> > +	int rc;
> > +
> > +	WARN_ON_ONCE(entry->flags & DESC_DONE_FLAG);
> > +
> > +	scoped_guard(spinlock_irqsave, &qp->rc_lock) {
> > +		head = READ_ONCE(*NTB_HEAD_RD_RC_I(qp));
> > +		issue = qp->rd_issue;
> > +		if (ntb_edma_ring_used_entry(head, issue) == 0) {
> > +			qp->tx_ring_full++;
> > +			return -ENOSPC;
> > +		}
> > +
> > +		/*
> > +		 * ntb_transport_edma_rc_read_complete_work() checks entry->flags
> > +		 * so it needs to be set before rd_issue++.
> > +		 */
> > +		idx = ntb_edma_ring_idx(issue);
> > +		in = NTB_DESC_RD_RC_I(qp, idx);
> > +		in->data = (uintptr_t)entry;
> > +
> > +		/* Make in->data visible before rd_issue++ */
> > +		smp_wmb();
> > +
> > +		WRITE_ONCE(qp->rd_issue, qp->rd_issue + 1);
> > +	}
> > +
> > +	/* Publish the final transfer length to the EP side */
> > +	out = NTB_DESC_RD_RC_O(qp, idx);
> > +	iowrite32(len, &out->len);
> > +	ioread32(&out->len);
> > +
> > +	if (unlikely(!len)) {
> > +		entry->flags |= DESC_DONE_FLAG;
> > +		queue_work(nt->wq, &qp->read_work);
> > +		return 0;
> > +	}
> > +
> > +	/* Paired with dma_wmb() in ntb_transport_edma_ep_rx_enqueue() */
> > +	dma_rmb();
> > +
> > +	/* kick remote eDMA read transfer */
> > +	ep_dst = (dma_addr_t)in->addr;
> > +	chan = ntb_edma_pick_chan(&nt->edma, REMOTE_EDMA_READ);
> > +	rc = ntb_transport_edma_rc_read_start(dma_dev, chan, len,
> > +					      entry->buf, ep_dst, entry);
> > +	if (rc) {
> > +		entry->errors++;
> > +		entry->len = -EIO;
> > +		entry->flags |= DESC_DONE_FLAG;
> > +		queue_work(nt->wq, &qp->read_work);
> > +	}
> > +	return 0;
> > +}
> > +
> > +static int ntb_transport_edma_ep_tx_enqueue(struct ntb_transport_qp *qp,
> > +					    struct ntb_queue_entry *entry)
> > +{
> > +	struct device *dma_dev = get_dma_dev(qp->ndev);
> > +	struct ntb_edma_desc *in, __iomem *out;
> > +	unsigned int len = entry->len;
> > +	dma_addr_t ep_src = 0;
> > +	u32 idx;
> > +	int rc;
> > +
> > +	if (likely(len)) {
> > +		ep_src = dma_map_single(dma_dev, entry->buf, len,
> > +					DMA_TO_DEVICE);
> > +		rc = dma_mapping_error(dma_dev, ep_src);
> > +		if (rc)
> > +			return rc;
> > +	}
> > +
> > +	scoped_guard(spinlock_irqsave, &qp->ep_tx_lock) {
> > +		if (ntb_edma_ring_full(qp->wr_prod, qp->wr_cons)) {
> > +			rc = -ENOSPC;
> > +			qp->tx_ring_full++;
> > +			goto out_unmap;
> > +		}
> > +
> > +		idx = ntb_edma_ring_idx(qp->wr_prod);
> > +		in  = NTB_DESC_WR_EP_I(qp, idx);
> > +		out = NTB_DESC_WR_EP_O(qp, idx);
> > +
> > +		WARN_ON(in->flags & DESC_DONE_FLAG);
> > +		WARN_ON(entry->flags & DESC_DONE_FLAG);
> > +		in->flags = 0;
> > +		in->data  = (uintptr_t)entry;
> > +		entry->addr  = ep_src;
> > +
> > +		iowrite32(len,          &out->len);
> > +		iowrite32(entry->flags, &out->flags);
> > +		iowrite64(ep_src,       &out->addr);
> > +		WRITE_ONCE(qp->wr_prod, qp->wr_prod + 1);
> > +
> > +		dma_wmb();
> > +		iowrite32(qp->wr_prod, NTB_HEAD_WR_EP_O(qp));
> > +
> > +		qp->tx_bytes += len;
> > +		qp->tx_pkts++;
> > +	}
> > +
> > +	ntb_transport_edma_notify_peer(qp);
> > +
> > +	return 0;
> > +out_unmap:
> > +	if (likely(len))
> > +		dma_unmap_single(dma_dev, ep_src, len, DMA_TO_DEVICE);
> > +	return rc;
> > +}
> > +
> > +static int ntb_transport_edma_tx_enqueue(struct ntb_transport_qp *qp,
> > +					 struct ntb_queue_entry *entry,
> > +					 void *cb, void *data, unsigned int len,
> > +					 unsigned int flags)
> > +{
> > +	struct device *dma_dev;
> > +
> > +	if (entry->addr) {
> > +		/* Deferred unmap */
> > +		dma_dev = get_dma_dev(qp->ndev);
> > +		dma_unmap_single(dma_dev, entry->addr, entry->len, DMA_TO_DEVICE);
> > +	}
> > +
> > +	entry->cb_data = cb;
> > +	entry->buf = data;
> > +	entry->len = len;
> > +	entry->flags = flags;
> > +	entry->errors = 0;
> > +	entry->addr = 0;
> > +
> > +	WARN_ON_ONCE(!ntb_qp_edma_enabled(qp));
> > +
> > +	if (ntb_qp_edma_is_ep(qp))
> > +		return ntb_transport_edma_ep_tx_enqueue(qp, entry);
> > +	else
> > +		return ntb_transport_edma_rc_tx_enqueue(qp, entry);
> > +}
> > +
> > +static int ntb_transport_edma_ep_rx_enqueue(struct ntb_transport_qp *qp,
> > +					    struct ntb_queue_entry *entry)
> > +{
> > +	struct device *dma_dev = get_dma_dev(qp->ndev);
> > +	struct ntb_edma_desc *in, __iomem *out;
> > +	unsigned int len = entry->len;
> > +	void *data = entry->buf;
> > +	dma_addr_t ep_dst;
> > +	u32 idx;
> > +	int rc;
> > +
> > +	ep_dst = dma_map_single(dma_dev, data, len, DMA_FROM_DEVICE);
> > +	rc = dma_mapping_error(dma_dev, ep_dst);
> > +	if (rc)
> > +		return rc;
> > +
> > +	scoped_guard(spinlock_bh, &qp->ep_rx_lock) {
> > +		if (ntb_edma_ring_full(READ_ONCE(qp->rd_prod),
> > +				       READ_ONCE(qp->rd_cons))) {
> > +			rc = -ENOSPC;
> > +			goto out_unmap;
> > +		}
> > +
> > +		idx = ntb_edma_ring_idx(qp->rd_prod);
> > +		in = NTB_DESC_RD_EP_I(qp, idx);
> > +		out = NTB_DESC_RD_EP_O(qp, idx);
> > +
> > +		iowrite32(len, &out->len);
> > +		iowrite64(ep_dst, &out->addr);
> > +
> > +		WARN_ON(in->flags & DESC_DONE_FLAG);
> > +		in->data = (uintptr_t)entry;
> > +		entry->addr = ep_dst;
> > +
> > +		/* Ensure len/addr are visible before the head update */
> > +		dma_wmb();
> > +
> > +		WRITE_ONCE(qp->rd_prod, qp->rd_prod + 1);
> > +		iowrite32(qp->rd_prod, NTB_HEAD_RD_EP_O(qp));
> > +	}
> > +	return 0;
> > +out_unmap:
> > +	dma_unmap_single(dma_dev, ep_dst, len, DMA_FROM_DEVICE);
> > +	return rc;
> > +}
> > +
> > +static int ntb_transport_edma_rx_enqueue(struct ntb_transport_qp *qp,
> > +					 struct ntb_queue_entry *entry)
> > +{
> > +	int rc;
> > +
> > +	/* The behaviour is the same as the default backend for RC side */
> > +	if (ntb_qp_edma_is_ep(qp)) {
> > +		rc = ntb_transport_edma_ep_rx_enqueue(qp, entry);
> > +		if (rc) {
> > +			ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry,
> > +				     &qp->rx_free_q);
> > +			return rc;
> > +		}
> > +	}
> > +
> > +	ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry, &qp->rx_pend_q);
> > +
> > +	if (qp->active)
> > +		tasklet_schedule(&qp->rxc_db_work);
> > +
> > +	return 0;
> > +}
> > +
> > +static void ntb_transport_edma_rx_poll(struct ntb_transport_qp *qp)
> > +{
> > +	struct ntb_transport_ctx *nt = qp->transport;
> > +
> > +	if (ntb_qp_edma_is_rc(qp))
> > +		ntb_transport_edma_rc_poll(qp);
> > +	else if (ntb_qp_edma_is_ep(qp)) {
> > +		/*
> > +		 * Make sure we poll the rings even if an eDMA interrupt is
> > +		 * cleared on the RC side earlier.
> > +		 */
> > +		queue_work(nt->wq, &qp->read_work);
> > +		queue_work(nt->wq, &qp->write_work);
> > +	} else
> > +		/* Unreachable */
> > +		WARN_ON_ONCE(1);
> > +}
> > +
> > +static void ntb_transport_edma_read_work(struct work_struct *work)
> > +{
> > +	struct ntb_transport_qp *qp = container_of(
> > +				work, struct ntb_transport_qp, read_work);
> > +
> > +	if (ntb_qp_edma_is_rc(qp))
> > +		ntb_transport_edma_rc_read_complete_work(work);
> > +	else if (ntb_qp_edma_is_ep(qp))
> > +		ntb_transport_edma_ep_read_work(work);
> > +	else
> > +		/* Unreachable */
> > +		WARN_ON_ONCE(1);
> > +}
> > +
> > +static void ntb_transport_edma_write_work(struct work_struct *work)
> > +{
> > +	struct ntb_transport_qp *qp = container_of(
> > +				work, struct ntb_transport_qp, write_work);
> > +
> > +	if (ntb_qp_edma_is_rc(qp))
> > +		ntb_transport_edma_rc_write_complete_work(work);
> > +	else if (ntb_qp_edma_is_ep(qp))
> > +		ntb_transport_edma_ep_write_work(work);
> > +	else
> > +		/* Unreachable */
> > +		WARN_ON_ONCE(1);
> > +}
> > +
> > +static void ntb_transport_edma_init_queue(struct ntb_transport_ctx *nt,
> > +					  unsigned int qp_num)
> > +{
> > +	struct ntb_transport_qp *qp = &nt->qp_vec[qp_num];
> > +
> > +	qp->wr_cons = 0;
> > +	qp->rd_cons = 0;
> > +	qp->wr_prod = 0;
> > +	qp->rd_prod = 0;
> > +	qp->wr_issue = 0;
> > +	qp->rd_issue = 0;
> > +
> > +	INIT_WORK(&qp->db_work, ntb_transport_edma_db_work);
> > +	INIT_WORK(&qp->read_work, ntb_transport_edma_read_work);
> > +	INIT_WORK(&qp->write_work, ntb_transport_edma_write_work);
> > +}
> > +
> > +static void ntb_transport_edma_create_queue(struct ntb_transport_ctx *nt,
> > +					    struct ntb_transport_qp *qp)
> > +{
> > +	spin_lock_init(&qp->ep_tx_lock);
> > +	spin_lock_init(&qp->ep_rx_lock);
> > +	spin_lock_init(&qp->rc_lock);
> > +}
> > +
> > +static const struct ntb_transport_backend_ops edma_backend_ops = {
> > +	.setup_qp_mw = ntb_transport_edma_setup_qp_mw,
> > +	.tx_free_entry = ntb_transport_edma_tx_free_entry,
> > +	.tx_enqueue = ntb_transport_edma_tx_enqueue,
> > +	.rx_enqueue = ntb_transport_edma_rx_enqueue,
> > +	.rx_poll = ntb_transport_edma_rx_poll,
> > +	.debugfs_stats_show = ntb_transport_edma_debugfs_stats_show,
> > +};
> > +#endif /* CONFIG_NTB_TRANSPORT_EDMA */
> > +
> >  /**
> >   * ntb_transport_link_up - Notify NTB transport of client readiness to use queue
> >   * @qp: NTB transport layer queue to be enabled
> > --
> > 2.48.1
> >

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 20/27] NTB: ntb_transport: Introduce remote eDMA backed transport mode
  2025-12-01 21:46   ` Dave Jiang
@ 2025-12-02  6:59     ` Koichiro Den
  2025-12-02 14:53       ` Dave Jiang
  0 siblings, 1 reply; 97+ messages in thread
From: Koichiro Den @ 2025-12-02  6:59 UTC (permalink / raw)
  To: Dave Jiang
  Cc: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li, mani,
	kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Mon, Dec 01, 2025 at 02:46:41PM -0700, Dave Jiang wrote:
> 
> 
> On 11/29/25 9:03 AM, Koichiro Den wrote:
> > Add a new transport backend that uses a remote DesignWare eDMA engine
> > located on the NTB endpoint to move data between host and endpoint.
> > 
> > In this mode:
> > 
> >   - The endpoint exposes a dedicated memory window that contains the
> >     eDMA register block followed by a small control structure (struct
> >     ntb_edma_info) and per-channel linked-list (LL) rings.
> > 
> >   - On the endpoint side, ntb_edma_setup_mws() allocates the control
> >     structure and LL rings in endpoint memory, then programs an inbound
> >     iATU region so that the host can access them via a peer MW.
> > 
> >   - On the host side, ntb_edma_setup_peer() ioremaps the peer MW, reads
> >     ntb_edma_info and configures a dw-edma DMA device to use the LL
> >     rings provided by the endpoint.
> > 
> >   - ntb_transport is extended with a new backend_ops implementation that
> >     routes TX and RX enqueue/poll operations through the remote eDMA
> >     rings while keeping the existing shared-memory backend intact.
> > 
> >   - The host signals the endpoint via a dedicated DMA read channel.
> >     'use_msi' module option is ignored when 'use_remote_edma=1'.
> > 
> > The new mode is guarded by a Kconfig option (NTB_TRANSPORT_EDMA) and a
> > module parameter (use_remote_edma). When disabled, the existing
> > ntb_transport behaviour is unchanged.
> > 
> > Signed-off-by: Koichiro Den <den@valinux.co.jp>
> > ---
> >  drivers/ntb/Kconfig                           |   11 +
> >  drivers/ntb/Makefile                          |    3 +
> >  drivers/ntb/ntb_edma.c                        |  628 ++++++++
> >  drivers/ntb/ntb_edma.h                        |  128 ++
> 
> I briefly looked over the code. It feels like the EDMA bits should go in drivers/ntb/hw/ rather than drivers/ntb/ given it's pretty specific to the designware hardware. What sits in drivers/ntb should be generic APIs where a different vendor can utilize it and not having to adopt to designware hardware specifics. So maybe a bit more abstractions are needed?

That makes sense, I'll reorganize things. Thank you for the suggestion.

> 
> >  .../{ntb_transport.c => ntb_transport_core.c} | 1281 ++++++++++++++++-
> >  5 files changed, 2048 insertions(+), 3 deletions(-)
> >  create mode 100644 drivers/ntb/ntb_edma.c
> >  create mode 100644 drivers/ntb/ntb_edma.h
> >  rename drivers/ntb/{ntb_transport.c => ntb_transport_core.c} (65%)
> > 
> > diff --git a/drivers/ntb/Kconfig b/drivers/ntb/Kconfig
> > index df16c755b4da..db63f02bb116 100644
> > --- a/drivers/ntb/Kconfig
> > +++ b/drivers/ntb/Kconfig
> > @@ -37,4 +37,15 @@ config NTB_TRANSPORT
> >  
> >  	 If unsure, say N.
> >  
> > +config NTB_TRANSPORT_EDMA
> > +	bool "NTB Transport backed by remote eDMA"
> > +	depends on NTB_TRANSPORT
> > +	depends on PCI
> > +	select DMA_ENGINE
> > +	help
> > +	  Enable a transport backend that uses a remote DesignWare eDMA engine
> > +	  exposed through a dedicated NTB memory window. The host uses the
> > +	  endpoint's eDMA engine to move data in both directions.
> > +	  Say Y here if you intend to use the 'use_remote_edma' module parameter.
> > +
> >  endif # NTB
> > diff --git a/drivers/ntb/Makefile b/drivers/ntb/Makefile
> > index 3a6fa181ff99..51f0e1e3aec7 100644
> > --- a/drivers/ntb/Makefile
> > +++ b/drivers/ntb/Makefile
> > @@ -4,3 +4,6 @@ obj-$(CONFIG_NTB_TRANSPORT) += ntb_transport.o
> >  
> >  ntb-y			:= core.o
> >  ntb-$(CONFIG_NTB_MSI)	+= msi.o
> > +
> > +ntb_transport-y					:= ntb_transport_core.o
> > +ntb_transport-$(CONFIG_NTB_TRANSPORT_EDMA)	+= ntb_edma.o
> > diff --git a/drivers/ntb/ntb_edma.c b/drivers/ntb/ntb_edma.c
> > new file mode 100644
> > index 000000000000..cb35e0d56aa8
> > --- /dev/null
> > +++ b/drivers/ntb/ntb_edma.c
> > @@ -0,0 +1,628 @@
> > +// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
> > +
> > +#include <linux/module.h>
> > +#include <linux/device.h>
> > +#include <linux/pci.h>
> > +#include <linux/ntb.h>
> > +#include <linux/io.h>
> > +#include <linux/iommu.h>
> > +#include <linux/dmaengine.h>
> > +#include <linux/pci-epc.h>
> > +#include <linux/dma/edma.h>
> > +#include <linux/irq.h>
> > +#include <linux/irqdomain.h>
> > +#include <linux/of.h>
> > +#include <linux/of_irq.h>
> > +#include <dt-bindings/interrupt-controller/arm-gic.h>
> > +
> > +#include "ntb_edma.h"
> > +
> > +/*
> > + * The interrupt register offsets below are taken from the DesignWare
> > + * eDMA "unrolled" register map (EDMA_MF_EDMA_UNROLL). The remote eDMA
> > + * backend currently only supports this layout.
> > + */
> > +#define DMA_WRITE_INT_STATUS_OFF   0x4c
> > +#define DMA_WRITE_INT_MASK_OFF     0x54
> > +#define DMA_WRITE_INT_CLEAR_OFF    0x58
> > +#define DMA_READ_INT_STATUS_OFF    0xa0
> > +#define DMA_READ_INT_MASK_OFF      0xa8
> > +#define DMA_READ_INT_CLEAR_OFF     0xac
> > +
> > +#define NTB_EDMA_NOTIFY_MAX_QP		64
> > +
> > +static unsigned int edma_spi = 417; /* 0x1a1 */
> > +module_param(edma_spi, uint, 0644);
> > +MODULE_PARM_DESC(edma_spi, "SPI number used by remote eDMA interrupt (EP local)");
> > +
> > +static u64 edma_regs_phys = 0xe65d5000;
> > +module_param(edma_regs_phys, ullong, 0644);
> > +MODULE_PARM_DESC(edma_regs_phys, "Physical base address of local eDMA registers (EP)");
> > +
> > +static unsigned long edma_regs_size = 0x1200;
> > +module_param(edma_regs_size, ulong, 0644);
> > +MODULE_PARM_DESC(edma_regs_size, "Size of the local eDMA register space (EP)");
> > +
> > +struct ntb_edma_intr {
> > +	u32 db[NTB_EDMA_NOTIFY_MAX_QP];
> > +};
> > +
> > +struct ntb_edma_ctx {
> > +	void *ll_wr_virt[EDMA_WR_CH_NUM];
> > +	dma_addr_t ll_wr_phys[EDMA_WR_CH_NUM];
> > +	void *ll_rd_virt[EDMA_RD_CH_NUM + 1];
> > +	dma_addr_t ll_rd_phys[EDMA_RD_CH_NUM + 1];
> > +
> > +	struct ntb_edma_intr *intr_ep_virt;
> > +	dma_addr_t intr_ep_phys;
> > +	struct ntb_edma_intr *intr_rc_virt;
> > +	dma_addr_t intr_rc_phys;
> > +	u32 notify_qp_max;
> > +
> > +	bool initialized;
> > +};
> > +
> > +static struct ntb_edma_ctx edma_ctx;
> > +
> > +typedef void (*ntb_edma_interrupt_cb_t)(void *data, int qp_num);
> > +
> > +struct ntb_edma_interrupt {
> > +	int virq;
> > +	void __iomem *base;
> > +	ntb_edma_interrupt_cb_t cb;
> > +	void *data;
> > +};
> > +
> > +static struct ntb_edma_interrupt ntb_edma_intr;
> > +
> > +static int ntb_edma_map_spi_to_virq(struct device *dev, unsigned int spi)
> > +{
> > +	struct device_node *np = dev_of_node(dev);
> > +	struct device_node *parent;
> > +	struct irq_fwspec fwspec = { 0 };
> > +	int virq;
> > +
> > +	parent = of_irq_find_parent(np);
> > +	if (!parent)
> > +		return -ENODEV;
> > +
> > +	fwspec.fwnode      = of_fwnode_handle(parent);
> > +	fwspec.param_count = 3;
> > +	fwspec.param[0]    = GIC_SPI;
> > +	fwspec.param[1]    = spi;
> > +	fwspec.param[2]    = IRQ_TYPE_LEVEL_HIGH;
> > +
> > +	virq = irq_create_fwspec_mapping(&fwspec);
> > +	of_node_put(parent);
> > +	return (virq > 0) ? virq : -EINVAL;
> > +}
> > +
> > +static irqreturn_t ntb_edma_isr(int irq, void *data)
> > +{
> > +	struct ntb_edma_interrupt *v = data;
> > +	u32 mask = BIT(EDMA_RD_CH_NUM);
> > +	u32 i, val;
> > +
> > +	/*
> > +	 * We do not ack interrupts here but instead we mask all local interrupt
> > +	 * sources except the read channel used for notification. This reduces
> > +	 * needless ISR invocations.
> > +	 *
> > +	 * In theory we could configure LIE=1/RIE=0 only for the notification
> > +	 * transfer (keeping all other channels at LIE=1/RIE=1), but that would
> > +	 * require intrusive changes to the dw-edma core.
> > +	 *
> > +	 * Note: The host side may have already cleared the read interrupt used
> > +	 * for notification, so reading DMA_READ_INT_CLEAR_OFF is not a reliable
> > +	 * way to detect it. As a result, we cannot reliably tell which specific
> > +	 * channel triggered this interrupt. intr_ep_virt->db[i] teaches us
> > +	 * instead.
> > +	 */
> > +	iowrite32(~0x0, v->base + DMA_WRITE_INT_MASK_OFF);
> > +	iowrite32(~mask, v->base + DMA_READ_INT_MASK_OFF);
> > +
> > +	if (!v->cb || !edma_ctx.intr_ep_virt)
> > +		return IRQ_HANDLED;
> > +
> > +	for (i = 0; i < edma_ctx.notify_qp_max; i++) {
> > +		val = READ_ONCE(edma_ctx.intr_ep_virt->db[i]);
> > +		if (!val)
> > +			continue;
> > +
> > +		WRITE_ONCE(edma_ctx.intr_ep_virt->db[i], 0);
> > +		v->cb(v->data, i);
> > +	}
> > +
> > +	return IRQ_HANDLED;
> > +}
> > +
> > +int ntb_edma_setup_isr(struct device *dev, struct device *epc_dev,
> > +		       ntb_edma_interrupt_cb_t cb, void *data)
> > +{
> > +	struct ntb_edma_interrupt *v = &ntb_edma_intr;
> > +	int virq = ntb_edma_map_spi_to_virq(epc_dev->parent, edma_spi);
> > +	int ret;
> > +
> > +	if (virq < 0) {
> > +		dev_err(dev, "failed to get virq (%d)\n", virq);
> > +		return virq;
> > +	}
> > +
> > +	v->virq = virq;
> > +	v->cb = cb;
> > +	v->data = data;
> > +	if (edma_regs_phys && !v->base)
> > +		v->base = devm_ioremap(dev, edma_regs_phys, edma_regs_size);
> > +	if (!v->base) {
> > +		dev_err(dev, "failed to setup v->base\n");
> > +		return -1;
> > +	}
> > +	ret = devm_request_irq(dev, v->virq, ntb_edma_isr, 0, "ntb-edma", v);
> > +	if (ret)
> > +		return ret;
> > +
> > +	if (v->base) {
> > +		iowrite32(0x0, v->base + DMA_WRITE_INT_MASK_OFF);
> > +		iowrite32(0x0, v->base + DMA_READ_INT_MASK_OFF);
> > +	}
> > +	return 0;
> > +}
> > +
> > +void ntb_edma_teardown_isr(struct device *dev)
> > +{
> > +	struct ntb_edma_interrupt *v = &ntb_edma_intr;
> > +
> > +	/* Mask all write/read interrupts so we don't get called again. */
> > +	if (v->base) {
> > +		iowrite32(~0x0, v->base + DMA_WRITE_INT_MASK_OFF);
> > +		iowrite32(~0x0, v->base + DMA_READ_INT_MASK_OFF);
> > +	}
> > +
> > +	if (v->virq > 0)
> > +		devm_free_irq(dev, v->virq, v);
> > +
> > +	if (v->base)
> > +		devm_iounmap(dev, v->base);
> > +
> > +	v->virq = 0;
> > +	v->cb = NULL;
> > +	v->data = NULL;
> > +}
> > +
> > +int ntb_edma_setup_mws(struct ntb_dev *ndev)
> > +{
> > +	const size_t info_bytes = PAGE_SIZE;
> > +	resource_size_t size_max, offset;
> > +	dma_addr_t intr_phys, info_phys;
> > +	u32 wr_done = 0, rd_done = 0;
> > +	struct ntb_edma_intr *intr;
> > +	struct ntb_edma_info *info;
> > +	int peer_mw, mw_index, rc;
> > +	struct iommu_domain *dom;
> > +	bool reg_mapped = false;
> > +	size_t ll_bytes, size;
> > +	struct pci_epc *epc;
> > +	struct device *dev;
> > +	unsigned long iova;
> > +	phys_addr_t phys;
> > +	u64 need;
> > +	u32 i;
> > +
> > +	/* +1 is for interruption */
> > +	ll_bytes = (EDMA_WR_CH_NUM + EDMA_RD_CH_NUM + 1) * DMA_LLP_MEM_SIZE;
> > +	need = EDMA_REG_SIZE + info_bytes + ll_bytes;
> > +
> > +	epc = ntb_get_pci_epc(ndev);
> > +	if (!epc)
> > +		return -ENODEV;
> > +	dev = epc->dev.parent;
> > +
> > +	if (edma_ctx.initialized)
> > +		return 0;
> > +
> > +	info = dma_alloc_coherent(dev, info_bytes, &info_phys, GFP_KERNEL);
> > +	if (!info)
> > +		return -ENOMEM;
> > +
> > +	memset(info, 0, info_bytes);
> > +	info->magic = NTB_EDMA_INFO_MAGIC;
> > +	info->wr_cnt = EDMA_WR_CH_NUM;
> > +	info->rd_cnt = EDMA_RD_CH_NUM + 1; /* +1 for interruption */
> > +	info->regs_phys = edma_regs_phys;
> > +	info->ll_stride = DMA_LLP_MEM_SIZE;
> > +
> > +	for (i = 0; i < EDMA_WR_CH_NUM; i++) {
> > +		edma_ctx.ll_wr_virt[i] = dma_alloc_attrs(dev, DMA_LLP_MEM_SIZE,
> > +							 &edma_ctx.ll_wr_phys[i],
> > +							 GFP_KERNEL,
> > +							 DMA_ATTR_FORCE_CONTIGUOUS);
> > +		if (!edma_ctx.ll_wr_virt[i]) {
> > +			rc = -ENOMEM;
> > +			goto err_free_ll;
> > +		}
> > +		wr_done++;
> > +		info->ll_wr_phys[i] = edma_ctx.ll_wr_phys[i];
> > +	}
> > +	for (i = 0; i < EDMA_RD_CH_NUM + 1; i++) {
> > +		edma_ctx.ll_rd_virt[i] = dma_alloc_attrs(dev, DMA_LLP_MEM_SIZE,
> > +							 &edma_ctx.ll_rd_phys[i],
> > +							 GFP_KERNEL,
> > +							 DMA_ATTR_FORCE_CONTIGUOUS);
> > +		if (!edma_ctx.ll_rd_virt[i]) {
> > +			rc = -ENOMEM;
> > +			goto err_free_ll;
> > +		}
> > +		rd_done++;
> > +		info->ll_rd_phys[i] = edma_ctx.ll_rd_phys[i];
> > +	}
> > +
> > +	/* For interruption */
> > +	edma_ctx.notify_qp_max = NTB_EDMA_NOTIFY_MAX_QP;
> > +	intr = dma_alloc_coherent(dev, sizeof(*intr), &intr_phys, GFP_KERNEL);
> > +	if (!intr) {
> > +		rc = -ENOMEM;
> > +		goto err_free_ll;
> > +	}
> > +	memset(intr, 0, sizeof(*intr));
> > +	edma_ctx.intr_ep_virt = intr;
> > +	edma_ctx.intr_ep_phys = intr_phys;
> > +	info->intr_dar_base = intr_phys;
> > +
> > +	peer_mw = ntb_peer_mw_count(ndev);
> > +	if (peer_mw <= 0) {
> > +		rc = -ENODEV;
> > +		goto err_free_ll;
> > +	}
> > +
> > +	mw_index = peer_mw - 1; /* last MW */
> > +
> > +	rc = ntb_mw_get_align(ndev, 0, mw_index, 0, NULL, &size_max,
> > +			      &offset);
> > +	if (rc)
> > +		goto err_free_ll;
> > +
> > +	if (size_max < need) {
> > +		rc = -ENOSPC;
> > +		goto err_free_ll;
> > +	}
> > +
> > +	/* Map register space (direct) */
> > +	dom = iommu_get_domain_for_dev(dev);
> > +	if (dom) {
> > +		phys = edma_regs_phys & PAGE_MASK;
> > +		size = PAGE_ALIGN(EDMA_REG_SIZE + edma_regs_phys - phys);
> > +		iova = phys;
> > +
> > +		rc = iommu_map(dom, iova, phys, EDMA_REG_SIZE,
> > +			       IOMMU_READ | IOMMU_WRITE | IOMMU_MMIO, GFP_KERNEL);
> > +		if (rc)
> > +			dev_err(&ndev->dev, "failed to create direct mapping for eDMA reg space\n");
> > +		reg_mapped = true;
> > +	}
> > +
> > +	rc = ntb_mw_set_trans(ndev, 0, mw_index, edma_regs_phys, EDMA_REG_SIZE, offset);
> > +	if (rc)
> > +		goto err_unmap_reg;
> > +
> > +	offset += EDMA_REG_SIZE;
> > +
> > +	/* Map ntb_edma_info */
> > +	rc = ntb_mw_set_trans(ndev, 0, mw_index, info_phys, info_bytes, offset);
> > +	if (rc)
> > +		goto err_clear_trans;
> > +	offset += info_bytes;
> > +
> > +	/* Map LL location */
> > +	for (i = 0; i < EDMA_WR_CH_NUM; i++) {
> > +		rc = ntb_mw_set_trans(ndev, 0, mw_index, edma_ctx.ll_wr_phys[i],
> > +				      DMA_LLP_MEM_SIZE, offset);
> > +		if (rc)
> > +			goto err_clear_trans;
> > +		offset += DMA_LLP_MEM_SIZE;
> > +	}
> > +	for (i = 0; i < EDMA_RD_CH_NUM + 1; i++) {
> > +		rc = ntb_mw_set_trans(ndev, 0, mw_index, edma_ctx.ll_rd_phys[i],
> > +				      DMA_LLP_MEM_SIZE, offset);
> > +		if (rc)
> > +			goto err_clear_trans;
> > +		offset += DMA_LLP_MEM_SIZE;
> > +	}
> > +	edma_ctx.initialized = true;
> > +
> > +	return 0;
> > +
> > +err_clear_trans:
> > +	/*
> > +	 * Tear down the NTB translation window used for the eDMA MW.
> > +	 * There is no sub-range clear API for ntb_mw_set_trans(), so we
> > +	 * unconditionally drop the whole mapping on error.
> > +	 */
> > +	ntb_mw_clear_trans(ndev, 0, mw_index);
> > +
> > +err_unmap_reg:
> > +	if (reg_mapped)
> > +		iommu_unmap(dom, iova, size);
> > +err_free_ll:
> > +	while (rd_done--)
> > +		dma_free_attrs(dev, DMA_LLP_MEM_SIZE,
> > +			       edma_ctx.ll_rd_virt[rd_done],
> > +			       edma_ctx.ll_rd_phys[rd_done],
> > +			       DMA_ATTR_FORCE_CONTIGUOUS);
> > +	while (wr_done--)
> > +		dma_free_attrs(dev, DMA_LLP_MEM_SIZE,
> > +			       edma_ctx.ll_wr_virt[wr_done],
> > +			       edma_ctx.ll_wr_phys[wr_done],
> > +			       DMA_ATTR_FORCE_CONTIGUOUS);
> > +	if (edma_ctx.intr_ep_virt)
> > +		dma_free_coherent(dev, sizeof(struct ntb_edma_intr),
> > +				  edma_ctx.intr_ep_virt,
> > +				  edma_ctx.intr_ep_phys);
> > +	dma_free_coherent(dev, info_bytes, info, info_phys);
> > +	return rc;
> > +}
> > +
> > +static int ntb_edma_irq_vector(struct device *dev, unsigned int nr)
> > +{
> > +	struct pci_dev *pdev = to_pci_dev(dev);
> > +	int ret, nvec;
> > +
> > +	nvec = pci_msi_vec_count(pdev);
> > +	for (; nr < nvec; nr++) {
> > +		ret = pci_irq_vector(pdev, nr);
> > +		if (!irq_has_action(ret))
> > +			return ret;
> > +	}
> > +	return 0;
> > +}
> > +
> > +static const struct dw_edma_plat_ops ntb_edma_ops = {
> > +	.irq_vector     = ntb_edma_irq_vector,
> > +};
> > +
> > +int ntb_edma_setup_peer(struct ntb_dev *ndev)
> > +{
> > +	struct ntb_edma_info *info;
> > +	unsigned int wr_cnt, rd_cnt;
> > +	struct dw_edma_chip *chip;
> > +	void __iomem *edma_virt;
> > +	phys_addr_t edma_phys;
> > +	resource_size_t mw_size;
> > +	u64 off = EDMA_REG_SIZE;
> > +	int peer_mw, mw_index;
> > +	unsigned int i;
> > +	int ret;
> > +
> > +	peer_mw = ntb_peer_mw_count(ndev);
> > +	if (peer_mw <= 0)
> > +		return -ENODEV;
> > +
> > +	mw_index = peer_mw - 1; /* last MW */
> > +
> > +	ret = ntb_peer_mw_get_addr(ndev, mw_index, &edma_phys,
> > +				   &mw_size);
> > +	if (ret)
> > +		return -1;
> > +
> > +	edma_virt = ioremap(edma_phys, mw_size);
> > +
> > +	chip = devm_kzalloc(&ndev->dev, sizeof(*chip), GFP_KERNEL);
> > +	if (!chip) {
> > +		ret = -ENOMEM;
> > +		return ret;
> > +	}
> > +
> > +	chip->dev = &ndev->pdev->dev;
> > +	chip->nr_irqs = 4;
> > +	chip->ops = &ntb_edma_ops;
> > +	chip->flags = 0;
> > +	chip->reg_base = edma_virt;
> > +	chip->mf = EDMA_MF_EDMA_UNROLL;
> > +
> > +	info = edma_virt + off;
> > +	if (info->magic != NTB_EDMA_INFO_MAGIC)
> > +		return -EINVAL;
> > +	wr_cnt = info->wr_cnt;
> > +	rd_cnt = info->rd_cnt;
> > +	chip->ll_wr_cnt = wr_cnt;
> > +	chip->ll_rd_cnt = rd_cnt;
> > +	off += PAGE_SIZE;
> > +
> > +	edma_ctx.notify_qp_max = NTB_EDMA_NOTIFY_MAX_QP;
> > +	edma_ctx.intr_ep_phys = info->intr_dar_base;
> > +	if (edma_ctx.intr_ep_phys) {
> > +		edma_ctx.intr_rc_virt =
> > +			dma_alloc_coherent(&ndev->pdev->dev,
> > +					   sizeof(struct ntb_edma_intr),
> > +					   &edma_ctx.intr_rc_phys,
> > +					   GFP_KERNEL);
> > +		if (!edma_ctx.intr_rc_virt)
> > +			return -ENOMEM;
> > +		memset(edma_ctx.intr_rc_virt, 0,
> > +		       sizeof(struct ntb_edma_intr));
> > +	}
> > +
> > +	for (i = 0; i < wr_cnt; i++) {
> > +		chip->ll_region_wr[i].vaddr.io = edma_virt + off;
> > +		chip->ll_region_wr[i].paddr = info->ll_wr_phys[i];
> > +		chip->ll_region_wr[i].sz = DMA_LLP_MEM_SIZE;
> > +		off += DMA_LLP_MEM_SIZE;
> > +	}
> > +	for (i = 0; i < rd_cnt; i++) {
> > +		chip->ll_region_rd[i].vaddr.io = edma_virt + off;
> > +		chip->ll_region_rd[i].paddr = info->ll_rd_phys[i];
> > +		chip->ll_region_rd[i].sz = DMA_LLP_MEM_SIZE;
> > +		off += DMA_LLP_MEM_SIZE;
> > +	}
> > +
> > +	if (!pci_dev_msi_enabled(ndev->pdev))
> > +		return -ENXIO;
> > +
> > +	ret = dw_edma_probe(chip);
> > +	if (ret) {
> > +		dev_err(&ndev->dev, "dw_edma_probe failed: %d\n", ret);
> > +		return ret;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +struct ntb_edma_filter {
> > +	struct device *dma_dev;
> > +	u32 direction;
> > +};
> > +
> > +static bool ntb_edma_filter_fn(struct dma_chan *chan, void *arg)
> > +{
> > +	struct ntb_edma_filter *filter = arg;
> > +	u32 dir = filter->direction;
> > +	struct dma_slave_caps caps;
> > +	int ret;
> > +
> > +	if (chan->device->dev != filter->dma_dev)
> > +		return false;
> > +
> > +	ret = dma_get_slave_caps(chan, &caps);
> > +	if (ret < 0)
> > +		return false;
> > +
> > +	return !!(caps.directions & dir);
> > +}
> > +
> > +void ntb_edma_teardown_chans(struct ntb_edma_chans *edma)
> > +{
> > +	unsigned int i;
> > +
> > +	for (i = 0; i < edma->num_wr_chan; i++)
> > +		dma_release_channel(edma->wr_chan[i]);
> > +
> > +	for (i = 0; i < edma->num_rd_chan; i++)
> > +		dma_release_channel(edma->rd_chan[i]);
> > +
> > +	if (edma->intr_chan)
> > +		dma_release_channel(edma->intr_chan);
> > +}
> > +
> > +int ntb_edma_setup_chans(struct device *dma_dev, struct ntb_edma_chans *edma)
> > +{
> > +	struct ntb_edma_filter filter;
> > +	dma_cap_mask_t dma_mask;
> > +	unsigned int i;
> > +
> > +	dma_cap_zero(dma_mask);
> > +	dma_cap_set(DMA_SLAVE, dma_mask);
> > +
> > +	memset(edma, 0, sizeof(*edma));
> > +	edma->dev = dma_dev;
> > +
> > +	filter.dma_dev = dma_dev;
> > +	filter.direction = BIT(DMA_DEV_TO_MEM);
> > +	for (i = 0; i < EDMA_WR_CH_NUM; i++) {
> > +		edma->wr_chan[i] = dma_request_channel(dma_mask,
> > +						       ntb_edma_filter_fn,
> > +						       &filter);
> > +		if (!edma->wr_chan[i])
> > +			break;
> > +		edma->num_wr_chan++;
> > +	}
> > +
> > +	filter.direction = BIT(DMA_MEM_TO_DEV);
> > +	for (i = 0; i < EDMA_RD_CH_NUM; i++) {
> > +		edma->rd_chan[i] = dma_request_channel(dma_mask,
> > +						       ntb_edma_filter_fn,
> > +						       &filter);
> > +		if (!edma->rd_chan[i])
> > +			break;
> > +		edma->num_rd_chan++;
> > +	}
> > +
> > +	edma->intr_chan = dma_request_channel(dma_mask, ntb_edma_filter_fn,
> > +					      &filter);
> > +	if (!edma->intr_chan)
> > +		dev_warn(dma_dev,
> > +			 "Remote eDMA notify channel could not be allocated\n");
> > +
> > +	if (!edma->num_wr_chan || !edma->num_rd_chan) {
> > +		dev_warn(dma_dev, "Remote eDMA channels failed to initialize\n");
> > +		ntb_edma_teardown_chans(edma);
> > +		return -ENODEV;
> > +	}
> > +	return 0;
> > +}
> > +
> > +struct dma_chan *ntb_edma_pick_chan(struct ntb_edma_chans *edma,
> > +				    remote_edma_dir_t dir)
> > +{
> > +	unsigned int n, cur, idx;
> > +	struct dma_chan **chans;
> > +	atomic_t *cur_chan;
> > +
> > +	if (dir == REMOTE_EDMA_WRITE) {
> > +		n = edma->num_wr_chan;
> > +		chans = edma->wr_chan;
> > +		cur_chan = &edma->cur_wr_chan;
> > +	} else {
> > +		n = edma->num_rd_chan;
> > +		chans = edma->rd_chan;
> > +		cur_chan = &edma->cur_rd_chan;
> > +	}
> > +	if (WARN_ON_ONCE(!n))
> > +		return NULL;
> > +
> > +	/* Simple round-robin */
> > +	cur = (unsigned int)atomic_inc_return(cur_chan) - 1;
> > +	idx = cur % n;
> > +	return chans[idx];
> > +}
> > +
> > +int ntb_edma_notify_peer(struct ntb_edma_chans *edma, int qp_num)
> > +{
> > +	struct dma_async_tx_descriptor *txd;
> > +	struct dma_slave_config cfg;
> > +	struct scatterlist sgl;
> > +	dma_cookie_t cookie;
> > +	struct device *dev;
> > +
> > +	if (!edma || !edma->intr_chan)
> > +		return -ENXIO;
> > +
> > +	if (qp_num < 0 || qp_num >= edma_ctx.notify_qp_max)
> > +		return -EINVAL;
> > +
> > +	if (!edma_ctx.intr_rc_virt || !edma_ctx.intr_ep_phys)
> > +		return -EINVAL;
> > +
> > +	dev = edma->dev;
> > +	if (!dev)
> > +		return -ENODEV;
> > +
> > +	WRITE_ONCE(edma_ctx.intr_rc_virt->db[qp_num], 1);
> > +
> > +	/* Ensure store is visible before kicking the DMA transfer */
> > +	wmb();
> > +
> > +	sg_init_table(&sgl, 1);
> > +	sg_dma_address(&sgl) = edma_ctx.intr_rc_phys + qp_num * sizeof(u32);
> > +	sg_dma_len(&sgl) = sizeof(u32);
> > +
> > +	memset(&cfg, 0, sizeof(cfg));
> > +	cfg.dst_addr       = edma_ctx.intr_ep_phys + qp_num * sizeof(u32);
> > +	cfg.src_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
> > +	cfg.dst_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
> > +	cfg.direction      = DMA_MEM_TO_DEV;
> > +
> > +	if (dmaengine_slave_config(edma->intr_chan, &cfg))
> > +		return -EINVAL;
> > +
> > +	txd = dmaengine_prep_slave_sg(edma->intr_chan, &sgl, 1,
> > +				      DMA_MEM_TO_DEV,
> > +				      DMA_CTRL_ACK | DMA_PREP_INTERRUPT);
> > +	if (!txd)
> > +		return -ENOSPC;
> > +
> > +	cookie = dmaengine_submit(txd);
> > +	if (dma_submit_error(cookie))
> > +		return -ENOSPC;
> > +
> > +	dma_async_issue_pending(edma->intr_chan);
> > +	return 0;
> > +}
> > diff --git a/drivers/ntb/ntb_edma.h b/drivers/ntb/ntb_edma.h
> > new file mode 100644
> > index 000000000000..da0451827edb
> > --- /dev/null
> > +++ b/drivers/ntb/ntb_edma.h
> > @@ -0,0 +1,128 @@
> > +/* SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) */
> > +#ifndef _NTB_EDMA_H_
> > +#define _NTB_EDMA_H_
> > +
> > +#include <linux/completion.h>
> > +#include <linux/device.h>
> > +#include <linux/interrupt.h>
> > +
> > +#define EDMA_REG_SIZE		SZ_64K
> > +#define DMA_LLP_MEM_SIZE	SZ_4K
> > +#define EDMA_WR_CH_NUM		4
> > +#define EDMA_RD_CH_NUM		4
> > +#define NTB_EDMA_MAX_CH		8
> > +
> > +#define NTB_EDMA_INFO_MAGIC	0x45444D41 /* "EDMA" */
> > +#define NTB_EDMA_INFO_OFF	EDMA_REG_SIZE
> > +
> > +#define NTB_EDMA_RING_ORDER	7
> > +#define NTB_EDMA_RING_ENTRIES	(1U << NTB_EDMA_RING_ORDER)
> > +#define NTB_EDMA_RING_MASK	(NTB_EDMA_RING_ENTRIES - 1)
> > +
> > +typedef void (*ntb_edma_interrupt_cb_t)(void *data, int qp_num);
> > +
> > +/*
> > + * REMOTE_EDMA_EP:
> > + *   Endpoint owns the eDMA engine and pushes descriptors into a shared MW.
> > + *
> > + * REMOTE_EDMA_RC:
> > + *   Root Complex controls the endpoint eDMA through the shared MW and
> > + *   drives reads/writes on behalf of the host.
> > + */
> > +typedef enum {
> > +	REMOTE_EDMA_UNKNOWN,
> > +	REMOTE_EDMA_EP,
> > +	REMOTE_EDMA_RC,
> > +} remote_edma_mode_t;
> > +
> > +typedef enum {
> > +	REMOTE_EDMA_WRITE,
> > +	REMOTE_EDMA_READ,
> > +} remote_edma_dir_t;
> > +
> > +/*
> > + * Layout of remote eDMA MW (EP local address space, RC sees via peer MW):
> > + *
> > + *  0 .. EDMA_REG_SIZE-1        : DesignWare eDMA registers
> > + *  EDMA_REG_SIZE .. +PAGE_SIZE : struct ntb_edma_info (EP writes, RC reads)
> > + *  +PAGE_SIZE ..               : LL ring buffers (EP allocates phys addresses,
> > + *                                RC configures via dw_edma)
> > + *
> > + * ntb_edma_setup_mws() on EP:
> > + *   - allocates ntb_edma_info and LLs in EP memory
> > + *   - programs inbound iATU so that RC peer MW[n] points at this block
> > + *
> > + * ntb_edma_setup_peer() on RC:
> > + *   - ioremaps peer MW[n]
> > + *   - reads ntb_edma_info
> > + *   - sets up dw_edma_chip ll_region_* from that info
> > + */
> > +struct ntb_edma_info {
> > +	u32 magic;
> > +	u16 wr_cnt;
> > +	u16 rd_cnt;
> > +	u64 regs_phys;
> > +	u32 ll_stride;
> > +	u32 rsvd;
> > +	u64 ll_wr_phys[NTB_EDMA_MAX_CH];
> > +	u64 ll_rd_phys[NTB_EDMA_MAX_CH];
> > +
> > +	u64 intr_dar_base;
> > +} __packed;
> > +
> > +struct ll_dma_addrs {
> > +	dma_addr_t wr[EDMA_WR_CH_NUM];
> > +	dma_addr_t rd[EDMA_RD_CH_NUM];
> > +};
> > +
> > +struct ntb_edma_chans {
> > +	struct device *dev;
> > +
> > +	struct dma_chan *wr_chan[EDMA_WR_CH_NUM];
> > +	struct dma_chan *rd_chan[EDMA_RD_CH_NUM];
> > +	struct dma_chan *intr_chan;
> > +
> > +	unsigned int num_wr_chan;
> > +	unsigned int num_rd_chan;
> > +	atomic_t cur_wr_chan;
> > +	atomic_t cur_rd_chan;
> > +};
> > +
> > +static __always_inline u32 ntb_edma_ring_idx(u32 v)
> > +{
> > +	return v & NTB_EDMA_RING_MASK;
> > +}
> > +
> > +static __always_inline u32 ntb_edma_ring_used_entry(u32 head, u32 tail)
> > +{
> > +	if (head >= tail) {
> > +		WARN_ON_ONCE((head - tail) > (NTB_EDMA_RING_ENTRIES - 1));
> > +		return head - tail;
> > +	}
> > +
> > +	WARN_ON_ONCE((U32_MAX - tail + head + 1) > (NTB_EDMA_RING_ENTRIES - 1));
> > +	return U32_MAX - tail + head + 1;
> > +}
> > +
> > +static __always_inline u32 ntb_edma_ring_free_entry(u32 head, u32 tail)
> > +{
> > +	return NTB_EDMA_RING_ENTRIES - ntb_edma_ring_used_entry(head, tail) - 1;
> > +}
> > +
> > +static __always_inline bool ntb_edma_ring_full(u32 head, u32 tail)
> > +{
> > +	return ntb_edma_ring_free_entry(head, tail) == 0;
> > +}
> > +
> > +int ntb_edma_setup_isr(struct device *dev, struct device *epc_dev,
> > +		       ntb_edma_interrupt_cb_t cb, void *data);
> > +void ntb_edma_teardown_isr(struct device *dev);
> > +int ntb_edma_setup_mws(struct ntb_dev *ndev);
> > +int ntb_edma_setup_peer(struct ntb_dev *ndev);
> > +int ntb_edma_setup_chans(struct device *dma_dev, struct ntb_edma_chans *edma);
> > +struct dma_chan *ntb_edma_pick_chan(struct ntb_edma_chans *edma,
> > +				    remote_edma_dir_t dir);
> > +void ntb_edma_teardown_chans(struct ntb_edma_chans *edma);
> > +int ntb_edma_notify_peer(struct ntb_edma_chans *edma, int qp_num);
> > +
> > +#endif
> > diff --git a/drivers/ntb/ntb_transport.c b/drivers/ntb/ntb_transport_core.c
> > similarity index 65%
> > rename from drivers/ntb/ntb_transport.c
> > rename to drivers/ntb/ntb_transport_core.c
> > index 907db6c93d4d..48d48921978d 100644
> > --- a/drivers/ntb/ntb_transport.c
> > +++ b/drivers/ntb/ntb_transport_core.c
> > @@ -47,6 +47,9 @@
> >   * Contact Information:
> >   * Jon Mason <jon.mason@intel.com>
> >   */
> > +#include <linux/atomic.h>
> > +#include <linux/bug.h>
> > +#include <linux/compiler.h>
> >  #include <linux/debugfs.h>
> >  #include <linux/delay.h>
> >  #include <linux/dmaengine.h>
> > @@ -71,6 +74,8 @@
> >  #define NTB_TRANSPORT_DESC	"Software Queue-Pair Transport over NTB"
> >  #define NTB_TRANSPORT_MIN_SPADS (MW0_SZ_HIGH + 2)
> >  
> > +#define NTB_EDMA_MAX_POLL		32
> > +
> >  MODULE_DESCRIPTION(NTB_TRANSPORT_DESC);
> >  MODULE_VERSION(NTB_TRANSPORT_VER);
> >  MODULE_LICENSE("Dual BSD/GPL");
> > @@ -102,6 +107,13 @@ module_param(use_msi, bool, 0644);
> >  MODULE_PARM_DESC(use_msi, "Use MSI interrupts instead of doorbells");
> >  #endif
> >  
> > +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> 
> This is a comment throughout this patch. Doing ifdefs inside C source is pretty frowed upon in the kernel. The preferred way is to only have ifdefs in the header files. So please give this a bit more consideration and see if it can be done differently to address this.

I agree, there is no good reason to keep those remaining ifdefs at all.
I'll clean it up. Thanks for pointing this out.

> 
> > +#include "ntb_edma.h"
> > +static bool use_remote_edma;
> > +module_param(use_remote_edma, bool, 0644);
> > +MODULE_PARM_DESC(use_remote_edma, "Use remote eDMA mode (when enabled, use_msi is ignored)");
> > +#endif
> > +
> >  static struct dentry *nt_debugfs_dir;
> >  
> >  /* Only two-ports NTB devices are supported */
> > @@ -125,6 +137,14 @@ struct ntb_queue_entry {
> >  		struct ntb_payload_header __iomem *tx_hdr;
> >  		struct ntb_payload_header *rx_hdr;
> >  	};
> > +
> > +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> > +	dma_addr_t addr;
> > +
> > +	/* Used by RC side only */
> > +	struct scatterlist sgl;
> > +	struct work_struct dma_work;
> > +#endif
> >  };
> >  
> >  struct ntb_rx_info {
> > @@ -202,6 +222,33 @@ struct ntb_transport_qp {
> >  	int msi_irq;
> >  	struct ntb_msi_desc msi_desc;
> >  	struct ntb_msi_desc peer_msi_desc;
> > +
> > +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> > +	/*
> > +	 * For ensuring peer notification in non-atomic context.
> > +	 * ntb_peer_db_set might sleep or schedule.
> > +	 */
> > +	struct work_struct db_work;
> > +
> > +	/*
> > +	 * wr: remote eDMA write transfer (EP -> RC direction)
> > +	 * rd: remote eDMA read transfer (RC -> EP direction)
> > +	 */
> > +	u32 wr_cons;
> > +	u32 rd_cons;
> > +	u32 wr_prod;
> > +	u32 rd_prod;
> > +	u32 wr_issue;
> > +	u32 rd_issue;
> > +
> > +	spinlock_t ep_tx_lock;
> > +	spinlock_t ep_rx_lock;
> > +	spinlock_t rc_lock;
> > +
> > +	/* Completion work for read/write transfers. */
> > +	struct work_struct read_work;
> > +	struct work_struct write_work;
> > +#endif
> 
> For something like this, maybe it needs its own struct instead of an ifdef chunk. Perhaps 'ntb_rx_info' can serve as a core data struct with EDMA having 'ntb_rx_info_edma' and embed 'ntb_rx_info'. 

Thanks again for the suggestion. I'll reorganize things.

Koichiro

> 
> DJ
> 
> >  };
> >  
> >  struct ntb_transport_mw {
> > @@ -249,6 +296,13 @@ struct ntb_transport_ctx {
> >  
> >  	/* Make sure workq of link event be executed serially */
> >  	struct mutex link_event_lock;
> > +
> > +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> > +	remote_edma_mode_t remote_edma_mode;
> > +	struct device *dma_dev;
> > +	struct workqueue_struct *wq;
> > +	struct ntb_edma_chans edma;
> > +#endif
> >  };
> >  
> >  enum {
> > @@ -262,6 +316,19 @@ struct ntb_payload_header {
> >  	unsigned int flags;
> >  };
> >  
> > +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> > +static void ntb_transport_edma_uninit(struct ntb_transport_ctx *nt);
> > +static int ntb_transport_edma_init(struct ntb_transport_ctx *nt,
> > +				   unsigned int *mw_count);
> > +static void ntb_transport_edma_init_queue(struct ntb_transport_ctx *nt,
> > +					  unsigned int qp_num);
> > +static void ntb_transport_edma_create_queue(struct ntb_transport_ctx *nt,
> > +					    struct ntb_transport_qp *qp);
> > +static int ntb_transport_edma_rc_init(struct ntb_transport_ctx *nt);
> > +static int ntb_transport_edma_ep_init(struct ntb_transport_ctx *nt);
> > +static void ntb_transport_edma_rc_dma_work(struct work_struct *work);
> > +#endif /* CONFIG_NTB_TRANSPORT_EDMA */
> > +
> >  /*
> >   * Return the device that should be used for DMA mapping.
> >   *
> > @@ -298,7 +365,7 @@ enum {
> >  	container_of((__drv), struct ntb_transport_client, driver)
> >  
> >  #define QP_TO_MW(nt, qp)	((qp) % nt->mw_count)
> > -#define NTB_QP_DEF_NUM_ENTRIES	100
> > +#define NTB_QP_DEF_NUM_ENTRIES	128
> >  #define NTB_LINK_DOWN_TIMEOUT	10
> >  
> >  static void ntb_transport_rxc_db(unsigned long data);
> > @@ -1015,6 +1082,10 @@ static void ntb_transport_link_cleanup(struct ntb_transport_ctx *nt)
> >  	count = ntb_spad_count(nt->ndev);
> >  	for (i = 0; i < count; i++)
> >  		ntb_spad_write(nt->ndev, i, 0);
> > +
> > +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> > +	ntb_edma_teardown_chans(&nt->edma);
> > +#endif
> >  }
> >  
> >  static void ntb_transport_link_cleanup_work(struct work_struct *work)
> > @@ -1051,6 +1122,14 @@ static void ntb_transport_link_work(struct work_struct *work)
> >  
> >  	/* send the local info, in the opposite order of the way we read it */
> >  
> > +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> > +	rc = ntb_transport_edma_ep_init(nt);
> > +	if (rc) {
> > +		dev_err(&pdev->dev, "Failed to init EP: %d\n", rc);
> > +		return;
> > +	}
> > +#endif
> > +
> >  	if (nt->use_msi) {
> >  		rc = ntb_msi_setup_mws(ndev);
> >  		if (rc) {
> > @@ -1132,6 +1211,14 @@ static void ntb_transport_link_work(struct work_struct *work)
> >  
> >  	nt->link_is_up = true;
> >  
> > +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> > +	rc = ntb_transport_edma_rc_init(nt);
> > +	if (rc) {
> > +		dev_err(&pdev->dev, "Failed to init RC: %d\n", rc);
> > +		goto out1;
> > +	}
> > +#endif
> > +
> >  	for (i = 0; i < nt->qp_count; i++) {
> >  		struct ntb_transport_qp *qp = &nt->qp_vec[i];
> >  
> > @@ -1277,6 +1364,8 @@ static const struct ntb_transport_backend_ops default_backend_ops = {
> >  	.debugfs_stats_show = ntb_transport_default_debugfs_stats_show,
> >  };
> >  
> > +static const struct ntb_transport_backend_ops edma_backend_ops;
> > +
> >  static int ntb_transport_probe(struct ntb_client *self, struct ntb_dev *ndev)
> >  {
> >  	struct ntb_transport_ctx *nt;
> > @@ -1311,7 +1400,23 @@ static int ntb_transport_probe(struct ntb_client *self, struct ntb_dev *ndev)
> >  
> >  	nt->ndev = ndev;
> >  
> > -	nt->backend_ops = default_backend_ops;
> > +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> > +	if (use_remote_edma) {
> > +		rc = ntb_transport_edma_init(nt, &mw_count);
> > +		if (rc) {
> > +			nt->mw_count = 0;
> > +			goto err;
> > +		}
> > +		nt->backend_ops = edma_backend_ops;
> > +
> > +		/*
> > +		 * On remote eDMA mode, we reserve a read channel for Host->EP
> > +		 * interruption.
> > +		 */
> > +		use_msi = false;
> > +	} else
> > +#endif
> > +		nt->backend_ops = default_backend_ops;
> >  
> >  	/*
> >  	 * If we are using MSI, and have at least one extra memory window,
> > @@ -1402,6 +1507,10 @@ static int ntb_transport_probe(struct ntb_client *self, struct ntb_dev *ndev)
> >  		rc = ntb_transport_init_queue(nt, i);
> >  		if (rc)
> >  			goto err2;
> > +
> > +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> > +		ntb_transport_edma_init_queue(nt, i);
> > +#endif
> >  	}
> >  
> >  	INIT_DELAYED_WORK(&nt->link_work, ntb_transport_link_work);
> > @@ -1433,6 +1542,9 @@ static int ntb_transport_probe(struct ntb_client *self, struct ntb_dev *ndev)
> >  	}
> >  	kfree(nt->mw_vec);
> >  err:
> > +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> > +	ntb_transport_edma_uninit(nt);
> > +#endif
> >  	kfree(nt);
> >  	return rc;
> >  }
> > @@ -2055,11 +2167,16 @@ ntb_transport_create_queue(void *data, struct device *client_dev,
> >  
> >  	nt->qp_bitmap_free &= ~qp_bit;
> >  
> > +	qp->qp_bit = qp_bit;
> >  	qp->cb_data = data;
> >  	qp->rx_handler = handlers->rx_handler;
> >  	qp->tx_handler = handlers->tx_handler;
> >  	qp->event_handler = handlers->event_handler;
> >  
> > +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> > +	ntb_transport_edma_create_queue(nt, qp);
> > +#endif
> > +
> >  	dma_cap_zero(dma_mask);
> >  	dma_cap_set(DMA_MEMCPY, dma_mask);
> >  
> > @@ -2105,6 +2222,9 @@ ntb_transport_create_queue(void *data, struct device *client_dev,
> >  			goto err1;
> >  
> >  		entry->qp = qp;
> > +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> > +		INIT_WORK(&entry->dma_work, ntb_transport_edma_rc_dma_work);
> > +#endif
> >  		ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry,
> >  			     &qp->rx_free_q);
> >  	}
> > @@ -2156,8 +2276,8 @@ EXPORT_SYMBOL_GPL(ntb_transport_create_queue);
> >   */
> >  void ntb_transport_free_queue(struct ntb_transport_qp *qp)
> >  {
> > -	struct pci_dev *pdev;
> >  	struct ntb_queue_entry *entry;
> > +	struct pci_dev *pdev;
> >  	u64 qp_bit;
> >  
> >  	if (!qp)
> > @@ -2208,6 +2328,10 @@ void ntb_transport_free_queue(struct ntb_transport_qp *qp)
> >  	tasklet_kill(&qp->rxc_db_work);
> >  
> >  	cancel_delayed_work_sync(&qp->link_work);
> > +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> > +	cancel_work_sync(&qp->read_work);
> > +	cancel_work_sync(&qp->write_work);
> > +#endif
> >  
> >  	qp->cb_data = NULL;
> >  	qp->rx_handler = NULL;
> > @@ -2346,6 +2470,1157 @@ int ntb_transport_tx_enqueue(struct ntb_transport_qp *qp, void *cb, void *data,
> >  }
> >  EXPORT_SYMBOL_GPL(ntb_transport_tx_enqueue);
> >  
> > +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> > +/*
> > + * Remote eDMA mode implementation
> > + */
> > +struct ntb_edma_desc {
> > +	u32 len;
> > +	u32 flags;
> > +	u64 addr; /* DMA address */
> > +	u64 data;
> > +};
> > +
> > +struct ntb_edma_ring {
> > +	struct ntb_edma_desc desc[NTB_EDMA_RING_ENTRIES];
> > +	u32 head;
> > +	u32 tail;
> > +};
> > +
> > +#define NTB_EDMA_DESC_OFF(i)	((size_t)(i) * sizeof(struct ntb_edma_desc))
> > +
> > +#define __NTB_EDMA_CHECK_INDEX(_i)					\
> > +({									\
> > +	unsigned long __i = (unsigned long)(_i);			\
> > +	WARN_ONCE(__i >= (unsigned long)NTB_EDMA_RING_ENTRIES,		\
> > +		  "ntb_edma: index i=%lu >= ring_entries=%lu\n",	\
> > +		  __i, (unsigned long)NTB_EDMA_RING_ENTRIES);		\
> > +	__i;								\
> > +})
> > +
> > +#define NTB_EDMA_DESC_I(qp, i, n)					\
> > +({									\
> > +	typeof(qp) __qp = (qp);						\
> > +	unsigned long __i = __NTB_EDMA_CHECK_INDEX(i);			\
> > +	(struct ntb_edma_desc *)					\
> > +		((char *)(__qp)->rx_buff +				\
> > +		 (sizeof(struct ntb_edma_ring) * n) +			\
> > +		 NTB_EDMA_DESC_OFF(__i));				\
> > +})
> > +
> > +#define NTB_EDMA_DESC_O(qp, i, n)					\
> > +({									\
> > +	typeof(qp) __qp = (qp);						\
> > +	unsigned long __i = __NTB_EDMA_CHECK_INDEX(i);			\
> > +	(struct ntb_edma_desc __iomem *)				\
> > +		((char __iomem *)(__qp)->tx_mw +			\
> > +		 (sizeof(struct ntb_edma_ring) * n) +			\
> > +		 NTB_EDMA_DESC_OFF(__i));				\
> > +})
> > +
> > +#define NTB_EDMA_HEAD_I(qp, n) ((u32 *)((char *)qp->rx_buff +		\
> > +				(sizeof(struct ntb_edma_ring) * n) +	\
> > +				offsetof(struct ntb_edma_ring, head)))
> > +#define NTB_EDMA_HEAD_O(qp, n) ((u32 *)((char __iomem *)qp->tx_mw +	\
> > +				(sizeof(struct ntb_edma_ring) * n) +	\
> > +				offsetof(struct ntb_edma_ring, head)))
> > +#define NTB_EDMA_TAIL_I(qp, n) ((u32 *)((char *)qp->rx_buff +		\
> > +				(sizeof(struct ntb_edma_ring) * n) +	\
> > +				offsetof(struct ntb_edma_ring, tail)))
> > +#define NTB_EDMA_TAIL_O(qp, n) ((u32 *)((char __iomem *)qp->tx_mw +	\
> > +				(sizeof(struct ntb_edma_ring) * n) +	\
> > +				offsetof(struct ntb_edma_ring, tail)))
> > +
> > +/*
> > + * Macro naming rule:
> > + *   NTB_DESC_RD_EP_I (as an example)
> > + *            ^^ ^^ ^
> > + *            :  :  `-- I(n) or O(ut). In = Read, Out = Write.
> > + *            :  `----- Who uses this macro.
> > + *            `-------- DESC / HEAD / TAIL
> > + *
> > + * Read transfers (RC->EP):
> > + *
> > + *   EP view (outbound, written via NTB):
> > + *       - descs: NTB_DESC_RD_EP_O(qp, i) / NTB_DESC_RD_EP_I(qp, i)
> > + *           [ len ][ flags ][ addr ][ data ]
> > + *           [ len ][ flags ][ addr ][ data ]
> > + *           :
> > + *           [ len ][ flags ][ addr ][ data ]
> > + *       - head: NTB_HEAD_RD_EP_O(qp)
> > + *       - tail: NTB_TAIL_RD_EP_I(qp)
> > + *
> > + *   RC view (inbound, local mapping):
> > + *       - descs: NTB_DESC_RD_RC_I(qp, i) / NTB_DESC_RD_RC_O(qp, i)
> > + *           [ len ][ flags ][ addr ][ data ]
> > + *           [ len ][ flags ][ addr ][ data ]
> > + *           :
> > + *           [ len ][ flags ][ addr ][ data ]
> > + *       - head: NTB_HEAD_RD_RC_I(qp)
> > + *       - tail: NTB_TAIL_RD_RC_O(qp)
> > + *
> > + * Write transfers (EP -> RC) are analogous but use
> > + * NTB_DESC_WR_{EP_O,RC_I}(), NTB_HEAD_WR_{EP_O,RC_I}(),
> > + * and NTB_TAIL_WR_{EP_I,RC_O}().
> > + */
> > +#define NTB_DESC_RD_EP_I(qp, i)	NTB_EDMA_DESC_I(qp, i, 0)
> > +#define NTB_DESC_RD_EP_O(qp, i)	NTB_EDMA_DESC_O(qp, i, 0)
> > +#define NTB_DESC_WR_EP_I(qp, i)	NTB_EDMA_DESC_I(qp, i, 1)
> > +#define NTB_DESC_WR_EP_O(qp, i)	NTB_EDMA_DESC_O(qp, i, 1)
> > +#define NTB_DESC_RD_RC_I(qp, i)	NTB_EDMA_DESC_I(qp, i, 0)
> > +#define NTB_DESC_RD_RC_O(qp, i)	NTB_EDMA_DESC_O(qp, i, 0)
> > +#define NTB_DESC_WR_RC_I(qp, i)	NTB_EDMA_DESC_I(qp, i, 1)
> > +#define NTB_DESC_WR_RC_O(qp, i)	NTB_EDMA_DESC_O(qp, i, 1)
> > +
> > +#define NTB_HEAD_RD_EP_O(qp)	NTB_EDMA_HEAD_O(qp, 0)
> > +#define NTB_HEAD_WR_EP_O(qp)	NTB_EDMA_HEAD_O(qp, 1)
> > +#define NTB_HEAD_RD_RC_I(qp)	NTB_EDMA_HEAD_I(qp, 0)
> > +#define NTB_HEAD_WR_RC_I(qp)	NTB_EDMA_HEAD_I(qp, 1)
> > +
> > +#define NTB_TAIL_RD_EP_I(qp)	NTB_EDMA_TAIL_I(qp, 0)
> > +#define NTB_TAIL_WR_EP_I(qp)	NTB_EDMA_TAIL_I(qp, 1)
> > +#define NTB_TAIL_RD_RC_O(qp)	NTB_EDMA_TAIL_O(qp, 0)
> > +#define NTB_TAIL_WR_RC_O(qp)	NTB_EDMA_TAIL_O(qp, 1)
> > +
> > +static inline bool ntb_qp_edma_is_rc(struct ntb_transport_qp *qp)
> > +{
> > +	return qp->transport->remote_edma_mode == REMOTE_EDMA_RC;
> > +}
> > +
> > +static inline bool ntb_qp_edma_is_ep(struct ntb_transport_qp *qp)
> > +{
> > +	return qp->transport->remote_edma_mode == REMOTE_EDMA_EP;
> > +}
> > +
> > +static inline bool ntb_qp_edma_enabled(struct ntb_transport_qp *qp)
> > +{
> > +	return ntb_qp_edma_is_rc(qp) || ntb_qp_edma_is_ep(qp);
> > +}
> > +
> > +static unsigned int ntb_transport_edma_tx_free_entry(struct ntb_transport_qp *qp)
> > +{
> > +	unsigned int head, tail;
> > +
> > +	if (ntb_qp_edma_is_ep(qp)) {
> > +		scoped_guard(spinlock_irqsave, &qp->ep_tx_lock) {
> > +			/* In this scope, only 'head' might proceed */
> > +			tail = READ_ONCE(qp->wr_cons);
> > +			head = READ_ONCE(qp->wr_prod);
> > +		}
> > +		return ntb_edma_ring_free_entry(head, tail);
> > +	}
> > +
> > +	scoped_guard(spinlock_irqsave, &qp->rc_lock) {
> > +		/* In this scope, only 'head' might proceed */
> > +		tail = READ_ONCE(qp->rd_issue);
> > +		head = READ_ONCE(*NTB_HEAD_RD_RC_I(qp));
> > +	}
> > +	/*
> > +	 * On RC side, 'used' amount indicates how much EP side
> > +	 * has refilled, which are available for us to use for TX.
> > +	 */
> > +	return ntb_edma_ring_used_entry(head, tail);
> > +}
> > +
> > +static void ntb_transport_edma_debugfs_stats_show(struct seq_file *s,
> > +						  struct ntb_transport_qp *qp)
> > +{
> > +	seq_printf(s, "rx_bytes - \t%llu\n", qp->rx_bytes);
> > +	seq_printf(s, "rx_pkts - \t%llu\n", qp->rx_pkts);
> > +	seq_printf(s, "rx_err_no_buf - %llu\n", qp->rx_err_no_buf);
> > +	seq_printf(s, "rx_buff - \t0x%p\n", qp->rx_buff);
> > +	seq_printf(s, "rx_max_entry - \t%u\n", qp->rx_max_entry);
> > +	seq_printf(s, "rx_alloc_entry - \t%u\n\n", qp->rx_alloc_entry);
> > +
> > +	seq_printf(s, "tx_bytes - \t%llu\n", qp->tx_bytes);
> > +	seq_printf(s, "tx_pkts - \t%llu\n", qp->tx_pkts);
> > +	seq_printf(s, "tx_ring_full - \t%llu\n", qp->tx_ring_full);
> > +	seq_printf(s, "tx_err_no_buf - %llu\n", qp->tx_err_no_buf);
> > +	seq_printf(s, "tx_mw - \t0x%p\n", qp->tx_mw);
> > +	seq_printf(s, "tx_max_entry - \t%u\n", qp->tx_max_entry);
> > +	seq_printf(s, "free tx - \t%u\n", ntb_transport_tx_free_entry(qp));
> > +	seq_putc(s, '\n');
> > +
> > +	seq_puts(s, "Using Remote eDMA - Yes\n");
> > +	seq_printf(s, "QP Link - \t%s\n", qp->link_is_up ? "Up" : "Down");
> > +}
> > +
> > +static void ntb_transport_edma_uninit(struct ntb_transport_ctx *nt)
> > +{
> > +	struct ntb_dev *ndev = nt->ndev;
> > +
> > +	if (nt->remote_edma_mode == REMOTE_EDMA_EP && ndev && ndev->pdev)
> > +		ntb_edma_teardown_isr(&ndev->pdev->dev);
> > +
> > +	if (nt->wq)
> > +		destroy_workqueue(nt->wq);
> > +	nt->wq = NULL;
> > +}
> > +
> > +static int ntb_transport_edma_init(struct ntb_transport_ctx *nt,
> > +				   unsigned int *mw_count)
> > +{
> > +	struct ntb_dev *ndev = nt->ndev;
> > +
> > +	/*
> > +	 * We need at least one MW for the transport plus one MW reserved
> > +	 * for the remote eDMA window (see ntb_edma_setup_mws/peer).
> > +	 */
> > +	if (*mw_count <= 1) {
> > +		dev_err(&ndev->dev,
> > +			"remote eDMA requires at least two MWS (have %u)\n",
> > +			*mw_count);
> > +		return -ENODEV;
> > +	}
> > +
> > +	nt->wq = alloc_workqueue("ntb-edma-wq", WQ_UNBOUND | WQ_SYSFS, 0);
> > +	if (!nt->wq) {
> > +		ntb_transport_edma_uninit(nt);
> > +		return -ENOMEM;
> > +	}
> > +
> > +	/* Reserve the last peer MW exclusively for the eDMA window. */
> > +	*mw_count -= 1;
> > +
> > +	return 0;
> > +}
> > +
> > +static void ntb_transport_edma_db_work(struct work_struct *work)
> > +{
> > +	struct ntb_transport_qp *qp =
> > +			container_of(work, struct ntb_transport_qp, db_work);
> > +
> > +	ntb_peer_db_set(qp->ndev, qp->qp_bit);
> > +}
> > +
> > +static void ntb_transport_edma_notify_peer(struct ntb_transport_qp *qp)
> > +{
> > +	if (ntb_qp_edma_is_rc(qp))
> > +		if (!ntb_edma_notify_peer(&qp->transport->edma, qp->qp_num))
> > +			return;
> > +
> > +	/*
> > +	 * Called from contexts that may be atomic. Since ntb_peer_db_set()
> > +	 * may sleep, delegate the actual doorbell write to a workqueue.
> > +	 */
> > +	queue_work(system_highpri_wq, &qp->db_work);
> > +}
> > +
> > +static void ntb_transport_edma_isr(void *data, int qp_num)
> > +{
> > +	struct ntb_transport_ctx *nt = data;
> > +	struct ntb_transport_qp *qp;
> > +
> > +	if (qp_num < 0 || qp_num >= nt->qp_count)
> > +		return;
> > +
> > +	qp = &nt->qp_vec[qp_num];
> > +	if (WARN_ON(!qp))
> > +		return;
> > +
> > +	queue_work(nt->wq, &qp->read_work);
> > +	queue_work(nt->wq, &qp->write_work);
> > +}
> > +
> > +static int ntb_transport_edma_rc_init(struct ntb_transport_ctx *nt)
> > +{
> > +	struct ntb_dev *ndev = nt->ndev;
> > +	struct pci_dev *pdev = ndev->pdev;
> > +	int rc;
> > +
> > +	if (!use_remote_edma || nt->remote_edma_mode != REMOTE_EDMA_UNKNOWN)
> > +		return 0;
> > +
> > +	rc = ntb_edma_setup_peer(ndev);
> > +	if (rc) {
> > +		dev_err(&pdev->dev, "Failed to enable remote eDMA: %d\n", rc);
> > +		return rc;
> > +	}
> > +
> > +	rc = ntb_edma_setup_chans(get_dma_dev(ndev), &nt->edma);
> > +	if (rc) {
> > +		dev_err(&pdev->dev, "Failed to setup eDMA channels: %d\n", rc);
> > +		return rc;
> > +	}
> > +
> > +	nt->remote_edma_mode = REMOTE_EDMA_RC;
> > +	return 0;
> > +}
> > +
> > +static int ntb_transport_edma_ep_init(struct ntb_transport_ctx *nt)
> > +{
> > +	struct ntb_dev *ndev = nt->ndev;
> > +	struct pci_dev *pdev = ndev->pdev;
> > +	struct pci_epc *epc;
> > +	int rc;
> > +
> > +	if (!use_remote_edma || nt->remote_edma_mode == REMOTE_EDMA_EP)
> > +		return 0;
> > +
> > +	/* Only EP side can return pci_epc */
> > +	epc = ntb_get_pci_epc(ndev);
> > +	if (!epc)
> > +		return 0;
> > +
> > +	rc = ntb_edma_setup_mws(ndev);
> > +	if (rc) {
> > +		dev_err(&pdev->dev,
> > +			"Failed to set up memory window for eDMA: %d\n", rc);
> > +		return rc;
> > +	}
> > +
> > +	rc = ntb_edma_setup_isr(&pdev->dev, &epc->dev, ntb_transport_edma_isr, nt);
> > +	if (rc) {
> > +		dev_err(&pdev->dev, "Failed to setup eDMA ISR (%d)\n", rc);
> > +		return rc;
> > +	}
> > +
> > +	nt->remote_edma_mode = REMOTE_EDMA_EP;
> > +	return 0;
> > +}
> > +
> > +static int ntb_transport_edma_setup_qp_mw(struct ntb_transport_ctx *nt,
> > +					  unsigned int qp_num)
> > +{
> > +	struct ntb_transport_qp *qp = &nt->qp_vec[qp_num];
> > +	struct ntb_dev *ndev = nt->ndev;
> > +	struct ntb_queue_entry *entry;
> > +	struct ntb_transport_mw *mw;
> > +	unsigned int mw_num, mw_count, qp_count;
> > +	unsigned int qp_offset, rx_info_offset;
> > +	unsigned int mw_size, mw_size_per_qp;
> > +	unsigned int num_qps_mw;
> > +	size_t edma_total;
> > +	unsigned int i;
> > +	int node;
> > +
> > +	mw_count = nt->mw_count;
> > +	qp_count = nt->qp_count;
> > +
> > +	mw_num = QP_TO_MW(nt, qp_num);
> > +	mw = &nt->mw_vec[mw_num];
> > +
> > +	if (!mw->virt_addr)
> > +		return -ENOMEM;
> > +
> > +	if (mw_num < qp_count % mw_count)
> > +		num_qps_mw = qp_count / mw_count + 1;
> > +	else
> > +		num_qps_mw = qp_count / mw_count;
> > +
> > +	mw_size = min(nt->mw_vec[mw_num].phys_size, mw->xlat_size);
> > +	if (max_mw_size && mw_size > max_mw_size)
> > +		mw_size = max_mw_size;
> > +
> > +	mw_size_per_qp = round_down((unsigned int)mw_size / num_qps_mw, SZ_64);
> > +	qp_offset = mw_size_per_qp * (qp_num / mw_count);
> > +	rx_info_offset = mw_size_per_qp - sizeof(struct ntb_rx_info);
> > +
> > +	qp->tx_mw_size = mw_size_per_qp;
> > +	qp->tx_mw = nt->mw_vec[mw_num].vbase + qp_offset;
> > +	if (!qp->tx_mw)
> > +		return -EINVAL;
> > +	qp->tx_mw_phys = nt->mw_vec[mw_num].phys_addr + qp_offset;
> > +	if (!qp->tx_mw_phys)
> > +		return -EINVAL;
> > +	qp->rx_info = qp->tx_mw + rx_info_offset;
> > +	qp->rx_buff = mw->virt_addr + qp_offset;
> > +	qp->remote_rx_info = qp->rx_buff + rx_info_offset;
> > +
> > +	/* Due to housekeeping, there must be at least 2 buffs */
> > +	qp->tx_max_frame = min(transport_mtu, mw_size_per_qp / 2);
> > +	qp->rx_max_frame = min(transport_mtu, mw_size_per_qp / 2);
> > +
> > +	/* In eDMA mode, decouple from MW sizing and force ring-sized entries */
> > +	edma_total = 2 * sizeof(struct ntb_edma_ring);
> > +	if (rx_info_offset < edma_total) {
> > +		dev_err(&ndev->dev, "Ring space requires %luB (>=%uB)\n",
> > +			edma_total, rx_info_offset);
> > +		return -EINVAL;
> > +	}
> > +	qp->tx_max_entry = NTB_EDMA_RING_ENTRIES;
> > +	qp->rx_max_entry = NTB_EDMA_RING_ENTRIES;
> > +
> > +	/*
> > +	 * Checking to see if we have more entries than the default.
> > +	 * We should add additional entries if that is the case so we
> > +	 * can be in sync with the transport frames.
> > +	 */
> > +	node = dev_to_node(&ndev->dev);
> > +	for (i = qp->rx_alloc_entry; i < qp->rx_max_entry; i++) {
> > +		entry = kzalloc_node(sizeof(*entry), GFP_KERNEL, node);
> > +		if (!entry)
> > +			return -ENOMEM;
> > +
> > +		entry->qp = qp;
> > +		INIT_WORK(&entry->dma_work, ntb_transport_edma_rc_dma_work);
> > +		ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry,
> > +			     &qp->rx_free_q);
> > +		qp->rx_alloc_entry++;
> > +	}
> > +
> > +	memset(qp->rx_buff, 0, edma_total);
> > +
> > +	qp->rx_pkts = 0;
> > +	qp->tx_pkts = 0;
> > +
> > +	return 0;
> > +}
> > +
> > +static int ntb_transport_edma_ep_read_complete(struct ntb_transport_qp *qp)
> > +{
> > +	struct device *dma_dev = get_dma_dev(qp->ndev);
> > +	struct ntb_queue_entry *entry;
> > +	struct ntb_edma_desc *in;
> > +	unsigned int len;
> > +	u32 idx;
> > +
> > +	if (ntb_edma_ring_used_entry(READ_ONCE(*NTB_TAIL_RD_EP_I(qp)),
> > +				     qp->rd_cons) == 0)
> > +		return 0;
> > +
> > +	idx = ntb_edma_ring_idx(qp->rd_cons);
> > +	in = NTB_DESC_RD_EP_I(qp, idx);
> > +	if (!(in->flags & DESC_DONE_FLAG))
> > +		return 0;
> > +
> > +	in->flags = 0;
> > +	len = in->len; /* might be smaller than entry->len */
> > +
> > +	entry = (struct ntb_queue_entry *)(in->data);
> > +	if (WARN_ON(!entry))
> > +		return 0;
> > +
> > +	if (in->flags & LINK_DOWN_FLAG) {
> > +		ntb_qp_link_down(qp);
> > +		qp->rd_cons++;
> > +		ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry, &qp->rx_free_q);
> > +		return 1;
> > +	}
> > +
> > +	dma_unmap_single(dma_dev, entry->addr, entry->len, DMA_FROM_DEVICE);
> > +
> > +	qp->rx_bytes += len;
> > +	qp->rx_pkts++;
> > +	qp->rd_cons++;
> > +
> > +	if (qp->rx_handler && qp->client_ready)
> > +		qp->rx_handler(qp, qp->cb_data, entry->cb_data, len);
> > +
> > +	ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry, &qp->rx_free_q);
> > +	return 1;
> > +}
> > +
> > +static int ntb_transport_edma_ep_write_complete(struct ntb_transport_qp *qp)
> > +{
> > +	struct ntb_queue_entry *entry;
> > +	struct ntb_edma_desc *in;
> > +	u32 idx;
> > +
> > +	if (ntb_edma_ring_used_entry(READ_ONCE(*NTB_TAIL_WR_EP_I(qp)),
> > +				     qp->wr_cons) == 0)
> > +		return 0;
> > +
> > +	idx = ntb_edma_ring_idx(qp->wr_cons);
> > +	in = NTB_DESC_WR_EP_I(qp, idx);
> > +
> > +	entry = (struct ntb_queue_entry *)(in->data);
> > +	if (WARN_ON(!entry))
> > +		return 0;
> > +
> > +	qp->wr_cons++;
> > +
> > +	if (qp->tx_handler)
> > +		qp->tx_handler(qp, qp->cb_data, entry->cb_data, entry->len);
> > +
> > +	ntb_list_add(&qp->ntb_tx_free_q_lock, &entry->entry, &qp->tx_free_q);
> > +	return 1;
> > +}
> > +
> > +static void ntb_transport_edma_ep_read_work(struct work_struct *work)
> > +{
> > +	struct ntb_transport_qp *qp = container_of(
> > +				work, struct ntb_transport_qp, read_work);
> > +	unsigned int i;
> > +
> > +	for (i = 0; i < NTB_EDMA_MAX_POLL; i++) {
> > +		if (!ntb_transport_edma_ep_read_complete(qp))
> > +			break;
> > +	}
> > +
> > +	if (ntb_transport_edma_ep_read_complete(qp))
> > +		queue_work(qp->transport->wq, &qp->read_work);
> > +}
> > +
> > +static void ntb_transport_edma_ep_write_work(struct work_struct *work)
> > +{
> > +	struct ntb_transport_qp *qp = container_of(
> > +				work, struct ntb_transport_qp, write_work);
> > +	unsigned int i;
> > +
> > +	for (i = 0; i < NTB_EDMA_MAX_POLL; i++) {
> > +		if (!ntb_transport_edma_ep_write_complete(qp))
> > +			break;
> > +	}
> > +
> > +	if (ntb_transport_edma_ep_write_complete(qp))
> > +		queue_work(qp->transport->wq, &qp->write_work);
> > +}
> > +
> > +static void ntb_transport_edma_rc_write_complete_work(struct work_struct *work)
> > +{
> > +	struct ntb_transport_qp *qp = container_of(
> > +				work, struct ntb_transport_qp, write_work);
> > +	struct ntb_queue_entry *entry;
> > +	struct ntb_edma_desc *in;
> > +	unsigned int len;
> > +	void *cb_data;
> > +	u32 idx;
> > +
> > +	while (ntb_edma_ring_used_entry(READ_ONCE(qp->wr_issue),
> > +					qp->wr_cons) != 0) {
> > +		/* Paired with smp_wmb() in ntb_transport_edma_rc_poll() */
> > +		smp_rmb();
> > +
> > +		idx = ntb_edma_ring_idx(qp->wr_cons);
> > +		in = NTB_DESC_WR_RC_I(qp, idx);
> > +		entry = (struct ntb_queue_entry *)READ_ONCE(in->data);
> > +		if (!entry || !(entry->flags & DESC_DONE_FLAG))
> > +			break;
> > +
> > +		in->data = 0;
> > +
> > +		cb_data = entry->cb_data;
> > +		len = entry->len;
> > +
> > +		iowrite32(++qp->wr_cons, NTB_TAIL_WR_RC_O(qp));
> > +
> > +		if (unlikely(entry->flags & LINK_DOWN_FLAG)) {
> > +			ntb_qp_link_down(qp);
> > +			continue;
> > +		}
> > +
> > +		ntb_transport_edma_notify_peer(qp);
> > +
> > +		ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry, &qp->rx_free_q);
> > +
> > +		if (qp->rx_handler && qp->client_ready)
> > +			qp->rx_handler(qp, qp->cb_data, cb_data, len);
> > +
> > +		/* stat updates */
> > +		qp->rx_bytes += len;
> > +		qp->rx_pkts++;
> > +	}
> > +}
> > +
> > +static void ntb_transport_edma_rc_write_cb(void *data,
> > +					   const struct dmaengine_result *res)
> > +{
> > +	struct ntb_queue_entry *entry = data;
> > +	struct ntb_transport_qp *qp = entry->qp;
> > +	struct ntb_transport_ctx *nt = qp->transport;
> > +	enum dmaengine_tx_result dma_err = res->result;
> > +	struct device *dma_dev = get_dma_dev(qp->ndev);
> > +
> > +	switch (dma_err) {
> > +	case DMA_TRANS_READ_FAILED:
> > +	case DMA_TRANS_WRITE_FAILED:
> > +	case DMA_TRANS_ABORTED:
> > +		entry->errors++;
> > +		entry->len = -EIO;
> > +		break;
> > +	case DMA_TRANS_NOERROR:
> > +	default:
> > +		break;
> > +	}
> > +	dma_unmap_sg(dma_dev, &entry->sgl, 1, DMA_FROM_DEVICE);
> > +	sg_dma_address(&entry->sgl) = 0;
> > +
> > +	entry->flags |= DESC_DONE_FLAG;
> > +
> > +	queue_work(nt->wq, &qp->write_work);
> > +}
> > +
> > +static void ntb_transport_edma_rc_read_complete_work(struct work_struct *work)
> > +{
> > +	struct ntb_transport_qp *qp = container_of(
> > +				work, struct ntb_transport_qp, read_work);
> > +	struct ntb_edma_desc *in, __iomem *out;
> > +	struct ntb_queue_entry *entry;
> > +	unsigned int len;
> > +	void *cb_data;
> > +	u32 idx;
> > +
> > +	while (ntb_edma_ring_used_entry(READ_ONCE(qp->rd_issue),
> > +					qp->rd_cons) != 0) {
> > +		/* Paired with smp_wmb() in ntb_transport_edma_rc_tx_enqueue() */
> > +		smp_rmb();
> > +
> > +		idx = ntb_edma_ring_idx(qp->rd_cons);
> > +		in = NTB_DESC_RD_RC_I(qp, idx);
> > +		entry = (struct ntb_queue_entry *)in->data;
> > +		if (!entry || !(entry->flags & DESC_DONE_FLAG))
> > +			break;
> > +
> > +		in->data = 0;
> > +
> > +		cb_data = entry->cb_data;
> > +		len = entry->len;
> > +
> > +		out = NTB_DESC_RD_RC_O(qp, idx);
> > +
> > +		WRITE_ONCE(qp->rd_cons, qp->rd_cons + 1);
> > +
> > +		/*
> > +		 * No need to add barrier in-between to enforce ordering here.
> > +		 * The other side proceeds only after both flags and tail are
> > +		 * updated.
> > +		 */
> > +		iowrite32(entry->flags, &out->flags);
> > +		iowrite32(qp->rd_cons, NTB_TAIL_RD_RC_O(qp));
> > +
> > +		ntb_transport_edma_notify_peer(qp);
> > +
> > +		ntb_list_add(&qp->ntb_tx_free_q_lock, &entry->entry,
> > +			     &qp->tx_free_q);
> > +
> > +		if (qp->tx_handler)
> > +			qp->tx_handler(qp, qp->cb_data, cb_data, len);
> > +
> > +		/* stat updates */
> > +		qp->tx_bytes += len;
> > +		qp->tx_pkts++;
> > +	}
> > +}
> > +
> > +static void ntb_transport_edma_rc_read_cb(void *data,
> > +					  const struct dmaengine_result *res)
> > +{
> > +	struct ntb_queue_entry *entry = data;
> > +	struct ntb_transport_qp *qp = entry->qp;
> > +	struct ntb_transport_ctx *nt = qp->transport;
> > +	struct device *dma_dev = get_dma_dev(qp->ndev);
> > +	enum dmaengine_tx_result dma_err = res->result;
> > +
> > +	switch (dma_err) {
> > +	case DMA_TRANS_READ_FAILED:
> > +	case DMA_TRANS_WRITE_FAILED:
> > +	case DMA_TRANS_ABORTED:
> > +		entry->errors++;
> > +		entry->len = -EIO;
> > +		break;
> > +	case DMA_TRANS_NOERROR:
> > +	default:
> > +		break;
> > +	}
> > +	dma_unmap_sg(dma_dev, &entry->sgl, 1, DMA_TO_DEVICE);
> > +	sg_dma_address(&entry->sgl) = 0;
> > +
> > +	entry->flags |= DESC_DONE_FLAG;
> > +
> > +	queue_work(nt->wq, &qp->read_work);
> > +}
> > +
> > +static int ntb_transport_edma_rc_write_start(struct device *d,
> > +					     struct dma_chan *chan, size_t len,
> > +					     dma_addr_t ep_src, void *rc_dst,
> > +					     struct ntb_queue_entry *entry)
> > +{
> > +	struct scatterlist *sgl = &entry->sgl;
> > +	struct dma_async_tx_descriptor *txd;
> > +	struct dma_slave_config cfg;
> > +	dma_cookie_t cookie;
> > +	int nents, rc;
> > +
> > +	if (!d)
> > +		return -ENODEV;
> > +
> > +	if (!chan)
> > +		return -ENXIO;
> > +
> > +	if (WARN_ON(!ep_src || !rc_dst))
> > +		return -EINVAL;
> > +
> > +	if (WARN_ON(sg_dma_address(sgl)))
> > +		return -EINVAL;
> > +
> > +	sg_init_one(sgl, rc_dst, len);
> > +	nents = dma_map_sg(d, sgl, 1, DMA_FROM_DEVICE);
> > +	if (nents <= 0)
> > +		return -EIO;
> > +
> > +	memset(&cfg, 0, sizeof(cfg));
> > +	cfg.src_addr       = ep_src;
> > +	cfg.src_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
> > +	cfg.dst_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
> > +	cfg.direction      = DMA_DEV_TO_MEM;
> > +	rc = dmaengine_slave_config(chan, &cfg);
> > +	if (rc)
> > +		goto out_unmap;
> > +
> > +	txd = dmaengine_prep_slave_sg(chan, sgl, 1, DMA_DEV_TO_MEM,
> > +				      DMA_CTRL_ACK | DMA_PREP_INTERRUPT);
> > +	if (!txd) {
> > +		rc = -EIO;
> > +		goto out_unmap;
> > +	}
> > +
> > +	txd->callback_result = ntb_transport_edma_rc_write_cb;
> > +	txd->callback_param = entry;
> > +
> > +	cookie = dmaengine_submit(txd);
> > +	if (dma_submit_error(cookie)) {
> > +		rc = -EIO;
> > +		goto out_unmap;
> > +	}
> > +	dma_async_issue_pending(chan);
> > +	return 0;
> > +out_unmap:
> > +	dma_unmap_sg(d, sgl, 1, DMA_FROM_DEVICE);
> > +	return rc;
> > +}
> > +
> > +static int ntb_transport_edma_rc_read_start(struct device *d,
> > +					    struct dma_chan *chan, size_t len,
> > +					    void *rc_src, dma_addr_t ep_dst,
> > +					    struct ntb_queue_entry *entry)
> > +{
> > +	struct scatterlist *sgl = &entry->sgl;
> > +	struct dma_async_tx_descriptor *txd;
> > +	struct dma_slave_config cfg;
> > +	dma_cookie_t cookie;
> > +	int nents, rc;
> > +
> > +	if (!d)
> > +		return -ENODEV;
> > +
> > +	if (!chan)
> > +		return -ENXIO;
> > +
> > +	if (WARN_ON(!rc_src || !ep_dst))
> > +		return -EINVAL;
> > +
> > +	if (WARN_ON(sg_dma_address(sgl)))
> > +		return -EINVAL;
> > +
> > +	sg_init_one(sgl, rc_src, len);
> > +	nents = dma_map_sg(d, sgl, 1, DMA_TO_DEVICE);
> > +	if (nents <= 0)
> > +		return -EIO;
> > +
> > +	memset(&cfg, 0, sizeof(cfg));
> > +	cfg.dst_addr       = ep_dst;
> > +	cfg.src_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
> > +	cfg.dst_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
> > +	cfg.direction      = DMA_MEM_TO_DEV;
> > +	rc = dmaengine_slave_config(chan, &cfg);
> > +	if (rc)
> > +		goto out_unmap;
> > +
> > +	txd = dmaengine_prep_slave_sg(chan, sgl, 1, DMA_MEM_TO_DEV,
> > +				      DMA_CTRL_ACK | DMA_PREP_INTERRUPT);
> > +	if (!txd) {
> > +		rc = -EIO;
> > +		goto out_unmap;
> > +	}
> > +
> > +	txd->callback_result = ntb_transport_edma_rc_read_cb;
> > +	txd->callback_param = entry;
> > +
> > +	cookie = dmaengine_submit(txd);
> > +	if (dma_submit_error(cookie)) {
> > +		rc = -EIO;
> > +		goto out_unmap;
> > +	}
> > +	dma_async_issue_pending(chan);
> > +	return 0;
> > +out_unmap:
> > +	dma_unmap_sg(d, sgl, 1, DMA_TO_DEVICE);
> > +	return rc;
> > +}
> > +
> > +static void ntb_transport_edma_rc_dma_work(struct work_struct *work)
> > +{
> > +	struct ntb_queue_entry *entry = container_of(
> > +				work, struct ntb_queue_entry, dma_work);
> > +	struct ntb_transport_qp *qp = entry->qp;
> > +	struct ntb_transport_ctx *nt = qp->transport;
> > +	struct device *dma_dev = get_dma_dev(qp->ndev);
> > +	struct dma_chan *chan;
> > +	int rc;
> > +
> > +	chan = ntb_edma_pick_chan(&nt->edma, REMOTE_EDMA_WRITE);
> > +	rc = ntb_transport_edma_rc_write_start(dma_dev, chan, entry->len,
> > +					       entry->addr, entry->buf, entry);
> > +	if (rc) {
> > +		entry->errors++;
> > +		entry->len = -EIO;
> > +		entry->flags |= DESC_DONE_FLAG;
> > +		queue_work(nt->wq, &qp->write_work);
> > +		return;
> > +	}
> > +}
> > +
> > +static void ntb_transport_edma_rc_poll(struct ntb_transport_qp *qp)
> > +{
> > +	struct ntb_transport_ctx *nt = qp->transport;
> > +	unsigned int budget = NTB_EDMA_MAX_POLL;
> > +	struct ntb_queue_entry *entry;
> > +	struct ntb_edma_desc *in;
> > +	dma_addr_t ep_src;
> > +	u32 len, idx;
> > +
> > +	while (budget--) {
> > +		if (ntb_edma_ring_used_entry(READ_ONCE(*NTB_HEAD_WR_RC_I(qp)),
> > +					     qp->wr_issue) == 0)
> > +			break;
> > +
> > +		idx = ntb_edma_ring_idx(qp->wr_issue);
> > +		in = NTB_DESC_WR_RC_I(qp, idx);
> > +
> > +		len = READ_ONCE(in->len);
> > +		ep_src = (dma_addr_t)READ_ONCE(in->addr);
> > +
> > +		/* Prepare 'entry' for write completion */
> > +		entry = ntb_list_rm(&qp->ntb_rx_q_lock, &qp->rx_pend_q);
> > +		if (!entry) {
> > +			qp->rx_err_no_buf++;
> > +			break;
> > +		}
> > +		if (WARN_ON(entry->flags & DESC_DONE_FLAG))
> > +			entry->flags &= ~DESC_DONE_FLAG;
> > +		entry->len = len; /* NB. entry->len can be <=0 */
> > +		entry->addr = ep_src;
> > +
> > +		/*
> > +		 * ntb_transport_edma_rc_write_complete_work() checks entry->flags
> > +		 * so it needs to be set before wr_issue++.
> > +		 */
> > +		in->data = (uintptr_t)entry;
> > +
> > +		/* Ensure in->data visible before wr_issue++ */
> > +		smp_wmb();
> > +
> > +		WRITE_ONCE(qp->wr_issue, qp->wr_issue + 1);
> > +
> > +		if (!len) {
> > +			entry->flags |= DESC_DONE_FLAG;
> > +			queue_work(nt->wq, &qp->write_work);
> > +			continue;
> > +		}
> > +
> > +		if (in->flags & LINK_DOWN_FLAG) {
> > +			dev_dbg(&qp->ndev->pdev->dev, "link down flag set\n");
> > +			entry->flags |= DESC_DONE_FLAG | LINK_DOWN_FLAG;
> > +			queue_work(nt->wq, &qp->write_work);
> > +			continue;
> > +		}
> > +
> > +		queue_work(nt->wq, &entry->dma_work);
> > +	}
> > +
> > +	if (!budget)
> > +		tasklet_schedule(&qp->rxc_db_work);
> > +}
> > +
> > +static int ntb_transport_edma_rc_tx_enqueue(struct ntb_transport_qp *qp,
> > +					    struct ntb_queue_entry *entry)
> > +{
> > +	struct device *dma_dev = get_dma_dev(qp->ndev);
> > +	struct ntb_transport_ctx *nt = qp->transport;
> > +	struct ntb_edma_desc *in, __iomem *out;
> > +	unsigned int len = entry->len;
> > +	struct dma_chan *chan;
> > +	u32 issue, idx, head;
> > +	dma_addr_t ep_dst;
> > +	int rc;
> > +
> > +	WARN_ON_ONCE(entry->flags & DESC_DONE_FLAG);
> > +
> > +	scoped_guard(spinlock_irqsave, &qp->rc_lock) {
> > +		head = READ_ONCE(*NTB_HEAD_RD_RC_I(qp));
> > +		issue = qp->rd_issue;
> > +		if (ntb_edma_ring_used_entry(head, issue) == 0) {
> > +			qp->tx_ring_full++;
> > +			return -ENOSPC;
> > +		}
> > +
> > +		/*
> > +		 * ntb_transport_edma_rc_read_complete_work() checks entry->flags
> > +		 * so it needs to be set before rd_issue++.
> > +		 */
> > +		idx = ntb_edma_ring_idx(issue);
> > +		in = NTB_DESC_RD_RC_I(qp, idx);
> > +		in->data = (uintptr_t)entry;
> > +
> > +		/* Make in->data visible before rd_issue++ */
> > +		smp_wmb();
> > +
> > +		WRITE_ONCE(qp->rd_issue, qp->rd_issue + 1);
> > +	}
> > +
> > +	/* Publish the final transfer length to the EP side */
> > +	out = NTB_DESC_RD_RC_O(qp, idx);
> > +	iowrite32(len, &out->len);
> > +	ioread32(&out->len);
> > +
> > +	if (unlikely(!len)) {
> > +		entry->flags |= DESC_DONE_FLAG;
> > +		queue_work(nt->wq, &qp->read_work);
> > +		return 0;
> > +	}
> > +
> > +	/* Paired with dma_wmb() in ntb_transport_edma_ep_rx_enqueue() */
> > +	dma_rmb();
> > +
> > +	/* kick remote eDMA read transfer */
> > +	ep_dst = (dma_addr_t)in->addr;
> > +	chan = ntb_edma_pick_chan(&nt->edma, REMOTE_EDMA_READ);
> > +	rc = ntb_transport_edma_rc_read_start(dma_dev, chan, len,
> > +					      entry->buf, ep_dst, entry);
> > +	if (rc) {
> > +		entry->errors++;
> > +		entry->len = -EIO;
> > +		entry->flags |= DESC_DONE_FLAG;
> > +		queue_work(nt->wq, &qp->read_work);
> > +	}
> > +	return 0;
> > +}
> > +
> > +static int ntb_transport_edma_ep_tx_enqueue(struct ntb_transport_qp *qp,
> > +					    struct ntb_queue_entry *entry)
> > +{
> > +	struct device *dma_dev = get_dma_dev(qp->ndev);
> > +	struct ntb_edma_desc *in, __iomem *out;
> > +	unsigned int len = entry->len;
> > +	dma_addr_t ep_src = 0;
> > +	u32 idx;
> > +	int rc;
> > +
> > +	if (likely(len)) {
> > +		ep_src = dma_map_single(dma_dev, entry->buf, len,
> > +					DMA_TO_DEVICE);
> > +		rc = dma_mapping_error(dma_dev, ep_src);
> > +		if (rc)
> > +			return rc;
> > +	}
> > +
> > +	scoped_guard(spinlock_irqsave, &qp->ep_tx_lock) {
> > +		if (ntb_edma_ring_full(qp->wr_prod, qp->wr_cons)) {
> > +			rc = -ENOSPC;
> > +			qp->tx_ring_full++;
> > +			goto out_unmap;
> > +		}
> > +
> > +		idx = ntb_edma_ring_idx(qp->wr_prod);
> > +		in  = NTB_DESC_WR_EP_I(qp, idx);
> > +		out = NTB_DESC_WR_EP_O(qp, idx);
> > +
> > +		WARN_ON(in->flags & DESC_DONE_FLAG);
> > +		WARN_ON(entry->flags & DESC_DONE_FLAG);
> > +		in->flags = 0;
> > +		in->data  = (uintptr_t)entry;
> > +		entry->addr  = ep_src;
> > +
> > +		iowrite32(len,          &out->len);
> > +		iowrite32(entry->flags, &out->flags);
> > +		iowrite64(ep_src,       &out->addr);
> > +		WRITE_ONCE(qp->wr_prod, qp->wr_prod + 1);
> > +
> > +		dma_wmb();
> > +		iowrite32(qp->wr_prod, NTB_HEAD_WR_EP_O(qp));
> > +
> > +		qp->tx_bytes += len;
> > +		qp->tx_pkts++;
> > +	}
> > +
> > +	ntb_transport_edma_notify_peer(qp);
> > +
> > +	return 0;
> > +out_unmap:
> > +	if (likely(len))
> > +		dma_unmap_single(dma_dev, ep_src, len, DMA_TO_DEVICE);
> > +	return rc;
> > +}
> > +
> > +static int ntb_transport_edma_tx_enqueue(struct ntb_transport_qp *qp,
> > +					 struct ntb_queue_entry *entry,
> > +					 void *cb, void *data, unsigned int len,
> > +					 unsigned int flags)
> > +{
> > +	struct device *dma_dev;
> > +
> > +	if (entry->addr) {
> > +		/* Deferred unmap */
> > +		dma_dev = get_dma_dev(qp->ndev);
> > +		dma_unmap_single(dma_dev, entry->addr, entry->len, DMA_TO_DEVICE);
> > +	}
> > +
> > +	entry->cb_data = cb;
> > +	entry->buf = data;
> > +	entry->len = len;
> > +	entry->flags = flags;
> > +	entry->errors = 0;
> > +	entry->addr = 0;
> > +
> > +	WARN_ON_ONCE(!ntb_qp_edma_enabled(qp));
> > +
> > +	if (ntb_qp_edma_is_ep(qp))
> > +		return ntb_transport_edma_ep_tx_enqueue(qp, entry);
> > +	else
> > +		return ntb_transport_edma_rc_tx_enqueue(qp, entry);
> > +}
> > +
> > +static int ntb_transport_edma_ep_rx_enqueue(struct ntb_transport_qp *qp,
> > +					    struct ntb_queue_entry *entry)
> > +{
> > +	struct device *dma_dev = get_dma_dev(qp->ndev);
> > +	struct ntb_edma_desc *in, __iomem *out;
> > +	unsigned int len = entry->len;
> > +	void *data = entry->buf;
> > +	dma_addr_t ep_dst;
> > +	u32 idx;
> > +	int rc;
> > +
> > +	ep_dst = dma_map_single(dma_dev, data, len, DMA_FROM_DEVICE);
> > +	rc = dma_mapping_error(dma_dev, ep_dst);
> > +	if (rc)
> > +		return rc;
> > +
> > +	scoped_guard(spinlock_bh, &qp->ep_rx_lock) {
> > +		if (ntb_edma_ring_full(READ_ONCE(qp->rd_prod),
> > +				       READ_ONCE(qp->rd_cons))) {
> > +			rc = -ENOSPC;
> > +			goto out_unmap;
> > +		}
> > +
> > +		idx = ntb_edma_ring_idx(qp->rd_prod);
> > +		in = NTB_DESC_RD_EP_I(qp, idx);
> > +		out = NTB_DESC_RD_EP_O(qp, idx);
> > +
> > +		iowrite32(len, &out->len);
> > +		iowrite64(ep_dst, &out->addr);
> > +
> > +		WARN_ON(in->flags & DESC_DONE_FLAG);
> > +		in->data = (uintptr_t)entry;
> > +		entry->addr = ep_dst;
> > +
> > +		/* Ensure len/addr are visible before the head update */
> > +		dma_wmb();
> > +
> > +		WRITE_ONCE(qp->rd_prod, qp->rd_prod + 1);
> > +		iowrite32(qp->rd_prod, NTB_HEAD_RD_EP_O(qp));
> > +	}
> > +	return 0;
> > +out_unmap:
> > +	dma_unmap_single(dma_dev, ep_dst, len, DMA_FROM_DEVICE);
> > +	return rc;
> > +}
> > +
> > +static int ntb_transport_edma_rx_enqueue(struct ntb_transport_qp *qp,
> > +					 struct ntb_queue_entry *entry)
> > +{
> > +	int rc;
> > +
> > +	/* The behaviour is the same as the default backend for RC side */
> > +	if (ntb_qp_edma_is_ep(qp)) {
> > +		rc = ntb_transport_edma_ep_rx_enqueue(qp, entry);
> > +		if (rc) {
> > +			ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry,
> > +				     &qp->rx_free_q);
> > +			return rc;
> > +		}
> > +	}
> > +
> > +	ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry, &qp->rx_pend_q);
> > +
> > +	if (qp->active)
> > +		tasklet_schedule(&qp->rxc_db_work);
> > +
> > +	return 0;
> > +}
> > +
> > +static void ntb_transport_edma_rx_poll(struct ntb_transport_qp *qp)
> > +{
> > +	struct ntb_transport_ctx *nt = qp->transport;
> > +
> > +	if (ntb_qp_edma_is_rc(qp))
> > +		ntb_transport_edma_rc_poll(qp);
> > +	else if (ntb_qp_edma_is_ep(qp)) {
> > +		/*
> > +		 * Make sure we poll the rings even if an eDMA interrupt is
> > +		 * cleared on the RC side earlier.
> > +		 */
> > +		queue_work(nt->wq, &qp->read_work);
> > +		queue_work(nt->wq, &qp->write_work);
> > +	} else
> > +		/* Unreachable */
> > +		WARN_ON_ONCE(1);
> > +}
> > +
> > +static void ntb_transport_edma_read_work(struct work_struct *work)
> > +{
> > +	struct ntb_transport_qp *qp = container_of(
> > +				work, struct ntb_transport_qp, read_work);
> > +
> > +	if (ntb_qp_edma_is_rc(qp))
> > +		ntb_transport_edma_rc_read_complete_work(work);
> > +	else if (ntb_qp_edma_is_ep(qp))
> > +		ntb_transport_edma_ep_read_work(work);
> > +	else
> > +		/* Unreachable */
> > +		WARN_ON_ONCE(1);
> > +}
> > +
> > +static void ntb_transport_edma_write_work(struct work_struct *work)
> > +{
> > +	struct ntb_transport_qp *qp = container_of(
> > +				work, struct ntb_transport_qp, write_work);
> > +
> > +	if (ntb_qp_edma_is_rc(qp))
> > +		ntb_transport_edma_rc_write_complete_work(work);
> > +	else if (ntb_qp_edma_is_ep(qp))
> > +		ntb_transport_edma_ep_write_work(work);
> > +	else
> > +		/* Unreachable */
> > +		WARN_ON_ONCE(1);
> > +}
> > +
> > +static void ntb_transport_edma_init_queue(struct ntb_transport_ctx *nt,
> > +					  unsigned int qp_num)
> > +{
> > +	struct ntb_transport_qp *qp = &nt->qp_vec[qp_num];
> > +
> > +	qp->wr_cons = 0;
> > +	qp->rd_cons = 0;
> > +	qp->wr_prod = 0;
> > +	qp->rd_prod = 0;
> > +	qp->wr_issue = 0;
> > +	qp->rd_issue = 0;
> > +
> > +	INIT_WORK(&qp->db_work, ntb_transport_edma_db_work);
> > +	INIT_WORK(&qp->read_work, ntb_transport_edma_read_work);
> > +	INIT_WORK(&qp->write_work, ntb_transport_edma_write_work);
> > +}
> > +
> > +static void ntb_transport_edma_create_queue(struct ntb_transport_ctx *nt,
> > +					    struct ntb_transport_qp *qp)
> > +{
> > +	spin_lock_init(&qp->ep_tx_lock);
> > +	spin_lock_init(&qp->ep_rx_lock);
> > +	spin_lock_init(&qp->rc_lock);
> > +}
> > +
> > +static const struct ntb_transport_backend_ops edma_backend_ops = {
> > +	.setup_qp_mw = ntb_transport_edma_setup_qp_mw,
> > +	.tx_free_entry = ntb_transport_edma_tx_free_entry,
> > +	.tx_enqueue = ntb_transport_edma_tx_enqueue,
> > +	.rx_enqueue = ntb_transport_edma_rx_enqueue,
> > +	.rx_poll = ntb_transport_edma_rx_poll,
> > +	.debugfs_stats_show = ntb_transport_edma_debugfs_stats_show,
> > +};
> > +#endif /* CONFIG_NTB_TRANSPORT_EDMA */
> > +
> >  /**
> >   * ntb_transport_link_up - Notify NTB transport of client readiness to use queue
> >   * @qp: NTB transport layer queue to be enabled
> 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping
  2025-12-02  6:35     ` Koichiro Den
@ 2025-12-02  9:32       ` Niklas Cassel
  2025-12-02 15:20         ` Frank Li
  2025-12-03  8:40         ` Koichiro Den
  0 siblings, 2 replies; 97+ messages in thread
From: Niklas Cassel @ 2025-12-02  9:32 UTC (permalink / raw)
  To: Koichiro Den
  Cc: Frank Li, ntb, linux-pci, dmaengine, linux-kernel, mani,
	kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang,
	allenbh, Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer,
	logang, jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer,
	arnd, pstanner, elfring, Damien Le Moal

Hello Koichiro,

On Tue, Dec 02, 2025 at 03:35:36PM +0900, Koichiro Den wrote:
> On Mon, Dec 01, 2025 at 03:41:38PM -0500, Frank Li wrote:
> > On Sun, Nov 30, 2025 at 01:03:57AM +0900, Koichiro Den wrote:
> > > dw_pcie_ep_raise_msi_irq() currently programs an outbound iATU window
> > > for the MSI target address on every interrupt and tears it down again
> > > via dw_pcie_ep_unmap_addr().
> > >
> > > On systems that heavily use the AXI bridge interface (for example when
> > > the integrated eDMA engine is active), this means the outbound iATU
> > > registers are updated while traffic is in flight. The DesignWare
> > > endpoint spec warns that updating iATU registers in this situation is
> > > not supported, and the behavior is undefined.
> > >
> > > Under high MSI and eDMA load this pattern results in occasional bogus
> > > outbound transactions and IOMMU faults such as:
> > >
> > >   ipmmu-vmsa eed40000.iommu: Unhandled fault: status 0x00001502 iova 0xfe000000
> > >
> > 
> > I agree needn't map/unmap MSI every time. But I think there should be
> > logic problem behind this. IOMMU report error means page table already
> > removed, but you still try to access it after that. You'd better find where
> > access MSI memory after dw_pcie_ep_unmap_addr().
> 
> I don't see any other callers that access the MSI region after
> dw_pcie_ep_unmap_addr(), but I might be missing something. Also, even if I
> serialize dw_pcie_ep_raise_msi_irq() invocations, the problem still
> appears.
> 
> A couple of details I forgot to describe in the commit message:
> (1). The IOMMU error is only reported on the RC side.
> (2). Sometimes there is no IOMMU error printed and the board just freezes (becomes unresponsive).
> 
> The faulting iova is 0xfe000000. The iova 0xfe000000 is the base of
> "addr_space" for R-Car S4 in EP mode:
> https://github.com/jonmason/ntb/blob/68113d260674/arch/arm64/boot/dts/renesas/r8a779f0.dtsi#L847
> 
> So it looks like the EP sometimes issue MWr at "addr_space" base (offset 0),
> the RC forwards it to its IOMMU (IPMMUHC) and that faults. My working theory
> is that when the iATU registers are updated under heavy DMA load, the DAR of
> some in-flight transfer can get corrupted to 0xfe000000. That would match one
> possible symptom of the undefined behaviour that the DW EPC spec warns about
> when changing iATU configuration under load.

For your information, in the NVMe PCI EPF driver:
https://github.com/torvalds/linux/blob/v6.18/drivers/nvme/target/pci-epf.c#L389-L429

We take a mutex around the dmaengine_slave_config() and dma_sync_wait() calls,
because without a mutex, we noticed that having multiple outstanding transfers,
since the dmaengine_slave_config() specifies the src/dst address, the function
call would affect other concurrent DMA transfers, leading to corruption because
of invalid src/dst addresses.

Having a mutex so that we can only have one outstanding transfer solves these
issues, but is obviously very bad for performance.


I did try to add DMA_MEMCPY support to the dw-edma driver:
https://lore.kernel.org/linux-pci/20241217160448.199310-4-cassel@kernel.org/

Since that would allow us to specify both the src and dst address in a single
dmaengine function call (so that we would no longer need a mutex).
However, because the eDMA hardware (at least for EDMA_LEGACY_UNROLL) does not
support transfers between PCI to PCI, only PCI to local DDR or local DDR to
PCI, using prep_memcpy() is wrong, as it does not take a direction:
https://lore.kernel.org/linux-pci/Z4jf2s5SaUu3wdJi@ryzen/

If we want to improve the dw-edma driver, so that an EPF driver can have
multiple outstanding transfers, I think the best way forward would be to create
a new _prep_slave_memcpy() or similar, that does take a direction, and thus
does not require dmaengine_slave_config() to be called before every
_prep_slave_memcpy() call.


Kind regards,
Niklas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 10/27] NTB: core: Add .get_pci_epc() to ntb_dev_ops
  2025-12-02  6:32     ` Koichiro Den
@ 2025-12-02 14:49       ` Dave Jiang
  2025-12-03 15:02         ` Koichiro Den
  0 siblings, 1 reply; 97+ messages in thread
From: Dave Jiang @ 2025-12-02 14:49 UTC (permalink / raw)
  To: Koichiro Den
  Cc: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li, mani,
	kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring



On 12/1/25 11:32 PM, Koichiro Den wrote:
> On Mon, Dec 01, 2025 at 02:08:14PM -0700, Dave Jiang wrote:
>>
>>
>> On 11/29/25 9:03 AM, Koichiro Den wrote:
>>> Add an optional get_pci_epc() callback to retrieve the underlying
>>> pci_epc device associated with the NTB implementation.
>>>
>>> Signed-off-by: Koichiro Den <den@valinux.co.jp>
>>> ---
>>>  drivers/ntb/hw/epf/ntb_hw_epf.c | 11 +----------
>>>  include/linux/ntb.h             | 21 +++++++++++++++++++++
>>>  2 files changed, 22 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/drivers/ntb/hw/epf/ntb_hw_epf.c b/drivers/ntb/hw/epf/ntb_hw_epf.c
>>> index a3ec411bfe49..d55ce6b0fad4 100644
>>> --- a/drivers/ntb/hw/epf/ntb_hw_epf.c
>>> +++ b/drivers/ntb/hw/epf/ntb_hw_epf.c
>>> @@ -9,6 +9,7 @@
>>>  #include <linux/delay.h>
>>>  #include <linux/module.h>
>>>  #include <linux/pci.h>
>>> +#include <linux/pci-epf.h>
>>>  #include <linux/slab.h>
>>>  #include <linux/ntb.h>
>>>  
>>> @@ -49,16 +50,6 @@
>>>  
>>>  #define NTB_EPF_COMMAND_TIMEOUT	1000 /* 1 Sec */
>>>  
>>> -enum pci_barno {
>>> -	NO_BAR = -1,
>>> -	BAR_0,
>>> -	BAR_1,
>>> -	BAR_2,
>>> -	BAR_3,
>>> -	BAR_4,
>>> -	BAR_5,
>>> -};
>>> -
>>>  enum epf_ntb_bar {
>>>  	BAR_CONFIG,
>>>  	BAR_PEER_SPAD,
>>> diff --git a/include/linux/ntb.h b/include/linux/ntb.h
>>> index d7ce5d2e60d0..04dc9a4d6b85 100644
>>> --- a/include/linux/ntb.h
>>> +++ b/include/linux/ntb.h
>>> @@ -64,6 +64,7 @@ struct ntb_client;
>>>  struct ntb_dev;
>>>  struct ntb_msi;
>>>  struct pci_dev;
>>> +struct pci_epc;
>>>  
>>>  /**
>>>   * enum ntb_topo - NTB connection topology
>>> @@ -256,6 +257,7 @@ static inline int ntb_ctx_ops_is_valid(const struct ntb_ctx_ops *ops)
>>>   * @msg_clear_mask:	See ntb_msg_clear_mask().
>>>   * @msg_read:		See ntb_msg_read().
>>>   * @peer_msg_write:	See ntb_peer_msg_write().
>>> + * @get_pci_epc:	See ntb_get_pci_epc().
>>>   */
>>>  struct ntb_dev_ops {
>>>  	int (*port_number)(struct ntb_dev *ntb);
>>> @@ -331,6 +333,7 @@ struct ntb_dev_ops {
>>>  	int (*msg_clear_mask)(struct ntb_dev *ntb, u64 mask_bits);
>>>  	u32 (*msg_read)(struct ntb_dev *ntb, int *pidx, int midx);
>>>  	int (*peer_msg_write)(struct ntb_dev *ntb, int pidx, int midx, u32 msg);
>>> +	struct pci_epc *(*get_pci_epc)(struct ntb_dev *ntb);
>>
>> Seems very specific call to this particular hardware instead of something generic for the NTB dev ops. Maybe it should be something like get_private_data() or something like that?
> 
> Thank you for the suggestion.
> 
> I also felt that it's too specific, but I couldn't come up with a clean
> generic interface at the time, so I left it in this form.
> 
> .get_private_data() might indeed be better. In the callback doc comment we
> could describe it as "may be used to obtain a backing PCI controller
> pointer"?

I would add that comment in the defined callback for the hardware driver. For the actual API, it would be something like "for retrieving private data specific to the hardware driver"?

DJ

> 
> -Koichiro
> 
>>
>> DJ
>>
>>
>>>  };
>>>  
>>>  static inline int ntb_dev_ops_is_valid(const struct ntb_dev_ops *ops)
>>> @@ -393,6 +396,9 @@ static inline int ntb_dev_ops_is_valid(const struct ntb_dev_ops *ops)
>>>  		/* !ops->msg_clear_mask == !ops->msg_count	&& */
>>>  		!ops->msg_read == !ops->msg_count		&&
>>>  		!ops->peer_msg_write == !ops->msg_count		&&
>>> +
>>> +		/* Miscellaneous optional callbacks */
>>> +		/* ops->get_pci_epc			&& */
>>>  		1;
>>>  }
>>>  
>>> @@ -1567,6 +1573,21 @@ static inline int ntb_peer_msg_write(struct ntb_dev *ntb, int pidx, int midx,
>>>  	return ntb->ops->peer_msg_write(ntb, pidx, midx, msg);
>>>  }
>>>  
>>> +/**
>>> + * ntb_get_pci_epc() - get backing PCI endpoint controller if possible.
>>> + * @ntb:	NTB device context.
>>> + *
>>> + * Get the backing PCI endpoint controller representation.
>>> + *
>>> + * Return: A pointer to the pci_epc instance if available. or %NULL if not.
>>> + */
>>> +static inline struct pci_epc __maybe_unused *ntb_get_pci_epc(struct ntb_dev *ntb)
>>> +{
>>> +	if (!ntb->ops->get_pci_epc)
>>> +		return NULL;
>>> +	return ntb->ops->get_pci_epc(ntb);
>>> +}
>>> +
>>>  /**
>>>   * ntb_peer_resource_idx() - get a resource index for a given peer idx
>>>   * @ntb:	NTB device context.
>>


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 20/27] NTB: ntb_transport: Introduce remote eDMA backed transport mode
  2025-12-02  6:59     ` Koichiro Den
@ 2025-12-02 14:53       ` Dave Jiang
  2025-12-03 14:19         ` Koichiro Den
  0 siblings, 1 reply; 97+ messages in thread
From: Dave Jiang @ 2025-12-02 14:53 UTC (permalink / raw)
  To: Koichiro Den
  Cc: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li, mani,
	kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring



On 12/1/25 11:59 PM, Koichiro Den wrote:
> On Mon, Dec 01, 2025 at 02:46:41PM -0700, Dave Jiang wrote:
>>
>>
>> On 11/29/25 9:03 AM, Koichiro Den wrote:
>>> Add a new transport backend that uses a remote DesignWare eDMA engine
>>> located on the NTB endpoint to move data between host and endpoint.
>>>
>>> In this mode:
>>>
>>>   - The endpoint exposes a dedicated memory window that contains the
>>>     eDMA register block followed by a small control structure (struct
>>>     ntb_edma_info) and per-channel linked-list (LL) rings.
>>>
>>>   - On the endpoint side, ntb_edma_setup_mws() allocates the control
>>>     structure and LL rings in endpoint memory, then programs an inbound
>>>     iATU region so that the host can access them via a peer MW.
>>>
>>>   - On the host side, ntb_edma_setup_peer() ioremaps the peer MW, reads
>>>     ntb_edma_info and configures a dw-edma DMA device to use the LL
>>>     rings provided by the endpoint.
>>>
>>>   - ntb_transport is extended with a new backend_ops implementation that
>>>     routes TX and RX enqueue/poll operations through the remote eDMA
>>>     rings while keeping the existing shared-memory backend intact.
>>>
>>>   - The host signals the endpoint via a dedicated DMA read channel.
>>>     'use_msi' module option is ignored when 'use_remote_edma=1'.
>>>
>>> The new mode is guarded by a Kconfig option (NTB_TRANSPORT_EDMA) and a
>>> module parameter (use_remote_edma). When disabled, the existing
>>> ntb_transport behaviour is unchanged.
>>>
>>> Signed-off-by: Koichiro Den <den@valinux.co.jp>
>>> ---
>>>  drivers/ntb/Kconfig                           |   11 +
>>>  drivers/ntb/Makefile                          |    3 +
>>>  drivers/ntb/ntb_edma.c                        |  628 ++++++++
>>>  drivers/ntb/ntb_edma.h                        |  128 ++
>>
>> I briefly looked over the code. It feels like the EDMA bits should go in drivers/ntb/hw/ rather than drivers/ntb/ given it's pretty specific to the designware hardware. What sits in drivers/ntb should be generic APIs where a different vendor can utilize it and not having to adopt to designware hardware specifics. So maybe a bit more abstractions are needed?
> 
> That makes sense, I'll reorganize things. Thank you for the suggestion.

Also, since a new transport is being introduced. Please update Documentation/driver-api/ntb.rst. While the current documentation doesn't provide adaquate API documentation for ntb_transport APIs, hopefully the new transport can do better going forward. :) Thank you!

DJ

> 
>>
>>>  .../{ntb_transport.c => ntb_transport_core.c} | 1281 ++++++++++++++++-
>>>  5 files changed, 2048 insertions(+), 3 deletions(-)
>>>  create mode 100644 drivers/ntb/ntb_edma.c
>>>  create mode 100644 drivers/ntb/ntb_edma.h
>>>  rename drivers/ntb/{ntb_transport.c => ntb_transport_core.c} (65%)
>>>
>>> diff --git a/drivers/ntb/Kconfig b/drivers/ntb/Kconfig
>>> index df16c755b4da..db63f02bb116 100644
>>> --- a/drivers/ntb/Kconfig
>>> +++ b/drivers/ntb/Kconfig
>>> @@ -37,4 +37,15 @@ config NTB_TRANSPORT
>>>  
>>>  	 If unsure, say N.
>>>  
>>> +config NTB_TRANSPORT_EDMA
>>> +	bool "NTB Transport backed by remote eDMA"
>>> +	depends on NTB_TRANSPORT
>>> +	depends on PCI
>>> +	select DMA_ENGINE
>>> +	help
>>> +	  Enable a transport backend that uses a remote DesignWare eDMA engine
>>> +	  exposed through a dedicated NTB memory window. The host uses the
>>> +	  endpoint's eDMA engine to move data in both directions.
>>> +	  Say Y here if you intend to use the 'use_remote_edma' module parameter.
>>> +
>>>  endif # NTB
>>> diff --git a/drivers/ntb/Makefile b/drivers/ntb/Makefile
>>> index 3a6fa181ff99..51f0e1e3aec7 100644
>>> --- a/drivers/ntb/Makefile
>>> +++ b/drivers/ntb/Makefile
>>> @@ -4,3 +4,6 @@ obj-$(CONFIG_NTB_TRANSPORT) += ntb_transport.o
>>>  
>>>  ntb-y			:= core.o
>>>  ntb-$(CONFIG_NTB_MSI)	+= msi.o
>>> +
>>> +ntb_transport-y					:= ntb_transport_core.o
>>> +ntb_transport-$(CONFIG_NTB_TRANSPORT_EDMA)	+= ntb_edma.o
>>> diff --git a/drivers/ntb/ntb_edma.c b/drivers/ntb/ntb_edma.c
>>> new file mode 100644
>>> index 000000000000..cb35e0d56aa8
>>> --- /dev/null
>>> +++ b/drivers/ntb/ntb_edma.c
>>> @@ -0,0 +1,628 @@
>>> +// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
>>> +
>>> +#include <linux/module.h>
>>> +#include <linux/device.h>
>>> +#include <linux/pci.h>
>>> +#include <linux/ntb.h>
>>> +#include <linux/io.h>
>>> +#include <linux/iommu.h>
>>> +#include <linux/dmaengine.h>
>>> +#include <linux/pci-epc.h>
>>> +#include <linux/dma/edma.h>
>>> +#include <linux/irq.h>
>>> +#include <linux/irqdomain.h>
>>> +#include <linux/of.h>
>>> +#include <linux/of_irq.h>
>>> +#include <dt-bindings/interrupt-controller/arm-gic.h>
>>> +
>>> +#include "ntb_edma.h"
>>> +
>>> +/*
>>> + * The interrupt register offsets below are taken from the DesignWare
>>> + * eDMA "unrolled" register map (EDMA_MF_EDMA_UNROLL). The remote eDMA
>>> + * backend currently only supports this layout.
>>> + */
>>> +#define DMA_WRITE_INT_STATUS_OFF   0x4c
>>> +#define DMA_WRITE_INT_MASK_OFF     0x54
>>> +#define DMA_WRITE_INT_CLEAR_OFF    0x58
>>> +#define DMA_READ_INT_STATUS_OFF    0xa0
>>> +#define DMA_READ_INT_MASK_OFF      0xa8
>>> +#define DMA_READ_INT_CLEAR_OFF     0xac
>>> +
>>> +#define NTB_EDMA_NOTIFY_MAX_QP		64
>>> +
>>> +static unsigned int edma_spi = 417; /* 0x1a1 */
>>> +module_param(edma_spi, uint, 0644);
>>> +MODULE_PARM_DESC(edma_spi, "SPI number used by remote eDMA interrupt (EP local)");
>>> +
>>> +static u64 edma_regs_phys = 0xe65d5000;
>>> +module_param(edma_regs_phys, ullong, 0644);
>>> +MODULE_PARM_DESC(edma_regs_phys, "Physical base address of local eDMA registers (EP)");
>>> +
>>> +static unsigned long edma_regs_size = 0x1200;
>>> +module_param(edma_regs_size, ulong, 0644);
>>> +MODULE_PARM_DESC(edma_regs_size, "Size of the local eDMA register space (EP)");
>>> +
>>> +struct ntb_edma_intr {
>>> +	u32 db[NTB_EDMA_NOTIFY_MAX_QP];
>>> +};
>>> +
>>> +struct ntb_edma_ctx {
>>> +	void *ll_wr_virt[EDMA_WR_CH_NUM];
>>> +	dma_addr_t ll_wr_phys[EDMA_WR_CH_NUM];
>>> +	void *ll_rd_virt[EDMA_RD_CH_NUM + 1];
>>> +	dma_addr_t ll_rd_phys[EDMA_RD_CH_NUM + 1];
>>> +
>>> +	struct ntb_edma_intr *intr_ep_virt;
>>> +	dma_addr_t intr_ep_phys;
>>> +	struct ntb_edma_intr *intr_rc_virt;
>>> +	dma_addr_t intr_rc_phys;
>>> +	u32 notify_qp_max;
>>> +
>>> +	bool initialized;
>>> +};
>>> +
>>> +static struct ntb_edma_ctx edma_ctx;
>>> +
>>> +typedef void (*ntb_edma_interrupt_cb_t)(void *data, int qp_num);
>>> +
>>> +struct ntb_edma_interrupt {
>>> +	int virq;
>>> +	void __iomem *base;
>>> +	ntb_edma_interrupt_cb_t cb;
>>> +	void *data;
>>> +};
>>> +
>>> +static struct ntb_edma_interrupt ntb_edma_intr;
>>> +
>>> +static int ntb_edma_map_spi_to_virq(struct device *dev, unsigned int spi)
>>> +{
>>> +	struct device_node *np = dev_of_node(dev);
>>> +	struct device_node *parent;
>>> +	struct irq_fwspec fwspec = { 0 };
>>> +	int virq;
>>> +
>>> +	parent = of_irq_find_parent(np);
>>> +	if (!parent)
>>> +		return -ENODEV;
>>> +
>>> +	fwspec.fwnode      = of_fwnode_handle(parent);
>>> +	fwspec.param_count = 3;
>>> +	fwspec.param[0]    = GIC_SPI;
>>> +	fwspec.param[1]    = spi;
>>> +	fwspec.param[2]    = IRQ_TYPE_LEVEL_HIGH;
>>> +
>>> +	virq = irq_create_fwspec_mapping(&fwspec);
>>> +	of_node_put(parent);
>>> +	return (virq > 0) ? virq : -EINVAL;
>>> +}
>>> +
>>> +static irqreturn_t ntb_edma_isr(int irq, void *data)
>>> +{
>>> +	struct ntb_edma_interrupt *v = data;
>>> +	u32 mask = BIT(EDMA_RD_CH_NUM);
>>> +	u32 i, val;
>>> +
>>> +	/*
>>> +	 * We do not ack interrupts here but instead we mask all local interrupt
>>> +	 * sources except the read channel used for notification. This reduces
>>> +	 * needless ISR invocations.
>>> +	 *
>>> +	 * In theory we could configure LIE=1/RIE=0 only for the notification
>>> +	 * transfer (keeping all other channels at LIE=1/RIE=1), but that would
>>> +	 * require intrusive changes to the dw-edma core.
>>> +	 *
>>> +	 * Note: The host side may have already cleared the read interrupt used
>>> +	 * for notification, so reading DMA_READ_INT_CLEAR_OFF is not a reliable
>>> +	 * way to detect it. As a result, we cannot reliably tell which specific
>>> +	 * channel triggered this interrupt. intr_ep_virt->db[i] teaches us
>>> +	 * instead.
>>> +	 */
>>> +	iowrite32(~0x0, v->base + DMA_WRITE_INT_MASK_OFF);
>>> +	iowrite32(~mask, v->base + DMA_READ_INT_MASK_OFF);
>>> +
>>> +	if (!v->cb || !edma_ctx.intr_ep_virt)
>>> +		return IRQ_HANDLED;
>>> +
>>> +	for (i = 0; i < edma_ctx.notify_qp_max; i++) {
>>> +		val = READ_ONCE(edma_ctx.intr_ep_virt->db[i]);
>>> +		if (!val)
>>> +			continue;
>>> +
>>> +		WRITE_ONCE(edma_ctx.intr_ep_virt->db[i], 0);
>>> +		v->cb(v->data, i);
>>> +	}
>>> +
>>> +	return IRQ_HANDLED;
>>> +}
>>> +
>>> +int ntb_edma_setup_isr(struct device *dev, struct device *epc_dev,
>>> +		       ntb_edma_interrupt_cb_t cb, void *data)
>>> +{
>>> +	struct ntb_edma_interrupt *v = &ntb_edma_intr;
>>> +	int virq = ntb_edma_map_spi_to_virq(epc_dev->parent, edma_spi);
>>> +	int ret;
>>> +
>>> +	if (virq < 0) {
>>> +		dev_err(dev, "failed to get virq (%d)\n", virq);
>>> +		return virq;
>>> +	}
>>> +
>>> +	v->virq = virq;
>>> +	v->cb = cb;
>>> +	v->data = data;
>>> +	if (edma_regs_phys && !v->base)
>>> +		v->base = devm_ioremap(dev, edma_regs_phys, edma_regs_size);
>>> +	if (!v->base) {
>>> +		dev_err(dev, "failed to setup v->base\n");
>>> +		return -1;
>>> +	}
>>> +	ret = devm_request_irq(dev, v->virq, ntb_edma_isr, 0, "ntb-edma", v);
>>> +	if (ret)
>>> +		return ret;
>>> +
>>> +	if (v->base) {
>>> +		iowrite32(0x0, v->base + DMA_WRITE_INT_MASK_OFF);
>>> +		iowrite32(0x0, v->base + DMA_READ_INT_MASK_OFF);
>>> +	}
>>> +	return 0;
>>> +}
>>> +
>>> +void ntb_edma_teardown_isr(struct device *dev)
>>> +{
>>> +	struct ntb_edma_interrupt *v = &ntb_edma_intr;
>>> +
>>> +	/* Mask all write/read interrupts so we don't get called again. */
>>> +	if (v->base) {
>>> +		iowrite32(~0x0, v->base + DMA_WRITE_INT_MASK_OFF);
>>> +		iowrite32(~0x0, v->base + DMA_READ_INT_MASK_OFF);
>>> +	}
>>> +
>>> +	if (v->virq > 0)
>>> +		devm_free_irq(dev, v->virq, v);
>>> +
>>> +	if (v->base)
>>> +		devm_iounmap(dev, v->base);
>>> +
>>> +	v->virq = 0;
>>> +	v->cb = NULL;
>>> +	v->data = NULL;
>>> +}
>>> +
>>> +int ntb_edma_setup_mws(struct ntb_dev *ndev)
>>> +{
>>> +	const size_t info_bytes = PAGE_SIZE;
>>> +	resource_size_t size_max, offset;
>>> +	dma_addr_t intr_phys, info_phys;
>>> +	u32 wr_done = 0, rd_done = 0;
>>> +	struct ntb_edma_intr *intr;
>>> +	struct ntb_edma_info *info;
>>> +	int peer_mw, mw_index, rc;
>>> +	struct iommu_domain *dom;
>>> +	bool reg_mapped = false;
>>> +	size_t ll_bytes, size;
>>> +	struct pci_epc *epc;
>>> +	struct device *dev;
>>> +	unsigned long iova;
>>> +	phys_addr_t phys;
>>> +	u64 need;
>>> +	u32 i;
>>> +
>>> +	/* +1 is for interruption */
>>> +	ll_bytes = (EDMA_WR_CH_NUM + EDMA_RD_CH_NUM + 1) * DMA_LLP_MEM_SIZE;
>>> +	need = EDMA_REG_SIZE + info_bytes + ll_bytes;
>>> +
>>> +	epc = ntb_get_pci_epc(ndev);
>>> +	if (!epc)
>>> +		return -ENODEV;
>>> +	dev = epc->dev.parent;
>>> +
>>> +	if (edma_ctx.initialized)
>>> +		return 0;
>>> +
>>> +	info = dma_alloc_coherent(dev, info_bytes, &info_phys, GFP_KERNEL);
>>> +	if (!info)
>>> +		return -ENOMEM;
>>> +
>>> +	memset(info, 0, info_bytes);
>>> +	info->magic = NTB_EDMA_INFO_MAGIC;
>>> +	info->wr_cnt = EDMA_WR_CH_NUM;
>>> +	info->rd_cnt = EDMA_RD_CH_NUM + 1; /* +1 for interruption */
>>> +	info->regs_phys = edma_regs_phys;
>>> +	info->ll_stride = DMA_LLP_MEM_SIZE;
>>> +
>>> +	for (i = 0; i < EDMA_WR_CH_NUM; i++) {
>>> +		edma_ctx.ll_wr_virt[i] = dma_alloc_attrs(dev, DMA_LLP_MEM_SIZE,
>>> +							 &edma_ctx.ll_wr_phys[i],
>>> +							 GFP_KERNEL,
>>> +							 DMA_ATTR_FORCE_CONTIGUOUS);
>>> +		if (!edma_ctx.ll_wr_virt[i]) {
>>> +			rc = -ENOMEM;
>>> +			goto err_free_ll;
>>> +		}
>>> +		wr_done++;
>>> +		info->ll_wr_phys[i] = edma_ctx.ll_wr_phys[i];
>>> +	}
>>> +	for (i = 0; i < EDMA_RD_CH_NUM + 1; i++) {
>>> +		edma_ctx.ll_rd_virt[i] = dma_alloc_attrs(dev, DMA_LLP_MEM_SIZE,
>>> +							 &edma_ctx.ll_rd_phys[i],
>>> +							 GFP_KERNEL,
>>> +							 DMA_ATTR_FORCE_CONTIGUOUS);
>>> +		if (!edma_ctx.ll_rd_virt[i]) {
>>> +			rc = -ENOMEM;
>>> +			goto err_free_ll;
>>> +		}
>>> +		rd_done++;
>>> +		info->ll_rd_phys[i] = edma_ctx.ll_rd_phys[i];
>>> +	}
>>> +
>>> +	/* For interruption */
>>> +	edma_ctx.notify_qp_max = NTB_EDMA_NOTIFY_MAX_QP;
>>> +	intr = dma_alloc_coherent(dev, sizeof(*intr), &intr_phys, GFP_KERNEL);
>>> +	if (!intr) {
>>> +		rc = -ENOMEM;
>>> +		goto err_free_ll;
>>> +	}
>>> +	memset(intr, 0, sizeof(*intr));
>>> +	edma_ctx.intr_ep_virt = intr;
>>> +	edma_ctx.intr_ep_phys = intr_phys;
>>> +	info->intr_dar_base = intr_phys;
>>> +
>>> +	peer_mw = ntb_peer_mw_count(ndev);
>>> +	if (peer_mw <= 0) {
>>> +		rc = -ENODEV;
>>> +		goto err_free_ll;
>>> +	}
>>> +
>>> +	mw_index = peer_mw - 1; /* last MW */
>>> +
>>> +	rc = ntb_mw_get_align(ndev, 0, mw_index, 0, NULL, &size_max,
>>> +			      &offset);
>>> +	if (rc)
>>> +		goto err_free_ll;
>>> +
>>> +	if (size_max < need) {
>>> +		rc = -ENOSPC;
>>> +		goto err_free_ll;
>>> +	}
>>> +
>>> +	/* Map register space (direct) */
>>> +	dom = iommu_get_domain_for_dev(dev);
>>> +	if (dom) {
>>> +		phys = edma_regs_phys & PAGE_MASK;
>>> +		size = PAGE_ALIGN(EDMA_REG_SIZE + edma_regs_phys - phys);
>>> +		iova = phys;
>>> +
>>> +		rc = iommu_map(dom, iova, phys, EDMA_REG_SIZE,
>>> +			       IOMMU_READ | IOMMU_WRITE | IOMMU_MMIO, GFP_KERNEL);
>>> +		if (rc)
>>> +			dev_err(&ndev->dev, "failed to create direct mapping for eDMA reg space\n");
>>> +		reg_mapped = true;
>>> +	}
>>> +
>>> +	rc = ntb_mw_set_trans(ndev, 0, mw_index, edma_regs_phys, EDMA_REG_SIZE, offset);
>>> +	if (rc)
>>> +		goto err_unmap_reg;
>>> +
>>> +	offset += EDMA_REG_SIZE;
>>> +
>>> +	/* Map ntb_edma_info */
>>> +	rc = ntb_mw_set_trans(ndev, 0, mw_index, info_phys, info_bytes, offset);
>>> +	if (rc)
>>> +		goto err_clear_trans;
>>> +	offset += info_bytes;
>>> +
>>> +	/* Map LL location */
>>> +	for (i = 0; i < EDMA_WR_CH_NUM; i++) {
>>> +		rc = ntb_mw_set_trans(ndev, 0, mw_index, edma_ctx.ll_wr_phys[i],
>>> +				      DMA_LLP_MEM_SIZE, offset);
>>> +		if (rc)
>>> +			goto err_clear_trans;
>>> +		offset += DMA_LLP_MEM_SIZE;
>>> +	}
>>> +	for (i = 0; i < EDMA_RD_CH_NUM + 1; i++) {
>>> +		rc = ntb_mw_set_trans(ndev, 0, mw_index, edma_ctx.ll_rd_phys[i],
>>> +				      DMA_LLP_MEM_SIZE, offset);
>>> +		if (rc)
>>> +			goto err_clear_trans;
>>> +		offset += DMA_LLP_MEM_SIZE;
>>> +	}
>>> +	edma_ctx.initialized = true;
>>> +
>>> +	return 0;
>>> +
>>> +err_clear_trans:
>>> +	/*
>>> +	 * Tear down the NTB translation window used for the eDMA MW.
>>> +	 * There is no sub-range clear API for ntb_mw_set_trans(), so we
>>> +	 * unconditionally drop the whole mapping on error.
>>> +	 */
>>> +	ntb_mw_clear_trans(ndev, 0, mw_index);
>>> +
>>> +err_unmap_reg:
>>> +	if (reg_mapped)
>>> +		iommu_unmap(dom, iova, size);
>>> +err_free_ll:
>>> +	while (rd_done--)
>>> +		dma_free_attrs(dev, DMA_LLP_MEM_SIZE,
>>> +			       edma_ctx.ll_rd_virt[rd_done],
>>> +			       edma_ctx.ll_rd_phys[rd_done],
>>> +			       DMA_ATTR_FORCE_CONTIGUOUS);
>>> +	while (wr_done--)
>>> +		dma_free_attrs(dev, DMA_LLP_MEM_SIZE,
>>> +			       edma_ctx.ll_wr_virt[wr_done],
>>> +			       edma_ctx.ll_wr_phys[wr_done],
>>> +			       DMA_ATTR_FORCE_CONTIGUOUS);
>>> +	if (edma_ctx.intr_ep_virt)
>>> +		dma_free_coherent(dev, sizeof(struct ntb_edma_intr),
>>> +				  edma_ctx.intr_ep_virt,
>>> +				  edma_ctx.intr_ep_phys);
>>> +	dma_free_coherent(dev, info_bytes, info, info_phys);
>>> +	return rc;
>>> +}
>>> +
>>> +static int ntb_edma_irq_vector(struct device *dev, unsigned int nr)
>>> +{
>>> +	struct pci_dev *pdev = to_pci_dev(dev);
>>> +	int ret, nvec;
>>> +
>>> +	nvec = pci_msi_vec_count(pdev);
>>> +	for (; nr < nvec; nr++) {
>>> +		ret = pci_irq_vector(pdev, nr);
>>> +		if (!irq_has_action(ret))
>>> +			return ret;
>>> +	}
>>> +	return 0;
>>> +}
>>> +
>>> +static const struct dw_edma_plat_ops ntb_edma_ops = {
>>> +	.irq_vector     = ntb_edma_irq_vector,
>>> +};
>>> +
>>> +int ntb_edma_setup_peer(struct ntb_dev *ndev)
>>> +{
>>> +	struct ntb_edma_info *info;
>>> +	unsigned int wr_cnt, rd_cnt;
>>> +	struct dw_edma_chip *chip;
>>> +	void __iomem *edma_virt;
>>> +	phys_addr_t edma_phys;
>>> +	resource_size_t mw_size;
>>> +	u64 off = EDMA_REG_SIZE;
>>> +	int peer_mw, mw_index;
>>> +	unsigned int i;
>>> +	int ret;
>>> +
>>> +	peer_mw = ntb_peer_mw_count(ndev);
>>> +	if (peer_mw <= 0)
>>> +		return -ENODEV;
>>> +
>>> +	mw_index = peer_mw - 1; /* last MW */
>>> +
>>> +	ret = ntb_peer_mw_get_addr(ndev, mw_index, &edma_phys,
>>> +				   &mw_size);
>>> +	if (ret)
>>> +		return -1;
>>> +
>>> +	edma_virt = ioremap(edma_phys, mw_size);
>>> +
>>> +	chip = devm_kzalloc(&ndev->dev, sizeof(*chip), GFP_KERNEL);
>>> +	if (!chip) {
>>> +		ret = -ENOMEM;
>>> +		return ret;
>>> +	}
>>> +
>>> +	chip->dev = &ndev->pdev->dev;
>>> +	chip->nr_irqs = 4;
>>> +	chip->ops = &ntb_edma_ops;
>>> +	chip->flags = 0;
>>> +	chip->reg_base = edma_virt;
>>> +	chip->mf = EDMA_MF_EDMA_UNROLL;
>>> +
>>> +	info = edma_virt + off;
>>> +	if (info->magic != NTB_EDMA_INFO_MAGIC)
>>> +		return -EINVAL;
>>> +	wr_cnt = info->wr_cnt;
>>> +	rd_cnt = info->rd_cnt;
>>> +	chip->ll_wr_cnt = wr_cnt;
>>> +	chip->ll_rd_cnt = rd_cnt;
>>> +	off += PAGE_SIZE;
>>> +
>>> +	edma_ctx.notify_qp_max = NTB_EDMA_NOTIFY_MAX_QP;
>>> +	edma_ctx.intr_ep_phys = info->intr_dar_base;
>>> +	if (edma_ctx.intr_ep_phys) {
>>> +		edma_ctx.intr_rc_virt =
>>> +			dma_alloc_coherent(&ndev->pdev->dev,
>>> +					   sizeof(struct ntb_edma_intr),
>>> +					   &edma_ctx.intr_rc_phys,
>>> +					   GFP_KERNEL);
>>> +		if (!edma_ctx.intr_rc_virt)
>>> +			return -ENOMEM;
>>> +		memset(edma_ctx.intr_rc_virt, 0,
>>> +		       sizeof(struct ntb_edma_intr));
>>> +	}
>>> +
>>> +	for (i = 0; i < wr_cnt; i++) {
>>> +		chip->ll_region_wr[i].vaddr.io = edma_virt + off;
>>> +		chip->ll_region_wr[i].paddr = info->ll_wr_phys[i];
>>> +		chip->ll_region_wr[i].sz = DMA_LLP_MEM_SIZE;
>>> +		off += DMA_LLP_MEM_SIZE;
>>> +	}
>>> +	for (i = 0; i < rd_cnt; i++) {
>>> +		chip->ll_region_rd[i].vaddr.io = edma_virt + off;
>>> +		chip->ll_region_rd[i].paddr = info->ll_rd_phys[i];
>>> +		chip->ll_region_rd[i].sz = DMA_LLP_MEM_SIZE;
>>> +		off += DMA_LLP_MEM_SIZE;
>>> +	}
>>> +
>>> +	if (!pci_dev_msi_enabled(ndev->pdev))
>>> +		return -ENXIO;
>>> +
>>> +	ret = dw_edma_probe(chip);
>>> +	if (ret) {
>>> +		dev_err(&ndev->dev, "dw_edma_probe failed: %d\n", ret);
>>> +		return ret;
>>> +	}
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +struct ntb_edma_filter {
>>> +	struct device *dma_dev;
>>> +	u32 direction;
>>> +};
>>> +
>>> +static bool ntb_edma_filter_fn(struct dma_chan *chan, void *arg)
>>> +{
>>> +	struct ntb_edma_filter *filter = arg;
>>> +	u32 dir = filter->direction;
>>> +	struct dma_slave_caps caps;
>>> +	int ret;
>>> +
>>> +	if (chan->device->dev != filter->dma_dev)
>>> +		return false;
>>> +
>>> +	ret = dma_get_slave_caps(chan, &caps);
>>> +	if (ret < 0)
>>> +		return false;
>>> +
>>> +	return !!(caps.directions & dir);
>>> +}
>>> +
>>> +void ntb_edma_teardown_chans(struct ntb_edma_chans *edma)
>>> +{
>>> +	unsigned int i;
>>> +
>>> +	for (i = 0; i < edma->num_wr_chan; i++)
>>> +		dma_release_channel(edma->wr_chan[i]);
>>> +
>>> +	for (i = 0; i < edma->num_rd_chan; i++)
>>> +		dma_release_channel(edma->rd_chan[i]);
>>> +
>>> +	if (edma->intr_chan)
>>> +		dma_release_channel(edma->intr_chan);
>>> +}
>>> +
>>> +int ntb_edma_setup_chans(struct device *dma_dev, struct ntb_edma_chans *edma)
>>> +{
>>> +	struct ntb_edma_filter filter;
>>> +	dma_cap_mask_t dma_mask;
>>> +	unsigned int i;
>>> +
>>> +	dma_cap_zero(dma_mask);
>>> +	dma_cap_set(DMA_SLAVE, dma_mask);
>>> +
>>> +	memset(edma, 0, sizeof(*edma));
>>> +	edma->dev = dma_dev;
>>> +
>>> +	filter.dma_dev = dma_dev;
>>> +	filter.direction = BIT(DMA_DEV_TO_MEM);
>>> +	for (i = 0; i < EDMA_WR_CH_NUM; i++) {
>>> +		edma->wr_chan[i] = dma_request_channel(dma_mask,
>>> +						       ntb_edma_filter_fn,
>>> +						       &filter);
>>> +		if (!edma->wr_chan[i])
>>> +			break;
>>> +		edma->num_wr_chan++;
>>> +	}
>>> +
>>> +	filter.direction = BIT(DMA_MEM_TO_DEV);
>>> +	for (i = 0; i < EDMA_RD_CH_NUM; i++) {
>>> +		edma->rd_chan[i] = dma_request_channel(dma_mask,
>>> +						       ntb_edma_filter_fn,
>>> +						       &filter);
>>> +		if (!edma->rd_chan[i])
>>> +			break;
>>> +		edma->num_rd_chan++;
>>> +	}
>>> +
>>> +	edma->intr_chan = dma_request_channel(dma_mask, ntb_edma_filter_fn,
>>> +					      &filter);
>>> +	if (!edma->intr_chan)
>>> +		dev_warn(dma_dev,
>>> +			 "Remote eDMA notify channel could not be allocated\n");
>>> +
>>> +	if (!edma->num_wr_chan || !edma->num_rd_chan) {
>>> +		dev_warn(dma_dev, "Remote eDMA channels failed to initialize\n");
>>> +		ntb_edma_teardown_chans(edma);
>>> +		return -ENODEV;
>>> +	}
>>> +	return 0;
>>> +}
>>> +
>>> +struct dma_chan *ntb_edma_pick_chan(struct ntb_edma_chans *edma,
>>> +				    remote_edma_dir_t dir)
>>> +{
>>> +	unsigned int n, cur, idx;
>>> +	struct dma_chan **chans;
>>> +	atomic_t *cur_chan;
>>> +
>>> +	if (dir == REMOTE_EDMA_WRITE) {
>>> +		n = edma->num_wr_chan;
>>> +		chans = edma->wr_chan;
>>> +		cur_chan = &edma->cur_wr_chan;
>>> +	} else {
>>> +		n = edma->num_rd_chan;
>>> +		chans = edma->rd_chan;
>>> +		cur_chan = &edma->cur_rd_chan;
>>> +	}
>>> +	if (WARN_ON_ONCE(!n))
>>> +		return NULL;
>>> +
>>> +	/* Simple round-robin */
>>> +	cur = (unsigned int)atomic_inc_return(cur_chan) - 1;
>>> +	idx = cur % n;
>>> +	return chans[idx];
>>> +}
>>> +
>>> +int ntb_edma_notify_peer(struct ntb_edma_chans *edma, int qp_num)
>>> +{
>>> +	struct dma_async_tx_descriptor *txd;
>>> +	struct dma_slave_config cfg;
>>> +	struct scatterlist sgl;
>>> +	dma_cookie_t cookie;
>>> +	struct device *dev;
>>> +
>>> +	if (!edma || !edma->intr_chan)
>>> +		return -ENXIO;
>>> +
>>> +	if (qp_num < 0 || qp_num >= edma_ctx.notify_qp_max)
>>> +		return -EINVAL;
>>> +
>>> +	if (!edma_ctx.intr_rc_virt || !edma_ctx.intr_ep_phys)
>>> +		return -EINVAL;
>>> +
>>> +	dev = edma->dev;
>>> +	if (!dev)
>>> +		return -ENODEV;
>>> +
>>> +	WRITE_ONCE(edma_ctx.intr_rc_virt->db[qp_num], 1);
>>> +
>>> +	/* Ensure store is visible before kicking the DMA transfer */
>>> +	wmb();
>>> +
>>> +	sg_init_table(&sgl, 1);
>>> +	sg_dma_address(&sgl) = edma_ctx.intr_rc_phys + qp_num * sizeof(u32);
>>> +	sg_dma_len(&sgl) = sizeof(u32);
>>> +
>>> +	memset(&cfg, 0, sizeof(cfg));
>>> +	cfg.dst_addr       = edma_ctx.intr_ep_phys + qp_num * sizeof(u32);
>>> +	cfg.src_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
>>> +	cfg.dst_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
>>> +	cfg.direction      = DMA_MEM_TO_DEV;
>>> +
>>> +	if (dmaengine_slave_config(edma->intr_chan, &cfg))
>>> +		return -EINVAL;
>>> +
>>> +	txd = dmaengine_prep_slave_sg(edma->intr_chan, &sgl, 1,
>>> +				      DMA_MEM_TO_DEV,
>>> +				      DMA_CTRL_ACK | DMA_PREP_INTERRUPT);
>>> +	if (!txd)
>>> +		return -ENOSPC;
>>> +
>>> +	cookie = dmaengine_submit(txd);
>>> +	if (dma_submit_error(cookie))
>>> +		return -ENOSPC;
>>> +
>>> +	dma_async_issue_pending(edma->intr_chan);
>>> +	return 0;
>>> +}
>>> diff --git a/drivers/ntb/ntb_edma.h b/drivers/ntb/ntb_edma.h
>>> new file mode 100644
>>> index 000000000000..da0451827edb
>>> --- /dev/null
>>> +++ b/drivers/ntb/ntb_edma.h
>>> @@ -0,0 +1,128 @@
>>> +/* SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) */
>>> +#ifndef _NTB_EDMA_H_
>>> +#define _NTB_EDMA_H_
>>> +
>>> +#include <linux/completion.h>
>>> +#include <linux/device.h>
>>> +#include <linux/interrupt.h>
>>> +
>>> +#define EDMA_REG_SIZE		SZ_64K
>>> +#define DMA_LLP_MEM_SIZE	SZ_4K
>>> +#define EDMA_WR_CH_NUM		4
>>> +#define EDMA_RD_CH_NUM		4
>>> +#define NTB_EDMA_MAX_CH		8
>>> +
>>> +#define NTB_EDMA_INFO_MAGIC	0x45444D41 /* "EDMA" */
>>> +#define NTB_EDMA_INFO_OFF	EDMA_REG_SIZE
>>> +
>>> +#define NTB_EDMA_RING_ORDER	7
>>> +#define NTB_EDMA_RING_ENTRIES	(1U << NTB_EDMA_RING_ORDER)
>>> +#define NTB_EDMA_RING_MASK	(NTB_EDMA_RING_ENTRIES - 1)
>>> +
>>> +typedef void (*ntb_edma_interrupt_cb_t)(void *data, int qp_num);
>>> +
>>> +/*
>>> + * REMOTE_EDMA_EP:
>>> + *   Endpoint owns the eDMA engine and pushes descriptors into a shared MW.
>>> + *
>>> + * REMOTE_EDMA_RC:
>>> + *   Root Complex controls the endpoint eDMA through the shared MW and
>>> + *   drives reads/writes on behalf of the host.
>>> + */
>>> +typedef enum {
>>> +	REMOTE_EDMA_UNKNOWN,
>>> +	REMOTE_EDMA_EP,
>>> +	REMOTE_EDMA_RC,
>>> +} remote_edma_mode_t;
>>> +
>>> +typedef enum {
>>> +	REMOTE_EDMA_WRITE,
>>> +	REMOTE_EDMA_READ,
>>> +} remote_edma_dir_t;
>>> +
>>> +/*
>>> + * Layout of remote eDMA MW (EP local address space, RC sees via peer MW):
>>> + *
>>> + *  0 .. EDMA_REG_SIZE-1        : DesignWare eDMA registers
>>> + *  EDMA_REG_SIZE .. +PAGE_SIZE : struct ntb_edma_info (EP writes, RC reads)
>>> + *  +PAGE_SIZE ..               : LL ring buffers (EP allocates phys addresses,
>>> + *                                RC configures via dw_edma)
>>> + *
>>> + * ntb_edma_setup_mws() on EP:
>>> + *   - allocates ntb_edma_info and LLs in EP memory
>>> + *   - programs inbound iATU so that RC peer MW[n] points at this block
>>> + *
>>> + * ntb_edma_setup_peer() on RC:
>>> + *   - ioremaps peer MW[n]
>>> + *   - reads ntb_edma_info
>>> + *   - sets up dw_edma_chip ll_region_* from that info
>>> + */
>>> +struct ntb_edma_info {
>>> +	u32 magic;
>>> +	u16 wr_cnt;
>>> +	u16 rd_cnt;
>>> +	u64 regs_phys;
>>> +	u32 ll_stride;
>>> +	u32 rsvd;
>>> +	u64 ll_wr_phys[NTB_EDMA_MAX_CH];
>>> +	u64 ll_rd_phys[NTB_EDMA_MAX_CH];
>>> +
>>> +	u64 intr_dar_base;
>>> +} __packed;
>>> +
>>> +struct ll_dma_addrs {
>>> +	dma_addr_t wr[EDMA_WR_CH_NUM];
>>> +	dma_addr_t rd[EDMA_RD_CH_NUM];
>>> +};
>>> +
>>> +struct ntb_edma_chans {
>>> +	struct device *dev;
>>> +
>>> +	struct dma_chan *wr_chan[EDMA_WR_CH_NUM];
>>> +	struct dma_chan *rd_chan[EDMA_RD_CH_NUM];
>>> +	struct dma_chan *intr_chan;
>>> +
>>> +	unsigned int num_wr_chan;
>>> +	unsigned int num_rd_chan;
>>> +	atomic_t cur_wr_chan;
>>> +	atomic_t cur_rd_chan;
>>> +};
>>> +
>>> +static __always_inline u32 ntb_edma_ring_idx(u32 v)
>>> +{
>>> +	return v & NTB_EDMA_RING_MASK;
>>> +}
>>> +
>>> +static __always_inline u32 ntb_edma_ring_used_entry(u32 head, u32 tail)
>>> +{
>>> +	if (head >= tail) {
>>> +		WARN_ON_ONCE((head - tail) > (NTB_EDMA_RING_ENTRIES - 1));
>>> +		return head - tail;
>>> +	}
>>> +
>>> +	WARN_ON_ONCE((U32_MAX - tail + head + 1) > (NTB_EDMA_RING_ENTRIES - 1));
>>> +	return U32_MAX - tail + head + 1;
>>> +}
>>> +
>>> +static __always_inline u32 ntb_edma_ring_free_entry(u32 head, u32 tail)
>>> +{
>>> +	return NTB_EDMA_RING_ENTRIES - ntb_edma_ring_used_entry(head, tail) - 1;
>>> +}
>>> +
>>> +static __always_inline bool ntb_edma_ring_full(u32 head, u32 tail)
>>> +{
>>> +	return ntb_edma_ring_free_entry(head, tail) == 0;
>>> +}
>>> +
>>> +int ntb_edma_setup_isr(struct device *dev, struct device *epc_dev,
>>> +		       ntb_edma_interrupt_cb_t cb, void *data);
>>> +void ntb_edma_teardown_isr(struct device *dev);
>>> +int ntb_edma_setup_mws(struct ntb_dev *ndev);
>>> +int ntb_edma_setup_peer(struct ntb_dev *ndev);
>>> +int ntb_edma_setup_chans(struct device *dma_dev, struct ntb_edma_chans *edma);
>>> +struct dma_chan *ntb_edma_pick_chan(struct ntb_edma_chans *edma,
>>> +				    remote_edma_dir_t dir);
>>> +void ntb_edma_teardown_chans(struct ntb_edma_chans *edma);
>>> +int ntb_edma_notify_peer(struct ntb_edma_chans *edma, int qp_num);
>>> +
>>> +#endif
>>> diff --git a/drivers/ntb/ntb_transport.c b/drivers/ntb/ntb_transport_core.c
>>> similarity index 65%
>>> rename from drivers/ntb/ntb_transport.c
>>> rename to drivers/ntb/ntb_transport_core.c
>>> index 907db6c93d4d..48d48921978d 100644
>>> --- a/drivers/ntb/ntb_transport.c
>>> +++ b/drivers/ntb/ntb_transport_core.c
>>> @@ -47,6 +47,9 @@
>>>   * Contact Information:
>>>   * Jon Mason <jon.mason@intel.com>
>>>   */
>>> +#include <linux/atomic.h>
>>> +#include <linux/bug.h>
>>> +#include <linux/compiler.h>
>>>  #include <linux/debugfs.h>
>>>  #include <linux/delay.h>
>>>  #include <linux/dmaengine.h>
>>> @@ -71,6 +74,8 @@
>>>  #define NTB_TRANSPORT_DESC	"Software Queue-Pair Transport over NTB"
>>>  #define NTB_TRANSPORT_MIN_SPADS (MW0_SZ_HIGH + 2)
>>>  
>>> +#define NTB_EDMA_MAX_POLL		32
>>> +
>>>  MODULE_DESCRIPTION(NTB_TRANSPORT_DESC);
>>>  MODULE_VERSION(NTB_TRANSPORT_VER);
>>>  MODULE_LICENSE("Dual BSD/GPL");
>>> @@ -102,6 +107,13 @@ module_param(use_msi, bool, 0644);
>>>  MODULE_PARM_DESC(use_msi, "Use MSI interrupts instead of doorbells");
>>>  #endif
>>>  
>>> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
>>
>> This is a comment throughout this patch. Doing ifdefs inside C source is pretty frowed upon in the kernel. The preferred way is to only have ifdefs in the header files. So please give this a bit more consideration and see if it can be done differently to address this.
> 
> I agree, there is no good reason to keep those remaining ifdefs at all.
> I'll clean it up. Thanks for pointing this out.
> 
>>
>>> +#include "ntb_edma.h"
>>> +static bool use_remote_edma;
>>> +module_param(use_remote_edma, bool, 0644);
>>> +MODULE_PARM_DESC(use_remote_edma, "Use remote eDMA mode (when enabled, use_msi is ignored)");
>>> +#endif
>>> +
>>>  static struct dentry *nt_debugfs_dir;
>>>  
>>>  /* Only two-ports NTB devices are supported */
>>> @@ -125,6 +137,14 @@ struct ntb_queue_entry {
>>>  		struct ntb_payload_header __iomem *tx_hdr;
>>>  		struct ntb_payload_header *rx_hdr;
>>>  	};
>>> +
>>> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
>>> +	dma_addr_t addr;
>>> +
>>> +	/* Used by RC side only */
>>> +	struct scatterlist sgl;
>>> +	struct work_struct dma_work;
>>> +#endif
>>>  };
>>>  
>>>  struct ntb_rx_info {
>>> @@ -202,6 +222,33 @@ struct ntb_transport_qp {
>>>  	int msi_irq;
>>>  	struct ntb_msi_desc msi_desc;
>>>  	struct ntb_msi_desc peer_msi_desc;
>>> +
>>> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
>>> +	/*
>>> +	 * For ensuring peer notification in non-atomic context.
>>> +	 * ntb_peer_db_set might sleep or schedule.
>>> +	 */
>>> +	struct work_struct db_work;
>>> +
>>> +	/*
>>> +	 * wr: remote eDMA write transfer (EP -> RC direction)
>>> +	 * rd: remote eDMA read transfer (RC -> EP direction)
>>> +	 */
>>> +	u32 wr_cons;
>>> +	u32 rd_cons;
>>> +	u32 wr_prod;
>>> +	u32 rd_prod;
>>> +	u32 wr_issue;
>>> +	u32 rd_issue;
>>> +
>>> +	spinlock_t ep_tx_lock;
>>> +	spinlock_t ep_rx_lock;
>>> +	spinlock_t rc_lock;
>>> +
>>> +	/* Completion work for read/write transfers. */
>>> +	struct work_struct read_work;
>>> +	struct work_struct write_work;
>>> +#endif
>>
>> For something like this, maybe it needs its own struct instead of an ifdef chunk. Perhaps 'ntb_rx_info' can serve as a core data struct with EDMA having 'ntb_rx_info_edma' and embed 'ntb_rx_info'. 
> 
> Thanks again for the suggestion. I'll reorganize things.
> 
> Koichiro
> 
>>
>> DJ
>>
>>>  };
>>>  
>>>  struct ntb_transport_mw {
>>> @@ -249,6 +296,13 @@ struct ntb_transport_ctx {
>>>  
>>>  	/* Make sure workq of link event be executed serially */
>>>  	struct mutex link_event_lock;
>>> +
>>> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
>>> +	remote_edma_mode_t remote_edma_mode;
>>> +	struct device *dma_dev;
>>> +	struct workqueue_struct *wq;
>>> +	struct ntb_edma_chans edma;
>>> +#endif
>>>  };
>>>  
>>>  enum {
>>> @@ -262,6 +316,19 @@ struct ntb_payload_header {
>>>  	unsigned int flags;
>>>  };
>>>  
>>> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
>>> +static void ntb_transport_edma_uninit(struct ntb_transport_ctx *nt);
>>> +static int ntb_transport_edma_init(struct ntb_transport_ctx *nt,
>>> +				   unsigned int *mw_count);
>>> +static void ntb_transport_edma_init_queue(struct ntb_transport_ctx *nt,
>>> +					  unsigned int qp_num);
>>> +static void ntb_transport_edma_create_queue(struct ntb_transport_ctx *nt,
>>> +					    struct ntb_transport_qp *qp);
>>> +static int ntb_transport_edma_rc_init(struct ntb_transport_ctx *nt);
>>> +static int ntb_transport_edma_ep_init(struct ntb_transport_ctx *nt);
>>> +static void ntb_transport_edma_rc_dma_work(struct work_struct *work);
>>> +#endif /* CONFIG_NTB_TRANSPORT_EDMA */
>>> +
>>>  /*
>>>   * Return the device that should be used for DMA mapping.
>>>   *
>>> @@ -298,7 +365,7 @@ enum {
>>>  	container_of((__drv), struct ntb_transport_client, driver)
>>>  
>>>  #define QP_TO_MW(nt, qp)	((qp) % nt->mw_count)
>>> -#define NTB_QP_DEF_NUM_ENTRIES	100
>>> +#define NTB_QP_DEF_NUM_ENTRIES	128
>>>  #define NTB_LINK_DOWN_TIMEOUT	10
>>>  
>>>  static void ntb_transport_rxc_db(unsigned long data);
>>> @@ -1015,6 +1082,10 @@ static void ntb_transport_link_cleanup(struct ntb_transport_ctx *nt)
>>>  	count = ntb_spad_count(nt->ndev);
>>>  	for (i = 0; i < count; i++)
>>>  		ntb_spad_write(nt->ndev, i, 0);
>>> +
>>> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
>>> +	ntb_edma_teardown_chans(&nt->edma);
>>> +#endif
>>>  }
>>>  
>>>  static void ntb_transport_link_cleanup_work(struct work_struct *work)
>>> @@ -1051,6 +1122,14 @@ static void ntb_transport_link_work(struct work_struct *work)
>>>  
>>>  	/* send the local info, in the opposite order of the way we read it */
>>>  
>>> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
>>> +	rc = ntb_transport_edma_ep_init(nt);
>>> +	if (rc) {
>>> +		dev_err(&pdev->dev, "Failed to init EP: %d\n", rc);
>>> +		return;
>>> +	}
>>> +#endif
>>> +
>>>  	if (nt->use_msi) {
>>>  		rc = ntb_msi_setup_mws(ndev);
>>>  		if (rc) {
>>> @@ -1132,6 +1211,14 @@ static void ntb_transport_link_work(struct work_struct *work)
>>>  
>>>  	nt->link_is_up = true;
>>>  
>>> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
>>> +	rc = ntb_transport_edma_rc_init(nt);
>>> +	if (rc) {
>>> +		dev_err(&pdev->dev, "Failed to init RC: %d\n", rc);
>>> +		goto out1;
>>> +	}
>>> +#endif
>>> +
>>>  	for (i = 0; i < nt->qp_count; i++) {
>>>  		struct ntb_transport_qp *qp = &nt->qp_vec[i];
>>>  
>>> @@ -1277,6 +1364,8 @@ static const struct ntb_transport_backend_ops default_backend_ops = {
>>>  	.debugfs_stats_show = ntb_transport_default_debugfs_stats_show,
>>>  };
>>>  
>>> +static const struct ntb_transport_backend_ops edma_backend_ops;
>>> +
>>>  static int ntb_transport_probe(struct ntb_client *self, struct ntb_dev *ndev)
>>>  {
>>>  	struct ntb_transport_ctx *nt;
>>> @@ -1311,7 +1400,23 @@ static int ntb_transport_probe(struct ntb_client *self, struct ntb_dev *ndev)
>>>  
>>>  	nt->ndev = ndev;
>>>  
>>> -	nt->backend_ops = default_backend_ops;
>>> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
>>> +	if (use_remote_edma) {
>>> +		rc = ntb_transport_edma_init(nt, &mw_count);
>>> +		if (rc) {
>>> +			nt->mw_count = 0;
>>> +			goto err;
>>> +		}
>>> +		nt->backend_ops = edma_backend_ops;
>>> +
>>> +		/*
>>> +		 * On remote eDMA mode, we reserve a read channel for Host->EP
>>> +		 * interruption.
>>> +		 */
>>> +		use_msi = false;
>>> +	} else
>>> +#endif
>>> +		nt->backend_ops = default_backend_ops;
>>>  
>>>  	/*
>>>  	 * If we are using MSI, and have at least one extra memory window,
>>> @@ -1402,6 +1507,10 @@ static int ntb_transport_probe(struct ntb_client *self, struct ntb_dev *ndev)
>>>  		rc = ntb_transport_init_queue(nt, i);
>>>  		if (rc)
>>>  			goto err2;
>>> +
>>> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
>>> +		ntb_transport_edma_init_queue(nt, i);
>>> +#endif
>>>  	}
>>>  
>>>  	INIT_DELAYED_WORK(&nt->link_work, ntb_transport_link_work);
>>> @@ -1433,6 +1542,9 @@ static int ntb_transport_probe(struct ntb_client *self, struct ntb_dev *ndev)
>>>  	}
>>>  	kfree(nt->mw_vec);
>>>  err:
>>> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
>>> +	ntb_transport_edma_uninit(nt);
>>> +#endif
>>>  	kfree(nt);
>>>  	return rc;
>>>  }
>>> @@ -2055,11 +2167,16 @@ ntb_transport_create_queue(void *data, struct device *client_dev,
>>>  
>>>  	nt->qp_bitmap_free &= ~qp_bit;
>>>  
>>> +	qp->qp_bit = qp_bit;
>>>  	qp->cb_data = data;
>>>  	qp->rx_handler = handlers->rx_handler;
>>>  	qp->tx_handler = handlers->tx_handler;
>>>  	qp->event_handler = handlers->event_handler;
>>>  
>>> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
>>> +	ntb_transport_edma_create_queue(nt, qp);
>>> +#endif
>>> +
>>>  	dma_cap_zero(dma_mask);
>>>  	dma_cap_set(DMA_MEMCPY, dma_mask);
>>>  
>>> @@ -2105,6 +2222,9 @@ ntb_transport_create_queue(void *data, struct device *client_dev,
>>>  			goto err1;
>>>  
>>>  		entry->qp = qp;
>>> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
>>> +		INIT_WORK(&entry->dma_work, ntb_transport_edma_rc_dma_work);
>>> +#endif
>>>  		ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry,
>>>  			     &qp->rx_free_q);
>>>  	}
>>> @@ -2156,8 +2276,8 @@ EXPORT_SYMBOL_GPL(ntb_transport_create_queue);
>>>   */
>>>  void ntb_transport_free_queue(struct ntb_transport_qp *qp)
>>>  {
>>> -	struct pci_dev *pdev;
>>>  	struct ntb_queue_entry *entry;
>>> +	struct pci_dev *pdev;
>>>  	u64 qp_bit;
>>>  
>>>  	if (!qp)
>>> @@ -2208,6 +2328,10 @@ void ntb_transport_free_queue(struct ntb_transport_qp *qp)
>>>  	tasklet_kill(&qp->rxc_db_work);
>>>  
>>>  	cancel_delayed_work_sync(&qp->link_work);
>>> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
>>> +	cancel_work_sync(&qp->read_work);
>>> +	cancel_work_sync(&qp->write_work);
>>> +#endif
>>>  
>>>  	qp->cb_data = NULL;
>>>  	qp->rx_handler = NULL;
>>> @@ -2346,6 +2470,1157 @@ int ntb_transport_tx_enqueue(struct ntb_transport_qp *qp, void *cb, void *data,
>>>  }
>>>  EXPORT_SYMBOL_GPL(ntb_transport_tx_enqueue);
>>>  
>>> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
>>> +/*
>>> + * Remote eDMA mode implementation
>>> + */
>>> +struct ntb_edma_desc {
>>> +	u32 len;
>>> +	u32 flags;
>>> +	u64 addr; /* DMA address */
>>> +	u64 data;
>>> +};
>>> +
>>> +struct ntb_edma_ring {
>>> +	struct ntb_edma_desc desc[NTB_EDMA_RING_ENTRIES];
>>> +	u32 head;
>>> +	u32 tail;
>>> +};
>>> +
>>> +#define NTB_EDMA_DESC_OFF(i)	((size_t)(i) * sizeof(struct ntb_edma_desc))
>>> +
>>> +#define __NTB_EDMA_CHECK_INDEX(_i)					\
>>> +({									\
>>> +	unsigned long __i = (unsigned long)(_i);			\
>>> +	WARN_ONCE(__i >= (unsigned long)NTB_EDMA_RING_ENTRIES,		\
>>> +		  "ntb_edma: index i=%lu >= ring_entries=%lu\n",	\
>>> +		  __i, (unsigned long)NTB_EDMA_RING_ENTRIES);		\
>>> +	__i;								\
>>> +})
>>> +
>>> +#define NTB_EDMA_DESC_I(qp, i, n)					\
>>> +({									\
>>> +	typeof(qp) __qp = (qp);						\
>>> +	unsigned long __i = __NTB_EDMA_CHECK_INDEX(i);			\
>>> +	(struct ntb_edma_desc *)					\
>>> +		((char *)(__qp)->rx_buff +				\
>>> +		 (sizeof(struct ntb_edma_ring) * n) +			\
>>> +		 NTB_EDMA_DESC_OFF(__i));				\
>>> +})
>>> +
>>> +#define NTB_EDMA_DESC_O(qp, i, n)					\
>>> +({									\
>>> +	typeof(qp) __qp = (qp);						\
>>> +	unsigned long __i = __NTB_EDMA_CHECK_INDEX(i);			\
>>> +	(struct ntb_edma_desc __iomem *)				\
>>> +		((char __iomem *)(__qp)->tx_mw +			\
>>> +		 (sizeof(struct ntb_edma_ring) * n) +			\
>>> +		 NTB_EDMA_DESC_OFF(__i));				\
>>> +})
>>> +
>>> +#define NTB_EDMA_HEAD_I(qp, n) ((u32 *)((char *)qp->rx_buff +		\
>>> +				(sizeof(struct ntb_edma_ring) * n) +	\
>>> +				offsetof(struct ntb_edma_ring, head)))
>>> +#define NTB_EDMA_HEAD_O(qp, n) ((u32 *)((char __iomem *)qp->tx_mw +	\
>>> +				(sizeof(struct ntb_edma_ring) * n) +	\
>>> +				offsetof(struct ntb_edma_ring, head)))
>>> +#define NTB_EDMA_TAIL_I(qp, n) ((u32 *)((char *)qp->rx_buff +		\
>>> +				(sizeof(struct ntb_edma_ring) * n) +	\
>>> +				offsetof(struct ntb_edma_ring, tail)))
>>> +#define NTB_EDMA_TAIL_O(qp, n) ((u32 *)((char __iomem *)qp->tx_mw +	\
>>> +				(sizeof(struct ntb_edma_ring) * n) +	\
>>> +				offsetof(struct ntb_edma_ring, tail)))
>>> +
>>> +/*
>>> + * Macro naming rule:
>>> + *   NTB_DESC_RD_EP_I (as an example)
>>> + *            ^^ ^^ ^
>>> + *            :  :  `-- I(n) or O(ut). In = Read, Out = Write.
>>> + *            :  `----- Who uses this macro.
>>> + *            `-------- DESC / HEAD / TAIL
>>> + *
>>> + * Read transfers (RC->EP):
>>> + *
>>> + *   EP view (outbound, written via NTB):
>>> + *       - descs: NTB_DESC_RD_EP_O(qp, i) / NTB_DESC_RD_EP_I(qp, i)
>>> + *           [ len ][ flags ][ addr ][ data ]
>>> + *           [ len ][ flags ][ addr ][ data ]
>>> + *           :
>>> + *           [ len ][ flags ][ addr ][ data ]
>>> + *       - head: NTB_HEAD_RD_EP_O(qp)
>>> + *       - tail: NTB_TAIL_RD_EP_I(qp)
>>> + *
>>> + *   RC view (inbound, local mapping):
>>> + *       - descs: NTB_DESC_RD_RC_I(qp, i) / NTB_DESC_RD_RC_O(qp, i)
>>> + *           [ len ][ flags ][ addr ][ data ]
>>> + *           [ len ][ flags ][ addr ][ data ]
>>> + *           :
>>> + *           [ len ][ flags ][ addr ][ data ]
>>> + *       - head: NTB_HEAD_RD_RC_I(qp)
>>> + *       - tail: NTB_TAIL_RD_RC_O(qp)
>>> + *
>>> + * Write transfers (EP -> RC) are analogous but use
>>> + * NTB_DESC_WR_{EP_O,RC_I}(), NTB_HEAD_WR_{EP_O,RC_I}(),
>>> + * and NTB_TAIL_WR_{EP_I,RC_O}().
>>> + */
>>> +#define NTB_DESC_RD_EP_I(qp, i)	NTB_EDMA_DESC_I(qp, i, 0)
>>> +#define NTB_DESC_RD_EP_O(qp, i)	NTB_EDMA_DESC_O(qp, i, 0)
>>> +#define NTB_DESC_WR_EP_I(qp, i)	NTB_EDMA_DESC_I(qp, i, 1)
>>> +#define NTB_DESC_WR_EP_O(qp, i)	NTB_EDMA_DESC_O(qp, i, 1)
>>> +#define NTB_DESC_RD_RC_I(qp, i)	NTB_EDMA_DESC_I(qp, i, 0)
>>> +#define NTB_DESC_RD_RC_O(qp, i)	NTB_EDMA_DESC_O(qp, i, 0)
>>> +#define NTB_DESC_WR_RC_I(qp, i)	NTB_EDMA_DESC_I(qp, i, 1)
>>> +#define NTB_DESC_WR_RC_O(qp, i)	NTB_EDMA_DESC_O(qp, i, 1)
>>> +
>>> +#define NTB_HEAD_RD_EP_O(qp)	NTB_EDMA_HEAD_O(qp, 0)
>>> +#define NTB_HEAD_WR_EP_O(qp)	NTB_EDMA_HEAD_O(qp, 1)
>>> +#define NTB_HEAD_RD_RC_I(qp)	NTB_EDMA_HEAD_I(qp, 0)
>>> +#define NTB_HEAD_WR_RC_I(qp)	NTB_EDMA_HEAD_I(qp, 1)
>>> +
>>> +#define NTB_TAIL_RD_EP_I(qp)	NTB_EDMA_TAIL_I(qp, 0)
>>> +#define NTB_TAIL_WR_EP_I(qp)	NTB_EDMA_TAIL_I(qp, 1)
>>> +#define NTB_TAIL_RD_RC_O(qp)	NTB_EDMA_TAIL_O(qp, 0)
>>> +#define NTB_TAIL_WR_RC_O(qp)	NTB_EDMA_TAIL_O(qp, 1)
>>> +
>>> +static inline bool ntb_qp_edma_is_rc(struct ntb_transport_qp *qp)
>>> +{
>>> +	return qp->transport->remote_edma_mode == REMOTE_EDMA_RC;
>>> +}
>>> +
>>> +static inline bool ntb_qp_edma_is_ep(struct ntb_transport_qp *qp)
>>> +{
>>> +	return qp->transport->remote_edma_mode == REMOTE_EDMA_EP;
>>> +}
>>> +
>>> +static inline bool ntb_qp_edma_enabled(struct ntb_transport_qp *qp)
>>> +{
>>> +	return ntb_qp_edma_is_rc(qp) || ntb_qp_edma_is_ep(qp);
>>> +}
>>> +
>>> +static unsigned int ntb_transport_edma_tx_free_entry(struct ntb_transport_qp *qp)
>>> +{
>>> +	unsigned int head, tail;
>>> +
>>> +	if (ntb_qp_edma_is_ep(qp)) {
>>> +		scoped_guard(spinlock_irqsave, &qp->ep_tx_lock) {
>>> +			/* In this scope, only 'head' might proceed */
>>> +			tail = READ_ONCE(qp->wr_cons);
>>> +			head = READ_ONCE(qp->wr_prod);
>>> +		}
>>> +		return ntb_edma_ring_free_entry(head, tail);
>>> +	}
>>> +
>>> +	scoped_guard(spinlock_irqsave, &qp->rc_lock) {
>>> +		/* In this scope, only 'head' might proceed */
>>> +		tail = READ_ONCE(qp->rd_issue);
>>> +		head = READ_ONCE(*NTB_HEAD_RD_RC_I(qp));
>>> +	}
>>> +	/*
>>> +	 * On RC side, 'used' amount indicates how much EP side
>>> +	 * has refilled, which are available for us to use for TX.
>>> +	 */
>>> +	return ntb_edma_ring_used_entry(head, tail);
>>> +}
>>> +
>>> +static void ntb_transport_edma_debugfs_stats_show(struct seq_file *s,
>>> +						  struct ntb_transport_qp *qp)
>>> +{
>>> +	seq_printf(s, "rx_bytes - \t%llu\n", qp->rx_bytes);
>>> +	seq_printf(s, "rx_pkts - \t%llu\n", qp->rx_pkts);
>>> +	seq_printf(s, "rx_err_no_buf - %llu\n", qp->rx_err_no_buf);
>>> +	seq_printf(s, "rx_buff - \t0x%p\n", qp->rx_buff);
>>> +	seq_printf(s, "rx_max_entry - \t%u\n", qp->rx_max_entry);
>>> +	seq_printf(s, "rx_alloc_entry - \t%u\n\n", qp->rx_alloc_entry);
>>> +
>>> +	seq_printf(s, "tx_bytes - \t%llu\n", qp->tx_bytes);
>>> +	seq_printf(s, "tx_pkts - \t%llu\n", qp->tx_pkts);
>>> +	seq_printf(s, "tx_ring_full - \t%llu\n", qp->tx_ring_full);
>>> +	seq_printf(s, "tx_err_no_buf - %llu\n", qp->tx_err_no_buf);
>>> +	seq_printf(s, "tx_mw - \t0x%p\n", qp->tx_mw);
>>> +	seq_printf(s, "tx_max_entry - \t%u\n", qp->tx_max_entry);
>>> +	seq_printf(s, "free tx - \t%u\n", ntb_transport_tx_free_entry(qp));
>>> +	seq_putc(s, '\n');
>>> +
>>> +	seq_puts(s, "Using Remote eDMA - Yes\n");
>>> +	seq_printf(s, "QP Link - \t%s\n", qp->link_is_up ? "Up" : "Down");
>>> +}
>>> +
>>> +static void ntb_transport_edma_uninit(struct ntb_transport_ctx *nt)
>>> +{
>>> +	struct ntb_dev *ndev = nt->ndev;
>>> +
>>> +	if (nt->remote_edma_mode == REMOTE_EDMA_EP && ndev && ndev->pdev)
>>> +		ntb_edma_teardown_isr(&ndev->pdev->dev);
>>> +
>>> +	if (nt->wq)
>>> +		destroy_workqueue(nt->wq);
>>> +	nt->wq = NULL;
>>> +}
>>> +
>>> +static int ntb_transport_edma_init(struct ntb_transport_ctx *nt,
>>> +				   unsigned int *mw_count)
>>> +{
>>> +	struct ntb_dev *ndev = nt->ndev;
>>> +
>>> +	/*
>>> +	 * We need at least one MW for the transport plus one MW reserved
>>> +	 * for the remote eDMA window (see ntb_edma_setup_mws/peer).
>>> +	 */
>>> +	if (*mw_count <= 1) {
>>> +		dev_err(&ndev->dev,
>>> +			"remote eDMA requires at least two MWS (have %u)\n",
>>> +			*mw_count);
>>> +		return -ENODEV;
>>> +	}
>>> +
>>> +	nt->wq = alloc_workqueue("ntb-edma-wq", WQ_UNBOUND | WQ_SYSFS, 0);
>>> +	if (!nt->wq) {
>>> +		ntb_transport_edma_uninit(nt);
>>> +		return -ENOMEM;
>>> +	}
>>> +
>>> +	/* Reserve the last peer MW exclusively for the eDMA window. */
>>> +	*mw_count -= 1;
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static void ntb_transport_edma_db_work(struct work_struct *work)
>>> +{
>>> +	struct ntb_transport_qp *qp =
>>> +			container_of(work, struct ntb_transport_qp, db_work);
>>> +
>>> +	ntb_peer_db_set(qp->ndev, qp->qp_bit);
>>> +}
>>> +
>>> +static void ntb_transport_edma_notify_peer(struct ntb_transport_qp *qp)
>>> +{
>>> +	if (ntb_qp_edma_is_rc(qp))
>>> +		if (!ntb_edma_notify_peer(&qp->transport->edma, qp->qp_num))
>>> +			return;
>>> +
>>> +	/*
>>> +	 * Called from contexts that may be atomic. Since ntb_peer_db_set()
>>> +	 * may sleep, delegate the actual doorbell write to a workqueue.
>>> +	 */
>>> +	queue_work(system_highpri_wq, &qp->db_work);
>>> +}
>>> +
>>> +static void ntb_transport_edma_isr(void *data, int qp_num)
>>> +{
>>> +	struct ntb_transport_ctx *nt = data;
>>> +	struct ntb_transport_qp *qp;
>>> +
>>> +	if (qp_num < 0 || qp_num >= nt->qp_count)
>>> +		return;
>>> +
>>> +	qp = &nt->qp_vec[qp_num];
>>> +	if (WARN_ON(!qp))
>>> +		return;
>>> +
>>> +	queue_work(nt->wq, &qp->read_work);
>>> +	queue_work(nt->wq, &qp->write_work);
>>> +}
>>> +
>>> +static int ntb_transport_edma_rc_init(struct ntb_transport_ctx *nt)
>>> +{
>>> +	struct ntb_dev *ndev = nt->ndev;
>>> +	struct pci_dev *pdev = ndev->pdev;
>>> +	int rc;
>>> +
>>> +	if (!use_remote_edma || nt->remote_edma_mode != REMOTE_EDMA_UNKNOWN)
>>> +		return 0;
>>> +
>>> +	rc = ntb_edma_setup_peer(ndev);
>>> +	if (rc) {
>>> +		dev_err(&pdev->dev, "Failed to enable remote eDMA: %d\n", rc);
>>> +		return rc;
>>> +	}
>>> +
>>> +	rc = ntb_edma_setup_chans(get_dma_dev(ndev), &nt->edma);
>>> +	if (rc) {
>>> +		dev_err(&pdev->dev, "Failed to setup eDMA channels: %d\n", rc);
>>> +		return rc;
>>> +	}
>>> +
>>> +	nt->remote_edma_mode = REMOTE_EDMA_RC;
>>> +	return 0;
>>> +}
>>> +
>>> +static int ntb_transport_edma_ep_init(struct ntb_transport_ctx *nt)
>>> +{
>>> +	struct ntb_dev *ndev = nt->ndev;
>>> +	struct pci_dev *pdev = ndev->pdev;
>>> +	struct pci_epc *epc;
>>> +	int rc;
>>> +
>>> +	if (!use_remote_edma || nt->remote_edma_mode == REMOTE_EDMA_EP)
>>> +		return 0;
>>> +
>>> +	/* Only EP side can return pci_epc */
>>> +	epc = ntb_get_pci_epc(ndev);
>>> +	if (!epc)
>>> +		return 0;
>>> +
>>> +	rc = ntb_edma_setup_mws(ndev);
>>> +	if (rc) {
>>> +		dev_err(&pdev->dev,
>>> +			"Failed to set up memory window for eDMA: %d\n", rc);
>>> +		return rc;
>>> +	}
>>> +
>>> +	rc = ntb_edma_setup_isr(&pdev->dev, &epc->dev, ntb_transport_edma_isr, nt);
>>> +	if (rc) {
>>> +		dev_err(&pdev->dev, "Failed to setup eDMA ISR (%d)\n", rc);
>>> +		return rc;
>>> +	}
>>> +
>>> +	nt->remote_edma_mode = REMOTE_EDMA_EP;
>>> +	return 0;
>>> +}
>>> +
>>> +static int ntb_transport_edma_setup_qp_mw(struct ntb_transport_ctx *nt,
>>> +					  unsigned int qp_num)
>>> +{
>>> +	struct ntb_transport_qp *qp = &nt->qp_vec[qp_num];
>>> +	struct ntb_dev *ndev = nt->ndev;
>>> +	struct ntb_queue_entry *entry;
>>> +	struct ntb_transport_mw *mw;
>>> +	unsigned int mw_num, mw_count, qp_count;
>>> +	unsigned int qp_offset, rx_info_offset;
>>> +	unsigned int mw_size, mw_size_per_qp;
>>> +	unsigned int num_qps_mw;
>>> +	size_t edma_total;
>>> +	unsigned int i;
>>> +	int node;
>>> +
>>> +	mw_count = nt->mw_count;
>>> +	qp_count = nt->qp_count;
>>> +
>>> +	mw_num = QP_TO_MW(nt, qp_num);
>>> +	mw = &nt->mw_vec[mw_num];
>>> +
>>> +	if (!mw->virt_addr)
>>> +		return -ENOMEM;
>>> +
>>> +	if (mw_num < qp_count % mw_count)
>>> +		num_qps_mw = qp_count / mw_count + 1;
>>> +	else
>>> +		num_qps_mw = qp_count / mw_count;
>>> +
>>> +	mw_size = min(nt->mw_vec[mw_num].phys_size, mw->xlat_size);
>>> +	if (max_mw_size && mw_size > max_mw_size)
>>> +		mw_size = max_mw_size;
>>> +
>>> +	mw_size_per_qp = round_down((unsigned int)mw_size / num_qps_mw, SZ_64);
>>> +	qp_offset = mw_size_per_qp * (qp_num / mw_count);
>>> +	rx_info_offset = mw_size_per_qp - sizeof(struct ntb_rx_info);
>>> +
>>> +	qp->tx_mw_size = mw_size_per_qp;
>>> +	qp->tx_mw = nt->mw_vec[mw_num].vbase + qp_offset;
>>> +	if (!qp->tx_mw)
>>> +		return -EINVAL;
>>> +	qp->tx_mw_phys = nt->mw_vec[mw_num].phys_addr + qp_offset;
>>> +	if (!qp->tx_mw_phys)
>>> +		return -EINVAL;
>>> +	qp->rx_info = qp->tx_mw + rx_info_offset;
>>> +	qp->rx_buff = mw->virt_addr + qp_offset;
>>> +	qp->remote_rx_info = qp->rx_buff + rx_info_offset;
>>> +
>>> +	/* Due to housekeeping, there must be at least 2 buffs */
>>> +	qp->tx_max_frame = min(transport_mtu, mw_size_per_qp / 2);
>>> +	qp->rx_max_frame = min(transport_mtu, mw_size_per_qp / 2);
>>> +
>>> +	/* In eDMA mode, decouple from MW sizing and force ring-sized entries */
>>> +	edma_total = 2 * sizeof(struct ntb_edma_ring);
>>> +	if (rx_info_offset < edma_total) {
>>> +		dev_err(&ndev->dev, "Ring space requires %luB (>=%uB)\n",
>>> +			edma_total, rx_info_offset);
>>> +		return -EINVAL;
>>> +	}
>>> +	qp->tx_max_entry = NTB_EDMA_RING_ENTRIES;
>>> +	qp->rx_max_entry = NTB_EDMA_RING_ENTRIES;
>>> +
>>> +	/*
>>> +	 * Checking to see if we have more entries than the default.
>>> +	 * We should add additional entries if that is the case so we
>>> +	 * can be in sync with the transport frames.
>>> +	 */
>>> +	node = dev_to_node(&ndev->dev);
>>> +	for (i = qp->rx_alloc_entry; i < qp->rx_max_entry; i++) {
>>> +		entry = kzalloc_node(sizeof(*entry), GFP_KERNEL, node);
>>> +		if (!entry)
>>> +			return -ENOMEM;
>>> +
>>> +		entry->qp = qp;
>>> +		INIT_WORK(&entry->dma_work, ntb_transport_edma_rc_dma_work);
>>> +		ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry,
>>> +			     &qp->rx_free_q);
>>> +		qp->rx_alloc_entry++;
>>> +	}
>>> +
>>> +	memset(qp->rx_buff, 0, edma_total);
>>> +
>>> +	qp->rx_pkts = 0;
>>> +	qp->tx_pkts = 0;
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static int ntb_transport_edma_ep_read_complete(struct ntb_transport_qp *qp)
>>> +{
>>> +	struct device *dma_dev = get_dma_dev(qp->ndev);
>>> +	struct ntb_queue_entry *entry;
>>> +	struct ntb_edma_desc *in;
>>> +	unsigned int len;
>>> +	u32 idx;
>>> +
>>> +	if (ntb_edma_ring_used_entry(READ_ONCE(*NTB_TAIL_RD_EP_I(qp)),
>>> +				     qp->rd_cons) == 0)
>>> +		return 0;
>>> +
>>> +	idx = ntb_edma_ring_idx(qp->rd_cons);
>>> +	in = NTB_DESC_RD_EP_I(qp, idx);
>>> +	if (!(in->flags & DESC_DONE_FLAG))
>>> +		return 0;
>>> +
>>> +	in->flags = 0;
>>> +	len = in->len; /* might be smaller than entry->len */
>>> +
>>> +	entry = (struct ntb_queue_entry *)(in->data);
>>> +	if (WARN_ON(!entry))
>>> +		return 0;
>>> +
>>> +	if (in->flags & LINK_DOWN_FLAG) {
>>> +		ntb_qp_link_down(qp);
>>> +		qp->rd_cons++;
>>> +		ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry, &qp->rx_free_q);
>>> +		return 1;
>>> +	}
>>> +
>>> +	dma_unmap_single(dma_dev, entry->addr, entry->len, DMA_FROM_DEVICE);
>>> +
>>> +	qp->rx_bytes += len;
>>> +	qp->rx_pkts++;
>>> +	qp->rd_cons++;
>>> +
>>> +	if (qp->rx_handler && qp->client_ready)
>>> +		qp->rx_handler(qp, qp->cb_data, entry->cb_data, len);
>>> +
>>> +	ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry, &qp->rx_free_q);
>>> +	return 1;
>>> +}
>>> +
>>> +static int ntb_transport_edma_ep_write_complete(struct ntb_transport_qp *qp)
>>> +{
>>> +	struct ntb_queue_entry *entry;
>>> +	struct ntb_edma_desc *in;
>>> +	u32 idx;
>>> +
>>> +	if (ntb_edma_ring_used_entry(READ_ONCE(*NTB_TAIL_WR_EP_I(qp)),
>>> +				     qp->wr_cons) == 0)
>>> +		return 0;
>>> +
>>> +	idx = ntb_edma_ring_idx(qp->wr_cons);
>>> +	in = NTB_DESC_WR_EP_I(qp, idx);
>>> +
>>> +	entry = (struct ntb_queue_entry *)(in->data);
>>> +	if (WARN_ON(!entry))
>>> +		return 0;
>>> +
>>> +	qp->wr_cons++;
>>> +
>>> +	if (qp->tx_handler)
>>> +		qp->tx_handler(qp, qp->cb_data, entry->cb_data, entry->len);
>>> +
>>> +	ntb_list_add(&qp->ntb_tx_free_q_lock, &entry->entry, &qp->tx_free_q);
>>> +	return 1;
>>> +}
>>> +
>>> +static void ntb_transport_edma_ep_read_work(struct work_struct *work)
>>> +{
>>> +	struct ntb_transport_qp *qp = container_of(
>>> +				work, struct ntb_transport_qp, read_work);
>>> +	unsigned int i;
>>> +
>>> +	for (i = 0; i < NTB_EDMA_MAX_POLL; i++) {
>>> +		if (!ntb_transport_edma_ep_read_complete(qp))
>>> +			break;
>>> +	}
>>> +
>>> +	if (ntb_transport_edma_ep_read_complete(qp))
>>> +		queue_work(qp->transport->wq, &qp->read_work);
>>> +}
>>> +
>>> +static void ntb_transport_edma_ep_write_work(struct work_struct *work)
>>> +{
>>> +	struct ntb_transport_qp *qp = container_of(
>>> +				work, struct ntb_transport_qp, write_work);
>>> +	unsigned int i;
>>> +
>>> +	for (i = 0; i < NTB_EDMA_MAX_POLL; i++) {
>>> +		if (!ntb_transport_edma_ep_write_complete(qp))
>>> +			break;
>>> +	}
>>> +
>>> +	if (ntb_transport_edma_ep_write_complete(qp))
>>> +		queue_work(qp->transport->wq, &qp->write_work);
>>> +}
>>> +
>>> +static void ntb_transport_edma_rc_write_complete_work(struct work_struct *work)
>>> +{
>>> +	struct ntb_transport_qp *qp = container_of(
>>> +				work, struct ntb_transport_qp, write_work);
>>> +	struct ntb_queue_entry *entry;
>>> +	struct ntb_edma_desc *in;
>>> +	unsigned int len;
>>> +	void *cb_data;
>>> +	u32 idx;
>>> +
>>> +	while (ntb_edma_ring_used_entry(READ_ONCE(qp->wr_issue),
>>> +					qp->wr_cons) != 0) {
>>> +		/* Paired with smp_wmb() in ntb_transport_edma_rc_poll() */
>>> +		smp_rmb();
>>> +
>>> +		idx = ntb_edma_ring_idx(qp->wr_cons);
>>> +		in = NTB_DESC_WR_RC_I(qp, idx);
>>> +		entry = (struct ntb_queue_entry *)READ_ONCE(in->data);
>>> +		if (!entry || !(entry->flags & DESC_DONE_FLAG))
>>> +			break;
>>> +
>>> +		in->data = 0;
>>> +
>>> +		cb_data = entry->cb_data;
>>> +		len = entry->len;
>>> +
>>> +		iowrite32(++qp->wr_cons, NTB_TAIL_WR_RC_O(qp));
>>> +
>>> +		if (unlikely(entry->flags & LINK_DOWN_FLAG)) {
>>> +			ntb_qp_link_down(qp);
>>> +			continue;
>>> +		}
>>> +
>>> +		ntb_transport_edma_notify_peer(qp);
>>> +
>>> +		ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry, &qp->rx_free_q);
>>> +
>>> +		if (qp->rx_handler && qp->client_ready)
>>> +			qp->rx_handler(qp, qp->cb_data, cb_data, len);
>>> +
>>> +		/* stat updates */
>>> +		qp->rx_bytes += len;
>>> +		qp->rx_pkts++;
>>> +	}
>>> +}
>>> +
>>> +static void ntb_transport_edma_rc_write_cb(void *data,
>>> +					   const struct dmaengine_result *res)
>>> +{
>>> +	struct ntb_queue_entry *entry = data;
>>> +	struct ntb_transport_qp *qp = entry->qp;
>>> +	struct ntb_transport_ctx *nt = qp->transport;
>>> +	enum dmaengine_tx_result dma_err = res->result;
>>> +	struct device *dma_dev = get_dma_dev(qp->ndev);
>>> +
>>> +	switch (dma_err) {
>>> +	case DMA_TRANS_READ_FAILED:
>>> +	case DMA_TRANS_WRITE_FAILED:
>>> +	case DMA_TRANS_ABORTED:
>>> +		entry->errors++;
>>> +		entry->len = -EIO;
>>> +		break;
>>> +	case DMA_TRANS_NOERROR:
>>> +	default:
>>> +		break;
>>> +	}
>>> +	dma_unmap_sg(dma_dev, &entry->sgl, 1, DMA_FROM_DEVICE);
>>> +	sg_dma_address(&entry->sgl) = 0;
>>> +
>>> +	entry->flags |= DESC_DONE_FLAG;
>>> +
>>> +	queue_work(nt->wq, &qp->write_work);
>>> +}
>>> +
>>> +static void ntb_transport_edma_rc_read_complete_work(struct work_struct *work)
>>> +{
>>> +	struct ntb_transport_qp *qp = container_of(
>>> +				work, struct ntb_transport_qp, read_work);
>>> +	struct ntb_edma_desc *in, __iomem *out;
>>> +	struct ntb_queue_entry *entry;
>>> +	unsigned int len;
>>> +	void *cb_data;
>>> +	u32 idx;
>>> +
>>> +	while (ntb_edma_ring_used_entry(READ_ONCE(qp->rd_issue),
>>> +					qp->rd_cons) != 0) {
>>> +		/* Paired with smp_wmb() in ntb_transport_edma_rc_tx_enqueue() */
>>> +		smp_rmb();
>>> +
>>> +		idx = ntb_edma_ring_idx(qp->rd_cons);
>>> +		in = NTB_DESC_RD_RC_I(qp, idx);
>>> +		entry = (struct ntb_queue_entry *)in->data;
>>> +		if (!entry || !(entry->flags & DESC_DONE_FLAG))
>>> +			break;
>>> +
>>> +		in->data = 0;
>>> +
>>> +		cb_data = entry->cb_data;
>>> +		len = entry->len;
>>> +
>>> +		out = NTB_DESC_RD_RC_O(qp, idx);
>>> +
>>> +		WRITE_ONCE(qp->rd_cons, qp->rd_cons + 1);
>>> +
>>> +		/*
>>> +		 * No need to add barrier in-between to enforce ordering here.
>>> +		 * The other side proceeds only after both flags and tail are
>>> +		 * updated.
>>> +		 */
>>> +		iowrite32(entry->flags, &out->flags);
>>> +		iowrite32(qp->rd_cons, NTB_TAIL_RD_RC_O(qp));
>>> +
>>> +		ntb_transport_edma_notify_peer(qp);
>>> +
>>> +		ntb_list_add(&qp->ntb_tx_free_q_lock, &entry->entry,
>>> +			     &qp->tx_free_q);
>>> +
>>> +		if (qp->tx_handler)
>>> +			qp->tx_handler(qp, qp->cb_data, cb_data, len);
>>> +
>>> +		/* stat updates */
>>> +		qp->tx_bytes += len;
>>> +		qp->tx_pkts++;
>>> +	}
>>> +}
>>> +
>>> +static void ntb_transport_edma_rc_read_cb(void *data,
>>> +					  const struct dmaengine_result *res)
>>> +{
>>> +	struct ntb_queue_entry *entry = data;
>>> +	struct ntb_transport_qp *qp = entry->qp;
>>> +	struct ntb_transport_ctx *nt = qp->transport;
>>> +	struct device *dma_dev = get_dma_dev(qp->ndev);
>>> +	enum dmaengine_tx_result dma_err = res->result;
>>> +
>>> +	switch (dma_err) {
>>> +	case DMA_TRANS_READ_FAILED:
>>> +	case DMA_TRANS_WRITE_FAILED:
>>> +	case DMA_TRANS_ABORTED:
>>> +		entry->errors++;
>>> +		entry->len = -EIO;
>>> +		break;
>>> +	case DMA_TRANS_NOERROR:
>>> +	default:
>>> +		break;
>>> +	}
>>> +	dma_unmap_sg(dma_dev, &entry->sgl, 1, DMA_TO_DEVICE);
>>> +	sg_dma_address(&entry->sgl) = 0;
>>> +
>>> +	entry->flags |= DESC_DONE_FLAG;
>>> +
>>> +	queue_work(nt->wq, &qp->read_work);
>>> +}
>>> +
>>> +static int ntb_transport_edma_rc_write_start(struct device *d,
>>> +					     struct dma_chan *chan, size_t len,
>>> +					     dma_addr_t ep_src, void *rc_dst,
>>> +					     struct ntb_queue_entry *entry)
>>> +{
>>> +	struct scatterlist *sgl = &entry->sgl;
>>> +	struct dma_async_tx_descriptor *txd;
>>> +	struct dma_slave_config cfg;
>>> +	dma_cookie_t cookie;
>>> +	int nents, rc;
>>> +
>>> +	if (!d)
>>> +		return -ENODEV;
>>> +
>>> +	if (!chan)
>>> +		return -ENXIO;
>>> +
>>> +	if (WARN_ON(!ep_src || !rc_dst))
>>> +		return -EINVAL;
>>> +
>>> +	if (WARN_ON(sg_dma_address(sgl)))
>>> +		return -EINVAL;
>>> +
>>> +	sg_init_one(sgl, rc_dst, len);
>>> +	nents = dma_map_sg(d, sgl, 1, DMA_FROM_DEVICE);
>>> +	if (nents <= 0)
>>> +		return -EIO;
>>> +
>>> +	memset(&cfg, 0, sizeof(cfg));
>>> +	cfg.src_addr       = ep_src;
>>> +	cfg.src_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
>>> +	cfg.dst_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
>>> +	cfg.direction      = DMA_DEV_TO_MEM;
>>> +	rc = dmaengine_slave_config(chan, &cfg);
>>> +	if (rc)
>>> +		goto out_unmap;
>>> +
>>> +	txd = dmaengine_prep_slave_sg(chan, sgl, 1, DMA_DEV_TO_MEM,
>>> +				      DMA_CTRL_ACK | DMA_PREP_INTERRUPT);
>>> +	if (!txd) {
>>> +		rc = -EIO;
>>> +		goto out_unmap;
>>> +	}
>>> +
>>> +	txd->callback_result = ntb_transport_edma_rc_write_cb;
>>> +	txd->callback_param = entry;
>>> +
>>> +	cookie = dmaengine_submit(txd);
>>> +	if (dma_submit_error(cookie)) {
>>> +		rc = -EIO;
>>> +		goto out_unmap;
>>> +	}
>>> +	dma_async_issue_pending(chan);
>>> +	return 0;
>>> +out_unmap:
>>> +	dma_unmap_sg(d, sgl, 1, DMA_FROM_DEVICE);
>>> +	return rc;
>>> +}
>>> +
>>> +static int ntb_transport_edma_rc_read_start(struct device *d,
>>> +					    struct dma_chan *chan, size_t len,
>>> +					    void *rc_src, dma_addr_t ep_dst,
>>> +					    struct ntb_queue_entry *entry)
>>> +{
>>> +	struct scatterlist *sgl = &entry->sgl;
>>> +	struct dma_async_tx_descriptor *txd;
>>> +	struct dma_slave_config cfg;
>>> +	dma_cookie_t cookie;
>>> +	int nents, rc;
>>> +
>>> +	if (!d)
>>> +		return -ENODEV;
>>> +
>>> +	if (!chan)
>>> +		return -ENXIO;
>>> +
>>> +	if (WARN_ON(!rc_src || !ep_dst))
>>> +		return -EINVAL;
>>> +
>>> +	if (WARN_ON(sg_dma_address(sgl)))
>>> +		return -EINVAL;
>>> +
>>> +	sg_init_one(sgl, rc_src, len);
>>> +	nents = dma_map_sg(d, sgl, 1, DMA_TO_DEVICE);
>>> +	if (nents <= 0)
>>> +		return -EIO;
>>> +
>>> +	memset(&cfg, 0, sizeof(cfg));
>>> +	cfg.dst_addr       = ep_dst;
>>> +	cfg.src_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
>>> +	cfg.dst_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
>>> +	cfg.direction      = DMA_MEM_TO_DEV;
>>> +	rc = dmaengine_slave_config(chan, &cfg);
>>> +	if (rc)
>>> +		goto out_unmap;
>>> +
>>> +	txd = dmaengine_prep_slave_sg(chan, sgl, 1, DMA_MEM_TO_DEV,
>>> +				      DMA_CTRL_ACK | DMA_PREP_INTERRUPT);
>>> +	if (!txd) {
>>> +		rc = -EIO;
>>> +		goto out_unmap;
>>> +	}
>>> +
>>> +	txd->callback_result = ntb_transport_edma_rc_read_cb;
>>> +	txd->callback_param = entry;
>>> +
>>> +	cookie = dmaengine_submit(txd);
>>> +	if (dma_submit_error(cookie)) {
>>> +		rc = -EIO;
>>> +		goto out_unmap;
>>> +	}
>>> +	dma_async_issue_pending(chan);
>>> +	return 0;
>>> +out_unmap:
>>> +	dma_unmap_sg(d, sgl, 1, DMA_TO_DEVICE);
>>> +	return rc;
>>> +}
>>> +
>>> +static void ntb_transport_edma_rc_dma_work(struct work_struct *work)
>>> +{
>>> +	struct ntb_queue_entry *entry = container_of(
>>> +				work, struct ntb_queue_entry, dma_work);
>>> +	struct ntb_transport_qp *qp = entry->qp;
>>> +	struct ntb_transport_ctx *nt = qp->transport;
>>> +	struct device *dma_dev = get_dma_dev(qp->ndev);
>>> +	struct dma_chan *chan;
>>> +	int rc;
>>> +
>>> +	chan = ntb_edma_pick_chan(&nt->edma, REMOTE_EDMA_WRITE);
>>> +	rc = ntb_transport_edma_rc_write_start(dma_dev, chan, entry->len,
>>> +					       entry->addr, entry->buf, entry);
>>> +	if (rc) {
>>> +		entry->errors++;
>>> +		entry->len = -EIO;
>>> +		entry->flags |= DESC_DONE_FLAG;
>>> +		queue_work(nt->wq, &qp->write_work);
>>> +		return;
>>> +	}
>>> +}
>>> +
>>> +static void ntb_transport_edma_rc_poll(struct ntb_transport_qp *qp)
>>> +{
>>> +	struct ntb_transport_ctx *nt = qp->transport;
>>> +	unsigned int budget = NTB_EDMA_MAX_POLL;
>>> +	struct ntb_queue_entry *entry;
>>> +	struct ntb_edma_desc *in;
>>> +	dma_addr_t ep_src;
>>> +	u32 len, idx;
>>> +
>>> +	while (budget--) {
>>> +		if (ntb_edma_ring_used_entry(READ_ONCE(*NTB_HEAD_WR_RC_I(qp)),
>>> +					     qp->wr_issue) == 0)
>>> +			break;
>>> +
>>> +		idx = ntb_edma_ring_idx(qp->wr_issue);
>>> +		in = NTB_DESC_WR_RC_I(qp, idx);
>>> +
>>> +		len = READ_ONCE(in->len);
>>> +		ep_src = (dma_addr_t)READ_ONCE(in->addr);
>>> +
>>> +		/* Prepare 'entry' for write completion */
>>> +		entry = ntb_list_rm(&qp->ntb_rx_q_lock, &qp->rx_pend_q);
>>> +		if (!entry) {
>>> +			qp->rx_err_no_buf++;
>>> +			break;
>>> +		}
>>> +		if (WARN_ON(entry->flags & DESC_DONE_FLAG))
>>> +			entry->flags &= ~DESC_DONE_FLAG;
>>> +		entry->len = len; /* NB. entry->len can be <=0 */
>>> +		entry->addr = ep_src;
>>> +
>>> +		/*
>>> +		 * ntb_transport_edma_rc_write_complete_work() checks entry->flags
>>> +		 * so it needs to be set before wr_issue++.
>>> +		 */
>>> +		in->data = (uintptr_t)entry;
>>> +
>>> +		/* Ensure in->data visible before wr_issue++ */
>>> +		smp_wmb();
>>> +
>>> +		WRITE_ONCE(qp->wr_issue, qp->wr_issue + 1);
>>> +
>>> +		if (!len) {
>>> +			entry->flags |= DESC_DONE_FLAG;
>>> +			queue_work(nt->wq, &qp->write_work);
>>> +			continue;
>>> +		}
>>> +
>>> +		if (in->flags & LINK_DOWN_FLAG) {
>>> +			dev_dbg(&qp->ndev->pdev->dev, "link down flag set\n");
>>> +			entry->flags |= DESC_DONE_FLAG | LINK_DOWN_FLAG;
>>> +			queue_work(nt->wq, &qp->write_work);
>>> +			continue;
>>> +		}
>>> +
>>> +		queue_work(nt->wq, &entry->dma_work);
>>> +	}
>>> +
>>> +	if (!budget)
>>> +		tasklet_schedule(&qp->rxc_db_work);
>>> +}
>>> +
>>> +static int ntb_transport_edma_rc_tx_enqueue(struct ntb_transport_qp *qp,
>>> +					    struct ntb_queue_entry *entry)
>>> +{
>>> +	struct device *dma_dev = get_dma_dev(qp->ndev);
>>> +	struct ntb_transport_ctx *nt = qp->transport;
>>> +	struct ntb_edma_desc *in, __iomem *out;
>>> +	unsigned int len = entry->len;
>>> +	struct dma_chan *chan;
>>> +	u32 issue, idx, head;
>>> +	dma_addr_t ep_dst;
>>> +	int rc;
>>> +
>>> +	WARN_ON_ONCE(entry->flags & DESC_DONE_FLAG);
>>> +
>>> +	scoped_guard(spinlock_irqsave, &qp->rc_lock) {
>>> +		head = READ_ONCE(*NTB_HEAD_RD_RC_I(qp));
>>> +		issue = qp->rd_issue;
>>> +		if (ntb_edma_ring_used_entry(head, issue) == 0) {
>>> +			qp->tx_ring_full++;
>>> +			return -ENOSPC;
>>> +		}
>>> +
>>> +		/*
>>> +		 * ntb_transport_edma_rc_read_complete_work() checks entry->flags
>>> +		 * so it needs to be set before rd_issue++.
>>> +		 */
>>> +		idx = ntb_edma_ring_idx(issue);
>>> +		in = NTB_DESC_RD_RC_I(qp, idx);
>>> +		in->data = (uintptr_t)entry;
>>> +
>>> +		/* Make in->data visible before rd_issue++ */
>>> +		smp_wmb();
>>> +
>>> +		WRITE_ONCE(qp->rd_issue, qp->rd_issue + 1);
>>> +	}
>>> +
>>> +	/* Publish the final transfer length to the EP side */
>>> +	out = NTB_DESC_RD_RC_O(qp, idx);
>>> +	iowrite32(len, &out->len);
>>> +	ioread32(&out->len);
>>> +
>>> +	if (unlikely(!len)) {
>>> +		entry->flags |= DESC_DONE_FLAG;
>>> +		queue_work(nt->wq, &qp->read_work);
>>> +		return 0;
>>> +	}
>>> +
>>> +	/* Paired with dma_wmb() in ntb_transport_edma_ep_rx_enqueue() */
>>> +	dma_rmb();
>>> +
>>> +	/* kick remote eDMA read transfer */
>>> +	ep_dst = (dma_addr_t)in->addr;
>>> +	chan = ntb_edma_pick_chan(&nt->edma, REMOTE_EDMA_READ);
>>> +	rc = ntb_transport_edma_rc_read_start(dma_dev, chan, len,
>>> +					      entry->buf, ep_dst, entry);
>>> +	if (rc) {
>>> +		entry->errors++;
>>> +		entry->len = -EIO;
>>> +		entry->flags |= DESC_DONE_FLAG;
>>> +		queue_work(nt->wq, &qp->read_work);
>>> +	}
>>> +	return 0;
>>> +}
>>> +
>>> +static int ntb_transport_edma_ep_tx_enqueue(struct ntb_transport_qp *qp,
>>> +					    struct ntb_queue_entry *entry)
>>> +{
>>> +	struct device *dma_dev = get_dma_dev(qp->ndev);
>>> +	struct ntb_edma_desc *in, __iomem *out;
>>> +	unsigned int len = entry->len;
>>> +	dma_addr_t ep_src = 0;
>>> +	u32 idx;
>>> +	int rc;
>>> +
>>> +	if (likely(len)) {
>>> +		ep_src = dma_map_single(dma_dev, entry->buf, len,
>>> +					DMA_TO_DEVICE);
>>> +		rc = dma_mapping_error(dma_dev, ep_src);
>>> +		if (rc)
>>> +			return rc;
>>> +	}
>>> +
>>> +	scoped_guard(spinlock_irqsave, &qp->ep_tx_lock) {
>>> +		if (ntb_edma_ring_full(qp->wr_prod, qp->wr_cons)) {
>>> +			rc = -ENOSPC;
>>> +			qp->tx_ring_full++;
>>> +			goto out_unmap;
>>> +		}
>>> +
>>> +		idx = ntb_edma_ring_idx(qp->wr_prod);
>>> +		in  = NTB_DESC_WR_EP_I(qp, idx);
>>> +		out = NTB_DESC_WR_EP_O(qp, idx);
>>> +
>>> +		WARN_ON(in->flags & DESC_DONE_FLAG);
>>> +		WARN_ON(entry->flags & DESC_DONE_FLAG);
>>> +		in->flags = 0;
>>> +		in->data  = (uintptr_t)entry;
>>> +		entry->addr  = ep_src;
>>> +
>>> +		iowrite32(len,          &out->len);
>>> +		iowrite32(entry->flags, &out->flags);
>>> +		iowrite64(ep_src,       &out->addr);
>>> +		WRITE_ONCE(qp->wr_prod, qp->wr_prod + 1);
>>> +
>>> +		dma_wmb();
>>> +		iowrite32(qp->wr_prod, NTB_HEAD_WR_EP_O(qp));
>>> +
>>> +		qp->tx_bytes += len;
>>> +		qp->tx_pkts++;
>>> +	}
>>> +
>>> +	ntb_transport_edma_notify_peer(qp);
>>> +
>>> +	return 0;
>>> +out_unmap:
>>> +	if (likely(len))
>>> +		dma_unmap_single(dma_dev, ep_src, len, DMA_TO_DEVICE);
>>> +	return rc;
>>> +}
>>> +
>>> +static int ntb_transport_edma_tx_enqueue(struct ntb_transport_qp *qp,
>>> +					 struct ntb_queue_entry *entry,
>>> +					 void *cb, void *data, unsigned int len,
>>> +					 unsigned int flags)
>>> +{
>>> +	struct device *dma_dev;
>>> +
>>> +	if (entry->addr) {
>>> +		/* Deferred unmap */
>>> +		dma_dev = get_dma_dev(qp->ndev);
>>> +		dma_unmap_single(dma_dev, entry->addr, entry->len, DMA_TO_DEVICE);
>>> +	}
>>> +
>>> +	entry->cb_data = cb;
>>> +	entry->buf = data;
>>> +	entry->len = len;
>>> +	entry->flags = flags;
>>> +	entry->errors = 0;
>>> +	entry->addr = 0;
>>> +
>>> +	WARN_ON_ONCE(!ntb_qp_edma_enabled(qp));
>>> +
>>> +	if (ntb_qp_edma_is_ep(qp))
>>> +		return ntb_transport_edma_ep_tx_enqueue(qp, entry);
>>> +	else
>>> +		return ntb_transport_edma_rc_tx_enqueue(qp, entry);
>>> +}
>>> +
>>> +static int ntb_transport_edma_ep_rx_enqueue(struct ntb_transport_qp *qp,
>>> +					    struct ntb_queue_entry *entry)
>>> +{
>>> +	struct device *dma_dev = get_dma_dev(qp->ndev);
>>> +	struct ntb_edma_desc *in, __iomem *out;
>>> +	unsigned int len = entry->len;
>>> +	void *data = entry->buf;
>>> +	dma_addr_t ep_dst;
>>> +	u32 idx;
>>> +	int rc;
>>> +
>>> +	ep_dst = dma_map_single(dma_dev, data, len, DMA_FROM_DEVICE);
>>> +	rc = dma_mapping_error(dma_dev, ep_dst);
>>> +	if (rc)
>>> +		return rc;
>>> +
>>> +	scoped_guard(spinlock_bh, &qp->ep_rx_lock) {
>>> +		if (ntb_edma_ring_full(READ_ONCE(qp->rd_prod),
>>> +				       READ_ONCE(qp->rd_cons))) {
>>> +			rc = -ENOSPC;
>>> +			goto out_unmap;
>>> +		}
>>> +
>>> +		idx = ntb_edma_ring_idx(qp->rd_prod);
>>> +		in = NTB_DESC_RD_EP_I(qp, idx);
>>> +		out = NTB_DESC_RD_EP_O(qp, idx);
>>> +
>>> +		iowrite32(len, &out->len);
>>> +		iowrite64(ep_dst, &out->addr);
>>> +
>>> +		WARN_ON(in->flags & DESC_DONE_FLAG);
>>> +		in->data = (uintptr_t)entry;
>>> +		entry->addr = ep_dst;
>>> +
>>> +		/* Ensure len/addr are visible before the head update */
>>> +		dma_wmb();
>>> +
>>> +		WRITE_ONCE(qp->rd_prod, qp->rd_prod + 1);
>>> +		iowrite32(qp->rd_prod, NTB_HEAD_RD_EP_O(qp));
>>> +	}
>>> +	return 0;
>>> +out_unmap:
>>> +	dma_unmap_single(dma_dev, ep_dst, len, DMA_FROM_DEVICE);
>>> +	return rc;
>>> +}
>>> +
>>> +static int ntb_transport_edma_rx_enqueue(struct ntb_transport_qp *qp,
>>> +					 struct ntb_queue_entry *entry)
>>> +{
>>> +	int rc;
>>> +
>>> +	/* The behaviour is the same as the default backend for RC side */
>>> +	if (ntb_qp_edma_is_ep(qp)) {
>>> +		rc = ntb_transport_edma_ep_rx_enqueue(qp, entry);
>>> +		if (rc) {
>>> +			ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry,
>>> +				     &qp->rx_free_q);
>>> +			return rc;
>>> +		}
>>> +	}
>>> +
>>> +	ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry, &qp->rx_pend_q);
>>> +
>>> +	if (qp->active)
>>> +		tasklet_schedule(&qp->rxc_db_work);
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static void ntb_transport_edma_rx_poll(struct ntb_transport_qp *qp)
>>> +{
>>> +	struct ntb_transport_ctx *nt = qp->transport;
>>> +
>>> +	if (ntb_qp_edma_is_rc(qp))
>>> +		ntb_transport_edma_rc_poll(qp);
>>> +	else if (ntb_qp_edma_is_ep(qp)) {
>>> +		/*
>>> +		 * Make sure we poll the rings even if an eDMA interrupt is
>>> +		 * cleared on the RC side earlier.
>>> +		 */
>>> +		queue_work(nt->wq, &qp->read_work);
>>> +		queue_work(nt->wq, &qp->write_work);
>>> +	} else
>>> +		/* Unreachable */
>>> +		WARN_ON_ONCE(1);
>>> +}
>>> +
>>> +static void ntb_transport_edma_read_work(struct work_struct *work)
>>> +{
>>> +	struct ntb_transport_qp *qp = container_of(
>>> +				work, struct ntb_transport_qp, read_work);
>>> +
>>> +	if (ntb_qp_edma_is_rc(qp))
>>> +		ntb_transport_edma_rc_read_complete_work(work);
>>> +	else if (ntb_qp_edma_is_ep(qp))
>>> +		ntb_transport_edma_ep_read_work(work);
>>> +	else
>>> +		/* Unreachable */
>>> +		WARN_ON_ONCE(1);
>>> +}
>>> +
>>> +static void ntb_transport_edma_write_work(struct work_struct *work)
>>> +{
>>> +	struct ntb_transport_qp *qp = container_of(
>>> +				work, struct ntb_transport_qp, write_work);
>>> +
>>> +	if (ntb_qp_edma_is_rc(qp))
>>> +		ntb_transport_edma_rc_write_complete_work(work);
>>> +	else if (ntb_qp_edma_is_ep(qp))
>>> +		ntb_transport_edma_ep_write_work(work);
>>> +	else
>>> +		/* Unreachable */
>>> +		WARN_ON_ONCE(1);
>>> +}
>>> +
>>> +static void ntb_transport_edma_init_queue(struct ntb_transport_ctx *nt,
>>> +					  unsigned int qp_num)
>>> +{
>>> +	struct ntb_transport_qp *qp = &nt->qp_vec[qp_num];
>>> +
>>> +	qp->wr_cons = 0;
>>> +	qp->rd_cons = 0;
>>> +	qp->wr_prod = 0;
>>> +	qp->rd_prod = 0;
>>> +	qp->wr_issue = 0;
>>> +	qp->rd_issue = 0;
>>> +
>>> +	INIT_WORK(&qp->db_work, ntb_transport_edma_db_work);
>>> +	INIT_WORK(&qp->read_work, ntb_transport_edma_read_work);
>>> +	INIT_WORK(&qp->write_work, ntb_transport_edma_write_work);
>>> +}
>>> +
>>> +static void ntb_transport_edma_create_queue(struct ntb_transport_ctx *nt,
>>> +					    struct ntb_transport_qp *qp)
>>> +{
>>> +	spin_lock_init(&qp->ep_tx_lock);
>>> +	spin_lock_init(&qp->ep_rx_lock);
>>> +	spin_lock_init(&qp->rc_lock);
>>> +}
>>> +
>>> +static const struct ntb_transport_backend_ops edma_backend_ops = {
>>> +	.setup_qp_mw = ntb_transport_edma_setup_qp_mw,
>>> +	.tx_free_entry = ntb_transport_edma_tx_free_entry,
>>> +	.tx_enqueue = ntb_transport_edma_tx_enqueue,
>>> +	.rx_enqueue = ntb_transport_edma_rx_enqueue,
>>> +	.rx_poll = ntb_transport_edma_rx_poll,
>>> +	.debugfs_stats_show = ntb_transport_edma_debugfs_stats_show,
>>> +};
>>> +#endif /* CONFIG_NTB_TRANSPORT_EDMA */
>>> +
>>>  /**
>>>   * ntb_transport_link_up - Notify NTB transport of client readiness to use queue
>>>   * @qp: NTB transport layer queue to be enabled
>>


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping
  2025-12-02  9:32       ` Niklas Cassel
@ 2025-12-02 15:20         ` Frank Li
  2025-12-03  8:40         ` Koichiro Den
  1 sibling, 0 replies; 97+ messages in thread
From: Frank Li @ 2025-12-02 15:20 UTC (permalink / raw)
  To: Niklas Cassel
  Cc: Koichiro Den, ntb, linux-pci, dmaengine, linux-kernel, mani,
	kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang,
	allenbh, Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer,
	logang, jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer,
	arnd, pstanner, elfring, Damien Le Moal

On Tue, Dec 02, 2025 at 10:32:19AM +0100, Niklas Cassel wrote:
> Hello Koichiro,
>
> On Tue, Dec 02, 2025 at 03:35:36PM +0900, Koichiro Den wrote:
> > On Mon, Dec 01, 2025 at 03:41:38PM -0500, Frank Li wrote:
> > > On Sun, Nov 30, 2025 at 01:03:57AM +0900, Koichiro Den wrote:
> > > > dw_pcie_ep_raise_msi_irq() currently programs an outbound iATU window
> > > > for the MSI target address on every interrupt and tears it down again
> > > > via dw_pcie_ep_unmap_addr().
> > > >
> > > > On systems that heavily use the AXI bridge interface (for example when
> > > > the integrated eDMA engine is active), this means the outbound iATU
> > > > registers are updated while traffic is in flight. The DesignWare
> > > > endpoint spec warns that updating iATU registers in this situation is
> > > > not supported, and the behavior is undefined.
> > > >
> > > > Under high MSI and eDMA load this pattern results in occasional bogus
> > > > outbound transactions and IOMMU faults such as:
> > > >
> > > >   ipmmu-vmsa eed40000.iommu: Unhandled fault: status 0x00001502 iova 0xfe000000
> > > >
> > >
> > > I agree needn't map/unmap MSI every time. But I think there should be
> > > logic problem behind this. IOMMU report error means page table already
> > > removed, but you still try to access it after that. You'd better find where
> > > access MSI memory after dw_pcie_ep_unmap_addr().
> >
> > I don't see any other callers that access the MSI region after
> > dw_pcie_ep_unmap_addr(), but I might be missing something. Also, even if I
> > serialize dw_pcie_ep_raise_msi_irq() invocations, the problem still
> > appears.
> >
> > A couple of details I forgot to describe in the commit message:
> > (1). The IOMMU error is only reported on the RC side.
> > (2). Sometimes there is no IOMMU error printed and the board just freezes (becomes unresponsive).
> >
> > The faulting iova is 0xfe000000. The iova 0xfe000000 is the base of
> > "addr_space" for R-Car S4 in EP mode:
> > https://github.com/jonmason/ntb/blob/68113d260674/arch/arm64/boot/dts/renesas/r8a779f0.dtsi#L847
> >
> > So it looks like the EP sometimes issue MWr at "addr_space" base (offset 0),
> > the RC forwards it to its IOMMU (IPMMUHC) and that faults. My working theory
> > is that when the iATU registers are updated under heavy DMA load, the DAR of
> > some in-flight transfer can get corrupted to 0xfe000000. That would match one
> > possible symptom of the undefined behaviour that the DW EPC spec warns about
> > when changing iATU configuration under load.
>
> For your information, in the NVMe PCI EPF driver:
> https://github.com/torvalds/linux/blob/v6.18/drivers/nvme/target/pci-epf.c#L389-L429
>
> We take a mutex around the dmaengine_slave_config() and dma_sync_wait() calls,
> because without a mutex, we noticed that having multiple outstanding transfers,
> since the dmaengine_slave_config() specifies the src/dst address, the function
> call would affect other concurrent DMA transfers, leading to corruption because
> of invalid src/dst addresses.
>
> Having a mutex so that we can only have one outstanding transfer solves these
> issues, but is obviously very bad for performance.
>
>
> I did try to add DMA_MEMCPY support to the dw-edma driver:
> https://lore.kernel.org/linux-pci/20241217160448.199310-4-cassel@kernel.org/
>
> Since that would allow us to specify both the src and dst address in a single
> dmaengine function call (so that we would no longer need a mutex).
> However, because the eDMA hardware (at least for EDMA_LEGACY_UNROLL) does not
> support transfers between PCI to PCI, only PCI to local DDR or local DDR to
> PCI, using prep_memcpy() is wrong, as it does not take a direction:
> https://lore.kernel.org/linux-pci/Z4jf2s5SaUu3wdJi@ryzen/
>
> If we want to improve the dw-edma driver, so that an EPF driver can have
> multiple outstanding transfers, I think the best way forward would be to create
> a new _prep_slave_memcpy() or similar, that does take a direction, and thus
> does not require dmaengine_slave_config() to be called before every
> _prep_slave_memcpy() call.

Thanks, we also consider put slave config informaiton into prep_sg(single).
dmaengine_prep_slave_sg(single)_config(..., struct dma_slave_config *config)
to simple error handle at dma transfer functions.

dmaengine_prep_slave_single_config(
	struct dma_chan *chan, size_t size,
	unsigned long flags, dma_slave_config *config)

config already include src and dsc address.

Frank

>
>
> Kind regards,
> Niklas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 20/27] NTB: ntb_transport: Introduce remote eDMA backed transport mode
  2025-12-02  6:43     ` Koichiro Den
@ 2025-12-02 15:42       ` Frank Li
  2025-12-03  8:53         ` Koichiro Den
  0 siblings, 1 reply; 97+ messages in thread
From: Frank Li @ 2025-12-02 15:42 UTC (permalink / raw)
  To: Koichiro Den
  Cc: ntb, linux-pci, dmaengine, linux-kernel, mani, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Tue, Dec 02, 2025 at 03:43:10PM +0900, Koichiro Den wrote:
> On Mon, Dec 01, 2025 at 04:41:05PM -0500, Frank Li wrote:
> > On Sun, Nov 30, 2025 at 01:03:58AM +0900, Koichiro Den wrote:
> > > Add a new transport backend that uses a remote DesignWare eDMA engine
> > > located on the NTB endpoint to move data between host and endpoint.
> > >
...
> > > +#include "ntb_edma.h"
> > > +
> > > +/*
> > > + * The interrupt register offsets below are taken from the DesignWare
> > > + * eDMA "unrolled" register map (EDMA_MF_EDMA_UNROLL). The remote eDMA
> > > + * backend currently only supports this layout.
> > > + */
> > > +#define DMA_WRITE_INT_STATUS_OFF   0x4c
> > > +#define DMA_WRITE_INT_MASK_OFF     0x54
> > > +#define DMA_WRITE_INT_CLEAR_OFF    0x58
> > > +#define DMA_READ_INT_STATUS_OFF    0xa0
> > > +#define DMA_READ_INT_MASK_OFF      0xa8
> > > +#define DMA_READ_INT_CLEAR_OFF     0xac
> >
> > Not sure why need access EDMA register because EMDA driver already export
> > as dmaengine driver.
>
> These are intended for EP use. In my current design I intentionally don't
> use the standard dw-edma dmaengine driver on the EP side.

why not?

>
> >
> > > +
> > > +#define NTB_EDMA_NOTIFY_MAX_QP		64
> > > +
...
> > > +
> > > +	virq = irq_create_fwspec_mapping(&fwspec);
> > > +	of_node_put(parent);
> > > +	return (virq > 0) ? virq : -EINVAL;
> > > +}
> > > +
> > > +static irqreturn_t ntb_edma_isr(int irq, void *data)
> > > +{
> >
> > Not sue why dw_edma_interrupt_write/read() does work for your case. Suppose
> > just register callback for dmeengine.
>
> If we ran dw_edma_probe() on both the EP and RC sides and let the dmaengine
> callbacks handle int_status/int_clear, I think we could hit races. One side
> might clear a status bit before the other side has a chance to see it and
> invoke its callback. Please correct me if I'm missing something here.

You should use difference channel?

>
> To avoid that, in my current implementation, the RC side handles the
> status/int_clear registers in the usual way, and the EP side only tries to
> suppress needless edma_int as much as possible.
>
> That said, I'm now wondering if it would be better to set LIE=0/RIE=1 for
> the DMA transfer channels and LIE=1/RIE=0 for the notification channel.
> That would require some changes on dw-edma core.

If dw-edma work as remote DMA, which should enable RIE. like
dw-edma-pcie.c, but not one actually use it recently.

Use EDMA as doorbell should be new case and I think it is quite useful.

> >
> > > +	struct ntb_edma_interrupt *v = data;
> > > +	u32 mask = BIT(EDMA_RD_CH_NUM);
> > > +	u32 i, val;
> > > +
...
> > > +	ret = dw_edma_probe(chip);
> >
> > I think dw_edma_probe() should be in ntb_hw_epf.c, which provide DMA
> > dma engine support.
> >
> > EP side, suppose default dwc controller driver already setup edma engine,
> > so use correct filter function, you should get dma chan.
>
> I intentionally hid edma for EP side in .dts patch in [RFC PATCH v2 26/27]
> so that RC side only manages eDMA remotely and avoids the potential race
> condition I mentioned above.

Improve eDMA core to suppport some dma channel work at local, some for
remote.

Frank
>
> Thanks for reviewing,
> Koichiro
>
> >
> > Frank
> >
> > > +	if (ret) {
> > > +		dev_err(&ndev->dev, "dw_edma_probe failed: %d\n", ret);
> > > +		return ret;
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
...

> > > +{
> > > +	spin_lock_init(&qp->ep_tx_lock);
> > > +	spin_lock_init(&qp->ep_rx_lock);
> > > +	spin_lock_init(&qp->rc_lock);
> > > +}
> > > +
> > > +static const struct ntb_transport_backend_ops edma_backend_ops = {
> > > +	.setup_qp_mw = ntb_transport_edma_setup_qp_mw,
> > > +	.tx_free_entry = ntb_transport_edma_tx_free_entry,
> > > +	.tx_enqueue = ntb_transport_edma_tx_enqueue,
> > > +	.rx_enqueue = ntb_transport_edma_rx_enqueue,
> > > +	.rx_poll = ntb_transport_edma_rx_poll,
> > > +	.debugfs_stats_show = ntb_transport_edma_debugfs_stats_show,
> > > +};
> > > +#endif /* CONFIG_NTB_TRANSPORT_EDMA */
> > > +
> > >  /**
> > >   * ntb_transport_link_up - Notify NTB transport of client readiness to use queue
> > >   * @qp: NTB transport layer queue to be enabled
> > > --
> > > 2.48.1
> > >

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 04/27] PCI: endpoint: Add inbound mapping ops to EPC core
  2025-12-02  6:25     ` Koichiro Den
@ 2025-12-02 15:58       ` Frank Li
  2025-12-03 14:12         ` Koichiro Den
  0 siblings, 1 reply; 97+ messages in thread
From: Frank Li @ 2025-12-02 15:58 UTC (permalink / raw)
  To: Koichiro Den
  Cc: ntb, linux-pci, dmaengine, linux-kernel, mani, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Tue, Dec 02, 2025 at 03:25:31PM +0900, Koichiro Den wrote:
> On Mon, Dec 01, 2025 at 02:19:55PM -0500, Frank Li wrote:
> > On Sun, Nov 30, 2025 at 01:03:42AM +0900, Koichiro Den wrote:
> > > Add new EPC ops map_inbound() and unmap_inbound() for mapping a subrange
> > > of a BAR into CPU space. These will be implemented by controller drivers
> > > such as DesignWare.
> > >
> > > Signed-off-by: Koichiro Den <den@valinux.co.jp>
> > > ---
> > >  drivers/pci/endpoint/pci-epc-core.c | 44 +++++++++++++++++++++++++++++
> > >  include/linux/pci-epc.h             | 11 ++++++++
> > >  2 files changed, 55 insertions(+)
> > >
> > > diff --git a/drivers/pci/endpoint/pci-epc-core.c b/drivers/pci/endpoint/pci-epc-core.c
> > > index ca7f19cc973a..825109e54ba9 100644
> > > --- a/drivers/pci/endpoint/pci-epc-core.c
> > > +++ b/drivers/pci/endpoint/pci-epc-core.c
> > > @@ -444,6 +444,50 @@ int pci_epc_map_addr(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> > >  }
> > >  EXPORT_SYMBOL_GPL(pci_epc_map_addr);
> > >
> > > +/**
> > > + * pci_epc_map_inbound() - map a BAR subrange to the local CPU address
> > > + * @epc: the EPC device on which BAR has to be configured
> > > + * @func_no: the physical endpoint function number in the EPC device
> > > + * @vfunc_no: the virtual endpoint function number in the physical function
> > > + * @epf_bar: the struct epf_bar that contains the BAR information
> > > + * @offset: byte offset from the BAR base selected by the host
> > > + *
> > > + * Invoke to configure the BAR of the endpoint device and map a subrange
> > > + * selected by @offset to a CPU address.
> > > + *
> > > + * Returns 0 on success, -EOPNOTSUPP if unsupported, or a negative errno.
> > > + */
> > > +int pci_epc_map_inbound(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> > > +			struct pci_epf_bar *epf_bar, u64 offset)
> >
> > Supposed need size information?  if BAR's size is 4G,
> >
> > you may just want to map from 0x4000 to 0x5000, not whole offset to end's
> > space.
>
> That sounds reasonable, the interface should accept a size parameter so that it
> is flexible enough to configure arbitrary sub-ranges inside a BAR, instead of
> implicitly using "offset to end of BAR".
>
> For the ntb_transport use_remote_edma=1 testing on R‑Car S4 I only needed at
> most two sub-ranges inside one BAR, so a size parameter was not strictly
> necessary in that setup, but I agree that the current interface looks
> half-baked and not very generic. I'll extend it to take size as well.
>
> >
> > commit message said map into CPU space, where CPU address?
>
> The interface currently requires a pointer to a struct pci_epf_bar instance and
> uses its phys_addr field as the CPU physical base address.
> I'm not fully convinced that using struct pci_epf_bar this way is the cleanest
> approach, so I'm open to better suggestions if you have any.

struct pci_epf_bar already have phys_addr and size information.

pci_epc_set_bars(..., struct pci_epf_bar *epf_bar, size_t num_of_bar)

to set many memory regions to one bar space. when num_of_bar is 1, fallback
to exitting pci_epc_set_bar().

If there are IOMMU in EP system, maybe use IOMMU to map to difference place
is easier.

Frank

>
> Koichiro
>
> >
> > Frank
> > > +{
> > > +	if (!epc || !epc->ops || !epc->ops->map_inbound)
> > > +		return -EOPNOTSUPP;
> > > +
> > > +	return epc->ops->map_inbound(epc, func_no, vfunc_no, epf_bar, offset);
> > > +}
> > > +EXPORT_SYMBOL_GPL(pci_epc_map_inbound);
> > > +
> > > +/**
> > > + * pci_epc_unmap_inbound() - unmap a previously mapped BAR subrange
> > > + * @epc: the EPC device on which the inbound mapping was programmed
> > > + * @func_no: the physical endpoint function number in the EPC device
> > > + * @vfunc_no: the virtual endpoint function number in the physical function
> > > + * @epf_bar: the struct epf_bar used when the mapping was created
> > > + * @offset: byte offset from the BAR base that was mapped
> > > + *
> > > + * Invoke to remove a BAR subrange mapping created by pci_epc_map_inbound().
> > > + * If the controller has no support, this call is a no-op.
> > > + */
> > > +void pci_epc_unmap_inbound(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> > > +			   struct pci_epf_bar *epf_bar, u64 offset)
> > > +{
> > > +	if (!epc || !epc->ops || !epc->ops->unmap_inbound)
> > > +		return;
> > > +
> > > +	epc->ops->unmap_inbound(epc, func_no, vfunc_no, epf_bar, offset);
> > > +}
> > > +EXPORT_SYMBOL_GPL(pci_epc_unmap_inbound);
> > > +
> > >  /**
> > >   * pci_epc_mem_map() - allocate and map a PCI address to a CPU address
> > >   * @epc: the EPC device on which the CPU address is to be allocated and mapped
> > > diff --git a/include/linux/pci-epc.h b/include/linux/pci-epc.h
> > > index 4286bfdbfdfa..a5fb91cc2982 100644
> > > --- a/include/linux/pci-epc.h
> > > +++ b/include/linux/pci-epc.h
> > > @@ -71,6 +71,8 @@ struct pci_epc_map {
> > >   *		region
> > >   * @map_addr: ops to map CPU address to PCI address
> > >   * @unmap_addr: ops to unmap CPU address and PCI address
> > > + * @map_inbound: ops to map a subrange inside a BAR to CPU address.
> > > + * @unmap_inbound: ops to unmap a subrange inside a BAR and CPU address.
> > >   * @set_msi: ops to set the requested number of MSI interrupts in the MSI
> > >   *	     capability register
> > >   * @get_msi: ops to get the number of MSI interrupts allocated by the RC from
> > > @@ -99,6 +101,10 @@ struct pci_epc_ops {
> > >  			    phys_addr_t addr, u64 pci_addr, size_t size);
> > >  	void	(*unmap_addr)(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> > >  			      phys_addr_t addr);
> > > +	int	(*map_inbound)(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> > > +			       struct pci_epf_bar *epf_bar, u64 offset);
> > > +	void	(*unmap_inbound)(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> > > +				 struct pci_epf_bar *epf_bar, u64 offset);
> > >  	int	(*set_msi)(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> > >  			   u8 nr_irqs);
> > >  	int	(*get_msi)(struct pci_epc *epc, u8 func_no, u8 vfunc_no);
> > > @@ -286,6 +292,11 @@ int pci_epc_map_addr(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> > >  		     u64 pci_addr, size_t size);
> > >  void pci_epc_unmap_addr(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> > >  			phys_addr_t phys_addr);
> > > +
> > > +int pci_epc_map_inbound(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> > > +			struct pci_epf_bar *epf_bar, u64 offset);
> > > +void pci_epc_unmap_inbound(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> > > +			   struct pci_epf_bar *epf_bar, u64 offset);
> > >  int pci_epc_set_msi(struct pci_epc *epc, u8 func_no, u8 vfunc_no, u8 nr_irqs);
> > >  int pci_epc_get_msi(struct pci_epc *epc, u8 func_no, u8 vfunc_no);
> > >  int pci_epc_set_msix(struct pci_epc *epc, u8 func_no, u8 vfunc_no, u16 nr_irqs,
> > > --
> > > 2.48.1
> > >

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 00/27] NTB transport backed by remote DW eDMA
  2025-12-02  6:20   ` Koichiro Den
@ 2025-12-02 16:07     ` Frank Li
  2025-12-03  8:43       ` Koichiro Den
  0 siblings, 1 reply; 97+ messages in thread
From: Frank Li @ 2025-12-02 16:07 UTC (permalink / raw)
  To: Koichiro Den
  Cc: ntb, linux-pci, dmaengine, linux-kernel, mani, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Tue, Dec 02, 2025 at 03:20:01PM +0900, Koichiro Den wrote:
> On Mon, Dec 01, 2025 at 05:02:57PM -0500, Frank Li wrote:
> > On Sun, Nov 30, 2025 at 01:03:38AM +0900, Koichiro Den wrote:
> > > Hi,
> > >
> > > This is RFC v2 of the NTB/PCI series for Renesas R-Car S4. The ultimate
> > > goal is unchanged, i.e. to improve performance between RC and EP
> > > (with vNTB) over ntb_transport, but the approach has changed drastically.
> > > Based on the feedback from Frank Li in the v1 thread, in particular:
> > > https://lore.kernel.org/all/aQEsip3TsPn4LJY9@lizhi-Precision-Tower-5810/
> > > this RFC v2 instead builds an NTB transport backed by remote eDMA
> > > architecture and reshapes the series around it. The RC->EP interruption
> > > is now achieved using a dedicated eDMA read channel, so the somewhat
> > > "hack"-ish approach in RFC v1 is no longer needed.
> > >
> > > Compared to RFC v1, this v2 series enables NTB transport backed by
> > > remote DW eDMA, so the current ntb_transport handling of Memory Window
> > > is no longer needed, and direct DMA transfers between EP and RC are
> > > used.
> > >
> > > I realize this is quite a large series. Sorry for the volume, but for
> > > the RFC stage I believe presenting the full picture in a single set
> > > helps with reviewing the overall architecture. Once the direction is
> > > agreed, I will respin it split by subsystem and topic.
> > >
> > >
> > ...
> > >
> > > - Before this change:
> > >
> > >   * ping
> > >     64 bytes from 10.0.0.11: icmp_seq=1 ttl=64 time=12.3 ms
> > >     64 bytes from 10.0.0.11: icmp_seq=2 ttl=64 time=6.58 ms
> > >     64 bytes from 10.0.0.11: icmp_seq=3 ttl=64 time=1.26 ms
> > >     64 bytes from 10.0.0.11: icmp_seq=4 ttl=64 time=7.43 ms
> > >     64 bytes from 10.0.0.11: icmp_seq=5 ttl=64 time=1.39 ms
> > >     64 bytes from 10.0.0.11: icmp_seq=6 ttl=64 time=7.38 ms
> > >     64 bytes from 10.0.0.11: icmp_seq=7 ttl=64 time=1.42 ms
> > >     64 bytes from 10.0.0.11: icmp_seq=8 ttl=64 time=7.41 ms
> > >
> > >   * RC->EP (`sudo iperf3 -ub0 -l 65480 -P 2`)
> > >     [ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
> > >     [  5]   0.00-10.01  sec   344 MBytes   288 Mbits/sec  3.483 ms  51/5555 (0.92%)  receiver
> > >     [  6]   0.00-10.01  sec   342 MBytes   287 Mbits/sec  3.814 ms  38/5517 (0.69%)  receiver
> > >     [SUM]   0.00-10.01  sec   686 MBytes   575 Mbits/sec  3.648 ms  89/11072 (0.8%)  receiver
> > >
> > >   * EP->RC (`sudo iperf3 -ub0 -l 65480 -P 2`)
> > >     [  5]   0.00-10.03  sec   334 MBytes   279 Mbits/sec  3.164 ms  390/5731 (6.8%)  receiver
> > >     [  6]   0.00-10.03  sec   334 MBytes   279 Mbits/sec  2.416 ms  396/5741 (6.9%)  receiver
> > >     [SUM]   0.00-10.03  sec   667 MBytes   558 Mbits/sec  2.790 ms  786/11472 (6.9%)  receiver
> > >
> > >     Note: with `-P 2`, the best total bitrate (receiver side) was achieved.
> > >
> > > - After this change (use_remote_edma=1) [1]:
> > >
> > >   * ping
> > >     64 bytes from 10.0.0.11: icmp_seq=1 ttl=64 time=1.48 ms
> > >     64 bytes from 10.0.0.11: icmp_seq=2 ttl=64 time=1.03 ms
> > >     64 bytes from 10.0.0.11: icmp_seq=3 ttl=64 time=0.931 ms
> > >     64 bytes from 10.0.0.11: icmp_seq=4 ttl=64 time=0.910 ms
> > >     64 bytes from 10.0.0.11: icmp_seq=5 ttl=64 time=1.07 ms
> > >     64 bytes from 10.0.0.11: icmp_seq=6 ttl=64 time=0.986 ms
> > >     64 bytes from 10.0.0.11: icmp_seq=7 ttl=64 time=0.910 ms
> > >     64 bytes from 10.0.0.11: icmp_seq=8 ttl=64 time=0.883 ms
> > >
> > >   * RC->EP (`sudo iperf3 -ub0 -l 65480 -P 4`)
> > >     [  5]   0.00-10.01  sec  3.54 GBytes  3.04 Gbits/sec  0.030 ms  0/58007 (0%)  receiver
> > >     [  6]   0.00-10.01  sec  3.71 GBytes  3.19 Gbits/sec  0.453 ms  0/60909 (0%)  receiver
> > >     [  9]   0.00-10.01  sec  3.85 GBytes  3.30 Gbits/sec  0.027 ms  0/63072 (0%)  receiver
> > >     [ 11]   0.00-10.01  sec  3.26 GBytes  2.80 Gbits/sec  0.070 ms  1/53512 (0.0019%)  receiver
> > >     [SUM]   0.00-10.01  sec  14.4 GBytes  12.3 Gbits/sec  0.145 ms  1/235500 (0.00042%)  receiver
> > >
> > >   * EP->RC (`sudo iperf3 -ub0 -l 65480 -P 4`)
> > >     [  5]   0.00-10.03  sec  3.40 GBytes  2.91 Gbits/sec  0.104 ms  15467/71208 (22%)  receiver
> > >     [  6]   0.00-10.03  sec  3.08 GBytes  2.64 Gbits/sec  0.176 ms  12097/62609 (19%)  receiver
> > >     [  9]   0.00-10.03  sec  3.38 GBytes  2.90 Gbits/sec  0.270 ms  17212/72710 (24%)  receiver
> > >     [ 11]   0.00-10.03  sec  2.56 GBytes  2.19 Gbits/sec  0.200 ms  11193/53090 (21%)  receiver
> >
> > Almost 10x fast, 2.9G vs 279M? high light this one will bring more peopole
> > interesting about this topic.
>
> Thank you for the review!
>
> OK, I'll highlight this in the next iteration.
> By the way, my impression is that we can achieve even higher with this remote
> eDMA architecture.

eDMA can reduce one memory copy and longer TLP data length. Previously, I
tried use RDMA framework some year ago, but it is over complex and stop the
work.

>
> >
> > >     [SUM]   0.00-10.03  sec  12.4 GBytes  10.6 Gbits/sec  0.188 ms  55969/259617 (22%)  receiver
> > >
> > >   [1] configfs settings:
> > >       # modprobe pci_epf_vntb dyndbg=+pmf
> > >       # cd /sys/kernel/config/pci_ep/
> > >       # mkdir functions/pci_epf_vntb/func1
> > >       # echo 0x1912 >   functions/pci_epf_vntb/func1/vendorid
> > >       # echo 0x0030 >   functions/pci_epf_vntb/func1/deviceid
> > >       # echo 32 >       functions/pci_epf_vntb/func1/msi_interrupts
> > >       # echo 16 >       functions/pci_epf_vntb/func1/pci_epf_vntb.0/db_count
> > >       # echo 128 >      functions/pci_epf_vntb/func1/pci_epf_vntb.0/spad_count
> > >       # echo 2 >        functions/pci_epf_vntb/func1/pci_epf_vntb.0/num_mws
> > >       # echo 0xe0000 >  functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw1
> > >       # echo 0x20000 >  functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw2
> > >       # echo 0xe0000 >  functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw2_offset
> >
> > look like, you try to create sub-small mw windows.
> >
> > Is it more clean ?
> >
> > echo 0xe0000 >  functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw1.0
> > echo 0x20000 >  functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw1.1
> >
> > so wm1.1 natively continue from prevous one.
>
> Thanks for the suggestion.
>
> I was trying to keep the sub-small mw windows referred to in the same way
> as normal windows for simplicity and readability, but I agree your proposal
> looks intuitive from a User-eXperience point of view.
>
> My only concern is that e.g. {mw1.0, mw1.1, mw2.0} may translate internally
> into something like {mw1, mw2, mw3} effectively, and that numbering
> mismatch might become confusing when reading or debugging the code.

If there are enough bars, you can try use one dedicate bar for EDMA register
space, LL space shared with bar0 (control bar) to reduce complex, and get
better performace firstly.

Frank

>
> -Koichiro
>
> >
> > Frank
> >
> > >       # echo 0x1912 >   functions/pci_epf_vntb/func1/pci_epf_vntb.0/vntb_vid
> > >       # echo 0x0030 >   functions/pci_epf_vntb/func1/pci_epf_vntb.0/vntb_pid
> > >       # echo 0x10 >     functions/pci_epf_vntb/func1/pci_epf_vntb.0/vbus_number
> > >       # echo 0 >        functions/pci_epf_vntb/func1/pci_epf_vntb.0/ctrl_bar
> > >       # echo 4 >        functions/pci_epf_vntb/func1/pci_epf_vntb.0/db_bar
> > >       # echo 2 >        functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw1_bar
> > >       # echo 2 >        functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw2_bar
> > >       # ln -s controllers/e65d0000.pcie-ep functions/pci_epf_vntb/func1/primary/
> > >       # echo 1 > controllers/e65d0000.pcie-ep/start
> > >
> > >
> > > Thanks for taking a look.
> > >
> > >
> > > Koichiro Den (27):
> > >   PCI: endpoint: pci-epf-vntb: Use array_index_nospec() on mws_size[]
> > >     access
> > >   PCI: endpoint: pci-epf-vntb: Add mwN_offset configfs attributes
> > >   NTB: epf: Handle mwN_offset for inbound MW regions
> > >   PCI: endpoint: Add inbound mapping ops to EPC core
> > >   PCI: dwc: ep: Implement EPC inbound mapping support
> > >   PCI: endpoint: pci-epf-vntb: Use pci_epc_map_inbound() for MW mapping
> > >   NTB: Add offset parameter to MW translation APIs
> > >   PCI: endpoint: pci-epf-vntb: Propagate MW offset from configfs when
> > >     present
> > >   NTB: ntb_transport: Support offsetted partial memory windows
> > >   NTB: core: Add .get_pci_epc() to ntb_dev_ops
> > >   NTB: epf: vntb: Implement .get_pci_epc() callback
> > >   damengine: dw-edma: Fix MSI data values for multi-vector IMWr
> > >     interrupts
> > >   NTB: ntb_transport: Use seq_file for QP stats debugfs
> > >   NTB: ntb_transport: Move TX memory window setup into setup_qp_mw()
> > >   NTB: ntb_transport: Dynamically determine qp count
> > >   NTB: ntb_transport: Introduce get_dma_dev() helper
> > >   NTB: epf: Reserve a subset of MSI vectors for non-NTB users
> > >   NTB: ntb_transport: Introduce ntb_transport_backend_ops
> > >   PCI: dwc: ep: Cache MSI outbound iATU mapping
> > >   NTB: ntb_transport: Introduce remote eDMA backed transport mode
> > >   NTB: epf: Provide db_vector_count/db_vector_mask callbacks
> > >   ntb_netdev: Multi-queue support
> > >   NTB: epf: Add per-SoC quirk to cap MRRS for DWC eDMA (128B for R-Car)
> > >   iommu: ipmmu-vmsa: Add PCIe ch0 to devices_allowlist
> > >   iommu: ipmmu-vmsa: Add support for reserved regions
> > >   arm64: dts: renesas: Add Spider RC/EP DTs for NTB with remote DW PCIe
> > >     eDMA
> > >   NTB: epf: Add an additional memory window (MW2) barno mapping on
> > >     Renesas R-Car
> > >
> > >  arch/arm64/boot/dts/renesas/Makefile          |    2 +
> > >  .../boot/dts/renesas/r8a779f0-spider-ep.dts   |   46 +
> > >  .../boot/dts/renesas/r8a779f0-spider-rc.dts   |   52 +
> > >  drivers/dma/dw-edma/dw-edma-core.c            |   28 +-
> > >  drivers/iommu/ipmmu-vmsa.c                    |    7 +-
> > >  drivers/net/ntb_netdev.c                      |  341 ++-
> > >  drivers/ntb/Kconfig                           |   11 +
> > >  drivers/ntb/Makefile                          |    3 +
> > >  drivers/ntb/hw/amd/ntb_hw_amd.c               |    6 +-
> > >  drivers/ntb/hw/epf/ntb_hw_epf.c               |  177 +-
> > >  drivers/ntb/hw/idt/ntb_hw_idt.c               |    3 +-
> > >  drivers/ntb/hw/intel/ntb_hw_gen1.c            |    6 +-
> > >  drivers/ntb/hw/intel/ntb_hw_gen1.h            |    2 +-
> > >  drivers/ntb/hw/intel/ntb_hw_gen3.c            |    3 +-
> > >  drivers/ntb/hw/intel/ntb_hw_gen4.c            |    6 +-
> > >  drivers/ntb/hw/mscc/ntb_hw_switchtec.c        |    6 +-
> > >  drivers/ntb/msi.c                             |    6 +-
> > >  drivers/ntb/ntb_edma.c                        |  628 ++++++
> > >  drivers/ntb/ntb_edma.h                        |  128 ++
> > >  .../{ntb_transport.c => ntb_transport_core.c} | 1829 ++++++++++++++---
> > >  drivers/ntb/test/ntb_perf.c                   |    4 +-
> > >  drivers/ntb/test/ntb_tool.c                   |    6 +-
> > >  .../pci/controller/dwc/pcie-designware-ep.c   |  287 ++-
> > >  drivers/pci/controller/dwc/pcie-designware.h  |    7 +
> > >  drivers/pci/endpoint/functions/pci-epf-vntb.c |  229 ++-
> > >  drivers/pci/endpoint/pci-epc-core.c           |   44 +
> > >  include/linux/ntb.h                           |   39 +-
> > >  include/linux/ntb_transport.h                 |   21 +
> > >  include/linux/pci-epc.h                       |   11 +
> > >  29 files changed, 3415 insertions(+), 523 deletions(-)
> > >  create mode 100644 arch/arm64/boot/dts/renesas/r8a779f0-spider-ep.dts
> > >  create mode 100644 arch/arm64/boot/dts/renesas/r8a779f0-spider-rc.dts
> > >  create mode 100644 drivers/ntb/ntb_edma.c
> > >  create mode 100644 drivers/ntb/ntb_edma.h
> > >  rename drivers/ntb/{ntb_transport.c => ntb_transport_core.c} (59%)
> > >
> > > --
> > > 2.48.1
> > >

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping
  2025-12-02  6:32   ` Niklas Cassel
@ 2025-12-03  8:30     ` Koichiro Den
  2025-12-03 10:19       ` Niklas Cassel
  0 siblings, 1 reply; 97+ messages in thread
From: Koichiro Den @ 2025-12-03  8:30 UTC (permalink / raw)
  To: Niklas Cassel
  Cc: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li, mani,
	kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang,
	allenbh, Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer,
	logang, jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer,
	arnd, pstanner, elfring, Damien Le Moal

On Tue, Dec 02, 2025 at 07:32:31AM +0100, Niklas Cassel wrote:
> On Sun, Nov 30, 2025 at 01:03:57AM +0900, Koichiro Den wrote:
> > dw_pcie_ep_raise_msi_irq() currently programs an outbound iATU window
> > for the MSI target address on every interrupt and tears it down again
> > via dw_pcie_ep_unmap_addr().
> > 
> > On systems that heavily use the AXI bridge interface (for example when
> > the integrated eDMA engine is active), this means the outbound iATU
> > registers are updated while traffic is in flight. The DesignWare
> > endpoint spec warns that updating iATU registers in this situation is
> > not supported, and the behavior is undefined.
> 
> Please reference a specific section in the EP databook, and the specific
> EP databook version that you are using.

Ok, the section I was referring to in the commit message is:

DW EPC databook 5.40a - 3.10.6.1 iATU Outbound Programming Overview
"Caution: Dynamic iATU Programming with AXI Bridge Module You must not update
the iATU registers while operations are in progress on the AXI bridge slave
interface."

> 
> This patch appears to address quite a serious issue, so I think that you
> should submit it as a standalone patch, and not as part of a series.
> 
> (Especially not as part of an RFC which can take quite long before it is
> even submitted as a normal (non-RFC) series.)

Makes sense, thank you for the guidance.

Koichiro

> 
> 
> Kind regards,
> Niklas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping
  2025-12-02  9:32       ` Niklas Cassel
  2025-12-02 15:20         ` Frank Li
@ 2025-12-03  8:40         ` Koichiro Den
  2025-12-03 10:39           ` Niklas Cassel
  1 sibling, 1 reply; 97+ messages in thread
From: Koichiro Den @ 2025-12-03  8:40 UTC (permalink / raw)
  To: Niklas Cassel
  Cc: Frank Li, ntb, linux-pci, dmaengine, linux-kernel, mani,
	kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang,
	allenbh, Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer,
	logang, jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer,
	arnd, pstanner, elfring, Damien Le Moal

On Tue, Dec 02, 2025 at 10:32:19AM +0100, Niklas Cassel wrote:
> Hello Koichiro,
> 
> On Tue, Dec 02, 2025 at 03:35:36PM +0900, Koichiro Den wrote:
> > On Mon, Dec 01, 2025 at 03:41:38PM -0500, Frank Li wrote:
> > > On Sun, Nov 30, 2025 at 01:03:57AM +0900, Koichiro Den wrote:
> > > > dw_pcie_ep_raise_msi_irq() currently programs an outbound iATU window
> > > > for the MSI target address on every interrupt and tears it down again
> > > > via dw_pcie_ep_unmap_addr().
> > > >
> > > > On systems that heavily use the AXI bridge interface (for example when
> > > > the integrated eDMA engine is active), this means the outbound iATU
> > > > registers are updated while traffic is in flight. The DesignWare
> > > > endpoint spec warns that updating iATU registers in this situation is
> > > > not supported, and the behavior is undefined.
> > > >
> > > > Under high MSI and eDMA load this pattern results in occasional bogus
> > > > outbound transactions and IOMMU faults such as:
> > > >
> > > >   ipmmu-vmsa eed40000.iommu: Unhandled fault: status 0x00001502 iova 0xfe000000
> > > >
> > > 
> > > I agree needn't map/unmap MSI every time. But I think there should be
> > > logic problem behind this. IOMMU report error means page table already
> > > removed, but you still try to access it after that. You'd better find where
> > > access MSI memory after dw_pcie_ep_unmap_addr().
> > 
> > I don't see any other callers that access the MSI region after
> > dw_pcie_ep_unmap_addr(), but I might be missing something. Also, even if I
> > serialize dw_pcie_ep_raise_msi_irq() invocations, the problem still
> > appears.
> > 
> > A couple of details I forgot to describe in the commit message:
> > (1). The IOMMU error is only reported on the RC side.
> > (2). Sometimes there is no IOMMU error printed and the board just freezes (becomes unresponsive).
> > 
> > The faulting iova is 0xfe000000. The iova 0xfe000000 is the base of
> > "addr_space" for R-Car S4 in EP mode:
> > https://github.com/jonmason/ntb/blob/68113d260674/arch/arm64/boot/dts/renesas/r8a779f0.dtsi#L847
> > 
> > So it looks like the EP sometimes issue MWr at "addr_space" base (offset 0),
> > the RC forwards it to its IOMMU (IPMMUHC) and that faults. My working theory
> > is that when the iATU registers are updated under heavy DMA load, the DAR of
> > some in-flight transfer can get corrupted to 0xfe000000. That would match one
> > possible symptom of the undefined behaviour that the DW EPC spec warns about
> > when changing iATU configuration under load.
> 
> For your information, in the NVMe PCI EPF driver:
> https://github.com/torvalds/linux/blob/v6.18/drivers/nvme/target/pci-epf.c#L389-L429
> 
> We take a mutex around the dmaengine_slave_config() and dma_sync_wait() calls,
> because without a mutex, we noticed that having multiple outstanding transfers,
> since the dmaengine_slave_config() specifies the src/dst address, the function
> call would affect other concurrent DMA transfers, leading to corruption because
> of invalid src/dst addresses.
> 
> Having a mutex so that we can only have one outstanding transfer solves these
> issues, but is obviously very bad for performance.

Thank you for the info, it helps a lot.
As for DW eDMA, it seems that at least dmaengine_slave_config() and
dmaengine_prep_slave_sg() needs to be executed under an exclusive
per-dma_chan lock. In hindsight, [RFC PATCH v2 20/27] should've done so,
although I still think it's unrelated to the particular issue addressed by
this commit.

For testing, I tried adding dma_sync_wait() and taking a per-dma_chan mutex
around dmaengine_slave_config() ~ dma_sync_wait() sequence, but the problem
(that often occurs with the IOMMU fault at 0xfe000000) remains under heavy
load if I revert this commit. The diff for the experiment is a bit large
(+117, -89), so I'm not posting it here, but I can do so if that would be
useful.

> 
> 
> I did try to add DMA_MEMCPY support to the dw-edma driver:
> https://lore.kernel.org/linux-pci/20241217160448.199310-4-cassel@kernel.org/
> 
> Since that would allow us to specify both the src and dst address in a single
> dmaengine function call (so that we would no longer need a mutex).
> However, because the eDMA hardware (at least for EDMA_LEGACY_UNROLL) does not
> support transfers between PCI to PCI, only PCI to local DDR or local DDR to
> PCI, using prep_memcpy() is wrong, as it does not take a direction:
> https://lore.kernel.org/linux-pci/Z4jf2s5SaUu3wdJi@ryzen/

Interesting, I didn't know that.

> 
> If we want to improve the dw-edma driver, so that an EPF driver can have
> multiple outstanding transfers, I think the best way forward would be to create
> a new _prep_slave_memcpy() or similar, that does take a direction, and thus
> does not require dmaengine_slave_config() to be called before every
> _prep_slave_memcpy() call.

Would dmaengine_prep_slave_single_config(), which Frank tolds us in this
thread, be sufficient?


Thanks for reviewing,
Koichiro

> 
> 
> Kind regards,
> Niklas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 00/27] NTB transport backed by remote DW eDMA
  2025-12-02 16:07     ` Frank Li
@ 2025-12-03  8:43       ` Koichiro Den
  0 siblings, 0 replies; 97+ messages in thread
From: Koichiro Den @ 2025-12-03  8:43 UTC (permalink / raw)
  To: Frank Li
  Cc: ntb, linux-pci, dmaengine, linux-kernel, mani, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Tue, Dec 02, 2025 at 11:07:23AM -0500, Frank Li wrote:
> On Tue, Dec 02, 2025 at 03:20:01PM +0900, Koichiro Den wrote:
> > On Mon, Dec 01, 2025 at 05:02:57PM -0500, Frank Li wrote:
> > > On Sun, Nov 30, 2025 at 01:03:38AM +0900, Koichiro Den wrote:
> > > > Hi,
> > > >
> > > > This is RFC v2 of the NTB/PCI series for Renesas R-Car S4. The ultimate
> > > > goal is unchanged, i.e. to improve performance between RC and EP
> > > > (with vNTB) over ntb_transport, but the approach has changed drastically.
> > > > Based on the feedback from Frank Li in the v1 thread, in particular:
> > > > https://lore.kernel.org/all/aQEsip3TsPn4LJY9@lizhi-Precision-Tower-5810/
> > > > this RFC v2 instead builds an NTB transport backed by remote eDMA
> > > > architecture and reshapes the series around it. The RC->EP interruption
> > > > is now achieved using a dedicated eDMA read channel, so the somewhat
> > > > "hack"-ish approach in RFC v1 is no longer needed.
> > > >
> > > > Compared to RFC v1, this v2 series enables NTB transport backed by
> > > > remote DW eDMA, so the current ntb_transport handling of Memory Window
> > > > is no longer needed, and direct DMA transfers between EP and RC are
> > > > used.
> > > >
> > > > I realize this is quite a large series. Sorry for the volume, but for
> > > > the RFC stage I believe presenting the full picture in a single set
> > > > helps with reviewing the overall architecture. Once the direction is
> > > > agreed, I will respin it split by subsystem and topic.
> > > >
> > > >
> > > ...
> > > >
> > > > - Before this change:
> > > >
> > > >   * ping
> > > >     64 bytes from 10.0.0.11: icmp_seq=1 ttl=64 time=12.3 ms
> > > >     64 bytes from 10.0.0.11: icmp_seq=2 ttl=64 time=6.58 ms
> > > >     64 bytes from 10.0.0.11: icmp_seq=3 ttl=64 time=1.26 ms
> > > >     64 bytes from 10.0.0.11: icmp_seq=4 ttl=64 time=7.43 ms
> > > >     64 bytes from 10.0.0.11: icmp_seq=5 ttl=64 time=1.39 ms
> > > >     64 bytes from 10.0.0.11: icmp_seq=6 ttl=64 time=7.38 ms
> > > >     64 bytes from 10.0.0.11: icmp_seq=7 ttl=64 time=1.42 ms
> > > >     64 bytes from 10.0.0.11: icmp_seq=8 ttl=64 time=7.41 ms
> > > >
> > > >   * RC->EP (`sudo iperf3 -ub0 -l 65480 -P 2`)
> > > >     [ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
> > > >     [  5]   0.00-10.01  sec   344 MBytes   288 Mbits/sec  3.483 ms  51/5555 (0.92%)  receiver
> > > >     [  6]   0.00-10.01  sec   342 MBytes   287 Mbits/sec  3.814 ms  38/5517 (0.69%)  receiver
> > > >     [SUM]   0.00-10.01  sec   686 MBytes   575 Mbits/sec  3.648 ms  89/11072 (0.8%)  receiver
> > > >
> > > >   * EP->RC (`sudo iperf3 -ub0 -l 65480 -P 2`)
> > > >     [  5]   0.00-10.03  sec   334 MBytes   279 Mbits/sec  3.164 ms  390/5731 (6.8%)  receiver
> > > >     [  6]   0.00-10.03  sec   334 MBytes   279 Mbits/sec  2.416 ms  396/5741 (6.9%)  receiver
> > > >     [SUM]   0.00-10.03  sec   667 MBytes   558 Mbits/sec  2.790 ms  786/11472 (6.9%)  receiver
> > > >
> > > >     Note: with `-P 2`, the best total bitrate (receiver side) was achieved.
> > > >
> > > > - After this change (use_remote_edma=1) [1]:
> > > >
> > > >   * ping
> > > >     64 bytes from 10.0.0.11: icmp_seq=1 ttl=64 time=1.48 ms
> > > >     64 bytes from 10.0.0.11: icmp_seq=2 ttl=64 time=1.03 ms
> > > >     64 bytes from 10.0.0.11: icmp_seq=3 ttl=64 time=0.931 ms
> > > >     64 bytes from 10.0.0.11: icmp_seq=4 ttl=64 time=0.910 ms
> > > >     64 bytes from 10.0.0.11: icmp_seq=5 ttl=64 time=1.07 ms
> > > >     64 bytes from 10.0.0.11: icmp_seq=6 ttl=64 time=0.986 ms
> > > >     64 bytes from 10.0.0.11: icmp_seq=7 ttl=64 time=0.910 ms
> > > >     64 bytes from 10.0.0.11: icmp_seq=8 ttl=64 time=0.883 ms
> > > >
> > > >   * RC->EP (`sudo iperf3 -ub0 -l 65480 -P 4`)
> > > >     [  5]   0.00-10.01  sec  3.54 GBytes  3.04 Gbits/sec  0.030 ms  0/58007 (0%)  receiver
> > > >     [  6]   0.00-10.01  sec  3.71 GBytes  3.19 Gbits/sec  0.453 ms  0/60909 (0%)  receiver
> > > >     [  9]   0.00-10.01  sec  3.85 GBytes  3.30 Gbits/sec  0.027 ms  0/63072 (0%)  receiver
> > > >     [ 11]   0.00-10.01  sec  3.26 GBytes  2.80 Gbits/sec  0.070 ms  1/53512 (0.0019%)  receiver
> > > >     [SUM]   0.00-10.01  sec  14.4 GBytes  12.3 Gbits/sec  0.145 ms  1/235500 (0.00042%)  receiver
> > > >
> > > >   * EP->RC (`sudo iperf3 -ub0 -l 65480 -P 4`)
> > > >     [  5]   0.00-10.03  sec  3.40 GBytes  2.91 Gbits/sec  0.104 ms  15467/71208 (22%)  receiver
> > > >     [  6]   0.00-10.03  sec  3.08 GBytes  2.64 Gbits/sec  0.176 ms  12097/62609 (19%)  receiver
> > > >     [  9]   0.00-10.03  sec  3.38 GBytes  2.90 Gbits/sec  0.270 ms  17212/72710 (24%)  receiver
> > > >     [ 11]   0.00-10.03  sec  2.56 GBytes  2.19 Gbits/sec  0.200 ms  11193/53090 (21%)  receiver
> > >
> > > Almost 10x fast, 2.9G vs 279M? high light this one will bring more peopole
> > > interesting about this topic.
> >
> > Thank you for the review!
> >
> > OK, I'll highlight this in the next iteration.
> > By the way, my impression is that we can achieve even higher with this remote
> > eDMA architecture.
> 
> eDMA can reduce one memory copy and longer TLP data length. Previously, I
> tried use RDMA framework some year ago, but it is over complex and stop the
> work.

That's interesting. Thank you for the info.

> 
> >
> > >
> > > >     [SUM]   0.00-10.03  sec  12.4 GBytes  10.6 Gbits/sec  0.188 ms  55969/259617 (22%)  receiver
> > > >
> > > >   [1] configfs settings:
> > > >       # modprobe pci_epf_vntb dyndbg=+pmf
> > > >       # cd /sys/kernel/config/pci_ep/
> > > >       # mkdir functions/pci_epf_vntb/func1
> > > >       # echo 0x1912 >   functions/pci_epf_vntb/func1/vendorid
> > > >       # echo 0x0030 >   functions/pci_epf_vntb/func1/deviceid
> > > >       # echo 32 >       functions/pci_epf_vntb/func1/msi_interrupts
> > > >       # echo 16 >       functions/pci_epf_vntb/func1/pci_epf_vntb.0/db_count
> > > >       # echo 128 >      functions/pci_epf_vntb/func1/pci_epf_vntb.0/spad_count
> > > >       # echo 2 >        functions/pci_epf_vntb/func1/pci_epf_vntb.0/num_mws
> > > >       # echo 0xe0000 >  functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw1
> > > >       # echo 0x20000 >  functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw2
> > > >       # echo 0xe0000 >  functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw2_offset
> > >
> > > look like, you try to create sub-small mw windows.
> > >
> > > Is it more clean ?
> > >
> > > echo 0xe0000 >  functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw1.0
> > > echo 0x20000 >  functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw1.1
> > >
> > > so wm1.1 natively continue from prevous one.
> >
> > Thanks for the suggestion.
> >
> > I was trying to keep the sub-small mw windows referred to in the same way
> > as normal windows for simplicity and readability, but I agree your proposal
> > looks intuitive from a User-eXperience point of view.
> >
> > My only concern is that e.g. {mw1.0, mw1.1, mw2.0} may translate internally
> > into something like {mw1, mw2, mw3} effectively, and that numbering
> > mismatch might become confusing when reading or debugging the code.
> 
> If there are enough bars, you can try use one dedicate bar for EDMA register
> space, LL space shared with bar0 (control bar) to reduce complex, and get
> better performace firstly.

Thank you for the suggestion. Once I have the critical pieces (which we are
discussing in several threads for this RFCv2 series) sorted out and start
preparing the next iteration, I'll revisit this.

Koichiro

> 
> Frank
> 
> >
> > -Koichiro
> >
> > >
> > > Frank
> > >
> > > >       # echo 0x1912 >   functions/pci_epf_vntb/func1/pci_epf_vntb.0/vntb_vid
> > > >       # echo 0x0030 >   functions/pci_epf_vntb/func1/pci_epf_vntb.0/vntb_pid
> > > >       # echo 0x10 >     functions/pci_epf_vntb/func1/pci_epf_vntb.0/vbus_number
> > > >       # echo 0 >        functions/pci_epf_vntb/func1/pci_epf_vntb.0/ctrl_bar
> > > >       # echo 4 >        functions/pci_epf_vntb/func1/pci_epf_vntb.0/db_bar
> > > >       # echo 2 >        functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw1_bar
> > > >       # echo 2 >        functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw2_bar
> > > >       # ln -s controllers/e65d0000.pcie-ep functions/pci_epf_vntb/func1/primary/
> > > >       # echo 1 > controllers/e65d0000.pcie-ep/start
> > > >
> > > >
> > > > Thanks for taking a look.
> > > >
> > > >
> > > > Koichiro Den (27):
> > > >   PCI: endpoint: pci-epf-vntb: Use array_index_nospec() on mws_size[]
> > > >     access
> > > >   PCI: endpoint: pci-epf-vntb: Add mwN_offset configfs attributes
> > > >   NTB: epf: Handle mwN_offset for inbound MW regions
> > > >   PCI: endpoint: Add inbound mapping ops to EPC core
> > > >   PCI: dwc: ep: Implement EPC inbound mapping support
> > > >   PCI: endpoint: pci-epf-vntb: Use pci_epc_map_inbound() for MW mapping
> > > >   NTB: Add offset parameter to MW translation APIs
> > > >   PCI: endpoint: pci-epf-vntb: Propagate MW offset from configfs when
> > > >     present
> > > >   NTB: ntb_transport: Support offsetted partial memory windows
> > > >   NTB: core: Add .get_pci_epc() to ntb_dev_ops
> > > >   NTB: epf: vntb: Implement .get_pci_epc() callback
> > > >   damengine: dw-edma: Fix MSI data values for multi-vector IMWr
> > > >     interrupts
> > > >   NTB: ntb_transport: Use seq_file for QP stats debugfs
> > > >   NTB: ntb_transport: Move TX memory window setup into setup_qp_mw()
> > > >   NTB: ntb_transport: Dynamically determine qp count
> > > >   NTB: ntb_transport: Introduce get_dma_dev() helper
> > > >   NTB: epf: Reserve a subset of MSI vectors for non-NTB users
> > > >   NTB: ntb_transport: Introduce ntb_transport_backend_ops
> > > >   PCI: dwc: ep: Cache MSI outbound iATU mapping
> > > >   NTB: ntb_transport: Introduce remote eDMA backed transport mode
> > > >   NTB: epf: Provide db_vector_count/db_vector_mask callbacks
> > > >   ntb_netdev: Multi-queue support
> > > >   NTB: epf: Add per-SoC quirk to cap MRRS for DWC eDMA (128B for R-Car)
> > > >   iommu: ipmmu-vmsa: Add PCIe ch0 to devices_allowlist
> > > >   iommu: ipmmu-vmsa: Add support for reserved regions
> > > >   arm64: dts: renesas: Add Spider RC/EP DTs for NTB with remote DW PCIe
> > > >     eDMA
> > > >   NTB: epf: Add an additional memory window (MW2) barno mapping on
> > > >     Renesas R-Car
> > > >
> > > >  arch/arm64/boot/dts/renesas/Makefile          |    2 +
> > > >  .../boot/dts/renesas/r8a779f0-spider-ep.dts   |   46 +
> > > >  .../boot/dts/renesas/r8a779f0-spider-rc.dts   |   52 +
> > > >  drivers/dma/dw-edma/dw-edma-core.c            |   28 +-
> > > >  drivers/iommu/ipmmu-vmsa.c                    |    7 +-
> > > >  drivers/net/ntb_netdev.c                      |  341 ++-
> > > >  drivers/ntb/Kconfig                           |   11 +
> > > >  drivers/ntb/Makefile                          |    3 +
> > > >  drivers/ntb/hw/amd/ntb_hw_amd.c               |    6 +-
> > > >  drivers/ntb/hw/epf/ntb_hw_epf.c               |  177 +-
> > > >  drivers/ntb/hw/idt/ntb_hw_idt.c               |    3 +-
> > > >  drivers/ntb/hw/intel/ntb_hw_gen1.c            |    6 +-
> > > >  drivers/ntb/hw/intel/ntb_hw_gen1.h            |    2 +-
> > > >  drivers/ntb/hw/intel/ntb_hw_gen3.c            |    3 +-
> > > >  drivers/ntb/hw/intel/ntb_hw_gen4.c            |    6 +-
> > > >  drivers/ntb/hw/mscc/ntb_hw_switchtec.c        |    6 +-
> > > >  drivers/ntb/msi.c                             |    6 +-
> > > >  drivers/ntb/ntb_edma.c                        |  628 ++++++
> > > >  drivers/ntb/ntb_edma.h                        |  128 ++
> > > >  .../{ntb_transport.c => ntb_transport_core.c} | 1829 ++++++++++++++---
> > > >  drivers/ntb/test/ntb_perf.c                   |    4 +-
> > > >  drivers/ntb/test/ntb_tool.c                   |    6 +-
> > > >  .../pci/controller/dwc/pcie-designware-ep.c   |  287 ++-
> > > >  drivers/pci/controller/dwc/pcie-designware.h  |    7 +
> > > >  drivers/pci/endpoint/functions/pci-epf-vntb.c |  229 ++-
> > > >  drivers/pci/endpoint/pci-epc-core.c           |   44 +
> > > >  include/linux/ntb.h                           |   39 +-
> > > >  include/linux/ntb_transport.h                 |   21 +
> > > >  include/linux/pci-epc.h                       |   11 +
> > > >  29 files changed, 3415 insertions(+), 523 deletions(-)
> > > >  create mode 100644 arch/arm64/boot/dts/renesas/r8a779f0-spider-ep.dts
> > > >  create mode 100644 arch/arm64/boot/dts/renesas/r8a779f0-spider-rc.dts
> > > >  create mode 100644 drivers/ntb/ntb_edma.c
> > > >  create mode 100644 drivers/ntb/ntb_edma.h
> > > >  rename drivers/ntb/{ntb_transport.c => ntb_transport_core.c} (59%)
> > > >
> > > > --
> > > > 2.48.1
> > > >

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 20/27] NTB: ntb_transport: Introduce remote eDMA backed transport mode
  2025-12-02 15:42       ` Frank Li
@ 2025-12-03  8:53         ` Koichiro Den
  2025-12-03 16:14           ` Frank Li
  0 siblings, 1 reply; 97+ messages in thread
From: Koichiro Den @ 2025-12-03  8:53 UTC (permalink / raw)
  To: Frank Li
  Cc: ntb, linux-pci, dmaengine, linux-kernel, mani, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Tue, Dec 02, 2025 at 10:42:29AM -0500, Frank Li wrote:
> On Tue, Dec 02, 2025 at 03:43:10PM +0900, Koichiro Den wrote:
> > On Mon, Dec 01, 2025 at 04:41:05PM -0500, Frank Li wrote:
> > > On Sun, Nov 30, 2025 at 01:03:58AM +0900, Koichiro Den wrote:
> > > > Add a new transport backend that uses a remote DesignWare eDMA engine
> > > > located on the NTB endpoint to move data between host and endpoint.
> > > >
> ...
> > > > +#include "ntb_edma.h"
> > > > +
> > > > +/*
> > > > + * The interrupt register offsets below are taken from the DesignWare
> > > > + * eDMA "unrolled" register map (EDMA_MF_EDMA_UNROLL). The remote eDMA
> > > > + * backend currently only supports this layout.
> > > > + */
> > > > +#define DMA_WRITE_INT_STATUS_OFF   0x4c
> > > > +#define DMA_WRITE_INT_MASK_OFF     0x54
> > > > +#define DMA_WRITE_INT_CLEAR_OFF    0x58
> > > > +#define DMA_READ_INT_STATUS_OFF    0xa0
> > > > +#define DMA_READ_INT_MASK_OFF      0xa8
> > > > +#define DMA_READ_INT_CLEAR_OFF     0xac
> > >
> > > Not sure why need access EDMA register because EMDA driver already export
> > > as dmaengine driver.
> >
> > These are intended for EP use. In my current design I intentionally don't
> > use the standard dw-edma dmaengine driver on the EP side.
> 
> why not?

Conceptually I agree that using the standard dw-edma driver on both sides
would be attractive for future extensibility and maintainability. However,
there are a couple of concerns for me, some of which might be alleviated by
your suggestion below, and some which are more generic safety concerns that
I tried to outline in my replies to your other comments.

> 
> >
> > >
> > > > +
> > > > +#define NTB_EDMA_NOTIFY_MAX_QP		64
> > > > +
> ...
> > > > +
> > > > +	virq = irq_create_fwspec_mapping(&fwspec);
> > > > +	of_node_put(parent);
> > > > +	return (virq > 0) ? virq : -EINVAL;
> > > > +}
> > > > +
> > > > +static irqreturn_t ntb_edma_isr(int irq, void *data)
> > > > +{
> > >
> > > Not sue why dw_edma_interrupt_write/read() does work for your case. Suppose
> > > just register callback for dmeengine.
> >
> > If we ran dw_edma_probe() on both the EP and RC sides and let the dmaengine
> > callbacks handle int_status/int_clear, I think we could hit races. One side
> > might clear a status bit before the other side has a chance to see it and
> > invoke its callback. Please correct me if I'm missing something here.
> 
> You should use difference channel?

Do you mean something like this:
- on EP side, dw_edma_probe() only set up a dedicated channel for notification,
- on RC side, do not set up that particular channel via dw_edma_channel_setup(),
  but do other remaining channels for DMA transfers.

Also, is it generically safe to have dw_edma_probe() executed from both ends on
the same eDMA instance, as long as the channels are carefully partitioned
between them?

> 
> >
> > To avoid that, in my current implementation, the RC side handles the
> > status/int_clear registers in the usual way, and the EP side only tries to
> > suppress needless edma_int as much as possible.
> >
> > That said, I'm now wondering if it would be better to set LIE=0/RIE=1 for
> > the DMA transfer channels and LIE=1/RIE=0 for the notification channel.
> > That would require some changes on dw-edma core.
> 
> If dw-edma work as remote DMA, which should enable RIE. like
> dw-edma-pcie.c, but not one actually use it recently.
> 
> Use EDMA as doorbell should be new case and I think it is quite useful.
> 
> > >
> > > > +	struct ntb_edma_interrupt *v = data;
> > > > +	u32 mask = BIT(EDMA_RD_CH_NUM);
> > > > +	u32 i, val;
> > > > +
> ...
> > > > +	ret = dw_edma_probe(chip);
> > >
> > > I think dw_edma_probe() should be in ntb_hw_epf.c, which provide DMA
> > > dma engine support.
> > >
> > > EP side, suppose default dwc controller driver already setup edma engine,
> > > so use correct filter function, you should get dma chan.
> >
> > I intentionally hid edma for EP side in .dts patch in [RFC PATCH v2 26/27]
> > so that RC side only manages eDMA remotely and avoids the potential race
> > condition I mentioned above.
> 
> Improve eDMA core to suppport some dma channel work at local, some for
> remote.

Right, Firstly I experimented a bit more with different LIE/RIE settings and
ended up with the following observations:

* LIE=0/RIE=1 does not seem to work at the hardware level. When I tried this for
  DMA transfer channels, the RC side never received any interrupt. The databook
  (5.40a, 8.2.2 "Interrupts and Error Handling") has a hint that says
  "If you want a remote interrupt and not a local interrupt then: Set LIE and
  RIE [...]", so I think this behaviour is expected.
* LIE=1/RIE=0 does work at the hardware level, but is problematic for my current
  design, where the RC issues the DMA transfer for the notification via
  ntb_edma_notify_peer(). With RIE=0, the RC never calls
  dw_edma_core_handle_int() for that channel, which means that internal state
  such as dw_edma_chan.status is never managed correctly.

> 
> Frank
> >
> > Thanks for reviewing,
> > Koichiro
> >
> > >
> > > Frank
> > >
> > > > +	if (ret) {
> > > > +		dev_err(&ndev->dev, "dw_edma_probe failed: %d\n", ret);
> > > > +		return ret;
> > > > +	}
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> ...
> 
> > > > +{
> > > > +	spin_lock_init(&qp->ep_tx_lock);
> > > > +	spin_lock_init(&qp->ep_rx_lock);
> > > > +	spin_lock_init(&qp->rc_lock);
> > > > +}
> > > > +
> > > > +static const struct ntb_transport_backend_ops edma_backend_ops = {
> > > > +	.setup_qp_mw = ntb_transport_edma_setup_qp_mw,
> > > > +	.tx_free_entry = ntb_transport_edma_tx_free_entry,
> > > > +	.tx_enqueue = ntb_transport_edma_tx_enqueue,
> > > > +	.rx_enqueue = ntb_transport_edma_rx_enqueue,
> > > > +	.rx_poll = ntb_transport_edma_rx_poll,
> > > > +	.debugfs_stats_show = ntb_transport_edma_debugfs_stats_show,
> > > > +};
> > > > +#endif /* CONFIG_NTB_TRANSPORT_EDMA */
> > > > +
> > > >  /**
> > > >   * ntb_transport_link_up - Notify NTB transport of client readiness to use queue
> > > >   * @qp: NTB transport layer queue to be enabled
> > > > --
> > > > 2.48.1
> > > >

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping
  2025-12-03  8:30     ` Koichiro Den
@ 2025-12-03 10:19       ` Niklas Cassel
  2025-12-03 14:56         ` Koichiro Den
  0 siblings, 1 reply; 97+ messages in thread
From: Niklas Cassel @ 2025-12-03 10:19 UTC (permalink / raw)
  To: Koichiro Den, mani
  Cc: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring, Damien Le Moal

On Wed, Dec 03, 2025 at 05:30:52PM +0900, Koichiro Den wrote:
> On Tue, Dec 02, 2025 at 07:32:31AM +0100, Niklas Cassel wrote:
> > On Sun, Nov 30, 2025 at 01:03:57AM +0900, Koichiro Den wrote:
> > > dw_pcie_ep_raise_msi_irq() currently programs an outbound iATU window
> > > for the MSI target address on every interrupt and tears it down again
> > > via dw_pcie_ep_unmap_addr().
> > > 
> > > On systems that heavily use the AXI bridge interface (for example when
> > > the integrated eDMA engine is active), this means the outbound iATU
> > > registers are updated while traffic is in flight. The DesignWare
> > > endpoint spec warns that updating iATU registers in this situation is
> > > not supported, and the behavior is undefined.
> > 
> > Please reference a specific section in the EP databook, and the specific
> > EP databook version that you are using.
> 
> Ok, the section I was referring to in the commit message is:
> 
> DW EPC databook 5.40a - 3.10.6.1 iATU Outbound Programming Overview
> "Caution: Dynamic iATU Programming with AXI Bridge Module You must not update
> the iATU registers while operations are in progress on the AXI bridge slave
> interface."

Please add this text to the commit message when sending a proper patch.

Nit: I think it is "DW EP databook" and not "DW EPC databook".

However, if what you are suggesting is true, that would have an implication
for all PCI EPF drivers.

E.g. the MHI EPF driver:
https://github.com/torvalds/linux/blob/v6.18/drivers/pci/endpoint/functions/pci-epf-mhi.c#L394-L395
https://github.com/torvalds/linux/blob/v6.18/drivers/pci/endpoint/functions/pci-epf-mhi.c#L323-L324

uses either the eDMA (without calling pci_epc_map_addr()) or MMIO
(which does call pci_epc_map_addr(), which will update the iATU registers),
depending on the I/O size.

And I assume that the MHI bus can have multiple outgoing reads/writes
at the same time.

If what you are suggesting is true, AFAICT, any EPF driver that could have
multiple outgoing transactions occuring at the same time, can not be allowed
to have calls to pci_epc_map_addr().

Which would mean that, even if we change dw_pcie_ep_raise_msix_irq() and
dw_pcie_ep_raise_msi_irq() to not call map_addr() after
dw_pcie_ep_init_registers(), we could never have an EPF driver that mixes
MMIO and DMA. (Or even has multiple outgoing MMIO transactions, as that
requires updating iATU registers.)

Kind regards,
Niklas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping
  2025-12-03  8:40         ` Koichiro Den
@ 2025-12-03 10:39           ` Niklas Cassel
  2025-12-03 14:36             ` Koichiro Den
                               ` (2 more replies)
  0 siblings, 3 replies; 97+ messages in thread
From: Niklas Cassel @ 2025-12-03 10:39 UTC (permalink / raw)
  To: Koichiro Den
  Cc: Frank Li, ntb, linux-pci, dmaengine, linux-kernel, mani,
	kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang,
	allenbh, Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer,
	logang, jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer,
	arnd, pstanner, elfring, Damien Le Moal

On Wed, Dec 03, 2025 at 05:40:45PM +0900, Koichiro Den wrote:
> > 
> > If we want to improve the dw-edma driver, so that an EPF driver can have
> > multiple outstanding transfers, I think the best way forward would be to create
> > a new _prep_slave_memcpy() or similar, that does take a direction, and thus
> > does not require dmaengine_slave_config() to be called before every
> > _prep_slave_memcpy() call.
> 
> Would dmaengine_prep_slave_single_config(), which Frank tolds us in this
> thread, be sufficient?

I think that Frank is suggesting a new dmaengine API,
dmaengine_prep_slave_single_config(), which is like
dmaengine_prep_slave_single(), but also takes a struct dma_slave_config *
as a parameter.


I really like the idea.
I think it would allow us to remove the mutex in nvmet_pci_epf_dma_transfer():
https://github.com/torvalds/linux/blob/v6.18/drivers/nvme/target/pci-epf.c#L389-L429

Frank you wrote: "Thanks, we also consider ..."
Does that mean that you have any plans to work on this?
I would definitely be interested.


Kind regards,
Niklas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 04/27] PCI: endpoint: Add inbound mapping ops to EPC core
  2025-12-02 15:58       ` Frank Li
@ 2025-12-03 14:12         ` Koichiro Den
  0 siblings, 0 replies; 97+ messages in thread
From: Koichiro Den @ 2025-12-03 14:12 UTC (permalink / raw)
  To: Frank Li
  Cc: ntb, linux-pci, dmaengine, linux-kernel, mani, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Tue, Dec 02, 2025 at 10:58:08AM -0500, Frank Li wrote:
> On Tue, Dec 02, 2025 at 03:25:31PM +0900, Koichiro Den wrote:
> > On Mon, Dec 01, 2025 at 02:19:55PM -0500, Frank Li wrote:
> > > On Sun, Nov 30, 2025 at 01:03:42AM +0900, Koichiro Den wrote:
> > > > Add new EPC ops map_inbound() and unmap_inbound() for mapping a subrange
> > > > of a BAR into CPU space. These will be implemented by controller drivers
> > > > such as DesignWare.
> > > >
> > > > Signed-off-by: Koichiro Den <den@valinux.co.jp>
> > > > ---
> > > >  drivers/pci/endpoint/pci-epc-core.c | 44 +++++++++++++++++++++++++++++
> > > >  include/linux/pci-epc.h             | 11 ++++++++
> > > >  2 files changed, 55 insertions(+)
> > > >
> > > > diff --git a/drivers/pci/endpoint/pci-epc-core.c b/drivers/pci/endpoint/pci-epc-core.c
> > > > index ca7f19cc973a..825109e54ba9 100644
> > > > --- a/drivers/pci/endpoint/pci-epc-core.c
> > > > +++ b/drivers/pci/endpoint/pci-epc-core.c
> > > > @@ -444,6 +444,50 @@ int pci_epc_map_addr(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(pci_epc_map_addr);
> > > >
> > > > +/**
> > > > + * pci_epc_map_inbound() - map a BAR subrange to the local CPU address
> > > > + * @epc: the EPC device on which BAR has to be configured
> > > > + * @func_no: the physical endpoint function number in the EPC device
> > > > + * @vfunc_no: the virtual endpoint function number in the physical function
> > > > + * @epf_bar: the struct epf_bar that contains the BAR information
> > > > + * @offset: byte offset from the BAR base selected by the host
> > > > + *
> > > > + * Invoke to configure the BAR of the endpoint device and map a subrange
> > > > + * selected by @offset to a CPU address.
> > > > + *
> > > > + * Returns 0 on success, -EOPNOTSUPP if unsupported, or a negative errno.
> > > > + */
> > > > +int pci_epc_map_inbound(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> > > > +			struct pci_epf_bar *epf_bar, u64 offset)
> > >
> > > Supposed need size information?  if BAR's size is 4G,
> > >
> > > you may just want to map from 0x4000 to 0x5000, not whole offset to end's
> > > space.
> >
> > That sounds reasonable, the interface should accept a size parameter so that it
> > is flexible enough to configure arbitrary sub-ranges inside a BAR, instead of
> > implicitly using "offset to end of BAR".
> >
> > For the ntb_transport use_remote_edma=1 testing on R‑Car S4 I only needed at
> > most two sub-ranges inside one BAR, so a size parameter was not strictly
> > necessary in that setup, but I agree that the current interface looks
> > half-baked and not very generic. I'll extend it to take size as well.
> >
> > >
> > > commit message said map into CPU space, where CPU address?
> >
> > The interface currently requires a pointer to a struct pci_epf_bar instance and
> > uses its phys_addr field as the CPU physical base address.
> > I'm not fully convinced that using struct pci_epf_bar this way is the cleanest
> > approach, so I'm open to better suggestions if you have any.
> 
> struct pci_epf_bar already have phys_addr and size information.
> 
> pci_epc_set_bars(..., struct pci_epf_bar *epf_bar, size_t num_of_bar)
> 
> to set many memory regions to one bar space. when num_of_bar is 1, fallback
> to exitting pci_epc_set_bar().

My concern was that reusing struct pci_epf_bar, which represents an entire
BAR (starting at offset 0), for the new pci_epc_map_inbound(), might feel
semantically a bit unclean.

pci_epc_set_bars() idea sounds useful for some scenarios, thank you for the
suggestion. I also think it's not uncommon to want to add Address Match
Mode sub-range mappings incrementally, rather than configuring all ranges
in one shot through pci_epc_set_bars().

> 
> If there are IOMMU in EP system, maybe use IOMMU to map to difference place
> is easier.

That certainly makes sense, while personally I would prefer a single, more
generic and intuitive interface that does not depend on the presence of an
IOMMU.

Thank you for the review,
Koichiro

> 
> Frank
> 
> >
> > Koichiro
> >
> > >
> > > Frank
> > > > +{
> > > > +	if (!epc || !epc->ops || !epc->ops->map_inbound)
> > > > +		return -EOPNOTSUPP;
> > > > +
> > > > +	return epc->ops->map_inbound(epc, func_no, vfunc_no, epf_bar, offset);
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(pci_epc_map_inbound);
> > > > +
> > > > +/**
> > > > + * pci_epc_unmap_inbound() - unmap a previously mapped BAR subrange
> > > > + * @epc: the EPC device on which the inbound mapping was programmed
> > > > + * @func_no: the physical endpoint function number in the EPC device
> > > > + * @vfunc_no: the virtual endpoint function number in the physical function
> > > > + * @epf_bar: the struct epf_bar used when the mapping was created
> > > > + * @offset: byte offset from the BAR base that was mapped
> > > > + *
> > > > + * Invoke to remove a BAR subrange mapping created by pci_epc_map_inbound().
> > > > + * If the controller has no support, this call is a no-op.
> > > > + */
> > > > +void pci_epc_unmap_inbound(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> > > > +			   struct pci_epf_bar *epf_bar, u64 offset)
> > > > +{
> > > > +	if (!epc || !epc->ops || !epc->ops->unmap_inbound)
> > > > +		return;
> > > > +
> > > > +	epc->ops->unmap_inbound(epc, func_no, vfunc_no, epf_bar, offset);
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(pci_epc_unmap_inbound);
> > > > +
> > > >  /**
> > > >   * pci_epc_mem_map() - allocate and map a PCI address to a CPU address
> > > >   * @epc: the EPC device on which the CPU address is to be allocated and mapped
> > > > diff --git a/include/linux/pci-epc.h b/include/linux/pci-epc.h
> > > > index 4286bfdbfdfa..a5fb91cc2982 100644
> > > > --- a/include/linux/pci-epc.h
> > > > +++ b/include/linux/pci-epc.h
> > > > @@ -71,6 +71,8 @@ struct pci_epc_map {
> > > >   *		region
> > > >   * @map_addr: ops to map CPU address to PCI address
> > > >   * @unmap_addr: ops to unmap CPU address and PCI address
> > > > + * @map_inbound: ops to map a subrange inside a BAR to CPU address.
> > > > + * @unmap_inbound: ops to unmap a subrange inside a BAR and CPU address.
> > > >   * @set_msi: ops to set the requested number of MSI interrupts in the MSI
> > > >   *	     capability register
> > > >   * @get_msi: ops to get the number of MSI interrupts allocated by the RC from
> > > > @@ -99,6 +101,10 @@ struct pci_epc_ops {
> > > >  			    phys_addr_t addr, u64 pci_addr, size_t size);
> > > >  	void	(*unmap_addr)(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> > > >  			      phys_addr_t addr);
> > > > +	int	(*map_inbound)(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> > > > +			       struct pci_epf_bar *epf_bar, u64 offset);
> > > > +	void	(*unmap_inbound)(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> > > > +				 struct pci_epf_bar *epf_bar, u64 offset);
> > > >  	int	(*set_msi)(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> > > >  			   u8 nr_irqs);
> > > >  	int	(*get_msi)(struct pci_epc *epc, u8 func_no, u8 vfunc_no);
> > > > @@ -286,6 +292,11 @@ int pci_epc_map_addr(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> > > >  		     u64 pci_addr, size_t size);
> > > >  void pci_epc_unmap_addr(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> > > >  			phys_addr_t phys_addr);
> > > > +
> > > > +int pci_epc_map_inbound(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> > > > +			struct pci_epf_bar *epf_bar, u64 offset);
> > > > +void pci_epc_unmap_inbound(struct pci_epc *epc, u8 func_no, u8 vfunc_no,
> > > > +			   struct pci_epf_bar *epf_bar, u64 offset);
> > > >  int pci_epc_set_msi(struct pci_epc *epc, u8 func_no, u8 vfunc_no, u8 nr_irqs);
> > > >  int pci_epc_get_msi(struct pci_epc *epc, u8 func_no, u8 vfunc_no);
> > > >  int pci_epc_set_msix(struct pci_epc *epc, u8 func_no, u8 vfunc_no, u16 nr_irqs,
> > > > --
> > > > 2.48.1
> > > >

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 20/27] NTB: ntb_transport: Introduce remote eDMA backed transport mode
  2025-12-02 14:53       ` Dave Jiang
@ 2025-12-03 14:19         ` Koichiro Den
  0 siblings, 0 replies; 97+ messages in thread
From: Koichiro Den @ 2025-12-03 14:19 UTC (permalink / raw)
  To: Dave Jiang
  Cc: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li, mani,
	kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Tue, Dec 02, 2025 at 07:53:59AM -0700, Dave Jiang wrote:
> 
> 
> On 12/1/25 11:59 PM, Koichiro Den wrote:
> > On Mon, Dec 01, 2025 at 02:46:41PM -0700, Dave Jiang wrote:
> >>
> >>
> >> On 11/29/25 9:03 AM, Koichiro Den wrote:
> >>> Add a new transport backend that uses a remote DesignWare eDMA engine
> >>> located on the NTB endpoint to move data between host and endpoint.
> >>>
> >>> In this mode:
> >>>
> >>>   - The endpoint exposes a dedicated memory window that contains the
> >>>     eDMA register block followed by a small control structure (struct
> >>>     ntb_edma_info) and per-channel linked-list (LL) rings.
> >>>
> >>>   - On the endpoint side, ntb_edma_setup_mws() allocates the control
> >>>     structure and LL rings in endpoint memory, then programs an inbound
> >>>     iATU region so that the host can access them via a peer MW.
> >>>
> >>>   - On the host side, ntb_edma_setup_peer() ioremaps the peer MW, reads
> >>>     ntb_edma_info and configures a dw-edma DMA device to use the LL
> >>>     rings provided by the endpoint.
> >>>
> >>>   - ntb_transport is extended with a new backend_ops implementation that
> >>>     routes TX and RX enqueue/poll operations through the remote eDMA
> >>>     rings while keeping the existing shared-memory backend intact.
> >>>
> >>>   - The host signals the endpoint via a dedicated DMA read channel.
> >>>     'use_msi' module option is ignored when 'use_remote_edma=1'.
> >>>
> >>> The new mode is guarded by a Kconfig option (NTB_TRANSPORT_EDMA) and a
> >>> module parameter (use_remote_edma). When disabled, the existing
> >>> ntb_transport behaviour is unchanged.
> >>>
> >>> Signed-off-by: Koichiro Den <den@valinux.co.jp>
> >>> ---
> >>>  drivers/ntb/Kconfig                           |   11 +
> >>>  drivers/ntb/Makefile                          |    3 +
> >>>  drivers/ntb/ntb_edma.c                        |  628 ++++++++
> >>>  drivers/ntb/ntb_edma.h                        |  128 ++
> >>
> >> I briefly looked over the code. It feels like the EDMA bits should go in drivers/ntb/hw/ rather than drivers/ntb/ given it's pretty specific to the designware hardware. What sits in drivers/ntb should be generic APIs where a different vendor can utilize it and not having to adopt to designware hardware specifics. So maybe a bit more abstractions are needed?
> > 
> > That makes sense, I'll reorganize things. Thank you for the suggestion.
> 
> Also, since a new transport is being introduced. Please update Documentation/driver-api/ntb.rst. While the current documentation doesn't provide adaquate API documentation for ntb_transport APIs, hopefully the new transport can do better going forward. :) Thank you!

Thanks for the reminder. I'll update ntb.rst in the next revision (perhaps
as part of a split-out series).

Thank you for the review,
Koichiro

> 
> DJ
> 
> > 
> >>
> >>>  .../{ntb_transport.c => ntb_transport_core.c} | 1281 ++++++++++++++++-
> >>>  5 files changed, 2048 insertions(+), 3 deletions(-)
> >>>  create mode 100644 drivers/ntb/ntb_edma.c
> >>>  create mode 100644 drivers/ntb/ntb_edma.h
> >>>  rename drivers/ntb/{ntb_transport.c => ntb_transport_core.c} (65%)
> >>>
> >>> diff --git a/drivers/ntb/Kconfig b/drivers/ntb/Kconfig
> >>> index df16c755b4da..db63f02bb116 100644
> >>> --- a/drivers/ntb/Kconfig
> >>> +++ b/drivers/ntb/Kconfig
> >>> @@ -37,4 +37,15 @@ config NTB_TRANSPORT
> >>>  
> >>>  	 If unsure, say N.
> >>>  
> >>> +config NTB_TRANSPORT_EDMA
> >>> +	bool "NTB Transport backed by remote eDMA"
> >>> +	depends on NTB_TRANSPORT
> >>> +	depends on PCI
> >>> +	select DMA_ENGINE
> >>> +	help
> >>> +	  Enable a transport backend that uses a remote DesignWare eDMA engine
> >>> +	  exposed through a dedicated NTB memory window. The host uses the
> >>> +	  endpoint's eDMA engine to move data in both directions.
> >>> +	  Say Y here if you intend to use the 'use_remote_edma' module parameter.
> >>> +
> >>>  endif # NTB
> >>> diff --git a/drivers/ntb/Makefile b/drivers/ntb/Makefile
> >>> index 3a6fa181ff99..51f0e1e3aec7 100644
> >>> --- a/drivers/ntb/Makefile
> >>> +++ b/drivers/ntb/Makefile
> >>> @@ -4,3 +4,6 @@ obj-$(CONFIG_NTB_TRANSPORT) += ntb_transport.o
> >>>  
> >>>  ntb-y			:= core.o
> >>>  ntb-$(CONFIG_NTB_MSI)	+= msi.o
> >>> +
> >>> +ntb_transport-y					:= ntb_transport_core.o
> >>> +ntb_transport-$(CONFIG_NTB_TRANSPORT_EDMA)	+= ntb_edma.o
> >>> diff --git a/drivers/ntb/ntb_edma.c b/drivers/ntb/ntb_edma.c
> >>> new file mode 100644
> >>> index 000000000000..cb35e0d56aa8
> >>> --- /dev/null
> >>> +++ b/drivers/ntb/ntb_edma.c
> >>> @@ -0,0 +1,628 @@
> >>> +// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
> >>> +
> >>> +#include <linux/module.h>
> >>> +#include <linux/device.h>
> >>> +#include <linux/pci.h>
> >>> +#include <linux/ntb.h>
> >>> +#include <linux/io.h>
> >>> +#include <linux/iommu.h>
> >>> +#include <linux/dmaengine.h>
> >>> +#include <linux/pci-epc.h>
> >>> +#include <linux/dma/edma.h>
> >>> +#include <linux/irq.h>
> >>> +#include <linux/irqdomain.h>
> >>> +#include <linux/of.h>
> >>> +#include <linux/of_irq.h>
> >>> +#include <dt-bindings/interrupt-controller/arm-gic.h>
> >>> +
> >>> +#include "ntb_edma.h"
> >>> +
> >>> +/*
> >>> + * The interrupt register offsets below are taken from the DesignWare
> >>> + * eDMA "unrolled" register map (EDMA_MF_EDMA_UNROLL). The remote eDMA
> >>> + * backend currently only supports this layout.
> >>> + */
> >>> +#define DMA_WRITE_INT_STATUS_OFF   0x4c
> >>> +#define DMA_WRITE_INT_MASK_OFF     0x54
> >>> +#define DMA_WRITE_INT_CLEAR_OFF    0x58
> >>> +#define DMA_READ_INT_STATUS_OFF    0xa0
> >>> +#define DMA_READ_INT_MASK_OFF      0xa8
> >>> +#define DMA_READ_INT_CLEAR_OFF     0xac
> >>> +
> >>> +#define NTB_EDMA_NOTIFY_MAX_QP		64
> >>> +
> >>> +static unsigned int edma_spi = 417; /* 0x1a1 */
> >>> +module_param(edma_spi, uint, 0644);
> >>> +MODULE_PARM_DESC(edma_spi, "SPI number used by remote eDMA interrupt (EP local)");
> >>> +
> >>> +static u64 edma_regs_phys = 0xe65d5000;
> >>> +module_param(edma_regs_phys, ullong, 0644);
> >>> +MODULE_PARM_DESC(edma_regs_phys, "Physical base address of local eDMA registers (EP)");
> >>> +
> >>> +static unsigned long edma_regs_size = 0x1200;
> >>> +module_param(edma_regs_size, ulong, 0644);
> >>> +MODULE_PARM_DESC(edma_regs_size, "Size of the local eDMA register space (EP)");
> >>> +
> >>> +struct ntb_edma_intr {
> >>> +	u32 db[NTB_EDMA_NOTIFY_MAX_QP];
> >>> +};
> >>> +
> >>> +struct ntb_edma_ctx {
> >>> +	void *ll_wr_virt[EDMA_WR_CH_NUM];
> >>> +	dma_addr_t ll_wr_phys[EDMA_WR_CH_NUM];
> >>> +	void *ll_rd_virt[EDMA_RD_CH_NUM + 1];
> >>> +	dma_addr_t ll_rd_phys[EDMA_RD_CH_NUM + 1];
> >>> +
> >>> +	struct ntb_edma_intr *intr_ep_virt;
> >>> +	dma_addr_t intr_ep_phys;
> >>> +	struct ntb_edma_intr *intr_rc_virt;
> >>> +	dma_addr_t intr_rc_phys;
> >>> +	u32 notify_qp_max;
> >>> +
> >>> +	bool initialized;
> >>> +};
> >>> +
> >>> +static struct ntb_edma_ctx edma_ctx;
> >>> +
> >>> +typedef void (*ntb_edma_interrupt_cb_t)(void *data, int qp_num);
> >>> +
> >>> +struct ntb_edma_interrupt {
> >>> +	int virq;
> >>> +	void __iomem *base;
> >>> +	ntb_edma_interrupt_cb_t cb;
> >>> +	void *data;
> >>> +};
> >>> +
> >>> +static struct ntb_edma_interrupt ntb_edma_intr;
> >>> +
> >>> +static int ntb_edma_map_spi_to_virq(struct device *dev, unsigned int spi)
> >>> +{
> >>> +	struct device_node *np = dev_of_node(dev);
> >>> +	struct device_node *parent;
> >>> +	struct irq_fwspec fwspec = { 0 };
> >>> +	int virq;
> >>> +
> >>> +	parent = of_irq_find_parent(np);
> >>> +	if (!parent)
> >>> +		return -ENODEV;
> >>> +
> >>> +	fwspec.fwnode      = of_fwnode_handle(parent);
> >>> +	fwspec.param_count = 3;
> >>> +	fwspec.param[0]    = GIC_SPI;
> >>> +	fwspec.param[1]    = spi;
> >>> +	fwspec.param[2]    = IRQ_TYPE_LEVEL_HIGH;
> >>> +
> >>> +	virq = irq_create_fwspec_mapping(&fwspec);
> >>> +	of_node_put(parent);
> >>> +	return (virq > 0) ? virq : -EINVAL;
> >>> +}
> >>> +
> >>> +static irqreturn_t ntb_edma_isr(int irq, void *data)
> >>> +{
> >>> +	struct ntb_edma_interrupt *v = data;
> >>> +	u32 mask = BIT(EDMA_RD_CH_NUM);
> >>> +	u32 i, val;
> >>> +
> >>> +	/*
> >>> +	 * We do not ack interrupts here but instead we mask all local interrupt
> >>> +	 * sources except the read channel used for notification. This reduces
> >>> +	 * needless ISR invocations.
> >>> +	 *
> >>> +	 * In theory we could configure LIE=1/RIE=0 only for the notification
> >>> +	 * transfer (keeping all other channels at LIE=1/RIE=1), but that would
> >>> +	 * require intrusive changes to the dw-edma core.
> >>> +	 *
> >>> +	 * Note: The host side may have already cleared the read interrupt used
> >>> +	 * for notification, so reading DMA_READ_INT_CLEAR_OFF is not a reliable
> >>> +	 * way to detect it. As a result, we cannot reliably tell which specific
> >>> +	 * channel triggered this interrupt. intr_ep_virt->db[i] teaches us
> >>> +	 * instead.
> >>> +	 */
> >>> +	iowrite32(~0x0, v->base + DMA_WRITE_INT_MASK_OFF);
> >>> +	iowrite32(~mask, v->base + DMA_READ_INT_MASK_OFF);
> >>> +
> >>> +	if (!v->cb || !edma_ctx.intr_ep_virt)
> >>> +		return IRQ_HANDLED;
> >>> +
> >>> +	for (i = 0; i < edma_ctx.notify_qp_max; i++) {
> >>> +		val = READ_ONCE(edma_ctx.intr_ep_virt->db[i]);
> >>> +		if (!val)
> >>> +			continue;
> >>> +
> >>> +		WRITE_ONCE(edma_ctx.intr_ep_virt->db[i], 0);
> >>> +		v->cb(v->data, i);
> >>> +	}
> >>> +
> >>> +	return IRQ_HANDLED;
> >>> +}
> >>> +
> >>> +int ntb_edma_setup_isr(struct device *dev, struct device *epc_dev,
> >>> +		       ntb_edma_interrupt_cb_t cb, void *data)
> >>> +{
> >>> +	struct ntb_edma_interrupt *v = &ntb_edma_intr;
> >>> +	int virq = ntb_edma_map_spi_to_virq(epc_dev->parent, edma_spi);
> >>> +	int ret;
> >>> +
> >>> +	if (virq < 0) {
> >>> +		dev_err(dev, "failed to get virq (%d)\n", virq);
> >>> +		return virq;
> >>> +	}
> >>> +
> >>> +	v->virq = virq;
> >>> +	v->cb = cb;
> >>> +	v->data = data;
> >>> +	if (edma_regs_phys && !v->base)
> >>> +		v->base = devm_ioremap(dev, edma_regs_phys, edma_regs_size);
> >>> +	if (!v->base) {
> >>> +		dev_err(dev, "failed to setup v->base\n");
> >>> +		return -1;
> >>> +	}
> >>> +	ret = devm_request_irq(dev, v->virq, ntb_edma_isr, 0, "ntb-edma", v);
> >>> +	if (ret)
> >>> +		return ret;
> >>> +
> >>> +	if (v->base) {
> >>> +		iowrite32(0x0, v->base + DMA_WRITE_INT_MASK_OFF);
> >>> +		iowrite32(0x0, v->base + DMA_READ_INT_MASK_OFF);
> >>> +	}
> >>> +	return 0;
> >>> +}
> >>> +
> >>> +void ntb_edma_teardown_isr(struct device *dev)
> >>> +{
> >>> +	struct ntb_edma_interrupt *v = &ntb_edma_intr;
> >>> +
> >>> +	/* Mask all write/read interrupts so we don't get called again. */
> >>> +	if (v->base) {
> >>> +		iowrite32(~0x0, v->base + DMA_WRITE_INT_MASK_OFF);
> >>> +		iowrite32(~0x0, v->base + DMA_READ_INT_MASK_OFF);
> >>> +	}
> >>> +
> >>> +	if (v->virq > 0)
> >>> +		devm_free_irq(dev, v->virq, v);
> >>> +
> >>> +	if (v->base)
> >>> +		devm_iounmap(dev, v->base);
> >>> +
> >>> +	v->virq = 0;
> >>> +	v->cb = NULL;
> >>> +	v->data = NULL;
> >>> +}
> >>> +
> >>> +int ntb_edma_setup_mws(struct ntb_dev *ndev)
> >>> +{
> >>> +	const size_t info_bytes = PAGE_SIZE;
> >>> +	resource_size_t size_max, offset;
> >>> +	dma_addr_t intr_phys, info_phys;
> >>> +	u32 wr_done = 0, rd_done = 0;
> >>> +	struct ntb_edma_intr *intr;
> >>> +	struct ntb_edma_info *info;
> >>> +	int peer_mw, mw_index, rc;
> >>> +	struct iommu_domain *dom;
> >>> +	bool reg_mapped = false;
> >>> +	size_t ll_bytes, size;
> >>> +	struct pci_epc *epc;
> >>> +	struct device *dev;
> >>> +	unsigned long iova;
> >>> +	phys_addr_t phys;
> >>> +	u64 need;
> >>> +	u32 i;
> >>> +
> >>> +	/* +1 is for interruption */
> >>> +	ll_bytes = (EDMA_WR_CH_NUM + EDMA_RD_CH_NUM + 1) * DMA_LLP_MEM_SIZE;
> >>> +	need = EDMA_REG_SIZE + info_bytes + ll_bytes;
> >>> +
> >>> +	epc = ntb_get_pci_epc(ndev);
> >>> +	if (!epc)
> >>> +		return -ENODEV;
> >>> +	dev = epc->dev.parent;
> >>> +
> >>> +	if (edma_ctx.initialized)
> >>> +		return 0;
> >>> +
> >>> +	info = dma_alloc_coherent(dev, info_bytes, &info_phys, GFP_KERNEL);
> >>> +	if (!info)
> >>> +		return -ENOMEM;
> >>> +
> >>> +	memset(info, 0, info_bytes);
> >>> +	info->magic = NTB_EDMA_INFO_MAGIC;
> >>> +	info->wr_cnt = EDMA_WR_CH_NUM;
> >>> +	info->rd_cnt = EDMA_RD_CH_NUM + 1; /* +1 for interruption */
> >>> +	info->regs_phys = edma_regs_phys;
> >>> +	info->ll_stride = DMA_LLP_MEM_SIZE;
> >>> +
> >>> +	for (i = 0; i < EDMA_WR_CH_NUM; i++) {
> >>> +		edma_ctx.ll_wr_virt[i] = dma_alloc_attrs(dev, DMA_LLP_MEM_SIZE,
> >>> +							 &edma_ctx.ll_wr_phys[i],
> >>> +							 GFP_KERNEL,
> >>> +							 DMA_ATTR_FORCE_CONTIGUOUS);
> >>> +		if (!edma_ctx.ll_wr_virt[i]) {
> >>> +			rc = -ENOMEM;
> >>> +			goto err_free_ll;
> >>> +		}
> >>> +		wr_done++;
> >>> +		info->ll_wr_phys[i] = edma_ctx.ll_wr_phys[i];
> >>> +	}
> >>> +	for (i = 0; i < EDMA_RD_CH_NUM + 1; i++) {
> >>> +		edma_ctx.ll_rd_virt[i] = dma_alloc_attrs(dev, DMA_LLP_MEM_SIZE,
> >>> +							 &edma_ctx.ll_rd_phys[i],
> >>> +							 GFP_KERNEL,
> >>> +							 DMA_ATTR_FORCE_CONTIGUOUS);
> >>> +		if (!edma_ctx.ll_rd_virt[i]) {
> >>> +			rc = -ENOMEM;
> >>> +			goto err_free_ll;
> >>> +		}
> >>> +		rd_done++;
> >>> +		info->ll_rd_phys[i] = edma_ctx.ll_rd_phys[i];
> >>> +	}
> >>> +
> >>> +	/* For interruption */
> >>> +	edma_ctx.notify_qp_max = NTB_EDMA_NOTIFY_MAX_QP;
> >>> +	intr = dma_alloc_coherent(dev, sizeof(*intr), &intr_phys, GFP_KERNEL);
> >>> +	if (!intr) {
> >>> +		rc = -ENOMEM;
> >>> +		goto err_free_ll;
> >>> +	}
> >>> +	memset(intr, 0, sizeof(*intr));
> >>> +	edma_ctx.intr_ep_virt = intr;
> >>> +	edma_ctx.intr_ep_phys = intr_phys;
> >>> +	info->intr_dar_base = intr_phys;
> >>> +
> >>> +	peer_mw = ntb_peer_mw_count(ndev);
> >>> +	if (peer_mw <= 0) {
> >>> +		rc = -ENODEV;
> >>> +		goto err_free_ll;
> >>> +	}
> >>> +
> >>> +	mw_index = peer_mw - 1; /* last MW */
> >>> +
> >>> +	rc = ntb_mw_get_align(ndev, 0, mw_index, 0, NULL, &size_max,
> >>> +			      &offset);
> >>> +	if (rc)
> >>> +		goto err_free_ll;
> >>> +
> >>> +	if (size_max < need) {
> >>> +		rc = -ENOSPC;
> >>> +		goto err_free_ll;
> >>> +	}
> >>> +
> >>> +	/* Map register space (direct) */
> >>> +	dom = iommu_get_domain_for_dev(dev);
> >>> +	if (dom) {
> >>> +		phys = edma_regs_phys & PAGE_MASK;
> >>> +		size = PAGE_ALIGN(EDMA_REG_SIZE + edma_regs_phys - phys);
> >>> +		iova = phys;
> >>> +
> >>> +		rc = iommu_map(dom, iova, phys, EDMA_REG_SIZE,
> >>> +			       IOMMU_READ | IOMMU_WRITE | IOMMU_MMIO, GFP_KERNEL);
> >>> +		if (rc)
> >>> +			dev_err(&ndev->dev, "failed to create direct mapping for eDMA reg space\n");
> >>> +		reg_mapped = true;
> >>> +	}
> >>> +
> >>> +	rc = ntb_mw_set_trans(ndev, 0, mw_index, edma_regs_phys, EDMA_REG_SIZE, offset);
> >>> +	if (rc)
> >>> +		goto err_unmap_reg;
> >>> +
> >>> +	offset += EDMA_REG_SIZE;
> >>> +
> >>> +	/* Map ntb_edma_info */
> >>> +	rc = ntb_mw_set_trans(ndev, 0, mw_index, info_phys, info_bytes, offset);
> >>> +	if (rc)
> >>> +		goto err_clear_trans;
> >>> +	offset += info_bytes;
> >>> +
> >>> +	/* Map LL location */
> >>> +	for (i = 0; i < EDMA_WR_CH_NUM; i++) {
> >>> +		rc = ntb_mw_set_trans(ndev, 0, mw_index, edma_ctx.ll_wr_phys[i],
> >>> +				      DMA_LLP_MEM_SIZE, offset);
> >>> +		if (rc)
> >>> +			goto err_clear_trans;
> >>> +		offset += DMA_LLP_MEM_SIZE;
> >>> +	}
> >>> +	for (i = 0; i < EDMA_RD_CH_NUM + 1; i++) {
> >>> +		rc = ntb_mw_set_trans(ndev, 0, mw_index, edma_ctx.ll_rd_phys[i],
> >>> +				      DMA_LLP_MEM_SIZE, offset);
> >>> +		if (rc)
> >>> +			goto err_clear_trans;
> >>> +		offset += DMA_LLP_MEM_SIZE;
> >>> +	}
> >>> +	edma_ctx.initialized = true;
> >>> +
> >>> +	return 0;
> >>> +
> >>> +err_clear_trans:
> >>> +	/*
> >>> +	 * Tear down the NTB translation window used for the eDMA MW.
> >>> +	 * There is no sub-range clear API for ntb_mw_set_trans(), so we
> >>> +	 * unconditionally drop the whole mapping on error.
> >>> +	 */
> >>> +	ntb_mw_clear_trans(ndev, 0, mw_index);
> >>> +
> >>> +err_unmap_reg:
> >>> +	if (reg_mapped)
> >>> +		iommu_unmap(dom, iova, size);
> >>> +err_free_ll:
> >>> +	while (rd_done--)
> >>> +		dma_free_attrs(dev, DMA_LLP_MEM_SIZE,
> >>> +			       edma_ctx.ll_rd_virt[rd_done],
> >>> +			       edma_ctx.ll_rd_phys[rd_done],
> >>> +			       DMA_ATTR_FORCE_CONTIGUOUS);
> >>> +	while (wr_done--)
> >>> +		dma_free_attrs(dev, DMA_LLP_MEM_SIZE,
> >>> +			       edma_ctx.ll_wr_virt[wr_done],
> >>> +			       edma_ctx.ll_wr_phys[wr_done],
> >>> +			       DMA_ATTR_FORCE_CONTIGUOUS);
> >>> +	if (edma_ctx.intr_ep_virt)
> >>> +		dma_free_coherent(dev, sizeof(struct ntb_edma_intr),
> >>> +				  edma_ctx.intr_ep_virt,
> >>> +				  edma_ctx.intr_ep_phys);
> >>> +	dma_free_coherent(dev, info_bytes, info, info_phys);
> >>> +	return rc;
> >>> +}
> >>> +
> >>> +static int ntb_edma_irq_vector(struct device *dev, unsigned int nr)
> >>> +{
> >>> +	struct pci_dev *pdev = to_pci_dev(dev);
> >>> +	int ret, nvec;
> >>> +
> >>> +	nvec = pci_msi_vec_count(pdev);
> >>> +	for (; nr < nvec; nr++) {
> >>> +		ret = pci_irq_vector(pdev, nr);
> >>> +		if (!irq_has_action(ret))
> >>> +			return ret;
> >>> +	}
> >>> +	return 0;
> >>> +}
> >>> +
> >>> +static const struct dw_edma_plat_ops ntb_edma_ops = {
> >>> +	.irq_vector     = ntb_edma_irq_vector,
> >>> +};
> >>> +
> >>> +int ntb_edma_setup_peer(struct ntb_dev *ndev)
> >>> +{
> >>> +	struct ntb_edma_info *info;
> >>> +	unsigned int wr_cnt, rd_cnt;
> >>> +	struct dw_edma_chip *chip;
> >>> +	void __iomem *edma_virt;
> >>> +	phys_addr_t edma_phys;
> >>> +	resource_size_t mw_size;
> >>> +	u64 off = EDMA_REG_SIZE;
> >>> +	int peer_mw, mw_index;
> >>> +	unsigned int i;
> >>> +	int ret;
> >>> +
> >>> +	peer_mw = ntb_peer_mw_count(ndev);
> >>> +	if (peer_mw <= 0)
> >>> +		return -ENODEV;
> >>> +
> >>> +	mw_index = peer_mw - 1; /* last MW */
> >>> +
> >>> +	ret = ntb_peer_mw_get_addr(ndev, mw_index, &edma_phys,
> >>> +				   &mw_size);
> >>> +	if (ret)
> >>> +		return -1;
> >>> +
> >>> +	edma_virt = ioremap(edma_phys, mw_size);
> >>> +
> >>> +	chip = devm_kzalloc(&ndev->dev, sizeof(*chip), GFP_KERNEL);
> >>> +	if (!chip) {
> >>> +		ret = -ENOMEM;
> >>> +		return ret;
> >>> +	}
> >>> +
> >>> +	chip->dev = &ndev->pdev->dev;
> >>> +	chip->nr_irqs = 4;
> >>> +	chip->ops = &ntb_edma_ops;
> >>> +	chip->flags = 0;
> >>> +	chip->reg_base = edma_virt;
> >>> +	chip->mf = EDMA_MF_EDMA_UNROLL;
> >>> +
> >>> +	info = edma_virt + off;
> >>> +	if (info->magic != NTB_EDMA_INFO_MAGIC)
> >>> +		return -EINVAL;
> >>> +	wr_cnt = info->wr_cnt;
> >>> +	rd_cnt = info->rd_cnt;
> >>> +	chip->ll_wr_cnt = wr_cnt;
> >>> +	chip->ll_rd_cnt = rd_cnt;
> >>> +	off += PAGE_SIZE;
> >>> +
> >>> +	edma_ctx.notify_qp_max = NTB_EDMA_NOTIFY_MAX_QP;
> >>> +	edma_ctx.intr_ep_phys = info->intr_dar_base;
> >>> +	if (edma_ctx.intr_ep_phys) {
> >>> +		edma_ctx.intr_rc_virt =
> >>> +			dma_alloc_coherent(&ndev->pdev->dev,
> >>> +					   sizeof(struct ntb_edma_intr),
> >>> +					   &edma_ctx.intr_rc_phys,
> >>> +					   GFP_KERNEL);
> >>> +		if (!edma_ctx.intr_rc_virt)
> >>> +			return -ENOMEM;
> >>> +		memset(edma_ctx.intr_rc_virt, 0,
> >>> +		       sizeof(struct ntb_edma_intr));
> >>> +	}
> >>> +
> >>> +	for (i = 0; i < wr_cnt; i++) {
> >>> +		chip->ll_region_wr[i].vaddr.io = edma_virt + off;
> >>> +		chip->ll_region_wr[i].paddr = info->ll_wr_phys[i];
> >>> +		chip->ll_region_wr[i].sz = DMA_LLP_MEM_SIZE;
> >>> +		off += DMA_LLP_MEM_SIZE;
> >>> +	}
> >>> +	for (i = 0; i < rd_cnt; i++) {
> >>> +		chip->ll_region_rd[i].vaddr.io = edma_virt + off;
> >>> +		chip->ll_region_rd[i].paddr = info->ll_rd_phys[i];
> >>> +		chip->ll_region_rd[i].sz = DMA_LLP_MEM_SIZE;
> >>> +		off += DMA_LLP_MEM_SIZE;
> >>> +	}
> >>> +
> >>> +	if (!pci_dev_msi_enabled(ndev->pdev))
> >>> +		return -ENXIO;
> >>> +
> >>> +	ret = dw_edma_probe(chip);
> >>> +	if (ret) {
> >>> +		dev_err(&ndev->dev, "dw_edma_probe failed: %d\n", ret);
> >>> +		return ret;
> >>> +	}
> >>> +
> >>> +	return 0;
> >>> +}
> >>> +
> >>> +struct ntb_edma_filter {
> >>> +	struct device *dma_dev;
> >>> +	u32 direction;
> >>> +};
> >>> +
> >>> +static bool ntb_edma_filter_fn(struct dma_chan *chan, void *arg)
> >>> +{
> >>> +	struct ntb_edma_filter *filter = arg;
> >>> +	u32 dir = filter->direction;
> >>> +	struct dma_slave_caps caps;
> >>> +	int ret;
> >>> +
> >>> +	if (chan->device->dev != filter->dma_dev)
> >>> +		return false;
> >>> +
> >>> +	ret = dma_get_slave_caps(chan, &caps);
> >>> +	if (ret < 0)
> >>> +		return false;
> >>> +
> >>> +	return !!(caps.directions & dir);
> >>> +}
> >>> +
> >>> +void ntb_edma_teardown_chans(struct ntb_edma_chans *edma)
> >>> +{
> >>> +	unsigned int i;
> >>> +
> >>> +	for (i = 0; i < edma->num_wr_chan; i++)
> >>> +		dma_release_channel(edma->wr_chan[i]);
> >>> +
> >>> +	for (i = 0; i < edma->num_rd_chan; i++)
> >>> +		dma_release_channel(edma->rd_chan[i]);
> >>> +
> >>> +	if (edma->intr_chan)
> >>> +		dma_release_channel(edma->intr_chan);
> >>> +}
> >>> +
> >>> +int ntb_edma_setup_chans(struct device *dma_dev, struct ntb_edma_chans *edma)
> >>> +{
> >>> +	struct ntb_edma_filter filter;
> >>> +	dma_cap_mask_t dma_mask;
> >>> +	unsigned int i;
> >>> +
> >>> +	dma_cap_zero(dma_mask);
> >>> +	dma_cap_set(DMA_SLAVE, dma_mask);
> >>> +
> >>> +	memset(edma, 0, sizeof(*edma));
> >>> +	edma->dev = dma_dev;
> >>> +
> >>> +	filter.dma_dev = dma_dev;
> >>> +	filter.direction = BIT(DMA_DEV_TO_MEM);
> >>> +	for (i = 0; i < EDMA_WR_CH_NUM; i++) {
> >>> +		edma->wr_chan[i] = dma_request_channel(dma_mask,
> >>> +						       ntb_edma_filter_fn,
> >>> +						       &filter);
> >>> +		if (!edma->wr_chan[i])
> >>> +			break;
> >>> +		edma->num_wr_chan++;
> >>> +	}
> >>> +
> >>> +	filter.direction = BIT(DMA_MEM_TO_DEV);
> >>> +	for (i = 0; i < EDMA_RD_CH_NUM; i++) {
> >>> +		edma->rd_chan[i] = dma_request_channel(dma_mask,
> >>> +						       ntb_edma_filter_fn,
> >>> +						       &filter);
> >>> +		if (!edma->rd_chan[i])
> >>> +			break;
> >>> +		edma->num_rd_chan++;
> >>> +	}
> >>> +
> >>> +	edma->intr_chan = dma_request_channel(dma_mask, ntb_edma_filter_fn,
> >>> +					      &filter);
> >>> +	if (!edma->intr_chan)
> >>> +		dev_warn(dma_dev,
> >>> +			 "Remote eDMA notify channel could not be allocated\n");
> >>> +
> >>> +	if (!edma->num_wr_chan || !edma->num_rd_chan) {
> >>> +		dev_warn(dma_dev, "Remote eDMA channels failed to initialize\n");
> >>> +		ntb_edma_teardown_chans(edma);
> >>> +		return -ENODEV;
> >>> +	}
> >>> +	return 0;
> >>> +}
> >>> +
> >>> +struct dma_chan *ntb_edma_pick_chan(struct ntb_edma_chans *edma,
> >>> +				    remote_edma_dir_t dir)
> >>> +{
> >>> +	unsigned int n, cur, idx;
> >>> +	struct dma_chan **chans;
> >>> +	atomic_t *cur_chan;
> >>> +
> >>> +	if (dir == REMOTE_EDMA_WRITE) {
> >>> +		n = edma->num_wr_chan;
> >>> +		chans = edma->wr_chan;
> >>> +		cur_chan = &edma->cur_wr_chan;
> >>> +	} else {
> >>> +		n = edma->num_rd_chan;
> >>> +		chans = edma->rd_chan;
> >>> +		cur_chan = &edma->cur_rd_chan;
> >>> +	}
> >>> +	if (WARN_ON_ONCE(!n))
> >>> +		return NULL;
> >>> +
> >>> +	/* Simple round-robin */
> >>> +	cur = (unsigned int)atomic_inc_return(cur_chan) - 1;
> >>> +	idx = cur % n;
> >>> +	return chans[idx];
> >>> +}
> >>> +
> >>> +int ntb_edma_notify_peer(struct ntb_edma_chans *edma, int qp_num)
> >>> +{
> >>> +	struct dma_async_tx_descriptor *txd;
> >>> +	struct dma_slave_config cfg;
> >>> +	struct scatterlist sgl;
> >>> +	dma_cookie_t cookie;
> >>> +	struct device *dev;
> >>> +
> >>> +	if (!edma || !edma->intr_chan)
> >>> +		return -ENXIO;
> >>> +
> >>> +	if (qp_num < 0 || qp_num >= edma_ctx.notify_qp_max)
> >>> +		return -EINVAL;
> >>> +
> >>> +	if (!edma_ctx.intr_rc_virt || !edma_ctx.intr_ep_phys)
> >>> +		return -EINVAL;
> >>> +
> >>> +	dev = edma->dev;
> >>> +	if (!dev)
> >>> +		return -ENODEV;
> >>> +
> >>> +	WRITE_ONCE(edma_ctx.intr_rc_virt->db[qp_num], 1);
> >>> +
> >>> +	/* Ensure store is visible before kicking the DMA transfer */
> >>> +	wmb();
> >>> +
> >>> +	sg_init_table(&sgl, 1);
> >>> +	sg_dma_address(&sgl) = edma_ctx.intr_rc_phys + qp_num * sizeof(u32);
> >>> +	sg_dma_len(&sgl) = sizeof(u32);
> >>> +
> >>> +	memset(&cfg, 0, sizeof(cfg));
> >>> +	cfg.dst_addr       = edma_ctx.intr_ep_phys + qp_num * sizeof(u32);
> >>> +	cfg.src_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
> >>> +	cfg.dst_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
> >>> +	cfg.direction      = DMA_MEM_TO_DEV;
> >>> +
> >>> +	if (dmaengine_slave_config(edma->intr_chan, &cfg))
> >>> +		return -EINVAL;
> >>> +
> >>> +	txd = dmaengine_prep_slave_sg(edma->intr_chan, &sgl, 1,
> >>> +				      DMA_MEM_TO_DEV,
> >>> +				      DMA_CTRL_ACK | DMA_PREP_INTERRUPT);
> >>> +	if (!txd)
> >>> +		return -ENOSPC;
> >>> +
> >>> +	cookie = dmaengine_submit(txd);
> >>> +	if (dma_submit_error(cookie))
> >>> +		return -ENOSPC;
> >>> +
> >>> +	dma_async_issue_pending(edma->intr_chan);
> >>> +	return 0;
> >>> +}
> >>> diff --git a/drivers/ntb/ntb_edma.h b/drivers/ntb/ntb_edma.h
> >>> new file mode 100644
> >>> index 000000000000..da0451827edb
> >>> --- /dev/null
> >>> +++ b/drivers/ntb/ntb_edma.h
> >>> @@ -0,0 +1,128 @@
> >>> +/* SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) */
> >>> +#ifndef _NTB_EDMA_H_
> >>> +#define _NTB_EDMA_H_
> >>> +
> >>> +#include <linux/completion.h>
> >>> +#include <linux/device.h>
> >>> +#include <linux/interrupt.h>
> >>> +
> >>> +#define EDMA_REG_SIZE		SZ_64K
> >>> +#define DMA_LLP_MEM_SIZE	SZ_4K
> >>> +#define EDMA_WR_CH_NUM		4
> >>> +#define EDMA_RD_CH_NUM		4
> >>> +#define NTB_EDMA_MAX_CH		8
> >>> +
> >>> +#define NTB_EDMA_INFO_MAGIC	0x45444D41 /* "EDMA" */
> >>> +#define NTB_EDMA_INFO_OFF	EDMA_REG_SIZE
> >>> +
> >>> +#define NTB_EDMA_RING_ORDER	7
> >>> +#define NTB_EDMA_RING_ENTRIES	(1U << NTB_EDMA_RING_ORDER)
> >>> +#define NTB_EDMA_RING_MASK	(NTB_EDMA_RING_ENTRIES - 1)
> >>> +
> >>> +typedef void (*ntb_edma_interrupt_cb_t)(void *data, int qp_num);
> >>> +
> >>> +/*
> >>> + * REMOTE_EDMA_EP:
> >>> + *   Endpoint owns the eDMA engine and pushes descriptors into a shared MW.
> >>> + *
> >>> + * REMOTE_EDMA_RC:
> >>> + *   Root Complex controls the endpoint eDMA through the shared MW and
> >>> + *   drives reads/writes on behalf of the host.
> >>> + */
> >>> +typedef enum {
> >>> +	REMOTE_EDMA_UNKNOWN,
> >>> +	REMOTE_EDMA_EP,
> >>> +	REMOTE_EDMA_RC,
> >>> +} remote_edma_mode_t;
> >>> +
> >>> +typedef enum {
> >>> +	REMOTE_EDMA_WRITE,
> >>> +	REMOTE_EDMA_READ,
> >>> +} remote_edma_dir_t;
> >>> +
> >>> +/*
> >>> + * Layout of remote eDMA MW (EP local address space, RC sees via peer MW):
> >>> + *
> >>> + *  0 .. EDMA_REG_SIZE-1        : DesignWare eDMA registers
> >>> + *  EDMA_REG_SIZE .. +PAGE_SIZE : struct ntb_edma_info (EP writes, RC reads)
> >>> + *  +PAGE_SIZE ..               : LL ring buffers (EP allocates phys addresses,
> >>> + *                                RC configures via dw_edma)
> >>> + *
> >>> + * ntb_edma_setup_mws() on EP:
> >>> + *   - allocates ntb_edma_info and LLs in EP memory
> >>> + *   - programs inbound iATU so that RC peer MW[n] points at this block
> >>> + *
> >>> + * ntb_edma_setup_peer() on RC:
> >>> + *   - ioremaps peer MW[n]
> >>> + *   - reads ntb_edma_info
> >>> + *   - sets up dw_edma_chip ll_region_* from that info
> >>> + */
> >>> +struct ntb_edma_info {
> >>> +	u32 magic;
> >>> +	u16 wr_cnt;
> >>> +	u16 rd_cnt;
> >>> +	u64 regs_phys;
> >>> +	u32 ll_stride;
> >>> +	u32 rsvd;
> >>> +	u64 ll_wr_phys[NTB_EDMA_MAX_CH];
> >>> +	u64 ll_rd_phys[NTB_EDMA_MAX_CH];
> >>> +
> >>> +	u64 intr_dar_base;
> >>> +} __packed;
> >>> +
> >>> +struct ll_dma_addrs {
> >>> +	dma_addr_t wr[EDMA_WR_CH_NUM];
> >>> +	dma_addr_t rd[EDMA_RD_CH_NUM];
> >>> +};
> >>> +
> >>> +struct ntb_edma_chans {
> >>> +	struct device *dev;
> >>> +
> >>> +	struct dma_chan *wr_chan[EDMA_WR_CH_NUM];
> >>> +	struct dma_chan *rd_chan[EDMA_RD_CH_NUM];
> >>> +	struct dma_chan *intr_chan;
> >>> +
> >>> +	unsigned int num_wr_chan;
> >>> +	unsigned int num_rd_chan;
> >>> +	atomic_t cur_wr_chan;
> >>> +	atomic_t cur_rd_chan;
> >>> +};
> >>> +
> >>> +static __always_inline u32 ntb_edma_ring_idx(u32 v)
> >>> +{
> >>> +	return v & NTB_EDMA_RING_MASK;
> >>> +}
> >>> +
> >>> +static __always_inline u32 ntb_edma_ring_used_entry(u32 head, u32 tail)
> >>> +{
> >>> +	if (head >= tail) {
> >>> +		WARN_ON_ONCE((head - tail) > (NTB_EDMA_RING_ENTRIES - 1));
> >>> +		return head - tail;
> >>> +	}
> >>> +
> >>> +	WARN_ON_ONCE((U32_MAX - tail + head + 1) > (NTB_EDMA_RING_ENTRIES - 1));
> >>> +	return U32_MAX - tail + head + 1;
> >>> +}
> >>> +
> >>> +static __always_inline u32 ntb_edma_ring_free_entry(u32 head, u32 tail)
> >>> +{
> >>> +	return NTB_EDMA_RING_ENTRIES - ntb_edma_ring_used_entry(head, tail) - 1;
> >>> +}
> >>> +
> >>> +static __always_inline bool ntb_edma_ring_full(u32 head, u32 tail)
> >>> +{
> >>> +	return ntb_edma_ring_free_entry(head, tail) == 0;
> >>> +}
> >>> +
> >>> +int ntb_edma_setup_isr(struct device *dev, struct device *epc_dev,
> >>> +		       ntb_edma_interrupt_cb_t cb, void *data);
> >>> +void ntb_edma_teardown_isr(struct device *dev);
> >>> +int ntb_edma_setup_mws(struct ntb_dev *ndev);
> >>> +int ntb_edma_setup_peer(struct ntb_dev *ndev);
> >>> +int ntb_edma_setup_chans(struct device *dma_dev, struct ntb_edma_chans *edma);
> >>> +struct dma_chan *ntb_edma_pick_chan(struct ntb_edma_chans *edma,
> >>> +				    remote_edma_dir_t dir);
> >>> +void ntb_edma_teardown_chans(struct ntb_edma_chans *edma);
> >>> +int ntb_edma_notify_peer(struct ntb_edma_chans *edma, int qp_num);
> >>> +
> >>> +#endif
> >>> diff --git a/drivers/ntb/ntb_transport.c b/drivers/ntb/ntb_transport_core.c
> >>> similarity index 65%
> >>> rename from drivers/ntb/ntb_transport.c
> >>> rename to drivers/ntb/ntb_transport_core.c
> >>> index 907db6c93d4d..48d48921978d 100644
> >>> --- a/drivers/ntb/ntb_transport.c
> >>> +++ b/drivers/ntb/ntb_transport_core.c
> >>> @@ -47,6 +47,9 @@
> >>>   * Contact Information:
> >>>   * Jon Mason <jon.mason@intel.com>
> >>>   */
> >>> +#include <linux/atomic.h>
> >>> +#include <linux/bug.h>
> >>> +#include <linux/compiler.h>
> >>>  #include <linux/debugfs.h>
> >>>  #include <linux/delay.h>
> >>>  #include <linux/dmaengine.h>
> >>> @@ -71,6 +74,8 @@
> >>>  #define NTB_TRANSPORT_DESC	"Software Queue-Pair Transport over NTB"
> >>>  #define NTB_TRANSPORT_MIN_SPADS (MW0_SZ_HIGH + 2)
> >>>  
> >>> +#define NTB_EDMA_MAX_POLL		32
> >>> +
> >>>  MODULE_DESCRIPTION(NTB_TRANSPORT_DESC);
> >>>  MODULE_VERSION(NTB_TRANSPORT_VER);
> >>>  MODULE_LICENSE("Dual BSD/GPL");
> >>> @@ -102,6 +107,13 @@ module_param(use_msi, bool, 0644);
> >>>  MODULE_PARM_DESC(use_msi, "Use MSI interrupts instead of doorbells");
> >>>  #endif
> >>>  
> >>> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> >>
> >> This is a comment throughout this patch. Doing ifdefs inside C source is pretty frowed upon in the kernel. The preferred way is to only have ifdefs in the header files. So please give this a bit more consideration and see if it can be done differently to address this.
> > 
> > I agree, there is no good reason to keep those remaining ifdefs at all.
> > I'll clean it up. Thanks for pointing this out.
> > 
> >>
> >>> +#include "ntb_edma.h"
> >>> +static bool use_remote_edma;
> >>> +module_param(use_remote_edma, bool, 0644);
> >>> +MODULE_PARM_DESC(use_remote_edma, "Use remote eDMA mode (when enabled, use_msi is ignored)");
> >>> +#endif
> >>> +
> >>>  static struct dentry *nt_debugfs_dir;
> >>>  
> >>>  /* Only two-ports NTB devices are supported */
> >>> @@ -125,6 +137,14 @@ struct ntb_queue_entry {
> >>>  		struct ntb_payload_header __iomem *tx_hdr;
> >>>  		struct ntb_payload_header *rx_hdr;
> >>>  	};
> >>> +
> >>> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> >>> +	dma_addr_t addr;
> >>> +
> >>> +	/* Used by RC side only */
> >>> +	struct scatterlist sgl;
> >>> +	struct work_struct dma_work;
> >>> +#endif
> >>>  };
> >>>  
> >>>  struct ntb_rx_info {
> >>> @@ -202,6 +222,33 @@ struct ntb_transport_qp {
> >>>  	int msi_irq;
> >>>  	struct ntb_msi_desc msi_desc;
> >>>  	struct ntb_msi_desc peer_msi_desc;
> >>> +
> >>> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> >>> +	/*
> >>> +	 * For ensuring peer notification in non-atomic context.
> >>> +	 * ntb_peer_db_set might sleep or schedule.
> >>> +	 */
> >>> +	struct work_struct db_work;
> >>> +
> >>> +	/*
> >>> +	 * wr: remote eDMA write transfer (EP -> RC direction)
> >>> +	 * rd: remote eDMA read transfer (RC -> EP direction)
> >>> +	 */
> >>> +	u32 wr_cons;
> >>> +	u32 rd_cons;
> >>> +	u32 wr_prod;
> >>> +	u32 rd_prod;
> >>> +	u32 wr_issue;
> >>> +	u32 rd_issue;
> >>> +
> >>> +	spinlock_t ep_tx_lock;
> >>> +	spinlock_t ep_rx_lock;
> >>> +	spinlock_t rc_lock;
> >>> +
> >>> +	/* Completion work for read/write transfers. */
> >>> +	struct work_struct read_work;
> >>> +	struct work_struct write_work;
> >>> +#endif
> >>
> >> For something like this, maybe it needs its own struct instead of an ifdef chunk. Perhaps 'ntb_rx_info' can serve as a core data struct with EDMA having 'ntb_rx_info_edma' and embed 'ntb_rx_info'. 
> > 
> > Thanks again for the suggestion. I'll reorganize things.
> > 
> > Koichiro
> > 
> >>
> >> DJ
> >>
> >>>  };
> >>>  
> >>>  struct ntb_transport_mw {
> >>> @@ -249,6 +296,13 @@ struct ntb_transport_ctx {
> >>>  
> >>>  	/* Make sure workq of link event be executed serially */
> >>>  	struct mutex link_event_lock;
> >>> +
> >>> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> >>> +	remote_edma_mode_t remote_edma_mode;
> >>> +	struct device *dma_dev;
> >>> +	struct workqueue_struct *wq;
> >>> +	struct ntb_edma_chans edma;
> >>> +#endif
> >>>  };
> >>>  
> >>>  enum {
> >>> @@ -262,6 +316,19 @@ struct ntb_payload_header {
> >>>  	unsigned int flags;
> >>>  };
> >>>  
> >>> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> >>> +static void ntb_transport_edma_uninit(struct ntb_transport_ctx *nt);
> >>> +static int ntb_transport_edma_init(struct ntb_transport_ctx *nt,
> >>> +				   unsigned int *mw_count);
> >>> +static void ntb_transport_edma_init_queue(struct ntb_transport_ctx *nt,
> >>> +					  unsigned int qp_num);
> >>> +static void ntb_transport_edma_create_queue(struct ntb_transport_ctx *nt,
> >>> +					    struct ntb_transport_qp *qp);
> >>> +static int ntb_transport_edma_rc_init(struct ntb_transport_ctx *nt);
> >>> +static int ntb_transport_edma_ep_init(struct ntb_transport_ctx *nt);
> >>> +static void ntb_transport_edma_rc_dma_work(struct work_struct *work);
> >>> +#endif /* CONFIG_NTB_TRANSPORT_EDMA */
> >>> +
> >>>  /*
> >>>   * Return the device that should be used for DMA mapping.
> >>>   *
> >>> @@ -298,7 +365,7 @@ enum {
> >>>  	container_of((__drv), struct ntb_transport_client, driver)
> >>>  
> >>>  #define QP_TO_MW(nt, qp)	((qp) % nt->mw_count)
> >>> -#define NTB_QP_DEF_NUM_ENTRIES	100
> >>> +#define NTB_QP_DEF_NUM_ENTRIES	128
> >>>  #define NTB_LINK_DOWN_TIMEOUT	10
> >>>  
> >>>  static void ntb_transport_rxc_db(unsigned long data);
> >>> @@ -1015,6 +1082,10 @@ static void ntb_transport_link_cleanup(struct ntb_transport_ctx *nt)
> >>>  	count = ntb_spad_count(nt->ndev);
> >>>  	for (i = 0; i < count; i++)
> >>>  		ntb_spad_write(nt->ndev, i, 0);
> >>> +
> >>> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> >>> +	ntb_edma_teardown_chans(&nt->edma);
> >>> +#endif
> >>>  }
> >>>  
> >>>  static void ntb_transport_link_cleanup_work(struct work_struct *work)
> >>> @@ -1051,6 +1122,14 @@ static void ntb_transport_link_work(struct work_struct *work)
> >>>  
> >>>  	/* send the local info, in the opposite order of the way we read it */
> >>>  
> >>> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> >>> +	rc = ntb_transport_edma_ep_init(nt);
> >>> +	if (rc) {
> >>> +		dev_err(&pdev->dev, "Failed to init EP: %d\n", rc);
> >>> +		return;
> >>> +	}
> >>> +#endif
> >>> +
> >>>  	if (nt->use_msi) {
> >>>  		rc = ntb_msi_setup_mws(ndev);
> >>>  		if (rc) {
> >>> @@ -1132,6 +1211,14 @@ static void ntb_transport_link_work(struct work_struct *work)
> >>>  
> >>>  	nt->link_is_up = true;
> >>>  
> >>> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> >>> +	rc = ntb_transport_edma_rc_init(nt);
> >>> +	if (rc) {
> >>> +		dev_err(&pdev->dev, "Failed to init RC: %d\n", rc);
> >>> +		goto out1;
> >>> +	}
> >>> +#endif
> >>> +
> >>>  	for (i = 0; i < nt->qp_count; i++) {
> >>>  		struct ntb_transport_qp *qp = &nt->qp_vec[i];
> >>>  
> >>> @@ -1277,6 +1364,8 @@ static const struct ntb_transport_backend_ops default_backend_ops = {
> >>>  	.debugfs_stats_show = ntb_transport_default_debugfs_stats_show,
> >>>  };
> >>>  
> >>> +static const struct ntb_transport_backend_ops edma_backend_ops;
> >>> +
> >>>  static int ntb_transport_probe(struct ntb_client *self, struct ntb_dev *ndev)
> >>>  {
> >>>  	struct ntb_transport_ctx *nt;
> >>> @@ -1311,7 +1400,23 @@ static int ntb_transport_probe(struct ntb_client *self, struct ntb_dev *ndev)
> >>>  
> >>>  	nt->ndev = ndev;
> >>>  
> >>> -	nt->backend_ops = default_backend_ops;
> >>> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> >>> +	if (use_remote_edma) {
> >>> +		rc = ntb_transport_edma_init(nt, &mw_count);
> >>> +		if (rc) {
> >>> +			nt->mw_count = 0;
> >>> +			goto err;
> >>> +		}
> >>> +		nt->backend_ops = edma_backend_ops;
> >>> +
> >>> +		/*
> >>> +		 * On remote eDMA mode, we reserve a read channel for Host->EP
> >>> +		 * interruption.
> >>> +		 */
> >>> +		use_msi = false;
> >>> +	} else
> >>> +#endif
> >>> +		nt->backend_ops = default_backend_ops;
> >>>  
> >>>  	/*
> >>>  	 * If we are using MSI, and have at least one extra memory window,
> >>> @@ -1402,6 +1507,10 @@ static int ntb_transport_probe(struct ntb_client *self, struct ntb_dev *ndev)
> >>>  		rc = ntb_transport_init_queue(nt, i);
> >>>  		if (rc)
> >>>  			goto err2;
> >>> +
> >>> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> >>> +		ntb_transport_edma_init_queue(nt, i);
> >>> +#endif
> >>>  	}
> >>>  
> >>>  	INIT_DELAYED_WORK(&nt->link_work, ntb_transport_link_work);
> >>> @@ -1433,6 +1542,9 @@ static int ntb_transport_probe(struct ntb_client *self, struct ntb_dev *ndev)
> >>>  	}
> >>>  	kfree(nt->mw_vec);
> >>>  err:
> >>> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> >>> +	ntb_transport_edma_uninit(nt);
> >>> +#endif
> >>>  	kfree(nt);
> >>>  	return rc;
> >>>  }
> >>> @@ -2055,11 +2167,16 @@ ntb_transport_create_queue(void *data, struct device *client_dev,
> >>>  
> >>>  	nt->qp_bitmap_free &= ~qp_bit;
> >>>  
> >>> +	qp->qp_bit = qp_bit;
> >>>  	qp->cb_data = data;
> >>>  	qp->rx_handler = handlers->rx_handler;
> >>>  	qp->tx_handler = handlers->tx_handler;
> >>>  	qp->event_handler = handlers->event_handler;
> >>>  
> >>> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> >>> +	ntb_transport_edma_create_queue(nt, qp);
> >>> +#endif
> >>> +
> >>>  	dma_cap_zero(dma_mask);
> >>>  	dma_cap_set(DMA_MEMCPY, dma_mask);
> >>>  
> >>> @@ -2105,6 +2222,9 @@ ntb_transport_create_queue(void *data, struct device *client_dev,
> >>>  			goto err1;
> >>>  
> >>>  		entry->qp = qp;
> >>> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> >>> +		INIT_WORK(&entry->dma_work, ntb_transport_edma_rc_dma_work);
> >>> +#endif
> >>>  		ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry,
> >>>  			     &qp->rx_free_q);
> >>>  	}
> >>> @@ -2156,8 +2276,8 @@ EXPORT_SYMBOL_GPL(ntb_transport_create_queue);
> >>>   */
> >>>  void ntb_transport_free_queue(struct ntb_transport_qp *qp)
> >>>  {
> >>> -	struct pci_dev *pdev;
> >>>  	struct ntb_queue_entry *entry;
> >>> +	struct pci_dev *pdev;
> >>>  	u64 qp_bit;
> >>>  
> >>>  	if (!qp)
> >>> @@ -2208,6 +2328,10 @@ void ntb_transport_free_queue(struct ntb_transport_qp *qp)
> >>>  	tasklet_kill(&qp->rxc_db_work);
> >>>  
> >>>  	cancel_delayed_work_sync(&qp->link_work);
> >>> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> >>> +	cancel_work_sync(&qp->read_work);
> >>> +	cancel_work_sync(&qp->write_work);
> >>> +#endif
> >>>  
> >>>  	qp->cb_data = NULL;
> >>>  	qp->rx_handler = NULL;
> >>> @@ -2346,6 +2470,1157 @@ int ntb_transport_tx_enqueue(struct ntb_transport_qp *qp, void *cb, void *data,
> >>>  }
> >>>  EXPORT_SYMBOL_GPL(ntb_transport_tx_enqueue);
> >>>  
> >>> +#ifdef CONFIG_NTB_TRANSPORT_EDMA
> >>> +/*
> >>> + * Remote eDMA mode implementation
> >>> + */
> >>> +struct ntb_edma_desc {
> >>> +	u32 len;
> >>> +	u32 flags;
> >>> +	u64 addr; /* DMA address */
> >>> +	u64 data;
> >>> +};
> >>> +
> >>> +struct ntb_edma_ring {
> >>> +	struct ntb_edma_desc desc[NTB_EDMA_RING_ENTRIES];
> >>> +	u32 head;
> >>> +	u32 tail;
> >>> +};
> >>> +
> >>> +#define NTB_EDMA_DESC_OFF(i)	((size_t)(i) * sizeof(struct ntb_edma_desc))
> >>> +
> >>> +#define __NTB_EDMA_CHECK_INDEX(_i)					\
> >>> +({									\
> >>> +	unsigned long __i = (unsigned long)(_i);			\
> >>> +	WARN_ONCE(__i >= (unsigned long)NTB_EDMA_RING_ENTRIES,		\
> >>> +		  "ntb_edma: index i=%lu >= ring_entries=%lu\n",	\
> >>> +		  __i, (unsigned long)NTB_EDMA_RING_ENTRIES);		\
> >>> +	__i;								\
> >>> +})
> >>> +
> >>> +#define NTB_EDMA_DESC_I(qp, i, n)					\
> >>> +({									\
> >>> +	typeof(qp) __qp = (qp);						\
> >>> +	unsigned long __i = __NTB_EDMA_CHECK_INDEX(i);			\
> >>> +	(struct ntb_edma_desc *)					\
> >>> +		((char *)(__qp)->rx_buff +				\
> >>> +		 (sizeof(struct ntb_edma_ring) * n) +			\
> >>> +		 NTB_EDMA_DESC_OFF(__i));				\
> >>> +})
> >>> +
> >>> +#define NTB_EDMA_DESC_O(qp, i, n)					\
> >>> +({									\
> >>> +	typeof(qp) __qp = (qp);						\
> >>> +	unsigned long __i = __NTB_EDMA_CHECK_INDEX(i);			\
> >>> +	(struct ntb_edma_desc __iomem *)				\
> >>> +		((char __iomem *)(__qp)->tx_mw +			\
> >>> +		 (sizeof(struct ntb_edma_ring) * n) +			\
> >>> +		 NTB_EDMA_DESC_OFF(__i));				\
> >>> +})
> >>> +
> >>> +#define NTB_EDMA_HEAD_I(qp, n) ((u32 *)((char *)qp->rx_buff +		\
> >>> +				(sizeof(struct ntb_edma_ring) * n) +	\
> >>> +				offsetof(struct ntb_edma_ring, head)))
> >>> +#define NTB_EDMA_HEAD_O(qp, n) ((u32 *)((char __iomem *)qp->tx_mw +	\
> >>> +				(sizeof(struct ntb_edma_ring) * n) +	\
> >>> +				offsetof(struct ntb_edma_ring, head)))
> >>> +#define NTB_EDMA_TAIL_I(qp, n) ((u32 *)((char *)qp->rx_buff +		\
> >>> +				(sizeof(struct ntb_edma_ring) * n) +	\
> >>> +				offsetof(struct ntb_edma_ring, tail)))
> >>> +#define NTB_EDMA_TAIL_O(qp, n) ((u32 *)((char __iomem *)qp->tx_mw +	\
> >>> +				(sizeof(struct ntb_edma_ring) * n) +	\
> >>> +				offsetof(struct ntb_edma_ring, tail)))
> >>> +
> >>> +/*
> >>> + * Macro naming rule:
> >>> + *   NTB_DESC_RD_EP_I (as an example)
> >>> + *            ^^ ^^ ^
> >>> + *            :  :  `-- I(n) or O(ut). In = Read, Out = Write.
> >>> + *            :  `----- Who uses this macro.
> >>> + *            `-------- DESC / HEAD / TAIL
> >>> + *
> >>> + * Read transfers (RC->EP):
> >>> + *
> >>> + *   EP view (outbound, written via NTB):
> >>> + *       - descs: NTB_DESC_RD_EP_O(qp, i) / NTB_DESC_RD_EP_I(qp, i)
> >>> + *           [ len ][ flags ][ addr ][ data ]
> >>> + *           [ len ][ flags ][ addr ][ data ]
> >>> + *           :
> >>> + *           [ len ][ flags ][ addr ][ data ]
> >>> + *       - head: NTB_HEAD_RD_EP_O(qp)
> >>> + *       - tail: NTB_TAIL_RD_EP_I(qp)
> >>> + *
> >>> + *   RC view (inbound, local mapping):
> >>> + *       - descs: NTB_DESC_RD_RC_I(qp, i) / NTB_DESC_RD_RC_O(qp, i)
> >>> + *           [ len ][ flags ][ addr ][ data ]
> >>> + *           [ len ][ flags ][ addr ][ data ]
> >>> + *           :
> >>> + *           [ len ][ flags ][ addr ][ data ]
> >>> + *       - head: NTB_HEAD_RD_RC_I(qp)
> >>> + *       - tail: NTB_TAIL_RD_RC_O(qp)
> >>> + *
> >>> + * Write transfers (EP -> RC) are analogous but use
> >>> + * NTB_DESC_WR_{EP_O,RC_I}(), NTB_HEAD_WR_{EP_O,RC_I}(),
> >>> + * and NTB_TAIL_WR_{EP_I,RC_O}().
> >>> + */
> >>> +#define NTB_DESC_RD_EP_I(qp, i)	NTB_EDMA_DESC_I(qp, i, 0)
> >>> +#define NTB_DESC_RD_EP_O(qp, i)	NTB_EDMA_DESC_O(qp, i, 0)
> >>> +#define NTB_DESC_WR_EP_I(qp, i)	NTB_EDMA_DESC_I(qp, i, 1)
> >>> +#define NTB_DESC_WR_EP_O(qp, i)	NTB_EDMA_DESC_O(qp, i, 1)
> >>> +#define NTB_DESC_RD_RC_I(qp, i)	NTB_EDMA_DESC_I(qp, i, 0)
> >>> +#define NTB_DESC_RD_RC_O(qp, i)	NTB_EDMA_DESC_O(qp, i, 0)
> >>> +#define NTB_DESC_WR_RC_I(qp, i)	NTB_EDMA_DESC_I(qp, i, 1)
> >>> +#define NTB_DESC_WR_RC_O(qp, i)	NTB_EDMA_DESC_O(qp, i, 1)
> >>> +
> >>> +#define NTB_HEAD_RD_EP_O(qp)	NTB_EDMA_HEAD_O(qp, 0)
> >>> +#define NTB_HEAD_WR_EP_O(qp)	NTB_EDMA_HEAD_O(qp, 1)
> >>> +#define NTB_HEAD_RD_RC_I(qp)	NTB_EDMA_HEAD_I(qp, 0)
> >>> +#define NTB_HEAD_WR_RC_I(qp)	NTB_EDMA_HEAD_I(qp, 1)
> >>> +
> >>> +#define NTB_TAIL_RD_EP_I(qp)	NTB_EDMA_TAIL_I(qp, 0)
> >>> +#define NTB_TAIL_WR_EP_I(qp)	NTB_EDMA_TAIL_I(qp, 1)
> >>> +#define NTB_TAIL_RD_RC_O(qp)	NTB_EDMA_TAIL_O(qp, 0)
> >>> +#define NTB_TAIL_WR_RC_O(qp)	NTB_EDMA_TAIL_O(qp, 1)
> >>> +
> >>> +static inline bool ntb_qp_edma_is_rc(struct ntb_transport_qp *qp)
> >>> +{
> >>> +	return qp->transport->remote_edma_mode == REMOTE_EDMA_RC;
> >>> +}
> >>> +
> >>> +static inline bool ntb_qp_edma_is_ep(struct ntb_transport_qp *qp)
> >>> +{
> >>> +	return qp->transport->remote_edma_mode == REMOTE_EDMA_EP;
> >>> +}
> >>> +
> >>> +static inline bool ntb_qp_edma_enabled(struct ntb_transport_qp *qp)
> >>> +{
> >>> +	return ntb_qp_edma_is_rc(qp) || ntb_qp_edma_is_ep(qp);
> >>> +}
> >>> +
> >>> +static unsigned int ntb_transport_edma_tx_free_entry(struct ntb_transport_qp *qp)
> >>> +{
> >>> +	unsigned int head, tail;
> >>> +
> >>> +	if (ntb_qp_edma_is_ep(qp)) {
> >>> +		scoped_guard(spinlock_irqsave, &qp->ep_tx_lock) {
> >>> +			/* In this scope, only 'head' might proceed */
> >>> +			tail = READ_ONCE(qp->wr_cons);
> >>> +			head = READ_ONCE(qp->wr_prod);
> >>> +		}
> >>> +		return ntb_edma_ring_free_entry(head, tail);
> >>> +	}
> >>> +
> >>> +	scoped_guard(spinlock_irqsave, &qp->rc_lock) {
> >>> +		/* In this scope, only 'head' might proceed */
> >>> +		tail = READ_ONCE(qp->rd_issue);
> >>> +		head = READ_ONCE(*NTB_HEAD_RD_RC_I(qp));
> >>> +	}
> >>> +	/*
> >>> +	 * On RC side, 'used' amount indicates how much EP side
> >>> +	 * has refilled, which are available for us to use for TX.
> >>> +	 */
> >>> +	return ntb_edma_ring_used_entry(head, tail);
> >>> +}
> >>> +
> >>> +static void ntb_transport_edma_debugfs_stats_show(struct seq_file *s,
> >>> +						  struct ntb_transport_qp *qp)
> >>> +{
> >>> +	seq_printf(s, "rx_bytes - \t%llu\n", qp->rx_bytes);
> >>> +	seq_printf(s, "rx_pkts - \t%llu\n", qp->rx_pkts);
> >>> +	seq_printf(s, "rx_err_no_buf - %llu\n", qp->rx_err_no_buf);
> >>> +	seq_printf(s, "rx_buff - \t0x%p\n", qp->rx_buff);
> >>> +	seq_printf(s, "rx_max_entry - \t%u\n", qp->rx_max_entry);
> >>> +	seq_printf(s, "rx_alloc_entry - \t%u\n\n", qp->rx_alloc_entry);
> >>> +
> >>> +	seq_printf(s, "tx_bytes - \t%llu\n", qp->tx_bytes);
> >>> +	seq_printf(s, "tx_pkts - \t%llu\n", qp->tx_pkts);
> >>> +	seq_printf(s, "tx_ring_full - \t%llu\n", qp->tx_ring_full);
> >>> +	seq_printf(s, "tx_err_no_buf - %llu\n", qp->tx_err_no_buf);
> >>> +	seq_printf(s, "tx_mw - \t0x%p\n", qp->tx_mw);
> >>> +	seq_printf(s, "tx_max_entry - \t%u\n", qp->tx_max_entry);
> >>> +	seq_printf(s, "free tx - \t%u\n", ntb_transport_tx_free_entry(qp));
> >>> +	seq_putc(s, '\n');
> >>> +
> >>> +	seq_puts(s, "Using Remote eDMA - Yes\n");
> >>> +	seq_printf(s, "QP Link - \t%s\n", qp->link_is_up ? "Up" : "Down");
> >>> +}
> >>> +
> >>> +static void ntb_transport_edma_uninit(struct ntb_transport_ctx *nt)
> >>> +{
> >>> +	struct ntb_dev *ndev = nt->ndev;
> >>> +
> >>> +	if (nt->remote_edma_mode == REMOTE_EDMA_EP && ndev && ndev->pdev)
> >>> +		ntb_edma_teardown_isr(&ndev->pdev->dev);
> >>> +
> >>> +	if (nt->wq)
> >>> +		destroy_workqueue(nt->wq);
> >>> +	nt->wq = NULL;
> >>> +}
> >>> +
> >>> +static int ntb_transport_edma_init(struct ntb_transport_ctx *nt,
> >>> +				   unsigned int *mw_count)
> >>> +{
> >>> +	struct ntb_dev *ndev = nt->ndev;
> >>> +
> >>> +	/*
> >>> +	 * We need at least one MW for the transport plus one MW reserved
> >>> +	 * for the remote eDMA window (see ntb_edma_setup_mws/peer).
> >>> +	 */
> >>> +	if (*mw_count <= 1) {
> >>> +		dev_err(&ndev->dev,
> >>> +			"remote eDMA requires at least two MWS (have %u)\n",
> >>> +			*mw_count);
> >>> +		return -ENODEV;
> >>> +	}
> >>> +
> >>> +	nt->wq = alloc_workqueue("ntb-edma-wq", WQ_UNBOUND | WQ_SYSFS, 0);
> >>> +	if (!nt->wq) {
> >>> +		ntb_transport_edma_uninit(nt);
> >>> +		return -ENOMEM;
> >>> +	}
> >>> +
> >>> +	/* Reserve the last peer MW exclusively for the eDMA window. */
> >>> +	*mw_count -= 1;
> >>> +
> >>> +	return 0;
> >>> +}
> >>> +
> >>> +static void ntb_transport_edma_db_work(struct work_struct *work)
> >>> +{
> >>> +	struct ntb_transport_qp *qp =
> >>> +			container_of(work, struct ntb_transport_qp, db_work);
> >>> +
> >>> +	ntb_peer_db_set(qp->ndev, qp->qp_bit);
> >>> +}
> >>> +
> >>> +static void ntb_transport_edma_notify_peer(struct ntb_transport_qp *qp)
> >>> +{
> >>> +	if (ntb_qp_edma_is_rc(qp))
> >>> +		if (!ntb_edma_notify_peer(&qp->transport->edma, qp->qp_num))
> >>> +			return;
> >>> +
> >>> +	/*
> >>> +	 * Called from contexts that may be atomic. Since ntb_peer_db_set()
> >>> +	 * may sleep, delegate the actual doorbell write to a workqueue.
> >>> +	 */
> >>> +	queue_work(system_highpri_wq, &qp->db_work);
> >>> +}
> >>> +
> >>> +static void ntb_transport_edma_isr(void *data, int qp_num)
> >>> +{
> >>> +	struct ntb_transport_ctx *nt = data;
> >>> +	struct ntb_transport_qp *qp;
> >>> +
> >>> +	if (qp_num < 0 || qp_num >= nt->qp_count)
> >>> +		return;
> >>> +
> >>> +	qp = &nt->qp_vec[qp_num];
> >>> +	if (WARN_ON(!qp))
> >>> +		return;
> >>> +
> >>> +	queue_work(nt->wq, &qp->read_work);
> >>> +	queue_work(nt->wq, &qp->write_work);
> >>> +}
> >>> +
> >>> +static int ntb_transport_edma_rc_init(struct ntb_transport_ctx *nt)
> >>> +{
> >>> +	struct ntb_dev *ndev = nt->ndev;
> >>> +	struct pci_dev *pdev = ndev->pdev;
> >>> +	int rc;
> >>> +
> >>> +	if (!use_remote_edma || nt->remote_edma_mode != REMOTE_EDMA_UNKNOWN)
> >>> +		return 0;
> >>> +
> >>> +	rc = ntb_edma_setup_peer(ndev);
> >>> +	if (rc) {
> >>> +		dev_err(&pdev->dev, "Failed to enable remote eDMA: %d\n", rc);
> >>> +		return rc;
> >>> +	}
> >>> +
> >>> +	rc = ntb_edma_setup_chans(get_dma_dev(ndev), &nt->edma);
> >>> +	if (rc) {
> >>> +		dev_err(&pdev->dev, "Failed to setup eDMA channels: %d\n", rc);
> >>> +		return rc;
> >>> +	}
> >>> +
> >>> +	nt->remote_edma_mode = REMOTE_EDMA_RC;
> >>> +	return 0;
> >>> +}
> >>> +
> >>> +static int ntb_transport_edma_ep_init(struct ntb_transport_ctx *nt)
> >>> +{
> >>> +	struct ntb_dev *ndev = nt->ndev;
> >>> +	struct pci_dev *pdev = ndev->pdev;
> >>> +	struct pci_epc *epc;
> >>> +	int rc;
> >>> +
> >>> +	if (!use_remote_edma || nt->remote_edma_mode == REMOTE_EDMA_EP)
> >>> +		return 0;
> >>> +
> >>> +	/* Only EP side can return pci_epc */
> >>> +	epc = ntb_get_pci_epc(ndev);
> >>> +	if (!epc)
> >>> +		return 0;
> >>> +
> >>> +	rc = ntb_edma_setup_mws(ndev);
> >>> +	if (rc) {
> >>> +		dev_err(&pdev->dev,
> >>> +			"Failed to set up memory window for eDMA: %d\n", rc);
> >>> +		return rc;
> >>> +	}
> >>> +
> >>> +	rc = ntb_edma_setup_isr(&pdev->dev, &epc->dev, ntb_transport_edma_isr, nt);
> >>> +	if (rc) {
> >>> +		dev_err(&pdev->dev, "Failed to setup eDMA ISR (%d)\n", rc);
> >>> +		return rc;
> >>> +	}
> >>> +
> >>> +	nt->remote_edma_mode = REMOTE_EDMA_EP;
> >>> +	return 0;
> >>> +}
> >>> +
> >>> +static int ntb_transport_edma_setup_qp_mw(struct ntb_transport_ctx *nt,
> >>> +					  unsigned int qp_num)
> >>> +{
> >>> +	struct ntb_transport_qp *qp = &nt->qp_vec[qp_num];
> >>> +	struct ntb_dev *ndev = nt->ndev;
> >>> +	struct ntb_queue_entry *entry;
> >>> +	struct ntb_transport_mw *mw;
> >>> +	unsigned int mw_num, mw_count, qp_count;
> >>> +	unsigned int qp_offset, rx_info_offset;
> >>> +	unsigned int mw_size, mw_size_per_qp;
> >>> +	unsigned int num_qps_mw;
> >>> +	size_t edma_total;
> >>> +	unsigned int i;
> >>> +	int node;
> >>> +
> >>> +	mw_count = nt->mw_count;
> >>> +	qp_count = nt->qp_count;
> >>> +
> >>> +	mw_num = QP_TO_MW(nt, qp_num);
> >>> +	mw = &nt->mw_vec[mw_num];
> >>> +
> >>> +	if (!mw->virt_addr)
> >>> +		return -ENOMEM;
> >>> +
> >>> +	if (mw_num < qp_count % mw_count)
> >>> +		num_qps_mw = qp_count / mw_count + 1;
> >>> +	else
> >>> +		num_qps_mw = qp_count / mw_count;
> >>> +
> >>> +	mw_size = min(nt->mw_vec[mw_num].phys_size, mw->xlat_size);
> >>> +	if (max_mw_size && mw_size > max_mw_size)
> >>> +		mw_size = max_mw_size;
> >>> +
> >>> +	mw_size_per_qp = round_down((unsigned int)mw_size / num_qps_mw, SZ_64);
> >>> +	qp_offset = mw_size_per_qp * (qp_num / mw_count);
> >>> +	rx_info_offset = mw_size_per_qp - sizeof(struct ntb_rx_info);
> >>> +
> >>> +	qp->tx_mw_size = mw_size_per_qp;
> >>> +	qp->tx_mw = nt->mw_vec[mw_num].vbase + qp_offset;
> >>> +	if (!qp->tx_mw)
> >>> +		return -EINVAL;
> >>> +	qp->tx_mw_phys = nt->mw_vec[mw_num].phys_addr + qp_offset;
> >>> +	if (!qp->tx_mw_phys)
> >>> +		return -EINVAL;
> >>> +	qp->rx_info = qp->tx_mw + rx_info_offset;
> >>> +	qp->rx_buff = mw->virt_addr + qp_offset;
> >>> +	qp->remote_rx_info = qp->rx_buff + rx_info_offset;
> >>> +
> >>> +	/* Due to housekeeping, there must be at least 2 buffs */
> >>> +	qp->tx_max_frame = min(transport_mtu, mw_size_per_qp / 2);
> >>> +	qp->rx_max_frame = min(transport_mtu, mw_size_per_qp / 2);
> >>> +
> >>> +	/* In eDMA mode, decouple from MW sizing and force ring-sized entries */
> >>> +	edma_total = 2 * sizeof(struct ntb_edma_ring);
> >>> +	if (rx_info_offset < edma_total) {
> >>> +		dev_err(&ndev->dev, "Ring space requires %luB (>=%uB)\n",
> >>> +			edma_total, rx_info_offset);
> >>> +		return -EINVAL;
> >>> +	}
> >>> +	qp->tx_max_entry = NTB_EDMA_RING_ENTRIES;
> >>> +	qp->rx_max_entry = NTB_EDMA_RING_ENTRIES;
> >>> +
> >>> +	/*
> >>> +	 * Checking to see if we have more entries than the default.
> >>> +	 * We should add additional entries if that is the case so we
> >>> +	 * can be in sync with the transport frames.
> >>> +	 */
> >>> +	node = dev_to_node(&ndev->dev);
> >>> +	for (i = qp->rx_alloc_entry; i < qp->rx_max_entry; i++) {
> >>> +		entry = kzalloc_node(sizeof(*entry), GFP_KERNEL, node);
> >>> +		if (!entry)
> >>> +			return -ENOMEM;
> >>> +
> >>> +		entry->qp = qp;
> >>> +		INIT_WORK(&entry->dma_work, ntb_transport_edma_rc_dma_work);
> >>> +		ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry,
> >>> +			     &qp->rx_free_q);
> >>> +		qp->rx_alloc_entry++;
> >>> +	}
> >>> +
> >>> +	memset(qp->rx_buff, 0, edma_total);
> >>> +
> >>> +	qp->rx_pkts = 0;
> >>> +	qp->tx_pkts = 0;
> >>> +
> >>> +	return 0;
> >>> +}
> >>> +
> >>> +static int ntb_transport_edma_ep_read_complete(struct ntb_transport_qp *qp)
> >>> +{
> >>> +	struct device *dma_dev = get_dma_dev(qp->ndev);
> >>> +	struct ntb_queue_entry *entry;
> >>> +	struct ntb_edma_desc *in;
> >>> +	unsigned int len;
> >>> +	u32 idx;
> >>> +
> >>> +	if (ntb_edma_ring_used_entry(READ_ONCE(*NTB_TAIL_RD_EP_I(qp)),
> >>> +				     qp->rd_cons) == 0)
> >>> +		return 0;
> >>> +
> >>> +	idx = ntb_edma_ring_idx(qp->rd_cons);
> >>> +	in = NTB_DESC_RD_EP_I(qp, idx);
> >>> +	if (!(in->flags & DESC_DONE_FLAG))
> >>> +		return 0;
> >>> +
> >>> +	in->flags = 0;
> >>> +	len = in->len; /* might be smaller than entry->len */
> >>> +
> >>> +	entry = (struct ntb_queue_entry *)(in->data);
> >>> +	if (WARN_ON(!entry))
> >>> +		return 0;
> >>> +
> >>> +	if (in->flags & LINK_DOWN_FLAG) {
> >>> +		ntb_qp_link_down(qp);
> >>> +		qp->rd_cons++;
> >>> +		ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry, &qp->rx_free_q);
> >>> +		return 1;
> >>> +	}
> >>> +
> >>> +	dma_unmap_single(dma_dev, entry->addr, entry->len, DMA_FROM_DEVICE);
> >>> +
> >>> +	qp->rx_bytes += len;
> >>> +	qp->rx_pkts++;
> >>> +	qp->rd_cons++;
> >>> +
> >>> +	if (qp->rx_handler && qp->client_ready)
> >>> +		qp->rx_handler(qp, qp->cb_data, entry->cb_data, len);
> >>> +
> >>> +	ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry, &qp->rx_free_q);
> >>> +	return 1;
> >>> +}
> >>> +
> >>> +static int ntb_transport_edma_ep_write_complete(struct ntb_transport_qp *qp)
> >>> +{
> >>> +	struct ntb_queue_entry *entry;
> >>> +	struct ntb_edma_desc *in;
> >>> +	u32 idx;
> >>> +
> >>> +	if (ntb_edma_ring_used_entry(READ_ONCE(*NTB_TAIL_WR_EP_I(qp)),
> >>> +				     qp->wr_cons) == 0)
> >>> +		return 0;
> >>> +
> >>> +	idx = ntb_edma_ring_idx(qp->wr_cons);
> >>> +	in = NTB_DESC_WR_EP_I(qp, idx);
> >>> +
> >>> +	entry = (struct ntb_queue_entry *)(in->data);
> >>> +	if (WARN_ON(!entry))
> >>> +		return 0;
> >>> +
> >>> +	qp->wr_cons++;
> >>> +
> >>> +	if (qp->tx_handler)
> >>> +		qp->tx_handler(qp, qp->cb_data, entry->cb_data, entry->len);
> >>> +
> >>> +	ntb_list_add(&qp->ntb_tx_free_q_lock, &entry->entry, &qp->tx_free_q);
> >>> +	return 1;
> >>> +}
> >>> +
> >>> +static void ntb_transport_edma_ep_read_work(struct work_struct *work)
> >>> +{
> >>> +	struct ntb_transport_qp *qp = container_of(
> >>> +				work, struct ntb_transport_qp, read_work);
> >>> +	unsigned int i;
> >>> +
> >>> +	for (i = 0; i < NTB_EDMA_MAX_POLL; i++) {
> >>> +		if (!ntb_transport_edma_ep_read_complete(qp))
> >>> +			break;
> >>> +	}
> >>> +
> >>> +	if (ntb_transport_edma_ep_read_complete(qp))
> >>> +		queue_work(qp->transport->wq, &qp->read_work);
> >>> +}
> >>> +
> >>> +static void ntb_transport_edma_ep_write_work(struct work_struct *work)
> >>> +{
> >>> +	struct ntb_transport_qp *qp = container_of(
> >>> +				work, struct ntb_transport_qp, write_work);
> >>> +	unsigned int i;
> >>> +
> >>> +	for (i = 0; i < NTB_EDMA_MAX_POLL; i++) {
> >>> +		if (!ntb_transport_edma_ep_write_complete(qp))
> >>> +			break;
> >>> +	}
> >>> +
> >>> +	if (ntb_transport_edma_ep_write_complete(qp))
> >>> +		queue_work(qp->transport->wq, &qp->write_work);
> >>> +}
> >>> +
> >>> +static void ntb_transport_edma_rc_write_complete_work(struct work_struct *work)
> >>> +{
> >>> +	struct ntb_transport_qp *qp = container_of(
> >>> +				work, struct ntb_transport_qp, write_work);
> >>> +	struct ntb_queue_entry *entry;
> >>> +	struct ntb_edma_desc *in;
> >>> +	unsigned int len;
> >>> +	void *cb_data;
> >>> +	u32 idx;
> >>> +
> >>> +	while (ntb_edma_ring_used_entry(READ_ONCE(qp->wr_issue),
> >>> +					qp->wr_cons) != 0) {
> >>> +		/* Paired with smp_wmb() in ntb_transport_edma_rc_poll() */
> >>> +		smp_rmb();
> >>> +
> >>> +		idx = ntb_edma_ring_idx(qp->wr_cons);
> >>> +		in = NTB_DESC_WR_RC_I(qp, idx);
> >>> +		entry = (struct ntb_queue_entry *)READ_ONCE(in->data);
> >>> +		if (!entry || !(entry->flags & DESC_DONE_FLAG))
> >>> +			break;
> >>> +
> >>> +		in->data = 0;
> >>> +
> >>> +		cb_data = entry->cb_data;
> >>> +		len = entry->len;
> >>> +
> >>> +		iowrite32(++qp->wr_cons, NTB_TAIL_WR_RC_O(qp));
> >>> +
> >>> +		if (unlikely(entry->flags & LINK_DOWN_FLAG)) {
> >>> +			ntb_qp_link_down(qp);
> >>> +			continue;
> >>> +		}
> >>> +
> >>> +		ntb_transport_edma_notify_peer(qp);
> >>> +
> >>> +		ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry, &qp->rx_free_q);
> >>> +
> >>> +		if (qp->rx_handler && qp->client_ready)
> >>> +			qp->rx_handler(qp, qp->cb_data, cb_data, len);
> >>> +
> >>> +		/* stat updates */
> >>> +		qp->rx_bytes += len;
> >>> +		qp->rx_pkts++;
> >>> +	}
> >>> +}
> >>> +
> >>> +static void ntb_transport_edma_rc_write_cb(void *data,
> >>> +					   const struct dmaengine_result *res)
> >>> +{
> >>> +	struct ntb_queue_entry *entry = data;
> >>> +	struct ntb_transport_qp *qp = entry->qp;
> >>> +	struct ntb_transport_ctx *nt = qp->transport;
> >>> +	enum dmaengine_tx_result dma_err = res->result;
> >>> +	struct device *dma_dev = get_dma_dev(qp->ndev);
> >>> +
> >>> +	switch (dma_err) {
> >>> +	case DMA_TRANS_READ_FAILED:
> >>> +	case DMA_TRANS_WRITE_FAILED:
> >>> +	case DMA_TRANS_ABORTED:
> >>> +		entry->errors++;
> >>> +		entry->len = -EIO;
> >>> +		break;
> >>> +	case DMA_TRANS_NOERROR:
> >>> +	default:
> >>> +		break;
> >>> +	}
> >>> +	dma_unmap_sg(dma_dev, &entry->sgl, 1, DMA_FROM_DEVICE);
> >>> +	sg_dma_address(&entry->sgl) = 0;
> >>> +
> >>> +	entry->flags |= DESC_DONE_FLAG;
> >>> +
> >>> +	queue_work(nt->wq, &qp->write_work);
> >>> +}
> >>> +
> >>> +static void ntb_transport_edma_rc_read_complete_work(struct work_struct *work)
> >>> +{
> >>> +	struct ntb_transport_qp *qp = container_of(
> >>> +				work, struct ntb_transport_qp, read_work);
> >>> +	struct ntb_edma_desc *in, __iomem *out;
> >>> +	struct ntb_queue_entry *entry;
> >>> +	unsigned int len;
> >>> +	void *cb_data;
> >>> +	u32 idx;
> >>> +
> >>> +	while (ntb_edma_ring_used_entry(READ_ONCE(qp->rd_issue),
> >>> +					qp->rd_cons) != 0) {
> >>> +		/* Paired with smp_wmb() in ntb_transport_edma_rc_tx_enqueue() */
> >>> +		smp_rmb();
> >>> +
> >>> +		idx = ntb_edma_ring_idx(qp->rd_cons);
> >>> +		in = NTB_DESC_RD_RC_I(qp, idx);
> >>> +		entry = (struct ntb_queue_entry *)in->data;
> >>> +		if (!entry || !(entry->flags & DESC_DONE_FLAG))
> >>> +			break;
> >>> +
> >>> +		in->data = 0;
> >>> +
> >>> +		cb_data = entry->cb_data;
> >>> +		len = entry->len;
> >>> +
> >>> +		out = NTB_DESC_RD_RC_O(qp, idx);
> >>> +
> >>> +		WRITE_ONCE(qp->rd_cons, qp->rd_cons + 1);
> >>> +
> >>> +		/*
> >>> +		 * No need to add barrier in-between to enforce ordering here.
> >>> +		 * The other side proceeds only after both flags and tail are
> >>> +		 * updated.
> >>> +		 */
> >>> +		iowrite32(entry->flags, &out->flags);
> >>> +		iowrite32(qp->rd_cons, NTB_TAIL_RD_RC_O(qp));
> >>> +
> >>> +		ntb_transport_edma_notify_peer(qp);
> >>> +
> >>> +		ntb_list_add(&qp->ntb_tx_free_q_lock, &entry->entry,
> >>> +			     &qp->tx_free_q);
> >>> +
> >>> +		if (qp->tx_handler)
> >>> +			qp->tx_handler(qp, qp->cb_data, cb_data, len);
> >>> +
> >>> +		/* stat updates */
> >>> +		qp->tx_bytes += len;
> >>> +		qp->tx_pkts++;
> >>> +	}
> >>> +}
> >>> +
> >>> +static void ntb_transport_edma_rc_read_cb(void *data,
> >>> +					  const struct dmaengine_result *res)
> >>> +{
> >>> +	struct ntb_queue_entry *entry = data;
> >>> +	struct ntb_transport_qp *qp = entry->qp;
> >>> +	struct ntb_transport_ctx *nt = qp->transport;
> >>> +	struct device *dma_dev = get_dma_dev(qp->ndev);
> >>> +	enum dmaengine_tx_result dma_err = res->result;
> >>> +
> >>> +	switch (dma_err) {
> >>> +	case DMA_TRANS_READ_FAILED:
> >>> +	case DMA_TRANS_WRITE_FAILED:
> >>> +	case DMA_TRANS_ABORTED:
> >>> +		entry->errors++;
> >>> +		entry->len = -EIO;
> >>> +		break;
> >>> +	case DMA_TRANS_NOERROR:
> >>> +	default:
> >>> +		break;
> >>> +	}
> >>> +	dma_unmap_sg(dma_dev, &entry->sgl, 1, DMA_TO_DEVICE);
> >>> +	sg_dma_address(&entry->sgl) = 0;
> >>> +
> >>> +	entry->flags |= DESC_DONE_FLAG;
> >>> +
> >>> +	queue_work(nt->wq, &qp->read_work);
> >>> +}
> >>> +
> >>> +static int ntb_transport_edma_rc_write_start(struct device *d,
> >>> +					     struct dma_chan *chan, size_t len,
> >>> +					     dma_addr_t ep_src, void *rc_dst,
> >>> +					     struct ntb_queue_entry *entry)
> >>> +{
> >>> +	struct scatterlist *sgl = &entry->sgl;
> >>> +	struct dma_async_tx_descriptor *txd;
> >>> +	struct dma_slave_config cfg;
> >>> +	dma_cookie_t cookie;
> >>> +	int nents, rc;
> >>> +
> >>> +	if (!d)
> >>> +		return -ENODEV;
> >>> +
> >>> +	if (!chan)
> >>> +		return -ENXIO;
> >>> +
> >>> +	if (WARN_ON(!ep_src || !rc_dst))
> >>> +		return -EINVAL;
> >>> +
> >>> +	if (WARN_ON(sg_dma_address(sgl)))
> >>> +		return -EINVAL;
> >>> +
> >>> +	sg_init_one(sgl, rc_dst, len);
> >>> +	nents = dma_map_sg(d, sgl, 1, DMA_FROM_DEVICE);
> >>> +	if (nents <= 0)
> >>> +		return -EIO;
> >>> +
> >>> +	memset(&cfg, 0, sizeof(cfg));
> >>> +	cfg.src_addr       = ep_src;
> >>> +	cfg.src_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
> >>> +	cfg.dst_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
> >>> +	cfg.direction      = DMA_DEV_TO_MEM;
> >>> +	rc = dmaengine_slave_config(chan, &cfg);
> >>> +	if (rc)
> >>> +		goto out_unmap;
> >>> +
> >>> +	txd = dmaengine_prep_slave_sg(chan, sgl, 1, DMA_DEV_TO_MEM,
> >>> +				      DMA_CTRL_ACK | DMA_PREP_INTERRUPT);
> >>> +	if (!txd) {
> >>> +		rc = -EIO;
> >>> +		goto out_unmap;
> >>> +	}
> >>> +
> >>> +	txd->callback_result = ntb_transport_edma_rc_write_cb;
> >>> +	txd->callback_param = entry;
> >>> +
> >>> +	cookie = dmaengine_submit(txd);
> >>> +	if (dma_submit_error(cookie)) {
> >>> +		rc = -EIO;
> >>> +		goto out_unmap;
> >>> +	}
> >>> +	dma_async_issue_pending(chan);
> >>> +	return 0;
> >>> +out_unmap:
> >>> +	dma_unmap_sg(d, sgl, 1, DMA_FROM_DEVICE);
> >>> +	return rc;
> >>> +}
> >>> +
> >>> +static int ntb_transport_edma_rc_read_start(struct device *d,
> >>> +					    struct dma_chan *chan, size_t len,
> >>> +					    void *rc_src, dma_addr_t ep_dst,
> >>> +					    struct ntb_queue_entry *entry)
> >>> +{
> >>> +	struct scatterlist *sgl = &entry->sgl;
> >>> +	struct dma_async_tx_descriptor *txd;
> >>> +	struct dma_slave_config cfg;
> >>> +	dma_cookie_t cookie;
> >>> +	int nents, rc;
> >>> +
> >>> +	if (!d)
> >>> +		return -ENODEV;
> >>> +
> >>> +	if (!chan)
> >>> +		return -ENXIO;
> >>> +
> >>> +	if (WARN_ON(!rc_src || !ep_dst))
> >>> +		return -EINVAL;
> >>> +
> >>> +	if (WARN_ON(sg_dma_address(sgl)))
> >>> +		return -EINVAL;
> >>> +
> >>> +	sg_init_one(sgl, rc_src, len);
> >>> +	nents = dma_map_sg(d, sgl, 1, DMA_TO_DEVICE);
> >>> +	if (nents <= 0)
> >>> +		return -EIO;
> >>> +
> >>> +	memset(&cfg, 0, sizeof(cfg));
> >>> +	cfg.dst_addr       = ep_dst;
> >>> +	cfg.src_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
> >>> +	cfg.dst_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
> >>> +	cfg.direction      = DMA_MEM_TO_DEV;
> >>> +	rc = dmaengine_slave_config(chan, &cfg);
> >>> +	if (rc)
> >>> +		goto out_unmap;
> >>> +
> >>> +	txd = dmaengine_prep_slave_sg(chan, sgl, 1, DMA_MEM_TO_DEV,
> >>> +				      DMA_CTRL_ACK | DMA_PREP_INTERRUPT);
> >>> +	if (!txd) {
> >>> +		rc = -EIO;
> >>> +		goto out_unmap;
> >>> +	}
> >>> +
> >>> +	txd->callback_result = ntb_transport_edma_rc_read_cb;
> >>> +	txd->callback_param = entry;
> >>> +
> >>> +	cookie = dmaengine_submit(txd);
> >>> +	if (dma_submit_error(cookie)) {
> >>> +		rc = -EIO;
> >>> +		goto out_unmap;
> >>> +	}
> >>> +	dma_async_issue_pending(chan);
> >>> +	return 0;
> >>> +out_unmap:
> >>> +	dma_unmap_sg(d, sgl, 1, DMA_TO_DEVICE);
> >>> +	return rc;
> >>> +}
> >>> +
> >>> +static void ntb_transport_edma_rc_dma_work(struct work_struct *work)
> >>> +{
> >>> +	struct ntb_queue_entry *entry = container_of(
> >>> +				work, struct ntb_queue_entry, dma_work);
> >>> +	struct ntb_transport_qp *qp = entry->qp;
> >>> +	struct ntb_transport_ctx *nt = qp->transport;
> >>> +	struct device *dma_dev = get_dma_dev(qp->ndev);
> >>> +	struct dma_chan *chan;
> >>> +	int rc;
> >>> +
> >>> +	chan = ntb_edma_pick_chan(&nt->edma, REMOTE_EDMA_WRITE);
> >>> +	rc = ntb_transport_edma_rc_write_start(dma_dev, chan, entry->len,
> >>> +					       entry->addr, entry->buf, entry);
> >>> +	if (rc) {
> >>> +		entry->errors++;
> >>> +		entry->len = -EIO;
> >>> +		entry->flags |= DESC_DONE_FLAG;
> >>> +		queue_work(nt->wq, &qp->write_work);
> >>> +		return;
> >>> +	}
> >>> +}
> >>> +
> >>> +static void ntb_transport_edma_rc_poll(struct ntb_transport_qp *qp)
> >>> +{
> >>> +	struct ntb_transport_ctx *nt = qp->transport;
> >>> +	unsigned int budget = NTB_EDMA_MAX_POLL;
> >>> +	struct ntb_queue_entry *entry;
> >>> +	struct ntb_edma_desc *in;
> >>> +	dma_addr_t ep_src;
> >>> +	u32 len, idx;
> >>> +
> >>> +	while (budget--) {
> >>> +		if (ntb_edma_ring_used_entry(READ_ONCE(*NTB_HEAD_WR_RC_I(qp)),
> >>> +					     qp->wr_issue) == 0)
> >>> +			break;
> >>> +
> >>> +		idx = ntb_edma_ring_idx(qp->wr_issue);
> >>> +		in = NTB_DESC_WR_RC_I(qp, idx);
> >>> +
> >>> +		len = READ_ONCE(in->len);
> >>> +		ep_src = (dma_addr_t)READ_ONCE(in->addr);
> >>> +
> >>> +		/* Prepare 'entry' for write completion */
> >>> +		entry = ntb_list_rm(&qp->ntb_rx_q_lock, &qp->rx_pend_q);
> >>> +		if (!entry) {
> >>> +			qp->rx_err_no_buf++;
> >>> +			break;
> >>> +		}
> >>> +		if (WARN_ON(entry->flags & DESC_DONE_FLAG))
> >>> +			entry->flags &= ~DESC_DONE_FLAG;
> >>> +		entry->len = len; /* NB. entry->len can be <=0 */
> >>> +		entry->addr = ep_src;
> >>> +
> >>> +		/*
> >>> +		 * ntb_transport_edma_rc_write_complete_work() checks entry->flags
> >>> +		 * so it needs to be set before wr_issue++.
> >>> +		 */
> >>> +		in->data = (uintptr_t)entry;
> >>> +
> >>> +		/* Ensure in->data visible before wr_issue++ */
> >>> +		smp_wmb();
> >>> +
> >>> +		WRITE_ONCE(qp->wr_issue, qp->wr_issue + 1);
> >>> +
> >>> +		if (!len) {
> >>> +			entry->flags |= DESC_DONE_FLAG;
> >>> +			queue_work(nt->wq, &qp->write_work);
> >>> +			continue;
> >>> +		}
> >>> +
> >>> +		if (in->flags & LINK_DOWN_FLAG) {
> >>> +			dev_dbg(&qp->ndev->pdev->dev, "link down flag set\n");
> >>> +			entry->flags |= DESC_DONE_FLAG | LINK_DOWN_FLAG;
> >>> +			queue_work(nt->wq, &qp->write_work);
> >>> +			continue;
> >>> +		}
> >>> +
> >>> +		queue_work(nt->wq, &entry->dma_work);
> >>> +	}
> >>> +
> >>> +	if (!budget)
> >>> +		tasklet_schedule(&qp->rxc_db_work);
> >>> +}
> >>> +
> >>> +static int ntb_transport_edma_rc_tx_enqueue(struct ntb_transport_qp *qp,
> >>> +					    struct ntb_queue_entry *entry)
> >>> +{
> >>> +	struct device *dma_dev = get_dma_dev(qp->ndev);
> >>> +	struct ntb_transport_ctx *nt = qp->transport;
> >>> +	struct ntb_edma_desc *in, __iomem *out;
> >>> +	unsigned int len = entry->len;
> >>> +	struct dma_chan *chan;
> >>> +	u32 issue, idx, head;
> >>> +	dma_addr_t ep_dst;
> >>> +	int rc;
> >>> +
> >>> +	WARN_ON_ONCE(entry->flags & DESC_DONE_FLAG);
> >>> +
> >>> +	scoped_guard(spinlock_irqsave, &qp->rc_lock) {
> >>> +		head = READ_ONCE(*NTB_HEAD_RD_RC_I(qp));
> >>> +		issue = qp->rd_issue;
> >>> +		if (ntb_edma_ring_used_entry(head, issue) == 0) {
> >>> +			qp->tx_ring_full++;
> >>> +			return -ENOSPC;
> >>> +		}
> >>> +
> >>> +		/*
> >>> +		 * ntb_transport_edma_rc_read_complete_work() checks entry->flags
> >>> +		 * so it needs to be set before rd_issue++.
> >>> +		 */
> >>> +		idx = ntb_edma_ring_idx(issue);
> >>> +		in = NTB_DESC_RD_RC_I(qp, idx);
> >>> +		in->data = (uintptr_t)entry;
> >>> +
> >>> +		/* Make in->data visible before rd_issue++ */
> >>> +		smp_wmb();
> >>> +
> >>> +		WRITE_ONCE(qp->rd_issue, qp->rd_issue + 1);
> >>> +	}
> >>> +
> >>> +	/* Publish the final transfer length to the EP side */
> >>> +	out = NTB_DESC_RD_RC_O(qp, idx);
> >>> +	iowrite32(len, &out->len);
> >>> +	ioread32(&out->len);
> >>> +
> >>> +	if (unlikely(!len)) {
> >>> +		entry->flags |= DESC_DONE_FLAG;
> >>> +		queue_work(nt->wq, &qp->read_work);
> >>> +		return 0;
> >>> +	}
> >>> +
> >>> +	/* Paired with dma_wmb() in ntb_transport_edma_ep_rx_enqueue() */
> >>> +	dma_rmb();
> >>> +
> >>> +	/* kick remote eDMA read transfer */
> >>> +	ep_dst = (dma_addr_t)in->addr;
> >>> +	chan = ntb_edma_pick_chan(&nt->edma, REMOTE_EDMA_READ);
> >>> +	rc = ntb_transport_edma_rc_read_start(dma_dev, chan, len,
> >>> +					      entry->buf, ep_dst, entry);
> >>> +	if (rc) {
> >>> +		entry->errors++;
> >>> +		entry->len = -EIO;
> >>> +		entry->flags |= DESC_DONE_FLAG;
> >>> +		queue_work(nt->wq, &qp->read_work);
> >>> +	}
> >>> +	return 0;
> >>> +}
> >>> +
> >>> +static int ntb_transport_edma_ep_tx_enqueue(struct ntb_transport_qp *qp,
> >>> +					    struct ntb_queue_entry *entry)
> >>> +{
> >>> +	struct device *dma_dev = get_dma_dev(qp->ndev);
> >>> +	struct ntb_edma_desc *in, __iomem *out;
> >>> +	unsigned int len = entry->len;
> >>> +	dma_addr_t ep_src = 0;
> >>> +	u32 idx;
> >>> +	int rc;
> >>> +
> >>> +	if (likely(len)) {
> >>> +		ep_src = dma_map_single(dma_dev, entry->buf, len,
> >>> +					DMA_TO_DEVICE);
> >>> +		rc = dma_mapping_error(dma_dev, ep_src);
> >>> +		if (rc)
> >>> +			return rc;
> >>> +	}
> >>> +
> >>> +	scoped_guard(spinlock_irqsave, &qp->ep_tx_lock) {
> >>> +		if (ntb_edma_ring_full(qp->wr_prod, qp->wr_cons)) {
> >>> +			rc = -ENOSPC;
> >>> +			qp->tx_ring_full++;
> >>> +			goto out_unmap;
> >>> +		}
> >>> +
> >>> +		idx = ntb_edma_ring_idx(qp->wr_prod);
> >>> +		in  = NTB_DESC_WR_EP_I(qp, idx);
> >>> +		out = NTB_DESC_WR_EP_O(qp, idx);
> >>> +
> >>> +		WARN_ON(in->flags & DESC_DONE_FLAG);
> >>> +		WARN_ON(entry->flags & DESC_DONE_FLAG);
> >>> +		in->flags = 0;
> >>> +		in->data  = (uintptr_t)entry;
> >>> +		entry->addr  = ep_src;
> >>> +
> >>> +		iowrite32(len,          &out->len);
> >>> +		iowrite32(entry->flags, &out->flags);
> >>> +		iowrite64(ep_src,       &out->addr);
> >>> +		WRITE_ONCE(qp->wr_prod, qp->wr_prod + 1);
> >>> +
> >>> +		dma_wmb();
> >>> +		iowrite32(qp->wr_prod, NTB_HEAD_WR_EP_O(qp));
> >>> +
> >>> +		qp->tx_bytes += len;
> >>> +		qp->tx_pkts++;
> >>> +	}
> >>> +
> >>> +	ntb_transport_edma_notify_peer(qp);
> >>> +
> >>> +	return 0;
> >>> +out_unmap:
> >>> +	if (likely(len))
> >>> +		dma_unmap_single(dma_dev, ep_src, len, DMA_TO_DEVICE);
> >>> +	return rc;
> >>> +}
> >>> +
> >>> +static int ntb_transport_edma_tx_enqueue(struct ntb_transport_qp *qp,
> >>> +					 struct ntb_queue_entry *entry,
> >>> +					 void *cb, void *data, unsigned int len,
> >>> +					 unsigned int flags)
> >>> +{
> >>> +	struct device *dma_dev;
> >>> +
> >>> +	if (entry->addr) {
> >>> +		/* Deferred unmap */
> >>> +		dma_dev = get_dma_dev(qp->ndev);
> >>> +		dma_unmap_single(dma_dev, entry->addr, entry->len, DMA_TO_DEVICE);
> >>> +	}
> >>> +
> >>> +	entry->cb_data = cb;
> >>> +	entry->buf = data;
> >>> +	entry->len = len;
> >>> +	entry->flags = flags;
> >>> +	entry->errors = 0;
> >>> +	entry->addr = 0;
> >>> +
> >>> +	WARN_ON_ONCE(!ntb_qp_edma_enabled(qp));
> >>> +
> >>> +	if (ntb_qp_edma_is_ep(qp))
> >>> +		return ntb_transport_edma_ep_tx_enqueue(qp, entry);
> >>> +	else
> >>> +		return ntb_transport_edma_rc_tx_enqueue(qp, entry);
> >>> +}
> >>> +
> >>> +static int ntb_transport_edma_ep_rx_enqueue(struct ntb_transport_qp *qp,
> >>> +					    struct ntb_queue_entry *entry)
> >>> +{
> >>> +	struct device *dma_dev = get_dma_dev(qp->ndev);
> >>> +	struct ntb_edma_desc *in, __iomem *out;
> >>> +	unsigned int len = entry->len;
> >>> +	void *data = entry->buf;
> >>> +	dma_addr_t ep_dst;
> >>> +	u32 idx;
> >>> +	int rc;
> >>> +
> >>> +	ep_dst = dma_map_single(dma_dev, data, len, DMA_FROM_DEVICE);
> >>> +	rc = dma_mapping_error(dma_dev, ep_dst);
> >>> +	if (rc)
> >>> +		return rc;
> >>> +
> >>> +	scoped_guard(spinlock_bh, &qp->ep_rx_lock) {
> >>> +		if (ntb_edma_ring_full(READ_ONCE(qp->rd_prod),
> >>> +				       READ_ONCE(qp->rd_cons))) {
> >>> +			rc = -ENOSPC;
> >>> +			goto out_unmap;
> >>> +		}
> >>> +
> >>> +		idx = ntb_edma_ring_idx(qp->rd_prod);
> >>> +		in = NTB_DESC_RD_EP_I(qp, idx);
> >>> +		out = NTB_DESC_RD_EP_O(qp, idx);
> >>> +
> >>> +		iowrite32(len, &out->len);
> >>> +		iowrite64(ep_dst, &out->addr);
> >>> +
> >>> +		WARN_ON(in->flags & DESC_DONE_FLAG);
> >>> +		in->data = (uintptr_t)entry;
> >>> +		entry->addr = ep_dst;
> >>> +
> >>> +		/* Ensure len/addr are visible before the head update */
> >>> +		dma_wmb();
> >>> +
> >>> +		WRITE_ONCE(qp->rd_prod, qp->rd_prod + 1);
> >>> +		iowrite32(qp->rd_prod, NTB_HEAD_RD_EP_O(qp));
> >>> +	}
> >>> +	return 0;
> >>> +out_unmap:
> >>> +	dma_unmap_single(dma_dev, ep_dst, len, DMA_FROM_DEVICE);
> >>> +	return rc;
> >>> +}
> >>> +
> >>> +static int ntb_transport_edma_rx_enqueue(struct ntb_transport_qp *qp,
> >>> +					 struct ntb_queue_entry *entry)
> >>> +{
> >>> +	int rc;
> >>> +
> >>> +	/* The behaviour is the same as the default backend for RC side */
> >>> +	if (ntb_qp_edma_is_ep(qp)) {
> >>> +		rc = ntb_transport_edma_ep_rx_enqueue(qp, entry);
> >>> +		if (rc) {
> >>> +			ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry,
> >>> +				     &qp->rx_free_q);
> >>> +			return rc;
> >>> +		}
> >>> +	}
> >>> +
> >>> +	ntb_list_add(&qp->ntb_rx_q_lock, &entry->entry, &qp->rx_pend_q);
> >>> +
> >>> +	if (qp->active)
> >>> +		tasklet_schedule(&qp->rxc_db_work);
> >>> +
> >>> +	return 0;
> >>> +}
> >>> +
> >>> +static void ntb_transport_edma_rx_poll(struct ntb_transport_qp *qp)
> >>> +{
> >>> +	struct ntb_transport_ctx *nt = qp->transport;
> >>> +
> >>> +	if (ntb_qp_edma_is_rc(qp))
> >>> +		ntb_transport_edma_rc_poll(qp);
> >>> +	else if (ntb_qp_edma_is_ep(qp)) {
> >>> +		/*
> >>> +		 * Make sure we poll the rings even if an eDMA interrupt is
> >>> +		 * cleared on the RC side earlier.
> >>> +		 */
> >>> +		queue_work(nt->wq, &qp->read_work);
> >>> +		queue_work(nt->wq, &qp->write_work);
> >>> +	} else
> >>> +		/* Unreachable */
> >>> +		WARN_ON_ONCE(1);
> >>> +}
> >>> +
> >>> +static void ntb_transport_edma_read_work(struct work_struct *work)
> >>> +{
> >>> +	struct ntb_transport_qp *qp = container_of(
> >>> +				work, struct ntb_transport_qp, read_work);
> >>> +
> >>> +	if (ntb_qp_edma_is_rc(qp))
> >>> +		ntb_transport_edma_rc_read_complete_work(work);
> >>> +	else if (ntb_qp_edma_is_ep(qp))
> >>> +		ntb_transport_edma_ep_read_work(work);
> >>> +	else
> >>> +		/* Unreachable */
> >>> +		WARN_ON_ONCE(1);
> >>> +}
> >>> +
> >>> +static void ntb_transport_edma_write_work(struct work_struct *work)
> >>> +{
> >>> +	struct ntb_transport_qp *qp = container_of(
> >>> +				work, struct ntb_transport_qp, write_work);
> >>> +
> >>> +	if (ntb_qp_edma_is_rc(qp))
> >>> +		ntb_transport_edma_rc_write_complete_work(work);
> >>> +	else if (ntb_qp_edma_is_ep(qp))
> >>> +		ntb_transport_edma_ep_write_work(work);
> >>> +	else
> >>> +		/* Unreachable */
> >>> +		WARN_ON_ONCE(1);
> >>> +}
> >>> +
> >>> +static void ntb_transport_edma_init_queue(struct ntb_transport_ctx *nt,
> >>> +					  unsigned int qp_num)
> >>> +{
> >>> +	struct ntb_transport_qp *qp = &nt->qp_vec[qp_num];
> >>> +
> >>> +	qp->wr_cons = 0;
> >>> +	qp->rd_cons = 0;
> >>> +	qp->wr_prod = 0;
> >>> +	qp->rd_prod = 0;
> >>> +	qp->wr_issue = 0;
> >>> +	qp->rd_issue = 0;
> >>> +
> >>> +	INIT_WORK(&qp->db_work, ntb_transport_edma_db_work);
> >>> +	INIT_WORK(&qp->read_work, ntb_transport_edma_read_work);
> >>> +	INIT_WORK(&qp->write_work, ntb_transport_edma_write_work);
> >>> +}
> >>> +
> >>> +static void ntb_transport_edma_create_queue(struct ntb_transport_ctx *nt,
> >>> +					    struct ntb_transport_qp *qp)
> >>> +{
> >>> +	spin_lock_init(&qp->ep_tx_lock);
> >>> +	spin_lock_init(&qp->ep_rx_lock);
> >>> +	spin_lock_init(&qp->rc_lock);
> >>> +}
> >>> +
> >>> +static const struct ntb_transport_backend_ops edma_backend_ops = {
> >>> +	.setup_qp_mw = ntb_transport_edma_setup_qp_mw,
> >>> +	.tx_free_entry = ntb_transport_edma_tx_free_entry,
> >>> +	.tx_enqueue = ntb_transport_edma_tx_enqueue,
> >>> +	.rx_enqueue = ntb_transport_edma_rx_enqueue,
> >>> +	.rx_poll = ntb_transport_edma_rx_poll,
> >>> +	.debugfs_stats_show = ntb_transport_edma_debugfs_stats_show,
> >>> +};
> >>> +#endif /* CONFIG_NTB_TRANSPORT_EDMA */
> >>> +
> >>>  /**
> >>>   * ntb_transport_link_up - Notify NTB transport of client readiness to use queue
> >>>   * @qp: NTB transport layer queue to be enabled
> >>
> 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping
  2025-12-03 10:39           ` Niklas Cassel
@ 2025-12-03 14:36             ` Koichiro Den
  2025-12-03 14:40               ` Koichiro Den
  2025-12-04 17:10             ` Frank Li
  2025-12-05 16:28             ` Frank Li
  2 siblings, 1 reply; 97+ messages in thread
From: Koichiro Den @ 2025-12-03 14:36 UTC (permalink / raw)
  To: Niklas Cassel
  Cc: Frank Li, ntb, linux-pci, dmaengine, linux-kernel, mani,
	kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang,
	allenbh, Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer,
	logang, jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer,
	arnd, pstanner, elfring, Damien Le Moal

On Wed, Dec 03, 2025 at 11:39:21AM +0100, Niklas Cassel wrote:
> On Wed, Dec 03, 2025 at 05:40:45PM +0900, Koichiro Den wrote:
> > > 
> > > If we want to improve the dw-edma driver, so that an EPF driver can have
> > > multiple outstanding transfers, I think the best way forward would be to create
> > > a new _prep_slave_memcpy() or similar, that does take a direction, and thus
> > > does not require dmaengine_slave_config() to be called before every
> > > _prep_slave_memcpy() call.
> > 
> > Would dmaengine_prep_slave_single_config(), which Frank tolds us in this
> > thread, be sufficient?
> 
> I think that Frank is suggesting a new dmaengine API,
> dmaengine_prep_slave_single_config(), which is like
> dmaengine_prep_slave_single(), but also takes a struct dma_slave_config *
> as a parameter.
> 
> 
> I really like the idea.
> I think it would allow us to remove the mutex in nvmet_pci_epf_dma_transfer():
> https://github.com/torvalds/linux/blob/v6.18/drivers/nvme/target/pci-epf.c#L389-L429

Thank you for the clarification. I was wondering whether there were any
particular reasons for covering such a broad window (i.e. from
dmaengine_prep_slave_sg() to the end of dma_sync_wait()) with the mutex in
the nvme case (but it seems there are none, right?).

> 
> Frank you wrote: "Thanks, we also consider ..."
> Does that mean that you have any plans to work on this?
> I would definitely be interested.

No, I only learned about the idea in this thread. I also think it is a good
idea, but I would be interested to know why it has not been upstreamed so
far, I mean, whether there were any technical hurdles. Frank, any input
would be greatly appreciated.

Thank you,
Koichiro

> 
> 
> Kind regards,
> Niklas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping
  2025-12-03 14:36             ` Koichiro Den
@ 2025-12-03 14:40               ` Koichiro Den
  0 siblings, 0 replies; 97+ messages in thread
From: Koichiro Den @ 2025-12-03 14:40 UTC (permalink / raw)
  To: Niklas Cassel
  Cc: Frank Li, ntb, linux-pci, dmaengine, linux-kernel, mani,
	kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang,
	allenbh, Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer,
	logang, jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer,
	arnd, pstanner, elfring, Damien Le Moal

On Wed, Dec 03, 2025 at 11:36:31PM +0900, Koichiro Den wrote:
> On Wed, Dec 03, 2025 at 11:39:21AM +0100, Niklas Cassel wrote:
> > On Wed, Dec 03, 2025 at 05:40:45PM +0900, Koichiro Den wrote:
> > > > 
> > > > If we want to improve the dw-edma driver, so that an EPF driver can have
> > > > multiple outstanding transfers, I think the best way forward would be to create
> > > > a new _prep_slave_memcpy() or similar, that does take a direction, and thus
> > > > does not require dmaengine_slave_config() to be called before every
> > > > _prep_slave_memcpy() call.
> > > 
> > > Would dmaengine_prep_slave_single_config(), which Frank tolds us in this
> > > thread, be sufficient?
> > 
> > I think that Frank is suggesting a new dmaengine API,
> > dmaengine_prep_slave_single_config(), which is like
> > dmaengine_prep_slave_single(), but also takes a struct dma_slave_config *
> > as a parameter.
> > 
> > 
> > I really like the idea.
> > I think it would allow us to remove the mutex in nvmet_pci_epf_dma_transfer():
> > https://github.com/torvalds/linux/blob/v6.18/drivers/nvme/target/pci-epf.c#L389-L429
> 
> Thank you for the clarification. I was wondering whether there were any
> particular reasons for covering such a broad window (i.e. from
> dmaengine_prep_slave_sg() to the end of dma_sync_wait()) with the mutex in
> the nvme case (but it seems there are none, right?).
> 
> > 
> > Frank you wrote: "Thanks, we also consider ..."
> > Does that mean that you have any plans to work on this?
> > I would definitely be interested.
> 
> No, I only learned about the idea in this thread. I also think it is a good
> idea, but I would be interested to know why it has not been upstreamed so
> far, I mean, whether there were any technical hurdles. Frank, any input
> would be greatly appreciated.

Oh sorry this part was to Frank. Please disregard my comment here.

Koichiro

> 
> Thank you,
> Koichiro
> 
> > 
> > 
> > Kind regards,
> > Niklas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping
  2025-12-03 10:19       ` Niklas Cassel
@ 2025-12-03 14:56         ` Koichiro Den
  0 siblings, 0 replies; 97+ messages in thread
From: Koichiro Den @ 2025-12-03 14:56 UTC (permalink / raw)
  To: Niklas Cassel
  Cc: mani, ntb, linux-pci, dmaengine, linux-kernel, Frank.Li,
	kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang,
	allenbh, Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer,
	logang, jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer,
	arnd, pstanner, elfring, Damien Le Moal

On Wed, Dec 03, 2025 at 11:19:13AM +0100, Niklas Cassel wrote:
> On Wed, Dec 03, 2025 at 05:30:52PM +0900, Koichiro Den wrote:
> > On Tue, Dec 02, 2025 at 07:32:31AM +0100, Niklas Cassel wrote:
> > > On Sun, Nov 30, 2025 at 01:03:57AM +0900, Koichiro Den wrote:
> > > > dw_pcie_ep_raise_msi_irq() currently programs an outbound iATU window
> > > > for the MSI target address on every interrupt and tears it down again
> > > > via dw_pcie_ep_unmap_addr().
> > > > 
> > > > On systems that heavily use the AXI bridge interface (for example when
> > > > the integrated eDMA engine is active), this means the outbound iATU
> > > > registers are updated while traffic is in flight. The DesignWare
> > > > endpoint spec warns that updating iATU registers in this situation is
> > > > not supported, and the behavior is undefined.
> > > 
> > > Please reference a specific section in the EP databook, and the specific
> > > EP databook version that you are using.
> > 
> > Ok, the section I was referring to in the commit message is:
> > 
> > DW EPC databook 5.40a - 3.10.6.1 iATU Outbound Programming Overview
> > "Caution: Dynamic iATU Programming with AXI Bridge Module You must not update
> > the iATU registers while operations are in progress on the AXI bridge slave
> > interface."
> 
> Please add this text to the commit message when sending a proper patch.
> 
> Nit: I think it is "DW EP databook" and not "DW EPC databook".

Thanks for pointing it out. Noted.

> 
> 
> However, if what you are suggesting is true, that would have an implication
> for all PCI EPF drivers.
> 
> E.g. the MHI EPF driver:
> https://github.com/torvalds/linux/blob/v6.18/drivers/pci/endpoint/functions/pci-epf-mhi.c#L394-L395
> https://github.com/torvalds/linux/blob/v6.18/drivers/pci/endpoint/functions/pci-epf-mhi.c#L323-L324
> 
> uses either the eDMA (without calling pci_epc_map_addr()) or MMIO
> (which does call pci_epc_map_addr(), which will update the iATU registers),
> depending on the I/O size.
> 
> And I assume that the MHI bus can have multiple outgoing reads/writes
> at the same time.
> 
> If what you are suggesting is true, AFAICT, any EPF driver that could have
> multiple outgoing transactions occuring at the same time, can not be allowed
> to have calls to pci_epc_map_addr().
> 
> Which would mean that, even if we change dw_pcie_ep_raise_msix_irq() and
> dw_pcie_ep_raise_msi_irq() to not call map_addr() after
> dw_pcie_ep_init_registers(), we could never have an EPF driver that mixes
> MMIO and DMA. (Or even has multiple outgoing MMIO transactions, as that
> requires updating iATU registers.)

I understand. That's a very good point. I'm not really sure whether the
issue this patch attempts to address is SoC-specific (ie. observable only
on R-Car S4), but I still think it's a good idea to conform to the
databook and avoid the Caution. It is also still somewhat speculative on my
side, as I have not been able to verify what is happening at the hardware
level.

Koichiro

> 
> 
> Kind regards,
> Niklas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 10/27] NTB: core: Add .get_pci_epc() to ntb_dev_ops
  2025-12-02 14:49       ` Dave Jiang
@ 2025-12-03 15:02         ` Koichiro Den
  0 siblings, 0 replies; 97+ messages in thread
From: Koichiro Den @ 2025-12-03 15:02 UTC (permalink / raw)
  To: Dave Jiang
  Cc: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li, mani,
	kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Tue, Dec 02, 2025 at 07:49:08AM -0700, Dave Jiang wrote:
> 
> 
> On 12/1/25 11:32 PM, Koichiro Den wrote:
> > On Mon, Dec 01, 2025 at 02:08:14PM -0700, Dave Jiang wrote:
> >>
> >>
> >> On 11/29/25 9:03 AM, Koichiro Den wrote:
> >>> Add an optional get_pci_epc() callback to retrieve the underlying
> >>> pci_epc device associated with the NTB implementation.
> >>>
> >>> Signed-off-by: Koichiro Den <den@valinux.co.jp>
> >>> ---
> >>>  drivers/ntb/hw/epf/ntb_hw_epf.c | 11 +----------
> >>>  include/linux/ntb.h             | 21 +++++++++++++++++++++
> >>>  2 files changed, 22 insertions(+), 10 deletions(-)
> >>>
> >>> diff --git a/drivers/ntb/hw/epf/ntb_hw_epf.c b/drivers/ntb/hw/epf/ntb_hw_epf.c
> >>> index a3ec411bfe49..d55ce6b0fad4 100644
> >>> --- a/drivers/ntb/hw/epf/ntb_hw_epf.c
> >>> +++ b/drivers/ntb/hw/epf/ntb_hw_epf.c
> >>> @@ -9,6 +9,7 @@
> >>>  #include <linux/delay.h>
> >>>  #include <linux/module.h>
> >>>  #include <linux/pci.h>
> >>> +#include <linux/pci-epf.h>
> >>>  #include <linux/slab.h>
> >>>  #include <linux/ntb.h>
> >>>  
> >>> @@ -49,16 +50,6 @@
> >>>  
> >>>  #define NTB_EPF_COMMAND_TIMEOUT	1000 /* 1 Sec */
> >>>  
> >>> -enum pci_barno {
> >>> -	NO_BAR = -1,
> >>> -	BAR_0,
> >>> -	BAR_1,
> >>> -	BAR_2,
> >>> -	BAR_3,
> >>> -	BAR_4,
> >>> -	BAR_5,
> >>> -};
> >>> -
> >>>  enum epf_ntb_bar {
> >>>  	BAR_CONFIG,
> >>>  	BAR_PEER_SPAD,
> >>> diff --git a/include/linux/ntb.h b/include/linux/ntb.h
> >>> index d7ce5d2e60d0..04dc9a4d6b85 100644
> >>> --- a/include/linux/ntb.h
> >>> +++ b/include/linux/ntb.h
> >>> @@ -64,6 +64,7 @@ struct ntb_client;
> >>>  struct ntb_dev;
> >>>  struct ntb_msi;
> >>>  struct pci_dev;
> >>> +struct pci_epc;
> >>>  
> >>>  /**
> >>>   * enum ntb_topo - NTB connection topology
> >>> @@ -256,6 +257,7 @@ static inline int ntb_ctx_ops_is_valid(const struct ntb_ctx_ops *ops)
> >>>   * @msg_clear_mask:	See ntb_msg_clear_mask().
> >>>   * @msg_read:		See ntb_msg_read().
> >>>   * @peer_msg_write:	See ntb_peer_msg_write().
> >>> + * @get_pci_epc:	See ntb_get_pci_epc().
> >>>   */
> >>>  struct ntb_dev_ops {
> >>>  	int (*port_number)(struct ntb_dev *ntb);
> >>> @@ -331,6 +333,7 @@ struct ntb_dev_ops {
> >>>  	int (*msg_clear_mask)(struct ntb_dev *ntb, u64 mask_bits);
> >>>  	u32 (*msg_read)(struct ntb_dev *ntb, int *pidx, int midx);
> >>>  	int (*peer_msg_write)(struct ntb_dev *ntb, int pidx, int midx, u32 msg);
> >>> +	struct pci_epc *(*get_pci_epc)(struct ntb_dev *ntb);
> >>
> >> Seems very specific call to this particular hardware instead of something generic for the NTB dev ops. Maybe it should be something like get_private_data() or something like that?
> > 
> > Thank you for the suggestion.
> > 
> > I also felt that it's too specific, but I couldn't come up with a clean
> > generic interface at the time, so I left it in this form.
> > 
> > .get_private_data() might indeed be better. In the callback doc comment we
> > could describe it as "may be used to obtain a backing PCI controller
> > pointer"?
> 
> I would add that comment in the defined callback for the hardware driver. For the actual API, it would be something like "for retrieving private data specific to the hardware driver"?

Thank you, that sounds reasonable. I'll adjust the interface and comments.

Koichiro

> 
> DJ
> 
> > 
> > -Koichiro
> > 
> >>
> >> DJ
> >>
> >>
> >>>  };
> >>>  
> >>>  static inline int ntb_dev_ops_is_valid(const struct ntb_dev_ops *ops)
> >>> @@ -393,6 +396,9 @@ static inline int ntb_dev_ops_is_valid(const struct ntb_dev_ops *ops)
> >>>  		/* !ops->msg_clear_mask == !ops->msg_count	&& */
> >>>  		!ops->msg_read == !ops->msg_count		&&
> >>>  		!ops->peer_msg_write == !ops->msg_count		&&
> >>> +
> >>> +		/* Miscellaneous optional callbacks */
> >>> +		/* ops->get_pci_epc			&& */
> >>>  		1;
> >>>  }
> >>>  
> >>> @@ -1567,6 +1573,21 @@ static inline int ntb_peer_msg_write(struct ntb_dev *ntb, int pidx, int midx,
> >>>  	return ntb->ops->peer_msg_write(ntb, pidx, midx, msg);
> >>>  }
> >>>  
> >>> +/**
> >>> + * ntb_get_pci_epc() - get backing PCI endpoint controller if possible.
> >>> + * @ntb:	NTB device context.
> >>> + *
> >>> + * Get the backing PCI endpoint controller representation.
> >>> + *
> >>> + * Return: A pointer to the pci_epc instance if available. or %NULL if not.
> >>> + */
> >>> +static inline struct pci_epc __maybe_unused *ntb_get_pci_epc(struct ntb_dev *ntb)
> >>> +{
> >>> +	if (!ntb->ops->get_pci_epc)
> >>> +		return NULL;
> >>> +	return ntb->ops->get_pci_epc(ntb);
> >>> +}
> >>> +
> >>>  /**
> >>>   * ntb_peer_resource_idx() - get a resource index for a given peer idx
> >>>   * @ntb:	NTB device context.
> >>
> 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 20/27] NTB: ntb_transport: Introduce remote eDMA backed transport mode
  2025-12-03  8:53         ` Koichiro Den
@ 2025-12-03 16:14           ` Frank Li
  2025-12-04 15:42             ` Koichiro Den
  0 siblings, 1 reply; 97+ messages in thread
From: Frank Li @ 2025-12-03 16:14 UTC (permalink / raw)
  To: Koichiro Den
  Cc: ntb, linux-pci, dmaengine, linux-kernel, mani, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Wed, Dec 03, 2025 at 05:53:03PM +0900, Koichiro Den wrote:
> On Tue, Dec 02, 2025 at 10:42:29AM -0500, Frank Li wrote:
> > On Tue, Dec 02, 2025 at 03:43:10PM +0900, Koichiro Den wrote:
> > > On Mon, Dec 01, 2025 at 04:41:05PM -0500, Frank Li wrote:
> > > > On Sun, Nov 30, 2025 at 01:03:58AM +0900, Koichiro Den wrote:
> > > > > Add a new transport backend that uses a remote DesignWare eDMA engine
> > > > > located on the NTB endpoint to move data between host and endpoint.
> > > > >
> > ...
> > > > > +#include "ntb_edma.h"
> > > > > +
> > > > > +/*
> > > > > + * The interrupt register offsets below are taken from the DesignWare
> > > > > + * eDMA "unrolled" register map (EDMA_MF_EDMA_UNROLL). The remote eDMA
> > > > > + * backend currently only supports this layout.
> > > > > + */
> > > > > +#define DMA_WRITE_INT_STATUS_OFF   0x4c
> > > > > +#define DMA_WRITE_INT_MASK_OFF     0x54
> > > > > +#define DMA_WRITE_INT_CLEAR_OFF    0x58
> > > > > +#define DMA_READ_INT_STATUS_OFF    0xa0
> > > > > +#define DMA_READ_INT_MASK_OFF      0xa8
> > > > > +#define DMA_READ_INT_CLEAR_OFF     0xac
> > > >
> > > > Not sure why need access EDMA register because EMDA driver already export
> > > > as dmaengine driver.
> > >
> > > These are intended for EP use. In my current design I intentionally don't
> > > use the standard dw-edma dmaengine driver on the EP side.
> >
> > why not?
>
> Conceptually I agree that using the standard dw-edma driver on both sides
> would be attractive for future extensibility and maintainability. However,
> there are a couple of concerns for me, some of which might be alleviated by
> your suggestion below, and some which are more generic safety concerns that
> I tried to outline in my replies to your other comments.
>
> >
> > >
> > > >
> > > > > +
> > > > > +#define NTB_EDMA_NOTIFY_MAX_QP		64
> > > > > +
> > ...
> > > > > +
> > > > > +	virq = irq_create_fwspec_mapping(&fwspec);
> > > > > +	of_node_put(parent);
> > > > > +	return (virq > 0) ? virq : -EINVAL;
> > > > > +}
> > > > > +
> > > > > +static irqreturn_t ntb_edma_isr(int irq, void *data)
> > > > > +{
> > > >
> > > > Not sue why dw_edma_interrupt_write/read() does work for your case. Suppose
> > > > just register callback for dmeengine.
> > >
> > > If we ran dw_edma_probe() on both the EP and RC sides and let the dmaengine
> > > callbacks handle int_status/int_clear, I think we could hit races. One side
> > > might clear a status bit before the other side has a chance to see it and
> > > invoke its callback. Please correct me if I'm missing something here.
> >
> > You should use difference channel?
>
> Do you mean something like this:
> - on EP side, dw_edma_probe() only set up a dedicated channel for notification,
> - on RC side, do not set up that particular channel via dw_edma_channel_setup(),
>   but do other remaining channels for DMA transfers.

Yes, it may be simple overall. Of course this will waste a channel.

>
> Also, is it generically safe to have dw_edma_probe() executed from both ends on
> the same eDMA instance, as long as the channels are carefully partitioned
> between them?

Channel register MMIO space is sperated. Some channel register shared
into one 32bit register.

But the critical one, interrupt status is w1c. So only write BIT(channel)
is safe.

Need careful handle irq enable/disable.

Or you can defer all actual DMA transfer to EP side, you can append
MSI write at last item of link to notify RC side about DMA done. (actually
RIE should do the same thing)

>
> >
> > >
> > > To avoid that, in my current implementation, the RC side handles the
> > > status/int_clear registers in the usual way, and the EP side only tries to
> > > suppress needless edma_int as much as possible.
> > >
> > > That said, I'm now wondering if it would be better to set LIE=0/RIE=1 for
> > > the DMA transfer channels and LIE=1/RIE=0 for the notification channel.
> > > That would require some changes on dw-edma core.
> >
> > If dw-edma work as remote DMA, which should enable RIE. like
> > dw-edma-pcie.c, but not one actually use it recently.
> >
> > Use EDMA as doorbell should be new case and I think it is quite useful.
> >
> > > >
> > > > > +	struct ntb_edma_interrupt *v = data;
> > > > > +	u32 mask = BIT(EDMA_RD_CH_NUM);
> > > > > +	u32 i, val;
> > > > > +
> > ...
> > > > > +	ret = dw_edma_probe(chip);
> > > >
> > > > I think dw_edma_probe() should be in ntb_hw_epf.c, which provide DMA
> > > > dma engine support.
> > > >
> > > > EP side, suppose default dwc controller driver already setup edma engine,
> > > > so use correct filter function, you should get dma chan.
> > >
> > > I intentionally hid edma for EP side in .dts patch in [RFC PATCH v2 26/27]
> > > so that RC side only manages eDMA remotely and avoids the potential race
> > > condition I mentioned above.
> >
> > Improve eDMA core to suppport some dma channel work at local, some for
> > remote.
>
> Right, Firstly I experimented a bit more with different LIE/RIE settings and
> ended up with the following observations:
>
> * LIE=0/RIE=1 does not seem to work at the hardware level. When I tried this for
>   DMA transfer channels, the RC side never received any interrupt. The databook
>   (5.40a, 8.2.2 "Interrupts and Error Handling") has a hint that says
>   "If you want a remote interrupt and not a local interrupt then: Set LIE and
>   RIE [...]", so I think this behaviour is expected.

Actually, you can append MSI write at last one of DMA descriptor link. So
it will not depend on eDMA's IRQ at all.

> * LIE=1/RIE=0 does work at the hardware level, but is problematic for my current
>   design, where the RC issues the DMA transfer for the notification via
>   ntb_edma_notify_peer(). With RIE=0, the RC never calls
>   dw_edma_core_handle_int() for that channel, which means that internal state
>   such as dw_edma_chan.status is never managed correctly.

If you append on MSI write at DMA link, you needn't check status register,
just check current LL pos to know which descrptor already done.

Or you also enable LIE and disable related IRQ line(without register
irq handler), so Local IRQ will be ignore by GIC, you can safe handle at
RC side.

Frank
>
> >
> > Frank
> > >
> > > Thanks for reviewing,
> > > Koichiro
> > >
> > > >
> > > > Frank
> > > >
> > > > > +	if (ret) {
> > > > > +		dev_err(&ndev->dev, "dw_edma_probe failed: %d\n", ret);
> > > > > +		return ret;
> > > > > +	}
> > > > > +
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > ...
> >
> > > > > +{
> > > > > +	spin_lock_init(&qp->ep_tx_lock);
> > > > > +	spin_lock_init(&qp->ep_rx_lock);
> > > > > +	spin_lock_init(&qp->rc_lock);
> > > > > +}
> > > > > +
> > > > > +static const struct ntb_transport_backend_ops edma_backend_ops = {
> > > > > +	.setup_qp_mw = ntb_transport_edma_setup_qp_mw,
> > > > > +	.tx_free_entry = ntb_transport_edma_tx_free_entry,
> > > > > +	.tx_enqueue = ntb_transport_edma_tx_enqueue,
> > > > > +	.rx_enqueue = ntb_transport_edma_rx_enqueue,
> > > > > +	.rx_poll = ntb_transport_edma_rx_poll,
> > > > > +	.debugfs_stats_show = ntb_transport_edma_debugfs_stats_show,
> > > > > +};
> > > > > +#endif /* CONFIG_NTB_TRANSPORT_EDMA */
> > > > > +
> > > > >  /**
> > > > >   * ntb_transport_link_up - Notify NTB transport of client readiness to use queue
> > > > >   * @qp: NTB transport layer queue to be enabled
> > > > > --
> > > > > 2.48.1
> > > > >

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 20/27] NTB: ntb_transport: Introduce remote eDMA backed transport mode
  2025-12-03 16:14           ` Frank Li
@ 2025-12-04 15:42             ` Koichiro Den
  2025-12-04 20:16               ` Frank Li
  0 siblings, 1 reply; 97+ messages in thread
From: Koichiro Den @ 2025-12-04 15:42 UTC (permalink / raw)
  To: Frank Li
  Cc: ntb, linux-pci, dmaengine, linux-kernel, mani, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Wed, Dec 03, 2025 at 11:14:43AM -0500, Frank Li wrote:
> On Wed, Dec 03, 2025 at 05:53:03PM +0900, Koichiro Den wrote:
> > On Tue, Dec 02, 2025 at 10:42:29AM -0500, Frank Li wrote:
> > > On Tue, Dec 02, 2025 at 03:43:10PM +0900, Koichiro Den wrote:
> > > > On Mon, Dec 01, 2025 at 04:41:05PM -0500, Frank Li wrote:
> > > > > On Sun, Nov 30, 2025 at 01:03:58AM +0900, Koichiro Den wrote:
> > > > > > Add a new transport backend that uses a remote DesignWare eDMA engine
> > > > > > located on the NTB endpoint to move data between host and endpoint.
> > > > > >
> > > ...
> > > > > > +#include "ntb_edma.h"
> > > > > > +
> > > > > > +/*
> > > > > > + * The interrupt register offsets below are taken from the DesignWare
> > > > > > + * eDMA "unrolled" register map (EDMA_MF_EDMA_UNROLL). The remote eDMA
> > > > > > + * backend currently only supports this layout.
> > > > > > + */
> > > > > > +#define DMA_WRITE_INT_STATUS_OFF   0x4c
> > > > > > +#define DMA_WRITE_INT_MASK_OFF     0x54
> > > > > > +#define DMA_WRITE_INT_CLEAR_OFF    0x58
> > > > > > +#define DMA_READ_INT_STATUS_OFF    0xa0
> > > > > > +#define DMA_READ_INT_MASK_OFF      0xa8
> > > > > > +#define DMA_READ_INT_CLEAR_OFF     0xac
> > > > >
> > > > > Not sure why need access EDMA register because EMDA driver already export
> > > > > as dmaengine driver.
> > > >
> > > > These are intended for EP use. In my current design I intentionally don't
> > > > use the standard dw-edma dmaengine driver on the EP side.
> > >
> > > why not?
> >
> > Conceptually I agree that using the standard dw-edma driver on both sides
> > would be attractive for future extensibility and maintainability. However,
> > there are a couple of concerns for me, some of which might be alleviated by
> > your suggestion below, and some which are more generic safety concerns that
> > I tried to outline in my replies to your other comments.
> >
> > >
> > > >
> > > > >
> > > > > > +
> > > > > > +#define NTB_EDMA_NOTIFY_MAX_QP		64
> > > > > > +
> > > ...
> > > > > > +
> > > > > > +	virq = irq_create_fwspec_mapping(&fwspec);
> > > > > > +	of_node_put(parent);
> > > > > > +	return (virq > 0) ? virq : -EINVAL;
> > > > > > +}
> > > > > > +
> > > > > > +static irqreturn_t ntb_edma_isr(int irq, void *data)
> > > > > > +{
> > > > >
> > > > > Not sue why dw_edma_interrupt_write/read() does work for your case. Suppose
> > > > > just register callback for dmeengine.
> > > >
> > > > If we ran dw_edma_probe() on both the EP and RC sides and let the dmaengine
> > > > callbacks handle int_status/int_clear, I think we could hit races. One side
> > > > might clear a status bit before the other side has a chance to see it and
> > > > invoke its callback. Please correct me if I'm missing something here.
> > >
> > > You should use difference channel?
> >
> > Do you mean something like this:
> > - on EP side, dw_edma_probe() only set up a dedicated channel for notification,
> > - on RC side, do not set up that particular channel via dw_edma_channel_setup(),
> >   but do other remaining channels for DMA transfers.
> 
> Yes, it may be simple overall. Of course this will waste a channel.

So, on the EP side I see two possible approaches:

(a) Hide "dma" [1] as in [RFC PATCH v2 26/27] and call dw_edma_probe() with
    hand-crafted settings (chip->ll_rd_cnt = 1, chip->ll_wr_cnt = 0).
(b) Or, teach this special-purpose policy (i.e. configuring only a single
    notification channel) to the SoC glue driver's dw_pcie_ep_init_registers(),
    for example via Kconfig. I don't think DT is a good place to describe
    such a policy.

There is also another option, which do not necessarily run dw_edma_probe()
by ourselves:

(c) Leave the default initialization by the SoC glue as-is, and override the
    per-channel role via some new dw-edma interface, with the guarantee
    that all channels except the notification channel remain unused on its
    side afterwards. In this model, the EP side builds the LL locations
    for data transfers and the RC configures all channels, but it sets up
    the notification channel in a special manner.

[1] https://github.com/jonmason/ntb/blob/68113d260674/Documentation/devicetree/bindings/pci/snps%2Cdw-pcie-ep.yaml#L83

> 
> >
> > Also, is it generically safe to have dw_edma_probe() executed from both ends on
> > the same eDMA instance, as long as the channels are carefully partitioned
> > between them?
> 
> Channel register MMIO space is sperated. Some channel register shared
> into one 32bit register.
> 
> But the critical one, interrupt status is w1c. So only write BIT(channel)
> is safe.
> 
> Need careful handle irq enable/disable.

Yeah, I agree it is unavoidable in this model.

> 
> Or you can defer all actual DMA transfer to EP side, you can append
> MSI write at last item of link to notify RC side about DMA done. (actually
> RIE should do the same thing)
> 
> >
> > >
> > > >
> > > > To avoid that, in my current implementation, the RC side handles the
> > > > status/int_clear registers in the usual way, and the EP side only tries to
> > > > suppress needless edma_int as much as possible.
> > > >
> > > > That said, I'm now wondering if it would be better to set LIE=0/RIE=1 for
> > > > the DMA transfer channels and LIE=1/RIE=0 for the notification channel.
> > > > That would require some changes on dw-edma core.
> > >
> > > If dw-edma work as remote DMA, which should enable RIE. like
> > > dw-edma-pcie.c, but not one actually use it recently.
> > >
> > > Use EDMA as doorbell should be new case and I think it is quite useful.
> > >
> > > > >
> > > > > > +	struct ntb_edma_interrupt *v = data;
> > > > > > +	u32 mask = BIT(EDMA_RD_CH_NUM);
> > > > > > +	u32 i, val;
> > > > > > +
> > > ...
> > > > > > +	ret = dw_edma_probe(chip);
> > > > >
> > > > > I think dw_edma_probe() should be in ntb_hw_epf.c, which provide DMA
> > > > > dma engine support.
> > > > >
> > > > > EP side, suppose default dwc controller driver already setup edma engine,
> > > > > so use correct filter function, you should get dma chan.
> > > >
> > > > I intentionally hid edma for EP side in .dts patch in [RFC PATCH v2 26/27]
> > > > so that RC side only manages eDMA remotely and avoids the potential race
> > > > condition I mentioned above.
> > >
> > > Improve eDMA core to suppport some dma channel work at local, some for
> > > remote.
> >
> > Right, Firstly I experimented a bit more with different LIE/RIE settings and
> > ended up with the following observations:
> >
> > * LIE=0/RIE=1 does not seem to work at the hardware level. When I tried this for
> >   DMA transfer channels, the RC side never received any interrupt. The databook
> >   (5.40a, 8.2.2 "Interrupts and Error Handling") has a hint that says
> >   "If you want a remote interrupt and not a local interrupt then: Set LIE and
> >   RIE [...]", so I think this behaviour is expected.
> 
> Actually, you can append MSI write at last one of DMA descriptor link. So
> it will not depend on eDMA's IRQ at all.

For RC->EP interrupts on R-Car S4 in EP mode, using ITS_TRANSLATER as the
IB iATU target did not appear to work in practice. Indeed that was the
motivation for the RFC v1 series [2]. I have not tried using ITS_TRANSLATER
as the eDMA read transfer DAR.

But in any case, simply masking the local interrupt is sufficient here. I
mainly wanted to point out that my naive idea of LIE=0/RIE=1 is not
implementable with this hardware. This whole LIE/RIE topic is a bit
off-track, sorry for the noise.

[2] For the record, RFC v2 is conceptually orthogonal and introduces a
    broader concept ie. remote eDMA model, but I reused many of the
    preparatory commits from v1, which is why this is RFC v2 rather than a
    separate series.

> 
> > * LIE=1/RIE=0 does work at the hardware level, but is problematic for my current
> >   design, where the RC issues the DMA transfer for the notification via
> >   ntb_edma_notify_peer(). With RIE=0, the RC never calls
> >   dw_edma_core_handle_int() for that channel, which means that internal state
> >   such as dw_edma_chan.status is never managed correctly.
> 
> If you append on MSI write at DMA link, you needn't check status register,
> just check current LL pos to know which descrptor already done.
> 
> Or you also enable LIE and disable related IRQ line(without register
> irq handler), so Local IRQ will be ignore by GIC, you can safe handle at
> RC side.

What I was worried about here is that, with RIE=0 the current dw-edma
handling of struct dw_edma_chan::status field (not status register) would
not run for that channel, which could affect subsequent tx submissions. But
your suggestion also makes sense, thank you.

--8<--

So anyway the key point seems that we should avoid such hard-coded register
handling in [RFC PATCH v2 20/27] and rely only on the standard dw-edma
interfaces (possibly with some extensions to the dw-edma core). From your
feedback, I feel this is the essential direction.

From that perspective, I'm leaning toward (b) (which I wrote above in a
reply comment) with a Kconfig guard, i.e. in dw_pcie_ep_init_registers(),
if IS_ENABLED(CONFIG_DW_REMOTE_EDMA) we only configure the notification
channel. In practice, a DT-based variant of (b) (for example a new property
such as "dma-notification-channel = <N>;" and making
dw_pcie_ep_init_registers() honour it) would be very handy for users, but I
suspect putting this kind of policy into DT is not acceptable.

Assuming careful handling, (c) might actually be the simplest approach. I
may need to add a small hook for the notification channel in
dw_edma_done_interrupt(), via a new API such as
dw_edma_chan_register_notify().

Thank you for your time and review,
Koichiro

> 
> Frank
> >
> > >
> > > Frank
> > > >
> > > > Thanks for reviewing,
> > > > Koichiro
> > > >
> > > > >
> > > > > Frank
> > > > >
> > > > > > +	if (ret) {
> > > > > > +		dev_err(&ndev->dev, "dw_edma_probe failed: %d\n", ret);
> > > > > > +		return ret;
> > > > > > +	}
> > > > > > +
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > ...
> > >
> > > > > > +{
> > > > > > +	spin_lock_init(&qp->ep_tx_lock);
> > > > > > +	spin_lock_init(&qp->ep_rx_lock);
> > > > > > +	spin_lock_init(&qp->rc_lock);
> > > > > > +}
> > > > > > +
> > > > > > +static const struct ntb_transport_backend_ops edma_backend_ops = {
> > > > > > +	.setup_qp_mw = ntb_transport_edma_setup_qp_mw,
> > > > > > +	.tx_free_entry = ntb_transport_edma_tx_free_entry,
> > > > > > +	.tx_enqueue = ntb_transport_edma_tx_enqueue,
> > > > > > +	.rx_enqueue = ntb_transport_edma_rx_enqueue,
> > > > > > +	.rx_poll = ntb_transport_edma_rx_poll,
> > > > > > +	.debugfs_stats_show = ntb_transport_edma_debugfs_stats_show,
> > > > > > +};
> > > > > > +#endif /* CONFIG_NTB_TRANSPORT_EDMA */
> > > > > > +
> > > > > >  /**
> > > > > >   * ntb_transport_link_up - Notify NTB transport of client readiness to use queue
> > > > > >   * @qp: NTB transport layer queue to be enabled
> > > > > > --
> > > > > > 2.48.1
> > > > > >

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping
  2025-12-03 10:39           ` Niklas Cassel
  2025-12-03 14:36             ` Koichiro Den
@ 2025-12-04 17:10             ` Frank Li
  2025-12-05 16:28             ` Frank Li
  2 siblings, 0 replies; 97+ messages in thread
From: Frank Li @ 2025-12-04 17:10 UTC (permalink / raw)
  To: Niklas Cassel
  Cc: Koichiro Den, ntb, linux-pci, dmaengine, linux-kernel, mani,
	kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang,
	allenbh, Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer,
	logang, jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer,
	arnd, pstanner, elfring, Damien Le Moal

On Wed, Dec 03, 2025 at 11:39:21AM +0100, Niklas Cassel wrote:
> On Wed, Dec 03, 2025 at 05:40:45PM +0900, Koichiro Den wrote:
> > >
> > > If we want to improve the dw-edma driver, so that an EPF driver can have
> > > multiple outstanding transfers, I think the best way forward would be to create
> > > a new _prep_slave_memcpy() or similar, that does take a direction, and thus
> > > does not require dmaengine_slave_config() to be called before every
> > > _prep_slave_memcpy() call.
> >
> > Would dmaengine_prep_slave_single_config(), which Frank tolds us in this
> > thread, be sufficient?
>
> I think that Frank is suggesting a new dmaengine API,
> dmaengine_prep_slave_single_config(), which is like
> dmaengine_prep_slave_single(), but also takes a struct dma_slave_config *
> as a parameter.
>
>
> I really like the idea.
> I think it would allow us to remove the mutex in nvmet_pci_epf_dma_transfer():
> https://github.com/torvalds/linux/blob/v6.18/drivers/nvme/target/pci-epf.c#L389-L429
>
> Frank you wrote: "Thanks, we also consider ..."
> Does that mean that you have any plans to work on this?
> I would definitely be interested.

Great, let me draft one version, I am seeking enough user case, which can
really reduce code or enough benenfit.

Frank

>
>
> Kind regards,
> Niklas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 20/27] NTB: ntb_transport: Introduce remote eDMA backed transport mode
  2025-12-04 15:42             ` Koichiro Den
@ 2025-12-04 20:16               ` Frank Li
  2025-12-05  3:04                 ` Koichiro Den
  0 siblings, 1 reply; 97+ messages in thread
From: Frank Li @ 2025-12-04 20:16 UTC (permalink / raw)
  To: Koichiro Den
  Cc: ntb, linux-pci, dmaengine, linux-kernel, mani, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Fri, Dec 05, 2025 at 12:42:03AM +0900, Koichiro Den wrote:
> On Wed, Dec 03, 2025 at 11:14:43AM -0500, Frank Li wrote:
> > On Wed, Dec 03, 2025 at 05:53:03PM +0900, Koichiro Den wrote:
> > > On Tue, Dec 02, 2025 at 10:42:29AM -0500, Frank Li wrote:
> > > > On Tue, Dec 02, 2025 at 03:43:10PM +0900, Koichiro Den wrote:
> > > > > On Mon, Dec 01, 2025 at 04:41:05PM -0500, Frank Li wrote:
> > > > > > On Sun, Nov 30, 2025 at 01:03:58AM +0900, Koichiro Den wrote:
> > > > > > > Add a new transport backend that uses a remote DesignWare eDMA engine
> > > > > > > located on the NTB endpoint to move data between host and endpoint.
> > > > > > >
> > > > ...
> > > > > > > +#include "ntb_edma.h"
> > > > > > > +
> > > > > > > +/*
> > > > > > > + * The interrupt register offsets below are taken from the DesignWare
> > > > > > > + * eDMA "unrolled" register map (EDMA_MF_EDMA_UNROLL). The remote eDMA
> > > > > > > + * backend currently only supports this layout.
> > > > > > > + */
> > > > > > > +#define DMA_WRITE_INT_STATUS_OFF   0x4c
> > > > > > > +#define DMA_WRITE_INT_MASK_OFF     0x54
> > > > > > > +#define DMA_WRITE_INT_CLEAR_OFF    0x58
> > > > > > > +#define DMA_READ_INT_STATUS_OFF    0xa0
> > > > > > > +#define DMA_READ_INT_MASK_OFF      0xa8
> > > > > > > +#define DMA_READ_INT_CLEAR_OFF     0xac
> > > > > >
> > > > > > Not sure why need access EDMA register because EMDA driver already export
> > > > > > as dmaengine driver.
> > > > >
> > > > > These are intended for EP use. In my current design I intentionally don't
> > > > > use the standard dw-edma dmaengine driver on the EP side.
> > > >
> > > > why not?
> > >
> > > Conceptually I agree that using the standard dw-edma driver on both sides
> > > would be attractive for future extensibility and maintainability. However,
> > > there are a couple of concerns for me, some of which might be alleviated by
> > > your suggestion below, and some which are more generic safety concerns that
> > > I tried to outline in my replies to your other comments.
> > >
> > > >
> > > > >
> > > > > >
> > > > > > > +
> > > > > > > +#define NTB_EDMA_NOTIFY_MAX_QP		64
> > > > > > > +
> > > > ...
> > > > > > > +
> > > > > > > +	virq = irq_create_fwspec_mapping(&fwspec);
> > > > > > > +	of_node_put(parent);
> > > > > > > +	return (virq > 0) ? virq : -EINVAL;
> > > > > > > +}
> > > > > > > +
> > > > > > > +static irqreturn_t ntb_edma_isr(int irq, void *data)
> > > > > > > +{
> > > > > >
> > > > > > Not sue why dw_edma_interrupt_write/read() does work for your case. Suppose
> > > > > > just register callback for dmeengine.
> > > > >
> > > > > If we ran dw_edma_probe() on both the EP and RC sides and let the dmaengine
> > > > > callbacks handle int_status/int_clear, I think we could hit races. One side
> > > > > might clear a status bit before the other side has a chance to see it and
> > > > > invoke its callback. Please correct me if I'm missing something here.
> > > >
> > > > You should use difference channel?
> > >
> > > Do you mean something like this:
> > > - on EP side, dw_edma_probe() only set up a dedicated channel for notification,
> > > - on RC side, do not set up that particular channel via dw_edma_channel_setup(),
> > >   but do other remaining channels for DMA transfers.
> >
> > Yes, it may be simple overall. Of course this will waste a channel.
>
> So, on the EP side I see two possible approaches:
>
> (a) Hide "dma" [1] as in [RFC PATCH v2 26/27] and call dw_edma_probe() with
>     hand-crafted settings (chip->ll_rd_cnt = 1, chip->ll_wr_cnt = 0).
> (b) Or, teach this special-purpose policy (i.e. configuring only a single
>     notification channel) to the SoC glue driver's dw_pcie_ep_init_registers(),
>     for example via Kconfig. I don't think DT is a good place to describe
>     such a policy.
>
> There is also another option, which do not necessarily run dw_edma_probe()
> by ourselves:
>
> (c) Leave the default initialization by the SoC glue as-is, and override the
>     per-channel role via some new dw-edma interface, with the guarantee
>     that all channels except the notification channel remain unused on its
>     side afterwards. In this model, the EP side builds the LL locations
>     for data transfers and the RC configures all channels, but it sets up
>     the notification channel in a special manner.
>
> [1] https://github.com/jonmason/ntb/blob/68113d260674/Documentation/devicetree/bindings/pci/snps%2Cdw-pcie-ep.yaml#L83
>
> >
> > >
> > > Also, is it generically safe to have dw_edma_probe() executed from both ends on
> > > the same eDMA instance, as long as the channels are carefully partitioned
> > > between them?
> >
> > Channel register MMIO space is sperated. Some channel register shared
> > into one 32bit register.
> >
> > But the critical one, interrupt status is w1c. So only write BIT(channel)
> > is safe.
> >
> > Need careful handle irq enable/disable.
>
> Yeah, I agree it is unavoidable in this model.
>
> >
> > Or you can defer all actual DMA transfer to EP side, you can append
> > MSI write at last item of link to notify RC side about DMA done. (actually
> > RIE should do the same thing)
> >
> > >
> > > >
> > > > >
> > > > > To avoid that, in my current implementation, the RC side handles the
> > > > > status/int_clear registers in the usual way, and the EP side only tries to
> > > > > suppress needless edma_int as much as possible.
> > > > >
> > > > > That said, I'm now wondering if it would be better to set LIE=0/RIE=1 for
> > > > > the DMA transfer channels and LIE=1/RIE=0 for the notification channel.
> > > > > That would require some changes on dw-edma core.
> > > >
> > > > If dw-edma work as remote DMA, which should enable RIE. like
> > > > dw-edma-pcie.c, but not one actually use it recently.
> > > >
> > > > Use EDMA as doorbell should be new case and I think it is quite useful.
> > > >
> > > > > >
> > > > > > > +	struct ntb_edma_interrupt *v = data;
> > > > > > > +	u32 mask = BIT(EDMA_RD_CH_NUM);
> > > > > > > +	u32 i, val;
> > > > > > > +
> > > > ...
> > > > > > > +	ret = dw_edma_probe(chip);
> > > > > >
> > > > > > I think dw_edma_probe() should be in ntb_hw_epf.c, which provide DMA
> > > > > > dma engine support.
> > > > > >
> > > > > > EP side, suppose default dwc controller driver already setup edma engine,
> > > > > > so use correct filter function, you should get dma chan.
> > > > >
> > > > > I intentionally hid edma for EP side in .dts patch in [RFC PATCH v2 26/27]
> > > > > so that RC side only manages eDMA remotely and avoids the potential race
> > > > > condition I mentioned above.
> > > >
> > > > Improve eDMA core to suppport some dma channel work at local, some for
> > > > remote.
> > >
> > > Right, Firstly I experimented a bit more with different LIE/RIE settings and
> > > ended up with the following observations:
> > >
> > > * LIE=0/RIE=1 does not seem to work at the hardware level. When I tried this for
> > >   DMA transfer channels, the RC side never received any interrupt. The databook
> > >   (5.40a, 8.2.2 "Interrupts and Error Handling") has a hint that says
> > >   "If you want a remote interrupt and not a local interrupt then: Set LIE and
> > >   RIE [...]", so I think this behaviour is expected.
> >
> > Actually, you can append MSI write at last one of DMA descriptor link. So
> > it will not depend on eDMA's IRQ at all.
>
> For RC->EP interrupts on R-Car S4 in EP mode, using ITS_TRANSLATER as the
> IB iATU target did not appear to work in practice. Indeed that was the
> motivation for the RFC v1 series [2]. I have not tried using ITS_TRANSLATER
> as the eDMA read transfer DAR.
>
> But in any case, simply masking the local interrupt is sufficient here. I
> mainly wanted to point out that my naive idea of LIE=0/RIE=1 is not
> implementable with this hardware. This whole LIE/RIE topic is a bit
> off-track, sorry for the noise.
>
> [2] For the record, RFC v2 is conceptually orthogonal and introduces a
>     broader concept ie. remote eDMA model, but I reused many of the
>     preparatory commits from v1, which is why this is RFC v2 rather than a
>     separate series.
>
> >
> > > * LIE=1/RIE=0 does work at the hardware level, but is problematic for my current
> > >   design, where the RC issues the DMA transfer for the notification via
> > >   ntb_edma_notify_peer(). With RIE=0, the RC never calls
> > >   dw_edma_core_handle_int() for that channel, which means that internal state
> > >   such as dw_edma_chan.status is never managed correctly.
> >
> > If you append on MSI write at DMA link, you needn't check status register,
> > just check current LL pos to know which descrptor already done.
> >
> > Or you also enable LIE and disable related IRQ line(without register
> > irq handler), so Local IRQ will be ignore by GIC, you can safe handle at
> > RC side.
>
> What I was worried about here is that, with RIE=0 the current dw-edma
> handling of struct dw_edma_chan::status field (not status register) would
> not run for that channel, which could affect subsequent tx submissions. But
> your suggestion also makes sense, thank you.
>
> --8<--
>
> So anyway the key point seems that we should avoid such hard-coded register
> handling in [RFC PATCH v2 20/27] and rely only on the standard dw-edma
> interfaces (possibly with some extensions to the dw-edma core). From your
> feedback, I feel this is the essential direction.
>
> From that perspective, I'm leaning toward (b) (which I wrote above in a
> reply comment) with a Kconfig guard, i.e. in dw_pcie_ep_init_registers(),
> if IS_ENABLED(CONFIG_DW_REMOTE_EDMA) we only configure the notification
> channel. In practice, a DT-based variant of (b) (for example a new property
> such as "dma-notification-channel = <N>;" and making
> dw_pcie_ep_init_registers() honour it) would be very handy for users, but I
> suspect putting this kind of policy into DT is not acceptable.
>
> Assuming careful handling, (c) might actually be the simplest approach. I
> may need to add a small hook for the notification channel in
> dw_edma_done_interrupt(), via a new API such as
> dw_edma_chan_register_notify().

I reply everything here for overall design

EDMA actually can access all memory at both EP and RC side regardless PCI
map windows. NTB defination is that only access part of both system memory,
so anyway need once memcpy. Although NTB can't take 100% eDMA advantage, it
is still easiest path now. I have a draft idea without touch NTB core code
(most likley).

EP side                          RC side
             1:  Control bar
             2:  Doorbell bar
             3:  WM1

MW1 is fixed sized array [ntb_payload_header + data]. Current NTB built
queue in system memory, transfer data (RW) to this array.

Use EDMA only one side, RC/EP. use EP as example.

In 1 (control bar, resever memory space, which call B)

In ntb_hw_epf.c driver, create a simple 'fake' DMA memcpy driver, which
just implement device_prep_dma_memcpy(). That just put src\dest\size info
to memory space B, then push doorbell.

in EP side's a workqueue, fetch info from B, the send to EDMA queue to
do actual transfer, after EP DMA finish, mark done at B, then raise msi irq,
'fake' DMA memcpy driver will be triggered.

Futher, 3 WM1 is not necessary existed at all, because both side don't
access it directly.

For example:

case RC TX, EP RX

RC ntb_async_tx_submit() use device_prep_dma_memcpy() copy user space
memory (0xRC_1000 to PCI_1000, size 0x1000), put into share bar0 position

            0xRC_1000 -> 0xPCI_1000 0x1000

EP side, there RX request ntb_async_rx_submit(),  from 0xPCI_1000 to
0xEP_8000 size 0x20000.

so setup eDMA transfer form 0xRC_1000 -> 0xEP_8000 size 1000. After complete
mark both side done, then trigger related callback functions.

You can see 0xPCI_1000 is not used at all. Actually 0xPCI_1000 is trouble
maker,  RC and EP system PCI space is not necesary the same as CPU space,
PCI controller may do address convert.

Frank
>
> Thank you for your time and review,
> Koichiro
>
> >
> > Frank
> > >
> > > >
> > > > Frank
> > > > >
> > > > > Thanks for reviewing,
> > > > > Koichiro
> > > > >
> > > > > >
> > > > > > Frank
> > > > > >
> > > > > > > +	if (ret) {
> > > > > > > +		dev_err(&ndev->dev, "dw_edma_probe failed: %d\n", ret);
> > > > > > > +		return ret;
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	return 0;
> > > > > > > +}
> > > > > > > +
> > > > ...
> > > >
> > > > > > > +{
> > > > > > > +	spin_lock_init(&qp->ep_tx_lock);
> > > > > > > +	spin_lock_init(&qp->ep_rx_lock);
> > > > > > > +	spin_lock_init(&qp->rc_lock);
> > > > > > > +}
> > > > > > > +
> > > > > > > +static const struct ntb_transport_backend_ops edma_backend_ops = {
> > > > > > > +	.setup_qp_mw = ntb_transport_edma_setup_qp_mw,
> > > > > > > +	.tx_free_entry = ntb_transport_edma_tx_free_entry,
> > > > > > > +	.tx_enqueue = ntb_transport_edma_tx_enqueue,
> > > > > > > +	.rx_enqueue = ntb_transport_edma_rx_enqueue,
> > > > > > > +	.rx_poll = ntb_transport_edma_rx_poll,
> > > > > > > +	.debugfs_stats_show = ntb_transport_edma_debugfs_stats_show,
> > > > > > > +};
> > > > > > > +#endif /* CONFIG_NTB_TRANSPORT_EDMA */
> > > > > > > +
> > > > > > >  /**
> > > > > > >   * ntb_transport_link_up - Notify NTB transport of client readiness to use queue
> > > > > > >   * @qp: NTB transport layer queue to be enabled
> > > > > > > --
> > > > > > > 2.48.1
> > > > > > >

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 20/27] NTB: ntb_transport: Introduce remote eDMA backed transport mode
  2025-12-04 20:16               ` Frank Li
@ 2025-12-05  3:04                 ` Koichiro Den
  2025-12-05 15:06                   ` Frank Li
  0 siblings, 1 reply; 97+ messages in thread
From: Koichiro Den @ 2025-12-05  3:04 UTC (permalink / raw)
  To: Frank Li
  Cc: ntb, linux-pci, dmaengine, linux-kernel, mani, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Thu, Dec 04, 2025 at 03:16:25PM -0500, Frank Li wrote:
> On Fri, Dec 05, 2025 at 12:42:03AM +0900, Koichiro Den wrote:
> > On Wed, Dec 03, 2025 at 11:14:43AM -0500, Frank Li wrote:
> > > On Wed, Dec 03, 2025 at 05:53:03PM +0900, Koichiro Den wrote:
> > > > On Tue, Dec 02, 2025 at 10:42:29AM -0500, Frank Li wrote:
> > > > > On Tue, Dec 02, 2025 at 03:43:10PM +0900, Koichiro Den wrote:
> > > > > > On Mon, Dec 01, 2025 at 04:41:05PM -0500, Frank Li wrote:
> > > > > > > On Sun, Nov 30, 2025 at 01:03:58AM +0900, Koichiro Den wrote:
> > > > > > > > Add a new transport backend that uses a remote DesignWare eDMA engine
> > > > > > > > located on the NTB endpoint to move data between host and endpoint.
> > > > > > > >
> > > > > ...
> > > > > > > > +#include "ntb_edma.h"
> > > > > > > > +
> > > > > > > > +/*
> > > > > > > > + * The interrupt register offsets below are taken from the DesignWare
> > > > > > > > + * eDMA "unrolled" register map (EDMA_MF_EDMA_UNROLL). The remote eDMA
> > > > > > > > + * backend currently only supports this layout.
> > > > > > > > + */
> > > > > > > > +#define DMA_WRITE_INT_STATUS_OFF   0x4c
> > > > > > > > +#define DMA_WRITE_INT_MASK_OFF     0x54
> > > > > > > > +#define DMA_WRITE_INT_CLEAR_OFF    0x58
> > > > > > > > +#define DMA_READ_INT_STATUS_OFF    0xa0
> > > > > > > > +#define DMA_READ_INT_MASK_OFF      0xa8
> > > > > > > > +#define DMA_READ_INT_CLEAR_OFF     0xac
> > > > > > >
> > > > > > > Not sure why need access EDMA register because EMDA driver already export
> > > > > > > as dmaengine driver.
> > > > > >
> > > > > > These are intended for EP use. In my current design I intentionally don't
> > > > > > use the standard dw-edma dmaengine driver on the EP side.
> > > > >
> > > > > why not?
> > > >
> > > > Conceptually I agree that using the standard dw-edma driver on both sides
> > > > would be attractive for future extensibility and maintainability. However,
> > > > there are a couple of concerns for me, some of which might be alleviated by
> > > > your suggestion below, and some which are more generic safety concerns that
> > > > I tried to outline in my replies to your other comments.
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > > +
> > > > > > > > +#define NTB_EDMA_NOTIFY_MAX_QP		64
> > > > > > > > +
> > > > > ...
> > > > > > > > +
> > > > > > > > +	virq = irq_create_fwspec_mapping(&fwspec);
> > > > > > > > +	of_node_put(parent);
> > > > > > > > +	return (virq > 0) ? virq : -EINVAL;
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +static irqreturn_t ntb_edma_isr(int irq, void *data)
> > > > > > > > +{
> > > > > > >
> > > > > > > Not sue why dw_edma_interrupt_write/read() does work for your case. Suppose
> > > > > > > just register callback for dmeengine.
> > > > > >
> > > > > > If we ran dw_edma_probe() on both the EP and RC sides and let the dmaengine
> > > > > > callbacks handle int_status/int_clear, I think we could hit races. One side
> > > > > > might clear a status bit before the other side has a chance to see it and
> > > > > > invoke its callback. Please correct me if I'm missing something here.
> > > > >
> > > > > You should use difference channel?
> > > >
> > > > Do you mean something like this:
> > > > - on EP side, dw_edma_probe() only set up a dedicated channel for notification,
> > > > - on RC side, do not set up that particular channel via dw_edma_channel_setup(),
> > > >   but do other remaining channels for DMA transfers.
> > >
> > > Yes, it may be simple overall. Of course this will waste a channel.
> >
> > So, on the EP side I see two possible approaches:
> >
> > (a) Hide "dma" [1] as in [RFC PATCH v2 26/27] and call dw_edma_probe() with
> >     hand-crafted settings (chip->ll_rd_cnt = 1, chip->ll_wr_cnt = 0).
> > (b) Or, teach this special-purpose policy (i.e. configuring only a single
> >     notification channel) to the SoC glue driver's dw_pcie_ep_init_registers(),
> >     for example via Kconfig. I don't think DT is a good place to describe
> >     such a policy.
> >
> > There is also another option, which do not necessarily run dw_edma_probe()
> > by ourselves:
> >
> > (c) Leave the default initialization by the SoC glue as-is, and override the
> >     per-channel role via some new dw-edma interface, with the guarantee
> >     that all channels except the notification channel remain unused on its
> >     side afterwards. In this model, the EP side builds the LL locations
> >     for data transfers and the RC configures all channels, but it sets up
> >     the notification channel in a special manner.
> >
> > [1] https://github.com/jonmason/ntb/blob/68113d260674/Documentation/devicetree/bindings/pci/snps%2Cdw-pcie-ep.yaml#L83
> >
> > >
> > > >
> > > > Also, is it generically safe to have dw_edma_probe() executed from both ends on
> > > > the same eDMA instance, as long as the channels are carefully partitioned
> > > > between them?
> > >
> > > Channel register MMIO space is sperated. Some channel register shared
> > > into one 32bit register.
> > >
> > > But the critical one, interrupt status is w1c. So only write BIT(channel)
> > > is safe.
> > >
> > > Need careful handle irq enable/disable.
> >
> > Yeah, I agree it is unavoidable in this model.
> >
> > >
> > > Or you can defer all actual DMA transfer to EP side, you can append
> > > MSI write at last item of link to notify RC side about DMA done. (actually
> > > RIE should do the same thing)
> > >
> > > >
> > > > >
> > > > > >
> > > > > > To avoid that, in my current implementation, the RC side handles the
> > > > > > status/int_clear registers in the usual way, and the EP side only tries to
> > > > > > suppress needless edma_int as much as possible.
> > > > > >
> > > > > > That said, I'm now wondering if it would be better to set LIE=0/RIE=1 for
> > > > > > the DMA transfer channels and LIE=1/RIE=0 for the notification channel.
> > > > > > That would require some changes on dw-edma core.
> > > > >
> > > > > If dw-edma work as remote DMA, which should enable RIE. like
> > > > > dw-edma-pcie.c, but not one actually use it recently.
> > > > >
> > > > > Use EDMA as doorbell should be new case and I think it is quite useful.
> > > > >
> > > > > > >
> > > > > > > > +	struct ntb_edma_interrupt *v = data;
> > > > > > > > +	u32 mask = BIT(EDMA_RD_CH_NUM);
> > > > > > > > +	u32 i, val;
> > > > > > > > +
> > > > > ...
> > > > > > > > +	ret = dw_edma_probe(chip);
> > > > > > >
> > > > > > > I think dw_edma_probe() should be in ntb_hw_epf.c, which provide DMA
> > > > > > > dma engine support.
> > > > > > >
> > > > > > > EP side, suppose default dwc controller driver already setup edma engine,
> > > > > > > so use correct filter function, you should get dma chan.
> > > > > >
> > > > > > I intentionally hid edma for EP side in .dts patch in [RFC PATCH v2 26/27]
> > > > > > so that RC side only manages eDMA remotely and avoids the potential race
> > > > > > condition I mentioned above.
> > > > >
> > > > > Improve eDMA core to suppport some dma channel work at local, some for
> > > > > remote.
> > > >
> > > > Right, Firstly I experimented a bit more with different LIE/RIE settings and
> > > > ended up with the following observations:
> > > >
> > > > * LIE=0/RIE=1 does not seem to work at the hardware level. When I tried this for
> > > >   DMA transfer channels, the RC side never received any interrupt. The databook
> > > >   (5.40a, 8.2.2 "Interrupts and Error Handling") has a hint that says
> > > >   "If you want a remote interrupt and not a local interrupt then: Set LIE and
> > > >   RIE [...]", so I think this behaviour is expected.
> > >
> > > Actually, you can append MSI write at last one of DMA descriptor link. So
> > > it will not depend on eDMA's IRQ at all.
> >
> > For RC->EP interrupts on R-Car S4 in EP mode, using ITS_TRANSLATER as the
> > IB iATU target did not appear to work in practice. Indeed that was the
> > motivation for the RFC v1 series [2]. I have not tried using ITS_TRANSLATER
> > as the eDMA read transfer DAR.
> >
> > But in any case, simply masking the local interrupt is sufficient here. I
> > mainly wanted to point out that my naive idea of LIE=0/RIE=1 is not
> > implementable with this hardware. This whole LIE/RIE topic is a bit
> > off-track, sorry for the noise.
> >
> > [2] For the record, RFC v2 is conceptually orthogonal and introduces a
> >     broader concept ie. remote eDMA model, but I reused many of the
> >     preparatory commits from v1, which is why this is RFC v2 rather than a
> >     separate series.
> >
> > >
> > > > * LIE=1/RIE=0 does work at the hardware level, but is problematic for my current
> > > >   design, where the RC issues the DMA transfer for the notification via
> > > >   ntb_edma_notify_peer(). With RIE=0, the RC never calls
> > > >   dw_edma_core_handle_int() for that channel, which means that internal state
> > > >   such as dw_edma_chan.status is never managed correctly.
> > >
> > > If you append on MSI write at DMA link, you needn't check status register,
> > > just check current LL pos to know which descrptor already done.
> > >
> > > Or you also enable LIE and disable related IRQ line(without register
> > > irq handler), so Local IRQ will be ignore by GIC, you can safe handle at
> > > RC side.
> >
> > What I was worried about here is that, with RIE=0 the current dw-edma
> > handling of struct dw_edma_chan::status field (not status register) would
> > not run for that channel, which could affect subsequent tx submissions. But
> > your suggestion also makes sense, thank you.
> >
> > --8<--
> >
> > So anyway the key point seems that we should avoid such hard-coded register
> > handling in [RFC PATCH v2 20/27] and rely only on the standard dw-edma
> > interfaces (possibly with some extensions to the dw-edma core). From your
> > feedback, I feel this is the essential direction.
> >
> > From that perspective, I'm leaning toward (b) (which I wrote above in a
> > reply comment) with a Kconfig guard, i.e. in dw_pcie_ep_init_registers(),
> > if IS_ENABLED(CONFIG_DW_REMOTE_EDMA) we only configure the notification
> > channel. In practice, a DT-based variant of (b) (for example a new property
> > such as "dma-notification-channel = <N>;" and making
> > dw_pcie_ep_init_registers() honour it) would be very handy for users, but I
> > suspect putting this kind of policy into DT is not acceptable.
> >
> > Assuming careful handling, (c) might actually be the simplest approach. I
> > may need to add a small hook for the notification channel in
> > dw_edma_done_interrupt(), via a new API such as
> > dw_edma_chan_register_notify().
> 
> I reply everything here for overall design
> 
> EDMA actually can access all memory at both EP and RC side regardless PCI
> map windows. NTB defination is that only access part of both system memory,
> so anyway need once memcpy. Although NTB can't take 100% eDMA advantage, it
> is still easiest path now. I have a draft idea without touch NTB core code
> (most likley).
> 
> EP side                          RC side
>              1:  Control bar
>              2:  Doorbell bar
>              3:  WM1
> 
> MW1 is fixed sized array [ntb_payload_header + data]. Current NTB built
> queue in system memory, transfer data (RW) to this array.
> 
> Use EDMA only one side, RC/EP. use EP as example.
> 
> In 1 (control bar, resever memory space, which call B)
> 
> In ntb_hw_epf.c driver, create a simple 'fake' DMA memcpy driver, which
> just implement device_prep_dma_memcpy(). That just put src\dest\size info
> to memory space B, then push doorbell.
> 
> in EP side's a workqueue, fetch info from B, the send to EDMA queue to
> do actual transfer, after EP DMA finish, mark done at B, then raise msi irq,
> 'fake' DMA memcpy driver will be triggered.
> 
> Futher, 3 WM1 is not necessary existed at all, because both side don't
> access it directly.
> 
> For example:
> 
> case RC TX, EP RX
> 
> RC ntb_async_tx_submit() use device_prep_dma_memcpy() copy user space
> memory (0xRC_1000 to PCI_1000, size 0x1000), put into share bar0 position
> 
>             0xRC_1000 -> 0xPCI_1000 0x1000
> 
> EP side, there RX request ntb_async_rx_submit(),  from 0xPCI_1000 to
> 0xEP_8000 size 0x20000.
> 
> so setup eDMA transfer form 0xRC_1000 -> 0xEP_8000 size 1000. After complete
> mark both side done, then trigger related callback functions.
> 
> You can see 0xPCI_1000 is not used at all. Actually 0xPCI_1000 is trouble
> maker,  RC and EP system PCI space is not necesary the same as CPU space,
> PCI controller may do address convert.

Thanks for the detailed explanation.

Just to clarify, regarding your comments about the number of memcpy
operations and not using the 0xPCI_1000 window for data path, I think RFC
v2 is already similar to what you're describing.

To me it seems the key differences in your proposal are mainly two-fold:
(1) the layering, and (2) local eDMA use rather than remote.

For (1), instead of adding more eDMA-specific handling into ntb_transport
layer, your approach would keep changes to ntb_transport minimal and
encapsulate the eDMA usage inside the "fake DMA memcpy driver" as much as
possible. In that design, would the MW1 layout change? Leaving the existing
layout as-is would waste the space (so RFC v2 had introduced a new layout).

Also, one point I'm still unsure about is the opposite direction (ie.
EP->RC). In that case, do you also expect the EP to trigger its local eDMA
engine? If yes, then, similar to the RC->EP direction in RFC v2, the EP
would need to know the RC-side receive buffer address (e.g. 0xRC_1000) in
advance.

You also mentioned that you already have some draft. Are you planning to
post that as a patch series? If not, I can of course try to
implement/prototype this approach based on your suggestion.

Please let me know if the above understanding does not match what you had
in mind.

Thank you,
Koichiro


> 
> Frank
> >
> > Thank you for your time and review,
> > Koichiro
> >
> > >
> > > Frank
> > > >
> > > > >
> > > > > Frank
> > > > > >
> > > > > > Thanks for reviewing,
> > > > > > Koichiro
> > > > > >
> > > > > > >
> > > > > > > Frank
> > > > > > >
> > > > > > > > +	if (ret) {
> > > > > > > > +		dev_err(&ndev->dev, "dw_edma_probe failed: %d\n", ret);
> > > > > > > > +		return ret;
> > > > > > > > +	}
> > > > > > > > +
> > > > > > > > +	return 0;
> > > > > > > > +}
> > > > > > > > +
> > > > > ...
> > > > >
> > > > > > > > +{
> > > > > > > > +	spin_lock_init(&qp->ep_tx_lock);
> > > > > > > > +	spin_lock_init(&qp->ep_rx_lock);
> > > > > > > > +	spin_lock_init(&qp->rc_lock);
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +static const struct ntb_transport_backend_ops edma_backend_ops = {
> > > > > > > > +	.setup_qp_mw = ntb_transport_edma_setup_qp_mw,
> > > > > > > > +	.tx_free_entry = ntb_transport_edma_tx_free_entry,
> > > > > > > > +	.tx_enqueue = ntb_transport_edma_tx_enqueue,
> > > > > > > > +	.rx_enqueue = ntb_transport_edma_rx_enqueue,
> > > > > > > > +	.rx_poll = ntb_transport_edma_rx_poll,
> > > > > > > > +	.debugfs_stats_show = ntb_transport_edma_debugfs_stats_show,
> > > > > > > > +};
> > > > > > > > +#endif /* CONFIG_NTB_TRANSPORT_EDMA */
> > > > > > > > +
> > > > > > > >  /**
> > > > > > > >   * ntb_transport_link_up - Notify NTB transport of client readiness to use queue
> > > > > > > >   * @qp: NTB transport layer queue to be enabled
> > > > > > > > --
> > > > > > > > 2.48.1
> > > > > > > >

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 20/27] NTB: ntb_transport: Introduce remote eDMA backed transport mode
  2025-12-05  3:04                 ` Koichiro Den
@ 2025-12-05 15:06                   ` Frank Li
  2025-12-18  4:34                     ` Koichiro Den
  0 siblings, 1 reply; 97+ messages in thread
From: Frank Li @ 2025-12-05 15:06 UTC (permalink / raw)
  To: Koichiro Den
  Cc: ntb, linux-pci, dmaengine, linux-kernel, mani, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Fri, Dec 05, 2025 at 12:04:24PM +0900, Koichiro Den wrote:
> On Thu, Dec 04, 2025 at 03:16:25PM -0500, Frank Li wrote:
> > On Fri, Dec 05, 2025 at 12:42:03AM +0900, Koichiro Den wrote:
> > > On Wed, Dec 03, 2025 at 11:14:43AM -0500, Frank Li wrote:
> > > > On Wed, Dec 03, 2025 at 05:53:03PM +0900, Koichiro Den wrote:
> > > > > On Tue, Dec 02, 2025 at 10:42:29AM -0500, Frank Li wrote:
> > > > > > On Tue, Dec 02, 2025 at 03:43:10PM +0900, Koichiro Den wrote:
> > > > > > > On Mon, Dec 01, 2025 at 04:41:05PM -0500, Frank Li wrote:
> > > > > > > > On Sun, Nov 30, 2025 at 01:03:58AM +0900, Koichiro Den wrote:
> > > > > > > > > Add a new transport backend that uses a remote DesignWare eDMA engine
> > > > > > > > > located on the NTB endpoint to move data between host and endpoint.
> > > > > > > > >
> > > > > > ...
> > > > > > > > > +#include "ntb_edma.h"
> > > > > > > > > +
> > > > > > > > > +/*
> > > > > > > > > + * The interrupt register offsets below are taken from the DesignWare
> > > > > > > > > + * eDMA "unrolled" register map (EDMA_MF_EDMA_UNROLL). The remote eDMA
> > > > > > > > > + * backend currently only supports this layout.
> > > > > > > > > + */
> > > > > > > > > +#define DMA_WRITE_INT_STATUS_OFF   0x4c
> > > > > > > > > +#define DMA_WRITE_INT_MASK_OFF     0x54
> > > > > > > > > +#define DMA_WRITE_INT_CLEAR_OFF    0x58
> > > > > > > > > +#define DMA_READ_INT_STATUS_OFF    0xa0
> > > > > > > > > +#define DMA_READ_INT_MASK_OFF      0xa8
> > > > > > > > > +#define DMA_READ_INT_CLEAR_OFF     0xac
> > > > > > > >
> > > > > > > > Not sure why need access EDMA register because EMDA driver already export
> > > > > > > > as dmaengine driver.
> > > > > > >
> > > > > > > These are intended for EP use. In my current design I intentionally don't
> > > > > > > use the standard dw-edma dmaengine driver on the EP side.
> > > > > >
> > > > > > why not?
> > > > >
> > > > > Conceptually I agree that using the standard dw-edma driver on both sides
> > > > > would be attractive for future extensibility and maintainability. However,
> > > > > there are a couple of concerns for me, some of which might be alleviated by
> > > > > your suggestion below, and some which are more generic safety concerns that
> > > > > I tried to outline in my replies to your other comments.
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > +
> > > > > > > > > +#define NTB_EDMA_NOTIFY_MAX_QP		64
> > > > > > > > > +
> > > > > > ...
> > > > > > > > > +
> > > > > > > > > +	virq = irq_create_fwspec_mapping(&fwspec);
> > > > > > > > > +	of_node_put(parent);
> > > > > > > > > +	return (virq > 0) ? virq : -EINVAL;
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > > +static irqreturn_t ntb_edma_isr(int irq, void *data)
> > > > > > > > > +{
> > > > > > > >
> > > > > > > > Not sue why dw_edma_interrupt_write/read() does work for your case. Suppose
> > > > > > > > just register callback for dmeengine.
> > > > > > >
> > > > > > > If we ran dw_edma_probe() on both the EP and RC sides and let the dmaengine
> > > > > > > callbacks handle int_status/int_clear, I think we could hit races. One side
> > > > > > > might clear a status bit before the other side has a chance to see it and
> > > > > > > invoke its callback. Please correct me if I'm missing something here.
> > > > > >
> > > > > > You should use difference channel?
> > > > >
> > > > > Do you mean something like this:
> > > > > - on EP side, dw_edma_probe() only set up a dedicated channel for notification,
> > > > > - on RC side, do not set up that particular channel via dw_edma_channel_setup(),
> > > > >   but do other remaining channels for DMA transfers.
> > > >
> > > > Yes, it may be simple overall. Of course this will waste a channel.
> > >
> > > So, on the EP side I see two possible approaches:
> > >
> > > (a) Hide "dma" [1] as in [RFC PATCH v2 26/27] and call dw_edma_probe() with
> > >     hand-crafted settings (chip->ll_rd_cnt = 1, chip->ll_wr_cnt = 0).
> > > (b) Or, teach this special-purpose policy (i.e. configuring only a single
> > >     notification channel) to the SoC glue driver's dw_pcie_ep_init_registers(),
> > >     for example via Kconfig. I don't think DT is a good place to describe
> > >     such a policy.
> > >
> > > There is also another option, which do not necessarily run dw_edma_probe()
> > > by ourselves:
> > >
> > > (c) Leave the default initialization by the SoC glue as-is, and override the
> > >     per-channel role via some new dw-edma interface, with the guarantee
> > >     that all channels except the notification channel remain unused on its
> > >     side afterwards. In this model, the EP side builds the LL locations
> > >     for data transfers and the RC configures all channels, but it sets up
> > >     the notification channel in a special manner.
> > >
> > > [1] https://github.com/jonmason/ntb/blob/68113d260674/Documentation/devicetree/bindings/pci/snps%2Cdw-pcie-ep.yaml#L83
> > >
> > > >
> > > > >
> > > > > Also, is it generically safe to have dw_edma_probe() executed from both ends on
> > > > > the same eDMA instance, as long as the channels are carefully partitioned
> > > > > between them?
> > > >
> > > > Channel register MMIO space is sperated. Some channel register shared
> > > > into one 32bit register.
> > > >
> > > > But the critical one, interrupt status is w1c. So only write BIT(channel)
> > > > is safe.
> > > >
> > > > Need careful handle irq enable/disable.
> > >
> > > Yeah, I agree it is unavoidable in this model.
> > >
> > > >
> > > > Or you can defer all actual DMA transfer to EP side, you can append
> > > > MSI write at last item of link to notify RC side about DMA done. (actually
> > > > RIE should do the same thing)
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > To avoid that, in my current implementation, the RC side handles the
> > > > > > > status/int_clear registers in the usual way, and the EP side only tries to
> > > > > > > suppress needless edma_int as much as possible.
> > > > > > >
> > > > > > > That said, I'm now wondering if it would be better to set LIE=0/RIE=1 for
> > > > > > > the DMA transfer channels and LIE=1/RIE=0 for the notification channel.
> > > > > > > That would require some changes on dw-edma core.
> > > > > >
> > > > > > If dw-edma work as remote DMA, which should enable RIE. like
> > > > > > dw-edma-pcie.c, but not one actually use it recently.
> > > > > >
> > > > > > Use EDMA as doorbell should be new case and I think it is quite useful.
> > > > > >
> > > > > > > >
> > > > > > > > > +	struct ntb_edma_interrupt *v = data;
> > > > > > > > > +	u32 mask = BIT(EDMA_RD_CH_NUM);
> > > > > > > > > +	u32 i, val;
> > > > > > > > > +
> > > > > > ...
> > > > > > > > > +	ret = dw_edma_probe(chip);
> > > > > > > >
> > > > > > > > I think dw_edma_probe() should be in ntb_hw_epf.c, which provide DMA
> > > > > > > > dma engine support.
> > > > > > > >
> > > > > > > > EP side, suppose default dwc controller driver already setup edma engine,
> > > > > > > > so use correct filter function, you should get dma chan.
> > > > > > >
> > > > > > > I intentionally hid edma for EP side in .dts patch in [RFC PATCH v2 26/27]
> > > > > > > so that RC side only manages eDMA remotely and avoids the potential race
> > > > > > > condition I mentioned above.
> > > > > >
> > > > > > Improve eDMA core to suppport some dma channel work at local, some for
> > > > > > remote.
> > > > >
> > > > > Right, Firstly I experimented a bit more with different LIE/RIE settings and
> > > > > ended up with the following observations:
> > > > >
> > > > > * LIE=0/RIE=1 does not seem to work at the hardware level. When I tried this for
> > > > >   DMA transfer channels, the RC side never received any interrupt. The databook
> > > > >   (5.40a, 8.2.2 "Interrupts and Error Handling") has a hint that says
> > > > >   "If you want a remote interrupt and not a local interrupt then: Set LIE and
> > > > >   RIE [...]", so I think this behaviour is expected.
> > > >
> > > > Actually, you can append MSI write at last one of DMA descriptor link. So
> > > > it will not depend on eDMA's IRQ at all.
> > >
> > > For RC->EP interrupts on R-Car S4 in EP mode, using ITS_TRANSLATER as the
> > > IB iATU target did not appear to work in practice. Indeed that was the
> > > motivation for the RFC v1 series [2]. I have not tried using ITS_TRANSLATER
> > > as the eDMA read transfer DAR.
> > >
> > > But in any case, simply masking the local interrupt is sufficient here. I
> > > mainly wanted to point out that my naive idea of LIE=0/RIE=1 is not
> > > implementable with this hardware. This whole LIE/RIE topic is a bit
> > > off-track, sorry for the noise.
> > >
> > > [2] For the record, RFC v2 is conceptually orthogonal and introduces a
> > >     broader concept ie. remote eDMA model, but I reused many of the
> > >     preparatory commits from v1, which is why this is RFC v2 rather than a
> > >     separate series.
> > >
> > > >
> > > > > * LIE=1/RIE=0 does work at the hardware level, but is problematic for my current
> > > > >   design, where the RC issues the DMA transfer for the notification via
> > > > >   ntb_edma_notify_peer(). With RIE=0, the RC never calls
> > > > >   dw_edma_core_handle_int() for that channel, which means that internal state
> > > > >   such as dw_edma_chan.status is never managed correctly.
> > > >
> > > > If you append on MSI write at DMA link, you needn't check status register,
> > > > just check current LL pos to know which descrptor already done.
> > > >
> > > > Or you also enable LIE and disable related IRQ line(without register
> > > > irq handler), so Local IRQ will be ignore by GIC, you can safe handle at
> > > > RC side.
> > >
> > > What I was worried about here is that, with RIE=0 the current dw-edma
> > > handling of struct dw_edma_chan::status field (not status register) would
> > > not run for that channel, which could affect subsequent tx submissions. But
> > > your suggestion also makes sense, thank you.
> > >
> > > --8<--
> > >
> > > So anyway the key point seems that we should avoid such hard-coded register
> > > handling in [RFC PATCH v2 20/27] and rely only on the standard dw-edma
> > > interfaces (possibly with some extensions to the dw-edma core). From your
> > > feedback, I feel this is the essential direction.
> > >
> > > From that perspective, I'm leaning toward (b) (which I wrote above in a
> > > reply comment) with a Kconfig guard, i.e. in dw_pcie_ep_init_registers(),
> > > if IS_ENABLED(CONFIG_DW_REMOTE_EDMA) we only configure the notification
> > > channel. In practice, a DT-based variant of (b) (for example a new property
> > > such as "dma-notification-channel = <N>;" and making
> > > dw_pcie_ep_init_registers() honour it) would be very handy for users, but I
> > > suspect putting this kind of policy into DT is not acceptable.
> > >
> > > Assuming careful handling, (c) might actually be the simplest approach. I
> > > may need to add a small hook for the notification channel in
> > > dw_edma_done_interrupt(), via a new API such as
> > > dw_edma_chan_register_notify().
> >
> > I reply everything here for overall design
> >
> > EDMA actually can access all memory at both EP and RC side regardless PCI
> > map windows. NTB defination is that only access part of both system memory,
> > so anyway need once memcpy. Although NTB can't take 100% eDMA advantage, it
> > is still easiest path now. I have a draft idea without touch NTB core code
> > (most likley).
> >
> > EP side                          RC side
> >              1:  Control bar
> >              2:  Doorbell bar
> >              3:  WM1
> >
> > MW1 is fixed sized array [ntb_payload_header + data]. Current NTB built
> > queue in system memory, transfer data (RW) to this array.
> >
> > Use EDMA only one side, RC/EP. use EP as example.
> >
> > In 1 (control bar, resever memory space, which call B)
> >
> > In ntb_hw_epf.c driver, create a simple 'fake' DMA memcpy driver, which
> > just implement device_prep_dma_memcpy(). That just put src\dest\size info
> > to memory space B, then push doorbell.
> >
> > in EP side's a workqueue, fetch info from B, the send to EDMA queue to
> > do actual transfer, after EP DMA finish, mark done at B, then raise msi irq,
> > 'fake' DMA memcpy driver will be triggered.
> >
> > Futher, 3 WM1 is not necessary existed at all, because both side don't
> > access it directly.
> >
> > For example:
> >
> > case RC TX, EP RX
> >
> > RC ntb_async_tx_submit() use device_prep_dma_memcpy() copy user space
> > memory (0xRC_1000 to PCI_1000, size 0x1000), put into share bar0 position
> >
> >             0xRC_1000 -> 0xPCI_1000 0x1000
> >
> > EP side, there RX request ntb_async_rx_submit(),  from 0xPCI_1000 to
> > 0xEP_8000 size 0x20000.
> >
> > so setup eDMA transfer form 0xRC_1000 -> 0xEP_8000 size 1000. After complete
> > mark both side done, then trigger related callback functions.
> >
> > You can see 0xPCI_1000 is not used at all. Actually 0xPCI_1000 is trouble
> > maker,  RC and EP system PCI space is not necesary the same as CPU space,
> > PCI controller may do address convert.
>
> Thanks for the detailed explanation.
>
> Just to clarify, regarding your comments about the number of memcpy
> operations and not using the 0xPCI_1000 window for data path, I think RFC
> v2 is already similar to what you're describing.
>
> To me it seems the key differences in your proposal are mainly two-fold:
> (1) the layering, and (2) local eDMA use rather than remote.

Not big difference between remote and local DMA. My major means just use
oneside is enough. If eDMA handle in remote, EP side need virtual memcpy
and RC side to handle actual transfer.

I use EP as example, just because some logic R/W is reverted between EP/RC.
RC's write is EP's read.

>
> For (1), instead of adding more eDMA-specific handling into ntb_transport
> layer, your approach would keep changes to ntb_transport minimal and
> encapsulate the eDMA usage inside the "fake DMA memcpy driver" as much as
> possible. In that design, would the MW1 layout change? Leaving the existing
> layout as-is would waste the space (so RFC v2 had introduced a new layout).

It is fine if NTB maintainer agree it.

>
> Also, one point I'm still unsure about is the opposite direction (ie.
> EP->RC). In that case, do you also expect the EP to trigger its local eDMA
> engine? If yes, then, similar to the RC->EP direction in RFC v2, the EP
> would need to know the RC-side receive buffer address (e.g. 0xRC_1000) in
> advance.

'fake DMA memcpy driver' already put 0xRC_1000 to one shared memory place.

>
> You also mentioned that you already have some draft. Are you planning to
> post that as a patch series? If not, I can of course try to
> implement/prototype this approach based on your suggestion.

Sorry, I have not actually worked for ntb eDMA before. My work base on RDMA
framework. Idealy, RDMA can do user-space(EP) to user space (RC) data
transfer with zero copy.

But I think NTB is also a good path since RDMA is over complexed.

Frank

>
> Please let me know if the above understanding does not match what you had
> in mind.
>
> Thank you,
> Koichiro
>
>
> >
> > Frank
> > >
> > > Thank you for your time and review,
> > > Koichiro
> > >
> > > >
> > > > Frank
> > > > >
> > > > > >
> > > > > > Frank
> > > > > > >
> > > > > > > Thanks for reviewing,
> > > > > > > Koichiro
> > > > > > >
> > > > > > > >
> > > > > > > > Frank
> > > > > > > >
> > > > > > > > > +	if (ret) {
> > > > > > > > > +		dev_err(&ndev->dev, "dw_edma_probe failed: %d\n", ret);
> > > > > > > > > +		return ret;
> > > > > > > > > +	}
> > > > > > > > > +
> > > > > > > > > +	return 0;
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > ...
> > > > > >
> > > > > > > > > +{
> > > > > > > > > +	spin_lock_init(&qp->ep_tx_lock);
> > > > > > > > > +	spin_lock_init(&qp->ep_rx_lock);
> > > > > > > > > +	spin_lock_init(&qp->rc_lock);
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > > +static const struct ntb_transport_backend_ops edma_backend_ops = {
> > > > > > > > > +	.setup_qp_mw = ntb_transport_edma_setup_qp_mw,
> > > > > > > > > +	.tx_free_entry = ntb_transport_edma_tx_free_entry,
> > > > > > > > > +	.tx_enqueue = ntb_transport_edma_tx_enqueue,
> > > > > > > > > +	.rx_enqueue = ntb_transport_edma_rx_enqueue,
> > > > > > > > > +	.rx_poll = ntb_transport_edma_rx_poll,
> > > > > > > > > +	.debugfs_stats_show = ntb_transport_edma_debugfs_stats_show,
> > > > > > > > > +};
> > > > > > > > > +#endif /* CONFIG_NTB_TRANSPORT_EDMA */
> > > > > > > > > +
> > > > > > > > >  /**
> > > > > > > > >   * ntb_transport_link_up - Notify NTB transport of client readiness to use queue
> > > > > > > > >   * @qp: NTB transport layer queue to be enabled
> > > > > > > > > --
> > > > > > > > > 2.48.1
> > > > > > > > >

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping
  2025-12-03 10:39           ` Niklas Cassel
  2025-12-03 14:36             ` Koichiro Den
  2025-12-04 17:10             ` Frank Li
@ 2025-12-05 16:28             ` Frank Li
  2 siblings, 0 replies; 97+ messages in thread
From: Frank Li @ 2025-12-05 16:28 UTC (permalink / raw)
  To: Niklas Cassel
  Cc: Koichiro Den, ntb, linux-pci, dmaengine, linux-kernel, mani,
	kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang,
	allenbh, Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer,
	logang, jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer,
	arnd, pstanner, elfring, Damien Le Moal

On Wed, Dec 03, 2025 at 11:39:21AM +0100, Niklas Cassel wrote:
> On Wed, Dec 03, 2025 at 05:40:45PM +0900, Koichiro Den wrote:
> > >
> > > If we want to improve the dw-edma driver, so that an EPF driver can have
> > > multiple outstanding transfers, I think the best way forward would be to create
> > > a new _prep_slave_memcpy() or similar, that does take a direction, and thus
> > > does not require dmaengine_slave_config() to be called before every
> > > _prep_slave_memcpy() call.
> >
> > Would dmaengine_prep_slave_single_config(), which Frank tolds us in this
> > thread, be sufficient?
>
> I think that Frank is suggesting a new dmaengine API,
> dmaengine_prep_slave_single_config(), which is like
> dmaengine_prep_slave_single(), but also takes a struct dma_slave_config *
> as a parameter.
>
>
> I really like the idea.
> I think it would allow us to remove the mutex in nvmet_pci_epf_dma_transfer():
> https://github.com/torvalds/linux/blob/v6.18/drivers/nvme/target/pci-epf.c#L389-L429

I check code drivers/dma/dw-edma/dw-edma-core.c

does device_prep_interleaved_dma() work? which have not use config.

Frank

>
> Frank you wrote: "Thanks, we also consider ..."
> Does that mean that you have any plans to work on this?
> I would definitely be interested.
>
>
> Kind regards,
> Niklas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping
  2025-11-29 16:03 ` [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping Koichiro Den
  2025-12-01 20:41   ` Frank Li
  2025-12-02  6:32   ` Niklas Cassel
@ 2025-12-08  7:57   ` Niklas Cassel
  2025-12-09  8:15     ` Niklas Cassel
  2025-12-22  5:10     ` Krishna Chaitanya Chundru
  2025-12-12  3:38   ` Manivannan Sadhasivam
  3 siblings, 2 replies; 97+ messages in thread
From: Niklas Cassel @ 2025-12-08  7:57 UTC (permalink / raw)
  To: Koichiro Den
  Cc: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li, mani,
	kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang,
	allenbh, Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer,
	logang, jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer,
	arnd, pstanner, elfring

On Sun, Nov 30, 2025 at 01:03:57AM +0900, Koichiro Den wrote:
> dw_pcie_ep_raise_msi_irq() currently programs an outbound iATU window
> for the MSI target address on every interrupt and tears it down again
> via dw_pcie_ep_unmap_addr().
> 
> On systems that heavily use the AXI bridge interface (for example when
> the integrated eDMA engine is active), this means the outbound iATU
> registers are updated while traffic is in flight. The DesignWare
> endpoint spec warns that updating iATU registers in this situation is
> not supported, and the behavior is undefined.
> 
> Under high MSI and eDMA load this pattern results in occasional bogus
> outbound transactions and IOMMU faults such as:
> 
>   ipmmu-vmsa eed40000.iommu: Unhandled fault: status 0x00001502 iova 0xfe000000
> 
> followed by the system becoming unresponsive. This is the actual output
> observed on Renesas R-Car S4, with its ipmmu_hc used with PCIe ch0.
> 
> There is no need to reprogram the iATU region used for MSI on every
> interrupt. The host-provided MSI address is stable while MSI is enabled,
> and the endpoint driver already dedicates a scratch buffer for MSI
> generation.
> 
> Cache the aligned MSI address and map size, program the outbound iATU
> once, and keep the window enabled. Subsequent interrupts only perform a
> write to the MSI scratch buffer, avoiding dynamic iATU reprogramming in
> the hot path and fixing the lockups seen under load.
> 
> Signed-off-by: Koichiro Den <den@valinux.co.jp>
> ---
>  .../pci/controller/dwc/pcie-designware-ep.c   | 48 ++++++++++++++++---
>  drivers/pci/controller/dwc/pcie-designware.h  |  5 ++
>  2 files changed, 47 insertions(+), 6 deletions(-)
> 

I don't like that this patch modifies dw_pcie_ep_raise_msi_irq() but does
not modify dw_pcie_ep_raise_msix_irq()

both functions call dw_pcie_ep_map_addr() before doing the writel(),
so I think they should be treated the same.


I do however understand that it is a bit wasteful to dedicate one
outbound iATU for MSI and one outbound iATU for MSI-X, as the PCI
spec does not allow both of them to be enabled at the same PCI,
see:

6.1.4 MSI and MSI-X Operation § in PCIe 6.0 spec:
"A Function is permitted to implement both MSI and MSI-X,
but system software is prohibited from enabling both at the
same time. If system software enables both at the same time,
the behavior is undefined."


I guess the problem is that some EPF drivers, even if only
one capability can be enabled (MSI/MSI-X), call both
pci_epc_set_msi() and pci_epc_set_msix(), e.g.:
https://github.com/torvalds/linux/blob/v6.18/drivers/pci/endpoint/functions/pci-epf-test.c#L969-L987

To fill in the number of MSI/MSI-X irqs.

While other EPF drivers only call either pci_epc_set_msi() or
pci_epc_set_msix(), depending on the IRQ type that will actually
be used:
https://github.com/torvalds/linux/blob/v6.18/drivers/nvme/target/pci-epf.c#L2247-L2262

I think both versions is okay, just because the number of IRQs
is filled in for both MSI/MSI-X, AFAICT, only one of them will
get enabled.


I guess it might be hard for an EPC driver to know which capability
that is currently enabled, as to enable a capability is only a config
space write by the host side.

I guess in most real hardware, e.g. a NIC device, you do an
"enable engine"/"stop enginge" type of write to a BAR.

Perhaps we should have similar callbacks in struct pci_epc_ops ?

My thinking is that after "start engine", an EPC driver could read
the MSI and MSI-X capabilities, to see which is enabled.
As it should not be allowed to change between MSI and MSI-X without
doing a "stop engine" first.


Kind regards,
Niklas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping
  2025-12-08  7:57   ` Niklas Cassel
@ 2025-12-09  8:15     ` Niklas Cassel
  2025-12-12  3:56       ` Koichiro Den
  2025-12-22  5:10     ` Krishna Chaitanya Chundru
  1 sibling, 1 reply; 97+ messages in thread
From: Niklas Cassel @ 2025-12-09  8:15 UTC (permalink / raw)
  To: Koichiro Den
  Cc: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li, mani,
	kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang,
	allenbh, Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer,
	logang, jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer,
	arnd, pstanner, elfring

On Mon, Dec 08, 2025 at 08:57:19AM +0100, Niklas Cassel wrote:
> On Sun, Nov 30, 2025 at 01:03:57AM +0900, Koichiro Den wrote:
>
> I don't like that this patch modifies dw_pcie_ep_raise_msi_irq() but does
> not modify dw_pcie_ep_raise_msix_irq()
>
> both functions call dw_pcie_ep_map_addr() before doing the writel(),
> so I think they should be treated the same.

Btw, when using nvmet-pci-epf:
drivers/nvme/target/pci-epf.c

With queue depth == 32, I get:
[  106.585811] arm-smmu-v3 fc900000.iommu: event 0x10 received:
[  106.586349] arm-smmu-v3 fc900000.iommu:      0x0000010000000010
[  106.586846] arm-smmu-v3 fc900000.iommu:      0x0000020000000000
[  106.587341] arm-smmu-v3 fc900000.iommu:      0x000000090000f040
[  106.587841] arm-smmu-v3 fc900000.iommu:      0x0000000000000000
[  106.588335] arm-smmu-v3 fc900000.iommu: event: F_TRANSLATION client: 0000:01:00.0 sid: 0x100 ssid: 0x0 iova: 0x90000f040 ipa: 0x0
[  106.589383] arm-smmu-v3 fc900000.iommu: unpriv data write s1 "Input address caused fault" stag: 0x0

(If I only run with queue depth == 1, I cannot trigger this IOMMU error.)


So, I really think that this patch is important, as it solves a real
problem also for the nvmet-pci-epf driver.


With this patch applied, I cannot trigger the IOMMU error,
so I really think that you should send this a a stand alone patch.


I still think that we need to modify dw_pcie_ep_raise_msix_irq() in a similar
way.


Perhaps instead of:


	if (!ep->msi_iatu_mapped) {
		ret = dw_pcie_ep_map_addr(epc, func_no, 0,
					  ep->msi_mem_phys, msg_addr,
					  map_size);
		if (ret)
			return ret;

		ep->msi_iatu_mapped = true;
		ep->msi_msg_addr = msg_addr;
		ep->msi_map_size = map_size;
	} else if (WARN_ON_ONCE(ep->msi_msg_addr != msg_addr ||
				ep->msi_map_size != map_size)) {
		/*
		 * The host changed the MSI target address or the required
		 * mapping size. Reprogramming the iATU at runtime is unsafe
		 * on this controller, so bail out instead of trying to update
		 * the existing region.
		 */
		return -EINVAL;
	}

	writel(msg_data | (interrupt_num - 1), ep->msi_mem + offset);



We could modify both dw_pcie_ep_raise_msix_irq and dw_pcie_ep_raise_msi_irq
to do something like:



	if (!ep->msi_iatu_mapped) {
		ret = dw_pcie_ep_map_addr(epc, func_no, 0,
					  ep->msi_mem_phys, msg_addr,
					  map_size);
		if (ret)
			return ret;

		ep->msi_iatu_mapped = true;
		ep->msi_msg_addr = msg_addr;
		ep->msi_map_size = map_size;
	} else if (WARN_ON_ONCE(ep->msi_msg_addr != msg_addr ||
				ep->msi_map_size != map_size)) {
		/*
		 * Host driver might have changed from MSI to MSI-X
		 * or the other way around.
		 */
		 dw_pcie_ep_unmap_addr(epc, 0, 0, ep->msi_mem_phys);
		 ret = dw_pcie_ep_map_addr(epc, func_no, 0,
		 			   ep->msi_mem_phys, msg_addr,
					   map_size);
		if (ret)
			return ret;
	}

	writel(msg_data | (interrupt_num - 1), ep->msi_mem + offset);



I think that should work without needing to introuce any
.start_engine() / .stop_engine() APIs.



Kind regards,
Niklas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping
  2025-11-29 16:03 ` [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping Koichiro Den
                     ` (2 preceding siblings ...)
  2025-12-08  7:57   ` Niklas Cassel
@ 2025-12-12  3:38   ` Manivannan Sadhasivam
  2025-12-18  8:28     ` Koichiro Den
  3 siblings, 1 reply; 97+ messages in thread
From: Manivannan Sadhasivam @ 2025-12-12  3:38 UTC (permalink / raw)
  To: Koichiro Den
  Cc: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Sun, Nov 30, 2025 at 01:03:57AM +0900, Koichiro Den wrote:
> dw_pcie_ep_raise_msi_irq() currently programs an outbound iATU window
> for the MSI target address on every interrupt and tears it down again
> via dw_pcie_ep_unmap_addr().
> 
> On systems that heavily use the AXI bridge interface (for example when
> the integrated eDMA engine is active), this means the outbound iATU
> registers are updated while traffic is in flight. The DesignWare
> endpoint spec warns that updating iATU registers in this situation is
> not supported, and the behavior is undefined.
> 

When claiming spec violation, you should quote the spec reference such as the
spec version, section, and actual wording snippet.

> Under high MSI and eDMA load this pattern results in occasional bogus
> outbound transactions and IOMMU faults such as:
> 
>   ipmmu-vmsa eed40000.iommu: Unhandled fault: status 0x00001502 iova 0xfe000000
> 
> followed by the system becoming unresponsive. This is the actual output
> observed on Renesas R-Car S4, with its ipmmu_hc used with PCIe ch0.
> 
> There is no need to reprogram the iATU region used for MSI on every
> interrupt. The host-provided MSI address is stable while MSI is enabled,
> and the endpoint driver already dedicates a scratch buffer for MSI
> generation.
> 
> Cache the aligned MSI address and map size, program the outbound iATU
> once, and keep the window enabled. Subsequent interrupts only perform a
> write to the MSI scratch buffer, avoiding dynamic iATU reprogramming in
> the hot path and fixing the lockups seen under load.
> 

iATU windows are very limited (just 8 in some cases), so I don't like allocating
fixed windows for MSIs.

> Signed-off-by: Koichiro Den <den@valinux.co.jp>
> ---
>  .../pci/controller/dwc/pcie-designware-ep.c   | 48 ++++++++++++++++---
>  drivers/pci/controller/dwc/pcie-designware.h  |  5 ++
>  2 files changed, 47 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/pci/controller/dwc/pcie-designware-ep.c b/drivers/pci/controller/dwc/pcie-designware-ep.c
> index 3780a9bd6f79..ef8ded34d9ab 100644
> --- a/drivers/pci/controller/dwc/pcie-designware-ep.c
> +++ b/drivers/pci/controller/dwc/pcie-designware-ep.c
> @@ -778,6 +778,16 @@ static void dw_pcie_ep_stop(struct pci_epc *epc)
>  	struct dw_pcie_ep *ep = epc_get_drvdata(epc);
>  	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
>  
> +	/*
> +	 * Tear down the dedicated outbound window used for MSI
> +	 * generation. This avoids leaking an iATU window across
> +	 * endpoint stop/start cycles.
> +	 */
> +	if (ep->msi_iatu_mapped) {
> +		dw_pcie_ep_unmap_addr(epc, 0, 0, ep->msi_mem_phys);
> +		ep->msi_iatu_mapped = false;
> +	}
> +
>  	dw_pcie_stop_link(pci);
>  }
>  
> @@ -881,14 +891,37 @@ int dw_pcie_ep_raise_msi_irq(struct dw_pcie_ep *ep, u8 func_no,
>  	msg_addr = ((u64)msg_addr_upper) << 32 | msg_addr_lower;
>  
>  	msg_addr = dw_pcie_ep_align_addr(epc, msg_addr, &map_size, &offset);
> -	ret = dw_pcie_ep_map_addr(epc, func_no, 0, ep->msi_mem_phys, msg_addr,
> -				  map_size);
> -	if (ret)
> -		return ret;
>  
> -	writel(msg_data | (interrupt_num - 1), ep->msi_mem + offset);
> +	/*
> +	 * Program the outbound iATU once and keep it enabled.
> +	 *
> +	 * The spec warns that updating iATU registers while there are
> +	 * operations in flight on the AXI bridge interface is not
> +	 * supported, so we avoid reprogramming the region on every MSI,
> +	 * specifically unmapping immediately after writel().
> +	 */
> +	if (!ep->msi_iatu_mapped) {
> +		ret = dw_pcie_ep_map_addr(epc, func_no, 0,
> +					  ep->msi_mem_phys, msg_addr,
> +					  map_size);
> +		if (ret)
> +			return ret;
>  
> -	dw_pcie_ep_unmap_addr(epc, func_no, 0, ep->msi_mem_phys);
> +		ep->msi_iatu_mapped = true;
> +		ep->msi_msg_addr = msg_addr;
> +		ep->msi_map_size = map_size;
> +	} else if (WARN_ON_ONCE(ep->msi_msg_addr != msg_addr ||
> +				ep->msi_map_size != map_size)) {
> +		/*
> +		 * The host changed the MSI target address or the required
> +		 * mapping size. Reprogramming the iATU at runtime is unsafe
> +		 * on this controller, so bail out instead of trying to update
> +		 * the existing region.
> +		 */

I'd perfer having some sort of locking to program the iATU registers during
runtime instead of bailing out.

- Mani

-- 
மணிவண்ணன் சதாசிவம்

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping
  2025-12-09  8:15     ` Niklas Cassel
@ 2025-12-12  3:56       ` Koichiro Den
  0 siblings, 0 replies; 97+ messages in thread
From: Koichiro Den @ 2025-12-12  3:56 UTC (permalink / raw)
  To: Niklas Cassel
  Cc: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li, mani,
	kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang,
	allenbh, Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer,
	logang, jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer,
	arnd, pstanner, elfring

On Tue, Dec 09, 2025 at 09:15:57AM +0100, Niklas Cassel wrote:
> On Mon, Dec 08, 2025 at 08:57:19AM +0100, Niklas Cassel wrote:
> > On Sun, Nov 30, 2025 at 01:03:57AM +0900, Koichiro Den wrote:
> >
> > I don't like that this patch modifies dw_pcie_ep_raise_msi_irq() but does
> > not modify dw_pcie_ep_raise_msix_irq()
> >
> > both functions call dw_pcie_ep_map_addr() before doing the writel(),
> > so I think they should be treated the same.
> 
> Btw, when using nvmet-pci-epf:
> drivers/nvme/target/pci-epf.c
> 
> With queue depth == 32, I get:
> [  106.585811] arm-smmu-v3 fc900000.iommu: event 0x10 received:
> [  106.586349] arm-smmu-v3 fc900000.iommu:      0x0000010000000010
> [  106.586846] arm-smmu-v3 fc900000.iommu:      0x0000020000000000
> [  106.587341] arm-smmu-v3 fc900000.iommu:      0x000000090000f040
> [  106.587841] arm-smmu-v3 fc900000.iommu:      0x0000000000000000
> [  106.588335] arm-smmu-v3 fc900000.iommu: event: F_TRANSLATION client: 0000:01:00.0 sid: 0x100 ssid: 0x0 iova: 0x90000f040 ipa: 0x0
> [  106.589383] arm-smmu-v3 fc900000.iommu: unpriv data write s1 "Input address caused fault" stag: 0x0
> 
> (If I only run with queue depth == 1, I cannot trigger this IOMMU error.)
> 
> 
> So, I really think that this patch is important, as it solves a real
> problem also for the nvmet-pci-epf driver.
> 
> 
> With this patch applied, I cannot trigger the IOMMU error,
> so I really think that you should send this a a stand alone patch.
> 
> 
> I still think that we need to modify dw_pcie_ep_raise_msix_irq() in a similar
> way.

Sorry about my late response, and thank you for handling this:
https://lore.kernel.org/all/20251210071358.2267494-2-cassel@kernel.org/

Koichiro

> 
> 
> Perhaps instead of:
> 
> 
> 	if (!ep->msi_iatu_mapped) {
> 		ret = dw_pcie_ep_map_addr(epc, func_no, 0,
> 					  ep->msi_mem_phys, msg_addr,
> 					  map_size);
> 		if (ret)
> 			return ret;
> 
> 		ep->msi_iatu_mapped = true;
> 		ep->msi_msg_addr = msg_addr;
> 		ep->msi_map_size = map_size;
> 	} else if (WARN_ON_ONCE(ep->msi_msg_addr != msg_addr ||
> 				ep->msi_map_size != map_size)) {
> 		/*
> 		 * The host changed the MSI target address or the required
> 		 * mapping size. Reprogramming the iATU at runtime is unsafe
> 		 * on this controller, so bail out instead of trying to update
> 		 * the existing region.
> 		 */
> 		return -EINVAL;
> 	}
> 
> 	writel(msg_data | (interrupt_num - 1), ep->msi_mem + offset);
> 
> 
> 
> We could modify both dw_pcie_ep_raise_msix_irq and dw_pcie_ep_raise_msi_irq
> to do something like:
> 
> 
> 
> 	if (!ep->msi_iatu_mapped) {
> 		ret = dw_pcie_ep_map_addr(epc, func_no, 0,
> 					  ep->msi_mem_phys, msg_addr,
> 					  map_size);
> 		if (ret)
> 			return ret;
> 
> 		ep->msi_iatu_mapped = true;
> 		ep->msi_msg_addr = msg_addr;
> 		ep->msi_map_size = map_size;
> 	} else if (WARN_ON_ONCE(ep->msi_msg_addr != msg_addr ||
> 				ep->msi_map_size != map_size)) {
> 		/*
> 		 * Host driver might have changed from MSI to MSI-X
> 		 * or the other way around.
> 		 */
> 		 dw_pcie_ep_unmap_addr(epc, 0, 0, ep->msi_mem_phys);
> 		 ret = dw_pcie_ep_map_addr(epc, func_no, 0,
> 		 			   ep->msi_mem_phys, msg_addr,
> 					   map_size);
> 		if (ret)
> 			return ret;
> 	}
> 
> 	writel(msg_data | (interrupt_num - 1), ep->msi_mem + offset);
> 
> 
> 
> I think that should work without needing to introuce any
> .start_engine() / .stop_engine() APIs.
> 
> 
> 
> Kind regards,
> Niklas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 20/27] NTB: ntb_transport: Introduce remote eDMA backed transport mode
  2025-12-05 15:06                   ` Frank Li
@ 2025-12-18  4:34                     ` Koichiro Den
  0 siblings, 0 replies; 97+ messages in thread
From: Koichiro Den @ 2025-12-18  4:34 UTC (permalink / raw)
  To: Frank Li
  Cc: ntb, linux-pci, dmaengine, linux-kernel, mani, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Fri, Dec 05, 2025 at 10:06:30AM -0500, Frank Li wrote:
> On Fri, Dec 05, 2025 at 12:04:24PM +0900, Koichiro Den wrote:
> > On Thu, Dec 04, 2025 at 03:16:25PM -0500, Frank Li wrote:
> > > On Fri, Dec 05, 2025 at 12:42:03AM +0900, Koichiro Den wrote:
> > > > On Wed, Dec 03, 2025 at 11:14:43AM -0500, Frank Li wrote:
> > > > > On Wed, Dec 03, 2025 at 05:53:03PM +0900, Koichiro Den wrote:
> > > > > > On Tue, Dec 02, 2025 at 10:42:29AM -0500, Frank Li wrote:
> > > > > > > On Tue, Dec 02, 2025 at 03:43:10PM +0900, Koichiro Den wrote:
> > > > > > > > On Mon, Dec 01, 2025 at 04:41:05PM -0500, Frank Li wrote:
> > > > > > > > > On Sun, Nov 30, 2025 at 01:03:58AM +0900, Koichiro Den wrote:
> > > > > > > > > > Add a new transport backend that uses a remote DesignWare eDMA engine
> > > > > > > > > > located on the NTB endpoint to move data between host and endpoint.
> > > > > > > > > >
> > > > > > > ...
> > > > > > > > > > +#include "ntb_edma.h"
> > > > > > > > > > +
> > > > > > > > > > +/*
> > > > > > > > > > + * The interrupt register offsets below are taken from the DesignWare
> > > > > > > > > > + * eDMA "unrolled" register map (EDMA_MF_EDMA_UNROLL). The remote eDMA
> > > > > > > > > > + * backend currently only supports this layout.
> > > > > > > > > > + */
> > > > > > > > > > +#define DMA_WRITE_INT_STATUS_OFF   0x4c
> > > > > > > > > > +#define DMA_WRITE_INT_MASK_OFF     0x54
> > > > > > > > > > +#define DMA_WRITE_INT_CLEAR_OFF    0x58
> > > > > > > > > > +#define DMA_READ_INT_STATUS_OFF    0xa0
> > > > > > > > > > +#define DMA_READ_INT_MASK_OFF      0xa8
> > > > > > > > > > +#define DMA_READ_INT_CLEAR_OFF     0xac
> > > > > > > > >
> > > > > > > > > Not sure why need access EDMA register because EMDA driver already export
> > > > > > > > > as dmaengine driver.
> > > > > > > >
> > > > > > > > These are intended for EP use. In my current design I intentionally don't
> > > > > > > > use the standard dw-edma dmaengine driver on the EP side.
> > > > > > >
> > > > > > > why not?
> > > > > >
> > > > > > Conceptually I agree that using the standard dw-edma driver on both sides
> > > > > > would be attractive for future extensibility and maintainability. However,
> > > > > > there are a couple of concerns for me, some of which might be alleviated by
> > > > > > your suggestion below, and some which are more generic safety concerns that
> > > > > > I tried to outline in my replies to your other comments.
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > +
> > > > > > > > > > +#define NTB_EDMA_NOTIFY_MAX_QP		64
> > > > > > > > > > +
> > > > > > > ...
> > > > > > > > > > +
> > > > > > > > > > +	virq = irq_create_fwspec_mapping(&fwspec);
> > > > > > > > > > +	of_node_put(parent);
> > > > > > > > > > +	return (virq > 0) ? virq : -EINVAL;
> > > > > > > > > > +}
> > > > > > > > > > +
> > > > > > > > > > +static irqreturn_t ntb_edma_isr(int irq, void *data)
> > > > > > > > > > +{
> > > > > > > > >
> > > > > > > > > Not sue why dw_edma_interrupt_write/read() does work for your case. Suppose
> > > > > > > > > just register callback for dmeengine.
> > > > > > > >
> > > > > > > > If we ran dw_edma_probe() on both the EP and RC sides and let the dmaengine
> > > > > > > > callbacks handle int_status/int_clear, I think we could hit races. One side
> > > > > > > > might clear a status bit before the other side has a chance to see it and
> > > > > > > > invoke its callback. Please correct me if I'm missing something here.
> > > > > > >
> > > > > > > You should use difference channel?
> > > > > >
> > > > > > Do you mean something like this:
> > > > > > - on EP side, dw_edma_probe() only set up a dedicated channel for notification,
> > > > > > - on RC side, do not set up that particular channel via dw_edma_channel_setup(),
> > > > > >   but do other remaining channels for DMA transfers.
> > > > >
> > > > > Yes, it may be simple overall. Of course this will waste a channel.
> > > >
> > > > So, on the EP side I see two possible approaches:
> > > >
> > > > (a) Hide "dma" [1] as in [RFC PATCH v2 26/27] and call dw_edma_probe() with
> > > >     hand-crafted settings (chip->ll_rd_cnt = 1, chip->ll_wr_cnt = 0).
> > > > (b) Or, teach this special-purpose policy (i.e. configuring only a single
> > > >     notification channel) to the SoC glue driver's dw_pcie_ep_init_registers(),
> > > >     for example via Kconfig. I don't think DT is a good place to describe
> > > >     such a policy.
> > > >
> > > > There is also another option, which do not necessarily run dw_edma_probe()
> > > > by ourselves:
> > > >
> > > > (c) Leave the default initialization by the SoC glue as-is, and override the
> > > >     per-channel role via some new dw-edma interface, with the guarantee
> > > >     that all channels except the notification channel remain unused on its
> > > >     side afterwards. In this model, the EP side builds the LL locations
> > > >     for data transfers and the RC configures all channels, but it sets up
> > > >     the notification channel in a special manner.
> > > >
> > > > [1] https://github.com/jonmason/ntb/blob/68113d260674/Documentation/devicetree/bindings/pci/snps%2Cdw-pcie-ep.yaml#L83
> > > >
> > > > >
> > > > > >
> > > > > > Also, is it generically safe to have dw_edma_probe() executed from both ends on
> > > > > > the same eDMA instance, as long as the channels are carefully partitioned
> > > > > > between them?
> > > > >
> > > > > Channel register MMIO space is sperated. Some channel register shared
> > > > > into one 32bit register.
> > > > >
> > > > > But the critical one, interrupt status is w1c. So only write BIT(channel)
> > > > > is safe.
> > > > >
> > > > > Need careful handle irq enable/disable.
> > > >
> > > > Yeah, I agree it is unavoidable in this model.
> > > >
> > > > >
> > > > > Or you can defer all actual DMA transfer to EP side, you can append
> > > > > MSI write at last item of link to notify RC side about DMA done. (actually
> > > > > RIE should do the same thing)
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > To avoid that, in my current implementation, the RC side handles the
> > > > > > > > status/int_clear registers in the usual way, and the EP side only tries to
> > > > > > > > suppress needless edma_int as much as possible.
> > > > > > > >
> > > > > > > > That said, I'm now wondering if it would be better to set LIE=0/RIE=1 for
> > > > > > > > the DMA transfer channels and LIE=1/RIE=0 for the notification channel.
> > > > > > > > That would require some changes on dw-edma core.
> > > > > > >
> > > > > > > If dw-edma work as remote DMA, which should enable RIE. like
> > > > > > > dw-edma-pcie.c, but not one actually use it recently.
> > > > > > >
> > > > > > > Use EDMA as doorbell should be new case and I think it is quite useful.
> > > > > > >
> > > > > > > > >
> > > > > > > > > > +	struct ntb_edma_interrupt *v = data;
> > > > > > > > > > +	u32 mask = BIT(EDMA_RD_CH_NUM);
> > > > > > > > > > +	u32 i, val;
> > > > > > > > > > +
> > > > > > > ...
> > > > > > > > > > +	ret = dw_edma_probe(chip);
> > > > > > > > >
> > > > > > > > > I think dw_edma_probe() should be in ntb_hw_epf.c, which provide DMA
> > > > > > > > > dma engine support.
> > > > > > > > >
> > > > > > > > > EP side, suppose default dwc controller driver already setup edma engine,
> > > > > > > > > so use correct filter function, you should get dma chan.
> > > > > > > >
> > > > > > > > I intentionally hid edma for EP side in .dts patch in [RFC PATCH v2 26/27]
> > > > > > > > so that RC side only manages eDMA remotely and avoids the potential race
> > > > > > > > condition I mentioned above.
> > > > > > >
> > > > > > > Improve eDMA core to suppport some dma channel work at local, some for
> > > > > > > remote.
> > > > > >
> > > > > > Right, Firstly I experimented a bit more with different LIE/RIE settings and
> > > > > > ended up with the following observations:
> > > > > >
> > > > > > * LIE=0/RIE=1 does not seem to work at the hardware level. When I tried this for
> > > > > >   DMA transfer channels, the RC side never received any interrupt. The databook
> > > > > >   (5.40a, 8.2.2 "Interrupts and Error Handling") has a hint that says
> > > > > >   "If you want a remote interrupt and not a local interrupt then: Set LIE and
> > > > > >   RIE [...]", so I think this behaviour is expected.
> > > > >
> > > > > Actually, you can append MSI write at last one of DMA descriptor link. So
> > > > > it will not depend on eDMA's IRQ at all.
> > > >
> > > > For RC->EP interrupts on R-Car S4 in EP mode, using ITS_TRANSLATER as the
> > > > IB iATU target did not appear to work in practice. Indeed that was the
> > > > motivation for the RFC v1 series [2]. I have not tried using ITS_TRANSLATER
> > > > as the eDMA read transfer DAR.
> > > >
> > > > But in any case, simply masking the local interrupt is sufficient here. I
> > > > mainly wanted to point out that my naive idea of LIE=0/RIE=1 is not
> > > > implementable with this hardware. This whole LIE/RIE topic is a bit
> > > > off-track, sorry for the noise.
> > > >
> > > > [2] For the record, RFC v2 is conceptually orthogonal and introduces a
> > > >     broader concept ie. remote eDMA model, but I reused many of the
> > > >     preparatory commits from v1, which is why this is RFC v2 rather than a
> > > >     separate series.
> > > >
> > > > >
> > > > > > * LIE=1/RIE=0 does work at the hardware level, but is problematic for my current
> > > > > >   design, where the RC issues the DMA transfer for the notification via
> > > > > >   ntb_edma_notify_peer(). With RIE=0, the RC never calls
> > > > > >   dw_edma_core_handle_int() for that channel, which means that internal state
> > > > > >   such as dw_edma_chan.status is never managed correctly.
> > > > >
> > > > > If you append on MSI write at DMA link, you needn't check status register,
> > > > > just check current LL pos to know which descrptor already done.
> > > > >
> > > > > Or you also enable LIE and disable related IRQ line(without register
> > > > > irq handler), so Local IRQ will be ignore by GIC, you can safe handle at
> > > > > RC side.
> > > >
> > > > What I was worried about here is that, with RIE=0 the current dw-edma
> > > > handling of struct dw_edma_chan::status field (not status register) would
> > > > not run for that channel, which could affect subsequent tx submissions. But
> > > > your suggestion also makes sense, thank you.
> > > >
> > > > --8<--
> > > >
> > > > So anyway the key point seems that we should avoid such hard-coded register
> > > > handling in [RFC PATCH v2 20/27] and rely only on the standard dw-edma
> > > > interfaces (possibly with some extensions to the dw-edma core). From your
> > > > feedback, I feel this is the essential direction.
> > > >
> > > > From that perspective, I'm leaning toward (b) (which I wrote above in a
> > > > reply comment) with a Kconfig guard, i.e. in dw_pcie_ep_init_registers(),
> > > > if IS_ENABLED(CONFIG_DW_REMOTE_EDMA) we only configure the notification
> > > > channel. In practice, a DT-based variant of (b) (for example a new property
> > > > such as "dma-notification-channel = <N>;" and making
> > > > dw_pcie_ep_init_registers() honour it) would be very handy for users, but I
> > > > suspect putting this kind of policy into DT is not acceptable.
> > > >
> > > > Assuming careful handling, (c) might actually be the simplest approach. I
> > > > may need to add a small hook for the notification channel in
> > > > dw_edma_done_interrupt(), via a new API such as
> > > > dw_edma_chan_register_notify().
> > >
> > > I reply everything here for overall design
> > >
> > > EDMA actually can access all memory at both EP and RC side regardless PCI
> > > map windows. NTB defination is that only access part of both system memory,
> > > so anyway need once memcpy. Although NTB can't take 100% eDMA advantage, it
> > > is still easiest path now. I have a draft idea without touch NTB core code
> > > (most likley).
> > >
> > > EP side                          RC side
> > >              1:  Control bar
> > >              2:  Doorbell bar
> > >              3:  WM1
> > >
> > > MW1 is fixed sized array [ntb_payload_header + data]. Current NTB built
> > > queue in system memory, transfer data (RW) to this array.
> > >
> > > Use EDMA only one side, RC/EP. use EP as example.
> > >
> > > In 1 (control bar, resever memory space, which call B)
> > >
> > > In ntb_hw_epf.c driver, create a simple 'fake' DMA memcpy driver, which
> > > just implement device_prep_dma_memcpy(). That just put src\dest\size info
> > > to memory space B, then push doorbell.
> > >
> > > in EP side's a workqueue, fetch info from B, the send to EDMA queue to
> > > do actual transfer, after EP DMA finish, mark done at B, then raise msi irq,
> > > 'fake' DMA memcpy driver will be triggered.
> > >
> > > Futher, 3 WM1 is not necessary existed at all, because both side don't
> > > access it directly.
> > >
> > > For example:
> > >
> > > case RC TX, EP RX
> > >
> > > RC ntb_async_tx_submit() use device_prep_dma_memcpy() copy user space
> > > memory (0xRC_1000 to PCI_1000, size 0x1000), put into share bar0 position
> > >
> > >             0xRC_1000 -> 0xPCI_1000 0x1000
> > >
> > > EP side, there RX request ntb_async_rx_submit(),  from 0xPCI_1000 to
> > > 0xEP_8000 size 0x20000.
> > >
> > > so setup eDMA transfer form 0xRC_1000 -> 0xEP_8000 size 1000. After complete
> > > mark both side done, then trigger related callback functions.
> > >
> > > You can see 0xPCI_1000 is not used at all. Actually 0xPCI_1000 is trouble
> > > maker,  RC and EP system PCI space is not necesary the same as CPU space,
> > > PCI controller may do address convert.
> >
> > Thanks for the detailed explanation.
> >
> > Just to clarify, regarding your comments about the number of memcpy
> > operations and not using the 0xPCI_1000 window for data path, I think RFC
> > v2 is already similar to what you're describing.
> >
> > To me it seems the key differences in your proposal are mainly two-fold:
> > (1) the layering, and (2) local eDMA use rather than remote.
> 
> Not big difference between remote and local DMA. My major means just use
> oneside is enough. If eDMA handle in remote, EP side need virtual memcpy
> and RC side to handle actual transfer.
> 
> I use EP as example, just because some logic R/W is reverted between EP/RC.
> RC's write is EP's read.
> 
> >
> > For (1), instead of adding more eDMA-specific handling into ntb_transport
> > layer, your approach would keep changes to ntb_transport minimal and
> > encapsulate the eDMA usage inside the "fake DMA memcpy driver" as much as
> > possible. In that design, would the MW1 layout change? Leaving the existing
> > layout as-is would waste the space (so RFC v2 had introduced a new layout).
> 
> It is fine if NTB maintainer agree it.
> 
> >
> > Also, one point I'm still unsure about is the opposite direction (ie.
> > EP->RC). In that case, do you also expect the EP to trigger its local eDMA
> > engine? If yes, then, similar to the RC->EP direction in RFC v2, the EP
> > would need to know the RC-side receive buffer address (e.g. 0xRC_1000) in
> > advance.
> 
> 'fake DMA memcpy driver' already put 0xRC_1000 to one shared memory place.
> 
> >
> > You also mentioned that you already have some draft. Are you planning to
> > post that as a patch series? If not, I can of course try to
> > implement/prototype this approach based on your suggestion.
> 
> Sorry, I have not actually worked for ntb eDMA before. My work base on RDMA
> framework. Idealy, RDMA can do user-space(EP) to user space (RC) data
> transfer with zero copy.
> 
> But I think NTB is also a good path since RDMA is over complexed.
> 
> Frank

Hi Frank,

Thank you for the review and the discussion. Apologies for the delayed
response here in this thread, I needed some time to think through the
re-design.

After further consideration, I sent RFC v3 [1] with the following design:
* all the read channels (including a channel for notification) are driven
  by host (RC)
* all the write channels are driven by endpoint (EP)

This way we can avoid both ends touching and updating per-direction
registers concurrently at runtime [2]. Also the data plane behaviour
becomes symmetric in both directions, resulting in a simpler data path in
the NTB transport layer compared to RFC v2. As you commented earlier, RFC
v3 no longer relies on the duplicate hard-coded register offsets, and leave
dma_device/dma_chan initialization to the standard path. RFC v3 no longer
hides eDMA instance on enpoint side, like I did in [RFC PATCH v2 26/28]
[3].

But still I didn't implement the fake DMA memcpy driver idea in RFC v3.
Instead, I chose MW1 layout optimized for eDMA-backed transport, since it
reduces MW1 usage and makes it possible to scale to multiple queue pairs
with deeper ring buffers, which helps fully exploit the potential of the
eDMA-backed transport.

[1] https://lore.kernel.org/all/20251217151609.3162665-1-den@valinux.co.jp/
[2] as a somewhat relevant topic, I've found an existing issue that becomes
    observable under heavy load across multiple channels.
    https://lore.kernel.org/all/20251217151609.3162665-23-den@valinux.co.jp/
[3] https://lore.kernel.org/all/20251129160405.2568284-27-den@valinux.co.jp/

Thank you again for your time and for the review,

Koichiro

> 
> >
> > Please let me know if the above understanding does not match what you had
> > in mind.
> >
> > Thank you,
> > Koichiro
> >
> >
> > >
> > > Frank
> > > >
> > > > Thank you for your time and review,
> > > > Koichiro
> > > >
> > > > >
> > > > > Frank
> > > > > >
> > > > > > >
> > > > > > > Frank
> > > > > > > >
> > > > > > > > Thanks for reviewing,
> > > > > > > > Koichiro
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Frank
> > > > > > > > >
> > > > > > > > > > +	if (ret) {
> > > > > > > > > > +		dev_err(&ndev->dev, "dw_edma_probe failed: %d\n", ret);
> > > > > > > > > > +		return ret;
> > > > > > > > > > +	}
> > > > > > > > > > +
> > > > > > > > > > +	return 0;
> > > > > > > > > > +}
> > > > > > > > > > +
> > > > > > > ...
> > > > > > >
> > > > > > > > > > +{
> > > > > > > > > > +	spin_lock_init(&qp->ep_tx_lock);
> > > > > > > > > > +	spin_lock_init(&qp->ep_rx_lock);
> > > > > > > > > > +	spin_lock_init(&qp->rc_lock);
> > > > > > > > > > +}
> > > > > > > > > > +
> > > > > > > > > > +static const struct ntb_transport_backend_ops edma_backend_ops = {
> > > > > > > > > > +	.setup_qp_mw = ntb_transport_edma_setup_qp_mw,
> > > > > > > > > > +	.tx_free_entry = ntb_transport_edma_tx_free_entry,
> > > > > > > > > > +	.tx_enqueue = ntb_transport_edma_tx_enqueue,
> > > > > > > > > > +	.rx_enqueue = ntb_transport_edma_rx_enqueue,
> > > > > > > > > > +	.rx_poll = ntb_transport_edma_rx_poll,
> > > > > > > > > > +	.debugfs_stats_show = ntb_transport_edma_debugfs_stats_show,
> > > > > > > > > > +};
> > > > > > > > > > +#endif /* CONFIG_NTB_TRANSPORT_EDMA */
> > > > > > > > > > +
> > > > > > > > > >  /**
> > > > > > > > > >   * ntb_transport_link_up - Notify NTB transport of client readiness to use queue
> > > > > > > > > >   * @qp: NTB transport layer queue to be enabled
> > > > > > > > > > --
> > > > > > > > > > 2.48.1
> > > > > > > > > >

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 12/27] damengine: dw-edma: Fix MSI data values for multi-vector IMWr interrupts
  2025-12-02  6:32     ` Koichiro Den
@ 2025-12-18  6:52       ` Koichiro Den
  0 siblings, 0 replies; 97+ messages in thread
From: Koichiro Den @ 2025-12-18  6:52 UTC (permalink / raw)
  To: Frank Li
  Cc: ntb, linux-pci, dmaengine, linux-kernel, mani, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Tue, Dec 02, 2025 at 03:32:49PM +0900, Koichiro Den wrote:
> On Mon, Dec 01, 2025 at 02:46:43PM -0500, Frank Li wrote:
> > On Sun, Nov 30, 2025 at 01:03:50AM +0900, Koichiro Den wrote:
> > > When multiple MSI vectors are allocated for the DesignWare eDMA, the
> > > driver currently records the same MSI message for all IRQs by calling
> > > get_cached_msi_msg() per vector. For multi-vector MSI (as opposed to
> > > MSI-X), the cached message corresponds to vector 0 and msg.data is
> > > supposed to be adjusted by the IRQ index.
> > >
> > > As a result, all eDMA interrupts share the same MSI data value and the
> > > interrupt controller cannot distinguish between them.
> > >
> > > Introduce dw_edma_compose_msi() to construct the correct MSI message for
> > > each vector. For MSI-X nothing changes. For multi-vector MSI, derive the
> > > base IRQ with msi_get_virq(dev, 0) and OR in the per-vector offset into
> > > msg.data before storing it in dw->irq[i].msi.
> > >
> > > This makes each IMWr MSI vector use a unique MSI data value.
> > >
> > > Signed-off-by: Koichiro Den <den@valinux.co.jp>
> > > ---
> > >  drivers/dma/dw-edma/dw-edma-core.c | 28 ++++++++++++++++++++++++----
> > >  1 file changed, 24 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/drivers/dma/dw-edma/dw-edma-core.c b/drivers/dma/dw-edma/dw-edma-core.c
> > > index 8e5f7defa6b6..3542177a4a8e 100644
> > > --- a/drivers/dma/dw-edma/dw-edma-core.c
> > > +++ b/drivers/dma/dw-edma/dw-edma-core.c
> > > @@ -839,6 +839,28 @@ static inline void dw_edma_add_irq_mask(u32 *mask, u32 alloc, u16 cnt)
> > >  		(*mask)++;
> > >  }
> > >
> > > +static void dw_edma_compose_msi(struct device *dev, int irq, struct msi_msg *out)
> > > +{
> > > +	struct msi_desc *desc = irq_get_msi_desc(irq);
> > > +	struct msi_msg msg;
> > > +	unsigned int base;
> > > +
> > > +	if (!desc)
> > > +		return;
> > > +
> > > +	get_cached_msi_msg(irq, &msg);
> > > +	if (!desc->pci.msi_attrib.is_msix) {
> > > +		/*
> > > +		 * For multi-vector MSI, the cached message corresponds to
> > > +		 * vector 0. Adjust msg.data by the IRQ index so that each
> > > +		 * vector gets a unique MSI data value for IMWr Data Register.
> > > +		 */
> > > +		base = msi_get_virq(dev, 0);
> > > +		msg.data |= (irq - base);
> > 
> > why "|=", not "=" here?
> 
> "=" is better and safe here. Thanks for pointing it out, I'll fix it.

I forgot to add a follow-up comment before sending RFC v3. Here I was too
distracted, it was intentional to apply OR to the base value, which was
cached. So [RFC PATCH v3 10/35] remains unchanged.
https://lore.kernel.org/all/20251217151609.3162665-11-den@valinux.co.jp/

Thanks,
Koichiro

> 
> Koichiro
> 
> > 
> > Frank
> > 
> > > +	}
> > > +	*out = msg;
> > > +}
> > > +
> > >  static int dw_edma_irq_request(struct dw_edma *dw,
> > >  			       u32 *wr_alloc, u32 *rd_alloc)
> > >  {
> > > @@ -869,8 +891,7 @@ static int dw_edma_irq_request(struct dw_edma *dw,
> > >  			return err;
> > >  		}
> > >
> > > -		if (irq_get_msi_desc(irq))
> > > -			get_cached_msi_msg(irq, &dw->irq[0].msi);
> > > +		dw_edma_compose_msi(dev, irq, &dw->irq[0].msi);
> > >
> > >  		dw->nr_irqs = 1;
> > >  	} else {
> > > @@ -896,8 +917,7 @@ static int dw_edma_irq_request(struct dw_edma *dw,
> > >  			if (err)
> > >  				goto err_irq_free;
> > >
> > > -			if (irq_get_msi_desc(irq))
> > > -				get_cached_msi_msg(irq, &dw->irq[i].msi);
> > > +			dw_edma_compose_msi(dev, irq, &dw->irq[i].msi);
> > >  		}
> > >
> > >  		dw->nr_irqs = i;
> > > --
> > > 2.48.1
> > >

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping
  2025-12-12  3:38   ` Manivannan Sadhasivam
@ 2025-12-18  8:28     ` Koichiro Den
  0 siblings, 0 replies; 97+ messages in thread
From: Koichiro Den @ 2025-12-18  8:28 UTC (permalink / raw)
  To: Manivannan Sadhasivam
  Cc: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li, kwilczynski,
	kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang, allenbh,
	Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer, logang,
	jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer, arnd,
	pstanner, elfring

On Fri, Dec 12, 2025 at 12:38:02PM +0900, Manivannan Sadhasivam wrote:
> On Sun, Nov 30, 2025 at 01:03:57AM +0900, Koichiro Den wrote:
> > dw_pcie_ep_raise_msi_irq() currently programs an outbound iATU window
> > for the MSI target address on every interrupt and tears it down again
> > via dw_pcie_ep_unmap_addr().
> > 
> > On systems that heavily use the AXI bridge interface (for example when
> > the integrated eDMA engine is active), this means the outbound iATU
> > registers are updated while traffic is in flight. The DesignWare
> > endpoint spec warns that updating iATU registers in this situation is
> > not supported, and the behavior is undefined.
> > 
> 
> When claiming spec violation, you should quote the spec reference such as the
> spec version, section, and actual wording snippet.

Thank you for the review and sorry about my late response.

The relevant wording is from the DW EPC databook 5.40a - 3.10.6.1 iATU
Outbound Programming Overview:
"Dynamic iATU Programming with AXI Bridge Module - You must not update the
iATU registers while operations are in progress on the AXI bridge slave
interface."

Niklas had pointed this out earlier and posted a stand-alone patch that
includes the reference and quote:
https://lore.kernel.org/all/20251210071358.2267494-2-cassel@kernel.org/

> 
> > Under high MSI and eDMA load this pattern results in occasional bogus
> > outbound transactions and IOMMU faults such as:
> > 
> >   ipmmu-vmsa eed40000.iommu: Unhandled fault: status 0x00001502 iova 0xfe000000
> > 
> > followed by the system becoming unresponsive. This is the actual output
> > observed on Renesas R-Car S4, with its ipmmu_hc used with PCIe ch0.
> > 
> > There is no need to reprogram the iATU region used for MSI on every
> > interrupt. The host-provided MSI address is stable while MSI is enabled,
> > and the endpoint driver already dedicates a scratch buffer for MSI
> > generation.
> > 
> > Cache the aligned MSI address and map size, program the outbound iATU
> > once, and keep the window enabled. Subsequent interrupts only perform a
> > write to the MSI scratch buffer, avoiding dynamic iATU reprogramming in
> > the hot path and fixing the lockups seen under load.
> > 
> 
> iATU windows are very limited (just 8 in some cases), so I don't like allocating
> fixed windows for MSIs.

Do you think there is a generic way to resolve the issue without OB iATU
window? If so, I'd be happy to pursue it. I wonder whether MSI sideband
interface is something that can be used generically from software.

> 
> > Signed-off-by: Koichiro Den <den@valinux.co.jp>
> > ---
> >  .../pci/controller/dwc/pcie-designware-ep.c   | 48 ++++++++++++++++---
> >  drivers/pci/controller/dwc/pcie-designware.h  |  5 ++
> >  2 files changed, 47 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/pci/controller/dwc/pcie-designware-ep.c b/drivers/pci/controller/dwc/pcie-designware-ep.c
> > index 3780a9bd6f79..ef8ded34d9ab 100644
> > --- a/drivers/pci/controller/dwc/pcie-designware-ep.c
> > +++ b/drivers/pci/controller/dwc/pcie-designware-ep.c
> > @@ -778,6 +778,16 @@ static void dw_pcie_ep_stop(struct pci_epc *epc)
> >  	struct dw_pcie_ep *ep = epc_get_drvdata(epc);
> >  	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> >  
> > +	/*
> > +	 * Tear down the dedicated outbound window used for MSI
> > +	 * generation. This avoids leaking an iATU window across
> > +	 * endpoint stop/start cycles.
> > +	 */
> > +	if (ep->msi_iatu_mapped) {
> > +		dw_pcie_ep_unmap_addr(epc, 0, 0, ep->msi_mem_phys);
> > +		ep->msi_iatu_mapped = false;
> > +	}
> > +
> >  	dw_pcie_stop_link(pci);
> >  }
> >  
> > @@ -881,14 +891,37 @@ int dw_pcie_ep_raise_msi_irq(struct dw_pcie_ep *ep, u8 func_no,
> >  	msg_addr = ((u64)msg_addr_upper) << 32 | msg_addr_lower;
> >  
> >  	msg_addr = dw_pcie_ep_align_addr(epc, msg_addr, &map_size, &offset);
> > -	ret = dw_pcie_ep_map_addr(epc, func_no, 0, ep->msi_mem_phys, msg_addr,
> > -				  map_size);
> > -	if (ret)
> > -		return ret;
> >  
> > -	writel(msg_data | (interrupt_num - 1), ep->msi_mem + offset);
> > +	/*
> > +	 * Program the outbound iATU once and keep it enabled.
> > +	 *
> > +	 * The spec warns that updating iATU registers while there are
> > +	 * operations in flight on the AXI bridge interface is not
> > +	 * supported, so we avoid reprogramming the region on every MSI,
> > +	 * specifically unmapping immediately after writel().
> > +	 */
> > +	if (!ep->msi_iatu_mapped) {
> > +		ret = dw_pcie_ep_map_addr(epc, func_no, 0,
> > +					  ep->msi_mem_phys, msg_addr,
> > +					  map_size);
> > +		if (ret)
> > +			return ret;
> >  
> > -	dw_pcie_ep_unmap_addr(epc, func_no, 0, ep->msi_mem_phys);
> > +		ep->msi_iatu_mapped = true;
> > +		ep->msi_msg_addr = msg_addr;
> > +		ep->msi_map_size = map_size;
> > +	} else if (WARN_ON_ONCE(ep->msi_msg_addr != msg_addr ||
> > +				ep->msi_map_size != map_size)) {
> > +		/*
> > +		 * The host changed the MSI target address or the required
> > +		 * mapping size. Reprogramming the iATU at runtime is unsafe
> > +		 * on this controller, so bail out instead of trying to update
> > +		 * the existing region.
> > +		 */
> 
> I'd perfer having some sort of locking to program the iATU registers during
> runtime instead of bailing out.

Here, does the "locking" mean any mechanism to ensure the quiescence that
allows safe reprogramming?

Thank you,
Koichiro

> 
> - Mani
> 
> -- 
> மணிவண்ணன் சதாசிவம்

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping
  2025-12-08  7:57   ` Niklas Cassel
  2025-12-09  8:15     ` Niklas Cassel
@ 2025-12-22  5:10     ` Krishna Chaitanya Chundru
  2025-12-22  7:50       ` Niklas Cassel
  1 sibling, 1 reply; 97+ messages in thread
From: Krishna Chaitanya Chundru @ 2025-12-22  5:10 UTC (permalink / raw)
  To: Niklas Cassel, Koichiro Den
  Cc: ntb, linux-pci, dmaengine, linux-kernel, Frank.Li, mani,
	kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason, dave.jiang,
	allenbh, Basavaraj.Natikar, Shyam-sundar.S-k, kurt.schwemmer,
	logang, jingoohan1, lpieralisi, robh, jbrunet, fancer.lancer,
	arnd, pstanner, elfring



On 12/8/2025 1:27 PM, Niklas Cassel wrote:
> On Sun, Nov 30, 2025 at 01:03:57AM +0900, Koichiro Den wrote:
>> dw_pcie_ep_raise_msi_irq() currently programs an outbound iATU window
>> for the MSI target address on every interrupt and tears it down again
>> via dw_pcie_ep_unmap_addr().
>>
>> On systems that heavily use the AXI bridge interface (for example when
>> the integrated eDMA engine is active), this means the outbound iATU
>> registers are updated while traffic is in flight. The DesignWare
>> endpoint spec warns that updating iATU registers in this situation is
>> not supported, and the behavior is undefined.
>>
>> Under high MSI and eDMA load this pattern results in occasional bogus
>> outbound transactions and IOMMU faults such as:
>>
>>    ipmmu-vmsa eed40000.iommu: Unhandled fault: status 0x00001502 iova 0xfe000000
>>
>> followed by the system becoming unresponsive. This is the actual output
>> observed on Renesas R-Car S4, with its ipmmu_hc used with PCIe ch0.
>>
>> There is no need to reprogram the iATU region used for MSI on every
>> interrupt. The host-provided MSI address is stable while MSI is enabled,
>> and the endpoint driver already dedicates a scratch buffer for MSI
>> generation.
>>
>> Cache the aligned MSI address and map size, program the outbound iATU
>> once, and keep the window enabled. Subsequent interrupts only perform a
>> write to the MSI scratch buffer, avoiding dynamic iATU reprogramming in
>> the hot path and fixing the lockups seen under load.
>>
>> Signed-off-by: Koichiro Den <den@valinux.co.jp>
>> ---
>>   .../pci/controller/dwc/pcie-designware-ep.c   | 48 ++++++++++++++++---
>>   drivers/pci/controller/dwc/pcie-designware.h  |  5 ++
>>   2 files changed, 47 insertions(+), 6 deletions(-)
>>
> I don't like that this patch modifies dw_pcie_ep_raise_msi_irq() but does
> not modify dw_pcie_ep_raise_msix_irq()
>
> both functions call dw_pcie_ep_map_addr() before doing the writel(),
> so I think they should be treated the same.
>
>
> I do however understand that it is a bit wasteful to dedicate one
> outbound iATU for MSI and one outbound iATU for MSI-X, as the PCI
> spec does not allow both of them to be enabled at the same PCI,
> see:
>
> 6.1.4 MSI and MSI-X Operation § in PCIe 6.0 spec:
> "A Function is permitted to implement both MSI and MSI-X,
> but system software is prohibited from enabling both at the
> same time. If system software enables both at the same time,
> the behavior is undefined."
>
>
> I guess the problem is that some EPF drivers, even if only
> one capability can be enabled (MSI/MSI-X), call both
> pci_epc_set_msi() and pci_epc_set_msix(), e.g.:
> https://github.com/torvalds/linux/blob/v6.18/drivers/pci/endpoint/functions/pci-epf-test.c#L969-L987
>
> To fill in the number of MSI/MSI-X irqs.
>
> While other EPF drivers only call either pci_epc_set_msi() or
> pci_epc_set_msix(), depending on the IRQ type that will actually
> be used:
> https://github.com/torvalds/linux/blob/v6.18/drivers/nvme/target/pci-epf.c#L2247-L2262
>
> I think both versions is okay, just because the number of IRQs
> is filled in for both MSI/MSI-X, AFAICT, only one of them will
> get enabled.
>
>
> I guess it might be hard for an EPC driver to know which capability
> that is currently enabled, as to enable a capability is only a config
> space write by the host side.
As the host is the one which enables MSI/MSIX, it should be better the 
controller
driver takes this decision and the EPF driver just sends only raise_irq.
Because technically, host can disable MSI and enable MSIX at runtime also.

In the controller driver,  it can check which is enabled and chose b/w 
MSIX/MSI/Legacy.

- Krishna Chaitanya.
> I guess in most real hardware, e.g. a NIC device, you do an
> "enable engine"/"stop enginge" type of write to a BAR.
>
> Perhaps we should have similar callbacks in struct pci_epc_ops ?
>
> My thinking is that after "start engine", an EPC driver could read
> the MSI and MSI-X capabilities, to see which is enabled.
> As it should not be allowed to change between MSI and MSI-X without
> doing a "stop engine" first.
>
>
> Kind regards,
> Niklas
>


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping
  2025-12-22  5:10     ` Krishna Chaitanya Chundru
@ 2025-12-22  7:50       ` Niklas Cassel
  2025-12-22  8:14         ` Krishna Chaitanya Chundru
  0 siblings, 1 reply; 97+ messages in thread
From: Niklas Cassel @ 2025-12-22  7:50 UTC (permalink / raw)
  To: Krishna Chaitanya Chundru
  Cc: Koichiro Den, ntb, linux-pci, dmaengine, linux-kernel, Frank.Li,
	mani, kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason,
	dave.jiang, allenbh, Basavaraj.Natikar, Shyam-sundar.S-k,
	kurt.schwemmer, logang, jingoohan1, lpieralisi, robh, jbrunet,
	fancer.lancer, arnd, pstanner, elfring

On Mon, Dec 22, 2025 at 10:40:12AM +0530, Krishna Chaitanya Chundru wrote:
> On 12/8/2025 1:27 PM, Niklas Cassel wrote:
> > On Sun, Nov 30, 2025 at 01:03:57AM +0900, Koichiro Den wrote:
> > 
> > I guess the problem is that some EPF drivers, even if only
> > one capability can be enabled (MSI/MSI-X), call both
> > pci_epc_set_msi() and pci_epc_set_msix(), e.g.:
> > https://github.com/torvalds/linux/blob/v6.18/drivers/pci/endpoint/functions/pci-epf-test.c#L969-L987
> > 
> > To fill in the number of MSI/MSI-X irqs.
> > 
> > While other EPF drivers only call either pci_epc_set_msi() or
> > pci_epc_set_msix(), depending on the IRQ type that will actually
> > be used:
> > https://github.com/torvalds/linux/blob/v6.18/drivers/nvme/target/pci-epf.c#L2247-L2262
> > 
> > I think both versions is okay, just because the number of IRQs
> > is filled in for both MSI/MSI-X, AFAICT, only one of them will
> > get enabled.
> > 
> > 
> > I guess it might be hard for an EPC driver to know which capability
> > that is currently enabled, as to enable a capability is only a config
> > space write by the host side.
> As the host is the one which enables MSI/MSIX, it should be better the
> controller
> driver takes this decision and the EPF driver just sends only raise_irq.
> Because technically, host can disable MSI and enable MSIX at runtime also.
> 
> In the controller driver,  it can check which is enabled and chose b/w
> MSIX/MSI/Legacy.

I'm not sure if I'm following, but if by "the controller driver", you
mean the EPC driver, and not the host side driver, how can the EPC
driver know how many interrupts a specific EPF driver wants to use?

From the kdoc to pci_epc_set_msi(), the nr_irqs parameter is defined as:
@nr_irqs: number of MSI interrupts required by the EPF
https://github.com/torvalds/linux/blob/v6.19-rc2/drivers/pci/endpoint/pci-epc-core.c#L305

Anyway, I posted Koichiro's patch here:
https://lore.kernel.org/linux-pci/20251210071358.2267494-2-cassel@kernel.org/

See my comment:
  pci-epf-test does change between MSI and MSI-X without calling
  dw_pcie_ep_stop(), however, the msg_addr address written by the host
  will be the same address, at least when using a Linux host using a DWC
  based controller. If another host ends up using different msg_addr for
  MSI and MSI-X, then I think that we will need to modify pci-epf-test to
  call a function when changing IRQ type, such that pcie-designware-ep.c
  can tear down the MSI/MSI-X mapping.

So if we want to improve things, I think we need to modify the EPF drivers
to call a function when changing the IRQ type. The EPF driver should know
which IRQ type that is currently in use (see e.g. nvme_epf->irq_type in
drivers/nvme/target/pci-epf.c).

Additionally, I don't think that the host side should be allowed to change
the IRQ type (using e.g. setpci) when the EPF driver is in a "running state".

I think things will break badly if you e.g. try to do this on an PCIe
connected network card while the network card is in use.

Kind regards,
Niklas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping
  2025-12-22  7:50       ` Niklas Cassel
@ 2025-12-22  8:14         ` Krishna Chaitanya Chundru
  2025-12-22 10:21           ` Manivannan Sadhasivam
  0 siblings, 1 reply; 97+ messages in thread
From: Krishna Chaitanya Chundru @ 2025-12-22  8:14 UTC (permalink / raw)
  To: Niklas Cassel
  Cc: Koichiro Den, ntb, linux-pci, dmaengine, linux-kernel, Frank.Li,
	mani, kwilczynski, kishon, bhelgaas, corbet, vkoul, jdmason,
	dave.jiang, allenbh, Basavaraj.Natikar, Shyam-sundar.S-k,
	kurt.schwemmer, logang, jingoohan1, lpieralisi, robh, jbrunet,
	fancer.lancer, arnd, pstanner, elfring



On 12/22/2025 1:20 PM, Niklas Cassel wrote:
> On Mon, Dec 22, 2025 at 10:40:12AM +0530, Krishna Chaitanya Chundru wrote:
>> On 12/8/2025 1:27 PM, Niklas Cassel wrote:
>>> On Sun, Nov 30, 2025 at 01:03:57AM +0900, Koichiro Den wrote:
>>>
>>> I guess the problem is that some EPF drivers, even if only
>>> one capability can be enabled (MSI/MSI-X), call both
>>> pci_epc_set_msi() and pci_epc_set_msix(), e.g.:
>>> https://github.com/torvalds/linux/blob/v6.18/drivers/pci/endpoint/functions/pci-epf-test.c#L969-L987
>>>
>>> To fill in the number of MSI/MSI-X irqs.
>>>
>>> While other EPF drivers only call either pci_epc_set_msi() or
>>> pci_epc_set_msix(), depending on the IRQ type that will actually
>>> be used:
>>> https://github.com/torvalds/linux/blob/v6.18/drivers/nvme/target/pci-epf.c#L2247-L2262
>>>
>>> I think both versions is okay, just because the number of IRQs
>>> is filled in for both MSI/MSI-X, AFAICT, only one of them will
>>> get enabled.
>>>
>>>
>>> I guess it might be hard for an EPC driver to know which capability
>>> that is currently enabled, as to enable a capability is only a config
>>> space write by the host side.
>> As the host is the one which enables MSI/MSIX, it should be better the
>> controller
>> driver takes this decision and the EPF driver just sends only raise_irq.
>> Because technically, host can disable MSI and enable MSIX at runtime also.
>>
>> In the controller driver,  it can check which is enabled and chose b/w
>> MSIX/MSI/Legacy.
> I'm not sure if I'm following, but if by "the controller driver", you
> mean the EPC driver, and not the host side driver, how can the EPC
> driver know how many interrupts a specific EPF driver wants to use?
I meant the dwc drivers here.
Set msi & set msix still need to called from the EPF driver only to tell 
how many
interrupts they want to configure etc.
>
>  From the kdoc to pci_epc_set_msi(), the nr_irqs parameter is defined as:
> @nr_irqs: number of MSI interrupts required by the EPF
> https://github.com/torvalds/linux/blob/v6.19-rc2/drivers/pci/endpoint/pci-epc-core.c#L305
>
>
> Anyway, I posted Koichiro's patch here:
> https://lore.kernel.org/linux-pci/20251210071358.2267494-2-cassel@kernel.org/
I will comment on that patch.
>
> See my comment:
>    pci-epf-test does change between MSI and MSI-X without calling
>    dw_pcie_ep_stop(), however, the msg_addr address written by the host
>    will be the same address, at least when using a Linux host using a DWC
>    based controller. If another host ends up using different msg_addr for
>    MSI and MSI-X, then I think that we will need to modify pci-epf-test to
>    call a function when changing IRQ type, such that pcie-designware-ep.c
>    can tear down the MSI/MSI-X mapping.
Maybe for arm based systems we are getting same address but for x86 
based systems
it is not guarantee that you will get same address.
> So if we want to improve things, I think we need to modify the EPF drivers
> to call a function when changing the IRQ type. The EPF driver should know
> which IRQ type that is currently in use (see e.g. nvme_epf->irq_type in
> drivers/nvme/target/pci-epf.c).
My suggestion is let EPF driver call raise_irq with the vector number 
then the dwc driver
can raise IRQ based on which IRQ host enables it.
> Additionally, I don't think that the host side should be allowed to change
> the IRQ type (using e.g. setpci) when the EPF driver is in a "running state".
In the host driver itelf they can choose to change it by using 
pci_alloc_irq_vectors 
<https://elixir.bootlin.com/linux/v6.18.2/C/ident/pci_alloc_irq_vectors>, 
Currently it is not present but in future someone can change it, as spec 
didn't say you
can't update it.
> I think things will break badly if you e.g. try to do this on an PCIe
> connected network card while the network card is in use.
I agree on this.

I just want to highlight there is possibility of this in future, if 
someone comes up with a
clean logic.

- Krishna Chaitanya.
>
>
> Kind regards,
> Niklas


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping
  2025-12-22  8:14         ` Krishna Chaitanya Chundru
@ 2025-12-22 10:21           ` Manivannan Sadhasivam
  0 siblings, 0 replies; 97+ messages in thread
From: Manivannan Sadhasivam @ 2025-12-22 10:21 UTC (permalink / raw)
  To: Krishna Chaitanya Chundru
  Cc: Niklas Cassel, Koichiro Den, ntb, linux-pci, dmaengine,
	linux-kernel, Frank.Li, kwilczynski, kishon, bhelgaas, corbet,
	vkoul, jdmason, dave.jiang, allenbh, Basavaraj.Natikar,
	Shyam-sundar.S-k, kurt.schwemmer, logang, jingoohan1, lpieralisi,
	robh, jbrunet, fancer.lancer, arnd, pstanner, elfring

On Mon, Dec 22, 2025 at 01:44:02PM +0530, Krishna Chaitanya Chundru wrote:
> 
> 
> On 12/22/2025 1:20 PM, Niklas Cassel wrote:
> > On Mon, Dec 22, 2025 at 10:40:12AM +0530, Krishna Chaitanya Chundru wrote:
> > > On 12/8/2025 1:27 PM, Niklas Cassel wrote:
> > > > On Sun, Nov 30, 2025 at 01:03:57AM +0900, Koichiro Den wrote:
> > > > 
> > > > I guess the problem is that some EPF drivers, even if only
> > > > one capability can be enabled (MSI/MSI-X), call both
> > > > pci_epc_set_msi() and pci_epc_set_msix(), e.g.:
> > > > https://github.com/torvalds/linux/blob/v6.18/drivers/pci/endpoint/functions/pci-epf-test.c#L969-L987
> > > > 
> > > > To fill in the number of MSI/MSI-X irqs.
> > > > 
> > > > While other EPF drivers only call either pci_epc_set_msi() or
> > > > pci_epc_set_msix(), depending on the IRQ type that will actually
> > > > be used:
> > > > https://github.com/torvalds/linux/blob/v6.18/drivers/nvme/target/pci-epf.c#L2247-L2262
> > > > 
> > > > I think both versions is okay, just because the number of IRQs
> > > > is filled in for both MSI/MSI-X, AFAICT, only one of them will
> > > > get enabled.
> > > > 
> > > > 
> > > > I guess it might be hard for an EPC driver to know which capability
> > > > that is currently enabled, as to enable a capability is only a config
> > > > space write by the host side.
> > > As the host is the one which enables MSI/MSIX, it should be better the
> > > controller
> > > driver takes this decision and the EPF driver just sends only raise_irq.
> > > Because technically, host can disable MSI and enable MSIX at runtime also.
> > > 
> > > In the controller driver,  it can check which is enabled and chose b/w
> > > MSIX/MSI/Legacy.
> > I'm not sure if I'm following, but if by "the controller driver", you
> > mean the EPC driver, and not the host side driver, how can the EPC
> > driver know how many interrupts a specific EPF driver wants to use?
> I meant the dwc drivers here.
> Set msi & set msix still need to called from the EPF driver only to tell how
> many
> interrupts they want to configure etc.

Please leave a newline before and after your reply to make it readable in text
based clients, which some of the poor folks like me still use.

> > 
> >  From the kdoc to pci_epc_set_msi(), the nr_irqs parameter is defined as:
> > @nr_irqs: number of MSI interrupts required by the EPF
> > https://github.com/torvalds/linux/blob/v6.19-rc2/drivers/pci/endpoint/pci-epc-core.c#L305
> > 
> > 
> > Anyway, I posted Koichiro's patch here:
> > https://lore.kernel.org/linux-pci/20251210071358.2267494-2-cassel@kernel.org/
> I will comment on that patch.
> > 
> > See my comment:
> >    pci-epf-test does change between MSI and MSI-X without calling
> >    dw_pcie_ep_stop(), however, the msg_addr address written by the host
> >    will be the same address, at least when using a Linux host using a DWC
> >    based controller. If another host ends up using different msg_addr for
> >    MSI and MSI-X, then I think that we will need to modify pci-epf-test to
> >    call a function when changing IRQ type, such that pcie-designware-ep.c
> >    can tear down the MSI/MSI-X mapping.
> Maybe for arm based systems we are getting same address but for x86 based
> systems
> it is not guarantee that you will get same address.
> > So if we want to improve things, I think we need to modify the EPF drivers
> > to call a function when changing the IRQ type. The EPF driver should know
> > which IRQ type that is currently in use (see e.g. nvme_epf->irq_type in
> > drivers/nvme/target/pci-epf.c).
> My suggestion is let EPF driver call raise_irq with the vector number then
> the dwc driver
> can raise IRQ based on which IRQ host enables it.
> > Additionally, I don't think that the host side should be allowed to change
> > the IRQ type (using e.g. setpci) when the EPF driver is in a "running state".
> In the host driver itelf they can choose to change it by using
> pci_alloc_irq_vectors
> <https://elixir.bootlin.com/linux/v6.18.2/C/ident/pci_alloc_irq_vectors>,
> Currently it is not present but in future someone can change it, as spec
> didn't say you
> can't update it.

The spec has some wording on it (though not very clear) in r6.0, sec 6.1.4

"System software initializes the message address and message data (from here on
referred to as the “vector”) during device configuration, allocating one or more
vectors to each MSI-capable Function."

This *sounds* like the MSI/MSI-X initialization should happen during device
configuration.

> > I think things will break badly if you e.g. try to do this on an PCIe
> > connected network card while the network card is in use.
> I agree on this.
> 
> I just want to highlight there is possibility of this in future, if someone
> comes up with a
> clean logic.
> 

I don't know if this is even possible. For example, I don't think a host is
allowed to reattach a device which was using MSI to a VM which only uses MSI-X
during live device migration in virtualization world. I bet the device might not
perform as expected if that happens.

- Mani

-- 
மணிவண்ணன் சதாசிவம்

^ permalink raw reply	[flat|nested] 97+ messages in thread

end of thread, other threads:[~2025-12-22 10:21 UTC | newest]

Thread overview: 97+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-29 16:03 [RFC PATCH v2 00/27] NTB transport backed by remote DW eDMA Koichiro Den
2025-11-29 16:03 ` [RFC PATCH v2 01/27] PCI: endpoint: pci-epf-vntb: Use array_index_nospec() on mws_size[] access Koichiro Den
2025-12-01 18:59   ` Frank Li
2025-11-29 16:03 ` [RFC PATCH v2 02/27] PCI: endpoint: pci-epf-vntb: Add mwN_offset configfs attributes Koichiro Den
2025-12-01 19:11   ` Frank Li
2025-12-02  6:23     ` Koichiro Den
2025-11-29 16:03 ` [RFC PATCH v2 03/27] NTB: epf: Handle mwN_offset for inbound MW regions Koichiro Den
2025-12-01 19:14   ` Frank Li
2025-12-02  6:23     ` Koichiro Den
2025-11-29 16:03 ` [RFC PATCH v2 04/27] PCI: endpoint: Add inbound mapping ops to EPC core Koichiro Den
2025-12-01 19:19   ` Frank Li
2025-12-02  6:25     ` Koichiro Den
2025-12-02 15:58       ` Frank Li
2025-12-03 14:12         ` Koichiro Den
2025-11-29 16:03 ` [RFC PATCH v2 05/27] PCI: dwc: ep: Implement EPC inbound mapping support Koichiro Den
2025-12-01 19:32   ` Frank Li
2025-12-02  6:26     ` Koichiro Den
2025-11-29 16:03 ` [RFC PATCH v2 06/27] PCI: endpoint: pci-epf-vntb: Use pci_epc_map_inbound() for MW mapping Koichiro Den
2025-12-01 19:34   ` Frank Li
2025-12-02  6:26     ` Koichiro Den
2025-11-29 16:03 ` [RFC PATCH v2 07/27] NTB: Add offset parameter to MW translation APIs Koichiro Den
2025-11-29 16:03 ` [RFC PATCH v2 08/27] PCI: endpoint: pci-epf-vntb: Propagate MW offset from configfs when present Koichiro Den
2025-12-01 19:35   ` Frank Li
2025-11-29 16:03 ` [RFC PATCH v2 09/27] NTB: ntb_transport: Support offsetted partial memory windows Koichiro Den
2025-11-29 16:03 ` [RFC PATCH v2 10/27] NTB: core: Add .get_pci_epc() to ntb_dev_ops Koichiro Den
2025-12-01 19:39   ` Frank Li
2025-12-02  6:31     ` Koichiro Den
2025-12-01 21:08   ` Dave Jiang
2025-12-02  6:32     ` Koichiro Den
2025-12-02 14:49       ` Dave Jiang
2025-12-03 15:02         ` Koichiro Den
2025-11-29 16:03 ` [RFC PATCH v2 11/27] NTB: epf: vntb: Implement .get_pci_epc() callback Koichiro Den
2025-11-29 16:03 ` [RFC PATCH v2 12/27] damengine: dw-edma: Fix MSI data values for multi-vector IMWr interrupts Koichiro Den
2025-12-01 19:46   ` Frank Li
2025-12-02  6:32     ` Koichiro Den
2025-12-18  6:52       ` Koichiro Den
2025-11-29 16:03 ` [RFC PATCH v2 13/27] NTB: ntb_transport: Use seq_file for QP stats debugfs Koichiro Den
2025-12-01 19:50   ` Frank Li
2025-12-02  6:33     ` Koichiro Den
2025-11-29 16:03 ` [RFC PATCH v2 14/27] NTB: ntb_transport: Move TX memory window setup into setup_qp_mw() Koichiro Den
2025-12-01 20:02   ` Frank Li
2025-12-02  6:33     ` Koichiro Den
2025-11-29 16:03 ` [RFC PATCH v2 15/27] NTB: ntb_transport: Dynamically determine qp count Koichiro Den
2025-11-29 16:03 ` [RFC PATCH v2 16/27] NTB: ntb_transport: Introduce get_dma_dev() helper Koichiro Den
2025-11-29 16:03 ` [RFC PATCH v2 17/27] NTB: epf: Reserve a subset of MSI vectors for non-NTB users Koichiro Den
2025-11-29 16:03 ` [RFC PATCH v2 18/27] NTB: ntb_transport: Introduce ntb_transport_backend_ops Koichiro Den
2025-11-29 16:03 ` [RFC PATCH v2 19/27] PCI: dwc: ep: Cache MSI outbound iATU mapping Koichiro Den
2025-12-01 20:41   ` Frank Li
2025-12-02  6:35     ` Koichiro Den
2025-12-02  9:32       ` Niklas Cassel
2025-12-02 15:20         ` Frank Li
2025-12-03  8:40         ` Koichiro Den
2025-12-03 10:39           ` Niklas Cassel
2025-12-03 14:36             ` Koichiro Den
2025-12-03 14:40               ` Koichiro Den
2025-12-04 17:10             ` Frank Li
2025-12-05 16:28             ` Frank Li
2025-12-02  6:32   ` Niklas Cassel
2025-12-03  8:30     ` Koichiro Den
2025-12-03 10:19       ` Niklas Cassel
2025-12-03 14:56         ` Koichiro Den
2025-12-08  7:57   ` Niklas Cassel
2025-12-09  8:15     ` Niklas Cassel
2025-12-12  3:56       ` Koichiro Den
2025-12-22  5:10     ` Krishna Chaitanya Chundru
2025-12-22  7:50       ` Niklas Cassel
2025-12-22  8:14         ` Krishna Chaitanya Chundru
2025-12-22 10:21           ` Manivannan Sadhasivam
2025-12-12  3:38   ` Manivannan Sadhasivam
2025-12-18  8:28     ` Koichiro Den
2025-11-29 16:03 ` [RFC PATCH v2 20/27] NTB: ntb_transport: Introduce remote eDMA backed transport mode Koichiro Den
2025-12-01 21:41   ` Frank Li
2025-12-02  6:43     ` Koichiro Den
2025-12-02 15:42       ` Frank Li
2025-12-03  8:53         ` Koichiro Den
2025-12-03 16:14           ` Frank Li
2025-12-04 15:42             ` Koichiro Den
2025-12-04 20:16               ` Frank Li
2025-12-05  3:04                 ` Koichiro Den
2025-12-05 15:06                   ` Frank Li
2025-12-18  4:34                     ` Koichiro Den
2025-12-01 21:46   ` Dave Jiang
2025-12-02  6:59     ` Koichiro Den
2025-12-02 14:53       ` Dave Jiang
2025-12-03 14:19         ` Koichiro Den
2025-11-29 16:03 ` [RFC PATCH v2 21/27] NTB: epf: Provide db_vector_count/db_vector_mask callbacks Koichiro Den
2025-11-29 16:04 ` [RFC PATCH v2 22/27] ntb_netdev: Multi-queue support Koichiro Den
2025-11-29 16:04 ` [RFC PATCH v2 23/27] NTB: epf: Add per-SoC quirk to cap MRRS for DWC eDMA (128B for R-Car) Koichiro Den
2025-12-01 20:47   ` Frank Li
2025-11-29 16:04 ` [RFC PATCH v2 24/27] iommu: ipmmu-vmsa: Add PCIe ch0 to devices_allowlist Koichiro Den
2025-11-29 16:04 ` [RFC PATCH v2 25/27] iommu: ipmmu-vmsa: Add support for reserved regions Koichiro Den
2025-11-29 16:04 ` [RFC PATCH v2 26/27] arm64: dts: renesas: Add Spider RC/EP DTs for NTB with remote DW PCIe eDMA Koichiro Den
2025-11-29 16:04 ` [RFC PATCH v2 27/27] NTB: epf: Add an additional memory window (MW2) barno mapping on Renesas R-Car Koichiro Den
2025-12-01 22:02 ` [RFC PATCH v2 00/27] NTB transport backed by remote DW eDMA Frank Li
2025-12-02  6:20   ` Koichiro Den
2025-12-02 16:07     ` Frank Li
2025-12-03  8:43       ` Koichiro Den

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).