[BUG REPORT]net: page_pool: kernel crash at iommu_get_dma

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [BUG REPORT]net: page_pool: kernel crash at iommu_get_dma_domain+0xc/0x20
       [not found] <0e54954b-0880-4ebc-8ef0-13b3ac0a6838@huawei.com>
@ 2024-07-30 13:08 ` Yonglong Liu
  2024-07-30 17:21   ` Jesper Dangaard Brouer
  2024-08-02 16:38   ` Alexander Duyck
  0 siblings, 2 replies; 21+ messages in thread
From: Yonglong Liu @ 2024-07-30 13:08 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, pabeni, hawk, ilias.apalodimas
  Cc: netdev, linux-kernel, Alexander Duyck, Alexei Starovoitov,
	linyunsheng, shenjian (K), Salil Mehta

I found a bug when running hns3 driver with page pool enabled, the log 
as below:

[ 4406.956606] Unable to handle kernel NULL pointer dereference at 
virtual address 00000000000000a8
[ 4406.965379] Mem abort info:
[ 4406.968160]   ESR = 0x0000000096000004
[ 4406.971906]   EC = 0x25: DABT (current EL), IL = 32 bits
[ 4406.977218]   SET = 0, FnV = 0
[ 4406.980258]   EA = 0, S1PTW = 0
[ 4406.983404]   FSC = 0x04: level 0 translation fault
[ 4406.988273] Data abort info:
[ 4406.991154]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
[ 4406.996632]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[ 4407.001681]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[ 4407.006985] user pgtable: 4k pages, 48-bit VAs, pgdp=0000202828326000
[ 4407.013430] [00000000000000a8] pgd=0000000000000000, p4d=0000000000000000
[ 4407.020212] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
[ 4407.026454] Modules linked in: hclgevf xt_CHECKSUM ipt_REJECT 
nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle 
ip6table_filter ip6_tables hns_roce_hw_v2 hns3 hclge hnae3 xt_addrtype 
iptable_filter xt_conntrack overlay arm_spe_pmu arm_smmuv3_pmu 
hisi_uncore_hha_pmu hisi_uncore_ddrc_pmu hisi_uncore_l3c_pmu 
hisi_uncore_pmu fuse rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi 
scsi_transport_iscsi crct10dif_ce hisi_sec2 hisi_hpre hisi_zip 
hisi_sas_v3_hw xhci_pci sbsa_gwdt hisi_qm hisi_sas_main hisi_dma 
xhci_pci_renesas uacce libsas [last unloaded: hnae3]
[ 4407.076027] CPU: 48 PID: 610 Comm: kworker/48:1
[ 4407.093343] Workqueue: events page_pool_release_retry
[ 4407.098384] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS 
BTYPE=--)
[ 4407.105316] pc : iommu_get_dma_domain+0xc/0x20
[ 4407.109744] lr : iommu_dma_unmap_page+0x38/0xe8
[ 4407.114255] sp : ffff80008bacbc80
[ 4407.117554] x29: ffff80008bacbc80 x28: 0000000000000000 x27: 
ffffc31806be7000
[ 4407.124659] x26: ffff2020002b6ac0 x25: 0000000000000000 x24: 
0000000000000002
[ 4407.131762] x23: 0000000000000022 x22: 0000000000001000 x21: 
00000000fcd7c000
[ 4407.138865] x20: ffff0020c9882800 x19: ffff0020856f60c8 x18: 
ffff8000d3503c58
[ 4407.145968] x17: 0000000000000000 x16: 1fffe00419521061 x15: 
0000000000000001
[ 4407.153073] x14: 0000000000000003 x13: 00000401850ae012 x12: 
000006b10004e7fb
[ 4407.160177] x11: 0000000000000067 x10: 0000000000000c70 x9 : 
ffffc3180405cd20
[ 4407.167280] x8 : fefefefefefefeff x7 : 0000000000000001 x6 : 
0000000000000010
[ 4407.174382] x5 : ffffc3180405cce8 x4 : 0000000000000022 x3 : 
0000000000000002
[ 4407.181485] x2 : 0000000000001000 x1 : 00000000fcd7c000 x0 : 
0000000000000000
[ 4407.188589] Call trace:
[ 4407.191027]  iommu_get_dma_domain+0xc/0x20
[ 4407.195105]  dma_unmap_page_attrs+0x38/0x1d0
[ 4407.199361]  page_pool_return_page+0x48/0x180
[ 4407.203699]  page_pool_release+0xd4/0x1f0
[ 4407.207692]  page_pool_release_retry+0x28/0xe8
[ 4407.212119]  process_one_work+0x164/0x3e0
[ 4407.216116]  worker_thread+0x310/0x420
[ 4407.219851]  kthread+0x120/0x130
[ 4407.223066]  ret_from_fork+0x10/0x20
[ 4407.226630] Code: ffffc318 aa1e03e9 d503201f f9416c00 (f9405400)
[ 4407.232697] ---[ end trace 0000000000000000 ]---


The hns3 driver use page pool like this, just call once when the driver 
initialize:

static void hns3_alloc_page_pool(struct hns3_enet_ring *ring)
{
     struct page_pool_params pp_params = {
         .flags = PP_FLAG_DMA_MAP | PP_FLAG_PAGE_FRAG |
                 PP_FLAG_DMA_SYNC_DEV,
         .order = hns3_page_order(ring),
         .pool_size = ring->desc_num * hns3_buf_size(ring) /
                 (PAGE_SIZE << hns3_page_order(ring)),
         .nid = dev_to_node(ring_to_dev(ring)),
         .dev = ring_to_dev(ring),
         .dma_dir = DMA_FROM_DEVICE,
         .offset = 0,
         .max_len = PAGE_SIZE << hns3_page_order(ring),
     };

     ring->page_pool = page_pool_create(&pp_params);
     if (IS_ERR(ring->page_pool)) {
         dev_warn(ring_to_dev(ring), "page pool creation failed: %ld\n",
              PTR_ERR(ring->page_pool));
         ring->page_pool = NULL;
     }
}

And call page_pool_destroy(ring->page_pool)  when the driver uninitialized.


We use two devices, the net port connect directory, and the step of the 
test case like below:

1. enable a vf of '7d:00.0':  echo 1 > 
/sys/class/net/eno1/device/sriov_numvfs

2. use iperf to produce some flows(the problem happens to the side which 
runs 'iperf -s')

3. use ifconfig down/up to the vf

4. kill iperf

5. disable the vf: echo 0 > /sys/class/net/eno1/device/sriov_numvfs

6. run 1~5 with another port bd:00.0

7. repeat 1~6


And when running this test case, we can found another related message (I 
replaced pr_warn() to dev_warn()):

pci 0000:7d:01.0: page_pool_release_retry() stalled pool shutdown: id 
949, 98 inflight 1449 sec


Even when stop the traffic, stop the test case, disable the vf, this 
message is still being printed.

We must run the test case for about two hours to reproduce the problem.  
Is there some advise to solve or debug the problem?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [BUG REPORT]net: page_pool: kernel crash at iommu_get_dma_domain+0xc/0x20
  2024-07-30 13:08 ` [BUG REPORT]net: page_pool: kernel crash at iommu_get_dma_domain+0xc/0x20 Yonglong Liu
@ 2024-07-30 17:21   ` Jesper Dangaard Brouer
  2024-07-31  8:42     ` Somnath Kotur
  2024-08-02 16:38   ` Alexander Duyck
  1 sibling, 1 reply; 21+ messages in thread
From: Jesper Dangaard Brouer @ 2024-07-30 17:21 UTC (permalink / raw)
  To: Yonglong Liu, David S. Miller, Jakub Kicinski, pabeni,
	ilias.apalodimas
  Cc: netdev, linux-kernel, Alexander Duyck, Alexei Starovoitov,
	linyunsheng, shenjian (K), Salil Mehta



On 30/07/2024 15.08, Yonglong Liu wrote:
> I found a bug when running hns3 driver with page pool enabled, the log 
> as below:
> 
> [ 4406.956606] Unable to handle kernel NULL pointer dereference at 
> virtual address 00000000000000a8

struct iommu_domain *iommu_get_dma_domain(struct device *dev)
{
	return dev->iommu_group->default_domain;
}

$ pahole -C iommu_group --hex | grep default_domain
	struct iommu_domain *      default_domain;   /*  0xa8   0x8 */

Looks like iommu_group is a NULL pointer (that when deref member
'default_domain' cause this fault).


> [ 4406.965379] Mem abort info:
> [ 4406.968160]   ESR = 0x0000000096000004
> [ 4406.971906]   EC = 0x25: DABT (current EL), IL = 32 bits
> [ 4406.977218]   SET = 0, FnV = 0
> [ 4406.980258]   EA = 0, S1PTW = 0
> [ 4406.983404]   FSC = 0x04: level 0 translation fault
> [ 4406.988273] Data abort info:
> [ 4406.991154]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
> [ 4406.996632]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
> [ 4407.001681]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> [ 4407.006985] user pgtable: 4k pages, 48-bit VAs, pgdp=0000202828326000
> [ 4407.013430] [00000000000000a8] pgd=0000000000000000, 
> p4d=0000000000000000
> [ 4407.020212] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
> [ 4407.026454] Modules linked in: hclgevf xt_CHECKSUM ipt_REJECT 
> nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle 
> ip6table_filter ip6_tables hns_roce_hw_v2 hns3 hclge hnae3 xt_addrtype 
> iptable_filter xt_conntrack overlay arm_spe_pmu arm_smmuv3_pmu 
> hisi_uncore_hha_pmu hisi_uncore_ddrc_pmu hisi_uncore_l3c_pmu 
> hisi_uncore_pmu fuse rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi 
> scsi_transport_iscsi crct10dif_ce hisi_sec2 hisi_hpre hisi_zip 
> hisi_sas_v3_hw xhci_pci sbsa_gwdt hisi_qm hisi_sas_main hisi_dma 
> xhci_pci_renesas uacce libsas [last unloaded: hnae3]
> [ 4407.076027] CPU: 48 PID: 610 Comm: kworker/48:1
> [ 4407.093343] Workqueue: events page_pool_release_retry
> [ 4407.098384] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS 
> BTYPE=--)
> [ 4407.105316] pc : iommu_get_dma_domain+0xc/0x20
> [ 4407.109744] lr : iommu_dma_unmap_page+0x38/0xe8
> [ 4407.114255] sp : ffff80008bacbc80
> [ 4407.117554] x29: ffff80008bacbc80 x28: 0000000000000000 x27: 
> ffffc31806be7000
> [ 4407.124659] x26: ffff2020002b6ac0 x25: 0000000000000000 x24: 
> 0000000000000002
> [ 4407.131762] x23: 0000000000000022 x22: 0000000000001000 x21: 
> 00000000fcd7c000
> [ 4407.138865] x20: ffff0020c9882800 x19: ffff0020856f60c8 x18: 
> ffff8000d3503c58
> [ 4407.145968] x17: 0000000000000000 x16: 1fffe00419521061 x15: 
> 0000000000000001
> [ 4407.153073] x14: 0000000000000003 x13: 00000401850ae012 x12: 
> 000006b10004e7fb
> [ 4407.160177] x11: 0000000000000067 x10: 0000000000000c70 x9 : 
> ffffc3180405cd20
> [ 4407.167280] x8 : fefefefefefefeff x7 : 0000000000000001 x6 : 
> 0000000000000010
> [ 4407.174382] x5 : ffffc3180405cce8 x4 : 0000000000000022 x3 : 
> 0000000000000002
> [ 4407.181485] x2 : 0000000000001000 x1 : 00000000fcd7c000 x0 : 
> 0000000000000000
> [ 4407.188589] Call trace:
> [ 4407.191027]  iommu_get_dma_domain+0xc/0x20
> [ 4407.195105]  dma_unmap_page_attrs+0x38/0x1d0
> [ 4407.199361]  page_pool_return_page+0x48/0x180
> [ 4407.203699]  page_pool_release+0xd4/0x1f0
> [ 4407.207692]  page_pool_release_retry+0x28/0xe8

I suspect that the DMA IOMMU part was deallocated and freed by the 
driver even-though page_pool still have inflight packets.

The page_pool bumps refcnt via get_device() + put_device() on the DMA
'struct device', to avoid it going away, but I guess there is also some
IOMMU code that we need to make sure doesn't go away (until all inflight
pages are returned) ???


> [ 4407.212119]  process_one_work+0x164/0x3e0
> [ 4407.216116]  worker_thread+0x310/0x420
> [ 4407.219851]  kthread+0x120/0x130
> [ 4407.223066]  ret_from_fork+0x10/0x20
> [ 4407.226630] Code: ffffc318 aa1e03e9 d503201f f9416c00 (f9405400)
> [ 4407.232697] ---[ end trace 0000000000000000 ]---
> 
> 
> The hns3 driver use page pool like this, just call once when the driver 
> initialize:
> 
> static void hns3_alloc_page_pool(struct hns3_enet_ring *ring)
> {
>      struct page_pool_params pp_params = {
>          .flags = PP_FLAG_DMA_MAP | PP_FLAG_PAGE_FRAG |
>                  PP_FLAG_DMA_SYNC_DEV,
>          .order = hns3_page_order(ring),
>          .pool_size = ring->desc_num * hns3_buf_size(ring) /
>                  (PAGE_SIZE << hns3_page_order(ring)),
>          .nid = dev_to_node(ring_to_dev(ring)),
>          .dev = ring_to_dev(ring),
>          .dma_dir = DMA_FROM_DEVICE,
>          .offset = 0,
>          .max_len = PAGE_SIZE << hns3_page_order(ring),
>      };
> 
>      ring->page_pool = page_pool_create(&pp_params);
>      if (IS_ERR(ring->page_pool)) {
>          dev_warn(ring_to_dev(ring), "page pool creation failed: %ld\n",
>               PTR_ERR(ring->page_pool));
>          ring->page_pool = NULL;
>      }
> }
> 
> And call page_pool_destroy(ring->page_pool)  when the driver uninitialized.
> 
> 
> We use two devices, the net port connect directory, and the step of the 
> test case like below:
> 
> 1. enable a vf of '7d:00.0':  echo 1 > 
> /sys/class/net/eno1/device/sriov_numvfs
> 
> 2. use iperf to produce some flows(the problem happens to the side which 
> runs 'iperf -s')
> 
> 3. use ifconfig down/up to the vf
> 
> 4. kill iperf
> 
> 5. disable the vf: echo 0 > /sys/class/net/eno1/device/sriov_numvfs
> 
> 6. run 1~5 with another port bd:00.0
> 
> 7. repeat 1~6
> 
> 
> And when running this test case, we can found another related message (I 
> replaced pr_warn() to dev_warn()):
> 
> pci 0000:7d:01.0: page_pool_release_retry() stalled pool shutdown: id 
> 949, 98 inflight 1449 sec
> 
> 
> Even when stop the traffic, stop the test case, disable the vf, this 
> message is still being printed.
> 
> We must run the test case for about two hours to reproduce the problem. 
> Is there some advise to solve or debug the problem?
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [BUG REPORT]net: page_pool: kernel crash at iommu_get_dma_domain+0xc/0x20
  2024-07-30 17:21   ` Jesper Dangaard Brouer
@ 2024-07-31  8:42     ` Somnath Kotur
  2024-07-31 11:32       ` Yonglong Liu
  2024-08-05 12:19       ` Yunsheng Lin
  0 siblings, 2 replies; 21+ messages in thread
From: Somnath Kotur @ 2024-07-31  8:42 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Yonglong Liu, David S. Miller, Jakub Kicinski, pabeni,
	ilias.apalodimas, netdev, linux-kernel, Alexander Duyck,
	Alexei Starovoitov, linyunsheng, shenjian (K), Salil Mehta

[-- Attachment #1: Type: text/plain, Size: 6737 bytes --]

On Tue, Jul 30, 2024 at 10:51 PM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
>
>
>
> On 30/07/2024 15.08, Yonglong Liu wrote:
> > I found a bug when running hns3 driver with page pool enabled, the log
> > as below:
> >
> > [ 4406.956606] Unable to handle kernel NULL pointer dereference at
> > virtual address 00000000000000a8
>
> struct iommu_domain *iommu_get_dma_domain(struct device *dev)
> {
>         return dev->iommu_group->default_domain;
> }
>
> $ pahole -C iommu_group --hex | grep default_domain
>         struct iommu_domain *      default_domain;   /*  0xa8   0x8 */
>
> Looks like iommu_group is a NULL pointer (that when deref member
> 'default_domain' cause this fault).
>
>
> > [ 4406.965379] Mem abort info:
> > [ 4406.968160]   ESR = 0x0000000096000004
> > [ 4406.971906]   EC = 0x25: DABT (current EL), IL = 32 bits
> > [ 4406.977218]   SET = 0, FnV = 0
> > [ 4406.980258]   EA = 0, S1PTW = 0
> > [ 4406.983404]   FSC = 0x04: level 0 translation fault
> > [ 4406.988273] Data abort info:
> > [ 4406.991154]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
> > [ 4406.996632]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
> > [ 4407.001681]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> > [ 4407.006985] user pgtable: 4k pages, 48-bit VAs, pgdp=0000202828326000
> > [ 4407.013430] [00000000000000a8] pgd=0000000000000000,
> > p4d=0000000000000000
> > [ 4407.020212] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
> > [ 4407.026454] Modules linked in: hclgevf xt_CHECKSUM ipt_REJECT
> > nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle
> > ip6table_filter ip6_tables hns_roce_hw_v2 hns3 hclge hnae3 xt_addrtype
> > iptable_filter xt_conntrack overlay arm_spe_pmu arm_smmuv3_pmu
> > hisi_uncore_hha_pmu hisi_uncore_ddrc_pmu hisi_uncore_l3c_pmu
> > hisi_uncore_pmu fuse rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi
> > scsi_transport_iscsi crct10dif_ce hisi_sec2 hisi_hpre hisi_zip
> > hisi_sas_v3_hw xhci_pci sbsa_gwdt hisi_qm hisi_sas_main hisi_dma
> > xhci_pci_renesas uacce libsas [last unloaded: hnae3]
> > [ 4407.076027] CPU: 48 PID: 610 Comm: kworker/48:1
> > [ 4407.093343] Workqueue: events page_pool_release_retry
> > [ 4407.098384] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS
> > BTYPE=--)
> > [ 4407.105316] pc : iommu_get_dma_domain+0xc/0x20
> > [ 4407.109744] lr : iommu_dma_unmap_page+0x38/0xe8
> > [ 4407.114255] sp : ffff80008bacbc80
> > [ 4407.117554] x29: ffff80008bacbc80 x28: 0000000000000000 x27:
> > ffffc31806be7000
> > [ 4407.124659] x26: ffff2020002b6ac0 x25: 0000000000000000 x24:
> > 0000000000000002
> > [ 4407.131762] x23: 0000000000000022 x22: 0000000000001000 x21:
> > 00000000fcd7c000
> > [ 4407.138865] x20: ffff0020c9882800 x19: ffff0020856f60c8 x18:
> > ffff8000d3503c58
> > [ 4407.145968] x17: 0000000000000000 x16: 1fffe00419521061 x15:
> > 0000000000000001
> > [ 4407.153073] x14: 0000000000000003 x13: 00000401850ae012 x12:
> > 000006b10004e7fb
> > [ 4407.160177] x11: 0000000000000067 x10: 0000000000000c70 x9 :
> > ffffc3180405cd20
> > [ 4407.167280] x8 : fefefefefefefeff x7 : 0000000000000001 x6 :
> > 0000000000000010
> > [ 4407.174382] x5 : ffffc3180405cce8 x4 : 0000000000000022 x3 :
> > 0000000000000002
> > [ 4407.181485] x2 : 0000000000001000 x1 : 00000000fcd7c000 x0 :
> > 0000000000000000
> > [ 4407.188589] Call trace:
> > [ 4407.191027]  iommu_get_dma_domain+0xc/0x20
> > [ 4407.195105]  dma_unmap_page_attrs+0x38/0x1d0
> > [ 4407.199361]  page_pool_return_page+0x48/0x180
> > [ 4407.203699]  page_pool_release+0xd4/0x1f0
> > [ 4407.207692]  page_pool_release_retry+0x28/0xe8
>
> I suspect that the DMA IOMMU part was deallocated and freed by the
> driver even-though page_pool still have inflight packets.
When you say driver, which 'driver' do you mean?
I suspect this could be because of the VF instance going away with
this cmd - disable the vf: echo 0 >
/sys/class/net/eno1/device/sriov_numvfs, what do you think?
>
> The page_pool bumps refcnt via get_device() + put_device() on the DMA
> 'struct device', to avoid it going away, but I guess there is also some
> IOMMU code that we need to make sure doesn't go away (until all inflight
> pages are returned) ???
>
>
> > [ 4407.212119]  process_one_work+0x164/0x3e0
> > [ 4407.216116]  worker_thread+0x310/0x420
> > [ 4407.219851]  kthread+0x120/0x130
> > [ 4407.223066]  ret_from_fork+0x10/0x20
> > [ 4407.226630] Code: ffffc318 aa1e03e9 d503201f f9416c00 (f9405400)
> > [ 4407.232697] ---[ end trace 0000000000000000 ]---
> >
> >
> > The hns3 driver use page pool like this, just call once when the driver
> > initialize:
> >
> > static void hns3_alloc_page_pool(struct hns3_enet_ring *ring)
> > {
> >      struct page_pool_params pp_params = {
> >          .flags = PP_FLAG_DMA_MAP | PP_FLAG_PAGE_FRAG |
> >                  PP_FLAG_DMA_SYNC_DEV,
> >          .order = hns3_page_order(ring),
> >          .pool_size = ring->desc_num * hns3_buf_size(ring) /
> >                  (PAGE_SIZE << hns3_page_order(ring)),
> >          .nid = dev_to_node(ring_to_dev(ring)),
> >          .dev = ring_to_dev(ring),
> >          .dma_dir = DMA_FROM_DEVICE,
> >          .offset = 0,
> >          .max_len = PAGE_SIZE << hns3_page_order(ring),
> >      };
> >
> >      ring->page_pool = page_pool_create(&pp_params);
> >      if (IS_ERR(ring->page_pool)) {
> >          dev_warn(ring_to_dev(ring), "page pool creation failed: %ld\n",
> >               PTR_ERR(ring->page_pool));
> >          ring->page_pool = NULL;
> >      }
> > }
> >
> > And call page_pool_destroy(ring->page_pool)  when the driver uninitialized.
> >
> >
> > We use two devices, the net port connect directory, and the step of the
> > test case like below:
> >
> > 1. enable a vf of '7d:00.0':  echo 1 >
> > /sys/class/net/eno1/device/sriov_numvfs
> >
> > 2. use iperf to produce some flows(the problem happens to the side which
> > runs 'iperf -s')
> >
> > 3. use ifconfig down/up to the vf
> >
> > 4. kill iperf
> >
> > 5. disable the vf: echo 0 > /sys/class/net/eno1/device/sriov_numvfs
> >
> > 6. run 1~5 with another port bd:00.0
> >
> > 7. repeat 1~6
> >
> >
> > And when running this test case, we can found another related message (I
> > replaced pr_warn() to dev_warn()):
> >
> > pci 0000:7d:01.0: page_pool_release_retry() stalled pool shutdown: id
> > 949, 98 inflight 1449 sec
> >
> >
> > Even when stop the traffic, stop the test case, disable the vf, this
> > message is still being printed.
> >
> > We must run the test case for about two hours to reproduce the problem.
> > Is there some advise to solve or debug the problem?
> >
>

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4212 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [BUG REPORT]net: page_pool: kernel crash at iommu_get_dma_domain+0xc/0x20
  2024-07-31  8:42     ` Somnath Kotur
@ 2024-07-31 11:32       ` Yonglong Liu
  2024-08-02  2:06         ` Yonglong Liu
  2024-08-05 12:19       ` Yunsheng Lin
  1 sibling, 1 reply; 21+ messages in thread
From: Yonglong Liu @ 2024-07-31 11:32 UTC (permalink / raw)
  To: Somnath Kotur, Jesper Dangaard Brouer
  Cc: David S. Miller, Jakub Kicinski, pabeni, ilias.apalodimas, netdev,
	linux-kernel, Alexander Duyck, Alexei Starovoitov, linyunsheng,
	shenjian (K), Salil Mehta


On 2024/7/31 16:42, Somnath Kotur wrote:
> On Tue, Jul 30, 2024 at 10:51 PM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
>>
>>
>> On 30/07/2024 15.08, Yonglong Liu wrote:
>>> I found a bug when running hns3 driver with page pool enabled, the log
>>> as below:
>>>
>>> [ 4406.956606] Unable to handle kernel NULL pointer dereference at
>>> virtual address 00000000000000a8
>> struct iommu_domain *iommu_get_dma_domain(struct device *dev)
>> {
>>          return dev->iommu_group->default_domain;
>> }
>>
>> $ pahole -C iommu_group --hex | grep default_domain
>>          struct iommu_domain *      default_domain;   /*  0xa8   0x8 */
>>
>> Looks like iommu_group is a NULL pointer (that when deref member
>> 'default_domain' cause this fault).
>>
>>
>>> [ 4406.965379] Mem abort info:
>>> [ 4406.968160]   ESR = 0x0000000096000004
>>> [ 4406.971906]   EC = 0x25: DABT (current EL), IL = 32 bits
>>> [ 4406.977218]   SET = 0, FnV = 0
>>> [ 4406.980258]   EA = 0, S1PTW = 0
>>> [ 4406.983404]   FSC = 0x04: level 0 translation fault
>>> [ 4406.988273] Data abort info:
>>> [ 4406.991154]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
>>> [ 4406.996632]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
>>> [ 4407.001681]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
>>> [ 4407.006985] user pgtable: 4k pages, 48-bit VAs, pgdp=0000202828326000
>>> [ 4407.013430] [00000000000000a8] pgd=0000000000000000,
>>> p4d=0000000000000000
>>> [ 4407.020212] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
>>> [ 4407.026454] Modules linked in: hclgevf xt_CHECKSUM ipt_REJECT
>>> nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle
>>> ip6table_filter ip6_tables hns_roce_hw_v2 hns3 hclge hnae3 xt_addrtype
>>> iptable_filter xt_conntrack overlay arm_spe_pmu arm_smmuv3_pmu
>>> hisi_uncore_hha_pmu hisi_uncore_ddrc_pmu hisi_uncore_l3c_pmu
>>> hisi_uncore_pmu fuse rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi
>>> scsi_transport_iscsi crct10dif_ce hisi_sec2 hisi_hpre hisi_zip
>>> hisi_sas_v3_hw xhci_pci sbsa_gwdt hisi_qm hisi_sas_main hisi_dma
>>> xhci_pci_renesas uacce libsas [last unloaded: hnae3]
>>> [ 4407.076027] CPU: 48 PID: 610 Comm: kworker/48:1
>>> [ 4407.093343] Workqueue: events page_pool_release_retry
>>> [ 4407.098384] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS
>>> BTYPE=--)
>>> [ 4407.105316] pc : iommu_get_dma_domain+0xc/0x20
>>> [ 4407.109744] lr : iommu_dma_unmap_page+0x38/0xe8
>>> [ 4407.114255] sp : ffff80008bacbc80
>>> [ 4407.117554] x29: ffff80008bacbc80 x28: 0000000000000000 x27:
>>> ffffc31806be7000
>>> [ 4407.124659] x26: ffff2020002b6ac0 x25: 0000000000000000 x24:
>>> 0000000000000002
>>> [ 4407.131762] x23: 0000000000000022 x22: 0000000000001000 x21:
>>> 00000000fcd7c000
>>> [ 4407.138865] x20: ffff0020c9882800 x19: ffff0020856f60c8 x18:
>>> ffff8000d3503c58
>>> [ 4407.145968] x17: 0000000000000000 x16: 1fffe00419521061 x15:
>>> 0000000000000001
>>> [ 4407.153073] x14: 0000000000000003 x13: 00000401850ae012 x12:
>>> 000006b10004e7fb
>>> [ 4407.160177] x11: 0000000000000067 x10: 0000000000000c70 x9 :
>>> ffffc3180405cd20
>>> [ 4407.167280] x8 : fefefefefefefeff x7 : 0000000000000001 x6 :
>>> 0000000000000010
>>> [ 4407.174382] x5 : ffffc3180405cce8 x4 : 0000000000000022 x3 :
>>> 0000000000000002
>>> [ 4407.181485] x2 : 0000000000001000 x1 : 00000000fcd7c000 x0 :
>>> 0000000000000000
>>> [ 4407.188589] Call trace:
>>> [ 4407.191027]  iommu_get_dma_domain+0xc/0x20
>>> [ 4407.195105]  dma_unmap_page_attrs+0x38/0x1d0
>>> [ 4407.199361]  page_pool_return_page+0x48/0x180
>>> [ 4407.203699]  page_pool_release+0xd4/0x1f0
>>> [ 4407.207692]  page_pool_release_retry+0x28/0xe8
>> I suspect that the DMA IOMMU part was deallocated and freed by the
>> driver even-though page_pool still have inflight packets.
> When you say driver, which 'driver' do you mean?
> I suspect this could be because of the VF instance going away with
> this cmd - disable the vf: echo 0 >
> /sys/class/net/eno1/device/sriov_numvfs, what do you think?

I found that this happen when the vf enabled and running some packets, 
below is more infomation:


[ 4391.596558] pci 0000:7d:01.0: page_pool_release_retry() stalled pool 
shutdown: id 1145, 33 inflight 906 sec
[ 4397.111484] hns3 0000:bd:00.0: SRIOV setting: 0
[ 4397.118155] hns3 0000:bd:01.0 enp189s0f0v0: link down
[ 4397.416623] hns3 0000:bd:01.0: finished uninitializing hclgevf driver
[ 4397.480572] pci 0000:7d:01.0: page_pool_release_retry() stalled pool 
shutdown: id 1279, 98 inflight 362 sec
[ 4400.948362] hns3 0000:7d:00.0: SRIOV setting: 1
[ 4401.060569] pci 0000:7d:01.0: [19e5:a22f] type 00 class 0x020000 PCIe 
Endpoint
[ 4401.067795] pci 0000:7d:01.0: enabling Extended Tags
[ 4401.073090] hns3 0000:7d:01.0: Adding to iommu group 48
[ 4401.078494] hns3 0000:7d:01.0: enabling device (0000 -> 0002)
[ 4401.084348] hns3 0000:7d:01.0: The firmware version is 1.20.0.17
[ 4401.102911] hns3 0000:7d:01.0: finished initializing hclgevf driver
[ 4401.111212] hns3 0000:7d:01.0: using random MAC address da:**:**:**:a3:47
[ 4401.138375] hns3 0000:7d:01.0 eno1v0: renamed from eth0
[ 4403.939449] hns3 0000:7d:01.0 eno1v0: link up
[ 4403.940237] 8021q: adding VLAN 0 to HW filter on device eno1v0
[ 4406.956606] Unable to handle kernel NULL pointer dereference at 
virtual address 00000000000000a8


another log:

[11550.197002] hns3 0000:bd:01.0 enp189s0f0v0: link up
[11550.197034] hns3 0000:bd:01.0 enp189s0f0v0: net open
[11550.206910] 8021q: adding VLAN 0 to HW filter on device enp189s0f0v0
[11564.872929] page_pool_release_retry() stalled pool shutdown: id 2330, 
99 inflight 60 sec
[11568.353723] hns3 0000:bd:01.0 enp189s0f0v0: net stop
[11568.360723] hns3 0000:bd:01.0 enp189s0f0v0: link down
[11568.519899] hns3 0000:bd:01.0 enp189s0f0v0: link up
[11568.519935] hns3 0000:bd:01.0 enp189s0f0v0: net open
[11568.529840] 8021q: adding VLAN 0 to HW filter on device enp189s0f0v0
[11589.640930] page_pool_release_retry() stalled pool shutdown: id 1996, 
50 inflight 2054 sec
[11592.554875] hns3 0000:bd:01.0 enp189s0f0v0: net stop
[11592.560930] hns3 0000:bd:01.0 enp189s0f0v0: link down
[11596.684935] pci 0000:7d:01.0: [19e5:a22f] type 00 class 0x020000 PCIe 
Endpoint
[11596.692140] pci 0000:7d:01.0: enabling Extended Tags
[11596.697324] hns3 0000:7d:01.0: Adding to iommu group 48
[11596.702988] hns3 0000:7d:01.0: enabling device (0000 -> 0002)
[11596.708808] hns3 0000:7d:01.0: The firmware version is 1.20.0.17
[11596.727263] hns3 0000:7d:01.0: finished initializing hclgevf driver
[11596.735561] hns3 0000:7d:01.0: using random MAC address 72:**:**:**:55:d7
[11596.760621] hns3 0000:7d:01.0 eno1v0: renamed from eth0
[11599.545341] hns3 0000:7d:01.0 eno1v0: link up
[11599.545409] hns3 0000:7d:01.0 eno1v0: net open
[11599.554858] 8021q: adding VLAN 0 to HW filter on device eno1v0
[11608.908922] Unable to handle kernel NULL pointer dereference at 
virtual address 00000000000000a8


>> The page_pool bumps refcnt via get_device() + put_device() on the DMA
>> 'struct device', to avoid it going away, but I guess there is also some
>> IOMMU code that we need to make sure doesn't go away (until all inflight
>> pages are returned) ???
>>
>>
>>> [ 4407.212119]  process_one_work+0x164/0x3e0
>>> [ 4407.216116]  worker_thread+0x310/0x420
>>> [ 4407.219851]  kthread+0x120/0x130
>>> [ 4407.223066]  ret_from_fork+0x10/0x20
>>> [ 4407.226630] Code: ffffc318 aa1e03e9 d503201f f9416c00 (f9405400)
>>> [ 4407.232697] ---[ end trace 0000000000000000 ]---
>>>
>>>
>>> The hns3 driver use page pool like this, just call once when the driver
>>> initialize:
>>>
>>> static void hns3_alloc_page_pool(struct hns3_enet_ring *ring)
>>> {
>>>       struct page_pool_params pp_params = {
>>>           .flags = PP_FLAG_DMA_MAP | PP_FLAG_PAGE_FRAG |
>>>                   PP_FLAG_DMA_SYNC_DEV,
>>>           .order = hns3_page_order(ring),
>>>           .pool_size = ring->desc_num * hns3_buf_size(ring) /
>>>                   (PAGE_SIZE << hns3_page_order(ring)),
>>>           .nid = dev_to_node(ring_to_dev(ring)),
>>>           .dev = ring_to_dev(ring),
>>>           .dma_dir = DMA_FROM_DEVICE,
>>>           .offset = 0,
>>>           .max_len = PAGE_SIZE << hns3_page_order(ring),
>>>       };
>>>
>>>       ring->page_pool = page_pool_create(&pp_params);
>>>       if (IS_ERR(ring->page_pool)) {
>>>           dev_warn(ring_to_dev(ring), "page pool creation failed: %ld\n",
>>>                PTR_ERR(ring->page_pool));
>>>           ring->page_pool = NULL;
>>>       }
>>> }
>>>
>>> And call page_pool_destroy(ring->page_pool)  when the driver uninitialized.
>>>
>>>
>>> We use two devices, the net port connect directory, and the step of the
>>> test case like below:
>>>
>>> 1. enable a vf of '7d:00.0':  echo 1 >
>>> /sys/class/net/eno1/device/sriov_numvfs
>>>
>>> 2. use iperf to produce some flows(the problem happens to the side which
>>> runs 'iperf -s')
>>>
>>> 3. use ifconfig down/up to the vf
>>>
>>> 4. kill iperf
>>>
>>> 5. disable the vf: echo 0 > /sys/class/net/eno1/device/sriov_numvfs
>>>
>>> 6. run 1~5 with another port bd:00.0
>>>
>>> 7. repeat 1~6
>>>
>>>
>>> And when running this test case, we can found another related message (I
>>> replaced pr_warn() to dev_warn()):
>>>
>>> pci 0000:7d:01.0: page_pool_release_retry() stalled pool shutdown: id
>>> 949, 98 inflight 1449 sec
>>>
>>>
>>> Even when stop the traffic, stop the test case, disable the vf, this
>>> message is still being printed.
>>>
>>> We must run the test case for about two hours to reproduce the problem.
>>> Is there some advise to solve or debug the problem?
>>>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [BUG REPORT]net: page_pool: kernel crash at iommu_get_dma_domain+0xc/0x20
  2024-07-31 11:32       ` Yonglong Liu
@ 2024-08-02  2:06         ` Yonglong Liu
  2024-08-06  9:51           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 21+ messages in thread
From: Yonglong Liu @ 2024-08-02  2:06 UTC (permalink / raw)
  To: Somnath Kotur, Jesper Dangaard Brouer
  Cc: David S. Miller, Jakub Kicinski, pabeni, ilias.apalodimas, netdev,
	linux-kernel, Alexander Duyck, Alexei Starovoitov, linyunsheng,
	shenjian (K), Salil Mehta


On 2024/7/31 19:32, Yonglong Liu wrote:
>
> On 2024/7/31 16:42, Somnath Kotur wrote:
>> On Tue, Jul 30, 2024 at 10:51 PM Jesper Dangaard Brouer 
>> <hawk@kernel.org> wrote:
>>>
>>>
>>> On 30/07/2024 15.08, Yonglong Liu wrote:
>>>> I found a bug when running hns3 driver with page pool enabled, the log
>>>> as below:
>>>>
>>>> [ 4406.956606] Unable to handle kernel NULL pointer dereference at
>>>> virtual address 00000000000000a8
>>> struct iommu_domain *iommu_get_dma_domain(struct device *dev)
>>> {
>>>          return dev->iommu_group->default_domain;
>>> }
>>>
>>> $ pahole -C iommu_group --hex | grep default_domain
>>>          struct iommu_domain *      default_domain;   /* 0xa8   0x8 */
>>>
>>> Looks like iommu_group is a NULL pointer (that when deref member
>>> 'default_domain' cause this fault).
>>>
>>>
>>>> [ 4406.965379] Mem abort info:
>>>> [ 4406.968160]   ESR = 0x0000000096000004
>>>> [ 4406.971906]   EC = 0x25: DABT (current EL), IL = 32 bits
>>>> [ 4406.977218]   SET = 0, FnV = 0
>>>> [ 4406.980258]   EA = 0, S1PTW = 0
>>>> [ 4406.983404]   FSC = 0x04: level 0 translation fault
>>>> [ 4406.988273] Data abort info:
>>>> [ 4406.991154]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
>>>> [ 4406.996632]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
>>>> [ 4407.001681]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
>>>> [ 4407.006985] user pgtable: 4k pages, 48-bit VAs, 
>>>> pgdp=0000202828326000
>>>> [ 4407.013430] [00000000000000a8] pgd=0000000000000000,
>>>> p4d=0000000000000000
>>>> [ 4407.020212] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
>>>> [ 4407.026454] Modules linked in: hclgevf xt_CHECKSUM ipt_REJECT
>>>> nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle
>>>> ip6table_filter ip6_tables hns_roce_hw_v2 hns3 hclge hnae3 xt_addrtype
>>>> iptable_filter xt_conntrack overlay arm_spe_pmu arm_smmuv3_pmu
>>>> hisi_uncore_hha_pmu hisi_uncore_ddrc_pmu hisi_uncore_l3c_pmu
>>>> hisi_uncore_pmu fuse rpcrdma ib_isert iscsi_target_mod ib_iser 
>>>> libiscsi
>>>> scsi_transport_iscsi crct10dif_ce hisi_sec2 hisi_hpre hisi_zip
>>>> hisi_sas_v3_hw xhci_pci sbsa_gwdt hisi_qm hisi_sas_main hisi_dma
>>>> xhci_pci_renesas uacce libsas [last unloaded: hnae3]
>>>> [ 4407.076027] CPU: 48 PID: 610 Comm: kworker/48:1
>>>> [ 4407.093343] Workqueue: events page_pool_release_retry
>>>> [ 4407.098384] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS
>>>> BTYPE=--)
>>>> [ 4407.105316] pc : iommu_get_dma_domain+0xc/0x20
>>>> [ 4407.109744] lr : iommu_dma_unmap_page+0x38/0xe8
>>>> [ 4407.114255] sp : ffff80008bacbc80
>>>> [ 4407.117554] x29: ffff80008bacbc80 x28: 0000000000000000 x27:
>>>> ffffc31806be7000
>>>> [ 4407.124659] x26: ffff2020002b6ac0 x25: 0000000000000000 x24:
>>>> 0000000000000002
>>>> [ 4407.131762] x23: 0000000000000022 x22: 0000000000001000 x21:
>>>> 00000000fcd7c000
>>>> [ 4407.138865] x20: ffff0020c9882800 x19: ffff0020856f60c8 x18:
>>>> ffff8000d3503c58
>>>> [ 4407.145968] x17: 0000000000000000 x16: 1fffe00419521061 x15:
>>>> 0000000000000001
>>>> [ 4407.153073] x14: 0000000000000003 x13: 00000401850ae012 x12:
>>>> 000006b10004e7fb
>>>> [ 4407.160177] x11: 0000000000000067 x10: 0000000000000c70 x9 :
>>>> ffffc3180405cd20
>>>> [ 4407.167280] x8 : fefefefefefefeff x7 : 0000000000000001 x6 :
>>>> 0000000000000010
>>>> [ 4407.174382] x5 : ffffc3180405cce8 x4 : 0000000000000022 x3 :
>>>> 0000000000000002
>>>> [ 4407.181485] x2 : 0000000000001000 x1 : 00000000fcd7c000 x0 :
>>>> 0000000000000000
>>>> [ 4407.188589] Call trace:
>>>> [ 4407.191027]  iommu_get_dma_domain+0xc/0x20
>>>> [ 4407.195105]  dma_unmap_page_attrs+0x38/0x1d0
>>>> [ 4407.199361]  page_pool_return_page+0x48/0x180
>>>> [ 4407.203699]  page_pool_release+0xd4/0x1f0
>>>> [ 4407.207692]  page_pool_release_retry+0x28/0xe8
>>> I suspect that the DMA IOMMU part was deallocated and freed by the
>>> driver even-though page_pool still have inflight packets.
>> When you say driver, which 'driver' do you mean?
>> I suspect this could be because of the VF instance going away with
>> this cmd - disable the vf: echo 0 >
>> /sys/class/net/eno1/device/sriov_numvfs, what do you think?
>
> I found that this happen when the vf enabled and running some packets, 
> below is more infomation:
>
>
> [ 4391.596558] pci 0000:7d:01.0: page_pool_release_retry() stalled 
> pool shutdown: id 1145, 33 inflight 906 sec
> [ 4397.111484] hns3 0000:bd:00.0: SRIOV setting: 0
> [ 4397.118155] hns3 0000:bd:01.0 enp189s0f0v0: link down
> [ 4397.416623] hns3 0000:bd:01.0: finished uninitializing hclgevf driver
> [ 4397.480572] pci 0000:7d:01.0: page_pool_release_retry() stalled 
> pool shutdown: id 1279, 98 inflight 362 sec
> [ 4400.948362] hns3 0000:7d:00.0: SRIOV setting: 1
> [ 4401.060569] pci 0000:7d:01.0: [19e5:a22f] type 00 class 0x020000 
> PCIe Endpoint
> [ 4401.067795] pci 0000:7d:01.0: enabling Extended Tags
> [ 4401.073090] hns3 0000:7d:01.0: Adding to iommu group 48
> [ 4401.078494] hns3 0000:7d:01.0: enabling device (0000 -> 0002)
> [ 4401.084348] hns3 0000:7d:01.0: The firmware version is 1.20.0.17
> [ 4401.102911] hns3 0000:7d:01.0: finished initializing hclgevf driver
> [ 4401.111212] hns3 0000:7d:01.0: using random MAC address 
> da:**:**:**:a3:47
> [ 4401.138375] hns3 0000:7d:01.0 eno1v0: renamed from eth0
> [ 4403.939449] hns3 0000:7d:01.0 eno1v0: link up
> [ 4403.940237] 8021q: adding VLAN 0 to HW filter on device eno1v0
> [ 4406.956606] Unable to handle kernel NULL pointer dereference at 
> virtual address 00000000000000a8
>
>
> another log:
>
> [11550.197002] hns3 0000:bd:01.0 enp189s0f0v0: link up
> [11550.197034] hns3 0000:bd:01.0 enp189s0f0v0: net open
> [11550.206910] 8021q: adding VLAN 0 to HW filter on device enp189s0f0v0
> [11564.872929] page_pool_release_retry() stalled pool shutdown: id 
> 2330, 99 inflight 60 sec
> [11568.353723] hns3 0000:bd:01.0 enp189s0f0v0: net stop
> [11568.360723] hns3 0000:bd:01.0 enp189s0f0v0: link down
> [11568.519899] hns3 0000:bd:01.0 enp189s0f0v0: link up
> [11568.519935] hns3 0000:bd:01.0 enp189s0f0v0: net open
> [11568.529840] 8021q: adding VLAN 0 to HW filter on device enp189s0f0v0
> [11589.640930] page_pool_release_retry() stalled pool shutdown: id 
> 1996, 50 inflight 2054 sec
> [11592.554875] hns3 0000:bd:01.0 enp189s0f0v0: net stop
> [11592.560930] hns3 0000:bd:01.0 enp189s0f0v0: link down
> [11596.684935] pci 0000:7d:01.0: [19e5:a22f] type 00 class 0x020000 
> PCIe Endpoint
> [11596.692140] pci 0000:7d:01.0: enabling Extended Tags
> [11596.697324] hns3 0000:7d:01.0: Adding to iommu group 48
> [11596.702988] hns3 0000:7d:01.0: enabling device (0000 -> 0002)
> [11596.708808] hns3 0000:7d:01.0: The firmware version is 1.20.0.17
> [11596.727263] hns3 0000:7d:01.0: finished initializing hclgevf driver
> [11596.735561] hns3 0000:7d:01.0: using random MAC address 
> 72:**:**:**:55:d7
> [11596.760621] hns3 0000:7d:01.0 eno1v0: renamed from eth0
> [11599.545341] hns3 0000:7d:01.0 eno1v0: link up
> [11599.545409] hns3 0000:7d:01.0 eno1v0: net open
> [11599.554858] 8021q: adding VLAN 0 to HW filter on device eno1v0
> [11608.908922] Unable to handle kernel NULL pointer dereference at 
> virtual address 00000000000000a8
>
I adds more debug info, and found that:

[ 4573.356891] pci 0000:7d:01.0: page_pool_release_retry() stalled pool 
shutdown: id 1046, 82 inflight 60 sec, iommu_group=0x0

The iommu_group will release whether the page_pool is using it or not, 
so if once page_pool_return_page() was called(why does this occur when 
the device is reloaded and packets are transmitted?) , this crash will 
happen.

I try the follow patch, but doesn't work :(

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index f4444b4e39e6..d03a87407ca8 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -21,6 +21,7 @@
  #include <linux/poison.h>
  #include <linux/ethtool.h>
  #include <linux/netdevice.h>
+#include <linux/iommu.h>

  #include <trace/events/page_pool.h>

@@ -306,6 +307,9 @@ page_pool_create_percpu(const struct 
page_pool_params *params, int cpuid)
         if (err)
                 goto err_uninit;

+       if (pool->dma_map)
+               iommu_group_get(pool->p.dev);
+
         return pool;

  err_uninit:
@@ -974,8 +978,11 @@ static int page_pool_release(struct page_pool *pool)

         page_pool_scrub(pool);
         inflight = page_pool_inflight(pool, true);
-       if (!inflight)
+       if (!inflight) {
                 __page_pool_destroy(pool);
+               if (pool->dma_map)
+ iommu_group_put(pool->p.dev->iommu_group);
+       }

         return inflight;
  }


>>> The page_pool bumps refcnt via get_device() + put_device() on the DMA
>>> 'struct device', to avoid it going away, but I guess there is also some
>>> IOMMU code that we need to make sure doesn't go away (until all 
>>> inflight
>>> pages are returned) ???
>>>
>>>
>>>> [ 4407.212119] process_one_work+0x164/0x3e0
>>>> [ 4407.216116]  worker_thread+0x310/0x420
>>>> [ 4407.219851]  kthread+0x120/0x130
>>>> [ 4407.223066]  ret_from_fork+0x10/0x20
>>>> [ 4407.226630] Code: ffffc318 aa1e03e9 d503201f f9416c00 (f9405400)
>>>> [ 4407.232697] ---[ end trace 0000000000000000 ]---
>>>>
>>>>
>>>> The hns3 driver use page pool like this, just call once when the 
>>>> driver
>>>> initialize:
>>>>
>>>> static void hns3_alloc_page_pool(struct hns3_enet_ring *ring)
>>>> {
>>>>       struct page_pool_params pp_params = {
>>>>           .flags = PP_FLAG_DMA_MAP | PP_FLAG_PAGE_FRAG |
>>>>                   PP_FLAG_DMA_SYNC_DEV,
>>>>           .order = hns3_page_order(ring),
>>>>           .pool_size = ring->desc_num * hns3_buf_size(ring) /
>>>>                   (PAGE_SIZE << hns3_page_order(ring)),
>>>>           .nid = dev_to_node(ring_to_dev(ring)),
>>>>           .dev = ring_to_dev(ring),
>>>>           .dma_dir = DMA_FROM_DEVICE,
>>>>           .offset = 0,
>>>>           .max_len = PAGE_SIZE << hns3_page_order(ring),
>>>>       };
>>>>
>>>>       ring->page_pool = page_pool_create(&pp_params);
>>>>       if (IS_ERR(ring->page_pool)) {
>>>>           dev_warn(ring_to_dev(ring), "page pool creation failed: 
>>>> %ld\n",
>>>>                PTR_ERR(ring->page_pool));
>>>>           ring->page_pool = NULL;
>>>>       }
>>>> }
>>>>
>>>> And call page_pool_destroy(ring->page_pool)  when the driver 
>>>> uninitialized.
>>>>
>>>>
>>>> We use two devices, the net port connect directory, and the step of 
>>>> the
>>>> test case like below:
>>>>
>>>> 1. enable a vf of '7d:00.0':  echo 1 >
>>>> /sys/class/net/eno1/device/sriov_numvfs
>>>>
>>>> 2. use iperf to produce some flows(the problem happens to the side 
>>>> which
>>>> runs 'iperf -s')
>>>>
>>>> 3. use ifconfig down/up to the vf
>>>>
>>>> 4. kill iperf
>>>>
>>>> 5. disable the vf: echo 0 > /sys/class/net/eno1/device/sriov_numvfs
>>>>
>>>> 6. run 1~5 with another port bd:00.0
>>>>
>>>> 7. repeat 1~6
>>>>
>>>>
>>>> And when running this test case, we can found another related 
>>>> message (I
>>>> replaced pr_warn() to dev_warn()):
>>>>
>>>> pci 0000:7d:01.0: page_pool_release_retry() stalled pool shutdown: id
>>>> 949, 98 inflight 1449 sec
>>>>
>>>>
>>>> Even when stop the traffic, stop the test case, disable the vf, this
>>>> message is still being printed.
>>>>
>>>> We must run the test case for about two hours to reproduce the 
>>>> problem.
>>>> Is there some advise to solve or debug the problem?
>>>>
>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [BUG REPORT]net: page_pool: kernel crash at iommu_get_dma_domain+0xc/0x20
  2024-07-30 13:08 ` [BUG REPORT]net: page_pool: kernel crash at iommu_get_dma_domain+0xc/0x20 Yonglong Liu
  2024-07-30 17:21   ` Jesper Dangaard Brouer
@ 2024-08-02 16:38   ` Alexander Duyck
  2024-08-05 12:50     ` Yunsheng Lin
  1 sibling, 1 reply; 21+ messages in thread
From: Alexander Duyck @ 2024-08-02 16:38 UTC (permalink / raw)
  To: Yonglong Liu
  Cc: David S. Miller, Jakub Kicinski, pabeni, hawk, ilias.apalodimas,
	netdev, linux-kernel, Alexei Starovoitov, linyunsheng,
	shenjian (K), Salil Mehta

On Tue, Jul 30, 2024 at 6:08 AM Yonglong Liu <liuyonglong@huawei.com> wrote:
>
> I found a bug when running hns3 driver with page pool enabled, the log
> as below:
>
> [ 4406.956606] Unable to handle kernel NULL pointer dereference at
> virtual address 00000000000000a8
> [ 4406.965379] Mem abort info:
> [ 4406.968160]   ESR = 0x0000000096000004
> [ 4406.971906]   EC = 0x25: DABT (current EL), IL = 32 bits
> [ 4406.977218]   SET = 0, FnV = 0
> [ 4406.980258]   EA = 0, S1PTW = 0
> [ 4406.983404]   FSC = 0x04: level 0 translation fault
> [ 4406.988273] Data abort info:
> [ 4406.991154]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
> [ 4406.996632]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
> [ 4407.001681]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> [ 4407.006985] user pgtable: 4k pages, 48-bit VAs, pgdp=0000202828326000
> [ 4407.013430] [00000000000000a8] pgd=0000000000000000, p4d=0000000000000000
> [ 4407.020212] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
> [ 4407.026454] Modules linked in: hclgevf xt_CHECKSUM ipt_REJECT
> nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle
> ip6table_filter ip6_tables hns_roce_hw_v2 hns3 hclge hnae3 xt_addrtype
> iptable_filter xt_conntrack overlay arm_spe_pmu arm_smmuv3_pmu
> hisi_uncore_hha_pmu hisi_uncore_ddrc_pmu hisi_uncore_l3c_pmu
> hisi_uncore_pmu fuse rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi
> scsi_transport_iscsi crct10dif_ce hisi_sec2 hisi_hpre hisi_zip
> hisi_sas_v3_hw xhci_pci sbsa_gwdt hisi_qm hisi_sas_main hisi_dma
> xhci_pci_renesas uacce libsas [last unloaded: hnae3]
> [ 4407.076027] CPU: 48 PID: 610 Comm: kworker/48:1
> [ 4407.093343] Workqueue: events page_pool_release_retry
> [ 4407.098384] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS
> BTYPE=--)
> [ 4407.105316] pc : iommu_get_dma_domain+0xc/0x20
> [ 4407.109744] lr : iommu_dma_unmap_page+0x38/0xe8
> [ 4407.114255] sp : ffff80008bacbc80
> [ 4407.117554] x29: ffff80008bacbc80 x28: 0000000000000000 x27:
> ffffc31806be7000
> [ 4407.124659] x26: ffff2020002b6ac0 x25: 0000000000000000 x24:
> 0000000000000002
> [ 4407.131762] x23: 0000000000000022 x22: 0000000000001000 x21:
> 00000000fcd7c000
> [ 4407.138865] x20: ffff0020c9882800 x19: ffff0020856f60c8 x18:
> ffff8000d3503c58
> [ 4407.145968] x17: 0000000000000000 x16: 1fffe00419521061 x15:
> 0000000000000001
> [ 4407.153073] x14: 0000000000000003 x13: 00000401850ae012 x12:
> 000006b10004e7fb
> [ 4407.160177] x11: 0000000000000067 x10: 0000000000000c70 x9 :
> ffffc3180405cd20
> [ 4407.167280] x8 : fefefefefefefeff x7 : 0000000000000001 x6 :
> 0000000000000010
> [ 4407.174382] x5 : ffffc3180405cce8 x4 : 0000000000000022 x3 :
> 0000000000000002
> [ 4407.181485] x2 : 0000000000001000 x1 : 00000000fcd7c000 x0 :
> 0000000000000000
> [ 4407.188589] Call trace:
> [ 4407.191027]  iommu_get_dma_domain+0xc/0x20
> [ 4407.195105]  dma_unmap_page_attrs+0x38/0x1d0
> [ 4407.199361]  page_pool_return_page+0x48/0x180
> [ 4407.203699]  page_pool_release+0xd4/0x1f0
> [ 4407.207692]  page_pool_release_retry+0x28/0xe8
> [ 4407.212119]  process_one_work+0x164/0x3e0
> [ 4407.216116]  worker_thread+0x310/0x420
> [ 4407.219851]  kthread+0x120/0x130
> [ 4407.223066]  ret_from_fork+0x10/0x20
> [ 4407.226630] Code: ffffc318 aa1e03e9 d503201f f9416c00 (f9405400)
> [ 4407.232697] ---[ end trace 0000000000000000 ]---

The issue as I see it is that we aren't unmapping the pages when we
call page_pool_destroy. There need to be no pages remaining with a DMA
unmapping needed *after* that is called. Otherwise we will see this
issue regularly.

What we probably need to look at doing is beefing up page_pool_release
to add a step that will take an additional reference on the inflight
pages, then call __page_pool_put_page to switch them to a reference
counted page.

Seems like the worst case scenario is that we are talking about having
to walk the page table to do the above for any inflight pages but it
would certainly be a much more deterministic amount of time needed to
do that versus waiting on a page that may or may not return.

Alternatively a quick hack that would probably also address this would
be to clear poll->dma_map in page_pool_destroy or maybe in
page_pool_unreg_netdev so that any of those residual mappings would
essentially get leaked, but we wouldn't have to worry about trying to
unmap while the device doesn't exist.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [BUG REPORT]net: page_pool: kernel crash at iommu_get_dma_domain+0xc/0x20
  2024-07-31  8:42     ` Somnath Kotur
  2024-07-31 11:32       ` Yonglong Liu
@ 2024-08-05 12:19       ` Yunsheng Lin
  2024-08-05 12:53         ` Robin Murphy
  2024-08-06 13:35         ` Niklas Schnelle
  1 sibling, 2 replies; 21+ messages in thread
From: Yunsheng Lin @ 2024-08-05 12:19 UTC (permalink / raw)
  To: Somnath Kotur, Jesper Dangaard Brouer
  Cc: Yonglong Liu, David S. Miller, Jakub Kicinski, pabeni,
	ilias.apalodimas, netdev, linux-kernel, Alexander Duyck,
	Alexei Starovoitov, shenjian (K), Salil Mehta, joro, will,
	robin.murphy, iommu

On 2024/7/31 16:42, Somnath Kotur wrote:
> On Tue, Jul 30, 2024 at 10:51 PM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
>>

+cc iommu maintainers and list

>>
>> On 30/07/2024 15.08, Yonglong Liu wrote:
>>> I found a bug when running hns3 driver with page pool enabled, the log
>>> as below:
>>>
>>> [ 4406.956606] Unable to handle kernel NULL pointer dereference at
>>> virtual address 00000000000000a8
>>
>> struct iommu_domain *iommu_get_dma_domain(struct device *dev)
>> {
>>         return dev->iommu_group->default_domain;
>> }
>>
>> $ pahole -C iommu_group --hex | grep default_domain
>>         struct iommu_domain *      default_domain;   /*  0xa8   0x8 */
>>
>> Looks like iommu_group is a NULL pointer (that when deref member
>> 'default_domain' cause this fault).
>>
>>
>>> [ 4406.965379] Mem abort info:
>>> [ 4406.968160]   ESR = 0x0000000096000004
>>> [ 4406.971906]   EC = 0x25: DABT (current EL), IL = 32 bits
>>> [ 4406.977218]   SET = 0, FnV = 0
>>> [ 4406.980258]   EA = 0, S1PTW = 0
>>> [ 4406.983404]   FSC = 0x04: level 0 translation fault
>>> [ 4406.988273] Data abort info:
>>> [ 4406.991154]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
>>> [ 4406.996632]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
>>> [ 4407.001681]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
>>> [ 4407.006985] user pgtable: 4k pages, 48-bit VAs, pgdp=0000202828326000
>>> [ 4407.013430] [00000000000000a8] pgd=0000000000000000,
>>> p4d=0000000000000000
>>> [ 4407.020212] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
>>> [ 4407.026454] Modules linked in: hclgevf xt_CHECKSUM ipt_REJECT
>>> nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle
>>> ip6table_filter ip6_tables hns_roce_hw_v2 hns3 hclge hnae3 xt_addrtype
>>> iptable_filter xt_conntrack overlay arm_spe_pmu arm_smmuv3_pmu
>>> hisi_uncore_hha_pmu hisi_uncore_ddrc_pmu hisi_uncore_l3c_pmu
>>> hisi_uncore_pmu fuse rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi
>>> scsi_transport_iscsi crct10dif_ce hisi_sec2 hisi_hpre hisi_zip
>>> hisi_sas_v3_hw xhci_pci sbsa_gwdt hisi_qm hisi_sas_main hisi_dma
>>> xhci_pci_renesas uacce libsas [last unloaded: hnae3]
>>> [ 4407.076027] CPU: 48 PID: 610 Comm: kworker/48:1
>>> [ 4407.093343] Workqueue: events page_pool_release_retry
>>> [ 4407.098384] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS
>>> BTYPE=--)
>>> [ 4407.105316] pc : iommu_get_dma_domain+0xc/0x20
>>> [ 4407.109744] lr : iommu_dma_unmap_page+0x38/0xe8
>>> [ 4407.114255] sp : ffff80008bacbc80
>>> [ 4407.117554] x29: ffff80008bacbc80 x28: 0000000000000000 x27:
>>> ffffc31806be7000
>>> [ 4407.124659] x26: ffff2020002b6ac0 x25: 0000000000000000 x24:
>>> 0000000000000002
>>> [ 4407.131762] x23: 0000000000000022 x22: 0000000000001000 x21:
>>> 00000000fcd7c000
>>> [ 4407.138865] x20: ffff0020c9882800 x19: ffff0020856f60c8 x18:
>>> ffff8000d3503c58
>>> [ 4407.145968] x17: 0000000000000000 x16: 1fffe00419521061 x15:
>>> 0000000000000001
>>> [ 4407.153073] x14: 0000000000000003 x13: 00000401850ae012 x12:
>>> 000006b10004e7fb
>>> [ 4407.160177] x11: 0000000000000067 x10: 0000000000000c70 x9 :
>>> ffffc3180405cd20
>>> [ 4407.167280] x8 : fefefefefefefeff x7 : 0000000000000001 x6 :
>>> 0000000000000010
>>> [ 4407.174382] x5 : ffffc3180405cce8 x4 : 0000000000000022 x3 :
>>> 0000000000000002
>>> [ 4407.181485] x2 : 0000000000001000 x1 : 00000000fcd7c000 x0 :
>>> 0000000000000000
>>> [ 4407.188589] Call trace:
>>> [ 4407.191027]  iommu_get_dma_domain+0xc/0x20
>>> [ 4407.195105]  dma_unmap_page_attrs+0x38/0x1d0
>>> [ 4407.199361]  page_pool_return_page+0x48/0x180
>>> [ 4407.203699]  page_pool_release+0xd4/0x1f0
>>> [ 4407.207692]  page_pool_release_retry+0x28/0xe8
>>
>> I suspect that the DMA IOMMU part was deallocated and freed by the
>> driver even-though page_pool still have inflight packets.
> When you say driver, which 'driver' do you mean?
> I suspect this could be because of the VF instance going away with
> this cmd - disable the vf: echo 0 >
> /sys/class/net/eno1/device/sriov_numvfs, what do you think?
>>
>> The page_pool bumps refcnt via get_device() + put_device() on the DMA
>> 'struct device', to avoid it going away, but I guess there is also some
>> IOMMU code that we need to make sure doesn't go away (until all inflight
>> pages are returned) ???

I guess the above is why thing went wrong here, the question is which
IOMMU code need to be called here to stop them from going away.

What I am also curious is that there should be a pool of allocated iova in
iommu that is corresponding to the in-flight page for page_pool, shouldn't
iommu wait for the corresponding allocated iova to be freed similarly as
page_pool does for it's in-flight pages?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [BUG REPORT]net: page_pool: kernel crash at iommu_get_dma_domain+0xc/0x20
  2024-08-02 16:38   ` Alexander Duyck
@ 2024-08-05 12:50     ` Yunsheng Lin
  2024-08-06  0:39       ` Alexander Duyck
  0 siblings, 1 reply; 21+ messages in thread
From: Yunsheng Lin @ 2024-08-05 12:50 UTC (permalink / raw)
  To: Alexander Duyck, Yonglong Liu
  Cc: David S. Miller, Jakub Kicinski, pabeni, hawk, ilias.apalodimas,
	netdev, linux-kernel, Alexei Starovoitov, shenjian (K),
	Salil Mehta, iommu

On 2024/8/3 0:38, Alexander Duyck wrote:

...

> 
> The issue as I see it is that we aren't unmapping the pages when we
> call page_pool_destroy. There need to be no pages remaining with a DMA
> unmapping needed *after* that is called. Otherwise we will see this
> issue regularly.
> 
> What we probably need to look at doing is beefing up page_pool_release
> to add a step that will take an additional reference on the inflight
> pages, then call __page_pool_put_page to switch them to a reference
> counted page.

I am not sure if I understand what you meant about, did you mean making
page_pool_destroy() synchronously wait for the all in-flight pages to
come back before returning to driver?

> 
> Seems like the worst case scenario is that we are talking about having
> to walk the page table to do the above for any inflight pages but it

Which page table are we talking about here?

> would certainly be a much more deterministic amount of time needed to
> do that versus waiting on a page that may or may not return.
> 
> Alternatively a quick hack that would probably also address this would
> be to clear poll->dma_map in page_pool_destroy or maybe in

It seems we may need to clear pool->dma_sync too, and there may be some
time window between clearing and checking/dma_unmap?

> page_pool_unreg_netdev so that any of those residual mappings would
> essentially get leaked, but we wouldn't have to worry about trying to
> unmap while the device doesn't exist.

But how does the page_pool know if it is just the normal unloading processing
without VF disabling where the device still exists or it is the abnormal one
caused by the VF disabling where the device will disappear? If it is the first
one, does it cause resource leaking problem for iommu if some calling for iommu
is skipped?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [BUG REPORT]net: page_pool: kernel crash at iommu_get_dma_domain+0xc/0x20
  2024-08-05 12:19       ` Yunsheng Lin
@ 2024-08-05 12:53         ` Robin Murphy
  2024-08-06 11:54           ` Yunsheng Lin
  2024-08-06 13:35         ` Niklas Schnelle
  1 sibling, 1 reply; 21+ messages in thread
From: Robin Murphy @ 2024-08-05 12:53 UTC (permalink / raw)
  To: Yunsheng Lin, Somnath Kotur, Jesper Dangaard Brouer
  Cc: Yonglong Liu, David S. Miller, Jakub Kicinski, pabeni,
	ilias.apalodimas, netdev, linux-kernel, Alexander Duyck,
	Alexei Starovoitov, shenjian (K), Salil Mehta, joro, will, iommu

On 05/08/2024 1:19 pm, Yunsheng Lin wrote:
> On 2024/7/31 16:42, Somnath Kotur wrote:
>> On Tue, Jul 30, 2024 at 10:51 PM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
>>>
> 
> +cc iommu maintainers and list
> 
>>>
>>> On 30/07/2024 15.08, Yonglong Liu wrote:
>>>> I found a bug when running hns3 driver with page pool enabled, the log
>>>> as below:
>>>>
>>>> [ 4406.956606] Unable to handle kernel NULL pointer dereference at
>>>> virtual address 00000000000000a8
>>>
>>> struct iommu_domain *iommu_get_dma_domain(struct device *dev)
>>> {
>>>          return dev->iommu_group->default_domain;
>>> }
>>>
>>> $ pahole -C iommu_group --hex | grep default_domain
>>>          struct iommu_domain *      default_domain;   /*  0xa8   0x8 */
>>>
>>> Looks like iommu_group is a NULL pointer (that when deref member
>>> 'default_domain' cause this fault).
>>>
>>>
>>>> [ 4406.965379] Mem abort info:
>>>> [ 4406.968160]   ESR = 0x0000000096000004
>>>> [ 4406.971906]   EC = 0x25: DABT (current EL), IL = 32 bits
>>>> [ 4406.977218]   SET = 0, FnV = 0
>>>> [ 4406.980258]   EA = 0, S1PTW = 0
>>>> [ 4406.983404]   FSC = 0x04: level 0 translation fault
>>>> [ 4406.988273] Data abort info:
>>>> [ 4406.991154]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
>>>> [ 4406.996632]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
>>>> [ 4407.001681]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
>>>> [ 4407.006985] user pgtable: 4k pages, 48-bit VAs, pgdp=0000202828326000
>>>> [ 4407.013430] [00000000000000a8] pgd=0000000000000000,
>>>> p4d=0000000000000000
>>>> [ 4407.020212] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
>>>> [ 4407.026454] Modules linked in: hclgevf xt_CHECKSUM ipt_REJECT
>>>> nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle
>>>> ip6table_filter ip6_tables hns_roce_hw_v2 hns3 hclge hnae3 xt_addrtype
>>>> iptable_filter xt_conntrack overlay arm_spe_pmu arm_smmuv3_pmu
>>>> hisi_uncore_hha_pmu hisi_uncore_ddrc_pmu hisi_uncore_l3c_pmu
>>>> hisi_uncore_pmu fuse rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi
>>>> scsi_transport_iscsi crct10dif_ce hisi_sec2 hisi_hpre hisi_zip
>>>> hisi_sas_v3_hw xhci_pci sbsa_gwdt hisi_qm hisi_sas_main hisi_dma
>>>> xhci_pci_renesas uacce libsas [last unloaded: hnae3]
>>>> [ 4407.076027] CPU: 48 PID: 610 Comm: kworker/48:1
>>>> [ 4407.093343] Workqueue: events page_pool_release_retry
>>>> [ 4407.098384] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS
>>>> BTYPE=--)
>>>> [ 4407.105316] pc : iommu_get_dma_domain+0xc/0x20
>>>> [ 4407.109744] lr : iommu_dma_unmap_page+0x38/0xe8
>>>> [ 4407.114255] sp : ffff80008bacbc80
>>>> [ 4407.117554] x29: ffff80008bacbc80 x28: 0000000000000000 x27:
>>>> ffffc31806be7000
>>>> [ 4407.124659] x26: ffff2020002b6ac0 x25: 0000000000000000 x24:
>>>> 0000000000000002
>>>> [ 4407.131762] x23: 0000000000000022 x22: 0000000000001000 x21:
>>>> 00000000fcd7c000
>>>> [ 4407.138865] x20: ffff0020c9882800 x19: ffff0020856f60c8 x18:
>>>> ffff8000d3503c58
>>>> [ 4407.145968] x17: 0000000000000000 x16: 1fffe00419521061 x15:
>>>> 0000000000000001
>>>> [ 4407.153073] x14: 0000000000000003 x13: 00000401850ae012 x12:
>>>> 000006b10004e7fb
>>>> [ 4407.160177] x11: 0000000000000067 x10: 0000000000000c70 x9 :
>>>> ffffc3180405cd20
>>>> [ 4407.167280] x8 : fefefefefefefeff x7 : 0000000000000001 x6 :
>>>> 0000000000000010
>>>> [ 4407.174382] x5 : ffffc3180405cce8 x4 : 0000000000000022 x3 :
>>>> 0000000000000002
>>>> [ 4407.181485] x2 : 0000000000001000 x1 : 00000000fcd7c000 x0 :
>>>> 0000000000000000
>>>> [ 4407.188589] Call trace:
>>>> [ 4407.191027]  iommu_get_dma_domain+0xc/0x20
>>>> [ 4407.195105]  dma_unmap_page_attrs+0x38/0x1d0
>>>> [ 4407.199361]  page_pool_return_page+0x48/0x180
>>>> [ 4407.203699]  page_pool_release+0xd4/0x1f0
>>>> [ 4407.207692]  page_pool_release_retry+0x28/0xe8
>>>
>>> I suspect that the DMA IOMMU part was deallocated and freed by the
>>> driver even-though page_pool still have inflight packets.
>> When you say driver, which 'driver' do you mean?
>> I suspect this could be because of the VF instance going away with
>> this cmd - disable the vf: echo 0 >
>> /sys/class/net/eno1/device/sriov_numvfs, what do you think?
>>>
>>> The page_pool bumps refcnt via get_device() + put_device() on the DMA
>>> 'struct device', to avoid it going away, but I guess there is also some
>>> IOMMU code that we need to make sure doesn't go away (until all inflight
>>> pages are returned) ???
> 
> I guess the above is why thing went wrong here, the question is which
> IOMMU code need to be called here to stop them from going away.

This looks like the wrong device is being passed to dma_unmap_page() - 
if a device had an IOMMU DMA domain at the point when the DMA mapping 
was create, then neither that domain nor its group can legitimately have 
disappeared while that device still had a driver bound. Or if it *was* 
the right device, but it's already had device_del() called on it, then 
you have a fundamental lifecycle problem - a device with no driver bound 
should not be passed to the DMA API, much less a dead device that's 
already been removed from its parent bus.

Thanks,
Robin.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [BUG REPORT]net: page_pool: kernel crash at iommu_get_dma_domain+0xc/0x20
  2024-08-05 12:50     ` Yunsheng Lin
@ 2024-08-06  0:39       ` Alexander Duyck
  0 siblings, 0 replies; 21+ messages in thread
From: Alexander Duyck @ 2024-08-06  0:39 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: Yonglong Liu, David S. Miller, Jakub Kicinski, pabeni, hawk,
	ilias.apalodimas, netdev, linux-kernel, Alexei Starovoitov,
	shenjian (K), Salil Mehta, iommu

On Mon, Aug 5, 2024 at 6:20 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2024/8/3 0:38, Alexander Duyck wrote:
>
> ...
>
> >
> > The issue as I see it is that we aren't unmapping the pages when we
> > call page_pool_destroy. There need to be no pages remaining with a DMA
> > unmapping needed *after* that is called. Otherwise we will see this
> > issue regularly.
> >
> > What we probably need to look at doing is beefing up page_pool_release
> > to add a step that will take an additional reference on the inflight
> > pages, then call __page_pool_put_page to switch them to a reference
> > counted page.
>
> I am not sure if I understand what you meant about, did you mean making
> page_pool_destroy() synchronously wait for the all in-flight pages to
> come back before returning to driver?

Part of the issue is the device appears to be removed from the iommu
before all the pages have been unmap. To fix that we would either need
to unmap all the pages or force the kernel to wait until all of the
pages have been unmapped before the device can be removed from the
iommu group.

> >
> > Seems like the worst case scenario is that we are talking about having
> > to walk the page table to do the above for any inflight pages but it
>
> Which page table are we talking about here?

The internal memory being managed by the kernel in the form of struct
page. Basically we would need to walk through all the struct page
entries and if they are setup to use the page_pool we are freeing we
would have to force them out of the pool.

> > would certainly be a much more deterministic amount of time needed to
> > do that versus waiting on a page that may or may not return.
> >
> > Alternatively a quick hack that would probably also address this would
> > be to clear poll->dma_map in page_pool_destroy or maybe in
>
> It seems we may need to clear pool->dma_sync too, and there may be some
> time window between clearing and checking/dma_unmap?

That is a possibility. However for many platforms dma_sync is a no-op.

> > page_pool_unreg_netdev so that any of those residual mappings would
> > essentially get leaked, but we wouldn't have to worry about trying to
> > unmap while the device doesn't exist.
>
> But how does the page_pool know if it is just the normal unloading processing
> without VF disabling where the device still exists or it is the abnormal one
> caused by the VF disabling where the device will disappear? If it is the first
> one, does it cause resource leaking problem for iommu if some calling for iommu
> is skipped?

It wouldn't. Basically we would have to do this for any page pool that
is being destroyed with pages left in-flight. That is why the
preference would likely be to stall for some time and hope that the
pages get unmapped on their own, and then if they can't we would need
to force this process to kick in so we don't spend forever.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [BUG REPORT]net: page_pool: kernel crash at iommu_get_dma_domain+0xc/0x20
  2024-08-02  2:06         ` Yonglong Liu
@ 2024-08-06  9:51           ` Jesper Dangaard Brouer
  2024-08-06 14:23             ` Robin Murphy
  0 siblings, 1 reply; 21+ messages in thread
From: Jesper Dangaard Brouer @ 2024-08-06  9:51 UTC (permalink / raw)
  To: Yonglong Liu, Somnath Kotur
  Cc: David S. Miller, Jakub Kicinski, pabeni, ilias.apalodimas, netdev,
	linux-kernel, Alexander Duyck, Alexei Starovoitov, linyunsheng,
	shenjian (K), Salil Mehta, iommu, Jean-Philippe Brucker,
	linux-acpi


Cc. IOMMU list+maintainers, question below...


On 02/08/2024 04.06, Yonglong Liu wrote:
> 
> On 2024/7/31 19:32, Yonglong Liu wrote:
>>
>> On 2024/7/31 16:42, Somnath Kotur wrote:
>>> On Tue, Jul 30, 2024 at 10:51 PM Jesper Dangaard Brouer 
>>> <hawk@kernel.org> wrote:
>>>>
>>>>
>>>> On 30/07/2024 15.08, Yonglong Liu wrote:
>>>>> I found a bug when running hns3 driver with page pool enabled, the log
>>>>> as below:
>>>>>
>>>>> [ 4406.956606] Unable to handle kernel NULL pointer dereference at
>>>>> virtual address 00000000000000a8
>>>> struct iommu_domain *iommu_get_dma_domain(struct device *dev)
>>>> {
>>>>          return dev->iommu_group->default_domain;
>>>> }
>>>>
>>>> $ pahole -C iommu_group --hex | grep default_domain
>>>>          struct iommu_domain *      default_domain;   /* 0xa8   0x8 */
>>>>
>>>> Looks like iommu_group is a NULL pointer (that when deref member
>>>> 'default_domain' cause this fault).
>>>>
>>>>
>>>>> [ 4406.965379] Mem abort info:
>>>>> [ 4406.968160]   ESR = 0x0000000096000004
>>>>> [ 4406.971906]   EC = 0x25: DABT (current EL), IL = 32 bits
>>>>> [ 4406.977218]   SET = 0, FnV = 0
>>>>> [ 4406.980258]   EA = 0, S1PTW = 0
>>>>> [ 4406.983404]   FSC = 0x04: level 0 translation fault
>>>>> [ 4406.988273] Data abort info:
>>>>> [ 4406.991154]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
>>>>> [ 4406.996632]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
>>>>> [ 4407.001681]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
>>>>> [ 4407.006985] user pgtable: 4k pages, 48-bit VAs, 
>>>>> pgdp=0000202828326000
>>>>> [ 4407.013430] [00000000000000a8] pgd=0000000000000000,
>>>>> p4d=0000000000000000
>>>>> [ 4407.020212] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
>>>>> [ 4407.026454] Modules linked in: hclgevf xt_CHECKSUM ipt_REJECT
>>>>> nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle
>>>>> ip6table_filter ip6_tables hns_roce_hw_v2 hns3 hclge hnae3 xt_addrtype
>>>>> iptable_filter xt_conntrack overlay arm_spe_pmu arm_smmuv3_pmu
>>>>> hisi_uncore_hha_pmu hisi_uncore_ddrc_pmu hisi_uncore_l3c_pmu
>>>>> hisi_uncore_pmu fuse rpcrdma ib_isert iscsi_target_mod ib_iser 
>>>>> libiscsi
>>>>> scsi_transport_iscsi crct10dif_ce hisi_sec2 hisi_hpre hisi_zip
>>>>> hisi_sas_v3_hw xhci_pci sbsa_gwdt hisi_qm hisi_sas_main hisi_dma
>>>>> xhci_pci_renesas uacce libsas [last unloaded: hnae3]
>>>>> [ 4407.076027] CPU: 48 PID: 610 Comm: kworker/48:1
>>>>> [ 4407.093343] Workqueue: events page_pool_release_retry
>>>>> [ 4407.098384] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS
>>>>> BTYPE=--)
>>>>> [ 4407.105316] pc : iommu_get_dma_domain+0xc/0x20
>>>>> [ 4407.109744] lr : iommu_dma_unmap_page+0x38/0xe8
>>>>> [ 4407.114255] sp : ffff80008bacbc80
>>>>> [ 4407.117554] x29: ffff80008bacbc80 x28: 0000000000000000 x27:
>>>>> ffffc31806be7000
>>>>> [ 4407.124659] x26: ffff2020002b6ac0 x25: 0000000000000000 x24:
>>>>> 0000000000000002
>>>>> [ 4407.131762] x23: 0000000000000022 x22: 0000000000001000 x21:
>>>>> 00000000fcd7c000
>>>>> [ 4407.138865] x20: ffff0020c9882800 x19: ffff0020856f60c8 x18:
>>>>> ffff8000d3503c58
>>>>> [ 4407.145968] x17: 0000000000000000 x16: 1fffe00419521061 x15:
>>>>> 0000000000000001
>>>>> [ 4407.153073] x14: 0000000000000003 x13: 00000401850ae012 x12:
>>>>> 000006b10004e7fb
>>>>> [ 4407.160177] x11: 0000000000000067 x10: 0000000000000c70 x9 :
>>>>> ffffc3180405cd20
>>>>> [ 4407.167280] x8 : fefefefefefefeff x7 : 0000000000000001 x6 :
>>>>> 0000000000000010
>>>>> [ 4407.174382] x5 : ffffc3180405cce8 x4 : 0000000000000022 x3 :
>>>>> 0000000000000002
>>>>> [ 4407.181485] x2 : 0000000000001000 x1 : 00000000fcd7c000 x0 :
>>>>> 0000000000000000
>>>>> [ 4407.188589] Call trace:
>>>>> [ 4407.191027]  iommu_get_dma_domain+0xc/0x20
>>>>> [ 4407.195105]  dma_unmap_page_attrs+0x38/0x1d0
>>>>> [ 4407.199361]  page_pool_return_page+0x48/0x180
>>>>> [ 4407.203699]  page_pool_release+0xd4/0x1f0
>>>>> [ 4407.207692]  page_pool_release_retry+0x28/0xe8
>>>> I suspect that the DMA IOMMU part was deallocated and freed by the
>>>> driver even-though page_pool still have inflight packets.
>>> When you say driver, which 'driver' do you mean?
>>> I suspect this could be because of the VF instance going away with
>>> this cmd - disable the vf: echo 0 >
>>> /sys/class/net/eno1/device/sriov_numvfs, what do you think?
>>
>> I found that this happen when the vf enabled and running some packets, 
>> below is more infomation:
>>
>>
>> [ 4391.596558] pci 0000:7d:01.0: page_pool_release_retry() stalled 
>> pool shutdown: id 1145, 33 inflight 906 sec
>> [ 4397.111484] hns3 0000:bd:00.0: SRIOV setting: 0
>> [ 4397.118155] hns3 0000:bd:01.0 enp189s0f0v0: link down
>> [ 4397.416623] hns3 0000:bd:01.0: finished uninitializing hclgevf driver
>> [ 4397.480572] pci 0000:7d:01.0: page_pool_release_retry() stalled 
>> pool shutdown: id 1279, 98 inflight 362 sec
>> [ 4400.948362] hns3 0000:7d:00.0: SRIOV setting: 1
>> [ 4401.060569] pci 0000:7d:01.0: [19e5:a22f] type 00 class 0x020000 
>> PCIe Endpoint
>> [ 4401.067795] pci 0000:7d:01.0: enabling Extended Tags
>> [ 4401.073090] hns3 0000:7d:01.0: Adding to iommu group 48
>> [ 4401.078494] hns3 0000:7d:01.0: enabling device (0000 -> 0002)
>> [ 4401.084348] hns3 0000:7d:01.0: The firmware version is 1.20.0.17
>> [ 4401.102911] hns3 0000:7d:01.0: finished initializing hclgevf driver
>> [ 4401.111212] hns3 0000:7d:01.0: using random MAC address 
>> da:**:**:**:a3:47
>> [ 4401.138375] hns3 0000:7d:01.0 eno1v0: renamed from eth0
>> [ 4403.939449] hns3 0000:7d:01.0 eno1v0: link up
>> [ 4403.940237] 8021q: adding VLAN 0 to HW filter on device eno1v0
>> [ 4406.956606] Unable to handle kernel NULL pointer dereference at 
>> virtual address 00000000000000a8
>>
>>
>> another log:
>>
>> [11550.197002] hns3 0000:bd:01.0 enp189s0f0v0: link up
>> [11550.197034] hns3 0000:bd:01.0 enp189s0f0v0: net open
>> [11550.206910] 8021q: adding VLAN 0 to HW filter on device enp189s0f0v0
>> [11564.872929] page_pool_release_retry() stalled pool shutdown: id 
>> 2330, 99 inflight 60 sec
>> [11568.353723] hns3 0000:bd:01.0 enp189s0f0v0: net stop
>> [11568.360723] hns3 0000:bd:01.0 enp189s0f0v0: link down
>> [11568.519899] hns3 0000:bd:01.0 enp189s0f0v0: link up
>> [11568.519935] hns3 0000:bd:01.0 enp189s0f0v0: net open
>> [11568.529840] 8021q: adding VLAN 0 to HW filter on device enp189s0f0v0
>> [11589.640930] page_pool_release_retry() stalled pool shutdown: id 
>> 1996, 50 inflight 2054 sec
>> [11592.554875] hns3 0000:bd:01.0 enp189s0f0v0: net stop
>> [11592.560930] hns3 0000:bd:01.0 enp189s0f0v0: link down
>> [11596.684935] pci 0000:7d:01.0: [19e5:a22f] type 00 class 0x020000 
>> PCIe Endpoint
>> [11596.692140] pci 0000:7d:01.0: enabling Extended Tags
>> [11596.697324] hns3 0000:7d:01.0: Adding to iommu group 48
>> [11596.702988] hns3 0000:7d:01.0: enabling device (0000 -> 0002)
>> [11596.708808] hns3 0000:7d:01.0: The firmware version is 1.20.0.17
>> [11596.727263] hns3 0000:7d:01.0: finished initializing hclgevf driver
>> [11596.735561] hns3 0000:7d:01.0: using random MAC address 
>> 72:**:**:**:55:d7
>> [11596.760621] hns3 0000:7d:01.0 eno1v0: renamed from eth0
>> [11599.545341] hns3 0000:7d:01.0 eno1v0: link up
>> [11599.545409] hns3 0000:7d:01.0 eno1v0: net open
>> [11599.554858] 8021q: adding VLAN 0 to HW filter on device eno1v0
>> [11608.908922] Unable to handle kernel NULL pointer dereference at 
>> virtual address 00000000000000a8
>>
> I adds more debug info, and found that:
> 
> [ 4573.356891] pci 0000:7d:01.0: page_pool_release_retry() stalled pool 
> shutdown: id 1046, 82 inflight 60 sec, iommu_group=0x0
> 
> The iommu_group will release whether the page_pool is using it or not, 
> so if once page_pool_return_page() was called(why does this occur when 
> the device is reloaded and packets are transmitted?) , this crash will 
> happen.
> 
> I try the follow patch, but doesn't work :(
> 

The idea of taking a refcnt on IOMMU to avoid dev->iommu_group getting
freed, make sense to me.

The question is if API iommu_group_get() and iommu_group_put() is the
correct API to use in this case?

> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index f4444b4e39e6..d03a87407ca8 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -21,6 +21,7 @@
>   #include <linux/poison.h>
>   #include <linux/ethtool.h>
>   #include <linux/netdevice.h>
> +#include <linux/iommu.h>
> 

The page_pool already have a system/workqueue that waits for inflight
"packet" pages, and calls struct device API get_device() and put_device().

Why didn't the patch add code together with struct device API?
Like this:

$ git diff
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 2abe6e919224..686ff1d31aff 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -265,8 +265,10 @@ static int page_pool_init(struct page_pool *pool,
         /* Driver calling page_pool_create() also call 
page_pool_destroy() */
         refcount_set(&pool->user_cnt, 1);

-       if (pool->dma_map)
+       if (pool->dma_map) {
+               iommu_group_get(pool->p.dev);
                 get_device(pool->p.dev);
+       }

         return 0;
  }
@@ -275,8 +277,10 @@ static void page_pool_uninit(struct page_pool *pool)
  {
         ptr_ring_cleanup(&pool->ring, NULL);

-       if (pool->dma_map)
+       if (pool->dma_map) {
+               iommu_group_put(pool->p.dev->iommu_group);
                 put_device(pool->p.dev);
+       }


--Jesper

>   #include <trace/events/page_pool.h>
>  > @@ -306,6 +307,9 @@ page_pool_create_percpu(const struct
> page_pool_params *params, int cpuid)
>          if (err)
>                  goto err_uninit;
> 
> +       if (pool->dma_map)
> +               iommu_group_get(pool->p.dev);
> +
>          return pool;
> 
>   err_uninit:
> @@ -974,8 +978,11 @@ static int page_pool_release(struct page_pool *pool)
> 
>          page_pool_scrub(pool);
>          inflight = page_pool_inflight(pool, true);
> -       if (!inflight)
> +       if (!inflight) {
>                  __page_pool_destroy(pool);
> +               if (pool->dma_map)
> + iommu_group_put(pool->p.dev->iommu_group);
> +       }
> 
>          return inflight;
>   }
> 
> 
>>>> The page_pool bumps refcnt via get_device() + put_device() on the DMA
>>>> 'struct device', to avoid it going away, but I guess there is also some
>>>> IOMMU code that we need to make sure doesn't go away (until all 
>>>> inflight
>>>> pages are returned) ???
>>>>
>>>>
>>>>> [ 4407.212119] process_one_work+0x164/0x3e0
>>>>> [ 4407.216116]  worker_thread+0x310/0x420
>>>>> [ 4407.219851]  kthread+0x120/0x130
>>>>> [ 4407.223066]  ret_from_fork+0x10/0x20
>>>>> [ 4407.226630] Code: ffffc318 aa1e03e9 d503201f f9416c00 (f9405400)
>>>>> [ 4407.232697] ---[ end trace 0000000000000000 ]---
>>>>>
>>>>>
>>>>> The hns3 driver use page pool like this, just call once when the 
>>>>> driver
>>>>> initialize:
>>>>>
>>>>> static void hns3_alloc_page_pool(struct hns3_enet_ring *ring)
>>>>> {
>>>>>       struct page_pool_params pp_params = {
>>>>>           .flags = PP_FLAG_DMA_MAP | PP_FLAG_PAGE_FRAG |
>>>>>                   PP_FLAG_DMA_SYNC_DEV,
>>>>>           .order = hns3_page_order(ring),
>>>>>           .pool_size = ring->desc_num * hns3_buf_size(ring) /
>>>>>                   (PAGE_SIZE << hns3_page_order(ring)),
>>>>>           .nid = dev_to_node(ring_to_dev(ring)),
>>>>>           .dev = ring_to_dev(ring),
>>>>>           .dma_dir = DMA_FROM_DEVICE,
>>>>>           .offset = 0,
>>>>>           .max_len = PAGE_SIZE << hns3_page_order(ring),
>>>>>       };
>>>>>
>>>>>       ring->page_pool = page_pool_create(&pp_params);
>>>>>       if (IS_ERR(ring->page_pool)) {
>>>>>           dev_warn(ring_to_dev(ring), "page pool creation failed: 
>>>>> %ld\n",
>>>>>                PTR_ERR(ring->page_pool));
>>>>>           ring->page_pool = NULL;
>>>>>       }
>>>>> }
>>>>>
>>>>> And call page_pool_destroy(ring->page_pool)  when the driver 
>>>>> uninitialized.
>>>>>
>>>>>
>>>>> We use two devices, the net port connect directory, and the step of 
>>>>> the
>>>>> test case like below:
>>>>>
>>>>> 1. enable a vf of '7d:00.0':  echo 1 >
>>>>> /sys/class/net/eno1/device/sriov_numvfs
>>>>>
>>>>> 2. use iperf to produce some flows(the problem happens to the side 
>>>>> which
>>>>> runs 'iperf -s')
>>>>>
>>>>> 3. use ifconfig down/up to the vf
>>>>>
>>>>> 4. kill iperf
>>>>>
>>>>> 5. disable the vf: echo 0 > /sys/class/net/eno1/device/sriov_numvfs
>>>>>
>>>>> 6. run 1~5 with another port bd:00.0
>>>>>
>>>>> 7. repeat 1~6
>>>>>
>>>>>
>>>>> And when running this test case, we can found another related 
>>>>> message (I
>>>>> replaced pr_warn() to dev_warn()):
>>>>>
>>>>> pci 0000:7d:01.0: page_pool_release_retry() stalled pool shutdown: id
>>>>> 949, 98 inflight 1449 sec
>>>>>
>>>>>
>>>>> Even when stop the traffic, stop the test case, disable the vf, this
>>>>> message is still being printed.
>>>>>
>>>>> We must run the test case for about two hours to reproduce the 
>>>>> problem.
>>>>> Is there some advise to solve or debug the problem?
>>>>>
>>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [BUG REPORT]net: page_pool: kernel crash at iommu_get_dma_domain+0xc/0x20
  2024-08-05 12:53         ` Robin Murphy
@ 2024-08-06 11:54           ` Yunsheng Lin
  2024-08-06 12:50             ` Robin Murphy
  0 siblings, 1 reply; 21+ messages in thread
From: Yunsheng Lin @ 2024-08-06 11:54 UTC (permalink / raw)
  To: Robin Murphy, Somnath Kotur, Jesper Dangaard Brouer
  Cc: Yonglong Liu, David S. Miller, Jakub Kicinski, pabeni,
	ilias.apalodimas, netdev, linux-kernel, Alexander Duyck,
	Alexei Starovoitov, shenjian (K), Salil Mehta, joro, will, iommu

On 2024/8/5 20:53, Robin Murphy wrote:
>>>>
>>>> The page_pool bumps refcnt via get_device() + put_device() on the DMA
>>>> 'struct device', to avoid it going away, but I guess there is also some
>>>> IOMMU code that we need to make sure doesn't go away (until all inflight
>>>> pages are returned) ???
>>
>> I guess the above is why thing went wrong here, the question is which
>> IOMMU code need to be called here to stop them from going away.
> 
> This looks like the wrong device is being passed to dma_unmap_page() - if a device had an IOMMU DMA domain at the point when the DMA mapping was create, then neither that domain nor its group can legitimately have disappeared while that device still had a driver bound. Or if it *was* the right device, but it's already had device_del() called on it, then you have a fundamental lifecycle problem - a device with no driver bound should not be passed to the DMA API, much less a dead device that's already been removed from its parent bus.

Yes, the device *was* the right device, And it's already had device_del()
called on it.
page_pool tries to call get_device() on the DMA 'struct device' to avoid the
above lifecycle problem, it seems get_device() does not stop device_del()
from being called, and that is where we have the problem here:
https://elixir.bootlin.com/linux/v6.11-rc2/source/net/core/page_pool.c#L269

The above happens because driver with page_pool support may hand over
page still with dma mapping to network stack and try to reuse that page
after network stack is done with it and passes it back to page_pool to avoid
the penalty of dma mapping/unmapping. With all the caching in the network
stack, some pages may be held in the network stack without returning to the
page_pool soon enough, and with VF disable causing the driver unbound, the
page_pool does not stop the driver from doing it's unbounding work, instead
page_pool uses workqueue to check if there is some pages coming back from the
network stack periodically, if there is any, it will do the dma unmmapping
related cleanup work.

> 
> Thanks,
> Robin.
> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [BUG REPORT]net: page_pool: kernel crash at iommu_get_dma_domain+0xc/0x20
  2024-08-06 11:54           ` Yunsheng Lin
@ 2024-08-06 12:50             ` Robin Murphy
  2024-08-06 13:50               ` Jason Gunthorpe
  2024-08-20  7:18               ` Yonglong Liu
  0 siblings, 2 replies; 21+ messages in thread
From: Robin Murphy @ 2024-08-06 12:50 UTC (permalink / raw)
  To: Yunsheng Lin, Somnath Kotur, Jesper Dangaard Brouer
  Cc: Yonglong Liu, David S. Miller, Jakub Kicinski, pabeni,
	ilias.apalodimas, netdev, linux-kernel, Alexander Duyck,
	Alexei Starovoitov, shenjian (K), Salil Mehta, joro, will, iommu

On 06/08/2024 12:54 pm, Yunsheng Lin wrote:
> On 2024/8/5 20:53, Robin Murphy wrote:
>>>>>
>>>>> The page_pool bumps refcnt via get_device() + put_device() on the DMA
>>>>> 'struct device', to avoid it going away, but I guess there is also some
>>>>> IOMMU code that we need to make sure doesn't go away (until all inflight
>>>>> pages are returned) ???
>>>
>>> I guess the above is why thing went wrong here, the question is which
>>> IOMMU code need to be called here to stop them from going away.
>>
>> This looks like the wrong device is being passed to dma_unmap_page() - if a device had an IOMMU DMA domain at the point when the DMA mapping was create, then neither that domain nor its group can legitimately have disappeared while that device still had a driver bound. Or if it *was* the right device, but it's already had device_del() called on it, then you have a fundamental lifecycle problem - a device with no driver bound should not be passed to the DMA API, much less a dead device that's already been removed from its parent bus.
> 
> Yes, the device *was* the right device, And it's already had device_del()
> called on it.
> page_pool tries to call get_device() on the DMA 'struct device' to avoid the
> above lifecycle problem, it seems get_device() does not stop device_del()
> from being called, and that is where we have the problem here:
> https://elixir.bootlin.com/linux/v6.11-rc2/source/net/core/page_pool.c#L269
> 
> The above happens because driver with page_pool support may hand over
> page still with dma mapping to network stack and try to reuse that page
> after network stack is done with it and passes it back to page_pool to avoid
> the penalty of dma mapping/unmapping. With all the caching in the network
> stack, some pages may be held in the network stack without returning to the
> page_pool soon enough, and with VF disable causing the driver unbound, the
> page_pool does not stop the driver from doing it's unbounding work, instead
> page_pool uses workqueue to check if there is some pages coming back from the
> network stack periodically, if there is any, it will do the dma unmmapping
> related cleanup work.

OK, that sounds like a more insidious problem - it's not just IOMMU 
stuff, in general the page pool should not be holding and using the 
device pointer after the device has already been destroyed. Even without 
an IOMMU, attempting DMA unmaps after the driver has already unbound may 
leak resources or at worst corrupt memory. Fundamentally, the page pool 
code cannot allow DMA mappings to outlive the driver they belong to.

Thanks,
Robin.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [BUG REPORT]net: page_pool: kernel crash at iommu_get_dma_domain+0xc/0x20
  2024-08-05 12:19       ` Yunsheng Lin
  2024-08-05 12:53         ` Robin Murphy
@ 2024-08-06 13:35         ` Niklas Schnelle
  2024-08-13 18:49           ` Matthew Rosato
  1 sibling, 1 reply; 21+ messages in thread
From: Niklas Schnelle @ 2024-08-06 13:35 UTC (permalink / raw)
  To: Yunsheng Lin, Somnath Kotur, Jesper Dangaard Brouer
  Cc: Yonglong Liu, David S. Miller, Jakub Kicinski, pabeni,
	ilias.apalodimas, netdev, linux-kernel, Alexander Duyck,
	Alexei Starovoitov, shenjian (K), Salil Mehta, joro, will,
	robin.murphy, iommu

On Mon, 2024-08-05 at 20:19 +0800, Yunsheng Lin wrote:
> On 2024/7/31 16:42, Somnath Kotur wrote:
> > On Tue, Jul 30, 2024 at 10:51 PM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
> > > 
> 
> +cc iommu maintainers and list
> 
> > > 
> > > On 30/07/2024 15.08, Yonglong Liu wrote:
> > > > I found a bug when running hns3 driver with page pool enabled, the log
> > > > as below:
> > > > 
> > > > [ 4406.956606] Unable to handle kernel NULL pointer dereference at
> > > > virtual address 00000000000000a8
> > > 
> > > struct iommu_domain *iommu_get_dma_domain(struct device *dev)
> > > {
> > >         return dev->iommu_group->default_domain;
> > > }
> > > 
> > > $ pahole -C iommu_group --hex | grep default_domain
> > >         struct iommu_domain *      default_domain;   /*  0xa8   0x8 */
> > > 
> > > Looks like iommu_group is a NULL pointer (that when deref member
> > > 'default_domain' cause this fault).
> > > 
> > > 
> > > > [ 4406.965379] Mem abort info:
> > > > [ 4406.968160]   ESR = 0x0000000096000004
> > > > [ 4406.971906]   EC = 0x25: DABT (current EL), IL = 32 bits
> > > > [ 4406.977218]   SET = 0, FnV = 0
> > > > [ 4406.980258]   EA = 0, S1PTW = 0
> > > > [ 4406.983404]   FSC = 0x04: level 0 translation fault
> > > > [ 4406.988273] Data abort info:
> > > > [ 4406.991154]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
> > > > [ 4406.996632]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
> > > > [ 4407.001681]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> > > > [ 4407.006985] user pgtable: 4k pages, 48-bit VAs, pgdp=0000202828326000
> > > > [ 4407.013430] [00000000000000a8] pgd=0000000000000000,
> > > > p4d=0000000000000000
> > > > [ 4407.020212] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
> > > > [ 4407.026454] Modules linked in: hclgevf xt_CHECKSUM ipt_REJECT
> > > > nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle
> > > > ip6table_filter ip6_tables hns_roce_hw_v2 hns3 hclge hnae3 xt_addrtype
> > > > iptable_filter xt_conntrack overlay arm_spe_pmu arm_smmuv3_pmu
> > > > hisi_uncore_hha_pmu hisi_uncore_ddrc_pmu hisi_uncore_l3c_pmu
> > > > hisi_uncore_pmu fuse rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi
> > > > scsi_transport_iscsi crct10dif_ce hisi_sec2 hisi_hpre hisi_zip
> > > > hisi_sas_v3_hw xhci_pci sbsa_gwdt hisi_qm hisi_sas_main hisi_dma
> > > > xhci_pci_renesas uacce libsas [last unloaded: hnae3]
> > > > [ 4407.076027] CPU: 48 PID: 610 Comm: kworker/48:1
> > > > [ 4407.093343] Workqueue: events page_pool_release_retry
> > > > [ 4407.098384] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS
> > > > BTYPE=--)
> > > > [ 4407.105316] pc : iommu_get_dma_domain+0xc/0x20
> > > > [ 4407.109744] lr : iommu_dma_unmap_page+0x38/0xe8
> > > > [ 4407.114255] sp : ffff80008bacbc80
> > > > [ 4407.117554] x29: ffff80008bacbc80 x28: 0000000000000000 x27:
> > > > ffffc31806be7000
> > > > [ 4407.124659] x26: ffff2020002b6ac0 x25: 0000000000000000 x24:
> > > > 0000000000000002
> > > > [ 4407.131762] x23: 0000000000000022 x22: 0000000000001000 x21:
> > > > 00000000fcd7c000
> > > > [ 4407.138865] x20: ffff0020c9882800 x19: ffff0020856f60c8 x18:
> > > > ffff8000d3503c58
> > > > [ 4407.145968] x17: 0000000000000000 x16: 1fffe00419521061 x15:
> > > > 0000000000000001
> > > > [ 4407.153073] x14: 0000000000000003 x13: 00000401850ae012 x12:
> > > > 000006b10004e7fb
> > > > [ 4407.160177] x11: 0000000000000067 x10: 0000000000000c70 x9 :
> > > > ffffc3180405cd20
> > > > [ 4407.167280] x8 : fefefefefefefeff x7 : 0000000000000001 x6 :
> > > > 0000000000000010
> > > > [ 4407.174382] x5 : ffffc3180405cce8 x4 : 0000000000000022 x3 :
> > > > 0000000000000002
> > > > [ 4407.181485] x2 : 0000000000001000 x1 : 00000000fcd7c000 x0 :
> > > > 0000000000000000
> > > > [ 4407.188589] Call trace:
> > > > [ 4407.191027]  iommu_get_dma_domain+0xc/0x20
> > > > [ 4407.195105]  dma_unmap_page_attrs+0x38/0x1d0
> > > > [ 4407.199361]  page_pool_return_page+0x48/0x180
> > > > [ 4407.203699]  page_pool_release+0xd4/0x1f0
> > > > [ 4407.207692]  page_pool_release_retry+0x28/0xe8
> > > 
> > > I suspect that the DMA IOMMU part was deallocated and freed by the
> > > driver even-though page_pool still have inflight packets.
> > When you say driver, which 'driver' do you mean?
> > I suspect this could be because of the VF instance going away with
> > this cmd - disable the vf: echo 0 >
> > /sys/class/net/eno1/device/sriov_numvfs, what do you think?
> > > 
> > > The page_pool bumps refcnt via get_device() + put_device() on the DMA
> > > 'struct device', to avoid it going away, but I guess there is also some
> > > IOMMU code that we need to make sure doesn't go away (until all inflight
> > > pages are returned) ???
> 
> I guess the above is why thing went wrong here, the question is which
> IOMMU code need to be called here to stop them from going away.
> 
> What I am also curious is that there should be a pool of allocated iova in
> iommu that is corresponding to the in-flight page for page_pool, shouldn't
> iommu wait for the corresponding allocated iova to be freed similarly as
> page_pool does for it's in-flight pages?
> 


Is it possible you're using an IOMMU whose driver doesn't yet support
blocking_domain? I'm currently working an issue on s390 that also
occurs during device removal and is fixed by implementing blocking
domain in the s390 IOMMU driver (patch forthcoming). The root cause for
that is that our domain->ops->attach_dev() fails when during hot-unplug
the device is already gone from the platform's point of view and then
we ended up with a NULL domain unless we have a blocking domain which
can handle non existant devices and gets set as fallback in
__iommu_device_set_domain(). In the case I can reproduce the backtrace
is different[0] but we also saw at least two cases where we see the
exact same call trace as in the first mail of this thread. So far I
suspected them to be due to the blocking domain issue but it could be a
separate issue too.

Thanks,
Niklas

[0]
Unable to handle kernel pointer dereference in virtual kernel address space
Failing address: 0000000000000000 TEID: 0000000000000483
Fault in home space mode while using kernel ASCE.
AS:0000000159f00007 R3:00000003fe900007 S:00000003fe8ff800 P:000000000000013d
Oops: 0004 ilc:2 [#1] SMP
Modules linked in: ...
CPU: 15 UID: 0 PID: 139 Comm: kmcheck Kdump: loaded ...
Tainted: [W]=WARN
Hardware name: IBM 3931 A01 701 (LPAR)
Krnl PSW : 0404e00180000000 00000109fc6c2d98 (s390_iommu_release_device+0x58/0xf0)
           R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
Krnl GPRS: 0000000000000010 0000000000000000 0000000000000000 00000109fd18eb40
           00000109fcf78c34 0000000000000037 0000000000000000 00000109fd33ae60
           0000000804768748 0000000804768700 00000109fd133f10 0700000000000000
           0000000000000000 000000080474a400 00000109fc6b9d9a 00000089fef6b968
Krnl Code: 00000109fc6c2d8c: a51e0000           llilh   %r1,0
0109fc6c2d90: 580013ac          l       %r0,940(%r1)
0109fc6c2d94: a7180000          lhi     %r1,0
0109fc6c2d98: ba10c0b8          cs      %r1,%r0,184(%r12)
0109fc6c2d9c: ec180008007e      cij     %r1,0,8,00000109fc6c2dac
0109fc6c2da2: 4120c0b8          la      %r2,184(%r12)
0109fc6c2da6: c0e50023b3f5      brasl   %r14,00000109fcb39590
0109fc6c2dac: e310d0200004      lg      %r1,32(%r13)
Call Trace:
 [<00000109fc6c2d98>] s390_iommu_release_device+0x58/0xf0
 [<00000109fc6b9d9a>] iommu_deinit_device+0x7a/0x1c0
 [<00000109fc6b9920>] iommu_release_device+0x160/0x240
 [<00000109fc6b97ae>] iommu_bus_notifier+0x9e/0xb0
 [<00000109fbbb1be2>] blocking_notifier_call_chain+0x72/0x130
 [<00000109fc6d2c0c>] bus_notify+0x10c/0x130
 [<00000109fc6ccc16>] device_del+0x4c6/0x700
 [<00000109fc6349fa>] pci_remove_bus_device+0xfa/0x1c0
 [<00000109fc634af8>] pci_stop_and_remove_bus_device_locked+0x38/0x50
 [<00000109fbb6cb86>] zpci_bus_remove_device+0x66/0xa0
 [<00000109fbb6a9ac>] zpci_event_availability+0x15c/0x270
 [<00000109fc77b16a>] chsc_process_crw+0x48a/0xca0
 [<00000109fc7842c2>] crw_collect_info+0x1d2/0x310
 [<00000109fbbaf85c>] kthread+0x1bc/0x1e0
 [<00000109fbb0f5fa>] __ret_from_fork+0x3a/0x60
 [<00000109fcb5807a>] ret_from_fork+0xa/0x40
Last Breaking-Event-Address:
 [<00000109fc6b9d98>] iommu_deinit_device+0x78/0x1c0
Kernel panic - not syncing: Fatal exception: panic_on_oops



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [BUG REPORT]net: page_pool: kernel crash at iommu_get_dma_domain+0xc/0x20
  2024-08-06 12:50             ` Robin Murphy
@ 2024-08-06 13:50               ` Jason Gunthorpe
  2024-08-20 13:22                 ` Mina Almasry
  2024-08-20  7:18               ` Yonglong Liu
  1 sibling, 1 reply; 21+ messages in thread
From: Jason Gunthorpe @ 2024-08-06 13:50 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Yunsheng Lin, Somnath Kotur, Jesper Dangaard Brouer, Yonglong Liu,
	David S. Miller, Jakub Kicinski, pabeni, ilias.apalodimas, netdev,
	linux-kernel, Alexander Duyck, Alexei Starovoitov, shenjian (K),
	Salil Mehta, joro, will, iommu

On Tue, Aug 06, 2024 at 01:50:08PM +0100, Robin Murphy wrote:
> On 06/08/2024 12:54 pm, Yunsheng Lin wrote:
> > On 2024/8/5 20:53, Robin Murphy wrote:
> > > > > > 
> > > > > > The page_pool bumps refcnt via get_device() + put_device() on the DMA
> > > > > > 'struct device', to avoid it going away, but I guess there is also some
> > > > > > IOMMU code that we need to make sure doesn't go away (until all inflight
> > > > > > pages are returned) ???
> > > > 
> > > > I guess the above is why thing went wrong here, the question is which
> > > > IOMMU code need to be called here to stop them from going away.
> > > 
> > > This looks like the wrong device is being passed to dma_unmap_page() - if a device had an IOMMU DMA domain at the point when the DMA mapping was create, then neither that domain nor its group can legitimately have disappeared while that device still had a driver bound. Or if it *was* the right device, but it's already had device_del() called on it, then you have a fundamental lifecycle problem - a device with no driver bound should not be passed to the DMA API, much less a dead device that's already been removed from its parent bus.
> > 
> > Yes, the device *was* the right device, And it's already had device_del()
> > called on it.
> > page_pool tries to call get_device() on the DMA 'struct device' to avoid the
> > above lifecycle problem, it seems get_device() does not stop device_del()
> > from being called, and that is where we have the problem here:
> > https://elixir.bootlin.com/linux/v6.11-rc2/source/net/core/page_pool.c#L269
> > 
> > The above happens because driver with page_pool support may hand over
> > page still with dma mapping to network stack and try to reuse that page
> > after network stack is done with it and passes it back to page_pool to avoid
> > the penalty of dma mapping/unmapping. With all the caching in the network
> > stack, some pages may be held in the network stack without returning to the
> > page_pool soon enough, and with VF disable causing the driver unbound, the
> > page_pool does not stop the driver from doing it's unbounding work, instead
> > page_pool uses workqueue to check if there is some pages coming back from the
> > network stack periodically, if there is any, it will do the dma unmmapping
> > related cleanup work.
> 
> OK, that sounds like a more insidious problem - it's not just IOMMU stuff,
> in general the page pool should not be holding and using the device pointer
> after the device has already been destroyed. Even without an IOMMU,
> attempting DMA unmaps after the driver has already unbound may leak
> resources or at worst corrupt memory. Fundamentally, the page pool code
> cannot allow DMA mappings to outlive the driver they belong to.

+1

There is more that gets broken here if these basic driver model rules
are not followed!

netdev must fix this by waiting during driver remove until all this DMA
activity is finished somehow.

Jason

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [BUG REPORT]net: page_pool: kernel crash at iommu_get_dma_domain+0xc/0x20
  2024-08-06  9:51           ` Jesper Dangaard Brouer
@ 2024-08-06 14:23             ` Robin Murphy
  0 siblings, 0 replies; 21+ messages in thread
From: Robin Murphy @ 2024-08-06 14:23 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Yonglong Liu, Somnath Kotur
  Cc: David S. Miller, Jakub Kicinski, pabeni, ilias.apalodimas, netdev,
	linux-kernel, Alexander Duyck, Alexei Starovoitov, linyunsheng,
	shenjian (K), Salil Mehta, iommu, Jean-Philippe Brucker,
	linux-acpi

On 06/08/2024 10:51 am, Jesper Dangaard Brouer wrote:
[...]
>> The iommu_group will release whether the page_pool is using it or not, 
>> so if once page_pool_return_page() was called(why does this occur when 
>> the device is reloaded and packets are transmitted?) , this crash will 
>> happen.
>>
>> I try the follow patch, but doesn't work :(
>>
> 
> The idea of taking a refcnt on IOMMU to avoid dev->iommu_group getting
> freed, make sense to me.
> 
> The question is if API iommu_group_get() and iommu_group_put() is the
> correct API to use in this case?
> 
>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
>> index f4444b4e39e6..d03a87407ca8 100644
>> --- a/net/core/page_pool.c
>> +++ b/net/core/page_pool.c
>> @@ -21,6 +21,7 @@
>>   #include <linux/poison.h>
>>   #include <linux/ethtool.h>
>>   #include <linux/netdevice.h>
>> +#include <linux/iommu.h>
>>
> 
> The page_pool already have a system/workqueue that waits for inflight
> "packet" pages, and calls struct device API get_device() and put_device().
> 
> Why didn't the patch add code together with struct device API?
> Like this:

Now do the one where there is no IOMMU, and dma_unmap_page() corrupts 
random unrelated memory because the mapped DMA address was relative to 
dev->dma_range_map which has since become NULL.

In other words, no, hacking one particular IOMMU API symptom does not 
solve the fundamental lifecycle problem that you have here.

Thanks,
Robin.

> 
> $ git diff
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 2abe6e919224..686ff1d31aff 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -265,8 +265,10 @@ static int page_pool_init(struct page_pool *pool,
>          /* Driver calling page_pool_create() also call 
> page_pool_destroy() */
>          refcount_set(&pool->user_cnt, 1);
> 
> -       if (pool->dma_map)
> +       if (pool->dma_map) {
> +               iommu_group_get(pool->p.dev);
>                  get_device(pool->p.dev);
> +       }
> 
>          return 0;
>   }
> @@ -275,8 +277,10 @@ static void page_pool_uninit(struct page_pool *pool)
>   {
>          ptr_ring_cleanup(&pool->ring, NULL);
> 
> -       if (pool->dma_map)
> +       if (pool->dma_map) {
> +               iommu_group_put(pool->p.dev->iommu_group);
>                  put_device(pool->p.dev);
> +       }
> 
> 
> --Jesper
> 
>>   #include <trace/events/page_pool.h>
>>  > @@ -306,6 +307,9 @@ page_pool_create_percpu(const struct
>> page_pool_params *params, int cpuid)
>>          if (err)
>>                  goto err_uninit;
>>
>> +       if (pool->dma_map)
>> +               iommu_group_get(pool->p.dev);
>> +
>>          return pool;
>>
>>   err_uninit:
>> @@ -974,8 +978,11 @@ static int page_pool_release(struct page_pool *pool)
>>
>>          page_pool_scrub(pool);
>>          inflight = page_pool_inflight(pool, true);
>> -       if (!inflight)
>> +       if (!inflight) {
>>                  __page_pool_destroy(pool);
>> +               if (pool->dma_map)
>> + iommu_group_put(pool->p.dev->iommu_group);
>> +       }
>>
>>          return inflight;
>>   }
>>
>>
>>>>> The page_pool bumps refcnt via get_device() + put_device() on the DMA
>>>>> 'struct device', to avoid it going away, but I guess there is also 
>>>>> some
>>>>> IOMMU code that we need to make sure doesn't go away (until all 
>>>>> inflight
>>>>> pages are returned) ???
>>>>>
>>>>>
>>>>>> [ 4407.212119] process_one_work+0x164/0x3e0
>>>>>> [ 4407.216116]  worker_thread+0x310/0x420
>>>>>> [ 4407.219851]  kthread+0x120/0x130
>>>>>> [ 4407.223066]  ret_from_fork+0x10/0x20
>>>>>> [ 4407.226630] Code: ffffc318 aa1e03e9 d503201f f9416c00 (f9405400)
>>>>>> [ 4407.232697] ---[ end trace 0000000000000000 ]---
>>>>>>
>>>>>>
>>>>>> The hns3 driver use page pool like this, just call once when the 
>>>>>> driver
>>>>>> initialize:
>>>>>>
>>>>>> static void hns3_alloc_page_pool(struct hns3_enet_ring *ring)
>>>>>> {
>>>>>>       struct page_pool_params pp_params = {
>>>>>>           .flags = PP_FLAG_DMA_MAP | PP_FLAG_PAGE_FRAG |
>>>>>>                   PP_FLAG_DMA_SYNC_DEV,
>>>>>>           .order = hns3_page_order(ring),
>>>>>>           .pool_size = ring->desc_num * hns3_buf_size(ring) /
>>>>>>                   (PAGE_SIZE << hns3_page_order(ring)),
>>>>>>           .nid = dev_to_node(ring_to_dev(ring)),
>>>>>>           .dev = ring_to_dev(ring),
>>>>>>           .dma_dir = DMA_FROM_DEVICE,
>>>>>>           .offset = 0,
>>>>>>           .max_len = PAGE_SIZE << hns3_page_order(ring),
>>>>>>       };
>>>>>>
>>>>>>       ring->page_pool = page_pool_create(&pp_params);
>>>>>>       if (IS_ERR(ring->page_pool)) {
>>>>>>           dev_warn(ring_to_dev(ring), "page pool creation failed: 
>>>>>> %ld\n",
>>>>>>                PTR_ERR(ring->page_pool));
>>>>>>           ring->page_pool = NULL;
>>>>>>       }
>>>>>> }
>>>>>>
>>>>>> And call page_pool_destroy(ring->page_pool)  when the driver 
>>>>>> uninitialized.
>>>>>>
>>>>>>
>>>>>> We use two devices, the net port connect directory, and the step 
>>>>>> of the
>>>>>> test case like below:
>>>>>>
>>>>>> 1. enable a vf of '7d:00.0':  echo 1 >
>>>>>> /sys/class/net/eno1/device/sriov_numvfs
>>>>>>
>>>>>> 2. use iperf to produce some flows(the problem happens to the side 
>>>>>> which
>>>>>> runs 'iperf -s')
>>>>>>
>>>>>> 3. use ifconfig down/up to the vf
>>>>>>
>>>>>> 4. kill iperf
>>>>>>
>>>>>> 5. disable the vf: echo 0 > /sys/class/net/eno1/device/sriov_numvfs
>>>>>>
>>>>>> 6. run 1~5 with another port bd:00.0
>>>>>>
>>>>>> 7. repeat 1~6
>>>>>>
>>>>>>
>>>>>> And when running this test case, we can found another related 
>>>>>> message (I
>>>>>> replaced pr_warn() to dev_warn()):
>>>>>>
>>>>>> pci 0000:7d:01.0: page_pool_release_retry() stalled pool shutdown: id
>>>>>> 949, 98 inflight 1449 sec
>>>>>>
>>>>>>
>>>>>> Even when stop the traffic, stop the test case, disable the vf, this
>>>>>> message is still being printed.
>>>>>>
>>>>>> We must run the test case for about two hours to reproduce the 
>>>>>> problem.
>>>>>> Is there some advise to solve or debug the problem?
>>>>>>
>>>
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [BUG REPORT]net: page_pool: kernel crash at iommu_get_dma_domain+0xc/0x20
  2024-08-06 13:35         ` Niklas Schnelle
@ 2024-08-13 18:49           ` Matthew Rosato
  0 siblings, 0 replies; 21+ messages in thread
From: Matthew Rosato @ 2024-08-13 18:49 UTC (permalink / raw)
  To: Niklas Schnelle, Yunsheng Lin, Somnath Kotur,
	Jesper Dangaard Brouer
  Cc: Yonglong Liu, David S. Miller, Jakub Kicinski, pabeni,
	ilias.apalodimas, netdev, linux-kernel, Alexander Duyck,
	Alexei Starovoitov, shenjian (K), Salil Mehta, joro, will,
	robin.murphy, iommu

On 8/6/24 9:35 AM, Niklas Schnelle wrote:
> On Mon, 2024-08-05 at 20:19 +0800, Yunsheng Lin wrote:
>> On 2024/7/31 16:42, Somnath Kotur wrote:
>>> On Tue, Jul 30, 2024 at 10:51 PM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
>>>>
>>
>> +cc iommu maintainers and list
>>
>>>>
>>>> On 30/07/2024 15.08, Yonglong Liu wrote:
>>>>> I found a bug when running hns3 driver with page pool enabled, the log
>>>>> as below:
>>>>>
>>>>> [ 4406.956606] Unable to handle kernel NULL pointer dereference at
>>>>> virtual address 00000000000000a8
>>>>
>>>> struct iommu_domain *iommu_get_dma_domain(struct device *dev)
>>>> {
>>>>         return dev->iommu_group->default_domain;
>>>> }
>>>>
>>>> $ pahole -C iommu_group --hex | grep default_domain
>>>>         struct iommu_domain *      default_domain;   /*  0xa8   0x8 */
>>>>
>>>> Looks like iommu_group is a NULL pointer (that when deref member
>>>> 'default_domain' cause this fault).
>>>>
>>>>
>>>>> [ 4406.965379] Mem abort info:
>>>>> [ 4406.968160]   ESR = 0x0000000096000004
>>>>> [ 4406.971906]   EC = 0x25: DABT (current EL), IL = 32 bits
>>>>> [ 4406.977218]   SET = 0, FnV = 0
>>>>> [ 4406.980258]   EA = 0, S1PTW = 0
>>>>> [ 4406.983404]   FSC = 0x04: level 0 translation fault
>>>>> [ 4406.988273] Data abort info:
>>>>> [ 4406.991154]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
>>>>> [ 4406.996632]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
>>>>> [ 4407.001681]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
>>>>> [ 4407.006985] user pgtable: 4k pages, 48-bit VAs, pgdp=0000202828326000
>>>>> [ 4407.013430] [00000000000000a8] pgd=0000000000000000,
>>>>> p4d=0000000000000000
>>>>> [ 4407.020212] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
>>>>> [ 4407.026454] Modules linked in: hclgevf xt_CHECKSUM ipt_REJECT
>>>>> nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle
>>>>> ip6table_filter ip6_tables hns_roce_hw_v2 hns3 hclge hnae3 xt_addrtype
>>>>> iptable_filter xt_conntrack overlay arm_spe_pmu arm_smmuv3_pmu
>>>>> hisi_uncore_hha_pmu hisi_uncore_ddrc_pmu hisi_uncore_l3c_pmu
>>>>> hisi_uncore_pmu fuse rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi
>>>>> scsi_transport_iscsi crct10dif_ce hisi_sec2 hisi_hpre hisi_zip
>>>>> hisi_sas_v3_hw xhci_pci sbsa_gwdt hisi_qm hisi_sas_main hisi_dma
>>>>> xhci_pci_renesas uacce libsas [last unloaded: hnae3]
>>>>> [ 4407.076027] CPU: 48 PID: 610 Comm: kworker/48:1
>>>>> [ 4407.093343] Workqueue: events page_pool_release_retry
>>>>> [ 4407.098384] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS
>>>>> BTYPE=--)
>>>>> [ 4407.105316] pc : iommu_get_dma_domain+0xc/0x20
>>>>> [ 4407.109744] lr : iommu_dma_unmap_page+0x38/0xe8
>>>>> [ 4407.114255] sp : ffff80008bacbc80
>>>>> [ 4407.117554] x29: ffff80008bacbc80 x28: 0000000000000000 x27:
>>>>> ffffc31806be7000
>>>>> [ 4407.124659] x26: ffff2020002b6ac0 x25: 0000000000000000 x24:
>>>>> 0000000000000002
>>>>> [ 4407.131762] x23: 0000000000000022 x22: 0000000000001000 x21:
>>>>> 00000000fcd7c000
>>>>> [ 4407.138865] x20: ffff0020c9882800 x19: ffff0020856f60c8 x18:
>>>>> ffff8000d3503c58
>>>>> [ 4407.145968] x17: 0000000000000000 x16: 1fffe00419521061 x15:
>>>>> 0000000000000001
>>>>> [ 4407.153073] x14: 0000000000000003 x13: 00000401850ae012 x12:
>>>>> 000006b10004e7fb
>>>>> [ 4407.160177] x11: 0000000000000067 x10: 0000000000000c70 x9 :
>>>>> ffffc3180405cd20
>>>>> [ 4407.167280] x8 : fefefefefefefeff x7 : 0000000000000001 x6 :
>>>>> 0000000000000010
>>>>> [ 4407.174382] x5 : ffffc3180405cce8 x4 : 0000000000000022 x3 :
>>>>> 0000000000000002
>>>>> [ 4407.181485] x2 : 0000000000001000 x1 : 00000000fcd7c000 x0 :
>>>>> 0000000000000000
>>>>> [ 4407.188589] Call trace:
>>>>> [ 4407.191027]  iommu_get_dma_domain+0xc/0x20
>>>>> [ 4407.195105]  dma_unmap_page_attrs+0x38/0x1d0
>>>>> [ 4407.199361]  page_pool_return_page+0x48/0x180
>>>>> [ 4407.203699]  page_pool_release+0xd4/0x1f0
>>>>> [ 4407.207692]  page_pool_release_retry+0x28/0xe8
>>>>
>>>> I suspect that the DMA IOMMU part was deallocated and freed by the
>>>> driver even-though page_pool still have inflight packets.
>>> When you say driver, which 'driver' do you mean?
>>> I suspect this could be because of the VF instance going away with
>>> this cmd - disable the vf: echo 0 >
>>> /sys/class/net/eno1/device/sriov_numvfs, what do you think?
>>>>
>>>> The page_pool bumps refcnt via get_device() + put_device() on the DMA
>>>> 'struct device', to avoid it going away, but I guess there is also some
>>>> IOMMU code that we need to make sure doesn't go away (until all inflight
>>>> pages are returned) ???
>>
>> I guess the above is why thing went wrong here, the question is which
>> IOMMU code need to be called here to stop them from going away.
>>
>> What I am also curious is that there should be a pool of allocated iova in
>> iommu that is corresponding to the in-flight page for page_pool, shouldn't
>> iommu wait for the corresponding allocated iova to be freed similarly as
>> page_pool does for it's in-flight pages?
>>
> 
> 
> Is it possible you're using an IOMMU whose driver doesn't yet support
> blocking_domain? I'm currently working an issue on s390 that also
> occurs during device removal and is fixed by implementing blocking
> domain in the s390 IOMMU driver (patch forthcoming). The root cause for
> that is that our domain->ops->attach_dev() fails when during hot-unplug
> the device is already gone from the platform's point of view and then
> we ended up with a NULL domain unless we have a blocking domain which
> can handle non existant devices and gets set as fallback in
> __iommu_device_set_domain(). In the case I can reproduce the backtrace
> is different[0] but we also saw at least two cases where we see the
> exact same call trace as in the first mail of this thread. So far I
> suspected them to be due to the blocking domain issue but it could be a
> separate issue too.
> 
> Thanks,
> Niklas
> 

Couple of things to follow up with on Niklas' statement above...

So first, after further testing on my end, I wanted to clarify that the implementation of a blocked domain is unrelated to this bug.  Sorry for the noise.

Second, it looks like Niklas copied an unrelated backtrace in his report (I snipped it from my reply).

But I wanted to be clear, we can reproduce this same sort of error on s390 and using a different device driver (mlx5_core) and the backtrace is almost identical to what this thread is reporting and in the same area.  The most reliable repro method I've found so far is to use a few mellanox VFs and power one down (echo 0 > /sys/bus/pci/slots/.../power) during or shortly after a tcp workload (iperf3).  I verified that I can reproduce on a kernel as old as 6.7 release tag; I stopped there because that's the release when s390 converted to use dma-iommu but I assume the problem really existed longer than this.

Here's a backtrace of a repro on s390 (6.11-rc3):

[  691.860855] Unable to handle kernel pointer dereference in virtual kernel address space
[  691.861089] Failing address: 0706c00180000000 TEID: 0706c00180000803
[  691.861097] Fault in home space mode while using kernel ASCE.
[  691.861118] AS:0000000154fbc007 R3:0000000000000024 
[  691.861153] Oops: 0038 ilc:2 [#1] PREEMPT SMP 
[  691.861161] Modules linked in: xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat ip6table_filter ip6_tables iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter ip_tables x_tables rpcrdma rdma_ucm rdma_cm iw_cm ib_cm uvdevic
e s390_trng ism eadm_sch sunrpc tape_34xx tape tape_class mlx5_ib vfio_ap ib_uverbs kvm ib_core zcrypt_cex4 vfio_ccw mdev vfio_iommu_type1 vfio sch_fq_codel loop dm_multipath nfnetlink lcs ctcm fsm mlx5_core ghash_s390 prng chacha_s390 aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s3
90 sha_common zfcp scsi_transport_fc scsi_dh_rdac scsi_dh_emc scsi_dh_alua pkey zcrypt rng_core autofs4
[  691.861319] CPU: 9 UID: 0 PID: 283 Comm: kworker/9:2 Not tainted 6.11.0-rc3 #1
[  691.861325] Hardware name: IBM 8561 T01 772 (LPAR)
[  691.861329] Workqueue: events page_pool_release_retry
[  691.861342] Krnl PSW : 0704e00180000000 0000013bd3a6ee26 (iommu_iova_to_phys+0x6/0x40)
[  691.861355]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
[  691.861363] Krnl GPRS: 0000000080000000 0000000000000000 0706c00180000000 0000000afba90000
[  691.861369]            0000000000001000 0000000000000002 0000000000000022 0000000000000002
[  691.861374]            0000000000001000 0000000afba90000 00000039c0e19000 00000039c0bc9098
[  691.861379]            000000427ac33400 00000039c0e19068 0000013bd3a7655c 000000bbd8d1bb28
[  691.861389] Krnl Code: 0000013bd3a6ee1c: 0707                bcr     0,%r7
[  691.861389]            0000013bd3a6ee1e: 0707                bcr     0,%r7
[  691.861389]           #0000013bd3a6ee20: c004002a8d6c        brcl    0,0000013bd3fc08f8
[  691.861389]           >0000013bd3a6ee26: 58502000            l       %r5,0(%r2)
[  691.861389]            0000013bd3a6ee2a: ec580014047e        cij     %r5,4,8,0000013bd3a6ee52
[  691.861389]            0000013bd3a6ee30: ec58000c007e        cij     %r5,0,8,0000013bd3a6ee48
[  691.861389]            0000013bd3a6ee36: e31020080004        lg      %r1,8(%r2)
[  691.861389]            0000013bd3a6ee3c: e31010400004        lg      %r1,64(%r1)
[  691.861426] Call Trace:
[  691.861430]  [<0000013bd3a6ee26>] iommu_iova_to_phys+0x6/0x40 
[  691.861436]  [<0000013bd2f47a32>] dma_unmap_page_attrs+0x1a2/0x1e0 
[  691.861443]  [<0000013bd3c2b81a>] page_pool_return_page+0x5a/0x130 
[  691.861449]  [<0000013bd3c2cb68>] page_pool_release+0xb8/0x1f0 
[  691.861455]  [<0000013bd3c2ce9c>] page_pool_release_retry+0x2c/0x120 
[  691.861461]  [<0000013bd2e95652>] process_one_work+0x2b2/0x5d0 
[  691.861467]  [<0000013bd2e9625e>] worker_thread+0x20e/0x3f0 
[  691.861473]  [<0000013bd2ea25e2>] kthread+0x152/0x170 
[  691.861478]  [<0000013bd2e135ac>] __ret_from_fork+0x3c/0x60 
[  691.861484]  [<0000013bd3ecf0ca>] ret_from_fork+0xa/0x38 
[  691.861491] INFO: lockdep is turned off.
[  691.861495] Last Breaking-Event-Address:
[  691.861499]  [<0000013bd3a76556>] iommu_dma_unmap_page+0x36/0xb0
[  691.861507] Kernel panic - not syncing: Fatal exception: panic_on_oops





^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [BUG REPORT]net: page_pool: kernel crash at iommu_get_dma_domain+0xc/0x20
  2024-08-06 12:50             ` Robin Murphy
  2024-08-06 13:50               ` Jason Gunthorpe
@ 2024-08-20  7:18               ` Yonglong Liu
  2024-08-20 10:05                 ` Robin Murphy
  1 sibling, 1 reply; 21+ messages in thread
From: Yonglong Liu @ 2024-08-20  7:18 UTC (permalink / raw)
  To: Robin Murphy, Yunsheng Lin, Somnath Kotur, Jesper Dangaard Brouer
  Cc: David S. Miller, Jakub Kicinski, pabeni, ilias.apalodimas, netdev,
	linux-kernel, Alexander Duyck, Alexei Starovoitov, shenjian (K),
	Salil Mehta, joro, will, iommu


On 2024/8/6 20:50, Robin Murphy wrote:
> On 06/08/2024 12:54 pm, Yunsheng Lin wrote:
>> On 2024/8/5 20:53, Robin Murphy wrote:
>>>>>>
>>>>>> The page_pool bumps refcnt via get_device() + put_device() on the 
>>>>>> DMA
>>>>>> 'struct device', to avoid it going away, but I guess there is 
>>>>>> also some
>>>>>> IOMMU code that we need to make sure doesn't go away (until all 
>>>>>> inflight
>>>>>> pages are returned) ???
>>>>
>>>> I guess the above is why thing went wrong here, the question is which
>>>> IOMMU code need to be called here to stop them from going away.
>>>
>>> This looks like the wrong device is being passed to dma_unmap_page() 
>>> - if a device had an IOMMU DMA domain at the point when the DMA 
>>> mapping was create, then neither that domain nor its group can 
>>> legitimately have disappeared while that device still had a driver 
>>> bound. Or if it *was* the right device, but it's already had 
>>> device_del() called on it, then you have a fundamental lifecycle 
>>> problem - a device with no driver bound should not be passed to the 
>>> DMA API, much less a dead device that's already been removed from 
>>> its parent bus.
>>
>> Yes, the device *was* the right device, And it's already had 
>> device_del()
>> called on it.
>> page_pool tries to call get_device() on the DMA 'struct device' to 
>> avoid the
>> above lifecycle problem, it seems get_device() does not stop 
>> device_del()
>> from being called, and that is where we have the problem here:
>> https://elixir.bootlin.com/linux/v6.11-rc2/source/net/core/page_pool.c#L269 
>>
>>
>> The above happens because driver with page_pool support may hand over
>> page still with dma mapping to network stack and try to reuse that page
>> after network stack is done with it and passes it back to page_pool 
>> to avoid
>> the penalty of dma mapping/unmapping. With all the caching in the 
>> network
>> stack, some pages may be held in the network stack without returning 
>> to the
>> page_pool soon enough, and with VF disable causing the driver 
>> unbound, the
>> page_pool does not stop the driver from doing it's unbounding work, 
>> instead
>> page_pool uses workqueue to check if there is some pages coming back 
>> from the
>> network stack periodically, if there is any, it will do the dma 
>> unmmapping
>> related cleanup work.
>
> OK, that sounds like a more insidious problem - it's not just IOMMU 
> stuff, in general the page pool should not be holding and using the 
> device pointer after the device has already been destroyed. Even 
> without an IOMMU, attempting DMA unmaps after the driver has already 
> unbound may leak resources or at worst corrupt memory. Fundamentally, 
> the page pool code cannot allow DMA mappings to outlive the driver 
> they belong to.
>
> Thanks,
> Robin.
>
I found a easier way to reproduce the problem:
1. enable a vf
2. use mz to send only one of the multiple fragments
3. disable the vf in 30 seconds

Jakub's patch:
https://lore.kernel.org/netdev/20240809205717.0c966bad@kernel.org/T/
can solve this case, but can not solve the origin case.

Base on the simple case, I found that v6.9 has no problem, v6.10-rc1 
introduce the problem.
And I trace the differences between the two version:
v6.9: 
page_pool_return_page()->dma_unmap_page_attrs()->*dma_direct_unmap_page()*
v6.10-rc1: 
page_pool_return_page()->dma_unmap_page_attrs()->*iommu_dma_unmap_page()*
That's because in v6.9 dev->dma_ops is NULL and v6.10-rc1 doesn't. (The 
driver has already unbound, but the device is still there)

I revert the patch:
https://lore.kernel.org/linux-arm-kernel/4a4482a0-4408-4d86-9f61-6589ab78686b@somainline.org/T/
then the bug I report can not reproduce.

So maybe this problem is introduced by this patch or the following patch 
set?
https://lore.kernel.org/linux-iommu/cover.1713523152.git.robin.murphy@arm.com/
or the page pool always has the problem: In the version before v6.9, and 
in this case, page pool use iommu_dma_map_page() and 
dma_direct_unmap_page(), not pair?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [BUG REPORT]net: page_pool: kernel crash at iommu_get_dma_domain+0xc/0x20
  2024-08-20  7:18               ` Yonglong Liu
@ 2024-08-20 10:05                 ` Robin Murphy
  0 siblings, 0 replies; 21+ messages in thread
From: Robin Murphy @ 2024-08-20 10:05 UTC (permalink / raw)
  To: Yonglong Liu, Yunsheng Lin, Somnath Kotur, Jesper Dangaard Brouer
  Cc: David S. Miller, Jakub Kicinski, pabeni, ilias.apalodimas, netdev,
	linux-kernel, Alexander Duyck, Alexei Starovoitov, shenjian (K),
	Salil Mehta, joro, will, iommu, Matthew Rosato

On 2024-08-20 8:18 am, Yonglong Liu wrote:
> 
> On 2024/8/6 20:50, Robin Murphy wrote:
>> On 06/08/2024 12:54 pm, Yunsheng Lin wrote:
>>> On 2024/8/5 20:53, Robin Murphy wrote:
>>>>>>>
>>>>>>> The page_pool bumps refcnt via get_device() + put_device() on the 
>>>>>>> DMA
>>>>>>> 'struct device', to avoid it going away, but I guess there is 
>>>>>>> also some
>>>>>>> IOMMU code that we need to make sure doesn't go away (until all 
>>>>>>> inflight
>>>>>>> pages are returned) ???
>>>>>
>>>>> I guess the above is why thing went wrong here, the question is which
>>>>> IOMMU code need to be called here to stop them from going away.
>>>>
>>>> This looks like the wrong device is being passed to dma_unmap_page() 
>>>> - if a device had an IOMMU DMA domain at the point when the DMA 
>>>> mapping was create, then neither that domain nor its group can 
>>>> legitimately have disappeared while that device still had a driver 
>>>> bound. Or if it *was* the right device, but it's already had 
>>>> device_del() called on it, then you have a fundamental lifecycle 
>>>> problem - a device with no driver bound should not be passed to the 
>>>> DMA API, much less a dead device that's already been removed from 
>>>> its parent bus.
>>>
>>> Yes, the device *was* the right device, And it's already had 
>>> device_del()
>>> called on it.
>>> page_pool tries to call get_device() on the DMA 'struct device' to 
>>> avoid the
>>> above lifecycle problem, it seems get_device() does not stop 
>>> device_del()
>>> from being called, and that is where we have the problem here:
>>> https://elixir.bootlin.com/linux/v6.11-rc2/source/net/core/page_pool.c#L269
>>>
>>> The above happens because driver with page_pool support may hand over
>>> page still with dma mapping to network stack and try to reuse that page
>>> after network stack is done with it and passes it back to page_pool 
>>> to avoid
>>> the penalty of dma mapping/unmapping. With all the caching in the 
>>> network
>>> stack, some pages may be held in the network stack without returning 
>>> to the
>>> page_pool soon enough, and with VF disable causing the driver 
>>> unbound, the
>>> page_pool does not stop the driver from doing it's unbounding work, 
>>> instead
>>> page_pool uses workqueue to check if there is some pages coming back 
>>> from the
>>> network stack periodically, if there is any, it will do the dma 
>>> unmmapping
>>> related cleanup work.
>>
>> OK, that sounds like a more insidious problem - it's not just IOMMU 
>> stuff, in general the page pool should not be holding and using the 
>> device pointer after the device has already been destroyed. Even 
>> without an IOMMU, attempting DMA unmaps after the driver has already 
>> unbound may leak resources or at worst corrupt memory. Fundamentally, 
>> the page pool code cannot allow DMA mappings to outlive the driver 
>> they belong to.
>>
>> Thanks,
>> Robin.
>>
> I found a easier way to reproduce the problem:
> 1. enable a vf
> 2. use mz to send only one of the multiple fragments
> 3. disable the vf in 30 seconds
> 
> Jakub's patch:
> https://lore.kernel.org/netdev/20240809205717.0c966bad@kernel.org/T/
> can solve this case, but can not solve the origin case.
> 
> Base on the simple case, I found that v6.9 has no problem, v6.10-rc1 
> introduce the problem.

No, just because you don't see this particular crash symptom doesn't 
mean the fundamental issue isn't still there. Besides, Matthew already 
said he reproduced it on s390 back to v6.7 (and with the old DMA API 
implementation before then I think it wouldn't have crashed, but would 
have still leaked the hypervisor DMA mappings).

> And I trace the differences between the two version:
> v6.9: 
> page_pool_return_page()->dma_unmap_page_attrs()->*dma_direct_unmap_page()*
> v6.10-rc1: 
> page_pool_return_page()->dma_unmap_page_attrs()->*iommu_dma_unmap_page()*
> That's because in v6.9 dev->dma_ops is NULL and v6.10-rc1 doesn't. (The 
> driver has already unbound, but the device is still there)
> 
> I revert the patch:
> https://lore.kernel.org/linux-arm-kernel/4a4482a0-4408-4d86-9f61-6589ab78686b@somainline.org/T/
> then the bug I report can not reproduce.
> 
> So maybe this problem is introduced by this patch or the following patch 
> set?

That patch fixes a bug. Reintroducing that bug and breaking probe 
deferral on arm64 to attempt to paper over a design flaw in the page 
pool code is clearly not a reasonable course of action.

> https://lore.kernel.org/linux-iommu/cover.1713523152.git.robin.murphy@arm.com/
> or the page pool always has the problem: In the version before v6.9, and 
> in this case, page pool use iommu_dma_map_page() and 
> dma_direct_unmap_page(), not pair?

OK, do you start to see the problem there? If you got an address from 
vmalloc() and passed it to free_pages() would you expect that to work 
either? Sure, with enough luck on a cache-coherent system you may have 
been getting away with dma_direct_unmap_page() being largely a no-op, 
but if the arbitrary IOVA DMA address from iommu-dma happens to look 
like a SWIOTLB DMA address to dma-direct, that could be much more fun.

Thanks,
Robin.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [BUG REPORT]net: page_pool: kernel crash at iommu_get_dma_domain+0xc/0x20
  2024-08-06 13:50               ` Jason Gunthorpe
@ 2024-08-20 13:22                 ` Mina Almasry
  2024-08-20 14:43                   ` Jakub Kicinski
  0 siblings, 1 reply; 21+ messages in thread
From: Mina Almasry @ 2024-08-20 13:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Robin Murphy, Yunsheng Lin, Somnath Kotur, Jesper Dangaard Brouer,
	Yonglong Liu, David S. Miller, Jakub Kicinski, pabeni,
	ilias.apalodimas, netdev, linux-kernel, Alexander Duyck,
	Alexei Starovoitov, shenjian (K), Salil Mehta, joro, will, iommu

On Tue, Aug 6, 2024 at 9:50 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Tue, Aug 06, 2024 at 01:50:08PM +0100, Robin Murphy wrote:
> > On 06/08/2024 12:54 pm, Yunsheng Lin wrote:
> > > On 2024/8/5 20:53, Robin Murphy wrote:
> > > > > > >
> > > > > > > The page_pool bumps refcnt via get_device() + put_device() on the DMA
> > > > > > > 'struct device', to avoid it going away, but I guess there is also some
> > > > > > > IOMMU code that we need to make sure doesn't go away (until all inflight
> > > > > > > pages are returned) ???
> > > > >
> > > > > I guess the above is why thing went wrong here, the question is which
> > > > > IOMMU code need to be called here to stop them from going away.
> > > >
> > > > This looks like the wrong device is being passed to dma_unmap_page() - if a device had an IOMMU DMA domain at the point when the DMA mapping was create, then neither that domain nor its group can legitimately have disappeared while that device still had a driver bound. Or if it *was* the right device, but it's already had device_del() called on it, then you have a fundamental lifecycle problem - a device with no driver bound should not be passed to the DMA API, much less a dead device that's already been removed from its parent bus.
> > >
> > > Yes, the device *was* the right device, And it's already had device_del()
> > > called on it.
> > > page_pool tries to call get_device() on the DMA 'struct device' to avoid the
> > > above lifecycle problem, it seems get_device() does not stop device_del()
> > > from being called, and that is where we have the problem here:
> > > https://elixir.bootlin.com/linux/v6.11-rc2/source/net/core/page_pool.c#L269
> > >
> > > The above happens because driver with page_pool support may hand over
> > > page still with dma mapping to network stack and try to reuse that page
> > > after network stack is done with it and passes it back to page_pool to avoid
> > > the penalty of dma mapping/unmapping. With all the caching in the network
> > > stack, some pages may be held in the network stack without returning to the
> > > page_pool soon enough, and with VF disable causing the driver unbound, the
> > > page_pool does not stop the driver from doing it's unbounding work, instead
> > > page_pool uses workqueue to check if there is some pages coming back from the
> > > network stack periodically, if there is any, it will do the dma unmmapping
> > > related cleanup work.
> >
> > OK, that sounds like a more insidious problem - it's not just IOMMU stuff,
> > in general the page pool should not be holding and using the device pointer
> > after the device has already been destroyed. Even without an IOMMU,
> > attempting DMA unmaps after the driver has already unbound may leak
> > resources or at worst corrupt memory. Fundamentally, the page pool code
> > cannot allow DMA mappings to outlive the driver they belong to.
>
> +1
>
> There is more that gets broken here if these basic driver model rules
> are not followed!
>
> netdev must fix this by waiting during driver remove until all this DMA
> activity is finished somehow.
>
>

Sorry if naive question, but why don't non-pp drivers run into this
issue? On driver unload, GVE seems to loop over the pages in its rx
queues, and calls gve_free_page_dqo, which simply subtracts the bias
refcount, dma_unmap_page(), and put_page(). The pages will live on if
they have references in the net stack, but the driver is free to
unload.

Is it for some reason not feasible for the page_pool to release the
pages on driver unload and destroy itself, rather than have to stick
around until all pending pages have been returned from the net stack?

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [BUG REPORT]net: page_pool: kernel crash at iommu_get_dma_domain+0xc/0x20
  2024-08-20 13:22                 ` Mina Almasry
@ 2024-08-20 14:43                   ` Jakub Kicinski
  0 siblings, 0 replies; 21+ messages in thread
From: Jakub Kicinski @ 2024-08-20 14:43 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Jason Gunthorpe, Robin Murphy, Yunsheng Lin, Somnath Kotur,
	Jesper Dangaard Brouer, Yonglong Liu, David S. Miller, pabeni,
	ilias.apalodimas, netdev, linux-kernel, Alexander Duyck,
	Alexei Starovoitov, shenjian (K), Salil Mehta, joro, will, iommu

On Tue, 20 Aug 2024 09:22:39 -0400 Mina Almasry wrote:
> Is it for some reason not feasible for the page_pool to release the
> pages on driver unload and destroy itself, rather than have to stick
> around until all pending pages have been returned from the net stack?

Normal page pool does not track all the pages it allocates.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2024-08-20 14:43 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <0e54954b-0880-4ebc-8ef0-13b3ac0a6838@huawei.com>
2024-07-30 13:08 ` [BUG REPORT]net: page_pool: kernel crash at iommu_get_dma_domain+0xc/0x20 Yonglong Liu
2024-07-30 17:21   ` Jesper Dangaard Brouer
2024-07-31  8:42     ` Somnath Kotur
2024-07-31 11:32       ` Yonglong Liu
2024-08-02  2:06         ` Yonglong Liu
2024-08-06  9:51           ` Jesper Dangaard Brouer
2024-08-06 14:23             ` Robin Murphy
2024-08-05 12:19       ` Yunsheng Lin
2024-08-05 12:53         ` Robin Murphy
2024-08-06 11:54           ` Yunsheng Lin
2024-08-06 12:50             ` Robin Murphy
2024-08-06 13:50               ` Jason Gunthorpe
2024-08-20 13:22                 ` Mina Almasry
2024-08-20 14:43                   ` Jakub Kicinski
2024-08-20  7:18               ` Yonglong Liu
2024-08-20 10:05                 ` Robin Murphy
2024-08-06 13:35         ` Niklas Schnelle
2024-08-13 18:49           ` Matthew Rosato
2024-08-02 16:38   ` Alexander Duyck
2024-08-05 12:50     ` Yunsheng Lin
2024-08-06  0:39       ` Alexander Duyck

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).