[PATCH 1/3] habanalabs: fix dma_addr passed to dma_mmap

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* [PATCH 1/3] habanalabs: fix dma_addr passed to dma_mmap_coherent
@ 2021-01-12 19:01 Oded Gabbay
  2021-01-12 19:01 ` [PATCH 2/3] habanalabs: fix reset process in case of failures Oded Gabbay
  2021-01-12 19:01 ` [PATCH 3/3] habanalabs: prevent soft lockup during unmap Oded Gabbay
  0 siblings, 2 replies; 3+ messages in thread
From: Oded Gabbay @ 2021-01-12 19:01 UTC (permalink / raw)
  To: linux-kernel

When doing dma_alloc_coherent in the driver, we add a certain hard-coded
offset to the DMA address before returning to the callee function. This
offset is needed when our device use this DMA address to perform
outbound transactions to the host.

However, if we want to map the DMA'able memory to the user via
dma_mmap_coherent(), we need to pass the original dma address, without
this offset. Otherwise, we will get erronouos mapping.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
 drivers/misc/habanalabs/gaudi/gaudi.c | 3 ++-
 drivers/misc/habanalabs/goya/goya.c   | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/misc/habanalabs/gaudi/gaudi.c b/drivers/misc/habanalabs/gaudi/gaudi.c
index 8c09e4466af8..b328ddaa64ee 100644
--- a/drivers/misc/habanalabs/gaudi/gaudi.c
+++ b/drivers/misc/habanalabs/gaudi/gaudi.c
@@ -4002,7 +4002,8 @@ static int gaudi_cb_mmap(struct hl_device *hdev, struct vm_area_struct *vma,
 	vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP |
 			VM_DONTCOPY | VM_NORESERVE;
 
-	rc = dma_mmap_coherent(hdev->dev, vma, cpu_addr, dma_addr, size);
+	rc = dma_mmap_coherent(hdev->dev, vma, cpu_addr,
+				(dma_addr - HOST_PHYS_BASE), size);
 	if (rc)
 		dev_err(hdev->dev, "dma_mmap_coherent error %d", rc);
 
diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
index b8b4aa636b7c..63679a747d2c 100644
--- a/drivers/misc/habanalabs/goya/goya.c
+++ b/drivers/misc/habanalabs/goya/goya.c
@@ -2719,7 +2719,8 @@ static int goya_cb_mmap(struct hl_device *hdev, struct vm_area_struct *vma,
 	vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP |
 			VM_DONTCOPY | VM_NORESERVE;
 
-	rc = dma_mmap_coherent(hdev->dev, vma, cpu_addr, dma_addr, size);
+	rc = dma_mmap_coherent(hdev->dev, vma, cpu_addr,
+				(dma_addr - HOST_PHYS_BASE), size);
 	if (rc)
 		dev_err(hdev->dev, "dma_mmap_coherent error %d", rc);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH 2/3] habanalabs: fix reset process in case of failures
  2021-01-12 19:01 [PATCH 1/3] habanalabs: fix dma_addr passed to dma_mmap_coherent Oded Gabbay
@ 2021-01-12 19:01 ` Oded Gabbay
  2021-01-12 19:01 ` [PATCH 3/3] habanalabs: prevent soft lockup during unmap Oded Gabbay
  1 sibling, 0 replies; 3+ messages in thread
From: Oded Gabbay @ 2021-01-12 19:01 UTC (permalink / raw)
  To: linux-kernel

There are some points in the reset process where if the code fails
for some reason, and the system admin tries to initiate the reset
process again we will get a kernel panic.

This is because there aren't any protections in different fini
functions that are called during the reset process.

The protections that are added in this patch make sure that if the fini
functions are called multiple times, without calling init functions
between them, there won't be double release of already released
resources.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
 drivers/misc/habanalabs/common/device.c |  2 +-
 drivers/misc/habanalabs/common/mmu_v1.c | 12 ++++++++++--
 2 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/drivers/misc/habanalabs/common/device.c b/drivers/misc/habanalabs/common/device.c
index 1456eabf9601..1ea57d86caa3 100644
--- a/drivers/misc/habanalabs/common/device.c
+++ b/drivers/misc/habanalabs/common/device.c
@@ -1037,7 +1037,7 @@ int hl_device_reset(struct hl_device *hdev, bool hard_reset,
 
 	if (hard_reset) {
 		/* Release kernel context */
-		if (hl_ctx_put(hdev->kernel_ctx) == 1)
+		if (hdev->kernel_ctx && hl_ctx_put(hdev->kernel_ctx) == 1)
 			hdev->kernel_ctx = NULL;
 		hl_vm_fini(hdev);
 		hl_mmu_fini(hdev);
diff --git a/drivers/misc/habanalabs/common/mmu_v1.c b/drivers/misc/habanalabs/common/mmu_v1.c
index 2ce6ea89d4fa..06d8a44dd5d4 100644
--- a/drivers/misc/habanalabs/common/mmu_v1.c
+++ b/drivers/misc/habanalabs/common/mmu_v1.c
@@ -467,8 +467,16 @@ static void hl_mmu_v1_fini(struct hl_device *hdev)
 {
 	/* MMU H/W fini was already done in device hw_fini() */
 
-	kvfree(hdev->mmu_priv.dr.mmu_shadow_hop0);
-	gen_pool_destroy(hdev->mmu_priv.dr.mmu_pgt_pool);
+	if (!ZERO_OR_NULL_PTR(hdev->mmu_priv.hr.mmu_shadow_hop0)) {
+		kvfree(hdev->mmu_priv.dr.mmu_shadow_hop0);
+		gen_pool_destroy(hdev->mmu_priv.dr.mmu_pgt_pool);
+	}
+
+	/* Make sure that if we arrive here again without init was called we
+	 * won't cause kernel panic. This can happen for example if we fail
+	 * during hard reset code at certain points
+	 */
+	hdev->mmu_priv.dr.mmu_shadow_hop0 = NULL;
 }
 
 /**
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH 3/3] habanalabs: prevent soft lockup during unmap
  2021-01-12 19:01 [PATCH 1/3] habanalabs: fix dma_addr passed to dma_mmap_coherent Oded Gabbay
  2021-01-12 19:01 ` [PATCH 2/3] habanalabs: fix reset process in case of failures Oded Gabbay
@ 2021-01-12 19:01 ` Oded Gabbay
  1 sibling, 0 replies; 3+ messages in thread
From: Oded Gabbay @ 2021-01-12 19:01 UTC (permalink / raw)
  To: linux-kernel

When using Deep learning framework such as tensorflow or pytorch, there
are tens of thousands of host memory mappings. When the user frees
all those mappings at the same time, the process of unmapping and
unpinning them can take a long time, which may cause a soft lockup
bug.

To prevent this, we need to free the core to do other things during
the unmapping process. For now, we chose to do it every 32K unmappings
(each unmap is a single 4K page).

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
 drivers/misc/habanalabs/common/habanalabs.h |  1 +
 drivers/misc/habanalabs/common/memory.c     | 10 ++++++++--
 drivers/misc/habanalabs/common/mmu.c        |  6 +++---
 3 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/misc/habanalabs/common/habanalabs.h b/drivers/misc/habanalabs/common/habanalabs.h
index e0d7f5fbaa5c..60e16dc4bcac 100644
--- a/drivers/misc/habanalabs/common/habanalabs.h
+++ b/drivers/misc/habanalabs/common/habanalabs.h
@@ -2182,6 +2182,7 @@ void hl_mmu_v1_set_funcs(struct hl_device *hdev, struct hl_mmu_funcs *mmu);
 int hl_mmu_va_to_pa(struct hl_ctx *ctx, u64 virt_addr, u64 *phys_addr);
 int hl_mmu_get_tlb_info(struct hl_ctx *ctx, u64 virt_addr,
 			struct hl_mmu_hop_info *hops);
+bool hl_is_dram_va(struct hl_device *hdev, u64 virt_addr);
 
 int hl_fw_load_fw_to_device(struct hl_device *hdev, const char *fw_name,
 				void __iomem *dst, u32 src_offset, u32 size);
diff --git a/drivers/misc/habanalabs/common/memory.c b/drivers/misc/habanalabs/common/memory.c
index cbe9da4e0211..5d4fbdcaefe3 100644
--- a/drivers/misc/habanalabs/common/memory.c
+++ b/drivers/misc/habanalabs/common/memory.c
@@ -886,8 +886,10 @@ static void unmap_phys_pg_pack(struct hl_ctx *ctx, u64 vaddr,
 {
 	struct hl_device *hdev = ctx->hdev;
 	u64 next_vaddr, i;
+	bool is_host_addr;
 	u32 page_size;
 
+	is_host_addr = !hl_is_dram_va(hdev, vaddr);
 	page_size = phys_pg_pack->page_size;
 	next_vaddr = vaddr;
 
@@ -900,9 +902,13 @@ static void unmap_phys_pg_pack(struct hl_ctx *ctx, u64 vaddr,
 		/*
 		 * unmapping on Palladium can be really long, so avoid a CPU
 		 * soft lockup bug by sleeping a little between unmapping pages
+		 *
+		 * In addition, when unmapping host memory we pass through
+		 * the Linux kernel to unpin the pages and that takes a long
+		 * time. Therefore, sleep every 32K pages to avoid soft lockup
 		 */
-		if (hdev->pldm)
-			usleep_range(500, 1000);
+		if (hdev->pldm || (is_host_addr && (i & 0x7FFF) == 0))
+			usleep_range(50, 200);
 	}
 }
 
diff --git a/drivers/misc/habanalabs/common/mmu.c b/drivers/misc/habanalabs/common/mmu.c
index 33ae953d3a36..28a4638741d8 100644
--- a/drivers/misc/habanalabs/common/mmu.c
+++ b/drivers/misc/habanalabs/common/mmu.c
@@ -9,7 +9,7 @@
 
 #include "habanalabs.h"
 
-static bool is_dram_va(struct hl_device *hdev, u64 virt_addr)
+bool hl_is_dram_va(struct hl_device *hdev, u64 virt_addr)
 {
 	struct asic_fixed_properties *prop = &hdev->asic_prop;
 
@@ -156,7 +156,7 @@ int hl_mmu_unmap_page(struct hl_ctx *ctx, u64 virt_addr, u32 page_size,
 	if (!hdev->mmu_enable)
 		return 0;
 
-	is_dram_addr = is_dram_va(hdev, virt_addr);
+	is_dram_addr = hl_is_dram_va(hdev, virt_addr);
 
 	if (is_dram_addr)
 		mmu_prop = &prop->dmmu;
@@ -236,7 +236,7 @@ int hl_mmu_map_page(struct hl_ctx *ctx, u64 virt_addr, u64 phys_addr,
 	if (!hdev->mmu_enable)
 		return 0;
 
-	is_dram_addr = is_dram_va(hdev, virt_addr);
+	is_dram_addr = hl_is_dram_va(hdev, virt_addr);
 
 	if (is_dram_addr)
 		mmu_prop = &prop->dmmu;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-01-12 19:02 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-01-12 19:01 [PATCH 1/3] habanalabs: fix dma_addr passed to dma_mmap_coherent Oded Gabbay
2021-01-12 19:01 ` [PATCH 2/3] habanalabs: fix reset process in case of failures Oded Gabbay
2021-01-12 19:01 ` [PATCH 3/3] habanalabs: prevent soft lockup during unmap Oded Gabbay

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox