* [PATCH kernel v2 0/6] powerpc/powernv/iommu: Optimize memory use
From: Alexey Kardashevskiy @ 2018-06-17 11:14 UTC (permalink / raw)
To: linuxppc-dev
Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, Alex Williamson,
Benjamin Herrenschmidt, Russell Currey
This patchset aims to reduce actual memory use for guests with
sparse memory. The pseries guest uses dynamic DMA windows to map
the entire guest RAM but it only actually maps onlined memory
which may be not be contiguous. I hit this when tried passing
through NVLink2-connected GPU RAM of NVIDIA V100 and trying to
map this RAM at the same offset as in the real hardware
forced me to rework I handle these windows.
This moves userspace-to-host-physical translation table
(iommu_table::it_userspace) from VFIO TCE IOMMU subdriver to
the platform code and reuses the already existing multilevel
TCE table code which we have for the hardware tables.
At last in 6/6 I switch to on-demand allocation so we do not
allocate huge chunks of the table if we do not have to;
there is some math in 6/6.
Changes:
v2:
* bugfix and error handling in 6/6
Please comment. Thanks.
Alexey Kardashevskiy (6):
powerpc/powernv: Remove useless wrapper
powerpc/powernv: Move TCE manupulation code to its own file
KVM: PPC: Make iommu_table::it_userspace big endian
powerpc/powernv: Add indirect levels to it_userspace
powerpc/powernv: Rework TCE level allocation
powerpc/powernv/ioda: Allocate indirect TCE levels on demand
arch/powerpc/platforms/powernv/Makefile | 2 +-
arch/powerpc/include/asm/iommu.h | 11 +-
arch/powerpc/platforms/powernv/pci.h | 44 ++-
arch/powerpc/kvm/book3s_64_vio.c | 11 +-
arch/powerpc/kvm/book3s_64_vio_hv.c | 18 +-
arch/powerpc/platforms/powernv/pci-ioda-tce.c | 395 ++++++++++++++++++++++++++
arch/powerpc/platforms/powernv/pci-ioda.c | 192 ++-----------
arch/powerpc/platforms/powernv/pci.c | 158 -----------
drivers/vfio/vfio_iommu_spapr_tce.c | 65 +----
9 files changed, 482 insertions(+), 414 deletions(-)
create mode 100644 arch/powerpc/platforms/powernv/pci-ioda-tce.c
--
2.11.0
^ permalink raw reply
* [PATCH kernel v2 6/6] powerpc/powernv/ioda: Allocate indirect TCE levels on demand
From: Alexey Kardashevskiy @ 2018-06-17 11:14 UTC (permalink / raw)
To: linuxppc-dev
Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, Alex Williamson,
Benjamin Herrenschmidt, Russell Currey
In-Reply-To: <20180617111428.24349-1-aik@ozlabs.ru>
At the moment we allocate the entire TCE table, twice (hardware part and
userspace translation cache). This normally works as we normally have
contigous memory and the guest will map entire RAM for 64bit DMA.
However if we have sparse RAM (one example is a memory device), then
we will allocate TCEs which will never be used as the guest only maps
actual memory for DMA. If it is a single level TCE table, there is nothing
we can really do but if it a multilevel table, we can skip allocating
TCEs we know we won't need.
This adds ability to allocate only first level, saving memory.
This changes iommu_table::free() to avoid allocating of an extra level;
iommu_table::set() will do this when needed.
This adds @alloc parameter to iommu_table::exchange() to tell the callback
if it can allocate an extra level; the flag is set to "false" for
the realmode KVM handlers of H_PUT_TCE hcalls and the callback returns
H_TOO_HARD.
This still requires the entire table to be counted in mm::locked_vm.
To be conservative, this only does on-demand allocation when
the usespace cache table is requested which is the case of VFIO.
The example math for a system replicating a powernv setup with NVLink2
in a guest:
16GB RAM mapped at 0x0
128GB GPU RAM window (16GB of actual RAM) mapped at 0x244000000000
the table to cover that all with 64K pages takes:
(((0x244000000000 + 0x2000000000) >> 16)*8)>>20 = 4556MB
If we allocate only necessary TCE levels, we will only need:
(((0x400000000 + 0x400000000) >> 16)*8)>>20 = 4MB (plus some for indirect
levels).
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v2:
* fixed bug in cleanup path which forced the entire table to be
allocated right before destroying
* added memory allocation error handling pnv_tce()
---
arch/powerpc/include/asm/iommu.h | 7 ++-
arch/powerpc/platforms/powernv/pci.h | 6 ++-
arch/powerpc/kvm/book3s_64_vio_hv.c | 4 +-
arch/powerpc/platforms/powernv/pci-ioda-tce.c | 69 ++++++++++++++++++++-------
arch/powerpc/platforms/powernv/pci-ioda.c | 8 ++--
drivers/vfio/vfio_iommu_spapr_tce.c | 2 +-
6 files changed, 69 insertions(+), 27 deletions(-)
diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 4bdcf22..daa3ee5 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -70,7 +70,7 @@ struct iommu_table_ops {
unsigned long *hpa,
enum dma_data_direction *direction);
- __be64 *(*useraddrptr)(struct iommu_table *tbl, long index);
+ __be64 *(*useraddrptr)(struct iommu_table *tbl, long index, bool alloc);
#endif
void (*clear)(struct iommu_table *tbl,
long index, long npages);
@@ -122,10 +122,13 @@ struct iommu_table {
__be64 *it_userspace; /* userspace view of the table */
struct iommu_table_ops *it_ops;
struct kref it_kref;
+ int it_nid;
};
+#define IOMMU_TABLE_USERSPACE_ENTRY_RM(tbl, entry) \
+ ((tbl)->it_ops->useraddrptr((tbl), (entry), false))
#define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
- ((tbl)->it_ops->useraddrptr((tbl), (entry)))
+ ((tbl)->it_ops->useraddrptr((tbl), (entry), true))
/* Pure 2^n version of get_order */
static inline __attribute_const__
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 5e02408..1fa5590 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -267,8 +267,10 @@ extern int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
unsigned long attrs);
extern void pnv_tce_free(struct iommu_table *tbl, long index, long npages);
extern int pnv_tce_xchg(struct iommu_table *tbl, long index,
- unsigned long *hpa, enum dma_data_direction *direction);
-extern __be64 *pnv_tce_useraddrptr(struct iommu_table *tbl, long index);
+ unsigned long *hpa, enum dma_data_direction *direction,
+ bool alloc);
+extern __be64 *pnv_tce_useraddrptr(struct iommu_table *tbl, long index,
+ bool alloc);
extern unsigned long pnv_tce_get(struct iommu_table *tbl, long index);
extern long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index db0490c..05b4865 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -200,7 +200,7 @@ static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm *kvm,
{
struct mm_iommu_table_group_mem_t *mem = NULL;
const unsigned long pgsize = 1ULL << tbl->it_page_shift;
- __be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+ __be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY_RM(tbl, entry);
if (!pua)
/* it_userspace allocation might be delayed */
@@ -264,7 +264,7 @@ static long kvmppc_rm_tce_iommu_do_map(struct kvm *kvm, struct iommu_table *tbl,
{
long ret;
unsigned long hpa = 0;
- __be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+ __be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY_RM(tbl, entry);
struct mm_iommu_table_group_mem_t *mem;
if (!pua)
diff --git a/arch/powerpc/platforms/powernv/pci-ioda-tce.c b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
index 36c2eb0..a7debfb 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda-tce.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
@@ -48,7 +48,7 @@ static __be64 *pnv_alloc_tce_level(int nid, unsigned int shift)
return addr;
}
-static __be64 *pnv_tce(struct iommu_table *tbl, bool user, long idx)
+static __be64 *pnv_tce(struct iommu_table *tbl, bool user, long idx, bool alloc)
{
__be64 *tmp = user ? tbl->it_userspace : (__be64 *) tbl->it_base;
int level = tbl->it_indirect_levels;
@@ -57,7 +57,20 @@ static __be64 *pnv_tce(struct iommu_table *tbl, bool user, long idx)
while (level) {
int n = (idx & mask) >> (level * shift);
- unsigned long tce = be64_to_cpu(tmp[n]);
+ unsigned long tce;
+
+ if (tmp[n] == 0) {
+ __be64 *tmp2;
+
+ if (!alloc)
+ return NULL;
+
+ tmp2 = pnv_alloc_tce_level(tbl->it_nid,
+ ilog2(tbl->it_level_size) + 3);
+ tmp[n] = cpu_to_be64(__pa(tmp2) |
+ TCE_PCI_READ | TCE_PCI_WRITE);
+ }
+ tce = be64_to_cpu(tmp[n]);
tmp = __va(tce & ~(TCE_PCI_READ | TCE_PCI_WRITE));
idx &= ~mask;
@@ -84,7 +97,7 @@ int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
((rpn + i) << tbl->it_page_shift);
unsigned long idx = index - tbl->it_offset + i;
- *(pnv_tce(tbl, false, idx)) = cpu_to_be64(newtce);
+ *(pnv_tce(tbl, false, idx, true)) = cpu_to_be64(newtce);
}
return 0;
@@ -92,31 +105,45 @@ int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
#ifdef CONFIG_IOMMU_API
int pnv_tce_xchg(struct iommu_table *tbl, long index,
- unsigned long *hpa, enum dma_data_direction *direction)
+ unsigned long *hpa, enum dma_data_direction *direction,
+ bool alloc)
{
u64 proto_tce = iommu_direction_to_tce_perm(*direction);
unsigned long newtce = *hpa | proto_tce, oldtce;
unsigned long idx = index - tbl->it_offset;
+ __be64 *ptce;
BUG_ON(*hpa & ~IOMMU_PAGE_MASK(tbl));
if (newtce & TCE_PCI_WRITE)
newtce |= TCE_PCI_READ;
- oldtce = be64_to_cpu(xchg(pnv_tce(tbl, false, idx),
- cpu_to_be64(newtce)));
+ ptce = pnv_tce(tbl, false, idx, alloc);
+ if (!ptce) {
+ if (*direction == DMA_NONE) {
+ *hpa = 0;
+ return 0;
+ }
+ /* It is likely to be realmode */
+ if (!alloc)
+ return H_TOO_HARD;
+
+ return H_HARDWARE;
+ }
+
+ oldtce = be64_to_cpu(xchg(ptce, cpu_to_be64(newtce)));
*hpa = oldtce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
*direction = iommu_tce_direction(oldtce);
return 0;
}
-__be64 *pnv_tce_useraddrptr(struct iommu_table *tbl, long index)
+__be64 *pnv_tce_useraddrptr(struct iommu_table *tbl, long index, bool alloc)
{
if (WARN_ON_ONCE(!tbl->it_userspace))
return NULL;
- return pnv_tce(tbl, true, index - tbl->it_offset);
+ return pnv_tce(tbl, true, index - tbl->it_offset, alloc);
}
#endif
@@ -126,14 +153,19 @@ void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
for (i = 0; i < npages; i++) {
unsigned long idx = index - tbl->it_offset + i;
+ __be64 *ptce = pnv_tce(tbl, false, idx, false);
- *(pnv_tce(tbl, false, idx)) = cpu_to_be64(0);
+ if (ptce)
+ *ptce = cpu_to_be64(0);
}
}
unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
{
- __be64 *ptce = pnv_tce(tbl, false, index - tbl->it_offset);
+ __be64 *ptce = pnv_tce(tbl, false, index - tbl->it_offset, false);
+
+ if (!ptce)
+ return 0;
return be64_to_cpu(*ptce);
}
@@ -224,6 +256,7 @@ long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
unsigned int table_shift = max_t(unsigned int, entries_shift + 3,
PAGE_SHIFT);
const unsigned long tce_table_size = 1UL << table_shift;
+ unsigned int tmplevels = levels;
if (!levels || (levels > POWERNV_IOMMU_MAX_LEVELS))
return -EINVAL;
@@ -231,6 +264,9 @@ long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
if (!is_power_of_2(window_size))
return -EINVAL;
+ if (alloc_userspace_copy && (window_size > (1ULL << 32)))
+ tmplevels = 1;
+
/* Adjust direct table size from window_size and levels */
entries_shift = (entries_shift + levels - 1) / levels;
level_shift = entries_shift + 3;
@@ -241,7 +277,7 @@ long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
/* Allocate TCE table */
addr = pnv_pci_ioda2_table_do_alloc_pages(nid, level_shift,
- levels, tce_table_size, &offset, &total_allocated);
+ tmplevels, tce_table_size, &offset, &total_allocated);
/* addr==NULL means that the first level allocation failed */
if (!addr)
@@ -252,7 +288,7 @@ long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
* we did not allocate as much as we wanted,
* release partially allocated table.
*/
- if (offset < tce_table_size)
+ if (tmplevels == levels && offset < tce_table_size)
goto free_tces_exit;
/* Allocate userspace view of the TCE table */
@@ -263,8 +299,8 @@ long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
&total_allocated_uas);
if (!uas)
goto free_tces_exit;
- if (offset < tce_table_size ||
- total_allocated_uas != total_allocated)
+ if (tmplevels == levels && (offset < tce_table_size ||
+ total_allocated_uas != total_allocated))
goto free_uas_exit;
}
@@ -275,10 +311,11 @@ long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
tbl->it_indirect_levels = levels - 1;
tbl->it_allocated_size = total_allocated;
tbl->it_userspace = uas;
+ tbl->it_nid = nid;
- pr_debug("Created TCE table: ws=%08llx ts=%lx @%08llx base=%lx uas=%p levels=%d\n",
+ pr_debug("Created TCE table: ws=%08llx ts=%lx @%08llx base=%lx uas=%p levels=%d/%d\n",
window_size, tce_table_size, bus_offset, tbl->it_base,
- tbl->it_userspace, levels);
+ tbl->it_userspace, tmplevels, levels);
return 0;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index c61c04d..d9df620 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2010,7 +2010,7 @@ static int pnv_ioda1_tce_build(struct iommu_table *tbl, long index,
static int pnv_ioda1_tce_xchg(struct iommu_table *tbl, long index,
unsigned long *hpa, enum dma_data_direction *direction)
{
- long ret = pnv_tce_xchg(tbl, index, hpa, direction);
+ long ret = pnv_tce_xchg(tbl, index, hpa, direction, true);
if (!ret)
pnv_pci_p7ioc_tce_invalidate(tbl, index, 1, false);
@@ -2021,7 +2021,7 @@ static int pnv_ioda1_tce_xchg(struct iommu_table *tbl, long index,
static int pnv_ioda1_tce_xchg_rm(struct iommu_table *tbl, long index,
unsigned long *hpa, enum dma_data_direction *direction)
{
- long ret = pnv_tce_xchg(tbl, index, hpa, direction);
+ long ret = pnv_tce_xchg(tbl, index, hpa, direction, false);
if (!ret)
pnv_pci_p7ioc_tce_invalidate(tbl, index, 1, true);
@@ -2175,7 +2175,7 @@ static int pnv_ioda2_tce_build(struct iommu_table *tbl, long index,
static int pnv_ioda2_tce_xchg(struct iommu_table *tbl, long index,
unsigned long *hpa, enum dma_data_direction *direction)
{
- long ret = pnv_tce_xchg(tbl, index, hpa, direction);
+ long ret = pnv_tce_xchg(tbl, index, hpa, direction, true);
if (!ret)
pnv_pci_ioda2_tce_invalidate(tbl, index, 1, false);
@@ -2186,7 +2186,7 @@ static int pnv_ioda2_tce_xchg(struct iommu_table *tbl, long index,
static int pnv_ioda2_tce_xchg_rm(struct iommu_table *tbl, long index,
unsigned long *hpa, enum dma_data_direction *direction)
{
- long ret = pnv_tce_xchg(tbl, index, hpa, direction);
+ long ret = pnv_tce_xchg(tbl, index, hpa, direction, false);
if (!ret)
pnv_pci_ioda2_tce_invalidate(tbl, index, 1, true);
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 833f926..1e58fb9 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -635,7 +635,7 @@ static long tce_iommu_create_table(struct tce_container *container,
page_shift, window_size, levels, ptbl);
WARN_ON(!ret && !(*ptbl)->it_ops->free);
- WARN_ON(!ret && ((*ptbl)->it_allocated_size != table_size));
+ WARN_ON(!ret && ((*ptbl)->it_allocated_size > table_size));
return ret;
}
--
2.11.0
^ permalink raw reply related
* [PATCH kernel v2 4/6] powerpc/powernv: Add indirect levels to it_userspace
From: Alexey Kardashevskiy @ 2018-06-17 11:14 UTC (permalink / raw)
To: linuxppc-dev
Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, Alex Williamson,
Benjamin Herrenschmidt, Russell Currey
In-Reply-To: <20180617111428.24349-1-aik@ozlabs.ru>
We want to support sparse memory and therefore huge chunks of DMA windows
do not need to be mapped. If a DMA window big enough to require 2 or more
indirect levels, and a DMA window is used to map all RAM (which is
a default case for 64bit window), we can actually save some memory by
not allocation TCE for regions which we are not going to map anyway.
The hardware tables alreary support indirect levels but we also keep
host-physical-to-userspace translation array which is allocated by
vmalloc() and is a flat array which might use quite some memory.
This converts it_userspace from vmalloc'ed array to a multi level table.
As the format becomes platform dependend, this replaces the direct access
to it_usespace with a iommu_table_ops::useraddrptr hook which returns
a pointer to the userspace copy of a TCE; future extension will return
NULL if the level was not allocated.
This should not change non-KVM handling of TCE tables and it_userspace
will not be allocated for non-KVM tables.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
arch/powerpc/include/asm/iommu.h | 6 +--
arch/powerpc/platforms/powernv/pci.h | 3 +-
arch/powerpc/kvm/book3s_64_vio_hv.c | 8 ----
arch/powerpc/platforms/powernv/pci-ioda-tce.c | 65 +++++++++++++++++++++------
arch/powerpc/platforms/powernv/pci-ioda.c | 31 ++++++++++---
drivers/vfio/vfio_iommu_spapr_tce.c | 46 -------------------
6 files changed, 81 insertions(+), 78 deletions(-)
diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 803ac70..4bdcf22 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -69,6 +69,8 @@ struct iommu_table_ops {
long index,
unsigned long *hpa,
enum dma_data_direction *direction);
+
+ __be64 *(*useraddrptr)(struct iommu_table *tbl, long index);
#endif
void (*clear)(struct iommu_table *tbl,
long index, long npages);
@@ -123,9 +125,7 @@ struct iommu_table {
};
#define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
- ((tbl)->it_userspace ? \
- &((tbl)->it_userspace[(entry) - (tbl)->it_offset]) : \
- NULL)
+ ((tbl)->it_ops->useraddrptr((tbl), (entry)))
/* Pure 2^n version of get_order */
static inline __attribute_const__
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index f507baf..5e02408 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -268,11 +268,12 @@ extern int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
extern void pnv_tce_free(struct iommu_table *tbl, long index, long npages);
extern int pnv_tce_xchg(struct iommu_table *tbl, long index,
unsigned long *hpa, enum dma_data_direction *direction);
+extern __be64 *pnv_tce_useraddrptr(struct iommu_table *tbl, long index);
extern unsigned long pnv_tce_get(struct iommu_table *tbl, long index);
extern long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
__u32 page_shift, __u64 window_size, __u32 levels,
- struct iommu_table *tbl);
+ bool alloc_userspace_copy, struct iommu_table *tbl);
extern void pnv_pci_ioda2_table_free_pages(struct iommu_table *tbl);
extern long pnv_pci_link_table_and_group(int node, int num,
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 18109f3..db0490c 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -206,10 +206,6 @@ static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm *kvm,
/* it_userspace allocation might be delayed */
return H_TOO_HARD;
- pua = (void *) vmalloc_to_phys(pua);
- if (WARN_ON_ONCE_RM(!pua))
- return H_HARDWARE;
-
mem = mm_iommu_lookup_rm(kvm->mm, be64_to_cpu(*pua), pgsize);
if (!mem)
return H_TOO_HARD;
@@ -282,10 +278,6 @@ static long kvmppc_rm_tce_iommu_do_map(struct kvm *kvm, struct iommu_table *tbl,
if (WARN_ON_ONCE_RM(mm_iommu_ua_to_hpa_rm(mem, ua, &hpa)))
return H_HARDWARE;
- pua = (void *) vmalloc_to_phys(pua);
- if (WARN_ON_ONCE_RM(!pua))
- return H_HARDWARE;
-
if (WARN_ON_ONCE_RM(mm_iommu_mapped_inc(mem)))
return H_CLOSED;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda-tce.c b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
index 700ceb1..f14b282 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda-tce.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
@@ -31,9 +31,9 @@ void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
tbl->it_type = TCE_PCI;
}
-static __be64 *pnv_tce(struct iommu_table *tbl, long idx)
+static __be64 *pnv_tce(struct iommu_table *tbl, bool user, long idx)
{
- __be64 *tmp = ((__be64 *)tbl->it_base);
+ __be64 *tmp = user ? tbl->it_userspace : (__be64 *) tbl->it_base;
int level = tbl->it_indirect_levels;
const long shift = ilog2(tbl->it_level_size);
unsigned long mask = (tbl->it_level_size - 1) << (level * shift);
@@ -67,7 +67,7 @@ int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
((rpn + i) << tbl->it_page_shift);
unsigned long idx = index - tbl->it_offset + i;
- *(pnv_tce(tbl, idx)) = cpu_to_be64(newtce);
+ *(pnv_tce(tbl, false, idx)) = cpu_to_be64(newtce);
}
return 0;
@@ -86,12 +86,21 @@ int pnv_tce_xchg(struct iommu_table *tbl, long index,
if (newtce & TCE_PCI_WRITE)
newtce |= TCE_PCI_READ;
- oldtce = be64_to_cpu(xchg(pnv_tce(tbl, idx), cpu_to_be64(newtce)));
+ oldtce = be64_to_cpu(xchg(pnv_tce(tbl, false, idx),
+ cpu_to_be64(newtce)));
*hpa = oldtce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
*direction = iommu_tce_direction(oldtce);
return 0;
}
+
+__be64 *pnv_tce_useraddrptr(struct iommu_table *tbl, long index)
+{
+ if (WARN_ON_ONCE(!tbl->it_userspace))
+ return NULL;
+
+ return pnv_tce(tbl, true, index - tbl->it_offset);
+}
#endif
void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
@@ -101,13 +110,15 @@ void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
for (i = 0; i < npages; i++) {
unsigned long idx = index - tbl->it_offset + i;
- *(pnv_tce(tbl, idx)) = cpu_to_be64(0);
+ *(pnv_tce(tbl, false, idx)) = cpu_to_be64(0);
}
}
unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
{
- return be64_to_cpu(*(pnv_tce(tbl, index - tbl->it_offset)));
+ __be64 *ptce = pnv_tce(tbl, false, index - tbl->it_offset);
+
+ return be64_to_cpu(*ptce);
}
static void pnv_pci_ioda2_table_do_free_pages(__be64 *addr,
@@ -144,6 +155,10 @@ void pnv_pci_ioda2_table_free_pages(struct iommu_table *tbl)
pnv_pci_ioda2_table_do_free_pages((__be64 *)tbl->it_base, size,
tbl->it_indirect_levels);
+ if (tbl->it_userspace) {
+ pnv_pci_ioda2_table_do_free_pages(tbl->it_userspace, size,
+ tbl->it_indirect_levels);
+ }
}
static __be64 *pnv_pci_ioda2_table_do_alloc_pages(int nid, unsigned int shift,
@@ -191,10 +206,11 @@ static __be64 *pnv_pci_ioda2_table_do_alloc_pages(int nid, unsigned int shift,
long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
__u32 page_shift, __u64 window_size, __u32 levels,
- struct iommu_table *tbl)
+ bool alloc_userspace_copy, struct iommu_table *tbl)
{
- void *addr;
+ void *addr, *uas = NULL;
unsigned long offset = 0, level_shift, total_allocated = 0;
+ unsigned long total_allocated_uas = 0;
const unsigned int window_shift = ilog2(window_size);
unsigned int entries_shift = window_shift - page_shift;
unsigned int table_shift = max_t(unsigned int, entries_shift + 3,
@@ -228,10 +244,20 @@ long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
* we did not allocate as much as we wanted,
* release partially allocated table.
*/
- if (offset < tce_table_size) {
- pnv_pci_ioda2_table_do_free_pages(addr,
- 1ULL << (level_shift - 3), levels - 1);
- return -ENOMEM;
+ if (offset < tce_table_size)
+ goto free_tces_exit;
+
+ /* Allocate userspace view of the TCE table */
+ if (alloc_userspace_copy) {
+ offset = 0;
+ uas = pnv_pci_ioda2_table_do_alloc_pages(nid, level_shift,
+ levels, tce_table_size, &offset,
+ &total_allocated_uas);
+ if (!uas)
+ goto free_tces_exit;
+ if (offset < tce_table_size ||
+ total_allocated_uas != total_allocated)
+ goto free_uas_exit;
}
/* Setup linux iommu table */
@@ -240,11 +266,22 @@ long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
tbl->it_level_size = 1ULL << (level_shift - 3);
tbl->it_indirect_levels = levels - 1;
tbl->it_allocated_size = total_allocated;
+ tbl->it_userspace = uas;
- pr_devel("Created TCE table: ws=%08llx ts=%lx @%08llx\n",
- window_size, tce_table_size, bus_offset);
+ pr_debug("Created TCE table: ws=%08llx ts=%lx @%08llx base=%lx uas=%p levels=%d\n",
+ window_size, tce_table_size, bus_offset, tbl->it_base,
+ tbl->it_userspace, levels);
return 0;
+
+free_uas_exit:
+ pnv_pci_ioda2_table_do_free_pages(uas,
+ 1ULL << (level_shift - 3), levels - 1);
+free_tces_exit:
+ pnv_pci_ioda2_table_do_free_pages(addr,
+ 1ULL << (level_shift - 3), levels - 1);
+
+ return -ENOMEM;
}
static void pnv_iommu_table_group_link_free(struct rcu_head *head)
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 9577059..c61c04d 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2043,6 +2043,7 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
#ifdef CONFIG_IOMMU_API
.exchange = pnv_ioda1_tce_xchg,
.exchange_rm = pnv_ioda1_tce_xchg_rm,
+ .useraddrptr = pnv_tce_useraddrptr,
#endif
.clear = pnv_ioda1_tce_free,
.get = pnv_tce_get,
@@ -2207,6 +2208,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
#ifdef CONFIG_IOMMU_API
.exchange = pnv_ioda2_tce_xchg,
.exchange_rm = pnv_ioda2_tce_xchg_rm,
+ .useraddrptr = pnv_tce_useraddrptr,
#endif
.clear = pnv_ioda2_tce_free,
.get = pnv_tce_get,
@@ -2460,9 +2462,9 @@ void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
pe->tce_bypass_enabled = enable;
}
-static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
+static long pnv_pci_ioda2_do_create_table(struct iommu_table_group *table_group,
int num, __u32 page_shift, __u64 window_size, __u32 levels,
- struct iommu_table **ptbl)
+ bool alloc_userspace_copy, struct iommu_table **ptbl)
{
struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
table_group);
@@ -2479,7 +2481,7 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
ret = pnv_pci_ioda2_table_alloc_pages(nid,
bus_offset, page_shift, window_size,
- levels, tbl);
+ levels, alloc_userspace_copy, tbl);
if (ret) {
iommu_tce_table_put(tbl);
return ret;
@@ -2599,7 +2601,24 @@ static unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
tce_table_size, direct_table_size);
}
- return bytes;
+ return bytes + bytes; /* one for HW table, one for userspace copy */
+}
+
+static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
+ int num, __u32 page_shift, __u64 window_size, __u32 levels,
+ struct iommu_table **ptbl)
+{
+ return pnv_pci_ioda2_do_create_table(table_group,
+ num, page_shift, window_size, levels, false, ptbl);
+}
+
+static long pnv_pci_ioda2_create_table_userspace(
+ struct iommu_table_group *table_group,
+ int num, __u32 page_shift, __u64 window_size, __u32 levels,
+ struct iommu_table **ptbl)
+{
+ return pnv_pci_ioda2_do_create_table(table_group,
+ num, page_shift, window_size, levels, true, ptbl);
}
static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
@@ -2628,7 +2647,7 @@ static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
.get_table_size = pnv_pci_ioda2_get_table_size,
- .create_table = pnv_pci_ioda2_create_table,
+ .create_table = pnv_pci_ioda2_create_table_userspace,
.set_window = pnv_pci_ioda2_set_window,
.unset_window = pnv_pci_ioda2_unset_window,
.take_ownership = pnv_ioda2_take_ownership,
@@ -2733,7 +2752,7 @@ static void pnv_ioda2_npu_take_ownership(struct iommu_table_group *table_group)
static struct iommu_table_group_ops pnv_pci_ioda2_npu_ops = {
.get_table_size = pnv_pci_ioda2_get_table_size,
- .create_table = pnv_pci_ioda2_create_table,
+ .create_table = pnv_pci_ioda2_create_table_userspace,
.set_window = pnv_pci_ioda2_npu_set_window,
.unset_window = pnv_pci_ioda2_npu_unset_window,
.take_ownership = pnv_ioda2_npu_take_ownership,
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 8283a4a..833f926 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -212,44 +212,6 @@ static long tce_iommu_register_pages(struct tce_container *container,
return 0;
}
-static long tce_iommu_userspace_view_alloc(struct iommu_table *tbl,
- struct mm_struct *mm)
-{
- unsigned long cb = _ALIGN_UP(sizeof(tbl->it_userspace[0]) *
- tbl->it_size, PAGE_SIZE);
- unsigned long *uas;
- long ret;
-
- BUG_ON(tbl->it_userspace);
-
- ret = try_increment_locked_vm(mm, cb >> PAGE_SHIFT);
- if (ret)
- return ret;
-
- uas = vzalloc(cb);
- if (!uas) {
- decrement_locked_vm(mm, cb >> PAGE_SHIFT);
- return -ENOMEM;
- }
- tbl->it_userspace = (__be64 *) uas;
-
- return 0;
-}
-
-static void tce_iommu_userspace_view_free(struct iommu_table *tbl,
- struct mm_struct *mm)
-{
- unsigned long cb = _ALIGN_UP(sizeof(tbl->it_userspace[0]) *
- tbl->it_size, PAGE_SIZE);
-
- if (!tbl->it_userspace)
- return;
-
- vfree(tbl->it_userspace);
- tbl->it_userspace = NULL;
- decrement_locked_vm(mm, cb >> PAGE_SHIFT);
-}
-
static bool tce_page_is_contained(unsigned long hpa, unsigned page_shift)
{
struct page *page = __va(realmode_pfn_to_page(hpa >> PAGE_SHIFT));
@@ -604,12 +566,6 @@ static long tce_iommu_build_v2(struct tce_container *container,
unsigned long hpa;
enum dma_data_direction dirtmp;
- if (!tbl->it_userspace) {
- ret = tce_iommu_userspace_view_alloc(tbl, container->mm);
- if (ret)
- return ret;
- }
-
for (i = 0; i < pages; ++i) {
struct mm_iommu_table_group_mem_t *mem = NULL;
__be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry + i);
@@ -689,7 +645,6 @@ static void tce_iommu_free_table(struct tce_container *container,
{
unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
- tce_iommu_userspace_view_free(tbl, container->mm);
iommu_tce_table_put(tbl);
decrement_locked_vm(container->mm, pages);
}
@@ -1204,7 +1159,6 @@ static void tce_iommu_release_ownership(struct tce_container *container,
continue;
tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
- tce_iommu_userspace_view_free(tbl, container->mm);
if (tbl->it_map)
iommu_release_ownership(tbl);
--
2.11.0
^ permalink raw reply related
* [PATCH kernel v2 2/6] powerpc/powernv: Move TCE manupulation code to its own file
From: Alexey Kardashevskiy @ 2018-06-17 11:14 UTC (permalink / raw)
To: linuxppc-dev
Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, Alex Williamson,
Benjamin Herrenschmidt, Russell Currey
In-Reply-To: <20180617111428.24349-1-aik@ozlabs.ru>
Right now we have allocation code in pci-ioda.c and traversing code in
pci.c, let's keep them toghether. However both files are big enough
already so let's move this business to a new file.
While we at it, move the code which links IOMMU table groups to
IOMMU tables as it is not specific to any PNV PHB model.
These puts exported symbols from the new file together.
This fixes several warnings from checkpatch.pl like this:
"WARNING: Prefer 'unsigned int' to bare use of 'unsigned'".
As this is almost cut-n-paste, there should be no behavioral change.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
arch/powerpc/platforms/powernv/Makefile | 2 +-
arch/powerpc/platforms/powernv/pci.h | 41 ++--
arch/powerpc/platforms/powernv/pci-ioda-tce.c | 313 ++++++++++++++++++++++++++
arch/powerpc/platforms/powernv/pci-ioda.c | 146 ------------
arch/powerpc/platforms/powernv/pci.c | 158 -------------
5 files changed, 340 insertions(+), 320 deletions(-)
create mode 100644 arch/powerpc/platforms/powernv/pci-ioda-tce.c
diff --git a/arch/powerpc/platforms/powernv/Makefile b/arch/powerpc/platforms/powernv/Makefile
index 703a350..b540ce8e 100644
--- a/arch/powerpc/platforms/powernv/Makefile
+++ b/arch/powerpc/platforms/powernv/Makefile
@@ -6,7 +6,7 @@ obj-y += opal-msglog.o opal-hmi.o opal-power.o opal-irqchip.o
obj-y += opal-kmsg.o opal-powercap.o opal-psr.o opal-sensor-groups.o
obj-$(CONFIG_SMP) += smp.o subcore.o subcore-asm.o
-obj-$(CONFIG_PCI) += pci.o pci-ioda.o npu-dma.o
+obj-$(CONFIG_PCI) += pci.o pci-ioda.o npu-dma.o pci-ioda-tce.o
obj-$(CONFIG_CXL_BASE) += pci-cxl.o
obj-$(CONFIG_EEH) += eeh-powernv.o
obj-$(CONFIG_PPC_SCOM) += opal-xscom.o
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 1408247..f507baf 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -202,13 +202,6 @@ struct pnv_phb {
};
extern struct pci_ops pnv_pci_ops;
-extern int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
- unsigned long uaddr, enum dma_data_direction direction,
- unsigned long attrs);
-extern void pnv_tce_free(struct iommu_table *tbl, long index, long npages);
-extern int pnv_tce_xchg(struct iommu_table *tbl, long index,
- unsigned long *hpa, enum dma_data_direction *direction);
-extern unsigned long pnv_tce_get(struct iommu_table *tbl, long index);
void pnv_pci_dump_phb_diag_data(struct pci_controller *hose,
unsigned char *log_buff);
@@ -218,14 +211,6 @@ int pnv_pci_cfg_write(struct pci_dn *pdn,
int where, int size, u32 val);
extern struct iommu_table *pnv_pci_table_alloc(int nid);
-extern long pnv_pci_link_table_and_group(int node, int num,
- struct iommu_table *tbl,
- struct iommu_table_group *table_group);
-extern void pnv_pci_unlink_table_and_group(struct iommu_table *tbl,
- struct iommu_table_group *table_group);
-extern void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
- void *tce_mem, u64 tce_size,
- u64 dma_offset, unsigned page_shift);
extern void pnv_pci_init_ioda_hub(struct device_node *np);
extern void pnv_pci_init_ioda2_phb(struct device_node *np);
extern void pnv_pci_init_npu_phb(struct device_node *np);
@@ -273,4 +258,30 @@ extern void pnv_cxl_cx4_teardown_msi_irqs(struct pci_dev *pdev);
/* phb ops (cxl switches these when enabling the kernel api on the phb) */
extern const struct pci_controller_ops pnv_cxl_cx4_ioda_controller_ops;
+/* pci-ioda-tce.c */
+#define POWERNV_IOMMU_DEFAULT_LEVELS 1
+#define POWERNV_IOMMU_MAX_LEVELS 5
+
+extern int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
+ unsigned long uaddr, enum dma_data_direction direction,
+ unsigned long attrs);
+extern void pnv_tce_free(struct iommu_table *tbl, long index, long npages);
+extern int pnv_tce_xchg(struct iommu_table *tbl, long index,
+ unsigned long *hpa, enum dma_data_direction *direction);
+extern unsigned long pnv_tce_get(struct iommu_table *tbl, long index);
+
+extern long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
+ __u32 page_shift, __u64 window_size, __u32 levels,
+ struct iommu_table *tbl);
+extern void pnv_pci_ioda2_table_free_pages(struct iommu_table *tbl);
+
+extern long pnv_pci_link_table_and_group(int node, int num,
+ struct iommu_table *tbl,
+ struct iommu_table_group *table_group);
+extern void pnv_pci_unlink_table_and_group(struct iommu_table *tbl,
+ struct iommu_table_group *table_group);
+extern void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
+ void *tce_mem, u64 tce_size,
+ u64 dma_offset, unsigned int page_shift);
+
#endif /* __POWERNV_PCI_H */
diff --git a/arch/powerpc/platforms/powernv/pci-ioda-tce.c b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
new file mode 100644
index 0000000..700ceb1
--- /dev/null
+++ b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
@@ -0,0 +1,313 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * TCE helpers for IODA PCI/PCIe on PowerNV platforms
+ *
+ * Copyright 2018 IBM Corp.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#include <linux/kernel.h>
+#include <linux/iommu.h>
+
+#include <asm/iommu.h>
+#include <asm/tce.h>
+#include "pci.h"
+
+void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
+ void *tce_mem, u64 tce_size,
+ u64 dma_offset, unsigned int page_shift)
+{
+ tbl->it_blocksize = 16;
+ tbl->it_base = (unsigned long)tce_mem;
+ tbl->it_page_shift = page_shift;
+ tbl->it_offset = dma_offset >> tbl->it_page_shift;
+ tbl->it_index = 0;
+ tbl->it_size = tce_size >> 3;
+ tbl->it_busno = 0;
+ tbl->it_type = TCE_PCI;
+}
+
+static __be64 *pnv_tce(struct iommu_table *tbl, long idx)
+{
+ __be64 *tmp = ((__be64 *)tbl->it_base);
+ int level = tbl->it_indirect_levels;
+ const long shift = ilog2(tbl->it_level_size);
+ unsigned long mask = (tbl->it_level_size - 1) << (level * shift);
+
+ while (level) {
+ int n = (idx & mask) >> (level * shift);
+ unsigned long tce = be64_to_cpu(tmp[n]);
+
+ tmp = __va(tce & ~(TCE_PCI_READ | TCE_PCI_WRITE));
+ idx &= ~mask;
+ mask >>= shift;
+ --level;
+ }
+
+ return tmp + idx;
+}
+
+int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
+ unsigned long uaddr, enum dma_data_direction direction,
+ unsigned long attrs)
+{
+ u64 proto_tce = iommu_direction_to_tce_perm(direction);
+ u64 rpn = __pa(uaddr) >> tbl->it_page_shift;
+ long i;
+
+ if (proto_tce & TCE_PCI_WRITE)
+ proto_tce |= TCE_PCI_READ;
+
+ for (i = 0; i < npages; i++) {
+ unsigned long newtce = proto_tce |
+ ((rpn + i) << tbl->it_page_shift);
+ unsigned long idx = index - tbl->it_offset + i;
+
+ *(pnv_tce(tbl, idx)) = cpu_to_be64(newtce);
+ }
+
+ return 0;
+}
+
+#ifdef CONFIG_IOMMU_API
+int pnv_tce_xchg(struct iommu_table *tbl, long index,
+ unsigned long *hpa, enum dma_data_direction *direction)
+{
+ u64 proto_tce = iommu_direction_to_tce_perm(*direction);
+ unsigned long newtce = *hpa | proto_tce, oldtce;
+ unsigned long idx = index - tbl->it_offset;
+
+ BUG_ON(*hpa & ~IOMMU_PAGE_MASK(tbl));
+
+ if (newtce & TCE_PCI_WRITE)
+ newtce |= TCE_PCI_READ;
+
+ oldtce = be64_to_cpu(xchg(pnv_tce(tbl, idx), cpu_to_be64(newtce)));
+ *hpa = oldtce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+ *direction = iommu_tce_direction(oldtce);
+
+ return 0;
+}
+#endif
+
+void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
+{
+ long i;
+
+ for (i = 0; i < npages; i++) {
+ unsigned long idx = index - tbl->it_offset + i;
+
+ *(pnv_tce(tbl, idx)) = cpu_to_be64(0);
+ }
+}
+
+unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
+{
+ return be64_to_cpu(*(pnv_tce(tbl, index - tbl->it_offset)));
+}
+
+static void pnv_pci_ioda2_table_do_free_pages(__be64 *addr,
+ unsigned long size, unsigned int levels)
+{
+ const unsigned long addr_ul = (unsigned long) addr &
+ ~(TCE_PCI_READ | TCE_PCI_WRITE);
+
+ if (levels) {
+ long i;
+ u64 *tmp = (u64 *) addr_ul;
+
+ for (i = 0; i < size; ++i) {
+ unsigned long hpa = be64_to_cpu(tmp[i]);
+
+ if (!(hpa & (TCE_PCI_READ | TCE_PCI_WRITE)))
+ continue;
+
+ pnv_pci_ioda2_table_do_free_pages(__va(hpa), size,
+ levels - 1);
+ }
+ }
+
+ free_pages(addr_ul, get_order(size << 3));
+}
+
+void pnv_pci_ioda2_table_free_pages(struct iommu_table *tbl)
+{
+ const unsigned long size = tbl->it_indirect_levels ?
+ tbl->it_level_size : tbl->it_size;
+
+ if (!tbl->it_size)
+ return;
+
+ pnv_pci_ioda2_table_do_free_pages((__be64 *)tbl->it_base, size,
+ tbl->it_indirect_levels);
+}
+
+static __be64 *pnv_pci_ioda2_table_do_alloc_pages(int nid, unsigned int shift,
+ unsigned int levels, unsigned long limit,
+ unsigned long *current_offset, unsigned long *total_allocated)
+{
+ struct page *tce_mem = NULL;
+ __be64 *addr, *tmp;
+ unsigned int order = max_t(unsigned int, shift, PAGE_SHIFT) -
+ PAGE_SHIFT;
+ unsigned long allocated = 1UL << (order + PAGE_SHIFT);
+ unsigned int entries = 1UL << (shift - 3);
+ long i;
+
+ tce_mem = alloc_pages_node(nid, GFP_KERNEL, order);
+ if (!tce_mem) {
+ pr_err("Failed to allocate a TCE memory, order=%d\n", order);
+ return NULL;
+ }
+ addr = page_address(tce_mem);
+ memset(addr, 0, allocated);
+ *total_allocated += allocated;
+
+ --levels;
+ if (!levels) {
+ *current_offset += allocated;
+ return addr;
+ }
+
+ for (i = 0; i < entries; ++i) {
+ tmp = pnv_pci_ioda2_table_do_alloc_pages(nid, shift,
+ levels, limit, current_offset, total_allocated);
+ if (!tmp)
+ break;
+
+ addr[i] = cpu_to_be64(__pa(tmp) |
+ TCE_PCI_READ | TCE_PCI_WRITE);
+
+ if (*current_offset >= limit)
+ break;
+ }
+
+ return addr;
+}
+
+long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
+ __u32 page_shift, __u64 window_size, __u32 levels,
+ struct iommu_table *tbl)
+{
+ void *addr;
+ unsigned long offset = 0, level_shift, total_allocated = 0;
+ const unsigned int window_shift = ilog2(window_size);
+ unsigned int entries_shift = window_shift - page_shift;
+ unsigned int table_shift = max_t(unsigned int, entries_shift + 3,
+ PAGE_SHIFT);
+ const unsigned long tce_table_size = 1UL << table_shift;
+
+ if (!levels || (levels > POWERNV_IOMMU_MAX_LEVELS))
+ return -EINVAL;
+
+ if (!is_power_of_2(window_size))
+ return -EINVAL;
+
+ /* Adjust direct table size from window_size and levels */
+ entries_shift = (entries_shift + levels - 1) / levels;
+ level_shift = entries_shift + 3;
+ level_shift = max_t(unsigned int, level_shift, PAGE_SHIFT);
+
+ if ((level_shift - 3) * levels + page_shift >= 55)
+ return -EINVAL;
+
+ /* Allocate TCE table */
+ addr = pnv_pci_ioda2_table_do_alloc_pages(nid, level_shift,
+ levels, tce_table_size, &offset, &total_allocated);
+
+ /* addr==NULL means that the first level allocation failed */
+ if (!addr)
+ return -ENOMEM;
+
+ /*
+ * First level was allocated but some lower level failed as
+ * we did not allocate as much as we wanted,
+ * release partially allocated table.
+ */
+ if (offset < tce_table_size) {
+ pnv_pci_ioda2_table_do_free_pages(addr,
+ 1ULL << (level_shift - 3), levels - 1);
+ return -ENOMEM;
+ }
+
+ /* Setup linux iommu table */
+ pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, bus_offset,
+ page_shift);
+ tbl->it_level_size = 1ULL << (level_shift - 3);
+ tbl->it_indirect_levels = levels - 1;
+ tbl->it_allocated_size = total_allocated;
+
+ pr_devel("Created TCE table: ws=%08llx ts=%lx @%08llx\n",
+ window_size, tce_table_size, bus_offset);
+
+ return 0;
+}
+
+static void pnv_iommu_table_group_link_free(struct rcu_head *head)
+{
+ struct iommu_table_group_link *tgl = container_of(head,
+ struct iommu_table_group_link, rcu);
+
+ kfree(tgl);
+}
+
+void pnv_pci_unlink_table_and_group(struct iommu_table *tbl,
+ struct iommu_table_group *table_group)
+{
+ long i;
+ bool found;
+ struct iommu_table_group_link *tgl;
+
+ if (!tbl || !table_group)
+ return;
+
+ /* Remove link to a group from table's list of attached groups */
+ found = false;
+ list_for_each_entry_rcu(tgl, &tbl->it_group_list, next) {
+ if (tgl->table_group == table_group) {
+ list_del_rcu(&tgl->next);
+ call_rcu(&tgl->rcu, pnv_iommu_table_group_link_free);
+ found = true;
+ break;
+ }
+ }
+ if (WARN_ON(!found))
+ return;
+
+ /* Clean a pointer to iommu_table in iommu_table_group::tables[] */
+ found = false;
+ for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
+ if (table_group->tables[i] == tbl) {
+ table_group->tables[i] = NULL;
+ found = true;
+ break;
+ }
+ }
+ WARN_ON(!found);
+}
+
+long pnv_pci_link_table_and_group(int node, int num,
+ struct iommu_table *tbl,
+ struct iommu_table_group *table_group)
+{
+ struct iommu_table_group_link *tgl = NULL;
+
+ if (WARN_ON(!tbl || !table_group))
+ return -EINVAL;
+
+ tgl = kzalloc_node(sizeof(struct iommu_table_group_link), GFP_KERNEL,
+ node);
+ if (!tgl)
+ return -ENOMEM;
+
+ tgl->table_group = table_group;
+ list_add_rcu(&tgl->next, &tbl->it_group_list);
+
+ table_group->tables[num] = tbl;
+
+ return 0;
+}
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index d4c60b6..9577059 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -51,12 +51,8 @@
#define PNV_IODA1_M64_SEGS 8 /* Segments per M64 BAR */
#define PNV_IODA1_DMA32_SEGSIZE 0x10000000
-#define POWERNV_IOMMU_DEFAULT_LEVELS 1
-#define POWERNV_IOMMU_MAX_LEVELS 5
-
static const char * const pnv_phb_names[] = { "IODA1", "IODA2", "NPU_NVLINK",
"NPU_OCAPI" };
-static void pnv_pci_ioda2_table_free_pages(struct iommu_table *tbl);
void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
const char *fmt, ...)
@@ -2464,10 +2460,6 @@ void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
pe->tce_bypass_enabled = enable;
}
-static long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
- __u32 page_shift, __u64 window_size, __u32 levels,
- struct iommu_table *tbl);
-
static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
int num, __u32 page_shift, __u64 window_size, __u32 levels,
struct iommu_table **ptbl)
@@ -2775,144 +2767,6 @@ static void pnv_pci_ioda_setup_iommu_api(void)
static void pnv_pci_ioda_setup_iommu_api(void) { };
#endif
-static __be64 *pnv_pci_ioda2_table_do_alloc_pages(int nid, unsigned shift,
- unsigned levels, unsigned long limit,
- unsigned long *current_offset, unsigned long *total_allocated)
-{
- struct page *tce_mem = NULL;
- __be64 *addr, *tmp;
- unsigned order = max_t(unsigned, shift, PAGE_SHIFT) - PAGE_SHIFT;
- unsigned long allocated = 1UL << (order + PAGE_SHIFT);
- unsigned entries = 1UL << (shift - 3);
- long i;
-
- tce_mem = alloc_pages_node(nid, GFP_KERNEL, order);
- if (!tce_mem) {
- pr_err("Failed to allocate a TCE memory, order=%d\n", order);
- return NULL;
- }
- addr = page_address(tce_mem);
- memset(addr, 0, allocated);
- *total_allocated += allocated;
-
- --levels;
- if (!levels) {
- *current_offset += allocated;
- return addr;
- }
-
- for (i = 0; i < entries; ++i) {
- tmp = pnv_pci_ioda2_table_do_alloc_pages(nid, shift,
- levels, limit, current_offset, total_allocated);
- if (!tmp)
- break;
-
- addr[i] = cpu_to_be64(__pa(tmp) |
- TCE_PCI_READ | TCE_PCI_WRITE);
-
- if (*current_offset >= limit)
- break;
- }
-
- return addr;
-}
-
-static void pnv_pci_ioda2_table_do_free_pages(__be64 *addr,
- unsigned long size, unsigned level);
-
-static long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
- __u32 page_shift, __u64 window_size, __u32 levels,
- struct iommu_table *tbl)
-{
- void *addr;
- unsigned long offset = 0, level_shift, total_allocated = 0;
- const unsigned window_shift = ilog2(window_size);
- unsigned entries_shift = window_shift - page_shift;
- unsigned table_shift = max_t(unsigned, entries_shift + 3, PAGE_SHIFT);
- const unsigned long tce_table_size = 1UL << table_shift;
-
- if (!levels || (levels > POWERNV_IOMMU_MAX_LEVELS))
- return -EINVAL;
-
- if (!is_power_of_2(window_size))
- return -EINVAL;
-
- /* Adjust direct table size from window_size and levels */
- entries_shift = (entries_shift + levels - 1) / levels;
- level_shift = entries_shift + 3;
- level_shift = max_t(unsigned, level_shift, PAGE_SHIFT);
-
- if ((level_shift - 3) * levels + page_shift >= 55)
- return -EINVAL;
-
- /* Allocate TCE table */
- addr = pnv_pci_ioda2_table_do_alloc_pages(nid, level_shift,
- levels, tce_table_size, &offset, &total_allocated);
-
- /* addr==NULL means that the first level allocation failed */
- if (!addr)
- return -ENOMEM;
-
- /*
- * First level was allocated but some lower level failed as
- * we did not allocate as much as we wanted,
- * release partially allocated table.
- */
- if (offset < tce_table_size) {
- pnv_pci_ioda2_table_do_free_pages(addr,
- 1ULL << (level_shift - 3), levels - 1);
- return -ENOMEM;
- }
-
- /* Setup linux iommu table */
- pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, bus_offset,
- page_shift);
- tbl->it_level_size = 1ULL << (level_shift - 3);
- tbl->it_indirect_levels = levels - 1;
- tbl->it_allocated_size = total_allocated;
-
- pr_devel("Created TCE table: ws=%08llx ts=%lx @%08llx\n",
- window_size, tce_table_size, bus_offset);
-
- return 0;
-}
-
-static void pnv_pci_ioda2_table_do_free_pages(__be64 *addr,
- unsigned long size, unsigned level)
-{
- const unsigned long addr_ul = (unsigned long) addr &
- ~(TCE_PCI_READ | TCE_PCI_WRITE);
-
- if (level) {
- long i;
- u64 *tmp = (u64 *) addr_ul;
-
- for (i = 0; i < size; ++i) {
- unsigned long hpa = be64_to_cpu(tmp[i]);
-
- if (!(hpa & (TCE_PCI_READ | TCE_PCI_WRITE)))
- continue;
-
- pnv_pci_ioda2_table_do_free_pages(__va(hpa), size,
- level - 1);
- }
- }
-
- free_pages(addr_ul, get_order(size << 3));
-}
-
-static void pnv_pci_ioda2_table_free_pages(struct iommu_table *tbl)
-{
- const unsigned long size = tbl->it_indirect_levels ?
- tbl->it_level_size : tbl->it_size;
-
- if (!tbl->it_size)
- return;
-
- pnv_pci_ioda2_table_do_free_pages((__be64 *)tbl->it_base, size,
- tbl->it_indirect_levels);
-}
-
static unsigned long pnv_ioda_parse_tce_sizes(struct pnv_phb *phb)
{
struct pci_controller *hose = phb->hose;
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index b265ecc..13aef23 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -802,85 +802,6 @@ struct pci_ops pnv_pci_ops = {
.write = pnv_pci_write_config,
};
-static __be64 *pnv_tce(struct iommu_table *tbl, long idx)
-{
- __be64 *tmp = ((__be64 *)tbl->it_base);
- int level = tbl->it_indirect_levels;
- const long shift = ilog2(tbl->it_level_size);
- unsigned long mask = (tbl->it_level_size - 1) << (level * shift);
-
- while (level) {
- int n = (idx & mask) >> (level * shift);
- unsigned long tce = be64_to_cpu(tmp[n]);
-
- tmp = __va(tce & ~(TCE_PCI_READ | TCE_PCI_WRITE));
- idx &= ~mask;
- mask >>= shift;
- --level;
- }
-
- return tmp + idx;
-}
-
-int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
- unsigned long uaddr, enum dma_data_direction direction,
- unsigned long attrs)
-{
- u64 proto_tce = iommu_direction_to_tce_perm(direction);
- u64 rpn = __pa(uaddr) >> tbl->it_page_shift;
- long i;
-
- if (proto_tce & TCE_PCI_WRITE)
- proto_tce |= TCE_PCI_READ;
-
- for (i = 0; i < npages; i++) {
- unsigned long newtce = proto_tce |
- ((rpn + i) << tbl->it_page_shift);
- unsigned long idx = index - tbl->it_offset + i;
-
- *(pnv_tce(tbl, idx)) = cpu_to_be64(newtce);
- }
-
- return 0;
-}
-
-#ifdef CONFIG_IOMMU_API
-int pnv_tce_xchg(struct iommu_table *tbl, long index,
- unsigned long *hpa, enum dma_data_direction *direction)
-{
- u64 proto_tce = iommu_direction_to_tce_perm(*direction);
- unsigned long newtce = *hpa | proto_tce, oldtce;
- unsigned long idx = index - tbl->it_offset;
-
- BUG_ON(*hpa & ~IOMMU_PAGE_MASK(tbl));
-
- if (newtce & TCE_PCI_WRITE)
- newtce |= TCE_PCI_READ;
-
- oldtce = be64_to_cpu(xchg(pnv_tce(tbl, idx), cpu_to_be64(newtce)));
- *hpa = oldtce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
- *direction = iommu_tce_direction(oldtce);
-
- return 0;
-}
-#endif
-
-void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
-{
- long i;
-
- for (i = 0; i < npages; i++) {
- unsigned long idx = index - tbl->it_offset + i;
-
- *(pnv_tce(tbl, idx)) = cpu_to_be64(0);
- }
-}
-
-unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
-{
- return be64_to_cpu(*(pnv_tce(tbl, index - tbl->it_offset)));
-}
-
struct iommu_table *pnv_pci_table_alloc(int nid)
{
struct iommu_table *tbl;
@@ -895,85 +816,6 @@ struct iommu_table *pnv_pci_table_alloc(int nid)
return tbl;
}
-long pnv_pci_link_table_and_group(int node, int num,
- struct iommu_table *tbl,
- struct iommu_table_group *table_group)
-{
- struct iommu_table_group_link *tgl = NULL;
-
- if (WARN_ON(!tbl || !table_group))
- return -EINVAL;
-
- tgl = kzalloc_node(sizeof(struct iommu_table_group_link), GFP_KERNEL,
- node);
- if (!tgl)
- return -ENOMEM;
-
- tgl->table_group = table_group;
- list_add_rcu(&tgl->next, &tbl->it_group_list);
-
- table_group->tables[num] = tbl;
-
- return 0;
-}
-
-static void pnv_iommu_table_group_link_free(struct rcu_head *head)
-{
- struct iommu_table_group_link *tgl = container_of(head,
- struct iommu_table_group_link, rcu);
-
- kfree(tgl);
-}
-
-void pnv_pci_unlink_table_and_group(struct iommu_table *tbl,
- struct iommu_table_group *table_group)
-{
- long i;
- bool found;
- struct iommu_table_group_link *tgl;
-
- if (!tbl || !table_group)
- return;
-
- /* Remove link to a group from table's list of attached groups */
- found = false;
- list_for_each_entry_rcu(tgl, &tbl->it_group_list, next) {
- if (tgl->table_group == table_group) {
- list_del_rcu(&tgl->next);
- call_rcu(&tgl->rcu, pnv_iommu_table_group_link_free);
- found = true;
- break;
- }
- }
- if (WARN_ON(!found))
- return;
-
- /* Clean a pointer to iommu_table in iommu_table_group::tables[] */
- found = false;
- for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
- if (table_group->tables[i] == tbl) {
- table_group->tables[i] = NULL;
- found = true;
- break;
- }
- }
- WARN_ON(!found);
-}
-
-void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
- void *tce_mem, u64 tce_size,
- u64 dma_offset, unsigned page_shift)
-{
- tbl->it_blocksize = 16;
- tbl->it_base = (unsigned long)tce_mem;
- tbl->it_page_shift = page_shift;
- tbl->it_offset = dma_offset >> tbl->it_page_shift;
- tbl->it_index = 0;
- tbl->it_size = tce_size >> 3;
- tbl->it_busno = 0;
- tbl->it_type = TCE_PCI;
-}
-
void pnv_pci_dma_dev_setup(struct pci_dev *pdev)
{
struct pci_controller *hose = pci_bus_to_host(pdev->bus);
--
2.11.0
^ permalink raw reply related
* Re: [PATCH] Revert "net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends"
From: Andreas Schwab @ 2018-06-17 10:27 UTC (permalink / raw)
To: Eric Dumazet
Cc: Mathieu Malaterre, David S. Miller, Eric Dumazet, LKML,
Christophe LEROY, Meelis Roos, netdev, linuxppc-dev
In-Reply-To: <9d88677a-f2be-2089-79df-15df4e9a5dd6@gmail.com>
On Jun 16 2018, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> I would try something like :
>
> Basically do not bother using CHECKSUM_COMPLETE for small frames that might have been padded.
>
> Since we need to bring one cache line in eth_type_trans() and further header processing,
> computing the checksum in software will be almost free anyway.
>
> diff --git a/drivers/net/ethernet/sun/sungem.c b/drivers/net/ethernet/sun/sungem.c
> index 7a16d40a72d13cf1d522e8a3a396c826fe76f9b9..071039f211a8a33153e888bd4014314ba5e686a4 100644
> --- a/drivers/net/ethernet/sun/sungem.c
> +++ b/drivers/net/ethernet/sun/sungem.c
> @@ -855,9 +855,11 @@ static int gem_rx(struct gem *gp, int work_to_do)
> skb = copy_skb;
> }
>
> - csum = (__force __sum16)htons((status & RXDCTRL_TCPCSUM) ^ 0xffff);
> - skb->csum = csum_unfold(csum);
> - skb->ip_summed = CHECKSUM_COMPLETE;
> + if (len > ETH_ZLEN) {
> + csum = (__force __sum16)htons((status & RXDCTRL_TCPCSUM) ^ 0xffff);
> + skb->csum = csum_unfold(csum);
> + skb->ip_summed = CHECKSUM_COMPLETE;
> + }
> skb->protocol = eth_type_trans(skb, gp->dev);
>
> napi_gro_receive(&gp->napi, skb);
That doesn't change anything.
Andreas.
--
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1
"And now for something completely different."
^ permalink raw reply
* Re: [PATCH] Revert "net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends"
From: Andreas Schwab @ 2018-06-17 10:09 UTC (permalink / raw)
To: Mathieu Malaterre
Cc: eric.dumazet, David S. Miller, Eric Dumazet, LKML,
Christophe LEROY, Meelis Roos, netdev, linuxppc-dev
In-Reply-To: <CA+7wUsysaip70ARGQ7Byg1Zktfg0Y1esjEc9HxBGU4sKL8crhg@mail.gmail.com>
On Jun 16 2018, Mathieu Malaterre <malat@debian.org> wrote:
> That's odd since it seems to only affect g4+sungem user. None of the
> ppc64 seems to be having it.
I'm also seeing it on a PowerMac G5.
Andreas.
--
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1
"And now for something completely different."
^ permalink raw reply
* [PATCH v2] mm: convert return type of handle_mm_fault() caller to vm_fault_t
From: Souptick Joarder @ 2018-06-17 8:48 UTC (permalink / raw)
To: willy, rth, tony.luck, mattst88, vgupta, linux, catalin.marinas,
will.deacon, rkuo, geert, monstr, jhogan, lftan, jonas, jejb,
benh, palmer, ysato, davem, richard, gxt, tglx, hpa,
alexander.levin, akpm
Cc: linux-alpha, linux-kernel, linux-snps-arc, linux-arm-kernel,
linux-hexagon, linux-ia64, linux-m68k, linux-mips, nios2-dev,
openrisc, linux-parisc, linuxppc-dev, linux-riscv, linux-s390,
linux-sh, sparclinux, user-mode-linux-user, linux-xtensa, iommu,
linux-mm, brajeswar.linux, sabyasachi.linux
Use new return type vm_fault_t for fault handler. For
now, this is just documenting that the function returns
a VM_FAULT value rather than an errno. Once all instances
are converted, vm_fault_t will become a distinct type.
Ref-> commit 1c8f422059ae ("mm: change return type to vm_fault_t")
In this patch all the caller of handle_mm_fault()
are changed to return vm_fault_t type.
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
---
v2: Fixed kbuild error
arch/alpha/mm/fault.c | 3 ++-
arch/arc/mm/fault.c | 4 +++-
arch/arm/mm/fault.c | 7 ++++---
arch/arm64/mm/fault.c | 6 +++---
arch/hexagon/mm/vm_fault.c | 2 +-
arch/ia64/mm/fault.c | 2 +-
arch/m68k/mm/fault.c | 4 ++--
arch/microblaze/mm/fault.c | 2 +-
arch/mips/mm/fault.c | 2 +-
arch/nds32/mm/fault.c | 2 +-
arch/nios2/mm/fault.c | 2 +-
arch/openrisc/mm/fault.c | 2 +-
arch/parisc/mm/fault.c | 2 +-
arch/powerpc/include/asm/copro.h | 4 +++-
arch/powerpc/mm/copro_fault.c | 2 +-
arch/powerpc/mm/fault.c | 7 ++++---
arch/powerpc/platforms/cell/spufs/fault.c | 2 +-
arch/riscv/mm/fault.c | 3 ++-
arch/s390/mm/fault.c | 13 ++++++++-----
arch/sh/mm/fault.c | 4 ++--
arch/sparc/mm/fault_32.c | 3 ++-
arch/sparc/mm/fault_64.c | 3 ++-
arch/um/kernel/trap.c | 2 +-
arch/unicore32/mm/fault.c | 9 +++++----
arch/x86/mm/fault.c | 5 +++--
arch/xtensa/mm/fault.c | 2 +-
drivers/iommu/amd_iommu_v2.c | 2 +-
drivers/iommu/intel-svm.c | 4 +++-
drivers/misc/cxl/fault.c | 2 +-
drivers/misc/ocxl/link.c | 3 ++-
mm/hmm.c | 8 ++++----
mm/ksm.c | 2 +-
32 files changed, 69 insertions(+), 51 deletions(-)
diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
index cd3c572..2a979ee 100644
--- a/arch/alpha/mm/fault.c
+++ b/arch/alpha/mm/fault.c
@@ -87,7 +87,8 @@
struct vm_area_struct * vma;
struct mm_struct *mm = current->mm;
const struct exception_table_entry *fixup;
- int fault, si_code = SEGV_MAPERR;
+ int si_code = SEGV_MAPERR;
+ vm_fault_t fault;
siginfo_t info;
unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
index a0b7bd6..3a18d33 100644
--- a/arch/arc/mm/fault.c
+++ b/arch/arc/mm/fault.c
@@ -15,6 +15,7 @@
#include <linux/uaccess.h>
#include <linux/kdebug.h>
#include <linux/perf_event.h>
+#include <linux/mm_types.h>
#include <asm/pgalloc.h>
#include <asm/mmu.h>
@@ -66,7 +67,8 @@ void do_page_fault(unsigned long address, struct pt_regs *regs)
struct task_struct *tsk = current;
struct mm_struct *mm = tsk->mm;
siginfo_t info;
- int fault, ret;
+ int ret;
+ vm_fault_t fault;
int write = regs->ecr_cause & ECR_C_PROTV_STORE; /* ST/EX */
unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index b75eada..758abcb 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -219,12 +219,12 @@ static inline bool access_error(unsigned int fsr, struct vm_area_struct *vma)
return vma->vm_flags & mask ? false : true;
}
-static int __kprobes
+static vm_fault_t __kprobes
__do_page_fault(struct mm_struct *mm, unsigned long addr, unsigned int fsr,
unsigned int flags, struct task_struct *tsk)
{
struct vm_area_struct *vma;
- int fault;
+ vm_fault_t fault;
vma = find_vma(mm, addr);
fault = VM_FAULT_BADMAP;
@@ -259,7 +259,8 @@ static inline bool access_error(unsigned int fsr, struct vm_area_struct *vma)
{
struct task_struct *tsk;
struct mm_struct *mm;
- int fault, sig, code;
+ int sig, code;
+ vm_fault_t fault;
unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
if (notify_page_fault(regs, fsr))
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 2af3dd8..8da263b 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -371,12 +371,12 @@ static void do_bad_area(unsigned long addr, unsigned int esr, struct pt_regs *re
#define VM_FAULT_BADMAP 0x010000
#define VM_FAULT_BADACCESS 0x020000
-static int __do_page_fault(struct mm_struct *mm, unsigned long addr,
+static vm_fault_t __do_page_fault(struct mm_struct *mm, unsigned long addr,
unsigned int mm_flags, unsigned long vm_flags,
struct task_struct *tsk)
{
struct vm_area_struct *vma;
- int fault;
+ vm_fault_t fault;
vma = find_vma(mm, addr);
fault = VM_FAULT_BADMAP;
@@ -419,7 +419,7 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
struct task_struct *tsk;
struct mm_struct *mm;
struct siginfo si;
- int fault, major = 0;
+ vm_fault_t fault, major = 0;
unsigned long vm_flags = VM_READ | VM_WRITE;
unsigned int mm_flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c
index 3eec33c..5d1de6c 100644
--- a/arch/hexagon/mm/vm_fault.c
+++ b/arch/hexagon/mm/vm_fault.c
@@ -52,7 +52,7 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs)
struct mm_struct *mm = current->mm;
siginfo_t info;
int si_code = SEGV_MAPERR;
- int fault;
+ vm_fault_t fault;
const struct exception_table_entry *fixup;
unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c
index dfdc152..e085d89 100644
--- a/arch/ia64/mm/fault.c
+++ b/arch/ia64/mm/fault.c
@@ -87,7 +87,7 @@ static inline int notify_page_fault(struct pt_regs *regs, int trap)
struct mm_struct *mm = current->mm;
struct siginfo si;
unsigned long mask;
- int fault;
+ vm_fault_t fault;
unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
mask = ((((isr >> IA64_ISR_X_BIT) & 1UL) << VM_EXEC_BIT)
diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c
index 03253c4..1fc7ac0 100644
--- a/arch/m68k/mm/fault.c
+++ b/arch/m68k/mm/fault.c
@@ -73,7 +73,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
{
struct mm_struct *mm = current->mm;
struct vm_area_struct * vma;
- int fault;
+ vm_fault_t fault;
unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
pr_debug("do page fault:\nregs->sr=%#x, regs->pc=%#lx, address=%#lx, %ld, %p\n",
@@ -139,7 +139,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
*/
fault = handle_mm_fault(vma, address, flags);
- pr_debug("handle_mm_fault returns %d\n", fault);
+ pr_debug("handle_mm_fault returns %x\n", fault);
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return 0;
diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c
index f91b30f..92a8682 100644
--- a/arch/microblaze/mm/fault.c
+++ b/arch/microblaze/mm/fault.c
@@ -91,7 +91,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long address,
siginfo_t info;
int code = SEGV_MAPERR;
int is_write = error_code & ESR_S;
- int fault;
+ vm_fault_t fault;
unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
regs->ear = address;
diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
index 4f8f5bf..0bc5030 100644
--- a/arch/mips/mm/fault.c
+++ b/arch/mips/mm/fault.c
@@ -43,7 +43,7 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, unsigned long write,
struct mm_struct *mm = tsk->mm;
const int field = sizeof(unsigned long) * 2;
siginfo_t info;
- int fault;
+ vm_fault_t fault;
unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
static DEFINE_RATELIMIT_STATE(ratelimit_state, 5 * HZ, 10);
diff --git a/arch/nds32/mm/fault.c b/arch/nds32/mm/fault.c
index 3a246fb..96796d3 100644
--- a/arch/nds32/mm/fault.c
+++ b/arch/nds32/mm/fault.c
@@ -73,7 +73,7 @@ void do_page_fault(unsigned long entry, unsigned long addr,
struct mm_struct *mm;
struct vm_area_struct *vma;
siginfo_t info;
- int fault;
+ vm_fault_t fault;
unsigned int mask = VM_READ | VM_WRITE | VM_EXEC;
unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
diff --git a/arch/nios2/mm/fault.c b/arch/nios2/mm/fault.c
index b804dd0..24fd84c 100644
--- a/arch/nios2/mm/fault.c
+++ b/arch/nios2/mm/fault.c
@@ -47,7 +47,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long cause,
struct task_struct *tsk = current;
struct mm_struct *mm = tsk->mm;
int code = SEGV_MAPERR;
- int fault;
+ vm_fault_t fault;
unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
cause >>= 2;
diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
index d0021df..21e8f16 100644
--- a/arch/openrisc/mm/fault.c
+++ b/arch/openrisc/mm/fault.c
@@ -53,7 +53,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address,
struct mm_struct *mm;
struct vm_area_struct *vma;
siginfo_t info;
- int fault;
+ vm_fault_t fault;
unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
tsk = current;
diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c
index e247edb..ff9e634 100644
--- a/arch/parisc/mm/fault.c
+++ b/arch/parisc/mm/fault.c
@@ -262,7 +262,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long code,
struct task_struct *tsk;
struct mm_struct *mm;
unsigned long acc_type;
- int fault = 0;
+ vm_fault_t fault = 0;
unsigned int flags;
if (faulthandler_disabled())
diff --git a/arch/powerpc/include/asm/copro.h b/arch/powerpc/include/asm/copro.h
index ce216df..48616fe 100644
--- a/arch/powerpc/include/asm/copro.h
+++ b/arch/powerpc/include/asm/copro.h
@@ -10,13 +10,15 @@
#ifndef _ASM_POWERPC_COPRO_H
#define _ASM_POWERPC_COPRO_H
+#include <linux/mm_types.h>
+
struct copro_slb
{
u64 esid, vsid;
};
int copro_handle_mm_fault(struct mm_struct *mm, unsigned long ea,
- unsigned long dsisr, unsigned *flt);
+ unsigned long dsisr, vm_fault_t *flt);
int copro_calculate_slb(struct mm_struct *mm, u64 ea, struct copro_slb *slb);
diff --git a/arch/powerpc/mm/copro_fault.c b/arch/powerpc/mm/copro_fault.c
index 7d0945b..c8da352 100644
--- a/arch/powerpc/mm/copro_fault.c
+++ b/arch/powerpc/mm/copro_fault.c
@@ -34,7 +34,7 @@
* to handle fortunately.
*/
int copro_handle_mm_fault(struct mm_struct *mm, unsigned long ea,
- unsigned long dsisr, unsigned *flt)
+ unsigned long dsisr, vm_fault_t *flt)
{
struct vm_area_struct *vma;
unsigned long is_write;
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index c01d627..17cce1b 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -159,7 +159,7 @@ static noinline int bad_access(struct pt_regs *regs, unsigned long address)
}
static int do_sigbus(struct pt_regs *regs, unsigned long address,
- unsigned int fault)
+ vm_fault_t fault)
{
siginfo_t info;
unsigned int lsb = 0;
@@ -189,7 +189,8 @@ static int do_sigbus(struct pt_regs *regs, unsigned long address,
return 0;
}
-static int mm_fault_error(struct pt_regs *regs, unsigned long addr, int fault)
+static int mm_fault_error(struct pt_regs *regs, unsigned long addr,
+ vm_fault_t fault)
{
/*
* Kernel page fault interrupted by SIGKILL. We have no reason to
@@ -402,7 +403,7 @@ static int __do_page_fault(struct pt_regs *regs, unsigned long address,
int is_exec = TRAP(regs) == 0x400;
int is_user = user_mode(regs);
int is_write = page_fault_is_write(error_code);
- int fault, major = 0;
+ vm_fault_t fault, major = 0;
bool store_update_sp = false;
if (notify_page_fault(regs))
diff --git a/arch/powerpc/platforms/cell/spufs/fault.c b/arch/powerpc/platforms/cell/spufs/fault.c
index 870c0a8..0195076 100644
--- a/arch/powerpc/platforms/cell/spufs/fault.c
+++ b/arch/powerpc/platforms/cell/spufs/fault.c
@@ -111,7 +111,7 @@ int spufs_handle_class1(struct spu_context *ctx)
{
u64 ea, dsisr, access;
unsigned long flags;
- unsigned flt = 0;
+ vm_fault_t flt = 0;
int ret;
/*
diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
index 148c98c..88401d5 100644
--- a/arch/riscv/mm/fault.c
+++ b/arch/riscv/mm/fault.c
@@ -41,7 +41,8 @@ asmlinkage void do_page_fault(struct pt_regs *regs)
struct mm_struct *mm;
unsigned long addr, cause;
unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
- int fault, code = SEGV_MAPERR;
+ int code = SEGV_MAPERR;
+ vm_fault_t fault;
cause = regs->scause;
addr = regs->sbadaddr;
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index 93faeca..8ea0855 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -350,7 +350,8 @@ static noinline int signal_return(struct pt_regs *regs)
return -EACCES;
}
-static noinline void do_fault_error(struct pt_regs *regs, int access, int fault)
+static noinline void do_fault_error(struct pt_regs *regs, int access,
+ vm_fault_t fault)
{
int si_code;
@@ -410,7 +411,7 @@ static noinline void do_fault_error(struct pt_regs *regs, int access, int fault)
* 11 Page translation -> Not present (nullification)
* 3b Region third trans. -> Not present (nullification)
*/
-static inline int do_exception(struct pt_regs *regs, int access)
+static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
{
struct gmap *gmap;
struct task_struct *tsk;
@@ -420,7 +421,7 @@ static inline int do_exception(struct pt_regs *regs, int access)
unsigned long trans_exc_code;
unsigned long address;
unsigned int flags;
- int fault;
+ vm_fault_t fault;
tsk = current;
/*
@@ -571,7 +572,8 @@ static inline int do_exception(struct pt_regs *regs, int access)
void do_protection_exception(struct pt_regs *regs)
{
unsigned long trans_exc_code;
- int access, fault;
+ int access;
+ vm_fault_t fault;
trans_exc_code = regs->int_parm_long;
/*
@@ -606,7 +608,8 @@ void do_protection_exception(struct pt_regs *regs)
void do_dat_exception(struct pt_regs *regs)
{
- int access, fault;
+ int access;
+ vm_fault_t fault;
access = VM_READ | VM_EXEC | VM_WRITE;
fault = do_exception(regs, access);
diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c
index 6fd1bf7..474bf14 100644
--- a/arch/sh/mm/fault.c
+++ b/arch/sh/mm/fault.c
@@ -320,7 +320,7 @@ static noinline int vmalloc_fault(unsigned long address)
static noinline int
mm_fault_error(struct pt_regs *regs, unsigned long error_code,
- unsigned long address, unsigned int fault)
+ unsigned long address, vm_fault_t fault)
{
/*
* Pagefault was interrupted by SIGKILL. We have no reason to
@@ -403,7 +403,7 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
struct task_struct *tsk;
struct mm_struct *mm;
struct vm_area_struct * vma;
- int fault;
+ vm_fault_t fault;
unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
tsk = current;
diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c
index a8103a8..1a44a4e 100644
--- a/arch/sparc/mm/fault_32.c
+++ b/arch/sparc/mm/fault_32.c
@@ -174,7 +174,8 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write,
unsigned int fixup;
unsigned long g2;
int from_user = !(regs->psr & PSR_PS);
- int fault, code;
+ int code;
+ vm_fault_t fault;
unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
if (text_fault)
diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
index 41363f4..2078bfe 100644
--- a/arch/sparc/mm/fault_64.c
+++ b/arch/sparc/mm/fault_64.c
@@ -284,7 +284,8 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs)
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
unsigned int insn = 0;
- int si_code, fault_code, fault;
+ int si_code, fault_code;
+ vm_fault_t fault;
unsigned long address, mm_rss;
unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
index b2b02df..0afcd09 100644
--- a/arch/um/kernel/trap.c
+++ b/arch/um/kernel/trap.c
@@ -72,7 +72,7 @@ int handle_page_fault(unsigned long address, unsigned long ip,
}
do {
- int fault;
+ vm_fault_t fault;
fault = handle_mm_fault(vma, address, flags);
diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c
index bbefcc4..2982140 100644
--- a/arch/unicore32/mm/fault.c
+++ b/arch/unicore32/mm/fault.c
@@ -167,11 +167,11 @@ static inline bool access_error(unsigned int fsr, struct vm_area_struct *vma)
return vma->vm_flags & mask ? false : true;
}
-static int __do_pf(struct mm_struct *mm, unsigned long addr, unsigned int fsr,
- unsigned int flags, struct task_struct *tsk)
+static vm_fault_t __do_pf(struct mm_struct *mm, unsigned long addr,
+ unsigned int fsr, unsigned int flags, struct task_struct *tsk)
{
struct vm_area_struct *vma;
- int fault;
+ vm_fault_t fault;
vma = find_vma(mm, addr);
fault = VM_FAULT_BADMAP;
@@ -208,7 +208,8 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
{
struct task_struct *tsk;
struct mm_struct *mm;
- int fault, sig, code;
+ int sig, code;
+ vm_fault_t fault;
unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
tsk = current;
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 73bd8c9..5171d60 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -16,6 +16,7 @@
#include <linux/prefetch.h> /* prefetchw */
#include <linux/context_tracking.h> /* exception_enter(), ... */
#include <linux/uaccess.h> /* faulthandler_disabled() */
+#include <linux/mm_types.h>
#include <asm/cpufeature.h> /* boot_cpu_has, ... */
#include <asm/traps.h> /* dotraplinkage, ... */
@@ -1004,7 +1005,7 @@ static inline bool bad_area_access_from_pkeys(unsigned long error_code,
static noinline void
mm_fault_error(struct pt_regs *regs, unsigned long error_code,
- unsigned long address, u32 *pkey, unsigned int fault)
+ unsigned long address, u32 *pkey, vm_fault_t fault)
{
if (fatal_signal_pending(current) && !(error_code & X86_PF_USER)) {
no_context(regs, error_code, address, 0, 0);
@@ -1218,7 +1219,7 @@ static inline bool smap_violation(int error_code, struct pt_regs *regs)
struct vm_area_struct *vma;
struct task_struct *tsk;
struct mm_struct *mm;
- int fault, major = 0;
+ vm_fault_t fault, major = 0;
unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
u32 pkey;
diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c
index 8b9b6f4..203fade 100644
--- a/arch/xtensa/mm/fault.c
+++ b/arch/xtensa/mm/fault.c
@@ -42,7 +42,7 @@ void do_page_fault(struct pt_regs *regs)
siginfo_t info;
int is_write, is_exec;
- int fault;
+ vm_fault_t fault;
unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
info.si_code = SEGV_MAPERR;
diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 1d0b53a0..58da65d 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -508,7 +508,7 @@ static void do_fault(struct work_struct *work)
{
struct fault *fault = container_of(work, struct fault, work);
struct vm_area_struct *vma;
- int ret = VM_FAULT_ERROR;
+ vm_fault_t ret = VM_FAULT_ERROR;
unsigned int flags = 0;
struct mm_struct *mm;
u64 address;
diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
index e8cd984..75189c0 100644
--- a/drivers/iommu/intel-svm.c
+++ b/drivers/iommu/intel-svm.c
@@ -24,6 +24,7 @@
#include <linux/pci-ats.h>
#include <linux/dmar.h>
#include <linux/interrupt.h>
+#include <linux/mm_types.h>
#include <asm/page.h>
#define PASID_ENTRY_P BIT_ULL(0)
@@ -594,7 +595,8 @@ static irqreturn_t prq_event_thread(int irq, void *d)
struct vm_area_struct *vma;
struct page_req_dsc *req;
struct qi_desc resp;
- int ret, result;
+ int result;
+ vm_fault_t ret;
u64 address;
handled = 1;
diff --git a/drivers/misc/cxl/fault.c b/drivers/misc/cxl/fault.c
index 70dbb6d..93ecc67 100644
--- a/drivers/misc/cxl/fault.c
+++ b/drivers/misc/cxl/fault.c
@@ -134,7 +134,7 @@ static int cxl_handle_segment_miss(struct cxl_context *ctx,
int cxl_handle_mm_fault(struct mm_struct *mm, u64 dsisr, u64 dar)
{
- unsigned flt = 0;
+ vm_fault_t flt = 0;
int result;
unsigned long access, flags, inv_flags = 0;
diff --git a/drivers/misc/ocxl/link.c b/drivers/misc/ocxl/link.c
index f307905..4e155fb 100644
--- a/drivers/misc/ocxl/link.c
+++ b/drivers/misc/ocxl/link.c
@@ -2,6 +2,7 @@
// Copyright 2017 IBM Corp.
#include <linux/sched/mm.h>
#include <linux/mutex.h>
+#include <linux/mm_types.h>
#include <linux/mmu_context.h>
#include <asm/copro.h>
#include <asm/pnv-ocxl.h>
@@ -126,7 +127,7 @@ static void ack_irq(struct spa *spa, enum xsl_response r)
static void xsl_fault_handler_bh(struct work_struct *fault_work)
{
- unsigned int flt = 0;
+ vm_fault_t flt = 0;
unsigned long access, flags, inv_flags = 0;
enum xsl_response r;
struct xsl_fault *fault = container_of(fault_work, struct xsl_fault,
diff --git a/mm/hmm.c b/mm/hmm.c
index 486dc39..d7919e5 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -308,14 +308,14 @@ static int hmm_vma_do_fault(struct mm_walk *walk, unsigned long addr,
struct hmm_vma_walk *hmm_vma_walk = walk->private;
struct hmm_range *range = hmm_vma_walk->range;
struct vm_area_struct *vma = walk->vma;
- int r;
+ vm_fault_t ret;
flags |= hmm_vma_walk->block ? 0 : FAULT_FLAG_ALLOW_RETRY;
flags |= write_fault ? FAULT_FLAG_WRITE : 0;
- r = handle_mm_fault(vma, addr, flags);
- if (r & VM_FAULT_RETRY)
+ ret = handle_mm_fault(vma, addr, flags);
+ if (ret & VM_FAULT_RETRY)
return -EBUSY;
- if (r & VM_FAULT_ERROR) {
+ if (ret & VM_FAULT_ERROR) {
*pfn = range->values[HMM_PFN_ERROR];
return -EFAULT;
}
diff --git a/mm/ksm.c b/mm/ksm.c
index e3cbf9a..cb4e6ed 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -451,7 +451,7 @@ static inline bool ksm_test_exit(struct mm_struct *mm)
static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
{
struct page *page;
- int ret = 0;
+ vm_fault_t ret = 0;
do {
cond_resched();
--
1.9.1
^ permalink raw reply related
* Re: [powerpc/powervmc]kernel BUG at arch/powerpc/mm/pgtable-book3s64.c:414!
From: Venkat Rao B @ 2018-06-17 7:52 UTC (permalink / raw)
To: Michael Ellerman, linuxppc-dev; +Cc: sachinp, aneesh.kumar, Nicholas Piggin
In-Reply-To: <87h8m49nb7.fsf@concordia.ellerman.id.au>
On Friday 15 June 2018 07:14 PM, Michael Ellerman wrote:
> vrbagal1 <vrbagal1@linux.vnet.ibm.com> writes:
>
>> Hi,
>>
>> Observing kernel bug followed by kernel oops and system reboots, while
>> running kselftest on Power8 LPAR.
>>
>> Machine Details: Power8 LPAR
>> Git Tree:
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>> Commit ID: f5b7769eb0400ec5217a47e41148a9f816ca1f9f
>> Kernel version: 4.17.0-autotest
>> GCC version: (gcc version 6.3.1 20161221 (Red Hat 6.3.1-1) (GCC))
>>
>
> This is fixed by your patch Aneesh?
>
> http://patchwork.ozlabs.org/patch/929325/
Yes, this patch fixes the issue.
Regards,
Venkat
>
> cheers
>
^ permalink raw reply
* Re: [PATCH 0/3] Resolve -Wattribute-alias warnings from SYSCALL_DEFINEx()
From: Stafford Horne @ 2018-06-17 1:11 UTC (permalink / raw)
To: Paul Burton
Cc: linux-kbuild, Mauro Carvalho Chehab, linux-mips, Arnd Bergmann,
Ingo Molnar, Matthew Wilcox, Thomas Gleixner, Douglas Anderson,
Josh Poimboeuf, Andrew Morton, Matthias Kaehlcke, He Zhe,
Benjamin Herrenschmidt, Michal Marek, Khem Raj, Christophe Leroy,
Al Viro, Gideon Israel Dsouza, Masahiro Yamada, Kees Cook,
Michael Ellerman, Heiko Carstens, linux-kernel, Paul Mackerras,
linuxppc-dev
In-Reply-To: <20180616005323.7938-1-paul.burton@mips.com>
On Fri, Jun 15, 2018 at 05:53:19PM -0700, Paul Burton wrote:
> This series introduces infrastructure allowing compiler diagnostics to
> be disabled or their severity modified for specific pieces of code, with
> suitable abstractions to prevent that code from becoming tied to a
> specific compiler.
>
> This infrastructure is then used to disable the -Wattribute-alias
> warning around syscall definitions, which rely on type mismatches to
> sanitize arguments.
>
> Finally PowerPC-specific #pragma's are removed now that the generic code
> is handling this.
>
> The series takes Arnd's RFC patches & addresses the review comments they
> received. The most notable effect of this series to to avoid warnings &
> build failures caused by -Wattribute-alias when compiling the kernel
> with GCC 8.
>
> Applies cleanly atop master as of 9215310cf13b ("Merge
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net").
>
> Thanks,
> Paul
>
> Arnd Bergmann (2):
> kbuild: add macro for controlling warnings to linux/compiler.h
> disable -Wattribute-alias warning for SYSCALL_DEFINEx()
>
> Paul Burton (1):
> Revert "powerpc: fix build failure by disabling attribute-alias
> warning in pci_32"
>
> arch/powerpc/kernel/pci_32.c | 4 ---
> include/linux/compat.h | 8 ++++-
> include/linux/compiler-gcc.h | 66 ++++++++++++++++++++++++++++++++++
> include/linux/compiler_types.h | 18 ++++++++++
> include/linux/syscalls.h | 4 +++
> 5 files changed, 95 insertions(+), 5 deletions(-)
Hello Paul,
I tested the series out with the new OpenRISC 9.0.0 port and the
-Wattribute-alias warnings are gone. Thank you.
Using toolchain binaries from:
https://github.com/stffrdhrn/gcc/releases/tag/or1k-9.0.0-20180613
For the series:
Tested-by: Stafford Horne <shorne@gmail.com>
-Stafford
^ permalink raw reply
* Re: [PATCH] mm: convert return type of handle_mm_fault() caller to vm_fault_t
From: kbuild test robot @ 2018-06-16 18:14 UTC (permalink / raw)
To: Souptick Joarder
Cc: kbuild-all, willy, rth, tony.luck, mattst88, vgupta, linux,
catalin.marinas, will.deacon, rkuo, geert, monstr, jhogan, lftan,
jonas, jejb, benh, palmer, ysato, davem, richard, gxt, tglx, hpa,
alexander.levin, akpm, linux-alpha, linux-kernel, linux-snps-arc,
linux-arm-kernel, linux-hexagon, linux-ia64, linux-m68k,
linux-mips, nios2-dev, openrisc, linux-parisc, linuxppc-dev,
linux-riscv, linux-s390, linux-sh, sparclinux,
user-mode-linux-devel, user-mode-linux-user, linux-xtensa, iommu,
linux-mm, brajeswar.linux, sabyasachi.linux
In-Reply-To: <20180614190629.GA18576@jordon-HP-15-Notebook-PC>
[-- Attachment #1: Type: text/plain, Size: 11761 bytes --]
Hi Souptick,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on v4.17]
[cannot apply to linus/master powerpc/next sparc-next/master next-20180615]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Souptick-Joarder/mm-convert-return-type-of-handle_mm_fault-caller-to-vm_fault_t/20180615-030636
config: powerpc-allyesconfig (attached as .config)
compiler: powerpc64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=7.2.0 make.cross ARCH=powerpc
All warnings (new ones prefixed by >>):
arch/powerpc/mm/copro_fault.c:36:5: error: conflicting types for 'copro_handle_mm_fault'
int copro_handle_mm_fault(struct mm_struct *mm, unsigned long ea,
^~~~~~~~~~~~~~~~~~~~~
In file included from arch/powerpc/mm/copro_fault.c:27:0:
arch/powerpc/include/asm/copro.h:18:5: note: previous declaration of 'copro_handle_mm_fault' was here
int copro_handle_mm_fault(struct mm_struct *mm, unsigned long ea,
^~~~~~~~~~~~~~~~~~~~~
In file included from include/linux/linkage.h:7:0,
from include/linux/kernel.h:7,
from include/asm-generic/bug.h:18,
from arch/powerpc/include/asm/bug.h:128,
from include/linux/bug.h:5,
from arch/powerpc/include/asm/mmu.h:126,
from arch/powerpc/include/asm/lppaca.h:36,
from arch/powerpc/include/asm/paca.h:21,
from arch/powerpc/include/asm/current.h:16,
from include/linux/sched.h:12,
from arch/powerpc/mm/copro_fault.c:23:
arch/powerpc/mm/copro_fault.c:101:19: error: conflicting types for 'copro_handle_mm_fault'
EXPORT_SYMBOL_GPL(copro_handle_mm_fault);
^
include/linux/export.h:65:21: note: in definition of macro '___EXPORT_SYMBOL'
extern typeof(sym) sym; \
^~~
>> arch/powerpc/mm/copro_fault.c:101:1: note: in expansion of macro 'EXPORT_SYMBOL_GPL'
EXPORT_SYMBOL_GPL(copro_handle_mm_fault);
^~~~~~~~~~~~~~~~~
In file included from arch/powerpc/mm/copro_fault.c:27:0:
arch/powerpc/include/asm/copro.h:18:5: note: previous declaration of 'copro_handle_mm_fault' was here
int copro_handle_mm_fault(struct mm_struct *mm, unsigned long ea,
^~~~~~~~~~~~~~~~~~~~~
vim +/EXPORT_SYMBOL_GPL +101 arch/powerpc/mm/copro_fault.c
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2007-12-20 @23 #include <linux/sched.h>
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2007-12-20 24 #include <linux/mm.h>
4b16f8e2d arch/powerpc/platforms/cell/spu_fault.c Paul Gortmaker 2011-07-22 25 #include <linux/export.h>
e83d01697 arch/powerpc/mm/copro_fault.c Ian Munsie 2014-10-08 26 #include <asm/reg.h>
73d16a6e0 arch/powerpc/mm/copro_fault.c Ian Munsie 2014-10-08 27 #include <asm/copro.h>
be3ebfe82 arch/powerpc/mm/copro_fault.c Ian Munsie 2014-10-08 28 #include <asm/spu.h>
ec249dd86 arch/powerpc/mm/copro_fault.c Michael Neuling 2015-05-27 29 #include <misc/cxl-base.h>
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2007-12-20 30
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2007-12-20 31 /*
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2007-12-20 32 * This ought to be kept in sync with the powerpc specific do_page_fault
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2007-12-20 33 * function. Currently, there are a few corner cases that we haven't had
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2007-12-20 34 * to handle fortunately.
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2007-12-20 35 */
e83d01697 arch/powerpc/mm/copro_fault.c Ian Munsie 2014-10-08 36 int copro_handle_mm_fault(struct mm_struct *mm, unsigned long ea,
7c71e54a4 arch/powerpc/mm/copro_fault.c Souptick Joarder 2018-06-15 37 unsigned long dsisr, vm_fault_t *flt)
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2007-12-20 38 {
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2007-12-20 39 struct vm_area_struct *vma;
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2007-12-20 40 unsigned long is_write;
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2007-12-20 41 int ret;
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2007-12-20 42
60ee03194 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2009-02-17 43 if (mm == NULL)
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2007-12-20 44 return -EFAULT;
60ee03194 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2009-02-17 45
60ee03194 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2009-02-17 46 if (mm->pgd == NULL)
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2007-12-20 47 return -EFAULT;
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2007-12-20 48
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2007-12-20 49 down_read(&mm->mmap_sem);
60ee03194 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2009-02-17 50 ret = -EFAULT;
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2007-12-20 51 vma = find_vma(mm, ea);
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2007-12-20 52 if (!vma)
60ee03194 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2009-02-17 53 goto out_unlock;
60ee03194 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2009-02-17 54
60ee03194 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2009-02-17 55 if (ea < vma->vm_start) {
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2007-12-20 56 if (!(vma->vm_flags & VM_GROWSDOWN))
60ee03194 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2009-02-17 57 goto out_unlock;
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2007-12-20 58 if (expand_stack(vma, ea))
60ee03194 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2009-02-17 59 goto out_unlock;
60ee03194 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2009-02-17 60 }
60ee03194 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2009-02-17 61
e83d01697 arch/powerpc/mm/copro_fault.c Ian Munsie 2014-10-08 62 is_write = dsisr & DSISR_ISSTORE;
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2007-12-20 63 if (is_write) {
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2007-12-20 64 if (!(vma->vm_flags & VM_WRITE))
60ee03194 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2009-02-17 65 goto out_unlock;
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2007-12-20 66 } else {
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2007-12-20 67 if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
60ee03194 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2009-02-17 68 goto out_unlock;
842915f56 arch/powerpc/mm/copro_fault.c Mel Gorman 2015-02-12 69 /*
18061c17c arch/powerpc/mm/copro_fault.c Aneesh Kumar K.V 2017-01-30 70 * PROT_NONE is covered by the VMA check above.
18061c17c arch/powerpc/mm/copro_fault.c Aneesh Kumar K.V 2017-01-30 71 * and hash should get a NOHPTE fault instead of
18061c17c arch/powerpc/mm/copro_fault.c Aneesh Kumar K.V 2017-01-30 72 * a PROTFAULT in case fixup is needed for things
18061c17c arch/powerpc/mm/copro_fault.c Aneesh Kumar K.V 2017-01-30 73 * like autonuma.
842915f56 arch/powerpc/mm/copro_fault.c Mel Gorman 2015-02-12 74 */
18061c17c arch/powerpc/mm/copro_fault.c Aneesh Kumar K.V 2017-01-30 75 if (!radix_enabled())
842915f56 arch/powerpc/mm/copro_fault.c Mel Gorman 2015-02-12 76 WARN_ON_ONCE(dsisr & DSISR_PROTFAULT);
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2007-12-20 77 }
60ee03194 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2009-02-17 78
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2007-12-20 79 ret = 0;
dcddffd41 arch/powerpc/mm/copro_fault.c Kirill A. Shutemov 2016-07-26 80 *flt = handle_mm_fault(vma, ea, is_write ? FAULT_FLAG_WRITE : 0);
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2007-12-20 81 if (unlikely(*flt & VM_FAULT_ERROR)) {
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2007-12-20 82 if (*flt & VM_FAULT_OOM) {
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2007-12-20 83 ret = -ENOMEM;
60ee03194 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2009-02-17 84 goto out_unlock;
33692f275 arch/powerpc/mm/copro_fault.c Linus Torvalds 2015-01-29 85 } else if (*flt & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV)) {
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2007-12-20 86 ret = -EFAULT;
60ee03194 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2009-02-17 87 goto out_unlock;
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2007-12-20 88 }
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2007-12-20 89 BUG();
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2007-12-20 90 }
60ee03194 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2009-02-17 91
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2007-12-20 92 if (*flt & VM_FAULT_MAJOR)
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2007-12-20 93 current->maj_flt++;
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2007-12-20 94 else
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2007-12-20 95 current->min_flt++;
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2007-12-20 96
60ee03194 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2009-02-17 97 out_unlock:
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2007-12-20 98 up_read(&mm->mmap_sem);
60ee03194 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2009-02-17 99 return ret;
7cd58e438 arch/powerpc/platforms/cell/spu_fault.c Jeremy Kerr 2007-12-20 100 }
e83d01697 arch/powerpc/mm/copro_fault.c Ian Munsie 2014-10-08 @101 EXPORT_SYMBOL_GPL(copro_handle_mm_fault);
73d16a6e0 arch/powerpc/mm/copro_fault.c Ian Munsie 2014-10-08 102
:::::: The code at line 101 was first introduced by commit
:::::: e83d01697583d8610d1d62279758c2a881e3396f powerpc/cell: Move spu_handle_mm_fault() out of cell platform
:::::: TO: Ian Munsie <imunsie@au1.ibm.com>
:::::: CC: Michael Ellerman <mpe@ellerman.id.au>
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 56250 bytes --]
^ permalink raw reply
* Re: [RFC PATCH 03/23] genirq: Introduce IRQF_DELIVER_AS_NMI
From: Thomas Gleixner @ 2018-06-16 13:36 UTC (permalink / raw)
To: Ricardo Neri
Cc: Julien Thierry, Marc Zyngier, Peter Zijlstra, Ingo Molnar,
H. Peter Anvin, Andi Kleen, Ashok Raj, Borislav Petkov, Tony Luck,
Ravi V. Shankar, x86, sparclinux, linuxppc-dev, linux-kernel,
Jacob Pan, Daniel Lezcano, Andrew Morton,
Levin, Alexander (Sasha Levin), Randy Dunlap, Masami Hiramatsu,
Bartosz Golaszewski, Doug Berger, Palmer Dabbelt, iommu
In-Reply-To: <20180616003916.GA6659@voyager>
On Fri, 15 Jun 2018, Ricardo Neri wrote:
> On Fri, Jun 15, 2018 at 09:01:02AM +0100, Julien Thierry wrote:
> > In my patches, I'm not sure there is much to adapt to your work as most of
> > it is arch specific (although I wont say no to another pair of eyes looking
> > at them). From what I've seen of your patches, the point where we converge
> > is that need for some code to be able to tell the irqchip "I want that
> > particular interrupt line to be treated/setup as an NMI".
>
> Indeed, there has to be a generic way for the irqchip to announce that it
> supports configuring an interrupt as NMI... and a way to actually configuring
> it.
There has to be nothing. The irqchip infrastructure might be able to
provide certain aspects of NMI support, perhaps for initialization, but
everything else is fundamentally different and the executional parts simply
cannot use any of the irq chip functions at all.
Thanks,
tglx
^ permalink raw reply
* Re: [RFC PATCH 20/23] watchdog/hardlockup/hpet: Rotate interrupt among all monitored CPUs
From: Thomas Gleixner @ 2018-06-16 13:27 UTC (permalink / raw)
To: Ricardo Neri
Cc: Ingo Molnar, H. Peter Anvin, Andi Kleen, Ashok Raj,
Borislav Petkov, Tony Luck, Ravi V. Shankar, x86, sparclinux,
linuxppc-dev, linux-kernel, Jacob Pan, Rafael J. Wysocki,
Don Zickus, Nicholas Piggin, Michael Ellerman,
Frederic Weisbecker, Alexei Starovoitov, Babu Moger,
Mathieu Desnoyers, Masami Hiramatsu, Peter Zijlstra,
Andrew Morton, Philippe Ombredanne, Colin Ian King,
Byungchul Park, Paul E. McKenney, Luis R. Rodriguez, Waiman Long,
Josh Poimboeuf, Randy Dunlap, Davidlohr Bueso, Christoffer Dall,
Marc Zyngier, Kai-Heng Feng, Konrad Rzeszutek Wilk,
David Rientjes, iommu
In-Reply-To: <20180616004631.GB6659@voyager>
On Fri, 15 Jun 2018, Ricardo Neri wrote:
> On Fri, Jun 15, 2018 at 12:29:06PM +0200, Thomas Gleixner wrote:
> > You have to consider two cases:
> >
> > 1) !remapped mode:
> >
> > That's reasonably simple because you just have to deal with the HPET
> > TIMERn_PROCMSG_ROUT register. But then you need to do this directly and
> > not through any of the existing interrupt facilities.
>
> Indeed, there is no need to use the generic interrupt faciities to set affinity;
> I am dealing with an NMI anyways.
> >
> > 2) remapped mode:
> >
> > That's way more complex as you _cannot_ ever do anything which touches
> > the IOMMU and the related tables.
> >
> > So you'd need to reserve an IOMMU remapping entry for each CPU upfront,
> > store the resulting value for the HPET TIMERn_PROCMSG_ROUT register in
> > per cpu storage and just modify that one from NMI.
> >
> > Though there might be subtle side effects involved, which are related to
> > the acknowledge part. You need to talk to the IOMMU wizards first.
>
> I see. I will look into the code and prototype something that makes sense for
> the IOMMU maintainers.
I'd recommend to talk to them _before_ you cobble something together. If we
cannot reliably switch the affinity by directing the HPET NMI to a
different IOMMU remapping entry then the whole scheme does not work at all.
Thanks,
tglx
^ permalink raw reply
* Re: [RFC PATCH 17/23] watchdog/hardlockup/hpet: Convert the timer's interrupt to NMI
From: Thomas Gleixner @ 2018-06-16 13:24 UTC (permalink / raw)
To: Ricardo Neri
Cc: Ingo Molnar, H. Peter Anvin, Andi Kleen, Ashok Raj,
Borislav Petkov, Tony Luck, Ravi V. Shankar, x86, sparclinux,
linuxppc-dev, linux-kernel, Jacob Pan, Rafael J. Wysocki,
Don Zickus, Nicholas Piggin, Michael Ellerman,
Frederic Weisbecker, Alexei Starovoitov, Babu Moger,
Mathieu Desnoyers, Masami Hiramatsu, Peter Zijlstra,
Andrew Morton, Philippe Ombredanne, Colin Ian King,
Byungchul Park, Paul E. McKenney, Luis R. Rodriguez, Waiman Long,
Josh Poimboeuf, Randy Dunlap, Davidlohr Bueso, Christoffer Dall,
Marc Zyngier, Kai-Heng Feng, Konrad Rzeszutek Wilk,
David Rientjes, iommu
In-Reply-To: <20180616005103.GC6659@voyager>
On Fri, 15 Jun 2018, Ricardo Neri wrote:
> On Fri, Jun 15, 2018 at 11:19:09AM +0200, Thomas Gleixner wrote:
> > On Thu, 14 Jun 2018, Ricardo Neri wrote:
> > > Alternatively, there could be a counter that skips reading the HPET status
> > > register (and the detection of hardlockups) for every X NMIs. This would
> > > reduce the overall frequency of HPET register reads.
> >
> > Great plan. So if the watchdog is the only NMI (because perf is off) then
> > you delay the watchdog detection by that count.
>
> OK. This was a bad idea. Then, is it acceptable to have an read to an HPET
> register per NMI just to check in the status register if the HPET timer
> caused the NMI?
The status register is useless in case of MSI. MSI is edge triggered ....
The only register which gives you proper information is the counter
register itself. That adds an massive overhead to each NMI, because the
counter register access is synchronized to the HPET clock with hardware
magic. Plus on larger systems, the HPET access is cross node and even
slower.
Thanks,
tglx
^ permalink raw reply
* Re: [PATCH] Revert "net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends"
From: Eric Dumazet @ 2018-06-16 12:45 UTC (permalink / raw)
To: Mathieu Malaterre
Cc: David S. Miller, Eric Dumazet, LKML, Christophe LEROY,
Meelis Roos, schwab, netdev, linuxppc-dev
In-Reply-To: <CA+7wUsysaip70ARGQ7Byg1Zktfg0Y1esjEc9HxBGU4sKL8crhg@mail.gmail.com>
On 06/16/2018 12:14 AM, Mathieu Malaterre wrote:
> Hi Eric,
>
> On Fri, Jun 15, 2018 at 9:14 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>>
>>
>> On 06/15/2018 11:56 AM, Mathieu Malaterre wrote:
>>> This reverts commit 88078d98d1bb085d72af8437707279e203524fa5.
>>>
>>> It causes regressions for people using chips driven by the sungem
>>> driver. Suspicion is that the skb->csum value isn't being adjusted
>>> properly.
>>>
>>> Symptoms as seen on G4+sungem are:
>>>
>>> [ 34.023281] eth0: hw csum failure
>>> [ 34.023438] CPU: 0 PID: 0 Comm: swapper Not tainted 4.17.0+ #2
>>> [ 34.023618] Call Trace:
>>> [ 34.023707] [dffedbd0] [c069ddac] __skb_checksum_complete+0xf0/0x108 (unreliable)
>>> [ 34.023948] [dffedbf0] [c0777a70] tcp_v4_rcv+0x604/0xe00
>>> [ 34.024118] [dffedc70] [c0731624] ip_local_deliver_finish+0xa8/0x3c4
>>> [ 34.024315] [dffedcb0] [c0732430] ip_local_deliver+0xf0/0x154
>>> [ 34.024493] [dffedcf0] [c07328dc] ip_rcv+0x448/0x774
>>> [ 34.024653] [dffedd50] [c06aeae0] __netif_receive_skb_core+0x5e8/0x1184
>>> [ 34.024857] [dffedde0] [c06bba20] napi_gro_receive+0x160/0x22c
>>> [ 34.025044] [dffede10] [e14b2590] gem_poll+0x7fc/0x1ac0 [sungem]
>>> [ 34.025228] [dffedee0] [c06bacf0] net_rx_action+0x34c/0x618
>>> [ 34.025402] [dffedf60] [c07fd27c] __do_softirq+0x16c/0x5f0
>>> [ 34.025575] [dffedfd0] [c0064c7c] irq_exit+0x110/0x1a8
>>> [ 34.025738] [dffedff0] [c0016170] call_do_irq+0x24/0x3c
>>> [ 34.025903] [c0cf7e80] [c0009a84] do_IRQ+0x98/0x1a0
>>> [ 34.026055] [c0cf7eb0] [c001b474] ret_from_except+0x0/0x14
>>> [ 34.026225] --- interrupt: 501 at arch_cpu_idle+0x30/0x78
>>> LR = arch_cpu_idle+0x30/0x78
>>> [ 34.026510] [c0cf7f70] [c0cf6000] 0xc0cf6000 (unreliable)
>>> [ 34.026682] [c0cf7f80] [c00a3868] do_idle+0xc4/0x158
>>> [ 34.026835] [c0cf7fb0] [c00a3ab0] cpu_startup_entry+0x20/0x28
>>> [ 34.027013] [c0cf7fc0] [c0998820] start_kernel+0x47c/0x490
>>> [ 34.027181] [c0cf7ff0] [00003444] 0x3444
>>>
>>> See commit 7ce5a27f2ef8 ("Revert "net: Handle CHECKSUM_COMPLETE more
>>> adequately in pskb_trim_rcsum()."") for previous reference.
>>
>> This fix seems to hide a bug in csum functions on this architecture.
>
> That's odd since it seems to only affect g4+sungem user. None of the
> ppc64 seems to be having it. And some ppc32 users are not even seeing
> it.
>
>> Or a bug on this NIC when receiving a small packet (less than 60 bytes).
>> Maybe the padding bytes are not included in NIC provided csum, and not 0.
>
> Ok in that case the bug is located in
> ./drivers/net/ethernet/sun/sungem.c that seems more likely. I'll try
> to understand that code, then.
I would try something like :
Basically do not bother using CHECKSUM_COMPLETE for small frames that might have been padded.
Since we need to bring one cache line in eth_type_trans() and further header processing,
computing the checksum in software will be almost free anyway.
diff --git a/drivers/net/ethernet/sun/sungem.c b/drivers/net/ethernet/sun/sungem.c
index 7a16d40a72d13cf1d522e8a3a396c826fe76f9b9..071039f211a8a33153e888bd4014314ba5e686a4 100644
--- a/drivers/net/ethernet/sun/sungem.c
+++ b/drivers/net/ethernet/sun/sungem.c
@@ -855,9 +855,11 @@ static int gem_rx(struct gem *gp, int work_to_do)
skb = copy_skb;
}
- csum = (__force __sum16)htons((status & RXDCTRL_TCPCSUM) ^ 0xffff);
- skb->csum = csum_unfold(csum);
- skb->ip_summed = CHECKSUM_COMPLETE;
+ if (len > ETH_ZLEN) {
+ csum = (__force __sum16)htons((status & RXDCTRL_TCPCSUM) ^ 0xffff);
+ skb->csum = csum_unfold(csum);
+ skb->ip_summed = CHECKSUM_COMPLETE;
+ }
skb->protocol = eth_type_trans(skb, gp->dev);
napi_gro_receive(&gp->napi, skb);
^ permalink raw reply related
* Re: [PATCH v1] mm: relax deferred struct page requirements
From: Jiri Slaby @ 2018-06-16 8:04 UTC (permalink / raw)
To: Michal Hocko, Pavel Tatashin
Cc: steven.sistare, daniel.m.jordan, benh, paulus, akpm,
kirill.shutemov, arbab, schwidefsky, heiko.carstens, x86,
linux-kernel, tglx, linuxppc-dev, linux-mm, linux-s390, mgorman
In-Reply-To: <20171121072416.v77vu4osm2s4o5sq@dhcp22.suse.cz>
On 11/21/2017, 08:24 AM, Michal Hocko wrote:
> On Thu 16-11-17 20:46:01, Pavel Tatashin wrote:
>> There is no need to have ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT,
>> as all the page initialization code is in common code.
>>
>> Also, there is no need to depend on MEMORY_HOTPLUG, as initialization code
>> does not really use hotplug memory functionality. So, we can remove this
>> requirement as well.
>>
>> This patch allows to use deferred struct page initialization on all
>> platforms with memblock allocator.
>>
>> Tested on x86, arm64, and sparc. Also, verified that code compiles on
>> PPC with CONFIG_MEMORY_HOTPLUG disabled.
>
> There is slight risk that we will encounter corner cases on some
> architectures with weird memory layout/topology
Which x86_32-pae seems to be. Many bad page state errors are emitted
during boot when this patch is applied:
BUG: Bad page state in process swapper pfn:3c01c
page:f566c3f0 count:0 mapcount:1 mapping:00000000 index:0x0
flags: 0x0()
raw: 00000000 00000000 00000000 00000000 00000000 00000100 00000200 00000000
raw: 00000000
page dumped because: nonzero mapcount
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Tainted: G B
4.17.1-4.gdf028bb-pae #1 openSUSE Tumbleweed (unreleased)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.0.0-prebuilt.qemu-project.org 04/01/2014
Call Trace:
dump_stack+0x7d/0xbd
bad_page.cold.111+0x90/0xc7
free_pages_check_bad+0x52/0x70
free_pcppages_bulk+0x37d/0x570
free_unref_page_commit+0x9a/0xc0
free_unref_page+0x6a/0xa0
__free_pages+0x17/0x30
free_highmem_page+0x1e/0x50
add_highpages_with_active_regions+0xd6/0x113
set_highmem_pages_init+0x67/0x7d
mem_init+0x23/0x1d9
start_kernel+0x1c2/0x437
i386_start_kernel+0x98/0x9c
startup_32_smp+0x164/0x168
free_pages_check_bad expects mapcount == -1, but it is 1 with this patch.
Reverting the patch makes the BUGs go away -- the config diff is then:
@@ -617,7 +617,7 @@
# CONFIG_PGTABLE_MAPPING is not set
# CONFIG_ZSMALLOC_STAT is not set
CONFIG_GENERIC_EARLY_IOREMAP=y
-CONFIG_DEFERRED_STRUCT_PAGE_INIT=y
+CONFIG_ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT=y
# CONFIG_IDLE_PAGE_TRACKING is not set
CONFIG_FRAME_VECTOR=y
# CONFIG_PERCPU_STATS is not set
>> --- a/arch/powerpc/Kconfig
>> +++ b/arch/powerpc/Kconfig
>> @@ -148,7 +148,6 @@ config PPC
>> select ARCH_MIGHT_HAVE_PC_PARPORT
>> select ARCH_MIGHT_HAVE_PC_SERIO
>> select ARCH_SUPPORTS_ATOMIC_RMW
>> - select ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT
>> select ARCH_USE_BUILTIN_BSWAP
>> select ARCH_USE_CMPXCHG_LOCKREF if PPC64
>> select ARCH_WANT_IPC_PARSE_VERSION
>> diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
>> index 863a62a6de3c..525c2e3df6f5 100644
>> --- a/arch/s390/Kconfig
>> +++ b/arch/s390/Kconfig
>> @@ -108,7 +108,6 @@ config S390
>> select ARCH_INLINE_WRITE_UNLOCK_IRQRESTORE
>> select ARCH_SAVE_PAGE_KEYS if HIBERNATION
>> select ARCH_SUPPORTS_ATOMIC_RMW
>> - select ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT
>> select ARCH_SUPPORTS_NUMA_BALANCING
>> select ARCH_USE_BUILTIN_BSWAP
>> select ARCH_USE_CMPXCHG_LOCKREF
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index df3276d6bfe3..00a5446de394 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -69,7 +69,6 @@ config X86
>> select ARCH_MIGHT_HAVE_PC_PARPORT
>> select ARCH_MIGHT_HAVE_PC_SERIO
>> select ARCH_SUPPORTS_ATOMIC_RMW
>> - select ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT
>> select ARCH_SUPPORTS_NUMA_BALANCING if X86_64
>> select ARCH_USE_BUILTIN_BSWAP
>> select ARCH_USE_QUEUED_RWLOCKS
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 9c4bdddd80c2..c6bd0309ce7a 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -639,15 +639,10 @@ config MAX_STACK_SIZE_MB
>>
>> A sane initial value is 80 MB.
>>
>> -# For architectures that support deferred memory initialisation
>> -config ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT
>> - bool
>> -
>> config DEFERRED_STRUCT_PAGE_INIT
>> bool "Defer initialisation of struct pages to kthreads"
>> default n
>> - depends on ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT
>> - depends on NO_BOOTMEM && MEMORY_HOTPLUG
>> + depends on NO_BOOTMEM
>> depends on !FLATMEM
>> help
>> Ordinarily all struct pages are initialised during early boot in a
thanks,
--
js
suse labs
^ permalink raw reply
* Re: [PATCH] Revert "net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends"
From: Mathieu Malaterre @ 2018-06-16 7:14 UTC (permalink / raw)
To: eric.dumazet
Cc: David S. Miller, Eric Dumazet, LKML, Christophe LEROY,
Meelis Roos, schwab, netdev, linuxppc-dev
In-Reply-To: <fbb95c11-240c-1a11-0a62-0483908c577e@gmail.com>
Hi Eric,
On Fri, Jun 15, 2018 at 9:14 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
>
> On 06/15/2018 11:56 AM, Mathieu Malaterre wrote:
> > This reverts commit 88078d98d1bb085d72af8437707279e203524fa5.
> >
> > It causes regressions for people using chips driven by the sungem
> > driver. Suspicion is that the skb->csum value isn't being adjusted
> > properly.
> >
> > Symptoms as seen on G4+sungem are:
> >
> > [ 34.023281] eth0: hw csum failure
> > [ 34.023438] CPU: 0 PID: 0 Comm: swapper Not tainted 4.17.0+ #2
> > [ 34.023618] Call Trace:
> > [ 34.023707] [dffedbd0] [c069ddac] __skb_checksum_complete+0xf0/0x108 (unreliable)
> > [ 34.023948] [dffedbf0] [c0777a70] tcp_v4_rcv+0x604/0xe00
> > [ 34.024118] [dffedc70] [c0731624] ip_local_deliver_finish+0xa8/0x3c4
> > [ 34.024315] [dffedcb0] [c0732430] ip_local_deliver+0xf0/0x154
> > [ 34.024493] [dffedcf0] [c07328dc] ip_rcv+0x448/0x774
> > [ 34.024653] [dffedd50] [c06aeae0] __netif_receive_skb_core+0x5e8/0x1184
> > [ 34.024857] [dffedde0] [c06bba20] napi_gro_receive+0x160/0x22c
> > [ 34.025044] [dffede10] [e14b2590] gem_poll+0x7fc/0x1ac0 [sungem]
> > [ 34.025228] [dffedee0] [c06bacf0] net_rx_action+0x34c/0x618
> > [ 34.025402] [dffedf60] [c07fd27c] __do_softirq+0x16c/0x5f0
> > [ 34.025575] [dffedfd0] [c0064c7c] irq_exit+0x110/0x1a8
> > [ 34.025738] [dffedff0] [c0016170] call_do_irq+0x24/0x3c
> > [ 34.025903] [c0cf7e80] [c0009a84] do_IRQ+0x98/0x1a0
> > [ 34.026055] [c0cf7eb0] [c001b474] ret_from_except+0x0/0x14
> > [ 34.026225] --- interrupt: 501 at arch_cpu_idle+0x30/0x78
> > LR = arch_cpu_idle+0x30/0x78
> > [ 34.026510] [c0cf7f70] [c0cf6000] 0xc0cf6000 (unreliable)
> > [ 34.026682] [c0cf7f80] [c00a3868] do_idle+0xc4/0x158
> > [ 34.026835] [c0cf7fb0] [c00a3ab0] cpu_startup_entry+0x20/0x28
> > [ 34.027013] [c0cf7fc0] [c0998820] start_kernel+0x47c/0x490
> > [ 34.027181] [c0cf7ff0] [00003444] 0x3444
> >
> > See commit 7ce5a27f2ef8 ("Revert "net: Handle CHECKSUM_COMPLETE more
> > adequately in pskb_trim_rcsum()."") for previous reference.
>
> This fix seems to hide a bug in csum functions on this architecture.
That's odd since it seems to only affect g4+sungem user. None of the
ppc64 seems to be having it. And some ppc32 users are not even seeing
it.
> Or a bug on this NIC when receiving a small packet (less than 60 bytes).
> Maybe the padding bytes are not included in NIC provided csum, and not 0.
Ok in that case the bug is located in
./drivers/net/ethernet/sun/sungem.c that seems more likely. I'll try
to understand that code, then.
Thanks
>
>
>
^ permalink raw reply
* [PATCH 0/3] Resolve -Wattribute-alias warnings from SYSCALL_DEFINEx()
From: Paul Burton @ 2018-06-16 0:53 UTC (permalink / raw)
To: linux-kbuild
Cc: Mauro Carvalho Chehab, linux-mips, Arnd Bergmann, Ingo Molnar,
Matthew Wilcox, Thomas Gleixner, Douglas Anderson, Josh Poimboeuf,
Andrew Morton, Matthias Kaehlcke, He Zhe, Benjamin Herrenschmidt,
Michal Marek, Khem Raj, Christophe Leroy, Al Viro, Stafford Horne,
Gideon Israel Dsouza, Masahiro Yamada, Kees Cook,
Michael Ellerman, Heiko Carstens, linux-kernel, Paul Mackerras,
linuxppc-dev, Paul Burton
This series introduces infrastructure allowing compiler diagnostics to
be disabled or their severity modified for specific pieces of code, with
suitable abstractions to prevent that code from becoming tied to a
specific compiler.
This infrastructure is then used to disable the -Wattribute-alias
warning around syscall definitions, which rely on type mismatches to
sanitize arguments.
Finally PowerPC-specific #pragma's are removed now that the generic code
is handling this.
The series takes Arnd's RFC patches & addresses the review comments they
received. The most notable effect of this series to to avoid warnings &
build failures caused by -Wattribute-alias when compiling the kernel
with GCC 8.
Applies cleanly atop master as of 9215310cf13b ("Merge
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net").
Thanks,
Paul
Arnd Bergmann (2):
kbuild: add macro for controlling warnings to linux/compiler.h
disable -Wattribute-alias warning for SYSCALL_DEFINEx()
Paul Burton (1):
Revert "powerpc: fix build failure by disabling attribute-alias
warning in pci_32"
arch/powerpc/kernel/pci_32.c | 4 ---
include/linux/compat.h | 8 ++++-
include/linux/compiler-gcc.h | 66 ++++++++++++++++++++++++++++++++++
include/linux/compiler_types.h | 18 ++++++++++
include/linux/syscalls.h | 4 +++
5 files changed, 95 insertions(+), 5 deletions(-)
--
2.17.1
^ permalink raw reply
* [PATCH 2/3] disable -Wattribute-alias warning for SYSCALL_DEFINEx()
From: Paul Burton @ 2018-06-16 0:53 UTC (permalink / raw)
To: linux-kbuild
Cc: Mauro Carvalho Chehab, linux-mips, Arnd Bergmann, Ingo Molnar,
Matthew Wilcox, Thomas Gleixner, Douglas Anderson, Josh Poimboeuf,
Andrew Morton, Matthias Kaehlcke, He Zhe, Benjamin Herrenschmidt,
Michal Marek, Khem Raj, Christophe Leroy, Al Viro, Stafford Horne,
Gideon Israel Dsouza, Masahiro Yamada, Kees Cook,
Michael Ellerman, Heiko Carstens, linux-kernel, Paul Mackerras,
linuxppc-dev, Paul Burton
In-Reply-To: <20180616005323.7938-1-paul.burton@mips.com>
From: Arnd Bergmann <arnd@arndb.de>
gcc-8 warns for every single definition of a system call entry
point, e.g.:
include/linux/compat.h:56:18: error: 'compat_sys_rt_sigprocmask' alias between functions of incompatible types 'long int(int, compat_sigset_t *, compat_sigset_t *, compat_size_t)' {aka 'long int(int, struct <anonymous> *, struct <anonymous> *, unsigned int)'} and 'long int(long int, long int, long int, long int)' [-Werror=attribute-alias]
asmlinkage long compat_sys##name(__MAP(x,__SC_DECL,__VA_ARGS__))\
^~~~~~~~~~
include/linux/compat.h:45:2: note: in expansion of macro 'COMPAT_SYSCALL_DEFINEx'
COMPAT_SYSCALL_DEFINEx(4, _##name, __VA_ARGS__)
^~~~~~~~~~~~~~~~~~~~~~
kernel/signal.c:2601:1: note: in expansion of macro 'COMPAT_SYSCALL_DEFINE4'
COMPAT_SYSCALL_DEFINE4(rt_sigprocmask, int, how, compat_sigset_t __user *, nset,
^~~~~~~~~~~~~~~~~~~~~~
include/linux/compat.h:60:18: note: aliased declaration here
asmlinkage long compat_SyS##name(__MAP(x,__SC_LONG,__VA_ARGS__))\
^~~~~~~~~~
The new warning seems reasonable in principle, but it doesn't
help us here, since we rely on the type mismatch to sanitize the
system call arguments. After I reported this as GCC PR82435, a new
-Wno-attribute-alias option was added that could be used to turn the
warning off globally on the command line, but I'd prefer to do it a
little more fine-grained.
Interestingly, turning a warning off and on again inside of
a single macro doesn't always work, in this case I had to add
an extra statement inbetween and decided to copy the __SC_TEST
one from the native syscall to the compat syscall macro. See
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83256 for more details
about this.
[paul.burton@mips.com:
- Rebase atop current master.
- Split GCC & version arguments to __diag_ignore() in order to match
changes to the preceding patch.
- Add the comment argument to match the preceding patch.]
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82435
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paul Burton <paul.burton@mips.com>
Cc: Michal Marek <michal.lkml@markovi.net>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Douglas Anderson <dianders@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Gideon Israel Dsouza <gidisrael@gmail.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Khem Raj <raj.khem@gmail.com>
Cc: He Zhe <zhe.he@windriver.com>
Cc: linux-kbuild@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: linuxppc-dev@lists.ozlabs.org
---
include/linux/compat.h | 8 +++++++-
include/linux/syscalls.h | 4 ++++
2 files changed, 11 insertions(+), 1 deletion(-)
diff --git a/include/linux/compat.h b/include/linux/compat.h
index b1a5562b3215..c68acc47da57 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -72,6 +72,9 @@
*/
#ifndef COMPAT_SYSCALL_DEFINEx
#define COMPAT_SYSCALL_DEFINEx(x, name, ...) \
+ __diag_push(); \
+ __diag_ignore(GCC, 8, "-Wattribute-alias", \
+ "Type aliasing is used to sanitize syscall arguments");\
asmlinkage long compat_sys##name(__MAP(x,__SC_DECL,__VA_ARGS__)); \
asmlinkage long compat_sys##name(__MAP(x,__SC_DECL,__VA_ARGS__)) \
__attribute__((alias(__stringify(__se_compat_sys##name)))); \
@@ -80,8 +83,11 @@
asmlinkage long __se_compat_sys##name(__MAP(x,__SC_LONG,__VA_ARGS__)); \
asmlinkage long __se_compat_sys##name(__MAP(x,__SC_LONG,__VA_ARGS__)) \
{ \
- return __do_compat_sys##name(__MAP(x,__SC_DELOUSE,__VA_ARGS__));\
+ long ret = __do_compat_sys##name(__MAP(x,__SC_DELOUSE,__VA_ARGS__));\
+ __MAP(x,__SC_TEST,__VA_ARGS__); \
+ return ret; \
} \
+ __diag_pop(); \
static inline long __do_compat_sys##name(__MAP(x,__SC_DECL,__VA_ARGS__))
#endif /* COMPAT_SYSCALL_DEFINEx */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 73810808cdf2..a368a68cb667 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -231,6 +231,9 @@ static inline int is_syscall_trace_event(struct trace_event_call *tp_event)
*/
#ifndef __SYSCALL_DEFINEx
#define __SYSCALL_DEFINEx(x, name, ...) \
+ __diag_push(); \
+ __diag_ignore(GCC, 8, "-Wattribute-alias", \
+ "Type aliasing is used to sanitize syscall arguments");\
asmlinkage long sys##name(__MAP(x,__SC_DECL,__VA_ARGS__)) \
__attribute__((alias(__stringify(__se_sys##name)))); \
ALLOW_ERROR_INJECTION(sys##name, ERRNO); \
@@ -243,6 +246,7 @@ static inline int is_syscall_trace_event(struct trace_event_call *tp_event)
__PROTECT(x, ret,__MAP(x,__SC_ARGS,__VA_ARGS__)); \
return ret; \
} \
+ __diag_pop(); \
static inline long __do_sys##name(__MAP(x,__SC_DECL,__VA_ARGS__))
#endif /* __SYSCALL_DEFINEx */
--
2.17.1
^ permalink raw reply related
* [PATCH 3/3] Revert "powerpc: fix build failure by disabling attribute-alias warning in pci_32"
From: Paul Burton @ 2018-06-16 0:53 UTC (permalink / raw)
To: linux-kbuild
Cc: Mauro Carvalho Chehab, linux-mips, Arnd Bergmann, Ingo Molnar,
Matthew Wilcox, Thomas Gleixner, Douglas Anderson, Josh Poimboeuf,
Andrew Morton, Matthias Kaehlcke, He Zhe, Benjamin Herrenschmidt,
Michal Marek, Khem Raj, Christophe Leroy, Al Viro, Stafford Horne,
Gideon Israel Dsouza, Masahiro Yamada, Kees Cook,
Michael Ellerman, Heiko Carstens, linux-kernel, Paul Mackerras,
linuxppc-dev, Paul Burton
In-Reply-To: <20180616005323.7938-1-paul.burton@mips.com>
With SYSCALL_DEFINEx() disabling -Wattribute-alias generically, there's
no need to duplicate that for PowerPC's pciconfig_iobase syscall.
This reverts commit 415520373975 ("powerpc: fix build failure by
disabling attribute-alias warning in pci_32").
Signed-off-by: Paul Burton <paul.burton@mips.com>
Cc: Michal Marek <michal.lkml@markovi.net>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Douglas Anderson <dianders@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Gideon Israel Dsouza <gidisrael@gmail.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Khem Raj <raj.khem@gmail.com>
Cc: He Zhe <zhe.he@windriver.com>
Cc: linux-kbuild@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: linuxppc-dev@lists.ozlabs.org
---
arch/powerpc/kernel/pci_32.c | 4 ----
1 file changed, 4 deletions(-)
diff --git a/arch/powerpc/kernel/pci_32.c b/arch/powerpc/kernel/pci_32.c
index 4f861055a852..d63b488d34d7 100644
--- a/arch/powerpc/kernel/pci_32.c
+++ b/arch/powerpc/kernel/pci_32.c
@@ -285,9 +285,6 @@ pci_bus_to_hose(int bus)
* Note that the returned IO or memory base is a physical address
*/
-#pragma GCC diagnostic push
-#pragma GCC diagnostic ignored "-Wpragmas"
-#pragma GCC diagnostic ignored "-Wattribute-alias"
SYSCALL_DEFINE3(pciconfig_iobase, long, which,
unsigned long, bus, unsigned long, devfn)
{
@@ -313,4 +310,3 @@ SYSCALL_DEFINE3(pciconfig_iobase, long, which,
return result;
}
-#pragma GCC diagnostic pop
--
2.17.1
^ permalink raw reply related
* [PATCH 1/3] kbuild: add macro for controlling warnings to linux/compiler.h
From: Paul Burton @ 2018-06-16 0:53 UTC (permalink / raw)
To: linux-kbuild
Cc: Mauro Carvalho Chehab, linux-mips, Arnd Bergmann, Ingo Molnar,
Matthew Wilcox, Thomas Gleixner, Douglas Anderson, Josh Poimboeuf,
Andrew Morton, Matthias Kaehlcke, He Zhe, Benjamin Herrenschmidt,
Michal Marek, Khem Raj, Christophe Leroy, Al Viro, Stafford Horne,
Gideon Israel Dsouza, Masahiro Yamada, Kees Cook,
Michael Ellerman, Heiko Carstens, linux-kernel, Paul Mackerras,
linuxppc-dev, Paul Burton
In-Reply-To: <20180616005323.7938-1-paul.burton@mips.com>
From: Arnd Bergmann <arnd@arndb.de>
I have occasionally run into a situation where it would make sense to
control a compiler warning from a source file rather than doing so from
a Makefile using the $(cc-disable-warning, ...) or $(cc-option, ...)
helpers.
The approach here is similar to what glibc uses, using __diag() and
related macros to encapsulate a _Pragma("GCC diagnostic ...") statement
that gets turned into the respective "#pragma GCC diagnostic ..." by
the preprocessor when the macro gets expanded.
Like glibc, I also have an argument to pass the affected compiler
version, but decided to actually evaluate that one. For now, this
supports GCC_4_6, GCC_4_7, GCC_4_8, GCC_4_9, GCC_5, GCC_6, GCC_7,
GCC_8 and GCC_9. Adding support for CLANG_5 and other interesting
versions is straightforward here. GNU compilers starting with gcc-4.2
could support it in principle, but "#pragma GCC diagnostic push"
was only added in gcc-4.6, so it seems simpler to not deal with those
at all. The same versions show a large number of warnings already,
so it seems easier to just leave it at that and not do a more
fine-grained control for them.
The use cases I found so far include:
- turning off the gcc-8 -Wattribute-alias warning inside of the
SYSCALL_DEFINEx() macro without having to do it globally.
- Reducing the build time for a simple re-make after a change,
once we move the warnings from ./Makefile and
./scripts/Makefile.extrawarn into linux/compiler.h
- More control over the warnings based on other configurations,
using preprocessor syntax instead of Makefile syntax. This should make
it easier for the average developer to understand and change things.
- Adding an easy way to turn the W=1 option on unconditionally
for a subdirectory or a specific file. This has been requested
by several developers in the past that want to have their subsystems
W=1 clean.
- Integrating clang better into the build systems. Clang supports
more warnings than GCC, and we probably want to classify them
as default, W=1, W=2 etc, but there are cases in which the
warnings should be classified differently due to excessive false
positives from one or the other compiler.
- Adding a way to turn the default warnings into errors (e.g. using
a new "make E=0" tag) while not also turning the W=1 warnings into
errors.
This patch for now just adds the minimal infrastructure in order to
do the first of the list above. As the #pragma GCC diagnostic
takes precedence over command line options, the next step would be
to convert a lot of the individual Makefiles that set nonstandard
options to use __diag() instead.
[paul.burton@mips.com:
- Rebase atop current master.
- Add __diag_GCC, or more generally __diag_<compiler>, abstraction to
avoid code outside of linux/compiler-gcc.h needing to duplicate
knowledge about different GCC versions.
- Add a comment argument to __diag_{ignore,warn,error} which isn't
used in the expansion of the macros but serves to push people to
document the reason for using them - per feedback from Kees Cook.]
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paul Burton <paul.burton@mips.com>
Cc: Michal Marek <michal.lkml@markovi.net>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Douglas Anderson <dianders@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Gideon Israel Dsouza <gidisrael@gmail.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Khem Raj <raj.khem@gmail.com>
Cc: He Zhe <zhe.he@windriver.com>
Cc: linux-kbuild@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: linuxppc-dev@lists.ozlabs.org
---
include/linux/compiler-gcc.h | 66 ++++++++++++++++++++++++++++++++++
include/linux/compiler_types.h | 18 ++++++++++
2 files changed, 84 insertions(+)
diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h
index f1a7492a5cc8..aba64a2912d8 100644
--- a/include/linux/compiler-gcc.h
+++ b/include/linux/compiler-gcc.h
@@ -347,3 +347,69 @@
#if GCC_VERSION >= 50100
#define COMPILER_HAS_GENERIC_BUILTIN_OVERFLOW 1
#endif
+
+/*
+ * turn individual warnings and errors on and off locally, depending
+ * on version.
+ */
+#define __diag_GCC(version, s) __diag_GCC_ ## version(s)
+
+#if GCC_VERSION >= 40600
+#define __diag_str1(s) #s
+#define __diag_str(s) __diag_str1(s)
+#define __diag(s) _Pragma(__diag_str(GCC diagnostic s))
+
+/* compilers before gcc-4.6 do not understand "#pragma GCC diagnostic push" */
+#define __diag_GCC_4_6(s) __diag(s)
+#else
+#define __diag(s)
+#define __diag_GCC_4_6(s)
+#endif
+
+#if GCC_VERSION >= 40700
+#define __diag_GCC_4_7(s) __diag(s)
+#else
+#define __diag_GCC_4_7(s)
+#endif
+
+#if GCC_VERSION >= 40800
+#define __diag_GCC_4_8(s) __diag(s)
+#else
+#define __diag_GCC_4_8(s)
+#endif
+
+#if GCC_VERSION >= 40900
+#define __diag_GCC_4_9(s) __diag(s)
+#else
+#define __diag_GCC_4_9(s)
+#endif
+
+#if GCC_VERSION >= 50000
+#define __diag_GCC_5(s) __diag(s)
+#else
+#define __diag_GCC_5(s)
+#endif
+
+#if GCC_VERSION >= 60000
+#define __diag_GCC_6(s) __diag(s)
+#else
+#define __diag_GCC_6(s)
+#endif
+
+#if GCC_VERSION >= 70000
+#define __diag_GCC_7(s) __diag(s)
+#else
+#define __diag_GCC_7(s)
+#endif
+
+#if GCC_VERSION >= 80000
+#define __diag_GCC_8(s) __diag(s)
+#else
+#define __diag_GCC_8(s)
+#endif
+
+#if GCC_VERSION >= 90000
+#define __diag_GCC_9(s) __diag(s)
+#else
+#define __diag_GCC_9(s)
+#endif
diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h
index 6b79a9bba9a7..313a2ad884e0 100644
--- a/include/linux/compiler_types.h
+++ b/include/linux/compiler_types.h
@@ -271,4 +271,22 @@ struct ftrace_likely_data {
# define __native_word(t) (sizeof(t) == sizeof(char) || sizeof(t) == sizeof(short) || sizeof(t) == sizeof(int) || sizeof(t) == sizeof(long))
#endif
+#ifndef __diag
+#define __diag(string)
+#endif
+
+#ifndef __diag_GCC
+#define __diag_GCC(string)
+#endif
+
+#define __diag_push() __diag(push)
+#define __diag_pop() __diag(pop)
+
+#define __diag_ignore(compiler, version, option, comment) \
+ __diag_ ## compiler(version, ignored option)
+#define __diag_warn(compiler, version, option, comment) \
+ __diag_ ## compiler(version, warning option)
+#define __diag_error(compiler, version, option, comment) \
+ __diag_ ## compiler(version, error option)
+
#endif /* __LINUX_COMPILER_TYPES_H */
--
2.17.1
^ permalink raw reply related
* Re: [RFC V2] virtio: Add platform specific DMA API translation for virito devices
From: Benjamin Herrenschmidt @ 2018-06-16 1:07 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Ram Pai, Michael S. Tsirkin, robh, pawel.moll, Tom Lendacky, aik,
jasowang, cohuck, linux-kernel, virtualization, joe,
Rustad, Mark D, david, linuxppc-dev, elfring, Anshuman Khandual
In-Reply-To: <20180615091624.GA1064@infradead.org>
On Fri, 2018-06-15 at 02:16 -0700, Christoph Hellwig wrote:
> On Wed, Jun 13, 2018 at 11:11:01PM +1000, Benjamin Herrenschmidt wrote:
> > Actually ... the stuff in lib/dma-direct.c seems to be just it, no ?
> >
> > There's no cache flushing and there's no architecture hooks that I can
> > see other than the AMD security stuff which is probably fine.
> >
> > Or am I missing something ?
>
> You are missing the __phys_to_dma arch hook that allows architectures
> to adjust the dma address. Various systems have offsets, or even
> multiple banks with different offsets there. Most of them don't
> use the dma-direct code yet (working on it), but there are a few
> examples in the tree already.
Ok and on those systems, qemu will bypass said offset ? Maybe we could
just create a device DMA "flag" or (or use the attributes) to instruct
dma-direct to not use that for legacy virtio ?
Cheers,
Ben.
^ permalink raw reply
* Re: [PATCH kernel 2/2] powerpc/powernv: Define PHB4 type and enable sketchy bypass on POWER9
From: Benjamin Herrenschmidt @ 2018-06-16 1:05 UTC (permalink / raw)
To: Alexey Kardashevskiy, linuxppc-dev; +Cc: Alistair Popple, Russell Currey
In-Reply-To: <20180601081028.29401-3-aik@ozlabs.ru>
On Fri, 2018-06-01 at 18:10 +1000, Alexey Kardashevskiy wrote:
> These are found in POWER9 chips. Right now these PHBs have unknown type
> so changing it to PHB4 won't make much of a difference except enabling
> sketchy bypass for POWER9 as this does below.
And that will break on multi-chip systems since P9 doesn't have the
memory contiguous (it has the chip ID in the top bits).
Russell is working on a different implementation that should be much
more imune to the system physical memory layout.
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> arch/powerpc/platforms/powernv/pci.h | 1 +
> arch/powerpc/platforms/powernv/pci-ioda.c | 5 ++++-
> 2 files changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> index eada4b6..1408247 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -23,6 +23,7 @@ enum pnv_phb_model {
> PNV_PHB_MODEL_UNKNOWN,
> PNV_PHB_MODEL_P7IOC,
> PNV_PHB_MODEL_PHB3,
> + PNV_PHB_MODEL_PHB4,
> PNV_PHB_MODEL_NPU,
> PNV_PHB_MODEL_NPU2,
> };
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 9239142..66c2804 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1882,7 +1882,8 @@ static int pnv_pci_ioda_dma_set_mask(struct pci_dev *pdev, u64 dma_mask)
> if (dma_mask >> 32 &&
> dma_mask > (memory_hotplug_max() + (1ULL << 32)) &&
> pnv_pci_ioda_pe_single_vendor(pe) &&
> - phb->model == PNV_PHB_MODEL_PHB3) {
> + (phb->model == PNV_PHB_MODEL_PHB3 ||
> + phb->model == PNV_PHB_MODEL_PHB4)) {
> /* Configure the bypass mode */
> rc = pnv_pci_ioda_dma_64bit_bypass(pe);
> if (rc)
> @@ -3930,6 +3931,8 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
> phb->model = PNV_PHB_MODEL_P7IOC;
> else if (of_device_is_compatible(np, "ibm,power8-pciex"))
> phb->model = PNV_PHB_MODEL_PHB3;
> + else if (of_device_is_compatible(np, "ibm,power9-pciex"))
> + phb->model = PNV_PHB_MODEL_PHB4;
> else if (of_device_is_compatible(np, "ibm,power8-npu-pciex"))
> phb->model = PNV_PHB_MODEL_NPU;
> else if (of_device_is_compatible(np, "ibm,power9-npu-pciex"))
^ permalink raw reply
* Re: [PATCH kernel 1/2] powerpc/powernv: Reuse existing TCE code for sketchy bypass
From: Benjamin Herrenschmidt @ 2018-06-16 1:04 UTC (permalink / raw)
To: Alexey Kardashevskiy, linuxppc-dev; +Cc: Alistair Popple, Russell Currey
In-Reply-To: <20180601081028.29401-2-aik@ozlabs.ru>
On Fri, 2018-06-01 at 18:10 +1000, Alexey Kardashevskiy wrote:
> The existing sketchy bypass ignores the existing default 32bit TCE table
> (created by default for every PE at boot time or after being used by
> VFIO) and it allocates another table instead without updating PE DMA
> config (pe->table_group). So if we decide to use such device for VFIO
> later, this new table will also leak memory.
>
> This replaces adhoc table allocation and programming with the existing
> API which handles memory leaks.
>
> This programs the default 32bit table back to TVE#0 if configuring
> the new table failed for some reason.
>
> While we are at it, switch from the hardcoded 256MB TCEs to the biggest
> size supported by the hardware and reported by the firmware. This allows
> the sketchy bypass (originally made for POWER8 only) to work on POWER9
> too assuming that PHB4 type is defined and pnv_pci_ioda_dma_64bit_bypass()
> is called (coming next).
It won't work on POWER9 for other reasons. Mostly memory isn't
contiguous there.
What we shuld look into rather than removing the 32-bit table is
exploit DD2 P9 feature allowing to overlay TVE0 and TVE1 so that 0..4G
is in control of TVE0 and the rest goes to TVE1.
> This does not call iommu_init_table() for the new table as the caller
> will use &dma_nommu_ops and therefore ::it_map is not needed.
>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>
> Tested with:
> if (pe->tce_bypass_enabled) {
> top = pe->tce_bypass_base + memblock_end_of_DRAM() - 1;
> - bypass = (dma_mask >= top);
> + bypass = false;//(dma_mask >= top);
> }
> ---
> arch/powerpc/platforms/powernv/pci-ioda.c | 71 +++++++++++++++++--------------
> 1 file changed, 39 insertions(+), 32 deletions(-)
>
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index ceb7e64..9239142 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1791,54 +1791,61 @@ static bool pnv_pci_ioda_pe_single_vendor(struct pnv_ioda_pe *pe)
> *
> * Currently this will only work on PHB3 (POWER8).
> */
> +static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
> + int num, __u32 page_shift, __u64 window_size, __u32 levels,
> + struct iommu_table **ptbl);
> +
> +static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
> + int num, struct iommu_table *tbl);
> +
> +static unsigned long pnv_ioda_parse_tce_sizes(struct pnv_phb *phb);
> +
> static int pnv_pci_ioda_dma_64bit_bypass(struct pnv_ioda_pe *pe)
> {
> - u64 window_size, table_size, tce_count, addr;
> - struct page *table_pages;
> - u64 tce_order = 28; /* 256MB TCEs */
> - __be64 *tces;
> + u64 window_size;
> s64 rc;
> + struct iommu_table *tbl, *oldtbl = NULL;
> + unsigned long shift, offset;
>
> /*
> * Window size needs to be a power of two, but needs to account for
> * shifting memory by the 4GB offset required to skip 32bit space.
> */
> - window_size = roundup_pow_of_two(memory_hotplug_max() + (1ULL << 32));
> - tce_count = window_size >> tce_order;
> - table_size = tce_count << 3;
> -
> - if (table_size < PAGE_SIZE)
> - table_size = PAGE_SIZE;
> + window_size = roundup_pow_of_two(memory_hotplug_max() + SZ_4G);
> + shift = ilog2(pnv_ioda_parse_tce_sizes(pe->phb));
> + rc = pnv_pci_ioda2_create_table(&pe->table_group, 0, shift, window_size,
> + POWERNV_IOMMU_DEFAULT_LEVELS, &tbl);
> + if (rc) {
> + pe_err(pe, "Failed to create 64-bypass TCE table, err %ld", rc);
> + return rc;
> + }
>
> - table_pages = alloc_pages_node(pe->phb->hose->node, GFP_KERNEL,
> - get_order(table_size));
> - if (!table_pages)
> + offset = SZ_4G >> shift;
> + rc = tbl->it_ops->set(tbl, offset, tbl->it_size - offset,
> + 0 /* uaddr */, DMA_BIDIRECTIONAL, 0 /* attrs */);
> + if (rc)
> goto err;
>
> - tces = page_address(table_pages);
> - if (!tces)
> + if (pe->table_group.tables[0]) {
> + oldtbl = pe->table_group.tables[0];
> + pnv_pci_ioda2_unset_window(&pe->table_group, 0);
> + }
> +
> + rc = pnv_pci_ioda2_set_window(&pe->table_group, 0, tbl);
> + if (rc != OPAL_SUCCESS) {
> + rc = pnv_pci_ioda2_set_window(&pe->table_group, 0, oldtbl);
> goto err;
> + }
>
> - memset(tces, 0, table_size);
> + if (oldtbl)
> + iommu_tce_table_put(oldtbl);
>
> - for (addr = 0; addr < memory_hotplug_max(); addr += (1 << tce_order)) {
> - tces[(addr + (1ULL << 32)) >> tce_order] =
> - cpu_to_be64(addr | TCE_PCI_READ | TCE_PCI_WRITE);
> - }
> + pe_info(pe, "Using 64-bit DMA iommu bypass (through TVE#0)\n");
> + return 0;
>
> - rc = opal_pci_map_pe_dma_window(pe->phb->opal_id,
> - pe->pe_number,
> - /* reconfigure window 0 */
> - (pe->pe_number << 1) + 0,
> - 1,
> - __pa(tces),
> - table_size,
> - 1 << tce_order);
> - if (rc == OPAL_SUCCESS) {
> - pe_info(pe, "Using 64-bit DMA iommu bypass (through TVE#0)\n");
> - return 0;
> - }
> err:
> + iommu_tce_table_put(tbl);
> +
> pe_err(pe, "Error configuring 64-bit DMA bypass\n");
> return -EIO;
> }
^ permalink raw reply
* Re: [RFC PATCH 17/23] watchdog/hardlockup/hpet: Convert the timer's interrupt to NMI
From: Ricardo Neri @ 2018-06-16 0:51 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Ingo Molnar, H. Peter Anvin, Andi Kleen, Ashok Raj,
Borislav Petkov, Tony Luck, Ravi V. Shankar, x86, sparclinux,
linuxppc-dev, linux-kernel, Jacob Pan, Rafael J. Wysocki,
Don Zickus, Nicholas Piggin, Michael Ellerman,
Frederic Weisbecker, Alexei Starovoitov, Babu Moger,
Mathieu Desnoyers, Masami Hiramatsu, Peter Zijlstra,
Andrew Morton, Philippe Ombredanne, Colin Ian King,
Byungchul Park, Paul E. McKenney, Luis R. Rodriguez, Waiman Long,
Josh Poimboeuf, Randy Dunlap, Davidlohr Bueso, Christoffer Dall,
Marc Zyngier, Kai-Heng Feng, Konrad Rzeszutek Wilk,
David Rientjes, iommu
In-Reply-To: <alpine.DEB.2.21.1806151029210.2079@nanos.tec.linutronix.de>
On Fri, Jun 15, 2018 at 11:19:09AM +0200, Thomas Gleixner wrote:
> On Thu, 14 Jun 2018, Ricardo Neri wrote:
> > On Wed, Jun 13, 2018 at 11:40:00AM +0200, Thomas Gleixner wrote:
> > > On Tue, 12 Jun 2018, Ricardo Neri wrote:
> > > > @@ -183,6 +184,8 @@ static irqreturn_t hardlockup_detector_irq_handler(int irq, void *data)
> > > > if (!(hdata->flags & HPET_DEV_PERI_CAP))
> > > > kick_timer(hdata);
> > > >
> > > > + pr_err("This interrupt should not have happened. Ensure delivery mode is NMI.\n");
> > >
> > > Eeew.
> >
> > If you don't mind me asking. What is the problem with this error message?
>
> The problem is not the error message. The problem is the abuse of
> request_irq() and the fact that this irq handler function exists in the
> first place for something which is NMI based.
I wanted to add this handler in case the interrupt was not configured correctly
to be delivered as NMI (e.g., not supported by the hardware). I see your point.
Perhaps this is not needed. There is code in place to complain when an interrupt
that nobody was expecting happens.
>
> > > And in case that the HPET does not support periodic mode this reprogramms
> > > the timer on every NMI which means that while perf is running the watchdog
> > > will never ever detect anything.
> >
> > Yes. I see that this is wrong. With MSI interrupts, as far as I can
> > see, there is not a way to make sure that the HPET timer caused the NMI
> > perhaps the only option is to use an IO APIC interrupt and read the
> > interrupt status register.
> >
> > > Aside of that, reading TWO HPET registers for every NMI is insane. HPET
> > > access is horribly slow, so any high frequency perf monitoring will take a
> > > massive performance hit.
> >
> > If an IO APIC interrupt is used, only HPET register (the status register)
> > would need to be read for every NMI. Would that be more acceptable? Otherwise,
> > there is no way to determine if the HPET cause the NMI.
>
> You need level trigger for the HPET status register to be useful at all
> because in edge mode the interrupt status bits read always 0.
Indeed.
>
> That means you have to fiddle with the IOAPIC acknowledge magic from NMI
> context. Brilliant idea. If the NMI hits in the middle of a regular
> io_apic_read() then the interrupted code will endup with the wrong index
> register. Not to talk about the fun which the affinity rotation from NMI
> context would bring.
>
> Do not even think about using IOAPIC and level for this.
OK. I will stay away of it and focus on MSI.
>
> > Alternatively, there could be a counter that skips reading the HPET status
> > register (and the detection of hardlockups) for every X NMIs. This would
> > reduce the overall frequency of HPET register reads.
>
> Great plan. So if the watchdog is the only NMI (because perf is off) then
> you delay the watchdog detection by that count.
OK. This was a bad idea. Then, is it acceptable to have an read to an HPET
register per NMI just to check in the status register if the HPET timer
caused the NMI?
Thanks and BR,
Ricardo
^ permalink raw reply
* Re: [RFC PATCH 20/23] watchdog/hardlockup/hpet: Rotate interrupt among all monitored CPUs
From: Ricardo Neri @ 2018-06-16 0:46 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Ingo Molnar, H. Peter Anvin, Andi Kleen, Ashok Raj,
Borislav Petkov, Tony Luck, Ravi V. Shankar, x86, sparclinux,
linuxppc-dev, linux-kernel, Jacob Pan, Rafael J. Wysocki,
Don Zickus, Nicholas Piggin, Michael Ellerman,
Frederic Weisbecker, Alexei Starovoitov, Babu Moger,
Mathieu Desnoyers, Masami Hiramatsu, Peter Zijlstra,
Andrew Morton, Philippe Ombredanne, Colin Ian King,
Byungchul Park, Paul E. McKenney, Luis R. Rodriguez, Waiman Long,
Josh Poimboeuf, Randy Dunlap, Davidlohr Bueso, Christoffer Dall,
Marc Zyngier, Kai-Heng Feng, Konrad Rzeszutek Wilk,
David Rientjes, iommu
In-Reply-To: <alpine.DEB.2.21.1806151122070.2079@nanos.tec.linutronix.de>
On Fri, Jun 15, 2018 at 12:29:06PM +0200, Thomas Gleixner wrote:
> On Thu, 14 Jun 2018, Ricardo Neri wrote:
> > On Wed, Jun 13, 2018 at 11:48:09AM +0200, Thomas Gleixner wrote:
> > > On Tue, 12 Jun 2018, Ricardo Neri wrote:
> > > > + /* There are no CPUs to monitor. */
> > > > + if (!cpumask_weight(&hdata->monitored_mask))
> > > > + return NMI_HANDLED;
> > > > +
> > > > inspect_for_hardlockups(regs);
> > > >
> > > > + /*
> > > > + * Target a new CPU. Keep trying until we find a monitored CPU. CPUs
> > > > + * are addded and removed to this mask at cpu_up() and cpu_down(),
> > > > + * respectively. Thus, the interrupt should be able to be moved to
> > > > + * the next monitored CPU.
> > > > + */
> > > > + spin_lock(&hld_data->lock);
> > >
> > > Yuck. Taking a spinlock from NMI ...
> >
> > I am sorry. I will look into other options for locking. Do you think rcu_lock
> > would help in this case? I need this locking because the CPUs being monitored
> > changes as CPUs come online and offline.
>
> Sure, but you _cannot_ take any locks in NMI context which are also taken
> in !NMI context. And RCU will not help either. How so? The NMI can hit
> exactly before the CPU bit is cleared and then the CPU goes down. So RCU
> _cannot_ protect anything.
>
> All you can do there is make sure that the TIMn_CONF is only ever accessed
> in !NMI code. Then you can stop the timer _before_ a CPU goes down and make
> sure that the eventually on the fly NMI is finished. After that you can
> fiddle with the CPU mask and restart the timer. Be aware that this is going
> to be more corner case handling that actual functionality.
Thanks for the suggestion. It makes sense to stop the timer when updating the
CPU mask. In this manner the timer will not cause any NMI.
>
> > > > + for_each_cpu_wrap(cpu, &hdata->monitored_mask, smp_processor_id() + 1) {
> > > > + if (!irq_set_affinity(hld_data->irq, cpumask_of(cpu)))
> > > > + break;
> > >
> > > ... and then calling into generic interrupt code which will take even more
> > > locks is completely broken.
> >
> > I will into reworking how the destination of the interrupt is set.
>
> You have to consider two cases:
>
> 1) !remapped mode:
>
> That's reasonably simple because you just have to deal with the HPET
> TIMERn_PROCMSG_ROUT register. But then you need to do this directly and
> not through any of the existing interrupt facilities.
Indeed, there is no need to use the generic interrupt faciities to set affinity;
I am dealing with an NMI anyways.
>
> 2) remapped mode:
>
> That's way more complex as you _cannot_ ever do anything which touches
> the IOMMU and the related tables.
>
> So you'd need to reserve an IOMMU remapping entry for each CPU upfront,
> store the resulting value for the HPET TIMERn_PROCMSG_ROUT register in
> per cpu storage and just modify that one from NMI.
>
> Though there might be subtle side effects involved, which are related to
> the acknowledge part. You need to talk to the IOMMU wizards first.
I see. I will look into the code and prototype something that makes sense for
the IOMMU maintainers.
>
> All in all, the idea itself is interesting, but the envisioned approach of
> round robin and no fast accessible NMI reason detection is going to create
> more problems than it solves.
I see it more clearly now.
Thanks and BR,
Ricardo
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox