* Re: [PATCH v2 09/10] tools/perf: Add perf tools support for extended register capability in powerpc
From: Athira Rajeev @ 2020-07-13 2:36 UTC (permalink / raw)
To: Michael Ellerman; +Cc: Michael Neuling, maddy, linuxppc-dev
In-Reply-To: <87pn962owo.fsf@mpe.ellerman.id.au>
[-- Attachment #1: Type: text/plain, Size: 6296 bytes --]
> On 08-Jul-2020, at 5:34 PM, Michael Ellerman <mpe@ellerman.id.au> wrote:
>
> Athira Rajeev <atrajeev@linux.vnet.ibm.com <mailto:atrajeev@linux.vnet.ibm.com>> writes:
>> From: Anju T Sudhakar <anju@linux.vnet.ibm.com>
>>
>> Add extended regs to sample_reg_mask in the tool side to use
>> with `-I?` option. Perf tools side uses extended mask to display
>> the platform supported register names (with -I? option) to the user
>> and also send this mask to the kernel to capture the extended registers
>> in each sample. Hence decide the mask value based on the processor
>> version.
>>
>> Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com>
>> [Decide extended mask at run time based on platform]
>> Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
>> Reviewed-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
>
> Will need an ack from perf tools folks, who are not on Cc by the looks.
>
>> diff --git a/tools/arch/powerpc/include/uapi/asm/perf_regs.h b/tools/arch/powerpc/include/uapi/asm/perf_regs.h
>> index f599064..485b1d5 100644
>> --- a/tools/arch/powerpc/include/uapi/asm/perf_regs.h
>> +++ b/tools/arch/powerpc/include/uapi/asm/perf_regs.h
>> @@ -48,6 +48,18 @@ enum perf_event_powerpc_regs {
>> PERF_REG_POWERPC_DSISR,
>> PERF_REG_POWERPC_SIER,
>> PERF_REG_POWERPC_MMCRA,
>> - PERF_REG_POWERPC_MAX,
>> + /* Extended registers */
>> + PERF_REG_POWERPC_MMCR0,
>> + PERF_REG_POWERPC_MMCR1,
>> + PERF_REG_POWERPC_MMCR2,
>> + /* Max regs without the extended regs */
>> + PERF_REG_POWERPC_MAX = PERF_REG_POWERPC_MMCRA + 1,
>
> I don't really understand this idea of a max that's not the max.
Hi Michael
This is the MAX without extended regs. This is mainly used in `arch/powerpc/perf/perf_regs.c` to define pt_regs_offset ( to get index
for other regs ) and also is used to determine whether requested register is an extended reg while capturing data in sample
( in `perf_reg_value` )
Thanks
Athira
>
>> };
>> +
>> +#define PERF_REG_PMU_MASK ((1ULL << PERF_REG_POWERPC_MAX) - 1)
>> +
>> +/* PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_300 */
>> +#define PERF_REG_PMU_MASK_300 (((1ULL << (PERF_REG_POWERPC_MMCR2 + 1)) - 1) \
>> + - PERF_REG_PMU_MASK)
>> +
>> #endif /* _UAPI_ASM_POWERPC_PERF_REGS_H */
>> diff --git a/tools/perf/arch/powerpc/include/perf_regs.h b/tools/perf/arch/powerpc/include/perf_regs.h
>> index e18a355..46ed00d 100644
>> --- a/tools/perf/arch/powerpc/include/perf_regs.h
>> +++ b/tools/perf/arch/powerpc/include/perf_regs.h
>> @@ -64,7 +64,10 @@
>> [PERF_REG_POWERPC_DAR] = "dar",
>> [PERF_REG_POWERPC_DSISR] = "dsisr",
>> [PERF_REG_POWERPC_SIER] = "sier",
>> - [PERF_REG_POWERPC_MMCRA] = "mmcra"
>> + [PERF_REG_POWERPC_MMCRA] = "mmcra",
>> + [PERF_REG_POWERPC_MMCR0] = "mmcr0",
>> + [PERF_REG_POWERPC_MMCR1] = "mmcr1",
>> + [PERF_REG_POWERPC_MMCR2] = "mmcr2",
>> };
>>
>> static inline const char *perf_reg_name(int id)
>> diff --git a/tools/perf/arch/powerpc/util/perf_regs.c b/tools/perf/arch/powerpc/util/perf_regs.c
>> index 0a52429..9179230 100644
>> --- a/tools/perf/arch/powerpc/util/perf_regs.c
>> +++ b/tools/perf/arch/powerpc/util/perf_regs.c
>> @@ -6,9 +6,14 @@
>>
>> #include "../../../util/perf_regs.h"
>> #include "../../../util/debug.h"
>> +#include "../../../util/event.h"
>> +#include "../../../util/header.h"
>> +#include "../../../perf-sys.h"
>>
>> #include <linux/kernel.h>
>>
>> +#define PVR_POWER9 0x004E
>> +
>> const struct sample_reg sample_reg_masks[] = {
>> SMPL_REG(r0, PERF_REG_POWERPC_R0),
>> SMPL_REG(r1, PERF_REG_POWERPC_R1),
>> @@ -55,6 +60,9 @@
>> SMPL_REG(dsisr, PERF_REG_POWERPC_DSISR),
>> SMPL_REG(sier, PERF_REG_POWERPC_SIER),
>> SMPL_REG(mmcra, PERF_REG_POWERPC_MMCRA),
>> + SMPL_REG(mmcr0, PERF_REG_POWERPC_MMCR0),
>> + SMPL_REG(mmcr1, PERF_REG_POWERPC_MMCR1),
>> + SMPL_REG(mmcr2, PERF_REG_POWERPC_MMCR2),
>> SMPL_REG_END
>> };
>>
>> @@ -163,3 +171,50 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
>>
>> return SDT_ARG_VALID;
>> }
>> +
>> +uint64_t arch__intr_reg_mask(void)
>> +{
>> + struct perf_event_attr attr = {
>> + .type = PERF_TYPE_HARDWARE,
>> + .config = PERF_COUNT_HW_CPU_CYCLES,
>> + .sample_type = PERF_SAMPLE_REGS_INTR,
>> + .precise_ip = 1,
>> + .disabled = 1,
>> + .exclude_kernel = 1,
>> + };
>> + int fd, ret;
>> + char buffer[64];
>> + u32 version;
>> + u64 extended_mask = 0;
>> +
>> + /* Get the PVR value to set the extended
>> + * mask specific to platform
>
> Comment format is wrong, and punctuation please.
>
>> + */
>> + get_cpuid(buffer, sizeof(buffer));
>> + ret = sscanf(buffer, "%u,", &version);
>
> This is powerpc specific code, why not just use mfspr(SPRN_PVR), rather
> than redirecting via printf/sscanf.
>
>> +
>> + if (ret != 1) {
>> + pr_debug("Failed to get the processor version, unable to output extended registers\n");
>> + return PERF_REGS_MASK;
>> + }
>> +
>> + if (version == PVR_POWER9)
>> + extended_mask = PERF_REG_PMU_MASK_300;
>> + else
>> + return PERF_REGS_MASK;
>> +
>> + attr.sample_regs_intr = extended_mask;
>> + attr.sample_period = 1;
>> + event_attr_init(&attr);
>> +
>> + /*
>> + * check if the pmu supports perf extended regs, before
>> + * returning the register mask to sample.
>> + */
>> + fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
>> + if (fd != -1) {
>> + close(fd);
>> + return (extended_mask | PERF_REGS_MASK);
>> + }
>> + return PERF_REGS_MASK;
>
> I think this would read a bit better like:
>
> mask = PERF_REGS_MASK;
>
> if (version == PVR_POWER9)
> extended_mask = PERF_REG_PMU_MASK_300;
> else
> return mask;
>
> attr.sample_regs_intr = extended_mask;
> attr.sample_period = 1;
> event_attr_init(&attr);
>
> /*
> * check if the pmu supports perf extended regs, before
> * returning the register mask to sample.
> */
> fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
> if (fd != -1) {
> close(fd);
> mask |= extended_mask;
> }
>
> return mask;
>
>
> cheers
[-- Attachment #2: Type: text/html, Size: 43626 bytes --]
^ permalink raw reply
* [PATCH] powerpc/book3s64/pkeys: Fix pkey_access_permitted w.r.t execute disable pkey
From: Aneesh Kumar K.V @ 2020-07-12 13:20 UTC (permalink / raw)
To: linuxppc-dev, mpe; +Cc: Aneesh Kumar K.V, linuxram, Sandipan Das
Even if the IAMR value deny the execute access, current kernel return true
w.r.t pkey_access_permitted() for execute permission check if the AMR
read pkey bit is cleared.
This results in repeated page fault loop with a test like below.
#define _GNU_SOURCE
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <inttypes.h>
#include <assert.h>
#include <malloc.h>
#include <unistd.h>
#include <pthread.h>
#include <sys/mman.h>
#ifdef SYS_pkey_mprotect
#undef SYS_pkey_mprotect
#endif
#ifdef SYS_pkey_alloc
#undef SYS_pkey_alloc
#endif
#ifdef SYS_pkey_free
#undef SYS_pkey_free
#endif
#undef PKEY_DISABLE_EXECUTE
#define PKEY_DISABLE_EXECUTE 0x4
#define SYS_pkey_mprotect 386
#define SYS_pkey_alloc 384
#define SYS_pkey_free 385
#define PPC_INST_NOP 0x60000000
#define PPC_INST_BLR 0x4e800020
#define PROT_RWX (PROT_READ | PROT_WRITE | PROT_EXEC)
static int sys_pkey_mprotect(void *addr, size_t len, int prot, int pkey)
{
return syscall(SYS_pkey_mprotect, addr, len, prot, pkey);
}
static int sys_pkey_alloc(unsigned long flags, unsigned long access_rights)
{
return syscall(SYS_pkey_alloc, flags, access_rights);
}
static int sys_pkey_free(int pkey)
{
return syscall(SYS_pkey_free, pkey);
}
static void do_execute(void *region)
{
/* jump to region */
asm volatile(
"mtctr %0;"
"bctrl"
: : "r"(region) : "ctr", "lr");
}
static void do_protect(void *region)
{
size_t pgsize;
int i, pkey;
pgsize = getpagesize();
pkey = sys_pkey_alloc(0, PKEY_DISABLE_EXECUTE);
assert (pkey > 0);
/* perform mprotect */
assert(!sys_pkey_mprotect(region, pgsize, PROT_RWX, pkey));
do_execute(region);
/* free pkey */
assert(!sys_pkey_free(pkey));
}
int main(int argc, char **argv)
{
size_t pgsize, numinsns;
unsigned int *region;
int i;
/* allocate memory region to protect */
pgsize = getpagesize();
region = memalign(pgsize, pgsize);
assert(region != NULL);
assert(!mprotect(region, pgsize, PROT_RWX));
/* fill page with NOPs with a BLR at the end */
numinsns = pgsize / sizeof(region[0]);
for (i = 0; i < numinsns - 1; i++)
region[i] = PPC_INST_NOP;
region[i] = PPC_INST_BLR;
do_protect(region);
return EXIT_SUCCESS;
}
Fixes: f2407ef3ba22 ("powerpc: helper to validate key-access permissions of a pte")
Reported-by: Sandipan Das <sandipan@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
arch/powerpc/mm/book3s64/pkeys.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/arch/powerpc/mm/book3s64/pkeys.c b/arch/powerpc/mm/book3s64/pkeys.c
index ca5fcb4bff32..d174106bab67 100644
--- a/arch/powerpc/mm/book3s64/pkeys.c
+++ b/arch/powerpc/mm/book3s64/pkeys.c
@@ -354,12 +354,14 @@ static bool pkey_access_permitted(int pkey, bool write, bool execute)
u64 amr;
pkey_shift = pkeyshift(pkey);
- if (execute && !(read_iamr() & (IAMR_EX_BIT << pkey_shift)))
- return true;
+ if (execute)
+ return !(read_iamr() & (IAMR_EX_BIT << pkey_shift));
+
+ amr = read_amr();
+ if (write)
+ return !(amr & (AMR_WR_BIT << pkey_shift));
- amr = read_amr(); /* Delay reading amr until absolutely needed */
- return ((!write && !(amr & (AMR_RD_BIT << pkey_shift))) ||
- (write && !(amr & (AMR_WR_BIT << pkey_shift))));
+ return !(amr & (AMR_RD_BIT << pkey_shift));
}
bool arch_pte_access_permitted(u64 pte, bool write, bool execute)
--
2.26.2
^ permalink raw reply related
* Re: [GIT PULL] Please pull powerpc/linux.git powerpc-5.8-6 tag
From: pr-tracker-bot @ 2020-07-11 18:25 UTC (permalink / raw)
To: Michael Ellerman; +Cc: linuxppc-dev, Linus Torvalds, linux-kernel, npiggin
In-Reply-To: <87v9iuuv7n.fsf@mpe.ellerman.id.au>
The pull request you sent on Sat, 11 Jul 2020 21:50:20 +1000:
> https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git tags/powerpc-5.8-6
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/997c4431f04d8f0d6063bac3bcdceba8d939b879
Thank you!
--
Deet-doot-dot, I am a bot.
https://korg.wiki.kernel.org/userdoc/prtracker
^ permalink raw reply
* [GIT PULL] Please pull powerpc/linux.git powerpc-5.8-6 tag
From: Michael Ellerman @ 2020-07-11 11:50 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linuxppc-dev, linux-kernel, npiggin
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
Hi Linus,
Please pull another powerpc fix for 5.8:
The following changes since commit 19ab500edb5d6020010caba48ce3b4ce4182ab63:
powerpc/mm/pkeys: Make pkey access check work on execute_only_key (2020-06-29 16:17:02 +1000)
are available in the git repository at:
https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git tags/powerpc-5.8-6
for you to fetch changes up to 4557ac6b344b8cdf948ff8b007e8e1de34832f2e:
powerpc/64s/exception: Fix 0x1500 interrupt handler crash (2020-07-08 20:41:06 +1000)
- ------------------------------------------------------------------
powerpc fixes for 5.8 #6
One fix for a crash/soft lockup on Power8, caused by the exception rework we did
in v5.7.
Thanks to:
Paul Menzel, Nicholas Piggin.
- ------------------------------------------------------------------
Nicholas Piggin (1):
powerpc/64s/exception: Fix 0x1500 interrupt handler crash
arch/powerpc/kernel/exceptions-64s.S | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
-----BEGIN PGP SIGNATURE-----
iQIzBAEBCAAdFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl8JpzUACgkQUevqPMjh
pYD5Zg/+JSMK0phSPCEpBusdlwIeuP5JwQzMCPZgiz7wGsOl6OMF/pMekvQmYzsD
/9VKTWejulkn1Q1Gz3dhEEaVRJhjErPN0xb3tdrXmzyK2s67kwbkd+PhkmXyzRJ/
ByEHmg8K3kKhB3FEC/BimNzcjAhVy2bIqBR1+4gTjF/PDc9heugv9iOpnw2uvhGu
/DoYyQScWnC01PgOd8/dFDykELj98s/p4hKNx6JnteGi5cjrWClqJGJYCBc4xhs/
rPSA5ybUzuN+CbMZbvDi7sJPw+w4AYIC5kVYsZqRuJVGYGh0y4qcxSYK4BL0FDO2
UWdCLUFOkZqHtGhOUMn886YQnhPFukxVOVfLCIlkWcYeXK3HwQld8LrfzHUoMjBk
ao06xwccNyzGlRUn0lYKuRFRGyEtWXNDqLQZ9L4Klowwe9BG5HdYRzrNxLsyi7p0
yom6hevA5w2Sw9rZJrtImWevrsIp7ZexOcVBdfCGwCt9WlWHvT4C5iT0kMFGBh50
Tx/nKAaBWVN9p8OLX4qsmUlRIeAGQdV18HiAaLo5YubUSJ9Ne5ZGcTTZ42VwxvaM
1VCMUfexdr/oljy1BUP/ZTldcj8dF6saaWDb+yWWh8sUmKdsduXJP/5640au9FxS
EBofwuyiU0u8aoV9oH4km+Tqp2WtFhMtrgTNNgHtPix7C8ck2Y4=
=u9rR
-----END PGP SIGNATURE-----
^ permalink raw reply
* Re: [PATCH 11/20] Documentation: leds/ledtrig-transient: eliminate duplicated word
From: Pavel Machek @ 2020-07-11 9:35 UTC (permalink / raw)
To: Randy Dunlap
Cc: kvm, linux-doc, David Airlie, kgdb-bugreport, linux-fpga,
Liviu Dudau, dri-devel, Douglas Anderson, Paul Cercueil, keyrings,
Paul Mackerras, linux-i2c, Srinivas Pandruvada, Mihail Atanassov,
linux-leds, linux-s390, Daniel Thompson, linux-scsi,
Jonathan Corbet, Masahiro Yamada, Matthew Wilcox, Halil Pasic,
Jarkko Sakkinen, James Wang, linux-input, Mali DP Maintainers,
Derek Kiernan, linux-mips, Dragan Cvetic, Wu Hao, Tony Krowiak,
linux-kbuild, James E.J. Bottomley, Jiri Kosina, Hannes Reinecke,
linux-block, Thomas Bogendoerfer, Jacek Anaszewski, linux-mm,
Dan Williams, Andrew Morton, Mimi Zohar, Jens Axboe, Michal Marek,
Martin K. Petersen, Pierre Morel, linux-kernel, Wolfram Sang,
Daniel Vetter, Jason Wessel, Paolo Bonzini, linux-integrity,
linuxppc-dev, Mike Rapoport, Dan Murphy
In-Reply-To: <20200707180414.10467-12-rdunlap@infradead.org>
[-- Attachment #1: Type: text/plain, Size: 502 bytes --]
On Tue 2020-07-07 11:04:05, Randy Dunlap wrote:
> Drop the doubled word "for".
>
> Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: linux-doc@vger.kernel.org
> Cc: Jacek Anaszewski <jacek.anaszewski@gmail.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
(I expect documentation people take this, not me).
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
^ permalink raw reply
* [v3 4/5] KVM: PPC: Book3S HV: retry page migration before erroring-out H_SVM_PAGE_IN
From: Ram Pai @ 2020-07-11 9:13 UTC (permalink / raw)
To: kvm-ppc, linuxppc-dev
Cc: ldufour, linuxram, cclaudio, bharata, sathnaga, aneesh.kumar,
sukadev, bauerman, david
In-Reply-To: <1594458827-31866-1-git-send-email-linuxram@us.ibm.com>
The page requested for page-in; sometimes, can have transient
references, and hence cannot migrate immediately. Retry a few times
before returning error.
H_SVM_PAGE_IN interface is enhanced to return H_BUSY if the page is
not in a migratable state.
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Bharata B Rao <bharata@linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Claudio Carvalho <cclaudio@linux.ibm.com>
Cc: kvm-ppc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
---
Documentation/powerpc/ultravisor.rst | 1 +
arch/powerpc/kvm/book3s_hv_uvmem.c | 54 +++++++++++++++++++++---------------
2 files changed, 33 insertions(+), 22 deletions(-)
diff --git a/Documentation/powerpc/ultravisor.rst b/Documentation/powerpc/ultravisor.rst
index d98fc85..638d1a7 100644
--- a/Documentation/powerpc/ultravisor.rst
+++ b/Documentation/powerpc/ultravisor.rst
@@ -1034,6 +1034,7 @@ Return values
* H_PARAMETER if ``guest_pa`` is invalid.
* H_P2 if ``flags`` is invalid.
* H_P3 if ``order`` of page is invalid.
+ * H_BUSY if ``page`` is not in a state to pagein
Description
~~~~~~~~~~~
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 12ed52a..c9bdef6 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -843,7 +843,7 @@ unsigned long kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gpa,
struct vm_area_struct *vma;
int srcu_idx;
unsigned long gfn = gpa >> page_shift;
- int ret;
+ int ret, repeat_count = REPEAT_COUNT;
if (!(kvm->arch.secure_guest & KVMPPC_SECURE_INIT_START))
return H_UNSUPPORTED;
@@ -857,34 +857,44 @@ unsigned long kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gpa,
if (flags & H_PAGE_IN_SHARED)
return kvmppc_share_page(kvm, gpa, page_shift);
- ret = H_PARAMETER;
srcu_idx = srcu_read_lock(&kvm->srcu);
- down_write(&kvm->mm->mmap_sem);
- start = gfn_to_hva(kvm, gfn);
- if (kvm_is_error_hva(start))
- goto out;
-
- mutex_lock(&kvm->arch.uvmem_lock);
/* Fail the page-in request of an already paged-in page */
- if (kvmppc_gfn_is_uvmem_pfn(gfn, kvm, NULL))
- goto out_unlock;
+ mutex_lock(&kvm->arch.uvmem_lock);
+ ret = kvmppc_gfn_is_uvmem_pfn(gfn, kvm, NULL);
+ mutex_unlock(&kvm->arch.uvmem_lock);
+ if (ret) {
+ srcu_read_unlock(&kvm->srcu, srcu_idx);
+ return H_PARAMETER;
+ }
- end = start + (1UL << page_shift);
- vma = find_vma_intersection(kvm->mm, start, end);
- if (!vma || vma->vm_start > start || vma->vm_end < end)
- goto out_unlock;
+ do {
+ ret = H_PARAMETER;
+ down_write(&kvm->mm->mmap_sem);
- if (kvmppc_svm_migrate_page(vma, start, end, gpa, kvm, page_shift,
- true))
- goto out_unlock;
+ start = gfn_to_hva(kvm, gfn);
+ if (kvm_is_error_hva(start)) {
+ up_write(&kvm->mm->mmap_sem);
+ break;
+ }
- ret = H_SUCCESS;
+ end = start + (1UL << page_shift);
+ vma = find_vma_intersection(kvm->mm, start, end);
+ if (!vma || vma->vm_start > start || vma->vm_end < end) {
+ up_write(&kvm->mm->mmap_sem);
+ break;
+ }
+
+ mutex_lock(&kvm->arch.uvmem_lock);
+ ret = kvmppc_svm_migrate_page(vma, start, end, gpa, kvm, page_shift, true);
+ mutex_unlock(&kvm->arch.uvmem_lock);
+
+ up_write(&kvm->mm->mmap_sem);
+ } while (ret == -2 && repeat_count--);
+
+ if (ret == -2)
+ ret = H_BUSY;
-out_unlock:
- mutex_unlock(&kvm->arch.uvmem_lock);
-out:
- up_write(&kvm->mm->mmap_sem);
srcu_read_unlock(&kvm->srcu, srcu_idx);
return ret;
}
--
1.8.3.1
^ permalink raw reply related
* [v3 0/5] Migrate non-migrated pages of a SVM.
From: Ram Pai @ 2020-07-11 9:13 UTC (permalink / raw)
To: kvm-ppc, linuxppc-dev
Cc: ldufour, linuxram, cclaudio, bharata, sathnaga, aneesh.kumar,
sukadev, bauerman, david
The time taken to switch a VM to Secure-VM, increases by the size of the VM. A
100GB VM takes about 7minutes. This is unacceptable. This linear increase is
caused by a suboptimal behavior by the Ultravisor and the Hypervisor. The
Ultravisor unnecessarily migrates all the GFN of the VM from normal-memory to
secure-memory. It has to just migrate the necessary and sufficient GFNs.
However when the optimization is incorporated in the Ultravisor, the Hypervisor
starts misbehaving. The Hypervisor has a inbuilt assumption that the Ultravisor
will explicitly request to migrate, each and every GFN of the VM. If only
necessary and sufficient GFNs are requested for migration, the Hypervisor
continues to manage the remaining GFNs as normal GFNs. This leads of memory
corruption, manifested consistently when the SVM reboots.
The same is true, when a memory slot is hotplugged into a SVM. The Hypervisor
expects the ultravisor to request migration of all GFNs to secure-GFN. But the
hypervisor cannot handle any H_SVM_PAGE_IN requests from the Ultravisor, done
in the context of UV_REGISTER_MEM_SLOT ucall. This problem manifests as random
errors in the SVM, when a memory-slot is hotplugged.
This patch series automatically migrates the non-migrated pages of a SVM,
and thus solves the problem.
Testing: Passed rigorous testing using various sized SVMs.
Changelog:
v3: . Optimized the page-migration retry-logic.
. Relax and relinquish the cpu regularly while bulk migrating
the non-migrated pages. This issue was causing soft-lockups.
Fixed it.
. Added a new patch, to retry page-migration a couple of times
before returning H_BUSY in H_SVM_PAGE_IN. This issue was
seen a few times in a 24hour continuous reboot test of the SVMs.
v2: . fixed a bug observed by Laurent. The state of the GFN's associated
with Secure-VMs were not reset during memslot flush.
. Re-organized the code, for easier review.
. Better description of the patch series.
v1: fixed a bug observed by Bharata. Pages that where paged-in and later
paged-out must also be skipped from migration during H_SVM_INIT_DONE.
Laurent Dufour (1):
KVM: PPC: Book3S HV: migrate hot plugged memory
Ram Pai (4):
KVM: PPC: Book3S HV: Disable page merging in H_SVM_INIT_START
KVM: PPC: Book3S HV: track the state GFNs associated with secure VMs
KVM: PPC: Book3S HV: migrate remaining normal-GFNs to secure-GFNs in
H_SVM_INIT_DONE
KVM: PPC: Book3S HV: retry page migration before erroring-out
H_SVM_PAGE_IN
Documentation/powerpc/ultravisor.rst | 3 +
arch/powerpc/include/asm/kvm_book3s_uvmem.h | 2 +
arch/powerpc/kvm/book3s_hv.c | 10 +-
arch/powerpc/kvm/book3s_hv_uvmem.c | 487 ++++++++++++++++++++++++----
4 files changed, 429 insertions(+), 73 deletions(-)
--
1.8.3.1
^ permalink raw reply
* [v3 5/5] KVM: PPC: Book3S HV: migrate hot plugged memory
From: Ram Pai @ 2020-07-11 9:13 UTC (permalink / raw)
To: kvm-ppc, linuxppc-dev
Cc: ldufour, linuxram, cclaudio, bharata, sathnaga, aneesh.kumar,
sukadev, bauerman, david
In-Reply-To: <1594458827-31866-1-git-send-email-linuxram@us.ibm.com>
From: Laurent Dufour <ldufour@linux.ibm.com>
When a memory slot is hot plugged to a SVM, PFNs associated with the
GFNs in that slot must be migrated to the secure-PFNs, aka device-PFNs.
kvmppc_uv_migrate_mem_slot() is called to accomplish this. UV_PAGE_IN
ucall is skipped, since the ultravisor does not trust the content of
those pages and hence ignores it.
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
[resolved conflicts, and modified the commit log]
---
arch/powerpc/kvm/book3s_hv.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 819f96d..b0d4231 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4523,10 +4523,12 @@ static void kvmppc_core_commit_memory_region_hv(struct kvm *kvm,
case KVM_MR_CREATE:
if (kvmppc_uvmem_slot_init(kvm, new))
return;
- uv_register_mem_slot(kvm->arch.lpid,
- new->base_gfn << PAGE_SHIFT,
- new->npages * PAGE_SIZE,
- 0, new->id);
+ if (uv_register_mem_slot(kvm->arch.lpid,
+ new->base_gfn << PAGE_SHIFT,
+ new->npages * PAGE_SIZE,
+ 0, new->id))
+ return;
+ kvmppc_uv_migrate_mem_slot(kvm, new);
break;
case KVM_MR_DELETE:
uv_unregister_mem_slot(kvm->arch.lpid, old->id);
--
1.8.3.1
^ permalink raw reply related
* [v3 3/5] KVM: PPC: Book3S HV: migrate remaining normal-GFNs to secure-GFNs in H_SVM_INIT_DONE
From: Ram Pai @ 2020-07-11 9:13 UTC (permalink / raw)
To: kvm-ppc, linuxppc-dev
Cc: ldufour, linuxram, cclaudio, bharata, sathnaga, aneesh.kumar,
sukadev, bauerman, david
In-Reply-To: <1594458827-31866-1-git-send-email-linuxram@us.ibm.com>
The Ultravisor is expected to explicitly call H_SVM_PAGE_IN for all the pages
of the SVM before calling H_SVM_INIT_DONE. This causes a huge delay in
tranistioning the VM to SVM. The Ultravisor is interested in the pages that
contain the kernel, initrd and other important data structures. The rest of the
pages contain throw-away content. Hence requesting just the necessary and
sufficient pages from the Hypervisor is sufficient.
However if not all pages are requested by the Ultravisor, the Hypervisor
continues to consider the GFNs corresponding to the non-requested pages as
normal GFNs. This can lead to data-corruption and undefined behavior.
Move all the PFNs associated with the SVM's GFNs to secure-PFNs, in
H_SVM_INIT_DONE. Skip the GFNs that are already Paged-in or Shared or
Paged-in followed by a Paged-out.
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Bharata B Rao <bharata@linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Claudio Carvalho <cclaudio@linux.ibm.com>
Cc: kvm-ppc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
---
Documentation/powerpc/ultravisor.rst | 2 +
arch/powerpc/include/asm/kvm_book3s_uvmem.h | 2 +
arch/powerpc/kvm/book3s_hv_uvmem.c | 166 +++++++++++++++++++++++++---
3 files changed, 156 insertions(+), 14 deletions(-)
diff --git a/Documentation/powerpc/ultravisor.rst b/Documentation/powerpc/ultravisor.rst
index df136c8..d98fc85 100644
--- a/Documentation/powerpc/ultravisor.rst
+++ b/Documentation/powerpc/ultravisor.rst
@@ -933,6 +933,8 @@ Return values
* H_UNSUPPORTED if called from the wrong context (e.g.
from an SVM or before an H_SVM_INIT_START
hypercall).
+ * H_STATE if the hypervisor could not successfully
+ transition the VM to Secure VM.
Description
~~~~~~~~~~~
diff --git a/arch/powerpc/include/asm/kvm_book3s_uvmem.h b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
index 9cb7d8b..f229ab5 100644
--- a/arch/powerpc/include/asm/kvm_book3s_uvmem.h
+++ b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
@@ -23,6 +23,8 @@ unsigned long kvmppc_h_svm_page_out(struct kvm *kvm,
unsigned long kvmppc_h_svm_init_abort(struct kvm *kvm);
void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
struct kvm *kvm, bool skip_page_out);
+int kvmppc_uv_migrate_mem_slot(struct kvm *kvm,
+ const struct kvm_memory_slot *memslot);
#else
static inline int kvmppc_uvmem_init(void)
{
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 6d6c256..12ed52a 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -93,6 +93,7 @@
#include <asm/ultravisor.h>
#include <asm/mman.h>
#include <asm/kvm_ppc.h>
+#include <asm/kvm_book3s_uvmem.h>
static struct dev_pagemap kvmppc_uvmem_pgmap;
static unsigned long *kvmppc_uvmem_bitmap;
@@ -348,6 +349,43 @@ static bool kvmppc_gfn_is_uvmem_pfn(unsigned long gfn, struct kvm *kvm,
return false;
}
+/*
+ * starting from *gfn search for the next available GFN that is not yet
+ * transitioned to a secure GFN. return the value of that GFN in *gfn. If a
+ * GFN is found, return true, else return false
+ */
+static bool kvmppc_next_nontransitioned_gfn(const struct kvm_memory_slot *memslot,
+ struct kvm *kvm, unsigned long *gfn)
+{
+ struct kvmppc_uvmem_slot *p;
+ bool ret = false;
+ unsigned long i;
+
+ mutex_lock(&kvm->arch.uvmem_lock);
+
+ list_for_each_entry(p, &kvm->arch.uvmem_pfns, list)
+ if (*gfn >= p->base_pfn && *gfn < p->base_pfn + p->nr_pfns)
+ break;
+ if (!p)
+ goto out;
+ /*
+ * The code below assumes, one to one correspondence between
+ * kvmppc_uvmem_slot and memslot.
+ */
+ for (i = *gfn; i < p->base_pfn + p->nr_pfns; i++) {
+ unsigned long index = i - p->base_pfn;
+
+ if (!(p->pfns[index] & KVMPPC_GFN_FLAG_MASK)) {
+ *gfn = i;
+ ret = true;
+ break;
+ }
+ }
+out:
+ mutex_unlock(&kvm->arch.uvmem_lock);
+ return ret;
+}
+
static int kvmppc_memslot_page_merge(struct kvm *kvm,
struct kvm_memory_slot *memslot, bool merge)
{
@@ -461,12 +499,31 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
unsigned long kvmppc_h_svm_init_done(struct kvm *kvm)
{
+ struct kvm_memslots *slots;
+ struct kvm_memory_slot *memslot;
+ int srcu_idx;
+ long ret = H_SUCCESS;
+
if (!(kvm->arch.secure_guest & KVMPPC_SECURE_INIT_START))
return H_UNSUPPORTED;
+ /* migrate any unmoved normal pfn to device pfns*/
+ srcu_idx = srcu_read_lock(&kvm->srcu);
+ slots = kvm_memslots(kvm);
+ kvm_for_each_memslot(memslot, slots) {
+ ret = kvmppc_uv_migrate_mem_slot(kvm, memslot);
+ if (ret) {
+ ret = H_STATE;
+ goto out;
+ }
+ }
+
kvm->arch.secure_guest |= KVMPPC_SECURE_INIT_DONE;
pr_info("LPID %d went secure\n", kvm->arch.lpid);
- return H_SUCCESS;
+
+out:
+ srcu_read_unlock(&kvm->srcu, srcu_idx);
+ return ret;
}
/*
@@ -587,12 +644,14 @@ static struct page *kvmppc_uvmem_get_page(unsigned long gpa, struct kvm *kvm)
}
/*
- * Alloc a PFN from private device memory pool and copy page from normal
- * memory to secure memory using UV_PAGE_IN uvcall.
+ * Alloc a PFN from private device memory pool. If @pagein is true,
+ * copy page from normal memory to secure memory using UV_PAGE_IN uvcall.
*/
-static int kvmppc_svm_page_in(struct vm_area_struct *vma, unsigned long start,
- unsigned long end, unsigned long gpa, struct kvm *kvm,
- unsigned long page_shift)
+static int kvmppc_svm_migrate_page(struct vm_area_struct *vma,
+ unsigned long start,
+ unsigned long end, unsigned long gpa, struct kvm *kvm,
+ unsigned long page_shift,
+ bool pagein)
{
unsigned long src_pfn, dst_pfn = 0;
struct migrate_vma mig;
@@ -613,7 +672,7 @@ static int kvmppc_svm_page_in(struct vm_area_struct *vma, unsigned long start,
return ret;
if (!(*mig.src & MIGRATE_PFN_MIGRATE)) {
- ret = -1;
+ ret = -2;
goto out_finalize;
}
@@ -623,11 +682,16 @@ static int kvmppc_svm_page_in(struct vm_area_struct *vma, unsigned long start,
goto out_finalize;
}
- pfn = *mig.src >> MIGRATE_PFN_SHIFT;
- spage = migrate_pfn_to_page(*mig.src);
- if (spage)
- uv_page_in(kvm->arch.lpid, pfn << page_shift, gpa, 0,
- page_shift);
+ if (pagein) {
+ pfn = *mig.src >> MIGRATE_PFN_SHIFT;
+ spage = migrate_pfn_to_page(*mig.src);
+ if (spage) {
+ ret = uv_page_in(kvm->arch.lpid, pfn << page_shift,
+ gpa, 0, page_shift);
+ if (ret)
+ goto out_finalize;
+ }
+ }
*mig.dst = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
migrate_vma_pages(&mig);
@@ -637,6 +701,77 @@ static int kvmppc_svm_page_in(struct vm_area_struct *vma, unsigned long start,
}
/*
+ * return 1, if some page migration failed because of transient error,
+ * while the remaining pages migrated successfully.
+ * The caller can use this as a hint to retry.
+ *
+ * return 0 otherwise. *ret indicates the success status
+ * of this call.
+ */
+static int __kvmppc_uv_migrate_mem_slot(struct kvm *kvm,
+ const struct kvm_memory_slot *memslot, int *ret)
+{
+ unsigned long gfn = memslot->base_gfn;
+ struct vm_area_struct *vma;
+ unsigned long start, end;
+ bool retry = 0;
+
+ *ret = 0;
+ while (kvmppc_next_nontransitioned_gfn(memslot, kvm, &gfn)) {
+
+ down_write(&kvm->mm->mmap_sem);
+ start = gfn_to_hva(kvm, gfn);
+ if (kvm_is_error_hva(start)) {
+ *ret = H_STATE;
+ goto next;
+ }
+
+ end = start + (1UL << PAGE_SHIFT);
+ vma = find_vma_intersection(kvm->mm, start, end);
+ if (!vma || vma->vm_start > start || vma->vm_end < end) {
+ *ret = H_STATE;
+ goto next;
+ }
+
+ mutex_lock(&kvm->arch.uvmem_lock);
+ *ret = kvmppc_svm_migrate_page(vma, start, end,
+ (gfn << PAGE_SHIFT), kvm, PAGE_SHIFT, false);
+ mutex_unlock(&kvm->arch.uvmem_lock);
+
+next:
+ up_write(&kvm->mm->mmap_sem);
+
+ if (*ret == -2) {
+ retry = 1;
+ continue;
+ }
+
+ if (*ret)
+ return 0;
+ }
+ return retry;
+}
+
+#define REPEAT_COUNT 10
+
+int kvmppc_uv_migrate_mem_slot(struct kvm *kvm,
+ const struct kvm_memory_slot *memslot)
+{
+ int ret = 0, repeat_count = REPEAT_COUNT;
+
+ /*
+ * try migration of pages in the memslot 'repeat_count' number of
+ * times, provided each time migration fails because of transient
+ * errors only.
+ */
+ while (__kvmppc_uv_migrate_mem_slot(kvm, memslot, &ret) &&
+ repeat_count--)
+ ;
+
+ return ret;
+}
+
+/*
* Shares the page with HV, thus making it a normal page.
*
* - If the page is already secure, then provision a new page and share
@@ -740,8 +875,11 @@ unsigned long kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gpa,
if (!vma || vma->vm_start > start || vma->vm_end < end)
goto out_unlock;
- if (!kvmppc_svm_page_in(vma, start, end, gpa, kvm, page_shift))
- ret = H_SUCCESS;
+ if (kvmppc_svm_migrate_page(vma, start, end, gpa, kvm, page_shift,
+ true))
+ goto out_unlock;
+
+ ret = H_SUCCESS;
out_unlock:
mutex_unlock(&kvm->arch.uvmem_lock);
--
1.8.3.1
^ permalink raw reply related
* [v3 1/5] KVM: PPC: Book3S HV: Disable page merging in H_SVM_INIT_START
From: Ram Pai @ 2020-07-11 9:13 UTC (permalink / raw)
To: kvm-ppc, linuxppc-dev
Cc: ldufour, linuxram, cclaudio, bharata, sathnaga, aneesh.kumar,
sukadev, bauerman, david
In-Reply-To: <1594458827-31866-1-git-send-email-linuxram@us.ibm.com>
Merging of pages associated with each memslot of a SVM is
disabled the page is migrated in H_SVM_PAGE_IN handler.
This operation should have been done much earlier; the moment the VM
is initiated for secure-transition. Delaying this operation, increases
the probability for those pages to acquire new references , making it
impossible to migrate those pages in H_SVM_PAGE_IN handler.
Disable page-migration in H_SVM_INIT_START handling.
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
---
arch/powerpc/kvm/book3s_hv_uvmem.c | 96 +++++++++++++++++++++++++++++---------
1 file changed, 74 insertions(+), 22 deletions(-)
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 3d987b1..bfc3841 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -211,6 +211,65 @@ static bool kvmppc_gfn_is_uvmem_pfn(unsigned long gfn, struct kvm *kvm,
return false;
}
+static int kvmppc_memslot_page_merge(struct kvm *kvm,
+ struct kvm_memory_slot *memslot, bool merge)
+{
+ unsigned long gfn = memslot->base_gfn;
+ unsigned long end, start = gfn_to_hva(kvm, gfn);
+ int ret = 0;
+ struct vm_area_struct *vma;
+ int merge_flag = (merge) ? MADV_MERGEABLE : MADV_UNMERGEABLE;
+
+ if (kvm_is_error_hva(start))
+ return H_STATE;
+
+ end = start + (memslot->npages << PAGE_SHIFT);
+
+ down_write(&kvm->mm->mmap_sem);
+ do {
+ vma = find_vma_intersection(kvm->mm, start, end);
+ if (!vma) {
+ ret = H_STATE;
+ break;
+ }
+ ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
+ merge_flag, &vma->vm_flags);
+ if (ret) {
+ ret = H_STATE;
+ break;
+ }
+ start = vma->vm_end + 1;
+ } while (end > vma->vm_end);
+
+ up_write(&kvm->mm->mmap_sem);
+ return ret;
+}
+
+static int __kvmppc_page_merge(struct kvm *kvm, bool merge)
+{
+ struct kvm_memslots *slots;
+ struct kvm_memory_slot *memslot;
+ int ret = 0;
+
+ slots = kvm_memslots(kvm);
+ kvm_for_each_memslot(memslot, slots) {
+ ret = kvmppc_memslot_page_merge(kvm, memslot, merge);
+ if (ret)
+ break;
+ }
+ return ret;
+}
+
+static inline int kvmppc_disable_page_merge(struct kvm *kvm)
+{
+ return __kvmppc_page_merge(kvm, false);
+}
+
+static inline int kvmppc_enable_page_merge(struct kvm *kvm)
+{
+ return __kvmppc_page_merge(kvm, true);
+}
+
unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
{
struct kvm_memslots *slots;
@@ -232,11 +291,18 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
return H_AUTHORITY;
srcu_idx = srcu_read_lock(&kvm->srcu);
+
+ /* disable page-merging for all memslot */
+ ret = kvmppc_disable_page_merge(kvm);
+ if (ret)
+ goto out;
+
+ /* register the memslot */
slots = kvm_memslots(kvm);
kvm_for_each_memslot(memslot, slots) {
if (kvmppc_uvmem_slot_init(kvm, memslot)) {
ret = H_PARAMETER;
- goto out;
+ break;
}
ret = uv_register_mem_slot(kvm->arch.lpid,
memslot->base_gfn << PAGE_SHIFT,
@@ -245,9 +311,12 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
if (ret < 0) {
kvmppc_uvmem_slot_free(kvm, memslot);
ret = H_PARAMETER;
- goto out;
+ break;
}
}
+
+ if (ret)
+ kvmppc_enable_page_merge(kvm);
out:
srcu_read_unlock(&kvm->srcu, srcu_idx);
return ret;
@@ -384,7 +453,7 @@ static struct page *kvmppc_uvmem_get_page(unsigned long gpa, struct kvm *kvm)
*/
static int kvmppc_svm_page_in(struct vm_area_struct *vma, unsigned long start,
unsigned long end, unsigned long gpa, struct kvm *kvm,
- unsigned long page_shift, bool *downgrade)
+ unsigned long page_shift)
{
unsigned long src_pfn, dst_pfn = 0;
struct migrate_vma mig;
@@ -400,18 +469,6 @@ static int kvmppc_svm_page_in(struct vm_area_struct *vma, unsigned long start,
mig.src = &src_pfn;
mig.dst = &dst_pfn;
- /*
- * We come here with mmap_sem write lock held just for
- * ksm_madvise(), otherwise we only need read mmap_sem.
- * Hence downgrade to read lock once ksm_madvise() is done.
- */
- ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
- MADV_UNMERGEABLE, &vma->vm_flags);
- downgrade_write(&kvm->mm->mmap_sem);
- *downgrade = true;
- if (ret)
- return ret;
-
ret = migrate_vma_setup(&mig);
if (ret)
return ret;
@@ -503,7 +560,6 @@ unsigned long kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gpa,
unsigned long flags,
unsigned long page_shift)
{
- bool downgrade = false;
unsigned long start, end;
struct vm_area_struct *vma;
int srcu_idx;
@@ -540,16 +596,12 @@ unsigned long kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gpa,
if (!vma || vma->vm_start > start || vma->vm_end < end)
goto out_unlock;
- if (!kvmppc_svm_page_in(vma, start, end, gpa, kvm, page_shift,
- &downgrade))
+ if (!kvmppc_svm_page_in(vma, start, end, gpa, kvm, page_shift))
ret = H_SUCCESS;
out_unlock:
mutex_unlock(&kvm->arch.uvmem_lock);
out:
- if (downgrade)
- up_read(&kvm->mm->mmap_sem);
- else
- up_write(&kvm->mm->mmap_sem);
+ up_write(&kvm->mm->mmap_sem);
srcu_read_unlock(&kvm->srcu, srcu_idx);
return ret;
}
--
1.8.3.1
^ permalink raw reply related
* [v3 2/5] KVM: PPC: Book3S HV: track the state of GFNs associated with secure VMs
From: Ram Pai @ 2020-07-11 9:13 UTC (permalink / raw)
To: kvm-ppc, linuxppc-dev
Cc: ldufour, linuxram, cclaudio, bharata, sathnaga, aneesh.kumar,
sukadev, bauerman, david
In-Reply-To: <1594458827-31866-1-git-send-email-linuxram@us.ibm.com>
During the life of SVM, its GFNs transition through normal, secure and
shared states. Since the kernel does not track GFNs that are shared, it
is not possible to disambiguate a shared GFN from a GFN whose PFN has
not yet been migrated to a secure-PFN. Also it is not possible to
disambiguate a secure-GFN from a GFN whose GFN has been pagedout from
the ultravisor.
The ability to identify the state of a GFN is needed to skip migration
of its PFN to secure-PFN during ESM transition.
The code is re-organized to track the states of a GFN as explained
below.
************************************************************************
1. States of a GFN
---------------
The GFN can be in one of the following states.
(a) Secure - The GFN is secure. The GFN is associated with
a Secure VM, the contents of the GFN is not accessible
to the Hypervisor. This GFN can be backed by a secure-PFN,
or can be backed by a normal-PFN with contents encrypted.
The former is true when the GFN is paged-in into the
ultravisor. The latter is true when the GFN is paged-out
of the ultravisor.
(b) Shared - The GFN is shared. The GFN is associated with a
a secure VM. The contents of the GFN is accessible to
Hypervisor. This GFN is backed by a normal-PFN and its
content is un-encrypted.
(c) Normal - The GFN is a normal. The GFN is associated with
a normal VM. The contents of the GFN is accesible to
the Hypervisor. Its content is never encrypted.
2. States of a VM.
---------------
(a) Normal VM: A VM whose contents are always accessible to
the hypervisor. All its GFNs are normal-GFNs.
(b) Secure VM: A VM whose contents are not accessible to the
hypervisor without the VM's consent. Its GFNs are
either Shared-GFN or Secure-GFNs.
(c) Transient VM: A Normal VM that is transitioning to secure VM.
The transition starts on successful return of
H_SVM_INIT_START, and ends on successful return
of H_SVM_INIT_DONE. This transient VM, can have GFNs
in any of the three states; i.e Secure-GFN, Shared-GFN,
and Normal-GFN. The VM never executes in this state
in supervisor-mode.
3. Memory slot State.
------------------
The state of a memory slot mirrors the state of the
VM the memory slot is associated with.
4. VM State transition.
--------------------
A VM always starts in Normal Mode.
H_SVM_INIT_START moves the VM into transient state. During this
time the Ultravisor may request some of its GFNs to be shared or
secured. So its GFNs can be in one of the three GFN states.
H_SVM_INIT_DONE moves the VM entirely from transient state to
secure-state. At this point any left-over normal-GFNs are
transitioned to Secure-GFN.
H_SVM_INIT_ABORT moves the transient VM back to normal VM.
All its GFNs are moved to Normal-GFNs.
UV_TERMINATE transitions the secure-VM back to normal-VM. All
the secure-GFN and shared-GFNs are tranistioned to normal-GFN
Note: The contents of the normal-GFN is undefined at this point.
5. GFN state implementation:
-------------------------
Secure GFN is associated with a secure-PFN; also called uvmem_pfn,
when the GFN is paged-in. Its pfn[] has KVMPPC_GFN_UVMEM_PFN flag
set, and contains the value of the secure-PFN.
It is associated with a normal-PFN; also called mem_pfn, when
the GFN is pagedout. Its pfn[] has KVMPPC_GFN_MEM_PFN flag set.
The value of the normal-PFN is not tracked.
Shared GFN is associated with a normal-PFN. Its pfn[] has
KVMPPC_UVMEM_SHARED_PFN flag set. The value of the normal-PFN
is not tracked.
Normal GFN is associated with normal-PFN. Its pfn[] has
no flag set. The value of the normal-PFN is not tracked.
6. Life cycle of a GFN
--------------------
--------------------------------------------------------------
| | Share | Unshare | SVM |H_SVM_INIT_DONE|
| |operation |operation | abort/ | |
| | | | terminate | |
-------------------------------------------------------------
| | | | | |
| Secure | Shared | Secure |Normal |Secure |
| | | | | |
| Shared | Shared | Secure |Normal |Shared |
| | | | | |
| Normal | Shared | Secure |Normal |Secure |
--------------------------------------------------------------
7. Life cycle of a VM
--------------------
--------------------------------------------------------------------
| | start | H_SVM_ |H_SVM_ |H_SVM_ |UV_SVM_ |
| | VM |INIT_START|INIT_DONE|INIT_ABORT |TERMINATE |
| | | | | | |
--------- ----------------------------------------------------------
| | | | | | |
| Normal | Normal | Transient|Error |Error |Normal |
| | | | | | |
| Secure | Error | Error |Error |Error |Normal |
| | | | | | |
|Transient| N/A | Error |Secure |Normal |Normal |
--------------------------------------------------------------------
************************************************************************
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Bharata B Rao <bharata@linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Claudio Carvalho <cclaudio@linux.ibm.com>
Cc: kvm-ppc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
---
arch/powerpc/kvm/book3s_hv_uvmem.c | 187 +++++++++++++++++++++++++++++++++----
1 file changed, 168 insertions(+), 19 deletions(-)
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index bfc3841..6d6c256 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -98,7 +98,127 @@
static unsigned long *kvmppc_uvmem_bitmap;
static DEFINE_SPINLOCK(kvmppc_uvmem_bitmap_lock);
-#define KVMPPC_UVMEM_PFN (1UL << 63)
+/*
+ * States of a GFN
+ * ---------------
+ * The GFN can be in one of the following states.
+ *
+ * (a) Secure - The GFN is secure. The GFN is associated with
+ * a Secure VM, the contents of the GFN is not accessible
+ * to the Hypervisor. This GFN can be backed by a secure-PFN,
+ * or can be backed by a normal-PFN with contents encrypted.
+ * The former is true when the GFN is paged-in into the
+ * ultravisor. The latter is true when the GFN is paged-out
+ * of the ultravisor.
+ *
+ * (b) Shared - The GFN is shared. The GFN is associated with a
+ * a secure VM. The contents of the GFN is accessible to
+ * Hypervisor. This GFN is backed by a normal-PFN and its
+ * content is un-encrypted.
+ *
+ * (c) Normal - The GFN is a normal. The GFN is associated with
+ * a normal VM. The contents of the GFN is accesible to
+ * the Hypervisor. Its content is never encrypted.
+ *
+ * States of a VM.
+ * ---------------
+ *
+ * Normal VM: A VM whose contents are always accessible to
+ * the hypervisor. All its GFNs are normal-GFNs.
+ *
+ * Secure VM: A VM whose contents are not accessible to the
+ * hypervisor without the VM's consent. Its GFNs are
+ * either Shared-GFN or Secure-GFNs.
+ *
+ * Transient VM: A Normal VM that is transitioning to secure VM.
+ * The transition starts on successful return of
+ * H_SVM_INIT_START, and ends on successful return
+ * of H_SVM_INIT_DONE. This transient VM, can have GFNs
+ * in any of the three states; i.e Secure-GFN, Shared-GFN,
+ * and Normal-GFN. The VM never executes in this state
+ * in supervisor-mode.
+ *
+ * Memory slot State.
+ * -----------------------------
+ * The state of a memory slot mirrors the state of the
+ * VM the memory slot is associated with.
+ *
+ * VM State transition.
+ * --------------------
+ *
+ * A VM always starts in Normal Mode.
+ *
+ * H_SVM_INIT_START moves the VM into transient state. During this
+ * time the Ultravisor may request some of its GFNs to be shared or
+ * secured. So its GFNs can be in one of the three GFN states.
+ *
+ * H_SVM_INIT_DONE moves the VM entirely from transient state to
+ * secure-state. At this point any left-over normal-GFNs are
+ * transitioned to Secure-GFN.
+ *
+ * H_SVM_INIT_ABORT moves the transient VM back to normal VM.
+ * All its GFNs are moved to Normal-GFNs.
+ *
+ * UV_TERMINATE transitions the secure-VM back to normal-VM. All
+ * the secure-GFN and shared-GFNs are tranistioned to normal-GFN
+ * Note: The contents of the normal-GFN is undefined at this point.
+ *
+ * GFN state implementation:
+ * -------------------------
+ *
+ * Secure GFN is associated with a secure-PFN; also called uvmem_pfn,
+ * when the GFN is paged-in. Its pfn[] has KVMPPC_GFN_UVMEM_PFN flag
+ * set, and contains the value of the secure-PFN.
+ * It is associated with a normal-PFN; also called mem_pfn, when
+ * the GFN is pagedout. Its pfn[] has KVMPPC_GFN_MEM_PFN flag set.
+ * The value of the normal-PFN is not tracked.
+ *
+ * Shared GFN is associated with a normal-PFN. Its pfn[] has
+ * KVMPPC_UVMEM_SHARED_PFN flag set. The value of the normal-PFN
+ * is not tracked.
+ *
+ * Normal GFN is associated with normal-PFN. Its pfn[] has
+ * no flag set. The value of the normal-PFN is not tracked.
+ *
+ * Life cycle of a GFN
+ * --------------------
+ *
+ * --------------------------------------------------------------
+ * | | Share | Unshare | SVM |H_SVM_INIT_DONE|
+ * | |operation |operation | abort/ | |
+ * | | | | terminate | |
+ * -------------------------------------------------------------
+ * | | | | | |
+ * | Secure | Shared | Secure |Normal |Secure |
+ * | | | | | |
+ * | Shared | Shared | Secure |Normal |Shared |
+ * | | | | | |
+ * | Normal | Shared | Secure |Normal |Secure |
+ * --------------------------------------------------------------
+ *
+ * Life cycle of a VM
+ * --------------------
+ *
+ * --------------------------------------------------------------------
+ * | | start | H_SVM_ |H_SVM_ |H_SVM_ |UV_SVM_ |
+ * | | VM |INIT_START|INIT_DONE|INIT_ABORT |TERMINATE |
+ * | | | | | | |
+ * --------- ----------------------------------------------------------
+ * | | | | | | |
+ * | Normal | Normal | Transient|Error |Error |Normal |
+ * | | | | | | |
+ * | Secure | Error | Error |Error |Error |Normal |
+ * | | | | | | |
+ * |Transient| N/A | Error |Secure |Normal |Normal |
+ * --------------------------------------------------------------------
+ */
+
+#define KVMPPC_GFN_UVMEM_PFN (1UL << 63)
+#define KVMPPC_GFN_MEM_PFN (1UL << 62)
+#define KVMPPC_GFN_SHARED (1UL << 61)
+#define KVMPPC_GFN_SECURE (KVMPPC_GFN_UVMEM_PFN | KVMPPC_GFN_MEM_PFN)
+#define KVMPPC_GFN_FLAG_MASK (KVMPPC_GFN_SECURE | KVMPPC_GFN_SHARED)
+#define KVMPPC_GFN_PFN_MASK (~KVMPPC_GFN_FLAG_MASK)
struct kvmppc_uvmem_slot {
struct list_head list;
@@ -106,11 +226,11 @@ struct kvmppc_uvmem_slot {
unsigned long base_pfn;
unsigned long *pfns;
};
-
struct kvmppc_uvmem_page_pvt {
struct kvm *kvm;
unsigned long gpa;
bool skip_page_out;
+ bool remove_gfn;
};
bool kvmppc_uvmem_available(void)
@@ -163,8 +283,8 @@ void kvmppc_uvmem_slot_free(struct kvm *kvm, const struct kvm_memory_slot *slot)
mutex_unlock(&kvm->arch.uvmem_lock);
}
-static void kvmppc_uvmem_pfn_insert(unsigned long gfn, unsigned long uvmem_pfn,
- struct kvm *kvm)
+static void kvmppc_mark_gfn(unsigned long gfn, struct kvm *kvm,
+ unsigned long flag, unsigned long uvmem_pfn)
{
struct kvmppc_uvmem_slot *p;
@@ -172,24 +292,41 @@ static void kvmppc_uvmem_pfn_insert(unsigned long gfn, unsigned long uvmem_pfn,
if (gfn >= p->base_pfn && gfn < p->base_pfn + p->nr_pfns) {
unsigned long index = gfn - p->base_pfn;
- p->pfns[index] = uvmem_pfn | KVMPPC_UVMEM_PFN;
+ if (flag == KVMPPC_GFN_UVMEM_PFN)
+ p->pfns[index] = uvmem_pfn | flag;
+ else
+ p->pfns[index] = flag;
return;
}
}
}
-static void kvmppc_uvmem_pfn_remove(unsigned long gfn, struct kvm *kvm)
+/* mark the GFN as secure-GFN associated with @uvmem pfn device-PFN. */
+static void kvmppc_gfn_secure_uvmem_pfn(unsigned long gfn,
+ unsigned long uvmem_pfn, struct kvm *kvm)
{
- struct kvmppc_uvmem_slot *p;
+ kvmppc_mark_gfn(gfn, kvm, KVMPPC_GFN_UVMEM_PFN, uvmem_pfn);
+}
- list_for_each_entry(p, &kvm->arch.uvmem_pfns, list) {
- if (gfn >= p->base_pfn && gfn < p->base_pfn + p->nr_pfns) {
- p->pfns[gfn - p->base_pfn] = 0;
- return;
- }
- }
+/* mark the GFN as secure-GFN associated with a memory-PFN. */
+static void kvmppc_gfn_secure_mem_pfn(unsigned long gfn, struct kvm *kvm)
+{
+ kvmppc_mark_gfn(gfn, kvm, KVMPPC_GFN_MEM_PFN, 0);
+}
+
+/* mark the GFN as a shared GFN. */
+static void kvmppc_gfn_shared(unsigned long gfn, struct kvm *kvm)
+{
+ kvmppc_mark_gfn(gfn, kvm, KVMPPC_GFN_SHARED, 0);
+}
+
+/* mark the GFN as a non-existent GFN. */
+static void kvmppc_gfn_remove(unsigned long gfn, struct kvm *kvm)
+{
+ kvmppc_mark_gfn(gfn, kvm, 0, 0);
}
+/* return true, if the GFN is a secure-GFN backed by a secure-PFN */
static bool kvmppc_gfn_is_uvmem_pfn(unsigned long gfn, struct kvm *kvm,
unsigned long *uvmem_pfn)
{
@@ -199,10 +336,10 @@ static bool kvmppc_gfn_is_uvmem_pfn(unsigned long gfn, struct kvm *kvm,
if (gfn >= p->base_pfn && gfn < p->base_pfn + p->nr_pfns) {
unsigned long index = gfn - p->base_pfn;
- if (p->pfns[index] & KVMPPC_UVMEM_PFN) {
+ if (p->pfns[index] & KVMPPC_GFN_UVMEM_PFN) {
if (uvmem_pfn)
*uvmem_pfn = p->pfns[index] &
- ~KVMPPC_UVMEM_PFN;
+ KVMPPC_GFN_PFN_MASK;
return true;
} else
return false;
@@ -353,6 +490,7 @@ void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
mutex_lock(&kvm->arch.uvmem_lock);
if (!kvmppc_gfn_is_uvmem_pfn(gfn, kvm, &uvmem_pfn)) {
+ kvmppc_gfn_remove(gfn, kvm);
mutex_unlock(&kvm->arch.uvmem_lock);
continue;
}
@@ -360,6 +498,7 @@ void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
uvmem_page = pfn_to_page(uvmem_pfn);
pvt = uvmem_page->zone_device_data;
pvt->skip_page_out = skip_page_out;
+ pvt->remove_gfn = true;
mutex_unlock(&kvm->arch.uvmem_lock);
pfn = gfn_to_pfn(kvm, gfn);
@@ -429,7 +568,7 @@ static struct page *kvmppc_uvmem_get_page(unsigned long gpa, struct kvm *kvm)
goto out_clear;
uvmem_pfn = bit + pfn_first;
- kvmppc_uvmem_pfn_insert(gpa >> PAGE_SHIFT, uvmem_pfn, kvm);
+ kvmppc_gfn_secure_uvmem_pfn(gpa >> PAGE_SHIFT, uvmem_pfn, kvm);
pvt->gpa = gpa;
pvt->kvm = kvm;
@@ -524,6 +663,7 @@ static unsigned long kvmppc_share_page(struct kvm *kvm, unsigned long gpa,
uvmem_page = pfn_to_page(uvmem_pfn);
pvt = uvmem_page->zone_device_data;
pvt->skip_page_out = true;
+ pvt->remove_gfn = false;
}
retry:
@@ -537,12 +677,16 @@ static unsigned long kvmppc_share_page(struct kvm *kvm, unsigned long gpa,
uvmem_page = pfn_to_page(uvmem_pfn);
pvt = uvmem_page->zone_device_data;
pvt->skip_page_out = true;
+ pvt->remove_gfn = false;
kvm_release_pfn_clean(pfn);
goto retry;
}
- if (!uv_page_in(kvm->arch.lpid, pfn << page_shift, gpa, 0, page_shift))
+ if (!uv_page_in(kvm->arch.lpid, pfn << page_shift, gpa, 0,
+ page_shift)) {
+ kvmppc_gfn_shared(gfn, kvm);
ret = H_SUCCESS;
+ }
kvm_release_pfn_clean(pfn);
mutex_unlock(&kvm->arch.uvmem_lock);
out:
@@ -598,6 +742,7 @@ unsigned long kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gpa,
if (!kvmppc_svm_page_in(vma, start, end, gpa, kvm, page_shift))
ret = H_SUCCESS;
+
out_unlock:
mutex_unlock(&kvm->arch.uvmem_lock);
out:
@@ -706,7 +851,8 @@ static vm_fault_t kvmppc_uvmem_migrate_to_ram(struct vm_fault *vmf)
/*
* Release the device PFN back to the pool
*
- * Gets called when secure page becomes a normal page during H_SVM_PAGE_OUT.
+ * Gets called when secure GFN tranistions from a secure-PFN
+ * to a normal PFN during H_SVM_PAGE_OUT.
* Gets called with kvm->arch.uvmem_lock held.
*/
static void kvmppc_uvmem_page_free(struct page *page)
@@ -721,7 +867,10 @@ static void kvmppc_uvmem_page_free(struct page *page)
pvt = page->zone_device_data;
page->zone_device_data = NULL;
- kvmppc_uvmem_pfn_remove(pvt->gpa >> PAGE_SHIFT, pvt->kvm);
+ if (pvt->remove_gfn)
+ kvmppc_gfn_remove(pvt->gpa >> PAGE_SHIFT, pvt->kvm);
+ else
+ kvmppc_gfn_secure_mem_pfn(pvt->gpa >> PAGE_SHIFT, pvt->kvm);
kfree(pvt);
}
--
1.8.3.1
^ permalink raw reply related
* [PATCH v2] powerpc/pseries: detect secure and trusted boot state of the system.
From: Nayna Jain @ 2020-07-11 2:25 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Nayna Jain, linux-kernel, Mimi Zohar
The device-tree property to check secure and trusted boot state is
different for guests(pseries) compared to baremetal(powernv).
This patch updates the existing is_ppc_secureboot_enabled() and
is_ppc_trustedboot_enabled() function to add support for pseries.
Signed-off-by: Nayna Jain <nayna@linux.ibm.com>
Reviewed-by: Daniel Axtens <dja@axtens.net>
---
v2:
* included Michael Ellerman's feedback.
* added Daniel Axtens's Reviewed-by.
arch/powerpc/kernel/secure_boot.c | 23 +++++++++++++++++++++--
1 file changed, 21 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/kernel/secure_boot.c b/arch/powerpc/kernel/secure_boot.c
index 4b982324d368..efb325cbd42f 100644
--- a/arch/powerpc/kernel/secure_boot.c
+++ b/arch/powerpc/kernel/secure_boot.c
@@ -6,6 +6,7 @@
#include <linux/types.h>
#include <linux/of.h>
#include <asm/secure_boot.h>
+#include <asm/machdep.h>
static struct device_node *get_ppc_fw_sb_node(void)
{
@@ -23,12 +24,21 @@ bool is_ppc_secureboot_enabled(void)
{
struct device_node *node;
bool enabled = false;
+ u32 secureboot;
node = get_ppc_fw_sb_node();
enabled = of_property_read_bool(node, "os-secureboot-enforcing");
-
of_node_put(node);
+ if (enabled)
+ goto out;
+
+ if (!of_property_read_u32(of_root, "ibm,secure-boot", &secureboot)) {
+ if (secureboot)
+ enabled = (secureboot > 1) ? true : false;
+ }
+
+out:
pr_info("Secure boot mode %s\n", enabled ? "enabled" : "disabled");
return enabled;
@@ -38,12 +48,21 @@ bool is_ppc_trustedboot_enabled(void)
{
struct device_node *node;
bool enabled = false;
+ u32 trustedboot;
node = get_ppc_fw_sb_node();
enabled = of_property_read_bool(node, "trusted-enabled");
-
of_node_put(node);
+ if (enabled)
+ goto out;
+
+ if (!of_property_read_u32(of_root, "ibm,trusted-boot", &trustedboot)) {
+ if (trustedboot)
+ enabled = (trustedboot > 0) ? true : false;
+ }
+
+out:
pr_info("Trusted boot mode %s\n", enabled ? "enabled" : "disabled");
return enabled;
--
2.26.2
^ permalink raw reply related
* [PATCH V2 2/2] selftests/powerpc: Use proper error code to check fault address
From: Haren Myneni @ 2020-07-10 23:49 UTC (permalink / raw)
To: mpe; +Cc: tulioqm, linuxppc-dev, abali, rzinsly
In-Reply-To: <019fd53e7538c6f8f332d175df74b1815ef5aa8c.camel@linux.ibm.com>
ERR_NX_TRANSLATION(CSB.CC=5) is for internal to VAS for fault handling
and should not used by OS. ERR_NX_AT_FAULT(CSB.CC=250) is the proper
error code should be reported by OS when NX encounters address
translation failure.
This patch uses CC=250 to determine the fault address when the request
is not successful.
Signed-off-by: Haren Myneni <haren@linux.ibm.com>
---
tools/testing/selftests/powerpc/nx-gzip/gunz_test.c | 4 ++--
tools/testing/selftests/powerpc/nx-gzip/gzfht_test.c | 4 ++--
2 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/tools/testing/selftests/powerpc/nx-gzip/gunz_test.c b/tools/testing/selftests/powerpc/nx-gzip/gunz_test.c
index 6ee0fde..7c23d3d 100644
--- a/tools/testing/selftests/powerpc/nx-gzip/gunz_test.c
+++ b/tools/testing/selftests/powerpc/nx-gzip/gunz_test.c
@@ -698,13 +698,13 @@ int decompress_file(int argc, char **argv, void *devhandle)
switch (cc) {
- case ERR_NX_TRANSLATION:
+ case ERR_NX_AT_FAULT:
/* We touched the pages ahead of time. In the most common case
* we shouldn't be here. But may be some pages were paged out.
* Kernel should have placed the faulting address to fsaddr.
*/
- NXPRT(fprintf(stderr, "ERR_NX_TRANSLATION %p\n",
+ NXPRT(fprintf(stderr, "ERR_NX_AT_FAULT %p\n",
(void *)cmdp->crb.csb.fsaddr));
if (pgfault_retries == NX_MAX_FAULTS) {
diff --git a/tools/testing/selftests/powerpc/nx-gzip/gzfht_test.c b/tools/testing/selftests/powerpc/nx-gzip/gzfht_test.c
index 7496a83..02dffb6 100644
--- a/tools/testing/selftests/powerpc/nx-gzip/gzfht_test.c
+++ b/tools/testing/selftests/powerpc/nx-gzip/gzfht_test.c
@@ -306,13 +306,13 @@ int compress_file(int argc, char **argv, void *handle)
lzcounts, cmdp, handle);
if (cc != ERR_NX_OK && cc != ERR_NX_TPBC_GT_SPBC &&
- cc != ERR_NX_TRANSLATION) {
+ cc != ERR_NX_AT_FAULT) {
fprintf(stderr, "nx error: cc= %d\n", cc);
exit(-1);
}
/* Page faults are handled by the user code */
- if (cc == ERR_NX_TRANSLATION) {
+ if (cc == ERR_NX_AT_FAULT) {
NXPRT(fprintf(stderr, "page fault: cc= %d, ", cc));
NXPRT(fprintf(stderr, "try= %d, fsa= %08llx\n",
fault_tries,
--
1.8.3.1
^ permalink raw reply related
* [PATCH V2 1/2] powerpc/vas: Report proper error code for address translation failure
From: Haren Myneni @ 2020-07-10 23:47 UTC (permalink / raw)
To: mpe; +Cc: tulioqm, abali, linuxppc-dev, rzinsly
P9 DD2 NX workbook (Table 4-36) says DMA controller uses CC=5
internally for translation fault handling. NX reserves CC=250 for
OS to notify user space when NX encounters address translation
failure on the request buffer. Not an issue in earlier releases
as NX does not get faults on kernel addresses.
This patch defines CSB_CC_FAULT_ADDRESS(250) and updates CSB.CC with
this proper error code for user space.
Signed-off-by: Haren Myneni <haren@linux.ibm.com>
Changes in V2:
- Use CSB_CC_FAULT_ADDRESS instead of CSB_CC_ADDRESS_TRANSLATION
to distinguish from other error codes.
- Add NX workbook reference in the comment.
---
Documentation/powerpc/vas-api.rst | 2 +-
arch/powerpc/include/asm/icswx.h | 4 ++++
arch/powerpc/platforms/powernv/vas-fault.c | 2 +-
3 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/Documentation/powerpc/vas-api.rst b/Documentation/powerpc/vas-api.rst
index 1217c2f..788dc83 100644
--- a/Documentation/powerpc/vas-api.rst
+++ b/Documentation/powerpc/vas-api.rst
@@ -213,7 +213,7 @@ request buffers are not in memory. The operating system handles the fault by
updating CSB with the following data:
csb.flags = CSB_V;
- csb.cc = CSB_CC_TRANSLATION;
+ csb.cc = CSB_CC_FAULT_ADDRESS;
csb.ce = CSB_CE_TERMINATION;
csb.address = fault_address;
diff --git a/arch/powerpc/include/asm/icswx.h b/arch/powerpc/include/asm/icswx.h
index 965b1f3..9bc7c58 100644
--- a/arch/powerpc/include/asm/icswx.h
+++ b/arch/powerpc/include/asm/icswx.h
@@ -77,6 +77,10 @@ struct coprocessor_completion_block {
#define CSB_CC_CHAIN (37)
#define CSB_CC_SEQUENCE (38)
#define CSB_CC_HW (39)
+/*
+ * P9 DD NX Workbook 3.2 (Table 4-36): Address translation fault
+ */
+#define CSB_CC_FAULT_ADDRESS (250)
#define CSB_SIZE (0x10)
#define CSB_ALIGN CSB_SIZE
diff --git a/arch/powerpc/platforms/powernv/vas-fault.c b/arch/powerpc/platforms/powernv/vas-fault.c
index 266a6ca..3d21fce 100644
--- a/arch/powerpc/platforms/powernv/vas-fault.c
+++ b/arch/powerpc/platforms/powernv/vas-fault.c
@@ -79,7 +79,7 @@ static void update_csb(struct vas_window *window,
csb_addr = (void __user *)be64_to_cpu(crb->csb_addr);
memset(&csb, 0, sizeof(csb));
- csb.cc = CSB_CC_TRANSLATION;
+ csb.cc = CSB_CC_FAULT_ADDRESS;
csb.ce = CSB_CE_TERMINATION;
csb.cs = 0;
csb.count = 0;
--
1.8.3.1
^ permalink raw reply related
* Re: generic DMA bypass flag v4
From: Jesper Dangaard Brouer @ 2020-07-10 13:26 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Daniel Borkmann, Alexey Kardashevskiy, Greg Kroah-Hartman,
Joerg Roedel, brouer, Robin Murphy, linux-kernel, iommu,
Björn Töpel, linuxppc-dev, Lu Baolu
In-Reply-To: <20200708152449.316476-1-hch@lst.de>
On Wed, 8 Jul 2020 17:24:44 +0200
Christoph Hellwig <hch@lst.de> wrote:
> Note that as-is this breaks the XSK buffer pool, which unfortunately
> poked directly into DMA internals. A fix for that is already queued
> up in the netdev tree.
>
> Jesper and XDP gang: this should not regress any performance as
> the dma-direct calls are now inlined into the out of line DMA mapping
> calls. But if you can verify the performance numbers that would be
> greatly appreciated.
From a superficial review of the patches, they look okay to me. I don't
have time to run a performance benchmark (before I go on vacation).
I hoped Björn could test/benchmark this(?), given (as mentioned) this
also affect XSK / AF_XDP performance.
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
^ permalink raw reply
* [PATCH 14/14 v3] PCI: Remove '*val = 0' from pcie_capability_read_*()
From: Saheed Olayemi Bolarinwa @ 2020-07-10 21:20 UTC (permalink / raw)
To: helgaas
Cc: linux-wireless, linux-pci, QCA ath9k Development, netdev,
Oliver O'Halloran, Stanislaw Gruszka, linux-acpi, linux-rdma,
Jason Gunthorpe, Doug Ledford, Jakub Kicinski,
linux-kernel-mentees, Len Brown, Arnd Bergmann, skhan, bjorn,
Kalle Valo, Mike Marciniszyn, Sam Bobroff, Greg Kroah-Hartman,
Dennis Dalessandro, Rafael J. Wysocki, linux-kernel, Lukas Wunner,
Bolarinwa Olayemi Saheed, linuxppc-dev, David S. Miller
In-Reply-To: <20200710212026.27136-1-refactormyself@gmail.com>
From: Bolarinwa Olayemi Saheed <refactormyself@gmail.com>
There are several reasons why a PCI capability read may fail whether the
device is present or not. If this happens, pcie_capability_read_*() will
return -EINVAL/PCIBIOS_BAD_REGISTER_NUMBER or PCIBIOS_DEVICE_NOT_FOUND
and *val is set to 0.
This behaviour if further ensured by this code inside
pcie_capability_read_*()
ret = pci_read_config_dword(dev, pci_pcie_cap(dev) + pos, val);
/*
* Reset *val to 0 if pci_read_config_dword() fails, it may
* have been written as 0xFFFFFFFF if hardware error happens
* during pci_read_config_dword().
*/
if (ret)
*val = 0;
return ret;
a) Since all pci_generic_config_read() does is read a register value,
it may return success after reading a ~0 which *may* have been fabricated
by the PCI host bridge due to a read timeout. Hence pci_read_config_*()
will return success with a fabricated ~0 in *val, indicating a problem.
In this case, the assumed behaviour of pcie_capability_read_*() will be
wrong. To avoid error slipping through, more checks are necessary.
b) pci_read_config_*() will return PCIBIOS_DEVICE_NOT_FOUND only if
dev->error_state = pci_channel_io_perm_failure (i.e.
pci_dev_is_disconnected()) or if pci_generic_config_read() can't find the
device. In both cases *val is initially set to ~0 but as shown in the code
above pcie_capability_read_*() resets it back to 0. Even with this effort,
drivers still have to perform validation checks more so if 0 is a valid
value.
Most drivers only consider the case (b) and in some cases, there is the
expectation that on timeout *val has a fabricated value of ~0, which *may*
not always be true as explained in (a).
In any case, checks need to be done to validate the value read and maybe
confirm which error has occurred. It is better left to the drivers to do.
Remove the reset of *val to 0 when pci_read_config_*() fails.
Suggested-by: Bjorn Helgaas <bjorn@helgaas.com>
Signed-off-by: Bolarinwa Olayemi Saheed <refactormyself@gmail.com>
---
This patch depends on all of the preceeding patches in this series,
otherwise it will introduce bugs as pointed out in the commit message
of each.
drivers/pci/access.c | 14 --------------
1 file changed, 14 deletions(-)
diff --git a/drivers/pci/access.c b/drivers/pci/access.c
index 79c4a2ef269a..ec95edbb1ac8 100644
--- a/drivers/pci/access.c
+++ b/drivers/pci/access.c
@@ -413,13 +413,6 @@ int pcie_capability_read_word(struct pci_dev *dev, int pos, u16 *val)
if (pcie_capability_reg_implemented(dev, pos)) {
ret = pci_read_config_word(dev, pci_pcie_cap(dev) + pos, val);
- /*
- * Reset *val to 0 if pci_read_config_word() fails, it may
- * have been written as 0xFFFF if hardware error happens
- * during pci_read_config_word().
- */
- if (ret)
- *val = 0;
return ret;
}
@@ -448,13 +441,6 @@ int pcie_capability_read_dword(struct pci_dev *dev, int pos, u32 *val)
if (pcie_capability_reg_implemented(dev, pos)) {
ret = pci_read_config_dword(dev, pci_pcie_cap(dev) + pos, val);
- /*
- * Reset *val to 0 if pci_read_config_dword() fails, it may
- * have been written as 0xFFFFFFFF if hardware error happens
- * during pci_read_config_dword().
- */
- if (ret)
- *val = 0;
return ret;
}
--
2.18.2
^ permalink raw reply related
* [PATCH 0/14 v3] PCI: Remove '*val = 0' from pcie_capability_read_*()
From: Saheed Olayemi Bolarinwa @ 2020-07-10 21:20 UTC (permalink / raw)
To: helgaas
Cc: linux-wireless, linux-pci, QCA ath9k Development, netdev,
Oliver O'Halloran, Stanislaw Gruszka, linux-acpi, linux-rdma,
Jason Gunthorpe, Doug Ledford, Jakub Kicinski,
linux-kernel-mentees, Len Brown, Arnd Bergmann, skhan, bjorn,
Kalle Valo, Mike Marciniszyn, Sam Bobroff, Greg Kroah-Hartman,
Dennis Dalessandro, Rafael J. Wysocki, linux-kernel, Lukas Wunner,
Bolarinwa Olayemi Saheed, linuxppc-dev, David S. Miller
From: Bolarinwa Olayemi Saheed <refactormyself@gmail.com>
v3 CHANGES:
- Split previous PATCH 6/13 into two : PATCH 6/14 and PATCH 7/14
- Fix commit message of PATCH 5/14
- Update Patch numbering and Commit messages
- Add 'Acked by Greg KH' to PATCH 2/14
- Add PATCH version
v2 CHANGES:
- Fix missing comma, causing the email cc error
- Fix typos and numbering errors in commit messages
- Add commit message to 13/13
- Add two more patches: PATCH 3/13 and PATCH 4/13
MERGING:
Patch 7/14 depends on Patch 6/14. However Patch 6/14 has no dependency.
Please, merge PATCH 7/14 only after Patch 6/14.
Patch 14/14 depend on all preceeding patchs. Except for Patch 6/14 and
Patch 7/14, all other patches are independent of one another. Hence,
please merge Patch 14/14 only after other patches in this series have
been merged.
PATCH 6/14:
Make the function set status to "Power On" by default and only set to
Set "Power Off" only if pcie_capability_read_word() is successful and
(slot_ctrl & PCI_EXP_SLTCTL_PCC) == PCI_EXP_SLTCTL_PWR_OFF.
PATCH 1/14 to 13/14:
Check the return value of pcie_capability_read_*() to ensure success or
confirm failure. While maintaining these functions, this ensures that the
changes in PATCH 14/14 does not introduce any bug.
PATCH 14/14:
There are several reasons why a PCI capability read may fail whether the
device is present or not. If this happens, pcie_capability_read_*() will
return -EINVAL/PCIBIOS_BAD_REGISTER_NUMBER or PCIBIOS_DEVICE_NOT_FOUND
and *val is set to 0.
This behaviour if further ensured by this code inside
pcie_capability_read_*()
ret = pci_read_config_dword(dev, pci_pcie_cap(dev) + pos, val);
/*
* Reset *val to 0 if pci_read_config_dword() fails, it may
* have been written as 0xFFFFFFFF if hardware error happens
* during pci_read_config_dword().
*/
if (ret)
*val = 0;
return ret;
a) Since all pci_generic_config_read() does is read a register value,
it may return success after reading a ~0 which *may* have been fabricated
by the PCI host bridge due to a read timeout. Hence pci_read_config_*()
will return success with a fabricated ~0 in *val, indicating a problem.
In this case, the assumed behaviour of pcie_capability_read_*() will be
wrong. To avoid error slipping through, more checks are necessary.
b) pci_read_config_*() will return PCIBIOS_DEVICE_NOT_FOUND only if
dev->error_state = pci_channel_io_perm_failure (i.e.
pci_dev_is_disconnected()) or if pci_generic_config_read() can't find the
device. In both cases *val is initially set to ~0 but as shown in the code
above pcie_capability_read_*() resets it back to 0. Even with this effort,
drivers still have to perform validation checks more so if 0 is a valid
value.
Most drivers only consider the case (b) and in some cases, there is the
expectation that on timeout *val has a fabricated value of ~0, which *may*
not always be true as explained in (a).
In any case, checks need to be done to validate the value read and maybe
confirm which error has occurred. It is better left to the drivers to do.
Check the return value of pcie_capability_read_dword() to ensure success
and avoid bug as a result of Patch 14/14.
Remove the reset of *val to 0 when pci_read_config_*() fails.
Bolarinwa Olayemi Saheed (14):
IB/hfi1: Check the return value of pcie_capability_read_*()
misc: rtsx: Check the return value of pcie_capability_read_*()
ath9k: Check the return value of pcie_capability_read_*()
iwlegacy: Check the return value of pcie_capability_read_*()
PCI: pciehp: Check the return value of pcie_capability_read_*()
PCI: pciehp: Make "Power On" the default
PCI: pciehp: Check the return value of pcie_capability_read_*()
PCI/ACPI: Check the return value of pcie_capability_read_*()
PCI: pciehp: Check the return value of pcie_capability_read_*()
PCI: Check the return value of pcie_capability_read_*()
PCI/PM: Check return value of pcie_capability_read_*()
PCI/AER: Check the return value of pcie_capability_read_*()
PCI/ASPM: Check the return value of pcie_capability_read_*()
PCI: Remove '*val = 0' from pcie_capability_read_*()
drivers/net/wireless/ath/ath9k/pci.c | 5 +++--
drivers/net/wireless/intel/iwlegacy/common.c | 4 ++--
drivers/infiniband/hw/hfi1/aspm.c | 7 ++++---
drivers/misc/cardreader/rts5227.c | 5 +++--
drivers/misc/cardreader/rts5249.c | 5 +++--
drivers/misc/cardreader/rts5260.c | 5 +++--
drivers/misc/cardreader/rts5261.c | 5 +++--
drivers/pci/pcie/aer.c | 5 +++--
drivers/pci/pcie/aspm.c | 33 +++++++++++++++++----------------
drivers/pci/hotplug/pciehp_hpc.c | 47 ++++++++++++++++----------------
drivers/pci/pci-acpi.c | 10 ++++---
drivers/pci/probe.c | 29 ++++++++++++--------
drivers/pci/access.c | 14 --------------
13 files changed, 87 insertions(+), 87 deletions(-)
--
2.18.2
^ permalink raw reply
* Re: [PATCH v5] ima: move APPRAISE_BOOTPARAM dependency on ARCH_POLICY to runtime
From: Bruno Meneguele @ 2020-07-10 19:25 UTC (permalink / raw)
To: Mimi Zohar
Cc: linux-s390, nayna, erichte, x86, linux-kernel, stable,
linux-integrity, linuxppc-dev
In-Reply-To: <1594407288.14405.36.camel@linux.ibm.com>
[-- Attachment #1: Type: text/plain, Size: 5131 bytes --]
On Fri, Jul 10, 2020 at 02:54:48PM -0400, Mimi Zohar wrote:
> On Fri, 2020-07-10 at 15:34 -0300, Bruno Meneguele wrote:
> > On Fri, Jul 10, 2020 at 03:03:38PM -0300, Bruno Meneguele wrote:
> > > On Fri, Jul 10, 2020 at 01:23:24PM -0400, Mimi Zohar wrote:
> > > > On Thu, 2020-07-09 at 13:46 -0300, Bruno Meneguele wrote:
> > > > > APPRAISE_BOOTPARAM has been marked as dependent on !ARCH_POLICY in compile
> > > > > time, enforcing the appraisal whenever the kernel had the arch policy option
> > > > > enabled.
> > > >
> > > > > However it breaks systems where the option is set but the system didn't
> > > > > boot in a "secure boot" platform. In this scenario, anytime an appraisal
> > > > > policy (i.e. ima_policy=appraisal_tcb) is used it will be forced, without
> > > > > giving the user the opportunity to label the filesystem, before enforcing
> > > > > integrity.
> > > > >
> > > > > Considering the ARCH_POLICY is only effective when secure boot is actually
> > > > > enabled this patch remove the compile time dependency and move it to a
> > > > > runtime decision, based on the secure boot state of that platform.
> > > >
> > > > Perhaps we could simplify this patch description a bit?
> > > >
> > > > The IMA_APPRAISE_BOOTPARAM config allows enabling different
> > > > "ima_appraise=" modes - log, fix, enforce - at run time, but not when
> > > > IMA architecture specific policies are enabled. This prevents
> > > > properly labeling the filesystem on systems where secure boot is
> > > > supported, but not enabled on the platform. Only when secure boot is
> > > > enabled, should these IMA appraise modes be disabled.
> > > >
> > > > This patch removes the compile time dependency and makes it a runtime
> > > > decision, based on the secure boot state of that platform.
> > > >
> > >
> > > Sounds good to me.
> > >
> > > > <snip>
> > > >
> > > > > diff --git a/security/integrity/ima/ima_appraise.c b/security/integrity/ima/ima_appraise.c
> > > > > index a9649b04b9f1..884de471b38a 100644
> > > > > --- a/security/integrity/ima/ima_appraise.c
> > > > > +++ b/security/integrity/ima/ima_appraise.c
> > > > > @@ -19,6 +19,11 @@
> > > > > static int __init default_appraise_setup(c
> > > >
> > > > > har *str)
> > > > > {
> > > > > #ifdef CONFIG_IMA_APPRAISE_BOOTPARAM
> > > > > + if (arch_ima_get_secureboot()) {
> > > > > + pr_info("appraise boot param ignored: secure boot enabled");
> > > >
> > > > Instead of a generic statement, is it possible to include the actual
> > > > option being denied? Perhaps something like: "Secure boot enabled,
> > > > ignoring %s boot command line option"
> > > >
> > > > Mimi
> > > >
> > >
> > > Yes, sure.
> > >
> >
> > Btw, would it make sense to first make sure we have a valid "str"
> > option and not something random to print?
> >
> > diff --git a/security/integrity/ima/ima_appraise.c b/security/integrity/ima/ima_appraise.c
> > index a9649b04b9f1..1f1175531d3e 100644
> > --- a/security/integrity/ima/ima_appraise.c
> > +++ b/security/integrity/ima/ima_appraise.c
> > @@ -25,6 +25,16 @@ static int __init default_appraise_setup(char *str)
> > ima_appraise = IMA_APPRAISE_LOG;
> > else if (strncmp(str, "fix", 3) == 0)
> > ima_appraise = IMA_APPRAISE_FIX;
> > + else
> > + pr_info("invalid \"%s\" appraise option");
> > +
> > + if (arch_ima_get_secureboot()) {
> > + if (!is_ima_appraise_enabled()) {
> > + pr_info("Secure boot enabled: ignoring ima_appraise=%s boot parameter option",
> > + str);
> > + ima_appraise = IMA_APPRAISE_ENFORCE;
> > + }
> > + }
>
> Providing feedback is probably a good idea. However, the
> "arch_ima_get_secureboot" test can't come after setting
> "ima_appraise."
>
Sorry, but I'm not sure if I got the reason to why it can't be done
after: would it be basically to prevent any further processing about
ima_appraise as a matter of security principle? Or maybe to keep the
dependency between secureboot and bootparam truly strict?
Or are there something else I'm missing?
> Mimi
>
> > #endif
> > return 1;
> > }
> >
> >
> > The "else" there I think would make sense as well, at least to give the
> > user some feedback about a possible mispelling of him (as a separate
> > patch).
> >
> > And "if(!is_ima_appraise_enabled())" would avoid to print anything about
> > "ignoring the option" to the user in case he explicitly set "enforce",
> > which we know there isn't any real effect but is allowed and shown in
> > kernel-parameters.txt.
> >
> > > Thanks!
> > >
> > > > > + return 1;
> > > > > + }
> > > > > +
> > > > > if (strncmp(str, "off", 3) == 0)
> > > > > ima_appraise = 0;
> > > > > else if (strncmp(str, "log", 3) == 0)
> > > >
> > >
> > > --
> > > bmeneg
> > > PGP Key: http://bmeneg.com/pubkey.txt
> >
> >
> >
>
--
bmeneg
PGP Key: http://bmeneg.com/pubkey.txt
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply
* Re: [PATCH v5] ima: move APPRAISE_BOOTPARAM dependency on ARCH_POLICY to runtime
From: Mimi Zohar @ 2020-07-10 18:54 UTC (permalink / raw)
To: Bruno Meneguele
Cc: linux-s390, nayna, erichte, x86, linux-kernel, stable,
linux-integrity, linuxppc-dev
In-Reply-To: <20200710183420.GB10547@glitch>
On Fri, 2020-07-10 at 15:34 -0300, Bruno Meneguele wrote:
> On Fri, Jul 10, 2020 at 03:03:38PM -0300, Bruno Meneguele wrote:
> > On Fri, Jul 10, 2020 at 01:23:24PM -0400, Mimi Zohar wrote:
> > > On Thu, 2020-07-09 at 13:46 -0300, Bruno Meneguele wrote:
> > > > APPRAISE_BOOTPARAM has been marked as dependent on !ARCH_POLICY in compile
> > > > time, enforcing the appraisal whenever the kernel had the arch policy option
> > > > enabled.
> > >
> > > > However it breaks systems where the option is set but the system didn't
> > > > boot in a "secure boot" platform. In this scenario, anytime an appraisal
> > > > policy (i.e. ima_policy=appraisal_tcb) is used it will be forced, without
> > > > giving the user the opportunity to label the filesystem, before enforcing
> > > > integrity.
> > > >
> > > > Considering the ARCH_POLICY is only effective when secure boot is actually
> > > > enabled this patch remove the compile time dependency and move it to a
> > > > runtime decision, based on the secure boot state of that platform.
> > >
> > > Perhaps we could simplify this patch description a bit?
> > >
> > > The IMA_APPRAISE_BOOTPARAM config allows enabling different
> > > "ima_appraise=" modes - log, fix, enforce - at run time, but not when
> > > IMA architecture specific policies are enabled. This prevents
> > > properly labeling the filesystem on systems where secure boot is
> > > supported, but not enabled on the platform. Only when secure boot is
> > > enabled, should these IMA appraise modes be disabled.
> > >
> > > This patch removes the compile time dependency and makes it a runtime
> > > decision, based on the secure boot state of that platform.
> > >
> >
> > Sounds good to me.
> >
> > > <snip>
> > >
> > > > diff --git a/security/integrity/ima/ima_appraise.c b/security/integrity/ima/ima_appraise.c
> > > > index a9649b04b9f1..884de471b38a 100644
> > > > --- a/security/integrity/ima/ima_appraise.c
> > > > +++ b/security/integrity/ima/ima_appraise.c
> > > > @@ -19,6 +19,11 @@
> > > > static int __init default_appraise_setup(c
> > >
> > > > har *str)
> > > > {
> > > > #ifdef CONFIG_IMA_APPRAISE_BOOTPARAM
> > > > + if (arch_ima_get_secureboot()) {
> > > > + pr_info("appraise boot param ignored: secure boot enabled");
> > >
> > > Instead of a generic statement, is it possible to include the actual
> > > option being denied? Perhaps something like: "Secure boot enabled,
> > > ignoring %s boot command line option"
> > >
> > > Mimi
> > >
> >
> > Yes, sure.
> >
>
> Btw, would it make sense to first make sure we have a valid "str"
> option and not something random to print?
>
> diff --git a/security/integrity/ima/ima_appraise.c b/security/integrity/ima/ima_appraise.c
> index a9649b04b9f1..1f1175531d3e 100644
> --- a/security/integrity/ima/ima_appraise.c
> +++ b/security/integrity/ima/ima_appraise.c
> @@ -25,6 +25,16 @@ static int __init default_appraise_setup(char *str)
> ima_appraise = IMA_APPRAISE_LOG;
> else if (strncmp(str, "fix", 3) == 0)
> ima_appraise = IMA_APPRAISE_FIX;
> + else
> + pr_info("invalid \"%s\" appraise option");
> +
> + if (arch_ima_get_secureboot()) {
> + if (!is_ima_appraise_enabled()) {
> + pr_info("Secure boot enabled: ignoring ima_appraise=%s boot parameter option",
> + str);
> + ima_appraise = IMA_APPRAISE_ENFORCE;
> + }
> + }
Providing feedback is probably a good idea. However, the
"arch_ima_get_secureboot" test can't come after setting
"ima_appraise."
Mimi
> #endif
> return 1;
> }
>
>
> The "else" there I think would make sense as well, at least to give the
> user some feedback about a possible mispelling of him (as a separate
> patch).
>
> And "if(!is_ima_appraise_enabled())" would avoid to print anything about
> "ignoring the option" to the user in case he explicitly set "enforce",
> which we know there isn't any real effect but is allowed and shown in
> kernel-parameters.txt.
>
> > Thanks!
> >
> > > > + return 1;
> > > > + }
> > > > +
> > > > if (strncmp(str, "off", 3) == 0)
> > > > ima_appraise = 0;
> > > > else if (strncmp(str, "log", 3) == 0)
> > >
> >
> > --
> > bmeneg
> > PGP Key: http://bmeneg.com/pubkey.txt
>
>
>
^ permalink raw reply
* Re: [PATCH v5] ima: move APPRAISE_BOOTPARAM dependency on ARCH_POLICY to runtime
From: Bruno Meneguele @ 2020-07-10 18:34 UTC (permalink / raw)
To: Mimi Zohar
Cc: linux-s390, nayna, erichte, x86, linux-kernel, stable,
linux-integrity, linuxppc-dev
In-Reply-To: <20200710180338.GA10547@glitch>
[-- Attachment #1: Type: text/plain, Size: 4095 bytes --]
On Fri, Jul 10, 2020 at 03:03:38PM -0300, Bruno Meneguele wrote:
> On Fri, Jul 10, 2020 at 01:23:24PM -0400, Mimi Zohar wrote:
> > On Thu, 2020-07-09 at 13:46 -0300, Bruno Meneguele wrote:
> > > APPRAISE_BOOTPARAM has been marked as dependent on !ARCH_POLICY in compile
> > > time, enforcing the appraisal whenever the kernel had the arch policy option
> > > enabled.
> >
> > > However it breaks systems where the option is set but the system didn't
> > > boot in a "secure boot" platform. In this scenario, anytime an appraisal
> > > policy (i.e. ima_policy=appraisal_tcb) is used it will be forced, without
> > > giving the user the opportunity to label the filesystem, before enforcing
> > > integrity.
> > >
> > > Considering the ARCH_POLICY is only effective when secure boot is actually
> > > enabled this patch remove the compile time dependency and move it to a
> > > runtime decision, based on the secure boot state of that platform.
> >
> > Perhaps we could simplify this patch description a bit?
> >
> > The IMA_APPRAISE_BOOTPARAM config allows enabling different
> > "ima_appraise=" modes - log, fix, enforce - at run time, but not when
> > IMA architecture specific policies are enabled. This prevents
> > properly labeling the filesystem on systems where secure boot is
> > supported, but not enabled on the platform. Only when secure boot is
> > enabled, should these IMA appraise modes be disabled.
> >
> > This patch removes the compile time dependency and makes it a runtime
> > decision, based on the secure boot state of that platform.
> >
>
> Sounds good to me.
>
> > <snip>
> >
> > > diff --git a/security/integrity/ima/ima_appraise.c b/security/integrity/ima/ima_appraise.c
> > > index a9649b04b9f1..884de471b38a 100644
> > > --- a/security/integrity/ima/ima_appraise.c
> > > +++ b/security/integrity/ima/ima_appraise.c
> > > @@ -19,6 +19,11 @@
> > > static int __init default_appraise_setup(c
> >
> > > har *str)
> > > {
> > > #ifdef CONFIG_IMA_APPRAISE_BOOTPARAM
> > > + if (arch_ima_get_secureboot()) {
> > > + pr_info("appraise boot param ignored: secure boot enabled");
> >
> > Instead of a generic statement, is it possible to include the actual
> > option being denied? Perhaps something like: "Secure boot enabled,
> > ignoring %s boot command line option"
> >
> > Mimi
> >
>
> Yes, sure.
>
Btw, would it make sense to first make sure we have a valid "str"
option and not something random to print?
diff --git a/security/integrity/ima/ima_appraise.c b/security/integrity/ima/ima_appraise.c
index a9649b04b9f1..1f1175531d3e 100644
--- a/security/integrity/ima/ima_appraise.c
+++ b/security/integrity/ima/ima_appraise.c
@@ -25,6 +25,16 @@ static int __init default_appraise_setup(char *str)
ima_appraise = IMA_APPRAISE_LOG;
else if (strncmp(str, "fix", 3) == 0)
ima_appraise = IMA_APPRAISE_FIX;
+ else
+ pr_info("invalid \"%s\" appraise option");
+
+ if (arch_ima_get_secureboot()) {
+ if (!is_ima_appraise_enabled()) {
+ pr_info("Secure boot enabled: ignoring ima_appraise=%s boot parameter option",
+ str);
+ ima_appraise = IMA_APPRAISE_ENFORCE;
+ }
+ }
#endif
return 1;
}
The "else" there I think would make sense as well, at least to give the
user some feedback about a possible mispelling of him (as a separate
patch).
And "if(!is_ima_appraise_enabled())" would avoid to print anything about
"ignoring the option" to the user in case he explicitly set "enforce",
which we know there isn't any real effect but is allowed and shown in
kernel-parameters.txt.
> Thanks!
>
> > > + return 1;
> > > + }
> > > +
> > > if (strncmp(str, "off", 3) == 0)
> > > ima_appraise = 0;
> > > else if (strncmp(str, "log", 3) == 0)
> >
>
> --
> bmeneg
> PGP Key: http://bmeneg.com/pubkey.txt
--
bmeneg
PGP Key: http://bmeneg.com/pubkey.txt
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply related
* Re: [PATCH v5] ima: move APPRAISE_BOOTPARAM dependency on ARCH_POLICY to runtime
From: Bruno Meneguele @ 2020-07-10 18:03 UTC (permalink / raw)
To: Mimi Zohar
Cc: linux-s390, nayna, erichte, x86, linux-kernel, stable,
linux-integrity, linuxppc-dev
In-Reply-To: <1594401804.14405.8.camel@linux.ibm.com>
[-- Attachment #1: Type: text/plain, Size: 2439 bytes --]
On Fri, Jul 10, 2020 at 01:23:24PM -0400, Mimi Zohar wrote:
> On Thu, 2020-07-09 at 13:46 -0300, Bruno Meneguele wrote:
> > APPRAISE_BOOTPARAM has been marked as dependent on !ARCH_POLICY in compile
> > time, enforcing the appraisal whenever the kernel had the arch policy option
> > enabled.
>
> > However it breaks systems where the option is set but the system didn't
> > boot in a "secure boot" platform. In this scenario, anytime an appraisal
> > policy (i.e. ima_policy=appraisal_tcb) is used it will be forced, without
> > giving the user the opportunity to label the filesystem, before enforcing
> > integrity.
> >
> > Considering the ARCH_POLICY is only effective when secure boot is actually
> > enabled this patch remove the compile time dependency and move it to a
> > runtime decision, based on the secure boot state of that platform.
>
> Perhaps we could simplify this patch description a bit?
>
> The IMA_APPRAISE_BOOTPARAM config allows enabling different
> "ima_appraise=" modes - log, fix, enforce - at run time, but not when
> IMA architecture specific policies are enabled. This prevents
> properly labeling the filesystem on systems where secure boot is
> supported, but not enabled on the platform. Only when secure boot is
> enabled, should these IMA appraise modes be disabled.
>
> This patch removes the compile time dependency and makes it a runtime
> decision, based on the secure boot state of that platform.
>
Sounds good to me.
> <snip>
>
> > diff --git a/security/integrity/ima/ima_appraise.c b/security/integrity/ima/ima_appraise.c
> > index a9649b04b9f1..884de471b38a 100644
> > --- a/security/integrity/ima/ima_appraise.c
> > +++ b/security/integrity/ima/ima_appraise.c
> > @@ -19,6 +19,11 @@
> > static int __init default_appraise_setup(c
>
> > har *str)
> > {
> > #ifdef CONFIG_IMA_APPRAISE_BOOTPARAM
> > + if (arch_ima_get_secureboot()) {
> > + pr_info("appraise boot param ignored: secure boot enabled");
>
> Instead of a generic statement, is it possible to include the actual
> option being denied? Perhaps something like: "Secure boot enabled,
> ignoring %s boot command line option"
>
> Mimi
>
Yes, sure.
Thanks!
> > + return 1;
> > + }
> > +
> > if (strncmp(str, "off", 3) == 0)
> > ima_appraise = 0;
> > else if (strncmp(str, "log", 3) == 0)
>
--
bmeneg
PGP Key: http://bmeneg.com/pubkey.txt
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply
* Re: [PATCH] powerpc/numa: Restrict possible nodes based on platform
From: Nathan Lynch @ 2020-07-10 17:41 UTC (permalink / raw)
To: Srikar Dronamraju; +Cc: linuxppc-dev, Bharata B Rao
In-Reply-To: <20200707084203.GC874@linux.vnet.ibm.com>
Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> * Michael Ellerman <mpe@ellerman.id.au> [2020-07-07 15:02:17]:
>
>> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
>> > $ lsprop /proc/device-tree/rtas/ibm,*associ*-domains
>> > /proc/device-tree/rtas/ibm,current-associativity-domains
>> > 00000005 00000001 00000002 00000002 00000002 00000010
>> > /proc/device-tree/rtas/ibm,max-associativity-domains
>> > 00000005 00000001 00000008 00000020 00000020 00000100
>> >
>> > $ cat /sys/devices/system/node/possible ##Before patch
>> > 0-31
>> >
>> > $ cat /sys/devices/system/node/possible ##After patch
>> > 0-1
>> >
>> > Note the maximum nodes this platform can support is only 2 but the
>> > possible nodes is set to 32.
>>
>> But what about LPM to a system with more nodes?
>>
>
> I have very less info on LPM, so I checked with Nathan Lynch before posting
> and as per Nathan in the current design of LPM, Linux wouldn't use the new
> node numbers.
(I see a v2 has been posted already but I needed a little time to check
with our hypervisor people on this point.)
It's less of a design and more of a least-bad option in the absence of a
more flexible NUMA architecture in Linux.
For now, the "rule" with LPM and NUMA has to be that Linux uses the NUMA
information from the device tree that it was booted with, and it must
disregard ibm,associativity and similar information after LPM or certain
other platform events. Historically there has been code that tried to
honor changes in NUMA information but it caused much worse problems than
degraded performance. That code has been disabled by default since last
year and is now subject to removal:
https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=182897
Most NUMA-aware code happens to follow that rule because the device tree
associativity information tends to get cached on first access in Linux's
own data structures. It all feels a little fragile to me though,
especially since we can DLPAR add processors and memory after LPM with
"new" associativity properties which don't relate to the logical
topology Linux has already built. However, on currently available hw, as
long as we're using ibm,max-associativity-domains to limit the set of
possible nodes, I believe such resources will always receive valid (but
possibly suboptimal) NUMA assignments. That's because as of this
writing ibm,max-associativity-domains has the same contents on all
currently available PowerVM systems.
Now if we change to using ibm,current-associativity-domains, which we
*can* expect to differ between differently configured systems, post-LPM
DLPAR additions can yield resources with node assignments that
fall outside of the possible range, especially when we've migrated from
a smaller system to a larger one.
Is the current code robust against that possibility? I don't think so:
it looks like of_node_to_nid_single(), of_drconf_to_nid_single() and
possibly more code need to guard against this in order to prevent
NODE_DATA() null dereferences and the like. Probably those functions
should be made to clamp the nid assignment at num_possible_nodes()
instead of MAX_NUMNODES, which strikes me as more correct regardless of
your patch.
^ permalink raw reply
* Re: [PATCH v5] ima: move APPRAISE_BOOTPARAM dependency on ARCH_POLICY to runtime
From: Mimi Zohar @ 2020-07-10 17:23 UTC (permalink / raw)
To: Bruno Meneguele, linux-kernel, x86, linuxppc-dev, linux-s390,
linux-integrity
Cc: erichte, nayna, stable
In-Reply-To: <20200709164647.45153-1-bmeneg@redhat.com>
On Thu, 2020-07-09 at 13:46 -0300, Bruno Meneguele wrote:
> APPRAISE_BOOTPARAM has been marked as dependent on !ARCH_POLICY in compile
> time, enforcing the appraisal whenever the kernel had the arch policy option
> enabled.
> However it breaks systems where the option is set but the system didn't
> boot in a "secure boot" platform. In this scenario, anytime an appraisal
> policy (i.e. ima_policy=appraisal_tcb) is used it will be forced, without
> giving the user the opportunity to label the filesystem, before enforcing
> integrity.
>
> Considering the ARCH_POLICY is only effective when secure boot is actually
> enabled this patch remove the compile time dependency and move it to a
> runtime decision, based on the secure boot state of that platform.
Perhaps we could simplify this patch description a bit?
The IMA_APPRAISE_BOOTPARAM config allows enabling different
"ima_appraise=" modes - log, fix, enforce - at run time, but not when
IMA architecture specific policies are enabled. This prevents
properly labeling the filesystem on systems where secure boot is
supported, but not enabled on the platform. Only when secure boot is
enabled, should these IMA appraise modes be disabled.
This patch removes the compile time dependency and makes it a runtime
decision, based on the secure boot state of that platform.
<snip>
> diff --git a/security/integrity/ima/ima_appraise.c b/security/integrity/ima/ima_appraise.c
> index a9649b04b9f1..884de471b38a 100644
> --- a/security/integrity/ima/ima_appraise.c
> +++ b/security/integrity/ima/ima_appraise.c
> @@ -19,6 +19,11 @@
> static int __init default_appraise_setup(c
> har *str)
> {
> #ifdef CONFIG_IMA_APPRAISE_BOOTPARAM
> + if (arch_ima_get_secureboot()) {
> + pr_info("appraise boot param ignored: secure boot enabled");
Instead of a generic statement, is it possible to include the actual
option being denied? Perhaps something like: "Secure boot enabled,
ignoring %s boot command line option"
Mimi
> + return 1;
> + }
> +
> if (strncmp(str, "off", 3) == 0)
> ima_appraise = 0;
> else if (strncmp(str, "log", 3) == 0)
^ permalink raw reply
* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
From: Andy Lutomirski @ 2020-07-10 17:04 UTC (permalink / raw)
To: Nicholas Piggin
Cc: linux-arch, Arnd Bergmann, Peter Zijlstra, X86 ML, LKML, Linux-MM,
Mathieu Desnoyers, linuxppc-dev
In-Reply-To: <20200710015646.2020871-5-npiggin@gmail.com>
On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin <npiggin@gmail.com> wrote:
>
> And get rid of the generic sync_core_before_usermode facility.
>
> This helper is the wrong way around I think. The idea that membarrier
> state requires a core sync before returning to user is the easy one
> that does not need hiding behind membarrier calls. The gap in core
> synchronization due to x86's sysret/sysexit and lazy tlb mode, is the
> tricky detail that is better put in x86 lazy tlb code.
>
> Consider if an arch did not synchronize core in switch_mm either, then
> membarrier_mm_sync_core_before_usermode would be in the wrong place
> but arch specific mmu context functions would still be the right place.
> There is also a exit_lazy_tlb case that is not covered by this call, which
> could be a bugs (kthread use mm the membarrier process's mm then context
> switch back to the process without switching mm or lazy mm switch).
>
> This makes lazy tlb code a bit more modular.
The mm-switching and TLB-management has often had the regrettable
property that it gets wired up in a way that seems to work at the time
but doesn't have clear semantics, and I'm a bit concerned that this
patch is in that category. If I'm understanding right, you're trying
to enforce the property that exiting lazy TLB mode will promise to
sync the core eventually. But this has all kinds of odd properties:
- Why is exit_lazy_tlb() getting called at all in the relevant cases?
When is it permissible to call it? I look at your new code and see:
> +/*
> + * Ensure that a core serializing instruction is issued before returning
> + * to user-mode, if a SYNC_CORE was requested. x86 implements return to
> + * user-space through sysexit, sysrel, and sysretq, which are not core
> + * serializing.
> + *
> + * See the membarrier comment in finish_task_switch as to why this is done
> + * in exit_lazy_tlb.
> + */
> +#define exit_lazy_tlb exit_lazy_tlb
> +static inline void exit_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
> +{
> + /* Switching mm is serializing with write_cr3 */
> + if (tsk->mm != mm)
> + return;
And my brain says WTF? Surely you meant something like if
(WARN_ON_ONCE(tsk->mm != mm)) { /* egads, what even happened? how do
we try to recover well enough to get a crashed logged at least? */ }
So this needs actual documentation, preferably in comments near the
function, of what the preconditions are and what this mm parameter is.
Once that's done, then we could consider whether it's appropriate to
have this function promise to sync the core under some conditions.
- This whole structure seems to rely on the idea that switching mm
syncs something. I periodically ask chip vendor for non-serializing
mm switches. Specifically, in my dream world, we have totally
separate user and kernel page tables. Changing out the user tables
doesn't serialize or even create a fence. Instead it creates the
minimum required pipeline hazard such that user memory access and
switches to usermode will make sure they access memory through the
correct page tables. I haven't convinced a chip vendor yet, but there
are quite a few hundreds of cycles to be saved here. With this in
mind, I see the fencing aspects of the TLB handling code as somewhat
of an accident. I'm fine with documenting them and using them to
optimize other paths, but I think it should be explicit. For example:
/* Also does a full barrier? (Or a sync_core()-style barrier.)
However, if you rely on this, you must document it in a comment where
you call this function. *?
void switch_mm_irqs_off()
{
}
This is kind of like how we strongly encourage anyone using smp_?mb()
to document what they are fencing against.
Also, as it stands, I can easily see in_irq() ceasing to promise to
serialize. There are older kernels for which it does not promise to
serialize. And I have plans to make it stop serializing in the
nearish future.
--Andy
^ permalink raw reply
* Re: [PATCH 0/2] ASoC: fsl_spdif: Clear the validity bit for TX
From: Mark Brown @ 2020-07-10 15:39 UTC (permalink / raw)
To: timur, alsa-devel, perex, festevam, Shengjiu Wang, nicoleotsuka,
tiwai, Xiubo.Lee
Cc: linuxppc-dev, linux-kernel
In-Reply-To: <1594112066-31297-1-git-send-email-shengjiu.wang@nxp.com>
On Tue, 7 Jul 2020 16:54:24 +0800, Shengjiu Wang wrote:
> Clear the validity bit for TX
> Add kctl for configuring TX validity bit
>
> Shengjiu Wang (2):
> ASoC: fsl_spdif: Clear the validity bit for TX
> ASoC: fsl_spdif: Add kctl for configuring TX validity bit
>
> [...]
Applied to
https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git for-next
Thanks!
[1/2] ASoC: fsl_spdif: Clear the validity bit for TX
commit: 055b082156704b85a059770359d6cdedfb24831d
[2/2] ASoC: fsl_spdif: Add kctl for configuring TX validity bit
commit: aa3fce5cd454db551a4ea1656bab9c2bacd2d1f4
All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.
You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.
If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.
Please add any relevant lists and maintainers to the CCs when replying
to this mail.
Thanks,
Mark
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox