* [PATCH 11/24] KVM: arm64: Clean up the checking for huge mapping
From: Marc Zyngier @ 2020-05-29 16:01 UTC (permalink / raw)
To: Paolo Bonzini
Cc: kvmarm, kvm, Will Deacon, Jiang Yi, Ard Biesheuvel,
linux-arm-kernel
In-Reply-To: <20200529160121.899083-1-maz@kernel.org>
From: Suzuki K Poulose <suzuki.poulose@arm.com>
If we are checking whether the stage2 can map PAGE_SIZE,
we don't have to do the boundary checks as both the host
VMA and the guest memslots are page aligned. Bail the case
easily.
While we're at it, fixup a typo in the comment below.
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20200507123546.1875-2-yuzenghui@huawei.com
---
arch/arm64/kvm/mmu.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 917363375e8a..ccb44e7d30d9 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1610,6 +1610,10 @@ static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
hva_t uaddr_start, uaddr_end;
size_t size;
+ /* The memslot and the VMA are guaranteed to be aligned to PAGE_SIZE */
+ if (map_size == PAGE_SIZE)
+ return true;
+
size = memslot->npages * PAGE_SIZE;
gpa_start = memslot->base_gfn << PAGE_SHIFT;
@@ -1629,7 +1633,7 @@ static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
* |abcde|fgh Stage-1 block | Stage-1 block tv|xyz|
* +-----+--------------------+--------------------+---+
*
- * memslot->base_gfn << PAGE_SIZE:
+ * memslot->base_gfn << PAGE_SHIFT:
* +---+--------------------+--------------------+-----+
* |abc|def Stage-2 block | Stage-2 block |tvxyz|
* +---+--------------------+--------------------+-----+
--
2.26.2
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
^ permalink raw reply related
* [PATCH 16/24] KVM: arm64: Fix incorrect comment on kvm_get_hyp_vector()
From: Marc Zyngier @ 2020-05-29 16:01 UTC (permalink / raw)
To: Paolo Bonzini
Cc: kvmarm, kvm, Will Deacon, Jiang Yi, Ard Biesheuvel,
linux-arm-kernel
In-Reply-To: <20200529160121.899083-1-maz@kernel.org>
From: David Brazdil <dbrazdil@google.com>
The comment used to say that kvm_get_hyp_vector is only called on VHE systems.
In fact, it is also called from the nVHE init function cpu_init_hyp_mode().
Fix the comment to stop confusing devs.
Signed-off-by: David Brazdil <dbrazdil@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20200515152550.83810-1-dbrazdil@google.com
---
arch/arm64/include/asm/kvm_mmu.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 30b0e8d6b895..796f6a2e794a 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -473,7 +473,7 @@ static inline int kvm_write_guest_lock(struct kvm *kvm, gpa_t gpa,
extern void *__kvm_bp_vect_base;
extern int __kvm_harden_el2_vector_slot;
-/* This is only called on a VHE system */
+/* This is called on both VHE and !VHE systems */
static inline void *kvm_get_hyp_vector(void)
{
struct bp_hardening_data *data = arm64_get_bp_hardening_data();
--
2.26.2
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
^ permalink raw reply related
* [PATCH 23/24] KVM: arm64: Parametrize exception entry with a target EL
From: Marc Zyngier @ 2020-05-29 16:01 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Alexandru Elisei, Andrew Scull, Ard Biesheuvel, Christoffer Dall,
David Brazdil, Fuad Tabba, James Morse, Jiang Yi, Keqian Zhu,
Mark Rutland, Suzuki K Poulose, Will Deacon, Zenghui Yu,
Julien Thierry, linux-arm-kernel, kvm, kvmarm
In-Reply-To: <20200529160121.899083-1-maz@kernel.org>
We currently assume that an exception is delivered to EL1, always.
Once we emulate EL2, this no longer will be the case. To prepare
for this, add a target_mode parameter.
While we're at it, merge the computing of the target PC and PSTATE in
a single function that updates both PC and CPSR after saving their
previous values in the corresponding ELR/SPSR. This ensures that they
are updated in the correct order (a pretty common source of bugs...).
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
---
arch/arm64/include/asm/ptrace.h | 1 +
arch/arm64/kvm/inject_fault.c | 75 +++++++++++++++++----------------
2 files changed, 39 insertions(+), 37 deletions(-)
diff --git a/arch/arm64/include/asm/ptrace.h b/arch/arm64/include/asm/ptrace.h
index bf57308fcd63..953b6a1ce549 100644
--- a/arch/arm64/include/asm/ptrace.h
+++ b/arch/arm64/include/asm/ptrace.h
@@ -35,6 +35,7 @@
#define GIC_PRIO_PSR_I_SET (1 << 4)
/* Additional SPSR bits not exposed in the UABI */
+#define PSR_MODE_THREAD_BIT (1 << 0)
#define PSR_IL_BIT (1 << 20)
/* AArch32-specific ptrace requests */
diff --git a/arch/arm64/kvm/inject_fault.c b/arch/arm64/kvm/inject_fault.c
index 6aafc2825c1c..e21fdd93027a 100644
--- a/arch/arm64/kvm/inject_fault.c
+++ b/arch/arm64/kvm/inject_fault.c
@@ -26,28 +26,12 @@ enum exception_type {
except_type_serror = 0x180,
};
-static u64 get_except_vector(struct kvm_vcpu *vcpu, enum exception_type type)
-{
- u64 exc_offset;
-
- switch (*vcpu_cpsr(vcpu) & (PSR_MODE_MASK | PSR_MODE32_BIT)) {
- case PSR_MODE_EL1t:
- exc_offset = CURRENT_EL_SP_EL0_VECTOR;
- break;
- case PSR_MODE_EL1h:
- exc_offset = CURRENT_EL_SP_ELx_VECTOR;
- break;
- case PSR_MODE_EL0t:
- exc_offset = LOWER_EL_AArch64_VECTOR;
- break;
- default:
- exc_offset = LOWER_EL_AArch32_VECTOR;
- }
-
- return vcpu_read_sys_reg(vcpu, VBAR_EL1) + exc_offset + type;
-}
-
/*
+ * This performs the exception entry at a given EL (@target_mode), stashing PC
+ * and PSTATE into ELR and SPSR respectively, and compute the new PC/PSTATE.
+ * The EL passed to this function *must* be a non-secure, privileged mode with
+ * bit 0 being set (PSTATE.SP == 1).
+ *
* When an exception is taken, most PSTATE fields are left unchanged in the
* handler. However, some are explicitly overridden (e.g. M[4:0]). Luckily all
* of the inherited bits have the same position in the AArch64/AArch32 SPSR_ELx
@@ -59,10 +43,35 @@ static u64 get_except_vector(struct kvm_vcpu *vcpu, enum exception_type type)
* Here we manipulate the fields in order of the AArch64 SPSR_ELx layout, from
* MSB to LSB.
*/
-static unsigned long get_except64_pstate(struct kvm_vcpu *vcpu)
+static void enter_exception64(struct kvm_vcpu *vcpu, unsigned long target_mode,
+ enum exception_type type)
{
- unsigned long sctlr = vcpu_read_sys_reg(vcpu, SCTLR_EL1);
- unsigned long old, new;
+ unsigned long sctlr, vbar, old, new, mode;
+ u64 exc_offset;
+
+ mode = *vcpu_cpsr(vcpu) & (PSR_MODE_MASK | PSR_MODE32_BIT);
+
+ if (mode == target_mode)
+ exc_offset = CURRENT_EL_SP_ELx_VECTOR;
+ else if ((mode | PSR_MODE_THREAD_BIT) == target_mode)
+ exc_offset = CURRENT_EL_SP_EL0_VECTOR;
+ else if (!(mode & PSR_MODE32_BIT))
+ exc_offset = LOWER_EL_AArch64_VECTOR;
+ else
+ exc_offset = LOWER_EL_AArch32_VECTOR;
+
+ switch (target_mode) {
+ case PSR_MODE_EL1h:
+ vbar = vcpu_read_sys_reg(vcpu, VBAR_EL1);
+ sctlr = vcpu_read_sys_reg(vcpu, SCTLR_EL1);
+ vcpu_write_elr_el1(vcpu, *vcpu_pc(vcpu));
+ break;
+ default:
+ /* Don't do that */
+ BUG();
+ }
+
+ *vcpu_pc(vcpu) = vbar + exc_offset + type;
old = *vcpu_cpsr(vcpu);
new = 0;
@@ -105,9 +114,10 @@ static unsigned long get_except64_pstate(struct kvm_vcpu *vcpu)
new |= PSR_I_BIT;
new |= PSR_F_BIT;
- new |= PSR_MODE_EL1h;
+ new |= target_mode;
- return new;
+ *vcpu_cpsr(vcpu) = new;
+ vcpu_write_spsr(vcpu, old);
}
static void inject_abt64(struct kvm_vcpu *vcpu, bool is_iabt, unsigned long addr)
@@ -116,11 +126,7 @@ static void inject_abt64(struct kvm_vcpu *vcpu, bool is_iabt, unsigned long addr
bool is_aarch32 = vcpu_mode_is_32bit(vcpu);
u32 esr = 0;
- vcpu_write_elr_el1(vcpu, *vcpu_pc(vcpu));
- *vcpu_pc(vcpu) = get_except_vector(vcpu, except_type_sync);
-
- *vcpu_cpsr(vcpu) = get_except64_pstate(vcpu);
- vcpu_write_spsr(vcpu, cpsr);
+ enter_exception64(vcpu, PSR_MODE_EL1h, except_type_sync);
vcpu_write_sys_reg(vcpu, addr, FAR_EL1);
@@ -148,14 +154,9 @@ static void inject_abt64(struct kvm_vcpu *vcpu, bool is_iabt, unsigned long addr
static void inject_undef64(struct kvm_vcpu *vcpu)
{
- unsigned long cpsr = *vcpu_cpsr(vcpu);
u32 esr = (ESR_ELx_EC_UNKNOWN << ESR_ELx_EC_SHIFT);
- vcpu_write_elr_el1(vcpu, *vcpu_pc(vcpu));
- *vcpu_pc(vcpu) = get_except_vector(vcpu, except_type_sync);
-
- *vcpu_cpsr(vcpu) = get_except64_pstate(vcpu);
- vcpu_write_spsr(vcpu, cpsr);
+ enter_exception64(vcpu, PSR_MODE_EL1h, except_type_sync);
/*
* Build an unknown exception, depending on the instruction
--
2.26.2
^ permalink raw reply related
* [PATCH 22/24] KVM: arm64: Don't use empty structures as CPU reset state
From: Marc Zyngier @ 2020-05-29 16:01 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Alexandru Elisei, Andrew Scull, Ard Biesheuvel, Christoffer Dall,
David Brazdil, Fuad Tabba, James Morse, Jiang Yi, Keqian Zhu,
Mark Rutland, Suzuki K Poulose, Will Deacon, Zenghui Yu,
Julien Thierry, linux-arm-kernel, kvm, kvmarm
In-Reply-To: <20200529160121.899083-1-maz@kernel.org>
Keeping empty structure as the vcpu state initializer is slightly
wasteful: we only want to set pstate, and zero everything else.
Just do that.
Signed-off-by: Marc Zyngier <maz@kernel.org>
---
arch/arm64/kvm/reset.c | 21 +++++++++------------
1 file changed, 9 insertions(+), 12 deletions(-)
diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
index 658f3a79617b..865c8aa670bc 100644
--- a/arch/arm64/kvm/reset.c
+++ b/arch/arm64/kvm/reset.c
@@ -36,15 +36,11 @@ static u32 kvm_ipa_limit;
/*
* ARMv8 Reset Values
*/
-static const struct kvm_regs default_regs_reset = {
- .regs.pstate = (PSR_MODE_EL1h | PSR_A_BIT | PSR_I_BIT |
- PSR_F_BIT | PSR_D_BIT),
-};
+#define VCPU_RESET_PSTATE_EL1 (PSR_MODE_EL1h | PSR_A_BIT | PSR_I_BIT | \
+ PSR_F_BIT | PSR_D_BIT)
-static const struct kvm_regs default_regs_reset32 = {
- .regs.pstate = (PSR_AA32_MODE_SVC | PSR_AA32_A_BIT |
- PSR_AA32_I_BIT | PSR_AA32_F_BIT),
-};
+#define VCPU_RESET_PSTATE_SVC (PSR_AA32_MODE_SVC | PSR_AA32_A_BIT | \
+ PSR_AA32_I_BIT | PSR_AA32_F_BIT)
static bool cpu_has_32bit_el1(void)
{
@@ -257,9 +253,9 @@ static int kvm_vcpu_enable_ptrauth(struct kvm_vcpu *vcpu)
*/
int kvm_reset_vcpu(struct kvm_vcpu *vcpu)
{
- const struct kvm_regs *cpu_reset;
int ret = -EINVAL;
bool loaded;
+ u32 pstate;
/* Reset PMU outside of the non-preemptible section */
kvm_pmu_vcpu_reset(vcpu);
@@ -290,16 +286,17 @@ int kvm_reset_vcpu(struct kvm_vcpu *vcpu)
if (test_bit(KVM_ARM_VCPU_EL1_32BIT, vcpu->arch.features)) {
if (!cpu_has_32bit_el1())
goto out;
- cpu_reset = &default_regs_reset32;
+ pstate = VCPU_RESET_PSTATE_SVC;
} else {
- cpu_reset = &default_regs_reset;
+ pstate = VCPU_RESET_PSTATE_EL1;
}
break;
}
/* Reset core registers */
- memcpy(vcpu_gp_regs(vcpu), cpu_reset, sizeof(*cpu_reset));
+ memset(vcpu_gp_regs(vcpu), 0, sizeof(*vcpu_gp_regs(vcpu)));
+ vcpu_gp_regs(vcpu)->regs.pstate = pstate;
/* Reset system registers */
kvm_reset_sys_regs(vcpu);
--
2.26.2
^ permalink raw reply related
* [PATCH 23/24] KVM: arm64: Parametrize exception entry with a target EL
From: Marc Zyngier @ 2020-05-29 16:01 UTC (permalink / raw)
To: Paolo Bonzini
Cc: kvmarm, kvm, Will Deacon, Jiang Yi, Ard Biesheuvel,
linux-arm-kernel
In-Reply-To: <20200529160121.899083-1-maz@kernel.org>
We currently assume that an exception is delivered to EL1, always.
Once we emulate EL2, this no longer will be the case. To prepare
for this, add a target_mode parameter.
While we're at it, merge the computing of the target PC and PSTATE in
a single function that updates both PC and CPSR after saving their
previous values in the corresponding ELR/SPSR. This ensures that they
are updated in the correct order (a pretty common source of bugs...).
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
---
arch/arm64/include/asm/ptrace.h | 1 +
arch/arm64/kvm/inject_fault.c | 75 +++++++++++++++++----------------
2 files changed, 39 insertions(+), 37 deletions(-)
diff --git a/arch/arm64/include/asm/ptrace.h b/arch/arm64/include/asm/ptrace.h
index bf57308fcd63..953b6a1ce549 100644
--- a/arch/arm64/include/asm/ptrace.h
+++ b/arch/arm64/include/asm/ptrace.h
@@ -35,6 +35,7 @@
#define GIC_PRIO_PSR_I_SET (1 << 4)
/* Additional SPSR bits not exposed in the UABI */
+#define PSR_MODE_THREAD_BIT (1 << 0)
#define PSR_IL_BIT (1 << 20)
/* AArch32-specific ptrace requests */
diff --git a/arch/arm64/kvm/inject_fault.c b/arch/arm64/kvm/inject_fault.c
index 6aafc2825c1c..e21fdd93027a 100644
--- a/arch/arm64/kvm/inject_fault.c
+++ b/arch/arm64/kvm/inject_fault.c
@@ -26,28 +26,12 @@ enum exception_type {
except_type_serror = 0x180,
};
-static u64 get_except_vector(struct kvm_vcpu *vcpu, enum exception_type type)
-{
- u64 exc_offset;
-
- switch (*vcpu_cpsr(vcpu) & (PSR_MODE_MASK | PSR_MODE32_BIT)) {
- case PSR_MODE_EL1t:
- exc_offset = CURRENT_EL_SP_EL0_VECTOR;
- break;
- case PSR_MODE_EL1h:
- exc_offset = CURRENT_EL_SP_ELx_VECTOR;
- break;
- case PSR_MODE_EL0t:
- exc_offset = LOWER_EL_AArch64_VECTOR;
- break;
- default:
- exc_offset = LOWER_EL_AArch32_VECTOR;
- }
-
- return vcpu_read_sys_reg(vcpu, VBAR_EL1) + exc_offset + type;
-}
-
/*
+ * This performs the exception entry at a given EL (@target_mode), stashing PC
+ * and PSTATE into ELR and SPSR respectively, and compute the new PC/PSTATE.
+ * The EL passed to this function *must* be a non-secure, privileged mode with
+ * bit 0 being set (PSTATE.SP == 1).
+ *
* When an exception is taken, most PSTATE fields are left unchanged in the
* handler. However, some are explicitly overridden (e.g. M[4:0]). Luckily all
* of the inherited bits have the same position in the AArch64/AArch32 SPSR_ELx
@@ -59,10 +43,35 @@ static u64 get_except_vector(struct kvm_vcpu *vcpu, enum exception_type type)
* Here we manipulate the fields in order of the AArch64 SPSR_ELx layout, from
* MSB to LSB.
*/
-static unsigned long get_except64_pstate(struct kvm_vcpu *vcpu)
+static void enter_exception64(struct kvm_vcpu *vcpu, unsigned long target_mode,
+ enum exception_type type)
{
- unsigned long sctlr = vcpu_read_sys_reg(vcpu, SCTLR_EL1);
- unsigned long old, new;
+ unsigned long sctlr, vbar, old, new, mode;
+ u64 exc_offset;
+
+ mode = *vcpu_cpsr(vcpu) & (PSR_MODE_MASK | PSR_MODE32_BIT);
+
+ if (mode == target_mode)
+ exc_offset = CURRENT_EL_SP_ELx_VECTOR;
+ else if ((mode | PSR_MODE_THREAD_BIT) == target_mode)
+ exc_offset = CURRENT_EL_SP_EL0_VECTOR;
+ else if (!(mode & PSR_MODE32_BIT))
+ exc_offset = LOWER_EL_AArch64_VECTOR;
+ else
+ exc_offset = LOWER_EL_AArch32_VECTOR;
+
+ switch (target_mode) {
+ case PSR_MODE_EL1h:
+ vbar = vcpu_read_sys_reg(vcpu, VBAR_EL1);
+ sctlr = vcpu_read_sys_reg(vcpu, SCTLR_EL1);
+ vcpu_write_elr_el1(vcpu, *vcpu_pc(vcpu));
+ break;
+ default:
+ /* Don't do that */
+ BUG();
+ }
+
+ *vcpu_pc(vcpu) = vbar + exc_offset + type;
old = *vcpu_cpsr(vcpu);
new = 0;
@@ -105,9 +114,10 @@ static unsigned long get_except64_pstate(struct kvm_vcpu *vcpu)
new |= PSR_I_BIT;
new |= PSR_F_BIT;
- new |= PSR_MODE_EL1h;
+ new |= target_mode;
- return new;
+ *vcpu_cpsr(vcpu) = new;
+ vcpu_write_spsr(vcpu, old);
}
static void inject_abt64(struct kvm_vcpu *vcpu, bool is_iabt, unsigned long addr)
@@ -116,11 +126,7 @@ static void inject_abt64(struct kvm_vcpu *vcpu, bool is_iabt, unsigned long addr
bool is_aarch32 = vcpu_mode_is_32bit(vcpu);
u32 esr = 0;
- vcpu_write_elr_el1(vcpu, *vcpu_pc(vcpu));
- *vcpu_pc(vcpu) = get_except_vector(vcpu, except_type_sync);
-
- *vcpu_cpsr(vcpu) = get_except64_pstate(vcpu);
- vcpu_write_spsr(vcpu, cpsr);
+ enter_exception64(vcpu, PSR_MODE_EL1h, except_type_sync);
vcpu_write_sys_reg(vcpu, addr, FAR_EL1);
@@ -148,14 +154,9 @@ static void inject_abt64(struct kvm_vcpu *vcpu, bool is_iabt, unsigned long addr
static void inject_undef64(struct kvm_vcpu *vcpu)
{
- unsigned long cpsr = *vcpu_cpsr(vcpu);
u32 esr = (ESR_ELx_EC_UNKNOWN << ESR_ELx_EC_SHIFT);
- vcpu_write_elr_el1(vcpu, *vcpu_pc(vcpu));
- *vcpu_pc(vcpu) = get_except_vector(vcpu, except_type_sync);
-
- *vcpu_cpsr(vcpu) = get_except64_pstate(vcpu);
- vcpu_write_spsr(vcpu, cpsr);
+ enter_exception64(vcpu, PSR_MODE_EL1h, except_type_sync);
/*
* Build an unknown exception, depending on the instruction
--
2.26.2
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
^ permalink raw reply related
* [PATCH 22/24] KVM: arm64: Don't use empty structures as CPU reset state
From: Marc Zyngier @ 2020-05-29 16:01 UTC (permalink / raw)
To: Paolo Bonzini
Cc: kvmarm, kvm, Will Deacon, Jiang Yi, Ard Biesheuvel,
linux-arm-kernel
In-Reply-To: <20200529160121.899083-1-maz@kernel.org>
Keeping empty structure as the vcpu state initializer is slightly
wasteful: we only want to set pstate, and zero everything else.
Just do that.
Signed-off-by: Marc Zyngier <maz@kernel.org>
---
arch/arm64/kvm/reset.c | 21 +++++++++------------
1 file changed, 9 insertions(+), 12 deletions(-)
diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
index 658f3a79617b..865c8aa670bc 100644
--- a/arch/arm64/kvm/reset.c
+++ b/arch/arm64/kvm/reset.c
@@ -36,15 +36,11 @@ static u32 kvm_ipa_limit;
/*
* ARMv8 Reset Values
*/
-static const struct kvm_regs default_regs_reset = {
- .regs.pstate = (PSR_MODE_EL1h | PSR_A_BIT | PSR_I_BIT |
- PSR_F_BIT | PSR_D_BIT),
-};
+#define VCPU_RESET_PSTATE_EL1 (PSR_MODE_EL1h | PSR_A_BIT | PSR_I_BIT | \
+ PSR_F_BIT | PSR_D_BIT)
-static const struct kvm_regs default_regs_reset32 = {
- .regs.pstate = (PSR_AA32_MODE_SVC | PSR_AA32_A_BIT |
- PSR_AA32_I_BIT | PSR_AA32_F_BIT),
-};
+#define VCPU_RESET_PSTATE_SVC (PSR_AA32_MODE_SVC | PSR_AA32_A_BIT | \
+ PSR_AA32_I_BIT | PSR_AA32_F_BIT)
static bool cpu_has_32bit_el1(void)
{
@@ -257,9 +253,9 @@ static int kvm_vcpu_enable_ptrauth(struct kvm_vcpu *vcpu)
*/
int kvm_reset_vcpu(struct kvm_vcpu *vcpu)
{
- const struct kvm_regs *cpu_reset;
int ret = -EINVAL;
bool loaded;
+ u32 pstate;
/* Reset PMU outside of the non-preemptible section */
kvm_pmu_vcpu_reset(vcpu);
@@ -290,16 +286,17 @@ int kvm_reset_vcpu(struct kvm_vcpu *vcpu)
if (test_bit(KVM_ARM_VCPU_EL1_32BIT, vcpu->arch.features)) {
if (!cpu_has_32bit_el1())
goto out;
- cpu_reset = &default_regs_reset32;
+ pstate = VCPU_RESET_PSTATE_SVC;
} else {
- cpu_reset = &default_regs_reset;
+ pstate = VCPU_RESET_PSTATE_EL1;
}
break;
}
/* Reset core registers */
- memcpy(vcpu_gp_regs(vcpu), cpu_reset, sizeof(*cpu_reset));
+ memset(vcpu_gp_regs(vcpu), 0, sizeof(*vcpu_gp_regs(vcpu)));
+ vcpu_gp_regs(vcpu)->regs.pstate = pstate;
/* Reset system registers */
kvm_reset_sys_regs(vcpu);
--
2.26.2
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
^ permalink raw reply related
* [PATCH 06/24] KVM: arm64: Simplify __kvm_timer_set_cntvoff implementation
From: Marc Zyngier @ 2020-05-29 16:01 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Mark Rutland, kvmarm, kvm, Will Deacon, Suzuki K Poulose,
Keqian Zhu, Christoffer Dall, Jiang Yi, James Morse, Andrew Scull,
Zenghui Yu, Julien Thierry, David Brazdil, Alexandru Elisei,
Ard Biesheuvel, Fuad Tabba, linux-arm-kernel
In-Reply-To: <20200529160121.899083-1-maz@kernel.org>
Now that this function isn't constrained by the 32bit PCS,
let's simplify it by taking a single 64bit offset instead
of two 32bit parameters.
Signed-off-by: Marc Zyngier <maz@kernel.org>
---
arch/arm64/include/asm/kvm_asm.h | 2 +-
arch/arm64/kvm/arch_timer.c | 12 +-----------
arch/arm64/kvm/hyp/timer-sr.c | 3 +--
3 files changed, 3 insertions(+), 14 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index 7c7eeeaab9fa..59e314f38e43 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -64,7 +64,7 @@ extern void __kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa);
extern void __kvm_tlb_flush_vmid(struct kvm *kvm);
extern void __kvm_tlb_flush_local_vmid(struct kvm_vcpu *vcpu);
-extern void __kvm_timer_set_cntvoff(u32 cntvoff_low, u32 cntvoff_high);
+extern void __kvm_timer_set_cntvoff(u64 cntvoff);
extern int kvm_vcpu_run_vhe(struct kvm_vcpu *vcpu);
diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c
index 93bd59b46848..487eba9f87cd 100644
--- a/arch/arm64/kvm/arch_timer.c
+++ b/arch/arm64/kvm/arch_timer.c
@@ -451,17 +451,7 @@ static void timer_restore_state(struct arch_timer_context *ctx)
static void set_cntvoff(u64 cntvoff)
{
- u32 low = lower_32_bits(cntvoff);
- u32 high = upper_32_bits(cntvoff);
-
- /*
- * Since kvm_call_hyp doesn't fully support the ARM PCS especially on
- * 32-bit systems, but rather passes register by register shifted one
- * place (we put the function address in r0/x0), we cannot simply pass
- * a 64-bit value as an argument, but have to split the value in two
- * 32-bit halves.
- */
- kvm_call_hyp(__kvm_timer_set_cntvoff, low, high);
+ kvm_call_hyp(__kvm_timer_set_cntvoff, cntvoff);
}
static inline void set_timer_irq_phys_active(struct arch_timer_context *ctx, bool active)
diff --git a/arch/arm64/kvm/hyp/timer-sr.c b/arch/arm64/kvm/hyp/timer-sr.c
index ff76e6845fe4..fb5c0be33223 100644
--- a/arch/arm64/kvm/hyp/timer-sr.c
+++ b/arch/arm64/kvm/hyp/timer-sr.c
@@ -10,9 +10,8 @@
#include <asm/kvm_hyp.h>
-void __hyp_text __kvm_timer_set_cntvoff(u32 cntvoff_low, u32 cntvoff_high)
+void __hyp_text __kvm_timer_set_cntvoff(u64 cntvoff)
{
- u64 cntvoff = (u64)cntvoff_high << 32 | cntvoff_low;
write_sysreg(cntvoff, cntvoff_el2);
}
--
2.26.2
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related
* [PATCH] block: improve discard bio alignment in __blkdev_issue_discard()
From: Coly Li @ 2020-05-29 16:03 UTC (permalink / raw)
To: linux-block
Cc: Coly Li, Acshai Manoj, Christoph Hellwig, Enzo Matsumiya,
Hannes Reinecke, Jens Axboe, Ming Lei, Xiao Ni
This patch improves discard bio split for address and size alignment in
__blkdev_issue_discard(). The aligned discard bio may help underlying
device controller to perform better discard and internal garbage
collection, and avoid unnecessary internal fragment.
Current discard bio split algorithm in __blkdev_issue_discard() may have
non-discarded fregment on device even the discard bio LBA and size are
both aligned to device's discard granularity size.
Here is the example steps on how to reproduce the above problem.
- On a VMWare ESXi 6.5 update3 installation, create a 51GB virtual disk
with thin mode and give it to a Linux virtual machine.
- Inside the Linux virtual machine, if the 50GB virtual disk shows up as
/dev/sdb, fill data into the first 50GB by,
# dd if=/dev/zero of=/dev/sdb bs=4096 count=13107200
- Discard the 50GB range from offset 0 on /dev/sdb,
# blkdiscard /dev/sdb -o 0 -l 53687091200
- Observe the underlying mapping status of the device
# sg_get_lba_status /dev/sdb -m 1048 --lba=0
descriptor LBA: 0x0000000000000000 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000000000800 blocks: 16773120 deallocated
descriptor LBA: 0x0000000000fff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000001000000 blocks: 8386560 deallocated
descriptor LBA: 0x00000000017ff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000001800000 blocks: 8386560 deallocated
descriptor LBA: 0x0000000001fff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000002000000 blocks: 8386560 deallocated
descriptor LBA: 0x00000000027ff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000002800000 blocks: 8386560 deallocated
descriptor LBA: 0x0000000002fff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000003000000 blocks: 8386560 deallocated
descriptor LBA: 0x00000000037ff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000003800000 blocks: 8386560 deallocated
descriptor LBA: 0x0000000003fff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000004000000 blocks: 8386560 deallocated
descriptor LBA: 0x00000000047ff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000004800000 blocks: 8386560 deallocated
descriptor LBA: 0x0000000004fff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000005000000 blocks: 8386560 deallocated
descriptor LBA: 0x00000000057ff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000005800000 blocks: 8386560 deallocated
descriptor LBA: 0x0000000005fff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000006000000 blocks: 6291456 deallocated
descriptor LBA: 0x0000000006600000 blocks: 0 deallocated
Although the discard bio starts at LBA 0 and has 50<<30 bytes size which
are perfect aligned to the discard granularity, from the above list
these are many 1MB (2048 sectors) internal fragments exist unexpectedly.
The problem is in __blkdev_issue_discard(), an improper algorithm causes
an improper bio size which is not aligned.
25 int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
26 sector_t nr_sects, gfp_t gfp_mask, int flags,
27 struct bio **biop)
28 {
29 struct request_queue *q = bdev_get_queue(bdev);
[snipped]
56
57 while (nr_sects) {
58 sector_t req_sects = min_t(sector_t, nr_sects,
59 bio_allowed_max_sectors(q));
60
61 WARN_ON_ONCE((req_sects << 9) > UINT_MAX);
62
63 bio = blk_next_bio(bio, 0, gfp_mask);
64 bio->bi_iter.bi_sector = sector;
65 bio_set_dev(bio, bdev);
66 bio_set_op_attrs(bio, op, 0);
67
68 bio->bi_iter.bi_size = req_sects << 9;
69 sector += req_sects;
70 nr_sects -= req_sects;
[snipped]
79 }
80
81 *biop = bio;
82 return 0;
83 }
84 EXPORT_SYMBOL(__blkdev_issue_discard);
At line 58-59, to discard a 50GB range, req_sets is set as return value
of bio_allowed_max_sectors(q), which is 8388607 sectors. In the above
case, the discard granularity is 2048 sectors, although the start LBA
and discard length are aligned to discard granularity, seq_sets never
has chance to be aligned to discard granularity. This is why there are
some still-mapped 2048 sectors segment in every 4 or 8 GB range.
Because queue's max_discard_sectors is aligned to discard granularity,
if req_sects at line 58 is set to a value closest to UINT_MAX and
aligned to q->limits.max_discard_sectors, then all consequent split bios
inside device driver are (almostly) aligned to discard_granularity of
the device queue.
This patch introduces bio_aligned_discard_max_sectors() to return the
closet to UINT_MAX and aligned to q->limits.discard_granularity value,
and replace bio_allowed_max_sectors() with this new inline routine to
decide the split bio length.
But we still need to handle the situation when discard start LBA is not
aligned to q->limits.discard_granularity, otherwise even the length is
aligned, current code may still leave 2048 segment around every 4BG
range. Thereforeto calculate req_sects, firstly the start LBA of discard
request command is checked, if it is not aligned to discard granularity,
the first split location should make sure following bio has bi_sector
aligned to discard granularity. Then there won't be still-mapped segment
in the middle of the discard range.
The above is how this patch improves discard bio alignment in
__blkdev_issue_discard(). Now with this patch, after discard with same
command line mentiond previously, sg_get_lba_status returns,
descriptor LBA: 0x0000000000000000 blocks: 106954752 deallocated
descriptor LBA: 0x0000000006600000 blocks: 0 deallocated
We an see there is no 2048 sectors segment anymore, everything is clean.
Reported-by: Acshai Manoj <acshai.manoj@microfocus.com>
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Enzo Matsumiya <ematsumiya@suse.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Xiao Ni <xni@redhat.com>
---
block/blk-lib.c | 25 +++++++++++++++++++++++--
block/blk.h | 13 +++++++++++++
2 files changed, 36 insertions(+), 2 deletions(-)
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 5f2c429d4378..0149c8e8a39e 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -55,8 +55,29 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
return -EINVAL;
while (nr_sects) {
- sector_t req_sects = min_t(sector_t, nr_sects,
- bio_allowed_max_sectors(q));
+ sector_t granularity_aligned_lba;
+ sector_t req_sects;
+
+ granularity_aligned_lba =
+ round_up(sector, q->limits.discard_granularity);
+
+ /*
+ * Check whether the discard bio starts at a discard_granularity
+ * aligned LBA,
+ * - If no: set (granularity_aligned_lba - sector) to bi_size of
+ * the first split bio, then the second bio will start at a
+ * discard_granularity aligned LBA.
+ * - If yes: use bio_aligned_discard_max_sectors() as the max
+ * possible bi_size of th first split bio. Then when this bio
+ * is split in device drive, the split ones are always easier
+ * to be aligned to max_discard_sectors of the device's queue.
+ */
+ if (granularity_aligned_lba == sector)
+ req_sects = min_t(sector_t, nr_sects,
+ bio_aligned_discard_max_sectors(q));
+ else
+ req_sects = min_t(sector_t, nr_sects,
+ granularity_aligned_lba - sector);
WARN_ON_ONCE((req_sects << 9) > UINT_MAX);
diff --git a/block/blk.h b/block/blk.h
index 0a94ec68af32..18453eaddc7e 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -292,6 +292,19 @@ static inline unsigned int bio_allowed_max_sectors(struct request_queue *q)
return round_down(UINT_MAX, queue_logical_block_size(q)) >> 9;
}
+/*
+ * The max bio size which is aligned to q->limits.max_discard_sectors. This
+ * is a hint to split large discard bio in generic block layer, then if device
+ * driver needs to split the discard bio into smaller ones, their bi_size can
+ * be very probably and easily ligned to max_discard_sectors of the device's
+ * queue.
+ */
+static inline unsigned int bio_aligned_discard_max_sectors(
+ struct request_queue *q)
+{
+ return round_down(UINT_MAX, (q->limits.max_discard_sectors << 9)) >> 9;
+}
+
/*
* Internal io_context interface
*/
--
2.25.0
^ permalink raw reply related
* [PATCH 12/24] KVM: arm64: Unify handling THP backed host memory
From: Marc Zyngier @ 2020-05-29 16:01 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Alexandru Elisei, Andrew Scull, Ard Biesheuvel, Christoffer Dall,
David Brazdil, Fuad Tabba, James Morse, Jiang Yi, Keqian Zhu,
Mark Rutland, Suzuki K Poulose, Will Deacon, Zenghui Yu,
Julien Thierry, linux-arm-kernel, kvm, kvmarm
In-Reply-To: <20200529160121.899083-1-maz@kernel.org>
From: Suzuki K Poulose <suzuki.poulose@arm.com>
We support mapping host memory backed by PMD transparent hugepages
at stage2 as huge pages. However the checks are now spread across
two different places. Let us unify the handling of the THPs to
keep the code cleaner (and future proof for PUD THP support).
This patch moves transparent_hugepage_adjust() closer to the caller
to avoid a forward declaration for fault_supports_stage2_huge_mappings().
Also, since we already handle the case where the host VA and the guest
PA may not be aligned, the explicit VM_BUG_ON() is not required.
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20200507123546.1875-3-yuzenghui@huawei.com
---
arch/arm64/kvm/mmu.c | 115 ++++++++++++++++++++++---------------------
1 file changed, 60 insertions(+), 55 deletions(-)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index ccb44e7d30d9..66eb8e3f6e8c 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1375,47 +1375,6 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
return ret;
}
-static bool transparent_hugepage_adjust(kvm_pfn_t *pfnp, phys_addr_t *ipap)
-{
- kvm_pfn_t pfn = *pfnp;
- gfn_t gfn = *ipap >> PAGE_SHIFT;
-
- if (kvm_is_transparent_hugepage(pfn)) {
- unsigned long mask;
- /*
- * The address we faulted on is backed by a transparent huge
- * page. However, because we map the compound huge page and
- * not the individual tail page, we need to transfer the
- * refcount to the head page. We have to be careful that the
- * THP doesn't start to split while we are adjusting the
- * refcounts.
- *
- * We are sure this doesn't happen, because mmu_notifier_retry
- * was successful and we are holding the mmu_lock, so if this
- * THP is trying to split, it will be blocked in the mmu
- * notifier before touching any of the pages, specifically
- * before being able to call __split_huge_page_refcount().
- *
- * We can therefore safely transfer the refcount from PG_tail
- * to PG_head and switch the pfn from a tail page to the head
- * page accordingly.
- */
- mask = PTRS_PER_PMD - 1;
- VM_BUG_ON((gfn & mask) != (pfn & mask));
- if (pfn & mask) {
- *ipap &= PMD_MASK;
- kvm_release_pfn_clean(pfn);
- pfn &= ~mask;
- kvm_get_pfn(pfn);
- *pfnp = pfn;
- }
-
- return true;
- }
-
- return false;
-}
-
/**
* stage2_wp_ptes - write protect PMD range
* @pmd: pointer to pmd entry
@@ -1663,6 +1622,59 @@ static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
(hva & ~(map_size - 1)) + map_size <= uaddr_end;
}
+/*
+ * Check if the given hva is backed by a transparent huge page (THP) and
+ * whether it can be mapped using block mapping in stage2. If so, adjust
+ * the stage2 PFN and IPA accordingly. Only PMD_SIZE THPs are currently
+ * supported. This will need to be updated to support other THP sizes.
+ *
+ * Returns the size of the mapping.
+ */
+static unsigned long
+transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
+ unsigned long hva, kvm_pfn_t *pfnp,
+ phys_addr_t *ipap)
+{
+ kvm_pfn_t pfn = *pfnp;
+
+ /*
+ * Make sure the adjustment is done only for THP pages. Also make
+ * sure that the HVA and IPA are sufficiently aligned and that the
+ * block map is contained within the memslot.
+ */
+ if (kvm_is_transparent_hugepage(pfn) &&
+ fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) {
+ /*
+ * The address we faulted on is backed by a transparent huge
+ * page. However, because we map the compound huge page and
+ * not the individual tail page, we need to transfer the
+ * refcount to the head page. We have to be careful that the
+ * THP doesn't start to split while we are adjusting the
+ * refcounts.
+ *
+ * We are sure this doesn't happen, because mmu_notifier_retry
+ * was successful and we are holding the mmu_lock, so if this
+ * THP is trying to split, it will be blocked in the mmu
+ * notifier before touching any of the pages, specifically
+ * before being able to call __split_huge_page_refcount().
+ *
+ * We can therefore safely transfer the refcount from PG_tail
+ * to PG_head and switch the pfn from a tail page to the head
+ * page accordingly.
+ */
+ *ipap &= PMD_MASK;
+ kvm_release_pfn_clean(pfn);
+ pfn &= ~(PTRS_PER_PMD - 1);
+ kvm_get_pfn(pfn);
+ *pfnp = pfn;
+
+ return PMD_SIZE;
+ }
+
+ /* Use page mapping if we cannot use block mapping. */
+ return PAGE_SIZE;
+}
+
static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
struct kvm_memory_slot *memslot, unsigned long hva,
unsigned long fault_status)
@@ -1776,20 +1788,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
if (mmu_notifier_retry(kvm, mmu_seq))
goto out_unlock;
- if (vma_pagesize == PAGE_SIZE && !force_pte) {
- /*
- * Only PMD_SIZE transparent hugepages(THP) are
- * currently supported. This code will need to be
- * updated to support other THP sizes.
- *
- * Make sure the host VA and the guest IPA are sufficiently
- * aligned and that the block is contained within the memslot.
- */
- if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE) &&
- transparent_hugepage_adjust(&pfn, &fault_ipa))
- vma_pagesize = PMD_SIZE;
- }
-
+ /*
+ * If we are not forced to use page mapping, check if we are
+ * backed by a THP and thus use block mapping if possible.
+ */
+ if (vma_pagesize == PAGE_SIZE && !force_pte)
+ vma_pagesize = transparent_hugepage_adjust(memslot, hva,
+ &pfn, &fault_ipa);
if (writable)
kvm_set_pfn_dirty(pfn);
--
2.26.2
^ permalink raw reply related
* [PATCH 12/24] KVM: arm64: Unify handling THP backed host memory
From: Marc Zyngier @ 2020-05-29 16:01 UTC (permalink / raw)
To: Paolo Bonzini
Cc: kvmarm, kvm, Will Deacon, Jiang Yi, Ard Biesheuvel,
linux-arm-kernel
In-Reply-To: <20200529160121.899083-1-maz@kernel.org>
From: Suzuki K Poulose <suzuki.poulose@arm.com>
We support mapping host memory backed by PMD transparent hugepages
at stage2 as huge pages. However the checks are now spread across
two different places. Let us unify the handling of the THPs to
keep the code cleaner (and future proof for PUD THP support).
This patch moves transparent_hugepage_adjust() closer to the caller
to avoid a forward declaration for fault_supports_stage2_huge_mappings().
Also, since we already handle the case where the host VA and the guest
PA may not be aligned, the explicit VM_BUG_ON() is not required.
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20200507123546.1875-3-yuzenghui@huawei.com
---
arch/arm64/kvm/mmu.c | 115 ++++++++++++++++++++++---------------------
1 file changed, 60 insertions(+), 55 deletions(-)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index ccb44e7d30d9..66eb8e3f6e8c 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1375,47 +1375,6 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
return ret;
}
-static bool transparent_hugepage_adjust(kvm_pfn_t *pfnp, phys_addr_t *ipap)
-{
- kvm_pfn_t pfn = *pfnp;
- gfn_t gfn = *ipap >> PAGE_SHIFT;
-
- if (kvm_is_transparent_hugepage(pfn)) {
- unsigned long mask;
- /*
- * The address we faulted on is backed by a transparent huge
- * page. However, because we map the compound huge page and
- * not the individual tail page, we need to transfer the
- * refcount to the head page. We have to be careful that the
- * THP doesn't start to split while we are adjusting the
- * refcounts.
- *
- * We are sure this doesn't happen, because mmu_notifier_retry
- * was successful and we are holding the mmu_lock, so if this
- * THP is trying to split, it will be blocked in the mmu
- * notifier before touching any of the pages, specifically
- * before being able to call __split_huge_page_refcount().
- *
- * We can therefore safely transfer the refcount from PG_tail
- * to PG_head and switch the pfn from a tail page to the head
- * page accordingly.
- */
- mask = PTRS_PER_PMD - 1;
- VM_BUG_ON((gfn & mask) != (pfn & mask));
- if (pfn & mask) {
- *ipap &= PMD_MASK;
- kvm_release_pfn_clean(pfn);
- pfn &= ~mask;
- kvm_get_pfn(pfn);
- *pfnp = pfn;
- }
-
- return true;
- }
-
- return false;
-}
-
/**
* stage2_wp_ptes - write protect PMD range
* @pmd: pointer to pmd entry
@@ -1663,6 +1622,59 @@ static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
(hva & ~(map_size - 1)) + map_size <= uaddr_end;
}
+/*
+ * Check if the given hva is backed by a transparent huge page (THP) and
+ * whether it can be mapped using block mapping in stage2. If so, adjust
+ * the stage2 PFN and IPA accordingly. Only PMD_SIZE THPs are currently
+ * supported. This will need to be updated to support other THP sizes.
+ *
+ * Returns the size of the mapping.
+ */
+static unsigned long
+transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
+ unsigned long hva, kvm_pfn_t *pfnp,
+ phys_addr_t *ipap)
+{
+ kvm_pfn_t pfn = *pfnp;
+
+ /*
+ * Make sure the adjustment is done only for THP pages. Also make
+ * sure that the HVA and IPA are sufficiently aligned and that the
+ * block map is contained within the memslot.
+ */
+ if (kvm_is_transparent_hugepage(pfn) &&
+ fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) {
+ /*
+ * The address we faulted on is backed by a transparent huge
+ * page. However, because we map the compound huge page and
+ * not the individual tail page, we need to transfer the
+ * refcount to the head page. We have to be careful that the
+ * THP doesn't start to split while we are adjusting the
+ * refcounts.
+ *
+ * We are sure this doesn't happen, because mmu_notifier_retry
+ * was successful and we are holding the mmu_lock, so if this
+ * THP is trying to split, it will be blocked in the mmu
+ * notifier before touching any of the pages, specifically
+ * before being able to call __split_huge_page_refcount().
+ *
+ * We can therefore safely transfer the refcount from PG_tail
+ * to PG_head and switch the pfn from a tail page to the head
+ * page accordingly.
+ */
+ *ipap &= PMD_MASK;
+ kvm_release_pfn_clean(pfn);
+ pfn &= ~(PTRS_PER_PMD - 1);
+ kvm_get_pfn(pfn);
+ *pfnp = pfn;
+
+ return PMD_SIZE;
+ }
+
+ /* Use page mapping if we cannot use block mapping. */
+ return PAGE_SIZE;
+}
+
static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
struct kvm_memory_slot *memslot, unsigned long hva,
unsigned long fault_status)
@@ -1776,20 +1788,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
if (mmu_notifier_retry(kvm, mmu_seq))
goto out_unlock;
- if (vma_pagesize == PAGE_SIZE && !force_pte) {
- /*
- * Only PMD_SIZE transparent hugepages(THP) are
- * currently supported. This code will need to be
- * updated to support other THP sizes.
- *
- * Make sure the host VA and the guest IPA are sufficiently
- * aligned and that the block is contained within the memslot.
- */
- if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE) &&
- transparent_hugepage_adjust(&pfn, &fault_ipa))
- vma_pagesize = PMD_SIZE;
- }
-
+ /*
+ * If we are not forced to use page mapping, check if we are
+ * backed by a THP and thus use block mapping if possible.
+ */
+ if (vma_pagesize == PAGE_SIZE && !force_pte)
+ vma_pagesize = transparent_hugepage_adjust(memslot, hva,
+ &pfn, &fault_ipa);
if (writable)
kvm_set_pfn_dirty(pfn);
--
2.26.2
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
^ permalink raw reply related
* [PATCH 10/24] KVM: arm/arm64: Release kvm->mmu_lock in loop to prevent starvation
From: Marc Zyngier @ 2020-05-29 16:01 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Alexandru Elisei, Andrew Scull, Ard Biesheuvel, Christoffer Dall,
David Brazdil, Fuad Tabba, James Morse, Jiang Yi, Keqian Zhu,
Mark Rutland, Suzuki K Poulose, Will Deacon, Zenghui Yu,
Julien Thierry, linux-arm-kernel, kvm, kvmarm
In-Reply-To: <20200529160121.899083-1-maz@kernel.org>
From: Jiang Yi <giangyi@amazon.com>
Do cond_resched_lock() in stage2_flush_memslot() like what is done in
unmap_stage2_range() and other places holding mmu_lock while processing
a possibly large range of memory.
Signed-off-by: Jiang Yi <giangyi@amazon.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Link: https://lore.kernel.org/r/20200415084229.29992-1-giangyi@amazon.com
---
arch/arm64/kvm/mmu.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 29d8f24df944..917363375e8a 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -422,6 +422,9 @@ static void stage2_flush_memslot(struct kvm *kvm,
next = stage2_pgd_addr_end(kvm, addr, end);
if (!stage2_pgd_none(kvm, *pgd))
stage2_flush_puds(kvm, pgd, addr, next);
+
+ if (next != end)
+ cond_resched_lock(&kvm->mmu_lock);
} while (pgd++, addr = next, addr != end);
}
--
2.26.2
^ permalink raw reply related
* Re: [PATCH 0/3] make vm_committed_as_batch aware of vm overcommit policy
From: Feng Tang @ 2020-05-29 16:04 UTC (permalink / raw)
To: Andi Kleen
Cc: Qian Cai, Michal Hocko, Andrew Morton, Johannes Weiner,
Stephen Rothwell, Matthew Wilcox, Mel Gorman, Kees Cook,
Luis Chamberlain, Iurii Zaikin, Chen, Tim C, Hansen, Dave,
Huang, Ying, linux-mm@kvack.org, linux-kernel@vger.kernel.org
In-Reply-To: <20200529155025.GC621576@tassilo.jf.intel.com>
On Fri, May 29, 2020 at 08:50:25AM -0700, Andi Kleen wrote:
> >
> > ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
> > - if (ret == 0 && write)
> > + if (ret == 0 && write) {
> > + if (sysctl_overcommit_memory == OVERCOMMIT_NEVER)
> > + schedule_on_each_cpu(sync_overcommit_as);
>
> The schedule is not atomic.
> There's still a race window here over all the CPUs where the WARN_ON could
> happen because you change the global first.
The re-computing of batch number comes after this sync, so at this
point the batch is still the bigger one, and won't trigger the warning.
> Probably you would need another global that says "i'm currently changing
> the mode" and then skip the WARN_ON in that window. Maybe a sequence lock.
>
> Seems all overkill to me. Better to kill the warning.
Yes, the cost is high, schedule_on_each_cpu is labeled as "very slow"
in the code comments.
Thanks,
Feng
>
> -Andi
^ permalink raw reply
* [PATCH 10/24] KVM: arm/arm64: Release kvm->mmu_lock in loop to prevent starvation
From: Marc Zyngier @ 2020-05-29 16:01 UTC (permalink / raw)
To: Paolo Bonzini
Cc: kvmarm, kvm, Will Deacon, Jiang Yi, Ard Biesheuvel,
linux-arm-kernel
In-Reply-To: <20200529160121.899083-1-maz@kernel.org>
From: Jiang Yi <giangyi@amazon.com>
Do cond_resched_lock() in stage2_flush_memslot() like what is done in
unmap_stage2_range() and other places holding mmu_lock while processing
a possibly large range of memory.
Signed-off-by: Jiang Yi <giangyi@amazon.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Link: https://lore.kernel.org/r/20200415084229.29992-1-giangyi@amazon.com
---
arch/arm64/kvm/mmu.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 29d8f24df944..917363375e8a 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -422,6 +422,9 @@ static void stage2_flush_memslot(struct kvm *kvm,
next = stage2_pgd_addr_end(kvm, addr, end);
if (!stage2_pgd_none(kvm, *pgd))
stage2_flush_puds(kvm, pgd, addr, next);
+
+ if (next != end)
+ cond_resched_lock(&kvm->mmu_lock);
} while (pgd++, addr = next, addr != end);
}
--
2.26.2
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
^ permalink raw reply related
* [PATCH 13/24] KVM: arm64: Support enabling dirty log gradually in small chunks
From: Marc Zyngier @ 2020-05-29 16:01 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Alexandru Elisei, Andrew Scull, Ard Biesheuvel, Christoffer Dall,
David Brazdil, Fuad Tabba, James Morse, Jiang Yi, Keqian Zhu,
Mark Rutland, Suzuki K Poulose, Will Deacon, Zenghui Yu,
Julien Thierry, linux-arm-kernel, kvm, kvmarm
In-Reply-To: <20200529160121.899083-1-maz@kernel.org>
From: Keqian Zhu <zhukeqian1@huawei.com>
There is already support of enabling dirty log gradually in small chunks
for x86 in commit 3c9bd4006bfc ("KVM: x86: enable dirty log gradually in
small chunks"). This adds support for arm64.
x86 still writes protect all huge pages when DIRTY_LOG_INITIALLY_ALL_SET
is enabled. However, for arm64, both huge pages and normal pages can be
write protected gradually by userspace.
Under the Huawei Kunpeng 920 2.6GHz platform, I did some tests on 128G
Linux VMs with different page size. The memory pressure is 127G in each
case. The time taken of memory_global_dirty_log_start in QEMU is listed
below:
Page Size Before After Optimization
4K 650ms 1.8ms
2M 4ms 1.8ms
1G 2ms 1.8ms
Besides the time reduction, the biggest improvement is that we will minimize
the performance side effect (because of dissolving huge pages and marking
memslots dirty) on guest after enabling dirty log.
Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20200413122023.52583-1-zhukeqian1@huawei.com
---
Documentation/virt/kvm/api.rst | 2 +-
arch/arm64/include/asm/kvm_host.h | 3 +++
arch/arm64/kvm/mmu.c | 12 ++++++++++--
3 files changed, 14 insertions(+), 3 deletions(-)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index efbbe570aa9b..0017f63fa44f 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -5777,7 +5777,7 @@ will be initialized to 1 when created. This also improves performance because
dirty logging can be enabled gradually in small chunks on the first call
to KVM_CLEAR_DIRTY_LOG. KVM_DIRTY_LOG_INITIALLY_SET depends on
KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE (it is also only available on
-x86 for now).
+x86 and arm64 for now).
KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 was previously available under the name
KVM_CAP_MANUAL_DIRTY_LOG_PROTECT, but the implementation had bugs that make
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 32c8a675e5a4..a723f84fab83 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -46,6 +46,9 @@
#define KVM_REQ_RECORD_STEAL KVM_ARCH_REQ(3)
#define KVM_REQ_RELOAD_GICv4 KVM_ARCH_REQ(4)
+#define KVM_DIRTY_LOG_MANUAL_CAPS (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
+ KVM_DIRTY_LOG_INITIALLY_SET)
+
DECLARE_STATIC_KEY_FALSE(userspace_irqchip_in_use);
extern unsigned int kvm_sve_max_vl;
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 66eb8e3f6e8c..ddf85bf21897 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -2277,8 +2277,16 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
* allocated dirty_bitmap[], dirty pages will be tracked while the
* memory slot is write protected.
*/
- if (change != KVM_MR_DELETE && mem->flags & KVM_MEM_LOG_DIRTY_PAGES)
- kvm_mmu_wp_memory_region(kvm, mem->slot);
+ if (change != KVM_MR_DELETE && mem->flags & KVM_MEM_LOG_DIRTY_PAGES) {
+ /*
+ * If we're with initial-all-set, we don't need to write
+ * protect any pages because they're all reported as dirty.
+ * Huge pages and normal pages will be write protect gradually.
+ */
+ if (!kvm_dirty_log_manual_protect_and_init_set(kvm)) {
+ kvm_mmu_wp_memory_region(kvm, mem->slot);
+ }
+ }
}
int kvm_arch_prepare_memory_region(struct kvm *kvm,
--
2.26.2
^ permalink raw reply related
* [PATCH 13/24] KVM: arm64: Support enabling dirty log gradually in small chunks
From: Marc Zyngier @ 2020-05-29 16:01 UTC (permalink / raw)
To: Paolo Bonzini
Cc: kvmarm, kvm, Will Deacon, Jiang Yi, Ard Biesheuvel,
linux-arm-kernel
In-Reply-To: <20200529160121.899083-1-maz@kernel.org>
From: Keqian Zhu <zhukeqian1@huawei.com>
There is already support of enabling dirty log gradually in small chunks
for x86 in commit 3c9bd4006bfc ("KVM: x86: enable dirty log gradually in
small chunks"). This adds support for arm64.
x86 still writes protect all huge pages when DIRTY_LOG_INITIALLY_ALL_SET
is enabled. However, for arm64, both huge pages and normal pages can be
write protected gradually by userspace.
Under the Huawei Kunpeng 920 2.6GHz platform, I did some tests on 128G
Linux VMs with different page size. The memory pressure is 127G in each
case. The time taken of memory_global_dirty_log_start in QEMU is listed
below:
Page Size Before After Optimization
4K 650ms 1.8ms
2M 4ms 1.8ms
1G 2ms 1.8ms
Besides the time reduction, the biggest improvement is that we will minimize
the performance side effect (because of dissolving huge pages and marking
memslots dirty) on guest after enabling dirty log.
Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20200413122023.52583-1-zhukeqian1@huawei.com
---
Documentation/virt/kvm/api.rst | 2 +-
arch/arm64/include/asm/kvm_host.h | 3 +++
arch/arm64/kvm/mmu.c | 12 ++++++++++--
3 files changed, 14 insertions(+), 3 deletions(-)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index efbbe570aa9b..0017f63fa44f 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -5777,7 +5777,7 @@ will be initialized to 1 when created. This also improves performance because
dirty logging can be enabled gradually in small chunks on the first call
to KVM_CLEAR_DIRTY_LOG. KVM_DIRTY_LOG_INITIALLY_SET depends on
KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE (it is also only available on
-x86 for now).
+x86 and arm64 for now).
KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 was previously available under the name
KVM_CAP_MANUAL_DIRTY_LOG_PROTECT, but the implementation had bugs that make
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 32c8a675e5a4..a723f84fab83 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -46,6 +46,9 @@
#define KVM_REQ_RECORD_STEAL KVM_ARCH_REQ(3)
#define KVM_REQ_RELOAD_GICv4 KVM_ARCH_REQ(4)
+#define KVM_DIRTY_LOG_MANUAL_CAPS (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
+ KVM_DIRTY_LOG_INITIALLY_SET)
+
DECLARE_STATIC_KEY_FALSE(userspace_irqchip_in_use);
extern unsigned int kvm_sve_max_vl;
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 66eb8e3f6e8c..ddf85bf21897 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -2277,8 +2277,16 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
* allocated dirty_bitmap[], dirty pages will be tracked while the
* memory slot is write protected.
*/
- if (change != KVM_MR_DELETE && mem->flags & KVM_MEM_LOG_DIRTY_PAGES)
- kvm_mmu_wp_memory_region(kvm, mem->slot);
+ if (change != KVM_MR_DELETE && mem->flags & KVM_MEM_LOG_DIRTY_PAGES) {
+ /*
+ * If we're with initial-all-set, we don't need to write
+ * protect any pages because they're all reported as dirty.
+ * Huge pages and normal pages will be write protect gradually.
+ */
+ if (!kvm_dirty_log_manual_protect_and_init_set(kvm)) {
+ kvm_mmu_wp_memory_region(kvm, mem->slot);
+ }
+ }
}
int kvm_arch_prepare_memory_region(struct kvm *kvm,
--
2.26.2
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
^ permalink raw reply related
* [PATCH 07/24] KVM: arm64: Use cpus_have_final_cap for has_vhe()
From: Marc Zyngier @ 2020-05-29 16:01 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Mark Rutland, kvmarm, kvm, Will Deacon, Suzuki K Poulose,
Keqian Zhu, Christoffer Dall, Jiang Yi, James Morse, Andrew Scull,
Zenghui Yu, Julien Thierry, David Brazdil, Alexandru Elisei,
Ard Biesheuvel, Fuad Tabba, linux-arm-kernel
In-Reply-To: <20200529160121.899083-1-maz@kernel.org>
By the time we start using the has_vhe() helper, we have long
discovered whether we are running VHE or not. It thus makes
sense to use cpus_have_final_cap() instead of cpus_have_const_cap(),
which leads to a small text size reduction.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Acked-by: David Brazdil <dbrazdil@google.com>
Link: https://lore.kernel.org/r/20200513103828.74580-1-maz@kernel.org
---
arch/arm64/include/asm/virt.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm64/include/asm/virt.h b/arch/arm64/include/asm/virt.h
index 61fd26752adc..5051b388c654 100644
--- a/arch/arm64/include/asm/virt.h
+++ b/arch/arm64/include/asm/virt.h
@@ -85,7 +85,7 @@ static inline bool is_kernel_in_hyp_mode(void)
static __always_inline bool has_vhe(void)
{
- if (cpus_have_const_cap(ARM64_HAS_VIRT_HOST_EXTN))
+ if (cpus_have_final_cap(ARM64_HAS_VIRT_HOST_EXTN))
return true;
return false;
--
2.26.2
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related
* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
From: Qais Yousef @ 2020-05-29 16:04 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, Ingo Molnar, Randy Dunlap, Jonathan Corbet,
Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Luis Chamberlain, Kees Cook, Iurii Zaikin,
Quentin Perret, Valentin Schneider, Patrick Bellasi,
Pavan Kondeti, linux-doc, linux-kernel, linux-fsdevel
In-Reply-To: <20200529100806.GA3070@suse.de>
On 05/29/20 11:08, Mel Gorman wrote:
> On Thu, May 28, 2020 at 06:11:12PM +0200, Peter Zijlstra wrote:
> > > FWIW, I think you're referring to Mel's notice in OSPM regarding the overhead.
> > > Trying to see what goes on in there.
> >
> > Indeed, that one. The fact that regular distros cannot enable this
> > feature due to performance overhead is unfortunate. It means there is a
> > lot less potential for this stuff.
>
> During that talk, I was a vague about the cost, admitted I had not looked
> too closely at mainline performance and had since deleted the data given
> that the problem was first spotted in early April. If I heard someone
> else making statements like I did at the talk, I would consider it a bit
> vague, potentially FUD, possibly wrong and worth rechecking myself. In
> terms of distributions "cannot enable this", we could but I was unwilling
> to pay the cost for a feature no one has asked for yet. If they had, I
> would endevour to put it behind static branches and disable it by default
> (like what happened for PSI). I was contacted offlist about my comments
> at OSPM and gathered new data to respond properly. For the record, here
> is an editted version of my response;
I had this impression too that's why I had a rather humble attempt.
[...]
> # Event 'cycles:ppp'
> #
> # Baseline Delta Abs Shared Object Symbol
> # ........ ......... ........................ ..............................................
> #
> 9.59% -2.87% [kernel.vmlinux] [k] poll_idle
> 0.19% +1.85% [kernel.vmlinux] [k] activate_task
> +1.17% [kernel.vmlinux] [k] dequeue_task
> +0.89% [kernel.vmlinux] [k] update_rq_clock.part.73
> 3.88% +0.73% [kernel.vmlinux] [k] try_to_wake_up
> 3.17% +0.68% [kernel.vmlinux] [k] __schedule
> 1.16% -0.60% [kernel.vmlinux] [k] __update_load_avg_cfs_rq
> 2.20% -0.54% [kernel.vmlinux] [k] resched_curr
> 2.08% -0.29% [kernel.vmlinux] [k] _raw_spin_lock_irqsave
> 0.44% -0.29% [kernel.vmlinux] [k] cpus_share_cache
> 1.13% +0.23% [kernel.vmlinux] [k] _raw_spin_lock_bh
>
> A lot of the uclamp functions appear to be inlined so it is not be
> particularly obvious from a raw profile but it shows up in the annotated
> profile in activate_task and dequeue_task for example. In the case of
> dequeue_task, uclamp_rq_dec_id() is extremely expensive according to the
> annotated profile.
>
> I'm afraid I did not dig into this deeply once I knew I could just disable
> it even within the distribution.
Could by any chance the vmlinux (with debug symbols hopefully) and perf.dat are
still lying around to share?
I could send you a link to drop them somewhere.
Thanks
--
Qais Yousef
^ permalink raw reply
* [PATCH 09/24] KVM: arm64: Sidestep stage2_unmap_vm() on vcpu reset when S2FWB is supported
From: Marc Zyngier @ 2020-05-29 16:01 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Mark Rutland, kvmarm, kvm, Will Deacon, Suzuki K Poulose,
Keqian Zhu, Christoffer Dall, Jiang Yi, James Morse, Andrew Scull,
Zenghui Yu, Julien Thierry, David Brazdil, Alexandru Elisei,
Ard Biesheuvel, Fuad Tabba, linux-arm-kernel
In-Reply-To: <20200529160121.899083-1-maz@kernel.org>
From: Zenghui Yu <yuzenghui@huawei.com>
stage2_unmap_vm() was introduced to unmap user RAM region in the stage2
page table to make the caches coherent. E.g., a guest reboot with stage1
MMU disabled will access memory using non-cacheable attributes. If the
RAM and caches are not coherent at this stage, some evicted dirty cache
line may go and corrupt guest data in RAM.
Since ARMv8.4, S2FWB feature is mandatory and KVM will take advantage
of it to configure the stage2 page table and the attributes of memory
access. So we ensure that guests always access memory using cacheable
attributes and thus, the caches always be coherent.
So on CPUs that support S2FWB, we can safely reset the vcpu without a
heavy stage2 unmapping.
Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20200415072835.1164-1-yuzenghui@huawei.com
---
arch/arm64/kvm/arm.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index ee1b5bba1d08..0ea9a0266d9a 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -983,8 +983,11 @@ static int kvm_arch_vcpu_ioctl_vcpu_init(struct kvm_vcpu *vcpu,
/*
* Ensure a rebooted VM will fault in RAM pages and detect if the
* guest MMU is turned off and flush the caches as needed.
+ *
+ * S2FWB enforces all memory accesses to RAM being cacheable, we
+ * ensure that the cache is always coherent.
*/
- if (vcpu->arch.has_run_once)
+ if (vcpu->arch.has_run_once && !cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
stage2_unmap_vm(vcpu->kvm);
vcpu_reset_hcr(vcpu);
--
2.26.2
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related
* [PATCH 08/24] KVM: Fix spelling in code comments
From: Marc Zyngier @ 2020-05-29 16:01 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Mark Rutland, kvmarm, kvm, Will Deacon, Suzuki K Poulose,
Keqian Zhu, Christoffer Dall, Jiang Yi, James Morse, Andrew Scull,
Zenghui Yu, Julien Thierry, David Brazdil, Alexandru Elisei,
Ard Biesheuvel, Fuad Tabba, linux-arm-kernel
In-Reply-To: <20200529160121.899083-1-maz@kernel.org>
From: Fuad Tabba <tabba@google.com>
Fix spelling and typos (e.g., repeated words) in comments.
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20200401140310.29701-1-tabba@google.com
---
arch/arm64/kvm/arm.c | 6 +++---
arch/arm64/kvm/guest.c | 4 ++--
arch/arm64/kvm/hyp/vgic-v3-sr.c | 2 +-
arch/arm64/kvm/mmio.c | 2 +-
arch/arm64/kvm/mmu.c | 6 +++---
arch/arm64/kvm/psci.c | 6 +++---
arch/arm64/kvm/reset.c | 6 +++---
arch/arm64/kvm/sys_regs.c | 6 +++---
arch/arm64/kvm/vgic/vgic-v3.c | 2 +-
virt/kvm/coalesced_mmio.c | 2 +-
virt/kvm/eventfd.c | 2 +-
virt/kvm/kvm_main.c | 2 +-
12 files changed, 23 insertions(+), 23 deletions(-)
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index c958bb37b769..ee1b5bba1d08 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -455,9 +455,9 @@ void force_vm_exit(const cpumask_t *mask)
*
* The hardware supports a limited set of values with the value zero reserved
* for the host, so we check if an assigned value belongs to a previous
- * generation, which which requires us to assign a new value. If we're the
- * first to use a VMID for the new generation, we must flush necessary caches
- * and TLBs on all CPUs.
+ * generation, which requires us to assign a new value. If we're the first to
+ * use a VMID for the new generation, we must flush necessary caches and TLBs
+ * on all CPUs.
*/
static bool need_new_vmid_gen(struct kvm_vmid *vmid)
{
diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
index 50a279d3ddd7..871d51729b63 100644
--- a/arch/arm64/kvm/guest.c
+++ b/arch/arm64/kvm/guest.c
@@ -267,7 +267,7 @@ static int set_sve_vls(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg)
/*
* Vector lengths supported by the host can't currently be
* hidden from the guest individually: instead we can only set a
- * maxmium via ZCR_EL2.LEN. So, make sure the available vector
+ * maximum via ZCR_EL2.LEN. So, make sure the available vector
* lengths match the set requested exactly up to the requested
* maximum:
*/
@@ -337,7 +337,7 @@ static int sve_reg_to_region(struct sve_state_reg_region *region,
unsigned int reg_num;
unsigned int reqoffset, reqlen; /* User-requested offset and length */
- unsigned int maxlen; /* Maxmimum permitted length */
+ unsigned int maxlen; /* Maximum permitted length */
size_t sve_state_size;
diff --git a/arch/arm64/kvm/hyp/vgic-v3-sr.c b/arch/arm64/kvm/hyp/vgic-v3-sr.c
index 49fedf6710f9..6b85773e15c4 100644
--- a/arch/arm64/kvm/hyp/vgic-v3-sr.c
+++ b/arch/arm64/kvm/hyp/vgic-v3-sr.c
@@ -577,7 +577,7 @@ static u8 __hyp_text __vgic_v3_pri_to_pre(u8 pri, u32 vmcr, int grp)
/*
* The priority value is independent of any of the BPR values, so we
- * normalize it using the minumal BPR value. This guarantees that no
+ * normalize it using the minimal BPR value. This guarantees that no
* matter what the guest does with its BPR, we can always set/get the
* same value of a priority.
*/
diff --git a/arch/arm64/kvm/mmio.c b/arch/arm64/kvm/mmio.c
index aedfcff99ac5..4e0366759726 100644
--- a/arch/arm64/kvm/mmio.c
+++ b/arch/arm64/kvm/mmio.c
@@ -131,7 +131,7 @@ int io_mem_abort(struct kvm_vcpu *vcpu, struct kvm_run *run,
/*
* No valid syndrome? Ask userspace for help if it has
- * voluntered to do so, and bail out otherwise.
+ * volunteered to do so, and bail out otherwise.
*/
if (!kvm_vcpu_dabt_isvalid(vcpu)) {
if (vcpu->kvm->arch.return_nisv_io_abort_to_user) {
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index e3b9ee268823..29d8f24df944 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -784,7 +784,7 @@ static int __create_hyp_private_mapping(phys_addr_t phys_addr, size_t size,
mutex_lock(&kvm_hyp_pgd_mutex);
/*
- * This assumes that we we have enough space below the idmap
+ * This assumes that we have enough space below the idmap
* page to allocate our VAs. If not, the check below will
* kick. A potential alternative would be to detect that
* overflow and switch to an allocation above the idmap.
@@ -964,7 +964,7 @@ static void stage2_unmap_memslot(struct kvm *kvm,
* stage2_unmap_vm - Unmap Stage-2 RAM mappings
* @kvm: The struct kvm pointer
*
- * Go through the memregions and unmap any reguler RAM
+ * Go through the memregions and unmap any regular RAM
* backing memory already mapped to the VM.
*/
void stage2_unmap_vm(struct kvm *kvm)
@@ -2262,7 +2262,7 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
{
/*
* At this point memslot has been committed and there is an
- * allocated dirty_bitmap[], dirty pages will be be tracked while the
+ * allocated dirty_bitmap[], dirty pages will be tracked while the
* memory slot is write protected.
*/
if (change != KVM_MR_DELETE && mem->flags & KVM_MEM_LOG_DIRTY_PAGES)
diff --git a/arch/arm64/kvm/psci.c b/arch/arm64/kvm/psci.c
index ae364716ee40..83415e96b589 100644
--- a/arch/arm64/kvm/psci.c
+++ b/arch/arm64/kvm/psci.c
@@ -94,7 +94,7 @@ static unsigned long kvm_psci_vcpu_on(struct kvm_vcpu *source_vcpu)
/*
* NOTE: We always update r0 (or x0) because for PSCI v0.1
- * the general puspose registers are undefined upon CPU_ON.
+ * the general purpose registers are undefined upon CPU_ON.
*/
reset_state->r0 = smccc_get_arg3(source_vcpu);
@@ -265,10 +265,10 @@ static int kvm_psci_0_2_call(struct kvm_vcpu *vcpu)
case PSCI_0_2_FN_SYSTEM_OFF:
kvm_psci_system_off(vcpu);
/*
- * We should'nt be going back to guest VCPU after
+ * We shouldn't be going back to guest VCPU after
* receiving SYSTEM_OFF request.
*
- * If user space accidently/deliberately resumes
+ * If user space accidentally/deliberately resumes
* guest VCPU after SYSTEM_OFF request then guest
* VCPU should see internal failure from PSCI return
* value. To achieve this, we preload r0 (or x0) with
diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
index 30b7ea680f66..658f3a79617b 100644
--- a/arch/arm64/kvm/reset.c
+++ b/arch/arm64/kvm/reset.c
@@ -163,7 +163,7 @@ static int kvm_vcpu_finalize_sve(struct kvm_vcpu *vcpu)
vl = vcpu->arch.sve_max_vl;
/*
- * Resposibility for these properties is shared between
+ * Responsibility for these properties is shared between
* kvm_arm_init_arch_resources(), kvm_vcpu_enable_sve() and
* set_sve_vls(). Double-check here just to be sure:
*/
@@ -249,7 +249,7 @@ static int kvm_vcpu_enable_ptrauth(struct kvm_vcpu *vcpu)
* ioctl or as part of handling a request issued by another VCPU in the PSCI
* handling code. In the first case, the VCPU will not be loaded, and in the
* second case the VCPU will be loaded. Because this function operates purely
- * on the memory-backed valus of system registers, we want to do a full put if
+ * on the memory-backed values of system registers, we want to do a full put if
* we were loaded (handling a request) and load the values back at the end of
* the function. Otherwise we leave the state alone. In both cases, we
* disable preemption around the vcpu reset as we would otherwise race with
@@ -357,7 +357,7 @@ void kvm_set_ipa_limit(void)
*
* So clamp the ipa limit further down to limit the number of levels.
* Since we can concatenate upto 16 tables at entry level, we could
- * go upto 4bits above the maximum VA addressible with the current
+ * go upto 4bits above the maximum VA addressable with the current
* number of levels.
*/
va_max = PGDIR_SHIFT + PAGE_SHIFT - 3;
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 51db934702b6..620eaf11e672 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -34,7 +34,7 @@
#include "trace.h"
/*
- * All of this file is extremly similar to the ARM coproc.c, but the
+ * All of this file is extremely similar to the ARM coproc.c, but the
* types are different. My gut feeling is that it should be pretty
* easy to merge, but that would be an ABI breakage -- again. VFP
* would also need to be abstracted.
@@ -118,8 +118,8 @@ void vcpu_write_sys_reg(struct kvm_vcpu *vcpu, u64 val, int reg)
* entry to the guest but are only restored on vcpu_load.
*
* Note that MPIDR_EL1 for the guest is set by KVM via VMPIDR_EL2 but
- * should never be listed below, because the the MPIDR should only be
- * set once, before running the VCPU, and never changed later.
+ * should never be listed below, because the MPIDR should only be set
+ * once, before running the VCPU, and never changed later.
*/
switch (reg) {
case CSSELR_EL1: write_sysreg_s(val, SYS_CSSELR_EL1); return;
diff --git a/arch/arm64/kvm/vgic/vgic-v3.c b/arch/arm64/kvm/vgic/vgic-v3.c
index 5bc2ab58954b..3ccd6d3cb4d3 100644
--- a/arch/arm64/kvm/vgic/vgic-v3.c
+++ b/arch/arm64/kvm/vgic/vgic-v3.c
@@ -587,7 +587,7 @@ int vgic_v3_probe(const struct gic_kvm_info *info)
int ret;
/*
- * The ListRegs field is 5 bits, but there is a architectural
+ * The ListRegs field is 5 bits, but there is an architectural
* maximum of 16 list registers. Just ignore bit 4...
*/
kvm_vgic_global_state.nr_lr = (ich_vtr_el2 & 0xf) + 1;
diff --git a/virt/kvm/coalesced_mmio.c b/virt/kvm/coalesced_mmio.c
index 00c747dbc82e..e2c197fd4f9d 100644
--- a/virt/kvm/coalesced_mmio.c
+++ b/virt/kvm/coalesced_mmio.c
@@ -119,7 +119,7 @@ int kvm_coalesced_mmio_init(struct kvm *kvm)
/*
* We're using this spinlock to sync access to the coalesced ring.
- * The list doesn't need it's own lock since device registration and
+ * The list doesn't need its own lock since device registration and
* unregistration should only happen when kvm->slots_lock is held.
*/
spin_lock_init(&kvm->ring_lock);
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 67b6fc153e9c..e586d1395c28 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -116,7 +116,7 @@ irqfd_shutdown(struct work_struct *work)
struct kvm *kvm = irqfd->kvm;
u64 cnt;
- /* Make sure irqfd has been initalized in assign path. */
+ /* Make sure irqfd has been initialized in assign path. */
synchronize_srcu(&kvm->irq_srcu);
/*
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 74bdb7bf3295..f57792b1541b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2799,7 +2799,7 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_yield_to);
*
* (a) VCPU which has not done pl-exit or cpu relax intercepted recently
* (preempted lock holder), indicated by @in_spin_loop.
- * Set at the beiginning and cleared at the end of interception/PLE handler.
+ * Set at the beginning and cleared at the end of interception/PLE handler.
*
* (b) VCPU which has done pl-exit/ cpu relax intercepted but did not get
* chance last time (mostly it has become eligible now since we have probably
--
2.26.2
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related
* Re: CFLAGS overridden by distribution build system
From: William Roberts @ 2020-05-29 16:04 UTC (permalink / raw)
To: Stephen Smalley; +Cc: Laurent Bigonville, SElinux list
In-Reply-To: <CAEjxPJ6XwXLBKYFaNO-kZz-Vgvbb2B7VKqD3DKQFf_ebBuNiBg@mail.gmail.com>
On Fri, May 29, 2020 at 8:31 AM Stephen Smalley
<stephen.smalley.work@gmail.com> wrote:
>
> On Sat, May 23, 2020 at 11:59 AM William Roberts
> <bill.c.roberts@gmail.com> wrote:
> >
> > On Sat, May 23, 2020 at 5:57 AM Laurent Bigonville <bigon@debian.org> wrote:
> > >
> > > Hello,
> > >
> > > The current build system of the userspace is setting a lot of CFLAGS,
> > > but most of these are overridden by the distributions when building.
> > >
> > > Today I received a bug report[0] from Christian Göttsche asking me to
> > > set -fno-semantic-interposition again in libsepol. I see also the same
> > > flag and also a lot of others set in libselinux and libsemanage build
> > > system.
> >
> > Why would it be again? The old DSO design made that option impotent.
> > Clang does it by default.
> >
> > >
> > > For what I understand some of these are just needed for code quality
> > > (-W) and could be controlled by distributions but others might actually
> > > need to be always set (-f?).
> >
> > If you look at the Makefiles overall in all the projects, you'll see hardening
> > flags, etc. Libsepol has a pretty minimal set compared to say libselinux, but
> > they all get overridden by build time CFLAGS if you want.
> >
> > >
> > > Shouldn't the flags that always need to be set (which ones?) be moved to
> > > a "override CFLAGS" directive to avoid these to be unset by distributions?
> >
> > If you we're cleaver with CFLAGS before, you could acually circumvent
> > the buildtime
> > DSO stuff. While i'm not opposed to adding it to immutable flag, if
> > you're setting
> > CFLAGS, you're on your own. At least with the current design.
> >
> > I have no issues with adding it to override, but we would have to
> > overhaul a bunch
> > of them and make them consistent.
>
> I'm inclined to agree with Laurent here - we should always set this
> flag in order to preserve the same behavior prior to the patches that
> removed hidden_def/hidden_proto and switched to relying on this
> instead.
Theirs a lot of features that rely on particular cflags, consider hardening
options. A lot of the warnings/error stuff is not just a code hygiene, but also
spotting potential issues that could arise as security issues. If one starts
tinkering with cflags, you can really change the code quite a bit. This is why
some places only approve sanctioned builds of crypto libraries.
But one of the things to consider is that for clang builds, clang
doesn't support
this option, so we would need to move the compiler checking code into each
projects root makefile and ensure any consuming subordinate leafs add a
default, or move it to the global makefile and make sure each leaf that needs it
has a default.
^ permalink raw reply
* [Cluster-devel] [PATCH 1/4] sctp: add sctp_sock_set_nodelay
From: Marcelo Ricardo Leitner @ 2020-05-29 16:05 UTC (permalink / raw)
To: cluster-devel.redhat.com
In-Reply-To: <20200529120943.101454-2-hch@lst.de>
On Fri, May 29, 2020 at 02:09:40PM +0200, Christoph Hellwig wrote:
> Add a helper to directly set the SCTP_NODELAY sockopt from kernel space
> without going through a fake uaccess.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
I'm taking the action item to make sctp_setsockopt_nodelay() use the
new helper. Will likely create a __ variant of it, due to sock lock.
> ---
> fs/dlm/lowcomms.c | 10 ++--------
> include/net/sctp/sctp.h | 7 +++++++
> 2 files changed, 9 insertions(+), 8 deletions(-)
>
> diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
> index 69333728d871b..9f1c3cdc9d653 100644
> --- a/fs/dlm/lowcomms.c
> +++ b/fs/dlm/lowcomms.c
> @@ -914,7 +914,6 @@ static int sctp_bind_addrs(struct connection *con, uint16_t port)
> static void sctp_connect_to_sock(struct connection *con)
> {
> struct sockaddr_storage daddr;
> - int one = 1;
> int result;
> int addr_len;
> struct socket *sock;
> @@ -961,8 +960,7 @@ static void sctp_connect_to_sock(struct connection *con)
> log_print("connecting to %d", con->nodeid);
>
> /* Turn off Nagle's algorithm */
> - kernel_setsockopt(sock, SOL_SCTP, SCTP_NODELAY, (char *)&one,
> - sizeof(one));
> + sctp_sock_set_nodelay(sock->sk);
>
> /*
> * Make sock->ops->connect() function return in specified time,
> @@ -1176,7 +1174,6 @@ static int sctp_listen_for_all(void)
> struct socket *sock = NULL;
> int result = -EINVAL;
> struct connection *con = nodeid2con(0, GFP_NOFS);
> - int one = 1;
>
> if (!con)
> return -ENOMEM;
> @@ -1191,10 +1188,7 @@ static int sctp_listen_for_all(void)
> }
>
> sock_set_rcvbuf(sock->sk, NEEDED_RMEM);
> - result = kernel_setsockopt(sock, SOL_SCTP, SCTP_NODELAY, (char *)&one,
> - sizeof(one));
> - if (result < 0)
> - log_print("Could not set SCTP NODELAY error %d\n", result);
> + sctp_sock_set_nodelay(sock->sk);
>
> write_lock_bh(&sock->sk->sk_callback_lock);
> /* Init con struct */
> diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h
> index 3ab5c6bbb90bd..f8bcb75bb0448 100644
> --- a/include/net/sctp/sctp.h
> +++ b/include/net/sctp/sctp.h
> @@ -615,4 +615,11 @@ static inline bool sctp_newsk_ready(const struct sock *sk)
> return sock_flag(sk, SOCK_DEAD) || sk->sk_socket;
> }
>
> +static inline void sctp_sock_set_nodelay(struct sock *sk)
> +{
> + lock_sock(sk);
> + sctp_sk(sk)->nodelay = true;
> + release_sock(sk);
> +}
> +
> #endif /* __net_sctp_h__ */
> --
> 2.26.2
>
^ permalink raw reply
* Re: [PATCH 1/4] sctp: add sctp_sock_set_nodelay
From: Marcelo Ricardo Leitner @ 2020-05-29 16:05 UTC (permalink / raw)
To: Christoph Hellwig
Cc: David S. Miller, Jakub Kicinski, Vlad Yasevich, Neil Horman,
David Laight, linux-sctp, linux-kernel, cluster-devel, netdev
In-Reply-To: <20200529120943.101454-2-hch@lst.de>
On Fri, May 29, 2020 at 02:09:40PM +0200, Christoph Hellwig wrote:
> Add a helper to directly set the SCTP_NODELAY sockopt from kernel space
> without going through a fake uaccess.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
I'm taking the action item to make sctp_setsockopt_nodelay() use the
new helper. Will likely create a __ variant of it, due to sock lock.
> ---
> fs/dlm/lowcomms.c | 10 ++--------
> include/net/sctp/sctp.h | 7 +++++++
> 2 files changed, 9 insertions(+), 8 deletions(-)
>
> diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
> index 69333728d871b..9f1c3cdc9d653 100644
> --- a/fs/dlm/lowcomms.c
> +++ b/fs/dlm/lowcomms.c
> @@ -914,7 +914,6 @@ static int sctp_bind_addrs(struct connection *con, uint16_t port)
> static void sctp_connect_to_sock(struct connection *con)
> {
> struct sockaddr_storage daddr;
> - int one = 1;
> int result;
> int addr_len;
> struct socket *sock;
> @@ -961,8 +960,7 @@ static void sctp_connect_to_sock(struct connection *con)
> log_print("connecting to %d", con->nodeid);
>
> /* Turn off Nagle's algorithm */
> - kernel_setsockopt(sock, SOL_SCTP, SCTP_NODELAY, (char *)&one,
> - sizeof(one));
> + sctp_sock_set_nodelay(sock->sk);
>
> /*
> * Make sock->ops->connect() function return in specified time,
> @@ -1176,7 +1174,6 @@ static int sctp_listen_for_all(void)
> struct socket *sock = NULL;
> int result = -EINVAL;
> struct connection *con = nodeid2con(0, GFP_NOFS);
> - int one = 1;
>
> if (!con)
> return -ENOMEM;
> @@ -1191,10 +1188,7 @@ static int sctp_listen_for_all(void)
> }
>
> sock_set_rcvbuf(sock->sk, NEEDED_RMEM);
> - result = kernel_setsockopt(sock, SOL_SCTP, SCTP_NODELAY, (char *)&one,
> - sizeof(one));
> - if (result < 0)
> - log_print("Could not set SCTP NODELAY error %d\n", result);
> + sctp_sock_set_nodelay(sock->sk);
>
> write_lock_bh(&sock->sk->sk_callback_lock);
> /* Init con struct */
> diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h
> index 3ab5c6bbb90bd..f8bcb75bb0448 100644
> --- a/include/net/sctp/sctp.h
> +++ b/include/net/sctp/sctp.h
> @@ -615,4 +615,11 @@ static inline bool sctp_newsk_ready(const struct sock *sk)
> return sock_flag(sk, SOCK_DEAD) || sk->sk_socket;
> }
>
> +static inline void sctp_sock_set_nodelay(struct sock *sk)
> +{
> + lock_sock(sk);
> + sctp_sk(sk)->nodelay = true;
> + release_sock(sk);
> +}
> +
> #endif /* __net_sctp_h__ */
> --
> 2.26.2
>
^ permalink raw reply
* Re: [PATCH 1/4] sctp: add sctp_sock_set_nodelay
From: Marcelo Ricardo Leitner @ 2020-05-29 16:05 UTC (permalink / raw)
To: Christoph Hellwig
Cc: David S. Miller, Jakub Kicinski, Vlad Yasevich, Neil Horman,
David Laight, linux-sctp, linux-kernel, cluster-devel, netdev
In-Reply-To: <20200529120943.101454-2-hch@lst.de>
On Fri, May 29, 2020 at 02:09:40PM +0200, Christoph Hellwig wrote:
> Add a helper to directly set the SCTP_NODELAY sockopt from kernel space
> without going through a fake uaccess.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
I'm taking the action item to make sctp_setsockopt_nodelay() use the
new helper. Will likely create a __ variant of it, due to sock lock.
> ---
> fs/dlm/lowcomms.c | 10 ++--------
> include/net/sctp/sctp.h | 7 +++++++
> 2 files changed, 9 insertions(+), 8 deletions(-)
>
> diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
> index 69333728d871b..9f1c3cdc9d653 100644
> --- a/fs/dlm/lowcomms.c
> +++ b/fs/dlm/lowcomms.c
> @@ -914,7 +914,6 @@ static int sctp_bind_addrs(struct connection *con, uint16_t port)
> static void sctp_connect_to_sock(struct connection *con)
> {
> struct sockaddr_storage daddr;
> - int one = 1;
> int result;
> int addr_len;
> struct socket *sock;
> @@ -961,8 +960,7 @@ static void sctp_connect_to_sock(struct connection *con)
> log_print("connecting to %d", con->nodeid);
>
> /* Turn off Nagle's algorithm */
> - kernel_setsockopt(sock, SOL_SCTP, SCTP_NODELAY, (char *)&one,
> - sizeof(one));
> + sctp_sock_set_nodelay(sock->sk);
>
> /*
> * Make sock->ops->connect() function return in specified time,
> @@ -1176,7 +1174,6 @@ static int sctp_listen_for_all(void)
> struct socket *sock = NULL;
> int result = -EINVAL;
> struct connection *con = nodeid2con(0, GFP_NOFS);
> - int one = 1;
>
> if (!con)
> return -ENOMEM;
> @@ -1191,10 +1188,7 @@ static int sctp_listen_for_all(void)
> }
>
> sock_set_rcvbuf(sock->sk, NEEDED_RMEM);
> - result = kernel_setsockopt(sock, SOL_SCTP, SCTP_NODELAY, (char *)&one,
> - sizeof(one));
> - if (result < 0)
> - log_print("Could not set SCTP NODELAY error %d\n", result);
> + sctp_sock_set_nodelay(sock->sk);
>
> write_lock_bh(&sock->sk->sk_callback_lock);
> /* Init con struct */
> diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h
> index 3ab5c6bbb90bd..f8bcb75bb0448 100644
> --- a/include/net/sctp/sctp.h
> +++ b/include/net/sctp/sctp.h
> @@ -615,4 +615,11 @@ static inline bool sctp_newsk_ready(const struct sock *sk)
> return sock_flag(sk, SOCK_DEAD) || sk->sk_socket;
> }
>
> +static inline void sctp_sock_set_nodelay(struct sock *sk)
> +{
> + lock_sock(sk);
> + sctp_sk(sk)->nodelay = true;
> + release_sock(sk);
> +}
> +
> #endif /* __net_sctp_h__ */
> --
> 2.26.2
>
^ permalink raw reply
* Re: [RESEND PATCH v2] perf tools: correct license on jsmn json parser
From: Arnaldo Carvalho de Melo @ 2020-05-29 16:05 UTC (permalink / raw)
To: Ed Maste; +Cc: linux-kernel, Ed Maste, Andi Kleen, Greg Kroah-Hartman
In-Reply-To: <20200528170858.48457-1-emaste@freefall.freebsd.org>
Em Thu, May 28, 2020 at 05:08:59PM +0000, Ed Maste escreveu:
> From: Ed Maste <emaste@freebsd.org>
>
> This header is part of the jsmn json parser, introduced in 867a979a83.
> Correct the SPDX tag to indicate that it is under the MIT license.
Thanks, applied.
- Arnaldo
> Signed-off-by: Ed Maste <emaste@freebsd.org>
> Acked-by: Andi Kleen <ak@linux.intel.com>
> ---
> tools/perf/pmu-events/jsmn.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/tools/perf/pmu-events/jsmn.h b/tools/perf/pmu-events/jsmn.h
> index c7b0f6ea2a31..1bdfd55fff30 100644
> --- a/tools/perf/pmu-events/jsmn.h
> +++ b/tools/perf/pmu-events/jsmn.h
> @@ -1,4 +1,4 @@
> -/* SPDX-License-Identifier: GPL-2.0 */
> +/* SPDX-License-Identifier: MIT */
> #ifndef __JSMN_H_
> #define __JSMN_H_
>
> --
> 2.24.0
>
--
- Arnaldo
^ permalink raw reply
* Re: mmotm 2020-05-13-20-30 uploaded (objtool warnings)
From: Josh Poimboeuf @ 2020-05-29 16:05 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Christoph Hellwig, Randy Dunlap, Andrew Morton, broonie,
linux-fsdevel, linux-kernel, linux-mm, linux-next, mhocko,
mm-commits, sfr, Linus Torvalds, viro, x86, Steven Rostedt
In-Reply-To: <20200529153336.GC706518@hirez.programming.kicks-ass.net>
On Fri, May 29, 2020 at 05:33:36PM +0200, Peter Zijlstra wrote:
> On Fri, May 29, 2020 at 04:53:25PM +0200, Peter Zijlstra wrote:
> > On Fri, May 29, 2020 at 04:35:56PM +0200, Peter Zijlstra wrote:
>
> > *groan*, this is one of those CONFIG_PROFILE_ALL_BRANCHES builds. If I
> > disable that it goes away.
> >
> > Still trying to untangle the mess it generated, but on first go it
> > looks like objtool is right, but I'm not sure what went wrong.
>
> $ tools/objtool/objtool check -fab arch/x86/lib/csum-wrappers_64.o
> arch/x86/lib/csum-wrappers_64.o: warning: objtool: csum_and_copy_from_user()+0x29f: call to memset() with UACCESS enabled
> arch/x86/lib/csum-wrappers_64.o: warning: objtool: csum_and_copy_from_user()+0x283: (branch)
> arch/x86/lib/csum-wrappers_64.o: warning: objtool: csum_and_copy_from_user()+0x113: (branch)
> arch/x86/lib/csum-wrappers_64.o: warning: objtool: .altinstr_replacement+0xffffffffffffffff: (branch)
> arch/x86/lib/csum-wrappers_64.o: warning: objtool: csum_and_copy_from_user()+0xea: (alt)
> arch/x86/lib/csum-wrappers_64.o: warning: objtool: .altinstr_replacement+0xffffffffffffffff: (branch)
> arch/x86/lib/csum-wrappers_64.o: warning: objtool: csum_and_copy_from_user()+0xe7: (alt)
> arch/x86/lib/csum-wrappers_64.o: warning: objtool: csum_and_copy_from_user()+0xd2: (branch)
> arch/x86/lib/csum-wrappers_64.o: warning: objtool: csum_and_copy_from_user()+0x7e: (branch)
> arch/x86/lib/csum-wrappers_64.o: warning: objtool: csum_and_copy_from_user()+0x43: (branch)
> arch/x86/lib/csum-wrappers_64.o: warning: objtool: csum_and_copy_from_user()+0x0: <=== (sym)
>
> The problem is with the +0x113 branch, which is at 0x1d1.
>
> That looks to be:
>
> if (!likely(user_access_begin(src, len)))
> goto out_err;
>
> Except that the brach profiling stuff confused GCC enough to leak STAC
> into the error path or something.
It looks to me like GCC is doing the right thing. That likely()
translates to:
# define likely(x) (__branch_check__(x, 1, __builtin_constant_p(x)))
which becomes:
#define __branch_check__(x, expect, is_constant) ({ \
long ______r; \
static struct ftrace_likely_data \
__aligned(4) \
__section(_ftrace_annotated_branch) \
______f = { \
.data.func = __func__, \
.data.file = __FILE__, \
.data.line = __LINE__, \
}; \
______r = __builtin_expect(!!(x), expect); \
ftrace_likely_update(&______f, ______r, \
expect, is_constant); \
______r; \
})
Here 'x' is the call to user_access_begin(). It evaluates 'x' -- and
thus calls user_access_begin() -- before the call to
ftrace_likely_update().
So it's working as designed, right? The likely() just needs to be
changed to likely_notrace().
--
Josh
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.