kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V4 0/7] KVM, pkeys: add memory protection-key support
@ 2016-03-05 11:27 Huaitong Han
  2016-03-05 11:27 ` [PATCH V4 1/7] KVM, pkeys: expose CPUID/CR4 to guest Huaitong Han
                   ` (6 more replies)
  0 siblings, 7 replies; 47+ messages in thread
From: Huaitong Han @ 2016-03-05 11:27 UTC (permalink / raw)
  To: pbonzini, gleb; +Cc: kvm, Huaitong Han

Changes in v4:
*Patch 2 and 4 have rebased on http://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/log/?h=mm/pkeys

Changes in v3:
*Add comments for patch that disable PKU feature without ept.

Changes in v2:
*Add pku.c for kvm-unit-tests.
*Optimize permission_fault codes for patch4.
*Delete is_long_mode and PK for patch5.
*Squash cpuid and cr4 patches.

The protection-key feature provides an additional mechanism by which IA-32e
paging controls access to usermode addresses.

Hardware support for protection keys for user pages is enumerated with CPUID
feature flag CPUID.7.0.ECX[3]:PKU. Software support is CPUID.7.0.ECX[4]:OSPKE
with the setting of CR4.PKE(bit 22).

When CR4.PKE = 1, every linear address is associated with the 4-bit protection
key located in bits 62:59 of the paging-structure entry that mapped the page
containing the linear address. The PKRU register determines, for each
protection key, whether user-mode addresses with that protection key may be
read or written.

The PKRU register (protection key rights for user pages) is a 32-bit register
with the following format: for each i (0 ≤ i ≤ 15), PKRU[2i] is the
access-disable bit for protection key i (ADi); PKRU[2i+1] is the write-disable
bit for protection key i (WDi).

Software can use the RDPKRU and WRPKRU instructions with ECX = 0 to read and
write PKRU. In addition, the PKRU register is XSAVE-managed state and can thus
be read and written by instructions in the XSAVE feature set.

PFEC.PK (bit 5) is defined as protection key violations.

The specification of Protection Keys can be found at SDM (4.6.2, volume 3)
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf.


Huaitong Han (7):
  KVM, pkeys: expose CPUID/CR4 to guest
  KVM, pkeys: disable pkeys for guests in non-paging mode
  KVM, pkeys: update memeory permission bitmask for pkeys
  KVM, pkeys: add pkeys support for permission_fault logic
  KVM, pkeys: Add pkeys support for gva_to_gpa funcions
  KVM, pkeys: add pkeys support for xsave state
  KVM, pkeys: disable PKU feature without ept

 arch/x86/include/asm/kvm_host.h | 11 +++++---
 arch/x86/kvm/cpuid.c            | 24 ++++++++++++++++--
 arch/x86/kvm/cpuid.h            |  8 ++++++
 arch/x86/kvm/mmu.c              | 32 +++++++++++++++++++++--
 arch/x86/kvm/mmu.h              | 56 +++++++++++++++++++++++++++++++++++++----
 arch/x86/kvm/paging_tmpl.h      | 18 ++++++++++---
 arch/x86/kvm/vmx.c              |  6 ++---
 arch/x86/kvm/x86.c              | 27 ++++++++++++++------
 arch/x86/kvm/x86.h              |  3 ++-
 9 files changed, 159 insertions(+), 26 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH V4 1/7] KVM, pkeys: expose CPUID/CR4 to guest
  2016-03-05 11:27 [PATCH V4 0/7] KVM, pkeys: add memory protection-key support Huaitong Han
@ 2016-03-05 11:27 ` Huaitong Han
  2016-03-06  7:15   ` Xiao Guangrong
  2016-03-05 11:27 ` [PATCH V4 2/7] KVM, pkeys: disable pkeys for guests in non-paging mode Huaitong Han
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 47+ messages in thread
From: Huaitong Han @ 2016-03-05 11:27 UTC (permalink / raw)
  To: pbonzini, gleb; +Cc: kvm, Huaitong Han

X86_FEATURE_PKU is referred to as "PKU" in the hardware documentation:
CPUID.7.0.ECX[3]:PKU. X86_FEATURE_OSPKE is software support for pkeys,
enumerated with CPUID.7.0.ECX[4]:OSPKE, and it reflects the setting of
CR4.PKE(bit 22).

Signed-off-by: Huaitong Han <huaitong.han@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  3 ++-
 arch/x86/kvm/cpuid.c            | 21 +++++++++++++++++++--
 arch/x86/kvm/cpuid.h            |  8 ++++++++
 arch/x86/kvm/x86.c              |  9 ++++++---
 4 files changed, 35 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 44adbb8..2867626 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -83,7 +83,8 @@
 			  | X86_CR4_PSE | X86_CR4_PAE | X86_CR4_MCE     \
 			  | X86_CR4_PGE | X86_CR4_PCE | X86_CR4_OSFXSR | X86_CR4_PCIDE \
 			  | X86_CR4_OSXSAVE | X86_CR4_SMEP | X86_CR4_FSGSBASE \
-			  | X86_CR4_OSXMMEXCPT | X86_CR4_VMXE | X86_CR4_SMAP))
+			  | X86_CR4_OSXMMEXCPT | X86_CR4_VMXE | X86_CR4_SMAP \
+			  | X86_CR4_PKE))
 
 #define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR)
 
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 6525e92..f058db5 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -81,6 +81,17 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu)
 			apic->lapic_timer.timer_mode_mask = 1 << 17;
 	}
 
+	best = kvm_find_cpuid_entry(vcpu, 7, 0);
+	if (!best)
+		return 0;
+
+	/*Update OSPKE bit */
+	if (boot_cpu_has(X86_FEATURE_PKU) && best->function == 0x7) {
+		best->ecx &= ~F(OSPKE);
+		if (kvm_read_cr4_bits(vcpu, X86_CR4_PKE))
+			best->ecx |= F(OSPKE);
+	}
+
 	best = kvm_find_cpuid_entry(vcpu, 0xD, 0);
 	if (!best) {
 		vcpu->arch.guest_supported_xcr0 = 0;
@@ -354,6 +365,9 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
 	const u32 kvm_supported_word10_x86_features =
 		F(XSAVEOPT) | F(XSAVEC) | F(XGETBV1) | f_xsaves;
 
+	/* cpuid 7.0.ecx*/
+	const u32 kvm_supported_word11_x86_features = F(PKU) | 0 /*OSPKE*/;
+
 	/* all calls to cpuid_count() should be made on the same cpu */
 	get_cpu();
 
@@ -431,10 +445,13 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
 			cpuid_mask(&entry->ebx, 9);
 			// TSC_ADJUST is emulated
 			entry->ebx |= F(TSC_ADJUST);
-		} else
+			entry->ecx &= kvm_supported_word11_x86_features;
+			cpuid_mask(&entry->ecx, 13);
+		} else {
 			entry->ebx = 0;
+			entry->ecx = 0;
+		}
 		entry->eax = 0;
-		entry->ecx = 0;
 		entry->edx = 0;
 		break;
 	}
diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index c8eda14..3bacab1 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -79,6 +79,14 @@ static inline bool guest_cpuid_has_fsgsbase(struct kvm_vcpu *vcpu)
 	return best && (best->ebx & bit(X86_FEATURE_FSGSBASE));
 }
 
+static inline bool guest_cpuid_has_pku(struct kvm_vcpu *vcpu)
+{
+	struct kvm_cpuid_entry2 *best;
+
+	best = kvm_find_cpuid_entry(vcpu, 7, 0);
+	return best && (best->ecx & bit(X86_FEATURE_PKU));
+}
+
 static inline bool guest_cpuid_has_longmode(struct kvm_vcpu *vcpu)
 {
 	struct kvm_cpuid_entry2 *best;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4244c2b..05e7838 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -720,7 +720,7 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 {
 	unsigned long old_cr4 = kvm_read_cr4(vcpu);
 	unsigned long pdptr_bits = X86_CR4_PGE | X86_CR4_PSE | X86_CR4_PAE |
-				   X86_CR4_SMEP | X86_CR4_SMAP;
+				   X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE;
 
 	if (cr4 & CR4_RESERVED_BITS)
 		return 1;
@@ -737,6 +737,9 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 	if (!guest_cpuid_has_fsgsbase(vcpu) && (cr4 & X86_CR4_FSGSBASE))
 		return 1;
 
+	if (!guest_cpuid_has_pku(vcpu) && (cr4 & X86_CR4_PKE))
+		return 1;
+
 	if (is_long_mode(vcpu)) {
 		if (!(cr4 & X86_CR4_PAE))
 			return 1;
@@ -762,7 +765,7 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 	    (!(cr4 & X86_CR4_PCIDE) && (old_cr4 & X86_CR4_PCIDE)))
 		kvm_mmu_reset_context(vcpu);
 
-	if ((cr4 ^ old_cr4) & X86_CR4_OSXSAVE)
+	if ((cr4 ^ old_cr4) & (X86_CR4_OSXSAVE | X86_CR4_PKE))
 		kvm_update_cpuid(vcpu);
 
 	return 0;
@@ -7109,7 +7112,7 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
 
 	mmu_reset_needed |= kvm_read_cr4(vcpu) != sregs->cr4;
 	kvm_x86_ops->set_cr4(vcpu, sregs->cr4);
-	if (sregs->cr4 & X86_CR4_OSXSAVE)
+	if (sregs->cr4 & (X86_CR4_OSXSAVE | X86_CR4_PKE))
 		kvm_update_cpuid(vcpu);
 
 	idx = srcu_read_lock(&vcpu->kvm->srcu);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH V4 2/7] KVM, pkeys: disable pkeys for guests in non-paging mode
  2016-03-05 11:27 [PATCH V4 0/7] KVM, pkeys: add memory protection-key support Huaitong Han
  2016-03-05 11:27 ` [PATCH V4 1/7] KVM, pkeys: expose CPUID/CR4 to guest Huaitong Han
@ 2016-03-05 11:27 ` Huaitong Han
  2016-03-06  7:19   ` Xiao Guangrong
  2016-03-08 12:09   ` Yang Zhang
  2016-03-05 11:27 ` [PATCH V4 3/7] KVM, pkeys: update memeory permission bitmask for pkeys Huaitong Han
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 47+ messages in thread
From: Huaitong Han @ 2016-03-05 11:27 UTC (permalink / raw)
  To: pbonzini, gleb; +Cc: kvm, Huaitong Han

Pkeys is disabled if CPU is in non-paging mode in hardware. However KVM
always uses paging mode to emulate guest non-paging, mode with TDP. To
emulate this behavior, pkeys needs to be manually disabled when guest
switches to non-paging mode.

Signed-off-by: Huaitong Han <huaitong.han@intel.com>
---
Changes in v4:
*Patch has rebased on http://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/log/?h=mm/pkeys

 arch/x86/kvm/vmx.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index e2951b6..db33c22 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3855,13 +3855,13 @@ static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 
 	if (!enable_unrestricted_guest && !is_paging(vcpu))
 		/*
-		 * SMEP/SMAP is disabled if CPU is in non-paging mode in
+		 * SMEP/SMAP/PKU is disabled if CPU is in non-paging mode in
 		 * hardware.  However KVM always uses paging mode without
 		 * unrestricted guest.
-		 * To emulate this behavior, SMEP/SMAP needs to be manually
+		 * To emulate this behavior, SMEP/SMAP/PKU needs to be manually
 		 * disabled when guest switches to non-paging mode.
 		 */
-		hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP);
+		hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE);
 
 	vmcs_writel(CR4_READ_SHADOW, cr4);
 	vmcs_writel(GUEST_CR4, hw_cr4);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH V4 3/7] KVM, pkeys: update memeory permission bitmask for pkeys
  2016-03-05 11:27 [PATCH V4 0/7] KVM, pkeys: add memory protection-key support Huaitong Han
  2016-03-05 11:27 ` [PATCH V4 1/7] KVM, pkeys: expose CPUID/CR4 to guest Huaitong Han
  2016-03-05 11:27 ` [PATCH V4 2/7] KVM, pkeys: disable pkeys for guests in non-paging mode Huaitong Han
@ 2016-03-05 11:27 ` Huaitong Han
  2016-03-06  7:42   ` Xiao Guangrong
  2016-03-05 11:27 ` [PATCH V4 4/7] KVM, pkeys: add pkeys support for permission_fault logic Huaitong Han
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 47+ messages in thread
From: Huaitong Han @ 2016-03-05 11:27 UTC (permalink / raw)
  To: pbonzini, gleb; +Cc: kvm, Huaitong Han

PKEYS define a new status bit in the PFEC. PFEC.PK (bit 5), if some
conditions is true, the fault is considered as a PKU violation. This
patch updates memeory permission bitmask for pkeys.

Signed-off-by: Huaitong Han <huaitong.han@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  8 ++++++--
 arch/x86/kvm/mmu.c              | 32 ++++++++++++++++++++++++++++++--
 2 files changed, 36 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2867626..9eb54e8 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -187,12 +187,14 @@ enum {
 #define PFERR_USER_BIT 2
 #define PFERR_RSVD_BIT 3
 #define PFERR_FETCH_BIT 4
+#define PFERR_PK_BIT 5
 
 #define PFERR_PRESENT_MASK (1U << PFERR_PRESENT_BIT)
 #define PFERR_WRITE_MASK (1U << PFERR_WRITE_BIT)
 #define PFERR_USER_MASK (1U << PFERR_USER_BIT)
 #define PFERR_RSVD_MASK (1U << PFERR_RSVD_BIT)
 #define PFERR_FETCH_MASK (1U << PFERR_FETCH_BIT)
+#define PFERR_PK_MASK (1U << PFERR_PK_BIT)
 
 /* apic attention bits */
 #define KVM_APIC_CHECK_VAPIC	0
@@ -322,10 +324,12 @@ struct kvm_mmu {
 
 	/*
 	 * Bitmap; bit set = permission fault
-	 * Byte index: page fault error code [4:1]
+	 * Byte index: page fault error code [5:1]
 	 * Bit index: pte permissions in ACC_* format
+	 *
+	 * Add PFEC.PK (bit 5) for protection-key violations
 	 */
-	u8 permissions[16];
+	u8 permissions[32];
 
 	u64 *pae_root;
 	u64 *lm_root;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 95a955d..1ef66f4 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3776,16 +3776,22 @@ static void update_permission_bitmask(struct kvm_vcpu *vcpu,
 {
 	unsigned bit, byte, pfec;
 	u8 map;
-	bool fault, x, w, u, wf, uf, ff, smapf, cr4_smap, cr4_smep, smap = 0;
+	bool fault, x, w, u, smap = 0, pku = 0;
+	bool wf, uf, ff, smapf, rsvdf, pkuf;
+	bool cr4_smap, cr4_smep, cr4_pku;
 
 	cr4_smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP);
 	cr4_smap = kvm_read_cr4_bits(vcpu, X86_CR4_SMAP);
+	cr4_pku = kvm_read_cr4_bits(vcpu, X86_CR4_PKE);
+
 	for (byte = 0; byte < ARRAY_SIZE(mmu->permissions); ++byte) {
 		pfec = byte << 1;
 		map = 0;
 		wf = pfec & PFERR_WRITE_MASK;
 		uf = pfec & PFERR_USER_MASK;
 		ff = pfec & PFERR_FETCH_MASK;
+		rsvdf = pfec & PFERR_RSVD_MASK;
+		pkuf = pfec & PFERR_PK_MASK;
 		/*
 		 * PFERR_RSVD_MASK bit is set in PFEC if the access is not
 		 * subject to SMAP restrictions, and cleared otherwise. The
@@ -3824,12 +3830,34 @@ static void update_permission_bitmask(struct kvm_vcpu *vcpu,
 				 *   clearer.
 				 */
 				smap = cr4_smap && u && !uf && !ff;
+
+				/*
+				* PKU:additional mechanism by which the paging
+				* controls access to user-mode addresses based
+				* on the value in the PKRU register. A fault is
+				* considered as a PKU violation if all of the
+				* following conditions are true:
+				* 1.CR4_PKE=1.
+				* 2.EFER_LMA=1.
+				* 3.page is present with no reserved bit
+				*   violations.
+				* 4.the access is not an instruction fetch.
+				* 5.the access is to a user page.
+				* 6.PKRU.AD=1
+				*	or The access is a data write and
+				*	   PKRU.WD=1 and either CR0.WP=1
+				*	   or it is a user access.
+				*
+				* The 2nd and 6th conditions are computed
+				* dynamically in permission_fault.
+				*/
+				pku = cr4_pku && !rsvdf && !ff && u;
 			} else
 				/* Not really needed: no U/S accesses on ept  */
 				u = 1;
 
 			fault = (ff && !x) || (uf && !u) || (wf && !w) ||
-				(smapf && smap);
+				(smapf && smap) || (pkuf && pku);
 			map |= fault << bit;
 		}
 		mmu->permissions[byte] = map;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH V4 4/7] KVM, pkeys: add pkeys support for permission_fault logic
  2016-03-05 11:27 [PATCH V4 0/7] KVM, pkeys: add memory protection-key support Huaitong Han
                   ` (2 preceding siblings ...)
  2016-03-05 11:27 ` [PATCH V4 3/7] KVM, pkeys: update memeory permission bitmask for pkeys Huaitong Han
@ 2016-03-05 11:27 ` Huaitong Han
  2016-03-06  8:00   ` Xiao Guangrong
  2016-03-05 11:27 ` [PATCH V4 5/7] KVM, pkeys: Add pkeys support for gva_to_gpa funcions Huaitong Han
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 47+ messages in thread
From: Huaitong Han @ 2016-03-05 11:27 UTC (permalink / raw)
  To: pbonzini, gleb; +Cc: kvm, Huaitong Han

Protection keys define a new 4-bit protection key field (PKEY) in bits
62:59 of leaf entries of the page tables, the PKEY is an index to PKRU
register(16 domains), every domain has 2 bits(write disable bit, access
disable bit).

Static logic has been produced in update_permission_bitmask, dynamic logic
need read pkey from page table entries, get pkru value, and deduce the
correct result.

Signed-off-by: Huaitong Han <huaitong.han@intel.com>
---
Changes in v4:
*Patch has rebased on http://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/log/?h=mm/pkeys
 pte_pkey function has been deleted in kernel PKU support, so KVM patches use 
 pte_flags_pkey(pte_flags(pte)) instead of pte_pkey(pte).

 arch/x86/kvm/mmu.h         | 56 +++++++++++++++++++++++++++++++++++++++++-----
 arch/x86/kvm/paging_tmpl.h | 18 ++++++++++++---
 arch/x86/kvm/x86.c         |  2 +-
 3 files changed, 67 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 55ffb7b..7a9f627 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -3,6 +3,7 @@
 
 #include <linux/kvm_host.h>
 #include "kvm_cache_regs.h"
+#include "x86.h"
 
 #define PT64_PT_BITS 9
 #define PT64_ENT_PER_PAGE (1 << PT64_PT_BITS)
@@ -24,6 +25,7 @@
 #define PT_PAGE_SIZE_MASK (1ULL << PT_PAGE_SIZE_SHIFT)
 #define PT_PAT_MASK (1ULL << 7)
 #define PT_GLOBAL_MASK (1ULL << 8)
+
 #define PT64_NX_SHIFT 63
 #define PT64_NX_MASK (1ULL << PT64_NX_SHIFT)
 
@@ -45,6 +47,10 @@
 #define PT_PAGE_TABLE_LEVEL 1
 #define PT_MAX_HUGEPAGE_LEVEL (PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES - 1)
 
+#define PKRU_READ   0
+#define PKRU_WRITE  1
+#define PKRU_ATTRS  2
+
 static inline u64 rsvd_bits(int s, int e)
 {
 	return ((1ULL << (e - s + 1)) - 1) << s;
@@ -145,10 +151,50 @@ static inline bool is_write_protection(struct kvm_vcpu *vcpu)
  * fault with the given access (in ACC_* format)?
  */
 static inline bool permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
-				    unsigned pte_access, unsigned pfec)
+		unsigned pte_access, unsigned pte_pkeys, unsigned pfec)
 {
-	int cpl = kvm_x86_ops->get_cpl(vcpu);
-	unsigned long rflags = kvm_x86_ops->get_rflags(vcpu);
+	unsigned long smap, rflags;
+	u32 pkru, pkru_bits;
+	int cpl, index;
+	bool wf, uf;
+
+	cpl = kvm_x86_ops->get_cpl(vcpu);
+	rflags = kvm_x86_ops->get_rflags(vcpu);
+
+	/*
+	* PKU is computed dynamically in permission_fault.
+	* 2nd and 6th conditions:
+	* 2.EFER_LMA=1
+	* 6.PKRU.AD=1
+	*	or The access is a data write and PKRU.WD=1 and
+	*	   either CR0.WP=1 or it is a user mode access
+	*/
+	pkru = is_long_mode(vcpu) ? read_pkru() : 0;
+	if (unlikely(pkru) && (pfec & PFERR_PK_MASK))
+	{
+		/*
+		* PKRU defines 32 bits, there are 16 domains and 2 attribute bits per
+		* domain in pkru, pkey is the index to a defined domain, so the value
+		* of pkey * PKRU_ATTRS is offset of a defined domain.
+		*/
+		pkru_bits = (pkru >> (pte_pkeys * PKRU_ATTRS)) & 3;
+
+		wf = pfec & PFERR_WRITE_MASK;
+		uf = pfec & PFERR_USER_MASK;
+
+		/*
+		* Ignore PKRU.WD if not relevant to this access (a read,
+		* or a supervisor mode access if CR0.WP=0).
+		* So 6th conditions is equivalent to "pkru_bits != 0"
+		*/
+		if (!wf || (!uf && !is_write_protection(vcpu)))
+			pkru_bits &= ~(1 << PKRU_WRITE);
+
+		/* Flip pfec on PK bit if pkru_bits is zero */
+		pfec ^= pkru_bits ? 0 : PFERR_PK_MASK;
+	}
+	else
+		pfec &= ~PFERR_PK_MASK;
 
 	/*
 	 * If CPL < 3, SMAP prevention are disabled if EFLAGS.AC = 1.
@@ -163,8 +209,8 @@ static inline bool permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 	 * but it will be one in index if SMAP checks are being overridden.
 	 * It is important to keep this branchless.
 	 */
-	unsigned long smap = (cpl - 3) & (rflags & X86_EFLAGS_AC);
-	int index = (pfec >> 1) +
+	smap = (cpl - 3) & (rflags & X86_EFLAGS_AC);
+	index = (pfec >> 1) +
 		    (smap >> (X86_EFLAGS_AC_BIT - PFERR_RSVD_BIT + 1));
 
 	WARN_ON(pfec & PFERR_RSVD_MASK);
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 6c9fed9..9afc066 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -253,6 +253,15 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
 	}
 	return 0;
 }
+static inline unsigned FNAME(gpte_pkeys)(struct kvm_vcpu *vcpu, u64 gpte)
+{
+	unsigned pkeys = 0;
+#if PTTYPE == 64
+	pte_t pte = {.pte = gpte};
+	pkeys = pte_flags_pkey(pte_flags(pte));
+#endif
+	return pkeys;
+}
 
 /*
  * Fetch a guest pte for a guest virtual address
@@ -265,12 +274,13 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 	pt_element_t pte;
 	pt_element_t __user *uninitialized_var(ptep_user);
 	gfn_t table_gfn;
-	unsigned index, pt_access, pte_access, accessed_dirty;
+	unsigned index, pt_access, pte_access, accessed_dirty, pte_pkeys;
 	gpa_t pte_gpa;
 	int offset;
 	const int write_fault = access & PFERR_WRITE_MASK;
 	const int user_fault  = access & PFERR_USER_MASK;
 	const int fetch_fault = access & PFERR_FETCH_MASK;
+	const int pk_fault = access & PFERR_PK_MASK;
 	u16 errcode = 0;
 	gpa_t real_gpa;
 	gfn_t gfn;
@@ -356,7 +366,9 @@ retry_walk:
 		walker->ptes[walker->level - 1] = pte;
 	} while (!is_last_gpte(mmu, walker->level, pte));
 
-	if (unlikely(permission_fault(vcpu, mmu, pte_access, access))) {
+	pte_pkeys = FNAME(gpte_pkeys)(vcpu, pte);
+	if (unlikely(permission_fault(vcpu, mmu, pte_access, pte_pkeys,
+					access))) {
 		errcode |= PFERR_PRESENT_MASK;
 		goto error;
 	}
@@ -399,7 +411,7 @@ retry_walk:
 	return 1;
 
 error:
-	errcode |= write_fault | user_fault;
+	errcode |= write_fault | user_fault | pk_fault;
 	if (fetch_fault && (mmu->nx ||
 			    kvm_read_cr4_bits(vcpu, X86_CR4_SMEP)))
 		errcode |= PFERR_FETCH_MASK;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 05e7838..b697a99 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4317,7 +4317,7 @@ static int vcpu_mmio_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva,
 
 	if (vcpu_match_mmio_gva(vcpu, gva)
 	    && !permission_fault(vcpu, vcpu->arch.walk_mmu,
-				 vcpu->arch.access, access)) {
+				 vcpu->arch.access, 0, access)) {
 		*gpa = vcpu->arch.mmio_gfn << PAGE_SHIFT |
 					(gva & (PAGE_SIZE - 1));
 		trace_vcpu_match_mmio(gva, *gpa, write, false);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH V4 5/7] KVM, pkeys: Add pkeys support for gva_to_gpa funcions
  2016-03-05 11:27 [PATCH V4 0/7] KVM, pkeys: add memory protection-key support Huaitong Han
                   ` (3 preceding siblings ...)
  2016-03-05 11:27 ` [PATCH V4 4/7] KVM, pkeys: add pkeys support for permission_fault logic Huaitong Han
@ 2016-03-05 11:27 ` Huaitong Han
  2016-03-06  8:01   ` Xiao Guangrong
  2016-03-05 11:27 ` [PATCH V4 6/7] KVM, pkeys: add pkeys support for xsave state Huaitong Han
  2016-03-05 11:27 ` [PATCH V4 7/7] KVM, pkeys: disable PKU feature without ept Huaitong Han
  6 siblings, 1 reply; 47+ messages in thread
From: Huaitong Han @ 2016-03-05 11:27 UTC (permalink / raw)
  To: pbonzini, gleb; +Cc: kvm, Huaitong Han

This patch adds pkeys support for gva_to_gpa funcions.

Signed-off-by: Huaitong Han <huaitong.han@intel.com>
---
 arch/x86/kvm/x86.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b697a99..34e511b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4161,6 +4161,7 @@ gpa_t kvm_mmu_gva_to_gpa_read(struct kvm_vcpu *vcpu, gva_t gva,
 			      struct x86_exception *exception)
 {
 	u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
+	access |= kvm_read_cr4_bits(vcpu, X86_CR4_PKE) ? PFERR_PK_MASK : 0;
 	return vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, access, exception);
 }
 
@@ -4177,6 +4178,7 @@ gpa_t kvm_mmu_gva_to_gpa_write(struct kvm_vcpu *vcpu, gva_t gva,
 {
 	u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
 	access |= PFERR_WRITE_MASK;
+	access |= kvm_read_cr4_bits(vcpu, X86_CR4_PKE) ? PFERR_PK_MASK : 0;
 	return vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, access, exception);
 }
 
@@ -4227,10 +4229,13 @@ static int kvm_fetch_guest_virt(struct x86_emulate_ctxt *ctxt,
 	u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
 	unsigned offset;
 	int ret;
+	gpa_t gpa;
+
+	access |= kvm_read_cr4_bits(vcpu, X86_CR4_PKE) ? PFERR_PK_MASK : 0;
 
 	/* Inline kvm_read_guest_virt_helper for speed.  */
-	gpa_t gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, addr, access|PFERR_FETCH_MASK,
-						    exception);
+	gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, addr,
+			access | PFERR_FETCH_MASK, exception);
 	if (unlikely(gpa == UNMAPPED_GVA))
 		return X86EMUL_PROPAGATE_FAULT;
 
@@ -4251,6 +4256,7 @@ int kvm_read_guest_virt(struct x86_emulate_ctxt *ctxt,
 {
 	struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
 	u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
+	access |= kvm_read_cr4_bits(vcpu, X86_CR4_PKE) ? PFERR_PK_MASK : 0;
 
 	return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, access,
 					  exception);
@@ -4283,9 +4289,13 @@ int kvm_write_guest_virt_system(struct x86_emulate_ctxt *ctxt,
 	void *data = val;
 	int r = X86EMUL_CONTINUE;
 
+	u32 access = PFERR_WRITE_MASK;
+
+	access |= kvm_read_cr4_bits(vcpu, X86_CR4_PKE) ? PFERR_PK_MASK : 0;
+
 	while (bytes) {
 		gpa_t gpa =  vcpu->arch.walk_mmu->gva_to_gpa(vcpu, addr,
-							     PFERR_WRITE_MASK,
+							     access,
 							     exception);
 		unsigned offset = addr & (PAGE_SIZE-1);
 		unsigned towrite = min(bytes, (unsigned)PAGE_SIZE - offset);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH V4 6/7] KVM, pkeys: add pkeys support for xsave state
  2016-03-05 11:27 [PATCH V4 0/7] KVM, pkeys: add memory protection-key support Huaitong Han
                   ` (4 preceding siblings ...)
  2016-03-05 11:27 ` [PATCH V4 5/7] KVM, pkeys: Add pkeys support for gva_to_gpa funcions Huaitong Han
@ 2016-03-05 11:27 ` Huaitong Han
  2016-03-06  8:27   ` Xiao Guangrong
  2016-03-05 11:27 ` [PATCH V4 7/7] KVM, pkeys: disable PKU feature without ept Huaitong Han
  6 siblings, 1 reply; 47+ messages in thread
From: Huaitong Han @ 2016-03-05 11:27 UTC (permalink / raw)
  To: pbonzini, gleb; +Cc: kvm, Huaitong Han

This patch adds pkeys support for xsave state.

Signed-off-by: Huaitong Han <huaitong.han@intel.com>
---
 arch/x86/kvm/x86.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index f2afa5f..0f71d5d 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -182,7 +182,8 @@ bool kvm_mtrr_check_gfn_range_consistency(struct kvm_vcpu *vcpu, gfn_t gfn,
 
 #define KVM_SUPPORTED_XCR0     (XFEATURE_MASK_FP | XFEATURE_MASK_SSE \
 				| XFEATURE_MASK_YMM | XFEATURE_MASK_BNDREGS \
-				| XFEATURE_MASK_BNDCSR | XFEATURE_MASK_AVX512)
+				| XFEATURE_MASK_BNDCSR | XFEATURE_MASK_AVX512 \
+				| XFEATURE_MASK_PKRU)
 extern u64 host_xcr0;
 
 extern u64 kvm_supported_xcr0(void);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH V4 7/7] KVM, pkeys: disable PKU feature without ept
  2016-03-05 11:27 [PATCH V4 0/7] KVM, pkeys: add memory protection-key support Huaitong Han
                   ` (5 preceding siblings ...)
  2016-03-05 11:27 ` [PATCH V4 6/7] KVM, pkeys: add pkeys support for xsave state Huaitong Han
@ 2016-03-05 11:27 ` Huaitong Han
  2016-03-06  9:28   ` Xiao Guangrong
  6 siblings, 1 reply; 47+ messages in thread
From: Huaitong Han @ 2016-03-05 11:27 UTC (permalink / raw)
  To: pbonzini, gleb; +Cc: kvm, Huaitong Han

This patch disables CPUID:PKU without ept, because pkeys is not yet
implemented for shadow paging.

Signed-off-by: Huaitong Han <huaitong.han@intel.com>
---
 arch/x86/kvm/cpuid.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index f058db5..8ac559d 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -447,6 +447,9 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
 			entry->ebx |= F(TSC_ADJUST);
 			entry->ecx &= kvm_supported_word11_x86_features;
 			cpuid_mask(&entry->ecx, 13);
+			if (!tdp_enabled)
+				/* PKU is not yet implemented for shadow paging. */
+				entry->ecx &= ~F(PKU);
 		} else {
 			entry->ebx = 0;
 			entry->ecx = 0;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH V4 1/7] KVM, pkeys: expose CPUID/CR4 to guest
  2016-03-05 11:27 ` [PATCH V4 1/7] KVM, pkeys: expose CPUID/CR4 to guest Huaitong Han
@ 2016-03-06  7:15   ` Xiao Guangrong
  2016-03-06 23:20     ` Paolo Bonzini
  0 siblings, 1 reply; 47+ messages in thread
From: Xiao Guangrong @ 2016-03-06  7:15 UTC (permalink / raw)
  To: Huaitong Han, pbonzini, gleb; +Cc: kvm



On 03/05/2016 07:27 PM, Huaitong Han wrote:
> X86_FEATURE_PKU is referred to as "PKU" in the hardware documentation:
> CPUID.7.0.ECX[3]:PKU. X86_FEATURE_OSPKE is software support for pkeys,
> enumerated with CPUID.7.0.ECX[4]:OSPKE, and it reflects the setting of
> CR4.PKE(bit 22).
>
> Signed-off-by: Huaitong Han <huaitong.han@intel.com>
> ---
>   arch/x86/include/asm/kvm_host.h |  3 ++-
>   arch/x86/kvm/cpuid.c            | 21 +++++++++++++++++++--
>   arch/x86/kvm/cpuid.h            |  8 ++++++++
>   arch/x86/kvm/x86.c              |  9 ++++++---
>   4 files changed, 35 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 44adbb8..2867626 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -83,7 +83,8 @@
>   			  | X86_CR4_PSE | X86_CR4_PAE | X86_CR4_MCE     \
>   			  | X86_CR4_PGE | X86_CR4_PCE | X86_CR4_OSFXSR | X86_CR4_PCIDE \
>   			  | X86_CR4_OSXSAVE | X86_CR4_SMEP | X86_CR4_FSGSBASE \
> -			  | X86_CR4_OSXMMEXCPT | X86_CR4_VMXE | X86_CR4_SMAP))
> +			  | X86_CR4_OSXMMEXCPT | X86_CR4_VMXE | X86_CR4_SMAP \
> +			  | X86_CR4_PKE))
>
>   #define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR)
>
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 6525e92..f058db5 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -81,6 +81,17 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu)
>   			apic->lapic_timer.timer_mode_mask = 1 << 17;
>   	}
>
> +	best = kvm_find_cpuid_entry(vcpu, 7, 0);
> +	if (!best)
> +		return 0;
> +
> +	/*Update OSPKE bit */

Missed a space here.

> +	if (boot_cpu_has(X86_FEATURE_PKU) && best->function == 0x7) {
> +		best->ecx &= ~F(OSPKE);
> +		if (kvm_read_cr4_bits(vcpu, X86_CR4_PKE))
> +			best->ecx |= F(OSPKE);
> +	}
> +
>   	best = kvm_find_cpuid_entry(vcpu, 0xD, 0);
>   	if (!best) {
>   		vcpu->arch.guest_supported_xcr0 = 0;
> @@ -354,6 +365,9 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
>   	const u32 kvm_supported_word10_x86_features =
>   		F(XSAVEOPT) | F(XSAVEC) | F(XGETBV1) | f_xsaves;
>
> +	/* cpuid 7.0.ecx*/
> +	const u32 kvm_supported_word11_x86_features = F(PKU) | 0 /*OSPKE*/;
> +
>   	/* all calls to cpuid_count() should be made on the same cpu */
>   	get_cpu();
>
> @@ -431,10 +445,13 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
>   			cpuid_mask(&entry->ebx, 9);
>   			// TSC_ADJUST is emulated
>   			entry->ebx |= F(TSC_ADJUST);
> -		} else
> +			entry->ecx &= kvm_supported_word11_x86_features;
> +			cpuid_mask(&entry->ecx, 13);

Can we use a meaningful name defined by cpuid_leafs instead of the raw number?


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V4 2/7] KVM, pkeys: disable pkeys for guests in non-paging mode
  2016-03-05 11:27 ` [PATCH V4 2/7] KVM, pkeys: disable pkeys for guests in non-paging mode Huaitong Han
@ 2016-03-06  7:19   ` Xiao Guangrong
  2016-03-08 12:09   ` Yang Zhang
  1 sibling, 0 replies; 47+ messages in thread
From: Xiao Guangrong @ 2016-03-06  7:19 UTC (permalink / raw)
  To: Huaitong Han, pbonzini, gleb; +Cc: kvm



On 03/05/2016 07:27 PM, Huaitong Han wrote:
> Pkeys is disabled if CPU is in non-paging mode in hardware. However KVM
> always uses paging mode to emulate guest non-paging, mode with TDP. To
> emulate this behavior, pkeys needs to be manually disabled when guest
> switches to non-paging mode.

Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V4 3/7] KVM, pkeys: update memeory permission bitmask for pkeys
  2016-03-05 11:27 ` [PATCH V4 3/7] KVM, pkeys: update memeory permission bitmask for pkeys Huaitong Han
@ 2016-03-06  7:42   ` Xiao Guangrong
  2016-03-06 23:14     ` Paolo Bonzini
  0 siblings, 1 reply; 47+ messages in thread
From: Xiao Guangrong @ 2016-03-06  7:42 UTC (permalink / raw)
  To: Huaitong Han, pbonzini, gleb; +Cc: kvm



On 03/05/2016 07:27 PM, Huaitong Han wrote:
> PKEYS define a new status bit in the PFEC. PFEC.PK (bit 5), if some
> conditions is true, the fault is considered as a PKU violation. This
> patch updates memeory permission bitmask for pkeys.
>
> Signed-off-by: Huaitong Han <huaitong.han@intel.com>
> ---
>   arch/x86/include/asm/kvm_host.h |  8 ++++++--
>   arch/x86/kvm/mmu.c              | 32 ++++++++++++++++++++++++++++++--
>   2 files changed, 36 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 2867626..9eb54e8 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -187,12 +187,14 @@ enum {
>   #define PFERR_USER_BIT 2
>   #define PFERR_RSVD_BIT 3
>   #define PFERR_FETCH_BIT 4
> +#define PFERR_PK_BIT 5
>
>   #define PFERR_PRESENT_MASK (1U << PFERR_PRESENT_BIT)
>   #define PFERR_WRITE_MASK (1U << PFERR_WRITE_BIT)
>   #define PFERR_USER_MASK (1U << PFERR_USER_BIT)
>   #define PFERR_RSVD_MASK (1U << PFERR_RSVD_BIT)
>   #define PFERR_FETCH_MASK (1U << PFERR_FETCH_BIT)
> +#define PFERR_PK_MASK (1U << PFERR_PK_BIT)
>
>   /* apic attention bits */
>   #define KVM_APIC_CHECK_VAPIC	0
> @@ -322,10 +324,12 @@ struct kvm_mmu {
>
>   	/*
>   	 * Bitmap; bit set = permission fault
> -	 * Byte index: page fault error code [4:1]
> +	 * Byte index: page fault error code [5:1]
>   	 * Bit index: pte permissions in ACC_* format
> +	 *
> +	 * Add PFEC.PK (bit 5) for protection-key violations
>   	 */
> -	u8 permissions[16];
> +	u8 permissions[32];
>
>   	u64 *pae_root;
>   	u64 *lm_root;
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 95a955d..1ef66f4 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -3776,16 +3776,22 @@ static void update_permission_bitmask(struct kvm_vcpu *vcpu,
>   {
>   	unsigned bit, byte, pfec;
>   	u8 map;
> -	bool fault, x, w, u, wf, uf, ff, smapf, cr4_smap, cr4_smep, smap = 0;
> +	bool fault, x, w, u, smap = 0, pku = 0;
> +	bool wf, uf, ff, smapf, rsvdf, pkuf;
> +	bool cr4_smap, cr4_smep, cr4_pku;
>
>   	cr4_smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP);
>   	cr4_smap = kvm_read_cr4_bits(vcpu, X86_CR4_SMAP);
> +	cr4_pku = kvm_read_cr4_bits(vcpu, X86_CR4_PKE);
> +
>   	for (byte = 0; byte < ARRAY_SIZE(mmu->permissions); ++byte) {
>   		pfec = byte << 1;
>   		map = 0;
>   		wf = pfec & PFERR_WRITE_MASK;
>   		uf = pfec & PFERR_USER_MASK;
>   		ff = pfec & PFERR_FETCH_MASK;
> +		rsvdf = pfec & PFERR_RSVD_MASK;

No. RSVD is reserved by SMAP and it should not be used to walk guest page page table.

> +		pkuf = pfec & PFERR_PK_MASK;
>   		/*
>   		 * PFERR_RSVD_MASK bit is set in PFEC if the access is not
>   		 * subject to SMAP restrictions, and cleared otherwise. The
> @@ -3824,12 +3830,34 @@ static void update_permission_bitmask(struct kvm_vcpu *vcpu,
>   				 *   clearer.
>   				 */
>   				smap = cr4_smap && u && !uf && !ff;
> +
> +				/*
> +				* PKU:additional mechanism by which the paging
> +				* controls access to user-mode addresses based
> +				* on the value in the PKRU register. A fault is
> +				* considered as a PKU violation if all of the
> +				* following conditions are true:
> +				* 1.CR4_PKE=1.
> +				* 2.EFER_LMA=1.
> +				* 3.page is present with no reserved bit
> +				*   violations.
> +				* 4.the access is not an instruction fetch.
> +				* 5.the access is to a user page.
> +				* 6.PKRU.AD=1
> +				*	or The access is a data write and
> +				*	   PKRU.WD=1 and either CR0.WP=1
> +				*	   or it is a user access.
> +				*
> +				* The 2nd and 6th conditions are computed
> +				* dynamically in permission_fault.
> +				*/

It is not good as there are branches in the next patch.

However, i do not think we need a new byte index for PK. The conditions detecting PK enablement
can be fully found in current vcpu content (i.e, CR4, EFER and U/S access).

So you only need to extend permission to u32 and the highest two bits are used to indicate
PKEY got from guest pte.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V4 4/7] KVM, pkeys: add pkeys support for permission_fault logic
  2016-03-05 11:27 ` [PATCH V4 4/7] KVM, pkeys: add pkeys support for permission_fault logic Huaitong Han
@ 2016-03-06  8:00   ` Xiao Guangrong
  2016-03-06 20:36     ` Paolo Bonzini
  0 siblings, 1 reply; 47+ messages in thread
From: Xiao Guangrong @ 2016-03-06  8:00 UTC (permalink / raw)
  To: Huaitong Han, pbonzini, gleb; +Cc: kvm



On 03/05/2016 07:27 PM, Huaitong Han wrote:
> Protection keys define a new 4-bit protection key field (PKEY) in bits
> 62:59 of leaf entries of the page tables, the PKEY is an index to PKRU
> register(16 domains), every domain has 2 bits(write disable bit, access
> disable bit).
>
> Static logic has been produced in update_permission_bitmask, dynamic logic
> need read pkey from page table entries, get pkru value, and deduce the
> correct result.
>
> Signed-off-by: Huaitong Han <huaitong.han@intel.com>
> ---
> Changes in v4:
> *Patch has rebased on http://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/log/?h=mm/pkeys
>   pte_pkey function has been deleted in kernel PKU support, so KVM patches use
>   pte_flags_pkey(pte_flags(pte)) instead of pte_pkey(pte).
>
>   arch/x86/kvm/mmu.h         | 56 +++++++++++++++++++++++++++++++++++++++++-----
>   arch/x86/kvm/paging_tmpl.h | 18 ++++++++++++---
>   arch/x86/kvm/x86.c         |  2 +-
>   3 files changed, 67 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 55ffb7b..7a9f627 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -3,6 +3,7 @@
>
>   #include <linux/kvm_host.h>
>   #include "kvm_cache_regs.h"
> +#include "x86.h"
>
>   #define PT64_PT_BITS 9
>   #define PT64_ENT_PER_PAGE (1 << PT64_PT_BITS)
> @@ -24,6 +25,7 @@
>   #define PT_PAGE_SIZE_MASK (1ULL << PT_PAGE_SIZE_SHIFT)
>   #define PT_PAT_MASK (1ULL << 7)
>   #define PT_GLOBAL_MASK (1ULL << 8)
> +
>   #define PT64_NX_SHIFT 63
>   #define PT64_NX_MASK (1ULL << PT64_NX_SHIFT)
>
> @@ -45,6 +47,10 @@
>   #define PT_PAGE_TABLE_LEVEL 1
>   #define PT_MAX_HUGEPAGE_LEVEL (PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES - 1)
>
> +#define PKRU_READ   0
> +#define PKRU_WRITE  1
> +#define PKRU_ATTRS  2
> +
>   static inline u64 rsvd_bits(int s, int e)
>   {
>   	return ((1ULL << (e - s + 1)) - 1) << s;
> @@ -145,10 +151,50 @@ static inline bool is_write_protection(struct kvm_vcpu *vcpu)
>    * fault with the given access (in ACC_* format)?
>    */
>   static inline bool permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> -				    unsigned pte_access, unsigned pfec)
> +		unsigned pte_access, unsigned pte_pkeys, unsigned pfec)
>   {
> -	int cpl = kvm_x86_ops->get_cpl(vcpu);
> -	unsigned long rflags = kvm_x86_ops->get_rflags(vcpu);
> +	unsigned long smap, rflags;
> +	u32 pkru, pkru_bits;
> +	int cpl, index;
> +	bool wf, uf;
> +
> +	cpl = kvm_x86_ops->get_cpl(vcpu);
> +	rflags = kvm_x86_ops->get_rflags(vcpu);
> +
> +	/*
> +	* PKU is computed dynamically in permission_fault.
> +	* 2nd and 6th conditions:
> +	* 2.EFER_LMA=1
> +	* 6.PKRU.AD=1
> +	*	or The access is a data write and PKRU.WD=1 and
> +	*	   either CR0.WP=1 or it is a user mode access
> +	*/
> +	pkru = is_long_mode(vcpu) ? read_pkru() : 0;
> +	if (unlikely(pkru) && (pfec & PFERR_PK_MASK))
> +	{
> +		/*
> +		* PKRU defines 32 bits, there are 16 domains and 2 attribute bits per
> +		* domain in pkru, pkey is the index to a defined domain, so the value
> +		* of pkey * PKRU_ATTRS is offset of a defined domain.
> +		*/
> +		pkru_bits = (pkru >> (pte_pkeys * PKRU_ATTRS)) & 3;
> +
> +		wf = pfec & PFERR_WRITE_MASK;
> +		uf = pfec & PFERR_USER_MASK;
> +
> +		/*
> +		* Ignore PKRU.WD if not relevant to this access (a read,
> +		* or a supervisor mode access if CR0.WP=0).
> +		* So 6th conditions is equivalent to "pkru_bits != 0"
> +		*/
> +		if (!wf || (!uf && !is_write_protection(vcpu)))
> +			pkru_bits &= ~(1 << PKRU_WRITE);
> +
> +		/* Flip pfec on PK bit if pkru_bits is zero */
> +		pfec ^= pkru_bits ? 0 : PFERR_PK_MASK;
> +	}
> +	else
> +		pfec &= ~PFERR_PK_MASK;
>
>   	/*
>   	 * If CPL < 3, SMAP prevention are disabled if EFLAGS.AC = 1.
> @@ -163,8 +209,8 @@ static inline bool permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>   	 * but it will be one in index if SMAP checks are being overridden.
>   	 * It is important to keep this branchless.
>   	 */
> -	unsigned long smap = (cpl - 3) & (rflags & X86_EFLAGS_AC);
> -	int index = (pfec >> 1) +
> +	smap = (cpl - 3) & (rflags & X86_EFLAGS_AC);
> +	index = (pfec >> 1) +
>   		    (smap >> (X86_EFLAGS_AC_BIT - PFERR_RSVD_BIT + 1));

See my comments in the previous patch.

>
>   	WARN_ON(pfec & PFERR_RSVD_MASK);
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index 6c9fed9..9afc066 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -253,6 +253,15 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
>   	}
>   	return 0;
>   }
> +static inline unsigned FNAME(gpte_pkeys)(struct kvm_vcpu *vcpu, u64 gpte)
> +{
> +	unsigned pkeys = 0;
> +#if PTTYPE == 64
> +	pte_t pte = {.pte = gpte};
> +	pkeys = pte_flags_pkey(pte_flags(pte));
> +#endif

It works for PAE guest either.

It is not a problem as the bits used by PKEY are reserved on PAE mode which should
be figured out by is_rsvd_bits_set() prior to this call and PKEY is ignored if
EFER.LMA = 0. But please add some comments here to explain why it is safe.

> +	return pkeys;
> +}
>
>   /*
>    * Fetch a guest pte for a guest virtual address
> @@ -265,12 +274,13 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
>   	pt_element_t pte;
>   	pt_element_t __user *uninitialized_var(ptep_user);
>   	gfn_t table_gfn;
> -	unsigned index, pt_access, pte_access, accessed_dirty;
> +	unsigned index, pt_access, pte_access, accessed_dirty, pte_pkeys;
>   	gpa_t pte_gpa;
>   	int offset;
>   	const int write_fault = access & PFERR_WRITE_MASK;
>   	const int user_fault  = access & PFERR_USER_MASK;
>   	const int fetch_fault = access & PFERR_FETCH_MASK;
> +	const int pk_fault = access & PFERR_PK_MASK;
>   	u16 errcode = 0;
>   	gpa_t real_gpa;
>   	gfn_t gfn;
> @@ -356,7 +366,9 @@ retry_walk:
>   		walker->ptes[walker->level - 1] = pte;
>   	} while (!is_last_gpte(mmu, walker->level, pte));
>
> -	if (unlikely(permission_fault(vcpu, mmu, pte_access, access))) {
> +	pte_pkeys = FNAME(gpte_pkeys)(vcpu, pte);
> +	if (unlikely(permission_fault(vcpu, mmu, pte_access, pte_pkeys,
> +					access))) {
>   		errcode |= PFERR_PRESENT_MASK;
>   		goto error;
>   	}
> @@ -399,7 +411,7 @@ retry_walk:
>   	return 1;
>
>   error:
> -	errcode |= write_fault | user_fault;
> +	errcode |= write_fault | user_fault | pk_fault;

No. The pk_fault is not always caused by guest page table as it's depends on
CR0.WP which is always true on shadow page table.

>   	if (fetch_fault && (mmu->nx ||
>   			    kvm_read_cr4_bits(vcpu, X86_CR4_SMEP)))
>   		errcode |= PFERR_FETCH_MASK;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 05e7838..b697a99 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4317,7 +4317,7 @@ static int vcpu_mmio_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva,
>
>   	if (vcpu_match_mmio_gva(vcpu, gva)
>   	    && !permission_fault(vcpu, vcpu->arch.walk_mmu,
> -				 vcpu->arch.access, access)) {
> +				 vcpu->arch.access, 0, access)) {

No. The pkey is not always 0.

We should cache PKEY for the mmio access and use it here to check if the right
is adequate.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V4 5/7] KVM, pkeys: Add pkeys support for gva_to_gpa funcions
  2016-03-05 11:27 ` [PATCH V4 5/7] KVM, pkeys: Add pkeys support for gva_to_gpa funcions Huaitong Han
@ 2016-03-06  8:01   ` Xiao Guangrong
  2016-03-06 21:33     ` Paolo Bonzini
  0 siblings, 1 reply; 47+ messages in thread
From: Xiao Guangrong @ 2016-03-06  8:01 UTC (permalink / raw)
  To: Huaitong Han, pbonzini, gleb; +Cc: kvm



On 03/05/2016 07:27 PM, Huaitong Han wrote:
> This patch adds pkeys support for gva_to_gpa funcions.

It is not needed if you follow the approach i raised in the previous patch.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V4 6/7] KVM, pkeys: add pkeys support for xsave state
  2016-03-05 11:27 ` [PATCH V4 6/7] KVM, pkeys: add pkeys support for xsave state Huaitong Han
@ 2016-03-06  8:27   ` Xiao Guangrong
  0 siblings, 0 replies; 47+ messages in thread
From: Xiao Guangrong @ 2016-03-06  8:27 UTC (permalink / raw)
  To: Huaitong Han, pbonzini, gleb; +Cc: kvm



On 03/05/2016 07:27 PM, Huaitong Han wrote:
> This patch adds pkeys support for xsave state.
>

Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V4 7/7] KVM, pkeys: disable PKU feature without ept
  2016-03-05 11:27 ` [PATCH V4 7/7] KVM, pkeys: disable PKU feature without ept Huaitong Han
@ 2016-03-06  9:28   ` Xiao Guangrong
  2016-03-06 20:32     ` Paolo Bonzini
  0 siblings, 1 reply; 47+ messages in thread
From: Xiao Guangrong @ 2016-03-06  9:28 UTC (permalink / raw)
  To: Huaitong Han, pbonzini, gleb; +Cc: kvm



On 03/05/2016 07:27 PM, Huaitong Han wrote:
> This patch disables CPUID:PKU without ept, because pkeys is not yet
> implemented for shadow paging.
>

Does the PKRU is loaded/saved during vm-enter/vm-exit?

BTW, I just very quickly go through the spec, it seems VMX lacks the ability
to intercept the access to PKRU. Right?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V4 7/7] KVM, pkeys: disable PKU feature without ept
  2016-03-06  9:28   ` Xiao Guangrong
@ 2016-03-06 20:32     ` Paolo Bonzini
  2016-03-08  5:54       ` Xiao Guangrong
  0 siblings, 1 reply; 47+ messages in thread
From: Paolo Bonzini @ 2016-03-06 20:32 UTC (permalink / raw)
  To: Xiao Guangrong, Huaitong Han, gleb; +Cc: kvm



On 06/03/2016 10:28, Xiao Guangrong wrote:
>> This patch disables CPUID:PKU without ept, because pkeys is not yet
>> implemented for shadow paging.
> 
> Does the PKRU is loaded/saved during vm-enter/vm-exit?

Yes, through XSAVE/XRSTOR (which uses eager mode when PKE is active).

> BTW, I just very quickly go through the spec, it seems VMX lacks the
> ability to intercept the access to PKRU. Right?

Indeed RDPKRU/WRPKRU cannot be intercepted.

Paolo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V4 4/7] KVM, pkeys: add pkeys support for permission_fault logic
  2016-03-06  8:00   ` Xiao Guangrong
@ 2016-03-06 20:36     ` Paolo Bonzini
  2016-03-06 23:29       ` Paolo Bonzini
  2016-03-08  5:57       ` Xiao Guangrong
  0 siblings, 2 replies; 47+ messages in thread
From: Paolo Bonzini @ 2016-03-06 20:36 UTC (permalink / raw)
  To: Xiao Guangrong, Huaitong Han, gleb; +Cc: kvm



On 06/03/2016 09:00, Xiao Guangrong wrote:
>>
>>       if (vcpu_match_mmio_gva(vcpu, gva)
>>           && !permission_fault(vcpu, vcpu->arch.walk_mmu,
>> -                 vcpu->arch.access, access)) {
>> +                 vcpu->arch.access, 0, access)) {
> 
> No. The pkey is not always 0.
> 
> We should cache PKEY for the mmio access and use it here to check if the
> right is adequate.

This is just an optimization I think, so it can have false negatives (it
won't have many in practice because MMIO accesses are usually done in
supervisor mode).  The actual check is done when
vcpu->arch.walk_mmu->gva_to_gpa is called.

Paolo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V4 5/7] KVM, pkeys: Add pkeys support for gva_to_gpa funcions
  2016-03-06  8:01   ` Xiao Guangrong
@ 2016-03-06 21:33     ` Paolo Bonzini
  0 siblings, 0 replies; 47+ messages in thread
From: Paolo Bonzini @ 2016-03-06 21:33 UTC (permalink / raw)
  To: Xiao Guangrong, Huaitong Han, gleb; +Cc: kvm



On 06/03/2016 09:01, Xiao Guangrong wrote:
> 
> 
> On 03/05/2016 07:27 PM, Huaitong Han wrote:
>> This patch adds pkeys support for gva_to_gpa funcions.
> 
> It is not needed if you follow the approach i raised in the previous patch.

It's not needed anyway, I think.

PFERR_PK_MASK is set to CR4.PKE.  However, patch 3 sets pku to false if
CR4.PKE is clear, and then the corresponding bit is always clear in
mmu->permissions[byte].

Thus I think permission_fault() operates in just the same way even if
PFERR_PK_MASK is always set; CR4.PKE=0 is caught through the permissions
bitmask (as it's meant to be).

Instead of setting PFERR_PK_MASK, we can instead handle it in
permission_fault like:

	pkru = is_long_mode(vcpu) ? read_pkru() : 0;
	if (unlikely(pkru)) {
		...
		/* Set PK bit of pfec if pkru_bits is non-zero */
		pfec |= pkru_bits ? PFERR_PK_MASK : 0;
	}

I'll get to the topic of branches in another message.

Paolo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V4 3/7] KVM, pkeys: update memeory permission bitmask for pkeys
  2016-03-06  7:42   ` Xiao Guangrong
@ 2016-03-06 23:14     ` Paolo Bonzini
  2016-03-08  7:35       ` Xiao Guangrong
  0 siblings, 1 reply; 47+ messages in thread
From: Paolo Bonzini @ 2016-03-06 23:14 UTC (permalink / raw)
  To: Xiao Guangrong, Huaitong Han, gleb; +Cc: kvm



On 06/03/2016 08:42, Xiao Guangrong wrote:
>>
>> +        rsvdf = pfec & PFERR_RSVD_MASK;
> 
> No. RSVD is reserved by SMAP and it should not be used to walk guest
> page page table.

Agreed.  You can treat your code as if rsvdf was always false.  Reserved
bits are handled elsewhere.

>> +        pkuf = pfec & PFERR_PK_MASK;
>>           /*
>>            * PFERR_RSVD_MASK bit is set in PFEC if the access is not
>>            * subject to SMAP restrictions, and cleared otherwise. The
>> @@ -3824,12 +3830,34 @@ static void update_permission_bitmask(struct
>> kvm_vcpu *vcpu,
>>                    *   clearer.
>>                    */
>>                   smap = cr4_smap && u && !uf && !ff;
>> +
>> +                /*
>> +                * PKU:additional mechanism by which the paging
>> +                * controls access to user-mode addresses based
>> +                * on the value in the PKRU register. A fault is
>> +                * considered as a PKU violation if all of the
>> +                * following conditions are true:
>> +                * 1.CR4_PKE=1.
>> +                * 2.EFER_LMA=1.
>> +                * 3.page is present with no reserved bit
>> +                *   violations.
>> +                * 4.the access is not an instruction fetch.
>> +                * 5.the access is to a user page.
>> +                * 6.PKRU.AD=1
>> +                *    or The access is a data write and
>> +                *       PKRU.WD=1 and either CR0.WP=1
>> +                *       or it is a user access.
>> +                *
>> +                * The 2nd and 6th conditions are computed
>> +                * dynamically in permission_fault.
>> +                */
> 
> It is not good as there are branches in the next patch.

It's important to note that branches in general are _not_ a problem.
Only badly-predicted branches are a problem; well-predicted branches are
_faster_ than branchless code.  For example, take is_last_gpte.  The
branchy way to write it in walk_addr_generic would be (excluding the
32-bit !PSE case) something like:

	do {
	} while (level > PT_PAGE_TABLE_LEVEL &&
		 (!(gpte & PT_PAGE_SIZE_MASK) ||
		  level == mmu->root_level));

Here none of the branches is easily predicted, so we want to get rid of
them.

The next patch adds three branches, and they are not all equal:

- is_long_vcpu is well predicted to true (or even for 32-bit OSes it
should be well predicted if the host is not overcommitted).

- pkru != 0 should be well-predicted to false, at least for a few
years... and perhaps even later considering that most MMIO access
happens in the kernel.

- !wf || (!uf && !is_write_protection(vcpu)) is badly predicted and
should be removed

So only the last one is a problem.

> However, i do not think we need a new byte index for PK. The conditions
> detecting PK enablement
> can be fully found in current vcpu content (i.e, CR4, EFER and U/S access).

Adding a new byte index lets you cache CR4.PKE (and actually EFER.LMA
too, though Huaitong's patch doesn't do that).  It's a good thing to do.
 U/S is also handled by adding a new byte index, see Huaitong's

	pku = cr4_pku && !ff && u;

If this is improved to

	pku = cr4_pku && long_mode_vcpu && !ff && u;

one branch goes away in permission_fault.  The read_pkru() branch, if
well predicted, lets you optimize away the pkru tests.  I think it
_would_ be well predicted, so I think it should remain.

The "(!wf || (!uf && !is_write_protection(vcpu)))" is indeed the worst
of the three.  I was lenient in my previous review because this code
won't run on any system being sold now and in the next 1-2 (?) years.
However, we can indeed get rid of the branch, so let's do it. :)

I don't like the idea of making permissions[] four times larger.
Instead, we can find the value of the expression elsewhere in
mmu->permissions (!), or cache it separately in a different field of mmu.

If I interpret the rules correctly, WD works like this.  First, we take
the PTE and imagine that it had W=0.  Then, if this access would not
fault, WD is ignored.  This is because:

- on reads, WD is always ignored

- on writes, WD is ignored in supervisor mode if !CR0.WP

... and this is how W=0 page work, isn't it?

If so, I think it's something like this in code:

-		if (!wf || (!uf && !is_write_protection(vcpu)))
-			pkru_bits &= ~(1 << PKRU_WRITE);
+		/* Only testing writes, so ignore SMAP and fetch.  */
+		pfec_uw = pfec & (PFERR_WRITE_MASK|PFERR_USER_MASK);
+		fault_uw = mmu->permissions[pfec_uw >> 1];
+		/*
+		 * This page has U=1, so check if a U=1 W=0 page faults
+		 * on this access; if not ignore WD.
+		 */
+		pkru_bits &= ~(1 << PKRU_WRITE) |
+			(fault_uw >> (ACC_USER_MASK - PKRU_WRITE));

I think I even prefer if update_permission_bitmask sets up a separate
bitmask:

		mmu->fault_u1w0 |= (wf && !w) << byte;

and then this other bitmap can be tested in permission_fault:


-		if (!wf || (!uf && !is_write_protection(vcpu)))
-			pkru_bits &= ~(1 << PKRU_WRITE);
+		/*
+		 * fault_u1w0 ignores SMAP and PKRU, so use the
+		 * partially-computed PFEC that we were given.
+		 */
+		fault_uw = (mmu->fault_u1w0 >> (pfec >> 1)) & 1;
+		pkru_bits &= ~(1 << PKRU_WRITE) |
+			(fault_uw << PKRU_WRITE);

These ideas are untested, of course.  I apologize for any mistake.
However, they should apply both to Huaitong's current code (which needs
PFERR_PK_MASK in gva_to_gpa) and to my other suggestion from the reply
to patch 5 (http://article.gmane.org/gmane.comp.emulators.kvm.devel/148311).

Paolo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V4 1/7] KVM, pkeys: expose CPUID/CR4 to guest
  2016-03-06  7:15   ` Xiao Guangrong
@ 2016-03-06 23:20     ` Paolo Bonzini
  2016-03-08  7:39       ` Xiao Guangrong
  0 siblings, 1 reply; 47+ messages in thread
From: Paolo Bonzini @ 2016-03-06 23:20 UTC (permalink / raw)
  To: Xiao Guangrong, Huaitong Han, gleb; +Cc: kvm



On 06/03/2016 08:15, Xiao Guangrong wrote:
>> +    /* cpuid 7.0.ecx*/
>> +    const u32 kvm_supported_word11_x86_features = F(PKU) | 0 /*OSPKE*/;
>> + 
>>
>>               cpuid_mask(&entry->ebx, 9);
>>               // TSC_ADJUST is emulated
>>               entry->ebx |= F(TSC_ADJUST);
>> -        } else
>> +            entry->ecx &= kvm_supported_word11_x86_features;
>> +            cpuid_mask(&entry->ecx, 13);
> 
> Can we use a meaningful name defined by cpuid_leafs instead of the raw
> number?

This is not a cpuid leaf number, it's the Linux feature word number.  In
other words, the "11" in kvm_supported_word11_x86_features should have
matched the "13" in cpuid_mask(&entry->ecx, 13).

However, looking at
http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/tree/arch/x86/include/asm/cpufeatures.h
I see that all word numbers until 15 are taken, and PKE is not found
anywhere.  What's going on?

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V4 4/7] KVM, pkeys: add pkeys support for permission_fault logic
  2016-03-06 20:36     ` Paolo Bonzini
@ 2016-03-06 23:29       ` Paolo Bonzini
  2016-03-08  5:57       ` Xiao Guangrong
  1 sibling, 0 replies; 47+ messages in thread
From: Paolo Bonzini @ 2016-03-06 23:29 UTC (permalink / raw)
  To: Xiao Guangrong, Huaitong Han, gleb; +Cc: kvm



On 06/03/2016 21:36, Paolo Bonzini wrote:
> 
> 
> On 06/03/2016 09:00, Xiao Guangrong wrote:
>>>
>>>       if (vcpu_match_mmio_gva(vcpu, gva)
>>>           && !permission_fault(vcpu, vcpu->arch.walk_mmu,
>>> -                 vcpu->arch.access, access)) {
>>> +                 vcpu->arch.access, 0, access)) {
>>
>> No. The pkey is not always 0.
>>
>> We should cache PKEY for the mmio access and use it here to check if the
>> right is adequate.
> 
> This is just an optimization I think, so it can have false negatives (it
> won't have many in practice because MMIO accesses are usually done in
> supervisor mode).  The actual check is done when
> vcpu->arch.walk_mmu->gva_to_gpa is called.

Duh, sorry, false _positives_ are okay (i.e. you can say something
faults even if it actually doesn't).

One thing you could do is:

- do not pass pte_pkeys to permission_fault, instead read PKRU in
walk_addr_generic and pass "(pkru >> (pte_pkeys * PKRU_ATTRS)) & 3" to
permission_fault.

- here, pass ~0.  So if CR0.PKE=0 or U=0 caching works properly, but if
CR0.PKE=1 and U=1 it is disabled gracefully and gva_to_gpa does the page
walk correctly.

Paolo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V4 7/7] KVM, pkeys: disable PKU feature without ept
  2016-03-06 20:32     ` Paolo Bonzini
@ 2016-03-08  5:54       ` Xiao Guangrong
  2016-03-08  8:47         ` Paolo Bonzini
  0 siblings, 1 reply; 47+ messages in thread
From: Xiao Guangrong @ 2016-03-08  5:54 UTC (permalink / raw)
  To: Paolo Bonzini, Huaitong Han, gleb; +Cc: kvm



On 03/07/2016 04:32 AM, Paolo Bonzini wrote:
>
>
> On 06/03/2016 10:28, Xiao Guangrong wrote:
>>> This patch disables CPUID:PKU without ept, because pkeys is not yet
>>> implemented for shadow paging.
>>
>> Does the PKRU is loaded/saved during vm-enter/vm-exit?
>
> Yes, through XSAVE/XRSTOR (which uses eager mode when PKE is active).

You mean eager fpu? however, eager-fpu depends on 'eagerfpu' which is a kernel
parameter and this patchset did not force it on.

However, even if we use eager-fpu kvm still can lazily save/load due to some
fpu optimizations in kvm.

>> BTW, I just very quickly go through the spec, it seems VMX lacks the
>> ability to intercept the access to PKRU. Right?
>
> Indeed RDPKRU/WRPKRU cannot be intercepted.

Er, i was thinking using this feature to speedup write-protection for
shadow page table and dirty-logging... it seems not easy as PKRU can not
be intercepted. :(

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V4 4/7] KVM, pkeys: add pkeys support for permission_fault logic
  2016-03-06 20:36     ` Paolo Bonzini
  2016-03-06 23:29       ` Paolo Bonzini
@ 2016-03-08  5:57       ` Xiao Guangrong
  1 sibling, 0 replies; 47+ messages in thread
From: Xiao Guangrong @ 2016-03-08  5:57 UTC (permalink / raw)
  To: Paolo Bonzini, Huaitong Han, gleb; +Cc: kvm



On 03/07/2016 04:36 AM, Paolo Bonzini wrote:
>
>
> On 06/03/2016 09:00, Xiao Guangrong wrote:
>>>
>>>        if (vcpu_match_mmio_gva(vcpu, gva)
>>>            && !permission_fault(vcpu, vcpu->arch.walk_mmu,
>>> -                 vcpu->arch.access, access)) {
>>> +                 vcpu->arch.access, 0, access)) {
>>
>> No. The pkey is not always 0.
>>
>> We should cache PKEY for the mmio access and use it here to check if the
>> right is adequate.
>
> This is just an optimization I think, so it can have false negatives (it
> won't have many in practice because MMIO accesses are usually done in
> supervisor mode).  The actual check is done when
> vcpu->arch.walk_mmu->gva_to_gpa is called.

Okay, this patchset disabled PKEY for soft mmu (ept = 0) so it should be safe,
however some comments would be appreciated.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V4 3/7] KVM, pkeys: update memeory permission bitmask for pkeys
  2016-03-06 23:14     ` Paolo Bonzini
@ 2016-03-08  7:35       ` Xiao Guangrong
  2016-03-08  8:29         ` Paolo Bonzini
  0 siblings, 1 reply; 47+ messages in thread
From: Xiao Guangrong @ 2016-03-08  7:35 UTC (permalink / raw)
  To: Paolo Bonzini, Huaitong Han, gleb; +Cc: kvm



On 03/07/2016 07:14 AM, Paolo Bonzini wrote:
>
>
> On 06/03/2016 08:42, Xiao Guangrong wrote:
>>>
>>> +        rsvdf = pfec & PFERR_RSVD_MASK;
>>
>> No. RSVD is reserved by SMAP and it should not be used to walk guest
>> page page table.
>
> Agreed.  You can treat your code as if rsvdf was always false.  Reserved
> bits are handled elsewhere.
>
>>> +        pkuf = pfec & PFERR_PK_MASK;
>>>            /*
>>>             * PFERR_RSVD_MASK bit is set in PFEC if the access is not
>>>             * subject to SMAP restrictions, and cleared otherwise. The
>>> @@ -3824,12 +3830,34 @@ static void update_permission_bitmask(struct
>>> kvm_vcpu *vcpu,
>>>                     *   clearer.
>>>                     */
>>>                    smap = cr4_smap && u && !uf && !ff;
>>> +
>>> +                /*
>>> +                * PKU:additional mechanism by which the paging
>>> +                * controls access to user-mode addresses based
>>> +                * on the value in the PKRU register. A fault is
>>> +                * considered as a PKU violation if all of the
>>> +                * following conditions are true:
>>> +                * 1.CR4_PKE=1.
>>> +                * 2.EFER_LMA=1.
>>> +                * 3.page is present with no reserved bit
>>> +                *   violations.
>>> +                * 4.the access is not an instruction fetch.
>>> +                * 5.the access is to a user page.
>>> +                * 6.PKRU.AD=1
>>> +                *    or The access is a data write and
>>> +                *       PKRU.WD=1 and either CR0.WP=1
>>> +                *       or it is a user access.
>>> +                *
>>> +                * The 2nd and 6th conditions are computed
>>> +                * dynamically in permission_fault.
>>> +                */
>>
>> It is not good as there are branches in the next patch.
>
> It's important to note that branches in general are _not_ a problem.
> Only badly-predicted branches are a problem;

I agreed on this point.
> well-predicted branches are
> _faster_ than branchless code.

Er, i do not understand this. If these two case have the same cache hit,
how can a branch be faster?

> For example, take is_last_gpte.  The
> branchy way to write it in walk_addr_generic would be (excluding the
> 32-bit !PSE case) something like:
>
> 	do {
> 	} while (level > PT_PAGE_TABLE_LEVEL &&
> 		 (!(gpte & PT_PAGE_SIZE_MASK) ||
> 		  level == mmu->root_level));
>
> Here none of the branches is easily predicted, so we want to get rid of
> them.
>
> The next patch adds three branches, and they are not all equal:
>
> - is_long_vcpu is well predicted to true (or even for 32-bit OSes it
> should be well predicted if the host is not overcommitted).
>

But, in the production, cpu over-commit is the normal case...

> - pkru != 0 should be well-predicted to false, at least for a few
> years... and perhaps even later considering that most MMIO access
> happens in the kernel.
>
> - !wf || (!uf && !is_write_protection(vcpu)) is badly predicted and
> should be removed
>
> So only the last one is a problem.
>
>> However, i do not think we need a new byte index for PK. The conditions
>> detecting PK enablement
>> can be fully found in current vcpu content (i.e, CR4, EFER and U/S access).
>
> Adding a new byte index lets you cache CR4.PKE (and actually EFER.LMA
> too, though Huaitong's patch doesn't do that).  It's a good thing to do.
>   U/S is also handled by adding a new byte index, see Huaitong's

It is not on the same page, the U/S is the type of memory access which
is depended on vCPU runtime. But the condition whether PKEY is enabled or not
is fully depended on the envorment of CPU and we should _always_ check PKEY
even if PFEC_PKEY is not set.

As PKEY is not enabled on softmmu, the gva_to_gpa mostly comes from internal
KVM, that means we should always set PFEC.PKEY for all the gva_to_gpa request.
Wasting a bit is really unnecessary.

And it is always better to move more workload from permission_fault() to
update_permission_bitmask() as the former is much hotter than the latter.

>
> 	pku = cr4_pku && !ff && u;
>
> If this is improved to
>
> 	pku = cr4_pku && long_mode_vcpu && !ff && u;
>
> one branch goes away in permission_fault.  The read_pkru() branch, if
> well predicted, lets you optimize away the pkru tests.  I think it
> _would_ be well predicted, so I think it should remain.
>
> The "(!wf || (!uf && !is_write_protection(vcpu)))" is indeed the worst
> of the three.  I was lenient in my previous review because this code
> won't run on any system being sold now and in the next 1-2 (?) years.
> However, we can indeed get rid of the branch, so let's do it. :)
>
> I don't like the idea of making permissions[] four times larger.

Okay, then lets introduce a new field for PKEY separately. Your approach
, fault_u1w0, looks good to me.

> Instead, we can find the value of the expression elsewhere in
> mmu->permissions (!), or cache it separately in a different field of mmu.
>
> If I interpret the rules correctly, WD works like this.  First, we take
> the PTE and imagine that it had W=0.  Then, if this access would not
> fault, WD is ignored.  This is because:
>
> - on reads, WD is always ignored
>
> - on writes, WD is ignored in supervisor mode if !CR0.WP
>
> ... and this is how W=0 page work, isn't it?
>

Yes, it is.

> If so, I think it's something like this in code:
>
> -		if (!wf || (!uf && !is_write_protection(vcpu)))
> -			pkru_bits &= ~(1 << PKRU_WRITE);
> +		/* Only testing writes, so ignore SMAP and fetch.  */
> +		pfec_uw = pfec & (PFERR_WRITE_MASK|PFERR_USER_MASK);
> +		fault_uw = mmu->permissions[pfec_uw >> 1];
> +		/*
> +		 * This page has U=1, so check if a U=1 W=0 page faults
> +		 * on this access; if not ignore WD.
> +		 */
> +		pkru_bits &= ~(1 << PKRU_WRITE) |
> +			(fault_uw >> (ACC_USER_MASK - PKRU_WRITE));
>

This is trick and finally i understand it, yeah, it works. :) Except
i do not think PFEC.PKEY should be taken to index as i explained above.

> I think I even prefer if update_permission_bitmask sets up a separate
> bitmask:
>
> 		mmu->fault_u1w0 |= (wf && !w) << byte;
>
> and then this other bitmap can be tested in permission_fault:
>
>
> -		if (!wf || (!uf && !is_write_protection(vcpu)))
> -			pkru_bits &= ~(1 << PKRU_WRITE);
> +		/*
> +		 * fault_u1w0 ignores SMAP and PKRU, so use the
> +		 * partially-computed PFEC that we were given.
> +		 */
> +		fault_uw = (mmu->fault_u1w0 >> (pfec >> 1)) & 1;
> +		pkru_bits &= ~(1 << PKRU_WRITE) |
> +			(fault_uw << PKRU_WRITE);
>

It looks good to me!

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V4 1/7] KVM, pkeys: expose CPUID/CR4 to guest
  2016-03-06 23:20     ` Paolo Bonzini
@ 2016-03-08  7:39       ` Xiao Guangrong
  2016-03-08  7:58         ` Paolo Bonzini
  0 siblings, 1 reply; 47+ messages in thread
From: Xiao Guangrong @ 2016-03-08  7:39 UTC (permalink / raw)
  To: Paolo Bonzini, Huaitong Han, gleb; +Cc: kvm



On 03/07/2016 07:20 AM, Paolo Bonzini wrote:
>
>
> On 06/03/2016 08:15, Xiao Guangrong wrote:
>>> +    /* cpuid 7.0.ecx*/
>>> +    const u32 kvm_supported_word11_x86_features = F(PKU) | 0 /*OSPKE*/;
>>> +
>>>
>>>                cpuid_mask(&entry->ebx, 9);
>>>                // TSC_ADJUST is emulated
>>>                entry->ebx |= F(TSC_ADJUST);
>>> -        } else
>>> +            entry->ecx &= kvm_supported_word11_x86_features;
>>> +            cpuid_mask(&entry->ecx, 13);
>>
>> Can we use a meaningful name defined by cpuid_leafs instead of the raw
>> number?
>
> This is not a cpuid leaf number, it's the Linux feature word number.  In
> other words, the "11" in kvm_supported_word11_x86_features should have
> matched the "13" in cpuid_mask(&entry->ecx, 13).

But i noticed kernel uses the more meaningful name to update
boot_cpu_data.x86_capability[]:
$ git grep x86_capability arch/x86/
......
arch/x86/kernel/cpu/centaur.c:          c->x86_capability[CPUID_C000_0001_EDX] = cpuid_edx(0xC0000001);
arch/x86/kernel/cpu/common.c:           c->x86_capability[CPUID_1_ECX] = ecx;
arch/x86/kernel/cpu/common.c:           c->x86_capability[CPUID_1_EDX] = edx;
arch/x86/kernel/cpu/common.c:           c->x86_capability[CPUID_7_0_EBX] = ebx;
arch/x86/kernel/cpu/common.c:           c->x86_capability[CPUID_6_EAX] = cpuid_eax(0x00000006);
arch/x86/kernel/cpu/common.c:           c->x86_capability[CPUID_D_1_EAX] = eax;
arch/x86/kernel/cpu/common.c:           c->x86_capability[CPUID_F_0_EDX] = edx;
.....

Why not follow it in KVM?


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V4 1/7] KVM, pkeys: expose CPUID/CR4 to guest
  2016-03-08  7:39       ` Xiao Guangrong
@ 2016-03-08  7:58         ` Paolo Bonzini
  0 siblings, 0 replies; 47+ messages in thread
From: Paolo Bonzini @ 2016-03-08  7:58 UTC (permalink / raw)
  To: Xiao Guangrong, Huaitong Han, gleb; +Cc: kvm



On 08/03/2016 08:39, Xiao Guangrong wrote:
> But i noticed kernel uses the more meaningful name to update
> boot_cpu_data.x86_capability[]:
> $ git grep x86_capability arch/x86/
> ......
> arch/x86/kernel/cpu/centaur.c:         
> c->x86_capability[CPUID_C000_0001_EDX] = cpuid_edx(0xC0000001);
> arch/x86/kernel/cpu/common.c:           c->x86_capability[CPUID_1_ECX] = ecx;
> arch/x86/kernel/cpu/common.c:           c->x86_capability[CPUID_1_EDX] = edx;
> arch/x86/kernel/cpu/common.c:           c->x86_capability[CPUID_7_0_EBX] = ebx;
> arch/x86/kernel/cpu/common.c:           c->x86_capability[CPUID_6_EAX] = cpuid_eax(0x00000006);
> arch/x86/kernel/cpu/common.c:           c->x86_capability[CPUID_D_1_EAX] = eax;
> arch/x86/kernel/cpu/common.c:           c->x86_capability[CPUID_F_0_EDX] = edx;
> .....
> 
> Why not follow it in KVM?

Oh, indeed. That's fairly new (commit 39c06df4dc10, "x86/cpufeature:
Cleanup get_cpu_cap()", 2015-12-07), I had not noticed it. We can change
it indeed, but it's probably best to do it only for 4.7.

In the meanwhile, for PKRU it's enough to use the right feature number.  :)

Paolo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V4 3/7] KVM, pkeys: update memeory permission bitmask for pkeys
  2016-03-08  7:35       ` Xiao Guangrong
@ 2016-03-08  8:29         ` Paolo Bonzini
  2016-03-08  9:19           ` Xiao Guangrong
  0 siblings, 1 reply; 47+ messages in thread
From: Paolo Bonzini @ 2016-03-08  8:29 UTC (permalink / raw)
  To: Xiao Guangrong, Huaitong Han, gleb; +Cc: kvm



On 08/03/2016 08:35, Xiao Guangrong wrote:
>> well-predicted branches are _faster_ than branchless code.
> 
> Er, i do not understand this. If these two case have the same cache hit,
> how can a branch be faster?

Because branchless code typically executes fewer instructions.

Take the same example here:

>>     do {
>>     } while (level > PT_PAGE_TABLE_LEVEL &&
>>          (!(gpte & PT_PAGE_SIZE_MASK) ||
>>           level == mmu->root_level));

The assembly looks like (assuming %level, %gpte and %mmu are registers)

	cmp $1, %level
	jbe 1f
	test $128, %gpte
	jz beginning_of_loop
	cmpb ROOT_LEVEL_OFFSET(%mmu), %level
	je beginning_of_loop
1:

These are two to six instructions, with no dependency and which the
processor can change into one to three macro-ops.  For the branchless
code (I posted a patch to implement this algorithm yesterday):

	lea -2(%level), %temp1
	orl %temp1, %gpte
	movzbl LAST_NONLEAF_LEVEL_OFFSET(%mmu), %temp1
	movl %level, %temp2
	subl %temp1, %temp2
	andl %temp2, %gpte
	test $128, %gpte
	jz beginning_of_loop

These are eight instructions, with some dependencies between them too.
In some cases branchless code throws away the result of 10-15
instructions (because in the end it's ANDed with 0, for example).  If it
weren't for mispredictions, the branchy code would be faster.

>> Here none of the branches is easily predicted, so we want to get rid of
>> them.
>>
>> The next patch adds three branches, and they are not all equal:
>>
>> - is_long_vcpu is well predicted to true (or even for 32-bit OSes it
>> should be well predicted if the host is not overcommitted).
> 
> But, in the production, cpu over-commit is the normal case...

It depends on the workload.  I would guess that 32-bit OSes are more
common where you have a single legacy guest because e.g. it doesn't have
drivers for recent hardware.

>>> However, i do not think we need a new byte index for PK. The conditions
>>> detecting PK enablement
>>> can be fully found in current vcpu content (i.e, CR4, EFER and U/S
>>> access).
>>
>> Adding a new byte index lets you cache CR4.PKE (and actually EFER.LMA
>> too, though Huaitong's patch doesn't do that).  It's a good thing to do.
>>   U/S is also handled by adding a new byte index, see Huaitong's
> 
> It is not on the same page, the U/S is the type of memory access which
> is depended on vCPU runtime.

Do you mean the type of page (ACC_USER_MASK)?  Only U=1 pages are
subject to PKRU, even in the kernel.  The processor CPL
(PFERR_USER_MASK) only matters if CR0.WP=0.

> But the condition whether PKEY is enabled or not
> is fully depended on the envorment of CPU and we should _always_
> check PKEY even if PFEC_PKEY is not set.
> 
> As PKEY is not enabled on softmmu, the gva_to_gpa mostly comes from internal
> KVM, that means we should always set PFEC.PKEY for all the gva_to_gpa request.
> Wasting a bit is really unnecessary.
> 
> And it is always better to move more workload from permission_fault() to
> update_permission_bitmask() as the former is much hotter than the latter.

I agree, but I'm not sure why you say that adding a bits adds more work
to permission_fault().

Adding a bit lets us skip CR4.PKU and EFER.LMA checks in
permission_fault() and in all gva_to_gpa() callers.

So my proposal is to compute the "effective" PKRU bits (i.e. extract the
relevant AD and WD bits, and mask away WD if irrelevant) in
update_permission_bitmask(), and add PFERR_PK_MASK to the error code if
they are nonzero.

PFERR_PK_MASK must be computed in permission_fault().  It's a runtime
condition that it's not known before.

>> I don't like the idea of making permissions[] four times larger.
> 
> Okay, then lets introduce a new field for PKEY separately. Your approach
> , fault_u1w0, looks good to me.
>
>> I think I even prefer if update_permission_bitmask sets up a separate
>> bitmask:
>>
>>         mmu->fault_u1w0 |= (wf && !w) << byte;
>>
>> and then this other bitmap can be tested in permission_fault:
>>
>> -        if (!wf || (!uf && !is_write_protection(vcpu)))
>> -            pkru_bits &= ~(1 << PKRU_WRITE);
>> +        /*
>> +         * fault_u1w0 ignores SMAP and PKRU, so use the
>> +         * partially-computed PFEC that we were given.
>> +         */
>> +        fault_uw = (mmu->fault_u1w0 >> (pfec >> 1)) & 1;
>> +        pkru_bits &= ~(1 << PKRU_WRITE) |
>> +            (fault_uw << PKRU_WRITE);
> 
> It looks good to me!

Good. Thanks for reviewing the idea!

Paolo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V4 7/7] KVM, pkeys: disable PKU feature without ept
  2016-03-08  5:54       ` Xiao Guangrong
@ 2016-03-08  8:47         ` Paolo Bonzini
  2016-03-08  9:32           ` Xiao Guangrong
  2016-03-09  6:24           ` Yang Zhang
  0 siblings, 2 replies; 47+ messages in thread
From: Paolo Bonzini @ 2016-03-08  8:47 UTC (permalink / raw)
  To: Xiao Guangrong, Huaitong Han, gleb; +Cc: kvm



On 08/03/2016 06:54, Xiao Guangrong wrote:
> 
> 
> On 03/07/2016 04:32 AM, Paolo Bonzini wrote:
>>
>>
>> On 06/03/2016 10:28, Xiao Guangrong wrote:
>>>> This patch disables CPUID:PKU without ept, because pkeys is not yet
>>>> implemented for shadow paging.
>>>
>>> Does the PKRU is loaded/saved during vm-enter/vm-exit?
>>
>> Yes, through XSAVE/XRSTOR (which uses eager mode when PKE is active).
> 
> You mean eager fpu? however, eager-fpu depends on 'eagerfpu' which is a
> kernel parameter and this patchset did not force it on.

Some XSAVE features (currently only MPX, but in the future PKRU too)
will force eagerfpu on, see fpu__init_system_ctx_switch:

        if (xfeatures_mask & XFEATURE_MASK_EAGER) {
                if (eagerfpu == DISABLE) {
                        xfeatures_mask &= ~XFEATURE_MASK_EAGER;
                } else {
                        eagerfpu = ENABLE;
                }
        }

        if (eagerfpu == ENABLE)
                setup_force_cpu_cap(X86_FEATURE_EAGER_FPU);

KVM only exposes a subset of the host XSAVE features so the FPU is
always eager if KVM exposes MPX and PKRU.

> However, even if we use eager-fpu kvm still can lazily save/load due to
> some fpu optimizations in kvm.

KVM will use eager FPU if the host uses it.  See arch/x86/kvm/cpuid.c:

	vcpu->arch.eager_fpu =
		use_eager_fpu() || guest_cpuid_has_mpx(vcpu);

But the guest_cpuid_has_mpx(vcpu) check is unnecessary.  The guest CPUID
cannot have MPX if the host doesn't have the BNDREGS and BNDCSR
features...  Another patch to send. :)

>>> BTW, I just very quickly go through the spec, it seems VMX lacks the
>>> ability to intercept the access to PKRU. Right?
>>
>> Indeed RDPKRU/WRPKRU cannot be intercepted.
> 
> Er, i was thinking using this feature to speedup write-protection for
> shadow page table and dirty-logging... it seems not easy as PKRU can not
> be intercepted. :(

Also it only works on U=1 pages.

Paolo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V4 3/7] KVM, pkeys: update memeory permission bitmask for pkeys
  2016-03-08  8:29         ` Paolo Bonzini
@ 2016-03-08  9:19           ` Xiao Guangrong
  2016-03-08 10:01             ` Paolo Bonzini
  0 siblings, 1 reply; 47+ messages in thread
From: Xiao Guangrong @ 2016-03-08  9:19 UTC (permalink / raw)
  To: Paolo Bonzini, Huaitong Han, gleb; +Cc: kvm



On 03/08/2016 04:29 PM, Paolo Bonzini wrote:
>
>
> On 08/03/2016 08:35, Xiao Guangrong wrote:
>>> well-predicted branches are _faster_ than branchless code.
>>
>> Er, i do not understand this. If these two case have the same cache hit,
>> how can a branch be faster?
>
> Because branchless code typically executes fewer instructions.
>
> Take the same example here:
>
>>>      do {
>>>      } while (level > PT_PAGE_TABLE_LEVEL &&
>>>           (!(gpte & PT_PAGE_SIZE_MASK) ||
>>>            level == mmu->root_level));
>
> The assembly looks like (assuming %level, %gpte and %mmu are registers)
>
> 	cmp $1, %level
> 	jbe 1f
> 	test $128, %gpte
> 	jz beginning_of_loop
> 	cmpb ROOT_LEVEL_OFFSET(%mmu), %level
> 	je beginning_of_loop
> 1:
>
> These are two to six instructions, with no dependency and which the
> processor can change into one to three macro-ops.  For the branchless
> code (I posted a patch to implement this algorithm yesterday):
>
> 	lea -2(%level), %temp1
> 	orl %temp1, %gpte
> 	movzbl LAST_NONLEAF_LEVEL_OFFSET(%mmu), %temp1
> 	movl %level, %temp2
> 	subl %temp1, %temp2
> 	andl %temp2, %gpte
> 	test $128, %gpte
> 	jz beginning_of_loop
>
> These are eight instructions, with some dependencies between them too.
> In some cases branchless code throws away the result of 10-15
> instructions (because in the end it's ANDed with 0, for example).  If it
> weren't for mispredictions, the branchy code would be faster.
>

Good lesson, thank you, Paolo. :)

>>> Here none of the branches is easily predicted, so we want to get rid of
>>> them.
>>>
>>> The next patch adds three branches, and they are not all equal:
>>>
>>> - is_long_vcpu is well predicted to true (or even for 32-bit OSes it
>>> should be well predicted if the host is not overcommitted).
>>
>> But, in the production, cpu over-commit is the normal case...
>
> It depends on the workload.  I would guess that 32-bit OSes are more
> common where you have a single legacy guest because e.g. it doesn't have
> drivers for recent hardware.
>
>>>> However, i do not think we need a new byte index for PK. The conditions
>>>> detecting PK enablement
>>>> can be fully found in current vcpu content (i.e, CR4, EFER and U/S
>>>> access).
>>>
>>> Adding a new byte index lets you cache CR4.PKE (and actually EFER.LMA
>>> too, though Huaitong's patch doesn't do that).  It's a good thing to do.
>>>    U/S is also handled by adding a new byte index, see Huaitong's
>>
>> It is not on the same page, the U/S is the type of memory access which
>> is depended on vCPU runtime.
>
> Do you mean the type of page (ACC_USER_MASK)?  Only U=1 pages are
> subject to PKRU, even in the kernel.  The processor CPL
> (PFERR_USER_MASK) only matters if CR0.WP=0.

No. The index is:
| Byte index: page fault error code [4:1]

So, the type i mentioned is the type of memory access issued by CPU, e,g CPU is
writing the memory or CPU is executing on the memory.

>
>> But the condition whether PKEY is enabled or not
>> is fully depended on the envorment of CPU and we should _always_
>> check PKEY even if PFEC_PKEY is not set.
>>
>> As PKEY is not enabled on softmmu, the gva_to_gpa mostly comes from internal
>> KVM, that means we should always set PFEC.PKEY for all the gva_to_gpa request.
>> Wasting a bit is really unnecessary.
>>
>> And it is always better to move more workload from permission_fault() to
>> update_permission_bitmask() as the former is much hotter than the latter.
>
> I agree, but I'm not sure why you say that adding a bits adds more work
> to permission_fault().

A branch to check PFEC.PKEY, which is not well predictable on soft mmu. (It
should always be set in EPT as the page table walking is done by software,
however, if we only consider EPT we can assume it is always true).

>
> Adding a bit lets us skip CR4.PKU and EFER.LMA checks in
> permission_fault() and in all gva_to_gpa() callers.

The point is when we can clear this bit to skip these checks. We should
_always_ check PKEY even if PFEC.PKEY = 0, because:
1) all gva_to_gpa()s issued by KVM should always check PKEY. This is the
    case of ept only.

2) if the feature is enabled in softmmu, shadow page table may change its
    behavior, for example, the mmio-access causes a reserved PF which
    may clear PFEC.PKEY.

And skipping these checks is not really necessary as we can take them into
account when we update the bitmask.

>
> So my proposal is to compute the "effective" PKRU bits (i.e. extract the
> relevant AD and WD bits, and mask away WD if irrelevant) in
> update_permission_bitmask(), and add PFERR_PK_MASK to the error code if
> they are nonzero.
>
> PFERR_PK_MASK must be computed in permission_fault().  It's a runtime
> condition that it's not known before.
>

Yes, you are right.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V4 7/7] KVM, pkeys: disable PKU feature without ept
  2016-03-08  8:47         ` Paolo Bonzini
@ 2016-03-08  9:32           ` Xiao Guangrong
  2016-03-08 10:02             ` Paolo Bonzini
  2016-03-09  6:24           ` Yang Zhang
  1 sibling, 1 reply; 47+ messages in thread
From: Xiao Guangrong @ 2016-03-08  9:32 UTC (permalink / raw)
  To: Paolo Bonzini, Huaitong Han, gleb; +Cc: kvm



On 03/08/2016 04:47 PM, Paolo Bonzini wrote:
>
>
> On 08/03/2016 06:54, Xiao Guangrong wrote:
>>
>>
>> On 03/07/2016 04:32 AM, Paolo Bonzini wrote:
>>>
>>>
>>> On 06/03/2016 10:28, Xiao Guangrong wrote:
>>>>> This patch disables CPUID:PKU without ept, because pkeys is not yet
>>>>> implemented for shadow paging.
>>>>
>>>> Does the PKRU is loaded/saved during vm-enter/vm-exit?
>>>
>>> Yes, through XSAVE/XRSTOR (which uses eager mode when PKE is active).
>>
>> You mean eager fpu? however, eager-fpu depends on 'eagerfpu' which is a
>> kernel parameter and this patchset did not force it on.
>
> Some XSAVE features (currently only MPX, but in the future PKRU too)
> will force eagerfpu on, see fpu__init_system_ctx_switch:
>
>          if (xfeatures_mask & XFEATURE_MASK_EAGER) {
>                  if (eagerfpu == DISABLE) {
>                          xfeatures_mask &= ~XFEATURE_MASK_EAGER;

So if the kennel parameter, eagerfpu is set to "off", then eager is not
enabled, so PKRU can not work in KVM?

>                  } else {
>                          eagerfpu = ENABLE;
>                  }
>          }
>
>          if (eagerfpu == ENABLE)
>                  setup_force_cpu_cap(X86_FEATURE_EAGER_FPU);
>
> KVM only exposes a subset of the host XSAVE features so the FPU is
> always eager if KVM exposes MPX and PKRU.
>
>> However, even if we use eager-fpu kvm still can lazily save/load due to
>> some fpu optimizations in kvm.
>
> KVM will use eager FPU if the host uses it.  See arch/x86/kvm/cpuid.c:
>
> 	vcpu->arch.eager_fpu =
> 		use_eager_fpu() || guest_cpuid_has_mpx(vcpu);
>

> But the guest_cpuid_has_mpx(vcpu) check is unnecessary.  The guest CPUID
> cannot have MPX if the host doesn't have the BNDREGS and BNDCSR
> features...  Another patch to send. :)
>

Sorry, i missread the code, yes, if vcpu->arch.eager_fpu is true, it is
always save/load fpu for every vm-exit/vm-enter.

>>>> BTW, I just very quickly go through the spec, it seems VMX lacks the
>>>> ability to intercept the access to PKRU. Right?
>>>
>>> Indeed RDPKRU/WRPKRU cannot be intercepted.
>>
>> Er, i was thinking using this feature to speedup write-protection for
>> shadow page table and dirty-logging... it seems not easy as PKRU can not
>> be intercepted. :(
>
> Also it only works on U=1 pages.

Yes, indeed.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V4 3/7] KVM, pkeys: update memeory permission bitmask for pkeys
  2016-03-08  9:19           ` Xiao Guangrong
@ 2016-03-08 10:01             ` Paolo Bonzini
  2016-03-09  5:03               ` Xiao Guangrong
  0 siblings, 1 reply; 47+ messages in thread
From: Paolo Bonzini @ 2016-03-08 10:01 UTC (permalink / raw)
  To: Xiao Guangrong, Huaitong Han, gleb; +Cc: kvm



On 08/03/2016 10:19, Xiao Guangrong wrote:
>>>
>>> As PKEY is not enabled on softmmu, the gva_to_gpa mostly comes from
>>> internal
>>> KVM, that means we should always set PFEC.PKEY for all the gva_to_gpa
>>> request.
>>> Wasting a bit is really unnecessary.
>>>
>>> And it is always better to move more workload from permission_fault() to
>>> update_permission_bitmask() as the former is much hotter than the
>>> latter.
>>
>> I agree, but I'm not sure why you say that adding a bits adds more work
>> to permission_fault().
> 
> A branch to check PFEC.PKEY, which is not well predictable on soft mmu. (It
> should always be set in EPT as the page table walking is done by software,
> however, if we only consider EPT we can assume it is always true).

Ok, I agree we shouldn't add such a branch.

It's really messy that PKERR_PK_MASK is computed at runtime.  It means
we need to check U=1 in permission_fault, and we need to rethink all of
the handling of PKERR_PK_MASK in fact.  I think we should extend the
algorithm from earlier in the thread to handle the AD bit too:

	/* Bit 0: should PKRU.AD be considered if PFEC=0 or 1?
	 * Bit 1: should PKRU.WD be considered if PFEC=0 or 1?
	 * Bit 2: should PKRU.AD be considered if PFEC=2 or 3?
	 * Bit 3: should PKRU.WD be considered if PFEC=2 or 3?
	 * Ignore PKRU if U=0.
	 * etc.
	 */
	u32 pkru_mask;

In update_permission_bitmask, the bits of mmu->pkru_mask only depend on
the byte, not on the bit.

Bits 0, 2, ... are computed from pku, i.e. cr4_pku && long_mode && !ff.
Bits 1, 3, ... are computed from pku && (wf &&
(is_write_protection(vcpu) || uf)).

The permissions[] array can keep 16 entries, because PFERR_PK_MASK is
never passed to permission_fault (like PFERR_RSVD_MASK).  The
permission_fault code is something like:

	WARN_ON(pfec & PFERR_PK_MASK);
	pkru_mask = pte_access & ACC_USER_MASK ? mmu->pkru_mask : 0;
        pkru_bits &= pkru_mask >> (pfec & ~1);
	pfec |= -pkru_bits & PFERR_PK_MASK;

What do you think?

I still believe that the above checks should be wrapped in a

	if (unlikely(mmu->pkru_mask) {
		...
	}

It should be predicted well and removes the potential performance cost
of branchless code on !PKE OSes.

>> Adding a bit lets us skip CR4.PKU and EFER.LMA checks in
>> permission_fault() and in all gva_to_gpa() callers.
> 
> The point is when we can clear this bit to skip these checks. We should
> _always_ check PKEY even if PFEC.PKEY = 0, because:
> 1) all gva_to_gpa()s issued by KVM should always check PKEY. This is the
>    case of ept only.
> 
> 2) if the feature is enabled in softmmu, shadow page table may change its
>    behavior, for example, the mmio-access causes a reserved PF which
>    may clear PFEC.PKEY.
> 
> And skipping these checks is not really necessary as we can take them into
> account when we update the bitmask.

Got it.  Does the above pseudocode match your idea?  It disregards the
pagefault's PFERR_PK_MASK completely.

Paolo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V4 7/7] KVM, pkeys: disable PKU feature without ept
  2016-03-08  9:32           ` Xiao Guangrong
@ 2016-03-08 10:02             ` Paolo Bonzini
  2016-03-09  5:51               ` Xiao Guangrong
  0 siblings, 1 reply; 47+ messages in thread
From: Paolo Bonzini @ 2016-03-08 10:02 UTC (permalink / raw)
  To: Xiao Guangrong, Huaitong Han, gleb; +Cc: kvm



On 08/03/2016 10:32, Xiao Guangrong wrote:
> On 03/08/2016 04:47 PM, Paolo Bonzini wrote:
>> Some XSAVE features (currently only MPX, but in the future PKRU too)
>> will force eagerfpu on, see fpu__init_system_ctx_switch:
>>
>>          if (xfeatures_mask & XFEATURE_MASK_EAGER) {
>>                  if (eagerfpu == DISABLE) {
>>                          xfeatures_mask &= ~XFEATURE_MASK_EAGER;
> 
> So if the kernel parameter, eagerfpu is set to "off", then eager is not
> enabled, so PKRU can not work in KVM?

Yes.  Neither PKRU nor MPX.

>> KVM will use eager FPU if the host uses it.  See arch/x86/kvm/cpuid.c:
>>
>>     vcpu->arch.eager_fpu =
>>         use_eager_fpu() || guest_cpuid_has_mpx(vcpu);
>>
> 
> Sorry, i missread the code, yes, if vcpu->arch.eager_fpu is true, it is
> always save/load fpu for every vm-exit/vm-enter.

No problem, it's impossible to know all the code. :)

Paolo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V4 2/7] KVM, pkeys: disable pkeys for guests in non-paging mode
  2016-03-05 11:27 ` [PATCH V4 2/7] KVM, pkeys: disable pkeys for guests in non-paging mode Huaitong Han
  2016-03-06  7:19   ` Xiao Guangrong
@ 2016-03-08 12:09   ` Yang Zhang
  2016-03-08 12:11     ` Paolo Bonzini
  1 sibling, 1 reply; 47+ messages in thread
From: Yang Zhang @ 2016-03-08 12:09 UTC (permalink / raw)
  To: Huaitong Han, pbonzini, gleb; +Cc: kvm

On 2016/3/5 19:27, Huaitong Han wrote:
> Pkeys is disabled if CPU is in non-paging mode in hardware. However KVM
> always uses paging mode to emulate guest non-paging, mode with TDP. To
> emulate this behavior, pkeys needs to be manually disabled when guest
> switches to non-paging mode.
>
> Signed-off-by: Huaitong Han <huaitong.han@intel.com>
> ---
> Changes in v4:
> *Patch has rebased on http://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/log/?h=mm/pkeys
>
>   arch/x86/kvm/vmx.c | 6 +++---
>   1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index e2951b6..db33c22 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -3855,13 +3855,13 @@ static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
>
>   	if (!enable_unrestricted_guest && !is_paging(vcpu))

The comment says "Pkeys is disabled if CPU is in non-paging mode in 
hardware". Why to check enable_unrestricted_guest here?

>   		/*
> -		 * SMEP/SMAP is disabled if CPU is in non-paging mode in
> +		 * SMEP/SMAP/PKU is disabled if CPU is in non-paging mode in
>   		 * hardware.  However KVM always uses paging mode without
>   		 * unrestricted guest.
> -		 * To emulate this behavior, SMEP/SMAP needs to be manually
> +		 * To emulate this behavior, SMEP/SMAP/PKU needs to be manually
>   		 * disabled when guest switches to non-paging mode.
>   		 */
> -		hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP);
> +		hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE);
>
>   	vmcs_writel(CR4_READ_SHADOW, cr4);
>   	vmcs_writel(GUEST_CR4, hw_cr4);
>


-- 
best regards
yang

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V4 2/7] KVM, pkeys: disable pkeys for guests in non-paging mode
  2016-03-08 12:09   ` Yang Zhang
@ 2016-03-08 12:11     ` Paolo Bonzini
  2016-03-08 13:02       ` Yang Zhang
  0 siblings, 1 reply; 47+ messages in thread
From: Paolo Bonzini @ 2016-03-08 12:11 UTC (permalink / raw)
  To: Yang Zhang, Huaitong Han, gleb; +Cc: kvm



On 08/03/2016 13:09, Yang Zhang wrote:
>>
>>       if (!enable_unrestricted_guest && !is_paging(vcpu))
> 
> The comment says "Pkeys is disabled if CPU is in non-paging mode in
> hardware". Why to check enable_unrestricted_guest here?

If !enable_unrestricted_guest, the CPU is always running with CR0.PG=1
and CR4 needs to be modified.

If enable_unrestricted_guest, the CPU automatically disables
SMEP/SMAP/PKU when the guest sets CR0.PG=0.

Paolo

>>           /*
>> -         * SMEP/SMAP is disabled if CPU is in non-paging mode in
>> +         * SMEP/SMAP/PKU is disabled if CPU is in non-paging mode in
>>            * hardware.  However KVM always uses paging mode without
>>            * unrestricted guest.
>> -         * To emulate this behavior, SMEP/SMAP needs to be manually
>> +         * To emulate this behavior, SMEP/SMAP/PKU needs to be manually
>>            * disabled when guest switches to non-paging mode.
>>            */
>> -        hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP);
>> +        hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE);

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V4 2/7] KVM, pkeys: disable pkeys for guests in non-paging mode
  2016-03-08 12:11     ` Paolo Bonzini
@ 2016-03-08 13:02       ` Yang Zhang
  0 siblings, 0 replies; 47+ messages in thread
From: Yang Zhang @ 2016-03-08 13:02 UTC (permalink / raw)
  To: Paolo Bonzini, Huaitong Han, gleb; +Cc: kvm

On 2016/3/8 20:11, Paolo Bonzini wrote:
>
>
> On 08/03/2016 13:09, Yang Zhang wrote:
>>>
>>>        if (!enable_unrestricted_guest && !is_paging(vcpu))
>>
>> The comment says "Pkeys is disabled if CPU is in non-paging mode in
>> hardware". Why to check enable_unrestricted_guest here?
>
> If !enable_unrestricted_guest, the CPU is always running with CR0.PG=1
> and CR4 needs to be modified.
>
> If enable_unrestricted_guest, the CPU automatically disables
> SMEP/SMAP/PKU when the guest sets CR0.PG=0.

Thanks for you explanation. It would be better to add it into the below 
comment to tell why only care !enable_unrestricted_guest case.

>
> Paolo
>
>>>            /*
>>> -         * SMEP/SMAP is disabled if CPU is in non-paging mode in
>>> +         * SMEP/SMAP/PKU is disabled if CPU is in non-paging mode in
>>>             * hardware.  However KVM always uses paging mode without
>>>             * unrestricted guest.
>>> -         * To emulate this behavior, SMEP/SMAP needs to be manually
>>> +         * To emulate this behavior, SMEP/SMAP/PKU needs to be manually
>>>             * disabled when guest switches to non-paging mode.
>>>             */
>>> -        hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP);
>>> +        hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE);


-- 
best regards
yang

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V4 3/7] KVM, pkeys: update memeory permission bitmask for pkeys
  2016-03-08 10:01             ` Paolo Bonzini
@ 2016-03-09  5:03               ` Xiao Guangrong
  2016-03-09  8:10                 ` Paolo Bonzini
  0 siblings, 1 reply; 47+ messages in thread
From: Xiao Guangrong @ 2016-03-09  5:03 UTC (permalink / raw)
  To: Paolo Bonzini, Huaitong Han, gleb; +Cc: kvm



On 03/08/2016 06:01 PM, Paolo Bonzini wrote:
>
>
> On 08/03/2016 10:19, Xiao Guangrong wrote:
>>>>
>>>> As PKEY is not enabled on softmmu, the gva_to_gpa mostly comes from
>>>> internal
>>>> KVM, that means we should always set PFEC.PKEY for all the gva_to_gpa
>>>> request.
>>>> Wasting a bit is really unnecessary.
>>>>
>>>> And it is always better to move more workload from permission_fault() to
>>>> update_permission_bitmask() as the former is much hotter than the
>>>> latter.
>>>
>>> I agree, but I'm not sure why you say that adding a bits adds more work
>>> to permission_fault().
>>
>> A branch to check PFEC.PKEY, which is not well predictable on soft mmu. (It
>> should always be set in EPT as the page table walking is done by software,
>> however, if we only consider EPT we can assume it is always true).
>
> Ok, I agree we shouldn't add such a branch.
>
> It's really messy that PKERR_PK_MASK is computed at runtime.  It means
> we need to check U=1 in permission_fault, and we need to rethink all of
> the handling of PKERR_PK_MASK in fact.  I think we should extend the
> algorithm from earlier in the thread to handle the AD bit too:
>
> 	/* Bit 0: should PKRU.AD be considered if PFEC=0 or 1?
> 	 * Bit 1: should PKRU.WD be considered if PFEC=0 or 1?
> 	 * Bit 2: should PKRU.AD be considered if PFEC=2 or 3?
> 	 * Bit 3: should PKRU.WD be considered if PFEC=2 or 3?
> 	 * Ignore PKRU if U=0.
> 	 * etc.
> 	 */
> 	u32 pkru_mask;
>
> In update_permission_bitmask, the bits of mmu->pkru_mask only depend on
> the byte, not on the bit.
>
> Bits 0, 2, ... are computed from pku, i.e. cr4_pku && long_mode && !ff.
> Bits 1, 3, ... are computed from pku && (wf &&
> (is_write_protection(vcpu) || uf)).
>
> The permissions[] array can keep 16 entries, because PFERR_PK_MASK is
> never passed to permission_fault (like PFERR_RSVD_MASK).  The
> permission_fault code is something like:
>
> 	WARN_ON(pfec & PFERR_PK_MASK);
> 	pkru_mask = pte_access & ACC_USER_MASK ? mmu->pkru_mask : 0;
>          pkru_bits &= pkru_mask >> (pfec & ~1);

                                       ^ it should be pfec >> 1? As we
never take PFEC.P into account?

Actually pkru_mask can be u16 since only three bits that are PFEC.W, PFEC.U
and PFEC.F need to be taken into account and every offset occupies 2 bits.

> 	pfec |= -pkru_bits & PFERR_PK_MASK;
>
> What do you think?

Very good. :)

>
> I still believe that the above checks should be wrapped in a
>
> 	if (unlikely(mmu->pkru_mask) {
> 		...
> 	}
>
> It should be predicted well and removes the potential performance cost
> of branchless code on !PKE OSes.

I am considering we can use 'pte_access & ACC_USER_MASK' replacing PFEC.RSVD
in your algorithm (then pkru_mask is 32 bit), permission_fault() is like this:

WARN_ON(pfec & PFERR_PK_MASK);

offset = pfec & (PFERR_WRITE_MASK | PFERR_USER_MASK);
offset |=  !!(pte_access & ACC_USER_MASK) << PFERR_RSVD_BIT;
offset >>= 1;

pkru_bits &= pkru_mask >> offset;
pfec |= -pkru_bits & PFERR_PK_MASK;

Now, this is no branch and no much overload introduced in update_permission_bitmask().

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V4 7/7] KVM, pkeys: disable PKU feature without ept
  2016-03-08 10:02             ` Paolo Bonzini
@ 2016-03-09  5:51               ` Xiao Guangrong
  2016-03-09  6:37                 ` Yang Zhang
  2016-03-09  8:13                 ` Paolo Bonzini
  0 siblings, 2 replies; 47+ messages in thread
From: Xiao Guangrong @ 2016-03-09  5:51 UTC (permalink / raw)
  To: Paolo Bonzini, Huaitong Han, gleb; +Cc: kvm



On 03/08/2016 06:02 PM, Paolo Bonzini wrote:
>
>
> On 08/03/2016 10:32, Xiao Guangrong wrote:
>> On 03/08/2016 04:47 PM, Paolo Bonzini wrote:
>>> Some XSAVE features (currently only MPX, but in the future PKRU too)
>>> will force eagerfpu on, see fpu__init_system_ctx_switch:
>>>
>>>           if (xfeatures_mask & XFEATURE_MASK_EAGER) {
>>>                   if (eagerfpu == DISABLE) {
>>>                           xfeatures_mask &= ~XFEATURE_MASK_EAGER;
>>
>> So if the kernel parameter, eagerfpu is set to "off", then eager is not
>> enabled, so PKRU can not work in KVM?
>
> Yes.  Neither PKRU nor MPX.

Er... I noticed fpregs is not switched if the CPU is running in KVM module
(vcpu is not scheduled out and does not exit to userspace), that is why
read_pkru() can be used to read guest's PKRU in the patch 4.

However, then guest can fully control the access of userspace's memory if
CR4.PKRU is enabled on host and KVM needs to access QEMU's memory to do some
emulation anyway. Is it really safe?






^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V4 7/7] KVM, pkeys: disable PKU feature without ept
  2016-03-08  8:47         ` Paolo Bonzini
  2016-03-08  9:32           ` Xiao Guangrong
@ 2016-03-09  6:24           ` Yang Zhang
  1 sibling, 0 replies; 47+ messages in thread
From: Yang Zhang @ 2016-03-09  6:24 UTC (permalink / raw)
  To: Paolo Bonzini, Xiao Guangrong, Huaitong Han, gleb; +Cc: kvm

On 2016/3/8 16:47, Paolo Bonzini wrote:
>
>
> On 08/03/2016 06:54, Xiao Guangrong wrote:
>>
>>
>> On 03/07/2016 04:32 AM, Paolo Bonzini wrote:
>>>
>>>
>>> On 06/03/2016 10:28, Xiao Guangrong wrote:
>>>>> This patch disables CPUID:PKU without ept, because pkeys is not yet
>>>>> implemented for shadow paging.
>>>>
>>>> Does the PKRU is loaded/saved during vm-enter/vm-exit?
>>>
>>> Yes, through XSAVE/XRSTOR (which uses eager mode when PKE is active).
>>
>> You mean eager fpu? however, eager-fpu depends on 'eagerfpu' which is a
>> kernel parameter and this patchset did not force it on.
>
> Some XSAVE features (currently only MPX, but in the future PKRU too)
> will force eagerfpu on, see fpu__init_system_ctx_switch:
>
>          if (xfeatures_mask & XFEATURE_MASK_EAGER) {
>                  if (eagerfpu == DISABLE) {
>                          xfeatures_mask &= ~XFEATURE_MASK_EAGER;
>                  } else {
>                          eagerfpu = ENABLE;
>                  }
>          }
>
>          if (eagerfpu == ENABLE)
>                  setup_force_cpu_cap(X86_FEATURE_EAGER_FPU);
>
> KVM only exposes a subset of the host XSAVE features so the FPU is
> always eager if KVM exposes MPX and PKRU.
>
>> However, even if we use eager-fpu kvm still can lazily save/load due to
>> some fpu optimizations in kvm.
>
> KVM will use eager FPU if the host uses it.  See arch/x86/kvm/cpuid.c:

Why KVM needs to uses eager FPU if the host uses it? I remember the 
prerequisite for eager FPU is guest has MPX feature. Besides, i noticed 
the original patch only enable eager_fpu whe guest has it:

vcpu->arch.eager_fpu = guest_cpuid_has_mpx(vcpu);

Is there any discussion around this changes? I cannot find it through 
google. :(

>
> 	vcpu->arch.eager_fpu =
> 		use_eager_fpu() || guest_cpuid_has_mpx(vcpu);
>
> But the guest_cpuid_has_mpx(vcpu) check is unnecessary.  The guest CPUID
> cannot have MPX if the host doesn't have the BNDREGS and BNDCSR
> features...  Another patch to send. :)
>
>>>> BTW, I just very quickly go through the spec, it seems VMX lacks the
>>>> ability to intercept the access to PKRU. Right?
>>>
>>> Indeed RDPKRU/WRPKRU cannot be intercepted.
>>
>> Er, i was thinking using this feature to speedup write-protection for
>> shadow page table and dirty-logging... it seems not easy as PKRU can not
>> be intercepted. :(
>
> Also it only works on U=1 pages.
>
> Paolo
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


-- 
best regards
yang

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V4 7/7] KVM, pkeys: disable PKU feature without ept
  2016-03-09  5:51               ` Xiao Guangrong
@ 2016-03-09  6:37                 ` Yang Zhang
  2016-03-09  7:21                   ` Xiao Guangrong
  2016-03-09  8:13                 ` Paolo Bonzini
  1 sibling, 1 reply; 47+ messages in thread
From: Yang Zhang @ 2016-03-09  6:37 UTC (permalink / raw)
  To: Xiao Guangrong, Paolo Bonzini, Huaitong Han, gleb; +Cc: kvm

On 2016/3/9 13:51, Xiao Guangrong wrote:
>
>
> On 03/08/2016 06:02 PM, Paolo Bonzini wrote:
>>
>>
>> On 08/03/2016 10:32, Xiao Guangrong wrote:
>>> On 03/08/2016 04:47 PM, Paolo Bonzini wrote:
>>>> Some XSAVE features (currently only MPX, but in the future PKRU too)
>>>> will force eagerfpu on, see fpu__init_system_ctx_switch:
>>>>
>>>>           if (xfeatures_mask & XFEATURE_MASK_EAGER) {
>>>>                   if (eagerfpu == DISABLE) {
>>>>                           xfeatures_mask &= ~XFEATURE_MASK_EAGER;
>>>
>>> So if the kernel parameter, eagerfpu is set to "off", then eager is not
>>> enabled, so PKRU can not work in KVM?
>>
>> Yes.  Neither PKRU nor MPX.
>
> Er... I noticed fpregs is not switched if the CPU is running in KVM module
> (vcpu is not scheduled out and does not exit to userspace), that is why
> read_pkru() can be used to read guest's PKRU in the patch 4.
>
> However, then guest can fully control the access of userspace's memory if
> CR4.PKRU is enabled on host and KVM needs to access QEMU's memory to do
> some
> emulation anyway. Is it really safe?

I think it depends on how we understand the guest uses Pkeys. From my 
point, guest only wants to protect the pages from guest's view, not kvm. 
So the access from KVM should be totally transparent to guest. And 
should not be aware by guest. For modification, i think current KVM only 
touch the DMA buffer which is setup by guest driver. It's guest 
responsibility to ensure the pages he want to protect are not used as 
DMA buffer.

-- 
best regards
yang

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V4 7/7] KVM, pkeys: disable PKU feature without ept
  2016-03-09  6:37                 ` Yang Zhang
@ 2016-03-09  7:21                   ` Xiao Guangrong
  2016-03-09  7:41                     ` Yang Zhang
  0 siblings, 1 reply; 47+ messages in thread
From: Xiao Guangrong @ 2016-03-09  7:21 UTC (permalink / raw)
  To: Yang Zhang, Paolo Bonzini, Huaitong Han, gleb; +Cc: kvm



On 03/09/2016 02:37 PM, Yang Zhang wrote:
> On 2016/3/9 13:51, Xiao Guangrong wrote:
>>
>>
>> On 03/08/2016 06:02 PM, Paolo Bonzini wrote:
>>>
>>>
>>> On 08/03/2016 10:32, Xiao Guangrong wrote:
>>>> On 03/08/2016 04:47 PM, Paolo Bonzini wrote:
>>>>> Some XSAVE features (currently only MPX, but in the future PKRU too)
>>>>> will force eagerfpu on, see fpu__init_system_ctx_switch:
>>>>>
>>>>>           if (xfeatures_mask & XFEATURE_MASK_EAGER) {
>>>>>                   if (eagerfpu == DISABLE) {
>>>>>                           xfeatures_mask &= ~XFEATURE_MASK_EAGER;
>>>>
>>>> So if the kernel parameter, eagerfpu is set to "off", then eager is not
>>>> enabled, so PKRU can not work in KVM?
>>>
>>> Yes.  Neither PKRU nor MPX.
>>
>> Er... I noticed fpregs is not switched if the CPU is running in KVM module
>> (vcpu is not scheduled out and does not exit to userspace), that is why
>> read_pkru() can be used to read guest's PKRU in the patch 4.
>>
>> However, then guest can fully control the access of userspace's memory if
>> CR4.PKRU is enabled on host and KVM needs to access QEMU's memory to do
>> some
>> emulation anyway. Is it really safe?
>
> I think it depends on how we understand the guest uses Pkeys. From my point, guest only wants to
> protect the pages from guest's view, not kvm. So the access from KVM should be totally transparent
> to guest. And should not be aware by guest. For modification, i think current KVM only touch the DMA
> buffer which is setup by guest driver. It's guest responsibility to ensure the pages he want to
> protect are not used as DMA buffer.
>

No really. A example is KVM need to read guest memory to emulate some instructions.

More worse, the pkey-bits on pte entry is different between guest and host (they
are using different page tables) so guest can not know what index in PKRU will be
used by KVM. A evil guest can use it to attack host...


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V4 7/7] KVM, pkeys: disable PKU feature without ept
  2016-03-09  7:21                   ` Xiao Guangrong
@ 2016-03-09  7:41                     ` Yang Zhang
  2016-03-09  7:50                       ` Xiao Guangrong
  0 siblings, 1 reply; 47+ messages in thread
From: Yang Zhang @ 2016-03-09  7:41 UTC (permalink / raw)
  To: Xiao Guangrong, Paolo Bonzini, Huaitong Han, gleb; +Cc: kvm

On 2016/3/9 15:21, Xiao Guangrong wrote:
>
>
> On 03/09/2016 02:37 PM, Yang Zhang wrote:
>> On 2016/3/9 13:51, Xiao Guangrong wrote:
>>>
>>>
>>> On 03/08/2016 06:02 PM, Paolo Bonzini wrote:
>>>>
>>>>
>>>> On 08/03/2016 10:32, Xiao Guangrong wrote:
>>>>> On 03/08/2016 04:47 PM, Paolo Bonzini wrote:
>>>>>> Some XSAVE features (currently only MPX, but in the future PKRU too)
>>>>>> will force eagerfpu on, see fpu__init_system_ctx_switch:
>>>>>>
>>>>>>           if (xfeatures_mask & XFEATURE_MASK_EAGER) {
>>>>>>                   if (eagerfpu == DISABLE) {
>>>>>>                           xfeatures_mask &= ~XFEATURE_MASK_EAGER;
>>>>>
>>>>> So if the kernel parameter, eagerfpu is set to "off", then eager is
>>>>> not
>>>>> enabled, so PKRU can not work in KVM?
>>>>
>>>> Yes.  Neither PKRU nor MPX.
>>>
>>> Er... I noticed fpregs is not switched if the CPU is running in KVM
>>> module
>>> (vcpu is not scheduled out and does not exit to userspace), that is why
>>> read_pkru() can be used to read guest's PKRU in the patch 4.
>>>
>>> However, then guest can fully control the access of userspace's
>>> memory if
>>> CR4.PKRU is enabled on host and KVM needs to access QEMU's memory to do
>>> some
>>> emulation anyway. Is it really safe?
>>
>> I think it depends on how we understand the guest uses Pkeys. From my
>> point, guest only wants to
>> protect the pages from guest's view, not kvm. So the access from KVM
>> should be totally transparent
>> to guest. And should not be aware by guest. For modification, i think
>> current KVM only touch the DMA
>> buffer which is setup by guest driver. It's guest responsibility to
>> ensure the pages he want to
>> protect are not used as DMA buffer.
>>
>
> No really. A example is KVM need to read guest memory to emulate some
> instructions.

This is the access case. I don't think guest is interesting in such access.

>
> More worse, the pkey-bits on pte entry is different between guest and
> host (they
> are using different page tables) so guest can not know what index in
> PKRU will be
> used by KVM. A evil guest can use it to attack host...

Can you give a example how a evil guest can attack host?

One question is if host and guest are both using pkeys, how to handle 
this case? I don't see current patch consider this case?

-- 
best regards
yang

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V4 7/7] KVM, pkeys: disable PKU feature without ept
  2016-03-09  7:41                     ` Yang Zhang
@ 2016-03-09  7:50                       ` Xiao Guangrong
  2016-03-09  8:00                         ` Yang Zhang
  0 siblings, 1 reply; 47+ messages in thread
From: Xiao Guangrong @ 2016-03-09  7:50 UTC (permalink / raw)
  To: Yang Zhang, Paolo Bonzini, Huaitong Han, gleb; +Cc: kvm



On 03/09/2016 03:41 PM, Yang Zhang wrote:
> On 2016/3/9 15:21, Xiao Guangrong wrote:
>>
>>
>> On 03/09/2016 02:37 PM, Yang Zhang wrote:
>>> On 2016/3/9 13:51, Xiao Guangrong wrote:
>>>>
>>>>
>>>> On 03/08/2016 06:02 PM, Paolo Bonzini wrote:
>>>>>
>>>>>
>>>>> On 08/03/2016 10:32, Xiao Guangrong wrote:
>>>>>> On 03/08/2016 04:47 PM, Paolo Bonzini wrote:
>>>>>>> Some XSAVE features (currently only MPX, but in the future PKRU too)
>>>>>>> will force eagerfpu on, see fpu__init_system_ctx_switch:
>>>>>>>
>>>>>>>           if (xfeatures_mask & XFEATURE_MASK_EAGER) {
>>>>>>>                   if (eagerfpu == DISABLE) {
>>>>>>>                           xfeatures_mask &= ~XFEATURE_MASK_EAGER;
>>>>>>
>>>>>> So if the kernel parameter, eagerfpu is set to "off", then eager is
>>>>>> not
>>>>>> enabled, so PKRU can not work in KVM?
>>>>>
>>>>> Yes.  Neither PKRU nor MPX.
>>>>
>>>> Er... I noticed fpregs is not switched if the CPU is running in KVM
>>>> module
>>>> (vcpu is not scheduled out and does not exit to userspace), that is why
>>>> read_pkru() can be used to read guest's PKRU in the patch 4.
>>>>
>>>> However, then guest can fully control the access of userspace's
>>>> memory if
>>>> CR4.PKRU is enabled on host and KVM needs to access QEMU's memory to do
>>>> some
>>>> emulation anyway. Is it really safe?
>>>
>>> I think it depends on how we understand the guest uses Pkeys. From my
>>> point, guest only wants to
>>> protect the pages from guest's view, not kvm. So the access from KVM
>>> should be totally transparent
>>> to guest. And should not be aware by guest. For modification, i think
>>> current KVM only touch the DMA
>>> buffer which is setup by guest driver. It's guest responsibility to
>>> ensure the pages he want to
>>> protect are not used as DMA buffer.
>>>
>>
>> No really. A example is KVM need to read guest memory to emulate some
>> instructions.
>
> This is the access case. I don't think guest is interesting in such access.

See below.

>
>>
>> More worse, the pkey-bits on pte entry is different between guest and
>> host (they
>> are using different page tables) so guest can not know what index in
>> PKRU will be
>> used by KVM. A evil guest can use it to attack host...
>
> Can you give a example how a evil guest can attack host?
>
> One question is if host and guest are both using pkeys, how to handle this case? I don't see current
> patch consider this case?

This is exact what i said...

Currently, this patchset did not recover host's PKRU then host will use
guest's PKRU to do memory access.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V4 7/7] KVM, pkeys: disable PKU feature without ept
  2016-03-09  7:50                       ` Xiao Guangrong
@ 2016-03-09  8:00                         ` Yang Zhang
  2016-03-09  8:05                           ` Xiao Guangrong
  0 siblings, 1 reply; 47+ messages in thread
From: Yang Zhang @ 2016-03-09  8:00 UTC (permalink / raw)
  To: Xiao Guangrong, Paolo Bonzini, Huaitong Han, gleb; +Cc: kvm

On 2016/3/9 15:50, Xiao Guangrong wrote:
>
>
> On 03/09/2016 03:41 PM, Yang Zhang wrote:
>> On 2016/3/9 15:21, Xiao Guangrong wrote:
>>>
>>>
>>> On 03/09/2016 02:37 PM, Yang Zhang wrote:
>>>> On 2016/3/9 13:51, Xiao Guangrong wrote:
>>>>>
>>>>>
>>>>> On 03/08/2016 06:02 PM, Paolo Bonzini wrote:
>>>>>>
>>>>>>
>>>>>> On 08/03/2016 10:32, Xiao Guangrong wrote:
>>>>>>> On 03/08/2016 04:47 PM, Paolo Bonzini wrote:
>>>>>>>> Some XSAVE features (currently only MPX, but in the future PKRU
>>>>>>>> too)
>>>>>>>> will force eagerfpu on, see fpu__init_system_ctx_switch:
>>>>>>>>
>>>>>>>>           if (xfeatures_mask & XFEATURE_MASK_EAGER) {
>>>>>>>>                   if (eagerfpu == DISABLE) {
>>>>>>>>                           xfeatures_mask &= ~XFEATURE_MASK_EAGER;
>>>>>>>
>>>>>>> So if the kernel parameter, eagerfpu is set to "off", then eager is
>>>>>>> not
>>>>>>> enabled, so PKRU can not work in KVM?
>>>>>>
>>>>>> Yes.  Neither PKRU nor MPX.
>>>>>
>>>>> Er... I noticed fpregs is not switched if the CPU is running in KVM
>>>>> module
>>>>> (vcpu is not scheduled out and does not exit to userspace), that is
>>>>> why
>>>>> read_pkru() can be used to read guest's PKRU in the patch 4.
>>>>>
>>>>> However, then guest can fully control the access of userspace's
>>>>> memory if
>>>>> CR4.PKRU is enabled on host and KVM needs to access QEMU's memory
>>>>> to do
>>>>> some
>>>>> emulation anyway. Is it really safe?
>>>>
>>>> I think it depends on how we understand the guest uses Pkeys. From my
>>>> point, guest only wants to
>>>> protect the pages from guest's view, not kvm. So the access from KVM
>>>> should be totally transparent
>>>> to guest. And should not be aware by guest. For modification, i think
>>>> current KVM only touch the DMA
>>>> buffer which is setup by guest driver. It's guest responsibility to
>>>> ensure the pages he want to
>>>> protect are not used as DMA buffer.
>>>>
>>>
>>> No really. A example is KVM need to read guest memory to emulate some
>>> instructions.
>>
>> This is the access case. I don't think guest is interesting in such
>> access.
>
> See below.
>
>>
>>>
>>> More worse, the pkey-bits on pte entry is different between guest and
>>> host (they
>>> are using different page tables) so guest can not know what index in
>>> PKRU will be
>>> used by KVM. A evil guest can use it to attack host...
>>
>> Can you give a example how a evil guest can attack host?
>>
>> One question is if host and guest are both using pkeys, how to handle
>> this case? I don't see current
>> patch consider this case?
>
> This is exact what i said...

Ah...I thought you are asking whether we should let guest aware the 
access from hypervisor. Sorry to misread your mail.

>
> Currently, this patchset did not recover host's PKRU then host will use
> guest's PKRU to do memory access.

This definitely wrong. Besides, should we consider host's setting when 
guest is running?

-- 
best regards
yang

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V4 7/7] KVM, pkeys: disable PKU feature without ept
  2016-03-09  8:00                         ` Yang Zhang
@ 2016-03-09  8:05                           ` Xiao Guangrong
  2016-03-09  8:18                             ` Paolo Bonzini
  0 siblings, 1 reply; 47+ messages in thread
From: Xiao Guangrong @ 2016-03-09  8:05 UTC (permalink / raw)
  To: Yang Zhang, Paolo Bonzini, Huaitong Han, gleb; +Cc: kvm



On 03/09/2016 04:00 PM, Yang Zhang wrote:
> On 2016/3/9 15:50, Xiao Guangrong wrote:
>>
>>
>> On 03/09/2016 03:41 PM, Yang Zhang wrote:
>>> On 2016/3/9 15:21, Xiao Guangrong wrote:
>>>>
>>>>
>>>> On 03/09/2016 02:37 PM, Yang Zhang wrote:
>>>>> On 2016/3/9 13:51, Xiao Guangrong wrote:
>>>>>>
>>>>>>
>>>>>> On 03/08/2016 06:02 PM, Paolo Bonzini wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 08/03/2016 10:32, Xiao Guangrong wrote:
>>>>>>>> On 03/08/2016 04:47 PM, Paolo Bonzini wrote:
>>>>>>>>> Some XSAVE features (currently only MPX, but in the future PKRU
>>>>>>>>> too)
>>>>>>>>> will force eagerfpu on, see fpu__init_system_ctx_switch:
>>>>>>>>>
>>>>>>>>>           if (xfeatures_mask & XFEATURE_MASK_EAGER) {
>>>>>>>>>                   if (eagerfpu == DISABLE) {
>>>>>>>>>                           xfeatures_mask &= ~XFEATURE_MASK_EAGER;
>>>>>>>>
>>>>>>>> So if the kernel parameter, eagerfpu is set to "off", then eager is
>>>>>>>> not
>>>>>>>> enabled, so PKRU can not work in KVM?
>>>>>>>
>>>>>>> Yes.  Neither PKRU nor MPX.
>>>>>>
>>>>>> Er... I noticed fpregs is not switched if the CPU is running in KVM
>>>>>> module
>>>>>> (vcpu is not scheduled out and does not exit to userspace), that is
>>>>>> why
>>>>>> read_pkru() can be used to read guest's PKRU in the patch 4.
>>>>>>
>>>>>> However, then guest can fully control the access of userspace's
>>>>>> memory if
>>>>>> CR4.PKRU is enabled on host and KVM needs to access QEMU's memory
>>>>>> to do
>>>>>> some
>>>>>> emulation anyway. Is it really safe?
>>>>>
>>>>> I think it depends on how we understand the guest uses Pkeys. From my
>>>>> point, guest only wants to
>>>>> protect the pages from guest's view, not kvm. So the access from KVM
>>>>> should be totally transparent
>>>>> to guest. And should not be aware by guest. For modification, i think
>>>>> current KVM only touch the DMA
>>>>> buffer which is setup by guest driver. It's guest responsibility to
>>>>> ensure the pages he want to
>>>>> protect are not used as DMA buffer.
>>>>>
>>>>
>>>> No really. A example is KVM need to read guest memory to emulate some
>>>> instructions.
>>>
>>> This is the access case. I don't think guest is interesting in such
>>> access.
>>
>> See below.
>>
>>>
>>>>
>>>> More worse, the pkey-bits on pte entry is different between guest and
>>>> host (they
>>>> are using different page tables) so guest can not know what index in
>>>> PKRU will be
>>>> used by KVM. A evil guest can use it to attack host...
>>>
>>> Can you give a example how a evil guest can attack host?
>>>
>>> One question is if host and guest are both using pkeys, how to handle
>>> this case? I don't see current
>>> patch consider this case?
>>
>> This is exact what i said...
>
> Ah...I thought you are asking whether we should let guest aware the access from hypervisor. Sorry to
> misread your mail.
>
>>
>> Currently, this patchset did not recover host's PKRU then host will use
>> guest's PKRU to do memory access.
>
> This definitely wrong. Besides, should we consider host's setting when guest is running?

We should. No reason stop QEMU and other KVM-based hypervisors using protection-key. :)


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V4 3/7] KVM, pkeys: update memeory permission bitmask for pkeys
  2016-03-09  5:03               ` Xiao Guangrong
@ 2016-03-09  8:10                 ` Paolo Bonzini
  0 siblings, 0 replies; 47+ messages in thread
From: Paolo Bonzini @ 2016-03-09  8:10 UTC (permalink / raw)
  To: Xiao Guangrong, Huaitong Han, gleb; +Cc: kvm



On 09/03/2016 06:03, Xiao Guangrong wrote:
>>     WARN_ON(pfec & PFERR_PK_MASK);
>>     pkru_mask = pte_access & ACC_USER_MASK ? mmu->pkru_mask : 0;
>>          pkru_bits &= pkru_mask >> (pfec & ~1);
> 
>                                       ^ it should be pfec >> 1? As we
> never take PFEC.P into account?

It should be "(pfec >> 1) * 2" (don't take PFEC.P into account, but
reserve two bits per entry), which is the same as pfec & ~1.

> Actually pkru_mask can be u16 since only three bits that are PFEC.W, PFEC.U
> and PFEC.F need to be taken into account and every offset occupies 2 bits.

Right, but PFEC.F is bit 4.  There is RSVD in the middle.  So it needs
to be u32.

> I am considering we can use 'pte_access & ACC_USER_MASK' replacing PFEC.RSVD
> in your algorithm (then pkru_mask is 32 bit), permission_fault() is like
> this:
> 
> WARN_ON(pfec & PFERR_PK_MASK);
> offset = pfec & (PFERR_WRITE_MASK | PFERR_USER_MASK);

You're forgetting PFERR_FETCH_MASK.

> offset |=  !!(pte_access & ACC_USER_MASK) << PFERR_RSVD_BIT;

Better:

offset |= (pte_access & ACC_USER_MASK) <<
     (PFERR_RSVD_BIT - ACC_USER_BIT);

> offset >>= 1;

As discussed above, bit 0 is necessary.

> pkru_bits &= pkru_mask >> offset;
> pfec |= -pkru_bits & PFERR_PK_MASK;

So, putting it together:

	if (mmu->pkru_mask) {
		WARN_ON(pfec & PFERR_PK_MASK);
		offset = pfec & ~1;
		offset |= (pte_access & ACC_USER_MASK) <<
			(PFERR_RSVD_BIT - ACC_USER_BIT);
		pkru_bits &= mmu->pkru_mask >> offset;
		pfec |= -pkru_bits & PFERR_PK_MASK;
	}

With this change, I think it's better to make update_pkru_bitmask a
separate function, instead of merging it in update_permissions_bitmask.
 There will be a little duplicated code, but not much.

Thanks!

Paolo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V4 7/7] KVM, pkeys: disable PKU feature without ept
  2016-03-09  5:51               ` Xiao Guangrong
  2016-03-09  6:37                 ` Yang Zhang
@ 2016-03-09  8:13                 ` Paolo Bonzini
  1 sibling, 0 replies; 47+ messages in thread
From: Paolo Bonzini @ 2016-03-09  8:13 UTC (permalink / raw)
  To: Xiao Guangrong, Huaitong Han, gleb; +Cc: kvm



On 09/03/2016 06:51, Xiao Guangrong wrote:
>>>
>>
>> Yes.  Neither PKRU nor MPX.
> 
> Er... I noticed fpregs is not switched if the CPU is running in KVM module
> (vcpu is not scheduled out and does not exit to userspace), that is why
> read_pkru() can be used to read guest's PKRU in the patch 4.
> 
> However, then guest can fully control the access of userspace's memory if
> CR4.PKRU is enabled on host and KVM needs to access QEMU's memory to do
> some emulation anyway. Is it really safe?

I was thinking the same, and I think you're right.  We need to
save/restore PKRU in vmx_vcpu_run, and access a field in kvm_arch_vcpu
instead of using __read_pkru directly.

Paolo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V4 7/7] KVM, pkeys: disable PKU feature without ept
  2016-03-09  8:05                           ` Xiao Guangrong
@ 2016-03-09  8:18                             ` Paolo Bonzini
  0 siblings, 0 replies; 47+ messages in thread
From: Paolo Bonzini @ 2016-03-09  8:18 UTC (permalink / raw)
  To: Xiao Guangrong, Yang Zhang, Huaitong Han, gleb; +Cc: kvm



On 09/03/2016 09:05, Xiao Guangrong wrote:
>> Besides, should we consider host's setting when guest is running?
> 
> We should. No reason stop QEMU and other KVM-based hypervisors using
> protection-key. :)

This is a bit tricky.  Without pkey support in EPT, you'd also have to:
1) save the host PKRU somewhere in kvm_mmu between invocations of
KVM_RUN, and call kvm_mmu_reset_context when it changes; 2) get the
pkeys from the host pages, compute ad/wd, and use it to fill in the
permissions for the shadow or EPT page tables; 3) add ad/wd to the page
role.

So it's a good feature, but it should be a separate one.

Paolo

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2016-03-09  8:18 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-05 11:27 [PATCH V4 0/7] KVM, pkeys: add memory protection-key support Huaitong Han
2016-03-05 11:27 ` [PATCH V4 1/7] KVM, pkeys: expose CPUID/CR4 to guest Huaitong Han
2016-03-06  7:15   ` Xiao Guangrong
2016-03-06 23:20     ` Paolo Bonzini
2016-03-08  7:39       ` Xiao Guangrong
2016-03-08  7:58         ` Paolo Bonzini
2016-03-05 11:27 ` [PATCH V4 2/7] KVM, pkeys: disable pkeys for guests in non-paging mode Huaitong Han
2016-03-06  7:19   ` Xiao Guangrong
2016-03-08 12:09   ` Yang Zhang
2016-03-08 12:11     ` Paolo Bonzini
2016-03-08 13:02       ` Yang Zhang
2016-03-05 11:27 ` [PATCH V4 3/7] KVM, pkeys: update memeory permission bitmask for pkeys Huaitong Han
2016-03-06  7:42   ` Xiao Guangrong
2016-03-06 23:14     ` Paolo Bonzini
2016-03-08  7:35       ` Xiao Guangrong
2016-03-08  8:29         ` Paolo Bonzini
2016-03-08  9:19           ` Xiao Guangrong
2016-03-08 10:01             ` Paolo Bonzini
2016-03-09  5:03               ` Xiao Guangrong
2016-03-09  8:10                 ` Paolo Bonzini
2016-03-05 11:27 ` [PATCH V4 4/7] KVM, pkeys: add pkeys support for permission_fault logic Huaitong Han
2016-03-06  8:00   ` Xiao Guangrong
2016-03-06 20:36     ` Paolo Bonzini
2016-03-06 23:29       ` Paolo Bonzini
2016-03-08  5:57       ` Xiao Guangrong
2016-03-05 11:27 ` [PATCH V4 5/7] KVM, pkeys: Add pkeys support for gva_to_gpa funcions Huaitong Han
2016-03-06  8:01   ` Xiao Guangrong
2016-03-06 21:33     ` Paolo Bonzini
2016-03-05 11:27 ` [PATCH V4 6/7] KVM, pkeys: add pkeys support for xsave state Huaitong Han
2016-03-06  8:27   ` Xiao Guangrong
2016-03-05 11:27 ` [PATCH V4 7/7] KVM, pkeys: disable PKU feature without ept Huaitong Han
2016-03-06  9:28   ` Xiao Guangrong
2016-03-06 20:32     ` Paolo Bonzini
2016-03-08  5:54       ` Xiao Guangrong
2016-03-08  8:47         ` Paolo Bonzini
2016-03-08  9:32           ` Xiao Guangrong
2016-03-08 10:02             ` Paolo Bonzini
2016-03-09  5:51               ` Xiao Guangrong
2016-03-09  6:37                 ` Yang Zhang
2016-03-09  7:21                   ` Xiao Guangrong
2016-03-09  7:41                     ` Yang Zhang
2016-03-09  7:50                       ` Xiao Guangrong
2016-03-09  8:00                         ` Yang Zhang
2016-03-09  8:05                           ` Xiao Guangrong
2016-03-09  8:18                             ` Paolo Bonzini
2016-03-09  8:13                 ` Paolo Bonzini
2016-03-09  6:24           ` Yang Zhang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).