* Memory Read Only Enforcement: VMM assisted kernel rootkit mitigation for KVM
From: Ahmed Abd El Mawgood @ 2018-07-19 21:37 UTC (permalink / raw)
To: kvm, Kernel Hardening, virtualization, linux-doc, x86
Cc: Ard Biesheuvel, Kees Cook, nathan Corbet, David Vrabel, rkrcmar,
Boris Lukashev, Ingo Molnar, nigel.edwards, hpa, Paolo Bonzini,
Thomas Gleixner, Rik van Riel
Hi,
This is my first set of patches that works as I would expect, and the
third revision I sent to mailing lists.
Following up with my previous discussions about kernel rootkit mitigation
via placing R/O protection on critical data structure, static data,
privileged registers with static content. These patches present the
first part where it is only possible to place these protections on
memory pages. Feature-wise, this set of patches is incomplete in the sense of:
- They still don't protect privileged registers
- They don't protect guest TLB from malicious gva -> gpa page mappings.
But they provide sketches for a basic working design. Note that I am totally
noob and it took lots of time and effort to get to this point. So sorry in
advance if I overlooked something.
[PATCH 1/3] [RFC V3] KVM: X86: Memory ROE documentation
[PATCH 2/3] [RFC V3] KVM: X86: Adding arbitrary data pointer in kvm memslot itterator functions
[PATCH 3/3] [RFC V3] KVM: X86: Adding skeleton for Memory ROE
Summery:
Documentation/virtual/kvm/hypercalls.txt | 14 ++++
arch/x86/include/asm/kvm_host.h | 11 ++-
arch/x86/kvm/Kconfig | 7 ++
arch/x86/kvm/mmu.c | 127 ++++++++++++++++++++++---------
arch/x86/kvm/x86.c | 82 +++++++++++++++++++-
include/linux/kvm_host.h | 3 +
include/uapi/linux/kvm_para.h | 1 +
virt/kvm/kvm_main.c | 29 ++++++-
8 files changed, 232 insertions(+), 42 deletions(-)
^ permalink raw reply
* [PATCH 1/3] [RFC V3] KVM: X86: Memory ROE documentation
From: Ahmed Abd El Mawgood @ 2018-07-19 21:38 UTC (permalink / raw)
To: kvm, Kernel Hardening, virtualization, linux-doc, x86
Cc: Ard Biesheuvel, Kees Cook, nathan Corbet, David Vrabel, rkrcmar,
Boris Lukashev, Ingo Molnar, nigel.edwards, hpa, Paolo Bonzini,
Thomas Gleixner, Rik van Riel
In-Reply-To: <20180719213802.17161-1-ahmedsoliman0x666@gmail.com>
Following up with my previous threads on KVM assisted Anti rootkit
protections.
The current version doesn't address the attacks involving pages
remapping. It is still design in progress, nevertheless, it will be in
my later patch sets.
Signed-off-by: Ahmed Abd El Mawgood <ahmedsoliman0x666@gmail.com>
---
Documentation/virtual/kvm/hypercalls.txt | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/Documentation/virtual/kvm/hypercalls.txt b/Documentation/virtual/kvm/hypercalls.txt
index a890529c63ed..a9db68adb7c9 100644
--- a/Documentation/virtual/kvm/hypercalls.txt
+++ b/Documentation/virtual/kvm/hypercalls.txt
@@ -121,3 +121,17 @@ compute the CLOCK_REALTIME for its clock, at the same instant.
Returns KVM_EOPNOTSUPP if the host does not use TSC clocksource,
or if clock type is different than KVM_CLOCK_PAIRING_WALLCLOCK.
+
+7. KVM_HC_HMROE
+----------------
+Architecture: x86
+Status: active
+Purpose: Hypercall used to apply Read-Only Enforcement to guest pages
+Usage:
+ a0: start address of page that should be protected.
+
+This hypercall lets a guest kernel to have part of its read/write memory
+converted into read-only. This action is irreversible. KVM_HC_HMROE can
+not be triggered from guest Ring 3 (user mode). The reason is that user
+mode malicious software can make use of it enforce read only protection on
+an arbitrary memory page thus crashing the kernel.
--
2.16.4
^ permalink raw reply related
* [PATCH 2/3] [RFC V3] KVM: X86: Adding arbitrary data pointer in kvm memslot itterator functions
From: Ahmed Abd El Mawgood @ 2018-07-19 21:38 UTC (permalink / raw)
To: kvm, Kernel Hardening, virtualization, linux-doc, x86
Cc: Ard Biesheuvel, Kees Cook, nathan Corbet, David Vrabel, rkrcmar,
Boris Lukashev, Ingo Molnar, nigel.edwards, hpa, Paolo Bonzini,
Thomas Gleixner, Rik van Riel
In-Reply-To: <20180719213802.17161-1-ahmedsoliman0x666@gmail.com>
This will help sharing data into the slot_level_handler callback. In my
case I need to a share a counter for the pages traversed to use it in some
bitmap. Being able to send arbitrary memory pointer into the
slot_level_handler callback made it easy.
Signed-off-by: Ahmed Abd El Mawgood <ahmedsoliman0x666@gmail.com>
---
arch/x86/kvm/mmu.c | 65 +++++++++++++++++++++++++++++++-----------------------
1 file changed, 37 insertions(+), 28 deletions(-)
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index d594690d8b95..77661530b2c4 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1418,7 +1418,7 @@ static bool spte_write_protect(u64 *sptep, bool pt_protect)
static bool __rmap_write_protect(struct kvm *kvm,
struct kvm_rmap_head *rmap_head,
- bool pt_protect)
+ bool pt_protect, void *data)
{
u64 *sptep;
struct rmap_iterator iter;
@@ -1457,7 +1457,8 @@ static bool wrprot_ad_disabled_spte(u64 *sptep)
* - W bit on ad-disabled SPTEs.
* Returns true iff any D or W bits were cleared.
*/
-static bool __rmap_clear_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head)
+static bool __rmap_clear_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+ void *data)
{
u64 *sptep;
struct rmap_iterator iter;
@@ -1483,7 +1484,8 @@ static bool spte_set_dirty(u64 *sptep)
return mmu_spte_update(sptep, spte);
}
-static bool __rmap_set_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head)
+static bool __rmap_set_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+ void *data)
{
u64 *sptep;
struct rmap_iterator iter;
@@ -1515,7 +1517,7 @@ static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
while (mask) {
rmap_head = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
PT_PAGE_TABLE_LEVEL, slot);
- __rmap_write_protect(kvm, rmap_head, false);
+ __rmap_write_protect(kvm, rmap_head, false, NULL);
/* clear the first set bit */
mask &= mask - 1;
@@ -1541,7 +1543,7 @@ void kvm_mmu_clear_dirty_pt_masked(struct kvm *kvm,
while (mask) {
rmap_head = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
PT_PAGE_TABLE_LEVEL, slot);
- __rmap_clear_dirty(kvm, rmap_head);
+ __rmap_clear_dirty(kvm, rmap_head, NULL);
/* clear the first set bit */
mask &= mask - 1;
@@ -1594,7 +1596,8 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
for (i = PT_PAGE_TABLE_LEVEL; i <= PT_MAX_HUGEPAGE_LEVEL; ++i) {
rmap_head = __gfn_to_rmap(gfn, i, slot);
- write_protected |= __rmap_write_protect(kvm, rmap_head, true);
+ write_protected |= __rmap_write_protect(kvm, rmap_head, true,
+ NULL);
}
return write_protected;
@@ -1608,7 +1611,8 @@ static bool rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn)
return kvm_mmu_slot_gfn_write_protect(vcpu->kvm, slot, gfn);
}
-static bool kvm_zap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head)
+static bool kvm_zap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
+ void *data)
{
u64 *sptep;
struct rmap_iterator iter;
@@ -1628,7 +1632,7 @@ static int kvm_unmap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
struct kvm_memory_slot *slot, gfn_t gfn, int level,
unsigned long data)
{
- return kvm_zap_rmapp(kvm, rmap_head);
+ return kvm_zap_rmapp(kvm, rmap_head, NULL);
}
static int kvm_set_pte_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
@@ -5086,13 +5090,15 @@ void kvm_mmu_uninit_vm(struct kvm *kvm)
}
/* The return value indicates if tlb flush on all vcpus is needed. */
-typedef bool (*slot_level_handler) (struct kvm *kvm, struct kvm_rmap_head *rmap_head);
+typedef bool (*slot_level_handler) (struct kvm *kvm,
+ struct kvm_rmap_head *rmap_head, void *data);
/* The caller should hold mmu-lock before calling this function. */
static __always_inline bool
slot_handle_level_range(struct kvm *kvm, struct kvm_memory_slot *memslot,
slot_level_handler fn, int start_level, int end_level,
- gfn_t start_gfn, gfn_t end_gfn, bool lock_flush_tlb)
+ gfn_t start_gfn, gfn_t end_gfn, bool lock_flush_tlb,
+ void *data)
{
struct slot_rmap_walk_iterator iterator;
bool flush = false;
@@ -5100,7 +5106,7 @@ slot_handle_level_range(struct kvm *kvm, struct kvm_memory_slot *memslot,
for_each_slot_rmap_range(memslot, start_level, end_level, start_gfn,
end_gfn, &iterator) {
if (iterator.rmap)
- flush |= fn(kvm, iterator.rmap);
+ flush |= fn(kvm, iterator.rmap, data);
if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
if (flush && lock_flush_tlb) {
@@ -5122,36 +5128,36 @@ slot_handle_level_range(struct kvm *kvm, struct kvm_memory_slot *memslot,
static __always_inline bool
slot_handle_level(struct kvm *kvm, struct kvm_memory_slot *memslot,
slot_level_handler fn, int start_level, int end_level,
- bool lock_flush_tlb)
+ bool lock_flush_tlb, void *data)
{
return slot_handle_level_range(kvm, memslot, fn, start_level,
end_level, memslot->base_gfn,
memslot->base_gfn + memslot->npages - 1,
- lock_flush_tlb);
+ lock_flush_tlb, data);
}
static __always_inline bool
slot_handle_all_level(struct kvm *kvm, struct kvm_memory_slot *memslot,
- slot_level_handler fn, bool lock_flush_tlb)
+ slot_level_handler fn, bool lock_flush_tlb, void *data)
{
return slot_handle_level(kvm, memslot, fn, PT_PAGE_TABLE_LEVEL,
- PT_MAX_HUGEPAGE_LEVEL, lock_flush_tlb);
+ PT_MAX_HUGEPAGE_LEVEL, lock_flush_tlb, data);
}
static __always_inline bool
slot_handle_large_level(struct kvm *kvm, struct kvm_memory_slot *memslot,
- slot_level_handler fn, bool lock_flush_tlb)
+ slot_level_handler fn, bool lock_flush_tlb, void *data)
{
return slot_handle_level(kvm, memslot, fn, PT_PAGE_TABLE_LEVEL + 1,
- PT_MAX_HUGEPAGE_LEVEL, lock_flush_tlb);
+ PT_MAX_HUGEPAGE_LEVEL, lock_flush_tlb, data);
}
static __always_inline bool
slot_handle_leaf(struct kvm *kvm, struct kvm_memory_slot *memslot,
- slot_level_handler fn, bool lock_flush_tlb)
+ slot_level_handler fn, bool lock_flush_tlb, void *data)
{
return slot_handle_level(kvm, memslot, fn, PT_PAGE_TABLE_LEVEL,
- PT_PAGE_TABLE_LEVEL, lock_flush_tlb);
+ PT_PAGE_TABLE_LEVEL, lock_flush_tlb, data);
}
void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
@@ -5173,7 +5179,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
slot_handle_level_range(kvm, memslot, kvm_zap_rmapp,
PT_PAGE_TABLE_LEVEL, PT_MAX_HUGEPAGE_LEVEL,
- start, end - 1, true);
+ start, end - 1, true, NULL);
}
}
@@ -5181,9 +5187,10 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
}
static bool slot_rmap_write_protect(struct kvm *kvm,
- struct kvm_rmap_head *rmap_head)
+ struct kvm_rmap_head *rmap_head,
+ void *data)
{
- return __rmap_write_protect(kvm, rmap_head, false);
+ return __rmap_write_protect(kvm, rmap_head, false, data);
}
void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
@@ -5193,7 +5200,7 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
spin_lock(&kvm->mmu_lock);
flush = slot_handle_all_level(kvm, memslot, slot_rmap_write_protect,
- false);
+ false, NULL);
spin_unlock(&kvm->mmu_lock);
/*
@@ -5219,7 +5226,8 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
}
static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
- struct kvm_rmap_head *rmap_head)
+ struct kvm_rmap_head *rmap_head,
+ void *data)
{
u64 *sptep;
struct rmap_iterator iter;
@@ -5257,7 +5265,7 @@ void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
/* FIXME: const-ify all uses of struct kvm_memory_slot. */
spin_lock(&kvm->mmu_lock);
slot_handle_leaf(kvm, (struct kvm_memory_slot *)memslot,
- kvm_mmu_zap_collapsible_spte, true);
+ kvm_mmu_zap_collapsible_spte, true, NULL);
spin_unlock(&kvm->mmu_lock);
}
@@ -5267,7 +5275,7 @@ void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
bool flush;
spin_lock(&kvm->mmu_lock);
- flush = slot_handle_leaf(kvm, memslot, __rmap_clear_dirty, false);
+ flush = slot_handle_leaf(kvm, memslot, __rmap_clear_dirty, false, NULL);
spin_unlock(&kvm->mmu_lock);
lockdep_assert_held(&kvm->slots_lock);
@@ -5290,7 +5298,7 @@ void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm,
spin_lock(&kvm->mmu_lock);
flush = slot_handle_large_level(kvm, memslot, slot_rmap_write_protect,
- false);
+ false, NULL);
spin_unlock(&kvm->mmu_lock);
/* see kvm_mmu_slot_remove_write_access */
@@ -5307,7 +5315,8 @@ void kvm_mmu_slot_set_dirty(struct kvm *kvm,
bool flush;
spin_lock(&kvm->mmu_lock);
- flush = slot_handle_all_level(kvm, memslot, __rmap_set_dirty, false);
+ flush = slot_handle_all_level(kvm, memslot, __rmap_set_dirty, false,
+ NULL);
spin_unlock(&kvm->mmu_lock);
lockdep_assert_held(&kvm->slots_lock);
--
2.16.4
^ permalink raw reply related
* [PATCH 3/3] [RFC V3] KVM: X86: Adding skeleton for Memory ROE
From: Ahmed Abd El Mawgood @ 2018-07-19 21:38 UTC (permalink / raw)
To: kvm, Kernel Hardening, virtualization, linux-doc, x86
Cc: Ard Biesheuvel, Kees Cook, nathan Corbet, David Vrabel, rkrcmar,
Boris Lukashev, Ingo Molnar, nigel.edwards, hpa, Paolo Bonzini,
Thomas Gleixner, Rik van Riel
In-Reply-To: <20180719213802.17161-1-ahmedsoliman0x666@gmail.com>
This patch introduces a hypercall implemented for X86 that can assist
against subset of kernel rootkits, it works by place readonly protection in
shadow PTE. The end result protection is also kept in a bitmap for each
kvm_memory_slot and is used as reference when updating SPTEs. The whole
goal is to protect the guest kernel static data from modification if
attacker is running from guest ring 0, for this reason there is no
hypercall to revert effect of Memory ROE hypercall. This patch doesn't
implement integrity check on guest TLB so obvious attack on the current
implementation will involve guest virtual address -> guest physical
address remapping, but there are plans to fix that.
Signed-off-by: Ahmed Abd El Mawgood <ahmedsoliman0x666@gmail.com>
---
arch/x86/include/asm/kvm_host.h | 11 +++++-
arch/x86/kvm/Kconfig | 7 ++++
arch/x86/kvm/mmu.c | 72 ++++++++++++++++++++++++++++++------
arch/x86/kvm/x86.c | 82 +++++++++++++++++++++++++++++++++++++++--
include/linux/kvm_host.h | 3 ++
include/uapi/linux/kvm_para.h | 1 +
virt/kvm/kvm_main.c | 29 +++++++++++++--
7 files changed, 186 insertions(+), 19 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c13cd28d9d1b..128bcfa246a3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -236,6 +236,15 @@ struct kvm_mmu_memory_cache {
void *objects[KVM_NR_MEM_OBJS];
};
+/*
+ * This is internal structure used to be be able to access kvm memory slot and
+ * have track of the number of current PTE when doing shadow PTE walk
+ */
+struct kvm_write_access_data {
+ int i;
+ struct kvm_memory_slot *memslot;
+};
+
/*
* the pages used as guest page table on soft mmu are tracked by
* kvm_memory_slot.arch.gfn_track which is 16 bits, so the role bits used
@@ -1130,7 +1139,7 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
u64 acc_track_mask, u64 me_mask);
void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
-void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
+void kvm_mmu_slot_apply_write_access(struct kvm *kvm,
struct kvm_memory_slot *memslot);
void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
const struct kvm_memory_slot *memslot);
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 92fd433c50b9..8ae822a8dc7a 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -96,6 +96,13 @@ config KVM_MMU_AUDIT
This option adds a R/W kVM module parameter 'mmu_audit', which allows
auditing of KVM MMU events at runtime.
+config KVM_MROE
+ bool "Hypercall Memory Read-Only Enforcement"
+ depends on KVM && X86
+ help
+ This option add KVM_HC_HMROE hypercall to kvm which as hardening
+ mechanism to protect memory pages from being edited.
+
# OK, it's a little counter-intuitive to do this, but it puts it neatly under
# the virtualization menu.
source drivers/vhost/Kconfig
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 77661530b2c4..4ce6a9a19a23 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1416,9 +1416,8 @@ static bool spte_write_protect(u64 *sptep, bool pt_protect)
return mmu_spte_update(sptep, spte);
}
-static bool __rmap_write_protect(struct kvm *kvm,
- struct kvm_rmap_head *rmap_head,
- bool pt_protect, void *data)
+static bool __rmap_write_protection(struct kvm *kvm,
+ struct kvm_rmap_head *rmap_head, bool pt_protect)
{
u64 *sptep;
struct rmap_iterator iter;
@@ -1430,6 +1429,38 @@ static bool __rmap_write_protect(struct kvm *kvm,
return flush;
}
+#ifdef CONFIG_KVM_MROE
+static bool __rmap_write_protect_mroe(struct kvm *kvm,
+ struct kvm_rmap_head *rmap_head,
+ bool pt_protect,
+ struct kvm_write_access_data *d)
+{
+ u64 *sptep;
+ struct rmap_iterator iter;
+ bool prot;
+ bool flush = false;
+
+ for_each_rmap_spte(rmap_head, &iter, sptep) {
+ prot = !test_bit(d->i, d->memslot->mroe_bitmap) && pt_protect;
+ flush |= spte_write_protect(sptep, prot);
+ d->i++;
+ }
+ return flush;
+}
+#endif
+
+static bool __rmap_write_protect(struct kvm *kvm,
+ struct kvm_rmap_head *rmap_head,
+ bool pt_protect,
+ struct kvm_write_access_data *d)
+{
+#ifdef CONFIG_KVM_MROE
+ if (d != NULL)
+ return __rmap_write_protect_mroe(kvm, rmap_head, pt_protect, d);
+#endif
+ return __rmap_write_protection(kvm, rmap_head, pt_protect);
+}
+
static bool spte_clear_dirty(u64 *sptep)
{
u64 spte = *sptep;
@@ -1517,7 +1548,7 @@ static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
while (mask) {
rmap_head = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
PT_PAGE_TABLE_LEVEL, slot);
- __rmap_write_protect(kvm, rmap_head, false, NULL);
+ __rmap_write_protection(kvm, rmap_head, false);
/* clear the first set bit */
mask &= mask - 1;
@@ -1593,11 +1624,15 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
struct kvm_rmap_head *rmap_head;
int i;
bool write_protected = false;
+ struct kvm_write_access_data data = {
+ .i = 0,
+ .memslot = slot,
+ };
for (i = PT_PAGE_TABLE_LEVEL; i <= PT_MAX_HUGEPAGE_LEVEL; ++i) {
rmap_head = __gfn_to_rmap(gfn, i, slot);
write_protected |= __rmap_write_protect(kvm, rmap_head, true,
- NULL);
+ &data);
}
return write_protected;
@@ -5190,21 +5225,36 @@ static bool slot_rmap_write_protect(struct kvm *kvm,
struct kvm_rmap_head *rmap_head,
void *data)
{
- return __rmap_write_protect(kvm, rmap_head, false, data);
+ return __rmap_write_protect(kvm, rmap_head, false,
+ (struct kvm_write_access_data *)data);
}
-void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
+static bool slot_rmap_apply_protection(struct kvm *kvm,
+ struct kvm_rmap_head *rmap_head,
+ void *data)
+{
+ struct kvm_write_access_data *d = (struct kvm_write_access_data *) data;
+ bool prot_mask = !(d->memslot->flags & KVM_MEM_READONLY);
+
+ return __rmap_write_protect(kvm, rmap_head, prot_mask, d);
+}
+
+void kvm_mmu_slot_apply_write_access(struct kvm *kvm,
struct kvm_memory_slot *memslot)
{
bool flush;
+ struct kvm_write_access_data data = {
+ .i = 0,
+ .memslot = memslot,
+ };
spin_lock(&kvm->mmu_lock);
- flush = slot_handle_all_level(kvm, memslot, slot_rmap_write_protect,
- false, NULL);
+ flush = slot_handle_all_level(kvm, memslot, slot_rmap_apply_protection,
+ false, &data);
spin_unlock(&kvm->mmu_lock);
/*
- * kvm_mmu_slot_remove_write_access() and kvm_vm_ioctl_get_dirty_log()
+ * kvm_mmu_slot_apply_write_access() and kvm_vm_ioctl_get_dirty_log()
* which do tlb flush out of mmu-lock should be serialized by
* kvm->slots_lock otherwise tlb flush would be missed.
*/
@@ -5301,7 +5351,7 @@ void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm,
false, NULL);
spin_unlock(&kvm->mmu_lock);
- /* see kvm_mmu_slot_remove_write_access */
+ /* see kvm_mmu_slot_apply_write_access*/
lockdep_assert_held(&kvm->slots_lock);
if (flush)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0046aa70205a..9addc46d75be 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4177,7 +4177,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
/*
* All the TLBs can be flushed out of mmu lock, see the comments in
- * kvm_mmu_slot_remove_write_access().
+ * kvm_mmu_slot_apply_write_access().
*/
lockdep_assert_held(&kvm->slots_lock);
if (is_dirty)
@@ -6670,7 +6670,76 @@ static int kvm_pv_clock_pairing(struct kvm_vcpu *vcpu, gpa_t paddr,
}
#endif
-/*
+#ifdef CONFIG_KVM_MROE
+static int __roe_protect_frame(struct kvm *kvm, gpa_t gpa)
+{
+ struct kvm_memory_slot *slot;
+ gfn_t gfn = gpa >> PAGE_SHIFT;
+
+ slot = gfn_to_memslot(kvm, gfn);
+ if (!slot || gfn > slot->base_gfn + slot->npages)
+ return -EINVAL;
+ set_bit(gfn - slot->base_gfn, slot->mroe_bitmap);
+ kvm_mmu_slot_apply_write_access(kvm, slot);
+ kvm_arch_flush_shadow_memslot(kvm, slot);
+
+ return 0;
+}
+
+static int roe_protect_frame(struct kvm *kvm, gpa_t gpa)
+{
+ int r;
+
+ mutex_lock(&kvm->slots_lock);
+ r = __roe_protect_frame(kvm, gpa);
+ mutex_unlock(&kvm->slots_lock);
+ return r;
+}
+
+static bool kvm_mroe_userspace(struct kvm_vcpu *vcpu)
+{
+ u64 rflags;
+ u64 cr0 = kvm_read_cr0(vcpu);
+ u64 iopl;
+
+ // first checking we are not in protected mode
+ if ((cr0 & 1) == 0)
+ return false;
+ /*
+ * we don't need to worry about comments in __get_regs
+ * because we are sure that this function will only be
+ * triggered at the end of a hypercall
+ */
+ rflags = kvm_get_rflags(vcpu);
+ iopl = (rflags >> 12) & 3;
+ if (iopl != 3)
+ return false;
+ return true;
+}
+
+static int kvm_mroe(struct kvm_vcpu *vcpu, u64 gva)
+{
+ struct kvm *kvm = vcpu->kvm;
+ gpa_t gpa;
+ u64 hva;
+
+ /*
+ * First we need to maek sure that we are running from something that
+ * isn't usermode
+ */
+ if (kvm_mroe_userspace(vcpu))
+ return -1;//I don't really know what to return
+ if (gva & ~PAGE_MASK)
+ return -EINVAL;
+ gpa = kvm_mmu_gva_to_gpa_system(vcpu, gva, NULL);
+ hva = gfn_to_hva(kvm, gpa >> PAGE_SHIFT);
+ if (!access_ok(VERIFY_WRITE, hva, PAGE_SIZE))
+ return -EINVAL;
+ return roe_protect_frame(vcpu->kvm, gpa);
+}
+#endif
+
+ /*
* kvm_pv_kick_cpu_op: Kick a vcpu.
*
* @apicid - apicid of vcpu to be kicked.
@@ -6737,6 +6806,11 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
case KVM_HC_CLOCK_PAIRING:
ret = kvm_pv_clock_pairing(vcpu, a0, a1);
break;
+#endif
+#ifdef CONFIG_KVM_MROE
+ case KVM_HC_HMROE:
+ ret = kvm_mroe(vcpu, a0);
+ break;
#endif
default:
ret = -KVM_ENOSYS;
@@ -8971,8 +9045,8 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
struct kvm_memory_slot *new)
{
/* Still write protect RO slot */
+ kvm_mmu_slot_apply_write_access(kvm, new);
if (new->flags & KVM_MEM_READONLY) {
- kvm_mmu_slot_remove_write_access(kvm, new);
return;
}
@@ -9010,7 +9084,7 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
if (kvm_x86_ops->slot_enable_log_dirty)
kvm_x86_ops->slot_enable_log_dirty(kvm, new);
else
- kvm_mmu_slot_remove_write_access(kvm, new);
+ kvm_mmu_slot_apply_write_access(kvm, new);
} else {
if (kvm_x86_ops->slot_disable_log_dirty)
kvm_x86_ops->slot_disable_log_dirty(kvm, new);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4ee7bc548a83..82c5780e11d9 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -297,6 +297,9 @@ static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
struct kvm_memory_slot {
gfn_t base_gfn;
unsigned long npages;
+#ifdef CONFIG_KVM_MROE
+ unsigned long *mroe_bitmap;
+#endif
unsigned long *dirty_bitmap;
struct kvm_arch_memory_slot arch;
unsigned long userspace_addr;
diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h
index dcf629dd2889..4e2badc09b5b 100644
--- a/include/uapi/linux/kvm_para.h
+++ b/include/uapi/linux/kvm_para.h
@@ -26,6 +26,7 @@
#define KVM_HC_MIPS_EXIT_VM 7
#define KVM_HC_MIPS_CONSOLE_OUTPUT 8
#define KVM_HC_CLOCK_PAIRING 9
+#define KVM_HC_HMROE 10
/*
* hypercalls use architecture specific
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8b47507faab5..0f7141e4d550 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -794,6 +794,17 @@ static int kvm_create_dirty_bitmap(struct kvm_memory_slot *memslot)
return 0;
}
+static int kvm_init_mroe_bitmap(struct kvm_memory_slot *slot)
+{
+#ifdef CONFIG_KVM_MROE
+ slot->mroe_bitmap = kvzalloc(BITS_TO_LONGS(slot->npages) *
+ sizeof(unsigned long), GFP_KERNEL);
+ if (!slot->mroe_bitmap)
+ return -ENOMEM;
+#endif
+ return 0;
+}
+
/*
* Insert memslot and re-sort memslots based on their GFN,
* so binary search could be used to lookup GFN.
@@ -1011,6 +1022,8 @@ int __kvm_set_memory_region(struct kvm *kvm,
if (kvm_create_dirty_bitmap(&new) < 0)
goto out_free;
}
+ if (kvm_init_mroe_bitmap(&new) < 0)
+ goto out_free;
slots = kvzalloc(sizeof(struct kvm_memslots), GFP_KERNEL);
if (!slots)
@@ -1264,13 +1277,23 @@ static bool memslot_is_readonly(struct kvm_memory_slot *slot)
return slot->flags & KVM_MEM_READONLY;
}
+static bool gfn_is_readonly(struct kvm_memory_slot *slot, gfn_t gfn)
+{
+#ifdef CONFIG_KVM_MROE
+ return test_bit(gfn - slot->base_gfn, slot->mroe_bitmap) ||
+ memslot_is_readonly(slot);
+#else
+ return memslot_is_readonly(slot);
+#endif
+}
+
static unsigned long __gfn_to_hva_many(struct kvm_memory_slot *slot, gfn_t gfn,
gfn_t *nr_pages, bool write)
{
if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
return KVM_HVA_ERR_BAD;
- if (memslot_is_readonly(slot) && write)
+ if (gfn_is_readonly(slot, gfn) && write)
return KVM_HVA_ERR_RO_BAD;
if (nr_pages)
@@ -1314,7 +1337,7 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot,
unsigned long hva = __gfn_to_hva_many(slot, gfn, NULL, false);
if (!kvm_is_error_hva(hva) && writable)
- *writable = !memslot_is_readonly(slot);
+ *writable = !gfn_is_readonly(slot, gfn);
return hva;
}
@@ -1554,7 +1577,7 @@ kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
}
/* Do not map writable pfn in the readonly memslot. */
- if (writable && memslot_is_readonly(slot)) {
+ if (writable && gfn_is_readonly(slot, gfn)) {
*writable = false;
writable = NULL;
}
--
2.16.4
^ permalink raw reply related
* [PATCH net-next 0/9] TX used ring batched updating for vhost
From: Jason Wang @ 2018-07-20 0:15 UTC (permalink / raw)
To: mst, jasowang, netdev; +Cc: linux-kernel, kvm, virtualization
Hi:
This series implement batch updating of used ring for TX. This help to
reduce the cache contention on used ring. The idea is first split
datacopy path from zerocopy, and do only batching for datacopy. This
is because zercopy had already supported its own batching.
TX PPS was increased 25.8% and Netperf TCP does not show obvious
differences.
The split of datapath will also be helpful for future implementation
like in order completion.
Please review.
Thanks
Jason Wang (9):
vhost_net: drop unnecessary parameter
vhost_net: introduce helper to initialize tx iov iter
vhost_net: introduce vhost_exceeds_weight()
vhost_net: introduce get_tx_bufs()
vhost_net: introduce tx_can_batch()
vhost_net: split out datacopy logic
vhost_net: rename vhost_rx_signal_used() to vhost_net_signal_used()
vhost_net: rename VHOST_RX_BATCH to VHOST_NET_BATCH
vhost_net: batch update used ring for datacopy TX
drivers/vhost/net.c | 249 +++++++++++++++++++++++++++++++++++++---------------
1 file changed, 179 insertions(+), 70 deletions(-)
--
2.7.4
^ permalink raw reply
* [PATCH net-next 1/9] vhost_net: drop unnecessary parameter
From: Jason Wang @ 2018-07-20 0:15 UTC (permalink / raw)
To: mst, jasowang, netdev; +Cc: linux-kernel, kvm, virtualization
In-Reply-To: <1532045721-4958-1-git-send-email-jasowang@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
drivers/vhost/net.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index b224036..1a8175a 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -430,7 +430,6 @@ static int vhost_net_enable_vq(struct vhost_net *n,
static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
struct vhost_virtqueue *vq,
- struct iovec iov[], unsigned int iov_size,
unsigned int *out_num, unsigned int *in_num,
bool *busyloop_intr)
{
@@ -512,9 +511,8 @@ static void handle_tx(struct vhost_net *net)
vhost_zerocopy_signal_used(net, vq);
busyloop_intr = false;
- head = vhost_net_tx_get_vq_desc(net, vq, vq->iov,
- ARRAY_SIZE(vq->iov),
- &out, &in, &busyloop_intr);
+ head = vhost_net_tx_get_vq_desc(net, vq, &out, &in,
+ &busyloop_intr);
/* On error, stop handling until the next kick. */
if (unlikely(head < 0))
break;
--
2.7.4
^ permalink raw reply related
* [PATCH net-next 2/9] vhost_net: introduce helper to initialize tx iov iter
From: Jason Wang @ 2018-07-20 0:15 UTC (permalink / raw)
To: mst, jasowang, netdev; +Cc: linux-kernel, kvm, virtualization
In-Reply-To: <1532045721-4958-1-git-send-email-jasowang@redhat.com>
Introduce init_iov_iter() in order to be reused by future patch.
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
drivers/vhost/net.c | 26 +++++++++++++++++---------
1 file changed, 17 insertions(+), 9 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 1a8175a..cac28fd 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -466,6 +466,18 @@ static bool vhost_exceeds_maxpend(struct vhost_net *net)
min_t(unsigned int, VHOST_MAX_PEND, vq->num >> 2);
}
+static size_t init_iov_iter(struct vhost_virtqueue *vq, struct iov_iter *iter,
+ size_t hdr_size, int out)
+{
+ /* Skip header. TODO: support TSO. */
+ size_t len = iov_length(vq->iov, out);
+
+ iov_iter_init(iter, WRITE, vq->iov, out, len);
+ iov_iter_advance(iter, hdr_size);
+
+ return iov_iter_count(iter);
+}
+
/* Expects to be always run from workqueue - which acts as
* read-size critical section for our kind of RCU. */
static void handle_tx(struct vhost_net *net)
@@ -531,18 +543,14 @@ static void handle_tx(struct vhost_net *net)
"out %d, int %d\n", out, in);
break;
}
- /* Skip header. TODO: support TSO. */
- len = iov_length(vq->iov, out);
- iov_iter_init(&msg.msg_iter, WRITE, vq->iov, out, len);
- iov_iter_advance(&msg.msg_iter, hdr_size);
+
/* Sanity check */
- if (!msg_data_left(&msg)) {
- vq_err(vq, "Unexpected header len for TX: "
- "%zd expected %zd\n",
- len, hdr_size);
+ len = init_iov_iter(vq, &msg.msg_iter, hdr_size, out);
+ if (!len) {
+ vq_err(vq, "Unexpected header len for TX: %zd expected %zd\n",
+ len, hdr_size);
break;
}
- len = msg_data_left(&msg);
zcopy_used = zcopy && len >= VHOST_GOODCOPY_LEN
&& !vhost_exceeds_maxpend(net)
--
2.7.4
^ permalink raw reply related
* [PATCH net-next 3/9] vhost_net: introduce vhost_exceeds_weight()
From: Jason Wang @ 2018-07-20 0:15 UTC (permalink / raw)
To: mst, jasowang, netdev; +Cc: linux-kernel, kvm, virtualization
In-Reply-To: <1532045721-4958-1-git-send-email-jasowang@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
drivers/vhost/net.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index cac28fd..b9e1674 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -478,6 +478,12 @@ static size_t init_iov_iter(struct vhost_virtqueue *vq, struct iov_iter *iter,
return iov_iter_count(iter);
}
+static bool vhost_exceeds_weight(int pkts, int total_len)
+{
+ return total_len >= VHOST_NET_WEIGHT ||
+ pkts >= VHOST_NET_PKT_WEIGHT;
+}
+
/* Expects to be always run from workqueue - which acts as
* read-size critical section for our kind of RCU. */
static void handle_tx(struct vhost_net *net)
@@ -576,7 +582,6 @@ static void handle_tx(struct vhost_net *net)
msg.msg_control = NULL;
ubufs = NULL;
}
-
total_len += len;
if (total_len < VHOST_NET_WEIGHT &&
!vhost_vq_avail_empty(&net->dev, vq) &&
@@ -606,8 +611,7 @@ static void handle_tx(struct vhost_net *net)
else
vhost_zerocopy_signal_used(net, vq);
vhost_net_tx_packet(net);
- if (unlikely(total_len >= VHOST_NET_WEIGHT) ||
- unlikely(++sent_pkts >= VHOST_NET_PKT_WEIGHT)) {
+ if (unlikely(vhost_exceeds_weight(++sent_pkts, total_len))) {
vhost_poll_queue(&vq->poll);
break;
}
@@ -918,8 +922,7 @@ static void handle_rx(struct vhost_net *net)
if (unlikely(vq_log))
vhost_log_write(vq, vq_log, log, vhost_len);
total_len += vhost_len;
- if (unlikely(total_len >= VHOST_NET_WEIGHT) ||
- unlikely(++recv_pkts >= VHOST_NET_PKT_WEIGHT)) {
+ if (unlikely(vhost_exceeds_weight(++recv_pkts, total_len))) {
vhost_poll_queue(&vq->poll);
goto out;
}
--
2.7.4
^ permalink raw reply related
* [PATCH net-next 4/9] vhost_net: introduce get_tx_bufs()
From: Jason Wang @ 2018-07-20 0:15 UTC (permalink / raw)
To: mst, jasowang, netdev; +Cc: linux-kernel, kvm, virtualization
In-Reply-To: <1532045721-4958-1-git-send-email-jasowang@redhat.com>
Factor out logic of getting tx buffer and iov iter
initialization. This will be used for reducing codes duplication in
the future.
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
drivers/vhost/net.c | 49 ++++++++++++++++++++++++++++++++-----------------
1 file changed, 32 insertions(+), 17 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index b9e1674..a014ca0 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -484,6 +484,36 @@ static bool vhost_exceeds_weight(int pkts, int total_len)
pkts >= VHOST_NET_PKT_WEIGHT;
}
+static int get_tx_bufs(struct vhost_net *net,
+ struct vhost_net_virtqueue *nvq,
+ struct msghdr *msg,
+ unsigned int *out, unsigned int *in,
+ size_t *len, bool *busyloop_intr)
+{
+ struct vhost_virtqueue *vq = &nvq->vq;
+ int ret;
+
+ ret = vhost_net_tx_get_vq_desc(net, vq, out, in, busyloop_intr);
+ if (ret < 0 || ret == vq->num)
+ return ret;
+
+ if (*in) {
+ vq_err(vq, "Unexpected descriptor format for TX: out %d, int %d\n",
+ *out, *in);
+ return -EFAULT;
+ }
+
+ /* Sanity check */
+ *len = init_iov_iter(vq, &msg->msg_iter, nvq->vhost_hlen, *out);
+ if (*len == 0) {
+ vq_err(vq, "Unexpected header len for TX: %zd expected %zd\n",
+ *len, nvq->vhost_hlen);
+ return -EFAULT;
+ }
+
+ return ret;
+}
+
/* Expects to be always run from workqueue - which acts as
* read-size critical section for our kind of RCU. */
static void handle_tx(struct vhost_net *net)
@@ -501,7 +531,6 @@ static void handle_tx(struct vhost_net *net)
};
size_t len, total_len = 0;
int err;
- size_t hdr_size;
struct socket *sock;
struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
bool zcopy, zcopy_used;
@@ -518,7 +547,6 @@ static void handle_tx(struct vhost_net *net)
vhost_disable_notify(&net->dev, vq);
vhost_net_disable_vq(net, vq);
- hdr_size = nvq->vhost_hlen;
zcopy = nvq->ubufs;
for (;;) {
@@ -529,8 +557,8 @@ static void handle_tx(struct vhost_net *net)
vhost_zerocopy_signal_used(net, vq);
busyloop_intr = false;
- head = vhost_net_tx_get_vq_desc(net, vq, &out, &in,
- &busyloop_intr);
+ head = get_tx_bufs(net, nvq, &msg, &out, &in, &len,
+ &busyloop_intr);
/* On error, stop handling until the next kick. */
if (unlikely(head < 0))
break;
@@ -544,19 +572,6 @@ static void handle_tx(struct vhost_net *net)
}
break;
}
- if (in) {
- vq_err(vq, "Unexpected descriptor format for TX: "
- "out %d, int %d\n", out, in);
- break;
- }
-
- /* Sanity check */
- len = init_iov_iter(vq, &msg.msg_iter, hdr_size, out);
- if (!len) {
- vq_err(vq, "Unexpected header len for TX: %zd expected %zd\n",
- len, hdr_size);
- break;
- }
zcopy_used = zcopy && len >= VHOST_GOODCOPY_LEN
&& !vhost_exceeds_maxpend(net)
--
2.7.4
^ permalink raw reply related
* [PATCH net-next 5/9] vhost_net: introduce tx_can_batch()
From: Jason Wang @ 2018-07-20 0:15 UTC (permalink / raw)
To: mst, jasowang, netdev; +Cc: linux-kernel, kvm, virtualization
In-Reply-To: <1532045721-4958-1-git-send-email-jasowang@redhat.com>
Introduce tx_can_batch() to determine whether TX could be
batched. This will help to reduce the code duplication in the future.
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
drivers/vhost/net.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index a014ca0..f59b615 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -514,6 +514,12 @@ static int get_tx_bufs(struct vhost_net *net,
return ret;
}
+static bool tx_can_batch(struct vhost_virtqueue *vq, size_t total_len)
+{
+ return total_len < VHOST_NET_WEIGHT &&
+ !vhost_vq_avail_empty(vq->dev, vq);
+}
+
/* Expects to be always run from workqueue - which acts as
* read-size critical section for our kind of RCU. */
static void handle_tx(struct vhost_net *net)
@@ -598,8 +604,7 @@ static void handle_tx(struct vhost_net *net)
ubufs = NULL;
}
total_len += len;
- if (total_len < VHOST_NET_WEIGHT &&
- !vhost_vq_avail_empty(&net->dev, vq) &&
+ if (tx_can_batch(vq, total_len) &&
likely(!vhost_exceeds_maxpend(net))) {
msg.msg_flags |= MSG_MORE;
} else {
--
2.7.4
^ permalink raw reply related
* [PATCH net-next 6/9] vhost_net: split out datacopy logic
From: Jason Wang @ 2018-07-20 0:15 UTC (permalink / raw)
To: mst, jasowang, netdev; +Cc: linux-kernel, kvm, virtualization
In-Reply-To: <1532045721-4958-1-git-send-email-jasowang@redhat.com>
Instead of mixing zerocopy and datacopy logics, this patch tries to
split datacopy logic out. This results for a more compact code and
ad-hoc optimization could be done on top more easily.
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
drivers/vhost/net.c | 110 ++++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 90 insertions(+), 20 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f59b615..9cef0b2 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -520,9 +520,7 @@ static bool tx_can_batch(struct vhost_virtqueue *vq, size_t total_len)
!vhost_vq_avail_empty(vq->dev, vq);
}
-/* Expects to be always run from workqueue - which acts as
- * read-size critical section for our kind of RCU. */
-static void handle_tx(struct vhost_net *net)
+static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
{
struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
struct vhost_virtqueue *vq = &nvq->vq;
@@ -537,30 +535,76 @@ static void handle_tx(struct vhost_net *net)
};
size_t len, total_len = 0;
int err;
- struct socket *sock;
- struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
- bool zcopy, zcopy_used;
int sent_pkts = 0;
- mutex_lock(&vq->mutex);
- sock = vq->private_data;
- if (!sock)
- goto out;
+ for (;;) {
+ bool busyloop_intr = false;
- if (!vq_iotlb_prefetch(vq))
- goto out;
+ head = get_tx_bufs(net, nvq, &msg, &out, &in, &len,
+ &busyloop_intr);
+ /* On error, stop handling until the next kick. */
+ if (unlikely(head < 0))
+ break;
+ /* Nothing new? Wait for eventfd to tell us they refilled. */
+ if (head == vq->num) {
+ if (unlikely(busyloop_intr)) {
+ vhost_poll_queue(&vq->poll);
+ } else if (unlikely(vhost_enable_notify(&net->dev,
+ vq))) {
+ vhost_disable_notify(&net->dev, vq);
+ continue;
+ }
+ break;
+ }
- vhost_disable_notify(&net->dev, vq);
- vhost_net_disable_vq(net, vq);
+ total_len += len;
+ if (tx_can_batch(vq, total_len))
+ msg.msg_flags |= MSG_MORE;
+ else
+ msg.msg_flags &= ~MSG_MORE;
+
+ /* TODO: Check specific error and bomb out unless ENOBUFS? */
+ err = sock->ops->sendmsg(sock, &msg, len);
+ if (unlikely(err < 0)) {
+ vhost_discard_vq_desc(vq, 1);
+ vhost_net_enable_vq(net, vq);
+ break;
+ }
+ if (err != len)
+ pr_debug("Truncated TX packet: len %d != %zd\n",
+ err, len);
+ vhost_add_used_and_signal(&net->dev, vq, head, 0);
+ if (vhost_exceeds_weight(++sent_pkts, total_len)) {
+ vhost_poll_queue(&vq->poll);
+ break;
+ }
+ }
+}
- zcopy = nvq->ubufs;
+static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock)
+{
+ struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
+ struct vhost_virtqueue *vq = &nvq->vq;
+ unsigned out, in;
+ int head;
+ struct msghdr msg = {
+ .msg_name = NULL,
+ .msg_namelen = 0,
+ .msg_control = NULL,
+ .msg_controllen = 0,
+ .msg_flags = MSG_DONTWAIT,
+ };
+ size_t len, total_len = 0;
+ int err;
+ struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
+ bool zcopy_used;
+ int sent_pkts = 0;
for (;;) {
bool busyloop_intr;
/* Release DMAs done buffers first */
- if (zcopy)
- vhost_zerocopy_signal_used(net, vq);
+ vhost_zerocopy_signal_used(net, vq);
busyloop_intr = false;
head = get_tx_bufs(net, nvq, &msg, &out, &in, &len,
@@ -579,9 +623,9 @@ static void handle_tx(struct vhost_net *net)
break;
}
- zcopy_used = zcopy && len >= VHOST_GOODCOPY_LEN
- && !vhost_exceeds_maxpend(net)
- && vhost_net_tx_select_zcopy(net);
+ zcopy_used = len >= VHOST_GOODCOPY_LEN
+ && !vhost_exceeds_maxpend(net)
+ && vhost_net_tx_select_zcopy(net);
/* use msg_control to pass vhost zerocopy ubuf info to skb */
if (zcopy_used) {
@@ -636,6 +680,32 @@ static void handle_tx(struct vhost_net *net)
break;
}
}
+}
+
+/* Expects to be always run from workqueue - which acts as
+ * read-size critical section for our kind of RCU. */
+static void handle_tx(struct vhost_net *net)
+{
+ struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
+ struct vhost_virtqueue *vq = &nvq->vq;
+ struct socket *sock;
+
+ mutex_lock(&vq->mutex);
+ sock = vq->private_data;
+ if (!sock)
+ goto out;
+
+ if (!vq_iotlb_prefetch(vq))
+ goto out;
+
+ vhost_disable_notify(&net->dev, vq);
+ vhost_net_disable_vq(net, vq);
+
+ if (vhost_sock_zcopy(sock))
+ handle_tx_zerocopy(net, sock);
+ else
+ handle_tx_copy(net, sock);
+
out:
mutex_unlock(&vq->mutex);
}
--
2.7.4
^ permalink raw reply related
* [PATCH net-next 7/9] vhost_net: rename vhost_rx_signal_used() to vhost_net_signal_used()
From: Jason Wang @ 2018-07-20 0:15 UTC (permalink / raw)
To: mst, jasowang, netdev; +Cc: linux-kernel, kvm, virtualization
In-Reply-To: <1532045721-4958-1-git-send-email-jasowang@redhat.com>
Rename for reusing this for TX.
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
drivers/vhost/net.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 9cef0b2..53d305b 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -741,7 +741,7 @@ static int sk_has_rx_data(struct sock *sk)
return skb_queue_empty(&sk->sk_receive_queue);
}
-static void vhost_rx_signal_used(struct vhost_net_virtqueue *nvq)
+static void vhost_net_signal_used(struct vhost_net_virtqueue *nvq)
{
struct vhost_virtqueue *vq = &nvq->vq;
struct vhost_dev *dev = vq->dev;
@@ -765,7 +765,7 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk,
if (!len && tvq->busyloop_timeout) {
/* Flush batched heads first */
- vhost_rx_signal_used(rnvq);
+ vhost_net_signal_used(rnvq);
/* Both tx vq and rx socket were polled here */
mutex_lock_nested(&tvq->mutex, 1);
vhost_disable_notify(&net->dev, tvq);
@@ -1008,7 +1008,7 @@ static void handle_rx(struct vhost_net *net)
}
nvq->done_idx += headcount;
if (nvq->done_idx > VHOST_RX_BATCH)
- vhost_rx_signal_used(nvq);
+ vhost_net_signal_used(nvq);
if (unlikely(vq_log))
vhost_log_write(vq, vq_log, log, vhost_len);
total_len += vhost_len;
@@ -1022,7 +1022,7 @@ static void handle_rx(struct vhost_net *net)
else
vhost_net_enable_vq(net, vq);
out:
- vhost_rx_signal_used(nvq);
+ vhost_net_signal_used(nvq);
mutex_unlock(&vq->mutex);
}
--
2.7.4
^ permalink raw reply related
* [PATCH net-next 8/9] vhost_net: rename VHOST_RX_BATCH to VHOST_NET_BATCH
From: Jason Wang @ 2018-07-20 0:15 UTC (permalink / raw)
To: mst, jasowang, netdev; +Cc: linux-kernel, kvm, virtualization
In-Reply-To: <1532045721-4958-1-git-send-email-jasowang@redhat.com>
A more generic name which could be used for TX as well.
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
drivers/vhost/net.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 53d305b..2fd2f0e3 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -94,7 +94,7 @@ struct vhost_net_ubuf_ref {
struct vhost_virtqueue *vq;
};
-#define VHOST_RX_BATCH 64
+#define VHOST_NET_BATCH 64
struct vhost_net_buf {
void **queue;
int tail;
@@ -168,7 +168,7 @@ static int vhost_net_buf_produce(struct vhost_net_virtqueue *nvq)
rxq->head = 0;
rxq->tail = ptr_ring_consume_batched(nvq->rx_ring, rxq->queue,
- VHOST_RX_BATCH);
+ VHOST_NET_BATCH);
return rxq->tail;
}
@@ -1007,7 +1007,7 @@ static void handle_rx(struct vhost_net *net)
goto out;
}
nvq->done_idx += headcount;
- if (nvq->done_idx > VHOST_RX_BATCH)
+ if (nvq->done_idx > VHOST_NET_BATCH)
vhost_net_signal_used(nvq);
if (unlikely(vq_log))
vhost_log_write(vq, vq_log, log, vhost_len);
@@ -1075,7 +1075,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
return -ENOMEM;
}
- queue = kmalloc_array(VHOST_RX_BATCH, sizeof(void *),
+ queue = kmalloc_array(VHOST_NET_BATCH, sizeof(void *),
GFP_KERNEL);
if (!queue) {
kfree(vqs);
--
2.7.4
^ permalink raw reply related
* [PATCH net-next 9/9] vhost_net: batch update used ring for datacopy TX
From: Jason Wang @ 2018-07-20 0:15 UTC (permalink / raw)
To: mst, jasowang, netdev; +Cc: linux-kernel, kvm, virtualization
In-Reply-To: <1532045721-4958-1-git-send-email-jasowang@redhat.com>
Like commit e2b3b35eb989 ("vhost_net: batch used ring update in rx"),
this patches implements batch used ring update for datacopy TX
(zerocopy has already done some kind of batching).
Testpmd transmission from guest to host (XDP_DROP on tap) shows 25.8%
improvement (from ~3.1Mpps to ~3.9Mpps) on Broadwell i7-5600U CPU @
2.60GHz machine. Netperf TCP tests does not show obvious differences.
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
drivers/vhost/net.c | 40 +++++++++++++++++++++++++---------------
1 file changed, 25 insertions(+), 15 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 2fd2f0e3..367d802 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -428,16 +428,31 @@ static int vhost_net_enable_vq(struct vhost_net *n,
return vhost_poll_start(poll, sock->file);
}
+static void vhost_net_signal_used(struct vhost_net_virtqueue *nvq)
+{
+ struct vhost_virtqueue *vq = &nvq->vq;
+ struct vhost_dev *dev = vq->dev;
+
+ if (!nvq->done_idx)
+ return;
+
+ vhost_add_used_and_signal_n(dev, vq, vq->heads, nvq->done_idx);
+ nvq->done_idx = 0;
+}
+
static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
- struct vhost_virtqueue *vq,
+ struct vhost_net_virtqueue *nvq,
unsigned int *out_num, unsigned int *in_num,
bool *busyloop_intr)
{
+ struct vhost_virtqueue *vq = &nvq->vq;
unsigned long uninitialized_var(endtime);
int r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
out_num, in_num, NULL, NULL);
if (r == vq->num && vq->busyloop_timeout) {
+ if (!vhost_sock_zcopy(vq->private_data))
+ vhost_net_signal_used(nvq);
preempt_disable();
endtime = busy_clock() + vq->busyloop_timeout;
while (vhost_can_busy_poll(endtime)) {
@@ -493,7 +508,8 @@ static int get_tx_bufs(struct vhost_net *net,
struct vhost_virtqueue *vq = &nvq->vq;
int ret;
- ret = vhost_net_tx_get_vq_desc(net, vq, out, in, busyloop_intr);
+ ret = vhost_net_tx_get_vq_desc(net, nvq, out, in, busyloop_intr);
+
if (ret < 0 || ret == vq->num)
return ret;
@@ -557,6 +573,9 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
break;
}
+ vq->heads[nvq->done_idx].id = cpu_to_vhost32(vq, head);
+ vq->heads[nvq->done_idx].len = 0;
+
total_len += len;
if (tx_can_batch(vq, total_len))
msg.msg_flags |= MSG_MORE;
@@ -573,12 +592,15 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
if (err != len)
pr_debug("Truncated TX packet: len %d != %zd\n",
err, len);
- vhost_add_used_and_signal(&net->dev, vq, head, 0);
+ if (++nvq->done_idx >= VHOST_NET_BATCH)
+ vhost_net_signal_used(nvq);
if (vhost_exceeds_weight(++sent_pkts, total_len)) {
vhost_poll_queue(&vq->poll);
break;
}
}
+
+ vhost_net_signal_used(nvq);
}
static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock)
@@ -741,18 +763,6 @@ static int sk_has_rx_data(struct sock *sk)
return skb_queue_empty(&sk->sk_receive_queue);
}
-static void vhost_net_signal_used(struct vhost_net_virtqueue *nvq)
-{
- struct vhost_virtqueue *vq = &nvq->vq;
- struct vhost_dev *dev = vq->dev;
-
- if (!nvq->done_idx)
- return;
-
- vhost_add_used_and_signal_n(dev, vq, vq->heads, nvq->done_idx);
- nvq->done_idx = 0;
-}
-
static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk,
bool *busyloop_intr)
{
--
2.7.4
^ permalink raw reply related
* Re: [PATCH 3/3] [RFC V3] KVM: X86: Adding skeleton for Memory ROE
From: Ahmed Soliman @ 2018-07-20 0:26 UTC (permalink / raw)
To: Jann Horn
Cc: Jonathan Corbet, Ard Biesheuvel, Radim Krčmář,
Kees Cook, kvm, linux-doc, David Vrabel, the arch/x86 maintainers,
Boris Lukashev, virtualization, Ingo Molnar, nigel.edwards,
H . Peter Anvin, Kernel Hardening, Paolo Bonzini, Thomas Gleixner,
Rik van Riel
In-Reply-To: <CAG48ez3EyU=ROBczUdHEuOYBtZghYqOpq3K16Bs4RQLO1OO6oA@mail.gmail.com>
On 20 July 2018 at 00:59, Jann Horn <jannh@google.com> wrote:
> On Thu, Jul 19, 2018 at 11:40 PM Ahmed Abd El Mawgood
> Why are you implementing this in the kernel, instead of doing it in
> host userspace?
I thought about implementing it completely in QEMU but It won't be
possible for few reasons:
- After talking to QEMU folks I came up to conclusion that it when it
comes to managing memory allocated for guest, it is always better to let
KVM handles everything, unless there is a good reason to play with that
memory chunk inside QEMU itself.
- But actually there is a good reason for implementing ROE in kernel space,
it is that ROE is architecture dependent to great extent. I should have
emphasized that the only currently supported architecture is X86. I am
not sure how deep the dependency on architecture goes. But as for now
the current set of patches does a SPTE enumeration as part of the process.
To my best knowledge, this isn't exposed outside arch/x68/kvm let alone
having a host user space interface for it. Also the way I am planning to
protect TLB from malicious gva -> gpa mapping is by knowing that in x86
it is possible to VMEXIT on page faults, I am not sure if it will safe to
assume that all kvm supported architectures will behave this way.
For these reasons I thought it will be better if arch dependent stuff (the
mechanism implementation) is kept in arch/*/kvm folder and with minimal
modifications to virt/kvm/* after setting a kconfig variable to enable ROE.
But I left room for the user space app using kvm to decide the rightful policy
for handling ROE violations. The way it works by KVM_EXIT_MMIO error to user
space, keeping all the architectural details hidden away from user space.
A last note is that I didn't create this from scratch, instead I extended
KVM_MEM_READONLY implementation to also allow R/O per page instead
R/O per whole slot which is already done in kernel space.
^ permalink raw reply
* Re: [PATCH 3/3] [RFC V3] KVM: X86: Adding skeleton for Memory ROE
From: Randy Dunlap @ 2018-07-20 1:07 UTC (permalink / raw)
To: Ahmed Abd El Mawgood, kvm, Kernel Hardening, virtualization,
linux-doc, x86
Cc: Ard Biesheuvel, Kees Cook, nathan Corbet, David Vrabel, rkrcmar,
Boris Lukashev, Ingo Molnar, nigel.edwards, hpa, Paolo Bonzini,
Thomas Gleixner, Rik van Riel
In-Reply-To: <20180719213802.17161-4-ahmedsoliman0x666@gmail.com>
On 07/19/2018 02:38 PM, Ahmed Abd El Mawgood wrote:
> This patch introduces a hypercall implemented for X86 that can assist
> against subset of kernel rootkits, it works by place readonly protection in
> shadow PTE. The end result protection is also kept in a bitmap for each
> kvm_memory_slot and is used as reference when updating SPTEs. The whole
> goal is to protect the guest kernel static data from modification if
> attacker is running from guest ring 0, for this reason there is no
> hypercall to revert effect of Memory ROE hypercall. This patch doesn't
> implement integrity check on guest TLB so obvious attack on the current
> implementation will involve guest virtual address -> guest physical
> address remapping, but there are plans to fix that.
>
> Signed-off-by: Ahmed Abd El Mawgood <ahmedsoliman0x666@gmail.com>
> ---
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 92fd433c50b9..8ae822a8dc7a 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -96,6 +96,13 @@ config KVM_MMU_AUDIT
> This option adds a R/W kVM module parameter 'mmu_audit', which allows
> auditing of KVM MMU events at runtime.
>
> +config KVM_MROE
> + bool "Hypercall Memory Read-Only Enforcement"
> + depends on KVM && X86
> + help
> + This option add KVM_HC_HMROE hypercall to kvm which as hardening
adds to kvm as a hardening (???)
> + mechanism to protect memory pages from being edited.
> +
> # OK, it's a little counter-intuitive to do this, but it puts it neatly under
> # the virtualization menu.
> source drivers/vhost/Kconfig
--
~Randy
^ permalink raw reply
* Re: [PATCH 1/3] [RFC V3] KVM: X86: Memory ROE documentation
From: Randy Dunlap @ 2018-07-20 1:11 UTC (permalink / raw)
To: Ahmed Abd El Mawgood, kvm, Kernel Hardening, virtualization,
linux-doc, x86
Cc: Ard Biesheuvel, Kees Cook, nathan Corbet, David Vrabel, rkrcmar,
Boris Lukashev, Ingo Molnar, nigel.edwards, hpa, Paolo Bonzini,
Thomas Gleixner, Rik van Riel
In-Reply-To: <20180719213802.17161-2-ahmedsoliman0x666@gmail.com>
On 07/19/2018 02:38 PM, Ahmed Abd El Mawgood wrote:
> Documentation/virtual/kvm/hypercalls.txt | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
>
> diff --git a/Documentation/virtual/kvm/hypercalls.txt b/Documentation/virtual/kvm/hypercalls.txt
> index a890529c63ed..a9db68adb7c9 100644
> --- a/Documentation/virtual/kvm/hypercalls.txt
> +++ b/Documentation/virtual/kvm/hypercalls.txt
> @@ -121,3 +121,17 @@ compute the CLOCK_REALTIME for its clock, at the same instant.
>
> Returns KVM_EOPNOTSUPP if the host does not use TSC clocksource,
> or if clock type is different than KVM_CLOCK_PAIRING_WALLCLOCK.
> +
> +7. KVM_HC_HMROE
> +----------------
> +Architecture: x86
> +Status: active
> +Purpose: Hypercall used to apply Read-Only Enforcement to guest pages
> +Usage:
> + a0: start address of page that should be protected.
Is this done one page per call? No grouping, no multiple pages?
> +
> +This hypercall lets a guest kernel to have part of its read/write memory
lets a guest kernel have part of
> +converted into read-only. This action is irreversible. KVM_HC_HMROE can
> +not be triggered from guest Ring 3 (user mode). The reason is that user
> +mode malicious software can make use of it enforce read only protection on
make use of it to enforce
> +an arbitrary memory page thus crashing the kernel.
>
--
~Randy
^ permalink raw reply
* Re: Memory Read Only Enforcement: VMM assisted kernel rootkit mitigation for KVM
From: Konrad Rzeszutek Wilk @ 2018-07-20 2:45 UTC (permalink / raw)
To: Ahmed Abd El Mawgood, xen-devel
Cc: nathan Corbet, Ard Biesheuvel, rkrcmar, Kees Cook, kvm, linux-doc,
David Vrabel, x86, Boris Lukashev, virtualization, Ingo Molnar,
nigel.edwards, hpa, Kernel Hardening, Paolo Bonzini,
Thomas Gleixner, Rik van Riel
In-Reply-To: <20180719213802.17161-1-ahmedsoliman0x666@gmail.com>
On Thu, Jul 19, 2018 at 11:37:59PM +0200, Ahmed Abd El Mawgood wrote:
> Hi,
>
> This is my first set of patches that works as I would expect, and the
> third revision I sent to mailing lists.
>
> Following up with my previous discussions about kernel rootkit mitigation
> via placing R/O protection on critical data structure, static data,
> privileged registers with static content. These patches present the
> first part where it is only possible to place these protections on
> memory pages. Feature-wise, this set of patches is incomplete in the sense of:
> - They still don't protect privileged registers
> - They don't protect guest TLB from malicious gva -> gpa page mappings.
> But they provide sketches for a basic working design. Note that I am totally
> noob and it took lots of time and effort to get to this point. So sorry in
> advance if I overlooked something.
This reminds me of Xen PV page model. That is the hypervisor is the one
auditing the page tables and the guest's pages are read-only.
Ditto for IDT, GDT, etc. Gosh, did you by chance look at how
Xen PV mechanism is done? It may provide the protection you are looking for?
CC-ing xen-devel.
>
> [PATCH 1/3] [RFC V3] KVM: X86: Memory ROE documentation
> [PATCH 2/3] [RFC V3] KVM: X86: Adding arbitrary data pointer in kvm memslot itterator functions
> [PATCH 3/3] [RFC V3] KVM: X86: Adding skeleton for Memory ROE
>
> Summery:
>
> Documentation/virtual/kvm/hypercalls.txt | 14 ++++
> arch/x86/include/asm/kvm_host.h | 11 ++-
> arch/x86/kvm/Kconfig | 7 ++
> arch/x86/kvm/mmu.c | 127 ++++++++++++++++++++++---------
> arch/x86/kvm/x86.c | 82 +++++++++++++++++++-
> include/linux/kvm_host.h | 3 +
> include/uapi/linux/kvm_para.h | 1 +
> virt/kvm/kvm_main.c | 29 ++++++-
> 8 files changed, 232 insertions(+), 42 deletions(-)
>
^ permalink raw reply
* [RFC 0/4] Virtio uses DMA API for all devices
From: Anshuman Khandual @ 2018-07-20 3:59 UTC (permalink / raw)
To: virtualization, linux-kernel
Cc: robh, srikar, mst, benh, linuxram, hch, paulus, mpe, joe,
khandual, linuxppc-dev, elfring, haren, david
This patch series is the follow up on the discussions we had before about
the RFC titled [RFC,V2] virtio: Add platform specific DMA API translation
for virito devices (https://patchwork.kernel.org/patch/10417371/). There
were suggestions about doing away with two different paths of transactions
with the host/QEMU, first being the direct GPA and the other being the DMA
API based translations.
First patch attempts to create a direct GPA mapping based DMA operations
structure called 'virtio_direct_dma_ops' with exact same implementation
of the direct GPA path which virtio core currently has but just wrapped in
a DMA API format. Virtio core must use 'virtio_direct_dma_ops' instead of
the arch default in absence of VIRTIO_F_IOMMU_PLATFORM flag to preserve the
existing semantics. The second patch does exactly that inside the function
virtio_finalize_features(). The third patch removes the default direct GPA
path from virtio core forcing it to use DMA API callbacks for all devices.
Now with that change, every device must have a DMA operations structure
associated with it. The fourth patch adds an additional hook which gives
the platform an opportunity to do yet another override if required. This
platform hook can be used on POWER Ultravisor based protected guests to
load up SWIOTLB DMA callbacks to do the required (as discussed previously
in the above mentioned thread how host is allowed to access only parts of
the guest GPA range) bounce buffering into the shared memory for all I/O
scatter gather buffers to be consumed on the host side.
Please go through these patches and review whether this approach broadly
makes sense. I will appreciate suggestions, inputs, comments regarding
the patches or the approach in general. Thank you.
Anshuman Khandual (4):
virtio: Define virtio_direct_dma_ops structure
virtio: Override device's DMA OPS with virtio_direct_dma_ops selectively
virtio: Force virtio core to use DMA API callbacks for all virtio devices
virtio: Add platform specific DMA API translation for virito devices
arch/powerpc/include/asm/dma-mapping.h | 6 +++
arch/powerpc/platforms/pseries/iommu.c | 6 +++
drivers/virtio/virtio.c | 72 ++++++++++++++++++++++++++++++++++
drivers/virtio/virtio_pci_common.h | 3 ++
drivers/virtio/virtio_ring.c | 65 +-----------------------------
5 files changed, 89 insertions(+), 63 deletions(-)
--
2.9.3
^ permalink raw reply
* [RFC 1/4] virtio: Define virtio_direct_dma_ops structure
From: Anshuman Khandual @ 2018-07-20 3:59 UTC (permalink / raw)
To: virtualization, linux-kernel
Cc: robh, srikar, mst, benh, linuxram, hch, paulus, mpe, joe,
khandual, linuxppc-dev, elfring, haren, david
In-Reply-To: <20180720035941.6844-1-khandual@linux.vnet.ibm.com>
Current implementation of DMA API inside virtio core calls device's DMA OPS
callback functions when the flag VIRTIO_F_IOMMU_PLATFORM flag is set. But
in absence of the flag, virtio core falls back calling basic transformation
of the incoming SG addresses as GPA. Going forward virtio should only call
DMA API based transformations generating either GPA or IOVA depending on
QEMU expectations again based on VIRTIO_F_IOMMU_PLATFORM flag. It requires
removing existing fallback code path for GPA transformation and replacing
that with a direct map DMA OPS structure. This adds that direct mapping DMA
OPS structure to be used in later patches which will make virtio core call
DMA API all the time for all virtio devices.
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
drivers/virtio/virtio.c | 60 ++++++++++++++++++++++++++++++++++++++
drivers/virtio/virtio_pci_common.h | 3 ++
2 files changed, 63 insertions(+)
diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index 59e36ef..7907ad3 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -3,6 +3,7 @@
#include <linux/virtio_config.h>
#include <linux/module.h>
#include <linux/idr.h>
+#include <linux/dma-mapping.h>
#include <uapi/linux/virtio_ids.h>
/* Unique numbering for virtio devices. */
@@ -442,3 +443,62 @@ core_initcall(virtio_init);
module_exit(virtio_exit);
MODULE_LICENSE("GPL");
+
+/*
+ * Virtio direct mapping DMA API operations structure
+ *
+ * This defines DMA API structure for all virtio devices which would not
+ * either bring in their own DMA OPS from architecture or they would not
+ * like to use architecture specific IOMMU based DMA OPS because QEMU
+ * expects GPA instead of an IOVA in absence of VIRTIO_F_IOMMU_PLATFORM.
+ */
+dma_addr_t virtio_direct_map_page(struct device *dev, struct page *page,
+ unsigned long offset, size_t size,
+ enum dma_data_direction dir,
+ unsigned long attrs)
+{
+ return page_to_phys(page) + offset;
+}
+
+void virtio_direct_unmap_page(struct device *hwdev, dma_addr_t dev_addr,
+ size_t size, enum dma_data_direction dir,
+ unsigned long attrs)
+{
+}
+
+int virtio_direct_mapping_error(struct device *hwdev, dma_addr_t dma_addr)
+{
+ return 0;
+}
+
+void *virtio_direct_alloc(struct device *dev, size_t size, dma_addr_t *dma_handle,
+ gfp_t gfp, unsigned long attrs)
+{
+ void *queue = alloc_pages_exact(PAGE_ALIGN(size), gfp);
+
+ if (queue) {
+ phys_addr_t phys_addr = virt_to_phys(queue);
+ *dma_handle = (dma_addr_t)phys_addr;
+
+ if (WARN_ON_ONCE(*dma_handle != phys_addr)) {
+ free_pages_exact(queue, PAGE_ALIGN(size));
+ return NULL;
+ }
+ }
+ return queue;
+}
+
+void virtio_direct_free(struct device *dev, size_t size, void *vaddr,
+ dma_addr_t dma_addr, unsigned long attrs)
+{
+ free_pages_exact(vaddr, PAGE_ALIGN(size));
+}
+
+const struct dma_map_ops virtio_direct_dma_ops = {
+ .alloc = virtio_direct_alloc,
+ .free = virtio_direct_free,
+ .map_page = virtio_direct_map_page,
+ .unmap_page = virtio_direct_unmap_page,
+ .mapping_error = virtio_direct_mapping_error,
+};
+EXPORT_SYMBOL(virtio_direct_dma_ops);
diff --git a/drivers/virtio/virtio_pci_common.h b/drivers/virtio/virtio_pci_common.h
index 135ee3c..ec44d2f 100644
--- a/drivers/virtio/virtio_pci_common.h
+++ b/drivers/virtio/virtio_pci_common.h
@@ -31,6 +31,9 @@
#include <linux/highmem.h>
#include <linux/spinlock.h>
+extern struct dma_map_ops virtio_direct_dma_ops;
+
+
struct virtio_pci_vq_info {
/* the actual virtqueue */
struct virtqueue *vq;
--
2.9.3
^ permalink raw reply related
* [RFC 2/4] virtio: Override device's DMA OPS with virtio_direct_dma_ops selectively
From: Anshuman Khandual @ 2018-07-20 3:59 UTC (permalink / raw)
To: virtualization, linux-kernel
Cc: robh, srikar, mst, benh, linuxram, hch, paulus, mpe, joe,
khandual, linuxppc-dev, elfring, haren, david
In-Reply-To: <20180720035941.6844-1-khandual@linux.vnet.ibm.com>
Now that virtio core always needs all virtio devices to have DMA OPS, we
need to make sure that the structure it points is the right one. In the
absence of VIRTIO_F_IOMMU_PLATFORM flag QEMU expects GPA from guest kernel.
In such case, virtio device must use default virtio_direct_dma_ops DMA OPS
structure which transforms scatter gather buffer addresses as GPA. This
DMA OPS override must happen as early as possible during virtio device
initializatin sequence before virtio core starts using given device's DMA
OPS callbacks for I/O transactions. This change detects device's IOMMU flag
and does the override in case the flag is cleared.
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
drivers/virtio/virtio.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index 7907ad3..6b13987 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -166,6 +166,8 @@ void virtio_add_status(struct virtio_device *dev, unsigned int status)
}
EXPORT_SYMBOL_GPL(virtio_add_status);
+const struct dma_map_ops virtio_direct_dma_ops;
+
int virtio_finalize_features(struct virtio_device *dev)
{
int ret = dev->config->finalize_features(dev);
@@ -174,6 +176,9 @@ int virtio_finalize_features(struct virtio_device *dev)
if (ret)
return ret;
+ if (virtio_has_iommu_quirk(dev))
+ set_dma_ops(dev->dev.parent, &virtio_direct_dma_ops);
+
if (!virtio_has_feature(dev, VIRTIO_F_VERSION_1))
return 0;
--
2.9.3
^ permalink raw reply related
* [RFC 3/4] virtio: Force virtio core to use DMA API callbacks for all virtio devices
From: Anshuman Khandual @ 2018-07-20 3:59 UTC (permalink / raw)
To: virtualization, linux-kernel
Cc: robh, srikar, mst, benh, linuxram, hch, paulus, mpe, joe,
khandual, linuxppc-dev, elfring, haren, david
In-Reply-To: <20180720035941.6844-1-khandual@linux.vnet.ibm.com>
Virtio core should use DMA API callbacks for all virtio devices which may
generate either GAP or IOVA depending on VIRTIO_F_IOMMU_PLATFORM flag and
resulting QEMU expectations. This implies that every virtio device needs
to have a DMA OPS structure. This removes previous GPA fallback code paths.
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
drivers/virtio/virtio_ring.c | 65 ++------------------------------------------
1 file changed, 2 insertions(+), 63 deletions(-)
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 814b395..c265964 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -141,26 +141,6 @@ struct vring_virtqueue {
* unconditionally on data path.
*/
-static bool vring_use_dma_api(struct virtio_device *vdev)
-{
- if (!virtio_has_iommu_quirk(vdev))
- return true;
-
- /* Otherwise, we are left to guess. */
- /*
- * In theory, it's possible to have a buggy QEMU-supposed
- * emulated Q35 IOMMU and Xen enabled at the same time. On
- * such a configuration, virtio has never worked and will
- * not work without an even larger kludge. Instead, enable
- * the DMA API if we're a Xen guest, which at least allows
- * all of the sensible Xen configurations to work correctly.
- */
- if (xen_domain())
- return true;
-
- return false;
-}
-
/*
* The DMA ops on various arches are rather gnarly right now, and
* making all of the arch DMA ops work on the vring device itself
@@ -176,9 +156,6 @@ static dma_addr_t vring_map_one_sg(const struct vring_virtqueue *vq,
struct scatterlist *sg,
enum dma_data_direction direction)
{
- if (!vring_use_dma_api(vq->vq.vdev))
- return (dma_addr_t)sg_phys(sg);
-
/*
* We can't use dma_map_sg, because we don't use scatterlists in
* the way it expects (we don't guarantee that the scatterlist
@@ -193,9 +170,6 @@ static dma_addr_t vring_map_single(const struct vring_virtqueue *vq,
void *cpu_addr, size_t size,
enum dma_data_direction direction)
{
- if (!vring_use_dma_api(vq->vq.vdev))
- return (dma_addr_t)virt_to_phys(cpu_addr);
-
return dma_map_single(vring_dma_dev(vq),
cpu_addr, size, direction);
}
@@ -205,9 +179,6 @@ static void vring_unmap_one(const struct vring_virtqueue *vq,
{
u16 flags;
- if (!vring_use_dma_api(vq->vq.vdev))
- return;
-
flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
if (flags & VRING_DESC_F_INDIRECT) {
@@ -228,9 +199,6 @@ static void vring_unmap_one(const struct vring_virtqueue *vq,
static int vring_mapping_error(const struct vring_virtqueue *vq,
dma_addr_t addr)
{
- if (!vring_use_dma_api(vq->vq.vdev))
- return 0;
-
return dma_mapping_error(vring_dma_dev(vq), addr);
}
@@ -1016,43 +984,14 @@ EXPORT_SYMBOL_GPL(__vring_new_virtqueue);
static void *vring_alloc_queue(struct virtio_device *vdev, size_t size,
dma_addr_t *dma_handle, gfp_t flag)
{
- if (vring_use_dma_api(vdev)) {
- return dma_alloc_coherent(vdev->dev.parent, size,
+ return dma_alloc_coherent(vdev->dev.parent, size,
dma_handle, flag);
- } else {
- void *queue = alloc_pages_exact(PAGE_ALIGN(size), flag);
- if (queue) {
- phys_addr_t phys_addr = virt_to_phys(queue);
- *dma_handle = (dma_addr_t)phys_addr;
-
- /*
- * Sanity check: make sure we dind't truncate
- * the address. The only arches I can find that
- * have 64-bit phys_addr_t but 32-bit dma_addr_t
- * are certain non-highmem MIPS and x86
- * configurations, but these configurations
- * should never allocate physical pages above 32
- * bits, so this is fine. Just in case, throw a
- * warning and abort if we end up with an
- * unrepresentable address.
- */
- if (WARN_ON_ONCE(*dma_handle != phys_addr)) {
- free_pages_exact(queue, PAGE_ALIGN(size));
- return NULL;
- }
- }
- return queue;
- }
}
static void vring_free_queue(struct virtio_device *vdev, size_t size,
void *queue, dma_addr_t dma_handle)
{
- if (vring_use_dma_api(vdev)) {
- dma_free_coherent(vdev->dev.parent, size, queue, dma_handle);
- } else {
- free_pages_exact(queue, PAGE_ALIGN(size));
- }
+ dma_free_coherent(vdev->dev.parent, size, queue, dma_handle);
}
struct virtqueue *vring_create_virtqueue(
--
2.9.3
^ permalink raw reply related
* [RFC 4/4] virtio: Add platform specific DMA API translation for virito devices
From: Anshuman Khandual @ 2018-07-20 3:59 UTC (permalink / raw)
To: virtualization, linux-kernel
Cc: robh, srikar, mst, benh, linuxram, hch, paulus, mpe, joe,
khandual, linuxppc-dev, elfring, haren, david
In-Reply-To: <20180720035941.6844-1-khandual@linux.vnet.ibm.com>
This adds a hook which a platform can define in order to allow it to
override virtio device's DMA OPS irrespective of whether it has the
flag VIRTIO_F_IOMMU_PLATFORM set or not. We want to use this to do
bounce-buffering of data on the new secure pSeries platform, currently
under development, where a KVM host cannot access all of the memory
space of a secure KVM guest. The host can only access the pages which
the guest has explicitly requested to be shared with the host, thus
the virtio implementation in the guest has to copy data to and from
shared pages.
With this hook, the platform code in the secure guest can force the
use of swiotlb for virtio buffers, with a back-end for swiotlb which
will use a pool of pre-allocated shared pages. Thus all data being
sent or received by virtio devices will be copied through pages which
the host has access to.
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
arch/powerpc/include/asm/dma-mapping.h | 6 ++++++
arch/powerpc/platforms/pseries/iommu.c | 6 ++++++
drivers/virtio/virtio.c | 7 +++++++
3 files changed, 19 insertions(+)
diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h
index 8fa3945..bc5a9d3 100644
--- a/arch/powerpc/include/asm/dma-mapping.h
+++ b/arch/powerpc/include/asm/dma-mapping.h
@@ -116,3 +116,9 @@ extern u64 __dma_get_required_mask(struct device *dev);
#endif /* __KERNEL__ */
#endif /* _ASM_DMA_MAPPING_H */
+
+#define platform_override_dma_ops platform_override_dma_ops
+
+struct virtio_device;
+
+extern void platform_override_dma_ops(struct virtio_device *vdev);
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 06f0296..5773bc7 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -38,6 +38,7 @@
#include <linux/of.h>
#include <linux/iommu.h>
#include <linux/rculist.h>
+#include <linux/virtio.h>
#include <asm/io.h>
#include <asm/prom.h>
#include <asm/rtas.h>
@@ -1396,3 +1397,8 @@ static int __init disable_multitce(char *str)
__setup("multitce=", disable_multitce);
machine_subsys_initcall_sync(pseries, tce_iommu_bus_notifier_init);
+
+void platform_override_dma_ops(struct virtio_device *vdev)
+{
+ /* Override vdev->parent.dma_ops if required */
+}
diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index 6b13987..432c332 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -168,6 +168,12 @@ EXPORT_SYMBOL_GPL(virtio_add_status);
const struct dma_map_ops virtio_direct_dma_ops;
+#ifndef platform_override_dma_ops
+static inline void platform_override_dma_ops(struct virtio_device *vdev)
+{
+}
+#endif
+
int virtio_finalize_features(struct virtio_device *dev)
{
int ret = dev->config->finalize_features(dev);
@@ -179,6 +185,7 @@ int virtio_finalize_features(struct virtio_device *dev)
if (virtio_has_iommu_quirk(dev))
set_dma_ops(dev->dev.parent, &virtio_direct_dma_ops);
+ platform_override_dma_ops(dev);
if (!virtio_has_feature(dev, VIRTIO_F_VERSION_1))
return 0;
--
2.9.3
^ permalink raw reply related
* [PATCH v36 0/5] Virtio-balloon: support free page reporting
From: Wei Wang @ 2018-07-20 8:33 UTC (permalink / raw)
To: virtio-dev, linux-kernel, virtualization, kvm, linux-mm, mst,
mhocko, akpm, torvalds
Cc: yang.zhang.wz, riel, quan.xu0, liliang.opensource, pbonzini,
nilal
This patch series is separated from the previous "Virtio-balloon
Enhancement" series. The new feature, VIRTIO_BALLOON_F_FREE_PAGE_HINT,
implemented by this series enables the virtio-balloon driver to report
hints of guest free pages to the host. It can be used to accelerate live
migration of VMs. Here is an introduction of this usage:
Live migration needs to transfer the VM's memory from the source machine
to the destination round by round. For the 1st round, all the VM's memory
is transferred. From the 2nd round, only the pieces of memory that were
written by the guest (after the 1st round) are transferred. One method
that is popularly used by the hypervisor to track which part of memory is
written is to write-protect all the guest memory.
This feature enables the optimization by skipping the transfer of guest
free pages during VM live migration. It is not concerned that the memory
pages are used after they are given to the hypervisor as a hint of the
free pages, because they will be tracked by the hypervisor and transferred
in the subsequent round if they are used and written.
* Tests
- Test Environment
Host: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Guest: 8G RAM, 4 vCPU
Migration setup: migrate_set_speed 100G, migrate_set_downtime 2 second
- Test Results
- Idle Guest Live Migration Time (results are averaged over 10 runs):
- Optimization v.s. Legacy = 409ms vs 1757ms --> ~77% reduction
(setting page poisoning zero and enabling ksm don't affect the
comparison result)
- Guest with Linux Compilation Workload (make bzImage -j4):
- Live Migration Time (average)
Optimization v.s. Legacy = 1407ms v.s. 2528ms --> ~44% reduction
- Linux Compilation Time
Optimization v.s. Legacy = 5min4s v.s. 5min12s
--> no obvious difference
ChangeLog:
v35->v36:
- remove the mm patch, as Linus has a suggestion to get free page
addresses via allocation, instead of reading from the free page
list.
- virtio-balloon:
- replace oom notifier with shrinker;
- the guest to host communication interface remains the same as
v32.
- allocate free page blocks and send to host one by one, and free
them after sending all the pages.
For ChangeLogs from v22 to v35, please reference
https://lwn.net/Articles/759413/
For ChangeLogs before v21, please reference
https://lwn.net/Articles/743660/
Wei Wang (5):
virtio-balloon: remove BUG() in init_vqs
virtio_balloon: replace oom notifier with shrinker
virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
mm/page_poison: expose page_poisoning_enabled to kernel modules
virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON
drivers/virtio/virtio_balloon.c | 456 ++++++++++++++++++++++++++++++------
include/uapi/linux/virtio_balloon.h | 7 +
mm/page_poison.c | 6 +
3 files changed, 394 insertions(+), 75 deletions(-)
--
2.7.4
^ permalink raw reply
* [PATCH v36 1/5] virtio-balloon: remove BUG() in init_vqs
From: Wei Wang @ 2018-07-20 8:33 UTC (permalink / raw)
To: virtio-dev, linux-kernel, virtualization, kvm, linux-mm, mst,
mhocko, akpm, torvalds
Cc: yang.zhang.wz, riel, quan.xu0, liliang.opensource, pbonzini,
nilal
In-Reply-To: <1532075585-39067-1-git-send-email-wei.w.wang@intel.com>
It's a bit overkill to use BUG when failing to add an entry to the
stats_vq in init_vqs. So remove it and just return the error to the
caller to bail out nicely.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
---
drivers/virtio/virtio_balloon.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 6b237e3..9356a1a 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -455,9 +455,13 @@ static int init_vqs(struct virtio_balloon *vb)
num_stats = update_balloon_stats(vb);
sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
- if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
- < 0)
- BUG();
+ err = virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb,
+ GFP_KERNEL);
+ if (err) {
+ dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
+ __func__);
+ return err;
+ }
virtqueue_kick(vb->stats_vq);
}
return 0;
--
2.7.4
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox