Linux Documentation

Linux Documentation
 help / color / mirror / Atom feed

* [PATCH v4 1/5] mm/page_idle: Add per-pid idle page tracking using virtual indexing
From: Joel Fernandes (Google) @ 2019-08-05 17:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google), Alexey Dobriyan, Andrew Morton,
	Borislav Petkov, Brendan Gregg, Catalin Marinas, Christian Hansen,
	dancol, fmayer, H. Peter Anvin, Ingo Molnar, joelaf,
	Jonathan Corbet, Kees Cook, kernel-team, linux-api, linux-doc,
	linux-fsdevel, linux-mm, Michal Hocko, Mike Rapoport, minchan,
	namhyung, paulmck, Robin Murphy, Roman Gushchin, Stephen Rothwell,
	surenb, Thomas Gleixner, tkjos, Vladimir Davydov, Vlastimil Babka,
	Will Deacon

The page_idle tracking feature currently requires looking up the pagemap
for a process followed by interacting with /sys/kernel/mm/page_idle.
Looking up PFN from pagemap in Android devices is not supported by
unprivileged process and requires SYS_ADMIN and gives 0 for the PFN.

This patch adds support to directly interact with page_idle tracking at
the PID level by introducing a /proc/<pid>/page_idle file.  It follows
the exact same semantics as the global /sys/kernel/mm/page_idle, but now
looking up PFN through pagemap is not needed since the interface uses
virtual frame numbers, and at the same time also does not require
SYS_ADMIN.

In Android, we are using this for the heap profiler (heapprofd) which
profiles and pin points code paths which allocates and leaves memory
idle for long periods of time. This method solves the security issue
with userspace learning the PFN, and while at it is also shown to yield
better results than the pagemap lookup, the theory being that the window
where the address space can change is reduced by eliminating the
intermediate pagemap look up stage. In virtual address indexing, the
process's mmap_sem is held for the duration of the access.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
v3->v4: Minor fixups (Minchan)
        Add swap pte handling (Konstantin, Minchan)
v2->v3:
Fixed a bug where I was doing a kfree that is not needed due to not
needing to do GFP_ATOMIC allocations.

v1->v2:
Mark swap ptes as idle (Minchan)
Avoid need for GFP_ATOMIC (Andrew)
Get rid of idle_page_list lock by moving list to stack

Internal review -> v1:
Fixes from Suren.
Corrections to change log, docs (Florian, Sandeep)

 arch/Kconfig                  |   3 +
 fs/proc/base.c                |   3 +
 fs/proc/internal.h            |   1 +
 fs/proc/task_mmu.c            |  43 ++++
 include/asm-generic/pgtable.h |   6 +
 include/linux/page_idle.h     |   4 +
 mm/page_idle.c                | 359 +++++++++++++++++++++++++++++-----
 mm/rmap.c                     |   2 +
 8 files changed, 376 insertions(+), 45 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index a7b57dd42c26..3aa121ce824e 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -575,6 +575,9 @@ config ARCH_WANT_HUGE_PMD_SHARE
 config HAVE_ARCH_SOFT_DIRTY
 	bool
 
+config HAVE_ARCH_PTE_SWP_PGIDLE
+	bool
+
 config HAVE_MOD_ARCH_SPECIFIC
 	bool
 	help
diff --git a/fs/proc/base.c b/fs/proc/base.c
index ebea9501afb8..fd2f74bd4e35 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3039,6 +3039,9 @@ static const struct pid_entry tgid_base_stuff[] = {
 	REG("smaps",      S_IRUGO, proc_pid_smaps_operations),
 	REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations),
 	REG("pagemap",    S_IRUSR, proc_pagemap_operations),
+#ifdef CONFIG_IDLE_PAGE_TRACKING
+	REG("page_idle", S_IRUSR|S_IWUSR, proc_page_idle_operations),
+#endif
 #endif
 #ifdef CONFIG_SECURITY
 	DIR("attr",       S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations),
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index cd0c8d5ce9a1..bc9371880c63 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -293,6 +293,7 @@ extern const struct file_operations proc_pid_smaps_operations;
 extern const struct file_operations proc_pid_smaps_rollup_operations;
 extern const struct file_operations proc_clear_refs_operations;
 extern const struct file_operations proc_pagemap_operations;
+extern const struct file_operations proc_page_idle_operations;
 
 extern unsigned long task_vsize(struct mm_struct *);
 extern unsigned long task_statm(struct mm_struct *,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 582c5e680176..a9003fe8d267 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1650,6 +1650,49 @@ const struct file_operations proc_pagemap_operations = {
 	.open		= pagemap_open,
 	.release	= pagemap_release,
 };
+
+#ifdef CONFIG_IDLE_PAGE_TRACKING
+static ssize_t proc_page_idle_read(struct file *file, char __user *buf,
+				   size_t count, loff_t *ppos)
+{
+	return page_idle_proc_read(file, buf, count, ppos);
+}
+
+static ssize_t proc_page_idle_write(struct file *file, const char __user *buf,
+				 size_t count, loff_t *ppos)
+{
+	return page_idle_proc_write(file, (char __user *)buf, count, ppos);
+}
+
+static int proc_page_idle_open(struct inode *inode, struct file *file)
+{
+	struct mm_struct *mm;
+
+	mm = proc_mem_open(inode, PTRACE_MODE_READ);
+	if (IS_ERR(mm))
+		return PTR_ERR(mm);
+	file->private_data = mm;
+	return 0;
+}
+
+static int proc_page_idle_release(struct inode *inode, struct file *file)
+{
+	struct mm_struct *mm = file->private_data;
+
+	if (mm)
+		mmdrop(mm);
+	return 0;
+}
+
+const struct file_operations proc_page_idle_operations = {
+	.llseek		= mem_lseek, /* borrow this */
+	.read		= proc_page_idle_read,
+	.write		= proc_page_idle_write,
+	.open		= proc_page_idle_open,
+	.release	= proc_page_idle_release,
+};
+#endif /* CONFIG_IDLE_PAGE_TRACKING */
+
 #endif /* CONFIG_PROC_PAGE_MONITOR */
 
 #ifdef CONFIG_NUMA
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 75d9d68a6de7..6d51d0a355a7 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -712,6 +712,12 @@ static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
 #define arch_start_context_switch(prev)	do {} while (0)
 #endif
 
+#ifndef CONFIG_HAVE_ARCH_PTE_SWP_PGIDLE
+static inline pte_t pte_swp_mkpage_idle(pte_t pte) { return pte; }
+static inline int pte_swp_page_idle(pte_t pte) { return 0; }
+static inline pte_t pte_swp_clear_mkpage_idle(pte_t pte) { return pte; }
+#endif
+
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
 #ifndef CONFIG_ARCH_ENABLE_THP_MIGRATION
 static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
index 1e894d34bdce..f1bc2640d85e 100644
--- a/include/linux/page_idle.h
+++ b/include/linux/page_idle.h
@@ -106,6 +106,10 @@ static inline void clear_page_idle(struct page *page)
 }
 #endif /* CONFIG_64BIT */
 
+ssize_t page_idle_proc_write(struct file *file,
+	char __user *buf, size_t count, loff_t *ppos, struct task_struct *tsk);
+ssize_t page_idle_proc_read(struct file *file,
+	char __user *buf, size_t count, loff_t *ppos, struct task_struct *tsk);
 #else /* !CONFIG_IDLE_PAGE_TRACKING */
 
 static inline bool page_is_young(struct page *page)
diff --git a/mm/page_idle.c b/mm/page_idle.c
index 295512465065..a5b00d63216c 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -5,17 +5,22 @@
 #include <linux/sysfs.h>
 #include <linux/kobject.h>
 #include <linux/mm.h>
-#include <linux/mmzone.h>
-#include <linux/pagemap.h>
-#include <linux/rmap.h>
 #include <linux/mmu_notifier.h>
+#include <linux/mmzone.h>
 #include <linux/page_ext.h>
 #include <linux/page_idle.h>
+#include <linux/pagemap.h>
+#include <linux/rmap.h>
+#include <linux/sched/mm.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
 
 #define BITMAP_CHUNK_SIZE	sizeof(u64)
 #define BITMAP_CHUNK_BITS	(BITMAP_CHUNK_SIZE * BITS_PER_BYTE)
 
 /*
+ * Get a reference to a page for idle tracking purposes, with additional checks.
+ *
  * Idle page tracking only considers user memory pages, for other types of
  * pages the idle flag is always unset and an attempt to set it is silently
  * ignored.
@@ -25,18 +30,13 @@
  * page tracking. With such an indicator of user pages we can skip isolated
  * pages, but since there are not usually many of them, it will hardly affect
  * the overall result.
- *
- * This function tries to get a user memory page by pfn as described above.
  */
-static struct page *page_idle_get_page(unsigned long pfn)
+static struct page *page_idle_get_page(struct page *page_in)
 {
 	struct page *page;
 	pg_data_t *pgdat;
 
-	if (!pfn_valid(pfn))
-		return NULL;
-
-	page = pfn_to_page(pfn);
+	page = page_in;
 	if (!page || !PageLRU(page) ||
 	    !get_page_unless_zero(page))
 		return NULL;
@@ -51,6 +51,18 @@ static struct page *page_idle_get_page(unsigned long pfn)
 	return page;
 }
 
+/*
+ * This function tries to get a user memory page by pfn as described above.
+ */
+static struct page *page_idle_get_page_pfn(unsigned long pfn)
+{
+
+	if (!pfn_valid(pfn))
+		return NULL;
+
+	return page_idle_get_page(pfn_to_page(pfn));
+}
+
 static bool page_idle_clear_pte_refs_one(struct page *page,
 					struct vm_area_struct *vma,
 					unsigned long addr, void *arg)
@@ -118,6 +130,47 @@ static void page_idle_clear_pte_refs(struct page *page)
 		unlock_page(page);
 }
 
+/* Helper to get the start and end frame given a pos and count */
+static int page_idle_get_frames(loff_t pos, size_t count, struct mm_struct *mm,
+				unsigned long *start, unsigned long *end)
+{
+	unsigned long max_frame;
+
+	/* If an mm is not given, assume we want physical frames */
+	max_frame = mm ? (mm->task_size >> PAGE_SHIFT) : max_pfn;
+
+	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
+		return -EINVAL;
+
+	*start = pos * BITS_PER_BYTE;
+	if (*start >= max_frame)
+		return -ENXIO;
+
+	*end = *start + count * BITS_PER_BYTE;
+	if (*end > max_frame)
+		*end = max_frame;
+	return 0;
+}
+
+static bool page_idle_pte_check(struct page *page)
+{
+	if (!page)
+		return false;
+
+	if (page_is_idle(page)) {
+		/*
+		 * The page might have been referenced via a
+		 * pte, in which case it is not idle. Clear
+		 * refs and recheck.
+		 */
+		page_idle_clear_pte_refs(page);
+		if (page_is_idle(page))
+			return true;
+	}
+
+	return false;
+}
+
 static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
 				     struct bin_attribute *attr, char *buf,
 				     loff_t pos, size_t count)
@@ -125,35 +178,21 @@ static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
 	u64 *out = (u64 *)buf;
 	struct page *page;
 	unsigned long pfn, end_pfn;
-	int bit;
+	int bit, ret;
 
-	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
-		return -EINVAL;
-
-	pfn = pos * BITS_PER_BYTE;
-	if (pfn >= max_pfn)
-		return 0;
-
-	end_pfn = pfn + count * BITS_PER_BYTE;
-	if (end_pfn > max_pfn)
-		end_pfn = max_pfn;
+	ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
+	if (ret == -ENXIO)
+		return 0;  /* Reads beyond max_pfn do nothing */
+	else if (ret)
+		return ret;
 
 	for (; pfn < end_pfn; pfn++) {
 		bit = pfn % BITMAP_CHUNK_BITS;
 		if (!bit)
 			*out = 0ULL;
-		page = page_idle_get_page(pfn);
-		if (page) {
-			if (page_is_idle(page)) {
-				/*
-				 * The page might have been referenced via a
-				 * pte, in which case it is not idle. Clear
-				 * refs and recheck.
-				 */
-				page_idle_clear_pte_refs(page);
-				if (page_is_idle(page))
-					*out |= 1ULL << bit;
-			}
+		page = page_idle_get_page_pfn(pfn);
+		if (page && page_idle_pte_check(page)) {
+			*out |= 1ULL << bit;
 			put_page(page);
 		}
 		if (bit == BITMAP_CHUNK_BITS - 1)
@@ -170,23 +209,16 @@ static ssize_t page_idle_bitmap_write(struct file *file, struct kobject *kobj,
 	const u64 *in = (u64 *)buf;
 	struct page *page;
 	unsigned long pfn, end_pfn;
-	int bit;
-
-	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
-		return -EINVAL;
+	int bit, ret;
 
-	pfn = pos * BITS_PER_BYTE;
-	if (pfn >= max_pfn)
-		return -ENXIO;
-
-	end_pfn = pfn + count * BITS_PER_BYTE;
-	if (end_pfn > max_pfn)
-		end_pfn = max_pfn;
+	ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
+	if (ret)
+		return ret;
 
 	for (; pfn < end_pfn; pfn++) {
 		bit = pfn % BITMAP_CHUNK_BITS;
 		if ((*in >> bit) & 1) {
-			page = page_idle_get_page(pfn);
+			page = page_idle_get_page_pfn(pfn);
 			if (page) {
 				page_idle_clear_pte_refs(page);
 				set_page_idle(page);
@@ -224,6 +256,243 @@ struct page_ext_operations page_idle_ops = {
 };
 #endif
 
+/*  page_idle tracking for /proc/<pid>/page_idle */
+
+struct page_node {
+	struct page *page;
+	unsigned long addr;
+	struct list_head list;
+};
+
+struct page_idle_proc_priv {
+	unsigned long start_addr;
+	char *buffer;
+	int write;
+
+	/* Pre-allocate and provide nodes to pte_page_idle_proc_add() */
+	struct page_node *page_nodes;
+	int cur_page_node;
+	struct list_head *idle_page_list;
+};
+
+/*
+ * Set a page as idle or add it to a list to be set as idle later.
+ */
+static void pte_page_idle_proc_add(struct page *page,
+			       unsigned long addr, struct mm_walk *walk)
+{
+	struct page *page_get = NULL;
+	struct page_node *pn;
+	int bit;
+	unsigned long frames;
+	struct page_idle_proc_priv *priv = walk->private;
+	u64 *chunk = (u64 *)priv->buffer;
+
+	if (priv->write) {
+		VM_BUG_ON(!page);
+
+		/* Find whether this page was asked to be marked */
+		frames = (addr - priv->start_addr) >> PAGE_SHIFT;
+		bit = frames % BITMAP_CHUNK_BITS;
+		chunk = &chunk[frames / BITMAP_CHUNK_BITS];
+		if (((*chunk >> bit) & 1) == 0)
+			return;
+	}
+
+	if (page) {
+		page_get = page_idle_get_page(page);
+		if (!page_get)
+			return;
+	} else {
+		/* For swapped pages, set output bit as idle */
+		frames = (addr - priv->start_addr) >> PAGE_SHIFT;
+		bit = frames % BITMAP_CHUNK_BITS;
+		chunk = &chunk[frames / BITMAP_CHUNK_BITS];
+		*chunk |= (1 << bit);
+		return;
+	}
+
+	/*
+	 * For all other pages, add it to a list since we have to walk rmap,
+	 * which acquires ptlock, and we cannot walk rmap right now.
+	 */
+	pn = &(priv->page_nodes[priv->cur_page_node++]);
+	pn->page = page_get;
+	pn->addr = addr;
+	list_add(&pn->list, priv->idle_page_list);
+}
+
+static int pte_page_idle_proc_range(pmd_t *pmd, unsigned long addr,
+				    unsigned long end,
+				    struct mm_walk *walk)
+{
+	pte_t *pte;
+	spinlock_t *ptl;
+	struct page *page;
+	struct vm_area_struct *vma = walk->vma;
+	struct page_idle_proc_priv *priv = walk->private;
+
+	ptl = pmd_trans_huge_lock(pmd, vma);
+	if (ptl) {
+		if (pmd_present(*pmd)) {
+			page = follow_trans_huge_pmd(vma, addr, pmd,
+						     FOLL_DUMP|FOLL_WRITE);
+			if (!IS_ERR_OR_NULL(page))
+				pte_page_idle_proc_add(page, addr, walk);
+		}
+		spin_unlock(ptl);
+		return 0;
+	}
+
+	if (pmd_trans_unstable(pmd))
+		return 0;
+
+	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	for (; addr != end; pte++, addr += PAGE_SIZE) {
+		/* For swap_pte handling, we use an idle bit in the swap pte. */
+		if (is_swap_pte(*pte)) {
+			if (priv->write) {
+				set_pte_at(walk->mm, addr, pte,
+					   pte_swp_mkpage_idle(*pte));
+			} else {
+				/* If swap pte has idle bit set, report it as idle */
+				if (pte_swp_page_idle(*pte))
+					pte_page_idle_proc_add(NULL, addr, walk);
+			}
+			continue;
+		}
+
+		if (!pte_present(*pte))
+			continue;
+
+		page = vm_normal_page(vma, addr, *pte);
+		if (page)
+			pte_page_idle_proc_add(page, addr, walk);
+	}
+
+	pte_unmap_unlock(pte - 1, ptl);
+	return 0;
+}
+
+ssize_t page_idle_proc_generic(struct file *file, char __user *ubuff,
+			       size_t count, loff_t *pos, int write)
+{
+	int ret;
+	char *buffer;
+	u64 *out;
+	unsigned long start_addr, end_addr, start_frame, end_frame;
+	struct mm_struct *mm = file->private_data;
+	struct mm_walk walk = { .pmd_entry = pte_page_idle_proc_range, };
+	struct page_node *cur;
+	struct page_idle_proc_priv priv;
+	bool walk_error = false;
+	LIST_HEAD(idle_page_list);
+
+	if (!mm || !mmget_not_zero(mm))
+		return -EINVAL;
+
+	if (count > PAGE_SIZE)
+		count = PAGE_SIZE;
+
+	buffer = kzalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!buffer) {
+		ret = -ENOMEM;
+		goto out_mmput;
+	}
+	out = (u64 *)buffer;
+
+	if (write && copy_from_user(buffer, ubuff, count)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	ret = page_idle_get_frames(*pos, count, mm, &start_frame, &end_frame);
+	if (ret)
+		goto out;
+
+	start_addr = (start_frame << PAGE_SHIFT);
+	end_addr = (end_frame << PAGE_SHIFT);
+	priv.buffer = buffer;
+	priv.start_addr = start_addr;
+	priv.write = write;
+
+	priv.idle_page_list = &idle_page_list;
+	priv.cur_page_node = 0;
+	priv.page_nodes = kzalloc(sizeof(struct page_node) *
+				  (end_frame - start_frame), GFP_KERNEL);
+	if (!priv.page_nodes) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	walk.private = &priv;
+	walk.mm = mm;
+
+	down_read(&mm->mmap_sem);
+
+	/*
+	 * idle_page_list is needed because walk_page_vma() holds ptlock which
+	 * deadlocks with page_idle_clear_pte_refs(). So we have to collect all
+	 * pages first, and then call page_idle_clear_pte_refs().
+	 */
+	ret = walk_page_range(start_addr, end_addr, &walk);
+	if (ret)
+		walk_error = true;
+
+	list_for_each_entry(cur, &idle_page_list, list) {
+		int bit, index;
+		unsigned long off;
+		struct page *page = cur->page;
+
+		if (unlikely(walk_error))
+			goto remove_page;
+
+		if (write) {
+			if (page) {
+				page_idle_clear_pte_refs(page);
+				set_page_idle(page);
+			}
+		} else {
+			/* If page is NULL, it was swapped out */
+			if (!page || page_idle_pte_check(page)) {
+				off = ((cur->addr) >> PAGE_SHIFT) - start_frame;
+				bit = off % BITMAP_CHUNK_BITS;
+				index = off / BITMAP_CHUNK_BITS;
+				out[index] |= 1ULL << bit;
+			}
+		}
+remove_page:
+		if (page)
+			put_page(page);
+	}
+
+	if (!write && !walk_error)
+		ret = copy_to_user(ubuff, buffer, count);
+
+	up_read(&mm->mmap_sem);
+	kfree(priv.page_nodes);
+out:
+	kfree(buffer);
+out_mmput:
+	mmput(mm);
+	if (!ret)
+		ret = count;
+	return ret;
+
+}
+
+ssize_t page_idle_proc_read(struct file *file, char __user *ubuff,
+			    size_t count, loff_t *pos)
+{
+	return page_idle_proc_generic(file, ubuff, count, pos, 0);
+}
+
+ssize_t page_idle_proc_write(struct file *file, char __user *ubuff,
+			     size_t count, loff_t *pos, struct mm_struct *mm)
+{
+	return page_idle_proc_generic(file, ubuff, count, pos, 1);
+}
+
 static int __init page_idle_init(void)
 {
 	int err;
diff --git a/mm/rmap.c b/mm/rmap.c
index e5dfe2ae6b0d..4bd618aab402 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1629,6 +1629,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			swp_pte = swp_entry_to_pte(entry);
 			if (pte_soft_dirty(pteval))
 				swp_pte = pte_swp_mksoft_dirty(swp_pte);
+			if (page_is_idle(page))
+				swp_pte = pte_swp_mkpage_idle(swp_pte);
 			set_pte_at(mm, address, pvmw.pte, swp_pte);
 			/* Invalidate as we cleared the pte */
 			mmu_notifier_invalidate_range(mm, address,
-- 
2.22.0.770.g0f2c4a37fd-goog

^ permalink raw reply related

* Re: [PATCH 1/9] KVM: arm64: Document PV-time interface
From: Christophe de Dinechin @ 2019-08-05 16:40 UTC (permalink / raw)
  To: Steven Price
  Cc: Catalin Marinas, Marc Zyngier, Paolo Bonzini,
	Radim Krčmář, Russell King, Will Deacon,
	James Morse, Julien Thierry, Suzuki K Pouloze, kvm, kvmarm,
	linux-arm-kernel, linux-doc, linux-kernel
In-Reply-To: <20190802145017.42543-2-steven.price@arm.com>


Steven Price writes:

> Introduce a paravirtualization interface for KVM/arm64 based on the
> "Arm Paravirtualized Time for Arm-Base Systems" specification DEN 0057A.
>
> This only adds the details about "Stolen Time" as the details of "Live
> Physical Time" have not been fully agreed.
>
[...]

> +
> +Stolen Time
> +-----------
> +
> +The structure pointed to by the PV_TIME_ST hypercall is as follows:
> +
> +  Field       | Byte Length | Byte Offset | Description
> +  ----------- | ----------- | ----------- | --------------------------
> +  Revision    |      4      |      0      | Must be 0 for version 0.1
> +  Attributes  |      4      |      4      | Must be 0
> +  Stolen time |      8      |      8      | Stolen time in unsigned
> +              |             |             | nanoseconds indicating how
> +              |             |             | much time this VCPU thread
> +              |             |             | was involuntarily not
> +              |             |             | running on a physical CPU.

I know very little about the topic, but I don't understand how the spec
as proposed allows an accurate reading of the relation between physical
time and stolen time simultaneously. In other words, could you draw
Figure 1 of the spec from within the guest? Or is it a non-objective?

For example, if you read the stolen time before you read CNTVCT_EL0,
isn't it possible for a lengthy event like a migration to occur between
the two reads, causing the stolen time to be obsolete and off by seconds?

--
Cheers,
Christophe de Dinechin (IRC c3d)

^ permalink raw reply

* [PATCH] kernel-doc: ignore __printf attribute
From: Randy Dunlap @ 2019-08-05 16:29 UTC (permalink / raw)
  To: linux-doc@vger.kernel.org, Jonathan Corbet; +Cc: LKML, Brendan Higgins

From: Randy Dunlap <rdunlap@infradead.org>

Ignore __printf() function attributes just as other __attribute__
strings are ignored.

Fixes this kernel-doc warning message:
include/kunit/kunit-stream.h:58: warning: Function parameter or member '2' not described in '__printf'

Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Tested-by: Brendan Higgins <brendanhiggins@google.com>
---
 scripts/kernel-doc |    1 +
 1 file changed, 1 insertion(+)

--- linux-next-20190805.orig/scripts/kernel-doc
+++ linux-next-20190805/scripts/kernel-doc
@@ -1580,6 +1580,7 @@ sub dump_function($$) {
     $prototype =~ s/__must_check +//;
     $prototype =~ s/__weak +//;
     $prototype =~ s/__sched +//;
+    $prototype =~ s/__printf\s*\(\s*\d*\s*,\s*\d*\s*\) +//;
     my $define = $prototype =~ s/^#\s*define\s+//; #ak added
     $prototype =~ s/__attribute__\s*\(\(
             (?:



^ permalink raw reply

* Re: [PATCH 6/9] KVM: arm64: Provide a PV_TIME device to user space
From: Marc Zyngier @ 2019-08-05 16:28 UTC (permalink / raw)
  To: Steven Price
  Cc: kvm, Radim Krčmář, Catalin Marinas,
	Suzuki K Pouloze, linux-doc, Russell King, linux-kernel,
	James Morse, linux-arm-kernel, Paolo Bonzini, Will Deacon, kvmarm,
	Julien Thierry
In-Reply-To: <1a7d5be6-184b-0c78-61a3-b01730cb5df9@arm.com>

On 05/08/2019 17:10, Steven Price wrote:
> On 03/08/2019 13:51, Marc Zyngier wrote:
>> On Fri,  2 Aug 2019 15:50:14 +0100
>> Steven Price <steven.price@arm.com> wrote:
>>
>>> Allow user space to inform the KVM host where in the physical memory
>>> map the paravirtualized time structures should be located.
>>>
>>> A device is created which provides the base address of an array of
>>> Stolen Time (ST) structures, one for each VCPU. There must be (64 *
>>> total number of VCPUs) bytes of memory available at this location.
>>>
>>> The address is given in terms of the physical address visible to
>>> the guest and must be 64 byte aligned. The memory should be marked as
>>> reserved to the guest to stop it allocating it for other purposes.
>>
>> Why? You seem to be allocating the memory from the kernel, so as far as
>> the guest is concerned, this isn't generally usable memory.
> 
> I obviously didn't word it very well - that's what I meant. The "memory"
> that represents the stolen time structure shouldn't be shown to the
> guest as normal memory, but "reserved" for the purpose of stolen time.
> 
> To be honest it looks like I forgot to rewrite this commit message -
> which 64 byte alignment is all that the guest can rely on (because each
> vCPU has it's own structure), the actual array of structures needs to be
> page aligned to ensure we can safely map it into the guest.
> 
>>>
>>> Signed-off-by: Steven Price <steven.price@arm.com>
>>> ---
>>>  arch/arm64/include/asm/kvm_mmu.h  |   2 +
>>>  arch/arm64/include/uapi/asm/kvm.h |   6 +
>>>  arch/arm64/kvm/Makefile           |   1 +
>>>  include/uapi/linux/kvm.h          |   2 +
>>>  virt/kvm/arm/mmu.c                |  44 +++++++
>>>  virt/kvm/arm/pvtime.c             | 190 ++++++++++++++++++++++++++++++
>>>  6 files changed, 245 insertions(+)
>>>  create mode 100644 virt/kvm/arm/pvtime.c

[...]

>>> +static int kvm_arm_pvtime_set_attr(struct kvm_device *dev,
>>> +				   struct kvm_device_attr *attr)
>>> +{
>>> +	struct kvm_arch_pvtime *pvtime = &dev->kvm->arch.pvtime;
>>> +	u64 __user *user = (u64 __user *)attr->addr;
>>> +	u64 paddr;
>>> +	int ret;
>>> +
>>> +	switch (attr->group) {
>>> +	case KVM_DEV_ARM_PV_TIME_PADDR:
>>> +		if (get_user(paddr, user))
>>> +			return -EFAULT;
>>> +		if (paddr & 63)
>>> +			return -EINVAL;
>>
>> You should check whether the device fits into the IPA space for this
>> guest, and whether it overlaps with anything else.
> 
> pvtime_map_pages() should fail in the case of overlap. That seems
> sufficient to me - do you think we need something stronger?

Definitely. stage2_set_pte() won't fail for a non-IO overlapping
mapping, and will just treat it as guest memory. If this overlaps with a
memslot, we'll never be able to fault that page in, ending up with
interesting memory corruption... :-/

That's one of the reasons why I think option (2) in your earlier email
is an interesting one, as it sidesteps a whole lot of ugly and hard to
test corner cases.

Thanks,

	M.
-- 
Jazz is not dead, it just smells funny...

^ permalink raw reply

* Re: [PATCH 6/9] KVM: arm64: Provide a PV_TIME device to user space
From: Steven Price @ 2019-08-05 16:10 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvm, Radim Krčmář, Catalin Marinas,
	Suzuki K Pouloze, linux-doc, Russell King, linux-kernel,
	James Morse, linux-arm-kernel, Paolo Bonzini, Will Deacon, kvmarm,
	Julien Thierry
In-Reply-To: <20190803135113.6cdf500c@why>

On 03/08/2019 13:51, Marc Zyngier wrote:
> On Fri,  2 Aug 2019 15:50:14 +0100
> Steven Price <steven.price@arm.com> wrote:
> 
>> Allow user space to inform the KVM host where in the physical memory
>> map the paravirtualized time structures should be located.
>>
>> A device is created which provides the base address of an array of
>> Stolen Time (ST) structures, one for each VCPU. There must be (64 *
>> total number of VCPUs) bytes of memory available at this location.
>>
>> The address is given in terms of the physical address visible to
>> the guest and must be 64 byte aligned. The memory should be marked as
>> reserved to the guest to stop it allocating it for other purposes.
> 
> Why? You seem to be allocating the memory from the kernel, so as far as
> the guest is concerned, this isn't generally usable memory.

I obviously didn't word it very well - that's what I meant. The "memory"
that represents the stolen time structure shouldn't be shown to the
guest as normal memory, but "reserved" for the purpose of stolen time.

To be honest it looks like I forgot to rewrite this commit message -
which 64 byte alignment is all that the guest can rely on (because each
vCPU has it's own structure), the actual array of structures needs to be
page aligned to ensure we can safely map it into the guest.

>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>>  arch/arm64/include/asm/kvm_mmu.h  |   2 +
>>  arch/arm64/include/uapi/asm/kvm.h |   6 +
>>  arch/arm64/kvm/Makefile           |   1 +
>>  include/uapi/linux/kvm.h          |   2 +
>>  virt/kvm/arm/mmu.c                |  44 +++++++
>>  virt/kvm/arm/pvtime.c             | 190 ++++++++++++++++++++++++++++++
>>  6 files changed, 245 insertions(+)
>>  create mode 100644 virt/kvm/arm/pvtime.c
>>
>> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
>> index befe37d4bc0e..88c8a4b2836f 100644
>> --- a/arch/arm64/include/asm/kvm_mmu.h
>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>> @@ -157,6 +157,8 @@ int kvm_alloc_stage2_pgd(struct kvm *kvm);
>>  void kvm_free_stage2_pgd(struct kvm *kvm);
>>  int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>>  			  phys_addr_t pa, unsigned long size, bool writable);
>> +int kvm_phys_addr_memremap(struct kvm *kvm, phys_addr_t guest_ipa,
>> +			  phys_addr_t pa, unsigned long size, bool writable);
>>  
>>  int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run);
>>  
>> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
>> index 9a507716ae2f..95516a4198ea 100644
>> --- a/arch/arm64/include/uapi/asm/kvm.h
>> +++ b/arch/arm64/include/uapi/asm/kvm.h
>> @@ -367,6 +367,12 @@ struct kvm_vcpu_events {
>>  #define KVM_PSCI_RET_INVAL		PSCI_RET_INVALID_PARAMS
>>  #define KVM_PSCI_RET_DENIED		PSCI_RET_DENIED
>>  
>> +/* Device Control API: PV_TIME */
>> +#define KVM_DEV_ARM_PV_TIME_PADDR	0
>> +#define  KVM_DEV_ARM_PV_TIME_ST		0
>> +#define KVM_DEV_ARM_PV_TIME_STATE_SIZE	1
>> +#define KVM_DEV_ARM_PV_TIME_STATE	2
>> +
>>  #endif
>>  
>>  #endif /* __ARM_KVM_H__ */
>> diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
>> index 73dce4d47d47..5ffbdc39e780 100644
>> --- a/arch/arm64/kvm/Makefile
>> +++ b/arch/arm64/kvm/Makefile
>> @@ -14,6 +14,7 @@ kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o $(KVM)/e
>>  kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/arm.o $(KVM)/arm/mmu.o $(KVM)/arm/mmio.o
>>  kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/psci.o $(KVM)/arm/perf.o
>>  kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/hypercalls.o
>> +kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/pvtime.o
>>  
>>  kvm-$(CONFIG_KVM_ARM_HOST) += inject_fault.o regmap.o va_layout.o
>>  kvm-$(CONFIG_KVM_ARM_HOST) += hyp.o hyp-init.o handle_exit.o
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index a7c19540ce21..04bffafa0708 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -1222,6 +1222,8 @@ enum kvm_device_type {
>>  #define KVM_DEV_TYPE_ARM_VGIC_ITS	KVM_DEV_TYPE_ARM_VGIC_ITS
>>  	KVM_DEV_TYPE_XIVE,
>>  #define KVM_DEV_TYPE_XIVE		KVM_DEV_TYPE_XIVE
>> +	KVM_DEV_TYPE_ARM_PV_TIME,
>> +#define KVM_DEV_TYPE_ARM_PV_TIME	KVM_DEV_TYPE_ARM_PV_TIME
>>  	KVM_DEV_TYPE_MAX,
>>  };
>>  
>> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
>> index 38b4c910b6c3..be28a4aee451 100644
>> --- a/virt/kvm/arm/mmu.c
>> +++ b/virt/kvm/arm/mmu.c
>> @@ -1368,6 +1368,50 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>>  	return ret;
>>  }
>>  
>> +/**
>> + * kvm_phys_addr_memremap - map a memory range to guest IPA
>> + *
>> + * @kvm:	The KVM pointer
>> + * @guest_ipa:	The IPA at which to insert the mapping
>> + * @pa:		The physical address of the memory
>> + * @size:	The size of the mapping
>> + */
>> +int kvm_phys_addr_memremap(struct kvm *kvm, phys_addr_t guest_ipa,
>> +			  phys_addr_t pa, unsigned long size, bool writable)
>> +{
>> +	phys_addr_t addr, end;
>> +	int ret = 0;
>> +	unsigned long pfn;
>> +	struct kvm_mmu_memory_cache cache = { 0, };
>> +
>> +	end = (guest_ipa + size + PAGE_SIZE - 1) & PAGE_MASK;
>> +	pfn = __phys_to_pfn(pa);
>> +
>> +	for (addr = guest_ipa; addr < end; addr += PAGE_SIZE) {
>> +		pte_t pte = pfn_pte(pfn, PAGE_S2);
>> +
>> +		if (writable)
>> +			pte = kvm_s2pte_mkwrite(pte);
>> +
>> +		ret = mmu_topup_memory_cache(&cache,
>> +					     kvm_mmu_cache_min_pages(kvm),
>> +					     KVM_NR_MEM_OBJS);
>> +		if (ret)
>> +			goto out;
>> +		spin_lock(&kvm->mmu_lock);
>> +		ret = stage2_set_pte(kvm, &cache, addr, &pte, 0);
>> +		spin_unlock(&kvm->mmu_lock);
>> +		if (ret)
>> +			goto out;
>> +
>> +		pfn++;
>> +	}
>> +
>> +out:
>> +	mmu_free_memory_cache(&cache);
>> +	return ret;
>> +}
> 
> This is an exact copy of kvm_phys_addr_ioremap(), with only the memory
> attributes changing. Surely we can have a shared implementation that
> takes the memory attribute as a parameter.

Good point - although due to below I'm going to need something which can
deal with non-contiguous memory...

>> +
>>  static bool transparent_hugepage_adjust(kvm_pfn_t *pfnp, phys_addr_t *ipap)
>>  {
>>  	kvm_pfn_t pfn = *pfnp;
>> diff --git a/virt/kvm/arm/pvtime.c b/virt/kvm/arm/pvtime.c
>> new file mode 100644
>> index 000000000000..9051bc07eae1
>> --- /dev/null
>> +++ b/virt/kvm/arm/pvtime.c
>> @@ -0,0 +1,190 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +// Copyright (C) 2019 Arm Ltd.
>> +
>> +#include <linux/kvm_host.h>
>> +#include <asm/kvm_mmu.h>
>> +
>> +/* We currently only support PV time on ARM64 */
>> +#ifdef CONFIG_ARM64
> 
> And we're only compiling it on arm64, so why the #ifdef?

Another good point - will remove.

>> +
>> +#include <asm/pvclock-abi.h>
>> +
>> +static int max_stolen_size(void)
>> +{
>> +	int size = KVM_MAX_VCPUS * sizeof(struct pvclock_vcpu_stolen_time_info);
> 
> So we're always allocating enough memory for 512 CPUs? That's an
> additional 32kB of contiguous memory...
> 
>> +
>> +	return ALIGN(size, PAGE_SIZE);
>> +}
>> +
>> +static int kvm_arm_pvtime_create(struct kvm_device *dev, u32 type)
>> +{
>> +	struct kvm_arch_pvtime *pvtime = &dev->kvm->arch.pvtime;
>> +
>> +	pvtime->st = alloc_pages_exact(max_stolen_size(),
>> +				       GFP_KERNEL | __GFP_ZERO);
>> +	if (!pvtime->st)
>> +		return -ENOMEM;
> 
> Is there any chance we could use a vmalloc allocation instead? This
> would lift the requirement on having physically contiguous memory.

Yes I think it should be possible to use non-contiguous memory and vmalloc.

>> +
>> +	return 0;
>> +}
>> +
>> +static void kvm_arm_pvtime_destroy(struct kvm_device *dev)
>> +{
>> +	struct kvm_arch_pvtime *pvtime = &dev->kvm->arch.pvtime;
>> +
>> +	pvtime->st_base = GPA_INVALID;
>> +	free_pages_exact(pvtime->st, max_stolen_size());
>> +	kfree(dev);
>> +}
>> +
>> +static int pvtime_map_pages(struct kvm *kvm, gpa_t guest_paddr,
>> +			    void *kaddr, int size)
>> +{
>> +	return kvm_phys_addr_memremap(kvm, guest_paddr,
>> +			virt_to_phys(kaddr),
>> +			size, false);
>> +}
>> +
>> +static int pvtime_save_state(struct kvm *kvm, u64 type, void __user *user)
>> +{
>> +	void *source;
>> +	size_t size;
>> +
>> +	switch (type) {
>> +	case KVM_DEV_ARM_PV_TIME_ST:
>> +		source = kvm->arch.pvtime.st;
>> +		size = sizeof(struct pvclock_vcpu_stolen_time_info) *
>> +			atomic_read(&kvm->online_vcpus);
>> +		break;
>> +	default:
>> +		return -ENXIO;
>> +	}
>> +
>> +	if (copy_to_user(user, source, size))
>> +		return -EFAULT;
>> +	return 0;
>> +}
>> +
>> +static int pvtime_restore_state(struct kvm *kvm, u64 type, void __user *user)
>> +{
>> +	void *dest;
>> +	size_t size;
>> +
>> +	switch (type) {
>> +	case KVM_DEV_ARM_PV_TIME_ST:
>> +		dest = kvm->arch.pvtime.st;
>> +		size = sizeof(struct pvclock_vcpu_stolen_time_info) *
>> +			atomic_read(&kvm->online_vcpus);
>> +		break;
>> +	default:
>> +		return -ENXIO;
>> +	}
>> +
>> +	if (copy_from_user(dest, user, size))
>> +		return -EFAULT;
>> +
>> +	return 0;
>> +}
>> +
>> +static int kvm_arm_pvtime_set_attr(struct kvm_device *dev,
>> +				   struct kvm_device_attr *attr)
>> +{
>> +	struct kvm_arch_pvtime *pvtime = &dev->kvm->arch.pvtime;
>> +	u64 __user *user = (u64 __user *)attr->addr;
>> +	u64 paddr;
>> +	int ret;
>> +
>> +	switch (attr->group) {
>> +	case KVM_DEV_ARM_PV_TIME_PADDR:
>> +		if (get_user(paddr, user))
>> +			return -EFAULT;
>> +		if (paddr & 63)
>> +			return -EINVAL;
> 
> You should check whether the device fits into the IPA space for this
> guest, and whether it overlaps with anything else.

pvtime_map_pages() should fail in the case of overlap. That seems
sufficient to me - do you think we need something stronger?

>> +		switch (attr->attr) {
>> +		case KVM_DEV_ARM_PV_TIME_ST:
>> +			if (pvtime->st_base != GPA_INVALID)
>> +				return -EEXIST;
>> +			ret = pvtime_map_pages(dev->kvm, paddr, pvtime->st,
>> +					max_stolen_size());
> 
> Consider moving the size directly into pvtime_map_pages(), and dropping
> the pvtime->st parameter. All you need is kvm and paddr.

Will do.

Thanks,

Steve

^ permalink raw reply

* Re: [PATCH v4 11/12] fpga: dfl: fme: add global error reporting support
From: Greg KH @ 2019-08-05 15:56 UTC (permalink / raw)
  To: Wu Hao
  Cc: mdf, linux-fpga, linux-kernel, linux-api, linux-doc, atull,
	Luwei Kang, Ananda Ravuri, Xu Yilun
In-Reply-To: <1564914022-3710-12-git-send-email-hao.wu@intel.com>

On Sun, Aug 04, 2019 at 06:20:21PM +0800, Wu Hao wrote:
> +static int fme_global_err_init(struct platform_device *pdev,
> +			       struct dfl_feature *feature)
> +{
> +	struct device *dev;
> +	int ret = 0;
> +
> +	dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> +	if (!dev)
> +		return -ENOMEM;
> +
> +	dev->parent = &pdev->dev;
> +	dev->release = err_dev_release;
> +	dev_set_name(dev, "errors");
> +
> +	fme_error_enable(feature);
> +
> +	ret = device_register(dev);
> +	if (ret) {
> +		put_device(dev);
> +		return ret;
> +	}
> +
> +	ret = device_add_groups(dev, error_groups);

cute, but no, you do not create a whole struct device for a subdir.  Use
the attribute group name like you did on earlier patches.

And again, you raced userspace and lost :(

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH v4 07/12] fpga: dfl: afu: add error reporting support.
From: Greg KH @ 2019-08-05 15:54 UTC (permalink / raw)
  To: Wu Hao; +Cc: mdf, linux-fpga, linux-kernel, linux-api, linux-doc, atull,
	Xu Yilun
In-Reply-To: <1564914022-3710-8-git-send-email-hao.wu@intel.com>

On Sun, Aug 04, 2019 at 06:20:17PM +0800, Wu Hao wrote:
> Error reporting is one important private feature, it reports error
> detected on port and accelerated function unit (AFU). It introduces
> several sysfs interfaces to allow userspace to check and clear
> errors detected by hardware.
> 
> Signed-off-by: Xu Yilun <yilun.xu@intel.com>
> Signed-off-by: Wu Hao <hao.wu@intel.com>
> Acked-by: Alan Tull <atull@kernel.org>
> Signed-off-by: Moritz Fischer <mdf@kernel.org>
> ---
> v2: switch to device_add/remove_group for sysfs.
> v3: update kernel version and date in sysfs doc
> v4: remove dev_dbg in init/uinit callback function.
> ---
>  Documentation/ABI/testing/sysfs-platform-dfl-port |  39 ++++
>  drivers/fpga/Makefile                             |   1 +
>  drivers/fpga/dfl-afu-error.c                      | 221 ++++++++++++++++++++++
>  drivers/fpga/dfl-afu-main.c                       |   4 +
>  drivers/fpga/dfl-afu.h                            |   4 +
>  5 files changed, 269 insertions(+)
>  create mode 100644 drivers/fpga/dfl-afu-error.c
> 
> diff --git a/Documentation/ABI/testing/sysfs-platform-dfl-port b/Documentation/ABI/testing/sysfs-platform-dfl-port
> index 5663441..3b6580b 100644
> --- a/Documentation/ABI/testing/sysfs-platform-dfl-port
> +++ b/Documentation/ABI/testing/sysfs-platform-dfl-port
> @@ -81,3 +81,42 @@ KernelVersion:	5.4
>  Contact:	Wu Hao <hao.wu@intel.com>
>  Description:	Read-only. Read this file to get the status of issued command
>  		to userclck_freqcntrcmd.
> +
> +What:		/sys/bus/platform/devices/dfl-port.0/errors/revision
> +Date:		August 2019
> +KernelVersion:	5.4
> +Contact:	Wu Hao <hao.wu@intel.com>
> +Description:	Read-only. Read this file to get the revision of this error
> +		reporting private feature.

Same revision question here that I had on an earlier patch.


> +
> +What:		/sys/bus/platform/devices/dfl-port.0/errors/errors
> +Date:		August 2019
> +KernelVersion:	5.4
> +Contact:	Wu Hao <hao.wu@intel.com>
> +Description:	Read-only. Read this file to get errors detected on port and
> +		Accelerated Function Unit (AFU).
> +
> +What:		/sys/bus/platform/devices/dfl-port.0/errors/first_error
> +Date:		August 2019
> +KernelVersion:	5.4
> +Contact:	Wu Hao <hao.wu@intel.com>
> +Description:	Read-only. Read this file to get the first error detected by
> +		hardware.
> +
> +What:		/sys/bus/platform/devices/dfl-port.0/errors/first_malformed_req
> +Date:		August 2019
> +KernelVersion:	5.4
> +Contact:	Wu Hao <hao.wu@intel.com>
> +Description:	Read-only. Read this file to get the first malformed request
> +		captured by hardware.
> +
> +What:		/sys/bus/platform/devices/dfl-port.0/errors/clear
> +Date:		August 2019
> +KernelVersion:	5.4
> +Contact:	Wu Hao <hao.wu@intel.com>
> +Description:	Write-only. Write error code to this file to clear errors.
> +		Write fails with -EINVAL if input parsing fails or input error
> +		code doesn't match.
> +		Write fails with -EBUSY or -ETIMEDOUT if error can't be cleared
> +		as hardware is in low power state (-EBUSY) or not responding
> +		(-ETIMEDOUT).
> diff --git a/drivers/fpga/Makefile b/drivers/fpga/Makefile
> index 312b937..7255891 100644
> --- a/drivers/fpga/Makefile
> +++ b/drivers/fpga/Makefile
> @@ -41,6 +41,7 @@ obj-$(CONFIG_FPGA_DFL_AFU)		+= dfl-afu.o
>  
>  dfl-fme-objs := dfl-fme-main.o dfl-fme-pr.o
>  dfl-afu-objs := dfl-afu-main.o dfl-afu-region.o dfl-afu-dma-region.o
> +dfl-afu-objs += dfl-afu-error.o
>  
>  # Drivers for FPGAs which implement DFL
>  obj-$(CONFIG_FPGA_DFL_PCI)		+= dfl-pci.o
> diff --git a/drivers/fpga/dfl-afu-error.c b/drivers/fpga/dfl-afu-error.c
> new file mode 100644
> index 0000000..c5e0efa
> --- /dev/null
> +++ b/drivers/fpga/dfl-afu-error.c
> @@ -0,0 +1,221 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Driver for FPGA Accelerated Function Unit (AFU) Error Reporting
> + *
> + * Copyright 2019 Intel Corporation, Inc.
> + *
> + * Authors:
> + *   Wu Hao <hao.wu@linux.intel.com>
> + *   Xiao Guangrong <guangrong.xiao@linux.intel.com>
> + *   Joseph Grecco <joe.grecco@intel.com>
> + *   Enno Luebbers <enno.luebbers@intel.com>
> + *   Tim Whisonant <tim.whisonant@intel.com>
> + *   Ananda Ravuri <ananda.ravuri@intel.com>
> + *   Mitchel Henry <henry.mitchel@intel.com>
> + */
> +
> +#include <linux/uaccess.h>
> +
> +#include "dfl-afu.h"
> +
> +#define PORT_ERROR_MASK		0x8
> +#define PORT_ERROR		0x10
> +#define PORT_FIRST_ERROR	0x18
> +#define PORT_MALFORMED_REQ0	0x20
> +#define PORT_MALFORMED_REQ1	0x28
> +
> +#define ERROR_MASK		GENMASK_ULL(63, 0)
> +
> +/* mask or unmask port errors by the error mask register. */
> +static void __port_err_mask(struct device *dev, bool mask)
> +{
> +	void __iomem *base;
> +
> +	base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_ERROR);
> +
> +	writeq(mask ? ERROR_MASK : 0, base + PORT_ERROR_MASK);
> +}
> +
> +/* clear port errors. */
> +static int __port_err_clear(struct device *dev, u64 err)
> +{
> +	struct platform_device *pdev = to_platform_device(dev);
> +	void __iomem *base_err, *base_hdr;
> +	int ret;
> +	u64 v;
> +
> +	base_err = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_ERROR);
> +	base_hdr = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_HEADER);
> +
> +	/*
> +	 * clear Port Errors
> +	 *
> +	 * - Check for AP6 State
> +	 * - Halt Port by keeping Port in reset
> +	 * - Set PORT Error mask to all 1 to mask errors
> +	 * - Clear all errors
> +	 * - Set Port mask to all 0 to enable errors
> +	 * - All errors start capturing new errors
> +	 * - Enable Port by pulling the port out of reset
> +	 */
> +
> +	/* if device is still in AP6 power state, can not clear any error. */
> +	v = readq(base_hdr + PORT_HDR_STS);
> +	if (FIELD_GET(PORT_STS_PWR_STATE, v) == PORT_STS_PWR_STATE_AP6) {
> +		dev_err(dev, "Could not clear errors, device in AP6 state.\n");
> +		return -EBUSY;
> +	}
> +
> +	/* Halt Port by keeping Port in reset */
> +	ret = __port_disable(pdev);
> +	if (ret)
> +		return ret;
> +
> +	/* Mask all errors */
> +	__port_err_mask(dev, true);
> +
> +	/* Clear errors if err input matches with current port errors.*/
> +	v = readq(base_err + PORT_ERROR);
> +
> +	if (v == err) {
> +		writeq(v, base_err + PORT_ERROR);
> +
> +		v = readq(base_err + PORT_FIRST_ERROR);
> +		writeq(v, base_err + PORT_FIRST_ERROR);
> +	} else {
> +		ret = -EINVAL;
> +	}
> +
> +	/* Clear mask */
> +	__port_err_mask(dev, false);
> +
> +	/* Enable the Port by clear the reset */
> +	__port_enable(pdev);
> +
> +	return ret;
> +}
> +
> +static ssize_t revision_show(struct device *dev, struct device_attribute *attr,
> +			     char *buf)
> +{
> +	void __iomem *base;
> +
> +	base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_ERROR);
> +
> +	return sprintf(buf, "%u\n", dfl_feature_revision(base));
> +}
> +static DEVICE_ATTR_RO(revision);
> +
> +static ssize_t errors_show(struct device *dev, struct device_attribute *attr,
> +			   char *buf)
> +{
> +	struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
> +	void __iomem *base;
> +	u64 error;
> +
> +	base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_ERROR);
> +
> +	mutex_lock(&pdata->lock);
> +	error = readq(base + PORT_ERROR);
> +	mutex_unlock(&pdata->lock);
> +
> +	return sprintf(buf, "0x%llx\n", (unsigned long long)error);
> +}
> +static DEVICE_ATTR_RO(errors);
> +
> +static ssize_t first_error_show(struct device *dev,
> +				struct device_attribute *attr, char *buf)
> +{
> +	struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
> +	void __iomem *base;
> +	u64 error;
> +
> +	base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_ERROR);
> +
> +	mutex_lock(&pdata->lock);
> +	error = readq(base + PORT_FIRST_ERROR);
> +	mutex_unlock(&pdata->lock);
> +
> +	return sprintf(buf, "0x%llx\n", (unsigned long long)error);
> +}
> +static DEVICE_ATTR_RO(first_error);
> +
> +static ssize_t first_malformed_req_show(struct device *dev,
> +					struct device_attribute *attr,
> +					char *buf)
> +{
> +	struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
> +	void __iomem *base;
> +	u64 req0, req1;
> +
> +	base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_ERROR);
> +
> +	mutex_lock(&pdata->lock);
> +	req0 = readq(base + PORT_MALFORMED_REQ0);
> +	req1 = readq(base + PORT_MALFORMED_REQ1);
> +	mutex_unlock(&pdata->lock);
> +
> +	return sprintf(buf, "0x%016llx%016llx\n",
> +		       (unsigned long long)req1, (unsigned long long)req0);
> +}
> +static DEVICE_ATTR_RO(first_malformed_req);
> +
> +static ssize_t clear_store(struct device *dev, struct device_attribute *attr,
> +			   const char *buff, size_t count)
> +{
> +	struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
> +	u64 value;
> +	int ret;
> +
> +	if (kstrtou64(buff, 0, &value))
> +		return -EINVAL;
> +
> +	mutex_lock(&pdata->lock);
> +	ret = __port_err_clear(dev, value);
> +	mutex_unlock(&pdata->lock);
> +
> +	return ret ? ret : count;
> +}
> +static DEVICE_ATTR_WO(clear);
> +
> +static struct attribute *port_err_attrs[] = {
> +	&dev_attr_revision.attr,
> +	&dev_attr_errors.attr,
> +	&dev_attr_first_error.attr,
> +	&dev_attr_first_malformed_req.attr,
> +	&dev_attr_clear.attr,
> +	NULL,
> +};
> +
> +static struct attribute_group port_err_attr_group = {
> +	.attrs = port_err_attrs,
> +	.name = "errors",
> +};
> +
> +static int port_err_init(struct platform_device *pdev,
> +			 struct dfl_feature *feature)
> +{
> +	struct dfl_feature_platform_data *pdata = dev_get_platdata(&pdev->dev);
> +
> +	mutex_lock(&pdata->lock);
> +	__port_err_mask(&pdev->dev, false);
> +	mutex_unlock(&pdata->lock);

Locking one data structure and then modifying another one is up there
with "things never to do in the kernel unless you want a huge
headache!".

> +
> +	return device_add_group(&pdev->dev, &port_err_attr_group);

You raced userspace and lost :(

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH v4 06/12] fpga: dfl: afu: export __port_enable/disable function.
From: Greg KH @ 2019-08-05 15:52 UTC (permalink / raw)
  To: Wu Hao; +Cc: mdf, linux-fpga, linux-kernel, linux-api, linux-doc, atull,
	Xu Yilun
In-Reply-To: <1564914022-3710-7-git-send-email-hao.wu@intel.com>

On Sun, Aug 04, 2019 at 06:20:16PM +0800, Wu Hao wrote:
> As these two functions are used by other private features. e.g.
> in error reporting private feature, it requires to check port status
> and reset port for error clearing.
> 
> Signed-off-by: Xu Yilun <yilun.xu@intel.com>
> Signed-off-by: Wu Hao <hao.wu@intel.com>
> Acked-by: Moritz Fischer <mdf@kernel.org>
> Acked-by: Alan Tull <atull@kernel.org>
> Signed-off-by: Moritz Fischer <mdf@kernel.org>
> ---
> v2: rebased
> ---
>  drivers/fpga/dfl-afu-main.c | 25 ++++++++++++++-----------
>  drivers/fpga/dfl-afu.h      |  3 +++
>  2 files changed, 17 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/fpga/dfl-afu-main.c b/drivers/fpga/dfl-afu-main.c
> index e013afb..e312179 100644
> --- a/drivers/fpga/dfl-afu-main.c
> +++ b/drivers/fpga/dfl-afu-main.c
> @@ -22,14 +22,16 @@
>  #include "dfl-afu.h"
>  
>  /**
> - * port_enable - enable a port
> + * __port_enable - enable a port
>   * @pdev: port platform device.
>   *
>   * Enable Port by clear the port soft reset bit, which is set by default.
>   * The AFU is unable to respond to any MMIO access while in reset.
> - * port_enable function should only be used after port_disable function.
> + * __port_enable function should only be used after __port_disable function.
> + *
> + * The caller needs to hold lock for protection.
>   */
> -static void port_enable(struct platform_device *pdev)
> +void __port_enable(struct platform_device *pdev)

worst global function name ever.

Don't polute the global namespace like this for a single driver.  If you
REALLY need it, then use a prefix that shows it is your individual
dfl_special_sauce_platform_device_only type thing.

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH v4 04/12] fpga: dfl: afu: add userclock sysfs interfaces.
From: Greg KH @ 2019-08-05 15:51 UTC (permalink / raw)
  To: Wu Hao
  Cc: mdf, linux-fpga, linux-kernel, linux-api, linux-doc, atull,
	Ananda Ravuri, Russ Weight, Xu Yilun
In-Reply-To: <1564914022-3710-5-git-send-email-hao.wu@intel.com>

On Sun, Aug 04, 2019 at 06:20:14PM +0800, Wu Hao wrote:
> This patch introduces userclock sysfs interfaces for AFU, user
> could use these interfaces for clock setting to AFU.
> 
> Please note that, this is only working for port header feature
> with revision 0, for later revisions, userclock setting is moved
> to a separated private feature, so one revision sysfs interface
> is exposed to userspace application for this purpose too.
> 
> Signed-off-by: Ananda Ravuri <ananda.ravuri@intel.com>
> Signed-off-by: Russ Weight <russell.h.weight@intel.com>
> Signed-off-by: Xu Yilun <yilun.xu@intel.com>
> Signed-off-by: Wu Hao <hao.wu@intel.com>
> Acked-by: Alan Tull <atull@kernel.org>
> Signed-off-by: Moritz Fischer <mdf@kernel.org>
> ---
> v2: rebased, and switched to use device_add/remove_groups for sysfs
> v3: update kernel version and date in sysfs doc
> v4: rebased.
> ---
>  Documentation/ABI/testing/sysfs-platform-dfl-port |  35 +++++++
>  drivers/fpga/dfl-afu-main.c                       | 114 +++++++++++++++++++++-
>  drivers/fpga/dfl.h                                |   9 ++
>  3 files changed, 157 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-platform-dfl-port b/Documentation/ABI/testing/sysfs-platform-dfl-port
> index 1ab3e6f..5663441 100644
> --- a/Documentation/ABI/testing/sysfs-platform-dfl-port
> +++ b/Documentation/ABI/testing/sysfs-platform-dfl-port
> @@ -46,3 +46,38 @@ Contact:	Wu Hao <hao.wu@intel.com>
>  Description:	Read-write. Read or set AFU latency tolerance reporting value.
>  		Set ltr to 1 if the AFU can tolerate latency >= 40us or set it
>  		to 0 if it is latency sensitive.
> +
> +What:		/sys/bus/platform/devices/dfl-port.0/revision
> +Date:		August 2019
> +KernelVersion:	5.4
> +Contact:	Wu Hao <hao.wu@intel.com>
> +Description:	Read-only. Read this file to get the revision of port header
> +		feature.

What does "revision" mean?

It feels like you are creating a different set of sysfs files depending
on the revision field.  Which is fine, sysfs is one-value-per-file and
userspace needs to handle if the file is present or not.  So why not
just rely on that and not have to mess with 'revision' at all?  What is
userspace going to do with that information?

> +
> +What:		/sys/bus/platform/devices/dfl-port.0/userclk_freqcmd
> +Date:		August 2019
> +KernelVersion:	5.4
> +Contact:	Wu Hao <hao.wu@intel.com>
> +Description:	Write-only. User writes command to this interface to set
> +		userclock to AFU.
> +
> +What:		/sys/bus/platform/devices/dfl-port.0/userclk_freqsts
> +Date:		August 2019
> +KernelVersion:	5.4
> +Contact:	Wu Hao <hao.wu@intel.com>
> +Description:	Read-only. Read this file to get the status of issued command
> +		to userclck_freqcmd.
> +
> +What:		/sys/bus/platform/devices/dfl-port.0/userclk_freqcntrcmd
> +Date:		August 2019
> +KernelVersion:	5.4
> +Contact:	Wu Hao <hao.wu@intel.com>
> +Description:	Write-only. User writes command to this interface to set
> +		userclock counter.
> +
> +What:		/sys/bus/platform/devices/dfl-port.0/userclk_freqcntrsts
> +Date:		August 2019
> +KernelVersion:	5.4
> +Contact:	Wu Hao <hao.wu@intel.com>
> +Description:	Read-only. Read this file to get the status of issued command
> +		to userclck_freqcntrcmd.
> diff --git a/drivers/fpga/dfl-afu-main.c b/drivers/fpga/dfl-afu-main.c
> index 12175bb..407c97d 100644
> --- a/drivers/fpga/dfl-afu-main.c
> +++ b/drivers/fpga/dfl-afu-main.c
> @@ -142,6 +142,17 @@ static int port_get_id(struct platform_device *pdev)
>  static DEVICE_ATTR_RO(id);
>  
>  static ssize_t
> +revision_show(struct device *dev, struct device_attribute *attr, char *buf)
> +{
> +	void __iomem *base;
> +
> +	base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_HEADER);
> +
> +	return sprintf(buf, "%x\n", dfl_feature_revision(base));
> +}
> +static DEVICE_ATTR_RO(revision);
> +
> +static ssize_t
>  ltr_show(struct device *dev, struct device_attribute *attr, char *buf)
>  {
>  	struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
> @@ -276,6 +287,7 @@ static int port_get_id(struct platform_device *pdev)
>  
>  static struct attribute *port_hdr_attrs[] = {
>  	&dev_attr_id.attr,
> +	&dev_attr_revision.attr,
>  	&dev_attr_ltr.attr,
>  	&dev_attr_ap1_event.attr,
>  	&dev_attr_ap2_event.attr,
> @@ -284,14 +296,113 @@ static int port_get_id(struct platform_device *pdev)
>  };
>  ATTRIBUTE_GROUPS(port_hdr);
>  
> +static ssize_t
> +userclk_freqcmd_store(struct device *dev, struct device_attribute *attr,
> +		      const char *buf, size_t count)
> +{
> +	struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
> +	u64 userclk_freq_cmd;
> +	void __iomem *base;
> +
> +	if (kstrtou64(buf, 0, &userclk_freq_cmd))
> +		return -EINVAL;
> +
> +	base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_HEADER);
> +
> +	mutex_lock(&pdata->lock);
> +	writeq(userclk_freq_cmd, base + PORT_HDR_USRCLK_CMD0);
> +	mutex_unlock(&pdata->lock);
> +
> +	return count;
> +}
> +static DEVICE_ATTR_WO(userclk_freqcmd);
> +
> +static ssize_t
> +userclk_freqcntrcmd_store(struct device *dev, struct device_attribute *attr,
> +			  const char *buf, size_t count)
> +{
> +	struct dfl_feature_platform_data *pdata = dev_get_platdata(dev);
> +	u64 userclk_freqcntr_cmd;
> +	void __iomem *base;
> +
> +	if (kstrtou64(buf, 0, &userclk_freqcntr_cmd))
> +		return -EINVAL;
> +
> +	base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_HEADER);
> +
> +	mutex_lock(&pdata->lock);
> +	writeq(userclk_freqcntr_cmd, base + PORT_HDR_USRCLK_CMD1);
> +	mutex_unlock(&pdata->lock);
> +
> +	return count;
> +}
> +static DEVICE_ATTR_WO(userclk_freqcntrcmd);
> +
> +static ssize_t
> +userclk_freqsts_show(struct device *dev, struct device_attribute *attr,
> +		     char *buf)
> +{
> +	u64 userclk_freqsts;
> +	void __iomem *base;
> +
> +	base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_HEADER);
> +
> +	userclk_freqsts = readq(base + PORT_HDR_USRCLK_STS0);
> +
> +	return sprintf(buf, "0x%llx\n", (unsigned long long)userclk_freqsts);
> +}
> +static DEVICE_ATTR_RO(userclk_freqsts);
> +
> +static ssize_t
> +userclk_freqcntrsts_show(struct device *dev, struct device_attribute *attr,
> +			 char *buf)
> +{
> +	u64 userclk_freqcntrsts;
> +	void __iomem *base;
> +
> +	base = dfl_get_feature_ioaddr_by_id(dev, PORT_FEATURE_ID_HEADER);
> +
> +	userclk_freqcntrsts = readq(base + PORT_HDR_USRCLK_STS1);
> +
> +	return sprintf(buf, "0x%llx\n",
> +		       (unsigned long long)userclk_freqcntrsts);
> +}
> +static DEVICE_ATTR_RO(userclk_freqcntrsts);
> +
> +static struct attribute *port_hdr_userclk_attrs[] = {
> +	&dev_attr_userclk_freqcmd.attr,
> +	&dev_attr_userclk_freqcntrcmd.attr,
> +	&dev_attr_userclk_freqsts.attr,
> +	&dev_attr_userclk_freqcntrsts.attr,
> +	NULL,
> +};
> +ATTRIBUTE_GROUPS(port_hdr_userclk);
> +
>  static int port_hdr_init(struct platform_device *pdev,
>  			 struct dfl_feature *feature)
>  {
> +	int ret;
> +
>  	dev_dbg(&pdev->dev, "PORT HDR Init.\n");
>  
>  	port_reset(pdev);
>  
> -	return device_add_groups(&pdev->dev, port_hdr_groups);
> +	ret = device_add_groups(&pdev->dev, port_hdr_groups);

This all needs to be reworked based on the ability for devices to
properly add groups when they are bound on probe (the core does it for
you, no need for the driver to do it.)  But until then, you should at
least consider:

> +	if (ret)
> +		return ret;
> +
> +	/*
> +	 * if revision > 0, the userclock will be moved from port hdr register
> +	 * region to a separated private feature.
> +	 */
> +	if (dfl_feature_revision(feature->ioaddr) > 0)
> +		return 0;
> +
> +	ret = device_add_groups(&pdev->dev, port_hdr_userclk_groups);
> +	if (ret)
> +		device_remove_groups(&pdev->dev, port_hdr_groups);

struct attribute_group has is_visible() as a callback to have the core
show or not show, individual attributes when they are created.  So no
need for a second group of attributes and you needing to add/remove
them, just add them all and let the callback handle the "is visible"
logic.  Makes cleanup _so_ much easier (i.e. you don't have to do it.)

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH 5/6] tty: serial: Add linflexuart driver for S32V234
From: gregkh @ 2019-08-05 15:31 UTC (permalink / raw)
  To: Stefan-gabriel Mirea
  Cc: corbet@lwn.net, robh+dt@kernel.org, mark.rutland@arm.com,
	catalin.marinas@arm.com, will@kernel.org, shawnguo@kernel.org,
	Leo Li, jslaby@suse.com, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, devicetree@vger.kernel.org,
	linux-serial@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, Cosmin Stefan Stoica,
	Larisa Ileana Grigore
In-Reply-To: <20190802194702.30249-6-stefan-gabriel.mirea@nxp.com>

On Fri, Aug 02, 2019 at 07:47:23PM +0000, Stefan-gabriel Mirea wrote:
> --- a/include/uapi/linux/serial_core.h
> +++ b/include/uapi/linux/serial_core.h
> @@ -293,4 +293,7 @@
>  /* SiFive UART */
>  #define PORT_SIFIVE_V0	120
>  
> +/* Freescale Linflex UART */
> +#define PORT_LINFLEXUART	121

Do you really need this modified?

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH] Documentation/admin-guide: Embargoed hardware security issues
From: Greg Kroah-Hartman @ 2019-08-05 15:12 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: linux-kernel, Jonathan Corbet, security, linux-doc,
	Thomas Gleixner, Mauro Carvalho Chehab
In-Reply-To: <nycvar.YFH.7.76.1908040214090.5899@cbobk.fhfr.pm>

On Sun, Aug 04, 2019 at 02:17:00AM +0200, Jiri Kosina wrote:
> On Thu, 25 Jul 2019, Greg Kroah-Hartman wrote:
> 
> > To address the requirements of embargoed hardware issues, like Meltdown,
> > Spectre, L1TF, etc. it is necessary to define and document a process for
> > handling embargoed hardware security issues.
> 
> I don't know what exactly went wrong, but there is a much more up-to-date 
> version of that document (especially when it comes to vendor contacts), 
> which I sent around on Thu, 2 May 2019 20:23:48 +0200 (CEST) already. 
> Please find it below.

Ah, sorry, don't know what happened here, we had too many different
versions floating around.

I'll take your version, make the edits recommended and send out a new
one in a few days, thanks!

greg k-h

^ permalink raw reply

* Re: [PATCH] Documentation/admin-guide: Embargoed hardware security issues
From: Greg Kroah-Hartman @ 2019-08-05 14:59 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jiri Kosina, linux-kernel, Jonathan Corbet, security, linux-doc,
	Thomas Gleixner, Mauro Carvalho Chehab
In-Reply-To: <87blx3n0a2.fsf@xmission.com>

On Mon, Aug 05, 2019 at 09:40:21AM -0500, Eric W. Biederman wrote:
> 
> I skimmed this and a couple things jumped out at me.
> 
> 1) PGP and S/MIME because of their use of long term keys do not provide
>    forward secrecy.  Which can makes it worth while to cryptographically
>    factor a key or to obtain knowledge of a private key without the key
>    holders knowledge.  As the keys will be used again and again over a
>    long period of time.

Secrecy over a "long period of time" is not what is needed here.  6
months max is what I have seen, why would you need longer?

>    More recent protocol's such as Signal's Double Ratchet Protocol
>    enable forward secrecy for store and foward communications, and
>    remove the problem of long term keys.

And how does that work with email?  We need something that actually
works with a tool that everyone can use for development (i.e. email)

> 2) The existence of such a process with encrypted communications to
>    ensure long term confidentiality is going to make our contact people
>    the targets of people who want access to knolwedge about hardware
>    bugs like meltdown, before they become public.

Why are those same people not "targets" today?

And again, it's not long-term.

> I am just mentioning these things in case they are not immediately
> obvious to everyone else involved, so that people can be certain
> they are comfortable with the tradeoffs being made.

I know of no other thing that actually works (and lots of people can't
even get PGP to work as they use foolish email clients.)  Do you?

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH] Documentation/admin-guide: Embargoed hardware security issues
From: Eric W. Biederman @ 2019-08-05 14:40 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Greg Kroah-Hartman, linux-kernel, Jonathan Corbet, security,
	linux-doc, Thomas Gleixner, Mauro Carvalho Chehab
In-Reply-To: <nycvar.YFH.7.76.1908040214090.5899@cbobk.fhfr.pm>

I skimmed this and a couple things jumped out at me.

1) PGP and S/MIME because of their use of long term keys do not provide
   forward secrecy.  Which can makes it worth while to cryptographically
   factor a key or to obtain knowledge of a private key without the key
   holders knowledge.  As the keys will be used again and again over a
   long period of time.

   More recent protocol's such as Signal's Double Ratchet Protocol
   enable forward secrecy for store and foward communications, and
   remove the problem of long term keys.

2) The existence of such a process with encrypted communications to
   ensure long term confidentiality is going to make our contact people
   the targets of people who want access to knolwedge about hardware
   bugs like meltdown, before they become public.

I am just mentioning these things in case they are not immediately
obvious to everyone else involved, so that people can be certain
they are comfortable with the tradeoffs being made.

Eric

^ permalink raw reply

* Re: [PATCH v12 01/11] MODSIGN: Export module signature definitions
From: Mimi Zohar @ 2019-08-05 14:25 UTC (permalink / raw)
  To: Philipp Rudo, Thiago Jung Bauermann
  Cc: Jessica Yu, linux-integrity, linux-security-module, keyrings,
	linux-crypto, linuxppc-dev, linux-doc, linux-kernel,
	Dmitry Kasatkin, James Morris, Serge E. Hallyn, David Howells,
	David Woodhouse, Herbert Xu, David S. Miller, Jonathan Corbet,
	AKASHI, Takahiro, Heiko Carstens, linux-s390
In-Reply-To: <20190805151123.12510d72@laptop-ibm>

On Mon, 2019-08-05 at 15:11 +0200, Philipp Rudo wrote:
> Hi Thiago,
> 
> > > The patch looks good now.  
> > 
> > Thanks! Can I add your Reviewed-by?
> 
> sorry, for the late answer, but I was on vacation the last two weeks. I hope
> it's not too late now.
> 
> Reviewed-by: Philipp Rudo <prudo@linux.ibm.com>

Thanks!  This patch set is still in the #next-queued-testing
branch.  I'm still hoping for a few more tags, before pushing it out
to the #next-integrity branch later today.

Mimi


^ permalink raw reply

* Re: [PATCH 4/9] KVM: arm64: Support stolen time reporting via shared structure
From: Steven Price @ 2019-08-05 14:18 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvm, linux-doc, Catalin Marinas, linux-kernel, Russell King,
	Paolo Bonzini, Will Deacon, kvmarm, linux-arm-kernel
In-Reply-To: <20190803191303.02e9bcc9@why>

On 03/08/2019 19:13, Marc Zyngier wrote:
> On Sat, 3 Aug 2019 18:58:17 +0100
> Marc Zyngier <maz@kernel.org> wrote:
> 
>> On Fri,  2 Aug 2019 15:50:12 +0100
>> Steven Price <steven.price@arm.com> wrote:
>>
>>> Implement the service call for configuring a shared structre between a
>>> VCPU and the hypervisor in which the hypervisor can write the time
>>> stolen from the VCPU's execution time by other tasks on the host.
>>>
>>> The hypervisor allocates memory which is placed at an IPA chosen by user
>>> space. The hypervisor then uses WRITE_ONCE() to update the shared
>>> structre ensuring single copy atomicity of the 64-bit unsigned value
>>> that reports stolen time in nanoseconds.
>>>
>>> Whenever stolen time is enabled by the guest, the stolen time counter is
>>> reset.
>>>
>>> The stolen time itself is retrieved from the sched_info structure
>>> maintained by the Linux scheduler code. We enable SCHEDSTATS when
>>> selecting KVM Kconfig to ensure this value is meaningful.
>>>
>>> Signed-off-by: Steven Price <steven.price@arm.com>
>>> ---
>>>  arch/arm64/include/asm/kvm_host.h | 13 +++++-
>>>  arch/arm64/kvm/Kconfig            |  1 +
>>>  include/kvm/arm_hypercalls.h      |  1 +
>>>  include/linux/kvm_types.h         |  2 +
>>>  virt/kvm/arm/arm.c                | 18 ++++++++
>>>  virt/kvm/arm/hypercalls.c         | 70 +++++++++++++++++++++++++++++++
>>>  6 files changed, 104 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>>> index f656169db8c3..78f270190d43 100644
>>> --- a/arch/arm64/include/asm/kvm_host.h
>>> +++ b/arch/arm64/include/asm/kvm_host.h
>>> @@ -44,6 +44,7 @@
>>>  	KVM_ARCH_REQ_FLAGS(0, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>>>  #define KVM_REQ_IRQ_PENDING	KVM_ARCH_REQ(1)
>>>  #define KVM_REQ_VCPU_RESET	KVM_ARCH_REQ(2)
>>> +#define KVM_REQ_RECORD_STEAL	KVM_ARCH_REQ(3)
>>>  
>>>  DECLARE_STATIC_KEY_FALSE(userspace_irqchip_in_use);
>>>  
>>> @@ -83,6 +84,11 @@ struct kvm_arch {
>>>  
>>>  	/* Mandated version of PSCI */
>>>  	u32 psci_version;
>>> +
>>> +	struct kvm_arch_pvtime {
>>> +		void *st;
>>> +		gpa_t st_base;
>>> +	} pvtime;
>>>  };
>>>  
>>>  #define KVM_NR_MEM_OBJS     40
>>> @@ -338,8 +344,13 @@ struct kvm_vcpu_arch {
>>>  	/* True when deferrable sysregs are loaded on the physical CPU,
>>>  	 * see kvm_vcpu_load_sysregs and kvm_vcpu_put_sysregs. */
>>>  	bool sysregs_loaded_on_cpu;
>>> -};
>>>  
>>> +	/* Guest PV state */
>>> +	struct {
>>> +		u64 steal;
>>> +		u64 last_steal;
>>> +	} steal;
>>> +};
>>>  /* Pointer to the vcpu's SVE FFR for sve_{save,load}_state() */
>>>  #define vcpu_sve_pffr(vcpu) ((void *)((char *)((vcpu)->arch.sve_state) + \
>>>  				      sve_ffr_offset((vcpu)->arch.sve_max_vl)))
>>> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
>>> index a67121d419a2..d8b88e40d223 100644
>>> --- a/arch/arm64/kvm/Kconfig
>>> +++ b/arch/arm64/kvm/Kconfig
>>> @@ -39,6 +39,7 @@ config KVM
>>>  	select IRQ_BYPASS_MANAGER
>>>  	select HAVE_KVM_IRQ_BYPASS
>>>  	select HAVE_KVM_VCPU_RUN_PID_CHANGE
>>> +	select SCHEDSTATS
>>>  	---help---
>>>  	  Support hosting virtualized guest machines.
>>>  	  We don't support KVM with 16K page tables yet, due to the multiple
>>> diff --git a/include/kvm/arm_hypercalls.h b/include/kvm/arm_hypercalls.h
>>> index 35a5abcc4ca3..9f0710ab4292 100644
>>> --- a/include/kvm/arm_hypercalls.h
>>> +++ b/include/kvm/arm_hypercalls.h
>>> @@ -7,6 +7,7 @@
>>>  #include <asm/kvm_emulate.h>
>>>  
>>>  int kvm_hvc_call_handler(struct kvm_vcpu *vcpu);
>>> +int kvm_update_stolen_time(struct kvm_vcpu *vcpu);
>>>  
>>>  static inline u32 smccc_get_function(struct kvm_vcpu *vcpu)
>>>  {
>>> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
>>> index bde5374ae021..1c88e69db3d9 100644
>>> --- a/include/linux/kvm_types.h
>>> +++ b/include/linux/kvm_types.h
>>> @@ -35,6 +35,8 @@ typedef unsigned long  gva_t;
>>>  typedef u64            gpa_t;
>>>  typedef u64            gfn_t;
>>>  
>>> +#define GPA_INVALID	(~(gpa_t)0)
>>> +
>>>  typedef unsigned long  hva_t;
>>>  typedef u64            hpa_t;
>>>  typedef u64            hfn_t;
>>> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
>>> index f645c0fbf7ec..ebd963d2580b 100644
>>> --- a/virt/kvm/arm/arm.c
>>> +++ b/virt/kvm/arm/arm.c
>>> @@ -40,6 +40,10 @@
>>>  #include <asm/kvm_coproc.h>
>>>  #include <asm/sections.h>
>>>  
>>> +#include <kvm/arm_hypercalls.h>
>>> +#include <kvm/arm_pmu.h>
>>> +#include <kvm/arm_psci.h>
>>> +
>>>  #ifdef REQUIRES_VIRT
>>>  __asm__(".arch_extension	virt");
>>>  #endif
>>> @@ -135,6 +139,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>>>  	kvm->arch.max_vcpus = vgic_present ?
>>>  				kvm_vgic_get_max_vcpus() : KVM_MAX_VCPUS;
>>>  
>>> +	kvm->arch.pvtime.st_base = GPA_INVALID;
>>>  	return ret;
>>>  out_free_stage2_pgd:
>>>  	kvm_free_stage2_pgd(kvm);
>>> @@ -371,6 +376,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>>  	kvm_vcpu_load_sysregs(vcpu);
>>>  	kvm_arch_vcpu_load_fp(vcpu);
>>>  	kvm_vcpu_pmu_restore_guest(vcpu);
>>> +	kvm_make_request(KVM_REQ_RECORD_STEAL, vcpu);
>>>  
>>>  	if (single_task_running())
>>>  		vcpu_clear_wfe_traps(vcpu);
>>> @@ -617,6 +623,15 @@ static void vcpu_req_sleep(struct kvm_vcpu *vcpu)
>>>  	smp_rmb();
>>>  }
>>>  
>>> +static void vcpu_req_record_steal(struct kvm_vcpu *vcpu)
>>> +{
>>> +	int idx;
>>> +
>>> +	idx = srcu_read_lock(&vcpu->kvm->srcu);
>>> +	kvm_update_stolen_time(vcpu);
>>> +	srcu_read_unlock(&vcpu->kvm->srcu, idx);
>>> +}
>>> +
>>>  static int kvm_vcpu_initialized(struct kvm_vcpu *vcpu)
>>>  {
>>>  	return vcpu->arch.target >= 0;
>>> @@ -636,6 +651,9 @@ static void check_vcpu_requests(struct kvm_vcpu *vcpu)
>>>  		 * that a VCPU sees new virtual interrupts.
>>>  		 */
>>>  		kvm_check_request(KVM_REQ_IRQ_PENDING, vcpu);
>>> +
>>> +		if (kvm_check_request(KVM_REQ_RECORD_STEAL, vcpu))
>>> +			vcpu_req_record_steal(vcpu);  
>>
>> Something troubles me. Here, you've set the request on load. But you
>> can be preempted at any time (preemption gets disabled just after).
>>
>> I have the feeling that should you get preempted right here, you'll
>> end-up having accumulated the wrong amount of steal time, as the
>> request put via load when you'll get scheduled back in will only get
>> processed after a full round of entry/exit/entry, which doesn't look
>> great.
> 
> Ah, no. We're saved by the check for pending requests right before we
> jump in the guest, causing an early exit and the whole shebang to be
> restarted.

Yes, that's my understanding. Obviously not ideal if it happens in that
small window, but everything is redone to get the right values in the end.

Steve

^ permalink raw reply

* Re: [PATCH 4/9] KVM: arm64: Support stolen time reporting via shared structure
From: Steven Price @ 2019-08-05 14:09 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvm, Radim Krčmář, Catalin Marinas,
	Suzuki K Pouloze, linux-doc, Russell King, linux-kernel,
	James Morse, linux-arm-kernel, Paolo Bonzini, Will Deacon, kvmarm,
	Julien Thierry
In-Reply-To: <20190803125515.6aa50084@why>

On 03/08/2019 12:55, Marc Zyngier wrote:
> On Fri,  2 Aug 2019 15:50:12 +0100
> Steven Price <steven.price@arm.com> wrote:
> 
>> Implement the service call for configuring a shared structre between a
> 
> structure
> 
>> VCPU and the hypervisor in which the hypervisor can write the time
>> stolen from the VCPU's execution time by other tasks on the host.
>>
>> The hypervisor allocates memory which is placed at an IPA chosen by user
>> space. The hypervisor then uses WRITE_ONCE() to update the shared
>> structre ensuring single copy atomicity of the 64-bit unsigned value
> 
> structure

Twice in one commit message... thanks for spotting! :)

>> that reports stolen time in nanoseconds.
>>
>> Whenever stolen time is enabled by the guest, the stolen time counter is
>> reset.
>>
>> The stolen time itself is retrieved from the sched_info structure
>> maintained by the Linux scheduler code. We enable SCHEDSTATS when
>> selecting KVM Kconfig to ensure this value is meaningful.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>>  arch/arm64/include/asm/kvm_host.h | 13 +++++-
>>  arch/arm64/kvm/Kconfig            |  1 +
>>  include/kvm/arm_hypercalls.h      |  1 +
>>  include/linux/kvm_types.h         |  2 +
>>  virt/kvm/arm/arm.c                | 18 ++++++++
>>  virt/kvm/arm/hypercalls.c         | 70 +++++++++++++++++++++++++++++++
>>  6 files changed, 104 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>> index f656169db8c3..78f270190d43 100644
>> --- a/arch/arm64/include/asm/kvm_host.h
>> +++ b/arch/arm64/include/asm/kvm_host.h
>> @@ -44,6 +44,7 @@
>>  	KVM_ARCH_REQ_FLAGS(0, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>>  #define KVM_REQ_IRQ_PENDING	KVM_ARCH_REQ(1)
>>  #define KVM_REQ_VCPU_RESET	KVM_ARCH_REQ(2)
>> +#define KVM_REQ_RECORD_STEAL	KVM_ARCH_REQ(3)
>>  
>>  DECLARE_STATIC_KEY_FALSE(userspace_irqchip_in_use);
>>  
>> @@ -83,6 +84,11 @@ struct kvm_arch {
>>  
>>  	/* Mandated version of PSCI */
>>  	u32 psci_version;
>> +
>> +	struct kvm_arch_pvtime {
>> +		void *st;
> 
> Is it really a void *? I'm sure you can use a proper type here...

Indeed that sounds like a good idea!

>> +		gpa_t st_base;
>> +	} pvtime;
>>  };
>>  
>>  #define KVM_NR_MEM_OBJS     40
>> @@ -338,8 +344,13 @@ struct kvm_vcpu_arch {
>>  	/* True when deferrable sysregs are loaded on the physical CPU,
>>  	 * see kvm_vcpu_load_sysregs and kvm_vcpu_put_sysregs. */
>>  	bool sysregs_loaded_on_cpu;
>> -};
>>  
>> +	/* Guest PV state */
>> +	struct {
>> +		u64 steal;
>> +		u64 last_steal;
>> +	} steal;
>> +};
>>  /* Pointer to the vcpu's SVE FFR for sve_{save,load}_state() */
>>  #define vcpu_sve_pffr(vcpu) ((void *)((char *)((vcpu)->arch.sve_state) + \
>>  				      sve_ffr_offset((vcpu)->arch.sve_max_vl)))
>> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
>> index a67121d419a2..d8b88e40d223 100644
>> --- a/arch/arm64/kvm/Kconfig
>> +++ b/arch/arm64/kvm/Kconfig
>> @@ -39,6 +39,7 @@ config KVM
>>  	select IRQ_BYPASS_MANAGER
>>  	select HAVE_KVM_IRQ_BYPASS
>>  	select HAVE_KVM_VCPU_RUN_PID_CHANGE
>> +	select SCHEDSTATS
>>  	---help---
>>  	  Support hosting virtualized guest machines.
>>  	  We don't support KVM with 16K page tables yet, due to the multiple
>> diff --git a/include/kvm/arm_hypercalls.h b/include/kvm/arm_hypercalls.h
>> index 35a5abcc4ca3..9f0710ab4292 100644
>> --- a/include/kvm/arm_hypercalls.h
>> +++ b/include/kvm/arm_hypercalls.h
>> @@ -7,6 +7,7 @@
>>  #include <asm/kvm_emulate.h>
>>  
>>  int kvm_hvc_call_handler(struct kvm_vcpu *vcpu);
>> +int kvm_update_stolen_time(struct kvm_vcpu *vcpu);
>>  
>>  static inline u32 smccc_get_function(struct kvm_vcpu *vcpu)
>>  {
>> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
>> index bde5374ae021..1c88e69db3d9 100644
>> --- a/include/linux/kvm_types.h
>> +++ b/include/linux/kvm_types.h
>> @@ -35,6 +35,8 @@ typedef unsigned long  gva_t;
>>  typedef u64            gpa_t;
>>  typedef u64            gfn_t;
>>  
>> +#define GPA_INVALID	(~(gpa_t)0)
>> +
>>  typedef unsigned long  hva_t;
>>  typedef u64            hpa_t;
>>  typedef u64            hfn_t;
>> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
>> index f645c0fbf7ec..ebd963d2580b 100644
>> --- a/virt/kvm/arm/arm.c
>> +++ b/virt/kvm/arm/arm.c
>> @@ -40,6 +40,10 @@
>>  #include <asm/kvm_coproc.h>
>>  #include <asm/sections.h>
>>  
>> +#include <kvm/arm_hypercalls.h>
>> +#include <kvm/arm_pmu.h>
>> +#include <kvm/arm_psci.h>
>> +
>>  #ifdef REQUIRES_VIRT
>>  __asm__(".arch_extension	virt");
>>  #endif
>> @@ -135,6 +139,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>>  	kvm->arch.max_vcpus = vgic_present ?
>>  				kvm_vgic_get_max_vcpus() : KVM_MAX_VCPUS;
>>  
>> +	kvm->arch.pvtime.st_base = GPA_INVALID;
>>  	return ret;
>>  out_free_stage2_pgd:
>>  	kvm_free_stage2_pgd(kvm);
>> @@ -371,6 +376,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>  	kvm_vcpu_load_sysregs(vcpu);
>>  	kvm_arch_vcpu_load_fp(vcpu);
>>  	kvm_vcpu_pmu_restore_guest(vcpu);
>> +	kvm_make_request(KVM_REQ_RECORD_STEAL, vcpu);
>>  
>>  	if (single_task_running())
>>  		vcpu_clear_wfe_traps(vcpu);
>> @@ -617,6 +623,15 @@ static void vcpu_req_sleep(struct kvm_vcpu *vcpu)
>>  	smp_rmb();
>>  }
>>  
>> +static void vcpu_req_record_steal(struct kvm_vcpu *vcpu)
>> +{
>> +	int idx;
>> +
>> +	idx = srcu_read_lock(&vcpu->kvm->srcu);
>> +	kvm_update_stolen_time(vcpu);
>> +	srcu_read_unlock(&vcpu->kvm->srcu, idx);
>> +}
>> +
>>  static int kvm_vcpu_initialized(struct kvm_vcpu *vcpu)
>>  {
>>  	return vcpu->arch.target >= 0;
>> @@ -636,6 +651,9 @@ static void check_vcpu_requests(struct kvm_vcpu *vcpu)
>>  		 * that a VCPU sees new virtual interrupts.
>>  		 */
>>  		kvm_check_request(KVM_REQ_IRQ_PENDING, vcpu);
>> +
>> +		if (kvm_check_request(KVM_REQ_RECORD_STEAL, vcpu))
>> +			vcpu_req_record_steal(vcpu);
>>  	}
>>  }
>>  
>> diff --git a/virt/kvm/arm/hypercalls.c b/virt/kvm/arm/hypercalls.c
>> index 2906b2df99df..196c71c8dd87 100644
>> --- a/virt/kvm/arm/hypercalls.c
>> +++ b/virt/kvm/arm/hypercalls.c
>> @@ -10,6 +10,70 @@
>>  #include <kvm/arm_hypercalls.h>
>>  #include <kvm/arm_psci.h>
>>  
>> +
>> +static struct pvclock_vcpu_stolen_time_info *pvtime_get_st(
>> +		struct kvm_vcpu *vcpu)
> 
> nit: on a single line.
> 
>> +{
>> +	struct pvclock_vcpu_stolen_time_info *st = vcpu->kvm->arch.pvtime.st;
>> +
>> +	if (!st)
>> +		return NULL;
>> +
>> +	return &st[kvm_vcpu_get_idx(vcpu)];
>> +}
>> +
>> +int kvm_update_stolen_time(struct kvm_vcpu *vcpu)
>> +{
>> +	u64 steal;
>> +	struct pvclock_vcpu_stolen_time_info *kaddr;
>> +
>> +	if (vcpu->kvm->arch.pvtime.st_base == GPA_INVALID)
>> +		return -ENOTSUPP;
> 
> So for a guest that doesn't have stolen time support (which is 100% of
> them for the foreseeable future), we still set a request, take the srcu
> lock and end-up here for nothing. I'd rather we test this st_base
> early, as it should never change once the guest has started.

Good point - I'll make the call to kvm_make_request() conditional on
st_base being set.

>> +
>> +	kaddr = pvtime_get_st(vcpu);
>> +
>> +	if (!kaddr)
>> +		return -ENOTSUPP;
> 
> How can this happen?

Good question, and it makes the pvtime_get_st() helper rather pointless.
Will remove.

>> +
>> +	kaddr->revision = 0;
>> +	kaddr->attributes = 0;
> 
> Why does this need to be written each time we update the stolen time? I
> have the feeling this would be better moved to the hypercall
> initializing the data structure.

Sure.

>> +
>> +	/* Let's do the local bookkeeping */
>> +	steal = vcpu->arch.steal.steal;
>> +	steal += current->sched_info.run_delay - vcpu->arch.steal.last_steal;
>> +	vcpu->arch.steal.last_steal = current->sched_info.run_delay;
>> +	vcpu->arch.steal.steal = steal;
>> +
>> +	/* Now write out the value to the shared page */
>> +	WRITE_ONCE(kaddr->stolen_time, cpu_to_le64(steal));
> 
> Is there any requirement for this to be visible to another CPU than the
> one this is being written from?

I don't believe there is a requirement for this to be visible
synchronously - there is an implicit race here if another vCPU is
accessing the structure (because the hypervisor can preempt and update
at any point), so we don't need to push this out with a barrier. However
we must use 64-bit single-copy atomicity writes to ensure that another
vCPU can't see a half-updated value.

Thanks,

Steve

^ permalink raw reply

* Re: [PATCH 0/9] arm64: Stolen time support
From: Marc Zyngier @ 2019-08-05 13:26 UTC (permalink / raw)
  To: Steven Price
  Cc: kvm, Radim Krčmář, Catalin Marinas,
	Suzuki K Pouloze, linux-doc, Russell King, linux-kernel,
	James Morse, linux-arm-kernel, Paolo Bonzini, Will Deacon, kvmarm,
	Julien Thierry
In-Reply-To: <6789f477-8ab5-cc54-1ad2-8627917b07c9@arm.com>

On 05/08/2019 14:06, Steven Price wrote:
> On 03/08/2019 19:05, Marc Zyngier wrote:
>> On Fri,  2 Aug 2019 15:50:08 +0100
>> Steven Price <steven.price@arm.com> wrote:
>>
>> Hi Steven,
>>
>>> This series add support for paravirtualized time for arm64 guests and
>>> KVM hosts following the specification in Arm's document DEN 0057A:
>>>
>>> https://developer.arm.com/docs/den0057/a
>>>
>>> It implements support for stolen time, allowing the guest to
>>> identify time when it is forcibly not executing.
>>>
>>> It doesn't implement support for Live Physical Time (LPT) as there are
>>> some concerns about the overheads and approach in the above
>>> specification, and I expect an updated version of the specification to
>>> be released soon with just the stolen time parts.
>>
>> Thanks for posting this.
>>
>> My current concern with this series is around the fact that we allocate
>> memory from the kernel on behalf of the guest. It is the first example
>> of such thing in the ARM port, and I can't really say I'm fond of it.
>>
>> x86 seems to get away with it by having the memory allocated from
>> userspace, why I tend to like more. Yes, put_user is more
>> expensive than a straight store, but this isn't done too often either.
>>
>> What is the rational for your current approach?
> 
> As I see it there are 3 approaches that can be taken here:
> 
> 1. Hypervisor allocates memory and adds it to the virtual machine. This
> means that everything to do with the 'device' is encapsulated behind the
> KVM_CREATE_DEVICE / KVM_[GS]ET_DEVICE_ATTR ioctls. But since we want the
> stolen time structure to be fast it cannot be a trapping region and has
> to be backed by real memory - in this case allocated by the host kernel.
> 
> 2. Host user space allocates memory. Similar to above, but this time
> user space needs to manage the memory region as well as the usual
> KVM_CREATE_DEVICE dance. I've no objection to this, but it means
> kvmtool/QEMU needs to be much more aware of what is going on (e.g. how
> to size the memory region).
> 
> 3. Guest kernel "donates" the memory to the hypervisor for the
> structure. As far as I'm aware this is what x86 does. The problems I see
> this approach are:
> 
>  a) kexec becomes much more tricky - there needs to be a disabling
> mechanism for the guest to stop the hypervisor scribbling on memory
> before starting the new kernel.
> 
>  b) If there is more than one entity that is interested in the
> information (e.g. firmware and kernel) then this requires some form of
> arbitration in the guest because the hypervisor doesn't want to have to
> track an arbitrary number of regions to update.
> 
>  c) Performance can suffer if the host kernel doesn't have a suitably
> aligned/sized area to use. As you say - put_user() is more expensive.
> The structure is updated on every return to the VM.
> 
> 
> Of course x86 does prove the third approach can work, but I'm not sure
> which is actually better. Avoid the kexec cancellation requirements was
> the main driver of the current approach. Although many of the
> conversations about this were also tied up with Live Physical Time which
> adds its own complications.

My current train of thoughts is around (2):

- We don't need a new mechanism to track pages or deal with overlapping
IPA ranges
- We can get rid of the save/restore interface

The drawback is that the amount of memory required per vcpu becomes ABI.
I don't think that's a huge deal, as the hypervisor has the same
contract with the guest.

We also take a small hit with put_user(), but this is only done as a
consequence of vcpu_load() (and not on every entry as you suggest
above). It'd be worth quantifying this overhead before making any
decision one way or another.

Thanks,

	M.
-- 
Jazz is not dead, it just smells funny...

^ permalink raw reply

* Re: [PATCH 1/9] KVM: arm64: Document PV-time interface
From: Steven Price @ 2019-08-05 13:06 UTC (permalink / raw)
  To: Zenghui Yu
  Cc: kvm, linux-doc, Catalin Marinas, Russell King, linux-kernel,
	Marc Zyngier, Paolo Bonzini, Will Deacon, kvmarm,
	linux-arm-kernel
In-Reply-To: <3bdd764a-b6f5-d17e-a703-d8eb13838efc@huawei.com>

On 05/08/2019 04:23, Zenghui Yu wrote:
> Hi Steven,
> 
> On 2019/8/2 22:50, Steven Price wrote:
>> Introduce a paravirtualization interface for KVM/arm64 based on the
>> "Arm Paravirtualized Time for Arm-Base Systems" specification DEN 0057A.
>>
>> This only adds the details about "Stolen Time" as the details of "Live
>> Physical Time" have not been fully agreed.
>>
>> User space can specify a reserved area of memory for the guest and
>> inform KVM to populate the memory with information on time that the host
>> kernel has stolen from the guest.
>>
>> A hypercall interface is provided for the guest to interrogate the
>> hypervisor's support for this interface and the location of the shared
>> memory structures.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>>   Documentation/virtual/kvm/arm/pvtime.txt | 107 +++++++++++++++++++++++
>>   1 file changed, 107 insertions(+)
>>   create mode 100644 Documentation/virtual/kvm/arm/pvtime.txt
>                                     ^^^^^^^
> This directory has been renamed recently, see:
> 
> https://patchwork.ozlabs.org/patch/1136104/

Thanks for pointing that out - I'll move it in the next version.

Steve

^ permalink raw reply

* Re: [PATCH 1/9] KVM: arm64: Document PV-time interface
From: Steven Price @ 2019-08-05 13:06 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: peter.maydell, kvm, Radim Krčmář, Catalin Marinas,
	Suzuki K Pouloze, linux-doc, Russell King, linux-kernel,
	James Morse, linux-arm-kernel, Paolo Bonzini, Will Deacon, kvmarm,
	Julien Thierry
In-Reply-To: <20190803121343.2f482200@why>

On 03/08/2019 12:13, Marc Zyngier wrote:
> On Fri,  2 Aug 2019 15:50:09 +0100
> Steven Price <steven.price@arm.com> wrote:
> 
> [+Peter for the userspace aspect of things]
> 
> Hi Steve,
> 
>> Introduce a paravirtualization interface for KVM/arm64 based on the
>> "Arm Paravirtualized Time for Arm-Base Systems" specification DEN 0057A.
>>
>> This only adds the details about "Stolen Time" as the details of "Live
>> Physical Time" have not been fully agreed.
>>
>> User space can specify a reserved area of memory for the guest and
>> inform KVM to populate the memory with information on time that the host
>> kernel has stolen from the guest.
>>
>> A hypercall interface is provided for the guest to interrogate the
>> hypervisor's support for this interface and the location of the shared
>> memory structures.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>>  Documentation/virtual/kvm/arm/pvtime.txt | 107 +++++++++++++++++++++++
>>  1 file changed, 107 insertions(+)
>>  create mode 100644 Documentation/virtual/kvm/arm/pvtime.txt
>>
>> diff --git a/Documentation/virtual/kvm/arm/pvtime.txt b/Documentation/virtual/kvm/arm/pvtime.txt
>> new file mode 100644
>> index 000000000000..e6ae9799e1d5
>> --- /dev/null
>> +++ b/Documentation/virtual/kvm/arm/pvtime.txt
>> @@ -0,0 +1,107 @@
>> +Paravirtualized time support for arm64
>> +======================================
>> +
>> +Arm specification DEN0057/A defined a standard for paravirtualised time
>> +support for Aarch64 guests:
> 
> nit: AArch64
> 
>> +
>> +https://developer.arm.com/docs/den0057/a
> 
> Between this file and the above document, which one is authoritative?

The above document should be authoritative - although I'm still waiting
for the final version to be published. I'm not expecting any changes to
the stolen time part though.

>> +
>> +KVM/Arm64 implements the stolen time part of this specification by providing
> 
> nit: KVM/arm64
> 
>> +some hypervisor service calls to support a paravirtualized guest obtaining a
>> +view of the amount of time stolen from its execution.
>> +
>> +Two new SMCCC compatible hypercalls are defined:
>> +
>> +PV_FEATURES 0xC5000020
>> +PV_TIME_ST  0xC5000022
>> +
>> +These are only available in the SMC64/HVC64 calling convention as
>> +paravirtualized time is not available to 32 bit Arm guests.
>> +
>> +PV_FEATURES
>> +    Function ID:  (uint32)  : 0xC5000020
>> +    PV_func_id:   (uint32)  : Either PV_TIME_LPT or PV_TIME_ST
>> +    Return value: (int32)   : NOT_SUPPORTED (-1) or SUCCESS (0) if the relevant
>> +                              PV-time feature is supported by the hypervisor.
> 
> How is PV_FEATURES discovered? Is the intention to make it a generic
> ARM-wide PV discovery mechanism, not specific to PV time?

SMCCC is mandated for PV time. So, assuming the hypervisor supports
SMCCC, the "NOT_SUPPORTED" return is mandated by SMCCC if PV time isn't
supported.

However, we do also use the SMCCC_ARCH_FEATURES mechanism to check the
existence of PV_FEATURES before use. I'll update the document to call
this out.

>> +
>> +PV_TIME_ST
>> +    Function ID:  (uint32)  : 0xC5000022
>> +    Return value: (int64)   : IPA of the stolen time data structure for this
>> +                              (V)CPU. On failure:
>> +                              NOT_SUPPORTED (-1)
>> +
> 
> Is the size implicit? What are the memory attributes? This either needs
> documenting here, or point to the right bit to the spec.

The size is implicit - it's a pointer to the below structure, so the
guest can only rely on the first 16 bytes being valid. The memory
attributes are described in the specification as:

"The calling guest can map the IPA into normal memory with inner and
outer write back caching attributes, in the inner sharable domain"

I'll put those details in this document for completeness.

>> +Stolen Time
>> +-----------
>> +
>> +The structure pointed to by the PV_TIME_ST hypercall is as follows:
>> +
>> +  Field       | Byte Length | Byte Offset | Description
>> +  ----------- | ----------- | ----------- | --------------------------
>> +  Revision    |      4      |      0      | Must be 0 for version 0.1
>> +  Attributes  |      4      |      4      | Must be 0
>> +  Stolen time |      8      |      8      | Stolen time in unsigned
>> +              |             |             | nanoseconds indicating how
>> +              |             |             | much time this VCPU thread
>> +              |             |             | was involuntarily not
>> +              |             |             | running on a physical CPU.
>> +
>> +The structure will be updated by the hypervisor periodically as time is stolen
> 
> Is it really periodic? If so, when is the update frequency?

Hmm, periodic might be the wrong term - there is no guaranteed frequency
of update. The spec says:

"The hypervisor must update this value prior to scheduling a virtual CPU"

I guess that's probably the best description.

>> +from the VCPU. It will be present within a reserved region of the normal
>> +memory given to the guest. The guest should not attempt to write into this
>> +memory. There is a structure by VCPU of the guest.
> 
> What if the vcpu writes to it? Does it get a fault?

From the perspective from the specification this is undefined. A fault
would therefore be acceptable but isn't generated in the implementation
defined here.

> If there is a
> structure per vcpu, what is the layout in memory? How does a vcpu find
> its own data structure? Is that the address returned by PV_TIME_ST?

A call to PV_TIME_ST returns the structure for the calling vCPU - I'll
make that explicit. The layout is therefore defined by the hypervisor
and cannot be relied on by the guest. As below this implementation uses
a simple array of structures.

>> +
>> +User space interface
>> +====================
>> +
>> +User space can request that KVM provide the paravirtualized time interface to
>> +a guest by creating a KVM_DEV_TYPE_ARM_PV_TIME device, for example:
>> +
>> +    struct kvm_create_device pvtime_device = {
>> +            .type = KVM_DEV_TYPE_ARM_PV_TIME,
>> +            .attr = 0,
>> +            .flags = 0,
>> +    };
>> +
>> +    pvtime_fd = ioctl(vm_fd, KVM_CREATE_DEVICE, &pvtime_device);
>> +
>> +The guest IPA of the structures must be given to KVM. This is the base address
> 
> nit: s/guest //
> 
>> +of an array of stolen time structures (one for each VCPU). For example:
>> +
>> +    struct kvm_device_attr st_base = {
>> +            .group = KVM_DEV_ARM_PV_TIME_PADDR,
>> +            .attr = KVM_DEV_ARM_PV_TIME_ST,
>> +            .addr = (u64)(unsigned long)&st_paddr
>> +    };
>> +
>> +    ioctl(pvtime_fd, KVM_SET_DEVICE_ATTR, &st_base);
> 
> So the allocation itself is performed by the kernel? What are the
> ordering requirements between creating vcpus and the device? What are
> the alignment requirements for the base address?

The base address should be page aligned - I'll spell that out.

There are currently no ordering requirements between creating vcpus and
the device. However...

>> +
>> +For migration (or save/restore) of a guest it is necessary to save the contents
>> +of the shared page(s) and later restore them. KVM_DEV_ARM_PV_TIME_STATE_SIZE
>> +provides the size of this data and KVM_DEV_ARM_PV_TIME_STATE allows the state
>> +to be read/written.
> 
> Is the size variable depending on the number of vcpus?

...yes - so restoring the state after migration must be done after
creating the vcpus. I'll point out that the device should created after.

Thanks for the review,

Steve

^ permalink raw reply

* Re: [PATCH 0/9] arm64: Stolen time support
From: Steven Price @ 2019-08-05 13:06 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvm, Radim Krčmář, Catalin Marinas,
	Suzuki K Pouloze, linux-doc, Russell King, linux-kernel,
	James Morse, linux-arm-kernel, Paolo Bonzini, Will Deacon, kvmarm,
	Julien Thierry
In-Reply-To: <20190803190522.5fec8f7d@why>

On 03/08/2019 19:05, Marc Zyngier wrote:
> On Fri,  2 Aug 2019 15:50:08 +0100
> Steven Price <steven.price@arm.com> wrote:
> 
> Hi Steven,
> 
>> This series add support for paravirtualized time for arm64 guests and
>> KVM hosts following the specification in Arm's document DEN 0057A:
>>
>> https://developer.arm.com/docs/den0057/a
>>
>> It implements support for stolen time, allowing the guest to
>> identify time when it is forcibly not executing.
>>
>> It doesn't implement support for Live Physical Time (LPT) as there are
>> some concerns about the overheads and approach in the above
>> specification, and I expect an updated version of the specification to
>> be released soon with just the stolen time parts.
> 
> Thanks for posting this.
> 
> My current concern with this series is around the fact that we allocate
> memory from the kernel on behalf of the guest. It is the first example
> of such thing in the ARM port, and I can't really say I'm fond of it.
> 
> x86 seems to get away with it by having the memory allocated from
> userspace, why I tend to like more. Yes, put_user is more
> expensive than a straight store, but this isn't done too often either.
> 
> What is the rational for your current approach?

As I see it there are 3 approaches that can be taken here:

1. Hypervisor allocates memory and adds it to the virtual machine. This
means that everything to do with the 'device' is encapsulated behind the
KVM_CREATE_DEVICE / KVM_[GS]ET_DEVICE_ATTR ioctls. But since we want the
stolen time structure to be fast it cannot be a trapping region and has
to be backed by real memory - in this case allocated by the host kernel.

2. Host user space allocates memory. Similar to above, but this time
user space needs to manage the memory region as well as the usual
KVM_CREATE_DEVICE dance. I've no objection to this, but it means
kvmtool/QEMU needs to be much more aware of what is going on (e.g. how
to size the memory region).

3. Guest kernel "donates" the memory to the hypervisor for the
structure. As far as I'm aware this is what x86 does. The problems I see
this approach are:

 a) kexec becomes much more tricky - there needs to be a disabling
mechanism for the guest to stop the hypervisor scribbling on memory
before starting the new kernel.

 b) If there is more than one entity that is interested in the
information (e.g. firmware and kernel) then this requires some form of
arbitration in the guest because the hypervisor doesn't want to have to
track an arbitrary number of regions to update.

 c) Performance can suffer if the host kernel doesn't have a suitably
aligned/sized area to use. As you say - put_user() is more expensive.
The structure is updated on every return to the VM.

Of course x86 does prove the third approach can work, but I'm not sure
which is actually better. Avoid the kexec cancellation requirements was
the main driver of the current approach. Although many of the
conversations about this were also tied up with Live Physical Time which
adds its own complications.

Steve

^ permalink raw reply

* Re: [PATCH 3/9] KVM: arm64: Implement PV_FEATURES call
From: Steven Price @ 2019-08-05 13:14 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvm, Radim Krčmář, Catalin Marinas,
	Suzuki K Pouloze, linux-doc, Russell King, linux-kernel,
	James Morse, linux-arm-kernel, Paolo Bonzini, Will Deacon, kvmarm,
	Julien Thierry
In-Reply-To: <20190803122124.7700f700@why>

On 03/08/2019 12:21, Marc Zyngier wrote:
> On Fri,  2 Aug 2019 15:50:11 +0100
> Steven Price <steven.price@arm.com> wrote:
> 
>> This provides a mechanism for querying which paravirtualized features
>> are available in this hypervisor.
>>
>> Also add the header file which defines the ABI for the paravirtualized
>> clock features we're about to add.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>>  arch/arm64/include/asm/pvclock-abi.h | 20 ++++++++++++++++++++
>>  include/linux/arm-smccc.h            | 14 ++++++++++++++
>>  virt/kvm/arm/hypercalls.c            |  9 +++++++++
>>  3 files changed, 43 insertions(+)
>>  create mode 100644 arch/arm64/include/asm/pvclock-abi.h
>>
>> diff --git a/arch/arm64/include/asm/pvclock-abi.h b/arch/arm64/include/asm/pvclock-abi.h
>> new file mode 100644
>> index 000000000000..1f7cdc102691
>> --- /dev/null
>> +++ b/arch/arm64/include/asm/pvclock-abi.h
>> @@ -0,0 +1,20 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/* Copyright (C) 2019 Arm Ltd. */
>> +
>> +#ifndef __ASM_PVCLOCK_ABI_H
>> +#define __ASM_PVCLOCK_ABI_H
>> +
>> +/* The below structures and constants are defined in ARM DEN0057A */
>> +
>> +struct pvclock_vcpu_stolen_time_info {
>> +	__le32 revision;
>> +	__le32 attributes;
>> +	__le64 stolen_time;
>> +	/* Structure must be 64 byte aligned, pad to that size */
>> +	u8 padding[48];
>> +} __packed;
>> +
>> +#define PV_VM_TIME_NOT_SUPPORTED	-1
> 
> Isn't the intent for this to be the same value as
> SMCCC_RET_NOT_SUPPORTED?

Yes it is, I guess there's not much point defining it again.

>> +#define PV_VM_TIME_INVALID_PARAMETERS	-2
> 
> It overlaps with SMCCC_RET_NOT_REQUIRED. Is that a problem? Should we
> consider a spec change for this?

Actually INVALID_PARAMETERS is only for Live Physical Time, since we're
not implementing it here, this can go as well.

Thanks,

Steve

^ permalink raw reply

* Re: [PATCH v3 1/2] mm/page_idle: Add per-pid idle page tracking using virtual indexing
From: Joel Fernandes @ 2019-08-05 13:10 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-kernel, Alexey Dobriyan, Andrew Morton, Brendan Gregg,
	Christian Hansen, dancol, fmayer, joaodias, Jonathan Corbet,
	Kees Cook, kernel-team, linux-api, linux-doc, linux-fsdevel,
	linux-mm, Michal Hocko, Mike Rapoport, namhyung, Roman Gushchin,
	Stephen Rothwell, surenb, tkjos, Vladimir Davydov,
	Vlastimil Babka, wvw
In-Reply-To: <20190805075547.GA196934@google.com>

On Mon, Aug 05, 2019 at 04:55:47PM +0900, Minchan Kim wrote:
> Hi Joel,

Hi Minchan,

> On Wed, Jul 31, 2019 at 01:19:37PM -0400, Joel Fernandes wrote:
> > > > -static struct page *page_idle_get_page(unsigned long pfn)
> > > > +static struct page *page_idle_get_page(struct page *page_in)
> > > 
> > > Looks weird function name after you changed the argument.
> > > Maybe "bool check_valid_page(struct page *page)"?
> > 
> > 
> > I don't think so, this function does a get_page_unless_zero() on the page as well.
> > 
> > > >  {
> > > >  	struct page *page;
> > > >  	pg_data_t *pgdat;
> > > >  
> > > > -	if (!pfn_valid(pfn))
> > > > -		return NULL;
> > > > -
> > > > -	page = pfn_to_page(pfn);
> > > > +	page = page_in;
> > > >  	if (!page || !PageLRU(page) ||
> > > >  	    !get_page_unless_zero(page))
> > > >  		return NULL;
> > > > @@ -51,6 +49,18 @@ static struct page *page_idle_get_page(unsigned long pfn)
> > > >  	return page;
> > > >  }
> > > >  
> > > > +/*
> > > > + * This function tries to get a user memory page by pfn as described above.
> > > > + */
> > > > +static struct page *page_idle_get_page_pfn(unsigned long pfn)
> > > 
> > > So we could use page_idle_get_page name here.
> > 
> > 
> > Based on above comment, I prefer to keep same name. Do you agree?
> 
> Yes, I agree. Just please add a comment about refcount in the description
> on page_idle_get_page.

Ok.


> > > > +	return page_idle_get_page(pfn_to_page(pfn));
> > > > +}
> > > > +
> > > >  static bool page_idle_clear_pte_refs_one(struct page *page,
> > > >  					struct vm_area_struct *vma,
> > > >  					unsigned long addr, void *arg)
> > > > @@ -118,6 +128,47 @@ static void page_idle_clear_pte_refs(struct page *page)
> > > >  		unlock_page(page);
> > > >  }
> > > >  
> > > > +/* Helper to get the start and end frame given a pos and count */
> > > > +static int page_idle_get_frames(loff_t pos, size_t count, struct mm_struct *mm,
> > > > +				unsigned long *start, unsigned long *end)
> > > > +{
> > > > +	unsigned long max_frame;
> > > > +
> > > > +	/* If an mm is not given, assume we want physical frames */
> > > > +	max_frame = mm ? (mm->task_size >> PAGE_SHIFT) : max_pfn;
> > > > +
> > > > +	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
> > > > +		return -EINVAL;
> > > > +
> > > > +	*start = pos * BITS_PER_BYTE;
> > > > +	if (*start >= max_frame)
> > > > +		return -ENXIO;
> > > > +
> > > > +	*end = *start + count * BITS_PER_BYTE;
> > > > +	if (*end > max_frame)
> > > > +		*end = max_frame;
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static bool page_really_idle(struct page *page)
> > > 
> > > Just minor:
> > > Instead of creating new API, could we combine page_is_idle with
> > > introducing furthere argument pte_check?
> > 
> > 
> > I cannot see in the code where pte_check will be false when this is called? I
> > could rename the function to page_idle_check_ptes() if that's Ok with you.
> 
> What I don't like is _*really*_ part of the funcion name.
> 
> I see several page_is_idle calls in huge_memory.c, migration.c, swap.c.
> They could just check only page flag so they could use "false" with pte_check.

I will rename it to page_idle_check_ptes(). If you want pte_check argument,
that can be a later patch if/when there are other users for it in other
files. Hope that's reasonable.


> > > > +ssize_t page_idle_proc_generic(struct file *file, char __user *ubuff,
> > > > +			       size_t count, loff_t *pos,
> > > > +			       struct task_struct *tsk, int write)
> > > > +{
> > > > +	int ret;
> > > > +	char *buffer;
> > > > +	u64 *out;
> > > > +	unsigned long start_addr, end_addr, start_frame, end_frame;
> > > > +	struct mm_struct *mm = file->private_data;
> > > > +	struct mm_walk walk = { .pmd_entry = pte_page_idle_proc_range, };
> > > > +	struct page_node *cur;
> > > > +	struct page_idle_proc_priv priv;
> > > > +	bool walk_error = false;
> > > > +	LIST_HEAD(idle_page_list);
> > > > +
> > > > +	if (!mm || !mmget_not_zero(mm))
> > > > +		return -EINVAL;
> > > > +
> > > > +	if (count > PAGE_SIZE)
> > > > +		count = PAGE_SIZE;
> > > > +
> > > > +	buffer = kzalloc(PAGE_SIZE, GFP_KERNEL);
> > > > +	if (!buffer) {
> > > > +		ret = -ENOMEM;
> > > > +		goto out_mmput;
> > > > +	}
> > > > +	out = (u64 *)buffer;
> > > > +
> > > > +	if (write && copy_from_user(buffer, ubuff, count)) {
> > > > +		ret = -EFAULT;
> > > > +		goto out;
> > > > +	}
> > > > +
> > > > +	ret = page_idle_get_frames(*pos, count, mm, &start_frame, &end_frame);
> > > > +	if (ret)
> > > > +		goto out;
> > > > +
> > > > +	start_addr = (start_frame << PAGE_SHIFT);
> > > > +	end_addr = (end_frame << PAGE_SHIFT);
> > > > +	priv.buffer = buffer;
> > > > +	priv.start_addr = start_addr;
> > > > +	priv.write = write;
> > > > +
> > > > +	priv.idle_page_list = &idle_page_list;
> > > > +	priv.cur_page_node = 0;
> > > > +	priv.page_nodes = kzalloc(sizeof(struct page_node) *
> > > > +				  (end_frame - start_frame), GFP_KERNEL);
> > > > +	if (!priv.page_nodes) {
> > > > +		ret = -ENOMEM;
> > > > +		goto out;
> > > > +	}
> > > > +
> > > > +	walk.private = &priv;
> > > > +	walk.mm = mm;
> > > > +
> > > > +	down_read(&mm->mmap_sem);
> > > > +
> > > > +	/*
> > > > +	 * idle_page_list is needed because walk_page_vma() holds ptlock which
> > > > +	 * deadlocks with page_idle_clear_pte_refs(). So we have to collect all
> > > > +	 * pages first, and then call page_idle_clear_pte_refs().
> > > > +	 */
> > > 
> > > Thanks for the comment, I was curious why you want to have
> > > idle_page_list and the reason is here.
> > > 
> > > How about making this /proc/<pid>/page_idle per-process granuariy,
> > > unlike system level /sys/xxx/page_idle? What I meant is not to check
> > > rmap to see any reference from random process but just check only
> > > access from the target process. It would be more proper as /proc/
> > > <pid>/ interface and good for per-process tracking as well as
> > > fast.
> > 
> > 
> > I prefer not to do this for the following reasons:
> > (1) It makes a feature lost, now accesses to shared pages will not be
> > accounted properly. 
> 
> Do you really want to check global attribute by per-process interface?

Pages are inherrently not per-process, they are global. A page does not
necessarily belong to a process. An anonymous page can be shared. We are
operating on pages in the end of the day.

I think you are confusing the per-process file interface with the core
mechanism. The core mechanism always operations on physical PAGES.


> That would be doable with existing idle page tracking feature and that's
> the one of reasons page idle tracking was born(e.g. even, page cache
> for non-mapped) unlike clear_refs.

I think you are misunderstanding the patch, the patch does not want to change
the core mechanism. That is a bit out of scope for the patch. Page
idle-tracking at the core of it looks at PTE of all processes. We are just
using the VFN (virtual frame) interface to skip the need for separate pagemap
look up -- that's it.


> Once we create a new interface by per-process, just checking the process
> -granuariy access check sounds more reasonable to me.

It sounds reasonable but there is no reason to not do the full and proper
page tracking for now, including shared pages. Otherwise it makes it
inconsistent with the existing mechanism and can confuse the user about what
to expect (especially for shared pages).


> With that, we could catch only idle pages of the target process even though
> the page was touched by several other processes.
> If the user want to know global level access point, they could use
> exisint interface(If there is a concern(e.g., security) to use existing
> idle page tracking, let's discuss it as other topic how we could make
> existing feature more useful).
> 
> IOW, my point is that we already have global access check(1. from ptes
> among several processes, 2. from page flag for non-mapped pages) feature
> from from existing idle page tracking interface and now we are about to create
> new interface for per-process wise so I wanted to create a particular
> feature which cannot be covered by existing iterface.

Yes, it sounds like you want to create a different feature. Then that can be
a follow-up different patch, and that is out of scope for this patch.


> > (2) It makes it inconsistent with other idle page tracking mechanism. I
> 
> That's the my comment to create different idle page tracking we couldn't
> do with existing interface.

Yes, sure. But that can be a different patch and we can weigh the benefits of
it at that time. I don't want to introduce a new page tracking mechanism, I
am just trying to reuse the existing one.


> > prefer if post per-process. At the heart of it, the tracking is always at the
> 
> What does it mean "post per-process"?

Sorry it was a typo, I meant "the core mechanism should not be a per-process
one, but a global one". We are just changing the interface in this patch, we
are not changing the existing core mechanism. That gives us all the benefits
of the existing code such as non-interference with page reclaim code, without
introducing any new bugs. By the way I did fix a bug in the existing original
code as well!


> > physical page level -- I feel that is how it should be. Other drawback, is
> > also we have to document this subtlety.
> 
> Sorry, Could you elaborate it a bit?

I meant, with a new mechanism as the one you are proposing, we have to
document that now shared pages will not be tracked properly. That is a
'subtle difference' and will have to be documented appropriated in the
'internals' section of the idle page tracking document.

thanks,

 - Joel


^ permalink raw reply

* Re: [PATCH v12 01/11] MODSIGN: Export module signature definitions
From: Philipp Rudo @ 2019-08-05 13:11 UTC (permalink / raw)
  To: Thiago Jung Bauermann
  Cc: Jessica Yu, linux-integrity, linux-security-module, keyrings,
	linux-crypto, linuxppc-dev, linux-doc, linux-kernel, Mimi Zohar,
	Dmitry Kasatkin, James Morris, Serge E. Hallyn, David Howells,
	David Woodhouse, Herbert Xu, David S. Miller, Jonathan Corbet,
	AKASHI, Takahiro, Heiko Carstens, linux-s390
In-Reply-To: <8736iw9y00.fsf@morokweng.localdomain>

Hi Thiago,

> > The patch looks good now.  
> 
> Thanks! Can I add your Reviewed-by?

sorry, for the late answer, but I was on vacation the last two weeks. I hope
it's not too late now.

Reviewed-by: Philipp Rudo <prudo@linux.ibm.com>


^ permalink raw reply

* [PATCH] Input: docs: fix spelling mistake "potocol" -> "protocol"
From: Colin King @ 2019-08-05 10:49 UTC (permalink / raw)
  To: Henrik Rydberg, Dmitry Torokhov, Jonathan Corbet, linux-input,
	linux-doc
  Cc: kernel-janitors, linux-kernel

From: Colin Ian King <colin.king@canonical.com>

There is a minor spelling mistake in the documentation, fix it.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
---
 Documentation/input/multi-touch-protocol.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/input/multi-touch-protocol.rst b/Documentation/input/multi-touch-protocol.rst
index 6be70342e709..307fe22d9668 100644
--- a/Documentation/input/multi-touch-protocol.rst
+++ b/Documentation/input/multi-touch-protocol.rst
@@ -23,7 +23,7 @@ devices capable of tracking identifiable contacts (type B), the protocol
 describes how to send updates for individual contacts via event slots.
 
 .. note::
-   MT potocol type A is obsolete, all kernel drivers have been
+   MT protocol type A is obsolete, all kernel drivers have been
    converted to use type B.
 
 Protocol Usage
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH 7/9] arm/arm64: Provide a wrapper for SMCCC 1.1 calls
From: Will Deacon @ 2019-08-05 10:03 UTC (permalink / raw)
  To: Steven Price
  Cc: Catalin Marinas, Marc Zyngier, Paolo Bonzini,
	Radim Krčmář, Russell King, James Morse,
	Julien Thierry, Suzuki K Pouloze, kvm, kvmarm, linux-arm-kernel,
	linux-doc, linux-kernel
In-Reply-To: <20190802145017.42543-8-steven.price@arm.com>

On Fri, Aug 02, 2019 at 03:50:15PM +0100, Steven Price wrote:
> SMCCC 1.1 calls may use either HVC or SMC depending on the PSCI
> conduit. Rather than coding this in every call site provide a macro
> which uses the correct instruction. The macro also handles the case
> where no PSCI conduit is configured returning a not supported error
> in res, along with returning the conduit used for the call.
> 
> This allow us to remove some duplicated code and will be useful later
> when adding paravirtualized time hypervisor calls.
> 
> Signed-off-by: Steven Price <steven.price@arm.com>
> ---
>  include/linux/arm-smccc.h | 44 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 44 insertions(+)

Acked-by: Will Deacon <will@kernel.org>

Will

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox