Linux Documentation

Linux Documentation
 help / color / mirror / Atom feed

* [PATCH v5 3/6] [RFC] x86: Add support for idle bit in swap PTE
From: Joel Fernandes (Google) @ 2019-08-07 17:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google), Alexey Dobriyan, Andrew Morton,
	Borislav Petkov, Brendan Gregg, Catalin Marinas, Christian Hansen,
	dancol, fmayer, H. Peter Anvin, Ingo Molnar, joelaf,
	Jonathan Corbet, Kees Cook, kernel-team, linux-api, linux-doc,
	linux-fsdevel, linux-mm, Michal Hocko, Mike Rapoport, minchan,
	namhyung, paulmck, Robin Murphy, Roman Gushchin, Stephen Rothwell,
	surenb, Thomas Gleixner, tkjos, Vladimir Davydov, Vlastimil Babka,
	Will Deacon
In-Reply-To: <20190807171559.182301-1-joel@joelfernandes.org>

This bit will be used by idle page tracking code to correctly identify
if a page that was swapped out was idle before it got swapped out.
Without this PTE bit, we lose information about if a page is idle or not
since the page frame gets unmapped and the page gets freed.

Bits 2-6 are unused in the swap PTE (see the comment in
arch/x86/include/asm/pgtable_64.h). Bit 2 corresponds to _PAGE_USER. Use
it for swap PTE purposes.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 arch/x86/Kconfig                     |  1 +
 arch/x86/include/asm/pgtable.h       | 15 +++++++++++++++
 arch/x86/include/asm/pgtable_types.h |  6 ++++++
 3 files changed, 22 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 222855cc0158..728f22370f17 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -139,6 +139,7 @@ config X86
 	select HAVE_ARCH_MMAP_RND_COMPAT_BITS	if MMU && COMPAT
 	select HAVE_ARCH_COMPAT_MMAP_BASES	if MMU && COMPAT
 	select HAVE_ARCH_PREL32_RELOCATIONS
+	select HAVE_ARCH_PTE_SWP_PGIDLE
 	select HAVE_ARCH_SECCOMP_FILTER
 	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
 	select HAVE_ARCH_STACKLEAK
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 0bc530c4eb13..ef3e662cee4a 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1371,6 +1371,21 @@ static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
 #endif
 #endif
 
+static inline pte_t pte_swp_mkpage_idle(pte_t pte)
+{
+	return pte_set_flags(pte, _PAGE_SWP_PGIDLE);
+}
+
+static inline int pte_swp_page_idle(pte_t pte)
+{
+	return pte_flags(pte) & _PAGE_SWP_PGIDLE;
+}
+
+static inline pte_t pte_swp_clear_mkpage_idle(pte_t pte)
+{
+	return pte_clear_flags(pte, _PAGE_SWP_PGIDLE);
+}
+
 #define PKRU_AD_BIT 0x1
 #define PKRU_WD_BIT 0x2
 #define PKRU_BITS_PER_PKEY 2
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index b5e49e6bac63..6739cba4c900 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -100,6 +100,12 @@
 #define _PAGE_SWP_SOFT_DIRTY	(_AT(pteval_t, 0))
 #endif
 
+#ifdef CONFIG_IDLE_PAGE_TRACKING
+#define _PAGE_SWP_PGIDLE	_PAGE_USER
+#else
+#define _PAGE_SWP_PGIDLE	(_AT(pteval_t, 0))
+#endif
+
 #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
 #define _PAGE_NX	(_AT(pteval_t, 1) << _PAGE_BIT_NX)
 #define _PAGE_DEVMAP	(_AT(u64, 1) << _PAGE_BIT_DEVMAP)
-- 
2.22.0.770.g0f2c4a37fd-goog


^ permalink raw reply related

* [PATCH v5 2/6] mm/page_idle: Add support for handling swapped PG_Idle pages
From: Joel Fernandes (Google) @ 2019-08-07 17:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google), Minchan Kim, Alexey Dobriyan,
	Andrew Morton, Borislav Petkov, Brendan Gregg, Catalin Marinas,
	Christian Hansen, dancol, fmayer, H. Peter Anvin, Ingo Molnar,
	joelaf, Jonathan Corbet, Kees Cook, kernel-team, linux-api,
	linux-doc, linux-fsdevel, linux-mm, Michal Hocko, Mike Rapoport,
	namhyung, paulmck, Robin Murphy, Roman Gushchin, Stephen Rothwell,
	surenb, Thomas Gleixner, tkjos, Vladimir Davydov, Vlastimil Babka,
	Will Deacon
In-Reply-To: <20190807171559.182301-1-joel@joelfernandes.org>

Idle page tracking currently does not work well in the following
scenario:
 1. mark page-A idle which was present at that time.
 2. run workload
 3. page-A is not touched by workload
 4. *sudden* memory pressure happen so finally page A is finally swapped out
 5. now see the page A - it appears as if it was accessed (pte unmapped
    so idle bit not set in output) - but it's incorrect.

To fix this, we store the idle information into a new idle bit of the
swap PTE during swapping of anonymous pages.

Also in the future, madvise extensions will allow a system process
manager (like Android's ActivityManager) to swap pages out of a process
that it knows will be cold. To an external process like a heap profiler
that is doing idle tracking on another process, this procedure will
interfere with the idle page tracking similar to the above steps.

Suggested-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 arch/Kconfig                  |  3 +++
 include/asm-generic/pgtable.h |  6 ++++++
 mm/page_idle.c                | 26 ++++++++++++++++++++++++--
 mm/rmap.c                     |  2 ++
 4 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index a7b57dd42c26..3aa121ce824e 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -575,6 +575,9 @@ config ARCH_WANT_HUGE_PMD_SHARE
 config HAVE_ARCH_SOFT_DIRTY
 	bool
 
+config HAVE_ARCH_PTE_SWP_PGIDLE
+	bool
+
 config HAVE_MOD_ARCH_SPECIFIC
 	bool
 	help
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 75d9d68a6de7..6d51d0a355a7 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -712,6 +712,12 @@ static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
 #define arch_start_context_switch(prev)	do {} while (0)
 #endif
 
+#ifndef CONFIG_HAVE_ARCH_PTE_SWP_PGIDLE
+static inline pte_t pte_swp_mkpage_idle(pte_t pte) { return pte; }
+static inline int pte_swp_page_idle(pte_t pte) { return 0; }
+static inline pte_t pte_swp_clear_mkpage_idle(pte_t pte) { return pte; }
+#endif
+
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
 #ifndef CONFIG_ARCH_ENABLE_THP_MIGRATION
 static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
diff --git a/mm/page_idle.c b/mm/page_idle.c
index 9de4f4c67a8c..2766d4ab348c 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -276,7 +276,7 @@ struct page_idle_proc_priv {
 };
 
 /*
- * Add page to list to be set as idle later.
+ * Set a page as idle or add it to a list to be set as idle later.
  */
 static void pte_page_idle_proc_add(struct page *page,
 			       unsigned long addr, struct mm_walk *walk)
@@ -303,6 +303,13 @@ static void pte_page_idle_proc_add(struct page *page,
 		page_get = page_idle_get_page(page);
 		if (!page_get)
 			return;
+	} else {
+		/* For swapped pages, set output bit as idle */
+		frames = (addr - priv->start_addr) >> PAGE_SHIFT;
+		bit = frames % BITMAP_CHUNK_BITS;
+		chunk = &chunk[frames / BITMAP_CHUNK_BITS];
+		*chunk |= (1 << bit);
+		return;
 	}
 
 	/*
@@ -323,6 +330,7 @@ static int pte_page_idle_proc_range(pmd_t *pmd, unsigned long addr,
 	spinlock_t *ptl;
 	struct page *page;
 	struct vm_area_struct *vma = walk->vma;
+	struct page_idle_proc_priv *priv = walk->private;
 
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
@@ -341,6 +349,19 @@ static int pte_page_idle_proc_range(pmd_t *pmd, unsigned long addr,
 
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	for (; addr != end; pte++, addr += PAGE_SIZE) {
+		/* For swap_pte handling, we use an idle bit in the swap pte. */
+		if (is_swap_pte(*pte)) {
+			if (priv->write) {
+				set_pte_at(walk->mm, addr, pte,
+					   pte_swp_mkpage_idle(*pte));
+			} else {
+				/* If swap pte has idle bit set, report it as idle */
+				if (pte_swp_page_idle(*pte))
+					pte_page_idle_proc_add(NULL, addr, walk);
+			}
+			continue;
+		}
+
 		if (!pte_present(*pte))
 			continue;
 
@@ -432,7 +453,8 @@ ssize_t page_idle_proc_generic(struct file *file, char __user *ubuff,
 				set_page_idle(page);
 			}
 		} else {
-			if (page_idle_pte_check(page)) {
+			/* If page is NULL, it was swapped out */
+			if (!page || page_idle_pte_check(page)) {
 				off = ((cur->addr) >> PAGE_SHIFT) - start_frame;
 				bit = off % BITMAP_CHUNK_BITS;
 				index = off / BITMAP_CHUNK_BITS;
diff --git a/mm/rmap.c b/mm/rmap.c
index e5dfe2ae6b0d..4bd618aab402 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1629,6 +1629,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			swp_pte = swp_entry_to_pte(entry);
 			if (pte_soft_dirty(pteval))
 				swp_pte = pte_swp_mksoft_dirty(swp_pte);
+			if (page_is_idle(page))
+				swp_pte = pte_swp_mkpage_idle(swp_pte);
 			set_pte_at(mm, address, pvmw.pte, swp_pte);
 			/* Invalidate as we cleared the pte */
 			mmu_notifier_invalidate_range(mm, address,
-- 
2.22.0.770.g0f2c4a37fd-goog


^ permalink raw reply related

* [PATCH v5 6/6] doc: Update documentation for page_idle virtual address indexing
From: Joel Fernandes (Google) @ 2019-08-07 17:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google), Mike Rapoport, Sandeep Patil,
	Alexey Dobriyan, Andrew Morton, Borislav Petkov, Brendan Gregg,
	Catalin Marinas, Christian Hansen, dancol, fmayer, H. Peter Anvin,
	Ingo Molnar, joelaf, Jonathan Corbet, Kees Cook, kernel-team,
	linux-api, linux-doc, linux-fsdevel, linux-mm, Michal Hocko,
	minchan, namhyung, paulmck, Robin Murphy, Roman Gushchin,
	Stephen Rothwell, surenb, Thomas Gleixner, tkjos,
	Vladimir Davydov, Vlastimil Babka, Will Deacon
In-Reply-To: <20190807171559.182301-1-joel@joelfernandes.org>

This patch updates the documentation with the new page_idle tracking
feature which uses virtual address indexing.

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Sandeep Patil <sspatil@google.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 .../admin-guide/mm/idle_page_tracking.rst     | 43 ++++++++++++++++---
 1 file changed, 36 insertions(+), 7 deletions(-)

diff --git a/Documentation/admin-guide/mm/idle_page_tracking.rst b/Documentation/admin-guide/mm/idle_page_tracking.rst
index df9394fb39c2..9eef32000f5e 100644
--- a/Documentation/admin-guide/mm/idle_page_tracking.rst
+++ b/Documentation/admin-guide/mm/idle_page_tracking.rst
@@ -19,10 +19,14 @@ It is enabled by CONFIG_IDLE_PAGE_TRACKING=y.
 
 User API
 ========
+There are 2 ways to access the idle page tracking API. One uses physical
+address indexing, another uses a simpler virtual address indexing scheme.
 
-The idle page tracking API is located at ``/sys/kernel/mm/page_idle``.
-Currently, it consists of the only read-write file,
-``/sys/kernel/mm/page_idle/bitmap``.
+Physical address indexing
+-------------------------
+The idle page tracking API for physical address indexing using page frame
+numbers (PFN) is located at ``/sys/kernel/mm/page_idle``.  Currently, it
+consists of the only read-write file, ``/sys/kernel/mm/page_idle/bitmap``.
 
 The file implements a bitmap where each bit corresponds to a memory page. The
 bitmap is represented by an array of 8-byte integers, and the page at PFN #i is
@@ -74,6 +78,31 @@ See :ref:`Documentation/admin-guide/mm/pagemap.rst <pagemap>` for more
 information about ``/proc/pid/pagemap``, ``/proc/kpageflags``, and
 ``/proc/kpagecgroup``.
 
+Virtual address indexing
+------------------------
+The idle page tracking API for virtual address indexing using virtual frame
+numbers (VFN) for a process ``<pid>`` is located at ``/proc/<pid>/page_idle``.
+It is a bitmap that follows the same semantics as
+``/sys/kernel/mm/page_idle/bitmap`` except that it uses virtual instead of
+physical frame numbers.
+
+This idle page tracking API does not deal with PFN so it does not require prior
+lookups of ``pagemap``. This is an advantage on some systems where looking up
+PFN is considered a security issue.  Also in some cases, this interface could
+be slightly more reliable to use than physical address indexing, since in
+physical address indexing, address space changes can occur between reading the
+``pagemap`` and reading the ``bitmap``, while in virtual address indexing, the
+process's ``mmap_sem`` is held for the duration of the access.
+
+To estimate the amount of pages that are not used by a workload one should:
+
+ 1. Mark all the workload's pages as idle by setting corresponding bits in
+    ``/proc/<pid>/page_idle``.
+
+ 2. Wait until the workload accesses its working set.
+
+ 3. Read ``/proc/<pid>/page_idle`` and count the number of bits set.
+
 .. _impl_details:
 
 Implementation Details
@@ -99,10 +128,10 @@ When a dirty page is written to swap or disk as a result of memory reclaim or
 exceeding the dirty memory limit, it is not marked referenced.
 
 The idle memory tracking feature adds a new page flag, the Idle flag. This flag
-is set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` (see the
-:ref:`User API <user_api>`
-section), and cleared automatically whenever a page is referenced as defined
-above.
+is set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` for physical
+addressing or by writing to ``/proc/<pid>/page_idle`` for virtual
+addressing (see the :ref:`User API <user_api>` section), and cleared
+automatically whenever a page is referenced as defined above.
 
 When a page is marked idle, the Accessed bit must be cleared in all PTEs it is
 mapped to, otherwise we will not be able to detect accesses to the page coming
-- 
2.22.0.770.g0f2c4a37fd-goog


^ permalink raw reply related

* [PATCH v5 4/6] [RFC] arm64: Add support for idle bit in swap PTE
From: Joel Fernandes (Google) @ 2019-08-07 17:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google), Robin Murphy, Alexey Dobriyan,
	Andrew Morton, Borislav Petkov, Brendan Gregg, Catalin Marinas,
	Christian Hansen, dancol, fmayer, H. Peter Anvin, Ingo Molnar,
	joelaf, Jonathan Corbet, Kees Cook, kernel-team, linux-api,
	linux-doc, linux-fsdevel, linux-mm, Michal Hocko, Mike Rapoport,
	minchan, namhyung, paulmck, Roman Gushchin, Stephen Rothwell,
	surenb, Thomas Gleixner, tkjos, Vladimir Davydov, Vlastimil Babka,
	Will Deacon
In-Reply-To: <20190807171559.182301-1-joel@joelfernandes.org>

This bit will be used by idle page tracking code to correctly identify
if a page that was swapped out was idle before it got swapped out.
Without this PTE bit, we lose information about if a page is idle or not
since the page frame gets unmapped.

In this patch we reuse PTE_DEVMAP bit since idle page tracking only
works on user pages in the LRU. Device pages should not consitute those
so it should be unused and safe to use.

Cc: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 arch/arm64/Kconfig                    |  1 +
 arch/arm64/include/asm/pgtable-prot.h |  1 +
 arch/arm64/include/asm/pgtable.h      | 15 +++++++++++++++
 3 files changed, 17 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 3adcec05b1f6..9d1412c693d7 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -128,6 +128,7 @@ config ARM64
 	select HAVE_ARCH_MMAP_RND_BITS
 	select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT
 	select HAVE_ARCH_PREL32_RELOCATIONS
+	select HAVE_ARCH_PTE_SWP_PGIDLE
 	select HAVE_ARCH_SECCOMP_FILTER
 	select HAVE_ARCH_STACKLEAK
 	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
index 92d2e9f28f28..917b15c5d63a 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -18,6 +18,7 @@
 #define PTE_SPECIAL		(_AT(pteval_t, 1) << 56)
 #define PTE_DEVMAP		(_AT(pteval_t, 1) << 57)
 #define PTE_PROT_NONE		(_AT(pteval_t, 1) << 58) /* only when !PTE_VALID */
+#define PTE_SWP_PGIDLE		PTE_DEVMAP		 /* for idle page tracking during swapout */
 
 #ifndef __ASSEMBLY__
 
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 3f5461f7b560..558f5ebd81ba 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -212,6 +212,21 @@ static inline pte_t pte_mkdevmap(pte_t pte)
 	return set_pte_bit(pte, __pgprot(PTE_DEVMAP));
 }
 
+static inline int pte_swp_page_idle(pte_t pte)
+{
+	return 0;
+}
+
+static inline pte_t pte_swp_mkpage_idle(pte_t pte)
+{
+	return set_pte_bit(pte, __pgprot(PTE_SWP_PGIDLE));
+}
+
+static inline pte_t pte_swp_clear_page_idle(pte_t pte)
+{
+	return clear_pte_bit(pte, __pgprot(PTE_SWP_PGIDLE));
+}
+
 static inline void set_pte(pte_t *ptep, pte_t pte)
 {
 	WRITE_ONCE(*ptep, pte);
-- 
2.22.0.770.g0f2c4a37fd-goog


^ permalink raw reply related

* [PATCH v5 5/6] page_idle: Drain all LRU pagevec before idle tracking
From: Joel Fernandes (Google) @ 2019-08-07 17:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google), Alexey Dobriyan, Andrew Morton,
	Borislav Petkov, Brendan Gregg, Catalin Marinas, Christian Hansen,
	dancol, fmayer, H. Peter Anvin, Ingo Molnar, joelaf,
	Jonathan Corbet, Kees Cook, kernel-team, linux-api, linux-doc,
	linux-fsdevel, linux-mm, Michal Hocko, Mike Rapoport, minchan,
	namhyung, paulmck, Robin Murphy, Roman Gushchin, Stephen Rothwell,
	surenb, Thomas Gleixner, tkjos, Vladimir Davydov, Vlastimil Babka,
	Will Deacon
In-Reply-To: <20190807171559.182301-1-joel@joelfernandes.org>

During idle page tracking, we see that sometimes faulted anon pages are in
pagevec but are not drained to LRU. Idle page tracking only considers pages
on LRU.

I am able to find multiple issues involving this. One issue looks like
idle tracking is completely broken. It shows up in my testing as if a
page that is marked as idle is always "accessed" -- because it was never
marked as idle (due to not draining of pagevec).

The other issue shows up as a failure during swapping (support for which
this series adds), with the following sequence:
 1. Allocate some pages
 2. Write to them
 3. Mark them as idle                                  <--- fails
 4. Introduce some memory pressure to induce swapping.
 5. Check the swap bit I introduced in this series.    <--- fails to set idle
                                                            bit in swap PTE.

To fix this, this patch drains all CPU's pagevec before starting idle tracking.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 mm/page_idle.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/mm/page_idle.c b/mm/page_idle.c
index 2766d4ab348c..26440a497609 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -180,6 +180,13 @@ static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
 	unsigned long pfn, end_pfn;
 	int bit, ret;
 
+	/*
+	 * Idle page tracking currently works only on LRU pages, so drain
+	 * them. This can cause slowness, but in the future we could
+	 * remove this operation if we are tracking non-LRU pages too.
+	 */
+	lru_add_drain_all();
+
 	ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
 	if (ret == -ENXIO)
 		return 0;  /* Reads beyond max_pfn do nothing */
@@ -211,6 +218,13 @@ static ssize_t page_idle_bitmap_write(struct file *file, struct kobject *kobj,
 	unsigned long pfn, end_pfn;
 	int bit, ret;
 
+	/*
+	 * Idle page tracking currently works only on LRU pages, so drain
+	 * them. This can cause slowness, but in the future we could
+	 * remove this operation if we are tracking non-LRU pages too.
+	 */
+	lru_add_drain_all();
+
 	ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
 	if (ret)
 		return ret;
@@ -428,6 +442,13 @@ ssize_t page_idle_proc_generic(struct file *file, char __user *ubuff,
 	walk.private = &priv;
 	walk.mm = mm;
 
+	/*
+	 * Idle page tracking currently works only on LRU pages, so drain
+	 * them. This can cause slowness, but in the future we could
+	 * remove this operation if we are tracking non-LRU pages too.
+	 */
+	lru_add_drain_all();
+
 	down_read(&mm->mmap_sem);
 
 	/*
-- 
2.22.0.770.g0f2c4a37fd-goog


^ permalink raw reply related

* [PATCH v5 1/6] mm/page_idle: Add per-pid idle page tracking using virtual index
From: Joel Fernandes (Google) @ 2019-08-07 17:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google), Alexey Dobriyan, Andrew Morton,
	Borislav Petkov, Brendan Gregg, Catalin Marinas, Christian Hansen,
	dancol, fmayer, H. Peter Anvin, Ingo Molnar, joelaf,
	Jonathan Corbet, Kees Cook, kernel-team, linux-api, linux-doc,
	linux-fsdevel, linux-mm, Michal Hocko, Mike Rapoport, minchan,
	namhyung, paulmck, Robin Murphy, Roman Gushchin, Stephen Rothwell,
	surenb, Thomas Gleixner, tkjos, Vladimir Davydov, Vlastimil Babka,
	Will Deacon

The page_idle tracking feature currently requires looking up the pagemap
for a process followed by interacting with /sys/kernel/mm/page_idle.
Looking up PFN from pagemap in Android devices is not supported by
unprivileged process and requires SYS_ADMIN and gives 0 for the PFN.

This patch adds support to directly interact with page_idle tracking at
the PID level by introducing a /proc/<pid>/page_idle file.  It follows
the exact same semantics as the global /sys/kernel/mm/page_idle, but now
looking up PFN through pagemap is not needed since the interface uses
virtual frame numbers, and at the same time also does not require
SYS_ADMIN.

In Android, we are using this for the heap profiler (heapprofd) which
profiles and pin points code paths which allocates and leaves memory
idle for long periods of time. This method solves the security issue
with userspace learning the PFN, and while at it is also shown to yield
better results than the pagemap lookup, the theory being that the window
where the address space can change is reduced by eliminating the
intermediate pagemap look up stage. In virtual address indexing, the
process's mmap_sem is held for the duration of the access.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
v3->v4: Minor fixups (Minchan)
        Add swap pte handling (Konstantin, Minchan)
v2->v3:
Fixed a bug where I was doing a kfree that is not needed due to not
needing to do GFP_ATOMIC allocations.

v1->v2:
Mark swap ptes as idle (Minchan)
Avoid need for GFP_ATOMIC (Andrew)
Get rid of idle_page_list lock by moving list to stack

Internal review -> v1:
Fixes from Suren.
Corrections to change log, docs (Florian, Sandeep)

 fs/proc/base.c            |   3 +
 fs/proc/internal.h        |   1 +
 fs/proc/task_mmu.c        |  42 +++++
 include/linux/page_idle.h |   4 +
 mm/page_idle.c            | 337 +++++++++++++++++++++++++++++++++-----
 5 files changed, 342 insertions(+), 45 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index ebea9501afb8..fd2f74bd4e35 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3039,6 +3039,9 @@ static const struct pid_entry tgid_base_stuff[] = {
 	REG("smaps",      S_IRUGO, proc_pid_smaps_operations),
 	REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations),
 	REG("pagemap",    S_IRUSR, proc_pagemap_operations),
+#ifdef CONFIG_IDLE_PAGE_TRACKING
+	REG("page_idle", S_IRUSR|S_IWUSR, proc_page_idle_operations),
+#endif
 #endif
 #ifdef CONFIG_SECURITY
 	DIR("attr",       S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations),
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index cd0c8d5ce9a1..bc9371880c63 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -293,6 +293,7 @@ extern const struct file_operations proc_pid_smaps_operations;
 extern const struct file_operations proc_pid_smaps_rollup_operations;
 extern const struct file_operations proc_clear_refs_operations;
 extern const struct file_operations proc_pagemap_operations;
+extern const struct file_operations proc_page_idle_operations;
 
 extern unsigned long task_vsize(struct mm_struct *);
 extern unsigned long task_statm(struct mm_struct *,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 582c5e680176..192ffc4e24d7 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1650,6 +1650,48 @@ const struct file_operations proc_pagemap_operations = {
 	.open		= pagemap_open,
 	.release	= pagemap_release,
 };
+
+#ifdef CONFIG_IDLE_PAGE_TRACKING
+static ssize_t proc_page_idle_read(struct file *file, char __user *buf,
+				   size_t count, loff_t *ppos)
+{
+	return page_idle_proc_read(file, buf, count, ppos);
+}
+
+static ssize_t proc_page_idle_write(struct file *file, const char __user *buf,
+				 size_t count, loff_t *ppos)
+{
+	return page_idle_proc_write(file, (char __user *)buf, count, ppos);
+}
+
+static int proc_page_idle_open(struct inode *inode, struct file *file)
+{
+	struct mm_struct *mm;
+
+	mm = proc_mem_open(inode, PTRACE_MODE_READ);
+	if (IS_ERR(mm))
+		return PTR_ERR(mm);
+	file->private_data = mm;
+	return 0;
+}
+
+static int proc_page_idle_release(struct inode *inode, struct file *file)
+{
+	struct mm_struct *mm = file->private_data;
+
+	mmdrop(mm);
+	return 0;
+}
+
+const struct file_operations proc_page_idle_operations = {
+	.llseek		= mem_lseek, /* borrow this */
+	.read		= proc_page_idle_read,
+	.write		= proc_page_idle_write,
+	.open		= proc_page_idle_open,
+	.release	= proc_page_idle_release,
+};
+#endif /* CONFIG_IDLE_PAGE_TRACKING */
+
 #endif /* CONFIG_PROC_PAGE_MONITOR */
 
 #ifdef CONFIG_NUMA
diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
index 1e894d34bdce..a765a6d14e1a 100644
--- a/include/linux/page_idle.h
+++ b/include/linux/page_idle.h
@@ -106,6 +106,10 @@ static inline void clear_page_idle(struct page *page)
 }
 #endif /* CONFIG_64BIT */
 
+ssize_t page_idle_proc_write(struct file *file,
+	char __user *buf, size_t count, loff_t *ppos);
+ssize_t page_idle_proc_read(struct file *file,
+	char __user *buf, size_t count, loff_t *ppos);
 #else /* !CONFIG_IDLE_PAGE_TRACKING */
 
 static inline bool page_is_young(struct page *page)
diff --git a/mm/page_idle.c b/mm/page_idle.c
index 295512465065..9de4f4c67a8c 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -5,17 +5,22 @@
 #include <linux/sysfs.h>
 #include <linux/kobject.h>
 #include <linux/mm.h>
-#include <linux/mmzone.h>
-#include <linux/pagemap.h>
-#include <linux/rmap.h>
 #include <linux/mmu_notifier.h>
+#include <linux/mmzone.h>
 #include <linux/page_ext.h>
 #include <linux/page_idle.h>
+#include <linux/pagemap.h>
+#include <linux/rmap.h>
+#include <linux/sched/mm.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
 
 #define BITMAP_CHUNK_SIZE	sizeof(u64)
 #define BITMAP_CHUNK_BITS	(BITMAP_CHUNK_SIZE * BITS_PER_BYTE)
 
 /*
+ * Get a reference to a page for idle tracking purposes, with additional checks.
+ *
  * Idle page tracking only considers user memory pages, for other types of
  * pages the idle flag is always unset and an attempt to set it is silently
  * ignored.
@@ -25,18 +30,13 @@
  * page tracking. With such an indicator of user pages we can skip isolated
  * pages, but since there are not usually many of them, it will hardly affect
  * the overall result.
- *
- * This function tries to get a user memory page by pfn as described above.
  */
-static struct page *page_idle_get_page(unsigned long pfn)
+static struct page *page_idle_get_page(struct page *page_in)
 {
 	struct page *page;
 	pg_data_t *pgdat;
 
-	if (!pfn_valid(pfn))
-		return NULL;
-
-	page = pfn_to_page(pfn);
+	page = page_in;
 	if (!page || !PageLRU(page) ||
 	    !get_page_unless_zero(page))
 		return NULL;
@@ -51,6 +51,18 @@ static struct page *page_idle_get_page(unsigned long pfn)
 	return page;
 }
 
+/*
+ * This function tries to get a user memory page by pfn as described above.
+ */
+static struct page *page_idle_get_page_pfn(unsigned long pfn)
+{
+
+	if (!pfn_valid(pfn))
+		return NULL;
+
+	return page_idle_get_page(pfn_to_page(pfn));
+}
+
 static bool page_idle_clear_pte_refs_one(struct page *page,
 					struct vm_area_struct *vma,
 					unsigned long addr, void *arg)
@@ -118,6 +130,47 @@ static void page_idle_clear_pte_refs(struct page *page)
 		unlock_page(page);
 }
 
+/* Helper to get the start and end frame given a pos and count */
+static int page_idle_get_frames(loff_t pos, size_t count, struct mm_struct *mm,
+				unsigned long *start, unsigned long *end)
+{
+	unsigned long max_frame;
+
+	/* If an mm is not given, assume we want physical frames */
+	max_frame = mm ? (mm->task_size >> PAGE_SHIFT) : max_pfn;
+
+	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
+		return -EINVAL;
+
+	*start = pos * BITS_PER_BYTE;
+	if (*start >= max_frame)
+		return -ENXIO;
+
+	*end = *start + count * BITS_PER_BYTE;
+	if (*end > max_frame)
+		*end = max_frame;
+	return 0;
+}
+
+static bool page_idle_pte_check(struct page *page)
+{
+	if (!page)
+		return false;
+
+	if (page_is_idle(page)) {
+		/*
+		 * The page might have been referenced via a
+		 * pte, in which case it is not idle. Clear
+		 * refs and recheck.
+		 */
+		page_idle_clear_pte_refs(page);
+		if (page_is_idle(page))
+			return true;
+	}
+
+	return false;
+}
+
 static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
 				     struct bin_attribute *attr, char *buf,
 				     loff_t pos, size_t count)
@@ -125,35 +178,21 @@ static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
 	u64 *out = (u64 *)buf;
 	struct page *page;
 	unsigned long pfn, end_pfn;
-	int bit;
+	int bit, ret;
 
-	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
-		return -EINVAL;
-
-	pfn = pos * BITS_PER_BYTE;
-	if (pfn >= max_pfn)
-		return 0;
-
-	end_pfn = pfn + count * BITS_PER_BYTE;
-	if (end_pfn > max_pfn)
-		end_pfn = max_pfn;
+	ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
+	if (ret == -ENXIO)
+		return 0;  /* Reads beyond max_pfn do nothing */
+	else if (ret)
+		return ret;
 
 	for (; pfn < end_pfn; pfn++) {
 		bit = pfn % BITMAP_CHUNK_BITS;
 		if (!bit)
 			*out = 0ULL;
-		page = page_idle_get_page(pfn);
-		if (page) {
-			if (page_is_idle(page)) {
-				/*
-				 * The page might have been referenced via a
-				 * pte, in which case it is not idle. Clear
-				 * refs and recheck.
-				 */
-				page_idle_clear_pte_refs(page);
-				if (page_is_idle(page))
-					*out |= 1ULL << bit;
-			}
+		page = page_idle_get_page_pfn(pfn);
+		if (page && page_idle_pte_check(page)) {
+			*out |= 1ULL << bit;
 			put_page(page);
 		}
 		if (bit == BITMAP_CHUNK_BITS - 1)
@@ -170,23 +209,16 @@ static ssize_t page_idle_bitmap_write(struct file *file, struct kobject *kobj,
 	const u64 *in = (u64 *)buf;
 	struct page *page;
 	unsigned long pfn, end_pfn;
-	int bit;
+	int bit, ret;
 
-	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
-		return -EINVAL;
-
-	pfn = pos * BITS_PER_BYTE;
-	if (pfn >= max_pfn)
-		return -ENXIO;
-
-	end_pfn = pfn + count * BITS_PER_BYTE;
-	if (end_pfn > max_pfn)
-		end_pfn = max_pfn;
+	ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
+	if (ret)
+		return ret;
 
 	for (; pfn < end_pfn; pfn++) {
 		bit = pfn % BITMAP_CHUNK_BITS;
 		if ((*in >> bit) & 1) {
-			page = page_idle_get_page(pfn);
+			page = page_idle_get_page_pfn(pfn);
 			if (page) {
 				page_idle_clear_pte_refs(page);
 				set_page_idle(page);
@@ -224,6 +256,221 @@ struct page_ext_operations page_idle_ops = {
 };
 #endif
 
+/*  page_idle tracking for /proc/<pid>/page_idle */
+
+struct page_node {
+	struct page *page;
+	unsigned long addr;
+	struct list_head list;
+};
+
+struct page_idle_proc_priv {
+	unsigned long start_addr;
+	char *buffer;
+	int write;
+
+	/* Pre-allocate and provide nodes to pte_page_idle_proc_add() */
+	struct page_node *page_nodes;
+	int cur_page_node;
+	struct list_head *idle_page_list;
+};
+
+/*
+ * Add page to list to be set as idle later.
+ */
+static void pte_page_idle_proc_add(struct page *page,
+			       unsigned long addr, struct mm_walk *walk)
+{
+	struct page *page_get = NULL;
+	struct page_node *pn;
+	int bit;
+	unsigned long frames;
+	struct page_idle_proc_priv *priv = walk->private;
+	u64 *chunk = (u64 *)priv->buffer;
+
+	if (priv->write) {
+		VM_BUG_ON(!page);
+
+		/* Find whether this page was asked to be marked */
+		frames = (addr - priv->start_addr) >> PAGE_SHIFT;
+		bit = frames % BITMAP_CHUNK_BITS;
+		chunk = &chunk[frames / BITMAP_CHUNK_BITS];
+		if (((*chunk >> bit) & 1) == 0)
+			return;
+	}
+
+	if (page) {
+		page_get = page_idle_get_page(page);
+		if (!page_get)
+			return;
+	}
+
+	/*
+	 * For all other pages, add it to a list since we have to walk rmap,
+	 * which acquires ptlock, and we cannot walk rmap right now.
+	 */
+	pn = &(priv->page_nodes[priv->cur_page_node++]);
+	pn->page = page_get;
+	pn->addr = addr;
+	list_add(&pn->list, priv->idle_page_list);
+}
+
+static int pte_page_idle_proc_range(pmd_t *pmd, unsigned long addr,
+				    unsigned long end,
+				    struct mm_walk *walk)
+{
+	pte_t *pte;
+	spinlock_t *ptl;
+	struct page *page;
+	struct vm_area_struct *vma = walk->vma;
+
+	ptl = pmd_trans_huge_lock(pmd, vma);
+	if (ptl) {
+		if (pmd_present(*pmd)) {
+			page = follow_trans_huge_pmd(vma, addr, pmd,
+						     FOLL_DUMP|FOLL_WRITE);
+			if (!IS_ERR_OR_NULL(page))
+				pte_page_idle_proc_add(page, addr, walk);
+		}
+		spin_unlock(ptl);
+		return 0;
+	}
+
+	if (pmd_trans_unstable(pmd))
+		return 0;
+
+	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	for (; addr != end; pte++, addr += PAGE_SIZE) {
+		if (!pte_present(*pte))
+			continue;
+
+		page = vm_normal_page(vma, addr, *pte);
+		if (page)
+			pte_page_idle_proc_add(page, addr, walk);
+	}
+
+	pte_unmap_unlock(pte - 1, ptl);
+	return 0;
+}
+
+ssize_t page_idle_proc_generic(struct file *file, char __user *ubuff,
+			       size_t count, loff_t *pos, int write)
+{
+	int ret;
+	char *buffer;
+	u64 *out;
+	unsigned long start_addr, end_addr, start_frame, end_frame;
+	struct mm_struct *mm = file->private_data;
+	struct mm_walk walk = { .pmd_entry = pte_page_idle_proc_range, };
+	struct page_node *cur;
+	struct page_idle_proc_priv priv;
+	bool walk_error = false;
+	LIST_HEAD(idle_page_list);
+
+	if (!mm || !mmget_not_zero(mm))
+		return -EINVAL;
+
+	if (count > PAGE_SIZE)
+		count = PAGE_SIZE;
+
+	buffer = kzalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!buffer) {
+		ret = -ENOMEM;
+		goto out_mmput;
+	}
+	out = (u64 *)buffer;
+
+	if (write && copy_from_user(buffer, ubuff, count)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	ret = page_idle_get_frames(*pos, count, mm, &start_frame, &end_frame);
+	if (ret)
+		goto out;
+
+	start_addr = (start_frame << PAGE_SHIFT);
+	end_addr = (end_frame << PAGE_SHIFT);
+	priv.buffer = buffer;
+	priv.start_addr = start_addr;
+	priv.write = write;
+
+	priv.idle_page_list = &idle_page_list;
+	priv.cur_page_node = 0;
+	priv.page_nodes = kzalloc(sizeof(struct page_node) *
+				  (end_frame - start_frame), GFP_KERNEL);
+	if (!priv.page_nodes) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	walk.private = &priv;
+	walk.mm = mm;
+
+	down_read(&mm->mmap_sem);
+
+	/*
+	 * idle_page_list is needed because walk_page_vma() holds ptlock which
+	 * deadlocks with page_idle_clear_pte_refs(). So we have to collect all
+	 * pages first, and then call page_idle_clear_pte_refs().
+	 */
+	ret = walk_page_range(start_addr, end_addr, &walk);
+	if (ret)
+		walk_error = true;
+
+	list_for_each_entry(cur, &idle_page_list, list) {
+		int bit, index;
+		unsigned long off;
+		struct page *page = cur->page;
+
+		if (unlikely(walk_error))
+			goto remove_page;
+
+		if (write) {
+			if (page) {
+				page_idle_clear_pte_refs(page);
+				set_page_idle(page);
+			}
+		} else {
+			if (page_idle_pte_check(page)) {
+				off = ((cur->addr) >> PAGE_SHIFT) - start_frame;
+				bit = off % BITMAP_CHUNK_BITS;
+				index = off / BITMAP_CHUNK_BITS;
+				out[index] |= 1ULL << bit;
+			}
+		}
+remove_page:
+		if (page)
+			put_page(page);
+	}
+
+	if (!write && !walk_error)
+		ret = copy_to_user(ubuff, buffer, count);
+
+	up_read(&mm->mmap_sem);
+	kfree(priv.page_nodes);
+out:
+	kfree(buffer);
+out_mmput:
+	mmput(mm);
+	if (!ret)
+		ret = count;
+	return ret;
+
+}
+
+ssize_t page_idle_proc_read(struct file *file, char __user *ubuff,
+			    size_t count, loff_t *pos)
+{
+	return page_idle_proc_generic(file, ubuff, count, pos, 0);
+}
+
+ssize_t page_idle_proc_write(struct file *file, char __user *ubuff,
+			     size_t count, loff_t *pos)
+{
+	return page_idle_proc_generic(file, ubuff, count, pos, 1);
+}
+
 static int __init page_idle_init(void)
 {
 	int err;
-- 
2.22.0.770.g0f2c4a37fd-goog

^ permalink raw reply related

* Re: [PATCH 5/6] tty: serial: Add linflexuart driver for S32V234
From: Stefan-gabriel Mirea @ 2019-08-07 17:05 UTC (permalink / raw)
  To: gregkh@linuxfoundation.org
  Cc: corbet@lwn.net, robh+dt@kernel.org, mark.rutland@arm.com,
	catalin.marinas@arm.com, will@kernel.org, shawnguo@kernel.org,
	Leo Li, jslaby@suse.com, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, devicetree@vger.kernel.org,
	linux-serial@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, Cosmin Stefan Stoica,
	Larisa Ileana Grigore
In-Reply-To: <20190807165646.GA6584@kroah.com>

On 8/7/2019 7:56 PM, gregkh@linuxfoundation.org wrote:
> On Wed, Aug 07, 2019 at 04:42:17PM +0000, Stefan-gabriel Mirea wrote:
>> On 8/6/2019 9:40 PM, gregkh@linuxfoundation.org wrote:
>>> On Tue, Aug 06, 2019 at 05:11:17PM +0000, Stefan-gabriel Mirea wrote:

<snip>

>>>> Other than that, I do not see anything wrong with the addition of a
>>>> define in serial_core.h for this purpose (which is also what most of the
>>>> serial drivers do, including amba-pl011.c, mentioned in
>>>> Documentation/driver-api/serial/driver.rst as providing the reference
>>>> implementation), so please be more specific.
>>>
>>> I am getting tired of dealing with merge issues with that list, and no
>>> one seems to be able to find where they are really needed for userspace,
>>> especially for new devices.  What happens if you do not have use it?
>>
>> I see. If I drop its usage completely and leave 'type' from the
>> uart_port as 0, uart_port_startup() will fail when finding that
>> uport->type == PORT_UNKNOWN at [1] (there may be other effects as well,
>> e.g. due to the check in uart_configure_port[2]).
>>
>> So I suppose that I need to define some nonzero 'PORT_KNOWN' macro in
>> the driver and use that one internally for 'type'. Is my understanding
>> correct? Will there be any problems if I define it to a positive integer
>> which is already assigned to another driver, according to serial_core.h?
> 
> Ugh, ok, that's messy, nevermind.  Keep the #define in there, I will try
> to figure out how to move all of these at once sometime in the future...
> 
> sorry for the noise.

No problem, thank you for your time.

Regards,
Stefan

^ permalink raw reply

* Re: [PATCH 5/6] tty: serial: Add linflexuart driver for S32V234
From: gregkh @ 2019-08-07 16:56 UTC (permalink / raw)
  To: Stefan-gabriel Mirea
  Cc: corbet@lwn.net, robh+dt@kernel.org, mark.rutland@arm.com,
	catalin.marinas@arm.com, will@kernel.org, shawnguo@kernel.org,
	Leo Li, jslaby@suse.com, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, devicetree@vger.kernel.org,
	linux-serial@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, Cosmin Stefan Stoica,
	Larisa Ileana Grigore
In-Reply-To: <VI1PR0402MB2863C4406C06B0BDA3581822DFD40@VI1PR0402MB2863.eurprd04.prod.outlook.com>

On Wed, Aug 07, 2019 at 04:42:17PM +0000, Stefan-gabriel Mirea wrote:
> On 8/6/2019 9:40 PM, gregkh@linuxfoundation.org wrote:
> > On Tue, Aug 06, 2019 at 05:11:17PM +0000, Stefan-gabriel Mirea wrote:
> >> On 8/5/2019 6:31 PM, gregkh@linuxfoundation.org wrote:
> >>> On Fri, Aug 02, 2019 at 07:47:23PM +0000, Stefan-gabriel Mirea wrote:
> >>>>
> >>>> +/* Freescale Linflex UART */
> >>>> +#define PORT_LINFLEXUART     121
> >>>
> >>> Do you really need this modified?
> >>
> >> Hello Greg,
> >>
> >> This macro is meant to be assigned to port->type in the config_port
> >> method from uart_ops, in order for verify_port to know if the received
> >> serial_struct structure was really targeted for a LINFlex port. It
> >> needs to be defined outside, to avoid "collisions" with other drivers.
> > 
> > Yes, I know what it goes to, but does anyone in userspace actually use
> > it?
> 
> No, we do not use it from userspace, but kept the pattern only for
> conformance.
> 
> >> Other than that, I do not see anything wrong with the addition of a
> >> define in serial_core.h for this purpose (which is also what most of the
> >> serial drivers do, including amba-pl011.c, mentioned in
> >> Documentation/driver-api/serial/driver.rst as providing the reference
> >> implementation), so please be more specific.
> > 
> > I am getting tired of dealing with merge issues with that list, and no
> > one seems to be able to find where they are really needed for userspace,
> > especially for new devices.  What happens if you do not have use it?
> 
> I see. If I drop its usage completely and leave 'type' from the
> uart_port as 0, uart_port_startup() will fail when finding that
> uport->type == PORT_UNKNOWN at [1] (there may be other effects as well,
> e.g. due to the check in uart_configure_port[2]).
> 
> So I suppose that I need to define some nonzero 'PORT_KNOWN' macro in
> the driver and use that one internally for 'type'. Is my understanding
> correct? Will there be any problems if I define it to a positive integer
> which is already assigned to another driver, according to serial_core.h?

Ugh, ok, that's messy, nevermind.  Keep the #define in there, I will try
to figure out how to move all of these at once sometime in the future...

sorry for the noise.

greg k-h

^ permalink raw reply

* Re: [PATCH 5/6] tty: serial: Add linflexuart driver for S32V234
From: Stefan-gabriel Mirea @ 2019-08-07 16:42 UTC (permalink / raw)
  To: gregkh@linuxfoundation.org
  Cc: corbet@lwn.net, robh+dt@kernel.org, mark.rutland@arm.com,
	catalin.marinas@arm.com, will@kernel.org, shawnguo@kernel.org,
	Leo Li, jslaby@suse.com, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, devicetree@vger.kernel.org,
	linux-serial@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, Cosmin Stefan Stoica,
	Larisa Ileana Grigore
In-Reply-To: <20190806184042.GA26041@kroah.com>

On 8/6/2019 9:40 PM, gregkh@linuxfoundation.org wrote:
> On Tue, Aug 06, 2019 at 05:11:17PM +0000, Stefan-gabriel Mirea wrote:
>> On 8/5/2019 6:31 PM, gregkh@linuxfoundation.org wrote:
>>> On Fri, Aug 02, 2019 at 07:47:23PM +0000, Stefan-gabriel Mirea wrote:
>>>>
>>>> +/* Freescale Linflex UART */
>>>> +#define PORT_LINFLEXUART     121
>>>
>>> Do you really need this modified?
>>
>> Hello Greg,
>>
>> This macro is meant to be assigned to port->type in the config_port
>> method from uart_ops, in order for verify_port to know if the received
>> serial_struct structure was really targeted for a LINFlex port. It
>> needs to be defined outside, to avoid "collisions" with other drivers.
> 
> Yes, I know what it goes to, but does anyone in userspace actually use
> it?

No, we do not use it from userspace, but kept the pattern only for
conformance.

>> Other than that, I do not see anything wrong with the addition of a
>> define in serial_core.h for this purpose (which is also what most of the
>> serial drivers do, including amba-pl011.c, mentioned in
>> Documentation/driver-api/serial/driver.rst as providing the reference
>> implementation), so please be more specific.
> 
> I am getting tired of dealing with merge issues with that list, and no
> one seems to be able to find where they are really needed for userspace,
> especially for new devices.  What happens if you do not have use it?

I see. If I drop its usage completely and leave 'type' from the
uart_port as 0, uart_port_startup() will fail when finding that
uport->type == PORT_UNKNOWN at [1] (there may be other effects as well,
e.g. due to the check in uart_configure_port[2]).

So I suppose that I need to define some nonzero 'PORT_KNOWN' macro in
the driver and use that one internally for 'type'. Is my understanding
correct? Will there be any problems if I define it to a positive integer
which is already assigned to another driver, according to serial_core.h?

Regards,
Stefan

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/tty/serial/serial_core.c?h=v5.3-rc1#n188
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/tty/serial/serial_core.c?h=v5.3-rc1#n2348

^ permalink raw reply

* Re: [PATCH] mm, slab: Extend slab/shrink to shrink all the memcg caches
From: Vlastimil Babka @ 2019-08-07 16:33 UTC (permalink / raw)
  To: Waiman Long, peter enderborg, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Andrew Morton, Alexander Viro,
	Jonathan Corbet, Luis Chamberlain, Kees Cook, Johannes Weiner,
	Michal Hocko, Vladimir Davydov
  Cc: linux-mm, linux-doc, linux-fsdevel, cgroups, linux-kernel,
	Roman Gushchin, Shakeel Butt, Andrea Arcangeli
In-Reply-To: <f878a00c-5d84-534b-deac-5736534a61cd@redhat.com>

On 7/23/19 4:30 PM, Waiman Long wrote:
> On 7/22/19 8:46 AM, peter enderborg wrote:
>> On 7/2/19 8:37 PM, Waiman Long wrote:
>>> Currently, a value of '1" is written to /sys/kernel/slab/<slab>/shrink
>>> file to shrink the slab by flushing all the per-cpu slabs and free
>>> slabs in partial lists. This applies only to the root caches, though.
>>>
>>> Extends this capability by shrinking all the child memcg caches and
>>> the root cache when a value of '2' is written to the shrink sysfs file.
>>>
>>> On a 4-socket 112-core 224-thread x86-64 system after a parallel kernel
>>> build, the the amount of memory occupied by slabs before shrinking
>>> slabs were:
>>>
>>>  # grep task_struct /proc/slabinfo
>>>  task_struct         7114   7296   7744    4    8 : tunables    0    0
>>>  0 : slabdata   1824   1824      0
>>>  # grep "^S[lRU]" /proc/meminfo
>>>  Slab:            1310444 kB
>>>  SReclaimable:     377604 kB
>>>  SUnreclaim:       932840 kB
>>>
>>> After shrinking slabs:
>>>
>>>  # grep "^S[lRU]" /proc/meminfo
>>>  Slab:             695652 kB
>>>  SReclaimable:     322796 kB
>>>  SUnreclaim:       372856 kB
>>>  # grep task_struct /proc/slabinfo
>>>  task_struct         2262   2572   7744    4    8 : tunables    0    0
>>>  0 : slabdata    643    643      0
>>
>> What is the time between this measurement points? Should not the shrinked memory show up as reclaimable?
> 
> In this case, I echoed '2' to all the shrink sysfs files under
> /sys/kernel/slab. The purpose of shrinking caches is to reclaim as much
> unused memory slabs from all the caches, irrespective if they are
> reclaimable or not.

Well, SReclaimable counts pages allocated by kmem caches with
SLAB_RECLAIM_ACCOUNT flags, which should match those that have a shrinker
associated and can thus actually reclaim objects. That shrinking slabs affected
SReclaimable just a bit while reducing SUnreclaim by more than 50% looks
certainly odd.
For example the task_struct cache is not a reclaimable one, yet shows massive
reduction. Could be that the reclaimable objects were pinning non-reclaimable
ones, so the shrinking had secondary effects in non-reclaimable caches.

> We do not reclaim any used objects. That is why we
> see the numbers were reduced in both cases.
> 
> Cheers,
> Longman
> 


^ permalink raw reply

* Re: [PATCH 0/6] hwspinlock: allow sharing of hwspinlocks
From: Suman Anna @ 2019-08-07 16:19 UTC (permalink / raw)
  To: Fabien DESSENNE, Bjorn Andersson
  Cc: Ohad Ben-Cohen, Rob Herring, Mark Rutland, Maxime Coquelin,
	Alexandre TORGUE, Jonathan Corbet,
	linux-remoteproc@vger.kernel.org, devicetree@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-stm32@st-md-mailman.stormreply.com,
	linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org,
	Benjamin GAIGNARD
In-Reply-To: <02329102-5571-c6c1-b78c-693747133f0e@st.com>

Hi Fabien,

On 8/7/19 3:39 AM, Fabien DESSENNE wrote:
> Hi
> 
> On 06/08/2019 11:30 PM, Suman Anna wrote:
>> On 8/6/19 1:21 PM, Bjorn Andersson wrote:
>>> On Tue 06 Aug 10:38 PDT 2019, Suman Anna wrote:
>>>
>>>> Hi Fabien,
>>>>
>>>> On 8/5/19 12:46 PM, Bjorn Andersson wrote:
>>>>> On Mon 05 Aug 01:48 PDT 2019, Fabien DESSENNE wrote:
>>>>>
>>>>>> On 01/08/2019 9:14 PM, Bjorn Andersson wrote:
>>>>>>> On Wed 13 Mar 08:50 PDT 2019, Fabien Dessenne wrote:
>>> [..]
>>>>>> B/ This would introduce some inconsistency between the two 'request' API
>>>>>> which are hwspin_lock_request() and hwspin_lock_request_specific().
>>>>>> hwspin_lock_request() looks for an unused lock, so requests for an exclusive
>>>>>> usage. On the other side, request_specific() would request shared locks.
>>>>>> Worst the following sequence can transform an exclusive usage into a shared
>>>>>>
>>>>> There is already an inconsistency in between these; as with above any
>>>>> system that uses both request() and request_specific() will be suffering
>>>>> from intermittent failures due to probe ordering.
>>>>>
>>>>>> one:
>>>>>>     -hwspin_lock_request() -> returns Id#0 (exclusive)
>>>>>>     -hwspin_lock_request() -> returns Id#1 (exclusive)
>>>>>>     -hwspin_lock_request_specific(0) -> returns Id#0 and makes Id#0 shared
>>>>>> Honestly I am not sure that this is a real issue, but it's better to have it
>>>>>> in mind before we take ay decision
>>>> Wouldn't it be actually simpler to just introduce a new specific API
>>>> variant for this, similar to the reset core for example (it uses a
>>>> separate exclusive API), without having to modify the bindings at all.
>>>> It is just a case of your driver using the right API, and the core can
>>>> be modified to use the additional tag semantics based on the API. It
>>>> should avoid any confusion with say using a different second cell value
>>>> for the same lock in two different nodes.
>>>>
>>> But this implies that there is an actual need to hold these locks
>>> exclusively. Given that they are (except for the raw case) all wrapped
>>> by Linux locking primitives there shouldn't be a problem sharing a lock
>>> (except possibly for the raw case).
>> Yes agreed, the HWLOCK_RAW and HWLOCK_IN_ATOMIC cases are unprotected. I
>> am still trying to understand better the usecase to see if the same lock
>> is being multiplexed for different protection contexts, or if all of
>> them are protecting the same context.
> 
> 
> Here are two different examples that explain the need for changes.
> In both cases the Linux clients are talking to a single entity on the
> remote-side.
> 
> Example 1:
>      exti: interrupt-controller@5000d000 {
>          compatible = "st,stm32mp1-exti", "syscon";
>          interrupt-controller;
>          #interrupt-cells = <2>;
>          reg = <0x5000d000 0x400>;
>          hwlocks = <&hsem 1>;
>      };
> The two drivers (stm32mp1-exti and syscon) refer to the same hwlock.
> With the current hwspinlock implementation, only the first driver succeeds
> in requesting (hwspin_lock_request_specific) the hwlock. The second request
> fails.
> Here, we really need to share the hwlock between the two drivers.
> Note: hardware spinlock support for regmap was 'recently' introduced in 4.15
> see https://lore.kernel.org/patchwork/patch/845941/
> 
> 
> 
> Example 2:
> Here it is more a question of optimization : we want to save the number of
> hwlocks used to protect resources, using an unique hwlock to protect all
> pinctrl resources:
>          pinctrl: pin-controller@50002000 {
>              compatible = "st,stm32mp157-pinctrl";
>              ranges = <0 0x50002000 0xa400>;
>              hwlocks = <&hsem 0 1>;
> 
>          pinctrl_z: pin-controller-z@54004000 {
>              compatible = "st,stm32mp157-z-pinctrl";
>              ranges = <0 0x54004000 0x400>;
>              pins-are-numbered;
>              hwlocks = <&hsem 0 1>;

Thanks for the examples.

> 
>>
>>> I agree that we shouldn't specify this property in DT - if anything it
>>> should be a variant of the API.
> 
> 
> If we decide to add a 'shared' API, then, what about the generic regmap 
> driver?
> 
> In the context of above example1, this would require to update the 
> regmap driver.
> 
> But would this be acceptable for any driver using syscon/regmap?
> 
> 
> I think it is better to keep the existing API (modifying it so it always 
> allows
> 
> hwlocks sharing, so no need for bindings update) than adding another API.

For your usecases, you would definitely need the syscon/regmap behavior
to be shared right. Whether we introduce a 'shared' API or an
'exclusive' API and change the current API behavior to shared, it is
definitely a case-by-case usage scenario for the existing drivers and
usage right. The main contention point is what to do with the
unprotected usecases like Bjorn originally pointed out.

regards
Suman

> 
> 
> 
>>>
>>>> If you are sharing a hwlock on the Linux side, surely your driver should
>>>> be aware that it is a shared lock. The tag can be set during the first
>>>> request API, and you look through both tags when giving out a handle.
>>>>
>>> Why would the driver need to know about it?
>> Just the semantics if we were to support single user vs multiple users
>> on Linux-side to even get a handle. Your point is that this may be moot
>> since we have protection anyway other than the raw cases. But we need to
>> be able to have the same API work across all cases.
>>
>> So far, it had mostly been that there would be one user on Linux
>> competing with other equivalent peer entities on different processors.
>> It is not common to have multiple users since these protection schemes
>> are usually needed only at the lowest levels of a stack, so the
>> exclusive handle stuff had been sufficient.
>>
>>>> Obviously, the hwspin_lock_request() API usage semantics always had the
>>>> implied additional need for communicating the lock id to the other peer
>>>> entity, so a realistic usage is most always the specific API variant. I
>>>> doubt this API would be of much use for the shared driver usage. This
>>>> also implies that the client user does not care about specifying a lock
>>>> in DT.
>>>>
>>> Afaict if the lock are shared then there shouldn't be a problem with
>>> some clients using the request API and others request_specific(). As any
>>> collisions would simply mean that there are more contention on the lock.
>>>
>>> With the current exclusive model that is not possible and the success of
>>> the request_specific will depend on probe order.
>>>
>>> But perhaps it should be explicitly prohibited to use both APIs on the
>>> same hwspinlock instance?
>> Yeah, they are meant to be complimentary usage, though I doubt we will
>> ever have any realistic users for the generic API if we haven't had a
>> usage so far. I had posted a concept of reserved locks long back [1] to
>> keep away certain locks from the generic requestor, but dropped it since
>> we did not have an actual use-case needing it.
>>
>> regards
>> Suman
>>
>> [1] https://lwn.net/Articles/611944/


^ permalink raw reply

* [PATCH v7 2/2] arm64: Relax Documentation/arm64/tagged-pointers.rst
From: Catalin Marinas @ 2019-08-07 15:53 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Vincenzo Frascino, Will Deacon, Andrey Konovalov, Szabolcs Nagy,
	Kevin Brodsky, linux-doc, linux-arch, Dave Hansen
In-Reply-To: <20190807155321.9648-1-catalin.marinas@arm.com>

From: Vincenzo Frascino <vincenzo.frascino@arm.com>

On arm64 the TCR_EL1.TBI0 bit has been always enabled hence
the userspace (EL0) is allowed to set a non-zero value in the
top byte but the resulting pointers are not allowed at the
user-kernel syscall ABI boundary.

With the relaxed ABI proposed in this set, it is now possible to pass
tagged pointers to the syscalls, when these pointers are in memory
ranges obtained by an anonymous (MAP_ANONYMOUS) mmap().

Relax the requirements described in tagged-pointers.rst to be compliant
with the behaviours guaranteed by the ARM64 Tagged Address ABI.

Cc: Will Deacon <will.deacon@arm.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Szabolcs Nagy <szabolcs.nagy@arm.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
[catalin.marinas@arm.com: minor tweaks]
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
---
 Documentation/arm64/tagged-pointers.rst | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/Documentation/arm64/tagged-pointers.rst b/Documentation/arm64/tagged-pointers.rst
index 2acdec3ebbeb..82a3eff71a70 100644
--- a/Documentation/arm64/tagged-pointers.rst
+++ b/Documentation/arm64/tagged-pointers.rst
@@ -20,7 +20,8 @@ Passing tagged addresses to the kernel
 --------------------------------------
 
 All interpretation of userspace memory addresses by the kernel assumes
-an address tag of 0x00.
+an address tag of 0x00, unless the application enables the AArch64
+Tagged Address ABI explicitly.
 
 This includes, but is not limited to, addresses found in:
 
@@ -33,18 +34,23 @@ This includes, but is not limited to, addresses found in:
  - the frame pointer (x29) and frame records, e.g. when interpreting
    them to generate a backtrace or call graph.
 
-Using non-zero address tags in any of these locations may result in an
-error code being returned, a (fatal) signal being raised, or other modes
-of failure.
+Using non-zero address tags in any of these locations when the
+userspace application did not enable the AArch64 Tagged Address ABI may
+result in an error code being returned, a (fatal) signal being raised,
+or other modes of failure.
 
-For these reasons, passing non-zero address tags to the kernel via
-system calls is forbidden, and using a non-zero address tag for sp is
-strongly discouraged.
+For these reasons, when the AArch64 Tagged Address ABI is disabled,
+passing non-zero address tags to the kernel via system calls is
+forbidden, and using a non-zero address tag for sp is strongly
+discouraged.
 
 Programs maintaining a frame pointer and frame records that use non-zero
 address tags may suffer impaired or inaccurate debug and profiling
 visibility.
 
+The AArch64 Tagged Address ABI description and the guarantees it
+provides can be found in: Documentation/arm64/tagged-address-abi.rst.
+
 
 Preserving tags
 ---------------
@@ -59,6 +65,9 @@ be preserved.
 The architecture prevents the use of a tagged PC, so the upper byte will
 be set to a sign-extension of bit 55 on exception return.
 
+This behaviour is preserved when the AArch64 Tagged Address ABI is
+enabled.
+
 
 Other considerations
 --------------------

^ permalink raw reply related

* [PATCH v7 1/2] arm64: Define Documentation/arm64/tagged-address-abi.rst
From: Catalin Marinas @ 2019-08-07 15:53 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Vincenzo Frascino, Will Deacon, Andrey Konovalov, Szabolcs Nagy,
	Kevin Brodsky, linux-doc, linux-arch, Dave Hansen
In-Reply-To: <20190807155321.9648-1-catalin.marinas@arm.com>

From: Vincenzo Frascino <vincenzo.frascino@arm.com>

On arm64 the TCR_EL1.TBI0 bit has been always enabled hence
the userspace (EL0) is allowed to set a non-zero value in the
top byte but the resulting pointers are not allowed at the
user-kernel syscall ABI boundary.

With the relaxed ABI proposed through this document, it is now possible
to pass tagged pointers to the syscalls, when these pointers are in
memory ranges obtained by an anonymous (MAP_ANONYMOUS) mmap().

This change in the ABI requires a mechanism to requires the userspace
to opt-in to such an option.

Specify and document the way in which sysctl and prctl() can be used
in combination to allow the userspace to opt-in this feature.

Cc: Will Deacon <will.deacon@arm.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Szabolcs Nagy <szabolcs.nagy@arm.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
[catalin.marinas@arm.com: some rewording, dropped MAP_PRIVATE]
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
---
 Documentation/arm64/tagged-address-abi.rst | 151 +++++++++++++++++++++
 1 file changed, 151 insertions(+)
 create mode 100644 Documentation/arm64/tagged-address-abi.rst

diff --git a/Documentation/arm64/tagged-address-abi.rst b/Documentation/arm64/tagged-address-abi.rst
new file mode 100644
index 000000000000..f91a5d2ac865
--- /dev/null
+++ b/Documentation/arm64/tagged-address-abi.rst
@@ -0,0 +1,151 @@
+==========================
+AArch64 TAGGED ADDRESS ABI
+==========================
+
+Author: Vincenzo Frascino <vincenzo.frascino@arm.com>
+
+Date: 25 July 2019
+
+This document describes the usage and semantics of the Tagged Address
+ABI on AArch64 Linux.
+
+1. Introduction
+---------------
+
+On AArch64 the TCR_EL1.TBI0 bit has always been enabled, allowing userspace
+(EL0) to perform memory accesses through 64-bit pointers with a non-zero
+top byte. Such tagged pointers, however, were not allowed at the
+user-kernel syscall ABI boundary.
+
+This document describes the relaxation of the syscall ABI that allows
+userspace to pass certain tagged pointers to kernel syscalls, as described
+in section 2.
+
+2. AArch64 Tagged Address ABI
+-----------------------------
+
+From the kernel syscall interface perspective and for the purposes of this
+document, a "valid tagged pointer" is a pointer with a potentially non-zero
+top-byte that references an address in the user process address space
+obtained in one of the following ways:
+
+- mmap() done by the process itself (or its parent), where either:
+
+  - flags have the **MAP_ANONYMOUS** bit set
+  - the file descriptor refers to a regular file (including those returned
+    by memfd_create()) or **/dev/zero**
+
+- brk() system call done by the process itself (i.e. the heap area between
+  the initial location of the program break at process creation and its
+  current location).
+
+- any memory mapped by the kernel in the address space of the process
+  during creation and with the same restrictions as for mmap() above (e.g.
+  data, bss, stack).
+
+The AArch64 Tagged Address ABI is an opt-in feature and an application can
+control it via **prctl()** as follows:
+
+- **PR_SET_TAGGED_ADDR_CTRL**: enable or disable the AArch64 Tagged Address
+  ABI for the calling process.
+
+  The (unsigned int) arg2 argument is a bit mask describing the control mode
+  used:
+
+  - **PR_TAGGED_ADDR_ENABLE**: enable AArch64 Tagged Address ABI. Default
+    status is disabled.
+
+  The arguments arg3, arg4, and arg5 are ignored.
+
+- **PR_GET_TAGGED_ADDR_CTRL**: get the status of the AArch64 Tagged Address
+  ABI for the calling process.
+
+  The arguments arg2, arg3, arg4, and arg5 are ignored.
+
+The prctl(PR_SET_TAGGED_ADDR_CTRL, ...) will return -EINVAL if the
+AArch64 Tagged Address ABI is not available
+(CONFIG_ARM64_TAGGED_ADDR_ABI disabled or sysctl abi.tagged_addr=0).
+
+The ABI properties set by the mechanism described above are inherited by
+threads of the same application and fork()'ed children but cleared by
+execve().
+
+Opting in (the prctl() option described above only) to or out of the
+AArch64 Tagged Address ABI can be disabled globally at runtime using the
+sysctl interface:
+
+- **abi.tagged_addr**: a new sysctl interface that can be used to prevent
+  applications from enabling or disabling the relaxed ABI. The sysctl
+  supports the following configuration options:
+
+  - **0**: disable the prctl(PR_SET_TAGGED_ADDR_CTRL) option to
+    enable/disable the AArch64 Tagged Address ABI globally
+
+  - **1** (Default): enable the prctl(PR_SET_TAGGED_ADDR_CTRL) option to
+    enable/disable the AArch64 Tagged Address ABI globally
+
+  Note that this sysctl does not affect the status of the AArch64 Tagged
+  Address ABI of the running processes.
+
+When a process has successfully enabled the new ABI by invoking
+prctl(PR_SET_TAGGED_ADDR_CTRL, PR_TAGGED_ADDR_ENABLE), the following
+behaviours are guaranteed:
+
+- Every currently available syscall, except the cases mentioned in section
+  3, can accept any valid tagged pointer. The same rule is applicable to
+  any syscall introduced in the future.
+
+- The syscall behaviour is undefined for non valid tagged pointers.
+
+- Every valid tagged pointer is expected to work as an untagged one.
+
+A definition of the meaning of tagged pointers on AArch64 can be found in:
+Documentation/arm64/tagged-pointers.txt.
+
+3. AArch64 Tagged Address ABI Exceptions
+-----------------------------------------
+
+The behaviour described in section 2, with particular reference to the
+acceptance by the syscalls of any valid tagged pointer, is not applicable
+to the following cases:
+
+- mmap() addr parameter.
+
+- mremap() new_address parameter.
+
+- prctl(PR_SET_MM, ``*``, ...) other than arg2 PR_SET_MM_MAP and
+  PR_SET_MM_MAP_SIZE.
+
+- prctl(PR_SET_MM, PR_SET_MM_MAP{,_SIZE}, ...) struct prctl_mm_map fields.
+
+Any attempt to use non-zero tagged pointers will lead to undefined
+behaviour.
+
+4. Example of correct usage
+---------------------------
+.. code-block:: c
+
+   void main(void)
+   {
+           static int tbi_enabled = 0;
+           unsigned long tag = 0;
+
+           char *ptr = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE,
+                            MAP_ANONYMOUS, -1, 0);
+
+           if (prctl(PR_SET_TAGGED_ADDR_CTRL, PR_TAGGED_ADDR_ENABLE,
+                     0, 0, 0) == 0)
+                   tbi_enabled = 1;
+
+           if (ptr == (void *)-1) /* MAP_FAILED */
+                   return -1;
+
+           if (tbi_enabled)
+                   tag = rand() & 0xff;
+
+           ptr = (char *)((unsigned long)ptr | (tag << TAG_SHIFT));
+
+           *ptr = 'a';
+
+           ...
+   }

^ permalink raw reply related

* [PATCH v7 0/2] arm64 tagged address ABI
From: Catalin Marinas @ 2019-08-07 15:53 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Vincenzo Frascino, Will Deacon, Andrey Konovalov, Szabolcs Nagy,
	Kevin Brodsky, linux-doc, linux-arch, Dave Hansen

Hi,

Thanks for the feedback so far. This is an updated series documenting
the AArch64 Tagged Address ABI as implemented by these patches:

http://lkml.kernel.org/r/cover.1563904656.git.andreyknvl@google.com

Version 6 of the documentation series is available here:

http://lkml.kernel.org/r/20190725135044.24381-1-vincenzo.frascino@arm.com

Changes in v7:

- Dropped the MAP_PRIVATE requirements for tagged pointers for both
  anonymous and file mappings. One reason is that we can't enforce such
  restriction anyway. The other reason is that a future series
  implementing support for the hardware MTE will detect
  incompatibilities of the new PROT_MTE flag with various mmap()
  options.

- As a consequence of the above, I removed Szabolcs ack as I'm not sure
  he's ok with the change.

- Clarified the sysctl and prctl() interaction and reordered the
  descriptions.

- Reworded the prctl(PR_SET_MM) restrictions.

- Removed the description of the tag preservation from the first patch
  as it didn't really make sense (the syscall ABI has always preserved
  all registers other than x0 on return to user).

- s/ARM64/AArch64/ for consistency with the tagged-pointers.rst
  document.

- Other minor rewordings.

Vincenzo Frascino (2):
  arm64: Define Documentation/arm64/tagged-address-abi.rst
  arm64: Relax Documentation/arm64/tagged-pointers.rst

 Documentation/arm64/tagged-address-abi.rst | 151 +++++++++++++++++++++
 Documentation/arm64/tagged-pointers.rst    |  23 +++-
 2 files changed, 167 insertions(+), 7 deletions(-)
 create mode 100644 Documentation/arm64/tagged-address-abi.rst


^ permalink raw reply

* Re: [PATCH 1/9] KVM: arm64: Document PV-time interface
From: Steven Price @ 2019-08-07 15:26 UTC (permalink / raw)
  To: Christophe de Dinechin
  Cc: KVM list, Catalin Marinas, linux-doc, Russell King, open list,
	Marc Zyngier, Paolo Bonzini, Will Deacon, kvmarm,
	linux-arm-kernel
In-Reply-To: <9F77FA64-C71B-4025-A58D-3AC07E6688DE@dinechin.org>

On 07/08/2019 15:28, Christophe de Dinechin wrote:
> 
> 
>> On 7 Aug 2019, at 15:21, Steven Price <steven.price@arm.com
>> <mailto:steven.price@arm.com>> wrote:
>>
>> On 05/08/2019 17:40, Christophe de Dinechin wrote:
>>>
>>> Steven Price writes:
>>>
>>>> Introduce a paravirtualization interface for KVM/arm64 based on the
>>>> "Arm Paravirtualized Time for Arm-Base Systems" specification DEN 0057A.
>>>>
>>>> This only adds the details about "Stolen Time" as the details of "Live
>>>> Physical Time" have not been fully agreed.
>>>>
>>> [...]
>>>
>>>> +
>>>> +Stolen Time
>>>> +-----------
>>>> +
>>>> +The structure pointed to by the PV_TIME_ST hypercall is as follows:
>>>> +
>>>> +  Field       | Byte Length | Byte Offset | Description
>>>> +  ----------- | ----------- | ----------- | --------------------------
>>>> +  Revision    |      4      |      0      | Must be 0 for version 0.1
>>>> +  Attributes  |      4      |      4      | Must be 0
>>>> +  Stolen time |      8      |      8      | Stolen time in unsigned
>>>> +              |             |             | nanoseconds indicating how
>>>> +              |             |             | much time this VCPU thread
>>>> +              |             |             | was involuntarily not
>>>> +              |             |             | running on a physical CPU.
>>>
>>> I know very little about the topic, but I don't understand how the spec
>>> as proposed allows an accurate reading of the relation between physical
>>> time and stolen time simultaneously. In other words, could you draw
>>> Figure 1 of the spec from within the guest? Or is it a non-objective?
>>
>> Figure 1 is mostly attempting to explain Live Physical Time (LPT), which
>> is not part of this patch series. But it does touch on stolen time by
>> the difference between "live physical time" and "virtual time".
>>
>> I'm not sure what you mean by "from within the guest". From the
>> perspective of the guest the parts of the diagram where the guest isn't
>> running don't exist (therefore there are discontinuities in the
>> "physical time" and "live physical time" lines).
> 
> I meant: If I run code within the guest that attempts to draw Figure 1,
> race conditions may cause the diagram actually drawn by your guest
> program to look completely wrong on occasions.
> 
>> This patch series doesn't attempt to provide the guest with a view of
>> "physical time" (or LPT) - but it might be able to observe that by
>> consulting something external (e.g. an NTP server, or an emulated RTC
>> which reports wall-clock time).
> 
> … with what appear to be like a built-in race condition, as you correctly
> identified. I was wondering if the built-in race condition was deliberate
> and/or necessary, or if it was irrelevant for the planned uses of the value.
> 
>> What it does provide is a mechanism for obtaining the difference (as
>> reported by the host) between "live physical time" and "virtual time" -
>> this is reported in nanoseconds in the above structure.
>>
>>> For example, if you read the stolen time before you read CNTVCT_EL0,
>>> isn't it possible for a lengthy event like a migration to occur between
>>> the two reads, causing the stolen time to be obsolete and off by seconds?
>>
>> "Lengthy events" like migration are represented by the "paused" state in
>> the diagram - i.e. it's the difference between "physical time" and "live
>> physical time". So stolen time doesn't attempt to represent that.
>>
>> And yes, there is a race between reading CNTVCT_EL0 and reading stolen
>> time - but in practice this doesn't really matter. The usual pseudo-code
>> way of using stolen time is:
> 
> I’m assuming this is the guest scheduler you are talking about,

yes

> and I’m assuming virtualization can preempt that code anywhere.
> Maybe that’s where I’m wrong?

You are correct, the guest can be preempted at any point.

> 
> For the sake of the argument, assume there is a 1s pause.
> Not completely unreasonable in a migration scenario.

As I mentioned before, events like migration are not represented by
stolen time. They would be represented by CNTVCT_EL0 appearing to pause
during the migration (so showing a difference between "physical time"
and "live physical time"). The stolen time value would not be incremented.

>>  * scheduler captures stolen time from structure and CNTVCT_EL0:
>>      before_timer = CNTVCT_EL0
> 
> [insert optional 1s pause here, case A]
> 
>>      before_stolen = stolen
>>  * schedule in process
>>  * process is pre-empted (or blocked in some way)
>>  * scheduler captures stolen time from structure and CNTVCT_EL0:
>>      after_timer = CNTVCT_EL0
> 
> [insert optional 1s pause here, case B]
> 
>>      after_stolen = stolen
>>      time = to_nsecs(after_timer - before_timer) -
>>             (after_stolen - before_stolen)
> 
> In case A, time is too big by one second. In case B, it is too small,
> to the point where your code might need to be ready for
> “time” unexpectedly showing up as negative.

So a 1 second pause is unlikely for stolen time - this means that the
VCPU was ready to run, but the host didn't run it for some reason. But
in theory you are correct this could happen. The core code deals with it
like this (update_rq_clock_task):
> 	if (static_key_false((&paravirt_steal_rq_enabled))) {
> 		steal = paravirt_steal_clock(cpu_of(rq));
> 		steal -= rq->prev_steal_time_rq;
> 
> 		if (unlikely(steal > delta))
> 			steal = delta;
> 
> 		rq->prev_steal_time_rq += steal;
> 		delta -= steal;
> 	}

So if (steal > delta) then steal is capped to delta, preventing the
final delta from going negative.

>>
>> The scheduler can then charge the process for "time" nanoseconds of
>> time. This ensures that a process isn't unfairly penalised if the host
>> doesn't schedule the VCPU while it is supposed to be running.
>>
>> The race is very small in comparison to the time the process is running,
>> and in the worst case just means the process is charged slightly more
>> (or less) than it should be.
> 
> At this point, what I don’t understand is why the race would be
> “very small” or why you would only be charged “slightly” more or less?

The window between measuring the time using CNTVCT_EL0 and getting the
stolen time from the hypervisor is pretty short. The amount of time that
is (normally) stolen in one go is also small. So the race is unlikely
and the error when it occurs is (usually) small.

Long events (such as migration or pausing the guest) are not considered
"stolen time" and should be reflected to the guest in other ways.

>> I guess if you're really worried about it, you could do a dance like:
>>
>> do {
>> before = stolen
>> timer = CNTVCT_EL0
>> after = stolen
>> } while (before != after);
> 
> That will work as long as nothing in that loop requires something
> that would cause `stolen` to jump. If there is such a guarantee,
> then that’s even efficient, because in most cases the loop
> would only run once, at the cost of one extra read and one test.

Note that other architectures don't have such loops, so arm64 is just
following the lead of existing architecture.

>> But I don't see the need to have such an accurate view of elapsed time
>> that the VCPU was scheduled. And of course at the moment (without this
>> series) the guest has no idea about time stolen by the host.
> 
> I’m certainly not arguing that exposing stolen time is a bad idea,
> I’m only wondering if the proposed solution is racy, and if so, if
> it is intentional.
> 
> If it’s indeed racy, the problem could be mitigated in a number of
> ways
> 
> a) document your loop or something similar as being the recommended
> way to avoid the race, and then ensure that the loop actually
> will always work as intended. The upside is that it’s just a change in
> some comments or documentation.
> 
> b) having a single interface that exposes multiple times. For example,
> you could have a copy of CNTVCT_EL0 written alongside stolen time,
> and then the scheduler could use that copy for its decision.

That would still be racy - the structure can be updated at any time (as
the host could interrupt the VCPU at any time), so you would still be
left with the problem of reading both atomically - which would mean
going back to the loop. This is the approach that LPT takes and is
documented in the spec.

Also I can't see why you would want to know the CNTVCT_EL0 value at the
point the stolen time was updated, it's much more useful to know the
current CNTVCT_EL0 value.

Ultimately reading the stolen time is always going to be slightly racy
because you are including some of the scheduler's time in the
calculation of how much time the process was running for. The pauses you
describe above are instances where time has been stolen from the
scheduler, but that time is being accounted for/against a user space
process. While the algorithm could be changed so that it's always a
positive for the user space process I'm not sure that's a benefit (it's
probably better that statistically it can go either way).

Steve


^ permalink raw reply

* Re: [PATCH 6/9] KVM: arm64: Provide a PV_TIME device to user space
From: Marc Zyngier @ 2019-08-07 13:51 UTC (permalink / raw)
  To: Steven Price
  Cc: kvm, linux-doc, Catalin Marinas, linux-kernel, Russell King,
	Paolo Bonzini, Will Deacon, kvmarm, linux-arm-kernel
In-Reply-To: <5aa54066-b9f6-22d1-fa2b-ce5cbf244ab5@arm.com>

On 07/08/2019 14:39, Steven Price wrote:
> On 03/08/2019 18:34, Marc Zyngier wrote:
>> On Sat, 3 Aug 2019 13:51:13 +0100
>> Marc Zyngier <maz@kernel.org> wrote:
>>
>> [forgot that one]
>>
>>> On Fri,  2 Aug 2019 15:50:14 +0100
>>> Steven Price <steven.price@arm.com> wrote:
>>
>> [...]
>>
>>>> +static int __init kvm_pvtime_init(void)
>>>> +{
>>>> +	kvm_register_device_ops(&pvtime_ops, KVM_DEV_TYPE_ARM_PV_TIME);
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +late_initcall(kvm_pvtime_init);
>>
>> Why is it an initcall? So far, the only initcall we've used is the one
>> that initializes KVM itself. Can't we just the device_ops just like we
>> do for the vgic?
> 
> So would you prefer a direct call from init_subsystems() in
> virt/kvm/arm/arm.c?

Yes. Consistency is important.

> The benefit of initcall is just that it keeps the code self-contained.
> In init_subsystems() I'd either need a #ifdef CONFIG_ARM64 or a dummy
> function for arm.

Having a dummy function for 32bit ARM is fine. Most of the code we add
to the 32bit port is made of empty stubs anyway.

Thanks,

	M.
-- 
Jazz is not dead, it just smells funny...

^ permalink raw reply

* Re: [PATCH 6/9] KVM: arm64: Provide a PV_TIME device to user space
From: Steven Price @ 2019-08-07 13:39 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvm, linux-doc, Catalin Marinas, linux-kernel, Russell King,
	Paolo Bonzini, Will Deacon, kvmarm, linux-arm-kernel
In-Reply-To: <20190803183335.149e0113@why>

On 03/08/2019 18:34, Marc Zyngier wrote:
> On Sat, 3 Aug 2019 13:51:13 +0100
> Marc Zyngier <maz@kernel.org> wrote:
> 
> [forgot that one]
> 
>> On Fri,  2 Aug 2019 15:50:14 +0100
>> Steven Price <steven.price@arm.com> wrote:
> 
> [...]
> 
>>> +static int __init kvm_pvtime_init(void)
>>> +{
>>> +	kvm_register_device_ops(&pvtime_ops, KVM_DEV_TYPE_ARM_PV_TIME);
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +late_initcall(kvm_pvtime_init);
> 
> Why is it an initcall? So far, the only initcall we've used is the one
> that initializes KVM itself. Can't we just the device_ops just like we
> do for the vgic?

So would you prefer a direct call from init_subsystems() in
virt/kvm/arm/arm.c?

The benefit of initcall is just that it keeps the code self-contained.
In init_subsystems() I'd either need a #ifdef CONFIG_ARM64 or a dummy
function for arm.

Steve

^ permalink raw reply

* Re: [PATCH 1/9] KVM: arm64: Document PV-time interface
From: Steven Price @ 2019-08-07 13:21 UTC (permalink / raw)
  To: Christophe de Dinechin
  Cc: kvm, Radim Krčmář, Catalin Marinas,
	Suzuki K Pouloze, linux-doc, Russell King, linux-kernel,
	James Morse, linux-arm-kernel, Marc Zyngier, Paolo Bonzini,
	Will Deacon, kvmarm, Julien Thierry
In-Reply-To: <m1mugnmv0x.fsf@dinechin.org>

On 05/08/2019 17:40, Christophe de Dinechin wrote:
> 
> Steven Price writes:
> 
>> Introduce a paravirtualization interface for KVM/arm64 based on the
>> "Arm Paravirtualized Time for Arm-Base Systems" specification DEN 0057A.
>>
>> This only adds the details about "Stolen Time" as the details of "Live
>> Physical Time" have not been fully agreed.
>>
> [...]
> 
>> +
>> +Stolen Time
>> +-----------
>> +
>> +The structure pointed to by the PV_TIME_ST hypercall is as follows:
>> +
>> +  Field       | Byte Length | Byte Offset | Description
>> +  ----------- | ----------- | ----------- | --------------------------
>> +  Revision    |      4      |      0      | Must be 0 for version 0.1
>> +  Attributes  |      4      |      4      | Must be 0
>> +  Stolen time |      8      |      8      | Stolen time in unsigned
>> +              |             |             | nanoseconds indicating how
>> +              |             |             | much time this VCPU thread
>> +              |             |             | was involuntarily not
>> +              |             |             | running on a physical CPU.
> 
> I know very little about the topic, but I don't understand how the spec
> as proposed allows an accurate reading of the relation between physical
> time and stolen time simultaneously. In other words, could you draw
> Figure 1 of the spec from within the guest? Or is it a non-objective?

Figure 1 is mostly attempting to explain Live Physical Time (LPT), which
is not part of this patch series. But it does touch on stolen time by
the difference between "live physical time" and "virtual time".

I'm not sure what you mean by "from within the guest". From the
perspective of the guest the parts of the diagram where the guest isn't
running don't exist (therefore there are discontinuities in the
"physical time" and "live physical time" lines).

This patch series doesn't attempt to provide the guest with a view of
"physical time" (or LPT) - but it might be able to observe that by
consulting something external (e.g. an NTP server, or an emulated RTC
which reports wall-clock time).

What it does provide is a mechanism for obtaining the difference (as
reported by the host) between "live physical time" and "virtual time" -
this is reported in nanoseconds in the above structure.

> For example, if you read the stolen time before you read CNTVCT_EL0,
> isn't it possible for a lengthy event like a migration to occur between
> the two reads, causing the stolen time to be obsolete and off by seconds?

"Lengthy events" like migration are represented by the "paused" state in
the diagram - i.e. it's the difference between "physical time" and "live
physical time". So stolen time doesn't attempt to represent that.

And yes, there is a race between reading CNTVCT_EL0 and reading stolen
time - but in practice this doesn't really matter. The usual pseudo-code
way of using stolen time is:

  * scheduler captures stolen time from structure and CNTVCT_EL0:
      before_timer = CNTVCT_EL0
      before_stolen = stolen
  * schedule in process
  * process is pre-empted (or blocked in some way)
  * scheduler captures stolen time from structure and CNTVCT_EL0:
      after_timer = CNTVCT_EL0
      after_stolen = stolen
      time = to_nsecs(after_timer - before_timer) -
             (after_stolen - before_stolen)

The scheduler can then charge the process for "time" nanoseconds of
time. This ensures that a process isn't unfairly penalised if the host
doesn't schedule the VCPU while it is supposed to be running.

The race is very small in comparison to the time the process is running,
and in the worst case just means the process is charged slightly more
(or less) than it should be.

I guess if you're really worried about it, you could do a dance like:

	do {
		before = stolen
		timer = CNTVCT_EL0
		after = stolen
	} while (before != after);

But I don't see the need to have such an accurate view of elapsed time
that the VCPU was scheduled. And of course at the moment (without this
series) the guest has no idea about time stolen by the host.

Steve

^ permalink raw reply

* Re: [Tee-dev] [PATCH v8 0/2] fTPM: firmware TPM running in TEE
From: Rouven Czerwinski @ 2019-08-07 13:21 UTC (permalink / raw)
  To: Jarkko Sakkinen, Sasha Levin
  Cc: linux-kernel, linux-doc, corbet, linux-kernel, thiruan, tee-dev,
	jgg, ilias.apalodimas, rdunlap, linux-integrity, peterhuewe,
	bryankel
In-Reply-To: <20190711200858.xydm3wujikufxjcw@linux.intel.com>

Hi,

I spent some time with the fTPM module and TA on a Nitrogen6X with the
latest OP-TEE master. After stumbling through the "tee_supplicant no
persistent storage" problem, my module now issues the following error
message on module load:

[   34.633252] tpm tpm0: ftpm_tee_tpm_op_send: SUBMIT_COMMAND invoke error: 0xffff0006
[   34.641035] tpm tpm0: tpm_try_transmit: send(): error -65530
[   34.647008] tpm tpm0: ftpm_tee_tpm_op_send: SUBMIT_COMMAND invoke error: 0xffff0006
[   34.654788] tpm tpm0: tpm_try_transmit: send(): error -65530
[   34.660480] ftpm-tee ftpm: ftpm_tee_probe: tpm_chip_register failed with rc=-65530
[   34.678087] ftpm-tee: probe of ftpm failed with error -65530

To me the TEE_ERROR_BAD_PARAMETERS indicates some ABI issue between the
TA and the kernel module. Note that I built the TA from 
https://github.com/microsoft/MSRSec.git with commit
6bb57db632c424f87cbaf7ec6f9c89be7682b3c0. Maybe this is not the correct
version, I had some problems building the module from the repository
mentioned in the Patches

Regards,
Rouven Czerwinski
-- 
Pengutronix e.K.                           |            		 |
Industrial Linux Solutions                 | https://www.pengutronix.de/ |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |

^ permalink raw reply

* [PATCH v2 1/3] drm: add gem ttm helpers
From: Gerd Hoffmann @ 2019-08-07 10:38 UTC (permalink / raw)
  To: dri-devel
  Cc: tzimmermann, Gerd Hoffmann, David Airlie, Daniel Vetter,
	Maarten Lankhorst, Maxime Ripard, Sean Paul, Jonathan Corbet,
	open list:DOCUMENTATION, open list
In-Reply-To: <20190807103849.7289-1-kraxel@redhat.com>

Now with ttm_buffer_object being a subclass of drm_gem_object we can
easily lookup ttm_buffer_object for a given drm_gem_object, which in
turm allows to create common helper functions.  This patch starts off
with dump mmap helpers.

v2:
 - drop drm_gem_ttm_mmap_offset().
 - improve kerneldocs.

Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
---
 include/drm/drm_gem_ttm_helper.h     | 27 +++++++++++++++++
 drivers/gpu/drm/drm_gem_ttm_helper.c | 45 ++++++++++++++++++++++++++++
 Documentation/gpu/drm-mm.rst         | 12 ++++++++
 drivers/gpu/drm/Kconfig              |  7 +++++
 drivers/gpu/drm/Makefile             |  3 ++
 5 files changed, 94 insertions(+)
 create mode 100644 include/drm/drm_gem_ttm_helper.h
 create mode 100644 drivers/gpu/drm/drm_gem_ttm_helper.c

diff --git a/include/drm/drm_gem_ttm_helper.h b/include/drm/drm_gem_ttm_helper.h
new file mode 100644
index 000000000000..0c0fbf76af60
--- /dev/null
+++ b/include/drm/drm_gem_ttm_helper.h
@@ -0,0 +1,27 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+
+#ifndef DRM_GEM_TTM_HELPER_H
+#define DRM_GEM_TTM_HELPER_H
+
+#include <linux/kernel.h>
+
+#include <drm/drm_gem.h>
+#include <drm/ttm/ttm_bo_api.h>
+
+/**
+ * Returns the container of type &struct ttm_buffer_object
+ * for field base.
+ * @gem:	the GEM object
+ * Returns:	The containing GEM VRAM object
+ */
+static inline struct ttm_buffer_object *drm_gem_ttm_of_gem(
+	struct drm_gem_object *gem)
+{
+	return container_of(gem, struct ttm_buffer_object, base);
+}
+
+int drm_gem_ttm_driver_dumb_mmap_offset(struct drm_file *file,
+					struct drm_device *dev,
+					uint32_t handle, uint64_t *offset);
+
+#endif
diff --git a/drivers/gpu/drm/drm_gem_ttm_helper.c b/drivers/gpu/drm/drm_gem_ttm_helper.c
new file mode 100644
index 000000000000..7aa3c115e42c
--- /dev/null
+++ b/drivers/gpu/drm/drm_gem_ttm_helper.c
@@ -0,0 +1,45 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include <drm/drm_gem_ttm_helper.h>
+
+/**
+ * DOC: overview
+ *
+ * This library provides helper functions for gem objects backed by
+ * ttm.
+ */
+
+/**
+ * drm_gem_ttm_driver_dumb_mmap_offset() - \
+	Implements &struct drm_driver.dumb_mmap_offset
+ * @file:	DRM file pointer.
+ * @dev:	DRM device.
+ * @handle:	GEM handle
+ * @offset:	Returns the mapping's memory offset on success
+ *
+ * Returns:
+ * 0 on success, or
+ * a negative errno code otherwise.
+ */
+int drm_gem_ttm_driver_dumb_mmap_offset(struct drm_file *file,
+					 struct drm_device *dev,
+					 uint32_t handle, uint64_t *offset)
+{
+	struct drm_gem_object *gem;
+
+	gem = drm_gem_object_lookup(file, handle);
+	if (!gem)
+		return -ENOENT;
+
+	/*
+	 * No drm_gem_create_mmap_offset() call here, ttm handles this
+	 * at bo initialization time instead.
+	 */
+
+	*offset = drm_vma_node_offset_addr(&gem->vma_node);
+
+	drm_gem_object_put_unlocked(gem);
+
+	return 0;
+}
+EXPORT_SYMBOL(drm_gem_ttm_driver_dumb_mmap_offset);
diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
index b664f054c259..a70a1d9f30ec 100644
--- a/Documentation/gpu/drm-mm.rst
+++ b/Documentation/gpu/drm-mm.rst
@@ -412,6 +412,18 @@ VRAM MM Helper Functions Reference
 .. kernel-doc:: drivers/gpu/drm/drm_vram_mm_helper.c
    :export:
 
+GEM TTM Helper Functions Reference
+-----------------------------------
+
+.. kernel-doc:: drivers/gpu/drm/drm_gem_ttm_helper.c
+   :doc: overview
+
+.. kernel-doc:: include/drm/drm_gem_ttm_helper.h
+   :internal:
+
+.. kernel-doc:: drivers/gpu/drm/drm_gem_ttm_helper.c
+   :export:
+
 VMA Offset Manager
 ==================
 
diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
index e6f40fb54c9a..f7b25519f95c 100644
--- a/drivers/gpu/drm/Kconfig
+++ b/drivers/gpu/drm/Kconfig
@@ -172,6 +172,13 @@ config DRM_VRAM_HELPER
 	help
 	  Helpers for VRAM memory management
 
+config DRM_TTM_HELPER
+	tristate
+	depends on DRM
+	select DRM_TTM
+	help
+	  Helpers for ttm-based gem objects
+
 config DRM_GEM_CMA_HELPER
 	bool
 	depends on DRM
diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
index 10f8329a8b71..545c61d6528b 100644
--- a/drivers/gpu/drm/Makefile
+++ b/drivers/gpu/drm/Makefile
@@ -37,6 +37,9 @@ drm_vram_helper-y := drm_gem_vram_helper.o \
 		     drm_vram_mm_helper.o
 obj-$(CONFIG_DRM_VRAM_HELPER) += drm_vram_helper.o
 
+drm_ttm_helper-y := drm_gem_ttm_helper.o
+obj-$(CONFIG_DRM_TTM_HELPER) += drm_ttm_helper.o
+
 drm_kms_helper-y := drm_crtc_helper.o drm_dp_helper.o drm_dsc.o drm_probe_helper.o \
 		drm_plane_helper.o drm_dp_mst_topology.o drm_atomic_helper.o \
 		drm_kms_helper_common.o drm_dp_dual_mode_helper.o \
-- 
2.18.1


^ permalink raw reply related

* Re: [PATCH v4 1/5] mm/page_idle: Add per-pid idle page tracking using virtual indexing
From: Joel Fernandes @ 2019-08-07 10:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Alexey Dobriyan, Borislav Petkov, Brendan Gregg,
	Catalin Marinas, Christian Hansen, dancol, fmayer, H. Peter Anvin,
	Ingo Molnar, Jonathan Corbet, Kees Cook, kernel-team, linux-api,
	linux-doc, linux-fsdevel, linux-mm, Michal Hocko, Mike Rapoport,
	minchan, namhyung, paulmck, Robin Murphy, Roman Gushchin,
	Stephen Rothwell, surenb, Thomas Gleixner, tkjos,
	Vladimir Davydov, Vlastimil Babka, Will Deacon, Brendan Gregg
In-Reply-To: <20190806151921.edec128271caccb5214fc1bd@linux-foundation.org>

On Tue, Aug 06, 2019 at 03:19:21PM -0700, Andrew Morton wrote:
> (cc Brendan's other email address, hoping for review input ;))

;)

> On Mon,  5 Aug 2019 13:04:47 -0400 "Joel Fernandes (Google)" <joel@joelfernandes.org> wrote:
> 
> > The page_idle tracking feature currently requires looking up the pagemap
> > for a process followed by interacting with /sys/kernel/mm/page_idle.
> > Looking up PFN from pagemap in Android devices is not supported by
> > unprivileged process and requires SYS_ADMIN and gives 0 for the PFN.
> > 
> > This patch adds support to directly interact with page_idle tracking at
> > the PID level by introducing a /proc/<pid>/page_idle file.  It follows
> > the exact same semantics as the global /sys/kernel/mm/page_idle, but now
> > looking up PFN through pagemap is not needed since the interface uses
> > virtual frame numbers, and at the same time also does not require
> > SYS_ADMIN.
> > 
> > In Android, we are using this for the heap profiler (heapprofd) which
> > profiles and pin points code paths which allocates and leaves memory
> > idle for long periods of time. This method solves the security issue
> > with userspace learning the PFN, and while at it is also shown to yield
> > better results than the pagemap lookup, the theory being that the window
> > where the address space can change is reduced by eliminating the
> > intermediate pagemap look up stage. In virtual address indexing, the
> > process's mmap_sem is held for the duration of the access.
> 
> Quite a lot of changes to the page_idle code.  Has this all been
> runtime tested on architectures where
> CONFIG_HAVE_ARCH_PTE_SWP_PGIDLE=n?  That could be x86 with a little
> Kconfig fiddle-for-testing-purposes.

I will do this Kconfig fiddle test with CONFIG_HAVE_ARCH_PTE_SWP_PGIDLE=n and test
the patch as well.

In previous series, this flag was not there (which should have been
equivalent to the above test), and things are working fine.

> > 8 files changed, 376 insertions(+), 45 deletions(-)
> 
> Quite a lot of new code unconditionally added to major architectures. 
> Are we confident that everyone will want this feature?

I did not follow, could you clarify more? All of this diff stat is not to
architecture code:

 arch/Kconfig                  |   3 ++
 fs/proc/base.c                |   3 ++
 fs/proc/internal.h            |   1 +
 fs/proc/task_mmu.c            |  43 +++++++++++++++++++++
 include/asm-generic/pgtable.h |   6 +++
 include/linux/page_idle.h     |   4 ++
 mm/page_idle.c                | 359 +++++++++++++++++++++++++++++..
 mm/rmap.c                     |   2 +
 8 files changed, 376 insertions(+), 45 deletions(-)

The arcitecture change is in a later patch, and is not that many lines.

Also, I am planning to split the swap functionality of the patch into a
separate one for easier review.

> > +static int proc_page_idle_open(struct inode *inode, struct file *file)
> > +{
> > +	struct mm_struct *mm;
> > +
> > +	mm = proc_mem_open(inode, PTRACE_MODE_READ);
> > +	if (IS_ERR(mm))
> > +		return PTR_ERR(mm);
> > +	file->private_data = mm;
> > +	return 0;
> > +}
> > +
> > +static int proc_page_idle_release(struct inode *inode, struct file *file)
> > +{
> > +	struct mm_struct *mm = file->private_data;
> > +
> > +	if (mm)
> 
> I suspect the test isn't needed?  proc_page_idle_release) won't be
> called if proc_page_idle_open() failed?

Yes you are right, will remove the test.

thanks,

 - Joel


^ permalink raw reply

* Re: [PATCH v4 11/12] fpga: dfl: fme: add global error reporting support
From: Greg KH @ 2019-08-07  9:30 UTC (permalink / raw)
  To: Wu Hao
  Cc: mdf, linux-fpga, linux-kernel, linux-api, linux-doc, atull,
	Luwei Kang, Ananda Ravuri, Xu Yilun
In-Reply-To: <20190807080825.GA10344@hao-dev>

On Wed, Aug 07, 2019 at 04:08:25PM +0800, Wu Hao wrote:
> On Wed, Aug 07, 2019 at 10:45:22AM +0800, Wu Hao wrote:
> > On Mon, Aug 05, 2019 at 05:56:26PM +0200, Greg KH wrote:
> > > On Sun, Aug 04, 2019 at 06:20:21PM +0800, Wu Hao wrote:
> > > > +static int fme_global_err_init(struct platform_device *pdev,
> > > > +			       struct dfl_feature *feature)
> > > > +{
> > > > +	struct device *dev;
> > > > +	int ret = 0;
> > > > +
> > > > +	dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> > > > +	if (!dev)
> > > > +		return -ENOMEM;
> > > > +
> > > > +	dev->parent = &pdev->dev;
> > > > +	dev->release = err_dev_release;
> > > > +	dev_set_name(dev, "errors");
> > > > +
> > > > +	fme_error_enable(feature);
> > > > +
> > > > +	ret = device_register(dev);
> > > > +	if (ret) {
> > > > +		put_device(dev);
> > > > +		return ret;
> > > > +	}
> > > > +
> > > > +	ret = device_add_groups(dev, error_groups);
> > > 
> > > cute, but no, you do not create a whole struct device for a subdir.  Use
> > > the attribute group name like you did on earlier patches.
> > 
> > Sure, let me fix it in the next version.
> > 
> > > 
> > > And again, you raced userspace and lost :(
> > 
> > Same here, could you please give some more hints here?
> 
> Oh.. I see..
> 
> I should follow [1] as this is a platform driver. I will fix it. Thanks!
> 
> [PATCH 00/11] Platform drivers, provide a way to add sysfs groups easily
> 
> [1]https://lkml.org/lkml/2019/7/4/181

Yes, that is the correct thing to do.

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH 0/6] hwspinlock: allow sharing of hwspinlocks
From: Fabien DESSENNE @ 2019-08-07  8:39 UTC (permalink / raw)
  To: Suman Anna, Bjorn Andersson
  Cc: Ohad Ben-Cohen, Rob Herring, Mark Rutland, Maxime Coquelin,
	Alexandre TORGUE, Jonathan Corbet,
	linux-remoteproc@vger.kernel.org, devicetree@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-stm32@st-md-mailman.stormreply.com,
	linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org,
	Benjamin GAIGNARD
In-Reply-To: <1aea3d28-29dc-f9de-3b86-cf777e0d5caa@ti.com>

Hi


On 06/08/2019 11:30 PM, Suman Anna wrote:
> On 8/6/19 1:21 PM, Bjorn Andersson wrote:
>> On Tue 06 Aug 10:38 PDT 2019, Suman Anna wrote:
>>
>>> Hi Fabien,
>>>
>>> On 8/5/19 12:46 PM, Bjorn Andersson wrote:
>>>> On Mon 05 Aug 01:48 PDT 2019, Fabien DESSENNE wrote:
>>>>
>>>>> On 01/08/2019 9:14 PM, Bjorn Andersson wrote:
>>>>>> On Wed 13 Mar 08:50 PDT 2019, Fabien Dessenne wrote:
>> [..]
>>>>> B/ This would introduce some inconsistency between the two 'request' API
>>>>> which are hwspin_lock_request() and hwspin_lock_request_specific().
>>>>> hwspin_lock_request() looks for an unused lock, so requests for an exclusive
>>>>> usage. On the other side, request_specific() would request shared locks.
>>>>> Worst the following sequence can transform an exclusive usage into a shared
>>>>>
>>>> There is already an inconsistency in between these; as with above any
>>>> system that uses both request() and request_specific() will be suffering
>>>> from intermittent failures due to probe ordering.
>>>>
>>>>> one:
>>>>>     -hwspin_lock_request() -> returns Id#0 (exclusive)
>>>>>     -hwspin_lock_request() -> returns Id#1 (exclusive)
>>>>>     -hwspin_lock_request_specific(0) -> returns Id#0 and makes Id#0 shared
>>>>> Honestly I am not sure that this is a real issue, but it's better to have it
>>>>> in mind before we take ay decision
>>> Wouldn't it be actually simpler to just introduce a new specific API
>>> variant for this, similar to the reset core for example (it uses a
>>> separate exclusive API), without having to modify the bindings at all.
>>> It is just a case of your driver using the right API, and the core can
>>> be modified to use the additional tag semantics based on the API. It
>>> should avoid any confusion with say using a different second cell value
>>> for the same lock in two different nodes.
>>>
>> But this implies that there is an actual need to hold these locks
>> exclusively. Given that they are (except for the raw case) all wrapped
>> by Linux locking primitives there shouldn't be a problem sharing a lock
>> (except possibly for the raw case).
> Yes agreed, the HWLOCK_RAW and HWLOCK_IN_ATOMIC cases are unprotected. I
> am still trying to understand better the usecase to see if the same lock
> is being multiplexed for different protection contexts, or if all of
> them are protecting the same context.


Here are two different examples that explain the need for changes.
In both cases the Linux clients are talking to a single entity on the
remote-side.

Example 1:
     exti: interrupt-controller@5000d000 {
         compatible = "st,stm32mp1-exti", "syscon";
         interrupt-controller;
         #interrupt-cells = <2>;
         reg = <0x5000d000 0x400>;
         hwlocks = <&hsem 1>;
     };
The two drivers (stm32mp1-exti and syscon) refer to the same hwlock.
With the current hwspinlock implementation, only the first driver succeeds
in requesting (hwspin_lock_request_specific) the hwlock. The second request
fails.
Here, we really need to share the hwlock between the two drivers.
Note: hardware spinlock support for regmap was 'recently' introduced in 4.15
see https://lore.kernel.org/patchwork/patch/845941/



Example 2:
Here it is more a question of optimization : we want to save the number of
hwlocks used to protect resources, using an unique hwlock to protect all
pinctrl resources:
         pinctrl: pin-controller@50002000 {
             compatible = "st,stm32mp157-pinctrl";
             ranges = <0 0x50002000 0xa400>;
             hwlocks = <&hsem 0 1>;

         pinctrl_z: pin-controller-z@54004000 {
             compatible = "st,stm32mp157-z-pinctrl";
             ranges = <0 0x54004000 0x400>;
             pins-are-numbered;
             hwlocks = <&hsem 0 1>;

>
>> I agree that we shouldn't specify this property in DT - if anything it
>> should be a variant of the API.


If we decide to add a 'shared' API, then, what about the generic regmap 
driver?

In the context of above example1, this would require to update the 
regmap driver.

But would this be acceptable for any driver using syscon/regmap?


I think it is better to keep the existing API (modifying it so it always 
allows

hwlocks sharing, so no need for bindings update) than adding another API.



>>
>>> If you are sharing a hwlock on the Linux side, surely your driver should
>>> be aware that it is a shared lock. The tag can be set during the first
>>> request API, and you look through both tags when giving out a handle.
>>>
>> Why would the driver need to know about it?
> Just the semantics if we were to support single user vs multiple users
> on Linux-side to even get a handle. Your point is that this may be moot
> since we have protection anyway other than the raw cases. But we need to
> be able to have the same API work across all cases.
>
> So far, it had mostly been that there would be one user on Linux
> competing with other equivalent peer entities on different processors.
> It is not common to have multiple users since these protection schemes
> are usually needed only at the lowest levels of a stack, so the
> exclusive handle stuff had been sufficient.
>
>>> Obviously, the hwspin_lock_request() API usage semantics always had the
>>> implied additional need for communicating the lock id to the other peer
>>> entity, so a realistic usage is most always the specific API variant. I
>>> doubt this API would be of much use for the shared driver usage. This
>>> also implies that the client user does not care about specifying a lock
>>> in DT.
>>>
>> Afaict if the lock are shared then there shouldn't be a problem with
>> some clients using the request API and others request_specific(). As any
>> collisions would simply mean that there are more contention on the lock.
>>
>> With the current exclusive model that is not possible and the success of
>> the request_specific will depend on probe order.
>>
>> But perhaps it should be explicitly prohibited to use both APIs on the
>> same hwspinlock instance?
> Yeah, they are meant to be complimentary usage, though I doubt we will
> ever have any realistic users for the generic API if we haven't had a
> usage so far. I had posted a concept of reserved locks long back [1] to
> keep away certain locks from the generic requestor, but dropped it since
> we did not have an actual use-case needing it.
>
> regards
> Suman
>
> [1] https://lwn.net/Articles/611944/

^ permalink raw reply

* Re: [PATCH v4 11/12] fpga: dfl: fme: add global error reporting support
From: Wu Hao @ 2019-08-07  8:08 UTC (permalink / raw)
  To: Greg KH
  Cc: mdf, linux-fpga, linux-kernel, linux-api, linux-doc, atull,
	Luwei Kang, Ananda Ravuri, Xu Yilun
In-Reply-To: <20190807024521.GD24158@hao-dev>

On Wed, Aug 07, 2019 at 10:45:22AM +0800, Wu Hao wrote:
> On Mon, Aug 05, 2019 at 05:56:26PM +0200, Greg KH wrote:
> > On Sun, Aug 04, 2019 at 06:20:21PM +0800, Wu Hao wrote:
> > > +static int fme_global_err_init(struct platform_device *pdev,
> > > +			       struct dfl_feature *feature)
> > > +{
> > > +	struct device *dev;
> > > +	int ret = 0;
> > > +
> > > +	dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> > > +	if (!dev)
> > > +		return -ENOMEM;
> > > +
> > > +	dev->parent = &pdev->dev;
> > > +	dev->release = err_dev_release;
> > > +	dev_set_name(dev, "errors");
> > > +
> > > +	fme_error_enable(feature);
> > > +
> > > +	ret = device_register(dev);
> > > +	if (ret) {
> > > +		put_device(dev);
> > > +		return ret;
> > > +	}
> > > +
> > > +	ret = device_add_groups(dev, error_groups);
> > 
> > cute, but no, you do not create a whole struct device for a subdir.  Use
> > the attribute group name like you did on earlier patches.
> 
> Sure, let me fix it in the next version.
> 
> > 
> > And again, you raced userspace and lost :(
> 
> Same here, could you please give some more hints here?

Oh.. I see..

I should follow [1] as this is a platform driver. I will fix it. Thanks!

[PATCH 00/11] Platform drivers, provide a way to add sysfs groups easily

[1]https://lkml.org/lkml/2019/7/4/181

Hao

> 
> Thanks in advance.
> Hao
> 
> > 
> > thanks,
> > 
> > greg k-h

^ permalink raw reply

* Re: [PATCH v4 11/12] fpga: dfl: fme: add global error reporting support
From: Wu Hao @ 2019-08-07  2:45 UTC (permalink / raw)
  To: Greg KH
  Cc: mdf, linux-fpga, linux-kernel, linux-api, linux-doc, atull,
	Luwei Kang, Ananda Ravuri, Xu Yilun
In-Reply-To: <20190805155626.GD8107@kroah.com>

On Mon, Aug 05, 2019 at 05:56:26PM +0200, Greg KH wrote:
> On Sun, Aug 04, 2019 at 06:20:21PM +0800, Wu Hao wrote:
> > +static int fme_global_err_init(struct platform_device *pdev,
> > +			       struct dfl_feature *feature)
> > +{
> > +	struct device *dev;
> > +	int ret = 0;
> > +
> > +	dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> > +	if (!dev)
> > +		return -ENOMEM;
> > +
> > +	dev->parent = &pdev->dev;
> > +	dev->release = err_dev_release;
> > +	dev_set_name(dev, "errors");
> > +
> > +	fme_error_enable(feature);
> > +
> > +	ret = device_register(dev);
> > +	if (ret) {
> > +		put_device(dev);
> > +		return ret;
> > +	}
> > +
> > +	ret = device_add_groups(dev, error_groups);
> 
> cute, but no, you do not create a whole struct device for a subdir.  Use
> the attribute group name like you did on earlier patches.

Sure, let me fix it in the next version.

> 
> And again, you raced userspace and lost :(

Same here, could you please give some more hints here?

Thanks in advance.
Hao

> 
> thanks,
> 
> greg k-h

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox