All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RESEND v6 02/16] mm/gup: Fix __get_user_pages() on fault retry of hugetlb
From: Peter Xu @ 2020-02-20 15:53 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Martin Cracauer, Linus Torvalds, Mike Rapoport,
	Kirill A . Shutemov, Johannes Weiner, Dr . David Alan Gilbert,
	David Hildenbrand, Bobby Powers, Maya Gokhale, Jerome Glisse,
	Mike Kravetz, Matthew Wilcox, Marty McFadden, Mel Gorman, peterx,
	Hugh Dickins, Brian Geffon, Denis Plotnikov, Pavel Emelyanov
In-Reply-To: <20200220155353.8676-1-peterx@redhat.com>

When follow_hugetlb_page() returns with *locked==0, it means we've got
a VM_FAULT_RETRY within the fauling process and we've released the
mmap_sem.  When that happens, we should stop and bail out.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index 1b4411bd0042..76cb420c0fb7 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -849,6 +849,16 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 				i = follow_hugetlb_page(mm, vma, pages, vmas,
 						&start, &nr_pages, i,
 						gup_flags, locked);
+				if (locked && *locked == 0) {
+					/*
+					 * We've got a VM_FAULT_RETRY
+					 * and we've lost mmap_sem.
+					 * We must stop here.
+					 */
+					BUG_ON(gup_flags & FOLL_NOWAIT);
+					BUG_ON(ret != 0);
+					goto out;
+				}
 				continue;
 			}
 		}
-- 
2.24.1



^ permalink raw reply related

* [PATCH RESEND v6 03/16] mm: Introduce fault_signal_pending()
From: Peter Xu @ 2020-02-20 15:53 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Martin Cracauer, Linus Torvalds, Mike Rapoport,
	Kirill A . Shutemov, Johannes Weiner, Dr . David Alan Gilbert,
	David Hildenbrand, Bobby Powers, Maya Gokhale, Jerome Glisse,
	Mike Kravetz, Matthew Wilcox, Marty McFadden, Mel Gorman, peterx,
	Hugh Dickins, Brian Geffon, Denis Plotnikov, Pavel Emelyanov
In-Reply-To: <20200220155353.8676-1-peterx@redhat.com>

For most architectures, we've got a quick path to detect fatal signal
after a handle_mm_fault().  Introduce a helper for that quick path.

It cleans the current codes a bit so we don't need to duplicate the
same check across archs.  More importantly, this will be an unified
place that we handle the signal immediately right after an interrupted
page fault, so it'll be much easier for us if we want to change the
behavior of handling signals later on for all the archs.

Note that currently only part of the archs are using this new helper,
because some archs have their own way to handle signals.  In the
follow up patches, we'll try to apply this helper to all the rest of
archs.

Another note is that the "regs" parameter in the new helper is not
used yet.  It'll be used very soon.  Now we kept it in this patch only
to avoid touching all the archs again in the follow up patches.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/alpha/mm/fault.c        |  2 +-
 arch/arm/mm/fault.c          |  2 +-
 arch/hexagon/mm/vm_fault.c   |  2 +-
 arch/ia64/mm/fault.c         |  2 +-
 arch/m68k/mm/fault.c         |  2 +-
 arch/microblaze/mm/fault.c   |  2 +-
 arch/mips/mm/fault.c         |  2 +-
 arch/nds32/mm/fault.c        |  2 +-
 arch/nios2/mm/fault.c        |  2 +-
 arch/openrisc/mm/fault.c     |  2 +-
 arch/parisc/mm/fault.c       |  2 +-
 arch/riscv/mm/fault.c        |  2 +-
 arch/s390/mm/fault.c         |  3 +--
 arch/sparc/mm/fault_32.c     |  2 +-
 arch/sparc/mm/fault_64.c     |  2 +-
 arch/unicore32/mm/fault.c    |  2 +-
 arch/xtensa/mm/fault.c       |  2 +-
 include/linux/sched/signal.h | 13 +++++++++++++
 18 files changed, 30 insertions(+), 18 deletions(-)

diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
index 741e61ef9d3f..aea33b599037 100644
--- a/arch/alpha/mm/fault.c
+++ b/arch/alpha/mm/fault.c
@@ -150,7 +150,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
 	   the fault.  */
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if (fault_signal_pending(fault, regs))
 		return;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index bd0f4821f7e1..937b81ff8649 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -295,7 +295,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 	 * signal first. We do not need to release the mmap_sem because
 	 * it would already be released in __lock_page_or_retry in
 	 * mm/filemap.c. */
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
+	if (fault_signal_pending(fault, regs)) {
 		if (!user_mode(regs))
 			goto no_context;
 		return 0;
diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c
index b3bc71680ae4..d19beaf11b4c 100644
--- a/arch/hexagon/mm/vm_fault.c
+++ b/arch/hexagon/mm/vm_fault.c
@@ -91,7 +91,7 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs)
 
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if (fault_signal_pending(fault, regs))
 		return;
 
 	/* The most common case -- we are done. */
diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c
index c2f299fe9e04..211b4f439384 100644
--- a/arch/ia64/mm/fault.c
+++ b/arch/ia64/mm/fault.c
@@ -141,7 +141,7 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re
 	 */
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if (fault_signal_pending(fault, regs))
 		return;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c
index e9b1d7585b43..a455e202691b 100644
--- a/arch/m68k/mm/fault.c
+++ b/arch/m68k/mm/fault.c
@@ -138,7 +138,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
 	fault = handle_mm_fault(vma, address, flags);
 	pr_debug("handle_mm_fault returns %x\n", fault);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if (fault_signal_pending(fault, regs))
 		return 0;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c
index e6a810b0c7ad..cdde01dcdfc3 100644
--- a/arch/microblaze/mm/fault.c
+++ b/arch/microblaze/mm/fault.c
@@ -217,7 +217,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long address,
 	 */
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if (fault_signal_pending(fault, regs))
 		return;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
index 1e8d00793784..0ee6fafc57bc 100644
--- a/arch/mips/mm/fault.c
+++ b/arch/mips/mm/fault.c
@@ -154,7 +154,7 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, unsigned long write,
 	 */
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if (fault_signal_pending(regs))
 		return;
 
 	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
diff --git a/arch/nds32/mm/fault.c b/arch/nds32/mm/fault.c
index 906dfb25353c..0e63f81eff5b 100644
--- a/arch/nds32/mm/fault.c
+++ b/arch/nds32/mm/fault.c
@@ -214,7 +214,7 @@ void do_page_fault(unsigned long entry, unsigned long addr,
 	 * signal first. We do not need to release the mmap_sem because it
 	 * would already be released in __lock_page_or_retry in mm/filemap.c.
 	 */
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
+	if (fault_signal_pending(fault, regs)) {
 		if (!user_mode(regs))
 			goto no_context;
 		return;
diff --git a/arch/nios2/mm/fault.c b/arch/nios2/mm/fault.c
index 6a2e716b959f..704ace8ca0ee 100644
--- a/arch/nios2/mm/fault.c
+++ b/arch/nios2/mm/fault.c
@@ -133,7 +133,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long cause,
 	 */
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if (fault_signal_pending(fault, regs))
 		return;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
index 5d4d3a9691d0..85c7eb0c0186 100644
--- a/arch/openrisc/mm/fault.c
+++ b/arch/openrisc/mm/fault.c
@@ -161,7 +161,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address,
 
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if (fault_signal_pending(fault, regs))
 		return;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c
index adbd5e2144a3..f9be1d1cb43f 100644
--- a/arch/parisc/mm/fault.c
+++ b/arch/parisc/mm/fault.c
@@ -304,7 +304,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long code,
 
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if (fault_signal_pending(fault, regs))
 		return;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
index cf7248e07f43..1d3869e9ddef 100644
--- a/arch/riscv/mm/fault.c
+++ b/arch/riscv/mm/fault.c
@@ -117,7 +117,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs)
 	 * signal first. We do not need to release the mmap_sem because it
 	 * would already be released in __lock_page_or_retry in mm/filemap.c.
 	 */
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(tsk))
+	if (fault_signal_pending(fault, regs))
 		return;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index 7b0bb475c166..179cf92a56e5 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -480,8 +480,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
 	 * the fault.
 	 */
 	fault = handle_mm_fault(vma, address, flags);
-	/* No reason to continue if interrupted by SIGKILL. */
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
+	if (fault_signal_pending(fault, regs)) {
 		fault = VM_FAULT_SIGNAL;
 		if (flags & FAULT_FLAG_RETRY_NOWAIT)
 			goto out_up;
diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c
index 89976c9b936c..6efbeb227644 100644
--- a/arch/sparc/mm/fault_32.c
+++ b/arch/sparc/mm/fault_32.c
@@ -237,7 +237,7 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write,
 	 */
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if (fault_signal_pending(fault, regs))
 		return;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
index 8b7ddbd14b65..dd1ed6555831 100644
--- a/arch/sparc/mm/fault_64.c
+++ b/arch/sparc/mm/fault_64.c
@@ -425,7 +425,7 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs)
 
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if (fault_signal_pending(fault, regs))
 		goto exit_exception;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c
index 76342de9cf8c..59d0e6ec2cfc 100644
--- a/arch/unicore32/mm/fault.c
+++ b/arch/unicore32/mm/fault.c
@@ -250,7 +250,7 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 	 * signal first. We do not need to release the mmap_sem because
 	 * it would already be released in __lock_page_or_retry in
 	 * mm/filemap.c. */
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if (fault_signal_pending(fault, regs))
 		return 0;
 
 	if (!(fault & VM_FAULT_ERROR) && (flags & FAULT_FLAG_ALLOW_RETRY)) {
diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c
index bee30a77cd70..59515905d4ad 100644
--- a/arch/xtensa/mm/fault.c
+++ b/arch/xtensa/mm/fault.c
@@ -110,7 +110,7 @@ void do_page_fault(struct pt_regs *regs)
 	 */
 	fault = handle_mm_fault(vma, address, flags);
 
-	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+	if (fault_signal_pending(fault, regs))
 		return;
 
 	if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 88050259c466..4c87ffce64d1 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -369,6 +369,19 @@ static inline int signal_pending_state(long state, struct task_struct *p)
 	return (state & TASK_INTERRUPTIBLE) || __fatal_signal_pending(p);
 }
 
+/*
+ * This should only be used in fault handlers to decide whether we
+ * should stop the current fault routine to handle the signals
+ * instead, especially with the case where we've got interrupted with
+ * a VM_FAULT_RETRY.
+ */
+static inline bool fault_signal_pending(unsigned int fault_flags,
+					struct pt_regs *regs)
+{
+	return unlikely((fault_flags & VM_FAULT_RETRY) &&
+			fatal_signal_pending(current));
+}
+
 /*
  * Reevaluate whether the task has signals pending delivery.
  * Wake the task if so.
-- 
2.24.1



^ permalink raw reply related

* [PATCH RESEND v6 00/16] mm: Page fault enhancements
From: Peter Xu @ 2020-02-20 15:53 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Martin Cracauer, Linus Torvalds, Mike Rapoport,
	Kirill A . Shutemov, Johannes Weiner, Dr . David Alan Gilbert,
	David Hildenbrand, Bobby Powers, Maya Gokhale, Jerome Glisse,
	Mike Kravetz, Matthew Wilcox, Marty McFadden, Mel Gorman, peterx,
	Hugh Dickins, Brian Geffon, Denis Plotnikov, Pavel Emelyanov

[Resend v6]

This is v6 of the series.  It is majorly a rebase to 5.6-rc2, nothing
else to be expected (plus some tests after the rebase).  Instead of
rewrite the cover letter I decided to use what we have for v5.

Adding extra CCs for both Bobby Powers <bobbypowers@gmail.com> and
Brian Geffon <bgeffon@google.com>.

Online repo: https://github.com/xzpeter/linux/tree/mm-pf-signal-retry

Any review comment is appreciated.  Thanks,

=============== v5 cover letter ==================

This is v5 of the series.  As Matthew suggested, I split the previous
patch "mm: Return faster for non-fatal signals in user mode faults"
into a few smaller ones:

  1. One patch to introduce fatal_signal_pending(), and use it in
     archs that can directly apply

  2. A few more patches to let the rest archs to use the new helper.
     With that we can have an unified entry for signal detection

  3. One last patch to change fatal_signal_pending() to detect
     userspace non-fatal signal

Nothing should have changed in the rest patches.  Because the fault
retry patches will depend on the previous ones, I decided to simply
repost all the patches.

Here's the new patchset layout:

Patch 1-2:      cleanup, and potential bugfix of hugetlbfs on fault retry

Patch 3-9:      let page fault to respond to non-fatal signals faster

Patch 10:       remove the userfaultfd NOPAGE emulation

Patch 11-14:    allow page fault to retry more than once

Patch 15-16:    let gup code to use FAULT_FLAG_KILLABLE too

I would really appreciate any review comments for the series,
especially for the first two patches which IMHO are even not related
to this patchset and they should either cleanup or fix things.

Smoke tested on x86 only.

Thanks,

v5:
- split "mm: Return faster for non-fatal signals in user mode faults"
  into a few more patches, let all archs to use an unified entry for
  fast signal handling (fatal_signal_pending)

v4:
- use lore.kernel.org for all the links in commit messages [Kirill]
- one more patch ("mm/gup: Fix __get_user_pages() on fault retry of
  hugetlb") to fix hugetlb path on fault retry
- one more patch ("mm/gup: Allow to react to fatal signals") to:
  - use down_read_killable() properly [Linus]
  - pass in FAULT_FLAG_KILLABLE for all GUP [Linus]
- one more patch ("mm/userfaultfd: Honor FAULT_FLAG_KILLABLE in fault
  path") to let handle_userfaultfd() respect FAULT_FLAG_KILLABLE.
  Should have no functional change after previous two new patches.

v3:
- check fatal signals in __get_user_page_locked() [Linus]
- add r-bs

v2:
- resent previous version, rebase only

=============== v1 cover letter ==================

This series is split out of userfaultfd-wp series to only cover the
general page fault changes, since it seems to make sense itself.

Basically it does two things:

  (a) Allows the page fault handlers to be more interactive on not
      only SIGKILL, but also the rest of userspace signals (especially
      for user-mode faults), and,

  (b) Allows the page fault retry (VM_FAULT_RETRY) to happen for more
      than once.

I'm keeping the CC list as in uffd-wp v5, hopefully I'm not sending
too much spams...

And, instead of writting again the cover letter, I'm just copy-pasting
my previous link here which has more details on why we do this:

  https://patchwork.kernel.org/cover/10691991/

The major change from that latest version should be that we introduced
a new page fault flag FAULT_FLAG_INTERRUPTIBLE as suggested by Linus
[1] to represents that we would like the fault handler to respond to
non-fatal signals.  Also, we're more careful now on when to do the
immediate return of the page fault for such signals.  For example, now
we'll only check against signal_pending() for user-mode page faults
and we keep the kernel-mode page fault patch untouched for it.  More
information can be found in separate patches.

The patchset is only lightly tested on x86.

All comments are greatly welcomed.  Thanks,

[1] https://lkml.org/lkml/2019/6/25/1382

Peter Xu (16):
  mm/gup: Rename "nonblocking" to "locked" where proper
  mm/gup: Fix __get_user_pages() on fault retry of hugetlb
  mm: Introduce fault_signal_pending()
  x86/mm: Use helper fault_signal_pending()
  arc/mm: Use helper fault_signal_pending()
  arm64/mm: Use helper fault_signal_pending()
  powerpc/mm: Use helper fault_signal_pending()
  sh/mm: Use helper fault_signal_pending()
  mm: Return faster for non-fatal signals in user mode faults
  userfaultfd: Don't retake mmap_sem to emulate NOPAGE
  mm: Introduce FAULT_FLAG_DEFAULT
  mm: Introduce FAULT_FLAG_INTERRUPTIBLE
  mm: Allow VM_FAULT_RETRY for multiple times
  mm/gup: Allow VM_FAULT_RETRY for multiple times
  mm/gup: Allow to react to fatal signals
  mm/userfaultfd: Honor FAULT_FLAG_KILLABLE in fault path

 arch/alpha/mm/fault.c           |  6 +--
 arch/arc/mm/fault.c             | 35 +++++--------
 arch/arm/mm/fault.c             |  7 +--
 arch/arm64/mm/fault.c           | 26 +++------
 arch/hexagon/mm/vm_fault.c      |  5 +-
 arch/ia64/mm/fault.c            |  5 +-
 arch/m68k/mm/fault.c            |  7 +--
 arch/microblaze/mm/fault.c      |  5 +-
 arch/mips/mm/fault.c            |  5 +-
 arch/nds32/mm/fault.c           |  5 +-
 arch/nios2/mm/fault.c           |  7 +--
 arch/openrisc/mm/fault.c        |  5 +-
 arch/parisc/mm/fault.c          |  8 ++-
 arch/powerpc/mm/fault.c         | 20 ++-----
 arch/riscv/mm/fault.c           |  9 +---
 arch/s390/mm/fault.c            | 10 ++--
 arch/sh/mm/fault.c              | 13 +++--
 arch/sparc/mm/fault_32.c        |  5 +-
 arch/sparc/mm/fault_64.c        |  5 +-
 arch/um/kernel/trap.c           |  3 +-
 arch/unicore32/mm/fault.c       |  8 ++-
 arch/x86/mm/fault.c             | 30 +++++------
 arch/xtensa/mm/fault.c          |  5 +-
 drivers/gpu/drm/ttm/ttm_bo_vm.c | 12 +++--
 fs/userfaultfd.c                | 62 ++++++++++------------
 include/linux/mm.h              | 81 ++++++++++++++++++++++++----
 include/linux/sched/signal.h    | 14 +++++
 mm/filemap.c                    |  2 +-
 mm/gup.c                        | 93 +++++++++++++++++++++------------
 mm/hugetlb.c                    | 17 +++---
 mm/internal.h                   |  6 +--
 31 files changed, 281 insertions(+), 240 deletions(-)

-- 
2.24.1



^ permalink raw reply

* [PATCH RESEND v6 04/16] x86/mm: Use helper fault_signal_pending()
From: Peter Xu @ 2020-02-20 15:53 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrea Arcangeli, Martin Cracauer, Linus Torvalds, Mike Rapoport,
	Kirill A . Shutemov, Johannes Weiner, Dr . David Alan Gilbert,
	David Hildenbrand, Bobby Powers, Maya Gokhale, Jerome Glisse,
	Mike Kravetz, Matthew Wilcox, Marty McFadden, Mel Gorman, peterx,
	Hugh Dickins, Brian Geffon, Denis Plotnikov, Pavel Emelyanov
In-Reply-To: <20200220155353.8676-1-peterx@redhat.com>

Let's move the fatal signal check even earlier so that we can directly
use the new fault_signal_pending() in x86 mm code.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/mm/fault.c | 28 +++++++++++++---------------
 1 file changed, 13 insertions(+), 15 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index fa4ea09593ab..6a00bc8d047f 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1442,27 +1442,25 @@ void do_user_addr_fault(struct pt_regs *regs,
 	fault = handle_mm_fault(vma, address, flags);
 	major |= fault & VM_FAULT_MAJOR;
 
+	/* Quick path to respond to signals */
+	if (fault_signal_pending(fault, regs)) {
+		if (!user_mode(regs))
+			no_context(regs, hw_error_code, address, SIGBUS,
+				   BUS_ADRERR);
+		return;
+	}
+
 	/*
 	 * If we need to retry the mmap_sem has already been released,
 	 * and if there is a fatal signal pending there is no guarantee
 	 * that we made any progress. Handle this case first.
 	 */
-	if (unlikely(fault & VM_FAULT_RETRY)) {
+	if (unlikely((fault & VM_FAULT_RETRY) &&
+		     (flags & FAULT_FLAG_ALLOW_RETRY))) {
 		/* Retry at most once */
-		if (flags & FAULT_FLAG_ALLOW_RETRY) {
-			flags &= ~FAULT_FLAG_ALLOW_RETRY;
-			flags |= FAULT_FLAG_TRIED;
-			if (!fatal_signal_pending(tsk))
-				goto retry;
-		}
-
-		/* User mode? Just return to handle the fatal exception */
-		if (flags & FAULT_FLAG_USER)
-			return;
-
-		/* Not returning to user mode? Handle exceptions or die: */
-		no_context(regs, hw_error_code, address, SIGBUS, BUS_ADRERR);
-		return;
+		flags &= ~FAULT_FLAG_ALLOW_RETRY;
+		flags |= FAULT_FLAG_TRIED;
+		goto retry;
 	}
 
 	up_read(&mm->mmap_sem);
-- 
2.24.1



^ permalink raw reply related

* Re: omap-secure.c:undefined reference to `__arm_smccc_smc'
From: Tony Lindgren @ 2020-02-20 15:54 UTC (permalink / raw)
  To: kbuild-all
In-Reply-To: <202002131856.VeW4PhBJ%lkp@intel.com>

[-- Attachment #1: Type: text/plain, Size: 1187 bytes --]

Andrew,

* kbuild test robot <lkp@intel.com> [200213 10:27]:
> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> head:   0bf999f9c5e74c7ecf9dafb527146601e5c848b9
> commit: c37baa06f8a970e4a533d41f7d33e5e57de5ad25 ARM: OMAP2+: Fix undefined reference to omap_secure_init
> date:   3 weeks ago
> config: arm-randconfig-a001-20200213 (attached as .config)
> compiler: arm-linux-gnueabi-gcc (GCC) 7.5.0
> reproduce:
>         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         git checkout c37baa06f8a970e4a533d41f7d33e5e57de5ad25
>         # save the attached .config to linux build tree
>         GCC_VERSION=7.5.0 make.cross ARCH=arm 
> 
> If you fix the issue, kindly add following tag
> Reported-by: kbuild test robot <lkp@intel.com>
> 
> All errors (new ones prefixed by >>):
> 
>    arch/arm/mach-omap2/omap-secure.o: In function `omap_smccc_smc':
> >> omap-secure.c:(.text+0x94): undefined reference to `__arm_smccc_smc'

Have you looked at this one? Looks like there's still an unhandled
randconfig build case.

Regards,

Tony

^ permalink raw reply

* Re: omap-secure.c:undefined reference to `__arm_smccc_smc'
From: Tony Lindgren @ 2020-02-20 15:54 UTC (permalink / raw)
  To: Andrew F. Davis; +Cc: kbuild-all, linux-kernel, kbuild test robot
In-Reply-To: <202002131856.VeW4PhBJ%lkp@intel.com>

Andrew,

* kbuild test robot <lkp@intel.com> [200213 10:27]:
> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> head:   0bf999f9c5e74c7ecf9dafb527146601e5c848b9
> commit: c37baa06f8a970e4a533d41f7d33e5e57de5ad25 ARM: OMAP2+: Fix undefined reference to omap_secure_init
> date:   3 weeks ago
> config: arm-randconfig-a001-20200213 (attached as .config)
> compiler: arm-linux-gnueabi-gcc (GCC) 7.5.0
> reproduce:
>         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         git checkout c37baa06f8a970e4a533d41f7d33e5e57de5ad25
>         # save the attached .config to linux build tree
>         GCC_VERSION=7.5.0 make.cross ARCH=arm 
> 
> If you fix the issue, kindly add following tag
> Reported-by: kbuild test robot <lkp@intel.com>
> 
> All errors (new ones prefixed by >>):
> 
>    arch/arm/mach-omap2/omap-secure.o: In function `omap_smccc_smc':
> >> omap-secure.c:(.text+0x94): undefined reference to `__arm_smccc_smc'

Have you looked at this one? Looks like there's still an unhandled
randconfig build case.

Regards,

Tony

^ permalink raw reply

* [Cluster-devel] [PATCH v7 14/24] btrfs: Convert from readpages to readahead
From: Matthew Wilcox @ 2020-02-20 15:54 UTC (permalink / raw)
  To: cluster-devel.redhat.com
In-Reply-To: <20200220154658.GA19577@infradead.org>

On Thu, Feb 20, 2020 at 07:46:58AM -0800, Christoph Hellwig wrote:
> On Thu, Feb 20, 2020 at 05:48:49AM -0800, Matthew Wilcox wrote:
> > btrfs: Convert from readpages to readahead
> >   
> > Implement the new readahead method in btrfs.  Add a readahead_page_batch()
> > to optimise fetching a batch of pages at once.
> 
> Shouldn't this readahead_page_batch heper go into a separate patch so
> that it clearly stands out?

I'll move it into 'Put readahead pages in cache earlier' for v8 (the
same patch where we add readahead_page())




^ permalink raw reply

* [Ocfs2-devel] [PATCH v7 14/24] btrfs: Convert from readpages to readahead
From: Matthew Wilcox @ 2020-02-20 15:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Johannes Thumshirn, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	linux-btrfs@vger.kernel.org, linux-erofs@lists.ozlabs.org,
	linux-ext4@vger.kernel.org,
	linux-f2fs-devel@lists.sourceforge.net, cluster-devel@redhat.com,
	ocfs2-devel@oss.oracle.com, linux-xfs@vger.kernel.org
In-Reply-To: <20200220154658.GA19577@infradead.org>

On Thu, Feb 20, 2020 at 07:46:58AM -0800, Christoph Hellwig wrote:
> On Thu, Feb 20, 2020 at 05:48:49AM -0800, Matthew Wilcox wrote:
> > btrfs: Convert from readpages to readahead
> >   
> > Implement the new readahead method in btrfs.  Add a readahead_page_batch()
> > to optimise fetching a batch of pages at once.
> 
> Shouldn't this readahead_page_batch heper go into a separate patch so
> that it clearly stands out?

I'll move it into 'Put readahead pages in cache earlier' for v8 (the
same patch where we add readahead_page())

^ permalink raw reply

* Re: [PATCH v7 14/24] btrfs: Convert from readpages to readahead
From: Matthew Wilcox @ 2020-02-20 15:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Johannes Thumshirn, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	linux-btrfs@vger.kernel.org, linux-erofs@lists.ozlabs.org,
	linux-ext4@vger.kernel.org,
	linux-f2fs-devel@lists.sourceforge.net, cluster-devel@redhat.com,
	ocfs2-devel@oss.oracle.com, linux-xfs@vger.kernel.org
In-Reply-To: <20200220154658.GA19577@infradead.org>

On Thu, Feb 20, 2020 at 07:46:58AM -0800, Christoph Hellwig wrote:
> On Thu, Feb 20, 2020 at 05:48:49AM -0800, Matthew Wilcox wrote:
> > btrfs: Convert from readpages to readahead
> >   
> > Implement the new readahead method in btrfs.  Add a readahead_page_batch()
> > to optimise fetching a batch of pages at once.
> 
> Shouldn't this readahead_page_batch heper go into a separate patch so
> that it clearly stands out?

I'll move it into 'Put readahead pages in cache earlier' for v8 (the
same patch where we add readahead_page())

^ permalink raw reply

* Re: [PATCH v7 14/24] btrfs: Convert from readpages to readahead
From: Matthew Wilcox @ 2020-02-20 15:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-xfs@vger.kernel.org, Johannes Thumshirn,
	linux-kernel@vger.kernel.org,
	linux-f2fs-devel@lists.sourceforge.net, cluster-devel@redhat.com,
	linux-mm@kvack.org, ocfs2-devel@oss.oracle.com,
	linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org,
	linux-erofs@lists.ozlabs.org, linux-btrfs@vger.kernel.org
In-Reply-To: <20200220154658.GA19577@infradead.org>

On Thu, Feb 20, 2020 at 07:46:58AM -0800, Christoph Hellwig wrote:
> On Thu, Feb 20, 2020 at 05:48:49AM -0800, Matthew Wilcox wrote:
> > btrfs: Convert from readpages to readahead
> >   
> > Implement the new readahead method in btrfs.  Add a readahead_page_batch()
> > to optimise fetching a batch of pages at once.
> 
> Shouldn't this readahead_page_batch heper go into a separate patch so
> that it clearly stands out?

I'll move it into 'Put readahead pages in cache earlier' for v8 (the
same patch where we add readahead_page())

^ permalink raw reply

* Re: [PATCH 10/12] mm: x86: Invoke hypercall when page encryption status is changed
From: Brijesh Singh @ 2020-02-20 15:54 UTC (permalink / raw)
  To: Andy Lutomirski, Steve Rutherford
  Cc: brijesh.singh, Ashish Kalra, Paolo Bonzini, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, Radim Krčmář,
	Joerg Roedel, Borislav Petkov, Tom Lendacky, David Rientjes, x86,
	KVM list, LKML
In-Reply-To: <52450536-AF7B-4206-8F05-CF387A216031@amacapital.net>



On 2/19/20 8:12 PM, Andy Lutomirski wrote:
> 
> 
>> On Feb 19, 2020, at 5:58 PM, Steve Rutherford <srutherford@google.com> wrote:
>>
>> On Wed, Feb 12, 2020 at 5:18 PM Ashish Kalra <Ashish.Kalra@amd.com> wrote:
>>>
>>> From: Brijesh Singh <brijesh.singh@amd.com>
>>>
>>> Invoke a hypercall when a memory region is changed from encrypted ->
>>> decrypted and vice versa. Hypervisor need to know the page encryption
>>> status during the guest migration.
>>
>> One messy aspect, which I think is fine in practice, is that this
>> presumes that pages are either treated as encrypted or decrypted. If
>> also done on SEV, the in-place re-encryption supported by SME would
>> break SEV migration. Linux doesn't do this now on SEV, and I don't
>> have an intuition for why Linux might want this, but we will need to
>> ensure it is never done in order to ensure that migration works down
>> the line. I don't believe the AMD manual promises this will work
>> anyway.
>>
>> Something feels a bit wasteful about having all future kernels
>> universally announce c-bit status when SEV is enabled, even if KVM
>> isn't listening, since it may be too old (or just not want to know).
>> Might be worth eliding the hypercalls if you get ENOSYS back? There
>> might be a better way of passing paravirt config metadata across than
>> just trying and seeing if the hypercall succeeds, but I'm not super
>> familiar with it.
> 
> I actually think this should be a hard requirement to merge this. The host needs to tell the guest that it supports this particular migration strategy and the guest needs to tell the host that it is using it.  And the guest needs a way to tell the host that it’s *not* using it right now due to kexec, for example.
> 
> I’m still uneasy about a guest being migrated in the window where the hypercall tracking and the page encryption bit don’t match.  I guess maybe corruption in this window doesn’t matter?
> 

I don't think there is a corruption issue here. Let's consider the below
case:

1) A page is transmitted as C=1 (encrypted)

2) During the migration window, the page encryption bit is changed
  to C=0 (decrypted)

3) #2 will cause a change in page table memory, thus dirty memory
  the tracker will create retransmission of the page table memory.

4) The page itself will not be re-transmitted because there was
  no change to the content of the page.

On destination, the read from the page will get the ciphertext.

The encryption bit change in the page table is used on the next access.
The user of the page needs to ensure that data is written with the
correct encryption bit before reading.

thanks

^ permalink raw reply

* Re: [RFC PATCH v3 05/27] qcow2: Document the Extended L2 Entries feature
From: Max Reitz @ 2020-02-20 15:54 UTC (permalink / raw)
  To: Alberto Garcia, qemu-devel
  Cc: Kevin Wolf, Anton Nefedov, qemu-block,
	Vladimir Sementsov-Ogievskiy, Denis V . Lunev
In-Reply-To: <0b884ddcd0ac3a3c0b8cdd9d09c74566ac107c9a.1577014346.git.berto@igalia.com>


[-- Attachment #1.1: Type: text/plain, Size: 3344 bytes --]

On 22.12.19 12:36, Alberto Garcia wrote:
> Subcluster allocation in qcow2 is implemented by extending the
> existing L2 table entries and adding additional information to
> indicate the allocation status of each subcluster.
> 
> This patch documents the changes to the qcow2 format and how they
> affect the calculation of the L2 cache size.
> 
> Signed-off-by: Alberto Garcia <berto@igalia.com>
> ---
>  docs/interop/qcow2.txt | 68 ++++++++++++++++++++++++++++++++++++++++--
>  docs/qcow2-cache.txt   | 19 +++++++++++-
>  2 files changed, 83 insertions(+), 4 deletions(-)
> 
> diff --git a/docs/interop/qcow2.txt b/docs/interop/qcow2.txt
> index af5711e533..d34261f955 100644
> --- a/docs/interop/qcow2.txt
> +++ b/docs/interop/qcow2.txt
> @@ -39,6 +39,9 @@ The first cluster of a qcow2 image contains the file header:
>                      as the maximum cluster size and won't be able to open images
>                      with larger cluster sizes.
>  
> +                    Note: if the image has Extended L2 Entries then cluster_bits
> +                    must be at least 14 (i.e. 16384 byte clusters).
> +
>           24 - 31:   size
>                      Virtual disk size in bytes.
>  
> @@ -109,7 +112,12 @@ in the description of a field.
>                                  An External Data File Name header extension may
>                                  be present if this bit is set.
>  
> -                    Bits 3-63:  Reserved (set to 0)
> +                    Bit 3:      Extended L2 Entries.  If this bit is set then

I suppose bit 4 now.  (Compression is bit 3.)

[...]

> +Subcluster Allocation Bitmap (for standard clusters):
> +
> +    Bit  0 -  31:   Allocation status (one bit per subcluster)
> +
> +                    1: the subcluster is allocated. In this case the
> +                       host cluster offset field must contain a valid
> +                       offset.
> +                    0: the subcluster is not allocated. In this case
> +                       read requests shall go to the backing file or
> +                       return zeros if there is no backing file data.
> +
> +                    Bits are assigned starting from the most significant one.
> +                    (i.e. bit x is used for subcluster 31 - x)

I still prefer it the other way round, both personally (e.g. it’s the C
ordering), and because other places in qcow2 use LSb for bit ordering
(the refcount order).

I don’t see ease of debugging as a particularly good reason; but then
again, I didn’t have to debug this feature yet (as opposed to you).

But since I’m used to counting bits from the right (because this is how
it’s done basically everywhere), I can’t imagine I would find it more
difficult than counting them from the left.

Max

> +        32 -  63    Subcluster reads as zeros (one bit per subcluster)
> +
> +                    1: the subcluster reads as zeros. In this case the
> +                       allocation status bit must be unset. The host
> +                       cluster offset field may or may not be set.
> +                    0: no effect.
> +
> +                    Bits are assigned starting from the most significant one.
> +                    (i.e. bit x is used for subcluster 63 - x)


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH v4 1/6] driver core: Fix driver_deferred_probe_check_state() logic
From: Bjorn Andersson @ 2020-02-20 15:55 UTC (permalink / raw)
  To: John Stultz
  Cc: lkml, Rob Herring, Rafael J. Wysocki, Kevin Hilman, Ulf Hansson,
	Pavel Machek, Len Brown, Todd Kjos, Liam Girdwood, Mark Brown,
	Thierry Reding, Linus Walleij, Greg Kroah-Hartman, linux-pm
In-Reply-To: <20200220050440.45878-2-john.stultz@linaro.org>

On Wed 19 Feb 21:04 PST 2020, John Stultz wrote:

> driver_deferred_probe_check_state() has some uninituitive behavior.
> 
> * From boot to late_initcall, it returns -EPROBE_DEFER
> 
> * From late_initcall to the deferred_probe_timeout (if set)
>   it returns -ENODEV
> 
> * If the deferred_probe_timeout it set, after it fires, it
>   returns -ETIMEDOUT
> 
> This is a bit confusing, as its useful to have the function
> return -EPROBE_DEFER while the timeout is still running. This
> behavior has resulted in the somwhat duplicative
> driver_deferred_probe_check_state_continue() function being
> added.
> 
> Thus this patch tries to improve the logic, so that it behaves
> as such:
> 
> * If deferred_probe_timeout is set, it returns -EPROBE_DEFER
>   until the timeout, afterwhich it returns -ETIMEDOUT.
> 
> * If deferred_probe_timeout is not set (-1), it returns
>   -EPROBE_DEFER until late_initcall, after which it returns
> 
> This will make the deferred_probe_timeout value much more
> functional, and will allow us to consolidate the
> driver_deferred_probe_check_state() and
> driver_deferred_probe_check_state_continue() logic in a later
> patch.
> 
> Cc: Rob Herring <robh@kernel.org>
> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> Cc: Kevin Hilman <khilman@kernel.org>
> Cc: Ulf Hansson <ulf.hansson@linaro.org>
> Cc: Pavel Machek <pavel@ucw.cz>
> Cc: Len Brown <len.brown@intel.com>
> Cc: Todd Kjos <tkjos@google.com>
> Cc: Bjorn Andersson <bjorn.andersson@linaro.org>
> Cc: Liam Girdwood <lgirdwood@gmail.com>
> Cc: Mark Brown <broonie@kernel.org>
> Cc: Thierry Reding <treding@nvidia.com>
> Cc: Linus Walleij <linus.walleij@linaro.org>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: linux-pm@vger.kernel.org
> Signed-off-by: John Stultz <john.stultz@linaro.org>
> Change-Id: I8349b7a403ce8cbce485ea0a0a5512fddffb635c

Please drop the Change-Id.

> ---
> v4:
> * Simplified logic suggested by Andy Shevchenko
> * Clarified commit message to focus on logic change
> ---
>  drivers/base/dd.c | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> index b25bcab2a26b..bb383dca39c1 100644
> --- a/drivers/base/dd.c
> +++ b/drivers/base/dd.c
> @@ -237,7 +237,7 @@ __setup("deferred_probe_timeout=", deferred_probe_timeout_setup);
>  
>  static int __driver_deferred_probe_check_state(struct device *dev)
>  {
> -	if (!initcalls_done)
> +	if (!initcalls_done || deferred_probe_timeout > 0)
>  		return -EPROBE_DEFER;
>  
>  	if (!deferred_probe_timeout) {
> @@ -252,9 +252,11 @@ static int __driver_deferred_probe_check_state(struct device *dev)
>   * driver_deferred_probe_check_state() - Check deferred probe state
>   * @dev: device to check
>   *
> - * Returns -ENODEV if init is done and all built-in drivers have had a chance
> - * to probe (i.e. initcalls are done), -ETIMEDOUT if deferred probe debug
> - * timeout has expired, or -EPROBE_DEFER if none of those conditions are met.
> + * Returnes -EPROBE_DEFER if initcalls have not completed, or the deferred

As pointed out by Rafael, this should be Return:

With that addressed, you have my
Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org>

Regards,
Bjorn

> + * probe timeout is set, but not expried.
> + * Returns -ETIMEDOUT if the probe timeout was set and has expired.
> + * Returns -ENODEV if initcalls have completed and the deferred probe timeout
> + * was not set.
>   *
>   * Drivers or subsystems can opt-in to calling this function instead of directly
>   * returning -EPROBE_DEFER.
> -- 
> 2.17.1
> 

^ permalink raw reply

* Re: [PATCH v3 0/3] Introduce per-task latency_nice for scheduler hints
From: chris hyser @ 2020-02-20 15:55 UTC (permalink / raw)
  To: David Laight, Parth Shah, vincent.guittot@linaro.org,
	patrick.bellasi@matbug.net, valentin.schneider@arm.com,
	dhaval.giani@oracle.com, dietmar.eggemann@arm.com
  Cc: linux-kernel@vger.kernel.org, peterz@infradead.org,
	mingo@redhat.com, qais.yousef@arm.com, pavel@ucw.cz,
	qperret@qperret.net, pjt@google.com, tj@kernel.org
In-Reply-To: <2870e44f41414fd58b58f7831d7386fe@AcuMS.aculab.com>



On 2/20/20 9:39 AM, David Laight wrote:
> From: chris hyser <chris.hyser@oracle.com>
>> Sent: 19 February 2020 17:17
>>
>> On 2/19/20 6:18 AM, David Laight wrote:
>>> From: chris hyser
>>>> Sent: 18 February 2020 23:00
>>> ...
>>>> All, I was asked to take a look at the original latency_nice patchset.
>>>> First, to clarify objectives, Oracle is not
>>>> interested in trading throughput for latency.
>>>> What we found is that the DB has specific tasks which do very little but
>>>> need to do this as absolutely quickly as possible, ie extreme latency
>>>> sensitivity. Second, the key to latency reduction
>>>> in the task wakeup path seems to be limiting variations of "idle cpu" search.
>>>> The latter particularly interests me as an example of "platform size
>>>> based latency" which I believe to be important given all the varying size
>>>> VMs and containers.
>>>
>>>   From my experiments there are a few things that seem to affect latency
>>> of waking up real time (sched fifo) tasks on a normal kernel:
>>
>> Sorry. I was only ever talking about sched_other as per the original patchset. I realize the term
>> extreme latency
>> sensitivity may have caused confusion. What that means to DB people is no doubt different than audio
>> people. :-)
> 
> Shorter lines.....
> 
> ISTM you are making some already complicated code even more complex.
> Better to make it simpler instead.

The code already exists to set a limit to bail out of what is sometimes a needlessly excessive search. Setting that 
based on an integer doesn't seem particularly complex. Now whether that is actually useful is what I'm currently looking at.

> 
> If you need a thread to run as soon as possible after it is woken
> why not use the RT scheduler (eg SCHED_FIFO) that is what it is for.

Overkill and doesn't play well with cpu cgroup controller.


> 
> If there are delays finding an idle cpu to migrate a process to
> (especially on systems with large numbers of cpu) then that is a
> general problem that can be addressed without extra knobs.

There is no if. It is a brute force search. There are delays proportional to the search domain size. You can optimize 
the hell of out the brute force, or you use obtained knowledge to bail out early. Getting that knowledge from the user 
is a time honored tradition. :-)

-chrish

^ permalink raw reply

* Re: [PATCH] mm: memcontrol: asynchronous reclaim for memory.high
From: Tejun Heo @ 2020-02-20 15:56 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: Johannes Weiner, Michal Hocko, Andrew Morton, Roman Gushchin,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	Peter Zijlstra
In-Reply-To: <20200220154524.dql3i5brnjjwecft-S51bK0XF4qpuJJETbFA3a0B3C2bhBk7L0E9HWUfgJXw@public.gmane.org>

On Thu, Feb 20, 2020 at 10:45:24AM -0500, Daniel Jordan wrote:
> Ok, consistency with io and memory is one advantage to doing it that way.
> Creating kthreads in cgroups also seems viable so far, and it's unclear whether
> either approach is significantly simpler or more maintainable than the other,
> at least to me.

The problem with separate kthread approach is that many of these work
units are tiny, and cgroup membership might not be known or doesn't
agree with the processing context from the beginning

For example, the ownership of network packets can't be determined till
processing has progressed quite a bit in shared contexts and each item
too small to bounce around. The only viable way I can think of
splitting aggregate overhead according to the number of packets (or
some other trivially measureable quntity) processed.

Anything sitting in reclaim layer is the same. Reclaim should be
charged to the cgroup whose memory is reclaimed *but* shouldn't block
other cgroups which are waiting for that memory. It has to happen in
the context of the highest priority entity waiting for memory but the
costs incurred must be charged to the memory owners.

So, one way or the other, I think we'll need back charging and once
back charging is needed for big ticket items like network and reclaim,
it's kinda silly to use separate mechanisms for other stuff.

> Is someone on your side working on remote charging right now?  I was planning
> to post an RFD comparing these soon and it would make sense to include them.

It's been on the to do list but nobody is working on it yet.

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCH] mm: memcontrol: asynchronous reclaim for memory.high
From: Tejun Heo @ 2020-02-20 15:56 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: Johannes Weiner, Michal Hocko, Andrew Morton, Roman Gushchin,
	linux-mm, cgroups, linux-kernel, kernel-team, Peter Zijlstra
In-Reply-To: <20200220154524.dql3i5brnjjwecft@ca-dmjordan1.us.oracle.com>

On Thu, Feb 20, 2020 at 10:45:24AM -0500, Daniel Jordan wrote:
> Ok, consistency with io and memory is one advantage to doing it that way.
> Creating kthreads in cgroups also seems viable so far, and it's unclear whether
> either approach is significantly simpler or more maintainable than the other,
> at least to me.

The problem with separate kthread approach is that many of these work
units are tiny, and cgroup membership might not be known or doesn't
agree with the processing context from the beginning

For example, the ownership of network packets can't be determined till
processing has progressed quite a bit in shared contexts and each item
too small to bounce around. The only viable way I can think of
splitting aggregate overhead according to the number of packets (or
some other trivially measureable quntity) processed.

Anything sitting in reclaim layer is the same. Reclaim should be
charged to the cgroup whose memory is reclaimed *but* shouldn't block
other cgroups which are waiting for that memory. It has to happen in
the context of the highest priority entity waiting for memory but the
costs incurred must be charged to the memory owners.

So, one way or the other, I think we'll need back charging and once
back charging is needed for big ticket items like network and reclaim,
it's kinda silly to use separate mechanisms for other stuff.

> Is someone on your side working on remote charging right now?  I was planning
> to post an RFD comparing these soon and it would make sense to include them.

It's been on the to do list but nobody is working on it yet.

Thanks.

-- 
tejun


^ permalink raw reply

* Re: [PATCH v2] xfs: add agf freeblocks verify in xfs_agf_verify
From: Darrick J. Wong @ 2020-02-20 15:56 UTC (permalink / raw)
  To: Zheng Bin; +Cc: sandeen, bfoster, dchinner, linux-xfs, renxudong1, yi.zhang
In-Reply-To: <1582197182-142137-1-git-send-email-zhengbin13@huawei.com>

On Thu, Feb 20, 2020 at 07:13:02PM +0800, Zheng Bin wrote:
> We recently used fuzz(hydra) to test XFS and automatically generate
> tmp.img(XFS v5 format, but some metadata is wrong)
> 
> xfs_repair information(just one AG):
> agf_freeblks 0, counted 3224 in ag 0
> agf_longest 536874136, counted 3224 in ag 0
> sb_fdblocks 613, counted 3228
> 
> Test as follows:
> mount tmp.img tmpdir
> cp file1M tmpdir
> sync
> 
> In 4.19-stable, sync will stuck, the reason is:
> xfs_mountfs
>   xfs_check_summary_counts
>     if ((!xfs_sb_version_haslazysbcount(&mp->m_sb) ||
>        XFS_LAST_UNMOUNT_WAS_CLEAN(mp)) &&
>        !xfs_fs_has_sickness(mp, XFS_SICK_FS_COUNTERS))
> 	return 0;  -->just return, incore sb_fdblocks still be 613
>     xfs_initialize_perag_data
> 
> cp file1M tmpdir -->ok(write file to pagecache)
> sync -->stuck(write pagecache to disk)
> xfs_map_blocks
>   xfs_iomap_write_allocate
>     while (count_fsb != 0) {
>       nimaps = 0;
>       while (nimaps == 0) { --> endless loop
>          nimaps = 1;
>          xfs_bmapi_write(..., &nimaps) --> nimaps becomes 0 again
> xfs_bmapi_write
>   xfs_bmap_alloc
>     xfs_bmap_btalloc
>       xfs_alloc_vextent
>         xfs_alloc_fix_freelist
>           xfs_alloc_space_available -->fail(agf_freeblks is 0)
> 
> In linux-next, sync not stuck, cause commit c2b3164320b5 ("xfs:
> use the latest extent at writeback delalloc conversion time") remove
> the above while, dmesg is as follows:
> [   55.250114] XFS (loop0): page discard on page ffffea0008bc7380, inode 0x1b0c, offset 0.
> 
> Users do not know why this page is discard, the better soultion is:
> 1. Like xfs_repair, make sure sb_fdblocks is equal to counted
> (xfs_initialize_perag_data did this, who is not called at this mount)
> 2. Add agf verify, if fail, will tell users to repair
> 
> This patch use the second soultion.
> 
> Signed-off-by: Zheng Bin <zhengbin13@huawei.com>
> Signed-off-by: Ren Xudong <renxudong1@huawei.com>
> ---
> v1->v2: modify comment, add more agf verify
>  fs/xfs/libxfs/xfs_alloc.c | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index d8053bc..5faed42 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -2839,6 +2839,7 @@ xfs_agf_verify(
>  {
>  	struct xfs_mount	*mp = bp->b_mount;
>  	struct xfs_agf		*agf = XFS_BUF_TO_AGF(bp);
> +	int i;
> 
>  	if (xfs_sb_version_hascrc(&mp->m_sb)) {
>  		if (!uuid_equal(&agf->agf_uuid, &mp->m_sb.sb_meta_uuid))
> @@ -2858,6 +2859,22 @@ xfs_agf_verify(
>  	      be32_to_cpu(agf->agf_flcount) <= xfs_agfl_size(mp)))
>  		return __this_address;
> 
> +	if (be32_to_cpu(agf->agf_length) > mp->m_sb.sb_dblocks ||
> +	    be32_to_cpu(agf->agf_btreeblks) > be32_to_cpu(agf->agf_length) ||

Isn't this already covered later on?

> +	    be32_to_cpu(agf->agf_rmap_blocks) > be32_to_cpu(agf->agf_length) ||
> +	    be32_to_cpu(agf->agf_refcount_blocks) > be32_to_cpu(agf->agf_length) ||

Do these fields need checking when the corresponding feature isn't
enabled?

> +	    be32_to_cpu(agf->agf_spare2) != 0)

If, for some reason, these unused "spare" fields are *not* zero, won't
this cause mount failures on existing filesystems?

> +		return __this_address;

Please try to check only one agf field per if clause, because we use the
logged __this_address to figure out which field triggered the corruption
error.

> +
> +	for (i = 0; i < ARRAY_SIZE(agf->agf_spare64); i++)
> +		if (be64_to_cpu(agf->agf_spare64[i]) != 0)
> +			return __this_address;

memchr_inv if you leave in the spare check.

> +
> +	if (be32_to_cpu(agf->agf_freeblks) < be32_to_cpu(agf->agf_longest) ||
> +	    be32_to_cpu(agf->agf_freeblks) > be32_to_cpu(agf->agf_length) ||

Already covered in this function.

--D

> +	    be32_to_cpu(agf->agf_freeblks) > mp->m_sb.sb_fdblocks)
> +		return __this_address;
> +
>  	if (be32_to_cpu(agf->agf_levels[XFS_BTNUM_BNO]) < 1 ||
>  	    be32_to_cpu(agf->agf_levels[XFS_BTNUM_CNT]) < 1 ||
>  	    be32_to_cpu(agf->agf_levels[XFS_BTNUM_BNO]) > XFS_BTREE_MAXLEVELS ||
> --
> 2.7.4
> 

^ permalink raw reply

* Re: [PATCH v4 2/6] driver core: Set deferred_probe_timeout to a longer default if CONFIG_MODULES is set
From: Bjorn Andersson @ 2020-02-20 15:55 UTC (permalink / raw)
  To: John Stultz
  Cc: lkml, Rob Herring, Rafael J. Wysocki, Kevin Hilman, Ulf Hansson,
	Pavel Machek, Len Brown, Todd Kjos, Liam Girdwood, Mark Brown,
	Thierry Reding, Linus Walleij, Greg Kroah-Hartman, linux-pm
In-Reply-To: <20200220050440.45878-3-john.stultz@linaro.org>

On Wed 19 Feb 21:04 PST 2020, John Stultz wrote:

> When using modules, its common for the modules not to be loaded
> until quite late by userland. With the current code,
> driver_deferred_probe_check_state() will stop returning
> EPROBE_DEFER after late_initcall, which can cause module
> dependency resolution to fail after that.
> 
> So allow a longer window of 30 seconds (picked somewhat
> arbitrarily, but influenced by the similar regulator core
> timeout value) in the case where modules are enabled.
> 
> Cc: Rob Herring <robh@kernel.org>
> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> Cc: Kevin Hilman <khilman@kernel.org>
> Cc: Ulf Hansson <ulf.hansson@linaro.org>
> Cc: Pavel Machek <pavel@ucw.cz>
> Cc: Len Brown <len.brown@intel.com>
> Cc: Todd Kjos <tkjos@google.com>
> Cc: Bjorn Andersson <bjorn.andersson@linaro.org>
> Cc: Liam Girdwood <lgirdwood@gmail.com>
> Cc: Mark Brown <broonie@kernel.org>
> Cc: Thierry Reding <treding@nvidia.com>
> Cc: Linus Walleij <linus.walleij@linaro.org>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: linux-pm@vger.kernel.org
> Signed-off-by: John Stultz <john.stultz@linaro.org>
> Change-Id: I9c5a02a54915ff53f9f14d49c601f41d7105e05e

Change-Id...

Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org>

Regards,
Bjorn

> ---
> v4:
> * Split out into its own patch as suggested by Mark
> * Made change conditional on CONFIG_MODULES
> ---
>  drivers/base/dd.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> index bb383dca39c1..fa138f24e2d3 100644
> --- a/drivers/base/dd.c
> +++ b/drivers/base/dd.c
> @@ -224,7 +224,16 @@ static int deferred_devs_show(struct seq_file *s, void *data)
>  }
>  DEFINE_SHOW_ATTRIBUTE(deferred_devs);
>  
> +#ifdef CONFIG_MODULES
> +/*
> + * In the case of modules, set the default probe timeout to
> + * 30 seconds to give userland some time to load needed modules
> + */
> +static int deferred_probe_timeout = 30;
> +#else
> +/* In the case of !modules, no probe timeout needed */
>  static int deferred_probe_timeout = -1;
> +#endif
>  static int __init deferred_probe_timeout_setup(char *str)
>  {
>  	int timeout;
> -- 
> 2.17.1
> 

^ permalink raw reply

* Re: [PATCH v8 0/8] btrfs: remove buffer heads form superblock handling
From: David Sterba @ 2020-02-20 15:56 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: David Sterba, Nikolay Borisov, Josef Bacik,
	linux-btrfs @ vger . kernel . org, Christoph Hellwig
In-Reply-To: <20200213152436.13276-1-johannes.thumshirn@wdc.com>

On Fri, Feb 14, 2020 at 12:24:28AM +0900, Johannes Thumshirn wrote:
> This patch series removes the use of buffer_heads from btrfs' super block read
> and write paths. It also converts the integrity-checking code to only work
> with pages and BIOs.

I've fixed the small things and added this patchset to misc-next.
Thanks.

^ permalink raw reply

* Re: [PATCH net-next] net: mscc: ocelot: Workaround to allow traffic to CPU in standalone mode
From: Vladimir Oltean @ 2020-02-20 15:56 UTC (permalink / raw)
  To: Allan W. Nielsen
  Cc: Andrew Lunn, David S. Miller, Horatiu Vultur, Alexandre Belloni,
	Florian Fainelli, Vivien Didelot, Joergen Andreasen,
	Claudiu Manoil, netdev, Microchip Linux Driver Support
In-Reply-To: <20200220132329.43s6tq3xsoo7htuz@lx-anielsen.microsemi.net>

On Thu, 20 Feb 2020 at 15:23, Allan W. Nielsen
<allan.nielsen@microchip.com> wrote:
>
> Horatiu and I have looked further into this, done a few experiments, and
> discussed with the HW engineers who have a more detailed version of how
> the chips are working and how Ocelot and Felix differs.
>
> Here are our findings:
>
> - The most significant bit in the PGID table is "special" as it is a
>    CPU-copy bit.

Wow.
Looking at the code after this realization, it is so confusing to call
the NPI port "ocelot->cpu" now, since it doesn't benefit from this
"privilege" that a "real" CPU port (from the Ocelot hardware
perspective) has.
Not to mention how strange it is for the hardware to behave this way.

> - This bit is not being used in the source filtering!

So BIT(6) on Felix and BIT(11) on Ocelot are just being interpreted
for the first and second PGID lookup (destination and aggregation
masks) but not for source masks? Don't you want to actually document
this somewhere?

So the frame is copied to the CPU based on the AND between first and
second lookup, and to all the other ports, including the NPI port,
based on the first, second and third PGID lookup.
So frames can reach the NPI port directly, via 3 PGID lookups, or
indirectly, via 2 PGID lookups. What I did is make the direct path
work. You're suggesting me to set the indirect mode up.

> This means that
>    your original patch can be applied without breaking Ocelot (the
>    uninitialized cpu field must be fixed though).

So my patch makes the Felix NPI port work in an unintended way and
does not affect Ocelot in any way. Roger.

>    - Still I do not think we should do this as it is not the root-casuse
> - In Felix we have 2 ways to get frames to the CPU, in Ocelot we have 1
>    (Ocelot also has two if it uses an NPI port, but it does not do that
>    in the current driver).
>    - In Felix you can get frames to the CPU by either using the CPU port
>      (port 6), or by using the NPI port (which can be any in the range of
>      0-5).
>      - But you should only use the CPU port, and not the NPI port
>        directly. Using the NPI port directly will cause the two targets
>        to behave differently, and this is not what we do when testing all
>        the use-cases on the switch.

Differently in what way?

>    - In Ocelot you can only get frames to the CPU by using the CPU port
>      (port 11).
>
> Due to this, I very much think you need to fix this, such that Felix
> always port 6 to reach the CPU (with the exception of writing
> QSYS_EXT_CPU_CFG where you "connect" the CPU queue/port to the NPI
> port).
>

What PGIDs should I use for the NPI port if I use it with indirection
via port 6?

> If you do this change, then the Ocelot and Felix should start to work in
> the same way.
>
> Then, if you want the CPU to be part of the unicast flooding (this is
> where this discussion started), then you should add the CPU port to the
> PGID entry pointed at in ANA:ANA:FLOODING:FLD_UNICAST. This should be
> done for Felix and not for Ocelot.

No, the question is why don't _you_ want the CPU to be in the
FLD_UNICAST PGID (which is PGID_UC). The distinction you're making
between Felix and Ocelot here is quite arbitrary, and seems to be
based just on "your CPU is more powerful".

So right now, multicast and broadcast traffic goes to PGID_MC (61),
and unknown unicast goes to PGID_UC (60).
The destination ports mask for PGID_MC is
GENMASK(ocelot->num_phys_ports, 0) - 0x7f for me, 0xfff for you.
The destination ports mask for PGID_UC is
GENMASK(ocelot->num_phys_ports - 1, 0) - this is the hardware default
value - 0x3f for me, 0x7ff for you.
So you are keeping the non-physical CPU port in the destination mask
for broadcast, but not in that for unknown unicast. In the way the
system is configured, it is still susceptible to broadcast storms. So
I don't think there is any real benefit in crippling the system like
this. If port 6 was in PGID_UC by default, I would have never bat an
eye and more than likely never needed to know the gory details.

Is it possible to set up policers for traffic going to CPU? I _did_
see this phrase already in the manual:

"Frames where the DMAC lookup returned a PGID with the CPU port set
are always forwarded to the CPU even when the frame
is policed by the storm policers."

So the traffic I'm seeing now on the NPI port is not copied to the
CPU, it is forwarded.
If there's no other way to set up storm policers for the mode you're
suggesting me to change Felix to, then I would respectfully keep it
the way it is right now.

>
> If you want the analyser (where the MAC table sits), to "learn" the CPU
> MAC (which is needed if you do not want to have the CPU mac as a static
> entry in the MAC-table), then you need to set the 'src-port' to 6 (if it
> is Ocelot then it will be 11) in the IFH:
>

Please tell me more about this. Don't I need to set BYPASS=1 for
frames to go on the front-panel port specified in DEST? And doesn't
BYPASS=1 mean no source MAC learning for injected traffic?
As much as I would like the analyzer to run, I won't do it if it is
going to compromise xmit from Linux. I don't want traffic sent from
Linux to an unknown unicast in standalone mode to be flooded to all
front-panel ports with no way for me to control it.

> anielsen@lx-anielsen ~ $ ef hex ifh-oc1 help
> ifh-oc1          Injection Frame Header for Ocelot1
>
> Specify the ifh-oc1 header by using one or more of the following fields:
> - Name ------------ offset:width --- Description --------------------------
>    bypass              +  0:  1  Skip analyzer processing
>    b1-rew-mac          +  1:  1  Replace SMAC address
>    b1-rew-op           +  2:  9  Rewriter operation command
>    b0-masq             +  1:  1  Enable masquerading
>    b0-masq-port        +  2:  4  Masquerading port
>    rew-val             + 11: 32  Receive time stamp
>    res1                + 43: 17  Reserved
>    dest                + 60: 12  Destination set for the frame. Dest[11] is the CPU
>    res2                + 72:  9  Reserved
>    src-port            + 81:  4  The port number where the frame was injected (0-12)  <------------------- THIS FIELD
>    res3                + 85:  2  Reserved
>    trfm-timer          + 87:  4  Timer for periodic transmissions (1..8). If zero then normal injection
>    res4                + 91:  6  Reserved
>    dp                  + 97:  1  Drop precedence level after policing
>    pop-cnt             + 98:  2  Number of VLAN tags that must be popped
>    cpuq                +100:  8  CPU extraction queue mask
>    qos-class           +108:  3  Classified QoS class
>    tag-type            +111:  1  Tag information's associated Tag Protocol Identifier (TPID)
>    pcp                 +112:  3  Classified PCP
>    dei                 +115:  1  Classified DEI
>    vid                 +116: 12  Classified VID
>
>
> /Allan
>

Thanks,
-Vladimir

^ permalink raw reply

* [PATCH] firmware: imx: Remove IMX_SC_RPC_SVC_ABORT
From: Leonard Crestez @ 2020-02-20 15:56 UTC (permalink / raw)
  To: Shawn Guo, Dong Aisheng
  Cc: Fabio Estevam, Franck LENORMAND, linux-imx, kernel,
	linux-arm-kernel

This is not used by linux and not supported as part of imx SCU api, it
was added by mistake.

The constant value "9" has since been reassigned in firmware to a
different service.

Signed-off-by: Leonard Crestez <leonard.crestez@nxp.com>
---
 include/linux/firmware/imx/ipc.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/linux/firmware/imx/ipc.h b/include/linux/firmware/imx/ipc.h
index 6312c8cb084a..891057434858 100644
--- a/include/linux/firmware/imx/ipc.h
+++ b/include/linux/firmware/imx/ipc.h
@@ -23,11 +23,10 @@ enum imx_sc_rpc_svc {
 	IMX_SC_RPC_SVC_RM = 3,
 	IMX_SC_RPC_SVC_TIMER = 5,
 	IMX_SC_RPC_SVC_PAD = 6,
 	IMX_SC_RPC_SVC_MISC = 7,
 	IMX_SC_RPC_SVC_IRQ = 8,
-	IMX_SC_RPC_SVC_ABORT = 9
 };
 
 struct imx_sc_rpc_msg {
 	uint8_t ver;
 	uint8_t size;
-- 
2.17.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related

* Re: [igt-dev] [PATCH i-g-t 9/9] i915: Exercise I915_CONTEXT_PARAM_RINGSIZE
From: Janusz Krzysztofik @ 2020-02-20 15:57 UTC (permalink / raw)
  To: Chris Wilson; +Cc: igt-dev, intel-gfx
In-Reply-To: <157529875955.27263.14886639874940144583@skylake-alporthouse-com>

Hi Chris,

On Monday, December 2, 2019 3:59:19 PM CET Chris Wilson wrote:
> Quoting Janusz Krzysztofik (2019-12-02 14:42:58)
> > Hi Chris,
> > 
> > I have a few questions rather than comments.  I hope they are worth spending 
> > your time.
> > 
> > On Wednesday, November 13, 2019 1:52:40 PM CET Chris Wilson wrote:
> > > I915_CONTEXT_PARAM_RINGSIZE specifies how large to create the command
> > > ringbuffer for logical ring contects. This directly affects the number
> > 
> > s/contects/contexts/
> > 
> > > of batches userspace can submit before blocking waiting for space.
> > > 
> > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Have you got this patch still queued somewhere?  As UMD has accepted the 
solution and are ready with changes on their side, I think we need to merge it 
soon, and the kernel side as well.

Thanks,
Janusz


> > > ---
> > >  tests/Makefile.sources        |   3 +
> > >  tests/i915/gem_ctx_ringsize.c | 296 ++++++++++++++++++++++++++++++++++
> > >  tests/meson.build             |   1 +
> > >  3 files changed, 300 insertions(+)
> > >  create mode 100644 tests/i915/gem_ctx_ringsize.c
> > > 
> > > diff --git a/tests/Makefile.sources b/tests/Makefile.sources
> > > index e17d43155..801fc52f3 100644
> > > --- a/tests/Makefile.sources
> > > +++ b/tests/Makefile.sources
> > > @@ -163,6 +163,9 @@ gem_ctx_param_SOURCES = i915/gem_ctx_param.c
> > >  TESTS_progs += gem_ctx_persistence
> > >  gem_ctx_persistence_SOURCES = i915/gem_ctx_persistence.c
> > >  
> > > +TESTS_progs += gem_ctx_ringsize
> > > +gem_ctx_ringsize_SOURCES = i915/gem_ctx_ringsize.c
> > > +
> > >  TESTS_progs += gem_ctx_shared
> > >  gem_ctx_shared_SOURCES = i915/gem_ctx_shared.c
> > >  
> > > diff --git a/tests/i915/gem_ctx_ringsize.c b/tests/i915/gem_ctx_ringsize.c
> > > new file mode 100644
> > > index 000000000..1450e8f0d
> > > --- /dev/null
> > > +++ b/tests/i915/gem_ctx_ringsize.c
> > > @@ -0,0 +1,296 @@
> > > +/*
> > > + * Copyright © 2019 Intel Corporation
> > > + *
> > > + * Permission is hereby granted, free of charge, to any person obtaining a
> > > + * copy of this software and associated documentation files (the "Software"),
> > > + * to deal in the Software without restriction, including without limitation
> > > + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> > > + * and/or sell copies of the Software, and to permit persons to whom the
> > > + * Software is furnished to do so, subject to the following conditions:
> > > + *
> > > + * The above copyright notice and this permission notice (including the next
> > > + * paragraph) shall be included in all copies or substantial portions of the
> > > + * Software.
> > > + *
> > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> > > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> > > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> > > + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> > > + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> > > + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
> > > + * IN THE SOFTWARE.
> > > + */
> > > +
> > > +#include <errno.h>
> > > +#include <fcntl.h>
> > > +#include <inttypes.h>
> > > +#include <sys/ioctl.h>
> > > +#include <sys/types.h>
> > > +#include <unistd.h>
> > > +
> > > +#include "drmtest.h" /* gem_quiescent_gpu()! */
> > > +#include "i915/gem_context.h"
> > > +#include "i915/gem_engine_topology.h"
> > > +#include "ioctl_wrappers.h" /* gem_wait()! */
> > > +#include "sw_sync.h"
> > > +
> > > +#define I915_CONTEXT_PARAM_RINGSIZE 0xc
> > 
> > How are we going to handle symbol redefinition conflict which arises as soon 
> > as this symbol is also included from kernel headers (e.g. via 
> > "i915/gem_engine_topology.h")?
> 
> Final version we copy the headers form the kernel. Conflicts remind us
> when we forget.
> 
> > 
> > > +
> > > +static bool has_ringsize(int i915)
> > > +{
> > > +     struct drm_i915_gem_context_param p = {
> > > +             .param = I915_CONTEXT_PARAM_RINGSIZE,
> > > +     };
> > > +
> > > +     return __gem_context_get_param(i915, &p) == 0;
> > > +}
> > > +
> > > +static void test_idempotent(int i915)
> > > +{
> > > +     struct drm_i915_gem_context_param p = {
> > > +             .param = I915_CONTEXT_PARAM_RINGSIZE,
> > > +     };
> > > +     uint32_t saved;
> > > +
> > > +     /*
> > > +      * Simple test to verify that we are able to read back the same
> > > +      * value as we set.
> > > +      */
> > > +
> > > +     gem_context_get_param(i915, &p);
> > > +     saved = p.value;
> > > +
> > > +     for (uint32_t x = 1 << 12; x <= 128 << 12; x <<= 1) {
> > 
> > I've noticed you are using two different notations for those minimum/maximum 
> > constants.  I think that may be confusing.  How about defining and using 
> > macros?  
> 
> A range in pages...
>  
> > > +             p.value = x;
> > > +             gem_context_set_param(i915, &p);
> > > +             gem_context_get_param(i915, &p);
> > > +             igt_assert_eq_u32(p.value, x);
> > > +     }
> > > +
> > > +     p.value = saved;
> > > +     gem_context_set_param(i915, &p);
> > > +}
> > > +
> > > +static void test_invalid(int i915)
> > > +{
> > > +     struct drm_i915_gem_context_param p = {
> > > +             .param = I915_CONTEXT_PARAM_RINGSIZE,
> > > +     };
> > > +     uint64_t invalid[] = {
> > > +             0, 1, 4095, 4097, 8191, 8193,
> > > +             /* upper limit may be HW dependent, atm it is 512KiB */
> > > +             (512 << 10) - 1, (512 << 10) + 1,
> > 
> > Here is an example of that different notation mentioned above.
> 
> And here written in KiB to match comments.
> 
> > 
> > > +             -1, -1u
> > > +     };
> > > +     uint32_t saved;
> > > +
> > > +     gem_context_get_param(i915, &p);
> > > +     saved = p.value;
> > > +
> > > +     for (int i = 0; i < ARRAY_SIZE(invalid); i++) {
> > > +             p.value = invalid[i];
> > > +             igt_assert_eq(__gem_context_set_param(i915, &p), -EINVAL);
> > > +             gem_context_get_param(i915, &p);
> > > +             igt_assert_eq_u64(p.value, saved);
> > > +     }
> > > +}
> > > +
> > > +static int create_ext_ioctl(int i915,
> > > +                         struct drm_i915_gem_context_create_ext *arg)
> > > +{
> > > +     int err;
> > > +
> > > +     err = 0;
> > > +     if (igt_ioctl(i915, DRM_IOCTL_I915_GEM_CONTEXT_CREATE_EXT, arg)) {
> > > +             err = -errno;
> > > +             igt_assume(err);
> > > +     }
> > > +
> > > +     errno = 0;
> > > +     return err;
> > > +}
> > 
> > This helper looks like pretty standard for me.  Why there are no library 
> > functions for such generic operations?
> 
> Because no one has written that yet.
> 
> > 
> > > +
> > > +static void test_create(int i915)
> > > +{
> > > +     struct drm_i915_gem_context_create_ext_setparam p = {
> > > +             .base = {
> > > +                     .name = I915_CONTEXT_CREATE_EXT_SETPARAM,
> > > +                     .next_extension = 0, /* end of chain */
> > > +             },
> > > +             .param = {
> > > +                     .param = I915_CONTEXT_PARAM_RINGSIZE,
> > > +                     .value = 512 << 10,
> > > +             }
> > > +     };
> > > +     struct drm_i915_gem_context_create_ext create = {
> > > +             .flags = I915_CONTEXT_CREATE_FLAGS_USE_EXTENSIONS,
> > > +             .extensions = to_user_pointer(&p),
> > > +     };
> > > +
> > > +     igt_assert_eq(create_ext_ioctl(i915, &create),  0);
> > > +
> > > +     p.param.ctx_id = create.ctx_id;
> > > +     p.param.value = 0;
> > > +     gem_context_get_param(i915, &p.param);
> > > +     igt_assert_eq(p.param.value, 512 << 10);
> > > +
> > > +     gem_context_destroy(i915, create.ctx_id);
> > > +}
> > > +
> > > +static void test_clone(int i915)
> > > +{
> > > +     struct drm_i915_gem_context_create_ext_setparam p = {
> > > +             .base = {
> > > +                     .name = I915_CONTEXT_CREATE_EXT_SETPARAM,
> > > +                     .next_extension = 0, /* end of chain */
> > > +             },
> > > +             .param = {
> > > +                     .param = I915_CONTEXT_PARAM_RINGSIZE,
> > > +                     .value = 512 << 10,
> > > +             }
> > > +     };
> > > +     struct drm_i915_gem_context_create_ext create = {
> > > +             .flags = I915_CONTEXT_CREATE_FLAGS_USE_EXTENSIONS,
> > > +             .extensions = to_user_pointer(&p),
> > > +     };
> > > +
> > > +     igt_assert_eq(create_ext_ioctl(i915, &create),  0);
> > > +
> > > +     p.param.ctx_id = gem_context_clone(i915, create.ctx_id,
> > > +                                        I915_CONTEXT_CLONE_ENGINES, 0);
> > > +     igt_assert_neq(p.param.ctx_id, create.ctx_id);
> > > +     gem_context_destroy(i915, create.ctx_id);
> > > +
> > > +     p.param.value = 0;
> > > +     gem_context_get_param(i915, &p.param);
> > > +     igt_assert_eq(p.param.value, 512 << 10);
> > > +
> > > +     gem_context_destroy(i915, p.param.ctx_id);
> > > +}
> > > +
> > > +static int __execbuf(int i915, struct drm_i915_gem_execbuffer2 *execbuf)
> > > +{
> > > +     int err;
> > > +
> > > +     err = 0;
> > > +     if (ioctl(i915, DRM_IOCTL_I915_GEM_EXECBUFFER2, execbuf))
> > > +             err = -errno;
> > > +
> > > +     errno = 0;
> > > +     return err;
> > > +}
> > 
> > The above helper looks pretty the same as lib/ioctlwrappers.c:__gem_execbuf().  
> > Does igt_assume(err) found in the latter matter so much that you use your own 
> > version?
> 
> It's very, very different from that one.
> 
> > > +
> > > +static uint32_t __batch_create(int i915, uint32_t offset)
> > 
> > This is always called with offset = 0, do we expect other values to be used 
> > later?
> 
> Why not.
>  
> > > +{
> > > +     const uint32_t bbe = 0xa << 23;
> > > +     uint32_t handle;
> > > +
> > > +     handle = gem_create(i915, ALIGN(offset + sizeof(bbe), 4096));
> > 
> > Why don't we rely on the driver making the alignment for us?
> 
> I'm used to being inside the kernel where it's expected to be correct.
> 
> > > +     gem_write(i915, handle, offset, &bbe, sizeof(bbe));
> > > +
> > > +     return handle;
> > > +}
> > > +
> > > +static uint32_t batch_create(int i915)
> > > +{
> > > +     return __batch_create(i915, 0);
> > > +}
> > > +
> > > +static unsigned int measure_inflight(int i915, unsigned int engine)
> > > +{
> > > +     IGT_CORK_FENCE(cork);
> > > +     struct drm_i915_gem_exec_object2 obj = {
> > > +             .handle = batch_create(i915)
> > > +     };
> > > +     struct drm_i915_gem_execbuffer2 execbuf = {
> > > +             .buffers_ptr = to_user_pointer(&obj),
> > > +             .buffer_count = 1,
> > > +             .flags = engine | I915_EXEC_FENCE_IN,
> > > +             .rsvd2 = igt_cork_plug(&cork, i915),
> > > +     };
> > > +     unsigned int count;
> > > +
> > > +     fcntl(i915, F_SETFL, fcntl(i915, F_GETFL) | O_NONBLOCK);
> > > +
> > > +     gem_execbuf(i915, &execbuf);
> > > +     for (count = 1; __execbuf(i915, &execbuf) == 0; count++)
> > > +             ;
> > 
> > Shouldn't we check if the reason for the failure is what we expect, i.e., 
> > -EWOULDBLOCK (or -EINTR)?  And why don't we put a time constraint on that loop 
> > in case O_NONBLOCK handling is not supported (yet)?
> 
> Sure. The idea is that O_NONBLOCK is supported, otherwise we don't
> have fast and precise feedback.
> 
> > > +static void test_resize(int i915,
> > > +                     const struct intel_execution_engine2 *e,
> > > +                     unsigned int flags)
> > > +#define IDLE (1 << 0)
> > > +{
> > > +     struct drm_i915_gem_context_param p = {
> > > +             .param = I915_CONTEXT_PARAM_RINGSIZE,
> > > +     };
> > > +     unsigned int prev[2] = {};
> > > +     uint32_t saved;
> > > +
> > > +     gem_context_get_param(i915, &p);
> > > +     saved = p.value;
> > > +
> > > +     gem_quiescent_gpu(i915);
> > > +     for (p.value = 1 << 12; p.value <= 128 << 12; p.value <<= 1) {
> > > +             unsigned int count;
> > > +
> > > +             gem_context_set_param(i915, &p);
> > > +
> > > +             count = measure_inflight(i915, e->flags);
> > > +             igt_info("%s: %llx -> %d\n", e->name, p.value, count);
> > > +             igt_assert(count > 3 * (prev[1] - prev[0]) / 4 + prev[1]);
> > 
> > Where does this formula come from?  Why not just count == 2 * prev[1] ?
> > What results should we expect in "active" vs. "idle" mode?
> 
> I've explained somewhere why it is not 2*prev... And there's a small
> amount of imprecision (+-1 request). In test_resize is the comment:
> 
>         /*
>          * The ringsize directly affects the number of batches we can have
>          * inflight -- when we run out of room in the ring, the client is
>          * blocked (or if O_NONBLOCK is specified, -EWOULDBLOCK is reported).
>          * The kernel throttles the client when they enter the last 4KiB page,
>          * so as we double the size of the ring, we nearly double the number
>          * of requests we can fit as 2^n-1: i.e 0, 1, 3, 7, 15, 31 pages.
>          */
> 
> -Chris
> 




_______________________________________________
igt-dev mailing list
igt-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/igt-dev

^ permalink raw reply

* [Cluster-devel] [PATCH v7 14/24] btrfs: Convert from readpages to readahead
From: Christoph Hellwig @ 2020-02-20 15:57 UTC (permalink / raw)
  To: cluster-devel.redhat.com
In-Reply-To: <20200220155452.GX24185@bombadil.infradead.org>

On Thu, Feb 20, 2020 at 07:54:52AM -0800, Matthew Wilcox wrote:
> On Thu, Feb 20, 2020 at 07:46:58AM -0800, Christoph Hellwig wrote:
> > On Thu, Feb 20, 2020 at 05:48:49AM -0800, Matthew Wilcox wrote:
> > > btrfs: Convert from readpages to readahead
> > >   
> > > Implement the new readahead method in btrfs.  Add a readahead_page_batch()
> > > to optimise fetching a batch of pages at once.
> > 
> > Shouldn't this readahead_page_batch heper go into a separate patch so
> > that it clearly stands out?
> 
> I'll move it into 'Put readahead pages in cache earlier' for v8 (the
> same patch where we add readahead_page())

One argument for keeping it in a patch of its own is that btrfs appears
to be the only user, and Goldwyn has a WIP conversion of btrfs to iomap,
so it might go away pretty soon and we could just revert the commit.

But this starts to get into really minor details, so I'll shut up now :)




^ permalink raw reply

* [Ocfs2-devel] [PATCH v7 14/24] btrfs: Convert from readpages to readahead
From: Christoph Hellwig @ 2020-02-20 15:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Johannes Thumshirn,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, linux-btrfs@vger.kernel.org,
	linux-erofs@lists.ozlabs.org, linux-ext4@vger.kernel.org,
	linux-f2fs-devel@lists.sourceforge.net, cluster-devel@redhat.com,
	ocfs2-devel@oss.oracle.com, linux-xfs@vger.kernel.org
In-Reply-To: <20200220155452.GX24185@bombadil.infradead.org>

On Thu, Feb 20, 2020 at 07:54:52AM -0800, Matthew Wilcox wrote:
> On Thu, Feb 20, 2020 at 07:46:58AM -0800, Christoph Hellwig wrote:
> > On Thu, Feb 20, 2020 at 05:48:49AM -0800, Matthew Wilcox wrote:
> > > btrfs: Convert from readpages to readahead
> > >   
> > > Implement the new readahead method in btrfs.  Add a readahead_page_batch()
> > > to optimise fetching a batch of pages at once.
> > 
> > Shouldn't this readahead_page_batch heper go into a separate patch so
> > that it clearly stands out?
> 
> I'll move it into 'Put readahead pages in cache earlier' for v8 (the
> same patch where we add readahead_page())

One argument for keeping it in a patch of its own is that btrfs appears
to be the only user, and Goldwyn has a WIP conversion of btrfs to iomap,
so it might go away pretty soon and we could just revert the commit.

But this starts to get into really minor details, so I'll shut up now :)

^ permalink raw reply

* Re: [PATCH v7 14/24] btrfs: Convert from readpages to readahead
From: Christoph Hellwig @ 2020-02-20 15:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Johannes Thumshirn,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, linux-btrfs@vger.kernel.org,
	linux-erofs@lists.ozlabs.org, linux-ext4@vger.kernel.org,
	linux-f2fs-devel@lists.sourceforge.net, cluster-devel@redhat.com,
	ocfs2-devel@oss.oracle.com, linux-xfs@vger.kernel.org
In-Reply-To: <20200220155452.GX24185@bombadil.infradead.org>

On Thu, Feb 20, 2020 at 07:54:52AM -0800, Matthew Wilcox wrote:
> On Thu, Feb 20, 2020 at 07:46:58AM -0800, Christoph Hellwig wrote:
> > On Thu, Feb 20, 2020 at 05:48:49AM -0800, Matthew Wilcox wrote:
> > > btrfs: Convert from readpages to readahead
> > >   
> > > Implement the new readahead method in btrfs.  Add a readahead_page_batch()
> > > to optimise fetching a batch of pages at once.
> > 
> > Shouldn't this readahead_page_batch heper go into a separate patch so
> > that it clearly stands out?
> 
> I'll move it into 'Put readahead pages in cache earlier' for v8 (the
> same patch where we add readahead_page())

One argument for keeping it in a patch of its own is that btrfs appears
to be the only user, and Goldwyn has a WIP conversion of btrfs to iomap,
so it might go away pretty soon and we could just revert the commit.

But this starts to get into really minor details, so I'll shut up now :)

^ permalink raw reply


This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.