LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH 6/7] powerpc/traps: Print signal name for unhandled signals
From: Murilo Opsfelder Araujo @ 2018-07-25 20:06 UTC (permalink / raw)
  To: LEROY Christophe
  Cc: linuxppc-dev, Tobin C . Harding, Sukadev Bhattiprolu, Simon Guo,
	Paul Mackerras, Nicholas Piggin, Michael Neuling,
	Michael Ellerman, Eric W . Biederman, Cyril Bur,
	Benjamin Herrenschmidt, Balbir Singh, Andrew Donnellan,
	Alastair D'Silva, linux-kernel
In-Reply-To: <20180725174927.Horde.Sfg4oyTmcaLgHIJr3Gf-oQ1@messagerie.si.c-s.fr>

Hi, Christophe.

On Wed, Jul 25, 2018 at 05:49:27PM +0200, LEROY Christophe wrote:
> Murilo Opsfelder Araujo <muriloo@linux.ibm.com> a écrit :
>
> > This adds a human-readable name in the unhandled signal message.
> >
> > Before this patch, a page fault looked like:
> >
> >     Jul 11 16:04:11 localhost kernel: pandafault[6303]: unhandled signal
> > 11 at 00000000100007d0 nip 000000001000061c lr 00007fff93c55100 code 2
> > in pandafault[10000000+10000]
> >
> > After this patch, a page fault looks like:
> >
> >     Jul 11 18:14:48 localhost kernel: pandafault[6352]: segfault (11) at
> > 000000013a2a09f8 nip 000000013a2a086c lr 00007fffb63e5100 code 2 in
> > pandafault[13a2a0000+10000]
> >
> > Signed-off-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
> > ---
> >  arch/powerpc/kernel/traps.c | 43 +++++++++++++++++++++++++++++++++----
> >  1 file changed, 39 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
> > index e6c43ef9fb50..e55ee639d010 100644
> > --- a/arch/powerpc/kernel/traps.c
> > +++ b/arch/powerpc/kernel/traps.c
> > @@ -96,6 +96,41 @@ EXPORT_SYMBOL(__debugger_fault_handler);
> >  #define TM_DEBUG(x...) do { } while(0)
> >  #endif
> >
> > +static const char *signames[SIGRTMIN + 1] = {
> > +	"UNKNOWN",
> > +	"SIGHUP",			// 1
> > +	"SIGINT",			// 2
> > +	"SIGQUIT",			// 3
> > +	"SIGILL",			// 4
> > +	"unhandled trap",		// 5 = SIGTRAP
> > +	"SIGABRT",			// 6 = SIGIOT
> > +	"bus error",			// 7 = SIGBUS
> > +	"floating point exception",	// 8 = SIGFPE
> > +	"illegal instruction",		// 9 = SIGILL
> > +	"SIGUSR1",			// 10
> > +	"segfault",			// 11 = SIGSEGV
> > +	"SIGUSR2",			// 12
> > +	"SIGPIPE",			// 13
> > +	"SIGALRM",			// 14
> > +	"SIGTERM",			// 15
> > +	"SIGSTKFLT",			// 16
> > +	"SIGCHLD",			// 17
> > +	"SIGCONT",			// 18
> > +	"SIGSTOP",			// 19
> > +	"SIGTSTP",			// 20
> > +	"SIGTTIN",			// 21
> > +	"SIGTTOU",			// 22
> > +	"SIGURG",			// 23
> > +	"SIGXCPU",			// 24
> > +	"SIGXFSZ",			// 25
> > +	"SIGVTALRM",			// 26
> > +	"SIGPROF",			// 27
> > +	"SIGWINCH",			// 28
> > +	"SIGIO",			// 29 = SIGPOLL = SIGLOST
> > +	"SIGPWR",			// 30
> > +	"SIGSYS",			// 31 = SIGUNUSED
> > +};
> > +
> >  /*
> >   * Trap & Exception support
> >   */
> > @@ -314,10 +349,10 @@ static void show_signal_msg(int signr, struct
> > pt_regs *regs, int code,
> >  	if (!unhandled_signal(current, signr))
> >  		return;
> >
> > -	pr_info("%s[%d]: unhandled signal %d at "REG_FMT \
> > -		" nip "REG_FMT" lr "REG_FMT" code %x",
> > -		current->comm, current->pid, signr, addr,
> > -		regs->nip, regs->link, code);
> > +	pr_info("%s[%d]: %s (%d) at "REG_FMT" nip "REG_FMT \
> > +		" lr "REG_FMT" code %x",
> > +		current->comm, current->pid, signames[signr],
> > +		signr, addr, regs->nip, regs->link, code);
>
> Are we sure that signr is always within the limits of the table ?

Looking at the code, we only pass the following signals:

    SIGBUS
    SIGFPE
    SIGILL
    SIGSEGV
    SIGTRAP

All of them are within the limits of the table.  We've added other
signals for completeness.

Cheers
Murilo

^ permalink raw reply

* Re: Infinite looping observed in __offline_pages
From: Michal Hocko @ 2018-07-25 20:03 UTC (permalink / raw)
  To: John Allen
  Cc: linux-kernel, linuxppc-dev, kamezawa.hiroyu, n-horiguchi, mgorman
In-Reply-To: <20180725181115.hmlyd3tmnu3mn3sf@p50.austin.ibm.com>

On Wed 25-07-18 13:11:15, John Allen wrote:
[...]
> Does a failure in do_migrate_range indicate that the range is unmigratable
> and the loop in __offline_pages should terminate and goto failed_removal? Or
> should we allow a certain number of retrys before we
> give up on migrating the range?

Unfortunatelly not. Migration code doesn't tell a difference between
ephemeral and permanent failures. We are relying on
start_isolate_page_range to tell us this. So the question is, what kind
of page is not migratable and for what reason.

Are you able to add some debugging to give us more information. The
current debugging code in the hotplug/migration sucks...
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH 6/7] powerpc/traps: Print signal name for unhandled signals
From: Murilo Opsfelder Araujo @ 2018-07-25 19:41 UTC (permalink / raw)
  To: Gustavo Romero
  Cc: linux-kernel, Michael Neuling, Simon Guo, Nicholas Piggin,
	Paul Mackerras, Eric W . Biederman, Andrew Donnellan,
	Alastair D'Silva, Sukadev Bhattiprolu, linuxppc-dev,
	Cyril Bur, Tobin C . Harding
In-Reply-To: <0364bdc4-ddf9-85de-1128-8fc8791d63dd@linux.vnet.ibm.com>

Hi, Gustavo.

On Wed, Jul 25, 2018 at 12:19:00PM -0300, Gustavo Romero wrote:
> Hi Murilo,
> 
> LGTM.
> 
> Just a comment:
> 
> On 07/24/2018 04:27 PM, Murilo Opsfelder Araujo wrote:
> > This adds a human-readable name in the unhandled signal message.
> > 
> > Before this patch, a page fault looked like:
> > 
> >      Jul 11 16:04:11 localhost kernel: pandafault[6303]: unhandled signal 11 at 00000000100007d0 nip 000000001000061c lr 00007fff93c55100 code 2 in pandafault[10000000+10000]
> > 
> > After this patch, a page fault looks like:
> > 
> >      Jul 11 18:14:48 localhost kernel: pandafault[6352]: segfault (11) at 000000013a2a09f8 nip 000000013a2a086c lr 00007fffb63e5100 code 2 in pandafault[13a2a0000+10000]
> 
> I _really_ don't want to bikeshed here, but I vouch for keeping the
> "unhandled" word before the signal name, like:
> 
> [...] pandafault[6352]: unhandled segfault (11) at 000000013a2a09f8 nip [...]
> 
> because the issue reported here is really that we got a segfault _and_
> there was no handler to catch it.

Either way works for me.

> But feel free to wait for additional comments to decide it.

Sure.  I intend to wait a couple of weeks to respin the series based on
community feedback.

Cheers
Murilo

^ permalink raw reply

* Re: [PATCH 0/7] powerpc: Modernize unhandled signals message
From: Murilo Opsfelder Araujo @ 2018-07-25 19:36 UTC (permalink / raw)
  To: Michael Neuling
  Cc: linux-kernel, Alastair D'Silva, Andrew Donnellan,
	Balbir Singh, Benjamin Herrenschmidt, Christophe Leroy, Cyril Bur,
	Eric W . Biederman, Michael Ellerman, Nicholas Piggin,
	Paul Mackerras, Simon Guo, Sukadev Bhattiprolu, Tobin C . Harding,
	linuxppc-dev
In-Reply-To: <0f185351330a7f1d66409fe6a2dc02aae6826039.camel@neuling.org>

Hi, Mikey.

On Wed, Jul 25, 2018 at 05:00:21PM +1000, Michael Neuling wrote:
>  On Tue, 2018-07-24 at 16:27 -0300, Murilo Opsfelder Araujo wrote:
> > Hi, everyone.
> > 
> > This series was inspired by the need to modernize and display more
> > informative messages about unhandled signals.
> > 
> > The "unhandled signal NN" is not very informative.  We thought it would
> > be helpful adding a human-readable message describing what the signal
> > number means, printing the VMA address, and dumping the instructions.
> > 
> > We can add more informative messages, like informing what each code of a
> > SIGSEGV signal means.  We are open to suggestions.
> > 
> > I have collected some early feedback from Michael Ellerman about this
> > series and would love to hear more feedback from you all.
> 
> Nice.. the instruction dump would have been very handy when debugging the PCR
> init issue I had a month or so back.
> 
> > Before this series:
> > 
> >     Jul 24 13:01:07 localhost kernel: pandafault[5989]: unhandled signal 11 at 00000000100007d0 nip 000000001000061c lr 00003fff85a75100 code 2
> > 
> > After this series:
> > 
> >     Jul 24 13:08:01 localhost kernel: pandafault[10758]: segfault (11) at 00000000100007d0 nip 000000001000061c lr 00007fffabc85100 code 2 in pandafault[10000000+10000]
> >     Jul 24 13:08:01 localhost kernel: Instruction dump:
> >     Jul 24 13:08:01 localhost kernel: 4bfffeec 4bfffee8 3c401002 38427f00 fbe1fff8 f821ffc1 7c3f0b78 3d22fffe
> >     Jul 24 13:08:01 localhost kernel: 392988d0 f93f0020 e93f0020 39400048 <99490000> 39200000 7d234b78 383f0040
> 
> What happens if we get a sudden flood of these from different processes that
> overlap their output? Are we going to be able to match up the process with
> instruction dump?

As to the flood of messages, ___ratelimit() makes me think that we'll
likely see some warn messages informing how many show_signal_msg()
callbacks were suppressed, instead of interleaved messages and
instruction dumps.

As to matching process with instruction dump, I believe we'd need more
information to glue them together.

What if we modify show_instructions() to accept a string prefix to be
printed along with each line?

> Should we prefix every line with the PID to avoid this?

That's possible.  An alternative would be prefixing each line with the
process name and its PID, as in the first line.  For example:

    pandafault[10758]: segfault (11) at 00000000100007d0 nip 000000001000061c lr 00007fffabc85100 code 2 in pandafault[10000000+10000]
    pandafault[10758]: Instruction dump:
    pandafault[10758]: 4bfffeec 4bfffee8 3c401002 38427f00 fbe1fff8 f821ffc1 7c3f0b78 3d22fffe
    pandafault[10758]: 392988d0 f93f0020 e93f0020 39400048 <99490000> 39200000 7d234b78 383f0040

The above can be interleaved with other messages and we'll still be able
to match process and its corresponding instruction dump.

Cheers
Murilo

^ permalink raw reply

* Infinite looping observed in __offline_pages
From: John Allen @ 2018-07-25 18:11 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev
  Cc: jallen, kamezawa.hiroyu, n-horiguchi, mgorman, mhocko

Hi All,

Under heavy stress and constant memory hot add/remove, I have observed 
the following loop to occasionally loop infinitely:

mm/memory_hotplug.c:__offline_pages

repeat:
        /* start memory hot removal */
        ret = -EINTR;
        if (signal_pending(current))
                goto failed_removal;

        cond_resched();
        lru_add_drain_all();
        drain_all_pages(zone);

        pfn = scan_movable_pages(start_pfn, end_pfn);
        if (pfn) { /* We have movable pages */
                ret = do_migrate_range(pfn, end_pfn);
                goto repeat;
        }

What appears to be happening in this case is that do_migrate_range 
returns a failure code which is being ignored. The failure is stemming 
from migrate_pages returning "1" which I'm guessing is the result of us 
hitting the following case:

mm/migrate.c: migrate_pages

	default:
		/*
		 * Permanent failure (-EBUSY, -ENOSYS, etc.):
		 * unlike -EAGAIN case, the failed page is
		 * removed from migration page list and not
		 * retried in the next outer loop.
		 */
		nr_failed++;
		break;
	}

Does a failure in do_migrate_range indicate that the range is 
unmigratable and the loop in __offline_pages should terminate and goto 
failed_removal? Or should we allow a certain number of retrys before we
give up on migrating the range?

This issue was observed on a ppc64le lpar on a 4.18-rc6 kernel.

-John

^ permalink raw reply

* [PATCH V2 6/6] powerpc/mm:book3s: Enable THP migration support
From: Aneesh Kumar K.V @ 2018-07-25 16:19 UTC (permalink / raw)
  To: npiggin, benh, paulus, mpe, Naoya Horiguchi
  Cc: linuxppc-dev, linux-mm, Aneesh Kumar K.V
In-Reply-To: <20180725161903.31257-1-aneesh.kumar@linux.ibm.com>

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 8 ++++++++
 arch/powerpc/platforms/Kconfig.cputype       | 1 +
 2 files changed, 9 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 8b80c5e16896..2f12732952f0 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -734,6 +734,8 @@ static inline bool pte_user(pte_t pte)
  */
 #define __pte_to_swp_entry(pte)	((swp_entry_t) { pte_val((pte)) & ~_PAGE_PTE })
 #define __swp_entry_to_pte(x)	__pte((x).val | _PAGE_PTE)
+#define __pmd_to_swp_entry(pmd)	(__pte_to_swp_entry(pmd_pte(pmd)))
+#define __swp_entry_to_pmd(x)	(pte_pmd(__swp_entry_to_pte(x)))
 
 #ifdef CONFIG_MEM_SOFT_DIRTY
 #define _PAGE_SWP_SOFT_DIRTY   (1UL << (SWP_TYPE_BITS + _PAGE_BIT_SWAP_TYPE))
@@ -1084,6 +1086,12 @@ static inline pte_t *pmdp_ptep(pmd_t *pmd)
 #define pmd_soft_dirty(pmd)    pte_soft_dirty(pmd_pte(pmd))
 #define pmd_mksoft_dirty(pmd)  pte_pmd(pte_mksoft_dirty(pmd_pte(pmd)))
 #define pmd_clear_soft_dirty(pmd) pte_pmd(pte_clear_soft_dirty(pmd_pte(pmd)))
+
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+#define pmd_swp_mksoft_dirty(pmd)	pte_pmd(pte_swp_mksoft_dirty(pmd_pte(pmd)))
+#define pmd_swp_soft_dirty(pmd)		pte_swp_soft_dirty(pmd_pte(pmd))
+#define pmd_swp_clear_soft_dirty(pmd)	pte_pmd(pte_swp_clear_soft_dirty(pmd_pte(pmd)))
+#endif
 #endif /* CONFIG_HAVE_ARCH_SOFT_DIRTY */
 
 #ifdef CONFIG_NUMA_BALANCING
diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
index e93d6186de6e..f83ca0ca3bd8 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -72,6 +72,7 @@ config PPC_BOOK3S_64
 	select PPC_HAVE_PMU_SUPPORT
 	select SYS_SUPPORTS_HUGETLBFS
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
+	select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
 	select ARCH_SUPPORTS_NUMA_BALANCING
 	select IRQ_WORK
 
-- 
2.17.1

^ permalink raw reply related

* [PATCH V2 5/6] powerpc/mm/thp: update pmd_trans_huge to check for pmd_present
From: Aneesh Kumar K.V @ 2018-07-25 16:19 UTC (permalink / raw)
  To: npiggin, benh, paulus, mpe, Naoya Horiguchi
  Cc: linuxppc-dev, linux-mm, Aneesh Kumar K.V
In-Reply-To: <20180725161903.31257-1-aneesh.kumar@linux.ibm.com>

We need to make sure pmd_trans_huge returns false for a pmd migration entry.
We mark the migration entry by clearing the _PAGE_PRESENT bit. We keep the
_PAGE_PTE bit set to indicate a leaf page table entry. Hence we need to make
sure we check for pmd_present() so that pmd_trans_huge won't return true on
pmd migration entry.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 .../include/asm/book3s/64/pgtable-64k.h        |  3 +++
 arch/powerpc/include/asm/book3s/64/pgtable.h   | 18 ++++++++++++++++++
 2 files changed, 21 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable-64k.h b/arch/powerpc/include/asm/book3s/64/pgtable-64k.h
index d7ee249d6890..e3d4dd4ae2fa 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable-64k.h
@@ -10,6 +10,9 @@
  *
  * Defined in such a way that we can optimize away code block at build time
  * if CONFIG_HUGETLB_PAGE=n.
+ *
+ * returns true for pmd migration entries, THP, devmap, hugetlb
+ * But compile time dependent on CONFIG_HUGETLB_PAGE
  */
 static inline int pmd_huge(pmd_t pmd)
 {
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index fce9ce8781a0..8b80c5e16896 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1129,6 +1129,10 @@ pmd_hugepage_update(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp,
 	return hash__pmd_hugepage_update(mm, addr, pmdp, clr, set);
 }
 
+/*
+ * returns true for pmd migration entries, THP, devmap, hugetlb
+ * But compile time dependent on THP config
+ */
 static inline int pmd_large(pmd_t pmd)
 {
 	return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
@@ -1163,8 +1167,22 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm, unsigned long addr,
 		pmd_hugepage_update(mm, addr, pmdp, 0, _PAGE_PRIVILEGED);
 }
 
+/*
+ * Only returns true for a THP. False for pmd migration entry.
+ * We also need to return true when we come across a pte that
+ * in between a thp split. While splitting THP, we mark the pmd
+ * invalid (pmdp_invalidate()) before we set it with pte page
+ * address. A pmd_trans_huge() check against a pmd entry during that time
+ * should return true.
+ * We should not call this on a hugetlb entry. We should check for HugeTLB
+ * entry using vma->vm_flags
+ * The page table walk rule is explained in Documentation/vm/transhuge.rst
+ */
 static inline int pmd_trans_huge(pmd_t pmd)
 {
+	if (!pmd_present(pmd))
+		return false;
+
 	if (radix_enabled())
 		return radix__pmd_trans_huge(pmd);
 	return hash__pmd_trans_huge(pmd);
-- 
2.17.1

^ permalink raw reply related

* [PATCH V2 4/6] arch/powerpc/mm/hash: validate the pte entries before handling the hash fault
From: Aneesh Kumar K.V @ 2018-07-25 16:19 UTC (permalink / raw)
  To: npiggin, benh, paulus, mpe, Naoya Horiguchi
  Cc: linuxppc-dev, linux-mm, Aneesh Kumar K.V
In-Reply-To: <20180725161903.31257-1-aneesh.kumar@linux.ibm.com>

Make sure we are operating on THP and hugetlb entries in the respective hash
fault handling routines.

No functional change in this patch. If we walked the table wrongly before, we
will retry the access.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/mm/hugepage-hash64.c    | 6 ++++++
 arch/powerpc/mm/hugetlbpage-hash64.c | 4 ++++
 2 files changed, 10 insertions(+)

diff --git a/arch/powerpc/mm/hugepage-hash64.c b/arch/powerpc/mm/hugepage-hash64.c
index f20d16f849c5..049dcb8c95c2 100644
--- a/arch/powerpc/mm/hugepage-hash64.c
+++ b/arch/powerpc/mm/hugepage-hash64.c
@@ -51,6 +51,12 @@ int __hash_page_thp(unsigned long ea, unsigned long access, unsigned long vsid,
 			new_pmd |= _PAGE_DIRTY;
 	} while (!pmd_xchg(pmdp, __pmd(old_pmd), __pmd(new_pmd)));
 
+	/*
+	 * Make sure this is thp or devmap entry
+	 */
+	if (!(old_pmd & (H_PAGE_THP_HUGE | _PAGE_DEVMAP)))
+		return 0;
+
 	rflags = htab_convert_pte_flags(new_pmd);
 
 #if 0
diff --git a/arch/powerpc/mm/hugetlbpage-hash64.c b/arch/powerpc/mm/hugetlbpage-hash64.c
index b320f5097a06..2e6a8f9345d3 100644
--- a/arch/powerpc/mm/hugetlbpage-hash64.c
+++ b/arch/powerpc/mm/hugetlbpage-hash64.c
@@ -62,6 +62,10 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
 			new_pte |= _PAGE_DIRTY;
 	} while(!pte_xchg(ptep, __pte(old_pte), __pte(new_pte)));
 
+	/* Make sure this is a hugetlb entry */
+	if (old_pte & (H_PAGE_THP_HUGE | _PAGE_DEVMAP))
+		return 0;
+
 	rflags = htab_convert_pte_flags(new_pte);
 	if (unlikely(mmu_psize == MMU_PAGE_16G))
 		offset = PTRS_PER_PUD;
-- 
2.17.1

^ permalink raw reply related

* [PATCH V2 3/6] powerpc/mm/book3s: Check for pmd_large instead of pmd_trans_huge
From: Aneesh Kumar K.V @ 2018-07-25 16:19 UTC (permalink / raw)
  To: npiggin, benh, paulus, mpe, Naoya Horiguchi
  Cc: linuxppc-dev, linux-mm, Aneesh Kumar K.V
In-Reply-To: <20180725161903.31257-1-aneesh.kumar@linux.ibm.com>

Update few code paths to check for pmd_large.

set_pmd_at:
We want to use this to store swap pte at pmd level. For swap ptes we don't want
to set H_PAGE_THP_HUGE. Hence check for pmd_large in set_pmd_at. This remove
the false WARN_ON when using this with swap pmd entry.

pmd_page:
We don't really use them on pmd migration entries. But they can also work with
migration entries and we don't differentiate at the pte level. Hence update
pmd_page to work with pmd migration entries too

__find_linux_pte:
lockless page table walk need to handle pmd migration entries. pmd_trans_huge
check will return false on them. We don't set thp = 1 for such entries, but
update hpage_shift correctly. Without this we will walk pmd migration entries
as a pte page pointer which is wrong.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/mm/hugetlbpage.c      | 8 ++++++--
 arch/powerpc/mm/pgtable-book3s64.c | 2 +-
 arch/powerpc/mm/pgtable_64.c       | 2 +-
 3 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 7ae5e4bfd318..80fdeec89698 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -846,8 +846,12 @@ pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea,
 				ret_pte = (pte_t *) pmdp;
 				goto out;
 			}
-
-			if (pmd_huge(pmd)) {
+			/*
+			 * pmd_large check below will handle the swap pmd pte
+			 * we need to do both the check because they are config
+			 * dependent.
+			 */
+			if (pmd_huge(pmd) || pmd_large(pmd)) {
 				ret_pte = (pte_t *) pmdp;
 				goto out;
 			} else if (is_hugepd(__hugepd(pmd_val(pmd))))
diff --git a/arch/powerpc/mm/pgtable-book3s64.c b/arch/powerpc/mm/pgtable-book3s64.c
index 24346ab4cd37..16d963f281a3 100644
--- a/arch/powerpc/mm/pgtable-book3s64.c
+++ b/arch/powerpc/mm/pgtable-book3s64.c
@@ -71,7 +71,7 @@ void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 #ifdef CONFIG_DEBUG_VM
 	WARN_ON(pte_present(pmd_pte(*pmdp)) && !pte_protnone(pmd_pte(*pmdp)));
 	assert_spin_locked(pmd_lockptr(mm, pmdp));
-	WARN_ON(!(pmd_trans_huge(pmd) || pmd_devmap(pmd)));
+	WARN_ON(!(pmd_large(pmd) || pmd_devmap(pmd)));
 #endif
 	trace_hugepage_set_pmd(addr, pmd_val(pmd));
 	return set_pte_at(mm, addr, pmdp_ptep(pmdp), pmd_pte(pmd));
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index 53e9eeecd5d4..e15e63079ba8 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -306,7 +306,7 @@ struct page *pud_page(pud_t pud)
  */
 struct page *pmd_page(pmd_t pmd)
 {
-	if (pmd_trans_huge(pmd) || pmd_huge(pmd) || pmd_devmap(pmd))
+	if (pmd_large(pmd) || pmd_huge(pmd) || pmd_devmap(pmd))
 		return pte_page(pmd_pte(pmd));
 	return virt_to_page(pmd_page_vaddr(pmd));
 }
-- 
2.17.1

^ permalink raw reply related

* [PATCH V2 2/6] powerpc/mm/hugetlb/book3s: add _PAGE_PRESENT to hugepd pointer.
From: Aneesh Kumar K.V @ 2018-07-25 16:18 UTC (permalink / raw)
  To: npiggin, benh, paulus, mpe, Naoya Horiguchi
  Cc: linuxppc-dev, linux-mm, Aneesh Kumar K.V
In-Reply-To: <20180725161903.31257-1-aneesh.kumar@linux.ibm.com>

This make hugetlb directory pointer similar to other page able entries. A hugepd
entry is identified by lack of _PAGE_PTE bit set and directory size stored in
HUGEPD_SHIFT_MASK. We update that to also look at _PAGE_PRESENT

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/include/asm/book3s/64/hash-4k.h | 2 +-
 arch/powerpc/include/asm/book3s/64/hugetlb.h | 3 +++
 arch/powerpc/mm/hugetlbpage.c                | 2 +-
 3 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
index 9a3798660cef..15bc16b1dc9c 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -66,7 +66,7 @@ static inline int hash__hugepd_ok(hugepd_t hpd)
 	 * if it is not a pte and have hugepd shift mask
 	 * set, then it is a hugepd directory pointer
 	 */
-	if (!(hpdval & _PAGE_PTE) &&
+	if (!(hpdval & _PAGE_PTE) && (hpdval & _PAGE_PRESENT) &&
 	    ((hpdval & HUGEPD_SHIFT_MASK) != 0))
 		return true;
 	return false;
diff --git a/arch/powerpc/include/asm/book3s/64/hugetlb.h b/arch/powerpc/include/asm/book3s/64/hugetlb.h
index 50888388a359..5b0177733994 100644
--- a/arch/powerpc/include/asm/book3s/64/hugetlb.h
+++ b/arch/powerpc/include/asm/book3s/64/hugetlb.h
@@ -39,4 +39,7 @@ static inline bool gigantic_page_supported(void)
 }
 #endif
 
+/* hugepd entry valid bit */
+#define HUGEPD_VAL_BITS		(0x8000000000000000UL)
+
 #endif
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index f425b5b37d58..7ae5e4bfd318 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -95,7 +95,7 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
 			break;
 		else {
 #ifdef CONFIG_PPC_BOOK3S_64
-			*hpdp = __hugepd(__pa(new) |
+			*hpdp = __hugepd(__pa(new) | HUGEPD_VAL_BITS |
 					 (shift_to_mmu_psize(pshift) << 2));
 #elif defined(CONFIG_PPC_8xx)
 			*hpdp = __hugepd(__pa(new) | _PMD_USER |
-- 
2.17.1

^ permalink raw reply related

* [PATCH V2 1/6] powerpc/mm/book3s: Update pmd_present to look at _PAGE_PRESENT bit
From: Aneesh Kumar K.V @ 2018-07-25 16:18 UTC (permalink / raw)
  To: npiggin, benh, paulus, mpe, Naoya Horiguchi
  Cc: linuxppc-dev, linux-mm, Aneesh Kumar K.V

With this patch we use 0x8000000000000000UL (_PAGE_PRESENT) to indicate a valid
pgd/pud/pmd entry. We also switch the p**_present() to look at this bit.

With pmd_present, we have a special case. We need to make sure we consider a
pmd marked invalid during THP split as present. Right now we clear the
_PAGE_PRESENT bit during a pmdp_invalidate. Inorder to consider this special
case we add a new pte bit _PAGE_INVALID (mapped to _RPAGE_SW0). This bit is
only used with _PAGE_PRESENT cleared. Hence we are not really losing a pte bit
for this special case. pmd_present is also updated to look at _PAGE_INVALID.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/include/asm/book3s/64/hash.h    |  5 +++++
 arch/powerpc/include/asm/book3s/64/pgtable.h | 23 +++++++++++++++++---
 arch/powerpc/mm/hash_utils_64.c              |  6 ++---
 arch/powerpc/mm/pgtable-book3s64.c           |  2 +-
 4 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h
index 0387b155f13d..a371ac7c3183 100644
--- a/arch/powerpc/include/asm/book3s/64/hash.h
+++ b/arch/powerpc/include/asm/book3s/64/hash.h
@@ -16,6 +16,11 @@
 #include <asm/book3s/64/hash-4k.h>
 #endif
 
+/* Bits to set in a PMD/PUD/PGD entry valid bit*/
+#define HASH_PMD_VAL_BITS		(0x8000000000000000UL)
+#define HASH_PUD_VAL_BITS		(0x8000000000000000UL)
+#define HASH_PGD_VAL_BITS		(0x8000000000000000UL)
+
 /*
  * Size of EA range mapped by our pagetables.
  */
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 676118743a06..fce9ce8781a0 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -44,6 +44,15 @@
 
 #define _PAGE_PTE		0x4000000000000000UL	/* distinguishes PTEs from pointers */
 #define _PAGE_PRESENT		0x8000000000000000UL	/* pte contains a translation */
+/*
+ * We need to mark a pmd pte invalid while splitting. We can do that by clearing the
+ * _PAGE_PRESENT bit. But then that will be taken as a swap pte. Inorder to differentiate
+ * between two use a SW field when invalidating. We don't add a special bit to indicate
+ * swap pte because that is also used for migration ptes, and we do back and forth between
+ * a valid pte entry and migration ptes. So any information in the software bits will be
+ * lost if we overload those bits.
+ */
+#define _PAGE_INVALID		_RPAGE_SW0
 
 /*
  * Top and bottom bits of RPN which can be used by hash
@@ -859,8 +868,16 @@ static inline int pmd_none(pmd_t pmd)
 
 static inline int pmd_present(pmd_t pmd)
 {
+	/*
+	 * A pmd is considerent present if _PAGE_PRESENT is set.
+	 * We also need to consider the pmd present which is marked
+	 * invalid during a split. Hence we look for _PAGE_INVALID
+	 * if we find _PAGE_PRESENT cleared.
+	 */
+	if (pmd_raw(pmd) & cpu_to_be64(_PAGE_PRESENT | _PAGE_INVALID))
+		return true;
 
-	return !pmd_none(pmd);
+	return false;
 }
 
 static inline int pmd_bad(pmd_t pmd)
@@ -887,7 +904,7 @@ static inline int pud_none(pud_t pud)
 
 static inline int pud_present(pud_t pud)
 {
-	return !pud_none(pud);
+	return (pud_raw(pud) & cpu_to_be64(_PAGE_PRESENT));
 }
 
 extern struct page *pud_page(pud_t pud);
@@ -934,7 +951,7 @@ static inline int pgd_none(pgd_t pgd)
 
 static inline int pgd_present(pgd_t pgd)
 {
-	return !pgd_none(pgd);
+	return (pgd_raw(pgd) & cpu_to_be64(_PAGE_PRESENT));
 }
 
 static inline pte_t pgd_pte(pgd_t pgd)
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 5a72e980e25a..7ce7fa5397d5 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -1002,9 +1002,9 @@ void __init hash__early_init_mmu(void)
 	 * 4k use hugepd format, so for hash set then to
 	 * zero
 	 */
-	__pmd_val_bits = 0;
-	__pud_val_bits = 0;
-	__pgd_val_bits = 0;
+	__pmd_val_bits = HASH_PMD_VAL_BITS;
+	__pud_val_bits = HASH_PUD_VAL_BITS;
+	__pgd_val_bits = HASH_PGD_VAL_BITS;
 
 	__kernel_virt_start = H_KERN_VIRT_START;
 	__kernel_virt_size = H_KERN_VIRT_SIZE;
diff --git a/arch/powerpc/mm/pgtable-book3s64.c b/arch/powerpc/mm/pgtable-book3s64.c
index 4afbfbb64bfd..24346ab4cd37 100644
--- a/arch/powerpc/mm/pgtable-book3s64.c
+++ b/arch/powerpc/mm/pgtable-book3s64.c
@@ -106,7 +106,7 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 {
 	unsigned long old_pmd;
 
-	old_pmd = pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, 0);
+	old_pmd = pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, _PAGE_INVALID);
 	flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
 	/*
 	 * This ensures that generic code that rely on IRQ disabling
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH v8 2/2] hwmon: ibmpowernv: Add attributes to enable/disable sensor groups
From: Guenter Roeck @ 2018-07-25 16:10 UTC (permalink / raw)
  To: Shilpasri G Bhat
  Cc: mpe, linuxppc-dev, linux-hwmon, linux-kernel, ego, stewart
In-Reply-To: <1532423589-18730-3-git-send-email-shilpa.bhat@linux.vnet.ibm.com>

On Tue, Jul 24, 2018 at 02:43:09PM +0530, Shilpasri G Bhat wrote:
> OPAL firmware provides the facility for some groups of sensors to be
> enabled/disabled at runtime to give the user the option of using the
> system resources for collecting these sensors or not.
> 
> For example, on POWER9 systems, the On Chip Controller (OCC) gathers
> various system and chip level sensors and maintains their values in
> main memory.
> 
> This patch provides support for enabling/disabling the sensor groups
> like power, temperature, current and voltage.
> 
> Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
> [stewart@linux.vnet.ibm.com: Commit message]

Acked-by: Guenter Roeck <linux@roeck-us.net>

I would appreciate if subsequent changes were handled as additional
patches instead of forcing me to re-review everything again from
scratch.

I would also appreciate if those requesting changes would have the
courtesy of reviewing the changes triggered by those requests.

Thanks,
Guenter

> ---
> Changes from v7:
> - Use of_for_each_phandle() and of_count_phandle_with_args() to parse
>   through the phandle array
> 
>  Documentation/hwmon/ibmpowernv |  43 +++++++-
>  drivers/hwmon/ibmpowernv.c     | 238 +++++++++++++++++++++++++++++++++++------
>  2 files changed, 247 insertions(+), 34 deletions(-)
> 
> diff --git a/Documentation/hwmon/ibmpowernv b/Documentation/hwmon/ibmpowernv
> index 8826ba2..5646825 100644
> --- a/Documentation/hwmon/ibmpowernv
> +++ b/Documentation/hwmon/ibmpowernv
> @@ -33,9 +33,48 @@ fanX_input		Measured RPM value.
>  fanX_min		Threshold RPM for alert generation.
>  fanX_fault		0: No fail condition
>  			1: Failing fan
> +
>  tempX_input		Measured ambient temperature.
>  tempX_max		Threshold ambient temperature for alert generation.
> -inX_input		Measured power supply voltage
> +tempX_highest		Historical maximum temperature
> +tempX_lowest		Historical minimum temperature
> +tempX_enable		Enable/disable all temperature sensors belonging to the
> +			sub-group. In POWER9, this attribute corresponds to
> +			each OCC. Using this attribute each OCC can be asked to
> +			disable/enable all of its temperature sensors.
> +			1: Enable
> +			0: Disable
> +
> +inX_input		Measured power supply voltage (millivolt)
>  inX_fault		0: No fail condition.
>  			1: Failing power supply.
> -power1_input		System power consumption (microWatt)
> +inX_highest		Historical maximum voltage
> +inX_lowest		Historical minimum voltage
> +inX_enable		Enable/disable all voltage sensors belonging to the
> +			sub-group. In POWER9, this attribute corresponds to
> +			each OCC. Using this attribute each OCC can be asked to
> +			disable/enable all of its voltage sensors.
> +			1: Enable
> +			0: Disable
> +
> +powerX_input		Power consumption (microWatt)
> +powerX_input_highest	Historical maximum power
> +powerX_input_lowest	Historical minimum power
> +powerX_enable		Enable/disable all power sensors belonging to the
> +			sub-group. In POWER9, this attribute corresponds to
> +			each OCC. Using this attribute each OCC can be asked to
> +			disable/enable all of its power sensors.
> +			1: Enable
> +			0: Disable
> +
> +currX_input		Measured current (milliampere)
> +currX_highest		Historical maximum current
> +currX_lowest		Historical minimum current
> +currX_enable		Enable/disable all current sensors belonging to the
> +			sub-group. In POWER9, this attribute corresponds to
> +			each OCC. Using this attribute each OCC can be asked to
> +			disable/enable all of its current sensors.
> +			1: Enable
> +			0: Disable
> +
> +energyX_input		Cumulative energy (microJoule)
> diff --git a/drivers/hwmon/ibmpowernv.c b/drivers/hwmon/ibmpowernv.c
> index f829dad..8347280 100644
> --- a/drivers/hwmon/ibmpowernv.c
> +++ b/drivers/hwmon/ibmpowernv.c
> @@ -90,11 +90,20 @@ struct sensor_data {
>  	char label[MAX_LABEL_LEN];
>  	char name[MAX_ATTR_LEN];
>  	struct device_attribute dev_attr;
> +	struct sensor_group_data *sgrp_data;
> +};
> +
> +struct sensor_group_data {
> +	struct mutex mutex;
> +	u32 gid;
> +	bool enable;
>  };
>  
>  struct platform_data {
>  	const struct attribute_group *attr_groups[MAX_SENSOR_TYPE + 1];
> +	struct sensor_group_data *sgrp_data;
>  	u32 sensors_count; /* Total count of sensors from each group */
> +	u32 nr_sensor_groups; /* Total number of sensor groups */
>  };
>  
>  static ssize_t show_sensor(struct device *dev, struct device_attribute *devattr,
> @@ -105,6 +114,9 @@ static ssize_t show_sensor(struct device *dev, struct device_attribute *devattr,
>  	ssize_t ret;
>  	u64 x;
>  
> +	if (sdata->sgrp_data && !sdata->sgrp_data->enable)
> +		return -ENODATA;
> +
>  	ret =  opal_get_sensor_data_u64(sdata->id, &x);
>  
>  	if (ret)
> @@ -120,6 +132,46 @@ static ssize_t show_sensor(struct device *dev, struct device_attribute *devattr,
>  	return sprintf(buf, "%llu\n", x);
>  }
>  
> +static ssize_t show_enable(struct device *dev,
> +			   struct device_attribute *devattr, char *buf)
> +{
> +	struct sensor_data *sdata = container_of(devattr, struct sensor_data,
> +						 dev_attr);
> +
> +	return sprintf(buf, "%u\n", sdata->sgrp_data->enable);
> +}
> +
> +static ssize_t store_enable(struct device *dev,
> +			    struct device_attribute *devattr,
> +			    const char *buf, size_t count)
> +{
> +	struct sensor_data *sdata = container_of(devattr, struct sensor_data,
> +						 dev_attr);
> +	struct sensor_group_data *sgrp_data = sdata->sgrp_data;
> +	int ret;
> +	bool data;
> +
> +	ret = kstrtobool(buf, &data);
> +	if (ret)
> +		return ret;
> +
> +	ret = mutex_lock_interruptible(&sgrp_data->mutex);
> +	if (ret)
> +		return ret;
> +
> +	if (data != sgrp_data->enable) {
> +		ret =  sensor_group_enable(sgrp_data->gid, data);
> +		if (!ret)
> +			sgrp_data->enable = data;
> +	}
> +
> +	if (!ret)
> +		ret = count;
> +
> +	mutex_unlock(&sgrp_data->mutex);
> +	return ret;
> +}
> +
>  static ssize_t show_label(struct device *dev, struct device_attribute *devattr,
>  			  char *buf)
>  {
> @@ -292,12 +344,115 @@ static u32 get_sensor_hwmon_index(struct sensor_data *sdata,
>  	return ++sensor_groups[sdata->type].hwmon_index;
>  }
>  
> +static int init_sensor_group_data(struct platform_device *pdev,
> +				  struct platform_data *pdata)
> +{
> +	struct sensor_group_data *sgrp_data;
> +	struct device_node *groups, *sgrp;
> +	int count = 0, ret = 0;
> +	enum sensors type;
> +
> +	groups = of_find_compatible_node(NULL, NULL, "ibm,opal-sensor-group");
> +	if (!groups)
> +		return ret;
> +
> +	for_each_child_of_node(groups, sgrp) {
> +		type = get_sensor_type(sgrp);
> +		if (type != MAX_SENSOR_TYPE)
> +			pdata->nr_sensor_groups++;
> +	}
> +
> +	if (!pdata->nr_sensor_groups)
> +		goto out;
> +
> +	sgrp_data = devm_kcalloc(&pdev->dev, pdata->nr_sensor_groups,
> +				 sizeof(*sgrp_data), GFP_KERNEL);
> +	if (!sgrp_data) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	for_each_child_of_node(groups, sgrp) {
> +		u32 gid;
> +
> +		type = get_sensor_type(sgrp);
> +		if (type == MAX_SENSOR_TYPE)
> +			continue;
> +
> +		if (of_property_read_u32(sgrp, "sensor-group-id", &gid))
> +			continue;
> +
> +		if (of_count_phandle_with_args(sgrp, "sensors", NULL) <= 0)
> +			continue;
> +
> +		sensor_groups[type].attr_count++;
> +		sgrp_data[count].gid = gid;
> +		mutex_init(&sgrp_data[count].mutex);
> +		sgrp_data[count++].enable = false;
> +	}
> +
> +	pdata->sgrp_data = sgrp_data;
> +out:
> +	of_node_put(groups);
> +	return ret;
> +}
> +
> +static struct sensor_group_data *get_sensor_group(struct platform_data *pdata,
> +						  struct device_node *node,
> +						  enum sensors gtype)
> +{
> +	struct sensor_group_data *sgrp_data = pdata->sgrp_data;
> +	struct device_node *groups, *sgrp;
> +
> +	groups = of_find_compatible_node(NULL, NULL, "ibm,opal-sensor-group");
> +	if (!groups)
> +		return NULL;
> +
> +	for_each_child_of_node(groups, sgrp) {
> +		struct of_phandle_iterator it;
> +		u32 gid;
> +		int rc, i;
> +		enum sensors type;
> +
> +		type = get_sensor_type(sgrp);
> +		if (type != gtype)
> +			continue;
> +
> +		if (of_property_read_u32(sgrp, "sensor-group-id", &gid))
> +			continue;
> +
> +		of_for_each_phandle(&it, rc, sgrp, "sensors", NULL, 0)
> +			if (it.phandle == node->phandle) {
> +				of_node_put(it.node);
> +				break;
> +			}
> +
> +		if (rc)
> +			continue;
> +
> +		for (i = 0; i < pdata->nr_sensor_groups; i++)
> +			if (gid == sgrp_data[i].gid) {
> +				of_node_put(sgrp);
> +				of_node_put(groups);
> +				return &sgrp_data[i];
> +			}
> +	}
> +
> +	of_node_put(groups);
> +	return NULL;
> +}
> +
>  static int populate_attr_groups(struct platform_device *pdev)
>  {
>  	struct platform_data *pdata = platform_get_drvdata(pdev);
>  	const struct attribute_group **pgroups = pdata->attr_groups;
>  	struct device_node *opal, *np;
>  	enum sensors type;
> +	int ret;
> +
> +	ret = init_sensor_group_data(pdev, pdata);
> +	if (ret)
> +		return ret;
>  
>  	opal = of_find_node_by_path("/ibm,opal/sensors");
>  	for_each_child_of_node(opal, np) {
> @@ -344,7 +499,10 @@ static int populate_attr_groups(struct platform_device *pdev)
>  static void create_hwmon_attr(struct sensor_data *sdata, const char *attr_name,
>  			      ssize_t (*show)(struct device *dev,
>  					      struct device_attribute *attr,
> -					      char *buf))
> +					      char *buf),
> +			    ssize_t (*store)(struct device *dev,
> +					     struct device_attribute *attr,
> +					     const char *buf, size_t count))
>  {
>  	snprintf(sdata->name, MAX_ATTR_LEN, "%s%d_%s",
>  		 sensor_groups[sdata->type].name, sdata->hwmon_index,
> @@ -352,23 +510,33 @@ static void create_hwmon_attr(struct sensor_data *sdata, const char *attr_name,
>  
>  	sysfs_attr_init(&sdata->dev_attr.attr);
>  	sdata->dev_attr.attr.name = sdata->name;
> -	sdata->dev_attr.attr.mode = S_IRUGO;
>  	sdata->dev_attr.show = show;
> +	if (store) {
> +		sdata->dev_attr.store = store;
> +		sdata->dev_attr.attr.mode = 0664;
> +	} else {
> +		sdata->dev_attr.attr.mode = 0444;
> +	}
>  }
>  
>  static void populate_sensor(struct sensor_data *sdata, int od, int hd, int sid,
>  			    const char *attr_name, enum sensors type,
>  			    const struct attribute_group *pgroup,
> +			    struct sensor_group_data *sgrp_data,
>  			    ssize_t (*show)(struct device *dev,
>  					    struct device_attribute *attr,
> -					    char *buf))
> +					    char *buf),
> +			    ssize_t (*store)(struct device *dev,
> +					     struct device_attribute *attr,
> +					     const char *buf, size_t count))
>  {
>  	sdata->id = sid;
>  	sdata->type = type;
>  	sdata->opal_index = od;
>  	sdata->hwmon_index = hd;
> -	create_hwmon_attr(sdata, attr_name, show);
> +	create_hwmon_attr(sdata, attr_name, show, store);
>  	pgroup->attrs[sensor_groups[type].attr_count++] = &sdata->dev_attr.attr;
> +	sdata->sgrp_data = sgrp_data;
>  }
>  
>  static char *get_max_attr(enum sensors type)
> @@ -403,24 +571,23 @@ static int create_device_attrs(struct platform_device *pdev)
>  	const struct attribute_group **pgroups = pdata->attr_groups;
>  	struct device_node *opal, *np;
>  	struct sensor_data *sdata;
> -	u32 sensor_id;
> -	enum sensors type;
>  	u32 count = 0;
> -	int err = 0;
> +	u32 group_attr_id[MAX_SENSOR_TYPE] = {0};
>  
> -	opal = of_find_node_by_path("/ibm,opal/sensors");
>  	sdata = devm_kcalloc(&pdev->dev,
>  			     pdata->sensors_count, sizeof(*sdata),
>  			     GFP_KERNEL);
> -	if (!sdata) {
> -		err = -ENOMEM;
> -		goto exit_put_node;
> -	}
> +	if (!sdata)
> +		return -ENOMEM;
>  
> +	opal = of_find_node_by_path("/ibm,opal/sensors");
>  	for_each_child_of_node(opal, np) {
> +		struct sensor_group_data *sgrp_data;
>  		const char *attr_name;
> -		u32 opal_index;
> +		u32 opal_index, hw_id;
> +		u32 sensor_id;
>  		const char *label;
> +		enum sensors type;
>  
>  		if (np->name == NULL)
>  			continue;
> @@ -456,14 +623,12 @@ static int create_device_attrs(struct platform_device *pdev)
>  			opal_index = INVALID_INDEX;
>  		}
>  
> -		sdata[count].opal_index = opal_index;
> -		sdata[count].hwmon_index =
> -			get_sensor_hwmon_index(&sdata[count], sdata, count);
> -
> -		create_hwmon_attr(&sdata[count], attr_name, show_sensor);
> -
> -		pgroups[type]->attrs[sensor_groups[type].attr_count++] =
> -				&sdata[count++].dev_attr.attr;
> +		hw_id = get_sensor_hwmon_index(&sdata[count], sdata, count);
> +		sgrp_data = get_sensor_group(pdata, np, type);
> +		populate_sensor(&sdata[count], opal_index, hw_id, sensor_id,
> +				attr_name, type, pgroups[type], sgrp_data,
> +				show_sensor, NULL);
> +		count++;
>  
>  		if (!of_property_read_string(np, "label", &label)) {
>  			/*
> @@ -474,35 +639,43 @@ static int create_device_attrs(struct platform_device *pdev)
>  			 */
>  
>  			make_sensor_label(np, &sdata[count], label);
> -			populate_sensor(&sdata[count], opal_index,
> -					sdata[count - 1].hwmon_index,
> +			populate_sensor(&sdata[count], opal_index, hw_id,
>  					sensor_id, "label", type, pgroups[type],
> -					show_label);
> +					NULL, show_label, NULL);
>  			count++;
>  		}
>  
>  		if (!of_property_read_u32(np, "sensor-data-max", &sensor_id)) {
>  			attr_name = get_max_attr(type);
> -			populate_sensor(&sdata[count], opal_index,
> -					sdata[count - 1].hwmon_index,
> +			populate_sensor(&sdata[count], opal_index, hw_id,
>  					sensor_id, attr_name, type,
> -					pgroups[type], show_sensor);
> +					pgroups[type], sgrp_data, show_sensor,
> +					NULL);
>  			count++;
>  		}
>  
>  		if (!of_property_read_u32(np, "sensor-data-min", &sensor_id)) {
>  			attr_name = get_min_attr(type);
> -			populate_sensor(&sdata[count], opal_index,
> -					sdata[count - 1].hwmon_index,
> +			populate_sensor(&sdata[count], opal_index, hw_id,
>  					sensor_id, attr_name, type,
> -					pgroups[type], show_sensor);
> +					pgroups[type], sgrp_data, show_sensor,
> +					NULL);
> +			count++;
> +		}
> +
> +		if (sgrp_data && !sgrp_data->enable) {
> +			sgrp_data->enable = true;
> +			hw_id = ++group_attr_id[type];
> +			populate_sensor(&sdata[count], opal_index, hw_id,
> +					sgrp_data->gid, "enable", type,
> +					pgroups[type], sgrp_data, show_enable,
> +					store_enable);
>  			count++;
>  		}
>  	}
>  
> -exit_put_node:
>  	of_node_put(opal);
> -	return err;
> +	return 0;
>  }
>  
>  static int ibmpowernv_probe(struct platform_device *pdev)
> @@ -517,6 +690,7 @@ static int ibmpowernv_probe(struct platform_device *pdev)
>  
>  	platform_set_drvdata(pdev, pdata);
>  	pdata->sensors_count = 0;
> +	pdata->nr_sensor_groups = 0;
>  	err = populate_attr_groups(pdev);
>  	if (err)
>  		return err;
> -- 
> 1.8.3.1
> 

^ permalink raw reply

* Re: [PATCH 7/7] powerpc/traps: Show instructions on exceptions
From: LEROY Christophe @ 2018-07-25 16:01 UTC (permalink / raw)
  To: Murilo Opsfelder Araujo
  Cc: linuxppc-dev, Tobin C . Harding, Sukadev Bhattiprolu, Simon Guo,
	Paul Mackerras, Nicholas Piggin, Michael Neuling,
	Michael Ellerman, Eric W . Biederman, Cyril Bur,
	Benjamin Herrenschmidt, Balbir Singh, Andrew Donnellan,
	Alastair D'Silva, linux-kernel
In-Reply-To: <20180724192720.32417-8-muriloo@linux.ibm.com>

Murilo Opsfelder Araujo <muriloo@linux.ibm.com> a =C3=A9crit=C2=A0:

> Move show_instructions() declaration to arch/powerpc/include/asm/stacktra=
ce.h
> and include asm/stracktrace.h in arch/powerpc/kernel/process.c,=20=20
>=20which contains
> the implementation.
>
> Modify show_instructions() not to call __kernel_text_address(), allowing
> userspace instruction dump.  probe_kernel_address(), which returns -EFAUL=
T if
> something goes wrong, is still being called.
>
> Call show_instructions() in arch/powerpc/kernel/traps.c to dump=20=20
>=20instructions at
> faulty location, useful to debugging.

Shouldn't this part be in a second patch ?

Wouldn't it be better to also see regs in addition if we want to use=20=20
this=20to understand what happened ?
So you could call show_regs() instead of show_instructions() ?

Christophe

>
> Before this patch, an unhandled signal message looked like:
>
>     Jul 24 09:57:00 localhost kernel: pandafault[10524]: segfault=20=20
>=20(11) at 00000000100007d0 nip 000000001000061c lr 00007fffbd295100=20=20
>=20code 2 in pandafault[10000000+10000]
>
> After this patch, it looks like:
>
>     Jul 24 09:57:00 localhost kernel: pandafault[10524]: segfault=20=20
>=20(11) at 00000000100007d0 nip 000000001000061c lr 00007fffbd295100=20=20
>=20code 2 in pandafault[10000000+10000]
>     Jul 24 09:57:00 localhost kernel: Instruction dump:
>     Jul 24 09:57:00 localhost kernel: 4bfffeec 4bfffee8 3c401002=20=20
>=2038427f00 fbe1fff8 f821ffc1 7c3f0b78 3d22fffe
>     Jul 24 09:57:00 localhost kernel: 392988d0 f93f0020 e93f0020=20=20
>=2039400048 <99490000> 39200000 7d234b78 383f0040
>
> Signed-off-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
> ---
>  arch/powerpc/include/asm/stacktrace.h | 7 +++++++
>  arch/powerpc/kernel/process.c         | 6 +++---
>  arch/powerpc/kernel/traps.c           | 3 +++
>  3 files changed, 13 insertions(+), 3 deletions(-)
>  create mode 100644 arch/powerpc/include/asm/stacktrace.h
>
> diff --git a/arch/powerpc/include/asm/stacktrace.h=20=20
>=20b/arch/powerpc/include/asm/stacktrace.h
> new file mode 100644
> index 000000000000..46e5ef451578
> --- /dev/null
> +++ b/arch/powerpc/include/asm/stacktrace.h
> @@ -0,0 +1,7 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_POWERPC_STACKTRACE_H
> +#define _ASM_POWERPC_STACKTRACE_H
> +
> +void show_instructions(struct pt_regs *regs);
> +
> +#endif /* _ASM_POWERPC_STACKTRACE_H */
> diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.=
c
> index b1af3390249c..ee1d63e03c52 100644
> --- a/arch/powerpc/kernel/process.c
> +++ b/arch/powerpc/kernel/process.c
> @@ -52,6 +52,7 @@
>  #include <asm/machdep.h>
>  #include <asm/time.h>
>  #include <asm/runlatch.h>
> +#include <asm/stacktrace.h>
>  #include <asm/syscalls.h>
>  #include <asm/switch_to.h>
>  #include <asm/tm.h>
> @@ -1261,7 +1262,7 @@ struct task_struct *__switch_to(struct=20=20
>=20task_struct *prev,
>
>  static int instructions_to_print =3D 16;
>
> -static void show_instructions(struct pt_regs *regs)
> +void show_instructions(struct pt_regs *regs)
>  {
>  	int i;
>  	unsigned long pc =3D regs->nip - (instructions_to_print * 3 / 4 *
> @@ -1283,8 +1284,7 @@ static void show_instructions(struct pt_regs *regs)
>  			pc =3D (unsigned long)phys_to_virt(pc);
>  #endif
>
> -		if (!__kernel_text_address(pc) ||
> -		     probe_kernel_address((unsigned int __user *)pc, instr)) {
> +		if (probe_kernel_address((unsigned int __user *)pc, instr)) {
>  			pr_cont("XXXXXXXX ");
>  		} else {
>  			if (regs->nip =3D=3D pc)
> diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
> index e55ee639d010..3beca17ac1b1 100644
> --- a/arch/powerpc/kernel/traps.c
> +++ b/arch/powerpc/kernel/traps.c
> @@ -70,6 +70,7 @@
>  #include <asm/hmi.h>
>  #include <sysdev/fsl_pci.h>
>  #include <asm/kprobes.h>
> +#include <asm/stacktrace.h>
>
>  #if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC_CORE)
>  int (*__debugger)(struct pt_regs *regs) __read_mostly;
> @@ -357,6 +358,8 @@ static void show_signal_msg(int signr, struct=20=20
>=20pt_regs *regs, int code,
>  	print_vma_addr(KERN_CONT " in ", regs->nip);
>
>  	pr_cont("\n");
> +
> +	show_instructions(regs);
>  }
>
>  void _exception_pkey(int signr, struct pt_regs *regs, int code,
> --
> 2.17.1

^ permalink raw reply

* Re: [PATCH v07 2/9] hotplug/cpu: Add operation queuing function
From: Michael Bringmann @ 2018-07-25 15:57 UTC (permalink / raw)
  To: Nathan Fontenot, linuxppc-dev; +Cc: John Allen, Thomas Falcon, Tyrel Datwyler
In-Reply-To: <71455ad3-482b-d801-8f2e-fa2ddcb4017f@linux.vnet.ibm.com>

See below.

On 07/23/2018 12:51 PM, Nathan Fontenot wrote:
> On 07/13/2018 03:18 PM, Michael Bringmann wrote:
>> migration/dlpar: This patch adds function dlpar_queue_action()
>> which will queued up information about a CPU/Memory 'readd'
>> operation according to resource type, action code, and DRC index.
>> At a subsequent point, the list of operations can be run/played
>> in series.  Examples of such oprations include 'readd' of CPU
>> and Memory blocks identified as having changed their associativity
>> during an LPAR migration event. >
>> Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com>
>> ---
>> Changes in patch:
>>    -- Correct drc_index before adding to pseries_hp_errorlog struct
>>    -- Correct text of notice
>>    -- Revise queuing model to save up all of the DLPAR actions for
>>       later execution.
>>    -- Restore list init statement missing from patch
>>    -- Move call to apply queued operations into 'mobility.c'
>>    -- Compress some code
>>    -- Rename some of queueing function APIs
>>    -- Revise implementation to push execution of queued operations
>>       to a workqueue task.
>>    -- Cleanup reference to outdated queuing operation.
>> ---
>>   arch/powerpc/include/asm/rtas.h           |    2 +
>>   arch/powerpc/platforms/pseries/dlpar.c    |   61 +++++++++++++++++++++++++++++
>>   arch/powerpc/platforms/pseries/mobility.c |    4 ++
>>   arch/powerpc/platforms/pseries/pseries.h  |    2 +
>>   4 files changed, 69 insertions(+)
>>
>> diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
>> index 71e393c..4f601c7 100644
>> --- a/arch/powerpc/include/asm/rtas.h
>> +++ b/arch/powerpc/include/asm/rtas.h
>> @@ -310,12 +310,14 @@ struct pseries_hp_errorlog {
>>           struct { __be32 count, index; } ic;
>>           char    drc_name[1];
>>       } _drc_u;
>> +    struct list_head list;
>>   };
>>
>>   #define PSERIES_HP_ELOG_RESOURCE_CPU    1
>>   #define PSERIES_HP_ELOG_RESOURCE_MEM    2
>>   #define PSERIES_HP_ELOG_RESOURCE_SLOT    3
>>   #define PSERIES_HP_ELOG_RESOURCE_PHB    4
>> +#define PSERIES_HP_ELOG_RESOURCE_PMT    5
>>
>>   #define PSERIES_HP_ELOG_ACTION_ADD    1
>>   #define PSERIES_HP_ELOG_ACTION_REMOVE    2
>> diff --git a/arch/powerpc/platforms/pseries/dlpar.c b/arch/powerpc/platforms/pseries/dlpar.c
>> index a0b20c0..7264b8e 100644
>> --- a/arch/powerpc/platforms/pseries/dlpar.c
>> +++ b/arch/powerpc/platforms/pseries/dlpar.c
>> @@ -25,6 +25,7 @@
>>   #include <asm/prom.h>
>>   #include <asm/machdep.h>
>>   #include <linux/uaccess.h>
>> +#include <linux/delay.h>
>>   #include <asm/rtas.h>
>>
>>   static struct workqueue_struct *pseries_hp_wq;
>> @@ -329,6 +330,8 @@ int dlpar_release_drc(u32 drc_index)
>>       return 0;
>>   }
>>
>> +static int dlpar_pmt(struct pseries_hp_errorlog *work);
>> +
>>   static int handle_dlpar_errorlog(struct pseries_hp_errorlog *hp_elog)
>>   {
>>       int rc;
>> @@ -357,6 +360,9 @@ static int handle_dlpar_errorlog(struct pseries_hp_errorlog *hp_elog)
>>       case PSERIES_HP_ELOG_RESOURCE_CPU:
>>           rc = dlpar_cpu(hp_elog);
>>           break;
>> +    case PSERIES_HP_ELOG_RESOURCE_PMT:
>> +        rc = dlpar_pmt(hp_elog);
>> +        break;
>>       default:
>>           pr_warn_ratelimited("Invalid resource (%d) specified\n",
>>                       hp_elog->resource);
>> @@ -407,6 +413,61 @@ void queue_hotplug_event(struct pseries_hp_errorlog *hp_errlog,
>>       }
>>   }
>>
>> +LIST_HEAD(dlpar_delayed_list);
>> +
>> +int dlpar_queue_action(int resource, int action, u32 drc_index)
>> +{
>> +    struct pseries_hp_errorlog *hp_errlog;
>> +
>> +    hp_errlog = kmalloc(sizeof(struct pseries_hp_errorlog), GFP_KERNEL);
>> +    if (!hp_errlog)
>> +        return -ENOMEM;
>> +
>> +    hp_errlog->resource = resource;
>> +    hp_errlog->action = action;
>> +    hp_errlog->id_type = PSERIES_HP_ELOG_ID_DRC_INDEX;
>> +    hp_errlog->_drc_u.drc_index = cpu_to_be32(drc_index);
>> +
>> +    list_add_tail(&hp_errlog->list, &dlpar_delayed_list);
>> +
>> +    return 0;
>> +}
>> +
>> +static int dlpar_pmt(struct pseries_hp_errorlog *work)
>> +{
>> +    struct list_head *pos, *q;
>> +
>> +    ssleep(15);
>> +
>> +    list_for_each_safe(pos, q, &dlpar_delayed_list) {
>> +        struct pseries_hp_errorlog *tmp;
>> +
>> +        tmp = list_entry(pos, struct pseries_hp_errorlog, list);
>> +        handle_dlpar_errorlog(tmp);
>> +
>> +        list_del(pos);
>> +        kfree(tmp);
>> +
>> +        ssleep(10);
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +int dlpar_queued_actions_run(void)
>> +{
>> +    if (!list_empty(&dlpar_delayed_list)) {
>> +        struct pseries_hp_errorlog hp_errlog;
>> +
>> +        hp_errlog.resource = PSERIES_HP_ELOG_RESOURCE_PMT;
>> +        hp_errlog.action = 0;
>> +        hp_errlog.id_type = 0;
>> +
>> +        queue_hotplug_event(&hp_errlog, 0, 0); > +    }
>> +    return 0;
>> +}
> 
> I'm a bit confused by this. Is there a reason this needs to queue a
> hotplug event instead of just walking the list as is done in dlpar_pmt?

Up to this point, the operations have only been added to 'dlpar_delayed_list'.
This function separates the execution of the CPU readd and Memory readd
operations from the execution of 'migration store'.  If we walk the list
here, then we add the execution time of all of the readd operations to
the time of 'migration store'.  This is not a large problem in small
systems like we have in the Kernel Team.  This may be a major issue though,
for production SAP HANA systems where we may be readding thousands of pages
of memory.  By pushing the execution of the CPU readd and Memory readd
operations after, and separate, from the execution of 'migration store',
we do not delay the end of the operation or the return of the completion
status to an associated HMC.

> 
> -Nathan

Michael

> 
>> +
>>   static int dlpar_parse_resource(char **cmd, struct pseries_hp_errorlog *hp_elog)
>>   {
>>       char *arg;
>> diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c
>> index f6364d9..d0d1cae 100644
>> --- a/arch/powerpc/platforms/pseries/mobility.c
>> +++ b/arch/powerpc/platforms/pseries/mobility.c
>> @@ -378,6 +378,10 @@ static ssize_t migration_store(struct class *class,
>>           return rc;
>>
>>       post_mobility_fixup();
>> +
>> +    /* Apply any necessary changes identified during fixup */
>> +    dlpar_queued_actions_run();
>> +
>>       return count;
>>   }
>>
>> diff --git a/arch/powerpc/platforms/pseries/pseries.h b/arch/powerpc/platforms/pseries/pseries.h
>> index 60db2ee..72ca996 100644
>> --- a/arch/powerpc/platforms/pseries/pseries.h
>> +++ b/arch/powerpc/platforms/pseries/pseries.h
>> @@ -61,6 +61,8 @@ extern struct device_node *dlpar_configure_connector(__be32,
>>
>>   void queue_hotplug_event(struct pseries_hp_errorlog *hp_errlog,
>>                struct completion *hotplug_done, int *rc);
>> +int dlpar_queue_action(int resource, int action, u32 drc_index);
>> +int dlpar_queued_actions_run(void);
>>   #ifdef CONFIG_MEMORY_HOTPLUG
>>   int dlpar_memory(struct pseries_hp_errorlog *hp_elog);
>>   #else
>>
> 
> 

-- 
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line  363-5196
External: (512) 286-5196
Cell:       (512) 466-0650
mwb@linux.vnet.ibm.com

^ permalink raw reply

* Re: [PATCH v07 2/9] hotplug/cpu: Add operation queuing function
From: Michael Bringmann @ 2018-07-25 15:49 UTC (permalink / raw)
  To: John Allen
  Cc: Nathan Fontenot, linuxppc-dev, Tyrel Datwyler, Thomas Falcon,
	John Allen
In-Reply-To: <20180723155414.2apx5uxrn3462qks@p50>

See below.

On 07/23/2018 10:54 AM, John Allen wrote:
> On Fri, Jul 13, 2018 at 03:18:01PM -0500, Michael Bringmann wrote:
>> migration/dlpar: This patch adds function dlpar_queue_action()
>> which will queued up information about a CPU/Memory 'readd'
>> operation according to resource type, action code, and DRC index.
>> At a subsequent point, the list of operations can be run/played
>> in series.  Examples of such oprations include 'readd' of CPU
>> and Memory blocks identified as having changed their associativity
>> during an LPAR migration event.
>>
>> Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com>
>> ---
>> Changes in patch:
>>  -- Correct drc_index before adding to pseries_hp_errorlog struct
>>  -- Correct text of notice
>>  -- Revise queuing model to save up all of the DLPAR actions for
>>     later execution.
>>  -- Restore list init statement missing from patch
>>  -- Move call to apply queued operations into 'mobility.c'
>>  -- Compress some code
>>  -- Rename some of queueing function APIs
>>  -- Revise implementation to push execution of queued operations
>>     to a workqueue task.
>>  -- Cleanup reference to outdated queuing operation.
>> ---
>> arch/powerpc/include/asm/rtas.h           |    2 +
>> arch/powerpc/platforms/pseries/dlpar.c    |   61 +++++++++++++++++++++++++++++
>> arch/powerpc/platforms/pseries/mobility.c |    4 ++
>> arch/powerpc/platforms/pseries/pseries.h  |    2 +
>> 4 files changed, 69 insertions(+)
>>
>> diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
>> index 71e393c..4f601c7 100644
>> --- a/arch/powerpc/include/asm/rtas.h
>> +++ b/arch/powerpc/include/asm/rtas.h
>> @@ -310,12 +310,14 @@ struct pseries_hp_errorlog {
>>         struct { __be32 count, index; } ic;
>>         char    drc_name[1];
>>     } _drc_u;
>> +    struct list_head list;
>> };
>>
>> #define PSERIES_HP_ELOG_RESOURCE_CPU    1
>> #define PSERIES_HP_ELOG_RESOURCE_MEM    2
>> #define PSERIES_HP_ELOG_RESOURCE_SLOT    3
>> #define PSERIES_HP_ELOG_RESOURCE_PHB    4
>> +#define PSERIES_HP_ELOG_RESOURCE_PMT    5
>>
>> #define PSERIES_HP_ELOG_ACTION_ADD    1
>> #define PSERIES_HP_ELOG_ACTION_REMOVE    2
>> diff --git a/arch/powerpc/platforms/pseries/dlpar.c b/arch/powerpc/platforms/pseries/dlpar.c
>> index a0b20c0..7264b8e 100644
>> --- a/arch/powerpc/platforms/pseries/dlpar.c
>> +++ b/arch/powerpc/platforms/pseries/dlpar.c
>> @@ -25,6 +25,7 @@
>> #include <asm/prom.h>
>> #include <asm/machdep.h>
>> #include <linux/uaccess.h>
>> +#include <linux/delay.h>
>> #include <asm/rtas.h>
>>
>> static struct workqueue_struct *pseries_hp_wq;
>> @@ -329,6 +330,8 @@ int dlpar_release_drc(u32 drc_index)
>>     return 0;
>> }
>>
>> +static int dlpar_pmt(struct pseries_hp_errorlog *work);
>> +
>> static int handle_dlpar_errorlog(struct pseries_hp_errorlog *hp_elog)
>> {
>>     int rc;
>> @@ -357,6 +360,9 @@ static int handle_dlpar_errorlog(struct pseries_hp_errorlog *hp_elog)
>>     case PSERIES_HP_ELOG_RESOURCE_CPU:
>>         rc = dlpar_cpu(hp_elog);
>>         break;
>> +    case PSERIES_HP_ELOG_RESOURCE_PMT:
>> +        rc = dlpar_pmt(hp_elog);
>> +        break;
>>     default:
>>         pr_warn_ratelimited("Invalid resource (%d) specified\n",
>>                     hp_elog->resource);
>> @@ -407,6 +413,61 @@ void queue_hotplug_event(struct pseries_hp_errorlog *hp_errlog,
>>     }
>> }
>>
>> +LIST_HEAD(dlpar_delayed_list);
>> +
>> +int dlpar_queue_action(int resource, int action, u32 drc_index)
>> +{
>> +    struct pseries_hp_errorlog *hp_errlog;
>> +
>> +    hp_errlog = kmalloc(sizeof(struct pseries_hp_errorlog), GFP_KERNEL);
>> +    if (!hp_errlog)
>> +        return -ENOMEM;
>> +
>> +    hp_errlog->resource = resource;
>> +    hp_errlog->action = action;
>> +    hp_errlog->id_type = PSERIES_HP_ELOG_ID_DRC_INDEX;
>> +    hp_errlog->_drc_u.drc_index = cpu_to_be32(drc_index);
>> +
>> +    list_add_tail(&hp_errlog->list, &dlpar_delayed_list);
>> +
>> +    return 0;
>> +}
>> +
>> +static int dlpar_pmt(struct pseries_hp_errorlog *work)
>> +{
>> +    struct list_head *pos, *q;
>> +
>> +    ssleep(15);
> 
> Why do we need to sleep for so long here?

One or more drivers were finishing their initialization about the same
time as the 'migration_store' operation was completing.  E.g. 'ibmvscsi'
When we did 'dlpar cpu readd' operations at the same time early on in
testing, the errors / warnings were inconsistent.  I put in a delay to
get beyond the driver re-init after migration.  15 seconds was an estimate.
As it occurs after the completion of the 'migration store' operation,
it does not delay any responses returned to the HMC.

We can rerun tests with reduced / removed delays to see if they are still
necessary given some other corrections in the mix.

> 
> -John

Michael

>> +
>> +    list_for_each_safe(pos, q, &dlpar_delayed_list) {
>> +        struct pseries_hp_errorlog *tmp;
>> +
>> +        tmp = list_entry(pos, struct pseries_hp_errorlog, list);
>> +        handle_dlpar_errorlog(tmp);
>> +
>> +        list_del(pos);
>> +        kfree(tmp);
>> +
>> +        ssleep(10);
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +int dlpar_queued_actions_run(void)
>> +{
>> +    if (!list_empty(&dlpar_delayed_list)) {
>> +        struct pseries_hp_errorlog hp_errlog;
>> +
>> +        hp_errlog.resource = PSERIES_HP_ELOG_RESOURCE_PMT;
>> +        hp_errlog.action = 0;
>> +        hp_errlog.id_type = 0;
>> +
>> +        queue_hotplug_event(&hp_errlog, 0, 0);
>> +    }
>> +    return 0;
>> +}
>> +
>> static int dlpar_parse_resource(char **cmd, struct pseries_hp_errorlog *hp_elog)
>> {
>>     char *arg;
>> diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c
>> index f6364d9..d0d1cae 100644
>> --- a/arch/powerpc/platforms/pseries/mobility.c
>> +++ b/arch/powerpc/platforms/pseries/mobility.c
>> @@ -378,6 +378,10 @@ static ssize_t migration_store(struct class *class,
>>         return rc;
>>
>>     post_mobility_fixup();
>> +
>> +    /* Apply any necessary changes identified during fixup */
>> +    dlpar_queued_actions_run();
>> +
>>     return count;
>> }
>>
>> diff --git a/arch/powerpc/platforms/pseries/pseries.h b/arch/powerpc/platforms/pseries/pseries.h
>> index 60db2ee..72ca996 100644
>> --- a/arch/powerpc/platforms/pseries/pseries.h
>> +++ b/arch/powerpc/platforms/pseries/pseries.h
>> @@ -61,6 +61,8 @@ extern struct device_node *dlpar_configure_connector(__be32,
>>
>> void queue_hotplug_event(struct pseries_hp_errorlog *hp_errlog,
>>              struct completion *hotplug_done, int *rc);
>> +int dlpar_queue_action(int resource, int action, u32 drc_index);
>> +int dlpar_queued_actions_run(void);
>> #ifdef CONFIG_MEMORY_HOTPLUG
>> int dlpar_memory(struct pseries_hp_errorlog *hp_elog);
>> #else
>>
> 
> 

-- 
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line  363-5196
External: (512) 286-5196
Cell:       (512) 466-0650
mwb@linux.vnet.ibm.com

^ permalink raw reply

* Re: [PATCH 6/7] powerpc/traps: Print signal name for unhandled signals
From: LEROY Christophe @ 2018-07-25 15:49 UTC (permalink / raw)
  To: Murilo Opsfelder Araujo
  Cc: linuxppc-dev, Tobin C . Harding, Sukadev Bhattiprolu, Simon Guo,
	Paul Mackerras, Nicholas Piggin, Michael Neuling,
	Michael Ellerman, Eric W . Biederman, Cyril Bur,
	Benjamin Herrenschmidt, Balbir Singh, Andrew Donnellan,
	Alastair D'Silva, linux-kernel
In-Reply-To: <20180724192720.32417-7-muriloo@linux.ibm.com>

Murilo Opsfelder Araujo <muriloo@linux.ibm.com> a =C3=A9crit=C2=A0:

> This adds a human-readable name in the unhandled signal message.
>
> Before this patch, a page fault looked like:
>
>     Jul 11 16:04:11 localhost kernel: pandafault[6303]: unhandled=20=20
>=20signal 11 at 00000000100007d0 nip 000000001000061c lr=20=20
>=2000007fff93c55100 code 2 in pandafault[10000000+10000]
>
> After this patch, a page fault looks like:
>
>     Jul 11 18:14:48 localhost kernel: pandafault[6352]: segfault=20=20
>=20(11) at 000000013a2a09f8 nip 000000013a2a086c lr 00007fffb63e5100=20=20
>=20code 2 in pandafault[13a2a0000+10000]
>
> Signed-off-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
> ---
>  arch/powerpc/kernel/traps.c | 43 +++++++++++++++++++++++++++++++++----
>  1 file changed, 39 insertions(+), 4 deletions(-)
>
> diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
> index e6c43ef9fb50..e55ee639d010 100644
> --- a/arch/powerpc/kernel/traps.c
> +++ b/arch/powerpc/kernel/traps.c
> @@ -96,6 +96,41 @@ EXPORT_SYMBOL(__debugger_fault_handler);
>  #define TM_DEBUG(x...) do { } while(0)
>  #endif
>
> +static const char *signames[SIGRTMIN + 1] =3D {
> +	"UNKNOWN",
> +	"SIGHUP",			// 1
> +	"SIGINT",			// 2
> +	"SIGQUIT",			// 3
> +	"SIGILL",			// 4
> +	"unhandled trap",		// 5 =3D SIGTRAP
> +	"SIGABRT",			// 6 =3D SIGIOT
> +	"bus error",			// 7 =3D SIGBUS
> +	"floating point exception",	// 8 =3D SIGFPE
> +	"illegal instruction",		// 9 =3D SIGILL
> +	"SIGUSR1",			// 10
> +	"segfault",			// 11 =3D SIGSEGV
> +	"SIGUSR2",			// 12
> +	"SIGPIPE",			// 13
> +	"SIGALRM",			// 14
> +	"SIGTERM",			// 15
> +	"SIGSTKFLT",			// 16
> +	"SIGCHLD",			// 17
> +	"SIGCONT",			// 18
> +	"SIGSTOP",			// 19
> +	"SIGTSTP",			// 20
> +	"SIGTTIN",			// 21
> +	"SIGTTOU",			// 22
> +	"SIGURG",			// 23
> +	"SIGXCPU",			// 24
> +	"SIGXFSZ",			// 25
> +	"SIGVTALRM",			// 26
> +	"SIGPROF",			// 27
> +	"SIGWINCH",			// 28
> +	"SIGIO",			// 29 =3D SIGPOLL =3D SIGLOST
> +	"SIGPWR",			// 30
> +	"SIGSYS",			// 31 =3D SIGUNUSED
> +};
> +
>  /*
>   * Trap & Exception support
>   */
> @@ -314,10 +349,10 @@ static void show_signal_msg(int signr, struct=20=20
>=20pt_regs *regs, int code,
>  	if (!unhandled_signal(current, signr))
>  		return;
>
> -	pr_info("%s[%d]: unhandled signal %d at "REG_FMT \
> -		" nip "REG_FMT" lr "REG_FMT" code %x",
> -		current->comm, current->pid, signr, addr,
> -		regs->nip, regs->link, code);
> +	pr_info("%s[%d]: %s (%d) at "REG_FMT" nip "REG_FMT \
> +		" lr "REG_FMT" code %x",
> +		current->comm, current->pid, signames[signr],
> +		signr, addr, regs->nip, regs->link, code);

Are we sure that signr is always within the limits of the table ?

Christophe
>
>  	print_vma_addr(KERN_CONT " in ", regs->nip);
>
> --
> 2.17.1

^ permalink raw reply

* Re: [PATCH 2/7] powerpc/traps: Return early in show_signal_msg()
From: LEROY Christophe @ 2018-07-25 15:42 UTC (permalink / raw)
  To: Murilo Opsfelder Araujo
  Cc: linuxppc-dev, Tobin C . Harding, Sukadev Bhattiprolu, Simon Guo,
	Paul Mackerras, Nicholas Piggin, Michael Neuling,
	Michael Ellerman, Eric W . Biederman, Cyril Bur,
	Benjamin Herrenschmidt, Balbir Singh, Andrew Donnellan,
	Alastair D'Silva, linux-kernel
In-Reply-To: <20180724192720.32417-3-muriloo@linux.ibm.com>

Murilo Opsfelder Araujo <muriloo@linux.ibm.com> a =C3=A9crit=C2=A0:

> Modify logic of show_signal_msg() to return early, if possible.  Replace
> printk_ratelimited() by printk() and a default rate limit burst to limit
> displaying unhandled signals messages.

Can you explain more the benefits of this change ?

Christophe

>
> Signed-off-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
> ---
>  arch/powerpc/kernel/traps.c | 21 +++++++++++++++------
>  1 file changed, 15 insertions(+), 6 deletions(-)
>
> diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
> index cbd3dc365193..4faab4705774 100644
> --- a/arch/powerpc/kernel/traps.c
> +++ b/arch/powerpc/kernel/traps.c
> @@ -301,6 +301,13 @@ void user_single_step_siginfo(struct task_struct *ts=
k,
>  	info->si_addr =3D (void __user *)regs->nip;
>  }
>
> +static bool show_unhandled_signals_ratelimited(void)
> +{
> +	static DEFINE_RATELIMIT_STATE(rs, DEFAULT_RATELIMIT_INTERVAL,
> +				      DEFAULT_RATELIMIT_BURST);
> +	return show_unhandled_signals && __ratelimit(&rs);
> +}
> +
>  static void show_signal_msg(int signr, struct pt_regs *regs, int code,
>  			    unsigned long addr)
>  {
> @@ -309,11 +316,12 @@ static void show_signal_msg(int signr, struct=20=20
>=20pt_regs *regs, int code,
>  	const char fmt64[] =3D KERN_INFO "%s[%d]: unhandled signal %d " \
>  		"at %016lx nip %016lx lr %016lx code %x\n";
>
> -	if (show_unhandled_signals && unhandled_signal(current, signr)) {
> -		printk_ratelimited(regs->msr & MSR_64BIT ? fmt64 : fmt32,
> -				   current->comm, current->pid, signr,
> -				   addr, regs->nip, regs->link, code);
> -	}
> +	if (!unhandled_signal(current, signr))
> +		return;
> +
> +	printk(regs->msr & MSR_64BIT ? fmt64 : fmt32,
> +	       current->comm, current->pid, signr,
> +	       addr, regs->nip, regs->link, code);
>  }
>
>  void _exception_pkey(int signr, struct pt_regs *regs, int code,
> @@ -326,7 +334,8 @@ void _exception_pkey(int signr, struct pt_regs=20=20
>=20*regs, int code,
>  		return;
>  	}
>
> -	show_signal_msg(signr, regs, code, addr);
> +	if (show_unhandled_signals_ratelimited())
> +		show_signal_msg(signr, regs, code, addr);
>
>  	if (arch_irqs_disabled() && !arch_irq_disabled_regs(regs))
>  		local_irq_enable();
> --
> 2.17.1

^ permalink raw reply

* Re: [PATCH 1/4] treewide: convert ISO_8859-1 text comments to utf-8
From: Joe Perches @ 2018-07-25 15:33 UTC (permalink / raw)
  To: Arnd Bergmann, Andrew Morton
  Cc: Samuel Ortiz, David S. Miller, Rob Herring, Michael Ellerman,
	Jonathan Cameron, linux-wireless, Networking, DTML,
	Linux Kernel Mailing List, Linux ARM,
	open list:HARDWARE RANDOM NUMBER GENERATOR CORE, linuxppc-dev,
	linux-iio, Linux PM list, lvs-devel, netfilter-devel, coreteam
In-Reply-To: <CAK8P3a3tOuP1FVS7oD1UhO5s4C+fLkL8VT3eCpRnSxBZxKzf6A@mail.gmail.com>

On Wed, 2018-07-25 at 15:12 +0200, Arnd Bergmann wrote:
> tools/perf/tests/.gitignore:
>                             LLVM byte-codes, uncompressed
> On Wed, Jul 25, 2018 at 2:55 AM, Andrew Morton
> <akpm@linux-foundation.org> wrote:
> > On Tue, 24 Jul 2018 17:13:20 -0700 Joe Perches <joe@perches.com> wrote:
> > 
> > > On Tue, 2018-07-24 at 14:00 -0700, Andrew Morton wrote:
> > > > On Tue, 24 Jul 2018 13:13:25 +0200 Arnd Bergmann <arnd@arndb.de> wrote:
> > > > > Almost all files in the kernel are either plain text or UTF-8
> > > > > encoded. A couple however are ISO_8859-1, usually just a few
> > > > > characters in a C comments, for historic reasons.
> > > > > This converts them all to UTF-8 for consistency.
> > > 
> > > []
> > > > Will we be getting a checkpatch rule to keep things this way?
> > > 
> > > How would that be done?
> > 
> > I'm using this, seems to work.
> > 
> >         if ! file $p | grep -q -P ", ASCII text|, UTF-8 Unicode text"
> >         then
> >                 echo $p: weird charset
> >         fi
> 
> There are a couple of files that my version of 'find' incorrectly identified as
> something completely different, like:
> 
> Documentation/devicetree/bindings/pinctrl/pinctrl-sx150x.txt:
>             SemOne archive data
> Documentation/devicetree/bindings/rtc/epson,rtc7301.txt:
>             Microsoft Document Imaging Format
> Documentation/filesystems/nfs/pnfs-block-server.txt:
>             PPMN archive data
> arch/arm/boot/dts/bcm283x-rpi-usb-host.dtsi:
>         Sendmail frozen configuration  - version = "host";
> Documentation/networking/segmentation-offloads.txt:
>         StuffIt Deluxe Segment (data) : gmentation Offloads in the
> Linux Networking Stack
> arch/sparc/include/asm/visasm.h:                              SAS 7+
> arch/xtensa/kernel/setup.c:                                         ,
> init=0x454c, stat=0x090a, dev=0x2009, bas=0x2020
> drivers/cpufreq/powernow-k8.c:
> TI-XX Graphing Calculator (FLASH)
> tools/testing/selftests/net/forwarding/tc_shblocks.sh:
>                             Minix filesystem, V2 (big endian)
> tools/perf/tests/.gitignore:
>                             LLVM byte-codes, uncompressed
> 
> All of the above seem to be valid ASCII or UTF-8 files, so the check
> above will lead
> to false-positives, but it may be good enough as they are the
> exception, and may be
> bugs in 'file'.
> 
> Not sure if we need to worry about 'file' not being installed.

checkpatch works on patches so I think the test isn't
really relevant.  It has to use the appropriate email
header that sets the charset.

perhaps:
---
 scripts/checkpatch.pl | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index 34e4683de7a3..57355fbd2d28 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -2765,9 +2765,13 @@ sub process {
 # Check if there is UTF-8 in a commit log when a mail header has explicitly
 # declined it, i.e defined some charset where it is missing.
 		if ($in_header_lines &&
-		    $rawline =~ /^Content-Type:.+charset="(.+)".*$/ &&
-		    $1 !~ /utf-8/i) {
-			$non_utf8_charset = 1;
+		    $rawline =~ /^Content-Type:.+charset="?([^\s;"]+)/) {
+			my $charset = $1;
+			$non_utf8_charset = 1 if ($charset !~ /^utf-8$/i);
+			if ($charset !~ /^(?:us-ascii|utf-8|iso-8859-1)$/) {
+				WARN("PATCH_CHARSET",
+				     "Unpreferred email header charset '$charset'\n" . $herecurr);
+			}
 		}
 
 		if ($in_commit_log && $non_utf8_charset && $realfile =~ /^$/ &&

^ permalink raw reply related

* Re: [PATCH 6/7] powerpc/traps: Print signal name for unhandled signals
From: Gustavo Romero @ 2018-07-25 15:19 UTC (permalink / raw)
  To: Murilo Opsfelder Araujo, linux-kernel
  Cc: Michael Neuling, Simon Guo, Nicholas Piggin, Paul Mackerras,
	Eric W . Biederman, Andrew Donnellan, Alastair D'Silva,
	Sukadev Bhattiprolu, linuxppc-dev, Cyril Bur, Tobin C . Harding
In-Reply-To: <20180724192720.32417-7-muriloo@linux.ibm.com>

Hi Murilo,

LGTM.

Just a comment:

On 07/24/2018 04:27 PM, Murilo Opsfelder Araujo wrote:
> This adds a human-readable name in the unhandled signal message.
> 
> Before this patch, a page fault looked like:
> 
>      Jul 11 16:04:11 localhost kernel: pandafault[6303]: unhandled signal 11 at 00000000100007d0 nip 000000001000061c lr 00007fff93c55100 code 2 in pandafault[10000000+10000]
> 
> After this patch, a page fault looks like:
> 
>      Jul 11 18:14:48 localhost kernel: pandafault[6352]: segfault (11) at 000000013a2a09f8 nip 000000013a2a086c lr 00007fffb63e5100 code 2 in pandafault[13a2a0000+10000]

I _really_ don't want to bikeshed here, but I vouch for keeping the
"unhandled" word before the signal name, like:

[...] pandafault[6352]: unhandled segfault (11) at 000000013a2a09f8 nip [...]

because the issue reported here is really that we got a segfault _and_
there was no handler to catch it.

But feel free to wait for additional comments to decide it.


Cheers,
Gustavo

^ permalink raw reply

* Re: [PATCH] powerpc/64s/radix: tlb do not flush on page size when fullmm
From: Aneesh Kumar K.V @ 2018-07-25 14:17 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev; +Cc: Aneesh Kumar K . V, Nicholas Piggin
In-Reply-To: <20180725135806.29648-1-npiggin@gmail.com>

Nicholas Piggin <npiggin@gmail.com> writes:

> When the mm is being torn down there will be a full PID flush so
> there is no need to flush the TLB on page size changes.

and that tlb flush is PID flush since tlb->fulmm is set? So this avoids
multiple PID tlb flush.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>


>
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>  arch/powerpc/include/asm/tlb.h | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
> index 5d3107f2b014..d1b3dc4a6a0a 100644
> --- a/arch/powerpc/include/asm/tlb.h
> +++ b/arch/powerpc/include/asm/tlb.h
> @@ -84,6 +84,9 @@ static inline void __tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep,
>  static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
>  						     unsigned int page_size)
>  {
> +	if (tlb->fullmm)
> +		return;
> +
>  	if (!tlb->page_size)
>  		tlb->page_size = page_size;
>  	else if (tlb->page_size != page_size) {
> -- 
> 2.17.0

^ permalink raw reply

* [RFC PATCH 4/4] powerpc/64s/radix: optimise TLB flush with precise TLB ranges in mmu_gather
From: Nicholas Piggin @ 2018-07-25 14:06 UTC (permalink / raw)
  To: linux-mm; +Cc: Nicholas Piggin, linuxppc-dev, linux-arch, linux-arm-kernel
In-Reply-To: <20180725140641.30372-1-npiggin@gmail.com>

The mmu_gather APIs keep track of the invalidated address range, and
the generic page table freeing accessors expand the invalidated range
to cover the addresses corresponding to the page tables even if there
are no ptes and therefore no TLB entries to invalidate. This is done
for architectures that have paging structure caches that are
invalidated with their TLB invalidate instructions (e.g., x86).

powerpc/64s/radix does have a "page walk cache" (PWC), but it is
invalidated with a specific instruction and tracked independently in
the mmu_gather (using the need_flush_all flag to indicate PWC must be
flushed). Therefore TLB invalidation does not have to be expanded to
cover freed page tables.

This patch defines p??_free_tlb functions for 64s, which do not expand
the TLB flush range over page table pages. This brings the number of
tlbiel instructions required by a kernel compile from 33M to 25M, most
avoided from exec => shift_arg_pages().

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/include/asm/tlb.h | 34 ++++++++++++++++++++++++++++++++++
 arch/powerpc/mm/tlb-radix.c    | 10 ++++++++++
 include/asm-generic/tlb.h      |  5 +++++
 3 files changed, 49 insertions(+)

diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index 9138baccebb0..5d3107f2b014 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -30,6 +30,40 @@
 #define __tlb_remove_tlb_entry	__tlb_remove_tlb_entry
 #define tlb_remove_check_page_size_change tlb_remove_check_page_size_change
 
+#ifdef CONFIG_PPC_BOOK3S_64
+/*
+ * powerpc book3s hash does not have page table structure caches, and
+ * radix requires explicit management with PWC invalidate tlb type, so
+ * there is no need to expand the mmu_gather range over invalidated page
+ * table pages like the generic code does.
+ */
+
+#define pte_free_tlb(tlb, ptep, address)			\
+	do {							\
+		__pte_free_tlb(tlb, ptep, address);		\
+	} while (0)
+
+#define pmd_free_tlb(tlb, pmdp, address)			\
+	do {							\
+		__pmd_free_tlb(tlb, pmdp, address);		\
+	} while (0)
+
+#define pud_free_tlb(tlb, pudp, address)			\
+	do {							\
+		__pud_free_tlb(tlb, pudp, address);		\
+	} while (0)
+
+/*
+ * Radix sets need_flush_all when page table pages have been unmapped
+ * and the PWC needs flushing. Generic code must call our tlb_flush
+ * even on empty ranges in this case.
+ *
+ * This will always be false for hash.
+ */
+#define arch_tlb_mustflush(tlb) (tlb->need_flush_all)
+
+#endif
+
 extern void tlb_flush(struct mmu_gather *tlb);
 
 /* Get the generic bits... */
diff --git a/arch/powerpc/mm/tlb-radix.c b/arch/powerpc/mm/tlb-radix.c
index 1135b43a597c..238b20a513e7 100644
--- a/arch/powerpc/mm/tlb-radix.c
+++ b/arch/powerpc/mm/tlb-radix.c
@@ -862,6 +862,16 @@ void radix__tlb_flush(struct mmu_gather *tlb)
 	unsigned long start = tlb->start;
 	unsigned long end = tlb->end;
 
+	/*
+	 * This can happen if need_flush_all is set due to a page table
+	 * invalidate, but no ptes under it freed (see arch_tlb_mustflush).
+	 * Set end = start to prevent any TLB flushing here (only PWC).
+	 */
+	if (!end) {
+		WARN_ON_ONCE(!tlb->need_flush_all);
+		end = start;
+	}
+
 	/*
 	 * if page size is not something we understand, do a full mm flush
 	 *
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index b320c0cc8996..a55ef1425f0d 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -285,6 +285,11 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
  * http://lkml.kernel.org/r/CA+55aFzBggoXtNXQeng5d_mRoDnaMBE5Y+URs+PHR67nUpMtaw@mail.gmail.com
  *
  * For now w.r.t page table cache, mark the range_size as PAGE_SIZE
+ *
+ * Update: powerpc (Book3S 64-bit, radix MMU) has an architected page table
+ * cache (called PWC), and invalidates it specifically. It sets the
+ * need_flush_all flag to indicate the PWC requires flushing, so it defines
+ * its own p??_free_tlb functions which do not expand the TLB range.
  */
 
 #ifndef pte_free_tlb
-- 
2.17.0

^ permalink raw reply related

* [RFC PATCH 3/4] mm: allow arch to have tlb_flush caled on an empty TLB range
From: Nicholas Piggin @ 2018-07-25 14:06 UTC (permalink / raw)
  To: linux-mm; +Cc: Nicholas Piggin, linuxppc-dev, linux-arch, linux-arm-kernel
In-Reply-To: <20180725140641.30372-1-npiggin@gmail.com>

powerpc wants to de-couple page table caching structure flushes
from TLB flushes, which will make it possible to have mmu_gather
with freed page table pages but no TLB range. These must be sent
to tlb_flush, so allow the arch to specify when mmu_gather with
empty ranges should have tlb_flush called.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 include/asm-generic/tlb.h | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index b3353e21f3b3..b320c0cc8996 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -139,14 +139,27 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb)
 	}
 }
 
+/*
+ * arch_tlb_mustflush specifies if tlb_flush is to be called even if the
+ * TLB range is empty (this can be the case for freeing page table pages
+ * if the arch does not adjust TLB range to cover them).
+ */
+#ifndef arch_tlb_mustflush
+#define arch_tlb_mustflush(tlb) false
+#endif
+
 static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 {
-	if (!tlb->end)
+	unsigned long start = tlb->start;
+	unsigned long end = tlb->end;
+
+	if (!(end || arch_tlb_mustflush(tlb)))
 		return;
 
 	tlb_flush(tlb);
-	mmu_notifier_invalidate_range(tlb->mm, tlb->start, tlb->end);
 	__tlb_reset_range(tlb);
+	if (end)
+		mmu_notifier_invalidate_range(tlb->mm, start, end);
 }
 
 static inline void tlb_remove_page_size(struct mmu_gather *tlb,
-- 
2.17.0

^ permalink raw reply related

* [RFC PATCH 2/4] mm: mmu_notifier fix for tlb_end_vma
From: Nicholas Piggin @ 2018-07-25 14:06 UTC (permalink / raw)
  To: linux-mm; +Cc: Nicholas Piggin, linuxppc-dev, linux-arch, linux-arm-kernel
In-Reply-To: <20180725140641.30372-1-npiggin@gmail.com>

The generic tlb_end_vma does not call invalidate_range mmu notifier,
and it resets resets the mmu_gather range, which means the notifier
won't be called on part of the range in case of an unmap that spans
multiple vmas.

ARM64 seems to be the only arch I could see that has notifiers and
uses the generic tlb_end_vma. I have not actually tested it.
---
 include/asm-generic/tlb.h | 17 +++++++++++++----
 mm/memory.c               | 10 ----------
 2 files changed, 13 insertions(+), 14 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 3063125197ad..b3353e21f3b3 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -15,6 +15,7 @@
 #ifndef _ASM_GENERIC__TLB_H
 #define _ASM_GENERIC__TLB_H
 
+#include <linux/mmu_notifier.h>
 #include <linux/swap.h>
 #include <asm/pgalloc.h>
 #include <asm/tlbflush.h>
@@ -138,6 +139,16 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb)
 	}
 }
 
+static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
+{
+	if (!tlb->end)
+		return;
+
+	tlb_flush(tlb);
+	mmu_notifier_invalidate_range(tlb->mm, tlb->start, tlb->end);
+	__tlb_reset_range(tlb);
+}
+
 static inline void tlb_remove_page_size(struct mmu_gather *tlb,
 					struct page *page, int page_size)
 {
@@ -186,10 +197,8 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
 
 #define __tlb_end_vma(tlb, vma)					\
 	do {							\
-		if (!tlb->fullmm && tlb->end) {			\
-			tlb_flush(tlb);				\
-			__tlb_reset_range(tlb);			\
-		}						\
+		if (!tlb->fullmm)				\
+			tlb_flush_mmu_tlbonly(tlb);		\
 	} while (0)
 
 #ifndef tlb_end_vma
diff --git a/mm/memory.c b/mm/memory.c
index bc053d5e9d41..135d18b31e44 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -238,16 +238,6 @@ void arch_tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
 	__tlb_reset_range(tlb);
 }
 
-static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
-{
-	if (!tlb->end)
-		return;
-
-	tlb_flush(tlb);
-	mmu_notifier_invalidate_range(tlb->mm, tlb->start, tlb->end);
-	__tlb_reset_range(tlb);
-}
-
 static void tlb_flush_mmu_free(struct mmu_gather *tlb)
 {
 	struct mmu_gather_batch *batch;
-- 
2.17.0

^ permalink raw reply related

* [RFC PATCH 1/4] mm: move tlb_table_flush to tlb_flush_mmu_free
From: Nicholas Piggin @ 2018-07-25 14:06 UTC (permalink / raw)
  To: linux-mm; +Cc: Nicholas Piggin, linuxppc-dev, linux-arch, linux-arm-kernel
In-Reply-To: <20180725140641.30372-1-npiggin@gmail.com>

There is no need to call this from tlb_flush_mmu_tlbonly, it
logically belongs with tlb_flush_mmu_free. This allows some
code consolidation with a subsequent fix.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 mm/memory.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7206a634270b..bc053d5e9d41 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -245,9 +245,6 @@ static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 
 	tlb_flush(tlb);
 	mmu_notifier_invalidate_range(tlb->mm, tlb->start, tlb->end);
-#ifdef CONFIG_HAVE_RCU_TABLE_FREE
-	tlb_table_flush(tlb);
-#endif
 	__tlb_reset_range(tlb);
 }
 
@@ -255,6 +252,9 @@ static void tlb_flush_mmu_free(struct mmu_gather *tlb)
 {
 	struct mmu_gather_batch *batch;
 
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+	tlb_table_flush(tlb);
+#endif
 	for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
 		free_pages_and_swap_cache(batch->pages, batch->nr);
 		batch->nr = 0;
-- 
2.17.0

^ permalink raw reply related

* [RFC PATCH 0/4] mm: mmu_gather changes to support explicit paging
From: Nicholas Piggin @ 2018-07-25 14:06 UTC (permalink / raw)
  To: linux-mm; +Cc: Nicholas Piggin, linuxppc-dev, linux-arch, linux-arm-kernel

The first 3 patches in this series are some generic mm changes I
would like to make, including a possible fix which may(?) be needed
for ARM64. Other than the bugfix, these first 3 patches should not
change anything so hopefully they aren't too controversial.

The powerpc patch is also there for reference. 

Thanks,
Nick

Nicholas Piggin (4):
  mm: move tlb_table_flush to tlb_flush_mmu_free
  mm: mmu_notifier fix for tlb_end_vma
  mm: allow arch to have tlb_flush caled on an empty TLB range
  powerpc/64s/radix: optimise TLB flush with precise TLB ranges in
    mmu_gather

 arch/powerpc/include/asm/tlb.h | 34 +++++++++++++++++++++++++++++++++
 arch/powerpc/mm/tlb-radix.c    | 10 ++++++++++
 include/asm-generic/tlb.h      | 35 ++++++++++++++++++++++++++++++----
 mm/memory.c                    | 14 ++------------
 4 files changed, 77 insertions(+), 16 deletions(-)

-- 
2.17.0

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox