LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH 0/5] Powerpc/hw-breakpoint: Fixes plus Code refactor
From: Ravi Bangoria @ 2019-06-19  7:47 UTC (permalink / raw)
  To: Michael Neuling, Christophe Leroy
  Cc: Ravi Bangoria, linux-kernel, npiggin, paulus, naveen.n.rao,
	linuxppc-dev
In-Reply-To: <85d5494d2a5d4a887e739c379105436e498217a8.camel@neuling.org>



On 6/18/19 11:47 AM, Michael Neuling wrote:
> On Tue, 2019-06-18 at 08:01 +0200, Christophe Leroy wrote:
>>
>> Le 18/06/2019 à 06:27, Ravi Bangoria a écrit :
>>> patch 1-3: Code refactor
>>> patch 4: Speedup disabling breakpoint
>>> patch 5: Fix length calculation for unaligned targets
>>
>> While you are playing with hw breakpoints, did you have a look at 
>> https://github.com/linuxppc/issues/issues/38 ?
> 
> Agreed and also: 
> 
> https://github.com/linuxppc/issues/issues/170
> 
> https://github.com/linuxppc/issues/issues/128 
> 

Yes, I'm aware of those. Will have a look at them.


^ permalink raw reply

* [RFC PATCH v0] powerpc: Fix BUG_ON during memory unplug on radix
From: Bharata B Rao @ 2019-06-19  7:45 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: aneesh.kumar, npiggin, Bharata B Rao

We hit the following BUG_ON when memory hotplugged before reboot
is unplugged after reboot:

kernel BUG at arch/powerpc/mm/pgtable-frag.c:113!

 remove_pagetable+0x594/0x6a0
 (unreliable)
 remove_pagetable+0x94/0x6a0
 vmemmap_free+0x394/0x410
 sparse_remove_one_section+0x26c/0x2e8
 __remove_pages+0x428/0x540
 arch_remove_memory+0xd0/0x170
 __remove_memory+0xd4/0x1a0
 dlpar_remove_lmb+0xbc/0x110
 dlpar_memory+0xa80/0xd20
 handle_dlpar_errorlog+0xa8/0x160
 pseries_hp_work_fn+0x2c/0x60
 process_one_work+0x46c/0x860
 worker_thread+0x364/0x5e0
 kthread+0x1b0/0x1c0
 ret_from_kernel_thread+0x5c/0x68

This occurs because, during reboot-after-hotplug, the hotplugged
memory range gets initialized as regular memory and page
tables are setup using memblock allocator. This means that we
wouldn't have initialized the PMD or PTE fragment count for
those PMD or PTE pages.

Fixing this includes 3 aspects:

- Walk the init_mm page tables from mem_init() and initialize
  the PMD and PTE fragment counts appropriately.
- When we do early allocation of PMD (and PGD as well) pages,
  allocate in page size PAGE_SIZE granularity so that we are
  sure that the complete page is available for us to set the
  fragment count which is part of struct page.
- When PMD or PTE page is freed, check if it comes from memblock
  allocator and free it appropriately.

Reported-by: Srikanth Aithal <sraithal@linux.vnet.ibm.com>
Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
---
 arch/powerpc/include/asm/book3s/64/radix.h |  1 +
 arch/powerpc/include/asm/sparsemem.h       |  1 +
 arch/powerpc/mm/book3s64/pgtable.c         | 12 +++-
 arch/powerpc/mm/book3s64/radix_pgtable.c   | 67 +++++++++++++++++++++-
 arch/powerpc/mm/mem.c                      |  5 ++
 arch/powerpc/mm/pgtable-frag.c             |  5 +-
 6 files changed, 87 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
index 574eca33f893..4320f2790e8d 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -285,6 +285,7 @@ static inline unsigned long radix__get_tree_size(void)
 #ifdef CONFIG_MEMORY_HOTPLUG
 int radix__create_section_mapping(unsigned long start, unsigned long end, int nid);
 int radix__remove_section_mapping(unsigned long start, unsigned long end);
+void radix__fixup_pgtable_fragments(void);
 #endif /* CONFIG_MEMORY_HOTPLUG */
 #endif /* __ASSEMBLY__ */
 #endif
diff --git a/arch/powerpc/include/asm/sparsemem.h b/arch/powerpc/include/asm/sparsemem.h
index 3192d454a733..e662f9232d35 100644
--- a/arch/powerpc/include/asm/sparsemem.h
+++ b/arch/powerpc/include/asm/sparsemem.h
@@ -15,6 +15,7 @@
 #ifdef CONFIG_MEMORY_HOTPLUG
 extern int create_section_mapping(unsigned long start, unsigned long end, int nid);
 extern int remove_section_mapping(unsigned long start, unsigned long end);
+void fixup_pgtable_fragments(void);
 
 #ifdef CONFIG_PPC_BOOK3S_64
 extern int resize_hpt_for_hotplug(unsigned long new_mem_size);
diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
index 01bc9663360d..7efe9cc16b39 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -186,6 +186,13 @@ int __meminit remove_section_mapping(unsigned long start, unsigned long end)
 
 	return hash__remove_section_mapping(start, end);
 }
+
+void fixup_pgtable_fragments(void)
+{
+	if (radix_enabled())
+		radix__fixup_pgtable_fragments();
+}
+
 #endif /* CONFIG_MEMORY_HOTPLUG */
 
 void __init mmu_partition_table_init(void)
@@ -320,7 +327,10 @@ void pmd_fragment_free(unsigned long *pmd)
 	BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0);
 	if (atomic_dec_and_test(&page->pt_frag_refcount)) {
 		pgtable_pmd_page_dtor(page);
-		__free_page(page);
+		if (PageReserved(page))
+			free_reserved_page(page);
+		else
+			__free_page(page);
 	}
 }
 
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 273ae66a9a45..402e8da28cab 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -32,6 +32,69 @@
 unsigned int mmu_pid_bits;
 unsigned int mmu_base_pid;
 
+static void fixup_pmd_fragments(pmd_t *pmd)
+{
+	int i;
+
+	for (i = 0; i < PTRS_PER_PMD; i++, pmd++) {
+		pte_t *pte;
+		struct page *page;
+
+		if (pmd_none(*pmd))
+			continue;
+		if (pmd_huge(*pmd))
+			continue;
+
+		pte = pte_offset_kernel(pmd, 0);
+		if (!pte_none(*pte)) {
+			page = virt_to_page(pte);
+			atomic_inc(&page->pt_frag_refcount);
+		}
+	}
+}
+
+static void fixup_pud_fragments(pud_t *pud)
+{
+	int i;
+
+	for (i = 0; i < PTRS_PER_PUD; i++, pud++) {
+		pmd_t *pmd;
+		struct page *page;
+
+		if (pud_none(*pud))
+			continue;
+		if (pud_huge(*pud))
+			continue;
+
+		pmd = pmd_offset(pud, 0);
+		if (!pmd_none(*pmd)) {
+			page = virt_to_page(pmd);
+			atomic_inc(&page->pt_frag_refcount);
+		}
+		fixup_pmd_fragments(pmd);
+	}
+}
+
+void radix__fixup_pgtable_fragments(void)
+{
+	int i;
+	pgd_t *pgd = pgd_offset_k(0UL);
+
+	spin_lock(&init_mm.page_table_lock);
+	for (i = 0; i < PTRS_PER_PGD; i++, pgd++) {
+		pud_t *pud;
+
+		if (pgd_none(*pgd))
+			continue;
+		if (pgd_huge(*pgd))
+			continue;
+
+		pud = pud_offset(pgd, 0);
+		fixup_pud_fragments(pud);
+	}
+	spin_unlock(&init_mm.page_table_lock);
+}
+
 static int native_register_process_table(unsigned long base, unsigned long pg_sz,
 					 unsigned long table_size)
 {
@@ -80,7 +143,7 @@ static int early_map_kernel_page(unsigned long ea, unsigned long pa,
 
 	pgdp = pgd_offset_k(ea);
 	if (pgd_none(*pgdp)) {
-		pudp = early_alloc_pgtable(PUD_TABLE_SIZE, nid,
+		pudp = early_alloc_pgtable(PAGE_SIZE, nid,
 						region_start, region_end);
 		pgd_populate(&init_mm, pgdp, pudp);
 	}
@@ -90,7 +153,7 @@ static int early_map_kernel_page(unsigned long ea, unsigned long pa,
 		goto set_the_pte;
 	}
 	if (pud_none(*pudp)) {
-		pmdp = early_alloc_pgtable(PMD_TABLE_SIZE, nid,
+		pmdp = early_alloc_pgtable(PAGE_SIZE, nid,
 						region_start, region_end);
 		pud_populate(&init_mm, pudp, pmdp);
 	}
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index cba29131bccc..a8788b404266 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -51,6 +51,10 @@
 
 #include <mm/mmu_decl.h>
 
+void __weak fixup_pgtable_fragments(void)
+{
+}
+
 #ifndef CPU_FTR_COHERENT_ICACHE
 #define CPU_FTR_COHERENT_ICACHE	0	/* XXX for now */
 #define CPU_FTR_NOEXECUTE	0
@@ -276,6 +280,7 @@ void __init mem_init(void)
 	set_max_mapnr(max_pfn);
 	memblock_free_all();
 
+	fixup_pgtable_fragments();
 #ifdef CONFIG_HIGHMEM
 	{
 		unsigned long pfn, highmem_mapnr;
diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
index a7b05214760c..694de7c731aa 100644
--- a/arch/powerpc/mm/pgtable-frag.c
+++ b/arch/powerpc/mm/pgtable-frag.c
@@ -114,6 +114,9 @@ void pte_fragment_free(unsigned long *table, int kernel)
 	if (atomic_dec_and_test(&page->pt_frag_refcount)) {
 		if (!kernel)
 			pgtable_page_dtor(page);
-		__free_page(page);
+		if (PageReserved(page))
+			free_reserved_page(page);
+		else
+			__free_page(page);
 	}
 }
-- 
2.17.1


^ permalink raw reply related

* Re: [PATCH 5/5] Powerpc/Watchpoint: Fix length calculation for unaligned target
From: Ravi Bangoria @ 2019-06-19  7:45 UTC (permalink / raw)
  To: Michael Neuling
  Cc: Ravi Bangoria, linux-kernel, npiggin, paulus, naveen.n.rao,
	linuxppc-dev
In-Reply-To: <707bc0b664b8ebbb843a1541155fed219c216035.camel@neuling.org>



On 6/18/19 7:02 PM, Michael Neuling wrote:
> On Tue, 2019-06-18 at 09:57 +0530, Ravi Bangoria wrote:
>> Watchpoint match range is always doubleword(8 bytes) aligned on
>> powerpc. If the given range is crossing doubleword boundary, we
>> need to increase the length such that next doubleword also get
>> covered. Ex,
>>
>>           address   len = 6 bytes
>>                 |=========.
>>    |------------v--|------v--------|
>>    | | | | | | | | | | | | | | | | |
>>    |---------------|---------------|
>>     <---8 bytes--->
>>
>> In such case, current code configures hw as:
>>   start_addr = address & ~HW_BREAKPOINT_ALIGN
>>   len = 8 bytes
>>
>> And thus read/write in last 4 bytes of the given range is ignored.
>> Fix this by including next doubleword in the length. Watchpoint
>> exception handler already ignores extraneous exceptions, so no
>> changes required for that.
> 
> Nice catch. Thanks.
> 
> I assume this has been broken forever? Should we be CCing stable? If so, it
> would be nice to have this self contained (separate from the refactor) so we can
> more easily backport it.

Yes this has been broken forever. I'll add Fixes: tag and cc stable.

> 
> Also, can you update 
> tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c to catch this issue?

Sure, will add the test case.

[...]

>> +u16 hw_breakpoint_get_final_len(struct arch_hw_breakpoint *brk,
>> +				unsigned long *start_addr,
>> +				unsigned long *end_addr)
> 
> I don't really like this.  "final" is not a good name. Something like hardware
> would be better.
> 
> Also, can you put the start_addr and end addr in the arch_hw_breakpoint rather
> than doing what you have above.  Call them hw_start_addr, hw_end_addr.
> 
> We could even set these two new addresses where we set the set of
> arch_hw_breakpoint rather than having this late call.

Sure, will use 'hw_' prefix for them.


^ permalink raw reply

* Re: [PATCH kernel] powerpc/powernv/npu: Export symbols as GPL
From: Christoph Hellwig @ 2019-06-19  7:29 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alistair Popple, linuxppc-dev, Piotr Jaroszynski, Reza Arbab,
	Christoph Hellwig
In-Reply-To: <20190619041312.81217-1-aik@ozlabs.ru>

On Wed, Jun 19, 2019 at 02:13:12PM +1000, Alexey Kardashevskiy wrote:
> The out-of-tree NVIDIA driver has been re-licensed recently to MIT/GPL
> so we can do the right thing and restrict the exported symbols to GPL
> only.

Which still does not matter until it actually is in mainline.  We don't
keep kernel infrastructure for purely out of tree drivers around.

^ permalink raw reply

* Re: [PATCH 3/4] powerpc/powernv: remove dead NPU DMA code
From: Christoph Hellwig @ 2019-06-19  7:28 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linux-kernel, Paul Mackerras, linuxppc-dev, Christoph Hellwig
In-Reply-To: <db502ec4-2e8f-fbc3-9db2-3fe98464a62c@ozlabs.ru>

On Wed, Jun 19, 2019 at 10:34:54AM +1000, Alexey Kardashevskiy wrote:
> 
> 
> On 23/05/2019 17:49, Christoph Hellwig wrote:
> > None of these routines were ever used since they were added to the
> > kernel.
> 
> 
> It is still being used exactly in the way as it was explained before in
> previous respins. Thanks.

Please point to the in-kernel user, because that is the only relevant
one.  This is not just my opinion but we had a clear discussion on that
at least years kernel summit.

^ permalink raw reply

* Re: [PATCH 4/7] powerpc/ftrace: Additionally nop out the preceding mflr with -mprofile-kernel
From: Nicholas Piggin @ 2019-06-19  7:10 UTC (permalink / raw)
  To: Masami Hiramatsu, Ingo Molnar, Michael Ellerman, Naveen N. Rao,
	Steven Rostedt
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <877e9idum7.fsf@concordia.ellerman.id.au>

Michael Ellerman's on June 19, 2019 3:14 pm:
> Hi Naveen,
> 
> Sorry I meant to reply to this earlier .. :/
> 
> "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com> writes:
>> With -mprofile-kernel, gcc emits 'mflr r0', followed by 'bl _mcount' to
>> enable function tracing and profiling. So far, with dynamic ftrace, we
>> used to only patch out the branch to _mcount(). However, mflr is
>> executed by the branch unit that can only execute one per cycle on
>> POWER9 and shared with branches, so it would be nice to avoid it where
>> possible.
>>
>> We cannot simply nop out the mflr either. When enabling function
>> tracing, there can be a race if tracing is enabled when some thread was
>> interrupted after executing a nop'ed out mflr. In this case, the thread
>> would execute the now-patched-in branch to _mcount() without having
>> executed the preceding mflr.
>>
>> To solve this, we now enable function tracing in 2 steps: patch in the
>> mflr instruction, use synchronize_rcu_tasks() to ensure all existing
>> threads make progress, and then patch in the branch to _mcount(). We
>> override ftrace_replace_code() with a powerpc64 variant for this
>> purpose.
> 
> According to the ISA we're not allowed to patch mflr at runtime. See the
> section on "CMODX".

According to "quasi patch class" engineering note, we can patch
anything with a preferred nop. But that's written as an optional
facility, which we don't have a feature to test for.

> 
> I'm also not convinced the ordering between the two patches is
> guaranteed by the ISA, given that there's possibly no isync on the other
> CPU.

Will they go through a context synchronizing event?

synchronize_rcu_tasks() should ensure a thread is scheduled away, but
I'm not actually sure it guarantees CSI if it's kernel->kernel. Could
do a smp_call_function to do the isync on each CPU to be sure.

Thanks,
Nick


^ permalink raw reply

* Re: [PATCH 5/5] Powerpc/Watchpoint: Fix length calculation for unaligned target
From: Ravi Bangoria @ 2019-06-19  6:51 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Ravi Bangoria, mikey, linux-kernel, npiggin, paulus, naveen.n.rao,
	linuxppc-dev
In-Reply-To: <3390c3c2-8290-da55-a183-872c593c7b1e@c-s.fr>


On 6/18/19 12:16 PM, Christophe Leroy wrote:
>>   +/* Maximum len for DABR is 8 bytes and DAWR is 512 bytes */
>> +static int hw_breakpoint_validate_len(struct arch_hw_breakpoint *hw)
>> +{
>> +    u16 length_max = 8;
>> +    u16 final_len;
> 
> You should be more consistent in naming. If one is called final_len, the other one should be called max_len.

Copy/paste :). Will change it.

> 
>> +    unsigned long start_addr, end_addr;
>> +
>> +    final_len = hw_breakpoint_get_final_len(hw, &start_addr, &end_addr);
>> +
>> +    if (dawr_enabled()) {
>> +        length_max = 512;
>> +        /* DAWR region can't cross 512 bytes boundary */
>> +        if ((start_addr >> 9) != (end_addr >> 9))
>> +            return -EINVAL;
>> +    }
>> +
>> +    if (final_len > length_max)
>> +        return -EINVAL;
>> +
>> +    return 0;
>> +}
>> +
> 
> Is many places, we have those numeric 512 and 9 shift. Could we replace them by some symbol, for instance DAWR_SIZE and DAWR_SHIFT ?

I don't see any other place where we check for boundary limit.

[...]

> 
>> +u16 hw_breakpoint_get_final_len(struct arch_hw_breakpoint *brk,
>> +                unsigned long *start_addr,
>> +                unsigned long *end_addr)
>> +{
>> +    *start_addr = brk->address & ~HW_BREAKPOINT_ALIGN;
>> +    *end_addr = (brk->address + brk->len - 1) | HW_BREAKPOINT_ALIGN;
>> +    return *end_addr - *start_addr + 1;
>> +}
> 
> This function gives horrible code (a couple of unneeded store/re-read and read/re-read).
> 
> 000006bc <hw_breakpoint_get_final_len>:
>      6bc:    81 23 00 00     lwz     r9,0(r3)
>      6c0:    55 29 00 38     rlwinm  r9,r9,0,0,28
>      6c4:    91 24 00 00     stw     r9,0(r4)
>      6c8:    81 43 00 00     lwz     r10,0(r3)
>      6cc:    a1 23 00 06     lhz     r9,6(r3)
>      6d0:    38 6a ff ff     addi    r3,r10,-1
>      6d4:    7c 63 4a 14     add     r3,r3,r9
>      6d8:    60 63 00 07     ori     r3,r3,7
>      6dc:    90 65 00 00     stw     r3,0(r5)
>      6e0:    38 63 00 01     addi    r3,r3,1
>      6e4:    81 24 00 00     lwz     r9,0(r4)
>      6e8:    7c 69 18 50     subf    r3,r9,r3
>      6ec:    54 63 04 3e     clrlwi  r3,r3,16
>      6f0:    4e 80 00 20     blr
> 
> Below code gives something better:
> 
> u16 hw_breakpoint_get_final_len(struct arch_hw_breakpoint *brk,
>                 unsigned long *start_addr,
>                 unsigned long *end_addr)
> {
>     unsigned long address = brk->address;
>     unsigned long len = brk->len;
>     unsigned long start = address & ~HW_BREAKPOINT_ALIGN;
>     unsigned long end = (address + len - 1) | HW_BREAKPOINT_ALIGN;
> 
>     *start_addr = start;
>     *end_addr = end;
>     return end - start + 1;
> }
> 
> 000006bc <hw_breakpoint_get_final_len>:
>      6bc:    81 43 00 00     lwz     r10,0(r3)
>      6c0:    a1 03 00 06     lhz     r8,6(r3)
>      6c4:    39 2a ff ff     addi    r9,r10,-1
>      6c8:    7d 28 4a 14     add     r9,r8,r9
>      6cc:    55 4a 00 38     rlwinm  r10,r10,0,0,28
>      6d0:    61 29 00 07     ori     r9,r9,7
>      6d4:    91 44 00 00     stw     r10,0(r4)
>      6d8:    20 6a 00 01     subfic  r3,r10,1
>      6dc:    91 25 00 00     stw     r9,0(r5)
>      6e0:    7c 63 4a 14     add     r3,r3,r9
>      6e4:    54 63 04 3e     clrlwi  r3,r3,16
>      6e8:    4e 80 00 20     blr
> 
> 
> And regardless, that's a pitty to have this function using pointers which are from local variables in the callers, as we loose the benefit of registers. Couldn't this function go in the .h as a static inline ? I'm sure the result would be worth it.

This is obviously a bit of optimization, but I like Mikey's idea of
storing start_addr and end_addr in the arch_hw_breakpoint. That way
we don't have to recalculate length every time in set_dawr.


^ permalink raw reply

* Re: [RFC PATCH v2] powerpc/xmon: restrict when kernel is locked down
From: Daniel Axtens @ 2019-06-19  6:24 UTC (permalink / raw)
  To: Andrew Donnellan, Christopher M Riedl, linuxppc-dev,
	kernel-hardening; +Cc: mjg59
In-Reply-To: <57844920-c17b-d93c-66c0-e6822af71929@linux.ibm.com>

Andrew Donnellan <ajd@linux.ibm.com> writes:

> On 4/6/19 1:05 pm, Christopher M Riedl wrote:>>> +	if (!xmon_is_ro) {
>>>> +		xmon_is_ro = kernel_is_locked_down("Using xmon write-access",
>>>> +						   LOCKDOWN_INTEGRITY);
>>>> +		if (xmon_is_ro) {
>>>> +			printf("xmon: Read-only due to kernel lockdown\n");
>>>> +			clear_all_bpt();
>>>
>>> Remind me again why we need to clear breakpoints in integrity mode?
>>>
>>>
>>> Andrew
>>>
>> 
>> I interpreted "integrity" mode as meaning that any changes made by xmon should
>> be reversed. This also covers the case when a user creates some breakpoint(s)
>> in xmon, exits xmon, and then elevates the lockdown state. Upon hitting the
>> first breakpoint and (re-)entering xmon, xmon will clear all breakpoints.
>> 
>> Xmon can only take action in response to dynamic lockdown level changes when
>> xmon is invoked in some manner - if there is a better way I am all ears :)
>> 
>
> Integrity mode merely means we are aiming to prevent modifications to 
> kernel memory. IMHO leaving existing breakpoints in place is fine as 
> long as when we hit the breakpoint xmon is in read-only mode.
>
> (dja/mpe might have opinions on this)

Apologies for taking so long to get back to you.

I think ajd is right. 

I think about it like this. There are 2 transitions:

 - into integrity mode

   Here, we need to go into r/o, but do not need to clear breakpoints.
   You can still insert breakpoints in readonly mode, so clearing them
   just makes things more irritating rather than safer.

 - into confidentiality mode

   Here we need to purge breakpoints and disable xmon completely.

Kind regards,
Daniel

>
> -- 
> Andrew Donnellan              OzLabs, ADL Canberra
> ajd@linux.ibm.com             IBM Australia Limited

^ permalink raw reply

* Re: [PATCH 4/5] Powerpc/hw-breakpoint: Optimize disable path
From: Ravi Bangoria @ 2019-06-19  6:14 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Ravi Bangoria, mikey, linux-kernel, npiggin, paulus, naveen.n.rao,
	linuxppc-dev
In-Reply-To: <6e7c6054-b152-40db-c7d3-89901949460f@c-s.fr>



On 6/18/19 12:01 PM, Christophe Leroy wrote:
>> diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
>> index f002d2ffff86..265fac9fb3a4 100644
>> --- a/arch/powerpc/kernel/process.c
>> +++ b/arch/powerpc/kernel/process.c
>> @@ -793,10 +793,22 @@ static inline int set_dabr(struct arch_hw_breakpoint *brk)
>>       return __set_dabr(dabr, dabrx);
>>   }
>>   +static int disable_dawr(void)
>> +{
>> +    if (ppc_md.set_dawr)
>> +        return ppc_md.set_dawr(0, 0);
>> +
>> +    mtspr(SPRN_DAWRX, 0);
> 
> And SPRN_DAWR ?

Setting DAWRx with 0 should be enough to disable the breakpoint.


^ permalink raw reply

* Re: [PATCH 4/5] Powerpc/hw-breakpoint: Optimize disable path
From: Ravi Bangoria @ 2019-06-19  6:02 UTC (permalink / raw)
  To: Michael Neuling
  Cc: Ravi Bangoria, linux-kernel, npiggin, paulus, naveen.n.rao,
	linuxppc-dev
In-Reply-To: <ab7e5fac2a1ea78181900f5df7411b1f51b65eb9.camel@neuling.org>



On 6/18/19 11:45 AM, Michael Neuling wrote:
> On Tue, 2019-06-18 at 09:57 +0530, Ravi Bangoria wrote:
>> Directly setting dawr and dawrx with 0 should be enough to
>> disable watchpoint. No need to reset individual bits in
>> variable and then set in hw.
> 
> This seems like a pointless optimisation to me. 
> 
> I'm all for adding more code/complexity if it buys us some performance, but I
> can't imagine this is a fast path (nor have you stated any performance
> benefits). 

This gets called from sched_switch. I expected the improvement when
we switch from monitored process to non-monitored process. With such
scenario, I tried to measure the difference in execution time of
set_dawr but I don't see any improvement. So I'll drop the patch.


^ permalink raw reply

* Re: [PATCH v2 6/6] powerpc/eeh: Refactor around eeh_probe_devices()
From: Sam Bobroff @ 2019-06-19  5:53 UTC (permalink / raw)
  To: Oliver; +Cc: Alexey Kardashevskiy, linuxppc-dev, Tyrel Datwyler
In-Reply-To: <CAOSf1CFYtmbq2ZMpAPuJiyG7gkUO=w-mp-61jW1uQCHfF3=LpQ@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 9571 bytes --]

On Wed, Jun 05, 2019 at 03:49:15PM +1000, Oliver wrote:
> On Tue, May 7, 2019 at 2:30 PM Sam Bobroff <sbobroff@linux.ibm.com> wrote:
> >
> > Now that EEH support for all devices (on PowerNV and pSeries) is
> > provided by the pcibios bus add device hooks, eeh_probe_devices() and
> > eeh_addr_cache_build() are redundant and can be removed.
> >
> > Move the EEH enabled message into it's own function so that it can be
> > called from multiple places.
> >
> > Note that previously on pSeries, useless EEH sysfs files were created
> > for some devices that did not have EEH support and this change
> > prevents them from being created.
> >
> > Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
> > ---
> > v2 - As it's so small, merged the enablement message patch into this one (where it's used).
> >    - Reworked enablement messages.
> >
> >  arch/powerpc/include/asm/eeh.h               |  7 ++---
> >  arch/powerpc/kernel/eeh.c                    | 27 ++++++-----------
> >  arch/powerpc/kernel/eeh_cache.c              | 32 --------------------
> >  arch/powerpc/platforms/powernv/eeh-powernv.c |  4 +--
> >  arch/powerpc/platforms/pseries/pci.c         |  3 +-
> >  5 files changed, 14 insertions(+), 59 deletions(-)
> >
> > diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
> > index 12baf1df134c..3994d45ae0d4 100644
> > --- a/arch/powerpc/include/asm/eeh.h
> > +++ b/arch/powerpc/include/asm/eeh.h
> > @@ -283,13 +283,12 @@ struct pci_bus *eeh_pe_bus_get(struct eeh_pe *pe);
> >
> >  struct eeh_dev *eeh_dev_init(struct pci_dn *pdn);
> >  void eeh_dev_phb_init_dynamic(struct pci_controller *phb);
> > -void eeh_probe_devices(void);
> > +void eeh_show_enabled(void);
> >  int __init eeh_ops_register(struct eeh_ops *ops);
> >  int __exit eeh_ops_unregister(const char *name);
> >  int eeh_check_failure(const volatile void __iomem *token);
> >  int eeh_dev_check_failure(struct eeh_dev *edev);
> >  void eeh_addr_cache_init(void);
> > -void eeh_addr_cache_build(void);
> >  void eeh_add_device_early(struct pci_dn *);
> >  void eeh_add_device_tree_early(struct pci_dn *);
> >  void eeh_add_device_late(struct pci_dev *);
> > @@ -333,7 +332,7 @@ static inline bool eeh_enabled(void)
> >          return false;
> >  }
> >
> > -static inline void eeh_probe_devices(void) { }
> > +static inline void eeh_show_enabled(void) { }
> >
> >  static inline void *eeh_dev_init(struct pci_dn *pdn, void *data)
> >  {
> > @@ -351,8 +350,6 @@ static inline int eeh_check_failure(const volatile void __iomem *token)
> >
> >  static inline void eeh_addr_cache_init(void) { }
> >
> > -static inline void eeh_addr_cache_build(void) { }
> > -
> >  static inline void eeh_add_device_early(struct pci_dn *pdn) { }
> >
> >  static inline void eeh_add_device_tree_early(struct pci_dn *pdn) { }
> > diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
> > index 1ed80adb40a1..f905235f0307 100644
> > --- a/arch/powerpc/kernel/eeh.c
> > +++ b/arch/powerpc/kernel/eeh.c
> > @@ -163,6 +163,16 @@ static int __init eeh_setup(char *str)
> >  }
> >  __setup("eeh=", eeh_setup);
> >
> > +void eeh_show_enabled(void)
> > +{
> > +       if (eeh_has_flag(EEH_FORCE_DISABLED))
> > +               pr_info("EEH: Recovery disabled by kernel parameter.\n");
> > +       else if (eeh_has_flag(EEH_ENABLED))
> > +               pr_info("EEH: Capable adapter found: recovery enabled.\n");
> > +       else
> > +               pr_info("EEH: No capable adapters found: recovery disabled.\n");
> > +}
> > +
> >  /*
> >   * This routine captures assorted PCI configuration space data
> >   * for the indicated PCI device, and puts them into a buffer
> > @@ -1156,23 +1166,6 @@ static struct notifier_block eeh_reboot_nb = {
> >         .notifier_call = eeh_reboot_notifier,
> >  };
> >
> 
> > -void eeh_probe_devices(void)
> > -{
> > -       struct pci_controller *hose, *tmp;
> > -       struct pci_dn *pdn;
> > -
> > -       /* Enable EEH for all adapters */
> > -       list_for_each_entry_safe(hose, tmp, &hose_list, list_node) {
> > -               pdn = hose->pci_data;
> > -               traverse_pci_dn(pdn, eeh_ops->probe, NULL);
> > -       }
> > -       if (eeh_enabled())
> > -               pr_info("EEH: PCI Enhanced I/O Error Handling Enabled\n");
> > -       else
> > -               pr_info("EEH: No capable adapters found\n");
> > -
> > -}
> 
> The one concern I have about this is that PAPR requires us to enable
> EEH for the device before we do any config accesses. From PAPR:
> 
> R1–7.3.11.1–5. For the EEH option: If a device driver is going to
> enable EEH and the platform has not defaulted
> to EEH enabled, then it must do so before it does any operations with
> its IOA, including any configuration
> cycles or Load or Store operations.
> 
> So if we want to be strictly compatible we'd need to ensure the
> set-eeh-option RTAS call happens before we read the VDID in
> pci_scan_device(). The pseries eeh_probe() function does this
> currently, but if we defer it until the pcibios call happens we'll
> have done a pile of config accesses before then. Maybe it doesn't
> matter, but we'd need to do further testing under phyp or work out
> some other way to ensure it's done pre-probe.

Hmm! I had not looked at this specifically, but I don't think I've made
it any worse. The reordering is all underneath pcibios_init(), and it
changes things from this...

	for each PHB: pcibios_scan_phb() and then pci_bus_add_devices().
	pcibios_resource_survey() [calls ppc_md.pcibios_fixup()->eeh_probe_devices()]

... to this:

	for each PHB: pcibios_scan_phb()
	pcibios_resource_survey()
	for each PHB: pci_bus_add_devices()
	ppc_md.pcibios_fixup()->eeh_probe_devices()

So either way, the EEH probe (and therefore setup) happens last. Moving
the probe into pseries_pcibios_bus_add_device() actually moves it a
little earlier than before.

Just for kicks, I instrumented rtas_pci_read_config() and it shows
that there are indeed several accesses before the probe function is
called. Here's the first:

pcibios_init() ->
pcibios_scan_phb() ->
__of_scan_bus() ->
of_scan_pci_dev() ->
of_create_pci_dev() ->
set_pcie_port_type() ->
pci_find_capability()

Most of that is PowerPC specific, and there's already an EEH hack in
of_scan_pci_dev(), so I tried a quick hack to probe there and it does
seem at least boot OK. And it's before any accesses! :-)

What do you think? (Although, I think I'd prefer to leave this as follow
up work.)

> >  /**
> >   * eeh_init - EEH initialization
> >   *
> > diff --git a/arch/powerpc/kernel/eeh_cache.c b/arch/powerpc/kernel/eeh_cache.c
> > index f93dd5cf6a39..c40078d036af 100644
> > --- a/arch/powerpc/kernel/eeh_cache.c
> > +++ b/arch/powerpc/kernel/eeh_cache.c
> > @@ -278,38 +278,6 @@ void eeh_addr_cache_init(void)
> >         spin_lock_init(&pci_io_addr_cache_root.piar_lock);
> >  }
> >
> > -/**
> > - * eeh_addr_cache_build - Build a cache of I/O addresses
> > - *
> > - * Build a cache of pci i/o addresses.  This cache will be used to
> > - * find the pci device that corresponds to a given address.
> > - * This routine scans all pci busses to build the cache.
> > - * Must be run late in boot process, after the pci controllers
> > - * have been scanned for devices (after all device resources are known).
> > - */
> > -void eeh_addr_cache_build(void)
> > -{
> > -       struct pci_dn *pdn;
> > -       struct eeh_dev *edev;
> > -       struct pci_dev *dev = NULL;
> > -
> > -       for_each_pci_dev(dev) {
> > -               pdn = pci_get_pdn_by_devfn(dev->bus, dev->devfn);
> > -               if (!pdn)
> > -                       continue;
> > -
> > -               edev = pdn_to_eeh_dev(pdn);
> > -               if (!edev)
> > -                       continue;
> > -
> > -               dev->dev.archdata.edev = edev;
> > -               edev->pdev = dev;
> > -
> > -               eeh_addr_cache_insert_dev(dev);
> > -               eeh_sysfs_add_device(dev);
> > -       }
> > -}
> > -
> >  static int eeh_addr_cache_show(struct seq_file *s, void *v)
> >  {
> >         struct pci_io_addr_range *piar;
> > diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
> > index 90729d908a54..22a94f4b8586 100644
> > --- a/arch/powerpc/platforms/powernv/eeh-powernv.c
> > +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
> > @@ -259,9 +259,7 @@ int pnv_eeh_post_init(void)
> >         struct pnv_phb *phb;
> >         int ret = 0;
> >
> > -       /* Probe devices & build address cache */
> > -       eeh_probe_devices();
> > -       eeh_addr_cache_build();
> > +       eeh_show_enabled();
> >
> >         /* Register OPAL event notifier */
> >         eeh_event_irq = opal_event_request(ilog2(OPAL_EVENT_PCI_ERROR));
> > diff --git a/arch/powerpc/platforms/pseries/pci.c b/arch/powerpc/platforms/pseries/pci.c
> > index 37a77e57893e..d6a5f4f27507 100644
> > --- a/arch/powerpc/platforms/pseries/pci.c
> > +++ b/arch/powerpc/platforms/pseries/pci.c
> > @@ -242,8 +242,7 @@ void __init pSeries_final_fixup(void)
> >
> >         pSeries_request_regions();
> >
> > -       eeh_probe_devices();
> > -       eeh_addr_cache_build();
> > +       eeh_show_enabled();
> >
> >  #ifdef CONFIG_PCI_IOV
> >         ppc_md.pcibios_sriov_enable = pseries_pcibios_sriov_enable;
> > --
> > 2.19.0.2.gcad72f5712
> >
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH 4/7] powerpc/ftrace: Additionally nop out the preceding mflr with -mprofile-kernel
From: Michael Ellerman @ 2019-06-19  5:14 UTC (permalink / raw)
  To: Naveen N. Rao, Steven Rostedt, Masami Hiramatsu, Ingo Molnar,
	Nicholas Piggin
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <72492bc769cd6f40a536e689fc3195570d07fd5c.1560868106.git.naveen.n.rao@linux.vnet.ibm.com>

Hi Naveen,

Sorry I meant to reply to this earlier .. :/

"Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com> writes:
> With -mprofile-kernel, gcc emits 'mflr r0', followed by 'bl _mcount' to
> enable function tracing and profiling. So far, with dynamic ftrace, we
> used to only patch out the branch to _mcount(). However, mflr is
> executed by the branch unit that can only execute one per cycle on
> POWER9 and shared with branches, so it would be nice to avoid it where
> possible.
>
> We cannot simply nop out the mflr either. When enabling function
> tracing, there can be a race if tracing is enabled when some thread was
> interrupted after executing a nop'ed out mflr. In this case, the thread
> would execute the now-patched-in branch to _mcount() without having
> executed the preceding mflr.
>
> To solve this, we now enable function tracing in 2 steps: patch in the
> mflr instruction, use synchronize_rcu_tasks() to ensure all existing
> threads make progress, and then patch in the branch to _mcount(). We
> override ftrace_replace_code() with a powerpc64 variant for this
> purpose.

According to the ISA we're not allowed to patch mflr at runtime. See the
section on "CMODX".

I'm also not convinced the ordering between the two patches is
guaranteed by the ISA, given that there's possibly no isync on the other
CPU.

But I haven't had time to dig into it sorry, hopefully later in the
week?

cheers

> diff --git a/arch/powerpc/kernel/trace/ftrace.c b/arch/powerpc/kernel/trace/ftrace.c
> index 517662a56bdc..5e2b29808af1 100644
> --- a/arch/powerpc/kernel/trace/ftrace.c
> +++ b/arch/powerpc/kernel/trace/ftrace.c
> @@ -125,7 +125,7 @@ __ftrace_make_nop(struct module *mod,
>  {
>  	unsigned long entry, ptr, tramp;
>  	unsigned long ip = rec->ip;
> -	unsigned int op, pop;
> +	unsigned int op;
>  
>  	/* read where this goes */
>  	if (probe_kernel_read(&op, (void *)ip, sizeof(int))) {
> @@ -160,8 +160,6 @@ __ftrace_make_nop(struct module *mod,
>  
>  #ifdef CONFIG_MPROFILE_KERNEL
>  	/* When using -mkernel_profile there is no load to jump over */
> -	pop = PPC_INST_NOP;
> -
>  	if (probe_kernel_read(&op, (void *)(ip - 4), 4)) {
>  		pr_err("Fetching instruction at %lx failed.\n", ip - 4);
>  		return -EFAULT;
> @@ -169,26 +167,23 @@ __ftrace_make_nop(struct module *mod,
>  
>  	/* We expect either a mflr r0, or a std r0, LRSAVE(r1) */
>  	if (op != PPC_INST_MFLR && op != PPC_INST_STD_LR) {
> -		pr_err("Unexpected instruction %08x around bl _mcount\n", op);
> +		pr_err("Unexpected instruction %08x before bl _mcount\n", op);
>  		return -EINVAL;
>  	}
> -#else
> -	/*
> -	 * Our original call site looks like:
> -	 *
> -	 * bl <tramp>
> -	 * ld r2,XX(r1)
> -	 *
> -	 * Milton Miller pointed out that we can not simply nop the branch.
> -	 * If a task was preempted when calling a trace function, the nops
> -	 * will remove the way to restore the TOC in r2 and the r2 TOC will
> -	 * get corrupted.
> -	 *
> -	 * Use a b +8 to jump over the load.
> -	 */
>  
> -	pop = PPC_INST_BRANCH | 8;	/* b +8 */
> +	/* We should patch out the bl to _mcount first */
> +	if (patch_instruction((unsigned int *)ip, PPC_INST_NOP)) {
> +		pr_err("Patching NOP failed.\n");
> +		return -EPERM;
> +	}
>  
> +	/* then, nop out the preceding 'mflr r0' as an optimization */
> +	if (op == PPC_INST_MFLR &&
> +		patch_instruction((unsigned int *)(ip - 4), PPC_INST_NOP)) {
> +		pr_err("Patching NOP failed.\n");
> +		return -EPERM;
> +	}
> +#else
>  	/*
>  	 * Check what is in the next instruction. We can see ld r2,40(r1), but
>  	 * on first pass after boot we will see mflr r0.
> @@ -202,12 +197,25 @@ __ftrace_make_nop(struct module *mod,
>  		pr_err("Expected %08x found %08x\n", PPC_INST_LD_TOC, op);
>  		return -EINVAL;
>  	}
> -#endif /* CONFIG_MPROFILE_KERNEL */
>  
> -	if (patch_instruction((unsigned int *)ip, pop)) {
> +	/*
> +	 * Our original call site looks like:
> +	 *
> +	 * bl <tramp>
> +	 * ld r2,XX(r1)
> +	 *
> +	 * Milton Miller pointed out that we can not simply nop the branch.
> +	 * If a task was preempted when calling a trace function, the nops
> +	 * will remove the way to restore the TOC in r2 and the r2 TOC will
> +	 * get corrupted.
> +	 *
> +	 * Use a b +8 to jump over the load.
> +	 */
> +	if (patch_instruction((unsigned int *)ip, PPC_INST_BRANCH | 8)) {
>  		pr_err("Patching NOP failed.\n");
>  		return -EPERM;
>  	}
> +#endif /* CONFIG_MPROFILE_KERNEL */
>  
>  	return 0;
>  }
> @@ -421,6 +429,26 @@ static int __ftrace_make_nop_kernel(struct dyn_ftrace *rec, unsigned long addr)
>  		return -EPERM;
>  	}
>  
> +#ifdef CONFIG_MPROFILE_KERNEL
> +	/* Nop out the preceding 'mflr r0' as an optimization */
> +	if (probe_kernel_read(&op, (void *)(ip - 4), 4)) {
> +		pr_err("Fetching instruction at %lx failed.\n", ip - 4);
> +		return -EFAULT;
> +	}
> +
> +	/* We expect either a mflr r0, or a std r0, LRSAVE(r1) */
> +	if (op != PPC_INST_MFLR && op != PPC_INST_STD_LR) {
> +		pr_err("Unexpected instruction %08x before bl _mcount\n", op);
> +		return -EINVAL;
> +	}
> +
> +	if (op == PPC_INST_MFLR &&
> +		patch_instruction((unsigned int *)(ip - 4), PPC_INST_NOP)) {
> +		pr_err("Patching NOP failed.\n");
> +		return -EPERM;
> +	}
> +#endif
> +
>  	return 0;
>  }
>  
> @@ -429,6 +457,7 @@ int ftrace_make_nop(struct module *mod,
>  {
>  	unsigned long ip = rec->ip;
>  	unsigned int old, new;
> +	int rc;
>  
>  	/*
>  	 * If the calling address is more that 24 bits away,
> @@ -439,7 +468,34 @@ int ftrace_make_nop(struct module *mod,
>  		/* within range */
>  		old = ftrace_call_replace(ip, addr, 1);
>  		new = PPC_INST_NOP;
> -		return ftrace_modify_code(ip, old, new);
> +		rc = ftrace_modify_code(ip, old, new);
> +#ifdef CONFIG_MPROFILE_KERNEL
> +		if (rc)
> +			return rc;
> +
> +		/*
> +		 * For -mprofile-kernel, we patch out the preceding 'mflr r0'
> +		 * instruction, as an optimization. It is important to nop out
> +		 * the branch to _mcount() first, as a lone 'mflr r0' is
> +		 * harmless.
> +		 */
> +		if (probe_kernel_read(&old, (void *)(ip - 4), 4)) {
> +			pr_err("Fetching instruction at %lx failed.\n", ip - 4);
> +			return -EFAULT;
> +		}
> +
> +		/* We expect either a mflr r0, or a std r0, LRSAVE(r1) */
> +		if (old != PPC_INST_MFLR && old != PPC_INST_STD_LR) {
> +			pr_err("Unexpected instruction %08x before bl _mcount\n",
> +					old);
> +			return -EINVAL;
> +		}
> +
> +		if (old == PPC_INST_MFLR)
> +			rc = patch_instruction((unsigned int *)(ip - 4),
> +					PPC_INST_NOP);
> +#endif
> +		return rc;
>  	} else if (core_kernel_text(ip))
>  		return __ftrace_make_nop_kernel(rec, addr);
>  
> @@ -567,6 +623,37 @@ __ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
>  		return -EINVAL;
>  	}
>  
> +#ifdef CONFIG_MPROFILE_KERNEL
> +	/*
> +	 * We could end up here without having called __ftrace_make_call_prep()
> +	 * if function tracing is enabled before a module is loaded.
> +	 *
> +	 * ftrace_module_enable() --> ftrace_replace_code_rec() -->
> +	 *	ftrace_make_call() --> __ftrace_make_call()
> +	 *
> +	 * In this scenario, the previous instruction will be a NOP. It is
> +	 * safe to patch it with a 'mflr r0' since we know for a fact that
> +	 * this code is not yet being run.
> +	 */
> +	ip -= MCOUNT_INSN_SIZE;
> +
> +	/* read where this goes */
> +	if (probe_kernel_read(op, ip, MCOUNT_INSN_SIZE))
> +		return -EFAULT;
> +
> +	/*
> +	 * nothing to do if this is using the older -mprofile-kernel
> +	 * instruction sequence
> +	 */
> +	if (op[0] != PPC_INST_NOP)
> +		return 0;
> +
> +	if (patch_instruction((unsigned int *)ip, PPC_INST_MFLR)) {
> +		pr_err("Patching MFLR failed.\n");
> +		return -EPERM;
> +	}
> +#endif
> +
>  	return 0;
>  }
>  
> @@ -863,6 +950,116 @@ void arch_ftrace_update_code(int command)
>  	ftrace_modify_all_code(command);
>  }
>  
> +#ifdef CONFIG_MPROFILE_KERNEL
> +/* Returns 1 if we patched in the mflr */
> +static int __ftrace_make_call_prep(struct dyn_ftrace *rec)
> +{
> +	void *ip = (void *)rec->ip - MCOUNT_INSN_SIZE;
> +	unsigned int op[2];
> +
> +	/* read where this goes */
> +	if (probe_kernel_read(op, ip, sizeof(op)))
> +		return -EFAULT;
> +
> +	if (op[1] != PPC_INST_NOP) {
> +		pr_err("Unexpected call sequence at %p: %x %x\n",
> +							ip, op[0], op[1]);
> +		return -EINVAL;
> +	}
> +
> +	/*
> +	 * nothing to do if this is using the older -mprofile-kernel
> +	 * instruction sequence
> +	 */
> +	if (op[0] != PPC_INST_NOP)
> +		return 0;
> +
> +	if (patch_instruction((unsigned int *)ip, PPC_INST_MFLR)) {
> +		pr_err("Patching MFLR failed.\n");
> +		return -EPERM;
> +	}
> +
> +	return 1;
> +}
> +
> +/*
> + * When enabling function tracing for -mprofile-kernel that uses a
> + * 2-instruction sequence of 'mflr r0, bl _mcount()', we use a 2 step process:
> + * 1. Patch in the 'mflr r0' instruction
> + * 1a. synchronize_rcu_tasks() to ensure that any threads that had executed
> + *     the earlier nop there make progress (and hence do not branch into
> + *     ftrace without executing the preceding mflr)
> + * 2. Patch in the branch to ftrace
> + */
> +void ftrace_replace_code(int mod_flags)
> +{
> +	int enable = mod_flags & FTRACE_MODIFY_ENABLE_FL;
> +	int schedulable = mod_flags & FTRACE_MODIFY_MAY_SLEEP_FL;
> +	int ret, failed, make_call = 0;
> +	struct ftrace_rec_iter *iter;
> +	struct dyn_ftrace *rec;
> +
> +	if (unlikely(!ftrace_enabled))
> +		return;
> +
> +	for_ftrace_rec_iter(iter) {
> +		rec = ftrace_rec_iter_record(iter);
> +
> +		if (rec->flags & FTRACE_FL_DISABLED)
> +			continue;
> +
> +		failed = 0;
> +		ret = ftrace_test_record(rec, enable);
> +		if (ret == FTRACE_UPDATE_MAKE_CALL) {
> +			failed = __ftrace_make_call_prep(rec);
> +			if (failed < 0) {
> +				ftrace_bug(failed, rec);
> +				return;
> +			} else if (failed == 1)
> +				make_call++;
> +		}
> +
> +		if (!failed) {
> +			/* We can patch the record directly */
> +			failed = ftrace_replace_code_rec(rec, enable);
> +			if (failed) {
> +				ftrace_bug(failed, rec);
> +				return;
> +			}
> +		}
> +
> +		if (schedulable)
> +			cond_resched();
> +	}
> +
> +	if (!make_call)
> +		/* No records needed patching a preceding mflr */
> +		return;
> +
> +	synchronize_rcu_tasks();
> +
> +	for_ftrace_rec_iter(iter) {
> +		rec = ftrace_rec_iter_record(iter);
> +
> +		if (rec->flags & FTRACE_FL_DISABLED)
> +			continue;
> +
> +		ret = ftrace_test_record(rec, enable);
> +		if (ret == FTRACE_UPDATE_MAKE_CALL)
> +			failed = ftrace_replace_code_rec(rec, enable);
> +
> +		if (failed) {
> +			ftrace_bug(failed, rec);
> +			return;
> +		}
> +
> +		if (schedulable)
> +			cond_resched();
> +	}
> +
> +}
> +#endif
> +
>  #ifdef CONFIG_PPC64
>  #define PACATOC offsetof(struct paca_struct, kernel_toc)
>  
> -- 
> 2.22.0

^ permalink raw reply

* Re: [PATCH v2 1/1] cpuidle-powernv : forced wakeup for stop states
From: Nicholas Piggin @ 2019-06-19  4:23 UTC (permalink / raw)
  To: Abhishek Goel, linux-kernel, linux-pm, linuxppc-dev
  Cc: ego, daniel.lezcano, rjw, dja
In-Reply-To: <20190617095648.18847-2-huntbag@linux.vnet.ibm.com>

Abhishek Goel's on June 17, 2019 7:56 pm:
> Currently, the cpuidle governors determine what idle state a idling CPU
> should enter into based on heuristics that depend on the idle history on
> that CPU. Given that no predictive heuristic is perfect, there are cases
> where the governor predicts a shallow idle state, hoping that the CPU will
> be busy soon. However, if no new workload is scheduled on that CPU in the
> near future, the CPU may end up in the shallow state.
> 
> This is problematic, when the predicted state in the aforementioned
> scenario is a shallow stop state on a tickless system. As we might get
> stuck into shallow states for hours, in absence of ticks or interrupts.
> 
> To address this, We forcefully wakeup the cpu by setting the
> decrementer. The decrementer is set to a value that corresponds with the
> residency of the next available state. Thus firing up a timer that will
> forcefully wakeup the cpu. Few such iterations will essentially train the
> governor to select a deeper state for that cpu, as the timer here
> corresponds to the next available cpuidle state residency. Thus, cpu will
> eventually end up in the deepest possible state.
> 
> Signed-off-by: Abhishek Goel <huntbag@linux.vnet.ibm.com>
> ---
> 
> Auto-promotion
>  v1 : started as auto promotion logic for cpuidle states in generic
> driver
>  v2 : Removed timeout_needed and rebased the code to upstream kernel
> Forced-wakeup
>  v1 : New patch with name of forced wakeup started
>  v2 : Extending the forced wakeup logic for all states. Setting the
> decrementer instead of queuing up a hrtimer to implement the logic.
> 
>  drivers/cpuidle/cpuidle-powernv.c | 38 +++++++++++++++++++++++++++++++
>  1 file changed, 38 insertions(+)
> 
> diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
> index 84b1ebe212b3..bc9ca18ae7e3 100644
> --- a/drivers/cpuidle/cpuidle-powernv.c
> +++ b/drivers/cpuidle/cpuidle-powernv.c
> @@ -46,6 +46,26 @@ static struct stop_psscr_table stop_psscr_table[CPUIDLE_STATE_MAX] __read_mostly
>  static u64 default_snooze_timeout __read_mostly;
>  static bool snooze_timeout_en __read_mostly;
>  
> +static u64 forced_wakeup_timeout(struct cpuidle_device *dev,
> +				 struct cpuidle_driver *drv,
> +				 int index)
> +{
> +	int i;
> +
> +	for (i = index + 1; i < drv->state_count; i++) {
> +		struct cpuidle_state *s = &drv->states[i];
> +		struct cpuidle_state_usage *su = &dev->states_usage[i];
> +
> +		if (s->disabled || su->disable)
> +			continue;
> +
> +		return (s->target_residency + 2 * s->exit_latency) *
> +			tb_ticks_per_usec;
> +	}
> +
> +	return 0;
> +}

It would be nice to not have this kind of loop iteration in the
idle fast path. Can we add a flag or something to the idle state?

> +
>  static u64 get_snooze_timeout(struct cpuidle_device *dev,
>  			      struct cpuidle_driver *drv,
>  			      int index)
> @@ -144,8 +164,26 @@ static int stop_loop(struct cpuidle_device *dev,
>  		     struct cpuidle_driver *drv,
>  		     int index)
>  {
> +	u64 dec_expiry_tb, dec, timeout_tb, forced_wakeup;
> +
> +	dec = mfspr(SPRN_DEC);
> +	timeout_tb = forced_wakeup_timeout(dev, drv, index);
> +	forced_wakeup = 0;
> +
> +	if (timeout_tb && timeout_tb < dec) {
> +		forced_wakeup = 1;
> +		dec_expiry_tb = mftb() + dec;
> +	}

The compiler probably can't optimise away the SPR manipulations so try
to avoid them if possible.

> +
> +	if (forced_wakeup)
> +		mtspr(SPRN_DEC, timeout_tb);

This should just be put in the above 'if'.

> +
>  	power9_idle_type(stop_psscr_table[index].val,
>  			 stop_psscr_table[index].mask);
> +
> +	if (forced_wakeup)
> +		mtspr(SPRN_DEC, dec_expiry_tb - mftb());

This will sometimes go negative and result in another timer interrupt.

It also breaks irq work (which can be set here by machine check I
believe.

May need to implement some timer code to do this for you.

static void reset_dec_after_idle(void)
{
	u64 now;
        u64 *next_tb;

	if (test_irq_work_pending())
		return;
	now = mftb;
	next_tb = this_cpu_ptr(&decrementers_next_tb);

	if (now >= *next_tb)
		return;
	set_dec(*next_tb - now);
	if (test_irq_work_pending())
		set_dec(1);
}

Something vaguely like that. See timer_interrupt().

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH v2 3/6] powerpc/eeh: Improve debug messages around device addition
From: Sam Bobroff @ 2019-06-19  4:27 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: oohall, linuxppc-dev, tyreld
In-Reply-To: <ef181b9d-54df-23f9-2f06-f0f4d0bd8e8a@ozlabs.ru>

[-- Attachment #1: Type: text/plain, Size: 6225 bytes --]

On Tue, Jun 11, 2019 at 03:47:58PM +1000, Alexey Kardashevskiy wrote:
> 
> 
> On 07/05/2019 14:30, Sam Bobroff wrote:
> > Also remove useless comment.
> > 
> > Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
> > Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > ---
> >  arch/powerpc/kernel/eeh.c                    |  2 +-
> >  arch/powerpc/platforms/powernv/eeh-powernv.c | 14 ++++++++----
> >  arch/powerpc/platforms/pseries/eeh_pseries.c | 23 +++++++++++++++-----
> >  3 files changed, 28 insertions(+), 11 deletions(-)
> > 
> > diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
> > index 8d3c36a1f194..b14d89547895 100644
> > --- a/arch/powerpc/kernel/eeh.c
> > +++ b/arch/powerpc/kernel/eeh.c
> > @@ -1291,7 +1291,7 @@ void eeh_add_device_late(struct pci_dev *dev)
> >  	pdn = pci_get_pdn_by_devfn(dev->bus, dev->devfn);
> >  	edev = pdn_to_eeh_dev(pdn);
> >  	if (edev->pdev == dev) {
> > -		pr_debug("EEH: Already referenced !\n");
> > +		pr_debug("EEH: Device %s already referenced!\n", pci_name(dev));
> >  		return;
> >  	}
> >  
> > diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
> > index 6fc1a463b796..0e374cdba961 100644
> > --- a/arch/powerpc/platforms/powernv/eeh-powernv.c
> > +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
> > @@ -50,10 +50,7 @@ void pnv_pcibios_bus_add_device(struct pci_dev *pdev)
> >  	if (!pdev->is_virtfn)
> >  		return;
> >  
> > -	/*
> > -	 * The following operations will fail if VF's sysfs files
> > -	 * aren't created or its resources aren't finalized.
> > -	 */
> > +	pr_debug("%s: EEH: Setting up device %s.\n", __func__, pci_name(pdev));
> 
> 
> dev_dbg() seems more appropriate.

Oh! It does, or even pci_debug() :-)

I'll change it if I need to do another version, otherwise I'll clean it
up later.

> >  	eeh_add_device_early(pdn);
> >  	eeh_add_device_late(pdev);
> >  	eeh_sysfs_add_device(pdev);
> > @@ -397,6 +394,10 @@ static void *pnv_eeh_probe(struct pci_dn *pdn, void *data)
> >  	int ret;
> >  	int config_addr = (pdn->busno << 8) | (pdn->devfn);
> >  
> > +	pr_debug("%s: probing %04x:%02x:%02x.%01x\n",
> > +		__func__, hose->global_number, pdn->busno,
> > +		PCI_SLOT(pdn->devfn), PCI_FUNC(pdn->devfn));
> > +
> >  	/*
> >  	 * When probing the root bridge, which doesn't have any
> >  	 * subordinate PCI devices. We don't have OF node for
> > @@ -491,6 +492,11 @@ static void *pnv_eeh_probe(struct pci_dn *pdn, void *data)
> >  	/* Save memory bars */
> >  	eeh_save_bars(edev);
> >  
> > +	pr_debug("%s: EEH enabled on %02x:%02x.%01x PHB#%x-PE#%x\n",
> > +		__func__, pdn->busno, PCI_SLOT(pdn->devfn),
> > +		PCI_FUNC(pdn->devfn), edev->pe->phb->global_number,
> > +		edev->pe->addr);
> > +
> >  	return NULL;
> >  }
> >  
> > diff --git a/arch/powerpc/platforms/pseries/eeh_pseries.c b/arch/powerpc/platforms/pseries/eeh_pseries.c
> > index 7aa50258dd42..ae06878fbdea 100644
> > --- a/arch/powerpc/platforms/pseries/eeh_pseries.c
> > +++ b/arch/powerpc/platforms/pseries/eeh_pseries.c
> > @@ -65,6 +65,8 @@ void pseries_pcibios_bus_add_device(struct pci_dev *pdev)
> >  	if (!pdev->is_virtfn)
> >  		return;
> >  
> > +	pr_debug("%s: EEH: Setting up device %s.\n", __func__, pci_name(pdev));
> > +
> >  	pdn->device_id  =  pdev->device;
> >  	pdn->vendor_id  =  pdev->vendor;
> >  	pdn->class_code =  pdev->class;
> > @@ -251,6 +253,10 @@ static void *pseries_eeh_probe(struct pci_dn *pdn, void *data)
> >  	int enable = 0;
> >  	int ret;
> >  
> > +	pr_debug("%s: probing %04x:%02x:%02x.%01x\n",
> > +		__func__, pdn->phb->global_number, pdn->busno,
> > +		PCI_SLOT(pdn->devfn), PCI_FUNC(pdn->devfn));
> > +
> >  	/* Retrieve OF node and eeh device */
> >  	edev = pdn_to_eeh_dev(pdn);
> >  	if (!edev || edev->pe)
> > @@ -294,7 +300,12 @@ static void *pseries_eeh_probe(struct pci_dn *pdn, void *data)
> >  
> >  	/* Enable EEH on the device */
> >  	ret = eeh_ops->set_option(&pe, EEH_OPT_ENABLE);
> > -	if (!ret) {
> > +	if (ret) {
> > +		pr_debug("%s: EEH failed to enable on %02x:%02x.%01x PHB#%x-PE#%x (code %d)\n",
> > +			__func__, pdn->busno, PCI_SLOT(pdn->devfn),
> > +			PCI_FUNC(pdn->devfn), pe.phb->global_number,
> > +			pe.addr, ret);
> > +	} else {
> 
> 
> edev!=NULL here so you could do dev_dbg(&edev->pdev->dev,...) and skip
> PCI_SLOT/PCI_FUNC. Or is (edev!=NULL && edev->pdev==NULL) possible (it
> could be, just asking)?

I can see that edev will be non-NULL here, but that pr_debug() pattern
(using the PDN information to form the PCI address) is quite common
across the EEH code, so I think rather than changing a couple of
specific cases, I should do a separate cleanup patch and introduce
something like pdn_debug(pdn, "...."). What do you think?

(I don't know exactly when edev->pdev can be NULL.)

> 
> >  		/* Retrieve PE address */
> >  		edev->pe_config_addr = eeh_ops->get_pe_addr(&pe);
> >  		pe.addr = edev->pe_config_addr;
> > @@ -310,11 +321,6 @@ static void *pseries_eeh_probe(struct pci_dn *pdn, void *data)
> >  		if (enable) {
> >  			eeh_add_flag(EEH_ENABLED);
> >  			eeh_add_to_parent_pe(edev);
> > -
> > -			pr_debug("%s: EEH enabled on %02x:%02x.%01x PHB#%x-PE#%x\n",
> > -				__func__, pdn->busno, PCI_SLOT(pdn->devfn),
> > -				PCI_FUNC(pdn->devfn), pe.phb->global_number,
> > -				pe.addr);
> >  		} else if (pdn->parent && pdn_to_eeh_dev(pdn->parent) &&
> >  			   (pdn_to_eeh_dev(pdn->parent))->pe) {
> >  			/* This device doesn't support EEH, but it may have an
> > @@ -323,6 +329,11 @@ static void *pseries_eeh_probe(struct pci_dn *pdn, void *data)
> >  			edev->pe_config_addr = pdn_to_eeh_dev(pdn->parent)->pe_config_addr;
> >  			eeh_add_to_parent_pe(edev);
> >  		}
> > +		pr_debug("%s: EEH %s on %02x:%02x.%01x PHB#%x-PE#%x (code %d)\n",
> > +			__func__, (enable ? "enabled" : "unsupported"),
> > +			pdn->busno, PCI_SLOT(pdn->devfn),
> > +			PCI_FUNC(pdn->devfn), pe.phb->global_number,
> > +			pe.addr, ret);
> 
> Same here. I understand though this one is a cut-n-paste :)
> 
> 
> >  	}
> >  
> >  	/* Save memory bars */
> > 
> 
> -- 
> Alexey

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* [PATCH kernel] powerpc/powernv/npu: Export symbols as GPL
From: Alexey Kardashevskiy @ 2019-06-19  4:13 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Alistair Popple, Piotr Jaroszynski,
	Reza Arbab, Christoph Hellwig

The out-of-tree NVIDIA driver has been re-licensed recently to MIT/GPL
so we can do the right thing and restrict the exported symbols to GPL
only.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/npu-dma.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
index dc23d9d2a7d9..459d9e728003 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -58,7 +58,7 @@ struct pci_dev *pnv_pci_get_gpu_dev(struct pci_dev *npdev)
 
 	return gpdev;
 }
-EXPORT_SYMBOL(pnv_pci_get_gpu_dev);
+EXPORT_SYMBOL_GPL(pnv_pci_get_gpu_dev);
 
 /* Given the real PCI device get a linked NPU device. */
 struct pci_dev *pnv_pci_get_npu_dev(struct pci_dev *gpdev, int index)
@@ -1052,7 +1052,7 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev *gpdev,
 
 	return npu_context;
 }
-EXPORT_SYMBOL(pnv_npu2_init_context);
+EXPORT_SYMBOL_GPL(pnv_npu2_init_context);
 
 static void pnv_npu2_release_context(struct kref *kref)
 {
@@ -1107,7 +1107,7 @@ void pnv_npu2_destroy_context(struct npu_context *npu_context,
 	}
 
 }
-EXPORT_SYMBOL(pnv_npu2_destroy_context);
+EXPORT_SYMBOL_GPL(pnv_npu2_destroy_context);
 
 /*
  * Assumes mmap_sem is held for the contexts associated mm.
@@ -1149,7 +1149,7 @@ int pnv_npu2_handle_fault(struct npu_context *context, uintptr_t *ea,
 
 	return result;
 }
-EXPORT_SYMBOL(pnv_npu2_handle_fault);
+EXPORT_SYMBOL_GPL(pnv_npu2_handle_fault);
 
 int pnv_npu2_init(struct pci_controller *hose)
 {
-- 
2.17.1


^ permalink raw reply related

* Re: [PATCH kernel 1/2] powerpc/pseries/dma: Allow swiotlb
From: Thiago Jung Bauermann @ 2019-06-19  4:13 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alistair Popple, linuxppc-dev, David Gibson
In-Reply-To: <bf9f80d8-0179-5330-531a-a243d5c358d4@ozlabs.ru>


Alexey Kardashevskiy <aik@ozlabs.ru> writes:

> On 11/05/2019 08:36, Thiago Jung Bauermann wrote:
>>
>> Alexey Kardashevskiy <aik@ozlabs.ru> writes:
>>
>>> The commit 8617a5c5bc00 ("powerpc/dma: handle iommu bypass in
>>> dma_iommu_ops") merged direct DMA ops into the IOMMU DMA ops allowing
>>> SWIOTLB as well but only for mapping; the unmapping and bouncing parts
>>> were left unmodified.
>>>
>>> This adds missing direct unmapping calls to .unmap_page() and .unmap_sg().
>>>
>>> This adds missing sync callbacks and directs them to the direct DMA hooks.
>>>
>>> Fixes: 8617a5c5bc00 (powerpc/dma: handle iommu bypass in dma_iommu_ops)
>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>
>> Nice! Thanks for working on this. I have the patch at the end of this
>> email to get virtio-scsi-pci and virtio-blk-pci working in a secure
>> guest.
>
> I saw the set_pci_dma_ops(NULL) patch but could not figure out how pass
> NULL there sets the DMA ops to direct.

That causes pcibios_setup_device() to call set_dma_ops(&dev->dev, NULL),
which in turn causes dma_is_direct(get_dma_ops(dev)) to return true.

>> I applied your patch and reverted my patch and unfortunately the guest
>> hangs right after mounting the disk:
>
> Have you applied it on upstream kernel? I cannot see how it affects
> current guests as it is...

I applied it on a branch containing both Claudio Carvalho's "kvmppc:
Paravirtualize KVM to support ultravisor" series as well as my "Secure
Virtual Machine Enablement" patch series.

https://lore.kernel.org/linuxppc-dev/20190518142524.28528-1-cclaudio@linux.ibm.com/
https://lore.kernel.org/linuxppc-dev/20190521044912.1375-1-bauerman@linux.ibm.com/

--
Thiago Jung Bauermann
IBM Linux Technology Center


^ permalink raw reply

* Re: [PATCH 1/3] powerpc/64: __ioremap_at clean up in the error case
From: Nicholas Piggin @ 2019-06-19  4:04 UTC (permalink / raw)
  To: Christophe Leroy, linuxppc-dev
In-Reply-To: <5d358284-c31c-5e01-240a-54b3491a8915@c-s.fr>

Christophe Leroy's on June 11, 2019 4:28 pm:
> 
> 
> Le 10/06/2019 à 05:08, Nicholas Piggin a écrit :
>> __ioremap_at error handling is wonky, it requires caller to clean up
>> after it. Implement a helper that does the map and error cleanup and
>> remove the requirement from the caller.
>> 
>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>> ---
>> 
>> This series is a different approach to the problem, using the generic
>> ioremap_page_range directly which reduces added code, and moves
>> the radix specific code into radix files. Thanks to Christophe for
>> pointing out various problems with the previous patch.
>> 
>>   arch/powerpc/mm/pgtable_64.c | 27 ++++++++++++++++++++-------
>>   1 file changed, 20 insertions(+), 7 deletions(-)
>> 
>> diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
>> index d2d976ff8a0e..6bd3660388aa 100644
>> --- a/arch/powerpc/mm/pgtable_64.c
>> +++ b/arch/powerpc/mm/pgtable_64.c
>> @@ -108,14 +108,30 @@ unsigned long ioremap_bot;
>>   unsigned long ioremap_bot = IOREMAP_BASE;
>>   #endif
>>   
>> +static int ioremap_range(unsigned long ea, phys_addr_t pa, unsigned long size, pgprot_t prot, int nid)
>> +{
>> +	unsigned long i;
>> +
>> +	for (i = 0; i < size; i += PAGE_SIZE) {
>> +		int err = map_kernel_page(ea + i, pa + i, prot);
> 
> Missing a blank line
> 
>> +		if (err) {
> 
> I'd have done the following to reduce indentation depth
> 
> 		if (!err)
> 			continue

I'll consider it, line lengths were not too bad.

>> +			if (slab_is_available())
>> +				unmap_kernel_range(ea, size);
> 
> Shouldn't it be unmap_kernel_range(ea, i) ?

I guess (i - PAGE_SIZE really), although the old code effectively did
the full range. As a "clean up" it may be better to avoid subtle
change in behaviour and do that in another patch?

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH 2/3] powerpc/64s/radix: ioremap use ioremap_page_range
From: Nicholas Piggin @ 2019-06-19  3:59 UTC (permalink / raw)
  To: Christophe Leroy, linuxppc-dev
In-Reply-To: <1a9a36aa-f2bb-1ce8-78d5-ddf24e336078@c-s.fr>

Christophe Leroy's on June 11, 2019 4:46 pm:
> 
> 
> Le 10/06/2019 à 05:08, Nicholas Piggin a écrit :
>> Radix can use ioremap_page_range for ioremap, after slab is available.
>> This makes it possible to enable huge ioremap mapping support.
>> 
>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>> ---
>>   arch/powerpc/include/asm/book3s/64/radix.h |  3 +++
>>   arch/powerpc/mm/book3s64/pgtable.c         | 21 +++++++++++++++++++++
>>   arch/powerpc/mm/book3s64/radix_pgtable.c   | 21 +++++++++++++++++++++
>>   arch/powerpc/mm/pgtable_64.c               |  2 +-
>>   4 files changed, 46 insertions(+), 1 deletion(-)
>> 
>> diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
>> index 574eca33f893..e04a839cb5b9 100644
>> --- a/arch/powerpc/include/asm/book3s/64/radix.h
>> +++ b/arch/powerpc/include/asm/book3s/64/radix.h
>> @@ -266,6 +266,9 @@ extern void radix__vmemmap_remove_mapping(unsigned long start,
>>   extern int radix__map_kernel_page(unsigned long ea, unsigned long pa,
>>   				 pgprot_t flags, unsigned int psz);
>>   
>> +extern int radix__ioremap_range(unsigned long ea, phys_addr_t pa,
>> +				unsigned long size, pgprot_t prot, int nid);
>> +
> 
> 'extern' is pointless here, and checkpatch will cry.
> 
>>   static inline unsigned long radix__get_tree_size(void)
>>   {
>>   	unsigned long rts_field;
>> diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
>> index ff98b663c83e..953850a602f7 100644
>> --- a/arch/powerpc/mm/book3s64/pgtable.c
>> +++ b/arch/powerpc/mm/book3s64/pgtable.c
>> @@ -450,3 +450,24 @@ int pmd_move_must_withdraw(struct spinlock *new_pmd_ptl,
>>   
>>   	return true;
>>   }
>> +
>> +int ioremap_range(unsigned long ea, phys_addr_t pa, unsigned long size, pgprot_t prot, int nid)
>> +{
>> +	unsigned long i;
>> +
>> +	if (radix_enabled())
>> +		return radix__ioremap_range(ea, pa, size, prot, nid);
> 
> This function looks pretty similar to the one in the previous patch.
> Since radix_enabled() is available and return false for all other 
> subarches, I think the above could go in the generic ioremap_range(), 
> you'll only need to move the function declaration in a common file, for 
> instance asm/io.h
> 
>> +
>> +	for (i = 0; i < size; i += PAGE_SIZE) {
>> +		int err = map_kernel_page(ea + i, pa + i, prot);
>> +		if (err) {
>> +			if (slab_is_available())
>> +				unmap_kernel_range(ea, size);
>> +			else
>> +				WARN_ON_ONCE(1); /* Should clean up */
>> +			return err;
>> +		}
>> +	}
>> +
>> +	return 0;
>> +}
>> diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
>> index c9bcf428dd2b..db993bc1aef3 100644
>> --- a/arch/powerpc/mm/book3s64/radix_pgtable.c
>> +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
>> @@ -11,6 +11,7 @@
>>   
>>   #define pr_fmt(fmt) "radix-mmu: " fmt
>>   
>> +#include <linux/io.h>
>>   #include <linux/kernel.h>
>>   #include <linux/sched/mm.h>
>>   #include <linux/memblock.h>
>> @@ -1122,3 +1123,23 @@ void radix__ptep_modify_prot_commit(struct vm_area_struct *vma,
>>   
>>   	set_pte_at(mm, addr, ptep, pte);
>>   }
>> +
>> +int radix__ioremap_range(unsigned long ea, phys_addr_t pa, unsigned long size,
>> +			pgprot_t prot, int nid)
>> +{
>> +	if (likely(slab_is_available())) {
>> +		int err = ioremap_page_range(ea, ea + size, pa, prot);
>> +		if (err)
>> +			unmap_kernel_range(ea, size);
>> +		return err;
>> +	} else {
>> +		unsigned long i;
>> +
>> +		for (i = 0; i < size; i += PAGE_SIZE) {
>> +			int err = map_kernel_page(ea + i, pa + i, prot);
>> +			if (WARN_ON_ONCE(err)) /* Should clean up */
>> +				return err;
>> +		}
> 
> Same loop again.
> 
> What about not doing a radix specific function and just putting 
> something like below in the core ioremap_range() function ?
> 
> 	if (likely(slab_is_available()) && radix_enabled()) {
> 		int err = ioremap_page_range(ea, ea + size, pa, prot);
> 
> 		if (err)
> 			unmap_kernel_range(ea, size);
> 		return err;
> 	}
> 
> Because I'm pretty sure will more and more use ioremap_page_range().

Well I agree the duplication is not so nice, but it's convenient
to see what is going on for each MMU type.

There is a significant amount of churn that needs to be done in
this layer so I prefer to make it a bit simpler despite duplication.

I would like to remove the early ioremap or make it into its own
function. Re-implement map_kernel_page with ioremap_page_range,
allow page tables that don't use slab to avoid the early check,
unbolt the hptes mapped in early boot, etc.

I just wanted to escape out the 64s and hash/radix implementations
completely until that settles.

>> -static int ioremap_range(unsigned long ea, phys_addr_t pa, unsigned long size, pgprot_t prot, int nid)
>> +int __weak ioremap_range(unsigned long ea, phys_addr_t pa, unsigned long size, pgprot_t prot, int nid)
> 
> Hum. Weak functions remain in unused in vmlinux unless 
> CONFIG_LD_DEAD_CODE_DATA_ELIMINATION is selected.
> 
> Also, they are some how dangerous because people might change them 
> without seeing that it is overridden for some particular configuration.

Well you shouldn't assume that when you see a weak function, but
what's the preferred alternative? A config option?

Thanks,
Nick


^ permalink raw reply

* Re: [PATCH 1/4] mm: Move ioremap page table mapping function to mm/
From: Nicholas Piggin @ 2019-06-19  3:43 UTC (permalink / raw)
  To: Christophe Leroy, linux-mm; +Cc: linuxppc-dev, linux-arm-kernel
In-Reply-To: <86991f76-2101-8087-37db-d939d5d744fa@c-s.fr>

Christophe Leroy's on June 11, 2019 3:24 pm:
> 
> 
> Le 10/06/2019 à 06:38, Nicholas Piggin a écrit :
>> ioremap_page_range is a generic function to create a kernel virtual
>> mapping, move it to mm/vmalloc.c and rename it vmap_range.
>> 
>> For clarity with this move, also:
>> - Rename vunmap_page_range (vmap_range's inverse) to vunmap_range.
>> - Rename vmap_page_range (which takes a page array) to vmap_pages.
> 
> Maybe it would be easier to follow the change if the name change was 
> done in another patch than the move.

I could do that.

>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>> ---
>> 
>> Fixed up the arm64 compile errors, fixed a few bugs, and tidied
>> things up a bit more.
>> 
>> Have tested powerpc and x86 but not arm64, would appreciate a review
>> and test of the arm64 patch if possible.
>> 
>>   include/linux/vmalloc.h |   3 +
>>   lib/ioremap.c           | 173 +++---------------------------
>>   mm/vmalloc.c            | 228 ++++++++++++++++++++++++++++++++++++----
>>   3 files changed, 229 insertions(+), 175 deletions(-)
>> 
>> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
>> index 51e131245379..812bea5866d6 100644
>> --- a/include/linux/vmalloc.h
>> +++ b/include/linux/vmalloc.h
>> @@ -147,6 +147,9 @@ extern struct vm_struct *find_vm_area(const void *addr);
>>   extern int map_vm_area(struct vm_struct *area, pgprot_t prot,
>>   			struct page **pages);
>>   #ifdef CONFIG_MMU
>> +extern int vmap_range(unsigned long addr,
>> +		       unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
>> +		       unsigned int max_page_shift);
> 
> Drop extern keyword here.

I don't know if I was going crazy but at one point I was getting
duplicate symbol errors that were fixed by adding extern somewhere.
Maybe sleep depravation. However...

> As checkpatch tells you, 'CHECK:AVOID_EXTERNS: extern prototypes should 
> be avoided in .h files'

I prefer to follow existing style in surrounding code at the expense
of some checkpatch warnings. If somebody later wants to "fix" it
that's fine.

Thanks,
Nick


^ permalink raw reply

* Re: [PATCH 4/4] mm/vmalloc: Hugepage vmalloc mappings
From: Nicholas Piggin @ 2019-06-19  3:39 UTC (permalink / raw)
  To: Christophe Leroy, linux-mm, Russell Currey; +Cc: linuxppc-dev, linux-arm-kernel
In-Reply-To: <b79bf11d-43c7-88c9-8395-239625a1bdcf@c-s.fr>

Christophe Leroy's on June 11, 2019 3:39 pm:
> 
> 
> Le 10/06/2019 à 06:38, Nicholas Piggin a écrit :
>> For platforms that define HAVE_ARCH_HUGE_VMAP, have vmap allow vmalloc to
>> allocate huge pages and map them
> 
> Will this be compatible with Russell's series 
> https://patchwork.ozlabs.org/patch/1099857/ for the implementation of 
> STRICT_MODULE_RWX ?
> I see that apply_to_page_range() have things like BUG_ON(pud_huge(*pud));
> 
> Might also be an issue for arm64 as I think Russell's implementation 
> comes from there.

Yeah you're right (and correct about arm64 problem). I'll fix that up.

>> +static int vmap_hpages_range(unsigned long start, unsigned long end,
>> +			   pgprot_t prot, struct page **pages,
>> +			   unsigned int page_shift)
>> +{
>> +	BUG_ON(page_shift != PAGE_SIZE);
> 
> Do we really need a BUG_ON() there ? What happens if this condition is 
> true ?

If it's true then vmap_pages_range would die horribly reading off the
end of the pages array thinking they are struct page pointers.

I guess it could return failure.

>> +	return vmap_pages_range(start, end, prot, pages);
>> +}
>> +#endif
>> +
>> +
>>   int is_vmalloc_or_module_addr(const void *x)
>>   {
>>   	/*
>> @@ -462,7 +498,7 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
>>   {
>>   	unsigned long addr = (unsigned long) vmalloc_addr;
>>   	struct page *page = NULL;
>> -	pgd_t *pgd = pgd_offset_k(addr);
>> +	pgd_t *pgd;
>>   	p4d_t *p4d;
>>   	pud_t *pud;
>>   	pmd_t *pmd;
>> @@ -474,27 +510,38 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
>>   	 */
>>   	VIRTUAL_BUG_ON(!is_vmalloc_or_module_addr(vmalloc_addr));
>>   
>> +	pgd = pgd_offset_k(addr);
>>   	if (pgd_none(*pgd))
>>   		return NULL;
>> +
>>   	p4d = p4d_offset(pgd, addr);
>>   	if (p4d_none(*p4d))
>>   		return NULL;
>> -	pud = pud_offset(p4d, addr);
>> +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
> 
> Do we really need that ifdef ? Won't p4d_large() always return 0 when is 
> not set ?
> Otherwise, could we use IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMAP) instead ?
> 
> Same several places below.

Possibly some of them are not defined without HAVE_ARCH_HUGE_VMAP
I think. I'll try to apply this pattern as much as possible.

>> @@ -2541,14 +2590,17 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>>   				 pgprot_t prot, int node)
>>   {
>>   	struct page **pages;
>> +	unsigned long addr = (unsigned long)area->addr;
>> +	unsigned long size = get_vm_area_size(area);
>> +	unsigned int page_shift = area->page_shift;
>> +	unsigned int shift = page_shift + PAGE_SHIFT;
>>   	unsigned int nr_pages, array_size, i;
>>   	const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO;
>>   	const gfp_t alloc_mask = gfp_mask | __GFP_NOWARN;
>>   	const gfp_t highmem_mask = (gfp_mask & (GFP_DMA | GFP_DMA32)) ?
>> -					0 :
>> -					__GFP_HIGHMEM;
>> +					0 : __GFP_HIGHMEM;
> 
> This patch is already quite big, shouldn't this kind of unrelated 
> cleanups be in another patch ?

Okay, 2 against 1. I'll minimise changes like this.

Thanks,
Nick


^ permalink raw reply

* Re: [PATCH 4/4] mm/vmalloc: Hugepage vmalloc mappings
From: Nicholas Piggin @ 2019-06-19  3:33 UTC (permalink / raw)
  To: Anshuman Khandual, Mark Rutland; +Cc: linux-mm, linuxppc-dev, linux-arm-kernel
In-Reply-To: <a1747247-f4f6-ea9a-149c-07c7eb9193d8@arm.com>

Anshuman Khandual's on June 11, 2019 4:17 pm:
> 
> 
> On 06/10/2019 08:14 PM, Nicholas Piggin wrote:
>> Mark Rutland's on June 11, 2019 12:10 am:
>>> Hi,
>>>
>>> On Mon, Jun 10, 2019 at 02:38:38PM +1000, Nicholas Piggin wrote:
>>>> For platforms that define HAVE_ARCH_HUGE_VMAP, have vmap allow vmalloc to
>>>> allocate huge pages and map them
>>>>
>>>> This brings dTLB misses for linux kernel tree `git diff` from 45,000 to
>>>> 8,000 on a Kaby Lake KVM guest with 8MB dentry hash and mitigations=off
>>>> (performance is in the noise, under 1% difference, page tables are likely
>>>> to be well cached for this workload). Similar numbers are seen on POWER9.
>>>
>>> Do you happen to know which vmalloc mappings these get used for in the
>>> above case? Where do we see vmalloc mappings that large?
>> 
>> Large module vmalloc could be subject to huge mappings.
>> 
>>> I'm worried as to how this would interact with the set_memory_*()
>>> functions, as on arm64 those can only operate on page-granular mappings.
>>> Those may need fixing up to handle huge mappings; certainly if the above
>>> is all for modules.
>> 
>> Good point, that looks like it would break on arm64 at least. I'll
>> work on it. We may have to make this opt in beyond HUGE_VMAP.
> 
> This is another reason we might need to have an arch opt-ins like the one
> I mentioned before.
> 

Let's try to get the precursor stuff like page table functions and
vmalloc_to_page in this merge window, and then concentrate on the
huge vmalloc support issues after that.

Christophe points out that powerpc is likely to have a similar 
problem which I didn't realise, so I'll re think it.

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH 4/4] mm/vmalloc: Hugepage vmalloc mappings
From: Nicholas Piggin @ 2019-06-19  3:29 UTC (permalink / raw)
  To: Anshuman Khandual, linux-mm
  Cc: linuxppc-dev, linux-arm-kernel, Ard Biesheuvel
In-Reply-To: <2bd573d5-84ab-4b27-2126-863681ca3ef4@arm.com>

Anshuman Khandual's on June 11, 2019 4:59 pm:
> On 06/11/2019 05:46 AM, Nicholas Piggin wrote:
>> Anshuman Khandual's on June 10, 2019 6:53 pm:
>>> On 06/10/2019 10:08 AM, Nicholas Piggin wrote:
>>>> For platforms that define HAVE_ARCH_HUGE_VMAP, have vmap allow vmalloc to
>>>> allocate huge pages and map them.
>>>
>>> IIUC that extends HAVE_ARCH_HUGE_VMAP from iormap to vmalloc. 
>>>
>>>>
>>>> This brings dTLB misses for linux kernel tree `git diff` from 45,000 to
>>>> 8,000 on a Kaby Lake KVM guest with 8MB dentry hash and mitigations=off
>>>> (performance is in the noise, under 1% difference, page tables are likely
>>>> to be well cached for this workload). Similar numbers are seen on POWER9.
>>>
>>> Sure will try this on arm64.
>>>
>>>>
>>>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>>>> ---
>>>>  include/asm-generic/4level-fixup.h |   1 +
>>>>  include/asm-generic/5level-fixup.h |   1 +
>>>>  include/linux/vmalloc.h            |   1 +
>>>>  mm/vmalloc.c                       | 132 +++++++++++++++++++++++------
>>>>  4 files changed, 107 insertions(+), 28 deletions(-)
>>>>
>>>> diff --git a/include/asm-generic/4level-fixup.h b/include/asm-generic/4level-fixup.h
>>>> index e3667c9a33a5..3cc65a4dd093 100644
>>>> --- a/include/asm-generic/4level-fixup.h
>>>> +++ b/include/asm-generic/4level-fixup.h
>>>> @@ -20,6 +20,7 @@
>>>>  #define pud_none(pud)			0
>>>>  #define pud_bad(pud)			0
>>>>  #define pud_present(pud)		1
>>>> +#define pud_large(pud)			0
>>>>  #define pud_ERROR(pud)			do { } while (0)
>>>>  #define pud_clear(pud)			pgd_clear(pud)
>>>>  #define pud_val(pud)			pgd_val(pud)
>>>> diff --git a/include/asm-generic/5level-fixup.h b/include/asm-generic/5level-fixup.h
>>>> index bb6cb347018c..c4377db09a4f 100644
>>>> --- a/include/asm-generic/5level-fixup.h
>>>> +++ b/include/asm-generic/5level-fixup.h
>>>> @@ -22,6 +22,7 @@
>>>>  #define p4d_none(p4d)			0
>>>>  #define p4d_bad(p4d)			0
>>>>  #define p4d_present(p4d)		1
>>>> +#define p4d_large(p4d)			0
>>>>  #define p4d_ERROR(p4d)			do { } while (0)
>>>>  #define p4d_clear(p4d)			pgd_clear(p4d)
>>>>  #define p4d_val(p4d)			pgd_val(p4d)
>>>
>>> Both of these are required from vmalloc_to_page() which as per a later comment
>>> should be part of a prerequisite patch before this series.
>> 
>> I'm not sure what you mean. This patch is where they get used.
> 
> In case you move out vmalloc_to_page() changes to a separate patch.

Sorry for the delay in reply.

I'll split this and see if we might be able to get it into next
merge window. I can have another try at the huge vmalloc patch
after that.

> 
>> 
>> Possibly I could split this and the vmalloc_to_page change out. I'll
>> consider it.
>> 
>>>> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
>>>> index 812bea5866d6..4c92dc608928 100644
>>>> --- a/include/linux/vmalloc.h
>>>> +++ b/include/linux/vmalloc.h
>>>> @@ -42,6 +42,7 @@ struct vm_struct {
>>>>  	unsigned long		size;
>>>>  	unsigned long		flags;
>>>>  	struct page		**pages;
>>>> +	unsigned int		page_shift;
>>>
>>> So the entire vm_struct will be mapped with a single page_shift. It cannot have
>>> mix and match mappings with PAGE_SIZE, PMD_SIZE, PUD_SIZE etc in case the
>>> allocation fails for larger ones, falling back etc what over other reasons.
>> 
>> For now, yes. I have a bit of follow up work to improve that and make
>> it able to fall back, but it's a bit more churn and not a significant
>> benefit just yet because there are not a lot of very large vmallocs
>> (except the early hashes which can be satisfied with large allocs).
> 
> Right but it will make this new feature complete like ioremap which logically
> supports till P4D (though AFAICT not used). If there are no actual vmalloc
> requests that large it is fine. Allocation attempts will start from the page
> table level depending on the requested size. It is better to have PUD/P4D
> considerations now rather than trying to after fit it later.

I've considered them, which is why e.g., a shift gets passed around 
rather than a bool for small/large.

I won't over complicate this page array data structure for something
that may never be supported though. I think we may actually be better
moving away from it in the vmalloc code and just referencing pages
from the page tables, so it's just something we can cross when we get
to it.

>>> Also should not we check for the alignment of the range [start...end] with
>>> respect to (1UL << [PAGE_SHIFT + page_shift]).
>> 
>> The caller should if it specifies large page. Could check and -EINVAL
>> for incorrect alignment.
> 
> That might be a good check here.

Will add.

>>>> @@ -474,27 +510,38 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
>>>>  	 */
>>>>  	VIRTUAL_BUG_ON(!is_vmalloc_or_module_addr(vmalloc_addr));
>>>>  
>>>> +	pgd = pgd_offset_k(addr);
>>>>  	if (pgd_none(*pgd))
>>>>  		return NULL;
>>>> +
>>>
>>> Small nit. Stray line here.
>>>
>>> 'pgd' related changes here seem to be just cleanups and should not part of this patch.
>> 
>> Yeah I figure it doesn't matter to make small changes close by, but
>> maybe that's more frowned upon now for git blame?
> 
> Right. But I guess it should be okay if you can make vmalloc_to_page()
> changes as a separate patch. This patch which adds a new feature should
> not have any clean ups IMHO.

Well... that alone would be a new feature too. Or could be considered
a bug fix, which makes it even more important not to contain
superfluous changes.

Is there a real prohibition on small slightly peripheral tidying
like this? I don't think I'd bother sending a lone patch just to
change a couple lines of spacing.

>>>>  	p4d = p4d_offset(pgd, addr);
>>>>  	if (p4d_none(*p4d))
>>>>  		return NULL;
>>>> -	pud = pud_offset(p4d, addr);
>>>> +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
>>>> +	if (p4d_large(*p4d))
>>>> +		return p4d_page(*p4d) + ((addr & ~P4D_MASK) >> PAGE_SHIFT);
>>>> +#endif
>>>> +	if (WARN_ON_ONCE(p4d_bad(*p4d)))
>>>> +		return NULL;
>>>>  
>>>> -	/*
>>>> -	 * Don't dereference bad PUD or PMD (below) entries. This will also
>>>> -	 * identify huge mappings, which we may encounter on architectures
>>>> -	 * that define CONFIG_HAVE_ARCH_HUGE_VMAP=y. Such regions will be
>>>> -	 * identified as vmalloc addresses by is_vmalloc_addr(), but are
>>>> -	 * not [unambiguously] associated with a struct page, so there is
>>>> -	 * no correct value to return for them.
>>>> -	 */
>>>
>>> What changed the situation so that we could return struct page for a huge
>>> mapping now ?
>> 
>> For the PUD case? Nothing changed, per se, we I just calculate the
>> correct struct page now, so I may return it.
> 
> I was just curious what prevented this earlier (before this series). The
> comment here and commit message which added this change making me wonder
> what was the reason for not doing this earlier.  

Just not implemented I guess.

>>> AFAICT even after this patch, PUD/P4D level huge pages can only
>>> be created with ioremap_page_range() not with vmalloc() which creates PMD
>>> sized mappings only. Hence if it's okay to dereference struct page of a huge
>>> mapping (not withstanding the comment here) it should be part of an earlier
>>> patch fixing it first for existing ioremap_page_range() huge mappings.
>> 
>> Possibly yes, we can consider 029c54b095995 to be a band-aid for huge
>> vmaps which is fixed properly by this change, in which case it could
>> make sense to break this into its own patch.
> 
> On arm64 [pud|pmd]_bad() calls out huge mappings at PUD or PMD. I still wonder what
> Ard (copied him now) meant by "not [unambiguously] associated with a struct page".
> He also mentioned about compound pages in the commit message. Anyways these makes
> sense (fetching the struct page) unless I am missing something. But should be part
> of a separate patch.

I do somewhat see the intention of the commit message, but if we
consider the vmap/iomap layer's choice of page size as transparent to
the caller, and the vmalloc_to_page API has always been specifically
interested in the PAGE_SIZE struct page, then my patch is fine and
introduces no problems. It restores the API functionality to be the
same regardless of whether small or large pages were used for the
actual mapping.

>>>> +	if (IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMAP)) {
>>>> +		unsigned long size_per_node;
>>>> +
>>>> +		size_per_node = size;
>>>> +		if (node == NUMA_NO_NODE)
>>>> +			size_per_node /= num_online_nodes();
>>>> +		if (size_per_node >= PMD_SIZE)
>>>> +			shift = PMD_SHIFT;
>>>
>>> There are two problems here.
>>>
>>> 1. Should not size_per_node be aligned with PMD_SIZE to avoid wasting memory later
>>>    because of alignment upwards (making it worse for NUMA_NO_NODE)
>> 
>> I'm not sure what you mean, it's just a heuristic to check for node
>> interleaving, and use small pages if large can not interleave well.
>> 
>>> 2. What about PUD_SIZE which is not considered here at all
>> 
>> Yeah, not doing PUD pages at all. It would be pretty trivial to add 
>> after PMD is working, but would it actually get used anywhere?
> 
> But it should make this feature logically complete. Allocation attempts can start
> at right pgtable level depending on the requested size. I dont think it will have
> any performance impact or something.

I disagree that's necessary or desirable for PMD support here. Sure
an arch might have PUD size within MAX_ORDER and implement that, but
it's just something that can be implemented when the time comes.

There's nothing about this patch that hinders being extendedto PUD
level I just won't add code that's not used and I can't test.

Thanks for the detailed review, I appreciate it.

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH kernel] powerpc/pci/of: Parse unassigned resources
From: Alexey Kardashevskiy @ 2019-06-19  1:20 UTC (permalink / raw)
  To: Michael Ellerman, linuxppc-dev
  Cc: Sam Bobroff, Alistair Popple, kvm-ppc, Oliver O'Halloran,
	David Gibson
In-Reply-To: <87sgs7ozsa.fsf@concordia.ellerman.id.au>



On 18/06/2019 22:15, Michael Ellerman wrote:
> Alexey Kardashevskiy <aik@ozlabs.ru> writes:
>> The pseries platform uses the PCI_PROBE_DEVTREE method of PCI probing
>> which is basically reading "assigned-addresses" of every PCI device.
>> However if the property is missing or zero sized, then there is
>> no fallback of any kind and the PCI resources remain undiscovered, i.e.
>> pdev->resource[] array is empty.
>>
>> This adds a fallback which parses the "reg" property in pretty much same
>> way except it marks resources as "unset" which later makes Linux assign
>> those resources with proper addresses.
> 
> What happens under PowerVM is the big question.
> 
> ie. if we see such a device under PowerVM and then do our own assignment
> what happens?

I'd be surprised not to see at least one "assigned-addresses" under
powervm, and a single assigned bar will do the old behavior.

I guess I could make it depend on "linux,pci-probe-only" (which I will
need for this to work anyway), if that helps, should I?


-- 
Alexey

^ permalink raw reply

* Re: [PATCH v5 2/2] powerpc: Fix compile issue with force DAWR
From: Michael Neuling @ 2019-06-19  1:11 UTC (permalink / raw)
  To: Christophe Leroy, mpe; +Cc: Mathieu Malaterre, hch, linuxppc-dev
In-Reply-To: <e20b2d44-508c-7d06-1af8-b608563b8c57@c-s.fr>

On Tue, 2019-06-18 at 18:28 +0200, Christophe Leroy wrote:
> 
> Le 04/06/2019 à 05:00, Michael Neuling a écrit :
> > If you compile with KVM but without CONFIG_HAVE_HW_BREAKPOINT you fail
> > at linking with:
> >    arch/powerpc/kvm/book3s_hv_rmhandlers.o:(.text+0x708): undefined
> > reference to `dawr_force_enable'
> > 
> > This was caused by commit c1fe190c0672 ("powerpc: Add force enable of
> > DAWR on P9 option").
> > 
> > This moves a bunch of code around to fix this. It moves a lot of the
> > DAWR code in a new file and creates a new CONFIG_PPC_DAWR to enable
> > compiling it.
> 
> After looking at all this once more, I'm just wondering: why are we 
> creating stuff specific to DAWR ?
> 
> In the old days, we only add DABR, and everything was named on DABR.
> When DAWR was introduced some years ago we renamed stuff like do_dabr() 
> to do_break() so that we could regroup things together. And now we are 
> taking dawr() out of the rest. Why not keep dabr() stuff and dawr() 
> stuff all together in something dedicated to breakpoints, and try to 
> regroup all breakpoint stuff in a single place ? I see some 
> breakpointing stuff done in kernel/process.c and other things done in 
> hw_breakpoint.c, to common functions call from one file to the other, 
> preventing GCC to fully optimise, etc ...
> 
> Also, behing this thinking, I have the idea that we could easily 
> implement 512 bytes breakpoints on the 8xx too. The 8xx have neither 
> DABR nor DAWR, but is using a set of comparators. And as you can see in 
> the 8xx version of __set_dabr() in kernel/process.c, we emulate the DABR 
> behaviour by setting two comparators. By using the same comparators with 
> a different setup, we should be able to implement breakpoints on larger 
> ranges of address.

Christophe

I agree that their are opportunities to refactor this code and I appreciate your
efforts in making this code better but... 

We have a problem here of not being able to compile an odd ball case that almost
no one ever hits (it was just an odd mpe CI case). We're up to v5 of a simple
fix which is just silly. 

So let's get this fix in and move on to the whole bunch of refactoring we can do
in this code which is already documented in the github issue tracking.

Regards,
Mikey


^ permalink raw reply

* Re: [PATCH 0/1] PPC32: fix ptrace() access to FPU registers
From: Daniel Axtens @ 2019-06-19  0:36 UTC (permalink / raw)
  To: Andreas Schwab, Radu Rendec; +Cc: Paul Mackerras, linuxppc-dev, Oleg Nesterov
In-Reply-To: <875zp2rcip.fsf@igel.home>

Andreas Schwab <schwab@linux-m68k.org> writes:

> On Jun 18 2019, Radu Rendec <radu.rendec@gmail.com> wrote:
>
>> Since you already have a working setup, it would be nice if you could
>> add a printk to arch_ptrace() to print the address and confirm what I
>> believe happens (by reading the gdb source code).
>
> A ppc32 ptrace syscall goes through compat_arch_ptrace.


Ah right, and that (in ptrace32.c) contains code that will work:


			/*
			 * the user space code considers the floating point
			 * to be an array of unsigned int (32 bits) - the
			 * index passed in is based on this assumption.
			 */
			tmp = ((unsigned int *)child->thread.fp_state.fpr)
				[FPRINDEX(index)];

FPRINDEX is defined above to deal with the various manipulations you
need to do.

Radu: I think we want to copy that working code back into ptrace.c. The
challenge will be unpicking the awful mess of ifdefs in ptrace.c and
making it somewhat more comprehensible.

Regards,
Daniel

>
> Andreas.
>
> -- 
> Andreas Schwab, schwab@linux-m68k.org
> GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
> "And now for something completely different."

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox