LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH 1/2] powerpc/powernv: Make pnv_pci_sriov_enable() and friends static
From: Michael Ellerman @ 2020-07-16 12:56 UTC (permalink / raw)
  To: Oliver O'Halloran, linuxppc-dev; +Cc: kernel test robot
In-Reply-To: <20200705133557.443607-1-oohall@gmail.com>

On Sun, 5 Jul 2020 23:35:56 +1000, Oliver O'Halloran wrote:
> The kernel test robot noticed these are non-static which causes Clang to
> print some warnings. These are called via ppc_md function pointers so
> there's no need for them to be non-static.

Applied to powerpc/next.

[1/2] powerpc/powernv: Make pnv_pci_sriov_enable() and friends static
      https://git.kernel.org/powerpc/c/93eacd94e09db2b1bb0343f8115385e5c34abf0a
[2/2] powerpc/powernv: Move pnv_ioda_setup_bus_dma under CONFIG_IOMMU_API
      https://git.kernel.org/powerpc/c/e3417faec526cbf97773dca691dcd743f5bfeb64

cheers

^ permalink raw reply

* Re: [PATCH 1/3] powerpc/64s: restore_math remove TM test
From: Michael Ellerman @ 2020-07-16 12:56 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev; +Cc: Anton Blanchard, Gustavo Romero
In-Reply-To: <20200623234139.2262227-1-npiggin@gmail.com>

On Wed, 24 Jun 2020 09:41:37 +1000, Nicholas Piggin wrote:
> The TM test in restore_math added by commit dc16b553c949e ("powerpc:
> Always restore FPU/VEC/VSX if hardware transactional memory in use") is
> no longer necessary after commit a8318c13e79ba ("powerpc/tm: Fix
> restoring FP/VMX facility incorrectly on interrupts"), which removed
> the cases where restore_math has to restore if TM is active.

Applied to powerpc/next.

[1/3] powerpc/64s: restore_math remove TM test
      https://git.kernel.org/powerpc/c/891b4fe8fe3d09f20948b391f24c9fc5b7580a2b
[2/3] powerpc/64s: Fix restore_math unnecessarily changing MSR
      https://git.kernel.org/powerpc/c/01eb01877f3386d4bd5de75909abdd0af45a5fa2
[3/3] powerpc: re-initialise lazy FPU/VEC counters on every fault
      https://git.kernel.org/powerpc/c/b2b46304e9360f3dda49c9d8ba4a1478b9eecf1d

cheers

^ permalink raw reply

* Re: [PATCH 1/1] MAINTAINERS: Remove self
From: Michael Ellerman @ 2020-07-16 12:56 UTC (permalink / raw)
  To: Sam Bobroff, linuxppc-dev
In-Reply-To: <aec7d729c28e35c7fa9969ec50229080c771195c.1593471043.git.sbobroff@linux.ibm.com>

On Tue, 30 Jun 2020 08:50:44 +1000, Sam Bobroff wrote:
> I'm sorry to say I can no longer maintain this position.

Applied to powerpc/next.

[1/1] MAINTAINERS: Remove self from powerpc EEH
      https://git.kernel.org/powerpc/c/acccc984c1f2e49225b40f1d0d20d383ec27d4d0

cheers

^ permalink raw reply

* Re: [PATCH v6] powerpc/fadump: fix race between pstore write and fadump crash trigger
From: Michael Ellerman @ 2020-07-16 12:56 UTC (permalink / raw)
  To: mpe, Sourabh Jain; +Cc: linuxppc-dev, linux-kernel, hbathini, mahesh
In-Reply-To: <20200713052435.183750-1-sourabhjain@linux.ibm.com>

On Mon, 13 Jul 2020 10:54:35 +0530, Sourabh Jain wrote:
> When we enter into fadump crash path via system reset we fail to update
> the pstore.
> 
> On the system reset path we first update the pstore then we go for fadump
> crash. But the problem here is when all the CPUs try to get the pstore
> lock to initiate the pstore write, only one CPUs will acquire the lock
> and proceed with the pstore write. Since it in NMI context CPUs that fail
> to get lock do not wait for their turn to write to the pstore and simply
> proceed with the next operation which is fadump crash. One of the CPU who
> proceeded with fadump crash path triggers the crash and does not wait for
> the CPU who gets the pstore lock to complete the pstore update.
> 
> [...]

Applied to powerpc/next.

[1/1] powerpc/fadump: fix race between pstore write and fadump crash trigger
      https://git.kernel.org/powerpc/c/ba608c4fa12cfd0cab0e153249c29441f4dd3312

cheers

^ permalink raw reply

* Re: [PATCH -next] powerpc/xive: Remove unused inline function xive_kexec_teardown_cpu()
From: Michael Ellerman @ 2020-07-16 12:56 UTC (permalink / raw)
  To: paulus, mpe, groug, benh, YueHaibing; +Cc: linuxppc-dev, linux-kernel
In-Reply-To: <20200715025040.33952-1-yuehaibing@huawei.com>

On Wed, 15 Jul 2020 10:50:40 +0800, YueHaibing wrote:
> commit e27e0a94651e ("powerpc/xive: Remove xive_kexec_teardown_cpu()")
> left behind this, remove it.

Applied to powerpc/next.

[1/1] powerpc/xive: Remove unused inline function xive_kexec_teardown_cpu()
      https://git.kernel.org/powerpc/c/29d9407e1037868b59d12948d42ad3ef58fc3a5a

cheers

^ permalink raw reply

* Re: [PATCH -next] cpuidle/pseries: Make symbol 'pseries_idle_driver' static
From: Michael Ellerman @ 2020-07-16 12:56 UTC (permalink / raw)
  To: Hulk Robot, Wei Yongjun, Daniel Lezcano, Rafael J. Wysocki,
	Michael Ellerman
  Cc: linuxppc-dev, linux-pm
In-Reply-To: <20200714142424.66648-1-weiyongjun1@huawei.com>

On Tue, 14 Jul 2020 22:24:24 +0800, Wei Yongjun wrote:
> The sparse tool complains as follows:
> 
> drivers/cpuidle/cpuidle-pseries.c:25:23: warning:
>  symbol 'pseries_idle_driver' was not declared. Should it be static?
> 
> 'pseries_idle_driver' is not used outside of this file, so marks
> it static.

Applied to powerpc/next.

[1/1] cpuidle/pseries: Make symbol 'pseries_idle_driver' static
      https://git.kernel.org/powerpc/c/92fe8483b1660feaa602d8be6ca7efe95ae4789b

cheers

^ permalink raw reply

* Re: [PATCH 0/3] Implement shared_cpu_list for powerpc
From: Michael Ellerman @ 2020-07-16 12:56 UTC (permalink / raw)
  To: Michael Ellerman, Srikar Dronamraju; +Cc: Nathan Lynch, linuxppc-dev
In-Reply-To: <20200629103703.4538-1-srikar@linux.vnet.ibm.com>

On Mon, 29 Jun 2020 16:07:00 +0530, Srikar Dronamraju wrote:
> shared_cpu_list sysfs file is missing in powerpc and shared_cpu_map gives an
> extra newline character.
> 
> Before this patchset
> # ls /sys/devices/system/cpu0/cache/index1
> coherency_line_size  number_of_sets  size  ways_of_associativity
> level                shared_cpu_map  type
> # cat /sys/devices/system/cpu0/cache/index1/shared_cpu_map
> 00ff
> 
> [...]

Applied to powerpc/next.

[1/3] powerpc/cacheinfo: Use cpumap_print to print cpumap
      https://git.kernel.org/powerpc/c/5658cf085ba3c3f3c24ac0f7210f0473794df506
[2/3] powerpc/cacheinfo: Make cpumap_show code reusable
      https://git.kernel.org/powerpc/c/74b7492e417812ea0f5002e210e2ac07a5728d17
[3/3] powerpc/cacheinfo: Add per cpu per index shared_cpu_list
      https://git.kernel.org/powerpc/c/a87a77cb947cc9fc89f0dad51aeee66a61cc7fc4

cheers

^ permalink raw reply

* Re: [PATCH v3 0/4] powerpc/mm/radix: Memory unplug fixes
From: Nathan Lynch @ 2020-07-16 14:00 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: linuxppc-dev, Bharata B Rao
In-Reply-To: <20200709131925.922266-1-aneesh.kumar@linux.ibm.com>

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> This is the next version of the fixes for memory unplug on radix.
> The issues and the fix are described in the actual patches.

I guess this isn't actually causing problems at runtime right now, but I
notice calls to resize_hpt_for_hotplug() from arch_add_memory() and
arch_remove_memory(), which ought to be mmu-agnostic:

int __ref arch_add_memory(int nid, u64 start, u64 size,
			  struct mhp_params *params)
{
	unsigned long start_pfn = start >> PAGE_SHIFT;
	unsigned long nr_pages = size >> PAGE_SHIFT;
	int rc;

	resize_hpt_for_hotplug(memblock_phys_mem_size());

	start = (unsigned long)__va(start);
	rc = create_section_mapping(start, start + size, nid,
				    params->pgprot);
...


^ permalink raw reply

* Re: [PATCH] pseries: Fix 64 bit logical memory block panic
From: Aneesh Kumar K.V @ 2020-07-16 14:07 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: nathanl, linuxppc-dev
In-Reply-To: <20200716013030.GA4076912@thinks.paulus.ozlabs.org>

On 7/16/20 7:00 AM, Paul Mackerras wrote:
> On Wed, Jul 15, 2020 at 06:12:25PM +0530, Aneesh Kumar K.V wrote:
>> Anton Blanchard <anton@ozlabs.org> writes:
>>
>>> Booting with a 4GB LMB size causes us to panic:
>>>
>>>    qemu-system-ppc64: OS terminated: OS panic:
>>>        Memory block size not suitable: 0x0
>>>
>>> Fix pseries_memory_block_size() to handle 64 bit LMBs.
>>>
>>> Cc: stable@vger.kernel.org
>>> Signed-off-by: Anton Blanchard <anton@ozlabs.org>
>>> ---
>>>   arch/powerpc/platforms/pseries/hotplug-memory.c | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c
>>> index 5ace2f9a277e..6574ac33e887 100644
>>> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
>>> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
>>> @@ -27,7 +27,7 @@ static bool rtas_hp_event;
>>>   unsigned long pseries_memory_block_size(void)
>>>   {
>>>   	struct device_node *np;
>>> -	unsigned int memblock_size = MIN_MEMORY_BLOCK_SIZE;
>>> +	uint64_t memblock_size = MIN_MEMORY_BLOCK_SIZE;
>>>   	struct resource r;
>>>   
>>>   	np = of_find_node_by_path("/ibm,dynamic-reconfiguration-memory");
>>
>> We need similar changes at more places?
>>
>> modified   arch/powerpc/include/asm/book3s/64/mmu.h
>> @@ -85,7 +85,7 @@ extern unsigned int mmu_base_pid;
>>   /*
>>    * memory block size used with radix translation.
>>    */
>> -extern unsigned int __ro_after_init radix_mem_block_size;
>> +extern unsigned long __ro_after_init radix_mem_block_size;
>>   
>>   #define PRTB_SIZE_SHIFT	(mmu_pid_bits + 4)
>>   #define PRTB_ENTRIES	(1ul << mmu_pid_bits)
>> modified   arch/powerpc/include/asm/drmem.h
>> @@ -21,7 +21,7 @@ struct drmem_lmb {
>>   struct drmem_lmb_info {
>>   	struct drmem_lmb        *lmbs;
>>   	int                     n_lmbs;
>> -	u32                     lmb_size;
>> +	u64                     lmb_size;
>>   };
>>   
>>   extern struct drmem_lmb_info *drmem_info;
>> modified   arch/powerpc/mm/book3s64/radix_pgtable.c
>> @@ -34,7 +34,7 @@
>>   
>>   unsigned int mmu_pid_bits;
>>   unsigned int mmu_base_pid;
>> -unsigned int radix_mem_block_size __ro_after_init;
>> +unsigned long radix_mem_block_size __ro_after_init;
> 
> These changes look fine.
> 
>>   static __ref void *early_alloc_pgtable(unsigned long size, int nid,
>>   			unsigned long region_start, unsigned long region_end)
>> modified   arch/powerpc/mm/drmem.c
>> @@ -268,14 +268,15 @@ static void __init __walk_drmem_v2_lmbs(const __be32 *prop, const __be32 *usm,
>>   void __init walk_drmem_lmbs_early(unsigned long node,
>>   			void (*func)(struct drmem_lmb *, const __be32 **))
>>   {
>> +	const __be64 *lmb_prop;
>>   	const __be32 *prop, *usm;
>>   	int len;
>>   
>> -	prop = of_get_flat_dt_prop(node, "ibm,lmb-size", &len);
>> -	if (!prop || len < dt_root_size_cells * sizeof(__be32))
>> +	lmb_prop = of_get_flat_dt_prop(node, "ibm,lmb-size", &len);
>> +	if (!lmb_prop || len < sizeof(__be64))
>>   		return;
>>   
>> -	drmem_info->lmb_size = dt_mem_next_cell(dt_root_size_cells, &prop);
>> +	drmem_info->lmb_size = be64_to_cpup(lmb_prop);
> 
> This particular change shouldn't be necessary.  We already have
> dt_mem_next_cell() returning u64, and it knows how to combine two
> cells to give a u64 (for dt_root_size_cells == 2).


agreed. I added it here because in another patch i was confused about 
the usage of dt_root_size_cells. We don't generally use that in other 
device tree parsing code. I will move that to a separate patch as cleanup.

> 
>>   	usm = of_get_flat_dt_prop(node, "linux,drconf-usable-memory", &len);
>>   
>> @@ -296,19 +297,19 @@ void __init walk_drmem_lmbs_early(unsigned long node,
>>   
>>   static int __init init_drmem_lmb_size(struct device_node *dn)
>>   {
>> -	const __be32 *prop;
>> +	const __be64 *prop;
>>   	int len;
>>   
>>   	if (drmem_info->lmb_size)
>>   		return 0;
>>   
>>   	prop = of_get_property(dn, "ibm,lmb-size", &len);
>> -	if (!prop || len < dt_root_size_cells * sizeof(__be32)) {
>> +	if (!prop || len < sizeof(__be64)) {
>>   		pr_info("Could not determine LMB size\n");
>>   		return -1;
>>   	}
>>   
>> -	drmem_info->lmb_size = dt_mem_next_cell(dt_root_size_cells, &prop);
>> +	drmem_info->lmb_size = be64_to_cpup(prop);
> 
> Same comment here.
> 

-aneesh

^ permalink raw reply

* Re: [PATCH V5 1/4] mm/debug_vm_pgtable: Add tests validating arch helpers for core MM features
From: Steven Price @ 2020-07-16 14:14 UTC (permalink / raw)
  To: Anshuman Khandual, linux-mm
  Cc: Heiko Carstens, Paul Mackerras, H. Peter Anvin, agordeev,
	Will Deacon, linux-riscv, linux-arch, linux-s390, x86,
	Mike Rapoport, Christian Borntraeger, Ingo Molnar,
	gerald.schaefer, ziy, Catalin Marinas, linux-snps-arc,
	Vasily Gorbik, cai, Paul Walmsley, Kirill A . Shutemov,
	Thomas Gleixner, linux-arm-kernel, christophe.leroy, Vineet Gupta,
	linux-kernel, Palmer Dabbelt, aneesh.kumar, Borislav Petkov,
	Andrew Morton, linuxppc-dev, rppt
In-Reply-To: <1594610587-4172-2-git-send-email-anshuman.khandual@arm.com>

On 13/07/2020 04:23, Anshuman Khandual wrote:
> This adds new tests validating arch page table helpers for these following
> core memory features. These tests create and test specific mapping types at
> various page table levels.
> 
> 1. SPECIAL mapping
> 2. PROTNONE mapping
> 3. DEVMAP mapping
> 4. SOFTDIRTY mapping
> 5. SWAP mapping
> 6. MIGRATION mapping
> 7. HUGETLB mapping
> 8. THP mapping
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
> Cc: Christophe Leroy <christophe.leroy@c-s.fr>
> Cc: Mike Rapoport <rppt@linux.ibm.com>
> Cc: Vineet Gupta <vgupta@synopsys.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
> Cc: Vasily Gorbik <gor@linux.ibm.com>
> Cc: Christian Borntraeger <borntraeger@de.ibm.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> Cc: Paul Walmsley <paul.walmsley@sifive.com>
> Cc: Palmer Dabbelt <palmer@dabbelt.com>
> Cc: linux-snps-arc@lists.infradead.org
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-s390@vger.kernel.org
> Cc: linux-riscv@lists.infradead.org
> Cc: x86@kernel.org
> Cc: linux-mm@kvack.org
> Cc: linux-arch@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Tested-by: Vineet Gupta <vgupta@synopsys.com>	#arc
> Reviewed-by: Zi Yan <ziy@nvidia.com>
> Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
> ---
>   mm/debug_vm_pgtable.c | 302 +++++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 301 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
> index 61ab16fb2e36..2fac47db3eb7 100644
> --- a/mm/debug_vm_pgtable.c
> +++ b/mm/debug_vm_pgtable.c
[...]
> +
> +static void __init pte_swap_tests(unsigned long pfn, pgprot_t prot)
> +{
> +	swp_entry_t swp;
> +	pte_t pte;
> +
> +	pte = pfn_pte(pfn, prot);
> +	swp = __pte_to_swp_entry(pte);

Minor issue: this doesn't look necessarily valid - there's no reason a 
normal PTE can be turned into a swp_entry. In practise this is likely to 
work on all architectures because there's no reason not to use (at 
least) all the PFN bits for the swap entry, but it doesn't exactly seem 
correct.

Can we start with a swp_entry_t (from __swp_entry()) and check the round 
trip of that?

It would also seem sensible to have a check that 
is_swap_pte(__swp_entry_to_pte(__swp_entry(x,y))) is true.

> +	pte = __swp_entry_to_pte(swp);
> +	WARN_ON(pfn != pte_pfn(pte));
> +}
> +
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> +static void __init pmd_swap_tests(unsigned long pfn, pgprot_t prot)
> +{
> +	swp_entry_t swp;
> +	pmd_t pmd;
> +
> +	pmd = pfn_pmd(pfn, prot);
> +	swp = __pmd_to_swp_entry(pmd);
> +	pmd = __swp_entry_to_pmd(swp);
> +	WARN_ON(pfn != pmd_pfn(pmd));
> +}
> +#else  /* !CONFIG_ARCH_ENABLE_THP_MIGRATION */
> +static void __init pmd_swap_tests(unsigned long pfn, pgprot_t prot) { }
> +#endif /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
> +
> +static void __init swap_migration_tests(void)
> +{
> +	struct page *page;
> +	swp_entry_t swp;
> +
> +	if (!IS_ENABLED(CONFIG_MIGRATION))
> +		return;
> +	/*
> +	 * swap_migration_tests() requires a dedicated page as it needs to
> +	 * be locked before creating a migration entry from it. Locking the
> +	 * page that actually maps kernel text ('start_kernel') can be real
> +	 * problematic. Lets allocate a dedicated page explicitly for this

NIT: s/Lets/Let's

Otherwise looks good to me.

Steve

^ permalink raw reply

* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
From: Mathieu Desnoyers @ 2020-07-16 15:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-arch, Arnd Bergmann, x86, linux-kernel, Nicholas Piggin,
	Andy Lutomirski, linux-mm, Andy Lutomirski, linuxppc-dev
In-Reply-To: <20200716110038.GA119549@hirez.programming.kicks-ass.net>

----- On Jul 16, 2020, at 7:00 AM, Peter Zijlstra peterz@infradead.org wrote:

> On Thu, Jul 16, 2020 at 08:03:36PM +1000, Nicholas Piggin wrote:
>> Excerpts from Peter Zijlstra's message of July 16, 2020 6:50 pm:
>> > On Wed, Jul 15, 2020 at 10:18:20PM -0700, Andy Lutomirski wrote:
>> >> > On Jul 15, 2020, at 9:15 PM, Nicholas Piggin <npiggin@gmail.com> wrote:
> 
>> >> But I’m wondering if all this deferred sync stuff is wrong. In the
>> >> brave new world of io_uring and such, perhaps kernel access matter
>> >> too.  Heck, even:
>> > 
>> > IIRC the membarrier SYNC_CORE use-case is about user-space
>> > self-modifying code.
>> > 
>> > Userspace re-uses a text address and needs to SYNC_CORE before it can be
>> > sure the old text is forgotten. Nothing the kernel does matters there.
>> > 
>> > I suppose the manpage could be more clear there.
>> 
>> True, but memory ordering of kernel stores from kernel threads for
>> regular mem barrier is the concern here.
>> 
>> Does io_uring update completion queue from kernel thread or interrupt,
>> for example? If it does, then membarrier will not order such stores
>> with user memory accesses.
> 
> So we're talking about regular membarrier() then? Not the SYNC_CORE
> variant per-se.
> 
> Even there, I'll argue we don't care, but perhaps Mathieu has a
> different opinion.

I agree with Peter that we don't care about accesses to user-space
memory performed concurrently with membarrier.

What we'd care about in terms of accesses to user-space memory from the
kernel is something that would be clearly ordered as happening before
or after the membarrier call, for instance a read(2) followed by
membarrier(2) after the read returns, or a read(2) issued after return
from membarrier(2). The other scenario we'd care about is with the compiler
barrier paired with membarrier: e.g. read(2) returns, compiler barrier,
followed by a store. Or load, compiler barrier, followed by write(2).

All those scenarios imply before/after ordering wrt either membarrier or
the compiler barrier. I notice that io_uring has a "completion" queue.
Let's try to come up with realistic usage scenarios.

So the dependency chain would be provided by e.g.:

* Infrequent read / Frequent write, communicating read completion through variable X

wait for io_uring read request completion -> membarrier -> store X=1

with matching

load from X (waiting for X==1) -> asm volatile (::: "memory") -> submit io_uring write request

or this other scenario:

* Frequent read / Infrequent write, communicating read completion through variable X

load from X (waiting for X==1) -> membarrier -> submit io_uring write request

with matching

wait for io_uring read request completion -> asm volatile (::: "memory") -> store X=1

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply

* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
From: Mathieu Desnoyers @ 2020-07-16 15:46 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: linux-arch, Arnd Bergmann, Peter Zijlstra, x86, linux-kernel,
	linux-mm, Andy Lutomirski, linuxppc-dev
In-Reply-To: <1594873644.viept6os6j.astroid@bobo.none>



----- On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npiggin@gmail.com wrote:
> I should be more complete here, especially since I was complaining
> about unclear barrier comment :)
> 
> 
> CPU0                     CPU1
> a. user stuff            1. user stuff
> b. membarrier()          2. enter kernel
> c. smp_mb()              3. smp_mb__after_spinlock(); // in __schedule
> d. read rq->curr         4. rq->curr switched to kthread
> e. is kthread, skip IPI  5. switch_to kthread
> f. return to user        6. rq->curr switched to user thread
> g. user stuff            7. switch_to user thread
>                         8. exit kernel
>                         9. more user stuff
> 
> What you're really ordering is a, g vs 1, 9 right?
> 
> In other words, 9 must see a if it sees g, g must see 1 if it saw 9,
> etc.
> 
> Userspace does not care where the barriers are exactly or what kernel
> memory accesses might be being ordered by them, so long as there is a
> mb somewhere between a and g, and 1 and 9. Right?

This is correct. Note that the accesses to user-space memory can be
done either by user-space code or kernel code, it doesn't matter.
However, in order to be considered as happening before/after
either membarrier or the matching compiler barrier, kernel code
needs to have causality relationship with user-space execution,
e.g. user-space does a system call, or returns from a system call.

In the case of io_uring, submitting a request or returning from waiting
on request completion appear to provide this causality relationship.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply

* Re: [PATCH net-next] ibmvnic: Increase driver logging
From: Thomas Falcon @ 2020-07-16 15:59 UTC (permalink / raw)
  To: David Miller, kuba; +Cc: drt, netdev, linuxppc-dev
In-Reply-To: <20200715.182956.490791427431304861.davem@davemloft.net>

On 7/15/20 8:29 PM, David Miller wrote:
> From: Jakub Kicinski <kuba@kernel.org>
> Date: Wed, 15 Jul 2020 17:06:32 -0700
>
>> On Wed, 15 Jul 2020 18:51:55 -0500 Thomas Falcon wrote:
>>>   	free_netdev(netdev);
>>>   	dev_set_drvdata(&dev->dev, NULL);
>>> +	netdev_info(netdev, "VNIC client device has been successfully removed.\n");
>> A step too far, perhaps.
>>
>> In general this patch looks a little questionable IMHO, this amount of
>> logging output is not commonly seen in drivers. All the the info
>> messages are just static text, not even carrying any extra information.
>> In an era of ftrace, and bpftrace, do we really need this?
> Agreed, this is too much.  This is debugging, and thus suitable for tracing
> facilities, at best.

Thanks for your feedback. I see now that I was overly aggressive with 
this patch to be sure, but it would help with narrowing down problems at 
a first glance, should they arise. The driver in its current state logs 
very little of what is it doing without the use of additional debugging 
or tracing facilities. Would it be worth it to pursue a less aggressive 
version or would that be dead on arrival? What are acceptable driver 
operations to log at this level?

Thanks,

Tom

^ permalink raw reply

* Re: [PATCH net-next] ibmvnic: Increase driver logging
From: Michal Suchánek @ 2020-07-16 16:07 UTC (permalink / raw)
  To: Thomas Falcon; +Cc: drt, kuba, linuxppc-dev, David Miller, netdev
In-Reply-To: <9c9d6e46-240b-8513-08e4-e1c7556cb3c8@linux.ibm.com>

On Thu, Jul 16, 2020 at 10:59:58AM -0500, Thomas Falcon wrote:
> 
> On 7/15/20 8:29 PM, David Miller wrote:
> > From: Jakub Kicinski <kuba@kernel.org>
> > Date: Wed, 15 Jul 2020 17:06:32 -0700
> > 
> > > On Wed, 15 Jul 2020 18:51:55 -0500 Thomas Falcon wrote:
> > > >   	free_netdev(netdev);
> > > >   	dev_set_drvdata(&dev->dev, NULL);
> > > > +	netdev_info(netdev, "VNIC client device has been successfully removed.\n");
> > > A step too far, perhaps.
> > > 
> > > In general this patch looks a little questionable IMHO, this amount of
> > > logging output is not commonly seen in drivers. All the the info
> > > messages are just static text, not even carrying any extra information.
> > > In an era of ftrace, and bpftrace, do we really need this?
> > Agreed, this is too much.  This is debugging, and thus suitable for tracing
> > facilities, at best.
> 
> Thanks for your feedback. I see now that I was overly aggressive with this
> patch to be sure, but it would help with narrowing down problems at a first
> glance, should they arise. The driver in its current state logs very little
> of what is it doing without the use of additional debugging or tracing
> facilities. Would it be worth it to pursue a less aggressive version or
> would that be dead on arrival? What are acceptable driver operations to log
> at this level?

Also would it be advisable to add the messages as pr_dbg to be enabled on demand?

Thanks

Michal

^ permalink raw reply

* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
From: Mathieu Desnoyers @ 2020-07-16 16:03 UTC (permalink / raw)
  To: Nicholas Piggin, paulmck
  Cc: linux-arch, Arnd Bergmann, Peter Zijlstra, x86, linux-kernel,
	linux-mm, Andy Lutomirski, linuxppc-dev
In-Reply-To: <1494299304.15894.1594914382695.JavaMail.zimbra@efficios.com>

----- On Jul 16, 2020, at 11:46 AM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:

> ----- On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npiggin@gmail.com wrote:
>> I should be more complete here, especially since I was complaining
>> about unclear barrier comment :)
>> 
>> 
>> CPU0                     CPU1
>> a. user stuff            1. user stuff
>> b. membarrier()          2. enter kernel
>> c. smp_mb()              3. smp_mb__after_spinlock(); // in __schedule
>> d. read rq->curr         4. rq->curr switched to kthread
>> e. is kthread, skip IPI  5. switch_to kthread
>> f. return to user        6. rq->curr switched to user thread
>> g. user stuff            7. switch_to user thread
>>                         8. exit kernel
>>                         9. more user stuff
>> 
>> What you're really ordering is a, g vs 1, 9 right?
>> 
>> In other words, 9 must see a if it sees g, g must see 1 if it saw 9,
>> etc.
>> 
>> Userspace does not care where the barriers are exactly or what kernel
>> memory accesses might be being ordered by them, so long as there is a
>> mb somewhere between a and g, and 1 and 9. Right?
> 
> This is correct.

Actually, sorry, the above is not quite right. It's been a while
since I looked into the details of membarrier.

The smp_mb() at the beginning of membarrier() needs to be paired with a
smp_mb() _after_ rq->curr is switched back to the user thread, so the
memory barrier is between store to rq->curr and following user-space
accesses.

The smp_mb() at the end of membarrier() needs to be paired with the
smp_mb__after_spinlock() at the beginning of schedule, which is
between accesses to userspace memory and switching rq->curr to kthread.

As to *why* this ordering is needed, I'd have to dig through additional
scenarios from https://lwn.net/Articles/573436/. Or maybe Paul remembers ?

Thanks,

Mathieu


> Note that the accesses to user-space memory can be
> done either by user-space code or kernel code, it doesn't matter.
> However, in order to be considered as happening before/after
> either membarrier or the matching compiler barrier, kernel code
> needs to have causality relationship with user-space execution,
> e.g. user-space does a system call, or returns from a system call.
> 
> In the case of io_uring, submitting a request or returning from waiting
> on request completion appear to provide this causality relationship.
> 
> Thanks,
> 
> Mathieu
> 
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply

* Re: [PATCH v3 0/3] Off-load TLB invalidations to host for !GTSE
From: Qian Cai @ 2020-07-16 17:27 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: sfr, aneesh.kumar, linux-kernel, npiggin, linux-next,
	linuxppc-dev
In-Reply-To: <20200703053608.12884-1-bharata@linux.ibm.com>

On Fri, Jul 03, 2020 at 11:06:05AM +0530, Bharata B Rao wrote:
> Hypervisor may choose not to enable Guest Translation Shootdown Enable
> (GTSE) option for the guest. When GTSE isn't ON, the guest OS isn't
> permitted to use instructions like tblie and tlbsync directly, but is
> expected to make hypervisor calls to get the TLB flushed.
> 
> This series enables the TLB flush routines in the radix code to
> off-load TLB flushing to hypervisor via the newly proposed hcall
> H_RPT_INVALIDATE. 
> 
> To easily check the availability of GTSE, it is made an MMU feature.
> The OV5 handling and H_REGISTER_PROC_TBL hcall are changed to
> handle GTSE as an optionally available feature and to not assume GTSE
> when radix support is available.
> 
> The actual hcall implementation for KVM isn't included in this
> patchset and will be posted separately.
> 
> Changes in v3
> =============
> - Fixed a bug in the hcall wrapper code where we were missing setting
>   H_RPTI_TYPE_NESTED while retrying the failed flush request with
>   a full flush for the nested case.
> - s/psize_to_h_rpti/psize_to_rpti_pgsize
> 
> v2: https://lore.kernel.org/linuxppc-dev/20200626131000.5207-1-bharata@linux.ibm.com/T/#t
> 
> Bharata B Rao (2):
>   powerpc/mm: Enable radix GTSE only if supported.
>   powerpc/pseries: H_REGISTER_PROC_TBL should ask for GTSE only if
>     enabled
> 
> Nicholas Piggin (1):
>   powerpc/mm/book3s64/radix: Off-load TLB invalidations to host when
>     !GTSE

Reverting the whole series fixed random memory corruptions during boot on
POWER9 PowerNV systems below.

IBM 8335-GTH (ibm,witherspoon)
POWER9, altivec supported
262144 MB memory, 2000 GB disk space

.config:
https://gitlab.com/cailca/linux-mm/-/blob/master/powerpc.config

[    9.338996][  T925] BUG: Unable to handle kernel instruction fetch (NULL pointer?)
[    9.339026][  T925] Faulting instruction address: 0x00000000
[    9.339051][  T925] Oops: Kernel access of bad area, sig: 11 [#1]
[    9.339064][  T925] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=256 NUMA PowerNV
[    9.339098][  T925] Modules linked in: dm_mirror dm_region_hash dm_log dm_mod
[    9.339150][  T925] CPU: 92 PID: 925 Comm: (md-udevd) Not tainted 5.8.0-rc5-next-20200716 #3
[    9.339186][  T925] NIP:  0000000000000000 LR: c00000000021f2cc CTR: 0000000000000000
[    9.339210][  T925] REGS: c000201cb52d79b0 TRAP: 0400   Not tainted  (5.8.0-rc5-next-20200716)
[    9.339244][  T925] MSR:  9000000040009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24222292  XER: 00000000
[    9.339278][  T925] CFAR: c00000000021f2c8 IRQMASK: 0 
[    9.339278][  T925] GPR00: c00000000021f2cc c000201cb52d7c40 c000000005901000 c000201cb52d7ca8 
[    9.339278][  T925] GPR04: c00800000ea60038 0000000000000000 000000007fff0000 000000007fff0000 
[    9.339278][  T925] GPR08: 0000000000000000 0000000000000000 c000201cb50bd500 0000000000000003 
[    9.339278][  T925] GPR12: 0000000000000000 c000201fff694500 00007fffa4a8a940 00007fffa4a8a6c8 
[    9.339278][  T925] GPR16: 00007fffa4a8a8f8 00007fffa4a8a650 00007fffa4a8a488 0000000000000000 
[    9.339278][  T925] GPR20: 0000000000050001 00007fffa4a8a984 000000007fff0000 c00000000a4545cc 
[    9.339278][  T925] GPR24: c000000000affe28 0000000000000000 0000000000000000 0000000000000166 
[    9.339278][  T925] GPR28: c000201cb52d7ca8 c00800000ea60000 c000201cc3b72600 000000007fff0000 
[    9.339493][  T925] NIP [0000000000000000] 0x0
[    9.339516][  T925] LR [c00000000021f2cc] __seccomp_filter+0xec/0x530
bpf_dispatcher_nop_func at include/linux/bpf.h:567
(inlined by) bpf_prog_run_pin_on_cpu at include/linux/filter.h:597
(inlined by) seccomp_run_filters at kernel/seccomp.c:324
(inlined by) __seccomp_filter at kernel/seccomp.c:937
[    9.339538][  T925] Call Trace:
[    9.339548][  T925] [c000201cb52d7c40] [c00000000021f2cc] __seccomp_filter+0xec/0x530 (unreliable)
[    9.339566][  T925] [c000201cb52d7d50] [c000000000025af8] do_syscall_trace_enter+0xb8/0x470
do_seccomp at arch/powerpc/kernel/ptrace/ptrace.c:252
(inlined by) do_syscall_trace_enter at arch/powerpc/kernel/ptrace/ptrace.c:327
[    9.339600][  T925] [c000201cb52d7dc0] [c00000000002c8f8] system_call_exception+0x138/0x180
[    9.339625][  T925] [c000201cb52d7e20] [c00000000000c9e8] system_call_common+0xe8/0x214
[    9.339648][  T925] Instruction dump:
[    9.339667][  T925] XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX 
[    9.339706][  T925] XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX 
[    9.339748][  T925] ---[ end trace d89eb80f9a6bc141 ]---
[  OK  ] Started Journal Service.
[   10.452364][  T925] Kernel panic - not syncing: Fatal exception
[   11.876655][  T925] ---[ end Kernel panic - not syncing: Fatal exception ]---

There could also be lots of random userspace segfault like,

[   16.463545][  T771] rngd[771]: segfault (11) at 0 nip 0 lr 0 code 1 in rngd[106d60000+20000]
[   16.463620][  T771] rngd[771]: code: XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX 
[   16.463656][  T771] rngd[771]: code: XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX

Occasionally, there are many soft-lockups,

[  396.920702][   C99] watchdog: BUG: soft lockup - CPU#99 stuck for 22s! [(spawn):2692]
[  396.920754][   C99] Modules linked in: kvm_hv kvm ip_tables x_tables sd_mod bnx2x tg3 ahci mdio libahci libphy firmware_class libata dm_mirror dm_region_hash dm_log dm_mod
[  396.920843][   C99] irq event stamp: 1731717220
[  396.920860][   C99] hardirqs last  enabled at (1731717219): [<c00000000004d6f4>] do_page_fault+0x324/0xd90
[  396.920889][   C99] hardirqs last disabled at (1731717220): [<c000000000015638>] arch_local_irq_restore+0x48/0xd0
[  396.920919][   C99] softirqs last  enabled at (41260): [<c0000000009abbe8>] __do_softirq+0x648/0x8c8
[  396.920948][   C99] softirqs last disabled at (41125): [<c0000000000d717c>] irq_exit+0x15c/0x1c0
[  396.920976][   C99] CPU: 99 PID: 2692 Comm: (spawn) Tainted: G             L    5.8.0-rc5-next-20200716 #3
[  396.921001][   C99] NIP:  c0000000000152b4 LR: c000000000015640 CTR: 0000000000000000
[  396.921037][   C99] REGS: c000201cbc3d7178 TRAP: 0900   Tainted: G             L     (5.8.0-rc5-next-20200716)
[  396.921074][   C99] MSR:  9000000000001033 <SF,HV,ME,IR,DR,RI,LE>  CR: 28022482  XER: 20040000
[  396.921122][   C99] CFAR: 0000000000000000 IRQMASK: 3 
[  396.921122][   C99] GPR00: c000000000015640 c000201cbc3d7340 c000000005901000 c000201cbc3d7178 
[  396.921122][   C99] GPR04: c0000000057d7280 0000000000000000 000000000002000a 0000000000000003 
[  396.921122][   C99] GPR08: 0000201cc61c0000 0000000000000000 0000000000000001 c00000000593f868 
[  396.921122][   C99] GPR12: 0000000000002000 c000201fff67e700 00007fffdcda3918 0000000139eeba60 
[  396.921122][   C99] GPR16: 0000000139f30130 00007fffdcda39c8 0000000139eea708 0000000000000000 
[  396.921122][   C99] GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000008 
[  396.921122][   C99] GPR24: 0000000000000e60 0000000000000900 0000000000000500 0000000000000a00 
[  396.921122][   C99] GPR28: 0000000000000f00 0000000000000002 0000000000000003 c000201ca4212400 
[  396.921468][   C99] NIP [c0000000000152b4] replay_soft_interrupts+0x74/0x3b0
replay_soft_interrupts at arch/powerpc/kernel/irq.c:216
[  396.921504][   C99] LR [c000000000015640] arch_local_irq_restore+0x50/0xd0
arch_local_irq_restore at arch/powerpc/kernel/irq.c:375
[  396.921539][   C99] Call Trace:
[  396.921560][   C99] [c000201cbc3d7340] [c000000000015640] arch_local_irq_restore+0x50/0xd0 (unreliable)
[  396.921602][   C99] [c000201cbc3d7360] [c0000000009a0c68] lock_is_held_type+0xf8/0x180
[  396.921641][   C99] [c000201cbc3d73c0] [c0000000003e8cf0] mem_cgroup_from_task+0xa0/0x130
[  396.921666][   C99] [c000201cbc3d7400] [c000000000337950] handle_mm_fault+0x140/0x1d20
[  396.921703][   C99] [c000201cbc3d7500] [c00000000004d5ac] do_page_fault+0x1dc/0xd90
[  396.921763][   C99] [c000201cbc3d7600] [c00000000000c028] handle_page_fault+0x10/0x2c
[  396.921804][   C99] --- interrupt: 300 at futex_cleanup+0x3c0/0x740
[  396.921804][   C99]     LR = futex_cleanup+0x35c/0x740
[  396.921879][   C99] [c000201cbc3d79c0] [c0000000001df2e8] futex_exec_release+0x28/0x50
[  396.921929][   C99] [c000201cbc3d79f0] [c0000000000c5e54] exec_mm_release+0x24/0x50
[  396.921968][   C99] [c000201cbc3d7a30] [c000000000421e84] begin_new_exec+0x324/0xea0
[  396.922005][   C99] [c000201cbc3d7af0] [c0000000004d8f1c] load_elf_binary+0x7fc/0x1110
[  396.922042][   C99] [c000201cbc3d7bf0] [c000000000420824] exec_binprm+0x1c4/0x7d0
[  396.922079][   C99] [c000201cbc3d7cb0] [c000000000421540] do_execveat_common+0x710/0x960
[  396.922117][   C99] [c000201cbc3d7d90] [c000000000422a44] sys_execve+0x44/0x60
[  396.922156][   C99] [c000201cbc3d7dc0] [c00000000002c8b8] system_call_exception+0xf8/0x180
[  396.922205][   C99] [c000201cbc3d7e20] [c00000000000c9e8] system_call_common+0xe8/0x214
[  396.922253][   C99] Instruction dump:
[  396.922286][   C99] 3b000e60 3b400500 3b600a00 3b800f00 f8010010 f821fe11 38610028 e92d0c70 
[  396.922316][   C99] f9210198 39200000 8aed0989 48037df9 <60000000> 39200003 f9210160 56e90738


[  248.821138][  T676] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
[  248.821170][  T676] khugepaged      D28416   682      2 0x00000800 
[  248.821212][  T676] Call Trace: 
[  248.821241][  T676] [c000001ff310f4e0] [c000000000caeb80] lru_add_drain_work+0x0/0x48 (unreliable) 
[  248.821275][  T676] [c000001ff310f6c0] [c00000000001a2d0] __switch_to+0x260/0x380 
[  248.821308][  T676] [c000001ff310f720] [c0000000009a18b8] __schedule+0x398/0x9f0 
[  248.821352][  T676] [c000001ff310f7f0] [c0000000009a1fa8] schedule+0x98/0x160 
[  248.821387][  T676] [c000001ff310f820] [c0000000009a9814] schedule_timeout+0x304/0x520 
[  248.821432][  T676] [c000001ff310f960] [c0000000009a3c84] wait_for_completion+0xc4/0x1b0 
[  248.821460][  T676] [c000001ff310f9d0] [c0000000000fd0c8] __flush_work+0x3b8/0x770 
[  248.821491][  T676] [c000001ff310faf0] [c0000000002e0ac4] lru_add_drain_all+0x3e4/0x760 
[  248.821521][  T676] [c000001ff310fbf0] [c0000000003e0f18] khugepaged+0xd8/0x1770 
[  248.821560][  T676] [c000001ff310fdb0] [c0000000001095fc] kthread+0x1bc/0x1d0 
[  248.821611][  T676] [c000001ff310fe20] [c00000000000cbc4] ret_from_kernel_thread+0x5c/0x78 
[  248.821655][  T676] INFO: task kworker/56:1:719 blocked for more than 122 seconds. 
[  248.821689][  T676]       Tainted: G             L    5.8.0-rc5-next-20200716 #3 
[  248.821729][  T676] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
[  248.821779][  T676] kworker/56:1    D27584   719      2 0x00000800 
[  248.821839][  T676] Workqueue: rcu_gp wait_rcu_exp_gp 
[  248.821862][  T676] Call Trace: 
[  248.821888][  T676] [c000001ff2b57660] [c0000000057c9fb0] rcu_state+0x4fb0/0x5100 (unreliable) 
[  248.821934][  T676] [c000001ff2b57840] [c00000000001a2d0] __switch_to+0x260/0x380 
[  248.821977][  T676] [c000001ff2b578a0] [c0000000009a18b8] __schedule+0x398/0x9f0 
[  248.822021][  T676] [c000001ff2b57970] [c0000000009a1fa8] schedule+0x98/0x160 
[  248.822066][  T676] [c000001ff2b579a0] [c0000000009a970c] schedule_timeout+0x1fc/0x520 
[  248.822110][  T676] [c000001ff2b57ae0] [c0000000001a86d0] rcu_exp_wait_wake+0x1b0/0x950 
[  248.822153][  T676] [c000001ff2b57c30] [c0000000000fb754] process_one_work+0x304/0x900 
[  248.822197][  T676] [c000001ff2b57d20] [c0000000000fbdc8] worker_thread+0x78/0x520 
[  248.822242][  T676] [c000001ff2b57db0] [c0000000001095fc] kthread+0x1bc/0x1d0 
[  248.822279][  T676] [c000001ff2b57e20] [c00000000000cbc4] ret_from_kernel_thread+0x5c/0x78 
[  248.822385][  T676] INFO: task lvm:3123 blocked for more than 122 seconds. 
[  248.822413][  T676]       Tainted: G             L    5.8.0-rc5-next-20200716 #3 
[  248.822462][  T676] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
[  248.822503][  T676] lvm             D26608  3123      1 0x00040000 
[  248.822552][  T676] Call Trace: 
[  248.822576][  T676] [c000001feee27a00] [c00000000001a2d0] __switch_to+0x260/0x380 
[  248.822620][  T676] [c000001feee27a60] [c0000000009a18b8] __schedule+0x398/0x9f0 
[  248.822648][  T676] [c000001feee27b30] [c0000000009a1fa8] schedule+0x98/0x160 
[  248.822680][  T676] [c000001feee27b60] [c0000000009a9814] schedule_timeout+0x304/0x520 
[  248.822724][  T676] [c000001feee27ca0] [c0000000009a3c84] wait_for_completion+0xc4/0x1b0 
[  248.822768][  T676] [c000001feee27d10] [c0000000004b0e88] sys_io_destroy+0x238/0x2f0 
[  248.822808][  T676] [c000001feee27dc0] [c00000000002c8b8] system_call_exception+0xf8/0x180 
[  248.822840][  T676] [c000001feee27e20] [c00000000000c9e8] system_call_common+0xe8/0x214 
[  248.822873][  T676] INFO: task lvm:3126 blocked for more than 122 seconds. 
[  248.822901][  T676]       Tainted: G             L    5.8.0-rc5-next-20200716 #3 
[  248.822938][  T676] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
[  248.822987][  T676] lvm             D26608  3126      1 0x00040000 
[  248.823017][  T676] Call Trace: 
[  248.823031][  T676] [c000001fc0b27a00] [c00000000001a2d0] __switch_to+0x260/0x380 
[  248.823075][  T676] [c000001fc0b27a60] [c0000000009a18b8] __schedule+0x398/0x9f0 
[  248.823113][  T676] [c000001fc0b27b30] [c0000000009a1fa8] schedule+0x98/0x160 
[  248.823158][  T676] [c000001fc0b27b60] [c0000000009a9814] schedule_timeout+0x304/0x520 
[  248.823199][  T676] [c000001fc0b27ca0] [c0000000009a3c84] wait_for_completion+0xc4/0x1b0 
[  248.823250][  T676] [c000001fc0b27d10] [c0000000004b0e88] sys_io_destroy+0x238/0x2f0 
[  248.823294][  T676] [c000001fc0b27dc0] [c00000000002c8b8] system_call_exception+0xf8/0x180 
[  248.823332][  T676] [c000001fc0b27e20] [c00000000000c9e8] system_call_common+0xe8/0x214 
[  248.823374][  T676] INFO: task auditd:3163 blocked for more than 122 seconds. 
[  248.823424][  T676]       Tainted: G             L    5.8.0-rc5-next-20200716 #3 
[  248.823471][  T676] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
[  248.823512][  T676] auditd          D27088  3163      1 0x00042408 
[  248.823551][  T676] Call Trace: 
[  248.823583][  T676] [c000001fc080f760] [0000000200000001] 0x200000001 (unreliable) 
[  248.823648][  T676] [c000001fc080f940] [c00000000001a2d0] __switch_to+0x260/0x380 
[  248.823689][  T676] [c000001fc080f9a0] [c0000000009a18b8] __schedule+0x398/0x9f0 
[  248.823742][  T676] [c000001fc080fa70] [c0000000009a1fa8] schedule+0x98/0x160 
[  248.823784][  T676] [c000001fc080faa0] [c0000000001a9244] synchronize_rcu_expedited+0x394/0x600 
[  248.823837][  T676] [c000001fc080fba0] [c0000000004504c4] namespace_unlock+0xf4/0x230 
[  248.823881][  T676] [c000001fc080fc00] [c000000000456dec] put_mnt_ns+0x5c/0x80 
[  248.823926][  T676] [c000001fc080fc30] [c00000000010ba6c] free_nsproxy+0x2c/0x1e0 
[  248.823966][  T676] [c000001fc080fc60] [c0000000000d5130] do_exit+0x4e0/0xee0 
[  248.823997][  T676] [c000001fc080fd60] [c0000000000d5bec] do_group_exit+0x5c/0xd0 
[  248.824019][  T676] [c000001fc080fda0] [c0000000000d5c7c] sys_exit_group+0x1c/0x20 
[  248.824060][  T676] [c000001fc080fdc0] [c00000000002c8b8] system_call_exception+0xf8/0x180 
[  248.824103][  T676] [c000001fc080fe20] [c00000000000c9e8] system_call_common+0xe8/0x214 
[  248.824192][  T676]  
[  248.824192][  T676] Showing all locks held in the system: 
[  248.824419][  T676] 1 lock held by khungtaskd/676: 
[  248.824455][  T676]  #0: c0000000057c44c0 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire.constprop.29+0x8/0x30 
[  248.824531][  T676] 1 lock held by khugepaged/682: 
[  248.824565][  T676]  #0: c0000000057f42c8 (lock#4){+.+.}-{3:3}, at: lru_add_drain_all+0x68/0x760 
[  248.824679][  T676] 2 locks held by kworker/56:1/719: 
[  248.824742][  T676]  #0: c00000000bcc8938 ((wq_completion)rcu_gp){+.+.}-{0:0}, at: process_one_work+0x21c/0x900 
[  248.824857][  T676]  #1: c000001ff2b57c90 ((work_completion)(&rew.rew_work)){+.+.}-{0:0}, at: process_one_work+0x21c/0x900 
[  248.825026][  T676] 3 locks held by (spawn)/2692: 
[  248.825077][  T676] 1 lock held by auditd/3163: 
[  248.825135][  T676]  #0: c0000000057c9ee8 (rcu_state.exp_mutex){+.+.}-{3:3}, at: synchronize_rcu_expedited+0x254/0x600 
[  248.825296][  T676] ============================================= 
[  248.825296][  T676]  

> 
>  .../include/asm/book3s/64/tlbflush-radix.h    | 15 ++++
>  arch/powerpc/include/asm/hvcall.h             | 34 +++++++-
>  arch/powerpc/include/asm/mmu.h                |  4 +
>  arch/powerpc/include/asm/plpar_wrappers.h     | 52 ++++++++++++
>  arch/powerpc/kernel/dt_cpu_ftrs.c             |  1 +
>  arch/powerpc/kernel/prom_init.c               | 13 +--
>  arch/powerpc/mm/book3s64/radix_tlb.c          | 82 +++++++++++++++++--
>  arch/powerpc/mm/init_64.c                     |  5 +-
>  arch/powerpc/platforms/pseries/lpar.c         |  8 +-
>  9 files changed, 197 insertions(+), 17 deletions(-)
> 
> -- 
> 2.21.3
> 

^ permalink raw reply

* Re: [V2 PATCH 1/3] Refactoring powerpc code for carrying over IMA measurement logs, to move non architecture specific code to security/ima.
From: Thiago Jung Bauermann @ 2020-07-16 17:51 UTC (permalink / raw)
  To: Prakhar Srivastava
  Cc: kstewart, mark.rutland, gregkh, bhsharma, tao.li, zohar, paulus,
	vincenzo.frascino, will, nramas, frowand.list, masahiroy, jmorris,
	takahiro.akashi, linux-arm-kernel, catalin.marinas, serge,
	devicetree, pasha.tatashin, robh+dt, hsinyi, tusharsu, tglx,
	allison, christophe.leroy, mbrugger, balajib, dmitry.kasatkin,
	linux-kernel, linux-security-module, james.morse, linux-integrity,
	linuxppc-dev
In-Reply-To: <1385c8bb-cd25-8dc4-7224-8e27135f3356@linux.microsoft.com>


Hello Prakhar,

Prakhar Srivastava <prsriva@linux.microsoft.com> writes:

> On 6/19/20 5:19 PM, Thiago Jung Bauermann wrote:
>>
>> Prakhar Srivastava <prsriva@linux.microsoft.com> writes:
>>
>>> Powerpc has support to carry over the IMA measurement logs. Refatoring the
>>> non-architecture specific code out of arch/powerpc and into security/ima.
>>>
>>> The code adds support for reserving and freeing up of memory for IMA measurement
>>> logs.
>>
>> Last week, Mimi provided this feedback:
>>
>> "From your patch description, this patch should be broken up.  Moving
>> the non-architecture specific code out of powerpc should be one patch.
>>   Additional support should be in another patch.  After each patch, the
>> code should work properly."
>>
>> That's not what you do here. You move the code, but you also make other
>> changes at the same time. This has two problems:
>>
>> 1. It makes the patch harder to review, because it's very easy to miss a
>>     change.
>>
>> 2. If in the future a git bisect later points to this patch, it's not
>>     clear whether the problem is because of the code movement, or because
>>     of the other changes.
>>
>> When you move code, ideally the patch should only make the changes
>> necessary to make the code work at its new location. The patch which
>> does code movement should not cause any change in behavior.
>>
>> Other changes should go in separate patches, either before or after the
>> one moving the code.
>>
>> More comments below.
>>
> Hi Thiago,
>
> Apologies for the delayed response i was away for a few days.
> I am working on breaking up the changes so that its easier to review and update
> as well.

No problem.

>
> Thanks,
> Prakhar Srivastava
>
>>>
>>> ---
>>>   arch/powerpc/include/asm/ima.h     |  10 ---
>>>   arch/powerpc/kexec/ima.c           | 126 ++---------------------------
>>>   security/integrity/ima/ima_kexec.c | 116 ++++++++++++++++++++++++++
>>>   3 files changed, 124 insertions(+), 128 deletions(-)
>>>
>>> diff --git a/arch/powerpc/include/asm/ima.h b/arch/powerpc/include/asm/ima.h
>>> index ead488cf3981..c29ec86498f8 100644
>>> --- a/arch/powerpc/include/asm/ima.h
>>> +++ b/arch/powerpc/include/asm/ima.h
>>> @@ -4,15 +4,6 @@
>>>
>>>   struct kimage;
>>>
>>> -int ima_get_kexec_buffer(void **addr, size_t *size);
>>> -int ima_free_kexec_buffer(void);
>>> -
>>> -#ifdef CONFIG_IMA
>>> -void remove_ima_buffer(void *fdt, int chosen_node);
>>> -#else
>>> -static inline void remove_ima_buffer(void *fdt, int chosen_node) {}
>>> -#endif
>>> -
>>>   #ifdef CONFIG_IMA_KEXEC
>>>   int arch_ima_add_kexec_buffer(struct kimage *image, unsigned long load_addr,
>>>   			      size_t size);
>>> @@ -22,7 +13,6 @@ int setup_ima_buffer(const struct kimage *image, void *fdt, int chosen_node);
>>>   static inline int setup_ima_buffer(const struct kimage *image, void *fdt,
>>>   				   int chosen_node)
>>>   {
>>> -	remove_ima_buffer(fdt, chosen_node);
>>>   	return 0;
>>>   }
>>
>> This is wrong. Even if the currently running kernel doesn't have
>> CONFIG_IMA_KEXEC, it should remove the IMA buffer property and memory
>> reservation from the FDT that is being prepared for the next kernel.
>>
>> This is because the IMA kexec buffer is useless for the next kernel,
>> regardless of whether the current kernel supports CONFIG_IMA_KEXEC or
>> not. Keeping it around would be a waste of memory.
>>
> I will keep it in my next revision.
> My understanding was the reserved memory is freed and property removed when IMA
> loads the logs on init.

If CONFIG_IMA_KEXEC is set, then yes. If it isn't then that needs to
happen in the function above.

> During setup_fdt in kexec, a duplicate copy of the dt is
> used, but memory still needs to be allocated, thus the property itself indicats
> presence of reserved memory.
> 
>>> @@ -179,13 +64,18 @@ int setup_ima_buffer(const struct kimage *image, void *fdt, int chosen_node)
>>>   	int ret, addr_cells, size_cells, entry_size;
>>>   	u8 value[16];
>>>
>>> -	remove_ima_buffer(fdt, chosen_node);
>>
>> This is wrong, for the same reason stated above.
>>
>>>   	if (!image->arch.ima_buffer_size)
>>>   		return 0;
>>>
>>> -	ret = get_addr_size_cells(&addr_cells, &size_cells);
>>> -	if (ret)
>>> +	ret = fdt_address_cells(fdt, chosen_node);
>>> +	if (ret < 0)
>>> +		return ret;
>>> +	addr_cells = ret;
>>> +
>>> +	ret = fdt_size_cells(fdt, chosen_node);
>>> +	if (ret < 0)
>>>   		return ret;
>>> +	size_cells = ret;
>>>
>>>   	entry_size = 4 * (addr_cells + size_cells);
>>>
>>
>> I liked this change. Thanks! I agree it's better to use
>> fdt_address_cells() and fdt_size_cells() here.
>>
>> But it should be in a separate patch. Either before or after the one
>> moving the code.
>>
>>> diff --git a/security/integrity/ima/ima_kexec.c b/security/integrity/ima/ima_kexec.c
>>> index 121de3e04af2..e1e6d6154015 100644
>>> --- a/security/integrity/ima/ima_kexec.c
>>> +++ b/security/integrity/ima/ima_kexec.c
>>> @@ -10,8 +10,124 @@
>>>   #include <linux/seq_file.h>
>>>   #include <linux/vmalloc.h>
>>>   #include <linux/kexec.h>
>>> +#include <linux/of.h>
>>> +#include <linux/memblock.h>
>>> +#include <linux/libfdt.h>
>>>   #include "ima.h"
>>>
>>> +static int get_addr_size_cells(int *addr_cells, int *size_cells)
>>> +{
>>> +	struct device_node *root;
>>> +
>>> +	root = of_find_node_by_path("/");
>>> +	if (!root)
>>> +		return -EINVAL;
>>> +
>>> +	*addr_cells = of_n_addr_cells(root);
>>> +	*size_cells = of_n_size_cells(root);
>>> +
>>> +	of_node_put(root);
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static int do_get_kexec_buffer(const void *prop, int len, unsigned long *addr,
>>> +			       size_t *size)
>>> +{
>>> +	int ret, addr_cells, size_cells;
>>> +
>>> +	ret = get_addr_size_cells(&addr_cells, &size_cells);
>>> +	if (ret)
>>> +		return ret;
>>> +
>>> +	if (len < 4 * (addr_cells + size_cells))
>>> +		return -ENOENT;
>>> +
>>> +	*addr = of_read_number(prop, addr_cells);
>>> +	*size = of_read_number(prop + 4 * addr_cells, size_cells);
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +/**
>>> + * ima_get_kexec_buffer - get IMA buffer from the previous kernel
>>> + * @addr:	On successful return, set to point to the buffer contents.
>>> + * @size:	On successful return, set to the buffer size.
>>> + *
>>> + * Return: 0 on success, negative errno on error.
>>> + */
>>> +int ima_get_kexec_buffer(void **addr, size_t *size)
>>> +{
>>> +	int ret, len;
>>> +	unsigned long tmp_addr;
>>> +	size_t tmp_size;
>>> +	const void *prop;
>>> +
>>> +	prop = of_get_property(of_chosen, "linux,ima-kexec-buffer", &len);
>>> +	if (!prop)
>>> +		return -ENOENT;
>>> +
>>> +	ret = do_get_kexec_buffer(prop, len, &tmp_addr, &tmp_size);
>>> +	if (ret)
>>> +		return ret;
>>> +
>>> +	*addr = __va(tmp_addr);
>>> +	*size = tmp_size;
>>> +
>>> +	return 0;
>>> +}
>>
>> The functions above were moved without being changed. Good.
>>
>>> +/**
>>> + * ima_free_kexec_buffer - free memory used by the IMA buffer
>>> + */
>>> +int ima_free_kexec_buffer(void)
>>> +{
>>> +	int ret;
>>> +	unsigned long addr;
>>> +	size_t size;
>>> +	struct property *prop;
>>> +
>>> +	prop = of_find_property(of_chosen, "linux,ima-kexec-buffer", NULL);
>>> +	if (!prop)
>>> +		return -ENOENT;
>>> +
>>> +	ret = do_get_kexec_buffer(prop->value, prop->length, &addr, &size);
>>> +	if (ret)
>>> +		return ret;
>>> +
>>> +	ret = of_remove_property(of_chosen, prop);
>>> +	if (ret)
>>> +		return ret;
>>> +
>>> +	return memblock_free(__pa(addr), size);
>>
>> Here you added a __pa() call. Do you store a virtual address in
>> linux,ima-kexec-buffer property? Doesn't it make more sense to store a
>> physical address?
>>
> trying to minimize the changes here as do_get_kexec_buffer return the va.
> I will refactor this to remove the double translation.

In the powerpc version, do_get_kexec_buffer() returns the pa, and one of
its callers does the va translation. I think that worked well.

>> Even if making this change is the correct thing to do, it should be a
>> separate patch, unless it can't be avoided. And if that is the case,
>> then it should be explained in the commit message.
>>
>>> +
>>> +}
>>> +
>>> +/**
>>> + * remove_ima_buffer - remove the IMA buffer property and reservation from @fdt
>>> + *
>>> + * The IMA measurement buffer is of no use to a subsequent kernel, so we always
>>> + * remove it from the device tree.
>>> + */
>>> +void remove_ima_buffer(void *fdt, int chosen_node)
>>> +{
>>> +	int ret, len;
>>> +	unsigned long addr;
>>> +	size_t size;
>>> +	const void *prop;
>>> +
>>> +	prop = fdt_getprop(fdt, chosen_node, "linux,ima-kexec-buffer", &len);
>>> +	if (!prop)
>>> +		return;
>>> +
>>> +	do_get_kexec_buffer(prop, len, &addr, &size);
>>> +	ret = fdt_delprop(fdt, chosen_node, "linux,ima-kexec-buffer");
>>> +	if (ret < 0)
>>> +		return;
>>> +
>>> +	memblock_free(addr, size);
>>> +}
>>
>> Here is another function that changed when moved. This one I know to be
>> wrong. You're confusing the purposes of remove_ima_buffer() and
>> ima_free_kexec_buffer().
>>
>> You did send me a question about them nearly three weeks ago which I
>> only answered today, so I apologize. Also, their names could more
>> clearly reflect their differences, so it's bad naming on my part.
>>
>> With IMA kexec buffers, there are two kernels (and thus their two
>> respective, separate device trees) to be concerned about:
>>
>> 1. the currently running kernel, which uses the live device tree
>> (accessed with the of_* functions) and the memblock subsystem;
>>
>> 2. the kernel which is being loaded by kexec, which will use the FDT
>> blob being passed around as argument to these functions, and the memory
>> reservations in the memory reservation table of the FDT blob.
>>
>> ima_free_kexec_buffer() is used by IMA in the currently running kernel.
>> Therefore the device tree it is concerned about is the live one, and
>> thus uses the of_* functions to access it. And uses memblock to change
>> the memory reservation.
>>
>> remove_ima_buffer() on the other hand is used by the kexec code to
>> prepare an FDT blob for the kernel that is being loaded. It should not
>> make any changes to live device tree, nor to memblock allocations. It
>> should only make changes to the FDT blob.
>>
> Thank you for this, greatly appreciate clearing my misunderstandings.

You're welcome. Sorry again for not answering your question before you
sent this patch series.

-- 
Thiago Jung Bauermann
IBM Linux Technology Center

^ permalink raw reply

* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
From: Mathieu Desnoyers @ 2020-07-16 18:58 UTC (permalink / raw)
  To: Nicholas Piggin, paulmck, Alan Stern
  Cc: linux-arch, Arnd Bergmann, Peter Zijlstra, x86, linux-kernel,
	linux-mm, Andy Lutomirski, linuxppc-dev
In-Reply-To: <1370747990.15974.1594915396143.JavaMail.zimbra@efficios.com>

----- On Jul 16, 2020, at 12:03 PM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:

> ----- On Jul 16, 2020, at 11:46 AM, Mathieu Desnoyers
> mathieu.desnoyers@efficios.com wrote:
> 
>> ----- On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npiggin@gmail.com wrote:
>>> I should be more complete here, especially since I was complaining
>>> about unclear barrier comment :)
>>> 
>>> 
>>> CPU0                     CPU1
>>> a. user stuff            1. user stuff
>>> b. membarrier()          2. enter kernel
>>> c. smp_mb()              3. smp_mb__after_spinlock(); // in __schedule
>>> d. read rq->curr         4. rq->curr switched to kthread
>>> e. is kthread, skip IPI  5. switch_to kthread
>>> f. return to user        6. rq->curr switched to user thread
>>> g. user stuff            7. switch_to user thread
>>>                         8. exit kernel
>>>                         9. more user stuff
>>> 
>>> What you're really ordering is a, g vs 1, 9 right?
>>> 
>>> In other words, 9 must see a if it sees g, g must see 1 if it saw 9,
>>> etc.
>>> 
>>> Userspace does not care where the barriers are exactly or what kernel
>>> memory accesses might be being ordered by them, so long as there is a
>>> mb somewhere between a and g, and 1 and 9. Right?
>> 
>> This is correct.
> 
> Actually, sorry, the above is not quite right. It's been a while
> since I looked into the details of membarrier.
> 
> The smp_mb() at the beginning of membarrier() needs to be paired with a
> smp_mb() _after_ rq->curr is switched back to the user thread, so the
> memory barrier is between store to rq->curr and following user-space
> accesses.
> 
> The smp_mb() at the end of membarrier() needs to be paired with the
> smp_mb__after_spinlock() at the beginning of schedule, which is
> between accesses to userspace memory and switching rq->curr to kthread.
> 
> As to *why* this ordering is needed, I'd have to dig through additional
> scenarios from https://lwn.net/Articles/573436/. Or maybe Paul remembers ?

Thinking further about this, I'm beginning to consider that maybe we have been
overly cautious by requiring memory barriers before and after store to rq->curr.

If CPU0 observes a CPU1's rq->curr->mm which differs from its own process (current)
while running the membarrier system call, it necessarily means that CPU1 had
to issue smp_mb__after_spinlock when entering the scheduler, between any user-space
loads/stores and update of rq->curr.

Requiring a memory barrier between update of rq->curr (back to current process's
thread) and following user-space memory accesses does not seem to guarantee
anything more than what the initial barrier at the beginning of __schedule already
provides, because the guarantees are only about accesses to user-space memory.

Therefore, with the memory barrier at the beginning of __schedule, just observing that
CPU1's rq->curr differs from current should guarantee that a memory barrier was issued
between any sequentially consistent instructions belonging to the current process on
CPU1.

Or am I missing/misremembering an important point here ?

Thanks,

Mathieu

> 
> Thanks,
> 
> Mathieu
> 
> 
>> Note that the accesses to user-space memory can be
>> done either by user-space code or kernel code, it doesn't matter.
>> However, in order to be considered as happening before/after
>> either membarrier or the matching compiler barrier, kernel code
>> needs to have causality relationship with user-space execution,
>> e.g. user-space does a system call, or returns from a system call.
>> 
>> In the case of io_uring, submitting a request or returning from waiting
>> on request completion appear to provide this causality relationship.
>> 
>> Thanks,
>> 
>> Mathieu
>> 
>> 
>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> http://www.efficios.com
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply

* [PATCH] powerpc/64: Fix an out of date comment about MMIO ordering
From: Palmer Dabbelt @ 2020-07-16 19:38 UTC (permalink / raw)
  To: Will Deacon
  Cc: kernel-team, bigeasy, Palmer Dabbelt, linuxppc-dev, npiggin,
	linux-kernel, paulus, tglx, msuchanek, jniethe5

From: Palmer Dabbelt <palmerdabbelt@google.com>

This primitive has been renamed, but because it was spelled incorrectly in the
first place it must have escaped the fixup patch.  As far as I can tell this
logic is still correct: smp_mb__after_spinlock() uses the default smp_mb()
implementation, which is "sync" rather than "hwsync" but those are the same
(though I'm not that familiar with PowerPC).

Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
---
 arch/powerpc/kernel/entry_64.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index b3c9f15089b6..7b38b4daca93 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -357,7 +357,7 @@ _GLOBAL(_switch)
 	 * kernel/sched/core.c).
 	 *
 	 * Uncacheable stores in the case of involuntary preemption must
-	 * be taken care of. The smp_mb__before_spin_lock() in __schedule()
+	 * be taken care of. The smp_mb__after_spinlock() in __schedule()
 	 * is implemented as hwsync on powerpc, which orders MMIO too. So
 	 * long as there is an hwsync in the context switch path, it will
 	 * be executed on the source CPU after the task has performed
-- 
2.28.0.rc0.105.gf9edc3c819-goog


^ permalink raw reply related

* Re: [PATCH v2 2/3] ASoC: bindings: fsl-asoc-card: Support hp-det-gpio and mic-det-gpio
From: Rob Herring @ 2020-07-16 19:53 UTC (permalink / raw)
  To: Shengjiu Wang
  Cc: devicetree, alsa-devel, timur, lgirdwood, linux-kernel, Xiubo.Lee,
	tiwai, robh+dt, perex, nicoleotsuka, broonie, festevam,
	linuxppc-dev
In-Reply-To: <1594822179-1849-3-git-send-email-shengjiu.wang@nxp.com>

On Wed, 15 Jul 2020 22:09:38 +0800, Shengjiu Wang wrote:
> Add headphone and microphone detection GPIO support.
> These properties are optional.
> 
> Signed-off-by: Shengjiu Wang <shengjiu.wang@nxp.com>
> Acked-by: Nicolin Chen <nicoleotsuka@gmail.com>
> ---
>  Documentation/devicetree/bindings/sound/fsl-asoc-card.txt | 3 +++
>  1 file changed, 3 insertions(+)
> 

Acked-by: Rob Herring <robh@kernel.org>

^ permalink raw reply

* Re: [PATCH net-next] ibmvnic: Increase driver logging
From: Jakub Kicinski @ 2020-07-16 20:22 UTC (permalink / raw)
  To: Michal Suchánek
  Cc: drt, netdev, Thomas Falcon, linuxppc-dev, David Miller
In-Reply-To: <20200716160736.GI32107@kitsune.suse.cz>

On Thu, 16 Jul 2020 18:07:37 +0200 Michal Suchánek wrote:
> On Thu, Jul 16, 2020 at 10:59:58AM -0500, Thomas Falcon wrote:
> > On 7/15/20 8:29 PM, David Miller wrote:  
> > > From: Jakub Kicinski <kuba@kernel.org>
> > > Date: Wed, 15 Jul 2020 17:06:32 -0700
> > >   
> > > > On Wed, 15 Jul 2020 18:51:55 -0500 Thomas Falcon wrote:  
> > > > >   	free_netdev(netdev);
> > > > >   	dev_set_drvdata(&dev->dev, NULL);
> > > > > +	netdev_info(netdev, "VNIC client device has been successfully removed.\n");  
> > > > A step too far, perhaps.
> > > > 
> > > > In general this patch looks a little questionable IMHO, this amount of
> > > > logging output is not commonly seen in drivers. All the the info
> > > > messages are just static text, not even carrying any extra information.
> > > > In an era of ftrace, and bpftrace, do we really need this?  
> > > Agreed, this is too much.  This is debugging, and thus suitable for tracing
> > > facilities, at best.  
> > 
> > Thanks for your feedback. I see now that I was overly aggressive with this
> > patch to be sure, but it would help with narrowing down problems at a first
> > glance, should they arise. The driver in its current state logs very little
> > of what is it doing without the use of additional debugging or tracing
> > facilities. Would it be worth it to pursue a less aggressive version or
> > would that be dead on arrival? What are acceptable driver operations to log
> > at this level?  

Sadly it's much more of an art than hard science. Most networking
drivers will print identifying information when they probe the device
and then only about major config changes or when link comes up or goes
down. And obviously when anything unexpected, like an error happens,
that's key.

You seem to be adding start / end information for each driver init /
deinit stage. I'd say try to focus on the actual errors you're trying
to catch.

> Also would it be advisable to add the messages as pr_dbg to be enabled on demand?

I personally have had a pretty poor experience with pr_debug() because
CONFIG_DYNAMIC_DEBUG is not always enabled. Since you're just printing
static text there shouldn't be much difference between pr_debug and
ftrace and/or bpftrace, honestly.

Again, slightly hard to advise not knowing what you're trying to catch.

^ permalink raw reply

* Re: [PATCH v3 10/12] ppc64/kexec_file: prepare elfcore header for crashing kernel
From: Hari Bathini @ 2020-07-16 21:07 UTC (permalink / raw)
  To: Thiago Jung Bauermann
  Cc: Pingfan Liu, Petr Tesarik, Nayna Jain, Kexec-ml,
	Mahesh J Salgaonkar, Mimi Zohar, lkml, linuxppc-dev, Sourabh Jain,
	Andrew Morton, Dave Young, Vivek Goyal, Eric Biederman
In-Reply-To: <87tuy88ai7.fsf@morokweng.localdomain>



On 16/07/20 7:52 am, Thiago Jung Bauermann wrote:
> 
> Hari Bathini <hbathini@linux.ibm.com> writes:
> 
>>  /**
>> + * get_crash_memory_ranges - Get crash memory ranges. This list includes
>> + *                           first/crashing kernel's memory regions that
>> + *                           would be exported via an elfcore.
>> + * @mem_ranges:              Range list to add the memory ranges to.
>> + *
>> + * Returns 0 on success, negative errno on error.
>> + */
>> +static int get_crash_memory_ranges(struct crash_mem **mem_ranges)
>> +{
>> +	struct memblock_region *reg;
>> +	struct crash_mem *tmem;
>> +	int ret;
>> +
>> +	for_each_memblock(memory, reg) {
>> +		u64 base, size;
>> +
>> +		base = (u64)reg->base;
>> +		size = (u64)reg->size;
>> +
>> +		/* Skip backup memory region, which needs a separate entry */
>> +		if (base == BACKUP_SRC_START) {
>> +			if (size > BACKUP_SRC_SIZE) {
>> +				base = BACKUP_SRC_END + 1;
>> +				size -= BACKUP_SRC_SIZE;
>> +			} else
>> +				continue;
>> +		}
>> +
>> +		ret = add_mem_range(mem_ranges, base, size);
>> +		if (ret)
>> +			goto out;
>> +
>> +		/* Try merging adjacent ranges before reallocation attempt */
>> +		if ((*mem_ranges)->nr_ranges == (*mem_ranges)->max_nr_ranges)
>> +			sort_memory_ranges(*mem_ranges, true);
>> +	}
>> +
>> +	/* Reallocate memory ranges if there is no space to split ranges */
>> +	tmem = *mem_ranges;
>> +	if (tmem && (tmem->nr_ranges == tmem->max_nr_ranges)) {
>> +		tmem = realloc_mem_ranges(mem_ranges);
>> +		if (!tmem)
>> +			goto out;
>> +	}
>> +
>> +	/* Exclude crashkernel region */
>> +	ret = crash_exclude_mem_range(tmem, crashk_res.start, crashk_res.end);
>> +	if (ret)
>> +		goto out;
>> +
>> +	ret = add_rtas_mem_range(mem_ranges);
>> +	if (ret)
>> +		goto out;
>> +
>> +	ret = add_opal_mem_range(mem_ranges);
>> +	if (ret)
>> +		goto out;
> 
> Maybe I'm confused, but don't you add the RTAS and OPAL regions as
> usable memory for the crashkernel? In that case they shouldn't show up
> in the core file.

kexec-tools does the same thing. I am not endorsing it but I was trying to stay
in parity to avoid breaking any userspace tools/commands. But as you rightly
pointed, this is NOT right. The right thing to do, to get the rtas/opal data at
the time of crash, is to have a backup region for them just like we have for
the first 64K memory. I was hoping to do that later.

Will check how userspace tools respond to dropping these regions. If that makes
the tools unhappy, will retain the regions with a FIXME. Sorry about the confusion.

Thanks
Hari

^ permalink raw reply

* Re: [PATCH v3 03/12] powerpc/kexec_file: add helper functions for getting memory ranges
From: Hari Bathini @ 2020-07-16 21:08 UTC (permalink / raw)
  To: Thiago Jung Bauermann
  Cc: Pingfan Liu, Petr Tesarik, Nayna Jain, Kexec-ml,
	Mahesh J Salgaonkar, Mimi Zohar, lkml, linuxppc-dev, Sourabh Jain,
	Andrew Morton, Dave Young, Vivek Goyal, Eric Biederman
In-Reply-To: <874kq98xo4.fsf@morokweng.localdomain>



On 15/07/20 5:19 am, Thiago Jung Bauermann wrote:
> 

<snip>

> <snip>
> 
>> +/**
>> + * get_mem_rngs_size - Get the allocated size of mrngs based on
>> + *                     max_nr_ranges and chunk size.
>> + * @mrngs:             Memory ranges.
>> + *
>> + * Returns the maximum no. of ranges.
> 
> This isn't correct. It returns the maximum size of @mrngs.

True. Will update..

> <snip>
> 
>> +/**
>> + * add_tce_mem_ranges - Adds tce-table range to the given memory ranges list.
>> + * @mem_ranges:         Range list to add the memory range(s) to.
>> + *
>> + * Returns 0 on success, negative errno on error.
>> + */
>> +int add_tce_mem_ranges(struct crash_mem **mem_ranges)
>> +{
>> +	struct device_node *dn;
>> +	int ret;
>> +
>> +	for_each_node_by_type(dn, "pci") {
>> +		u64 base;
>> +		u32 size;
>> +
>> +		ret = of_property_read_u64(dn, "linux,tce-base", &base);
>> +		ret |= of_property_read_u32(dn, "linux,tce-size", &size);
>> +		if (!ret)
> 
> Shouldn't the condition be `ret` instead of `!ret`?

Oops! Will fix it.

>> +/**
>> + * sort_memory_ranges - Sorts the given memory ranges list.
>> + * @mem_ranges:         Range list to sort.
>> + * @merge:              If true, merge the list after sorting.
>> + *
>> + * Returns nothing.
>> + */
>> +void sort_memory_ranges(struct crash_mem *mrngs, bool merge)
>> +{
>> +	struct crash_mem_range *rngs;
>> +	struct crash_mem_range rng;
>> +	int i, j, idx;
>> +
>> +	if (!mrngs)
>> +		return;
>> +
>> +	/* Sort the ranges in-place */
>> +	rngs = &mrngs->ranges[0];
>> +	for (i = 0; i < mrngs->nr_ranges; i++) {
>> +		idx = i;
>> +		for (j = (i + 1); j < mrngs->nr_ranges; j++) {
>> +			if (rngs[idx].start > rngs[j].start)
>> +				idx = j;
>> +		}
>> +		if (idx != i) {
>> +			rng = rngs[idx];
>> +			rngs[idx] = rngs[i];
>> +			rngs[i] = rng;
>> +		}
>> +	}
> 
> Would it work using sort() from lib/sort.c here?

Yeah. I think we could reuse it with a simple compare callback. Will do that.

Thanks
Hari

^ permalink raw reply

* Re: [PATCH v3 04/12] ppc64/kexec_file: avoid stomping memory used by special regions
From: Hari Bathini @ 2020-07-16 21:09 UTC (permalink / raw)
  To: Thiago Jung Bauermann
  Cc: Pingfan Liu, Petr Tesarik, Nayna Jain, Kexec-ml,
	Mahesh J Salgaonkar, Mimi Zohar, lkml, linuxppc-dev, Sourabh Jain,
	Andrew Morton, Dave Young, Vivek Goyal, Eric Biederman
In-Reply-To: <87365t8pse.fsf@morokweng.localdomain>



On 15/07/20 8:09 am, Thiago Jung Bauermann wrote:
> 
> Hari Bathini <hbathini@linux.ibm.com> writes:
> 

<snip>
 
>> +/**
>> + * __locate_mem_hole_top_down - Looks top down for a large enough memory hole
>> + *                              in the memory regions between buf_min & buf_max
>> + *                              for the buffer. If found, sets kbuf->mem.
>> + * @kbuf:                       Buffer contents and memory parameters.
>> + * @buf_min:                    Minimum address for the buffer.
>> + * @buf_max:                    Maximum address for the buffer.
>> + *
>> + * Returns 0 on success, negative errno on error.
>> + */
>> +static int __locate_mem_hole_top_down(struct kexec_buf *kbuf,
>> +				      u64 buf_min, u64 buf_max)
>> +{
>> +	int ret = -EADDRNOTAVAIL;
>> +	phys_addr_t start, end;
>> +	u64 i;
>> +
>> +	for_each_mem_range_rev(i, &memblock.memory, NULL, NUMA_NO_NODE,
>> +			       MEMBLOCK_NONE, &start, &end, NULL) {
>> +		if (start > buf_max)
>> +			continue;
>> +
>> +		/* Memory hole not found */
>> +		if (end < buf_min)
>> +			break;
>> +
>> +		/* Adjust memory region based on the given range */
>> +		if (start < buf_min)
>> +			start = buf_min;
>> +		if (end > buf_max)
>> +			end = buf_max;
>> +
>> +		start = ALIGN(start, kbuf->buf_align);
>> +		if (start < end && (end - start + 1) >= kbuf->memsz) {
> 
> This is why I dislike using start and end to express address ranges:
> 
> While struct resource seems to use the [address, end] convention, my

struct crash_mem also uses [address, end] convention.
This off-by-one error did not cause any issues as the hole start and size we try to find
are at least page aligned.

Nonetheless, I think fixing 'end' early in the loop with "end -= 1" would ensure
correctness while continuing to use the same convention for structs crash_mem & resource.

Thanks
Hari

^ permalink raw reply

* Re: [PATCH v3 05/12] powerpc/drmem: make lmb walk a bit more flexible
From: Hari Bathini @ 2020-07-16 21:09 UTC (permalink / raw)
  To: Thiago Jung Bauermann
  Cc: Pingfan Liu, Petr Tesarik, Nayna Jain, Kexec-ml,
	Mahesh J Salgaonkar, Mimi Zohar, lkml, linuxppc-dev, Sourabh Jain,
	Andrew Morton, Dave Young, Vivek Goyal, Eric Biederman
In-Reply-To: <871rld8mic.fsf@morokweng.localdomain>



On 15/07/20 9:20 am, Thiago Jung Bauermann wrote:
> 
> Hari Bathini <hbathini@linux.ibm.com> writes:
> 
>> @@ -534,7 +537,7 @@ static int __init early_init_dt_scan_memory_ppc(unsigned long node,
>>  #ifdef CONFIG_PPC_PSERIES
>>  	if (depth == 1 &&
>>  	    strcmp(uname, "ibm,dynamic-reconfiguration-memory") == 0) {
>> -		walk_drmem_lmbs_early(node, early_init_drmem_lmb);
>> +		walk_drmem_lmbs_early(node, NULL, early_init_drmem_lmb);
> 
> walk_drmem_lmbs_early() can now fail. Should this failure be propagated
> as a return value of early_init_dt_scan_memory_ppc()?
  
> 
>>  		return 0;
>>  	}
>>  #endif
> <snip>
> 
>> @@ -787,7 +790,7 @@ static int __init parse_numa_properties(void)
>>  	 */
>>  	memory = of_find_node_by_path("/ibm,dynamic-reconfiguration-memory");
>>  	if (memory) {
>> -		walk_drmem_lmbs(memory, numa_setup_drmem_lmb);
>> +		walk_drmem_lmbs(memory, NULL, numa_setup_drmem_lmb);
> 
> Similarly here. Now that this call can fail, should
> parse_numa_properties() handle or propagate the failure?

They would still not fail unless the callbacks early_init_drmem_lmb() & numa_setup_drmem_lmb()
are updated to have failure scenarios. Also, these call sites always ignored failure scenarios
even before walk_drmem_lmbs() was introduced. So, I prefer to keep them the way they are?

Thanks
Hari

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox