* Re: [PATCH v2 7/9] trace_uprobe/sdt: Fix multiple update of same reference counter
From: Ravi Bangoria @ 2018-04-11 4:28 UTC (permalink / raw)
To: Oleg Nesterov
Cc: mhiramat, peterz, srikar, rostedt, acme, ananth, akpm,
alexander.shishkin, alexis.berlemont, corbet, dan.j.williams,
jolsa, kan.liang, kjlx, kstewart, linux-doc, linux-kernel,
linux-mm, milian.wolff, mingo, namhyung, naveen.n.rao, pc, tglx,
yao.jin, fengguang.wu, jglisse, Ravi Bangoria
In-Reply-To: <20180410110633.GA29063@redhat.com>
Hi Oleg,
On 04/10/2018 04:36 PM, Oleg Nesterov wrote:
> Hi Ravi,
>
> On 04/10, Ravi Bangoria wrote:
>>> and what if __mmu_notifier_register() fails simply because signal_pending() == T?
>>> see mm_take_all_locks().
>>>
>>> at first glance this all look suspicious and sub-optimal,
>> Yes. I should have added checks for failure cases.
>> Will fix them in v3.
> And what can you do if it fails? Nothing except report the problem. But
> signal_pending() is not the unlikely or error condition, it should not
> cause the tracing errors.
...
> Plus mm_take_all_locks() is very heavy... BTW, uprobe_mmap_callback() is
> called unconditionally. Whatever it does, can we at least move it after
> the no_uprobe_events() check? Can't we also check MMF_HAS_UPROBES?
Sure, I'll move it after these conditions.
> Either way, I do not feel that mmu_notifier is the right tool... Did you
> consider the uprobe_clear_state() hook we already have?
Ah! This is really a good idea. We don't need mmu_notifier then.
Thanks for suggestion,
Ravi
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* [PATCH v3 2/2] mm: remove odd HAVE_PTE_SPECIAL
From: Laurent Dufour @ 2018-04-11 8:03 UTC (permalink / raw)
To: linux-kernel, linux-mm, linuxppc-dev, x86, linux-doc,
linux-snps-arc, linux-arm-kernel, linux-riscv, linux-s390,
linux-sh, sparclinux, Jerome Glisse, mhocko, aneesh.kumar, akpm,
mpe, benh, paulus, Jonathan Corbet, Catalin Marinas, Will Deacon,
Yoshinori Sato, Rich Felker, David S . Miller, Thomas Gleixner,
Ingo Molnar, Vineet Gupta, Palmer Dabbelt, Albert Ou,
Martin Schwidefsky, Heiko Carstens, David Rientjes, Robin Murphy
In-Reply-To: <1523433816-14460-1-git-send-email-ldufour@linux.vnet.ibm.com>
Remove the additional define HAVE_PTE_SPECIAL and rely directly on
CONFIG_ARCH_HAS_PTE_SPECIAL.
There is no functional change introduced by this patch
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
mm/memory.c | 19 ++++++++-----------
1 file changed, 8 insertions(+), 11 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 96910c625daa..7f7dc7b2a341 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -817,17 +817,12 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
* PFNMAP mappings in order to support COWable mappings.
*
*/
-#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
-# define HAVE_PTE_SPECIAL 1
-#else
-# define HAVE_PTE_SPECIAL 0
-#endif
struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
pte_t pte, bool with_public_device)
{
unsigned long pfn = pte_pfn(pte);
- if (HAVE_PTE_SPECIAL) {
+ if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL)) {
if (likely(!pte_special(pte)))
goto check_pfn;
if (vma->vm_ops && vma->vm_ops->find_special_page)
@@ -862,7 +857,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
return NULL;
}
- /* !HAVE_PTE_SPECIAL case follows: */
+ /* !CONFIG_ARCH_HAS_PTE_SPECIAL case follows: */
if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
if (vma->vm_flags & VM_MIXEDMAP) {
@@ -881,7 +876,8 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
if (is_zero_pfn(pfn))
return NULL;
-check_pfn:
+
+check_pfn: __maybe_unused
if (unlikely(pfn > highest_memmap_pfn)) {
print_bad_pte(vma, addr, pte, NULL);
return NULL;
@@ -891,7 +887,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
* NOTE! We still have PageReserved() pages in the page tables.
* eg. VDSO mappings can cause them to exist.
*/
-out:
+out: __maybe_unused
return pfn_to_page(pfn);
}
@@ -904,7 +900,7 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
/*
* There is no pmd_special() but there may be special pmds, e.g.
* in a direct-access (dax) mapping, so let's just replicate the
- * !HAVE_PTE_SPECIAL case from vm_normal_page() here.
+ * !CONFIG_ARCH_HAS_PTE_SPECIAL case from vm_normal_page() here.
*/
if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
if (vma->vm_flags & VM_MIXEDMAP) {
@@ -1933,7 +1929,8 @@ static int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
* than insert_pfn). If a zero_pfn were inserted into a VM_MIXEDMAP
* without pte special, it would there be refcounted as a normal page.
*/
- if (!HAVE_PTE_SPECIAL && !pfn_t_devmap(pfn) && pfn_t_valid(pfn)) {
+ if (!IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) &&
+ !pfn_t_devmap(pfn) && pfn_t_valid(pfn)) {
struct page *page;
/*
--
2.7.4
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [PATCH v3 0/2] move __HAVE_ARCH_PTE_SPECIAL in Kconfig
From: Laurent Dufour @ 2018-04-11 8:03 UTC (permalink / raw)
To: linux-kernel, linux-mm, linuxppc-dev, x86, linux-doc,
linux-snps-arc, linux-arm-kernel, linux-riscv, linux-s390,
linux-sh, sparclinux, Jerome Glisse, mhocko, aneesh.kumar, akpm,
mpe, benh, paulus, Jonathan Corbet, Catalin Marinas, Will Deacon,
Yoshinori Sato, Rich Felker, David S . Miller, Thomas Gleixner,
Ingo Molnar, Vineet Gupta, Palmer Dabbelt, Albert Ou,
Martin Schwidefsky, Heiko Carstens, David Rientjes, Robin Murphy
The per architecture __HAVE_ARCH_PTE_SPECIAL is defined statically in the
per architecture header files. This doesn't allow to make other
configuration dependent on it.
The first patch of this series is replacing __HAVE_ARCH_PTE_SPECIAL by
CONFIG_ARCH_HAS_PTE_SPECIAL defined into the Kconfig files,
setting it automatically when architectures was already setting it in
header file.
The second patch is removing the odd define HAVE_PTE_SPECIAL which is a
duplicate of CONFIG_ARCH_HAS_PTE_SPECIAL.
There is no functional change introduced by this series.
--
Changes since v2:
* remove __HAVE_ARCH_PTE_SPECIAL in arch/riscv/include/asm/pgtable-bits.h
* use IS_ENABLED() instead of #ifdef blocks in patch 2
Laurent Dufour (2):
mm: introduce ARCH_HAS_PTE_SPECIAL
mm: remove odd HAVE_PTE_SPECIAL
.../features/vm/pte_special/arch-support.txt | 2 +-
arch/arc/Kconfig | 1 +
arch/arc/include/asm/pgtable.h | 2 --
arch/arm/Kconfig | 1 +
arch/arm/include/asm/pgtable-3level.h | 1 -
arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/pgtable.h | 2 --
arch/powerpc/Kconfig | 1 +
arch/powerpc/include/asm/book3s/64/pgtable.h | 3 ---
arch/powerpc/include/asm/pte-common.h | 3 ---
arch/riscv/Kconfig | 1 +
arch/riscv/include/asm/pgtable-bits.h | 3 ---
arch/s390/Kconfig | 1 +
arch/s390/include/asm/pgtable.h | 1 -
arch/sh/Kconfig | 1 +
arch/sh/include/asm/pgtable.h | 2 --
arch/sparc/Kconfig | 1 +
arch/sparc/include/asm/pgtable_64.h | 3 ---
arch/x86/Kconfig | 1 +
arch/x86/include/asm/pgtable_types.h | 1 -
include/linux/pfn_t.h | 4 ++--
mm/Kconfig | 3 +++
mm/gup.c | 4 ++--
mm/memory.c | 19 ++++++++-----------
24 files changed, 25 insertions(+), 37 deletions(-)
--
2.7.4
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* [PATCH v3 1/2] mm: introduce ARCH_HAS_PTE_SPECIAL
From: Laurent Dufour @ 2018-04-11 8:03 UTC (permalink / raw)
To: linux-kernel, linux-mm, linuxppc-dev, x86, linux-doc,
linux-snps-arc, linux-arm-kernel, linux-riscv, linux-s390,
linux-sh, sparclinux, Jerome Glisse, mhocko, aneesh.kumar, akpm,
mpe, benh, paulus, Jonathan Corbet, Catalin Marinas, Will Deacon,
Yoshinori Sato, Rich Felker, David S . Miller, Thomas Gleixner,
Ingo Molnar, Vineet Gupta, Palmer Dabbelt, Albert Ou,
Martin Schwidefsky, Heiko Carstens, David Rientjes, Robin Murphy
In-Reply-To: <1523433816-14460-1-git-send-email-ldufour@linux.vnet.ibm.com>
Currently the PTE special supports is turned on in per architecture header
files. Most of the time, it is defined in arch/*/include/asm/pgtable.h
depending or not on some other per architecture static definition.
This patch introduce a new configuration variable to manage this directly
in the Kconfig files. It would later replace __HAVE_ARCH_PTE_SPECIAL.
Here notes for some architecture where the definition of
__HAVE_ARCH_PTE_SPECIAL is not obvious:
arm
__HAVE_ARCH_PTE_SPECIAL which is currently defined in
arch/arm/include/asm/pgtable-3level.h which is included by
arch/arm/include/asm/pgtable.h when CONFIG_ARM_LPAE is set.
So select ARCH_HAS_PTE_SPECIAL if ARM_LPAE.
powerpc
__HAVE_ARCH_PTE_SPECIAL is defined in 2 files:
- arch/powerpc/include/asm/book3s/64/pgtable.h
- arch/powerpc/include/asm/pte-common.h
The first one is included if (PPC_BOOK3S & PPC64) while the second is
included in all the other cases.
So select ARCH_HAS_PTE_SPECIAL all the time.
sparc:
__HAVE_ARCH_PTE_SPECIAL is defined if defined(__sparc__) &&
defined(__arch64__) which are defined through the compiler in
sparc/Makefile if !SPARC32 which I assume to be if SPARC64.
So select ARCH_HAS_PTE_SPECIAL if SPARC64
There is no functional change introduced by this patch.
Suggested-by: Jerome Glisse <jglisse@redhat>
Reviewed-by: Jerome Glisse <jglisse@redhat>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
---
Documentation/features/vm/pte_special/arch-support.txt | 2 +-
arch/arc/Kconfig | 1 +
arch/arc/include/asm/pgtable.h | 2 --
arch/arm/Kconfig | 1 +
arch/arm/include/asm/pgtable-3level.h | 1 -
arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/pgtable.h | 2 --
arch/powerpc/Kconfig | 1 +
arch/powerpc/include/asm/book3s/64/pgtable.h | 3 ---
arch/powerpc/include/asm/pte-common.h | 3 ---
arch/riscv/Kconfig | 1 +
arch/riscv/include/asm/pgtable-bits.h | 3 ---
arch/s390/Kconfig | 1 +
arch/s390/include/asm/pgtable.h | 1 -
arch/sh/Kconfig | 1 +
arch/sh/include/asm/pgtable.h | 2 --
arch/sparc/Kconfig | 1 +
arch/sparc/include/asm/pgtable_64.h | 3 ---
arch/x86/Kconfig | 1 +
arch/x86/include/asm/pgtable_types.h | 1 -
include/linux/pfn_t.h | 4 ++--
mm/Kconfig | 3 +++
mm/gup.c | 4 ++--
mm/memory.c | 2 +-
24 files changed, 18 insertions(+), 27 deletions(-)
diff --git a/Documentation/features/vm/pte_special/arch-support.txt b/Documentation/features/vm/pte_special/arch-support.txt
index 055004f467d2..cd05924ea875 100644
--- a/Documentation/features/vm/pte_special/arch-support.txt
+++ b/Documentation/features/vm/pte_special/arch-support.txt
@@ -1,6 +1,6 @@
#
# Feature name: pte_special
-# Kconfig: __HAVE_ARCH_PTE_SPECIAL
+# Kconfig: ARCH_HAS_PTE_SPECIAL
# description: arch supports the pte_special()/pte_mkspecial() VM APIs
#
-----------------------
diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig
index d76bf4a83740..8516e2b0239a 100644
--- a/arch/arc/Kconfig
+++ b/arch/arc/Kconfig
@@ -44,6 +44,7 @@ config ARC
select HAVE_GENERIC_DMA_COHERENT
select HAVE_KERNEL_GZIP
select HAVE_KERNEL_LZMA
+ select ARCH_HAS_PTE_SPECIAL
config MIGHT_HAVE_PCI
bool
diff --git a/arch/arc/include/asm/pgtable.h b/arch/arc/include/asm/pgtable.h
index 08fe33830d4b..8ec5599a0957 100644
--- a/arch/arc/include/asm/pgtable.h
+++ b/arch/arc/include/asm/pgtable.h
@@ -320,8 +320,6 @@ PTE_BIT_FUNC(mkexec, |= (_PAGE_EXECUTE));
PTE_BIT_FUNC(mkspecial, |= (_PAGE_SPECIAL));
PTE_BIT_FUNC(mkhuge, |= (_PAGE_HW_SZ));
-#define __HAVE_ARCH_PTE_SPECIAL
-
static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
{
return __pte((pte_val(pte) & _PAGE_CHG_MASK) | pgprot_val(newprot));
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index a7f8e7f4b88f..c088c851b235 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -8,6 +8,7 @@ config ARM
select ARCH_HAS_DEVMEM_IS_ALLOWED
select ARCH_HAS_ELF_RANDOMIZE
select ARCH_HAS_FORTIFY_SOURCE
+ select ARCH_HAS_PTE_SPECIAL if ARM_LPAE
select ARCH_HAS_SET_MEMORY
select ARCH_HAS_PHYS_TO_DMA
select ARCH_HAS_STRICT_KERNEL_RWX if MMU && !XIP_KERNEL
diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
index 2a4836087358..6d50a11d7793 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -219,7 +219,6 @@ static inline pte_t pte_mkspecial(pte_t pte)
pte_val(pte) |= L_PTE_SPECIAL;
return pte;
}
-#define __HAVE_ARCH_PTE_SPECIAL
#define pmd_write(pmd) (pmd_isclear((pmd), L_PMD_SECT_RDONLY))
#define pmd_dirty(pmd) (pmd_isset((pmd), L_PMD_SECT_DIRTY))
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index eb2cf4938f6d..9a3f1b1ab50c 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -17,6 +17,7 @@ config ARM64
select ARCH_HAS_GIGANTIC_PAGE if (MEMORY_ISOLATION && COMPACTION) || CMA
select ARCH_HAS_KCOV
select ARCH_HAS_MEMBARRIER_SYNC_CORE
+ select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_SET_MEMORY
select ARCH_HAS_SG_CHAIN
select ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 7e2c27e63cd8..b96c8a186908 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -306,8 +306,6 @@ static inline int pte_same(pte_t pte_a, pte_t pte_b)
#define HPAGE_MASK (~(HPAGE_SIZE - 1))
#define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT)
-#define __HAVE_ARCH_PTE_SPECIAL
-
static inline pte_t pgd_pte(pgd_t pgd)
{
return __pte(pgd_val(pgd));
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index c32a181a7cbb..f7415fe25c07 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -141,6 +141,7 @@ config PPC
select ARCH_HAS_GCOV_PROFILE_ALL
select ARCH_HAS_PHYS_TO_DMA
select ARCH_HAS_PMEM_API if PPC64
+ select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_MEMBARRIER_CALLBACKS
select ARCH_HAS_SCALED_CPUTIME if VIRT_CPU_ACCOUNTING_NATIVE
select ARCH_HAS_SG_CHAIN
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 47b5ffc8715d..b3ac8948b257 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -319,9 +319,6 @@ extern unsigned long pci_io_base;
/* Advertise special mapping type for AGP */
#define HAVE_PAGE_AGP
-/* Advertise support for _PAGE_SPECIAL */
-#define __HAVE_ARCH_PTE_SPECIAL
-
#ifndef __ASSEMBLY__
/*
diff --git a/arch/powerpc/include/asm/pte-common.h b/arch/powerpc/include/asm/pte-common.h
index c4a72c7a8c83..03dfddb1f49a 100644
--- a/arch/powerpc/include/asm/pte-common.h
+++ b/arch/powerpc/include/asm/pte-common.h
@@ -216,9 +216,6 @@ static inline bool pte_user(pte_t pte)
#define PAGE_AGP (PAGE_KERNEL_NC)
#define HAVE_PAGE_AGP
-/* Advertise support for _PAGE_SPECIAL */
-#define __HAVE_ARCH_PTE_SPECIAL
-
#ifndef _PAGE_READ
/* if not defined, we should not find _PAGE_WRITE too */
#define _PAGE_READ 0
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 23d8acca5c90..92d331ea11bd 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -34,6 +34,7 @@ config RISCV
select THREAD_INFO_IN_TASK
select RISCV_TIMER
select GENERIC_IRQ_MULTI_HANDLER
+ select ARCH_HAS_PTE_SPECIAL
config MMU
def_bool y
diff --git a/arch/riscv/include/asm/pgtable-bits.h b/arch/riscv/include/asm/pgtable-bits.h
index 997ddbb1d370..2fa2942be221 100644
--- a/arch/riscv/include/asm/pgtable-bits.h
+++ b/arch/riscv/include/asm/pgtable-bits.h
@@ -42,7 +42,4 @@
_PAGE_WRITE | _PAGE_EXEC | \
_PAGE_USER | _PAGE_GLOBAL))
-/* Advertise support for _PAGE_SPECIAL */
-#define __HAVE_ARCH_PTE_SPECIAL
-
#endif /* _ASM_RISCV_PGTABLE_BITS_H */
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 32a0d5b958bf..5f1f4997e7e9 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -72,6 +72,7 @@ config S390
select ARCH_HAS_GCOV_PROFILE_ALL
select ARCH_HAS_GIGANTIC_PAGE if (MEMORY_ISOLATION && COMPACTION) || CMA
select ARCH_HAS_KCOV
+ select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_SET_MEMORY
select ARCH_HAS_SG_CHAIN
select ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 2d24d33bf188..9809694e1389 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -171,7 +171,6 @@ static inline int is_module_addr(void *addr)
#define _PAGE_WRITE 0x020 /* SW pte write bit */
#define _PAGE_SPECIAL 0x040 /* SW associated with special page */
#define _PAGE_UNUSED 0x080 /* SW bit for pgste usage state */
-#define __HAVE_ARCH_PTE_SPECIAL
#ifdef CONFIG_MEM_SOFT_DIRTY
#define _PAGE_SOFT_DIRTY 0x002 /* SW pte soft dirty bit */
diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
index 97fe29316476..a6c75b6806d2 100644
--- a/arch/sh/Kconfig
+++ b/arch/sh/Kconfig
@@ -50,6 +50,7 @@ config SUPERH
select HAVE_ARCH_AUDITSYSCALL
select HAVE_FUTEX_CMPXCHG if FUTEX
select HAVE_NMI
+ select ARCH_HAS_PTE_SPECIAL
help
The SuperH is a RISC processor targeted for use in embedded systems
and consumer electronics; it was also used in the Sega Dreamcast
diff --git a/arch/sh/include/asm/pgtable.h b/arch/sh/include/asm/pgtable.h
index 89c513a982fc..f6abfe2bca93 100644
--- a/arch/sh/include/asm/pgtable.h
+++ b/arch/sh/include/asm/pgtable.h
@@ -156,8 +156,6 @@ extern void page_table_range_init(unsigned long start, unsigned long end,
#define HAVE_ARCH_UNMAPPED_AREA
#define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN
-#define __HAVE_ARCH_PTE_SPECIAL
-
#include <asm-generic/pgtable.h>
#endif /* __ASM_SH_PGTABLE_H */
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index 8767e45f1b2b..6b5a4f05dcb2 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -86,6 +86,7 @@ config SPARC64
select ARCH_USE_QUEUED_SPINLOCKS
select GENERIC_TIME_VSYSCALL
select ARCH_CLOCKSOURCE_DATA
+ select ARCH_HAS_PTE_SPECIAL
config ARCH_DEFCONFIG
string
diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 44d6ac47e035..1393a8ac596b 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -117,9 +117,6 @@ bool kern_addr_valid(unsigned long addr);
#define _PAGE_PMD_HUGE _AC(0x0100000000000000,UL) /* Huge page */
#define _PAGE_PUD_HUGE _PAGE_PMD_HUGE
-/* Advertise support for _PAGE_SPECIAL */
-#define __HAVE_ARCH_PTE_SPECIAL
-
/* SUN4U pte bits... */
#define _PAGE_SZ4MB_4U _AC(0x6000000000000000,UL) /* 4MB Page */
#define _PAGE_SZ512K_4U _AC(0x4000000000000000,UL) /* 512K Page */
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d8983df5a2bc..2399c8ce02be 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -57,6 +57,7 @@ config X86
select ARCH_HAS_KCOV if X86_64
select ARCH_HAS_MEMBARRIER_SYNC_CORE
select ARCH_HAS_PMEM_API if X86_64
+ select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_REFCOUNT
select ARCH_HAS_UACCESS_FLUSHCACHE if X86_64
select ARCH_HAS_SET_MEMORY
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 1e5a40673953..99fff853c944 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -65,7 +65,6 @@
#define _PAGE_PKEY_BIT2 (_AT(pteval_t, 0))
#define _PAGE_PKEY_BIT3 (_AT(pteval_t, 0))
#endif
-#define __HAVE_ARCH_PTE_SPECIAL
#define _PAGE_PKEY_MASK (_PAGE_PKEY_BIT0 | \
_PAGE_PKEY_BIT1 | \
diff --git a/include/linux/pfn_t.h b/include/linux/pfn_t.h
index a03c2642a87c..21713dc14ce2 100644
--- a/include/linux/pfn_t.h
+++ b/include/linux/pfn_t.h
@@ -122,7 +122,7 @@ pud_t pud_mkdevmap(pud_t pud);
#endif
#endif /* __HAVE_ARCH_PTE_DEVMAP */
-#ifdef __HAVE_ARCH_PTE_SPECIAL
+#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
static inline bool pfn_t_special(pfn_t pfn)
{
return (pfn.val & PFN_SPECIAL) == PFN_SPECIAL;
@@ -132,5 +132,5 @@ static inline bool pfn_t_special(pfn_t pfn)
{
return false;
}
-#endif /* __HAVE_ARCH_PTE_SPECIAL */
+#endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
#endif /* _LINUX_PFN_T_H_ */
diff --git a/mm/Kconfig b/mm/Kconfig
index d5004d82a1d6..1ea3e4a6a123 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -752,3 +752,6 @@ config GUP_BENCHMARK
performance of get_user_pages_fast().
See tools/testing/selftests/vm/gup_benchmark.c
+
+config ARCH_HAS_PTE_SPECIAL
+ bool
diff --git a/mm/gup.c b/mm/gup.c
index 76af4cfeaf68..d19bfbadd32a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1351,7 +1351,7 @@ static void undo_dev_pagemap(int *nr, int nr_start, struct page **pages)
}
}
-#ifdef __HAVE_ARCH_PTE_SPECIAL
+#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
int write, struct page **pages, int *nr)
{
@@ -1427,7 +1427,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
{
return 0;
}
-#endif /* __HAVE_ARCH_PTE_SPECIAL */
+#endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
#if defined(__HAVE_ARCH_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
static int __gup_device_huge(unsigned long pfn, unsigned long addr,
diff --git a/mm/memory.c b/mm/memory.c
index a1f990e33e38..96910c625daa 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -817,7 +817,7 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
* PFNMAP mappings in order to support COWable mappings.
*
*/
-#ifdef __HAVE_ARCH_PTE_SPECIAL
+#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
# define HAVE_PTE_SPECIAL 1
#else
# define HAVE_PTE_SPECIAL 0
--
2.7.4
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* Re: [PATCH v3 2/2] mm: remove odd HAVE_PTE_SPECIAL
From: Michal Hocko @ 2018-04-11 8:33 UTC (permalink / raw)
To: Laurent Dufour
Cc: linux-kernel, linux-mm, linuxppc-dev, x86, linux-doc,
linux-snps-arc, linux-arm-kernel, linux-riscv, linux-s390,
linux-sh, sparclinux, Jerome Glisse, aneesh.kumar, akpm, mpe,
benh, paulus, Jonathan Corbet, Catalin Marinas, Will Deacon,
Yoshinori Sato, Rich Felker, David S . Miller, Thomas Gleixner,
Ingo Molnar, Vineet Gupta, Palmer Dabbelt, Albert Ou,
Martin Schwidefsky, Heiko Carstens, David Rientjes, Robin Murphy
In-Reply-To: <1523433816-14460-3-git-send-email-ldufour@linux.vnet.ibm.com>
On Wed 11-04-18 10:03:36, Laurent Dufour wrote:
> @@ -881,7 +876,8 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
>
> if (is_zero_pfn(pfn))
> return NULL;
> -check_pfn:
> +
> +check_pfn: __maybe_unused
> if (unlikely(pfn > highest_memmap_pfn)) {
> print_bad_pte(vma, addr, pte, NULL);
> return NULL;
> @@ -891,7 +887,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> * NOTE! We still have PageReserved() pages in the page tables.
> * eg. VDSO mappings can cause them to exist.
> */
> -out:
> +out: __maybe_unused
> return pfn_to_page(pfn);
Why do we need this ugliness all of the sudden?
--
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH v3 1/2] mm: introduce ARCH_HAS_PTE_SPECIAL
From: Michal Hocko @ 2018-04-11 8:34 UTC (permalink / raw)
To: Laurent Dufour
Cc: linux-kernel, linux-mm, linuxppc-dev, x86, linux-doc,
linux-snps-arc, linux-arm-kernel, linux-riscv, linux-s390,
linux-sh, sparclinux, Jerome Glisse, aneesh.kumar, akpm, mpe,
benh, paulus, Jonathan Corbet, Catalin Marinas, Will Deacon,
Yoshinori Sato, Rich Felker, David S . Miller, Thomas Gleixner,
Ingo Molnar, Vineet Gupta, Palmer Dabbelt, Albert Ou,
Martin Schwidefsky, Heiko Carstens, David Rientjes, Robin Murphy
In-Reply-To: <1523433816-14460-2-git-send-email-ldufour@linux.vnet.ibm.com>
On Wed 11-04-18 10:03:35, Laurent Dufour wrote:
> Currently the PTE special supports is turned on in per architecture header
> files. Most of the time, it is defined in arch/*/include/asm/pgtable.h
> depending or not on some other per architecture static definition.
>
> This patch introduce a new configuration variable to manage this directly
> in the Kconfig files. It would later replace __HAVE_ARCH_PTE_SPECIAL.
>
> Here notes for some architecture where the definition of
> __HAVE_ARCH_PTE_SPECIAL is not obvious:
>
> arm
> __HAVE_ARCH_PTE_SPECIAL which is currently defined in
> arch/arm/include/asm/pgtable-3level.h which is included by
> arch/arm/include/asm/pgtable.h when CONFIG_ARM_LPAE is set.
> So select ARCH_HAS_PTE_SPECIAL if ARM_LPAE.
>
> powerpc
> __HAVE_ARCH_PTE_SPECIAL is defined in 2 files:
> - arch/powerpc/include/asm/book3s/64/pgtable.h
> - arch/powerpc/include/asm/pte-common.h
> The first one is included if (PPC_BOOK3S & PPC64) while the second is
> included in all the other cases.
> So select ARCH_HAS_PTE_SPECIAL all the time.
>
> sparc:
> __HAVE_ARCH_PTE_SPECIAL is defined if defined(__sparc__) &&
> defined(__arch64__) which are defined through the compiler in
> sparc/Makefile if !SPARC32 which I assume to be if SPARC64.
> So select ARCH_HAS_PTE_SPECIAL if SPARC64
>
> There is no functional change introduced by this patch.
>
> Suggested-by: Jerome Glisse <jglisse@redhat>
> Reviewed-by: Jerome Glisse <jglisse@redhat>
> Acked-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Looks good to me. I have checked x86 and the generic code and it looks
good to me. Anyway arch maintainers really have to double check this.
--
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH v3 2/2] mm: remove odd HAVE_PTE_SPECIAL
From: Laurent Dufour @ 2018-04-11 8:41 UTC (permalink / raw)
To: Michal Hocko
Cc: linux-kernel, linux-mm, linuxppc-dev, x86, linux-doc,
linux-snps-arc, linux-arm-kernel, linux-riscv, linux-s390,
linux-sh, sparclinux, Jerome Glisse, aneesh.kumar, akpm, mpe,
benh, paulus, Jonathan Corbet, Catalin Marinas, Will Deacon,
Yoshinori Sato, Rich Felker, David S . Miller, Thomas Gleixner,
Ingo Molnar, Vineet Gupta, Palmer Dabbelt, Albert Ou,
Martin Schwidefsky, Heiko Carstens, David Rientjes, Robin Murphy
In-Reply-To: <20180411083315.GA23400@dhcp22.suse.cz>
On 11/04/2018 10:33, Michal Hocko wrote:
> On Wed 11-04-18 10:03:36, Laurent Dufour wrote:
>> @@ -881,7 +876,8 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
>>
>> if (is_zero_pfn(pfn))
>> return NULL;
>> -check_pfn:
>> +
>> +check_pfn: __maybe_unused
>> if (unlikely(pfn > highest_memmap_pfn)) {
>> print_bad_pte(vma, addr, pte, NULL);
>> return NULL;
>> @@ -891,7 +887,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
>> * NOTE! We still have PageReserved() pages in the page tables.
>> * eg. VDSO mappings can cause them to exist.
>> */
>> -out:
>> +out: __maybe_unused
>> return pfn_to_page(pfn);
>
> Why do we need this ugliness all of the sudden?
Indeed the compiler doesn't complaint but in theory it should since these
labels are not used depending on CONFIG_ARCH_HAS_PTE_SPECIAL.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH v3 2/2] mm: remove odd HAVE_PTE_SPECIAL
From: Michal Hocko @ 2018-04-11 8:49 UTC (permalink / raw)
To: Laurent Dufour
Cc: linux-kernel, linux-mm, linuxppc-dev, x86, linux-doc,
linux-snps-arc, linux-arm-kernel, linux-riscv, linux-s390,
linux-sh, sparclinux, Jerome Glisse, aneesh.kumar, akpm, mpe,
benh, paulus, Jonathan Corbet, Catalin Marinas, Will Deacon,
Yoshinori Sato, Rich Felker, David S . Miller, Thomas Gleixner,
Ingo Molnar, Vineet Gupta, Palmer Dabbelt, Albert Ou,
Martin Schwidefsky, Heiko Carstens, David Rientjes, Robin Murphy
In-Reply-To: <5bd1bb46-8f71-e6db-7fb7-43d023a37f58@linux.vnet.ibm.com>
On Wed 11-04-18 10:41:23, Laurent Dufour wrote:
> On 11/04/2018 10:33, Michal Hocko wrote:
> > On Wed 11-04-18 10:03:36, Laurent Dufour wrote:
> >> @@ -881,7 +876,8 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> >>
> >> if (is_zero_pfn(pfn))
> >> return NULL;
> >> -check_pfn:
> >> +
> >> +check_pfn: __maybe_unused
> >> if (unlikely(pfn > highest_memmap_pfn)) {
> >> print_bad_pte(vma, addr, pte, NULL);
> >> return NULL;
> >> @@ -891,7 +887,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> >> * NOTE! We still have PageReserved() pages in the page tables.
> >> * eg. VDSO mappings can cause them to exist.
> >> */
> >> -out:
> >> +out: __maybe_unused
> >> return pfn_to_page(pfn);
> >
> > Why do we need this ugliness all of the sudden?
> Indeed the compiler doesn't complaint but in theory it should since these
> labels are not used depending on CONFIG_ARCH_HAS_PTE_SPECIAL.
Well, such a warning would be quite pointless so I would rather not make
the code ugly. The value of unused label is quite questionable to start
with...
--
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH v3 2/2] mm: remove odd HAVE_PTE_SPECIAL
From: Christophe LEROY @ 2018-04-11 8:58 UTC (permalink / raw)
To: Laurent Dufour, linux-kernel, linux-mm, linuxppc-dev, x86,
linux-doc, linux-snps-arc, linux-arm-kernel, linux-riscv,
linux-s390, linux-sh, sparclinux, Jerome Glisse, mhocko,
aneesh.kumar, akpm, mpe, benh, paulus, Jonathan Corbet,
Catalin Marinas, Will Deacon, Yoshinori Sato, Rich Felker,
David S . Miller, Thomas Gleixner, Ingo Molnar, Vineet Gupta,
Palmer Dabbelt, Albert Ou, Martin Schwidefsky, Heiko Carstens,
David Rientjes, Robin Murphy
In-Reply-To: <1523433816-14460-3-git-send-email-ldufour@linux.vnet.ibm.com>
Le 11/04/2018 à 10:03, Laurent Dufour a écrit :
> Remove the additional define HAVE_PTE_SPECIAL and rely directly on
> CONFIG_ARCH_HAS_PTE_SPECIAL.
>
> There is no functional change introduced by this patch
>
> Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
> ---
> mm/memory.c | 19 ++++++++-----------
> 1 file changed, 8 insertions(+), 11 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 96910c625daa..7f7dc7b2a341 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -817,17 +817,12 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> * PFNMAP mappings in order to support COWable mappings.
> *
> */
> -#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
> -# define HAVE_PTE_SPECIAL 1
> -#else
> -# define HAVE_PTE_SPECIAL 0
> -#endif
> struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> pte_t pte, bool with_public_device)
> {
> unsigned long pfn = pte_pfn(pte);
>
> - if (HAVE_PTE_SPECIAL) {
> + if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL)) {
> if (likely(!pte_special(pte)))
> goto check_pfn;
> if (vma->vm_ops && vma->vm_ops->find_special_page)
> @@ -862,7 +857,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> return NULL;
> }
>
> - /* !HAVE_PTE_SPECIAL case follows: */
> + /* !CONFIG_ARCH_HAS_PTE_SPECIAL case follows: */
>
> if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
> if (vma->vm_flags & VM_MIXEDMAP) {
> @@ -881,7 +876,8 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
>
> if (is_zero_pfn(pfn))
> return NULL;
> -check_pfn:
> +
> +check_pfn: __maybe_unused
See below
> if (unlikely(pfn > highest_memmap_pfn)) {
> print_bad_pte(vma, addr, pte, NULL);
> return NULL;
> @@ -891,7 +887,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> * NOTE! We still have PageReserved() pages in the page tables.
> * eg. VDSO mappings can cause them to exist.
> */
> -out:
> +out: __maybe_unused
Why do you need that change ?
There is no reason for the compiler to complain. It would complain if
the goto was within a #ifdef, but all the purpose of using IS_ENABLED()
is to allow the compiler to properly handle all possible cases. That's
all the force of IS_ENABLED() compared to ifdefs, and that the reason
why they are plebicited, ref Linux Codying style for a detailed explanation.
Christophe
> return pfn_to_page(pfn);
> }
>
> @@ -904,7 +900,7 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> /*
> * There is no pmd_special() but there may be special pmds, e.g.
> * in a direct-access (dax) mapping, so let's just replicate the
> - * !HAVE_PTE_SPECIAL case from vm_normal_page() here.
> + * !CONFIG_ARCH_HAS_PTE_SPECIAL case from vm_normal_page() here.
> */
> if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
> if (vma->vm_flags & VM_MIXEDMAP) {
> @@ -1933,7 +1929,8 @@ static int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
> * than insert_pfn). If a zero_pfn were inserted into a VM_MIXEDMAP
> * without pte special, it would there be refcounted as a normal page.
> */
> - if (!HAVE_PTE_SPECIAL && !pfn_t_devmap(pfn) && pfn_t_valid(pfn)) {
> + if (!IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) &&
> + !pfn_t_devmap(pfn) && pfn_t_valid(pfn)) {
> struct page *page;
>
> /*
>
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH v3 2/2] mm: remove odd HAVE_PTE_SPECIAL
From: Christophe LEROY @ 2018-04-11 8:59 UTC (permalink / raw)
To: Laurent Dufour, Michal Hocko
Cc: Rich Felker, linux-doc, Catalin Marinas, Palmer Dabbelt,
Will Deacon, x86, linux-mm, paulus, sparclinux, linux-riscv,
linux-s390, Yoshinori Sato, Jonathan Corbet, linux-sh,
Ingo Molnar, linux-snps-arc, Heiko Carstens, David Rientjes,
Robin Murphy, Jerome Glisse, Thomas Gleixner, linux-arm-kernel,
Vineet Gupta, linux-kernel, aneesh.kumar, Albert Ou,
Martin Schwidefsky, akpm, linuxppc-dev, David S . Miller
In-Reply-To: <5bd1bb46-8f71-e6db-7fb7-43d023a37f58@linux.vnet.ibm.com>
Le 11/04/2018 à 10:41, Laurent Dufour a écrit :
> On 11/04/2018 10:33, Michal Hocko wrote:
>> On Wed 11-04-18 10:03:36, Laurent Dufour wrote:
>>> @@ -881,7 +876,8 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
>>>
>>> if (is_zero_pfn(pfn))
>>> return NULL;
>>> -check_pfn:
>>> +
>>> +check_pfn: __maybe_unused
>>> if (unlikely(pfn > highest_memmap_pfn)) {
>>> print_bad_pte(vma, addr, pte, NULL);
>>> return NULL;
>>> @@ -891,7 +887,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
>>> * NOTE! We still have PageReserved() pages in the page tables.
>>> * eg. VDSO mappings can cause them to exist.
>>> */
>>> -out:
>>> +out: __maybe_unused
>>> return pfn_to_page(pfn);
>>
>> Why do we need this ugliness all of the sudden?
> Indeed the compiler doesn't complaint but in theory it should since these
> labels are not used depending on CONFIG_ARCH_HAS_PTE_SPECIAL.
Why should it complain ?
Regards
Christophe
>
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH v3 2/2] mm: remove odd HAVE_PTE_SPECIAL
From: Laurent Dufour @ 2018-04-11 9:03 UTC (permalink / raw)
To: Christophe LEROY, linux-kernel, linux-mm, linuxppc-dev, x86,
linux-doc, linux-snps-arc, linux-arm-kernel, linux-riscv,
linux-s390, linux-sh, sparclinux, Jerome Glisse, mhocko,
aneesh.kumar, akpm, mpe, benh, paulus, Jonathan Corbet,
Catalin Marinas, Will Deacon, Yoshinori Sato, Rich Felker,
David S . Miller, Thomas Gleixner, Ingo Molnar, Vineet Gupta,
Palmer Dabbelt, Albert Ou, Martin Schwidefsky, Heiko Carstens,
David Rientjes, Robin Murphy
In-Reply-To: <de6ee514-8b7e-24d0-a7ee-a8887e8b0ae9@c-s.fr>
On 11/04/2018 10:58, Christophe LEROY wrote:
>
>
> Le 11/04/2018 à 10:03, Laurent Dufour a écrit :
>> Remove the additional define HAVE_PTE_SPECIAL and rely directly on
>> CONFIG_ARCH_HAS_PTE_SPECIAL.
>>
>> There is no functional change introduced by this patch
>>
>> Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
>> ---
>> mm/memory.c | 19 ++++++++-----------
>> 1 file changed, 8 insertions(+), 11 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 96910c625daa..7f7dc7b2a341 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -817,17 +817,12 @@ static void print_bad_pte(struct vm_area_struct *vma,
>> unsigned long addr,
>> * PFNMAP mappings in order to support COWable mappings.
>> *
>> */
>> -#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
>> -# define HAVE_PTE_SPECIAL 1
>> -#else
>> -# define HAVE_PTE_SPECIAL 0
>> -#endif
>> struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
>> pte_t pte, bool with_public_device)
>> {
>> unsigned long pfn = pte_pfn(pte);
>> - if (HAVE_PTE_SPECIAL) {
>> + if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL)) {
>> if (likely(!pte_special(pte)))
>> goto check_pfn;
>> if (vma->vm_ops && vma->vm_ops->find_special_page)
>> @@ -862,7 +857,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma,
>> unsigned long addr,
>> return NULL;
>> }
>> - /* !HAVE_PTE_SPECIAL case follows: */
>> + /* !CONFIG_ARCH_HAS_PTE_SPECIAL case follows: */
>> if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
>> if (vma->vm_flags & VM_MIXEDMAP) {
>> @@ -881,7 +876,8 @@ struct page *_vm_normal_page(struct vm_area_struct *vma,
>> unsigned long addr,
>> if (is_zero_pfn(pfn))
>> return NULL;
>> -check_pfn:
>> +
>> +check_pfn: __maybe_unused
>
> See below
>
>> if (unlikely(pfn > highest_memmap_pfn)) {
>> print_bad_pte(vma, addr, pte, NULL);
>> return NULL;
>> @@ -891,7 +887,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma,
>> unsigned long addr,
>> * NOTE! We still have PageReserved() pages in the page tables.
>> * eg. VDSO mappings can cause them to exist.
>> */
>> -out:
>> +out: __maybe_unused
>
> Why do you need that change ?
>
> There is no reason for the compiler to complain. It would complain if the goto
> was within a #ifdef, but all the purpose of using IS_ENABLED() is to allow the
> compiler to properly handle all possible cases. That's all the force of
> IS_ENABLED() compared to ifdefs, and that the reason why they are plebicited,
> ref Linux Codying style for a detailed explanation.
Fair enough.
Should I submit a v4 just to remove these so ugly __maybe_unused ?
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH v3 2/2] mm: remove odd HAVE_PTE_SPECIAL
From: Christophe LEROY @ 2018-04-11 9:09 UTC (permalink / raw)
To: Laurent Dufour, linux-kernel, linux-mm, linuxppc-dev, x86,
linux-doc, linux-snps-arc, linux-arm-kernel, linux-riscv,
linux-s390, linux-sh, sparclinux, Jerome Glisse, mhocko,
aneesh.kumar, akpm, mpe, benh, paulus, Jonathan Corbet,
Catalin Marinas, Will Deacon, Yoshinori Sato, Rich Felker,
David S . Miller, Thomas Gleixner, Ingo Molnar, Vineet Gupta,
Palmer Dabbelt, Albert Ou, Martin Schwidefsky, Heiko Carstens,
David Rientjes, Robin Murphy
In-Reply-To: <93ed4fe4-dd1e-51be-948b-d53b16de21c5@linux.vnet.ibm.com>
Le 11/04/2018 à 11:03, Laurent Dufour a écrit :
>
>
> On 11/04/2018 10:58, Christophe LEROY wrote:
>>
>>
>> Le 11/04/2018 à 10:03, Laurent Dufour a écrit :
>>> Remove the additional define HAVE_PTE_SPECIAL and rely directly on
>>> CONFIG_ARCH_HAS_PTE_SPECIAL.
>>>
>>> There is no functional change introduced by this patch
>>>
>>> Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
>>> ---
>>> mm/memory.c | 19 ++++++++-----------
>>> 1 file changed, 8 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 96910c625daa..7f7dc7b2a341 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -817,17 +817,12 @@ static void print_bad_pte(struct vm_area_struct *vma,
>>> unsigned long addr,
>>> * PFNMAP mappings in order to support COWable mappings.
>>> *
>>> */
>>> -#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
>>> -# define HAVE_PTE_SPECIAL 1
>>> -#else
>>> -# define HAVE_PTE_SPECIAL 0
>>> -#endif
>>> struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
>>> pte_t pte, bool with_public_device)
>>> {
>>> unsigned long pfn = pte_pfn(pte);
>>> - if (HAVE_PTE_SPECIAL) {
>>> + if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL)) {
>>> if (likely(!pte_special(pte)))
>>> goto check_pfn;
>>> if (vma->vm_ops && vma->vm_ops->find_special_page)
>>> @@ -862,7 +857,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma,
>>> unsigned long addr,
>>> return NULL;
>>> }
>>> - /* !HAVE_PTE_SPECIAL case follows: */
>>> + /* !CONFIG_ARCH_HAS_PTE_SPECIAL case follows: */
>>> if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
>>> if (vma->vm_flags & VM_MIXEDMAP) {
>>> @@ -881,7 +876,8 @@ struct page *_vm_normal_page(struct vm_area_struct *vma,
>>> unsigned long addr,
>>> if (is_zero_pfn(pfn))
>>> return NULL;
>>> -check_pfn:
>>> +
>>> +check_pfn: __maybe_unused
>>
>> See below
>>
>>> if (unlikely(pfn > highest_memmap_pfn)) {
>>> print_bad_pte(vma, addr, pte, NULL);
>>> return NULL;
>>> @@ -891,7 +887,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma,
>>> unsigned long addr,
>>> * NOTE! We still have PageReserved() pages in the page tables.
>>> * eg. VDSO mappings can cause them to exist.
>>> */
>>> -out:
>>> +out: __maybe_unused
>>
>> Why do you need that change ?
>>
>> There is no reason for the compiler to complain. It would complain if the goto
>> was within a #ifdef, but all the purpose of using IS_ENABLED() is to allow the
>> compiler to properly handle all possible cases. That's all the force of
>> IS_ENABLED() compared to ifdefs, and that the reason why they are plebicited,
>> ref Linux Codying style for a detailed explanation.
>
> Fair enough.
>
> Should I submit a v4 just to remove these so ugly __maybe_unused ?
>
Most likely, unless the mm maintainer agrees to remove them by himself
when applying your patch ?
Christophe
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [RFC bpf-next v2 8/8] bpf: add documentation for eBPF helpers (58-64)
From: Jesper Dangaard Brouer @ 2018-04-11 10:17 UTC (permalink / raw)
To: Quentin Monnet
Cc: brouer, daniel, ast, netdev, oss-drivers, linux-doc, linux-man,
John Fastabend
In-Reply-To: <20180410144157.4831-9-quentin.monnet@netronome.com>
On Tue, 10 Apr 2018 15:41:57 +0100
Quentin Monnet <quentin.monnet@netronome.com> wrote:
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 7343af4196c8..db090ad03626 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1250,6 +1250,51 @@ union bpf_attr {
> * Return
> * 0 on success, or a negative error in case of failure.
> *
> + * int bpf_redirect_map(struct bpf_map *map, u32 key, u64 flags)
> + * Description
> + * Redirect the packet to the endpoint referenced by *map* at
> + * index *key*. Depending on its type, his *map* can contain
> + * references to net devices (for forwarding packets through other
> + * ports), or to CPUs (for redirecting XDP frames to another CPU;
> + * but this is not fully implemented as of this writing).
Stating that CPUMAP redirect "is not fully implemented" is confusing.
The issue is that CPUMAP only works for "native" XDP.
What about saying:
"[...] or to CPUs (for redirecting XDP frames to another CPU;
but this is only implemented for native XDP as of this writing)"
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH v3 2/2] mm: remove odd HAVE_PTE_SPECIAL
From: Laurent Dufour @ 2018-04-11 10:32 UTC (permalink / raw)
To: Christophe LEROY, linux-kernel, linux-mm, linuxppc-dev, x86,
linux-doc, linux-snps-arc, linux-arm-kernel, linux-riscv,
linux-s390, linux-sh, sparclinux, Jerome Glisse, mhocko,
aneesh.kumar, akpm, mpe, benh, paulus, Jonathan Corbet,
Catalin Marinas, Will Deacon, Yoshinori Sato, Rich Felker,
David S . Miller, Thomas Gleixner, Ingo Molnar, Vineet Gupta,
Palmer Dabbelt, Albert Ou, Martin Schwidefsky, Heiko Carstens,
David Rientjes, Robin Murphy
In-Reply-To: <278a5212-b962-9a3a-cc86-76cac744afab@c-s.fr>
On 11/04/2018 11:09, Christophe LEROY wrote:
>
>
> Le 11/04/2018 à 11:03, Laurent Dufour a écrit :
>>
>>
>> On 11/04/2018 10:58, Christophe LEROY wrote:
>>>
>>>
>>> Le 11/04/2018 à 10:03, Laurent Dufour a écrit :
>>>> Remove the additional define HAVE_PTE_SPECIAL and rely directly on
>>>> CONFIG_ARCH_HAS_PTE_SPECIAL.
>>>>
>>>> There is no functional change introduced by this patch
>>>>
>>>> Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
>>>> ---
>>>> mm/memory.c | 19 ++++++++-----------
>>>> 1 file changed, 8 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 96910c625daa..7f7dc7b2a341 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -817,17 +817,12 @@ static void print_bad_pte(struct vm_area_struct *vma,
>>>> unsigned long addr,
>>>> * PFNMAP mappings in order to support COWable mappings.
>>>> *
>>>> */
>>>> -#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
>>>> -# define HAVE_PTE_SPECIAL 1
>>>> -#else
>>>> -# define HAVE_PTE_SPECIAL 0
>>>> -#endif
>>>> struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long
>>>> addr,
>>>> pte_t pte, bool with_public_device)
>>>> {
>>>> unsigned long pfn = pte_pfn(pte);
>>>> - if (HAVE_PTE_SPECIAL) {
>>>> + if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL)) {
>>>> if (likely(!pte_special(pte)))
>>>> goto check_pfn;
>>>> if (vma->vm_ops && vma->vm_ops->find_special_page)
>>>> @@ -862,7 +857,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma,
>>>> unsigned long addr,
>>>> return NULL;
>>>> }
>>>> - /* !HAVE_PTE_SPECIAL case follows: */
>>>> + /* !CONFIG_ARCH_HAS_PTE_SPECIAL case follows: */
>>>> if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
>>>> if (vma->vm_flags & VM_MIXEDMAP) {
>>>> @@ -881,7 +876,8 @@ struct page *_vm_normal_page(struct vm_area_struct *vma,
>>>> unsigned long addr,
>>>> if (is_zero_pfn(pfn))
>>>> return NULL;
>>>> -check_pfn:
>>>> +
>>>> +check_pfn: __maybe_unused
>>>
>>> See below
>>>
>>>> if (unlikely(pfn > highest_memmap_pfn)) {
>>>> print_bad_pte(vma, addr, pte, NULL);
>>>> return NULL;
>>>> @@ -891,7 +887,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma,
>>>> unsigned long addr,
>>>> * NOTE! We still have PageReserved() pages in the page tables.
>>>> * eg. VDSO mappings can cause them to exist.
>>>> */
>>>> -out:
>>>> +out: __maybe_unused
>>>
>>> Why do you need that change ?
>>>
>>> There is no reason for the compiler to complain. It would complain if the goto
>>> was within a #ifdef, but all the purpose of using IS_ENABLED() is to allow the
>>> compiler to properly handle all possible cases. That's all the force of
>>> IS_ENABLED() compared to ifdefs, and that the reason why they are plebicited,
>>> ref Linux Codying style for a detailed explanation.
>>
>> Fair enough.
>>
>> Should I submit a v4 just to remove these so ugly __maybe_unused ?
>>
>
> Most likely, unless the mm maintainer agrees to remove them by himself when
> applying your patch ?
That was my point.
Andrew, should I send a v4 or could you wipe the 2 __maybe_unsued when applying
the patch ?
Thanks,
Laurent.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH v3 2/2] mm: remove odd HAVE_PTE_SPECIAL
From: Michal Hocko @ 2018-04-11 11:09 UTC (permalink / raw)
To: Laurent Dufour
Cc: Christophe LEROY, linux-kernel, linux-mm, linuxppc-dev, x86,
linux-doc, linux-snps-arc, linux-arm-kernel, linux-riscv,
linux-s390, linux-sh, sparclinux, Jerome Glisse, aneesh.kumar,
akpm, mpe, benh, paulus, Jonathan Corbet, Catalin Marinas,
Will Deacon, Yoshinori Sato, Rich Felker, David S . Miller,
Thomas Gleixner, Ingo Molnar, Vineet Gupta, Palmer Dabbelt,
Albert Ou, Martin Schwidefsky, Heiko Carstens, David Rientjes,
Robin Murphy
In-Reply-To: <32655c37-91cb-17aa-58e7-74254e2673a0@linux.vnet.ibm.com>
On Wed 11-04-18 12:32:07, Laurent Dufour wrote:
[...]
> Andrew, should I send a v4 or could you wipe the 2 __maybe_unsued when applying
> the patch ?
A follow $patch-fix should be better rather than post this again and
spam people with more emails.
--
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH v3 05/10] ARM: dts: aspeed: peci: Add PECI node
From: Joel Stanley @ 2018-04-11 11:52 UTC (permalink / raw)
To: Jae Hyun Yoo
Cc: Alan Cox, Andrew Jeffery, Andrew Lunn, Andy Shevchenko,
Arnd Bergmann, Benjamin Herrenschmidt, Fengguang Wu, Greg KH,
Guenter Roeck, Haiyue Wang, James Feist, Jason M Biils,
Jean Delvare, Julia Cartwright, Miguel Ojeda, Milton Miller II,
Pavel Machek, Randy Dunlap, Stef van Os, Sumeet R Pawnikar,
Vernon Mauery, Linux Kernel Mailing List, linux-doc, devicetree,
linux-hwmon, Linux ARM, OpenBMC Maillist
In-Reply-To: <20180410183212.16787-6-jae.hyun.yoo@linux.intel.com>
On 11 April 2018 at 04:02, Jae Hyun Yoo <jae.hyun.yoo@linux.intel.com> wrote:
> This commit adds PECI bus/adapter node of AST24xx/AST25xx into
> aspeed-g4 and aspeed-g5.
>
The patches to the device trees get merged by the ASPEED maintainer
(me). Once you have the bindings reviewed you can send the patches to
me and the linux-aspeed list (I've got a pending patch to maintainers
that will ensure get_maintainers.pl does the right thing as far as
email addresses go).
I'd suggest dropping it from your series and re-sending once the
bindings and driver are reviewed.
Cheers,
Joel
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH v3 04/10] Documentations: dt-bindings: Add a document of PECI adapter driver for Aspeed AST24xx/25xx SoCs
From: Joel Stanley @ 2018-04-11 11:52 UTC (permalink / raw)
To: Jae Hyun Yoo, Rob Herring, linux-aspeed, Ryan Chen
Cc: Alan Cox, Andrew Jeffery, Andrew Lunn, Andy Shevchenko,
Arnd Bergmann, Benjamin Herrenschmidt, Fengguang Wu, Greg KH,
Guenter Roeck, Haiyue Wang, James Feist, Jason M Biils,
Jean Delvare, Julia Cartwright, Miguel Ojeda, Milton Miller II,
Pavel Machek, Randy Dunlap, Stef van Os, Sumeet R Pawnikar,
Vernon Mauery, Linux Kernel Mailing List, linux-doc, devicetree,
linux-hwmon, Linux ARM, OpenBMC Maillist
In-Reply-To: <20180410183212.16787-5-jae.hyun.yoo@linux.intel.com>
On 11 April 2018 at 04:02, Jae Hyun Yoo <jae.hyun.yoo@linux.intel.com> wrote:
> This commit adds a dt-bindings document of PECI adapter driver for Aspeed
We try to capitalise ASPEED.
> AST24xx/25xx SoCs.
> ---
> .../devicetree/bindings/peci/peci-aspeed.txt | 60 ++++++++++++++++++++++
> 1 file changed, 60 insertions(+)
> create mode 100644 Documentation/devicetree/bindings/peci/peci-aspeed.txt
>
> diff --git a/Documentation/devicetree/bindings/peci/peci-aspeed.txt b/Documentation/devicetree/bindings/peci/peci-aspeed.txt
> new file mode 100644
> index 000000000000..4598bb8c20fa
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/peci/peci-aspeed.txt
> @@ -0,0 +1,60 @@
> +Device tree configuration for PECI buses on the AST24XX and AST25XX SoCs.
> +
> +Required properties:
> +- compatible : Should be "aspeed,ast2400-peci" or "aspeed,ast2500-peci"
> + - aspeed,ast2400-peci: Aspeed AST2400 family PECI
> + controller
> + - aspeed,ast2500-peci: Aspeed AST2500 family PECI
> + controller
> +- reg : Should contain PECI controller registers location and
> + length.
> +- #address-cells : Should be <1>.
> +- #size-cells : Should be <0>.
> +- interrupts : Should contain PECI controller interrupt.
> +- clocks : Should contain clock source for PECI controller.
> + Should reference clkin.
Are you sure that this is driven by clkin? Most peripherals on the
Aspeed are attached to the apb, so should reference that clock.
> +- clock_frequency : Should contain the operation frequency of PECI controller
> + in units of Hz.
> + 187500 ~ 24000000
Can you explain why you need both the parent clock and this frequency
to be specified?
> +
> +Optional properties:
> +- msg-timing-nego : Message timing negotiation period. This value will
Perhaps msg-timing-period? Or just msg-timing?
> + determine the period of message timing negotiation to be
> + issued by PECI controller. The unit of the programmed
> + value is four times of PECI clock period.
> + 0 ~ 255 (default: 1)
> +- addr-timing-nego : Address timing negotiation period. This value will
> + determine the period of address timing negotiation to be
> + issued by PECI controller. The unit of the programmed
> + value is four times of PECI clock period.
> + 0 ~ 255 (default: 1)
> +- rd-sampling-point : Read sampling point selection. The whole period of a bit
> + time will be divided into 16 time frames. This value will
> + determine the time frame in which the controller will
> + sample PECI signal for data read back. Usually in the
> + middle of a bit time is the best.
> + 0 ~ 15 (default: 8)
> +- cmd_timeout_ms : Command timeout in units of ms.
> + 1 ~ 60000 (default: 1000)
> +
> +Example:
> + peci: peci@1e78b000 {
> + compatible = "simple-bus";
> + #address-cells = <1>;
> + #size-cells = <1>;
> + ranges = <0x0 0x1e78b000 0x60>;
> +
> + peci0: peci-bus@0 {
> + compatible = "aspeed,ast2500-peci";
> + reg = <0x0 0x60>;
> + #address-cells = <1>;
> + #size-cells = <0>;
> + interrupts = <15>;
> + clocks = <&clk_clkin>;
> + clock-frequency = <24000000>;
> + msg-timing-nego = <1>;
> + addr-timing-nego = <1>;
> + rd-sampling-point = <8>;
> + cmd-timeout-ms = <1000>;
> + };
> + };
> --
> 2.16.2
>
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH v3 01/10] Documentations: dt-bindings: Add documents of generic PECI bus, adapter and client drivers
From: Joel Stanley @ 2018-04-11 11:52 UTC (permalink / raw)
To: Jae Hyun Yoo, Rob Herring
Cc: Alan Cox, Andrew Jeffery, Andrew Lunn, Andy Shevchenko,
Arnd Bergmann, Benjamin Herrenschmidt, Fengguang Wu, Greg KH,
Guenter Roeck, Haiyue Wang, James Feist, Jason M Biils,
Jean Delvare, Julia Cartwright, Miguel Ojeda, Milton Miller II,
Pavel Machek, Randy Dunlap, Stef van Os, Sumeet R Pawnikar,
Vernon Mauery, Linux Kernel Mailing List, linux-doc, devicetree,
linux-hwmon, Linux ARM, OpenBMC Maillist
In-Reply-To: <20180410183212.16787-2-jae.hyun.yoo@linux.intel.com>
Hi Jae,
On 11 April 2018 at 04:02, Jae Hyun Yoo <jae.hyun.yoo@linux.intel.com> wrote:
> This commit adds documents of generic PECI bus, adapter and client drivers.
>
> Signed-off-by: Jae Hyun Yoo <jae.hyun.yoo@linux.intel.com>
> Reviewed-by: Haiyue Wang <haiyue.wang@linux.intel.com>
> Reviewed-by: James Feist <james.feist@linux.intel.com>
> Reviewed-by: Vernon Mauery <vernon.mauery@linux.intel.com>
> Cc: Alan Cox <alan@linux.intel.com>
> Cc: Andrew Jeffery <andrew@aj.id.au>
> Cc: Andrew Lunn <andrew@lunn.ch>
> Cc: Andy Shevchenko <andriy.shevchenko@intel.com>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Fengguang Wu <fengguang.wu@intel.com>
> Cc: Greg KH <gregkh@linuxfoundation.org>
> Cc: Guenter Roeck <linux@roeck-us.net>
> Cc: Jason M Biils <jason.m.bills@linux.intel.com>
> Cc: Jean Delvare <jdelvare@suse.com>
> Cc: Joel Stanley <joel@jms.id.au>
> Cc: Julia Cartwright <juliac@eso.teric.us>
> Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
> Cc: Milton Miller II <miltonm@us.ibm.com>
> Cc: Pavel Machek <pavel@ucw.cz>
> Cc: Randy Dunlap <rdunlap@infradead.org>
> Cc: Stef van Os <stef.van.os@prodrive-technologies.com>
> Cc: Sumeet R Pawnikar <sumeet.r.pawnikar@intel.com>
That's a hefty cc list. I can't see Rob Herring though, and he's
usually the person who you need to convince to get your bindings
accepted.
I recommend using ./scripts/get_maintainers.pl to build your CC list,
and then add others you think are relevant.
I'm not sure what the guidelines are for generic bindings, so I'll
defer to Rob for this patch.
Cheers,
Joel
> ---
> .../devicetree/bindings/peci/peci-adapter.txt | 23 ++++++++++++++++++++
> .../devicetree/bindings/peci/peci-bus.txt | 15 +++++++++++++
> .../devicetree/bindings/peci/peci-client.txt | 25 ++++++++++++++++++++++
> 3 files changed, 63 insertions(+)
> create mode 100644 Documentation/devicetree/bindings/peci/peci-adapter.txt
> create mode 100644 Documentation/devicetree/bindings/peci/peci-bus.txt
> create mode 100644 Documentation/devicetree/bindings/peci/peci-client.txt
>
> diff --git a/Documentation/devicetree/bindings/peci/peci-adapter.txt b/Documentation/devicetree/bindings/peci/peci-adapter.txt
> new file mode 100644
> index 000000000000..9221374f6b11
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/peci/peci-adapter.txt
> @@ -0,0 +1,23 @@
> +Generic device tree configuration for PECI adapters.
> +
> +Required properties:
> +- compatible : Should contain hardware specific definition strings that can
> + match an adapter driver implementation.
> +- reg : Should contain PECI controller registers location and length.
> +- #address-cells : Should be <1>.
> +- #size-cells : Should be <0>.
> +
> +Example:
> + peci: peci@10000000 {
> + compatible = "simple-bus";
> + #address-cells = <1>;
> + #size-cells = <1>;
> + ranges = <0x0 0x10000000 0x1000>;
> +
> + peci0: peci-bus@0 {
> + compatible = "soc,soc-peci";
> + reg = <0x0 0x1000>;
> + #address-cells = <1>;
> + #size-cells = <0>;
> + };
> + };
> diff --git a/Documentation/devicetree/bindings/peci/peci-bus.txt b/Documentation/devicetree/bindings/peci/peci-bus.txt
> new file mode 100644
> index 000000000000..90bcc791ccb0
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/peci/peci-bus.txt
> @@ -0,0 +1,15 @@
> +Generic device tree configuration for PECI buses.
> +
> +Required properties:
> +- compatible : Should be "simple-bus".
> +- #address-cells : Should be <1>.
> +- #size-cells : Should be <1>.
> +- ranges : Should contain PECI controller registers ranges.
> +
> +Example:
> + peci: peci@10000000 {
> + compatible = "simple-bus";
> + #address-cells = <1>;
> + #size-cells = <1>;
> + ranges = <0x0 0x10000000 0x1000>;
> + };
> diff --git a/Documentation/devicetree/bindings/peci/peci-client.txt b/Documentation/devicetree/bindings/peci/peci-client.txt
> new file mode 100644
> index 000000000000..8e2bfd8532f6
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/peci/peci-client.txt
> @@ -0,0 +1,25 @@
> +Generic device tree configuration for PECI clients.
> +
> +Required properties:
> +- compatible : Should contain target device specific definition strings that can
> + match a client driver implementation.
> +- reg : Should contain address of a client CPU. Address range of CPU
> + clients is starting from 0x30 based on PECI specification.
> + <0x30> .. <0x37> (depends on the PECI_OFFSET_MAX definition)
> +
> +Example:
> + peci-bus@0 {
> + #address-cells = <1>;
> + #size-cells = <0>;
> + < more properties >
> +
> + function@cpu0 {
> + compatible = "device,function";
> + reg = <0x30>;
> + };
> +
> + function@cpu1 {
> + compatible = "device,function";
> + reg = <0x31>;
> + };
> + };
> --
> 2.16.2
>
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH v3 06/10] drivers/peci: Add a PECI adapter driver for Aspeed AST24xx/AST25xx
From: Joel Stanley @ 2018-04-11 11:51 UTC (permalink / raw)
To: Jae Hyun Yoo, Ryan Chen
Cc: Alan Cox, Andrew Jeffery, Andrew Lunn, Andy Shevchenko,
Arnd Bergmann, Benjamin Herrenschmidt, Fengguang Wu, Greg KH,
Guenter Roeck, Haiyue Wang, James Feist, Jason M Biils,
Jean Delvare, Julia Cartwright, Miguel Ojeda, Milton Miller II,
Pavel Machek, Randy Dunlap, Stef van Os, Sumeet R Pawnikar,
Vernon Mauery, Linux Kernel Mailing List, linux-doc, devicetree,
linux-hwmon, Linux ARM, OpenBMC Maillist
In-Reply-To: <20180410183212.16787-7-jae.hyun.yoo@linux.intel.com>
Hello Jae,
On 11 April 2018 at 04:02, Jae Hyun Yoo <jae.hyun.yoo@linux.intel.com> wrote:
> This commit adds PECI adapter driver implementation for Aspeed
> AST24xx/AST25xx.
The driver is looking good!
It looks like you've done some kind of review that we weren't allowed
to see, which is a double edged sword - I might be asking about things
that you've already spoken about with someone else.
I'm only just learning about PECI, but I do have some general comments below.
> ---
> drivers/peci/Kconfig | 28 +++
> drivers/peci/Makefile | 3 +
> drivers/peci/peci-aspeed.c | 504 +++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 535 insertions(+)
> create mode 100644 drivers/peci/peci-aspeed.c
>
> diff --git a/drivers/peci/Kconfig b/drivers/peci/Kconfig
> index 1fbc13f9e6c2..0e33420365de 100644
> --- a/drivers/peci/Kconfig
> +++ b/drivers/peci/Kconfig
> @@ -14,4 +14,32 @@ config PECI
> processors and chipset components to external monitoring or control
> devices.
>
> + If you want PECI support, you should say Y here and also to the
> + specific driver for your bus adapter(s) below.
> +
> +if PECI
> +
> +#
> +# PECI hardware bus configuration
> +#
> +
> +menu "PECI Hardware Bus support"
> +
> +config PECI_ASPEED
> + tristate "Aspeed AST24xx/AST25xx PECI support"
I think just saying ASPEED PECI support is enough. That way if the
next ASPEED SoC happens to have PECI we don't need to update all of
the help text :)
> + select REGMAP_MMIO
> + depends on OF
> + depends on ARCH_ASPEED || COMPILE_TEST
> + help
> + Say Y here if you want support for the Platform Environment Control
> + Interface (PECI) bus adapter driver on the Aspeed AST24XX and AST25XX
> + SoCs.
> +
> + This support is also available as a module. If so, the module
> + will be called peci-aspeed.
> +
> +endmenu
> +
> +endif # PECI
> +
> endmenu
> diff --git a/drivers/peci/Makefile b/drivers/peci/Makefile
> index 9e8615e0d3ff..886285e69765 100644
> --- a/drivers/peci/Makefile
> +++ b/drivers/peci/Makefile
> @@ -4,3 +4,6 @@
>
> # Core functionality
> obj-$(CONFIG_PECI) += peci-core.o
> +
> +# Hardware specific bus drivers
> +obj-$(CONFIG_PECI_ASPEED) += peci-aspeed.o
> diff --git a/drivers/peci/peci-aspeed.c b/drivers/peci/peci-aspeed.c
> new file mode 100644
> index 000000000000..be2a1f327eb1
> --- /dev/null
> +++ b/drivers/peci/peci-aspeed.c
> @@ -0,0 +1,504 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright (C) 2012-2017 ASPEED Technology Inc.
> +// Copyright (c) 2018 Intel Corporation
> +
> +#include <linux/clk.h>
> +#include <linux/delay.h>
> +#include <linux/interrupt.h>
> +#include <linux/jiffies.h>
> +#include <linux/module.h>
> +#include <linux/of.h>
> +#include <linux/peci.h>
> +#include <linux/platform_device.h>
> +#include <linux/regmap.h>
> +
> +#define DUMP_DEBUG 0
> +
> +/* Aspeed PECI Registers */
> +#define AST_PECI_CTRL 0x00
Nit: we use ASPEED instead of AST in the upstream kernel to distingush
from the aspeed sdk drivers. If you feel strongly about this then I
won't insist you change.
> +#define AST_PECI_TIMING 0x04
> +#define AST_PECI_CMD 0x08
> +#define AST_PECI_CMD_CTRL 0x0c
> +#define AST_PECI_EXP_FCS 0x10
> +#define AST_PECI_CAP_FCS 0x14
> +#define AST_PECI_INT_CTRL 0x18
> +#define AST_PECI_INT_STS 0x1c
> +#define AST_PECI_W_DATA0 0x20
> +#define AST_PECI_W_DATA1 0x24
> +#define AST_PECI_W_DATA2 0x28
> +#define AST_PECI_W_DATA3 0x2c
> +#define AST_PECI_R_DATA0 0x30
> +#define AST_PECI_R_DATA1 0x34
> +#define AST_PECI_R_DATA2 0x38
> +#define AST_PECI_R_DATA3 0x3c
> +#define AST_PECI_W_DATA4 0x40
> +#define AST_PECI_W_DATA5 0x44
> +#define AST_PECI_W_DATA6 0x48
> +#define AST_PECI_W_DATA7 0x4c
> +#define AST_PECI_R_DATA4 0x50
> +#define AST_PECI_R_DATA5 0x54
> +#define AST_PECI_R_DATA6 0x58
> +#define AST_PECI_R_DATA7 0x5c
> +
> +/* AST_PECI_CTRL - 0x00 : Control Register */
> +#define PECI_CTRL_SAMPLING_MASK GENMASK(19, 16)
> +#define PECI_CTRL_SAMPLING(x) (((x) << 16) & PECI_CTRL_SAMPLING_MASK)
> +#define PECI_CTRL_SAMPLING_GET(x) (((x) & PECI_CTRL_SAMPLING_MASK) >> 16)
> +#define PECI_CTRL_READ_MODE_MASK GENMASK(13, 12)
> +#define PECI_CTRL_READ_MODE(x) (((x) << 12) & PECI_CTRL_READ_MODE_MASK)
> +#define PECI_CTRL_READ_MODE_GET(x) (((x) & PECI_CTRL_READ_MODE_MASK) >> 12)
> +#define PECI_CTRL_READ_MODE_COUNT BIT(12)
> +#define PECI_CTRL_READ_MODE_DBG BIT(13)
> +#define PECI_CTRL_CLK_SOURCE_MASK BIT(11)
> +#define PECI_CTRL_CLK_SOURCE(x) (((x) << 11) & PECI_CTRL_CLK_SOURCE_MASK)
> +#define PECI_CTRL_CLK_SOURCE_GET(x) (((x) & PECI_CTRL_CLK_SOURCE_MASK) >> 11)
> +#define PECI_CTRL_CLK_DIV_MASK GENMASK(10, 8)
> +#define PECI_CTRL_CLK_DIV(x) (((x) << 8) & PECI_CTRL_CLK_DIV_MASK)
> +#define PECI_CTRL_CLK_DIV_GET(x) (((x) & PECI_CTRL_CLK_DIV_MASK) >> 8)
> +#define PECI_CTRL_INVERT_OUT BIT(7)
> +#define PECI_CTRL_INVERT_IN BIT(6)
> +#define PECI_CTRL_BUS_CONTENT_EN BIT(5)
> +#define PECI_CTRL_PECI_EN BIT(4)
> +#define PECI_CTRL_PECI_CLK_EN BIT(0)
I know these come from the ASPEED sdk driver. Do we need them all?
> +
> +/* AST_PECI_TIMING - 0x04 : Timing Negotiation Register */
> +#define PECI_TIMING_MESSAGE_MASK GENMASK(15, 8)
> +#define PECI_TIMING_MESSAGE(x) (((x) << 8) & PECI_TIMING_MESSAGE_MASK)
> +#define PECI_TIMING_MESSAGE_GET(x) (((x) & PECI_TIMING_MESSAGE_MASK) >> 8)
> +#define PECI_TIMING_ADDRESS_MASK GENMASK(7, 0)
> +#define PECI_TIMING_ADDRESS(x) ((x) & PECI_TIMING_ADDRESS_MASK)
> +#define PECI_TIMING_ADDRESS_GET(x) ((x) & PECI_TIMING_ADDRESS_MASK)
> +
> +/* AST_PECI_CMD - 0x08 : Command Register */
> +#define PECI_CMD_PIN_MON BIT(31)
> +#define PECI_CMD_STS_MASK GENMASK(27, 24)
> +#define PECI_CMD_STS_GET(x) (((x) & PECI_CMD_STS_MASK) >> 24)
> +#define PECI_CMD_FIRE BIT(0)
> +
> +/* AST_PECI_LEN - 0x0C : Read/Write Length Register */
> +#define PECI_AW_FCS_EN BIT(31)
> +#define PECI_READ_LEN_MASK GENMASK(23, 16)
> +#define PECI_READ_LEN(x) (((x) << 16) & PECI_READ_LEN_MASK)
> +#define PECI_WRITE_LEN_MASK GENMASK(15, 8)
> +#define PECI_WRITE_LEN(x) (((x) << 8) & PECI_WRITE_LEN_MASK)
> +#define PECI_TAGET_ADDR_MASK GENMASK(7, 0)
> +#define PECI_TAGET_ADDR(x) ((x) & PECI_TAGET_ADDR_MASK)
> +
> +/* AST_PECI_EXP_FCS - 0x10 : Expected FCS Data Register */
> +#define PECI_EXPECT_READ_FCS_MASK GENMASK(23, 16)
> +#define PECI_EXPECT_READ_FCS_GET(x) (((x) & PECI_EXPECT_READ_FCS_MASK) >> 16)
> +#define PECI_EXPECT_AW_FCS_AUTO_MASK GENMASK(15, 8)
> +#define PECI_EXPECT_AW_FCS_AUTO_GET(x) (((x) & PECI_EXPECT_AW_FCS_AUTO_MASK) \
> + >> 8)
> +#define PECI_EXPECT_WRITE_FCS_MASK GENMASK(7, 0)
> +#define PECI_EXPECT_WRITE_FCS_GET(x) ((x) & PECI_EXPECT_WRITE_FCS_MASK)
> +
> +/* AST_PECI_CAP_FCS - 0x14 : Captured FCS Data Register */
> +#define PECI_CAPTURE_READ_FCS_MASK GENMASK(23, 16)
> +#define PECI_CAPTURE_READ_FCS_GET(x) (((x) & PECI_CAPTURE_READ_FCS_MASK) >> 16)
> +#define PECI_CAPTURE_WRITE_FCS_MASK GENMASK(7, 0)
> +#define PECI_CAPTURE_WRITE_FCS_GET(x) ((x) & PECI_CAPTURE_WRITE_FCS_MASK)
> +
> +/* AST_PECI_INT_CTRL/STS - 0x18/0x1c : Interrupt Register */
> +#define PECI_INT_TIMING_RESULT_MASK GENMASK(31, 30)
> +#define PECI_INT_TIMEOUT BIT(4)
> +#define PECI_INT_CONNECT BIT(3)
> +#define PECI_INT_W_FCS_BAD BIT(2)
> +#define PECI_INT_W_FCS_ABORT BIT(1)
> +#define PECI_INT_CMD_DONE BIT(0)
> +
> +struct aspeed_peci {
> + struct peci_adapter adaper;
> + struct device *dev;
> + struct regmap *regmap;
> + int irq;
> + struct completion xfer_complete;
> + u32 status;
> + u32 cmd_timeout_ms;
> +};
> +
> +#define PECI_INT_MASK (PECI_INT_TIMEOUT | PECI_INT_CONNECT | \
> + PECI_INT_W_FCS_BAD | PECI_INT_W_FCS_ABORT | \
> + PECI_INT_CMD_DONE)
> +
> +#define PECI_IDLE_CHECK_TIMEOUT_MS 50
> +#define PECI_IDLE_CHECK_INTERVAL_MS 10
> +
> +#define PECI_RD_SAMPLING_POINT_DEFAULT 8
> +#define PECI_RD_SAMPLING_POINT_MAX 15
> +#define PECI_CLK_DIV_DEFAULT 0
> +#define PECI_CLK_DIV_MAX 7
> +#define PECI_MSG_TIMING_NEGO_DEFAULT 1
> +#define PECI_MSG_TIMING_NEGO_MAX 255
> +#define PECI_ADDR_TIMING_NEGO_DEFAULT 1
> +#define PECI_ADDR_TIMING_NEGO_MAX 255
> +#define PECI_CMD_TIMEOUT_MS_DEFAULT 1000
> +#define PECI_CMD_TIMEOUT_MS_MAX 60000
> +
> +static int aspeed_peci_xfer_native(struct aspeed_peci *priv,
> + struct peci_xfer_msg *msg)
> +{
> + long err, timeout = msecs_to_jiffies(priv->cmd_timeout_ms);
> + u32 peci_head, peci_state, rx_data, cmd_sts;
> + ktime_t start, end;
> + s64 elapsed_ms;
> + int i, rc = 0;
> + uint reg;
> +
> + start = ktime_get();
> +
> + /* Check command sts and bus idle state */
> + while (!regmap_read(priv->regmap, AST_PECI_CMD, &cmd_sts) &&
> + (cmd_sts & (PECI_CMD_STS_MASK | PECI_CMD_PIN_MON))) {
> + end = ktime_get();
> + elapsed_ms = ktime_to_ms(ktime_sub(end, start));
> + if (elapsed_ms >= PECI_IDLE_CHECK_TIMEOUT_MS) {
> + dev_dbg(priv->dev, "Timeout waiting for idle state!\n");
> + return -ETIMEDOUT;
> + }
> +
> + usleep_range(PECI_IDLE_CHECK_INTERVAL_MS * 1000,
> + (PECI_IDLE_CHECK_INTERVAL_MS * 1000) + 1000);
> + };
Could the above use regmap_read_poll_timeout instead?
> +
> + reinit_completion(&priv->xfer_complete);
> +
> + peci_head = PECI_TAGET_ADDR(msg->addr) |
> + PECI_WRITE_LEN(msg->tx_len) |
> + PECI_READ_LEN(msg->rx_len);
> +
> + rc = regmap_write(priv->regmap, AST_PECI_CMD_CTRL, peci_head);
> + if (rc)
> + return rc;
> +
> + for (i = 0; i < msg->tx_len; i += 4) {
> + reg = i < 16 ? AST_PECI_W_DATA0 + i % 16 :
> + AST_PECI_W_DATA4 + i % 16;
> + rc = regmap_write(priv->regmap, reg,
> + (msg->tx_buf[i + 3] << 24) |
> + (msg->tx_buf[i + 2] << 16) |
> + (msg->tx_buf[i + 1] << 8) |
> + msg->tx_buf[i + 0]);
That looks like an endian swap. Can we do something like this?
regmap_write(map, reg, cpu_to_be32p((void *)msg->tx_buff))
> + if (rc)
> + return rc;
> + }
> +
> + dev_dbg(priv->dev, "HEAD : 0x%08x\n", peci_head);
> +#if DUMP_DEBUG
Having #defines is frowned upon. I think print_hex_dump_debug will do
what you want here.
> + print_hex_dump(KERN_DEBUG, "TX : ", DUMP_PREFIX_NONE, 16, 1,
> + msg->tx_buf, msg->tx_len, true);
> +#endif
> +
> + rc = regmap_write(priv->regmap, AST_PECI_CMD, PECI_CMD_FIRE);
> + if (rc)
> + return rc;
> +
> + err = wait_for_completion_interruptible_timeout(&priv->xfer_complete,
> + timeout);
> +
> + dev_dbg(priv->dev, "INT_STS : 0x%08x\n", priv->status);
> + if (!regmap_read(priv->regmap, AST_PECI_CMD, &peci_state))
> + dev_dbg(priv->dev, "PECI_STATE : 0x%lx\n",
> + PECI_CMD_STS_GET(peci_state));
> + else
> + dev_dbg(priv->dev, "PECI_STATE : read error\n");
> +
> + rc = regmap_write(priv->regmap, AST_PECI_CMD, 0);
> + if (rc)
> + return rc;
> +
> + if (err <= 0 || !(priv->status & PECI_INT_CMD_DONE)) {
> + if (err < 0) { /* -ERESTARTSYS */
> + return (int)err;
> + } else if (err == 0) {
> + dev_dbg(priv->dev, "Timeout waiting for a response!\n");
> + return -ETIMEDOUT;
> + }
> +
> + dev_dbg(priv->dev, "No valid response!\n");
> + return -EIO;
> + }
> +
> + for (i = 0; i < msg->rx_len; i++) {
> + u8 byte_offset = i % 4;
> +
> + if (byte_offset == 0) {
> + reg = i < 16 ? AST_PECI_R_DATA0 + i % 16 :
> + AST_PECI_R_DATA4 + i % 16;
I find this hard to read. Use a few more lines to make it clear what
your code is doing.
Actually, the entire for loop is cryptic. I understand what it's doing
now. Can you rework it to make it more readable? You follow a similar
pattern above in the write case.
> + rc = regmap_read(priv->regmap, reg, &rx_data);
> + if (rc)
> + return rc;
> + }
> +
> + msg->rx_buf[i] = (u8)(rx_data >> (byte_offset << 3))
> + }
> +
> +#if DUMP_DEBUG
> + print_hex_dump(KERN_DEBUG, "RX : ", DUMP_PREFIX_NONE, 16, 1,
> + msg->rx_buf, msg->rx_len, true);
> +#endif
> + if (!regmap_read(priv->regmap, AST_PECI_CMD, &peci_state))
> + dev_dbg(priv->dev, "PECI_STATE : 0x%lx\n",
> + PECI_CMD_STS_GET(peci_state));
> + else
> + dev_dbg(priv->dev, "PECI_STATE : read error\n");
Given the regmap_read is always going to be a memory read on the
aspeed, I can't think of a situation where the read will fail.
On that note, is there a reason you are using regmap and not just
accessing the hardware directly? regmap imposes a number of pointer
lookups and tests each time you do a read or write.
> + dev_dbg(priv->dev, "------------------------\n");
> +
> + return rc;
> +}
> +
> +static irqreturn_t aspeed_peci_irq_handler(int irq, void *arg)
> +{
> + struct aspeed_peci *priv = arg;
> + u32 status_ack = 0;
> +
> + if (regmap_read(priv->regmap, AST_PECI_INT_STS, &priv->status))
> + return IRQ_NONE;
Again, a memory mapped read won't fail. How about we check that the
regmap is working once in your _probe() function, and assume it will
continue working from there (or remove the regmap abstraction all
together).
> +
> + /* Be noted that multiple interrupt bits can be set at the same time */
> + if (priv->status & PECI_INT_TIMEOUT) {
> + dev_dbg(priv->dev, "PECI_INT_TIMEOUT\n");
> + status_ack |= PECI_INT_TIMEOUT;
> + }
> +
> + if (priv->status & PECI_INT_CONNECT) {
> + dev_dbg(priv->dev, "PECI_INT_CONNECT\n");
> + status_ack |= PECI_INT_CONNECT;
> + }
> +
> + if (priv->status & PECI_INT_W_FCS_BAD) {
> + dev_dbg(priv->dev, "PECI_INT_W_FCS_BAD\n");
> + status_ack |= PECI_INT_W_FCS_BAD;
> + }
> +
> + if (priv->status & PECI_INT_W_FCS_ABORT) {
> + dev_dbg(priv->dev, "PECI_INT_W_FCS_ABORT\n");
> + status_ack |= PECI_INT_W_FCS_ABORT;
> + }
All of this code is for debugging only. Do you want to put it behind
some kind of conditional?
> +
> + /**
> + * All commands should be ended up with a PECI_INT_CMD_DONE bit set
> + * even in an error case.
> + */
> + if (priv->status & PECI_INT_CMD_DONE) {
> + dev_dbg(priv->dev, "PECI_INT_CMD_DONE\n");
> + status_ack |= PECI_INT_CMD_DONE;
> + complete(&priv->xfer_complete);
> + }
> +
> + if (regmap_write(priv->regmap, AST_PECI_INT_STS, status_ack))
> + return IRQ_NONE;
> +
> + return IRQ_HANDLED;
> +}
> +
> +static int aspeed_peci_init_ctrl(struct aspeed_peci *priv)
> +{
> + u32 msg_timing_nego, addr_timing_nego, rd_sampling_point;
> + u32 clk_freq, clk_divisor, clk_div_val = 0;
> + struct clk *clkin;
> + int ret;
> +
> + clkin = devm_clk_get(priv->dev, NULL);
> + if (IS_ERR(clkin)) {
> + dev_err(priv->dev, "Failed to get clk source.\n");
> + return PTR_ERR(clkin);
> + }
> +
> + ret = of_property_read_u32(priv->dev->of_node, "clock-frequency",
> + &clk_freq);
> + if (ret < 0) {
> + dev_err(priv->dev,
> + "Could not read clock-frequency property.\n");
> + return ret;
> + }
> +
> + clk_divisor = clk_get_rate(clkin) / clk_freq;
> + devm_clk_put(priv->dev, clkin);
> +
> + while ((clk_divisor >> 1) && (clk_div_val < PECI_CLK_DIV_MAX))
> + clk_div_val++;
We have a framework for doing clocks in the kernel. Would it make
sense to write a driver for this clock and add it to
drivers/clk/clk-aspeed.c?
> +
> + ret = of_property_read_u32(priv->dev->of_node, "msg-timing-nego",
> + &msg_timing_nego);
> + if (ret || msg_timing_nego > PECI_MSG_TIMING_NEGO_MAX) {
> + dev_warn(priv->dev,
> + "Invalid msg-timing-nego : %u, Use default : %u\n",
> + msg_timing_nego, PECI_MSG_TIMING_NEGO_DEFAULT);
The property is optional so I suggest we don't print a message if it's
not present. We certainly don't want to print a message saying
"invalid".
The same comment applies to the other optional properties below.
> + msg_timing_nego = PECI_MSG_TIMING_NEGO_DEFAULT;
> + }
> +
> + ret = of_property_read_u32(priv->dev->of_node, "addr-timing-nego",
> + &addr_timing_nego);
> + if (ret || addr_timing_nego > PECI_ADDR_TIMING_NEGO_MAX) {
> + dev_warn(priv->dev,
> + "Invalid addr-timing-nego : %u, Use default : %u\n",
> + addr_timing_nego, PECI_ADDR_TIMING_NEGO_DEFAULT);
> + addr_timing_nego = PECI_ADDR_TIMING_NEGO_DEFAULT;
> + }
> +
> + ret = of_property_read_u32(priv->dev->of_node, "rd-sampling-point",
> + &rd_sampling_point);
> + if (ret || rd_sampling_point > PECI_RD_SAMPLING_POINT_MAX) {
> + dev_warn(priv->dev,
> + "Invalid rd-sampling-point : %u. Use default : %u\n",
> + rd_sampling_point,
> + PECI_RD_SAMPLING_POINT_DEFAULT);
> + rd_sampling_point = PECI_RD_SAMPLING_POINT_DEFAULT;
> + }
> +
> + ret = of_property_read_u32(priv->dev->of_node, "cmd-timeout-ms",
> + &priv->cmd_timeout_ms);
> + if (ret || priv->cmd_timeout_ms > PECI_CMD_TIMEOUT_MS_MAX ||
> + priv->cmd_timeout_ms == 0) {
> + dev_warn(priv->dev,
> + "Invalid cmd-timeout-ms : %u. Use default : %u\n",
> + priv->cmd_timeout_ms,
> + PECI_CMD_TIMEOUT_MS_DEFAULT);
> + priv->cmd_timeout_ms = PECI_CMD_TIMEOUT_MS_DEFAULT;
> + }
> +
> + ret = regmap_write(priv->regmap, AST_PECI_CTRL,
> + PECI_CTRL_CLK_DIV(PECI_CLK_DIV_DEFAULT) |
> + PECI_CTRL_PECI_CLK_EN);
> + if (ret)
> + return ret;
> +
> + usleep_range(1000, 5000);
Can we probe in parallel? If not, putting a sleep in the _probe will
hold up the rest of drivers from being able to do anything, and hold
up boot.
If you decide that you do need to probe here, please add a comment.
(This is the wait for the clock to be stable?)
> +
> + /**
> + * Timing negotiation period setting.
> + * The unit of the programmed value is 4 times of PECI clock period.
> + */
> + ret = regmap_write(priv->regmap, AST_PECI_TIMING,
> + PECI_TIMING_MESSAGE(msg_timing_nego) |
> + PECI_TIMING_ADDRESS(addr_timing_nego));
> + if (ret)
> + return ret;
> +
> + /* Clear interrupts */
> + ret = regmap_write(priv->regmap, AST_PECI_INT_STS, PECI_INT_MASK);
> + if (ret)
> + return ret;
> +
> + /* Enable interrupts */
> + ret = regmap_write(priv->regmap, AST_PECI_INT_CTRL, PECI_INT_MASK);
> + if (ret)
> + return ret;
> +
> + /* Read sampling point and clock speed setting */
> + ret = regmap_write(priv->regmap, AST_PECI_CTRL,
> + PECI_CTRL_SAMPLING(rd_sampling_point) |
> + PECI_CTRL_CLK_DIV(clk_div_val) |
> + PECI_CTRL_PECI_EN | PECI_CTRL_PECI_CLK_EN);
> + if (ret)
> + return ret;
> +
> + return 0;
> +}
> +
> +static const struct regmap_config aspeed_peci_regmap_config = {
> + .reg_bits = 32,
> + .val_bits = 32,
> + .reg_stride = 4,
> + .max_register = AST_PECI_R_DATA7,
> + .val_format_endian = REGMAP_ENDIAN_LITTLE,
> + .fast_io = true,
> +};
> +
> +static int aspeed_peci_xfer(struct peci_adapter *adaper,
> + struct peci_xfer_msg *msg)
> +{
> + struct aspeed_peci *priv = peci_get_adapdata(adaper);
> +
> + return aspeed_peci_xfer_native(priv, msg);
> +}
> +
> +static int aspeed_peci_probe(struct platform_device *pdev)
> +{
> + struct aspeed_peci *priv;
> + struct resource *res;
> + void __iomem *base;
> + int ret = 0;
> +
> + priv = devm_kzalloc(&pdev->dev, sizeof(*priv), GFP_KERNEL);
> + if (!priv)
> + return -ENOMEM;
> +
> + dev_set_drvdata(&pdev->dev, priv);
> + priv->dev = &pdev->dev;
> +
> + res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
> + base = devm_ioremap_resource(&pdev->dev, res);
> + if (IS_ERR(base))
> + return PTR_ERR(base);
> +
> + priv->regmap = devm_regmap_init_mmio(&pdev->dev, base,
> + &aspeed_peci_regmap_config);
> + if (IS_ERR(priv->regmap))
> + return PTR_ERR(priv->regmap);
> +
> + priv->irq = platform_get_irq(pdev, 0);
> + if (!priv->irq)
> + return -ENODEV;
> +
> + ret = devm_request_irq(&pdev->dev, priv->irq, aspeed_peci_irq_handler,
> + IRQF_SHARED,
This interrupt is only for the peci device. Why is it marked as shared?
> + "peci-aspeed-irq",
> + priv);
> + if (ret < 0)
> + return ret;
> +
> + init_completion(&priv->xfer_complete);
> +
> + priv->adaper.dev.parent = priv->dev;
> + priv->adaper.dev.of_node = of_node_get(dev_of_node(priv->dev));
> + strlcpy(priv->adaper.name, pdev->name, sizeof(priv->adaper.name));
> + priv->adaper.xfer = aspeed_peci_xfer;
> + peci_set_adapdata(&priv->adaper, priv);
> +
> + ret = aspeed_peci_init_ctrl(priv);
> + if (ret < 0)
> + return ret;
> +
> + ret = peci_add_adapter(&priv->adaper);
> + if (ret < 0)
> + return ret;
> +
> + dev_info(&pdev->dev, "peci bus %d registered, irq %d\n",
> + priv->adaper.nr, priv->irq);
> +
> + return 0;
> +}
> +
> +static int aspeed_peci_remove(struct platform_device *pdev)
> +{
> + struct aspeed_peci *priv = dev_get_drvdata(&pdev->dev);
> +
> + peci_del_adapter(&priv->adaper);
> + of_node_put(priv->adaper.dev.of_node);
> +
> + return 0;
> +}
> +
> +static const struct of_device_id aspeed_peci_of_table[] = {
> + { .compatible = "aspeed,ast2400-peci", },
> + { .compatible = "aspeed,ast2500-peci", },
> + { }
> +};
> +MODULE_DEVICE_TABLE(of, aspeed_peci_of_table);
> +
> +static struct platform_driver aspeed_peci_driver = {
> + .probe = aspeed_peci_probe,
> + .remove = aspeed_peci_remove,
> + .driver = {
> + .name = "peci-aspeed",
> + .of_match_table = of_match_ptr(aspeed_peci_of_table),
> + },
> +};
> +module_platform_driver(aspeed_peci_driver);
> +
> +MODULE_AUTHOR("Ryan Chen <ryan_chen@aspeedtech.com>");
> +MODULE_AUTHOR("Jae Hyun Yoo <jae.hyun.yoo@linux.intel.com>");
> +MODULE_DESCRIPTION("Aspeed PECI driver");
> +MODULE_LICENSE("GPL v2");
> --
> 2.16.2
>
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* [RFC tip/locking/lockdep v6 04/20] lockdep: Redefine LOCK_*_STATE* bits
From: Boqun Feng @ 2018-04-11 13:50 UTC (permalink / raw)
To: linux-kernel
Cc: Peter Zijlstra, Ingo Molnar, Andrea Parri, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, open list:DOCUMENTATION
In-Reply-To: <20180411135110.9217-1-boqun.feng@gmail.com>
There are three types of lock acquisitions: write, non-recursive read
and recursive read, among which write locks and non-recursive read locks
have no difference from a viewpoint for deadlock detections, because a
write acquisition of the corresponding lock on an independent CPU or
task makes a non-recursive read lock act as a write lock in the sense of
deadlock. So we could treat them as the same type (named as
"non-recursive lock") in lockdep.
As in the irq lock inversion detection (safe->unsafe deadlock
detection), we used to differ write lock with read lock (non-recursive
and recursive ones), such a classification could be improved as
non-recursive read lock behaves the same as write lock, so this patch
redefines the meanings of LOCK_{USED_IN, ENABLED}_STATE*.
old:
LOCK_* : stands for write lock
LOCK_*_READ: stands for read lock(non-recursive and recursive)
new:
LOCK_* : stands for non-recursive(write lock and non-recursive
read lock)
LOCK_*_RR: stands for recursive read lock
Such a change is needed for a future improvement on recursive read
related irq inversion deadlock detection.
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
Documentation/locking/lockdep-design.txt | 6 +++---
kernel/locking/lockdep.c | 28 ++++++++++++++--------------
kernel/locking/lockdep_internals.h | 16 ++++++++--------
kernel/locking/lockdep_proc.c | 12 ++++++------
4 files changed, 31 insertions(+), 31 deletions(-)
diff --git a/Documentation/locking/lockdep-design.txt b/Documentation/locking/lockdep-design.txt
index 6bb9e90e2c4f..53ede30ce16d 100644
--- a/Documentation/locking/lockdep-design.txt
+++ b/Documentation/locking/lockdep-design.txt
@@ -30,9 +30,9 @@ State
The validator tracks lock-class usage history into 4n + 1 separate state bits:
- 'ever held in STATE context'
-- 'ever held as readlock in STATE context'
+- 'ever held as recursive readlock in STATE context'
- 'ever held with STATE enabled'
-- 'ever held as readlock with STATE enabled'
+- 'ever held as recurisve readlock with STATE enabled'
Where STATE can be either one of (kernel/locking/lockdep_states.h)
- hardirq
@@ -51,7 +51,7 @@ locking error messages, inside curlies. A contrived example:
(&sio_locks[i].lock){-.-...}, at: [<c02867fd>] mutex_lock+0x21/0x24
-The bit position indicates STATE, STATE-read, for each of the states listed
+The bit position indicates STATE, STATE-RR, for each of the states listed
above, and the character displayed in each indicates:
'.' acquired while irqs disabled and not in irq context
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index f39a071ef0a8..14af2327b52a 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -448,10 +448,10 @@ DEFINE_PER_CPU(struct lockdep_stats, lockdep_stats);
*/
#define __USAGE(__STATE) \
- [LOCK_USED_IN_##__STATE] = "IN-"__stringify(__STATE)"-W", \
- [LOCK_ENABLED_##__STATE] = __stringify(__STATE)"-ON-W", \
- [LOCK_USED_IN_##__STATE##_READ] = "IN-"__stringify(__STATE)"-R",\
- [LOCK_ENABLED_##__STATE##_READ] = __stringify(__STATE)"-ON-R",
+ [LOCK_USED_IN_##__STATE] = "IN-"__stringify(__STATE), \
+ [LOCK_ENABLED_##__STATE] = __stringify(__STATE)"-ON", \
+ [LOCK_USED_IN_##__STATE##_RR] = "IN-"__stringify(__STATE)"-RR", \
+ [LOCK_ENABLED_##__STATE##_RR] = __stringify(__STATE)"-ON-RR",
static const char *usage_str[] =
{
@@ -492,7 +492,7 @@ void get_usage_chars(struct lock_class *class, char usage[LOCK_USAGE_CHARS])
#define LOCKDEP_STATE(__STATE) \
usage[i++] = get_usage_char(class, LOCK_USED_IN_##__STATE); \
- usage[i++] = get_usage_char(class, LOCK_USED_IN_##__STATE##_READ);
+ usage[i++] = get_usage_char(class, LOCK_USED_IN_##__STATE##_RR);
#include "lockdep_states.h"
#undef LOCKDEP_STATE
@@ -1645,7 +1645,7 @@ static const char *state_names[] = {
static const char *state_rnames[] = {
#define LOCKDEP_STATE(__STATE) \
- __stringify(__STATE)"-READ",
+ __stringify(__STATE)"-RR",
#include "lockdep_states.h"
#undef LOCKDEP_STATE
};
@@ -3039,14 +3039,14 @@ static int mark_irqflags(struct task_struct *curr, struct held_lock *hlock)
* mark the lock as used in these contexts:
*/
if (!hlock->trylock) {
- if (hlock->read) {
+ if (hlock->read == 2) {
if (curr->hardirq_context)
if (!mark_lock(curr, hlock,
- LOCK_USED_IN_HARDIRQ_READ))
+ LOCK_USED_IN_HARDIRQ_RR))
return 0;
if (curr->softirq_context)
if (!mark_lock(curr, hlock,
- LOCK_USED_IN_SOFTIRQ_READ))
+ LOCK_USED_IN_SOFTIRQ_RR))
return 0;
} else {
if (curr->hardirq_context)
@@ -3058,13 +3058,13 @@ static int mark_irqflags(struct task_struct *curr, struct held_lock *hlock)
}
}
if (!hlock->hardirqs_off) {
- if (hlock->read) {
+ if (hlock->read == 2) {
if (!mark_lock(curr, hlock,
- LOCK_ENABLED_HARDIRQ_READ))
+ LOCK_ENABLED_HARDIRQ_RR))
return 0;
if (curr->softirqs_enabled)
if (!mark_lock(curr, hlock,
- LOCK_ENABLED_SOFTIRQ_READ))
+ LOCK_ENABLED_SOFTIRQ_RR))
return 0;
} else {
if (!mark_lock(curr, hlock,
@@ -3170,9 +3170,9 @@ static int mark_lock(struct task_struct *curr, struct held_lock *this,
switch (new_bit) {
#define LOCKDEP_STATE(__STATE) \
case LOCK_USED_IN_##__STATE: \
- case LOCK_USED_IN_##__STATE##_READ: \
+ case LOCK_USED_IN_##__STATE##_RR: \
case LOCK_ENABLED_##__STATE: \
- case LOCK_ENABLED_##__STATE##_READ:
+ case LOCK_ENABLED_##__STATE##_RR:
#include "lockdep_states.h"
#undef LOCKDEP_STATE
ret = mark_lock_irq(curr, this, new_bit);
diff --git a/kernel/locking/lockdep_internals.h b/kernel/locking/lockdep_internals.h
index d459d624ba2a..93002d337936 100644
--- a/kernel/locking/lockdep_internals.h
+++ b/kernel/locking/lockdep_internals.h
@@ -13,9 +13,9 @@
enum lock_usage_bit {
#define LOCKDEP_STATE(__STATE) \
LOCK_USED_IN_##__STATE, \
- LOCK_USED_IN_##__STATE##_READ, \
+ LOCK_USED_IN_##__STATE##_RR, \
LOCK_ENABLED_##__STATE, \
- LOCK_ENABLED_##__STATE##_READ,
+ LOCK_ENABLED_##__STATE##_RR,
#include "lockdep_states.h"
#undef LOCKDEP_STATE
LOCK_USED,
@@ -30,9 +30,9 @@ enum lock_usage_bit {
enum {
#define LOCKDEP_STATE(__STATE) \
__LOCKF(USED_IN_##__STATE) \
- __LOCKF(USED_IN_##__STATE##_READ) \
+ __LOCKF(USED_IN_##__STATE##_RR) \
__LOCKF(ENABLED_##__STATE) \
- __LOCKF(ENABLED_##__STATE##_READ)
+ __LOCKF(ENABLED_##__STATE##_RR)
#include "lockdep_states.h"
#undef LOCKDEP_STATE
__LOCKF(USED)
@@ -41,10 +41,10 @@ enum {
#define LOCKF_ENABLED_IRQ (LOCKF_ENABLED_HARDIRQ | LOCKF_ENABLED_SOFTIRQ)
#define LOCKF_USED_IN_IRQ (LOCKF_USED_IN_HARDIRQ | LOCKF_USED_IN_SOFTIRQ)
-#define LOCKF_ENABLED_IRQ_READ \
- (LOCKF_ENABLED_HARDIRQ_READ | LOCKF_ENABLED_SOFTIRQ_READ)
-#define LOCKF_USED_IN_IRQ_READ \
- (LOCKF_USED_IN_HARDIRQ_READ | LOCKF_USED_IN_SOFTIRQ_READ)
+#define LOCKF_ENABLED_IRQ_RR \
+ (LOCKF_ENABLED_HARDIRQ_RR | LOCKF_ENABLED_SOFTIRQ_RR)
+#define LOCKF_USED_IN_IRQ_RR \
+ (LOCKF_USED_IN_HARDIRQ_RR | LOCKF_USED_IN_SOFTIRQ_RR)
/*
* CONFIG_LOCKDEP_SMALL is defined for sparc. Sparc requires .text,
diff --git a/kernel/locking/lockdep_proc.c b/kernel/locking/lockdep_proc.c
index ad69bbc9bd28..630a6bc6e24c 100644
--- a/kernel/locking/lockdep_proc.c
+++ b/kernel/locking/lockdep_proc.c
@@ -252,17 +252,17 @@ static int lockdep_stats_show(struct seq_file *m, void *v)
nr_hardirq_safe++;
if (class->usage_mask & LOCKF_ENABLED_HARDIRQ)
nr_hardirq_unsafe++;
- if (class->usage_mask & LOCKF_USED_IN_IRQ_READ)
+ if (class->usage_mask & LOCKF_USED_IN_IRQ_RR)
nr_irq_read_safe++;
- if (class->usage_mask & LOCKF_ENABLED_IRQ_READ)
+ if (class->usage_mask & LOCKF_ENABLED_IRQ_RR)
nr_irq_read_unsafe++;
- if (class->usage_mask & LOCKF_USED_IN_SOFTIRQ_READ)
+ if (class->usage_mask & LOCKF_USED_IN_SOFTIRQ_RR)
nr_softirq_read_safe++;
- if (class->usage_mask & LOCKF_ENABLED_SOFTIRQ_READ)
+ if (class->usage_mask & LOCKF_ENABLED_SOFTIRQ_RR)
nr_softirq_read_unsafe++;
- if (class->usage_mask & LOCKF_USED_IN_HARDIRQ_READ)
+ if (class->usage_mask & LOCKF_USED_IN_HARDIRQ_RR)
nr_hardirq_read_safe++;
- if (class->usage_mask & LOCKF_ENABLED_HARDIRQ_READ)
+ if (class->usage_mask & LOCKF_ENABLED_HARDIRQ_RR)
nr_hardirq_read_unsafe++;
#ifdef CONFIG_PROVE_LOCKING
--
2.16.2
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [RFC tip/locking/lockdep v6 01/20] lockdep/Documention: Recursive read lock detection reasoning
From: Boqun Feng @ 2018-04-11 13:50 UTC (permalink / raw)
To: linux-kernel
Cc: Peter Zijlstra, Ingo Molnar, Andrea Parri, Paul E. McKenney,
Boqun Feng, Jonathan Corbet, open list:DOCUMENTATION
In-Reply-To: <20180411135110.9217-1-boqun.feng@gmail.com>
This patch add the documentation piece for the reasoning of deadlock
detection related to recursive read lock. The following sections are
added:
* Explain what is a recursive read lock, and what deadlock cases
they could introduce.
* Introduce the notations for different types of dependencies, and
the definition of strong paths.
* Proof for a closed strong path is both sufficient and necessary
for deadlock detections with recursive read locks involved. The
proof could also explain why we call the path "strong"
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
Documentation/locking/lockdep-design.txt | 178 +++++++++++++++++++++++++++++++
1 file changed, 178 insertions(+)
diff --git a/Documentation/locking/lockdep-design.txt b/Documentation/locking/lockdep-design.txt
index 9de1c158d44c..6bb9e90e2c4f 100644
--- a/Documentation/locking/lockdep-design.txt
+++ b/Documentation/locking/lockdep-design.txt
@@ -284,3 +284,181 @@ Run the command and save the output, then compare against the output from
a later run of this command to identify the leakers. This same output
can also help you find situations where runtime lock initialization has
been omitted.
+
+Recursive read locks:
+---------------------
+
+Lockdep now is equipped with deadlock detection for recursive read locks.
+
+Recursive read locks, as their name indicates, are the locks able to be
+acquired recursively. Unlike non-recursive read locks, recursive read locks
+only get blocked by current write lock *holders* other than write lock
+*waiters*, for example:
+
+ TASK A: TASK B:
+
+ read_lock(X);
+
+ write_lock(X);
+
+ read_lock(X);
+
+is not a deadlock for recursive read locks, as while the task B is waiting for
+the lock X, the second read_lock() doesn't need to wait because it's a recursive
+read lock. However if the read_lock() is non-recursive read lock, then the above
+case is a deadlock, because even if the write_lock() in TASK B can not get the
+lock, but it can block the second read_lock() in TASK A.
+
+Note that a lock can be a write lock (exclusive lock), a non-recursive read
+lock (non-recursive shared lock) or a recursive read lock (recursive shared
+lock), depending on the lock operations used to acquire it (more specifically,
+the value of the 'read' parameter for lock_acquire()). In other words, a single
+lock instance has three types of acquisition depending on the acquisition
+functions: exclusive, non-recursive read, and recursive read.
+
+To be concise, we call that write locks and non-recursive read locks as
+"non-recursive" locks and recursive read locks as "recursive" locks.
+
+Recursive locks don't block each other, while non-recursive locks do (this is
+even true for two non-recursive read locks). A non-recursive lock can block the
+corresponding recursive lock, and vice versa.
+
+A deadlock case with recursive locks involved is as follow:
+
+ TASK A: TASK B:
+
+ read_lock(X);
+ read_lock(Y);
+ write_lock(Y);
+ write_lock(X);
+
+Task A is waiting for task B to read_unlock() Y and task B is waiting for task
+A to read_unlock() X.
+
+Dependency types and strong dependency paths:
+---------------------------------------------
+In order to detect deadlocks as above, lockdep needs to track different dependencies.
+There are 4 categories for dependency edges in the lockdep graph:
+
+1) -(NN)->: non-recursive to non-recursive dependency. "X -(NN)-> Y" means
+ X -> Y and both X and Y are non-recursive locks.
+
+2) -(RN)->: recursive to non-recursive dependency. "X -(RN)-> Y" means
+ X -> Y and X is recursive read lock and Y is non-recursive lock.
+
+3) -(NR)->: non-recursive to recursive dependency, "X -(NR)-> Y" means
+ X -> Y and X is non-recursive lock and Y is recursive lock.
+
+4) -(RR)->: recursive to recursive dependency, "X -(RR)-> Y" means
+ X -> Y and both X and Y are recursive locks.
+
+Note that given two locks, they may have multiple dependencies between them, for example:
+
+ TASK A:
+
+ read_lock(X);
+ write_lock(Y);
+ ...
+
+ TASK B:
+
+ write_lock(X);
+ write_lock(Y);
+
+, we have both X -(RN)-> Y and X -(NN)-> Y in the dependency graph.
+
+We use -(*N)-> for edges that is either -(RN)-> or -(NN)->, the similar for -(N*)->,
+-(*R)-> and -(R*)->
+
+A "path" is a series of conjunct dependency edges in the graph. And we define a
+"strong" path, which indicates the strong dependency throughout each dependency
+in the path, as the path that doesn't have two conjunct edges (dependencies) as
+-(*R)-> and -(R*)->. In other words, a "strong" path is a path from a lock
+walking to another through the lock dependencies, and if X -> Y -> Z in the
+path (where X, Y, Z are locks), if the walk from X to Y is through a -(NR)-> or
+-(RR)-> dependency, the walk from Y to Z must not be through a -(RN)-> or
+-(RR)-> dependency, otherwise it's not a strong path.
+
+We will see why the path is called "strong" in next section.
+
+Recursive Read Deadlock Detection:
+----------------------------------
+
+We now prove two things:
+
+Lemma 1:
+
+If there is a closed strong path (i.e. a strong cirle), then there is a
+combination of locking sequences that causes deadlock. I.e. a strong circle is
+sufficient for deadlock detection.
+
+Lemma 2:
+
+If there is no closed strong path (i.e. strong cirle), then there is no
+combination of locking sequences that could cause deadlock. I.e. strong
+circles are necessary for deadlock detection.
+
+With these two Lemmas, we can easily say a closed strong path is both sufficient
+and necessary for deadlocks, therefore a closed strong path is equivalent to
+deadlock possibility. As a closed strong path stands for a dependency chain that
+could cause deadlocks, so we call it "strong", considering there are dependency
+circles that won't cause deadlocks.
+
+Proof for sufficiency (Lemma 1):
+
+Let's say we have a strong cirlce:
+
+ L1 -> L2 ... -> Ln -> L1
+
+, which means we have dependencies:
+
+ L1 -> L2
+ L2 -> L3
+ ...
+ Ln-1 -> Ln
+ Ln -> L1
+
+We now can construct a combination of locking sequences that cause deadlock:
+
+Firstly let's make one CPU/task get the L1 in L1 -> L2, and then another get
+the L2 in L2 -> L3, and so on. After this, all of the Lx in Lx -> Lx+1 are
+held by different CPU/tasks.
+
+And then because we have L1 -> L2, so the holder of L1 is going to acquire L2
+in L1 -> L2, however since L2 is already held by another CPU/task, plus L1 ->
+L2 and L2 -> L3 are not *R and R* (the definition of strong), therefore the
+holder of L1 can not get L2, it has to wait L2's holder to release.
+
+Moreover, we can have a similar conclusion for L2's holder: it has to wait L3's
+holder to release, and so on. We now can proof that Lx's holder has to wait for
+Lx+1's holder to release, and note that Ln+1 is L1, so we have a circular
+waiting scenario and nobody can get progress, therefore a deadlock.
+
+Proof for necessary (Lemma 2):
+
+Lemma 2 is equivalent to: If there is a deadlock scenario, then there must be a
+strong circle in the dependency graph.
+
+According to Wikipedia[1], if there is a deadlock, then there must be a circular
+waiting scenario, means there are N CPU/tasks, where CPU/task P1 is waiting for
+a lock held by P2, and P2 is waiting for a lock held by P3, ... and Pn is waiting
+for a lock held by P1. Let's name the lock Px is waiting as Lx, so since P1 is waiting
+for L1 and holding Ln, so we will have Ln -> L1 in the dependency graph. Similarly,
+we have L1 -> L2, L2 -> L3, ..., Ln-1 -> Ln in the dependency graph, which means we
+have a circle:
+
+ Ln -> L1 -> L2 -> ... -> Ln
+
+, and now let's prove the circle is strong:
+
+For a lock Lx, Px contributes the dependency Lx-1 -> Lx and Px+1 contributes
+the dependency Lx -> Lx+1, and since Px is waiting for Px+1 to release Lx,
+so Lx can not be both recursive in Lx -> Lx+1 and Lx-1 -> Lx, because recursive
+locks don't block each other, therefore Lx-1 -> Lx and Lx -> Lx+1 can not be a
+-(*R)-> -(R*)-> pair, and this is true for any lock in the circle, therefore,
+the circle is strong.
+
+References:
+-----------
+[1]: https://en.wikipedia.org/wiki/Deadlock
+[2]: Shibu, K. (2009). Intro To Embedded Systems (1st ed.). Tata McGraw-Hill
--
2.16.2
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [RFC bpf-next v2 1/8] bpf: add script and prepare bpf.h for new helpers documentation
From: Quentin Monnet @ 2018-04-11 15:41 UTC (permalink / raw)
To: Alexei Starovoitov; +Cc: daniel, ast, netdev, oss-drivers, linux-doc, linux-man
In-Reply-To: <20180410181620.axzpwh2iawg2kgh3@ast-mbp.dhcp.thefacebook.com>
2018-04-10 11:16 UTC-0700 ~ Alexei Starovoitov
<alexei.starovoitov@gmail.com>
> On Tue, Apr 10, 2018 at 03:41:50PM +0100, Quentin Monnet wrote:
>> Remove previous "overview" of eBPF helpers from user bpf.h header.
>> Replace it by a comment explaining how to process the new documentation
>> (to come in following patches) with a Python script to produce RST, then
>> man page documentation.
>>
>> Also add the aforementioned Python script under scripts/. It is used to
>> process include/uapi/linux/bpf.h and to extract helper descriptions, to
>> turn it into a RST document that can further be processed with rst2man
>> to produce a man page. The script takes one "--filename <path/to/file>"
>> option. If the script is launched from scripts/ in the kernel root
>> directory, it should be able to find the location of the header to
>> parse, and "--filename <path/to/file>" is then optional. If it cannot
>> find the file, then the option becomes mandatory. RST-formatted
>> documentation is printed to standard output.
>>
>> Typical workflow for producing the final man page would be:
>>
>> $ ./scripts/bpf_helpers_doc.py \
>> --filename include/uapi/linux/bpf.h > /tmp/bpf-helpers.rst
>> $ rst2man /tmp/bpf-helpers.rst > /tmp/bpf-helpers.7
>> $ man /tmp/bpf-helpers.7
>>
>> Note that the tool kernel-doc cannot be used to document eBPF helpers,
>> whose signatures are not available directly in the header files
>> (pre-processor directives are used to produce them at the beginning of
>> the compilation process).
>>
>> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
>> ---
>> include/uapi/linux/bpf.h | 406 ++------------------------------------------
>> scripts/bpf_helpers_doc.py | 414 +++++++++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 430 insertions(+), 390 deletions(-)
>> create mode 100755 scripts/bpf_helpers_doc.py
>>
[...]
>> diff --git a/scripts/bpf_helpers_doc.py b/scripts/bpf_helpers_doc.py
>> new file mode 100755
>> index 000000000000..3a15ba3f0a83
>> --- /dev/null
>> +++ b/scripts/bpf_helpers_doc.py
>> @@ -0,0 +1,414 @@
>> +#!/usr/bin/python3
>> +#
>> +# Copyright (C) 2018 Netronome Systems, Inc.
>> +#
>> +# This software is licensed under the GNU General License Version 2,
>> +# June 1991 as shown in the file COPYING in the top-level directory of this
>> +# source tree.
>
> please use SPDX instead.
>
Same as for bpftool, our layers remain a bit cautious about it. I'd be
happy to change it for SPDX as a follow-up when I get the green light.
>> +#
>> +# THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS"
>> +# WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING,
>> +# BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
>> +# FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE
>> +# OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME
>> +# THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
>> +
[...]
>> +class PrinterRST(Printer):
>> + """
>> + A printer for dumping collected information about helpers as a ReStructured
>> + Text page compatible with the rst2man program, which can be used to
>> + generate a manual page for the helpers.
>> + @helpers: array of Helper objects to print to standard output
>> + """
>> + def print_header(self):
>> + header = '''\
>> +.. Copyright (C) 2018 Netronome Systems, Inc.
>
> I think would be good to capture copyrights of all authors that added
> the helpers being documented. Since a lot of text was copied from commit
> logs it's only fair to preserve the copyrights.
> Such man page file is automatically generated by the python script
> and script itself is copyrighted by Netronome. That's fine, but the text
> of man page is not netronome only.
> I'm not sure what would be the solution. May be something like:
> "
> Copyright (C) All BPF authors and contributors from 2011 to present
> See git log include/uapi/linux/bpf.h for details
> "
> ?
Seems fair indeed. I do not have a better suggestion myself, so I will
stick to yours.
Out of curiosity, why 2011 for the year? I thought you introduced eBPF
in the kernel in 2014 (bd4cf0ed331a), and I do not believe helpers have
any link with cBPF?
>> +..
>> +.. %%%LICENSE_START(VERBATIM)
>> +.. Permission is granted to make and distribute verbatim copies of this
>> +.. manual provided the copyright notice and this permission notice are
>> +.. preserved on all copies.
>> +..
>> +.. Permission is granted to copy and distribute modified versions of this
>> +.. manual under the conditions for verbatim copying, provided that the
>> +.. entire resulting derived work is distributed under the terms of a
>> +.. permission notice identical to this one.
>> +..
>> +.. Since the Linux kernel and libraries are constantly changing, this
>> +.. manual page may be incorrect or out-of-date. The author(s) assume no
>> +.. responsibility for errors or omissions, or for damages resulting from
>> +.. the use of the information contained herein. The author(s) may not
>> +.. have taken the same level of care in the production of this manual,
>> +.. which is licensed free of charge, as they might when working
>> +.. professionally.
>> +..
>> +.. Formatted or processed versions of this manual, if unaccompanied by
>> +.. the source, must acknowledge the copyright and authors of this work.
>> +.. %%%LICENSE_END
>> +..
>> +.. Please do not edit this file. It was generated from the documentation
>> +.. located in file include/uapi/linux/bpf.h of the Linux kernel sources
>> +.. (helpers description), and from scripts/bpf_helpers_doc.py in the same
>> +.. repository (header and footer).
>> +
>> +===========
>> +BPF-HELPERS
>> +===========
>> +-------------------------------------------------------------------------------
>> +list of eBPF helper functions
>> +-------------------------------------------------------------------------------
>> +
>> +:Manual section: 7
>> +
>> +DESCRIPTION
>> +===========
>> +
>> +The extended Berkeley Packet Filter (eBPF) subsystem consists in programs
>> +written in a pseudo-assembly language, then attached to one of the several
>> +kernel hooks and run in reaction of specific events. This framework differs
>> +from the older, "classic" BPF (or "cBPF") in several aspects, one of them being
>> +the ability to call special functions (or "helpers") from within a program. For
>> +security reasons, these functions are restricted to a white-list of helpers
>> +defined in the kernel.
>
> 'for security reasons' sounds a bit odd. May be 'for safety reasons' ?
> Or drop that part.
I'll drop it and keep "These functions are restricted to a white-list of
helpers defined in the kernel."
>> +
>> +These helpers are used by eBPF programs to interact with the system, or with
>> +the context in which they work. For instance, they can be used to print
>> +debugging messages, to get the time since the system was booted, to interact
>> +with eBPF maps, or to manipulate network packets metadata. Since there are
>
> s/packets metadata/packets/
Ok.
>> +several eBPF program types, and that they do not run in the same context, each
>> +program type can only call a subset of those helpers.
>> +
>> +Due to eBPF conventions, a helper can not have more than five arguments.
>> +
>> +This document is an attempt to list and document the helpers available to eBPF
>> +developers. They are sorted by chronological order (the oldest helpers in the
>> +kernel at the top).
>> +
>> +HELPERS
>> +=======
>> +'''
>> + print(header)
>> +
>> + def print_footer(self):
>> + footer = '''
>> +NOTES
>> +=====
>> +
>> +On the performance side, eBPF programs move to the stack all arguments to pass
>> +to the helpers, and call directly into the compiled helper functions without
>
> "move to the stack all arguments" ?! I'm not sure what you're trying to say.
> The arguments stay in registers for the call.
>
>> +requiring any foreign-function interface. As a result, calling helpers
>> +introduce very little overhead.
>
> not true. it's zero overhead. Literally. Very little is not the same as zero.
Not the same indeed :). I will fix with "no overhead".
I'm not too sure either what I meant when I wrote the thing about moving
arguments to the stack... In fact this "NOTES" section is short and not
really relevant. I will probably delete it and add a line in page header
about helpers being called with no overhead.
>> +
>> +EXAMPLES
>> +========
>> +
>> +Example usage for most of the eBPF helpers listed in this manual page are
>> +available within the Linux kernel sources, at the following locations:
>> +
>> +* *samples/bpf/*
>> +* *tools/testing/selftests/bpf/*
>> +
>> +IMPLEMENTATION
>> +==============
>> +
>> +This manual page is an effort to document the existing eBPF helper functions.
>> +But as of this writing, the BPF sub-system is under heavy development. New eBPF
>> +program or map types are added, along with new helper functions. Some helpers
>> +are occasionally made available for additional program types. So in spite of
>> +the efforts of the community, this page might not be up-to-date. If you want to
>> +check by yourself what helper functions exist in your kernel, or what types of
>> +programs they can support, here are some files among the kernel tree that you
>> +may be interested in:
>> +
>> +* *include/uapi/linux/bpf.h* contains the full list of all helper functions.
>> +* *net/core/filter.c* contains the definition of most network-related helper
>> + functions, and the list of program types from which they can be used.
>> +* *kernel/trace/bpf_trace.c* is the equivalent for most tracing program-related
>> + helpers.
>> +* *kernel/bpf/verifier.c* contains the functions used to check that valid types
>> + of eBPF maps are used with a given helper function.
>> +* *kernel/bpf/* directory contains other files in which additional helpers are
>> + defined (for cgroups, sockmaps, etc.).
>> +
>> +Compatibility between helper functions and program types can generally be found
>> +in the files where helper functions are defined. Look for the **struct
>> +bpf_func_proto** objects and for functions returning them: these functions
>> +contain a list of helpers that a given program type can call. Note that the
>> +**default:** label of the **switch ... case** used to filter helpers can call
>> +other functions, themselves allowing access to additional helpers. The
>> +requirement for GPL license is also in those **struct bpf_func_proto**.
>
> I think here would be good to add that most networking helpers are non-GPL
> because they operate on packets which are abstract bytes on the wire,
> whereas most tracing helpers are GPL, since they inspect the guts of
> the linux kernel which is GPL itself.
> That's the main reason why adding extra 'gpl=yes/no' for each helper
> description is redundant.
>
I removed information from the page header about licensing since v1, I
may reintroduce some of it and tell about the difference between
networking and tracing programs, as you suggest.
I understand this difference and I see that specifying GPL requirement
for each individual helper is redundant. And yet I still believe that
for newcomers, it remains easier to have the indication for the specific
helper they are reading about in the man page rather than to take a
guess ("Is this helper for networking only?"). But I do not intend to
add it back to this set anyway, so let's keep this for future
discussions :).
Thanks for the review!
Quentin
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* [RFC bpf-next v2 2/8] bpf: add documentation for eBPF helpers (01-11)
From: Quentin Monnet @ 2018-04-11 15:42 UTC (permalink / raw)
To: Alexei Starovoitov; +Cc: daniel, ast, netdev, oss-drivers, linux-doc, linux-man
In-Reply-To: <20180410175605.2wqhaqx34a4o3gdi@ast-mbp.dhcp.thefacebook.com>
2018-04-10 10:56 UTC-0700 ~ Alexei Starovoitov
<alexei.starovoitov@gmail.com>
> On Tue, Apr 10, 2018 at 03:41:51PM +0100, Quentin Monnet wrote:
>> Add documentation for eBPF helper functions to bpf.h user header file.
>> This documentation can be parsed with the Python script provided in
>> another commit of the patch series, in order to provide a RST document
>> that can later be converted into a man page.
>>
>> The objective is to make the documentation easily understandable and
>> accessible to all eBPF developers, including beginners.
>>
>> This patch contains descriptions for the following helper functions, all
>> written by Alexei:
>>
>> - bpf_map_lookup_elem()
>> - bpf_map_update_elem()
>> - bpf_map_delete_elem()
>> - bpf_probe_read()
>> - bpf_ktime_get_ns()
>> - bpf_trace_printk()
>> - bpf_skb_store_bytes()
>> - bpf_l3_csum_replace()
>> - bpf_l4_csum_replace()
>> - bpf_tail_call()
>> - bpf_clone_redirect()
>>
>> Cc: Alexei Starovoitov <ast@kernel.org>
>> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
>> ---
>> include/uapi/linux/bpf.h | 199 +++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 199 insertions(+)
>>
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 45f77f01e672..2bc653a3a20f 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -381,6 +381,205 @@ union bpf_attr {
>> * intentional, removing them would break paragraphs for rst2man.
>> *
>> * Start of BPF helper function descriptions:
>> + *
>> + * void *bpf_map_lookup_elem(struct bpf_map *map, void *key)
>> + * Description
>> + * Perform a lookup in *map* for an entry associated to *key*.
>> + * Return
>> + * Map value associated to *key*, or **NULL** if no entry was
>> + * found.
>> + *
>> + * int bpf_map_update_elem(struct bpf_map *map, void *key, void *value, u64 flags)
>> + * Description
>> + * Add or update the value of the entry associated to *key* in
>> + * *map* with *value*. *flags* is one of:
>> + *
>> + * **BPF_NOEXIST**
>> + * The entry for *key* must not exist in the map.
>> + * **BPF_EXIST**
>> + * The entry for *key* must already exist in the map.
>> + * **BPF_ANY**
>> + * No condition on the existence of the entry for *key*.
>> + *
>> + * These flags are only useful for maps of type
>> + * **BPF_MAP_TYPE_HASH**. For all other map types, **BPF_ANY**
>> + * should be used.
>
> I think that's not entirely accurate.
> The flags work as expected for all other map types as well
> and for lru map, sockmap, map in map the flags have practical use cases.
>
Ok, I missed that. I have to go back and check how the flags are used
for those maps. I will cook up something cleaner for the next version of
the set.
>> + * Return
>> + * 0 on success, or a negative error in case of failure.
>> + *
[...]
>> + *
>> + * int bpf_trace_printk(const char *fmt, u32 fmt_size, ...)
>> + * Description
>> + * This helper is a "printk()-like" facility for debugging. It
>> + * prints a message defined by format *fmt* (of size *fmt_size*)
>> + * to file *\/sys/kernel/debug/tracing/trace* from DebugFS, if
>> + * available. It can take up to three additional **u64**
>> + * arguments (as an eBPF helpers, the total number of arguments is
>> + * limited to five). Each time the helper is called, it appends a
>> + * line that looks like the following:
>> + *
>> + * ::
>> + *
>> + * telnet-470 [001] .N.. 419421.045894: 0x00000001: BPF command: 2
>> + *
>> + * In the above:
>> + *
>> + * * ``telnet`` is the name of the current task.
>> + * * ``470`` is the PID of the current task.
>> + * * ``001`` is the CPU number on which the task is
>> + * running.
>> + * * In ``.N..``, each character refers to a set of
>> + * options (whether irqs are enabled, scheduling
>> + * options, whether hard/softirqs are running, level of
>> + * preempt_disabled respectively). **N** means that
>> + * **TIF_NEED_RESCHED** and **PREEMPT_NEED_RESCHED**
>> + * are set.
>> + * * ``419421.045894`` is a timestamp.
>> + * * ``0x00000001`` is a fake value used by BPF for the
>> + * instruction pointer register.
>> + * * ``BPF command: 2`` is the message formatted with
>> + * *fmt*.
>
> the above depends on how trace_pipe was configured. It's a default
> configuration for many, but would be good to explain this a bit better.
>
I did not know about that. Would you have a pointer about how to
configure trace_pipe, please?
>> + *
>> + * The conversion specifiers supported by *fmt* are similar, but
>> + * more limited than for printk(). They are **%d**, **%i**,
>> + * **%u**, **%x**, **%ld**, **%li**, **%lu**, **%lx**, **%lld**,
>> + * **%lli**, **%llu**, **%llx**, **%p**, **%s**. No modifier (size
>> + * of field, padding with zeroes, etc.) is available, and the
>> + * helper will silently fail if it encounters an unknown
>> + * specifier.
>
> This is not true. bpf_trace_printk will return -EINVAL for unknown specifier.
>
Correct, sorry about that. I never check the return value of
bpf_trace_printk(), and it's hard to realise it failed without resorting
to another bpf_trace_printk() :). I'll fix it, what about:
"No modifier (size of field, padding with zeroes, etc.) is available,
and the helper will return **-EINVAL** (but print nothing) if it
encounters an unknown specifier."
(I would like to keep the "print nothing" idea, at the beginning I spent
some time myself trying to figure out why my bpf_trace_prink() seemed to
be never called--I was simply trying to print with "%#x".)
>> + *
>> + * Also, note that **bpf_trace_printk**\ () is slow, and should
>> + * only be used for debugging purposes. For passing values to user
>> + * space, perf events should be preferred.
>
> please mention the giant dmesg warning that people will definitely
> notice when they try to use this helper.
This is a good idea, I will mention it.
>> + * Return
>> + * The number of bytes written to the buffer, or a negative error
>> + * in case of failure.
>> + *
[...]
>> + * int bpf_tail_call(void *ctx, struct bpf_map *prog_array_map, u32 index)
>> + * Description
>> + * This special helper is used to trigger a "tail call", or in
>> + * other words, to jump into another eBPF program. The contents of
>> + * eBPF registers and stack are not modified, the new program
>> + * "inherits" them from the caller. This mechanism allows for
>
> "inherits" is a technically correct, but misleading statement,
> since callee program cannot access caller's registers and stack.
>
I can replace this sentence by:
"The same stack frame is used (but values on stack and in registers for
the caller are not accessible to the callee)."
>> + * program chaining, either for raising the maximum number of
>> + * available eBPF instructions, or to execute given programs in
>> + * conditional blocks. For security reasons, there is an upper
>> + * limit to the number of successive tail calls that can be
>> + * performed.
>> + *
>> + * Upon call of this helper, the program attempts to jump into a
>> + * program referenced at index *index* in *prog_array_map*, a
>> + * special map of type **BPF_MAP_TYPE_PROG_ARRAY**, and passes
>> + * *ctx*, a pointer to the context.
>> + *
>> + * If the call succeeds, the kernel immediately runs the first
>> + * instruction of the new program. This is not a function call,
>> + * and it never goes back to the previous program. If the call
>> + * fails, then the helper has no effect, and the caller continues
>> + * to run its own instructions. A call can fail if the destination
>> + * program for the jump does not exist (i.e. *index* is superior
>> + * to the number of entries in *prog_array_map*), or if the
>> + * maximum number of tail calls has been reached for this chain of
>> + * programs. This limit is defined in the kernel by the macro
>> + * **MAX_TAIL_CALL_CNT** (not accessible to user space), which
>> + * is currently set to 32.
>> + * Return
>> + * 0 on success, or a negative error in case of failure.
>> + *
>> + * int bpf_clone_redirect(struct sk_buff *skb, u32 ifindex, u64 flags)
>> + * Description
>> + * Clone and redirect the packet associated to *skb* to another
>> + * net device of index *ifindex*. The only flag supported for now
>> + * is **BPF_F_INGRESS**, which indicates the packet is to be
>> + * redirected to the ingress interface instead of (by default)
>> + * egress.
>
> imo the above sentence is prone to misinterpretation.
> Can you rephrase it to say that both redirect to ingress and redirect to egress
> are supported and flag is used to indicate which path to take ?
>
I could replace with the following:
"Clone and redirect the packet associated to *skb* to another net device
of index *ifindex*. Both ingress and egress interfaces can be used for
redirection. The **BPF_F_INGRESS** value in *flags* is used to make the
distinction (ingress path is selected if the flag is present, egress
path otherwise). This is the only flag supported for now."
I think I wrote similar things about other helpers using BPF_F_INGRESS
flag, I will also update them accordingly.
>> + *
>> + * A call to this helper is susceptible to change data from the
>> + * packet. Therefore, at load time, all checks on pointers
>> + * previously done by the verifier are invalidated and must be
>> + * performed again.
>> + * Return
>> + * 0 on success, or a negative error in case of failure.
>> */
>> #define __BPF_FUNC_MAPPER(FN) \
>> FN(unspec), \
>> --
>> 2.14.1
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* [RFC bpf-next v2 3/8] bpf: add documentation for eBPF helpers (12-22)
From: Quentin Monnet @ 2018-04-11 15:43 UTC (permalink / raw)
To: Alexei Starovoitov; +Cc: daniel, ast, netdev, oss-drivers, linux-doc, linux-man
In-Reply-To: <20180410224259.v5p2t2dc5s27mw3z@ast-mbp.dhcp.thefacebook.com>
2018-04-10 15:43 UTC-0700 ~ Alexei Starovoitov
<alexei.starovoitov@gmail.com>
> On Tue, Apr 10, 2018 at 03:41:52PM +0100, Quentin Monnet wrote:
>> Add documentation for eBPF helper functions to bpf.h user header file.
>> This documentation can be parsed with the Python script provided in
>> another commit of the patch series, in order to provide a RST document
>> that can later be converted into a man page.
>>
>> The objective is to make the documentation easily understandable and
>> accessible to all eBPF developers, including beginners.
>>
>> This patch contains descriptions for the following helper functions, all
>> writter by Alexei:
>>
>> - bpf_get_current_pid_tgid()
>> - bpf_get_current_uid_gid()
>> - bpf_get_current_comm()
>> - bpf_skb_vlan_push()
>> - bpf_skb_vlan_pop()
>> - bpf_skb_get_tunnel_key()
>> - bpf_skb_set_tunnel_key()
>> - bpf_redirect()
>> - bpf_perf_event_output()
>> - bpf_get_stackid()
>> - bpf_get_current_task()
>>
>> Cc: Alexei Starovoitov <ast@kernel.org>
>> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
>> ---
>> include/uapi/linux/bpf.h | 237 +++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 237 insertions(+)
>>
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 2bc653a3a20f..f3ea8824efbc 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -580,6 +580,243 @@ union bpf_attr {
>> * performed again.
>> * Return
>> * 0 on success, or a negative error in case of failure.
>> + *
>> + * u64 bpf_get_current_pid_tgid(void)
>> + * Return
>> + * A 64-bit integer containing the current tgid and pid, and
>> + * created as such:
>> + * *current_task*\ **->tgid << 32 \|**
>> + * *current_task*\ **->pid**.
>> + *
>> + * u64 bpf_get_current_uid_gid(void)
>> + * Return
>> + * A 64-bit integer containing the current GID and UID, and
>> + * created as such: *current_gid* **<< 32 \|** *current_uid*.
>> + *
>> + * int bpf_get_current_comm(char *buf, u32 size_of_buf)
>> + * Description
>> + * Copy the **comm** attribute of the current task into *buf* of
>> + * *size_of_buf*. The **comm** attribute contains the name of
>> + * the executable (excluding the path) for the current task. The
>> + * *size_of_buf* must be strictly positive. On success, the
>
> that reminds me that we probably should relax it to ARG_CONST_SIZE_OR_ZERO.
> The programs won't be passing an actual zero into it, but it helps
> a lot to tell verifier that zero is also valid, since programs
> become much simpler.
>
Ok. No change to helper description for now, we will update here when
your patch lands.
>> + * helper makes sure that the *buf* is NUL-terminated. On failure,
>> + * it is filled with zeroes.
>> + * Return
>> + * 0 on success, or a negative error in case of failure.
>> + *
>> + * int bpf_skb_vlan_push(struct sk_buff *skb, __be16 vlan_proto, u16 vlan_tci)
>> + * Description
>> + * Push a *vlan_tci* (VLAN tag control information) of protocol
>> + * *vlan_proto* to the packet associated to *skb*, then update
>> + * the checksum. Note that if *vlan_proto* is different from
>> + * **ETH_P_8021Q** and **ETH_P_8021AD**, it is considered to
>> + * be **ETH_P_8021Q**.
>> + *
>> + * A call to this helper is susceptible to change data from the
>> + * packet. Therefore, at load time, all checks on pointers
>> + * previously done by the verifier are invalidated and must be
>> + * performed again.
>> + * Return
>> + * 0 on success, or a negative error in case of failure.
>> + *
>> + * int bpf_skb_vlan_pop(struct sk_buff *skb)
>> + * Description
>> + * Pop a VLAN header from the packet associated to *skb*.
>> + *
>> + * A call to this helper is susceptible to change data from the
>> + * packet. Therefore, at load time, all checks on pointers
>> + * previously done by the verifier are invalidated and must be
>> + * performed again.
>> + * Return
>> + * 0 on success, or a negative error in case of failure.
>> + *
>> + * int bpf_skb_get_tunnel_key(struct sk_buff *skb, struct bpf_tunnel_key *key, u32 size, u64 flags)
>> + * Description
>> + * Get tunnel metadata. This helper takes a pointer *key* to an
>> + * empty **struct bpf_tunnel_key** of **size**, that will be
>> + * filled with tunnel metadata for the packet associated to *skb*.
>> + * The *flags* can be set to **BPF_F_TUNINFO_IPV6**, which
>> + * indicates that the tunnel is based on IPv6 protocol instead of
>> + * IPv4.
>> + *
>> + * This is typically used on the receive path to perform a lookup
>> + * or a packet redirection based on the value of *key*:
>
> above is correct, but feels a bit cryptic.
> May be give more concrete example for particular tunneling protocol like gre
> and say that tunnel_key.remote_ip[46] is essential part of the encap and
> bpf prog will make decisions based on the contents of the encap header
> where bpf_tunnel_key is a single structure that generalizes parameters of
> various tunneling protocols into one struct.
>
I will try to do this.
>> + *
>> + * ::
>> + *
>> + * struct bpf_tunnel_key key = {};
>> + * bpf_skb_get_tunnel_key(skb, &key, sizeof(key), 0);
>> + * lookup or redirect based on key ...
>> + *
>> + * Return
>> + * 0 on success, or a negative error in case of failure.
>> + *
>> + * int bpf_skb_set_tunnel_key(struct sk_buff *skb, struct bpf_tunnel_key *key, u32 size, u64 flags)
>> + * Description
>> + * Populate tunnel metadata for packet associated to *skb.* The
>> + * tunnel metadata is set to the contents of *key*, of *size*. The
>> + * *flags* can be set to a combination of the following values:
>> + *
>> + * **BPF_F_TUNINFO_IPV6**
>> + * Indicate that the tunnel is based on IPv6 protocol
>> + * instead of IPv4.
>> + * **BPF_F_ZERO_CSUM_TX**
>> + * For IPv4 packets, add a flag to tunnel metadata
>> + * indicating that checksum computation should be skipped
>> + * and checksum set to zeroes.
>> + * **BPF_F_DONT_FRAGMENT**
>> + * Add a flag to tunnel metadata indicating that the
>> + * packet should not be fragmented.
>> + * **BPF_F_SEQ_NUMBER**
>> + * Add a flag to tunnel metadata indicating that a
>> + * sequence number should be added to tunnel header before
>> + * sending the packet. This flag was added for GRE
>> + * encapsulation, but might be used with other protocols
>> + * as well in the future.
>> + *
>> + * Here is a typical usage on the transmit path:
>> + *
>> + * ::
>> + *
>> + * struct bpf_tunnel_key key;
>> + * populate key ...
>> + * bpf_skb_set_tunnel_key(skb, &key, sizeof(key), 0);
>> + * bpf_clone_redirect(skb, vxlan_dev_ifindex, 0);
>> + *
>> + * Return
>> + * 0 on success, or a negative error in case of failure.
>> + *
>> + * int bpf_redirect(u32 ifindex, u64 flags)
>> + * Description
>> + * Redirect the packet to another net device of index *ifindex*.
>> + * This helper is somewhat similar to **bpf_clone_redirect**\
>> + * (), except that the packet is not cloned, which provides
>> + * increased performance.
>> + *
>> + * For hooks other than XDP, *flags* can be set to
>> + * **BPF_F_INGRESS**, which indicates the packet is to be
>> + * redirected to the ingress interface instead of (by default)
>> + * egress. Currently, XDP does not support any flag.
>> + * Return
>> + * For XDP, the helper returns **XDP_REDIRECT** on success or
>> + * **XDP_ABORT** on error. For other program types, the values
>> + * are **TC_ACT_REDIRECT** on success or **TC_ACT_SHOT** on
>> + * error.
>> + *
>> + * int bpf_perf_event_output(struct pt_reg *ctx, struct bpf_map *map, u64 flags, void *data, u64 size)
>> + * Description
>> + * Write perf raw sample into a perf event held by *map* of type
>
> I'd say:
> Write raw *data* blob into special bpf perf event held by ...
>
Yes it sounds better, I will follow the suggestion.
>> + * **BPF_MAP_TYPE_PERF_EVENT_ARRAY**. This perf event must
>> + * have the following attributes: **PERF_SAMPLE_RAW** as
>> + * **sample_type**, **PERF_TYPE_SOFTWARE** as **type**, and
>> + * **PERF_COUNT_SW_BPF_OUTPUT** as **config**.
>> + *
>> + * The *flags* are used to indicate the index in *map* for which
>> + * the value must be put, masked with **BPF_F_INDEX_MASK**.
>> + * Alternatively, *flags* can be set to **BPF_F_CURRENT_CPU**
>> + * to indicate that the index of the current CPU core should be
>> + * used.
>> + *
>> + * The value to write, of *size*, is passed through eBPF stack and
>> + * pointed by *data*.
>> + *
>> + * The context of the program *ctx* needs also be passed to the
>> + * helper, and will get interpreted as a pointer to a **struct
>> + * pt_reg**.
>
> Not quite correct.
> Initially bpf_perf_event_output() was only used with 'struct pt_reg *ctx',
> but then later it was generalized for all other tracing prog types,
> for clsact and even for XDP.
> So 'ctx' can be any of the context used by these program types.
>
Right, I suppose I only looked at bpf_perf_event_output_tp() for this
one :(. I can simply trim it to:
"The context of the program *ctx* needs also be passed to the helper."
>> + *
>> + * On user space, a program willing to read the values needs to
>> + * call **perf_event_open**\ () on the perf event (either for
>> + * one or for all CPUs) and to store the file descriptor into the
>> + * *map*. This must be done before the eBPF program can send data
>> + * into it. An example is available in file
>> + * *samples/bpf/trace_output_user.c* in the Linux kernel source
>> + * tree (the eBPF program counterpart is in
>> + * *samples/bpf/trace_output_kern.c*). It looks like the
>> + * following snippet:
>> + *
>> + * ::
>> + *
>> + * volatile struct perf_event_mmap_page *header;
>> + * struct perf_event_attr attr = {
>> + * .sample_type = PERF_SAMPLE_RAW,
>> + * .type = PERF_TYPE_SOFTWARE,
>> + * .config = PERF_COUNT_SW_BPF_OUTPUT,
>> + * };
>> + * int page_size;
>> + * int mmap_size;
>> + * int key = 0;
>> + * int pmu_fd;
>> + * void *base;
>> + *
>> + * if (load_bpf_file(filename))
>> + * return -1;
>> + *
>> + * pmu_fd = sys_perf_event_open(&attr,
>> + * -1, // pid
>> + * 0, // cpu
>> + * -1, // group_fd
>> + * 0);
>> + *
>> + * assert(pmu_fd >= 0);
>> + * assert(bpf_map_update_elem(map_fd[0], &key,
>> + * &pmu_fd, BPF_ANY) == 0);
>> + * assert(ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0) == 0);
>> + *
>> + * page_size = getpagesize();
>> + * mmap_size = page_size * (page_cnt + 1);
>> + *
>> + * base = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
>> + * MAP_SHARED, fd, 0);
>> + * if (base == MAP_FAILED)
>> + * return -1;
>> + *
>> + * header = base;
>
> I think that is too much for the man page, especially above is far from
> complete example.
>
Yeah, I was unsure about keeping it. I will remove the snippet.
>> + *
>> + * **bpf_perf_event_output**\ () achieves better performance
>> + * than **bpf_trace_printk**\ () for sharing data with user
>> + * space, and is much better suitable for streaming data from eBPF
>> + * programs.
>> + * Return
>> + * 0 on success, or a negative error in case of failure.
>> + *
>> + * int bpf_get_stackid(struct pt_reg *ctx, struct bpf_map *map, u64 flags)
>> + * Description
>> + * Walk a user or a kernel stack and return its id. To achieve
>> + * this, the helper needs *ctx*, which is a pointer to the context
>> + * on which the tracing program is executed, and a pointer to a
>> + * *map* of type **BPF_MAP_TYPE_STACK_TRACE**.
>> + *
>> + * The last argument, *flags*, holds the number of stack frames to
>> + * skip (from 0 to 255), masked with
>> + * **BPF_F_SKIP_FIELD_MASK**. The next bits can be used to set
>> + * a combination of the following flags:
>> + *
>> + * **BPF_F_USER_STACK**
>> + * Collect a user space stack instead of a kernel stack.
>> + * **BPF_F_FAST_STACK_CMP**
>> + * Compare stacks by hash only.
>> + * **BPF_F_REUSE_STACKID**
>> + * If two different stacks hash into the same *stackid*,
>> + * discard the old one.
>
> we have an annoying bug here that we will be sending a patch to fix soon,
> since right now there is no way for the program to know that stackid
> got replaced.
>
Understood. Same as for bpf_get_current_comm(), I will leave the
description untouched until the patch lands.
>> + *
>> + * The stack id retrieved is a 32 bit long integer handle which
>> + * can be further combined with other data (including other stack
>> + * ids) and used as a key into maps. This can be useful for
>> + * generating a variety of graphs (such as flame graphs or off-cpu
>> + * graphs).
>> + *
>> + * For walking a stack, this helper is an improvement over
>> + * **bpf_probe_read**\ (), which can be used with unrolled loops
>> + * but is not efficient and consumes a lot of eBPF instructions.
>> + * Instead, **bpf_get_stackid**\ () can collect up to
>> + * **PERF_MAX_STACK_DEPTH** both kernel and user frames.
>
> PERF_MAX_STACK_DEPTH is now controlled by sysctl knob.
> Would be good to mention that this limit can and should be increased
> for profiling long user stacks like java.
>
Good idea, I will add it.
Thanks a lot Alexei for the thorough reviews!
Quentin
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* [RFC bpf-next v2 7/8] bpf: add documentation for eBPF helpers (51-57)
From: Quentin Monnet @ 2018-04-11 15:44 UTC (permalink / raw)
To: Yonghong Song, daniel, ast
Cc: netdev, oss-drivers, linux-doc, linux-man, Lawrence Brakmo,
Josef Bacik, Andrey Ignatov
In-Reply-To: <cc54b41e-3f2f-e87f-042f-842c96308626@fb.com>
2018-04-10 09:58 UTC-0700 ~ Yonghong Song <yhs@fb.com>
> On 4/10/18 7:41 AM, Quentin Monnet wrote:
>> Add documentation for eBPF helper functions to bpf.h user header file.
>> This documentation can be parsed with the Python script provided in
>> another commit of the patch series, in order to provide a RST document
>> that can later be converted into a man page.
>>
>> The objective is to make the documentation easily understandable and
>> accessible to all eBPF developers, including beginners.
>>
>> This patch contains descriptions for the following helper functions:
>>
>> Helpers from Lawrence:
>> - bpf_setsockopt()
>> - bpf_getsockopt()
>> - bpf_sock_ops_cb_flags_set()
>>
>> Helpers from Yonghong:
>> - bpf_perf_event_read_value()
>> - bpf_perf_prog_read_value()
>>
>> Helper from Josef:
>> - bpf_override_return()
>>
>> Helper from Andrey:
>> - bpf_bind()
>>
>> Cc: Lawrence Brakmo <brakmo@fb.com>
>> Cc: Yonghong Song <yhs@fb.com>
>> Cc: Josef Bacik <jbacik@fb.com>
>> Cc: Andrey Ignatov <rdna@fb.com>
>> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
>> ---
>> include/uapi/linux/bpf.h | 184
>> +++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 184 insertions(+)
>>
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 15d9ccafebbe..7343af4196c8 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
[...]
>> @@ -1255,6 +1277,168 @@ union bpf_attr {
>> * performed again.
>> * Return
>> * 0 on success, or a negative error in case of failure.
>> + *
>> + * int bpf_perf_event_read_value(struct bpf_map *map, u64 flags,
>> struct bpf_perf_event_value *buf, u32 buf_size)
>> + * Description
>> + * Read the value of a perf event counter, and store it into
>> *buf*
>> + * of size *buf_size*. This helper relies on a *map* of type
>> + * **BPF_MAP_TYPE_PERF_EVENT_ARRAY**. The nature of the perf
>> + * event counter is selected at the creation of the *map*. The
>
> The nature of the perf event counter is selected when *map* is updated
> with perf_event fd's.
>
Thanks, I will fix it.
>> + * *map* is an array whose size is the number of available CPU
>> + * cores, and each cell contains a value relative to one
>> core. The
>
> It is confusing to mix core/cpu here. Maybe just use perf_event
> convention, always using cpu?
>
Right, I'll remove occurrences of "core".
>> + * value to retrieve is indicated by *flags*, that contains the
>> + * index of the core to look up, masked with
>> + * **BPF_F_INDEX_MASK**. Alternatively, *flags* can be set to
>> + * **BPF_F_CURRENT_CPU** to indicate that the value for the
>> + * current CPU core should be retrieved.
>> + *
>> + * This helper behaves in a way close to
>> + * **bpf_perf_event_read**\ () helper, save that instead of
>> + * just returning the value observed, it fills the *buf*
>> + * structure. This allows for additional data to be
>> retrieved: in
>> + * particular, the enabled and running times (in *buf*\
>> + * **->enabled** and *buf*\ **->running**, respectively) are
>> + * copied.
>> + *
>> + * These values are interesting, because hardware PMU
>> (Performance
>> + * Monitoring Unit) counters are limited resources. When
>> there are
>> + * more PMU based perf events opened than available counters,
>> + * kernel will multiplex these events so each event gets certain
>> + * percentage (but not all) of the PMU time. In case that
>> + * multiplexing happens, the number of samples or counter value
>> + * will not reflect the case compared to when no multiplexing
>> + * occurs. This makes comparison between different runs
>> difficult.
>> + * Typically, the counter value should be normalized before
>> + * comparing to other experiments. The usual normalization is
>> done
>> + * as follows.
>> + *
>> + * ::
>> + *
>> + * normalized_counter = counter * t_enabled / t_running
>> + *
>> + * Where t_enabled is the time enabled for event and
>> t_running is
>> + * the time running for event since last normalization. The
>> + * enabled and running times are accumulated since the perf
>> event
>> + * open. To achieve scaling factor between two invocations of an
>> + * eBPF program, users can can use CPU id as the key (which is
>> + * typical for perf array usage model) to remember the previous
>> + * value and do the calculation inside the eBPF program.
>> + * Return
>> + * 0 on success, or a negative error in case of failure.
>> + *
[...]
Thanks Yonghong for the review!
I have a favor to ask of you. I got a bounce for Kaixu Xia's email
address, and I don't know what alternative email address I could use. I
CC-ed to have a review for helper bpf_perf_event_read() (in patch 6 of
this series), which is rather close to bpf_perf_event_read_value().
Would you mind having a look at that one too, please? The description is
not long.
Quentin
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox