* [RFC PATCH v2] powerpc/xmon: restrict when kernel is locked down
From: Christopher M. Riedl @ 2019-05-24 12:38 UTC (permalink / raw)
To: linuxppc-dev, kernel-hardening; +Cc: Christopher M. Riedl, ajd, mjg59, dja
Xmon should be either fully or partially disabled depending on the
kernel lockdown state.
Put xmon into read-only mode for lockdown=integrity and completely
disable xmon when lockdown=confidentiality. Xmon checks the lockdown
state and takes appropriate action:
(1) during xmon_setup to prevent early xmon'ing
(2) when triggered via sysrq
(3) when toggled via debugfs
(4) when triggered via a previously enabled breakpoint
The following lockdown state transitions are handled:
(1) lockdown=none -> lockdown=integrity
clear all breakpoints, set xmon read-only mode
(2) lockdown=none -> lockdown=confidentiality
clear all breakpoints, prevent re-entry into xmon
(3) lockdown=integrity -> lockdown=confidentiality
prevent re-entry into xmon
Suggested-by: Andrew Donnellan <ajd@linux.ibm.com>
Signed-off-by: Christopher M. Riedl <cmr@informatik.wtf>
---
Applies on top of this series:
https://patchwork.kernel.org/cover/10884631/
I've done some limited testing of the scenarios mentioned in the commit
message on a single CPU QEMU config.
v1->v2:
Fix subject line
Submit to linuxppc-dev and kernel-hardening
arch/powerpc/xmon/xmon.c | 56 +++++++++++++++++++++++++++++++++++++++-
1 file changed, 55 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index 3e7be19aa208..8c4a5a0c28f0 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -191,6 +191,9 @@ static void dump_tlb_44x(void);
static void dump_tlb_book3e(void);
#endif
+static void clear_all_bpt(void);
+static void xmon_init(int);
+
#ifdef CONFIG_PPC64
#define REG "%.16lx"
#else
@@ -291,6 +294,39 @@ Commands:\n\
zh halt\n"
;
+#ifdef CONFIG_LOCK_DOWN_KERNEL
+static bool xmon_check_lockdown(void)
+{
+ static bool lockdown = false;
+
+ if (!lockdown) {
+ lockdown = kernel_is_locked_down("Using xmon",
+ LOCKDOWN_CONFIDENTIALITY);
+ if (lockdown) {
+ printf("xmon: Disabled by strict kernel lockdown\n");
+ xmon_on = 0;
+ xmon_init(0);
+ }
+ }
+
+ if (!xmon_is_ro) {
+ xmon_is_ro = kernel_is_locked_down("Using xmon write-access",
+ LOCKDOWN_INTEGRITY);
+ if (xmon_is_ro) {
+ printf("xmon: Read-only due to kernel lockdown\n");
+ clear_all_bpt();
+ }
+ }
+
+ return lockdown;
+}
+#else
+inline static bool xmon_check_lockdown(void)
+{
+ return false;
+}
+#endif /* CONFIG_LOCK_DOWN_KERNEL */
+
static struct pt_regs *xmon_regs;
static inline void sync(void)
@@ -708,6 +744,9 @@ static int xmon_bpt(struct pt_regs *regs)
struct bpt *bp;
unsigned long offset;
+ if (xmon_check_lockdown())
+ return 0;
+
if ((regs->msr & (MSR_IR|MSR_PR|MSR_64BIT)) != (MSR_IR|MSR_64BIT))
return 0;
@@ -739,6 +778,9 @@ static int xmon_sstep(struct pt_regs *regs)
static int xmon_break_match(struct pt_regs *regs)
{
+ if (xmon_check_lockdown())
+ return 0;
+
if ((regs->msr & (MSR_IR|MSR_PR|MSR_64BIT)) != (MSR_IR|MSR_64BIT))
return 0;
if (dabr.enabled == 0)
@@ -749,6 +791,9 @@ static int xmon_break_match(struct pt_regs *regs)
static int xmon_iabr_match(struct pt_regs *regs)
{
+ if (xmon_check_lockdown())
+ return 0;
+
if ((regs->msr & (MSR_IR|MSR_PR|MSR_64BIT)) != (MSR_IR|MSR_64BIT))
return 0;
if (iabr == NULL)
@@ -3742,6 +3787,9 @@ static void xmon_init(int enable)
#ifdef CONFIG_MAGIC_SYSRQ
static void sysrq_handle_xmon(int key)
{
+ if (xmon_check_lockdown())
+ return;
+
/* ensure xmon is enabled */
xmon_init(1);
debugger(get_irq_regs());
@@ -3763,7 +3811,6 @@ static int __init setup_xmon_sysrq(void)
device_initcall(setup_xmon_sysrq);
#endif /* CONFIG_MAGIC_SYSRQ */
-#ifdef CONFIG_DEBUG_FS
static void clear_all_bpt(void)
{
int i;
@@ -3785,8 +3832,12 @@ static void clear_all_bpt(void)
printf("xmon: All breakpoints cleared\n");
}
+#ifdef CONFIG_DEBUG_FS
static int xmon_dbgfs_set(void *data, u64 val)
{
+ if (xmon_check_lockdown())
+ return 0;
+
xmon_on = !!val;
xmon_init(xmon_on);
@@ -3845,6 +3896,9 @@ early_param("xmon", early_parse_xmon);
void __init xmon_setup(void)
{
+ if (xmon_check_lockdown())
+ return;
+
if (xmon_on)
xmon_init(1);
if (xmon_early)
--
2.21.0
^ permalink raw reply related
* Re: [BISECTED] kexec regression on PowerBook G4
From: Aaro Koskinen @ 2019-05-24 13:29 UTC (permalink / raw)
To: Christophe Leroy; +Cc: linuxppc-dev
In-Reply-To: <969271d1-0943-42e6-8992-77b20e305e48@c-s.fr>
Hi,
On Fri, May 24, 2019 at 09:40:30AM +0200, Christophe Leroy wrote:
> Le 24/05/2019 à 09:36, Aaro Koskinen a écrit :
> >On Fri, May 24, 2019 at 08:08:36AM +0200, Christophe Leroy wrote:
> >>>Le 24/05/2019 à 00:23, Aaro Koskinen a écrit :
> >>>>Unfortunately still no luck... The crash is pretty much the same with
> >>>>both
> >>>>changes.
> >>>
> >>>Right. In fact change_page_attr() does nothing because this part of RAM is
> >>>mapped by DBATs so v_block_mapped() returns not NULL.
> >>>
> >>>So, we have to set an IBAT for this area. I'll try and send you a new
> >>>patch for that before noon (CET).
> >>>
> >>
> >>patch sent out. In the patch I have also added a printk to print the buffer
> >>address, so if the problem still occurs, we'll know if the problem is really
> >>at the address of the buffer or if we are wrong from the beginning.
> >
> >Reboot code buffer at ef0c3000
> >Bye!
> >BUG: Unable to handle kernel instruction fetch
> >Faulting instruction address: 0xef0c3000
> >
>
> Oops, I forgot to call update_bats() after setibat().
>
> Can you add it and retry ?
Thanks, that was it, now it finally works!
A.
^ permalink raw reply
* Re: [BISECTED] kexec regression on PowerBook G4
From: Christophe Leroy @ 2019-05-24 13:35 UTC (permalink / raw)
To: Aaro Koskinen; +Cc: linuxppc-dev
In-Reply-To: <20190524132907.GE5234@darkstar.musicnaut.iki.fi>
Le 24/05/2019 à 15:29, Aaro Koskinen a écrit :
> Hi,
>
> On Fri, May 24, 2019 at 09:40:30AM +0200, Christophe Leroy wrote:
>> Le 24/05/2019 à 09:36, Aaro Koskinen a écrit :
>>> On Fri, May 24, 2019 at 08:08:36AM +0200, Christophe Leroy wrote:
>>>>> Le 24/05/2019 à 00:23, Aaro Koskinen a écrit :
>>>>>> Unfortunately still no luck... The crash is pretty much the same with
>>>>>> both
>>>>>> changes.
>>>>>
>>>>> Right. In fact change_page_attr() does nothing because this part of RAM is
>>>>> mapped by DBATs so v_block_mapped() returns not NULL.
>>>>>
>>>>> So, we have to set an IBAT for this area. I'll try and send you a new
>>>>> patch for that before noon (CET).
>>>>>
>>>>
>>>> patch sent out. In the patch I have also added a printk to print the buffer
>>>> address, so if the problem still occurs, we'll know if the problem is really
>>>> at the address of the buffer or if we are wrong from the beginning.
>>>
>>> Reboot code buffer at ef0c3000
>>> Bye!
>>> BUG: Unable to handle kernel instruction fetch
>>> Faulting instruction address: 0xef0c3000
>>>
>>
>> Oops, I forgot to call update_bats() after setibat().
>>
>> Can you add it and retry ?
>
> Thanks, that was it, now it finally works!
>
Thanks for reporting the issue and testing.
I'll work on a clean fix patch in the begining of June.
Christophe
^ permalink raw reply
* Re: [RFC PATCH v2] powerpc: fix kexec failure on book3s/32
From: Christophe Leroy @ 2019-05-24 13:38 UTC (permalink / raw)
To: Aaro Koskinen; +Cc: linuxppc-dev, linux-kernel
In-Reply-To: <8164abbe117d8353bb88132d7cfa8bc26a60ca66.1558677767.git.christophe.leroy@c-s.fr>
Le 24/05/2019 à 08:05, Christophe Leroy a écrit :
> Fixes: 63b2bc619565 ("powerpc/mm/32s: Use BATs for STRICT_KERNEL_RWX")
> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reported-by: Aaro Koskinen <aaro.koskinen@iki.fi>
> ---
> arch/powerpc/kernel/machine_kexec_32.c | 8 ++++++++
> arch/powerpc/mm/book3s32/mmu.c | 7 +++++--
> arch/powerpc/mm/mmu_decl.h | 2 ++
> 3 files changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/kernel/machine_kexec_32.c b/arch/powerpc/kernel/machine_kexec_32.c
> index affe5dcce7f4..83e61a8f8468 100644
> --- a/arch/powerpc/kernel/machine_kexec_32.c
> +++ b/arch/powerpc/kernel/machine_kexec_32.c
> @@ -15,6 +15,7 @@
> #include <asm/cacheflush.h>
> #include <asm/hw_irq.h>
> #include <asm/io.h>
> +#include <mm/mmu_decl.h>
>
> typedef void (*relocate_new_kernel_t)(
> unsigned long indirection_page,
> @@ -35,6 +36,8 @@ void default_machine_kexec(struct kimage *image)
> unsigned long page_list;
> unsigned long reboot_code_buffer, reboot_code_buffer_phys;
> relocate_new_kernel_t rnk;
> + unsigned long bat_size = 128 << 10;
> + unsigned long bat_mask = ~(bat_size - 1);
>
> /* Interrupts aren't acceptable while we reboot */
> local_irq_disable();
> @@ -54,6 +57,11 @@ void default_machine_kexec(struct kimage *image)
> memcpy((void *)reboot_code_buffer, relocate_new_kernel,
> relocate_new_kernel_size);
>
> + printk(KERN_INFO "Reboot code buffer at %lx\n", reboot_code_buffer);
> + mtsrin(mfsrin(reboot_code_buffer) & ~SR_NX, reboot_code_buffer);
> + setibat(7, reboot_code_buffer & bat_mask, reboot_code_buffer_phys & bat_mask,
> + bat_size, PAGE_KERNEL_TEXT);
A call to update_bats() have to be added here after setibat()
Christophe
> +
> flush_icache_range(reboot_code_buffer,
> reboot_code_buffer + KEXEC_CONTROL_PAGE_SIZE);
> printk(KERN_INFO "Bye!\n");
> diff --git a/arch/powerpc/mm/book3s32/mmu.c b/arch/powerpc/mm/book3s32/mmu.c
> index fc073cb2c517..7124700edb0f 100644
> --- a/arch/powerpc/mm/book3s32/mmu.c
> +++ b/arch/powerpc/mm/book3s32/mmu.c
> @@ -124,8 +124,8 @@ static unsigned int block_size(unsigned long base, unsigned long top)
> * of 2 between 128k and 256M.
> * Only for 603+ ...
> */
> -static void setibat(int index, unsigned long virt, phys_addr_t phys,
> - unsigned int size, pgprot_t prot)
> +void setibat(int index, unsigned long virt, phys_addr_t phys,
> + unsigned int size, pgprot_t prot)
> {
> unsigned int bl = (size >> 17) - 1;
> int wimgxpp;
> @@ -197,6 +197,9 @@ void mmu_mark_initmem_nx(void)
> if (cpu_has_feature(CPU_FTR_601))
> return;
>
> + if (IS_ENABLED(CONFIG_KEXEC))
> + nb--;
> +
> for (i = 0; i < nb - 1 && base < top && top - base > (128 << 10);) {
> size = block_size(base, top);
> setibat(i++, PAGE_OFFSET + base, base, size, PAGE_KERNEL_TEXT);
> diff --git a/arch/powerpc/mm/mmu_decl.h b/arch/powerpc/mm/mmu_decl.h
> index 7bac0aa2026a..478584d50cf2 100644
> --- a/arch/powerpc/mm/mmu_decl.h
> +++ b/arch/powerpc/mm/mmu_decl.h
> @@ -103,6 +103,8 @@ void print_system_hash_info(void);
> extern void mapin_ram(void);
> extern void setbat(int index, unsigned long virt, phys_addr_t phys,
> unsigned int size, pgprot_t prot);
> +void setibat(int index, unsigned long virt, phys_addr_t phys,
> + unsigned int size, pgprot_t prot);
>
> extern int __map_without_bats;
> extern unsigned int rtas_data, rtas_size;
>
^ permalink raw reply
* [PATCH v2] mm: add account_locked_vm utility function
From: Daniel Jordan @ 2019-05-24 17:50 UTC (permalink / raw)
To: akpm
Cc: Mark Rutland, Davidlohr Bueso, kvm, Alan Tull,
Alexey Kardashevskiy, linux-fpga, linux-kernel, kvm-ppc,
Daniel Jordan, linux-mm, Alex Williamson, Jason Gunthorpe,
Moritz Fischer, Steve Sistare, Christoph Lameter, linuxppc-dev,
Wu Hao
In-Reply-To: <de375582-2c35-8e8a-4737-c816052a8e58@ozlabs.ru>
locked_vm accounting is done roughly the same way in five places, so
unify them in a helper. Standardize the debug prints, which vary
slightly, but include the helper's caller to disambiguate between
callsites.
Error codes stay the same, so user-visible behavior does too. The one
exception is that the -EPERM case in tce_account_locked_vm is removed
because Alexey has never seen it triggered.
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: Alan Tull <atull@kernel.org>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Moritz Fischer <mdf@kernel.org>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Steve Sistare <steven.sistare@oracle.com>
Cc: Wu Hao <hao.wu@intel.com>
Cc: linux-mm@kvack.org
Cc: kvm@vger.kernel.org
Cc: kvm-ppc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-fpga@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
Against v5.2-rc1.
v2:
- applied review comments from Alexey
- added _RET_IP_ to debug print to disambiguate callers
arch/powerpc/kvm/book3s_64_vio.c | 44 +++--------------------
arch/powerpc/mm/book3s64/iommu_api.c | 41 +++------------------
drivers/fpga/dfl-afu-dma-region.c | 53 +++------------------------
drivers/vfio/vfio_iommu_spapr_tce.c | 54 +++-------------------------
drivers/vfio/vfio_iommu_type1.c | 17 ++-------
include/linux/mm.h | 19 ++++++++++
mm/util.c | 46 ++++++++++++++++++++++++
7 files changed, 84 insertions(+), 190 deletions(-)
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 66270e07449a..768b645c7edf 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -30,6 +30,7 @@
#include <linux/anon_inodes.h>
#include <linux/iommu.h>
#include <linux/file.h>
+#include <linux/mm.h>
#include <asm/kvm_ppc.h>
#include <asm/kvm_book3s.h>
@@ -56,43 +57,6 @@ static unsigned long kvmppc_stt_pages(unsigned long tce_pages)
return tce_pages + ALIGN(stt_bytes, PAGE_SIZE) / PAGE_SIZE;
}
-static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
-{
- long ret = 0;
-
- if (!current || !current->mm)
- return ret; /* process exited */
-
- down_write(¤t->mm->mmap_sem);
-
- if (inc) {
- unsigned long locked, lock_limit;
-
- locked = current->mm->locked_vm + stt_pages;
- lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
- if (locked > lock_limit && !capable(CAP_IPC_LOCK))
- ret = -ENOMEM;
- else
- current->mm->locked_vm += stt_pages;
- } else {
- if (WARN_ON_ONCE(stt_pages > current->mm->locked_vm))
- stt_pages = current->mm->locked_vm;
-
- current->mm->locked_vm -= stt_pages;
- }
-
- pr_debug("[%d] RLIMIT_MEMLOCK KVM %c%ld %ld/%ld%s\n", current->pid,
- inc ? '+' : '-',
- stt_pages << PAGE_SHIFT,
- current->mm->locked_vm << PAGE_SHIFT,
- rlimit(RLIMIT_MEMLOCK),
- ret ? " - exceeded" : "");
-
- up_write(¤t->mm->mmap_sem);
-
- return ret;
-}
-
static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
{
struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
@@ -302,7 +266,7 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
kvm_put_kvm(stt->kvm);
- kvmppc_account_memlimit(
+ account_locked_vm(current->mm,
kvmppc_stt_pages(kvmppc_tce_pages(stt->size)), false);
call_rcu(&stt->rcu, release_spapr_tce_table);
@@ -327,7 +291,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
return -EINVAL;
npages = kvmppc_tce_pages(size);
- ret = kvmppc_account_memlimit(kvmppc_stt_pages(npages), true);
+ ret = account_locked_vm(current->mm, kvmppc_stt_pages(npages), true);
if (ret)
return ret;
@@ -373,7 +337,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
kfree(stt);
fail_acct:
- kvmppc_account_memlimit(kvmppc_stt_pages(npages), false);
+ account_locked_vm(current->mm, kvmppc_stt_pages(npages), false);
return ret;
}
diff --git a/arch/powerpc/mm/book3s64/iommu_api.c b/arch/powerpc/mm/book3s64/iommu_api.c
index 5c521f3924a5..18d22eec0ebd 100644
--- a/arch/powerpc/mm/book3s64/iommu_api.c
+++ b/arch/powerpc/mm/book3s64/iommu_api.c
@@ -19,6 +19,7 @@
#include <linux/hugetlb.h>
#include <linux/swap.h>
#include <linux/sizes.h>
+#include <linux/mm.h>
#include <asm/mmu_context.h>
#include <asm/pte-walk.h>
#include <linux/mm_inline.h>
@@ -51,40 +52,6 @@ struct mm_iommu_table_group_mem_t {
u64 dev_hpa; /* Device memory base address */
};
-static long mm_iommu_adjust_locked_vm(struct mm_struct *mm,
- unsigned long npages, bool incr)
-{
- long ret = 0, locked, lock_limit;
-
- if (!npages)
- return 0;
-
- down_write(&mm->mmap_sem);
-
- if (incr) {
- locked = mm->locked_vm + npages;
- lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
- if (locked > lock_limit && !capable(CAP_IPC_LOCK))
- ret = -ENOMEM;
- else
- mm->locked_vm += npages;
- } else {
- if (WARN_ON_ONCE(npages > mm->locked_vm))
- npages = mm->locked_vm;
- mm->locked_vm -= npages;
- }
-
- pr_debug("[%d] RLIMIT_MEMLOCK HASH64 %c%ld %ld/%ld\n",
- current ? current->pid : 0,
- incr ? '+' : '-',
- npages << PAGE_SHIFT,
- mm->locked_vm << PAGE_SHIFT,
- rlimit(RLIMIT_MEMLOCK));
- up_write(&mm->mmap_sem);
-
- return ret;
-}
-
bool mm_iommu_preregistered(struct mm_struct *mm)
{
return !list_empty(&mm->context.iommu_group_mem_list);
@@ -101,7 +68,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
unsigned long entry, chunk;
if (dev_hpa == MM_IOMMU_TABLE_INVALID_HPA) {
- ret = mm_iommu_adjust_locked_vm(mm, entries, true);
+ ret = account_locked_vm(mm, entries, true);
if (ret)
return ret;
@@ -216,7 +183,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
kfree(mem);
unlock_exit:
- mm_iommu_adjust_locked_vm(mm, locked_entries, false);
+ account_locked_vm(mm, locked_entries, false);
return ret;
}
@@ -316,7 +283,7 @@ long mm_iommu_put(struct mm_struct *mm, struct mm_iommu_table_group_mem_t *mem)
unlock_exit:
mutex_unlock(&mem_list_mutex);
- mm_iommu_adjust_locked_vm(mm, unlock_entries, false);
+ account_locked_vm(mm, unlock_entries, false);
return ret;
}
diff --git a/drivers/fpga/dfl-afu-dma-region.c b/drivers/fpga/dfl-afu-dma-region.c
index c438722bf4e1..0a532c602d8f 100644
--- a/drivers/fpga/dfl-afu-dma-region.c
+++ b/drivers/fpga/dfl-afu-dma-region.c
@@ -12,6 +12,7 @@
#include <linux/dma-mapping.h>
#include <linux/sched/signal.h>
#include <linux/uaccess.h>
+#include <linux/mm.h>
#include "dfl-afu.h"
@@ -31,52 +32,6 @@ void afu_dma_region_init(struct dfl_feature_platform_data *pdata)
afu->dma_regions = RB_ROOT;
}
-/**
- * afu_dma_adjust_locked_vm - adjust locked memory
- * @dev: port device
- * @npages: number of pages
- * @incr: increase or decrease locked memory
- *
- * Increase or decrease the locked memory size with npages input.
- *
- * Return 0 on success.
- * Return -ENOMEM if locked memory size is over the limit and no CAP_IPC_LOCK.
- */
-static int afu_dma_adjust_locked_vm(struct device *dev, long npages, bool incr)
-{
- unsigned long locked, lock_limit;
- int ret = 0;
-
- /* the task is exiting. */
- if (!current->mm)
- return 0;
-
- down_write(¤t->mm->mmap_sem);
-
- if (incr) {
- locked = current->mm->locked_vm + npages;
- lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-
- if (locked > lock_limit && !capable(CAP_IPC_LOCK))
- ret = -ENOMEM;
- else
- current->mm->locked_vm += npages;
- } else {
- if (WARN_ON_ONCE(npages > current->mm->locked_vm))
- npages = current->mm->locked_vm;
- current->mm->locked_vm -= npages;
- }
-
- dev_dbg(dev, "[%d] RLIMIT_MEMLOCK %c%ld %ld/%ld%s\n", current->pid,
- incr ? '+' : '-', npages << PAGE_SHIFT,
- current->mm->locked_vm << PAGE_SHIFT, rlimit(RLIMIT_MEMLOCK),
- ret ? "- exceeded" : "");
-
- up_write(¤t->mm->mmap_sem);
-
- return ret;
-}
-
/**
* afu_dma_pin_pages - pin pages of given dma memory region
* @pdata: feature device platform data
@@ -92,7 +47,7 @@ static int afu_dma_pin_pages(struct dfl_feature_platform_data *pdata,
struct device *dev = &pdata->dev->dev;
int ret, pinned;
- ret = afu_dma_adjust_locked_vm(dev, npages, true);
+ ret = account_locked_vm(current->mm, npages, true);
if (ret)
return ret;
@@ -121,7 +76,7 @@ static int afu_dma_pin_pages(struct dfl_feature_platform_data *pdata,
free_pages:
kfree(region->pages);
unlock_vm:
- afu_dma_adjust_locked_vm(dev, npages, false);
+ account_locked_vm(current->mm, npages, false);
return ret;
}
@@ -141,7 +96,7 @@ static void afu_dma_unpin_pages(struct dfl_feature_platform_data *pdata,
put_all_pages(region->pages, npages);
kfree(region->pages);
- afu_dma_adjust_locked_vm(dev, npages, false);
+ account_locked_vm(current->mm, npages, false);
dev_dbg(dev, "%ld pages unpinned\n", npages);
}
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 40ddc0c5f677..d06e8e291924 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -22,6 +22,7 @@
#include <linux/vmalloc.h>
#include <linux/sched/mm.h>
#include <linux/sched/signal.h>
+#include <linux/mm.h>
#include <asm/iommu.h>
#include <asm/tce.h>
@@ -34,51 +35,6 @@
static void tce_iommu_detach_group(void *iommu_data,
struct iommu_group *iommu_group);
-static long try_increment_locked_vm(struct mm_struct *mm, long npages)
-{
- long ret = 0, locked, lock_limit;
-
- if (WARN_ON_ONCE(!mm))
- return -EPERM;
-
- if (!npages)
- return 0;
-
- down_write(&mm->mmap_sem);
- locked = mm->locked_vm + npages;
- lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
- if (locked > lock_limit && !capable(CAP_IPC_LOCK))
- ret = -ENOMEM;
- else
- mm->locked_vm += npages;
-
- pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%ld%s\n", current->pid,
- npages << PAGE_SHIFT,
- mm->locked_vm << PAGE_SHIFT,
- rlimit(RLIMIT_MEMLOCK),
- ret ? " - exceeded" : "");
-
- up_write(&mm->mmap_sem);
-
- return ret;
-}
-
-static void decrement_locked_vm(struct mm_struct *mm, long npages)
-{
- if (!mm || !npages)
- return;
-
- down_write(&mm->mmap_sem);
- if (WARN_ON_ONCE(npages > mm->locked_vm))
- npages = mm->locked_vm;
- mm->locked_vm -= npages;
- pr_debug("[%d] RLIMIT_MEMLOCK -%ld %ld/%ld\n", current->pid,
- npages << PAGE_SHIFT,
- mm->locked_vm << PAGE_SHIFT,
- rlimit(RLIMIT_MEMLOCK));
- up_write(&mm->mmap_sem);
-}
-
/*
* VFIO IOMMU fd for SPAPR_TCE IOMMU implementation
*
@@ -336,7 +292,7 @@ static int tce_iommu_enable(struct tce_container *container)
return ret;
locked = table_group->tce32_size >> PAGE_SHIFT;
- ret = try_increment_locked_vm(container->mm, locked);
+ ret = account_locked_vm(container->mm, locked, true);
if (ret)
return ret;
@@ -355,7 +311,7 @@ static void tce_iommu_disable(struct tce_container *container)
container->enabled = false;
BUG_ON(!container->mm);
- decrement_locked_vm(container->mm, container->locked_pages);
+ account_locked_vm(container->mm, container->locked_pages, false);
}
static void *tce_iommu_open(unsigned long arg)
@@ -659,7 +615,7 @@ static long tce_iommu_create_table(struct tce_container *container,
if (!table_size)
return -EINVAL;
- ret = try_increment_locked_vm(container->mm, table_size >> PAGE_SHIFT);
+ ret = account_locked_vm(container->mm, table_size >> PAGE_SHIFT, true);
if (ret)
return ret;
@@ -678,7 +634,7 @@ static void tce_iommu_free_table(struct tce_container *container,
unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
iommu_tce_table_put(tbl);
- decrement_locked_vm(container->mm, pages);
+ account_locked_vm(container->mm, pages, false);
}
static long tce_iommu_create_window(struct tce_container *container,
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 3ddc375e7063..bf449ace1676 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -275,21 +275,8 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
ret = down_write_killable(&mm->mmap_sem);
if (!ret) {
- if (npage > 0) {
- if (!dma->lock_cap) {
- unsigned long limit;
-
- limit = task_rlimit(dma->task,
- RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-
- if (mm->locked_vm + npage > limit)
- ret = -ENOMEM;
- }
- }
-
- if (!ret)
- mm->locked_vm += npage;
-
+ ret = __account_locked_vm(mm, abs(npage), npage > 0, dma->task,
+ dma->lock_cap);
up_write(&mm->mmap_sem);
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0e8834ac32b7..72c1034d2ec7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1564,6 +1564,25 @@ long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
int get_user_pages_fast(unsigned long start, int nr_pages,
unsigned int gup_flags, struct page **pages);
+int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
+ struct task_struct *task, bool bypass_rlim);
+
+static inline int account_locked_vm(struct mm_struct *mm, unsigned long pages,
+ bool inc)
+{
+ int ret;
+
+ if (pages == 0 || !mm)
+ return 0;
+
+ down_write(&mm->mmap_sem);
+ ret = __account_locked_vm(mm, pages, inc, current,
+ capable(CAP_IPC_LOCK));
+ up_write(&mm->mmap_sem);
+
+ return ret;
+}
+
/* Container for pinned pfns / pages */
struct frame_vector {
unsigned int nr_allocated; /* Number of frames we have space for */
diff --git a/mm/util.c b/mm/util.c
index e2e4f8c3fa12..bd3bdf16a084 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -6,6 +6,7 @@
#include <linux/err.h>
#include <linux/sched.h>
#include <linux/sched/mm.h>
+#include <linux/sched/signal.h>
#include <linux/sched/task_stack.h>
#include <linux/security.h>
#include <linux/swap.h>
@@ -346,6 +347,51 @@ int __weak get_user_pages_fast(unsigned long start,
}
EXPORT_SYMBOL_GPL(get_user_pages_fast);
+/**
+ * __account_locked_vm - account locked pages to an mm's locked_vm
+ * @mm: mm to account against, may be NULL
+ * @pages: number of pages to account
+ * @inc: %true if @pages should be considered positive, %false if not
+ * @task: task used to check RLIMIT_MEMLOCK
+ * @bypass_rlim: %true if checking RLIMIT_MEMLOCK should be skipped
+ *
+ * Assumes @task and @mm are valid (i.e. at least one reference on each), and
+ * that mmap_sem is held as writer.
+ *
+ * Return:
+ * * 0 on success
+ * * 0 if @mm is NULL (can happen for example if the task is exiting)
+ * * -ENOMEM if RLIMIT_MEMLOCK would be exceeded.
+ */
+int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
+ struct task_struct *task, bool bypass_rlim)
+{
+ unsigned long locked_vm, limit;
+ int ret = 0;
+
+ locked_vm = mm->locked_vm;
+ if (inc) {
+ if (!bypass_rlim) {
+ limit = task_rlimit(task, RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+ if (locked_vm + pages > limit)
+ ret = -ENOMEM;
+ }
+ if (!ret)
+ mm->locked_vm = locked_vm + pages;
+ } else {
+ WARN_ON_ONCE(pages > locked_vm);
+ mm->locked_vm = locked_vm - pages;
+ }
+
+ pr_debug("%s: [%d] caller %ps %c%lu %lu/%lu%s\n", __func__, task->pid,
+ (void *)_RET_IP_, (inc) ? '+' : '-', pages << PAGE_SHIFT,
+ locked_vm << PAGE_SHIFT, task_rlimit(task, RLIMIT_MEMLOCK),
+ ret ? " - exceeded" : "");
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(__account_locked_vm);
+
unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot,
unsigned long flag, unsigned long pgoff)
base-commit: a188339ca5a396acc588e5851ed7e19f66b0ebd9
--
2.21.0
^ permalink raw reply related
* [PATCH RFC 1/5] powerpc: Use regular rcu_dereference_raw API
From: Joel Fernandes (Google) @ 2019-05-24 23:49 UTC (permalink / raw)
To: linux-kernel
Cc: rcu, Jonathan Corbet, linux-doc, Lai Jiangshan, Josh Triplett,
kvm-ppc, Miguel Ojeda, Ingo Molnar, Mathieu Desnoyers,
Steven Rostedt, Joel Fernandes (Google), Paul E. McKenney,
linuxppc-dev
In-Reply-To: <20190524234933.5133-1-joel@joelfernandes.org>
rcu_dereference_raw already does not do any tracing. There is no need to
use the _notrace variant of it and this series removes that API, so let us
use the regular variant here.
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
arch/powerpc/include/asm/kvm_book3s_64.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h
index 21b1ed5df888..c15c9bbf0206 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -546,7 +546,7 @@ static inline void note_hpte_modification(struct kvm *kvm,
*/
static inline struct kvm_memslots *kvm_memslots_raw(struct kvm *kvm)
{
- return rcu_dereference_raw_notrace(kvm->memslots[0]);
+ return rcu_dereference_raw(kvm->memslots[0]);
}
extern void kvmppc_mmu_debugfs_init(struct kvm *kvm);
--
2.22.0.rc1.257.g3120a18244-goog
^ permalink raw reply related
* [PATCH RFC 0/5] Remove some notrace RCU APIs
From: Joel Fernandes (Google) @ 2019-05-24 23:49 UTC (permalink / raw)
To: linux-kernel
Cc: rcu, Jonathan Corbet, linux-doc, Lai Jiangshan, Josh Triplett,
kvm-ppc, Miguel Ojeda, Ingo Molnar, Mathieu Desnoyers,
Steven Rostedt, Joel Fernandes (Google), Paul E. McKenney,
linuxppc-dev
The series removes users of the following APIs, and the APIs themselves, since
the regular non - _notrace variants don't do any tracing anyway.
* hlist_for_each_entry_rcu_notrace
* rcu_dereference_raw_notrace
Joel Fernandes (Google) (5):
powerpc: Use regular rcu_dereference_raw API
trace: Use regular rcu_dereference_raw API
hashtable: Use the regular hlist_for_each_entry_rcu API
rculist: Remove hlist_for_each_entry_rcu_notrace since no users
rcu: Remove rcu_dereference_raw_notrace since no users
.clang-format | 1 -
.../RCU/Design/Requirements/Requirements.html | 6 +++---
arch/powerpc/include/asm/kvm_book3s_64.h | 2 +-
include/linux/hashtable.h | 2 +-
include/linux/rculist.h | 20 -------------------
include/linux/rcupdate.h | 9 ---------
kernel/trace/ftrace.c | 4 ++--
kernel/trace/ftrace_internal.h | 8 ++++----
kernel/trace/trace.c | 4 ++--
9 files changed, 13 insertions(+), 43 deletions(-)
--
2.22.0.rc1.257.g3120a18244-goog
^ permalink raw reply
* [PATCH RFC 2/5] trace: Use regular rcu_dereference_raw API
From: Joel Fernandes (Google) @ 2019-05-24 23:49 UTC (permalink / raw)
To: linux-kernel
Cc: rcu, Jonathan Corbet, linux-doc, Lai Jiangshan, Josh Triplett,
kvm-ppc, Miguel Ojeda, Ingo Molnar, Mathieu Desnoyers,
Steven Rostedt, Joel Fernandes (Google), Paul E. McKenney,
linuxppc-dev
In-Reply-To: <20190524234933.5133-1-joel@joelfernandes.org>
rcu_dereference_raw already does not do any tracing. There is no need to
use the _notrace variant of it and this series removes that API, so let us
use the regular variant here.
While at it, also replace the only user of
hlist_for_each_entry_rcu_notrace (which indirectly uses the
rcu_dereference_raw_notrace API) with hlist_for_each_entry_rcu which
also does not do any tracing.
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
kernel/trace/ftrace.c | 4 ++--
kernel/trace/ftrace_internal.h | 8 ++++----
kernel/trace/trace.c | 4 ++--
3 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index b920358dd8f7..f7d5f0ee69de 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -706,7 +706,7 @@ ftrace_find_profiled_func(struct ftrace_profile_stat *stat, unsigned long ip)
if (hlist_empty(hhd))
return NULL;
- hlist_for_each_entry_rcu_notrace(rec, hhd, node) {
+ hlist_for_each_entry_rcu(rec, hhd, node) {
if (rec->ip == ip)
return rec;
}
@@ -1135,7 +1135,7 @@ __ftrace_lookup_ip(struct ftrace_hash *hash, unsigned long ip)
key = ftrace_hash_key(hash, ip);
hhd = &hash->buckets[key];
- hlist_for_each_entry_rcu_notrace(entry, hhd, hlist) {
+ hlist_for_each_entry_rcu(entry, hhd, hlist) {
if (entry->ip == ip)
return entry;
}
diff --git a/kernel/trace/ftrace_internal.h b/kernel/trace/ftrace_internal.h
index 0515a2096f90..e3530a284f46 100644
--- a/kernel/trace/ftrace_internal.h
+++ b/kernel/trace/ftrace_internal.h
@@ -6,22 +6,22 @@
/*
* Traverse the ftrace_global_list, invoking all entries. The reason that we
- * can use rcu_dereference_raw_notrace() is that elements removed from this list
+ * can use rcu_dereference_raw() is that elements removed from this list
* are simply leaked, so there is no need to interact with a grace-period
- * mechanism. The rcu_dereference_raw_notrace() calls are needed to handle
+ * mechanism. The rcu_dereference_raw() calls are needed to handle
* concurrent insertions into the ftrace_global_list.
*
* Silly Alpha and silly pointer-speculation compiler optimizations!
*/
#define do_for_each_ftrace_op(op, list) \
- op = rcu_dereference_raw_notrace(list); \
+ op = rcu_dereference_raw(list); \
do
/*
* Optimized for just a single item in the list (as that is the normal case).
*/
#define while_for_each_ftrace_op(op) \
- while (likely(op = rcu_dereference_raw_notrace((op)->next)) && \
+ while (likely(op = rcu_dereference_raw((op)->next)) && \
unlikely((op) != &ftrace_list_end))
extern struct ftrace_ops __rcu *ftrace_ops_list;
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index ec439999f387..cb8d696d9cde 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2638,10 +2638,10 @@ static void ftrace_exports(struct ring_buffer_event *event)
preempt_disable_notrace();
- export = rcu_dereference_raw_notrace(ftrace_exports_list);
+ export = rcu_dereference_raw(ftrace_exports_list);
while (export) {
trace_process_export(export, event);
- export = rcu_dereference_raw_notrace(export->next);
+ export = rcu_dereference_raw(export->next);
}
preempt_enable_notrace();
--
2.22.0.rc1.257.g3120a18244-goog
^ permalink raw reply related
* [PATCH RFC 3/5] hashtable: Use the regular hlist_for_each_entry_rcu API
From: Joel Fernandes (Google) @ 2019-05-24 23:49 UTC (permalink / raw)
To: linux-kernel
Cc: rcu, Jonathan Corbet, linux-doc, Lai Jiangshan, Josh Triplett,
kvm-ppc, Miguel Ojeda, Ingo Molnar, Mathieu Desnoyers,
Steven Rostedt, Joel Fernandes (Google), Paul E. McKenney,
linuxppc-dev
In-Reply-To: <20190524234933.5133-1-joel@joelfernandes.org>
hlist_for_each_entry_rcu already does not do any tracing. This series
removes the notrace variant of it, so let us just use the regular API.
In a future patch, we can also remove the
hash_for_each_possible_rcu_notrace API that this patch touches.
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
include/linux/hashtable.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/hashtable.h b/include/linux/hashtable.h
index 417d2c4bc60d..47fa7b673c1b 100644
--- a/include/linux/hashtable.h
+++ b/include/linux/hashtable.h
@@ -189,7 +189,7 @@ static inline void hash_del_rcu(struct hlist_node *node)
* not do any RCU debugging or tracing.
*/
#define hash_for_each_possible_rcu_notrace(name, obj, member, key) \
- hlist_for_each_entry_rcu_notrace(obj, \
+ hlist_for_each_entry_rcu(obj, \
&name[hash_min(key, HASH_BITS(name))], member)
/**
--
2.22.0.rc1.257.g3120a18244-goog
^ permalink raw reply related
* [PATCH RFC 4/5] rculist: Remove hlist_for_each_entry_rcu_notrace since no users
From: Joel Fernandes (Google) @ 2019-05-24 23:49 UTC (permalink / raw)
To: linux-kernel
Cc: rcu, Jonathan Corbet, linux-doc, Lai Jiangshan, Josh Triplett,
kvm-ppc, Miguel Ojeda, Ingo Molnar, Mathieu Desnoyers,
Steven Rostedt, Joel Fernandes (Google), Paul E. McKenney,
linuxppc-dev
In-Reply-To: <20190524234933.5133-1-joel@joelfernandes.org>
The series removes all users of the API and with this patch, the API
itself.
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
.clang-format | 1 -
include/linux/rculist.h | 20 --------------------
2 files changed, 21 deletions(-)
diff --git a/.clang-format b/.clang-format
index 2ffd69afc1a8..aa935923f5cb 100644
--- a/.clang-format
+++ b/.clang-format
@@ -287,7 +287,6 @@ ForEachMacros:
- 'hlist_for_each_entry_from_rcu'
- 'hlist_for_each_entry_rcu'
- 'hlist_for_each_entry_rcu_bh'
- - 'hlist_for_each_entry_rcu_notrace'
- 'hlist_for_each_entry_safe'
- '__hlist_for_each_rcu'
- 'hlist_for_each_safe'
diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index e91ec9ddcd30..0d3d77cf4f07 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -628,26 +628,6 @@ static inline void hlist_add_behind_rcu(struct hlist_node *n,
pos = hlist_entry_safe(rcu_dereference_raw(hlist_next_rcu(\
&(pos)->member)), typeof(*(pos)), member))
-/**
- * hlist_for_each_entry_rcu_notrace - iterate over rcu list of given type (for tracing)
- * @pos: the type * to use as a loop cursor.
- * @head: the head for your list.
- * @member: the name of the hlist_node within the struct.
- *
- * This list-traversal primitive may safely run concurrently with
- * the _rcu list-mutation primitives such as hlist_add_head_rcu()
- * as long as the traversal is guarded by rcu_read_lock().
- *
- * This is the same as hlist_for_each_entry_rcu() except that it does
- * not do any RCU debugging or tracing.
- */
-#define hlist_for_each_entry_rcu_notrace(pos, head, member) \
- for (pos = hlist_entry_safe (rcu_dereference_raw_notrace(hlist_first_rcu(head)),\
- typeof(*(pos)), member); \
- pos; \
- pos = hlist_entry_safe(rcu_dereference_raw_notrace(hlist_next_rcu(\
- &(pos)->member)), typeof(*(pos)), member))
-
/**
* hlist_for_each_entry_rcu_bh - iterate over rcu list of given type
* @pos: the type * to use as a loop cursor.
--
2.22.0.rc1.257.g3120a18244-goog
^ permalink raw reply related
* [PATCH RFC 5/5] rcu: Remove rcu_dereference_raw_notrace since no users
From: Joel Fernandes (Google) @ 2019-05-24 23:49 UTC (permalink / raw)
To: linux-kernel
Cc: rcu, Jonathan Corbet, linux-doc, Lai Jiangshan, Josh Triplett,
kvm-ppc, Miguel Ojeda, Ingo Molnar, Mathieu Desnoyers,
Steven Rostedt, Joel Fernandes (Google), Paul E. McKenney,
linuxppc-dev
In-Reply-To: <20190524234933.5133-1-joel@joelfernandes.org>
The series removes all users of the API and with this patch, the API
itself. Also fix documentation.
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
Documentation/RCU/Design/Requirements/Requirements.html | 6 +++---
include/linux/rcupdate.h | 9 ---------
2 files changed, 3 insertions(+), 12 deletions(-)
diff --git a/Documentation/RCU/Design/Requirements/Requirements.html b/Documentation/RCU/Design/Requirements/Requirements.html
index 5a9238a2883c..9727278893e6 100644
--- a/Documentation/RCU/Design/Requirements/Requirements.html
+++ b/Documentation/RCU/Design/Requirements/Requirements.html
@@ -2512,9 +2512,9 @@ disabled across the entire RCU read-side critical section.
<p>
It is possible to use tracing on RCU code, but tracing itself
uses RCU.
-For this reason, <tt>rcu_dereference_raw_notrace()</tt>
-is provided for use by tracing, which avoids the destructive
-recursion that could otherwise ensue.
+This is the other reason for using, <tt>rcu_dereference_raw()</tt>,
+for use by tracing, which avoids the destructive recursion that could
+otherwise ensue.
This API is also used by virtualization in some architectures,
where RCU readers execute in environments in which tracing
cannot be used.
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 922bb6848813..f917a27fc115 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -472,15 +472,6 @@ static inline void rcu_preempt_sleep_check(void) { }
__rcu_dereference_check((p), (c) || rcu_read_lock_sched_held(), \
__rcu)
-/*
- * The tracing infrastructure traces RCU (we want that), but unfortunately
- * some of the RCU checks causes tracing to lock up the system.
- *
- * The no-tracing version of rcu_dereference_raw() must not call
- * rcu_read_lock_held().
- */
-#define rcu_dereference_raw_notrace(p) __rcu_dereference_check((p), 1, __rcu)
-
/**
* rcu_dereference_protected() - fetch RCU pointer when updates prevented
* @p: The pointer to read, prior to dereferencing
--
2.22.0.rc1.257.g3120a18244-goog
^ permalink raw reply related
* Re: [PATCH 2/2] powerpc/perf: Fix mmcra corruption by bhrb_filter
From: Michael Ellerman @ 2019-05-25 0:54 UTC (permalink / raw)
To: Ravi Bangoria, peterz, jolsa, maddy
Cc: Ravi Bangoria, linuxppc-dev, linux-kernel, acme
In-Reply-To: <20190511024217.4013-2-ravi.bangoria@linux.ibm.com>
On Sat, 2019-05-11 at 02:42:17 UTC, Ravi Bangoria wrote:
> Consider a scenario where user creates two events:
>
> 1st event:
> attr.sample_type |= PERF_SAMPLE_BRANCH_STACK;
> attr.branch_sample_type = PERF_SAMPLE_BRANCH_ANY;
> fd = perf_event_open(attr, 0, 1, -1, 0);
>
> This sets cpuhw->bhrb_filter to 0 and returns valid fd.
>
> 2nd event:
> attr.sample_type |= PERF_SAMPLE_BRANCH_STACK;
> attr.branch_sample_type = PERF_SAMPLE_BRANCH_CALL;
> fd = perf_event_open(attr, 0, 1, -1, 0);
>
> It overrides cpuhw->bhrb_filter to -1 and returns with error.
>
> Now if power_pmu_enable() gets called by any path other than
> power_pmu_add(), ppmu->config_bhrb(-1) will set mmcra to -1.
>
> Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
> Reviewed-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Applied to powerpc fixes, thanks.
https://git.kernel.org/powerpc/c/3202e35ec1c8fc19cea24253ff83edf7
cheers
^ permalink raw reply
* Re: [PATCH] powerpc/powernv: Return for invalid IMC domain
From: Michael Ellerman @ 2019-05-25 0:54 UTC (permalink / raw)
To: Anju T Sudhakar; +Cc: pavsubra, maddy, linuxppc-dev, anju
In-Reply-To: <20190520085753.19670-1-anju@linux.vnet.ibm.com>
On Mon, 2019-05-20 at 08:57:53 UTC, Anju T Sudhakar wrote:
> Currently init_imc_pmu() can be failed either because
> an IMC unit with invalid domain(i.e an IMC node not
> supported by the kernel) is attempted a pmu-registration
> or something went wrong while registering a valid IMC unit.
> In both the cases kernel provides a 'Registration failed'
> error message.
>
> Example:
> Log message, when trace-imc node is not supported by the kernel, and the
> skiboot supports trace-imc node.
>
> So for kernel, trace-imc node is now an unknown domain.
>
> [ 1.731870] nest_phb5_imc performance monitor hardware support registered
> [ 1.731944] nest_powerbus0_imc performance monitor hardware support registered
> [ 1.734458] thread_imc performance monitor hardware support registered
> [ 1.734460] IMC Unknown Device type
> [ 1.734462] IMC PMU (null) Register failed
> [ 1.734558] nest_xlink0_imc performance monitor hardware support registered
> [ 1.734614] nest_xlink1_imc performance monitor hardware support registered
> [ 1.734670] nest_xlink2_imc performance monitor hardware support registered
> [ 1.747043] Initialise system trusted keyrings
> [ 1.747054] Key type blacklist registered
>
>
> To avoid ambiguity on the error message, return for invalid domain
> before attempting a pmu registration.
>
> Fixes: 8f95faaac56c1 (`powerpc/powernv: Detect and create IMC device`)
> Reported-by: Pavaman Subramaniyam <pavsubra@in.ibm.com>
> Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com>
> Reviewed-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Applied to powerpc fixes, thanks.
https://git.kernel.org/powerpc/c/b59bd3527fe3c1939340df558d7f9d56
cheers
^ permalink raw reply
* Re: [PATCH] powerpc: Fix loading of kernel + initramfs with kexec_file_load()
From: Michael Ellerman @ 2019-05-25 0:54 UTC (permalink / raw)
To: Thiago Jung Bauermann, linuxppc-dev
Cc: AKASHI Takahiro, Thiago Jung Bauermann, kexec, linux-kernel,
Mimi Zohar
In-Reply-To: <20190522220158.18479-1-bauerman@linux.ibm.com>
On Wed, 2019-05-22 at 22:01:58 UTC, Thiago Jung Bauermann wrote:
> Commit b6664ba42f14 ("s390, kexec_file: drop arch_kexec_mem_walk()")
> changed kexec_add_buffer() to skip searching for a memory location if
> kexec_buf.mem is already set, and use the address that is there.
>
> In powerpc code we reuse a kexec_buf variable for loading both the kernel
> and the initramfs by resetting some of the fields between those uses, but
> not mem. This causes kexec_add_buffer() to try to load the kernel at the
> same address where initramfs will be loaded, which is naturally rejected:
>
> # kexec -s -l --initrd initramfs vmlinuz
> kexec_file_load failed: Invalid argument
>
> Setting the mem field before every call to kexec_add_buffer() fixes this
> regression.
>
> Fixes: b6664ba42f14 ("s390, kexec_file: drop arch_kexec_mem_walk()")
> Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
> Reviewed-by: Dave Young <dyoung@redhat.com>
Applied to powerpc fixes, thanks.
https://git.kernel.org/powerpc/c/8b909e3548706cbebc0a676067b81aad
cheers
^ permalink raw reply
* Re: [PATCH RFC 0/5] Remove some notrace RCU APIs
From: Steven Rostedt @ 2019-05-25 3:24 UTC (permalink / raw)
To: Joel Fernandes (Google)
Cc: rcu, Jonathan Corbet, linux-doc, Lai Jiangshan, linux-kernel,
kvm-ppc, Josh Triplett, Miguel Ojeda, Ingo Molnar,
Mathieu Desnoyers, Paul E. McKenney, linuxppc-dev
In-Reply-To: <20190524234933.5133-1-joel@joelfernandes.org>
On Fri, 24 May 2019 19:49:28 -0400
"Joel Fernandes (Google)" <joel@joelfernandes.org> wrote:
> The series removes users of the following APIs, and the APIs themselves, since
> the regular non - _notrace variants don't do any tracing anyway.
> * hlist_for_each_entry_rcu_notrace
> * rcu_dereference_raw_notrace
>
I guess the difference between the _raw_notrace and just _raw variants
is that _notrace ones do a rcu_check_sparse(). Don't we want to keep
that check?
-- Steve
^ permalink raw reply
* Re: [PATCH RFC 0/5] Remove some notrace RCU APIs
From: Joel Fernandes @ 2019-05-25 8:14 UTC (permalink / raw)
To: Steven Rostedt
Cc: rcu, Jonathan Corbet, linux-doc, Lai Jiangshan, linux-kernel,
kvm-ppc, Josh Triplett, Miguel Ojeda, Ingo Molnar,
Mathieu Desnoyers, Paul E. McKenney, linuxppc-dev
In-Reply-To: <20190524232458.4bcf4eb4@gandalf.local.home>
On Fri, May 24, 2019 at 11:24:58PM -0400, Steven Rostedt wrote:
> On Fri, 24 May 2019 19:49:28 -0400
> "Joel Fernandes (Google)" <joel@joelfernandes.org> wrote:
>
> > The series removes users of the following APIs, and the APIs themselves, since
> > the regular non - _notrace variants don't do any tracing anyway.
> > * hlist_for_each_entry_rcu_notrace
> > * rcu_dereference_raw_notrace
> >
>
> I guess the difference between the _raw_notrace and just _raw variants
> is that _notrace ones do a rcu_check_sparse(). Don't we want to keep
> that check?
This is true.
Since the users of _raw_notrace are very few, is it worth keeping this API
just for sparse checking? The API naming is also confusing. I was expecting
_raw_notrace to do fewer checks than _raw, instead of more. Honestly, I just
want to nuke _raw_notrace as done in this series and later we can introduce a
sparse checking version of _raw if need-be. The other option could be to
always do sparse checking for _raw however that used to be the case and got
changed in http://lists.infradead.org/pipermail/linux-afs/2016-July/001016.html
thanks a lot,
- Joel
>
> -- Steve
^ permalink raw reply
* Re: [PATCH RFC 0/5] Remove some notrace RCU APIs
From: Steven Rostedt @ 2019-05-25 11:08 UTC (permalink / raw)
To: Joel Fernandes
Cc: rcu, Jonathan Corbet, linux-doc, Lai Jiangshan, linux-kernel,
kvm-ppc, Josh Triplett, Miguel Ojeda, Ingo Molnar,
Mathieu Desnoyers, Paul E. McKenney, linuxppc-dev
In-Reply-To: <20190525081444.GC197789@google.com>
On Sat, 25 May 2019 04:14:44 -0400
Joel Fernandes <joel@joelfernandes.org> wrote:
> > I guess the difference between the _raw_notrace and just _raw variants
> > is that _notrace ones do a rcu_check_sparse(). Don't we want to keep
> > that check?
>
> This is true.
>
> Since the users of _raw_notrace are very few, is it worth keeping this API
> just for sparse checking? The API naming is also confusing. I was expecting
> _raw_notrace to do fewer checks than _raw, instead of more. Honestly, I just
> want to nuke _raw_notrace as done in this series and later we can introduce a
> sparse checking version of _raw if need-be. The other option could be to
> always do sparse checking for _raw however that used to be the case and got
> changed in http://lists.infradead.org/pipermail/linux-afs/2016-July/001016.html
What if we just rename _raw to _raw_nocheck, and _raw_notrace to _raw ?
-- Steve
^ permalink raw reply
* Re: [PATCH RFC 0/5] Remove some notrace RCU APIs
From: Joel Fernandes @ 2019-05-25 14:19 UTC (permalink / raw)
To: Steven Rostedt
Cc: rcu, Jonathan Corbet, linux-doc, Lai Jiangshan, linux-kernel,
kvm-ppc, Josh Triplett, Miguel Ojeda, Ingo Molnar,
Mathieu Desnoyers, Paul E. McKenney, linuxppc-dev
In-Reply-To: <20190525070826.16f76ee7@gandalf.local.home>
On Sat, May 25, 2019 at 07:08:26AM -0400, Steven Rostedt wrote:
> On Sat, 25 May 2019 04:14:44 -0400
> Joel Fernandes <joel@joelfernandes.org> wrote:
>
> > > I guess the difference between the _raw_notrace and just _raw variants
> > > is that _notrace ones do a rcu_check_sparse(). Don't we want to keep
> > > that check?
> >
> > This is true.
> >
> > Since the users of _raw_notrace are very few, is it worth keeping this API
> > just for sparse checking? The API naming is also confusing. I was expecting
> > _raw_notrace to do fewer checks than _raw, instead of more. Honestly, I just
> > want to nuke _raw_notrace as done in this series and later we can introduce a
> > sparse checking version of _raw if need-be. The other option could be to
> > always do sparse checking for _raw however that used to be the case and got
> > changed in http://lists.infradead.org/pipermail/linux-afs/2016-July/001016.html
>
> What if we just rename _raw to _raw_nocheck, and _raw_notrace to _raw ?
That would also mean changing 160 usages of _raw to _raw_nocheck in the
kernel :-/.
The tracing usage of _raw_notrace is only like 2 or 3 users. Can we just call
rcu_check_sparse directly in the calling code for those and eliminate the APIs?
I wonder what Paul thinks about the matter as well.
thanks, Steven!
^ permalink raw reply
* Re: [PATCH RFC 0/5] Remove some notrace RCU APIs
From: Paul E. McKenney @ 2019-05-25 15:50 UTC (permalink / raw)
To: Joel Fernandes
Cc: rcu, Jonathan Corbet, linux-doc, Lai Jiangshan, Josh Triplett,
Steven Rostedt, linux-kernel, Miguel Ojeda, Ingo Molnar,
Mathieu Desnoyers, kvm-ppc, linuxppc-dev
In-Reply-To: <20190525141954.GA176647@google.com>
On Sat, May 25, 2019 at 10:19:54AM -0400, Joel Fernandes wrote:
> On Sat, May 25, 2019 at 07:08:26AM -0400, Steven Rostedt wrote:
> > On Sat, 25 May 2019 04:14:44 -0400
> > Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > > > I guess the difference between the _raw_notrace and just _raw variants
> > > > is that _notrace ones do a rcu_check_sparse(). Don't we want to keep
> > > > that check?
> > >
> > > This is true.
> > >
> > > Since the users of _raw_notrace are very few, is it worth keeping this API
> > > just for sparse checking? The API naming is also confusing. I was expecting
> > > _raw_notrace to do fewer checks than _raw, instead of more. Honestly, I just
> > > want to nuke _raw_notrace as done in this series and later we can introduce a
> > > sparse checking version of _raw if need-be. The other option could be to
> > > always do sparse checking for _raw however that used to be the case and got
> > > changed in http://lists.infradead.org/pipermail/linux-afs/2016-July/001016.html
> >
> > What if we just rename _raw to _raw_nocheck, and _raw_notrace to _raw ?
>
> That would also mean changing 160 usages of _raw to _raw_nocheck in the
> kernel :-/.
>
> The tracing usage of _raw_notrace is only like 2 or 3 users. Can we just call
> rcu_check_sparse directly in the calling code for those and eliminate the APIs?
>
> I wonder what Paul thinks about the matter as well.
My thought is that it is likely that a goodly number of the current uses
of _raw should really be some form of _check, with lockdep expressions
spelled out. Not that working out what exactly those lockdep expressions
should be is necessarily a trivial undertaking. ;-)
That aside, if we are going to change the name of an API that is
used 160 places throughout the tree, we would need to have a pretty
good justification. Without such a justification, it will just look
like pointless churn to the various developers and maintainers on the
receiving end of the patches.
Thanx, Paul
> thanks, Steven!
>
^ permalink raw reply
* Re: [PATCH RFC 0/5] Remove some notrace RCU APIs
From: Joel Fernandes @ 2019-05-25 18:14 UTC (permalink / raw)
To: Paul E. McKenney
Cc: rcu, Jonathan Corbet, linux-doc, Lai Jiangshan, Josh Triplett,
Steven Rostedt, linux-kernel, Miguel Ojeda, Ingo Molnar,
Mathieu Desnoyers, kvm-ppc, linuxppc-dev
In-Reply-To: <20190525155035.GE28207@linux.ibm.com>
On Sat, May 25, 2019 at 08:50:35AM -0700, Paul E. McKenney wrote:
> On Sat, May 25, 2019 at 10:19:54AM -0400, Joel Fernandes wrote:
> > On Sat, May 25, 2019 at 07:08:26AM -0400, Steven Rostedt wrote:
> > > On Sat, 25 May 2019 04:14:44 -0400
> > > Joel Fernandes <joel@joelfernandes.org> wrote:
> > >
> > > > > I guess the difference between the _raw_notrace and just _raw variants
> > > > > is that _notrace ones do a rcu_check_sparse(). Don't we want to keep
> > > > > that check?
> > > >
> > > > This is true.
> > > >
> > > > Since the users of _raw_notrace are very few, is it worth keeping this API
> > > > just for sparse checking? The API naming is also confusing. I was expecting
> > > > _raw_notrace to do fewer checks than _raw, instead of more. Honestly, I just
> > > > want to nuke _raw_notrace as done in this series and later we can introduce a
> > > > sparse checking version of _raw if need-be. The other option could be to
> > > > always do sparse checking for _raw however that used to be the case and got
> > > > changed in http://lists.infradead.org/pipermail/linux-afs/2016-July/001016.html
> > >
> > > What if we just rename _raw to _raw_nocheck, and _raw_notrace to _raw ?
> >
> > That would also mean changing 160 usages of _raw to _raw_nocheck in the
> > kernel :-/.
> >
> > The tracing usage of _raw_notrace is only like 2 or 3 users. Can we just call
> > rcu_check_sparse directly in the calling code for those and eliminate the APIs?
> >
> > I wonder what Paul thinks about the matter as well.
>
> My thought is that it is likely that a goodly number of the current uses
> of _raw should really be some form of _check, with lockdep expressions
> spelled out. Not that working out what exactly those lockdep expressions
> should be is necessarily a trivial undertaking. ;-)
Yes, currently where I am a bit stuck is the rcu_dereference_raw()
cannot possibly know what SRCU domain it is under, so lockdep cannot check if
an SRCU lock is held without the user also passing along the SRCU domain. I
am trying to change lockdep to see if it can check if *any* srcu domain lock
is held (regardless of which one) and complain if none are. This is at least
better than no check at all.
However, I think it gets tricky for mutexes. If you have something like:
mutex_lock(some_mutex);
p = rcu_dereference_raw(gp);
mutex_unlock(some_mutex);
This might be a perfectly valid invocation of _raw, however my checks (patch
is still cooking) trigger a lockdep warning becase _raw cannot know that this
is Ok. lockdep thinks it is not in a reader section. This then gets into the
territory of a new rcu_derference_raw_protected(gp, assert_held(some_mutex))
which sucks because its yet another API. To circumvent this issue, can we
just have callers of rcu_dereference_raw ensure that they call
rcu_read_lock() if they are protecting dereferences by a mutex? That would
make things a lot easier and also may be Ok since rcu_read_lock is quite
cheap.
> That aside, if we are going to change the name of an API that is
> used 160 places throughout the tree, we would need to have a pretty
> good justification. Without such a justification, it will just look
> like pointless churn to the various developers and maintainers on the
> receiving end of the patches.
Actually, the API name change is not something I want to do, it is Steven
suggestion. My suggestion is let us just delete _raw_notrace and just use the
_raw API for tracing, since _raw doesn't do any tracing anyway. Steve pointed
that _raw_notrace does sparse checking unlike _raw, but I think that isn't an
issue since _raw doesn't do such checking at the moment anyway.. (if possible
check my cover letter again for details/motivation of this series).
thanks!
- Joel
> Thanx, Paul
>
> > thanks, Steven!
> >
>
^ permalink raw reply
* Re: [PATCH RFC 0/5] Remove some notrace RCU APIs
From: Joel Fernandes @ 2019-05-25 18:18 UTC (permalink / raw)
To: Paul E. McKenney
Cc: rcu, Jonathan Corbet, linux-doc, Lai Jiangshan, Josh Triplett,
Steven Rostedt, linux-kernel, Miguel Ojeda, Ingo Molnar,
Mathieu Desnoyers, kvm-ppc, linuxppc-dev
In-Reply-To: <20190525181407.GA220326@google.com>
On Sat, May 25, 2019 at 02:14:07PM -0400, Joel Fernandes wrote:
[snip]
> > That aside, if we are going to change the name of an API that is
> > used 160 places throughout the tree, we would need to have a pretty
> > good justification. Without such a justification, it will just look
> > like pointless churn to the various developers and maintainers on the
> > receiving end of the patches.
>
> Actually, the API name change is not something I want to do, it is Steven
> suggestion. My suggestion is let us just delete _raw_notrace and just use the
> _raw API for tracing, since _raw doesn't do any tracing anyway. Steve pointed
> that _raw_notrace does sparse checking unlike _raw, but I think that isn't an
> issue since _raw doesn't do such checking at the moment anyway.. (if possible
> check my cover letter again for details/motivation of this series).
Come to think of it, if we/I succeed in adding lockdep checking in _raw, then
we can just keep the current APIs and not delete anything. And we can have
_raw_notrace skip the lockdep checks. The sparse check question would still
be an open one though, since _raw doesn't do sparse checks at the moment
unlike _raw_notrace as Steve pointed.
Thanks,
- Joel
^ permalink raw reply
* Re: [PATCH v2] mm: add account_locked_vm utility function
From: Andrew Morton @ 2019-05-25 21:51 UTC (permalink / raw)
To: Daniel Jordan
Cc: Mark Rutland, Davidlohr Bueso, kvm, Alan Tull,
Alexey Kardashevskiy, linux-fpga, linux-kernel, kvm-ppc, linux-mm,
Alex Williamson, Jason Gunthorpe, Moritz Fischer, Steve Sistare,
Christoph Lameter, linuxppc-dev, Wu Hao
In-Reply-To: <20190524175045.26897-1-daniel.m.jordan@oracle.com>
On Fri, 24 May 2019 13:50:45 -0400 Daniel Jordan <daniel.m.jordan@oracle.com> wrote:
> locked_vm accounting is done roughly the same way in five places, so
> unify them in a helper. Standardize the debug prints, which vary
> slightly, but include the helper's caller to disambiguate between
> callsites.
>
> Error codes stay the same, so user-visible behavior does too. The one
> exception is that the -EPERM case in tce_account_locked_vm is removed
> because Alexey has never seen it triggered.
>
> ...
>
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1564,6 +1564,25 @@ long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
> int get_user_pages_fast(unsigned long start, int nr_pages,
> unsigned int gup_flags, struct page **pages);
>
> +int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
> + struct task_struct *task, bool bypass_rlim);
> +
> +static inline int account_locked_vm(struct mm_struct *mm, unsigned long pages,
> + bool inc)
> +{
> + int ret;
> +
> + if (pages == 0 || !mm)
> + return 0;
> +
> + down_write(&mm->mmap_sem);
> + ret = __account_locked_vm(mm, pages, inc, current,
> + capable(CAP_IPC_LOCK));
> + up_write(&mm->mmap_sem);
> +
> + return ret;
> +}
That's quite a mouthful for an inlined function. How about uninlining
the whole thing and fiddling drivers/vfio/vfio_iommu_type1.c to suit.
I wonder why it does down_write_killable and whether it really needs
to...
^ permalink raw reply
* [PATCH] dlpar: Fix a missing-check bug in dlpar_parse_cc_property()
From: Gen Zhang @ 2019-05-26 2:42 UTC (permalink / raw)
To: benh, paulus; +Cc: nfont, linuxppc-dev, linux-kernel
In dlpar_parse_cc_property(), 'prop->name' is allocated by kstrdup().
kstrdup() may return NULL, so it should be checked and handle error.
And prop should be freed if 'prop->name' is NULL.
Signed-off-by: Gen Zhang <blackgod016574@gmail.com>
---
diff --git a/arch/powerpc/platforms/pseries/dlpar.c b/arch/powerpc/platforms/pseries/dlpar.c
index 1795804..c852024 100644
--- a/arch/powerpc/platforms/pseries/dlpar.c
+++ b/arch/powerpc/platforms/pseries/dlpar.c
@@ -61,6 +61,10 @@ static struct property *dlpar_parse_cc_property(struct cc_workarea *ccwa)
name = (char *)ccwa + be32_to_cpu(ccwa->name_offset);
prop->name = kstrdup(name, GFP_KERNEL);
+ if (!prop->name) {
+ dlpar_free_cc_property(prop);
+ return NULL;
+ }
prop->length = be32_to_cpu(ccwa->prop_length);
value = (char *)ccwa + be32_to_cpu(ccwa->prop_offset);
---
^ permalink raw reply related
* Re: [PATCH RFC 4/5] rculist: Remove hlist_for_each_entry_rcu_notrace since no users
From: Miguel Ojeda @ 2019-05-26 16:20 UTC (permalink / raw)
To: Joel Fernandes (Google)
Cc: rcu, Jonathan Corbet, Linux Doc Mailing List, Lai Jiangshan,
linux-kernel, kvm-ppc, Josh Triplett, Ingo Molnar,
Mathieu Desnoyers, Steven Rostedt, Paul E. McKenney, linuxppc-dev
In-Reply-To: <20190524234933.5133-5-joel@joelfernandes.org>
On Sat, May 25, 2019 at 1:50 AM Joel Fernandes (Google)
<joel@joelfernandes.org> wrote:
>
> The series removes all users of the API and with this patch, the API
> itself.
>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
> .clang-format | 1 -
Ack for clang-format, and thanks for removing it there too! :-)
Cheers,
Miguel
^ permalink raw reply
* Re: [PATCH v2 1/2] open: add close_range()
From: Szabolcs Nagy @ 2019-05-26 20:20 UTC (permalink / raw)
To: Christian Brauner
Cc: linux-ia64, linux-sh, oleg, dhowells, linux-kselftest, sparclinux,
shuah, linux-arch, linux-s390, miklos, x86, torvalds, linux-mips,
linux-xtensa, tkjos, arnd, jannh, linux-m68k, viro, tglx, ldv,
linux-arm-kernel, fweimer, linux-parisc, linux-api, linux-kernel,
linux-alpha, linux-fsdevel, linuxppc-dev
In-Reply-To: <20190523154747.15162-2-christian@brauner.io>
* Christian Brauner <christian@brauner.io> [2019-05-23 17:47:46 +0200]:
> This adds the close_range() syscall. It allows to efficiently close a range
> of file descriptors up to all file descriptors of a calling task.
>
> The syscall came up in a recent discussion around the new mount API and
> making new file descriptor types cloexec by default. During this
> discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
> syscall in this manner has been requested by various people over time.
>
> First, it helps to close all file descriptors of an exec()ing task. This
> can be done safely via (quoting Al's example from [1] verbatim):
>
> /* that exec is sensitive */
> unshare(CLONE_FILES);
> /* we don't want anything past stderr here */
> close_range(3, ~0U);
> execve(....);
this does not work in a hosted c implementation unless the libc
guarantees not to use libc internal fds (e.g. in execve).
(the libc cannot easily abstract fds, so the syscall abi layer
fd semantics is necessarily visible to user code.)
i think this is a new constraint for userspace runtimes.
(not entirely unreasonable though)
> The code snippet above is one way of working around the problem that file
> descriptors are not cloexec by default. This is aggravated by the fact that
> we can't just switch them over without massively regressing userspace. For
> a whole class of programs having an in-kernel method of closing all file
> descriptors is very helpful (e.g. demons, service managers, programming
> language standard libraries, container managers etc.).
was cloexec_range(a,b) considered?
> (Please note, unshare(CLONE_FILES) should only be needed if the calling
> task is multi-threaded and shares the file descriptor table with another
> thread in which case two threads could race with one thread allocating
> file descriptors and the other one closing them via close_range(). For the
> general case close_range() before the execve() is sufficient.)
assuming there is no unblocked signal handler that may open fds.
a syscall that tramples on fds not owned by the caller is ugly
(not generally safe to use and may break things if it gets used),
i don't have a better solution for fd leaks or missing cloexec,
but i think it needs more analysis how it can be used.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox