* [RFC PATCH 01/19] mm: Introduce vm_account
[not found] <cover.f52b9eb2792bccb8a9ecd6bc95055705cfe2ae03.1674538665.git-series.apopple@nvidia.com>
@ 2023-01-24 5:42 ` Alistair Popple
2023-01-24 6:29 ` Christoph Hellwig
` (2 more replies)
2023-01-24 5:42 ` [RFC PATCH 04/19] infiniband/umem: Convert to use vm_account Alistair Popple
` (3 subsequent siblings)
4 siblings, 3 replies; 20+ messages in thread
From: Alistair Popple @ 2023-01-24 5:42 UTC (permalink / raw)
To: linux-mm, cgroups
Cc: linux-kernel, jgg, jhubbard, tjmercier, hannes, surenb, mkoutny,
daniel, Alistair Popple, linuxppc-dev, linux-fpga, linux-rdma,
virtualization, kvm, netdev, io-uring, bpf, rds-devel,
linux-kselftest
Kernel drivers that pin pages should account these pages against
either user->locked_vm or mm->pinned_vm and fail the pinning if
RLIMIT_MEMLOCK is exceeded and CAP_IPC_LOCK isn't held.
Currently drivers open-code this accounting and use various methods to
update the atomic variables and check against the limits leading to
various bugs and inconsistencies. To fix this introduce a standard
interface for charging pinned and locked memory. As this involves
taking references on kernel objects such as mm_struct or user_struct
we introduce a new vm_account struct to hold these references. Several
helper functions are then introduced to grab references and check
limits.
As the way these limits are charged and enforced is visible to
userspace we need to be careful not to break existing applications by
charging to different counters. As a result the vm_account functions
support accounting to different counters as required.
A future change will extend this to also account against a cgroup for
pinned pages.
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Cc: linux-kernel@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-fpga@vger.kernel.org
Cc: linux-rdma@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: kvm@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: cgroups@vger.kernel.org
Cc: io-uring@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: bpf@vger.kernel.org
Cc: rds-devel@oss.oracle.com
Cc: linux-kselftest@vger.kernel.org
---
include/linux/mm_types.h | 87 ++++++++++++++++++++++++++++++++++++++++-
mm/util.c | 89 +++++++++++++++++++++++++++++++++++++++++-
2 files changed, 176 insertions(+)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 9757067..7de2168 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1085,4 +1085,91 @@ enum fault_flag {
typedef unsigned int __bitwise zap_flags_t;
+/**
+ * enum vm_account_flags - Determine how pinned/locked memory is accounted.
+ * @VM_ACCOUNT_TASK: Account pinned memory to mm->pinned_vm.
+ * @VM_ACCOUNT_BYPASS: Don't enforce rlimit on any charges.
+ * @VM_ACCOUNT_USER: Accounnt locked memory to user->locked_vm.
+ *
+ * Determines which statistic pinned/locked memory is accounted
+ * against. All limits will be enforced against RLIMIT_MEMLOCK and the
+ * pins cgroup if CONFIG_CGROUP_PINS is enabled.
+ *
+ * New drivers should use VM_ACCOUNT_TASK. VM_ACCOUNT_USER is used by
+ * pre-existing drivers to maintain existing accounting against
+ * user->locked_mm rather than mm->pinned_mm.
+ *
+ * VM_ACCOUNT_BYPASS may also be specified to bypass rlimit
+ * checks. Typically this is used to cache CAP_IPC_LOCK from when a
+ * driver is first initialised. Note that this does not bypass cgroup
+ * limit checks.
+ */
+enum vm_account_flags {
+ VM_ACCOUNT_TASK = 0,
+ VM_ACCOUNT_BYPASS = 1,
+ VM_ACCOUNT_USER = 2,
+};
+
+struct vm_account {
+ struct task_struct *task;
+ union {
+ struct mm_struct *mm;
+ struct user_struct *user;
+ } a;
+ enum vm_account_flags flags;
+};
+
+/**
+ * vm_account_init - Initialise a new struct vm_account.
+ * @vm_account: pointer to uninitialised vm_account.
+ * @task: task to charge against.
+ * @user: user to charge against. Must be non-NULL for VM_ACCOUNT_USER.
+ * @flags: flags to use when charging to vm_account.
+ *
+ * Initialise a new uninitialiused struct vm_account. Takes references
+ * on the task/mm/user/cgroup as required although callers must ensure
+ * any references passed in remain valid for the duration of this
+ * call.
+ */
+void vm_account_init(struct vm_account *vm_account, struct task_struct *task,
+ struct user_struct *user, enum vm_account_flags flags);
+/**
+ * vm_account_init_current - Initialise a new struct vm_account.
+ * @vm_account: pointer to uninitialised vm_account.
+ *
+ * Helper to initialise a vm_account for the common case of charging
+ * with VM_ACCOUNT_TASK against current.
+ */
+void vm_account_init_current(struct vm_account *vm_account);
+
+/**
+ * vm_account_release - Initialise a new struct vm_account.
+ * @vm_account: pointer to initialised vm_account.
+ *
+ * Drop any object references obtained by vm_account_init(). The
+ * vm_account must not be used after calling this unless reinitialised
+ * with vm_account_init().
+ */
+void vm_account_release(struct vm_account *vm_account);
+
+/**
+ * vm_account_pinned - Charge pinned or locked memory to the vm_account.
+ * @vm_account: pointer to an initialised vm_account.
+ * @npages: number of pages to charge.
+ *
+ * Return: 0 on success, -ENOMEM if a limit would be exceeded.
+ *
+ * Note: All pages must be explicitly uncharged with
+ * vm_unaccount_pinned() prior to releasing the vm_account with
+ * vm_account_release().
+ */
+int vm_account_pinned(struct vm_account *vm_account, unsigned long npages);
+
+/**
+ * vm_unaccount_pinned - Uncharge pinned or locked memory to the vm_account.
+ * @vm_account: pointer to an initialised vm_account.
+ * @npages: number of pages to uncharge.
+ */
+void vm_unaccount_pinned(struct vm_account *vm_account, unsigned long npages);
+
#endif /* _LINUX_MM_TYPES_H */
diff --git a/mm/util.c b/mm/util.c
index b56c92f..af40b1e 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -430,6 +430,95 @@ void arch_pick_mmap_layout(struct mm_struct *mm, struct rlimit *rlim_stack)
}
#endif
+void vm_account_init(struct vm_account *vm_account, struct task_struct *task,
+ struct user_struct *user, enum vm_account_flags flags)
+{
+ vm_account->task = get_task_struct(task);
+
+ if (flags & VM_ACCOUNT_USER) {
+ vm_account->a.user = get_uid(user);
+ } else {
+ mmgrab(task->mm);
+ vm_account->a.mm = task->mm;
+ }
+
+ vm_account->flags = flags;
+}
+EXPORT_SYMBOL_GPL(vm_account_init);
+
+void vm_account_init_current(struct vm_account *vm_account)
+{
+ vm_account_init(vm_account, current, NULL, VM_ACCOUNT_TASK);
+}
+EXPORT_SYMBOL_GPL(vm_account_init_current);
+
+void vm_account_release(struct vm_account *vm_account)
+{
+ put_task_struct(vm_account->task);
+ if (vm_account->flags & VM_ACCOUNT_USER)
+ free_uid(vm_account->a.user);
+ else
+ mmdrop(vm_account->a.mm);
+}
+EXPORT_SYMBOL_GPL(vm_account_release);
+
+/*
+ * Charge pages with an atomic compare and swap. Returns -ENOMEM on
+ * failure, 1 on success and 0 for retry.
+ */
+static int vm_account_cmpxchg(struct vm_account *vm_account,
+ unsigned long npages, unsigned long lock_limit)
+{
+ u64 cur_pages, new_pages;
+
+ if (vm_account->flags & VM_ACCOUNT_USER)
+ cur_pages = atomic_long_read(&vm_account->a.user->locked_vm);
+ else
+ cur_pages = atomic64_read(&vm_account->a.mm->pinned_vm);
+
+ new_pages = cur_pages + npages;
+ if (lock_limit != RLIM_INFINITY && new_pages > lock_limit)
+ return -ENOMEM;
+
+ if (vm_account->flags & VM_ACCOUNT_USER) {
+ return atomic_long_cmpxchg(&vm_account->a.user->locked_vm,
+ cur_pages, new_pages) == cur_pages;
+ } else {
+ return atomic64_cmpxchg(&vm_account->a.mm->pinned_vm,
+ cur_pages, new_pages) == cur_pages;
+ }
+}
+
+int vm_account_pinned(struct vm_account *vm_account, unsigned long npages)
+{
+ unsigned long lock_limit = RLIM_INFINITY;
+ int ret;
+
+ if (!(vm_account->flags & VM_ACCOUNT_BYPASS) && !capable(CAP_IPC_LOCK))
+ lock_limit = task_rlimit(vm_account->task,
+ RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+
+ while (true) {
+ ret = vm_account_cmpxchg(vm_account, npages, lock_limit);
+ if (ret > 0)
+ break;
+ else if (ret < 0)
+ return ret;
+ }
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(vm_account_pinned);
+
+void vm_unaccount_pinned(struct vm_account *vm_account, unsigned long npages)
+{
+ if (vm_account->flags & VM_ACCOUNT_USER)
+ atomic_long_sub(npages, &vm_account->a.user->locked_vm);
+ else
+ atomic64_sub(npages, &vm_account->a.mm->pinned_vm);
+}
+EXPORT_SYMBOL_GPL(vm_unaccount_pinned);
+
/**
* __account_locked_vm - account locked pages to an mm's locked_vm
* @mm: mm to account against
--
git-series 0.9.1
^ permalink raw reply related [flat|nested] 20+ messages in thread* Re: [RFC PATCH 01/19] mm: Introduce vm_account
2023-01-24 5:42 ` [RFC PATCH 01/19] mm: Introduce vm_account Alistair Popple
@ 2023-01-24 6:29 ` Christoph Hellwig
2023-01-24 14:32 ` Jason Gunthorpe
2023-01-31 14:00 ` David Hildenbrand
2 siblings, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2023-01-24 6:29 UTC (permalink / raw)
To: Alistair Popple
Cc: linux-mm, cgroups, linux-kernel, jgg, jhubbard, tjmercier, hannes,
surenb, mkoutny, daniel, linuxppc-dev, linux-fpga, linux-rdma,
virtualization, kvm, netdev, io-uring, bpf, rds-devel,
linux-kselftest
> +/**
> + * vm_account_init - Initialise a new struct vm_account.
> + * @vm_account: pointer to uninitialised vm_account.
> + * @task: task to charge against.
> + * @user: user to charge against. Must be non-NULL for VM_ACCOUNT_USER.
> + * @flags: flags to use when charging to vm_account.
> + *
> + * Initialise a new uninitialiused struct vm_account. Takes references
> + * on the task/mm/user/cgroup as required although callers must ensure
> + * any references passed in remain valid for the duration of this
> + * call.
> + */
> +void vm_account_init(struct vm_account *vm_account, struct task_struct *task,
> + struct user_struct *user, enum vm_account_flags flags);
kerneldoc comments are supposed to be next to the implementation, and
not the declaration in the header.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC PATCH 01/19] mm: Introduce vm_account
2023-01-24 5:42 ` [RFC PATCH 01/19] mm: Introduce vm_account Alistair Popple
2023-01-24 6:29 ` Christoph Hellwig
@ 2023-01-24 14:32 ` Jason Gunthorpe
2023-01-30 11:36 ` Alistair Popple
2023-01-31 14:00 ` David Hildenbrand
2 siblings, 1 reply; 20+ messages in thread
From: Jason Gunthorpe @ 2023-01-24 14:32 UTC (permalink / raw)
To: Alistair Popple
Cc: linux-mm, cgroups, linux-kernel, jhubbard, tjmercier, hannes,
surenb, mkoutny, daniel, linuxppc-dev, linux-fpga, linux-rdma,
virtualization, kvm, netdev, io-uring, bpf, rds-devel,
linux-kselftest
On Tue, Jan 24, 2023 at 04:42:30PM +1100, Alistair Popple wrote:
> +/**
> + * enum vm_account_flags - Determine how pinned/locked memory is accounted.
> + * @VM_ACCOUNT_TASK: Account pinned memory to mm->pinned_vm.
> + * @VM_ACCOUNT_BYPASS: Don't enforce rlimit on any charges.
> + * @VM_ACCOUNT_USER: Accounnt locked memory to user->locked_vm.
> + *
> + * Determines which statistic pinned/locked memory is accounted
> + * against. All limits will be enforced against RLIMIT_MEMLOCK and the
> + * pins cgroup if CONFIG_CGROUP_PINS is enabled.
> + *
> + * New drivers should use VM_ACCOUNT_TASK. VM_ACCOUNT_USER is used by
> + * pre-existing drivers to maintain existing accounting against
> + * user->locked_mm rather than mm->pinned_mm.
I thought the guidance was the opposite of this, it is the newer
places in the kernel that are using VM_ACCOUNT_USER?
I haven't got to the rest of the patches yet, but isn't there also a
mm->pinned_vm vs mm->locked_vm variation in the current drivers as
well?
> +void vm_account_init_current(struct vm_account *vm_account)
> +{
> + vm_account_init(vm_account, current, NULL, VM_ACCOUNT_TASK);
> +}
> +EXPORT_SYMBOL_GPL(vm_account_init_current);
This can probably just be a static inline
You might consider putting all this in some new vm_account.h - given
how rarely it is used? Compile times and all
Jason
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [RFC PATCH 01/19] mm: Introduce vm_account
2023-01-24 14:32 ` Jason Gunthorpe
@ 2023-01-30 11:36 ` Alistair Popple
0 siblings, 0 replies; 20+ messages in thread
From: Alistair Popple @ 2023-01-30 11:36 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: linux-mm, cgroups, linux-kernel, jhubbard, tjmercier, hannes,
surenb, mkoutny, daniel, linuxppc-dev, linux-fpga, linux-rdma,
virtualization, kvm, netdev, io-uring, bpf, rds-devel,
linux-kselftest
Jason Gunthorpe <jgg@nvidia.com> writes:
> On Tue, Jan 24, 2023 at 04:42:30PM +1100, Alistair Popple wrote:
>> +/**
>> + * enum vm_account_flags - Determine how pinned/locked memory is accounted.
>> + * @VM_ACCOUNT_TASK: Account pinned memory to mm->pinned_vm.
>> + * @VM_ACCOUNT_BYPASS: Don't enforce rlimit on any charges.
>> + * @VM_ACCOUNT_USER: Accounnt locked memory to user->locked_vm.
>> + *
>> + * Determines which statistic pinned/locked memory is accounted
>> + * against. All limits will be enforced against RLIMIT_MEMLOCK and the
>> + * pins cgroup if CONFIG_CGROUP_PINS is enabled.
>> + *
>> + * New drivers should use VM_ACCOUNT_TASK. VM_ACCOUNT_USER is used by
>> + * pre-existing drivers to maintain existing accounting against
>> + * user->locked_mm rather than mm->pinned_mm.
>
> I thought the guidance was the opposite of this, it is the newer
> places in the kernel that are using VM_ACCOUNT_USER?
I'd just assumed mm->pinned_vm was preferred because that's what most
drivers use. user->locked_mm does seem more sensible though as at least
it's possible to meaningfully enforce some overall limit. Will switch
the flags/comment around to suggest new users use VM_ACCOUNT_USER.
> I haven't got to the rest of the patches yet, but isn't there also a
> mm->pinned_vm vs mm->locked_vm variation in the current drivers as
> well?
>
>> +void vm_account_init_current(struct vm_account *vm_account)
>> +{
>> + vm_account_init(vm_account, current, NULL, VM_ACCOUNT_TASK);
>> +}
>> +EXPORT_SYMBOL_GPL(vm_account_init_current);
>
> This can probably just be a static inline
>
> You might consider putting all this in some new vm_account.h - given
> how rarely it is used? Compile times and all
Works for me.
> Jason
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC PATCH 01/19] mm: Introduce vm_account
2023-01-24 5:42 ` [RFC PATCH 01/19] mm: Introduce vm_account Alistair Popple
2023-01-24 6:29 ` Christoph Hellwig
2023-01-24 14:32 ` Jason Gunthorpe
@ 2023-01-31 14:00 ` David Hildenbrand
2 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2023-01-31 14:00 UTC (permalink / raw)
To: Alistair Popple, linux-mm, cgroups
Cc: linux-kernel, jgg, jhubbard, tjmercier, hannes, surenb, mkoutny,
daniel, linuxppc-dev, linux-fpga, linux-rdma, virtualization, kvm,
netdev, io-uring, bpf, rds-devel, linux-kselftest
On 24.01.23 06:42, Alistair Popple wrote:
> Kernel drivers that pin pages should account these pages against
> either user->locked_vm or mm->pinned_vm and fail the pinning if
> RLIMIT_MEMLOCK is exceeded and CAP_IPC_LOCK isn't held.
>
> Currently drivers open-code this accounting and use various methods to
> update the atomic variables and check against the limits leading to
> various bugs and inconsistencies. To fix this introduce a standard
> interface for charging pinned and locked memory. As this involves
> taking references on kernel objects such as mm_struct or user_struct
> we introduce a new vm_account struct to hold these references. Several
> helper functions are then introduced to grab references and check
> limits.
>
> As the way these limits are charged and enforced is visible to
> userspace we need to be careful not to break existing applications by
> charging to different counters. As a result the vm_account functions
> support accounting to different counters as required.
>
> A future change will extend this to also account against a cgroup for
> pinned pages.
The term "vm_account" is misleading, no? VM_ACCOUNT is for accounting
towards the commit limit ....
--
Thanks,
David / dhildenb
^ permalink raw reply [flat|nested] 20+ messages in thread
* [RFC PATCH 04/19] infiniband/umem: Convert to use vm_account
[not found] <cover.f52b9eb2792bccb8a9ecd6bc95055705cfe2ae03.1674538665.git-series.apopple@nvidia.com>
2023-01-24 5:42 ` [RFC PATCH 01/19] mm: Introduce vm_account Alistair Popple
@ 2023-01-24 5:42 ` Alistair Popple
2023-01-24 5:42 ` [RFC PATCH 05/19] RMDA/siw: " Alistair Popple
` (2 subsequent siblings)
4 siblings, 0 replies; 20+ messages in thread
From: Alistair Popple @ 2023-01-24 5:42 UTC (permalink / raw)
To: linux-mm, cgroups
Cc: linux-kernel, jgg, jhubbard, tjmercier, hannes, surenb, mkoutny,
daniel, Alistair Popple, Jason Gunthorpe, Leon Romanovsky,
linux-rdma
Converts the infiniband core umem code to use the vm_account structure
so that pinned pages can be charged to the correct cgroup with
account_pinned_vm().
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: linux-rdma@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
drivers/infiniband/core/umem.c | 16 ++++++----------
drivers/infiniband/core/umem_odp.c | 6 ++++++
include/rdma/ib_umem.h | 1 +
3 files changed, 13 insertions(+), 10 deletions(-)
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 755a9c5..479b7f0 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -149,8 +149,6 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr,
{
struct ib_umem *umem;
struct page **page_list;
- unsigned long lock_limit;
- unsigned long new_pinned;
unsigned long cur_base;
unsigned long dma_attr = 0;
struct mm_struct *mm;
@@ -186,6 +184,7 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr,
umem->writable = ib_access_writable(access);
umem->owning_mm = mm = current->mm;
mmgrab(mm);
+ vm_account_init_current(&umem->vm_account);
page_list = (struct page **) __get_free_page(GFP_KERNEL);
if (!page_list) {
@@ -199,11 +198,7 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr,
goto out;
}
- lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-
- new_pinned = atomic64_add_return(npages, &mm->pinned_vm);
- if (new_pinned > lock_limit && !capable(CAP_IPC_LOCK)) {
- atomic64_sub(npages, &mm->pinned_vm);
+ if (vm_account_pinned(&umem->vm_account, npages)) {
ret = -ENOMEM;
goto out;
}
@@ -248,12 +243,13 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr,
umem_release:
__ib_umem_release(device, umem, 0);
- atomic64_sub(ib_umem_num_pages(umem), &mm->pinned_vm);
+ vm_unaccount_pinned(&umem->vm_account, ib_umem_num_pages(umem));
out:
free_page((unsigned long) page_list);
umem_kfree:
if (ret) {
mmdrop(umem->owning_mm);
+ vm_account_release(&umem->vm_account);
kfree(umem);
}
return ret ? ERR_PTR(ret) : umem;
@@ -275,8 +271,8 @@ void ib_umem_release(struct ib_umem *umem)
__ib_umem_release(umem->ibdev, umem, 1);
- atomic64_sub(ib_umem_num_pages(umem), &umem->owning_mm->pinned_vm);
- mmdrop(umem->owning_mm);
+ vm_unaccount_pinned(&umem->vm_account, ib_umem_num_pages(umem));
+ vm_account_release(&umem->vm_account);
kfree(umem);
}
EXPORT_SYMBOL(ib_umem_release);
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index e9fa22d..4fbca3e 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -130,6 +130,7 @@ struct ib_umem_odp *ib_umem_odp_alloc_implicit(struct ib_device *device,
umem->ibdev = device;
umem->writable = ib_access_writable(access);
umem->owning_mm = current->mm;
+ vm_account_init_current(&umem->vm_account);
umem_odp->is_implicit_odp = 1;
umem_odp->page_shift = PAGE_SHIFT;
@@ -137,6 +138,7 @@ struct ib_umem_odp *ib_umem_odp_alloc_implicit(struct ib_device *device,
ret = ib_init_umem_odp(umem_odp, NULL);
if (ret) {
put_pid(umem_odp->tgid);
+ vm_account_release(&umem->vm_account);
kfree(umem_odp);
return ERR_PTR(ret);
}
@@ -179,6 +181,7 @@ ib_umem_odp_alloc_child(struct ib_umem_odp *root, unsigned long addr,
umem->address = addr;
umem->writable = root->umem.writable;
umem->owning_mm = root->umem.owning_mm;
+ umem->vm_account = root->umem.vm_account;
odp_data->page_shift = PAGE_SHIFT;
odp_data->notifier.ops = ops;
@@ -239,6 +242,7 @@ struct ib_umem_odp *ib_umem_odp_get(struct ib_device *device,
umem_odp->umem.address = addr;
umem_odp->umem.writable = ib_access_writable(access);
umem_odp->umem.owning_mm = current->mm;
+ vm_account_init_current(&umem_odp->umem.vm_account);
umem_odp->notifier.ops = ops;
umem_odp->page_shift = PAGE_SHIFT;
@@ -255,6 +259,7 @@ struct ib_umem_odp *ib_umem_odp_get(struct ib_device *device,
err_put_pid:
put_pid(umem_odp->tgid);
+ vm_account_release(&umem_odp->umem.vm_account);
kfree(umem_odp);
return ERR_PTR(ret);
}
@@ -278,6 +283,7 @@ void ib_umem_odp_release(struct ib_umem_odp *umem_odp)
kvfree(umem_odp->pfn_list);
}
put_pid(umem_odp->tgid);
+ vm_account_release(&umem_odp->umem.vm_account);
kfree(umem_odp);
}
EXPORT_SYMBOL(ib_umem_odp_release);
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index 92a673c..de51406 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -19,6 +19,7 @@ struct dma_buf_attach_ops;
struct ib_umem {
struct ib_device *ibdev;
struct mm_struct *owning_mm;
+ struct vm_account vm_account;
u64 iova;
size_t length;
unsigned long address;
--
git-series 0.9.1
^ permalink raw reply related [flat|nested] 20+ messages in thread* [RFC PATCH 05/19] RMDA/siw: Convert to use vm_account
[not found] <cover.f52b9eb2792bccb8a9ecd6bc95055705cfe2ae03.1674538665.git-series.apopple@nvidia.com>
2023-01-24 5:42 ` [RFC PATCH 01/19] mm: Introduce vm_account Alistair Popple
2023-01-24 5:42 ` [RFC PATCH 04/19] infiniband/umem: Convert to use vm_account Alistair Popple
@ 2023-01-24 5:42 ` Alistair Popple
2023-01-24 14:37 ` Jason Gunthorpe
2023-01-24 5:42 ` [RFC PATCH 06/19] RDMA/usnic: convert " Alistair Popple
2023-01-24 5:42 ` [RFC PATCH 10/19] net: skb: Switch to using vm_account Alistair Popple
4 siblings, 1 reply; 20+ messages in thread
From: Alistair Popple @ 2023-01-24 5:42 UTC (permalink / raw)
To: linux-mm, cgroups
Cc: linux-kernel, jgg, jhubbard, tjmercier, hannes, surenb, mkoutny,
daniel, Alistair Popple, Bernard Metzler, Jason Gunthorpe,
Leon Romanovsky, linux-rdma
Convert to using a vm_account structure to account pinned memory to
both the mm and the pins cgroup.
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Cc: Bernard Metzler <bmt@zurich.ibm.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: linux-rdma@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
drivers/infiniband/sw/siw/siw.h | 2 +-
drivers/infiniband/sw/siw/siw_mem.c | 20 ++++++--------------
drivers/infiniband/sw/siw/siw_verbs.c | 15 ---------------
3 files changed, 7 insertions(+), 30 deletions(-)
diff --git a/drivers/infiniband/sw/siw/siw.h b/drivers/infiniband/sw/siw/siw.h
index 2f3a9cd..0c4a3ec 100644
--- a/drivers/infiniband/sw/siw/siw.h
+++ b/drivers/infiniband/sw/siw/siw.h
@@ -124,7 +124,7 @@ struct siw_umem {
int num_pages;
bool writable;
u64 fp_addr; /* First page base address */
- struct mm_struct *owning_mm;
+ struct vm_account vm_account;
};
struct siw_pble {
diff --git a/drivers/infiniband/sw/siw/siw_mem.c b/drivers/infiniband/sw/siw/siw_mem.c
index b2b33dd..9c53fc3 100644
--- a/drivers/infiniband/sw/siw/siw_mem.c
+++ b/drivers/infiniband/sw/siw/siw_mem.c
@@ -68,7 +68,6 @@ static void siw_free_plist(struct siw_page_chunk *chunk, int num_pages,
void siw_umem_release(struct siw_umem *umem, bool dirty)
{
- struct mm_struct *mm_s = umem->owning_mm;
int i, num_pages = umem->num_pages;
for (i = 0; num_pages; i++) {
@@ -79,9 +78,9 @@ void siw_umem_release(struct siw_umem *umem, bool dirty)
kfree(umem->page_chunk[i].plist);
num_pages -= to_free;
}
- atomic64_sub(umem->num_pages, &mm_s->pinned_vm);
+ vm_unaccount_pinned(&umem->vm_account, umem->num_pages);
+ vm_account_release(&umem->vm_account);
- mmdrop(mm_s);
kfree(umem->page_chunk);
kfree(umem);
}
@@ -365,9 +364,7 @@ struct siw_pbl *siw_pbl_alloc(u32 num_buf)
struct siw_umem *siw_umem_get(u64 start, u64 len, bool writable)
{
struct siw_umem *umem;
- struct mm_struct *mm_s;
u64 first_page_va;
- unsigned long mlock_limit;
unsigned int foll_flags = FOLL_LONGTERM;
int num_pages, num_chunks, i, rv = 0;
@@ -385,20 +382,16 @@ struct siw_umem *siw_umem_get(u64 start, u64 len, bool writable)
if (!umem)
return ERR_PTR(-ENOMEM);
- mm_s = current->mm;
- umem->owning_mm = mm_s;
umem->writable = writable;
- mmgrab(mm_s);
+ vm_account_init_current(&umem->vm_account);
if (writable)
foll_flags |= FOLL_WRITE;
- mmap_read_lock(mm_s);
+ mmap_read_lock(current->mm);
- mlock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-
- if (num_pages + atomic64_read(&mm_s->pinned_vm) > mlock_limit) {
+ if (vm_account_pinned(&umem->vm_account, num_pages)) {
rv = -ENOMEM;
goto out_sem_up;
}
@@ -429,7 +422,6 @@ struct siw_umem *siw_umem_get(u64 start, u64 len, bool writable)
goto out_sem_up;
umem->num_pages += rv;
- atomic64_add(rv, &mm_s->pinned_vm);
first_page_va += rv * PAGE_SIZE;
nents -= rv;
got += rv;
@@ -437,7 +429,7 @@ struct siw_umem *siw_umem_get(u64 start, u64 len, bool writable)
num_pages -= got;
}
out_sem_up:
- mmap_read_unlock(mm_s);
+ mmap_read_unlock(current->mm);
if (rv > 0)
return umem;
diff --git a/drivers/infiniband/sw/siw/siw_verbs.c b/drivers/infiniband/sw/siw/siw_verbs.c
index 906fde1..8fab009 100644
--- a/drivers/infiniband/sw/siw/siw_verbs.c
+++ b/drivers/infiniband/sw/siw/siw_verbs.c
@@ -1321,8 +1321,6 @@ struct ib_mr *siw_reg_user_mr(struct ib_pd *pd, u64 start, u64 len,
struct siw_umem *umem = NULL;
struct siw_ureq_reg_mr ureq;
struct siw_device *sdev = to_siw_dev(pd->device);
-
- unsigned long mem_limit = rlimit(RLIMIT_MEMLOCK);
int rv;
siw_dbg_pd(pd, "start: 0x%pK, va: 0x%pK, len: %llu\n",
@@ -1338,19 +1336,6 @@ struct ib_mr *siw_reg_user_mr(struct ib_pd *pd, u64 start, u64 len,
rv = -EINVAL;
goto err_out;
}
- if (mem_limit != RLIM_INFINITY) {
- unsigned long num_pages =
- (PAGE_ALIGN(len + (start & ~PAGE_MASK))) >> PAGE_SHIFT;
- mem_limit >>= PAGE_SHIFT;
-
- if (num_pages > mem_limit - current->mm->locked_vm) {
- siw_dbg_pd(pd, "pages req %lu, max %lu, lock %lu\n",
- num_pages, mem_limit,
- current->mm->locked_vm);
- rv = -ENOMEM;
- goto err_out;
- }
- }
umem = siw_umem_get(start, len, ib_access_writable(rights));
if (IS_ERR(umem)) {
rv = PTR_ERR(umem);
--
git-series 0.9.1
^ permalink raw reply related [flat|nested] 20+ messages in thread* Re: [RFC PATCH 05/19] RMDA/siw: Convert to use vm_account
2023-01-24 5:42 ` [RFC PATCH 05/19] RMDA/siw: " Alistair Popple
@ 2023-01-24 14:37 ` Jason Gunthorpe
2023-01-24 15:22 ` Bernard Metzler
2023-01-24 15:56 ` Bernard Metzler
0 siblings, 2 replies; 20+ messages in thread
From: Jason Gunthorpe @ 2023-01-24 14:37 UTC (permalink / raw)
To: Alistair Popple
Cc: linux-mm, cgroups, linux-kernel, jhubbard, tjmercier, hannes,
surenb, mkoutny, daniel, Bernard Metzler, Leon Romanovsky,
linux-rdma
On Tue, Jan 24, 2023 at 04:42:34PM +1100, Alistair Popple wrote:
> @@ -385,20 +382,16 @@ struct siw_umem *siw_umem_get(u64 start, u64 len, bool writable)
> if (!umem)
> return ERR_PTR(-ENOMEM);
>
> - mm_s = current->mm;
> - umem->owning_mm = mm_s;
> umem->writable = writable;
>
> - mmgrab(mm_s);
> + vm_account_init_current(&umem->vm_account);
>
> if (writable)
> foll_flags |= FOLL_WRITE;
>
> - mmap_read_lock(mm_s);
> + mmap_read_lock(current->mm);
>
> - mlock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> -
> - if (num_pages + atomic64_read(&mm_s->pinned_vm) > mlock_limit) {
> + if (vm_account_pinned(&umem->vm_account, num_pages)) {
> rv = -ENOMEM;
> goto out_sem_up;
> }
> @@ -429,7 +422,6 @@ struct siw_umem *siw_umem_get(u64 start, u64 len, bool writable)
> goto out_sem_up;
>
> umem->num_pages += rv;
> - atomic64_add(rv, &mm_s->pinned_vm);
Also fixes the race bug
Jason
^ permalink raw reply [flat|nested] 20+ messages in thread* RE: [RFC PATCH 05/19] RMDA/siw: Convert to use vm_account
2023-01-24 14:37 ` Jason Gunthorpe
@ 2023-01-24 15:22 ` Bernard Metzler
2023-01-24 15:56 ` Bernard Metzler
1 sibling, 0 replies; 20+ messages in thread
From: Bernard Metzler @ 2023-01-24 15:22 UTC (permalink / raw)
To: Jason Gunthorpe, Alistair Popple
Cc: linux-mm@kvack.org, cgroups@vger.kernel.org,
linux-kernel@vger.kernel.org, jhubbard@nvidia.com,
tjmercier@google.com, hannes@cmpxchg.org, surenb@google.com,
mkoutny@suse.com, daniel@ffwll.ch, Leon Romanovsky,
linux-rdma@vger.kernel.org
> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, 24 January 2023 15:37
> To: Alistair Popple <apopple@nvidia.com>
> Cc: linux-mm@kvack.org; cgroups@vger.kernel.org; linux-
> kernel@vger.kernel.org; jhubbard@nvidia.com; tjmercier@google.com;
> hannes@cmpxchg.org; surenb@google.com; mkoutny@suse.com; daniel@ffwll.ch;
> Bernard Metzler <BMT@zurich.ibm.com>; Leon Romanovsky <leon@kernel.org>;
> linux-rdma@vger.kernel.org
> Subject: [EXTERNAL] Re: [RFC PATCH 05/19] RMDA/siw: Convert to use
> vm_account
>
> On Tue, Jan 24, 2023 at 04:42:34PM +1100, Alistair Popple wrote:
>
> > @@ -385,20 +382,16 @@ struct siw_umem *siw_umem_get(u64 start, u64 len,
> bool writable)
> > if (!umem)
> > return ERR_PTR(-ENOMEM);
> >
> > - mm_s = current->mm;
> > - umem->owning_mm = mm_s;
> > umem->writable = writable;
> >
> > - mmgrab(mm_s);
> > + vm_account_init_current(&umem->vm_account);
> >
> > if (writable)
> > foll_flags |= FOLL_WRITE;
> >
> > - mmap_read_lock(mm_s);
> > + mmap_read_lock(current->mm);
> >
> > - mlock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > -
> > - if (num_pages + atomic64_read(&mm_s->pinned_vm) > mlock_limit) {
> > + if (vm_account_pinned(&umem->vm_account, num_pages)) {
> > rv = -ENOMEM;
> > goto out_sem_up;
> > }
> > @@ -429,7 +422,6 @@ struct siw_umem *siw_umem_get(u64 start, u64 len,
> bool writable)
> > goto out_sem_up;
> >
> > umem->num_pages += rv;
> > - atomic64_add(rv, &mm_s->pinned_vm);
>
> Also fixes the race bug
>
True! Should have used atomic64_add_return() in the first place...
Bernard.
^ permalink raw reply [flat|nested] 20+ messages in thread* RE: [RFC PATCH 05/19] RMDA/siw: Convert to use vm_account
2023-01-24 14:37 ` Jason Gunthorpe
2023-01-24 15:22 ` Bernard Metzler
@ 2023-01-24 15:56 ` Bernard Metzler
2023-01-30 11:34 ` Alistair Popple
1 sibling, 1 reply; 20+ messages in thread
From: Bernard Metzler @ 2023-01-24 15:56 UTC (permalink / raw)
To: Jason Gunthorpe, Alistair Popple
Cc: linux-mm@kvack.org, cgroups@vger.kernel.org,
linux-kernel@vger.kernel.org, jhubbard@nvidia.com,
tjmercier@google.com, hannes@cmpxchg.org, surenb@google.com,
mkoutny@suse.com, daniel@ffwll.ch, Leon Romanovsky,
linux-rdma@vger.kernel.org
> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, 24 January 2023 15:37
> To: Alistair Popple <apopple@nvidia.com>
> Cc: linux-mm@kvack.org; cgroups@vger.kernel.org; linux-
> kernel@vger.kernel.org; jhubbard@nvidia.com; tjmercier@google.com;
> hannes@cmpxchg.org; surenb@google.com; mkoutny@suse.com; daniel@ffwll.ch;
> Bernard Metzler <BMT@zurich.ibm.com>; Leon Romanovsky <leon@kernel.org>;
> linux-rdma@vger.kernel.org
> Subject: [EXTERNAL] Re: [RFC PATCH 05/19] RMDA/siw: Convert to use
> vm_account
>
> On Tue, Jan 24, 2023 at 04:42:34PM +1100, Alistair Popple wrote:
>
> > @@ -385,20 +382,16 @@ struct siw_umem *siw_umem_get(u64 start, u64 len,
> bool writable)
> > if (!umem)
> > return ERR_PTR(-ENOMEM);
> >
> > - mm_s = current->mm;
> > - umem->owning_mm = mm_s;
> > umem->writable = writable;
> >
> > - mmgrab(mm_s);
> > + vm_account_init_current(&umem->vm_account);
> >
> > if (writable)
> > foll_flags |= FOLL_WRITE;
> >
> > - mmap_read_lock(mm_s);
> > + mmap_read_lock(current->mm);
> >
> > - mlock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > -
> > - if (num_pages + atomic64_read(&mm_s->pinned_vm) > mlock_limit) {
> > + if (vm_account_pinned(&umem->vm_account, num_pages)) {
> > rv = -ENOMEM;
> > goto out_sem_up;
> > }
> > @@ -429,7 +422,6 @@ struct siw_umem *siw_umem_get(u64 start, u64 len,
> bool writable)
> > goto out_sem_up;
> >
> > umem->num_pages += rv;
> > - atomic64_add(rv, &mm_s->pinned_vm);
>
> Also fixes the race bug
But introduces another one. In that loop, umem->num_pages keeps the
number of pages currently pinned, not the target number. The current
patch uses that umem->num_pages to call vm_unaccount_pinned() in
siw_umem_release(). Bailing out before all pages are pinned would
mess up that accounting during release. Maybe introduce another
parameter to siw_umem_release(), or better have another umem member
'umem->num_pages_accounted' for correct accounting during release.
Bernard.
>
> Jason
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [RFC PATCH 05/19] RMDA/siw: Convert to use vm_account
2023-01-24 15:56 ` Bernard Metzler
@ 2023-01-30 11:34 ` Alistair Popple
2023-01-30 13:27 ` Bernard Metzler
0 siblings, 1 reply; 20+ messages in thread
From: Alistair Popple @ 2023-01-30 11:34 UTC (permalink / raw)
To: Bernard Metzler
Cc: Jason Gunthorpe, linux-mm@kvack.org, cgroups@vger.kernel.org,
linux-kernel@vger.kernel.org, jhubbard@nvidia.com,
tjmercier@google.com, hannes@cmpxchg.org, surenb@google.com,
mkoutny@suse.com, daniel@ffwll.ch, Leon Romanovsky,
linux-rdma@vger.kernel.org
Bernard Metzler <BMT@zurich.ibm.com> writes:
>> -----Original Message-----
>> From: Jason Gunthorpe <jgg@nvidia.com>
>> Sent: Tuesday, 24 January 2023 15:37
>> To: Alistair Popple <apopple@nvidia.com>
>> Cc: linux-mm@kvack.org; cgroups@vger.kernel.org; linux-
>> kernel@vger.kernel.org; jhubbard@nvidia.com; tjmercier@google.com;
>> hannes@cmpxchg.org; surenb@google.com; mkoutny@suse.com; daniel@ffwll.ch;
>> Bernard Metzler <BMT@zurich.ibm.com>; Leon Romanovsky <leon@kernel.org>;
>> linux-rdma@vger.kernel.org
>> Subject: [EXTERNAL] Re: [RFC PATCH 05/19] RMDA/siw: Convert to use
>> vm_account
>>
>> On Tue, Jan 24, 2023 at 04:42:34PM +1100, Alistair Popple wrote:
>>
>> > @@ -385,20 +382,16 @@ struct siw_umem *siw_umem_get(u64 start, u64 len,
>> bool writable)
>> > if (!umem)
>> > return ERR_PTR(-ENOMEM);
>> >
>> > - mm_s = current->mm;
>> > - umem->owning_mm = mm_s;
>> > umem->writable = writable;
>> >
>> > - mmgrab(mm_s);
>> > + vm_account_init_current(&umem->vm_account);
>> >
>> > if (writable)
>> > foll_flags |= FOLL_WRITE;
>> >
>> > - mmap_read_lock(mm_s);
>> > + mmap_read_lock(current->mm);
>> >
>> > - mlock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>> > -
>> > - if (num_pages + atomic64_read(&mm_s->pinned_vm) > mlock_limit) {
>> > + if (vm_account_pinned(&umem->vm_account, num_pages)) {
>> > rv = -ENOMEM;
>> > goto out_sem_up;
>> > }
>> > @@ -429,7 +422,6 @@ struct siw_umem *siw_umem_get(u64 start, u64 len,
>> bool writable)
>> > goto out_sem_up;
>> >
>> > umem->num_pages += rv;
>> > - atomic64_add(rv, &mm_s->pinned_vm);
>>
>> Also fixes the race bug
>
> But introduces another one. In that loop, umem->num_pages keeps the
> number of pages currently pinned, not the target number. The current
> patch uses that umem->num_pages to call vm_unaccount_pinned() in
> siw_umem_release(). Bailing out before all pages are pinned would
> mess up that accounting during release. Maybe introduce another
> parameter to siw_umem_release(), or better have another umem member
> 'umem->num_pages_accounted' for correct accounting during release.
Yes, I see the problem thanks for pointing it out. Will fix for the next
version.
> Bernard.
>>
>> Jason
^ permalink raw reply [flat|nested] 20+ messages in thread* RE: Re: [RFC PATCH 05/19] RMDA/siw: Convert to use vm_account
2023-01-30 11:34 ` Alistair Popple
@ 2023-01-30 13:27 ` Bernard Metzler
0 siblings, 0 replies; 20+ messages in thread
From: Bernard Metzler @ 2023-01-30 13:27 UTC (permalink / raw)
To: Alistair Popple
Cc: Jason Gunthorpe, linux-mm@kvack.org, cgroups@vger.kernel.org,
linux-kernel@vger.kernel.org, jhubbard@nvidia.com,
tjmercier@google.com, hannes@cmpxchg.org, surenb@google.com,
mkoutny@suse.com, daniel@ffwll.ch, Leon Romanovsky,
linux-rdma@vger.kernel.org
> -----Original Message-----
> From: Alistair Popple <apopple@nvidia.com>
> Sent: Monday, 30 January 2023 12:35
> To: Bernard Metzler <BMT@zurich.ibm.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>; linux-mm@kvack.org;
> cgroups@vger.kernel.org; linux-kernel@vger.kernel.org; jhubbard@nvidia.com;
> tjmercier@google.com; hannes@cmpxchg.org; surenb@google.com;
> mkoutny@suse.com; daniel@ffwll.ch; Leon Romanovsky <leon@kernel.org>;
> linux-rdma@vger.kernel.org
> Subject: [EXTERNAL] Re: [RFC PATCH 05/19] RMDA/siw: Convert to use
> vm_account
>
>
> Bernard Metzler <BMT@zurich.ibm.com> writes:
>
> >> -----Original Message-----
> >> From: Jason Gunthorpe <jgg@nvidia.com>
> >> Sent: Tuesday, 24 January 2023 15:37
> >> To: Alistair Popple <apopple@nvidia.com>
> >> Cc: linux-mm@kvack.org; cgroups@vger.kernel.org; linux-
> >> kernel@vger.kernel.org; jhubbard@nvidia.com; tjmercier@google.com;
> >> hannes@cmpxchg.org; surenb@google.com; mkoutny@suse.com;
> daniel@ffwll.ch;
> >> Bernard Metzler <BMT@zurich.ibm.com>; Leon Romanovsky <leon@kernel.org>;
> >> linux-rdma@vger.kernel.org
> >> Subject: [EXTERNAL] Re: [RFC PATCH 05/19] RMDA/siw: Convert to use
> >> vm_account
> >>
> >> On Tue, Jan 24, 2023 at 04:42:34PM +1100, Alistair Popple wrote:
> >>
> >> > @@ -385,20 +382,16 @@ struct siw_umem *siw_umem_get(u64 start, u64
> len,
> >> bool writable)
> >> > if (!umem)
> >> > return ERR_PTR(-ENOMEM);
> >> >
> >> > - mm_s = current->mm;
> >> > - umem->owning_mm = mm_s;
> >> > umem->writable = writable;
> >> >
> >> > - mmgrab(mm_s);
> >> > + vm_account_init_current(&umem->vm_account);
> >> >
> >> > if (writable)
> >> > foll_flags |= FOLL_WRITE;
> >> >
> >> > - mmap_read_lock(mm_s);
> >> > + mmap_read_lock(current->mm);
> >> >
> >> > - mlock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> >> > -
> >> > - if (num_pages + atomic64_read(&mm_s->pinned_vm) > mlock_limit) {
> >> > + if (vm_account_pinned(&umem->vm_account, num_pages)) {
> >> > rv = -ENOMEM;
> >> > goto out_sem_up;
> >> > }
> >> > @@ -429,7 +422,6 @@ struct siw_umem *siw_umem_get(u64 start, u64 len,
> >> bool writable)
> >> > goto out_sem_up;
> >> >
> >> > umem->num_pages += rv;
> >> > - atomic64_add(rv, &mm_s->pinned_vm);
> >>
> >> Also fixes the race bug
> >
> > But introduces another one. In that loop, umem->num_pages keeps the
> > number of pages currently pinned, not the target number. The current
> > patch uses that umem->num_pages to call vm_unaccount_pinned() in
> > siw_umem_release(). Bailing out before all pages are pinned would
> > mess up that accounting during release. Maybe introduce another
> > parameter to siw_umem_release(), or better have another umem member
> > 'umem->num_pages_accounted' for correct accounting during release.
>
> Yes, I see the problem thanks for pointing it out. Will fix for the next
> version.
Thank you! Let me send a patch to the original code,
just checking if not all pages are pinned and fix the
counter accordingly. Maybe you can go from there..?
Thank you,
Bernard.
^ permalink raw reply [flat|nested] 20+ messages in thread
* [RFC PATCH 06/19] RDMA/usnic: convert to use vm_account
[not found] <cover.f52b9eb2792bccb8a9ecd6bc95055705cfe2ae03.1674538665.git-series.apopple@nvidia.com>
` (2 preceding siblings ...)
2023-01-24 5:42 ` [RFC PATCH 05/19] RMDA/siw: " Alistair Popple
@ 2023-01-24 5:42 ` Alistair Popple
2023-01-24 14:41 ` Jason Gunthorpe
2023-01-24 5:42 ` [RFC PATCH 10/19] net: skb: Switch to using vm_account Alistair Popple
4 siblings, 1 reply; 20+ messages in thread
From: Alistair Popple @ 2023-01-24 5:42 UTC (permalink / raw)
To: linux-mm, cgroups
Cc: linux-kernel, jgg, jhubbard, tjmercier, hannes, surenb, mkoutny,
daniel, Alistair Popple, Christian Benvenuti, Nelson Escobar,
Jason Gunthorpe, Leon Romanovsky, linux-rdma
Convert to using a vm_account structure to account pinned memory to
both the mm and the pins cgroup.
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Cc: Christian Benvenuti <benve@cisco.com>
Cc: Nelson Escobar <neescoba@cisco.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: linux-rdma@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
drivers/infiniband/hw/usnic/usnic_uiom.c | 13 +++++--------
drivers/infiniband/hw/usnic/usnic_uiom.h | 1 +
2 files changed, 6 insertions(+), 8 deletions(-)
diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c b/drivers/infiniband/hw/usnic/usnic_uiom.c
index c301b3b..250276e 100644
--- a/drivers/infiniband/hw/usnic/usnic_uiom.c
+++ b/drivers/infiniband/hw/usnic/usnic_uiom.c
@@ -89,8 +89,6 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
struct page **page_list;
struct scatterlist *sg;
struct usnic_uiom_chunk *chunk;
- unsigned long locked;
- unsigned long lock_limit;
unsigned long cur_base;
unsigned long npages;
int ret;
@@ -123,10 +121,8 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
uiomr->owning_mm = mm = current->mm;
mmap_read_lock(mm);
- locked = atomic64_add_return(npages, ¤t->mm->pinned_vm);
- lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-
- if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
+ vm_account_init_current(&uiomr->vm_account);
+ if (vm_account_pinned(&uiomr->vm_account, npages)) {
ret = -ENOMEM;
goto out;
}
@@ -178,7 +174,8 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
out:
if (ret < 0) {
usnic_uiom_put_pages(chunk_list, 0);
- atomic64_sub(npages, ¤t->mm->pinned_vm);
+ vm_unaccount_pinned(&uiomr->vm_account, npages);
+ vm_account_release(&uiomr->vm_account);
} else
mmgrab(uiomr->owning_mm);
@@ -430,7 +427,7 @@ void usnic_uiom_reg_release(struct usnic_uiom_reg *uiomr)
{
__usnic_uiom_reg_release(uiomr->pd, uiomr, 1);
- atomic64_sub(usnic_uiom_num_pages(uiomr), &uiomr->owning_mm->pinned_vm);
+ vm_unaccount_pinned(&uiomr->vm_account, usnic_uiom_num_pages(uiomr));
__usnic_uiom_release_tail(uiomr);
}
diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.h b/drivers/infiniband/hw/usnic/usnic_uiom.h
index 5a9acf9..5c296a7 100644
--- a/drivers/infiniband/hw/usnic/usnic_uiom.h
+++ b/drivers/infiniband/hw/usnic/usnic_uiom.h
@@ -72,6 +72,7 @@ struct usnic_uiom_reg {
struct list_head chunk_list;
struct work_struct work;
struct mm_struct *owning_mm;
+ struct vm_account vm_account;
};
struct usnic_uiom_chunk {
--
git-series 0.9.1
^ permalink raw reply related [flat|nested] 20+ messages in thread* Re: [RFC PATCH 06/19] RDMA/usnic: convert to use vm_account
2023-01-24 5:42 ` [RFC PATCH 06/19] RDMA/usnic: convert " Alistair Popple
@ 2023-01-24 14:41 ` Jason Gunthorpe
2023-01-30 11:10 ` Alistair Popple
0 siblings, 1 reply; 20+ messages in thread
From: Jason Gunthorpe @ 2023-01-24 14:41 UTC (permalink / raw)
To: Alistair Popple
Cc: linux-mm, cgroups, linux-kernel, jhubbard, tjmercier, hannes,
surenb, mkoutny, daniel, Christian Benvenuti, Nelson Escobar,
Leon Romanovsky, linux-rdma
On Tue, Jan 24, 2023 at 04:42:35PM +1100, Alistair Popple wrote:
> diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c b/drivers/infiniband/hw/usnic/usnic_uiom.c
> index c301b3b..250276e 100644
> --- a/drivers/infiniband/hw/usnic/usnic_uiom.c
> +++ b/drivers/infiniband/hw/usnic/usnic_uiom.c
> @@ -89,8 +89,6 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
> struct page **page_list;
> struct scatterlist *sg;
> struct usnic_uiom_chunk *chunk;
> - unsigned long locked;
> - unsigned long lock_limit;
> unsigned long cur_base;
> unsigned long npages;
> int ret;
> @@ -123,10 +121,8 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
> uiomr->owning_mm = mm = current->mm;
> mmap_read_lock(mm);
>
> - locked = atomic64_add_return(npages, ¤t->mm->pinned_vm);
> - lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> -
> - if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
> + vm_account_init_current(&uiomr->vm_account);
> + if (vm_account_pinned(&uiomr->vm_account, npages)) {
> ret = -ENOMEM;
> goto out;
> }
Is this error handling right? This driver tried to avoid the race by
using atomic64_add_return() but it means that the out label undoes the add:
> @@ -178,7 +174,8 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
> out:
> if (ret < 0) {
> usnic_uiom_put_pages(chunk_list, 0);
> - atomic64_sub(npages, ¤t->mm->pinned_vm);
Here
> + vm_unaccount_pinned(&uiomr->vm_account, npages);
> + vm_account_release(&uiomr->vm_account);
But with the new API we shouldn't call vm_unaccount_pinned() if
vm_account_pinned() doesn't succeed?
Jason
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [RFC PATCH 06/19] RDMA/usnic: convert to use vm_account
2023-01-24 14:41 ` Jason Gunthorpe
@ 2023-01-30 11:10 ` Alistair Popple
0 siblings, 0 replies; 20+ messages in thread
From: Alistair Popple @ 2023-01-30 11:10 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: linux-mm, cgroups, linux-kernel, jhubbard, tjmercier, hannes,
surenb, mkoutny, daniel, Christian Benvenuti, Nelson Escobar,
Leon Romanovsky, linux-rdma
Jason Gunthorpe <jgg@nvidia.com> writes:
> On Tue, Jan 24, 2023 at 04:42:35PM +1100, Alistair Popple wrote:
>> diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c b/drivers/infiniband/hw/usnic/usnic_uiom.c
>> index c301b3b..250276e 100644
>> --- a/drivers/infiniband/hw/usnic/usnic_uiom.c
>> +++ b/drivers/infiniband/hw/usnic/usnic_uiom.c
>> @@ -89,8 +89,6 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
>> struct page **page_list;
>> struct scatterlist *sg;
>> struct usnic_uiom_chunk *chunk;
>> - unsigned long locked;
>> - unsigned long lock_limit;
>> unsigned long cur_base;
>> unsigned long npages;
>> int ret;
>> @@ -123,10 +121,8 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
>> uiomr->owning_mm = mm = current->mm;
>> mmap_read_lock(mm);
>>
>> - locked = atomic64_add_return(npages, ¤t->mm->pinned_vm);
>> - lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>> -
>> - if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
>> + vm_account_init_current(&uiomr->vm_account);
>> + if (vm_account_pinned(&uiomr->vm_account, npages)) {
>> ret = -ENOMEM;
>> goto out;
>> }
>
> Is this error handling right? This driver tried to avoid the race by
> using atomic64_add_return() but it means that the out label undoes the add:
>
>
>> @@ -178,7 +174,8 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
>> out:
>> if (ret < 0) {
>> usnic_uiom_put_pages(chunk_list, 0);
>> - atomic64_sub(npages, ¤t->mm->pinned_vm);
>
> Here
>
>> + vm_unaccount_pinned(&uiomr->vm_account, npages);
>> + vm_account_release(&uiomr->vm_account);
>
> But with the new API we shouldn't call vm_unaccount_pinned() if
> vm_account_pinned() doesn't succeed?
Good point. Will add the following fix:
@@ -123,6 +123,7 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
vm_account_init_current(&uiomr->vm_account);
if (vm_account_pinned(&uiomr->vm_account, npages)) {
+ npages = 0;
ret = -ENOMEM;
goto out;
}
>
> Jason
^ permalink raw reply [flat|nested] 20+ messages in thread
* [RFC PATCH 10/19] net: skb: Switch to using vm_account
[not found] <cover.f52b9eb2792bccb8a9ecd6bc95055705cfe2ae03.1674538665.git-series.apopple@nvidia.com>
` (3 preceding siblings ...)
2023-01-24 5:42 ` [RFC PATCH 06/19] RDMA/usnic: convert " Alistair Popple
@ 2023-01-24 5:42 ` Alistair Popple
2023-01-24 14:51 ` Jason Gunthorpe
4 siblings, 1 reply; 20+ messages in thread
From: Alistair Popple @ 2023-01-24 5:42 UTC (permalink / raw)
To: linux-mm, cgroups
Cc: linux-kernel, jgg, jhubbard, tjmercier, hannes, surenb, mkoutny,
daniel, Alistair Popple, netdev, linux-rdma, rds-devel
Switch to using vm_account to charge pinned pages. This will allow a
future change to charge the pinned pages to a cgroup to limit the
overall number of pinned pages in the system.
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Cc: linux-kernel@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: linux-rdma@vger.kernel.org
Cc: rds-devel@oss.oracle.com
---
include/linux/skbuff.h | 6 ++---
include/net/sock.h | 2 ++-
net/core/skbuff.c | 47 +++++++++++++++----------------------------
net/rds/message.c | 9 +++++---
4 files changed, 28 insertions(+), 36 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 4c84924..c956405 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -554,7 +554,6 @@ struct ubuf_info_msgzc {
};
struct mmpin {
- struct user_struct *user;
unsigned int num_pg;
} mmp;
};
@@ -563,8 +562,9 @@ struct ubuf_info_msgzc {
#define uarg_to_msgzc(ubuf_ptr) container_of((ubuf_ptr), struct ubuf_info_msgzc, \
ubuf)
-int mm_account_pinned_pages(struct mmpin *mmp, size_t size);
-void mm_unaccount_pinned_pages(struct mmpin *mmp);
+int mm_account_pinned_pages(struct vm_account *vm_account, struct mmpin *mmp,
+ size_t size);
+void mm_unaccount_pinned_pages(struct vm_account *vm_account, struct mmpin *mmp);
/* This data is invariant across clones and lives at
* the end of the header data, ie. at skb->end.
diff --git a/include/net/sock.h b/include/net/sock.h
index dcd72e6..bc3a868 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -334,6 +334,7 @@ struct sk_filter;
* @sk_security: used by security modules
* @sk_mark: generic packet mark
* @sk_cgrp_data: cgroup data for this cgroup
+ * @sk_vm_account: data for pinned memory accounting
* @sk_memcg: this socket's memory cgroup association
* @sk_write_pending: a write to stream socket waits to start
* @sk_state_change: callback to indicate change in the state of the sock
@@ -523,6 +524,7 @@ struct sock {
void *sk_security;
#endif
struct sock_cgroup_data sk_cgrp_data;
+ struct vm_account sk_vm_account;
struct mem_cgroup *sk_memcg;
void (*sk_state_change)(struct sock *sk);
void (*sk_data_ready)(struct sock *sk);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 4a0eb55..bed3fc9 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1309,42 +1309,25 @@ struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src)
}
EXPORT_SYMBOL_GPL(skb_morph);
-int mm_account_pinned_pages(struct mmpin *mmp, size_t size)
+int mm_account_pinned_pages(struct vm_account *vm_account, struct mmpin *mmp,
+ size_t size)
{
- unsigned long max_pg, num_pg, new_pg, old_pg;
- struct user_struct *user;
-
- if (capable(CAP_IPC_LOCK) || !size)
- return 0;
+ unsigned int num_pg;
num_pg = (size >> PAGE_SHIFT) + 2; /* worst case */
- max_pg = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
- user = mmp->user ? : current_user();
+ if (vm_account_pinned(vm_account, num_pg))
+ return -ENOBUFS;
- old_pg = atomic_long_read(&user->locked_vm);
- do {
- new_pg = old_pg + num_pg;
- if (new_pg > max_pg)
- return -ENOBUFS;
- } while (!atomic_long_try_cmpxchg(&user->locked_vm, &old_pg, new_pg));
-
- if (!mmp->user) {
- mmp->user = get_uid(user);
- mmp->num_pg = num_pg;
- } else {
- mmp->num_pg += num_pg;
- }
+ mmp->num_pg += num_pg;
return 0;
}
EXPORT_SYMBOL_GPL(mm_account_pinned_pages);
-void mm_unaccount_pinned_pages(struct mmpin *mmp)
+void mm_unaccount_pinned_pages(struct vm_account *vm_account, struct mmpin *mmp)
{
- if (mmp->user) {
- atomic_long_sub(mmp->num_pg, &mmp->user->locked_vm);
- free_uid(mmp->user);
- }
+ vm_unaccount_pinned(vm_account, mmp->num_pg);
+ vm_account_release(vm_account);
}
EXPORT_SYMBOL_GPL(mm_unaccount_pinned_pages);
@@ -1361,9 +1344,12 @@ static struct ubuf_info *msg_zerocopy_alloc(struct sock *sk, size_t size)
BUILD_BUG_ON(sizeof(*uarg) > sizeof(skb->cb));
uarg = (void *)skb->cb;
- uarg->mmp.user = NULL;
+ uarg->mmp.num_pg = 0;
+ vm_account_init(&sk->sk_vm_account, current,
+ current_user(), VM_ACCOUNT_USER);
- if (mm_account_pinned_pages(&uarg->mmp, size)) {
+ if (mm_account_pinned_pages(&sk->sk_vm_account, &uarg->mmp, size)) {
+ vm_account_release(&sk->sk_vm_account);
kfree_skb(skb);
return NULL;
}
@@ -1416,7 +1402,8 @@ struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
next = (u32)atomic_read(&sk->sk_zckey);
if ((u32)(uarg_zc->id + uarg_zc->len) == next) {
- if (mm_account_pinned_pages(&uarg_zc->mmp, size))
+ if (mm_account_pinned_pages(&sk->sk_vm_account,
+ &uarg_zc->mmp, size))
return NULL;
uarg_zc->len++;
uarg_zc->bytelen = bytelen;
@@ -1466,7 +1453,7 @@ static void __msg_zerocopy_callback(struct ubuf_info_msgzc *uarg)
u32 lo, hi;
u16 len;
- mm_unaccount_pinned_pages(&uarg->mmp);
+ mm_unaccount_pinned_pages(&sk->sk_vm_account, &uarg->mmp);
/* if !len, there was only 1 call, and it was aborted
* so do not queue a completion notification
diff --git a/net/rds/message.c b/net/rds/message.c
index b47e4f0..2138a70 100644
--- a/net/rds/message.c
+++ b/net/rds/message.c
@@ -99,7 +99,7 @@ static void rds_rm_zerocopy_callback(struct rds_sock *rs,
struct list_head *head;
unsigned long flags;
- mm_unaccount_pinned_pages(&znotif->z_mmp);
+ mm_unaccount_pinned_pages(&rs->rs_sk.sk_vm_account, &znotif->z_mmp);
q = &rs->rs_zcookie_queue;
spin_lock_irqsave(&q->lock, flags);
head = &q->zcookie_head;
@@ -367,6 +367,7 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
int ret = 0;
int length = iov_iter_count(from);
struct rds_msg_zcopy_info *info;
+ struct vm_account *vm_account = &rm->m_rs->rs_sk.sk_vm_account;
rm->m_inc.i_hdr.h_len = cpu_to_be32(iov_iter_count(from));
@@ -380,7 +381,9 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
return -ENOMEM;
INIT_LIST_HEAD(&info->rs_zcookie_next);
rm->data.op_mmp_znotifier = &info->znotif;
- if (mm_account_pinned_pages(&rm->data.op_mmp_znotifier->z_mmp,
+ vm_account_init(vm_account, current, current_user(), VM_ACCOUNT_USER);
+ if (mm_account_pinned_pages(vm_account,
+ &rm->data.op_mmp_znotifier->z_mmp,
length)) {
ret = -ENOMEM;
goto err;
@@ -399,7 +402,7 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
for (i = 0; i < rm->data.op_nents; i++)
put_page(sg_page(&rm->data.op_sg[i]));
mmp = &rm->data.op_mmp_znotifier->z_mmp;
- mm_unaccount_pinned_pages(mmp);
+ mm_unaccount_pinned_pages(vm_account, mmp);
ret = -EFAULT;
goto err;
}
--
git-series 0.9.1
^ permalink raw reply related [flat|nested] 20+ messages in thread* Re: [RFC PATCH 10/19] net: skb: Switch to using vm_account
2023-01-24 5:42 ` [RFC PATCH 10/19] net: skb: Switch to using vm_account Alistair Popple
@ 2023-01-24 14:51 ` Jason Gunthorpe
2023-01-30 11:17 ` Alistair Popple
0 siblings, 1 reply; 20+ messages in thread
From: Jason Gunthorpe @ 2023-01-24 14:51 UTC (permalink / raw)
To: Alistair Popple
Cc: linux-mm, cgroups, linux-kernel, jhubbard, tjmercier, hannes,
surenb, mkoutny, daniel, netdev, linux-rdma, rds-devel
On Tue, Jan 24, 2023 at 04:42:39PM +1100, Alistair Popple wrote:
> diff --git a/include/net/sock.h b/include/net/sock.h
> index dcd72e6..bc3a868 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -334,6 +334,7 @@ struct sk_filter;
> * @sk_security: used by security modules
> * @sk_mark: generic packet mark
> * @sk_cgrp_data: cgroup data for this cgroup
> + * @sk_vm_account: data for pinned memory accounting
> * @sk_memcg: this socket's memory cgroup association
> * @sk_write_pending: a write to stream socket waits to start
> * @sk_state_change: callback to indicate change in the state of the sock
> @@ -523,6 +524,7 @@ struct sock {
> void *sk_security;
> #endif
> struct sock_cgroup_data sk_cgrp_data;
> + struct vm_account sk_vm_account;
> struct mem_cgroup *sk_memcg;
> void (*sk_state_change)(struct sock *sk);
> void (*sk_data_ready)(struct sock *sk);
I'm not sure this makes sense in a sock - each sock can be shared with
different proceses..
> diff --git a/net/rds/message.c b/net/rds/message.c
> index b47e4f0..2138a70 100644
> --- a/net/rds/message.c
> +++ b/net/rds/message.c
> @@ -99,7 +99,7 @@ static void rds_rm_zerocopy_callback(struct rds_sock *rs,
> struct list_head *head;
> unsigned long flags;
>
> - mm_unaccount_pinned_pages(&znotif->z_mmp);
> + mm_unaccount_pinned_pages(&rs->rs_sk.sk_vm_account, &znotif->z_mmp);
> q = &rs->rs_zcookie_queue;
> spin_lock_irqsave(&q->lock, flags);
> head = &q->zcookie_head;
> @@ -367,6 +367,7 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
> int ret = 0;
> int length = iov_iter_count(from);
> struct rds_msg_zcopy_info *info;
> + struct vm_account *vm_account = &rm->m_rs->rs_sk.sk_vm_account;
>
> rm->m_inc.i_hdr.h_len = cpu_to_be32(iov_iter_count(from));
>
> @@ -380,7 +381,9 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
> return -ENOMEM;
> INIT_LIST_HEAD(&info->rs_zcookie_next);
> rm->data.op_mmp_znotifier = &info->znotif;
> - if (mm_account_pinned_pages(&rm->data.op_mmp_znotifier->z_mmp,
> + vm_account_init(vm_account, current, current_user(), VM_ACCOUNT_USER);
> + if (mm_account_pinned_pages(vm_account,
> + &rm->data.op_mmp_znotifier->z_mmp,
> length)) {
> ret = -ENOMEM;
> goto err;
> @@ -399,7 +402,7 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
> for (i = 0; i < rm->data.op_nents; i++)
> put_page(sg_page(&rm->data.op_sg[i]));
> mmp = &rm->data.op_mmp_znotifier->z_mmp;
> - mm_unaccount_pinned_pages(mmp);
> + mm_unaccount_pinned_pages(vm_account, mmp);
> ret = -EFAULT;
> goto err;
> }
I wonder if RDS should just not be doing accounting? Usually things
related to iov_iter are short term and we don't account for them.
But then I don't really know how RDS works, Santos?
Regardless, maybe the vm_account should be stored in the
rds_msg_zcopy_info ?
Jason
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [RFC PATCH 10/19] net: skb: Switch to using vm_account
2023-01-24 14:51 ` Jason Gunthorpe
@ 2023-01-30 11:17 ` Alistair Popple
2023-02-06 4:36 ` Alistair Popple
0 siblings, 1 reply; 20+ messages in thread
From: Alistair Popple @ 2023-01-30 11:17 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: linux-mm, cgroups, linux-kernel, jhubbard, tjmercier, hannes,
surenb, mkoutny, daniel, netdev, linux-rdma, rds-devel
Jason Gunthorpe <jgg@nvidia.com> writes:
> On Tue, Jan 24, 2023 at 04:42:39PM +1100, Alistair Popple wrote:
>> diff --git a/include/net/sock.h b/include/net/sock.h
>> index dcd72e6..bc3a868 100644
>> --- a/include/net/sock.h
>> +++ b/include/net/sock.h
>> @@ -334,6 +334,7 @@ struct sk_filter;
>> * @sk_security: used by security modules
>> * @sk_mark: generic packet mark
>> * @sk_cgrp_data: cgroup data for this cgroup
>> + * @sk_vm_account: data for pinned memory accounting
>> * @sk_memcg: this socket's memory cgroup association
>> * @sk_write_pending: a write to stream socket waits to start
>> * @sk_state_change: callback to indicate change in the state of the sock
>> @@ -523,6 +524,7 @@ struct sock {
>> void *sk_security;
>> #endif
>> struct sock_cgroup_data sk_cgrp_data;
>> + struct vm_account sk_vm_account;
>> struct mem_cgroup *sk_memcg;
>> void (*sk_state_change)(struct sock *sk);
>> void (*sk_data_ready)(struct sock *sk);
>
> I'm not sure this makes sense in a sock - each sock can be shared with
> different proceses..
TBH it didn't feel right to me either so was hoping for some
feedback. Will try your suggestion below.
>> diff --git a/net/rds/message.c b/net/rds/message.c
>> index b47e4f0..2138a70 100644
>> --- a/net/rds/message.c
>> +++ b/net/rds/message.c
>> @@ -99,7 +99,7 @@ static void rds_rm_zerocopy_callback(struct rds_sock *rs,
>> struct list_head *head;
>> unsigned long flags;
>>
>> - mm_unaccount_pinned_pages(&znotif->z_mmp);
>> + mm_unaccount_pinned_pages(&rs->rs_sk.sk_vm_account, &znotif->z_mmp);
>> q = &rs->rs_zcookie_queue;
>> spin_lock_irqsave(&q->lock, flags);
>> head = &q->zcookie_head;
>> @@ -367,6 +367,7 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
>> int ret = 0;
>> int length = iov_iter_count(from);
>> struct rds_msg_zcopy_info *info;
>> + struct vm_account *vm_account = &rm->m_rs->rs_sk.sk_vm_account;
>>
>> rm->m_inc.i_hdr.h_len = cpu_to_be32(iov_iter_count(from));
>>
>> @@ -380,7 +381,9 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
>> return -ENOMEM;
>> INIT_LIST_HEAD(&info->rs_zcookie_next);
>> rm->data.op_mmp_znotifier = &info->znotif;
>> - if (mm_account_pinned_pages(&rm->data.op_mmp_znotifier->z_mmp,
>> + vm_account_init(vm_account, current, current_user(), VM_ACCOUNT_USER);
>> + if (mm_account_pinned_pages(vm_account,
>> + &rm->data.op_mmp_znotifier->z_mmp,
>> length)) {
>> ret = -ENOMEM;
>> goto err;
>> @@ -399,7 +402,7 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
>> for (i = 0; i < rm->data.op_nents; i++)
>> put_page(sg_page(&rm->data.op_sg[i]));
>> mmp = &rm->data.op_mmp_znotifier->z_mmp;
>> - mm_unaccount_pinned_pages(mmp);
>> + mm_unaccount_pinned_pages(vm_account, mmp);
>> ret = -EFAULT;
>> goto err;
>> }
>
> I wonder if RDS should just not be doing accounting? Usually things
> related to iov_iter are short term and we don't account for them.
Yeah, I couldn't easily figure out why these were accounted for in the
first place either.
> But then I don't really know how RDS works, Santos?
>
> Regardless, maybe the vm_account should be stored in the
> rds_msg_zcopy_info ?
On first glance that looks like a better spot. Thanks for the
idea.
> Jason
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [RFC PATCH 10/19] net: skb: Switch to using vm_account
2023-01-30 11:17 ` Alistair Popple
@ 2023-02-06 4:36 ` Alistair Popple
2023-02-06 13:14 ` Jason Gunthorpe
0 siblings, 1 reply; 20+ messages in thread
From: Alistair Popple @ 2023-02-06 4:36 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: linux-mm, cgroups, linux-kernel, jhubbard, tjmercier, hannes,
surenb, mkoutny, daniel, netdev, linux-rdma, rds-devel
Alistair Popple <apopple@nvidia.com> writes:
> Jason Gunthorpe <jgg@nvidia.com> writes:
>
>> On Tue, Jan 24, 2023 at 04:42:39PM +1100, Alistair Popple wrote:
>>> diff --git a/include/net/sock.h b/include/net/sock.h
>>> index dcd72e6..bc3a868 100644
>>> --- a/include/net/sock.h
>>> +++ b/include/net/sock.h
>>> @@ -334,6 +334,7 @@ struct sk_filter;
>>> * @sk_security: used by security modules
>>> * @sk_mark: generic packet mark
>>> * @sk_cgrp_data: cgroup data for this cgroup
>>> + * @sk_vm_account: data for pinned memory accounting
>>> * @sk_memcg: this socket's memory cgroup association
>>> * @sk_write_pending: a write to stream socket waits to start
>>> * @sk_state_change: callback to indicate change in the state of the sock
>>> @@ -523,6 +524,7 @@ struct sock {
>>> void *sk_security;
>>> #endif
>>> struct sock_cgroup_data sk_cgrp_data;
>>> + struct vm_account sk_vm_account;
>>> struct mem_cgroup *sk_memcg;
>>> void (*sk_state_change)(struct sock *sk);
>>> void (*sk_data_ready)(struct sock *sk);
>>
>> I'm not sure this makes sense in a sock - each sock can be shared with
>> different proceses..
>
> TBH it didn't feel right to me either so was hoping for some
> feedback. Will try your suggestion below.
>
>>> diff --git a/net/rds/message.c b/net/rds/message.c
>>> index b47e4f0..2138a70 100644
>>> --- a/net/rds/message.c
>>> +++ b/net/rds/message.c
>>> @@ -99,7 +99,7 @@ static void rds_rm_zerocopy_callback(struct rds_sock *rs,
>>> struct list_head *head;
>>> unsigned long flags;
>>>
>>> - mm_unaccount_pinned_pages(&znotif->z_mmp);
>>> + mm_unaccount_pinned_pages(&rs->rs_sk.sk_vm_account, &znotif->z_mmp);
>>> q = &rs->rs_zcookie_queue;
>>> spin_lock_irqsave(&q->lock, flags);
>>> head = &q->zcookie_head;
>>> @@ -367,6 +367,7 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
>>> int ret = 0;
>>> int length = iov_iter_count(from);
>>> struct rds_msg_zcopy_info *info;
>>> + struct vm_account *vm_account = &rm->m_rs->rs_sk.sk_vm_account;
>>>
>>> rm->m_inc.i_hdr.h_len = cpu_to_be32(iov_iter_count(from));
>>>
>>> @@ -380,7 +381,9 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
>>> return -ENOMEM;
>>> INIT_LIST_HEAD(&info->rs_zcookie_next);
>>> rm->data.op_mmp_znotifier = &info->znotif;
>>> - if (mm_account_pinned_pages(&rm->data.op_mmp_znotifier->z_mmp,
>>> + vm_account_init(vm_account, current, current_user(), VM_ACCOUNT_USER);
>>> + if (mm_account_pinned_pages(vm_account,
>>> + &rm->data.op_mmp_znotifier->z_mmp,
>>> length)) {
>>> ret = -ENOMEM;
>>> goto err;
>>> @@ -399,7 +402,7 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
>>> for (i = 0; i < rm->data.op_nents; i++)
>>> put_page(sg_page(&rm->data.op_sg[i]));
>>> mmp = &rm->data.op_mmp_znotifier->z_mmp;
>>> - mm_unaccount_pinned_pages(mmp);
>>> + mm_unaccount_pinned_pages(vm_account, mmp);
>>> ret = -EFAULT;
>>> goto err;
>>> }
>>
>> I wonder if RDS should just not be doing accounting? Usually things
>> related to iov_iter are short term and we don't account for them.
>
> Yeah, I couldn't easily figure out why these were accounted for in the
> first place either.
>
>> But then I don't really know how RDS works, Santos?
>>
>> Regardless, maybe the vm_account should be stored in the
>> rds_msg_zcopy_info ?
>
> On first glance that looks like a better spot. Thanks for the
> idea.
That works fine for RDS but not for skbuff. We still need a vm_account
in the struct sock or somewhere else for that. For example in
msg_zerocopy_realloc() we only have a struct ubuf_info_msgzc
available. We can't add a struct vm_account field to that because
ultimately it is stored in struct sk_buff->ck[] which is not large
enough to contain ubuf_info_msgzc + vm_account.
I'm not terribly familiar with kernel networking code though, so happy
to hear other suggestions.
>> Jason
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [RFC PATCH 10/19] net: skb: Switch to using vm_account
2023-02-06 4:36 ` Alistair Popple
@ 2023-02-06 13:14 ` Jason Gunthorpe
0 siblings, 0 replies; 20+ messages in thread
From: Jason Gunthorpe @ 2023-02-06 13:14 UTC (permalink / raw)
To: Alistair Popple
Cc: linux-mm, cgroups, linux-kernel, jhubbard, tjmercier, hannes,
surenb, mkoutny, daniel, netdev, linux-rdma, rds-devel
On Mon, Feb 06, 2023 at 03:36:49PM +1100, Alistair Popple wrote:
> >> But then I don't really know how RDS works, Santos?
> >>
> >> Regardless, maybe the vm_account should be stored in the
> >> rds_msg_zcopy_info ?
> >
> > On first glance that looks like a better spot. Thanks for the
> > idea.
>
> That works fine for RDS but not for skbuff.
I would definately put the RDS stuff like that..
> We still need a vm_account in the struct sock or somewhere else for
> that. For example in msg_zerocopy_realloc() we only have a struct
> ubuf_info_msgzc available. We can't add a struct vm_account field to
> that because ultimately it is stored in struct sk_buff->ck[] which
> is not large enough to contain ubuf_info_msgzc + vm_account.
Well, AFAICT this is using iov_iter to get the pages and in general
iov_iter - eg as used for O_DIRECT - doesn't charge anything.
If this does somehow allow a userspace to hold pin a page for a long
time then it is already technically wrong because it doesn't use
FOLL_LONGTERM.
Arguably FOLL_LONGTERM should be the key precondition to require
accounting.
So I wonder if it should just be deleted?
Jason
^ permalink raw reply [flat|nested] 20+ messages in thread