* [RFC PATCH] mm: Add helpers for locked_vm
@ 2014-07-30 9:28 Alexey Kardashevskiy
2014-07-30 10:12 ` Peter Zijlstra
2014-07-30 10:31 ` Davidlohr Bueso
0 siblings, 2 replies; 7+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-30 9:28 UTC (permalink / raw)
To: Alexey Kardashevskiy
Cc: Benjamin Herrenschmidt, Andrew Morton, Kirill A . Shutemov,
Rik van Riel, Peter Zijlstra, Mel Gorman, Johannes Weiner,
Andrea Arcangeli, Sasha Levin, Wanpeng Li, Vlastimil Babka,
Jörn Engel, Paul E . McKenney, Davidlohr Bueso, linux-mm,
linux-kernel, Alex Williamson, Alexander Graf, Michael Ellerman
This adds 2 helpers to change the locked_vm counter:
- try_increase_locked_vm - may fail if new locked_vm value will be greater
than the RLIMIT_MEMLOCK limit;
- decrease_locked_vm.
These will be used by drivers capable of locking memory by userspace
request. For example, VFIO can use it to check if it can lock DMA memory
or PPC-KVM can use it to check if it can lock memory for TCE tables.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
include/linux/mm.h | 3 +++
mm/mlock.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 52 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e03dd29..1cb219d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2113,5 +2113,8 @@ void __init setup_nr_node_ids(void);
static inline void setup_nr_node_ids(void) {}
#endif
+extern long try_increment_locked_vm(long npages);
+extern void decrement_locked_vm(long npages);
+
#endif /* __KERNEL__ */
#endif /* _LINUX_MM_H */
diff --git a/mm/mlock.c b/mm/mlock.c
index b1eb536..39e4b55 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -864,3 +864,52 @@ void user_shm_unlock(size_t size, struct user_struct *user)
spin_unlock(&shmlock_user_lock);
free_uid(user);
}
+
+/**
+ * try_increment_locked_vm() - checks if new locked_vm value is going to
+ * be less than RLIMIT_MEMLOCK and increments it by npages if it is.
+ *
+ * @npages: the number of pages to add to locked_vm.
+ *
+ * Returns 0 if succeeded or negative value if failed.
+ */
+long try_increment_locked_vm(long npages)
+{
+ long ret = 0, locked, lock_limit;
+
+ if (!current || !current->mm)
+ return -ESRCH; /* process exited */
+
+ down_write(¤t->mm->mmap_sem);
+ locked = current->mm->locked_vm + npages;
+ lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+ if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
+ pr_warn("RLIMIT_MEMLOCK (%ld) exceeded\n",
+ rlimit(RLIMIT_MEMLOCK));
+ ret = -ENOMEM;
+ } else {
+ current->mm->locked_vm += npages;
+ }
+ up_write(¤t->mm->mmap_sem);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(try_increment_locked_vm);
+
+/**
+ * decrement_locked_vm() - decrements the current task's locked_vm counter.
+ *
+ * @npages: the number to decrement by.
+ */
+void decrement_locked_vm(long npages)
+{
+ if (!current || !current->mm)
+ return; /* process exited */
+
+ down_write(¤t->mm->mmap_sem);
+ if (npages > current->mm->locked_vm)
+ npages = current->mm->locked_vm;
+ current->mm->locked_vm -= npages;
+ up_write(¤t->mm->mmap_sem);
+}
+EXPORT_SYMBOL_GPL(decrement_locked_vm);
--
2.0.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [RFC PATCH] mm: Add helpers for locked_vm
2014-07-30 9:28 [RFC PATCH] mm: Add helpers for locked_vm Alexey Kardashevskiy
@ 2014-07-30 10:12 ` Peter Zijlstra
2014-07-30 10:31 ` Davidlohr Bueso
1 sibling, 0 replies; 7+ messages in thread
From: Peter Zijlstra @ 2014-07-30 10:12 UTC (permalink / raw)
To: Alexey Kardashevskiy
Cc: Benjamin Herrenschmidt, Andrew Morton, Kirill A . Shutemov,
Rik van Riel, Mel Gorman, Johannes Weiner, Andrea Arcangeli,
Sasha Levin, Wanpeng Li, Vlastimil Babka, Jörn Engel,
Paul E . McKenney, Davidlohr Bueso, linux-mm, linux-kernel,
Alex Williamson, Alexander Graf, Michael Ellerman
[-- Attachment #1: Type: text/plain, Size: 657 bytes --]
On Wed, Jul 30, 2014 at 07:28:13PM +1000, Alexey Kardashevskiy wrote:
> This adds 2 helpers to change the locked_vm counter:
> - try_increase_locked_vm - may fail if new locked_vm value will be greater
> than the RLIMIT_MEMLOCK limit;
> - decrease_locked_vm.
>
> These will be used by drivers capable of locking memory by userspace
> request. For example, VFIO can use it to check if it can lock DMA memory
> or PPC-KVM can use it to check if it can lock memory for TCE tables.
Urgh no.. have a look here:
lkml.kernel.org/r/20140526145605.016140154@infradead.org
Someone needs to go help with the IB stuff, I got properly lost in
there.
[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH] mm: Add helpers for locked_vm
2014-07-30 9:28 [RFC PATCH] mm: Add helpers for locked_vm Alexey Kardashevskiy
2014-07-30 10:12 ` Peter Zijlstra
@ 2014-07-30 10:31 ` Davidlohr Bueso
2014-07-30 12:30 ` Alexey Kardashevskiy
2014-08-01 10:04 ` Benjamin Herrenschmidt
1 sibling, 2 replies; 7+ messages in thread
From: Davidlohr Bueso @ 2014-07-30 10:31 UTC (permalink / raw)
To: Alexey Kardashevskiy
Cc: Benjamin Herrenschmidt, Andrew Morton, Kirill A . Shutemov,
Rik van Riel, Peter Zijlstra, Mel Gorman, Johannes Weiner,
Andrea Arcangeli, Sasha Levin, Wanpeng Li, Vlastimil Babka,
Jörn Engel, Paul E . McKenney, linux-mm, linux-kernel,
Alex Williamson, Alexander Graf, Michael Ellerman
On Wed, 2014-07-30 at 19:28 +1000, Alexey Kardashevskiy wrote:
> This adds 2 helpers to change the locked_vm counter:
> - try_increase_locked_vm - may fail if new locked_vm value will be greater
> than the RLIMIT_MEMLOCK limit;
> - decrease_locked_vm.
>
> These will be used by drivers capable of locking memory by userspace
> request. For example, VFIO can use it to check if it can lock DMA memory
> or PPC-KVM can use it to check if it can lock memory for TCE tables.
>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> include/linux/mm.h | 3 +++
> mm/mlock.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 52 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index e03dd29..1cb219d 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2113,5 +2113,8 @@ void __init setup_nr_node_ids(void);
> static inline void setup_nr_node_ids(void) {}
> #endif
>
> +extern long try_increment_locked_vm(long npages);
> +extern void decrement_locked_vm(long npages);
> +
> #endif /* __KERNEL__ */
> #endif /* _LINUX_MM_H */
> diff --git a/mm/mlock.c b/mm/mlock.c
> index b1eb536..39e4b55 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -864,3 +864,52 @@ void user_shm_unlock(size_t size, struct user_struct *user)
> spin_unlock(&shmlock_user_lock);
> free_uid(user);
> }
> +
> +/**
> + * try_increment_locked_vm() - checks if new locked_vm value is going to
> + * be less than RLIMIT_MEMLOCK and increments it by npages if it is.
> + *
> + * @npages: the number of pages to add to locked_vm.
> + *
> + * Returns 0 if succeeded or negative value if failed.
> + */
> +long try_increment_locked_vm(long npages)
mlock calls work at an address granularity...
> +{
> + long ret = 0, locked, lock_limit;
> +
> + if (!current || !current->mm)
> + return -ESRCH; /* process exited */
It doesn't strike me that this is the place for this. It would seem that
it would be the caller's responsibility to make sure of this (and not
sure how !current can happen...).
> +
> + down_write(¤t->mm->mmap_sem);
> + locked = current->mm->locked_vm + npages;
> + lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
nit: please set locked and lock_limit before taking the mmap_sem.
> + if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
> + pr_warn("RLIMIT_MEMLOCK (%ld) exceeded\n",
> + rlimit(RLIMIT_MEMLOCK));
> + ret = -ENOMEM;
> + } else {
It would be nicer to have it the other way around, leave the #else for
ENOMEM. It reads better, imho.
> + current->mm->locked_vm += npages;
More importantly just setting locked_vm is not enough. You'll need to
call do_mlock() here (again, addr granularity ;). This also applies to
your decrement_locked_vm().
Thanks,
Davidlohr
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH] mm: Add helpers for locked_vm
2014-07-30 10:31 ` Davidlohr Bueso
@ 2014-07-30 12:30 ` Alexey Kardashevskiy
2014-07-30 12:47 ` Peter Zijlstra
2014-08-01 10:04 ` Benjamin Herrenschmidt
1 sibling, 1 reply; 7+ messages in thread
From: Alexey Kardashevskiy @ 2014-07-30 12:30 UTC (permalink / raw)
To: Davidlohr Bueso
Cc: Benjamin Herrenschmidt, Andrew Morton, Kirill A . Shutemov,
Rik van Riel, Peter Zijlstra, Mel Gorman, Johannes Weiner,
Andrea Arcangeli, Sasha Levin, Wanpeng Li, Vlastimil Babka,
Jo"rn Engel, Paul E . McKenney, linux-mm, linux-kernel,
Alex Williamson, Alexander Graf, Michael Ellerman
On 07/30/2014 08:31 PM, Davidlohr Bueso wrote:
> On Wed, 2014-07-30 at 19:28 +1000, Alexey Kardashevskiy wrote:
>> This adds 2 helpers to change the locked_vm counter:
>> - try_increase_locked_vm - may fail if new locked_vm value will be greater
>> than the RLIMIT_MEMLOCK limit;
>> - decrease_locked_vm.
>>
>> These will be used by drivers capable of locking memory by userspace
>> request. For example, VFIO can use it to check if it can lock DMA memory
>> or PPC-KVM can use it to check if it can lock memory for TCE tables.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>> include/linux/mm.h | 3 +++
>> mm/mlock.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 52 insertions(+)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index e03dd29..1cb219d 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -2113,5 +2113,8 @@ void __init setup_nr_node_ids(void);
>> static inline void setup_nr_node_ids(void) {}
>> #endif
>>
>> +extern long try_increment_locked_vm(long npages);
>> +extern void decrement_locked_vm(long npages);
>> +
>> #endif /* __KERNEL__ */
>> #endif /* _LINUX_MM_H */
>> diff --git a/mm/mlock.c b/mm/mlock.c
>> index b1eb536..39e4b55 100644
>> --- a/mm/mlock.c
>> +++ b/mm/mlock.c
>> @@ -864,3 +864,52 @@ void user_shm_unlock(size_t size, struct user_struct *user)
>> spin_unlock(&shmlock_user_lock);
>> free_uid(user);
>> }
>> +
>> +/**
>> + * try_increment_locked_vm() - checks if new locked_vm value is going to
>> + * be less than RLIMIT_MEMLOCK and increments it by npages if it is.
>> + *
>> + * @npages: the number of pages to add to locked_vm.
>> + *
>> + * Returns 0 if succeeded or negative value if failed.
>> + */
>> +long try_increment_locked_vm(long npages)
>
> mlock calls work at an address granularity...
>
>> +{
>> + long ret = 0, locked, lock_limit;
>> +
>> + if (!current || !current->mm)
>> + return -ESRCH; /* process exited */
>
> It doesn't strike me that this is the place for this. It would seem that
> it would be the caller's responsibility to make sure of this (and not
> sure how !current can happen...).
>
>> +
>> + down_write(¤t->mm->mmap_sem);
>> + locked = current->mm->locked_vm + npages;
>> + lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>
> nit: please set locked and lock_limit before taking the mmap_sem.
>
>> + if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
>> + pr_warn("RLIMIT_MEMLOCK (%ld) exceeded\n",
>> + rlimit(RLIMIT_MEMLOCK));
>> + ret = -ENOMEM;
>> + } else {
>
> It would be nicer to have it the other way around, leave the #else for
> ENOMEM. It reads better, imho.
>
>> + current->mm->locked_vm += npages;
>
> More importantly just setting locked_vm is not enough. You'll need to
> call do_mlock() here (again, addr granularity ;). This also applies to
> your decrement_locked_vm().
Uff. Bad commit log :(
No, this is not my intention here. Here I only want to increment the counter.
The whole problem is like this: there is VFIO (PCI passthru) and the guest
which gets a real PCI device wants to use some of guest RAM for DMA so we
need to pin this memory. PPC64-pseries specific is:
1. only part of guest RAM can be used for DMA, so called "window", and we
do not know in advance what part of guest RAM has to be pinned; the window
is never guaranteed to have a specific size like "whole guest RAM" and even
if we wanted to pin the entire guest RAM - we cannot do this as we do not
know the guest's RAM size if it is not KVM;
2. we could do this counting and locking in real time but this is not
possible from real mode (MMU off) and will slow things down.
So the trick is we do not let the guest (QEMU in full emulation or KVM,
does not matter here) use VFIO at all if it cannot increment the locked_vm
counter in advance. No locking needs to done at the moment of the guest's
start.
--
Alexey
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH] mm: Add helpers for locked_vm
2014-07-30 12:30 ` Alexey Kardashevskiy
@ 2014-07-30 12:47 ` Peter Zijlstra
2014-08-01 10:08 ` Benjamin Herrenschmidt
0 siblings, 1 reply; 7+ messages in thread
From: Peter Zijlstra @ 2014-07-30 12:47 UTC (permalink / raw)
To: Alexey Kardashevskiy
Cc: Davidlohr Bueso, Benjamin Herrenschmidt, Andrew Morton,
Kirill A . Shutemov, Rik van Riel, Mel Gorman, Johannes Weiner,
Andrea Arcangeli, Sasha Levin, Wanpeng Li, Vlastimil Babka,
Jo"rn Engel, Paul E . McKenney, linux-mm, linux-kernel,
Alex Williamson, Alexander Graf, Michael Ellerman
[-- Attachment #1: Type: text/plain, Size: 284 bytes --]
On Wed, Jul 30, 2014 at 10:30:48PM +1000, Alexey Kardashevskiy wrote:
>
> No, this is not my intention here. Here I only want to increment the counter.
Full and hard nack on that. It should always be tied to actual pages, we
should not detach this and make it 'a number'.
[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH] mm: Add helpers for locked_vm
2014-07-30 12:47 ` Peter Zijlstra
@ 2014-08-01 10:08 ` Benjamin Herrenschmidt
0 siblings, 0 replies; 7+ messages in thread
From: Benjamin Herrenschmidt @ 2014-08-01 10:08 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Alexey Kardashevskiy, Davidlohr Bueso, Andrew Morton,
Kirill A . Shutemov, Rik van Riel, Mel Gorman, Johannes Weiner,
Andrea Arcangeli, Sasha Levin, Wanpeng Li, Vlastimil Babka,
Jo"rn Engel, Paul E . McKenney, linux-mm, linux-kernel,
Alex Williamson, Alexander Graf, Michael Ellerman
On Wed, 2014-07-30 at 14:47 +0200, Peter Zijlstra wrote:
> On Wed, Jul 30, 2014 at 10:30:48PM +1000, Alexey Kardashevskiy wrote:
> >
> > No, this is not my intention here. Here I only want to increment the counter.
>
> Full and hard nack on that. It should always be tied to actual pages, we
> should not detach this and make it 'a number'.
But this is the only way. We *cannot* go through the whole per-page
locking logic every time the guest puts a translation into the IOMMU,
this will completely kill guest performances for pass-through devices.
Worse, for performances, because populating the iommu is a hypercall,
we want to do it in "real mode" (special MMU-off environment) where we
cannot rely on most normal kernel services such as normal locks, vmalloc
space isn't accessible etc...
So we don't have a choice. Either we let guests randomly pin arbitrary
amounts of system memory, or we have a way to predictively account for
the maximum that *can* be mapped/pinned in the iommu table to enable
the fast path.
Another problem with the mlock logic is that it doesn't refcount how
many time a page has been locked, while the guest can map a given page
multiple time in the iommu.
Ben.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH] mm: Add helpers for locked_vm
2014-07-30 10:31 ` Davidlohr Bueso
2014-07-30 12:30 ` Alexey Kardashevskiy
@ 2014-08-01 10:04 ` Benjamin Herrenschmidt
1 sibling, 0 replies; 7+ messages in thread
From: Benjamin Herrenschmidt @ 2014-08-01 10:04 UTC (permalink / raw)
To: Davidlohr Bueso
Cc: Alexey Kardashevskiy, Andrew Morton, Kirill A . Shutemov,
Rik van Riel, Peter Zijlstra, Mel Gorman, Johannes Weiner,
Andrea Arcangeli, Sasha Levin, Wanpeng Li, Vlastimil Babka,
Jörn Engel, Paul E . McKenney, linux-mm, linux-kernel,
Alex Williamson, Alexander Graf, Michael Ellerman
On Wed, 2014-07-30 at 03:31 -0700, Davidlohr Bueso wrote:
> It doesn't strike me that this is the place for this. It would seem that
> it would be the caller's responsibility to make sure of this (and not
> sure how !current can happen...).
>
> > +
> > + down_write(¤t->mm->mmap_sem);
> > + locked = current->mm->locked_vm + npages;
> > + lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>
> nit: please set locked and lock_limit before taking the mmap_sem.
Won't it be racy to read current->mm->locked_vm without the sem ?
> > + if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
> > + pr_warn("RLIMIT_MEMLOCK (%ld) exceeded\n",
> > + rlimit(RLIMIT_MEMLOCK));
> > + ret = -ENOMEM;
> > + } else {
>
> It would be nicer to have it the other way around, leave the #else for
> ENOMEM. It reads better, imho.
>
> > + current->mm->locked_vm += npages;
>
> More importantly just setting locked_vm is not enough. You'll need to
> call do_mlock() here (again, addr granularity ;). This also applies to
> your decrement_locked_vm().
Do we need to actually do mlock ? Basically this is VFIO doing
get_user_pages on a pile of guest/user memory, we are trying to account
for it, but I don't think we need the whole mlock business on top of it
Also address granularity cannot work. We basically predictively account
how much the guest can lock, but we won't know how much it actually
locks until he actually does DMA mappings which is a fairly fast path.
In some cases, I think (Alexey, correct me if I'm wrong), we are trying
to account for kernel memory allocated on behalf of the guest, which is
not necessarily mapped as normal VMAs, it's mostly a way to prevent
a stray KVM/qemu guest from causing the kernel to allocate a ton of
pinned memory by accounting it as part of the locked memory limits.
Ben.
> Thanks,
> Davidlohr
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2014-08-01 11:05 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-07-30 9:28 [RFC PATCH] mm: Add helpers for locked_vm Alexey Kardashevskiy
2014-07-30 10:12 ` Peter Zijlstra
2014-07-30 10:31 ` Davidlohr Bueso
2014-07-30 12:30 ` Alexey Kardashevskiy
2014-07-30 12:47 ` Peter Zijlstra
2014-08-01 10:08 ` Benjamin Herrenschmidt
2014-08-01 10:04 ` Benjamin Herrenschmidt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).