linux-doc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Stefan Roesch <shr@devkernel.io>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: kernel-team@fb.com, linux-mm@kvack.org, riel@surriel.com,
	mhocko@suse.com, david@redhat.com,
	linux-kselftest@vger.kernel.org, linux-doc@vger.kernel.org,
	akpm@linux-foundation.org
Subject: Re: [PATCH v3 1/3] mm: add new api to enable ksm per process
Date: Wed, 08 Mar 2023 14:16:36 -0800	[thread overview]
Message-ID: <qvqwbkl2zxui.fsf@dev0134.prn3.facebook.com> (raw)
In-Reply-To: <20230308164746.GA473363@cmpxchg.org>


Johannes Weiner <hannes@cmpxchg.org> writes:

> On Thu, Feb 23, 2023 at 08:39:58PM -0800, Stefan Roesch wrote:
>> This adds a new prctl to API to enable and disable KSM on a per process
>> basis instead of only at the VMA basis (with madvise).
>>
>> 1) Introduce new MMF_VM_MERGE_ANY flag
>>
>> This introduces the new flag MMF_VM_MERGE_ANY flag. When this flag is
>> set, kernel samepage merging (ksm) gets enabled for all vma's of a
>> process.
>>
>> 2) add flag to __ksm_enter
>>
>> This change adds the flag parameter to __ksm_enter. This allows to
>> distinguish if ksm was called by prctl or madvise.
>>
>> 3) add flag to __ksm_exit call
>>
>> This adds the flag parameter to the __ksm_exit() call. This allows to
>> distinguish if this call is for an prctl or madvise invocation.
>>
>> 4) invoke madvise for all vmas in scan_get_next_rmap_item
>>
>> If the new flag MMF_VM_MERGE_ANY has been set for a process, iterate
>> over all the vmas and enable ksm if possible. For the vmas that can be
>> ksm enabled this is only done once.
>>
>> 5) support disabling of ksm for a process
>>
>> This adds the ability to disable ksm for a process if ksm has been
>> enabled for the process.
>>
>> 6) add new prctl option to get and set ksm for a process
>>
>> This adds two new options to the prctl system call
>> - enable ksm for all vmas of a process (if the vmas support it).
>> - query if ksm has been enabled for a process.
>>
>> Signed-off-by: Stefan Roesch <shr@devkernel.io>
>
> Hey Stefan, thanks for merging the patches into one. I found it much
> easier to review.
>
> Overall this looks straight-forward to me. A few comments below:
>
>> @@ -2659,6 +2660,34 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>>  	case PR_SET_VMA:
>>  		error = prctl_set_vma(arg2, arg3, arg4, arg5);
>>  		break;
>> +#ifdef CONFIG_KSM
>> +	case PR_SET_MEMORY_MERGE:
>> +		if (!capable(CAP_SYS_RESOURCE))
>> +			return -EPERM;
>> +
>> +		if (arg2) {
>> +			if (mmap_write_lock_killable(me->mm))
>> +				return -EINTR;
>> +
>> +			if (test_bit(MMF_VM_MERGEABLE, &me->mm->flags))
>> +				error = -EINVAL;
>
> So if the workload has already madvised specific VMAs the
> process-enablement will fail. Why is that? Shouldn't it be possible to
> override a local decision from an outside context that has more
> perspective on both sharing opportunities and security aspects?
>
> If there is a good reason for it, the -EINVAL should be addressed in
> the manpage. And maybe add a comment here as well.
>

This makes sense, I'll remove the check above.

>> +			else if (!test_bit(MMF_VM_MERGE_ANY, &me->mm->flags))
>> +				error = __ksm_enter(me->mm, MMF_VM_MERGE_ANY);
>> +			mmap_write_unlock(me->mm);
>> +		} else {
>> +			__ksm_exit(me->mm, MMF_VM_MERGE_ANY);
>> +		}
>> +		break;
>> +	case PR_GET_MEMORY_MERGE:
>> +		if (!capable(CAP_SYS_RESOURCE))
>> +			return -EPERM;
>> +
>> +		if (arg2 || arg3 || arg4 || arg5)
>> +			return -EINVAL;
>> +
>> +		error = !!test_bit(MMF_VM_MERGE_ANY, &me->mm->flags);
>> +		break;
>> +#endif
>>  	default:
>>  		error = -EINVAL;
>>  		break;
>> diff --git a/mm/ksm.c b/mm/ksm.c
>> index 56808e3bfd19..23d6944f78ad 100644
>> --- a/mm/ksm.c
>> +++ b/mm/ksm.c
>> @@ -1063,6 +1063,7 @@ static int unmerge_and_remove_all_rmap_items(void)
>>
>>  			mm_slot_free(mm_slot_cache, mm_slot);
>>  			clear_bit(MMF_VM_MERGEABLE, &mm->flags);
>> +			clear_bit(MMF_VM_MERGE_ANY, &mm->flags);
>>  			mmdrop(mm);
>>  		} else
>>  			spin_unlock(&ksm_mmlist_lock);
>> @@ -2329,6 +2330,17 @@ static struct ksm_rmap_item *get_next_rmap_item(struct ksm_mm_slot *mm_slot,
>>  	return rmap_item;
>>  }
>>
>> +static bool vma_ksm_mergeable(struct vm_area_struct *vma)
>> +{
>> +	if (vma->vm_flags & VM_MERGEABLE)
>> +		return true;
>> +
>> +	if (test_bit(MMF_VM_MERGE_ANY, &vma->vm_mm->flags))
>> +		return true;
>> +
>> +	return false;
>> +}
>> +
>>  static struct ksm_rmap_item *scan_get_next_rmap_item(struct page **page)
>>  {
>>  	struct mm_struct *mm;
>> @@ -2405,8 +2417,20 @@ static struct ksm_rmap_item *scan_get_next_rmap_item(struct page **page)
>>  		goto no_vmas;
>>
>>  	for_each_vma(vmi, vma) {
>> -		if (!(vma->vm_flags & VM_MERGEABLE))
>> +		if (!vma_ksm_mergeable(vma))
>>  			continue;
>> +		if (!(vma->vm_flags & VM_MERGEABLE)) {
>
> IMO, the helper obscures the interaction between the vma flag and the
> per-process flag here. How about:
>
> 		if (!(vma->vm_flags & VM_MERGEABLE)) {
> 			if (!test_bit(MMF_VM_MERGE_ANY, &vma->vm_mm->flags))
> 				continue;
>
> 			/*
> 			 * With per-process merging enabled, have the MM scan
> 			 * enroll any existing and new VMAs on the fly.
> 			 *
> 			ksm_madvise();
> 		}
>
>> +			unsigned long flags = vma->vm_flags;
>> +
>> +			/* madvise failed, use next vma */
>> +			if (ksm_madvise(vma, vma->vm_start, vma->vm_end, MADV_MERGEABLE, &flags))
>> +				continue;
>> +			/* vma, not supported as being mergeable */
>> +			if (!(flags & VM_MERGEABLE))
>> +				continue;
>> +
>> +			vm_flags_set(vma, VM_MERGEABLE);
>
> I don't understand the local flags. Can't it pass &vma->vm_flags to
> ksm_madvise()? It'll set VM_MERGEABLE on success. And you know it
> wasn't set before because the whole thing is inside the !set
> branch. The return value doesn't seem super useful, it's only the flag
> setting that matters:
>
> 			ksm_madvise(vma, vma->vm_start, vma->vm_end, MADV_MERGEABLE, &vma->vm_flags);
> 			/* madvise can fail, and will skip special vmas (pfnmaps and such) */
> 			if (!(vma->vm_flags & VM_MERGEABLE))
> 				continue;
>

vm_flags is defined as const. I cannot pass it directly inside the
function, this is the reason, I'm using a local variable for it.

>> +		}
>>  		if (ksm_scan.address < vma->vm_start)
>>  			ksm_scan.address = vma->vm_start;
>>  		if (!vma->anon_vma)
>> @@ -2491,6 +2515,7 @@ static struct ksm_rmap_item *scan_get_next_rmap_item(struct page **page)
>>
>>  		mm_slot_free(mm_slot_cache, mm_slot);
>>  		clear_bit(MMF_VM_MERGEABLE, &mm->flags);
>> +		clear_bit(MMF_VM_MERGE_ANY, &mm->flags);
>>  		mmap_read_unlock(mm);
>>  		mmdrop(mm);
>>  	} else {
>
>> @@ -2664,12 +2690,39 @@ int __ksm_enter(struct mm_struct *mm)
>>  	return 0;
>>  }
>>
>> -void __ksm_exit(struct mm_struct *mm)
>> +static void unmerge_vmas(struct mm_struct *mm)
>> +{
>> +	struct vm_area_struct *vma;
>> +	struct vma_iterator vmi;
>> +
>> +	vma_iter_init(&vmi, mm, 0);
>> +
>> +	mmap_read_lock(mm);
>> +	for_each_vma(vmi, vma) {
>> +		if (vma->vm_flags & VM_MERGEABLE) {
>> +			unsigned long flags = vma->vm_flags;
>> +
>> +			if (ksm_madvise(vma, vma->vm_start, vma->vm_end, MADV_UNMERGEABLE, &flags))
>> +				continue;
>> +
>> +			vm_flags_clear(vma, VM_MERGEABLE);
>
> ksm_madvise() tests and clears VM_MERGEABLE, so AFAICS
>
> 	for_each_vma(vmi, vma)
> 		ksm_madvise();
>
> should do it...
>

This is the same problem. vma->vm_flags is defined as const.

+		if (vma->vm_flags & VM_MERGEABLE) {
This will be removed.

>> +		}
>> +	}
>> +	mmap_read_unlock(mm);
>> +}
>> +
>> +void __ksm_exit(struct mm_struct *mm, int flag)
>>  {
>>  	struct ksm_mm_slot *mm_slot;
>>  	struct mm_slot *slot;
>>  	int easy_to_free = 0;
>>
>> +	if (!(current->flags & PF_EXITING) && flag == MMF_VM_MERGE_ANY &&
>> +		test_bit(MMF_VM_MERGE_ANY, &mm->flags)) {
>> +		clear_bit(MMF_VM_MERGE_ANY, &mm->flags);
>> +		unmerge_vmas(mm);
>
> ...and then it's short enough to just open-code it here and drop the
> unmerge_vmas() helper.

  reply	other threads:[~2023-03-08 22:23 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-24  4:39 [PATCH v3 0/3] mm: process/cgroup ksm support Stefan Roesch
2023-02-24  4:39 ` [PATCH v3 1/3] mm: add new api to enable ksm per process Stefan Roesch
2023-03-08 16:47   ` Johannes Weiner
2023-03-08 22:16     ` Stefan Roesch [this message]
2023-03-09  4:59       ` Johannes Weiner
2023-03-09 22:33         ` Stefan Roesch
2023-02-24  4:39 ` [PATCH v3 2/3] mm: add new KSM process and sysfs knobs Stefan Roesch
2023-02-24  4:40 ` [PATCH v3 3/3] selftests/mm: add new selftests for KSM Stefan Roesch
2023-02-26  5:30   ` Andrew Morton
2023-02-27 17:19     ` Stefan Roesch
2023-02-27 17:24       ` Mathieu Desnoyers
2023-02-26  5:08 ` [PATCH v3 0/3] mm: process/cgroup ksm support Andrew Morton
2023-02-27 17:13   ` Stefan Roesch
2023-03-07 18:48   ` Stefan Roesch
2023-03-08 17:01 ` David Hildenbrand
2023-03-08 17:30   ` Johannes Weiner
2023-03-08 18:41     ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=qvqwbkl2zxui.fsf@dev0134.prn3.facebook.com \
    --to=shr@devkernel.io \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@fb.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=riel@surriel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).