Re: [PATCH V2] mm: add a new vector based madvise syscall

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Andrew Morton <akpm@linux-foundation.org>
To: Shaohua Li <shli@fb.com>
Cc: linux-mm@kvack.org, linux-api@vger.kernel.org, je@fb.com,
	Kernel-team@fb.com, Rik van Riel <riel@redhat.com>,
	Mel Gorman <mgorman@suse.de>, Hugh Dickins <hughd@google.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Andi Kleen <andi@firstfloor.org>,
	Minchan Kim <minchan@kernel.org>
Subject: Re: [PATCH V2] mm: add a new vector based madvise syscall
Date: Tue, 1 Dec 2015 15:59:26 -0800	[thread overview]
Message-ID: <20151201155926.6291e5e541c49d453079849b@linux-foundation.org> (raw)
In-Reply-To: <c25b90749f9212359a085125f6403f4c148dfde0.1447098139.git.shli@fb.com>

On Mon, 9 Nov 2015 11:44:54 -0800 Shaohua Li <shli@fb.com> wrote:

> In jemalloc, a free(3) doesn't immediately free the memory to OS even
> the memory is page aligned/size, and hope the memory can be reused soon.
> Later the virtual address becomes fragmented, and more and more free
> memory are aggregated. If the free memory size is large, jemalloc uses
> madvise(DONT_NEED) to actually free the memory back to OS.
> 
> The madvise has significantly overhead paritcularly because of TLB
> flush. jemalloc does madvise for several virtual address space ranges
> one time. Instead of calling madvise for each of the ranges, we
> introduce a new syscall to purge memory for several ranges one time. In
> this way, we can merge several TLB flush for the ranges to one big TLB
> flush. This also reduce mmap_sem locking and kernel/userspace switching.
> 
> I'm running a simple memory allocation benchmark. 32 threads do random
> malloc/free/realloc. Corresponding jemalloc patch to utilize this API is
> attached.
> Without patch:
> real    0m18.923s
> user    1m11.819s
> sys     7m44.626s
> each cpu gets around 3000K/s TLB flush interrupt. Perf shows TLB flush
> is hotest functions. mmap_sem read locking (because of page fault) is
> also heavy.
> 
> with patch:
> real    0m15.026s
> user    0m48.548s
> sys     6m41.153s
> each cpu gets around 140k/s TLB flush interrupt. TLB flush isn't hot at
> all. mmap_sem read locking (still because of page fault) becomes the
> sole hot spot.
> 
> Another test malloc a bunch of memory in 48 threads, then all threads
> free the memory. I measure the time of the memory free.
> Without patch: 34.332s
> With patch:    17.429s
> 
> Current implementation only supports MADV_DONTNEED. Should be trival to
> support MADV_FREE if necessary later.

I'd like to see a full description of the proposed userspace interface:
arguments, data structures, return values, etc.  A propotype manpage,
basically.

I'd also like to see an analysis of which other userspace allocators
will benefit from this.  glibc? tcmalloc?

>
> ...
>
> +/*
> + * The vector madvise(). Like madvise except running for a vector of virtual
> + * address ranges
> + */
> +SYSCALL_DEFINE3(madvisev, const struct iovec __user *, uvector,
> +	unsigned long, nr_segs, int, behavior)
> +{
> +	struct iovec iovstack[UIO_FASTIOV];
> +	struct iovec *iov = NULL;
> +	unsigned long start, end = 0;
> +	int unmapped_error = 0;
> +	size_t len;
> +	struct mmu_gather tlb;
> +	int error;
> +	int i;
> +
> +	if (behavior != MADV_DONTNEED)
> +		return -EINVAL;
> +
> +	error = rw_copy_check_uvector(CHECK_IOVEC_ONLY, uvector, nr_segs,
> +			UIO_FASTIOV, iovstack, &iov);
> +	if (error <= 0)
> +		goto out;
> +	/* Make sure address in ascend order */
> +	sort(iov, nr_segs, sizeof(struct iovec), iov_cmp_func, NULL);

Do we really need to sort the addresses?  That's something which can be
done in userspace and we can easily add a check-for-sortedness to the
below loop.

It depends on whether userspace can easily generate a sorted array.  If
basically all userspace will always need to run sort() then it doesn't
matter much whether it's done in the kernel or in userspace.  But if
*some* userspace can naturally generate its array in sorted form then
neither userspace nor the kernel needs to run sort() and we should take
this out.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

     prev parent reply	other threads:[~2015-12-01 23:59 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-11-09 19:44 [PATCH V2] mm: add a new vector based madvise syscall Shaohua Li
2015-12-01 23:59 ` Andrew Morton [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20151201155926.6291e5e541c49d453079849b@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=Kernel-team@fb.com \
    --cc=aarcange@redhat.com \
    --cc=andi@firstfloor.org \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=je@fb.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=minchan@kernel.org \
    --cc=riel@redhat.com \
    --cc=shli@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).