From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:46222) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1X2WG0-0000pY-9W for qemu-devel@nongnu.org; Wed, 02 Jul 2014 21:56:23 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1X2WFs-00023V-PI for qemu-devel@nongnu.org; Wed, 02 Jul 2014 21:56:16 -0400 Received: from mail-pa0-f52.google.com ([209.85.220.52]:33992) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1X2WFs-00023N-KP for qemu-devel@nongnu.org; Wed, 02 Jul 2014 21:56:08 -0400 Received: by mail-pa0-f52.google.com with SMTP id eu11so13561370pac.39 for ; Wed, 02 Jul 2014 18:56:07 -0700 (PDT) From: Andy Lutomirski Message-ID: <53B4B833.9010508@mit.edu> Date: Wed, 02 Jul 2014 18:56:03 -0700 MIME-Version: 1.0 References: <1404319816-30229-1-git-send-email-aarcange@redhat.com> <1404319816-30229-9-git-send-email-aarcange@redhat.com> In-Reply-To: <1404319816-30229-9-git-send-email-aarcange@redhat.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH 08/10] userfaultfd: add new syscall to provide memory externalization List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Andrea Arcangeli , qemu-devel@nongnu.org, kvm@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Taras Glek , Robert Love , Dave Hansen , Jan Kara , Minchan Kim , Mel Gorman , Linux API , Hugh Dickins , "\"Dr. David Alan Gilbert\"" , "Huangpeng (Peter)" , Neil Brown , Dmitry Adamushko , Johannes Weiner , KOSAKI Motohiro , Mike Hommey , Andrew Morton , Michel Lespinasse , Android Kernel Team , Keith Packard , Isaku Yamahata On 07/02/2014 09:50 AM, Andrea Arcangeli wrote: > Once an userfaultfd is created MADV_USERFAULT regions talks through > the userfaultfd protocol with the thread responsible for doing the > memory externalization of the process. > > The protocol starts by userland writing the requested/preferred > USERFAULT_PROTOCOL version into the userfault fd (64bit write), if > kernel knows it, it will ack it by allowing userland to read 64bit > from the userfault fd that will contain the same 64bit > USERFAULT_PROTOCOL version that userland asked. Otherwise userfault > will read __u64 value -1ULL (aka USERFAULTFD_UNKNOWN_PROTOCOL) and it > will have to try again by writing an older protocol version if > suitable for its usage too, and read it back again until it stops > reading -1ULL. After that the userfaultfd protocol starts. > > The protocol consists in the userfault fd reads 64bit in size > providing userland the fault addresses. After a userfault address has > been read and the fault is resolved by userland, the application must > write back 128bits in the form of [ start, end ] range (64bit each) > that will tell the kernel such a range has been mapped. Multiple read > userfaults can be resolved in a single range write. poll() can be used > to know when there are new userfaults to read (POLLIN) and when there > are threads waiting a wakeup through a range write (POLLOUT). > > Signed-off-by: Andrea Arcangeli > +#ifdef CONFIG_PROC_FS > +static int userfaultfd_show_fdinfo(struct seq_file *m, struct file *f) > +{ > + struct userfaultfd_ctx *ctx = f->private_data; > + int ret; > + wait_queue_t *wq; > + struct userfaultfd_wait_queue *uwq; > + unsigned long pending = 0, total = 0; > + > + spin_lock(&ctx->fault_wqh.lock); > + list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) { > + uwq = container_of(wq, struct userfaultfd_wait_queue, wq); > + if (uwq->pending) > + pending++; > + total++; > + } > + spin_unlock(&ctx->fault_wqh.lock); > + > + ret = seq_printf(m, "pending:\t%lu\ntotal:\t%lu\n", pending, total); This should show the protocol version, too. > + > +SYSCALL_DEFINE1(userfaultfd, int, flags) > +{ > + int fd, error; > + struct file *file; This looks like it can't be used more than once in a process. That will be unfortunate for libraries. Would it be feasible to either have userfaultfd claim a range of addresses or for a vma to be explicitly associated with a userfaultfd? (In the latter case, giant PROT_NONE MAP_NORESERVE mappings could be used.)