public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed
From: "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
To: Roland Dreier <roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: "Michael R. Hines"
	<mrhines-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>,
	Sean Hefty <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>,
	Hal Rosenstock
	<hal.rosenstock-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>,
	Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>,
	Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>,
	"linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
	<linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	LKML <linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A@public.gmane.org
Subject: Re: [PATCH] rdma: don't make pages writeable if not requiested
Date: Thu, 21 Mar 2013 13:30:47 +0200	[thread overview]
Message-ID: <20130321113047.GA31599@redhat.com> (raw)
In-Reply-To: <20130321093230.GF28328-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On Thu, Mar 21, 2013 at 11:32:30AM +0200, Michael S. Tsirkin wrote:
> On Wed, Mar 20, 2013 at 11:55:54PM -0700, Roland Dreier wrote:
> > On Wed, Mar 20, 2013 at 11:18 PM, Michael S. Tsirkin <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > > core/umem.c seems to get the arguments to get_user_pages
> > > in the reverse order: it sets writeable flag and
> > > breaks COW for MAP_SHARED if and only if hardware needs to
> > > write the page.
> > >
> > > This breaks memory overcommit for users such as KVM:
> > > each time we try to register a page to send it to remote, this
> > > breaks COW.  It seems that for applications that only have
> > > REMOTE_READ permission, there is no reason to break COW at all.
> > 
> > I proposed a similar (but not exactly the same, see below) patch a
> > while ago: https://lkml.org/lkml/2012/1/26/7 but read the thread,
> > especially https://lkml.org/lkml/2012/2/6/265
> > 
> > I think this change will break the case where userspace tries to
> > register an MR with read-only permission, but intends locally through
> > the CPU to write to the memory.  If the memory registration is done
> > while the memory is mapped read-only but has VM_MAYWRITE, then
> > userspace gets into trouble when COW happens.  In the case you're
> > describing (although I'm not sure where in KVM we're talking about
> > using RDMA), what happens if you register memory with only REMOTE_READ
> > and then COW is triggered because of a local write?  (I'm assuming you
> > don't want remote access to continue to get the old contents of the
> > page)
> 
> I read that, and the above. It looks like everyone was doing tricks
> like "register page, then modify it, then let remote read it"
> and for some reason assumed it's ok to write into page locally from CPU
> even if LOCAL_WRITE is not set.  I don't see why don't applications set
> LOCAL_WRITE if they are going to write to memory locally, but assuming
> they don't, we can't just break them.
> 
> So what we need is a new "no I really do not intend to write into this
> memory" flag that avoids doing tricks in the kernel and treats the
> page normally, just pins it so hardware can read it.
> 
> 
> > I have to confess that I still haven't had a chance to implement the
> > proposed FOLL_FOLLOW solution to all of this.
> 
> See a much easier to implement proposal at the bottom.
> 
> > > If the page that is COW has lots of copies, this makes the user process
> > > quickly exceed the cgroups memory limit.  This makes RDMA mostly useless
> > > for virtualization, thus the stable tag.
> > 
> > The actual problem description here is a bit too terse for me to
> > understand.  How do we end up with lots of copies of a COW page?
> 
> Reading the links above, rdma breaks COW intentionally.
> 
> Imagine a page with lots of instances in the process page map.
> For example a zero page, but not only that: we rely on KSM heavily
> to deduplicate pages for multiple VMs.
> There are gigabytes of these in each of the multiple VMs
> running on a host.
> 
> What we are using RDMA for is VM migration so we careful not to change
> this memory: when we do allow memory to change we are careful
> to track what was changed, reregister and resend the data.
> 
> But at the moment, each time we register a virtual address referencing
> this page, infiniband assumes we might want to change the page so it
> does get_user_pages with writeable flag, forcing a copy.
> Amount of used RAM explodes.
> 
> >  Why
> > is RDMA registering the memory any more  special than having everyone
> > who maps this page actually writing to it and triggering COW?
> > 
> > >                 ret = get_user_pages(current, current->mm, cur_base,
> > >                                      min_t(unsigned long, npages,
> > >                                            PAGE_SIZE / sizeof (struct page *)),
> > > -                                    1, !umem->writable, page_list, vma_list);
> > > +                                    !umem->writable, 1, page_list, vma_list);
> > 
> > The first two parameters in this line being changed are "write" and "force".
> > 
> > I think if we do change this, then we need to pass umem->writable (as
> > opposed to !umem->writable) for the "write" parameter.
> 
> Ugh. Sure enough. Let's agree on the direction before I respin the
> patch though.
> 
> > Not sure
> > whether "force" makes sense or not.
> > 
> >  - R.
> 
> If you don't force write on read-only mappings you don't, but
> it seems harmless for read-only gup. Still, no need to change
> what's not broken.
> 
> Please comment on the below (completely untested, and needs userspace
> patch too, but just to give you the idea)
> 
> --->
> 
> rdma: add IB_ACCESS_APP_READONLY 

Or we can call it IB_ACCESS_GIFT - this is a bit like SPLICE_F_GIFT
semantics.



> At the moment any attempt to register memory for RDMA breaks
> COW, which hurts hosts overcomitted for memory.
> But if the application knows it won't write into the MR after
> registration, we can save (sometimes a lot of) memory
> by telling the kernel not to bother breaking COW for us.
> 
> If the application does change memory registered with this flag, it can
> re-register afterwards, and resend the data on the wire.
> 
> Signed-off-by: Michael S. Tsirkin <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> 
> ---
> 
> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> index 5929598..635b57a 100644
> --- a/drivers/infiniband/core/umem.c
> +++ b/drivers/infiniband/core/umem.c
> @@ -152,7 +152,9 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
>  		ret = get_user_pages(current, current->mm, cur_base,
>  				     min_t(unsigned long, npages,
>  					   PAGE_SIZE / sizeof (struct page *)),
> -				     !umem->writable, 1, page_list, vma_list);
> +				     umem->writable ||
> +				     !(access & IB_ACCESS_APP_READONLY),
> +				     !umem->writable, page_list, vma_list);
>  
>  		if (ret < 0)
>  			goto out;
> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> index 98cc4b2..3a3ba1b 100644
> --- a/include/rdma/ib_verbs.h
> +++ b/include/rdma/ib_verbs.h
> @@ -871,7 +871,8 @@ enum ib_access_flags {
>  	IB_ACCESS_REMOTE_READ	= (1<<2),
>  	IB_ACCESS_REMOTE_ATOMIC	= (1<<3),
>  	IB_ACCESS_MW_BIND	= (1<<4),
> -	IB_ZERO_BASED		= (1<<5)
> +	IB_ZERO_BASED		= (1<<5),
> +	IB_ACCESS_APP_READONLY	= (1<<6) /* User promises not to change the data */
>  };
>  
>  struct ib_phys_buf {
> 
> -- 
> MST
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  parent reply	other threads:[~2013-03-21 11:30 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-03-21  6:18 [PATCH] rdma: don't make pages writeable if not requiested Michael S. Tsirkin
     [not found] ` <20130321061838.GA28319-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-03-21  6:55   ` Roland Dreier
     [not found]     ` <CAL1RGDUcMj9QVsuQgK+ozw64L6-cGehL7YBUJ1_ckni6TD=Kcg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-03-21  7:03       ` Michael S. Tsirkin
     [not found]         ` <20130321070357.GD28328-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-03-21  7:15           ` Roland Dreier
     [not found]             ` <CAG4TOxPkhOhGmzeA1K4a0Zw8HxS-QkOr-PCx7mJgA+KkuH3ZiQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-03-21  8:51               ` Michael S. Tsirkin
     [not found]                 ` <20130321085107.GE28328-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-03-21  9:13                   ` Roland Dreier
     [not found]                     ` <CAL1RGDVnkLZU2Vge4o3BwDxnAfGv7TQRMqE6ha3MUt39CVp5NQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-03-21  9:39                       ` Michael S. Tsirkin
     [not found]                         ` <20130321093946.GG28328-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-03-21 17:11                           ` Jason Gunthorpe
     [not found]                             ` <20130321171115.GA653-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2013-03-21 17:15                               ` Michael S. Tsirkin
     [not found]                                 ` <20130321171525.GE2994-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-03-21 17:21                                   ` Jason Gunthorpe
     [not found]                                     ` <20130321172150.GA3118-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2013-03-21 17:42                                       ` Michael S. Tsirkin
     [not found]                                         ` <20130321174237.GA4060-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-03-21 17:57                                           ` Jason Gunthorpe
     [not found]                                             ` <20130321175732.GA3263-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2013-03-21 18:03                                               ` Michael S. Tsirkin
2013-03-21 18:16                               ` Michael S. Tsirkin
     [not found]                                 ` <20130321181633.GC4366-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-03-21 18:41                                   ` Jason Gunthorpe
     [not found]                                     ` <20130321184135.GA8044-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2013-03-21 19:15                                       ` Michael S. Tsirkin
     [not found]                                         ` <20130321191541.GB5272-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-03-21 20:09                                           ` Jason Gunthorpe
2013-03-21  9:32       ` Michael S. Tsirkin
     [not found]         ` <20130321093230.GF28328-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-03-21 11:30           ` Michael S. Tsirkin [this message]
2013-03-21 12:23   ` Michael R. Hines
2013-03-21 12:32     ` Michael S. Tsirkin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130321113047.GA31599@redhat.com \
    --to=mst-h+wxahxf7alqt0dzr+alfa@public.gmane.org \
    --cc=cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org \
    --cc=hal.rosenstock-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org \
    --cc=linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=mrhines-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org \
    --cc=qemu-devel-qX2TKyscuCcdnm+yROfE0A@public.gmane.org \
    --cc=roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org \
    --cc=sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org \
    --cc=yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox