qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: "Michael S. Tsirkin" <mst@redhat.com>
To: Roland Dreier <roland@kernel.org>
Cc: qemu-devel@nongnu.org,
	"linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>,
	Yishai Hadas <yishaih@mellanox.com>,
	LKML <linux-kernel@vger.kernel.org>,
	"Michael R. Hines" <mrhines@linux.vnet.ibm.com>,
	Hal Rosenstock <hal.rosenstock@gmail.com>,
	Sean Hefty <sean.hefty@intel.com>,
	Christoph Lameter <cl@linux.com>
Subject: Re: [Qemu-devel] [PATCH] rdma: don't make pages writeable if not requiested
Date: Thu, 21 Mar 2013 13:30:47 +0200	[thread overview]
Message-ID: <20130321113047.GA31599@redhat.com> (raw)
In-Reply-To: <20130321093230.GF28328@redhat.com>

On Thu, Mar 21, 2013 at 11:32:30AM +0200, Michael S. Tsirkin wrote:
> On Wed, Mar 20, 2013 at 11:55:54PM -0700, Roland Dreier wrote:
> > On Wed, Mar 20, 2013 at 11:18 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > core/umem.c seems to get the arguments to get_user_pages
> > > in the reverse order: it sets writeable flag and
> > > breaks COW for MAP_SHARED if and only if hardware needs to
> > > write the page.
> > >
> > > This breaks memory overcommit for users such as KVM:
> > > each time we try to register a page to send it to remote, this
> > > breaks COW.  It seems that for applications that only have
> > > REMOTE_READ permission, there is no reason to break COW at all.
> > 
> > I proposed a similar (but not exactly the same, see below) patch a
> > while ago: https://lkml.org/lkml/2012/1/26/7 but read the thread,
> > especially https://lkml.org/lkml/2012/2/6/265
> > 
> > I think this change will break the case where userspace tries to
> > register an MR with read-only permission, but intends locally through
> > the CPU to write to the memory.  If the memory registration is done
> > while the memory is mapped read-only but has VM_MAYWRITE, then
> > userspace gets into trouble when COW happens.  In the case you're
> > describing (although I'm not sure where in KVM we're talking about
> > using RDMA), what happens if you register memory with only REMOTE_READ
> > and then COW is triggered because of a local write?  (I'm assuming you
> > don't want remote access to continue to get the old contents of the
> > page)
> 
> I read that, and the above. It looks like everyone was doing tricks
> like "register page, then modify it, then let remote read it"
> and for some reason assumed it's ok to write into page locally from CPU
> even if LOCAL_WRITE is not set.  I don't see why don't applications set
> LOCAL_WRITE if they are going to write to memory locally, but assuming
> they don't, we can't just break them.
> 
> So what we need is a new "no I really do not intend to write into this
> memory" flag that avoids doing tricks in the kernel and treats the
> page normally, just pins it so hardware can read it.
> 
> 
> > I have to confess that I still haven't had a chance to implement the
> > proposed FOLL_FOLLOW solution to all of this.
> 
> See a much easier to implement proposal at the bottom.
> 
> > > If the page that is COW has lots of copies, this makes the user process
> > > quickly exceed the cgroups memory limit.  This makes RDMA mostly useless
> > > for virtualization, thus the stable tag.
> > 
> > The actual problem description here is a bit too terse for me to
> > understand.  How do we end up with lots of copies of a COW page?
> 
> Reading the links above, rdma breaks COW intentionally.
> 
> Imagine a page with lots of instances in the process page map.
> For example a zero page, but not only that: we rely on KSM heavily
> to deduplicate pages for multiple VMs.
> There are gigabytes of these in each of the multiple VMs
> running on a host.
> 
> What we are using RDMA for is VM migration so we careful not to change
> this memory: when we do allow memory to change we are careful
> to track what was changed, reregister and resend the data.
> 
> But at the moment, each time we register a virtual address referencing
> this page, infiniband assumes we might want to change the page so it
> does get_user_pages with writeable flag, forcing a copy.
> Amount of used RAM explodes.
> 
> >  Why
> > is RDMA registering the memory any more  special than having everyone
> > who maps this page actually writing to it and triggering COW?
> > 
> > >                 ret = get_user_pages(current, current->mm, cur_base,
> > >                                      min_t(unsigned long, npages,
> > >                                            PAGE_SIZE / sizeof (struct page *)),
> > > -                                    1, !umem->writable, page_list, vma_list);
> > > +                                    !umem->writable, 1, page_list, vma_list);
> > 
> > The first two parameters in this line being changed are "write" and "force".
> > 
> > I think if we do change this, then we need to pass umem->writable (as
> > opposed to !umem->writable) for the "write" parameter.
> 
> Ugh. Sure enough. Let's agree on the direction before I respin the
> patch though.
> 
> > Not sure
> > whether "force" makes sense or not.
> > 
> >  - R.
> 
> If you don't force write on read-only mappings you don't, but
> it seems harmless for read-only gup. Still, no need to change
> what's not broken.
> 
> Please comment on the below (completely untested, and needs userspace
> patch too, but just to give you the idea)
> 
> --->
> 
> rdma: add IB_ACCESS_APP_READONLY 

Or we can call it IB_ACCESS_GIFT - this is a bit like SPLICE_F_GIFT
semantics.



> At the moment any attempt to register memory for RDMA breaks
> COW, which hurts hosts overcomitted for memory.
> But if the application knows it won't write into the MR after
> registration, we can save (sometimes a lot of) memory
> by telling the kernel not to bother breaking COW for us.
> 
> If the application does change memory registered with this flag, it can
> re-register afterwards, and resend the data on the wire.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> 
> ---
> 
> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> index 5929598..635b57a 100644
> --- a/drivers/infiniband/core/umem.c
> +++ b/drivers/infiniband/core/umem.c
> @@ -152,7 +152,9 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
>  		ret = get_user_pages(current, current->mm, cur_base,
>  				     min_t(unsigned long, npages,
>  					   PAGE_SIZE / sizeof (struct page *)),
> -				     !umem->writable, 1, page_list, vma_list);
> +				     umem->writable ||
> +				     !(access & IB_ACCESS_APP_READONLY),
> +				     !umem->writable, page_list, vma_list);
>  
>  		if (ret < 0)
>  			goto out;
> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> index 98cc4b2..3a3ba1b 100644
> --- a/include/rdma/ib_verbs.h
> +++ b/include/rdma/ib_verbs.h
> @@ -871,7 +871,8 @@ enum ib_access_flags {
>  	IB_ACCESS_REMOTE_READ	= (1<<2),
>  	IB_ACCESS_REMOTE_ATOMIC	= (1<<3),
>  	IB_ACCESS_MW_BIND	= (1<<4),
> -	IB_ZERO_BASED		= (1<<5)
> +	IB_ZERO_BASED		= (1<<5),
> +	IB_ACCESS_APP_READONLY	= (1<<6) /* User promises not to change the data */
>  };
>  
>  struct ib_phys_buf {
> 
> -- 
> MST

  reply	other threads:[~2013-03-21 11:30 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-03-21  6:18 [Qemu-devel] [PATCH] rdma: don't make pages writeable if not requiested Michael S. Tsirkin
2013-03-21  6:55 ` Roland Dreier
2013-03-21  7:03   ` Michael S. Tsirkin
2013-03-21  7:15     ` Roland Dreier
2013-03-21  8:51       ` Michael S. Tsirkin
2013-03-21  9:13         ` Roland Dreier
2013-03-21  9:39           ` Michael S. Tsirkin
2013-03-21 17:11             ` Jason Gunthorpe
2013-03-21 17:15               ` Michael S. Tsirkin
2013-03-21 17:21                 ` Jason Gunthorpe
2013-03-21 17:42                   ` Michael S. Tsirkin
2013-03-21 17:57                     ` Jason Gunthorpe
2013-03-21 18:03                       ` Michael S. Tsirkin
2013-03-21 18:16               ` Michael S. Tsirkin
2013-03-21 18:41                 ` Jason Gunthorpe
2013-03-21 19:15                   ` Michael S. Tsirkin
2013-03-21 20:09                     ` Jason Gunthorpe
2013-03-21  9:32   ` Michael S. Tsirkin
2013-03-21 11:30     ` Michael S. Tsirkin [this message]
2013-03-21 12:23 ` Michael R. Hines
2013-03-21 12:32   ` Michael S. Tsirkin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130321113047.GA31599@redhat.com \
    --to=mst@redhat.com \
    --cc=cl@linux.com \
    --cc=hal.rosenstock@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=mrhines@linux.vnet.ibm.com \
    --cc=qemu-devel@nongnu.org \
    --cc=roland@kernel.org \
    --cc=sean.hefty@intel.com \
    --cc=yishaih@mellanox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).