From: Seth Jennings <sjenning@linux.vnet.ibm.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
Nitin Gupta <ngupta@vflare.org>, Minchan Kim <minchan@kernel.org>,
Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
Dan Magenheimer <dan.magenheimer@oracle.com>,
Robert Jennings <rcj@linux.vnet.ibm.com>,
Jenifer Hopper <jhopper@us.ibm.com>, Mel Gorman <mgorman@suse.de>,
Johannes Weiner <jweiner@redhat.com>,
Rik van Riel <riel@redhat.com>,
Larry Woodman <lwoodman@redhat.com>,
Benjamin Herrenschmidt <benh@kernel.crashing.org>,
Dave Hansen <dave@sr71.net>, Joe Perches <joe@perches.com>,
Joonsoo Kim <iamjoonsoo.kim@lge.com>,
Cody P Schafer <cody@linux.vnet.ibm.com>,
Hugh Dickens <hughd@google.com>,
Paul Mackerras <paulus@samba.org>,
Heesub Shin <heesub.shin@samsung.com>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
devel@driverdev.osuosl.org
Subject: Re: [PATCHv12 2/4] zbud: add to mm/
Date: Wed, 29 May 2013 15:42:36 -0500 [thread overview]
Message-ID: <20130529204236.GD428@cerebellum> (raw)
In-Reply-To: <20130529113434.b2ced4cc1e66c7a0a520d908@linux-foundation.org>
On Wed, May 29, 2013 at 11:34:34AM -0700, Andrew Morton wrote:
> On Wed, 29 May 2013 10:45:00 -0500 Seth Jennings <sjenning@linux.vnet.ibm.com> wrote:
>
> > > > +struct zbud_page {
> > > > + union {
> > > > + struct page page;
> > > > + struct {
> > > > + unsigned long donotuse;
> > > > + u16 first_chunks;
> > > > + u16 last_chunks;
> > > > + struct list_head buddy;
> > > > + struct list_head lru;
> > > > + };
> > > > + };
> > > > +};
> > >
> > > Whoa. So zbud scribbles on existing pageframes?
> >
> > Yes.
> >
> > >
> > > Please tell us about this, in some detail. How is it done and why is
> > > this necessary?
> > >
> > > Presumably the pageframe must be restored at some stage, so this code
> > > has to be kept in sync with external unrelated changes to core MM?
> >
> > Yes, this is done in free_zbud_page().
> >
> > >
> > > Why was it implemented in this fashion rather than going into the main
> > > `struct page' definition and adding the appropriate unionised fields?
> >
> > Yes, modifying the struct page is the cleaner way. I thought that adding more
> > convolution to struct page would create more friction on the path to getting
> > this merged. Plus overlaying the struct page was the approach used by zsmalloc
> > and so I was thinking more along these lines.
>
> I'd be interested in seeing what the modifications to struct page look
> like. It really is the better way.
I'll do it then.
>
> > If you'd rather add the zbud fields directly into unions in struct page,
> > I'm ok with that if you are.
> >
> > Of course, this doesn't avoid having to reset the fields for the page allocator
> > before we free them. Even slub/slob reset the mapcount before calling
> > __free_page(), for example.
> >
> > >
> > > I worry about any code which independently looks at the pageframe
> > > tables and expects to find page struts there. One example is probably
> > > memory_failure() but there are probably others.
>
> ^^ this, please. It could be kinda fatal.
I'll look into this.
The expected behavior is that memory_failure() should handle zbud pages in the
same way that it handles in-use slub/slab/slob pages and return -EBUSY.
>
> > > >
> > > > ...
> > > >
> > > > +int zbud_alloc(struct zbud_pool *pool, int size, gfp_t gfp,
> > > > + unsigned long *handle)
> > > > +{
> > > > + int chunks, i, freechunks;
> > > > + struct zbud_page *zbpage = NULL;
> > > > + enum buddy bud;
> > > > + struct page *page;
> > > > +
> > > > + if (size <= 0 || gfp & __GFP_HIGHMEM)
> > > > + return -EINVAL;
> > > > + if (size > PAGE_SIZE)
> > > > + return -E2BIG;
> > >
> > > Means "Argument list too long" and isn't appropriate here.
> >
> > Ok, I need a return value other than -EINVAL to convey to the user that the
> > allocation is larger than what the allocator can hold. I don't see an existing
> > errno that would be more suited for that. Do you have a suggestion?
>
> ENOMEM perhaps. That's also somewhat misleading, but I guess there's
> precedent for ENOMEM meaning "allocation too large" as well as "out
> of memory".
Works for me.
>
> > > > +int zbud_reclaim_page(struct zbud_pool *pool, unsigned int retries)
> > > > +{
> > > > + int i, ret, freechunks;
> > > > + struct zbud_page *zbpage;
> > > > + unsigned long first_handle = 0, last_handle = 0;
> > > > +
> > > > + spin_lock(&pool->lock);
> > > > + if (!pool->ops || !pool->ops->evict || list_empty(&pool->lru) ||
> > > > + retries == 0) {
> > > > + spin_unlock(&pool->lock);
> > > > + return -EINVAL;
> > > > + }
> > > > + for (i = 0; i < retries; i++) {
> > > > + zbpage = list_tail_entry(&pool->lru, struct zbud_page, lru);
> > > > + list_del(&zbpage->lru);
> > > > + list_del(&zbpage->buddy);
> > > > + /* Protect zbpage against free */
> > >
> > > Against free by who? What other code paths can access this page at
> > > this time?
> >
> > zbud has no way of serializing with the user (zswap) to prevent it calling
> > zbud_free() during zbud reclaim. To prevent the zbud page from being freed
> > while reclaim is operating on it, we set the reclaim flag in the struct page.
> > zbud_free() checks this flag and, if set, only sets the chunk length of the
> > allocation to 0, but does not actually free the zbud page. That is left to
> > this reclaim path.
>
> Sounds strange. Page refcounting is a well-established protocol and
> works well in other places?
Yes, refcounting seemed like overkill for this situation since the refcount
will only ever be 1 or 2 (2 if under reclaim) which basically reduces it to a
boolean. I'm also not sure if there is room left in the struct page for a
refcount with all the existing zbud metadata.
However, if you really don't like this, I can look at doing it via refcounts.
Seth
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
WARNING: multiple messages have this Message-ID (diff)
From: Seth Jennings <sjenning@linux.vnet.ibm.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
Nitin Gupta <ngupta@vflare.org>, Minchan Kim <minchan@kernel.org>,
Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
Dan Magenheimer <dan.magenheimer@oracle.com>,
Robert Jennings <rcj@linux.vnet.ibm.com>,
Jenifer Hopper <jhopper@us.ibm.com>, Mel Gorman <mgorman@suse.de>,
Johannes Weiner <jweiner@redhat.com>,
Rik van Riel <riel@redhat.com>,
Larry Woodman <lwoodman@redhat.com>,
Benjamin Herrenschmidt <benh@kernel.crashing.org>,
Dave Hansen <dave@sr71.net>, Joe Perches <joe@perches.com>,
Joonsoo Kim <iamjoonsoo.kim@lge.com>,
Cody P Schafer <cody@linux.vnet.ibm.com>,
Hugh Dickens <hughd@google.com>,
Paul Mackerras <paulus@samba.org>,
Heesub Shin <heesub.shin@samsung.com>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
devel@driverdev.osuosl.org
Subject: Re: [PATCHv12 2/4] zbud: add to mm/
Date: Wed, 29 May 2013 15:42:36 -0500 [thread overview]
Message-ID: <20130529204236.GD428@cerebellum> (raw)
In-Reply-To: <20130529113434.b2ced4cc1e66c7a0a520d908@linux-foundation.org>
On Wed, May 29, 2013 at 11:34:34AM -0700, Andrew Morton wrote:
> On Wed, 29 May 2013 10:45:00 -0500 Seth Jennings <sjenning@linux.vnet.ibm.com> wrote:
>
> > > > +struct zbud_page {
> > > > + union {
> > > > + struct page page;
> > > > + struct {
> > > > + unsigned long donotuse;
> > > > + u16 first_chunks;
> > > > + u16 last_chunks;
> > > > + struct list_head buddy;
> > > > + struct list_head lru;
> > > > + };
> > > > + };
> > > > +};
> > >
> > > Whoa. So zbud scribbles on existing pageframes?
> >
> > Yes.
> >
> > >
> > > Please tell us about this, in some detail. How is it done and why is
> > > this necessary?
> > >
> > > Presumably the pageframe must be restored at some stage, so this code
> > > has to be kept in sync with external unrelated changes to core MM?
> >
> > Yes, this is done in free_zbud_page().
> >
> > >
> > > Why was it implemented in this fashion rather than going into the main
> > > `struct page' definition and adding the appropriate unionised fields?
> >
> > Yes, modifying the struct page is the cleaner way. I thought that adding more
> > convolution to struct page would create more friction on the path to getting
> > this merged. Plus overlaying the struct page was the approach used by zsmalloc
> > and so I was thinking more along these lines.
>
> I'd be interested in seeing what the modifications to struct page look
> like. It really is the better way.
I'll do it then.
>
> > If you'd rather add the zbud fields directly into unions in struct page,
> > I'm ok with that if you are.
> >
> > Of course, this doesn't avoid having to reset the fields for the page allocator
> > before we free them. Even slub/slob reset the mapcount before calling
> > __free_page(), for example.
> >
> > >
> > > I worry about any code which independently looks at the pageframe
> > > tables and expects to find page struts there. One example is probably
> > > memory_failure() but there are probably others.
>
> ^^ this, please. It could be kinda fatal.
I'll look into this.
The expected behavior is that memory_failure() should handle zbud pages in the
same way that it handles in-use slub/slab/slob pages and return -EBUSY.
>
> > > >
> > > > ...
> > > >
> > > > +int zbud_alloc(struct zbud_pool *pool, int size, gfp_t gfp,
> > > > + unsigned long *handle)
> > > > +{
> > > > + int chunks, i, freechunks;
> > > > + struct zbud_page *zbpage = NULL;
> > > > + enum buddy bud;
> > > > + struct page *page;
> > > > +
> > > > + if (size <= 0 || gfp & __GFP_HIGHMEM)
> > > > + return -EINVAL;
> > > > + if (size > PAGE_SIZE)
> > > > + return -E2BIG;
> > >
> > > Means "Argument list too long" and isn't appropriate here.
> >
> > Ok, I need a return value other than -EINVAL to convey to the user that the
> > allocation is larger than what the allocator can hold. I don't see an existing
> > errno that would be more suited for that. Do you have a suggestion?
>
> ENOMEM perhaps. That's also somewhat misleading, but I guess there's
> precedent for ENOMEM meaning "allocation too large" as well as "out
> of memory".
Works for me.
>
> > > > +int zbud_reclaim_page(struct zbud_pool *pool, unsigned int retries)
> > > > +{
> > > > + int i, ret, freechunks;
> > > > + struct zbud_page *zbpage;
> > > > + unsigned long first_handle = 0, last_handle = 0;
> > > > +
> > > > + spin_lock(&pool->lock);
> > > > + if (!pool->ops || !pool->ops->evict || list_empty(&pool->lru) ||
> > > > + retries == 0) {
> > > > + spin_unlock(&pool->lock);
> > > > + return -EINVAL;
> > > > + }
> > > > + for (i = 0; i < retries; i++) {
> > > > + zbpage = list_tail_entry(&pool->lru, struct zbud_page, lru);
> > > > + list_del(&zbpage->lru);
> > > > + list_del(&zbpage->buddy);
> > > > + /* Protect zbpage against free */
> > >
> > > Against free by who? What other code paths can access this page at
> > > this time?
> >
> > zbud has no way of serializing with the user (zswap) to prevent it calling
> > zbud_free() during zbud reclaim. To prevent the zbud page from being freed
> > while reclaim is operating on it, we set the reclaim flag in the struct page.
> > zbud_free() checks this flag and, if set, only sets the chunk length of the
> > allocation to 0, but does not actually free the zbud page. That is left to
> > this reclaim path.
>
> Sounds strange. Page refcounting is a well-established protocol and
> works well in other places?
Yes, refcounting seemed like overkill for this situation since the refcount
will only ever be 1 or 2 (2 if under reclaim) which basically reduces it to a
boolean. I'm also not sure if there is room left in the struct page for a
refcount with all the existing zbud metadata.
However, if you really don't like this, I can look at doing it via refcounts.
Seth
next prev parent reply other threads:[~2013-05-29 20:42 UTC|newest]
Thread overview: 54+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-05-20 16:26 [PATCHv12 0/4] zswap: compressed swap caching Seth Jennings
2013-05-20 16:26 ` Seth Jennings
2013-05-20 16:26 ` [PATCHv12 1/4] debugfs: add get/set for atomic types Seth Jennings
2013-05-20 16:26 ` Seth Jennings
2013-05-29 17:42 ` Konrad Rzeszutek Wilk
2013-05-29 17:42 ` Konrad Rzeszutek Wilk
2013-05-20 16:26 ` [PATCHv12 2/4] zbud: add to mm/ Seth Jennings
2013-05-20 16:26 ` Seth Jennings
2013-05-21 3:37 ` Bob Liu
2013-05-21 3:37 ` Bob Liu
2013-05-28 21:59 ` Andrew Morton
2013-05-28 21:59 ` Andrew Morton
2013-05-29 15:45 ` Seth Jennings
2013-05-29 15:45 ` Seth Jennings
2013-05-29 18:34 ` Andrew Morton
2013-05-29 18:34 ` Andrew Morton
2013-05-29 20:42 ` Seth Jennings [this message]
2013-05-29 20:42 ` Seth Jennings
2013-05-29 20:48 ` Andrew Morton
2013-05-29 20:48 ` Andrew Morton
2013-05-29 21:09 ` Dan Magenheimer
2013-05-29 21:09 ` Dan Magenheimer
2013-05-29 21:29 ` Andrew Morton
2013-05-29 21:29 ` Andrew Morton
2013-05-30 17:43 ` Seth Jennings
2013-05-30 17:43 ` Seth Jennings
2013-05-30 21:20 ` Seth Jennings
2013-05-30 21:20 ` Seth Jennings
2013-05-31 1:48 ` Bob Liu
2013-05-31 1:48 ` Bob Liu
2013-06-03 13:48 ` Konrad Rzeszutek Wilk
2013-06-03 13:48 ` Konrad Rzeszutek Wilk
2013-05-29 20:45 ` Seth Jennings
2013-05-29 20:45 ` Seth Jennings
2013-05-20 16:26 ` [PATCHv12 3/4] zswap: " Seth Jennings
2013-05-20 16:26 ` Seth Jennings
2013-05-21 3:31 ` Bob Liu
2013-05-21 3:31 ` Bob Liu
2013-05-28 21:59 ` Andrew Morton
2013-05-28 21:59 ` Andrew Morton
2013-05-29 14:57 ` Seth Jennings
2013-05-29 14:57 ` Seth Jennings
2013-05-29 18:29 ` Andrew Morton
2013-05-29 18:29 ` Andrew Morton
2013-05-29 19:50 ` Seth Jennings
2013-05-29 19:50 ` Seth Jennings
2013-05-29 19:57 ` Andrew Morton
2013-05-29 19:57 ` Andrew Morton
2013-05-29 21:08 ` Seth Jennings
2013-05-29 21:08 ` Seth Jennings
2013-05-29 21:16 ` Andrew Morton
2013-05-29 21:16 ` Andrew Morton
2013-05-20 16:26 ` [PATCHv12 4/4] zswap: add documentation Seth Jennings
2013-05-20 16:26 ` Seth Jennings
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130529204236.GD428@cerebellum \
--to=sjenning@linux.vnet.ibm.com \
--cc=akpm@linux-foundation.org \
--cc=benh@kernel.crashing.org \
--cc=cody@linux.vnet.ibm.com \
--cc=dan.magenheimer@oracle.com \
--cc=dave@sr71.net \
--cc=devel@driverdev.osuosl.org \
--cc=gregkh@linuxfoundation.org \
--cc=heesub.shin@samsung.com \
--cc=hughd@google.com \
--cc=iamjoonsoo.kim@lge.com \
--cc=jhopper@us.ibm.com \
--cc=joe@perches.com \
--cc=jweiner@redhat.com \
--cc=konrad.wilk@oracle.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lwoodman@redhat.com \
--cc=mgorman@suse.de \
--cc=minchan@kernel.org \
--cc=ngupta@vflare.org \
--cc=paulus@samba.org \
--cc=rcj@linux.vnet.ibm.com \
--cc=riel@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.