Linux virtualization list
 help / color / mirror / Atom feed
* Re: [virtio-dev] virtio-vsock live migration
From: Michael S. Tsirkin @ 2016-03-15 16:12 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: virtio-dev, Claudio Imbrenda, Christian Borntraeger,
	Matt Benjamin, virtualization, Christoffer Dall
In-Reply-To: <20160315151529.GB26263@stefanha-x1.localdomain>

On Tue, Mar 15, 2016 at 03:15:29PM +0000, Stefan Hajnoczi wrote:
> On Mon, Mar 14, 2016 at 01:13:24PM +0200, Michael S. Tsirkin wrote:
> > On Thu, Mar 03, 2016 at 03:37:37PM +0000, Stefan Hajnoczi wrote:
> > > Michael pointed out that the virtio-vsock draft specification does not
> > > address live migration and in fact currently precludes migration.
> > > 
> > > Migration is fundamental so the device specification at least mustn't
> > > preclude it.  Having brainstormed migration with Matthew Benjamin and
> > > Michael Tsirkin, I am now summarizing the approach that I want to
> > > include in the next draft specification.
> > > 
> > > Feedback and comments welcome!  In the meantime I will implement this in
> > > code and update the draft specification.
> > 
> > Most of the issue seems to be a consequence of using a 4 byte CID.
> > 
> > I think the right thing to do is just to teach guests
> > about 64 bit CIDs.
> > 
> > For now, can we drop guest CID from guest to host communication completely,
> > making CID only host-visible? Maybe leave the space
> > in the packet so we can add CID there later.
> > It seems that in theory this will allow changing CID
> > during migration, transparently to the guest.
> > 
> > Guest visible CID is required for guest to guest communication -
> > but IIUC that is not currently supported.
> > Maybe that can be made conditional on 64 bit addressing.
> > Alternatively, it seems much easier to accept that these channels get broken
> > across migration.
> 
> I reached the conclusion that channels break across migration because:
> 
> 1. 32-bit CIDs are in sockaddr_vm and we'd break AF_VSOCK ABI by
>    changing it to 64-bit.  Application code would be specific
>    virtio-vsock and wouldn't work with other AF_VSOCK transports that
>    use the 32-bit sockaddr_vm struct.

You don't have to repeat the IPv6 mistake.  Make all 32 bit CIDs
64 bit CIDs by padding with 0s, then 64 bit apps can use
any CID.

Old 32 bit CID applications will not be able to use the extended
addresses, but hardcoding bugs
does not seem sane.


> 2. Dropping guest CIDs from the protocol breaks network protocols that
>    send addresses.

Stick it in config space if you really have to.
But why do you need it on each packet?

>  NFS and netperf are the first two protocols I looked
>    at and both transmit address information across the connection...


Does netperf really attempt to get local IP
and then send that inline within the connection?


-- 
MST

^ permalink raw reply

* Re: [virtio-dev] virtio-vsock live migration
From: Stefan Hajnoczi @ 2016-03-15 15:15 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, Claudio Imbrenda, Christian Borntraeger,
	Matt Benjamin, virtualization, Christoffer Dall
In-Reply-To: <20160314130150-mutt-send-email-mst@redhat.com>


[-- Attachment #1.1: Type: text/plain, Size: 1946 bytes --]

On Mon, Mar 14, 2016 at 01:13:24PM +0200, Michael S. Tsirkin wrote:
> On Thu, Mar 03, 2016 at 03:37:37PM +0000, Stefan Hajnoczi wrote:
> > Michael pointed out that the virtio-vsock draft specification does not
> > address live migration and in fact currently precludes migration.
> > 
> > Migration is fundamental so the device specification at least mustn't
> > preclude it.  Having brainstormed migration with Matthew Benjamin and
> > Michael Tsirkin, I am now summarizing the approach that I want to
> > include in the next draft specification.
> > 
> > Feedback and comments welcome!  In the meantime I will implement this in
> > code and update the draft specification.
> 
> Most of the issue seems to be a consequence of using a 4 byte CID.
> 
> I think the right thing to do is just to teach guests
> about 64 bit CIDs.
> 
> For now, can we drop guest CID from guest to host communication completely,
> making CID only host-visible? Maybe leave the space
> in the packet so we can add CID there later.
> It seems that in theory this will allow changing CID
> during migration, transparently to the guest.
> 
> Guest visible CID is required for guest to guest communication -
> but IIUC that is not currently supported.
> Maybe that can be made conditional on 64 bit addressing.
> Alternatively, it seems much easier to accept that these channels get broken
> across migration.

I reached the conclusion that channels break across migration because:

1. 32-bit CIDs are in sockaddr_vm and we'd break AF_VSOCK ABI by
   changing it to 64-bit.  Application code would be specific
   virtio-vsock and wouldn't work with other AF_VSOCK transports that
   use the 32-bit sockaddr_vm struct.

2. Dropping guest CIDs from the protocol breaks network protocols that
   send addresses.  NFS and netperf are the first two protocols I looked
   at and both transmit address information across the connection...

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: virtio-vsock live migration
From: Stefan Hajnoczi @ 2016-03-15 15:10 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, Claudio Imbrenda, Christian Borntraeger,
	Matt Benjamin, virtualization, Christoffer Dall
In-Reply-To: <20160311014147-mutt-send-email-mst@redhat.com>


[-- Attachment #1.1: Type: text/plain, Size: 6191 bytes --]

On Fri, Mar 11, 2016 at 01:56:05AM +0200, Michael S. Tsirkin wrote:
> On Thu, Mar 03, 2016 at 03:37:37PM +0000, Stefan Hajnoczi wrote:
> > Michael pointed out that the virtio-vsock draft specification does not
> > address live migration and in fact currently precludes migration.
> > 
> > Migration is fundamental so the device specification at least mustn't
> > preclude it.  Having brainstormed migration with Matthew Benjamin and
> > Michael Tsirkin, I am now summarizing the approach that I want to
> > include in the next draft specification.
> > 
> > Feedback and comments welcome!  In the meantime I will implement this in
> > code and update the draft specification.
> > 
> > 1. Requirements
> > 
> > Virtio-vsock is a new AF_VSOCK transport.  As such, it should provide at
> > least the same guarantees as the existing AF_VSOCK VMCI transport.  This
> > is for consistency and to allow code reuse across any AF_VSOCK
> > transport.
> > 
> > Virtio-vsock aims to replace virtio-serial by providing the same
> > guest/host communication ability but with sockets API semantics that are
> > more popular and convenient for application developers.  Therefore
> > virtio-vsock migration should provide at least the same level of
> > migration functionality as virtio-serial.
> > 
> > Ideally it should be possible to migrate applications using AF_VSOCK
> > together with the virtual machine so that guest<->host communication is
> > interrupted.  Neither AF_VSOCK VMCI nor virtio-serial support this
> > today.
> 
> I'm not sure why do you say this about virtio serial.
> It appears that if host pre-connected to destination
> qemu before migration, backend reconnects transparently
> on destination.

You are right, virtio-serial supports keeping active ports open across
migration (as well as closing active ports across migration).  In
virtio-vsock the equivalent would be setsockopt() CRIU-style socket
migration which is not implemented today.

> > 2. Basic disruptive migration flow
> > 
> > When the virtual machine migrates from the source host to the
> > destination host, the guest's CID may change.  The CID namespace is
> > host-wide
> 
> 
> BTW, I think CIDs would have to become per network namespace.

Yes, I agree.

> > so other hosts may have CID collisions and allocate a new CID
> > for incoming migration VMs.
> 
> I guess all this is so that guest can retrieve its CID and
> send it to host using some side-channel?

Yes.

> > The device notifies the guest that the CID has changed.  Guest sockets
> > are affected as follows:
> > 
> >  * Established connections are reset (ECONNRESET) and the guest
> >    application will have to reconnect.
> > 
> >  * Listen sockets remain open.  The only thing to note is that
> >    connections from the host are now made to the new CID.  This means
> >    the local address of the listen socket is automatically updated to
> >    the new CID.
> > 
> >  * Sockets in other states are unchanged.
> > 
> > Applications must handle disruptive migration by reconnecting if
> > necessary after ECONNRESET.
> > 
> > 3. Checkpoint/restore for seamless migration
> > 
> > Applications that wish to communicate across live migration can do so
> > but this requires extra application-specific checkpoint/restore code.
> > 
> > This is similar to the approach taken by the CRIU project where
> > getsockopt()/setsockopt() is used to migrate socket state.  The
> > difference is that the application process is not automatically migrated
> > from the source host to the destination host.  Therefore, the
> > application needs to migrate its own state somehow.
> > 
> > The flow is as follows:
> > 
> > The application on the source host must quiesce (stop sending/receiving)
> > and use getsockopt() to extract socket state information from the host
> > kernel.
> > 
> > A new instance of the application is started on the destination host and
> > given the state so it can restore the connection.  The setsockopt()
> > syscall is used to restore socket state information.
> > 
> > The guest is given a list of <host_old_cid, host_new_cid, host_port,
> > guest_port> tuples for established connections that must not be reset
> > when the guest CID update notification is received.  These connections
> > will carry on as if nothing changed.
> > 
> > Note that the connection's remote address is updated from host_old_cid
> > to host_new_cid.  This allows remapping of CIDs (if necessary).
> > Typically this will be unused because the host always has well-known CID
> > 2.  In a guest<->guest scenario it may be used to remap CIDs.
> > 
> > 
> > For the time being I am focussing on the basic disruptive migration flow
> > only.  Checkpoint/restore can be added with a feature bit in the future.
> > It is a lot more complex and I'm not sure whether there will be any
> > users yet.
> > 
> > Stefan
> 
> This makes some things harder. For example, imagine a guest
> reboot mixed with migration. We don't know why did the connection
> die, so we'll retry connections until - when?
> 
> Could you please describe some user of vsock and show how
> it recovers from destructive migration?

qemu-guest-agent runs inside the guest with an AF_VSOCK listen socket.

libvirt arbitrates the qemu-guest-agent connection and provides an API
for applications to send commands.

When an application sends a command, libvirt checks if the connection to
qemu-guest-agent is established.  If there is no connection libvirt will
attempt to connect.

The command is sent to qemu-guest-agent and the response is handed back
to the guest application.  libvirt arbitrates access so commands from
multiple applications are serialized.

Live migration resets the established connection between
qemu-guest-agent and the source host's libvirt daemon.  When an
application issues the next qemu-guest-agent command the libvirt daemon
on the destination host notices there is no established connection yet
and starts a new one.

Libvirt refuses to send qemu-guest-agent commands while live migration
is in progress.

Stefan

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* RE: [RFC qemu 0/4] A PV solution for live migration optimization
From: Li, Liang Z @ 2016-03-15 11:11 UTC (permalink / raw)
  To: Michael S. Tsirkin, Dr. David Alan Gilbert
  Cc: ehabkost@redhat.com, kvm@vger.kernel.org, simhan@hpe.com,
	linux-kernel@vger.kernel.org, qemu-devel@nongnu.org,
	jitendra.kolhe@hpe.com, linux-mm@kvack.org,
	mohan_parthasarathy@hpe.com, Amit Shah, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, rth@twiddle.net
In-Reply-To: <20160315121613-mutt-send-email-mst@redhat.com>

> On Mon, Mar 14, 2016 at 05:03:34PM +0000, Dr. David Alan Gilbert wrote:
> > * Li, Liang Z (liang.z.li@intel.com) wrote:
> > > >
> > > > Hi,
> > > >   I'm just catching back up on this thread; so without reference
> > > > to any particular previous mail in the thread.
> > > >
> > > >   1) How many of the free pages do we tell the host about?
> > > >      Your main change is telling the host about all the
> > > >      free pages.
> > >
> > > Yes, all the guest's free pages.
> > >
> > > >      If we tell the host about all the free pages, then we might
> > > >      end up needing to allocate more pages and update the host
> > > >      with pages we now want to use; that would have to wait for the
> > > >      host to acknowledge that use of these pages, since if we don't
> > > >      wait for it then it might have skipped migrating a page we
> > > >      just started using (I don't understand how your series solves that).
> > > >      So the guest probably needs to keep some free pages - how many?
> > >
> > > Actually, there is no need to care about whether the free pages will be
> used by the host.
> > > We only care about some of the free pages we get reused by the guest,
> right?
> > >
> > > The dirty page logging can be used to solve this, starting the dirty
> > > page logging before getting the free pages informant from guest.
> > > Even some of the free pages are modified by the guest during the
> > > process of getting the free pages information, these modified pages will
> be traced by the dirty page logging mechanism. So in the following
> migration_bitmap_sync() function.
> > > The pages in the free pages bitmap, but latter was modified, will be
> > > reset to dirty. We won't omit any dirtied pages.
> > >
> > > So, guest doesn't need to keep any free pages.
> >
> > OK, yes, that works; so we do:
> >   * enable dirty logging
> >   * ask guest for free pages
> >   * initialise the migration bitmap as everything-free
> >   * then later we do the normal sync-dirty bitmap stuff and it all just works.
> >
> > That's nice and simple.
> 
> This works once, sure. But there's an issue is that you have to defer migration
> until you get the free page list, and this only works once. So you end up with
> heuristics about how long to wait.
> 
> Instead I propose:
> 
> - mark all pages dirty as we do now.
> 
> - at start of migration, start tracking dirty
>   pages in kvm, and tell guest to start tracking free pages
> 
> we can now introduce any kind of delay, for example wait for ack from guest,
> or do whatever else, or even just start migrating pages
> 
> - repeatedly:
> 	- get list of free pages from guest
> 	- clear them in migration bitmap
> 	- get dirty list from kvm
> 
> - at end of migration, stop tracking writes in kvm,
>   and tell guest to stop tracking free pages

I had thought of filtering out the free pages in each migration bitmap synchronization. 
The advantage is we can skip process as many free pages as possible. Not just once.
The disadvantage is that we should change the current memory management code to track the free pages,
instead of traversing the free page list to construct the free pages bitmap, to reduce the overhead to get the free pages bitmap.
I am not sure the if the Kernel people would like it.

If keeping the traversing mechanism, because of the overhead, maybe it's not worth to filter out the free pages repeatedly.

Liang

^ permalink raw reply

* Re: [RFC qemu 0/4] A PV solution for live migration optimization
From: Michael S. Tsirkin @ 2016-03-15 10:29 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: ehabkost@redhat.com, kvm@vger.kernel.org, simhan@hpe.com,
	Li, Liang Z, linux-kernel@vger.kernel.org, qemu-devel@nongnu.org,
	jitendra.kolhe@hpe.com, linux-mm@kvack.org,
	mohan_parthasarathy@hpe.com, Amit Shah, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, rth@twiddle.net
In-Reply-To: <20160314170334.GK2234@work-vm>

On Mon, Mar 14, 2016 at 05:03:34PM +0000, Dr. David Alan Gilbert wrote:
> * Li, Liang Z (liang.z.li@intel.com) wrote:
> > > 
> > > Hi,
> > >   I'm just catching back up on this thread; so without reference to any
> > > particular previous mail in the thread.
> > > 
> > >   1) How many of the free pages do we tell the host about?
> > >      Your main change is telling the host about all the
> > >      free pages.
> > 
> > Yes, all the guest's free pages.
> > 
> > >      If we tell the host about all the free pages, then we might
> > >      end up needing to allocate more pages and update the host
> > >      with pages we now want to use; that would have to wait for the
> > >      host to acknowledge that use of these pages, since if we don't
> > >      wait for it then it might have skipped migrating a page we
> > >      just started using (I don't understand how your series solves that).
> > >      So the guest probably needs to keep some free pages - how many?
> > 
> > Actually, there is no need to care about whether the free pages will be used by the host.
> > We only care about some of the free pages we get reused by the guest, right?
> > 
> > The dirty page logging can be used to solve this, starting the dirty page logging before getting
> > the free pages informant from guest. Even some of the free pages are modified by the guest
> > during the process of getting the free pages information, these modified pages will be traced
> > by the dirty page logging mechanism. So in the following migration_bitmap_sync() function.
> > The pages in the free pages bitmap, but latter was modified, will be reset to dirty. We won't
> > omit any dirtied pages.
> > 
> > So, guest doesn't need to keep any free pages.
> 
> OK, yes, that works; so we do:
>   * enable dirty logging
>   * ask guest for free pages
>   * initialise the migration bitmap as everything-free
>   * then later we do the normal sync-dirty bitmap stuff and it all just works.
> 
> That's nice and simple.

This works once, sure. But there's an issue is that you have
to defer migration until you get the free page list,
and this only works once. So you end up with heuristics
about how long to wait.

Instead I propose:

- mark all pages dirty as we do now.

- at start of migration, start tracking dirty
  pages in kvm, and tell guest to start tracking free pages

we can now introduce any kind of delay, for
example wait for ack from guest, or do whatever else,
or even just start migrating pages

- repeatedly:
	- get list of free pages from guest
	- clear them in migration bitmap
	- get dirty list from kvm

- at end of migration, stop tracking writes in kvm,
  and tell guest to stop tracking free pages



> > >   2) Clearing out caches
> > >      Does it make sense to clean caches?  They're apparently useful data
> > >      so if we clean them it's likely to slow the guest down; I guess
> > >      they're also likely to be fairly static data - so at least fairly
> > >      easy to migrate.
> > >      The answer here partially depends on what you want from your migration;
> > >      if you're after the fastest possible migration time it might make
> > >      sense to clean the caches and avoid migrating them; but that might
> > >      be at the cost of more disruption to the guest - there's a trade off
> > >      somewhere and it's not clear to me how you set that depending on your
> > >      guest/network/reqirements.
> > > 
> > 
> > Yes, clean the caches is an option.  Let the users decide using it or not.
> > 
> > >   3) Why is ballooning slow?
> > >      You've got a figure of 5s to balloon on an 8GB VM - but an
> > >      8GB VM isn't huge; so I worry about how long it would take
> > >      on a big VM.   We need to understand why it's slow
> > >        * is it due to the guest shuffling pages around?
> > >        * is it due to the virtio-balloon protocol sending one page
> > >          at a time?
> > >          + Do balloon pages normally clump in physical memory
> > >             - i.e. would a 'large balloon' message help
> > >             - or do we need a bitmap because it tends not to clump?
> > > 
> > 
> > I didn't do a comprehensive test. But I found most of the time spending
> > on allocating the pages and sending the PFNs to guest, I don't know that's
> > the most time consuming operation, allocating the pages or sending the PFNs.
> 
> It might be a good idea to analyse it a bit more to convince people where
> the problem is.
> 
> > >        * is it due to the madvise on the host?
> > >          If we were using the normal balloon messages, then we
> > >          could, during migration, just route those to the migration
> > >          code rather than bothering with the madvise.
> > >          If they're clumping together we could just turn that into
> > >          one big madvise; if they're not then would we benefit from
> > >          a call that lets us madvise lots of areas?
> > > 
> > 
> > My test showed madvise() is not the main reason for the long time, only taken
> > 10% of the total  inflating balloon operation time.
> > Big madvise can more or less improve the performance.
> 
> OK; 10% of the total is still pretty big even for your 8GB VM.
> 
> > >   4) Speeding up the migration of those free pages
> > >     You're using the bitmap to avoid migrating those free pages; HPe's
> > >     patchset is reconstructing a bitmap from the balloon data;  OK, so
> > >     this all makes sense to avoid migrating them - I'd also been thinking
> > >     of using pagemap to spot zero pages that would help find other zero'd
> > >     pages, but perhaps ballooned is enough?
> > > 
> > Could you describe your ideal with more details?
> 
> At the moment the migration code spends a fair amount of time checking if a page
> is zero; I was thinking perhaps the qemu could just open /proc/self/pagemap
> and check if the page was mapped; that would seem cheap if we're checking big
> ranges; and that would find all the balloon pages.
> 
> > >   5) Second-migrate
> > >     Given a VM where you've done all those tricks on, what happens when
> > >     you migrate it a second time?   I guess you're aiming for the guest
> > >     to update it's bitmap;  HPe's solution is to migrate it's balloon
> > >     bitmap along with the migration data.
> > 
> > Nothing is special in the second migration, QEMU will request the guest for free pages
> > Information, and the guest will traverse it's current free page list to construct a
> > new free page bitmap and send it to QEMU. Just like in the first migration.
> 
> Right.
> 
> Dave
> 
> > Liang
> > > 
> > > Dave
> > > 
> > > --
> > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply

* Re: [PATCH v1 19/19] zram: use __GFP_MOVABLE for memory allocation
From: Sergey Senozhatsky @ 2016-03-15  6:56 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, aquini, rknize, Sergey Senozhatsky, Hugh Dickins,
	linux-kernel, virtualization, bfields, linux-mm, Gioh Kim, koct9i,
	Joonsoo Kim, Andrew Morton, jlayton, Vlastimil Babka, Mel Gorman
In-Reply-To: <1457681423-26664-20-git-send-email-minchan@kernel.org>

On (03/11/16 16:30), Minchan Kim wrote:
[..]
> init
> Node 0, zone      DMA    208    120     51     41     11      0      0      0      0      0      0
> Node 0, zone    DMA32  16380  13777   9184   3805    789     54      3      0      0      0      0
> compaction
> Node 0, zone      DMA    132     82     40     39     16      2      1      0      0      0      0
> Node 0, zone    DMA32   5219   5526   4969   3455   1831    677    139     15      0      0      0
> 
> new:
> 
> init
> Node 0, zone      DMA    379    115     97     19      2      0      0      0      0      0      0
> Node 0, zone    DMA32  18891  16774  10862   3947    637     21      0      0      0      0      0
> compaction  1
> Node 0, zone      DMA    214     66     87     29     10      3      0      0      0      0      0
> Node 0, zone    DMA32   1612   3139   3154   2469   1745    990    384     94      7      0      0
> 
> As you can see, compaction made so many high-order pages. Yay!
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>

Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>

	-ss

^ permalink raw reply

* Re: [PATCH v1 11/19] zsmalloc: squeeze freelist into page->mapping
From: Minchan Kim @ 2016-03-15  6:51 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Rik van Riel, aquini, rknize, Sergey Senozhatsky, Hugh Dickins,
	linux-kernel, virtualization, bfields, linux-mm, Gioh Kim, koct9i,
	Joonsoo Kim, Andrew Morton, jlayton, Vlastimil Babka, Mel Gorman
In-Reply-To: <20160315064053.GF1464@swordfish>

On Tue, Mar 15, 2016 at 03:40:53PM +0900, Sergey Senozhatsky wrote:
> On (03/11/16 16:30), Minchan Kim wrote:
> > -static void *location_to_obj(struct page *page, unsigned long obj_idx)
> > +static void objidx_to_page_and_ofs(struct size_class *class,
> > +				struct page *first_page,
> > +				unsigned long obj_idx,
> > +				struct page **obj_page,
> > +				unsigned long *ofs_in_page)
> 
> this looks big; 5 params, function "returning" both page and offset...
> any chance to split it in two steps, perhaps?

Yes, it's rather ugly but I don't have a good idea.
Feel free to suggest if you have a better idea.

> 
> besides, it is more intuitive (at least to me) when 'offset'
> shortened to 'offt', not 'ofs'.

Indeed. I will change it to get_page_and_offset instead of
abbreviation if we cannot refactor it more.

> 
> 	-ss
> 
> >  {
> > -	unsigned long obj;
> > +	int i;
> > +	unsigned long ofs;
> > +	struct page *cursor;
> > +	int nr_page;
> >  
> > -	if (!page) {
> > -		VM_BUG_ON(obj_idx);
> > -		return NULL;
> > -	}
> > +	ofs = obj_idx * class->size;
> > +	cursor = first_page;
> > +	nr_page = ofs >> PAGE_SHIFT;
> >  
> > -	obj = page_to_pfn(page) << OBJ_INDEX_BITS;
> > -	obj |= ((obj_idx) & OBJ_INDEX_MASK);
> > -	obj <<= OBJ_TAG_BITS;
> > +	*ofs_in_page = ofs & ~PAGE_MASK;
> > +
> > +	for (i = 0; i < nr_page; i++)
> > +		cursor = get_next_page(cursor);
> >  
> > -	return (void *)obj;
> > +	*obj_page = cursor;
> >  }

^ permalink raw reply

* Re: [PATCH v1 09/19] zsmalloc: keep max_object in size_class
From: Minchan Kim @ 2016-03-15  6:41 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Rik van Riel, aquini, rknize, Sergey Senozhatsky, Hugh Dickins,
	linux-kernel, virtualization, bfields, linux-mm, Gioh Kim, koct9i,
	Joonsoo Kim, Andrew Morton, jlayton, Vlastimil Babka, Mel Gorman
In-Reply-To: <20160315062824.GE1464@swordfish>

On Tue, Mar 15, 2016 at 03:28:24PM +0900, Sergey Senozhatsky wrote:
> On (03/11/16 16:30), Minchan Kim wrote:
> > Every zspage in a size_class has same number of max objects so
> > we could move it to a size_class.
> > 
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  mm/zsmalloc.c | 29 ++++++++++++++---------------
> >  1 file changed, 14 insertions(+), 15 deletions(-)
> > 
> > diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> > index b4fb11831acb..ca663c82c1fc 100644
> > --- a/mm/zsmalloc.c
> > +++ b/mm/zsmalloc.c
> > @@ -32,8 +32,6 @@
> >   *	page->freelist: points to the first free object in zspage.
> >   *		Free objects are linked together using in-place
> >   *		metadata.
> > - *	page->objects: maximum number of objects we can store in this
> > - *		zspage (class->zspage_order * PAGE_SIZE / class->size)
> >   *	page->lru: links together first pages of various zspages.
> >   *		Basically forming list of zspages in a fullness group.
> >   *	page->mapping: class index and fullness group of the zspage
> > @@ -211,6 +209,7 @@ struct size_class {
> >  	 * of ZS_ALIGN.
> >  	 */
> >  	int size;
> > +	int objs_per_zspage;
> >  	unsigned int index;
> 
> struct page ->objects "comes for free". now we don't use it, instead
> every size_class grows by 4 bytes? is there any reason for this?

It is union with _mapcount and it is used by checking non-lru movable
page in this patchset.
> 
> 	-ss

^ permalink raw reply

* Re: [PATCH v1 11/19] zsmalloc: squeeze freelist into page->mapping
From: Sergey Senozhatsky @ 2016-03-15  6:40 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, aquini, rknize, Sergey Senozhatsky, Hugh Dickins,
	linux-kernel, virtualization, bfields, linux-mm, Gioh Kim, koct9i,
	Joonsoo Kim, Andrew Morton, jlayton, Vlastimil Babka, Mel Gorman
In-Reply-To: <1457681423-26664-12-git-send-email-minchan@kernel.org>

On (03/11/16 16:30), Minchan Kim wrote:
> -static void *location_to_obj(struct page *page, unsigned long obj_idx)
> +static void objidx_to_page_and_ofs(struct size_class *class,
> +				struct page *first_page,
> +				unsigned long obj_idx,
> +				struct page **obj_page,
> +				unsigned long *ofs_in_page)

this looks big; 5 params, function "returning" both page and offset...
any chance to split it in two steps, perhaps?

besides, it is more intuitive (at least to me) when 'offset'
shortened to 'offt', not 'ofs'.

	-ss

>  {
> -	unsigned long obj;
> +	int i;
> +	unsigned long ofs;
> +	struct page *cursor;
> +	int nr_page;
>  
> -	if (!page) {
> -		VM_BUG_ON(obj_idx);
> -		return NULL;
> -	}
> +	ofs = obj_idx * class->size;
> +	cursor = first_page;
> +	nr_page = ofs >> PAGE_SHIFT;
>  
> -	obj = page_to_pfn(page) << OBJ_INDEX_BITS;
> -	obj |= ((obj_idx) & OBJ_INDEX_MASK);
> -	obj <<= OBJ_TAG_BITS;
> +	*ofs_in_page = ofs & ~PAGE_MASK;
> +
> +	for (i = 0; i < nr_page; i++)
> +		cursor = get_next_page(cursor);
>  
> -	return (void *)obj;
> +	*obj_page = cursor;
>  }

^ permalink raw reply

* Re: [PATCH v1 09/19] zsmalloc: keep max_object in size_class
From: Sergey Senozhatsky @ 2016-03-15  6:28 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, aquini, rknize, Sergey Senozhatsky, Hugh Dickins,
	linux-kernel, virtualization, bfields, linux-mm, Gioh Kim, koct9i,
	Joonsoo Kim, Andrew Morton, jlayton, Vlastimil Babka, Mel Gorman
In-Reply-To: <1457681423-26664-10-git-send-email-minchan@kernel.org>

On (03/11/16 16:30), Minchan Kim wrote:
> Every zspage in a size_class has same number of max objects so
> we could move it to a size_class.
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  mm/zsmalloc.c | 29 ++++++++++++++---------------
>  1 file changed, 14 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index b4fb11831acb..ca663c82c1fc 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -32,8 +32,6 @@
>   *	page->freelist: points to the first free object in zspage.
>   *		Free objects are linked together using in-place
>   *		metadata.
> - *	page->objects: maximum number of objects we can store in this
> - *		zspage (class->zspage_order * PAGE_SIZE / class->size)
>   *	page->lru: links together first pages of various zspages.
>   *		Basically forming list of zspages in a fullness group.
>   *	page->mapping: class index and fullness group of the zspage
> @@ -211,6 +209,7 @@ struct size_class {
>  	 * of ZS_ALIGN.
>  	 */
>  	int size;
> +	int objs_per_zspage;
>  	unsigned int index;

struct page ->objects "comes for free". now we don't use it, instead
every size_class grows by 4 bytes? is there any reason for this?

	-ss

^ permalink raw reply

* Re: [PATCH v1 08/19] zsmalloc: remove unused pool param in obj_free
From: Sergey Senozhatsky @ 2016-03-15  6:21 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, aquini, rknize, Sergey Senozhatsky, Hugh Dickins,
	linux-kernel, virtualization, bfields, linux-mm, Gioh Kim, koct9i,
	Joonsoo Kim, Andrew Morton, jlayton, Vlastimil Babka, Mel Gorman
In-Reply-To: <1457681423-26664-9-git-send-email-minchan@kernel.org>

On (03/11/16 16:30), Minchan Kim wrote:
> Let's remove unused pool param in obj_free
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>

Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>

	-ss

^ permalink raw reply

* Re: [PATCH v1 07/19] zsmalloc: reordering function parameter
From: Sergey Senozhatsky @ 2016-03-15  6:20 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, aquini, rknize, Sergey Senozhatsky, Hugh Dickins,
	linux-kernel, virtualization, bfields, linux-mm, Gioh Kim, koct9i,
	Joonsoo Kim, Andrew Morton, jlayton, Vlastimil Babka, Mel Gorman
In-Reply-To: <1457681423-26664-8-git-send-email-minchan@kernel.org>

On (03/11/16 16:30), Minchan Kim wrote:
> This patch cleans up function parameter ordering to order
> higher data structure first.
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>

Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>

	-ss

^ permalink raw reply

* Re: [PATCH v1 06/19] zsmalloc: clean up many BUG_ON
From: Sergey Senozhatsky @ 2016-03-15  6:19 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, aquini, rknize, Sergey Senozhatsky, Hugh Dickins,
	linux-kernel, virtualization, bfields, linux-mm, Gioh Kim, koct9i,
	Joonsoo Kim, Andrew Morton, jlayton, Vlastimil Babka, Mel Gorman
In-Reply-To: <1457681423-26664-7-git-send-email-minchan@kernel.org>

On (03/11/16 16:30), Minchan Kim wrote:
> There are many BUG_ON in zsmalloc.c which is not recommened so
> change them as alternatives.
> 
> Normal rule is as follows:
> 
> 1. avoid BUG_ON if possible. Instead, use VM_BUG_ON or VM_BUG_ON_PAGE
> 2. use VM_BUG_ON_PAGE if we need to see struct page's fields
> 3. use those assertion in primitive functions so higher functions
> can rely on the assertion in the primitive function.
> 4. Don't use assertion if following instruction can trigger Oops
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>

Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>

	-ss

^ permalink raw reply

* Re: [PATCH v1 05/19] zsmalloc: use first_page rather than page
From: Sergey Senozhatsky @ 2016-03-15  6:19 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, aquini, rknize, Sergey Senozhatsky, Hugh Dickins,
	linux-kernel, virtualization, bfields, linux-mm, Gioh Kim, koct9i,
	Joonsoo Kim, Andrew Morton, jlayton, Vlastimil Babka, Mel Gorman
In-Reply-To: <1457681423-26664-6-git-send-email-minchan@kernel.org>

On (03/11/16 16:30), Minchan Kim wrote:
> This patch cleans up function parameter "struct page".
> Many functions of zsmalloc expects that page paramter is "first_page"
> so use "first_page" rather than "page" for code readability.
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>

Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>

	-ss

^ permalink raw reply

* RE: [RFC qemu 0/4] A PV solution for live migration optimization
From: Li, Liang Z @ 2016-03-15  3:31 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: ehabkost@redhat.com, kvm@vger.kernel.org, mst@redhat.com,
	simhan@hpe.com, linux-kernel@vger.kernel.org,
	qemu-devel@nongnu.org, jitendra.kolhe@hpe.com, linux-mm@kvack.org,
	mohan_parthasarathy@hpe.com, Amit Shah, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, rth@twiddle.net
In-Reply-To: <20160314170334.GK2234@work-vm>

> > > Hi,
> > >   I'm just catching back up on this thread; so without reference to
> > > any particular previous mail in the thread.
> > >
> > >   1) How many of the free pages do we tell the host about?
> > >      Your main change is telling the host about all the
> > >      free pages.
> >
> > Yes, all the guest's free pages.
> >
> > >      If we tell the host about all the free pages, then we might
> > >      end up needing to allocate more pages and update the host
> > >      with pages we now want to use; that would have to wait for the
> > >      host to acknowledge that use of these pages, since if we don't
> > >      wait for it then it might have skipped migrating a page we
> > >      just started using (I don't understand how your series solves that).
> > >      So the guest probably needs to keep some free pages - how many?
> >
> > Actually, there is no need to care about whether the free pages will be
> used by the host.
> > We only care about some of the free pages we get reused by the guest,
> right?
> >
> > The dirty page logging can be used to solve this, starting the dirty
> > page logging before getting the free pages informant from guest. Even
> > some of the free pages are modified by the guest during the process of
> > getting the free pages information, these modified pages will be traced by
> the dirty page logging mechanism. So in the following
> migration_bitmap_sync() function.
> > The pages in the free pages bitmap, but latter was modified, will be
> > reset to dirty. We won't omit any dirtied pages.
> >
> > So, guest doesn't need to keep any free pages.
> 
> OK, yes, that works; so we do:
>   * enable dirty logging
>   * ask guest for free pages
>   * initialise the migration bitmap as everything-free
>   * then later we do the normal sync-dirty bitmap stuff and it all just works.
> 
> That's nice and simple.
> 
> > >   2) Clearing out caches
> > >      Does it make sense to clean caches?  They're apparently useful data
> > >      so if we clean them it's likely to slow the guest down; I guess
> > >      they're also likely to be fairly static data - so at least fairly
> > >      easy to migrate.
> > >      The answer here partially depends on what you want from your
> migration;
> > >      if you're after the fastest possible migration time it might make
> > >      sense to clean the caches and avoid migrating them; but that might
> > >      be at the cost of more disruption to the guest - there's a trade off
> > >      somewhere and it's not clear to me how you set that depending on
> your
> > >      guest/network/reqirements.
> > >
> >
> > Yes, clean the caches is an option.  Let the users decide using it or not.
> >
> > >   3) Why is ballooning slow?
> > >      You've got a figure of 5s to balloon on an 8GB VM - but an
> > >      8GB VM isn't huge; so I worry about how long it would take
> > >      on a big VM.   We need to understand why it's slow
> > >        * is it due to the guest shuffling pages around?
> > >        * is it due to the virtio-balloon protocol sending one page
> > >          at a time?
> > >          + Do balloon pages normally clump in physical memory
> > >             - i.e. would a 'large balloon' message help
> > >             - or do we need a bitmap because it tends not to clump?
> > >
> >
> > I didn't do a comprehensive test. But I found most of the time
> > spending on allocating the pages and sending the PFNs to guest, I
> > don't know that's the most time consuming operation, allocating the pages
> or sending the PFNs.
> 
> It might be a good idea to analyse it a bit more to convince people where the
> problem is.
> 

Yes, I will try to measure the time spending on different parts.

> > >        * is it due to the madvise on the host?
> > >          If we were using the normal balloon messages, then we
> > >          could, during migration, just route those to the migration
> > >          code rather than bothering with the madvise.
> > >          If they're clumping together we could just turn that into
> > >          one big madvise; if they're not then would we benefit from
> > >          a call that lets us madvise lots of areas?
> > >
> >
> > My test showed madvise() is not the main reason for the long time,
> > only taken 10% of the total  inflating balloon operation time.
> > Big madvise can more or less improve the performance.
> 
> OK; 10% of the total is still pretty big even for your 8GB VM.
> 
> > >   4) Speeding up the migration of those free pages
> > >     You're using the bitmap to avoid migrating those free pages; HPe's
> > >     patchset is reconstructing a bitmap from the balloon data;  OK, so
> > >     this all makes sense to avoid migrating them - I'd also been thinking
> > >     of using pagemap to spot zero pages that would help find other zero'd
> > >     pages, but perhaps ballooned is enough?
> > >
> > Could you describe your ideal with more details?
> 
> At the moment the migration code spends a fair amount of time checking if a
> page is zero; I was thinking perhaps the qemu could just open
> /proc/self/pagemap and check if the page was mapped; that would seem
> cheap if we're checking big ranges; and that would find all the balloon pages.
> 

Even if virtio-balloon is not enabled, it can be used to find the pages that never used
by guest.

> > >   5) Second-migrate
> > >     Given a VM where you've done all those tricks on, what happens when
> > >     you migrate it a second time?   I guess you're aiming for the guest
> > >     to update it's bitmap;  HPe's solution is to migrate it's balloon
> > >     bitmap along with the migration data.
> >
> > Nothing is special in the second migration, QEMU will request the
> > guest for free pages Information, and the guest will traverse it's
> > current free page list to construct a new free page bitmap and send it to
> QEMU. Just like in the first migration.
> 
> Right.
> 
> Dave
> 
> > Liang
> > >
> > > Dave
> > >
> > > --
> > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply

* Re: [PATCH V4 0/3] basic busy polling support for vhost_net
From: Jason Wang @ 2016-03-15  3:10 UTC (permalink / raw)
  To: Michael Rapoport, Greg Kurz
  Cc: yang.zhang.wz, kvm, mst, netdev, linux-kernel, virtualization,
	borntraeger
In-Reply-To: <201603100648.u2A6mSTl020833@d06av07.portsmouth.uk.ibm.com>



On 03/10/2016 02:48 PM, Michael Rapoport wrote:
> Hi Greg,
>
>> > Greg Kurz <gkurz@linux.vnet.ibm.com> wrote on 03/09/2016 09:26:45 PM:
>>> > > On Fri,  4 Mar 2016 06:24:50 -0500
>>> > > Jason Wang <jasowang@redhat.com> wrote:
>> > 
>>> > > This series tries to add basic busy polling for vhost net. The idea is
>>> > > simple: at the end of tx/rx processing, busy polling for new tx added
>>> > > descriptor and rx receive socket for a while. The maximum number of
>>> > > time (in us) could be spent on busy polling was specified ioctl.
>>> > > 
>>> > > Test A were done through:
>>> > > 
>>> > > - 50 us as busy loop timeout
>>> > > - Netperf 2.6
>>> > > - Two machines with back to back connected mlx4
>> > 
>> > Hi Jason,
>> > 
>> > Could this also improve performance if both VMs are
>> > on the same host system ?
> I've experimented a little with Jason's patches and guest-to-guest netperf 
> when both guests were on the same host, and I saw improvements for that 
> case.
>  

Good to know this, I haven't tested this before but from the codes, it
should work for VM2VM case too.

Thanks a lot for the testing.

^ permalink raw reply

* Re: [PATCH v1 01/19] mm: use put_page to free page instead of putback_lru_page
From: Minchan Kim @ 2016-03-15  1:16 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: jlayton, Rik van Riel, aquini, rknize, Sergey Senozhatsky,
	Hugh Dickins, linux-kernel, virtualization, bfields, linux-mm,
	Gioh Kim, Mel Gorman, Andrew Morton, Naoya Horiguchi, Joonsoo Kim,
	koct9i
In-Reply-To: <56E67AE1.60700@suse.cz>

On Mon, Mar 14, 2016 at 09:48:33AM +0100, Vlastimil Babka wrote:
> On 03/11/2016 08:30 AM, Minchan Kim wrote:
> >Procedure of page migration is as follows:
> >
> >First of all, it should isolate a page from LRU and try to
> >migrate the page. If it is successful, it releases the page
> >for freeing. Otherwise, it should put the page back to LRU
> >list.
> >
> >For LRU pages, we have used putback_lru_page for both freeing
> >and putback to LRU list. It's okay because put_page is aware of
> >LRU list so if it releases last refcount of the page, it removes
> >the page from LRU list. However, It makes unnecessary operations
> >(e.g., lru_cache_add, pagevec and flags operations.
> 
> Yeah, and compaction (perhaps also other migration users) has to
> drain the lru pvec... Getting rid of this stuff is worth even by
> itself.

Good note. Although we cannot remove lru pvec draining completely,
at least, this patch removes a case which should drain pvec for
returning freed page to buddy.

Thanks for the notice.

> 
> >It would be
> >not significant but no worth to do) and harder to support new
> >non-lru page migration because put_page isn't aware of non-lru
> >page's data structure.
> >
> >To solve the problem, we can add new hook in put_page with
> >PageMovable flags check but it can increase overhead in
> >hot path and needs new locking scheme to stabilize the flag check
> >with put_page.
> >
> >So, this patch cleans it up to divide two semantic(ie, put and putback).
> >If migration is successful, use put_page instead of putback_lru_page and
> >use putback_lru_page only on failure. That makes code more readable
> >and doesn't add overhead in put_page.
> 
> I had an idea of checking for count==1 in putback_lru_page() which
> would take the put_page() shortcut from there. But maybe it can't be
> done nicely without races.

I thought about it and we might do it via page_freeze_refs but
what I want at this moment is to separte two semantic put and putback.
;-)

> 
> >Cc: Vlastimil Babka <vbabka@suse.cz>
> >Cc: Mel Gorman <mgorman@suse.de>
> >Cc: Hugh Dickins <hughd@google.com>
> >Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> >Signed-off-by: Minchan Kim <minchan@kernel.org>
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Note in -next/after 4.6-rc1 this will need some rebasing though.

Thanks for the review!

^ permalink raw reply

* Re: [RFC qemu 0/4] A PV solution for live migration optimization
From: Dr. David Alan Gilbert @ 2016-03-14 17:03 UTC (permalink / raw)
  To: Li, Liang Z
  Cc: ehabkost@redhat.com, kvm@vger.kernel.org, mst@redhat.com,
	simhan@hpe.com, linux-kernel@vger.kernel.org,
	qemu-devel@nongnu.org, jitendra.kolhe@hpe.com, linux-mm@kvack.org,
	mohan_parthasarathy@hpe.com, Amit Shah, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, rth@twiddle.net
In-Reply-To: <F2CBF3009FA73547804AE4C663CAB28E0414B118@shsmsx102.ccr.corp.intel.com>

* Li, Liang Z (liang.z.li@intel.com) wrote:
> > 
> > Hi,
> >   I'm just catching back up on this thread; so without reference to any
> > particular previous mail in the thread.
> > 
> >   1) How many of the free pages do we tell the host about?
> >      Your main change is telling the host about all the
> >      free pages.
> 
> Yes, all the guest's free pages.
> 
> >      If we tell the host about all the free pages, then we might
> >      end up needing to allocate more pages and update the host
> >      with pages we now want to use; that would have to wait for the
> >      host to acknowledge that use of these pages, since if we don't
> >      wait for it then it might have skipped migrating a page we
> >      just started using (I don't understand how your series solves that).
> >      So the guest probably needs to keep some free pages - how many?
> 
> Actually, there is no need to care about whether the free pages will be used by the host.
> We only care about some of the free pages we get reused by the guest, right?
> 
> The dirty page logging can be used to solve this, starting the dirty page logging before getting
> the free pages informant from guest. Even some of the free pages are modified by the guest
> during the process of getting the free pages information, these modified pages will be traced
> by the dirty page logging mechanism. So in the following migration_bitmap_sync() function.
> The pages in the free pages bitmap, but latter was modified, will be reset to dirty. We won't
> omit any dirtied pages.
> 
> So, guest doesn't need to keep any free pages.

OK, yes, that works; so we do:
  * enable dirty logging
  * ask guest for free pages
  * initialise the migration bitmap as everything-free
  * then later we do the normal sync-dirty bitmap stuff and it all just works.

That's nice and simple.

> >   2) Clearing out caches
> >      Does it make sense to clean caches?  They're apparently useful data
> >      so if we clean them it's likely to slow the guest down; I guess
> >      they're also likely to be fairly static data - so at least fairly
> >      easy to migrate.
> >      The answer here partially depends on what you want from your migration;
> >      if you're after the fastest possible migration time it might make
> >      sense to clean the caches and avoid migrating them; but that might
> >      be at the cost of more disruption to the guest - there's a trade off
> >      somewhere and it's not clear to me how you set that depending on your
> >      guest/network/reqirements.
> > 
> 
> Yes, clean the caches is an option.  Let the users decide using it or not.
> 
> >   3) Why is ballooning slow?
> >      You've got a figure of 5s to balloon on an 8GB VM - but an
> >      8GB VM isn't huge; so I worry about how long it would take
> >      on a big VM.   We need to understand why it's slow
> >        * is it due to the guest shuffling pages around?
> >        * is it due to the virtio-balloon protocol sending one page
> >          at a time?
> >          + Do balloon pages normally clump in physical memory
> >             - i.e. would a 'large balloon' message help
> >             - or do we need a bitmap because it tends not to clump?
> > 
> 
> I didn't do a comprehensive test. But I found most of the time spending
> on allocating the pages and sending the PFNs to guest, I don't know that's
> the most time consuming operation, allocating the pages or sending the PFNs.

It might be a good idea to analyse it a bit more to convince people where
the problem is.

> >        * is it due to the madvise on the host?
> >          If we were using the normal balloon messages, then we
> >          could, during migration, just route those to the migration
> >          code rather than bothering with the madvise.
> >          If they're clumping together we could just turn that into
> >          one big madvise; if they're not then would we benefit from
> >          a call that lets us madvise lots of areas?
> > 
> 
> My test showed madvise() is not the main reason for the long time, only taken
> 10% of the total  inflating balloon operation time.
> Big madvise can more or less improve the performance.

OK; 10% of the total is still pretty big even for your 8GB VM.

> >   4) Speeding up the migration of those free pages
> >     You're using the bitmap to avoid migrating those free pages; HPe's
> >     patchset is reconstructing a bitmap from the balloon data;  OK, so
> >     this all makes sense to avoid migrating them - I'd also been thinking
> >     of using pagemap to spot zero pages that would help find other zero'd
> >     pages, but perhaps ballooned is enough?
> > 
> Could you describe your ideal with more details?

At the moment the migration code spends a fair amount of time checking if a page
is zero; I was thinking perhaps the qemu could just open /proc/self/pagemap
and check if the page was mapped; that would seem cheap if we're checking big
ranges; and that would find all the balloon pages.

> >   5) Second-migrate
> >     Given a VM where you've done all those tricks on, what happens when
> >     you migrate it a second time?   I guess you're aiming for the guest
> >     to update it's bitmap;  HPe's solution is to migrate it's balloon
> >     bitmap along with the migration data.
> 
> Nothing is special in the second migration, QEMU will request the guest for free pages
> Information, and the guest will traverse it's current free page list to construct a
> new free page bitmap and send it to QEMU. Just like in the first migration.

Right.

Dave

> Liang
> > 
> > Dave
> > 
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply

* Re: [virtio-dev] virtio-vsock live migration
From: Michael S. Tsirkin @ 2016-03-14 11:13 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: virtio-dev, Claudio Imbrenda, Christian Borntraeger,
	Matt Benjamin, virtualization, Christoffer Dall
In-Reply-To: <20160303153737.GA19780@stefanha-x1.localdomain>

On Thu, Mar 03, 2016 at 03:37:37PM +0000, Stefan Hajnoczi wrote:
> Michael pointed out that the virtio-vsock draft specification does not
> address live migration and in fact currently precludes migration.
> 
> Migration is fundamental so the device specification at least mustn't
> preclude it.  Having brainstormed migration with Matthew Benjamin and
> Michael Tsirkin, I am now summarizing the approach that I want to
> include in the next draft specification.
> 
> Feedback and comments welcome!  In the meantime I will implement this in
> code and update the draft specification.

Most of the issue seems to be a consequence of using a 4 byte CID.

I think the right thing to do is just to teach guests
about 64 bit CIDs.

For now, can we drop guest CID from guest to host communication completely,
making CID only host-visible? Maybe leave the space
in the packet so we can add CID there later.
It seems that in theory this will allow changing CID
during migration, transparently to the guest.

Guest visible CID is required for guest to guest communication -
but IIUC that is not currently supported.
Maybe that can be made conditional on 64 bit addressing.
Alternatively, it seems much easier to accept that these channels get broken
across migration.


> 1. Requirements
> 
> Virtio-vsock is a new AF_VSOCK transport.  As such, it should provide at
> least the same guarantees as the existing AF_VSOCK VMCI transport.  This
> is for consistency and to allow code reuse across any AF_VSOCK
> transport.
> 
> Virtio-vsock aims to replace virtio-serial by providing the same
> guest/host communication ability but with sockets API semantics that are
> more popular and convenient for application developers.  Therefore
> virtio-vsock migration should provide at least the same level of
> migration functionality as virtio-serial.
> 
> Ideally it should be possible to migrate applications using AF_VSOCK
> together with the virtual machine so that guest<->host communication is
> interrupted.  Neither AF_VSOCK VMCI nor virtio-serial support this
> today.
> 
> 2. Basic disruptive migration flow
> 
> When the virtual machine migrates from the source host to the
> destination host, the guest's CID may change.  The CID namespace is
> host-wide so other hosts may have CID collisions and allocate a new CID
> for incoming migration VMs.
> 
> The device notifies the guest that the CID has changed.  Guest sockets
> are affected as follows:
> 
>  * Established connections are reset (ECONNRESET) and the guest
>    application will have to reconnect.
> 
>  * Listen sockets remain open.  The only thing to note is that
>    connections from the host are now made to the new CID.  This means
>    the local address of the listen socket is automatically updated to
>    the new CID.
> 
>  * Sockets in other states are unchanged.
> 
> Applications must handle disruptive migration by reconnecting if
> necessary after ECONNRESET.
> 
> 3. Checkpoint/restore for seamless migration
> 
> Applications that wish to communicate across live migration can do so
> but this requires extra application-specific checkpoint/restore code.
> 
> This is similar to the approach taken by the CRIU project where
> getsockopt()/setsockopt() is used to migrate socket state.  The
> difference is that the application process is not automatically migrated
> from the source host to the destination host.  Therefore, the
> application needs to migrate its own state somehow.
> 
> The flow is as follows:
> 
> The application on the source host must quiesce (stop sending/receiving)
> and use getsockopt() to extract socket state information from the host
> kernel.
> 
> A new instance of the application is started on the destination host and
> given the state so it can restore the connection.  The setsockopt()
> syscall is used to restore socket state information.
> 
> The guest is given a list of <host_old_cid, host_new_cid, host_port,
> guest_port> tuples for established connections that must not be reset
> when the guest CID update notification is received.  These connections
> will carry on as if nothing changed.
> 
> Note that the connection's remote address is updated from host_old_cid
> to host_new_cid.  This allows remapping of CIDs (if necessary).
> Typically this will be unused because the host always has well-known CID
> 2.  In a guest<->guest scenario it may be used to remap CIDs.
> 
> 
> For the time being I am focussing on the basic disruptive migration flow
> only.  Checkpoint/restore can be added with a feature bit in the future.
> It is a lot more complex and I'm not sure whether there will be any
> users yet.
> 
> Stefan

^ permalink raw reply

* Re: [PATCH v1 01/19] mm: use put_page to free page instead of putback_lru_page
From: Vlastimil Babka @ 2016-03-14  8:48 UTC (permalink / raw)
  To: Minchan Kim, Andrew Morton
  Cc: Rik van Riel, aquini, rknize, Sergey Senozhatsky, Hugh Dickins,
	linux-kernel, virtualization, bfields, linux-mm, Gioh Kim,
	Mel Gorman, jlayton, Naoya Horiguchi, Joonsoo Kim, koct9i
In-Reply-To: <1457681423-26664-2-git-send-email-minchan@kernel.org>

On 03/11/2016 08:30 AM, Minchan Kim wrote:
> Procedure of page migration is as follows:
>
> First of all, it should isolate a page from LRU and try to
> migrate the page. If it is successful, it releases the page
> for freeing. Otherwise, it should put the page back to LRU
> list.
>
> For LRU pages, we have used putback_lru_page for both freeing
> and putback to LRU list. It's okay because put_page is aware of
> LRU list so if it releases last refcount of the page, it removes
> the page from LRU list. However, It makes unnecessary operations
> (e.g., lru_cache_add, pagevec and flags operations.

Yeah, and compaction (perhaps also other migration users) has to drain 
the lru pvec... Getting rid of this stuff is worth even by itself.

> It would be
> not significant but no worth to do) and harder to support new
> non-lru page migration because put_page isn't aware of non-lru
> page's data structure.
>
> To solve the problem, we can add new hook in put_page with
> PageMovable flags check but it can increase overhead in
> hot path and needs new locking scheme to stabilize the flag check
> with put_page.
>
> So, this patch cleans it up to divide two semantic(ie, put and putback).
> If migration is successful, use put_page instead of putback_lru_page and
> use putback_lru_page only on failure. That makes code more readable
> and doesn't add overhead in put_page.

I had an idea of checking for count==1 in putback_lru_page() which would 
take the put_page() shortcut from there. But maybe it can't be done 
nicely without races.

> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Signed-off-by: Minchan Kim <minchan@kernel.org>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

Note in -next/after 4.6-rc1 this will need some rebasing though.

^ permalink raw reply

* Re: [PATCH v1 13/19] zsmalloc: factor page chain functionality out
From: Minchan Kim @ 2016-03-14  4:58 UTC (permalink / raw)
  To: xuyiping
  Cc: Rik van Riel, aquini, rknize, Sergey Senozhatsky, Hugh Dickins,
	linux-kernel, virtualization, bfields, linux-mm, Gioh Kim, koct9i,
	Joonsoo Kim, Andrew Morton, jlayton, Vlastimil Babka, Mel Gorman
In-Reply-To: <56E38870.5090408@hisilicon.com>

On Sat, Mar 12, 2016 at 11:09:36AM +0800, xuyiping wrote:
> 
> 
> On 2016/3/11 15:30, Minchan Kim wrote:
> >For migration, we need to create sub-page chain of zspage
> >dynamically so this patch factors it out from alloc_zspage.
> >
> >As a minor refactoring, it makes OBJ_ALLOCATED_TAG assign
> >more clear in obj_malloc(it could be another patch but it's
> >trivial so I want to put together in this patch).
> >
> >Signed-off-by: Minchan Kim <minchan@kernel.org>
> >---
> >  mm/zsmalloc.c | 78 ++++++++++++++++++++++++++++++++++-------------------------
> >  1 file changed, 45 insertions(+), 33 deletions(-)
> >
> >diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> >index bfc6a048afac..f86f8aaeb902 100644
> >--- a/mm/zsmalloc.c
> >+++ b/mm/zsmalloc.c
> >@@ -977,7 +977,9 @@ static void init_zspage(struct size_class *class, struct page *first_page)
> >  	unsigned long off = 0;
> >  	struct page *page = first_page;
> >
> >-	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
> >+	first_page->freelist = NULL;
> >+	INIT_LIST_HEAD(&first_page->lru);
> >+	set_zspage_inuse(first_page, 0);
> >
> >  	while (page) {
> >  		struct page *next_page;
> >@@ -1022,13 +1024,44 @@ static void init_zspage(struct size_class *class, struct page *first_page)
> >  	set_freeobj(first_page, 0);
> >  }
> >
> >+static void create_page_chain(struct page *pages[], int nr_pages)
> >+{
> >+	int i;
> >+	struct page *page;
> >+	struct page *prev_page = NULL;
> >+	struct page *first_page = NULL;
> >+
> >+	for (i = 0; i < nr_pages; i++) {
> >+		page = pages[i];
> >+
> >+		INIT_LIST_HEAD(&page->lru);
> >+		if (i == 0) {
> >+			SetPagePrivate(page);
> >+			set_page_private(page, 0);
> >+			first_page = page;
> >+		}
> >+
> >+		if (i == 1)
> >+			set_page_private(first_page, (unsigned long)page);
> >+		if (i >= 1)
> >+			set_page_private(page, (unsigned long)first_page);
> >+		if (i >= 2)
> >+			list_add(&page->lru, &prev_page->lru);
> >+		if (i == nr_pages - 1)
> >+			SetPagePrivate2(page);
> >+
> >+		prev_page = page;
> >+	}
> >+}
> >+
> >  /*
> >   * Allocate a zspage for the given size class
> >   */
> >  static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
> >  {
> >-	int i, error;
> >+	int i;
> >  	struct page *first_page = NULL, *uninitialized_var(prev_page);
> >+	struct page *pages[ZS_MAX_PAGES_PER_ZSPAGE];
> >
> >  	/*
> >  	 * Allocate individual pages and link them together as:
> >@@ -1041,43 +1074,23 @@ static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
> 
> 	*uninitialized_var(prev_page) in alloc_zspage is not in use more.

True.
It says why we should avoid uninitialized_var if possible.
If we didn't use uninitialized_var, compiler could warn about it
when I did build test.

Thanks.

^ permalink raw reply

* Re: [PATCH v1 09/19] zsmalloc: keep max_object in size_class
From: Minchan Kim @ 2016-03-14  4:55 UTC (permalink / raw)
  To: xuyiping
  Cc: Rik van Riel, aquini, rknize, Sergey Senozhatsky, Hugh Dickins,
	linux-kernel, virtualization, bfields, linux-mm, Gioh Kim, koct9i,
	Joonsoo Kim, Andrew Morton, jlayton, Vlastimil Babka, Mel Gorman
In-Reply-To: <56E37490.7060606@hisilicon.com>

On Sat, Mar 12, 2016 at 09:44:48AM +0800, xuyiping wrote:
> 
> 
> On 2016/3/11 15:30, Minchan Kim wrote:
> >Every zspage in a size_class has same number of max objects so
> >we could move it to a size_class.
> >
> >Signed-off-by: Minchan Kim <minchan@kernel.org>
> >---
> >  mm/zsmalloc.c | 29 ++++++++++++++---------------
> >  1 file changed, 14 insertions(+), 15 deletions(-)
> >
> >diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> >index b4fb11831acb..ca663c82c1fc 100644
> >--- a/mm/zsmalloc.c
> >+++ b/mm/zsmalloc.c
> >@@ -32,8 +32,6 @@
> >   *	page->freelist: points to the first free object in zspage.
> >   *		Free objects are linked together using in-place
> >   *		metadata.
> >- *	page->objects: maximum number of objects we can store in this
> >- *		zspage (class->zspage_order * PAGE_SIZE / class->size)
> >   *	page->lru: links together first pages of various zspages.
> >   *		Basically forming list of zspages in a fullness group.
> >   *	page->mapping: class index and fullness group of the zspage
> >@@ -211,6 +209,7 @@ struct size_class {
> >  	 * of ZS_ALIGN.
> >  	 */
> >  	int size;
> >+	int objs_per_zspage;
> >  	unsigned int index;
> >
> >  	struct zs_size_stat stats;
> >@@ -622,21 +621,22 @@ static inline void zs_pool_stat_destroy(struct zs_pool *pool)
> >   * the pool (not yet implemented). This function returns fullness
> >   * status of the given page.
> >   */
> >-static enum fullness_group get_fullness_group(struct page *first_page)
> >+static enum fullness_group get_fullness_group(struct size_class *class,
> >+						struct page *first_page)
> >  {
> >-	int inuse, max_objects;
> >+	int inuse, objs_per_zspage;
> >  	enum fullness_group fg;
> >
> >  	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
> >
> >  	inuse = first_page->inuse;
> >-	max_objects = first_page->objects;
> >+	objs_per_zspage = class->objs_per_zspage;
> >
> >  	if (inuse == 0)
> >  		fg = ZS_EMPTY;
> >-	else if (inuse == max_objects)
> >+	else if (inuse == objs_per_zspage)
> >  		fg = ZS_FULL;
> >-	else if (inuse <= 3 * max_objects / fullness_threshold_frac)
> >+	else if (inuse <= 3 * objs_per_zspage / fullness_threshold_frac)
> >  		fg = ZS_ALMOST_EMPTY;
> >  	else
> >  		fg = ZS_ALMOST_FULL;
> >@@ -723,7 +723,7 @@ static enum fullness_group fix_fullness_group(struct size_class *class,
> >  	enum fullness_group currfg, newfg;
> >
> >  	get_zspage_mapping(first_page, &class_idx, &currfg);
> >-	newfg = get_fullness_group(first_page);
> >+	newfg = get_fullness_group(class, first_page);
> >  	if (newfg == currfg)
> >  		goto out;
> >
> >@@ -1003,9 +1003,6 @@ static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
> >  	init_zspage(class, first_page);
> >
> >  	first_page->freelist = location_to_obj(first_page, 0);
> >-	/* Maximum number of objects we can store in this zspage */
> >-	first_page->objects = class->pages_per_zspage * PAGE_SIZE / class->size;
> >-
> >  	error = 0; /* Success */
> >
> >  cleanup:
> >@@ -1235,11 +1232,11 @@ static bool can_merge(struct size_class *prev, int size, int pages_per_zspage)
> >  	return true;
> >  }
> >
> >-static bool zspage_full(struct page *first_page)
> >+static bool zspage_full(struct size_class *class, struct page *first_page)
> >  {
> >  	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
> >
> >-	return first_page->inuse == first_page->objects;
> >+	return first_page->inuse == class->objs_per_zspage;
> >  }
> >
> >  unsigned long zs_get_total_pages(struct zs_pool *pool)
> >@@ -1625,7 +1622,7 @@ static int migrate_zspage(struct zs_pool *pool, struct size_class *class,
> >  		}
> >
> >  		/* Stop if there is no more space */
> >-		if (zspage_full(d_page)) {
> >+		if (zspage_full(class, d_page)) {
> >  			unpin_tag(handle);
> >  			ret = -ENOMEM;
> >  			break;
> >@@ -1684,7 +1681,7 @@ static enum fullness_group putback_zspage(struct zs_pool *pool,
> >  {
> >  	enum fullness_group fullness;
> >
> >-	fullness = get_fullness_group(first_page);
> >+	fullness = get_fullness_group(class, first_page);
> >  	insert_zspage(class, fullness, first_page);
> >  	set_zspage_mapping(first_page, class->index, fullness);
> >
> >@@ -1933,6 +1930,8 @@ struct zs_pool *zs_create_pool(const char *name, gfp_t flags)
> >  		class->size = size;
> >  		class->index = i;
> >  		class->pages_per_zspage = pages_per_zspage;
> >+		class->objs_per_zspage = class->pages_per_zspage *
> >+						PAGE_SIZE / class->size;
> >  		if (pages_per_zspage == 1 &&
> >  			get_maxobj_per_zspage(size, pages_per_zspage) == 1)
> >  			class->huge = true;
> 
> 		computes the "objs_per_zspage" twice here.
> 
> 		class->objs_per_zspage = get_maxobj_per_zspage(size,
> 						pages_per_zspage);
> 		if (pages_per_zspage == 1 && class->objs_per_zspage ==1)
> 			class->huge = true;

Yeb. I will do.

Thanks.

^ permalink raw reply

* Re: [PATCH v1 13/19] zsmalloc: factor page chain functionality out
From: xuyiping @ 2016-03-12  3:09 UTC (permalink / raw)
  To: Minchan Kim, Andrew Morton
  Cc: Rik van Riel, aquini, rknize, Sergey Senozhatsky, Hugh Dickins,
	linux-kernel, virtualization, bfields, linux-mm, Gioh Kim, koct9i,
	jlayton, Joonsoo Kim, Vlastimil Babka, Mel Gorman
In-Reply-To: <1457681423-26664-14-git-send-email-minchan@kernel.org>



On 2016/3/11 15:30, Minchan Kim wrote:
> For migration, we need to create sub-page chain of zspage
> dynamically so this patch factors it out from alloc_zspage.
>
> As a minor refactoring, it makes OBJ_ALLOCATED_TAG assign
> more clear in obj_malloc(it could be another patch but it's
> trivial so I want to put together in this patch).
>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>   mm/zsmalloc.c | 78 ++++++++++++++++++++++++++++++++++-------------------------
>   1 file changed, 45 insertions(+), 33 deletions(-)
>
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index bfc6a048afac..f86f8aaeb902 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -977,7 +977,9 @@ static void init_zspage(struct size_class *class, struct page *first_page)
>   	unsigned long off = 0;
>   	struct page *page = first_page;
>
> -	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
> +	first_page->freelist = NULL;
> +	INIT_LIST_HEAD(&first_page->lru);
> +	set_zspage_inuse(first_page, 0);
>
>   	while (page) {
>   		struct page *next_page;
> @@ -1022,13 +1024,44 @@ static void init_zspage(struct size_class *class, struct page *first_page)
>   	set_freeobj(first_page, 0);
>   }
>
> +static void create_page_chain(struct page *pages[], int nr_pages)
> +{
> +	int i;
> +	struct page *page;
> +	struct page *prev_page = NULL;
> +	struct page *first_page = NULL;
> +
> +	for (i = 0; i < nr_pages; i++) {
> +		page = pages[i];
> +
> +		INIT_LIST_HEAD(&page->lru);
> +		if (i == 0) {
> +			SetPagePrivate(page);
> +			set_page_private(page, 0);
> +			first_page = page;
> +		}
> +
> +		if (i == 1)
> +			set_page_private(first_page, (unsigned long)page);
> +		if (i >= 1)
> +			set_page_private(page, (unsigned long)first_page);
> +		if (i >= 2)
> +			list_add(&page->lru, &prev_page->lru);
> +		if (i == nr_pages - 1)
> +			SetPagePrivate2(page);
> +
> +		prev_page = page;
> +	}
> +}
> +
>   /*
>    * Allocate a zspage for the given size class
>    */
>   static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
>   {
> -	int i, error;
> +	int i;
>   	struct page *first_page = NULL, *uninitialized_var(prev_page);
> +	struct page *pages[ZS_MAX_PAGES_PER_ZSPAGE];
>
>   	/*
>   	 * Allocate individual pages and link them together as:
> @@ -1041,43 +1074,23 @@ static struct page *alloc_zspage(struct size_class *class, gfp_t flags)

	*uninitialized_var(prev_page) in alloc_zspage is not in use more.

>   	 * (i.e. no other sub-page has this flag set) and PG_private_2 to
>   	 * identify the last page.
>   	 */
> -	error = -ENOMEM;
>   	for (i = 0; i < class->pages_per_zspage; i++) {
>   		struct page *page;
>
>   		page = alloc_page(flags);
> -		if (!page)
> -			goto cleanup;
> -
> -		INIT_LIST_HEAD(&page->lru);
> -		if (i == 0) {	/* first page */
> -			page->freelist = NULL;
> -			SetPagePrivate(page);
> -			set_page_private(page, 0);
> -			first_page = page;
> -			set_zspage_inuse(page, 0);
> +		if (!page) {
> +			while (--i >= 0)
> +				__free_page(pages[i]);
> +			return NULL;
>   		}
> -		if (i == 1)
> -			set_page_private(first_page, (unsigned long)page);
> -		if (i >= 1)
> -			set_page_private(page, (unsigned long)first_page);
> -		if (i >= 2)
> -			list_add(&page->lru, &prev_page->lru);
> -		if (i == class->pages_per_zspage - 1)	/* last page */
> -			SetPagePrivate2(page);
> -		prev_page = page;
> +
> +		pages[i] = page;
>   	}
>
> +	create_page_chain(pages, class->pages_per_zspage);
> +	first_page = pages[0];
>   	init_zspage(class, first_page);
>
> -	error = 0; /* Success */
> -
> -cleanup:
> -	if (unlikely(error) && first_page) {
> -		free_zspage(first_page);
> -		first_page = NULL;
> -	}
> -
>   	return first_page;
>   }
>
> @@ -1419,7 +1432,6 @@ static unsigned long obj_malloc(struct size_class *class,
>   	unsigned long m_offset;
>   	void *vaddr;
>
> -	handle |= OBJ_ALLOCATED_TAG;
>   	obj = get_freeobj(first_page);
>   	objidx_to_page_and_ofs(class, first_page, obj,
>   				&m_page, &m_offset);
> @@ -1429,10 +1441,10 @@ static unsigned long obj_malloc(struct size_class *class,
>   	set_freeobj(first_page, link->next >> OBJ_ALLOCATED_TAG);
>   	if (!class->huge)
>   		/* record handle in the header of allocated chunk */
> -		link->handle = handle;
> +		link->handle = handle | OBJ_ALLOCATED_TAG;
>   	else
>   		/* record handle in first_page->private */
> -		set_page_private(first_page, handle);
> +		set_page_private(first_page, handle | OBJ_ALLOCATED_TAG);
>   	kunmap_atomic(vaddr);
>   	mod_zspage_inuse(first_page, 1);
>   	zs_stat_inc(class, OBJ_USED, 1);
>

^ permalink raw reply

* Re: [PATCH v1 09/19] zsmalloc: keep max_object in size_class
From: xuyiping @ 2016-03-12  1:44 UTC (permalink / raw)
  To: Minchan Kim, Andrew Morton
  Cc: Rik van Riel, aquini, rknize, Sergey Senozhatsky, Hugh Dickins,
	linux-kernel, virtualization, bfields, linux-mm, Gioh Kim, koct9i,
	jlayton, Joonsoo Kim, Vlastimil Babka, Mel Gorman
In-Reply-To: <1457681423-26664-10-git-send-email-minchan@kernel.org>



On 2016/3/11 15:30, Minchan Kim wrote:
> Every zspage in a size_class has same number of max objects so
> we could move it to a size_class.
>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>   mm/zsmalloc.c | 29 ++++++++++++++---------------
>   1 file changed, 14 insertions(+), 15 deletions(-)
>
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index b4fb11831acb..ca663c82c1fc 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -32,8 +32,6 @@
>    *	page->freelist: points to the first free object in zspage.
>    *		Free objects are linked together using in-place
>    *		metadata.
> - *	page->objects: maximum number of objects we can store in this
> - *		zspage (class->zspage_order * PAGE_SIZE / class->size)
>    *	page->lru: links together first pages of various zspages.
>    *		Basically forming list of zspages in a fullness group.
>    *	page->mapping: class index and fullness group of the zspage
> @@ -211,6 +209,7 @@ struct size_class {
>   	 * of ZS_ALIGN.
>   	 */
>   	int size;
> +	int objs_per_zspage;
>   	unsigned int index;
>
>   	struct zs_size_stat stats;
> @@ -622,21 +621,22 @@ static inline void zs_pool_stat_destroy(struct zs_pool *pool)
>    * the pool (not yet implemented). This function returns fullness
>    * status of the given page.
>    */
> -static enum fullness_group get_fullness_group(struct page *first_page)
> +static enum fullness_group get_fullness_group(struct size_class *class,
> +						struct page *first_page)
>   {
> -	int inuse, max_objects;
> +	int inuse, objs_per_zspage;
>   	enum fullness_group fg;
>
>   	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
>
>   	inuse = first_page->inuse;
> -	max_objects = first_page->objects;
> +	objs_per_zspage = class->objs_per_zspage;
>
>   	if (inuse == 0)
>   		fg = ZS_EMPTY;
> -	else if (inuse == max_objects)
> +	else if (inuse == objs_per_zspage)
>   		fg = ZS_FULL;
> -	else if (inuse <= 3 * max_objects / fullness_threshold_frac)
> +	else if (inuse <= 3 * objs_per_zspage / fullness_threshold_frac)
>   		fg = ZS_ALMOST_EMPTY;
>   	else
>   		fg = ZS_ALMOST_FULL;
> @@ -723,7 +723,7 @@ static enum fullness_group fix_fullness_group(struct size_class *class,
>   	enum fullness_group currfg, newfg;
>
>   	get_zspage_mapping(first_page, &class_idx, &currfg);
> -	newfg = get_fullness_group(first_page);
> +	newfg = get_fullness_group(class, first_page);
>   	if (newfg == currfg)
>   		goto out;
>
> @@ -1003,9 +1003,6 @@ static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
>   	init_zspage(class, first_page);
>
>   	first_page->freelist = location_to_obj(first_page, 0);
> -	/* Maximum number of objects we can store in this zspage */
> -	first_page->objects = class->pages_per_zspage * PAGE_SIZE / class->size;
> -
>   	error = 0; /* Success */
>
>   cleanup:
> @@ -1235,11 +1232,11 @@ static bool can_merge(struct size_class *prev, int size, int pages_per_zspage)
>   	return true;
>   }
>
> -static bool zspage_full(struct page *first_page)
> +static bool zspage_full(struct size_class *class, struct page *first_page)
>   {
>   	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
>
> -	return first_page->inuse == first_page->objects;
> +	return first_page->inuse == class->objs_per_zspage;
>   }
>
>   unsigned long zs_get_total_pages(struct zs_pool *pool)
> @@ -1625,7 +1622,7 @@ static int migrate_zspage(struct zs_pool *pool, struct size_class *class,
>   		}
>
>   		/* Stop if there is no more space */
> -		if (zspage_full(d_page)) {
> +		if (zspage_full(class, d_page)) {
>   			unpin_tag(handle);
>   			ret = -ENOMEM;
>   			break;
> @@ -1684,7 +1681,7 @@ static enum fullness_group putback_zspage(struct zs_pool *pool,
>   {
>   	enum fullness_group fullness;
>
> -	fullness = get_fullness_group(first_page);
> +	fullness = get_fullness_group(class, first_page);
>   	insert_zspage(class, fullness, first_page);
>   	set_zspage_mapping(first_page, class->index, fullness);
>
> @@ -1933,6 +1930,8 @@ struct zs_pool *zs_create_pool(const char *name, gfp_t flags)
>   		class->size = size;
>   		class->index = i;
>   		class->pages_per_zspage = pages_per_zspage;
> +		class->objs_per_zspage = class->pages_per_zspage *
> +						PAGE_SIZE / class->size;
>   		if (pages_per_zspage == 1 &&
>   			get_maxobj_per_zspage(size, pages_per_zspage) == 1)
>   			class->huge = true;

		computes the "objs_per_zspage" twice here.

		class->objs_per_zspage = get_maxobj_per_zspage(size,
						pages_per_zspage);
		if (pages_per_zspage == 1 && class->objs_per_zspage ==1)
			class->huge = true;

>

^ permalink raw reply

* Re: [PATCH v1 03/19] fs/anon_inodes: new interface to create new inode
From: Gioh Kim @ 2016-03-11 14:24 UTC (permalink / raw)
  To: Al Viro, Minchan Kim
  Cc: Rik van Riel, aquini, rknize, Sergey Senozhatsky, Hugh Dickins,
	linux-kernel, virtualization, bfields, linux-mm, Gioh Kim, koct9i,
	Joonsoo Kim, Andrew Morton, jlayton, Vlastimil Babka, Mel Gorman
In-Reply-To: <20160311080503.GR17997@ZenIV.linux.org.uk>



On 11.03.2016 09:05, Al Viro wrote:
> On Fri, Mar 11, 2016 at 04:30:07PM +0900, Minchan Kim wrote:
>> From: Gioh Kim <gurugio@hanmail.net>
>>
>> The anon_inodes has already complete interfaces to create manage
>> many anonymous inodes but don't have interface to get
>> new inode. Other sub-modules can create anonymous inode
>> without creating and mounting it's own pseudo filesystem.
> IMO that's a bad idea.  In case of aio "creating and mounting" takes this:
> static struct dentry *aio_mount(struct file_system_type *fs_type,
>                                  int flags, const char *dev_name, void *data)
> {
>          static const struct dentry_operations ops = {
>                  .d_dname        = simple_dname,
>          };
>          return mount_pseudo(fs_type, "aio:", NULL, &ops, AIO_RING_MAGIC);
> }
> and
>          static struct file_system_type aio_fs = {
>                  .name           = "aio",
>                  .mount          = aio_mount,
>                  .kill_sb        = kill_anon_super,
>          };
>          aio_mnt = kern_mount(&aio_fs);
>
> All of 12 lines.  Your export is not much shorter.  To quote old mail on
> the same topic:
I know what aio_setup() does. It can be a solution.
But I thought creating anon_inode_new() is simpler than several drivers 
create its own pseudo filesystem.
Creating a filesystem requires memory allocation and locking some lists 
even though it is pseudo.

Could you inform me if there is a reason we should avoid creating 
anonymous inode?

>
>> Note that anon_inodes.c reason to exist was "it's for situations where
>> all context lives on struct file and we don't need separate inode for
>> them".  Going from that to "it happens to contain a handy function for inode
>> allocation"...


-- 
Best regards,
Gioh Kim

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox