From: Avi Kivity <avi@redhat.com>
To: Paul Mackerras <paulus@samba.org>
Cc: Alexander Graf <agraf@suse.de>,
kvm@vger.kernel.org, kvm-ppc@vger.kernel.org,
David Gibson <david@gibson.dropbear.id.au>,
Juan Quintela <quintela@redhat.com>,
Orit Wasserman <owasserm@redhat.com>,
Anthony Liguori <anthony@codemonkey.ws>
Subject: Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT
Date: Tue, 16 Oct 2012 10:06:58 +0000 [thread overview]
Message-ID: <507D31C2.1060800@redhat.com> (raw)
In-Reply-To: <20121016040152.GQ1218@drongo>
On 10/16/2012 06:01 AM, Paul Mackerras wrote:
> A new ioctl, KVM_PPC_GET_HTAB_FD, returns a file descriptor. Reads on
> this fd return the contents of the HPT (hashed page table), writes
> create and/or remove entries in the HPT. There is a new capability,
> KVM_CAP_PPC_HTAB_FD, to indicate the presence of the ioctl. The ioctl
> takes an argument structure with the index of the first HPT entry to
> read out and a set of flags. The flags indicate whether the user is
> intending to read or write the HPT, and whether to return all entries
> or only the "bolted" entries (those with the bolted bit, 0x10, set in
> the first doubleword).
>
> This is intended for use in implementing qemu's savevm/loadvm and for
> live migration. Therefore, on reads, the first pass returns information
> about all HPTEs (or all bolted HPTEs). When the first pass reaches the
> end of the HPT, it returns from the read. Subsequent reads only return
> information about HPTEs that have changed since they were last read.
> A read that finds no changed HPTEs in the HPT following where the last
> read finished will return 0 bytes.
Copying people with interest in migration.
> +4.78 KVM_PPC_GET_HTAB_FD
> +
> +Capability: KVM_CAP_PPC_HTAB_FD
> +Architectures: powerpc
> +Type: vm ioctl
> +Parameters: Pointer to struct kvm_get_htab_fd (in)
> +Returns: file descriptor number (>= 0) on success, -1 on error
> +
> +This returns a file descriptor that can be used either to read out the
> +entries in the guest's hashed page table (HPT), or to write entries to
> +initialize the HPT. The returned fd can only be written to if the
> +KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and
> +can only be read if that bit is clear. The argument struct looks like
> +this:
> +
> +/* For KVM_PPC_GET_HTAB_FD */
> +struct kvm_get_htab_fd {
> + __u64 flags;
> + __u64 start_index;
> +};
> +
> +/* Values for kvm_get_htab_fd.flags */
> +#define KVM_GET_HTAB_BOLTED_ONLY ((__u64)0x1)
> +#define KVM_GET_HTAB_WRITE ((__u64)0x2)
> +
> +The `start_index' field gives the index in the HPT of the entry at
> +which to start reading. It is ignored when writing.
> +
> +Reads on the fd will initially supply information about all
> +"interesting" HPT entries. Interesting entries are those with the
> +bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise
> +all entries. When the end of the HPT is reached, the read() will
> +return.
What happens if the read buffer is smaller than the HPT size?
What happens if the read buffer size is not a multiple of entry size?
Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry
to change).
> If read() is called again on the fd, it will start again from
> +the beginning of the HPT, but will only return HPT entries that have
> +changed since they were last read.
> +
> +Data read or written is structured as a header (8 bytes) followed by a
> +series of valid HPT entries (16 bytes) each. The header indicates how
> +many valid HPT entries there are and how many invalid entries follow
> +the valid entries. The invalid entries are not represented explicitly
> +in the stream. The header format is:
> +
> +struct kvm_get_htab_header {
> + __u32 index;
> + __u16 n_valid;
> + __u16 n_invalid;
> +};
This structure forces the kernel to return entries sequentially. Will
this block changing the data structure in the future? Or is the
hardware spec sufficiently strict that such changes are not realistic?
> +
> +Writes to the fd create HPT entries starting at the index given in the
> +header; first `n_valid' valid entries with contents from the data
> +written, then `n_invalid' invalid entries, invalidating any previously
> +valid entries found.
This scheme is a clever, original, and very interesting approach to live
migration. That doesn't necessarily mean a NAK, we should see if it
makes sense for other migration APIs as well (we currently have
difficulties migrating very large/wide guests).
What is the typical number of entries in the HPT? Do you have estimates
of the change rate?
Suppose new hardware arrives that supports nesting HPTs, so that kvm is
no longer synchronously aware of the guest HPT (similar to how NPT/EPT
made kvm unaware of guest virtual->physical translations on x86). How
will we deal with that? But I guess this will be a
non-guest-transparent and non-userspace-transparent change, unlike
NPT/EPT, so a userspace ABI addition will be needed anyway).
--
error compiling committee.c: too many arguments to function
WARNING: multiple messages have this Message-ID (diff)
From: Avi Kivity <avi@redhat.com>
To: Paul Mackerras <paulus@samba.org>
Cc: Alexander Graf <agraf@suse.de>,
kvm@vger.kernel.org, kvm-ppc@vger.kernel.org,
David Gibson <david@gibson.dropbear.id.au>,
Juan Quintela <quintela@redhat.com>,
Orit Wasserman <owasserm@redhat.com>,
Anthony Liguori <anthony@codemonkey.ws>
Subject: Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT
Date: Tue, 16 Oct 2012 12:06:58 +0200 [thread overview]
Message-ID: <507D31C2.1060800@redhat.com> (raw)
In-Reply-To: <20121016040152.GQ1218@drongo>
On 10/16/2012 06:01 AM, Paul Mackerras wrote:
> A new ioctl, KVM_PPC_GET_HTAB_FD, returns a file descriptor. Reads on
> this fd return the contents of the HPT (hashed page table), writes
> create and/or remove entries in the HPT. There is a new capability,
> KVM_CAP_PPC_HTAB_FD, to indicate the presence of the ioctl. The ioctl
> takes an argument structure with the index of the first HPT entry to
> read out and a set of flags. The flags indicate whether the user is
> intending to read or write the HPT, and whether to return all entries
> or only the "bolted" entries (those with the bolted bit, 0x10, set in
> the first doubleword).
>
> This is intended for use in implementing qemu's savevm/loadvm and for
> live migration. Therefore, on reads, the first pass returns information
> about all HPTEs (or all bolted HPTEs). When the first pass reaches the
> end of the HPT, it returns from the read. Subsequent reads only return
> information about HPTEs that have changed since they were last read.
> A read that finds no changed HPTEs in the HPT following where the last
> read finished will return 0 bytes.
Copying people with interest in migration.
> +4.78 KVM_PPC_GET_HTAB_FD
> +
> +Capability: KVM_CAP_PPC_HTAB_FD
> +Architectures: powerpc
> +Type: vm ioctl
> +Parameters: Pointer to struct kvm_get_htab_fd (in)
> +Returns: file descriptor number (>= 0) on success, -1 on error
> +
> +This returns a file descriptor that can be used either to read out the
> +entries in the guest's hashed page table (HPT), or to write entries to
> +initialize the HPT. The returned fd can only be written to if the
> +KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and
> +can only be read if that bit is clear. The argument struct looks like
> +this:
> +
> +/* For KVM_PPC_GET_HTAB_FD */
> +struct kvm_get_htab_fd {
> + __u64 flags;
> + __u64 start_index;
> +};
> +
> +/* Values for kvm_get_htab_fd.flags */
> +#define KVM_GET_HTAB_BOLTED_ONLY ((__u64)0x1)
> +#define KVM_GET_HTAB_WRITE ((__u64)0x2)
> +
> +The `start_index' field gives the index in the HPT of the entry at
> +which to start reading. It is ignored when writing.
> +
> +Reads on the fd will initially supply information about all
> +"interesting" HPT entries. Interesting entries are those with the
> +bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise
> +all entries. When the end of the HPT is reached, the read() will
> +return.
What happens if the read buffer is smaller than the HPT size?
What happens if the read buffer size is not a multiple of entry size?
Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry
to change).
> If read() is called again on the fd, it will start again from
> +the beginning of the HPT, but will only return HPT entries that have
> +changed since they were last read.
> +
> +Data read or written is structured as a header (8 bytes) followed by a
> +series of valid HPT entries (16 bytes) each. The header indicates how
> +many valid HPT entries there are and how many invalid entries follow
> +the valid entries. The invalid entries are not represented explicitly
> +in the stream. The header format is:
> +
> +struct kvm_get_htab_header {
> + __u32 index;
> + __u16 n_valid;
> + __u16 n_invalid;
> +};
This structure forces the kernel to return entries sequentially. Will
this block changing the data structure in the future? Or is the
hardware spec sufficiently strict that such changes are not realistic?
> +
> +Writes to the fd create HPT entries starting at the index given in the
> +header; first `n_valid' valid entries with contents from the data
> +written, then `n_invalid' invalid entries, invalidating any previously
> +valid entries found.
This scheme is a clever, original, and very interesting approach to live
migration. That doesn't necessarily mean a NAK, we should see if it
makes sense for other migration APIs as well (we currently have
difficulties migrating very large/wide guests).
What is the typical number of entries in the HPT? Do you have estimates
of the change rate?
Suppose new hardware arrives that supports nesting HPTs, so that kvm is
no longer synchronously aware of the guest HPT (similar to how NPT/EPT
made kvm unaware of guest virtual->physical translations on x86). How
will we deal with that? But I guess this will be a
non-guest-transparent and non-userspace-transparent change, unlike
NPT/EPT, so a userspace ABI addition will be needed anyway).
--
error compiling committee.c: too many arguments to function
next prev parent reply other threads:[~2012-10-16 10:06 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-10-16 3:58 [PATCH 0/5] KVM: PPC: Book3S HV: HPT read/write functions for userspace Paul Mackerras
2012-10-16 3:58 ` Paul Mackerras
2012-10-16 3:59 ` [PATCH 1/5] KVM: Provide mmu notifier retry test based on struct kvm Paul Mackerras
2012-10-16 3:59 ` Paul Mackerras
2012-10-16 9:44 ` Avi Kivity
2012-10-16 9:44 ` Avi Kivity
2012-10-16 10:06 ` Alexander Graf
2012-10-16 10:06 ` Alexander Graf
2012-10-16 4:00 ` [PATCH 2/5] KVM: PPC: Book3S HV: Restructure HPT entry creation code Paul Mackerras
2012-10-16 4:00 ` Paul Mackerras
2012-10-16 4:00 ` [PATCH 3/5] KVM: PPC: Book3S HV: Add a mechanism for recording modified HPTEs Paul Mackerras
2012-10-16 4:00 ` Paul Mackerras
2012-10-16 4:01 ` [PATCH 4/5] KVM: PPC: Book3S HV: Make a HPTE removal function available Paul Mackerras
2012-10-16 4:01 ` Paul Mackerras
2012-10-16 4:01 ` [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT Paul Mackerras
2012-10-16 4:01 ` Paul Mackerras
2012-10-16 10:06 ` Avi Kivity [this message]
2012-10-16 10:06 ` Avi Kivity
2012-10-16 11:58 ` Paul Mackerras
2012-10-16 11:58 ` Paul Mackerras
2012-10-16 13:06 ` Avi Kivity
2012-10-16 13:06 ` Avi Kivity
2012-10-16 20:03 ` Anthony Liguori
2012-10-16 20:03 ` Anthony Liguori
2012-10-17 10:27 ` Avi Kivity
2012-10-17 10:27 ` Avi Kivity
2012-10-16 21:52 ` Paul Mackerras
2012-10-16 21:52 ` Paul Mackerras
2012-10-17 10:31 ` Avi Kivity
2012-10-17 10:31 ` Avi Kivity
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=507D31C2.1060800@redhat.com \
--to=avi@redhat.com \
--cc=agraf@suse.de \
--cc=anthony@codemonkey.ws \
--cc=david@gibson.dropbear.id.au \
--cc=kvm-ppc@vger.kernel.org \
--cc=kvm@vger.kernel.org \
--cc=owasserm@redhat.com \
--cc=paulus@samba.org \
--cc=quintela@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.