From: Don Slutz <dslutz@verizon.com>
To: Jan Beulich <JBeulich@suse.com>, Don Slutz <dslutz@verizon.com>
Cc: Jun Nakajima <jun.nakajima@intel.com>, Tim Deegan <tim@xen.org>,
Kevin Tian <kevin.tian@intel.com>, Keir Fraser <keir@xen.org>,
Ian Campbell <ian.campbell@citrix.com>,
Stefano Stabellini <stefano.stabellini@eu.citrix.com>,
George Dunlap <George.Dunlap@eu.citrix.com>,
Andrew Cooper <andrew.cooper3@citrix.com>,
Ian Jackson <ian.jackson@eu.citrix.com>,
xen-devel@lists.xen.org, Eddie Dong <eddie.dong@intel.com>,
Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>,
Boris Ostrovsky <boris.ostrovsky@oracle.com>
Subject: Re: [PATCH for-4.5 v8 4/7] xen: Add vmware_port support
Date: Mon, 26 Jan 2015 10:58:51 -0500 [thread overview]
Message-ID: <54C6643B.1@terremark.com> (raw)
In-Reply-To: <54C0C39F0200007800057F73@mail.emea.novell.com>
On 01/22/15 03:32, Jan Beulich wrote:
>>>> On 21.01.15 at 18:52, <dslutz@verizon.com> wrote:
>> On 01/16/15 05:09, Jan Beulich wrote:
>>>>>> On 03.10.14 at 00:40, <dslutz@verizon.com> wrote:
>>>> This is a new domain_create() flag, DOMCRF_vmware_port. It is
>>>> passed to domctl as XEN_DOMCTL_CDF_vmware_port.
>>> Can you explain why a HVM param isn't suitable here?
>>>
>> The issue is that you need this flag during construct_vmcb() and
>> construct_vmcs(). While Intel has vmx_update_exception_bitmap()
>> AMD does not. So when HVM param's are setup and/or changed there
>> currently is no way to adjust AMD's exception bitmap.
>>
>> So this is the simpler way.
> But the less desirable one from a design/consistency perspective.
> Unless other maintainers disagree, I'd like to see this changed.
Ok, but will wait some time to see if "Unless other maintainers disagree"
>>>> This is both a more complete support then in currently provided by
>>>> QEMU and/or KVM and less. The missing part requires QEMU changes
>>>> and has been left out until the QEMU patches are accepted upstream.
>>> I vaguely recall the question having been asked before, but I can't
>>> find it to the answer to it: If qemu has support for this, why can't
>>> you build on that rather than adding everything in the hypervisor?
>>>
>> The v10 version of this patch set (which is waiting for an adjusted
>> QEMU (the released 2.2.0 is one) does use QEMU for more VMware port
>> support. The issues are:
> Was there a newer version of these posted than the v8 I looked at?
> If so, I must have overlooked the posting (as otherwise I would of
> course have commented on the newer version).
>
The newer version was being worked on, but had not been posted (and had no
changes to this patch). Since it was never posted, I will just continue
getting v9
(instead of v10) into shape to post.
>> 1) QEMU needs access to parts of CPU registers to handle VMware port.
>> 2) You need to allow ring 3 access to this 1 I/O port.
>> 3) There is more state in xen that would need to also be sent to
>> QEMU if all support is moved to QEMU.
> Understood.
>
>>>> @@ -2111,6 +2112,31 @@ svm_vmexit_do_vmsave(struct vmcb_struct *vmcb,
>>>> return;
>>>> }
>>>>
>>>> +static void svm_vmexit_gp_intercept(struct cpu_user_regs *regs,
>>>> + struct vcpu *v)
>>>> +{
>>>> + struct vmcb_struct *vmcb = v->arch.hvm_svm.vmcb;
>>>> + /*
>>>> + * Just use 15 for the instruction length; vmport_gp_check will
>>>> + * adjust it. This is because
>>>> + * __get_instruction_length_from_list() has issues, and may
>>>> + * require a double read of the instruction bytes. At some
>>>> + * point a new routine could be added that is based on the code
>>>> + * in vmport_gp_check with extensions to make it more general.
>>>> + * Since that routine is the only user of this code this can be
>>>> + * done later.
>>>> + */
>>>> + unsigned long inst_len = 15;
>>> Surely this can be unsigned int?
>> The code is smaller this way. In vmx_vmexit_gp_intercept():
>>
>> unsigned long inst_len;
>> ...
>> __vmread(VM_EXIT_INSTRUCTION_LEN, &inst_len);
>> ...
>> rc = vmport_gp_check(regs, v, &inst_len, inst_addr,
>> ...
>>
>> So changing the argument to vmport_gp_check() to "unsigned int" would
>> add code there.
> So be it then. Generic code shouldn't use odd types just because of
> vendor specific code needs it, unless this makes things _a lot_ more
> complicated.
>
Ok. Since It looks like I will not be using get_instruction_length() I will
change this to "unsigned int" (or should I use "unsigned short" or
"unsigned byte"?).
>>>> + int rc = X86EMUL_OKAY;
>>>> +
>>>> + if ( regs->_eax == BDOOR_MAGIC )
>>> With this, is handling other than 32-bit in/out really meaningful/
>>> correct?
>>>
>> Yes. Harder to use, but since VMware allows it, I allow it also.
> But then a comment explaining the non-architectural (from an
> instruction set perspective) behavior is the minimum that's
> needed for future readers (and reviewers) to understand this.
Ok will add.
>>>> + case BDOOR_CMD_GETHWVERSION:
>>>> + /* vmware_hw */
>>>> + regs->_eax = 0;get_instruction_length
>>>> + if ( is_hvm_vcpu(curr) )
>>> Since you can't get here for PV, I can't see what you need this
>>> conditional for.
>>>
>> Since I was not 100% sure, I was being safe. Would converting
>> this to be a "debug=y" check be ok?
> ASSERT() would indeed be the right vehicle.
>
Will do.
>>>> + {
>>>> + struct hvm_domain *hd = &d->arch.hvm_domain;
>>>> +
>>>> + regs->_eax = hd->params[HVM_PARAM_VMWARE_HW];
>>>> + }
>>>> + if ( !regs->_eax )
>>>> + regs->_eax = 4; /* Act like version 4 */
>>> Why version 4?
>> That is the 1st version that VMware was more consistent in the handling
>> of the "VMware hardware version". Any value between 1 and 6 would be
>> ok. This should only happen in strange configs.
> Please make the comment say so then.
>
Will do.
>>>> + /* hostUsecs */
>>>> + regs->_ebx =value / 1000000ULL;
>>>> + /* maxTimeLag */
>>>> + regs->_ecx = 1000000;
>>>> + break;
>>> Perhaps this should share code with BDOOR_CMD_GETTIME; I have
>>> to admit though that I can't make any sense of why the latter one
>>> has a FULL suffix when it returns _less_ information.
>>>
>> Sharing of code is not simple.
>> Since I did not pick the names, VMware did.
>>
>> Bug found. The full returns data in si & dx.
>> will fix. And also makes sharing more complex then not.
> Of course if the current code is incomplete, sharing makes less sense
> once completed.
>
>>>> + unsigned char bytes[MAX_INST_LEN];
>>>> + unsigned int fetch_len;
>>>> + int frc;
>>>> +
>>>> + /* in or out are limited to 32bits */
>>>> + if ( byte_cnt > 4 )
>>>> + byte_cnt = 4;
>>>> +
>>>> + /*
>>>> + * Fetch up to the next page break; we'll fetch from the
>>>> + * next page later if we have to.
>>>> + */
>>>> + fetch_len = min_t(unsigned int, *inst_len,
>>>> + PAGE_SIZE - (inst_addr & ~PAGE_MASK));
>>>> + frc = hvm_fetch_from_guest_virt_nofault(bytes, inst_addr, fetch_len,
>>>> + PFEC_page_present);
>>>> + if ( frc != HVMCOPY_okay )
>>>> + {
>>>> + gdprintk(XENLOG_WARNING,
>>>> + "Bad instruction fetch at %#lx (frc=%d il=%lu fl=%u)\n",
>>>> + (unsigned long) inst_addr, frc, *inst_len, fetch_len);
>>> Pointless cast. But the value of log messages like this one is
>>> questionable anyway.
>>>
>> Will drop cast. I am not sure it is possible to get here. The best I
>> have come up with is to change the GDT entry for CS to fault, then do
>> this instruction. Not sure it would fault, and clearly is an attempt
>> to break in.
>>
>> I do know that if Xen is running under VMware (Why anyone would do
>> this?) this is possible.
>>
>> With all this, should I just drop this message (or make it debug=y
>> only)?
> Yes - dropping would be preferred by me, but I'd accept a debug=y
> only one too.
Ok, will drop.
>>>> +
>>>> + /* Only adjust byte_cnt 1 time */
>>>> + if ( bytes[0] == 0x66 ) /* operand size prefix */
>>>> + {
>>>> + if ( byte_cnt == 4 )
>>>> + byte_cnt = 2;
>>>> + else
>>>> + byte_cnt = 4;
>>>> + }
>>> Iirc REX.W set following 0x66 cancels the effect of the latter. Another
>>> thing x86emul would be taking care of for you if you used it.
>> I did not know this. Most of my testing was done without any check
>> for prefix(s). I.E. (Open) VMware Tools only uses the inl. I do
>> not know of anybody using 16bit segments and VMware tools.
> But this isn't the perspective to take when adding code to the
> hypervisor - you should always consider what a (perhaps
> malicious) guest could do.
Ok, but my read of this statement does not help decide which way
to go. I see several options:
1) Only allow #GP to work for 0xed (inl (%dx),%eax).
Pros: No attack surface for malicious guest.
No need for get_instruction_length().
No need for Intel to confirm the necessary hardware
behaviour as being architectural.
Cons: There may exist user apps. that work on VMware and
not on xen (16bit segments, realmode, vm86, etc).
2) Only allow #GP to work for all 4 I/O instructions without prefix.
Pros: No attack surface for malicious guest.
No need for get_instruction_length().
No need for Intel to confirm the necessary hardware
behaviour as being architectural.
Cons: There may exist user apps. that work on VMware and
not on xen (16bit segments, realmode, vm86, etc).
3) Only allow zero or one 0x66 prefix and 0xed (inl (%dx),%eax).
Pros: No attack surface for malicious guest.
No need for get_instruction_length().
No need for Intel to confirm the necessary hardware
behaviour as being architectural.
Cons: There may exist user apps. that work on VMware and
not on xen (using too many prefixes, using other
opcodes).
4) Only allow zero or one 0x66 prefix and all 4 I/O instructions.
Pros: No attack surface for malicious guest.
No need for get_instruction_length().
No need for Intel to confirm the necessary hardware
behaviour as being architectural.
Cons: There may exist user apps. that work on VMware and
not on xen (using too many prefixes).
5) Only allow zero to 14 0x66 prefix and 0xed (inl (%dx),%eax).
Pros: No attack surface for malicious guest.
Cons: There may exist user apps. that work on VMware and
not on xen (using unneeded prefixes, using other
opcodes).
5a: Would be cleaner with get_instruction_length() on Intel,
but would need for Intel to confirm the necessary hardware
behaviour as being architectural.
5b: Always pass in MAX_INST_LEN.
6) Only allow zero to 14 0x66 prefix and all 4 I/O instructions.
Pros: No attack surface for malicious guest.
Cons: There may exist user apps. that work on VMware and
not on xen (using unneeded prefixes).
6a: Would be cleaner with get_instruction_length() on Intel,
but would need for Intel to confirm the necessary hardware
behaviour as being architectural.
6b: Always pass in MAX_INST_LEN.
7) Add complete prefix handling, and all 4 I/O instructions
Pros: Limited attack surface for malicious guest (the handling
of all prefixes greatly increases the complexity of the
code).
Cons: Lots of added code.
7a: Would be cleaner with get_instruction_length() on Intel,
but would need for Intel to confirm the necessary hardware
behaviour as being architectural.
7b: Always pass in MAX_INST_LEN.
8) Use hvm_emulate_one().
Pros: shares code, reduces new code.
Cons: Adds a lot of attack surface for malicious guest.
I had picked #6, you asked for #8, but I read your answer as do not
do #8.
I would be happy to go with any of the 8 ways (or a way I did not list
above),
just need to know which one to focus on.
>>>> +static void vmx_vmexit_gp_intercept(struct cpu_user_regs *regs,
>>>> + struct vcpu *v)
>>>> +{
>>>> + unsigned long exit_qualification;
>>>> + unsigned long inst_len;
>>>> + unsigned long inst_addr = vmx_rip2pointer(regs, v);
>>>> + unsigned long ecode;
>>>> + int rc;
>>>> +#ifndef NDEBUG
>>>> + unsigned long orig_inst_len;
>>>> + unsigned long vector;
>>>> +
>>>> + __vmread(VM_EXIT_INTR_INFO, &vector);
>>>> + BUG_ON(!(vector & INTR_INFO_VALID_MASK));
>>>> + BUG_ON(!(vector & INTR_INFO_DELIVER_CODE_MASK));
>>>> +#endif
>>> If you use ASSERT() instead of BUG_ON(), I think you can avoid most
>>> of this preprocessor conditional.
>>>
>> I do not see how. vector only exists in "debug=y". So yes if using
>> ASSERT() I could move the #endif up 2 lines, but that does not
>> look better to me.
> I don't follow - ASSERT() is intentionally coded in a way such that
> variables used only by it don't cause compiler warnings. And the
> optimizer ought to be able to eliminate the then unnecessary
> __vmread().
>
I am more use to explicit conditional code and to not depend on the
compilers
to correctly optimize the code. Will change.
Since the most likely case is that I will stop using
get_instruction_length()
(__vmread(VM_EXIT_INSTRUCTION_LEN,...)). This drops the need for
orig_inst_len
also.
>>>> + __vmread(EXIT_QUALIFICATION, &exit_qualification);
>>>> + __vmread(VM_EXIT_INSTRUCTION_LEN, &inst_len);
>>> get_instruction_length(). But is it architecturally defined that
>>> #GP intercept vmexits actually set this field?
>>>
>> I could not find a clear statement.
> That's the point of my comment.
>
>> My reading of (directly out of
>> "Intel® 64 and IA-32 Architectures
>> Software Developer’s Manual
>> Volume 3 (3A, 3B & 3C):
>> [...]
>> to me says that yes, this field is set on a #GP exit on an IN. But the
>> #GP case is not called out by name.
> And it is not any of the cases mentioned.
>
>> My read is that a #GP fault is a "VM Exits Unconditionally" based on the
>> setting of the exception bit mask.
> Right, but it's not exactly an instruction based exit. Unless Intel
> confirms that your extending of the manual says is correct, I'd
> rather recommend against relying on unspecified behavior. If
> any CPU model ends up behaving differently, this might be
> rather hard to diagnose I'm afraid.
>
>> So not using get_instruction_length() does avoid a possible BUG_ON()
>> if I am wrong.
>>
>> So there are 3 options here:
>> 1) Add an ASSERT() like the BUG_ON() in get_instruction_length()
>> 2) Switch to using get_instruction_length()
>> 3) Switch to using MAX_INST_LEN.
>>
>> Let me know which way to go.
> As said above - use get_instruction_length() if Intel confirms the
> necessary hardware behavior as being architectural. If they
> don't, 3) looks like the only viable option.
So what is the procedure to getting "Intel confirms the necessary hardware
behaviour as being architectural"?
>>>> @@ -2182,6 +2183,8 @@ int nvmx_n2_vmexit_handler(struct cpu_user_regs *regs,
>>>> if ( v->fpu_dirtied )
>>>> nvcpu->nv_vmexit_pending = 1;
>>>> }
>>>> + else if ( vector == TRAP_gp_fault )
>>>> + nvcpu->nv_vmexit_pending = 1;
>>> Doesn't that mean an unconditional vmexit even if the L1 hypervisor
>>> didn't ask for such?
>> I might. I have not done any testing here for the nested VMX case.
>> I could just drop this for now and deside what to do for this code later.
> If dropping the code is safe without also forbidding the combination
> of nested and VMware emulation.
Will have to do a lot more testing to know. At the time I started the
coding
it was still considered experimental. Looks like for 4.6 I will need it
to be
fully unit tested.
-Don Slutz
> Jan
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
next prev parent reply other threads:[~2015-01-26 15:58 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-10-02 21:30 [PATCH for-4.5 v7 0/7] Xen VMware tools support Don Slutz
2014-10-02 21:30 ` [PATCH for-4.5 v7 1/7] xen: Add support for VMware cpuid leaves Don Slutz
2015-01-15 16:42 ` Jan Beulich
2015-01-15 21:00 ` Don Slutz
2015-01-16 7:57 ` Jan Beulich
2015-01-16 19:21 ` Don Slutz
2014-10-02 21:30 ` [PATCH for-4.5 v7 2/7] tools: Add vmware_hw support Don Slutz
2014-10-02 22:21 ` Andrew Cooper
2014-10-02 22:56 ` [PATCH for-4.5 v8 " Don Slutz
2014-10-02 21:30 ` [PATCH for-4.5 v7 3/7] vmware: Add VMware provided include files Don Slutz
2015-01-15 16:46 ` Jan Beulich
2015-01-15 21:36 ` Don Slutz
2014-10-02 21:30 ` [PATCH for-4.5 v7 4/7] xen: Add vmware_port support Don Slutz
2014-10-02 21:58 ` Don Slutz
2014-10-02 22:40 ` [PATCH for-4.5 v8 " Don Slutz
2015-01-16 10:09 ` Jan Beulich
2015-01-21 17:52 ` Don Slutz
2015-01-22 8:32 ` Jan Beulich
2015-01-26 15:58 ` Don Slutz [this message]
2015-01-26 16:46 ` Jan Beulich
2015-01-26 20:19 ` Don Slutz
2015-01-27 7:58 ` Jan Beulich
2015-01-28 8:19 ` Jan Beulich
2015-01-28 22:47 ` Don Slutz
2015-01-29 0:32 ` Don Slutz
2015-02-10 19:30 ` [PATCH " Don Slutz
2015-02-11 7:56 ` Jan Beulich
2015-02-11 17:04 ` Andrew Cooper
2015-02-17 7:45 ` Jan Beulich
2014-10-02 21:30 ` [PATCH for-4.5 v7 5/7] tools: " Don Slutz
2014-10-02 21:30 ` [PATCH for-4.5 v7 6/7] Add xentrace to vmware_port Don Slutz
2014-10-02 21:30 ` [OPTIONAL][PATCH for-4.5 v7 7/7] Add xen-hvm-param Don Slutz
2014-10-16 8:12 ` [PATCH for-4.5 v7 0/7] Xen VMware tools support Jan Beulich
2014-10-16 12:10 ` Don Slutz
2014-10-16 12:17 ` Ian Jackson
2014-10-16 12:22 ` Jan Beulich
2014-10-16 12:58 ` Don Slutz
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=54C6643B.1@terremark.com \
--to=dslutz@verizon.com \
--cc=George.Dunlap@eu.citrix.com \
--cc=JBeulich@suse.com \
--cc=andrew.cooper3@citrix.com \
--cc=boris.ostrovsky@oracle.com \
--cc=eddie.dong@intel.com \
--cc=ian.campbell@citrix.com \
--cc=ian.jackson@eu.citrix.com \
--cc=jun.nakajima@intel.com \
--cc=keir@xen.org \
--cc=kevin.tian@intel.com \
--cc=stefano.stabellini@eu.citrix.com \
--cc=suravee.suthikulpanit@amd.com \
--cc=tim@xen.org \
--cc=xen-devel@lists.xen.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.