From: Lars Kurth <lars.kurth@citrix.com>
To: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
"msw@amazon.com" <msw@amazon.com>,
"aliguori@amazon.com" <aliguori@amazon.com>,
Antony Messerli <amesserl@rackspace.com>,
Rick Harris <rick.harris@rackspace.com>,
Paul Voccio <paul.voccio@rackspace.com>,
Steven Wilson <steven.wilson@rackspace.com>,
Major Hayden <major.hayden@rackspace.com>,
Josh Kearney <josh.kearney@rackspace.com>,
"jinsong.liu@alibaba-inc.com" <jinsong.liu@alibaba-inc.com>,
"xiantao.zxt@alibaba-inc.com" <xiantao.zxt@alibaba-inc.com>,
"boris.ostrovsky@oracle.com" <boris.ostrovsky@oracle.com>,
Daniel Kiper <daniel.kiper@oracle.com>,
Elena Ufimtseva <elena.ufimtseva@oracle.com>,
"bob.liu@oracle.com" <bob.liu@oracle.com>,
"hanweidong@huawei.com" <hanweidong@huawei.com>,
"peter.huangpeng@huawei.com" <peter.huangpeng@huawei.com>,
"fanhenglong@huawei.com" <fanhenglong@huawei.com>,
liuyingdong@huawei.com
Cc: "konrad@darnok.org" <konrad@darnok.org>
Subject: Re: [RFC v2] xSplice design
Date: Tue, 19 May 2015 19:13:03 +0000 [thread overview]
Message-ID: <D180D933.1C0D0%lars.kurth@citrix.com> (raw)
In-Reply-To: <20150515194440.GA24313@l.oracle.com>
Adding Don Slutz as he requested to be added
Lars
On 15/05/2015 20:44, "Konrad Rzeszutek Wilk" <konrad.wilk@oracle.com>
wrote:
>Hey!
>
>During the Xen Hacka^H^H^H^HProject Summit? we chatted about live-patching
>the hypervisor. We sketched out how it could be done, and brainstormed
>some of the problems.
>
>I took that and wrote an design - which is very much RFC. The design is
>laid out in two sections - the format of the ELF payload - and then the
>hypercalls to act on it.
>
>Hypercall preemption has caused a couple of XSAs so I've baked the need
>for that in the design so we hopefully won't have an XSA for this code.
>
>There are two big *TODO* in the design which I had hoped to get done
>before sending this out - however I am going on vacation for two weeks
>so I figured it would be better to send this off for folks to mull now
>then to have it languish.
>
>Please feel free to add more folks on the CC list.
>
>Enjoy!
>
>
># xSplice Design v1 (EXTERNAL RFC v2)
>
>## Rationale
>
>A mechanism is required to binarily patch the running hypervisor with new
>opcodes that have come about due to primarily security updates.
>
>This document describes the design of the API that would allow us to
>upload to the hypervisor binary patches.
>
>## Glossary
>
> * splice - patch in the binary code with new opcodes
> * trampoline - a jump to a new instruction.
> * payload - telemetries of the old code along with binary blob of the new
> function (if needed).
> * reloc - telemetries contained in the payload to construct proper
>trampoline.
>
>## Multiple ways to patch
>
>The mechanism needs to be flexible to patch the hypervisor in multiple
>ways
>and be as simple as possible. The compiled code is contiguous in memory
>with
>no gaps - so we have no luxury of 'moving' existing code and must either
>insert a trampoline to the new code to be executed - or only modify
>in-place
>the code if there is sufficient space. The placement of new code has to
>be done
>by hypervisor and the virtual address for the new code is allocated
>dynamically.
>i
>This implies that the hypervisor must compute the new offsets when
>splicing
>in the new trampoline code. Where the trampoline is added (inside
>the function we are patching or just the callers?) is also important.
>
>To lessen the amount of code in hypervisor, the consumer of the API
>is responsible for identifying which mechanism to employ and how many
>locations
>to patch. Combinations of modifying in-place code, adding trampoline, etc
>has to be supported. The API should allow read/write any memory within
>the hypervisor virtual address space.
>
>We must also have a mechanism to query what has been applied and a
>mechanism
>to revert it if needed.
>
>We must also have a mechanism to: provide an copy of the old code - so
>that
>the hypervisor can verify it against the code in memory; the new code;
>the symbol name of the function to be patched; or offset from the symbol;
>or virtual address.
>
>The complications that this design will encounter are explained later
>in this document.
>
>## Patching code
>
>The first mechanism to patch that comes in mind is in-place replacement.
>That is replace the affected code with new code. Unfortunately the x86
>ISA is variable size which places limits on how much space we have
>available
>to replace the instructions.
>
>The second mechanism is by replacing the call or jump to the
>old function with the address of the new function.
>
>A third mechanism is to add a jump to the new function at the
>start of the old function.
>
>### Example of trampoline and in-place splicing
>
>As example we will assume the hypervisor does not have XSA-132 (see
>*domctl/sysctl: don't leak hypervisor stack to toolstacks*
>4ff3449f0e9d175ceb9551d3f2aecb59273f639d) and we would like to binary
>patch
>the hypervisor with it. The original code looks as so:
>
><pre>
> 48 89 e0 mov %rsp,%rax
> 48 25 00 80 ff ff and $0xffffffffffff8000,%rax
></pre>
>
>while the new patched hypervisor would be:
>
><pre>
> 48 c7 45 b8 00 00 00 00 movq $0x0,-0x48(%rbp)
> 48 c7 45 c0 00 00 00 00 movq $0x0,-0x40(%rbp)
> 48 c7 45 c8 00 00 00 00 movq $0x0,-0x38(%rbp)
> 48 89 e0 mov %rsp,%rax
> 48 25 00 80 ff ff and $0xffffffffffff8000,%rax
></pre>
>
>This is inside the arch_do_domctl. This new change adds 21 extra
>bytes of code which alters all the offsets inside the function. To alter
>these offsets and add the extra 21 bytes of code we might not have enough
>space in .text to squeze this in.
>
>As such we could simplify this problem by only patching the site
>which calls arch_do_domctl:
>
><pre>
><do_domctl>:
> e8 4b b1 05 00 callq ffff82d08015fbb9 <arch_do_domctl>
></pre>
>
>with a new address for where the new `arch_do_domctl` would be (this
>area would be allocated dynamically).
>
>Astute readers will wonder what we need to do if we were to patch
>`do_domctl`
>- which is not called directly by hypervisor but on behalf of the guests
>via
>the `compat_hypercall_table` and `hypercall_table`.
>Patching the offset in `hypercall_table` for `do_domctl:
>(ffff82d080103079 <do_domctl>:)
><pre>
>
> ffff82d08024d490: 79 30
> ffff82d08024d492: 10 80 d0 82 ff ff
>
></pre>
>with the new address where the new `do_domctl` is possible. The other
>place where it is used is in `hvm_hypercall64_table` which would need
>to be patched in a similar way. This would require an in-place splicing
>of the new virtual address of `arch_do_domctl`.
>
>In summary this example patched the callee of the affected function by
> * allocating memory for the new code to live in,
> * changing the virtual address of all the functions which called the old
> code (computing the new offset, patching the callq with a new callq).
> * changing the function pointer tables with the new virtual address of
> the function (splicing in the new virtual address). Since this table
> resides in the .rodata section we would need to temporarily change the
> page table permissions during this part.
>
>
>However it has severe drawbacks - the safety checks which have to make
>sure
>the function is not on the stack - must also check every caller. For some
>patches this could if there were an sufficient large amount of callers
>that we would never be able to apply the update.
>
>### Example of different trampoline patching.
>
>An alternative mechanism exists where we can insert an trampoline in the
>existing function to be patched to jump directly to the new code. This
>lessens the locations to be patched to one but it puts pressure on the
>CPU branching logic (I-cache, but it is just one unconditional jump).
>
>For this example we will assume that the hypervisor has not been compiled
>with fe2e079f642effb3d24a6e1a7096ef26e691d93e (XSA-125: *pre-fill
>structures
>for certain HYPERVISOR_xen_version sub-ops*) which mem-sets an structure
>in `xen_version` hypercall. This function is not called **anywhere** in
>the hypervisor (it is called by the guest) but referenced in the
>`compat_hypercall_table` and `hypercall_table` (and indirectly called
>from that). Patching the offset in `hypercall_table` for the old
>`do_xen_version` (ffff82d080112f9e <do_xen_version>)
>
></pre>
> ffff82d08024b270 <hypercall_table>
> ...
> ffff82d08024b2f8: 9e 2f 11 80 d0 82 ff ff
>
></pre>
>with the new address where the new `do_xen_version` is possible. The other
>place where it is used is in `hvm_hypercall64_table` which would need
>to be patched in a similar way. This would require an in-place splicing
>of the new virtual address of `do_xen_version`.
>
>An alternative solution would be to patch insert an trampoline in the
>old `do_xen_version' function to directly jump to the new
>`do_xen_version`.
>
><pre>
> ffff82d080112f9e <do_xen_version>:
> ffff82d080112f9e: 48 c7 c0 da ff ff ff mov
>$0xffffffffffffffda,%rax
> ffff82d080112fa5: 83 ff 09 cmp $0x9,%edi
> ffff82d080112fa8: 0f 87 24 05 00 00 ja ffff82d0801134d2
><do_xen_version+0x534>
></pre>
>
>with:
>
><pre>
> ffff82d080112f9e <do_xen_version>:
> ffff82d080112f9e: e9 XX YY ZZ QQ jmpq [new
>do_xen_version]
></pre>
>
>which would lessen the amount of patching to just one location.
>
>In summary this example patched the affected function to jump to the
>new replacement function which required:
> * allocating memory for the new code to live in,
> * inserting trampoline with new offset in the old function to point to
>the
> new function.
> * Optionally we can insert in the old function an trampoline jump to an
>function
> providing an BUG_ON to catch errant code.
>
>The disadvantage of this are that the unconditional jump will consume a
>small
>I-cache penalty. However the simplicity of the patching of safety checks
>make this a worthwhile option.
>
>### Security
>
>With this method we can re-write the hypervisor - and as such we **MUST**
>be
>diligent in only allowing certain guests to perform this operation.
>
>Furthermore with SecureBoot or tboot, we **MUST** also verify the
>signature
>of the payload to be certain it came from a trusted source.
>
>As such the hypercall **MUST** support an XSM policy to limit the what
>guest is allowed. If the system is booted with signature checking the
>signature checking will be enforced.
>
>## Payload format
>
>The payload **MUST** contain enough data to allow us to apply the update
>and also safely reverse it. As such we **MUST** know:
>
> * What the old code is expected to be. We **MUST** verify it against the
> runtime code.
> * The locations in memory to be patched. This can be determined
>dynamically
> via symbols or via virtual addresses.
> * The new code to be used.
> * Signature to verify the payload.
>
>This binary format can be constructed using an custom binary format but
>there are severe disadvantages of it:
>
> * The format might need to be change and we need an mechanism to
>accommodate
> that.
> * It has to be platform agnostic.
> * Easily constructed using existing tools.
>
>As such having the payload in an ELF file is the sensible way. We would be
>carrying the various set of structures (and data) in the ELF sections
>under
>different names and with definitions. The prefix for the ELF section name
>would always be: *.xsplice_*
>
>Note that every structure has padding. This is added so that the
>hypervisor
>can re-use those fields as it sees fit.
>
>There are five sections *.xsplice_* sections:
>
> * `.xsplice_symbols` and `.xsplice_str`. The array of symbols to be
>referenced
> during the update. This can contain the symbols (functions) that will
>be
> patched, or the list of symbols (functions) to be checked pre-patching
>which
> may not be on the stack.
>
>* `.xsplice_reloc` and `.xsplice_reloc_howto`. The howto properly
>construct
> trampolines for an patch. We can have multiple locations for which we
> need to insert an trampoline for a payload and each location might
>require
> a different way of handling it. This would naturally reference the
>`.text`
> section and its proper offset. The `.xsplice_reloc` is not directly
>concerned
> with patches but rather is an ELF relocation - describing the target
> of a relocation and how that is performed. They're also used for where
> the new code references the run code too.
>
> * `.xsplice_sections`. The safety data for the old code and new code.
> This contains an array of symbols (pointing to `.xsplice_symbols` to
> and `.text`) which are to be used during safety and dependency
>checking.
>
>
> * `.xsplice_patches`: The description of the new functions to be patched
> in (size, type, pointer to code, etc.).
>
> * `.xsplice_change`. The structure that ties all of this together and
>defines
> the payload.
>
>Additionally the ELF file would contain:
>
> * `.text` section for the new and old code (function).
> * `.rela.text` relocation data for the `.text` (both new and old).
> * `.rela.xsplice_patches` relocation data for `.xsplice_patches` (such
>as offset
> to the `.text` ,`.xsplice_symbols`, or `.xsplice_reloc` section).
> * `.bss` section for the new code (function)
> * `.data` and `.data.read_mostly` section for the new and old code
>(function)
> * `.rodata` section for the new and old code (function).
>
>In short the *.xsplice_* sections represent various structures and the
>ELF provides the mechanism to glue it all together when loaded in memory.
>
>Note that a lot of these ideas are borrowed from kSplice which is
>available at: https://github.com/jirislaby/ksplice
>
>For ELF understanding the best starting point is the OSDev Wiki
>(http://wiki.osdev.org/ELF). Furthermore the ELF specification is
>at http://www.skyfree.org/linux/references/ELF_Format.pdf and
>at Oracle's web site:
>http://docs.oracle.com/cd/E23824_01/html/819-0690/chapter6-46512.html#scro
>lltoc
>
>### ASCII art of the ELF structures
>
>*TODO*: Include an ASCII art of how the sections are tied together.
>
>### xsplice_symbols
>
>The section contains an array of an structure that outlines the name
>of the symbol to be patched (or checked against). The structure is
>as follow:
>
><pre>
>struct xsplice_symbol {
> const char *name; /* The ELF name of the symbol. */
> const char *label; /* A unique xSplice name for the symbol. */
> uint8_t pad[16]; /* Must be zero. */
>};
></pre>
>The structures may be in the section in any order and in any amount
>(duplicate entries are permitted).
>
>Both `name` and `label` would be pointing to entries in `.xsplice_str`.
>
>The `label` is used for diagnostic purposes - such as including the
>name and the offset.
>
>### xsplice_reloc and xsplice_reloc_howto
>
>The section contains an array of a structure that outlines the different
>locations (and howto) for which an trampoline is to be inserted.
>
>The howto defines in the detail the change. It contains the type,
>whether the relocation is relative, the size of the relocation,
>bitmask for which parts of the instruction or data are to be replaced,
>amount of final relocation is shifted by (to drop unwanted data), and
>whether the replacement should be interpreted as signed value.
>
>The structure is as follow:
>
><pre>
>#define XSPLICE_HOWTO_RELOC_INLINE 0 /* Inline replacement. */
>#define XSPLICE_HOWTO_RELOC_PATCH 1 /* Add trampoline. */
>#define XSPLICE_HOWTO_RELOC_DATA 2 /* __DATE__ type change. */
>#define XSPLICE_HOWTO_RELOC_TIME 3 /* __TIME__ type chnage. */
>#define XSPLICE_HOWTO_BUG 4 /* BUG_ON being replaced.*/
>#define XSPLICE_HOWTO_EXTABLE 5 /* exception_table change. */
>#define XSPLICE_HOWTO_SYMBOL 6 /* change in symbol table. */
>
>#define XSPLICE_HOWTO_FLAG_PC_REL 0x00000001 /* Is PC relative. */
>#define XSPLICE_HOWOT_FLAG_SIGN 0x00000002 /* Should the new value
>be treated as signed value. */
>
>struct xsplice_reloc_howto {
> uint32_t type; /* XSPLICE_HOWTO_* */
> uint32_t flag; /* XSPLICE_HOWTO_FLAG_* */
> uint32_t size; /* Size, in bytes, of the item to be relocated. */
> uint32_t r_shift; /* The value the final relocation is shifted
>right by; used to drop unwanted data from the relocation. */
> uint64_t mask; /* Bitmask for which parts of the instruction or
>data are replaced with the relocated value. */
> uint8_t pad[8]; /* Must be zero. */
>};
>
></pre>
>
>This structure is used in:
>
><pre>
>struct xsplice_reloc {
> uint64_t addr; /* The address of the relocation (if known). */
> struct xsplice_symbol *symbol; /* Symbol for this relocation. */
> struct xsplice_reloc_howto *howto; /* Pointer to the above
>structure. */
> uint64_t isns_added; /* ELF addend resulting from quirks of
>instruction one of whose operands is the relocation. For example, this is
>-4 on x86 pc-relative jumps. */
> uint64_t isns_target; /* rest of the ELF addend. This is equal to
>the offset against the symbol that the relocation refers to. */
> uint8_t pad[8]; /* Must be zero. */
>};
></pre>
>
>### xsplice_sections
>
>The structure defined in this section is used to verify that it is safe
>to update with the new changes. It can contain safety data on the old code
>and what kind of matching we are to expect.
>
>It also can contain safety date of what to check when about to patch.
>That is whether any of the addresses (either provided or resolved
>when payload is loaded by referencing the symbols) are in memory
>with what we expect it to be.
>
>As such the flags can be or-ed together:
>
><pre>
>#define XSPLICE_SECTION_TEXT 0x00000001 /* Section is in .text */
>#define XSPLICE_SECTION_RODATA 0x00000002 /* Section is in .ro */
>#define XSPLICE_SECTION_DATA 0x00000004 /* Section is in .rodata */
>#define XSPLICE_SECTION_STRING 0x00000008 /* Section is in .str */
>#define XSPLICE_SECTION_ALTINSTRUCTIONS 0x00000010 /* Section has
>.altinstructions. */
>#define XSPLICE_SECTION_TEXT_INPLACE 0x00000200 /* Change is in place. */
>
>#dekine XSPLICE_SECTION_MATCH_EXACT 0x00000400 /* Must match exactly. */
>#define XSPLICE_SECTION_NO_STACKCHECK 0x00000800 /* Do not check the
>stack. */
>
>struct xsplice_section {
> struct xsplice_symbol *symbol; /* The symbol associated with this
>change. */
> uint64_t address; /* The address of the section (if known). */
> uint64_t size; /* The size of the section. */
> uint64_t flags; /* Various XSPLICE_SECTION_* flags. */
> uint8_t pad[16]; /* To be zero. */
>};
>
></pre>
>
>### xsplice_patches
>
>Within this section we have an array of a structure defining the new code
>(patch).
>
>This structure consist of an pointer to the new code (which in ELF ends up
>pointing to an offset in `.text` or `.data` section); the type of patch:
>inline - either text or data, or requesting an trampoline; and size of
>patch.
>
>The structure is as follow:
>
><pre>
>#define XSPLICE_PATCH_INLINE_TEXT 0
>#define XSPLICE_PATCH_INLINE_DATA 1
>#define XSPLICE_PATCH_RELOC_TEXT 2
>
>struct xsplice_patch {
> uint32_t type; /* XSPLICE_PATCH_* .*/
> uint32_t size; /* Size of patch. */
> uint64_t addr; /* The address of the new code (or data). */
> void *content; /* The bytes to be installed. */
> uint8_t pad[16]; /* Must be zero. */
>};
>
></pre>
>
>### xsplice_code
>
>The structure embedded within this section ties it all together.
>It has the name of the patch, and pointers to all the above
>mentioned structures (the start and end addresses).
>
>The structure is as follow:
>
><pre>
>struct xsplice_code {
> const char *name; /* A sensible name for the patch. Up to 40
>characters. */
> struct xsplice_reloc *relocs, *relocs_end; /* How to patch it */
> struct xsplice_section *sections, *sections_end; /* Safety data */
> struct xsplice_patch *patches, *patches_end; /* Patch code & data */
> uint8_t pad[32]; /* Must be zero. */
>};
></pre>
>
>There should only be one such structure in the section.
>
>### Example
>
>*TODO*: Include an objdump of how the ELF would look like for the XSA
>mentioned earlier.
>
>## Signature checking requirements.
>
>The signature checking requires that the layout of the data in memory
>**MUST** be same for signature to be verified. This means that the payload
>data layout in ELF format **MUST** match what the hypervisor would be
>expecting such that it can properly do signature verification.
>
>The signature is based on the all of the payloads continuously laid out
>in memory. The signature is to be appended at the end of the ELF payload
>prefixed with the string '~Module signature appended~\n", followed by
>an signature header then followed by the signature, key identifier, and
>signers
>name.
>
>Specifically the signature header would be:
>
><pre>
>#define PKEY_ALGO_DSA 0
>#define PKEY_ALGO_RSA 1
>
>#define PKEY_ID_PGP 0 /* OpenPGP generated key ID */
>#define PKEY_ID_X509 1 /* X.509 arbitrary subjectKeyIdentifier */
>
>#define HASH_ALGO_MD4 0
>#define HASH_ALGO_MD5 1
>#define HASH_ALGO_SHA1 2
>#define HASH_ALGO_RIPE_MD_160 3
>#define HASH_ALGO_SHA256 4
>#define HASH_ALGO_SHA384 5
>#define HASH_ALGO_SHA512 6
>#define HASH_ALGO_SHA224 7
>#define HASH_ALGO_RIPE_MD_128 8
>#define HASH_ALGO_RIPE_MD_256 9
>#define HASH_ALGO_RIPE_MD_320 10
>#define HASH_ALGO_WP_256 11
>#define HASH_ALGO_WP_384 12
>#define HASH_ALGO_WP_512 13
>#define HASH_ALGO_TGR_128 14
>#define HASH_ALGO_TGR_160 15
>#define HASH_ALGO_TGR_192 16
>
>
>struct elf_payload_signature {
> u8 algo; /* Public-key crypto algorithm PKEY_ALGO_*. */
> u8 hash; /* Digest algorithm: HASH_ALGO_*. */
> u8 id_type; /* Key identifier type PKEY_ID*. */
> u8 signer_len; /* Length of signer's name */
> u8 key_id_len; /* Length of key identifier */
> u8 __pad[3];
> __be32 sig_len; /* Length of signature data */
>};
>
></pre>
>(Note that this has been borrowed from Linux module signature code.).
>
>
>## Hypercalls
>
>We will employ the sub operations of the system management hypercall
>(sysctl).
>There are to be four sub-operations:
>
> * upload the payloads.
> * listing of payloads summary uploaded and their state.
> * getting an particular payload summary and its state.
> * command to apply, delete, or revert the payload.
>
>The patching is asynchronous therefore the caller is responsible
>to verify that it has been applied properly by retrieving the summary of
>it
>and verifying that there are no error codes associated with the payload.
>
>We **MUST** make it asynchronous due to the nature of patching: it
>requires
>every physical CPU to be lock-step with each other. The patching mechanism
>while an implementation detail, is not an short operation and as such
>the design **MUST** assume it will be an long-running operation.
>
>Furthermore it is possible to have multiple different payloads for the
>same
>function. As such an unique id has to be visible to allow proper
>manipulation.
>
>The hypercall is part of the `xen_sysctl`. The top level structure
>contains
>one uint32_t to determine the sub-operations:
>
><pre>
>struct xen_sysctl_xsplice_op {
> uint32_t cmd;
> union {
> ... see below ...
> } u;
>};
>
></pre>
>while the rest of hypercall specific structures are part of the this
>structure.
>
>
>### XEN_SYSCTL_XSPLICE_UPLOAD (0)
>
>Upload a payload to the hypervisor. The payload is verified and if there
>are any issues the proper return code will be returned. The payload is
>not applied at this time - that is controlled by
>*XEN_SYSCTL_XSPLICE_ACTION*.
>
>The caller provides:
>
> * `id` unique id.
> * `payload` the virtual address of where the ELF payload is.
>
>The return value is zero if the payload was succesfully uploaded and the
>signature was verified. Otherwise an EXX return value is provided.
>Duplicate `id` are not supported.
>
>The `payload` is the ELF payload as mentioned in the `Payload format`
>section.
>
>The structure is as follow:
>
><pre>
>struct xen_sysctl_xsplice_upload {
> char id[40]; /* IN, name of the patch. */
> uint64_t size; /* IN, size of the ELF file. */
> XEN_GUEST_HANDLE_64(uint8) payload; /* ELF file. */
>};
></pre>
>
>### XEN_SYSCTL_XSPLICE_GET (1)
>
>Retrieve an summary of an specific payload. This caller provides:
>
> * `id` the unique id.
> * `status` *MUST* be set to zero.
> * `rc` *MUST* be set to zero.
>
>The `summary` structure contains an summary of payload which includes:
>
> * `id` the unique id.
> * `status` - whether it has been:
> 1. *XSPLICE_STATUS_LOADED* (0) has been loaded.
> 2. *XSPLICE_STATUS_PROGRESS* (1) acting on the
>**XEN_SYSCTL_XSPLICE_ACTION** command.
> 3. *XSPLICE_STATUS_CHECKED* (2) the ELF payload safety checks passed.
> 4. *XSPLICE_STATUS_APPLIED* (3) loaded, checked, and applied.
> 5. *XSPLICE_STATUS_REVERTED* (4) loaded, checked, applied and then also
>reverted.
> 6. *XSPLICE_STATUS_IN_ERROR* (5) loaded and in a failed state. Consult
>`rc` for details.
> * `rc` - its error state if any.
>
>The structure is as follow:
>
><pre>
>#define XSPLICE_STATUS_LOADED 0
>#define XSPLICE_STATUS_PROGRESS 1
>#define XSPLICE_STATUS_CHECKED 2
>#define XSPLICE_STATUS_APPLIED 3
>#define XSPLICE_STATUS_REVERTED 4
>#define XSPLICE_STATUS_IN_ERROR 5
>
>struct xen_sysctl_xsplice_summary {
> char id[40]; /* IN/OUT, name of the patch. */
> uint32_t status; /* OUT */
> int32_t rc; /* OUT */
>};
></pre>
>
>### XEN_SYSCTL_XSPLICE_LIST (2)
>
>Retrieve an array of abbreviated summary of payloads that are loaded in
>the
>hypervisor.
>
>The caller provides:
>
> * `idx` index iterator. Initially it *MUST* be zero.
> * `count` the max number of entries to populate.
> * `summary` virtual address of where to write payload summaries.
>
>The hypercall returns zero on success and updates the `idx` (index)
>iterator
>with the number of payloads returned, `count` to the number of remaining
>payloads, and `summary` with an number of payload summaries.
>
>If the hypercall returns E2BIG the `count` is too big and should be
>lowered.
>
>Note that due to the asynchronous nature of hypercalls the domain might
>have
>added or removed the number of payloads making this information stale. It
>is
>the responsibility of the domain to provide proper accounting.
>
>The `summary` structure contains an summary of payload which includes:
>
> * `id` unique id.
> * `status` - whether it has been:
> 1. *XSPLICE_STATUS_LOADED* (0) has been loaded.
> 2. *XSPLICE_STATUS_PROGRESS* (1) acting on the
>**XEN_SYSCTL_XSPLICE_ACTION** command.
> 3. *XSPLICE_STATUS_CHECKED* (2) the payload `old` and `addr` match with
>the hypervisor.
> 4. *XSPLICE_STATUS_APPLIED* (3) loaded, checked, and applied.
> 5. *XSPLICE_STATUS_REVERTED* (4) loaded, checked, applied and then also
>reverted.
> 6. *XSPLICE_STATUS_IN_ERROR* (5) loaded and in a failed state. Consult
>`rc` for details.
> * `rc` - its error state if any.
>
>The structure is as follow:
>
><pre>
>struct xen_sysctl_xsplice_list {
> uint32_t idx; /* IN/OUT */
> uint32_t count; /* IN/OUT */
> XEN_GUEST_HANDLE_64(xen_sysctl_xsplice_summary) summary; /* OUT */
>};
>
>struct xen_sysctl_xsplice_summary {
> char id[40]; /* OUT, name of the patch. */
> uint32_t status; /* OUT */
> int32_t rc; /* OUT */
>};
>
></pre>
>### XEN_SYSCTL_XSPLICE_ACTION (3)
>
>Perform an operation on the payload structure referenced by the `id`
>field.
>The operation request is asynchronous and the status should be retrieved
>by using either **XEN_SYSCTL_XSPLICE_GET** or **XEN_SYSCTL_XSPLICE_LIST**
>hypercall.
>
>The caller provides:
>
> * `id` the unique id.
> * `cmd` the command requested:
> 1. *XSPLICE_ACTION_CHECK* (0) check that the payload will apply
>properly.
> 2. *XSPLICE_ACTION_UNLOAD* (1) unload the payload.
> 3. *XSPLICE_ACTION_REVERT* (2) revert the payload.
> 4. *XSPLICE_ACTION_APPLY* (3) apply the payload.
>
>
>The return value will be zero unless the provided fields are incorrect.
>
>The structure is as follow:
>
><pre>
>#define XSPLICE_ACTION_CHECK 0
>#define XSPLICE_ACTION_UNLOAD 1
>#define XSPLICE_ACTION_REVERT 2
>#define XSPLICE_ACTION_APPLY 3
>
>struct xen_sysctl_xsplice_action {
> char id[40]; /* IN, name of the patch. */
> uint32_t cmd; /* IN */
>};
>
></pre>
>
>## Sequence of events.
>
>The normal sequence of events is to:
>
> 1. *XEN_SYSCTL_XSPLICE_UPLOAD* to upload the payload. If there are
>errors *STOP* here.
> 2. *XEN_SYSCTL_XSPLICE_GET* to check the `->status`. If in
>*XSPLICE_STATUS_PROGRESS* spin. If in *XSPLICE_STATUS_LOADED* go to next
>step.
> 3. *XEN_SYSCTL_XSPLICE_ACTION* with *XSPLICE_ACTION_CHECK* command to
>verify that the payload can be succesfully applied.
> 4. *XEN_SYSCTL_XSPLICE_GET* to check the `->status`. If in
>*XSPLICE_STATUS_PROGRESS* spin. If in *XSPLICE_STATUS_CHECKED* go to next
>step.
> 5. *XEN_SYSCTL_XSPLICE_ACTION* with *XSPLICE_ACTION_APPLY* to apply the
>patch.
> 6. *XEN_SYSCTL_XSPLICE_GET* to check the `->status`. If in
>*XSPLICE_STATUS_PROGRESS* spin. If in *XSPLICE_STATUS_APPLIED* exit with
>success.
>
>
>## Addendum
>
>Implementation quirks should not be discussed in a design document.
>
>However these observations can provide aid when developing against this
>document.
>
>
>### Alternative assembler
>
>Alternative assembler is a mechanism to use different instructions
>depending
>on what the CPU supports. This is done by providing multiple streams of
>code
>that can be patched in - or if the CPU does not support it - padded with
>`nop` operations. The alternative assembler macros cause the compiler to
>expand the code to place a most generic code in place - emit a special
>ELF .section header to tag this location. During run-time the hypervisor
>can leave the areas alone or patch them with an better suited opcodes.
>
>As we might be patching the alternative assembler sections as well - by
>providing a new better suited op-codes or perhaps with nops - we need to
>also re-run the alternative assembler patching after we have done our
>patching.
>
>Also when we are doing safety checks the code we are checking might be
>utilizing alternative assembler. As such we should relax out checks to
>accomodate that.
>
>### .rodata sections
>
>The patching might require strings to be updated as well. As such we must
>be
>also able to patch the strings as needed. This sounds simple - but the
>compiler
>has a habit of coalescing strings that are the same - which means if we
>in-place
>alter the strings - other users will be inadvertently affected as well.
>
>This is also where pointers to functions live - and we may need to patch
>this
>as well.
>
>To guard against that we must be prepared to do patching similar to
>trampoline patching or in-line depending on the flavour. If we can
>do in-line patching we would need to:
>
> * alter `.rodata` to be writeable.
> * inline patch.
> * alter `.rodata` to be read-only.
>
>If are doing trampoline patching we would need to:
>
> * allocate a new memory location for the string.
> * all locations which use this string will have to be updated to use the
> offset to the string.
> * mark the region RO when we are done.
>
>### .bss sections
>
>Patching writable data is not suitable as it is unclear what should be
>done
>depending on the current state of data. As such it should not be
>attempted.
>
>
>### Patching code which is in the stack.
>
>We should not patch the code which is on the stack. That can lead
>to corruption.
>
>### Trampoline (e9 opcode)
>
>The e9 opcode used for jmpq uses a 32-bit signed displacement. That means
>we are limited to up to 2GB of virtual address to place the new code
>from the old code. That should not be a problem since Xen hypervisor has
>a very small footprint.
>
>However if we need - we can always add two trampolines. One at the 2GB
>limit that calls the next trampoline.
>
>### Time rendezvous code instead of stop_machine for patching
>
>The hypervisor's time rendezvous code runs synchronously across all CPUs
>every second. Using the stop_machine to patch can stall the time
>rendezvous
>code and result in NMI. As such having the patching be done at the tail
>of rendezvous code should avoid this problem.
>
>### Security
>
>Only the privileged domain should be allowed to do this operation.
next prev parent reply other threads:[~2015-05-19 19:13 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-05-15 19:44 [RFC v2] xSplice design Konrad Rzeszutek Wilk
2015-05-18 12:41 ` Jan Beulich
2015-06-05 14:49 ` Konrad Rzeszutek Wilk
2015-06-05 15:16 ` Jan Beulich
2015-06-05 16:00 ` Konrad Rzeszutek Wilk
2015-06-05 16:14 ` Jan Beulich
2015-05-18 12:54 ` Liuqiming (John)
2015-05-18 13:11 ` Daniel Kiper
2015-06-05 14:50 ` Konrad Rzeszutek Wilk
2015-05-19 19:13 ` Lars Kurth [this message]
2015-05-20 15:11 ` Martin Pohlack
2015-06-05 15:00 ` Konrad Rzeszutek Wilk
2015-06-05 15:15 ` Andrew Cooper
2015-06-05 15:27 ` Jan Beulich
2015-06-08 8:34 ` Martin Pohlack
2015-06-08 8:51 ` Jan Beulich
2015-06-08 14:38 ` Martin Pohlack
2015-06-08 15:19 ` Konrad Rzeszutek Wilk
2015-06-12 11:51 ` Martin Pohlack
2015-06-12 14:06 ` Konrad Rzeszutek Wilk
2015-06-12 11:39 ` Martin Pohlack
2015-06-12 14:03 ` Konrad Rzeszutek Wilk
2015-06-12 14:31 ` Martin Pohlack
2015-06-12 14:43 ` Jan Beulich
2015-06-12 17:31 ` Martin Pohlack
2015-06-12 18:46 ` Konrad Rzeszutek Wilk
2015-06-12 16:09 ` Konrad Rzeszutek Wilk
2015-06-12 16:17 ` Andrew Cooper
2015-06-12 16:39 ` Konrad Rzeszutek Wilk
2015-06-12 18:36 ` Martin Pohlack
2015-06-12 18:51 ` Konrad Rzeszutek Wilk
2015-07-06 19:36 ` Konrad Rzeszutek Wilk
2015-10-27 12:05 ` Ross Lagerwall
2015-10-29 16:55 ` Ross Lagerwall
2015-10-30 10:39 ` Martin Pohlack
2015-10-30 14:03 ` Ross Lagerwall
2015-10-30 14:06 ` Martin Pohlack
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=D180D933.1C0D0%lars.kurth@citrix.com \
--to=lars.kurth@citrix.com \
--cc=aliguori@amazon.com \
--cc=amesserl@rackspace.com \
--cc=bob.liu@oracle.com \
--cc=boris.ostrovsky@oracle.com \
--cc=daniel.kiper@oracle.com \
--cc=elena.ufimtseva@oracle.com \
--cc=fanhenglong@huawei.com \
--cc=hanweidong@huawei.com \
--cc=jinsong.liu@alibaba-inc.com \
--cc=josh.kearney@rackspace.com \
--cc=konrad.wilk@oracle.com \
--cc=konrad@darnok.org \
--cc=liuyingdong@huawei.com \
--cc=major.hayden@rackspace.com \
--cc=msw@amazon.com \
--cc=paul.voccio@rackspace.com \
--cc=peter.huangpeng@huawei.com \
--cc=rick.harris@rackspace.com \
--cc=steven.wilson@rackspace.com \
--cc=xiantao.zxt@alibaba-inc.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).