qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Laszlo Ersek <lersek@redhat.com>
To: Peter Maydell <peter.maydell@linaro.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	Andrew Jones <drjones@redhat.com>,
	QEMU Developers <qemu-devel@nongnu.org>,
	Alexander Graf <agraf@suse.de>"Richard W.M. Jones"
	<rjones@redhat.com>
Subject: Re: [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size are the same for MMIO data reg
Date: Wed, 17 Dec 2014 08:13:56 +0100	[thread overview]
Message-ID: <54912D34.5020809@redhat.com> (raw)
In-Reply-To: <CAFEAcA_82G44Tv9C50uoRS-EKtzeyKBfuCFt4LD-0xyFoySmpA@mail.gmail.com>

On 12/16/14 21:41, Peter Maydell wrote:
> On 16 December 2014 at 19:00, Laszlo Ersek <lersek@redhat.com> wrote:
>> The root of this question is what each of
>>
>> enum device_endian {
>>     DEVICE_NATIVE_ENDIAN,
>>     DEVICE_BIG_ENDIAN,
>>     DEVICE_LITTLE_ENDIAN,
>> };
>>
>> means.
> 
> An opening remark: endianness is a horribly confusing topic and
> support of more than one endianness is even worse. I may have made
> some inadvertent errors in this reply; if you think so please
> let me know and I'll have another stab at it.
> 
> That said: the device_endian options indicate what a device's
> internal idea of its endianness is. This is most relevant if
> a device accepts accesses at wider than byte width
> (for instance, if you can read a 32-bit value from
> address offset 0 and also an 8-bit value from offset 3 then
> how do those values line up? If you read a 32-bit value then
> which way round is it compared to what the device's io read/write
> function return?).
> 
> NATIVE_ENDIAN means "same order as the CPU's main data bus's
> natural representation". (Note that this is not necessarily
> the same as "the endianness the CPU currently has"; on ARM
> you can flip the CPU between LE and BE at runtime, which
> is basically inserting a byte-swizzling step between data
> accesses and the CPU's data bus, which is always LE for ARMv7+.)
> 
> Note that RAM accessible to a KVM guest is always NATIVE_ENDIAN
> (because the guest vcpu reads it directly with the h/w cpu).
> 
>> Consider the following call tree, which implements the splitting /
>> combining of an MMIO read:
>>
>> memory_region_dispatch_read() [memory.c]
>>   memory_region_dispatch_read1()
>>     access_with_adjusted_size()
>>       memory_region_big_endian()
>>       for each byte in the wide access:
>>         memory_region_read_accessor()
>>           fw_cfg_data_mem_read() [hw/nvram/fw_cfg.c]
>>             fw_cfg_read()
>>   adjust_endianness()
>>     memory_region_wrong_endianness()
>>     bswapXX()
>>
>> The function access_with_adjusted_size() always iterates over the MMIO
>> address range in incrementing address order. However, it can calculate
>> the shift count for memory_region_read_accessor() in two ways.
>>
>> When memory_region_big_endian() returns "true", the shift count
>> decreases as the MMIO address increases.
>>
>> When memory_region_big_endian() returns "false", the shift count
>> increases as the MMIO address increases.
> 
> Yep, because this is how the device has told us that it thinks
> of accesses as being put together.
> 
> The column in your table "host value" is the 16 bit value read from
> the device, ie what we have decided (based on device_endian) that
> it would have returned us if it had supported a 16 bit read directly
> itself. BE devices compose 16 bit values with the high byte first,
> LE devices with the low byte first, and native-endian devices
> in the same order as guest-endianness.
> 
>> In memory_region_read_accessor(), the shift count is used for a logical
>> (ie. numeric) bitwise left shift (<<). That operator works with numeric
>> values and hides (ie. accounts for) host endianness.
>>
>> Let's consider
>> - an array of two bytes, [0xaa, 0xbb],
>> - a uint16_t access made from the guest,
>> - and all twelve possible cases.
>>
>> In the table below, the "host", "guest" and "device_endian" columns are
>> input variables. The columns to the right are calculated / derived
>> values. The arrows above the columns show the data dependencies.
>>
>> After memory_region_dispatch_read1() constructs the value that is
>> visible in the "host value" column, it is passed to adjust_endianness().
>> If memory_region_wrong_endianness() returns "true", then the host value
>> is byte-swapped. The result is then passed to the guest.
>>
>>               +---------------------------------------------------------------------------------------------------------------+----------+
>>               |                                                                                                               |          |
>>             +---- ------+-------------------------------------------------------------------------+                           |          |
>>             | |         |                                                                         |                           |          |
>>       +----------------------------------------------------------+---------+                      |                           |          |
>>       |     | |         |                                        |         |                      |                           |          |
>>       |   +-----------+-------------------+  +----------------+  |         |  +------------------------+-------------------+  |          |
>>       |   | | |       | |                 |  |                |  |         |  |                   |    |                   |  |          |
>>       |   | | |       | |                 v  |                v  |         v  |                   v    |                   v  |          v
>>  #  host  guest  device_endian  memory_region_big_endian()  host value  host repr.    memory_region_wrong_endianness()  guest repr.   guest value
>> --  ----  -----  -------------  --------------------------  ----------  ------------  --------------------------------  ------------  -----------
>>  1  LE    LE     native         0                           0xbbaa      [0xaa, 0xbb]  0                                 [0xaa, 0xbb]  0xbbaa
>>  2  LE    LE     BE             1                           0xaabb      [0xbb, 0xaa]  1                                 [0xaa, 0xbb]  0xbbaa
>>  3  LE    LE     LE             0                           0xbbaa      [0xaa, 0xbb]  0                                 [0xaa, 0xbb]  0xbbaa
>>
>>  4  LE    BE     native         1                           0xaabb      [0xbb, 0xaa]  0                                 [0xbb, 0xaa]  0xbbaa
>>  5  LE    BE     BE             1                           0xaabb      [0xbb, 0xaa]  0                                 [0xbb, 0xaa]  0xbbaa
>>  6  LE    BE     LE             0                           0xbbaa      [0xaa, 0xbb]  1                                 [0xbb, 0xaa]  0xbbaa
>>
>>  7  BE    LE     native         0                           0xbbaa      [0xbb, 0xaa]  0                                 [0xbb, 0xaa]  0xaabb
>>  8  BE    LE     BE             1                           0xaabb      [0xaa, 0xbb]  1                                 [0xbb, 0xaa]  0xaabb
>>  9  BE    LE     LE             0                           0xbbaa      [0xbb, 0xaa]  0                                 [0xbb, 0xaa]  0xaabb
>>
>> 10  BE    BE     native         1                           0xaabb      [0xaa, 0xbb]  0                                 [0xaa, 0xbb]  0xaabb
>> 11  BE    BE     BE             1                           0xaabb      [0xaa, 0xbb]  0                                 [0xaa, 0xbb]  0xaabb
>> 12  BE    BE     LE             0                           0xbbaa      [0xbb, 0xaa]  1                                 [0xaa, 0xbb]  0xaabb
> 
> The column you have labelled 'guest repr' here is the returned data
> from io_mem_read, whose API contract is "write the data from the
> device into this host C uint16_t (or whatever) such that it is the
> value returned by the device read as a native host value". It's
> not related to the guest order at all.
> 
> So for instance, io_mem_read() is called by cpu_physical_memory_rw(),
> which passes it a local variable "val". So now "val" has the
> "guest repr" column's bytes in it, and (as a host C variable) the
> value:
> 1,2,3 : 0xbbaa
> 4,5,6 : 0xaabb
> 7,8,9 : 0xbbaa
> 10,11,12 : 0xaabb
> 
> As you can see, this depends on the "guest endianness" (which is
> kind of the endianness of the bus): a BE guest 16 bit access to
> this device would return the 16 bit value 0xaabb, and an LE guest
> 0xbbaa, and we have exactly those values in our host C variable.
> If this is TCG, then we'll copy that 16 bit host value into the
> CPUState struct field corresponding to the destination guest
> register as-is. (TCG CPUState fields are always in host-C-order.)
> 
> However, to pass them up to KVM we have to put them into a
> buffer in RAM as per the KVM_EXIT_MMIO ABI. So:
> cpu_physical_memory_rw() calls stw_p(buf, val) [which is "store
> in target CPU endianness"], so now buf has the bytes:
> 1,2,3 : [0xaa, 0xbb]
> 4,5,6 : [0xaa, 0xbb]
> 7,8,9 : [0xaa, 0xbb]
> 10,11,12 : [0xaa, 0xbb]
> 
> ...which is the same for every case.
> 
> This buffer is (for KVM) the run->mmio.data buffer, whose semantics
> are "the value as it would appear if the VCPU performed a load or store
> of the appropriate width directly to the byte array". Which is what we
> want -- your device has two bytes in order 0xaa, 0xbb, and we did
> a 16 bit load in the guest CPU, so we should get the same answer as if
> we did a 16 bit load of RAM containing 0xaa, 0xbb. That will be
> 0xaabb if the VCPU is big-endian, and 0xbbaa if it is not.
> 
> I think the fact that all of these things come out to the same
> set of bytes in the mmio.data buffer is the indication that all
> QEMU's byte swapping is correct.
> 
>> Looking at the two rightmost columns, we must conclude:
>>
>> (a) The "device_endian" field has absolutely no significance wrt. what
>>     the guest sees. In each triplet of cases, when we go from
>>     DEVICE_NATIVE_ENDIAN to DEVICE_BIG_ENDIAN to DEVICE_LITTLE_ENDIAN,
>>     the guest sees the exact same value.
>>
>>     I don't understand this result (it makes me doubt device_endian
>>     makes any sense). What did I do wrong?
> 
> I think it's because you defined your device as only supporting
> byte accesses that you didn't see any difference between the
> various device_endian settings. If a device presents itself as
> just an array of bytes then it doesn't have an endianness, really.
> 
> As Paolo says, if you make your device support wider accesses
> directly and build up the value yourself then you'll see that
> setting the device_endian to LE vs BE vs native does change
> the values you see in the guest (and that you'll need to set it
> to LE and interpret the words in the guest as LE to get the
> right behaviour).
> 
>>     This means that this interface is *value preserving*, not
>>     representation preserving. In other words: when host and guest
>>     endiannesses are identical, the *array* is transferred okay (the
>>     guest array matches the initial host array [0xaa, 0xbb]). When guest
>>     and host differ in endianness, the guest will see an incorrect
>>     *array*.
> 
> Think of a device which supports only byte accesses as being
> like a lump of RAM. A big-endian guest CPU which reads 32 bits
> at a time will get different values in registers to an LE guest
> which does the same.
> 
> *However*, if both CPUs just do "read 32 bits; write 32 bits to
> RAM" (ie a kind of memcpy but with the source being the mmio
> register rather than some other bit of RAM) then you should get
> a bytewise-correct copy of the data in RAM.
> 
> So I *think* that would be a viable approach: have your QEMU
> device code as it is now, and just make sure that if the guest
> is doing wider-than-byte accesses it does them as
>   do {
>      load word from mmio register;
>      store word to RAM;
>   } while (count);
> 
> and doesn't try to interpret the byte order of the values while
> they're in the CPU register in the middle of the loop.
> 
> Paolo's suggestion would work too, if you prefer that.
> 
>> I apologize for wasting everyone's time, but I think both results are
>> very non-intuitive.
> 
> Everything around endianness handling is non-intuitive --
> it's the nature of the problem, I'm afraid. (Some of this is
> also QEMU's fault for not having good documentation of the
> semantics of each of the layers involved in memory accesses.
> I have on my todo list the vague idea of trying to write these
> all up as a blog post.)

Thanks for taking the time to write this up. My analysis must have
missed at least two important things then:
- the device's read/write function needs to consider address & size, and
  return values that match host byte order. fw_cfg doesn't conform ATM.
- there's one more layer outside the call tree that I checked that can
  perform endianness conversion.

I'll try to implement Paolo's suggestion (ie. support wide accesses in
fw_cfg internally, not relying on splitting/combining by memory.c).

Thanks
Laszlo

  reply	other threads:[~2014-12-17  7:14 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-12-12 15:58 [Qemu-devel] [PATCH v4 0/8] fw_cfg, bootorder, and UEFI+'-kernel' on arm/virt Laszlo Ersek
2014-12-12 15:58 ` [Qemu-devel] [PATCH v4 1/8] fw_cfg: max access size and region size are the same for MMIO data reg Laszlo Ersek
2014-12-16 13:48   ` Andrew Jones
2014-12-16 19:00     ` Laszlo Ersek
2014-12-16 19:49       ` Paolo Bonzini
2014-12-16 20:06         ` Laszlo Ersek
2014-12-16 20:17           ` Laszlo Ersek
2014-12-16 21:47             ` Paolo Bonzini
2014-12-17  4:52               ` Laszlo Ersek
2014-12-16 20:40           ` Paolo Bonzini
2014-12-16 21:47             ` Peter Maydell
2014-12-17  5:06             ` Laszlo Ersek
2014-12-17  9:23               ` Paolo Bonzini
2014-12-17  9:31                 ` Alexander Graf
2014-12-16 20:41       ` Peter Maydell
2014-12-17  7:13         ` Laszlo Ersek [this message]
2014-12-17  8:28           ` Alexander Graf
2014-12-17  8:40             ` Laszlo Ersek
2014-12-12 15:58 ` [Qemu-devel] [PATCH v4 2/8] fw_cfg: generalize overlap check for combining control and data I/O ports Laszlo Ersek
2014-12-12 15:58 ` [Qemu-devel] [PATCH v4 3/8] fw_cfg: introduce the "data_memwidth" property Laszlo Ersek
2014-12-16 12:06   ` Alexander Graf
2014-12-16 12:42     ` Laszlo Ersek
2014-12-16 16:59       ` Laszlo Ersek
2014-12-16 17:10         ` Peter Maydell
2014-12-16 17:20           ` Alexander Graf
2014-12-16 18:52             ` Laszlo Ersek
2014-12-12 15:58 ` [Qemu-devel] [PATCH v4 4/8] fw_cfg: expose the "data_memwidth" prop with fw_cfg_init_data_memwidth() Laszlo Ersek
2014-12-12 15:58 ` [Qemu-devel] [PATCH v4 5/8] arm: add fw_cfg to "virt" board Laszlo Ersek
2014-12-12 15:58 ` [Qemu-devel] [PATCH v4 6/8] hw/loader: split out load_image_gzipped_buffer() Laszlo Ersek
2014-12-12 15:58 ` [Qemu-devel] [PATCH v4 7/8] hw/arm: pass pristine kernel image to guest firmware over fw_cfg Laszlo Ersek
2014-12-16 12:15   ` Alexander Graf
2014-12-16 12:18     ` Peter Maydell
2014-12-16 12:20       ` Alexander Graf
2014-12-16 12:25         ` Peter Maydell
2014-12-16 12:42           ` Richard W.M. Jones
2014-12-16 12:44             ` Laszlo Ersek
2014-12-12 15:58 ` [Qemu-devel] [PATCH v4 8/8] hw/arm/virt: enable passing of EFI-stubbed kernel to guest UEFI firmware Laszlo Ersek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54912D34.5020809@redhat.com \
    --to=lersek@redhat.com \
    --cc=agraf@suse.de \
    --cc=drjones@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).