From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:60485)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <c.pinto@virtualopensystems.com>) id 1Zj82k-00071w-F6
	for qemu-devel@nongnu.org; Mon, 05 Oct 2015 11:51:17 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <c.pinto@virtualopensystems.com>) id 1Zj82e-0005kM-Gt
	for qemu-devel@nongnu.org; Mon, 05 Oct 2015 11:51:13 -0400
Received: from mail-wi0-f181.google.com ([209.85.212.181]:33052)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <c.pinto@virtualopensystems.com>) id 1Zj82e-0005k6-6j
	for qemu-devel@nongnu.org; Mon, 05 Oct 2015 11:51:08 -0400
Received: by wiclk2 with SMTP id lk2so127372833wic.0
	for <qemu-devel@nongnu.org>; Mon, 05 Oct 2015 08:51:07 -0700 (PDT)
From: Christian Pinto <c.pinto@virtualopensystems.com>
References: <1443535059-26010-1-git-send-email-c.pinto@virtualopensystems.com>
	<CAPokK=pmTsavj6xS6Pd7SXusEhKmwqBNE8BKT2jxh_7WFYzvKA@mail.gmail.com>
Message-ID: <56129C63.1090401@virtualopensystems.com>
Date: Mon, 5 Oct 2015 17:50:59 +0200
MIME-Version: 1.0
In-Reply-To: <CAPokK=pmTsavj6xS6Pd7SXusEhKmwqBNE8BKT2jxh_7WFYzvKA@mail.gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [RFC PATCH 0/8] Towards an Heterogeneous QEMU
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Peter Crosthwaite <crosthwaitepeter@gmail.com>, "mar.krzeminski" <mar.krzeminski@gmail.com>, Peter Maydell <peter.maydell@linaro.org>, Edgar Iglesias <edgar.iglesias@xilinx.com>
Cc: Jani.Kokkonen@huawei.com, tech@virtualopensystems.com, Claudio.Fontana@huawei.com, "qemu-devel@nongnu.org Developers" <qemu-devel@nongnu.org>

Hello Peter,

thanks for your comments

On 01/10/2015 18:26, Peter Crosthwaite wrote:
> On Tue, Sep 29, 2015 at 6:57 AM, Christian Pinto
> <c.pinto@virtualopensystems.com>  wrote:
>> Hi all,
>>
>> This RFC patch-series introduces the set of changes enabling the
>> architectural elements to model the architecture presented in a previous RFC
>> letter: "[Qemu-devel][RFC] Towards an Heterogeneous QEMU".
>>
>> To recap the goal of such RFC:
>>
>> The idea is to enhance the current architecture of QEMU to enable the modeling
>> of a state of-the-art SoC with an AMP processing style, where different
>> processing units share the same system memory and communicate through shared
>> memory and inter-processor interrupts.
> This might have a lot in common with a similar inter-qemu
> communication solution effort at Xilinx. Edgar talks about it at KVM
> forum:
>
> https://www.youtube.com/watch?v=L5zG5Aukfek
>
> Around 18:30 mark. I think it might be lower level that your proposal,
> remote-port is designed to export the raw hardware interfaces (busses
> and pins) between QEMU and some other system, another QEMU being the
> common use cases.
Thanks for pointing this out. Indeed what presented by Edgar has a lot
of similarities with our proposal, but is targeting a different scenario
where low-level modeling of the various hardware components is taken
into account.
The goal of my proposal is on the other hand to enable a set of tools
for high level early prototyping of systems with a heterogeneous
set of cores, so to model a platform that does not exist in reality but that
the user wants to experiment with.
As an example I can envision a programming model researcher willing to 
explore an
heterogeneous system based on an X86 and a multi-core ARM accelerator 
sharing memory,
to build a new programming paradigm on top of it. Such user would not 
need the specific details
of the hardware nor all the various devices available in a real SoC, but 
only an abstract model
encapsulating the main features needed for his research.

So to link also to your next comment there is no actual SoC/hardware 
targeted by this
work.
>> An example is a multi-core ARM CPU
>> working alongside with two Cortex-M micro controllers.
>>
> Marcin is doing something with A9+M3. It sounds like he already has a
> lot working (latest emails were on some finer points). What is the
> board/SoC in question here (if you are able to share)?
>
>>  From the user point of view there is usually an operating system booting on
>> the Master processor (e.g. Linux) at platform startup, while the other
>> processors are used to offload the Master one from some computation or to deal
>> with real-time interfaces.
> I feel like this is architecting hardware based on common software use
> cases, rather than directly modelling the SoC in question. Can we
> model the hardware (e.g. the devices that are used for rpmesg and IPIs
> etc.) as regular devices, as it is in-SoC? That means AMP is just
> another guest?
This is a set of extensions focusing more on the communication channel 
between the processors
rather than a full SoC model. With this patch series each of the AMP 
processors is a different "guest".
>> It is the Master OS that triggers the boot of the
>> Slave processors, and provides them also the binary code to execute (e.g.
>> RTOS, binary firmware) by placing it into a pre-defined memory area that is
>> accessible to the Slaves. Usually the memory for the Slaves is carved out from
>> the Master OS during boot. Once a Slave is booted the two processors can
>> communicate through queues in shared memory and inter-processor interrupts
>> (IPIs). In Linux, it is the remoteproc/rpmsg framework that enables the
>> control (boot/shutdown) of Slave processors, and also to establish a
>> communication channel based on virtio queues.
>>
>> Currently, QEMU is not able to model such an architecture mainly because only
>> a single processor can be emulated at one time,
> SMP does work already. MTTCG will remove the one-run-at-a-time
> limitation. Multi-arch will allow you to mix multiple CPU
> architectures (e.g. PPC + ARM in same QEMU). But multiple
> heterogeneous ARMs should already just work, and there is already an
> in-tree precedent with the xlnx-zynqmp SoC. That SoC has 4xA53 and
> 2xR5 (all ARM).
Since Multi-arch is not yet available, with this proposal it is possible to
experiment with heterogeneous processors at high level of abstraction,
even beyond the ARM + ARM (e.g. X86 + ARM), exploiting the off-the-shelf
QEMU.

One thing I want to add is that all the solutions mentioned in this 
discussion Multi-arch,
Xilinx's patches, and our proposal could coexist from the code point of 
view, and none
would prevent the others from being used.
> Multiple system address spaces and CPUs have different views of the
> address space is another common snag on this effort, and is discussed
> on a recent thread between myself and Marcin.
Yes I have seen the discussion, but it was mostly dealing with one single
QEMU instance modeling all the cores. Here the different address spaces
are enforced by multiple QEMU instances.
>> and the OS binary image needs
>> to be placed in memory at model startup.
>>
> I don't see what this limitation is exactly. Can you explain more? I
> do see a need to work on the ARM bootloader for AMP flows, it is a
> pure SMP bootloader than assumes total control.
the problem here was to me that when we launch QEMU a binary needs to be 
provided and put in memory
in order to be executed. In this patch series the slave doesn't have a 
proper memory allocated when first launched.
The information about memory (fd + offset for mmap) is sent only later 
when the boot is triggered. This is also
safe since the slave will be waiting in the incoming state, and thus no 
corruption or errors can happen before the
boot is triggered.
> Can this effort be a bootloader overhaul? Two things:
>
> 1: The bootloader needs to repeatable
> 2: The bootloaders need to be targetable (to certain CPUs or clusters)
Well in this series the bootloader for the master is different from the 
one for the slave. In my idea the master,
besides the firmware/kernel image, will copy also a bootloader for the 
slave.
>> This patch series adds a set of modules and introduces minimal changes to the
>> current QEMU code-base to implement what described above, with master and slave
>> implemented as two different instances of QEMU. The aim of this work is to
>> enable application and runtime programmers to test their AMP applications, or
>> their new inter-SoC communtication protocol.
>>
>> The main changes are depicted in the following diagram and involve:
>>      - A new multi-client socket implementation that allows multiple instances of
>>        QEMU to attach to the same socket, with only one acting as a master.
>>      - A new memory backend, the shared memory backend, based on
>>        the file memory backend. Such new backend enables, on the master side,
>>        to allocate the whole memory as shareable (e.g. /dev/shm, or hugetlbfs).
>>        On the slave side it enables the startup of QEMU without any main memory
>>        allocated. The the slave goes in a waiting state, the same used in the
>>        case of an incoming migration, and a callback is registered on a
>>        multi-client socket shared with the master.
>>        The waiting state ends when the master sends to the slave the file
>>        descriptor and offset to mmap and use as memory.
> This is useful in it's own right and came up in the Xilinx implementation.
It is also mentioned in the video you are pointing, where the Microblaze 
cores are instantiated as foreign QEMU instances.
Is the code publicly available? There was a question about that in the 
video but I couldn't catch the answer.
>>      - A new inter-processor interrupt hardware distribution module, that is used
>>        also to trigger the boot of slave processors. Such module uses a pair of
>>        eventfd for each master-slave couple to trigger interrupts between the
>>        instances. No slave-to-slave interrupts are envisioned by the current
>>        implementation.
> Wouldn't that just be a software interrupt in the local QEMU instance?
Since in this proposal there will be multiple instances of QEMU running 
at the same time, eventfd
are used to signal the event (interrupt) among the different processes. 
So writing to a register of the IDM
will raise an interrupt to a remote QEMU instance using eventfd. Did 
this answer your question?
>> The multi client-socket is used for the master to trigger
>>        the boot of a slave, and also for each master-slave couple to exchancge the
>>        eventd file descriptors. The IDM device can be instantiated either as a
>>        PCI or sysbus device.
>>
> So if everything is is one QEMU, IPIs can be implemented with just a
> regular interrupt controller (which has a software set).
As said there are multiple instances of QEMU running at the same time, 
and each of them will see the IDM in their memory map.
Even if the IDM instances will be physically different, because of the 
multiple processes, all together will act as a single block (e.g., a 
light version of a mailbox).
>>                             Memory
>>                             (e.g. hugetlbfs)
>>
>> +------------------+       +--------------+            +------------------+
>> |                  |       |              |            |                  |
>> |   QEMU MASTER    |       |   Master     |            |   QEMU SLAVE     |
>> |                  |       |   Memory     |            |                  |
>> | +------+  +------+-+     |              |          +-+------+  +------+ |
>> | |      |  |SHMEM   |     |              |          |SHMEM   |  |      | |
>> | | VCPU |  |Backend +----->              |    +----->Backend |  | VCPU | |
>> | |      |  |        |     |              |    | +--->        |  |      | |
>> | +--^---+  +------+-+     |              |    | |   +-+------+  +--^---+ |
>> |    |             |       |              |    | |     |            |     |
>> |    +--+          |       |              |    | |     |        +---+     |
>> |       | IRQ      |       | +----------+ |    | |     |    IRQ |         |
>> |       |          |       | |          | |    | |     |        |         |
>> |  +----+----+     |       | | Slave    <------+ |     |   +----+---+     |
>> +--+  IDM    +-----+       | | Memory   | |      |     +---+ IDM    +-----+
>>     +-^----^--+             | |          | |      |         +-^---^--+
>>       |    |                | +----------+ |      |           |   |
>>       |    |                +--------------+      |           |   |
>>       |    |                                      |           |   |
>>       |    +--------------------------------------+-----------+   |
>>       |   UNIX Domain Socket(send mem fd + offset, trigger boot)  |
>>       |                                                           |
>>       +-----------------------------------------------------------+
>>                                eventfd
>>
> So the slave can only see a subset of the masters memory? Is the
> masters memory just the full system memory and the master is doing
> IOMMU setup for the slave pre-boot? Or is it a hard feature of the
> physical SoC?
Yes slaves can only see the memory that has been reserved for them. This 
is ensured by carving out
the memory from the master kernel and providing the offset to such 
memory to the slave. Each slave
will have its own memory map, and see the memory at the address defined 
in the machine model.
There is no IOMMU modeled, but it is neither a hard feature since 
decided at run-time.
>> The whole code can be checked out from:
>> https://git.virtualopensystems.com/dev/qemu-het.git
>> branch:
>> qemu-het-rfc-v1
>>
>> Patches apply to the current QEMU master branch
>>
>> =========
>> Demo
>> =========
>>
>> This patch series comes in the form of a demo to better understand how the
>> changes introduced can be exploited.
>> At the current status the demo can be executed using an ARM target for both
>> master and slave.
>>
>> The demo shows how a master QEMU instance carves out the memory for a slave,
>> copies inside linux kernel image and device tree blob and finally triggers the
>> boot.
>>
> These processes must have underlying hardware implementation, is the
> master using a system controller to implement the slave boot? (setting
> reset and entry points via registers?). How hard are they to model as
> regular devs?
>
In this series the system controller is the IDM device, that through a 
set of registers makes the master in
"control" each of the slaves. The IDM device is already seen as a 
regular device by each of the QEMU instances
involved.
>> How to reproduce the demo:
>>
>> In order to reproduce the demo a couple more extra elements need to be
>> downloaded and compiled.
>>
>> Binary loader
>> Loads the slave firmware (kernel) binary into memory and triggers the boot
>> https://git.virtualopensystems.com/dev/qemu-het-tools.git
>> branch:
>> load-bin-boot
>> To compile: just type "make"
>>
>> Slave kernel
>> Compile a linux kernel image (zImage) for the virt machine model.
>>
>> IDM test kernel module
>> Needed to trigger the boot of a slave
>> https://git.virtualopensystems.com/dev/qemu-het-tools.git
>> branch:
>> IDM-kernel-module
>> To compile: KDIR=kernel_path ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- make
>>
>> Slave DTB
>> https://git.virtualopensystems.com/dev/qemu-het-tools.git
>> branch:
>> slave-dtb
>>
>> Copy binary loader, IDM kernel module, zImage and dtb inside the disk
>> image or ramdisk of the master instance.
>>
>> Run the demo:
>>
>> run the master instance
>>
>> ./arm-softmmu/qemu-system-arm \
>>      -kernel zImage \
>>      -M virt -cpu cortex-a15 \
>>      -drive if=none,file=disk.img,cache=writeback,id=foo1 \
>>      -device virtio-blk-device,drive=foo1 \
>>      -object multi-socket-backend,id=foo,listen,path=ms_socket \
>>      -object memory-backend-shared,id=mem,size=1G,mem-path=/mnt/hugetlbfs,chardev=foo,master=on,prealloc=on  \
>>      -device idm_ipi,master=true,memdev=mem,socket=foo \
>>      -numa node,memdev=mem -m 1G \
>>      -append "root=/dev/vda rw console=ttyAMA0 mem=512M memmap=512M$0x60000000" \
>>      -nographic
>>
>> run the slave instance
>>
>> ./arm-softmmu/qemu-system-arm\
>>      -M virt -cpu cortex-a15 -machine slave=on \
>>      -drive if=none,file=disk.img,cache=writeback,id=foo1 \
>>      -device virtio-blk-device,drive=foo1 \
>>      -object multi-socket-backend,id=foo,path=ms_socket \
>>      -object memory-backend-shared,id=mem,size=512M,mem-path=/mnt/hugetlbfs,chardev=foo,master=off \
>>      -device idm_ipi,master=false,memdev=mem,socket=foo \
>>      -incoming "shared:mem" -numa node,memdev=mem -m 512M \
>>      -nographic
>>
>>
>> For simplicity, use a disk image for the slave instead of a ramdisk.
>>
>> As visible from the kernel boot arguments, the master is booted with mem=512
>> so that one half of the whole memory allocated is not used by the master and
>> reserved for the slave. Such memory starts for the virt platform from
>> address 0x60000000.
>>
>> Once the master is booted the image of the kernel and DTB can be copied in the
>> memory carved out for the slave.
>>
>> In the maser console
>>
>> probe the IDM kernel module:
>>
>> $ insmod idm_test_mod.ko
>>
>> run the application that copies the binaries into memory and triggers the boot:
>>
>> $ ./load_bin_app 1 ./zImage ./slave.dtb
>>
>>
>> On the slave console the linux kernel boot should be visible.
>>
>> The present demo is intended only as a demonstration to see the patch-set at
>> work. In the near future, boot triggering, memory carveout and binary copy might
>> be implemented in a remoteproc driver coupled with a RPMSG driver for
>> communication between master and slave instance.
>>
> So are these drivers the same ones as run on the real hardware? is
> there value in the fact that the real IPI mechanisms are replaced with
> virt ones?
As for the first question, since there is no specific target hardware 
even the drivers
are generic and thus not meant to run on a real SoC. The drivers shown 
for this demo
are an example of how the patch series could be used, and do not prevent 
the user to
implement its own drivers based on its own communication protocol.

Thanks,

Christian
> Regards,
> Peter
>
>> This work has been sponsored by Huawei Technologies Duesseldorf GmbH.
>>
>> Baptiste Reynal (3):
>>    backend: multi-socket
>>    backend: shared memory backend
>>    migration: add shared migration type
>>
>> Christian Pinto (5):
>>    hw/misc: IDM Device
>>    hw/arm: sysbus-fdt
>>    qemu: slave machine flag
>>    hw/arm: boot
>>    qemu: numa
>>
>>   backends/Makefile.objs             |   4 +-
>>   backends/hostmem-shared.c          | 203 ++++++++++++++++++
>>   backends/multi-socket.c            | 353 +++++++++++++++++++++++++++++++
>>   default-configs/arm-softmmu.mak    |   1 +
>>   default-configs/i386-softmmu.mak   |   1 +
>>   default-configs/x86_64-softmmu.mak |   1 +
>>   hw/arm/boot.c                      |  13 ++
>>   hw/arm/sysbus-fdt.c                |  60 ++++++
>>   hw/core/machine.c                  |  27 +++
>>   hw/misc/Makefile.objs              |   2 +
>>   hw/misc/idm.c                      | 416 +++++++++++++++++++++++++++++++++++++
>>   include/hw/boards.h                |   2 +
>>   include/hw/misc/idm.h              | 119 +++++++++++
>>   include/migration/migration.h      |   2 +
>>   include/qemu/multi-socket.h        | 124 +++++++++++
>>   include/sysemu/hostmem-shared.h    |  61 ++++++
>>   migration/Makefile.objs            |   2 +-
>>   migration/migration.c              |   2 +
>>   migration/shared.c                 |  32 +++
>>   numa.c                             |  17 +-
>>   qemu-options.hx                    |   5 +-
>>   util/qemu-config.c                 |   5 +
>>   22 files changed, 1448 insertions(+), 4 deletions(-)
>>   create mode 100644 backends/hostmem-shared.c
>>   create mode 100644 backends/multi-socket.c
>>   create mode 100644 hw/misc/idm.c
>>   create mode 100644 include/hw/misc/idm.h
>>   create mode 100644 include/qemu/multi-socket.h
>>   create mode 100644 include/sysemu/hostmem-shared.h
>>   create mode 100644 migration/shared.c
>>
>> --
>> 1.9.1
>>
>>