From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([209.51.188.92]:54378)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <jjherne@linux.ibm.com>) id 1gw6qD-0003D5-H0
	for qemu-devel@nongnu.org; Tue, 19 Feb 2019 09:57:51 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <jjherne@linux.ibm.com>) id 1gw6q4-0007xY-5B
	for qemu-devel@nongnu.org; Tue, 19 Feb 2019 09:57:46 -0500
Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:37394
	helo=mx0a-001b2d01.pphosted.com)
	by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32)
	(Exim 4.71) (envelope-from <jjherne@linux.ibm.com>)
	id 1gw6q1-0007qz-SP
	for qemu-devel@nongnu.org; Tue, 19 Feb 2019 09:57:39 -0500
Received: from pps.filterd (m0098414.ppops.net [127.0.0.1])
	by mx0b-001b2d01.pphosted.com (8.16.0.27/8.16.0.27) with SMTP id
	x1JEtPfd133455
	for <qemu-devel@nongnu.org>; Tue, 19 Feb 2019 09:57:27 -0500
Received: from e17.ny.us.ibm.com (e17.ny.us.ibm.com [129.33.205.207])
	by mx0b-001b2d01.pphosted.com with ESMTP id 2qrhprr2jn-1
	(version=TLSv1.2 cipher=AES256-GCM-SHA384 bits=256 verify=NOT)
	for <qemu-devel@nongnu.org>; Tue, 19 Feb 2019 09:57:26 -0500
Received: from localhost
	by e17.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only!
	Violators will be prosecuted
	for <qemu-devel@nongnu.org> from <jjherne@linux.ibm.com>;
	Tue, 19 Feb 2019 14:57:26 -0000
Reply-To: jjherne@linux.ibm.com
References: <1548768562-20007-1-git-send-email-jjherne@linux.ibm.com>
	<1548768562-20007-16-git-send-email-jjherne@linux.ibm.com>
	<20190204130238.120099ff.cohuck@redhat.com>
From: "Jason J. Herne" <jjherne@linux.ibm.com>
Date: Tue, 19 Feb 2019 09:57:20 -0500
MIME-Version: 1.0
In-Reply-To: <20190204130238.120099ff.cohuck@redhat.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Message-Id: <8e93e4be-7f7f-f827-aebd-53cb6fd7107e@linux.ibm.com>
Subject: Re: [Qemu-devel] [PATCH 15/15] s390-bios: Support booting from real
 dasd device
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Cornelia Huck <cohuck@redhat.com>
Cc: qemu-devel@nongnu.org, qemu-s390x@nongnu.org, pasic@linux.ibm.com, alifm@linux.ibm.com, borntraeger@de.ibm.com

On 2/4/19 7:02 AM, Cornelia Huck wrote:
> On Tue, 29 Jan 2019 08:29:22 -0500
> "Jason J. Herne" <jjherne@linux.ibm.com> wrote:
> 
>> Allows guest to boot from a vfio configured real dasd device.
>>
>> Signed-off-by: Jason J. Herne <jjherne@linux.ibm.com>
>> ---
>>   docs/devel/s390-dasd-ipl.txt | 132 +++++++++++++++++++++++
>>   pc-bios/s390-ccw/Makefile    |   2 +-
>>   pc-bios/s390-ccw/dasd-ipl.c  | 249 +++++++++++++++++++++++++++++++++++++++++++
>>   pc-bios/s390-ccw/dasd-ipl.h  |  16 +++
>>   pc-bios/s390-ccw/main.c      |   4 +
>>   pc-bios/s390-ccw/s390-arch.h |  13 +++
>>   6 files changed, 415 insertions(+), 1 deletion(-)
>>   create mode 100644 docs/devel/s390-dasd-ipl.txt
>>   create mode 100644 pc-bios/s390-ccw/dasd-ipl.c
>>   create mode 100644 pc-bios/s390-ccw/dasd-ipl.h
>>
>> diff --git a/docs/devel/s390-dasd-ipl.txt b/docs/devel/s390-dasd-ipl.txt
>> new file mode 100644
>> index 0000000..84ec7b8
>> --- /dev/null
>> +++ b/docs/devel/s390-dasd-ipl.txt
>> @@ -0,0 +1,132 @@
>> +*****************************
>> +***** s390 hardware IPL *****
>> +*****************************
>> +
>> +The s390 hardware IPL process consists of the following steps.
>> +
>> +1. A READ IPL ccw is constructed in memory location 0x0.
>> +    This ccw, by definition, reads the IPL1 record which is located on the disk
>> +    at cylinder 0 track 0 record 1. Note that the chain flag is on in this ccw
>> +    so when it is complete another ccw will be fetched and executed from memory
>> +    location 0x08.
>> +
>> +2. Execute the Read IPL ccw at 0x00, thereby reading IPL1 data into 0x00.
>> +    IPL1 data is 24 bytes in length and consists of the following pieces of
>> +    information: [psw][read ccw][tic ccw]. When the machine executes the Read
>> +    IPL ccw it read the 24-bytes of IPL1 to be read into memory starting at
>> +    location 0x0. Then the ccw program at 0x08 which consists of a read
>> +    ccw and a tic ccw is automatically executed because of the chain flag from
>> +    the original READ IPL ccw. The read ccw will read the IPL2 data into memory
>> +    and the TIC (Tranfer In Channel) will transfer control to the channel
>> +    program contained in the IPL2 data. The TIC channel command is the
>> +    equivalent of a branch/jump/goto instruction for channel programs.
>> +    NOTE: The ccws in IPL1 are defined by the architecture to be format 0.
>> +
>> +3. Execute IPL2.
>> +    The TIC ccw instruction at the end of the IPL1 channel program will begin
>> +    the execution of the IPL2 channel program. IPL2 is stage-2 of the boot
>> +    process and will contain a larger channel program than IPL1. The point of
>> +    IPL2 is to find and load either the operating system or a small program that
>> +    loads the operating system from disk. At the end of this step all or some of
>> +    the real operating system is loaded into memory and we are ready to hand
>> +    control over to the guest operating system. At this point the guest
>> +    operating system is entirely responsible for loading any more data it might
>> +    need to function. NOTE: The IPL2 channel program might read data into memory
>> +    location 0 thereby overwriting the IPL1 psw and channel program. This is ok
>> +    as long as the data placed in location 0 contains a psw whose instruction
>> +    address points to the guest operating system code to execute at the end of
>> +    the IPL/boot process.
>> +    NOTE: The ccws in IPL2 are defined by the architecture to be format 0.
>> +
>> +4. Start executing the guest operating system.
>> +    The psw that was loaded into memory location 0 as part of the ipl process
>> +    should contain the needed flags for the operating system we have loaded. The
>> +    psw's instruction address will point to the location in memory where we want
>> +    to start executing the operating system. This psw is loaded (via LPSW
>> +    instruction) causing control to be passed to the operating system code.
>> +
>> +In a non-virtualized environment this process, handled entirely by the hardware,
>> +is kicked off by the user initiating a "Load" procedure from the hardware
>> +management console. This "Load" procedure crafts a special "Read IPL" ccw in
>> +memory location 0x0 that reads IPL1. It then executes this ccw thereby kicking
>> +off the reading of IPL1 data. Since the channel program from IPL1 will be
>> +written immediately after the special "Read IPL" ccw, the IPL1 channel program
>> +will be executed immediately (the special read ccw has the chaining bit turned
>> +on). The TIC at the end of the IPL1 channel program will cause the IPL2 channel
>> +program to be executed automatically. After this sequence completes the "Load"
>> +procedure then loads the psw from 0x0.
> 
> Nice summary!
> 
>> +
>> +*****************************************
>> +***** How this all pertains to Qemu *****
> 
> s/Qemu/QEMU/
> 
> (also below)
> 

Fixed.

>> +*****************************************
>> +
>> +In theory we should merely have to do the following to IPL/boot a guest
>> +operating system from a DASD device:
>> +
>> +1. Place a "Read IPL" ccw into memory location 0x0 with chaining bit on.
>> +2. Execute channel program at 0x0.
>> +3. LPSW 0x0.
>> +
>> +However, our emulation of the machine's channel program logic is missing one key
>> +feature that is required for this process to work: non-prefetch of ccw data.
>> +
>> +When we start a channel program we pass the channel subsystem parameters via an
>> +ORB (Operation Request Block). One of those parameters is a prefetch bit. If the
>> +bit is on then Qemu is allowed to read the entire channel program from guest
>> +memory before it starts executing it. This means that any channel commands that
>> +read additional channel commands will not work as expected because the newly
>> +read commands will only exist in guest memory and NOT within Qemu's channel
>> +subsystem memory. Qemu's channel subsystem's implementation currently requires
> 
> But isn't that the vfio-ccw backend, rather than the channel subsystem
> implementation?
> 

Yep, you're right. I'll clarify this.

>> +this bit to be on for all channel programs. This is a problem because the IPL
>> +process consists of transferring control from the "Read IPL" ccw immediately to
>> +the IPL1 channel program that was read by "Read IPL".
>> +
>> +Not being able to turn off prefetch will also prevent the TIC at the end of the
>> +IPL1 channel program from transferring control to the IPL2 channel program.
>> +
>> +Lastly, in some cases (the zipl bootloader for example) the IPL2 program also
>> +tansfers control to another channel program segment immediately after reading it
>> +from the disk. So we need to be able to handle this case.
>> +
>> +**************************
>> +***** What Qemu does *****
>> +**************************
>> +
>> +Since we are forced to live with prefetch we cannot use the very simple IPL
>> +procedure we defined in the preceding section. So we compensate by doing the
>> +following.
>> +
>> +1. Place "Read IPL" ccw into memory location 0x0, but turn off chaining bit.
>> +2. Execute "Read IPL" at 0x0.
>> +
>> +   So now IPL1's psw is at 0x0 and IPL1's channel program is at 0x08.
>> +
>> +4. Write a custom channel program that will seek to the IPL2 record and then
>> +   execute the READ and TIC ccws from IPL1.  Normamly the seek is not required
>> +   because after reading the IPL1 record the disk is automatically positioned
>> +   to read the very next record which will be IPL2. But since we are not reading
>> +   both IPL1 and IPL2 as part of the same channel program we must manually set
>> +   the position.
>> +
>> +5. Grab the target address of the TIC instruction from the IPL1 channel program.
>> +   This address is where the IPL2 channel program starts.
>> +
>> +   Now IPL2 is loaded into memory somewhere, and we know the address.
>> +
>> +6. Execute the IPL2 channel program at the address obtained in step #5.
>> +
>> +   Because this channel program can be dynamic, we must use a special algorithm
>> +   that detects a READ immediately followed by a TIC and breaks the ccw chain
>> +   by turning off the chain bit in the READ ccw. When control is returned from
>> +   the kernel/hardware to the Qemu bios code we immediately issue another start
>> +   subchannel to execute the remaining TIC instruction. This causes the entire
>> +   channel program (starting from the TIC) and all needed data to be refetched
>> +   thereby stepping around the limitation that would otherwise prevent this
>> +   channel program from executing properly.
>> +
>> +   Now the operating system code is loaded somewhere in guest memory and the psw
>> +   in memory location 0x0 will point to entry code for the guest operating
>> +   system.
>> +
>> +7. LPSW 0x0.
>> +   LPSW transfers control to the guest operating system and we're done.
> 
> Also a good explanation of the procedure here!
> 
> (...)
> 
>> +static int run_dynamic_ccw_program(SubChannelId schid, uint32_t cpa)
>> +{
>> +    bool has_next;
>> +    uint32_t next_cpa = 0;
>> +    int rc;
>> +
>> +    do {
>> +        has_next = dynamic_cp_fixup(cpa, &next_cpa);
>> +
>> +        print_int("executing ccw chain at ", cpa);
> 
> Do you want to keep the unconditional print here? Or make it a
> debug_print_int, and maybe an unconditional print on error?
> 

Personally, I like having this here unconditionally. If things hang up or go wrong this 
lets us know if it was before or after we jumped into actual guest OS code. I know I could 
make it debug only, but having it all the time means better first failure data capture.

-- 
-- Jason J. Herne (jjherne@linux.ibm.com)