All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Alex Bennée" <alex.bennee@linaro.org>
To: "Cédric Le Goater" <clg@kaod.org>
Cc: "Peter Delevoryas" <pdel@fb.com>,
	"Peter Maydell" <peter.maydell@linaro.org>,
	"Andrew Jeffery" <andrew@aj.id.au>,
	"Joel Stanley" <joel@jms.id.au>,
	"pbonzini@redhat.com" <pbonzini@redhat.com>,
	"berrange@redhat.com" <berrange@redhat.com>,
	"eduardo@habkost.net" <eduardo@habkost.net>,
	"marcel.apfelbaum@gmail.com" <marcel.apfelbaum@gmail.com>,
	"richard.henderson@linaro.org" <richard.henderson@linaro.org>,
	"Philippe Mathieu-Daudé" <f4bug@amsat.org>,
	"ani@anisinha.ca" <ani@anisinha.ca>,
	"Cameron Esfahani via" <qemu-devel@nongnu.org>,
	qemu-arm <qemu-arm@nongnu.org>
Subject: Re: [PATCH 12/14] aspeed: Make aspeed_board_init_flashes public
Date: Thu, 30 Jun 2022 10:43:31 +0100	[thread overview]
Message-ID: <87a69uh45w.fsf@linaro.org> (raw)
In-Reply-To: <a35ac443-c93e-fb2b-6da9-2a3dc16f7d8c@kaod.org>


Cédric Le Goater <clg@kaod.org> writes:

> On 6/29/22 20:24, Alex Bennée wrote:
>> Cédric Le Goater <clg@kaod.org> writes:
>> 
>>> On 6/29/22 16:14, Alex Bennée wrote:
>>>> Cédric Le Goater <clg@kaod.org> writes:
>>>>
>>>>> On 6/24/22 18:50, Cédric Le Goater wrote:
>>>>>> On 6/23/22 20:43, Peter Delevoryas wrote:
>>>>>>>
>>>>>>>
>>>>>>>> On Jun 23, 2022, at 8:09 AM, Cédric Le Goater <clg@kaod.org> wrote:
>>>>>>>>
>>>>>>>> On 6/23/22 12:26, Peter Delevoryas wrote:
>>>>>>>>> Signed-off-by: Peter Delevoryas <pdel@fb.com>
>>>>>>>>
>>>>>>>> Let's start simple without flash support. We should be able to
>>>>>>>> load FW blobs in each CPU address space using loader devices.
>>>>>>>
>>>>>>> Actually, I was unable to do this, perhaps because the fb OpenBMC
>>>>>>> boot sequence is a little weird. I specifically _needed_ to have
>>>>>>> a flash device which maps the firmware in at 0x2000_0000, because
>>>>>>> the fb OpenBMC U-Boot SPL jumps to that address to start executing
>>>>>>> from flash? I think this is also why fb OpenBMC machines can be so slow.
>>>>>>>
>>>>>>> $ ./build/qemu-system-arm -machine fby35 \
>>>>>>>        -device loader,file=fby35.mtd,addr=0,cpu-num=0 -nographic \
>>>>>>>        -d int -drive file=fby35.mtd,format=raw,if=mtd
>>>>>> Ideally we should be booting from the flash device directly using
>>>>>> the machine option '-M ast2600-evb,execute-in-place=true' like HW
>>>>>> does. Instructions are fetched using SPI transfers. But the amount
>>>>>> of code generated is tremendous.
>>>> Yeah because there is a potential race when reading from HW so we
>>>> throw
>>>> away TB's after executing them because we have no way of knowing if it
>>>> has changed under our feet. See 873d64ac30 (accel/tcg: re-factor non-RAM
>>>> execution code) which cleaned up this handling.
>>>>
>>>>>> See some profiling below for a
>>>>>> run which barely reaches DRAM training in U-Boot.
>>>>>
>>>>> Some more profiling on both ast2500 and ast2600 machines shows :
>>>>>
>>>>>
>>>>> * ast2600-evb,execute-in-place=true :
>>>>>
>>>>> Type               Object  Call site                Wait Time (s)         Count  Average (us)
>>>>> ---------------------------------------------------------------------------------------------
>>>>> BQL mutex  0x564dc03922e0  accel/tcg/cputlb.c:1365       14.21443
>>>>> 32909927          0.43
>>>> This is unavoidable as a HW access needs the BQL held so we will go
>>>> through this cycle every executed instruction.
>>>> Did I miss why the flash contents are not mapped into the physical
>>>> address space? Isn't that how it appear to the processor?
>>>
>>>
>>> There are two modes :
>>>            if (ASPEED_MACHINE(machine)->mmio_exec) {
>>>              memory_region_init_alias(boot_rom, NULL, "aspeed.boot_rom",
>>>                                       &fl->mmio, 0, size);
>>>              memory_region_add_subregion(get_system_memory(), FIRMWARE_ADDR,
>>>                                          boot_rom);
>>>          } else {
>>>              memory_region_init_rom(boot_rom, NULL, "aspeed.boot_rom",
>>>                                     size, &error_abort);
>>>              memory_region_add_subregion(get_system_memory(), FIRMWARE_ADDR,
>>>                                          boot_rom);
>>>              write_boot_rom(drive0, FIRMWARE_ADDR, size, &error_abort);
>>>          }
>>>
>>> The default boot mode uses the ROM. No issue.
>>>
>>> The "execute-in-place=true" option creates an alias on the region of
>>> the flash contents and each instruction is then fetched from the flash
>>> drive with SPI transactions.

I wonder if we were keeping the code before?

>>>
>>> With old FW images, using an older U-boot, the machine boots in a couple
>>> of seconds. See the profiling below for a witherspoon-bmc machine using
>>> U-Boot 2016.07.
>>>
>>>    qemu-system-arm -M witherspoon-bmc,execute-in-place=true  -drive file=./flash-witherspoon-bmc,format=raw,if=mtd -drive file=./flash-witherspoon-bmc2,format=raw,if=mtd -nographic -nodefaults -snapshot -serial mon:stdio -enable-sync-profile
>>>    ...
>>>    U-Boot 2016.07-00040-g8425e96e2e27-dirty (Jun 24 2022 - 23:21:57 +0200)
>>>              Watchdog enabled
>>>    DRAM:  496 MiB
>>>    Flash: 32 MiB
>>>    In:    serial
>>>    Out:   serial
>>>    Err:   serial
>>>    Net:
>>>    (qemu) info sync-profile
>>>    Type               Object  Call site                Wait Time (s)         Count  Average (us)
>>>    ---------------------------------------------------------------------------------------------
>>>    BQL mutex  0x56189610b2e0  accel/tcg/cputlb.c:1365        0.25311      12346237          0.02
>>>    condvar    0x5618970cf220  softmmu/cpus.c:423             0.05506             2      27530.78
>>>    BQL mutex  0x56189610b2e0  util/rcu.c:269                 0.04709             2      23544.26
>>>    condvar    0x561896d0fc78  util/thread-pool.c:90          0.01340            83        161.47
>>>    condvar    0x56189610b240  softmmu/cpus.c:571             0.00005             1         54.93
>>>    condvar    0x56189610b280  softmmu/cpus.c:642             0.00003             1         32.88
>>>    BQL mutex  0x56189610b2e0  util/main-loop.c:318           0.00003            34          0.76
>>>    mutex      0x561896eade00  tcg/region.c:204               0.00002           995          0.02
>>>    rec_mutex  [           2]  util/async.c:682               0.00002           493          0.03
>>>    mutex      [           2]  chardev/char.c:118             0.00001           404          0.03
>>>    ---------------------------------------------------------------------------------------------
>>>
>>>
>>> However, with recent U-boots, it takes quite a while to reach DRAM training.
>>> Close to a minute. See the profiling below for an ast2500-evb machine using
>>> U-Boot 2019.04.
>>>
>>>   qemu-system-arm -M ast2500-evb,execute-in-place=true  -net nic,macaddr=C0:FF:EE:00:00:03,netdev=net0  -drive file=./flash-ast2500-evb,format=raw,if=mtd  -nographic -nodefaults -snapshot -serial mon:stdio  -enable-sync-profile
>>>    qemu-system-arm: warning: Aspeed iBT has no chardev backend
>>>    qemu-system-arm: warning: nic ftgmac100.1 has no peer
>>>    QEMU 7.0.50 monitor - type 'help' for more information
>>>       U-Boot 2019.04-00080-g6ca27db3f97b-dirty (Jun 24 2022 - 23:22:03
>>>      +0200)
>>>       SOC : AST2500-A1
>>>    RST : Power On
>>>    LPC Mode : SIO:Enable : SuperIO-2e
>>>    Eth : MAC0: RGMII, , MAC1: RGMII,
>>>    Model: AST2500 EVB
>>>    DRAM:  448 MiB (capacity:512 MiB, VGA:64 MiB, ECC:off)
>>>    MMC:   sdhci_slot0@100: 0, sdhci_slot1@200: 1
>>>    Loading Environment from SPI Flash... SF: Detected mx25l25635e with page size 256 Bytes, erase size 64 KiB, total 32 MiB
>>>    *** Warning - bad CRC, using default environment
>>>       In:    serial@1e784000
>>>    Out:   serial@1e784000
>>>    Err:   serial@1e784000
>>>    Net:   eth0: ethernet@1e660000
>>>    Warning: ethernet@1e680000 (eth1) using random MAC address - 4a:e5:9a:4a:c7:c5
>>>    , eth1: ethernet@1e680000
>>>    Hit any key to stop autoboot:  2
>>>    (qemu) info sync-profile
>>>    Type               Object  Call site                Wait Time (s)         Count  Average (us)
>>>    ---------------------------------------------------------------------------------------------
>>>    condvar    0x561f10c9ef88  util/thread-pool.c:90         10.01196            28     357570.00
>>>    BQL mutex  0x561f102362e0  accel/tcg/cputlb.c:1365        0.29496      14248621          0.02
>>>    condvar    0x561f110325a0  softmmu/cpus.c:423             0.02231             2      11152.57
>>>    BQL mutex  0x561f102362e0  util/rcu.c:269                 0.01447             4       3618.60
>>>    condvar    0x561f10236240  softmmu/cpus.c:571             0.00010             1        102.19
>>>    mutex      0x561f10e9f1c0  tcg/region.c:204               0.00007          3052          0.02
>>>    mutex      [           2]  chardev/char.c:118             0.00003          1486          0.02
>>>    condvar    0x561f10236280  softmmu/cpus.c:642             0.00003             1         29.38
>>>    BQL mutex  0x561f102362e0  accel/tcg/cputlb.c:1426        0.00002           973          0.02
>>>    BQL mutex  0x561f102362e0  util/main-loop.c:318           0.00001            34          0.41
>>>    ---------------------------------------------------------------------------------------------
>>>    Something in the layout of the FW is making a big difference. One
>>> that could be relevant is that the recent versions are using a device
>>> tree.
>>>
>>> There might be no good solution to this issue but I fail to analyze
>>> it correctly. Is there a way to collect information on the usage of
>>> Translation Blocks ?
>> You could expand the data we collect in tb_tree_stats and expose it
>> via
>> info jit.
>
> The "fast" run, U-Boot 2016.07, gives :
>
>
>   Translation buffer state:
>   gen code size       254880371/1073736704
>   TB count            1089
>   TB avg target size  16 max=356 bytes
>   TB avg host size    278 bytes (expansion ratio: 17.2)
>   cross page TB count 0 (0%)
>   direct jump count   501 (46%) (2 jumps=372 34%)
>   TB hash buckets     1025/8192 (12.51% head buckets used)
>   TB hash occupancy   3.32% avg chain occ. Histogram: [0.0,7.5)%|█  ▁  ▁  ▁|[67.5,75.0]%
>   TB hash avg chain   1.000 buckets. Histogram: 1|█|1
>      Statistics:
>   TB flush count      0
>   TB invalidate count 0
>   TLB full flushes    0
>   TLB partial flushes 2
>   TLB elided flushes  2338
>   JIT cycles          2221788409 (0.926 s at 2.4 GHz)
>   translated TBs      738520 (aborted=0 0.0%)
>   avg ops/TB          15.7 max=459
>   deleted ops/TB      2.72
>   avg temps/TB        32.89 max=88
>   avg host code/TB    113.7
>   avg search data/TB  5.2
>   cycles/op           192.0
>   cycles/in byte      748.7
>   cycles/out byte     26.4
>   cycles/search byte     582.8
>     gen_interm time   57.6%
>     gen_code time     42.4%
>   optim./code time    19.4%
>   liveness/code time  26.1%
>   cpu_restore count   0
>     avg cycles        0.0
>   and the "slow", U-Boot 2019.04 :
>
> Translation buffer state:
> gen code size       368603795/1073736704
> TB count            3052
> TB avg target size  16 max=360 bytes
> TB avg host size    293 bytes (expansion ratio: 17.6)
> cross page TB count 0 (0%)
> direct jump count   1431 (46%) (2 jumps=1104 36%)
> TB hash buckets     2559/8192 (31.24% head buckets used)
> TB hash occupancy   9.31% avg chain occ. Histogram: [0,10)%|█ ▃  ▁ ▁ ▁|[90,100]%
> TB hash avg chain   1.000 buckets. Histogram: 1|█|1
>
> Statistics:
> TB flush count      3

Something triggers 3 complete TB flushes in the slow run (3x1000 ~ 3000
TBs). This can be the TB overflowing but:

fast: gen code size       254880371/1073736704
slow: gen code size       368603795/1073736704

doesn't look like we are getting close to overflowing. Must be something
else.

> TB invalidate count 0
> TLB full flushes    0
> TLB partial flushes 3
> TLB elided flushes  2367
> JIT cycles          26479044772 (11.033 s at 2.4 GHz)
> translated TBs      10552169 (aborted=0 0.0%)
> avg ops/TB          15.0 max=464
> deleted ops/TB      2.44
> avg temps/TB        32.43 max=89
> avg host code/TB    99.0
> avg search data/TB  5.0
> cycles/op           167.7
> cycles/in byte      626.8
> cycles/out byte     25.4
> cycles/search byte     499.4
>   gen_interm time   50.4%
>   gen_code time     49.6%
> optim./code time    19.5%
> liveness/code time  27.7%
> cpu_restore count   0
>   avg cycles        0.0
>
>
> A lot more TBs.
>   C.


-- 
Alex Bennée

  reply	other threads:[~2022-06-30  9:49 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20220623102617.2164175-1-pdel@fb.com>
     [not found] ` <20220623102617.2164175-3-pdel@fb.com>
2022-06-23 12:11   ` [PATCH 02/14] sysbus: Remove sysbus_address_space Peter Maydell
     [not found] ` <20220623102617.2164175-5-pdel@fb.com>
2022-06-23 12:15   ` [PATCH 04/14] sysbus: Add sysbus_mmio_map_in Peter Maydell
2022-06-23 18:29     ` Peter Delevoryas
2022-06-23 18:29       ` Peter Delevoryas
     [not found] ` <20220623102617.2164175-9-pdel@fb.com>
2022-06-23 12:57   ` [PATCH 08/14] aspeed: Replace direct get_system_memory() calls Peter Maydell
2022-06-23 15:39     ` Cédric Le Goater
2022-06-23 18:45       ` Peter Delevoryas
2022-06-23 18:45         ` Peter Delevoryas
     [not found] ` <20220623102617.2164175-13-pdel@fb.com>
2022-06-23 15:09   ` [PATCH 12/14] aspeed: Make aspeed_board_init_flashes public Cédric Le Goater
2022-06-23 18:43     ` Peter Delevoryas
2022-06-23 18:43       ` Peter Delevoryas
2022-06-24 16:50       ` Cédric Le Goater
2022-06-29  9:11         ` Cédric Le Goater
2022-06-29 14:14           ` Alex Bennée
2022-06-29 15:54             ` Cédric Le Goater
2022-06-29 18:24               ` Alex Bennée
2022-06-30  8:49                 ` Cédric Le Goater
2022-06-30  9:43                   ` Alex Bennée [this message]
2022-07-05 12:35                     ` Cédric Le Goater

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87a69uh45w.fsf@linaro.org \
    --to=alex.bennee@linaro.org \
    --cc=andrew@aj.id.au \
    --cc=ani@anisinha.ca \
    --cc=berrange@redhat.com \
    --cc=clg@kaod.org \
    --cc=eduardo@habkost.net \
    --cc=f4bug@amsat.org \
    --cc=joel@jms.id.au \
    --cc=marcel.apfelbaum@gmail.com \
    --cc=pbonzini@redhat.com \
    --cc=pdel@fb.com \
    --cc=peter.maydell@linaro.org \
    --cc=qemu-arm@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=richard.henderson@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.