From: Oleksii Kurochko <oleksii.kurochko@gmail.com>
To: Jan Beulich <jbeulich@suse.com>
Cc: "Alistair Francis" <alistair.francis@wdc.com>,
"Bob Eshleman" <bobbyeshleman@gmail.com>,
"Connor Davis" <connojdavis@gmail.com>,
"Andrew Cooper" <andrew.cooper3@citrix.com>,
"Anthony PERARD" <anthony.perard@vates.tech>,
"Michal Orzel" <michal.orzel@amd.com>,
"Julien Grall" <julien@xen.org>,
"Roger Pau Monné" <roger.pau@citrix.com>,
"Stefano Stabellini" <sstabellini@kernel.org>,
xen-devel@lists.xenproject.org
Subject: Re: [PATCH v2 15/17] xen/riscv: Implement superpage splitting for p2m mappings
Date: Thu, 17 Jul 2025 18:37:45 +0200 [thread overview]
Message-ID: <e2227002-e38c-41e1-8bea-7585138ec5ba@gmail.com> (raw)
In-Reply-To: <9be8eeb4-281e-4b9b-9ea7-04ff738dc4db@suse.com>
[-- Attachment #1: Type: text/plain, Size: 10752 bytes --]
On 7/2/25 11:25 AM, Jan Beulich wrote:
> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>> Add support for down large memory mappings ("superpages") in the RISC-V
>> p2m mapping so that smaller, more precise mappings ("finer-grained entries")
>> can be inserted into lower levels of the page table hierarchy.
>>
>> To implement that the following is done:
>> - Introduce p2m_split_superpage(): Recursively shatters a superpage into
>> smaller page table entries down to the target level, preserving original
>> permissions and attributes.
>> - __p2m_set_entry() updated to invoke superpage splitting when inserting
>> entries at lower levels within a superpage-mapped region.
>>
>> This implementation is based on the ARM code, with modifications to the part
>> that follows the BBM (break-before-make) approach. Unlike ARM, RISC-V does
>> not require BBM, so there is no need to invalidate the PTE and flush the
>> TLB before updating it with the newly created, split page table.
> But some flushing is going to be necessary. As long as you only ever do
> global flushes, the one after the individual PTE modification (within the
> split table) will do (if BBM isn't required, see below), but once you move
> to more fine-grained flushing, that's not going to be enough anymore. Not
> sure it's a good idea to leave such a pitfall.
I think that I don't fully understand what is an issue.
>
> As to (no need for) BBM: I couldn't find anything to that effect in the
> privileged spec. Can you provide some pointer? What I found instead is e.g.
> this sentence: "To ensure that implicit reads observe writes to the same
> memory locations, an SFENCE.VMA instruction must be executed after the
> writes to flush the relevant cached translations." And this: "Accessing the
> same location using different cacheability attributes may cause loss of
> coherence." (This may not only occur when the same physical address is
> mapped twice at different VAs, but also after the shattering of a superpage
> when the new entry differs in cacheability.)
I also couldn't find that RISC-V spec mandates BBM explicitly as Arm does it.
What I meant that on RISC-V can do:
- Write new PTE
- Flush TLB
While on Arm it is almost always needed to do:
- Write zero to PTE
- Flush TLB
- Write new PTE
For example, the common CoW code path where you copy from a read only page to
a new page, then map that new page as writable just works on RISC-V without
extra considerations and on Arm it requires BBM.
It seems to me that BBM is mandated for Arm only because that TLB is shared
among cores, so there is no any guarantee that no prior entry for same VA
remains in TLB. In case of RISC-V's TLB isn't shared and after a flush it is
guaranteed that no prior entry for the same VA remains in the TLB.
But in the same time it could be cases, I guess, where BBM will be needed for
RISC-V too. Even the case of CoW mentioned above will need some kind of BBM,
but nothing that'll the CPU misbehave by doing CoW without BBM on RISC-V.
>
>> Additionally, the page table walk logic has been adjusted, as ARM uses the
>> opposite walk order compared to RISC-V.
> I think you used some similar wording already in an earlier patch. I find
> this confusing: Walk order is, aiui, the same. It's merely the numbering
> of levels that is the opposite way round, isn't it?
Yes, the numbering of levels is different and I counted that as a different
walk order. If it is too confusing, I will reword it and use numbering of levels.
>
>> Signed-off-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
>> ---
>> Changes in V2:
>> - New patch. It was a part of a big patch "xen/riscv: implement p2m mapping
>> functionality" which was splitted to smaller.
>> - Update the commit above the cycle which creates new page table as
>> RISC-V travserse page tables in an opposite to ARM order.
>> - RISC-V doesn't require BBM so there is no needed for invalidating
>> and TLB flushing before updating PTE.
>> ---
>> xen/arch/riscv/p2m.c | 102 ++++++++++++++++++++++++++++++++++++++++++-
>> 1 file changed, 101 insertions(+), 1 deletion(-)
>>
>> diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
>> index 87dd636b80..79c4473f1f 100644
>> --- a/xen/arch/riscv/p2m.c
>> +++ b/xen/arch/riscv/p2m.c
>> @@ -743,6 +743,77 @@ static void p2m_free_entry(struct p2m_domain *p2m,
>> p2m_free_page(p2m->domain, pg);
>> }
>>
>> +static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
>> + unsigned int level, unsigned int target,
>> + const unsigned int *offsets)
>> +{
>> + struct page_info *page;
>> + unsigned int i;
>> + pte_t pte, *table;
>> + bool rv = true;
>> +
>> + /* Convenience aliases */
>> + mfn_t mfn = pte_get_mfn(*entry);
>> + unsigned int next_level = level - 1;
>> + unsigned int level_order = XEN_PT_LEVEL_ORDER(next_level);
>> +
>> + /*
>> + * This should only be called with target != level and the entry is
>> + * a superpage.
>> + */
>> + ASSERT(level > target);
>> + ASSERT(p2me_is_superpage(p2m, *entry, level));
>> +
>> + page = p2m_alloc_page(p2m->domain);
>> + if ( !page )
>> + return false;
>> +
>> + page_list_add(page, &p2m->pages);
> Is there a reason this list maintenance isn't done in p2m_alloc_page()?
No there is no any reason, I will move that inside p2m_alloc_page().
>
>> + table = __map_domain_page(page);
>> +
>> + /*
>> + * We are either splitting a second level 1G page into 512 first level
>> + * 2M pages, or a first level 2M page into 512 zero level 4K pages.
>> + */
>> + for ( i = 0; i < XEN_PT_ENTRIES; i++ )
>> + {
>> + pte_t *new_entry = table + i;
>> +
>> + /*
>> + * Use the content of the superpage entry and override
>> + * the necessary fields. So the correct permission are kept.
>> + */
>> + pte = *entry;
>> + pte_set_mfn(&pte, mfn_add(mfn, i << level_order));
> While okay as long as you only permit superpages up to 1G, this is another
> trap for someone to fall into: Imo i would better be unsigned long right
> away, considering that RISC-V permits large pages at all levels.
>
>> + write_pte(new_entry, pte);
>> + }
>> +
>> + /*
>> + * Shatter superpage in the page to the level we want to make the
>> + * changes.
>> + * This is done outside the loop to avoid checking the offset to
>> + * know whether the entry should be shattered for every entry.
>> + */
>> + if ( next_level != target )
>> + rv = p2m_split_superpage(p2m, table + offsets[next_level],
>> + level - 1, target, offsets);
> I don't understand the comment: Under what conditions would every entry
> need (further) shattering? And where's that happening? Or is this merely
> a word ordering issue in the sentence, and "for every entry" wants
> moving ahead? (In that case I'm unconvinced this is in need of commenting
> upon.)
It is wording question. It should be something like:
+ /*
+ * Shatter superpage in the page to the level we want to make the
+ * changes.
+ * This is done outside the loop to avoid checking the offset for every
+ * entry (of new page table) to know whether the entry should be shattered.
+ */
>
>> + /* TODO: why it is necessary to have clean here? Not somewhere in the caller */
>> + if ( p2m->clean_pte )
>> + clean_dcache_va_range(table, PAGE_SIZE);
>> +
>> + unmap_domain_page(table);
> Again likely not something that wants taking care of right away, but there
> again is an inefficiency here: The caller almost certainly wants to map
> the same page again, to update the one entry that caused the request to
> shatter the page.
I'll add TODO.
>
>> + /*
>> + * Even if we failed, we should install the newly allocated PTE
>> + * entry. The caller will be in charge to free the sub-tree.
>> + */
>> + p2m_write_pte(entry, page_to_p2m_table(p2m, page), p2m->clean_pte);
> Why would it be wrong to free the page right here, vacating the entry at
> the same time (or leaving just that to the caller)? (IOW - if this is an
> implementation decision of yours, I think the word "should" would want
> dropping.) After all, the caller invoking p2m_free_entry() on the thus
> split PTE is less efficient (needs to iterate over all entries) than on
> the original one (where it's just a single superpage).
I think that I didn't get your idea.
>
>> @@ -806,7 +877,36 @@ static int __p2m_set_entry(struct p2m_domain *p2m,
>> */
>> if ( level > target )
> This condition is likely too strong, unless you actually mean to also
> split a superpage if it really wouldn't need splitting (new entry written
> still fitting with the superpage mapping, i.e. suitable MFN and same
> attributes).
I am not really sure that I fully understand.
My understanding is if level != target then the splitting is needed, I am
not really get the part "split a superpage if it really wouldn't need splitting".
>
>> {
>> - panic("Shattering isn't implemented\n");
>> + /* We need to split the original page. */
>> + pte_t split_pte = *entry;
>> +
>> + ASSERT(p2me_is_superpage(p2m, *entry, level));
>> +
>> + if ( !p2m_split_superpage(p2m, &split_pte, level, target, offsets) )
>> + {
>> + /* Free the allocated sub-tree */
>> + p2m_free_entry(p2m, split_pte, level);
>> +
>> + rc = -ENOMEM;
>> + goto out;
>> + }
>> +
>> + p2m_write_pte(entry, split_pte, p2m->clean_pte);
>> +
>> + /* Then move to the level we want to make real changes */
>> + for ( ; level < target; level++ )
> Don't you mean to move downwards here? At which point I wonder: Did you test
> this code?
No as the test for this case was missed. I will add one.
>
>> + {
>> + rc = p2m_next_level(p2m, true, level, &table, offsets[level]);
>> +
>> + /*
>> + * The entry should be found and either be a table
>> + * or a superpage if level 0 is not targeted
>> + */
>> + ASSERT(rc == GUEST_TABLE_NORMAL ||
>> + (rc == GUEST_TABLE_SUPER_PAGE && target > 0));
>> + }
> This, too, is inefficient (but likely good enough as a starting point): You walk
> tables twice - first when splitting, and then again when finding the target level.
>
> Considering the enclosing if(), this also again is a do/while() candidate.
I will add TODO to make that part more efficient. And based on your reply regarding
statement inside if(), I'll likely to use do/while().
Thanks.
~ Oleksii
[-- Attachment #2: Type: text/html, Size: 13870 bytes --]
next prev parent reply other threads:[~2025-07-17 16:38 UTC|newest]
Thread overview: 161+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-10 13:05 [PATCH v2 00/17] xen/riscv: introduce p2m functionality Oleksii Kurochko
2025-06-10 13:05 ` [PATCH v2 01/17] xen/riscv: implement sbi_remote_hfence_gvma() Oleksii Kurochko
2025-06-18 15:15 ` Jan Beulich
2025-06-23 14:31 ` Oleksii Kurochko
2025-06-23 14:39 ` Jan Beulich
2025-06-23 14:45 ` Oleksii Kurochko
2025-06-24 10:33 ` Oleksii Kurochko
2025-06-24 10:48 ` Jan Beulich
2025-06-10 13:05 ` [PATCH v2 02/17] xen/riscv: introduce sbi_remote_hfence_gvma_vmid() Oleksii Kurochko
2025-06-18 15:20 ` Jan Beulich
2025-06-23 14:38 ` Oleksii Kurochko
2025-06-10 13:05 ` [PATCH v2 03/17] xen/riscv: introduce guest domain's VMID allocation and manegement Oleksii Kurochko
2025-06-18 15:46 ` Jan Beulich
2025-06-24 9:46 ` Oleksii Kurochko
2025-06-24 10:44 ` Jan Beulich
2025-06-24 13:47 ` Oleksii Kurochko
2025-06-24 14:01 ` Jan Beulich
2025-06-24 15:32 ` Oleksii Kurochko
2025-06-26 10:05 ` Oleksii Kurochko
2025-06-26 10:41 ` Jan Beulich
2025-06-26 11:34 ` Oleksii Kurochko
2025-06-26 11:43 ` Juergen Gross
2025-06-26 12:05 ` Oleksii Kurochko
2025-06-26 12:17 ` Teddy Astie
2025-06-26 12:37 ` Jan Beulich
2025-06-26 12:16 ` Jan Beulich
2025-06-26 12:25 ` Oleksii Kurochko
2025-06-10 13:05 ` [PATCH v2 04/17] xen/riscv: construct the P2M pages pool for guests Oleksii Kurochko
2025-06-18 15:53 ` Jan Beulich
2025-06-25 14:48 ` Oleksii Kurochko
2025-06-25 14:55 ` Jan Beulich
2025-07-01 13:04 ` Jan Beulich
2025-07-02 10:30 ` Oleksii Kurochko
2025-07-02 10:34 ` Jan Beulich
2025-07-02 11:17 ` Oleksii Kurochko
2025-07-02 11:48 ` Oleksii Kurochko
2025-07-02 11:56 ` Jan Beulich
2025-07-02 12:34 ` Oleksii Kurochko
2025-07-02 12:49 ` Jan Beulich
2025-06-10 13:05 ` [PATCH v2 05/17] xen/riscv: introduce things necessary for p2m initialization Oleksii Kurochko
2025-06-18 16:08 ` Jan Beulich
2025-06-25 15:31 ` Oleksii Kurochko
2025-06-25 15:53 ` Jan Beulich
2025-06-26 8:40 ` Oleksii Kurochko
2025-06-26 11:01 ` Jan Beulich
2025-06-26 11:55 ` Oleksii Kurochko
2025-06-10 13:05 ` [PATCH v2 06/17] xen/riscv: add root page table allocation Oleksii Kurochko
2025-06-30 15:22 ` Jan Beulich
2025-06-30 16:18 ` Oleksii Kurochko
2025-07-01 6:29 ` Jan Beulich
2025-07-01 9:44 ` Oleksii Kurochko
2025-07-01 10:27 ` Jan Beulich
2025-07-01 14:02 ` Oleksii Kurochko
2025-07-01 14:28 ` Jan Beulich
2025-06-10 13:05 ` [PATCH v2 07/17] xen/riscv: introduce pte_{set,get}_mfn() Oleksii Kurochko
2025-06-26 14:57 ` Jan Beulich
2025-06-10 13:05 ` [PATCH v2 08/17] xen/riscv: add new p2m types and helper macros for type classification Oleksii Kurochko
2025-06-26 14:59 ` Jan Beulich
2025-06-30 14:33 ` Oleksii Kurochko
2025-06-30 14:38 ` Oleksii Kurochko
2025-06-30 14:45 ` Jan Beulich
2025-06-30 15:27 ` Oleksii Kurochko
2025-06-30 15:50 ` Jan Beulich
2025-07-02 10:13 ` Oleksii Kurochko
2025-07-02 10:36 ` Jan Beulich
2025-06-30 14:42 ` Jan Beulich
2025-06-30 15:13 ` Oleksii Kurochko
2025-06-30 15:27 ` Jan Beulich
2025-06-10 13:05 ` [PATCH v2 09/17] xen/riscv: introduce page_set_xenheap_gfn() Oleksii Kurochko
2025-06-30 15:48 ` Jan Beulich
2025-07-02 15:59 ` Oleksii Kurochko
2025-07-03 5:59 ` Jan Beulich
2025-06-10 13:05 ` [PATCH v2 10/17] xen/riscv: implement guest_physmap_add_entry() for mapping GFNs to MFNs Oleksii Kurochko
2025-06-30 15:59 ` Jan Beulich
2025-07-03 11:02 ` Oleksii Kurochko
2025-07-03 11:33 ` Jan Beulich
2025-07-03 11:54 ` Oleksii Kurochko
2025-07-03 13:09 ` Jan Beulich
2025-07-03 13:28 ` Oleksii Kurochko
2025-07-03 13:34 ` Jan Beulich
2025-06-10 13:05 ` [PATCH v2 11/17] xen/riscv: implement p2m_set_entry() and __p2m_set_entry() Oleksii Kurochko
2025-07-01 13:49 ` Jan Beulich
2025-07-04 15:01 ` Oleksii Kurochko
2025-07-07 7:20 ` Jan Beulich
2025-07-07 11:46 ` Oleksii Kurochko
2025-07-07 12:53 ` Jan Beulich
2025-07-07 15:00 ` Oleksii Kurochko
2025-07-07 15:15 ` Jan Beulich
2025-07-07 16:10 ` Oleksii Kurochko
2025-07-08 7:10 ` Jan Beulich
2025-07-08 9:01 ` Oleksii Kurochko
2025-07-08 10:37 ` Oleksii Kurochko
2025-07-08 12:45 ` Jan Beulich
2025-07-08 15:42 ` Oleksii Kurochko
2025-07-08 16:04 ` Jan Beulich
2025-07-09 8:24 ` Oleksii Kurochko
2025-07-09 8:41 ` Jan Beulich
2025-06-10 13:05 ` [PATCH v2 12/17] xen/riscv: Implement p2m_free_entry() and related helpers Oleksii Kurochko
2025-07-01 14:23 ` Jan Beulich
2025-07-11 15:56 ` Oleksii Kurochko
2025-07-14 7:15 ` Jan Beulich
2025-07-14 16:01 ` Oleksii Kurochko
2025-07-14 16:17 ` Jan Beulich
2025-06-10 13:05 ` [PATCH v2 13/17] xen/riscv: Implement p2m_entry_from_mfn() and support PBMT configuration Oleksii Kurochko
2025-07-01 15:08 ` Jan Beulich
2025-07-15 14:47 ` Oleksii Kurochko
2025-07-16 11:31 ` Jan Beulich
2025-07-16 16:07 ` Oleksii Kurochko
2025-07-16 16:18 ` Jan Beulich
2025-07-17 8:56 ` Oleksii Kurochko
2025-07-17 10:25 ` Jan Beulich
2025-07-18 9:52 ` Oleksii Kurochko
2025-07-21 12:18 ` Jan Beulich
2025-07-22 10:41 ` Oleksii Kurochko
2025-07-22 11:34 ` Oleksii Kurochko
2025-07-22 12:00 ` Jan Beulich
2025-07-22 14:25 ` Oleksii Kurochko
2025-07-22 14:35 ` Jan Beulich
2025-07-22 16:07 ` Oleksii Kurochko
2025-07-23 9:46 ` Jan Beulich
2025-07-28 8:52 ` Oleksii Kurochko
2025-07-28 9:09 ` Jan Beulich
2025-07-28 11:37 ` Oleksii Kurochko
2025-07-28 11:49 ` Jan Beulich
2025-07-22 11:54 ` Jan Beulich
2025-06-10 13:05 ` [PATCH v2 14/17] xen/riscv: implement p2m_next_level() Oleksii Kurochko
2025-07-02 8:35 ` Jan Beulich
2025-07-16 11:32 ` Oleksii Kurochko
2025-07-16 11:43 ` Jan Beulich
2025-07-16 15:53 ` Oleksii Kurochko
2025-07-16 16:12 ` Jan Beulich
2025-07-17 9:42 ` Oleksii Kurochko
2025-07-17 10:37 ` Jan Beulich
2025-07-18 11:19 ` Oleksii Kurochko
2025-07-21 13:14 ` Jan Beulich
2025-06-10 13:05 ` [PATCH v2 15/17] xen/riscv: Implement superpage splitting for p2m mappings Oleksii Kurochko
2025-07-02 9:25 ` Jan Beulich
2025-07-17 16:37 ` Oleksii Kurochko [this message]
2025-07-21 13:34 ` Jan Beulich
2025-07-22 14:57 ` Oleksii Kurochko
2025-07-22 16:02 ` Jan Beulich
2025-07-23 19:51 ` Oleksii Kurochko
2025-07-24 7:58 ` Jan Beulich
2025-06-10 13:05 ` [PATCH v2 16/17] xen/riscv: implement mfn_valid() and page reference, ownership handling helpers Oleksii Kurochko
2025-07-02 10:09 ` Jan Beulich
2025-07-02 10:28 ` Jan Beulich
2025-07-18 14:37 ` Oleksii Kurochko
2025-07-21 13:39 ` Jan Beulich
2025-07-22 12:03 ` Oleksii Kurochko
2025-07-22 12:05 ` Jan Beulich
2025-07-29 13:47 ` Oleksii Kurochko
2025-07-29 14:48 ` Jan Beulich
2025-07-02 12:52 ` Orzel, Michal
2025-07-18 14:49 ` Oleksii Kurochko
2025-07-21 13:42 ` Jan Beulich
2025-07-22 13:38 ` Oleksii Kurochko
2025-07-21 13:53 ` Jan Beulich
2025-06-10 13:05 ` [PATCH v2 17/17] xen/riscv: add support of page lookup by GFN Oleksii Kurochko
2025-07-02 11:44 ` Jan Beulich
2025-07-21 9:43 ` Oleksii Kurochko
2025-07-21 14:06 ` Jan Beulich
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e2227002-e38c-41e1-8bea-7585138ec5ba@gmail.com \
--to=oleksii.kurochko@gmail.com \
--cc=alistair.francis@wdc.com \
--cc=andrew.cooper3@citrix.com \
--cc=anthony.perard@vates.tech \
--cc=bobbyeshleman@gmail.com \
--cc=connojdavis@gmail.com \
--cc=jbeulich@suse.com \
--cc=julien@xen.org \
--cc=michal.orzel@amd.com \
--cc=roger.pau@citrix.com \
--cc=sstabellini@kernel.org \
--cc=xen-devel@lists.xenproject.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.