qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Richard Henderson <richard.henderson@linaro.org>, qemu-devel@nongnu.org
Cc: Florian Weimer <fweimer@redhat.com>,
	Thomas Huth <thuth@redhat.com>, Cornelia Huck <cohuck@redhat.com>,
	Stefano Brivio <sbrivio@redhat.com>,
	qemu-s390x@nongnu.org, Richard Henderson <rth@twiddle.net>
Subject: Re: [Qemu-devel] [PATCH v1 2/4] s390x/tcg: Introduce probe_read_access()
Date: Thu, 22 Aug 2019 09:01:26 +0200	[thread overview]
Message-ID: <eb4696dc-f95d-e934-3ff4-24364c161109@redhat.com> (raw)
In-Reply-To: <a8b1436d-a634-d767-8861-460c50acecb2@linaro.org>

On 22.08.19 00:31, Richard Henderson wrote:
> On 8/21/19 2:33 PM, David Hildenbrand wrote:
>>> NOTDIRTY cannot fault at all.  The associated rcu critical section is ugly
>>> enough to make me not want to do anything except continue to go through the
>>> regular MMIO path.
>>>
>>> In any case, so long as we eliminate *access* faults by probing the page table,
>>> then falling back to the byte-by-byte loop is, AFAICS, sufficient to implement
>>> the instructions correctly.
>>
>> "In any case, so long as we eliminate *access* faults by probing the
>> page table" - that's what I'm doing in this patch (and even more correct
>> in the prototype patch I shared), no? (besides the watchpoint madness below)
> 
> Correct.
> 
> My main objection to your current patch is that you perform the access checks
> within MVC, and then do some more tlb lookups in fast_memmove.

Right, however I think splitting up both steps is nicer (especially if
we mix and match memmove and memset in MVCLE later). Maybe combining a
tuned-up probe_write() with something similar (but different to)
access_prepare().

General changes:

1. Check watchpoints in probe_write()

2. Make probe_write() bail out if it would cross pages() - caller has to
assure it falls into a single page (TARGET_PAGE_SIZE ?).

3. Return a pointer from probe_write()
-> If NULL is returned, reads/writes have to go via memops, e.g., single
byte access

4. Make probe_write() return addresses of TLB entries that are already
invalid again (-> LAP)


What I propose for s390x:

1. access_prepare(): Collect the pointers using probe_write() - maximum
is two pages. If we get NULL, there is no fault but we have to fallback
to read/write using memops.

2. access_memset() / access_set() / access_move() ... pass the access
information we collected. Prototype I shared can be heavily simplified -
don't fall back to paddrs but instead do single-byte access, don't allow
flexible array of pages, etc.

3. First convert MVC to the new access mechanism, later fix + convert
the others.

> 
> I think that fast_memmove is where the access checks should live.  That allows
> incremental improvement to combine access checks + host address lookup, which
> cannot currently be done in one step with existing interfaces.
> 
> I guess you would still want access checks within MVC for the case in which you
> must fall back to byte-by-byte because of destructive overlap.

That's why I think introducing a new set of helpers makes sense. Once
all users of fast_memmove() etc are gone, we can then throw them away.

> 
>> "falling back to the byte-by-byte loop is, AFAICS, sufficient"
>>
>> I don't think this is sufficient. E.g., LAP protected pages
>> (PAGE_WRITE_INV which immediately requires a new MMU walk on the next
>> access) will trigger a new MMU walk on every byte access (that's why I
>> chose to pre-translate in my prototype).
> 
> LAP protected pages is exactly why probe_write should return the host address,
> so that we can do the access check + host address lookup in one step.
> 

Yes, this should also be done in tlb_vaddr_to_host() - but as we'll be
converting to a probe_write() thingy, it might not be required to handle
the special "TLB_INVALID_MASK set after tlb_fill()" case.

> But in the meantime...
> 
>> If another CPU modified the
>> page tables in between, we could suddenly get a fault - although we
>> checked early. What am I missing?
> 
> You're concerned with a bare write to the page table by cpu B, while cpu A is
> executing, and before cpu B issues the cross-cpu tlb flush?

Not bare writes, these are invalid. Take IPTE for example. Set the
invalid bit, then do a sync CPU flush.

If we re-translate a vaddr after probe_write() - which would currently
happen in case of LAP - then we could suddenly run into the invalid bit,
triggering a fault. The sync CPU flush still wasn't processed, but we
see the "invalid" bit early because we decided to re-translate after
already modifying memory. This re-translation really has to be avoided.

> 
> The tlb victim cache should prevent having to re-read a tlb entry from memory,
> at least for MVC.  The unlimited size we currently support for MVCL and MVCLE
> could act weird, but would be fixed by limiting the execution as discussed.

Yes, MVCL/MVCLE need a major overhaul. But good to know that a second
tlb_fill() will not result in the other TLB entry to suddenly get
invalid and require a new MMU translation (-> victim cache). That's
another concern I had.

> 
> Honestly, the os has to make sure that the page remains valid until after the
> flush completes, otherwise it's an os bug.  The cross-cpu tlb flush happens via
> async_run_on_cpu, and of course never occurs while we are executing a TB.  Yet
> another reason to limit the amount of work any one instruction does.  ;-)

In my example, the page would still be valid, however we the invalid bit
has already been set. The next MMU translation would choke on that and
trigger a fault.

I can only see this (re-translation) happening for LAP right now. And
LAP pages (lowcore) should be always mapped, so I don't think this is a
very important case to handle.

> 
> 
>> I see that we use BP_STOP_BEFORE_ACCESS for PER (Program Event
>> Recording) on s390x. I don't think that's correct. We want to get
>> notified after the values were changed.
>>
>> "A storage-alteration event occurs whenever a CPU,
>> by using a logical or virtual address, makes a store
>> access without an access exception to the storage
>> area designated by control registers 10 and 11. ..."
>>
>> "For a PER instruction-fetching nullification event, the
>> unit of operation is nullified. For other PER events,
>> the unit of operation is completed"
>>
>> Oh man, why is everything I take a look at broken.
> 
> Heh.
> 
>>> In the latter case, if the instruction has had any side effects prior to the
>>> longjmp, they will be re-done when we re-start the current instruction.
>>>
>>> To me this seems like a rather large bug in our implementation of watchpoints,
>>> as it only really works properly for simple load/store/load-op-store type
>>> instructions.  Anything that works on many addresses and doesn't delay side
>>> effects until all accesses are complete will Do The Wrong Thing.
>>>
>>> The fix, AFAICS, is for probe_write to call check_watchpoint(), so that we
>>> take the debug exit early.
>>
>> Indeed. I see what you mean now. (I was ignoring the "before access"
>> because I was assuming we don't need it on s390x)
>>
>> probe_write() would have to check for all BP_STOP_BEFORE_ACCESS watchpoints.
> 
> !BP_STOP_BEFORE_ACCESS watchpoints exit to the main loop as well, so that it
> can restart and then single-step the current instruction.
> 
> We need it the check in probe_write for all cases.

Thanks for the clarification!

> 
>> Yes, that's what I mean, TARGET_PAGE_SIZE, but eventually crossing a
>> page boundary. The longer I stare at the MVCL code, the more broken it
>> is. There are more nice things buried in the PoP. MVCL does not detect
>> access exceptions beyond the next 2k. So we have to limit it there
>> differently.
> 
> That language is indeed odd.
> 
> The only reading of that paragraph that makes sense to me is that the hardware
> *must* interrupt MVCL after every 2k bytes processed.  The idea that the user
> can magically write to a read-only page simply by providing length = 2MB and
> page that is initially writable is dumb.  I cannot imagine that is a correct
> reading.
> 
> Getting clarification from an IBM engineer on that would be good; otherwise I
> would just ignore that and proceed as if all access checks are performed.
> 
>> So what I understand is that
>>
>> - we should handle watchpoints in probe_write()
>> - not bypass IO memory (especially NOTDIRTY). We cannot always relay on
>>   getting access to a host page.
> 
> Correct.
> 
> 
> r~
> 


-- 

Thanks,

David / dhildenb


  parent reply	other threads:[~2019-08-22  7:03 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-21  9:22 [Qemu-devel] [PATCH v1 0/4] s390x/tcg: MOVE (MVC): Fault-safe handling David Hildenbrand
2019-08-21  9:22 ` [Qemu-devel] [PATCH v1 1/4] s390x/tcg: Use guest_addr_valid() instead of h2g_valid() in probe_write_access() David Hildenbrand
2019-08-21  9:22 ` [Qemu-devel] [PATCH v1 2/4] s390x/tcg: Introduce probe_read_access() David Hildenbrand
2019-08-21 17:26   ` Richard Henderson
2019-08-21 17:37     ` David Hildenbrand
2019-08-21 19:19       ` Richard Henderson
2019-08-21 19:36         ` David Hildenbrand
2019-08-21 20:38           ` Richard Henderson
2019-08-21 21:33             ` David Hildenbrand
2019-08-21 22:31               ` Richard Henderson
2019-08-21 22:43                 ` Richard Henderson
2019-08-22  6:42                   ` David Hildenbrand
2019-08-22  7:01                 ` David Hildenbrand [this message]
2019-08-26  9:31                   ` David Hildenbrand
2019-08-21 18:48     ` David Hildenbrand
2019-08-21  9:22 ` [Qemu-devel] [PATCH v1 3/4] s390x/tcg: MOVE (MVC): Increment the length once David Hildenbrand
2019-08-21 15:47   ` Richard Henderson
2019-08-21  9:22 ` [Qemu-devel] [PATCH v1 4/4] s390x/tcg: MOVE (MVC): Fault-safe handling David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=eb4696dc-f95d-e934-3ff4-24364c161109@redhat.com \
    --to=david@redhat.com \
    --cc=cohuck@redhat.com \
    --cc=fweimer@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-s390x@nongnu.org \
    --cc=richard.henderson@linaro.org \
    --cc=rth@twiddle.net \
    --cc=sbrivio@redhat.com \
    --cc=thuth@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).