From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E7B773D6A for ; Sat, 7 Sep 2024 00:53:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.202 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725670385; cv=none; b=Mui0f+1jzrmz5OEyZVbnOtL3jCWj3J/Mo92E0kLYbpzHqPKkULLPmrSlTDBROB6e5XKtqT7wPVASnEUfGNHXi+QMNo1nrI70RfHFMkAlmBkCzTx03j0+c7nohm6ToJCkLYB0SpPSRI14eC4ptYAAKypFihJVm01xj+KtZIhQtFQ= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725670385; c=relaxed/simple; bh=TFKe5FEoxPpWJXUbnXP7EQsXSllTlOcF0yXGY+oNEu4=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=J+X9ZtKtMFHB3CpOz2dD0pXYnYUw53cbGLUrCMGjhb0s/ANf5WrZ/vnDBzs0KREe6pLrGSxwJoK8OmOcTxnV/NS+7YcNMNNaZ60zDFeCdd4DfTH0yAE3Aee9XY9CkSTPR+vNsl+cm+OvyYPQCBUFE+RcIx1VJdj49z/Jyu4ttGU= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=OTmt21ci; arc=none smtp.client-ip=209.85.219.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="OTmt21ci" Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-e1ce98dcafaso5694326276.0 for ; Fri, 06 Sep 2024 17:53:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1725670383; x=1726275183; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=cvDg3DfHKLXZUdcNlHwYYdDUD8sFXAigpGLjYaFm+7Q=; b=OTmt21ci8IrhHrvadVdxfDTJ+QaiUhIsZaADOF1J8qQ+MPbDJEbwplvLzrpdsRvjyu kPM5dTigLPLOFysnrmqjaPx+VPNAFRBTjxJ6MYRR7bYrcrJeymDOSV+qWvAJVpuRSRFn 1iUwIGbisMwE9iDvLwNM+EEjvI3sFrodI+r+eXgAugfyjyXX6/rkwei94ZJiPz45sfmS fe20k0BBZDxRTChqwY69AQ6gjf2mLwP+MheU9OqYt10/wrKmElhIlT9Zs8DPNurISIxV PcGA4EH2CUjV/s38EDlPJT5b3nPDmlV+u/9SKyCqyWm7Isqm0Fadl2op+3D865uPONp3 j68w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1725670383; x=1726275183; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=cvDg3DfHKLXZUdcNlHwYYdDUD8sFXAigpGLjYaFm+7Q=; b=HE0oRKk4IPnY8+r39JXORIEvF7XtNAWpVvYdTxnCslOGos132/P602WNUUiu3gbza/ T/26Y9ckIIFPmcc4DHMIV1Ec9aQmv+izy8DnNNIzGuLdJounLD6UIH+iCPhWn9ukJvon 3/GhZmApsmh8xDEfXzaqgiBfSP5rjYOyGsHwisq89z0vD1dUIwpcBRlyTSF7bwICgCL8 GHLja5tKKaZ9+b+BuZrZ5MNZ/wN4Xzn3XrjGmGEenMg+7Wyh30SLE78PhqTUrSgVunQo C+FahoTD+2JGcJY89kYobwXmjRXoV4JI0GEB0zhvdHn7DkGM1Vr24sMb26ine44m9bZ3 Y7AA== X-Forwarded-Encrypted: i=1; AJvYcCXd0pKjH6qRyBE0VI2BrAfjApIINuMCYmy2hslvRJhYh0vij7OFGBVlyurGlYH4dER3e5mp0839tEtbcbs=@vger.kernel.org X-Gm-Message-State: AOJu0YzJ7vKpBA2wzD/zwT4U3iaon2/9pp/6vQWYfFk2XeDHZ2DGu89A o7krSrUJJ3KeZ6Oxer5dDpw3Ptbvm1TmF2c79j4sqt1+xsJaEBVdjUQdqxlYgYHdfO35zO29kcB V9Q== X-Google-Smtp-Source: AGHT+IGsrtQk6Vz/G0pLur+KioHIKoI49GqvwNCfvT5Rl6lo6ANEQKKE3nF1KjMCp7XQuepxjDSam58ag94= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a25:6845:0:b0:e1c:ea19:98f with SMTP id 3f1490d57ef6-e1d34a4dc8dmr62489276.8.1725670382387; Fri, 06 Sep 2024 17:53:02 -0700 (PDT) Date: Sat, 7 Sep 2024 00:53:00 +0000 In-Reply-To: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240809194335.1726916-1-seanjc@google.com> <20240809194335.1726916-10-seanjc@google.com> Message-ID: Subject: Re: [PATCH 09/22] KVM: selftests: Verify KVM correctly handles mprotect(PROT_READ) From: Sean Christopherson To: James Houghton Cc: Paolo Bonzini , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Oliver Upton , Marc Zyngier , Peter Xu , James Houghton Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable On Fri, Sep 06, 2024, James Houghton wrote: > On Fri, Aug 9, 2024 at 12:43=E2=80=AFPM Sean Christopherson wrote: > > > > Add two phases to mmu_stress_test to verify that KVM correctly handles > > guest memory that was writable, and then made read-only in the primary = MMU, > > and then made writable again. > > > > Add bonus coverage for x86 to verify that all of guest memory was marke= d > > read-only. Making forward progress (without making memory writable) > > requires arch specific code to skip over the faulting instruction, but = the > > test can at least verify each vCPU's starting page was made read-only. > > > > Signed-off-by: Sean Christopherson >=20 > Writing off-list because I just have some smaller questions that I > don't want to bother the list with. Pulling everyone and the lists back in :-) IMO, no question is too small for kvm@, and lkml@ is gigantic firehose that= 's 99% archival and 1% list, at best. Odds are very, very good that if you have a question, however trivial or small, then someone else has the exact same qu= estion, or _will_ have the question in the future. I strongly prefer that all questions, review, feedback, etc. happen on list= , even if the questions/feedback may seem trivial or noisy. The only exception is= if information can't/shouldn't be made public, e.g. because of an embargo, NDA= , security implications, etc. > For the next version, feel free to add: >=20 > Reviewed-by: James Houghton >=20 > All of the selftest patches look fine to me. >=20 > > --- > > tools/testing/selftests/kvm/mmu_stress_test.c | 87 ++++++++++++++++++- > > 1 file changed, 84 insertions(+), 3 deletions(-) > > > > diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/test= ing/selftests/kvm/mmu_stress_test.c > > index 50c3a17418c4..98f9a4660269 100644 > > --- a/tools/testing/selftests/kvm/mmu_stress_test.c > > +++ b/tools/testing/selftests/kvm/mmu_stress_test.c > > @@ -16,6 +16,8 @@ > > #include "guest_modes.h" > > #include "processor.h" > > > > +static bool mprotect_ro_done; > > + > > static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t = stride) > > { > > uint64_t gpa; > > @@ -31,6 +33,33 @@ static void guest_code(uint64_t start_gpa, uint64_t = end_gpa, uint64_t stride) > > *((volatile uint64_t *)gpa); > > GUEST_SYNC(2); > > > > + /* > > + * Write to the region while mprotect(PROT_READ) is underway. = Keep > > + * looping until the memory is guaranteed to be read-only, othe= rwise > > + * vCPUs may complete their writes and advance to the next stag= e > > + * prematurely. > > + */ > > + do { > > + for (gpa =3D start_gpa; gpa < end_gpa; gpa +=3D stride) > > +#ifdef __x86_64__ > > + asm volatile(".byte 0xc6,0x40,0x0,0x0" :: "a" (= gpa) : "memory"); >=20 > Ok so this appears to be a `mov BYTE PTR [rax + 0x0], 0x0`, where %rax = =3D gpa. :) >=20 > Does '0xc6,0x0,0x0' also work? It seems like that translates to `mov > BYTE PTR [rax], 0x0`. (just curious, no need to change it) LOL, yes, but as evidenced by the trailing comment, my intent was to genera= te "mov rax, [rax]", not "movb $0, [rax]". I suspect I was too lazy to consul= t the SDM to recall the correct opcode and simply copied an instruction from some= random disassembly output without looking too closely at the output. asm volatile(".byte 0xc6,0x40,0x0,0x0" :: "a" (gpa) : "memory"); /* MOV RA= X, [RAX] */ > And I take it you wrote it out like this (instead of using mnemonics) > so that you could guarantee that IP + 4 would be the right way to skip > forwards. Does it make sense to leave a comment about that?=20 Yes and yes. > The translation from mnemonic --> bytes won't change... Heh, this is x86, by no means is that guaranteed. E.g. see the above, wher= e the same mnemonic can be represented multiple ways. > so could you just write the proper assembly? (not a request, just curiou= s) In practice, probably. But the rules for inline assembly are, at best, fuz= zy. So long as the correct instruction is generated, the assembler has quite a = bit of freedom. E.g. similar to above, "mov %rax,(%rax)" can (should) be encoded as: 48 89 00 but can also be encoded as 48 89 40 00 Now, it's _extremely_ unlikely a compiler will actually generate the latter= , but it's perfectly legal to do so. E.g. with gcc-13, this mov %rax, 0x0(%rax) generates 48 89 00 even though a more literal interpretation would be 48 89 40 00 So yeah, while the hand-coded opcode is gross and annoying, given that a fa= ilure due to the "wrong" instruction being generated would be painful and time co= nsuming to debug, hand-coding is worth avoiding the risk and potential pain if the = compiler decides to be mean :-) > A comment that 0x40 corresponds to %rax and that "a" also corresponds > to %rax would have been helpful for me. :) Eh, I get what you're saying, but giving a play-by-play of the encoding isn= 't really all that reasonable because _so_ much information needs to be convey= ed to capture the entire picture, and some things are essentially table stakes wh= en it comes to x86 kernel programming. E.g. 0x40 doesn't simply mean "(%rax)", it's a full ModR/M that defines the addressing mode, which in turn depends on the operating mode (64-bit). And "a" isn't just %rax; it's specifically an input register constraint, e.= g. is distinctly different than: asm volatile(".byte 0x48,0x89,0x0" : "+a"(gpa) :: "memory"); /* mov %rax,= (%rax) */ even though in this specific scenario they generate the same code. And with the correct "48 89 00", understanding the full encoding requires d= escribing REX prefixes, which are a mess unto themselves. So, a trailing comment (with the correct mnemonic) is all I'm willing to do= , even though I 100% agree that it puts a decent sized load on the reader. There'= s just _too_ much information to communicate to the reader, at least for x86. > > +#else > > + vcpu_arch_put_guest(*((volatile uint64_t *)gpa)= , gpa); > > +#endif > > + } while (!READ_ONCE(mprotect_ro_done)); > > + > > + /* > > + * Only x86 can explicitly sync, as other architectures will be= stuck > > + * on the write fault. >=20 > It would also have been a little clearer if the comment also said how > this is just because the test has been written to increment for the PC > upon getting these write faults *for x86 only*. IDK something like > "For x86, the test will adjust the PC for each write fault, allowing > the above loop to complete. Other architectures will get stuck, so the > #3 sync step is skipped." Ya. Untested, but how about this? diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/= selftests/kvm/mmu_stress_test.c index 2d66c2724336..29acb22ea387 100644 --- a/tools/testing/selftests/kvm/mmu_stress_test.c +++ b/tools/testing/selftests/kvm/mmu_stress_test.c @@ -38,11 +38,18 @@ static void guest_code(uint64_t start_gpa, uint64_t end= _gpa, uint64_t stride) * looping until the memory is guaranteed to be read-only, otherwis= e * vCPUs may complete their writes and advance to the next stage * prematurely. + * + * For architectures that support skipping the faulting instruction= , + * generate the store via inline assembly to ensure the exact lengt= h + * of the instruction is known and stable (vcpu_arch_put_guest() on + * fixed-length architectures should work, but the cost of paranoia + * is low in this case). For x86, hand-code the exact opcode so th= at + * there is no room for variability in the generated instruction. */ do { for (gpa =3D start_gpa; gpa < end_gpa; gpa +=3D stride) #ifdef __x86_64__ - asm volatile(".byte 0xc6,0x40,0x0,0x0" :: "a" (gpa)= : "memory"); /* MOV RAX, [RAX] */ + asm volatile(".byte 0x48,0x89,0x00" :: "a"(gpa) : "= memory"); /* mov %rax, (%rax) */ #elif defined(__aarch64__) asm volatile("str %0, [%0]" :: "r" (gpa) : "memory"= ); #else @@ -163,7 +170,7 @@ static void *vcpu_worker(void *data) TEST_ASSERT_EQ(errno, EFAULT); #ifdef __x86_64__ WRITE_ONCE(vcpu->run->kvm_dirty_regs, KVM_SYNC_X86_REGS); - vcpu->run->s.regs.regs.rip +=3D 4; + vcpu->run->s.regs.regs.rip +=3D 3; #endif #ifdef __aarch64__ vcpu_set_reg(vcpu, ARM64_CORE_REG(regs.pc),