From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:48837) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Xol4w-0006p9-BB for qemu-devel@nongnu.org; Wed, 12 Nov 2014 22:28:16 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Xol4r-00022V-3i for qemu-devel@nongnu.org; Wed, 12 Nov 2014 22:28:14 -0500 Received: from ozlabs.org ([2401:3900:2:1::2]:51129) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Xol4q-000223-Eo for qemu-devel@nongnu.org; Wed, 12 Nov 2014 22:28:09 -0500 Date: Thu, 13 Nov 2014 14:23:06 +1100 From: David Gibson Message-ID: <20141113032306.GF7291@voom.fritz.box> References: <1412358473-31398-1-git-send-email-dgilbert@redhat.com> <1412358473-31398-45-git-send-email-dgilbert@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="jkO+KyKz7TfD21mV" Content-Disposition: inline In-Reply-To: <1412358473-31398-45-git-send-email-dgilbert@redhat.com> Subject: Re: [Qemu-devel] [PATCH v4 44/47] Postcopy; Handle userfault requests List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Dr. David Alan Gilbert (git)" Cc: aarcange@redhat.com, yamahata@private.email.ne.jp, lilei@linux.vnet.ibm.com, quintela@redhat.com, cristian.klein@cs.umu.se, qemu-devel@nongnu.org, amit.shah@redhat.com, yanghy@cn.fujitsu.com --jkO+KyKz7TfD21mV Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Fri, Oct 03, 2014 at 06:47:50PM +0100, Dr. David Alan Gilbert (git) wrot= e: > From: "Dr. David Alan Gilbert" >=20 > userfaultfd is a Linux syscall that gives an fd that receives a stream > of notifications of accesses to pages marked as MADV_USERFAULT, and > allows the program to acknowledge those stalls and tell the accessing > thread to carry on. >=20 > Signed-off-by: Dr. David Alan Gilbert [snip] > /* > + * Tell the kernel that we've now got some memory it previously asked fo= r. > + * Note: We're not allowed to ack a page which wasn't requested. > + */ > +static int ack_userfault(MigrationIncomingState *mis, void *start, size_= t len) > +{ > + uint64_t tmp[2]; > + > + /* > + * Kernel wants the range that's now safe to access > + * Note it always takes 64bit values, even on a 32bit host. > + */ > + tmp[0] =3D (uint64_t)(uintptr_t)start; > + tmp[1] =3D (uint64_t)(uintptr_t)start + (uint64_t)len; > + > + if (write(mis->userfault_fd, tmp, 16) !=3D 16) { > + int e =3D errno; Is an EOF (i.e. write() returns 0) ever possible here? If so errno may not have a meaningful value. > + if (e =3D=3D ENOENT) { > + /* Kernel said it wasn't waiting - one case where this can > + * happen is where two threads triggered the userfault > + * and we receive the page and ack it just after we received > + * the 2nd request and that ends up deciding it should ack it > + * We could optimise it out, but it's rare. > + */ > + /*fprintf(stderr, "ack_userfault: %p/%zx ENOENT\n", start, l= en); */ > + return 0; > + } > + error_report("postcopy_ram: Failed to notify kernel for %p/%zx (= %d)", > + start, len, e); > + return -errno; > + } > + > + return 0; > +} > + > +/* > * Handle faults detected by the USERFAULT markings > */ > static void *postcopy_ram_fault_thread(void *opaque) > { > MigrationIncomingState *mis =3D (MigrationIncomingState *)opaque; > + void *hostaddr; > + int ret; > + size_t hostpagesize =3D getpagesize(); > + RAMBlock *rb =3D NULL; > + RAMBlock *last_rb =3D NULL; /* last RAMBlock we sent part of */ > =20 > - fprintf(stderr, "postcopy_ram_fault_thread\n"); > - /* TODO: In later patch */ > + DPRINTF("%s", __func__); > qemu_sem_post(&mis->fault_thread_sem); > - while (1) { > - /* TODO: In later patch */ > - } > + while (true) { > + PostcopyPMIState old_state, tmp_state; > + ram_addr_t rb_offset; > + ram_addr_t in_raspace; > + unsigned long bitmap_index; > + struct pollfd pfd[2]; > + > + /* > + * We're mainly waiting for the kernel to give us a faulting HVA, > + * however we can be told to quit via userfault_quit_fd which is > + * an eventfd > + */ > + pfd[0].fd =3D mis->userfault_fd; > + pfd[0].events =3D POLLIN; > + pfd[0].revents =3D 0; > + pfd[1].fd =3D mis->userfault_quit_fd; > + pfd[1].events =3D POLLIN; /* Waiting for eventfd to go positive = */ > + pfd[1].revents =3D 0; > + > + if (poll(pfd, 2, -1 /* Wait forever */) =3D=3D -1) { > + perror("userfault poll"); > + break; > + } > =20 > + if (pfd[1].revents) { > + DPRINTF("%s got quit event", __func__); > + break; I don't see any cleanup path in the userfault thread. So wouldn't it be simpler to just pthread_cancel() it rather than using an extra fd for quit notifications. > + } > + > + ret =3D read(mis->userfault_fd, &hostaddr, sizeof(hostaddr)); > + if (ret !=3D sizeof(hostaddr)) { > + if (ret < 0) { > + perror("Failed to read full userfault hostaddr"); > + break; > + } else { > + error_report("%s: Read %d bytes from userfaultfd expecte= d %zd", > + __func__, ret, sizeof(hostaddr)); > + break; /* Lost alignment, don't know what we'd read next= */ > + } > + } > + > + rb =3D qemu_ram_block_from_host(hostaddr, true, &in_raspace, &rb= _offset, > + &bitmap_index); > + if (!rb) { > + error_report("postcopy_ram_fault_thread: Fault outside guest= : %p", > + hostaddr); > + break; > + } > + > + DPRINTF("%s: Request for HVA=3D%p index=3D%lx rb=3D%s offset=3D%= zx", > + __func__, hostaddr, bitmap_index, qemu_ram_get_idstr(rb), > + rb_offset); > + > + tmp_state =3D postcopy_pmi_get_state(mis, bitmap_index); > + do { > + old_state =3D tmp_state; > + > + switch (old_state) { > + case POSTCOPY_PMI_REQUESTED: > + /* Do nothing - it's already requested */ > + break; > + > + case POSTCOPY_PMI_RECEIVED: > + /* Already arrived - no state change, just kick the kern= el */ > + DPRINTF("postcopy_ram_fault_thread: notify pre of %p", > + hostaddr); > + if (ack_userfault(mis, > + (void *)((uintptr_t)hostaddr > + & ~(hostpagesize - 1)), > + hostpagesize)) { > + assert(0); > + } > + break; > + > + case POSTCOPY_PMI_MISSING: > + > + tmp_state =3D postcopy_pmi_change_state(mis, bitmap_inde= x, > + old_state, POSTCOPY_PMI_REQUE= STED); > + if (tmp_state =3D=3D POSTCOPY_PMI_MISSING) { > + /* > + * Send the request to the source - we want to reque= st one > + * of our host page sizes (which is >=3D TPS) > + */ > + if (rb !=3D last_rb) { > + last_rb =3D rb; > + migrate_send_rp_reqpages(mis, qemu_ram_get_idstr= (rb), > + rb_offset, hostpagesize= ); > + } else { > + /* Save some space */ > + migrate_send_rp_reqpages(mis, NULL, > + rb_offset, hostpagesize= ); > + } > + } > + break; > + } > + } while (tmp_state !=3D old_state); > + } > + DPRINTF("%s: exit", __func__); > return NULL; > } > =20 > int postcopy_ram_enable_notify(MigrationIncomingState *mis) > { > - /* Create the fault handler thread and wait for it to be ready */ > + uint64_t tmp64; > + > + /* Open the fd for the kernel to give us userfaults */ > + mis->userfault_fd =3D syscall(__NR_userfaultfd, O_CLOEXEC); > + if (mis->userfault_fd =3D=3D -1) { > + perror("Failed to open userfault fd"); > + return -1; > + } > + > + /* > + * Version handshake, we send it the version we want and expect to g= et the > + * same back. > + */ > + tmp64 =3D USERFAULTFD_PROTOCOL; > + if (write(mis->userfault_fd, &tmp64, sizeof(tmp64)) !=3D sizeof(tmp6= 4)) { > + perror("Writing userfaultfd version"); > + close(mis->userfault_fd); > + return -1; > + } > + if (read(mis->userfault_fd, &tmp64, sizeof(tmp64)) !=3D sizeof(tmp64= )) { > + perror("Reading userfaultfd version"); > + close(mis->userfault_fd); > + return -1; > + } > + if (tmp64 !=3D USERFAULTFD_PROTOCOL) { > + error_report("Mismatched userfaultfd version, expected %zx, got = %zx", > + (size_t)USERFAULTFD_PROTOCOL, (size_t)tmp64); > + close(mis->userfault_fd); > + return -1; > + } > + > + /* Now an eventfd we use to tell the fault-thread to quit */ > + mis->userfault_quit_fd =3D eventfd(0, EFD_CLOEXEC); > + if (mis->userfault_quit_fd =3D=3D -1) { > + perror("Opening userfault_quit_fd"); > + close(mis->userfault_fd); > + return -1; > + } > + > qemu_sem_init(&mis->fault_thread_sem, 0); > qemu_thread_create(&mis->fault_thread, "postcopy/fault", > postcopy_ram_fault_thread, mis, QEMU_THREAD_JOINA= BLE); > qemu_sem_wait(&mis->fault_thread_sem); > + mis->have_fault_thread =3D true; > =20 > /* Mark so that we get notified of accesses to unwritten areas */ > if (qemu_ram_foreach_block(postcopy_ram_sensitise_area, mis)) { > return -1; > } > =20 > + DPRINTF("postcopy_ram_enable_notify: Sensitised"); > + > return 0; > } > =20 > @@ -612,11 +814,12 @@ int postcopy_place_page(MigrationIncomingState *mis= , void *host, void *from, > =20 > if (syscall(__NR_remap_anon_pages, host, from, hps, 0) !=3D > getpagesize()) { > + int e =3D errno; > perror("remap_anon_pages in postcopy_place_page"); > fprintf(stderr, "host: %p from: %p pmi=3D%d\n", host, from, > postcopy_pmi_get_state(mis, bitmap_offset)); > =20 > - return -errno; > + return -e; Unrelated change, should probably be folded into the patch which added this code. > } > =20 > tmp_state =3D postcopy_pmi_get_state(mis, bitmap_offset); > @@ -629,7 +832,10 @@ int postcopy_place_page(MigrationIncomingState *mis,= void *host, void *from, > =20 > =20 > if (old_state =3D=3D POSTCOPY_PMI_REQUESTED) { > - /* TODO: Notify kernel */ > + /* Send the kernel the host address that should now be accessibl= e */ > + DPRINTF("%s: Notifying kernel bitmap_offset=3D0x%lx host=3D%p", > + __func__, bitmap_offset, host); > + return ack_userfault(mis, host, hps); > } > =20 > return 0; --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --jkO+KyKz7TfD21mV Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAEBAgAGBQJUZCQaAAoJEGw4ysog2bOSRAwQANDStAxYYTp+NltEL6vBB329 g9fv4dV55LFU0hsK9s8oZRPiEqrdfmlhiRJy4lnwbKP13RvVNdWZo9Qlr1TT2qit XGqV9e7hXDcRvo80q/i6rbSzYrjLrQ4mEKgR7ND2KFZUO6qG8d0l7qPYVv06Yanf u3vFjcOaoGEH0mxeUA1dbTDj+c+NH5vj7Iix/aat3HCaQIx4kN8cnXCrI75e6wek aRnXEVyQqPSkGuACWbrIORiSj6e2dsgS1ZTlcy8i9KQE3aGDiIZLCduulLVUSUQR vD6i9PTZZQ0v+eJyMEpbvbGCA5hBydEovaKLzI2mahPrRSbwZn9z/7Q8O8ndEfjm Tp5c8dJLN9SVwIVn/XPXZN4rlNBlMAHnxdtylniNuLMkPg41erfMASC6H/ngQGtA h7fesoaVOgoH4jfRooU0KNKEzR7rpHY/Ju2UB1uSFdWKdkdWwKZGzMiYi/cyh4LA R4dqjCA0+3046kyVmklfsteGrFBiFib6mT6k3jNEcMhTI58+wLTWSB0pM7sW9tA2 9/lhpembEn/quQj7tZ5xfKboWQx0FFxRCkdTzCJ5MxpLfYM1JIY58Iddfutjlb9g Ix9Q7ID8P7E8H+m9YSzsID44/6hbJaxA8gROhJxy4z7zujqf5QtU9C6tW6kWoHJm WDXqqIquzkUZUWgXc4CF =SSnX -----END PGP SIGNATURE----- --jkO+KyKz7TfD21mV--