From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([140.186.70.92]:54171) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1R6JPs-00040K-Ny for qemu-devel@nongnu.org; Wed, 21 Sep 2011 05:48:34 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1R6JPj-0007CT-LE for qemu-devel@nongnu.org; Wed, 21 Sep 2011 05:48:32 -0400 Received: from mail-vw0-f46.google.com ([209.85.212.46]:34926) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1R6JPj-0007CD-2a for qemu-devel@nongnu.org; Wed, 21 Sep 2011 05:48:23 -0400 Received: by vws1 with SMTP id 1so2466749vws.33 for ; Wed, 21 Sep 2011 02:48:22 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20110912091408.GA3465@stefanha-thinkpad.localdomain> References: <1315628610-28222-1-git-send-email-ronniesahlberg@gmail.com> <1315628610-28222-2-git-send-email-ronniesahlberg@gmail.com> <20110912091408.GA3465@stefanha-thinkpad.localdomain> Date: Wed, 21 Sep 2011 19:48:22 +1000 Message-ID: From: ronnie sahlberg Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [PATCH] This patch adds a new block driver : iSCSI List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Stefan Hajnoczi Cc: kwolf@redhat.com, fujita.tomonori@lab.ntt.co.jp, qemu-devel@nongnu.org, hch@lst.de Stefan, Thanks for your review. I feel that iscsi would be beneficial for several usecases. I have implemented all changes you pointed out, except two, and resent an updated patch to the list. Please see below. regards ronnie sahlberg On Mon, Sep 12, 2011 at 7:14 PM, Stefan Hajnoczi wrote: > On Sat, Sep 10, 2011 at 02:23:30PM +1000, Ronnie Sahlberg wrote: > > Looking good. =A0I think this is worth merging because it does offer > benefits over host iSCSI. > >> +static void >> +iscsi_abort_task_cb(struct iscsi_context *iscsi, int status, void *comm= and_data, >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0void *private_data) >> +{ >> +} >> + >> +static void >> +iscsi_aio_cancel(BlockDriverAIOCB *blockacb) >> +{ >> + =A0 =A0IscsiAIOCB *acb =3D (IscsiAIOCB *)blockacb; >> + =A0 =A0IscsiLun *iscsilun =3D acb->iscsilun; >> + >> + =A0 =A0acb->status =3D -ECANCELED; >> + =A0 =A0acb->common.cb(acb->common.opaque, acb->status); >> + =A0 =A0acb->canceled =3D 1; >> + >> + =A0 =A0iscsi_task_mgmt_abort_task_async(iscsilun->iscsi, acb->task, >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 iscsi_abort_task_cb, NULL); >> +} > > The asynchronous abort task call looks odd. =A0If a caller allocates a > buffer and issues a read request, then we need to make sure that the > request is really aborted by the time .bdrv_aio_cancel() returns. > > If I understand the code correctly, iscsi_aio_cancel() returns > immediately but the read request will still be in progress. =A0That means > the caller could now free their data buffer and the read request will > overwrite that unallocated memory. You are right. I have added a new function to libiscsi that will also release the task structure from libiscsi as well. So we should no longer have a window where we might have a cancelled task in flight writing the data to an already released buffer once the cancelled data buffers start arriving on the socket. > >> +static void >> +iscsi_aio_write10_cb(struct iscsi_context *iscsi, int status, >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 void *command_data, void *opaq= ue) >> +{ >> + =A0 =A0IscsiAIOCB *acb =3D opaque; >> + >> + =A0 =A0trace_iscsi_aio_write10_cb(iscsi, status, acb, acb->canceled); >> + >> + =A0 =A0if (acb->buf !=3D NULL) { >> + =A0 =A0 =A0 =A0free(acb->buf); >> + =A0 =A0} > > Please just use g_free(acb->buf). =A0g_free(NULL) is defined as a nop so > the check isn't necessary. =A0Also, this code uses free(3) when it should= e. > use g_free(3). > Done. >> + >> + =A0 =A0if (acb->canceled !=3D 0) { >> + =A0 =A0 =A0 =A0qemu_aio_release(acb); >> + =A0 =A0 =A0 =A0scsi_free_scsi_task(acb->task); >> + =A0 =A0 =A0 =A0acb->task =3D NULL; >> + =A0 =A0 =A0 =A0return; >> + =A0 =A0} >> + >> + =A0 =A0acb->status =3D 0; >> + =A0 =A0if (status < 0) { >> + =A0 =A0 =A0 =A0error_report("Failed to write10 data to iSCSI lun. %s", >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 iscsi_get_error(iscsi)); >> + =A0 =A0 =A0 =A0acb->status =3D -EIO; >> + =A0 =A0} >> + >> + =A0 =A0iscsi_schedule_bh(iscsi_readv_writev_bh_cb, acb); >> + =A0 =A0scsi_free_scsi_task(acb->task); >> + =A0 =A0acb->task =3D NULL; >> +} >> + >> +static int64_t sector_qemu2lun(int64_t sector, IscsiLun *iscsilun) >> +{ >> + =A0 =A0return sector * BDRV_SECTOR_SIZE / iscsilun->block_size; >> +} >> + >> +static BlockDriverAIOCB * >> +iscsi_aio_writev(BlockDriverState *bs, int64_t sector_num, >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 QEMUIOVector *qiov, int nb_sectors, >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 BlockDriverCompletionFunc *cb, >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 void *opaque) >> +{ >> + =A0 =A0IscsiLun *iscsilun =3D bs->opaque; >> + =A0 =A0struct iscsi_context *iscsi =3D iscsilun->iscsi; >> + =A0 =A0IscsiAIOCB *acb; >> + =A0 =A0size_t size; >> + =A0 =A0int fua =3D 0; >> + >> + =A0 =A0/* set FUA on writes when cache mode is write through */ >> + =A0 =A0if (!(bs->open_flags & (BDRV_O_CACHE_WB|BDRV_O_NOCACHE))) { >> + =A0 =A0 =A0 =A0fua =3D 1; >> + =A0 =A0} > > FUA needs to reflect the guest semantics - does this disk have an > enabled write cache? =A0When bs->open_flags has BDRV_O_CACHE_WB, then the > guest knows it needs to send flushes because there is a write cache: > > if (!(bs->open_flags & BDRV_O_CACHE_WB)) { > =A0 =A0fua =3D 1; > } > > BDRV_O_NOCACHE is just for local files and sets the O_DIRECT hint. =A0It > doesn't affect the cache semantics that the guest sees. > Done. When I first started building the patch a while ago, both flags were needed. I missed when the second flag became obsolete upstream. >> +/* >> + * We support iscsi url's on the form >> + * iscsi://[%@][:]// >> + */ >> +static int iscsi_open(BlockDriverState *bs, const char *filename, int f= lags) >> +{ >> + =A0 =A0IscsiLun *iscsilun =3D bs->opaque; >> + =A0 =A0struct iscsi_context *iscsi =3D NULL; >> + =A0 =A0struct iscsi_url *iscsi_url =3D NULL; >> + =A0 =A0struct IscsiTask task; >> + =A0 =A0int ret; >> + >> + =A0 =A0if ((BDRV_SECTOR_SIZE % 512) !=3D 0) { >> + =A0 =A0 =A0 =A0error_report("iSCSI: Invalid BDRV_SECTOR_SIZE. " >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 "BDRV_SECTOR_SIZE(%lld) is not= a multiple " >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 "of 512", BDRV_SECTOR_SIZE); >> + =A0 =A0 =A0 =A0return -EINVAL; >> + =A0 =A0} > > Another way of saying this is: > QEMU_BUILD_BUG_ON(BDRV_SECTOR_SIZE % 512 !=3D 0); > > The advantage is that the build fails instead of waiting until iscsi is > used at runtime until the failure is detected. > > What will happen if BDRV_SECTOR_SIZE is not a multiple of 512? > I did not do this change. If blocksize is not multiple of 512, the iscsi backend will just refuse to work since having a read-modify-write cycle across the network would be extremely costly. I dont think iscsi should cause the entire build to fail. While it is difficult to imagine a different blocksize, it is not impossible I guess. If someone comes up with a non-512-multiple blocksize and a good reason for one, I dont want to be the guy that broke the build. >> + >> + =A0 =A0memset(iscsilun, 0, sizeof(IscsiLun)); >> + >> + =A0 =A0/* Should really append the KVM name after the ':' here */ >> + =A0 =A0iscsi =3D iscsi_create_context("iqn.2008-11.org.linux-kvm:"); >> + =A0 =A0if (iscsi =3D=3D NULL) { >> + =A0 =A0 =A0 =A0error_report("iSCSI: Failed to create iSCSI context."); >> + =A0 =A0 =A0 =A0ret =3D -ENOMEM; >> + =A0 =A0 =A0 =A0goto failed; >> + =A0 =A0} >> + >> + =A0 =A0iscsi_url =3D iscsi_parse_full_url(iscsi, filename); >> + =A0 =A0if (iscsi_url =3D=3D NULL) { >> + =A0 =A0 =A0 =A0error_report("Failed to parse URL : %s %s", filename, >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 iscsi_get_error(iscsi)); >> + =A0 =A0 =A0 =A0ret =3D -ENOMEM; > > -EINVAL? Done. > >> +static BlockDriver bdrv_iscsi =3D { >> + =A0 =A0.format_name =A0 =A0 =3D "iscsi", >> + =A0 =A0.protocol_name =A0 =3D "iscsi", >> + >> + =A0 =A0.instance_size =A0 =3D sizeof(IscsiLun), >> + =A0 =A0.bdrv_file_open =A0=3D iscsi_open, >> + =A0 =A0.bdrv_close =A0 =A0 =A0=3D iscsi_close, >> + >> + =A0 =A0.bdrv_getlength =A0=3D iscsi_getlength, >> + >> + =A0 =A0.bdrv_aio_readv =A0=3D iscsi_aio_readv, >> + =A0 =A0.bdrv_aio_writev =3D iscsi_aio_writev, >> + =A0 =A0.bdrv_aio_flush =A0=3D iscsi_aio_flush, > > block.c does not emulate .bdrv_flush() using your .bdrv_aio_flush() > implementation. =A0I have sent a patch to add this emulation. =A0This wil= l > turn bdrv_flush(iscsi_bs) into an actual operation instead of a nop :). I did not implement this since I understood from the discussion that a patch is imminent and it is required for existing backend(s?) anyway. > >> diff --git a/trace-events b/trace-events >> index 3fdd60f..4e461e5 100644 >> --- a/trace-events >> +++ b/trace-events >> @@ -494,3 +494,10 @@ escc_sunkbd_event_in(int ch) "Untranslated keycode = %2.2x" >> =A0escc_sunkbd_event_out(int ch) "Translated keycode %2.2x" >> =A0escc_kbd_command(int val) "Command %d" >> =A0escc_sunmouse_event(int dx, int dy, int buttons_state) "dx=3D%d dy=3D= %d buttons=3D%01x" >> + >> +# block/iscsi.c >> +disable iscsi_aio_write10_cb(void *iscsi, int status, void *acb, int ca= nceled) "iscsi %p status %d acb %p canceled %d" >> +disable iscsi_aio_writev(void *iscsi, int64_t sector_num, int nb_sector= s, void *opaque, void *acb) "iscsi %p sector_num %"PRId64" nb_sectors %d op= aque %p acb %p" >> +disable iscsi_aio_read10_cb(void *iscsi, int status, void *acb, int can= celed) "iscsi %p status %d acb %p canceled %d" >> +disable iscsi_aio_readv(void *iscsi, int64_t sector_num, int nb_sectors= , void *opaque, void *acb) "iscsi %p sector_num %"PRId64" nb_sectors %d opa= que %p acb %p" > > It is no longer necessary to prefix trace event definitions with > "disable". Done. > > The stderr and simple backends now start up with all events disabled by > default. =A0The set of events can be set using the -trace > events=3D option or on the QEMU monitor using set trace-even= t > on. > > Stefan >