From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([140.186.70.92]:59140) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1R3yYo-0001M3-Gm for qemu-devel@nongnu.org; Wed, 14 Sep 2011 19:08:07 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1R3yYm-000099-UL for qemu-devel@nongnu.org; Wed, 14 Sep 2011 19:08:06 -0400 Received: from mail-qw0-f45.google.com ([209.85.216.45]:49336) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1R3yYm-00008y-Kp for qemu-devel@nongnu.org; Wed, 14 Sep 2011 19:08:04 -0400 Received: by qwg2 with SMTP id 2so550999qwg.4 for ; Wed, 14 Sep 2011 16:08:03 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20110912091408.GA3465@stefanha-thinkpad.localdomain> References: <1315628610-28222-1-git-send-email-ronniesahlberg@gmail.com> <1315628610-28222-2-git-send-email-ronniesahlberg@gmail.com> <20110912091408.GA3465@stefanha-thinkpad.localdomain> Date: Thu, 15 Sep 2011 09:08:03 +1000 Message-ID: From: ronnie sahlberg Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [PATCH] This patch adds a new block driver : iSCSI List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Stefan Hajnoczi Cc: kwolf@redhat.com, fujita.tomonori@lab.ntt.co.jp, qemu-devel@nongnu.org, hch@lst.de Stefan, Thanks for your review. I will do the changes you recommend and send an updated patch over the week= end. Could you advice the best path for handling the .bdrv_flush and the blocksize questions? I think it is reasonable to just not support iscsi at all for blocksize that is not multiple of 512 bytes since a read-modify-write cycle across a network would probably be prohibitively expensive. .bdrv_flush() I can easily add a synchronous implementation of this unless your patch is expected to be merged in the near future. regards ronnie sahlberg On Mon, Sep 12, 2011 at 7:14 PM, Stefan Hajnoczi wrote: > On Sat, Sep 10, 2011 at 02:23:30PM +1000, Ronnie Sahlberg wrote: > > Looking good. =A0I think this is worth merging because it does offer > benefits over host iSCSI. > >> +static void >> +iscsi_abort_task_cb(struct iscsi_context *iscsi, int status, void *comm= and_data, >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0void *private_data) >> +{ >> +} >> + >> +static void >> +iscsi_aio_cancel(BlockDriverAIOCB *blockacb) >> +{ >> + =A0 =A0IscsiAIOCB *acb =3D (IscsiAIOCB *)blockacb; >> + =A0 =A0IscsiLun *iscsilun =3D acb->iscsilun; >> + >> + =A0 =A0acb->status =3D -ECANCELED; >> + =A0 =A0acb->common.cb(acb->common.opaque, acb->status); >> + =A0 =A0acb->canceled =3D 1; >> + >> + =A0 =A0iscsi_task_mgmt_abort_task_async(iscsilun->iscsi, acb->task, >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 iscsi_abort_task_cb, NULL); >> +} > > The asynchronous abort task call looks odd. =A0If a caller allocates a > buffer and issues a read request, then we need to make sure that the > request is really aborted by the time .bdrv_aio_cancel() returns. > > If I understand the code correctly, iscsi_aio_cancel() returns > immediately but the read request will still be in progress. =A0That means > the caller could now free their data buffer and the read request will > overwrite that unallocated memory. > Right. I will need to remove the read pdu from the local initiator side too so that if the read is in flight and we receive data for it, that this data is discarded and not written into the possibly no longer valid buffers. I will do this change in the next version of the patch. >> +static void >> +iscsi_aio_write10_cb(struct iscsi_context *iscsi, int status, >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 void *command_data, void *opaq= ue) >> +{ >> + =A0 =A0IscsiAIOCB *acb =3D opaque; >> + >> + =A0 =A0trace_iscsi_aio_write10_cb(iscsi, status, acb, acb->canceled); >> + >> + =A0 =A0if (acb->buf !=3D NULL) { >> + =A0 =A0 =A0 =A0free(acb->buf); >> + =A0 =A0} > > Please just use g_free(acb->buf). =A0g_free(NULL) is defined as a nop so > the check isn't necessary. =A0Also, this code uses free(3) when it should > use g_free(3). Will do. > >> + >> + =A0 =A0if (acb->canceled !=3D 0) { >> + =A0 =A0 =A0 =A0qemu_aio_release(acb); >> + =A0 =A0 =A0 =A0scsi_free_scsi_task(acb->task); >> + =A0 =A0 =A0 =A0acb->task =3D NULL; >> + =A0 =A0 =A0 =A0return; >> + =A0 =A0} >> + >> + =A0 =A0acb->status =3D 0; >> + =A0 =A0if (status < 0) { >> + =A0 =A0 =A0 =A0error_report("Failed to write10 data to iSCSI lun. %s", >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 iscsi_get_error(iscsi)); >> + =A0 =A0 =A0 =A0acb->status =3D -EIO; >> + =A0 =A0} >> + >> + =A0 =A0iscsi_schedule_bh(iscsi_readv_writev_bh_cb, acb); >> + =A0 =A0scsi_free_scsi_task(acb->task); >> + =A0 =A0acb->task =3D NULL; >> +} >> + >> +static int64_t sector_qemu2lun(int64_t sector, IscsiLun *iscsilun) >> +{ >> + =A0 =A0return sector * BDRV_SECTOR_SIZE / iscsilun->block_size; >> +} >> + >> +static BlockDriverAIOCB * >> +iscsi_aio_writev(BlockDriverState *bs, int64_t sector_num, >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 QEMUIOVector *qiov, int nb_sectors, >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 BlockDriverCompletionFunc *cb, >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 void *opaque) >> +{ >> + =A0 =A0IscsiLun *iscsilun =3D bs->opaque; >> + =A0 =A0struct iscsi_context *iscsi =3D iscsilun->iscsi; >> + =A0 =A0IscsiAIOCB *acb; >> + =A0 =A0size_t size; >> + =A0 =A0int fua =3D 0; >> + >> + =A0 =A0/* set FUA on writes when cache mode is write through */ >> + =A0 =A0if (!(bs->open_flags & (BDRV_O_CACHE_WB|BDRV_O_NOCACHE))) { >> + =A0 =A0 =A0 =A0fua =3D 1; >> + =A0 =A0} > > FUA needs to reflect the guest semantics - does this disk have an > enabled write cache? =A0When bs->open_flags has BDRV_O_CACHE_WB, then the > guest knows it needs to send flushes because there is a write cache: > > if (!(bs->open_flags & BDRV_O_CACHE_WB)) { > =A0 =A0fua =3D 1; > } > > BDRV_O_NOCACHE is just for local files and sets the O_DIRECT hint. =A0It > doesn't affect the cache semantics that the guest sees. Thanks. I'll do this change. > >> +/* >> + * We support iscsi url's on the form >> + * iscsi://[%@][:]// >> + */ >> +static int iscsi_open(BlockDriverState *bs, const char *filename, int f= lags) >> +{ >> + =A0 =A0IscsiLun *iscsilun =3D bs->opaque; >> + =A0 =A0struct iscsi_context *iscsi =3D NULL; >> + =A0 =A0struct iscsi_url *iscsi_url =3D NULL; >> + =A0 =A0struct IscsiTask task; >> + =A0 =A0int ret; >> + >> + =A0 =A0if ((BDRV_SECTOR_SIZE % 512) !=3D 0) { >> + =A0 =A0 =A0 =A0error_report("iSCSI: Invalid BDRV_SECTOR_SIZE. " >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 "BDRV_SECTOR_SIZE(%lld) is not= a multiple " >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 "of 512", BDRV_SECTOR_SIZE); >> + =A0 =A0 =A0 =A0return -EINVAL; >> + =A0 =A0} > > Another way of saying this is: > QEMU_BUILD_BUG_ON(BDRV_SECTOR_SIZE % 512 !=3D 0); > > The advantage is that the build fails instead of waiting until iscsi is > used at runtime until the failure is detected. I can add this. But is it better to fail the whole build than to return an error at runtime if/when iscsi is used? > > What will happen if BDRV_SECTOR_SIZE is not a multiple of 512? iscsi will not work. I dont think it is worth implementing a read-modify-write cycle for iscsi. Performance would be so poor for this case that I think it is better to just not support it at all. > >> + >> + =A0 =A0memset(iscsilun, 0, sizeof(IscsiLun)); >> + >> + =A0 =A0/* Should really append the KVM name after the ':' here */ >> + =A0 =A0iscsi =3D iscsi_create_context("iqn.2008-11.org.linux-kvm:"); >> + =A0 =A0if (iscsi =3D=3D NULL) { >> + =A0 =A0 =A0 =A0error_report("iSCSI: Failed to create iSCSI context."); >> + =A0 =A0 =A0 =A0ret =3D -ENOMEM; >> + =A0 =A0 =A0 =A0goto failed; >> + =A0 =A0} >> + >> + =A0 =A0iscsi_url =3D iscsi_parse_full_url(iscsi, filename); >> + =A0 =A0if (iscsi_url =3D=3D NULL) { >> + =A0 =A0 =A0 =A0error_report("Failed to parse URL : %s %s", filename, >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 iscsi_get_error(iscsi)); >> + =A0 =A0 =A0 =A0ret =3D -ENOMEM; > > -EINVAL? Will do. > >> +static BlockDriver bdrv_iscsi =3D { >> + =A0 =A0.format_name =A0 =A0 =3D "iscsi", >> + =A0 =A0.protocol_name =A0 =3D "iscsi", >> + >> + =A0 =A0.instance_size =A0 =3D sizeof(IscsiLun), >> + =A0 =A0.bdrv_file_open =A0=3D iscsi_open, >> + =A0 =A0.bdrv_close =A0 =A0 =A0=3D iscsi_close, >> + >> + =A0 =A0.bdrv_getlength =A0=3D iscsi_getlength, >> + >> + =A0 =A0.bdrv_aio_readv =A0=3D iscsi_aio_readv, >> + =A0 =A0.bdrv_aio_writev =3D iscsi_aio_writev, >> + =A0 =A0.bdrv_aio_flush =A0=3D iscsi_aio_flush, > > block.c does not emulate .bdrv_flush() using your .bdrv_aio_flush() > implementation. =A0I have sent a patch to add this emulation. =A0This wil= l > turn bdrv_flush(iscsi_bs) into an actual operation instead of a nop :). Should I add a synchronous .bdrv_flush ? That is very easy for me to add. Or should I just wait for your patch to be merged ? > >> diff --git a/trace-events b/trace-events >> index 3fdd60f..4e461e5 100644 >> --- a/trace-events >> +++ b/trace-events >> @@ -494,3 +494,10 @@ escc_sunkbd_event_in(int ch) "Untranslated keycode = %2.2x" >> =A0escc_sunkbd_event_out(int ch) "Translated keycode %2.2x" >> =A0escc_kbd_command(int val) "Command %d" >> =A0escc_sunmouse_event(int dx, int dy, int buttons_state) "dx=3D%d dy=3D= %d buttons=3D%01x" >> + >> +# block/iscsi.c >> +disable iscsi_aio_write10_cb(void *iscsi, int status, void *acb, int ca= nceled) "iscsi %p status %d acb %p canceled %d" >> +disable iscsi_aio_writev(void *iscsi, int64_t sector_num, int nb_sector= s, void *opaque, void *acb) "iscsi %p sector_num %"PRId64" nb_sectors %d op= aque %p acb %p" >> +disable iscsi_aio_read10_cb(void *iscsi, int status, void *acb, int can= celed) "iscsi %p status %d acb %p canceled %d" >> +disable iscsi_aio_readv(void *iscsi, int64_t sector_num, int nb_sectors= , void *opaque, void *acb) "iscsi %p sector_num %"PRId64" nb_sectors %d opa= que %p acb %p" > > It is no longer necessary to prefix trace event definitions with > "disable". Will fix this. > > The stderr and simple backends now start up with all events disabled by > default. =A0The set of events can be set using the -trace > events=3D option or on the QEMU monitor using set trace-even= t > on. > > Stefan >