* [Qemu-devel] [PATCH v2 0/2] virtio-scsi: Optimizing request allocation @ 2014-09-15 5:23 Fam Zheng 2014-09-15 5:23 ` [Qemu-devel] [PATCH v2 1/2] scsi: Optimize scsi_req_alloc Fam Zheng 2014-09-15 5:23 ` [Qemu-devel] [PATCH v2 2/2] virtio-scsi: Optimize virtio_scsi_init_req Fam Zheng 0 siblings, 2 replies; 7+ messages in thread From: Fam Zheng @ 2014-09-15 5:23 UTC (permalink / raw) To: qemu-devel; +Cc: Paolo Bonzini v2: Slight improvements according to Paolo's comments. 4k -> 48k. Zeroing is relatively expensive since we have big request structures. VirtQueueElement (>48k!) and sense_buf (256 bytes) are two points to look at. This visibly reduces overhead of request handling when testing with the unmerged "null" driver and virtio-scsi dataplane. Before, the issue is very obvious with perf top: perf top -G -p `pidof qemu-system-x86_64` ----------------------------------------- + 16.50% libc-2.17.so [.] __memset_sse2 + 2.28% libc-2.17.so [.] _int_malloc + 2.25% [vdso] [.] 0x0000000000000cd1 + 2.02% [kernel] [k] _raw_spin_lock_irqsave + 1.97% libpthread-2.17.so [.] pthread_mutex_lock + 1.87% libpthread-2.17.so [.] pthread_mutex_unlock + 1.81% [kernel] [k] fget_light + 1.70% libc-2.17.so [.] malloc After, the high __memset_sse2 and _int_malloc is gone: perf top -G -p `pidof qemu-system-x86_64` ----------------------------------------- + 4.20% [kernel] [k] vcpu_enter_guest + 3.97% [kernel] [k] vmx_vcpu_run + 2.63% [kernel] [k] _raw_spin_lock_irqsave + 1.72% [kernel] [k] native_read_msr_safe + 1.65% [kernel] [k] __srcu_read_lock + 1.64% [kernel] [k] _raw_spin_unlock_irqrestore + 1.57% [vdso] [.] 0x00000000000008d8 + 1.49% libc-2.17.so [.] _int_malloc + 1.29% libpthread-2.17.so [.] pthread_mutex_unlock + 1.26% [kernel] [k] native_write_msr_safe See the commit message of patch 2 for some fio test data. Thanks, Fam Fam Zheng (2): scsi: Optimize scsi_req_alloc virtio-scsi: Optimize virtio_scsi_init_req hw/scsi/scsi-bus.c | 8 +++++--- hw/scsi/virtio-scsi.c | 23 +++++++++++++++++------ include/hw/scsi/scsi.h | 21 ++++++++++++++------- 3 files changed, 36 insertions(+), 16 deletions(-) -- 1.9.3 ^ permalink raw reply [flat|nested] 7+ messages in thread
* [Qemu-devel] [PATCH v2 1/2] scsi: Optimize scsi_req_alloc 2014-09-15 5:23 [Qemu-devel] [PATCH v2 0/2] virtio-scsi: Optimizing request allocation Fam Zheng @ 2014-09-15 5:23 ` Fam Zheng 2014-09-15 10:15 ` Paolo Bonzini 2014-09-15 5:23 ` [Qemu-devel] [PATCH v2 2/2] virtio-scsi: Optimize virtio_scsi_init_req Fam Zheng 1 sibling, 1 reply; 7+ messages in thread From: Fam Zheng @ 2014-09-15 5:23 UTC (permalink / raw) To: qemu-devel; +Cc: Paolo Bonzini Zeroing sense buffer for each scsi request is not efficient, we can just leave it uninitialized because sense_len is set to 0. Move the implicitly zeroed fields to the end of the structure and use a partial memset. The explicitly initialized fields (by scsi_req_alloc or scsi_req_new) are moved to the beginning of the structure, before sense buffer, to skip the memset. Also change g_malloc0 to g_slice_alloc. Signed-off-by: Fam Zheng <famz@redhat.com> --- hw/scsi/scsi-bus.c | 8 +++++--- include/hw/scsi/scsi.h | 21 ++++++++++++++------- 2 files changed, 19 insertions(+), 10 deletions(-) diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c index 954c607..af293b5 100644 --- a/hw/scsi/scsi-bus.c +++ b/hw/scsi/scsi-bus.c @@ -551,8 +551,11 @@ SCSIRequest *scsi_req_alloc(const SCSIReqOps *reqops, SCSIDevice *d, SCSIRequest *req; SCSIBus *bus = scsi_bus_from_device(d); BusState *qbus = BUS(bus); + const int memset_off = offsetof(SCSIRequest, sense) + + sizeof(req->sense); - req = g_malloc0(reqops->size); + req = g_slice_alloc(reqops->size); + memset((uint8_t *)req + memset_off, 0, reqops->size - memset_off); req->refcount = 1; req->bus = bus; req->dev = d; @@ -560,7 +563,6 @@ SCSIRequest *scsi_req_alloc(const SCSIReqOps *reqops, SCSIDevice *d, req->lun = lun; req->hba_private = hba_private; req->status = -1; - req->sense_len = 0; req->ops = reqops; object_ref(OBJECT(d)); object_ref(OBJECT(qbus->parent)); @@ -1603,7 +1605,7 @@ void scsi_req_unref(SCSIRequest *req) } object_unref(OBJECT(req->dev)); object_unref(OBJECT(qbus->parent)); - g_free(req); + g_slice_free1(req->ops->size, req); } } diff --git a/include/hw/scsi/scsi.h b/include/hw/scsi/scsi.h index 2e3a8f9..2a7d4e5 100644 --- a/include/hw/scsi/scsi.h +++ b/include/hw/scsi/scsi.h @@ -50,17 +50,24 @@ struct SCSIRequest { uint32_t tag; uint32_t lun; uint32_t status; + void *hba_private; size_t resid; + + /* Note: + * - fields before sense are initialized by scsi_req_alloc; + * - sense[] is uninitialized; + * - fields after sense are memset to 0 by scsi_req_alloc. + * */ + + uint8_t sense[SCSI_SENSE_BUF_SIZE]; + uint32_t sense_len; + bool enqueued; + bool io_canceled; + bool retry; + bool dma_started; SCSICommand cmd; BlockDriverAIOCB *aiocb; QEMUSGList *sg; - bool dma_started; - uint8_t sense[SCSI_SENSE_BUF_SIZE]; - uint32_t sense_len; - bool enqueued; - bool io_canceled; - bool retry; - void *hba_private; QTAILQ_ENTRY(SCSIRequest) next; }; -- 1.9.3 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [Qemu-devel] [PATCH v2 1/2] scsi: Optimize scsi_req_alloc 2014-09-15 5:23 ` [Qemu-devel] [PATCH v2 1/2] scsi: Optimize scsi_req_alloc Fam Zheng @ 2014-09-15 10:15 ` Paolo Bonzini 0 siblings, 0 replies; 7+ messages in thread From: Paolo Bonzini @ 2014-09-15 10:15 UTC (permalink / raw) To: Fam Zheng, qemu-devel Il 15/09/2014 07:23, Fam Zheng ha scritto: > + uint8_t sense[SCSI_SENSE_BUF_SIZE]; > + uint32_t sense_len; > + bool enqueued; > + bool io_canceled; > + bool retry; > + bool dma_started; > SCSICommand cmd; > BlockDriverAIOCB *aiocb; > QEMUSGList *sg; > - bool dma_started; > - uint8_t sense[SCSI_SENSE_BUF_SIZE]; > - uint32_t sense_len; > - bool enqueued; > - bool io_canceled; > - bool retry; > - void *hba_private; > QTAILQ_ENTRY(SCSIRequest) next; Similar to v1, cmd is filled in by scsi_req_new and can be moved to the uninitialized area. Paolo ^ permalink raw reply [flat|nested] 7+ messages in thread
* [Qemu-devel] [PATCH v2 2/2] virtio-scsi: Optimize virtio_scsi_init_req 2014-09-15 5:23 [Qemu-devel] [PATCH v2 0/2] virtio-scsi: Optimizing request allocation Fam Zheng 2014-09-15 5:23 ` [Qemu-devel] [PATCH v2 1/2] scsi: Optimize scsi_req_alloc Fam Zheng @ 2014-09-15 5:23 ` Fam Zheng 2014-09-15 10:17 ` Paolo Bonzini 1 sibling, 1 reply; 7+ messages in thread From: Fam Zheng @ 2014-09-15 5:23 UTC (permalink / raw) To: qemu-devel; +Cc: Paolo Bonzini The VirtQueueElement is a very big structure (>48k!), since it will be initialzed by virtqueue_pop, we can save the expensive zeroing here. This saves a few microseconds per request in my test: [fio-test] rw bs iodepth jobs bw iops latency -------------------------------------------------------------------------------------------- Before read 4k 1 1 110 28269 34 After read 4k 1 1 131 33745 28 Whereas, virtio-blk read 4k 1 1 217 55673 16 Signed-off-by: Fam Zheng <famz@redhat.com> --- hw/scsi/virtio-scsi.c | 23 +++++++++++++++++------ 1 file changed, 17 insertions(+), 6 deletions(-) diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c index 86aba88..7bf03c4 100644 --- a/hw/scsi/virtio-scsi.c +++ b/hw/scsi/virtio-scsi.c @@ -24,12 +24,19 @@ typedef struct VirtIOSCSIReq { VirtIOSCSI *dev; VirtQueue *vq; - VirtQueueElement elem; QEMUSGList qsgl; + QEMUIOVector resp_iov; + + /* Note: + * - fields before elem are initialized by virtio_scsi_init_req; + * - elem is uninitialized at the time of allocation. + * - fields after elem are zeroed by virtio_scsi_init_req. + * */ + + VirtQueueElement elem; SCSIRequest *sreq; size_t resp_size; enum SCSIXferMode mode; - QEMUIOVector resp_iov; union { VirtIOSCSICmdResp cmd; VirtIOSCSICtrlTMFResp tmf; @@ -68,23 +75,27 @@ static inline SCSIDevice *virtio_scsi_device_find(VirtIOSCSI *s, uint8_t *lun) static VirtIOSCSIReq *virtio_scsi_init_req(VirtIOSCSI *s, VirtQueue *vq) { VirtIOSCSIReq *req; - VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(s); - - req = g_malloc0(sizeof(*req) + vs->cdb_size); + VirtIOSCSICommon *vs = (VirtIOSCSICommon *)s; + const size_t zero_skip = offsetof(VirtIOSCSIReq, elem) + + sizeof(VirtQueueElement); + req = g_slice_alloc(sizeof(*req) + vs->cdb_size); req->vq = vq; req->dev = s; req->sreq = NULL; qemu_sglist_init(&req->qsgl, DEVICE(s), 8, &address_space_memory); qemu_iovec_init(&req->resp_iov, 1); + memset((uint8_t *)req + zero_skip, 0, sizeof(*req) - zero_skip); return req; } static void virtio_scsi_free_req(VirtIOSCSIReq *req) { + VirtIOSCSICommon *vs = (VirtIOSCSICommon *)req->dev; + qemu_iovec_destroy(&req->resp_iov); qemu_sglist_destroy(&req->qsgl); - g_free(req); + g_slice_free1(sizeof(*req) + vs->cdb_size, req); } static void virtio_scsi_complete_req(VirtIOSCSIReq *req) -- 1.9.3 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [Qemu-devel] [PATCH v2 2/2] virtio-scsi: Optimize virtio_scsi_init_req 2014-09-15 5:23 ` [Qemu-devel] [PATCH v2 2/2] virtio-scsi: Optimize virtio_scsi_init_req Fam Zheng @ 2014-09-15 10:17 ` Paolo Bonzini 2014-09-16 7:16 ` Fam Zheng 0 siblings, 1 reply; 7+ messages in thread From: Paolo Bonzini @ 2014-09-15 10:17 UTC (permalink / raw) To: Fam Zheng, qemu-devel Il 15/09/2014 07:23, Fam Zheng ha scritto: > SCSIRequest *sreq; > size_t resp_size; > enum SCSIXferMode mode; > - QEMUIOVector resp_iov; > union { > VirtIOSCSICmdResp cmd; > VirtIOSCSICtrlTMFResp tmf; > @@ -68,23 +75,27 @@ static inline SCSIDevice *virtio_scsi_device_find(VirtIOSCSI *s, uint8_t *lun) > static VirtIOSCSIReq *virtio_scsi_init_req(VirtIOSCSI *s, VirtQueue *vq) > { > VirtIOSCSIReq *req; > - VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(s); > - > - req = g_malloc0(sizeof(*req) + vs->cdb_size); > + VirtIOSCSICommon *vs = (VirtIOSCSICommon *)s; > + const size_t zero_skip = offsetof(VirtIOSCSIReq, elem) > + + sizeof(VirtQueueElement); > > + req = g_slice_alloc(sizeof(*req) + vs->cdb_size); Looks good, but why do you need to zero the union? You only need to zero sreq, resp_size and mode, don't you (and at this point, memset becomes superfluous)? Paolo ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Qemu-devel] [PATCH v2 2/2] virtio-scsi: Optimize virtio_scsi_init_req 2014-09-15 10:17 ` Paolo Bonzini @ 2014-09-16 7:16 ` Fam Zheng 2014-09-16 8:15 ` Paolo Bonzini 0 siblings, 1 reply; 7+ messages in thread From: Fam Zheng @ 2014-09-16 7:16 UTC (permalink / raw) To: Paolo Bonzini; +Cc: qemu-devel On Mon, 09/15 12:17, Paolo Bonzini wrote: > Il 15/09/2014 07:23, Fam Zheng ha scritto: > > SCSIRequest *sreq; > > size_t resp_size; > > enum SCSIXferMode mode; > > - QEMUIOVector resp_iov; > > union { > > VirtIOSCSICmdResp cmd; > > VirtIOSCSICtrlTMFResp tmf; > > @@ -68,23 +75,27 @@ static inline SCSIDevice *virtio_scsi_device_find(VirtIOSCSI *s, uint8_t *lun) > > static VirtIOSCSIReq *virtio_scsi_init_req(VirtIOSCSI *s, VirtQueue *vq) > > { > > VirtIOSCSIReq *req; > > - VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(s); > > - > > - req = g_malloc0(sizeof(*req) + vs->cdb_size); > > + VirtIOSCSICommon *vs = (VirtIOSCSICommon *)s; > > + const size_t zero_skip = offsetof(VirtIOSCSIReq, elem) > > + + sizeof(VirtQueueElement); > > > > + req = g_slice_alloc(sizeof(*req) + vs->cdb_size); > > Looks good, but why do you need to zero the union? You only need to > zero sreq, resp_size and mode, don't you (and at this point, memset > becomes superfluous)? > The structures in unions are not zeroed by caller, also leaving them breaks virtio-scsi in my test. FWIW, I will remove the "req->sreq = NULL;" two lines below in v3. At this point tuning these small fields are subtle optimization compared to the arrays, I say let's just simply keep the memset so that adding more fields in the future are also safe. Fam ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Qemu-devel] [PATCH v2 2/2] virtio-scsi: Optimize virtio_scsi_init_req 2014-09-16 7:16 ` Fam Zheng @ 2014-09-16 8:15 ` Paolo Bonzini 0 siblings, 0 replies; 7+ messages in thread From: Paolo Bonzini @ 2014-09-16 8:15 UTC (permalink / raw) To: Fam Zheng; +Cc: qemu-devel Il 16/09/2014 09:16, Fam Zheng ha scritto: > On Mon, 09/15 12:17, Paolo Bonzini wrote: >> Il 15/09/2014 07:23, Fam Zheng ha scritto: >>> SCSIRequest *sreq; >>> size_t resp_size; >>> enum SCSIXferMode mode; >>> - QEMUIOVector resp_iov; >>> union { >>> VirtIOSCSICmdResp cmd; >>> VirtIOSCSICtrlTMFResp tmf; >>> @@ -68,23 +75,27 @@ static inline SCSIDevice *virtio_scsi_device_find(VirtIOSCSI *s, uint8_t *lun) >>> static VirtIOSCSIReq *virtio_scsi_init_req(VirtIOSCSI *s, VirtQueue *vq) >>> { >>> VirtIOSCSIReq *req; >>> - VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(s); >>> - >>> - req = g_malloc0(sizeof(*req) + vs->cdb_size); >>> + VirtIOSCSICommon *vs = (VirtIOSCSICommon *)s; >>> + const size_t zero_skip = offsetof(VirtIOSCSIReq, elem) >>> + + sizeof(VirtQueueElement); >>> >>> + req = g_slice_alloc(sizeof(*req) + vs->cdb_size); >> >> Looks good, but why do you need to zero the union? You only need to >> zero sreq, resp_size and mode, don't you (and at this point, memset >> becomes superfluous)? >> > > The structures in unions are not zeroed by caller, also leaving them breaks > virtio-scsi in my test. > > FWIW, I will remove the "req->sreq = NULL;" two lines below in v3. At this > point tuning these small fields are subtle optimization compared to the arrays, > I say let's just simply keep the memset so that adding more fields in the > future are also safe. Perhaps the response fields have to be zeroed? The request shouldn't need it. It can be done separately though---the VirtQueueElement is the big one that we have to fix. Paolo ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2014-09-16 8:16 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-09-15 5:23 [Qemu-devel] [PATCH v2 0/2] virtio-scsi: Optimizing request allocation Fam Zheng 2014-09-15 5:23 ` [Qemu-devel] [PATCH v2 1/2] scsi: Optimize scsi_req_alloc Fam Zheng 2014-09-15 10:15 ` Paolo Bonzini 2014-09-15 5:23 ` [Qemu-devel] [PATCH v2 2/2] virtio-scsi: Optimize virtio_scsi_init_req Fam Zheng 2014-09-15 10:17 ` Paolo Bonzini 2014-09-16 7:16 ` Fam Zheng 2014-09-16 8:15 ` Paolo Bonzini
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).