From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:49646)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <ashmit602@gmail.com>) id 1c6LZ5-0004yB-7M
	for qemu-devel@nongnu.org; Mon, 14 Nov 2016 13:01:09 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <ashmit602@gmail.com>) id 1c6LZ2-0007XZ-SO
	for qemu-devel@nongnu.org; Mon, 14 Nov 2016 13:01:07 -0500
Received: from mail-it0-x242.google.com ([2607:f8b0:4001:c0b::242]:36608)
	by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16)
	(Exim 4.71) (envelope-from <ashmit602@gmail.com>) id 1c6LZ2-0007XI-LB
	for qemu-devel@nongnu.org; Mon, 14 Nov 2016 13:01:04 -0500
Received: by mail-it0-x242.google.com with SMTP id n68so15700588itn.3
	for <qemu-devel@nongnu.org>; Mon, 14 Nov 2016 10:01:03 -0800 (PST)
MIME-Version: 1.0
In-Reply-To: <CAJSP0QW3EGxW+PYmb=ThEfv33+BH1xTyxWBEY1YZO+w21wtgSw@mail.gmail.com>
References: <1475035789-685-1-git-send-email-ashish.mittal@veritas.com>
	<20160928214510.GA2837@stefanha-x1.localdomain>
	<CAJSP0QW3EGxW+PYmb=ThEfv33+BH1xTyxWBEY1YZO+w21wtgSw@mail.gmail.com>
From: ashish mittal <ashmit602@gmail.com>
Date: Mon, 14 Nov 2016 10:01:00 -0800
Message-ID: <CAAo6VWPy0orM+aAMGmEvmB94n36-vy2CRYNhF8+NZ_d=wuMaxg@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
Subject: Re: [Qemu-devel] [PATCH v7 RFC] block/vxhs: Initial commit to add
 Veritas HyperScale VxHS block device support
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Stefan Hajnoczi <stefanha@gmail.com>
Cc: qemu-devel <qemu-devel@nongnu.org>, Paolo Bonzini <pbonzini@redhat.com>, Kevin Wolf <kwolf@redhat.com>, Markus Armbruster <armbru@redhat.com>, "Daniel P. Berrange" <berrange@redhat.com>, Jeff Cody <jcody@redhat.com>, Fam Zheng <famz@redhat.com>, Ashish Mittal <ashish.mittal@veritas.com>, Ketan Nilangekar <Ketan.Nilangekar@veritas.com>, Abhijit Dey <Abhijit.Dey@veritas.com>, Venkatesha.Mg@veritas.com, Buddhi.Madhav@veritas.com

Will look into these ASAP.

On Mon, Nov 14, 2016 at 7:05 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Wed, Sep 28, 2016 at 10:45 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>> On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote:
>>
>> Review of .bdrv_open() and .bdrv_aio_writev() code paths.
>>
>> The big issues I see in this driver and libqnio:
>>
>> 1. Showstoppers like broken .bdrv_open() and leaking memory on every
>>    reply message.
>> 2. Insecure due to missing input validation (network packets and
>>    configuration) and incorrect string handling.
>> 3. Not fully asynchronous so QEMU and the guest may hang.
>>
>> Please think about the whole codebase and not just the lines I've
>> pointed out in this review when fixing these sorts of issues.  There may
>> be similar instances of these bugs elsewhere and it's important that
>> they are fixed so that this can be merged.
>
> Ping?
>
> You didn't respond to the comments I raised.  The libqnio buffer
> overflows and other issues from this email are still present.
>
> I put a lot of time into reviewing this patch series and libqnio.  If
> you want to get reviews please address feedback before sending a new
> patch series.
>
>>
>>> +/*
>>> + * Structure per vDisk maintained for state
>>> + */
>>> +typedef struct BDRVVXHSState {
>>> +    int                     fds[2];
>>> +    int64_t                 vdisk_size;
>>> +    int64_t                 vdisk_blocks;
>>> +    int64_t                 vdisk_flags;
>>> +    int                     vdisk_aio_count;
>>> +    int                     event_reader_pos;
>>> +    VXHSAIOCB               *qnio_event_acb;
>>> +    void                    *qnio_ctx;
>>> +    QemuSpin                vdisk_lock; /* Lock to protect BDRVVXHSState */
>>> +    QemuSpin                vdisk_acb_lock;  /* Protects ACB */
>>
>> These comments are insufficient for documenting locking.  Not all fields
>> are actually protected by these locks.  Please order fields according to
>> lock coverage:
>>
>> typedef struct VXHSAIOCB {
>>     ...
>>
>>     /* Protected by BDRVVXHSState->vdisk_acb_lock */
>>     int                 segments;
>>     ...
>> };
>>
>> typedef struct BDRVVXHSState {
>>     ...
>>
>>     /* Protected by vdisk_lock */
>>     QemuSpin                vdisk_lock;
>>     int                     vdisk_aio_count;
>>     QSIMPLEQ_HEAD(aio_retryq, VXHSAIOCB) vdisk_aio_retryq;
>>     ...
>> }
>>
>>> +static void vxhs_qnio_iio_close(BDRVVXHSState *s, int idx)
>>> +{
>>> +    /*
>>> +     * Close vDisk device
>>> +     */
>>> +    if (s->vdisk_hostinfo[idx].vdisk_rfd >= 0) {
>>> +        iio_devclose(s->qnio_ctx, 0, s->vdisk_hostinfo[idx].vdisk_rfd);
>>
>> libqnio comment:
>> Why does iio_devclose() take an unused cfd argument?  Perhaps it can be
>> dropped.
>>
>>> +        s->vdisk_hostinfo[idx].vdisk_rfd = -1;
>>> +    }
>>> +
>>> +    /*
>>> +     * Close QNIO channel against cached channel-fd
>>> +     */
>>> +    if (s->vdisk_hostinfo[idx].qnio_cfd >= 0) {
>>> +        iio_close(s->qnio_ctx, s->vdisk_hostinfo[idx].qnio_cfd);
>>
>> libqnio comment:
>> Why does iio_devclose() take an int32_t cfd argument but iio_close()
>> takes a uint32_t cfd argument?
>>
>>> +        s->vdisk_hostinfo[idx].qnio_cfd = -1;
>>> +    }
>>> +}
>>> +
>>> +static int vxhs_qnio_iio_open(int *cfd, const char *of_vsa_addr,
>>> +                              int *rfd, const char *file_name)
>>> +{
>>> +    /*
>>> +     * Open qnio channel to storage agent if not opened before.
>>> +     */
>>> +    if (*cfd < 0) {
>>> +        *cfd = iio_open(global_qnio_ctx, of_vsa_addr, 0);
>>
>> libqnio comments:
>>
>> 1.
>> There is a buffer overflow in qnio_create_channel().  strncpy() is used
>> incorrectly so long hostname or port (both can be 99 characters long)
>> will overflow channel->name[] (64 characters) or channel->port[] (8
>> characters).
>>
>>     strncpy(channel->name, hostname, strlen(hostname) + 1);
>>     strncpy(channel->port, port, strlen(port) + 1);
>>
>> The third argument must be the size of the *destination* buffer, not the
>> source buffer.  Also note that strncpy() doesn't NUL-terminate the
>> destination string so you must do that manually to ensure there is a NUL
>> byte at the end of the buffer.
>>
>> 2.
>> channel is leaked in the "Failed to open single connection" error case
>> in qnio_create_channel().
>>
>> 3.
>> If host is longer the 63 characters then the ioapi_ctx->channels and
>> qnio_ctx->channels maps will use different keys due to string truncation
>> in qnio_create_channel().  This means "Channel already exists" in
>> qnio_create_channel() and possibly other things will not work as
>> expected.
>>
>>> +        if (*cfd < 0) {
>>> +            trace_vxhs_qnio_iio_open(of_vsa_addr);
>>> +            return -ENODEV;
>>> +        }
>>> +    }
>>> +
>>> +    /*
>>> +     * Open vdisk device
>>> +     */
>>> +    *rfd = iio_devopen(global_qnio_ctx, *cfd, file_name, 0);
>>
>> libqnio comment:
>> Buffer overflow in iio_devopen() since chandev[128] is not large enough
>> to hold channel[100] + " " + devpath[arbitrary length] chars:
>>
>>     sprintf(chandev, "%s %s", channel, devpath);
>>
>>> +
>>> +    if (*rfd < 0) {
>>> +        if (*cfd >= 0) {
>>
>> This check is always true.  Otherwise the return -ENODEV would have been
>> taken above.  The if statement isn't necessary.
>>
>>> +static void vxhs_check_failover_status(int res, void *ctx)
>>> +{
>>> +    BDRVVXHSState *s = ctx;
>>> +
>>> +    if (res == 0) {
>>> +        /* found failover target */
>>> +        s->vdisk_cur_host_idx = s->vdisk_ask_failover_idx;
>>> +        s->vdisk_ask_failover_idx = 0;
>>> +        trace_vxhs_check_failover_status(
>>> +                   s->vdisk_hostinfo[s->vdisk_cur_host_idx].hostip,
>>> +                   s->vdisk_guid);
>>> +        qemu_spin_lock(&s->vdisk_lock);
>>> +        OF_VDISK_RESET_IOFAILOVER_IN_PROGRESS(s);
>>> +        qemu_spin_unlock(&s->vdisk_lock);
>>> +        vxhs_handle_queued_ios(s);
>>> +    } else {
>>> +        /* keep looking */
>>> +        trace_vxhs_check_failover_status_retry(s->vdisk_guid);
>>> +        s->vdisk_ask_failover_idx++;
>>> +        if (s->vdisk_ask_failover_idx == s->vdisk_nhosts) {
>>> +            /* pause and cycle through list again */
>>> +            sleep(QNIO_CONNECT_RETRY_SECS);
>>
>> This code is called from a QEMU thread via vxhs_aio_rw().  It is not
>> permitted to call sleep() since it will freeze QEMU and probably the
>> guest.
>>
>> If you need a timer you can use QEMU's timer APIs.  See aio_timer_new(),
>> timer_new_ns(), timer_mod(), timer_del(), timer_free().
>>
>>> +            s->vdisk_ask_failover_idx = 0;
>>> +        }
>>> +        res = vxhs_switch_storage_agent(s);
>>> +    }
>>> +}
>>> +
>>> +static int vxhs_failover_io(BDRVVXHSState *s)
>>> +{
>>> +    int res = 0;
>>> +
>>> +    trace_vxhs_failover_io(s->vdisk_guid);
>>> +
>>> +    s->vdisk_ask_failover_idx = 0;
>>> +    res = vxhs_switch_storage_agent(s);
>>> +
>>> +    return res;
>>> +}
>>> +
>>> +static void vxhs_iio_callback(int32_t rfd, uint32_t reason, void *ctx,
>>> +                       uint32_t error, uint32_t opcode)
>>
>> This function is doing too much.  Especially the failover code should
>> run in the AioContext since it's complex.  Don't do failover here
>> because this function is outside the AioContext lock.  Do it from
>> AioContext using a QEMUBH like block/rbd.c.
>>
>>> +static int32_t
>>> +vxhs_qnio_iio_writev(void *qnio_ctx, uint32_t rfd, QEMUIOVector *qiov,
>>> +                     uint64_t offset, void *ctx, uint32_t flags)
>>> +{
>>> +    struct iovec cur;
>>> +    uint64_t cur_offset = 0;
>>> +    uint64_t cur_write_len = 0;
>>> +    int segcount = 0;
>>> +    int ret = 0;
>>> +    int i, nsio = 0;
>>> +    int iovcnt = qiov->niov;
>>> +    struct iovec *iov = qiov->iov;
>>> +
>>> +    errno = 0;
>>> +    cur.iov_base = 0;
>>> +    cur.iov_len = 0;
>>> +
>>> +    ret = iio_writev(qnio_ctx, rfd, iov, iovcnt, offset, ctx, flags);
>>
>> libqnio comments:
>>
>> 1.
>> There are blocking connect(2) and getaddrinfo(3) calls in iio_writev()
>> so this may hang for arbitrary amounts of time.  This is not permitted
>> in .bdrv_aio_readv()/.bdrv_aio_writev().  Please make qnio actually
>> asynchronous.
>>
>> 2.
>> Where does client_callback() free reply?  It looks like every reply
>> message causes a memory leak!
>>
>> 3.
>> Buffer overflow in iio_writev() since device[128] cannot fit the device
>> string generated from the vdisk_guid.
>>
>> 4.
>> Buffer overflow in iio_writev() due to
>> strncpy(msg->hinfo.target,device,strlen(device)) where device[128] is
>> larger than target[64].  Also note the previous comments about strncpy()
>> usage.
>>
>> 5.
>> I don't see any endianness handling or portable alignment of struct
>> fields in the network protocol code.  Binary network protocols need to
>> take care of these issue for portability.  This means libqnio compiled
>> for different architectures will not work.  Do you plan to support any
>> other architectures besides x86?
>>
>> 6.
>> The networking code doesn't look robust: kvset uses assert() on input
>> from the network so the other side of the connection could cause SIGABRT
>> (coredump), the client uses the msg pointer as the cookie for the
>> response packet so the server can easily crash the client by sending a
>> bogus cookie value, etc.  Even on the client side these things are
>> troublesome but on a server they are guaranteed security issues.  I
>> didn't look into it deeply.  Please audit the code.
>>
>>> +static int vxhs_qemu_init(QDict *options, BDRVVXHSState *s,
>>> +                              int *cfd, int *rfd, Error **errp)
>>> +{
>>> +    QDict *backing_options = NULL;
>>> +    QemuOpts *opts, *tcp_opts;
>>> +    const char *vxhs_filename;
>>> +    char *of_vsa_addr = NULL;
>>> +    Error *local_err = NULL;
>>> +    const char *vdisk_id_opt;
>>> +    char *file_name = NULL;
>>> +    size_t num_servers = 0;
>>> +    char *str = NULL;
>>> +    int ret = 0;
>>> +    int i;
>>> +
>>> +    opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort);
>>> +    qemu_opts_absorb_qdict(opts, options, &local_err);
>>> +    if (local_err) {
>>> +        error_propagate(errp, local_err);
>>> +        ret = -EINVAL;
>>> +        goto out;
>>> +    }
>>> +
>>> +    vxhs_filename = qemu_opt_get(opts, VXHS_OPT_FILENAME);
>>> +    if (vxhs_filename) {
>>> +        trace_vxhs_qemu_init_filename(vxhs_filename);
>>> +    }
>>> +
>>> +    vdisk_id_opt = qemu_opt_get(opts, VXHS_OPT_VDISK_ID);
>>> +    if (!vdisk_id_opt) {
>>> +        error_setg(&local_err, QERR_MISSING_PARAMETER, VXHS_OPT_VDISK_ID);
>>> +        ret = -EINVAL;
>>> +        goto out;
>>> +    }
>>> +    s->vdisk_guid = g_strdup(vdisk_id_opt);
>>> +    trace_vxhs_qemu_init_vdisk(vdisk_id_opt);
>>> +
>>> +    num_servers = qdict_array_entries(options, VXHS_OPT_SERVER);
>>> +    if (num_servers < 1) {
>>> +        error_setg(&local_err, QERR_MISSING_PARAMETER, "server");
>>> +        ret = -EINVAL;
>>> +        goto out;
>>> +    } else if (num_servers > VXHS_MAX_HOSTS) {
>>> +        error_setg(&local_err, QERR_INVALID_PARAMETER, "server");
>>> +        error_append_hint(errp, "Maximum %d servers allowed.\n",
>>> +                          VXHS_MAX_HOSTS);
>>> +        ret = -EINVAL;
>>> +        goto out;
>>> +    }
>>> +    trace_vxhs_qemu_init_numservers(num_servers);
>>> +
>>> +    for (i = 0; i < num_servers; i++) {
>>> +        str = g_strdup_printf(VXHS_OPT_SERVER"%d.", i);
>>> +        qdict_extract_subqdict(options, &backing_options, str);
>>> +
>>> +        /* Create opts info from runtime_tcp_opts list */
>>> +        tcp_opts = qemu_opts_create(&runtime_tcp_opts, NULL, 0, &error_abort);
>>> +        qemu_opts_absorb_qdict(tcp_opts, backing_options, &local_err);
>>> +        if (local_err) {
>>> +            qdict_del(backing_options, str);
>>
>> backing_options is leaked and there's no need to delete the str key.
>>
>>> +            qemu_opts_del(tcp_opts);
>>> +            g_free(str);
>>> +            ret = -EINVAL;
>>> +            goto out;
>>> +        }
>>> +
>>> +        s->vdisk_hostinfo[i].hostip = g_strdup(qemu_opt_get(tcp_opts,
>>> +                                                            VXHS_OPT_HOST));
>>> +        s->vdisk_hostinfo[i].port = g_ascii_strtoll(qemu_opt_get(tcp_opts,
>>> +                                                                 VXHS_OPT_PORT),
>>> +                                                    NULL, 0);
>>
>> This will segfault if the port option was missing.
>>
>>> +
>>> +        s->vdisk_hostinfo[i].qnio_cfd = -1;
>>> +        s->vdisk_hostinfo[i].vdisk_rfd = -1;
>>> +        trace_vxhs_qemu_init(s->vdisk_hostinfo[i].hostip,
>>> +                             s->vdisk_hostinfo[i].port);
>>
>> It's not safe to use the %s format specifier for a trace event with a
>> NULL value.  In the case where hostip is NULL this could crash on some
>> systems.
>>
>>> +
>>> +        qdict_del(backing_options, str);
>>> +        qemu_opts_del(tcp_opts);
>>> +        g_free(str);
>>> +    }
>>
>> backing_options is leaked.
>>
>>> +
>>> +    s->vdisk_nhosts = i;
>>> +    s->vdisk_cur_host_idx = 0;
>>> +    file_name = g_strdup_printf("%s%s", vdisk_prefix, s->vdisk_guid);
>>> +    of_vsa_addr = g_strdup_printf("of://%s:%d",
>>> +                                s->vdisk_hostinfo[s->vdisk_cur_host_idx].hostip,
>>> +                                s->vdisk_hostinfo[s->vdisk_cur_host_idx].port);
>>
>> Can we get here with num_servers == 0?  In that case this would access
>> uninitialized memory.  I guess num_servers == 0 does not make sense and
>> there should be an error case for it.
>>
>>> +
>>> +    /*
>>> +     * .bdrv_open() and .bdrv_create() run under the QEMU global mutex.
>>> +     */
>>> +    if (global_qnio_ctx == NULL) {
>>> +        global_qnio_ctx = vxhs_setup_qnio();
>>
>> libqnio comment:
>> The client epoll thread should mask all signals (like
>> qemu_thread_create()).  Otherwise it may receive signals that it cannot
>> deal with.
>>
>>> +        if (global_qnio_ctx == NULL) {
>>> +            error_setg(&local_err, "Failed vxhs_setup_qnio");
>>> +            ret = -EINVAL;
>>> +            goto out;
>>> +        }
>>> +    }
>>> +
>>> +    ret = vxhs_qnio_iio_open(cfd, of_vsa_addr, rfd, file_name);
>>> +    if (!ret) {
>>> +        error_setg(&local_err, "Failed qnio_iio_open");
>>> +        ret = -EIO;
>>> +    }
>>
>> The return value of vxhs_qnio_iio_open() is 0 for success or -errno for
>> error.
>>
>> I guess you never ran this code!  The block driver won't even open
>> successfully.
>>
>>> +
>>> +out:
>>> +    g_free(file_name);
>>> +    g_free(of_vsa_addr);
>>> +    qemu_opts_del(opts);
>>> +
>>> +    if (ret < 0) {
>>> +        for (i = 0; i < num_servers; i++) {
>>> +            g_free(s->vdisk_hostinfo[i].hostip);
>>> +        }
>>> +        g_free(s->vdisk_guid);
>>> +        s->vdisk_guid = NULL;
>>> +        errno = -ret;
>>
>> There is no need to set errno here.  The return value already contains
>> the error and the caller doesn't look at errno.
>>
>>> +    }
>>> +    error_propagate(errp, local_err);
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +static int vxhs_open(BlockDriverState *bs, QDict *options,
>>> +              int bdrv_flags, Error **errp)
>>> +{
>>> +    BDRVVXHSState *s = bs->opaque;
>>> +    AioContext *aio_context;
>>> +    int qemu_qnio_cfd = -1;
>>> +    int device_opened = 0;
>>> +    int qemu_rfd = -1;
>>> +    int ret = 0;
>>> +    int i;
>>> +
>>> +    ret = vxhs_qemu_init(options, s, &qemu_qnio_cfd, &qemu_rfd, errp);
>>> +    if (ret < 0) {
>>> +        trace_vxhs_open_fail(ret);
>>> +        return ret;
>>> +    }
>>> +
>>> +    device_opened = 1;
>>> +    s->qnio_ctx = global_qnio_ctx;
>>> +    s->vdisk_hostinfo[0].qnio_cfd = qemu_qnio_cfd;
>>> +    s->vdisk_hostinfo[0].vdisk_rfd = qemu_rfd;
>>> +    s->vdisk_size = 0;
>>> +    QSIMPLEQ_INIT(&s->vdisk_aio_retryq);
>>> +
>>> +    /*
>>> +     * Create a pipe for communicating between two threads in different
>>> +     * context. Set handler for read event, which gets triggered when
>>> +     * IO completion is done by non-QEMU context.
>>> +     */
>>> +    ret = qemu_pipe(s->fds);
>>> +    if (ret < 0) {
>>> +        trace_vxhs_open_epipe('.');
>>> +        ret = -errno;
>>> +        goto errout;
>>
>> This leaks s->vdisk_guid, s->vdisk_hostinfo[i].hostip, etc.
>> bdrv_close() will not be called so this function must do cleanup itself.
>>
>>> +    }
>>> +    fcntl(s->fds[VDISK_FD_READ], F_SETFL, O_NONBLOCK);
>>> +
>>> +    aio_context = bdrv_get_aio_context(bs);
>>> +    aio_set_fd_handler(aio_context, s->fds[VDISK_FD_READ],
>>> +                       false, vxhs_aio_event_reader, NULL, s);
>>> +
>>> +    /*
>>> +     * Initialize the spin-locks.
>>> +     */
>>> +    qemu_spin_init(&s->vdisk_lock);
>>> +    qemu_spin_init(&s->vdisk_acb_lock);
>>> +
>>> +    return 0;
>>> +
>>> +errout:
>>> +    /*
>>> +     * Close remote vDisk device if it was opened earlier
>>> +     */
>>> +    if (device_opened) {
>>
>> This is always true.  The device_opened variable can be removed.
>>
>>> +/*
>>> + * This allocates QEMU-VXHS callback for each IO
>>> + * and is passed to QNIO. When QNIO completes the work,
>>> + * it will be passed back through the callback.
>>> + */
>>> +static BlockAIOCB *vxhs_aio_rw(BlockDriverState *bs,
>>> +                                int64_t sector_num, QEMUIOVector *qiov,
>>> +                                int nb_sectors,
>>> +                                BlockCompletionFunc *cb,
>>> +                                void *opaque, int iodir)
>>> +{
>>> +    VXHSAIOCB *acb = NULL;
>>> +    BDRVVXHSState *s = bs->opaque;
>>> +    size_t size;
>>> +    uint64_t offset;
>>> +    int iio_flags = 0;
>>> +    int ret = 0;
>>> +    void *qnio_ctx = s->qnio_ctx;
>>> +    uint32_t rfd = s->vdisk_hostinfo[s->vdisk_cur_host_idx].vdisk_rfd;
>>> +
>>> +    offset = sector_num * BDRV_SECTOR_SIZE;
>>> +    size = nb_sectors * BDRV_SECTOR_SIZE;
>>> +
>>> +    acb = qemu_aio_get(&vxhs_aiocb_info, bs, cb, opaque);
>>> +    /*
>>> +     * Setup or initialize VXHSAIOCB.
>>> +     * Every single field should be initialized since
>>> +     * acb will be picked up from the slab without
>>> +     * initializing with zero.
>>> +     */
>>> +    acb->io_offset = offset;
>>> +    acb->size = size;
>>> +    acb->ret = 0;
>>> +    acb->flags = 0;
>>> +    acb->aio_done = VXHS_IO_INPROGRESS;
>>> +    acb->segments = 0;
>>> +    acb->buffer = 0;
>>> +    acb->qiov = qiov;
>>> +    acb->direction = iodir;
>>> +
>>> +    qemu_spin_lock(&s->vdisk_lock);
>>> +    if (OF_VDISK_FAILED(s)) {
>>> +        trace_vxhs_aio_rw(s->vdisk_guid, iodir, size, offset);
>>> +        qemu_spin_unlock(&s->vdisk_lock);
>>> +        goto errout;
>>> +    }
>>> +    if (OF_VDISK_IOFAILOVER_IN_PROGRESS(s)) {
>>> +        QSIMPLEQ_INSERT_TAIL(&s->vdisk_aio_retryq, acb, retry_entry);
>>> +        s->vdisk_aio_retry_qd++;
>>> +        OF_AIOCB_FLAGS_SET_QUEUED(acb);
>>> +        qemu_spin_unlock(&s->vdisk_lock);
>>> +        trace_vxhs_aio_rw_retry(s->vdisk_guid, acb, 1);
>>> +        goto out;
>>> +    }
>>> +    s->vdisk_aio_count++;
>>> +    qemu_spin_unlock(&s->vdisk_lock);
>>> +
>>> +    iio_flags = (IIO_FLAG_DONE | IIO_FLAG_ASYNC);
>>> +
>>> +    switch (iodir) {
>>> +    case VDISK_AIO_WRITE:
>>> +            vxhs_inc_acb_segment_count(acb, 1);
>>> +            ret = vxhs_qnio_iio_writev(qnio_ctx, rfd, qiov,
>>> +                                       offset, (void *)acb, iio_flags);
>>> +            break;
>>> +    case VDISK_AIO_READ:
>>> +            vxhs_inc_acb_segment_count(acb, 1);
>>> +            ret = vxhs_qnio_iio_readv(qnio_ctx, rfd, qiov,
>>> +                                       offset, (void *)acb, iio_flags);
>>> +            break;
>>> +    default:
>>> +            trace_vxhs_aio_rw_invalid(iodir);
>>> +            goto errout;
>>
>> s->vdisk_aio_count must be decremented before returning.
>>
>>> +static void vxhs_close(BlockDriverState *bs)
>>> +{
>>> +    BDRVVXHSState *s = bs->opaque;
>>> +    int i;
>>> +
>>> +    trace_vxhs_close(s->vdisk_guid);
>>> +    close(s->fds[VDISK_FD_READ]);
>>> +    close(s->fds[VDISK_FD_WRITE]);
>>> +
>>> +    /*
>>> +     * Clearing all the event handlers for oflame registered to QEMU
>>> +     */
>>> +    aio_set_fd_handler(bdrv_get_aio_context(bs), s->fds[VDISK_FD_READ],
>>> +                       false, NULL, NULL, NULL);
>>
>> Please remove the event handler before closing the fd.  I don't think it
>> matters in this case but in other scenarios there could be race
>> conditions if another thread opens an fd and the file descriptor number
>> is reused.