* [PATCH 21/28] ibnbd_srv: add header shared in ibnbd_server
From: Jack Wang @ 2017-03-24 10:45 UTC (permalink / raw)
To: linux-block, linux-rdma
Cc: dledford, axboe, hch, mail, Milind.dumbare, yun.wang, Jack Wang,
Kleber Souza, Danil Kipnis, Roman Pen
In-Reply-To: <1490352343-20075-1-git-send-email-jinpu.wangl@profitbricks.com>
From: Jack Wang <jinpu.wang@profitbricks.com>
Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
Signed-off-by: Kleber Souza <kleber.souza@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
---
drivers/block/ibnbd_server/ibnbd_srv.h | 115 +++++++++++++++++++++++++++++++++
1 file changed, 115 insertions(+)
create mode 100644 drivers/block/ibnbd_server/ibnbd_srv.h
diff --git a/drivers/block/ibnbd_server/ibnbd_srv.h b/drivers/block/ibnbd_server/ibnbd_srv.h
new file mode 100644
index 0000000..764a31f
--- /dev/null
+++ b/drivers/block/ibnbd_server/ibnbd_srv.h
@@ -0,0 +1,115 @@
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler < mail@fholler.de>
+ * Jack Wang <jinpu.wang@profitbricks.com>
+ * Kleber Souza <kleber.souza@profitbricks.com>
+ * Danil Kipnis <danil.kipnis@profitbricks.com>
+ * Roman Pen <roman.penyaev@profitbricks.com>
+ * Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ * notice, this list of conditions, and the following disclaimer,
+ * without modification.
+ * 2. Redistributions in binary form must reproduce at minimum a disclaimer
+ * substantially similar to the "NO WARRANTY" disclaimer below
+ * ("Disclaimer") and any redistribution must be conditioned upon
+ * including a substantially similar Disclaimer requirement for further
+ * binary redistribution.
+ * 3. Neither the names of the above-listed copyright holders nor the names
+ * of any contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * Alternatively, this software may be distributed under the terms of the
+ * GNU General Public License ("GPL") version 2 as published by the Free
+ * Software Foundation.
+ *
+ * NO WARRANTY
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * HOLDERS OR CONTRIBUTORS BE LIABLE FOR SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+ * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
+ * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGES.
+ *
+ */
+
+#ifndef _IBNBD_SRV_H
+#define _IBNBD_SRV_H
+
+#include <linux/types.h>
+#include <linux/idr.h>
+#include <linux/kref.h>
+#include "../ibnbd_inc/ibnbd.h"
+#include "../ibnbd_inc/ibnbd-proto.h"
+#include <rdma/ibtrs.h>
+
+enum sess_state {
+ SESS_STATE_CONNECTED,
+ SESS_STATE_DISCONNECTED
+};
+
+struct ibnbd_srv_session {
+ struct list_head list; /* for the global sess_list */
+ struct ibtrs_session *ibtrs_sess;
+ char str_addr[IBTRS_ADDRLEN];
+ char hostname[MAXHOSTNAMELEN];
+ int queue_depth;
+ enum sess_state state;
+ struct bio_set *sess_bio_set;
+
+ rwlock_t index_lock ____cacheline_aligned;
+ struct idr index_idr;
+ struct mutex lock; /* protects sess_dev_list */
+ struct list_head sess_dev_list; /* list of struct ibnbd_srv_sess_dev */
+ u8 ver; /* IBNBD protocol version */
+};
+
+struct ibnbd_srv_dev {
+ struct list_head list; /* global dev_list */
+
+ struct kobject dev_kobj;
+ struct kobject dev_clients_kobj;
+
+ struct kref kref;
+ char id[NAME_MAX];
+
+ struct mutex lock; /* protects sess_dev_list and open_write_cnt */
+ struct list_head sess_dev_list; /* list of struct ibnbd_srv_sess_dev */
+ int open_write_cnt;
+ enum ibnbd_io_mode mode;
+};
+
+struct ibnbd_srv_sess_dev {
+ struct list_head dev_list; /* for struct ibnbd_srv_dev->sess_dev_list */
+ struct list_head sess_list; /* for struct ibnbd_srv_session->sess_dev_list */
+
+ struct ibnbd_dev *ibnbd_dev;
+ struct ibnbd_srv_session *sess;
+ struct ibnbd_srv_dev *dev;
+ struct kobject kobj;
+ struct completion *sysfs_release_compl;
+
+ u32 device_id;
+ u32 clt_device_id;
+ fmode_t open_flags;
+ struct kref kref;
+ struct completion *destroy_comp;
+ char pathname[NAME_MAX];
+ size_t nsectors;
+ bool is_visible;
+};
+
+int ibnbd_srv_revalidate_dev(struct ibnbd_srv_dev *dev);
+
+#endif
--
2.7.4
^ permalink raw reply related
* [PATCH 22/28] ibnbd_srv: add main functionality
From: Jack Wang @ 2017-03-24 10:45 UTC (permalink / raw)
To: linux-block, linux-rdma
Cc: dledford, axboe, hch, mail, Milind.dumbare, yun.wang, Jack Wang,
Kleber Souza, Danil Kipnis, Roman Pen
In-Reply-To: <1490352343-20075-1-git-send-email-jinpu.wangl@profitbricks.com>
From: Jack Wang <jinpu.wang@profitbricks.com>
Process incoming IO from ibtrs server, and hands them down to
underlying block device.
Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
Signed-off-by: Kleber Souza <kleber.souza@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
---
drivers/block/ibnbd_server/ibnbd_srv.c | 1074 ++++++++++++++++++++++++++++++++
1 file changed, 1074 insertions(+)
create mode 100644 drivers/block/ibnbd_server/ibnbd_srv.c
diff --git a/drivers/block/ibnbd_server/ibnbd_srv.c b/drivers/block/ibnbd_server/ibnbd_srv.c
new file mode 100644
index 0000000..13832b6
--- /dev/null
+++ b/drivers/block/ibnbd_server/ibnbd_srv.c
@@ -0,0 +1,1074 @@
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler < mail@fholler.de>
+ * Jack Wang <jinpu.wang@profitbricks.com>
+ * Kleber Souza <kleber.souza@profitbricks.com>
+ * Danil Kipnis <danil.kipnis@profitbricks.com>
+ * Roman Pen <roman.penyaev@profitbricks.com>
+ * Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ * notice, this list of conditions, and the following disclaimer,
+ * without modification.
+ * 2. Redistributions in binary form must reproduce at minimum a disclaimer
+ * substantially similar to the "NO WARRANTY" disclaimer below
+ * ("Disclaimer") and any redistribution must be conditioned upon
+ * including a substantially similar Disclaimer requirement for further
+ * binary redistribution.
+ * 3. Neither the names of the above-listed copyright holders nor the names
+ * of any contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * Alternatively, this software may be distributed under the terms of the
+ * GNU General Public License ("GPL") version 2 as published by the Free
+ * Software Foundation.
+ *
+ * NO WARRANTY
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * HOLDERS OR CONTRIBUTORS BE LIABLE FOR SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+ * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
+ * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGES.
+ *
+ */
+
+#include <linux/module.h>
+#include <linux/blkdev.h>
+#include <linux/idr.h>
+#include <rdma/ibtrs.h>
+#include "../ibnbd_inc/ibnbd-proto.h"
+#include <rdma/ibtrs_srv.h>
+#include "../ibnbd_inc/ibnbd.h"
+#include "ibnbd_srv.h"
+#include "ibnbd_srv_log.h"
+#include "ibnbd_srv_sysfs.h"
+#include "ibnbd_dev.h"
+
+MODULE_AUTHOR("ibnbd@profitbricks.com");
+MODULE_VERSION(__stringify(IBNBD_VER));
+MODULE_DESCRIPTION("InfiniBand Network Block Device Server");
+MODULE_LICENSE("GPL");
+
+#define DEFAULT_DEV_SEARCH_PATH "/"
+
+static char dev_search_path[PATH_MAX] = DEFAULT_DEV_SEARCH_PATH;
+
+static int dev_search_path_set(const char *val, const struct kernel_param *kp)
+{
+ char *dup;
+
+ if (strlen(val) >= sizeof(dev_search_path))
+ return -EINVAL;
+
+ dup = kstrdup(val, GFP_KERNEL);
+
+ if (dup[strlen(dup) - 1] == '\n')
+ dup[strlen(dup) - 1] = '\0';
+
+ strlcpy(dev_search_path, dup, sizeof(dev_search_path));
+
+ kfree(dup);
+ INFO_NP("dev_search_path changed to '%s'\n", dev_search_path);
+
+ return 0;
+}
+
+static struct kparam_string dev_search_path_kparam_str = {
+ .maxlen = sizeof(dev_search_path),
+ .string = dev_search_path
+};
+
+static const struct kernel_param_ops dev_search_path_ops = {
+ .set = dev_search_path_set,
+ .get = param_get_string,
+};
+
+module_param_cb(dev_search_path, &dev_search_path_ops,
+ &dev_search_path_kparam_str, 0444);
+MODULE_PARM_DESC(dev_search_path, "Sets the device_search_path."
+ " When a device is mapped this path is prepended to the"
+ " device_path from the map_device operation."
+ " (default: " DEFAULT_DEV_SEARCH_PATH ")");
+
+static int def_io_mode = IBNBD_BLOCKIO;
+module_param(def_io_mode, int, 0444);
+MODULE_PARM_DESC(def_io_mode, "By default, export devices in"
+ " blockio(" __stringify(_IBNBD_BLOCKIO) ") or"
+ " fileio(" __stringify(_IBNBD_FILEIO) ") mode."
+ " (default: " __stringify(_IBNBD_BLOCKIO) " (blockio))");
+
+static DEFINE_MUTEX(sess_lock);
+static DEFINE_SPINLOCK(dev_lock);
+
+static LIST_HEAD(sess_list);
+static LIST_HEAD(dev_list);
+
+
+struct ibnbd_io_private {
+ struct ibtrs_ops_id *id;
+ struct ibnbd_srv_sess_dev *sess_dev;
+};
+
+static struct ibtrs_srv_ops ibnbd_srv_ops;
+
+static void ibnbd_sess_dev_release(struct kref *kref)
+{
+ struct ibnbd_srv_sess_dev *sess_dev;
+
+ sess_dev = container_of(kref, struct ibnbd_srv_sess_dev, kref);
+ complete(sess_dev->destroy_comp);
+}
+
+static inline void ibnbd_put_sess_dev(struct ibnbd_srv_sess_dev *sess_dev)
+{
+ kref_put(&sess_dev->kref, ibnbd_sess_dev_release);
+}
+
+static void ibnbd_endio(void *priv, int error)
+{
+ int ret;
+ struct ibnbd_io_private *ibnbd_priv = priv;
+ struct ibnbd_srv_sess_dev *sess_dev = ibnbd_priv->sess_dev;
+
+ ibnbd_put_sess_dev(sess_dev);
+
+ ret = ibtrs_srv_resp_rdma(ibnbd_priv->id, error);
+ if (unlikely(ret))
+ ERR_RL(sess_dev, "Sending I/O response failed, errno: %d\n",
+ ret);
+
+ kfree(priv);
+}
+
+static struct ibnbd_srv_sess_dev *
+ibnbd_get_sess_dev(int dev_id, struct ibnbd_srv_session *srv_sess)
+{
+ struct ibnbd_srv_sess_dev *sess_dev;
+ int ret = 0;
+
+ read_lock(&srv_sess->index_lock);
+ sess_dev = idr_find(&srv_sess->index_idr, dev_id);
+ if (likely(sess_dev))
+ ret = kref_get_unless_zero(&sess_dev->kref);
+ read_unlock(&srv_sess->index_lock);
+
+ if (unlikely(!sess_dev || !ret))
+ return ERR_PTR(-ENXIO);
+
+ return sess_dev;
+}
+
+static int process_rdma(struct ibtrs_session *sess,
+ struct ibnbd_srv_session *srv_sess,
+ struct ibtrs_ops_id *id, void *data, u32 len)
+{
+ struct ibnbd_io_private *priv;
+ struct ibnbd_srv_sess_dev *sess_dev;
+ struct ibnbd_msg_io *msg;
+ size_t data_len;
+ int err;
+ u32 dev_id;
+
+ priv = kmalloc(sizeof(*priv), GFP_KERNEL);
+ if (unlikely(!priv))
+ return -ENOMEM;
+
+ data_len = len - sizeof(*msg);
+ /* ibnbd message is after disk data */
+ msg = (struct ibnbd_msg_io *)(data + data_len);
+
+ dev_id = msg->device_id;
+
+ sess_dev = ibnbd_get_sess_dev(dev_id, srv_sess);
+ if (unlikely(IS_ERR(sess_dev))) {
+ ERR_NP_RL("Got I/O request from client %s for unknown device id"
+ " %d\n", srv_sess->str_addr, dev_id);
+ err = -ENOTCONN;
+ goto err;
+ }
+
+ priv->sess_dev = sess_dev;
+ priv->id = id;
+
+ err = ibnbd_dev_submit_io(sess_dev->ibnbd_dev, msg->sector, data,
+ data_len, msg->bi_size, msg->rw, priv);
+ if (unlikely(err)) {
+ ERR(sess_dev, "Submitting I/O to device failed, errno: %d\n",
+ err);
+ goto sess_dev_put;
+ }
+
+ return 0;
+
+sess_dev_put:
+ ibnbd_put_sess_dev(sess_dev);
+err:
+ kfree(priv);
+ return err;
+}
+
+static void destroy_device(struct ibnbd_srv_dev *dev)
+{
+ WARN(!list_empty(&dev->sess_dev_list),
+ "Device %s is being destroyed but still in use!\n",
+ dev->id);
+
+ spin_lock(&dev_lock);
+ list_del(&dev->list);
+ spin_unlock(&dev_lock);
+
+ if (dev->dev_kobj.state_in_sysfs)
+ /*
+ * Destroy kobj only if it was really created.
+ * The following call should be sync, because
+ * we free the memory afterwards.
+ */
+ ibnbd_srv_destroy_dev_sysfs(dev);
+
+ kfree(dev);
+}
+
+static void destroy_device_cb(struct kref *kref)
+{
+ struct ibnbd_srv_dev *dev;
+
+ dev = container_of(kref, struct ibnbd_srv_dev, kref);
+
+ destroy_device(dev);
+}
+
+static void ibnbd_put_srv_dev(struct ibnbd_srv_dev *dev)
+{
+ kref_put(&dev->kref, destroy_device_cb);
+}
+
+static void ibnbd_destroy_sess_dev(struct ibnbd_srv_sess_dev *sess_dev,
+ bool locked)
+{
+ struct completion dc;
+
+ write_lock(&sess_dev->sess->index_lock);
+ idr_remove(&sess_dev->sess->index_idr, sess_dev->device_id);
+ write_unlock(&sess_dev->sess->index_lock);
+
+ init_completion(&dc);
+ sess_dev->destroy_comp = &dc;
+ ibnbd_put_sess_dev(sess_dev);
+ wait_for_completion(&dc);
+
+ ibnbd_dev_close(sess_dev->ibnbd_dev);
+ if (!locked)
+ mutex_lock(&sess_dev->sess->lock);
+ list_del(&sess_dev->sess_list);
+ if (!locked)
+ mutex_unlock(&sess_dev->sess->lock);
+
+ mutex_lock(&sess_dev->dev->lock);
+ list_del(&sess_dev->dev_list);
+ if (sess_dev->open_flags & FMODE_WRITE)
+ sess_dev->dev->open_write_cnt--;
+ mutex_unlock(&sess_dev->dev->lock);
+
+ ibnbd_put_srv_dev(sess_dev->dev);
+
+ INFO(sess_dev, "Device closed\n");
+ kfree(sess_dev);
+}
+
+static void destroy_sess(struct ibnbd_srv_session *srv_sess)
+{
+ struct ibnbd_srv_sess_dev *sess_dev, *tmp;
+
+ srv_sess->state = SESS_STATE_DISCONNECTED;
+
+ if (list_empty(&srv_sess->sess_dev_list))
+ goto out;
+
+ mutex_lock(&srv_sess->lock);
+ list_for_each_entry_safe(sess_dev, tmp, &srv_sess->sess_dev_list,
+ sess_list) {
+ ibnbd_srv_destroy_dev_client_sysfs(sess_dev);
+ ibnbd_destroy_sess_dev(sess_dev, true);
+ }
+ mutex_unlock(&srv_sess->lock);
+
+out:
+ idr_destroy(&srv_sess->index_idr);
+ bioset_free(srv_sess->sess_bio_set);
+
+ INFO_NP("IBTRS Session to %s disconnected\n", srv_sess->str_addr);
+
+ mutex_lock(&sess_lock);
+ list_del(&srv_sess->list);
+ mutex_unlock(&sess_lock);
+
+ kfree(srv_sess);
+}
+
+static int create_sess(struct ibtrs_session *sess)
+{
+ struct ibnbd_srv_session *srv_sess;
+
+ srv_sess = kzalloc(sizeof(*srv_sess), GFP_KERNEL);
+ if (!srv_sess) {
+ ERR_NP("Allocating srv_session for client %s failed\n",
+ ibtrs_srv_get_sess_addr(sess));
+ return -ENOMEM;
+ }
+ srv_sess->queue_depth = ibtrs_srv_get_sess_qdepth(sess);
+ srv_sess->sess_bio_set = bioset_create(srv_sess->queue_depth, 0);
+ if (!srv_sess->sess_bio_set) {
+ ERR_NP("Allocating srv_session for client %s failed\n",
+ ibtrs_srv_get_sess_addr(sess));
+ kfree(srv_sess);
+ return -ENOMEM;
+ }
+
+ idr_init(&srv_sess->index_idr);
+ rwlock_init(&srv_sess->index_lock);
+ INIT_LIST_HEAD(&srv_sess->sess_dev_list);
+ mutex_init(&srv_sess->lock);
+ srv_sess->state = SESS_STATE_CONNECTED;
+ mutex_lock(&sess_lock);
+ list_add(&srv_sess->list, &sess_list);
+ mutex_unlock(&sess_lock);
+
+ srv_sess->ibtrs_sess = sess;
+ srv_sess->queue_depth = ibtrs_srv_get_sess_qdepth(sess);
+ strlcpy(srv_sess->str_addr, ibtrs_srv_get_sess_addr(sess),
+ sizeof(srv_sess->str_addr));
+
+ ibtrs_srv_set_sess_priv(sess, srv_sess);
+
+ return 0;
+}
+
+static int ibnbd_srv_sess_ev(struct ibtrs_session *sess,
+ enum ibtrs_srv_sess_ev ev, void *priv)
+{
+ struct ibnbd_srv_session *srv_sess = priv;
+
+ switch (ev) {
+ case IBTRS_SRV_SESS_EV_CONNECTED:
+ INFO_NP("IBTRS session to %s established\n",
+ ibtrs_srv_get_sess_addr(sess));
+ return create_sess(sess);
+
+ case IBTRS_SRV_SESS_EV_DISCONNECTING:
+ if (WARN_ON(!priv ||
+ srv_sess->state != SESS_STATE_CONNECTED))
+ return -EINVAL;
+
+ INFO_NP("IBTRS Session to %s will be disconnected.\n",
+ srv_sess->str_addr);
+ srv_sess->state = SESS_STATE_DISCONNECTED;
+
+ return 0;
+
+ case IBTRS_SRV_SESS_EV_DISCONNECTED:
+ if (WARN_ON(!priv))
+ return -EINVAL;
+
+ destroy_sess(srv_sess);
+ return 0;
+
+ default:
+ WRN_NP("Received unknown IBTRS session event %d from session"
+ " %s\n", ev, srv_sess->str_addr);
+ return -EINVAL;
+ }
+}
+
+static int ibnbd_srv_rdma_ev(struct ibtrs_session *sess, void *priv,
+ struct ibtrs_ops_id *id, enum ibtrs_srv_rdma_ev ev,
+ void *data, size_t len)
+{
+ struct ibnbd_srv_session *srv_sess = priv;
+
+ if (unlikely(WARN_ON(!srv_sess) ||
+ srv_sess->state == SESS_STATE_DISCONNECTED))
+ return -ENODEV;
+
+ switch (ev) {
+ case IBTRS_SRV_RDMA_EV_RECV:
+ case IBTRS_SRV_RDMA_EV_WRITE_REQ:
+ return process_rdma(sess, srv_sess, id, data, len);
+
+ default:
+ WRN_NP("Received unexpected RDMA event %d from session %s\n",
+ ev, srv_sess->str_addr);
+ return -EINVAL;
+ }
+}
+
+static struct ibnbd_srv_sess_dev
+*ibnbd_sess_dev_alloc(struct ibnbd_srv_session *srv_sess)
+{
+ struct ibnbd_srv_sess_dev *sess_dev;
+ int error;
+
+ sess_dev = kzalloc(sizeof(*sess_dev), GFP_KERNEL);
+ if (!sess_dev)
+ return ERR_PTR(-ENOMEM);
+
+ idr_preload(GFP_KERNEL);
+ write_lock(&srv_sess->index_lock);
+
+ error = idr_alloc(&srv_sess->index_idr, sess_dev, 0, -1, GFP_NOWAIT);
+ if (error < 0) {
+ WRN_NP("Allocating idr failed, errno: %d\n", error);
+ goto out_unlock;
+ }
+
+ sess_dev->device_id = error;
+ error = 0;
+
+out_unlock:
+ write_unlock(&srv_sess->index_lock);
+ idr_preload_end();
+ if (error) {
+ kfree(sess_dev);
+ return ERR_PTR(error);
+ }
+
+ return sess_dev;
+}
+
+static struct ibnbd_srv_dev *ibnbd_srv_init_srv_dev(const char *id,
+ enum ibnbd_io_mode mode)
+{
+ struct ibnbd_srv_dev *dev;
+
+ dev = kzalloc(sizeof(*dev), GFP_KERNEL);
+ if (!dev)
+ return ERR_PTR(-ENOMEM);
+
+ strlcpy(dev->id, id, sizeof(dev->id));
+ dev->mode = mode;
+ kref_init(&dev->kref);
+ INIT_LIST_HEAD(&dev->sess_dev_list);
+ mutex_init(&dev->lock);
+
+ return dev;
+}
+
+static struct ibnbd_srv_dev *
+ibnbd_srv_find_or_add_srv_dev(struct ibnbd_srv_dev *new_dev)
+{
+ struct ibnbd_srv_dev *dev;
+
+ spin_lock(&dev_lock);
+ list_for_each_entry(dev, &dev_list, list) {
+ if (!strncmp(dev->id, new_dev->id, sizeof(dev->id))) {
+ if (!kref_get_unless_zero(&dev->kref))
+ /*
+ * We lost the race, device is almost dead.
+ * Continue traversing to find a valid one.
+ */
+ continue;
+ spin_unlock(&dev_lock);
+ return dev;
+ }
+ }
+ list_add(&new_dev->list, &dev_list);
+ spin_unlock(&dev_lock);
+
+ return new_dev;
+}
+
+static int ibnbd_srv_check_update_open_perm(struct ibnbd_srv_dev *srv_dev,
+ struct ibnbd_srv_session *srv_sess,
+ enum ibnbd_io_mode io_mode,
+ enum ibnbd_access_mode access_mode)
+{
+ int ret = -EPERM;
+
+ mutex_lock(&srv_dev->lock);
+
+ if (srv_dev->mode != io_mode) {
+ ERR_NP("Mapping device '%s' for client %s in %s mode forbidden,"
+ " device is already mapped from other client(s) in"
+ " %s mode\n", srv_dev->id, srv_sess->str_addr,
+ ibnbd_io_mode_str(io_mode),
+ ibnbd_io_mode_str(srv_dev->mode));
+ goto out;
+ }
+
+ switch (access_mode) {
+ case IBNBD_ACCESS_RO:
+ ret = 0;
+ break;
+ case IBNBD_ACCESS_RW:
+ if (srv_dev->open_write_cnt == 0) {
+ srv_dev->open_write_cnt++;
+ ret = 0;
+ } else {
+ ERR_NP("Mapping device '%s' for client %s with"
+ " RW permissions failed. Device already opened"
+ " as 'RW' by %d client(s) in %s mode.\n",
+ srv_dev->id, srv_sess->str_addr,
+ srv_dev->open_write_cnt,
+ ibnbd_io_mode_str(srv_dev->mode));
+ }
+ break;
+ case IBNBD_ACCESS_MIGRATION:
+ if (srv_dev->open_write_cnt < 2) {
+ srv_dev->open_write_cnt++;
+ ret = 0;
+ } else {
+ ERR_NP("Mapping device '%s' for client %s with"
+ " migration permissions failed. Device already"
+ " opened as 'RW' by %d client(s) in %s mode.\n",
+ srv_dev->id, srv_sess->str_addr,
+ srv_dev->open_write_cnt,
+ ibnbd_io_mode_str(srv_dev->mode));
+ }
+ break;
+ default:
+ ERR_NP("Received mapping request for device '%s' from client %s"
+ " with invalid access mode: %d\n", srv_dev->id,
+ srv_sess->str_addr, access_mode);
+ ret = -EINVAL;
+ }
+
+out:
+ mutex_unlock(&srv_dev->lock);
+
+ return ret;
+}
+
+static struct ibnbd_srv_dev *
+ibnbd_srv_get_or_create_srv_dev(struct ibnbd_dev *ibnbd_dev,
+ struct ibnbd_srv_session *srv_sess,
+ enum ibnbd_io_mode io_mode,
+ enum ibnbd_access_mode access_mode)
+{
+ int ret;
+ struct ibnbd_srv_dev *new_dev, *dev;
+ const char *dev_name = ibnbd_dev_get_name(ibnbd_dev);
+
+ new_dev = ibnbd_srv_init_srv_dev(dev_name, io_mode);
+ if (IS_ERR(new_dev))
+ return new_dev;
+
+ dev = ibnbd_srv_find_or_add_srv_dev(new_dev);
+ if (dev != new_dev)
+ kfree(new_dev);
+
+ ret = ibnbd_srv_check_update_open_perm(dev, srv_sess, io_mode,
+ access_mode);
+ if (ret) {
+ ibnbd_put_srv_dev(dev);
+ return ERR_PTR(ret);
+ }
+
+ return dev;
+}
+
+static inline void
+ibnbd_srv_fill_msg_open_rsp_header(struct ibnbd_msg_open_rsp *rsp,
+ u32 clt_device_id)
+{
+ rsp->hdr.type = IBNBD_MSG_OPEN_RSP;
+ rsp->clt_device_id = clt_device_id;
+}
+
+static void ibnbd_srv_fill_msg_open_rsp(struct ibnbd_msg_open_rsp *rsp,
+ u32 device_id, u32 clt_device_id,
+ size_t nsectors,
+ const struct ibnbd_dev *ibnbd_dev)
+{
+ struct block_device *bdev;
+
+ ibnbd_srv_fill_msg_open_rsp_header(rsp, clt_device_id);
+
+ rsp->result = 0;
+ rsp->device_id = device_id;
+ rsp->nsectors = nsectors;
+ rsp->logical_block_size =
+ ibnbd_dev_get_logical_bsize(ibnbd_dev);
+ rsp->physical_block_size = ibnbd_dev_get_phys_bsize(ibnbd_dev);
+ rsp->max_segments = ibnbd_dev_get_max_segs(ibnbd_dev);
+ rsp->max_hw_sectors = ibnbd_dev_get_max_hw_sects(ibnbd_dev);
+ rsp->max_write_same_sectors =
+ ibnbd_dev_get_max_write_same_sects(ibnbd_dev);
+
+ rsp->max_discard_sectors =
+ ibnbd_dev_get_max_discard_sects(ibnbd_dev);
+ rsp->discard_zeroes_data =
+ ibnbd_dev_get_discard_zeroes_data(ibnbd_dev);
+ rsp->discard_granularity =
+ ibnbd_dev_get_discard_granularity(ibnbd_dev);
+
+ rsp->discard_alignment = ibnbd_dev_get_discard_alignment(ibnbd_dev);
+ rsp->secure_discard = ibnbd_dev_get_secure_discard(ibnbd_dev);
+
+ bdev = ibnbd_dev_get_bdev(ibnbd_dev);
+ rsp->rotational = !blk_queue_nonrot(bdev_get_queue(bdev));
+ rsp->io_mode = ibnbd_dev->mode;
+
+ DEB("nsectors = %llu, logical_block_size = %d, "
+ "physical_block_size = %d, max_segments = %d, "
+ "max_hw_sectors = %d, max_write_same_sects = %d, "
+ "max_discard_sectors = %d, rotational = %d, io_mode = %d\n",
+ rsp->nsectors, rsp->logical_block_size, rsp->physical_block_size,
+ rsp->max_segments, rsp->max_hw_sectors, rsp->max_write_same_sectors,
+ rsp->max_discard_sectors, rsp->rotational, rsp->io_mode);
+}
+
+static struct ibnbd_srv_sess_dev *
+ibnbd_srv_create_set_sess_dev(struct ibnbd_srv_session *srv_sess,
+ const struct ibnbd_msg_open *open_msg,
+ struct ibnbd_dev *ibnbd_dev, fmode_t open_flags,
+ struct ibnbd_srv_dev *srv_dev)
+{
+ struct ibnbd_srv_sess_dev *sdev = ibnbd_sess_dev_alloc(srv_sess);
+
+ if (IS_ERR(sdev))
+ return sdev;
+
+ kref_init(&sdev->kref);
+
+ strlcpy(sdev->pathname, open_msg->dev_name, sizeof(sdev->pathname));
+
+ sdev->ibnbd_dev = ibnbd_dev;
+ sdev->sess = srv_sess;
+ sdev->dev = srv_dev;
+ sdev->open_flags = open_flags;
+ sdev->clt_device_id = open_msg->clt_device_id;
+
+ return sdev;
+}
+
+static char *ibnbd_srv_get_full_path(const char *dev_name)
+{
+ char *full_path;
+ char *a, *b;
+
+ full_path = kmalloc(PATH_MAX, GFP_KERNEL);
+ if (!full_path)
+ return ERR_PTR(-ENOMEM);
+
+ snprintf(full_path, PATH_MAX, "%s/%s", dev_search_path, dev_name);
+
+ /* eliminitate duplicated slashes */
+ a = strchr(full_path, '/');
+ b = a;
+ while (*b != '\0') {
+ if (*b == '/' && *a == '/') {
+ b++;
+ } else {
+ a++;
+ *a = *b;
+ b++;
+ }
+ }
+ a++;
+ *a = '\0';
+
+ return full_path;
+}
+
+static void process_msg_sess_info(struct ibtrs_session *s,
+ struct ibnbd_srv_session *srv_sess,
+ const void *msg, size_t len)
+{
+ int err;
+ const struct ibnbd_msg_sess_info *sess_info_msg = msg;
+ struct ibnbd_msg_sess_info_rsp rsp;
+ struct kvec vec = {
+ .iov_base = &rsp,
+ .iov_len = sizeof(rsp)
+ };
+
+ if (srv_sess->hostname[0] == '\0')
+ strlcpy(srv_sess->hostname, ibtrs_srv_get_sess_hostname(s),
+ sizeof(srv_sess->hostname));
+
+ srv_sess->ver = min_t(u8, sess_info_msg->ver, IBNBD_VERSION);
+ DEB("Session to %s (%s) using protocol version %d (client version: %d,"
+ " server version: %d)\n", srv_sess->str_addr, srv_sess->hostname,
+ srv_sess->ver, sess_info_msg->ver, IBNBD_VERSION);
+
+ rsp.hdr.type = IBNBD_MSG_SESS_INFO_RSP;
+ rsp.ver = srv_sess->ver;
+
+ err = ibtrs_srv_send(s, &vec, 1);
+ if (unlikely(err))
+ ERR_NP("Failed to send session info response to client"
+ "%s (%s)\n", srv_sess->str_addr, srv_sess->hostname);
+}
+
+static void process_msg_open(struct ibtrs_session *s,
+ struct ibnbd_srv_session *srv_sess,
+ const void *msg, size_t len)
+{
+ int ret;
+ struct ibnbd_srv_dev *srv_dev;
+ struct ibnbd_srv_sess_dev *srv_sess_dev;
+ const struct ibnbd_msg_open *open_msg = msg;
+ fmode_t open_flags;
+ char *full_path;
+ struct ibnbd_dev *ibnbd_dev;
+ enum ibnbd_io_mode io_mode;
+ struct ibnbd_msg_open_rsp rsp;
+ struct kvec vec = {
+ .iov_base = &rsp,
+ .iov_len = sizeof(rsp)
+ };
+
+ DEB("Open message received: client='%s' path='%s' access_mode=%d"
+ " io_mode=%d\n", srv_sess->str_addr, open_msg->dev_name,
+ open_msg->access_mode, open_msg->io_mode);
+ open_flags = FMODE_READ;
+ if (open_msg->access_mode != IBNBD_ACCESS_RO)
+ open_flags |= FMODE_WRITE;
+
+ if ((strlen(dev_search_path) + strlen(open_msg->dev_name))
+ >= PATH_MAX) {
+ ERR_NP("Opening device for client %s failed, device path too"
+ " long. '%s/%s' is longer than PATH_MAX (%d)\n",
+ srv_sess->str_addr, dev_search_path, open_msg->dev_name,
+ PATH_MAX);
+ ret = -EINVAL;
+ goto reject;
+ }
+ full_path = ibnbd_srv_get_full_path(open_msg->dev_name);
+ if (IS_ERR(full_path)) {
+ ret = PTR_ERR(full_path);
+ ERR_NP("Opening device '%s' for client %s failed,"
+ " failed to get device full path, errno: %d\n",
+ open_msg->dev_name, srv_sess->str_addr, ret);
+ goto reject;
+ }
+
+ if (open_msg->io_mode == IBNBD_BLOCKIO)
+ io_mode = IBNBD_BLOCKIO;
+ else if (open_msg->io_mode == IBNBD_FILEIO)
+ io_mode = IBNBD_FILEIO;
+ else
+ io_mode = def_io_mode;
+
+ ibnbd_dev = ibnbd_dev_open(full_path, open_flags, io_mode,
+ srv_sess->sess_bio_set, ibnbd_endio);
+ if (IS_ERR(ibnbd_dev)) {
+ ERR_NP("Opening device '%s' for client %s failed,"
+ " failed to open the block device, errno:"
+ " %ld\n", full_path, srv_sess->str_addr,
+ PTR_ERR(ibnbd_dev));
+ ret = PTR_ERR(ibnbd_dev);
+ goto free_path;
+ }
+
+ srv_dev = ibnbd_srv_get_or_create_srv_dev(ibnbd_dev, srv_sess, io_mode,
+ open_msg->access_mode);
+ if (IS_ERR(srv_dev)) {
+ ERR_NP("Opening device '%s' for client %s failed,"
+ " creating srv_dev failed, errno: %ld\n", full_path,
+ srv_sess->str_addr, PTR_ERR(srv_dev));
+ ret = PTR_ERR(srv_dev);
+ goto ibnbd_dev_close;
+ }
+
+ srv_sess_dev = ibnbd_srv_create_set_sess_dev(srv_sess, open_msg,
+ ibnbd_dev, open_flags,
+ srv_dev);
+ if (IS_ERR(srv_sess_dev)) {
+ ERR_NP("Opening device '%s' for client %s failed,"
+ " creating sess_dev failed, errno: %ld\n", full_path,
+ srv_sess->str_addr, PTR_ERR(srv_sess_dev));
+ ret = PTR_ERR(srv_sess_dev);
+ goto srv_dev_put;
+ }
+
+ /* Create the srv_dev sysfs files if they haven't been created yet. The
+ * reason to delay the creation is not to create the sysfs files before
+ * we are sure the device can be opened.
+ */
+ mutex_lock(&srv_dev->lock);
+ if (!srv_dev->dev_kobj.state_in_sysfs) {
+ ret = ibnbd_srv_create_dev_sysfs(srv_dev,
+ ibnbd_dev_get_bdev(ibnbd_dev),
+ ibnbd_dev_get_name(ibnbd_dev));
+ if (ret) {
+ mutex_unlock(&srv_dev->lock);
+ ERR(srv_sess_dev, "Opening device failed, failed to"
+ " create device sysfs files, errno: %d\n", ret);
+ goto free_srv_sess_dev;
+ }
+ }
+
+ ret = ibnbd_srv_create_dev_client_sysfs(srv_sess_dev);
+ if (ret) {
+ mutex_unlock(&srv_dev->lock);
+ ERR(srv_sess_dev, "Opening device failed, failed to create"
+ " dev client sysfs files, errno: %d\n", ret);
+ goto free_srv_sess_dev;
+ }
+
+ list_add(&srv_sess_dev->dev_list, &srv_dev->sess_dev_list);
+ mutex_unlock(&srv_dev->lock);
+
+ mutex_lock(&srv_sess->lock);
+ list_add(&srv_sess_dev->sess_list, &srv_sess->sess_dev_list);
+ mutex_unlock(&srv_sess->lock);
+
+ srv_sess_dev->nsectors = ibnbd_dev_get_capacity(ibnbd_dev);
+
+ ibnbd_srv_fill_msg_open_rsp(&rsp, srv_sess_dev->device_id,
+ open_msg->clt_device_id,
+ srv_sess_dev->nsectors, ibnbd_dev);
+
+ if (unlikely(srv_sess->state == SESS_STATE_DISCONNECTED)) {
+ ret = -ENODEV;
+ ERR(srv_sess_dev, "Opening device failed, session"
+ " is disconnected, errno: %d\n", ret);
+ goto remove_srv_sess_dev;
+ }
+
+ ret = ibtrs_srv_send(s, &vec, 1);
+ if (unlikely(ret)) {
+ ERR(srv_sess_dev, "Opening device failed, sending open"
+ " response msg failed, errno: %d\n", ret);
+ goto remove_srv_sess_dev;
+ }
+ srv_sess_dev->is_visible = true;
+ INFO(srv_sess_dev, "Opened device '%s' in %s mode\n",
+ srv_dev->id, ibnbd_io_mode_str(io_mode));
+
+ kfree(full_path);
+ return;
+
+remove_srv_sess_dev:
+ ibnbd_srv_destroy_dev_client_sysfs(srv_sess_dev);
+ mutex_lock(&srv_sess->lock);
+ list_del(&srv_sess_dev->sess_list);
+ mutex_unlock(&srv_sess->lock);
+
+ mutex_lock(&srv_dev->lock);
+ list_del(&srv_sess_dev->dev_list);
+ mutex_unlock(&srv_dev->lock);
+free_srv_sess_dev:
+ write_lock(&srv_sess->index_lock);
+ idr_remove(&srv_sess->index_idr, srv_sess_dev->device_id);
+ write_unlock(&srv_sess->index_lock);
+ kfree(srv_sess_dev);
+srv_dev_put:
+ if (open_msg->access_mode != IBNBD_ACCESS_RO) {
+ mutex_lock(&srv_dev->lock);
+ srv_dev->open_write_cnt--;
+ mutex_unlock(&srv_dev->lock);
+ }
+ ibnbd_put_srv_dev(srv_dev);
+ibnbd_dev_close:
+ ibnbd_dev_close(ibnbd_dev);
+free_path:
+ kfree(full_path);
+reject:
+ DEB("Sending negative response to client %s for device '%s': %d\n",
+ srv_sess->str_addr, open_msg->dev_name, ret);
+ ibnbd_srv_fill_msg_open_rsp_header(&rsp, open_msg->clt_device_id);
+ rsp.result = ret;
+ if (unlikely(srv_sess->state == SESS_STATE_DISCONNECTED))
+ return;
+ ret = ibtrs_srv_send(s, &vec, 1);
+ if (ret)
+ ERR_NP("Rejecting mapping request of device '%s' from client %s"
+ " failed, errno: %d\n", open_msg->dev_name,
+ srv_sess->str_addr, ret);
+}
+
+static int send_msg_close_rsp(struct ibtrs_session *sess, u32 clt_device_id)
+{
+ struct ibnbd_msg_close_rsp msg;
+ struct kvec vec = {
+ .iov_base = &msg,
+ .iov_len = sizeof(msg)
+ };
+
+ msg.hdr.type = IBNBD_MSG_CLOSE_RSP;
+ msg.clt_device_id = clt_device_id;
+
+ return ibtrs_srv_send(sess, &vec, 1);
+}
+
+static void process_msg_close(struct ibtrs_session *s,
+ struct ibnbd_srv_session *srv_sess,
+ const void *msg, size_t len)
+{
+ const struct ibnbd_msg_close *close_msg = msg;
+ struct ibnbd_srv_sess_dev *sess_dev;
+ u32 dev_id;
+
+ dev_id = close_msg->device_id;
+
+ sess_dev = ibnbd_get_sess_dev(dev_id, srv_sess);
+ if (likely(!IS_ERR(sess_dev))) {
+ u32 clt_device_id = sess_dev->clt_device_id;
+
+ ibnbd_srv_destroy_dev_client_sysfs(sess_dev);
+ ibnbd_put_sess_dev(sess_dev);
+ ibnbd_destroy_sess_dev(sess_dev, false);
+ send_msg_close_rsp(s, clt_device_id);
+ } else {
+ ERR_NP("Destroying device id %d from client %s failed,"
+ " device not open\n", dev_id, srv_sess->str_addr);
+ }
+}
+
+static void ibnbd_srv_recv(struct ibtrs_session *sess, void *priv,
+ const void *msg, size_t len)
+{
+ struct ibnbd_msg_hdr *hdr;
+ struct ibnbd_srv_session *srv_sess;
+
+ hdr = (struct ibnbd_msg_hdr *)msg;
+ srv_sess = priv;
+
+ if (unlikely(WARN_ON(!srv_sess)))
+ return;
+ if (unlikely(WARN_ON(!hdr) || ibnbd_validate_message(msg, len)))
+ return;
+
+ print_hex_dump_debug("", DUMP_PREFIX_OFFSET, 8, 1, msg, len, true);
+
+ switch (hdr->type) {
+ case IBNBD_MSG_SESS_INFO:
+ process_msg_sess_info(sess, srv_sess, msg, len);
+ break;
+ case IBNBD_MSG_OPEN:
+ process_msg_open(sess, srv_sess, msg, len);
+ break;
+ case IBNBD_MSG_CLOSE:
+ process_msg_close(sess, srv_sess, msg, len);
+ break;
+ default:
+ WRN_NP("Message with unexpected type %d received from client"
+ " %s\n", hdr->type, srv_sess->str_addr);
+ break;
+ }
+}
+
+static int ibnbd_srv_revalidate_sess_dev(struct ibnbd_srv_sess_dev *sess_dev)
+{
+ int ret;
+ size_t nsectors;
+ struct ibnbd_msg_revalidate msg;
+ struct kvec vec = {
+ .iov_base = &msg,
+ .iov_len = sizeof(msg)
+ };
+
+ nsectors = ibnbd_dev_get_capacity(sess_dev->ibnbd_dev);
+
+ msg.hdr.type = IBNBD_MSG_REVAL;
+ msg.clt_device_id = sess_dev->clt_device_id;
+ msg.nsectors = nsectors;
+
+ if (unlikely(sess_dev->sess->state == SESS_STATE_DISCONNECTED))
+ return -ENODEV;
+
+ if (!sess_dev->is_visible) {
+ INFO(sess_dev, "revalidate device failed, wait for sending "
+ "open reply first\n");
+ return -EAGAIN;
+ }
+
+ ret = ibtrs_srv_send(sess_dev->sess->ibtrs_sess, &vec, 1);
+ if (unlikely(ret)) {
+ ERR(sess_dev, "revalidate: Sending new device size"
+ " to client failed, errno: %d\n", ret);
+ } else {
+ INFO(sess_dev, "notified client about device size change"
+ " (old nsectors: %lu, new nsectors: %lu)\n",
+ sess_dev->nsectors, nsectors);
+ sess_dev->nsectors = nsectors;
+ }
+
+ return ret;
+}
+
+int ibnbd_srv_revalidate_dev(struct ibnbd_srv_dev *dev)
+{
+ struct ibnbd_srv_sess_dev *sess_dev;
+ int ret = 0;
+
+ mutex_lock(&dev->lock);
+ list_for_each_entry(sess_dev, &dev->sess_dev_list, dev_list)
+ ret += ibnbd_srv_revalidate_sess_dev(sess_dev);
+ mutex_unlock(&dev->lock);
+
+ if (ret)
+ return -EIO;
+
+ return 0;
+}
+
+static int __init ibnbd_srv_init_module(void)
+{
+ int err;
+
+ INFO_NP("Loading module ibnbd_server, version: %s (dev_search_path: "
+ "'%s', def_io_mode: '%s')\n", __stringify(IBNBD_VER),
+ dev_search_path, ibnbd_io_mode_str(def_io_mode));
+
+ ibnbd_srv_ops.owner = THIS_MODULE;
+ ibnbd_srv_ops.recv = ibnbd_srv_recv;
+ ibnbd_srv_ops.rdma_ev = ibnbd_srv_rdma_ev;
+ ibnbd_srv_ops.sess_ev = ibnbd_srv_sess_ev;
+
+ err = ibtrs_srv_register(&ibnbd_srv_ops);
+ if (err) {
+ ERR_NP("Failed to load module, IBTRS registration failed,"
+ " errno: %d\n", err);
+ goto out;
+ }
+
+ err = ibnbd_dev_init();
+ if (err) {
+ ERR_NP("Failed to load module, init device resources failed,"
+ " errno: %d\n", err);
+ goto unreg;
+ }
+
+ err = ibnbd_srv_create_sysfs_files();
+ if (err) {
+ ERR_NP("Failed to load module, create sysfs files failed,"
+ " errno: %d\n", err);
+ goto dev_destroy;
+ }
+
+ return 0;
+
+dev_destroy:
+ ibnbd_dev_destroy();
+unreg:
+ ibtrs_srv_unregister(&ibnbd_srv_ops);
+out:
+ return err;
+}
+
+static void __exit ibnbd_srv_cleanup_module(void)
+{
+ INFO_NP("Unloading module\n");
+ ibtrs_srv_unregister(&ibnbd_srv_ops);
+ WARN_ON(!list_empty(&sess_list));
+ ibnbd_srv_destroy_sysfs_files();
+ ibnbd_dev_destroy();
+ INFO_NP("Module unloaded\n");
+}
+
+module_init(ibnbd_srv_init_module);
+module_exit(ibnbd_srv_cleanup_module);
--
2.7.4
^ permalink raw reply related
* [PATCH 23/28] ibnbd_srv: add abstraction for submit IO to file or block device
From: Jack Wang @ 2017-03-24 10:45 UTC (permalink / raw)
To: linux-block, linux-rdma
Cc: dledford, axboe, hch, mail, Milind.dumbare, yun.wang, Jack Wang,
Kleber Souza, Danil Kipnis, Roman Pen
In-Reply-To: <1490352343-20075-1-git-send-email-jinpu.wangl@profitbricks.com>
From: Jack Wang <jinpu.wang@profitbricks.com>
Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
Signed-off-by: Kleber Souza <kleber.souza@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
---
drivers/block/ibnbd_server/ibnbd_dev.c | 436 +++++++++++++++++++++++++++++++++
drivers/block/ibnbd_server/ibnbd_dev.h | 149 +++++++++++
2 files changed, 585 insertions(+)
create mode 100644 drivers/block/ibnbd_server/ibnbd_dev.c
create mode 100644 drivers/block/ibnbd_server/ibnbd_dev.h
diff --git a/drivers/block/ibnbd_server/ibnbd_dev.c b/drivers/block/ibnbd_server/ibnbd_dev.c
new file mode 100644
index 0000000..5f6b453
--- /dev/null
+++ b/drivers/block/ibnbd_server/ibnbd_dev.c
@@ -0,0 +1,436 @@
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler < mail@fholler.de>
+ * Jack Wang <jinpu.wang@profitbricks.com>
+ * Kleber Souza <kleber.souza@profitbricks.com>
+ * Danil Kipnis <danil.kipnis@profitbricks.com>
+ * Roman Pen <roman.penyaev@profitbricks.com>
+ * Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ * notice, this list of conditions, and the following disclaimer,
+ * without modification.
+ * 2. Redistributions in binary form must reproduce at minimum a disclaimer
+ * substantially similar to the "NO WARRANTY" disclaimer below
+ * ("Disclaimer") and any redistribution must be conditioned upon
+ * including a substantially similar Disclaimer requirement for further
+ * binary redistribution.
+ * 3. Neither the names of the above-listed copyright holders nor the names
+ * of any contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * Alternatively, this software may be distributed under the terms of the
+ * GNU General Public License ("GPL") version 2 as published by the Free
+ * Software Foundation.
+ *
+ * NO WARRANTY
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * HOLDERS OR CONTRIBUTORS BE LIABLE FOR SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+ * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
+ * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGES.
+ *
+ */
+
+#include "ibnbd_dev.h"
+#include "ibnbd_srv_log.h"
+
+#define IBNBD_DEV_MAX_FILEIO_ACTIVE_WORKERS 0
+
+struct ibnbd_dev_file_io_work {
+ struct ibnbd_dev *dev;
+ void *priv;
+
+ sector_t sector;
+ void *data;
+ size_t len;
+ size_t bi_size;
+ enum ibnbd_io_flags flags;
+
+ struct work_struct work;
+};
+
+struct ibnbd_dev_blk_io {
+ struct ibnbd_dev *dev;
+ void *priv;
+};
+
+static struct workqueue_struct *fileio_wq;
+
+int ibnbd_dev_init(void)
+{
+ fileio_wq = alloc_workqueue("%s", WQ_UNBOUND,
+ IBNBD_DEV_MAX_FILEIO_ACTIVE_WORKERS,
+ "ibnbd_server_fileio_wq");
+ if (!fileio_wq)
+ return -ENOMEM;
+
+ return 0;
+}
+
+void ibnbd_dev_destroy(void)
+{
+ destroy_workqueue(fileio_wq);
+}
+
+static inline struct block_device *ibnbd_dev_open_bdev(const char *path,
+ fmode_t flags)
+{
+ return blkdev_get_by_path(path, flags, THIS_MODULE);
+}
+
+static int ibnbd_dev_blk_open(struct ibnbd_dev *dev, const char *path,
+ fmode_t flags)
+{
+ dev->bdev = ibnbd_dev_open_bdev(path, flags);
+ return PTR_ERR_OR_ZERO(dev->bdev);
+}
+
+static int ibnbd_dev_vfs_open(struct ibnbd_dev *dev, const char *path,
+ fmode_t flags)
+{
+ int oflags = O_DSYNC; /* enable write-through */
+
+ if (flags & FMODE_WRITE)
+ oflags |= O_RDWR;
+ else if (flags & FMODE_READ)
+ oflags |= O_RDONLY;
+ else
+ return -EINVAL;
+
+ dev->file = filp_open(path, oflags, 0);
+ return PTR_ERR_OR_ZERO(dev->file);
+}
+
+struct ibnbd_dev *ibnbd_dev_open(const char *path, fmode_t flags,
+ enum ibnbd_io_mode mode, struct bio_set *bs,
+ ibnbd_dev_io_fn io_cb)
+{
+ struct ibnbd_dev *dev;
+ int ret;
+
+ dev = kzalloc(sizeof(*dev), GFP_KERNEL);
+ if (!dev)
+ return ERR_PTR(-ENOMEM);
+
+ if (mode == IBNBD_BLOCKIO) {
+ dev->blk_open_flags = flags;
+ ret = ibnbd_dev_blk_open(dev, path, dev->blk_open_flags);
+ if (ret)
+ goto err;
+ } else if (mode == IBNBD_FILEIO) {
+ dev->blk_open_flags = FMODE_READ;
+ ret = ibnbd_dev_blk_open(dev, path, dev->blk_open_flags);
+ if (ret)
+ goto err;
+
+ ret = ibnbd_dev_vfs_open(dev, path, flags);
+ if (ret)
+ goto blk_put;
+ }
+
+ dev->blk_open_flags = flags;
+ dev->mode = mode;
+ dev->io_cb = io_cb;
+ bdevname(dev->bdev, dev->name);
+ dev->ibd_bio_set = bs;
+
+ return dev;
+
+blk_put:
+ blkdev_put(dev->bdev, dev->blk_open_flags);
+err:
+ kfree(dev);
+ return ERR_PTR(ret);
+}
+
+void ibnbd_dev_close(struct ibnbd_dev *dev)
+{
+ flush_workqueue(fileio_wq);
+ blkdev_put(dev->bdev, dev->blk_open_flags);
+ if (dev->mode == IBNBD_FILEIO)
+ filp_close(dev->file, dev->file);
+ kfree(dev);
+}
+
+static void ibnbd_dev_bi_end_io(struct bio *bio)
+{
+ struct ibnbd_dev_blk_io *io = bio->bi_private;
+
+ int error = bio->bi_error;
+
+ io->dev->io_cb(io->priv, error);
+
+ bio_put(bio);
+ kfree(io);
+}
+
+static void bio_map_kern_endio(struct bio *bio)
+{
+ bio_put(bio);
+}
+
+/**
+ * ibnbd_bio_map_kern - map kernel address into bio
+ * @q: the struct request_queue for the bio
+ * @data: pointer to buffer to map
+ * @bs: bio_set to use.
+ * @len: length in bytes
+ * @gfp_mask: allocation flags for bio allocation
+ *
+ * Map the kernel address into a bio suitable for io to a block
+ * device. Returns an error pointer in case of error.
+ */
+static struct bio *ibnbd_bio_map_kern(struct request_queue *q, void *data,
+ struct bio_set *bs,
+ unsigned int len, gfp_t gfp_mask)
+{
+ unsigned long kaddr = (unsigned long)data;
+ unsigned long end = (kaddr + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ unsigned long start = kaddr >> PAGE_SHIFT;
+ const int nr_pages = end - start;
+ int offset, i;
+ struct bio *bio;
+
+ bio = bio_alloc_bioset(gfp_mask, nr_pages, bs);
+ if (!bio)
+ return ERR_PTR(-ENOMEM);
+
+ offset = offset_in_page(kaddr);
+ for (i = 0; i < nr_pages; i++) {
+ unsigned int bytes = PAGE_SIZE - offset;
+
+ if (len <= 0)
+ break;
+
+ if (bytes > len)
+ bytes = len;
+
+ if (bio_add_pc_page(q, bio, virt_to_page(data), bytes,
+ offset) < bytes) {
+ /* we don't support partial mappings */
+ bio_put(bio);
+ return ERR_PTR(-EINVAL);
+ }
+
+ data += bytes;
+ len -= bytes;
+ offset = 0;
+ }
+
+ bio->bi_end_io = bio_map_kern_endio;
+ return bio;
+}
+
+static int ibnbd_dev_blk_submit_io(struct ibnbd_dev *dev, sector_t sector,
+ void *data, size_t len, u32 bi_size,
+ enum ibnbd_io_flags flags, void *priv)
+{
+ struct request_queue *q = bdev_get_queue(dev->bdev);
+ struct ibnbd_dev_blk_io *io;
+ struct bio *bio;
+
+ /* check if the buffer is suitable for bdev */
+ if (unlikely(WARN_ON(!blk_rq_aligned(q, (unsigned long)data, len))))
+ return -EINVAL;
+
+ /* Generate bio with pages pointing to the rdma buffer */
+ bio = ibnbd_bio_map_kern(q, data, dev->ibd_bio_set, len, GFP_KERNEL);
+ if (unlikely(IS_ERR(bio)))
+ return PTR_ERR(bio);
+
+ io = kmalloc(sizeof(*io), GFP_KERNEL);
+ if (unlikely(!io)) {
+ bio_put(bio);
+ return -ENOMEM;
+ }
+
+ io->dev = dev;
+ io->priv = priv;
+
+ bio->bi_end_io = ibnbd_dev_bi_end_io;
+ bio->bi_bdev = dev->bdev;
+ bio->bi_private = io;
+ bio->bi_opf = ibnbd_io_flags_to_bi_rw(flags);
+ bio->bi_iter.bi_sector = sector;
+ bio->bi_iter.bi_size = bi_size;
+
+ submit_bio(bio);
+
+ return 0;
+}
+
+static int ibnbd_dev_file_handle_flush(struct ibnbd_dev_file_io_work *w,
+ loff_t start)
+{
+ int ret;
+ loff_t end;
+ int len = w->bi_size;
+
+ if (len)
+ end = start + len - 1;
+ else
+ end = LLONG_MAX;
+
+ ret = vfs_fsync_range(w->dev->file, start, end, 1);
+ if (unlikely(ret))
+ INFO_NP_RL("I/O FLUSH failed on %s, vfs_sync errno: %d\n",
+ w->dev->name, ret);
+ return ret;
+}
+
+static int ibnbd_dev_file_handle_fua(struct ibnbd_dev_file_io_work *w,
+ loff_t start)
+{
+ int ret;
+ loff_t end;
+ int len = w->bi_size;
+
+ if (len)
+ end = start + len - 1;
+ else
+ end = LLONG_MAX;
+
+ ret = vfs_fsync_range(w->dev->file, start, end, 1);
+ if (unlikely(ret))
+ INFO_NP_RL("I/O FUA failed on %s, vfs_sync errno: %d\n",
+ w->dev->name, ret);
+ return ret;
+}
+
+static int ibnbd_dev_file_handle_write_same(struct ibnbd_dev_file_io_work *w)
+{
+ int i;
+
+ if (unlikely(WARN_ON(w->bi_size % w->len)))
+ return -EINVAL;
+
+ for (i = 1; i < w->bi_size / w->len; i++)
+ memcpy(w->data + i * w->len, w->data, w->len);
+
+ return 0;
+}
+
+static void ibnbd_dev_file_submit_io_worker(struct work_struct *w)
+{
+ struct ibnbd_dev_file_io_work *dev_work;
+ loff_t off;
+ int ret;
+ int len;
+ struct file *f;
+
+ dev_work = container_of(w, struct ibnbd_dev_file_io_work, work);
+ off = dev_work->sector * ibnbd_dev_get_logical_bsize(dev_work->dev);
+ f = dev_work->dev->file;
+ len = dev_work->bi_size;
+
+ if (dev_work->flags & IBNBD_RW_REQ_FLUSH) {
+ ret = ibnbd_dev_file_handle_flush(dev_work, off);
+ if (unlikely(ret))
+ goto out;
+ }
+
+ if (dev_work->flags & IBNBD_RW_REQ_WRITE_SAME) {
+ ret = ibnbd_dev_file_handle_write_same(dev_work);
+ if (unlikely(ret))
+ goto out;
+ }
+
+ /* TODO Implement support for DIRECT */
+ if (dev_work->bi_size) {
+ if (dev_work->flags & IBNBD_RW_REQ_WRITE)
+ ret = kernel_write(f, dev_work->data, dev_work->bi_size,
+ off);
+ else
+ ret = kernel_read(f, off, dev_work->data,
+ dev_work->bi_size);
+
+ if (unlikely(ret < 0)) {
+ goto out;
+ } else if (unlikely(ret != dev_work->bi_size)) {
+ /* TODO implement support for partial completions */
+ ret = -EIO;
+ goto out;
+ } else {
+ ret = 0;
+ }
+ }
+
+ if (dev_work->flags & IBNBD_RW_REQ_FUA)
+ ret = ibnbd_dev_file_handle_fua(dev_work, off);
+out:
+ dev_work->dev->io_cb(dev_work->priv, ret);
+ kfree(dev_work);
+}
+
+static inline bool ibnbd_dev_file_io_flags_supported(enum ibnbd_io_flags flags)
+{
+ flags &= ~IBNBD_RW_REQ_WRITE;
+ flags &= ~IBNBD_RW_REQ_SYNC;
+ flags &= ~IBNBD_RW_REQ_FUA;
+ flags &= ~IBNBD_RW_REQ_FLUSH;
+ flags &= ~IBNBD_RW_REQ_WRITE_SAME;
+
+ return (!flags);
+}
+
+static int ibnbd_dev_file_submit_io(struct ibnbd_dev *dev, sector_t sector,
+ void *data, size_t len, size_t bi_size,
+ enum ibnbd_io_flags flags, void *priv)
+{
+ struct ibnbd_dev_file_io_work *w;
+
+ if (!ibnbd_dev_file_io_flags_supported(flags)) {
+ INFO_NP_RL("Unsupported I/O flags: 0x%x on device %s\n", flags,
+ dev->name);
+ return -ENOTSUPP;
+ }
+
+ w = kmalloc(sizeof(*w), GFP_KERNEL);
+ if (!w)
+ return -ENOMEM;
+
+ w->dev = dev;
+ w->priv = priv;
+ w->sector = sector;
+ w->data = data;
+ w->len = len;
+ w->bi_size = bi_size;
+ w->flags = flags;
+ INIT_WORK(&w->work, ibnbd_dev_file_submit_io_worker);
+
+ if (unlikely(!queue_work(fileio_wq, &w->work))) {
+ kfree(w);
+ return -EEXIST;
+ }
+
+ return 0;
+}
+
+int ibnbd_dev_submit_io(struct ibnbd_dev *dev, sector_t sector, void *data,
+ size_t len, u32 bi_size, enum ibnbd_io_flags flags,
+ void *priv)
+{
+ if (dev->mode == IBNBD_FILEIO)
+ return ibnbd_dev_file_submit_io(dev, sector, data, len, bi_size,
+ flags, priv);
+ else if (dev->mode == IBNBD_BLOCKIO)
+ return ibnbd_dev_blk_submit_io(dev, sector, data, len, bi_size,
+ flags, priv);
+
+ WRN_NP("Submitting I/O to %s failed, dev->mode contains invalid "
+ "value: '%d', memory corrupted?", dev->name, dev->mode);
+ return -EINVAL;
+}
diff --git a/drivers/block/ibnbd_server/ibnbd_dev.h b/drivers/block/ibnbd_server/ibnbd_dev.h
new file mode 100644
index 0000000..7c73d64
--- /dev/null
+++ b/drivers/block/ibnbd_server/ibnbd_dev.h
@@ -0,0 +1,149 @@
+#ifndef _IBNBD_DEV_H
+#define _IBNBD_DEV_H
+
+#include <linux/fs.h>
+#include "../ibnbd_inc/ibnbd-proto.h"
+
+typedef void ibnbd_dev_io_fn(void *priv, int error);
+
+struct ibnbd_dev {
+ struct block_device *bdev;
+ struct bio_set *ibd_bio_set;
+ struct file *file;
+ fmode_t blk_open_flags;
+ enum ibnbd_io_mode mode;
+ char name[BDEVNAME_SIZE];
+ ibnbd_dev_io_fn *io_cb;
+};
+
+
+/** ibnbd_dev_init() - Initialize ibnbd_dev
+ *
+ * This functions initialized the ibnbd-dev component.
+ * It has to be called 1x time before ibnbd_dev_open() is used
+ */
+int ibnbd_dev_init(void);
+
+/** ibnbd_dev_destroy() - Destroy ibnbd_dev
+ *
+ * This functions destroys the ibnbd-dev component.
+ * It has to be called after the last device was closed.
+ */
+void ibnbd_dev_destroy(void);
+
+/**
+ * ibnbd_dev_open() - Open a device
+ * @flags: open flags
+ * @mode: open via VFS or block layer
+ * @bs: bio_set to use during block io,
+ * @io_cb: is called when I/O finished
+ */
+struct ibnbd_dev *ibnbd_dev_open(const char *path, fmode_t flags,
+ enum ibnbd_io_mode mode, struct bio_set *bs,
+ ibnbd_dev_io_fn io_cb);
+
+/**
+ * ibnbd_dev_close() - Close a device
+ */
+void ibnbd_dev_close(struct ibnbd_dev *dev);
+
+static inline size_t ibnbd_dev_get_capacity(const struct ibnbd_dev *dev)
+{
+ return get_capacity(dev->bdev->bd_disk);
+}
+
+static inline int ibnbd_dev_get_logical_bsize(const struct ibnbd_dev *dev)
+{
+ return bdev_logical_block_size(dev->bdev);
+}
+
+static inline int ibnbd_dev_get_phys_bsize(const struct ibnbd_dev *dev)
+{
+ return bdev_physical_block_size(dev->bdev);
+}
+
+static inline int ibnbd_dev_get_max_segs(const struct ibnbd_dev *dev)
+{
+ return queue_max_segments(bdev_get_queue(dev->bdev));
+}
+
+static inline int ibnbd_dev_get_max_hw_sects(const struct ibnbd_dev *dev)
+{
+ return queue_max_hw_sectors(bdev_get_queue(dev->bdev));
+}
+
+static inline int
+ibnbd_dev_get_max_write_same_sects(const struct ibnbd_dev *dev)
+{
+ return bdev_write_same(dev->bdev);
+}
+
+static inline int ibnbd_dev_get_secure_discard(const struct ibnbd_dev *dev)
+{
+ if (dev->mode == IBNBD_BLOCKIO)
+ return blk_queue_secure_erase(bdev_get_queue(dev->bdev));
+ return 0;
+}
+
+static inline int ibnbd_dev_get_max_discard_sects(const struct ibnbd_dev *dev)
+{
+ if (!blk_queue_discard(bdev_get_queue(dev->bdev)))
+ return 0;
+
+ if (dev->mode == IBNBD_BLOCKIO)
+ return blk_queue_get_max_sectors(bdev_get_queue(dev->bdev),
+ REQ_OP_DISCARD);
+ return 0;
+}
+
+static inline int ibnbd_dev_get_discard_zeroes_data(const struct ibnbd_dev *dev)
+{
+ if (dev->mode == IBNBD_BLOCKIO)
+ return bdev_get_queue(dev->bdev)->limits.discard_zeroes_data;
+ return 0;
+}
+
+static inline int ibnbd_dev_get_discard_granularity(const struct ibnbd_dev *dev)
+{
+ if (dev->mode == IBNBD_BLOCKIO)
+ return bdev_get_queue(dev->bdev)->limits.discard_granularity;
+ return 0;
+}
+
+static inline int ibnbd_dev_get_discard_alignment(const struct ibnbd_dev *dev)
+{
+ if (dev->mode == IBNBD_BLOCKIO)
+ return bdev_get_queue(dev->bdev)->limits.discard_alignment;
+ return 0;
+}
+
+
+/**
+ * ibnbd_dev_get_name() - Return the device name
+ * returns: Device name up to %BDEVNAME_SIZE% long
+ */
+static inline const char *ibnbd_dev_get_name(const struct ibnbd_dev *dev)
+{
+ return dev->name;
+}
+
+static inline struct block_device *
+ibnbd_dev_get_bdev(const struct ibnbd_dev *dev)
+{
+ return dev->bdev;
+}
+
+
+/**
+ * ibnbd_dev_submit_io() - Submit an I/O to the disk
+ * @dev: device to that the I/O is submitted
+ * @sector: address to read/write data to
+ * @data: I/O data to write or buffer to read I/O date into
+ * @len: length of @data
+ * @bi_size: Amount of data that will be read/written
+ * @priv: private data passed to @io_fn
+ */
+int ibnbd_dev_submit_io(struct ibnbd_dev *dev, sector_t sector, void *data,
+ size_t len, u32 bi_size, enum ibnbd_io_flags flags,
+ void *priv);
+#endif
--
2.7.4
^ permalink raw reply related
* [PATCH 24/28] ibnbd_srv: add log helpers
From: Jack Wang @ 2017-03-24 10:45 UTC (permalink / raw)
To: linux-block, linux-rdma
Cc: dledford, axboe, hch, mail, Milind.dumbare, yun.wang, Jack Wang,
Kleber Souza, Danil Kipnis
In-Reply-To: <1490352343-20075-1-git-send-email-jinpu.wangl@profitbricks.com>
From: Jack Wang <jinpu.wang@profitbricks.com>
Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
Signed-off-by: Kleber Souza <kleber.souza@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
---
drivers/block/ibnbd_server/ibnbd_srv_log.h | 69 ++++++++++++++++++++++++++++++
1 file changed, 69 insertions(+)
create mode 100644 drivers/block/ibnbd_server/ibnbd_srv_log.h
diff --git a/drivers/block/ibnbd_server/ibnbd_srv_log.h b/drivers/block/ibnbd_server/ibnbd_srv_log.h
new file mode 100644
index 0000000..9217804
--- /dev/null
+++ b/drivers/block/ibnbd_server/ibnbd_srv_log.h
@@ -0,0 +1,69 @@
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler < mail@fholler.de>
+ * Jack Wang <jinpu.wang@profitbricks.com>
+ * Kleber Souza <kleber.souza@profitbricks.com>
+ * Danil Kipnis <danil.kipnis@profitbricks.com>
+ * Roman Pen <roman.penyaev@profitbricks.com>
+ * Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ * notice, this list of conditions, and the following disclaimer,
+ * without modification.
+ * 2. Redistributions in binary form must reproduce at minimum a disclaimer
+ * substantially similar to the "NO WARRANTY" disclaimer below
+ * ("Disclaimer") and any redistribution must be conditioned upon
+ * including a substantially similar Disclaimer requirement for further
+ * binary redistribution.
+ * 3. Neither the names of the above-listed copyright holders nor the names
+ * of any contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * Alternatively, this software may be distributed under the terms of the
+ * GNU General Public License ("GPL") version 2 as published by the Free
+ * Software Foundation.
+ *
+ * NO WARRANTY
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * HOLDERS OR CONTRIBUTORS BE LIABLE FOR SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+ * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
+ * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGES.
+ *
+ */
+
+#ifndef __IBNBD_SRV_LOG_H__
+#define __IBNBD_SRV_LOG_H__
+
+#include "../ibnbd_inc/log.h"
+
+#define ERR(dev, fmt, ...) pr_err("ibnbd L%d <%s@%s> ERR: " fmt, \
+ __LINE__, dev->pathname, ibnbd_prefix(dev),\
+ ##__VA_ARGS__)
+#define ERR_RL(dev, fmt, ...) pr_err_ratelimited("ibnbd L%d <%s@%s> ERR: " fmt,\
+ __LINE__, dev->pathname, ibnbd_prefix(dev),\
+ ##__VA_ARGS__)
+#define WRN(dev, fmt, ...) pr_warn("ibnbd L%d <%s@%s> WARN: " fmt,\
+ __LINE__, dev->pathname, ibnbd_prefix(dev),\
+ ##__VA_ARGS__)
+#define WRN_RL(dev, fmt, ...) pr_warn_ratelimited("ibnbd L%d <%s@%s> WARN: " \
+ fmt, __LINE__, dev->pathname, ibnbd_prefix(dev),\
+ ##__VA_ARGS__)
+#define INFO(dev, fmt, ...) pr_info("ibnbd <%s@%s>: " \
+ fmt, dev->pathname, ibnbd_prefix(dev), ##__VA_ARGS__)
+#define INFO_RL(dev, fmt, ...) pr_info_ratelimited("ibnbd <%s@%s>: " \
+ fmt, dev->pathname, ibnbd_prefix(dev), ##__VA_ARGS__)
+
+#endif /*__IBNBD_SRV_LOG_H__*/
--
2.7.4
^ permalink raw reply related
* [PATCH 25/28] ibnbd_srv: add sysfs interface
From: Jack Wang @ 2017-03-24 10:45 UTC (permalink / raw)
To: linux-block, linux-rdma
Cc: dledford, axboe, hch, mail, Milind.dumbare, yun.wang, Jack Wang,
Kleber Souza, Danil Kipnis, Roman Pen
In-Reply-To: <1490352343-20075-1-git-send-email-jinpu.wangl@profitbricks.com>
From: Jack Wang <jinpu.wang@profitbricks.com>
Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
Signed-off-by: Kleber Souza <kleber.souza@profitbricks.com>
Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
---
drivers/block/ibnbd_server/ibnbd_srv_sysfs.c | 317 +++++++++++++++++++++++++++
drivers/block/ibnbd_server/ibnbd_srv_sysfs.h | 64 ++++++
2 files changed, 381 insertions(+)
create mode 100644 drivers/block/ibnbd_server/ibnbd_srv_sysfs.c
create mode 100644 drivers/block/ibnbd_server/ibnbd_srv_sysfs.h
diff --git a/drivers/block/ibnbd_server/ibnbd_srv_sysfs.c b/drivers/block/ibnbd_server/ibnbd_srv_sysfs.c
new file mode 100644
index 0000000..8774abe
--- /dev/null
+++ b/drivers/block/ibnbd_server/ibnbd_srv_sysfs.c
@@ -0,0 +1,317 @@
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler < mail@fholler.de>
+ * Jack Wang <jinpu.wang@profitbricks.com>
+ * Kleber Souza <kleber.souza@profitbricks.com>
+ * Danil Kipnis <danil.kipnis@profitbricks.com>
+ * Roman Pen <roman.penyaev@profitbricks.com>
+ * Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ * notice, this list of conditions, and the following disclaimer,
+ * without modification.
+ * 2. Redistributions in binary form must reproduce at minimum a disclaimer
+ * substantially similar to the "NO WARRANTY" disclaimer below
+ * ("Disclaimer") and any redistribution must be conditioned upon
+ * including a substantially similar Disclaimer requirement for further
+ * binary redistribution.
+ * 3. Neither the names of the above-listed copyright holders nor the names
+ * of any contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * Alternatively, this software may be distributed under the terms of the
+ * GNU General Public License ("GPL") version 2 as published by the Free
+ * Software Foundation.
+ *
+ * NO WARRANTY
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * HOLDERS OR CONTRIBUTORS BE LIABLE FOR SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+ * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
+ * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGES.
+ *
+ */
+
+#include <uapi/linux/limits.h>
+#include <linux/kobject.h>
+#include <linux/sysfs.h>
+#include <linux/stat.h>
+#include <linux/genhd.h>
+#include <linux/list.h>
+
+#include "../ibnbd_inc/ibnbd.h"
+#include "ibnbd_srv.h"
+#include "ibnbd_srv_log.h"
+#include "ibnbd_srv_sysfs.h"
+
+static struct kobject *ibnbd_srv_kobj;
+static struct kobject *ibnbd_srv_devices_kobj;
+#define IBNBD_SYSFS_DIR "ibnbd"
+static char ibnbd_sysfs_dir[64] = IBNBD_SYSFS_DIR;
+
+static ssize_t ibnbd_srv_revalidate_dev_show(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ char *page)
+{
+ return scnprintf(page, PAGE_SIZE,
+ "Usage: echo 1 > %s\n", attr->attr.name);
+}
+
+static ssize_t ibnbd_srv_revalidate_dev_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ int ret;
+ struct ibnbd_srv_dev *dev = container_of(kobj, struct ibnbd_srv_dev,
+ dev_kobj);
+
+ if (!sysfs_streq(buf, "1")) {
+ ERR_NP("%s: invalid value: '%s'\n", attr->attr.name, buf);
+ return -EINVAL;
+ }
+ ret = ibnbd_srv_revalidate_dev(dev);
+ if (ret)
+ return ret;
+
+ return count;
+}
+
+static struct kobj_attribute ibnbd_srv_revalidate_dev_attr =
+ __ATTR(revalidate,
+ 0644,
+ ibnbd_srv_revalidate_dev_show,
+ ibnbd_srv_revalidate_dev_store);
+
+static struct attribute *ibnbd_srv_default_dev_attrs[] = {
+ &ibnbd_srv_revalidate_dev_attr.attr,
+ NULL,
+};
+
+static struct attribute_group ibnbd_srv_default_dev_attr_group = {
+ .attrs = ibnbd_srv_default_dev_attrs,
+};
+
+static ssize_t ibnbd_srv_attr_show(struct kobject *kobj, struct attribute *attr,
+ char *page)
+{
+ struct kobj_attribute *kattr;
+ int ret = -EIO;
+
+ kattr = container_of(attr, struct kobj_attribute, attr);
+ if (kattr->show)
+ ret = kattr->show(kobj, kattr, page);
+ return ret;
+}
+
+static ssize_t ibnbd_srv_attr_store(struct kobject *kobj,
+ struct attribute *attr,
+ const char *page, size_t length)
+{
+ struct kobj_attribute *kattr;
+ int ret = -EIO;
+
+ kattr = container_of(attr, struct kobj_attribute, attr);
+ if (kattr->store)
+ ret = kattr->store(kobj, kattr, page, length);
+ return ret;
+}
+
+static const struct sysfs_ops ibnbd_srv_sysfs_ops = {
+ .show = ibnbd_srv_attr_show,
+ .store = ibnbd_srv_attr_store,
+};
+
+static struct kobj_type ibnbd_srv_dev_ktype = {
+ .sysfs_ops = &ibnbd_srv_sysfs_ops,
+};
+
+static struct kobj_type ibnbd_srv_dev_clients_ktype = {
+ .sysfs_ops = &ibnbd_srv_sysfs_ops,
+};
+
+int ibnbd_srv_create_dev_sysfs(struct ibnbd_srv_dev *dev,
+ struct block_device *bdev,
+ const char *dir_name)
+{
+ struct kobject *bdev_kobj;
+ int ret;
+
+ ret = kobject_init_and_add(&dev->dev_kobj, &ibnbd_srv_dev_ktype,
+ ibnbd_srv_devices_kobj, dir_name);
+ if (ret)
+ return ret;
+
+ ret = kobject_init_and_add(&dev->dev_clients_kobj,
+ &ibnbd_srv_dev_clients_ktype,
+ &dev->dev_kobj, "clients");
+ if (ret)
+ goto err;
+
+ ret = sysfs_create_group(&dev->dev_kobj,
+ &ibnbd_srv_default_dev_attr_group);
+ if (ret)
+ goto err2;
+
+ bdev_kobj = &disk_to_dev(bdev->bd_disk)->kobj;
+ ret = sysfs_create_link(&dev->dev_kobj, bdev_kobj, "block_dev");
+ if (ret)
+ goto err3;
+
+ return 0;
+
+err3:
+ sysfs_remove_group(&dev->dev_kobj,
+ &ibnbd_srv_default_dev_attr_group);
+err2:
+ kobject_del(&dev->dev_clients_kobj);
+ kobject_put(&dev->dev_clients_kobj);
+err:
+ kobject_del(&dev->dev_kobj);
+ kobject_put(&dev->dev_kobj);
+ return ret;
+}
+
+void ibnbd_srv_destroy_dev_sysfs(struct ibnbd_srv_dev *dev)
+{
+ sysfs_remove_link(&dev->dev_kobj, "block_dev");
+ sysfs_remove_group(&dev->dev_kobj, &ibnbd_srv_default_dev_attr_group);
+ kobject_del(&dev->dev_clients_kobj);
+ kobject_put(&dev->dev_clients_kobj);
+ kobject_del(&dev->dev_kobj);
+ kobject_put(&dev->dev_kobj);
+}
+
+static ssize_t ibnbd_srv_dev_client_ro_show(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ char *page)
+{
+ struct ibnbd_srv_sess_dev *sess_dev;
+
+ sess_dev = container_of(kobj, struct ibnbd_srv_sess_dev, kobj);
+
+ return scnprintf(page, PAGE_SIZE, "%s\n",
+ (sess_dev->open_flags & FMODE_WRITE) ? "0" : "1");
+}
+
+static struct kobj_attribute ibnbd_srv_dev_client_ro_attr =
+ __ATTR(read_only, 0444,
+ ibnbd_srv_dev_client_ro_show,
+ NULL);
+
+static ssize_t ibnbd_srv_dev_client_mapping_path_show(
+ struct kobject *kobj,
+ struct kobj_attribute *attr,
+ char *page)
+{
+ struct ibnbd_srv_sess_dev *sess_dev;
+
+ sess_dev = container_of(kobj, struct ibnbd_srv_sess_dev, kobj);
+
+ return scnprintf(page, PAGE_SIZE, "%s\n", sess_dev->pathname);
+}
+
+static struct kobj_attribute ibnbd_srv_dev_client_mapping_path_attr =
+ __ATTR(mapping_path, 0444,
+ ibnbd_srv_dev_client_mapping_path_show,
+ NULL);
+
+static struct attribute *ibnbd_srv_default_dev_clients_attrs[] = {
+ &ibnbd_srv_dev_client_ro_attr.attr,
+ &ibnbd_srv_dev_client_mapping_path_attr.attr,
+ NULL,
+};
+
+static struct attribute_group ibnbd_srv_default_dev_client_attr_group = {
+ .attrs = ibnbd_srv_default_dev_clients_attrs,
+};
+
+void ibnbd_srv_destroy_dev_client_sysfs(struct ibnbd_srv_sess_dev *sess_dev)
+{
+ struct completion sysfs_compl;
+
+ sysfs_remove_group(&sess_dev->kobj,
+ &ibnbd_srv_default_dev_client_attr_group);
+
+ init_completion(&sysfs_compl);
+ sess_dev->sysfs_release_compl = &sysfs_compl;
+ kobject_del(&sess_dev->kobj);
+ kobject_put(&sess_dev->kobj);
+ wait_for_completion(&sysfs_compl);
+}
+
+static void ibnbd_srv_sess_dev_release(struct kobject *kobj)
+{
+ struct ibnbd_srv_sess_dev *sess_dev;
+
+ sess_dev = container_of(kobj, struct ibnbd_srv_sess_dev, kobj);
+ if (sess_dev->sysfs_release_compl)
+ complete_all(sess_dev->sysfs_release_compl);
+}
+
+static struct kobj_type ibnbd_srv_sess_dev_ktype = {
+ .sysfs_ops = &ibnbd_srv_sysfs_ops,
+ .release = ibnbd_srv_sess_dev_release,
+};
+
+int ibnbd_srv_create_dev_client_sysfs(struct ibnbd_srv_sess_dev *sess_dev)
+{
+ int ret;
+
+ ret = kobject_init_and_add(&sess_dev->kobj, &ibnbd_srv_sess_dev_ktype,
+ &sess_dev->dev->dev_clients_kobj, "%s",
+ sess_dev->sess->str_addr);
+ if (ret)
+ return ret;
+
+ ret = sysfs_create_group(&sess_dev->kobj,
+ &ibnbd_srv_default_dev_client_attr_group);
+ if (ret)
+ goto err;
+
+ return 0;
+
+err:
+ kobject_del(&sess_dev->kobj);
+ kobject_put(&sess_dev->kobj);
+ return ret;
+}
+
+int ibnbd_srv_create_sysfs_files(void)
+{
+ int err;
+
+ ibnbd_srv_kobj = kobject_create_and_add(ibnbd_sysfs_dir, kernel_kobj);
+ if (!ibnbd_srv_kobj)
+ return -ENOMEM;
+
+ ibnbd_srv_devices_kobj = kobject_create_and_add("devices",
+ ibnbd_srv_kobj);
+ if (!ibnbd_srv_devices_kobj) {
+ err = -ENOMEM;
+ goto err;
+ }
+
+ return 0;
+
+err:
+ kobject_put(ibnbd_srv_kobj);
+ return err;
+}
+
+void ibnbd_srv_destroy_sysfs_files(void)
+{
+ kobject_put(ibnbd_srv_devices_kobj);
+ kobject_put(ibnbd_srv_kobj);
+}
diff --git a/drivers/block/ibnbd_server/ibnbd_srv_sysfs.h b/drivers/block/ibnbd_server/ibnbd_srv_sysfs.h
new file mode 100644
index 0000000..1df232a
--- /dev/null
+++ b/drivers/block/ibnbd_server/ibnbd_srv_sysfs.h
@@ -0,0 +1,64 @@
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler < mail@fholler.de>
+ * Jack Wang <jinpu.wang@profitbricks.com>
+ * Kleber Souza <kleber.souza@profitbricks.com>
+ * Danil Kipnis <danil.kipnis@profitbricks.com>
+ * Roman Pen <roman.penyaev@profitbricks.com>
+ * Milind Dumbare <Milind.dumbare@gmail.com>
+ *
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ * notice, this list of conditions, and the following disclaimer,
+ * without modification.
+ * 2. Redistributions in binary form must reproduce at minimum a disclaimer
+ * substantially similar to the "NO WARRANTY" disclaimer below
+ * ("Disclaimer") and any redistribution must be conditioned upon
+ * including a substantially similar Disclaimer requirement for further
+ * binary redistribution.
+ * 3. Neither the names of the above-listed copyright holders nor the names
+ * of any contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * Alternatively, this software may be distributed under the terms of the
+ * GNU General Public License ("GPL") version 2 as published by the Free
+ * Software Foundation.
+ *
+ * NO WARRANTY
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * HOLDERS OR CONTRIBUTORS BE LIABLE FOR SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+ * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
+ * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGES.
+ *
+ */
+
+#ifndef _IBNBD_SRV_SYFS_H
+#define _IBNBD_SRV_SYFS_H
+
+int ibnbd_srv_create_dev_sysfs(struct ibnbd_srv_dev *dev,
+ struct block_device *bdev,
+ const char *dir_name);
+
+void ibnbd_srv_destroy_dev_sysfs(struct ibnbd_srv_dev *dev);
+
+int ibnbd_srv_create_dev_client_sysfs(struct ibnbd_srv_sess_dev *sess_dev);
+
+void ibnbd_srv_destroy_dev_client_sysfs(struct ibnbd_srv_sess_dev *sess_dev);
+
+int ibnbd_srv_create_sysfs_files(void);
+
+void ibnbd_srv_destroy_sysfs_files(void);
+
+#endif
--
2.7.4
^ permalink raw reply related
* [PATCH 26/28] ibnbd_srv: add Makefile and Kconfig
From: Jack Wang @ 2017-03-24 10:45 UTC (permalink / raw)
To: linux-block, linux-rdma
Cc: dledford, axboe, hch, mail, Milind.dumbare, yun.wang, Jack Wang
In-Reply-To: <1490352343-20075-1-git-send-email-jinpu.wangl@profitbricks.com>
From: Jack Wang <jinpu.wang@profitbricks.com>
Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
---
drivers/block/Kconfig | 1 +
drivers/block/Makefile | 1 +
drivers/block/ibnbd_server/Kconfig | 16 ++++++++++++++++
drivers/block/ibnbd_server/Makefile | 3 +++
4 files changed, 21 insertions(+)
create mode 100644 drivers/block/ibnbd_server/Kconfig
create mode 100644 drivers/block/ibnbd_server/Makefile
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index c309e57..e4823c4 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -276,6 +276,7 @@ config BLK_DEV_CRYPTOLOOP
source "drivers/block/drbd/Kconfig"
source "drivers/block/ibnbd_client/Kconfig"
+source "drivers/block/ibnbd_server/Kconfig"
config BLK_DEV_NBD
tristate "Network block device support"
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index 7da1813..cd20888 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -35,6 +35,7 @@ obj-$(CONFIG_BLK_DEV_HD) += hd.o
obj-$(CONFIG_XEN_BLKDEV_FRONTEND) += xen-blkfront.o
obj-$(CONFIG_XEN_BLKDEV_BACKEND) += xen-blkback/
obj-$(CONFIG_BLK_DEV_IBNBD_CLT) += ibnbd_client/
+obj-$(CONFIG_BLK_DEV_IBNBD_SRV) += ibnbd_server/
obj-$(CONFIG_BLK_DEV_DRBD) += drbd/
obj-$(CONFIG_BLK_DEV_RBD) += rbd.o
obj-$(CONFIG_BLK_DEV_PCIESSD_MTIP32XX) += mtip32xx/
diff --git a/drivers/block/ibnbd_server/Kconfig b/drivers/block/ibnbd_server/Kconfig
new file mode 100644
index 0000000..943e1b2
--- /dev/null
+++ b/drivers/block/ibnbd_server/Kconfig
@@ -0,0 +1,16 @@
+config BLK_DEV_IBNBD_SRV
+ tristate "Network block device over Infiniband server support"
+ depends on INFINIBAND_IBTRS_SRV
+ ---help---
+ Saying Y here will allow your computer to be a server for network
+ block devices over Infiniband, i.e. it will be able to use block
+ devices exported by servers (mount file systems on them etc.).
+ Communication between client and server works over Infiniband
+ networking, but to the client program this is hidden:
+ it looks like a regular local file access to a block device
+ special file such as /dev/ibnbd0.
+
+ To compile this driver as a module, choose M here: the
+ module will be called ibnbd_client.
+
+ If unsure, say N.
diff --git a/drivers/block/ibnbd_server/Makefile b/drivers/block/ibnbd_server/Makefile
new file mode 100644
index 0000000..e66860f
--- /dev/null
+++ b/drivers/block/ibnbd_server/Makefile
@@ -0,0 +1,3 @@
+obj-$(CONFIG_BLK_DEV_IBNBD_SRV) += ibnbd_server.o
+ibnbd_server-objs := ibnbd_srv.o ibnbd_srv_sysfs.o ibnbd_dev.o \
+ ../ibnbd_lib/ibnbd.o ../ibnbd_lib/ibnbd-proto.o
--
2.7.4
^ permalink raw reply related
* [PATCH 27/28] ibnbd: add doc for how to use ibnbd and sysfs interface
From: Jack Wang @ 2017-03-24 10:45 UTC (permalink / raw)
To: linux-block, linux-rdma
Cc: dledford, axboe, hch, mail, Milind.dumbare, yun.wang, Jack Wang
In-Reply-To: <1490352343-20075-1-git-send-email-jinpu.wangl@profitbricks.com>
From: Jack Wang <jinpu.wang@profitbricks.com>
Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
---
Documentation/IBNBD.txt | 284 ++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 284 insertions(+)
create mode 100644 Documentation/IBNBD.txt
diff --git a/Documentation/IBNBD.txt b/Documentation/IBNBD.txt
new file mode 100644
index 0000000..f7f490a
--- /dev/null
+++ b/Documentation/IBNBD.txt
@@ -0,0 +1,284 @@
+Infiniband Network Block Device (IBNBD)
+=======================================
+
+Introduction
+------------
+
+IBNBD (InfiniBand Network Block Device) is a pair of kernel modules (client and
+server) that allows to access a remote storage device on the server from
+clients via an InfiniBand network.
+Mapped storage devices appear transparent for the client, acting as any other
+regular storage devices.
+
+The data transport between client and server over the InfiniBand network
+is performed by the IBTRS (InfiniBand Transport) kernel modules.
+
+The administration of these modules is done via sysfs. A Command-line tool
+(ibnbd-cli) is also available for a more user-friendly experience.
+
+Requirements
+------------
+ - IBTRS kernel modules (available as git-submodule)
+
+Quick Start
+-----------
+Server:
+ # insmod ibtrs/ibtrs_server/ibtrs_server.ko
+ # insmod ibnbd_server/ibnbd_server.ko
+
+Client:
+ # insmod ibtrs/ibtrs_client/ibtrs_client.ko
+ # insmod ibnbd_client/ibnbd_client.ko
+ # echo "server=<SERVER-ADDRESS> device_path=<DEV-PATH-ON-SERVER>" > /sys/kernel/ibnbd/map_device
+
+The block device <DEV-PATH-ON-SERVER> will become available on the client as
+/dev/ibnbd<NR>. It can be used like a local block device.
+
+Client Userspace Interface
+--------------------------
+This chapter describes only the most important files of Userspace Interface.
+A full documentation can be found in the Architecture Documentation.
+
+All sysfs files that are not read-only will return a usage information if they
+are read.
+
+example:
+ $ cat /sys/kernel/ibnbd/map_device
+
+
+/sys/kernel/ibnbd/ entries
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+map_device (RW)
+^^^^^^^^^^^^^^^
+To map a volume on the client, information about the device has to be written
+to:
+ /sys/kernel/ibnbd/map_device
+
+The format of the input is:
+ "server=<server-address> device_path=<relative-path-to-device-on-server>
+ [access_mode=<ro|rw|migration] [input_mode=(mq|rq)]
+ [io_mode=fileio|blockio]"
+
+Server Parameter
+++++++++++++++++
+A server address has to be in one of the following formats:
+ - ip:<IPv6>
+ - ip:<IPv4>
+ - gid:<GID>
+
+device_path Parameter
++++++++++++++++++++++++++++++++
+A device can be mapped by specifying its relative path to the configured
+dev_search_path on the server side.
+The ibnbd_server prepends the configured dev_search_path to the passed
+device_path from the mapped operation and tries to open a block device with the
+path dev_search_path/device_path:
+On success, a /dev/ibnbd<NR> device file, a /sys/block/ibnbd/ibnbd<NR>/
+directory and a entry in /sys/kernel/ibnbd/devices will be created.
+
+access_mode Parameter
++++++++++++++++++++++
+The access_mode parameter specifies if the device is to be mapped as read-only
+or read-write. The "migration" access mode has the same effect as "rw" and
+should be used during a VM migration scenario by the client where the VM is
+being migrated to.
+If not specified, 'rw' is used.
+
+input_mode Parameter
+++++++++++++++++++++
+The input_mode parameter specifies the internal I/O processing mode of the
+network block device on the client.
+If not specified, 'mq' mode is used.
+
+io_mode Parameter
++++++++++++++++++
+The io_mode parameter specifies if the device on the server will be opened as
+block device (blockio) or as file (fileio).
+When the device is opened as file, the VFS page cache is used for read I/O
+operations, write I/O operations bypass the page cache and go directly to disk
+(except meta updates, like file access time).
+When the device is opened as block device, the block device is accessed
+directly, no VFS page cache is used.
+If not specified, 'fileio' mode is used.
+
+Exit Codes
+++++++++++
+If the device is already mapped it will fail with EEXIST. If the input has an
+invalid format it will return EINVAL. If the device path cannot be found on the
+server, it will fail with ENOENT.
+
+Examples
+++++++++
+ # echo "server=ip:10.50.100.64 device_path=/dev/ram1" input_mode=mq > /sys/kernel/ibnbd/map_device
+ # echo "server=ip:10.50.100.64 device_path=3F2504E0-4F89-41D3-9A0C-0305E82C3301" > /sys/kernel/ibnbd/map_device
+
+Finding device file after mapping
++++++++++++++++++++++++++++++++++
+After mapping, the device file can be found by:
+1.) The symlink /sys/kernel/ibnbd/devices/<device_id> points to
+ /sys/block/<dev-name>.
+ The last part of the symlink destination is the same than the device name.
+ By extracting the last part of the path the path to the device
+ /dev/<dev-name> can be build.
+2.) /dev/block/$(cat /sys/kernel/ibnbd/devices/<device_id>/dev)
+
+How to find the <device_id> of the device is described on the next chapter
+(devices/ directory).
+
+devices/ (DIRECTORY)
+^^^^^^^^^^^^^^^^^^^^
+For each device mapped on the client a new symbolic link is created as
+/sys/kernel/ibnbd/devices/<device_id>, which points to the block device created
+by ibnbd (/sys/block/ibnbd<NR>/). The <device_id> of each device is created as
+follows:
+
+- If the 'device_path' provided during mapping contains slashes ("/"), they are
+ replaced by exclamation mark ("!") and used as as the <device_id>. Otherwise,
+ the <device_id> will be the same as the 'device_path' provided.
+
+
+Examples
+++++++++
+ /sys/kernel/ibnbd/devices/3F2504E0-4F89-41D3-9A0C-0305E82C3301 -> /sys/block/ibnbd1/
+ /sys/kernel/ibnbd/devices/!dev!ram1 -> /sys/block/ibnbd0/
+
+
+/sys/block/ibnbd<NR>/ibnbd/ entries
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+unmap_device (RW)
+^^^^^^^^^^^^^^^^^
+To unmap a volume, 'normal' or 'force' has to be written to:
+ /sys/block/ibnbd<NR>/ibnbd/unmap_device
+
+When 'normal' is used, the operation will fail with EBUSY if any process is
+using the device.
+When 'force' is used, the device is also unmapped when device is in use.
+All I/Os that are in progress will fail. It can happen that the device
+file (/dev/ibnbdx) still exists after the unmapping. The kernel
+couldn't remove the file because it was in use but it's marked as unused.
+The device file will be freed when no process refer to it.
+
+In a following IBNBD mapping the remote device can be reused, but
+ibnbd may generate different device file for it.
+
+Examples
+++++++++
+ # echo "normal" > /sys/block/ibnbd0/ibnbd/unmap_device
+
+state (RO)
+^^^^^^^^^^
+The file contains the current state of the block device. The state file returns
+'open' when the device is successfully mapped from the server and accepting I/O
+requests. When the connection to the server gets disconnected in case of an
+error (e.g. link failure), the state file returns 'closed' and all I/O requests
+will fail with -EIO.
+
+session (RO)
+^^^^^^^^^^^^
+IBNBD uses IBTRS session to transport the data between client and server.
+The file 'session' contains the address of the server, that was used to
+establish the IBTRS session.
+It's the same address that was passed as server parameter to the map_device
+file.
+
+mapping_path (RO)
+^^^^^^^^^^^^^^^^^
+Contains the path that was passed as device_path to the map_device operation.
+
+/sys/kernel/ibtrs/sessions/ entries
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The connections to the servers are created and destroyed on demand. When the
+first device is mapped from a server, an IBTRS connection will be created with
+this server and the following directory will be created:
+
+/sys/kernel/ibtrs/sessions/<server-address>/
+
+If the connection establishment fails, detailed error information can be found
+in the kernel log (dmesg).
+
+When the last device is unmapped from a server, the connection will be closed
+and the directory will be deleted.
+
+
+Server Userspace Interface
+--------------------------
+
+/sys/kernel/ibnbd/ entries
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+/sys/kernel/ibnbd/devices/ entries
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+When a Pserver maps a device, a directory entry with the name of the block
+device is created under /sys/kernel/ibnbd/devices/. If the device path provided
+by the client is a symbolic link to a block device, the target block device name
+is used instead of the mapping path name.
+
+block_dev
+^^^^^^^^^
+block_dev is a symlink to the sysfs entry of the exported device
+
+Examples
+++++++++
+ block_dev -> ../../../../devices/virtual/block/nullb1
+
+revalidate
+^^^^^^^^^^
+When the size of a exported block device changes on the server, the clients
+have to be notified so they can resize the mapped device.
+
+Notification of the clients about a device change is triggered by writing '1'
+to the revalidate file.
+
+Examples
+++++++++
+ # echo 1 > /sys/kernel/ibnbd/devices/nullb1/revalidate
+
+/sys/kernel/ibnbd/devices/<device_name>/clients entries
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+When the device is mapped from a client, the following directory will be
+created:
+
+/sys/kernel/ibnbd/devices/<device_name>/clients/<client-address> entries
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+When the device is unmapped, the directory will be removed.
+
+read_only
+^^^^^^^^^
+Contains '1' if device is mapped read-only, otherwise '0'.
+
+mapping_path
+^^^^^^^^^^^^
+Contains the relative device path provided by the user during mapping.
+
+
+IBNBD-Server Module Parameters
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+dev_search_path
+^^^^^^^^^^^^^^^
+When a device is mapped from the client, the server generates the path to the
+block device on the server side by concatenating dev_search_path and the
+device_path that was specified in the map_device operation.
+
+The format of the input is
+ path ::= Absolute linux path name,
+ Max. length depends on PATH_MAX define (usually 4095 chars)
+
+The default dev_search_path is: "/".
+
+Example
++++++++
+
+Configured dev_search_path on server is: /dev/storage/
+client maps device by::
+ # echo "server=ip:10.50.100.64 device_path=3F2504E0-4F89-41D3-9A0C-0305E82C3301" > /sys/kernel/ibnbd/map_device
+
+The server tries to open a block device with the path:
+ /dev/storage/3F2504E0-4F89-41D3-9A0C-0305E82C3301
+
+
+Contact
+-------
+Mailing list: ibnbd@profitbricks.com
--
2.7.4
^ permalink raw reply related
* [PATCH 28/28] MAINTRAINERS: Add maintainer for IBNBD/IBTRS
From: Jack Wang @ 2017-03-24 10:45 UTC (permalink / raw)
To: linux-block, linux-rdma
Cc: dledford, axboe, hch, mail, Milind.dumbare, yun.wang, Jack Wang
In-Reply-To: <1490352343-20075-1-git-send-email-jinpu.wangl@profitbricks.com>
From: Jack Wang <jinpu.wang@profitbricks.com>
Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
---
MAINTAINERS | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/MAINTAINERS b/MAINTAINERS
index c776906..12a528a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6263,6 +6263,20 @@ IBM ServeRAID RAID DRIVER
S: Orphan
F: drivers/scsi/ips.*
+IBTRS TRANSPORT DRIVERS
+M: Jack Wang <jinpu.wang@profitbricks.com>
+L: linux-rdma@vger.kernel.org
+S: Maintained
+F: include/linux/ibtrs*.h
+F: drivers/infiniband/ulp/ibtrs*
+
+IBNBD BLOCK DRIVERS
+M: Jack Wang <jinpu.wang@profitbricks.com>
+L: linux-rdma@vger.kernel.org
+S: Maintained
+F: Documentation/IBNBD.txt
+F: drivers/block/ibnbd*
+
ICH LPC AND GPIO DRIVER
M: Peter Tyser <ptyser@xes-inc.com>
S: Maintained
--
2.7.4
^ permalink raw reply related
* Re: [PATCH 5/8] nowait aio: return on congested block device
From: Goldwyn Rodrigues @ 2017-03-24 11:32 UTC (permalink / raw)
To: Jens Axboe, linux-fsdevel
Cc: jack, hch, linux-block, linux-btrfs, linux-ext4, linux-xfs, sagi,
avi, linux-api, willy, Goldwyn Rodrigues
In-Reply-To: <eee4683d-9f44-434f-b97f-b0b24c7b3dab@kernel.dk>
On 03/16/2017 09:33 AM, Jens Axboe wrote:
> On 03/15/2017 03:51 PM, Goldwyn Rodrigues wrote:
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 0eeb99e..2e5cba2 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -2014,7 +2019,7 @@ blk_qc_t generic_make_request(struct bio *bio)
>> do {
>> struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>>
>> - if (likely(blk_queue_enter(q, false) == 0)) {
>> + if (likely(blk_queue_enter(q, bio_flagged(bio, BIO_NOWAIT)) == 0)) {
>> struct bio_list hold;
>> struct bio_list lower, same;
>>
>> @@ -2040,7 +2045,10 @@ blk_qc_t generic_make_request(struct bio *bio)
>> bio_list_merge(&bio_list_on_stack, &same);
>> bio_list_merge(&bio_list_on_stack, &hold);
>> } else {
>> - bio_io_error(bio);
>> + if (unlikely(bio_flagged(bio, BIO_NOWAIT)))
>> + bio_wouldblock_error(bio);
>> + else
>> + bio_io_error(bio);
>
> This doesn't look right. What if the queue is dying, and BIO_NOWAIT just
> happened to be set?
>
Yes, I need to add a condition here to check for blk_queue_dying(). Thanks.
> And you're missing wbt_wait() as well as a blocking point. Ditto in
> blk-mq.
wbt_wait() does not apply to WRITE_ODIRECT
>
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 159187a..942ce8c 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -1518,6 +1518,8 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
>> rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data);
>> if (unlikely(!rq)) {
>> __wbt_done(q->rq_wb, wb_acct);
>> + if (bio && bio_flagged(bio, BIO_NOWAIT))
>> + bio_wouldblock_error(bio);
>> return BLK_QC_T_NONE;
>> }
>>
>
> This seems a little fragile now, since not both paths free the bio.
>
Direct I/O should free the bios in bio_dio_complete(). I am not sure why
it would not free bio here originally, but IIRC, this path is for
bio==NULL only. So, with this patch we would get a rq==NULL here and
hence the bio_wouldblock_error() call.
--
Goldwyn
^ permalink raw reply
* Re: [RFC PATCH 00/28] INFINIBAND NETWORK BLOCK DEVICE (IBNBD)
From: Johannes Thumshirn @ 2017-03-24 12:15 UTC (permalink / raw)
To: Jack Wang
Cc: linux-block, linux-rdma, dledford, axboe, hch, mail,
Milind.dumbare, yun.wang
In-Reply-To: <1490352343-20075-1-git-send-email-jinpu.wangl@profitbricks.com>
On Fri, Mar 24, 2017 at 11:45:15AM +0100, Jack Wang wrote:
> From: Jack Wang <jinpu.wang@profitbricks.com>
>
> This series introduces IBNBD/IBTRS kernel modules.
>
> IBNBD (InfiniBand network block device) allows for an RDMA transfer of block IO
> over InfiniBand network. The driver presents itself as a block device on client
> side and transmits the block requests in a zero-copy fashion to the server-side
> via InfiniBand. The server part of the driver converts the incoming buffers back
> into BIOs and hands them down to the underlying block device. As soon as IO
> responses come back from the drive, they are being transmitted back to the
> client.
>
> We design and implement this solution based on our need for Cloud Computing,
> the key features are:
> - High throughput and low latency due to:
> 1) Only two rdma messages per IO
> 2) Simplified client side server memory management
> 3) Eliminated SCSI sublayer
> - Simple configuration and handling
> 1) Server side is completely passive: volumes do not need to be
> explicitly exported
> 2) Only IB port GID and device path needed on client side to map
> a block device
> 3) A device can be remapped automatically i.e. after storage
> reboot
> - Pinning of IO-related processing to the CPU of the producer
>
> For usage please refer to Documentation/IBNBD.txt in later patch.
> My colleague Danil Kpnis presents IBNBD in Vault-2017 about our design/feature/
> tradeoff/performance:
>
> http://events.linuxfoundation.org/sites/events/files/slides/IBNBD-Vault-2017.pdf
>
Hi Jack,
Sorry to ask (I haven't attented the Vault presentation) but why can't you use
NVMe over Fabrics in your environment? From what I see in your presentation
and cover letter, it provides all you need and is in fact a standard Linux and
Windows already have implemented.
Thanks,
Johannes
--
Johannes Thumshirn Storage
jthumshirn@suse.de +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Felix Imend�rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N�rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
^ permalink raw reply
* Re: [PATCH 01/28] ibtrs: add header shared between ibtrs_client and ibtrs_server
From: Johannes Thumshirn @ 2017-03-24 12:35 UTC (permalink / raw)
To: Jack Wang
Cc: linux-block, linux-rdma, dledford, axboe, hch, mail,
Milind.dumbare, yun.wang, Kleber Souza, Danil Kipnis, Roman Pen
In-Reply-To: <1490352343-20075-2-git-send-email-jinpu.wangl@profitbricks.com>
On Fri, Mar 24, 2017 at 11:45:16AM +0100, Jack Wang wrote:
> From: Jack Wang <jinpu.wang@profitbricks.com>
>
> Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
> Signed-off-by: Kleber Souza <kleber.souza@profitbricks.com>
> Signed-off-by: Danil Kipnis <danil.kipnis@profitbricks.com>
> Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
> ---
[...]
> +
> +#define XX(a) case (a): return #a
please no macros with retun in them and XX isn't quite too descriptive as
well.
[...]
> +static inline const char *ib_wc_opcode_str(enum ib_wc_opcode opcode)
> +{
> + switch (opcode) {
> + XX(IB_WC_SEND);
> + XX(IB_WC_RDMA_WRITE);
> + XX(IB_WC_RDMA_READ);
> + XX(IB_WC_COMP_SWAP);
> + XX(IB_WC_FETCH_ADD);
> + /* recv-side); inbound completion */
> + XX(IB_WC_RECV);
> + XX(IB_WC_RECV_RDMA_WITH_IMM);
> + default: return "IB_WC_OPCODE_UNKNOWN";
> + }
> +}
How about:
struct {
char *name;
enum ib_wc_opcode opcode;
} ib_wc_opcode_table[] = {
{ stringyfy(IB_WC_SEND), IB_WC_SEND },
{ stringyfy(IB_WC_RDMA_WRITE), IB_WC_RDMA_WRITE },
{ stringyfy(IB_WC_RDMA_READ ), IB_WC_RDMA_READ }
{ stringyfy(IB_WC_COMP_SWAP), IB_WC_COMP_SWAP },
{ stringyfy(IB_WC_FETCH_ADD), IB_WC_FETCH_ADD },
{ stringyfy(IB_WC_RECV), IB_WC_RECV },
{ stringyfy(IB_WC_RECV_RDMA_WITH_IMM), IB_WC_RECV_RDMA_WITH_IMM },
{ NULL, 0 },
};
static inline const char *ib_wc_opcode_str(enum ib_wc_opcode opcode)
{
int i;
for (i = 0; i < ARRAY_SIZE(ib_wc_opcode_table); i++)
if (ib_wc_opcode_table[i].opcode == opcode)
return ib_wc_opcode_table[i].name;
return "IB_WC_OPCODE_UNKNOWN";
}
[...]
> +/**
> + * struct ibtrs_msg_hdr - Common header of all IBTRS messages
> + * @type: Message type, valid values see: enum ibtrs_msg_types
> + * @tsize: Total size of transferred data
> + *
> + * Don't move the first 8 padding bytes! It's a workaround for a kernel bug.
> + * See IBNBD-610 for details
What about resolving the kernel bug instead of making workarounds?
> + *
> + * DO NOT CHANGE!
> + */
> +struct ibtrs_msg_hdr {
> + u8 __padding1;
> + u8 type;
> + u16 __padding2;
> + u32 tsize;
> +};
[...]
--
Johannes Thumshirn Storage
jthumshirn@suse.de +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Felix Imend�rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N�rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
^ permalink raw reply
* [PATCH v2 0/4] block: misc changes
From: Ming Lei @ 2017-03-24 12:36 UTC (permalink / raw)
To: Jens Axboe, linux-block, Christoph Hellwig
Cc: Bart Van Assche, Hannes Reinecke, Ming Lei
Hi,
The 1st patch add comments on blk-mq races with timeout handler.
The other 3 patches improves handling for dying queue:
- the 2nd one adds one barrier in blk_queue_enter() for
avoiding hanging caused by out-of-order
- the 3rd and 4th patches block new I/O entering queue
after queue is set as dying
V1:
- add comments on races related with timeout handler
- add Tested-by & Reviewed-by tag
thanks,
Ming
Ming Lei (4):
blk-mq: comment on races related with timeout handler
block: add a read barrier in blk_queue_enter()
block: rename blk_mq_freeze_queue_start()
block: block new I/O just after queue is set as dying
block/blk-core.c | 12 ++++++++++++
block/blk-mq.c | 32 +++++++++++++++++++++++++++-----
drivers/block/mtip32xx/mtip32xx.c | 2 +-
drivers/nvme/host/core.c | 2 +-
include/linux/blk-mq.h | 2 +-
5 files changed, 42 insertions(+), 8 deletions(-)
--
2.9.3
^ permalink raw reply
* [PATCH v2 1/4] blk-mq: comment on races related with timeout handler
From: Ming Lei @ 2017-03-24 12:36 UTC (permalink / raw)
To: Jens Axboe, linux-block, Christoph Hellwig
Cc: Bart Van Assche, Hannes Reinecke, Ming Lei
In-Reply-To: <20170324123621.5227-1-tom.leiming@gmail.com>
This patch adds comment on two races related with
timeout handler:
- requeue from queue busy vs. timeout
- rq free & reallocation vs. timeout
Both the races themselves and current solution aren't
explicit enough, so add comments on them.
Cc: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
block/blk-mq.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c212b9644a9f..b36f0481ba0e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -523,6 +523,15 @@ void blk_mq_start_request(struct request *rq)
}
EXPORT_SYMBOL(blk_mq_start_request);
+/*
+ * When we reach here because queue is busy, REQ_ATOM_COMPLETE
+ * flag isn't set yet, so there may be race with timeout hanlder,
+ * but given rq->deadline is just set in .queue_rq() under
+ * this situation, the race won't be possible in reality because
+ * rq->timeout should be set as big enough to cover the window
+ * between blk_mq_start_request() called from .queue_rq() and
+ * clearing REQ_ATOM_STARTED here.
+ */
static void __blk_mq_requeue_request(struct request *rq)
{
struct request_queue *q = rq->q;
@@ -696,6 +705,19 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
return;
+ /*
+ * The rq being checked may have been freed and reallocated
+ * out already here, we avoid this race by checking rq->deadline
+ * and REQ_ATOM_COMPLETE flag together:
+ *
+ * - if rq->deadline is observed as new value because of
+ * reusing, the rq won't be timed out because of timing.
+ * - if rq->deadline is observed as previous value,
+ * REQ_ATOM_COMPLETE flag won't be cleared in reuse path
+ * because we put a barrier between setting rq->deadline
+ * and clearing the flag in blk_mq_start_request(), so
+ * this rq won't be timed out too.
+ */
if (time_after_eq(jiffies, rq->deadline)) {
if (!blk_mark_rq_complete(rq))
blk_mq_rq_timed_out(rq, reserved);
--
2.9.3
^ permalink raw reply related
* [PATCH v2 2/4] block: add a read barrier in blk_queue_enter()
From: Ming Lei @ 2017-03-24 12:36 UTC (permalink / raw)
To: Jens Axboe, linux-block, Christoph Hellwig
Cc: Bart Van Assche, Hannes Reinecke, Ming Lei
In-Reply-To: <20170324123621.5227-1-tom.leiming@gmail.com>
Without the barrier, reading DEAD flag of .q_usage_counter
and reading .mq_freeze_depth may be reordered, then the
following wait_event_interruptible() may never return.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
block/blk-core.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/block/blk-core.c b/block/blk-core.c
index ad388d5e309a..44eed17319c0 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -669,6 +669,14 @@ int blk_queue_enter(struct request_queue *q, bool nowait)
if (nowait)
return -EBUSY;
+ /*
+ * read pair of barrier in blk_mq_freeze_queue_start(),
+ * we need to order reading DEAD flag of .q_usage_counter
+ * and reading .mq_freeze_depth, otherwise the following
+ * wait may never return if the two read are reordered.
+ */
+ smp_rmb();
+
ret = wait_event_interruptible(q->mq_freeze_wq,
!atomic_read(&q->mq_freeze_depth) ||
blk_queue_dying(q));
--
2.9.3
^ permalink raw reply related
* [PATCH v2 3/4] block: rename blk_mq_freeze_queue_start()
From: Ming Lei @ 2017-03-24 12:36 UTC (permalink / raw)
To: Jens Axboe, linux-block, Christoph Hellwig
Cc: Bart Van Assche, Hannes Reinecke, Ming Lei
In-Reply-To: <20170324123621.5227-1-tom.leiming@gmail.com>
As the .q_usage_counter is used by both legacy and
mq path, we need to block new I/O if queue becomes
dead in blk_queue_enter().
So rename it and we can use this function in both
pathes.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
block/blk-core.c | 2 +-
block/blk-mq.c | 10 +++++-----
drivers/block/mtip32xx/mtip32xx.c | 2 +-
drivers/nvme/host/core.c | 2 +-
include/linux/blk-mq.h | 2 +-
5 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 44eed17319c0..5901133d105f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -670,7 +670,7 @@ int blk_queue_enter(struct request_queue *q, bool nowait)
return -EBUSY;
/*
- * read pair of barrier in blk_mq_freeze_queue_start(),
+ * read pair of barrier in blk_freeze_queue_start(),
* we need to order reading DEAD flag of .q_usage_counter
* and reading .mq_freeze_depth, otherwise the following
* wait may never return if the two read are reordered.
diff --git a/block/blk-mq.c b/block/blk-mq.c
index b36f0481ba0e..5370b4f750ff 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -68,7 +68,7 @@ static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx,
sbitmap_clear_bit(&hctx->ctx_map, ctx->index_hw);
}
-void blk_mq_freeze_queue_start(struct request_queue *q)
+void blk_freeze_queue_start(struct request_queue *q)
{
int freeze_depth;
@@ -78,7 +78,7 @@ void blk_mq_freeze_queue_start(struct request_queue *q)
blk_mq_run_hw_queues(q, false);
}
}
-EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_start);
+EXPORT_SYMBOL_GPL(blk_freeze_queue_start);
void blk_mq_freeze_queue_wait(struct request_queue *q)
{
@@ -108,7 +108,7 @@ void blk_freeze_queue(struct request_queue *q)
* no blk_unfreeze_queue(), and blk_freeze_queue() is not
* exported to drivers as the only user for unfreeze is blk_mq.
*/
- blk_mq_freeze_queue_start(q);
+ blk_freeze_queue_start(q);
blk_mq_freeze_queue_wait(q);
}
@@ -746,7 +746,7 @@ static void blk_mq_timeout_work(struct work_struct *work)
* percpu_ref_tryget directly, because we need to be able to
* obtain a reference even in the short window between the queue
* starting to freeze, by dropping the first reference in
- * blk_mq_freeze_queue_start, and the moment the last request is
+ * blk_freeze_queue_start, and the moment the last request is
* consumed, marked by the instant q_usage_counter reaches
* zero.
*/
@@ -2376,7 +2376,7 @@ static void blk_mq_queue_reinit_work(void)
* take place in parallel.
*/
list_for_each_entry(q, &all_q_list, all_q_node)
- blk_mq_freeze_queue_start(q);
+ blk_freeze_queue_start(q);
list_for_each_entry(q, &all_q_list, all_q_node)
blk_mq_freeze_queue_wait(q);
diff --git a/drivers/block/mtip32xx/mtip32xx.c b/drivers/block/mtip32xx/mtip32xx.c
index f96ab717534c..c96c35ab39df 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -4162,7 +4162,7 @@ static int mtip_block_remove(struct driver_data *dd)
dev_info(&dd->pdev->dev, "device %s surprise removal\n",
dd->disk->disk_name);
- blk_mq_freeze_queue_start(dd->queue);
+ blk_freeze_queue_start(dd->queue);
blk_mq_stop_hw_queues(dd->queue);
blk_mq_tagset_busy_iter(&dd->tags, mtip_no_dev_cleanup, dd);
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 9b3b57fef446..4a6d7f408769 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2386,7 +2386,7 @@ void nvme_start_freeze(struct nvme_ctrl *ctrl)
mutex_lock(&ctrl->namespaces_mutex);
list_for_each_entry(ns, &ctrl->namespaces, list)
- blk_mq_freeze_queue_start(ns->queue);
+ blk_freeze_queue_start(ns->queue);
mutex_unlock(&ctrl->namespaces_mutex);
}
EXPORT_SYMBOL_GPL(nvme_start_freeze);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 5b3e201c8d4f..ea2e9dcd3aef 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -243,7 +243,7 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
busy_tag_iter_fn *fn, void *priv);
void blk_mq_freeze_queue(struct request_queue *q);
void blk_mq_unfreeze_queue(struct request_queue *q);
-void blk_mq_freeze_queue_start(struct request_queue *q);
+void blk_freeze_queue_start(struct request_queue *q);
void blk_mq_freeze_queue_wait(struct request_queue *q);
int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
unsigned long timeout);
--
2.9.3
^ permalink raw reply related
* [PATCH v2 4/4] block: block new I/O just after queue is set as dying
From: Ming Lei @ 2017-03-24 12:36 UTC (permalink / raw)
To: Jens Axboe, linux-block, Christoph Hellwig
Cc: Bart Van Assche, Hannes Reinecke, Ming Lei, Tejun Heo
In-Reply-To: <20170324123621.5227-1-tom.leiming@gmail.com>
Before commit 780db2071a(blk-mq: decouble blk-mq freezing
from generic bypassing), the dying flag is checked before
entering queue, and Tejun converts the checking into .mq_freeze_depth,
and assumes the counter is increased just after dying flag
is set. Unfortunately we doesn't do that in blk_set_queue_dying().
This patch calls blk_freeze_queue_start() in blk_set_queue_dying(),
so that we can block new I/O coming once the queue is set as dying.
Given blk_set_queue_dying() is always called in remove path
of block device, and queue will be cleaned up later, we don't
need to worry about undoing the counter.
Cc: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Tejun Heo <tj@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
block/blk-core.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 5901133d105f..f0dd9b0054ed 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -500,6 +500,9 @@ void blk_set_queue_dying(struct request_queue *q)
queue_flag_set(QUEUE_FLAG_DYING, q);
spin_unlock_irq(q->queue_lock);
+ /* block new I/O coming */
+ blk_freeze_queue_start(q);
+
if (q->mq_ops)
blk_mq_wake_waiters(q);
else {
@@ -672,8 +675,9 @@ int blk_queue_enter(struct request_queue *q, bool nowait)
/*
* read pair of barrier in blk_freeze_queue_start(),
* we need to order reading DEAD flag of .q_usage_counter
- * and reading .mq_freeze_depth, otherwise the following
- * wait may never return if the two read are reordered.
+ * and reading .mq_freeze_depth or dying flag, otherwise
+ * the following wait may never return if the two read
+ * are reordered.
*/
smp_rmb();
--
2.9.3
^ permalink raw reply related
* Re: [RFC PATCH 00/28] INFINIBAND NETWORK BLOCK DEVICE (IBNBD)
From: Jinpu Wang @ 2017-03-24 12:46 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: linux-block, linux-rdma@vger.kernel.org, Doug Ledford, Jens Axboe,
hch, Fabian Holler, Milind Dumbare, Michael Wang
In-Reply-To: <20170324121526.GF3571@linux-x5ow.site>
On Fri, Mar 24, 2017 at 1:15 PM, Johannes Thumshirn <jthumshirn@suse.de> wr=
ote:
> On Fri, Mar 24, 2017 at 11:45:15AM +0100, Jack Wang wrote:
>> From: Jack Wang <jinpu.wang@profitbricks.com>
>>
>> This series introduces IBNBD/IBTRS kernel modules.
>>
>> IBNBD (InfiniBand network block device) allows for an RDMA transfer of b=
lock IO
>> over InfiniBand network. The driver presents itself as a block device on=
client
>> side and transmits the block requests in a zero-copy fashion to the serv=
er-side
>> via InfiniBand. The server part of the driver converts the incoming buff=
ers back
>> into BIOs and hands them down to the underlying block device. As soon as=
IO
>> responses come back from the drive, they are being transmitted back to t=
he
>> client.
>>
>> We design and implement this solution based on our need for Cloud Comput=
ing,
>> the key features are:
>> - High throughput and low latency due to:
>> 1) Only two rdma messages per IO
>> 2) Simplified client side server memory management
>> 3) Eliminated SCSI sublayer
>> - Simple configuration and handling
>> 1) Server side is completely passive: volumes do not need to be
>> explicitly exported
>> 2) Only IB port GID and device path needed on client side to map
>> a block device
>> 3) A device can be remapped automatically i.e. after storage
>> reboot
>> - Pinning of IO-related processing to the CPU of the producer
>>
>> For usage please refer to Documentation/IBNBD.txt in later patch.
>> My colleague Danil Kpnis presents IBNBD in Vault-2017 about our design/f=
eature/
>> tradeoff/performance:
>>
>> http://events.linuxfoundation.org/sites/events/files/slides/IBNBD-Vault-=
2017.pdf
>>
>
> Hi Jack,
>
> Sorry to ask (I haven't attented the Vault presentation) but why can't yo=
u use
> NVMe over Fabrics in your environment? From what I see in your presentati=
on
> and cover letter, it provides all you need and is in fact a standard Linu=
x and
> Windows already have implemented.
>
> Thanks,
> Johannes
> --
> Johannes Thumshirn Storage
> jthumshirn@suse.de +49 911 74053 689
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N=C3=BCrnberg
> GF: Felix Imend=C3=B6rffer, Jane Smithard, Graham Norton
> HRB 21284 (AG N=C3=BCrnberg)
> Key fingerprint =3D EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Hi Johnnes,
Our IBNBD project was started 3 years ago based on our need for Cloud
Computing, NVMeOF is a bit younger.
- IBNBD is one of our components, part of our software defined storage solu=
tion.
- As I listed in features, IBNBD has it's own features
We're planning to look more into NVMeOF, but it's not a replacement for IBN=
BD.
Thanks,
--=20
Jack Wang
Linux Kernel Developer
ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin
Tel: +49 30 577 008 042
Fax: +49 30 577 008 299
Email: jinpu.wang@profitbricks.com
URL: https://www.profitbricks.de
Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Gesch=C3=A4ftsf=C3=BChrer: Achim Weiss
^ permalink raw reply
* Re: [RFC PATCH 00/28] INFINIBAND NETWORK BLOCK DEVICE (IBNBD)
From: Johannes Thumshirn @ 2017-03-24 12:48 UTC (permalink / raw)
To: Jinpu Wang
Cc: linux-block, linux-rdma@vger.kernel.org, Doug Ledford, Jens Axboe,
hch, Fabian Holler, Milind Dumbare, Michael Wang
In-Reply-To: <CAMGffE=CitFGj11NhFKPL2MNiOVVyb-ggRe-MhewcobGY0-u5A@mail.gmail.com>
On Fri, Mar 24, 2017 at 01:46:02PM +0100, Jinpu Wang wrote:
> Hi Johnnes,
>
> Our IBNBD project was started 3 years ago based on our need for Cloud
> Computing, NVMeOF is a bit younger.
> - IBNBD is one of our components, part of our software defined storage solution.
> - As I listed in features, IBNBD has it's own features
>
> We're planning to look more into NVMeOF, but it's not a replacement for IBNBD.
Ok thanks for the clarification.
Byte,
Johannes
--
Johannes Thumshirn Storage
jthumshirn@suse.de +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Felix Imend�rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N�rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
^ permalink raw reply
* Re: [PATCH 01/28] ibtrs: add header shared between ibtrs_client and ibtrs_server
From: Jinpu Wang @ 2017-03-24 12:54 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: linux-block, linux-rdma@vger.kernel.org, Doug Ledford, Jens Axboe,
hch, Fabian Holler, Milind Dumbare, Michael Wang, Kleber Souza,
Danil Kipnis, Roman Pen
In-Reply-To: <20170324123546.GG3571@linux-x5ow.site>
>> +
>> +#define XX(a) case (a): return #a
>
> please no macros with retun in them and XX isn't quite too descriptive as
> well.
>
> [...]
>
>> +static inline const char *ib_wc_opcode_str(enum ib_wc_opcode opcode)
>> +{
>> + switch (opcode) {
>> + XX(IB_WC_SEND);
>> + XX(IB_WC_RDMA_WRITE);
>> + XX(IB_WC_RDMA_READ);
>> + XX(IB_WC_COMP_SWAP);
>> + XX(IB_WC_FETCH_ADD);
>> + /* recv-side); inbound completion */
>> + XX(IB_WC_RECV);
>> + XX(IB_WC_RECV_RDMA_WITH_IMM);
>> + default: return "IB_WC_OPCODE_UNKNOWN";
>> + }
>> +}
>
> How about:
>
> struct {
> char *name;
> enum ib_wc_opcode opcode;
> } ib_wc_opcode_table[] =3D {
> { stringyfy(IB_WC_SEND), IB_WC_SEND },
> { stringyfy(IB_WC_RDMA_WRITE), IB_WC_RDMA_WRITE },
> { stringyfy(IB_WC_RDMA_READ ), IB_WC_RDMA_READ }
> { stringyfy(IB_WC_COMP_SWAP), IB_WC_COMP_SWAP },
> { stringyfy(IB_WC_FETCH_ADD), IB_WC_FETCH_ADD },
> { stringyfy(IB_WC_RECV), IB_WC_RECV },
> { stringyfy(IB_WC_RECV_RDMA_WITH_IMM), IB_WC_RECV_RDMA_WITH_IMM }=
,
> { NULL, 0 },
> };
>
> static inline const char *ib_wc_opcode_str(enum ib_wc_opcode opcode)
> {
> int i;
>
> for (i =3D 0; i < ARRAY_SIZE(ib_wc_opcode_table); i++)
> if (ib_wc_opcode_table[i].opcode =3D=3D opcode)
> return ib_wc_opcode_table[i].name;
>
> return "IB_WC_OPCODE_UNKNOWN";
> }
>
Looks nice, might be better to put it into ib_verbs.h?
>
> [...]
>
>> +/**
>> + * struct ibtrs_msg_hdr - Common header of all IBTRS messages
>> + * @type: Message type, valid values see: enum ibtrs_msg_types
>> + * @tsize: Total size of transferred data
>> + *
>> + * Don't move the first 8 padding bytes! It's a workaround for a kernel=
bug.
>> + * See IBNBD-610 for details
>
> What about resolving the kernel bug instead of making workarounds?
I tried to send a patch upsteam, but was rejected by Sean.
http://www.spinics.net/lists/linux-rdma/msg22381.html
>
>> + *
>> + * DO NOT CHANGE!
>> + */
>> +struct ibtrs_msg_hdr {
>> + u8 __padding1;
>> + u8 type;
>> + u16 __padding2;
>> + u32 tsize;
>> +};
>
> [...]
>
> --
> Johannes Thumshirn Storage
> jthumshirn@suse.de +49 911 74053 689
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N=C3=BCrnberg
> GF: Felix Imend=C3=B6rffer, Jane Smithard, Graham Norton
> HRB 21284 (AG N=C3=BCrnberg)
> Key fingerprint =3D EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Thanks Johannes for review.
--=20
Jack Wang
Linux Kernel Developer
ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin
Tel: +49 30 577 008 042
Fax: +49 30 577 008 299
Email: jinpu.wang@profitbricks.com
URL: https://www.profitbricks.de
Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Gesch=C3=A4ftsf=C3=BChrer: Achim Weiss
^ permalink raw reply
* Re: [RFC PATCH 00/28] INFINIBAND NETWORK BLOCK DEVICE (IBNBD)
From: Bart Van Assche @ 2017-03-24 13:31 UTC (permalink / raw)
To: jthumshirn@suse.de, jinpu.wang@profitbricks.com
Cc: linux-block@vger.kernel.org, linux-rdma@vger.kernel.org,
mail@fholler.de, yun.wang@profitbricks.com, hch@lst.de,
axboe@kernel.dk, Milind.dumbare@gmail.com, dledford@redhat.com
In-Reply-To: <CAMGffE=CitFGj11NhFKPL2MNiOVVyb-ggRe-MhewcobGY0-u5A@mail.gmail.com>
On Fri, 2017-03-24 at 13:46 +0100, Jinpu Wang wrote:
> Our IBNBD project was started 3 years ago based on our need for Cloud
> Computing, NVMeOF is a bit younger.
> - IBNBD is one of our components, part of our software defined storage so=
lution.
> - As I listed in features, IBNBD has it's own features
>=20
> We're planning to look more into NVMeOF, but it's not a replacement for I=
BNBD.
Hello Jack, Danil and Roman,
Thanks for having taken the time to open source this work and to travel to
Boston to present this work at the Vault conference. However, my
understanding of IBNBD is that this driver has several shortcomings neither
NVMeOF nor iSER nor SRP have:
* Doesn't scale in terms of number of CPUs submitting I/O. The graphs shown
during the Vault talk clearly illustrate this. This is probably the resul=
t
of sharing a data structure across all client CPUs, maybe the bitmap that
tracks which parts of the target buffer space are in use.
* Supports IB but none of the other RDMA transports (RoCE / iWARP).
We also need performance numbers that compare IBNBD against SRP and/or
NVMeOF with memory registration disabled to see whether and how much faster
IBNBD is compared to these two protocols.
The fact that IBNBD only needs to messages per I/O is an advantage it has
today over SRP but not over NVMeOF nor over iSER. The upstream initiator
drivers for the latter two protocols already support inline data.
Another question I have is whether integration with multipathd is supported=
?
If multipathd tries to run scsi_id against an IBNBD client device that will
fail.
Thanks,
Bart.=
^ permalink raw reply
* [GIT PULL] Block fixes for 4.11-rc
From: Jens Axboe @ 2017-03-24 14:12 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org
Hi Linus,
A few fixes for the current series that should go into -rc4. This pull
request contains:
- A fix for a potential corruption of un-started requests from Ming.
- A blk-stat fix from Omar, ensuring we flush the stat batch before
checking nr_samples.
- A set of fixes from Sagi for the nvmeof family.
Please pull!
git://git.kernel.dk/linux-block.git for-linus
----------------------------------------------------------------
Jens Axboe (1):
Merge branch 'nvme-4.11-rc' of git://git.infradead.org/nvme into for-linus
Ming Lei (1):
blk-mq: don't complete un-started request in timeout handler
Omar Sandoval (1):
blk-stat: fix blk_stat_sum() if all samples are batched
Sagi Grimberg (5):
nvme-loop: fix a possible use-after-free when destroying the admin queue
nvmet: confirm sq percpu has scheduled and switched to atomic
nvmet-rdma: Fix a possible uninitialized variable dereference
nvme-rdma: handle cpu unplug when re-establishing the controller
nvme-loop: handle cpu unplug when re-establishing the controller
block/blk-mq.c | 11 +-----
block/blk-stat.c | 4 +-
drivers/nvme/host/rdma.c | 28 +++++++-------
drivers/nvme/target/core.c | 11 +++++-
drivers/nvme/target/loop.c | 90 +++++++++++++++++++++++++--------------------
drivers/nvme/target/nvmet.h | 1 +
drivers/nvme/target/rdma.c | 8 ++--
7 files changed, 82 insertions(+), 71 deletions(-)
--
Jens Axboe
^ permalink raw reply
* RE: [RFC PATCH 00/28] INFINIBAND NETWORK BLOCK DEVICE (IBNBD)
From: Steve Wise @ 2017-03-24 14:20 UTC (permalink / raw)
To: 'Jack Wang', linux-block, linux-rdma
Cc: dledford, axboe, hch, mail, Milind.dumbare, yun.wang
In-Reply-To: <1490352343-20075-1-git-send-email-jinpu.wangl@profitbricks.com>
>
> From: Jack Wang <jinpu.wang@profitbricks.com>
>
> This series introduces IBNBD/IBTRS kernel modules.
>
> IBNBD (InfiniBand network block device) allows for an RDMA transfer of block
IO
> over InfiniBand network. The driver presents itself as a block device on
client
> side and transmits the block requests in a zero-copy fashion to the
server-side
> via InfiniBand. The server part of the driver converts the incoming buffers
back
> into BIOs and hands them down to the underlying block device. As soon as IO
> responses come back from the drive, they are being transmitted back to the
> client.
Hey Jack, why is this IB specific? Can it work over iWARP transports as well?
Steve.
^ permalink raw reply
* Re: [RFC PATCH 00/28] INFINIBAND NETWORK BLOCK DEVICE (IBNBD)
From: Jinpu Wang @ 2017-03-24 14:24 UTC (permalink / raw)
To: Bart Van Assche
Cc: jthumshirn@suse.de, linux-block@vger.kernel.org,
linux-rdma@vger.kernel.org, mail@fholler.de,
yun.wang@profitbricks.com, hch@lst.de, axboe@kernel.dk,
Milind.dumbare@gmail.com, dledford@redhat.com, Danil Kipnis,
Roman Penyaev
In-Reply-To: <1490362271.2516.4.camel@sandisk.com>
On Fri, Mar 24, 2017 at 2:31 PM, Bart Van Assche
<Bart.VanAssche@sandisk.com> wrote:
> On Fri, 2017-03-24 at 13:46 +0100, Jinpu Wang wrote:
>> Our IBNBD project was started 3 years ago based on our need for Cloud
>> Computing, NVMeOF is a bit younger.
>> - IBNBD is one of our components, part of our software defined storage s=
olution.
>> - As I listed in features, IBNBD has it's own features
>>
>> We're planning to look more into NVMeOF, but it's not a replacement for =
IBNBD.
>
> Hello Jack, Danil and Roman,
>
> Thanks for having taken the time to open source this work and to travel t=
o
> Boston to present this work at the Vault conference. However, my
> understanding of IBNBD is that this driver has several shortcomings neith=
er
> NVMeOF nor iSER nor SRP have:
> * Doesn't scale in terms of number of CPUs submitting I/O. The graphs sho=
wn
> during the Vault talk clearly illustrate this. This is probably the res=
ult
> of sharing a data structure across all client CPUs, maybe the bitmap th=
at
> tracks which parts of the target buffer space are in use.
> * Supports IB but none of the other RDMA transports (RoCE / iWARP).
>
> We also need performance numbers that compare IBNBD against SRP and/or
> NVMeOF with memory registration disabled to see whether and how much fast=
er
> IBNBD is compared to these two protocols.
>
> The fact that IBNBD only needs to messages per I/O is an advantage it has
> today over SRP but not over NVMeOF nor over iSER. The upstream initiator
> drivers for the latter two protocols already support inline data.
>
> Another question I have is whether integration with multipathd is support=
ed?
> If multipathd tries to run scsi_id against an IBNBD client device that wi=
ll
> fail.
>
> Thanks,
>
> Bart.
Hello Bart,
Thanks for your comments. As usual in house driver mainly covers needs
for ProfitBricks,
We only tested in our hardware environment. We only use IB not
RoCE/iWARP. The idea to
opensource is :
- Present our design/implementation/tradeoff, others might be interested.
- Attract more attention from developers/testers, so we can improve
the project better and faster.
We will gather performance data compare with NVMeOF in next submitting.
multipath is not supported, we're using APM for failover. (patch from
Mellanox developers)
Thanks,
--=20
Jack Wang
Linux Kernel Developer
ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin
Tel: +49 30 577 008 042
Fax: +49 30 577 008 299
Email: jinpu.wang@profitbricks.com
URL: https://www.profitbricks.de
Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Gesch=C3=A4ftsf=C3=BChrer: Achim Weiss
^ permalink raw reply
* Re: [PATCH 01/28] ibtrs: add header shared between ibtrs_client and ibtrs_server
From: Johannes Thumshirn @ 2017-03-24 14:31 UTC (permalink / raw)
To: Jinpu Wang
Cc: linux-block, linux-rdma@vger.kernel.org, Doug Ledford, Jens Axboe,
hch, Fabian Holler, Milind Dumbare, Michael Wang, Kleber Souza,
Danil Kipnis, Roman Pen
In-Reply-To: <CAMGffEn7Q+Tchaj4RXV1zMk0MzHqGRv=0W5Vd1G_-GvZaG8tPA@mail.gmail.com>
On Fri, Mar 24, 2017 at 01:54:04PM +0100, Jinpu Wang wrote:
> >> +
> >> +#define XX(a) case (a): return #a
> >
> > please no macros with retun in them and XX isn't quite too descriptive as
> > well.
> >
> > [...]
> >
> >> +static inline const char *ib_wc_opcode_str(enum ib_wc_opcode opcode)
> >> +{
> >> + switch (opcode) {
> >> + XX(IB_WC_SEND);
> >> + XX(IB_WC_RDMA_WRITE);
> >> + XX(IB_WC_RDMA_READ);
> >> + XX(IB_WC_COMP_SWAP);
> >> + XX(IB_WC_FETCH_ADD);
> >> + /* recv-side); inbound completion */
> >> + XX(IB_WC_RECV);
> >> + XX(IB_WC_RECV_RDMA_WITH_IMM);
> >> + default: return "IB_WC_OPCODE_UNKNOWN";
> >> + }
> >> +}
> >
> > How about:
> >
> > struct {
> > char *name;
> > enum ib_wc_opcode opcode;
> > } ib_wc_opcode_table[] = {
> > { stringyfy(IB_WC_SEND), IB_WC_SEND },
> > { stringyfy(IB_WC_RDMA_WRITE), IB_WC_RDMA_WRITE },
> > { stringyfy(IB_WC_RDMA_READ ), IB_WC_RDMA_READ }
> > { stringyfy(IB_WC_COMP_SWAP), IB_WC_COMP_SWAP },
> > { stringyfy(IB_WC_FETCH_ADD), IB_WC_FETCH_ADD },
> > { stringyfy(IB_WC_RECV), IB_WC_RECV },
> > { stringyfy(IB_WC_RECV_RDMA_WITH_IMM), IB_WC_RECV_RDMA_WITH_IMM },
> > { NULL, 0 },
> > };
> >
> > static inline const char *ib_wc_opcode_str(enum ib_wc_opcode opcode)
> > {
> > int i;
> >
> > for (i = 0; i < ARRAY_SIZE(ib_wc_opcode_table); i++)
> > if (ib_wc_opcode_table[i].opcode == opcode)
> > return ib_wc_opcode_table[i].name;
> >
> > return "IB_WC_OPCODE_UNKNOWN";
> > }
> >
> Looks nice, might be better to put it into ib_verbs.h?
Probably yes, as are your kvec functions for lib/iov_iter.c
[...]
> > What about resolving the kernel bug instead of making workarounds?
> I tried to send a patch upsteam, but was rejected by Sean.
> http://www.spinics.net/lists/linux-rdma/msg22381.html
>
I don't see a NACK in this thread.
>From http://www.spinics.net/lists/linux-rdma/msg22410.html:
"The port space (which maps to the service ID) needs to be included as part of
the check that determines the format of the private data, and not simply the
address family."
After such a state I would have expected to see a v2 of the patch with above
comment addressed.
Byte,
Johannes
--
Johannes Thumshirn Storage
jthumshirn@suse.de +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Felix Imend�rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N�rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
^ permalink raw reply
* Re: [PATCH 01/28] ibtrs: add header shared between ibtrs_client and ibtrs_server
From: Jinpu Wang @ 2017-03-24 14:35 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: linux-block, linux-rdma@vger.kernel.org, Doug Ledford, Jens Axboe,
hch, Fabian Holler, Milind Dumbare, Michael Wang, Kleber Souza,
Danil Kipnis, Roman Pen
In-Reply-To: <20170324143127.GI3571@linux-x5ow.site>
On Fri, Mar 24, 2017 at 3:31 PM, Johannes Thumshirn <jthumshirn@suse.de> wr=
ote:
> On Fri, Mar 24, 2017 at 01:54:04PM +0100, Jinpu Wang wrote:
>> >> +
>> >> +#define XX(a) case (a): return #a
>> >
>> > please no macros with retun in them and XX isn't quite too descriptive=
as
>> > well.
>> >
>> > [...]
>> >
>> >> +static inline const char *ib_wc_opcode_str(enum ib_wc_opcode opcode)
>> >> +{
>> >> + switch (opcode) {
>> >> + XX(IB_WC_SEND);
>> >> + XX(IB_WC_RDMA_WRITE);
>> >> + XX(IB_WC_RDMA_READ);
>> >> + XX(IB_WC_COMP_SWAP);
>> >> + XX(IB_WC_FETCH_ADD);
>> >> + /* recv-side); inbound completion */
>> >> + XX(IB_WC_RECV);
>> >> + XX(IB_WC_RECV_RDMA_WITH_IMM);
>> >> + default: return "IB_WC_OPCODE_UNKNOWN";
>> >> + }
>> >> +}
>> >
>> > How about:
>> >
>> > struct {
>> > char *name;
>> > enum ib_wc_opcode opcode;
>> > } ib_wc_opcode_table[] =3D {
>> > { stringyfy(IB_WC_SEND), IB_WC_SEND },
>> > { stringyfy(IB_WC_RDMA_WRITE), IB_WC_RDMA_WRITE },
>> > { stringyfy(IB_WC_RDMA_READ ), IB_WC_RDMA_READ }
>> > { stringyfy(IB_WC_COMP_SWAP), IB_WC_COMP_SWAP },
>> > { stringyfy(IB_WC_FETCH_ADD), IB_WC_FETCH_ADD },
>> > { stringyfy(IB_WC_RECV), IB_WC_RECV },
>> > { stringyfy(IB_WC_RECV_RDMA_WITH_IMM), IB_WC_RECV_RDMA_WITH_IM=
M },
>> > { NULL, 0 },
>> > };
>> >
>> > static inline const char *ib_wc_opcode_str(enum ib_wc_opcode opcode)
>> > {
>> > int i;
>> >
>> > for (i =3D 0; i < ARRAY_SIZE(ib_wc_opcode_table); i++)
>> > if (ib_wc_opcode_table[i].opcode =3D=3D opcode)
>> > return ib_wc_opcode_table[i].name;
>> >
>> > return "IB_WC_OPCODE_UNKNOWN";
>> > }
>> >
>> Looks nice, might be better to put it into ib_verbs.h?
>
> Probably yes, as are your kvec functions for lib/iov_iter.c
Thanks, will do in next round!
>
> [...]
>
>> > What about resolving the kernel bug instead of making workarounds?
>> I tried to send a patch upsteam, but was rejected by Sean.
>> http://www.spinics.net/lists/linux-rdma/msg22381.html
>>
>
> I don't see a NACK in this thread.
>
> From http://www.spinics.net/lists/linux-rdma/msg22410.html:
> "The port space (which maps to the service ID) needs to be included as pa=
rt of
> the check that determines the format of the private data, and not simply =
the
> address family."
>
> After such a state I would have expected to see a v2 of the patch with ab=
ove
> comment addressed.
I might busy with other staff at that time, I will check again and
revisit the bug.
>
> Byte,
> Johannes
> --
> Johannes Thumshirn Storage
> jthumshirn@suse.de +49 911 74053 689
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N=C3=BCrnberg
> GF: Felix Imend=C3=B6rffer, Jane Smithard, Graham Norton
> HRB 21284 (AG N=C3=BCrnberg)
> Key fingerprint =3D EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Regards,
--=20
Jack Wang
Linux Kernel Developer
ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin
Tel: +49 30 577 008 042
Fax: +49 30 577 008 299
Email: jinpu.wang@profitbricks.com
URL: https://www.profitbricks.de
Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Gesch=C3=A4ftsf=C3=BChrer: Achim Weiss
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox