From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-wr1-f41.google.com (mail-wr1-f41.google.com [209.85.221.41]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9A1623D5677 for ; Tue, 5 May 2026 07:47:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.221.41 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777967235; cv=none; b=Mko6a5iqt8Bu7cS07TqPQnsgtumZG9ePDa+z/NV9bH6SzNrVykbto7mY9/EZRq8Ku3MHsygJHnQHAI2zm5SPMcygELUN1pRiIGhDREcuej/1t/urjwRyk34zYYYJkBctqE9PC7AaWe32JEca26PWHV6zf6qOiMiflTxKzN6MUL0= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777967235; c=relaxed/simple; bh=jHJ+dSXBjqGkDbwNmqoTz4ErnG0ao3qSSWKzWCbyiUU=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=tQoPdCHJpUYPKwX35yLDvC2WBW6ZBNN4TwuOXHt/nmVgB4AGvJsYcLjkJuySvp2Oi3blAgO2EcWYrek2fnkBQeOHMqf8RGvK39ujXgrhWYZ7jC6XL99DMR7Wv7x3Rhs62MSg3DpOh2h+t3DvdEQmeWzP2HHmXri87lzb2Gzn6RM= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=ionos.com; spf=pass smtp.mailfrom=ionos.com; dkim=pass (2048-bit key) header.d=ionos.com header.i=@ionos.com header.b=cl+La32j; arc=none smtp.client-ip=209.85.221.41 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=ionos.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=ionos.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ionos.com header.i=@ionos.com header.b="cl+La32j" Received: by mail-wr1-f41.google.com with SMTP id ffacd0b85a97d-43fe3e22e33so2730755f8f.0 for ; Tue, 05 May 2026 00:47:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ionos.com; s=google; t=1777967230; x=1778572030; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=LcHMma6IpCs5+MUqGzW2FmMV5Un3jStGMQxdTnoWqEQ=; b=cl+La32jv8N0AkVHPCypoLCO641ecHzV0l//jUNY8cQ4jhtXoTe58wBRXlc75lwXIl pJE0SfMEc9YBfc61rxko578chK80nb8BlOt3P3hONH1/fmWH0+kvULncuU8rDMmzvaFX jI2B9ZNRz+zPwPaILayNlPfAAtPJMEbh6IpIozxHWzbw32n+K1KT109urOiUYiqra+Zz ABlAMs2srxN0pQgYTnzRDGvTDbn+vHz0dtfcd3QPexfu84DSFISt1c24p3ZuJjHBrEvq nnZSKGQSfk2NuJdIkW7xecyBPfpsMl540H9eNsfoIakxOiFdICrVU8OcaCfH8Slh7cCa 4uzg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777967230; x=1778572030; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=LcHMma6IpCs5+MUqGzW2FmMV5Un3jStGMQxdTnoWqEQ=; b=PhrY/eS90wuJdNz0za7y5TckzU55GrZzsXdgdFP9ti1xmHNDEfM+qqjWCU4X8dYegn wdXnID3ffjnUbDroi/+9SKJURucyA9fzrxXEisl9wKoI5rVhmXq1HwKlL6yuaSba4cKw pHxfUfTBkr7E7izTMnC3u7/uEWXatqkmyHh1h88Sjv71Hky5nE1NKD+nALPi4rlUMe2G yNbDM9uANT8dTz/nIww6bEPnpbzV5PzRnAid6GyLI/0lhNudpWm5UbCdefH0J0NHIm2q Q7Skj1+XCfcCewXPUqZCVwXoZC1er7oTKdmrXZ0WIPQQfcxoK8Rx9q+qRer+nW26A/lr 8dvA== X-Forwarded-Encrypted: i=1; AFNElJ8XNfw6l+j8TO3l5fBmj9eYm+teWxZD4+aFDCIOHCmm+pD7lYWYQIly0JdHeAiuxYS1YHugOI9A0eLa@vger.kernel.org X-Gm-Message-State: AOJu0YzTc5atXwtxGu7rDal474bgv9xYM0OyGT85HgYx2YqC6kx4jKmT xLMTNLCqXABH4izCYlnHq9AxBC5vCw3rgM+QX5zdtZN7pj2Yv4Gc9LwQKXbmI3bg4S8= X-Gm-Gg: AeBDieshjMMA4COd3TfMCDc4J0O5flY3k5aLldyaYgY8QvU0lgsejoMbD+SicB+UKON HxFuc1lCLVgnFI377tMNTRwREVH+XNAPRQPa/9jK5FYXQIk4CWHf9VLlNmCFBw5n9oqyWvpwKUi s16mH047t83/w5kJU/1+CNBfuWbnE0k+aP22m4GF84vGozwQe/B7Sug07o0OU/j81IhJoZ+ioKc ZoMB9rTk+6jKVAUCIzAYx17Iebk7xoAMzEvPivUklzLbaWgLnXu8ju22+VAP0K1Vkp4K9mD1vNL jM2jaMA+e3vHz1cUK5qXubqD12xIUmDv0M0199GKlY5NYt5A/kFZctm/TkdDLseVYoCs8tzXItT FmRiyWRuje7R1gvUSyMkj/xuuFSFvy9PUt6iKggs0Xi4HFhgij5x/g24EiAOCQY1ks5bp4GQ3gO 405ipHUatjJaK9TMZZJldcik2r3RfC+xQKlotCHyokcTDGauuT1UiW2+IYolRReqir2FfhUJbZx xLOo5B4SQeSsDsrqVw/v4x9i93m+uy1X3D0mgNKx4jnT9xkR+Y+2KAl X-Received: by 2002:a05:600d:8451:b0:48a:55d8:7882 with SMTP id 5b1f17b1804b1-48d187d91f2mr22280455e9.9.1777967229640; Tue, 05 May 2026 00:47:09 -0700 (PDT) Received: from lb03189.fkb.profitbricks.net ([213.147.98.98]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-48d186d97c4sm10617125e9.35.2026.05.05.00.47.06 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 05 May 2026 00:47:09 -0700 (PDT) From: Md Haris Iqbal To: linux-block@vger.kernel.org, linux-rdma@vger.kernel.org Cc: linux-kernel@vger.kernel.org, axboe@kernel.dk, bvanassche@acm.org, hch@lst.de, jgg@ziepe.ca, leon@kernel.org, jinpu.wang@ionos.com, Md Haris Iqbal , Jia Li Subject: [PATCH 01/13] RDMA/rmr: add public and private headers Date: Tue, 5 May 2026 09:46:13 +0200 Message-ID: <20260505074644.195453-2-haris.iqbal@ionos.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20260505074644.195453-1-haris.iqbal@ionos.com> References: <20260505074644.195453-1-haris.iqbal@ionos.com> Precedence: bulk X-Mailing-List: linux-rdma@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Reliable Multicast over RTRS (RMR) is an RDMA ULP that provides active-active block-level replication on top of the RTRS transport. It guarantees delivery of an I/O to a group of storage nodes and handles resynchronization of data between storage nodes without involving the compute client. Add the public interface header (rmr.h) used by upper-layer consumers, plus the private headers shared between the client and server modules: rmr-proto.h wire protocol definitions rmr-pool.h pool data structures rmr-map.h dirty-map data structures rmr-req.h server-side request lifecycle helpers rmr-clt.h client-side structs and function declarations rmr-srv.h server-side structs and function declarations, including the IO store interface (struct rmr_srv_store_ops) implemented by an upper-layer consumer No code is compiled by this patch on its own; the modules are wired into the build in a later patch in this series. Signed-off-by: Md Haris Iqbal Signed-off-by: Jia Li --- drivers/infiniband/ulp/rmr/rmr-clt.h | 291 ++++++++++++++++++ drivers/infiniband/ulp/rmr/rmr-map.h | 246 +++++++++++++++ drivers/infiniband/ulp/rmr/rmr-pool.h | 400 +++++++++++++++++++++++++ drivers/infiniband/ulp/rmr/rmr-proto.h | 273 +++++++++++++++++ drivers/infiniband/ulp/rmr/rmr-req.h | 65 ++++ drivers/infiniband/ulp/rmr/rmr-srv.h | 219 ++++++++++++++ drivers/infiniband/ulp/rmr/rmr.h | 229 ++++++++++++++ 7 files changed, 1723 insertions(+) create mode 100644 drivers/infiniband/ulp/rmr/rmr-clt.h create mode 100644 drivers/infiniband/ulp/rmr/rmr-map.h create mode 100644 drivers/infiniband/ulp/rmr/rmr-pool.h create mode 100644 drivers/infiniband/ulp/rmr/rmr-proto.h create mode 100644 drivers/infiniband/ulp/rmr/rmr-req.h create mode 100644 drivers/infiniband/ulp/rmr/rmr-srv.h create mode 100644 drivers/infiniband/ulp/rmr/rmr.h diff --git a/drivers/infiniband/ulp/rmr/rmr-clt.h b/drivers/infiniband/ulp/rmr/rmr-clt.h new file mode 100644 index 000000000000..c50651efe4a3 --- /dev/null +++ b/drivers/infiniband/ulp/rmr/rmr-clt.h @@ -0,0 +1,291 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +/* + * Reliable multicast over RTRS (RMR) + * + * Copyright (c) 2026 IONOS SE + */ + +#ifndef RMR_CLT_H +#define RMR_CLT_H + +#include +#include "rmr-pool.h" + +#define RECONNECT_DELAY 30 +#define MAX_RECONNECTS -1 +#define RTRS_LINK_NAME "rtrs" + +#define RMR_MAP_CLEAN_DELAY_MS 5000 +#define RMR_RECOVER_INTERVAL_MS 3000 + +enum rmr_clt_sess_state { + RMR_CLT_SESS_DISCONNECTED = 1, + RMR_CLT_SESS_CONNECTED, +}; + +struct rmr_clt_sess { + char sessname[NAME_MAX]; + struct kobject kobj; + struct mutex lock; + struct rtrs_clt_sess *rtrs; + bool rtrs_ready; + /* server this session is connected to */ + int queue_depth; + u32 max_io_size; + u32 max_segments; + struct list_head pool_sess_list; + struct list_head g_list; + struct kref kref; + enum rmr_clt_sess_state state; +}; + +/* + * NB: If you change here, make sure the changes are in sync with + * pool_sess state machine routine i.e. pool_sess_change_state(). + */ +enum rmr_clt_pool_sess_state { + RMR_CLT_POOL_SESS_CREATED = 1, // No IO, No dirty map addition, Yes cmd msgs + RMR_CLT_POOL_SESS_NORMAL, // Yes IO, No dirty map addition, Yes cmd msgs + RMR_CLT_POOL_SESS_FAILED, // No IO, Yes dirty map addition, No cmd msgs + RMR_CLT_POOL_SESS_RECONNECTING, // No IO, Yes, dirty map addition, Yes cmd msgs + // But not with an updated map + + RMR_CLT_POOL_SESS_REMOVING // No IO, No dirty map addition, Yes cmd msgs + // Getting removed from pool +}; + +struct rmr_clt_pool_sess { + char sessname[NAME_MAX]; + struct rmr_pool *pool; + struct kobject kobj; + u8 member_id; /* refers to the pool id on the */ + struct kobject sess_kobj; + struct list_head entry; /* for pool->sess_list */ + struct list_head clt_sess_entry; /* for clt_sess->pool_sess_list */ + struct rmr_clt_sess *clt_sess; + atomic_t state; /* rmr_clt_pool_sess_state */ + u8 ver; /* protocol version */ + u8 pool_id; /* refers to the pool id on the */ + bool maintenance_mode; /* If the pool is in maintenance mode or not */ + bool was_last_authoritative; /* last NORMAL sess before it went FAILED; + * carries complete dirty maps for all members */ +}; + +struct rmr_clt_stats { + struct kobject kobj_stats; + atomic_t read_retries; +}; + +/* + * State descriptions: + * RMR_CLT_POOL_STATE_JOINED: An rmr_clt_pool which has one or more legs (rmr_clt_pool_sess) + * added to it. This means the pool has joined into pools from + * storage nodes + * + * RMR_CLT_POOL_STATE_IN_USE: An rmr_clt_pool which is in use by an upper layer client. This + * is usually done by calling rmr_clt_open + * + * Note: When adding a new state, + * remember to add an entry in the function rmr_get_clt_pool_state_name() + */ +enum rmr_clt_pool_state { + RMR_CLT_POOL_STATE_JOINED = 0, + RMR_CLT_POOL_STATE_IN_USE, + // RMR_CLT_POOL_STATE_DEGRADED, uncomment and use + // RMR_CLT_POOL_STATE_DIRTY, + RMR_CLT_POOL_STATE_MAX, +}; + +struct rmr_clt_pool { + struct rmr_pool *pool; + refcount_t refcount; + unsigned long state; + struct mutex clt_pool_lock; + + size_t queue_depth; + + struct rmr_clt_stats stats; + struct kobject stats_kobj; + + void *priv; /* provided by user */ + rmr_clt_ev_fn *link_ev; /* deliver events to user */ + + atomic_t io_freeze; + wait_queue_head_t map_update_wq; + struct mutex io_freeze_lock; + + struct workqueue_struct *recover_wq; + struct delayed_work recover_dwork; + + /* use sessions round robbin to read */ + struct rmr_clt_pool_sess __rcu *__percpu *pcpu_sess; +}; + +struct rmr_iu_comp { + wait_queue_head_t wait; + int errno; +}; + +/** + * rmr_iu - reserves resources needed to do an I/O op on pool + */ +struct rmr_iu { + struct rmr_pool *pool; + unsigned int mem_id; + struct list_head sess_list; /* list of per-session tags */ + u8 num_sessions; + refcount_t ref; /* lifetime refcount */ + struct rmr_msg_io msg; + int errno; + atomic_t succeeded; + refcount_t refcount; + rmr_conf_fn *conf; + void *priv; + /* for retry of failed reads */ + struct work_struct work; + struct scatterlist *sg; + unsigned int sg_cnt; +}; + +struct rmr_clt_sess_iu { + void *buf; /* for session messages */ + struct rtrs_permit *permit; + struct rmr_clt_pool_sess *pool_sess; + int errno; + union { + /* for session messages only */ + struct scatterlist sg; + /* for tag->sess_list of io messages*/ + struct list_head entry; + }; + + /* for session messages only */ + struct work_struct work; + + /* for io requests */ + struct rmr_iu *rmr_iu; + unsigned int mem_id; + + /* for command messages */ + struct rmr_clt_cmd_unit *rmr_cmd_unit; + + /* for session messages only */ + struct rmr_iu_comp comp; + atomic_t refcount; +}; + +struct rmr_clt_iu_comp { + wait_queue_head_t wait; + int errno; +}; + +struct rmr_clt_cmd_unit { + struct rmr_pool *pool; + struct rmr_clt_pool *clt_pool; + + struct list_head sess_list; + int num_sessions; + + int failed_state; + int errno; + atomic_t succeeded; + refcount_t refcount; + + rmr_conf_fn *conf; + void *priv; +}; + +/* rmr-clt.c */ +struct rmr_pool *rmr_clt_create_pool(const char *name); +void rmr_put_clt_pool(struct rmr_clt_pool *clt_pool); + +void rmr_clt_change_pool_state(struct rmr_clt_pool *rmr_clt_pool, + enum rmr_clt_pool_state new_state, bool set); +int rmr_clt_remove_pool_from_sysfs(struct rmr_pool *pool, + const struct attribute *sysfs_self); +struct rmr_clt_sess *find_and_get_or_create_clt_sess(char *sessname, + struct rtrs_addr *paths, + size_t path_cnt); +struct rmr_clt_pool_sess *rmr_clt_add_pool_sess(struct rmr_pool *pool, + struct rmr_clt_sess *clt_sess, bool create); +void rmr_clt_sess_put(struct rmr_clt_sess *sess); +void rmr_clt_del_pool_sess(struct rmr_clt_pool_sess *sess); +void rmr_clt_destroy_pool_sess(struct rmr_clt_pool_sess *sess, bool delete); + +const char *rmr_clt_sess_state_str(enum rmr_clt_pool_sess_state state); +void resend_join_pool(struct rmr_clt_sess *sess); +int rmr_clt_reconnect_sess(struct rmr_clt_sess *sess, + const struct rtrs_addr *paths, + size_t path_cnt); +int rmr_clt_start_last_io_update(struct rmr_pool *pool); +int rmr_clt_set_pool_sess_mm(struct rmr_clt_pool_sess *pool_sess); +int rmr_clt_enable_sess(struct rmr_clt_pool_sess *sess); + +int rmr_clt_send_map_update(struct rmr_pool *pool, struct rmr_iu *iu); + +int rmr_clt_pool_send_all(struct rmr_pool *pool, struct rmr_msg_pool_cmd *msg); +int rmr_clt_send_cmd_with_data(struct rmr_pool *pool, struct rmr_clt_pool_sess *pool_sess, + struct rmr_msg_pool_cmd *msg, + void *buf, unsigned int buflen); +int rmr_clt_map_add_id(struct rmr_pool *pool, int stg_id, rmr_id_t id); +void rmr_clt_init_cmd(struct rmr_pool *pool, struct rmr_msg_pool_cmd *msg); +int rmr_clt_pool_send_cmd(struct rmr_clt_pool_sess *sess, struct rmr_msg_pool_cmd *msg, bool wait); +int rmr_clt_del_stor_from_pool(struct rmr_clt_pool_sess *pool_sess, bool delete); +bool rmr_clt_sess_is_sync(struct rmr_clt_pool_sess *sess); +int send_msg_leave_pool(struct rmr_clt_pool_sess *pool_sess, bool delete, bool wait); +void rmr_clt_free_pool_sess(struct rmr_clt_pool_sess *pool_sess); +int rmr_clt_send_map(struct rmr_pool *map_src_pool, struct rmr_pool *clt_pool, + const struct rmr_msg_map_send_cmd *map_send_cmd, rmr_map_filter filter); +int rmr_clt_test_map(struct rmr_pool *src_pool, struct rmr_pool *dst_pool); +int rmr_clt_send_cmd_with_data_all(struct rmr_pool *pool, struct rmr_msg_pool_cmd *msg, + void *buf, unsigned int buflen); +int rmr_clt_pool_send_md_all(struct rmr_pool *src_pool, struct rmr_pool *clt_pool); +int rmr_clt_pool_send_cmd_all(struct rmr_pool *pool, enum rmr_msg_cmd_type cmd_type); +void recover_work(struct work_struct *work); + +int rmr_clt_pool_member_synced(struct rmr_pool *pool, u8 member_id); + +bool pool_sess_change_state(struct rmr_clt_pool_sess *pool_sess, + enum rmr_clt_pool_sess_state newstate); + +void rmr_clt_pool_io_freeze(struct rmr_clt_pool *clt_pool); +void rmr_clt_pool_io_unfreeze(struct rmr_clt_pool *clt_pool); +void rmr_clt_pool_io_wait_complete(struct rmr_clt_pool *clt_pool); +int rmr_clt_pool_try_enable(struct rmr_pool *pool); +int send_msg_enable_pool(struct rmr_clt_pool_sess *pool_sess, bool enable); + +void rmr_get_iu(struct rmr_iu *iu); +void rmr_put_iu(struct rmr_iu *iu); +void rmr_msg_put_iu(struct rmr_clt_pool_sess *pool_sess, + struct rmr_clt_sess_iu *sess_iu); +void wake_up_iu_comp(struct rmr_clt_sess_iu *sess_iu); +void msg_conf(void *priv, int errno); + +/* rmr-map-mgmt.c */ +void send_map_check(struct rmr_clt_pool_sess *pool_sess); +void send_store_check(struct rmr_clt_pool_sess *pool_sess); +int send_map_get_version(struct rmr_clt_pool_sess *pool_sess, u64 *ver); +int send_discard(struct rmr_clt_pool_sess *pool_sess, u8 cmd_type, u8 member_id); +int rmr_clt_handle_map_check_rsp(struct rmr_clt_pool_sess *pool_sess, + struct rmr_msg_pool_cmd_rsp *rsp); +int rmr_clt_handle_store_check_rsp(struct rmr_clt_pool_sess *pool_sess, + struct rmr_msg_pool_cmd_rsp *rsp); +int rmr_clt_read_map(struct rmr_pool *pool); +int rmr_clt_spread_map(struct rmr_pool *pool, struct rmr_clt_pool_sess *pool_sess_chosen, + bool enable, bool skip_normal); +int rmr_clt_unset_pool_sess_mm(struct rmr_clt_pool_sess *pool_sess); +void sched_map_add(struct work_struct *work); +void msg_pool_cmd_map_content_conf(struct work_struct *work); + +/* rmr-clt-sysfs.c */ +int rmr_clt_create_sysfs_files(void); +void rmr_clt_destroy_sysfs_files(void); +void rmr_clt_destroy_pool_sysfs_files(struct rmr_pool *pool, + const struct attribute *sysfs_self); +int rmr_clt_create_clt_sess_sysfs_files(struct rmr_clt_sess *clt_sess); +void rmr_clt_destroy_clt_sess_sysfs_files(struct rmr_clt_sess *clt_sess); + +int rmr_clt_reset_read_retries(struct rmr_clt_stats *stats, bool enable); +ssize_t rmr_clt_stats_read_retries_to_str(struct rmr_clt_stats *stats, char *page); + +#endif /* RMR_CLT_H */ diff --git a/drivers/infiniband/ulp/rmr/rmr-map.h b/drivers/infiniband/ulp/rmr/rmr-map.h new file mode 100644 index 000000000000..76ef6506421f --- /dev/null +++ b/drivers/infiniband/ulp/rmr/rmr-map.h @@ -0,0 +1,246 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +/* + * Reliable multicast over RTRS (RMR) + * + * Copyright (c) 2026 IONOS SE + */ + +#ifndef RMR_MAP_H +#define RMR_MAP_H + +#include +#include + +#include "rmr.h" + +/** + * The dirty map buffer is used to track dirty chunks through bits. + * The position of the bit denotes the chunk number it tracks. + * + * Bitmap structure + * ---------------- + * The dirty bitmap is stored in a 2 level tree-like structure. + * The main unit of storage are memory pages; They act as nodes of this structure. + * The first level pages (FLP) stores the address of the second level pages. + * There can be a total of 256 first level pages. + * The second level pages (SLP, also the leaf nodes/pages) stores the bitmap. + * + * The first level pages have to store the address of the second level pages. + * An address being 8B (default/max) long, the addresses of a maximum of 512 pages can + * be stored in a first level page. This then decides the maximum leaf pages a pool can + * have, which, for our example, is [(# pages of FLP) * (PAGE_SIZE / address_size)], + * (256*512)=131072. + * With the above info, the available space for bitmap is 131072*4KB(PAGE_SIZE)=512MB. + * + * A chunk is the smallest unit of data which is tracked for being dirty. A chunk is + * called dirty/unsynced, even if a single byte in it is dirty/unsynced. + * To track a chunk, a single byte (1B) is used. The least significant bit is used to signify + * if the chunk is dirty (set) or not. Other bits can be used for other purposes (for example, + * filters). The maximum number of chunks RMR can manage are then, (512MB)/1B=536870912. + * This number is fixed, as one can see from the calculations, and hence the maximum size of + * metadata RMR can allocate and use is fixed. + + * The user configurable part is the chunk size. Its range is 128KB-1MB, and it has to be a + * power of 2. + * The chunk size decides the maximum mapped size for an RMR pool. + * For example, for chunk size 1MB, and taking the maximum number of chunks RMR can allocate + * and handle (536870912, see above), the maximum mapped size would be (536870912*1MB)=512TB. + * The table showing the relation between chunk size and maximum mapped size is as follows, + * Chunk size Maximum mapped size + * 128KB 64TB + * 256KB 128TB + * 512KB 256TB + * 1MB 512TB + * + * Calculating chunk number + * ------------------------ + * Some key points + * 1) The Linux kernel has a fixed size for sector, which is 512 (or 9 bitshift) + * 2) The mapped_size provided and stores in the rmr_pool structure is in sectors. + * 3) The chunk_size provided and stored in the rmr_pool structure is in bytes. + * 4) The code calculates and stores chunk_size_shift in the rmr_pool structure to do fast + * calculation. + * 5) The IO offset give to RMR (through function rmr_clt_request) is in bytes. + * + * -- + * With the above points, lets have a sample scenario with mapped_size 1GB and chunk_size 128KB + * The numbers would then be, + * + * no_of_chunks = (mapped_size / chunk_size) + * no_of_chunks = 8192 + * + * chunk_size = 131072 + * chunk_size_shift = 17 + * + * dirty_map buffer size (in BYTES) = (no_of_chunks / bits in a byte) + * dirty_map buffer size (in BYTES) = 1024 + * + * -- + * Lets do a sample calculation of chunk_no from offset and length of an IO + * + * For offset 30801920 and length 4096 + * + * chunk_no = (offset >> chunk_size_shift) + * chunk_no = 235 + * + */ + +#define RMR_KEY_SHIFT 32 + +// Each chunk requires 1B of metadata +#define PER_CHUNK_MD 1 +#define PER_CHUNK_MD_LOG2 ilog2(PER_CHUNK_MD) + +#define GET_CHUNK_NUMBER(offset, shift) (offset >> shift) +#define GET_FOLLOWING_CHUNKS(offset_len, shift, start) (((offset_len - 1) >> shift) - start + 1) + +#define CHUNK_TO_OFFSET(chunk_no, shift) (chunk_no << shift) + +// The element type stored in FLP +typedef unsigned long el_flp; + +enum { + CHUNK_DIRTY_BIT = 0, + CHUNK_FILTER_BIT, +}; + +enum { + MAX_NO_OF_FLP = 256, + NO_OF_SLP_PER_FLP = (PAGE_SIZE >> ilog2(sizeof(void *))), + NO_OF_SLP_PER_FLP_LOG2 = ilog2(NO_OF_SLP_PER_FLP), + MAX_NO_OF_SLP = (MAX_NO_OF_FLP * NO_OF_SLP_PER_FLP), + + NO_OF_CHUNKS_PER_PAGE = (PAGE_SIZE >> PER_CHUNK_MD_LOG2), + // Chunks data is stored only in SLP + MAX_NO_OF_CHUNKS = (MAX_NO_OF_SLP * NO_OF_CHUNKS_PER_PAGE), + + CHUNKS_PER_SLP = (PAGE_SIZE >> PER_CHUNK_MD_LOG2), + CHUNKS_PER_SLP_LOG2 = ilog2(CHUNKS_PER_SLP), + CHUNKS_PER_FLP = (CHUNKS_PER_SLP * NO_OF_SLP_PER_FLP), + CHUNKS_PER_FLP_LOG2 = ilog2(CHUNKS_PER_FLP), +}; + +typedef enum { + MAP_NO_FILTER = 0, + MAP_ENTRY_UNSYNCED +} rmr_map_filter; + +enum rmr_map_state { + RMR_MAP_STATE_NO_CHECK = 0, + RMR_MAP_STATE_CHECKING, + // do we have some other useful states ? +}; + +struct rmr_dirty_id_map { + u8 member_id; + struct xarray rmr_id_map; + unsigned long ts; + atomic_t check_state; + + /* + * The usage of this is restricted to form a linked lised + * during mass deletion. Since this is in an RCU list (maps + * in rmr_pool), we cannot use this or change any data until + * the RCU period completes. So we use this next variable + * during mass deletion so we can have a list and don't have + * to wait and restart the search on every individual deletion + * of a map. Refer destroy_clt_pool(). + */ + struct rmr_dirty_id_map *next; + + u64 no_of_chunks; + u64 no_of_flp; + u64 no_of_slp_in_last_flp; + u64 no_of_chunk_in_last_slp; + u64 total_slp; + u8 *bitmap_filter; + void *dirty_bitmap[MAX_NO_OF_FLP]; +}; + +struct rmr_map_entry { + atomic_t sync_cnt; + struct llist_head wait_list; +}; + +/* + * The header of the bitmap buffer. + */ +struct rmr_map_cbuf_hdr { + u64 version; + u8 member_id; + + u64 no_of_chunks; + u64 no_of_flp; + u64 no_of_slp_in_last_flp; + u64 no_of_chunk_in_last_slp; + u64 total_slp; +} __packed; + +static inline unsigned long rmr_id_to_key(rmr_id_t id) +{ + unsigned long res; + + // highest bits for id.a, the rest are for id.b; + res = ((id.a << RMR_KEY_SHIFT) | id.b); + return res; +} + +static inline u64 key_to_a(unsigned long key) +{ + return key >> RMR_KEY_SHIFT; +} + +static inline u64 key_to_b(unsigned long key) +{ + return key & ((1ULL << RMR_KEY_SHIFT) - 1); +} + +void rmr_map_update_page_params(struct rmr_dirty_id_map *map); +struct rmr_dirty_id_map *rmr_map_create(struct rmr_pool *pool, u8 member_id); +void rmr_map_destroy(struct rmr_dirty_id_map *map); +void rmr_map_calc_chunk(struct rmr_pool *pool, size_t offset, size_t length, rmr_id_t *id); +void rmr_map_set_dirty(struct rmr_dirty_id_map *map, rmr_id_t id, u8 filter); +void rmr_map_set_dirty_all(struct rmr_dirty_id_map *map, u8 filter); +struct rmr_map_entry *rmr_map_unset_dirty(struct rmr_dirty_id_map *map, rmr_id_t id, u8 filter); +bool rmr_map_check_dirty(struct rmr_dirty_id_map *map, rmr_id_t id); +struct rmr_map_entry *rmr_map_get_dirty_entry(struct rmr_dirty_id_map *map, rmr_id_t id); +void rmr_map_clear_filter_all(struct rmr_dirty_id_map *map, u8 filter); +void rmr_map_unset_dirty_all(struct rmr_dirty_id_map *map); +bool rmr_map_empty(struct rmr_dirty_id_map *map); + +void rmr_map_bitwise_or_buf(void *dst_buf, void *src_buf, u32 buf_size); +int rmr_map_create_entries(struct rmr_dirty_id_map *map); + +void rmr_map_hexdump_bitmap_buf(u8 member_id, void *buf, u32 buf_size); +void rmr_map_slps_to_buf(struct rmr_dirty_id_map *map, u64 slp_idx, u64 no_of_slp, u8 *buf); +u64 rmr_map_buf_to_slps(struct rmr_dirty_id_map *map, u8 *buf, u32 buf_size, u64 slp_idx, + bool test); +void rmr_map_dump_bitmap(struct rmr_dirty_id_map *map); +int rmr_map_summary_format(struct rmr_pool *pool, char *buf, size_t buf_size); +void rmr_map_bidump_bitmap_buf(void *buf, u8 member_id, u32 buf_size); + +static inline void map_entry_get_sync(struct rmr_map_entry *entry) +{ + atomic_inc(&entry->sync_cnt); + pr_debug("after get ref for entry %p, sync cnt %d\n", + entry, atomic_read(&entry->sync_cnt)); +} + +static inline int map_entry_put_sync(struct rmr_map_entry *entry) +{ + pr_debug("before dec_and_test for entry %p, sync cnt %d\n", + entry, atomic_read(&entry->sync_cnt)); + return atomic_dec_and_test(&entry->sync_cnt); +} + +static inline void rmr_maplist_destroy(struct rmr_dirty_id_map *maplist) +{ + struct rmr_dirty_id_map *mp; + + while (maplist != NULL) { + mp = maplist; + maplist = maplist->next; + rmr_map_destroy(mp); + } +} +#endif /* RMR_MAP_H */ diff --git a/drivers/infiniband/ulp/rmr/rmr-pool.h b/drivers/infiniband/ulp/rmr/rmr-pool.h new file mode 100644 index 000000000000..3cb7d3ae84b9 --- /dev/null +++ b/drivers/infiniband/ulp/rmr/rmr-pool.h @@ -0,0 +1,400 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +/* + * Reliable multicast over RTRS (RMR) + * + * Copyright (c) 2026 IONOS SE + */ + +#ifndef RMR_POOL_H +#define RMR_POOL_H + +#include /* for NAME_MAX */ +#include +#include +#include /* for jhash() */ +#include /* for round_up */ +#include "rmr.h" +#include "rmr-map.h" + +#define RMR_POOL_MD_MAGIC 0xDEADBEEF +#define XA_TRUE ((void *)1UL) +#define XA_FALSE ((void *)2UL) + +extern struct kmem_cache *rmr_map_entry_cachep; +/* + * enum srv_sync_thread_state + */ +enum srv_sync_thread_state { + SYNC_THREAD_REQ_STOP, /* 0 */ + SYNC_THREAD_STOPPED, + SYNC_THREAD_RUNNING, + SYNC_THREAD_WAIT, +}; + +enum srv_map_update_state { + MAP_UPDATE_STATE_DISABLED, + MAP_UPDATE_STATE_READY, + MAP_UPDATE_STATE_DONE, +}; + +/* The srv pool specific structure */ +struct rmr_srv_md { + u64 map_ver; + u64 mapped_size; /* server store size in sectors */ + u8 member_id; + u8 srv_pool_state; /* server pool state */ + u8 store_state; /* state of io_store */ + u8 map_update_state; + bool discard_entries; +}; + +/* Shared by each pool */ +struct rmr_pool_md { + char poolname[NAME_MAX]; + u64 magic; + u32 group_id; + u32 chunk_size; /* rmr client */ + u64 mapped_size; /* client view of store size */ + u32 queue_depth; + u64 map_ver; + struct rmr_srv_md srv_md[RMR_POOL_MAX_SESS]; +} __packed; + +struct rmr_pool { + char poolname[NAME_MAX]; + u32 group_id; /* jhash() on poolname */ + struct kobject kobj; + struct kobject sessions_kobj; + struct list_head entry; /* for global pool_list */ + + struct list_head sess_list; /* list of sessions */ + struct mutex sess_lock; /* protect list of sessions */ + struct srcu_struct sess_list_srcu; + + void *priv; + u64 mapped_size; + u32 chunk_size; + u8 chunk_size_shift; + u64 no_of_chunks; + + struct percpu_ref ids_inflight_ref; + struct completion complete_done; + struct completion confirm_done; + + struct completion discard_done; /* for sync client pool */ + /* Set when waiting for response of discard request */ + atomic_t discard_waiting; + + u8 maps_cnt; + struct mutex maps_lock; + struct rmr_dirty_id_map __rcu + *maps[RMR_POOL_MAX_SESS]; + /* All member ids of the storage nodes */ + struct xarray stg_members; + u64 map_ver; + atomic_t normal_count; /* number of pool sessions currently in NORMAL state */ + struct srcu_struct map_srcu; + + struct rmr_pool_md pool_md; + + bool is_clt; + bool sync; +}; + +/** + * rmr_pool_find_md - find the index of the srv_md with the provided key in the pool_md + * + * @pool_md: the pool_md to search + * @key: the member_id of the server pool to search for + * @empty_slot: the empty slot is required by caller or not + * + * Description: + * Find the index of the srv_md with the matched key. If there is no such a key and the empty + * slot is not required, return -1. + * + * Return: + * >= 0, the index of the key in the pool_md. Return the index of an empty slot when the key + * is not found and the empty_slot flag is true + * -1 if the key is not found and empty_slot is false, or the pool_md doesn't exist + */ +static inline int rmr_pool_find_md(struct rmr_pool_md *pool_md, u8 key, bool empty_slot) +{ + int i; + int empty_i = -1; + + if (!pool_md) + return -1; + + for (i = 0; i < RMR_POOL_MAX_SESS; i++) { + if (!pool_md->srv_md[i].member_id) + empty_i = i; + + if (pool_md->srv_md[i].member_id == key) + return i; + } + + if (empty_slot) + return empty_i; + return -1; +} + +/** + * rmr_pool_md_check_discard - check the discard_entries flag of the srv_md + * + * @pool: the pool to check pool_md + * @member_id: the member_id of the srv_md to check + * + * Description: + * Check if the pool has received the discards from the server pool with the provided + * member_id. + * + * Return: + * 1 (true) if the pool has received the discards, + * 0 (false) if the pool has not received the discards, + * <0 if the pool has no info of the server pool + */ +static inline int rmr_pool_md_check_discard(struct rmr_pool *pool, u8 member_id) +{ + int md_i = rmr_pool_find_md(&pool->pool_md, member_id, false); + + if (md_i < 0) { + pr_err("Failed to find md for member_id %u\n", member_id); + return -EINVAL; + } + + /* If the flag is set, this pool has received the discards. */ + return pool->pool_md.srv_md[md_i].discard_entries; +} + +#define RMR_MAP_FORMAT_VER 1 +/* + * Get the first most significant bit of map_ver. If it is one, then the store of that storage node + * is being replaced. + */ +#define RMR_STORE_IS_REPLACE(map_ver) (map_ver >> 63 & 1ULL) +#define RMR_STORE_GET_VER(map_ver) (map_ver & ~(1ULL << 63)) +#define RMR_STORE_SET_REPLACE(map_ver) (map_ver |= 1ULL << 63) +#define RMR_STORE_UNSET_REPLACE(map_ver) (map_ver &= ~(1ULL << 63)) +#define RTRS_IO_LIMIT 102400 +//#define RTRS_IO_LIMIT 40 //for tests only + +/* + * TODO: + * We currently do not have mapped_size while creating dirty maps, + * which means we cannot calculate no_of_chunks, hence cannot allocate bitmap + * So, as a workaround, we allocate max size bitmap, + * and to reduce that allocation, we cap max mapped_size. + * + * 1GB max mapped size for now. + * (Size mentioned in number of sectors, just like nr_sects) + */ +#define RMR_MAX_MAPPED_SIZE 2097152 + +/* The header structure of rmr pool metadata will not over this limit. */ +#define RMR_MD_SIZE PAGE_SIZE +#define RMR_MD_SIZE_SECTORS (PAGE_SIZE / SECTOR_SIZE) +#define RMR_MAP_BUF_HDR_SIZE PAGE_SIZE +#define RMR_SRV_MD_SIZE (sizeof(struct rmr_srv_md) * RMR_POOL_MAX_SESS) +#define RMR_CLT_MD_SIZE (sizeof(struct rmr_pool_md) - RMR_SRV_MD_SIZE) +#define RMR_SECTOR_SIZE 512 +#define RMR_INT_ROUND_UP(x, y) (((x) + (y) - 1) / (y)) +#define RMR_ROUND_UP(x) round_up(x, RMR_SECTOR_SIZE) + +#define RMR_SRV_MAX_QDEPTH 512 + +/* last_io region starts right after the pool_md header page */ +#define RMR_LAST_IO_OFFSET RMR_MD_SIZE + +static inline u64 rmr_last_io_len(u32 queue_depth) +{ + return RMR_ROUND_UP((u64)queue_depth * sizeof(rmr_id_t)); +} + +static inline u64 rmr_bitmap_offset(u32 queue_depth) +{ + return RMR_LAST_IO_OFFSET + rmr_last_io_len(queue_depth); +} + +static inline u64 rmr_per_map_bitmap_size(u64 no_of_chunks) +{ + return DIV_ROUND_UP(no_of_chunks, CHUNKS_PER_SLP) * PAGE_SIZE; +} + +static inline u64 rmr_bitmap_len(u64 no_of_chunks) +{ + return RMR_POOL_MAX_SESS * rmr_per_map_bitmap_size(no_of_chunks); +} + +struct rmr_map_buf_hdr { + u64 version; + u64 member_id; + + /* + * dst_slp_idx: SLP index in the local dirty map buffer, + * from where to write the recved dirty map buffer + */ + u64 dst_slp_idx; + u32 buf_size; + + /* + * slp_idx: Only used for MAP_READ, + * to let client know where to ask from in the next iteration + */ + u64 map_idx; + u64 slp_idx; +} __packed; + +extern struct list_head pool_list; +extern struct mutex pool_mutex; + +const char *rmr_get_cmd_name(enum rmr_msg_cmd_type cmd); + +struct rmr_pool *rmr_create_pool(const char *poolname, void *priv); +void free_pool(struct rmr_pool *pool); + +struct rmr_pool *rmr_find_pool_by_group_id(u32 group_id); +struct rmr_pool *rmr_find_pool(const char *poolname); +int rmr_pool_maps_to_buf(struct rmr_pool *pool, u8 *map_idx, u64 *slp_idx, + void *buf, size_t buflen, rmr_map_filter filter); +int rmr_pool_save_map(struct rmr_pool *pool, void *buf, size_t buflen, + bool test_only); + +static inline void rmr_pool_update_no_of_chunk(struct rmr_pool *pool) +{ + u64 calc_no_of_chunks = 0, old_no_of_chunks = pool->no_of_chunks; + + /* + * In include/linux/types.h + * + * "Linux always considers sectors to be 512 (SECTOR_SHIFT==9) bytes long independently + * of the devices real block size." + * + * mapped_size is saved in sectors. + */ + if (pool->mapped_size) { + calc_no_of_chunks = (pool->mapped_size >> (pool->chunk_size_shift - 9)); + + if (pool->chunk_size && + (pool->mapped_size << 9) % pool->chunk_size) + calc_no_of_chunks += 1; + } + + if (calc_no_of_chunks != pool->no_of_chunks) { + pool->no_of_chunks = calc_no_of_chunks; + pr_info("%s: For %s, no_of_chunks old (%llu), updated %llu\n", + __func__, pool->poolname, old_no_of_chunks, pool->no_of_chunks); + } +} + +/* + * rmr_pool_maps_append - Append a map to the dense maps array + * @pool: pool + * @map: map to add + * + * Context: Caller must hold maps_lock. + */ +static inline void rmr_pool_maps_append(struct rmr_pool *pool, + struct rmr_dirty_id_map *map) +{ + rcu_assign_pointer(pool->maps[pool->maps_cnt], map); + pool->maps_cnt++; +} + +/* + * rmr_pool_maps_swap_remove - Remove map at index @i using swap-with-last + * @pool: pool + * @i: index of the map in the map array to remove + * @map: the map being removed + * + * Description: + * Maintains the dense invariant: pool->maps[0:maps_cnt] has no NULL gaps. + * + * Context: Caller must hold maps_lock. + */ +static inline void rmr_pool_maps_swap_remove(struct rmr_pool *pool, u8 i, + struct rmr_dirty_id_map *map) +{ + u8 last = pool->maps_cnt - 1; + + if (i != last) + rcu_assign_pointer(pool->maps[i], rcu_dereference_protected(pool->maps[last], + lockdep_is_held(&pool->maps_lock))); + + rcu_assign_pointer(pool->maps[last], NULL); + pool->maps_cnt--; +} + +static inline struct rmr_dirty_id_map *rmr_pool_find_map(struct rmr_pool *pool, u8 member_id) +{ + int i; + struct rmr_dirty_id_map *map; + struct rmr_dirty_id_map *res = NULL; + + rcu_read_lock(); + for (i = 0; i < pool->maps_cnt; i++) { + map = rcu_dereference(pool->maps[i]); + + if (WARN_ON(!map) || map->member_id != member_id) + continue; + + res = map; + break; + } + rcu_read_unlock(); + + return res; +} + +static inline int rmr_pool_remove_map(struct rmr_pool *pool, u8 member_id) +{ + int i; + struct rmr_dirty_id_map *mp; + struct rmr_dirty_id_map *map = NULL; + + pr_info("%s: pool %s is removing map for member_id %d\n", + __func__, pool->poolname, member_id); + + mutex_lock(&pool->maps_lock); + for (i = 0; i < pool->maps_cnt; i++) { + mp = rcu_dereference_protected(pool->maps[i], + lockdep_is_held(&pool->maps_lock)); + if (WARN_ON(!mp)) + continue; + if (mp->member_id == member_id) { + map = mp; + break; + } + } + + if (!map) { + mutex_unlock(&pool->maps_lock); + pr_err("%s: pool %s cannot find map for member_id %d\n", + __func__, pool->poolname, member_id); + return -EINVAL; + } + + /* Dirty map entries are also removed since the map no longer exists. */ + rmr_map_unset_dirty_all(map); + + rmr_pool_maps_swap_remove(pool, i, map); + synchronize_srcu(&pool->map_srcu); + + mutex_unlock(&pool->maps_lock); + + /* Free up the memory */ + rmr_map_destroy(map); + + return 0; +} + + +bool rmr_pool_change_state(struct rmr_pool *pool, enum rmr_pool_state new_state); + +void rmr_pool_confirm_inflight_ref(struct percpu_ref *ref); + +static inline u32 rmr_pool_hash(const char *poolname) +{ + return jhash(poolname, strlen(poolname), 0); +} + +#endif /* RMR_POOL_H */ diff --git a/drivers/infiniband/ulp/rmr/rmr-proto.h b/drivers/infiniband/ulp/rmr/rmr-proto.h new file mode 100644 index 000000000000..02c20ed76bef --- /dev/null +++ b/drivers/infiniband/ulp/rmr/rmr-proto.h @@ -0,0 +1,273 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +/* + * Reliable multicast over RTRS (RMR) + * + * Copyright (c) 2026 IONOS SE + */ + +#ifndef RMR_PROTO_H +#define RMR_PROTO_H + +#define RMR_PROTO_VER_MAJOR 0 +#define RMR_PROTO_VER_MINOR 1 + +#define RMR_PROTO_VER_STRING __stringify(RMR_PROTO_VER_MAJOR) "." \ + __stringify(RMR_PROTO_VER_MINOR) + +#ifndef RMR_VER_STRING +#define RMR_VER_STRING __stringify(RMR_PROTO_VER_MAJOR) "." \ + __stringify(RMR_PROTO_VER_MINOR) +#endif + +/* TODO: should be configurable */ +#define RTRS_PORT 1234 + +#define RMR_POOL_MAX_SESS 4 + +/** + * enum rmr_msg_types - RMR message types + * @RMR_MSG_JOIN_POOL: Join pool message from client to server + * @RMR_MSG_JOIN_POOL_RSP: Join pool messge response from server to client + * @RMR_MSG_LEAVE_POOL: Leave pool message from client to server + * @RMR_MSG_IO: IO(read/write) request on an object + */ +enum rmr_msg_type { + RMR_MSG_CMD, + RMR_MSG_CMD_RSP, + RMR_MSG_IO, + RMR_MSG_MD, + RMR_MSG_MAP_CLEAR, + RMR_MSG_MAP_ADD, +}; + +/** + * struct rmr_msg_hdr - header of RMR messages + * @type: Message type, valid values see: enum rmr_msg_types + */ +struct rmr_msg_hdr { + __le32 group_id; /* poolname jhash() */ + __le16 type; + __le16 __padding; +}; + +/** + * struct rmr_msg_io - message for object I/O read/write + * @hdr: message header + * @id_a: first 64bit of the object id + * @id_b: second 64bit of the object id + * @offset: offset from where to read/write + * @flags: bitmask, valid values are defined in enum rmr_io_flags + * @length: number of bytes for I/O read/write + * @pool_id: pool id to which the object belongs + */ +struct rmr_msg_io { + struct rmr_msg_hdr hdr; + __le64 id_a; + __le64 id_b; + + __le32 offset; + __le32 length; + __le32 flags; + __le16 prio; + + __le32 mem_id; + __le64 map_ver; + u8 failed_id[RMR_POOL_MAX_SESS]; + u8 failed_cnt; + + u8 member_id; + u8 sync; + u8 __padding[19]; //padding is not correct now i think +}; + +struct rmr_pool_member_info { + u8 no_of_stor; + + struct per_mem_info { + u8 member_id; + u8 c_dirty; + } p_mem_info[RMR_POOL_MAX_SESS]; +}; + +/** + * enum rmr_msg_cmd_types - RMR command types + * @RMR_CMD_MAP_READY: Get ready to receive map + * @RMR_CMD_MAP_SEND: Send map to certain node + * @RMR_CMD_MAP_DONE: Confirm map receipt + * + * When adding a command, + * make sure to add it to the function rmr_get_cmd_name. + */ +enum rmr_msg_cmd_type { + RMR_CMD_MAP_READY, // 0 + RMR_CMD_MAP_SEND, + RMR_CMD_SEND_MAP_BUF, + RMR_CMD_MAP_BUF_DONE, + RMR_CMD_MAP_DONE, + RMR_CMD_MAP_DISABLE, + RMR_CMD_READ_MAP_BUF, + RMR_CMD_MAP_CHECK, + RMR_CMD_LAST_IO_TO_MAP, + RMR_CMD_STORE_CHECK, + RMR_CMD_MAP_TEST, + /* sends the metadata of non-sync rmr-client to server */ + RMR_CMD_SEND_MD_BUF, + /*sends the message of discards to the node */ + RMR_CMD_SEND_DISCARD, + /* sends the message of md_update to the node; the node sends its srv_md back. */ + RMR_CMD_MD_SEND, + + RMR_CMD_MAP_GET_VER, // 14 + RMR_CMD_MAP_SET_VER, + RMR_CMD_DISCARD_CLEAR_FLAG, + + /* + * Add map related commands above this + */ + RMR_MAP_CMD_MAX, + + RMR_CMD_POOL_INFO, // 18 + RMR_CMD_JOIN_POOL, + + RMR_CMD_REJOIN_POOL, + + RMR_CMD_LEAVE_POOL, + RMR_CMD_ENABLE_POOL, // 22 + + RMR_CMD_USER, + + /* + * Add pool related commands above this + */ + RMR_POOL_CMD_MAX, +}; + +struct rmr_msg_map_send_cmd { + u8 receiver_member_id; +}; + +struct rmr_msg_map_buf_cmd { + u64 version; + u8 map_idx; + u64 slp_idx; +}; + +struct rmr_msg_map_buf_done_cmd { + u64 map_version; +}; + +struct rmr_msg_map_done_cmd { + u8 enable; +}; + +struct rmr_msg_send_md_buf_cmd { + u8 sync; /* if the pool is sync or not */ + u8 sender_id; + u8 receiver_id; + u64 flags; +}; + +struct rmr_msg_send_discard_cmd { + u8 member_id; /* the storage node that discards all data */ +}; + +struct rmr_msg_md_send_cmd { + u64 src_mapped_size; /* the pool mapped size on the sending side */ + u8 sender_id; + u8 leader_id; + u8 read_full_md; /* 1 = return full pool_md; 0 = own entry only */ +}; + +struct rmr_msg_pool_info_cmd { + u8 member_id; + u8 operation; /* add/remove */ + u8 mode; /* For add -> create/assemble. For remove -> delete/disassemble */ + u8 dirty; /* Valid only when operation=ADD and mode=CREATE */ +}; + +enum rmr_pool_info_op { + RMR_POOL_INFO_OP_ADD = 0, + RMR_POOL_INFO_OP_REMOVE, +}; + +enum rmr_pool_info_mode { + RMR_POOL_INFO_MODE_CREATE = 0, + RMR_POOL_INFO_MODE_ASSEMBLE, + RMR_POOL_INFO_MODE_DELETE, + RMR_POOL_INFO_MODE_DISASSEMBLE, +}; + +struct rmr_msg_set_map_ver_cmd { + u8 map_ver; /* the map version to set */ +}; + +struct rmr_msg_join_pool_cmd { + u64 queue_depth; + u32 chunk_size; + struct rmr_pool_member_info mem_info; + u8 dirty; + u8 create; + u8 rejoin; +}; + +struct rmr_msg_leave_pool_cmd { + u8 member_id; + u8 delete; +}; + +struct rmr_msg_enable_pool_cmd { + u32 enable; +}; + +struct rmr_msg_user_cmd { + size_t usr_len; +}; + +struct rmr_msg_join_pool_cmd_rsp { + u64 mapped_size; + u32 chunk_size; +}; + +struct rmr_msg_pool_cmd { + struct rmr_msg_hdr hdr; + u8 ver; + u8 cmd_type; + u8 sync; + u8 rsvd[1]; + s8 pool_name[NAME_MAX]; + union { + struct rmr_msg_map_send_cmd map_send_cmd; + struct rmr_msg_map_buf_cmd map_buf_cmd; + struct rmr_msg_map_buf_done_cmd map_buf_done_cmd; + struct rmr_msg_map_done_cmd map_done_cmd; + + struct rmr_msg_send_md_buf_cmd send_md_buf_cmd; + struct rmr_msg_send_discard_cmd send_discard_cmd; + struct rmr_msg_md_send_cmd md_send_cmd; + + struct rmr_msg_pool_info_cmd pool_info_cmd; + + struct rmr_msg_set_map_ver_cmd set_map_ver_cmd; + + struct rmr_msg_join_pool_cmd join_pool_cmd; + + struct rmr_msg_leave_pool_cmd leave_pool_cmd; + struct rmr_msg_enable_pool_cmd enable_pool_cmd; + + struct rmr_msg_user_cmd user_cmd; + }; +}; + +struct rmr_msg_pool_cmd_rsp { + struct rmr_msg_hdr hdr; + enum rmr_msg_cmd_type cmd_type; + u8 err; + u8 ver; + u8 member_id; + union { + struct rmr_msg_join_pool_cmd_rsp join_pool_cmd_rsp; + u64 value; + }; +}; + +#endif /* RMR_PROTO_H */ diff --git a/drivers/infiniband/ulp/rmr/rmr-req.h b/drivers/infiniband/ulp/rmr/rmr-req.h new file mode 100644 index 000000000000..8f15b36fe480 --- /dev/null +++ b/drivers/infiniband/ulp/rmr/rmr-req.h @@ -0,0 +1,65 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +/* + * Reliable multicast over RTRS (RMR) + * + * Copyright (c) 2026 IONOS SE + */ + +#ifndef RMR_REQ_H +#define RMR_REQ_H + +#include "rmr-pool.h" + +struct rmr_srv_req { + struct rmr_srv_pool *srv_pool; + rmr_id_t id; + + u32 offset; + u32 length; + u32 flags; + u16 prio; + + u32 mem_id; + struct rtrs_srv_op *rtrs_op; + struct rmr_srv_io_store *store; + void *data; + u32 datalen; //TODO: what is the difference between lenghth? + void (*endreq)(struct rmr_srv_req *, int err); + struct work_struct work; + int err; + u8 failed_cnt; + u8 failed_srv_id[RMR_POOL_MAX_SESS]; + u64 map_ver; + void *priv; + struct llist_node node; + bool from_sync; + struct scatterlist sg; + struct rmr_iu *iu; + struct rmr_srv_req *parent; + bool sync; + struct rcu_head rcu; +}; + +struct rmr_srv_req *rmr_srv_req_create(const struct rmr_msg_io *msg, + struct rmr_srv_pool *srv_pool, + struct rtrs_srv_op *rtrs_op, + void *data, u32 datalen, + void (*endreq)(struct rmr_srv_req *, int)); +struct rmr_srv_req *rmr_srv_md_req_create(struct rmr_srv_pool *srv_pool, + struct rtrs_srv_op *rtrs_op, void *data, + u32 offset, u32 len, unsigned long flags, + void (*endreq)(struct rmr_srv_req *, int)); +void rmr_req_submit(struct rmr_srv_req *req); +void rmr_md_req_submit(struct rmr_srv_req *req); +void rmr_srv_req_resp(struct rmr_srv_req *req, int err); +void rmr_srv_md_req_resp(struct rmr_srv_req *req, int err); +int rmr_srv_sync_chunk_id(struct rmr_srv_pool *srv_pool, struct rmr_map_entry *entry, + rmr_id_t id, bool from_sync); + +void rmr_process_wait_list(struct rmr_map_entry *entry, int err); + +struct rmr_map_entry_info { + rmr_id_t id; + u8 srv_id; +}; +#endif /* RMR_REQ_H */ diff --git a/drivers/infiniband/ulp/rmr/rmr-srv.h b/drivers/infiniband/ulp/rmr/rmr-srv.h new file mode 100644 index 000000000000..a84586aa78bd --- /dev/null +++ b/drivers/infiniband/ulp/rmr/rmr-srv.h @@ -0,0 +1,219 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +/* + * Reliable multicast over RTRS (RMR) + * + * Copyright (c) 2026 IONOS SE + */ + +#ifndef RMR_SRV_H +#define RMR_SRV_H + +/* rmr-srv-sysfs.c */ + +#include +#include +#include +#include +#include + +#include "rmr-pool.h" + +/* + * IO store interface implemented by an upper-layer consumer of rmr-server. + * All consumer-specific types are passed as void * so RMR remains + * independent of any particular client. + */ +struct rmr_srv_store_ops { + int (*submit_req)(void *device, void *data, u32 offset, u32 length, + unsigned long flags, u16 prio, void *priv); + int (*submit_md_req)(void *device, void *data, u32 offset, u32 length, + unsigned long flags, void *priv); + int (*submit_cmd)(void *device, const void *usr_buf, int usr_len, + void *data, int datalen); + bool (*io_allowed)(void *store_priv); + int (*get_params)(void *device); +}; + +#define DEFAULT_SYNC_QUEUE_DEPTH 32 +#define RMR_SRV_CHECK_MAPS_INTERVAL_MS 3000 +#define RMR_SRV_MD_SYNC_INTERVAL_MS 500 +#define RMR_SRV_DISCARD_TIMEOUT_MS 500 + +/* Bit indices for srv_pool->md_dirty — used with set_bit / test_and_clear_bit */ +enum rmr_srv_md_dirty_bit { + MD_DIRTY_POOL, /* pool_md fields changed */ + MD_DIRTY_MAPS, /* map bitmap changed */ + MD_DIRTY_LAST_IO, /* last_io updated */ +}; + +extern struct kmem_cache *rmr_req_cachep; +extern struct kmem_cache *rmr_map_entry_cachep; + +enum rmr_srv_register_disk_mode { + RMR_SRV_DISK_CREATE, /* Fresh store, new pool */ + RMR_SRV_DISK_ADD, /* Rejoin an existing pool */ + RMR_SRV_DISK_REPLACE, /* Replace an existing store */ +}; + +/* + * When adding state, remember to add an entry in the function rmr_get_srv_pool_state_name() + */ +enum rmr_srv_pool_state { + RMR_SRV_POOL_STATE_EMPTY, + RMR_SRV_POOL_STATE_REGISTERED, + RMR_SRV_POOL_STATE_CREATED, + RMR_SRV_POOL_STATE_NORMAL, + RMR_SRV_POOL_STATE_NO_IO, +}; + +struct rmr_srv_pool { + u8 member_id; + refcount_t refcount; + atomic_t state; + bool maintenance_mode; + + struct rmr_pool *pool; + + /* Sync thread */ + struct task_struct *th_tsk; + atomic_t thread_state; + atomic_t in_flight_sync_reqs; + + struct rmr_srv_io_store *io_store; + struct mutex srv_pool_lock; + atomic_t store_state; + + bool marked_create; + bool marked_delete; + + unsigned long md_dirty; /* bitmask of dirty regions */ + unsigned long map_update_state; + /* The internal client pool assigned to this server pool. */ + struct rmr_pool *clt; + size_t queue_depth; + rmr_id_t *last_io; + /* + * Each storage node keeps a command array with the length of queue depth to track the IOs + * in the last round. Use an array of chunk indexes as a copy of srv_pool->last_io so that + * it can be written back to/read from backing store as needed. + */ + rmr_id_t *last_io_idx; + + u32 max_sync_io_size; + struct workqueue_struct *clean_wq; + struct delayed_work clean_dwork; + + struct workqueue_struct *md_sync_wq; + struct delayed_work md_sync_dwork; + struct delayed_work last_io_sync_dwork; +}; + +/** + * rmr_srv_mark_pool_md_dirty() - Set MD_DIRTY_POOL and schedule delayed sync + * @srv_pool: Server pool with changed pool_md fields + */ +static inline void rmr_srv_mark_pool_md_dirty(struct rmr_srv_pool *srv_pool) +{ + set_bit(MD_DIRTY_POOL, &srv_pool->md_dirty); + mod_delayed_work(srv_pool->md_sync_wq, &srv_pool->md_sync_dwork, + msecs_to_jiffies(RMR_SRV_MD_SYNC_INTERVAL_MS)); +} + +struct rmr_srv_sess { + struct list_head pool_sess_list; + struct rtrs_srv_sess *rtrs; + struct kobject kobj; + char sessname[NAME_MAX]; + struct mutex lock; + u8 ver; + struct xarray pools; + struct list_head g_list_entry; +}; + +struct rmr_srv_pool_sess { + struct list_head pool_entry; /* for pool->sess_list */ + struct list_head srv_sess_entry; + struct rmr_srv_pool *srv_pool; + struct kobject kobj; + char sessname[NAME_MAX]; + struct rmr_srv_sess *srv_sess; + bool sync; +}; + +struct rmr_srv_io_store { + struct rmr_srv_store_ops *ops; + void *priv; +}; + +struct rmr_cmd_work_info { + struct work_struct cmd_work; + struct rmr_pool *pool; + struct rmr_srv_sess *sess; + struct rtrs_srv_sess *rtrs; + const struct rmr_msg_pool_cmd *cmd_msg; + struct rmr_msg_pool_cmd_rsp *rsp; + struct rtrs_srv_op *rtrs_op; + void *data; + size_t datalen; +}; + +void rmr_put_srv_pool(struct rmr_srv_pool *srv_pool); +struct rmr_srv_pool *rmr_create_srv_pool(char *poolname, u32 member_id); +void rmr_srv_pool_update_params(struct rmr_pool *pool); +int rmr_srv_read_md(struct rmr_pool *pool, struct rtrs_srv_op *rtrs_op, u32 offset, u32 len, + struct rmr_pool_md *pool_md_page); +int rmr_srv_send_md_update(struct rmr_pool *pool); +int rmr_srv_check_params(struct rmr_srv_pool *srv_pool); +void rmr_srv_mark_maps_dirty(struct rmr_srv_pool *srv_pool); + +/* rmr-srv-md.c */ +struct rmr_srv_req; /* forward decl for endreq prototype */ + +bool rmr_get_srv_pool(struct rmr_srv_pool *srv_pool); +void rmr_srv_endreq(struct rmr_srv_req *req, int err); + +int process_md_io(struct rmr_pool *pool, struct rtrs_srv_op *rtrs_op, + u32 offset, u32 len, unsigned long flags, void *buf); +void rmr_srv_md_maps_sync(struct rmr_pool *pool); +void rmr_srv_flush_pool_md(struct rmr_srv_pool *srv_pool); +void rmr_srv_md_sync(struct work_struct *work); +int rmr_srv_md_process_buf(struct rmr_pool *pool, void *buf, bool sync); +int rmr_srv_refresh_md(struct rmr_srv_pool *srv_pool); + +/* rmr-srv-sysfs.c */ + +int rmr_srv_create_sysfs_files(void); +void rmr_srv_destroy_sysfs_files(void); +void rmr_srv_destroy_pool_sysfs_files(struct rmr_pool *pool, + const struct attribute *sysfs_self); +int rmr_srv_sysfs_add_sess(struct rmr_pool *pool, + struct rmr_srv_pool_sess *pool_sess); +void rmr_srv_sysfs_del_sess(struct rmr_srv_pool_sess *pool_sess); + +void rmr_srv_free_sync_permits(struct rmr_pool *pool); +void rmr_srv_destroy_pool(struct rmr_pool *pool); +int rmr_srv_remove_clt_pool(struct rmr_srv_pool *srv_pool); + +void rmr_srv_stop_sync_and_go_offline(struct rmr_pool *pool); + +int rmr_srv_get_sync_permit(struct rmr_srv_pool *srv_pool); +void rmr_srv_put_sync_permit(struct rmr_srv_pool *srv_pool); + +int rmr_srv_sync_thread_start(struct rmr_srv_pool *srv_pool); +int rmr_srv_sync_thread_stop(struct rmr_srv_pool *srv_pool); + +void rmr_srv_sync_req_failed(struct rmr_srv_pool *srv_pool); + +int rmr_srv_query(struct rmr_pool *pool, u64 mapped_size, struct rmr_attrs *attr); +/* register/unregister rmr-srv */ +struct rmr_pool *rmr_srv_register(char *poolname, struct rmr_srv_store_ops *ops, void *priv, + u64 mapped_size, enum rmr_srv_register_disk_mode mode); +void rmr_srv_unregister(char *poolname, bool delete); + +int rmr_srv_pool_cmd_with_rsp(struct rmr_pool *pool, rmr_conf_fn *conf, void *priv, + const struct kvec *usr_vec, size_t nr, void *buf, int buf_len, + size_t size); +int rmr_srv_discard_id(struct rmr_pool *pool, u64 offset, u64 length, u8 member_id, bool sync); +void rmr_srv_replace_store(struct rmr_pool *pool); + +#endif /* RMR_SRV_H */ diff --git a/drivers/infiniband/ulp/rmr/rmr.h b/drivers/infiniband/ulp/rmr/rmr.h new file mode 100644 index 000000000000..72d591ccc047 --- /dev/null +++ b/drivers/infiniband/ulp/rmr/rmr.h @@ -0,0 +1,229 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +/* + * Reliable multicast over RTRS (RMR) + * + * Copyright (c) 2026 IONOS SE + */ + +#ifndef RMR_H +#define RMR_H + +#include +#include +#include + +#include "rmr-proto.h" +struct rmr_pool; + +typedef void (rmr_conf_fn)(void *priv, int errno); +enum rmr_wait_type { + NO_WAIT = RTRS_PERMIT_NOWAIT, + WAIT = RTRS_PERMIT_WAIT +}; + +/* + * Here goes RMR client API + */ + +/** + * inverse operation. decrements refcount + * and free if it reaches 0. + */ +void rmr_clt_put_pool(struct rmr_pool *pool); + +/** + * enum rmr_clt_link_ev - Events about connectivity state of a client + * @RMR_CLT_LINK_EV_RECONNECTED Client was reconnected. + * @RMR_CLT_LINK_EV_DISCONNECTED Client was disconnected. + */ +enum rmr_clt_link_ev { + RMR_CLT_LINK_EV_RECONNECTED, + RMR_CLT_LINK_EV_DISCONNECTED, +}; + +typedef void (rmr_clt_ev_fn)(void *priv, enum rmr_clt_link_ev ev); +/** + * rmr_clt_open() - Opens a pool from the RMR client + * @priv: User supplied private data. + * @link_ev: Event notification for connection state changes + * @priv: user supplied data that was passed to rmr_clt_open() + * @ev: Occurred event + * @poolname: name of the pool + * + * Only one user can open a pool at the same time. + * However administrative operations are possible. + * + * Return a valid pointer on success otherwise PTR_ERR. + */ +struct rmr_pool *rmr_clt_open(void *priv, rmr_clt_ev_fn *link_ev, const char *poolname); + +/** + returns the priv data that had been provided with open() +*/ +void *rmr_clt_get_priv(struct rmr_pool *pool); + +/** + * rmr_clt_close() - Closes a pool + * @pool: Pool handler, is freed on return + */ +void rmr_clt_close(struct rmr_pool *pool); + +#define RMR_OP_BITS 8 +#define RMR_OP_MASK ((1 << RMR_OP_BITS) - 1) + +/** + * enum rmr_io_flags - RMR request types from rq_flag_bits + * @RMR_OP_READ: read object + * @RMR_OP_WRITE: write object + * @RMR_OP_DISCARD: remove object + * @RMR_OP_SYNCREQ: sync request + * @RMR_OP_WRITE_ZEROES: write zeroes + * @RMR_OP_FLUSH: flush object + * @RMR_OP_MD_READ: read metadata of rmr + * @RMR_OP_MD_WRITE: write metadata of rmr + */ +enum rmr_io_flags { + /* Operations */ + RMR_OP_READ = 0, + RMR_OP_WRITE = 1, + RMR_OP_DISCARD = 2, + RMR_OP_SYNCREQ = 3, + RMR_OP_WRITE_ZEROES = 4, + RMR_OP_FLUSH = 5, + /* Add metadata related operations below this. */ + RMR_OP_MD_READ = 6, + RMR_OP_MD_WRITE = 7, + + /* Flags */ + RMR_F_SYNC = 1 <<(RMR_OP_BITS + 0), // 0x100, 0b0100000000 + RMR_F_FUA = 1 <<(RMR_OP_BITS + 1), // 0x200, 0b1000000000 +}; + +static inline u32 rmr_op(u32 flag) +{ + return flag & RMR_OP_MASK; +} + +static inline u32 rmr_flags(u32 flag) +{ + return flag & ~RMR_OP_MASK; +} + +/** + * Something to keep the 128 bit block_id (a.k.a object_id) + */ +typedef struct { + u64 a; + u64 b; +} rmr_id_t; + +struct rmr_iu; + +/** + * rmr_clt_get_iu() - allocates iu for future RDMA operation + * @pool: Current pool + * @id: Id of the object/block + * @flag: READ/WRITE/REMOVE + * @wait: WAIT/NO_WAIT + * + * Description: + * Allocates iu for the following RDMA operation. Iu is used + * to preallocate all resources and to propagate memory pressure + * up earlier. + * + */ +struct rmr_iu *rmr_clt_get_iu(struct rmr_pool *pool, + enum rmr_io_flags flag, + enum rmr_wait_type wait); + +/** + * rmr_clt_put_iu() - puts allocated iu + * @pool: Current pool + * @id: Id of the object/block + * @flag: READ/WRITE/REMOVE + * @iu: Iu to be freed + * + * Context: + * Does not matter + */ +void rmr_clt_put_iu(struct rmr_pool *pool, struct rmr_iu *iu); + +/** + * rmr_clt_request() - Request data transfer to/from server via RDMA. + * + * + * @pool: The Pool + * @iu: Iu allocated by pevious rmr_clt_get_iu call. + * @offset: offset inside the object to read/write: + * @length: length of data starting from offset + * @flag: READ/WRITE/REMOVE + * @prio: priority of IO + * @priv: User provided data, passed back with corresponding + * @(conf) confirmation. + * @conf: callback function to be called as confirmation + * @sg: Pages to be sent/received to/from server. + * @sg_cnt: Number of elements in the @sg + * + * Return: + * 0: Success + * -EAGAIN: Currently there are no resources to execute the request. + * Retry again later. + * <0: Error + * + * On flag=READ rtrs client will request a data transfer from Server to client. + * The data that the server will respond with will be stored in @sg when + * the user confirmation function is called. + * On flag=WRITE rtrs client will rdma write data in sg to server side. + */ +int rmr_clt_request(struct rmr_pool *pool, struct rmr_iu *iu, + size_t offset, size_t length, enum rmr_io_flags flag, unsigned short prio, + void *priv, rmr_conf_fn *conf, struct scatterlist *sg, unsigned int sg_cnt); + +int rmr_clt_cmd_with_rsp(struct rmr_pool *pool, rmr_conf_fn *conf, void *priv, + const struct kvec *usr_vec, size_t nr, void *buf, int buf_len, + size_t size); + + +/** + * rmr_attrs - RMR pool attributes + */ +struct rmr_attrs { + u32 queue_depth; + u32 max_io_size; + u32 chunk_size; + u32 max_segments; + u64 rmr_md_size; /* in sectors */ + u8 sync; + struct kobject *pool_kobj; +}; + +/** + * rmr_clt_query() - queries RMR pool attributes + * + * Returns: + * 0 on success + * -EINVAL no session in the pool + */ +int rmr_clt_query(struct rmr_pool *pool, struct rmr_attrs *attr); + +typedef enum { + RMR_MAP_ADD, + RMR_MAP_REMOVE, +} rmr_map_cmd; + +#define RMR_STORE_ID_BITS 32 +#define RMR_STORE_ID_OFFSET (64 - RMR_STORE_ID_BITS) + +#define RMR_CHUNK_BITS 32 +#define RMR_CHUNK_OFFSET 0 + +enum rmr_pool_state { + RMR_POOL_STATE_CREATED = 0, + RMR_POOL_STATE_JOINED, + RMR_POOL_STATE_ONLINE, + /* maybe we will use this later */ + RMR_POOL_STATE_DEGRADED, + RMR_POOL_STATE_SYNCING, +}; + +#endif -- 2.43.0