Re: [PATCH 06/20] drbd: add RDMA transport implementation

public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed

From: Christoph Hellwig <hch@infradead.org>
To: "Christoph Böhmwalder" <christoph.boehmwalder@linbit.com>
Cc: Jens Axboe <axboe@kernel.dk>,
	drbd-dev@lists.linbit.com, linux-kernel@vger.kernel.org,
	Lars Ellenberg <lars.ellenberg@linbit.com>,
	Philipp Reisner <philipp.reisner@linbit.com>,
	linux-block@vger.kernel.org,
	Joel Colledge <joel.colledge@linbit.com>,
	linux-rdma@vger.kernel.org, Jason Gunthorpe <jgg@ziepe.ca>,
	Leon Romanovsky <leon@kernel.org>
Subject: Re: [PATCH 06/20] drbd: add RDMA transport implementation
Date: Tue, 7 Apr 2026 22:42:55 -0700	[thread overview]
Message-ID: <adXq36pbGLXMZc2r@infradead.org> (raw)
In-Reply-To: <20260327223820.2244227-7-christoph.boehmwalder@linbit.com>

You really need to add the RDMA mailing list before adding new RDMA
code.  I'll try to review the bits I still remember, but you also
need a maintainer ACK.

> +#ifndef SENDER_COMPACTS_BVECS
> +/* My benchmarking shows a limit of 30 MB/s
> + * with the current implementation of this idea.
> + * cpu bound, perf top shows mainly get_page/put_page.
> + * Without this, using the plain send_page,
> + * I achieve > 400 MB/s on the same system.
> + * => disable for now, improve later.
> + */
> +#define SENDER_COMPACTS_BVECS 0
> +#endif

Nothing explains what "this idea" is.  And we do not add dead code as
a rule of thumb anyway.

> +/* Nearly all data transfer uses the send/receive semantics. No need to

Please use the normal kernel command style:

/*
 * Blah, blah, blah.
 */

> +int allocation_size;
> +/* module_param(allocation_size, int, 0664);
> +   MODULE_PARM_DESC(allocation_size, "Allocation size for receive buffers (page size of peer)");
> +
> +   That needs to be implemented in dtr_create_rx_desc() and in dtr_recv() and dtr_recv_pages() */
> +
> +/* If no recvbuf_size or sendbuf_size is configured use 1M plus two pages for the DATA_STREAM */
> +/* Actually it is not a buffer, but the number of tx_descs or rx_descs we allow,
> +   very comparable to the socket sendbuf and recvbuf sizes */

Please don't add random unused code.

Also while the kernel coding style has an exception for overly long
lines for selected lines where the improve readability, that by
definition can't apply to block comments.

> +/* Assuming that a singe 4k write should be at the highest scatterd over 8
> +   pages. I.e. has no parts smaller than 512 bytes.
> +   Arbitrary assumption. It seems that Mellanox hardware can do up to 29
> +   ppc64 page size might be 64k */
> +#if (PAGE_SIZE / 512) > 28
> +# define DTR_MAX_TX_SGES 28
> +#else
> +# define DTR_MAX_TX_SGES (PAGE_SIZE / 512)
> +#endif

All this looks complete bollocks.  We had multi-page bvecs for years,
aso the page size should not apply for the I/O path.

> +union dtr_immediate {
> +	struct {
> +#if defined(__LITTLE_ENDIAN_BITFIELD)
> +		unsigned int sequence:SEQUENCE_BITS;
> +		unsigned int stream:2;
> +#elif defined(__BIG_ENDIAN_BITFIELD)
> +		unsigned int stream:2;
> +		unsigned int sequence:SEQUENCE_BITS;
> +#else
> +# error "this endianness is not supported"
> +#endif
> +	};
> +	unsigned int i;
> +};

Bitfields for on-the-write structures are an anti-pattern,  Please
use proper masking and shifting like just about everyone else in modern
code.

> +static int dtr_init(struct drbd_transport *transport);
> +static void dtr_free(struct drbd_transport *transport, enum drbd_tr_free_op);
> +static int dtr_prepare_connect(struct drbd_transport *transport);
> +static int dtr_connect(struct drbd_transport *transport);
> +static void dtr_finish_connect(struct drbd_transport *transport);

The code structure here is totally messed up if you need all these
dozens of forward declarations.  Please reshuffled it that you only
need those actually required,

> +static struct drbd_transport_class rdma_transport_class = {
> +	.name = "rdma",
> +	.instance_size = sizeof(struct dtr_transport),
> +	.path_instance_size = sizeof(struct dtr_path),
> +	.listener_instance_size = sizeof(struct dtr_listener),
> +	.ops = (struct drbd_transport_ops) {

nested struct initializers don't need casts.

> +static bool atomic_inc_if_below(atomic_t *v, int limit)
> +{
> +	int old, cur;
> +
> +	cur = atomic_read(v);
> +	do {
> +		old = cur;
> +		if (old >= limit)
> +			return false;
> +
> +		cur = atomic_cmpxchg(v, old, old + 1);
> +	} while (cur != old);
> +
> +	return true;
> +}

Don't add you own atomics, please talk to the atomics maintainers
instead of adding hacks like this.

> +	err = -ENOMEM;
> +	tx_desc = kzalloc(sizeof(*tx_desc) + sizeof(struct ib_sge), gfp_mask);
> +	if (!tx_desc)
> +		goto out_put;

This is supposed to use the new flex kmalloc family of functions
(as much as I hate them).

> +	send_buffer = kmalloc(size, gfp_mask);
> +	if (!send_buffer) {
> +		kfree(tx_desc);
> +		goto out_put;
> +	}
> +	memcpy(send_buffer, buf, size);

and this is a kmemdup.

> +	if (err) {
> +		kfree(tx_desc);
> +		kfree(send_buffer);
> +		goto out_put;
> +	}

And please use proper gotos to unwind.

> +static int dtr_recv(struct drbd_transport *transport, enum drbd_stream stream, void **buf, size_t size, int flags)
> +{
> +	struct dtr_transport *rdma_transport;
> +	int err;
> +
> +	if (!transport)
> +		return -ECONNRESET;

How can this be NULL?

> +static int dtr_path_prepare(struct dtr_path *path, struct dtr_cm *cm, bool active)
> +{
> +	struct dtr_cm *cm2;
> +	int i, err;
> +
> +	cm2 = cmpxchg(&path->cm, NULL, cm); // RCU xchg

Wht's that comment supposed to mean?  If it is a rcu_replace_pointer,
use that.  If not the comment looks really odd.

> +	err = dtr_cm_alloc_rdma_res(cm);
> +
> +	return err;

You can return the value directly here.

Giving up for now.  I think this needs a sweep for basic sanity an
a review from the RDMA maintainers before we can go into more details.

next prev parent reply	other threads:[~2026-04-08  5:42 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-27 22:38 [PATCH 00/20] DRBD 9 rework Christoph Böhmwalder
2026-03-27 22:38 ` [PATCH 01/20] drbd: mark as BROKEN during " Christoph Böhmwalder
2026-03-27 22:38 ` [PATCH 02/20] drbd: extend wire protocol definitions for DRBD 9 Christoph Böhmwalder
2026-03-28 14:13   ` kernel test robot
2026-03-27 22:38 ` [PATCH 03/20] drbd: introduce DRBD 9 on-disk metadata format Christoph Böhmwalder
2026-03-27 22:38 ` [PATCH 04/20] drbd: add transport layer abstraction Christoph Böhmwalder
2026-03-27 22:38 ` [PATCH 05/20] drbd: add TCP transport implementation Christoph Böhmwalder
2026-03-27 22:38 ` [PATCH 06/20] drbd: add RDMA " Christoph Böhmwalder
2026-04-08  5:42   ` Christoph Hellwig [this message]
2026-03-27 22:38 ` [PATCH 07/20] drbd: add load-balancing TCP transport Christoph Böhmwalder
2026-03-27 22:38 ` [PATCH 08/20] drbd: add DAX/PMEM support for metadata access Christoph Böhmwalder
2026-04-08  5:46   ` Christoph Hellwig
2026-03-27 22:38 ` [PATCH 09/20] drbd: add optional compatibility layer for DRBD 8.4 Christoph Böhmwalder
2026-03-27 22:38 ` [PATCH 10/20] drbd: rename drbd_worker.c to drbd_sender.c Christoph Böhmwalder
2026-03-27 22:38 ` [PATCH 11/20] drbd: rework sender for DRBD 9 multi-peer Christoph Böhmwalder
2026-03-27 22:38 ` [PATCH 12/20] drbd: replace per-device state model with multi-peer data structures Christoph Böhmwalder
2026-03-27 22:38 ` [PATCH 13/20] drbd: rewrite state machine for DRBD 9 multi-peer clusters Christoph Böhmwalder
2026-03-27 22:38 ` [PATCH 14/20] drbd: rework activity log and bitmap for multi-peer replication Christoph Böhmwalder
2026-03-27 22:38 ` [PATCH 15/20] drbd: rework request processing for DRBD 9 multi-peer IO Christoph Böhmwalder
2026-03-27 22:38 ` [PATCH 16/20] drbd: rework module core for DRBD 9 transport and multi-peer Christoph Böhmwalder
2026-03-27 22:38 ` [PATCH 17/20] drbd: rework receiver for DRBD 9 transport and multi-peer protocol Christoph Böhmwalder
2026-03-27 22:38 ` [PATCH 18/20] drbd: rework netlink management interface for DRBD 9 Christoph Böhmwalder
2026-03-27 22:38 ` [PATCH 19/20] drbd: update monitoring interfaces for multi-peer topology Christoph Böhmwalder
2026-03-27 22:38 ` [PATCH 20/20] drbd: remove BROKEN for DRBD Christoph Böhmwalder
2026-03-28 12:21   ` kernel test robot
2026-03-28 14:20   ` kernel test robot
2026-04-03  1:30 ` [PATCH 00/20] DRBD 9 rework Jens Axboe
2026-04-03 13:24   ` Christoph Böhmwalder
2026-04-03 13:26     ` Jens Axboe
2026-04-08  5:17   ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=adXq36pbGLXMZc2r@infradead.org \
    --to=hch@infradead.org \
    --cc=axboe@kernel.dk \
    --cc=christoph.boehmwalder@linbit.com \
    --cc=drbd-dev@lists.linbit.com \
    --cc=jgg@ziepe.ca \
    --cc=joel.colledge@linbit.com \
    --cc=lars.ellenberg@linbit.com \
    --cc=leon@kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=philipp.reisner@linbit.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox