* [PATCH 1/3] tcp: export symbol tcp_set_congestion_control
2022-03-11 3:01 [PATCH 0/3] NVMe/TCP: support specifying the congestion-control Mingbao Sun
@ 2022-03-11 3:01 ` Mingbao Sun
2022-03-11 7:13 ` Christoph Hellwig
2022-03-11 3:01 ` [PATCH 2/3] nvme-tcp: support specifying the congestion-control Mingbao Sun
2022-03-11 3:01 ` [PATCH 3/3] nvmet-tcp: " Mingbao Sun
2 siblings, 1 reply; 8+ messages in thread
From: Mingbao Sun @ 2022-03-11 3:01 UTC (permalink / raw)
To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
Chaitanya Kulkarni, linux-nvme, linux-kernel, Eric Dumazet,
David S . Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
netdev
Cc: sunmingbao, tyler.sun, ping.gan, yanxiu.cai, libin.zhang, ao.sun
From: Mingbao Sun <tyler.sun@dell.com>
congestion-control could have a noticeable impaction on the
performance of TCP-based communications. This is of course true
to NVMe/TCP in the kernel.
Different congestion-controls (e.g., cubic, dctcp) are suitable for
different scenarios. Proper adoption of congestion control would
benefit the performance. On the contrary, the performance could be
destroyed.
So to gain excellent performance against different network
environments, NVMe/TCP tends to support specifying the
congestion-control.
This means NVMe/TCP (a kernel user) needs to set the congestion-control
of its TCP sockets.
Since the kernel API 'kernel_setsockopt' was removed, and since the
function ‘tcp_set_congestion_control’ is just the real underlying guy
handling this job, so it makes sense to get it exported.
Signed-off-by: Mingbao Sun <tyler.sun@dell.com>
---
net/ipv4/tcp_cong.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index db5831e6c136..5d77f3e7278e 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -383,6 +383,7 @@ int tcp_set_congestion_control(struct sock *sk, const char *name, bool load,
rcu_read_unlock();
return err;
}
+EXPORT_SYMBOL_GPL(tcp_set_congestion_control);
/* Slow start is used when congestion window is no greater than the slow start
* threshold. We base on RFC2581 and also handle stretch ACKs properly.
--
2.26.2
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH 1/3] tcp: export symbol tcp_set_congestion_control
2022-03-11 3:01 ` [PATCH 1/3] tcp: export symbol tcp_set_congestion_control Mingbao Sun
@ 2022-03-11 7:13 ` Christoph Hellwig
2022-03-11 8:39 ` Mingbao Sun
0 siblings, 1 reply; 8+ messages in thread
From: Christoph Hellwig @ 2022-03-11 7:13 UTC (permalink / raw)
To: Mingbao Sun
Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
Chaitanya Kulkarni, linux-nvme, linux-kernel, Eric Dumazet,
David S . Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
netdev, tyler.sun, ping.gan, yanxiu.cai, libin.zhang, ao.sun
Maybe add a kerneldoc comment now that this is an exported API?
Otherwise this looks fine to me:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/3] tcp: export symbol tcp_set_congestion_control
2022-03-11 7:13 ` Christoph Hellwig
@ 2022-03-11 8:39 ` Mingbao Sun
0 siblings, 0 replies; 8+ messages in thread
From: Mingbao Sun @ 2022-03-11 8:39 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Keith Busch, Jens Axboe, Sagi Grimberg, Chaitanya Kulkarni,
linux-nvme, linux-kernel, Eric Dumazet, David S . Miller,
Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski, netdev, tyler.sun,
ping.gan, yanxiu.cai, libin.zhang, ao.sun
On Fri, 11 Mar 2022 08:13:19 +0100
Christoph Hellwig <hch@lst.de> wrote:
> Maybe add a kerneldoc comment now that this is an exported API?
>
> Otherwise this looks fine to me:
>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
accept.
will add it in the next version.
and will notice user "Must be called on a locked sock".
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 2/3] nvme-tcp: support specifying the congestion-control
2022-03-11 3:01 [PATCH 0/3] NVMe/TCP: support specifying the congestion-control Mingbao Sun
2022-03-11 3:01 ` [PATCH 1/3] tcp: export symbol tcp_set_congestion_control Mingbao Sun
@ 2022-03-11 3:01 ` Mingbao Sun
2022-03-11 7:15 ` Christoph Hellwig
2022-03-11 3:01 ` [PATCH 3/3] nvmet-tcp: " Mingbao Sun
2 siblings, 1 reply; 8+ messages in thread
From: Mingbao Sun @ 2022-03-11 3:01 UTC (permalink / raw)
To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
Chaitanya Kulkarni, linux-nvme, linux-kernel, Eric Dumazet,
David S . Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
netdev
Cc: sunmingbao, tyler.sun, ping.gan, yanxiu.cai, libin.zhang, ao.sun
From: Mingbao Sun <tyler.sun@dell.com>
congestion-control could have a noticeable impaction on the
performance of TCP-based communications. This is of course true
to NVMe_over_TCP.
Different congestion-controls (e.g., cubic, dctcp) are suitable for
different scenarios. Proper adoption of congestion control would benefit
the performance. On the contrary, the performance could be destroyed.
Though we can specify the congestion-control of NVMe_over_TCP via
writing '/proc/sys/net/ipv4/tcp_congestion_control', but this also
changes the congestion-control of all the future TCP sockets that
have not been explicitly assigned the congestion-control, thus bringing
potential impaction on their performance.
So it makes sense to make NVMe_over_TCP support specifying the
congestion-control. And this commit addresses the host side.
Implementation approach:
a new option called 'tcp_congestion' was created in fabrics opt_tokens
for 'nvme connect' command to passed in the congestion-control
specified by the user.
Then later in nvme_tcp_alloc_queue, the specified congestion-control
would be applied to the relevant sockets of the host side.
Signed-off-by: Mingbao Sun <tyler.sun@dell.com>
---
drivers/nvme/host/fabrics.c | 12 ++++++++++++
drivers/nvme/host/fabrics.h | 2 ++
drivers/nvme/host/tcp.c | 15 ++++++++++++++-
3 files changed, 28 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index ee79a6d639b4..79d5f0dbafd3 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -548,6 +548,7 @@ static const match_table_t opt_tokens = {
{ NVMF_OPT_TOS, "tos=%d" },
{ NVMF_OPT_FAIL_FAST_TMO, "fast_io_fail_tmo=%d" },
{ NVMF_OPT_DISCOVERY, "discovery" },
+ { NVMF_OPT_TCP_CONGESTION, "tcp_congestion=%s" },
{ NVMF_OPT_ERR, NULL }
};
@@ -829,6 +830,16 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
case NVMF_OPT_DISCOVERY:
opts->discovery_nqn = true;
break;
+ case NVMF_OPT_TCP_CONGESTION:
+ p = match_strdup(args);
+ if (!p) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ kfree(opts->tcp_congestion);
+ opts->tcp_congestion = p;
+ break;
default:
pr_warn("unknown parameter or missing value '%s' in ctrl creation request\n",
p);
@@ -947,6 +958,7 @@ void nvmf_free_options(struct nvmf_ctrl_options *opts)
kfree(opts->subsysnqn);
kfree(opts->host_traddr);
kfree(opts->host_iface);
+ kfree(opts->tcp_congestion);
kfree(opts);
}
EXPORT_SYMBOL_GPL(nvmf_free_options);
diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h
index c3203ff1c654..25fdc169949d 100644
--- a/drivers/nvme/host/fabrics.h
+++ b/drivers/nvme/host/fabrics.h
@@ -68,6 +68,7 @@ enum {
NVMF_OPT_FAIL_FAST_TMO = 1 << 20,
NVMF_OPT_HOST_IFACE = 1 << 21,
NVMF_OPT_DISCOVERY = 1 << 22,
+ NVMF_OPT_TCP_CONGESTION = 1 << 23,
};
/**
@@ -117,6 +118,7 @@ struct nvmf_ctrl_options {
unsigned int nr_io_queues;
unsigned int reconnect_delay;
bool discovery_nqn;
+ const char *tcp_congestion;
bool duplicate_connect;
unsigned int kato;
struct nvmf_host *host;
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 10fc45d95b86..f2a6df35374a 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1487,6 +1487,18 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl,
if (nctrl->opts->tos >= 0)
ip_sock_set_tos(queue->sock->sk, nctrl->opts->tos);
+ if (nctrl->opts->mask & NVMF_OPT_TCP_CONGESTION) {
+ ret = tcp_set_congestion_control(queue->sock->sk,
+ nctrl->opts->tcp_congestion,
+ true, true);
+ if (ret) {
+ dev_err(nctrl->device,
+ "failed to set TCP congestion to %s: %d\n",
+ nctrl->opts->tcp_congestion, ret);
+ goto err_sock;
+ }
+ }
+
/* Set 10 seconds timeout for icresp recvmsg */
queue->sock->sk->sk_rcvtimeo = 10 * HZ;
@@ -2650,7 +2662,8 @@ static struct nvmf_transport_ops nvme_tcp_transport = {
NVMF_OPT_HOST_TRADDR | NVMF_OPT_CTRL_LOSS_TMO |
NVMF_OPT_HDR_DIGEST | NVMF_OPT_DATA_DIGEST |
NVMF_OPT_NR_WRITE_QUEUES | NVMF_OPT_NR_POLL_QUEUES |
- NVMF_OPT_TOS | NVMF_OPT_HOST_IFACE,
+ NVMF_OPT_TOS | NVMF_OPT_HOST_IFACE |
+ NVMF_OPT_TCP_CONGESTION,
.create_ctrl = nvme_tcp_create_ctrl,
};
--
2.26.2
^ permalink raw reply related [flat|nested] 8+ messages in thread* Re: [PATCH 2/3] nvme-tcp: support specifying the congestion-control
2022-03-11 3:01 ` [PATCH 2/3] nvme-tcp: support specifying the congestion-control Mingbao Sun
@ 2022-03-11 7:15 ` Christoph Hellwig
2022-03-11 8:47 ` Mingbao Sun
0 siblings, 1 reply; 8+ messages in thread
From: Christoph Hellwig @ 2022-03-11 7:15 UTC (permalink / raw)
To: Mingbao Sun
Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
Chaitanya Kulkarni, linux-nvme, linux-kernel, Eric Dumazet,
David S . Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
netdev, tyler.sun, ping.gan, yanxiu.cai, libin.zhang, ao.sun
On Fri, Mar 11, 2022 at 11:01:12AM +0800, Mingbao Sun wrote:
> + case NVMF_OPT_TCP_CONGESTION:
> + p = match_strdup(args);
> + if (!p) {
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + kfree(opts->tcp_congestion);
> + opts->tcp_congestion = p;
We'll need to check that the string is no loner than TCP_CA_NAME_MAX
somewhere.
>
> + if (nctrl->opts->mask & NVMF_OPT_TCP_CONGESTION) {
> + ret = tcp_set_congestion_control(queue->sock->sk,
> + nctrl->opts->tcp_congestion,
> + true, true);
This needs to be called under lock_sock() protection. Maybe also
add an assert to tcp_set_congestion_control to enforce that.
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [PATCH 2/3] nvme-tcp: support specifying the congestion-control
2022-03-11 7:15 ` Christoph Hellwig
@ 2022-03-11 8:47 ` Mingbao Sun
0 siblings, 0 replies; 8+ messages in thread
From: Mingbao Sun @ 2022-03-11 8:47 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Keith Busch, Jens Axboe, Sagi Grimberg, Chaitanya Kulkarni,
linux-nvme, linux-kernel, Eric Dumazet, David S . Miller,
Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski, netdev, tyler.sun,
ping.gan, yanxiu.cai, libin.zhang, ao.sun
On Fri, 11 Mar 2022 08:15:18 +0100
Christoph Hellwig <hch@lst.de> wrote:
> On Fri, Mar 11, 2022 at 11:01:12AM +0800, Mingbao Sun wrote:
> > + case NVMF_OPT_TCP_CONGESTION:
> > + p = match_strdup(args);
> > + if (!p) {
> > + ret = -ENOMEM;
> > + goto out;
> > + }
> > +
> > + kfree(opts->tcp_congestion);
> > + opts->tcp_congestion = p;
>
> We'll need to check that the string is no loner than TCP_CA_NAME_MAX
> somewhere.
>
accept.
will do that in the next version.
this would also be applied for the target side.
> >
> > + if (nctrl->opts->mask & NVMF_OPT_TCP_CONGESTION) {
> > + ret = tcp_set_congestion_control(queue->sock->sk,
> > + nctrl->opts->tcp_congestion,
> > + true, true);
>
> This needs to be called under lock_sock() protection. Maybe also
> add an assert to tcp_set_congestion_control to enforce that.
accept.
will handle it in the next version.
this would also be applied for the target side.
Many thanks for reminding.
as for the assertion, I failed to find a conventional way to do that.
would you like to give me a suggestion?
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 3/3] nvmet-tcp: support specifying the congestion-control
2022-03-11 3:01 [PATCH 0/3] NVMe/TCP: support specifying the congestion-control Mingbao Sun
2022-03-11 3:01 ` [PATCH 1/3] tcp: export symbol tcp_set_congestion_control Mingbao Sun
2022-03-11 3:01 ` [PATCH 2/3] nvme-tcp: support specifying the congestion-control Mingbao Sun
@ 2022-03-11 3:01 ` Mingbao Sun
2 siblings, 0 replies; 8+ messages in thread
From: Mingbao Sun @ 2022-03-11 3:01 UTC (permalink / raw)
To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
Chaitanya Kulkarni, linux-nvme, linux-kernel, Eric Dumazet,
David S . Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski,
netdev
Cc: sunmingbao, tyler.sun, ping.gan, yanxiu.cai, libin.zhang, ao.sun
From: Mingbao Sun <tyler.sun@dell.com>
congestion-control could have a noticeable impaction on the
performance of TCP-based communications. This is of course true
to NVMe_over_TCP.
Different congestion-controls (e.g., cubic, dctcp) are suitable for
different scenarios. Proper adoption of congestion control would benefit
the performance. On the contrary, the performance could be destroyed.
Though we can specify the congestion-control of NVMe_over_TCP via
writing '/proc/sys/net/ipv4/tcp_congestion_control', but this also
changes the congestion-control of all the future TCP sockets that
have not been explicitly assigned the congestion-control, thus bringing
potential impaction on their performance.
So it makes sense to make NVMe_over_TCP support specifying the
congestion-control. And this commit addresses the target side.
Implementation approach:
the following new file entry was created for user to specify the
congestion-control of each nvmet port.
'/sys/kernel/config/nvmet/ports/X/tcp_congestion'
Then later in nvmet_tcp_add_port, the specified congestion-control
would be applied to the listening socket of the nvmet port.
Signed-off-by: Mingbao Sun <tyler.sun@dell.com>
---
drivers/nvme/target/configfs.c | 37 ++++++++++++++++++++++++++++++++++
drivers/nvme/target/nvmet.h | 1 +
drivers/nvme/target/tcp.c | 11 ++++++++++
3 files changed, 49 insertions(+)
diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index 091a0ca16361..644e89bb0ee9 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -222,6 +222,41 @@ static ssize_t nvmet_addr_trsvcid_store(struct config_item *item,
CONFIGFS_ATTR(nvmet_, addr_trsvcid);
+static ssize_t nvmet_tcp_congestion_show(struct config_item *item,
+ char *page)
+{
+ struct nvmet_port *port = to_nvmet_port(item);
+
+ return snprintf(page, PAGE_SIZE, "%s\n",
+ port->tcp_congestion ? port->tcp_congestion : "");
+}
+
+static ssize_t nvmet_tcp_congestion_store(struct config_item *item,
+ const char *page, size_t count)
+{
+ struct nvmet_port *port = to_nvmet_port(item);
+ int len;
+ char *buf;
+
+ len = strcspn(page, "\n");
+ if (!len)
+ return -EINVAL;
+
+ if (nvmet_is_port_enabled(port, __func__))
+ return -EACCES;
+
+ buf = kmemdup_nul(page, len, GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+
+ kfree(port->tcp_congestion);
+ port->tcp_congestion = buf;
+
+ return count;
+}
+
+CONFIGFS_ATTR(nvmet_, tcp_congestion);
+
static ssize_t nvmet_param_inline_data_size_show(struct config_item *item,
char *page)
{
@@ -1597,6 +1632,7 @@ static void nvmet_port_release(struct config_item *item)
list_del(&port->global_entry);
kfree(port->ana_state);
+ kfree(port->tcp_congestion);
kfree(port);
}
@@ -1605,6 +1641,7 @@ static struct configfs_attribute *nvmet_port_attrs[] = {
&nvmet_attr_addr_treq,
&nvmet_attr_addr_traddr,
&nvmet_attr_addr_trsvcid,
+ &nvmet_attr_tcp_congestion,
&nvmet_attr_addr_trtype,
&nvmet_attr_param_inline_data_size,
#ifdef CONFIG_BLK_DEV_INTEGRITY
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 69637bf8f8e1..76a57c4c3456 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -145,6 +145,7 @@ struct nvmet_port {
struct config_group ana_groups_group;
struct nvmet_ana_group ana_default_group;
enum nvme_ana_state *ana_state;
+ const char *tcp_congestion;
void *priv;
bool enabled;
int inline_data_size;
diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index 83ca577f72be..489c46e396b9 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c
@@ -1741,6 +1741,17 @@ static int nvmet_tcp_add_port(struct nvmet_port *nport)
if (so_priority > 0)
sock_set_priority(port->sock->sk, so_priority);
+ if (nport->tcp_congestion) {
+ ret = tcp_set_congestion_control(port->sock->sk,
+ nport->tcp_congestion,
+ true, true);
+ if (ret) {
+ pr_err("failed to set port socket's congestion to %s: %d\n",
+ nport->tcp_congestion, ret);
+ goto err_sock;
+ }
+ }
+
ret = kernel_bind(port->sock, (struct sockaddr *)&port->addr,
sizeof(port->addr));
if (ret) {
--
2.26.2
^ permalink raw reply related [flat|nested] 8+ messages in thread