* [PATCH v2 0/4] bugfix: Introduce sendpages_ok() to check sendpage_ok() on contiguous pages
@ 2024-05-30 14:24 Ofir Gal
2024-05-30 14:24 ` [PATCH v2 1/4] net: introduce helper sendpages_ok() Ofir Gal
` (6 more replies)
0 siblings, 7 replies; 30+ messages in thread
From: Ofir Gal @ 2024-05-30 14:24 UTC (permalink / raw)
To: davem, linux-block, linux-nvme, netdev, ceph-devel
Cc: dhowells, edumazet, pabeni, kbusch, axboe, hch, sagi,
philipp.reisner, lars.ellenberg, christoph.boehmwalder, idryomov,
xiubli
skb_splice_from_iter() warns on !sendpage_ok() which results in nvme-tcp
data transfer failure. This warning leads to hanging IO.
nvme-tcp using sendpage_ok() to check the first page of an iterator in
order to disable MSG_SPLICE_PAGES. The iterator can represent a list of
contiguous pages.
When MSG_SPLICE_PAGES is enabled skb_splice_from_iter() is being used,
it requires all pages in the iterator to be sendable.
skb_splice_from_iter() checks each page with sendpage_ok().
nvme_tcp_try_send_data() might allow MSG_SPLICE_PAGES when the first
page is sendable, but the next one are not. skb_splice_from_iter() will
attempt to send all the pages in the iterator. When reaching an
unsendable page the IO will hang.
The patch introduces a helper sendpages_ok(), it returns true if all the
continuous pages are sendable.
Drivers who want to send contiguous pages with MSG_SPLICE_PAGES may use
this helper to check whether the page list is OK. If the helper does not
return true, the driver should remove MSG_SPLICE_PAGES flag.
The bug is reproducible, in order to reproduce we need nvme-over-tcp
controllers with optimal IO size bigger than PAGE_SIZE. Creating a raid
with bitmap over those devices reproduces the bug.
In order to simulate large optimal IO size you can use dm-stripe with a
single device.
Script to reproduce the issue on top of brd devices using dm-stripe is
attached below.
I have added 3 prints to test my theory. One in nvme_tcp_try_send_data()
and two others in skb_splice_from_iter() the first before sendpage_ok()
and the second on !sendpage_ok(), after the warning.
...
nvme_tcp: sendpage_ok, page: 0x654eccd7 (pfn: 120755), len: 262144, offset: 0
skbuff: before sendpage_ok - i: 0. page: 0x654eccd7 (pfn: 120755)
skbuff: before sendpage_ok - i: 1. page: 0x1666a4da (pfn: 120756)
skbuff: before sendpage_ok - i: 2. page: 0x54f9f140 (pfn: 120757)
WARNING: at net/core/skbuff.c:6848 skb_splice_from_iter+0x142/0x450
skbuff: !sendpage_ok - page: 0x54f9f140 (pfn: 120757). is_slab: 1, page_count: 1
...
stack trace:
...
WARNING: at net/core/skbuff.c:6848 skb_splice_from_iter+0x141/0x450
Workqueue: nvme_tcp_wq nvme_tcp_io_work
Call Trace:
? show_regs+0x6a/0x80
? skb_splice_from_iter+0x141/0x450
? __warn+0x8d/0x130
? skb_splice_from_iter+0x141/0x450
? report_bug+0x18c/0x1a0
? handle_bug+0x40/0x70
? exc_invalid_op+0x19/0x70
? asm_exc_invalid_op+0x1b/0x20
? skb_splice_from_iter+0x141/0x450
tcp_sendmsg_locked+0x39e/0xee0
? _prb_read_valid+0x216/0x290
tcp_sendmsg+0x2d/0x50
inet_sendmsg+0x43/0x80
sock_sendmsg+0x102/0x130
? vprintk_default+0x1d/0x30
? vprintk+0x3c/0x70
? _printk+0x58/0x80
nvme_tcp_try_send_data+0x17d/0x530
nvme_tcp_try_send+0x1b7/0x300
nvme_tcp_io_work+0x3c/0xc0
process_one_work+0x22e/0x420
worker_thread+0x50/0x3f0
? __pfx_worker_thread+0x10/0x10
kthread+0xd6/0x100
? __pfx_kthread+0x10/0x10
ret_from_fork+0x3c/0x60
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
...
---
Changelog:
v2, fix typo in patch subject
Ofir Gal (4):
net: introduce helper sendpages_ok()
nvme-tcp: use sendpages_ok() instead of sendpage_ok()
drbd: use sendpages_ok() to instead of sendpage_ok()
libceph: use sendpages_ok() to instead of sendpage_ok()
drivers/block/drbd/drbd_main.c | 2 +-
drivers/nvme/host/tcp.c | 2 +-
include/linux/net.h | 20 ++++++++++++++++++++
net/ceph/messenger_v1.c | 2 +-
net/ceph/messenger_v2.c | 2 +-
5 files changed, 24 insertions(+), 4 deletions(-)
reproduce.sh | 114 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 114 insertions(+)
create mode 100755 reproduce.sh
diff --git a/reproduce.sh b/reproduce.sh
new file mode 100755
index 000000000..8ae226b18
--- /dev/null
+++ b/reproduce.sh
@@ -0,0 +1,114 @@
+#!/usr/bin/env sh
+# SPDX-License-Identifier: MIT
+
+set -e
+
+load_modules() {
+ modprobe nvme
+ modprobe nvme-tcp
+ modprobe nvmet
+ modprobe nvmet-tcp
+}
+
+setup_ns() {
+ local dev=$1
+ local num=$2
+ local port=$3
+ ls $dev > /dev/null
+
+ mkdir -p /sys/kernel/config/nvmet/subsystems/$num
+ cd /sys/kernel/config/nvmet/subsystems/$num
+ echo 1 > attr_allow_any_host
+
+ mkdir -p namespaces/$num
+ cd namespaces/$num/
+ echo $dev > device_path
+ echo 1 > enable
+
+ ln -s /sys/kernel/config/nvmet/subsystems/$num \
+ /sys/kernel/config/nvmet/ports/$port/subsystems/
+}
+
+setup_port() {
+ local num=$1
+
+ mkdir -p /sys/kernel/config/nvmet/ports/$num
+ cd /sys/kernel/config/nvmet/ports/$num
+ echo "127.0.0.1" > addr_traddr
+ echo tcp > addr_trtype
+ echo 8009 > addr_trsvcid
+ echo ipv4 > addr_adrfam
+}
+
+setup_big_opt_io() {
+ local dev=$1
+ local name=$2
+
+ # Change optimal IO size by creating dm stripe
+ dmsetup create $name --table \
+ "0 `blockdev --getsz $dev` striped 1 512 $dev 0"
+}
+
+setup_targets() {
+ # Setup ram devices instead of using real nvme devices
+ modprobe brd rd_size=1048576 rd_nr=2 # 1GiB
+
+ setup_big_opt_io /dev/ram0 ram0_big_opt_io
+ setup_big_opt_io /dev/ram1 ram1_big_opt_io
+
+ setup_port 1
+ setup_ns /dev/mapper/ram0_big_opt_io 1 1
+ setup_ns /dev/mapper/ram1_big_opt_io 2 1
+}
+
+setup_initiators() {
+ nvme connect -t tcp -n 1 -a 127.0.0.1 -s 8009
+ nvme connect -t tcp -n 2 -a 127.0.0.1 -s 8009
+}
+
+reproduce_warn() {
+ local devs=$@
+
+ # Hangs here
+ mdadm --create /dev/md/test_md --level=1 --bitmap=internal \
+ --bitmap-chunk=1024K --assume-clean --run --raid-devices=2 $devs
+}
+
+echo "###################################
+
+The script creates 2 nvme initiators in order to reproduce the bug.
+The script doesn't know which controllers it created, choose the new nvme
+controllers when asked.
+
+###################################
+
+Press enter to continue.
+"
+
+read tmp
+
+echo "# Creating 2 nvme controllers for the reproduction. current nvme devices:"
+lsblk -s | grep nvme || true
+echo "---------------------------------
+"
+
+load_modules
+setup_targets
+setup_initiators
+
+sleep 0.1 # Wait for the new nvme ctrls to show up
+
+echo "# Created 2 nvme devices. nvme devices list:"
+
+lsblk -s | grep nvme
+echo "---------------------------------
+"
+
+echo "# Insert the new nvme devices as separated lines. both should be with size of 1G"
+read dev1
+read dev2
+
+ls /dev/$dev1 > /dev/null
+ls /dev/$dev2 > /dev/null
+
+reproduce_warn /dev/$dev1 /dev/$dev2
--
2.34.1
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH v2 1/4] net: introduce helper sendpages_ok()
2024-05-30 14:24 [PATCH v2 0/4] bugfix: Introduce sendpages_ok() to check sendpage_ok() on contiguous pages Ofir Gal
@ 2024-05-30 14:24 ` Ofir Gal
2024-05-31 7:32 ` Christoph Hellwig
2024-05-31 8:51 ` Sagi Grimberg
2024-05-30 14:24 ` [PATCH v2 2/4] nvme-tcp: use sendpages_ok() instead of sendpage_ok() Ofir Gal
` (5 subsequent siblings)
6 siblings, 2 replies; 30+ messages in thread
From: Ofir Gal @ 2024-05-30 14:24 UTC (permalink / raw)
To: davem, linux-block, linux-nvme, netdev, ceph-devel
Cc: dhowells, edumazet, pabeni, kbusch, axboe, hch, sagi,
philipp.reisner, lars.ellenberg, christoph.boehmwalder, idryomov,
xiubli
Network drivers are using sendpage_ok() to check the first page of an
iterator in order to disable MSG_SPLICE_PAGES. The iterator can
represent list of contiguous pages.
When MSG_SPLICE_PAGES is enabled skb_splice_from_iter() is being used,
it requires all pages in the iterator to be sendable. Therefore it needs
to check that each page is sendable.
The patch introduces a helper sendpages_ok(), it returns true if all the
contiguous pages are sendable.
Drivers who want to send contiguous pages with MSG_SPLICE_PAGES may use
this helper to check whether the page list is OK. If the helper does not
return true, the driver should remove MSG_SPLICE_PAGES flag.
Signed-off-by: Ofir Gal <ofir.gal@volumez.com>
---
include/linux/net.h | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)
diff --git a/include/linux/net.h b/include/linux/net.h
index 688320b79fcc..b33bdc3e2031 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -322,6 +322,26 @@ static inline bool sendpage_ok(struct page *page)
return !PageSlab(page) && page_count(page) >= 1;
}
+/*
+ * Check sendpage_ok on contiguous pages.
+ */
+static inline bool sendpages_ok(struct page *page, size_t len, size_t offset)
+{
+ unsigned int pagecount;
+ size_t page_offset;
+ int k;
+
+ page = page + offset / PAGE_SIZE;
+ page_offset = offset % PAGE_SIZE;
+ pagecount = DIV_ROUND_UP(len + page_offset, PAGE_SIZE);
+
+ for (k = 0; k < pagecount; k++)
+ if (!sendpage_ok(page + k))
+ return false;
+
+ return true;
+}
+
int kernel_sendmsg(struct socket *sock, struct msghdr *msg, struct kvec *vec,
size_t num, size_t len);
int kernel_sendmsg_locked(struct sock *sk, struct msghdr *msg,
--
2.34.1
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH v2 2/4] nvme-tcp: use sendpages_ok() instead of sendpage_ok()
2024-05-30 14:24 [PATCH v2 0/4] bugfix: Introduce sendpages_ok() to check sendpage_ok() on contiguous pages Ofir Gal
2024-05-30 14:24 ` [PATCH v2 1/4] net: introduce helper sendpages_ok() Ofir Gal
@ 2024-05-30 14:24 ` Ofir Gal
2024-05-31 7:32 ` Christoph Hellwig
2024-05-30 14:24 ` [PATCH v2 3/4] drbd: " Ofir Gal
` (4 subsequent siblings)
6 siblings, 1 reply; 30+ messages in thread
From: Ofir Gal @ 2024-05-30 14:24 UTC (permalink / raw)
To: davem, linux-block, linux-nvme, netdev, ceph-devel
Cc: dhowells, edumazet, pabeni, kbusch, axboe, hch, sagi
Currently nvme_tcp_try_send_data() use sendpage_ok() in order to disable
MSG_SPLICE_PAGES, it check the first page of the iterator, the iterator
may represent contiguous pages.
MSG_SPLICE_PAGES enables skb_splice_from_iter() which checks all the
pages it sends with sendpage_ok().
When nvme_tcp_try_send_data() sends an iterator that the first page is
sendable, but one of the other pages isn't skb_splice_from_iter() warns
and aborts the data transfer.
Using the new helper sendpages_ok() in order to disable MSG_SPLICE_PAGES
solves the issue.
Signed-off-by: Ofir Gal <ofir.gal@volumez.com>
---
drivers/nvme/host/tcp.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 8b5e4327fe83..9f0fd14cbcb7 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1051,7 +1051,7 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
else
msg.msg_flags |= MSG_MORE;
- if (!sendpage_ok(page))
+ if (!sendpages_ok(page, len, offset))
msg.msg_flags &= ~MSG_SPLICE_PAGES;
bvec_set_page(&bvec, page, len, offset);
--
2.34.1
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH v2 3/4] drbd: use sendpages_ok() instead of sendpage_ok()
2024-05-30 14:24 [PATCH v2 0/4] bugfix: Introduce sendpages_ok() to check sendpage_ok() on contiguous pages Ofir Gal
2024-05-30 14:24 ` [PATCH v2 1/4] net: introduce helper sendpages_ok() Ofir Gal
2024-05-30 14:24 ` [PATCH v2 2/4] nvme-tcp: use sendpages_ok() instead of sendpage_ok() Ofir Gal
@ 2024-05-30 14:24 ` Ofir Gal
2024-06-04 14:43 ` Christoph Böhmwalder
2024-05-30 14:24 ` [PATCH v2 4/4] libceph: " Ofir Gal
` (3 subsequent siblings)
6 siblings, 1 reply; 30+ messages in thread
From: Ofir Gal @ 2024-05-30 14:24 UTC (permalink / raw)
To: davem, linux-block, linux-nvme, netdev, ceph-devel
Cc: dhowells, edumazet, pabeni, philipp.reisner, lars.ellenberg,
christoph.boehmwalder
Currently _drbd_send_page() use sendpage_ok() in order to enable
MSG_SPLICE_PAGES, it check the first page of the iterator, the iterator
may represent contiguous pages.
MSG_SPLICE_PAGES enables skb_splice_from_iter() which checks all the
pages it sends with sendpage_ok().
When _drbd_send_page() sends an iterator that the first page is
sendable, but one of the other pages isn't skb_splice_from_iter() warns
and aborts the data transfer.
Using the new helper sendpages_ok() in order to enable MSG_SPLICE_PAGES
solves the issue.
Signed-off-by: Ofir Gal <ofir.gal@volumez.com>
---
drivers/block/drbd/drbd_main.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 113b441d4d36..a5dbbf6cce23 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -1550,7 +1550,7 @@ static int _drbd_send_page(struct drbd_peer_device *peer_device, struct page *pa
* put_page(); and would cause either a VM_BUG directly, or
* __page_cache_release a page that would actually still be referenced
* by someone, leading to some obscure delayed Oops somewhere else. */
- if (!drbd_disable_sendpage && sendpage_ok(page))
+ if (!drbd_disable_sendpage && sendpages_ok(page, len, offset))
msg.msg_flags |= MSG_NOSIGNAL | MSG_SPLICE_PAGES;
drbd_update_congested(peer_device->connection);
--
2.34.1
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH v2 4/4] libceph: use sendpages_ok() instead of sendpage_ok()
2024-05-30 14:24 [PATCH v2 0/4] bugfix: Introduce sendpages_ok() to check sendpage_ok() on contiguous pages Ofir Gal
` (2 preceding siblings ...)
2024-05-30 14:24 ` [PATCH v2 3/4] drbd: " Ofir Gal
@ 2024-05-30 14:24 ` Ofir Gal
2024-05-31 7:32 ` [PATCH v2 0/4] bugfix: Introduce sendpages_ok() to check sendpage_ok() on contiguous pages Christoph Hellwig
` (2 subsequent siblings)
6 siblings, 0 replies; 30+ messages in thread
From: Ofir Gal @ 2024-05-30 14:24 UTC (permalink / raw)
To: davem, linux-block, linux-nvme, netdev, ceph-devel
Cc: dhowells, edumazet, pabeni, idryomov, xiubli
Currently ceph_tcp_sendpage() and do_try_sendpage() use sendpage_ok() in
order to enable MSG_SPLICE_PAGES, it check the first page of the
iterator, the iterator may represent contiguous pages.
MSG_SPLICE_PAGES enables skb_splice_from_iter() which checks all the
pages it sends with sendpage_ok().
When ceph_tcp_sendpage() or do_try_sendpage() send an iterator that the
first page is sendable, but one of the other pages isn't
skb_splice_from_iter() warns and aborts the data transfer.
Using the new helper sendpages_ok() in order to enable MSG_SPLICE_PAGES
solves the issue.
Signed-off-by: Ofir Gal <ofir.gal@volumez.com>
---
net/ceph/messenger_v1.c | 2 +-
net/ceph/messenger_v2.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/ceph/messenger_v1.c b/net/ceph/messenger_v1.c
index 0cb61c76b9b8..a6788f284cd7 100644
--- a/net/ceph/messenger_v1.c
+++ b/net/ceph/messenger_v1.c
@@ -94,7 +94,7 @@ static int ceph_tcp_sendpage(struct socket *sock, struct page *page,
* coalescing neighboring slab objects into a single frag which
* triggers one of hardened usercopy checks.
*/
- if (sendpage_ok(page))
+ if (sendpages_ok(page, size, offset))
msg.msg_flags |= MSG_SPLICE_PAGES;
bvec_set_page(&bvec, page, size, offset);
diff --git a/net/ceph/messenger_v2.c b/net/ceph/messenger_v2.c
index bd608ffa0627..27f8f6c8eb60 100644
--- a/net/ceph/messenger_v2.c
+++ b/net/ceph/messenger_v2.c
@@ -165,7 +165,7 @@ static int do_try_sendpage(struct socket *sock, struct iov_iter *it)
* coalescing neighboring slab objects into a single frag
* which triggers one of hardened usercopy checks.
*/
- if (sendpage_ok(bv.bv_page))
+ if (sendpages_ok(bv.bv_page, bv.bv_len, bv.bv_offset))
msg.msg_flags |= MSG_SPLICE_PAGES;
else
msg.msg_flags &= ~MSG_SPLICE_PAGES;
--
2.34.1
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH v2 0/4] bugfix: Introduce sendpages_ok() to check sendpage_ok() on contiguous pages
2024-05-30 14:24 [PATCH v2 0/4] bugfix: Introduce sendpages_ok() to check sendpage_ok() on contiguous pages Ofir Gal
` (3 preceding siblings ...)
2024-05-30 14:24 ` [PATCH v2 4/4] libceph: " Ofir Gal
@ 2024-05-31 7:32 ` Christoph Hellwig
2024-06-01 22:36 ` Jakub Kicinski
2024-06-01 22:34 ` Jakub Kicinski
2024-06-03 7:24 ` Hannes Reinecke
6 siblings, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2024-05-31 7:32 UTC (permalink / raw)
To: Ofir Gal
Cc: davem, linux-block, linux-nvme, netdev, ceph-devel, dhowells,
edumazet, pabeni, kbusch, axboe, hch, sagi, philipp.reisner,
lars.ellenberg, christoph.boehmwalder, idryomov, xiubli
I still find it hightly annoying that we can't have a helper that
simply does the right thing for callers, but I guess this is the
best thing we can get without a change of mind from the networking
maintainers..
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 1/4] net: introduce helper sendpages_ok()
2024-05-30 14:24 ` [PATCH v2 1/4] net: introduce helper sendpages_ok() Ofir Gal
@ 2024-05-31 7:32 ` Christoph Hellwig
2024-05-31 8:51 ` Sagi Grimberg
1 sibling, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2024-05-31 7:32 UTC (permalink / raw)
To: Ofir Gal
Cc: davem, linux-block, linux-nvme, netdev, ceph-devel, dhowells,
edumazet, pabeni, kbusch, axboe, hch, sagi, philipp.reisner,
lars.ellenberg, christoph.boehmwalder, idryomov, xiubli
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 2/4] nvme-tcp: use sendpages_ok() instead of sendpage_ok()
2024-05-30 14:24 ` [PATCH v2 2/4] nvme-tcp: use sendpages_ok() instead of sendpage_ok() Ofir Gal
@ 2024-05-31 7:32 ` Christoph Hellwig
0 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2024-05-31 7:32 UTC (permalink / raw)
To: Ofir Gal
Cc: davem, linux-block, linux-nvme, netdev, ceph-devel, dhowells,
edumazet, pabeni, kbusch, axboe, hch, sagi
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 1/4] net: introduce helper sendpages_ok()
2024-05-30 14:24 ` [PATCH v2 1/4] net: introduce helper sendpages_ok() Ofir Gal
2024-05-31 7:32 ` Christoph Hellwig
@ 2024-05-31 8:51 ` Sagi Grimberg
2024-06-03 12:35 ` Ofir Gal
1 sibling, 1 reply; 30+ messages in thread
From: Sagi Grimberg @ 2024-05-31 8:51 UTC (permalink / raw)
To: Ofir Gal, davem, linux-block, linux-nvme, netdev, ceph-devel
Cc: dhowells, edumazet, pabeni, kbusch, axboe, hch, philipp.reisner,
lars.ellenberg, christoph.boehmwalder, idryomov, xiubli
On 30/05/2024 17:24, Ofir Gal wrote:
> Network drivers are using sendpage_ok() to check the first page of an
> iterator in order to disable MSG_SPLICE_PAGES. The iterator can
> represent list of contiguous pages.
>
> When MSG_SPLICE_PAGES is enabled skb_splice_from_iter() is being used,
> it requires all pages in the iterator to be sendable. Therefore it needs
> to check that each page is sendable.
>
> The patch introduces a helper sendpages_ok(), it returns true if all the
> contiguous pages are sendable.
>
> Drivers who want to send contiguous pages with MSG_SPLICE_PAGES may use
> this helper to check whether the page list is OK. If the helper does not
> return true, the driver should remove MSG_SPLICE_PAGES flag.
>
> Signed-off-by: Ofir Gal <ofir.gal@volumez.com>
> ---
> include/linux/net.h | 20 ++++++++++++++++++++
> 1 file changed, 20 insertions(+)
>
> diff --git a/include/linux/net.h b/include/linux/net.h
> index 688320b79fcc..b33bdc3e2031 100644
> --- a/include/linux/net.h
> +++ b/include/linux/net.h
> @@ -322,6 +322,26 @@ static inline bool sendpage_ok(struct page *page)
> return !PageSlab(page) && page_count(page) >= 1;
> }
>
> +/*
> + * Check sendpage_ok on contiguous pages.
> + */
> +static inline bool sendpages_ok(struct page *page, size_t len, size_t offset)
> +{
> + unsigned int pagecount;
> + size_t page_offset;
> + int k;
> +
> + page = page + offset / PAGE_SIZE;
> + page_offset = offset % PAGE_SIZE;
lets not modify the input page variable.
p = page + offset >> PAGE_SHIFT;
poffset = offset & PAGE_MASK;
> + pagecount = DIV_ROUND_UP(len + page_offset, PAGE_SIZE);
> +
> + for (k = 0; k < pagecount; k++)
> + if (!sendpage_ok(page + k))
> + return false;
perhaps instead of doing a costly DIV_ROUND_UP for every network send we
can do:
count = 0;
while (count < len) {
if (!sendpage_ok(p))
return false;
page++;
count += PAGE_SIZE;
}
And we can lose page_offset.
It can be done in a number of ways, but we should be able to do it
without the DIV_ROUND_UP...
I still don't understand how a page in the middle of a contiguous range ends
up coming from the slab while others don't.
Ofir, can you please check which condition in sendpage_ok actually fails?
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 0/4] bugfix: Introduce sendpages_ok() to check sendpage_ok() on contiguous pages
2024-05-30 14:24 [PATCH v2 0/4] bugfix: Introduce sendpages_ok() to check sendpage_ok() on contiguous pages Ofir Gal
` (4 preceding siblings ...)
2024-05-31 7:32 ` [PATCH v2 0/4] bugfix: Introduce sendpages_ok() to check sendpage_ok() on contiguous pages Christoph Hellwig
@ 2024-06-01 22:34 ` Jakub Kicinski
2024-06-02 7:48 ` Sagi Grimberg
2024-06-03 9:07 ` Hannes Reinecke
2024-06-03 7:24 ` Hannes Reinecke
6 siblings, 2 replies; 30+ messages in thread
From: Jakub Kicinski @ 2024-06-01 22:34 UTC (permalink / raw)
To: Ofir Gal
Cc: davem, linux-block, linux-nvme, netdev, ceph-devel, dhowells,
edumazet, pabeni, kbusch, axboe, hch, sagi, philipp.reisner,
lars.ellenberg, christoph.boehmwalder, idryomov, xiubli
On Thu, 30 May 2024 17:24:10 +0300 Ofir Gal wrote:
> skbuff: before sendpage_ok - i: 0. page: 0x654eccd7 (pfn: 120755)
> skbuff: before sendpage_ok - i: 1. page: 0x1666a4da (pfn: 120756)
> skbuff: before sendpage_ok - i: 2. page: 0x54f9f140 (pfn: 120757)
noob question, how do you get 3 contiguous pages, the third of which
is slab? is_slab doesn't mean what I think it does, or we got extremely
lucky with kmalloc?
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 0/4] bugfix: Introduce sendpages_ok() to check sendpage_ok() on contiguous pages
2024-05-31 7:32 ` [PATCH v2 0/4] bugfix: Introduce sendpages_ok() to check sendpage_ok() on contiguous pages Christoph Hellwig
@ 2024-06-01 22:36 ` Jakub Kicinski
2024-06-04 4:30 ` Christoph Hellwig
0 siblings, 1 reply; 30+ messages in thread
From: Jakub Kicinski @ 2024-06-01 22:36 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Ofir Gal, davem, linux-block, linux-nvme, netdev, ceph-devel,
dhowells, edumazet, pabeni, kbusch, axboe, sagi, philipp.reisner,
lars.ellenberg, christoph.boehmwalder, idryomov, xiubli
On Fri, 31 May 2024 09:32:14 +0200 Christoph Hellwig wrote:
> I still find it hightly annoying that we can't have a helper that
> simply does the right thing for callers, but I guess this is the
> best thing we can get without a change of mind from the networking
> maintainers..
Change mind about what? Did I miss a discussion?
There are no Fixes tags here, but the sendpage conversions are recent
work from David Howells, the interface is hardly set in stone..
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 0/4] bugfix: Introduce sendpages_ok() to check sendpage_ok() on contiguous pages
2024-06-01 22:34 ` Jakub Kicinski
@ 2024-06-02 7:48 ` Sagi Grimberg
2024-06-03 9:07 ` Hannes Reinecke
1 sibling, 0 replies; 30+ messages in thread
From: Sagi Grimberg @ 2024-06-02 7:48 UTC (permalink / raw)
To: Jakub Kicinski, Ofir Gal
Cc: davem, linux-block, linux-nvme, netdev, ceph-devel, dhowells,
edumazet, pabeni, kbusch, axboe, hch, philipp.reisner,
lars.ellenberg, christoph.boehmwalder, idryomov, xiubli
On 02/06/2024 1:34, Jakub Kicinski wrote:
> On Thu, 30 May 2024 17:24:10 +0300 Ofir Gal wrote:
>> skbuff: before sendpage_ok - i: 0. page: 0x654eccd7 (pfn: 120755)
>> skbuff: before sendpage_ok - i: 1. page: 0x1666a4da (pfn: 120756)
>> skbuff: before sendpage_ok - i: 2. page: 0x54f9f140 (pfn: 120757)
> noob question, how do you get 3 contiguous pages, the third of which
> is slab? is_slab doesn't mean what I think it does, or we got extremely
> lucky with kmalloc?
>
The contig range according to the trace is 256K, the third page was just the
first time that it saw this !ok page.
I asked the same thing. nvme-tcp gets a bio and sets up its own iov_iter
on the bio bvec for sending it over the wire. The test that reproduces this
creates an raid1 md device which probably has at least some effect into how
we got this buffer.
With the recent multipage bvecs work from Ming, nvme-tcp bvec entries will
often point to contiguous ranges that are > PAGE_SIZE. I didn't look
into the
implementation of skb_splice_from_iter, but I think its not very
efficient to
extract a contiguous range in PAGE_SIZE granular vector...
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 0/4] bugfix: Introduce sendpages_ok() to check sendpage_ok() on contiguous pages
2024-05-30 14:24 [PATCH v2 0/4] bugfix: Introduce sendpages_ok() to check sendpage_ok() on contiguous pages Ofir Gal
` (5 preceding siblings ...)
2024-06-01 22:34 ` Jakub Kicinski
@ 2024-06-03 7:24 ` Hannes Reinecke
2024-06-03 12:49 ` Ofir Gal
6 siblings, 1 reply; 30+ messages in thread
From: Hannes Reinecke @ 2024-06-03 7:24 UTC (permalink / raw)
To: Ofir Gal, davem, linux-block, linux-nvme, netdev, ceph-devel
Cc: dhowells, edumazet, pabeni, kbusch, axboe, hch, sagi,
philipp.reisner, lars.ellenberg, christoph.boehmwalder, idryomov,
xiubli
On 5/30/24 16:24, Ofir Gal wrote:
> skb_splice_from_iter() warns on !sendpage_ok() which results in nvme-tcp
> data transfer failure. This warning leads to hanging IO.
>
> nvme-tcp using sendpage_ok() to check the first page of an iterator in
> order to disable MSG_SPLICE_PAGES. The iterator can represent a list of
> contiguous pages.
>
> When MSG_SPLICE_PAGES is enabled skb_splice_from_iter() is being used,
> it requires all pages in the iterator to be sendable.
> skb_splice_from_iter() checks each page with sendpage_ok().
>
> nvme_tcp_try_send_data() might allow MSG_SPLICE_PAGES when the first
> page is sendable, but the next one are not. skb_splice_from_iter() will
> attempt to send all the pages in the iterator. When reaching an
> unsendable page the IO will hang.
>
> The patch introduces a helper sendpages_ok(), it returns true if all the
> continuous pages are sendable.
>
> Drivers who want to send contiguous pages with MSG_SPLICE_PAGES may use
> this helper to check whether the page list is OK. If the helper does not
> return true, the driver should remove MSG_SPLICE_PAGES flag.
>
>
> The bug is reproducible, in order to reproduce we need nvme-over-tcp
> controllers with optimal IO size bigger than PAGE_SIZE. Creating a raid
> with bitmap over those devices reproduces the bug.
>
> In order to simulate large optimal IO size you can use dm-stripe with a
> single device.
> Script to reproduce the issue on top of brd devices using dm-stripe is
> attached below.
>
>
> I have added 3 prints to test my theory. One in nvme_tcp_try_send_data()
> and two others in skb_splice_from_iter() the first before sendpage_ok()
> and the second on !sendpage_ok(), after the warning.
> ...
> nvme_tcp: sendpage_ok, page: 0x654eccd7 (pfn: 120755), len: 262144, offset: 0
> skbuff: before sendpage_ok - i: 0. page: 0x654eccd7 (pfn: 120755)
> skbuff: before sendpage_ok - i: 1. page: 0x1666a4da (pfn: 120756)
> skbuff: before sendpage_ok - i: 2. page: 0x54f9f140 (pfn: 120757)
> WARNING: at net/core/skbuff.c:6848 skb_splice_from_iter+0x142/0x450
> skbuff: !sendpage_ok - page: 0x54f9f140 (pfn: 120757). is_slab: 1, page_count: 1
> ...
>
>
> stack trace:
> ...
> WARNING: at net/core/skbuff.c:6848 skb_splice_from_iter+0x141/0x450
> Workqueue: nvme_tcp_wq nvme_tcp_io_work
> Call Trace:
> ? show_regs+0x6a/0x80
> ? skb_splice_from_iter+0x141/0x450
> ? __warn+0x8d/0x130
> ? skb_splice_from_iter+0x141/0x450
> ? report_bug+0x18c/0x1a0
> ? handle_bug+0x40/0x70
> ? exc_invalid_op+0x19/0x70
> ? asm_exc_invalid_op+0x1b/0x20
> ? skb_splice_from_iter+0x141/0x450
> tcp_sendmsg_locked+0x39e/0xee0
> ? _prb_read_valid+0x216/0x290
> tcp_sendmsg+0x2d/0x50
> inet_sendmsg+0x43/0x80
> sock_sendmsg+0x102/0x130
> ? vprintk_default+0x1d/0x30
> ? vprintk+0x3c/0x70
> ? _printk+0x58/0x80
> nvme_tcp_try_send_data+0x17d/0x530
> nvme_tcp_try_send+0x1b7/0x300
> nvme_tcp_io_work+0x3c/0xc0
> process_one_work+0x22e/0x420
> worker_thread+0x50/0x3f0
> ? __pfx_worker_thread+0x10/0x10
> kthread+0xd6/0x100
> ? __pfx_kthread+0x10/0x10
> ret_from_fork+0x3c/0x60
> ? __pfx_kthread+0x10/0x10
> ret_from_fork_asm+0x1b/0x30
> ...
>
> ---
> Changelog:
> v2, fix typo in patch subject
>
> Ofir Gal (4):
> net: introduce helper sendpages_ok()
> nvme-tcp: use sendpages_ok() instead of sendpage_ok()
> drbd: use sendpages_ok() to instead of sendpage_ok()
> libceph: use sendpages_ok() to instead of sendpage_ok()
>
> drivers/block/drbd/drbd_main.c | 2 +-
> drivers/nvme/host/tcp.c | 2 +-
> include/linux/net.h | 20 ++++++++++++++++++++
> net/ceph/messenger_v1.c | 2 +-
> net/ceph/messenger_v2.c | 2 +-
> 5 files changed, 24 insertions(+), 4 deletions(-)
>
> reproduce.sh | 114 +++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 114 insertions(+)
> create mode 100755 reproduce.sh
>
> diff --git a/reproduce.sh b/reproduce.sh
> new file mode 100755
> index 000000000..8ae226b18
> --- /dev/null
> +++ b/reproduce.sh
> @@ -0,0 +1,114 @@
> +#!/usr/bin/env sh
> +# SPDX-License-Identifier: MIT
> +
> +set -e
> +
> +load_modules() {
> + modprobe nvme
> + modprobe nvme-tcp
> + modprobe nvmet
> + modprobe nvmet-tcp
> +}
> +
> +setup_ns() {
> + local dev=$1
> + local num=$2
> + local port=$3
> + ls $dev > /dev/null
> +
> + mkdir -p /sys/kernel/config/nvmet/subsystems/$num
> + cd /sys/kernel/config/nvmet/subsystems/$num
> + echo 1 > attr_allow_any_host
> +
> + mkdir -p namespaces/$num
> + cd namespaces/$num/
> + echo $dev > device_path
> + echo 1 > enable
> +
> + ln -s /sys/kernel/config/nvmet/subsystems/$num \
> + /sys/kernel/config/nvmet/ports/$port/subsystems/
> +}
> +
> +setup_port() {
> + local num=$1
> +
> + mkdir -p /sys/kernel/config/nvmet/ports/$num
> + cd /sys/kernel/config/nvmet/ports/$num
> + echo "127.0.0.1" > addr_traddr
> + echo tcp > addr_trtype
> + echo 8009 > addr_trsvcid
> + echo ipv4 > addr_adrfam
> +}
> +
> +setup_big_opt_io() {
> + local dev=$1
> + local name=$2
> +
> + # Change optimal IO size by creating dm stripe
> + dmsetup create $name --table \
> + "0 `blockdev --getsz $dev` striped 1 512 $dev 0"
> +}
> +
> +setup_targets() {
> + # Setup ram devices instead of using real nvme devices
> + modprobe brd rd_size=1048576 rd_nr=2 # 1GiB
> +
> + setup_big_opt_io /dev/ram0 ram0_big_opt_io
> + setup_big_opt_io /dev/ram1 ram1_big_opt_io
> +
> + setup_port 1
> + setup_ns /dev/mapper/ram0_big_opt_io 1 1
> + setup_ns /dev/mapper/ram1_big_opt_io 2 1
> +}
> +
> +setup_initiators() {
> + nvme connect -t tcp -n 1 -a 127.0.0.1 -s 8009
> + nvme connect -t tcp -n 2 -a 127.0.0.1 -s 8009
> +}
> +
> +reproduce_warn() {
> + local devs=$@
> +
> + # Hangs here
> + mdadm --create /dev/md/test_md --level=1 --bitmap=internal \
> + --bitmap-chunk=1024K --assume-clean --run --raid-devices=2 $devs
> +}
> +
> +echo "###################################
> +
> +The script creates 2 nvme initiators in order to reproduce the bug.
> +The script doesn't know which controllers it created, choose the new nvme
> +controllers when asked.
> +
> +###################################
> +
> +Press enter to continue.
> +"
> +
> +read tmp
> +
> +echo "# Creating 2 nvme controllers for the reproduction. current nvme devices:"
> +lsblk -s | grep nvme || true
> +echo "---------------------------------
> +"
> +
> +load_modules
> +setup_targets
> +setup_initiators
> +
> +sleep 0.1 # Wait for the new nvme ctrls to show up
> +
> +echo "# Created 2 nvme devices. nvme devices list:"
> +
> +lsblk -s | grep nvme
> +echo "---------------------------------
> +"
> +
> +echo "# Insert the new nvme devices as separated lines. both should be with size of 1G"
> +read dev1
> +read dev2
> +
> +ls /dev/$dev1 > /dev/null
> +ls /dev/$dev2 > /dev/null
> +
> +reproduce_warn /dev/$dev1 /dev/$dev2
Can you convert that into a blktest script?
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 0/4] bugfix: Introduce sendpages_ok() to check sendpage_ok() on contiguous pages
2024-06-01 22:34 ` Jakub Kicinski
2024-06-02 7:48 ` Sagi Grimberg
@ 2024-06-03 9:07 ` Hannes Reinecke
2024-06-03 12:46 ` Ofir Gal
1 sibling, 1 reply; 30+ messages in thread
From: Hannes Reinecke @ 2024-06-03 9:07 UTC (permalink / raw)
To: Jakub Kicinski, Ofir Gal
Cc: davem, linux-block, linux-nvme, netdev, ceph-devel, dhowells,
edumazet, pabeni, kbusch, axboe, hch, sagi, philipp.reisner,
lars.ellenberg, christoph.boehmwalder, idryomov, xiubli
On 6/2/24 00:34, Jakub Kicinski wrote:
> On Thu, 30 May 2024 17:24:10 +0300 Ofir Gal wrote:
>> skbuff: before sendpage_ok - i: 0. page: 0x654eccd7 (pfn: 120755)
>> skbuff: before sendpage_ok - i: 1. page: 0x1666a4da (pfn: 120756)
>> skbuff: before sendpage_ok - i: 2. page: 0x54f9f140 (pfn: 120757)
>
> noob question, how do you get 3 contiguous pages, the third of which
> is slab? is_slab doesn't mean what I think it does, or we got extremely
> lucky with kmalloc?
>
I guess it's not slab which triggered; the actual code is:
static inline bool sendpage_ok(struct page *page)
{
return !PageSlab(page) && page_count(page) >= 1;
}
My bet is on 'page_count()' triggering.
Cheers,
Hannes
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 1/4] net: introduce helper sendpages_ok()
2024-05-31 8:51 ` Sagi Grimberg
@ 2024-06-03 12:35 ` Ofir Gal
2024-06-03 21:27 ` Sagi Grimberg
0 siblings, 1 reply; 30+ messages in thread
From: Ofir Gal @ 2024-06-03 12:35 UTC (permalink / raw)
To: Sagi Grimberg, davem, linux-block, linux-nvme, netdev, ceph-devel
Cc: dhowells, edumazet, pabeni, kbusch, axboe, hch, philipp.reisner,
lars.ellenberg, christoph.boehmwalder, idryomov, xiubli
On 31/05/2024 11:51, Sagi Grimberg wrote:
>
>
> On 30/05/2024 17:24, Ofir Gal wrote:
>> Network drivers are using sendpage_ok() to check the first page of an
>> iterator in order to disable MSG_SPLICE_PAGES. The iterator can
>> represent list of contiguous pages.
>>
>> When MSG_SPLICE_PAGES is enabled skb_splice_from_iter() is being used,
>> it requires all pages in the iterator to be sendable. Therefore it needs
>> to check that each page is sendable.
>>
>> The patch introduces a helper sendpages_ok(), it returns true if all the
>> contiguous pages are sendable.
>>
>> Drivers who want to send contiguous pages with MSG_SPLICE_PAGES may use
>> this helper to check whether the page list is OK. If the helper does not
>> return true, the driver should remove MSG_SPLICE_PAGES flag.
>>
>> Signed-off-by: Ofir Gal <ofir.gal@volumez.com>
>> ---
>> include/linux/net.h | 20 ++++++++++++++++++++
>> 1 file changed, 20 insertions(+)
>>
>> diff --git a/include/linux/net.h b/include/linux/net.h
>> index 688320b79fcc..b33bdc3e2031 100644
>> --- a/include/linux/net.h
>> +++ b/include/linux/net.h
>> @@ -322,6 +322,26 @@ static inline bool sendpage_ok(struct page *page)
>> return !PageSlab(page) && page_count(page) >= 1;
>> }
>> +/*
>> + * Check sendpage_ok on contiguous pages.
>> + */
>> +static inline bool sendpages_ok(struct page *page, size_t len, size_t offset)
>> +{
>> + unsigned int pagecount;
>> + size_t page_offset;
>> + int k;
>> +
>> + page = page + offset / PAGE_SIZE;
>> + page_offset = offset % PAGE_SIZE;
>
> lets not modify the input page variable.
>
> p = page + offset >> PAGE_SHIFT;
> poffset = offset & PAGE_MASK;
Ok, will be applied in the next patch set.
>> + pagecount = DIV_ROUND_UP(len + page_offset, PAGE_SIZE);
>> +
>> + for (k = 0; k < pagecount; k++)
>> + if (!sendpage_ok(page + k))
>> + return false;
>
> perhaps instead of doing a costly DIV_ROUND_UP for every network send we can do:
>
> count = 0;
> while (count < len) {
> if (!sendpage_ok(p))
> return false;
> page++;
> count += PAGE_SIZE;
> }
>
> And we can lose page_offset.
>
> It can be done in a number of ways, but we should be able to do it
> without the DIV_ROUND_UP...
Ok, will be applied in the next patch set.
> I still don't understand how a page in the middle of a contiguous range ends
> up coming from the slab while others don't.
I haven't investigate the origin of the IO
yet. I suspect the first 2 pages are the superblocks of the raid
(mdp_superblock_1 and bitmap_super_s) and the rest of the IO is the bitmap.
> Ofir, can you please check which condition in sendpage_ok actually fails?
It failed because the page has slab, page count is 1. Sorry for not
clarifying this.
"skbuff: !sendpage_ok - page: 0x54f9f140 (pfn: 120757). is_slab: 1, page_count: 1"
^
The print I used:
pr_info(
"!sendpage_ok - page: 0x%p (pfn: %lx). is_slab: %u, page_count: %u\n",
(void *)page,
page_to_pfn(page),
page_address(page),
!!PageSlab(page),
page_count(page)
);
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 0/4] bugfix: Introduce sendpages_ok() to check sendpage_ok() on contiguous pages
2024-06-03 9:07 ` Hannes Reinecke
@ 2024-06-03 12:46 ` Ofir Gal
0 siblings, 0 replies; 30+ messages in thread
From: Ofir Gal @ 2024-06-03 12:46 UTC (permalink / raw)
To: Hannes Reinecke, Jakub Kicinski
Cc: davem, linux-block, linux-nvme, netdev, ceph-devel, dhowells,
edumazet, pabeni, kbusch, axboe, hch, sagi, philipp.reisner,
lars.ellenberg, christoph.boehmwalder, idryomov, xiubli
On 03/06/2024 12:07, Hannes Reinecke wrote:
> On 6/2/24 00:34, Jakub Kicinski wrote:
>> On Thu, 30 May 2024 17:24:10 +0300 Ofir Gal wrote:
>>> skbuff: before sendpage_ok - i: 0. page: 0x654eccd7 (pfn: 120755)
>>> skbuff: before sendpage_ok - i: 1. page: 0x1666a4da (pfn: 120756)
>>> skbuff: before sendpage_ok - i: 2. page: 0x54f9f140 (pfn: 120757)
>>
>> noob question, how do you get 3 contiguous pages, the third of which
>> is slab? is_slab doesn't mean what I think it does, or we got extremely
>> lucky with kmalloc?
>>
> I guess it's not slab which triggered; the actual code is:
>
> static inline bool sendpage_ok(struct page *page)
> {
> return !PageSlab(page) && page_count(page) >= 1;
> }
>
> My bet is on 'page_count()' triggering.
It failed because the page has slab, page count is 1. Sorry for not
clarifying this.
"skbuff: !sendpage_ok - page: 0x54f9f140 (pfn: 120757). is_slab: 1, page_count: 1"
^
The print I used:
pr_info(
"!sendpage_ok - page: 0x%p (pfn: %lx). is_slab: %u, page_count: %u\n",
(void *)page,
page_to_pfn(page),
page_address(page),
!!PageSlab(page),
page_count(page)
);
Regarding the origin of the IO, I haven't investigated it yet. I suspect
the first 2 pages are the superblocks of the raid (mdp_superblock_1 and
bitmap_super_s) and the rest of the IO is the bitmap.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 0/4] bugfix: Introduce sendpages_ok() to check sendpage_ok() on contiguous pages
2024-06-03 7:24 ` Hannes Reinecke
@ 2024-06-03 12:49 ` Ofir Gal
0 siblings, 0 replies; 30+ messages in thread
From: Ofir Gal @ 2024-06-03 12:49 UTC (permalink / raw)
To: Hannes Reinecke, davem, linux-block, linux-nvme, netdev,
ceph-devel
Cc: dhowells, edumazet, pabeni, kbusch, axboe, hch, sagi,
philipp.reisner, lars.ellenberg, christoph.boehmwalder, idryomov,
xiubli
On 03/06/2024 10:24, Hannes Reinecke wrote:
> On 5/30/24 16:24, Ofir Gal wrote:
>> skb_splice_from_iter() warns on !sendpage_ok() which results in nvme-tcp
>> data transfer failure. This warning leads to hanging IO.
>>
>> nvme-tcp using sendpage_ok() to check the first page of an iterator in
>> order to disable MSG_SPLICE_PAGES. The iterator can represent a list of
>> contiguous pages.
>>
>> When MSG_SPLICE_PAGES is enabled skb_splice_from_iter() is being used,
>> it requires all pages in the iterator to be sendable.
>> skb_splice_from_iter() checks each page with sendpage_ok().
>>
>> nvme_tcp_try_send_data() might allow MSG_SPLICE_PAGES when the first
>> page is sendable, but the next one are not. skb_splice_from_iter() will
>> attempt to send all the pages in the iterator. When reaching an
>> unsendable page the IO will hang.
>>
>> The patch introduces a helper sendpages_ok(), it returns true if all the
>> continuous pages are sendable.
>>
>> Drivers who want to send contiguous pages with MSG_SPLICE_PAGES may use
>> this helper to check whether the page list is OK. If the helper does not
>> return true, the driver should remove MSG_SPLICE_PAGES flag.
>>
>>
>> The bug is reproducible, in order to reproduce we need nvme-over-tcp
>> controllers with optimal IO size bigger than PAGE_SIZE. Creating a raid
>> with bitmap over those devices reproduces the bug.
>>
>> In order to simulate large optimal IO size you can use dm-stripe with a
>> single device.
>> Script to reproduce the issue on top of brd devices using dm-stripe is
>> attached below.
>>
>>
>> I have added 3 prints to test my theory. One in nvme_tcp_try_send_data()
>> and two others in skb_splice_from_iter() the first before sendpage_ok()
>> and the second on !sendpage_ok(), after the warning.
>> ...
>> nvme_tcp: sendpage_ok, page: 0x654eccd7 (pfn: 120755), len: 262144, offset: 0
>> skbuff: before sendpage_ok - i: 0. page: 0x654eccd7 (pfn: 120755)
>> skbuff: before sendpage_ok - i: 1. page: 0x1666a4da (pfn: 120756)
>> skbuff: before sendpage_ok - i: 2. page: 0x54f9f140 (pfn: 120757)
>> WARNING: at net/core/skbuff.c:6848 skb_splice_from_iter+0x142/0x450
>> skbuff: !sendpage_ok - page: 0x54f9f140 (pfn: 120757). is_slab: 1, page_count: 1
>> ...
>>
>>
>> stack trace:
>> ...
>> WARNING: at net/core/skbuff.c:6848 skb_splice_from_iter+0x141/0x450
>> Workqueue: nvme_tcp_wq nvme_tcp_io_work
>> Call Trace:
>> ? show_regs+0x6a/0x80
>> ? skb_splice_from_iter+0x141/0x450
>> ? __warn+0x8d/0x130
>> ? skb_splice_from_iter+0x141/0x450
>> ? report_bug+0x18c/0x1a0
>> ? handle_bug+0x40/0x70
>> ? exc_invalid_op+0x19/0x70
>> ? asm_exc_invalid_op+0x1b/0x20
>> ? skb_splice_from_iter+0x141/0x450
>> tcp_sendmsg_locked+0x39e/0xee0
>> ? _prb_read_valid+0x216/0x290
>> tcp_sendmsg+0x2d/0x50
>> inet_sendmsg+0x43/0x80
>> sock_sendmsg+0x102/0x130
>> ? vprintk_default+0x1d/0x30
>> ? vprintk+0x3c/0x70
>> ? _printk+0x58/0x80
>> nvme_tcp_try_send_data+0x17d/0x530
>> nvme_tcp_try_send+0x1b7/0x300
>> nvme_tcp_io_work+0x3c/0xc0
>> process_one_work+0x22e/0x420
>> worker_thread+0x50/0x3f0
>> ? __pfx_worker_thread+0x10/0x10
>> kthread+0xd6/0x100
>> ? __pfx_kthread+0x10/0x10
>> ret_from_fork+0x3c/0x60
>> ? __pfx_kthread+0x10/0x10
>> ret_from_fork_asm+0x1b/0x30
>> ...
>>
>> ---
>> Changelog:
>> v2, fix typo in patch subject
>>
>> Ofir Gal (4):
>> net: introduce helper sendpages_ok()
>> nvme-tcp: use sendpages_ok() instead of sendpage_ok()
>> drbd: use sendpages_ok() to instead of sendpage_ok()
>> libceph: use sendpages_ok() to instead of sendpage_ok()
>>
>> drivers/block/drbd/drbd_main.c | 2 +-
>> drivers/nvme/host/tcp.c | 2 +-
>> include/linux/net.h | 20 ++++++++++++++++++++
>> net/ceph/messenger_v1.c | 2 +-
>> net/ceph/messenger_v2.c | 2 +-
>> 5 files changed, 24 insertions(+), 4 deletions(-)
>>
>> reproduce.sh | 114 +++++++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 114 insertions(+)
>> create mode 100755 reproduce.sh
>>
>> diff --git a/reproduce.sh b/reproduce.sh
>> new file mode 100755
>> index 000000000..8ae226b18
>> --- /dev/null
>> +++ b/reproduce.sh
>> @@ -0,0 +1,114 @@
>> +#!/usr/bin/env sh
>> +# SPDX-License-Identifier: MIT
>> +
>> +set -e
>> +
>> +load_modules() {
>> + modprobe nvme
>> + modprobe nvme-tcp
>> + modprobe nvmet
>> + modprobe nvmet-tcp
>> +}
>> +
>> +setup_ns() {
>> + local dev=$1
>> + local num=$2
>> + local port=$3
>> + ls $dev > /dev/null
>> +
>> + mkdir -p /sys/kernel/config/nvmet/subsystems/$num
>> + cd /sys/kernel/config/nvmet/subsystems/$num
>> + echo 1 > attr_allow_any_host
>> +
>> + mkdir -p namespaces/$num
>> + cd namespaces/$num/
>> + echo $dev > device_path
>> + echo 1 > enable
>> +
>> + ln -s /sys/kernel/config/nvmet/subsystems/$num \
>> + /sys/kernel/config/nvmet/ports/$port/subsystems/
>> +}
>> +
>> +setup_port() {
>> + local num=$1
>> +
>> + mkdir -p /sys/kernel/config/nvmet/ports/$num
>> + cd /sys/kernel/config/nvmet/ports/$num
>> + echo "127.0.0.1" > addr_traddr
>> + echo tcp > addr_trtype
>> + echo 8009 > addr_trsvcid
>> + echo ipv4 > addr_adrfam
>> +}
>> +
>> +setup_big_opt_io() {
>> + local dev=$1
>> + local name=$2
>> +
>> + # Change optimal IO size by creating dm stripe
>> + dmsetup create $name --table \
>> + "0 `blockdev --getsz $dev` striped 1 512 $dev 0"
>> +}
>> +
>> +setup_targets() {
>> + # Setup ram devices instead of using real nvme devices
>> + modprobe brd rd_size=1048576 rd_nr=2 # 1GiB
>> +
>> + setup_big_opt_io /dev/ram0 ram0_big_opt_io
>> + setup_big_opt_io /dev/ram1 ram1_big_opt_io
>> +
>> + setup_port 1
>> + setup_ns /dev/mapper/ram0_big_opt_io 1 1
>> + setup_ns /dev/mapper/ram1_big_opt_io 2 1
>> +}
>> +
>> +setup_initiators() {
>> + nvme connect -t tcp -n 1 -a 127.0.0.1 -s 8009
>> + nvme connect -t tcp -n 2 -a 127.0.0.1 -s 8009
>> +}
>> +
>> +reproduce_warn() {
>> + local devs=$@
>> +
>> + # Hangs here
>> + mdadm --create /dev/md/test_md --level=1 --bitmap=internal \
>> + --bitmap-chunk=1024K --assume-clean --run --raid-devices=2 $devs
>> +}
>> +
>> +echo "###################################
>> +
>> +The script creates 2 nvme initiators in order to reproduce the bug.
>> +The script doesn't know which controllers it created, choose the new nvme
>> +controllers when asked.
>> +
>> +###################################
>> +
>> +Press enter to continue.
>> +"
>> +
>> +read tmp
>> +
>> +echo "# Creating 2 nvme controllers for the reproduction. current nvme devices:"
>> +lsblk -s | grep nvme || true
>> +echo "---------------------------------
>> +"
>> +
>> +load_modules
>> +setup_targets
>> +setup_initiators
>> +
>> +sleep 0.1 # Wait for the new nvme ctrls to show up
>> +
>> +echo "# Created 2 nvme devices. nvme devices list:"
>> +
>> +lsblk -s | grep nvme
>> +echo "---------------------------------
>> +"
>> +
>> +echo "# Insert the new nvme devices as separated lines. both should be with size of 1G"
>> +read dev1
>> +read dev2
>> +
>> +ls /dev/$dev1 > /dev/null
>> +ls /dev/$dev2 > /dev/null
>> +
>> +reproduce_warn /dev/$dev1 /dev/$dev2
>
> Can you convert that into a blktest script?
Yes, will do.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 1/4] net: introduce helper sendpages_ok()
2024-06-03 12:35 ` Ofir Gal
@ 2024-06-03 21:27 ` Sagi Grimberg
2024-06-04 4:27 ` Christoph Hellwig
0 siblings, 1 reply; 30+ messages in thread
From: Sagi Grimberg @ 2024-06-03 21:27 UTC (permalink / raw)
To: Ofir Gal, davem, linux-block, linux-nvme, netdev, ceph-devel
Cc: dhowells, edumazet, pabeni, kbusch, axboe, hch, philipp.reisner,
lars.ellenberg, christoph.boehmwalder, idryomov, xiubli
>> I still don't understand how a page in the middle of a contiguous range ends
>> up coming from the slab while others don't.
> I haven't investigate the origin of the IO
> yet. I suspect the first 2 pages are the superblocks of the raid
> (mdp_superblock_1 and bitmap_super_s) and the rest of the IO is the bitmap.
Well, if these indeed are different origins and just *happen* to be a
mixture
of slab originated pages and non-slab pages combined together in a
single bio of a bvec entry,
I'd suspect that it would be more beneficial to split the bvec
(essentially not allow bio_add_page
to append the page to tail bvec depending on a queue limit (similar to
how we handle sg gaps).
>
>> Ofir, can you please check which condition in sendpage_ok actually fails?
> It failed because the page has slab, page count is 1. Sorry for not
> clarifying this.
>
> "skbuff: !sendpage_ok - page: 0x54f9f140 (pfn: 120757). is_slab: 1, page_count: 1"
> ^
> The print I used:
> pr_info(
> "!sendpage_ok - page: 0x%p (pfn: %lx). is_slab: %u, page_count: %u\n",
> (void *)page,
> page_to_pfn(page),
> page_address(page),
> !!PageSlab(page),
> page_count(page)
> );
>
>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 1/4] net: introduce helper sendpages_ok()
2024-06-03 21:27 ` Sagi Grimberg
@ 2024-06-04 4:27 ` Christoph Hellwig
2024-06-04 8:24 ` Sagi Grimberg
0 siblings, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2024-06-04 4:27 UTC (permalink / raw)
To: Sagi Grimberg
Cc: Ofir Gal, davem, linux-block, linux-nvme, netdev, ceph-devel,
dhowells, edumazet, pabeni, kbusch, axboe, hch, philipp.reisner,
lars.ellenberg, christoph.boehmwalder, idryomov, xiubli
On Tue, Jun 04, 2024 at 12:27:06AM +0300, Sagi Grimberg wrote:
>>> I still don't understand how a page in the middle of a contiguous range ends
>>> up coming from the slab while others don't.
>> I haven't investigate the origin of the IO
>> yet. I suspect the first 2 pages are the superblocks of the raid
>> (mdp_superblock_1 and bitmap_super_s) and the rest of the IO is the bitmap.
>
> Well, if these indeed are different origins and just *happen* to be a
> mixture
> of slab originated pages and non-slab pages combined together in a single
> bio of a bvec entry,
> I'd suspect that it would be more beneficial to split the bvec (essentially
> not allow bio_add_page
> to append the page to tail bvec depending on a queue limit (similar to how
> we handle sg gaps).
So you want to add a PageSlab check to bvec_try_merge_page? That sounds
fairly expensive..
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 0/4] bugfix: Introduce sendpages_ok() to check sendpage_ok() on contiguous pages
2024-06-01 22:36 ` Jakub Kicinski
@ 2024-06-04 4:30 ` Christoph Hellwig
2024-06-04 14:42 ` Jakub Kicinski
0 siblings, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2024-06-04 4:30 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Christoph Hellwig, Ofir Gal, davem, linux-block, linux-nvme,
netdev, ceph-devel, dhowells, edumazet, pabeni, kbusch, axboe,
sagi, philipp.reisner, lars.ellenberg, christoph.boehmwalder,
idryomov, xiubli
On Sat, Jun 01, 2024 at 03:36:10PM -0700, Jakub Kicinski wrote:
> On Fri, 31 May 2024 09:32:14 +0200 Christoph Hellwig wrote:
> > I still find it hightly annoying that we can't have a helper that
> > simply does the right thing for callers, but I guess this is the
> > best thing we can get without a change of mind from the networking
> > maintainers..
>
> Change mind about what? Did I miss a discussion?
Having a net helper to just send some memory in the most efficient
way and leave it up to the networking code to decide if it wants
to use sendpage or sendmsg internally as needed instead of burdening
the caller.
This happened before sendpage_ok was added, so probably in 2020
and most involved Dave and not you.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 1/4] net: introduce helper sendpages_ok()
2024-06-04 4:27 ` Christoph Hellwig
@ 2024-06-04 8:24 ` Sagi Grimberg
2024-06-04 13:01 ` Sagi Grimberg
0 siblings, 1 reply; 30+ messages in thread
From: Sagi Grimberg @ 2024-06-04 8:24 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Ofir Gal, davem, linux-block, linux-nvme, netdev, ceph-devel,
dhowells, edumazet, pabeni, kbusch, axboe, philipp.reisner,
lars.ellenberg, christoph.boehmwalder, idryomov, xiubli
On 04/06/2024 7:27, Christoph Hellwig wrote:
> On Tue, Jun 04, 2024 at 12:27:06AM +0300, Sagi Grimberg wrote:
>>>> I still don't understand how a page in the middle of a contiguous range ends
>>>> up coming from the slab while others don't.
>>> I haven't investigate the origin of the IO
>>> yet. I suspect the first 2 pages are the superblocks of the raid
>>> (mdp_superblock_1 and bitmap_super_s) and the rest of the IO is the bitmap.
>> Well, if these indeed are different origins and just *happen* to be a
>> mixture
>> of slab originated pages and non-slab pages combined together in a single
>> bio of a bvec entry,
>> I'd suspect that it would be more beneficial to split the bvec (essentially
>> not allow bio_add_page
>> to append the page to tail bvec depending on a queue limit (similar to how
>> we handle sg gaps).
> So you want to add a PageSlab check to bvec_try_merge_page? That sounds
> fairly expensive..
>
The check needs to happen somewhere apparently, and given that it will
be gated by a queue flag
only request queues that actually needed will suffer, but they will
suffer anyways...
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 1/4] net: introduce helper sendpages_ok()
2024-06-04 8:24 ` Sagi Grimberg
@ 2024-06-04 13:01 ` Sagi Grimberg
2024-06-06 12:57 ` Ofir Gal
0 siblings, 1 reply; 30+ messages in thread
From: Sagi Grimberg @ 2024-06-04 13:01 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Ofir Gal, davem, linux-block, linux-nvme, netdev, ceph-devel,
dhowells, edumazet, pabeni, kbusch, axboe, philipp.reisner,
lars.ellenberg, christoph.boehmwalder, idryomov, xiubli
On 04/06/2024 11:24, Sagi Grimberg wrote:
>
>
> On 04/06/2024 7:27, Christoph Hellwig wrote:
>> On Tue, Jun 04, 2024 at 12:27:06AM +0300, Sagi Grimberg wrote:
>>>>> I still don't understand how a page in the middle of a contiguous
>>>>> range ends
>>>>> up coming from the slab while others don't.
>>>> I haven't investigate the origin of the IO
>>>> yet. I suspect the first 2 pages are the superblocks of the raid
>>>> (mdp_superblock_1 and bitmap_super_s) and the rest of the IO is the
>>>> bitmap.
>>> Well, if these indeed are different origins and just *happen* to be a
>>> mixture
>>> of slab originated pages and non-slab pages combined together in a
>>> single
>>> bio of a bvec entry,
>>> I'd suspect that it would be more beneficial to split the bvec
>>> (essentially
>>> not allow bio_add_page
>>> to append the page to tail bvec depending on a queue limit (similar
>>> to how
>>> we handle sg gaps).
>> So you want to add a PageSlab check to bvec_try_merge_page? That sounds
>> fairly expensive..
>>
>
> The check needs to happen somewhere apparently, and given that it will
> be gated by a queue flag
> only request queues that actually needed will suffer, but they will
> suffer anyways...
Something like the untested patch below:
--
diff --git a/block/bio.c b/block/bio.c
index 53f608028c78..e55a4184c0e6 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -18,6 +18,7 @@
#include <linux/highmem.h>
#include <linux/blk-crypto.h>
#include <linux/xarray.h>
+#include <linux/net.h>
#include <trace/events/block.h>
#include "blk.h"
@@ -960,6 +961,9 @@ bool bvec_try_merge_hw_page(struct request_queue *q,
struct bio_vec *bv,
return false;
if (len > queue_max_segment_size(q) - bv->bv_len)
return false;
+ if (q->limits.splice_pages &&
+ sendpage_ok(bv->bv_page) ^ sendpage_ok(page))
+ return false;
return bvec_try_merge_page(bv, page, len, offset, same_page);
}
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index a7e820840cf7..82e2719acb9c 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1937,6 +1937,7 @@ static void nvme_set_ctrl_limits(struct nvme_ctrl
*ctrl,
lim->virt_boundary_mask = NVME_CTRL_PAGE_SIZE - 1;
lim->max_segment_size = UINT_MAX;
lim->dma_alignment = 3;
+ lim->splice_pages = ctrl->splice_pages;
}
static bool nvme_update_disk_info(struct nvme_ns *ns, struct
nvme_id_ns *id,
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 3f3e26849b61..d9818330e236 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -398,6 +398,7 @@ struct nvme_ctrl {
enum nvme_ctrl_type cntrltype;
enum nvme_dctype dctype;
+ bool splice_pages
};
static inline enum nvme_ctrl_state nvme_ctrl_state(struct nvme_ctrl *ctrl)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 02076b8cb4d8..618b8f20206a 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2146,6 +2146,12 @@ static int nvme_tcp_configure_admin_queue(struct
nvme_ctrl *ctrl, bool new)
if (error)
goto out_stop_queue;
+ /*
+ * we want to prevent contig pages with conflicting
splice-ability with
+ * respect to the network transmission
+ */
+ ctrl->splice_pages = true;
+
nvme_unquiesce_admin_queue(ctrl);
error = nvme_init_ctrl_finish(ctrl, false);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 69c4f113db42..ec657ddad2a4 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -331,6 +331,12 @@ struct queue_limits {
* due to possible offsets.
*/
unsigned int dma_alignment;
+
+ /*
+ * Drivers that use MSG_SPLICE_PAGES to send the bvec over the
network,
+ * will need either bvec entry contig pages spliceable or not.
+ */
+ bool splice_pages;
};
typedef int (*report_zones_cb)(struct blk_zone *zone, unsigned int idx,
--
What I now see is that we will check PageSlab twice (bvec last index and
append page)
and skb_splice_from_iter checks it again... How many times check we
check this :)
Would be great if the network stack can just check it once and fallback
to page copy...
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH v2 0/4] bugfix: Introduce sendpages_ok() to check sendpage_ok() on contiguous pages
2024-06-04 4:30 ` Christoph Hellwig
@ 2024-06-04 14:42 ` Jakub Kicinski
2024-06-05 7:27 ` Christoph Hellwig
0 siblings, 1 reply; 30+ messages in thread
From: Jakub Kicinski @ 2024-06-04 14:42 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Ofir Gal, davem, linux-block, linux-nvme, netdev, ceph-devel,
dhowells, edumazet, pabeni, kbusch, axboe, sagi, philipp.reisner,
lars.ellenberg, christoph.boehmwalder, idryomov, xiubli
On Tue, 4 Jun 2024 06:30:41 +0200 Christoph Hellwig wrote:
> On Sat, Jun 01, 2024 at 03:36:10PM -0700, Jakub Kicinski wrote:
> > On Fri, 31 May 2024 09:32:14 +0200 Christoph Hellwig wrote:
> > > I still find it hightly annoying that we can't have a helper that
> > > simply does the right thing for callers, but I guess this is the
> > > best thing we can get without a change of mind from the networking
> > > maintainers..
> >
> > Change mind about what? Did I miss a discussion?
>
> Having a net helper to just send some memory in the most efficient
> way and leave it up to the networking code to decide if it wants
> to use sendpage or sendmsg internally as needed instead of burdening
> the caller.
I'd guess the thinking was that if we push back the callers would
switch the relevant allocations to be page-backed. But we can add
a comment above the helper that says "you'd be better off using
page frags and calling sendmsg(MSG_SPLICE_PAGES) directly".
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 3/4] drbd: use sendpages_ok() instead of sendpage_ok()
2024-05-30 14:24 ` [PATCH v2 3/4] drbd: " Ofir Gal
@ 2024-06-04 14:43 ` Christoph Böhmwalder
0 siblings, 0 replies; 30+ messages in thread
From: Christoph Böhmwalder @ 2024-06-04 14:43 UTC (permalink / raw)
To: Ofir Gal, davem, linux-block, linux-nvme, netdev, ceph-devel
Cc: dhowells, edumazet, pabeni, philipp.reisner, lars.ellenberg
Am 30.05.24 um 16:24 schrieb Ofir Gal:
> Currently _drbd_send_page() use sendpage_ok() in order to enable
> MSG_SPLICE_PAGES, it check the first page of the iterator, the iterator
> may represent contiguous pages.
>
> MSG_SPLICE_PAGES enables skb_splice_from_iter() which checks all the
> pages it sends with sendpage_ok().
>
> When _drbd_send_page() sends an iterator that the first page is
> sendable, but one of the other pages isn't skb_splice_from_iter() warns
> and aborts the data transfer.
>
> Using the new helper sendpages_ok() in order to enable MSG_SPLICE_PAGES
> solves the issue.
>
> Signed-off-by: Ofir Gal <ofir.gal@volumez.com>
> ---
> drivers/block/drbd/drbd_main.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
> index 113b441d4d36..a5dbbf6cce23 100644
> --- a/drivers/block/drbd/drbd_main.c
> +++ b/drivers/block/drbd/drbd_main.c
> @@ -1550,7 +1550,7 @@ static int _drbd_send_page(struct drbd_peer_device *peer_device, struct page *pa
> * put_page(); and would cause either a VM_BUG directly, or
> * __page_cache_release a page that would actually still be referenced
> * by someone, leading to some obscure delayed Oops somewhere else. */
> - if (!drbd_disable_sendpage && sendpage_ok(page))
> + if (!drbd_disable_sendpage && sendpages_ok(page, len, offset))
> msg.msg_flags |= MSG_NOSIGNAL | MSG_SPLICE_PAGES;
>
> drbd_update_congested(peer_device->connection);
Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
--
Christoph Böhmwalder
LINBIT | Keeping the Digital World Running
DRBD HA — Disaster Recovery — Software defined Storage
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 0/4] bugfix: Introduce sendpages_ok() to check sendpage_ok() on contiguous pages
2024-06-04 14:42 ` Jakub Kicinski
@ 2024-06-05 7:27 ` Christoph Hellwig
0 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2024-06-05 7:27 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Christoph Hellwig, Ofir Gal, davem, linux-block, linux-nvme,
netdev, ceph-devel, dhowells, edumazet, pabeni, kbusch, axboe,
sagi, philipp.reisner, lars.ellenberg, christoph.boehmwalder,
idryomov, xiubli
On Tue, Jun 04, 2024 at 07:42:24AM -0700, Jakub Kicinski wrote:
> I'd guess the thinking was that if we push back the callers would
> switch the relevant allocations to be page-backed. But we can add
> a comment above the helper that says "you'd be better off using
> page frags and calling sendmsg(MSG_SPLICE_PAGES) directly".
That's just not how network block devices or file systems work, they
don't control the allocations and need to take what gets fed to them.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 1/4] net: introduce helper sendpages_ok()
2024-06-04 13:01 ` Sagi Grimberg
@ 2024-06-06 12:57 ` Ofir Gal
2024-06-06 13:08 ` Christoph Hellwig
0 siblings, 1 reply; 30+ messages in thread
From: Ofir Gal @ 2024-06-06 12:57 UTC (permalink / raw)
To: Sagi Grimberg, Christoph Hellwig
Cc: davem, linux-block, linux-nvme, netdev, ceph-devel, dhowells,
edumazet, pabeni, kbusch, axboe, philipp.reisner, lars.ellenberg,
christoph.boehmwalder, idryomov, xiubli
On 04/06/2024 16:01, Sagi Grimberg wrote:
>
>
> On 04/06/2024 11:24, Sagi Grimberg wrote:
>>
>>
>> On 04/06/2024 7:27, Christoph Hellwig wrote:
>>> On Tue, Jun 04, 2024 at 12:27:06AM +0300, Sagi Grimberg wrote:
>>>>>> I still don't understand how a page in the middle of a contiguous range ends
>>>>>> up coming from the slab while others don't.
>>>>> I haven't investigate the origin of the IO
>>>>> yet. I suspect the first 2 pages are the superblocks of the raid
>>>>> (mdp_superblock_1 and bitmap_super_s) and the rest of the IO is the bitmap.
>>>> Well, if these indeed are different origins and just *happen* to be a
>>>> mixture
>>>> of slab originated pages and non-slab pages combined together in a single
>>>> bio of a bvec entry,
>>>> I'd suspect that it would be more beneficial to split the bvec (essentially
>>>> not allow bio_add_page
>>>> to append the page to tail bvec depending on a queue limit (similar to how
>>>> we handle sg gaps).
I have investigated the origin of the IO. It's a bug in the md-bitmap
code. It's a single IO that __write_sb_page() sends, it rounds up the io
size to the optimal io size but doesn't check that the final size exceeds
the amount of pages it allocated.
The slab pages aren't allocated by the md-bitmap, they are pages that
happens to be after the allocated pages. I'm applying a patch to the md
subsystem asap.
I have added some logs to test the theory:
...
md: created bitmap (1 pages) for device md127
__write_sb_page before md_super_write. offset: 16, size: 262144. pfn: 0x53ee
### __write_sb_page before md_super_write. logging pages ###
pfn: 0x53ee, slab: 0
pfn: 0x53ef, slab: 1
pfn: 0x53f0, slab: 0
pfn: 0x53f1, slab: 0
pfn: 0x53f2, slab: 0
pfn: 0x53f3, slab: 1
...
nvme_tcp: sendpage_ok - pfn: 0x53ee, len: 262144, offset: 0
skbuff: before sendpage_ok() - pfn: 0x53ee
skbuff: before sendpage_ok() - pfn: 0x53ef
WARNING at net/core/skbuff.c:6848 skb_splice_from_iter+0x142/0x450
skbuff: !sendpage_ok - pfn: 0x53ef. is_slab: 1, page_count: 1
...
There is only 1 page allocated for bitmap but __write_sb_page() tries to
write 64 pages.
>>> So you want to add a PageSlab check to bvec_try_merge_page? That sounds
>>> fairly expensive..
>>>
>>
>> The check needs to happen somewhere apparently, and given that it will be gated by a queue flag
>> only request queues that actually needed will suffer, but they will suffer anyways...
> What I now see is that we will check PageSlab twice (bvec last index and append page)
> and skb_splice_from_iter checks it again... How many times check we check this :)
>
> Would be great if the network stack can just check it once and fallback to page copy...
Nevertheless I think a check in the network stack or when merging bio's
is still necessary.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 1/4] net: introduce helper sendpages_ok()
2024-06-06 12:57 ` Ofir Gal
@ 2024-06-06 13:08 ` Christoph Hellwig
2024-06-06 13:18 ` Ofir Gal
0 siblings, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2024-06-06 13:08 UTC (permalink / raw)
To: Ofir Gal
Cc: Sagi Grimberg, Christoph Hellwig, davem, linux-block, linux-nvme,
netdev, ceph-devel, dhowells, edumazet, pabeni, kbusch, axboe,
philipp.reisner, lars.ellenberg, christoph.boehmwalder, idryomov,
xiubli
On Thu, Jun 06, 2024 at 03:57:25PM +0300, Ofir Gal wrote:
> The slab pages aren't allocated by the md-bitmap, they are pages that
> happens to be after the allocated pages. I'm applying a patch to the md
> subsystem asap.
Similar cases could happen by other means as well. E.g. if you write
to the last blocks in an XFS allocation group while also writing something
to superblock copy at the beginnig of the next one and the two writes get
merged. Probably not easy to reproduce but entirely possible. Just as
as lot of other scenarious could happen due to merges.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 1/4] net: introduce helper sendpages_ok()
2024-06-06 13:08 ` Christoph Hellwig
@ 2024-06-06 13:18 ` Ofir Gal
2024-06-06 13:52 ` Christoph Hellwig
0 siblings, 1 reply; 30+ messages in thread
From: Ofir Gal @ 2024-06-06 13:18 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Sagi Grimberg, davem, linux-block, linux-nvme, netdev, ceph-devel,
dhowells, edumazet, pabeni, kbusch, axboe, philipp.reisner,
lars.ellenberg, christoph.boehmwalder, idryomov, xiubli
On 06/06/2024 16:08, Christoph Hellwig wrote:
> On Thu, Jun 06, 2024 at 03:57:25PM +0300, Ofir Gal wrote:
>> The slab pages aren't allocated by the md-bitmap, they are pages that
>> happens to be after the allocated pages. I'm applying a patch to the md
>> subsystem asap.
> Similar cases could happen by other means as well. E.g. if you write
> to the last blocks in an XFS allocation group while also writing something
> to superblock copy at the beginnig of the next one and the two writes get
> merged. Probably not easy to reproduce but entirely possible. Just as
> as lot of other scenarious could happen due to merges.
I agree. Do we want to proceed with my patch or would we prefer a more
efficient solution?
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 1/4] net: introduce helper sendpages_ok()
2024-06-06 13:18 ` Ofir Gal
@ 2024-06-06 13:52 ` Christoph Hellwig
2024-06-06 15:42 ` Ofir Gal
0 siblings, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2024-06-06 13:52 UTC (permalink / raw)
To: Ofir Gal
Cc: Christoph Hellwig, Sagi Grimberg, davem, linux-block, linux-nvme,
netdev, ceph-devel, dhowells, edumazet, pabeni, kbusch, axboe,
philipp.reisner, lars.ellenberg, christoph.boehmwalder, idryomov,
xiubli
On Thu, Jun 06, 2024 at 04:18:29PM +0300, Ofir Gal wrote:
> I agree. Do we want to proceed with my patch or would we prefer a more
> efficient solution?
We'll need something for sure. I don't like the idea of hacking
special cases that call into the networking code into the core block
layer, so I think it'll have to be something more like your original
patch.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 1/4] net: introduce helper sendpages_ok()
2024-06-06 13:52 ` Christoph Hellwig
@ 2024-06-06 15:42 ` Ofir Gal
0 siblings, 0 replies; 30+ messages in thread
From: Ofir Gal @ 2024-06-06 15:42 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Sagi Grimberg, davem, linux-block, linux-nvme, netdev, ceph-devel,
dhowells, edumazet, pabeni, kbusch, axboe, philipp.reisner,
lars.ellenberg, christoph.boehmwalder, idryomov, xiubli
On 06/06/2024 16:52, Christoph Hellwig wrote:
> On Thu, Jun 06, 2024 at 04:18:29PM +0300, Ofir Gal wrote:
>> I agree. Do we want to proceed with my patch or would we prefer a more
>> efficient solution?
> We'll need something for sure. I don't like the idea of hacking
> special cases that call into the networking code into the core block
> layer, so I think it'll have to be something more like your original
> patch.
Ok, I'm sending v3 in case we need it.
^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2024-06-06 15:42 UTC | newest]
Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-30 14:24 [PATCH v2 0/4] bugfix: Introduce sendpages_ok() to check sendpage_ok() on contiguous pages Ofir Gal
2024-05-30 14:24 ` [PATCH v2 1/4] net: introduce helper sendpages_ok() Ofir Gal
2024-05-31 7:32 ` Christoph Hellwig
2024-05-31 8:51 ` Sagi Grimberg
2024-06-03 12:35 ` Ofir Gal
2024-06-03 21:27 ` Sagi Grimberg
2024-06-04 4:27 ` Christoph Hellwig
2024-06-04 8:24 ` Sagi Grimberg
2024-06-04 13:01 ` Sagi Grimberg
2024-06-06 12:57 ` Ofir Gal
2024-06-06 13:08 ` Christoph Hellwig
2024-06-06 13:18 ` Ofir Gal
2024-06-06 13:52 ` Christoph Hellwig
2024-06-06 15:42 ` Ofir Gal
2024-05-30 14:24 ` [PATCH v2 2/4] nvme-tcp: use sendpages_ok() instead of sendpage_ok() Ofir Gal
2024-05-31 7:32 ` Christoph Hellwig
2024-05-30 14:24 ` [PATCH v2 3/4] drbd: " Ofir Gal
2024-06-04 14:43 ` Christoph Böhmwalder
2024-05-30 14:24 ` [PATCH v2 4/4] libceph: " Ofir Gal
2024-05-31 7:32 ` [PATCH v2 0/4] bugfix: Introduce sendpages_ok() to check sendpage_ok() on contiguous pages Christoph Hellwig
2024-06-01 22:36 ` Jakub Kicinski
2024-06-04 4:30 ` Christoph Hellwig
2024-06-04 14:42 ` Jakub Kicinski
2024-06-05 7:27 ` Christoph Hellwig
2024-06-01 22:34 ` Jakub Kicinski
2024-06-02 7:48 ` Sagi Grimberg
2024-06-03 9:07 ` Hannes Reinecke
2024-06-03 12:46 ` Ofir Gal
2024-06-03 7:24 ` Hannes Reinecke
2024-06-03 12:49 ` Ofir Gal
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).