Read operation gets EOF return when there is multi-client read/write after linux 5.16-rc1

ceph-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Read operation gets EOF return when there is multi-client read/write after linux 5.16-rc1
@ 2024-01-24  3:25 Frank Hsiao 蕭法宣
  2024-02-16  4:24 ` 回覆: " Frank Hsiao 蕭法宣
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Frank Hsiao 蕭法宣 @ 2024-01-24  3:25 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org

When multiple ceph kernel clients perform read/write on the same file, the read
operation(ceph_sync_read) returns EOF(ret = 0) even though the file has been
written by another client.

My envs use Ceph quincy(v17.2.6) and mount cephfs by ceph kernel client. For the
client side, I use Samba(v4.18.8) to export the folder as smb share and test it
with smbtorture. The test case is smb2.rw.rw1 with the following failure
message:

test: samba4.smb2.rw.rw1 
Checking data integrity over 10 ops 
read failed(NT_STATUS_END_OF_FILE) 
failure: samba4.smb2.rw.rw1 [ 
Exception: read 0, expected 440 
]

After some testing, I figured out that the failure only happens when I have
linux kernel version>=5.16-rc1, specifically after commit
c3d8e0b5de487a7c462781745bc17694a4266696. Kernel logs as below(on 5.16-rc1):

[Wed Jan 10 09:44:56 2024] [153221] ceph_read_iter:1559: ceph:  aio_sync_read
00000000789dccee 100000010ef.fffffffffffffffe 0~440 got cap refs on Fr 
[Wed Jan 10 09:44:56 2024] [153221] ceph_sync_read:852: ceph:  sync_read on file 
00000000d9e861fb 0~440 
[Wed Jan 10 09:44:56 2024] [153221] ceph_sync_read:913: ceph:  sync_read 0~440 got 440 i_size 0 
[Wed Jan 10 09:44:56 2024] [153221] ceph_sync_read:966: ceph:  sync_read result 0 retry_op 2 

...  

[Wed Jan 10 09:44:57 2024] [153221] ceph_read_iter:1559: ceph:  aio_sync_read
00000000789dccee 100000010ef.fffffffffffffffe 0~440 got cap refs on Fr 
[Wed Jan 10 09:44:57 2024] [153221] ceph_sync_read:852: ceph:  sync_read on file
00000000d9e861fb 0~0

The logs indicate that: 
1. ceph_sync_read may read data but i_size is obsolete in simultaneous rw situation 
2. The commit in 5.16-rc1 cap ret to i_size and set retry_op = CHECK_EOF 
3. When retrying, ceph_sync_read gets len=0 since iov count has modified in 
copy_page_to_iter 
4. ceph_read_iter return 0

I'm not sure if my understanding is correct. As a reference, here is my simple
patch and I need more comments. The purpose of the patch is to prevent
sync read handler from doing copy page when ret > i_size.

Thanks.

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 220a41831b46..5897f52ee998 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -926,6 +926,9 @@ static ssize_t ceph_sync_read(struct kiocb *iocb, struct iov_iter *to,

                idx = 0;
                left = ret > 0 ? ret : 0;
+               if (left > i_size) {
+                       left = i_size;
+               }
                while (left > 0) {
                        size_t len, copied;
                        page_off = off & ~PAGE_MASK;
@@ -952,7 +955,7 @@ static ssize_t ceph_sync_read(struct kiocb *iocb, struct iov_iter *to,
                        break;
        }

-       if (off > iocb->ki_pos) {
+       if (off > iocb->ki_pos || i_size == 0) {
                if (off >= i_size) {
                        *retry_op = CHECK_EOF;
                        ret = i_size - iocb->ki_pos;

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* 回覆: Read operation gets EOF return when there is multi-client read/write after linux 5.16-rc1
  2024-01-24  3:25 Read operation gets EOF return when there is multi-client read/write after linux 5.16-rc1 Frank Hsiao 蕭法宣
@ 2024-02-16  4:24 ` Frank Hsiao 蕭法宣
  2024-02-19  2:08   ` Xiubo Li
  2024-02-20  8:31 ` Xiubo Li
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 6+ messages in thread
From: Frank Hsiao 蕭法宣 @ 2024-02-16  4:24 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org
  Cc: jlayton@kernel.org, idryomov@gmail.com, xiubli@redhat.com

Hi, it is a friendly ping, thanks.

________________________________________
寄件者: Frank Hsiao 蕭法宣 <frankhsiao@qnap.com>
寄件日期: 2024年1月24日 上午 11:25
收件者: ceph-devel@vger.kernel.org
主旨: Read operation gets EOF return when there is multi-client read/write after linux 5.16-rc1

When multiple ceph kernel clients perform read/write on the same file, the read
operation(ceph_sync_read) returns EOF(ret = 0) even though the file has been
written by another client.

My envs use Ceph quincy(v17.2.6) and mount cephfs by ceph kernel client. For the
client side, I use Samba(v4.18.8) to export the folder as smb share and test it
with smbtorture. The test case is smb2.rw.rw1 with the following failure
message:

test: samba4.smb2.rw.rw1
Checking data integrity over 10 ops
read failed(NT_STATUS_END_OF_FILE)
failure: samba4.smb2.rw.rw1 [
Exception: read 0, expected 440
]

After some testing, I figured out that the failure only happens when I have
linux kernel version>=5.16-rc1, specifically after commit
c3d8e0b5de487a7c462781745bc17694a4266696. Kernel logs as below(on 5.16-rc1):

[Wed Jan 10 09:44:56 2024] [153221] ceph_read_iter:1559: ceph:  aio_sync_read
00000000789dccee 100000010ef.fffffffffffffffe 0~440 got cap refs on Fr
[Wed Jan 10 09:44:56 2024] [153221] ceph_sync_read:852: ceph:  sync_read on file
00000000d9e861fb 0~440
[Wed Jan 10 09:44:56 2024] [153221] ceph_sync_read:913: ceph:  sync_read 0~440 got 440 i_size 0
[Wed Jan 10 09:44:56 2024] [153221] ceph_sync_read:966: ceph:  sync_read result 0 retry_op 2

...

[Wed Jan 10 09:44:57 2024] [153221] ceph_read_iter:1559: ceph:  aio_sync_read
00000000789dccee 100000010ef.fffffffffffffffe 0~440 got cap refs on Fr
[Wed Jan 10 09:44:57 2024] [153221] ceph_sync_read:852: ceph:  sync_read on file
00000000d9e861fb 0~0

The logs indicate that:
1. ceph_sync_read may read data but i_size is obsolete in simultaneous rw situation
2. The commit in 5.16-rc1 cap ret to i_size and set retry_op = CHECK_EOF
3. When retrying, ceph_sync_read gets len=0 since iov count has modified in
copy_page_to_iter
4. ceph_read_iter return 0

I'm not sure if my understanding is correct. As a reference, here is my simple
patch and I need more comments. The purpose of the patch is to prevent
sync read handler from doing copy page when ret > i_size.

Thanks.

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 220a41831b46..5897f52ee998 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -926,6 +926,9 @@ static ssize_t ceph_sync_read(struct kiocb *iocb, struct iov_iter *to,

                idx = 0;
                left = ret > 0 ? ret : 0;
+               if (left > i_size) {
+                       left = i_size;
+               }
                while (left > 0) {
                        size_t len, copied;
                        page_off = off & ~PAGE_MASK;
@@ -952,7 +955,7 @@ static ssize_t ceph_sync_read(struct kiocb *iocb, struct iov_iter *to,
                        break;
        }

-       if (off > iocb->ki_pos) {
+       if (off > iocb->ki_pos || i_size == 0) {
                if (off >= i_size) {
                        *retry_op = CHECK_EOF;
                        ret = i_size - iocb->ki_pos;

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: 回覆: Read operation gets EOF return when there is multi-client read/write after linux 5.16-rc1
  2024-02-16  4:24 ` 回覆: " Frank Hsiao 蕭法宣
@ 2024-02-19  2:08   ` Xiubo Li
  0 siblings, 0 replies; 6+ messages in thread
From: Xiubo Li @ 2024-02-19  2:08 UTC (permalink / raw)
  To: Frank Hsiao 蕭法宣, ceph-devel@vger.kernel.org
  Cc: jlayton@kernel.org, idryomov@gmail.com

Hi

Just back from PTO and holiday.

I will check it today or tomorrow.

Thanks

- Xiubo

On 2/16/24 12:24, Frank Hsiao 蕭法宣 wrote:
> Hi, it is a friendly ping, thanks.
>
> ________________________________________
> 寄件者: Frank Hsiao 蕭法宣 <frankhsiao@qnap.com>
> 寄件日期: 2024年1月24日 上午 11:25
> 收件者: ceph-devel@vger.kernel.org
> 主旨: Read operation gets EOF return when there is multi-client read/write after linux 5.16-rc1
>
> When multiple ceph kernel clients perform read/write on the same file, the read
> operation(ceph_sync_read) returns EOF(ret = 0) even though the file has been
> written by another client.
>
> My envs use Ceph quincy(v17.2.6) and mount cephfs by ceph kernel client. For the
> client side, I use Samba(v4.18.8) to export the folder as smb share and test it
> with smbtorture. The test case is smb2.rw.rw1 with the following failure
> message:
>
> test: samba4.smb2.rw.rw1
> Checking data integrity over 10 ops
> read failed(NT_STATUS_END_OF_FILE)
> failure: samba4.smb2.rw.rw1 [
> Exception: read 0, expected 440
> ]
>
> After some testing, I figured out that the failure only happens when I have
> linux kernel version>=5.16-rc1, specifically after commit
> c3d8e0b5de487a7c462781745bc17694a4266696. Kernel logs as below(on 5.16-rc1):
>
>
> [Wed Jan 10 09:44:56 2024] [153221] ceph_read_iter:1559: ceph:  aio_sync_read
> 00000000789dccee 100000010ef.fffffffffffffffe 0~440 got cap refs on Fr
> [Wed Jan 10 09:44:56 2024] [153221] ceph_sync_read:852: ceph:  sync_read on file
> 00000000d9e861fb 0~440
> [Wed Jan 10 09:44:56 2024] [153221] ceph_sync_read:913: ceph:  sync_read 0~440 got 440 i_size 0
> [Wed Jan 10 09:44:56 2024] [153221] ceph_sync_read:966: ceph:  sync_read result 0 retry_op 2
>
> ...
>
> [Wed Jan 10 09:44:57 2024] [153221] ceph_read_iter:1559: ceph:  aio_sync_read
> 00000000789dccee 100000010ef.fffffffffffffffe 0~440 got cap refs on Fr
> [Wed Jan 10 09:44:57 2024] [153221] ceph_sync_read:852: ceph:  sync_read on file
> 00000000d9e861fb 0~0
>
>
> The logs indicate that:
> 1. ceph_sync_read may read data but i_size is obsolete in simultaneous rw situation
> 2. The commit in 5.16-rc1 cap ret to i_size and set retry_op = CHECK_EOF
> 3. When retrying, ceph_sync_read gets len=0 since iov count has modified in
> copy_page_to_iter
> 4. ceph_read_iter return 0
>
> I'm not sure if my understanding is correct. As a reference, here is my simple
> patch and I need more comments. The purpose of the patch is to prevent
> sync read handler from doing copy page when ret > i_size.
>
> Thanks.
>
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 220a41831b46..5897f52ee998 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -926,6 +926,9 @@ static ssize_t ceph_sync_read(struct kiocb *iocb, struct iov_iter *to,
>
>                  idx = 0;
>                  left = ret > 0 ? ret : 0;
> +               if (left > i_size) {
> +                       left = i_size;
> +               }
>                  while (left > 0) {
>                          size_t len, copied;
>                          page_off = off & ~PAGE_MASK;
> @@ -952,7 +955,7 @@ static ssize_t ceph_sync_read(struct kiocb *iocb, struct iov_iter *to,
>                          break;
>          }
>
> -       if (off > iocb->ki_pos) {
> +       if (off > iocb->ki_pos || i_size == 0) {
>                  if (off >= i_size) {
>                          *retry_op = CHECK_EOF;
>                          ret = i_size - iocb->ki_pos;


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Read operation gets EOF return when there is multi-client read/write after linux 5.16-rc1
  2024-01-24  3:25 Read operation gets EOF return when there is multi-client read/write after linux 5.16-rc1 Frank Hsiao 蕭法宣
  2024-02-16  4:24 ` 回覆: " Frank Hsiao 蕭法宣
@ 2024-02-20  8:31 ` Xiubo Li
  2024-02-21  1:41 ` Xiubo Li
       [not found] ` <6f953a75-23cc-4d41-bbc5-2ca0a839f6d3@redhat.com>
  3 siblings, 0 replies; 6+ messages in thread
From: Xiubo Li @ 2024-02-20  8:31 UTC (permalink / raw)
  To: Frank Hsiao 蕭法宣, ceph-devel@vger.kernel.org


On 1/24/24 11:25, Frank Hsiao 蕭法宣 wrote:
> When multiple ceph kernel clients perform read/write on the same file, the read
> operation(ceph_sync_read) returns EOF(ret = 0) even though the file has been
> written by another client.
>
> My envs use Ceph quincy(v17.2.6) and mount cephfs by ceph kernel client. For the
> client side, I use Samba(v4.18.8) to export the folder as smb share and test it
> with smbtorture. The test case is smb2.rw.rw1 with the following failure
> message:
>
> test: samba4.smb2.rw.rw1
> Checking data integrity over 10 ops
> read failed(NT_STATUS_END_OF_FILE)
> failure: samba4.smb2.rw.rw1 [
> Exception: read 0, expected 440
> ]
>
> After some testing, I figured out that the failure only happens when I have
> linux kernel version>=5.16-rc1, specifically after commit
> c3d8e0b5de487a7c462781745bc17694a4266696. Kernel logs as below(on 5.16-rc1):
>
>
> [Wed Jan 10 09:44:56 2024] [153221] ceph_read_iter:1559: ceph:  aio_sync_read
> 00000000789dccee 100000010ef.fffffffffffffffe 0~440 got cap refs on Fr
> [Wed Jan 10 09:44:56 2024] [153221] ceph_sync_read:852: ceph:  sync_read on file
> 00000000d9e861fb 0~440
> [Wed Jan 10 09:44:56 2024] [153221] ceph_sync_read:913: ceph:  sync_read 0~440 got 440 i_size 0
> [Wed Jan 10 09:44:56 2024] [153221] ceph_sync_read:966: ceph:  sync_read result 0 retry_op 2
>
> ...
>
> [Wed Jan 10 09:44:57 2024] [153221] ceph_read_iter:1559: ceph:  aio_sync_read
> 00000000789dccee 100000010ef.fffffffffffffffe 0~440 got cap refs on Fr
> [Wed Jan 10 09:44:57 2024] [153221] ceph_sync_read:852: ceph:  sync_read on file
> 00000000d9e861fb 0~0

Locally I tried to reproduce it but failed, the following is my logs:

<7>[ 3524.212187] ceph: [49172eb5-aa13-4408-956b-e10a6d02b0f6 4228] 
ceph_read_iter: sync 00000000b6ea7952 10000000000.fffffffffffffffe 
0~8192 got cap refs on Fr
<7>[ 3524.212194] ceph: [49172eb5-aa13-4408-956b-e10a6d02b0f6 4228] 
ceph_sync_read: on file 00000000a06cac30 0~2000
<7>[ 3524.212201] ceph: [49172eb5-aa13-4408-956b-e10a6d02b0f6 4228] 
__ceph_sync_read: on inode 00000000b6ea7952 10000000000.fffffffffffffffe 
0~2000
<7>[ 3524.212209] ceph: [49172eb5-aa13-4408-956b-e10a6d02b0f6 4228] 
__ceph_sync_read: orig 0~8192 reading 0~8192
<7>[ 3524.213008] ceph: [49172eb5-aa13-4408-956b-e10a6d02b0f6 4228] 
__ceph_sync_read: 0~8192 got 97 i_size 90
<7>[ 3524.213100] ceph: [49172eb5-aa13-4408-956b-e10a6d02b0f6 4228] 
__ceph_sync_read: result 90 retry_op 2
<7>[ 3524.213107] ceph: [49172eb5-aa13-4408-956b-e10a6d02b0f6 4228] 
ceph_read_iter: 00000000b6ea7952 10000000000.fffffffffffffffe dropping 
cap refs on Fr = 90
...

<7>[ 3524.213151] ceph: [49172eb5-aa13-4408-956b-e10a6d02b0f6 4228] 
__ceph_do_getattr: inode 00000000b6ea7952 10000000000.fffffffffffffffe 
mask Fsr mode 0100644
<7>[ 3524.213159] ceph: [49172eb5-aa13-4408-956b-e10a6d02b0f6 4228] 
__ceph_caps_issued: 00000000b6ea7952 10000000000.fffffffffffffffe cap 
0000000067201935 issued pAsLsXsFr
<7>[ 3524.213193] ceph: [49172eb5-aa13-4408-956b-e10a6d02b0f6 4228] 
ceph_mdsc_do_request: do_request on 0000000008917ba4
<7>[ 3524.213201] ceph: [49172eb5-aa13-4408-956b-e10a6d02b0f6 4228] 
ceph_mdsc_submit_request: submit_request on 0000000008917ba4 for inode 
0000000000000000
...

<7>[ 3524.345996] ceph: [49172eb5-aa13-4408-956b-e10a6d02b0f6 4228] 
ceph_fill_inode: 00000000b6ea7952 10000000000.fffffffffffffffe mode 
0100644 uid.gid 1000.1000
<7>[ 3524.346004] ceph: [49172eb5-aa13-4408-956b-e10a6d02b0f6 4228] 
ceph_fill_file_size: size 90 -> 97
...

<7>[ 3524.346235] ceph: [49172eb5-aa13-4408-956b-e10a6d02b0f6 4228] 
__ceph_do_getattr: result=0
<7>[ 3524.346236] ceph: [49172eb5-aa13-4408-956b-e10a6d02b0f6 4228] 
ceph_check_caps:  mds0 cap 0000000067201935 used - issued pAsLsXsFr 
implemented pAsLsXsFscr revoking Fsc
<7>[ 3524.346241] ceph: [49172eb5-aa13-4408-956b-e10a6d02b0f6 4228] 
ceph_read_iter: hit hole, ppos 90 < size 97, reading more
...
<7>[ 3524.346307] ceph: [49172eb5-aa13-4408-956b-e10a6d02b0f6 4228] 
__ceph_sync_read: on inode 00000000b6ea7952 10000000000.fffffffffffffffe 
5a~1f9f
<7>[ 3524.346313] ceph: [49172eb5-aa13-4408-956b-e10a6d02b0f6 4228] 
__ceph_sync_read: orig 90~8095 reading 90~8095
...

<7>[ 3524.370645] ceph: [49172eb5-aa13-4408-956b-e10a6d02b0f6 4228] 
__ceph_sync_read: 90~8095 got 7 i_size 97
<7>[ 3524.370776] ceph: [49172eb5-aa13-4408-956b-e10a6d02b0f6 4228] 
__ceph_sync_read: result 7 retry_op 2

As you can see that the first time it will read only 90 bytes, actually 
the file size is 97. Then the 'ceph_read_iter()' will call 
'__ceph_do_getattr()' to get the real i_size from MDS, and finally it 
will retry to read more and read the last 7 bytes.

Locally my test succeeded.

Could you upload more detail debug logs ? Have you seen the 
'ceph_read_iter()' was retried ?

Thanks

- Xiubo




>
> The logs indicate that:
> 1. ceph_sync_read may read data but i_size is obsolete in simultaneous rw situation
> 2. The commit in 5.16-rc1 cap ret to i_size and set retry_op = CHECK_EOF
> 3. When retrying, ceph_sync_read gets len=0 since iov count has modified in
> copy_page_to_iter
> 4. ceph_read_iter return 0
>
> I'm not sure if my understanding is correct. As a reference, here is my simple
> patch and I need more comments. The purpose of the patch is to prevent
> sync read handler from doing copy page when ret > i_size.
>
> Thanks.
>
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 220a41831b46..5897f52ee998 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -926,6 +926,9 @@ static ssize_t ceph_sync_read(struct kiocb *iocb, struct iov_iter *to,
>
>                  idx = 0;
>                  left = ret > 0 ? ret : 0;
> +               if (left > i_size) {
> +                       left = i_size;
> +               }
>                  while (left > 0) {
>                          size_t len, copied;
>                          page_off = off & ~PAGE_MASK;
> @@ -952,7 +955,7 @@ static ssize_t ceph_sync_read(struct kiocb *iocb, struct iov_iter *to,
>                          break;
>          }
>
> -       if (off > iocb->ki_pos) {
> +       if (off > iocb->ki_pos || i_size == 0) {
>                  if (off >= i_size) {
>                          *retry_op = CHECK_EOF;
>                          ret = i_size - iocb->ki_pos;


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Read operation gets EOF return when there is multi-client read/write after linux 5.16-rc1
  2024-01-24  3:25 Read operation gets EOF return when there is multi-client read/write after linux 5.16-rc1 Frank Hsiao 蕭法宣
  2024-02-16  4:24 ` 回覆: " Frank Hsiao 蕭法宣
  2024-02-20  8:31 ` Xiubo Li
@ 2024-02-21  1:41 ` Xiubo Li
       [not found] ` <6f953a75-23cc-4d41-bbc5-2ca0a839f6d3@redhat.com>
  3 siblings, 0 replies; 6+ messages in thread
From: Xiubo Li @ 2024-02-21  1:41 UTC (permalink / raw)
  To: Frank Hsiao 蕭法宣, ceph-devel@vger.kernel.org

Hi Rrank,

Thanks for your debug logs. I just improved your patch, could you have a 
try ?

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 06fdb3d05095..75bd46d0291b 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1194,7 +1194,7 @@ ssize_t __ceph_sync_read(struct inode *inode, 
loff_t *ki_pos,
                 }

                 idx = 0;
-               left = ret > 0 ? ret : 0;
+               left = ret > 0 ? umin(ret, i_size) : 0;
                 while (left > 0) {
                         size_t plen, copied;

@@ -1223,15 +1223,13 @@ ssize_t __ceph_sync_read(struct inode *inode, 
loff_t *ki_pos,
         }

         if (ret > 0) {
-               if (off > *ki_pos) {
-                       if (off >= i_size) {
-                               *retry_op = CHECK_EOF;
-                               ret = i_size - *ki_pos;
-                               *ki_pos = i_size;
-                       } else {
-                               ret = off - *ki_pos;
-                               *ki_pos = off;
-                       }
+               if (off >= i_size) {
+                       *retry_op = CHECK_EOF;
+                       ret = i_size - *ki_pos;
+                       *ki_pos = i_size;
+               } else {
+                       ret = off - *ki_pos;
+                       *ki_pos = off;
                 }

                 if (last_objver)


- Xiubo


On 1/24/24 11:25, Frank Hsiao 蕭法宣 wrote:
> When multiple ceph kernel clients perform read/write on the same file, the read
> operation(ceph_sync_read) returns EOF(ret = 0) even though the file has been
> written by another client.
>
> My envs use Ceph quincy(v17.2.6) and mount cephfs by ceph kernel client. For the
> client side, I use Samba(v4.18.8) to export the folder as smb share and test it
> with smbtorture. The test case is smb2.rw.rw1 with the following failure
> message:
>
> test: samba4.smb2.rw.rw1
> Checking data integrity over 10 ops
> read failed(NT_STATUS_END_OF_FILE)
> failure: samba4.smb2.rw.rw1 [
> Exception: read 0, expected 440
> ]
>
> After some testing, I figured out that the failure only happens when I have
> linux kernel version>=5.16-rc1, specifically after commit
> c3d8e0b5de487a7c462781745bc17694a4266696. Kernel logs as below(on 5.16-rc1):
>
>
> [Wed Jan 10 09:44:56 2024] [153221] ceph_read_iter:1559: ceph:  aio_sync_read
> 00000000789dccee 100000010ef.fffffffffffffffe 0~440 got cap refs on Fr
> [Wed Jan 10 09:44:56 2024] [153221] ceph_sync_read:852: ceph:  sync_read on file
> 00000000d9e861fb 0~440
> [Wed Jan 10 09:44:56 2024] [153221] ceph_sync_read:913: ceph:  sync_read 0~440 got 440 i_size 0
> [Wed Jan 10 09:44:56 2024] [153221] ceph_sync_read:966: ceph:  sync_read result 0 retry_op 2
>
> ...
>
> [Wed Jan 10 09:44:57 2024] [153221] ceph_read_iter:1559: ceph:  aio_sync_read
> 00000000789dccee 100000010ef.fffffffffffffffe 0~440 got cap refs on Fr
> [Wed Jan 10 09:44:57 2024] [153221] ceph_sync_read:852: ceph:  sync_read on file
> 00000000d9e861fb 0~0
>
>
> The logs indicate that:
> 1. ceph_sync_read may read data but i_size is obsolete in simultaneous rw situation
> 2. The commit in 5.16-rc1 cap ret to i_size and set retry_op = CHECK_EOF
> 3. When retrying, ceph_sync_read gets len=0 since iov count has modified in
> copy_page_to_iter
> 4. ceph_read_iter return 0
>
> I'm not sure if my understanding is correct. As a reference, here is my simple
> patch and I need more comments. The purpose of the patch is to prevent
> sync read handler from doing copy page when ret > i_size.
>
> Thanks.
>
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 220a41831b46..5897f52ee998 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -926,6 +926,9 @@ static ssize_t ceph_sync_read(struct kiocb *iocb, struct iov_iter *to,
>
>                  idx = 0;
>                  left = ret > 0 ? ret : 0;
> +               if (left > i_size) {
> +                       left = i_size;
> +               }
>                  while (left > 0) {
>                          size_t len, copied;
>                          page_off = off & ~PAGE_MASK;
> @@ -952,7 +955,7 @@ static ssize_t ceph_sync_read(struct kiocb *iocb, struct iov_iter *to,
>                          break;
>          }
>
> -       if (off > iocb->ki_pos) {
> +       if (off > iocb->ki_pos || i_size == 0) {
>                  if (off >= i_size) {
>                          *retry_op = CHECK_EOF;
>                          ret = i_size - iocb->ki_pos;


^ permalink raw reply related	[flat|nested] 6+ messages in thread

[parent not found: <6f953a75-23cc-4d41-bbc5-2ca0a839f6d3@redhat.com>]

* Re: Read operation gets EOF return when there is multi-client read/write after linux 5.16-rc1
       [not found] ` <6f953a75-23cc-4d41-bbc5-2ca0a839f6d3@redhat.com>
@ 2024-02-21 17:12   ` Gregory Farnum
  0 siblings, 0 replies; 6+ messages in thread
From: Gregory Farnum @ 2024-02-21 17:12 UTC (permalink / raw)
  To: Xiubo Li
  Cc: Frank Hsiao 蕭法宣, ceph-devel@vger.kernel.org,
	Jeff Layton, Venky Shankar, Patrick Donnelly

On Mon, Feb 19, 2024 at 9:09 PM Xiubo Li <xiubli@redhat.com> wrote:
>
>
> On 1/24/24 11:25, Frank Hsiao 蕭法宣 wrote:
>
> When multiple ceph kernel clients perform read/write on the same file, the read
> operation(ceph_sync_read) returns EOF(ret = 0) even though the file has been
> written by another client.
>
> My envs use Ceph quincy(v17.2.6) and mount cephfs by ceph kernel client. For the
> client side, I use Samba(v4.18.8) to export the folder as smb share and test it
> with smbtorture. The test case is smb2.rw.rw1 with the following failure
> message:
>
> test: samba4.smb2.rw.rw1
> Checking data integrity over 10 ops
> read failed(NT_STATUS_END_OF_FILE)
> failure: samba4.smb2.rw.rw1 [
> Exception: read 0, expected 440
> ]
>
> After some testing, I figured out that the failure only happens when I have
> linux kernel version>=5.16-rc1, specifically after commit
> c3d8e0b5de487a7c462781745bc17694a4266696. Kernel logs as below(on 5.16-rc1):
>
>
> [Wed Jan 10 09:44:56 2024] [153221] ceph_read_iter:1559: ceph:  aio_sync_read
> 00000000789dccee 100000010ef.fffffffffffffffe 0~440 got cap refs on Fr
> [Wed Jan 10 09:44:56 2024] [153221] ceph_sync_read:852: ceph:  sync_read on file
> 00000000d9e861fb 0~440
> [Wed Jan 10 09:44:56 2024] [153221] ceph_sync_read:913: ceph:  sync_read 0~440 got 440 i_size 0
> [Wed Jan 10 09:44:56 2024] [153221] ceph_sync_read:966: ceph:  sync_read result 0 retry_op 2
>
> ...
>
> [Wed Jan 10 09:44:57 2024] [153221] ceph_read_iter:1559: ceph:  aio_sync_read
> 00000000789dccee 100000010ef.fffffffffffffffe 0~440 got cap refs on Fr
> [Wed Jan 10 09:44:57 2024] [153221] ceph_sync_read:852: ceph:  sync_read on file
> 00000000d9e861fb 0~0
>
>
> The logs indicate that:
> 1. ceph_sync_read may read data but i_size is obsolete in simultaneous rw situation
> 2. The commit in 5.16-rc1 cap ret to i_size and set retry_op = CHECK_EOF
> 3. When retrying, ceph_sync_read gets len=0 since iov count has modified in
> copy_page_to_iter
> 4. ceph_read_iter return 0
>
> Yeah, in simultaneious read and write case, the CInode's filelock will be in MIX mode. When one client append new contents to the file it will increase the 'i_size' metadata to MDS, and then the MDS will broadcast the new metadatas to all the other clients. But in your case the reader client didn't get it yet before reading and was still using the old metadata.
>
> And the cephfs itself won't guarantee that the reader clients could get latest contents and metadata immediately. There will be a delay between clients. It should the apps themself to guarantee this.
>
> Before commit c3d8e0b5de487a7c462781745bc17694a4266696 it won't check the local 'i_size' and just read contents beyond of the file end boundary from Rados. Even we allow it like this for the contents, but the metadatas still have the same issue. So this sounds incorrect to me.
>
> Greg, Venky, Patrick, Jeff
>
> What do you think ?

I'm a little worried about this phrasing. In a MIX mode, clients can
only have Frw caps (and the lack of Fs means, don't cache file size
locally); the MDS may broadcast file size updates but I don't think
clients are allowed to rely on that. So from looking at the userspace
client, what should be happening in the described scenario is, the
client extends the file size with the MDS, then does its
file-extending write to the OSD, then the write is reported as
complete. And the second client gets the read request, at which point
it would have to check the file size from the MDS, then see the larger
size, then do the OSD read.
...and actually, when the MDS gets a setattr for the size, it grabs an
xlock locally, which may mean it recalls the Frw caps while it makes
that change? I haven't had to dig into the guts of this in a while, so
maybe the local file size is reliable. I guess Fs also covers mtime,
and that's why we don't issue it in MIX mode? Ugh.

Anyway, it is definitely not up to the applications to try and provide
file consistency as that is why we have filesystems at all. :) We
definitely do guarantee that readers get contents and metadata which
reflect any completed writes, as of the time the read was initiated —
this is a basic requirement of POSIX. (Of course, if the write isn't
completed when the write starts, we can do whatever we want.)

I'm not generally up on the kernel implementation, so any patches
which may have broken it, and the correct fix, are beyond me.

But having said all that, layering samba on top of the kernel client
is going to have wildly unpredictable behavior — the kernel client can
maintain all the consistency it wants, but with Samba sitting on top
caching things then the real file state is going to be completely
uncontrolled. I don't see how you're going to get predictable
outcomes, and am surprised any kernel client cephfs change could
reliably influence this, although the referenced kernel patch
certainly does change what we return to reads when there is
simultaneous IO. What's the change from i_size_read(inode) -> i_size
do — is that correctly covered by cap state? Or is the problem here
that smbtorture is running on nodes going to multiple sambas that are
speaking to the kernel cephfs client and getting fairly random results
back due to that? (I could see eg smb1 does a truncate+write which
gets flushed to OSD, then smb2 does a read before the cephfs write is
finished which happens to pick up the new data while the size updates
are happening.)
-Greg

> I'm not sure if my understanding is correct. As a reference, here is my simple
> patch and I need more comments. The purpose of the patch is to prevent
> sync read handler from doing copy page when ret > i_size.
>
> Thanks.
>
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 220a41831b46..5897f52ee998 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -926,6 +926,9 @@ static ssize_t ceph_sync_read(struct kiocb *iocb, struct iov_iter *to,
>
>                 idx = 0;
>                 left = ret > 0 ? ret : 0;
> +               if (left > i_size) {
> +                       left = i_size;
> +               }
>                 while (left > 0) {
>                         size_t len, copied;
>                         page_off = off & ~PAGE_MASK;
> @@ -952,7 +955,7 @@ static ssize_t ceph_sync_read(struct kiocb *iocb, struct iov_iter *to,
>                         break;
>         }
>
> -       if (off > iocb->ki_pos) {
> +       if (off > iocb->ki_pos || i_size == 0) {
>
> This seems incorrect to me.
>
> For example, what if the existing file size is 1024 bytes and then the writer client tries to append 1024 bytes to the file, then the file new size will be 2048. And then the reader clients' file i_size could still be 1024.
>
> Thanks
>
> - Xiubo
>
>                 if (off >= i_size) {
>                         *retry_op = CHECK_EOF;
>                         ret = i_size - iocb->ki_pos;


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-02-21 17:12 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-24  3:25 Read operation gets EOF return when there is multi-client read/write after linux 5.16-rc1 Frank Hsiao 蕭法宣
2024-02-16  4:24 ` 回覆: " Frank Hsiao 蕭法宣
2024-02-19  2:08   ` Xiubo Li
2024-02-20  8:31 ` Xiubo Li
2024-02-21  1:41 ` Xiubo Li
     [not found] ` <6f953a75-23cc-4d41-bbc5-2ca0a839f6d3@redhat.com>
2024-02-21 17:12   ` Gregory Farnum

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).