bad crc in data caused by a race condtion in write_partial_message

CEPH filesystem development
 help / color / mirror / Atom feed

* bad crc in data caused by a race condtion in write_partial_message_data
@ 2017-01-15  7:45 caifeng.zhu
  2017-01-15 16:26 ` Alex Elder
  2017-01-15 17:01 ` Ilya Dryomov
  0 siblings, 2 replies; 7+ messages in thread
From: caifeng.zhu @ 2017-01-15  7:45 UTC (permalink / raw)
  To: ceph-devel

Hi, all

Let's look at the problem first. We have a lot of 'bad crc in data' 
warnings at OSDs, like below:
    2017-01-14 23:25:54.671599 7f67201b3700  0 bad crc in data 1480547403 != exp 3751318843
    2017-01-14 23:25:54.681146 7f67201b3700  0 bad crc in data 3044715775 != exp 3018112170
    2017-01-14 23:25:54.681822 7f67201b3700  0 bad crc in data 2815383560 != exp 1455746011
    2017-01-14 23:25:54.686106 7f67205da700  0 bad crc in data 1781929234 != exp 498105391
    2017-01-14 23:25:54.688092 7f67205da700  0 bad crc in data 1845054835 != exp 3337474350
    2017-01-14 23:25:54.693225 7f67205da700  0 bad crc in data 1518733907 != exp 3781627678
    2017-01-14 23:25:54.755653 7f6724115700  0 bad crc in data 1173337243 != exp 3759627242
    ...
This problem occurs when we are testing(by fio) an NFS client, whose NFS server is
built on an XFS + RBD combination. The bad effect of the problem is that: OSD will close 
the connection of crc error and drop all reply messages sent through the connection. 
But the kernel rbd client will hold the requests and wait for the already dropped
replies which will never come. A deadlock occurs.

After some analysis, we suspect write_partial_message_data may have a race condtion. 
(Code below is got from gitbub.)
    1562		page = ceph_msg_data_next(cursor, &page_offset, &length,
    1563					  &last_piece);
    1564		ret = ceph_tcp_sendpage(con->sock, page, page_offset,
    1565					length, !last_piece);
    ...
    1572		if (do_datacrc && cursor->need_crc)
    1573			crc = ceph_crc32c_page(crc, page, page_offset, length);
At line 1564 ~ 1572, a worker thread of libceph workquue may send the page out by TCP 
and compute the CRC. But simultaneously, at the VFS/XFS level, there may be another thread
writing to file position cached by the sending-out page. If page sending and crc compution 
is interleaved by data writing, bad CRC will be complained by the receiving OSD.

To verify our suspection, we add the debug patch below:
(Code below is based on our linux version.)
@@ -1527,9 +1527,14 @@ static int write_partial_message_data(st
                bool last_piece;
                bool need_crc;
                int ret;
+               u32 crc2 = 0;

                page = ceph_msg_data_next(&msg->cursor, &page_offset, &length,
                                                        &last_piece);
+
+               if (do_datacrc && cursor->need_crc)
+                       crc2 = ceph_crc32c_page(crc, page, page_offset, length);
+
                ret = ceph_tcp_sendpage(con->sock, page, page_offset,
                                      length, last_piece);
                if (ret <= 0) {
@@ -1538,8 +1543,12 @@ static int write_partial_message_data(st

                        return ret;
                }
-               if (do_datacrc && cursor->need_crc)
+               if (do_datacrc && cursor->need_crc) {
                        crc = ceph_crc32c_page(crc, page, page_offset, length);
+                       if (crc2 != crc)
+                               pr_warn("tampered page %p: "
+                                   "before 0x%x, current 0x%x\n", page, crc2, crc);
+               }
                need_crc = ceph_msg_data_advance(&msg->cursor, (size_t)ret);
        }
And get the the warning messages below
    [Sun Jan 15 14:11:29 2017] libceph: tampered page ffffea002a8fb140: before 0x1aa3b794, current 0x5fe707d6
    [Sun Jan 15 14:11:29 2017] libceph: tampered page ffffea002122f680: before 0xec71a744, current 0x9b7d382f
    [Sun Jan 15 14:11:29 2017] libceph: tampered page ffffea002372b740: before 0xa1849173, current 0x888a93c1
    [Sun Jan 15 14:11:30 2017] libceph: tampered page ffffea0027a5a500: before 0x6fcb56ac, current 0x1b9aeced

A possible solution may be that: if crc checking is enabled, the page should be copied 
for sending and crc computation. Is that OK?

Best Regards.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: bad crc in data caused by a race condtion in write_partial_message_data
  2017-01-15  7:45 bad crc in data caused by a race condtion in write_partial_message_data caifeng.zhu
@ 2017-01-15 16:26 ` Alex Elder
  2017-01-16  3:09   ` caifeng.zhu
  2017-01-15 17:01 ` Ilya Dryomov
  1 sibling, 1 reply; 7+ messages in thread
From: Alex Elder @ 2017-01-15 16:26 UTC (permalink / raw)
  To: caifeng.zhu, ceph-devel

On 01/15/2017 01:45 AM, caifeng.zhu@uniswdc.com wrote:
> Hi, all
> 
> Let's look at the problem first. We have a lot of 'bad crc in data' 
> warnings at OSDs, like below:
>     2017-01-14 23:25:54.671599 7f67201b3700  0 bad crc in data 1480547403 != exp 3751318843
>     2017-01-14 23:25:54.681146 7f67201b3700  0 bad crc in data 3044715775 != exp 3018112170
>     2017-01-14 23:25:54.681822 7f67201b3700  0 bad crc in data 2815383560 != exp 1455746011
>     2017-01-14 23:25:54.686106 7f67205da700  0 bad crc in data 1781929234 != exp 498105391
>     2017-01-14 23:25:54.688092 7f67205da700  0 bad crc in data 1845054835 != exp 3337474350
>     2017-01-14 23:25:54.693225 7f67205da700  0 bad crc in data 1518733907 != exp 3781627678
>     2017-01-14 23:25:54.755653 7f6724115700  0 bad crc in data 1173337243 != exp 3759627242
>     ...
> This problem occurs when we are testing(by fio) an NFS client, whose NFS server is
> built on an XFS + RBD combination. The bad effect of the problem is that: OSD will close 
> the connection of crc error and drop all reply messages sent through the connection. 
> But the kernel rbd client will hold the requests and wait for the already dropped
> replies which will never come. A deadlock occurs.

The first problem is the reports of bad CRCs.  And the OSD reporting
this for messages sent by the RBD kernel client makes sense, given
what you say below.

Your statement that a deadlock occurs after that doesn't sound right.
Are these services (OSDs, RBD client, NFS client) running on different
machines?

> After some analysis, we suspect write_partial_message_data may have a race condtion. 
> (Code below is got from gitbub.)
>     1562		page = ceph_msg_data_next(cursor, &page_offset, &length,
>     1563					  &last_piece);
>     1564		ret = ceph_tcp_sendpage(con->sock, page, page_offset,
>     1565					length, !last_piece);
>     ...
>     1572		if (do_datacrc && cursor->need_crc)
>     1573			crc = ceph_crc32c_page(crc, page, page_offset, length);
> At line 1564 ~ 1572, a worker thread of libceph workquue may send the page out by TCP 
> and compute the CRC. But simultaneously, at the VFS/XFS level, there may be another thread
> writing to file position cached by the sending-out page. If page sending and crc compution 
> is interleaved by data writing, bad CRC will be complained by the receiving OSD.

This should not be happening.  Data supplied to the Ceph messenger
for sending can no longer be subject to modification.

> To verify our suspection, we add the debug patch below:

Your patch does indeed seem to show that a page is modified
during the call to ceph_tcp_sendpage().  But again, that should
not be happening.

I hope I'm not missing something obvious here...

> (Code below is based on our linux version.)
> @@ -1527,9 +1527,14 @@ static int write_partial_message_data(st
>                 bool last_piece;
>                 bool need_crc;
>                 int ret;
> +               u32 crc2 = 0;
> 
>                 page = ceph_msg_data_next(&msg->cursor, &page_offset, &length,
>                                                         &last_piece);
> +
> +               if (do_datacrc && cursor->need_crc)
> +                       crc2 = ceph_crc32c_page(crc, page, page_offset, length);
> +
>                 ret = ceph_tcp_sendpage(con->sock, page, page_offset,
>                                       length, last_piece);
>                 if (ret <= 0) {
> @@ -1538,8 +1543,12 @@ static int write_partial_message_data(st
> 
>                         return ret;
>                 }
> -               if (do_datacrc && cursor->need_crc)
> +               if (do_datacrc && cursor->need_crc) {
>                         crc = ceph_crc32c_page(crc, page, page_offset, length);
> +                       if (crc2 != crc)
> +                               pr_warn("tampered page %p: "
> +                                   "before 0x%x, current 0x%x\n", page, crc2, crc);
> +               }
>                 need_crc = ceph_msg_data_advance(&msg->cursor, (size_t)ret);
>         }
> And get the the warning messages below
>     [Sun Jan 15 14:11:29 2017] libceph: tampered page ffffea002a8fb140: before 0x1aa3b794, current 0x5fe707d6
>     [Sun Jan 15 14:11:29 2017] libceph: tampered page ffffea002122f680: before 0xec71a744, current 0x9b7d382f
>     [Sun Jan 15 14:11:29 2017] libceph: tampered page ffffea002372b740: before 0xa1849173, current 0x888a93c1
>     [Sun Jan 15 14:11:30 2017] libceph: tampered page ffffea0027a5a500: before 0x6fcb56ac, current 0x1b9aeced
> 
> A possible solution may be that: if crc checking is enabled, the page should be copied 
> for sending and crc computation. Is that OK?

Sending a copied page would ensure the page you're sending doesn't
change.  But if someone else is modifying the original, there's no
way of knowing whether the copied page is really intact either.

Sending a copy would treat a symptom and not the underlying problem.
So I'd say "no, that's not OK."

					-Alex

> 
> Best Regards.
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: bad crc in data caused by a race condtion in write_partial_message_data
  2017-01-15  7:45 bad crc in data caused by a race condtion in write_partial_message_data caifeng.zhu
  2017-01-15 16:26 ` Alex Elder
@ 2017-01-15 17:01 ` Ilya Dryomov
  2017-01-16  3:24   ` caifeng.zhu
  1 sibling, 1 reply; 7+ messages in thread
From: Ilya Dryomov @ 2017-01-15 17:01 UTC (permalink / raw)
  To: caifeng.zhu; +Cc: Ceph Development

On Sun, Jan 15, 2017 at 8:45 AM,  <caifeng.zhu@uniswdc.com> wrote:
> Hi, all
>
> Let's look at the problem first. We have a lot of 'bad crc in data'
> warnings at OSDs, like below:
>     2017-01-14 23:25:54.671599 7f67201b3700  0 bad crc in data 1480547403 != exp 3751318843
>     2017-01-14 23:25:54.681146 7f67201b3700  0 bad crc in data 3044715775 != exp 3018112170
>     2017-01-14 23:25:54.681822 7f67201b3700  0 bad crc in data 2815383560 != exp 1455746011
>     2017-01-14 23:25:54.686106 7f67205da700  0 bad crc in data 1781929234 != exp 498105391
>     2017-01-14 23:25:54.688092 7f67205da700  0 bad crc in data 1845054835 != exp 3337474350
>     2017-01-14 23:25:54.693225 7f67205da700  0 bad crc in data 1518733907 != exp 3781627678
>     2017-01-14 23:25:54.755653 7f6724115700  0 bad crc in data 1173337243 != exp 3759627242
>     ...
> This problem occurs when we are testing(by fio) an NFS client, whose NFS server is
> built on an XFS + RBD combination. The bad effect of the problem is that: OSD will close
> the connection of crc error and drop all reply messages sent through the connection.
> But the kernel rbd client will hold the requests and wait for the already dropped
> replies which will never come. A deadlock occurs.
>
> After some analysis, we suspect write_partial_message_data may have a race condtion.
> (Code below is got from gitbub.)
>     1562                page = ceph_msg_data_next(cursor, &page_offset, &length,
>     1563                                          &last_piece);
>     1564                ret = ceph_tcp_sendpage(con->sock, page, page_offset,
>     1565                                        length, !last_piece);
>     ...
>     1572                if (do_datacrc && cursor->need_crc)
>     1573                        crc = ceph_crc32c_page(crc, page, page_offset, length);
> At line 1564 ~ 1572, a worker thread of libceph workquue may send the page out by TCP
> and compute the CRC. But simultaneously, at the VFS/XFS level, there may be another thread
> writing to file position cached by the sending-out page. If page sending and crc compution
> is interleaved by data writing, bad CRC will be complained by the receiving OSD.
>
> To verify our suspection, we add the debug patch below:
> (Code below is based on our linux version.)

... which is based on?  This should be fixed in 4.3+ and all recent stable
kernels.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: bad crc in data caused by a race condtion in write_partial_message_data
  2017-01-15 16:26 ` Alex Elder
@ 2017-01-16  3:09   ` caifeng.zhu
  0 siblings, 0 replies; 7+ messages in thread
From: caifeng.zhu @ 2017-01-16  3:09 UTC (permalink / raw)
  To: Alex Elder; +Cc: ceph-devel

On Sun, Jan 15, 2017 at 10:26:38AM -0600, Alex Elder wrote:
> On 01/15/2017 01:45 AM, caifeng.zhu@uniswdc.com wrote:
> > Hi, all
> > 
> > Let's look at the problem first. We have a lot of 'bad crc in data' 
> > warnings at OSDs, like below:
> >     2017-01-14 23:25:54.671599 7f67201b3700  0 bad crc in data 1480547403 != exp 3751318843
> >     2017-01-14 23:25:54.681146 7f67201b3700  0 bad crc in data 3044715775 != exp 3018112170
> >     2017-01-14 23:25:54.681822 7f67201b3700  0 bad crc in data 2815383560 != exp 1455746011
> >     2017-01-14 23:25:54.686106 7f67205da700  0 bad crc in data 1781929234 != exp 498105391
> >     2017-01-14 23:25:54.688092 7f67205da700  0 bad crc in data 1845054835 != exp 3337474350
> >     2017-01-14 23:25:54.693225 7f67205da700  0 bad crc in data 1518733907 != exp 3781627678
> >     2017-01-14 23:25:54.755653 7f6724115700  0 bad crc in data 1173337243 != exp 3759627242
> >     ...
> > This problem occurs when we are testing(by fio) an NFS client, whose NFS server is
> > built on an XFS + RBD combination. The bad effect of the problem is that: OSD will close 
> > the connection of crc error and drop all reply messages sent through the connection. 
> > But the kernel rbd client will hold the requests and wait for the already dropped
> > replies which will never come. A deadlock occurs.
> 
> The first problem is the reports of bad CRCs.  And the OSD reporting
> this for messages sent by the RBD kernel client makes sense, given
> what you say below.
> 
> Your statement that a deadlock occurs after that doesn't sound right.
> Are these services (OSDs, RBD client, NFS client) running on different
> machines?
> 

Yes, all these services are running on different machines.

> > After some analysis, we suspect write_partial_message_data may have a race condtion. 
> > (Code below is got from gitbub.)
> >     1562		page = ceph_msg_data_next(cursor, &page_offset, &length,
> >     1563					  &last_piece);
> >     1564		ret = ceph_tcp_sendpage(con->sock, page, page_offset,
> >     1565					length, !last_piece);
> >     ...
> >     1572		if (do_datacrc && cursor->need_crc)
> >     1573			crc = ceph_crc32c_page(crc, page, page_offset, length);
> > At line 1564 ~ 1572, a worker thread of libceph workquue may send the page out by TCP 
> > and compute the CRC. But simultaneously, at the VFS/XFS level, there may be another thread
> > writing to file position cached by the sending-out page. If page sending and crc compution 
> > is interleaved by data writing, bad CRC will be complained by the receiving OSD.
> 
> This should not be happening.  Data supplied to the Ceph messenger
> for sending can no longer be subject to modification.
> 

I don't understand why pages at ceph messenger can NOT be modified.
As far as I know, pages caching file data are unlocked before bio and 
can be got for further writing. 

Could you please explain why?

> > To verify our suspection, we add the debug patch below:
> 
> Your patch does indeed seem to show that a page is modified
> during the call to ceph_tcp_sendpage().  But again, that should
> not be happening.
> 
> I hope I'm not missing something obvious here...
> 
> > (Code below is based on our linux version.)
> > @@ -1527,9 +1527,14 @@ static int write_partial_message_data(st
> >                 bool last_piece;
> >                 bool need_crc;
> >                 int ret;
> > +               u32 crc2 = 0;
> > 
> >                 page = ceph_msg_data_next(&msg->cursor, &page_offset, &length,
> >                                                         &last_piece);
> > +
> > +               if (do_datacrc && cursor->need_crc)
> > +                       crc2 = ceph_crc32c_page(crc, page, page_offset, length);
> > +
> >                 ret = ceph_tcp_sendpage(con->sock, page, page_offset,
> >                                       length, last_piece);
> >                 if (ret <= 0) {
> > @@ -1538,8 +1543,12 @@ static int write_partial_message_data(st
> > 
> >                         return ret;
> >                 }
> > -               if (do_datacrc && cursor->need_crc)
> > +               if (do_datacrc && cursor->need_crc) {
> >                         crc = ceph_crc32c_page(crc, page, page_offset, length);
> > +                       if (crc2 != crc)
> > +                               pr_warn("tampered page %p: "
> > +                                   "before 0x%x, current 0x%x\n", page, crc2, crc);
> > +               }
> >                 need_crc = ceph_msg_data_advance(&msg->cursor, (size_t)ret);
> >         }
> > And get the the warning messages below
> >     [Sun Jan 15 14:11:29 2017] libceph: tampered page ffffea002a8fb140: before 0x1aa3b794, current 0x5fe707d6
> >     [Sun Jan 15 14:11:29 2017] libceph: tampered page ffffea002122f680: before 0xec71a744, current 0x9b7d382f
> >     [Sun Jan 15 14:11:29 2017] libceph: tampered page ffffea002372b740: before 0xa1849173, current 0x888a93c1
> >     [Sun Jan 15 14:11:30 2017] libceph: tampered page ffffea0027a5a500: before 0x6fcb56ac, current 0x1b9aeced
> > 
> > A possible solution may be that: if crc checking is enabled, the page should be copied 
> > for sending and crc computation. Is that OK?
> 
> Sending a copied page would ensure the page you're sending doesn't
> change.  But if someone else is modifying the original, there's no
> way of knowing whether the copied page is really intact either.
> 
> Sending a copy would treat a symptom and not the underlying problem.
> So I'd say "no, that's not OK."

If the pages at ceph messenger should not be modified, then page copy 
is a way to hide the real problem, not a solution.

> 
> 					-Alex
> 
> > 
> > Best Regards.
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: bad crc in data caused by a race condtion in write_partial_message_data
  2017-01-15 17:01 ` Ilya Dryomov
@ 2017-01-16  3:24   ` caifeng.zhu
  2017-01-16  8:42     ` Ilya Dryomov
  0 siblings, 1 reply; 7+ messages in thread
From: caifeng.zhu @ 2017-01-16  3:24 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: caifeng.zhu, Ceph Development

On Sun, Jan 15, 2017 at 06:01:05PM +0100, Ilya Dryomov wrote:
> On Sun, Jan 15, 2017 at 8:45 AM,  <caifeng.zhu@uniswdc.com> wrote:
> > Hi, all
> >
> > Let's look at the problem first. We have a lot of 'bad crc in data'
> > warnings at OSDs, like below:
> >     2017-01-14 23:25:54.671599 7f67201b3700  0 bad crc in data 1480547403 != exp 3751318843
> >     2017-01-14 23:25:54.681146 7f67201b3700  0 bad crc in data 3044715775 != exp 3018112170
> >     2017-01-14 23:25:54.681822 7f67201b3700  0 bad crc in data 2815383560 != exp 1455746011
> >     2017-01-14 23:25:54.686106 7f67205da700  0 bad crc in data 1781929234 != exp 498105391
> >     2017-01-14 23:25:54.688092 7f67205da700  0 bad crc in data 1845054835 != exp 3337474350
> >     2017-01-14 23:25:54.693225 7f67205da700  0 bad crc in data 1518733907 != exp 3781627678
> >     2017-01-14 23:25:54.755653 7f6724115700  0 bad crc in data 1173337243 != exp 3759627242
> >     ...
> > This problem occurs when we are testing(by fio) an NFS client, whose NFS server is
> > built on an XFS + RBD combination. The bad effect of the problem is that: OSD will close
> > the connection of crc error and drop all reply messages sent through the connection.
> > But the kernel rbd client will hold the requests and wait for the already dropped
> > replies which will never come. A deadlock occurs.
> >
> > After some analysis, we suspect write_partial_message_data may have a race condtion.
> > (Code below is got from gitbub.)
> >     1562                page = ceph_msg_data_next(cursor, &page_offset, &length,
> >     1563                                          &last_piece);
> >     1564                ret = ceph_tcp_sendpage(con->sock, page, page_offset,
> >     1565                                        length, !last_piece);
> >     ...
> >     1572                if (do_datacrc && cursor->need_crc)
> >     1573                        crc = ceph_crc32c_page(crc, page, page_offset, length);
> > At line 1564 ~ 1572, a worker thread of libceph workquue may send the page out by TCP
> > and compute the CRC. But simultaneously, at the VFS/XFS level, there may be another thread
> > writing to file position cached by the sending-out page. If page sending and crc compution
> > is interleaved by data writing, bad CRC will be complained by the receiving OSD.
> >
> > To verify our suspection, we add the debug patch below:
> > (Code below is based on our linux version.)
> 
> ... which is based on?  This should be fixed in 4.3+ and all recent stable
> kernels.
> 

We are using CentOS 7.1, with kernel as 
kernel.osrelease = 3.10.0-229.14.1.el7.1.x86_64. 
With patches added by CentOS, the ceph kernel client is roughly about 4.0~.

Is there any info or doc about the fixes in 4.3+?
Thanks in advance.

> Thanks,
> 
>                 Ilya
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: bad crc in data caused by a race condtion in write_partial_message_data
  2017-01-16  3:24   ` caifeng.zhu
@ 2017-01-16  8:42     ` Ilya Dryomov
  2017-01-16 10:09       ` caifeng.zhu
  0 siblings, 1 reply; 7+ messages in thread
From: Ilya Dryomov @ 2017-01-16  8:42 UTC (permalink / raw)
  To: caifeng.zhu; +Cc: Ceph Development

On Mon, Jan 16, 2017 at 4:24 AM,  <caifeng.zhu@uniswdc.com> wrote:
> On Sun, Jan 15, 2017 at 06:01:05PM +0100, Ilya Dryomov wrote:
>> On Sun, Jan 15, 2017 at 8:45 AM,  <caifeng.zhu@uniswdc.com> wrote:
>> > Hi, all
>> >
>> > Let's look at the problem first. We have a lot of 'bad crc in data'
>> > warnings at OSDs, like below:
>> >     2017-01-14 23:25:54.671599 7f67201b3700  0 bad crc in data 1480547403 != exp 3751318843
>> >     2017-01-14 23:25:54.681146 7f67201b3700  0 bad crc in data 3044715775 != exp 3018112170
>> >     2017-01-14 23:25:54.681822 7f67201b3700  0 bad crc in data 2815383560 != exp 1455746011
>> >     2017-01-14 23:25:54.686106 7f67205da700  0 bad crc in data 1781929234 != exp 498105391
>> >     2017-01-14 23:25:54.688092 7f67205da700  0 bad crc in data 1845054835 != exp 3337474350
>> >     2017-01-14 23:25:54.693225 7f67205da700  0 bad crc in data 1518733907 != exp 3781627678
>> >     2017-01-14 23:25:54.755653 7f6724115700  0 bad crc in data 1173337243 != exp 3759627242
>> >     ...
>> > This problem occurs when we are testing(by fio) an NFS client, whose NFS server is
>> > built on an XFS + RBD combination. The bad effect of the problem is that: OSD will close
>> > the connection of crc error and drop all reply messages sent through the connection.
>> > But the kernel rbd client will hold the requests and wait for the already dropped
>> > replies which will never come. A deadlock occurs.
>> >
>> > After some analysis, we suspect write_partial_message_data may have a race condtion.
>> > (Code below is got from gitbub.)
>> >     1562                page = ceph_msg_data_next(cursor, &page_offset, &length,
>> >     1563                                          &last_piece);
>> >     1564                ret = ceph_tcp_sendpage(con->sock, page, page_offset,
>> >     1565                                        length, !last_piece);
>> >     ...
>> >     1572                if (do_datacrc && cursor->need_crc)
>> >     1573                        crc = ceph_crc32c_page(crc, page, page_offset, length);
>> > At line 1564 ~ 1572, a worker thread of libceph workquue may send the page out by TCP
>> > and compute the CRC. But simultaneously, at the VFS/XFS level, there may be another thread
>> > writing to file position cached by the sending-out page. If page sending and crc compution
>> > is interleaved by data writing, bad CRC will be complained by the receiving OSD.
>> >
>> > To verify our suspection, we add the debug patch below:
>> > (Code below is based on our linux version.)
>>
>> ... which is based on?  This should be fixed in 4.3+ and all recent stable
>> kernels.
>>
>
> We are using CentOS 7.1, with kernel as
> kernel.osrelease = 3.10.0-229.14.1.el7.1.x86_64.
> With patches added by CentOS, the ceph kernel client is roughly about 4.0~.

No, it's not ~4.0.  A lot of important fixes are missing from that
kernel and I'd strognly encourage you to upgrade to the 7.3 kernel.

>
> Is there any info or doc about the fixes in 4.3+?

This is the fix, trivial to cherry-pick and try out:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=bae818ee1577c27356093901a0ea48f672eda514

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: bad crc in data caused by a race condtion in write_partial_message_data
  2017-01-16  8:42     ` Ilya Dryomov
@ 2017-01-16 10:09       ` caifeng.zhu
  0 siblings, 0 replies; 7+ messages in thread
From: caifeng.zhu @ 2017-01-16 10:09 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Ceph Development

On Mon, Jan 16, 2017 at 09:42:25AM +0100, Ilya Dryomov wrote:
> On Mon, Jan 16, 2017 at 4:24 AM,  <caifeng.zhu@uniswdc.com> wrote:
> > On Sun, Jan 15, 2017 at 06:01:05PM +0100, Ilya Dryomov wrote:
> >> On Sun, Jan 15, 2017 at 8:45 AM,  <caifeng.zhu@uniswdc.com> wrote:
> >> > Hi, all
> >> >
> >> > Let's look at the problem first. We have a lot of 'bad crc in data'
> >> > warnings at OSDs, like below:
> >> >     2017-01-14 23:25:54.671599 7f67201b3700  0 bad crc in data 1480547403 != exp 3751318843
> >> >     2017-01-14 23:25:54.681146 7f67201b3700  0 bad crc in data 3044715775 != exp 3018112170
> >> >     2017-01-14 23:25:54.681822 7f67201b3700  0 bad crc in data 2815383560 != exp 1455746011
> >> >     2017-01-14 23:25:54.686106 7f67205da700  0 bad crc in data 1781929234 != exp 498105391
> >> >     2017-01-14 23:25:54.688092 7f67205da700  0 bad crc in data 1845054835 != exp 3337474350
> >> >     2017-01-14 23:25:54.693225 7f67205da700  0 bad crc in data 1518733907 != exp 3781627678
> >> >     2017-01-14 23:25:54.755653 7f6724115700  0 bad crc in data 1173337243 != exp 3759627242
> >> >     ...
> >> > This problem occurs when we are testing(by fio) an NFS client, whose NFS server is
> >> > built on an XFS + RBD combination. The bad effect of the problem is that: OSD will close
> >> > the connection of crc error and drop all reply messages sent through the connection.
> >> > But the kernel rbd client will hold the requests and wait for the already dropped
> >> > replies which will never come. A deadlock occurs.
> >> >
> >> > After some analysis, we suspect write_partial_message_data may have a race condtion.
> >> > (Code below is got from gitbub.)
> >> >     1562                page = ceph_msg_data_next(cursor, &page_offset, &length,
> >> >     1563                                          &last_piece);
> >> >     1564                ret = ceph_tcp_sendpage(con->sock, page, page_offset,
> >> >     1565                                        length, !last_piece);
> >> >     ...
> >> >     1572                if (do_datacrc && cursor->need_crc)
> >> >     1573                        crc = ceph_crc32c_page(crc, page, page_offset, length);
> >> > At line 1564 ~ 1572, a worker thread of libceph workquue may send the page out by TCP
> >> > and compute the CRC. But simultaneously, at the VFS/XFS level, there may be another thread
> >> > writing to file position cached by the sending-out page. If page sending and crc compution
> >> > is interleaved by data writing, bad CRC will be complained by the receiving OSD.
> >> >
> >> > To verify our suspection, we add the debug patch below:
> >> > (Code below is based on our linux version.)
> >>
> >> ... which is based on?  This should be fixed in 4.3+ and all recent stable
> >> kernels.
> >>
> >
> > We are using CentOS 7.1, with kernel as
> > kernel.osrelease = 3.10.0-229.14.1.el7.1.x86_64.
> > With patches added by CentOS, the ceph kernel client is roughly about 4.0~.
> 
> No, it's not ~4.0.  A lot of important fixes are missing from that
> kernel and I'd strognly encourage you to upgrade to the 7.3 kernel.
> 

Thanks for your suggestion. We'll try it.

> >
> > Is there any info or doc about the fixes in 4.3+?
> 
> This is the fix, trivial to cherry-pick and try out:
> 
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=bae818ee1577c27356093901a0ea48f672eda514
> 

This patch makes sense for me. It is the elegant soltuion, 
much (much ...) better than my proposal. Thanks for your help!

> Thanks,
> 
>                 Ilya
> 



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-01-16 10:13 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-01-15  7:45 bad crc in data caused by a race condtion in write_partial_message_data caifeng.zhu
2017-01-15 16:26 ` Alex Elder
2017-01-16  3:09   ` caifeng.zhu
2017-01-15 17:01 ` Ilya Dryomov
2017-01-16  3:24   ` caifeng.zhu
2017-01-16  8:42     ` Ilya Dryomov
2017-01-16 10:09       ` caifeng.zhu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox