Ceph data consistency

All of lore.kernel.org
 help / color / mirror / Atom feed

* Ceph data consistency
@ 2014-12-30  8:21 Paweł Sadowski
  2014-12-30 12:40 ` Vijayendra Shamanna
  0 siblings, 1 reply; 6+ messages in thread
From: Paweł Sadowski @ 2014-12-30  8:21 UTC (permalink / raw)
  To: ceph-devel

Hi,

On our Ceph cluster from time to time we have some inconsistent PGs
(after deep-scrub). We have some issues with disk/sata cables/lsi
controller causing IO errors from time to time (but that's not the point
in this case).

When IO error occurs on OSD journal partition everything works as is
should -> OSD is crashed and that's ok - Ceph will handle that.

But when IO error occurs on OSD data partition during journal flush OSD
continue to work. After calling *writev* (in buffer::list::write_fd) OSD
does check return code from this call but does NOT verify if write has
been successful to disk (data are still only in memory and there is no
fsync). That way OSD thinks that data has been stored on disk but it
might be discarded (during sync dirty page will be reclaimed and you'll
see "lost page write due to I/O error" in dmesg).

Since there is no checksumming of data I just wanted to make sure that
this is by design. Maybe there is a way to tell OSD to call fsync after
write and have data consistent?

-- 
Cheers,
PS

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: Ceph data consistency
  2014-12-30  8:21 Ceph data consistency Paweł Sadowski
@ 2014-12-30 12:40 ` Vijayendra Shamanna
  2014-12-30 13:40   ` Paweł Sadowski
  0 siblings, 1 reply; 6+ messages in thread
From: Vijayendra Shamanna @ 2014-12-30 12:40 UTC (permalink / raw)
  To: ceph@sadziu.pl, ceph-devel@vger.kernel.org

Hi,

There is a sync thread (sync_entry in FileStore.cc) which triggers periodically and executes sync_filesystem() to ensure that the data is consistent. The journal entries are trimmed only after a successful sync_filesystem() call

Thanks
Viju

>-----Original Message-----
>From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Pawel Sadowski
>Sent: Tuesday, December 30, 2014 1:52 PM
>To: ceph-devel@vger.kernel.org
>Subject: Ceph data consistency

>Hi,

>On our Ceph cluster from time to time we have some inconsistent PGs (after deep-scrub). We have some issues with disk/sata cables/lsi controller causing IO errors from time to time (but that's not the point in this case).

>When IO error occurs on OSD journal partition everything works as is should -> OSD is crashed and that's ok - Ceph will handle that.

>But when IO error occurs on OSD data partition during journal flush OSD continue to work. After calling *writev* (in buffer::list::write_fd) OSD does check return code from this call but does NOT verify if write has been successful to disk (data are still only >in memory and there is no fsync). That way OSD thinks that data has been stored on disk but it might be discarded (during sync dirty page will be reclaimed and you'll see "lost page write due to I/O error" in dmesg).

>Since there is no checksumming of data I just wanted to make sure that this is by design. Maybe there is a way to tell OSD to call fsync after write and have data consistent?

>--
>Cheers,
>PS
>--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Ceph data consistency
  2014-12-30 12:40 ` Vijayendra Shamanna
@ 2014-12-30 13:40   ` Paweł Sadowski
       [not found]     ` <CALurOm0wGEJ5MSrscUvVi_J3fyDevGbT3A11291qTLYTZejr_w@mail.gmail.com>
  0 siblings, 1 reply; 6+ messages in thread
From: Paweł Sadowski @ 2014-12-30 13:40 UTC (permalink / raw)
  To: Vijayendra Shamanna, ceph-devel@vger.kernel.org

On 12/30/2014 01:40 PM, Vijayendra Shamanna wrote:
> Hi,
>
> There is a sync thread (sync_entry in FileStore.cc) which triggers periodically and executes sync_filesystem() to ensure that the data is consistent. The journal entries are trimmed only after a successful sync_filesystem() call

sync_filesystem() always returns zero and journal will be trimmed. Executing sync()/syncfs() with dirty data in disk buffers will result in data loss ("lost page write due to I/O error").

I was doing some experiments simulating disk errors using Device Mapper "error" target. In this setup OSD was writing to broken disk without crashing. Every 5 seconds (filestore_max_sync_interval) kernel logs that some data were discarded due to IO error.


> Thanks
> Viju
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Pawel Sadowski
>> Sent: Tuesday, December 30, 2014 1:52 PM
>> To: ceph-devel@vger.kernel.org
>> Subject: Ceph data consistency
>>
>> Hi,
>>
>> On our Ceph cluster from time to time we have some inconsistent PGs (after deep-scrub). We have some issues with disk/sata cables/lsi controller causing IO errors from time to time (but that's not the point in this case).
>>
>> When IO error occurs on OSD journal partition everything works as is should -> OSD is crashed and that's ok - Ceph will handle that.
>>
>> But when IO error occurs on OSD data partition during journal flush OSD continue to work. After calling *writev* (in buffer::list::write_fd) OSD does check return code from this call but does NOT verify if write has been successful to disk (data are still only >in memory and there is no fsync). That way OSD thinks that data has been stored on disk but it might be discarded (during sync dirty page will be reclaimed and you'll see "lost page write due to I/O error" in dmesg).
>>
>> Since there is no checksumming of data I just wanted to make sure that this is by design. Maybe there is a way to tell OSD to call fsync after write and have data consistent?

-- 
PS

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: Ceph data consistency
       [not found]     ` <CALurOm0wGEJ5MSrscUvVi_J3fyDevGbT3A11291qTLYTZejr_w@mail.gmail.com>
@ 2015-01-07  1:41       ` Ma, Jianpeng
  2015-01-07  2:18         ` Sage Weil
  0 siblings, 1 reply; 6+ messages in thread
From: Ma, Jianpeng @ 2015-01-07  1:41 UTC (permalink / raw)
  To: Sage Weil
  Cc: ceph@sadziu.pl, Vijayendra.Shamanna@sandisk.com,
	ceph-devel@vger.kernel.org

> ---------- Forwarded message ----------
> From: Paweł Sadowski <ceph@sadziu.pl>
> Date: 2014-12-30 21:40 GMT+08:00
> Subject: Re: Ceph data consistency
> To: Vijayendra Shamanna <Vijayendra.Shamanna@sandisk.com>,
> "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
> 
> 
> On 12/30/2014 01:40 PM, Vijayendra Shamanna wrote:
> > Hi,
> >
> > There is a sync thread (sync_entry in FileStore.cc) which triggers
> > periodically and executes sync_filesystem() to ensure that the data is
> > consistent. The journal entries are trimmed only after a successful
> > sync_filesystem() call
> 
> sync_filesystem() always returns zero and journal will be trimmed.
> Executing sync()/syncfs() with dirty data in disk buffers will result in data loss
> ("lost page write due to I/O error").
> 
Hi sage:

From the git log, I see at first sync_filesystem() return the result of syncfs().
But in this commit 808c644248e486f44:
    Improve use of syncfs.
    Test syncfs return value and fallback to btrfs sync and then sync.
The author hope if syncfs() met error and sync() can resolve. Because sync() don't return result 
So it only return zero.
But which error can handle by this way? AFAK, no.
I suggest it directly return result of syncfs().

Jianpeng Ma
Thanks!


> I was doing some experiments simulating disk errors using Device Mapper
> "error" target. In this setup OSD was writing to broken disk without crashing.
> Every 5 seconds (filestore_max_sync_interval) kernel logs that some data were
> discarded due to IO error.
> 
> 
> > Thanks
> > Viju
> >> -----Original Message-----
> >> From: ceph-devel-owner@vger.kernel.org
> >> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Pawel Sadowski
> >> Sent: Tuesday, December 30, 2014 1:52 PM
> >> To: ceph-devel@vger.kernel.org
> >> Subject: Ceph data consistency
> >>
> >> Hi,
> >>
> >> On our Ceph cluster from time to time we have some inconsistent PGs (after
> deep-scrub). We have some issues with disk/sata cables/lsi controller causing
> IO errors from time to time (but that's not the point in this case).
> >>
> >> When IO error occurs on OSD journal partition everything works as is should
> -> OSD is crashed and that's ok - Ceph will handle that.
> >>
> >> But when IO error occurs on OSD data partition during journal flush OSD
> continue to work. After calling *writev* (in buffer::list::write_fd) OSD does
> check return code from this call but does NOT verify if write has been successful
> to disk (data are still only >in memory and there is no fsync). That way OSD
> thinks that data has been stored on disk but it might be discarded (during sync
> dirty page will be reclaimed and you'll see "lost page write due to I/O error" in
> dmesg).
> >>
> >> Since there is no checksumming of data I just wanted to make sure that this
> is by design. Maybe there is a way to tell OSD to call fsync after write and have
> data consistent?
> 
> --
> PS
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body
> of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: Ceph data consistency
  2015-01-07  1:41       ` Ma, Jianpeng
@ 2015-01-07  2:18         ` Sage Weil
  2015-01-07  5:59           ` Ma, Jianpeng
  0 siblings, 1 reply; 6+ messages in thread
From: Sage Weil @ 2015-01-07  2:18 UTC (permalink / raw)
  To: Ma, Jianpeng
  Cc: ceph@sadziu.pl, Vijayendra.Shamanna@sandisk.com,
	ceph-devel@vger.kernel.org

On Wed, 7 Jan 2015, Ma, Jianpeng wrote:
> > ---------- Forwarded message ----------
> > From: Pawe? Sadowski <ceph@sadziu.pl>
> > Date: 2014-12-30 21:40 GMT+08:00
> > Subject: Re: Ceph data consistency
> > To: Vijayendra Shamanna <Vijayendra.Shamanna@sandisk.com>,
> > "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
> > 
> > 
> > On 12/30/2014 01:40 PM, Vijayendra Shamanna wrote:
> > > Hi,
> > >
> > > There is a sync thread (sync_entry in FileStore.cc) which triggers
> > > periodically and executes sync_filesystem() to ensure that the data is
> > > consistent. The journal entries are trimmed only after a successful
> > > sync_filesystem() call
> > 
> > sync_filesystem() always returns zero and journal will be trimmed.
> > Executing sync()/syncfs() with dirty data in disk buffers will result in data loss
> > ("lost page write due to I/O error").
> > 
> Hi sage:
> 
> From the git log, I see at first sync_filesystem() return the result of syncfs().
> But in this commit 808c644248e486f44:
>     Improve use of syncfs.
>     Test syncfs return value and fallback to btrfs sync and then sync.
> The author hope if syncfs() met error and sync() can resolve. Because sync() don't return result 
> So it only return zero.
> But which error can handle by this way? AFAK, no.
> I suggest it directly return result of syncfs().

Yeah, that sounds right!

sage


> 
> Jianpeng Ma
> Thanks!
> 
> 
> > I was doing some experiments simulating disk errors using Device Mapper
> > "error" target. In this setup OSD was writing to broken disk without crashing.
> > Every 5 seconds (filestore_max_sync_interval) kernel logs that some data were
> > discarded due to IO error.
> > 
> > 
> > > Thanks
> > > Viju
> > >> -----Original Message-----
> > >> From: ceph-devel-owner@vger.kernel.org
> > >> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Pawel Sadowski
> > >> Sent: Tuesday, December 30, 2014 1:52 PM
> > >> To: ceph-devel@vger.kernel.org
> > >> Subject: Ceph data consistency
> > >>
> > >> Hi,
> > >>
> > >> On our Ceph cluster from time to time we have some inconsistent PGs (after
> > deep-scrub). We have some issues with disk/sata cables/lsi controller causing
> > IO errors from time to time (but that's not the point in this case).
> > >>
> > >> When IO error occurs on OSD journal partition everything works as is should
> > -> OSD is crashed and that's ok - Ceph will handle that.
> > >>
> > >> But when IO error occurs on OSD data partition during journal flush OSD
> > continue to work. After calling *writev* (in buffer::list::write_fd) OSD does
> > check return code from this call but does NOT verify if write has been successful
> > to disk (data are still only >in memory and there is no fsync). That way OSD
> > thinks that data has been stored on disk but it might be discarded (during sync
> > dirty page will be reclaimed and you'll see "lost page write due to I/O error" in
> > dmesg).
> > >>
> > >> Since there is no checksumming of data I just wanted to make sure that this
> > is by design. Maybe there is a way to tell OSD to call fsync after write and have
> > data consistent?
> > 
> > --
> > PS
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body
> > of a message to majordomo@vger.kernel.org More majordomo info at
> > http://vger.kernel.org/majordomo-info.html
> N?????r??y??????X???v???)?{.n?????z?]z????ay?\x1d????j\a??f???h?????\x1e?w???\f???j:+v???w????????\a????zZ+???????j"????i

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: Ceph data consistency
  2015-01-07  2:18         ` Sage Weil
@ 2015-01-07  5:59           ` Ma, Jianpeng
  0 siblings, 0 replies; 6+ messages in thread
From: Ma, Jianpeng @ 2015-01-07  5:59 UTC (permalink / raw)
  To: Sage Weil
  Cc: ceph@sadziu.pl, Vijayendra.Shamanna@sandisk.com,
	ceph-devel@vger.kernel.org

Hi Sage,
   Pull request is https://github.com/ceph/ceph/pull/3305.

Thanks!
Jianpeng Ma

> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Wednesday, January 7, 2015 10:18 AM
> To: Ma, Jianpeng
> Cc: ceph@sadziu.pl; Vijayendra.Shamanna@sandisk.com;
> ceph-devel@vger.kernel.org
> Subject: RE: Ceph data consistency
> 
> On Wed, 7 Jan 2015, Ma, Jianpeng wrote:
> > > ---------- Forwarded message ----------
> > > From: Pawe? Sadowski <ceph@sadziu.pl>
> > > Date: 2014-12-30 21:40 GMT+08:00
> > > Subject: Re: Ceph data consistency
> > > To: Vijayendra Shamanna <Vijayendra.Shamanna@sandisk.com>,
> > > "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
> > >
> > >
> > > On 12/30/2014 01:40 PM, Vijayendra Shamanna wrote:
> > > > Hi,
> > > >
> > > > There is a sync thread (sync_entry in FileStore.cc) which triggers
> > > > periodically and executes sync_filesystem() to ensure that the
> > > > data is consistent. The journal entries are trimmed only after a
> > > > successful
> > > > sync_filesystem() call
> > >
> > > sync_filesystem() always returns zero and journal will be trimmed.
> > > Executing sync()/syncfs() with dirty data in disk buffers will
> > > result in data loss ("lost page write due to I/O error").
> > >
> > Hi sage:
> >
> > From the git log, I see at first sync_filesystem() return the result of syncfs().
> > But in this commit 808c644248e486f44:
> >     Improve use of syncfs.
> >     Test syncfs return value and fallback to btrfs sync and then sync.
> > The author hope if syncfs() met error and sync() can resolve. Because
> > sync() don't return result So it only return zero.
> > But which error can handle by this way? AFAK, no.
> > I suggest it directly return result of syncfs().
> 
> Yeah, that sounds right!
> 
> sage
> 
> 
> >
> > Jianpeng Ma
> > Thanks!
> >
> >
> > > I was doing some experiments simulating disk errors using Device
> > > Mapper "error" target. In this setup OSD was writing to broken disk
> without crashing.
> > > Every 5 seconds (filestore_max_sync_interval) kernel logs that some
> > > data were discarded due to IO error.
> > >
> > >
> > > > Thanks
> > > > Viju
> > > >> -----Original Message-----
> > > >> From: ceph-devel-owner@vger.kernel.org
> > > >> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Pawel
> > > >> Sadowski
> > > >> Sent: Tuesday, December 30, 2014 1:52 PM
> > > >> To: ceph-devel@vger.kernel.org
> > > >> Subject: Ceph data consistency
> > > >>
> > > >> Hi,
> > > >>
> > > >> On our Ceph cluster from time to time we have some inconsistent
> > > >> PGs (after
> > > deep-scrub). We have some issues with disk/sata cables/lsi
> > > controller causing IO errors from time to time (but that's not the point in
> this case).
> > > >>
> > > >> When IO error occurs on OSD journal partition everything works as
> > > >> is should
> > > -> OSD is crashed and that's ok - Ceph will handle that.
> > > >>
> > > >> But when IO error occurs on OSD data partition during journal
> > > >> flush OSD
> > > continue to work. After calling *writev* (in buffer::list::write_fd)
> > > OSD does check return code from this call but does NOT verify if
> > > write has been successful to disk (data are still only >in memory
> > > and there is no fsync). That way OSD thinks that data has been
> > > stored on disk but it might be discarded (during sync dirty page
> > > will be reclaimed and you'll see "lost page write due to I/O error" in dmesg).
> > > >>
> > > >> Since there is no checksumming of data I just wanted to make sure
> > > >> that this
> > > is by design. Maybe there is a way to tell OSD to call fsync after
> > > write and have data consistent?
> > >
> > > --
> > > PS
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe
> > > ceph-devel" in the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
> N?????r??y??????X???v???)?{.n?????z?]z????ay?\x1d????j ??f???h?????\x1e?w?
> ??
> 
> ???j:+v???w???????? ????zZ+???????j"????i

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-01-07  5:59 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-12-30  8:21 Ceph data consistency Paweł Sadowski
2014-12-30 12:40 ` Vijayendra Shamanna
2014-12-30 13:40   ` Paweł Sadowski
     [not found]     ` <CALurOm0wGEJ5MSrscUvVi_J3fyDevGbT3A11291qTLYTZejr_w@mail.gmail.com>
2015-01-07  1:41       ` Ma, Jianpeng
2015-01-07  2:18         ` Sage Weil
2015-01-07  5:59           ` Ma, Jianpeng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.