public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* Fatigue for XFS
@ 2014-05-05 19:49 Andrey Korolyov
  2014-05-05 20:36 ` Dave Chinner
  0 siblings, 1 reply; 5+ messages in thread
From: Andrey Korolyov @ 2014-05-05 19:49 UTC (permalink / raw)
  To: ceph-users@lists.ceph.com; +Cc: xfs@oss.sgi.com

Hello,

We are currently exploring issue which can be related to Ceph itself
or to the XFS - any help is very appreciated.

First, the picture: relatively old cluster w/ two years uptime and ten
months after fs recreation on every OSD, one of daemons started to
flap approximately once per day for couple of weeks, with no external
reason (bandwidth/IOPS/host issues). It looks almost the same every
time - OSD suddenly stop serving requests for a short period, gets
kicked out by peers report, then returns in a couple of seconds. Of
course, small but sensitive amount of requests are delayed by 15-30
seconds twice, which is bad for us. The only thing which correlates
with this kick is a peak of I/O, not too large, even not consuming all
underlying disk utilization, but alone in the cluster and clearly
visible. Also there are at least two occasions *without* correlated
iowait peak.

I have two versions - we`re touching some sector on disk which is
about to be marked as dead but not displayed in SMART statistics or (I
believe so) some kind of XFS fatigue, which can be more likely in this
case, since near-bad sector should be touched more frequently and
related impact could leave traces in dmesg/SMART from my experience. I
would like to ask if anyone has a simular experience before or can
suggest to poke existing file system in some way. If no suggestion
appear, I`ll probably reformat disk and, if problem will remain after
refill, replace it, but I think less destructive actions can be done
before.

XFS is running on 3.10 with almost default create and mount options,
ceph version is the latest cuttlefish (this rack should be upgraded, I
know).

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Fatigue for XFS
  2014-05-05 19:49 Fatigue for XFS Andrey Korolyov
@ 2014-05-05 20:36 ` Dave Chinner
  2014-05-05 20:59   ` Andrey Korolyov
  0 siblings, 1 reply; 5+ messages in thread
From: Dave Chinner @ 2014-05-05 20:36 UTC (permalink / raw)
  To: Andrey Korolyov; +Cc: ceph-users@lists.ceph.com, xfs@oss.sgi.com

On Mon, May 05, 2014 at 11:49:05PM +0400, Andrey Korolyov wrote:
> Hello,
> 
> We are currently exploring issue which can be related to Ceph itself
> or to the XFS - any help is very appreciated.
> 
> First, the picture: relatively old cluster w/ two years uptime and ten
> months after fs recreation on every OSD, one of daemons started to
> flap approximately once per day for couple of weeks, with no external
> reason (bandwidth/IOPS/host issues). It looks almost the same every
> time - OSD suddenly stop serving requests for a short period, gets
> kicked out by peers report, then returns in a couple of seconds. Of
> course, small but sensitive amount of requests are delayed by 15-30
> seconds twice, which is bad for us. The only thing which correlates
> with this kick is a peak of I/O, not too large, even not consuming all
> underlying disk utilization, but alone in the cluster and clearly
> visible. Also there are at least two occasions *without* correlated
> iowait peak.

So, actual numbers and traces are the only thing that tell us what
is happening during these events. See here:

http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

If it happens at almost the same time every day, then I'd be looking
at the crontabs to find what starts up about that time. output of
top will also probably tell you what process is running, too. topio
might be instructive, and blktrace almost certainly will be....

> I have two versions - we`re touching some sector on disk which is
> about to be marked as dead but not displayed in SMART statistics or (I

Doubt it - SMART doesn't cause OS visible IO dispatch spikes.

> believe so) some kind of XFS fatigue, which can be more likely in this
> case, since near-bad sector should be touched more frequently and
> related impact could leave traces in dmesg/SMART from my experience. I

I doubt that, too, because XFS doesn't have anything that is
triggered on a daily basis inside it. Maybe you've got xfs_fsr set
up on a cron job, though...

> would like to ask if anyone has a simular experience before or can
> suggest to poke existing file system in some way. If no suggestion
> appear, I`ll probably reformat disk and, if problem will remain after
> refill, replace it, but I think less destructive actions can be done
> before.

Yeah, monitoring and determining the process that is issuing the IO
is what you need to find first.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Fatigue for XFS
  2014-05-05 20:36 ` Dave Chinner
@ 2014-05-05 20:59   ` Andrey Korolyov
  2014-05-05 21:23     ` Dave Chinner
  0 siblings, 1 reply; 5+ messages in thread
From: Andrey Korolyov @ 2014-05-05 20:59 UTC (permalink / raw)
  To: Dave Chinner; +Cc: ceph-users@lists.ceph.com, xfs@oss.sgi.com

On Tue, May 6, 2014 at 12:36 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, May 05, 2014 at 11:49:05PM +0400, Andrey Korolyov wrote:
>> Hello,
>>
>> We are currently exploring issue which can be related to Ceph itself
>> or to the XFS - any help is very appreciated.
>>
>> First, the picture: relatively old cluster w/ two years uptime and ten
>> months after fs recreation on every OSD, one of daemons started to
>> flap approximately once per day for couple of weeks, with no external
>> reason (bandwidth/IOPS/host issues). It looks almost the same every
>> time - OSD suddenly stop serving requests for a short period, gets
>> kicked out by peers report, then returns in a couple of seconds. Of
>> course, small but sensitive amount of requests are delayed by 15-30
>> seconds twice, which is bad for us. The only thing which correlates
>> with this kick is a peak of I/O, not too large, even not consuming all
>> underlying disk utilization, but alone in the cluster and clearly
>> visible. Also there are at least two occasions *without* correlated
>> iowait peak.
>
> So, actual numbers and traces are the only thing that tell us what
> is happening during these events. See here:
>
> http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
>
> If it happens at almost the same time every day, then I'd be looking
> at the crontabs to find what starts up about that time. output of
> top will also probably tell you what process is running, too. topio
> might be instructive, and blktrace almost certainly will be....
>
>> I have two versions - we`re touching some sector on disk which is
>> about to be marked as dead but not displayed in SMART statistics or (I
>
> Doubt it - SMART doesn't cause OS visible IO dispatch spikes.
>
>> believe so) some kind of XFS fatigue, which can be more likely in this
>> case, since near-bad sector should be touched more frequently and
>> related impact could leave traces in dmesg/SMART from my experience. I
>
> I doubt that, too, because XFS doesn't have anything that is
> triggered on a daily basis inside it. Maybe you've got xfs_fsr set
> up on a cron job, though...
>
>> would like to ask if anyone has a simular experience before or can
>> suggest to poke existing file system in some way. If no suggestion
>> appear, I`ll probably reformat disk and, if problem will remain after
>> refill, replace it, but I think less destructive actions can be done
>> before.
>
> Yeah, monitoring and determining the process that is issuing the IO
> is what you need to find first.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

Thanks Dave,

there are definitely no cron set for specific time (though most of
lockups happened in a relatively small interval which correlates with
the Ceph snapshot operations). In at least one case no Ceph snapshot
operations (including delayed removal) happened and at least two when
no I/O peak was observed. We observed and eliminated weird lockups
related to the openswitch behavior before - we`re combining storage
and compute nodes, so quirks in the OVS datapath caused very
interesting and weird system-wide lockups on (supposedly) spinlock,
and we see 'pure' Ceph lockups on XFS at time with 3.4-3.7 kernels,
all of them was correlated with very high context switch peak.

Current issue is seemingly nothing to do with spinlock-like bugs or
just a hardware problem, we even rebooted problematic node to check if
the memory allocator may stuck at the border of specific NUMA node,
with no help, but first reappearance of this bug was delayed by some
days then. Disabling lazy allocation via specifying allocsize did
nothing too. It may look like I am insisting that it is XFS bug, where
Ceph version is more likely to appear because of way more complicated
logic and operation behaviour, but persistence on specific node across
relaunching of Ceph storage daemon suggests bug relation to the
unlucky byte sequence more than anything else. If it finally appear as
Ceph bug, it`ll ruin our expectations from two-year of close
experience with this product and if it is XFS bug, we haven`t see
anything like this before, thought we had a pretty collection of
XFS-related lockups on the earlier kernels.

So, my understanding is that we hitting neither very rare memory
allocator bug in case of XFS or age-related Ceph issue, both are very
unlikely to exist - but I cannot imagine nothing else. If it helps, I
may collect a series of perf events during next appearance or exact
iostat output (mine graphics can say that the I/O was not choked
completely when peak appeared, that`s all).

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Fatigue for XFS
  2014-05-05 20:59   ` Andrey Korolyov
@ 2014-05-05 21:23     ` Dave Chinner
  2014-05-29 12:04       ` Andrey Korolyov
  0 siblings, 1 reply; 5+ messages in thread
From: Dave Chinner @ 2014-05-05 21:23 UTC (permalink / raw)
  To: Andrey Korolyov; +Cc: ceph-users@lists.ceph.com, xfs@oss.sgi.com

On Tue, May 06, 2014 at 12:59:27AM +0400, Andrey Korolyov wrote:
> On Tue, May 6, 2014 at 12:36 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, May 05, 2014 at 11:49:05PM +0400, Andrey Korolyov wrote:
> >> Hello,
> >>
> >> We are currently exploring issue which can be related to Ceph itself
> >> or to the XFS - any help is very appreciated.
> >>
> >> First, the picture: relatively old cluster w/ two years uptime and ten
> >> months after fs recreation on every OSD, one of daemons started to
> >> flap approximately once per day for couple of weeks, with no external
> >> reason (bandwidth/IOPS/host issues). It looks almost the same every
> >> time - OSD suddenly stop serving requests for a short period, gets
> >> kicked out by peers report, then returns in a couple of seconds. Of
> >> course, small but sensitive amount of requests are delayed by 15-30
> >> seconds twice, which is bad for us. The only thing which correlates
> >> with this kick is a peak of I/O, not too large, even not consuming all
> >> underlying disk utilization, but alone in the cluster and clearly
> >> visible. Also there are at least two occasions *without* correlated
> >> iowait peak.
> >
> > So, actual numbers and traces are the only thing that tell us what
> > is happening during these events. See here:
> >
> > http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
> >
> > If it happens at almost the same time every day, then I'd be looking
> > at the crontabs to find what starts up about that time. output of
> > top will also probably tell you what process is running, too. topio
> > might be instructive, and blktrace almost certainly will be....
> >
> >> I have two versions - we`re touching some sector on disk which is
> >> about to be marked as dead but not displayed in SMART statistics or (I
> >
> > Doubt it - SMART doesn't cause OS visible IO dispatch spikes.
> >
> >> believe so) some kind of XFS fatigue, which can be more likely in this
> >> case, since near-bad sector should be touched more frequently and
> >> related impact could leave traces in dmesg/SMART from my experience. I
> >
> > I doubt that, too, because XFS doesn't have anything that is
> > triggered on a daily basis inside it. Maybe you've got xfs_fsr set
> > up on a cron job, though...
> >
> >> would like to ask if anyone has a simular experience before or can
> >> suggest to poke existing file system in some way. If no suggestion
> >> appear, I`ll probably reformat disk and, if problem will remain after
> >> refill, replace it, but I think less destructive actions can be done
> >> before.
> >
> > Yeah, monitoring and determining the process that is issuing the IO
> > is what you need to find first.
> >
> > Cheers,
> >
> > Dave.
> > --
> > Dave Chinner
> > david@fromorbit.com
> 
> Thanks Dave,
> 
> there are definitely no cron set for specific time (though most of
> lockups happened in a relatively small interval which correlates with
> the Ceph snapshot operations).

OK.

FWIW, Ceph snapshots on XFS may not be immediately costly in terms
of IO - they can be extremely costly after one is taken when the
files in the snapshot are next written to. If you are snapshotting
files that are currently being written to, then that's likely to
cause immediate IO issues...

> In at least one case no Ceph snapshot
> operations (including delayed removal) happened and at least two when
> no I/O peak was observed. We observed and eliminated weird lockups
> related to the openswitch behavior before - we`re combining storage
> and compute nodes, so quirks in the OVS datapath caused very
> interesting and weird system-wide lockups on (supposedly) spinlock,
> and we see 'pure' Ceph lockups on XFS at time with 3.4-3.7 kernels,
> all of them was correlated with very high context switch peak.

Until we determine what is triggering the IO, the application isn't
really a concern.

> Current issue is seemingly nothing to do with spinlock-like bugs or
> just a hardware problem, we even rebooted problematic node to check if
> the memory allocator may stuck at the border of specific NUMA node,
> with no help, but first reappearance of this bug was delayed by some
> days then. Disabling lazy allocation via specifying allocsize did
> nothing too. It may look like I am insisting that it is XFS bug, where
> Ceph version is more likely to appear because of way more complicated
> logic and operation behaviour, but persistence on specific node across
> relaunching of Ceph storage daemon suggests bug relation to the
> unlucky byte sequence more than anything else. If it finally appear as
> Ceph bug, it`ll ruin our expectations from two-year of close
> experience with this product and if it is XFS bug, we haven`t see
> anything like this before, thought we had a pretty collection of
> XFS-related lockups on the earlier kernels.

Long experience with triaging storage performance issues has taught
me to ignore what anyone *thinks* is the cause of the problem; I
rely on the data that is gathered to tell me what the problem is. I
find that hard data has a nasty habit of busting assumptions,
expectations, speculations and hypothesis... :)

> If it helps, I
> may collect a series of perf events during next appearance or exact
> iostat output (mine graphics can say that the I/O was not choked
> completely when peak appeared, that`s all).

Before delving into perf events, we need to know what we are looking
for. That's what things like iostat, vmstat, top, blktrace, etc will
tell us - where to point the microscope.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Fatigue for XFS
  2014-05-05 21:23     ` Dave Chinner
@ 2014-05-29 12:04       ` Andrey Korolyov
  0 siblings, 0 replies; 5+ messages in thread
From: Andrey Korolyov @ 2014-05-29 12:04 UTC (permalink / raw)
  To: Dave Chinner; +Cc: ceph-users@lists.ceph.com, xfs@oss.sgi.com

On 05/06/2014 01:23 AM, Dave Chinner wrote:
> On Tue, May 06, 2014 at 12:59:27AM +0400, Andrey Korolyov wrote:
>> On Tue, May 6, 2014 at 12:36 AM, Dave Chinner <david@fromorbit.com> wrote:
>>> On Mon, May 05, 2014 at 11:49:05PM +0400, Andrey Korolyov wrote:
>>>> Hello,
>>>>
>>>> We are currently exploring issue which can be related to Ceph itself
>>>> or to the XFS - any help is very appreciated.
>>>>
>>>> First, the picture: relatively old cluster w/ two years uptime and ten
>>>> months after fs recreation on every OSD, one of daemons started to
>>>> flap approximately once per day for couple of weeks, with no external
>>>> reason (bandwidth/IOPS/host issues). It looks almost the same every
>>>> time - OSD suddenly stop serving requests for a short period, gets
>>>> kicked out by peers report, then returns in a couple of seconds. Of
>>>> course, small but sensitive amount of requests are delayed by 15-30
>>>> seconds twice, which is bad for us. The only thing which correlates
>>>> with this kick is a peak of I/O, not too large, even not consuming all
>>>> underlying disk utilization, but alone in the cluster and clearly
>>>> visible. Also there are at least two occasions *without* correlated
>>>> iowait peak.
>>>
>>> So, actual numbers and traces are the only thing that tell us what
>>> is happening during these events. See here:
>>>
>>> http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
>>>
>>> If it happens at almost the same time every day, then I'd be looking
>>> at the crontabs to find what starts up about that time. output of
>>> top will also probably tell you what process is running, too. topio
>>> might be instructive, and blktrace almost certainly will be....
>>>
>>>> I have two versions - we`re touching some sector on disk which is
>>>> about to be marked as dead but not displayed in SMART statistics or (I
>>>
>>> Doubt it - SMART doesn't cause OS visible IO dispatch spikes.
>>>
>>>> believe so) some kind of XFS fatigue, which can be more likely in this
>>>> case, since near-bad sector should be touched more frequently and
>>>> related impact could leave traces in dmesg/SMART from my experience. I
>>>
>>> I doubt that, too, because XFS doesn't have anything that is
>>> triggered on a daily basis inside it. Maybe you've got xfs_fsr set
>>> up on a cron job, though...
>>>
>>>> would like to ask if anyone has a simular experience before or can
>>>> suggest to poke existing file system in some way. If no suggestion
>>>> appear, I`ll probably reformat disk and, if problem will remain after
>>>> refill, replace it, but I think less destructive actions can be done
>>>> before.
>>>
>>> Yeah, monitoring and determining the process that is issuing the IO
>>> is what you need to find first.
>>>
>>> Cheers,
>>>
>>> Dave.
>>> --
>>> Dave Chinner
>>> david@fromorbit.com
>>
>> Thanks Dave,
>>
>> there are definitely no cron set for specific time (though most of
>> lockups happened in a relatively small interval which correlates with
>> the Ceph snapshot operations).
> 
> OK.
> 
> FWIW, Ceph snapshots on XFS may not be immediately costly in terms
> of IO - they can be extremely costly after one is taken when the
> files in the snapshot are next written to. If you are snapshotting
> files that are currently being written to, then that's likely to
> cause immediate IO issues...
> 
>> In at least one case no Ceph snapshot
>> operations (including delayed removal) happened and at least two when
>> no I/O peak was observed. We observed and eliminated weird lockups
>> related to the openswitch behavior before - we`re combining storage
>> and compute nodes, so quirks in the OVS datapath caused very
>> interesting and weird system-wide lockups on (supposedly) spinlock,
>> and we see 'pure' Ceph lockups on XFS at time with 3.4-3.7 kernels,
>> all of them was correlated with very high context switch peak.
> 
> Until we determine what is triggering the IO, the application isn't
> really a concern.
> 
>> Current issue is seemingly nothing to do with spinlock-like bugs or
>> just a hardware problem, we even rebooted problematic node to check if
>> the memory allocator may stuck at the border of specific NUMA node,
>> with no help, but first reappearance of this bug was delayed by some
>> days then. Disabling lazy allocation via specifying allocsize did
>> nothing too. It may look like I am insisting that it is XFS bug, where
>> Ceph version is more likely to appear because of way more complicated
>> logic and operation behaviour, but persistence on specific node across
>> relaunching of Ceph storage daemon suggests bug relation to the
>> unlucky byte sequence more than anything else. If it finally appear as
>> Ceph bug, it`ll ruin our expectations from two-year of close
>> experience with this product and if it is XFS bug, we haven`t see
>> anything like this before, thought we had a pretty collection of
>> XFS-related lockups on the earlier kernels.
> 
> Long experience with triaging storage performance issues has taught
> me to ignore what anyone *thinks* is the cause of the problem; I
> rely on the data that is gathered to tell me what the problem is. I
> find that hard data has a nasty habit of busting assumptions,
> expectations, speculations and hypothesis... :)
> 
>> If it helps, I
>> may collect a series of perf events during next appearance or exact
>> iostat output (mine graphics can say that the I/O was not choked
>> completely when peak appeared, that`s all).
> 
> Before delving into perf events, we need to know what we are looking
> for. That's what things like iostat, vmstat, top, blktrace, etc will
> tell us - where to point the microscope.
> 
> Cheers,
> 
> Dave.
> 

Thanks,

after a long and adventurous investigation we found that the effect was
most probably caused by crossing tails of multiple background snapshot
deletion in Ceph, so this had nothing to do with XFS, though behavior
was very strange and because of very large time intervals we had not
able imagine correlation between those events earlier. Background
snapshot removal in Ceph contains kind of 'spike' at the of the process,
so if one does deletion of a couple of snapshots holding close amount of
commited bytes each, their removal will shot spike almost synchronously
at the end, causing one or more OSD daemons to choke.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-05-29 12:05 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-05 19:49 Fatigue for XFS Andrey Korolyov
2014-05-05 20:36 ` Dave Chinner
2014-05-05 20:59   ` Andrey Korolyov
2014-05-05 21:23     ` Dave Chinner
2014-05-29 12:04       ` Andrey Korolyov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox