Can extremely high load cause disks to be kicked?

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Can extremely high load cause disks to be kicked?
@ 2012-05-31  8:31 Andy Smith
  2012-06-01  1:31 ` Stan Hoeppner
  2012-06-04  4:13 ` NeilBrown
  0 siblings, 2 replies; 19+ messages in thread
From: Andy Smith @ 2012-05-31  8:31 UTC (permalink / raw)
  To: linux-raid

Hello,

Last night a virtual machine on one of my servers was a victim of
DDoS. Given that the machine is routing packets to the VM, the
extremely high packets per second basically overwhelmed the CPU and
caused a lot of "BUG: soft lockup - CPU#0 stuck for XXs!" spew in
the logs.

So far nothing unusual for that type of event. However, a few
minutes in, I/O errors started to be generated which caused three of
the four disks in the raid10 to be kicked. Here's an excerpt:

May 30 18:24:49 blahblah kernel: [36534478.879311] BUG: soft lockup - CPU#0 stuck for 86s! [swapper:0]
May 30 18:24:49 blahblah kernel: [36534478.879311] Modules linked in: bridge ipv6 ipt_LOG xt_limit ipt_REJECT ipt_ULOG xt_multiport xt_tcpudp iptable_filter ip_tables x_tables ext2 fuse loop pcspkr i2c_i801 i2c_core container button evdev ext3 jbd mbcache dm_mirror dm_log dm_snapshot dm_mod raid10 raid1 md_mod ata_generic libata dock ide_pci_generic sd_mod it8213 ide_core e1000e mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal processor fan thermal_sys [last unloaded: scsi_wait_scan]
May 30 18:24:49 blahblah kernel: [36534478.879311] CPU 0:
May 30 18:24:49 blahblah kernel: [36534478.879311] Modules linked in: bridge ipv6 ipt_LOG xt_limit ipt_REJECT ipt_ULOG xt_multiport xt_tcpudp iptable_filter ip_tables x_tables ext2 fuse loop pcspkr i2c_i801 i2c_core container button evdev ext3 jbd mbcache dm_mirror dm_log dm_snapshot dm_mod raid10 raid1 md_mod ata_generic libata dock ide_pci_generic sd_mod it8213 ide_core e1000e mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal processor fan thermal_sys [last unloaded: scsi_wait_scan]
May 30 18:24:49 blahblah kernel: [36534478.879311] Pid: 0, comm: swapper Not tainted 2.6.26-2-xen-amd64 #1
May 30 18:24:49 blahblah kernel: [36534478.879311] RIP: e030:[<ffffffff802083aa>]  [<ffffffff802083aa>]
May 30 18:24:49 blahblah kernel: [36534478.879311] RSP: e02b:ffffffff80553f10  EFLAGS: 00000246
May 30 18:24:49 blahblah kernel: [36534478.879311] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff802083aa
May 30 18:24:49 blahblah kernel: [36534478.879311] RDX: ffffffff80553f28 RSI: 0000000000000000 RDI: 0000000000000001
May 30 18:24:49 blahblah kernel: [36534478.879311] RBP: 0000000000631918 R08: ffffffff805cbc38 R09: ffff880001bc7ee0
May 30 18:24:49 blahblah kernel: [36534478.879311] R10: 0000000000631918 R11: 0000000000000246 R12: ffffffffffffffff
May 30 18:24:49 blahblah kernel: [36534478.879311] R13: ffffffff8057c580 R14: ffffffff8057d1c0 R15: 0000000000000000
May 30 18:24:49 blahblah kernel: [36534478.879311] FS:  00007f65b193a6e0(0000) GS:ffffffff8053a000(0000) knlGS:0000000000000000
May 30 18:24:49 blahblah kernel: [36534478.879311] CS:  e033 DS: 0000 ES: 0000
May 30 18:24:49 blahblah kernel: [36534478.879311] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
May 30 18:24:49 blahblah kernel: [36534478.879311] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
May 30 18:24:49 blahblah kernel: [36534478.879311] 
May 30 18:24:49 blahblah kernel: [36534478.879311] Call Trace:
May 30 18:24:49 blahblah kernel: [36534478.879311]  [<ffffffff8020e79d>] ? xen_safe_halt+0x90/0xa6
May 30 18:24:49 blahblah kernel: [36534478.879311]  [<ffffffff8020a0ce>] ? xen_idle+0x2e/0x66
May 30 18:24:49 blahblah kernel: [36534478.879311]  [<ffffffff80209d49>] ? cpu_idle+0x97/0xb9
May 30 18:24:49 blahblah kernel: [36534478.879311] 
May 30 18:24:59 blahblah kernel: [36534488.966594] mptscsih: ioc0: attempting task abort! (sc=ffff880039047480)
May 30 18:24:59 blahblah kernel: [36534488.966810] sd 0:0:1:0: [sdb] CDB: Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
May 30 18:24:59 blahblah kernel: [36534488.967163] mptscsih: ioc0: task abort: SUCCESS (sc=ffff880039047480)
May 30 18:24:59 blahblah kernel: [36534488.970208] mptscsih: ioc0: attempting task abort! (sc=ffff8800348286c0)
May 30 18:24:59 blahblah kernel: [36534488.970519] sd 0:0:2:0: [sdc] CDB: Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
May 30 18:24:59 blahblah kernel: [36534488.971033] mptscsih: ioc0: task abort: SUCCESS (sc=ffff8800348286c0)
May 30 18:24:59 blahblah kernel: [36534488.974146] mptscsih: ioc0: attempting target reset! (sc=ffff880039047e80)
May 30 18:24:59 blahblah kernel: [36534488.974466] sd 0:0:0:0: [sda] CDB: Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
May 30 18:25:00 blahblah kernel: [36534489.490138] mptscsih: ioc0: target reset: SUCCESS (sc=ffff880039047e80)
May 30 18:25:00 blahblah kernel: [36534489.493027] mptscsih: ioc0: attempting target reset! (sc=ffff880034828080)
May 30 18:25:00 blahblah kernel: [36534489.493027] sd 0:0:3:0: [sdd] CDB: Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
May 30 18:25:00 blahblah kernel: [36534490.003961] mptscsih: ioc0: target reset: SUCCESS (sc=ffff880034828080)
May 30 18:25:00 blahblah kernel: [36534490.010870] end_request: I/O error, dev sdd, sector 581022718
May 30 18:25:00 blahblah kernel: [36534490.010870] md: super_written gets error=-5, uptodate=0
May 30 18:25:00 blahblah kernel: [36534490.010870] raid10: Disk failure on sdd5, disabling device.
May 30 18:25:00 blahblah kernel: [36534490.010870] raid10: Operation continuing on 3 devices.
May 30 18:25:00 blahblah kernel: [36534490.016887] end_request: I/O error, dev sda, sector 581022718
May 30 18:25:00 blahblah kernel: [36534490.017058] md: super_written gets error=-5, uptodate=0
May 30 18:25:00 blahblah kernel: [36534490.017212] raid10: Disk failure on sda5, disabling device.
May 30 18:25:00 blahblah kernel: [36534490.017213] raid10: Operation continuing on 2 devices.
May 30 18:25:00 blahblah kernel: [36534490.017562] end_request: I/O error, dev sdb, sector 581022718
May 30 18:25:00 blahblah kernel: [36534490.017730] md: super_written gets error=-5, uptodate=0
May 30 18:25:00 blahblah kernel: [36534490.017884] raid10: Disk failure on sdb5, disabling device.
May 30 18:25:00 blahblah kernel: [36534490.017885] raid10: Operation continuing on 1 devices.
May 30 18:25:00 blahblah kernel: [36534490.021015] end_request: I/O error, dev sdc, sector 581022718
May 30 18:25:00 blahblah kernel: [36534490.021015] md: super_written gets error=-5, uptodate=0

At this point the host was extremely upset. sd[abcd]5 were in use in
/dev/md3, but there were three other mdadm arrays using the same
disks and they were okay, so I wasn't suspecting actual hardware
failure as far as the disks went.

I used --add to add the devices back into md3, but they were added
as spares. I was stumped for a little while, then I decided to
--stop md3 and --create it again with --assume-clean. I got the
device order wrong the first few times but eventually I got there.

I then triggered a 'repair' at sync_action, and once that had
finished I started fscking things. There was a bit of corruption but
on the whole it seems to have been survivable.

Now, is this sort of behaviour expected when under incredible load?
Or is it indicative of a bug somewhere in kernel, mpt driver, or
even flaky SAS controller/disks?

Controller: LSISAS1068E B3, FwRev=011a0000h
Motherboard: Supermicro X7DCL-3
Disks: 4x SEAGATE  ST9300603SS      Version: 0006

While I'm familiar with the occasional big DDoS causing extreme CPU
load, hung tasks, CPU soft lockups etc., I've never had it kick
disks before. But I only have this one server with SAS and mdadm
whereas all the others are SATA and 3ware with BBU.

Root cause of failure aside, could I have made recovery easier? Was
there a better way than --create --assume-clean?

If I had done a --create with sdc5 (the device that stayed in the
array) and the other device with the closest event count, plus two
"missing", could I have expected less corruption when on 'repair'?

Cheers,
Andy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can extremely high load cause disks to be kicked?
  2012-05-31  8:31 Can extremely high load cause disks to be kicked? Andy Smith
@ 2012-06-01  1:31 ` Stan Hoeppner
  2012-06-01  3:15   ` Igor M Podlesny
  2012-06-01 19:25   ` Andy Smith
  2012-06-04  4:13 ` NeilBrown
  1 sibling, 2 replies; 19+ messages in thread
From: Stan Hoeppner @ 2012-06-01  1:31 UTC (permalink / raw)
  To: linux-raid

On 5/31/2012 3:31 AM, Andy Smith wrote:

> Now, is this sort of behaviour expected when under incredible load?
> Or is it indicative of a bug somewhere in kernel, mpt driver, or
> even flaky SAS controller/disks?

It is expected that people know what RAID is and how it is supposed to
be used.  RAID is to be used for protecting data in the event of a disk
failure and secondarily to increase performance.  That is not how you
seem to be using RAID.  BTW, I can't fully discern from your log
snippets...are you running md RAID inside of virtual machines or only on
the host hypervisor?  If the former problems like this are expected and
normal, which is why it is recommended to NEVER run md RAID inside a VM.

> Controller: LSISAS1068E B3, FwRev=011a0000h
> Motherboard: Supermicro X7DCL-3
> Disks: 4x SEAGATE  ST9300603SS      Version: 0006
> 
> While I'm familiar with the occasional big DDoS causing extreme CPU
> load, hung tasks, CPU soft lockups etc., I've never had it kick
> disks before. 

The md RAID driver didn't kick disks.  It kicked partitions, as this is
what you built your many arrays with.

> But I only have this one server with SAS and mdadm
> whereas all the others are SATA and 3ware with BBU.

Fancy that.

> Root cause of failure aside, could I have made recovery easier? Was
> there a better way than --create --assume-clean?
> 
> If I had done a --create with sdc5 (the device that stayed in the
> array) and the other device with the closest event count, plus two
> "missing", could I have expected less corruption when on 'repair'?

You could probably expect it to be more reliable if you used RAID as
it's meant to be used, which in this case would be a single RAID10 array
using none, or only one partition per disk, instead of creating 4 or 5
different md RAID arrays from 4-5 partitions on each disk.  This is
simply silly, and it's dangerous if doing so inside VMs.

md RAID is not meant to be used as a thin provisioning tool, which is
what you seem to have attempted here, and is almost certainly the root
cause of your problem.

I highly recommend creating a single md RAID array and using proper thin
provisioning tools/methods.  Or slap a real RAID card into this host as
you have the others.  The LSI (3ware) cards allow the creation of
multiple virtual drives per array, each being exposed as a different
SCSI LUN in Linux, which provides simple and effective thin
provisioning.  This is much simpler and more reliable than doing it all
with kernel drivers, daemons, and filesystem tricks (sparse files
mounted as filesystems and the like).

There are a number of scenarios where md RAID is better than hardware
RAID and vice versa.  Yours is a case where hardware RAID is superior,
as no matter the host CPU load, drives won't get kicked offline as a
result, as they're under the control of a dedicated IO processor (same
for SAN RAID).

-- 
Stan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can extremely high load cause disks to be kicked?
  2012-06-01  1:31 ` Stan Hoeppner
@ 2012-06-01  3:15   ` Igor M Podlesny
  2012-06-01 14:12     ` Stan Hoeppner
  2012-06-01 19:25   ` Andy Smith
  1 sibling, 1 reply; 19+ messages in thread
From: Igor M Podlesny @ 2012-06-01  3:15 UTC (permalink / raw)
  To: stan; +Cc: linux-raid

On 1 June 2012 09:31, Stan Hoeppner <stan@hardwarefreak.com> wrote:
[…]
> You could probably expect it to be more reliable if you used RAID as
> it's meant to be used, which in this case would be a single RAID10 array
> using none, or only one partition per disk, instead of creating 4 or 5
> different md RAID arrays from 4-5 partitions on each disk.  This is
> simply silly, and it's dangerous if doing so inside VMs.

   — How do you know those RAIDs are inside VMs?

--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can extremely high load cause disks to be kicked?
  2012-06-01  3:15   ` Igor M Podlesny
@ 2012-06-01 14:12     ` Stan Hoeppner
  2012-06-01 15:19       ` Igor M Podlesny
  0 siblings, 1 reply; 19+ messages in thread
From: Stan Hoeppner @ 2012-06-01 14:12 UTC (permalink / raw)
  To: Igor M Podlesny; +Cc: linux-raid

On 5/31/2012 10:15 PM, Igor M Podlesny wrote:
> On 1 June 2012 09:31, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> […]
>> You could probably expect it to be more reliable if you used RAID as
>> it's meant to be used, which in this case would be a single RAID10 array
>> using none, or only one partition per disk, instead of creating 4 or 5
>> different md RAID arrays from 4-5 partitions on each disk.  This is
>> simply silly, and it's dangerous if doing so inside VMs.
> 
>    — How do you know those RAIDs are inside VMs?

Those who speak English as a first language likely understood my use of
"if".  Had I used "when" instead, that would have implied certainty of
knowledge.  "If" conveys a possibility, a hypothetical.

For the English challenged, maybe reversing the sentence is more
comprehensible:

"If doing so inside VMs it is dangerous."

-- 
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can extremely high load cause disks to be kicked?
  2012-06-01 14:12     ` Stan Hoeppner
@ 2012-06-01 15:19       ` Igor M Podlesny
  2012-06-02  4:45         ` Stan Hoeppner
  0 siblings, 1 reply; 19+ messages in thread
From: Igor M Podlesny @ 2012-06-01 15:19 UTC (permalink / raw)
  To: stan; +Cc: linux-raid

On 1 June 2012 22:12, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> On 5/31/2012 10:15 PM, Igor M Podlesny wrote:
>> On 1 June 2012 09:31, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>> […]
>>> You could probably expect it to be more reliable if you used RAID as
>>> it's meant to be used, which in this case would be a single RAID10 array
>>> using none, or only one partition per disk, instead of creating 4 or 5
>>> different md RAID arrays from 4-5 partitions on each disk.  This is
>>> simply silly, and it's dangerous if doing so inside VMs.
>>
>>    — How do you know those RAIDs are inside VMs?
>
> Those who speak English as a first language likely understood my use of

   So you don't. Well, lemme remind some words of wisdom to you:
"Assumption is the mother of all f*ckups". (Feel free to reverse it as
you like), Stan.

--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can extremely high load cause disks to be kicked?
  2012-06-01 15:19       ` Igor M Podlesny
@ 2012-06-02  4:45         ` Stan Hoeppner
  2012-06-02  7:57           ` Igor M Podlesny
  0 siblings, 1 reply; 19+ messages in thread
From: Stan Hoeppner @ 2012-06-02  4:45 UTC (permalink / raw)
  To: Igor M Podlesny; +Cc: linux-raid

On 6/1/2012 10:19 AM, Igor M Podlesny wrote:
> On 1 June 2012 22:12, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>> On 5/31/2012 10:15 PM, Igor M Podlesny wrote:
>>> On 1 June 2012 09:31, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>>> […]
>>>> You could probably expect it to be more reliable if you used RAID as
>>>> it's meant to be used, which in this case would be a single RAID10 array
>>>> using none, or only one partition per disk, instead of creating 4 or 5
>>>> different md RAID arrays from 4-5 partitions on each disk.  This is
>>>> simply silly, and it's dangerous if doing so inside VMs.
>>>
>>>    — How do you know those RAIDs are inside VMs?
>>
>> Those who speak English as a first language likely understood my use of
> 
>    So you don't. Well, lemme remind some words of wisdom to you:
> "Assumption is the mother of all f*ckups". (Feel free to reverse it as
> you like), Stan.

Igor, you simply misunderstood what I stated.  I explained what likely
caused you to misunderstand.  I wasn't trying to insult you, or anyone
else who is not a native English speaker.

There's no reason for animosity over a simple misinterpretation of
language syntax.  Be proud that you speak and write/read English very
well.  I, on the other hand, cannot read/write/speak any Cyrillic
language, though I can recognize the characters of the alphabet as
Cyrillic.  In fact, I don't know any languages other than English, but
for some basic phrases in Spanish.  If I had to call for a doctor in
anything other than English I'd be in big trouble.

You're a better linguist than me, speaking at least two languages.  You
simply made a minor error in this case.  Can we be friends and move on,
or at least not be enemies?

Regards,

-- 
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can extremely high load cause disks to be kicked?
  2012-06-02  4:45         ` Stan Hoeppner
@ 2012-06-02  7:57           ` Igor M Podlesny
  2012-06-02  9:16             ` Stan Hoeppner
  0 siblings, 1 reply; 19+ messages in thread
From: Igor M Podlesny @ 2012-06-02  7:57 UTC (permalink / raw)
  To: stan; +Cc: linux-raid

On 2 June 2012 12:45, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> On 6/1/2012 10:19 AM, Igor M Podlesny wrote:
[…]
> simply made a minor error in this case.  Can we be friends and move on,
> or at least not be enemies?
>
   Surely we can!

   I can only add though that had you answered OP's question not so
harsh, it would be easier for me (and for others as well, I guess) to
recognize good intentions beneath.

--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can extremely high load cause disks to be kicked?
  2012-06-02  7:57           ` Igor M Podlesny
@ 2012-06-02  9:16             ` Stan Hoeppner
  0 siblings, 0 replies; 19+ messages in thread
From: Stan Hoeppner @ 2012-06-02  9:16 UTC (permalink / raw)
  To: Igor M Podlesny; +Cc: linux-raid

On 6/2/2012 2:57 AM, Igor M Podlesny wrote:
> On 2 June 2012 12:45, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>> On 6/1/2012 10:19 AM, Igor M Podlesny wrote:
> […]
>> simply made a minor error in this case.  Can we be friends and move on,
>> or at least not be enemies?
>>
>    Surely we can!
> 
>    I can only add though that had you answered OP's question not so
> harsh, it would be easier for me (and for others as well, I guess) to
> recognize good intentions beneath.

I tend to be very direct, very blunt.  That's just me.  I'll never be
worth a damn as a politician, priest, or a funeral director, that's for
sure. ;)

And if I didn't have good intentions, I wouldn't be contributing on this
list and a half dozen others. ;)

-- 
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can extremely high load cause disks to be kicked?
  2012-06-01  1:31 ` Stan Hoeppner
  2012-06-01  3:15   ` Igor M Podlesny
@ 2012-06-01 19:25   ` Andy Smith
  2012-06-02  5:47     ` Stan Hoeppner
  1 sibling, 1 reply; 19+ messages in thread
From: Andy Smith @ 2012-06-01 19:25 UTC (permalink / raw)
  To: linux-raid

Hi Stan,

On Thu, May 31, 2012 at 08:31:49PM -0500, Stan Hoeppner wrote:
> On 5/31/2012 3:31 AM, Andy Smith wrote:
> > Now, is this sort of behaviour expected when under incredible load?
> > Or is it indicative of a bug somewhere in kernel, mpt driver, or
> > even flaky SAS controller/disks?
> 
> It is expected that people know what RAID is and how it is supposed to
> be used.  RAID is to be used for protecting data in the event of a disk
> failure and secondarily to increase performance.  That is not how you
> seem to be using RAID.

Just to clarify, this was the hypervisor host. The VMs on it don't
use RAID themselves as that would indeed be silly.

> There are a number of scenarios where md RAID is better than hardware
> RAID and vice versa.  Yours is a case where hardware RAID is superior,
> as no matter the host CPU load, drives won't get kicked offline as a
> result, as they're under the control of a dedicated IO processor (same
> for SAN RAID).

Fair enough, thanks.

Cheers,
Andy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can extremely high load cause disks to be kicked?
  2012-06-01 19:25   ` Andy Smith
@ 2012-06-02  5:47     ` Stan Hoeppner
  2012-06-03  3:30       ` Andy Smith
  0 siblings, 1 reply; 19+ messages in thread
From: Stan Hoeppner @ 2012-06-02  5:47 UTC (permalink / raw)
  To: linux-raid

On 6/1/2012 2:25 PM, Andy Smith wrote:
> Hi Stan,
> 
> On Thu, May 31, 2012 at 08:31:49PM -0500, Stan Hoeppner wrote:
>> On 5/31/2012 3:31 AM, Andy Smith wrote:
>>> Now, is this sort of behaviour expected when under incredible load?
>>> Or is it indicative of a bug somewhere in kernel, mpt driver, or
>>> even flaky SAS controller/disks?
>>
>> It is expected that people know what RAID is and how it is supposed to
>> be used.  RAID is to be used for protecting data in the event of a disk
>> failure and secondarily to increase performance.  That is not how you
>> seem to be using RAID.
> 
> Just to clarify, this was the hypervisor host. The VMs on it don't
> use RAID themselves as that would indeed be silly.

Cool.  I only mentioned this as I've seen it in the wild more than once.

>> There are a number of scenarios where md RAID is better than hardware
>> RAID and vice versa.  Yours is a case where hardware RAID is superior,
>> as no matter the host CPU load, drives won't get kicked offline as a
>> result, as they're under the control of a dedicated IO processor (same
>> for SAN RAID).
> 
> Fair enough, thanks.

You could still use md RAID in your scenario.  But instead of having
multiple md arrays built of disk partitions and passing each array up to
a VM guest, the proper way to do this thin provisioning is to create one
md array and then create partitions on top.  Then pass a partition to a
guest.

This method eliminates many potential problems with your current setup,
such as elevator behavior causing excessive head seeks on the drives.
This is even more critical if some of your md arrays are parity (5/6).
You mentioned a single RAID 10 array.  If you're indeed running multiple
arrays of multiple RAID levels (parity and non parity) on the 5
partitions on each disk, and each VM is doing even medium to small
amount of IO concurrently, you'll be head thrashing pretty quickly.
Then, when you have a DDOS and lots of log writes/etc, you'll be
instantly seek bound, and likely start seeing SCSI timeouts, as the
drive head actuators simply can't move quickly enough to satisfy
requests before the timeout period.  This may be what caused your RAID
10 partitions to be kicked.  Not enough info to verify at this point.

The same situation can occur on a single OS bare metal host when the
storage system isn't designed to handle the IOPS load.  Consider a
maildir mailbox server with an average load of 2000 random R/W IOPS.
The _minimum_ you could get by with here would be 16x 15k disks in RAID
10.  8 * 300 seeks/s/drive = 2400 IOPS peak actuator performance.

If we were to put that workload on a RAID 10 array of only 4x 15k drives
we'd have 2 * 300 seeks/s/drive = 600 IOPS peak, less than 1/3rd the
actual load.  I've never tried this so I don't know if you'd have md
dropping drives due to SCSI timeouts, but you'd sure have serious
problems nonetheless.

-- 
Stan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can extremely high load cause disks to be kicked?
  2012-06-02  5:47     ` Stan Hoeppner
@ 2012-06-03  3:30       ` Andy Smith
  2012-06-03  4:05         ` Igor M Podlesny
  2012-06-03  6:49         ` Stan Hoeppner
  0 siblings, 2 replies; 19+ messages in thread
From: Andy Smith @ 2012-06-03  3:30 UTC (permalink / raw)
  To: linux-raid

Hi Stan,

On Sat, Jun 02, 2012 at 12:47:00AM -0500, Stan Hoeppner wrote:
> You could still use md RAID in your scenario.  But instead of having
> multiple md arrays built of disk partitions and passing each array up to
> a VM guest, the proper way to do this thin provisioning is to create one
> md array and then create partitions on top.  Then pass a partition to a
> guest.

I probably didn't give enough details. On this particular host there
are four md arrays:

md0 is mounted as /boot
md1 is used as swap
md2 is mounted as /
md3 is an LVM PV for "everything else"

Some further filesystems on the hypervisor host come from LVs in md3
(/usr, /var and so on).

VM guests get their block devices from LVs in md3. But we can ignore
the presence of VMs for now since my concern is only with the
hypervisor host itself at this point.

Even though it was a guest that was attacked, the hypervisor still
had to route the traffic through its userspace and its CPU got
overwhelmed by the high packets-per-second.

You made a point that multiple mdadm arrays should not be used,
though in this situation I can't see how that would have helped me;
would I not just have got I/O errors on the single md device that
everything was running from, causing instant crash?

Although an instant crash might have been preferable from the point of
view of enabling less corruption, I suppose.

> The same situation can occur on a single OS bare metal host when the
> storage system isn't designed to handle the IOPS load.

Unfortunately my monitoring of IOPS for this host cut out during the
attack and later problems so all I have for that period is a blank
graph, but I don't think the IOPS requirement would actually have
been that high. The CPU was certainly overwhelmed, but my main
concern is that I am never going to be able to design a system that
will cope with routing DDoS traffic in userspace.

I am OK with the hypervisor machine being completely hammered and
keeling over until the traffic is blocked upstream on real routers.
Not so happy about the hypervisor machine kicking devices out of
arrays and ending up with corrupted filesystems though. I haven't
experienced that before.

I don't think md is to blame as such because the logs show I/O
errors on all devices so no wonder it kicked them out. I was just
wondering if it was down to bad hardware, bad hardware _choice_, a
bug in the driver, some bug in the kernel or what.

Also still wondering if what I did to recover was the best way to go
or if I could have made it easier on myself.

Cheers,
Andy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can extremely high load cause disks to be kicked?
  2012-06-03  3:30       ` Andy Smith
@ 2012-06-03  4:05         ` Igor M Podlesny
  2012-06-03 22:05           ` Andy Smith
  2012-06-03  6:49         ` Stan Hoeppner
  1 sibling, 1 reply; 19+ messages in thread
From: Igor M Podlesny @ 2012-06-03  4:05 UTC (permalink / raw)
  To: linux-raid

On 3 June 2012 11:30, Andy Smith <andy@strugglers.net> wrote:
> I probably didn't give enough details. On this particular host there
> are four md arrays:
>
> md0 is mounted as /boot
> md1 is used as swap
> md2 is mounted as /
> md3 is an LVM PV for "everything else"
[…]

   Kinda off-topic question — why having separate MDs for swap and
root? I mean LVM's lv would be just fine for them.

--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can extremely high load cause disks to be kicked?
  2012-06-03  4:05         ` Igor M Podlesny
@ 2012-06-03 22:05           ` Andy Smith
  2012-06-04  1:55             ` Stan Hoeppner
  0 siblings, 1 reply; 19+ messages in thread
From: Andy Smith @ 2012-06-03 22:05 UTC (permalink / raw)
  To: linux-raid

Hi Igor,

On Sun, Jun 03, 2012 at 12:05:31PM +0800, Igor M Podlesny wrote:
> On 3 June 2012 11:30, Andy Smith <andy@strugglers.net> wrote:
> > I probably didn't give enough details. On this particular host there
> > are four md arrays:
> >
> > md0 is mounted as /boot
> > md1 is used as swap
> > md2 is mounted as /
> > md3 is an LVM PV for "everything else"
> […]
> 
>    Kinda off-topic question — why having separate MDs for swap and
> root? I mean LVM's lv would be just fine for them.

I know, but I'm kind of old-fashioned and have some (probably
unwarranted) distrust of having the really essential bits of the
system in LVM if they don't really need to be.

The system needs swap and I'm unlikely to want to change the size of
that swap, but if I do I can just add more swap from inside LVM. / is
also unlikely to need to be resized ever, so I keep that outside of
LVM too.

My position on this is probably going to have to change given the
current "get rid of /usr" plans that seem to be gaining traction in
the Linux community. A beefier / would probably justify being in
LVM to me.

But really it is all just personal taste isn't it..

Cheers,
Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can extremely high load cause disks to be kicked?
  2012-06-03 22:05           ` Andy Smith
@ 2012-06-04  1:55             ` Stan Hoeppner
  2012-06-04  9:06               ` Igor M Podlesny
  0 siblings, 1 reply; 19+ messages in thread
From: Stan Hoeppner @ 2012-06-04  1:55 UTC (permalink / raw)
  To: linux-raid

On 6/3/2012 5:05 PM, Andy Smith wrote:
> Hi Igor,
> 
> On Sun, Jun 03, 2012 at 12:05:31PM +0800, Igor M Podlesny wrote:
>> On 3 June 2012 11:30, Andy Smith <andy@strugglers.net> wrote:
>>> I probably didn't give enough details. On this particular host there
>>> are four md arrays:
>>>
>>> md0 is mounted as /boot
>>> md1 is used as swap
>>> md2 is mounted as /
>>> md3 is an LVM PV for "everything else"
>> […]
>>
>>    Kinda off-topic question — why having separate MDs for swap and
>> root? I mean LVM's lv would be just fine for them.
> 
> I know, but I'm kind of old-fashioned and have some (probably
> unwarranted) distrust of having the really essential bits of the
> system in LVM if they don't really need to be.

I wouldn't say unwarranted.  I've seen instances on this and other lists
where an md array blew chunks, was repaired and everything reported
good, only to have LVM devices corrupted beyond repair.  Or both md and
LVM report good after repair, but the XFS superblock is hosed beyond
repair.  The layers in the storage stack the better.  KISS principle,
"Just because you can doesn't mean you should", etc, etc.

> The system needs swap and I'm unlikely to want to change the size of
> that swap, but if I do I can just add more swap from inside LVM. / is
> also unlikely to need to be resized ever, so I keep that outside of
> LVM too.
> 
> My position on this is probably going to have to change given the
> current "get rid of /usr" plans that seem to be gaining traction in
> the Linux community. A beefier / would probably justify being in
> LVM to me.

I guess I'm more old fashioned.  I'd never put the root filesystem on an
LVM volume.  My preferences are directly on a single disk partition, or
hardware RAID1 partition, depending on the server/application area.

> But really it is all just personal taste isn't it..

To an extent yes.  Separating different portions of the UNIX filesystem
tree on different partitions/devices has a long tradition and usually
good reasoning behind it.  However, as I mentioned, spreading these
across multiple md arrays that share the same physical disks is not a
good idea.  A single md array with multiple partitions, with the
filesystem tree, swap, and data directories spread over these
partitions, is preferable.

-- 
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can extremely high load cause disks to be kicked?
  2012-06-04  1:55             ` Stan Hoeppner
@ 2012-06-04  9:06               ` Igor M Podlesny
  0 siblings, 0 replies; 19+ messages in thread
From: Igor M Podlesny @ 2012-06-04  9:06 UTC (permalink / raw)
  To: stan; +Cc: linux-raid

On 4 June 2012 09:55, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> On 6/3/2012 5:05 PM, Andy Smith wrote:
[…]
> >>    Kinda off-topic question — why having separate MDs for swap and
> >> root? I mean LVM's lv would be just fine for them.
> >
> > I know, but I'm kind of old-fashioned and have some (probably
> > unwarranted) distrust of having the really essential bits of the
> > system in LVM if they don't really need to be.
>
> I wouldn't say unwarranted.  I've seen instances on this and other lists
> where an md array blew chunks, was repaired and everything reported
> good, only to have LVM devices corrupted beyond repair.  Or both md and

   Sounds like urban myths to me. LVM's dev-mapper here is nothing
more as merely block look-up/translation table.

--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can extremely high load cause disks to be kicked?
  2012-06-03  3:30       ` Andy Smith
  2012-06-03  4:05         ` Igor M Podlesny
@ 2012-06-03  6:49         ` Stan Hoeppner
  2012-06-04  0:02           ` Andy Smith
  1 sibling, 1 reply; 19+ messages in thread
From: Stan Hoeppner @ 2012-06-03  6:49 UTC (permalink / raw)
  To: linux-raid

On 6/2/2012 10:30 PM, Andy Smith wrote:
> Hi Stan,

Hay Andy,

> On Sat, Jun 02, 2012 at 12:47:00AM -0500, Stan Hoeppner wrote:
>> You could still use md RAID in your scenario.  But instead of having
>> multiple md arrays built of disk partitions and passing each array up to
>> a VM guest, the proper way to do this thin provisioning is to create one
>> md array and then create partitions on top.  Then pass a partition to a
>> guest.
> 
> I probably didn't give enough details. On this particular host there
> are four md arrays:

Yeah, I made some incorrect assumptions about how you were using your
arrays.

> md0 is mounted as /boot
> md1 is used as swap
> md2 is mounted as /

What's the RAID level of each of these and how many partitions in each?

> md3 is an LVM PV for "everything else"

This one is the RAID 10 correct?

> Some further filesystems on the hypervisor host come from LVs in md3
> (/usr, /var and so on).
> 
> VM guests get their block devices from LVs in md3. But we can ignore
> the presence of VMs for now since my concern is only with the
> hypervisor host itself at this point.
> 
> Even though it was a guest that was attacked, the hypervisor still
> had to route the traffic through its userspace and its CPU got
> overwhelmed by the high packets-per-second.

How many cores in this machine and what frequency?

> You made a point that multiple mdadm arrays should not be used,
> though in this situation I can't see how that would have helped me;

As I mentioned, the disks would have likely been seeking less, but I
can't say for sure with the available data.  If the disks were seek
saturated this could explain why md3 component partitions were kicked.

> would I not just have got I/O errors on the single md device that
> everything was running from, causing instant crash?

Can you post the SCSI/ATA and md errors?

> Although an instant crash might have been preferable from the point of
> view of enabling less corruption, I suppose.
> 
>> The same situation can occur on a single OS bare metal host when the
>> storage system isn't designed to handle the IOPS load.
> 
> Unfortunately my monitoring of IOPS for this host cut out during the
> attack and later problems so all I have for that period is a blank
> graph, but I don't think the IOPS requirement would actually have
> been that high. 

What was actually being written to md3 during this attack?  Just
logging, or something else?  What was the exact nature of the DDOS
attack?  What service was it targeting?  I assume this wasn't simply a
ping flood.

> The CPU was certainly overwhelmed, but my main
> concern is that I am never going to be able to design a system that
> will cope with routing DDoS traffic in userspace.

Assuming the network data rate of the attack was less than 1000 Mb/s,
most any machine with two or more 2GHz+ cores and sufficient RAM should
easily be able to handle this type of thing without falling over.

Can you provide system hardware details?

> I am OK with the hypervisor machine being completely hammered and
> keeling over until the traffic is blocked upstream on real routers.
> Not so happy about the hypervisor machine kicking devices out of
> arrays and ending up with corrupted filesystems though. I haven't
> experienced that before.

Well, you also hadn't experienced an attack like this before.  Correct?

> I don't think md is to blame as such because the logs show I/O
> errors on all devices so no wonder it kicked them out. I was just
> wondering if it was down to bad hardware, bad hardware _choice_, a
> bug in the driver, some bug in the kernel or what.

This is impossible to determine with the information provided so far.
md isn't to blame, but the way you're using it may have played a role in
the partitions being kicked.

Consider this.  If the hypervisor was writing heavily to logs, and if
the hypervisor went into heavy swap during the attack, and the
partitions in md3 that were kicked reside on disks where the swap array
and/or / arrays exist, this would tend to bolster my theory regarding
seek starvation causing the timeouts and kicks.

If this isn't the case, and you simply had high load hammering md3 and
the underlying disks, with little IO on the other arrays, then the
problem likely lay elsewhere, possibly with the disks themselves, the
HBA, or even the system board.

Or, if this is truly a single CPU/core machine, the core was pegged, and
the hypervisor kernel scheduler wasn't giving enough time to md threads,
this may also explain the timeouts, though with any remotely recent
kernel this 'shouldn't' happen under load.

If the problem was indeed simply a hammered single core CPU, I'd suggest
swapping it for a multi-core model.  This will eliminate the possibility
of md threads starving for cycles, though I wouldn't think this alone
would cause 30 second BIO timeouts, and thus devices being kicked.  It's
far more likely the drives were seek saturated, which would explain the
30 second BIO timeouts.

> Also still wondering if what I did to recover was the best way to go
> or if I could have made it easier on myself.

I don't really have any input on this aspect, except to say that if you
got all your data recovered that's the important part.  If you spent
twice as long as you needed to I wouldn't sweat that at all.  I'd put
all my concentration on the root cause analysis.

-- 
Stan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can extremely high load cause disks to be kicked?
  2012-06-03  6:49         ` Stan Hoeppner
@ 2012-06-04  0:02           ` Andy Smith
  2012-06-04  6:58             ` Stan Hoeppner
  0 siblings, 1 reply; 19+ messages in thread
From: Andy Smith @ 2012-06-04  0:02 UTC (permalink / raw)
  To: linux-raid

Hi Stan,

Thanks for your detailed reply.

On Sun, Jun 03, 2012 at 01:49:23AM -0500, Stan Hoeppner wrote:
> On 6/2/2012 10:30 PM, Andy Smith wrote:
> > md0 is mounted as /boot
> > md1 is used as swap
> > md2 is mounted as /
> 
> What's the RAID level of each of these and how many partitions in each?

$ cat /proc/mdstat
Personalities : [raid1] [raid10]
md3 : active raid10 sdd5[0] sdb5[3] sdc5[2] sda5[1]
      581022592 blocks 64K chunks 2 near-copies [4/4] [UUUU]

md2 : active raid10 sdd3[0] sdb3[3] sdc3[2] sda3[1]
      1959680 blocks 64K chunks 2 near-copies [4/4] [UUUU]

md1 : active raid10 sdd2[0] sdb2[3] sdc2[2] sda2[1]
      1959680 blocks 64K chunks 2 near-copies [4/4] [UUUU]

md0 : active raid1 sdd1[0] sdb1[3] sdc1[2] sda1[1]
      489856 blocks [4/4] [UUUU]

unused devices: <none>

> > Even though it was a guest that was attacked, the hypervisor still
> > had to route the traffic through its userspace and its CPU got
> > overwhelmed by the high packets-per-second.
> 
> How many cores in this machine and what frequency?

It is a single quad core Xeon L5420 @ 2.50GHz.

> > would I not just have got I/O errors on the single md device that
> > everything was running from, causing instant crash?
> 
> Can you post the SCSI/ATA and md errors?

Sure. I posted an excerpt in the first email but here is a fuller
example:

Actually it's pretty big so I've put it at
http://paste.ubuntu.com/1022219/

At this point the logs stop because /var is an LV out of md3. I'm
moving remote syslog servers further up my priority list...

If there hadn't been a DDoS attack at the exact same time then
I'd have considered this purely hardware failure due to the way that
"mptscsih: ioc0: attempting task abort! (sc=ffff8800352d80c0)" is
the absolute first thing of interest. But the timing is too
coincidental.

It's also been fine since, including a quite IO-intensive backup job
and yesterday's "first Sunday of the month" sync_action.

> > Unfortunately my monitoring of IOPS for this host cut out during the
> > attack and later problems so all I have for that period is a blank
> > graph, but I don't think the IOPS requirement would actually have
> > been that high. 
> 
> What was actually being written to md3 during this attack?  Just
> logging, or something else?

All the VMs would have been doing their normal writing of course,
but on the hypervisor host /usr and /var come from md3. From the
logs, the main thing it seems to be having problems with is dm-1
which is the /usr LV.

> What was the exact nature of the DDOS attack?  What service was it
> targeting?  I assume this wasn't simply a ping flood.

It was a UDP short packet (78 bytes) multiple source single
destination flood, ~300Mbit/s but the killer was the ~600kpps. Less
than 10Mbit/s made it through to the VM it was targeting.

> > The CPU was certainly overwhelmed, but my main concern is that I
> > am never going to be able to design a system that will cope with
> > routing DDoS traffic in userspace.
> 
> Assuming the network data rate of the attack was less than 1000 Mb/s,
> most any machine with two or more 2GHz+ cores and sufficient RAM should
> easily be able to handle this type of thing without falling over.

I really don't think it is easy to spec a decent VM host that can
also route hundreds of thousands of packets per sec to guests,
without a large budget. I am OK with the host giving up, I just
don't want it to corrupt its storage.

I mean I'm sure it can be done, but the budget probably doesn't
allow it and temporary problems for all VMs on the host are
acceptable in this case; filesystem corruption isn't.

> Can you provide system hardware details?

My original email:

> > Controller: LSISAS1068E B3, FwRev=011a0000h
> > Motherboard: Supermicro X7DCL-3
> > Disks: 4x SEAGATE  ST9300603SS      Version: 0006

Network: e1000e: Intel(R) PRO/1000 Network Driver - 0.3.3.3-k2

The hypervisor itself has access to only 1GB RAM (the rest dedicated
to guest VMs) which may be rather low; I could look at boosting
that.

The other thing is that the hypervisor and all VM guests share the
same four CPU cores. It may be prudent to dedicate one CPU core to
the hypervisor and then let the guests share the other three.

Any other hardware details that might be relevant?

> > I am OK with the hypervisor machine being completely hammered and
> > keeling over until the traffic is blocked upstream on real routers.
> > Not so happy about the hypervisor machine kicking devices out of
> > arrays and ending up with corrupted filesystems though. I haven't
> > experienced that before.
> 
> Well, you also hadn't experienced an attack like this before.  Correct?

No, they happen from time to time and I find that a pretty big one
will cripple the host but I haven't yet seen that cause storage
problems. But as I say, all the other servers are 3ware+BBU.

To be honest I do find I get a better price/performance spot out of
3ware setup; this host represented an experiment with 10kRPM SAS
drives and software RAID a few years ago and whilst the performance
is decent, the much higher cost per GB of the SAS drives ultimately
makes this uneconomical for me, as I do have to provide a certain
storage capacity as well.

So I haven't gone with md RAID for this use case for a couple of
years now and am unlikely to do so in the future anyway. I do still
need to work out what to do with this particular server though.

> Consider this.  If the hypervisor was writing heavily to logs, and if
> the hypervisor went into heavy swap during the attack, and the
> partitions in md3 that were kicked reside on disks where the swap array
> and/or / arrays exist, this would tend to bolster my theory regarding
> seek starvation causing the timeouts and kicks.

I have a feeling it was trying to swap a lot from the repeated
mentions of "swapper" in the logs. The swap partition is md1 which
doesn't feature in any of the logs, but perhaps what is being logged
there is the swapper kernel thread being unable to do anything
because of extreme CPU starvation.

> Or, if this is truly a single CPU/core machine, the core was pegged, and
> the hypervisor kernel scheduler wasn't giving enough time to md threads,
> this may also explain the timeouts, though with any remotely recent
> kernel this 'shouldn't' happen under load.

Admittedly this is an old Debian lenny server running 2.6.26-2-xen
kernel. Pushing ahead the timescale of clearing VMs off of it and
doing an upgrade would probably be a good idea.

Although it isn't exactly the hardware setup I would like it's been
a pretty good machine for a few years now so I would rather not junk
it, if I can reassure myself that this won't happen again.

> > Also still wondering if what I did to recover was the best way to go
> > or if I could have made it easier on myself.
> 
> I don't really have any input on this aspect, except to say that if you
> got all your data recovered that's the important part.  If you spent
> twice as long as you needed to I wouldn't sweat that at all.  I'd put
> all my concentration on the root cause analysis.

Sure, though this was a completely new experience for me so if
anyone has any tips for better recovery then that will help should I
ever face anything like it again. I do use md RAID in quite a few
places (just not much for this use).

Notably, the server was only recoverable because I had backups of
/usr and /var. There were some essential files corrupted but being
able to get them from backups saved having to do a rebuild.

Actually corruption of VM block devices was quite minimal -- many of
them just had to replay the journal and clean up a handful of
orphaned inodes. Apparently a couple of them did lose a small number
of files but these VMs are all run by different admins and I don't
have great insight into it.

Anything I could have done to lessen that would have been useful.
They're meant to have backups but meh, human nature...

Cheers,
Andy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can extremely high load cause disks to be kicked?
  2012-06-04  0:02           ` Andy Smith
@ 2012-06-04  6:58             ` Stan Hoeppner
  0 siblings, 0 replies; 19+ messages in thread
From: Stan Hoeppner @ 2012-06-04  6:58 UTC (permalink / raw)
  To: linux-raid

On 6/3/2012 7:02 PM, Andy Smith wrote:
> Hi Stan,
> 
> Thanks for your detailed reply.

Welcome.  More detailed reply and question follow.

> $ cat /proc/mdstat
> Personalities : [raid1] [raid10]
> md3 : active raid10 sdd5[0] sdb5[3] sdc5[2] sda5[1]
>       581022592 blocks 64K chunks 2 near-copies [4/4] [UUUU]
> 
> md2 : active raid10 sdd3[0] sdb3[3] sdc3[2] sda3[1]
>       1959680 blocks 64K chunks 2 near-copies [4/4] [UUUU]
> 
> md1 : active raid10 sdd2[0] sdb2[3] sdc2[2] sda2[1]
>       1959680 blocks 64K chunks 2 near-copies [4/4] [UUUU]
> 
> md0 : active raid1 sdd1[0] sdb1[3] sdc1[2] sda1[1]
>       489856 blocks [4/4] [UUUU]
> 
> unused devices: <none>

You laid it out nicely, but again, not good practice to have this many
md arrays on the same disks.

> It is a single quad core Xeon L5420 @ 2.50GHz.

Ok, now that's interesting.

> Actually it's pretty big so I've put it at
> http://paste.ubuntu.com/1022219/

Ok, found the source of the mptscsi problem, not hardware failure.  Note
you're using:

Pid: 0, comm: swapper Not tainted 2.6.26-2-xen-amd64 #1

2.6.26 is so old its bones are turning to dust.  Note this Red Hat bug,
and the fact this is an upstream LSI driver problem:

https://bugzilla.redhat.com/show_bug.cgi?id=483424

This driver issue is well over 3 years old and has since been fixed.  It
exists in isolation of your DDOS attack.  And it is probably not the
cause of the partitions being kicked.  Most likely cause is CPU soft
lockup for 60, 90, 120 seconds.  To fix both of these, upgrade to a
kernel that's not rotting in the grave, something in the 3.x.x stable
series is best, and the most recent version of Xen.  Better yet, switch
to KVM as it's in mainline and better supported.

> At this point the logs stop because /var is an LV out of md3. I'm
> moving remote syslog servers further up my priority list...

Yes, good idea.  Even in absence of this mess.

> If there hadn't been a DDoS attack at the exact same time then
> I'd have considered this purely hardware failure due to the way that
> "mptscsih: ioc0: attempting task abort! (sc=ffff8800352d80c0)" is
> the absolute first thing of interest. But the timing is too
> coincidental.

Not hardware failure, note above LSI driver bug.

> It's also been fine since, including a quite IO-intensive backup job
> and yesterday's "first Sunday of the month" sync_action.

Check your logs for these mptscsi errors for the past month.  You should
see some even with zero load on the system, according to the bug report.

> All the VMs would have been doing their normal writing of course,
> but on the hypervisor host /usr and /var come from md3. From the

I wouldn't have laid it out like that.  I'm sure you're thinking the
same thing about now.

> logs, the main thing it seems to be having problems with is dm-1
> which is the /usr LV.

What was the hypervisor kernel writing to /usr during the attack?  Seems
to me it should have been writing nothing there, but just to /var/log.

>> What was the exact nature of the DDOS attack?  What service was it
>> targeting?  I assume this wasn't simply a ping flood.
> 
> It was a UDP short packet (78 bytes) multiple source single
> destination flood, ~300Mbit/s but the killer was the ~600kpps. Less
> than 10Mbit/s made it through to the VM it was targeting.

What, precisely, was it targeting?  What service was listening on the
UDP port that was attacked?  You still haven't named the UDP port.
Anything important?  Can you simply disable that service?  If it's only
used legitimately by local hosts, drop packets originating outside the
local network.

>>> The CPU was certainly overwhelmed, but my main concern is that I
>>> am never going to be able to design a system that will cope with
>>> routing DDoS traffic in userspace.

Why would you want to?  Kill it before it reaches that user space.

> I really don't think it is easy to spec a decent VM host that can
> also route hundreds of thousands of packets per sec to guests,
> without a large budget. 

Again, why would you even attempt this?  Kill it upstream before it
reaches the physical network interface.

> I am OK with the host giving up, I just
> don't want it to corrupt its storage.

The solution, again, isn't beefing up the VM host machine, but hardening
the kernel against DDOS, and more importantly, getting the upstream FW
squared away so it drops all these DDOS packets on the floor.

> I mean I'm sure it can be done, but the budget probably doesn't
> allow it and temporary problems for all VMs on the host are
> acceptable in this case; filesystem corruption isn't.

There are many articles easily found via Google explaining how to harden
a Linux machine against DDOS attacks.  None of them say anything about
beefing up the hardware in an effort to ABSORB the attack.  What you
need is a better upstream firewall, or an intelligent firewall running
in the hypervisor kernel, or both.

> The other thing is that the hypervisor and all VM guests share the
> same four CPU cores. It may be prudent to dedicate one CPU core to
> the hypervisor and then let the guests share the other three.

If you do that the host may simply fail more quickly, but still with damage.

> No, they happen from time to time and I find that a pretty big one
> will cripple the host but I haven't yet seen that cause storage
> problems. But as I say, all the other servers are 3ware+BBU.

Why are DDOS attack frequent there?  Are you a VPS provider?

> To be honest I do find I get a better price/performance spot out of
> 3ware setup; this host represented an experiment with 10kRPM SAS
> drives and software RAID a few years ago and whilst the performance
> is decent, the much higher cost per GB of the SAS drives ultimately
> makes this uneconomical for me, as I do have to provide a certain
> storage capacity as well.

But for your current investment in HBA RAID boards, and with limited
knowledge of your operation, it's sounding like you may be a good
candidate for a low cost midrange iSCSI SAN array.  Use a single 20GB
Intel SSD to boot the hypervisor and serve swap, etc, then mount iSCSI
LUNs for each VM with the hypervisor software initiator.  Yields much
more flexibility in storage provisioning that HBA RAID w/local disks.

You should be able to acquire a single controller Nexsan SATABoy w/2GB
BBWC, dual GbE iSCSI ports (and dual 4Gb FC ports), and 14x 2TB 7.2k
SATA drives for less than $15k.  Using RAID6 easily gets you near line
rate throughput (370MB/s), and gives you 24TB of space to slice into 256
virtual drives.  You'd want to do multipathing, so you would do up to
254 virtual drives, exporting each one as a LUN on each iSCSI port.  You
can size each virtual drive per the needs of the VM.  Creating and
deleting virtual drives is a few mouse clicks in the excellent web gui.
 This is the cheapest array in Nexsan's lineup.  The next step up the
Nexsan ladder gets you a single controller E18 with 18x 3TB drives in 2U
for a little over $20K.  A single RAID6 yields 48TB to slice up.  The
E18 allows expansion with a 60 drive 4U enclosure, up to 234TB total.
You're probably looking at the $50K neighborhood for the expansion
chassis loaded w/3TB drives.

> So I haven't gone with md RAID for this use case for a couple of
> years now and am unlikely to do so in the future anyway. I do still
> need to work out what to do with this particular server though.

Slap in a 3ware 9750-4i and write cache battery and redeploy.
http://www.newegg.com/Product/Product.aspx?Item=N82E16816116109
http://www.newegg.com/Product/Product.aspx?Item=N82E16816118118

W/512MB BBWC and those 10K SAS drives it will perform very well, better
than md and the current HBA WRT writes.

> I have a feeling it was trying to swap a lot from the repeated
> mentions of "swapper" in the logs. The swap partition is md1 which
> doesn't feature in any of the logs, but perhaps what is being logged
> there is the swapper kernel thread being unable to do anything
> because of extreme CPU starvation.

Makes sense.

> Admittedly this is an old Debian lenny server running 2.6.26-2-xen
> kernel. Pushing ahead the timescale of clearing VMs off of it and
> doing an upgrade would probably be a good idea.

Heheh, yeah, as I mentioned here way up above.  This Lenny kernel was
afflicted by the aforementioned LSI bug.  And there were plenty of other
issues with 2.6.26.

> Although it isn't exactly the hardware setup I would like it's been
> a pretty good machine for a few years now so I would rather not junk
> it, if I can reassure myself that this won't happen again.

You will probably suffer again if DDOSed.  And as I mentioned the
solution has little to do with (lack of) machine horsepower.  A modern
kernel will help out considerably.  But, you probably still need to
configure it against DDOS.  Lots of articles/guides available @Google
search.  Without knowing all the technicals of your attack, I can't say
for sure what kernel options would help.

A relatively simple solution you may want to consider is implementing a
daemon to monitor interface traffic in the hypervisor.  If/when the
packet/bit rate clearly shows signs of [D]DOS, run a script that
executes ifdown to take that interface offline and email you an alert
through the management interface (I assume you have both).  You may want
to automatically bring the interface back up after 5 minutes or so.  If
DDOS traffic still exists, down the interface again.  Rinse repeat until
the storm passes.  Your VMs will be non functional during an attack
anyway so there's no harm in downing the interface.  And doing so should
save your bacon.

> Sure, though this was a completely new experience for me so if
> anyone has any tips for better recovery then that will help should I
> ever face anything like it again. I do use md RAID in quite a few
> places (just not much for this use).
> 
> Notably, the server was only recoverable because I had backups of
> /usr and /var. There were some essential files corrupted but being
> able to get them from backups saved having to do a rebuild.

This is refreshing.  Too many times folks post that they MUST get their
md array or XFS filesystem back because they have no backup.  Neil, and
some of the XFS devs, save the day sometimes, but just as often md and
XFS superblocks are completely corrupted or overwritten with zeros,
making recovery impossible, and very unhappy users.

> Actually corruption of VM block devices was quite minimal -- many of
> them just had to replay the journal and clean up a handful of
> orphaned inodes. Apparently a couple of them did lose a small number
> of files but these VMs are all run by different admins and I don't
> have great insight into it.

I take it no two partitions in the same mirror pair were kicked, downing
the array?  I don't recall that you stated you actually lost the md3
array.  Sorry if you did.  Long thread, lots of words to get lost.

> Anything I could have done to lessen that would have been useful.
> They're meant to have backups but meh, human nature...

Well, yea, but it still sucks to do restores, and backups aren't always
bulletproof.

At this point I think your best bet is automated ifdown/ifup.  It's
simple, effective, should be somewhat cheap to implement.  It can all be
done in a daemonized perl script.  Just don't ask me to write it.  ;)
If I had the skill I'd have already done it.

Hope I've been of at least some help to you Andy, given you some decent
ideas, if not solid solutions.  I must say this is the most unique
troubleshooting thread I can recall on linux-raid.

-- 
Stan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Can extremely high load cause disks to be kicked?
  2012-05-31  8:31 Can extremely high load cause disks to be kicked? Andy Smith
  2012-06-01  1:31 ` Stan Hoeppner
@ 2012-06-04  4:13 ` NeilBrown
  1 sibling, 0 replies; 19+ messages in thread
From: NeilBrown @ 2012-06-04  4:13 UTC (permalink / raw)
  To: Andy Smith; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1229 bytes --]

On Thu, 31 May 2012 08:31:58 +0000 Andy Smith <andy@strugglers.net> wrote:

> Now, is this sort of behaviour expected when under incredible load?
> Or is it indicative of a bug somewhere in kernel, mpt driver, or
> even flaky SAS controller/disks?

This sort of high load would not affect md, except to slow it down.
My guess is that the real bug is in the mpt driver, but as I know nothing
about the mpt driver, you should treat that guess with a few kilos of NaCl.

> 
> Root cause of failure aside, could I have made recovery easier? Was
> there a better way than --create --assume-clean?

The mis-step was to try to add the devices back to the array.  A newer
mdadm would refuse to let you do this because of the destructive effect.

The correct step would have been to stop the array and re-assemble it,
with --force.

Once you had turned the devices to spares with --add, --create --assume-clean
was the correct fix.

> 
> If I had done a --create with sdc5 (the device that stayed in the
> array) and the other device with the closest event count, plus two
> "missing", could I have expected less corruption when on 'repair'?

Possibly.  You certainly wouldn't expect more.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2012-06-04  9:06 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-05-31  8:31 Can extremely high load cause disks to be kicked? Andy Smith
2012-06-01  1:31 ` Stan Hoeppner
2012-06-01  3:15   ` Igor M Podlesny
2012-06-01 14:12     ` Stan Hoeppner
2012-06-01 15:19       ` Igor M Podlesny
2012-06-02  4:45         ` Stan Hoeppner
2012-06-02  7:57           ` Igor M Podlesny
2012-06-02  9:16             ` Stan Hoeppner
2012-06-01 19:25   ` Andy Smith
2012-06-02  5:47     ` Stan Hoeppner
2012-06-03  3:30       ` Andy Smith
2012-06-03  4:05         ` Igor M Podlesny
2012-06-03 22:05           ` Andy Smith
2012-06-04  1:55             ` Stan Hoeppner
2012-06-04  9:06               ` Igor M Podlesny
2012-06-03  6:49         ` Stan Hoeppner
2012-06-04  0:02           ` Andy Smith
2012-06-04  6:58             ` Stan Hoeppner
2012-06-04  4:13 ` NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).