* Can extremely high load cause disks to be kicked? @ 2012-05-31 8:31 Andy Smith 2012-06-01 1:31 ` Stan Hoeppner 2012-06-04 4:13 ` NeilBrown 0 siblings, 2 replies; 19+ messages in thread From: Andy Smith @ 2012-05-31 8:31 UTC (permalink / raw) To: linux-raid Hello, Last night a virtual machine on one of my servers was a victim of DDoS. Given that the machine is routing packets to the VM, the extremely high packets per second basically overwhelmed the CPU and caused a lot of "BUG: soft lockup - CPU#0 stuck for XXs!" spew in the logs. So far nothing unusual for that type of event. However, a few minutes in, I/O errors started to be generated which caused three of the four disks in the raid10 to be kicked. Here's an excerpt: May 30 18:24:49 blahblah kernel: [36534478.879311] BUG: soft lockup - CPU#0 stuck for 86s! [swapper:0] May 30 18:24:49 blahblah kernel: [36534478.879311] Modules linked in: bridge ipv6 ipt_LOG xt_limit ipt_REJECT ipt_ULOG xt_multiport xt_tcpudp iptable_filter ip_tables x_tables ext2 fuse loop pcspkr i2c_i801 i2c_core container button evdev ext3 jbd mbcache dm_mirror dm_log dm_snapshot dm_mod raid10 raid1 md_mod ata_generic libata dock ide_pci_generic sd_mod it8213 ide_core e1000e mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal processor fan thermal_sys [last unloaded: scsi_wait_scan] May 30 18:24:49 blahblah kernel: [36534478.879311] CPU 0: May 30 18:24:49 blahblah kernel: [36534478.879311] Modules linked in: bridge ipv6 ipt_LOG xt_limit ipt_REJECT ipt_ULOG xt_multiport xt_tcpudp iptable_filter ip_tables x_tables ext2 fuse loop pcspkr i2c_i801 i2c_core container button evdev ext3 jbd mbcache dm_mirror dm_log dm_snapshot dm_mod raid10 raid1 md_mod ata_generic libata dock ide_pci_generic sd_mod it8213 ide_core e1000e mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal processor fan thermal_sys [last unloaded: scsi_wait_scan] May 30 18:24:49 blahblah kernel: [36534478.879311] Pid: 0, comm: swapper Not tainted 2.6.26-2-xen-amd64 #1 May 30 18:24:49 blahblah kernel: [36534478.879311] RIP: e030:[<ffffffff802083aa>] [<ffffffff802083aa>] May 30 18:24:49 blahblah kernel: [36534478.879311] RSP: e02b:ffffffff80553f10 EFLAGS: 00000246 May 30 18:24:49 blahblah kernel: [36534478.879311] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff802083aa May 30 18:24:49 blahblah kernel: [36534478.879311] RDX: ffffffff80553f28 RSI: 0000000000000000 RDI: 0000000000000001 May 30 18:24:49 blahblah kernel: [36534478.879311] RBP: 0000000000631918 R08: ffffffff805cbc38 R09: ffff880001bc7ee0 May 30 18:24:49 blahblah kernel: [36534478.879311] R10: 0000000000631918 R11: 0000000000000246 R12: ffffffffffffffff May 30 18:24:49 blahblah kernel: [36534478.879311] R13: ffffffff8057c580 R14: ffffffff8057d1c0 R15: 0000000000000000 May 30 18:24:49 blahblah kernel: [36534478.879311] FS: 00007f65b193a6e0(0000) GS:ffffffff8053a000(0000) knlGS:0000000000000000 May 30 18:24:49 blahblah kernel: [36534478.879311] CS: e033 DS: 0000 ES: 0000 May 30 18:24:49 blahblah kernel: [36534478.879311] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 May 30 18:24:49 blahblah kernel: [36534478.879311] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 May 30 18:24:49 blahblah kernel: [36534478.879311] May 30 18:24:49 blahblah kernel: [36534478.879311] Call Trace: May 30 18:24:49 blahblah kernel: [36534478.879311] [<ffffffff8020e79d>] ? xen_safe_halt+0x90/0xa6 May 30 18:24:49 blahblah kernel: [36534478.879311] [<ffffffff8020a0ce>] ? xen_idle+0x2e/0x66 May 30 18:24:49 blahblah kernel: [36534478.879311] [<ffffffff80209d49>] ? cpu_idle+0x97/0xb9 May 30 18:24:49 blahblah kernel: [36534478.879311] May 30 18:24:59 blahblah kernel: [36534488.966594] mptscsih: ioc0: attempting task abort! (sc=ffff880039047480) May 30 18:24:59 blahblah kernel: [36534488.966810] sd 0:0:1:0: [sdb] CDB: Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00 May 30 18:24:59 blahblah kernel: [36534488.967163] mptscsih: ioc0: task abort: SUCCESS (sc=ffff880039047480) May 30 18:24:59 blahblah kernel: [36534488.970208] mptscsih: ioc0: attempting task abort! (sc=ffff8800348286c0) May 30 18:24:59 blahblah kernel: [36534488.970519] sd 0:0:2:0: [sdc] CDB: Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00 May 30 18:24:59 blahblah kernel: [36534488.971033] mptscsih: ioc0: task abort: SUCCESS (sc=ffff8800348286c0) May 30 18:24:59 blahblah kernel: [36534488.974146] mptscsih: ioc0: attempting target reset! (sc=ffff880039047e80) May 30 18:24:59 blahblah kernel: [36534488.974466] sd 0:0:0:0: [sda] CDB: Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00 May 30 18:25:00 blahblah kernel: [36534489.490138] mptscsih: ioc0: target reset: SUCCESS (sc=ffff880039047e80) May 30 18:25:00 blahblah kernel: [36534489.493027] mptscsih: ioc0: attempting target reset! (sc=ffff880034828080) May 30 18:25:00 blahblah kernel: [36534489.493027] sd 0:0:3:0: [sdd] CDB: Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00 May 30 18:25:00 blahblah kernel: [36534490.003961] mptscsih: ioc0: target reset: SUCCESS (sc=ffff880034828080) May 30 18:25:00 blahblah kernel: [36534490.010870] end_request: I/O error, dev sdd, sector 581022718 May 30 18:25:00 blahblah kernel: [36534490.010870] md: super_written gets error=-5, uptodate=0 May 30 18:25:00 blahblah kernel: [36534490.010870] raid10: Disk failure on sdd5, disabling device. May 30 18:25:00 blahblah kernel: [36534490.010870] raid10: Operation continuing on 3 devices. May 30 18:25:00 blahblah kernel: [36534490.016887] end_request: I/O error, dev sda, sector 581022718 May 30 18:25:00 blahblah kernel: [36534490.017058] md: super_written gets error=-5, uptodate=0 May 30 18:25:00 blahblah kernel: [36534490.017212] raid10: Disk failure on sda5, disabling device. May 30 18:25:00 blahblah kernel: [36534490.017213] raid10: Operation continuing on 2 devices. May 30 18:25:00 blahblah kernel: [36534490.017562] end_request: I/O error, dev sdb, sector 581022718 May 30 18:25:00 blahblah kernel: [36534490.017730] md: super_written gets error=-5, uptodate=0 May 30 18:25:00 blahblah kernel: [36534490.017884] raid10: Disk failure on sdb5, disabling device. May 30 18:25:00 blahblah kernel: [36534490.017885] raid10: Operation continuing on 1 devices. May 30 18:25:00 blahblah kernel: [36534490.021015] end_request: I/O error, dev sdc, sector 581022718 May 30 18:25:00 blahblah kernel: [36534490.021015] md: super_written gets error=-5, uptodate=0 At this point the host was extremely upset. sd[abcd]5 were in use in /dev/md3, but there were three other mdadm arrays using the same disks and they were okay, so I wasn't suspecting actual hardware failure as far as the disks went. I used --add to add the devices back into md3, but they were added as spares. I was stumped for a little while, then I decided to --stop md3 and --create it again with --assume-clean. I got the device order wrong the first few times but eventually I got there. I then triggered a 'repair' at sync_action, and once that had finished I started fscking things. There was a bit of corruption but on the whole it seems to have been survivable. Now, is this sort of behaviour expected when under incredible load? Or is it indicative of a bug somewhere in kernel, mpt driver, or even flaky SAS controller/disks? Controller: LSISAS1068E B3, FwRev=011a0000h Motherboard: Supermicro X7DCL-3 Disks: 4x SEAGATE ST9300603SS Version: 0006 While I'm familiar with the occasional big DDoS causing extreme CPU load, hung tasks, CPU soft lockups etc., I've never had it kick disks before. But I only have this one server with SAS and mdadm whereas all the others are SATA and 3ware with BBU. Root cause of failure aside, could I have made recovery easier? Was there a better way than --create --assume-clean? If I had done a --create with sdc5 (the device that stayed in the array) and the other device with the closest event count, plus two "missing", could I have expected less corruption when on 'repair'? Cheers, Andy ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Can extremely high load cause disks to be kicked? 2012-05-31 8:31 Can extremely high load cause disks to be kicked? Andy Smith @ 2012-06-01 1:31 ` Stan Hoeppner 2012-06-01 3:15 ` Igor M Podlesny 2012-06-01 19:25 ` Andy Smith 2012-06-04 4:13 ` NeilBrown 1 sibling, 2 replies; 19+ messages in thread From: Stan Hoeppner @ 2012-06-01 1:31 UTC (permalink / raw) To: linux-raid On 5/31/2012 3:31 AM, Andy Smith wrote: > Now, is this sort of behaviour expected when under incredible load? > Or is it indicative of a bug somewhere in kernel, mpt driver, or > even flaky SAS controller/disks? It is expected that people know what RAID is and how it is supposed to be used. RAID is to be used for protecting data in the event of a disk failure and secondarily to increase performance. That is not how you seem to be using RAID. BTW, I can't fully discern from your log snippets...are you running md RAID inside of virtual machines or only on the host hypervisor? If the former problems like this are expected and normal, which is why it is recommended to NEVER run md RAID inside a VM. > Controller: LSISAS1068E B3, FwRev=011a0000h > Motherboard: Supermicro X7DCL-3 > Disks: 4x SEAGATE ST9300603SS Version: 0006 > > While I'm familiar with the occasional big DDoS causing extreme CPU > load, hung tasks, CPU soft lockups etc., I've never had it kick > disks before. The md RAID driver didn't kick disks. It kicked partitions, as this is what you built your many arrays with. > But I only have this one server with SAS and mdadm > whereas all the others are SATA and 3ware with BBU. Fancy that. > Root cause of failure aside, could I have made recovery easier? Was > there a better way than --create --assume-clean? > > If I had done a --create with sdc5 (the device that stayed in the > array) and the other device with the closest event count, plus two > "missing", could I have expected less corruption when on 'repair'? You could probably expect it to be more reliable if you used RAID as it's meant to be used, which in this case would be a single RAID10 array using none, or only one partition per disk, instead of creating 4 or 5 different md RAID arrays from 4-5 partitions on each disk. This is simply silly, and it's dangerous if doing so inside VMs. md RAID is not meant to be used as a thin provisioning tool, which is what you seem to have attempted here, and is almost certainly the root cause of your problem. I highly recommend creating a single md RAID array and using proper thin provisioning tools/methods. Or slap a real RAID card into this host as you have the others. The LSI (3ware) cards allow the creation of multiple virtual drives per array, each being exposed as a different SCSI LUN in Linux, which provides simple and effective thin provisioning. This is much simpler and more reliable than doing it all with kernel drivers, daemons, and filesystem tricks (sparse files mounted as filesystems and the like). There are a number of scenarios where md RAID is better than hardware RAID and vice versa. Yours is a case where hardware RAID is superior, as no matter the host CPU load, drives won't get kicked offline as a result, as they're under the control of a dedicated IO processor (same for SAN RAID). -- Stan ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Can extremely high load cause disks to be kicked? 2012-06-01 1:31 ` Stan Hoeppner @ 2012-06-01 3:15 ` Igor M Podlesny 2012-06-01 14:12 ` Stan Hoeppner 2012-06-01 19:25 ` Andy Smith 1 sibling, 1 reply; 19+ messages in thread From: Igor M Podlesny @ 2012-06-01 3:15 UTC (permalink / raw) To: stan; +Cc: linux-raid On 1 June 2012 09:31, Stan Hoeppner <stan@hardwarefreak.com> wrote: […] > You could probably expect it to be more reliable if you used RAID as > it's meant to be used, which in this case would be a single RAID10 array > using none, or only one partition per disk, instead of creating 4 or 5 > different md RAID arrays from 4-5 partitions on each disk. This is > simply silly, and it's dangerous if doing so inside VMs. — How do you know those RAIDs are inside VMs? -- -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Can extremely high load cause disks to be kicked? 2012-06-01 3:15 ` Igor M Podlesny @ 2012-06-01 14:12 ` Stan Hoeppner 2012-06-01 15:19 ` Igor M Podlesny 0 siblings, 1 reply; 19+ messages in thread From: Stan Hoeppner @ 2012-06-01 14:12 UTC (permalink / raw) To: Igor M Podlesny; +Cc: linux-raid On 5/31/2012 10:15 PM, Igor M Podlesny wrote: > On 1 June 2012 09:31, Stan Hoeppner <stan@hardwarefreak.com> wrote: > […] >> You could probably expect it to be more reliable if you used RAID as >> it's meant to be used, which in this case would be a single RAID10 array >> using none, or only one partition per disk, instead of creating 4 or 5 >> different md RAID arrays from 4-5 partitions on each disk. This is >> simply silly, and it's dangerous if doing so inside VMs. > > — How do you know those RAIDs are inside VMs? Those who speak English as a first language likely understood my use of "if". Had I used "when" instead, that would have implied certainty of knowledge. "If" conveys a possibility, a hypothetical. For the English challenged, maybe reversing the sentence is more comprehensible: "If doing so inside VMs it is dangerous." -- Stan -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Can extremely high load cause disks to be kicked? 2012-06-01 14:12 ` Stan Hoeppner @ 2012-06-01 15:19 ` Igor M Podlesny 2012-06-02 4:45 ` Stan Hoeppner 0 siblings, 1 reply; 19+ messages in thread From: Igor M Podlesny @ 2012-06-01 15:19 UTC (permalink / raw) To: stan; +Cc: linux-raid On 1 June 2012 22:12, Stan Hoeppner <stan@hardwarefreak.com> wrote: > On 5/31/2012 10:15 PM, Igor M Podlesny wrote: >> On 1 June 2012 09:31, Stan Hoeppner <stan@hardwarefreak.com> wrote: >> […] >>> You could probably expect it to be more reliable if you used RAID as >>> it's meant to be used, which in this case would be a single RAID10 array >>> using none, or only one partition per disk, instead of creating 4 or 5 >>> different md RAID arrays from 4-5 partitions on each disk. This is >>> simply silly, and it's dangerous if doing so inside VMs. >> >> — How do you know those RAIDs are inside VMs? > > Those who speak English as a first language likely understood my use of So you don't. Well, lemme remind some words of wisdom to you: "Assumption is the mother of all f*ckups". (Feel free to reverse it as you like), Stan. -- -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Can extremely high load cause disks to be kicked? 2012-06-01 15:19 ` Igor M Podlesny @ 2012-06-02 4:45 ` Stan Hoeppner 2012-06-02 7:57 ` Igor M Podlesny 0 siblings, 1 reply; 19+ messages in thread From: Stan Hoeppner @ 2012-06-02 4:45 UTC (permalink / raw) To: Igor M Podlesny; +Cc: linux-raid On 6/1/2012 10:19 AM, Igor M Podlesny wrote: > On 1 June 2012 22:12, Stan Hoeppner <stan@hardwarefreak.com> wrote: >> On 5/31/2012 10:15 PM, Igor M Podlesny wrote: >>> On 1 June 2012 09:31, Stan Hoeppner <stan@hardwarefreak.com> wrote: >>> […] >>>> You could probably expect it to be more reliable if you used RAID as >>>> it's meant to be used, which in this case would be a single RAID10 array >>>> using none, or only one partition per disk, instead of creating 4 or 5 >>>> different md RAID arrays from 4-5 partitions on each disk. This is >>>> simply silly, and it's dangerous if doing so inside VMs. >>> >>> — How do you know those RAIDs are inside VMs? >> >> Those who speak English as a first language likely understood my use of > > So you don't. Well, lemme remind some words of wisdom to you: > "Assumption is the mother of all f*ckups". (Feel free to reverse it as > you like), Stan. Igor, you simply misunderstood what I stated. I explained what likely caused you to misunderstand. I wasn't trying to insult you, or anyone else who is not a native English speaker. There's no reason for animosity over a simple misinterpretation of language syntax. Be proud that you speak and write/read English very well. I, on the other hand, cannot read/write/speak any Cyrillic language, though I can recognize the characters of the alphabet as Cyrillic. In fact, I don't know any languages other than English, but for some basic phrases in Spanish. If I had to call for a doctor in anything other than English I'd be in big trouble. You're a better linguist than me, speaking at least two languages. You simply made a minor error in this case. Can we be friends and move on, or at least not be enemies? Regards, -- Stan -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Can extremely high load cause disks to be kicked? 2012-06-02 4:45 ` Stan Hoeppner @ 2012-06-02 7:57 ` Igor M Podlesny 2012-06-02 9:16 ` Stan Hoeppner 0 siblings, 1 reply; 19+ messages in thread From: Igor M Podlesny @ 2012-06-02 7:57 UTC (permalink / raw) To: stan; +Cc: linux-raid On 2 June 2012 12:45, Stan Hoeppner <stan@hardwarefreak.com> wrote: > On 6/1/2012 10:19 AM, Igor M Podlesny wrote: […] > simply made a minor error in this case. Can we be friends and move on, > or at least not be enemies? > Surely we can! I can only add though that had you answered OP's question not so harsh, it would be easier for me (and for others as well, I guess) to recognize good intentions beneath. -- -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Can extremely high load cause disks to be kicked? 2012-06-02 7:57 ` Igor M Podlesny @ 2012-06-02 9:16 ` Stan Hoeppner 0 siblings, 0 replies; 19+ messages in thread From: Stan Hoeppner @ 2012-06-02 9:16 UTC (permalink / raw) To: Igor M Podlesny; +Cc: linux-raid On 6/2/2012 2:57 AM, Igor M Podlesny wrote: > On 2 June 2012 12:45, Stan Hoeppner <stan@hardwarefreak.com> wrote: >> On 6/1/2012 10:19 AM, Igor M Podlesny wrote: > […] >> simply made a minor error in this case. Can we be friends and move on, >> or at least not be enemies? >> > Surely we can! > > I can only add though that had you answered OP's question not so > harsh, it would be easier for me (and for others as well, I guess) to > recognize good intentions beneath. I tend to be very direct, very blunt. That's just me. I'll never be worth a damn as a politician, priest, or a funeral director, that's for sure. ;) And if I didn't have good intentions, I wouldn't be contributing on this list and a half dozen others. ;) -- Stan -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Can extremely high load cause disks to be kicked? 2012-06-01 1:31 ` Stan Hoeppner 2012-06-01 3:15 ` Igor M Podlesny @ 2012-06-01 19:25 ` Andy Smith 2012-06-02 5:47 ` Stan Hoeppner 1 sibling, 1 reply; 19+ messages in thread From: Andy Smith @ 2012-06-01 19:25 UTC (permalink / raw) To: linux-raid Hi Stan, On Thu, May 31, 2012 at 08:31:49PM -0500, Stan Hoeppner wrote: > On 5/31/2012 3:31 AM, Andy Smith wrote: > > Now, is this sort of behaviour expected when under incredible load? > > Or is it indicative of a bug somewhere in kernel, mpt driver, or > > even flaky SAS controller/disks? > > It is expected that people know what RAID is and how it is supposed to > be used. RAID is to be used for protecting data in the event of a disk > failure and secondarily to increase performance. That is not how you > seem to be using RAID. Just to clarify, this was the hypervisor host. The VMs on it don't use RAID themselves as that would indeed be silly. > There are a number of scenarios where md RAID is better than hardware > RAID and vice versa. Yours is a case where hardware RAID is superior, > as no matter the host CPU load, drives won't get kicked offline as a > result, as they're under the control of a dedicated IO processor (same > for SAN RAID). Fair enough, thanks. Cheers, Andy ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Can extremely high load cause disks to be kicked? 2012-06-01 19:25 ` Andy Smith @ 2012-06-02 5:47 ` Stan Hoeppner 2012-06-03 3:30 ` Andy Smith 0 siblings, 1 reply; 19+ messages in thread From: Stan Hoeppner @ 2012-06-02 5:47 UTC (permalink / raw) To: linux-raid On 6/1/2012 2:25 PM, Andy Smith wrote: > Hi Stan, > > On Thu, May 31, 2012 at 08:31:49PM -0500, Stan Hoeppner wrote: >> On 5/31/2012 3:31 AM, Andy Smith wrote: >>> Now, is this sort of behaviour expected when under incredible load? >>> Or is it indicative of a bug somewhere in kernel, mpt driver, or >>> even flaky SAS controller/disks? >> >> It is expected that people know what RAID is and how it is supposed to >> be used. RAID is to be used for protecting data in the event of a disk >> failure and secondarily to increase performance. That is not how you >> seem to be using RAID. > > Just to clarify, this was the hypervisor host. The VMs on it don't > use RAID themselves as that would indeed be silly. Cool. I only mentioned this as I've seen it in the wild more than once. >> There are a number of scenarios where md RAID is better than hardware >> RAID and vice versa. Yours is a case where hardware RAID is superior, >> as no matter the host CPU load, drives won't get kicked offline as a >> result, as they're under the control of a dedicated IO processor (same >> for SAN RAID). > > Fair enough, thanks. You could still use md RAID in your scenario. But instead of having multiple md arrays built of disk partitions and passing each array up to a VM guest, the proper way to do this thin provisioning is to create one md array and then create partitions on top. Then pass a partition to a guest. This method eliminates many potential problems with your current setup, such as elevator behavior causing excessive head seeks on the drives. This is even more critical if some of your md arrays are parity (5/6). You mentioned a single RAID 10 array. If you're indeed running multiple arrays of multiple RAID levels (parity and non parity) on the 5 partitions on each disk, and each VM is doing even medium to small amount of IO concurrently, you'll be head thrashing pretty quickly. Then, when you have a DDOS and lots of log writes/etc, you'll be instantly seek bound, and likely start seeing SCSI timeouts, as the drive head actuators simply can't move quickly enough to satisfy requests before the timeout period. This may be what caused your RAID 10 partitions to be kicked. Not enough info to verify at this point. The same situation can occur on a single OS bare metal host when the storage system isn't designed to handle the IOPS load. Consider a maildir mailbox server with an average load of 2000 random R/W IOPS. The _minimum_ you could get by with here would be 16x 15k disks in RAID 10. 8 * 300 seeks/s/drive = 2400 IOPS peak actuator performance. If we were to put that workload on a RAID 10 array of only 4x 15k drives we'd have 2 * 300 seeks/s/drive = 600 IOPS peak, less than 1/3rd the actual load. I've never tried this so I don't know if you'd have md dropping drives due to SCSI timeouts, but you'd sure have serious problems nonetheless. -- Stan ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Can extremely high load cause disks to be kicked? 2012-06-02 5:47 ` Stan Hoeppner @ 2012-06-03 3:30 ` Andy Smith 2012-06-03 4:05 ` Igor M Podlesny 2012-06-03 6:49 ` Stan Hoeppner 0 siblings, 2 replies; 19+ messages in thread From: Andy Smith @ 2012-06-03 3:30 UTC (permalink / raw) To: linux-raid Hi Stan, On Sat, Jun 02, 2012 at 12:47:00AM -0500, Stan Hoeppner wrote: > You could still use md RAID in your scenario. But instead of having > multiple md arrays built of disk partitions and passing each array up to > a VM guest, the proper way to do this thin provisioning is to create one > md array and then create partitions on top. Then pass a partition to a > guest. I probably didn't give enough details. On this particular host there are four md arrays: md0 is mounted as /boot md1 is used as swap md2 is mounted as / md3 is an LVM PV for "everything else" Some further filesystems on the hypervisor host come from LVs in md3 (/usr, /var and so on). VM guests get their block devices from LVs in md3. But we can ignore the presence of VMs for now since my concern is only with the hypervisor host itself at this point. Even though it was a guest that was attacked, the hypervisor still had to route the traffic through its userspace and its CPU got overwhelmed by the high packets-per-second. You made a point that multiple mdadm arrays should not be used, though in this situation I can't see how that would have helped me; would I not just have got I/O errors on the single md device that everything was running from, causing instant crash? Although an instant crash might have been preferable from the point of view of enabling less corruption, I suppose. > The same situation can occur on a single OS bare metal host when the > storage system isn't designed to handle the IOPS load. Unfortunately my monitoring of IOPS for this host cut out during the attack and later problems so all I have for that period is a blank graph, but I don't think the IOPS requirement would actually have been that high. The CPU was certainly overwhelmed, but my main concern is that I am never going to be able to design a system that will cope with routing DDoS traffic in userspace. I am OK with the hypervisor machine being completely hammered and keeling over until the traffic is blocked upstream on real routers. Not so happy about the hypervisor machine kicking devices out of arrays and ending up with corrupted filesystems though. I haven't experienced that before. I don't think md is to blame as such because the logs show I/O errors on all devices so no wonder it kicked them out. I was just wondering if it was down to bad hardware, bad hardware _choice_, a bug in the driver, some bug in the kernel or what. Also still wondering if what I did to recover was the best way to go or if I could have made it easier on myself. Cheers, Andy ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Can extremely high load cause disks to be kicked? 2012-06-03 3:30 ` Andy Smith @ 2012-06-03 4:05 ` Igor M Podlesny 2012-06-03 22:05 ` Andy Smith 2012-06-03 6:49 ` Stan Hoeppner 1 sibling, 1 reply; 19+ messages in thread From: Igor M Podlesny @ 2012-06-03 4:05 UTC (permalink / raw) To: linux-raid On 3 June 2012 11:30, Andy Smith <andy@strugglers.net> wrote: > I probably didn't give enough details. On this particular host there > are four md arrays: > > md0 is mounted as /boot > md1 is used as swap > md2 is mounted as / > md3 is an LVM PV for "everything else" […] Kinda off-topic question — why having separate MDs for swap and root? I mean LVM's lv would be just fine for them. -- -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Can extremely high load cause disks to be kicked? 2012-06-03 4:05 ` Igor M Podlesny @ 2012-06-03 22:05 ` Andy Smith 2012-06-04 1:55 ` Stan Hoeppner 0 siblings, 1 reply; 19+ messages in thread From: Andy Smith @ 2012-06-03 22:05 UTC (permalink / raw) To: linux-raid Hi Igor, On Sun, Jun 03, 2012 at 12:05:31PM +0800, Igor M Podlesny wrote: > On 3 June 2012 11:30, Andy Smith <andy@strugglers.net> wrote: > > I probably didn't give enough details. On this particular host there > > are four md arrays: > > > > md0 is mounted as /boot > > md1 is used as swap > > md2 is mounted as / > > md3 is an LVM PV for "everything else" > […] > > Kinda off-topic question — why having separate MDs for swap and > root? I mean LVM's lv would be just fine for them. I know, but I'm kind of old-fashioned and have some (probably unwarranted) distrust of having the really essential bits of the system in LVM if they don't really need to be. The system needs swap and I'm unlikely to want to change the size of that swap, but if I do I can just add more swap from inside LVM. / is also unlikely to need to be resized ever, so I keep that outside of LVM too. My position on this is probably going to have to change given the current "get rid of /usr" plans that seem to be gaining traction in the Linux community. A beefier / would probably justify being in LVM to me. But really it is all just personal taste isn't it.. Cheers, Andy -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Can extremely high load cause disks to be kicked? 2012-06-03 22:05 ` Andy Smith @ 2012-06-04 1:55 ` Stan Hoeppner 2012-06-04 9:06 ` Igor M Podlesny 0 siblings, 1 reply; 19+ messages in thread From: Stan Hoeppner @ 2012-06-04 1:55 UTC (permalink / raw) To: linux-raid On 6/3/2012 5:05 PM, Andy Smith wrote: > Hi Igor, > > On Sun, Jun 03, 2012 at 12:05:31PM +0800, Igor M Podlesny wrote: >> On 3 June 2012 11:30, Andy Smith <andy@strugglers.net> wrote: >>> I probably didn't give enough details. On this particular host there >>> are four md arrays: >>> >>> md0 is mounted as /boot >>> md1 is used as swap >>> md2 is mounted as / >>> md3 is an LVM PV for "everything else" >> […] >> >> Kinda off-topic question — why having separate MDs for swap and >> root? I mean LVM's lv would be just fine for them. > > I know, but I'm kind of old-fashioned and have some (probably > unwarranted) distrust of having the really essential bits of the > system in LVM if they don't really need to be. I wouldn't say unwarranted. I've seen instances on this and other lists where an md array blew chunks, was repaired and everything reported good, only to have LVM devices corrupted beyond repair. Or both md and LVM report good after repair, but the XFS superblock is hosed beyond repair. The layers in the storage stack the better. KISS principle, "Just because you can doesn't mean you should", etc, etc. > The system needs swap and I'm unlikely to want to change the size of > that swap, but if I do I can just add more swap from inside LVM. / is > also unlikely to need to be resized ever, so I keep that outside of > LVM too. > > My position on this is probably going to have to change given the > current "get rid of /usr" plans that seem to be gaining traction in > the Linux community. A beefier / would probably justify being in > LVM to me. I guess I'm more old fashioned. I'd never put the root filesystem on an LVM volume. My preferences are directly on a single disk partition, or hardware RAID1 partition, depending on the server/application area. > But really it is all just personal taste isn't it.. To an extent yes. Separating different portions of the UNIX filesystem tree on different partitions/devices has a long tradition and usually good reasoning behind it. However, as I mentioned, spreading these across multiple md arrays that share the same physical disks is not a good idea. A single md array with multiple partitions, with the filesystem tree, swap, and data directories spread over these partitions, is preferable. -- Stan -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Can extremely high load cause disks to be kicked? 2012-06-04 1:55 ` Stan Hoeppner @ 2012-06-04 9:06 ` Igor M Podlesny 0 siblings, 0 replies; 19+ messages in thread From: Igor M Podlesny @ 2012-06-04 9:06 UTC (permalink / raw) To: stan; +Cc: linux-raid On 4 June 2012 09:55, Stan Hoeppner <stan@hardwarefreak.com> wrote: > On 6/3/2012 5:05 PM, Andy Smith wrote: […] > >> Kinda off-topic question — why having separate MDs for swap and > >> root? I mean LVM's lv would be just fine for them. > > > > I know, but I'm kind of old-fashioned and have some (probably > > unwarranted) distrust of having the really essential bits of the > > system in LVM if they don't really need to be. > > I wouldn't say unwarranted. I've seen instances on this and other lists > where an md array blew chunks, was repaired and everything reported > good, only to have LVM devices corrupted beyond repair. Or both md and Sounds like urban myths to me. LVM's dev-mapper here is nothing more as merely block look-up/translation table. -- -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Can extremely high load cause disks to be kicked? 2012-06-03 3:30 ` Andy Smith 2012-06-03 4:05 ` Igor M Podlesny @ 2012-06-03 6:49 ` Stan Hoeppner 2012-06-04 0:02 ` Andy Smith 1 sibling, 1 reply; 19+ messages in thread From: Stan Hoeppner @ 2012-06-03 6:49 UTC (permalink / raw) To: linux-raid On 6/2/2012 10:30 PM, Andy Smith wrote: > Hi Stan, Hay Andy, > On Sat, Jun 02, 2012 at 12:47:00AM -0500, Stan Hoeppner wrote: >> You could still use md RAID in your scenario. But instead of having >> multiple md arrays built of disk partitions and passing each array up to >> a VM guest, the proper way to do this thin provisioning is to create one >> md array and then create partitions on top. Then pass a partition to a >> guest. > > I probably didn't give enough details. On this particular host there > are four md arrays: Yeah, I made some incorrect assumptions about how you were using your arrays. > md0 is mounted as /boot > md1 is used as swap > md2 is mounted as / What's the RAID level of each of these and how many partitions in each? > md3 is an LVM PV for "everything else" This one is the RAID 10 correct? > Some further filesystems on the hypervisor host come from LVs in md3 > (/usr, /var and so on). > > VM guests get their block devices from LVs in md3. But we can ignore > the presence of VMs for now since my concern is only with the > hypervisor host itself at this point. > > Even though it was a guest that was attacked, the hypervisor still > had to route the traffic through its userspace and its CPU got > overwhelmed by the high packets-per-second. How many cores in this machine and what frequency? > You made a point that multiple mdadm arrays should not be used, > though in this situation I can't see how that would have helped me; As I mentioned, the disks would have likely been seeking less, but I can't say for sure with the available data. If the disks were seek saturated this could explain why md3 component partitions were kicked. > would I not just have got I/O errors on the single md device that > everything was running from, causing instant crash? Can you post the SCSI/ATA and md errors? > Although an instant crash might have been preferable from the point of > view of enabling less corruption, I suppose. > >> The same situation can occur on a single OS bare metal host when the >> storage system isn't designed to handle the IOPS load. > > Unfortunately my monitoring of IOPS for this host cut out during the > attack and later problems so all I have for that period is a blank > graph, but I don't think the IOPS requirement would actually have > been that high. What was actually being written to md3 during this attack? Just logging, or something else? What was the exact nature of the DDOS attack? What service was it targeting? I assume this wasn't simply a ping flood. > The CPU was certainly overwhelmed, but my main > concern is that I am never going to be able to design a system that > will cope with routing DDoS traffic in userspace. Assuming the network data rate of the attack was less than 1000 Mb/s, most any machine with two or more 2GHz+ cores and sufficient RAM should easily be able to handle this type of thing without falling over. Can you provide system hardware details? > I am OK with the hypervisor machine being completely hammered and > keeling over until the traffic is blocked upstream on real routers. > Not so happy about the hypervisor machine kicking devices out of > arrays and ending up with corrupted filesystems though. I haven't > experienced that before. Well, you also hadn't experienced an attack like this before. Correct? > I don't think md is to blame as such because the logs show I/O > errors on all devices so no wonder it kicked them out. I was just > wondering if it was down to bad hardware, bad hardware _choice_, a > bug in the driver, some bug in the kernel or what. This is impossible to determine with the information provided so far. md isn't to blame, but the way you're using it may have played a role in the partitions being kicked. Consider this. If the hypervisor was writing heavily to logs, and if the hypervisor went into heavy swap during the attack, and the partitions in md3 that were kicked reside on disks where the swap array and/or / arrays exist, this would tend to bolster my theory regarding seek starvation causing the timeouts and kicks. If this isn't the case, and you simply had high load hammering md3 and the underlying disks, with little IO on the other arrays, then the problem likely lay elsewhere, possibly with the disks themselves, the HBA, or even the system board. Or, if this is truly a single CPU/core machine, the core was pegged, and the hypervisor kernel scheduler wasn't giving enough time to md threads, this may also explain the timeouts, though with any remotely recent kernel this 'shouldn't' happen under load. If the problem was indeed simply a hammered single core CPU, I'd suggest swapping it for a multi-core model. This will eliminate the possibility of md threads starving for cycles, though I wouldn't think this alone would cause 30 second BIO timeouts, and thus devices being kicked. It's far more likely the drives were seek saturated, which would explain the 30 second BIO timeouts. > Also still wondering if what I did to recover was the best way to go > or if I could have made it easier on myself. I don't really have any input on this aspect, except to say that if you got all your data recovered that's the important part. If you spent twice as long as you needed to I wouldn't sweat that at all. I'd put all my concentration on the root cause analysis. -- Stan ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Can extremely high load cause disks to be kicked? 2012-06-03 6:49 ` Stan Hoeppner @ 2012-06-04 0:02 ` Andy Smith 2012-06-04 6:58 ` Stan Hoeppner 0 siblings, 1 reply; 19+ messages in thread From: Andy Smith @ 2012-06-04 0:02 UTC (permalink / raw) To: linux-raid Hi Stan, Thanks for your detailed reply. On Sun, Jun 03, 2012 at 01:49:23AM -0500, Stan Hoeppner wrote: > On 6/2/2012 10:30 PM, Andy Smith wrote: > > md0 is mounted as /boot > > md1 is used as swap > > md2 is mounted as / > > What's the RAID level of each of these and how many partitions in each? $ cat /proc/mdstat Personalities : [raid1] [raid10] md3 : active raid10 sdd5[0] sdb5[3] sdc5[2] sda5[1] 581022592 blocks 64K chunks 2 near-copies [4/4] [UUUU] md2 : active raid10 sdd3[0] sdb3[3] sdc3[2] sda3[1] 1959680 blocks 64K chunks 2 near-copies [4/4] [UUUU] md1 : active raid10 sdd2[0] sdb2[3] sdc2[2] sda2[1] 1959680 blocks 64K chunks 2 near-copies [4/4] [UUUU] md0 : active raid1 sdd1[0] sdb1[3] sdc1[2] sda1[1] 489856 blocks [4/4] [UUUU] unused devices: <none> > > Even though it was a guest that was attacked, the hypervisor still > > had to route the traffic through its userspace and its CPU got > > overwhelmed by the high packets-per-second. > > How many cores in this machine and what frequency? It is a single quad core Xeon L5420 @ 2.50GHz. > > would I not just have got I/O errors on the single md device that > > everything was running from, causing instant crash? > > Can you post the SCSI/ATA and md errors? Sure. I posted an excerpt in the first email but here is a fuller example: Actually it's pretty big so I've put it at http://paste.ubuntu.com/1022219/ At this point the logs stop because /var is an LV out of md3. I'm moving remote syslog servers further up my priority list... If there hadn't been a DDoS attack at the exact same time then I'd have considered this purely hardware failure due to the way that "mptscsih: ioc0: attempting task abort! (sc=ffff8800352d80c0)" is the absolute first thing of interest. But the timing is too coincidental. It's also been fine since, including a quite IO-intensive backup job and yesterday's "first Sunday of the month" sync_action. > > Unfortunately my monitoring of IOPS for this host cut out during the > > attack and later problems so all I have for that period is a blank > > graph, but I don't think the IOPS requirement would actually have > > been that high. > > What was actually being written to md3 during this attack? Just > logging, or something else? All the VMs would have been doing their normal writing of course, but on the hypervisor host /usr and /var come from md3. From the logs, the main thing it seems to be having problems with is dm-1 which is the /usr LV. > What was the exact nature of the DDOS attack? What service was it > targeting? I assume this wasn't simply a ping flood. It was a UDP short packet (78 bytes) multiple source single destination flood, ~300Mbit/s but the killer was the ~600kpps. Less than 10Mbit/s made it through to the VM it was targeting. > > The CPU was certainly overwhelmed, but my main concern is that I > > am never going to be able to design a system that will cope with > > routing DDoS traffic in userspace. > > Assuming the network data rate of the attack was less than 1000 Mb/s, > most any machine with two or more 2GHz+ cores and sufficient RAM should > easily be able to handle this type of thing without falling over. I really don't think it is easy to spec a decent VM host that can also route hundreds of thousands of packets per sec to guests, without a large budget. I am OK with the host giving up, I just don't want it to corrupt its storage. I mean I'm sure it can be done, but the budget probably doesn't allow it and temporary problems for all VMs on the host are acceptable in this case; filesystem corruption isn't. > Can you provide system hardware details? My original email: > > Controller: LSISAS1068E B3, FwRev=011a0000h > > Motherboard: Supermicro X7DCL-3 > > Disks: 4x SEAGATE ST9300603SS Version: 0006 Network: e1000e: Intel(R) PRO/1000 Network Driver - 0.3.3.3-k2 The hypervisor itself has access to only 1GB RAM (the rest dedicated to guest VMs) which may be rather low; I could look at boosting that. The other thing is that the hypervisor and all VM guests share the same four CPU cores. It may be prudent to dedicate one CPU core to the hypervisor and then let the guests share the other three. Any other hardware details that might be relevant? > > I am OK with the hypervisor machine being completely hammered and > > keeling over until the traffic is blocked upstream on real routers. > > Not so happy about the hypervisor machine kicking devices out of > > arrays and ending up with corrupted filesystems though. I haven't > > experienced that before. > > Well, you also hadn't experienced an attack like this before. Correct? No, they happen from time to time and I find that a pretty big one will cripple the host but I haven't yet seen that cause storage problems. But as I say, all the other servers are 3ware+BBU. To be honest I do find I get a better price/performance spot out of 3ware setup; this host represented an experiment with 10kRPM SAS drives and software RAID a few years ago and whilst the performance is decent, the much higher cost per GB of the SAS drives ultimately makes this uneconomical for me, as I do have to provide a certain storage capacity as well. So I haven't gone with md RAID for this use case for a couple of years now and am unlikely to do so in the future anyway. I do still need to work out what to do with this particular server though. > Consider this. If the hypervisor was writing heavily to logs, and if > the hypervisor went into heavy swap during the attack, and the > partitions in md3 that were kicked reside on disks where the swap array > and/or / arrays exist, this would tend to bolster my theory regarding > seek starvation causing the timeouts and kicks. I have a feeling it was trying to swap a lot from the repeated mentions of "swapper" in the logs. The swap partition is md1 which doesn't feature in any of the logs, but perhaps what is being logged there is the swapper kernel thread being unable to do anything because of extreme CPU starvation. > Or, if this is truly a single CPU/core machine, the core was pegged, and > the hypervisor kernel scheduler wasn't giving enough time to md threads, > this may also explain the timeouts, though with any remotely recent > kernel this 'shouldn't' happen under load. Admittedly this is an old Debian lenny server running 2.6.26-2-xen kernel. Pushing ahead the timescale of clearing VMs off of it and doing an upgrade would probably be a good idea. Although it isn't exactly the hardware setup I would like it's been a pretty good machine for a few years now so I would rather not junk it, if I can reassure myself that this won't happen again. > > Also still wondering if what I did to recover was the best way to go > > or if I could have made it easier on myself. > > I don't really have any input on this aspect, except to say that if you > got all your data recovered that's the important part. If you spent > twice as long as you needed to I wouldn't sweat that at all. I'd put > all my concentration on the root cause analysis. Sure, though this was a completely new experience for me so if anyone has any tips for better recovery then that will help should I ever face anything like it again. I do use md RAID in quite a few places (just not much for this use). Notably, the server was only recoverable because I had backups of /usr and /var. There were some essential files corrupted but being able to get them from backups saved having to do a rebuild. Actually corruption of VM block devices was quite minimal -- many of them just had to replay the journal and clean up a handful of orphaned inodes. Apparently a couple of them did lose a small number of files but these VMs are all run by different admins and I don't have great insight into it. Anything I could have done to lessen that would have been useful. They're meant to have backups but meh, human nature... Cheers, Andy ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Can extremely high load cause disks to be kicked? 2012-06-04 0:02 ` Andy Smith @ 2012-06-04 6:58 ` Stan Hoeppner 0 siblings, 0 replies; 19+ messages in thread From: Stan Hoeppner @ 2012-06-04 6:58 UTC (permalink / raw) To: linux-raid On 6/3/2012 7:02 PM, Andy Smith wrote: > Hi Stan, > > Thanks for your detailed reply. Welcome. More detailed reply and question follow. > $ cat /proc/mdstat > Personalities : [raid1] [raid10] > md3 : active raid10 sdd5[0] sdb5[3] sdc5[2] sda5[1] > 581022592 blocks 64K chunks 2 near-copies [4/4] [UUUU] > > md2 : active raid10 sdd3[0] sdb3[3] sdc3[2] sda3[1] > 1959680 blocks 64K chunks 2 near-copies [4/4] [UUUU] > > md1 : active raid10 sdd2[0] sdb2[3] sdc2[2] sda2[1] > 1959680 blocks 64K chunks 2 near-copies [4/4] [UUUU] > > md0 : active raid1 sdd1[0] sdb1[3] sdc1[2] sda1[1] > 489856 blocks [4/4] [UUUU] > > unused devices: <none> You laid it out nicely, but again, not good practice to have this many md arrays on the same disks. > It is a single quad core Xeon L5420 @ 2.50GHz. Ok, now that's interesting. > Actually it's pretty big so I've put it at > http://paste.ubuntu.com/1022219/ Ok, found the source of the mptscsi problem, not hardware failure. Note you're using: Pid: 0, comm: swapper Not tainted 2.6.26-2-xen-amd64 #1 2.6.26 is so old its bones are turning to dust. Note this Red Hat bug, and the fact this is an upstream LSI driver problem: https://bugzilla.redhat.com/show_bug.cgi?id=483424 This driver issue is well over 3 years old and has since been fixed. It exists in isolation of your DDOS attack. And it is probably not the cause of the partitions being kicked. Most likely cause is CPU soft lockup for 60, 90, 120 seconds. To fix both of these, upgrade to a kernel that's not rotting in the grave, something in the 3.x.x stable series is best, and the most recent version of Xen. Better yet, switch to KVM as it's in mainline and better supported. > At this point the logs stop because /var is an LV out of md3. I'm > moving remote syslog servers further up my priority list... Yes, good idea. Even in absence of this mess. > If there hadn't been a DDoS attack at the exact same time then > I'd have considered this purely hardware failure due to the way that > "mptscsih: ioc0: attempting task abort! (sc=ffff8800352d80c0)" is > the absolute first thing of interest. But the timing is too > coincidental. Not hardware failure, note above LSI driver bug. > It's also been fine since, including a quite IO-intensive backup job > and yesterday's "first Sunday of the month" sync_action. Check your logs for these mptscsi errors for the past month. You should see some even with zero load on the system, according to the bug report. > All the VMs would have been doing their normal writing of course, > but on the hypervisor host /usr and /var come from md3. From the I wouldn't have laid it out like that. I'm sure you're thinking the same thing about now. > logs, the main thing it seems to be having problems with is dm-1 > which is the /usr LV. What was the hypervisor kernel writing to /usr during the attack? Seems to me it should have been writing nothing there, but just to /var/log. >> What was the exact nature of the DDOS attack? What service was it >> targeting? I assume this wasn't simply a ping flood. > > It was a UDP short packet (78 bytes) multiple source single > destination flood, ~300Mbit/s but the killer was the ~600kpps. Less > than 10Mbit/s made it through to the VM it was targeting. What, precisely, was it targeting? What service was listening on the UDP port that was attacked? You still haven't named the UDP port. Anything important? Can you simply disable that service? If it's only used legitimately by local hosts, drop packets originating outside the local network. >>> The CPU was certainly overwhelmed, but my main concern is that I >>> am never going to be able to design a system that will cope with >>> routing DDoS traffic in userspace. Why would you want to? Kill it before it reaches that user space. > I really don't think it is easy to spec a decent VM host that can > also route hundreds of thousands of packets per sec to guests, > without a large budget. Again, why would you even attempt this? Kill it upstream before it reaches the physical network interface. > I am OK with the host giving up, I just > don't want it to corrupt its storage. The solution, again, isn't beefing up the VM host machine, but hardening the kernel against DDOS, and more importantly, getting the upstream FW squared away so it drops all these DDOS packets on the floor. > I mean I'm sure it can be done, but the budget probably doesn't > allow it and temporary problems for all VMs on the host are > acceptable in this case; filesystem corruption isn't. There are many articles easily found via Google explaining how to harden a Linux machine against DDOS attacks. None of them say anything about beefing up the hardware in an effort to ABSORB the attack. What you need is a better upstream firewall, or an intelligent firewall running in the hypervisor kernel, or both. > The other thing is that the hypervisor and all VM guests share the > same four CPU cores. It may be prudent to dedicate one CPU core to > the hypervisor and then let the guests share the other three. If you do that the host may simply fail more quickly, but still with damage. > No, they happen from time to time and I find that a pretty big one > will cripple the host but I haven't yet seen that cause storage > problems. But as I say, all the other servers are 3ware+BBU. Why are DDOS attack frequent there? Are you a VPS provider? > To be honest I do find I get a better price/performance spot out of > 3ware setup; this host represented an experiment with 10kRPM SAS > drives and software RAID a few years ago and whilst the performance > is decent, the much higher cost per GB of the SAS drives ultimately > makes this uneconomical for me, as I do have to provide a certain > storage capacity as well. But for your current investment in HBA RAID boards, and with limited knowledge of your operation, it's sounding like you may be a good candidate for a low cost midrange iSCSI SAN array. Use a single 20GB Intel SSD to boot the hypervisor and serve swap, etc, then mount iSCSI LUNs for each VM with the hypervisor software initiator. Yields much more flexibility in storage provisioning that HBA RAID w/local disks. You should be able to acquire a single controller Nexsan SATABoy w/2GB BBWC, dual GbE iSCSI ports (and dual 4Gb FC ports), and 14x 2TB 7.2k SATA drives for less than $15k. Using RAID6 easily gets you near line rate throughput (370MB/s), and gives you 24TB of space to slice into 256 virtual drives. You'd want to do multipathing, so you would do up to 254 virtual drives, exporting each one as a LUN on each iSCSI port. You can size each virtual drive per the needs of the VM. Creating and deleting virtual drives is a few mouse clicks in the excellent web gui. This is the cheapest array in Nexsan's lineup. The next step up the Nexsan ladder gets you a single controller E18 with 18x 3TB drives in 2U for a little over $20K. A single RAID6 yields 48TB to slice up. The E18 allows expansion with a 60 drive 4U enclosure, up to 234TB total. You're probably looking at the $50K neighborhood for the expansion chassis loaded w/3TB drives. > So I haven't gone with md RAID for this use case for a couple of > years now and am unlikely to do so in the future anyway. I do still > need to work out what to do with this particular server though. Slap in a 3ware 9750-4i and write cache battery and redeploy. http://www.newegg.com/Product/Product.aspx?Item=N82E16816116109 http://www.newegg.com/Product/Product.aspx?Item=N82E16816118118 W/512MB BBWC and those 10K SAS drives it will perform very well, better than md and the current HBA WRT writes. > I have a feeling it was trying to swap a lot from the repeated > mentions of "swapper" in the logs. The swap partition is md1 which > doesn't feature in any of the logs, but perhaps what is being logged > there is the swapper kernel thread being unable to do anything > because of extreme CPU starvation. Makes sense. > Admittedly this is an old Debian lenny server running 2.6.26-2-xen > kernel. Pushing ahead the timescale of clearing VMs off of it and > doing an upgrade would probably be a good idea. Heheh, yeah, as I mentioned here way up above. This Lenny kernel was afflicted by the aforementioned LSI bug. And there were plenty of other issues with 2.6.26. > Although it isn't exactly the hardware setup I would like it's been > a pretty good machine for a few years now so I would rather not junk > it, if I can reassure myself that this won't happen again. You will probably suffer again if DDOSed. And as I mentioned the solution has little to do with (lack of) machine horsepower. A modern kernel will help out considerably. But, you probably still need to configure it against DDOS. Lots of articles/guides available @Google search. Without knowing all the technicals of your attack, I can't say for sure what kernel options would help. A relatively simple solution you may want to consider is implementing a daemon to monitor interface traffic in the hypervisor. If/when the packet/bit rate clearly shows signs of [D]DOS, run a script that executes ifdown to take that interface offline and email you an alert through the management interface (I assume you have both). You may want to automatically bring the interface back up after 5 minutes or so. If DDOS traffic still exists, down the interface again. Rinse repeat until the storm passes. Your VMs will be non functional during an attack anyway so there's no harm in downing the interface. And doing so should save your bacon. > Sure, though this was a completely new experience for me so if > anyone has any tips for better recovery then that will help should I > ever face anything like it again. I do use md RAID in quite a few > places (just not much for this use). > > Notably, the server was only recoverable because I had backups of > /usr and /var. There were some essential files corrupted but being > able to get them from backups saved having to do a rebuild. This is refreshing. Too many times folks post that they MUST get their md array or XFS filesystem back because they have no backup. Neil, and some of the XFS devs, save the day sometimes, but just as often md and XFS superblocks are completely corrupted or overwritten with zeros, making recovery impossible, and very unhappy users. > Actually corruption of VM block devices was quite minimal -- many of > them just had to replay the journal and clean up a handful of > orphaned inodes. Apparently a couple of them did lose a small number > of files but these VMs are all run by different admins and I don't > have great insight into it. I take it no two partitions in the same mirror pair were kicked, downing the array? I don't recall that you stated you actually lost the md3 array. Sorry if you did. Long thread, lots of words to get lost. > Anything I could have done to lessen that would have been useful. > They're meant to have backups but meh, human nature... Well, yea, but it still sucks to do restores, and backups aren't always bulletproof. At this point I think your best bet is automated ifdown/ifup. It's simple, effective, should be somewhat cheap to implement. It can all be done in a daemonized perl script. Just don't ask me to write it. ;) If I had the skill I'd have already done it. Hope I've been of at least some help to you Andy, given you some decent ideas, if not solid solutions. I must say this is the most unique troubleshooting thread I can recall on linux-raid. -- Stan ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Can extremely high load cause disks to be kicked? 2012-05-31 8:31 Can extremely high load cause disks to be kicked? Andy Smith 2012-06-01 1:31 ` Stan Hoeppner @ 2012-06-04 4:13 ` NeilBrown 1 sibling, 0 replies; 19+ messages in thread From: NeilBrown @ 2012-06-04 4:13 UTC (permalink / raw) To: Andy Smith; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 1229 bytes --] On Thu, 31 May 2012 08:31:58 +0000 Andy Smith <andy@strugglers.net> wrote: > Now, is this sort of behaviour expected when under incredible load? > Or is it indicative of a bug somewhere in kernel, mpt driver, or > even flaky SAS controller/disks? This sort of high load would not affect md, except to slow it down. My guess is that the real bug is in the mpt driver, but as I know nothing about the mpt driver, you should treat that guess with a few kilos of NaCl. > > Root cause of failure aside, could I have made recovery easier? Was > there a better way than --create --assume-clean? The mis-step was to try to add the devices back to the array. A newer mdadm would refuse to let you do this because of the destructive effect. The correct step would have been to stop the array and re-assemble it, with --force. Once you had turned the devices to spares with --add, --create --assume-clean was the correct fix. > > If I had done a --create with sdc5 (the device that stayed in the > array) and the other device with the closest event count, plus two > "missing", could I have expected less corruption when on 'repair'? Possibly. You certainly wouldn't expect more. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2012-06-04 9:06 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-05-31 8:31 Can extremely high load cause disks to be kicked? Andy Smith 2012-06-01 1:31 ` Stan Hoeppner 2012-06-01 3:15 ` Igor M Podlesny 2012-06-01 14:12 ` Stan Hoeppner 2012-06-01 15:19 ` Igor M Podlesny 2012-06-02 4:45 ` Stan Hoeppner 2012-06-02 7:57 ` Igor M Podlesny 2012-06-02 9:16 ` Stan Hoeppner 2012-06-01 19:25 ` Andy Smith 2012-06-02 5:47 ` Stan Hoeppner 2012-06-03 3:30 ` Andy Smith 2012-06-03 4:05 ` Igor M Podlesny 2012-06-03 22:05 ` Andy Smith 2012-06-04 1:55 ` Stan Hoeppner 2012-06-04 9:06 ` Igor M Podlesny 2012-06-03 6:49 ` Stan Hoeppner 2012-06-04 0:02 ` Andy Smith 2012-06-04 6:58 ` Stan Hoeppner 2012-06-04 4:13 ` NeilBrown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).