* performance with guests running 2.4 kernels (specifically RHEL3) @ 2008-04-16 0:15 David S. Ahern 2008-04-16 8:46 ` Avi Kivity 0 siblings, 1 reply; 73+ messages in thread From: David S. Ahern @ 2008-04-16 0:15 UTC (permalink / raw) To: kvm-devel I have been looking at RHEL3 based guests lately, and to say the least the performance is horrible. Rather than write a long tome on what I've done and observed, I'd like to find out if anyone has some insights or known problem areas running 2.4 guests. The short of it is that % system time spikes from time to time (e.g., on exec of a new process such as running /bin/true). I do not see the problem running RHEL3 on ESX, and an equivalent VM running RHEL4 runs fine. That suggests that the 2.4 kernel is doing something in a way that is not handled efficiently by kvm. Can someone shed some light on it? thanks, david ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3) 2008-04-16 0:15 performance with guests running 2.4 kernels (specifically RHEL3) David S. Ahern @ 2008-04-16 8:46 ` Avi Kivity 2008-04-17 21:12 ` David S. Ahern 0 siblings, 1 reply; 73+ messages in thread From: Avi Kivity @ 2008-04-16 8:46 UTC (permalink / raw) To: David S. Ahern; +Cc: kvm-devel David S. Ahern wrote: > I have been looking at RHEL3 based guests lately, and to say the least the > performance is horrible. Rather than write a long tome on what I've done and > observed, I'd like to find out if anyone has some insights or known problem > areas running 2.4 guests. The short of it is that % system time spikes from time > to time (e.g., on exec of a new process such as running /bin/true). > > I do not see the problem running RHEL3 on ESX, and an equivalent VM running > RHEL4 runs fine. That suggests that the 2.4 kernel is doing something in a way > that is not handled efficiently by kvm. > > Can someone shed some light on it? > It's not something that I test regularly. If you're running a 32-bit kernel, I'd suspect kmap(), or perhaps false positives from the fork detector. kvmtrace will probably give enough info to tell exactly what's going on; 'kvmstat -1' while the badness is happening may also help. -- error compiling committee.c: too many arguments to function ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3) 2008-04-16 8:46 ` Avi Kivity @ 2008-04-17 21:12 ` David S. Ahern 2008-04-18 7:57 ` Avi Kivity 2008-04-23 8:03 ` Avi Kivity 0 siblings, 2 replies; 73+ messages in thread From: David S. Ahern @ 2008-04-17 21:12 UTC (permalink / raw) To: Avi Kivity; +Cc: kvm-devel kvm_stat -1 is practically impossible to time correctly to get a good snippet. kvmtrace is a fascinating tool. I captured trace data that encompassed one intense period where the VM appeared to freeze (no terminal response for a few seconds). After converting to text I examined an arbitrary section in time (how do you correlate tsc to unix epoch?), and it shows vcpu0 hammered with interrupts and vcpu1 hammered with page faults. (I put the representative data below; I can send the binary or text files if you really want to see them.) All toll over about a 10-12 second time period the trace text files contain 8426221 lines and 2051344 of them are PAGE_FAULTs (that's 24% of the text lines which seems really high). david --------------------------------- vcpu0 data: 0 (+ 0) INTR vcpu = 0x00000001 pid = 0x000011ea [ vector = 0x00 ] 9968400020536 (+ 1712) VMENTRY vcpu = 0x00000001 pid = 0x000011ea 9968400096784 (+ 76248) VMEXIT vcpu = 0x00000001 pid = 0x000011ea [ exitcode = 0x00000001, rip = 0x00000000 c0154d7a ] 0 (+ 0) INTR vcpu = 0x00000001 pid = 0x000011ea [ vector = 0x00 ] 9968400098576 (+ 1792) VMENTRY vcpu = 0x00000001 pid = 0x000011ea 9968400114528 (+ 15952) VMEXIT vcpu = 0x00000001 pid = 0x000011ea [ exitcode = 0x00000001, rip = 0x00000000 c0154d7a ] 0 (+ 0) INTR vcpu = 0x00000001 pid = 0x000011ea [ vector = 0x00 ] 9968400116328 (+ 1800) VMENTRY vcpu = 0x00000001 pid = 0x000011ea 9968400137216 (+ 20888) VMEXIT vcpu = 0x00000001 pid = 0x000011ea [ exitcode = 0x00000001, rip = 0x00000000 c0154d7a ] 0 (+ 0) INTR vcpu = 0x00000001 pid = 0x000011ea [ vector = 0x00 ] 9968400138840 (+ 1624) VMENTRY vcpu = 0x00000001 pid = 0x000011ea 9968400209344 (+ 70504) VMEXIT vcpu = 0x00000001 pid = 0x000011ea [ exitcode = 0x00000001, rip = 0x00000000 c0154d7c ] 0 (+ 0) INTR vcpu = 0x00000001 pid = 0x000011ea [ vector = 0x00 ] 9968400211056 (+ 1712) VMENTRY vcpu = 0x00000001 pid = 0x000011ea 9968400226312 (+ 15256) VMEXIT vcpu = 0x00000001 pid = 0x000011ea [ exitcode = 0x00000001, rip = 0x00000000 c0154d7c ] 0 (+ 0) INTR vcpu = 0x00000001 pid = 0x000011ea [ vector = 0x00 ] 9968400228040 (+ 1728) VMENTRY vcpu = 0x00000001 pid = 0x000011ea 9968400248688 (+ 20648) VMEXIT vcpu = 0x00000001 pid = 0x000011ea [ exitcode = 0x00000001, rip = 0x00000000 c0154d7c ] vcpu1 data: 9968400002032 (+ 3808) VMEXIT vcpu = 0x00000000 pid = 0x000011ea [ exitcode = 0x00000000, rip = 0x00000000 c016127f ] 0 (+ 0) PAGE_FAULT vcpu = 0x00000000 pid = 0x000011ea [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ] 9968400005448 (+ 3416) VMENTRY vcpu = 0x00000000 pid = 0x000011ea 9968400009832 (+ 4384) VMEXIT vcpu = 0x00000000 pid = 0x000011ea [ exitcode = 0x00000000, rip = 0x00000000 c016104a ] 0 (+ 0) PAGE_FAULT vcpu = 0x00000000 pid = 0x000011ea [ errorcode = 0x0000000b, virt = 0x00000000 fffb6f88 ] 9968400071584 (+ 61752) VMENTRY vcpu = 0x00000000 pid = 0x000011ea 9968400075608 (+ 4024) VMEXIT vcpu = 0x00000000 pid = 0x000011ea [ exitcode = 0x00000000, rip = 0x00000000 c01610e7 ] 0 (+ 0) PAGE_FAULT vcpu = 0x00000000 pid = 0x000011ea [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ] 9968400083528 (+ 7920) VMENTRY vcpu = 0x00000000 pid = 0x000011ea 9968400087288 (+ 3760) VMEXIT vcpu = 0x00000000 pid = 0x000011ea [ exitcode = 0x00000000, rip = 0x00000000 c01610ee ] 0 (+ 0) PAGE_FAULT vcpu = 0x00000000 pid = 0x000011ea [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ] 9968400097312 (+ 10024) VMENTRY vcpu = 0x00000000 pid = 0x000011ea 9968400103064 (+ 5752) VMEXIT vcpu = 0x00000000 pid = 0x000011ea [ exitcode = 0x00000000, rip = 0x00000000 c0160f9c ] 0 (+ 0) PAGE_FAULT vcpu = 0x00000000 pid = 0x000011ea [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ] 9968400116624 (+ 13560) VMENTRY vcpu = 0x00000000 pid = 0x000011ea 9968400120424 (+ 3800) VMEXIT vcpu = 0x00000000 pid = 0x000011ea [ exitcode = 0x00000000, rip = 0x00000000 c0160fa1 ] 0 (+ 0) PAGE_FAULT vcpu = 0x00000000 pid = 0x000011ea [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ] 9968400123856 (+ 3432) VMENTRY vcpu = 0x00000000 pid = 0x000011ea 9968400128208 (+ 4352) VMEXIT vcpu = 0x00000000 pid = 0x000011ea [ exitcode = 0x00000000, rip = 0x00000000 c0160dab ] 0 (+ 0) PAGE_FAULT vcpu = 0x00000000 pid = 0x000011ea [ errorcode = 0x00000009, virt = 0x00000000 fffb6d28 ] 9968400183848 (+ 55640) VMENTRY vcpu = 0x00000000 pid = 0x000011ea 9968400188232 (+ 4384) VMEXIT vcpu = 0x00000000 pid = 0x000011ea [ exitcode = 0x00000000, rip = 0x00000000 c0160e4d ] 0 (+ 0) PAGE_FAULT vcpu = 0x00000000 pid = 0x000011ea [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ] 9968400196160 (+ 7928) VMENTRY vcpu = 0x00000000 pid = 0x000011ea 9968400199928 (+ 3768) VMEXIT vcpu = 0x00000000 pid = 0x000011ea [ exitcode = 0x00000000, rip = 0x00000000 c0160e54 ] 0 (+ 0) PAGE_FAULT vcpu = 0x00000000 pid = 0x000011ea [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ] 9968400209864 (+ 9936) VMENTRY vcpu = 0x00000000 pid = 0x000011ea 9968400214984 (+ 5120) VMEXIT vcpu = 0x00000000 pid = 0x000011ea [ exitcode = 0x00000000, rip = 0x00000000 c0160f9c ] 0 (+ 0) PAGE_FAULT vcpu = 0x00000000 pid = 0x000011ea [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ] 9968400228232 (+ 13248) VMENTRY vcpu = 0x00000000 pid = 0x000011ea 9968400232000 (+ 3768) VMEXIT vcpu = 0x00000000 pid = 0x000011ea [ exitcode = 0x00000000, rip = 0x00000000 c0160fa1 ] 0 (+ 0) PAGE_FAULT vcpu = 0x00000000 pid = 0x000011ea [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ] 9968400235424 (+ 3424) VMENTRY vcpu = 0x00000000 pid = 0x000011ea 9968400239816 (+ 4392) VMEXIT vcpu = 0x00000000 pid = 0x000011ea [ exitcode = 0x00000000, rip = 0x00000000 c0160dab ] 0 (+ 0) PAGE_FAULT vcpu = 0x00000000 pid = 0x000011ea [ errorcode = 0x00000009, virt = 0x00000000 fffb6d30 ] Avi Kivity wrote: > David S. Ahern wrote: >> I have been looking at RHEL3 based guests lately, and to say the least >> the >> performance is horrible. Rather than write a long tome on what I've >> done and >> observed, I'd like to find out if anyone has some insights or known >> problem >> areas running 2.4 guests. The short of it is that % system time spikes >> from time >> to time (e.g., on exec of a new process such as running /bin/true). >> >> I do not see the problem running RHEL3 on ESX, and an equivalent VM >> running >> RHEL4 runs fine. That suggests that the 2.4 kernel is doing something >> in a way >> that is not handled efficiently by kvm. >> >> Can someone shed some light on it? >> > > It's not something that I test regularly. If you're running a 32-bit > kernel, I'd suspect kmap(), or perhaps false positives from the fork > detector. > > kvmtrace will probably give enough info to tell exactly what's going on; > 'kvmstat -1' while the badness is happening may also help. > ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3) 2008-04-17 21:12 ` David S. Ahern @ 2008-04-18 7:57 ` Avi Kivity 2008-04-21 4:31 ` David S. Ahern 2008-04-23 8:03 ` Avi Kivity 1 sibling, 1 reply; 73+ messages in thread From: Avi Kivity @ 2008-04-18 7:57 UTC (permalink / raw) To: David S. Ahern; +Cc: kvm-devel David S. Ahern wrote: > kvm_stat -1 is practically impossible to time correctly to get a good snippet. > > kvmtrace is a fascinating tool. I captured trace data that encompassed one > intense period where the VM appeared to freeze (no terminal response for a few > seconds). > > After converting to text I examined an arbitrary section in time (how do you > correlate tsc to unix epoch?), and it shows vcpu0 hammered with interrupts and > vcpu1 hammered with page faults. (I put the representative data below; I can > send the binary or text files if you really want to see them.) All toll over > about a 10-12 second time period the trace text files contain 8426221 lines and > 2051344 of them are PAGE_FAULTs (that's 24% of the text lines which seems really > high). > > david > > > vcpu1 data: > > 0 (+ 0) PAGE_FAULT vcpu = 0x00000000 pid = 0x000011ea [ errorcode = > 0x00000003, virt = 0x00000000 c0009db0 ] > 0 (+ 0) PAGE_FAULT vcpu = 0x00000000 pid = 0x000011ea [ errorcode = > 0x00000003, virt = 0x00000000 c0009db4 ] > 0 (+ 0) PAGE_FAULT vcpu = 0x00000000 pid = 0x000011ea [ errorcode = > 0x00000003, virt = 0x00000000 c0009db0 ] > 0 (+ 0) PAGE_FAULT vcpu = 0x00000000 pid = 0x000011ea [ errorcode = > 0x00000009, virt = 0x00000000 fffb6d28 ] > 0 (+ 0) PAGE_FAULT vcpu = 0x00000000 pid = 0x000011ea [ errorcode = > 0x00000003, virt = 0x00000000 c0009db4 ] > 0 (+ 0) PAGE_FAULT vcpu = 0x00000000 pid = 0x000011ea [ errorcode = > 0x00000003, virt = 0x00000000 c0009db0 ] > 0 (+ 0) PAGE_FAULT vcpu = 0x00000000 pid = 0x000011ea [ errorcode = > 0x00000003, virt = 0x00000000 c0009db4 ] > 0 (+ 0) PAGE_FAULT vcpu = 0x00000000 pid = 0x000011ea [ errorcode = > 0x00000003, virt = 0x00000000 c0009db0 ] > 0 (+ 0) PAGE_FAULT vcpu = 0x00000000 pid = 0x000011ea [ errorcode = > 0x00000009, virt = 0x00000000 fffb6d30 ] > > > The pattern here is c0009db4, c0009db0, fffb6xxx, c0009db0. Setting a pte at c0009db0, accessing the page mapped by the pte, unmapping the pte. Note that c0009db0 (bits 3:11) == 0x1b6 == fffb6xxx (bits 12:20). That's a kmap_atomic() + access +kunmap_atomic() sequence. The expensive accesses ~50K cycles) seem to be the onces at fffb6xxx. Now theses shouldn't show up at all -- the kvm_mmu_pte_write() ought to have set up the ptes correctly. Can you add a trace at mmu_guess_page_from_pte_write(), right before "if (is_present_pte(gpte))"? I'm interested in gpa and gpte. Also a trace at kvm_mmu_pte_write(), where it sets flooded = 1 (hmm, try to increase the 3 to 4 in the line right above that, maybe the fork detector is misfiring). --------------------------------- vcpu0 data: 0 (+ 0) INTR vcpu = 0x00000001 pid = 0x000011ea [ vector = 0x00 ] 9968400020536 (+ 1712) VMENTRY vcpu = 0x00000001 pid = 0x000011ea 9968400096784 (+ 76248) VMEXIT vcpu = 0x00000001 pid = 0x000011ea [ exitcode = 0x00000001, rip = 0x00000000 c0154d7a ] 0 (+ 0) INTR vcpu = 0x00000001 pid = 0x000011ea [ vector = 0x00 ] 9968400098576 (+ 1792) VMENTRY vcpu = 0x00000001 pid = 0x000011ea 9968400114528 (+ 15952) VMEXIT vcpu = 0x00000001 pid = 0x000011ea [ exitcode = 0x00000001, rip = 0x00000000 c0154d7a ] 0 (+ 0) INTR vcpu = 0x00000001 pid = 0x000011ea [ vector = 0x00 ] 9968400116328 (+ 1800) VMENTRY vcpu = 0x00000001 pid = 0x000011ea 9968400137216 (+ 20888) VMEXIT vcpu = 0x00000001 pid = 0x000011ea [ exitcode = 0x00000001, rip = 0x00000000 c0154d7a ] 0 (+ 0) INTR vcpu = 0x00000001 pid = 0x000011ea [ vector = 0x00 ] 9968400138840 (+ 1624) VMENTRY vcpu = 0x00000001 pid = 0x000011ea 9968400209344 (+ 70504) VMEXIT vcpu = 0x00000001 pid = 0x000011ea [ exitcode = 0x00000001, rip = 0x00000000 c0154d7c ] 0 (+ 0) INTR vcpu = 0x00000001 pid = 0x000011ea [ vector = 0x00 ] 9968400211056 (+ 1712) VMENTRY vcpu = 0x00000001 pid = 0x000011ea 9968400226312 (+ 15256) VMEXIT vcpu = 0x00000001 pid = 0x000011ea [ exitcode = 0x00000001, rip = 0x00000000 c0154d7c ] 0 (+ 0) INTR vcpu = 0x00000001 pid = 0x000011ea [ vector = 0x00 ] 9968400228040 (+ 1728) VMENTRY vcpu = 0x00000001 pid = 0x000011ea 9968400248688 (+ 20648) VMEXIT vcpu = 0x00000001 pid = 0x000011ea [ exitcode = 0x00000001, rip = 0x00000000 c0154d7c ] Those are probably IPIs due to the kmaps above. > > Avi Kivity wrote: > >> David S. Ahern wrote: >> >>> I have been looking at RHEL3 based guests lately, and to say the least >>> the >>> performance is horrible. Rather than write a long tome on what I've >>> done and >>> observed, I'd like to find out if anyone has some insights or known >>> problem >>> areas running 2.4 guests. The short of it is that % system time spikes >>> from time >>> to time (e.g., on exec of a new process such as running /bin/true). >>> >>> I do not see the problem running RHEL3 on ESX, and an equivalent VM >>> running >>> RHEL4 runs fine. That suggests that the 2.4 kernel is doing something >>> in a way >>> that is not handled efficiently by kvm. >>> >>> Can someone shed some light on it? >>> >>> >> It's not something that I test regularly. If you're running a 32-bit >> kernel, I'd suspect kmap(), or perhaps false positives from the fork >> detector. >> >> kvmtrace will probably give enough info to tell exactly what's going on; >> 'kvmstat -1' while the badness is happening may also help. >> >> -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3) 2008-04-18 7:57 ` Avi Kivity @ 2008-04-21 4:31 ` David S. Ahern 2008-04-21 9:19 ` Avi Kivity 0 siblings, 1 reply; 73+ messages in thread From: David S. Ahern @ 2008-04-21 4:31 UTC (permalink / raw) To: Avi Kivity; +Cc: kvm-devel I added the traces and captured data over another apparent lockup of the guest. This seems to be representative of the sequence (pid/vcpu removed). (+4776) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016127c ] (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ] (+3632) VMENTRY (+4552) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016104a ] (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 fffb61c8 ] (+ 54928) VMENTRY (+4568) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610e7 ] (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ] (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db4 gpte = 0x00000000 41c5d363 ] (+8432) VMENTRY (+3936) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610ee ] (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ] (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db0 gpte = 0x00000000 00000000 ] (+ 13832) VMENTRY (+5768) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016127c ] (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ] (+3712) VMENTRY (+4576) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016104a ] (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 fffb61d0 ] (+ 0) PTE_WRITE [ gpa = 0x00000000 3d5981d0 gpte = 0x00000000 3d55d047 ] (+ 65216) VMENTRY (+4232) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610e7 ] (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ] (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db4 gpte = 0x00000000 3d598363 ] (+8640) VMENTRY (+3936) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610ee ] (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ] (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db0 gpte = 0x00000000 00000000 ] (+ 14160) VMENTRY I can forward a more complete time snippet if you'd like. vcpu0 + corresponding vcpu1 files have 85000 total lines and compressed the files total ~500k. I did not see the FLOODED trace come out during this sample though I did bump the count from 3 to 4 as you suggested. Correlating rip addresses to the 2.4 kernel: c0160d00-c0161290 = page_referenced It looks like the event is kscand running through the pages. I suspected this some time ago, and tried tweaking the kscand_work_percent sysctl variable. It appeared to lower the peak of the spikes, but maybe I imagined it. I believe lowering that value makes kscand wake up more often but do less work (page scanning) each time it is awakened. david Avi Kivity wrote: > Can you add a trace at mmu_guess_page_from_pte_write(), right before "if > (is_present_pte(gpte))"? I'm interested in gpa and gpte. Also a trace > at kvm_mmu_pte_write(), where it sets flooded = 1 (hmm, try to increase > the 3 to 4 in the line right above that, maybe the fork detector is > misfiring). ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3) 2008-04-21 4:31 ` David S. Ahern @ 2008-04-21 9:19 ` Avi Kivity 2008-04-21 17:07 ` David S. Ahern 2008-04-22 20:23 ` David S. Ahern 0 siblings, 2 replies; 73+ messages in thread From: Avi Kivity @ 2008-04-21 9:19 UTC (permalink / raw) To: David S. Ahern; +Cc: kvm-devel David S. Ahern wrote: > I added the traces and captured data over another apparent lockup of the guest. > This seems to be representative of the sequence (pid/vcpu removed). > > (+4776) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016127c ] > (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ] > (+3632) VMENTRY > (+4552) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016104a ] > (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 fffb61c8 ] > (+ 54928) VMENTRY > Can you oprofile the host to see where the 54K cycles are spent? > (+4568) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610e7 ] > (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ] > (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db4 gpte = 0x00000000 41c5d363 ] > (+8432) VMENTRY > (+3936) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610ee ] > (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ] > (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db0 gpte = 0x00000000 00000000 ] > (+ 13832) VMENTRY > > > (+5768) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016127c ] > (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ] > (+3712) VMENTRY > (+4576) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016104a ] > (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 fffb61d0 ] > (+ 0) PTE_WRITE [ gpa = 0x00000000 3d5981d0 gpte = 0x00000000 3d55d047 ] > This indeed has the accessed bit clear. > (+ 65216) VMENTRY > (+4232) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610e7 ] > (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ] > (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db4 gpte = 0x00000000 3d598363 ] > This has the accessed bit set and the user bit clear, and the pte pointing at the previous pte_write gpa. Looks like a kmap_atomic(). > (+8640) VMENTRY > (+3936) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610ee ] > (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ] > (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db0 gpte = 0x00000000 00000000 ] > (+ 14160) VMENTRY > > I can forward a more complete time snippet if you'd like. vcpu0 + corresponding > vcpu1 files have 85000 total lines and compressed the files total ~500k. > > I did not see the FLOODED trace come out during this sample though I did bump > the count from 3 to 4 as you suggested. > > > Bumping the count was supposed to remove the flooding... > Correlating rip addresses to the 2.4 kernel: > > c0160d00-c0161290 = page_referenced > > It looks like the event is kscand running through the pages. I suspected this > some time ago, and tried tweaking the kscand_work_percent sysctl variable. It > appeared to lower the peak of the spikes, but maybe I imagined it. I believe > lowering that value makes kscand wake up more often but do less work (page > scanning) each time it is awakened. > > What does 'top' in the guest show (perhaps sorted by total cpu time rather than instantaneous usage)? What host kernel are you running? How many host cpus? -- error compiling committee.c: too many arguments to function ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3) 2008-04-21 9:19 ` Avi Kivity @ 2008-04-21 17:07 ` David S. Ahern 2008-04-22 20:23 ` David S. Ahern 1 sibling, 0 replies; 73+ messages in thread From: David S. Ahern @ 2008-04-21 17:07 UTC (permalink / raw) To: Avi Kivity; +Cc: kvm-devel host: 2.6.25-rc8, x86_64, kvm-66 1 dual-core Xeon(R) CPU 3050 @ 2.13GHz 6 GB RAM (This behavior also occurs on a larger server with 2 dual-core Xeon(R) CPU 5140 @ 2.33GHz, 4 GB RAM. Same kernel and kvm versions.) guest: RHEL3 U8 (2.4.21-47.ELsmp), 2 vcpus, 2 GB RAM As usual, waited for a guest "event" -- high system time in guest which appears to lock it up. Following the event, kscand was the top CPU user (cumulative time) in the guest. During the event, 2 qemu threads are pegging the host CPU at 100%. Top samples from oprofile (oprofile was started after the freeze start and stopped when guest response returned): samples % image name app name symbol name 171716 35.1350 kvm-intel.ko kvm_intel vmx_vcpu_run 45836 9.3786 vmlinux vmlinux copy_user_generic_string 39417 8.0652 kvm.ko kvm kvm_read_guest_atomic 23604 4.8296 vmlinux vmlinux add_preempt_count 22878 4.6811 vmlinux vmlinux __smp_call_function_mask 16143 3.3030 kvm.ko kvm gfn_to_hva 14648 2.9971 vmlinux vmlinux sub_preempt_count 14589 2.9851 kvm.ko kvm __gfn_to_memslot 11666 2.3870 kvm.ko kvm unalias_gfn 10834 2.2168 kvm.ko kvm kvm_mmu_zap_page 10532 2.1550 kvm.ko kvm paging64_prefetch_page 6285 1.2860 kvm-intel.ko kvm_intel handle_exception 6066 1.2412 kvm.ko kvm kvm_arch_vcpu_ioctl_run 4741 0.9701 kvm.ko kvm kvm_add_trace 3801 0.7777 vmlinux vmlinux __copy_from_user_inatomic 3592 0.7350 vmlinux vmlinux follow_page 3326 0.6805 kvm.ko kvm mmu_memory_cache_alloc 3317 0.6787 kvm-intel.ko kvm_intel kvm_handle_exit 2971 0.6079 kvm.ko kvm paging64_page_fault 2777 0.5682 kvm.ko kvm paging64_walk_addr 2294 0.4694 kvm.ko kvm kvm_mmu_pte_write 2278 0.4661 kvm.ko kvm kvm_flush_remote_tlbs 2266 0.4636 kvm-intel.ko kvm_intel vmcs_writel 2086 0.4268 kvm.ko kvm mmu_set_spte 2041 0.4176 kvm.ko kvm kvm_read_guest 1615 0.3304 vmlinux vmlinux free_hot_cold_page david Avi Kivity wrote: > David S. Ahern wrote: >> I added the traces and captured data over another apparent lockup of >> the guest. >> This seems to be representative of the sequence (pid/vcpu removed). >> >> (+4776) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >> c016127c ] >> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 >> c0009db4 ] >> (+3632) VMENTRY >> (+4552) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >> c016104a ] >> (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 >> fffb61c8 ] >> (+ 54928) VMENTRY >> > > Can you oprofile the host to see where the 54K cycles are spent? > >> (+4568) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >> c01610e7 ] >> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 >> c0009db4 ] >> (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db4 gpte = 0x00000000 >> 41c5d363 ] >> (+8432) VMENTRY >> (+3936) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >> c01610ee ] >> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 >> c0009db0 ] >> (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db0 gpte = 0x00000000 >> 00000000 ] >> (+ 13832) VMENTRY >> >> >> (+5768) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >> c016127c ] >> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 >> c0009db4 ] >> (+3712) VMENTRY >> (+4576) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >> c016104a ] >> (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 >> fffb61d0 ] >> (+ 0) PTE_WRITE [ gpa = 0x00000000 3d5981d0 gpte = 0x00000000 >> 3d55d047 ] >> > > This indeed has the accessed bit clear. > >> (+ 65216) VMENTRY >> (+4232) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >> c01610e7 ] >> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 >> c0009db4 ] >> (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db4 gpte = 0x00000000 >> 3d598363 ] >> > > This has the accessed bit set and the user bit clear, and the pte > pointing at the previous pte_write gpa. Looks like a kmap_atomic(). > >> (+8640) VMENTRY >> (+3936) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >> c01610ee ] >> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 >> c0009db0 ] >> (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db0 gpte = 0x00000000 >> 00000000 ] >> (+ 14160) VMENTRY >> >> I can forward a more complete time snippet if you'd like. vcpu0 + >> corresponding >> vcpu1 files have 85000 total lines and compressed the files total ~500k. >> >> I did not see the FLOODED trace come out during this sample though I >> did bump >> the count from 3 to 4 as you suggested. >> >> >> > > Bumping the count was supposed to remove the flooding... > >> Correlating rip addresses to the 2.4 kernel: >> >> c0160d00-c0161290 = page_referenced >> >> It looks like the event is kscand running through the pages. I >> suspected this >> some time ago, and tried tweaking the kscand_work_percent sysctl >> variable. It >> appeared to lower the peak of the spikes, but maybe I imagined it. I >> believe >> lowering that value makes kscand wake up more often but do less work >> (page >> scanning) each time it is awakened. >> >> > > What does 'top' in the guest show (perhaps sorted by total cpu time > rather than instantaneous usage)? > > What host kernel are you running? How many host cpus? > ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3) 2008-04-21 9:19 ` Avi Kivity 2008-04-21 17:07 ` David S. Ahern @ 2008-04-22 20:23 ` David S. Ahern 2008-04-23 8:04 ` Avi Kivity 1 sibling, 1 reply; 73+ messages in thread From: David S. Ahern @ 2008-04-22 20:23 UTC (permalink / raw) To: Avi Kivity; +Cc: kvm-devel I added tracers to kvm_mmu_page_fault() that include collecting tsc cycles: 1. before vcpu->arch.mmu.page_fault() 2. after vcpu->arch.mmu.page_fault() 3. after mmu_topup_memory_caches() 4. after emulate_instruction() So the delta in the trace reports show: - cycles required for arch.mmu.page_fault (tracer 2) - cycles required for mmu_topup_memory_caches(tracer 3) - cycles required for emulate_instruction() (tracer 4) I captured trace data for ~5-seconds during one of the usual events (again this time it was due to kscand in the guest). I ran the formatted trace data through an awk script to summarize: TSC cycles tracer2 tracer3 tracer4 0 - 10,000: 295067 213251 115873 10,001 - 25,000: 7682 1004 98336 25,001 - 50,000: 201 15 36 50,001 - 100,000: 100655 0 10 > 100,000: 117 0 15 This means vcpu->arch.mmu.page_fault() was called 403,722 times in the roughyl 5-second interval: 295,067 times it took < 10,000 cycles, but 100,772 times it took longer than 50,000 cycles. The page_fault function getting run is paging64_page_fault. mmu_topup_memory_caches() and emulate_instruction() were both run 214,270 times, most of them relatively quickly. Note: I bumped the scheduling priority of the qemu threads to RR 1 so that few host processes could interrupt it. david Avi Kivity wrote: > David S. Ahern wrote: >> I added the traces and captured data over another apparent lockup of >> the guest. >> This seems to be representative of the sequence (pid/vcpu removed). >> >> (+4776) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >> c016127c ] >> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 >> c0009db4 ] >> (+3632) VMENTRY >> (+4552) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >> c016104a ] >> (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 >> fffb61c8 ] >> (+ 54928) VMENTRY >> > > Can you oprofile the host to see where the 54K cycles are spent? > >> (+4568) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >> c01610e7 ] >> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 >> c0009db4 ] >> (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db4 gpte = 0x00000000 >> 41c5d363 ] >> (+8432) VMENTRY >> (+3936) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >> c01610ee ] >> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 >> c0009db0 ] >> (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db0 gpte = 0x00000000 >> 00000000 ] >> (+ 13832) VMENTRY >> >> >> (+5768) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >> c016127c ] >> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 >> c0009db4 ] >> (+3712) VMENTRY >> (+4576) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >> c016104a ] >> (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 >> fffb61d0 ] >> (+ 0) PTE_WRITE [ gpa = 0x00000000 3d5981d0 gpte = 0x00000000 >> 3d55d047 ] >> > > This indeed has the accessed bit clear. > >> (+ 65216) VMENTRY >> (+4232) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >> c01610e7 ] >> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 >> c0009db4 ] >> (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db4 gpte = 0x00000000 >> 3d598363 ] >> > > This has the accessed bit set and the user bit clear, and the pte > pointing at the previous pte_write gpa. Looks like a kmap_atomic(). > >> (+8640) VMENTRY >> (+3936) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >> c01610ee ] >> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 >> c0009db0 ] >> (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db0 gpte = 0x00000000 >> 00000000 ] >> (+ 14160) VMENTRY >> >> I can forward a more complete time snippet if you'd like. vcpu0 + >> corresponding >> vcpu1 files have 85000 total lines and compressed the files total ~500k. >> >> I did not see the FLOODED trace come out during this sample though I >> did bump >> the count from 3 to 4 as you suggested. >> >> >> > > Bumping the count was supposed to remove the flooding... > >> Correlating rip addresses to the 2.4 kernel: >> >> c0160d00-c0161290 = page_referenced >> >> It looks like the event is kscand running through the pages. I >> suspected this >> some time ago, and tried tweaking the kscand_work_percent sysctl >> variable. It >> appeared to lower the peak of the spikes, but maybe I imagined it. I >> believe >> lowering that value makes kscand wake up more often but do less work >> (page >> scanning) each time it is awakened. >> >> > > What does 'top' in the guest show (perhaps sorted by total cpu time > rather than instantaneous usage)? > > What host kernel are you running? How many host cpus? > ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3) 2008-04-22 20:23 ` David S. Ahern @ 2008-04-23 8:04 ` Avi Kivity 2008-04-23 15:23 ` David S. Ahern 0 siblings, 1 reply; 73+ messages in thread From: Avi Kivity @ 2008-04-23 8:04 UTC (permalink / raw) To: David S. Ahern; +Cc: kvm-devel David S. Ahern wrote: > I added tracers to kvm_mmu_page_fault() that include collecting tsc cycles: > > 1. before vcpu->arch.mmu.page_fault() > 2. after vcpu->arch.mmu.page_fault() > 3. after mmu_topup_memory_caches() > 4. after emulate_instruction() > > So the delta in the trace reports show: > - cycles required for arch.mmu.page_fault (tracer 2) > - cycles required for mmu_topup_memory_caches(tracer 3) > - cycles required for emulate_instruction() (tracer 4) > > I captured trace data for ~5-seconds during one of the usual events (again this > time it was due to kscand in the guest). I ran the formatted trace data through > an awk script to summarize: > > TSC cycles tracer2 tracer3 tracer4 > 0 - 10,000: 295067 213251 115873 > 10,001 - 25,000: 7682 1004 98336 > 25,001 - 50,000: 201 15 36 > 50,001 - 100,000: 100655 0 10 > > 100,000: 117 0 15 > > This means vcpu->arch.mmu.page_fault() was called 403,722 times in the roughyl > 5-second interval: 295,067 times it took < 10,000 cycles, but 100,772 times it > took longer than 50,000 cycles. The page_fault function getting run is > paging64_page_fault. > > This does look like the fork detector. Once in every four faults, it triggers and the fault becomes slow. 100K floods == 100K page tables == 200GB of virtual memory, which seems excessive. Is this running a forked load like apache, with many processes? How much memory is on the guest, and is there any memory pressure? > mmu_topup_memory_caches() and emulate_instruction() were both run 214,270 times, > most of them relatively quickly. > b > Note: I bumped the scheduling priority of the qemu threads to RR 1 so that few > host processes could interrupt it. > > david > > > Avi Kivity wrote: > >> David S. Ahern wrote: >> >>> I added the traces and captured data over another apparent lockup of >>> the guest. >>> This seems to be representative of the sequence (pid/vcpu removed). >>> >>> (+4776) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >>> c016127c ] >>> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 >>> c0009db4 ] >>> (+3632) VMENTRY >>> (+4552) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >>> c016104a ] >>> (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 >>> fffb61c8 ] >>> (+ 54928) VMENTRY >>> >>> >> Can you oprofile the host to see where the 54K cycles are spent? >> >> >>> (+4568) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >>> c01610e7 ] >>> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 >>> c0009db4 ] >>> (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db4 gpte = 0x00000000 >>> 41c5d363 ] >>> (+8432) VMENTRY >>> (+3936) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >>> c01610ee ] >>> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 >>> c0009db0 ] >>> (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db0 gpte = 0x00000000 >>> 00000000 ] >>> (+ 13832) VMENTRY >>> >>> >>> (+5768) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >>> c016127c ] >>> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 >>> c0009db4 ] >>> (+3712) VMENTRY >>> (+4576) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >>> c016104a ] >>> (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 >>> fffb61d0 ] >>> (+ 0) PTE_WRITE [ gpa = 0x00000000 3d5981d0 gpte = 0x00000000 >>> 3d55d047 ] >>> >>> >> This indeed has the accessed bit clear. >> >> >>> (+ 65216) VMENTRY >>> (+4232) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >>> c01610e7 ] >>> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 >>> c0009db4 ] >>> (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db4 gpte = 0x00000000 >>> 3d598363 ] >>> >>> >> This has the accessed bit set and the user bit clear, and the pte >> pointing at the previous pte_write gpa. Looks like a kmap_atomic(). >> >> >>> (+8640) VMENTRY >>> (+3936) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >>> c01610ee ] >>> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 >>> c0009db0 ] >>> (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db0 gpte = 0x00000000 >>> 00000000 ] >>> (+ 14160) VMENTRY >>> >>> I can forward a more complete time snippet if you'd like. vcpu0 + >>> corresponding >>> vcpu1 files have 85000 total lines and compressed the files total ~500k. >>> >>> I did not see the FLOODED trace come out during this sample though I >>> did bump >>> the count from 3 to 4 as you suggested. >>> >>> >>> >>> >> Bumping the count was supposed to remove the flooding... >> >> >>> Correlating rip addresses to the 2.4 kernel: >>> >>> c0160d00-c0161290 = page_referenced >>> >>> It looks like the event is kscand running through the pages. I >>> suspected this >>> some time ago, and tried tweaking the kscand_work_percent sysctl >>> variable. It >>> appeared to lower the peak of the spikes, but maybe I imagined it. I >>> believe >>> lowering that value makes kscand wake up more often but do less work >>> (page >>> scanning) each time it is awakened. >>> >>> >>> >> What does 'top' in the guest show (perhaps sorted by total cpu time >> rather than instantaneous usage)? >> >> What host kernel are you running? How many host cpus? >> >> -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3) 2008-04-23 8:04 ` Avi Kivity @ 2008-04-23 15:23 ` David S. Ahern 2008-04-23 15:53 ` Avi Kivity 2008-04-25 17:33 ` David S. Ahern 0 siblings, 2 replies; 73+ messages in thread From: David S. Ahern @ 2008-04-23 15:23 UTC (permalink / raw) To: Avi Kivity; +Cc: kvm-devel >> Avi Kivity wrote: >> >>> David S. Ahern wrote: >>> >>>> I added the traces and captured data over another apparent lockup of >>>> the guest. >>>> This seems to be representative of the sequence (pid/vcpu removed). >>>> >>>> (+4776) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >>>> c016127c ] >>>> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 >>>> c0009db4 ] >>>> (+3632) VMENTRY >>>> (+4552) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >>>> c016104a ] >>>> (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 >>>> fffb61c8 ] >>>> (+ 54928) VMENTRY >>>> >>> Can you oprofile the host to see where the 54K cycles are spent? >>> >>> I've continued drilling down with the tracers to answer that question. I have done runs with tracers in paging64_page_fault and it showed the overhead is with the fetch() function. On my last run the tracers are in paging64_fetch() as follows: 1. after is_present_pte() check 2. before kvm_mmu_get_page() 3. after kvm_mmu_get_page() 4. after if (!metaphysical) {} The delta between 2 and 3 shows the cycles to run kvm_mmu_get_page(). The delta between 3 and 4 shows the cycles to run kvm_read_guest_atomic(), if it is run. Tracer1 dumps vcpu->arch.last_pt_write_count (a carryover from when the new tracers were in paging64_page_fault); tracer2 dumps the level, metaphysical and access variables; tracer5 dumps value in shadow_ent. A representative trace sample is: (+ 4576) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016104a ] (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 fffb6950 ] (+ 2664) PAGE_FAULT1 [ write_count = 0 ] (+ 472) PAGE_FAULT2 [ level = 2 metaphysical = 0 access 0x00000007 ] (+ 50416) PAGE_FAULT3 (+ 472) PAGE_FAULT4 (+ 856) PAGE_FAULT5 [ shadow_ent_val = 0x80000000 9276d043 ] (+ 1528) VMENTRY (+ 4992) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610e7 ] (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ] (+ 2296) PAGE_FAULT1 [ write_count = 0 ] (+ 816) PAGE_FAULT5 [ shadow_ent_val = 0x00000000 8a809041 ] (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db4 gpte = 0x00000000 4176d363 ] (+ 6424) VMENTRY (+ 3864) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610ee ] (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ] (+ 2496) PAGE_FAULT1 [ write_count = 1 ] (+ 856) PAGE_FAULT5 [ shadow_ent_val = 0x00000000 8a809041 ] (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db0 gpte = 0x00000000 00000000 ] (+ 10248) VMENTRY (+ 4744) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016127c ] (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ] (+ 2408) PAGE_FAULT1 [ write_count = 2 ] (+ 760) PAGE_FAULT5 [ shadow_ent_val = 0x00000000 8a809043 ] (+ 1240) VMENTRY (+ 4624) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016104a ] (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 fffb6950 ] (+ 2512) PAGE_FAULT1 [ write_count = 0 ] (+ 496) PAGE_FAULT2 [ level = 2 metaphysical = 0 access 0x00000007 ] (+ 48664) PAGE_FAULT3 (+ 472) PAGE_FAULT4 (+ 856) PAGE_FAULT5 [ shadow_ent_val = 0x80000000 9272d043 ] (+ 1576) VMENTRY So basically every 4th trip through the fetch function it runs kvm_mmu_get_page(). A summary of the entire trace file shows this function rarely executes in less than 50,000 cycles. Also, vcpu->arch.last_pt_write_count is always 0 when the high cycles are hit. More tidbits: - The hugepage option seems to have no effect -- the system spikes and overhead occurs with and without the hugepage option (above data is with it). - As the guest runs for hours, the intensity of the spikes drop though they still occur regularly and kscand continues to be the primary suspect. qemu's RSS tends to the guests memory allotment of 2GB. Internally guest memory usage runs at ~1GB page cache, 57M buffers, 24M swap, ~800MB for processes. - I have looked at process creation and do not see a strong correlation between system time spikes and number of new processes. So far the only correlations seem to be kscand and amount of memory used. ie., stock RHEL3 with few processes shows tiny spikes whereas my tests with 90+ processes using about 800M plus a continually updating page cache (ie., moderate IO levels) the spikes are strong and last for seconds. - Time runs really fast in the guest, gaining several minutes in 24-hours. I'll download your kvm_stat update and give it a try. When I started this investigation I was using Christian's kvmstat script which dumped stats to a file. Plots of that data did not show a strong correlation with guest system time. david ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3) 2008-04-23 15:23 ` David S. Ahern @ 2008-04-23 15:53 ` Avi Kivity 2008-04-23 16:39 ` David S. Ahern 2008-04-25 17:33 ` David S. Ahern 1 sibling, 1 reply; 73+ messages in thread From: Avi Kivity @ 2008-04-23 15:53 UTC (permalink / raw) To: David S. Ahern; +Cc: kvm-devel David S. Ahern wrote: > I've continued drilling down with the tracers to answer that question. I have > done runs with tracers in paging64_page_fault and it showed the overhead is with > the fetch() function. On my last run the tracers are in paging64_fetch() as follows: > > 1. after is_present_pte() check > 2. before kvm_mmu_get_page() > 3. after kvm_mmu_get_page() > 4. after if (!metaphysical) {} > > The delta between 2 and 3 shows the cycles to run kvm_mmu_get_page(). The delta > between 3 and 4 shows the cycles to run kvm_read_guest_atomic(), if it is run. > Tracer1 dumps vcpu->arch.last_pt_write_count (a carryover from when the new > tracers were in paging64_page_fault); tracer2 dumps the level, metaphysical and > access variables; tracer5 dumps value in shadow_ent. > > A representative trace sample is: > > (+ 4576) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016104a ] > (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 fffb6950 ] > (+ 2664) PAGE_FAULT1 [ write_count = 0 ] > (+ 472) PAGE_FAULT2 [ level = 2 metaphysical = 0 access 0x00000007 ] > (+ 50416) PAGE_FAULT3 > (+ 472) PAGE_FAULT4 > (+ 856) PAGE_FAULT5 [ shadow_ent_val = 0x80000000 9276d043 ] > (+ 1528) VMENTRY > (+ 4992) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610e7 ] > (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ] > (+ 2296) PAGE_FAULT1 [ write_count = 0 ] > (+ 816) PAGE_FAULT5 [ shadow_ent_val = 0x00000000 8a809041 ] > (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db4 gpte = 0x00000000 4176d363 ] > (+ 6424) VMENTRY > (+ 3864) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610ee ] > (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ] > (+ 2496) PAGE_FAULT1 [ write_count = 1 ] > (+ 856) PAGE_FAULT5 [ shadow_ent_val = 0x00000000 8a809041 ] > (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db0 gpte = 0x00000000 00000000 ] > (+ 10248) VMENTRY > (+ 4744) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016127c ] > (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ] > (+ 2408) PAGE_FAULT1 [ write_count = 2 ] > (+ 760) PAGE_FAULT5 [ shadow_ent_val = 0x00000000 8a809043 ] > (+ 1240) VMENTRY > (+ 4624) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016104a ] > (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 fffb6950 ] > (+ 2512) PAGE_FAULT1 [ write_count = 0 ] > (+ 496) PAGE_FAULT2 [ level = 2 metaphysical = 0 access 0x00000007 ] > (+ 48664) PAGE_FAULT3 > (+ 472) PAGE_FAULT4 > (+ 856) PAGE_FAULT5 [ shadow_ent_val = 0x80000000 9272d043 ] > (+ 1576) VMENTRY > > So basically every 4th trip through the fetch function it runs > kvm_mmu_get_page(). A summary of the entire trace file shows this function > rarely executes in less than 50,000 cycles. Also, vcpu->arch.last_pt_write_count > is always 0 when the high cycles are hit. > > Ah! The flood detector is not seeing the access through the kmap_atomic() pte, because that access has gone through the emulator. last_updated_pte_accessed(vcpu) will never return true. Can you verify that last_updated_pte_accessed(vcpu) indeed always returns false? -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3) 2008-04-23 15:53 ` Avi Kivity @ 2008-04-23 16:39 ` David S. Ahern 2008-04-24 17:25 ` David S. Ahern 2008-04-26 6:20 ` Avi Kivity 0 siblings, 2 replies; 73+ messages in thread From: David S. Ahern @ 2008-04-23 16:39 UTC (permalink / raw) To: Avi Kivity; +Cc: kvm-devel Avi Kivity wrote: > > Ah! The flood detector is not seeing the access through the > kmap_atomic() pte, because that access has gone through the emulator. > last_updated_pte_accessed(vcpu) will never return true. > > Can you verify that last_updated_pte_accessed(vcpu) indeed always > returns false? > It returns both true and false. I added a tracer to kvm_mmu_pte_write() to dump the rc of last_updated_pte_accessed(vcpu). ie., pte_access = last_updated_pte_accessed(vcpu); KVMTRACE_1D(PTE_ACCESS, vcpu, (u32) pte_access, handler); A sample: (+ 4488) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016104a ] (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 fffb63b0 ] (+ 2480) PAGE_FAULT1 [ write_count = 0 ] (+ 424) PAGE_FAULT2 [ level = 2 metaphysical = 0 access 0x00000007 ] (+ 51672) PAGE_FAULT3 (+ 472) PAGE_FAULT4 (+ 704) PAGE_FAULT5 [ shadow_ent = 0x80000001 2dfb5043 ] (+ 1496) VMENTRY (+ 4568) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610e7 ] (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ] (+ 2352) PAGE_FAULT1 [ write_count = 0 ] (+ 728) PAGE_FAULT5 [ shadow_ent = 0x00000001 91409041 ] (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db4 gpte = 0x00000000 41fb5363 ] (+ 0) PTE_ACCESS [ pte_access = 1 ] (+ 6864) VMENTRY (+ 3896) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610ee ] (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ] (+ 2376) PAGE_FAULT1 [ write_count = 1 ] (+ 720) PAGE_FAULT5 [ shadow_ent = 0x00000001 91409041 ] (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db0 gpte = 0x00000000 00000000 ] (+ 0) PTE_ACCESS [ pte_access = 0 ] (+ 12344) VMENTRY (+ 4688) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016127c ] (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ] (+ 2416) PAGE_FAULT1 [ write_count = 2 ] (+ 792) PAGE_FAULT5 [ shadow_ent = 0x00000001 91409043 ] (+ 1128) VMENTRY (+ 4512) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016104a ] (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 fffb63b0 ] (+ 2448) PAGE_FAULT1 [ write_count = 0 ] (+ 448) PAGE_FAULT2 [ level = 2 metaphysical = 0 access 0x00000007 ] (+ 51520) PAGE_FAULT3 (+ 432) PAGE_FAULT4 (+ 696) PAGE_FAULT5 [ shadow_ent = 0x80000001 2df5a043 ] (+ 1480) VMENTRY david ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3) 2008-04-23 16:39 ` David S. Ahern @ 2008-04-24 17:25 ` David S. Ahern 2008-04-26 6:43 ` Avi Kivity 2008-04-26 6:20 ` Avi Kivity 1 sibling, 1 reply; 73+ messages in thread From: David S. Ahern @ 2008-04-24 17:25 UTC (permalink / raw) To: Avi Kivity; +Cc: kvm-devel What is the rip (GUEST_RIP) value in the VMEXIT trace output? Is that the current instruction pointer for the guest? I take it the virt in the PAGE_FAULT trace output is the virtual address the guest was referencing when the page fault occurred. What I don't understand (one of many things really) is what the 0xfffb63b0 corresponds to in the guest. Any ideas? Also, the expensive page fault occurs on errorcode = 0x0000000b (PAGE_FAULT trace data). What does the 4th bit in 0xb mean? bit 0 set means PFERR_PRESENT_MASK is set, and bit 1 means PT_WRITABLE_MASK. What is bit 3? david David S. Ahern wrote: > > Avi Kivity wrote: >> Ah! The flood detector is not seeing the access through the >> kmap_atomic() pte, because that access has gone through the emulator. >> last_updated_pte_accessed(vcpu) will never return true. >> >> Can you verify that last_updated_pte_accessed(vcpu) indeed always >> returns false? >> > > It returns both true and false. I added a tracer to kvm_mmu_pte_write() to dump > the rc of last_updated_pte_accessed(vcpu). ie., > pte_access = last_updated_pte_accessed(vcpu); > KVMTRACE_1D(PTE_ACCESS, vcpu, (u32) pte_access, handler); > > A sample: > > (+ 4488) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016104a ] > (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 fffb63b0 ] > (+ 2480) PAGE_FAULT1 [ write_count = 0 ] > (+ 424) PAGE_FAULT2 [ level = 2 metaphysical = 0 access 0x00000007 ] > (+ 51672) PAGE_FAULT3 > (+ 472) PAGE_FAULT4 > (+ 704) PAGE_FAULT5 [ shadow_ent = 0x80000001 2dfb5043 ] > (+ 1496) VMENTRY > (+ 4568) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610e7 ] > (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ] > (+ 2352) PAGE_FAULT1 [ write_count = 0 ] > (+ 728) PAGE_FAULT5 [ shadow_ent = 0x00000001 91409041 ] > (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db4 gpte = 0x00000000 41fb5363 ] > (+ 0) PTE_ACCESS [ pte_access = 1 ] > (+ 6864) VMENTRY > (+ 3896) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610ee ] > (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ] > (+ 2376) PAGE_FAULT1 [ write_count = 1 ] > (+ 720) PAGE_FAULT5 [ shadow_ent = 0x00000001 91409041 ] > (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db0 gpte = 0x00000000 00000000 ] > (+ 0) PTE_ACCESS [ pte_access = 0 ] > (+ 12344) VMENTRY > (+ 4688) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016127c ] > (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ] > (+ 2416) PAGE_FAULT1 [ write_count = 2 ] > (+ 792) PAGE_FAULT5 [ shadow_ent = 0x00000001 91409043 ] > (+ 1128) VMENTRY > (+ 4512) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016104a ] > (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 fffb63b0 ] > (+ 2448) PAGE_FAULT1 [ write_count = 0 ] > (+ 448) PAGE_FAULT2 [ level = 2 metaphysical = 0 access 0x00000007 ] > (+ 51520) PAGE_FAULT3 > (+ 432) PAGE_FAULT4 > (+ 696) PAGE_FAULT5 [ shadow_ent = 0x80000001 2df5a043 ] > (+ 1480) VMENTRY > > > david > ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3) 2008-04-24 17:25 ` David S. Ahern @ 2008-04-26 6:43 ` Avi Kivity 0 siblings, 0 replies; 73+ messages in thread From: Avi Kivity @ 2008-04-26 6:43 UTC (permalink / raw) To: David S. Ahern; +Cc: kvm-devel David S. Ahern wrote: > What is the rip (GUEST_RIP) value in the VMEXIT trace output? Is that the > current instruction pointer for the guest? > > Yes. > I take it the virt in the PAGE_FAULT trace output is the virtual address the > guest was referencing when the page fault occurred. What I don't understand (one > of many things really) is what the 0xfffb63b0 corresponds to in the guest. Any > ideas? > > I'm pretty sure it is the kmap_atomic() pte. The guest wants to update a pte (call it pte1), which is in HIGHMEM, so it doesn't have a permanent mapping for it. It calls kmap_atomic() which sets up another pte (pte2, two writes), and then accesses pte1 through pte2. > Also, the expensive page fault occurs on errorcode = 0x0000000b (PAGE_FAULT > trace data). What does the 4th bit in 0xb mean? bit 0 set means > PFERR_PRESENT_MASK is set, and bit 1 means PT_WRITABLE_MASK. What is bit 3? > Bit 3 is the reserved bit, which means the shadow pte has an illegal bit combination. kvm sets up vmx to forward non-persent page faults (bit 0 clear) directly to the guest, so it needs some other pattern to get a trapping fault. IOW, there are two types of non-present shadow ptes in kvm: trapping ones (where we don't know what the guest pte looks like) and nontrapping ones (where we know the guest pte is not present, so we forward the fault directly to the guest). The first type is encoded with the reserved bit and present bit set, the second with both of them clear. You can disable this trickery using the bypass_guest_pf module parameter. It should be useful to try it, we'll see the forwarded faults as well. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3) 2008-04-23 16:39 ` David S. Ahern 2008-04-24 17:25 ` David S. Ahern @ 2008-04-26 6:20 ` Avi Kivity 1 sibling, 0 replies; 73+ messages in thread From: Avi Kivity @ 2008-04-26 6:20 UTC (permalink / raw) To: David S. Ahern; +Cc: kvm-devel David S. Ahern wrote: > Avi Kivity wrote: > >> Ah! The flood detector is not seeing the access through the >> kmap_atomic() pte, because that access has gone through the emulator. >> last_updated_pte_accessed(vcpu) will never return true. >> >> Can you verify that last_updated_pte_accessed(vcpu) indeed always >> returns false? >> >> > > It returns both true and false. I added a tracer to kvm_mmu_pte_write() to dump > the rc of last_updated_pte_accessed(vcpu). ie., > pte_access = last_updated_pte_accessed(vcpu); > KVMTRACE_1D(PTE_ACCESS, vcpu, (u32) pte_access, handler); > > A sample: > > (+ 4488) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016104a ] > (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 fffb63b0 ] > (+ 2480) PAGE_FAULT1 [ write_count = 0 ] > (+ 424) PAGE_FAULT2 [ level = 2 metaphysical = 0 access 0x00000007 ] > (+ 51672) PAGE_FAULT3 > (+ 472) PAGE_FAULT4 > (+ 704) PAGE_FAULT5 [ shadow_ent = 0x80000001 2dfb5043 ] > (+ 1496) VMENTRY > (+ 4568) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610e7 ] > (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ] > (+ 2352) PAGE_FAULT1 [ write_count = 0 ] > (+ 728) PAGE_FAULT5 [ shadow_ent = 0x00000001 91409041 ] > (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db4 gpte = 0x00000000 41fb5363 ] > (+ 0) PTE_ACCESS [ pte_access = 1 ] > (+ 6864) VMENTRY > (+ 3896) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610ee ] > (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ] > (+ 2376) PAGE_FAULT1 [ write_count = 1 ] > (+ 720) PAGE_FAULT5 [ shadow_ent = 0x00000001 91409041 ] > (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db0 gpte = 0x00000000 00000000 ] > (+ 0) PTE_ACCESS [ pte_access = 0 ] > (+ 12344) VMENTRY > (+ 4688) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016127c ] > (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ] > (+ 2416) PAGE_FAULT1 [ write_count = 2 ] > (+ 792) PAGE_FAULT5 [ shadow_ent = 0x00000001 91409043 ] > (+ 1128) VMENTRY > (+ 4512) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016104a ] > (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 fffb63b0 ] > (+ 2448) PAGE_FAULT1 [ write_count = 0 ] > (+ 448) PAGE_FAULT2 [ level = 2 metaphysical = 0 access 0x00000007 ] > (+ 51520) PAGE_FAULT3 > (+ 432) PAGE_FAULT4 > (+ 696) PAGE_FAULT5 [ shadow_ent = 0x80000001 2df5a043 ] > (+ 1480) VMENTRY > > Strange... there should be at least two pte_access = 0 traces in there before flooding can occur, according to my reading of the code. The counter needs to go up to 3 somehow. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3) 2008-04-23 15:23 ` David S. Ahern 2008-04-23 15:53 ` Avi Kivity @ 2008-04-25 17:33 ` David S. Ahern 2008-04-26 6:45 ` Avi Kivity 2008-04-28 18:15 ` Marcelo Tosatti 1 sibling, 2 replies; 73+ messages in thread From: David S. Ahern @ 2008-04-25 17:33 UTC (permalink / raw) To: Avi Kivity; +Cc: kvm-devel David S. Ahern wrote: > Avi Kivity wrote: > >> David S. Ahern wrote: >> >>> I added the traces and captured data over another apparent lockup of >>> the guest. >>> This seems to be representative of the sequence (pid/vcpu removed). >>> >>> (+4776) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >>> c016127c ] >>> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 >>> c0009db4 ] >>> (+3632) VMENTRY >>> (+4552) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >>> c016104a ] >>> (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 >>> fffb61c8 ] >>> (+ 54928) VMENTRY >>> >> Can you oprofile the host to see where the 54K cycles are spent? >> Most of the cycles (~80% of that 54k+) are spent in paging64_prefetch_page(): for (i = 0; i < PT64_ENT_PER_PAGE; ++i) { gpa_t pte_gpa = gfn_to_gpa(sp->gfn); pte_gpa += (i+offset) * sizeof(pt_element_t); r = kvm_read_guest_atomic(vcpu->kvm, pte_gpa, &pt, sizeof(pt_element_t)); if (r || is_present_pte(pt)) sp->spt[i] = shadow_trap_nonpresent_pte; else sp->spt[i] = shadow_notrap_nonpresent_pte; } This loop is run 512 times and takes a total of ~45k cycles, or ~88 cycles per loop. This function gets run >20,000/sec during some of the kscand loops. david ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3) 2008-04-25 17:33 ` David S. Ahern @ 2008-04-26 6:45 ` Avi Kivity 2008-04-28 18:15 ` Marcelo Tosatti 1 sibling, 0 replies; 73+ messages in thread From: Avi Kivity @ 2008-04-26 6:45 UTC (permalink / raw) To: David S. Ahern; +Cc: kvm-devel David S. Ahern wrote: > David S. Ahern wrote: > >> Avi Kivity wrote: >> >> >>> David S. Ahern wrote: >>> >>> >>>> I added the traces and captured data over another apparent lockup of >>>> the guest. >>>> This seems to be representative of the sequence (pid/vcpu removed). >>>> >>>> (+4776) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >>>> c016127c ] >>>> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 >>>> c0009db4 ] >>>> (+3632) VMENTRY >>>> (+4552) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >>>> c016104a ] >>>> (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 >>>> fffb61c8 ] >>>> (+ 54928) VMENTRY >>>> >>>> >>> Can you oprofile the host to see where the 54K cycles are spent? >>> >>> > > Most of the cycles (~80% of that 54k+) are spent in paging64_prefetch_page(): > > for (i = 0; i < PT64_ENT_PER_PAGE; ++i) { > gpa_t pte_gpa = gfn_to_gpa(sp->gfn); > pte_gpa += (i+offset) * sizeof(pt_element_t); > > r = kvm_read_guest_atomic(vcpu->kvm, pte_gpa, &pt, > sizeof(pt_element_t)); > if (r || is_present_pte(pt)) > sp->spt[i] = shadow_trap_nonpresent_pte; > else > sp->spt[i] = shadow_notrap_nonpresent_pte; > } > > This loop is run 512 times and takes a total of ~45k cycles, or ~88 cycles per > loop. > > This function gets run >20,000/sec during some of the kscand loops. > > We really ought to optimize it. That's second order however. The real fix is making sure it isn't called so often. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3) 2008-04-25 17:33 ` David S. Ahern 2008-04-26 6:45 ` Avi Kivity @ 2008-04-28 18:15 ` Marcelo Tosatti 2008-04-28 23:45 ` David S. Ahern 1 sibling, 1 reply; 73+ messages in thread From: Marcelo Tosatti @ 2008-04-28 18:15 UTC (permalink / raw) To: David S. Ahern; +Cc: kvm-devel, Avi Kivity On Fri, Apr 25, 2008 at 11:33:18AM -0600, David S. Ahern wrote: > Most of the cycles (~80% of that 54k+) are spent in paging64_prefetch_page(): > > for (i = 0; i < PT64_ENT_PER_PAGE; ++i) { > gpa_t pte_gpa = gfn_to_gpa(sp->gfn); > pte_gpa += (i+offset) * sizeof(pt_element_t); > > r = kvm_read_guest_atomic(vcpu->kvm, pte_gpa, &pt, > sizeof(pt_element_t)); > if (r || is_present_pte(pt)) > sp->spt[i] = shadow_trap_nonpresent_pte; > else > sp->spt[i] = shadow_notrap_nonpresent_pte; > } > > This loop is run 512 times and takes a total of ~45k cycles, or ~88 cycles per > loop. > > This function gets run >20,000/sec during some of the kscand loops. Hi David, Do you see the mmu_recycled counter increase? ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3) 2008-04-28 18:15 ` Marcelo Tosatti @ 2008-04-28 23:45 ` David S. Ahern 2008-04-30 4:18 ` David S. Ahern 0 siblings, 1 reply; 73+ messages in thread From: David S. Ahern @ 2008-04-28 23:45 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: kvm-devel, Avi Kivity Hi Marcelo: mmu_recycled is always 0 for this guest -- even after almost 4 hours of uptime. Here is a kvm_stat sample where guest time was very high and qemu had 2 processors at 100% on the host. I removed counters where both columns have 0 value for brevity. exits 45937979 758051 fpu_reload 1416831 87 halt_exits 112911 0 halt_wakeup 31771 0 host_state_reload 2068602 263 insn_emulation 21601480 365493 io_exits 1827374 2705 irq_exits 8934818 285196 mmio_exits 421674 147 mmu_cache_miss 4817689 93680 mmu_flooded 4815273 93680 mmu_pde_zapped 51344 0 mmu_prefetch 4817625 93680 mmu_pte_updated 14803298 270104 mmu_pte_write 19859863 363785 mmu_shadow_zapped 4832106 93679 pf_fixed 32184355 468398 pf_guest 264138 0 remote_tlb_flush 10697762 280522 tlb_flush 10301338 176424 (NOTE: This is for a *5* second sample interval instead of 1 to allow me to capture the data). Here's a sample when the guest is "well-behaved" (system time <10%, though ): exits 51502194 97453 fpu_reload 1421736 227 halt_exits 138361 1927 halt_wakeup 33047 117 host_state_reload 2110190 3740 insn_emulation 24367441 47260 io_exits 1874075 2576 irq_exits 10224702 13333 mmio_exits 435154 1726 mmu_cache_miss 5414097 11258 mmu_flooded 5411548 11243 mmu_pde_zapped 52851 44 mmu_prefetch 5414031 11258 mmu_pte_updated 16854686 29901 mmu_pte_write 22526765 42285 mmu_shadow_zapped 5430025 11313 pf_fixed 36144578 67666 pf_guest 282794 430 remote_tlb_flush 12126268 14619 tlb_flush 11753162 21460 There is definitely a strong correlation between the mmu counters and high system times in the guest. I am still trying to find out what in the guest is stimulating it when running on RHEL3; I do not see this same behavior for an equivalent setup running on RHEL4. By the way I added an mmu_prefetch stat in prefetch_page() to count the number of times the for() loop is hit with PTTYPE == 64; ie., number of times paging64_prefetch_page() is invoked. (I wanted an explicit counter for this loop, though the info seems to duplicate other entries.) That counter is listed above. As I mentioned in a prior post when kscand kicks in the change in mmu_prefetch counter is at 20,000+/sec, with each trip through that function taking 45k+ cycles. kscand is an instigator shortly after boot, however, kscand is *not* the culprit once the system has been up for 30-45 minutes. I have started instrumenting the RHEL3U8 kernel and for the load I am running kscand does not walk the active lists very often once the system is up. So, to dig deeper on what in the guest is stimulating the mmu I collected kvmtrace data for about a 2 minute time interval which caught about a 30-second period where guest system time was steady in the 25-30% range. Summarizing the number of times a RIP appears in an VMEXIT shows the following high runners: count RIP RHEL3-symbol 82549 0xc0140e42 follow_page [kernel] c0140d90 offset b2 42532 0xc0144760 handle_mm_fault [kernel] c01446d0 offset 90 36826 0xc013da4a futex_wait [kernel] c013d870 offset 1da 29987 0xc0145cd0 zap_pte_range [kernel] c0145c10 offset c0 27451 0xc0144018 do_no_page [kernel] c0143e20 offset 1f8 (halt entry removed the list since that is the ideal scenario for an exit). So the RIP correlates to follow_page() for a large percentage of the VMEXITs. I wrote an awk script to summarize (histogram style) the TSC cycles between VMEXIT and VMENTRY for an address. For the first rip, 0xc0140e42, 82,271 times (ie., almost 100% of the time) the trace shows a delta between 50k and 100k cycles between the VMEXIT and the subsequent VMENTRY. Similarly for the second one, 0xc0144760, 42403 times (again almost 100% of the occurrences) the trace shows a delta between 50k and 100k cycles between VMEXIT and VMENTRY. These seems to correlate with the prefetch_page function in kvm, though I am not 100% positive on that. I am now investigating the kernel paths leading to those functions. Any insights would definitely be appreciated. david Marcelo Tosatti wrote: > On Fri, Apr 25, 2008 at 11:33:18AM -0600, David S. Ahern wrote: >> Most of the cycles (~80% of that 54k+) are spent in paging64_prefetch_page(): >> >> for (i = 0; i < PT64_ENT_PER_PAGE; ++i) { >> gpa_t pte_gpa = gfn_to_gpa(sp->gfn); >> pte_gpa += (i+offset) * sizeof(pt_element_t); >> >> r = kvm_read_guest_atomic(vcpu->kvm, pte_gpa, &pt, >> sizeof(pt_element_t)); >> if (r || is_present_pte(pt)) >> sp->spt[i] = shadow_trap_nonpresent_pte; >> else >> sp->spt[i] = shadow_notrap_nonpresent_pte; >> } >> >> This loop is run 512 times and takes a total of ~45k cycles, or ~88 cycles per >> loop. >> >> This function gets run >20,000/sec during some of the kscand loops. > > Hi David, > > Do you see the mmu_recycled counter increase? > ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3) 2008-04-28 23:45 ` David S. Ahern @ 2008-04-30 4:18 ` David S. Ahern 2008-04-30 9:55 ` Avi Kivity 0 siblings, 1 reply; 73+ messages in thread From: David S. Ahern @ 2008-04-30 4:18 UTC (permalink / raw) To: Marcelo Tosatti, Avi Kivity; +Cc: kvm-devel Another tidbit for you guys as I make my way through various permutations: I installed the RHEL3 hugemem kernel and the guest behavior is *much* better. System time still has some regular hiccups that are higher than xen and esx (e.g., 1 minute samples out of 5 show system time between 10 and 15%), but overall guest behavior is good with the hugemem kernel. One side effect I've noticed is that I cannot restart the RHEL3 guest running the hugemem kernel in successive attempts. The guest has 2 vcpus and qemu shows one thread at 100% cpu. If I recall correctly kvm_stat shows a large amount of tlb_flushes (like millions in a 5-second sample). The scenario is: 1. start guest running hugemem kernel, 2. shutdown, 3. restart guest. During 3. it hangs, but at random points. Removing kvm/kvm-intel has no effect - guest still hangs on the restart. Rebooting the host clears the problem. Alternatively, during the hang on a restart I can kill the guest, and then on restart choose the normal, 32-bit smp kernel and the guest boots just fine. At this point I can shutdown the guest and restart with the hugemem kernel and it boots just fine. david David S. Ahern wrote: > Hi Marcelo: > > mmu_recycled is always 0 for this guest -- even after almost 4 hours of uptime. > > Here is a kvm_stat sample where guest time was very high and qemu had 2 > processors at 100% on the host. I removed counters where both columns have 0 > value for brevity. > > exits 45937979 758051 > fpu_reload 1416831 87 > halt_exits 112911 0 > halt_wakeup 31771 0 > host_state_reload 2068602 263 > insn_emulation 21601480 365493 > io_exits 1827374 2705 > irq_exits 8934818 285196 > mmio_exits 421674 147 > mmu_cache_miss 4817689 93680 > mmu_flooded 4815273 93680 > mmu_pde_zapped 51344 0 > mmu_prefetch 4817625 93680 > mmu_pte_updated 14803298 270104 > mmu_pte_write 19859863 363785 > mmu_shadow_zapped 4832106 93679 > pf_fixed 32184355 468398 > pf_guest 264138 0 > remote_tlb_flush 10697762 280522 > tlb_flush 10301338 176424 > > (NOTE: This is for a *5* second sample interval instead of 1 to allow me to > capture the data). > > Here's a sample when the guest is "well-behaved" (system time <10%, though ): > exits 51502194 97453 > fpu_reload 1421736 227 > halt_exits 138361 1927 > halt_wakeup 33047 117 > host_state_reload 2110190 3740 > insn_emulation 24367441 47260 > io_exits 1874075 2576 > irq_exits 10224702 13333 > mmio_exits 435154 1726 > mmu_cache_miss 5414097 11258 > mmu_flooded 5411548 11243 > mmu_pde_zapped 52851 44 > mmu_prefetch 5414031 11258 > mmu_pte_updated 16854686 29901 > mmu_pte_write 22526765 42285 > mmu_shadow_zapped 5430025 11313 > pf_fixed 36144578 67666 > pf_guest 282794 430 > remote_tlb_flush 12126268 14619 > tlb_flush 11753162 21460 > > > There is definitely a strong correlation between the mmu counters and high > system times in the guest. I am still trying to find out what in the guest is > stimulating it when running on RHEL3; I do not see this same behavior for an > equivalent setup running on RHEL4. > > By the way I added an mmu_prefetch stat in prefetch_page() to count the number > of times the for() loop is hit with PTTYPE == 64; ie., number of times > paging64_prefetch_page() is invoked. (I wanted an explicit counter for this > loop, though the info seems to duplicate other entries.) That counter is listed > above. As I mentioned in a prior post when kscand kicks in the change in > mmu_prefetch counter is at 20,000+/sec, with each trip through that function > taking 45k+ cycles. > > kscand is an instigator shortly after boot, however, kscand is *not* the culprit > once the system has been up for 30-45 minutes. I have started instrumenting the > RHEL3U8 kernel and for the load I am running kscand does not walk the active > lists very often once the system is up. > > So, to dig deeper on what in the guest is stimulating the mmu I collected > kvmtrace data for about a 2 minute time interval which caught about a 30-second > period where guest system time was steady in the 25-30% range. Summarizing the > number of times a RIP appears in an VMEXIT shows the following high runners: > > count RIP RHEL3-symbol > 82549 0xc0140e42 follow_page [kernel] c0140d90 offset b2 > 42532 0xc0144760 handle_mm_fault [kernel] c01446d0 offset 90 > 36826 0xc013da4a futex_wait [kernel] c013d870 offset 1da > 29987 0xc0145cd0 zap_pte_range [kernel] c0145c10 offset c0 > 27451 0xc0144018 do_no_page [kernel] c0143e20 offset 1f8 > > (halt entry removed the list since that is the ideal scenario for an exit). > > So the RIP correlates to follow_page() for a large percentage of the VMEXITs. > > I wrote an awk script to summarize (histogram style) the TSC cycles between > VMEXIT and VMENTRY for an address. For the first rip, 0xc0140e42, 82,271 times > (ie., almost 100% of the time) the trace shows a delta between 50k and 100k > cycles between the VMEXIT and the subsequent VMENTRY. Similarly for the second > one, 0xc0144760, 42403 times (again almost 100% of the occurrences) the trace > shows a delta between 50k and 100k cycles between VMEXIT and VMENTRY. These > seems to correlate with the prefetch_page function in kvm, though I am not 100% > positive on that. > > I am now investigating the kernel paths leading to those functions. Any insights > would definitely be appreciated. > > david > > > Marcelo Tosatti wrote: >> On Fri, Apr 25, 2008 at 11:33:18AM -0600, David S. Ahern wrote: >>> Most of the cycles (~80% of that 54k+) are spent in paging64_prefetch_page(): >>> >>> for (i = 0; i < PT64_ENT_PER_PAGE; ++i) { >>> gpa_t pte_gpa = gfn_to_gpa(sp->gfn); >>> pte_gpa += (i+offset) * sizeof(pt_element_t); >>> >>> r = kvm_read_guest_atomic(vcpu->kvm, pte_gpa, &pt, >>> sizeof(pt_element_t)); >>> if (r || is_present_pte(pt)) >>> sp->spt[i] = shadow_trap_nonpresent_pte; >>> else >>> sp->spt[i] = shadow_notrap_nonpresent_pte; >>> } >>> >>> This loop is run 512 times and takes a total of ~45k cycles, or ~88 cycles per >>> loop. >>> >>> This function gets run >20,000/sec during some of the kscand loops. >> Hi David, >> >> Do you see the mmu_recycled counter increase? >> > ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3) 2008-04-30 4:18 ` David S. Ahern @ 2008-04-30 9:55 ` Avi Kivity 2008-04-30 13:39 ` David S. Ahern 0 siblings, 1 reply; 73+ messages in thread From: Avi Kivity @ 2008-04-30 9:55 UTC (permalink / raw) To: David S. Ahern; +Cc: kvm-devel, Marcelo Tosatti David S. Ahern wrote: > Another tidbit for you guys as I make my way through various permutations: > I installed the RHEL3 hugemem kernel and the guest behavior is *much* better. > System time still has some regular hiccups that are higher than xen and esx > (e.g., 1 minute samples out of 5 show system time between 10 and 15%), but > overall guest behavior is good with the hugemem kernel. > > Wait, the amount of info here is overwhelming. Let's stick with the current kernel (32-bit, HIGHMEM4G, right?) Did you get any traces with bypass_guest_pf=0? That may show more info. -- Any sufficiently difficult bug is indistinguishable from a feature. ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3) 2008-04-30 9:55 ` Avi Kivity @ 2008-04-30 13:39 ` David S. Ahern 2008-04-30 13:49 ` Avi Kivity 2008-04-30 13:56 ` Daniel P. Berrange 0 siblings, 2 replies; 73+ messages in thread From: David S. Ahern @ 2008-04-30 13:39 UTC (permalink / raw) To: Avi Kivity; +Cc: kvm-devel, Marcelo Tosatti Avi Kivity wrote: > David S. Ahern wrote: >> Another tidbit for you guys as I make my way through various >> permutations: >> I installed the RHEL3 hugemem kernel and the guest behavior is *much* >> better. >> System time still has some regular hiccups that are higher than xen >> and esx >> (e.g., 1 minute samples out of 5 show system time between 10 and 15%), >> but >> overall guest behavior is good with the hugemem kernel. >> >> > > Wait, the amount of info here is overwhelming. Let's stick with the > current kernel (32-bit, HIGHMEM4G, right?) > > Did you get any traces with bypass_guest_pf=0? That may show more info. > My preference is to stick with the "standard", 32-bit RHEL3 kernel in the guest. My point in the last email was that the hugemem kernel shows a remarkable difference (it uses 3-levels of page tables right?). I was hoping that would ring a bell with someone. Adding bypass_guest_pf=0 did not improve the situation. Did you want anything particular with that setting -- like a RIP summary or a summary of exit-entry cycles? david ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3) 2008-04-30 13:39 ` David S. Ahern @ 2008-04-30 13:49 ` Avi Kivity 2008-05-11 12:32 ` Avi Kivity 2008-04-30 13:56 ` Daniel P. Berrange 1 sibling, 1 reply; 73+ messages in thread From: Avi Kivity @ 2008-04-30 13:49 UTC (permalink / raw) To: David S. Ahern; +Cc: kvm-devel, Marcelo Tosatti David S. Ahern wrote: > Avi Kivity wrote: > >> David S. Ahern wrote: >> >>> Another tidbit for you guys as I make my way through various >>> permutations: >>> I installed the RHEL3 hugemem kernel and the guest behavior is *much* >>> better. >>> System time still has some regular hiccups that are higher than xen >>> and esx >>> (e.g., 1 minute samples out of 5 show system time between 10 and 15%), >>> but >>> overall guest behavior is good with the hugemem kernel. >>> >>> >>> >> Wait, the amount of info here is overwhelming. Let's stick with the >> current kernel (32-bit, HIGHMEM4G, right?) >> >> Did you get any traces with bypass_guest_pf=0? That may show more info. >> >> > > My preference is to stick with the "standard", 32-bit RHEL3 kernel in the guest. > Me too. I would like to see all reasonable guests supported well, without performance issues, and not have to tell the use which kernel to use. > My point in the last email was that the hugemem kernel shows a remarkable > difference (it uses 3-levels of page tables right?). I was hoping that would > ring a bell with someone. > From the traces I saw I think the standard kernel is pae as well. Can you verify? I think it's CONFIG_HIGHMEM4G (instead of CONFIG_HIGHMEM64G) but that option may be different for such an old kernel. > Adding bypass_guest_pf=0 did not improve the situation. Did you want anything > particular with that setting -- like a RIP summary or a summary of exit-entry > cycles? > I asked fo this thinking bypass_guest_pf may help show more information. But thinking a bit more, it will not. I think I do know what the problem is. I will try it out. Is there a free clone (like centos) available somewhere? -- Any sufficiently difficult bug is indistinguishable from a feature. ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3) 2008-04-30 13:49 ` Avi Kivity @ 2008-05-11 12:32 ` Avi Kivity 2008-05-11 13:36 ` Avi Kivity 0 siblings, 1 reply; 73+ messages in thread From: Avi Kivity @ 2008-05-11 12:32 UTC (permalink / raw) To: David S. Ahern; +Cc: kvm-devel, Marcelo Tosatti [-- Attachment #1: Type: text/plain, Size: 602 bytes --] Avi Kivity wrote: > > I asked fo this thinking bypass_guest_pf may help show more > information. But thinking a bit more, it will not. > > I think I do know what the problem is. I will try it out. Is there a > free clone (like centos) available somewhere? This patch tracks down emulated accesses to speculated ptes and marks them as accessed, preventing the flooding on centos-3.1. Unfortunately it also causes a host oops midway through the boot process. I believe the oops is merely exposed by the patch, not caused by it. -- error compiling committee.c: too many arguments to function [-- Attachment #2: prevent-kscand-flooding.patch --] [-- Type: text/x-patch, Size: 2435 bytes --] diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 3d769c3..8c1e7f3 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1127,8 +1127,10 @@ unshadowed: else kvm_release_pfn_clean(pfn); } - if (!ptwrite || !*ptwrite) + if (speculative) { vcpu->arch.last_pte_updated = shadow_pte; + vcpu->arch.last_pte_gfn = gfn; + } } static void nonpaging_new_cr3(struct kvm_vcpu *vcpu) @@ -1674,6 +1676,17 @@ static void mmu_guess_page_from_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, vcpu->arch.update_pte.pfn = pfn; } +static void kvm_mmu_access_page(struct kvm_vcpu *vcpu, gfn_t gfn) +{ + u64 *spte = vcpu->arch.last_pte_updated; + + if (spte + && vcpu->arch.last_pte_gfn == gfn + && shadow_accessed_mask + && !(*spte & shadow_accessed_mask)) + set_bit(PT_ACCESSED_SHIFT, spte); +} + void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new, int bytes) { @@ -1697,13 +1710,14 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, pgprintk("%s: gpa %llx bytes %d\n", __func__, gpa, bytes); mmu_guess_page_from_pte_write(vcpu, gpa, new, bytes); spin_lock(&vcpu->kvm->mmu_lock); + kvm_mmu_access_page(vcpu, gfn); kvm_mmu_free_some_pages(vcpu); ++vcpu->kvm->stat.mmu_pte_write; kvm_mmu_audit(vcpu, "pre pte write"); if (gfn == vcpu->arch.last_pt_write_gfn && !last_updated_pte_accessed(vcpu)) { ++vcpu->arch.last_pt_write_count; - if (vcpu->arch.last_pt_write_count >= 3) + if (vcpu->arch.last_pt_write_count >= 4) flooded = 1; } else { vcpu->arch.last_pt_write_gfn = gfn; diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 1730757..258e5d5 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -15,7 +15,8 @@ #define PT_USER_MASK (1ULL << 2) #define PT_PWT_MASK (1ULL << 3) #define PT_PCD_MASK (1ULL << 4) -#define PT_ACCESSED_MASK (1ULL << 5) +#define PT_ACCESSED_SHIFT 5 +#define PT_ACCESSED_MASK (1ULL << PT_ACCESSED_SHIFT) #define PT_DIRTY_MASK (1ULL << 6) #define PT_PAGE_SIZE_MASK (1ULL << 7) #define PT_PAT_MASK (1ULL << 7) diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h index 1d8cd01..0bdb392 100644 --- a/include/asm-x86/kvm_host.h +++ b/include/asm-x86/kvm_host.h @@ -242,6 +242,7 @@ struct kvm_vcpu_arch { gfn_t last_pt_write_gfn; int last_pt_write_count; u64 *last_pte_updated; + gfn_t last_pte_gfn; struct { gfn_t gfn; /* presumed gfn during guest pte update */ [-- Attachment #3: Type: text/plain, Size: 320 bytes --] ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone [-- Attachment #4: Type: text/plain, Size: 158 bytes --] _______________________________________________ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel ^ permalink raw reply related [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3) 2008-05-11 12:32 ` Avi Kivity @ 2008-05-11 13:36 ` Avi Kivity 2008-05-13 3:49 ` David S. Ahern 0 siblings, 1 reply; 73+ messages in thread From: Avi Kivity @ 2008-05-11 13:36 UTC (permalink / raw) To: David S. Ahern; +Cc: kvm-devel, Marcelo Tosatti [-- Attachment #1: Type: text/plain, Size: 706 bytes --] Avi Kivity wrote: > Avi Kivity wrote: >> >> I asked fo this thinking bypass_guest_pf may help show more >> information. But thinking a bit more, it will not. >> >> I think I do know what the problem is. I will try it out. Is there >> a free clone (like centos) available somewhere? > > This patch tracks down emulated accesses to speculated ptes and marks > them as accessed, preventing the flooding on centos-3.1. > Unfortunately it also causes a host oops midway through the boot process. > > I believe the oops is merely exposed by the patch, not caused by it. > It was caused by the patch, please try the updated one attached. -- error compiling committee.c: too many arguments to function [-- Attachment #2: prevent-kscand-flooding.patch --] [-- Type: text/x-patch, Size: 2473 bytes --] diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 3d769c3..012e8ad 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1127,8 +1127,10 @@ unshadowed: else kvm_release_pfn_clean(pfn); } - if (!ptwrite || !*ptwrite) + if (speculative) { vcpu->arch.last_pte_updated = shadow_pte; + vcpu->arch.last_pte_gfn = gfn; + } } static void nonpaging_new_cr3(struct kvm_vcpu *vcpu) @@ -1674,6 +1676,18 @@ static void mmu_guess_page_from_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, vcpu->arch.update_pte.pfn = pfn; } +static void kvm_mmu_access_page(struct kvm_vcpu *vcpu, gfn_t gfn) +{ + u64 *spte = vcpu->arch.last_pte_updated; + + if (spte + && vcpu->arch.last_pte_gfn == gfn + && shadow_accessed_mask + && !(*spte & shadow_accessed_mask) + && is_shadow_present_pte(*spte)) + set_bit(PT_ACCESSED_SHIFT, spte); +} + void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new, int bytes) { @@ -1697,13 +1711,14 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, pgprintk("%s: gpa %llx bytes %d\n", __func__, gpa, bytes); mmu_guess_page_from_pte_write(vcpu, gpa, new, bytes); spin_lock(&vcpu->kvm->mmu_lock); + kvm_mmu_access_page(vcpu, gfn); kvm_mmu_free_some_pages(vcpu); ++vcpu->kvm->stat.mmu_pte_write; kvm_mmu_audit(vcpu, "pre pte write"); if (gfn == vcpu->arch.last_pt_write_gfn && !last_updated_pte_accessed(vcpu)) { ++vcpu->arch.last_pt_write_count; - if (vcpu->arch.last_pt_write_count >= 3) + if (vcpu->arch.last_pt_write_count >= 5) flooded = 1; } else { vcpu->arch.last_pt_write_gfn = gfn; diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 1730757..258e5d5 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -15,7 +15,8 @@ #define PT_USER_MASK (1ULL << 2) #define PT_PWT_MASK (1ULL << 3) #define PT_PCD_MASK (1ULL << 4) -#define PT_ACCESSED_MASK (1ULL << 5) +#define PT_ACCESSED_SHIFT 5 +#define PT_ACCESSED_MASK (1ULL << PT_ACCESSED_SHIFT) #define PT_DIRTY_MASK (1ULL << 6) #define PT_PAGE_SIZE_MASK (1ULL << 7) #define PT_PAT_MASK (1ULL << 7) diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h index 1d8cd01..0bdb392 100644 --- a/include/asm-x86/kvm_host.h +++ b/include/asm-x86/kvm_host.h @@ -242,6 +242,7 @@ struct kvm_vcpu_arch { gfn_t last_pt_write_gfn; int last_pt_write_count; u64 *last_pte_updated; + gfn_t last_pte_gfn; struct { gfn_t gfn; /* presumed gfn during guest pte update */ [-- Attachment #3: Type: text/plain, Size: 320 bytes --] ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone [-- Attachment #4: Type: text/plain, Size: 158 bytes --] _______________________________________________ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel ^ permalink raw reply related [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3) 2008-05-11 13:36 ` Avi Kivity @ 2008-05-13 3:49 ` David S. Ahern 2008-05-13 7:25 ` Avi Kivity 0 siblings, 1 reply; 73+ messages in thread From: David S. Ahern @ 2008-05-13 3:49 UTC (permalink / raw) To: Avi Kivity, kvm-devel That does the trick with kscand. Do you have recommendations for clock source settings? For example in my test case for this patch the guest gained 73 seconds (ahead of real time) after only 3 hours, 5 min of uptime. thanks, david Avi Kivity wrote: > Avi Kivity wrote: >> Avi Kivity wrote: >>> >>> I asked fo this thinking bypass_guest_pf may help show more >>> information. But thinking a bit more, it will not. >>> >>> I think I do know what the problem is. I will try it out. Is there >>> a free clone (like centos) available somewhere? >> >> This patch tracks down emulated accesses to speculated ptes and marks >> them as accessed, preventing the flooding on centos-3.1. >> Unfortunately it also causes a host oops midway through the boot process. >> >> I believe the oops is merely exposed by the patch, not caused by it. >> > > It was caused by the patch, please try the updated one attached. > > > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------- > This SF.net email is sponsored by the 2008 JavaOne(SM) Conference > Don't miss this year's exciting event. There's still time to save $100. > Use priority code J8TL2D2. > http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone > > > ------------------------------------------------------------------------ > > _______________________________________________ > kvm-devel mailing list > kvm-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/kvm-devel ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3) 2008-05-13 3:49 ` David S. Ahern @ 2008-05-13 7:25 ` Avi Kivity 2008-05-14 20:35 ` David S. Ahern 0 siblings, 1 reply; 73+ messages in thread From: Avi Kivity @ 2008-05-13 7:25 UTC (permalink / raw) To: David S. Ahern; +Cc: kvm-devel David S. Ahern wrote: > That does the trick with kscand. > > Not so fast... the patch updates the flood count to 5. Can you check if a lower value still works? Also, whether updating the flood count to 5 (without the rest of the patch) works? Unconditionally bumping the flood count to 5 will likely cause a performance regression on other guests. While I was able to see excessive flooding, I couldn't reproduce your kscand problem. Running /bin/true always returned immediately for me. > Do you have recommendations for clock source settings? For example in my > test case for this patch the guest gained 73 seconds (ahead of real > time) after only 3 hours, 5 min of uptime. > The kernel is trying to correlate tsc and pit, which isn't going to work. Try disabling the tsc, set edx.bit4=0 for cpuid.eax=1 in qemu-kvm-x86 .c do_cpuid_ent(). -- error compiling committee.c: too many arguments to function ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3) 2008-05-13 7:25 ` Avi Kivity @ 2008-05-14 20:35 ` David S. Ahern 2008-05-15 10:53 ` Avi Kivity 0 siblings, 1 reply; 73+ messages in thread From: David S. Ahern @ 2008-05-14 20:35 UTC (permalink / raw) To: Avi Kivity; +Cc: kvm-devel Avi Kivity wrote: > Not so fast... the patch updates the flood count to 5. Can you check > if a lower value still works? Also, whether updating the flood count to > 5 (without the rest of the patch) works? > > Unconditionally bumping the flood count to 5 will likely cause a > performance regression on other guests. I put the flood count back to 3, and the RHEL3 guest performance is even better. > > While I was able to see excessive flooding, I couldn't reproduce your > kscand problem. Running /bin/true always returned immediately for me. A poor attempt at finding a simplistic, minimal re-create. The use case I am investigating has over 500 processes/threads with a base memory consumption around 1GB. I was finding it nearly impossible to have a generic re-create of the problem for you to use in your investigations on CentOS. Thanks for the patch. david ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3) 2008-05-14 20:35 ` David S. Ahern @ 2008-05-15 10:53 ` Avi Kivity 2008-05-17 4:31 ` David S. Ahern 0 siblings, 1 reply; 73+ messages in thread From: Avi Kivity @ 2008-05-15 10:53 UTC (permalink / raw) To: David S. Ahern; +Cc: kvm-devel David S. Ahern wrote: > Avi Kivity wrote: > >> Not so fast... the patch updates the flood count to 5. Can you check >> if a lower value still works? Also, whether updating the flood count to >> 5 (without the rest of the patch) works? >> >> Unconditionally bumping the flood count to 5 will likely cause a >> performance regression on other guests. >> > > I put the flood count back to 3, and the RHEL3 guest performance is even > better. > > Okay, I committed the patch without the flood count == 5. -- error compiling committee.c: too many arguments to function ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3) 2008-05-15 10:53 ` Avi Kivity @ 2008-05-17 4:31 ` David S. Ahern [not found] ` <482FCEE1.5040306@qumranet.com> 0 siblings, 1 reply; 73+ messages in thread From: David S. Ahern @ 2008-05-17 4:31 UTC (permalink / raw) To: Avi Kivity; +Cc: kvm-devel [-- Attachment #1: Type: text/plain, Size: 1092 bytes --] Avi Kivity wrote: > > Okay, I committed the patch without the flood count == 5. > I've continued testing the RHEL3 guests with the flood count at 3, and I am right back to where I started. With the patch and the flood count at 3, I had 2 runs totaling around 24 hours that looked really good. Now, I am back to square one. I guess the short of it is that I am not sure if the patch resolves this issue or not. If you want to back it out, I can continue to apply on my end as I continue testing. A snapshot of kvm_stat -f 'mmu*' -l is attached for two test runs with the patch (line wrap is horrible inline). I will work on creating an ap that will stimulate kscand activity similar to what I am seeing. Also, in a prior e-mail I mentioned guest time advancing rapidly. I've noticed that with the -no-kvm-pit option the guest time is much better and typically stays within 3 seconds or so of the host, even through the high kscand activity which is one instance of when I've noticed time jumps with the kernel pit. Yes, this result has been repeatable through 6 or so runs. :-) david [-- Attachment #2: kvm-stats-rhel3 --] [-- Type: text/plain, Size: 4102 bytes --] kvm-68 with Avi's patch and flood threshold at 3: mmio_exit mmu_cache mmu_flood mmu_pde_z mmu_pte_u mmu_pte_w mmu_recyc mmu_shado 175 880 880 0 1832 2714 0 880 35 868 868 0 1782 2650 0 868 91 8522 8520 131 29179 38651 0 8722 28 991 992 0 2314 3312 0 992 91 796 796 0 1648 2445 0 796 81 1944 1943 0 7241 9213 0 1943 98 4149 4148 31 11975 16196 0 4214 41 3379 3380 0 9710 13100 0 3378 42 17729 17730 0 48415 66152 0 17729 guest has an apparent lockup at this point and when it unfreezes kscand cpu time jumps on the order of the time command line response was frozen (on the order of 30 seconds or more) 14 18634 18633 0 48286 66921 0 18634 21 18607 18607 0 48395 67001 0 18607 91 17991 17991 0 50039 68040 0 17991 7 17919 17920 0 53731 71650 0 17919 7 18060 18060 0 53539 71599 0 18060 21 17755 17755 0 52714 70469 0 17755 ----------------------- with Avi's patch and flood threshold at 5. mmio_exit mmu_cache mmu_flood mmu_pde_z mmu_pte_u mmu_pte_w mmu_recyc mmu_shado 147 604 602 42 21299 21957 0 660 112 163 167 23 7567 7759 0 170 105 0 1 2 3378 3381 0 1 14 4 4 0 9685 9689 0 4 137 628 623 43 21557 22255 0 682 42 0 2 4 5834 5840 0 2 91 14 16 0 25741 25757 0 16 28 58 55 0 23571 23626 0 55 84 627 624 45 32588 33268 0 685 132 9 13 1 12162 12177 0 13 91 0 1 0 3422 3423 0 1 35 1 1 0 4624 4625 0 1 102 237 244 0 12257 12504 0 242 19 401 387 46 20643 21088 0 449 26 3 4 1 127252 127261 0 4 guest has an apparent lockup at this point and when it unfreezes kscand cpu time jumps on the order of the time command line response was frozen (on the order of 30 seconds or more) 21 0 0 0 182651 182651 0 0 14 0 0 0 182524 182523 0 0 178 4 5 4 170752 170759 0 5 35 0 0 0 181471 181473 0 0 21 0 0 0 182263 182263 0 0 14 0 0 0 182493 182494 0 0 21 0 0 0 182489 182488 0 0 91 0 0 0 182203 182204 0 0 35 0 0 0 182378 182377 0 0 [-- Attachment #3: Type: text/plain, Size: 230 bytes --] ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ [-- Attachment #4: Type: text/plain, Size: 158 bytes --] _______________________________________________ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel ^ permalink raw reply [flat|nested] 73+ messages in thread
[parent not found: <482FCEE1.5040306@qumranet.com>]
[parent not found: <4830F90A.1020809@cisco.com>]
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) [not found] ` <4830F90A.1020809@cisco.com> @ 2008-05-19 4:14 ` David S. Ahern 2008-05-19 14:27 ` Avi Kivity 0 siblings, 1 reply; 73+ messages in thread From: David S. Ahern @ 2008-05-19 4:14 UTC (permalink / raw) To: Avi Kivity; +Cc: kvm [resend to new list]. David S. Ahern wrote: > I was just digging through the sysstat history files, and I was not > imagining it: I did have an excellent overnight run on 5/13-5/14 with > your patch and the standard RHEL3U8 smp kernel in the guest. I have no > idea why I cannot get anywhere close to that again. I have updated quite > a few variables since then (such as going from 2.6.25-rc8 to 2.6.25.3 > kernel in the host), but backing them out (i.e., resetting the test to > my recollection of all the details of 5/14) has not helped. baffling and > frustrating. > > more in-line below. > > > Avi Kivity wrote: >> David S. Ahern wrote: >>> Avi Kivity wrote: >>> >>>> Okay, I committed the patch without the flood count == 5. >>>> >>>> >>> I've continued testing the RHEL3 guests with the flood count at 3, and I >>> am right back to where I started. With the patch and the flood count at >>> 3, I had 2 runs totaling around 24 hours that looked really good. Now, I >>> am back to square one. I guess the short of it is that I am not sure if >>> the patch resolves this issue or not. >>> >>> >> What about with the flood count at 5? Does it reliably improve >> performance? >> > > [dsa] No. I saw the same problem with the flood count at 5. The > attachment in the last email shows kvm_stat data during a kscand event. > The data was collected with the patch you posted. With the flood count > at 3 the mmu cache/flood counters are in the 18,000/sec and pte updates > at ~50,000/sec and writes at 70,000/sec. With the flood count at 5 > mmu_cache/flood drops to 0 and pte updates and writes both hit > 180,000+/second. In both cases these last for 30 seconds or more. I only > included data for the onset as it's pretty flat during the kscand activity. > >>> Also, in a prior e-mail I mentioned guest time advancing rapidly. I've >>> noticed that with the -no-kvm-pit option the guest time is much better >>> and typically stays within 3 seconds or so of the host, even through the >>> high kscand activity which is one instance of when I've noticed time >>> jumps with the kernel pit. Yes, this result has been repeatable through >>> 6 or so runs. :-) >>> >> Strange. The in-kernel PIT was supposed to improve accuracy. >> > > [dsa] I started a run with the RHEL4 guest 8 hours ago and it is showing > the same kind of success. With the in-kernel PIT, time in the guest > advanced ~120 seconds over real time after just 2 days of up time. With > the userspace PIT, time in the guest is behind real time by only 1 > second after 8 hours of uptime. Note that I am running the RHEL4.6 > kernel recompiled with HZ at 250 instead of the usual 1000. > > david > ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) 2008-05-19 4:14 ` [kvm-devel] " David S. Ahern @ 2008-05-19 14:27 ` Avi Kivity 2008-05-19 16:25 ` David S. Ahern 2008-05-20 14:19 ` Avi Kivity 0 siblings, 2 replies; 73+ messages in thread From: Avi Kivity @ 2008-05-19 14:27 UTC (permalink / raw) To: David S. Ahern; +Cc: kvm David S. Ahern wrote: >> [dsa] No. I saw the same problem with the flood count at 5. The >> attachment in the last email shows kvm_stat data during a kscand event. >> The data was collected with the patch you posted. With the flood count >> at 3 the mmu cache/flood counters are in the 18,000/sec and pte updates >> at ~50,000/sec and writes at 70,000/sec. With the flood count at 5 >> mmu_cache/flood drops to 0 and pte updates and writes both hit >> 180,000+/second. In both cases these last for 30 seconds or more. I only >> included data for the onset as it's pretty flat during the kscand activity. >> It makes sense. We removed a flooding false positive, and introduced a false negative. The guest access sequence is: - point kmap pte at page table - use the new pte to access the page table Prior to the patch, the mmu didn't see the 'use' part, so it concluded the kmap pte would be better off unshadowed. This shows up as a high flood count. After the patch, this no longer happens, so the sequence can repreat for long periods. However the pte that is the result of the 'use' part is never accessed, so it should be detected as flooded! But our flood detection mechanism looks at one page at a time (per vcpu), while there are two pages involved here. There are (at least) three options available: - detect and special-case this scenario - change the flood detector to be per page table instead of per vcpu - change the flood detector to look at a list of recently used page tables instead of the last page table I'm having a hard time trying to pick between the second and third options. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) 2008-05-19 14:27 ` Avi Kivity @ 2008-05-19 16:25 ` David S. Ahern 2008-05-19 17:04 ` Avi Kivity 2008-05-20 14:19 ` Avi Kivity 1 sibling, 1 reply; 73+ messages in thread From: David S. Ahern @ 2008-05-19 16:25 UTC (permalink / raw) To: Avi Kivity; +Cc: kvm Does the fact that the hugemem kernel works just fine have any bearing on your options? Or rather, is there something unique about the way kscand works in the hugemem kernel that its performance is ok? I mentioned last month (so without your first patch) that running the hugemem kernel showed a remarkable improvement in performance compared to the standard smp kernel. Over the weekend I ran a test with your first patch and with the flood detector at 3 (I have not run a case with the detector at 5) and performance with the hugemem was even better in the sense that 1-minute averages of guest system time show no noticeable spikes. In an earlier post I showed a diff in the config files for the standard SMP and hugemem kernels. See: http://article.gmane.org/gmane.comp.emulators.kvm.devel/16944/ david Avi Kivity wrote: > David S. Ahern wrote: >>> [dsa] No. I saw the same problem with the flood count at 5. The >>> attachment in the last email shows kvm_stat data during a kscand event. >>> The data was collected with the patch you posted. With the flood count >>> at 3 the mmu cache/flood counters are in the 18,000/sec and pte updates >>> at ~50,000/sec and writes at 70,000/sec. With the flood count at 5 >>> mmu_cache/flood drops to 0 and pte updates and writes both hit >>> 180,000+/second. In both cases these last for 30 seconds or more. I only >>> included data for the onset as it's pretty flat during the kscand >>> activity. >>> > > It makes sense. We removed a flooding false positive, and introduced a > false negative. > > The guest access sequence is: > - point kmap pte at page table > - use the new pte to access the page table > > Prior to the patch, the mmu didn't see the 'use' part, so it concluded > the kmap pte would be better off unshadowed. This shows up as a high > flood count. > > After the patch, this no longer happens, so the sequence can repreat for > long periods. However the pte that is the result of the 'use' part is > never accessed, so it should be detected as flooded! But our flood > detection mechanism looks at one page at a time (per vcpu), while there > are two pages involved here. > > There are (at least) three options available: > - detect and special-case this scenario > - change the flood detector to be per page table instead of per vcpu > - change the flood detector to look at a list of recently used page > tables instead of the last page table > > I'm having a hard time trying to pick between the second and third options. > ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) 2008-05-19 16:25 ` David S. Ahern @ 2008-05-19 17:04 ` Avi Kivity 0 siblings, 0 replies; 73+ messages in thread From: Avi Kivity @ 2008-05-19 17:04 UTC (permalink / raw) To: David S. Ahern; +Cc: kvm David S. Ahern wrote: > Does the fact that the hugemem kernel works just fine have any bearing > on your options? Or rather, is there something unique about the way > kscand works in the hugemem kernel that its performance is ok? > > Yes. If your guest has < 4GB of memory, then all of it is lowmem in the hugemem kernel, and the two-step process for modifying a pte is short-circuited into just one step, and everything works fine. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) 2008-05-19 14:27 ` Avi Kivity 2008-05-19 16:25 ` David S. Ahern @ 2008-05-20 14:19 ` Avi Kivity 2008-05-20 14:34 ` Avi Kivity 2008-05-22 22:08 ` David S. Ahern 1 sibling, 2 replies; 73+ messages in thread From: Avi Kivity @ 2008-05-20 14:19 UTC (permalink / raw) To: David S. Ahern; +Cc: kvm [-- Attachment #1: Type: text/plain, Size: 798 bytes --] Avi Kivity wrote: > > There are (at least) three options available: > - detect and special-case this scenario > - change the flood detector to be per page table instead of per vcpu > - change the flood detector to look at a list of recently used page > tables instead of the last page table > > I'm having a hard time trying to pick between the second and third > options. > The answer turns out to be "yes", so here's a patch that adds a pte access history table for each shadowed guest page-table. Let me know if it helps. Benchmarking a variety of workloads on all guests supported by kvm is left as an exercise for the reader, but I suspect the patch will either improve things all around, or can be modified to do so. -- error compiling committee.c: too many arguments to function [-- Attachment #2: per-page-pte-history.patch --] [-- Type: text/x-patch, Size: 4637 bytes --] diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 154727d..1a3d01a 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1130,7 +1130,8 @@ unshadowed: if (speculative) { vcpu->arch.last_pte_updated = shadow_pte; vcpu->arch.last_pte_gfn = gfn; - } + } else + page_header(__pa(shadow_pte))->pte_history_len = 0; } static void nonpaging_new_cr3(struct kvm_vcpu *vcpu) @@ -1616,13 +1617,6 @@ static void mmu_pte_write_flush_tlb(struct kvm_vcpu *vcpu, u64 old, u64 new) kvm_mmu_flush_tlb(vcpu); } -static bool last_updated_pte_accessed(struct kvm_vcpu *vcpu) -{ - u64 *spte = vcpu->arch.last_pte_updated; - - return !!(spte && (*spte & shadow_accessed_mask)); -} - static void mmu_guess_page_from_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new, int bytes) { @@ -1679,13 +1673,49 @@ static void mmu_guess_page_from_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, static void kvm_mmu_access_page(struct kvm_vcpu *vcpu, gfn_t gfn) { u64 *spte = vcpu->arch.last_pte_updated; + struct kvm_mmu_page *page; + + if (spte && vcpu->arch.last_pte_gfn == gfn) { + page = page_header(__pa(spte)); + page->pte_history_len = 0; + pgprintk("clearing page history, gfn %x ent %lx\n", + page->gfn, spte - page->spt); + } +} + +static bool kvm_mmu_page_flooded(struct kvm_mmu_page *page) +{ + int i, j, ent, len; - if (spte - && vcpu->arch.last_pte_gfn == gfn - && shadow_accessed_mask - && !(*spte & shadow_accessed_mask) - && is_shadow_present_pte(*spte)) - set_bit(PT_ACCESSED_SHIFT, spte); + len = page->pte_history_len; + for (i = len; i != 0; --i) { + ent = page->pte_history[i - 1]; + if (test_bit(PT_ACCESSED_SHIFT, &page->spt[ent])) { + for (j = i; j < len; ++j) + page->pte_history[j-i] = page->pte_history[j]; + page->pte_history_len = len - i; + return false; + } + } + if (page->pte_history_len < KVM_MAX_PTE_HISTORY) + return false; + return true; +} + +static void kvm_mmu_log_pte_history(struct kvm_mmu_page *page, u64 *spte) +{ + int i; + unsigned ent = spte - page->spt; + + if (page->pte_history_len > 0 + && page->pte_history[page->pte_history_len - 1] == ent) + return; + if (page->pte_history_len == KVM_MAX_PTE_HISTORY) { + for (i = 1; i < KVM_MAX_PTE_HISTORY; ++i) + page->pte_history[i-1] = page->pte_history[i]; + --page->pte_history_len; + } + page->pte_history[page->pte_history_len++] = ent; } void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, @@ -1704,7 +1734,6 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, unsigned misaligned; unsigned quadrant; int level; - int flooded = 0; int npte; int r; @@ -1715,16 +1744,6 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_mmu_free_some_pages(vcpu); ++vcpu->kvm->stat.mmu_pte_write; kvm_mmu_audit(vcpu, "pre pte write"); - if (gfn == vcpu->arch.last_pt_write_gfn - && !last_updated_pte_accessed(vcpu)) { - ++vcpu->arch.last_pt_write_count; - if (vcpu->arch.last_pt_write_count >= 3) - flooded = 1; - } else { - vcpu->arch.last_pt_write_gfn = gfn; - vcpu->arch.last_pt_write_count = 1; - vcpu->arch.last_pte_updated = NULL; - } index = kvm_page_table_hashfn(gfn); bucket = &vcpu->kvm->arch.mmu_page_hash[index]; hlist_for_each_entry_safe(sp, node, n, bucket, hash_link) { @@ -1733,7 +1752,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, pte_size = sp->role.glevels == PT32_ROOT_LEVEL ? 4 : 8; misaligned = (offset ^ (offset + bytes - 1)) & ~(pte_size - 1); misaligned |= bytes < 4; - if (misaligned || flooded) { + if (misaligned || kvm_mmu_page_flooded(sp)) { /* * Misaligned accesses are too much trouble to fix * up; also, they usually indicate a page is not used @@ -1785,6 +1804,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, mmu_pte_write_zap_pte(vcpu, sp, spte); if (new) mmu_pte_write_new_pte(vcpu, sp, spte, new); + kvm_mmu_log_pte_history(sp, spte); mmu_pte_write_flush_tlb(vcpu, entry, *spte); ++spte; } diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h index a71f3aa..cbe550e 100644 --- a/include/asm-x86/kvm_host.h +++ b/include/asm-x86/kvm_host.h @@ -78,6 +78,7 @@ #define KVM_MIN_FREE_MMU_PAGES 5 #define KVM_REFILL_PAGES 25 #define KVM_MAX_CPUID_ENTRIES 40 +#define KVM_MAX_PTE_HISTORY 4 extern spinlock_t kvm_lock; extern struct list_head vm_list; @@ -189,6 +190,9 @@ struct kvm_mmu_page { u64 *parent_pte; /* !multimapped */ struct hlist_head parent_ptes; /* multimapped, kvm_pte_chain */ }; + + u16 pte_history_len; + u16 pte_history[KVM_MAX_PTE_HISTORY]; }; /* ^ permalink raw reply related [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) 2008-05-20 14:19 ` Avi Kivity @ 2008-05-20 14:34 ` Avi Kivity 2008-05-22 22:08 ` David S. Ahern 1 sibling, 0 replies; 73+ messages in thread From: Avi Kivity @ 2008-05-20 14:34 UTC (permalink / raw) To: David S. Ahern; +Cc: kvm Avi Kivity wrote: > > The answer turns out to be "yes", so here's a patch that adds a pte > access history table for each shadowed guest page-table. Let me know > if it helps. Benchmarking a variety of workloads on all guests > supported by kvm is left as an exercise for the reader, but I suspect > the patch will either improve things all around, or can be modified to > do so. > btw, the patch applied on top of kvm HEAD (which includes the previous patch). -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) 2008-05-20 14:19 ` Avi Kivity 2008-05-20 14:34 ` Avi Kivity @ 2008-05-22 22:08 ` David S. Ahern 2008-05-28 10:51 ` Avi Kivity 1 sibling, 1 reply; 73+ messages in thread From: David S. Ahern @ 2008-05-22 22:08 UTC (permalink / raw) To: Avi Kivity; +Cc: kvm [-- Attachment #1: Type: text/plain, Size: 1968 bytes --] The short answer is that I am still see large system time hiccups in the guests due to kscand in the guest scanning its active lists. I do see better response for a KVM_MAX_PTE_HISTORY of 3 than with 4. (For completeness I also tried a history of 2, but it performed worse than 3 which is no surprise given the meaning of it.) I have been able to scratch out a simplistic program that stimulates kscand activity similar to what is going on in my real guest (see attached). The program requests a memory allocation, initializes it (to get it backed) and then in a loop sweeps through the memory in chunks similar to a program using parts of its memory here and there but eventually accessing all of it. Start the RHEL3/CentOS 3 guest with *2GB* of RAM (or more). The key is using a fair amount of highmem. Start a couple of instances of the attached. For example, I've been using these 2: memuser 768M 120 5 300 memuser 384M 300 10 600 Together these instances take up a 1GB of RAM and once initialized consume very little CPU. On kvm they make kscand and kswapd go nuts every 5-15 minutes. For comparison, I do not see the same behavior for an identical setup running on esx 3.5. david Avi Kivity wrote: > Avi Kivity wrote: >> >> There are (at least) three options available: >> - detect and special-case this scenario >> - change the flood detector to be per page table instead of per vcpu >> - change the flood detector to look at a list of recently used page >> tables instead of the last page table >> >> I'm having a hard time trying to pick between the second and third >> options. >> > > The answer turns out to be "yes", so here's a patch that adds a pte > access history table for each shadowed guest page-table. Let me know if > it helps. Benchmarking a variety of workloads on all guests supported > by kvm is left as an exercise for the reader, but I suspect the patch > will either improve things all around, or can be modified to do so. > [-- Attachment #2: memuser.c --] [-- Type: text/x-csrc, Size: 2621 bytes --] /* simple program to malloc memory, inialize it, and * then repetitively use it to keep it active. */ #include <sys/time.h> #include <time.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include <libgen.h> /* goal is to sweep memory every T1 sec by accessing a * percentage at a time and sleeping T2 sec in between accesses. * Once all the memory has been accessed, sleep for T3 sec * before starting the cycle over. */ #define T1 180 #define T2 5 #define T3 300 const char *timestamp(void); void usage(const char *prog) { fprintf(stderr, "\nusage: %s memlen{M|K}) [t1 t2 t3]\n", prog); } int main(int argc, char *argv[]) { int len; char *endp; int factor, endp_len; int start, incr; int t1 = T1, t2 = T2, t3 = T3; char *mem; char c = 0; if (argc < 2) { usage(basename(argv[0])); return 1; } /* * determine memory to request */ len = (int) strtol(argv[1], &endp, 0); factor = 1; endp_len = strlen(endp); if ((endp_len == 1) && ((*endp == 'M') || (*endp == 'm'))) factor = 1024 * 1024; else if ((endp_len == 1) && ((*endp == 'K') || (*endp == 'k'))) factor = 1024; else if (endp_len) { fprintf(stderr, "invalid memory len.\n"); return 1; } len *= factor; if (len == 0) { fprintf(stdout, "memory len is 0.\n"); return 1; } /* * convert times if given */ if (argc > 2) { if (argc < 5) { usage(basename(argv[0])); return 1; } t1 = atoi(argv[2]); t2 = atoi(argv[3]); t3 = atoi(argv[4]); } /* * amount of memory to sweep at one time */ if (t1 && t2) incr = len / t1 * t2; else incr = len; mem = (char *) malloc(len); if (mem == NULL) { fprintf(stderr, "malloc failed\n"); return 1; } printf("memory allocated. initializing to 0\n"); memset(mem, 0, len); start = 0; printf("%s starting memory update.\n", timestamp()); while (1) { c++; if (c == 0x7f) c = 0; memset(mem + start, c, incr); start += incr; if ((start >= len) || ((start + incr) >= len)) { printf("%s scan complete. sleeping %d\n", timestamp(), t3); start = 0; sleep(t3); printf("%s starting memory update.\n", timestamp()); } else if (t2) sleep(t2); } return 0; } const char *timestamp(void) { static char date[64]; struct timeval now; struct tm ltime; memset(date, 0, sizeof(date)); if (gettimeofday(&now, NULL) == 0) { if (localtime_r(&now.tv_sec, <ime)) strftime(date, sizeof(date), "%m/%d %H:%M:%S", <ime); } if (strlen(date) == 0) strcpy(date, "unknown"); return date; } ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) 2008-05-22 22:08 ` David S. Ahern @ 2008-05-28 10:51 ` Avi Kivity 2008-05-28 14:13 ` David S. Ahern 2008-05-29 16:42 ` David S. Ahern 0 siblings, 2 replies; 73+ messages in thread From: Avi Kivity @ 2008-05-28 10:51 UTC (permalink / raw) To: David S. Ahern; +Cc: kvm David S. Ahern wrote: > The short answer is that I am still see large system time hiccups in the > guests due to kscand in the guest scanning its active lists. I do see > better response for a KVM_MAX_PTE_HISTORY of 3 than with 4. (For > completeness I also tried a history of 2, but it performed worse than 3 > which is no surprise given the meaning of it.) > > > I have been able to scratch out a simplistic program that stimulates > kscand activity similar to what is going on in my real guest (see > attached). The program requests a memory allocation, initializes it (to > get it backed) and then in a loop sweeps through the memory in chunks > similar to a program using parts of its memory here and there but > eventually accessing all of it. > > Start the RHEL3/CentOS 3 guest with *2GB* of RAM (or more). The key is > using a fair amount of highmem. Start a couple of instances of the > attached. For example, I've been using these 2: > > memuser 768M 120 5 300 > memuser 384M 300 10 600 > > Together these instances take up a 1GB of RAM and once initialized > consume very little CPU. On kvm they make kscand and kswapd go nuts > every 5-15 minutes. For comparison, I do not see the same behavior for > an identical setup running on esx 3.5. > I haven't been able to reproduce this: > [root@localhost root]# ps -elf | grep -E 'memuser|kscand' > 1 S root 7 1 1 75 0 - 0 schedu 10:07 ? > 00:00:26 [kscand] > 0 S root 1464 1 1 75 0 - 196986 schedu 10:20 pts/0 > 00:00:21 ./memuser 768M 120 5 300 > 0 S root 1465 1 0 75 0 - 98683 schedu 10:20 pts/0 > 00:00:10 ./memuser 384M 300 10 600 > 0 S root 2148 1293 0 75 0 - 922 pipe_w 10:48 pts/0 > 00:00:00 grep -E memuser|kscand The workload has been running for about half an hour, and kswapd cpu usage doesn't seem significant. This is a 2GB guest running with my patch ported to kvm.git HEAD. Guest is has 2G of memory. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) 2008-05-28 10:51 ` Avi Kivity @ 2008-05-28 14:13 ` David S. Ahern 2008-05-28 14:35 ` Avi Kivity 2008-05-28 14:48 ` Andrea Arcangeli 2008-05-29 16:42 ` David S. Ahern 1 sibling, 2 replies; 73+ messages in thread From: David S. Ahern @ 2008-05-28 14:13 UTC (permalink / raw) To: Avi Kivity; +Cc: kvm Weird. Could it be something about the hosts? I have been running these tests on a DL320G5 with a Xeon 3050 CPU, 2.13 GHz. Host OS is Fedora 8 with the 2.6.25.3 kernel. I'll rebuild kvm-69 with your latest patch and try the test programs again. david Avi Kivity wrote: > David S. Ahern wrote: >> The short answer is that I am still see large system time hiccups in the >> guests due to kscand in the guest scanning its active lists. I do see >> better response for a KVM_MAX_PTE_HISTORY of 3 than with 4. (For >> completeness I also tried a history of 2, but it performed worse than 3 >> which is no surprise given the meaning of it.) >> >> >> I have been able to scratch out a simplistic program that stimulates >> kscand activity similar to what is going on in my real guest (see >> attached). The program requests a memory allocation, initializes it (to >> get it backed) and then in a loop sweeps through the memory in chunks >> similar to a program using parts of its memory here and there but >> eventually accessing all of it. >> >> Start the RHEL3/CentOS 3 guest with *2GB* of RAM (or more). The key is >> using a fair amount of highmem. Start a couple of instances of the >> attached. For example, I've been using these 2: >> >> memuser 768M 120 5 300 >> memuser 384M 300 10 600 >> >> Together these instances take up a 1GB of RAM and once initialized >> consume very little CPU. On kvm they make kscand and kswapd go nuts >> every 5-15 minutes. For comparison, I do not see the same behavior for >> an identical setup running on esx 3.5. >> > > I haven't been able to reproduce this: > >> [root@localhost root]# ps -elf | grep -E 'memuser|kscand' >> 1 S root 7 1 1 75 0 - 0 schedu 10:07 ? >> 00:00:26 [kscand] >> 0 S root 1464 1 1 75 0 - 196986 schedu 10:20 pts/0 >> 00:00:21 ./memuser 768M 120 5 300 >> 0 S root 1465 1 0 75 0 - 98683 schedu 10:20 pts/0 >> 00:00:10 ./memuser 384M 300 10 600 >> 0 S root 2148 1293 0 75 0 - 922 pipe_w 10:48 pts/0 >> 00:00:00 grep -E memuser|kscand > > The workload has been running for about half an hour, and kswapd cpu > usage doesn't seem significant. This is a 2GB guest running with my > patch ported to kvm.git HEAD. Guest is has 2G of memory. > > ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) 2008-05-28 14:13 ` David S. Ahern @ 2008-05-28 14:35 ` Avi Kivity 2008-05-28 19:49 ` David S. Ahern 2008-05-28 14:48 ` Andrea Arcangeli 1 sibling, 1 reply; 73+ messages in thread From: Avi Kivity @ 2008-05-28 14:35 UTC (permalink / raw) To: David S. Ahern; +Cc: kvm David S. Ahern wrote: > Weird. Could it be something about the hosts? > > I have been running these tests on a DL320G5 with a Xeon 3050 CPU, 2.13 > GHz. Host OS is Fedora 8 with the 2.6.25.3 kernel. > > I'll rebuild kvm-69 with your latest patch and try the test programs again. > I've pushed it into kvm.git, branch name per-page-pte-tracking. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) 2008-05-28 14:35 ` Avi Kivity @ 2008-05-28 19:49 ` David S. Ahern 2008-05-29 6:37 ` Avi Kivity 0 siblings, 1 reply; 73+ messages in thread From: David S. Ahern @ 2008-05-28 19:49 UTC (permalink / raw) To: Avi Kivity; +Cc: kvm I have a clone of the kvm repository, but evidently not running the right magic to see the changes in the per-page-pte-tracking branch. I ran the following: git clone git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git git branch per-page-pte-tracking [dsa@daahern-lx kvm]$ git branch master * per-page-pte-tracking But arch/x86/kvm/mmu.c does not show the changes for the per-page-pte-history.patch. What I am not doing correctly here? david Avi Kivity wrote: > David S. Ahern wrote: >> Weird. Could it be something about the hosts? >> >> I have been running these tests on a DL320G5 with a Xeon 3050 CPU, 2.13 >> GHz. Host OS is Fedora 8 with the 2.6.25.3 kernel. >> >> I'll rebuild kvm-69 with your latest patch and try the test programs >> again. >> > > I've pushed it into kvm.git, branch name per-page-pte-tracking. > ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) 2008-05-28 19:49 ` David S. Ahern @ 2008-05-29 6:37 ` Avi Kivity 0 siblings, 0 replies; 73+ messages in thread From: Avi Kivity @ 2008-05-29 6:37 UTC (permalink / raw) To: David S. Ahern; +Cc: kvm David S. Ahern wrote: > I have a clone of the kvm repository, but evidently not running the > right magic to see the changes in the per-page-pte-tracking branch. I > ran the following: > > git clone git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git > git branch per-page-pte-tracking > > [dsa@daahern-lx kvm]$ git branch > master > * per-page-pte-tracking > > But arch/x86/kvm/mmu.c does not show the changes for the > per-page-pte-history.patch. > > What I am not doing correctly here? > > 'git branch' creates a new branch. Try the following git fetch origin git checkout origin/per-page-pte-tracking If that doesn't work (old git) try git fetch git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git per-page-pte-tracking:refs/heads/per-page-pte-tracking git checkout per-page-pte-tracking -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) 2008-05-28 14:13 ` David S. Ahern 2008-05-28 14:35 ` Avi Kivity @ 2008-05-28 14:48 ` Andrea Arcangeli 2008-05-28 14:57 ` Avi Kivity 2008-05-28 15:37 ` Avi Kivity 1 sibling, 2 replies; 73+ messages in thread From: Andrea Arcangeli @ 2008-05-28 14:48 UTC (permalink / raw) To: David S. Ahern; +Cc: Avi Kivity, kvm On Wed, May 28, 2008 at 08:13:44AM -0600, David S. Ahern wrote: > Weird. Could it be something about the hosts? Note that the VM itself will never make use of kmap. The VM is "data" agonistic. The VM has never any idea with the data contained by the pages. kmap/kmap_atomic/kunmap_atomic are only need to access _data_. Only I/O (if not using DMA, or because of bounce buffers) and page faults triggered in user process context, or other operations again done from user process context will call into kmap or kmap_atomic. And if KVM is inefficient in handling kmap/kmap_atomic that will lead to the user process running slower, and in turn generating less pressure to the guest and host VM if something. Guest will run slower than it should if KVM isn't optimized for the workload but it shouldn't alter any VM kernel thread CPU usage, only the CPU usage of the guest process context and host system time in qemu task should go up, nothing else. This is again because the VM will never care about the data contents and it'll never invoked kmap/kmap_atomic. So I never found a relation to the symptom reported of VM kernel threads going weird, with KVM optimal handling of kmap ptes. ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) 2008-05-28 14:48 ` Andrea Arcangeli @ 2008-05-28 14:57 ` Avi Kivity 2008-05-28 15:39 ` David S. Ahern 2008-05-28 15:58 ` Andrea Arcangeli 2008-05-28 15:37 ` Avi Kivity 1 sibling, 2 replies; 73+ messages in thread From: Avi Kivity @ 2008-05-28 14:57 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: David S. Ahern, kvm Andrea Arcangeli wrote: > On Wed, May 28, 2008 at 08:13:44AM -0600, David S. Ahern wrote: > >> Weird. Could it be something about the hosts? >> > > Note that the VM itself will never make use of kmap. The VM is "data" > agonistic. The VM has never any idea with the data contained by the > pages. kmap/kmap_atomic/kunmap_atomic are only need to access _data_. > > What about CONFIG_HIGHPTE? -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) 2008-05-28 14:57 ` Avi Kivity @ 2008-05-28 15:39 ` David S. Ahern 2008-05-29 11:49 ` Avi Kivity 2008-05-29 12:10 ` Avi Kivity 2008-05-28 15:58 ` Andrea Arcangeli 1 sibling, 2 replies; 73+ messages in thread From: David S. Ahern @ 2008-05-28 15:39 UTC (permalink / raw) To: Avi Kivity, Andrea Arcangeli; +Cc: kvm I've been instrumenting the guest kernel as well. It's the scanning of the active lists that triggers a lot of calls to paging64_prefetch_page, and, as you guys know, correlates with the number of direct pages in the list. Earlier in this thread I traced the kvm cycles to paging64_prefetch_page(). See http://www.mail-archive.com/kvm-devel@lists.sourceforge.net/msg16332.html In the guest I started capturing scans (kscand() loop) that took longer than a jiffie. Here's an example for 1 trip through the active lists, both anonymous and cache: active_anon_scan: HighMem, age 4, count[age] 41863 -> 30194, direct 36234, dj 225 active_anon_scan: HighMem, age 3, count[age] 1772 -> 1450, direct 1249, dj 3 active_anon_scan: HighMem, age 0, count[age] 104078 -> 101685, direct 84829, dj 848 active_cache_scan: HighMem, age 12, count[age] 3397 -> 2640, direct 889, dj 19 active_cache_scan: HighMem, age 8, count[age] 6105 -> 5884, direct 988, dj 24 active_cache_scan: HighMem, age 4, count[age] 18923 -> 18400, direct 11141, dj 37 active_cache_scan: HighMem, age 0, count[age] 14283 -> 14283, direct 69, dj 1 An explanation of the line (using the first one): it's a scan of the anonymous list, age bucket of 4. Before the scan loop the bucket had 41863 pages and after the loop the bucket had 30194. Of the pages in the bucket 36234 were direct pages(ie., PageDirect(page) was non-zero) and for this bucket 225 jiffies passed while running scan_active_list(). On the host side the total times (sum of the dj's/100) in the output above directly match with kvm_stat output, spikes in pte_writes/updates. Tracing the RHEL3 code I believe linux-2.4.21-rmap.patch is the patch that brought in the code that is run during the active list scans for direct pgaes. In and of itself each trip through the while loop in scan_active_list does not take a lot of time, but when run say 84,829 times (see age 0 above) the cumulative time is high, 8.48 seconds per the example above. I'll pull down the git branch and give it a spin. david Avi Kivity wrote: > Andrea Arcangeli wrote: >> On Wed, May 28, 2008 at 08:13:44AM -0600, David S. Ahern wrote: >> >>> Weird. Could it be something about the hosts? >>> >> >> Note that the VM itself will never make use of kmap. The VM is "data" >> agonistic. The VM has never any idea with the data contained by the >> pages. kmap/kmap_atomic/kunmap_atomic are only need to access _data_. >> >> > > What about CONFIG_HIGHPTE? > > > ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) 2008-05-28 15:39 ` David S. Ahern @ 2008-05-29 11:49 ` Avi Kivity 2008-05-29 12:10 ` Avi Kivity 1 sibling, 0 replies; 73+ messages in thread From: Avi Kivity @ 2008-05-29 11:49 UTC (permalink / raw) To: David S. Ahern; +Cc: Andrea Arcangeli, kvm David S. Ahern wrote: > I've been instrumenting the guest kernel as well. It's the scanning of > the active lists that triggers a lot of calls to paging64_prefetch_page, > and, as you guys know, correlates with the number of direct pages in the > list. Earlier in this thread I traced the kvm cycles to > paging64_prefetch_page(). See > I optimized this function a bit, hopefully it will relieve some of the pain. We still need to reduce the number of times it is called. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) 2008-05-28 15:39 ` David S. Ahern 2008-05-29 11:49 ` Avi Kivity @ 2008-05-29 12:10 ` Avi Kivity 2008-05-29 13:49 ` David S. Ahern 1 sibling, 1 reply; 73+ messages in thread From: Avi Kivity @ 2008-05-29 12:10 UTC (permalink / raw) To: David S. Ahern; +Cc: Andrea Arcangeli, kvm David S. Ahern wrote: > I've been instrumenting the guest kernel as well. It's the scanning of > the active lists that triggers a lot of calls to paging64_prefetch_page, > and, as you guys know, correlates with the number of direct pages in the > list. Earlier in this thread I traced the kvm cycles to > paging64_prefetch_page(). See > > http://www.mail-archive.com/kvm-devel@lists.sourceforge.net/msg16332.html > > In the guest I started capturing scans (kscand() loop) that took longer > than a jiffie. Here's an example for 1 trip through the active lists, > both anonymous and cache: > > active_anon_scan: HighMem, age 4, count[age] 41863 -> 30194, direct > 36234, dj 225 > > HZ=512, so half a second. 41K pages in 0.5s -> 80K pages/sec. Considering we have _at_least_ two emulations per page, this is almost reasonable. > active_anon_scan: HighMem, age 3, count[age] 1772 -> 1450, direct 1249, dj 3 > > active_anon_scan: HighMem, age 0, count[age] 104078 -> 101685, direct > 84829, dj 848 > Here we scanned 100K pages in ~2 seconds. 50K pages/sec, not too good. > I'll pull down the git branch and give it a spin. > I've rebased it again to include the prefetch_page optimization. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) 2008-05-29 12:10 ` Avi Kivity @ 2008-05-29 13:49 ` David S. Ahern 2008-05-29 14:08 ` Avi Kivity 0 siblings, 1 reply; 73+ messages in thread From: David S. Ahern @ 2008-05-29 13:49 UTC (permalink / raw) To: Avi Kivity; +Cc: Andrea Arcangeli, kvm This is 2.4/RHEL3, so HZ=100. 848 jiffies = 8.48 seconds -- and that's just the one age bucket and this is just one example pulled randomly (well after boot). During that time kscand does get scheduled out, but ultimately guest time is at 100% during the scans. david Avi Kivity wrote: > David S. Ahern wrote: >> I've been instrumenting the guest kernel as well. It's the scanning of >> the active lists that triggers a lot of calls to paging64_prefetch_page, >> and, as you guys know, correlates with the number of direct pages in the >> list. Earlier in this thread I traced the kvm cycles to >> paging64_prefetch_page(). See >> >> http://www.mail-archive.com/kvm-devel@lists.sourceforge.net/msg16332.html >> >> In the guest I started capturing scans (kscand() loop) that took longer >> than a jiffie. Here's an example for 1 trip through the active lists, >> both anonymous and cache: >> >> active_anon_scan: HighMem, age 4, count[age] 41863 -> 30194, direct >> 36234, dj 225 >> >> > > HZ=512, so half a second. > > 41K pages in 0.5s -> 80K pages/sec. Considering we have _at_least_ two > emulations per page, this is almost reasonable. > >> active_anon_scan: HighMem, age 3, count[age] 1772 -> 1450, direct >> 1249, dj 3 >> >> active_anon_scan: HighMem, age 0, count[age] 104078 -> 101685, direct >> 84829, dj 848 >> > > Here we scanned 100K pages in ~2 seconds. 50K pages/sec, not too good. > >> I'll pull down the git branch and give it a spin. >> > > I've rebased it again to include the prefetch_page optimization. > ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) 2008-05-29 13:49 ` David S. Ahern @ 2008-05-29 14:08 ` Avi Kivity 0 siblings, 0 replies; 73+ messages in thread From: Avi Kivity @ 2008-05-29 14:08 UTC (permalink / raw) To: David S. Ahern; +Cc: Andrea Arcangeli, kvm David S. Ahern wrote: > This is 2.4/RHEL3, so HZ=100. 848 jiffies = 8.48 seconds -- and that's > just the one age bucket and this is just one example pulled randomly > (well after boot). During that time kscand does get scheduled out, but > ultimately guest time is at 100% during the scans. > > Er, yes. Don't know where that CONFIG_HZ=512 came from in the centos config files: That's pretty bad, then. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) 2008-05-28 14:57 ` Avi Kivity 2008-05-28 15:39 ` David S. Ahern @ 2008-05-28 15:58 ` Andrea Arcangeli 1 sibling, 0 replies; 73+ messages in thread From: Andrea Arcangeli @ 2008-05-28 15:58 UTC (permalink / raw) To: Avi Kivity; +Cc: David S. Ahern, kvm On Wed, May 28, 2008 at 05:57:21PM +0300, Avi Kivity wrote: > What about CONFIG_HIGHPTE? Ah yes sorry! Official 2.4 has no highpte capability but surely RH backported highpte to 2.4 so that would explain the cpu time spent in kswapd _guest_ context. If highpte is the problem and you've troubles reproducing, I recommend running some dozen of those in background on the 2.4 VM that has the ZERO_PAGE support immediately after boot. This will ensure there will be tons of pagetables in highmemory. This should allocate purely pagetables and allow for a worst case of highpte. Check with /proc/meminfo that the pagetable number goes up of a few megabytes for each one of those tasks. Then just try to allocate some real ram (not zeropage) and if there's a problem with highptes it should be possible to reproduce it with so many highptes allocated in the system. Guest VM size should be 2G, you don't really need more than 2G to reproduce by using the below ZERO_PAGE trick. #include <unistd.h> #include <stdlib.h> #include <string.h> int main() { char *p1, *p2; p1 = malloc(512*1024*1024); p2 = malloc(512*1024*1024); if (memcmp(p1, p2, 512*1024*1024)) *(char *)0 = 0; pause(); return 0; } ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) 2008-05-28 14:48 ` Andrea Arcangeli 2008-05-28 14:57 ` Avi Kivity @ 2008-05-28 15:37 ` Avi Kivity 2008-05-28 15:43 ` David S. Ahern 1 sibling, 1 reply; 73+ messages in thread From: Avi Kivity @ 2008-05-28 15:37 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: David S. Ahern, kvm Andrea Arcangeli wrote: > > So I never found a relation to the symptom reported of VM kernel > threads going weird, with KVM optimal handling of kmap ptes. > The problem is this code: static int scan_active_list(struct zone_struct * zone, int age, struct list_head * list) { struct list_head *page_lru , *next; struct page * page; int over_rsslimit; /* Take the lock while messing with the list... */ lru_lock(zone); list_for_each_safe(page_lru, next, list) { page = list_entry(page_lru, struct page, lru); pte_chain_lock(page); if (page_referenced(page, &over_rsslimit) && !over_rsslimit) age_page_up_nolock(page, age); pte_chain_unlock(page); } lru_unlock(zone); return 0; } If the pages in the list are in the same order as in the ptes (which is very likely), then we have the following access pattern - set up kmap to point at pte - test_and_clear_bit(pte) - kunmap From kvm's point of view this looks like - several accesses to set up the kmap - if these accesses trigger flooding, we will have to tear down the shadow for this page, only to set it up again soon - an access to the pte (emulted) - if this access _doesn't_ trigger flooding, we will have 512 unneeded emulations. The pte is worthless anyway since the accessed bit is clear (so we can't set up a shadow pte for it) - this bug was fixed - an access to tear down the kmap [btw, am I reading this right? the entire list is scanned each time? if you have 1G of active HIGHMEM, that's a quarter of a million pages, which would take at least a second no matter what we do. VMware can probably special-case kmaps, but we can't] -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) 2008-05-28 15:37 ` Avi Kivity @ 2008-05-28 15:43 ` David S. Ahern 2008-05-28 17:04 ` Andrea Arcangeli 0 siblings, 1 reply; 73+ messages in thread From: David S. Ahern @ 2008-05-28 15:43 UTC (permalink / raw) To: Avi Kivity; +Cc: Andrea Arcangeli, kvm This is the code in the RHEL3.8 kernel: static int scan_active_list(struct zone_struct * zone, int age, struct list_head * list, int count) { struct list_head *page_lru , *next; struct page * page; int over_rsslimit; count = count * kscand_work_percent / 100; /* Take the lock while messing with the list... */ lru_lock(zone); while (count-- > 0 && !list_empty(list)) { page = list_entry(list->prev, struct page, lru); pte_chain_lock(page); if (page_referenced(page, &over_rsslimit) && !over_rsslimit && check_mapping_inuse(page)) age_page_up_nolock(page, age); else { list_del(&page->lru); list_add(&page->lru, list); } pte_chain_unlock(page); } lru_unlock(zone); return 0; } My previous email shows examples of the number of pages in the list and the scanning that happens. david Avi Kivity wrote: > Andrea Arcangeli wrote: >> >> So I never found a relation to the symptom reported of VM kernel >> threads going weird, with KVM optimal handling of kmap ptes. >> > > > The problem is this code: > > static int scan_active_list(struct zone_struct * zone, int age, > struct list_head * list) > { > struct list_head *page_lru , *next; > struct page * page; > int over_rsslimit; > > /* Take the lock while messing with the list... */ > lru_lock(zone); > list_for_each_safe(page_lru, next, list) { > page = list_entry(page_lru, struct page, lru); > pte_chain_lock(page); > if (page_referenced(page, &over_rsslimit) && !over_rsslimit) > age_page_up_nolock(page, age); > pte_chain_unlock(page); > } > lru_unlock(zone); > return 0; > } > > If the pages in the list are in the same order as in the ptes (which is > very likely), then we have the following access pattern > > - set up kmap to point at pte > - test_and_clear_bit(pte) > - kunmap > > From kvm's point of view this looks like > > - several accesses to set up the kmap > - if these accesses trigger flooding, we will have to tear down the > shadow for this page, only to set it up again soon > - an access to the pte (emulted) > - if this access _doesn't_ trigger flooding, we will have 512 unneeded > emulations. The pte is worthless anyway since the accessed bit is clear > (so we can't set up a shadow pte for it) > - this bug was fixed > - an access to tear down the kmap > > [btw, am I reading this right? the entire list is scanned each time? > > if you have 1G of active HIGHMEM, that's a quarter of a million pages, > which would take at least a second no matter what we do. VMware can > probably special-case kmaps, but we can't] > ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) 2008-05-28 15:43 ` David S. Ahern @ 2008-05-28 17:04 ` Andrea Arcangeli 2008-05-28 17:24 ` David S. Ahern 2008-05-29 10:01 ` Avi Kivity 0 siblings, 2 replies; 73+ messages in thread From: Andrea Arcangeli @ 2008-05-28 17:04 UTC (permalink / raw) To: David S. Ahern; +Cc: Avi Kivity, kvm On Wed, May 28, 2008 at 09:43:09AM -0600, David S. Ahern wrote: > This is the code in the RHEL3.8 kernel: > > static int scan_active_list(struct zone_struct * zone, int age, > struct list_head * list, int count) > { > struct list_head *page_lru , *next; > struct page * page; > int over_rsslimit; > > count = count * kscand_work_percent / 100; > /* Take the lock while messing with the list... */ > lru_lock(zone); > while (count-- > 0 && !list_empty(list)) { > page = list_entry(list->prev, struct page, lru); > pte_chain_lock(page); > if (page_referenced(page, &over_rsslimit) > && !over_rsslimit > && check_mapping_inuse(page)) > age_page_up_nolock(page, age); > else { > list_del(&page->lru); > list_add(&page->lru, list); > } > pte_chain_unlock(page); > } > lru_unlock(zone); > return 0; > } > > My previous email shows examples of the number of pages in the list and > the scanning that happens. This code looks better than the one below, as a limit was introduced and the whole list isn't scanned anymore, if you decrease kscand_work_percent (I assume it's a sysctl even if it's missing the sysctl_ prefix) to say 1, you can limit damages. Did you try it? > Avi Kivity wrote: > > Andrea Arcangeli wrote: > >> > >> So I never found a relation to the symptom reported of VM kernel > >> threads going weird, with KVM optimal handling of kmap ptes. > >> > > > > > > The problem is this code: > > > > static int scan_active_list(struct zone_struct * zone, int age, > > struct list_head * list) > > { > > struct list_head *page_lru , *next; > > struct page * page; > > int over_rsslimit; > > > > /* Take the lock while messing with the list... */ > > lru_lock(zone); > > list_for_each_safe(page_lru, next, list) { > > page = list_entry(page_lru, struct page, lru); > > pte_chain_lock(page); > > if (page_referenced(page, &over_rsslimit) && !over_rsslimit) > > age_page_up_nolock(page, age); > > pte_chain_unlock(page); > > } > > lru_unlock(zone); > > return 0; > > } > > > If the pages in the list are in the same order as in the ptes (which is > > very likely), then we have the following access pattern Yes it is likely. > > - set up kmap to point at pte > > - test_and_clear_bit(pte) > > - kunmap > > > > From kvm's point of view this looks like > > > > - several accesses to set up the kmap Hmm, the kmap establishment takes a single guest operation in the fixmap area. That's a single write to the pte, to write a pte_t 8/4 byte large region (PAE/non-PAE). The same pte_t is then cleared and flushed out of the tlb with a cpu-local invlpg during kunmap_atomic. I count 1 write here so far. > > - if these accesses trigger flooding, we will have to tear down the > > shadow for this page, only to set it up again soon So the shadow mapping the fixmap area would be tear down by the flooding. Or is the shadow corresponding to the real user pte pointed by the fixmap, that is unshadowed by the flooding, or both/all? > > - an access to the pte (emulted) Here I count the second write and this isn't done on the fixmap area like the first write above, but this is a write to the real user pte, pointed by the fixmap. So if this is emulated it means the shadow of the user pte pointing to the real data page is still active. > > - if this access _doesn't_ trigger flooding, we will have 512 unneeded > > emulations. The pte is worthless anyway since the accessed bit is clear > > (so we can't set up a shadow pte for it) > > - this bug was fixed You mean the accessed bit on fixmap pte used by kmap? Or the user pte pointed by the fixmap pte? > > - an access to tear down the kmap Yep, pte_clear on the fixmap pte_t followed by an invlpg (if that matters). > > [btw, am I reading this right? the entire list is scanned each time? If the list parameter isn't a local LIST_HEAD on the stack but the global one it's a full scan each time. I guess it's the global list looking at the new code at the top that has a kswapd_scan_limit sysctl. > > if you have 1G of active HIGHMEM, that's a quarter of a million pages, > > which would take at least a second no matter what we do. VMware can > > probably special-case kmaps, but we can't] Perhaps they've a list per-age bucket or similar but still I doubt this works well on host either... I guess the virtualization overhead is exacerbating the inefficiency. Perhaps killall -STOP kscand is good enough fix ;). This seem to only push the age up, to be functional the age has to go down and I guess the go-down is done by other threads so stopping kscand may not hurt. I think what we should aim for is to quickly reach this condition: 1) always keep the fixmap/kmap pte_t shadowed and emulate the kmap/kunmap access so the test_and_clear_young done on the user pte doesn't require to re-establish the spte representing the fixmap virtual address. If we don't emulate fixmap we'll have to re-establish the spte during the write to the user pte, and tear it down again during kunmap_atomic. So there's not much doubt fixmap access emulation is worth it. 2) get rid of the user pte shadow mapping pointing to the user data so the test_and_clear of the young bitflag on the user pte will not be emulated and it'll run at full CPU speed through the shadow pte mapping corresponding to the fixmap virtual address kscand pattern is the same as running mprotect on a 32bit 2.6 kernel so it sounds worth optimizing for it, even if kscand may be unfixable without killall -STOP kscand or VM fixes to guest. However I'm not sure about point 2 at the light of mprotect. With mprotect the guest virutal addresses mapped by the guest user ptes will be used. It's not like kscand that may write forever to the user ptes without ever using the guest virtual addresses that they're mapping. So we better be sure that by unshadowing and optimizing kscand we're not hurting mprotect or other pte mangling operations in 2.6 that will likely keep accessing the guest virtual addresses mapped by the user ptes previously modified. Hope this makes any sense, I'm not sure to understand this completely. ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) 2008-05-28 17:04 ` Andrea Arcangeli @ 2008-05-28 17:24 ` David S. Ahern 2008-05-29 10:01 ` Avi Kivity 1 sibling, 0 replies; 73+ messages in thread From: David S. Ahern @ 2008-05-28 17:24 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Avi Kivity, kvm Yes, I've tried changing kscand_work_percent (values of 50 and 30). Basically it makes kscand wake more often (ie.,MIN_AGING_INTERVAL declines in proportion) put do less work each trip through the lists. I have not seen a noticeable change in guest behavior. david Andrea Arcangeli wrote: > On Wed, May 28, 2008 at 09:43:09AM -0600, David S. Ahern wrote: >> This is the code in the RHEL3.8 kernel: >> >> static int scan_active_list(struct zone_struct * zone, int age, >> struct list_head * list, int count) >> { >> struct list_head *page_lru , *next; >> struct page * page; >> int over_rsslimit; >> >> count = count * kscand_work_percent / 100; >> /* Take the lock while messing with the list... */ >> lru_lock(zone); >> while (count-- > 0 && !list_empty(list)) { >> page = list_entry(list->prev, struct page, lru); >> pte_chain_lock(page); >> if (page_referenced(page, &over_rsslimit) >> && !over_rsslimit >> && check_mapping_inuse(page)) >> age_page_up_nolock(page, age); >> else { >> list_del(&page->lru); >> list_add(&page->lru, list); >> } >> pte_chain_unlock(page); >> } >> lru_unlock(zone); >> return 0; >> } >> >> My previous email shows examples of the number of pages in the list and >> the scanning that happens. > > This code looks better than the one below, as a limit was introduced > and the whole list isn't scanned anymore, if you decrease > kscand_work_percent (I assume it's a sysctl even if it's missing the > sysctl_ prefix) to say 1, you can limit damages. Did you try it? > >> Avi Kivity wrote: >>> Andrea Arcangeli wrote: >>>> So I never found a relation to the symptom reported of VM kernel >>>> threads going weird, with KVM optimal handling of kmap ptes. >>>> >>> >>> The problem is this code: >>> >>> static int scan_active_list(struct zone_struct * zone, int age, >>> struct list_head * list) >>> { >>> struct list_head *page_lru , *next; >>> struct page * page; >>> int over_rsslimit; >>> >>> /* Take the lock while messing with the list... */ >>> lru_lock(zone); >>> list_for_each_safe(page_lru, next, list) { >>> page = list_entry(page_lru, struct page, lru); >>> pte_chain_lock(page); >>> if (page_referenced(page, &over_rsslimit) && !over_rsslimit) >>> age_page_up_nolock(page, age); >>> pte_chain_unlock(page); >>> } >>> lru_unlock(zone); >>> return 0; >>> } >>> If the pages in the list are in the same order as in the ptes (which is >>> very likely), then we have the following access pattern > > Yes it is likely. > >>> - set up kmap to point at pte >>> - test_and_clear_bit(pte) >>> - kunmap >>> >>> From kvm's point of view this looks like >>> >>> - several accesses to set up the kmap > > Hmm, the kmap establishment takes a single guest operation in the > fixmap area. That's a single write to the pte, to write a pte_t 8/4 > byte large region (PAE/non-PAE). The same pte_t is then cleared and > flushed out of the tlb with a cpu-local invlpg during kunmap_atomic. > > I count 1 write here so far. > >>> - if these accesses trigger flooding, we will have to tear down the >>> shadow for this page, only to set it up again soon > > So the shadow mapping the fixmap area would be tear down by the > flooding. > > Or is the shadow corresponding to the real user pte pointed by the > fixmap, that is unshadowed by the flooding, or both/all? > >>> - an access to the pte (emulted) > > Here I count the second write and this isn't done on the fixmap area > like the first write above, but this is a write to the real user pte, > pointed by the fixmap. So if this is emulated it means the shadow of > the user pte pointing to the real data page is still active. > >>> - if this access _doesn't_ trigger flooding, we will have 512 unneeded >>> emulations. The pte is worthless anyway since the accessed bit is clear >>> (so we can't set up a shadow pte for it) >>> - this bug was fixed > > You mean the accessed bit on fixmap pte used by kmap? Or the user pte > pointed by the fixmap pte? > >>> - an access to tear down the kmap > > Yep, pte_clear on the fixmap pte_t followed by an invlpg (if that > matters). > >>> [btw, am I reading this right? the entire list is scanned each time? > > If the list parameter isn't a local LIST_HEAD on the stack but the > global one it's a full scan each time. I guess it's the global list > looking at the new code at the top that has a kswapd_scan_limit > sysctl. > >>> if you have 1G of active HIGHMEM, that's a quarter of a million pages, >>> which would take at least a second no matter what we do. VMware can >>> probably special-case kmaps, but we can't] > > Perhaps they've a list per-age bucket or similar but still I doubt > this works well on host either... I guess the virtualization overhead > is exacerbating the inefficiency. Perhaps killall -STOP kscand is good > enough fix ;). This seem to only push the age up, to be functional the > age has to go down and I guess the go-down is done by other threads so > stopping kscand may not hurt. > > I think what we should aim for is to quickly reach this condition: > > 1) always keep the fixmap/kmap pte_t shadowed and emulate the > kmap/kunmap access so the test_and_clear_young done on the user pte > doesn't require to re-establish the spte representing the fixmap > virtual address. If we don't emulate fixmap we'll have to > re-establish the spte during the write to the user pte, and > tear it down again during kunmap_atomic. So there's not much doubt > fixmap access emulation is worth it. > > 2) get rid of the user pte shadow mapping pointing to the user data so > the test_and_clear of the young bitflag on the user pte will not be > emulated and it'll run at full CPU speed through the shadow pte > mapping corresponding to the fixmap virtual address > > kscand pattern is the same as running mprotect on a 32bit 2.6 > kernel so it sounds worth optimizing for it, even if kscand may be > unfixable without killall -STOP kscand or VM fixes to guest. > > However I'm not sure about point 2 at the light of mprotect. With > mprotect the guest virutal addresses mapped by the guest user ptes > will be used. It's not like kscand that may write forever to the user > ptes without ever using the guest virtual addresses that they're > mapping. So we better be sure that by unshadowing and optimizing > kscand we're not hurting mprotect or other pte mangling operations in > 2.6 that will likely keep accessing the guest virtual addresses mapped > by the user ptes previously modified. > > Hope this makes any sense, I'm not sure to understand this completely. > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) 2008-05-28 17:04 ` Andrea Arcangeli 2008-05-28 17:24 ` David S. Ahern @ 2008-05-29 10:01 ` Avi Kivity 2008-05-29 14:27 ` Andrea Arcangeli 1 sibling, 1 reply; 73+ messages in thread From: Avi Kivity @ 2008-05-29 10:01 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: David S. Ahern, kvm Andrea Arcangeli wrote: > >>> - set up kmap to point at pte >>> - test_and_clear_bit(pte) >>> - kunmap >>> >>> From kvm's point of view this looks like >>> >>> - several accesses to set up the kmap >>> > > Hmm, the kmap establishment takes a single guest operation in the > fixmap area. That's a single write to the pte, to write a pte_t 8/4 > byte large region (PAE/non-PAE). The same pte_t is then cleared and > flushed out of the tlb with a cpu-local invlpg during kunmap_atomic. > > I count 1 write here so far. > > No, two: static inline void set_pte(pte_t *ptep, pte_t pte) { ptep->pte_high = pte.pte_high; smp_wmb(); ptep->pte_low = pte.pte_low; } >>> - if these accesses trigger flooding, we will have to tear down the >>> shadow for this page, only to set it up again soon >>> > > So the shadow mapping the fixmap area would be tear down by the > flooding. > Before we started patching this, yes. > Or is the shadow corresponding to the real user pte pointed by the > fixmap, that is unshadowed by the flooding, or both/all? > > After we started patching this, no, but with per-page-pte-history, yes (correctly). >>> - an access to the pte (emulted) >>> > > Here I count the second write and this isn't done on the fixmap area > like the first write above, but this is a write to the real user pte, > pointed by the fixmap. So if this is emulated it means the shadow of > the user pte pointing to the real data page is still active. > Right. But if we are scanning a page table linearly, it should be unshadowed. > >>> - if this access _doesn't_ trigger flooding, we will have 512 unneeded >>> emulations. The pte is worthless anyway since the accessed bit is clear >>> (so we can't set up a shadow pte for it) >>> - this bug was fixed >>> > > You mean the accessed bit on fixmap pte used by kmap? Or the user pte > pointed by the fixmap pte? > The user pte. After guest code runs test_and_clear_bit(accessed_bit, ptep), we can't shadow that pte (all shadowed ptes must have the accessed bit set in the corresponding guest pte, similar to how a tlb entry can only exist if the accessed bit is set). > >>> - an access to tear down the kmap >>> > > Yep, pte_clear on the fixmap pte_t followed by an invlpg (if that > matters). > Looking at the code, that only happens if CONFIG_HIGHMEM_DEBUG is set. > I think what we should aim for is to quickly reach this condition: > > 1) always keep the fixmap/kmap pte_t shadowed and emulate the > kmap/kunmap access so the test_and_clear_young done on the user pte > doesn't require to re-establish the spte representing the fixmap > virtual address. If we don't emulate fixmap we'll have to > re-establish the spte during the write to the user pte, and > tear it down again during kunmap_atomic. So there's not much doubt > fixmap access emulation is worth it. > That is what is done by current HEAD. 418c6952ba9fd379059ed325ea5a3efe904fb7fd is responsible. Note that there is an alternative: allow the kmap pte to be unshadowed, and instead emulate the access through that pte (i.e. emulate the btc instruction). I don't think it's worth it though because it hurts other users of the fixmap page. > 2) get rid of the user pte shadow mapping pointing to the user data so > the test_and_clear of the young bitflag on the user pte will not be > emulated and it'll run at full CPU speed through the shadow pte > mapping corresponding to the fixmap virtual address > That's what per-page-pte-history is supposed to do. The first few accesses are emulated, the next will be native. It's still not full speed as the kmap setup has to be emulated (twice). One possible optimization is that if we see the first part of the kmap instantiation, we emulate a few more instructions before returning to the guest. Xen does this IIRC. > kscand pattern is the same as running mprotect on a 32bit 2.6 > kernel so it sounds worth optimizing for it, even if kscand may be > unfixable without killall -STOP kscand or VM fixes to guest. > > I'm no longer sure the access pattern is sequential, since I see kmap_atomic() will not recreate the pte if its value has not changed (unless HIGHMEM_DEBUG). -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) 2008-05-29 10:01 ` Avi Kivity @ 2008-05-29 14:27 ` Andrea Arcangeli 2008-05-29 15:11 ` David S. Ahern 2008-05-29 15:16 ` Avi Kivity 0 siblings, 2 replies; 73+ messages in thread From: Andrea Arcangeli @ 2008-05-29 14:27 UTC (permalink / raw) To: Avi Kivity; +Cc: David S. Ahern, kvm On Thu, May 29, 2008 at 01:01:06PM +0300, Avi Kivity wrote: > No, two: > > static inline void set_pte(pte_t *ptep, pte_t pte) > { > ptep->pte_high = pte.pte_high; > smp_wmb(); > ptep->pte_low = pte.pte_low; > } Right, that can be 2 or 1 depending on PAE non-PAE, other 2.4 enterprise distro with pte-highmem ships non-PAE kernels by default. >>>> - if these accesses trigger flooding, we will have to tear down the >>>> shadow for this page, only to set it up again soon >>>> >> >> So the shadow mapping the fixmap area would be tear down by the >> flooding. >> > > Before we started patching this, yes. Ok so now the one/two writes to the guest fixmap virt address are emulated and the spte isn't tear down. > >> Or is the shadow corresponding to the real user pte pointed by the >> fixmap, that is unshadowed by the flooding, or both/all? >> >> > > After we started patching this, no, but with per-page-pte-history, yes > (correctly). So with the per-page-pte-history the shadow representing the guest user pte that is being modified by page_referenced is unshadowed. >>>> - an access to the pte (emulted) >>>> >> >> Here I count the second write and this isn't done on the fixmap area >> like the first write above, but this is a write to the real user pte, >> pointed by the fixmap. So if this is emulated it means the shadow of >> the user pte pointing to the real data page is still active. >> > > Right. But if we are scanning a page table linearly, it should be > unshadowed. I think we're often not scanning page table linearly with pte_chains, but yet those should be still unshadowed. mmaps won't always bring memory in linear order, memory isn't always initialized or by memset or pagedin with contiguous virtual accesses. So while the assumption that following the active list will sometime return guest ptes that maps contiguous guest virtual address is valid, it only accounts for a small percentage of the active list. It largely depends on the userland apps. Furthermore even if the active lru is initially pointing to linear ptes, the list is then split into age buckets depending on the access patterns at runtime, so that further fragments the linearity of the virtual addresses of the kmapped ptes. BTW, one thing we didn't account for in previous email, is that there can be more than one guest user pte modified by page_referenced, if it's not a direct page. And non direct pages surely won't provide linear scans, infact for non linear pages the most common is that the pte_t will point to the same virtual address but on a different pgd_t * (and in turn on a different pmd_t). >>>> - if this access _doesn't_ trigger flooding, we will have 512 unneeded >>>> emulations. The pte is worthless anyway since the accessed bit is clear >>>> (so we can't set up a shadow pte for it) >>>> - this bug was fixed >>>> >> >> You mean the accessed bit on fixmap pte used by kmap? Or the user pte >> pointed by the fixmap pte? >> > > The user pte. After guest code runs test_and_clear_bit(accessed_bit, > ptep), we can't shadow that pte (all shadowed ptes must have the accessed > bit set in the corresponding guest pte, similar to how a tlb entry can only > exist if the accessed bit is set). Is this software invariant to ensure that we'll refresh the accessed bit on the user pte too? I assume this is needed because otherwise if we clear the accessed bit on the shadow pte and we clear it on the user pte, when the shadow is mapped in the TLB again the accessed bit will be set on the shadow in hardware, but not on the user pte because the accessed bit is set on the spte without kvm page fault. So this means kscand by clearing the accessed bitflag on them, should automatically unshadowing all user ptes pointed by the fixmap pte. So a secnd test_and_clear_bit on the same user pte will run through the fixmap pte established by kmap_atomic without traps. So this means when the user program run again, it'll find the user pte unshadowed and it'll have to re-instantiate the shadow ptes with a kvm page fault, that has the primary objective of marking the user pte accessed again (to notify the next kscand pass that the data page pointed by the user pte was used meanwhile). If I understand correctly, the establishment of the shadow pte corresponding to the user pte, will have to mark wrprotect the spte corresponding to the fixmap pte because we need to intercept modifications to shadowed guest ptes and the spte corresponding to the fixmap guest pte is now pointing to a shadowed guest pte after the program returns running. Then when kscand runs again, for the pages that have been faulted in by the user program, we'll trap the test_and_clear_bit happening through the readonly spte corresponding to the fixmap guest pte, and we'll unshadow the spte of the guest user pte again and we'll mark the spte corresponding to the fixmap pte as read-write again, because of the test_and_clear_bit tells us that we've to unshadow instead of emulating. >>>> - an access to tear down the kmap >>>> >> >> Yep, pte_clear on the fixmap pte_t followed by an invlpg (if that >> matters). >> > > Looking at the code, that only happens if CONFIG_HIGHMEM_DEBUG is set. 2.4 yes. 2.6 is will do similar to CONFIG_HIGHMEM_DEBUG. 2.4 without HIGHMEM_DEBUG sets the pte and invlpg in kmap_atomic and does nothing in kunmap_atomic. 2.6 sets the pte in kmap_atomic, and clears it+invlpg in kunmap_atomic. >> I think what we should aim for is to quickly reach this condition: >> >> 1) always keep the fixmap/kmap pte_t shadowed and emulate the >> kmap/kunmap access so the test_and_clear_young done on the user pte >> doesn't require to re-establish the spte representing the fixmap >> virtual address. If we don't emulate fixmap we'll have to >> re-establish the spte during the write to the user pte, and >> tear it down again during kunmap_atomic. So there's not much doubt >> fixmap access emulation is worth it. >> > > That is what is done by current HEAD. > 418c6952ba9fd379059ed325ea5a3efe904fb7fd is responsible. Cool! > > Note that there is an alternative: allow the kmap pte to be unshadowed, and > instead emulate the access through that pte (i.e. emulate the btc > instruction). I don't think it's worth it though because it hurts other > users of the fixmap page. >> 2) get rid of the user pte shadow mapping pointing to the user data so >> the test_and_clear of the young bitflag on the user pte will not be >> emulated and it'll run at full CPU speed through the shadow pte >> mapping corresponding to the fixmap virtual address >> > > That's what per-page-pte-history is supposed to do. The first few accesses > are emulated, the next will be native. Why not to go native immediately when we notice a test_and_clear of the accessed bit? First the ptes won't be in contiguous virtual address order, so if the flooding of the sptes corresponding to the guest user pte depends on the gpa of the guest user ptes being contiguous it won't work well. But more importantly we've found a test_and_clear_bit of the accessed bitflag, so we should unshadow the user pte that is being marked "old" immediately without need to detect any flooding. > It's still not full speed as the kmap setup has to be emulated (twice). Agreed, the 1/2/3 emulations on writes to the fixmap area during kmap_atomic (1/2 for non-PAE/PAE and 1 further pte_clear on 2.6 or 2.4 debug-highmem) seems unavoidable. But the test_and_clear_bit writprotect fault (when the guest user pte is shadowed) should just unshadow the guest user pte, mark the spte representing the fixmap pte as writeable, and return immediately to guest mode to actually run test_and_clear_bit natively without writing it through emulation. Noticing the test_and_clear_bit also requires a bit of instruction "detection", but once we detected it from the eip address, we don't have to write anything to the guest. But I guess I'm missing something... > One possible optimization is that if we see the first part of the kmap > instantiation, we emulate a few more instructions before returning to the > guest. Xen does this IIRC. Surely this would avoid 1 wrprotect fault per kmap_atomic, but I'm not sure if 32bit PAE is that important to do this. Most 32bit enterprise kernels I worked aren't compiled with PAE, only one called bigsmp is. Also on 2.6, we could get the same benefit by making 2.6 at least as optimal as 2.4 by never clearing the fixmap pte and by doing invlpg only after setting it to a new value. Xen can't optimize that write in kunmap_atomic. 2.6 has debug enabled by default for no good reason. So that would be the first optimization to do as it saves a few cycles per kunmap_atomic on host too. > I'm no longer sure the access pattern is sequential, since I see > kmap_atomic() will not recreate the pte if its value has not changed > (unless HIGHMEM_DEBUG). Hmm kmap_atomic always writes a new value to the fixmap pte, even if it was mapping the same user pte as before. static inline void *kmap_atomic(struct page *page, enum km_type type) { enum fixed_addresses idx; unsigned long vaddr; if (page < highmem_start_page) return page_address(page); idx = type + KM_TYPE_NR*smp_processor_id(); vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx); #if HIGHMEM_DEBUG if (!pte_none(*(kmap_pte-idx))) out_of_line_bug(); #endif set_pte(kmap_pte-idx, mk_pte(page, kmap_prot)); __flush_tlb_one(vaddr); return (void*) vaddr; } In 2.6 does too, because it does the debug pte_clear in kunmap_atomic. In theory even host could do pte_same() and avoid an invlpg if it didn't change, but I'm unsure how frequently we remap the same page, the pte loops like mprotect will map the 4k large pte, and loop over it once it's mapped by the fixmap virtual address. So frequent repetitions of remapping of the same page with kmap_atomic sounds unlikely. ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) 2008-05-29 14:27 ` Andrea Arcangeli @ 2008-05-29 15:11 ` David S. Ahern 2008-05-29 15:16 ` Avi Kivity 1 sibling, 0 replies; 73+ messages in thread From: David S. Ahern @ 2008-05-29 15:11 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Avi Kivity, kvm Andrea Arcangeli wrote: > On Thu, May 29, 2008 at 01:01:06PM +0300, Avi Kivity wrote: >> No, two: >> >> static inline void set_pte(pte_t *ptep, pte_t pte) >> { >> ptep->pte_high = pte.pte_high; >> smp_wmb(); >> ptep->pte_low = pte.pte_low; >> } > > Right, that can be 2 or 1 depending on PAE non-PAE, other 2.4 > enterprise distro with pte-highmem ships non-PAE kernels by default. RHEL3U8 has CONFIG_X86_PAE set. <snipped> >>>>> - an access to tear down the kmap >>>>> >>> Yep, pte_clear on the fixmap pte_t followed by an invlpg (if that >>> matters). >>> >> Looking at the code, that only happens if CONFIG_HIGHMEM_DEBUG is set. > > 2.4 yes. 2.6 is will do similar to CONFIG_HIGHMEM_DEBUG. > > 2.4 without HIGHMEM_DEBUG sets the pte and invlpg in kmap_atomic and > does nothing in kunmap_atomic. > > 2.6 sets the pte in kmap_atomic, and clears it+invlpg in kunmap_atomic. CONFIG_DEBUG_HIGHMEM is set. <snipped> >> One possible optimization is that if we see the first part of the kmap >> instantiation, we emulate a few more instructions before returning to the >> guest. Xen does this IIRC. > > Surely this would avoid 1 wrprotect fault per kmap_atomic, but I'm not > sure if 32bit PAE is that important to do this. Most 32bit enterprise > kernels I worked aren't compiled with PAE, only one called bigsmp is. RHEL3 has a hugemem kernel which basically just enables the 4G/4G split. My guest with the hugemem kernel runs much better than the standard smp kernel. If you care to download it the RHEL3U8 kernel source is posted here: ftp://ftp.redhat.com/pub/redhat/linux/updates/enterprise/3AS/en/os/SRPMS/kernel-2.4.21-47.EL.src.rpm Red Hat does heavily patch kernels, so they will be dramatically different than the kernel.org kernel with the same number. david ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) 2008-05-29 14:27 ` Andrea Arcangeli 2008-05-29 15:11 ` David S. Ahern @ 2008-05-29 15:16 ` Avi Kivity 2008-05-30 13:12 ` Andrea Arcangeli 1 sibling, 1 reply; 73+ messages in thread From: Avi Kivity @ 2008-05-29 15:16 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: David S. Ahern, kvm Andrea Arcangeli wrote: >>> Here I count the second write and this isn't done on the fixmap area >>> like the first write above, but this is a write to the real user pte, >>> pointed by the fixmap. So if this is emulated it means the shadow of >>> the user pte pointing to the real data page is still active. >>> >>> >> Right. But if we are scanning a page table linearly, it should be >> unshadowed. >> > > I think we're often not scanning page table linearly with pte_chains, > but yet those should be still unshadowed. mmaps won't always bring > memory in linear order, memory isn't always initialized or by memset > or pagedin with contiguous virtual accesses. > > I guess we aren't scanning the page table linerarly, since with the linear-scan test case I can't reproduce the problem. > So while the assumption that following the active list will sometime > return guest ptes that maps contiguous guest virtual address is valid, > it only accounts for a small percentage of the active list. It largely > depends on the userland apps. Furthermore even if the active lru is > initially pointing to linear ptes, the list is then split into age > buckets depending on the access patterns at runtime, so that further > fragments the linearity of the virtual addresses of the kmapped ptes. > > BTW, one thing we didn't account for in previous email, is that there > can be more than one guest user pte modified by page_referenced, if > it's not a direct page. And non direct pages surely won't provide > linear scans, infact for non linear pages the most common is that the > pte_t will point to the same virtual address but on a different > pgd_t * (and in turn on a different pmd_t). > > Since the pte tracking is per-page, it won't be affected by shared pages. >>> You mean the accessed bit on fixmap pte used by kmap? Or the user pte >>> pointed by the fixmap pte? >>> >>> >> The user pte. After guest code runs test_and_clear_bit(accessed_bit, >> ptep), we can't shadow that pte (all shadowed ptes must have the accessed >> bit set in the corresponding guest pte, similar to how a tlb entry can only >> exist if the accessed bit is set). >> > > Is this software invariant to ensure that we'll refresh the accessed > bit on the user pte too? > > Yes. We need a fault in order to set the guest accessed bit. > So this means kscand by clearing the accessed bitflag on them, should > automatically unshadowing all user ptes pointed by the fixmap pte. > > So a secnd test_and_clear_bit on the same user pte will run through > the fixmap pte established by kmap_atomic without traps. > > So this means when the user program run again, it'll find the user pte > unshadowed and it'll have to re-instantiate the shadow ptes with a kvm > page fault, that has the primary objective of marking the user pte > accessed again (to notify the next kscand pass that the data page > pointed by the user pte was used meanwhile). > > If I understand correctly, the establishment of the shadow pte > corresponding to the user pte, will have to mark wrprotect the spte > corresponding to the fixmap pte because we need to intercept > modifications to shadowed guest ptes and the spte corresponding to the > fixmap guest pte is now pointing to a shadowed guest pte after the > program returns running. > > Then when kscand runs again, for the pages that have been faulted in > by the user program, we'll trap the test_and_clear_bit happening > through the readonly spte corresponding to the fixmap guest pte, and > we'll unshadow the spte of the guest user pte again and we'll mark the > spte corresponding to the fixmap pte as read-write again, because of > the test_and_clear_bit tells us that we've to unshadow instead of > emulating. > Yes. >>> 2) get rid of the user pte shadow mapping pointing to the user data so >>> the test_and_clear of the young bitflag on the user pte will not be >>> emulated and it'll run at full CPU speed through the shadow pte >>> mapping corresponding to the fixmap virtual address >>> >>> >> That's what per-page-pte-history is supposed to do. The first few accesses >> are emulated, the next will be native. >> > > Why not to go native immediately when we notice a test_and_clear of > the accessed bit? First the ptes won't be in contiguous virtual > address order, so if the flooding of the sptes corresponding to the > guest user pte depends on the gpa of the guest user ptes being > contiguous it won't work well. But more importantly we've found a > test_and_clear_bit of the accessed bitflag, so we should unshadow the > user pte that is being marked "old" immediately without need to detect > any flooding. > Unshadowing a page is expensive, both in immediate cost, and in future cost of reshadowing the page and taking faults. It's worthwhile to be sure the guest really doesn't want it as a page table. >> It's still not full speed as the kmap setup has to be emulated (twice). >> > > Agreed, the 1/2/3 emulations on writes to the fixmap area during > kmap_atomic (1/2 for non-PAE/PAE and 1 further pte_clear on 2.6 or 2.4 > debug-highmem) seems unavoidable. > > But the test_and_clear_bit writprotect fault (when the guest user pte > is shadowed) should just unshadow the guest user pte, mark the spte > representing the fixmap pte as writeable, and return immediately to > guest mode to actually run test_and_clear_bit natively without writing > it through emulation. > > Noticing the test_and_clear_bit also requires a bit of instruction > "detection", but once we detected it from the eip address, we don't > have to write anything to the guest. > > But I guess I'm missing something... > > If the pages are not scanned linearly, then unshadowing may not help. Let's see 1G of highmem is 250,000 pages, mapped by 500 pages tables. Well, then after 4000 scans we ought to have unshadowed everything. So I guess per-page-pte-history is broken, can't explain it otherwise. >> One possible optimization is that if we see the first part of the kmap >> instantiation, we emulate a few more instructions before returning to the >> guest. Xen does this IIRC. >> > > Surely this would avoid 1 wrprotect fault per kmap_atomic, but I'm not > sure if 32bit PAE is that important to do this. Most 32bit enterprise > kernels I worked aren't compiled with PAE, only one called bigsmp is. > > Well, seems RHEL 3.8 smp is PAE. > Also on 2.6, we could get the same benefit by making 2.6 at least as > optimal as 2.4 by never clearing the fixmap pte and by doing invlpg > only after setting it to a new value. Xen can't optimize that write in > kunmap_atomic. > > 2.6 has debug enabled by default for no good reason. So that would be > the first optimization to do as it saves a few cycles per > kunmap_atomic on host too. > > Yes, it's probably a small win on native as well. >> I'm no longer sure the access pattern is sequential, since I see >> kmap_atomic() will not recreate the pte if its value has not changed >> (unless HIGHMEM_DEBUG). >> > > Hmm kmap_atomic always writes a new value to the fixmap pte, even if > it was mapping the same user pte as before. > > static inline void *kmap_atomic(struct page *page, enum km_type type) > { > enum fixed_addresses idx; > unsigned long vaddr; > > if (page < highmem_start_page) > return page_address(page); > > idx = type + KM_TYPE_NR*smp_processor_id(); > vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx); > #if HIGHMEM_DEBUG > if (!pte_none(*(kmap_pte-idx))) > out_of_line_bug(); > #endif > set_pte(kmap_pte-idx, mk_pte(page, kmap_prot)); > __flush_tlb_one(vaddr); > > return (void*) vaddr; > } > > The centos 3.8 sources have static inline void *__kmap_atomic(struct page *page, enum km_type type) { enum fixed_addresses idx; unsigned long vaddr; idx = type + KM_TYPE_NR*smp_processor_id(); vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx); #if HIGHMEM_DEBUG if (!pte_none(*(kmap_pte-idx))) out_of_line_bug(); #else /* * Performance optimization - do not flush if the new * pte is the same as the old one: */ if (pte_val(*(kmap_pte-idx)) == pte_val(mk_pte(page, kmap_prot))) return (void *) vaddr; #endif set_pte(kmap_pte-idx, mk_pte(page, kmap_prot)); __flush_tlb_one(vaddr); return (void*) vaddr; } (linux-2.4.21-47.EL) -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) 2008-05-29 15:16 ` Avi Kivity @ 2008-05-30 13:12 ` Andrea Arcangeli 2008-05-31 7:39 ` Avi Kivity 0 siblings, 1 reply; 73+ messages in thread From: Andrea Arcangeli @ 2008-05-30 13:12 UTC (permalink / raw) To: Avi Kivity; +Cc: David S. Ahern, kvm On Thu, May 29, 2008 at 06:16:55PM +0300, Avi Kivity wrote: > Yes. We need a fault in order to set the guest accessed bit. So what I'm missing now is how the spte corresponding to the user pte that is under test_and_clear to clear the accessed bit, will not the zapped immediately. If we don't zap it immediately, how do we set the accessed bit again on the user pte, when the user program returned running and used that shadow pte to access the program data after the kscand pass? Or am I missing something? > Unshadowing a page is expensive, both in immediate cost, and in future cost > of reshadowing the page and taking faults. It's worthwhile to be sure the > guest really doesn't want it as a page table. Ok that makes sense, but can we defer the unshadowing while still emulating the accessed bit correctly on the user pte? > If the pages are not scanned linearly, then unshadowing may not help. It should help the second time kscand runs, for the user ptes that aren't shadowed anymore, the second pass won't require any emulation for test_and_bit because the spte of the fixmap area will be read-write. The bug that passes the anonymous pages number instead of the cache number will lead to many more test_and_clear than needed, and not all user ptes may be used in between two different kscand passes. > Let's see 1G of highmem is 250,000 pages, mapped by 500 pages tables. There are likely 1500 ptes in highmem. (ram isn't the most important factor) > Well, then after 4000 scans we ought to have unshadowed everything. So I > guess per-page-pte-history is broken, can't explain it otherwise. Yes, we should have unshadowed all user ptes after 4000 scans and then the test_and_clear shouldn't require any more emulation, there will be only 3 emulations for each kmap_atomic/kunmap_atomic. > static inline void *__kmap_atomic(struct page *page, enum km_type type) > { > enum fixed_addresses idx; > unsigned long vaddr; > > idx = type + KM_TYPE_NR*smp_processor_id(); > vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx); > #if HIGHMEM_DEBUG > if (!pte_none(*(kmap_pte-idx))) > out_of_line_bug(); > #else > /* > * Performance optimization - do not flush if the new > * pte is the same as the old one: > */ > if (pte_val(*(kmap_pte-idx)) == pte_val(mk_pte(page, kmap_prot))) > return (void *) vaddr; > #endif > set_pte(kmap_pte-idx, mk_pte(page, kmap_prot)); > __flush_tlb_one(vaddr); > > return (void*) vaddr; > } It's weird they optimized this if they enabled CONFIG_HIGHMEM_DEBUG. Anyway it doesn't make a whole lot of difference as it's an unlikely condition. > (linux-2.4.21-47.EL) Downloaded it now. I think it should be clear that by now, we're trying to be bug-compatile like the host here, and optimizing for 2.6 kmaps. ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) 2008-05-30 13:12 ` Andrea Arcangeli @ 2008-05-31 7:39 ` Avi Kivity 0 siblings, 0 replies; 73+ messages in thread From: Avi Kivity @ 2008-05-31 7:39 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: David S. Ahern, kvm Andrea Arcangeli wrote: > On Thu, May 29, 2008 at 06:16:55PM +0300, Avi Kivity wrote: > >> Yes. We need a fault in order to set the guest accessed bit. >> > > So what I'm missing now is how the spte corresponding to the user pte > that is under test_and_clear to clear the accessed bit, will not the > zapped immediately. If we don't zap it immediately, how do we set the > accessed bit again on the user pte, when the user program returned > running and used that shadow pte to access the program data after the > kscand pass? > > The spte is zapped unconditionally in kvm_mmu_pte_write(), and not re-established in mmu_pte_write_new_pte() due to the missing accessed bit. The question is whether to tear down the shadow page it is contained in, or not. > Or am I missing something? > > >> Unshadowing a page is expensive, both in immediate cost, and in future cost >> of reshadowing the page and taking faults. It's worthwhile to be sure the >> guest really doesn't want it as a page table. >> > > Ok that makes sense, but can we defer the unshadowing while still > emulating the accessed bit correctly on the user pte? > > We do, unless there's a bad bug somewhere. >> If the pages are not scanned linearly, then unshadowing may not help. >> > > It should help the second time kscand runs, for the user ptes that > aren't shadowed anymore, the second pass won't require any emulation > for test_and_bit because the spte of the fixmap area will be > read-write. The bug that passes the anonymous pages number instead of > the cache number will lead to many more test_and_clear than needed, > and not all user ptes may be used in between two different kscand passes. > > We still need 3 emulations per pte to set the fixmap entry. Unshadowing saves one emulation on the pte itself. >> Let's see 1G of highmem is 250,000 pages, mapped by 500 pages tables. >> > > There are likely 1500 ptes in highmem. (ram isn't the most important factor) > > I use 'pte' in the Intel manual sense (page table entry), not the Linux sense (page table). I mentioned these numbers to see the worst case behavior. Non-highmem: - with unshadow: O(500) accesses to unshadow the page tables, then native speed - without unshadow: O(250000) accesses to modify the ptes Highmem: - with unshadow: O(250000) accesses to update the fixmap entry - with unshadow: O(250000) accesses to update the fixmap entry and to modify the ptes >> Well, then after 4000 scans we ought to have unshadowed everything. So I >> guess per-page-pte-history is broken, can't explain it otherwise. >> > > Yes, we should have unshadowed all user ptes after 4000 scans and then > the test_and_clear shouldn't require any more emulation, there will be > only 3 emulations for each kmap_atomic/kunmap_atomic. > > So we save 25%. It's still bad even if everything is working correctly. > > I think it should be clear that by now, we're trying to be > bug-compatile like the host here, and optimizing for 2.6 kmaps. > Don't understand. I'm guessing esx gets its good performance by special-casing something. For example, they can keep the fixmap page never shadowed, always emulate accesses through the fixmap page, and recompile instructions that go through fixmap to issue a hypercall. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) 2008-05-28 10:51 ` Avi Kivity 2008-05-28 14:13 ` David S. Ahern @ 2008-05-29 16:42 ` David S. Ahern 2008-05-31 8:16 ` Avi Kivity 1 sibling, 1 reply; 73+ messages in thread From: David S. Ahern @ 2008-05-29 16:42 UTC (permalink / raw) To: Avi Kivity; +Cc: kvm [-- Attachment #1: Type: text/plain, Size: 5838 bytes --] Avi Kivity wrote: > David S. Ahern wrote: >> The short answer is that I am still see large system time hiccups in the >> guests due to kscand in the guest scanning its active lists. I do see >> better response for a KVM_MAX_PTE_HISTORY of 3 than with 4. (For >> completeness I also tried a history of 2, but it performed worse than 3 >> which is no surprise given the meaning of it.) >> >> >> I have been able to scratch out a simplistic program that stimulates >> kscand activity similar to what is going on in my real guest (see >> attached). The program requests a memory allocation, initializes it (to >> get it backed) and then in a loop sweeps through the memory in chunks >> similar to a program using parts of its memory here and there but >> eventually accessing all of it. >> >> Start the RHEL3/CentOS 3 guest with *2GB* of RAM (or more). The key is >> using a fair amount of highmem. Start a couple of instances of the >> attached. For example, I've been using these 2: >> >> memuser 768M 120 5 300 >> memuser 384M 300 10 600 >> >> Together these instances take up a 1GB of RAM and once initialized >> consume very little CPU. On kvm they make kscand and kswapd go nuts >> every 5-15 minutes. For comparison, I do not see the same behavior for >> an identical setup running on esx 3.5. >> > > I haven't been able to reproduce this: > >> [root@localhost root]# ps -elf | grep -E 'memuser|kscand' >> 1 S root 7 1 1 75 0 - 0 schedu 10:07 ? >> 00:00:26 [kscand] >> 0 S root 1464 1 1 75 0 - 196986 schedu 10:20 pts/0 >> 00:00:21 ./memuser 768M 120 5 300 >> 0 S root 1465 1 0 75 0 - 98683 schedu 10:20 pts/0 >> 00:00:10 ./memuser 384M 300 10 600 >> 0 S root 2148 1293 0 75 0 - 922 pipe_w 10:48 pts/0 >> 00:00:00 grep -E memuser|kscand > > The workload has been running for about half an hour, and kswapd cpu > usage doesn't seem significant. This is a 2GB guest running with my > patch ported to kvm.git HEAD. Guest is has 2G of memory. > I'm running on the per-page-pte-tracking branch, and I am still seeing it. I doubt you want to sit and watch the screen for an hour, so install sysstat if not already, change the sample rate to 1 minute (/etc/cron.d/sysstat), let the server run for a few hours and then run 'sar -u'. You'll see something like this: 10:12:11 AM LINUX RESTART 10:13:03 AM CPU %user %nice %system %iowait %idle 10:14:01 AM all 0.08 0.00 2.08 0.35 97.49 10:15:03 AM all 0.05 0.00 0.79 0.04 99.12 10:15:59 AM all 0.15 0.00 1.52 0.06 98.27 10:17:01 AM all 0.04 0.00 0.69 0.04 99.23 10:17:59 AM all 0.01 0.00 0.39 0.00 99.60 10:18:59 AM all 0.00 0.00 0.12 0.02 99.87 10:20:02 AM all 0.18 0.00 14.62 0.09 85.10 10:21:01 AM all 0.71 0.00 26.35 0.01 72.94 10:22:02 AM all 0.67 0.00 10.61 0.00 88.72 10:22:59 AM all 0.14 0.00 1.80 0.00 98.06 10:24:03 AM all 0.13 0.00 0.50 0.00 99.37 10:24:59 AM all 0.09 0.00 11.46 0.00 88.45 10:26:03 AM all 0.16 0.00 0.69 0.03 99.12 10:26:59 AM all 0.14 0.00 10.01 0.02 89.83 10:28:03 AM all 0.57 0.00 2.20 0.03 97.20 Average: all 0.21 0.00 5.55 0.05 94.20 every one of those jumps in %system time directly correlates to kscand activity. Without the memuser programs running the guest %system time is <1%. The point of this silly memuser program is just to use high memory -- let it age, then make it active again, sit idle, repeat. If you run kvm_stat with -l in the host you'll see the jump in pte writes/updates. An intern here added a timestamp to the kvm_stat output for me which helps to directly correlate guest/host data. I also ran my real guest on the branch. Performance at boot through the first 15 minutes was much better, but I'm still seeing recurring hits every 5 minutes when kscand kicks in. Here's the data from the guest for the first one which happened after 15 minutes of uptime: active_anon_scan: HighMem, age 11, count[age] 24886 -> 5796, direct 24845, dj 59 active_anon_scan: HighMem, age 7, count[age] 47772 -> 21289, direct 40868, dj 103 active_anon_scan: HighMem, age 3, count[age] 91007 -> 329, direct 45805, dj 1212 The kvm_stat data for this time period is attached due to line lengths. Also, I forgot to mention this before, but there is a bug in the kscand code in the RHEL3U8 kernel. When it scans the cache list it uses the count from the anonymous list: if (need_active_cache_scan(zone)) { for (age = MAX_AGE-1; age >= 0; age--) { scan_active_list(zone, age, &zone->active_cache_list[age], zone->active_anon_count[age]); ^^^^^^^^^^^^^^^^^ if (current->need_resched) schedule(); } } When the anonymous count is higher it is scanning the cache list repeatedly. An example of that was captured here: active_cache_scan: HighMem, age 7, count[age] 222 -> 179, count anon 111967, direct 626, dj 3 count anon is active_anon_count[age] which at this moment was 111,967. There were only 222 entries in the cache list, but the count value passed to scan_active_list was 111,967. When the cache list has a lot of direct pages, that causes a larger hit on kvm than needed. That said, I have to live with the bug in the guest. david [-- Attachment #2: kvm_stat.kscand --] [-- Type: text/plain, Size: 2650 bytes --] kvm-69/kvm_stat -f 'mmu*|pf*' -l: mmio_exit mmu_cache mmu_flood mmu_pde_z mmu_pte_u mmu_pte_w mmu_recyc mmu_shado pf_fixed pf_guest 182 18 18 0 5664 5682 0 18 5720 21 211 59 59 0 7040 7105 0 59 7348 99 81 0 48 0 45861 45909 0 48 45910 1 209 683 814 0 178527 179405 0 814 181410 9 67 111 320 0 175602 175922 0 320 177202 0 28 0 29 0 181365 181394 0 29 181394 0 7 0 22 0 181834 181856 0 22 181855 0 35 0 14 0 180129 180143 0 14 180144 0 7 0 10 0 179141 179151 0 10 179150 0 35 0 3 0 181359 181361 0 3 181362 0 7 0 4 0 181565 181570 0 4 181570 0 21 0 3 0 181435 181437 0 3 181437 0 21 0 4 0 181281 181286 0 4 181285 0 21 0 3 0 179444 179447 0 3 179448 0 91 0 61 0 179841 179902 0 61 179902 0 7 0 247 0 176628 176875 0 247 176874 0 313 478 133 1 100486 100604 0 133 126690 80 162 21 18 0 6361 6379 0 18 6584 5 294 40 23 21 9144 9188 0 25 9544 45 143 5 1 0 5026 5027 0 1 5502 1 The above corresponds to the following from the guest: active_anon_scan: HighMem, age 11, count[age] 24886 -> 5796, direct 24845, dj 59 active_anon_scan: HighMem, age 7, count[age] 47772 -> 21289, direct 40868, dj 103 active_anon_scan: HighMem, age 3, count[age] 91007 -> 329, direct 45805, dj 1212 ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) 2008-05-29 16:42 ` David S. Ahern @ 2008-05-31 8:16 ` Avi Kivity 2008-06-02 16:42 ` David S. Ahern 0 siblings, 1 reply; 73+ messages in thread From: Avi Kivity @ 2008-05-31 8:16 UTC (permalink / raw) To: David S. Ahern; +Cc: kvm David S. Ahern wrote: >> I haven't been able to reproduce this: >> >> >>> [root@localhost root]# ps -elf | grep -E 'memuser|kscand' >>> 1 S root 7 1 1 75 0 - 0 schedu 10:07 ? >>> 00:00:26 [kscand] >>> 0 S root 1464 1 1 75 0 - 196986 schedu 10:20 pts/0 >>> 00:00:21 ./memuser 768M 120 5 300 >>> 0 S root 1465 1 0 75 0 - 98683 schedu 10:20 pts/0 >>> 00:00:10 ./memuser 384M 300 10 600 >>> 0 S root 2148 1293 0 75 0 - 922 pipe_w 10:48 pts/0 >>> 00:00:00 grep -E memuser|kscand >>> >> The workload has been running for about half an hour, and kswapd cpu >> usage doesn't seem significant. This is a 2GB guest running with my >> patch ported to kvm.git HEAD. Guest is has 2G of memory. >> >> > > I'm running on the per-page-pte-tracking branch, and I am still seeing it. > > I doubt you want to sit and watch the screen for an hour, so install sysstat if not already, change the sample rate to 1 minute (/etc/cron.d/sysstat), let the server run for a few hours and then run 'sar -u'. You'll see something like this: > > 10:12:11 AM LINUX RESTART > > 10:13:03 AM CPU %user %nice %system %iowait %idle > 10:14:01 AM all 0.08 0.00 2.08 0.35 97.49 > 10:15:03 AM all 0.05 0.00 0.79 0.04 99.12 > 10:15:59 AM all 0.15 0.00 1.52 0.06 98.27 > 10:17:01 AM all 0.04 0.00 0.69 0.04 99.23 > 10:17:59 AM all 0.01 0.00 0.39 0.00 99.60 > 10:18:59 AM all 0.00 0.00 0.12 0.02 99.87 > 10:20:02 AM all 0.18 0.00 14.62 0.09 85.10 > 10:21:01 AM all 0.71 0.00 26.35 0.01 72.94 > 10:22:02 AM all 0.67 0.00 10.61 0.00 88.72 > 10:22:59 AM all 0.14 0.00 1.80 0.00 98.06 > 10:24:03 AM all 0.13 0.00 0.50 0.00 99.37 > 10:24:59 AM all 0.09 0.00 11.46 0.00 88.45 > 10:26:03 AM all 0.16 0.00 0.69 0.03 99.12 > 10:26:59 AM all 0.14 0.00 10.01 0.02 89.83 > 10:28:03 AM all 0.57 0.00 2.20 0.03 97.20 > Average: all 0.21 0.00 5.55 0.05 94.20 > > > every one of those jumps in %system time directly correlates to kscand activity. Without the memuser programs running the guest %system time is <1%. The point of this silly memuser program is just to use high memory -- let it age, then make it active again, sit idle, repeat. If you run kvm_stat with -l in the host you'll see the jump in pte writes/updates. An intern here added a timestamp to the kvm_stat output for me which helps to directly correlate guest/host data. > > > I also ran my real guest on the branch. Performance at boot through the first 15 minutes was much better, but I'm still seeing recurring hits every 5 minutes when kscand kicks in. Here's the data from the guest for the first one which happened after 15 minutes of uptime: > > active_anon_scan: HighMem, age 11, count[age] 24886 -> 5796, direct 24845, dj 59 > > active_anon_scan: HighMem, age 7, count[age] 47772 -> 21289, direct 40868, dj 103 > > active_anon_scan: HighMem, age 3, count[age] 91007 -> 329, direct 45805, dj 1212 > > We touched 90,000 ptes in 12 seconds. That's 8,000 ptes per second. Yet we see 180,000 page faults per second in the trace. Oh! Only 45K pages were direct, so the other 45K were shared, with perhaps many ptes. We shoud count ptes, not pages. Can you modify page_referenced() to count the numbers of ptes mapped (1 for direct pages, nr_chains for indirect pages) and print the total deltas in active_anon_scan? > The kvm_stat data for this time period is attached due to line lengths. > > > Also, I forgot to mention this before, but there is a bug in the kscand code in the RHEL3U8 kernel. When it scans the cache list it uses the count from the anonymous list: > > if (need_active_cache_scan(zone)) { > for (age = MAX_AGE-1; age >= 0; age--) { > scan_active_list(zone, age, > &zone->active_cache_list[age], > zone->active_anon_count[age]); > ^^^^^^^^^^^^^^^^^ > if (current->need_resched) > schedule(); > } > } > > When the anonymous count is higher it is scanning the cache list repeatedly. An example of that was captured here: > > active_cache_scan: HighMem, age 7, count[age] 222 -> 179, count anon 111967, direct 626, dj 3 > > count anon is active_anon_count[age] which at this moment was 111,967. There were only 222 entries in the cache list, but the count value passed to scan_active_list was 111,967. When the cache list has a lot of direct pages, that causes a larger hit on kvm than needed. That said, I have to live with the bug in the guest. > For debugging, can you fix it? It certainly has a large impact. Perhaps it is fixed in an update kernel. There's a 2.4.21-50.EL in the centos 3.8 update repos. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) 2008-05-31 8:16 ` Avi Kivity @ 2008-06-02 16:42 ` David S. Ahern 2008-06-05 8:37 ` Avi Kivity 0 siblings, 1 reply; 73+ messages in thread From: David S. Ahern @ 2008-06-02 16:42 UTC (permalink / raw) To: Avi Kivity; +Cc: kvm Avi Kivity wrote: > David S. Ahern wrote: >>> I haven't been able to reproduce this: >>> >>> >>>> [root@localhost root]# ps -elf | grep -E 'memuser|kscand' >>>> 1 S root 7 1 1 75 0 - 0 schedu 10:07 ? >>>> 00:00:26 [kscand] >>>> 0 S root 1464 1 1 75 0 - 196986 schedu 10:20 pts/0 >>>> 00:00:21 ./memuser 768M 120 5 300 >>>> 0 S root 1465 1 0 75 0 - 98683 schedu 10:20 pts/0 >>>> 00:00:10 ./memuser 384M 300 10 600 >>>> 0 S root 2148 1293 0 75 0 - 922 pipe_w 10:48 pts/0 >>>> 00:00:00 grep -E memuser|kscand >>>> >>> The workload has been running for about half an hour, and kswapd cpu >>> usage doesn't seem significant. This is a 2GB guest running with my >>> patch ported to kvm.git HEAD. Guest is has 2G of memory. >>> >>> >> >> I'm running on the per-page-pte-tracking branch, and I am still seeing >> it. >> I doubt you want to sit and watch the screen for an hour, so install >> sysstat if not already, change the sample rate to 1 minute >> (/etc/cron.d/sysstat), let the server run for a few hours and then run >> 'sar -u'. You'll see something like this: >> >> 10:12:11 AM LINUX RESTART >> >> 10:13:03 AM CPU %user %nice %system %iowait %idle >> 10:14:01 AM all 0.08 0.00 2.08 0.35 97.49 >> 10:15:03 AM all 0.05 0.00 0.79 0.04 99.12 >> 10:15:59 AM all 0.15 0.00 1.52 0.06 98.27 >> 10:17:01 AM all 0.04 0.00 0.69 0.04 99.23 >> 10:17:59 AM all 0.01 0.00 0.39 0.00 99.60 >> 10:18:59 AM all 0.00 0.00 0.12 0.02 99.87 >> 10:20:02 AM all 0.18 0.00 14.62 0.09 85.10 >> 10:21:01 AM all 0.71 0.00 26.35 0.01 72.94 >> 10:22:02 AM all 0.67 0.00 10.61 0.00 88.72 >> 10:22:59 AM all 0.14 0.00 1.80 0.00 98.06 >> 10:24:03 AM all 0.13 0.00 0.50 0.00 99.37 >> 10:24:59 AM all 0.09 0.00 11.46 0.00 88.45 >> 10:26:03 AM all 0.16 0.00 0.69 0.03 99.12 >> 10:26:59 AM all 0.14 0.00 10.01 0.02 89.83 >> 10:28:03 AM all 0.57 0.00 2.20 0.03 97.20 >> Average: all 0.21 0.00 5.55 0.05 94.20 >> >> >> every one of those jumps in %system time directly correlates to kscand >> activity. Without the memuser programs running the guest %system time >> is <1%. The point of this silly memuser program is just to use high >> memory -- let it age, then make it active again, sit idle, repeat. If >> you run kvm_stat with -l in the host you'll see the jump in pte >> writes/updates. An intern here added a timestamp to the kvm_stat >> output for me which helps to directly correlate guest/host data. >> >> >> I also ran my real guest on the branch. Performance at boot through >> the first 15 minutes was much better, but I'm still seeing recurring >> hits every 5 minutes when kscand kicks in. Here's the data from the >> guest for the first one which happened after 15 minutes of uptime: >> >> active_anon_scan: HighMem, age 11, count[age] 24886 -> 5796, direct >> 24845, dj 59 >> >> active_anon_scan: HighMem, age 7, count[age] 47772 -> 21289, direct >> 40868, dj 103 >> >> active_anon_scan: HighMem, age 3, count[age] 91007 -> 329, direct >> 45805, dj 1212 >> >> > > We touched 90,000 ptes in 12 seconds. That's 8,000 ptes per second. > Yet we see 180,000 page faults per second in the trace. > > Oh! Only 45K pages were direct, so the other 45K were shared, with > perhaps many ptes. We shoud count ptes, not pages. > > Can you modify page_referenced() to count the numbers of ptes mapped (1 > for direct pages, nr_chains for indirect pages) and print the total > deltas in active_anon_scan? > Here you go. I've shortened the line lengths to get them to squeeze into 80 columns: anon_scan, all HighMem zone, 187,910 active pages at loop start: count[12] 21462 -> 230, direct 20469, chains 3479, dj 58 count[11] 1338 -> 1162, direct 227, chains 26144, dj 59 count[8] 29397 -> 5410, direct 26115, chains 27617, dj 117 count[4] 35804 -> 25556, direct 31508, chains 82929, dj 256 count[3] 2738 -> 2207, direct 2680, chains 58, dj 7 count[0] 92580 -> 89509, direct 75024, chains 262834, dj 726 (age number is the index in []) cache_scan, all HighMem zone, 48,298 active pages at loop start: count[12] 3642 -> 2982, direct 499, chains 20022, dj 44 count[8] 11254 -> 11187, direct 7189, chains 9854, dj 37 count[4] 15709 -> 15702, direct 5071, chains 9388, dj 31 (with anon_cache_count bug fixed) If you sum the direct pages and the chains count for each row, convert dj into dt (divided by HZ = 100) you get: ( 20469 + 3479 ) / 0.58 = 41289 ( 227 + 26144 ) / 0.59 = 44696 ( 26115 + 27617 ) / 1.17 = 45924 ( 31508 + 82929 ) / 2.56 = 44701 ( 2680 + 58 ) / 0.07 = 39114 ( 75024 + 262834 ) / 7.26 = 46536 ( 499 + 20022 ) / 0.44 = 46638 ( 7189 + 9854 ) / 0.37 = 46062 ( 5071 + 9388 ) / 0.31 = 46641 For 4 pte writes per direct page or chain entry comes to ~187,000/sec which is close to the total collected by kvm_stat (data width shrunk to fit in e-mail; hope this is readable still): |---------- mmu_ ----------|----- pf_ -----| cache flood pde_z pte_u pte_w shado fixed guest 267 271 95 21455 21842 285 22840 165 66 88 0 12102 12224 88 12458 0 2042 2133 0 178146 180515 2133 188089 387 1053 1212 0 187067 188485 1212 193011 8 4771 4811 88 185129 190998 4825 207490 448 910 824 7 183066 184050 824 195836 12 707 785 0 176381 177300 785 180350 6 1167 1144 0 189618 191014 1144 195902 10 4238 4193 87 188381 193590 4206 207030 465 1448 1400 7 187786 189509 1400 198688 21 982 971 0 187880 189076 971 198405 2 1165 1208 0 190007 191503 1208 195746 13 1106 1146 0 189144 190550 1146 195143 0 4767 4788 96 185802 191704 4802 206362 477 1388 1431 0 187387 188991 1431 195115 3 584 551 0 77176 77802 551 84829 10 12 7 0 3601 3609 7 13497 4 243 153 91 31085 31333 167 35059 879 21 18 6 3130 3155 18 3827 2 21 4 1 4665 4670 4 6825 9 >> The kvm_stat data for this time period is attached due to line lengths. >> >> >> Also, I forgot to mention this before, but there is a bug in the >> kscand code in the RHEL3U8 kernel. When it scans the cache list it >> uses the count from the anonymous list: >> >> if (need_active_cache_scan(zone)) { >> for (age = MAX_AGE-1; age >= 0; age--) { >> scan_active_list(zone, age, >> &zone->active_cache_list[age], >> zone->active_anon_count[age]); >> ^^^^^^^^^^^^^^^^^ >> if (current->need_resched) >> schedule(); >> } >> } >> >> When the anonymous count is higher it is scanning the cache list >> repeatedly. An example of that was captured here: >> >> active_cache_scan: HighMem, age 7, count[age] 222 -> 179, count anon >> 111967, direct 626, dj 3 >> >> count anon is active_anon_count[age] which at this moment was 111,967. >> There were only 222 entries in the cache list, but the count value >> passed to scan_active_list was 111,967. When the cache list has a lot >> of direct pages, that causes a larger hit on kvm than needed. That >> said, I have to live with the bug in the guest. >> > > For debugging, can you fix it? It certainly has a large impact. > yes, I have run a few tests with it fixed to get a ballpark on the impact. The fix is included in the number above. > Perhaps it is fixed in an update kernel. There's a 2.4.21-50.EL in the > centos 3.8 update repos. > ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) 2008-06-02 16:42 ` David S. Ahern @ 2008-06-05 8:37 ` Avi Kivity 2008-06-05 16:20 ` David S. Ahern 0 siblings, 1 reply; 73+ messages in thread From: Avi Kivity @ 2008-06-05 8:37 UTC (permalink / raw) To: David S. Ahern; +Cc: kvm David S. Ahern wrote: >> Oh! Only 45K pages were direct, so the other 45K were shared, with >> perhaps many ptes. We shoud count ptes, not pages. >> >> Can you modify page_referenced() to count the numbers of ptes mapped (1 >> for direct pages, nr_chains for indirect pages) and print the total >> deltas in active_anon_scan? >> >> > > Here you go. I've shortened the line lengths to get them to squeeze into > 80 columns: > > anon_scan, all HighMem zone, 187,910 active pages at loop start: > count[12] 21462 -> 230, direct 20469, chains 3479, dj 58 > count[11] 1338 -> 1162, direct 227, chains 26144, dj 59 > count[8] 29397 -> 5410, direct 26115, chains 27617, dj 117 > count[4] 35804 -> 25556, direct 31508, chains 82929, dj 256 > count[3] 2738 -> 2207, direct 2680, chains 58, dj 7 > count[0] 92580 -> 89509, direct 75024, chains 262834, dj 726 > (age number is the index in []) > > Where do all those ptes come from? that's 180K pages (most of highmem), but with 550K ptes. The memuser workload doesn't use fork(), so there shouldn't be any indirect ptes. We might try to unshadow the fixmap page; that means we don't have to do 4 fixmap pte accesses per pte scanned. The kernel uses two methods for clearing the accessed bit: For direct pages: if (pte_young(*pte) && ptep_test_and_clear_young(pte)) referenced++; (two accesses) For indirect pages: if (ptep_test_and_clear_young(pte)) referenced++; (one access) Which have to be emulated if we don't shadow the fixmap. With the data above, that translates to 700K emulations with your numbers above, vs 2200K emulations, a 3X improvement. I'm not sure it will be sufficient given that we're reducing a 10-second kscand scan into a 3-second scan. > If you sum the direct pages and the chains count for each row, convert > dj into dt (divided by HZ = 100) you get: > > ( 20469 + 3479 ) / 0.58 = 41289 > ( 227 + 26144 ) / 0.59 = 44696 > ( 26115 + 27617 ) / 1.17 = 45924 > ( 31508 + 82929 ) / 2.56 = 44701 > ( 2680 + 58 ) / 0.07 = 39114 > ( 75024 + 262834 ) / 7.26 = 46536 > ( 499 + 20022 ) / 0.44 = 46638 > ( 7189 + 9854 ) / 0.37 = 46062 > ( 5071 + 9388 ) / 0.31 = 46641 > > For 4 pte writes per direct page or chain entry comes to ~187,000/sec > which is close to the total collected by kvm_stat (data width shrunk to > fit in e-mail; hope this is readable still): > > > |---------- mmu_ ----------|----- pf_ -----| > cache flood pde_z pte_u pte_w shado fixed guest > 267 271 95 21455 21842 285 22840 165 > 66 88 0 12102 12224 88 12458 0 > 2042 2133 0 178146 180515 2133 188089 387 > 1053 1212 0 187067 188485 1212 193011 8 > 4771 4811 88 185129 190998 4825 207490 448 > 910 824 7 183066 184050 824 195836 12 > 707 785 0 176381 177300 785 180350 6 > 1167 1144 0 189618 191014 1144 195902 10 > 4238 4193 87 188381 193590 4206 207030 465 > 1448 1400 7 187786 189509 1400 198688 21 > 982 971 0 187880 189076 971 198405 2 > 1165 1208 0 190007 191503 1208 195746 13 > 1106 1146 0 189144 190550 1146 195143 0 > 4767 4788 96 185802 191704 4802 206362 477 > 1388 1431 0 187387 188991 1431 195115 3 > 584 551 0 77176 77802 551 84829 10 > 12 7 0 3601 3609 7 13497 4 > 243 153 91 31085 31333 167 35059 879 > 21 18 6 3130 3155 18 3827 2 > 21 4 1 4665 4670 4 6825 9 > > >>> The kvm_stat data for this time period is attached due to line lengths. >>> >>> >>> Also, I forgot to mention this before, but there is a bug in the >>> kscand code in the RHEL3U8 kernel. When it scans the cache list it >>> uses the count from the anonymous list: >>> >>> if (need_active_cache_scan(zone)) { >>> for (age = MAX_AGE-1; age >= 0; age--) { >>> scan_active_list(zone, age, >>> &zone->active_cache_list[age], >>> zone->active_anon_count[age]); >>> ^^^^^^^^^^^^^^^^^ >>> if (current->need_resched) >>> schedule(); >>> } >>> } >>> >>> When the anonymous count is higher it is scanning the cache list >>> repeatedly. An example of that was captured here: >>> >>> active_cache_scan: HighMem, age 7, count[age] 222 -> 179, count anon >>> 111967, direct 626, dj 3 >>> >>> count anon is active_anon_count[age] which at this moment was 111,967. >>> There were only 222 entries in the cache list, but the count value >>> passed to scan_active_list was 111,967. When the cache list has a lot >>> of direct pages, that causes a larger hit on kvm than needed. That >>> said, I have to live with the bug in the guest. >>> >>> >> For debugging, can you fix it? It certainly has a large impact. >> >> > yes, I have run a few tests with it fixed to get a ballpark on the > impact. The fix is included in the number above. > > >> Perhaps it is fixed in an update kernel. There's a 2.4.21-50.EL in the >> centos 3.8 update repos. >> >> It seems to have been fixed there. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) 2008-06-05 8:37 ` Avi Kivity @ 2008-06-05 16:20 ` David S. Ahern 2008-06-06 16:40 ` Avi Kivity 0 siblings, 1 reply; 73+ messages in thread From: David S. Ahern @ 2008-06-05 16:20 UTC (permalink / raw) To: Avi Kivity; +Cc: kvm Avi Kivity wrote: > David S. Ahern wrote: >>> Oh! Only 45K pages were direct, so the other 45K were shared, with >>> perhaps many ptes. We shoud count ptes, not pages. >>> >>> Can you modify page_referenced() to count the numbers of ptes mapped (1 >>> for direct pages, nr_chains for indirect pages) and print the total >>> deltas in active_anon_scan? >>> >>> >> >> Here you go. I've shortened the line lengths to get them to squeeze into >> 80 columns: >> >> anon_scan, all HighMem zone, 187,910 active pages at loop start: >> count[12] 21462 -> 230, direct 20469, chains 3479, dj 58 >> count[11] 1338 -> 1162, direct 227, chains 26144, dj 59 >> count[8] 29397 -> 5410, direct 26115, chains 27617, dj 117 >> count[4] 35804 -> 25556, direct 31508, chains 82929, dj 256 >> count[3] 2738 -> 2207, direct 2680, chains 58, dj 7 >> count[0] 92580 -> 89509, direct 75024, chains 262834, dj 726 >> (age number is the index in []) >> >> > > Where do all those ptes come from? that's 180K pages (most of highmem), > but with 550K ptes. > > The memuser workload doesn't use fork(), so there shouldn't be any > indirect ptes. > > We might try to unshadow the fixmap page; that means we don't have to do > 4 fixmap pte accesses per pte scanned. > > The kernel uses two methods for clearing the accessed bit: > > For direct pages: > > if (pte_young(*pte) && ptep_test_and_clear_young(pte)) > referenced++; > > (two accesses) > > For indirect pages: > > if (ptep_test_and_clear_young(pte)) > referenced++; > > (one access) > > Which have to be emulated if we don't shadow the fixmap. With the data > above, that translates to 700K emulations with your numbers above, vs > 2200K emulations, a 3X improvement. I'm not sure it will be sufficient > given that we're reducing a 10-second kscand scan into a 3-second scan. > A 3-second scan is much better and incomparison to where kvm was when I started this e-mail thread (as high as 30-seconds for a scan) it's a 10-fold improvement. I gave a shot at implementing your suggestion, but evidently I am still not understanding the shadow implementation. Can you suggest a patch to try this out? david ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) 2008-06-05 16:20 ` David S. Ahern @ 2008-06-06 16:40 ` Avi Kivity 2008-06-19 4:20 ` David S. Ahern 0 siblings, 1 reply; 73+ messages in thread From: Avi Kivity @ 2008-06-06 16:40 UTC (permalink / raw) To: David S. Ahern; +Cc: kvm David S. Ahern wrote: > I gave a shot at implementing your suggestion, but evidently I am still > not understanding the shadow implementation. Can you suggest a patch to > try this out? > We can have a hacking session in kvm forum. Bring a guest on your laptop. It isn't going to be easy to both fix the problem and also not introduce a regression somewhere else. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) 2008-06-06 16:40 ` Avi Kivity @ 2008-06-19 4:20 ` David S. Ahern 2008-06-22 6:34 ` Avi Kivity 0 siblings, 1 reply; 73+ messages in thread From: David S. Ahern @ 2008-06-19 4:20 UTC (permalink / raw) To: Avi Kivity; +Cc: kvm Avi: We did not get a chance to do this at the Forum. I'd be interested in whatever options you have for reducing the scan time further (e.g., try to get scan time down to 1-2 seconds). thanks, david Avi Kivity wrote: > David S. Ahern wrote: >> I gave a shot at implementing your suggestion, but evidently I am still >> not understanding the shadow implementation. Can you suggest a patch to >> try this out? >> > > We can have a hacking session in kvm forum. Bring a guest on your laptop. > > It isn't going to be easy to both fix the problem and also not introduce > a regression somewhere else. > ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) 2008-06-19 4:20 ` David S. Ahern @ 2008-06-22 6:34 ` Avi Kivity 2008-06-23 14:09 ` David S. Ahern 0 siblings, 1 reply; 73+ messages in thread From: Avi Kivity @ 2008-06-22 6:34 UTC (permalink / raw) To: David S. Ahern; +Cc: kvm [-- Attachment #1: Type: text/plain, Size: 1135 bytes --] David S. Ahern wrote: > Avi: > > We did not get a chance to do this at the Forum. I'd be interested in > whatever options you have for reducing the scan time further (e.g., try > to get scan time down to 1-2 seconds). > > I'm unlikely to get time to do this properly for at least a week, as this will be quite difficult and I'm already horribly backlogged. However there's an alternative option, modifying the source and getting it upstreamed, as I think RHEL 3 is still maintained. The attached patch (untested) should give a 3X boost for kmap_atomics, by folding the two accesses to set the pte into one, and by dropping the access that clears the pte. Unfortunately it breaks the ABI, since external modules will inline the original kmap_atomic() which expects the pte to be cleared. This can be worked around by allocating new fixmap slots for kmap_atomic with the new behavior, and keeping the old slots with the old behavior, but we should first see if the maintainers are open to performance optimizations targeting kvm. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. [-- Attachment #2: faster-2.4-kmap_atomic.patch --] [-- Type: text/x-patch, Size: 1057 bytes --] --- include/asm-i386/atomic_kmap.h.orig 2007-06-12 00:24:29.000000000 +0300 +++ include/asm-i386/atomic_kmap.h 2008-06-22 09:23:26.000000000 +0300 @@ -51,18 +51,13 @@ static inline void *__kmap_atomic(struct idx = type + KM_TYPE_NR*smp_processor_id(); vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx); -#if HIGHMEM_DEBUG - if (!pte_none(*(kmap_pte-idx))) - out_of_line_bug(); -#else /* * Performance optimization - do not flush if the new * pte is the same as the old one: */ if (pte_val(*(kmap_pte-idx)) == pte_val(mk_pte(page, kmap_prot))) return (void *) vaddr; -#endif - set_pte(kmap_pte-idx, mk_pte(page, kmap_prot)); + set_pte_atomic(kmap_pte-idx, mk_pte(page, kmap_prot)); __flush_tlb_one(vaddr); return (void*) vaddr; @@ -77,12 +72,6 @@ static inline void __kunmap_atomic(void if (vaddr != __fix_to_virt(FIX_KMAP_BEGIN+idx)) out_of_line_bug(); - /* - * force other mappings to Oops if they'll try to access - * this pte without first remap it - */ - pte_clear(kmap_pte-idx); - __flush_tlb_one(vaddr); #endif } ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) 2008-06-22 6:34 ` Avi Kivity @ 2008-06-23 14:09 ` David S. Ahern 2008-06-25 9:51 ` Avi Kivity 0 siblings, 1 reply; 73+ messages in thread From: David S. Ahern @ 2008-06-23 14:09 UTC (permalink / raw) To: Avi Kivity; +Cc: kvm Avi Kivity wrote: > David S. Ahern wrote: >> Avi: >> >> We did not get a chance to do this at the Forum. I'd be interested in >> whatever options you have for reducing the scan time further (e.g., try >> to get scan time down to 1-2 seconds). >> >> > > I'm unlikely to get time to do this properly for at least a week, as > this will be quite difficult and I'm already horribly backlogged. > However there's an alternative option, modifying the source and getting > it upstreamed, as I think RHEL 3 is still maintained. > > The attached patch (untested) should give a 3X boost for kmap_atomics, > by folding the two accesses to set the pte into one, and by dropping the > access that clears the pte. Unfortunately it breaks the ABI, since > external modules will inline the original kmap_atomic() which expects > the pte to be cleared. > > This can be worked around by allocating new fixmap slots for kmap_atomic > with the new behavior, and keeping the old slots with the old behavior, > but we should first see if the maintainers are open to performance > optimizations targeting kvm. > RHEL3 is in Maintenance mode (for an explanation see http://www.redhat.com/security/updates/errata/) which means performance enhancement patches will not make it in. Also, I'm going to be out of the office for a couple of weeks in July, so I will need to put this aside until mid-August or so. I'll reevaluate options then. david ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3) 2008-06-23 14:09 ` David S. Ahern @ 2008-06-25 9:51 ` Avi Kivity 0 siblings, 0 replies; 73+ messages in thread From: Avi Kivity @ 2008-06-25 9:51 UTC (permalink / raw) To: David S. Ahern; +Cc: kvm David S. Ahern wrote: > > RHEL3 is in Maintenance mode (for an explanation see > http://www.redhat.com/security/updates/errata/) which means performance > enhancement patches will not make it in. > > Scratch that idea, then. > Also, I'm going to be out of the office for a couple of weeks in July, > so I will need to put this aside until mid-August or so. I'll reevaluate > options then. > One thing I'm looking at is implementing out-of-sync like Xen, which looks like it will obsolete the entire emulate vs flood thing at the cost of making unshadowing a little more expensive and consuming more memory. See http://thread.gmane.org/gmane.comp.emulators.xen.devel/52557 (and 58, 59, 60). -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3) 2008-04-30 13:39 ` David S. Ahern 2008-04-30 13:49 ` Avi Kivity @ 2008-04-30 13:56 ` Daniel P. Berrange 2008-04-30 14:23 ` David S. Ahern 1 sibling, 1 reply; 73+ messages in thread From: Daniel P. Berrange @ 2008-04-30 13:56 UTC (permalink / raw) To: David S. Ahern; +Cc: kvm-devel, Marcelo Tosatti, Avi Kivity On Wed, Apr 30, 2008 at 07:39:53AM -0600, David S. Ahern wrote: > Avi Kivity wrote: > > David S. Ahern wrote: > >> Another tidbit for you guys as I make my way through various > >> permutations: > >> I installed the RHEL3 hugemem kernel and the guest behavior is *much* > >> better. > >> System time still has some regular hiccups that are higher than xen > >> and esx > >> (e.g., 1 minute samples out of 5 show system time between 10 and 15%), > >> but > >> overall guest behavior is good with the hugemem kernel. > >> > >> > > > > Wait, the amount of info here is overwhelming. Let's stick with the > > current kernel (32-bit, HIGHMEM4G, right?) > > > > Did you get any traces with bypass_guest_pf=0? That may show more info. > > > > My preference is to stick with the "standard", 32-bit RHEL3 kernel in the guest. > My point in the last email was that the hugemem kernel shows a remarkable > difference (it uses 3-levels of page tables right?). I was hoping that would > ring a bell with someone. IIRC, the RHEL-3 hugemem kernel is using the 4g/4g split patches which give userspace and kernelspace their own independant pagetables http://lwn.net/Articles/39925/ http://lwn.net/Articles/39283/ Dan. -- |: Red Hat, Engineering, Boston -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :| ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3) 2008-04-30 13:56 ` Daniel P. Berrange @ 2008-04-30 14:23 ` David S. Ahern 0 siblings, 0 replies; 73+ messages in thread From: David S. Ahern @ 2008-04-30 14:23 UTC (permalink / raw) To: Daniel P. Berrange, Avi Kivity; +Cc: kvm-devel, Marcelo Tosatti Yes, the 4G/4G patch and the 64G options are both enabled for the hugemem kernel: CONFIG_HIGHMEM64G=y CONFIG_X86_4G=y Differences between the "standard" kernel and the hugemem kernel: # diff config-2.4.21-47.ELsmp config-2.4.21-47.ELhugemem 2157,2158c2157,2158 < CONFIG_M686=y < # CONFIG_MPENTIUMIII is not set --- > # CONFIG_M686 is not set > CONFIG_MPENTIUMIII=y 2169c2169 < CONFIG_X86_PGE=y --- > # CONFIG_X86_PGE is not set 2193c2193 < # CONFIG_X86_4G is not set --- > CONFIG_X86_4G=y 2365,2366c2365 < CONFIG_M686=y < CONFIG_X86_PGE=y --- > CONFIG_MPENTIUMIII=y 2369,2372d2367 < # CONFIG_MXT is not set < CONFIG_HOTPLUG_PCI=y < CONFIG_HOTPLUG_PCI_COMPAQ=m < CONFIG_HOTPLUG_PCI_IBM=m 2373a2369 > CONFIG_X86_4G=y 2377,2379d2372 < # CONFIG_EWRK3 is not set < CONFIG_UNIX98_PTY_COUNT=2048 < CONFIG_HZ=512 2382a2376,2383 > # CONFIG_MXT is not set > CONFIG_HOTPLUG_PCI=y > CONFIG_HOTPLUG_PCI_COMPAQ=m > CONFIG_HOTPLUG_PCI_IBM=m > # CONFIG_EWRK3 is not set > CONFIG_UNIX98_PTY_COUNT=2048 > CONFIG_DEBUG_BUGVERBOSE=y > # CONFIG_PNPBIOS is not set Avi: Centos releases: http://isoredirect.centos.org/centos/3/isos/i386/ I am running RHEL3.8 which I do not see listed. Also, I'll need to work on a stock install and try to capture some kind of workload that exhibits the problem. It will be a couple of days. david Daniel P. Berrange wrote: > On Wed, Apr 30, 2008 at 07:39:53AM -0600, David S. Ahern wrote: >> Avi Kivity wrote: >>> David S. Ahern wrote: >>>> Another tidbit for you guys as I make my way through various >>>> permutations: >>>> I installed the RHEL3 hugemem kernel and the guest behavior is *much* >>>> better. >>>> System time still has some regular hiccups that are higher than xen >>>> and esx >>>> (e.g., 1 minute samples out of 5 show system time between 10 and 15%), >>>> but >>>> overall guest behavior is good with the hugemem kernel. >>>> >>>> >>> Wait, the amount of info here is overwhelming. Let's stick with the >>> current kernel (32-bit, HIGHMEM4G, right?) >>> >>> Did you get any traces with bypass_guest_pf=0? That may show more info. >>> >> My preference is to stick with the "standard", 32-bit RHEL3 kernel in the guest. >> My point in the last email was that the hugemem kernel shows a remarkable >> difference (it uses 3-levels of page tables right?). I was hoping that would >> ring a bell with someone. > > IIRC, the RHEL-3 hugemem kernel is using the 4g/4g split patches which > give userspace and kernelspace their own independant pagetables > > http://lwn.net/Articles/39925/ > http://lwn.net/Articles/39283/ > > Dan. ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: performance with guests running 2.4 kernels (specifically RHEL3) 2008-04-17 21:12 ` David S. Ahern 2008-04-18 7:57 ` Avi Kivity @ 2008-04-23 8:03 ` Avi Kivity 1 sibling, 0 replies; 73+ messages in thread From: Avi Kivity @ 2008-04-23 8:03 UTC (permalink / raw) To: David S. Ahern; +Cc: kvm-devel David S. Ahern wrote: > kvm_stat -1 is practically impossible to time correctly to get a good snippet. > > I've added a --log option to get vmstat-like output. I've also added --fields to select which fields are of interest, to avoid the need for 280-column displays. That's now pushed to kvm-userspace.git. Example: ./kvm_stat -f 'mmu.*|pf.*|remote.*' -l -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ^ permalink raw reply [flat|nested] 73+ messages in thread
end of thread, other threads:[~2008-06-25 9:51 UTC | newest]
Thread overview: 73+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-16 0:15 performance with guests running 2.4 kernels (specifically RHEL3) David S. Ahern
2008-04-16 8:46 ` Avi Kivity
2008-04-17 21:12 ` David S. Ahern
2008-04-18 7:57 ` Avi Kivity
2008-04-21 4:31 ` David S. Ahern
2008-04-21 9:19 ` Avi Kivity
2008-04-21 17:07 ` David S. Ahern
2008-04-22 20:23 ` David S. Ahern
2008-04-23 8:04 ` Avi Kivity
2008-04-23 15:23 ` David S. Ahern
2008-04-23 15:53 ` Avi Kivity
2008-04-23 16:39 ` David S. Ahern
2008-04-24 17:25 ` David S. Ahern
2008-04-26 6:43 ` Avi Kivity
2008-04-26 6:20 ` Avi Kivity
2008-04-25 17:33 ` David S. Ahern
2008-04-26 6:45 ` Avi Kivity
2008-04-28 18:15 ` Marcelo Tosatti
2008-04-28 23:45 ` David S. Ahern
2008-04-30 4:18 ` David S. Ahern
2008-04-30 9:55 ` Avi Kivity
2008-04-30 13:39 ` David S. Ahern
2008-04-30 13:49 ` Avi Kivity
2008-05-11 12:32 ` Avi Kivity
2008-05-11 13:36 ` Avi Kivity
2008-05-13 3:49 ` David S. Ahern
2008-05-13 7:25 ` Avi Kivity
2008-05-14 20:35 ` David S. Ahern
2008-05-15 10:53 ` Avi Kivity
2008-05-17 4:31 ` David S. Ahern
[not found] ` <482FCEE1.5040306@qumranet.com>
[not found] ` <4830F90A.1020809@cisco.com>
2008-05-19 4:14 ` [kvm-devel] " David S. Ahern
2008-05-19 14:27 ` Avi Kivity
2008-05-19 16:25 ` David S. Ahern
2008-05-19 17:04 ` Avi Kivity
2008-05-20 14:19 ` Avi Kivity
2008-05-20 14:34 ` Avi Kivity
2008-05-22 22:08 ` David S. Ahern
2008-05-28 10:51 ` Avi Kivity
2008-05-28 14:13 ` David S. Ahern
2008-05-28 14:35 ` Avi Kivity
2008-05-28 19:49 ` David S. Ahern
2008-05-29 6:37 ` Avi Kivity
2008-05-28 14:48 ` Andrea Arcangeli
2008-05-28 14:57 ` Avi Kivity
2008-05-28 15:39 ` David S. Ahern
2008-05-29 11:49 ` Avi Kivity
2008-05-29 12:10 ` Avi Kivity
2008-05-29 13:49 ` David S. Ahern
2008-05-29 14:08 ` Avi Kivity
2008-05-28 15:58 ` Andrea Arcangeli
2008-05-28 15:37 ` Avi Kivity
2008-05-28 15:43 ` David S. Ahern
2008-05-28 17:04 ` Andrea Arcangeli
2008-05-28 17:24 ` David S. Ahern
2008-05-29 10:01 ` Avi Kivity
2008-05-29 14:27 ` Andrea Arcangeli
2008-05-29 15:11 ` David S. Ahern
2008-05-29 15:16 ` Avi Kivity
2008-05-30 13:12 ` Andrea Arcangeli
2008-05-31 7:39 ` Avi Kivity
2008-05-29 16:42 ` David S. Ahern
2008-05-31 8:16 ` Avi Kivity
2008-06-02 16:42 ` David S. Ahern
2008-06-05 8:37 ` Avi Kivity
2008-06-05 16:20 ` David S. Ahern
2008-06-06 16:40 ` Avi Kivity
2008-06-19 4:20 ` David S. Ahern
2008-06-22 6:34 ` Avi Kivity
2008-06-23 14:09 ` David S. Ahern
2008-06-25 9:51 ` Avi Kivity
2008-04-30 13:56 ` Daniel P. Berrange
2008-04-30 14:23 ` David S. Ahern
2008-04-23 8:03 ` Avi Kivity
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox