Avi Kivity wrote: > David S. Ahern wrote: >> The short answer is that I am still see large system time hiccups in the >> guests due to kscand in the guest scanning its active lists. I do see >> better response for a KVM_MAX_PTE_HISTORY of 3 than with 4. (For >> completeness I also tried a history of 2, but it performed worse than 3 >> which is no surprise given the meaning of it.) >> >> >> I have been able to scratch out a simplistic program that stimulates >> kscand activity similar to what is going on in my real guest (see >> attached). The program requests a memory allocation, initializes it (to >> get it backed) and then in a loop sweeps through the memory in chunks >> similar to a program using parts of its memory here and there but >> eventually accessing all of it. >> >> Start the RHEL3/CentOS 3 guest with *2GB* of RAM (or more). The key is >> using a fair amount of highmem. Start a couple of instances of the >> attached. For example, I've been using these 2: >> >> memuser 768M 120 5 300 >> memuser 384M 300 10 600 >> >> Together these instances take up a 1GB of RAM and once initialized >> consume very little CPU. On kvm they make kscand and kswapd go nuts >> every 5-15 minutes. For comparison, I do not see the same behavior for >> an identical setup running on esx 3.5. >> > > I haven't been able to reproduce this: > >> [root@localhost root]# ps -elf | grep -E 'memuser|kscand' >> 1 S root 7 1 1 75 0 - 0 schedu 10:07 ? >> 00:00:26 [kscand] >> 0 S root 1464 1 1 75 0 - 196986 schedu 10:20 pts/0 >> 00:00:21 ./memuser 768M 120 5 300 >> 0 S root 1465 1 0 75 0 - 98683 schedu 10:20 pts/0 >> 00:00:10 ./memuser 384M 300 10 600 >> 0 S root 2148 1293 0 75 0 - 922 pipe_w 10:48 pts/0 >> 00:00:00 grep -E memuser|kscand > > The workload has been running for about half an hour, and kswapd cpu > usage doesn't seem significant. This is a 2GB guest running with my > patch ported to kvm.git HEAD. Guest is has 2G of memory. > I'm running on the per-page-pte-tracking branch, and I am still seeing it. I doubt you want to sit and watch the screen for an hour, so install sysstat if not already, change the sample rate to 1 minute (/etc/cron.d/sysstat), let the server run for a few hours and then run 'sar -u'. You'll see something like this: 10:12:11 AM LINUX RESTART 10:13:03 AM CPU %user %nice %system %iowait %idle 10:14:01 AM all 0.08 0.00 2.08 0.35 97.49 10:15:03 AM all 0.05 0.00 0.79 0.04 99.12 10:15:59 AM all 0.15 0.00 1.52 0.06 98.27 10:17:01 AM all 0.04 0.00 0.69 0.04 99.23 10:17:59 AM all 0.01 0.00 0.39 0.00 99.60 10:18:59 AM all 0.00 0.00 0.12 0.02 99.87 10:20:02 AM all 0.18 0.00 14.62 0.09 85.10 10:21:01 AM all 0.71 0.00 26.35 0.01 72.94 10:22:02 AM all 0.67 0.00 10.61 0.00 88.72 10:22:59 AM all 0.14 0.00 1.80 0.00 98.06 10:24:03 AM all 0.13 0.00 0.50 0.00 99.37 10:24:59 AM all 0.09 0.00 11.46 0.00 88.45 10:26:03 AM all 0.16 0.00 0.69 0.03 99.12 10:26:59 AM all 0.14 0.00 10.01 0.02 89.83 10:28:03 AM all 0.57 0.00 2.20 0.03 97.20 Average: all 0.21 0.00 5.55 0.05 94.20 every one of those jumps in %system time directly correlates to kscand activity. Without the memuser programs running the guest %system time is <1%. The point of this silly memuser program is just to use high memory -- let it age, then make it active again, sit idle, repeat. If you run kvm_stat with -l in the host you'll see the jump in pte writes/updates. An intern here added a timestamp to the kvm_stat output for me which helps to directly correlate guest/host data. I also ran my real guest on the branch. Performance at boot through the first 15 minutes was much better, but I'm still seeing recurring hits every 5 minutes when kscand kicks in. Here's the data from the guest for the first one which happened after 15 minutes of uptime: active_anon_scan: HighMem, age 11, count[age] 24886 -> 5796, direct 24845, dj 59 active_anon_scan: HighMem, age 7, count[age] 47772 -> 21289, direct 40868, dj 103 active_anon_scan: HighMem, age 3, count[age] 91007 -> 329, direct 45805, dj 1212 The kvm_stat data for this time period is attached due to line lengths. Also, I forgot to mention this before, but there is a bug in the kscand code in the RHEL3U8 kernel. When it scans the cache list it uses the count from the anonymous list: if (need_active_cache_scan(zone)) { for (age = MAX_AGE-1; age >= 0; age--) { scan_active_list(zone, age, &zone->active_cache_list[age], zone->active_anon_count[age]); ^^^^^^^^^^^^^^^^^ if (current->need_resched) schedule(); } } When the anonymous count is higher it is scanning the cache list repeatedly. An example of that was captured here: active_cache_scan: HighMem, age 7, count[age] 222 -> 179, count anon 111967, direct 626, dj 3 count anon is active_anon_count[age] which at this moment was 111,967. There were only 222 entries in the cache list, but the count value passed to scan_active_list was 111,967. When the cache list has a lot of direct pages, that causes a larger hit on kvm than needed. That said, I have to live with the bug in the guest. david