From mboxrd@z Thu Jan  1 00:00:00 1970
From: Yolkfull Chow <yzhou@redhat.com>
Subject: Re: [KVM-AUTOTEST PATCH] A test patch - Boot VMs until one of them
 becomes unresponsive
Date: Thu, 11 Jun 2009 11:37:16 +0800
Message-ID: <4A307BEC.8060906@redhat.com>
References: <425001110.1660581244634737690.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Cc: Uri Lublin <uril@redhat.com>, kvm@vger.kernel.org
To: Michael Goldish <mgoldish@redhat.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx2.redhat.com ([66.187.237.31]:49836 "EHLO mx2.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1762085AbZFKDh0 (ORCPT <rfc822;kvm@vger.kernel.org>);
	Wed, 10 Jun 2009 23:37:26 -0400
Received: from int-mx2.corp.redhat.com (int-mx2.corp.redhat.com [172.16.27.26])
	by mx2.redhat.com (8.13.8/8.13.8) with ESMTP id n5B3bSCU016823
	for <kvm@vger.kernel.org>; Wed, 10 Jun 2009 23:37:28 -0400
In-Reply-To: <425001110.1660581244634737690.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On 06/10/2009 07:52 PM, Michael Goldish wrote:
> ----- "Yolkfull Chow"<yzhou@redhat.com>  wrote:
>
>    
>> On 06/10/2009 06:03 PM, Michael Goldish wrote:
>>      
>>> ----- "Yolkfull Chow"<yzhou@redhat.com>   wrote:
>>>
>>>
>>>        
>>>> On 06/09/2009 05:44 PM, Michael Goldish wrote:
>>>>
>>>>          
>>>>> The test looks pretty nicely written. Comments:
>>>>>
>>>>> 1. Consider making all the cloned VMs use image snapshots:
>>>>>
>>>>> curr_vm = vm1.clone()
>>>>> curr_vm.get_params()["extra_params"] += " -snapshot"
>>>>>
>>>>> I'm not sure it's a good idea to let all VMs use the same disk
>>>>>
>>>>>            
>>>> image.
>>>>
>>>>          
>>>>> Or maybe you shouldn't add -snapshot yourself, but rather do it
>>>>>            
>> in
>>      
>>>>>
>>>>>            
>>>> the config
>>>>
>>>>          
>>>>> file for the first VM, and then all cloned VMs will have
>>>>>            
>> -snapshot
>>      
>>>>>
>>>>>            
>>>> as well.
>>>>
>>>>          
>>>>>
>>>>>            
>>>> Yes I use 'image_snapshot = yes' in config file.
>>>>
>>>>          
>>>>> 2. Consider changing the message
>>>>> " Booting the %dth guest" % num
>>>>> to
>>>>> "Booting guest #%d" % num
>>>>> (because there's no such thing as 2th and 3th)
>>>>>
>>>>> 3. Consider changing the message
>>>>> "Cannot boot vm anylonger"
>>>>> to
>>>>> "Cannot create VM #%d" % num
>>>>>
>>>>> 4. Why not add curr_vm to vms immediately after cloning it?
>>>>> That way you can kill it in the exception handler later, without
>>>>>
>>>>>            
>>>> having
>>>>
>>>>          
>>>>> to send it a 'quit' if you can't login ('if not
>>>>>            
>> curr_vm_session').
>>      
>>>>>
>>>>>            
>>>> Yes, good idea.
>>>>
>>>>          
>>>>> 5. " %dth guest boots up successfully" % num -->    again, 2th and
>>>>>            
>> 3th
>>      
>>>>>
>>>>>            
>>>> make no sense.
>>>>
>>>>          
>>>>> Also, I wonder why you add those spaces before every info
>>>>>            
>> message.
>>      
>>>>> 6. "%dth guest's session is not responsive" -->    same
>>>>> (maybe use "Guest session #%d is not responsive" % num)
>>>>>
>>>>> 7. "Shut down the %dth guest" -->    same
>>>>> (maybe "Shutting down guest #%d"? or destroying/killing?)
>>>>>
>>>>> 8. Shouldn't we fail the test when we find an unresponsive
>>>>>            
>> session?
>>      
>>>>> It seems you just display an error message. You can simply
>>>>>            
>> replace
>>      
>>>>> logging.error( with raise error.TestFail(.
>>>>>
>>>>>
>>>>>            
>>>>
>>>>          
>>>>> 9. Consider using a stricter test than just
>>>>>
>>>>>            
>>>> vm_session.is_responsive().
>>>>
>>>>          
>>>>> vm_session.is_responsive() just sends ENTER to the sessions and
>>>>>
>>>>>            
>>>> returns
>>>>
>>>>          
>>>>> True if it gets anything as a result (usually a prompt, or even
>>>>>            
>> just
>>      
>>>>>
>>>>>            
>>>> a
>>>>
>>>>          
>>>>> newline echoed back). If the session passes this test it is
>>>>>            
>> indeed
>>      
>>>>> responsive, so it's a decent test, but maybe you can send some
>>>>>
>>>>>            
>>>> command
>>>>
>>>>          
>>>>> (user configurable?) and test for some output. I'm really not
>>>>>            
>> sure
>>      
>>>>>
>>>>>            
>>>> this
>>>>
>>>>          
>>>>> is important, because I can't imagine a session would respond to
>>>>>            
>> a
>>      
>>>>>
>>>>>            
>>>> newline
>>>>
>>>>          
>>>>> but not to other commands, but who knows. Maybe you can send the
>>>>>
>>>>>            
>>>> first VM
>>>>
>>>>          
>>>>> a user-specified command when the test begins, remember the
>>>>>            
>> output,
>>      
>>>>>
>>>>>            
>>>> and
>>>>
>>>>          
>>>>> then send all other VMs the same command and make sure the output
>>>>>            
>> is
>>      
>>>>>
>>>>>            
>>>> the
>>>>
>>>>          
>>>>> same.
>>>>>
>>>>>
>>>>>            
>>>> maybe use 'info status' and send command 'help' via session to vms
>>>>          
>> and
>>      
>>>> compare their output?
>>>>
>>>>          
>>> I'm not sure I understand. What does 'info status' do? We're talking
>>>        
>> about
>>      
>>> an SSH shell, not the monitor. You can do whatever you like, like
>>>        
>> 'uname -a',
>>      
>>> and 'ls /', but you should leave it up to the user to decide, so
>>>        
>> he/she
>>      
>>> can specify different commands for different guests. Linux commands
>>>        
>> won't
>>      
>>> work under Windows, so Linux and Windows must have different
>>>        
>> commands in
>>      
>>> the config file. In the Linux section, under '- @Linux:' you can
>>>        
>> add
>>      
>>> something like:
>>>
>>> stress_boot:
>>>       stress_boot_test_command = uname -a
>>>
>>> and under '- @Windows:':
>>>
>>> stress_boot:
>>>       stress_boot_test_command = ver&&  vol
>>>
>>> These commands are just naive suggestions. I'm sure someone can
>>>        
>> think of
>>      
>>> much more informative commands.
>>>
>>>        
>> That's really good suggestions.  Thanks, Michael.  And can I use
>> 'migration_test_command' instead?
>>      
> Not really. Why would you want to use another test's param?
>
> 1. There's no guarantee that 'migration_test_command' is defined
> for your boot stress test. In fact, it is probably only defined for
> migration tests, so you probably won't be able to access it. Try
> params.get('migration_test_command') in your test and you'll probably
> get None.
>
> 2. The user may not want to run migration at all, and then he/she
> will probably not define 'migration_test_command'.
>
> 3. The user might want to use different test commands for migration
> and for the boot stress test.
>
>    
>>>>> 10. I'm not sure you should use the param "kill_vm_gracefully"
>>>>>
>>>>>            
>>>> because that's
>>>>
>>>>          
>>>>> a postprocessor param (probably not your business). You can just
>>>>>
>>>>>            
>>>> call
>>>>
>>>>          
>>>>> destroy() in the exception handler with gracefully=False, because
>>>>>            
>> if
>>      
>>>>>
>>>>>            
>>>> the VMs
>>>>
>>>>          
>>>>> are non- responsive, I don't expect them to shutdown nicely with
>>>>>            
>> an
>>      
>>>>>
>>>>>            
>>>> SSH
>>>>
>>>>          
>>>>> command (that's what gracefully does). Also, we're using
>>>>>            
>> -snapshot,
>>      
>>>>>
>>>>>            
>>>> so
>>>>
>>>>          
>>>>> there's no reason to shut them down nicely.
>>>>>
>>>>>
>>>>>            
>>>> Yes,  I agree. :)
>>>>
>>>>          
>>>>> 11. "Total number booted successfully: %d" % (num - 1) -->    why
>>>>>            
>> not
>>      
>>>>>
>>>>>            
>>>> just num?
>>>>
>>>>          
>>>>> We really have num VMs including the first one.
>>>>> Or you can say: "Total number booted successfully in addition to
>>>>>            
>> the
>>      
>>>>>
>>>>>            
>>>> first one"
>>>>
>>>>          
>>>>> but that's much longer.
>>>>>
>>>>>
>>>>>            
>>>> Since after the first guest booted, I set num = 1 and then  'num +=
>>>>          
>> 1'
>>      
>>>> at first in while loop ( for the purpose of getting a new vm ).
>>>> So curr_vm is vm2 ( num is 2) now. If the second vm failed to boot
>>>>          
>> up,
>>      
>>>> the num booted successfully should be (num - 1).
>>>> I would use enumerate(vms) that Uri suggested to make number easier
>>>>          
>> to
>>      
>>>> count.
>>>>
>>>>          
>>> OK, I didn't notice that.
>>>
>>>
>>>        
>>>>> 12. Consider adding a 'max_vms' (or 'threshold') user param to
>>>>>            
>> the
>>      
>>>>>
>>>>>            
>>>> test. If
>>>>
>>>>          
>>>>> num reaches 'max_vms', we stop adding VMs and pass the test.
>>>>>
>>>>>            
>>>> Otherwise the
>>>>
>>>>          
>>>>> test will always fail (which is depressing). If
>>>>>
>>>>>            
>>>> params.get("threshold") is
>>>>
>>>>          
>>>>> None or "", or in short -- 'if not params.get("threshold")',
>>>>>            
>> disable
>>      
>>>>>
>>>>>            
>>>> this
>>>>
>>>>          
>>>>> feature and keep adding VMs forever. The user can enable the
>>>>>            
>> feature
>>      
>>>>>
>>>>>            
>>>> with:
>>>>
>>>>          
>>>>> max_vms = 50
>>>>> or disable it with:
>>>>> max_vms =
>>>>>
>>>>>
>>>>>            
>>>> This is a good idea for hardware resource limit of host.
>>>>
>>>>          
>>>>> 13. Why are you catching OSError? If you get OSError it might be
>>>>>            
>> a
>>      
>>>>>
>>>>>            
>>>> framework bug.
>>>>
>>>>          
>>>>>
>>>>>            
>>>> Since sometimes, vm.create() successfully but failed to ssh-login
>>>> since
>>>> the running python cannot allocate physical memory (OSError).
>>>> Add max_vms could fix this problem I think.
>>>>
>>>>          
>>> Do you remember exactly where OSError was thrown? Do you happen to
>>>        
>> have
>>      
>>> a backtrace? (I just want to be very it's not a bug.)
>>>
>>>        
>> The OSError was thrown when checking all VMs are responsive and I got
>> many traceback about "OSError: [Errno 12] Cannot allocate memory".
>> Maybe since when last VM was created successfully with lucky,  whereas
>> python cannot get physical memory after that when checking all
>> sessions.
>> So can we now catch the OSError and tell user the number of max_vms
>> is too large?
>>      
> Sure. I was just worried it might be a framework bug. If it's a legitimate
> memory error -- catch it and fail the test.
>
> If you happen to catch that OSError again, and get a backtrace, I'd like
> to see it if that's possible.
>    
Michael, these are the backtrace messages:

...
20090611-064959 
no_boundary.local_stg.RHEL.5.3-server-64.no_ksm.boot_vms.e1000.user.size_1024: 
ERROR: run_once: Test failed: [Errno 12] Cannot allocate memory
20090611-064959 
no_boundary.local_stg.RHEL.5.3-server-64.no_ksm.boot_vms.e1000.user.size_1024: 
DEBUG: run_once: Postprocessing on error...
20090611-065000 
no_boundary.local_stg.RHEL.5.3-server-64.no_ksm.boot_vms.e1000.user.size_1024: 
DEBUG: postprocess_vm: Postprocessing VM 'vm1'...
20090611-065000 
no_boundary.local_stg.RHEL.5.3-server-64.no_ksm.boot_vms.e1000.user.size_1024: 
DEBUG: postprocess_vm: VM object found in environment
20090611-065000 
no_boundary.local_stg.RHEL.5.3-server-64.no_ksm.boot_vms.e1000.user.size_1024: 
DEBUG: send_monitor_cmd: Sending monitor command: screendump 
/kvm-autotest/client/results/default/kvm_runtest_2.[RHEL-Server-5.3-64][None][1024][1][qcow2]<no_boundary.local_stg.RHEL.5.3-server-64.no_ksm.boot_vms.e1000.user.size_1024>/debug/post_vm1.ppm
20090611-065000 
no_boundary.local_stg.RHEL.5.3-server-64.no_ksm.boot_vms.e1000.user.size_1024: 
DEBUG: run_once: Contents of environment: {'vm__vm1': <kvm_vm.VM 
instance at 0x92999a28>}
post-test sysinfo error:
Traceback (most recent call last):
   File "/kvm-autotest/client/common_lib/log.py", line 58, in decorated_func
     fn(*args, **dargs)
   File "/kvm-autotest/client/bin/base_sysinfo.py", line 213, in 
log_after_each_test
     log.run(test_sysinfodir)
   File "/kvm-autotest/client/bin/base_sysinfo.py", line 112, in run
     shell=True, env=env)
   File "/usr/lib64/python2.4/subprocess.py", line 412, in call
     return Popen(*args, **kwargs).wait()
   File "/usr/lib64/python2.4/subprocess.py", line 542, in __init__
     errread, errwrite)
   File "/usr/lib64/python2.4/subprocess.py", line 902, in _execute_child
     self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory
2009-06-11 06:50:02,859 Configuring logger for client level
         FAIL    
kvm_runtest_2.[RHEL-Server-5.3-64][None][1024][1][qcow2]<no_boundary.local_stg.RHEL.5.3-server-64.no_ksm.boot_vms.e1000.user.size_1024>    
kvm_runtest_2.[RHEL-Server-5.3-64][None][1024][1][qcow2]<no_boundary.local_stg.RHEL.5.3-server-64.no_ksm.boot_vms.e1000.user.size_1024>    
timestamp=1244717402    localtime=Jun 11 06:50:02    Unhandled OSError: 
[Errno 12] Cannot allocate memory
           Traceback (most recent call last):
             File "/kvm-autotest/client/common_lib/test.py", line 304, 
in _exec
               self.execute(*p_args, **p_dargs)
             File "/kvm-autotest/client/common_lib/test.py", line 187, 
in execute
               self.run_once(*args, **dargs)
             File 
"/kvm-autotest/client/tests/kvm_runtest_2/kvm_runtest_2.py", line 145, 
in run_once
               routine_obj.routine(self, params, env)
             File 
"/kvm-autotest/client/tests/kvm_runtest_2/kvm_tests.py", line 3071, in 
run_boot_vms
               curr_vm_session = kvm_utils.wait_for(curr_vm.ssh_login, 
240, 0, 2)
             File 
"/kvm-autotest/client/tests/kvm_runtest_2/kvm_utils.py", line 797, in 
wait_for
               output = func()
             File "/kvm-autotest/client/tests/kvm_runtest_2/kvm_vm.py", 
line 728, in ssh_login
               session = kvm_utils.ssh(address, port, username, 
password, prompt, timeout)
             File 
"/kvm-autotest/client/tests/kvm_runtest_2/kvm_utils.py", line 553, in ssh
               return remote_login(command, password, prompt, "\n", timeout)
             File 
"/kvm-autotest/client/tests/kvm_runtest_2/kvm_utils.py", line 431, in 
remote_login
               sub = kvm_spawn(command, linesep)
             File 
"/kvm-autotest/client/tests/kvm_runtest_2/kvm_utils.py", line 114, in 
__init__
               (pid, fd) = pty.fork()
             File "/usr/lib64/python2.4/pty.py", line 108, in fork
               pid = os.fork()
           OSError: [Errno 12] Cannot allocate memory
Persistent state variable __group_level now set to 1
     END FAIL    
kvm_runtest_2.[RHEL-Server-5.3-64][None][1024][1][qcow2]<no_boundary.local_stg.RHEL.5.3-server-64.no_ksm.boot_vms.e1000.user.size_1024>    
kvm_runtest_2.[RHEL-Server-5.3-64][None][1024][1][qcow2]<no_boundary.local_stg.RHEL.5.3-server-64.no_ksm.boot_vms.e1000.user.size_1024>    
timestamp=1244717403    localtime=Jun 11 06:50:03
Dropping caches
2009-06-11 06:50:03,409 running: sync
JOB ERROR: Unhandled OSError: [Errno 12] Cannot allocate memory
Traceback (most recent call last):
   File "/kvm-autotest/client/bin/job.py", line 978, in step_engine
     execfile(self.control, global_control_vars, global_control_vars)
   File "/kvm-autotest/client/control", line 1030, in ?
     cfg_to_test("kvm_tests.cfg")
   File "/kvm-autotest/client/control", line 1013, in cfg_to_test
     current_status = job.run_test("kvm_runtest_2", params=dict, 
tag=tagname)
   File "/kvm-autotest/client/bin/job.py", line 44, in wrapped
     utils.drop_caches()
   File "/kvm-autotest/client/bin/base_utils.py", line 638, in drop_caches
     utils.system("sync")
   File "/kvm-autotest/client/common_lib/utils.py", line 510, in system
     stdout_tee=sys.stdout, stderr_tee=sys.stderr).exit_status
   File "/kvm-autotest/client/common_lib/utils.py", line 330, in run
     bg_job = join_bg_jobs(
   File "/kvm-autotest/client/common_lib/utils.py", line 37, in __init__
     stdin=stdin)
   File "/usr/lib64/python2.4/subprocess.py", line 542, in __init__
     errread, errwrite)
   File "/usr/lib64/python2.4/subprocess.py", line 902, in _execute_child
     self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory

Persistent state variable __group_level now set to 0
END ABORT    ----    ----    timestamp=1244717418    localtime=Jun 11 
06:50:18    Unhandled OSError: [Errno 12] Cannot allocate memory
   Traceback (most recent call last):
     File "/kvm-autotest/client/bin/job.py", line 978, in step_engine
       execfile(self.control, global_control_vars, global_control_vars)
     File "/kvm-autotest/client/control", line 1030, in ?
       cfg_to_test("kvm_tests.cfg")
     File "/kvm-autotest/client/control", line 1013, in cfg_to_test
       current_status = job.run_test("kvm_runtest_2", params=dict, 
tag=tagname)
     File "/kvm-autotest/client/bin/job.py", line 44, in wrapped
       utils.drop_caches()
     File "/kvm-autotest/client/bin/base_utils.py", line 638, in drop_caches
       utils.system("sync")
     File "/kvm-autotest/client/common_lib/utils.py", line 510, in system
       stdout_tee=sys.stdout, stderr_tee=sys.stderr).exit_status
     File "/kvm-autotest/client/common_lib/utils.py", line 330, in run
       bg_job = join_bg_jobs(
     File "/kvm-autotest/client/common_lib/utils.py", line 37, in __init__
       stdin=stdin)
     File "/usr/lib64/python2.4/subprocess.py", line 542, in __init__
       errread, errwrite)
     File "/usr/lib64/python2.4/subprocess.py", line 902, in _execute_child
       self.pid = os.fork()
   OSError: [Errno 12] Cannot allocate memory
[root@dhcp-66-70-9 kvm_runtest_2]#
> Thanks,
> Michael
>
>    
>>>>> 14. At the end of the exception handler you should proably
>>>>>            
>> re-raise
>>      
>>>>>
>>>>>            
>>>> the exception
>>>>
>>>>          
>>>>> you caught. Otherwise the user won't see the error message. You
>>>>>            
>> can
>>      
>>>>>
>>>>>            
>>>> simply replace
>>>>
>>>>          
>>>>> 'break' with 'raise' (no parameters), and it should work,
>>>>>
>>>>>            
>>>> hopefully.
>>>>
>>>>          
>>>>>
>>>>>            
>>>> Yes I should if add a 'max_vms'.
>>>>
>>>>          
>>> I think you should re-raise anyway. Otherwise, what's the point in
>>>        
>> writing
>>      
>>> error messages such as "raise error.TestFail("Cannot boot vm
>>>        
>> anylonger")"?
>>      
>>> I you don't re-raise, the user won't see the messages.
>>>
>>>
>>>        
>>>>> I know these are quite a few comments, but they're all rather
>>>>>            
>> minor
>>      
>>>>>
>>>>>            
>>>> and the test
>>>>
>>>>          
>>>>> is well written in my opinion.
>>>>>
>>>>>
>>>>>            
>>>> Thank you,  I will do modification according to your and Uri's
>>>> comments,
>>>> and will re-submit it here later. :)
>>>>
>>>> Thanks and Best Regards,
>>>> Yolkfull
>>>>
>>>>          
>>>>> Thanks,
>>>>> Michael
>>>>>
>>>>> ----- Original Message -----
>>>>> From: "Yolkfull Chow"<yzhou@redhat.com>
>>>>> To:kvm@vger.kernel.org
>>>>> Cc: "Uri Lublin"<uril@redhat.com>
>>>>> Sent: Tuesday, June 9, 2009 11:41:54 AM (GMT+0200) Auto-Detected
>>>>> Subject: [KVM-AUTOTEST PATCH] A test patch - Boot VMs until one
>>>>>            
>> of
>>      
>>>>>
>>>>>            
>>>> them becomes unresponsive
>>>>
>>>>          
>>>>> Hi,
>>>>>
>>>>> This test will boot VMs until one of them becomes unresponsive,
>>>>>            
>> and
>>      
>>>>> records the maximum number of VMs successfully started.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>            
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>          
>>
>> -- 
>> Yolkfull
>> Regards,
>>      


-- 
Yolkfull
Regards,