From mboxrd@z Thu Jan 1 00:00:00 1970 From: Keir Fraser Subject: Re: [PATCH] Fix cpu online/offline bug: mce memory leak. Date: Tue, 01 Mar 2011 10:25:38 +0000 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Haitao Shan Cc: "Liu, Jinsong" , "xen-devel@lists.xensource.com" List-Id: xen-devel@lists.xenproject.org Well, it is the correct fix, certainly. The new notifier mechanism we borrowed from Linux should enable us to remove all dynamic allocations from a CPU's early bringup path. It's just a case of finding them all as some ar= e tucked away in cpu/platform-specific areas (like this one). There are a couple of alterations for Jinsong to make, and then smoke test the patch again. When he resubmits with the alterations I will check the patch straight in for the next release candidate (RC7). K. On 01/03/2011 10:21, "Haitao Shan" wrote: > Ah, yes. I was misled by the patch name itself. :) > Smart fix, isn't it! I guess the issue is closed. Thanks, Jinsong. >=20 > Shan Haitao >=20 > 2011/3/1 Keir Fraser >> On 01/03/2011 10:12, "Haitao Shan" wrote: >>=20 >>> Hi, Keir, >>>=20 >>> Fixing memory leak would be always good. But I don't agree this could f= ix >>> the >>> problem I observed.=A0 >>> Yes, indeed, I observed the issue during stress test of cpu offline/onl= ine >>> by >>> offlining all cpus except CPU0 one by one and then bringing them back. = In >>> this >>> case, with memory leak, after some time, xmalloc at onlining could hit = a >>> heap >>> page that is formerly owned by domains, since usable Xen heap memory wi= ll be >>> slowly used up. Without memory leak, xmalloc at onlining will be likely= use >>> the memory that is freed just by offlining. So it won't trigger this >>> assertion. >>>=20 >>> But let us think a typical usage model, you will offline all LPs on a s= ocket >>> so that it can be later removed physically. Some other time, you will b= ring >>> them back. Between offlining and onlining, there could be a time interv= al as >>> long as one week and activities such as creating and killing many guest= s. >>> How >>> can we ensure that we won't meet this assertion at that time? >>>=20 >>> It is my understanding that this memory leak triggered the issue I rais= ed >>> but >>> the issue itself can be triggered in many other ways. Please do correct= me >>> if >>> I am wrong. Thanks! >>=20 >> The point is that the patch also moves the xmalloc() calls out of the ne= w >> CPU's bringup path, and into CPU_UP_PREPARE context. This means the >> allocations will occur on a different CPU, before the new CPU begins >> execution, and with irqs enabled. That's why it should fix your crash. >>=20 >> =A0-- Keir >>=20 >>> Shan Haitao >>>=20 >>> 2011/3/1 Keir Fraser >>>> Some comments inline below. Overall a good patch to have, but needs so= me >>>> minor adjustments! >>>>=20 >>>> =A0-- Keir >>>>=20 >>>> On 01/03/2011 09:09, "Liu, Jinsong" wrote: >>>>=20 >>>>> Fix cpu online/offline bug: mce memory leak. >>>>>=20 >>>>> Current Xen mce logic didn't free mcabanks. This would be a memory le= ak >>>>> when >>>>> cpu offline. >>>>> When repeatly do cpu online/offline, this memory leak would make xenp= ool >>>>> shrink, and at a time point, >>>>> will call alloc_heap_pages --> flush_area_mask, which >>>>> ASSERT(local_irq_is_enabled()). >>>>> However, cpu online is irq disable, so it finally result in Xen crash= . >>>>>=20 >>>>> This patch fix the memory leak bug, and tested OK over 110,000 round = cpu >>>>> online/offline. >>>>>=20 >>>>> Signed-off-by: Liu, Jinsong >>>>>=20 >>>>> diff -r 1a364b17d66a xen/arch/x86/cpu/mcheck/mce_intel.c >>>>> --- a/xen/arch/x86/cpu/mcheck/mce_intel.c Fri Feb 25 01:26:01 2011 +0= 800 >>>>> +++ b/xen/arch/x86/cpu/mcheck/mce_intel.c Mon Feb 28 19:19:20 2011 +0= 800 >>>>> @@ -1227,9 +1227,24 @@ static void intel_init_mce(void) >>>>> =A0 =A0 =A0mce_uhandler_num =3D sizeof(intel_mce_uhandlers)/sizeof(struct >>>>> mca_error_handler); >>>>> =A0} >>>>>=20 >>>>> -static int intel_init_mca_banks(void) >>>>> +static void cpu_mcabank_free(unsigned int cpu) >>>>> =A0{ >>>>> - =A0 =A0struct mca_banks *mb1, *mb2, * mb3; >>>>> + =A0 =A0struct mca_banks *mb1, *mb2, *mb3, *mb4; >>>>> + >>>>> + =A0 =A0mb1 =3D per_cpu(mce_clear_banks, cpu); >>>>> + =A0 =A0mb2 =3D per_cpu(no_cmci_banks, cpu); >>>>> + =A0 =A0mb3 =3D per_cpu(mce_banks_owned, cpu); >>>>> + =A0 =A0mb4 =3D per_cpu(poll_bankmask, cpu); >>>>> + >>>>> + =A0 =A0mcabanks_free(mb1); >>>>> + =A0 =A0mcabanks_free(mb2); >>>>> + =A0 =A0mcabanks_free(mb3); >>>>> + =A0 =A0mcabanks_free(mb4); >>>>> +} >>>>> + >>>>> +static void cpu_mcabank_alloc(unsigned int cpu) >>>>> +{ >>>>> + =A0 =A0struct mca_banks *mb1, *mb2, *mb3; >>>>>=20 >>>>> =A0 =A0 =A0mb1 =3D mcabanks_alloc(); >>>>> =A0 =A0 =A0mb2 =3D mcabanks_alloc(); >>>>> @@ -1237,22 +1252,23 @@ static int intel_init_mca_banks(void) >>>>> =A0 =A0 =A0if (!mb1 || !mb2 || !mb3) >>>>> =A0 =A0 =A0 =A0 =A0goto out; >>>>>=20 >>>>> - =A0 =A0__get_cpu_var(mce_clear_banks) =3D mb1; >>>>> - =A0 =A0__get_cpu_var(no_cmci_banks) =3D mb2; >>>>> - =A0 =A0__get_cpu_var(mce_banks_owned) =3D mb3; >>>>> + =A0 =A0per_cpu(mce_clear_banks, cpu) =3D mb1; >>>>> + =A0 =A0per_cpu(no_cmci_banks, cpu) =3D mb2; >>>>> + =A0 =A0per_cpu(mce_banks_owned, cpu) =3D mb3; >>>>> + =A0 =A0return; >>>>>=20 >>>>> - =A0 =A0return 0; >>>>> =A0out: >>>>> =A0 =A0 =A0mcabanks_free(mb1); >>>>> =A0 =A0 =A0mcabanks_free(mb2); >>>>> =A0 =A0 =A0mcabanks_free(mb3); >>>>> - =A0 =A0return -ENOMEM; >>>>> =A0} >>>>>=20 >>>>> =A0/* p4/p6 family have similar MCA initialization process */ >>>>> =A0enum mcheck_type intel_mcheck_init(struct cpuinfo_x86 *c) >>>>> =A0{ >>>>> - =A0 =A0if (intel_init_mca_banks()) >>>>> + =A0 =A0if ( !this_cpu(mce_clear_banks) || >>>>> + =A0 =A0 =A0 =A0 !this_cpu(no_cmci_banks) =A0 || >>>>> + =A0 =A0 =A0 =A0 !this_cpu(mce_banks_owned) ) >>>>> =A0 =A0 =A0 =A0 =A0 return mcheck_none; >>>>>=20 >>>>> =A0 =A0 =A0intel_init_mca(c); >>>>> @@ -1301,13 +1317,19 @@ static int cpu_callback( >>>>> =A0static int cpu_callback( >>>>> =A0 =A0 =A0struct notifier_block *nfb, unsigned long action, void *hcpu) >>>>> =A0{ >>>>> + =A0 =A0unsigned int cpu =3D (unsigned long)hcpu; >>>>> + >>>>> =A0 =A0 =A0switch ( action ) >>>>> =A0 =A0 =A0{ >>>>> + =A0 =A0case CPU_UP_PREPARE: >>>>> + =A0 =A0 =A0 =A0cpu_mcabank_alloc(cpu); >>>>> + =A0 =A0 =A0 =A0break; >>>>> =A0 =A0 =A0case CPU_DYING: >>>>> =A0 =A0 =A0 =A0 =A0cpu_mcheck_disable(); >>>>> =A0 =A0 =A0 =A0 =A0break; >>>>> =A0 =A0 =A0case CPU_DEAD: >>>>> =A0 =A0 =A0 =A0 =A0cpu_mcheck_distribute_cmci(); >>>>> + =A0 =A0 =A0 =A0cpu_mcabank_free(cpu); >>>>> =A0 =A0 =A0 =A0 =A0break; >>>>=20 >>>> You also need to cpu_mcabank_free(cpu) on CPU_UP_CANCELED. Else there'= s a >>>> memory leak on failed CPU bringup. >>>>=20 >>>>> =A0 =A0 =A0default: >>>>> =A0 =A0 =A0 =A0 =A0break; >>>>> @@ -1322,6 +1344,8 @@ static struct notifier_block cpu_nfb =3D { >>>>>=20 >>>>> =A0static int __init intel_mce_initcall(void) >>>>> =A0{ >>>>> + =A0 =A0void *hcpu =3D (void *)(long)smp_processor_id(); >>>>> + =A0 =A0cpu_callback(&cpu_nfb, CPU_UP_PREPARE, hcpu); >>>>> =A0 =A0 =A0register_cpu_notifier(&cpu_nfb); >>>>> =A0 =A0 =A0return 0; >>>>> =A0} >>>>> diff -r 1a364b17d66a xen/arch/x86/setup.c >>>>> --- a/xen/arch/x86/setup.c Fri Feb 25 01:26:01 2011 +0800 >>>>> +++ b/xen/arch/x86/setup.c Mon Feb 28 19:19:20 2011 +0800 >>>>> @@ -1203,6 +1203,8 @@ void __init __start_xen(unsigned long mb >>>>>=20 >>>>> =A0 =A0 =A0arch_init_memory(); >>>>>=20 >>>>> + =A0 =A0do_presmp_initcalls(); >>>>> + >>>>> =A0 =A0 =A0identify_cpu(&boot_cpu_data); >>>>> =A0 =A0 =A0if ( cpu_has_fxsr ) >>>>> =A0 =A0 =A0 =A0 =A0set_in_cr4(X86_CR4_OSFXSR); >>>>> @@ -1235,8 +1237,6 @@ void __init __start_xen(unsigned long mb >>>>> =A0 =A0 =A0initialize_keytable(); >>>>>=20 >>>>> =A0 =A0 =A0console_init_postirq(); >>>>> - >>>>> - =A0 =A0do_presmp_initcalls(); >>>>=20 >>>> This looks risky, especially so close to 4.1 release. Instead of movin= g >>>> do_presmp_initcalls(), please special-case cpu_mcabank_alloc() for CPU= 0 in >>>> intel_mcheck_init(), and remove your direct call to >>>> cpu_callback(...CPU_UP_PREPARE...) in intel_mce_initcall(). >>>>=20 >>>>> =A0 =A0 =A0for_each_present_cpu ( i ) >>>>> =A0 =A0 =A0{ >>>>=20 >>>>=20 >>>=20 >>>=20 >>=20 >>=20 >=20 >=20