From mboxrd@z Thu Jan 1 00:00:00 1970 From: Konrad Rzeszutek Wilk Subject: Re: acpidump crashes on some machines Date: Thu, 23 Aug 2012 10:35:12 -0400 Message-ID: <20120823143512.GI26098@phenom.dumpdata.com> References: <20120620145127.GD12787@phenom.dumpdata.com> <4FE32DDA.10204@amd.com> <20120630014825.GA7003@phenom.dumpdata.com> <20120630021936.GA27100@phenom.dumpdata.com> <50114002.2030700@amd.com> <20120817205207.GA3002@phenom.dumpdata.com> <503602A0.9040908@amd.com> <50360465.3090602@citrix.com> <20120823141055.GH26098@phenom.dumpdata.com> <50363FEF.10001@citrix.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Content-Disposition: inline In-Reply-To: <50363FEF.10001@citrix.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: David Vrabel Cc: Andre Przywara , Jeremy Fitzhardinge , xen-devel , Jan Beulich List-Id: xen-devel@lists.xenproject.org On Thu, Aug 23, 2012 at 03:36:31PM +0100, David Vrabel wrote: > On 23/08/12 15:10, Konrad Rzeszutek Wilk wrote: > > On Thu, Aug 23, 2012 at 11:22:29AM +0100, David Vrabel wrote: > >> On 23/08/12 11:14, Andre Przywara wrote: > >>> On 08/17/2012 10:52 PM, Konrad Rzeszutek Wilk wrote: > >>>> On Thu, Jul 26, 2012 at 03:02:58PM +0200, Andre Przywara wrote: > >>>>> On 06/30/2012 04:19 AM, Konrad Rzeszutek Wilk wrote: > >>>>> > >>>>> Konrad, David, > >>>>> > >>>>> back on track for this issue. Thanks for your input, I could do some > >>>>> more debugging (see below for a refresh): > >>>>> > >>>>> It seems like it affects only the first page of the 1:1 mapping. I > >>>>> didn't have an issues with the last PFN or the page behind it (which > >>>>> failed properly). > >>>>> > >>>>> David, thanks for the hint with varying dom0_mem parameter. I > >>>>> thought I already checked this, but I did it once again and it > >>>>> turned out that it is only an issue if dom0_mem is smaller than the > >>>>> ACPI area, which generates a hole in the memory map. So we have > >>>>> (simplified) > >>>>> * 1:1 mapping to 1 MB > >>>>> * normal mapping till dom0_mem > >>>>> * unmapped area till ACPI E820 area > >>>>> * ACPI E820 1:1 mapping > >>>>> > >>>>> As far as I could chase it down the 1:1 mapping itself looks OK, I > >>>>> couldn't find any off-by-one bugs here. So maybe it is code that > >>>>> later on invalidates areas between the normal guest mapping and the > >>>>> ACPI mem? > >>>> > >>>> I think I found it. Can you try this pls [and if you can't find > >>>> early_to_phys.. just use the __set_phys_to call] > >>> > >>> Yes, that works. At least after a quick test on my test box. Both the > >>> test module and acpidump work as expected. If I replace the "<" in your > >>> patch with the original "<=", I get the warning (and due to the > >>> "continue" it also works). > >> > >> Note that the balloon driver could subsequently overwrite the p2m entry. > > > > Hmm, I am not seeing how.. the region that is passed in is right up to > > the PFN (I believe). And I did run with this patch over a couple of days > > with ballooning up and down. But maybe I missed something? > > Hrrm. I was sure I wrote "Note that the balloon driver could > subsequently overwrite the p2m entry /if/ this warning is triggered." > but it seems I did not. :/ > > i.e., if the warning is triggered, the xen_extra_mem region will be > incorrectly sized and the balloon driver will make use of the incorrect > region. Ah, that makes more sense. Yes we would do the overwritting part later on in that scenario.. which makes me wonder - if we did that in the past how come MMIO devices still worked! Some boxes have the gap/MMIO right at the edge of the E820_RAM - perhaps they silently coping with and we just never caught on this fact. > > David > > > Let me prep a patch that adds some more checks in the balloon driver > > just in case we do hit this. > > > >> I don't think it is worth redoing the patch to adjust the region passed > >> to the balloon driver to avoid this though. > >> > >>> I also successfully tested the minimal fix (just replacing <= with <). > >>> I will feed it to the testers here to cover more machines. > >>> > >>> Do you want to keep the warnings in (which exceed 80 characters, btw)? > >> > >> I think we do. > >> > >> David