From mboxrd@z Thu Jan 1 00:00:00 1970 From: prarit (prarit@prarit.com) Date: Sat, 12 Mar 2005 17:04:54 +0000 Subject: Re: Latest bk kernel does not properly free PCI IO & MEM allocations Message-Id: <0ID900FO80S63RF0@vms046.mailsrvcs.net> List-Id: References: <422F42A9.7050009@sgi.com> In-Reply-To: <422F42A9.7050009@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: linux-hotplug@vger.kernel.org Greg, Thanks for the reply. I'm at home today and don't have access to my SGI account. If you want to fwd this to the mailing list please feel free. If I don't see it Monday morning, I'll fwd it back to the list then. >> What I meant to say was pci_free_resources calls release_resource, where >> release_region calls __release_region. __release_region is called a >> "legacy" function? > What documentation calls it that? Hrmm ... I distinctly remember seeing that somewhere. I'll find it again. I also recall the term "compatibility cruft" used ... in ioport.h? >I don"t like the wording at all, no one will notice it (trust me I"ve > tried stuff like this before...) Yeah ... I know. I've tried it before too. I've been working in this space for a long time and it's clear to me that it needs a compile time #define DEBUG printk option. Issues within the resource allocation code are impossible to debug without sticking a large # of printk's throughout the code. >But what"s really curious, is why no one has hit this before. Nothing > has changed recently in this area of the kernel. Did this used to work > before? Does it work just fine without the patch for other drivers? I haven't gone back through kernels to determine where this "breakage" occurred, but I do know that it is in the 2.6.9 kernel as I have been developing on a RHEL4 platform. This is how I stumbled across this: As you know I'm building and am testing an SGI Altix Hotplug Driver. While testing I started up a few memory stress and IO stress tests *that did not involve the card I was targeting for the test* and approximately 5% of the time I hit a NULL pointer oops. At first glance, I thought the issue was within the sysfs/proc filesystems or pci_free_resources as that's where the oops' were -- obviously with more inspection I realized that didn't make sense and it wasn't the case. A few things have to happen (in the precise order) for someone to hit this issue. Note that I have been running on 16, 32, and 64 cpu systems so it is very like #4 below is due to another CPU. 1. HP slot is disabled via sysfs. 2. PCI driver must call release_regions 3. release_regions kfree's resource structures 4. Context switch/Another CPU: resource area is alloc'd by something else. 5. pci_free_resources is called Possible oops right here. I've also stumbled across the case where the pci_free_resources case tried to free non-existant regions -- dumping the memory address indicated in one case that I was looking at char data ... I recall that it looked like I was looking at the word "qla". I dumped /proc/iomem and /proc/ioports and incurred an oops in /proc . Suppose #4 above doesn't happen. Step #5 occurs and no one (user or system) is the wiser. The memory is still intact -- no oops. Additionally, I've seen people using PCI Hotplug in the field. Typically when removing a card sysadmins tend to quiesce the system before removal. That makes #4 just that much less likely to occur. :) :) Want the long explanation? :) :) P. ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_ide95&alloc_id396&op=click _______________________________________________ Linux-hotplug-devel mailing list http://linux-hotplug.sourceforge.net Linux-hotplug-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-hotplug-devel