From mboxrd@z Thu Jan 1 00:00:00 1970 From: Russ Anderson Date: Mon, 10 Mar 2008 21:10:31 +0000 Subject: Re: [PATCH] New way of storing MCA/INIT logs Message-Id: <20080310211030.GB8678@sgi.com> List-Id: References: <47CD8142.7050207@bull.net> In-Reply-To: <47CD8142.7050207@bull.net> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: linux-ia64@vger.kernel.org On Thu, Mar 06, 2008 at 11:24:06AM +0100, Zoltan Menyhart wrote: > Russ Anderson wrote: > > >I have a test case that creates that scenario. With your patch and only > >one of the MCAs (at most) end up getting logged in > >/var/log/salinfo/decoded . > > Can you describe, please, what your test does and what is the > expected behavior of the MCA layer? The test process allocates memory, injects an uncorrectable error, forks a child, then both processes consume the bad data, with the effect of two processes going into OS_MCA at the same time. With the old code a total of four MCA records get logged. (Overkill, an opportunity for improvement.) Each cpu that went through MCA logs the error twice, with one of the records being marked recovered (each pair of records are otherwise identical). With the new code the first MCA is reported as occuring on cpu 0 when it occured on cpu 1. I think it is due to this code in arch/ia64/kernel/salinfo.c: ------------------------------------------------------------- n = data->cpu_check; // printk("CPU %d: %s(): data->cpu_check: %d, data->cpu_event: %016lx\n", smp_processor_id(), // __func__, n, data->cpu_event.bits[0]); // :-) if (atomic_read(&ia64_MCA_logs._b_cnt) > 0 || atomic_read(&ia64_INIT_logs._b_cnt) > 0){ // printk("cpu %d %d %d\n", cpu, atomic_read(&ia64_MCA_logs._b_cnt), atomic_read(&ia64_INIT_logs._b_cnt)); cpu = any_online_cpu(cpu_online_map); } else { for (i = 0; i < NR_CPUS; i++) { if (cpu_isset(n, data->cpu_event)) { if (!cpu_online(n)) { cpu_clear(n, data->cpu_event); continue; } cpu = n; break; } if (++n = NR_CPUS) n = 0; } if (cpu = -1) goto retry; ia64_mlogbuf_dump(); /* for next read, start checking at next CPU */ data->cpu_check = cpu; if (++data->cpu_check = NR_CPUS) data->cpu_check = 0; } snprintf(cmd, sizeof(cmd), "read %d\n", cpu); ------------------------------------------------------------- This line cpu = any_online_cpu(cpu_online_map); returns 0, so the MCA gets marked as being on cpu 0 instead of the actual cpu (cpu 1). > Another idea: the integration into the salinfo side in not yet quit smooth, > :-) Understood. > it is the polling that fetches the logs one by one. Please leave 3 periods > for the polling to see all the logs. -- Russ Anderson, OS RAS/Partitioning Project Lead SGI - Silicon Graphics Inc rja@sgi.com