On 2014-11-14 15:59, Meelis Roos wrote: >>>> The second oops is in blk_mq_map_queue() which is a trivial >>>> two level cpu lookup. I wonder if there's something odd about >>>> cpu numbers on these big old sparc systems? >>> >>> CPU numbers are sparse - they are determined by hardware slot number and >>> some models only fill every other mainboard slot, and first slots can be >>> free. I have first board offline and currently have CPUs numbered >>> 10,11,14,15 online. >>> >>> Here is debug with Jens's patch: >> >>> [ 133.971050] CPU 11: synchronized TICK with master CPU (last diff -1 cycles, maxerr 516 cycles) >>> [ 133.975491] CPU 14: synchronized TICK with master CPU (last diff -3 cycles, maxerr 531 cycles) >>> [ 133.979943] CPU 15: synchronized TICK with master CPU (last diff -3 cycles, maxerr 531 cycles) >>> [ 133.980146] Brought up 4 CPUs >> >> So this looks like this might be the issue. On a scsi-mq disabled boot, >> you have 4 CPUs, but how are they numbered? > > The numbers are always the same. I would hope so, my question was really on what CPU numbers you see. But I guess that 10, 11, 14, and 15? > But everything seems to be mapped to queue 0? As it should, scsi-mq only supports a single hw queue for now. >> We might need Christophs debug patch on top this to fully know... > > Applied it too, dmesg is below. Yes it does spam the log a lot, and over > 9600bps console its' somewhat slow :) > > There is another detail to note -this server contains a faulty disk as > sdc that times out spinup. I left it in the server because it helped to > pinpoint and fix a previous error in esp scsi driver. This can be a > factor here too - the error handling details. It could be. So we have tons of mappings from CPU10 to queue 0, but then we see this: > [ 256.236742] cpu: 10 > [ 256.236749] queue: 809119744 and it turns to crap. This is pretty weird. Try with this debug patch - get rid of the other ones first. It should reduce your noise level too. -- Jens Axboe