* My position on general ``RAS'' tool support infrastructure [not found] <46E8B06D.9080006@bluelane.com> @ 2007-09-13 13:21 ` Eric W. Biederman 2007-09-18 1:38 ` Randy Dunlap 0 siblings, 1 reply; 3+ messages in thread From: Eric W. Biederman @ 2007-09-13 13:21 UTC (permalink / raw) To: pete Cc: Jason Wessel, Randy Dunlap, Matt Mackall, Amit Kale, Dave Anderson, kdb, jlan, Vivek Goyal, Andrew Morton, Kexec Mailing List, linux-kernel Pete/Piet Delaney <pete@bluelane.com> writes: > Jason, Eric: > > Did you read Keith Owens suggestion on RAS tools from: Yes. There is a tension here between generality of support infrastructure, maintainability of the infrastructure, simplicity of the infrastructure and reliability of the infrastructure. The historical linux perspective is that anything that compromises the maintainability or the reliability of the kernel without the tools is unacceptable. There is also a historical perspective that using the single stepping mode of a debugger to diagnose problems frequently leads to symptoms being fixed and not the actual problems being fixed. My initial proposal in this thread was that if kdb wanted to have a hook point someplace where were not comfortable adding a hook point it could use a break point or some of the tracing infrastructure. Somehow that suggestion seems to have gotten lost. On the kexec on panic path the philosophy is that the kernel is broken and as little as possible should be relied upon. So in general I am opposed to extra code on that path. General hooks like notifiers in particular, because they make adding non-paranoid code much easier and review of the code on a particular call path much harder. >From what I can tell the philosophy of the kdb code is that the kernel is mostly ok except for one or two little bugs so it is reasonable to rely on lots of kernel infrastructure. As I understand the problem the difference in philosophy and maintenance overhead is why kexec on panic has been merged and why it has a much larger success rate the previous crash dump implementation like lkcd. I will not that in some sense it is a harder approach to implement as it emphasizes the challenge of drivers that work starting from a random hardware state, and because it draws a clear line between the broken kernel and the recover kernel. But those things are exactly what encourage things to work well. I don't mind playing well with others as long as that doesn't compromise the implementation reliability, and maintainability. So far it is my opinion that the current kexec on panic implementation is insufficiently paranoid and touches the hardware and the rest of the kernel too much. Which explains my rather strong reactions when people suggest that we trust the broken kernel more. I don't think this is an insolvable problem but I do think it is hard problem that must be solved with delicacy. I also get irritable that the last time something like this came up I had to have a several day long conversation with someone about why they need a patch that has already been rejected because it compromised the reliability of the implementation only to discover they were trying to make kdb and kexec on panic play nice together. So if someone who is suggesting an implementation can absorb and understand the requirements of the different groups and come up with solutions that meet the requirements of the different projects I think progress can be made. That as far as I know takes talent. If we wind up with a situation where we have to continually review unacceptable solutions the choices are either get negative about it and reject everything, or give up and let something through. Since I think giving up in this situation is irresponsible and likely to make a worse kernel I am leaning very strongly towards NAK'ing everything because I have seen so many problematic proposals that did not look like they were on the path to something reasonable. Eric ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: My position on general ``RAS'' tool support infrastructure 2007-09-13 13:21 ` My position on general ``RAS'' tool support infrastructure Eric W. Biederman @ 2007-09-18 1:38 ` Randy Dunlap 2007-09-18 4:28 ` Vivek Goyal 0 siblings, 1 reply; 3+ messages in thread From: Randy Dunlap @ 2007-09-18 1:38 UTC (permalink / raw) To: Eric W. Biederman Cc: pete, Jason Wessel, Matt Mackall, Amit Kale, Dave Anderson, kdb, jlan, Vivek Goyal, Andrew Morton, Kexec Mailing List, linux-kernel On Thu, 13 Sep 2007 07:21:10 -0600 Eric W. Biederman wrote: > Pete/Piet Delaney <pete@bluelane.com> writes: > > > Jason, Eric: > > > > Did you read Keith Owens suggestion on RAS tools from: Yes. and I re-read it. There are several things in Keith's email that make sense: a. all RAS tools should use a common interface b. it's not the kernel's job to decide which RAS tool runs first Eric makes some good points too. I'm mostly similar to Eric: paranoid about trusting software/hardware after a panic (or oops). So if someone wants to use multiple RAS tools on a panic event, enabling an admin to set priorities is OK with me, but I'll only trust the first one that is used, and even that one may have problems. IOW, I don't see a big need to support multiple RAS tools at one time. (speaking for myself) > So if someone who is suggesting an implementation can absorb > and understand the requirements of the different groups and come > up with solutions that meet the requirements of the different projects > I think progress can be made. That as far as I know takes talent. Ack that. --- ~Randy ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: My position on general ``RAS'' tool support infrastructure 2007-09-18 1:38 ` Randy Dunlap @ 2007-09-18 4:28 ` Vivek Goyal 0 siblings, 0 replies; 3+ messages in thread From: Vivek Goyal @ 2007-09-18 4:28 UTC (permalink / raw) To: Randy Dunlap Cc: Eric W. Biederman, pete, Jason Wessel, Matt Mackall, Amit Kale, Dave Anderson, kdb, jlan, Andrew Morton, Kexec Mailing List, linux-kernel On Mon, Sep 17, 2007 at 06:38:53PM -0700, Randy Dunlap wrote: > On Thu, 13 Sep 2007 07:21:10 -0600 Eric W. Biederman wrote: > > > Pete/Piet Delaney <pete@bluelane.com> writes: > > > > > Jason, Eric: > > > > > > Did you read Keith Owens suggestion on RAS tools from: > > > Yes. and I re-read it. > > There are several things in Keith's email that make sense: > > a. all RAS tools should use a common interface > b. it's not the kernel's job to decide which RAS tool runs first > > > Eric makes some good points too. I'm mostly similar to Eric: > paranoid about trusting software/hardware after a panic (or oops). > > So if someone wants to use multiple RAS tools on a panic event, > enabling an admin to set priorities is OK with me, but I'll only > trust the first one that is used, and even that one may have > problems. IOW, I don't see a big need to support multiple RAS > tools at one time. (speaking for myself) > I would be nice to have a kernel debugger co-exist with crash dumping. I like Eric's idea of debugger putting a break point on panic(). This would mean that rest of the post panic() actions have to be performed by second kernel which can perform those actions much more reliably. But this also brings in the additional requirement of passing all the required context to second kernel. For example, in the past somebody wanted to send a message to a remote node that sytem crashed so that standby can take over. If the same job has to be done in second kernel, it requires all the relavant information like remote host IP, port etc passed to the second kernel which I think makes the job little harder. May be one can pre-configure these parameters in user space and let the job be done either from initrd or user space scripts in second kernel. Thanks Vivek ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2007-09-18 4:28 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <46E8B06D.9080006@bluelane.com>
2007-09-13 13:21 ` My position on general ``RAS'' tool support infrastructure Eric W. Biederman
2007-09-18 1:38 ` Randy Dunlap
2007-09-18 4:28 ` Vivek Goyal
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox