From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mark Kampe Subject: The costs of logging and not logging Date: Mon, 21 Nov 2011 08:29:55 -0800 Message-ID: <4ECA7C83.1090702@dreamhost.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from mail.hq.newdream.net ([66.33.206.127]:33639 "EHLO mail.hq.newdream.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751681Ab1KUQ3z (ORCPT ); Mon, 21 Nov 2011 11:29:55 -0500 Received: from mail.hq.newdream.net (localhost [127.0.0.1]) by mail.hq.newdream.net (Postfix) with ESMTP id 227B9C066 for ; Mon, 21 Nov 2011 08:38:32 -0800 (PST) Received: from [192.168.107.232] (aon.hq.newdream.net [64.111.111.107]) by mail.hq.newdream.net (Postfix) with ESMTPSA id 1C3ACC064 for ; Mon, 21 Nov 2011 08:38:32 -0800 (PST) Sender: ceph-devel-owner@vger.kernel.org List-ID: To: ceph-devel@vger.kernel.org The bugs we most dread are situations that only happen rarely, and are only detected long after the damage has been done. Given the business we are in, we will face many of them. We apparently have such bugs open at this very moment. In most cases, the primary debugging tools one has are audit and diagnostic logs ... which WE do not have because they are too expensive (because they are synchronously written with C++ streams) to leave enabled all the time. I think it is a mistake to think of audit and diagnostic logs as a tool to be turned on when we have a problem to debug. There should be a basic level of logging that is always enabled (so we will have data after the first instance of the bug) ... which can be cranked up from verbose to bombastic when we find a problem that won't yield to more moderate interrogation: (a) after the problem happens is too late to start collecting data. (b) these logs are gold mines of information for a myriad of purposes we cannot yet even imagine. This can only be done if the logging mechanism is sufficiently inexpensive that we are not afraid to use it: low execution artifact from the logging operations reansonable memory costs for bufferring small enough on disk that we can keep them for months Not having such a mechanism is (if I correctly understand) already hurting us for internal debugging, and will quickly cripple us when we have customer (i.e. people who cannot diagnose problems for themselves) problems to debug. There are many tricks to make logging cheap, and the sizes acceptable. There are probably a dozen open-source implementations that already do what we need, and if they don't something basic can be built in a two-digit number of hours. The real cost is not in the mechanism but in adapting existing code to use it. This cost can be mitigated by making the changes opportunistically ... one component at a time, as dictated by need/fear. But we cannot make that change-over until we have a mechanism. Because the greatest cost is not the mechanism, but the change-over, we should give more than passing thought to what mechanism to choose ... so that the decision we make remains a good one for the next few years. This may be something that we need to do sooner, rather than later. regards, ---mark---