From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-oa1-f41.google.com (mail-oa1-f41.google.com [209.85.160.41]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3BACB1474CC for ; Fri, 17 Apr 2026 23:54:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.160.41 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776470042; cv=none; b=MZRnOYZyTLXAvwoWR+9MAZFcv+eEzI3wDc4E0wMDmfHjrDvvQNY4r8mK2rqpOE3cZy9/W9Z3/JfYIAzuWo5krGLurUy9eKGpKV6ctEQpZsRxGY+YB96ly8qYGDkUJV5Wy4EaMSyU2P8KCCCJEhWkSaV6SpCeRJBXQLCUk6JmGzI= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776470042; c=relaxed/simple; bh=r3vngcDZZA4TJHLV+XAGR7YminuF4au6qGzkcW/Xbaw=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=XShUh5TbiI7Dz46H/XqmzjHbiFgjbS2v2CK3HN2zWSlgi2cbafBRZ56zcaAX5oupR474jjknoJVd14/IJ8dasZaki7YDhfgHgMm8CGqrPTm3k74HGhh/2qAuD/R0QoJIZuEq0usb/npZMu+IOKgRvFeo9dh9xmNeu7MihXeSRyM= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=minyard.net; spf=pass smtp.mailfrom=minyard.net; dkim=pass (2048-bit key) header.d=minyard.net header.i=@minyard.net header.b=TNmop24g; arc=none smtp.client-ip=209.85.160.41 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=minyard.net Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=minyard.net Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=minyard.net header.i=@minyard.net header.b="TNmop24g" Received: by mail-oa1-f41.google.com with SMTP id 586e51a60fabf-415c8a4d2e6so480944fac.0 for ; Fri, 17 Apr 2026 16:54:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=minyard.net; s=google; t=1776470040; x=1777074840; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:reply-to :message-id:subject:cc:to:from:date:from:to:cc:subject:date :message-id:reply-to; bh=d0iBqvxrj1FF5CFkZU0UBhyVIys8trrMQJE6XnUM2Qw=; b=TNmop24gTSHx0Z5iz10kGKBE1fkpPVRgoDbx1gB82R2SCiAItaOBNkiiiiOJO7y4O6 TrRhOIvyj/pD4QOIaesYFke1ljOAi0JC3KaSGQSz5jStG9VuWEsAqH9L/HpO7vJqZN4a vlUwtx1HU/gPS6GuiGtI6gimbeu+4p1E8EKmpwCelUBaGGycGpppvA2h6/dfnew1nokS 0EM/SUKMP38Wn4sUp9+KGDrOn2WKxOuZJDi8TMuQzszpHNNrx/LZMDrjeN6ZVF4H1+qT xjr3IfJH0ZbZtMaG1o4GGIqGV8JogircoZmAqFB0KXvyk/a8vuSzyhjiInZyuX16SAZZ Mrnw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1776470040; x=1777074840; h=in-reply-to:content-disposition:mime-version:references:reply-to :message-id:subject:cc:to:from:date:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=d0iBqvxrj1FF5CFkZU0UBhyVIys8trrMQJE6XnUM2Qw=; b=Im6ZJuGGILsPhxGadNtP93AczHsbVQS/yz5gAcrQamc2Q9++OPMeOC6cTDlLg5QlPk u1P3uPK6NCZ4+PJedLPqT/O+bK2rQQA5tuJddsYcpVspJJN6hACejGodRmlmZ68wViP7 EIwInXuVI1LBKGofCwFFGnFPegqdwP+l/M8ED/rgRqAD69JjhCvswgh1dZwDRsVhYSZj uDLaG1wydNAEJLcAAkOO/BLCW9efJKlpgq8dh+aKqTtgfHVe4O0RWdSI0kF+ubdmNH0G lvjYkyXXSA/7+OM/UpdyzSCTBtnCdKjaSDLQaihGipkjMGGUsLNTg3tmepOsiwH2Tnno YHfw== X-Forwarded-Encrypted: i=1; AFNElJ9HF9bdsIiyc6JDhwpDgOqy6j8ep1+JccOX4bgS3XI/bosNVt16tbBjJ0ZnZVrUJOxxLeLrfXbrXEfDp2A=@vger.kernel.org X-Gm-Message-State: AOJu0Yz9aU9+LVtPZxh9djGREDVlh9qh7xT6rlxaAXoI0XzngEAjlQ2P 3n7E3/s/+aqmK++2hqUF2516Rcab41DN9o77R6LnnWxW/WVi1WEzBKGw2mMqDXfXsBU= X-Gm-Gg: AeBDietud+BKq3ul030QOEm0HW7ccEy5Z/K9S0srn6OiQ3FDZgDnArY5ZSVJTPP4AWq 15n8Q5cN1oUPiUo7VexLGs/M7qHerp7cSpJyY5p1HBbBZ54Ap6pRQ83H/Kn2nS4LZ1S+h9IFN+R WeCZ6SugZfOzg1qwXPxttqp+XI2oADGMofezxEtGfy6zt4WhiepluVnGlFI5hZ5kVr4OGUbv0Bm 0QWZDXa6eUzaM7/2WFOD4PqmM57I23cMEQaHvhFEFCTpk92T9Hu0yTPxABzULi1gC3P2fo5cwFD 8FqINEUT7qPgMZ6RqvhxXwMyZ7uBnhH8O6eqoxqTuo+hWMk7VskXLayRevqYJrL6hBegfS3wkVx 5kaljOj2SjI/1EEqw4D9gN2xsOrMIsGtE+aG6t5TZjyG1Pi92DjZQCGXrRYcH0/bzhplwlLYO0U 0mxSyJocW/ZehbZKSaiENgZqXXPVFRfk0sUye7QJEz9nIKYg3rlXw3vkSycUUKAK7gTxvuX+VL8 dYLmYoMgioIsqr7zHYEl9z12w== X-Received: by 2002:a05:6871:4086:b0:417:435c:ba0a with SMTP id 586e51a60fabf-42aded59513mr3106688fac.33.1776470040016; Fri, 17 Apr 2026 16:54:00 -0700 (PDT) Received: from mail.minyard.net ([2001:470:b8f6:1b:257b:858d:a51e:d838]) by smtp.gmail.com with ESMTPSA id 586e51a60fabf-42b934a2e8esm2724448fac.10.2026.04.17.16.53.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 17 Apr 2026 16:53:59 -0700 (PDT) Date: Fri, 17 Apr 2026 18:53:55 -0500 From: Corey Minyard To: Matt Fleming Cc: Tony Camuso , openipmi-developer@lists.sourceforge.net, linux-kernel@vger.kernel.org, kernel-team@cloudflare.com, Matt Fleming Subject: Re: [PATCH] ipmi: Add timeout to unconditional wait in __get_device_id() Message-ID: Reply-To: corey@minyard.net References: <20260415115930.3428942-1-matt@readmodwrite.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: On Fri, Apr 17, 2026 at 11:23:03PM +0100, Matt Fleming wrote: > On Wed, Apr 15, 2026 at 07:16:53AM -0500, Corey Minyard wrote: > > > > The lower level driver should never not return an answer, it is supposed > > to guarantee that it returns an error if the BMC doesn't respond. > > > > So the bug is not here, the bug is elsewhere. My guess is that there > > is some new failure mode where a BMC is not working but it responds well > > enough that it sort of works and fools the driver. But that's only a > > guess. > > I can now reproduce this pretty reliably by running concurrent > ipmitool commands (sensor/sel/mc info) + sysfs readers + periodic > ipmitool mc reset cold. It wedges in a few minutes. Hmm. If you are sending cold resets, then the driver is going into reset maintenance mode and it should be rejecting messages for 30 seconds after you send that command. You can disable that by changing is_maintenance_mode_cmd() in ipmi_msghandler.c to always return false. > > My working theory is handle_flags() in ipmi_si_intf.c can loop on > flag-driven commands (e.g. READ_EVENT_MSG_BUFFER) without ever calling > start_next_msg(), starving waiting_msg indefinitely. > > Captured state at wedge: > > si_state=SI_GETTING_EVENTS msg_flags=0x02 > si_curr cycling cmd=0x35 (READ_EVENT_MSG_BUFFER) > si_wait frozen cmd=0x08 (GET_DEVICE_GUID, never promoted) > > The cold reset makes the BMC report EVENT_MSG_BUFFER_FULL during > re-init, which drives the flag loop. The EVENT_MSG_BUFFER_FULL flag only gets cleared when a unsuccessful READ_EVENT_MSG_BUFFER command completes. Getting data from the BMC has higher priority than sending data to the BMC. If the BMC continually reports success from READ_EVENT_MSG_BUFFER, then that would certainly wedge the driver. But it would have to continually report success for that command, which would be strange as its supposed to error out when the queue is empty. If it's really something like that, I could also look at adding limits for those operations. To debug things like this I often add module_params that let me see what is going on. But you can look at the "invalid_events" counter to see if the data is bogus. Or there should be an "Event queue full, discarding incoming events" log coming out once at the beginning of when this happens. -corey > > Thanks, > Matt