[PATCH v1] Revert "ipmi:si: Gracefully handle if the BMC is non-functional"

Rafael J. Wysocki posted 1 patch 1 month, 2 weeks ago
drivers/char/ipmi/ipmi_si_intf.c |   29 ++++++-----------------------
1 file changed, 6 insertions(+), 23 deletions(-)
[PATCH v1] Revert "ipmi:si: Gracefully handle if the BMC is non-functional"
Posted by Rafael J. Wysocki 1 month, 2 weeks ago
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Revert commit bc3a9d217755 ("ipmi:si: Gracefully handle if the BMC is
non-functional") that attempted to improve the handling of the cases
in which the BMC was not responsive, but did not succeed.

Instead, it introduced a regression causing AML in ACPI tables that use
IMPI operation regions to block indefinitely on the tx_msg->tx_complete
completion in acpi_ipmi_space_handler(), which may affect ACPI control
methods on any system where IPMI is involved resulting in various types
of breakage that is not straightforward to diagnose.

For example, on the system where the regression was first observed, it
caused sysfs accesses to attributes exposed by the acpi_power_meter
driver to block because they involved AML evaluation which is not
super-easy to connect to IPMI.

This is a nasty and rather urgent problem with no viable fix in sight.

Note that AI was involved in diagnosing it, but didn't help much.

Fixes: bc3a9d217755 ("ipmi:si: Gracefully handle if the BMC is non-functional")
Closes: https://lore.kernel.org/linux-acpi/CAK8fFZ6Vi4xayvdKh-_eLi-nDNMLuEoMsvwEnb33QqnwS7o4BA@mail.gmail.com/
Reported-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
Tested-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
Cc: All applicable <stable@vger.kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/char/ipmi/ipmi_si_intf.c |   29 ++++++-----------------------
 1 file changed, 6 insertions(+), 23 deletions(-)

--- a/drivers/char/ipmi/ipmi_si_intf.c
+++ b/drivers/char/ipmi/ipmi_si_intf.c
@@ -53,7 +53,6 @@
 #define SI_TIMEOUT_JIFFIES	(SI_TIMEOUT_TIME_USEC/SI_USEC_PER_JIFFY)
 #define SI_SHORT_TIMEOUT_USEC  250 /* .25ms when the SM request a
 				      short timeout */
-#define SI_TIMEOUT_HOSED	(HZ) /* 1 second when in hosed state. */
 
 enum si_intf_state {
 	SI_NORMAL,
@@ -62,8 +61,7 @@ enum si_intf_state {
 	SI_CLEARING_FLAGS,
 	SI_GETTING_MESSAGES,
 	SI_CHECKING_ENABLES,
-	SI_SETTING_ENABLES,
-	SI_HOSED
+	SI_SETTING_ENABLES
 	/* FIXME - add watchdog stuff. */
 };
 
@@ -754,8 +752,6 @@ static void handle_transaction_done(stru
 		}
 		break;
 	}
-	case SI_HOSED: /* Shouldn't happen. */
-		break;
 	}
 }
 
@@ -770,10 +766,6 @@ static enum si_sm_result smi_event_handl
 	enum si_sm_result si_sm_result;
 
 restart:
-	if (smi_info->si_state == SI_HOSED)
-		/* Just in case, hosed state is only left from the timeout. */
-		return SI_SM_HOSED;
-
 	/*
 	 * There used to be a loop here that waited a little while
 	 * (around 25us) before giving up.  That turned out to be
@@ -797,20 +789,18 @@ restart:
 
 		/*
 		 * Do the before return_hosed_msg, because that
-		 * releases the lock.  We just disable operations for
-		 * a while and retry in hosed state.
+		 * releases the lock.
 		 */
-		smi_info->si_state = SI_HOSED;
+		smi_info->si_state = SI_NORMAL;
 		if (smi_info->curr_msg != NULL) {
 			/*
 			 * If we were handling a user message, format
 			 * a response to send to the upper layer to
 			 * tell it about the error.
 			 */
-			return_hosed_msg(smi_info, IPMI_BUS_ERR);
+			return_hosed_msg(smi_info, IPMI_ERR_UNSPECIFIED);
 		}
-		smi_mod_timer(smi_info, jiffies + SI_TIMEOUT_HOSED);
-		goto out;
+		goto restart;
 	}
 
 	/*
@@ -908,7 +898,7 @@ static void flush_messages(void *send_in
 	 * mode.  This means we are single-threaded, no need for locks.
 	 */
 	result = smi_event_handler(smi_info, 0);
-	while (result != SI_SM_IDLE && result != SI_SM_HOSED) {
+	while (result != SI_SM_IDLE) {
 		udelay(SI_SHORT_TIMEOUT_USEC);
 		result = smi_event_handler(smi_info, SI_SHORT_TIMEOUT_USEC);
 	}
@@ -921,9 +911,6 @@ static int sender(void *send_info, struc
 
 	debug_timestamp(smi_info, "Enqueue");
 
-	if (smi_info->si_state == SI_HOSED)
-		return IPMI_BUS_ERR;
-
 	if (smi_info->run_to_completion) {
 		/*
 		 * If we are running to completion, start it.  Upper
@@ -1104,10 +1091,6 @@ static void smi_timeout(struct timer_lis
 	spin_lock_irqsave(&(smi_info->si_lock), flags);
 	debug_timestamp(smi_info, "Timer");
 
-	if (smi_info->si_state == SI_HOSED)
-		/* Try something to see if the BMC is now operational. */
-		start_get_flags(smi_info);
-
 	jiffies_now = jiffies;
 	time_diff = (((long)jiffies_now - (long)smi_info->last_timeout_jiffies)
 		     * SI_USEC_PER_JIFFY);
Re: [PATCH v1] Revert "ipmi:si: Gracefully handle if the BMC is non-functional"
Posted by Guenter Roeck 1 month, 2 weeks ago
On 2/12/26 05:41, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Revert commit bc3a9d217755 ("ipmi:si: Gracefully handle if the BMC is
> non-functional") that attempted to improve the handling of the cases
> in which the BMC was not responsive, but did not succeed.
> 
> Instead, it introduced a regression causing AML in ACPI tables that use
> IMPI operation regions to block indefinitely on the tx_msg->tx_complete
> completion in acpi_ipmi_space_handler(), which may affect ACPI control
> methods on any system where IPMI is involved resulting in various types
> of breakage that is not straightforward to diagnose.
> 
> For example, on the system where the regression was first observed, it
> caused sysfs accesses to attributes exposed by the acpi_power_meter
> driver to block because they involved AML evaluation which is not
> super-easy to connect to IPMI.
> 
> This is a nasty and rather urgent problem with no viable fix in sight.
> 
> Note that AI was involved in diagnosing it, but didn't help much.
> 
> Fixes: bc3a9d217755 ("ipmi:si: Gracefully handle if the BMC is non-functional")
> Closes: https://lore.kernel.org/linux-acpi/CAK8fFZ6Vi4xayvdKh-_eLi-nDNMLuEoMsvwEnb33QqnwS7o4BA@mail.gmail.com/
> Reported-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
> Tested-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
> Cc: All applicable <stable@vger.kernel.org>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Acked-by: Guenter Roeck <linux@roeck-us.net>