Matt reported that there were issues with the IPMI driver getting wedged in some cases. It turns out that the BMC was not reporting an error as it should have (per the spec) when the event queue was empty. The IPMI driver would then request the next event, and so on, wedging the driver. The BMC sits on a fuzzy line between a trusted devices and a remote and possibly untrusted device. If you compromised a BMC you have all sorts of tools you can use to attack the host: the reset line, interrupts, and usually access to write the system firmware and possibly devices like disk drives, serial ports and VGA consoles. So attacking through this interface would not be the first thing you would do. But it is an possible attack point. I'm assuming that the BMC was delivering an empty message when this happens, so the first patch checks the message length to make sure it's a valid message. It's a good check no matter what, so it's in whether that's the issue or not. The second patch limits the number of events or messages that can be fetched at a time to 10. This is a good thing to do, anyway. If more message or events were present, the next flag check should get them. So it's a more general fix. I looked at adding the patch Matt suggested, doing a timeout on the wait, but that introduces some race conditions if the response comes back late. That will require some more thought. The timeouts with IPMI can be pretty long, the spec specifies fairly long timeouts, 5 seconds waiting for the BMC to respond to anything. So failing an operation can take some time, and reducing the timeouts is probably a bad idea. No rationale is given in the spec, but I'm guessing it expects that a BMC in restart can recover within 5 seconds, so it gives timeouts so the BMC is always available within that tie. The spec gives you the gist that the BMC should always be available on a system that has one. So the driver (at the beginning) followed that. Thus the driver tries 10 times for a message before it gives up, giving 50 seconds total failure time for a message. That is not in the spec (I don't think) so that could be made selectable on a per-message basis. There are already mechanisms for this available in the APIs; I'll look at that. -corey
On Tue, Apr 21, 2026 at 07:42:42AM -0500, Corey Minyard wrote: > Matt reported that there were issues with the IPMI driver getting wedged > in some cases. It turns out that the BMC was not reporting an error as > it should have (per the spec) when the event queue was empty. The IPMI > driver would then request the next event, and so on, wedging the driver. Thanks for replying so quickly, Corey. I'll test these out. One bit of info I pulled out of the stuck machine is that the response looks properly formed. I sampled the first 8 entries and they were all identical 19-byte successful READ_EVENT_MSG_BUFFER responses: 1c 35 00 55 55 c0 41 a7 00 00 00 00 00 3a ff 00 ff ff ff So on this machine, the event replies do not look short or malformed; they look like repeated successful event-buffer reads with the same payload. Thanks, Matt
> 2026年4月22日 06:24,Matt Fleming <matt@readmodwrite.com> 写道: > > On Tue, Apr 21, 2026 at 07:42:42AM -0500, Corey Minyard wrote: >> Matt reported that there were issues with the IPMI driver getting wedged >> in some cases. It turns out that the BMC was not reporting an error as >> it should have (per the spec) when the event queue was empty. The IPMI >> driver would then request the next event, and so on, wedging the driver. > > Thanks for replying so quickly, Corey. I'll test these out. > > One bit of info I pulled out of the stuck machine is that the response > looks properly formed. > > I sampled the first 8 entries and they were all identical 19-byte > successful READ_EVENT_MSG_BUFFER responses: > > 1c 35 00 55 55 c0 41 a7 00 00 00 00 00 3a ff 00 ff ff ff > Perhaps I know where this data comes from. During a previous debugging session (where ipmitool v1.8.19 failed on sensor list due to an underflow in nr_numbers, which has since been fixed), I noticed this behavior. However, I ’m not sure why it is implemented this way or what exactly this command is intended to do. If you are running on OpenBMC, it is very likely related to this part, where a fixed value is always returned (especially if the KCS channel happens to be configured as 15): See: https://github.com/openbmc/phosphor-host-ipmid/blob/master/systemintfcmds.cpp#L35 Jian. > So on this machine, the event replies do not look short or malformed; > they look like repeated successful event-buffer reads with the same > payload. > > Thanks, > Matt > > > _______________________________________________ > Openipmi-developer mailing list > Openipmi-developer@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/openipmi-developer
© 2016 - 2026 Red Hat, Inc.