[edk2-devel] RFC: Another solution to the nested interrupt issue

Ni, Ray posted 1 patch 3 months ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/edk2 tags/patchew/MN6PR11MB82443B73244CEF84481629F08C7A2@MN6PR11MB8244.namprd11.prod.outlook.com
[edk2-devel] RFC: Another solution to the nested interrupt issue
Posted by Ni, Ray 3 months ago
All,
This mail is to bring another approach to solve the stack-overflow due to nested interrupts. Michael solved this problem in OVMF through NestedInterruptTplLib.

I made a draft patch as attached "DxeCore.diff". The patch simply to avoid the interrupt in enable state when TPL is dropped to the interrupted TPL. The interrupt will be enabled later through "IRET".

So, a timer driver has two ways to implement its timer interrupt handler:

1.      Do raise/restore TPL in the TimerInterruptHandler(). But call the APIs in NestedInterruptTplLib.

2.      Do not raise/restore TPL in the TimerInterruptHandler(). So that only DxeCore restores the TPL. And when DxeCore restores the TPL, the interrupt is not enabled when TPL is dropped to the interrupted TPL (as it will be enabled later by "IRET").


Implementing the logic in DxeCore does not prevent the TimerInterruptHandler() from implementing the way #1.
Agree on the draft patch?

My 2nd question is can we set a rule that TimerInterruptHandler() should NOT restore TPL so that way #2 (changing DxeCore) is enough to solve the stack overflow issue due to nested interrupts.

I was aware of the discussion between Laszlo and Michael in end of 2022 but never dig deeply as today into this problem.
I really appreciate the long discussion in the bugzilla (https://bugzilla.tianocore.org/show_bug.cgi?id=4162) and comments in NestedInterruptTplLib. I learned a lot from them and they are quite interesting!

Thanks,
Ray


-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#114369): https://edk2.groups.io/g/devel/message/114369
Mute This Topic: https://groups.io/mt/103950154/1787277
Group Owner: devel+owner@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/leave/3901457/1787277/102458076/xyzzy [importer@patchew.org]
-=-=-=-=-=-=-=-=-=-=-=-


diff --git a/MdeModulePkg/Core/Dxe/Event/Timer.c b/MdeModulePkg/Core/Dxe/Event/Timer.c
index 29e507c67c..8e8b9d6592 100644
--- a/MdeModulePkg/Core/Dxe/Event/Timer.c
+++ b/MdeModulePkg/Core/Dxe/Event/Timer.c
@@ -176,6 +176,11 @@ CoreInitializeTimer (
   ASSERT_EFI_ERROR (Status);
 }
 
+VOID
+CoreRestoreTplDeferEnableInterrupt (
+  IN EFI_TPL  NewTpl
+  );
+
 /**
   Called by the platform code to process a tick.
 
@@ -190,11 +195,9 @@ CoreTimerTick (
   )
 {
   IEVENT  *Event;
+  EFI_TPL  OriginalTpl;
 
-  //
-  // Check runtiem flag in case there are ticks while exiting boot services
-  //
-  CoreAcquireLock (&mEfiSystemTimeLock);
+  OriginalTpl = CoreRaiseTpl (TPL_HIGH_LEVEL);
 
   //
   // Update the system time
@@ -213,7 +216,7 @@ CoreTimerTick (
     }
   }
 
-  CoreReleaseLock (&mEfiSystemTimeLock);
+  CoreRestoreTplDeferEnableInterrupt (OriginalTpl);   //CoreRestoreTpl (OriginalTpl);
 }
 
 /**
diff --git a/MdeModulePkg/Core/Dxe/Event/Tpl.c b/MdeModulePkg/Core/Dxe/Event/Tpl.c
index b33f80573c..8d4b92ef4e 100644
--- a/MdeModulePkg/Core/Dxe/Event/Tpl.c
+++ b/MdeModulePkg/Core/Dxe/Event/Tpl.c
@@ -147,3 +147,55 @@ CoreRestoreTpl (
     CoreSetInterruptState (TRUE);
   }
 }
+
+
+VOID
+CoreRestoreTplDeferEnableInterrupt (
+  IN EFI_TPL  NewTpl
+  )
+{
+  EFI_TPL  OldTpl;
+  EFI_TPL  PendingTpl;
+
+  OldTpl = gEfiCurrentTpl;
+  if (NewTpl > OldTpl) {
+    DEBUG ((DEBUG_ERROR, "FATAL ERROR - RestoreTpl with NewTpl(0x%x) > OldTpl(0x%x)\n", NewTpl, OldTpl));
+    ASSERT (FALSE);
+  }
+
+  ASSERT (VALID_TPL (NewTpl));
+
+  //
+  // If lowering below HIGH_LEVEL, make sure
+  // interrupts are enabled
+  //
+
+  if ((OldTpl >= TPL_HIGH_LEVEL) &&  (NewTpl < TPL_HIGH_LEVEL)) {
+    gEfiCurrentTpl = TPL_HIGH_LEVEL;
+  }
+
+  //
+  // Dispatch any pending events
+  //
+  while (gEventPending != 0) {
+    PendingTpl = (UINTN)HighBitSet64 (gEventPending);
+    if (PendingTpl <= NewTpl) {
+      break;
+    }
+
+    gEfiCurrentTpl = PendingTpl;
+    if (gEfiCurrentTpl < TPL_HIGH_LEVEL) {
+      CoreSetInterruptState (TRUE);
+    }
+
+    CoreDispatchEventNotifies (gEfiCurrentTpl);
+  }
+
+  //
+  // Disable interrupt before restoring to the original TPL.
+  // This is to avoid the nest interrupt happens in the same TPL
+  // which will be endless.
+  //
+  CoreSetInterruptState (FALSE);
+  gEfiCurrentTpl = NewTpl;
+}
\ No newline at end of file
Re: [edk2-devel] RFC: Another solution to the nested interrupt issue
Posted by Michael Brown 3 months ago
On 25/01/2024 07:57, Ni, Ray wrote:
> This mail is to bring another approach to solve the stack-overflow due 
> to nested interrupts. Michael solved this problem in OVMF through 
> NestedInterruptTplLib.
> 
> I made a draft patch as attached “DxeCore.diff”. The patch simply to 
> avoid the interrupt in enable state when TPL is dropped to the 
> interrupted TPL. The interrupt will be enabled later through “IRET”.

I don't disagree with the approach, but it does break the API as per the 
UEFI PI specification (version 1.8 section II-12.10), and so this is not 
something that can just be dropped in as an EDK2 code change.

There are no version number fields or spare parameters (that I can see) 
that could be used within EFI_TIMER_ARCH_PROTOCOL to allow for this 
change of semantics, so this would have to be a breaking change.

I think you'd need to first define an EFI_TIMER_ARCH2_PROTOCOL with 
different semantics for RegisterHandler(), but I'm not a UEFI Forum 
member and I have no idea what the bureaucratic process is for pushing 
through this kind of change.  I suspect we'd end up having to support 
both protocol variants (which means added maintenance burden).

It might be plausible instead to extend EFI_CPU_ARCH_PROTOCOL in a 
backwards-compatible way.  The InterruptType parameter to 
RegisterInterruptHandler() is already handled in a very casual way: the 
UEFI spec defines a limited range of possible types (0-19 for IA32/X64) 
for EFI_EXCEPTION_TYPE, but the code in LocalApicTimerDxe.c already 
treats it as meaning just an IRQ vector number by passing the value 
LOCAL_APIC_TIMER_VECTOR=32.

(Incidentally, the code comments in MdePkg/Include/Protocol/Cpu.h are 
completely wrong - the description of the InterruptType parameter seems 
to have been copied from the description of the State parameter in 
EFI_CPU_GET_INTERRUPT_STATE.)

The extension I'm thinking of could be:

- Clean up the definition of EFI_EXCEPTION_TYPE to clarify that it 
represents a CPU interrupt vector number (if this is how all current 
architectures actually use it).

- Extend the definition of EFI_EXCEPTION_TYPE to include a high bit flag 
that can be ORed on to the exception type, e.g.

   #define EFI_EXCEPTION_TYPE_TPL_HIGH 0x01000000

- The RegisterInterruptHandler() implementation could then treat this 
flag as meaning that the handler should be called via a wrapper that 
does RaiseTPL(TPL_HIGH_LEVEL) before calling the handler, and the 
equivalent of your CoreRestoreTplDeferEnableInterrupt() after the 
handler returns, i.e. that the handler is automatically called at 
TPL_HIGH_LEVEL.

This would not make any breaking change to the definitions of 
EFI_TIMER_ARCH_PROTOCOL or EFI_CPU_ARCH_PROTOCOL, and so would not 
require an ARCH2_PROTOCOL variant to be created.

I haven't tried implementing this to see if it's viable, so I've 
probably missed something crucial.

Personally I would go for moving NestedInterruptTplLib to MdeModulePkg 
since this can be done today, whereas anything involving a spec change 
will take a lot more time, but that's not my call.

Thanks,

Michael



-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#114401): https://edk2.groups.io/g/devel/message/114401
Mute This Topic: https://groups.io/mt/103950154/1787277
Group Owner: devel+owner@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub [importer@patchew.org]
-=-=-=-=-=-=-=-=-=-=-=-


Re: [edk2-devel] RFC: Another solution to the nested interrupt issue
Posted by Ni, Ray 3 months ago
> On 25/01/2024 07:57, Ni, Ray wrote:
> > This mail is to bring another approach to solve the stack-overflow due
> > to nested interrupts. Michael solved this problem in OVMF through
> > NestedInterruptTplLib.
> >
> > I made a draft patch as attached “DxeCore.diff”. The patch simply to
> > avoid the interrupt in enable state when TPL is dropped to the
> > interrupted TPL. The interrupt will be enabled later through “IRET”.
> 
> I don't disagree with the approach, but it does break the API as per the
> UEFI PI specification (version 1.8 section II-12.10), and so this is not
> something that can just be dropped in as an EDK2 code change.

You think that the TimerInterruptHandler() doesn't raise/restore TPL
which would violate the PI spec as PI spec says " NotifyFunction ... executes at EFI_TPL_HIGH_LEVEL."?

I do not think the PI spec requires TimerInterruptHandler() raises TPL
to HIGH before invoking NotifyFunction. It just means the NotifyFunction
will execute at TPL_HIGH.

If you review HpetTimer driver, it does not raise TPL to HIGH before
invoking NotifyFunction.

And I think implementing the DxeCore changes as attached does not
prevent the TimerInterruptHandler() from calling raise/restore TPL.

So, with the changes done in DxeCore, a timer driver could either
not raise/restore TPL in TimerInterruptHandler(), or it calls
NestedInterruptTplLib if it wants.

I like the idea behind NestedInterruptTplLib. The only concern is
the maintenance effort. I want to avoid the difficulty of debuggability
brought by NestedInterruptTplLib when some timer interrupt related
issues happen in future. 

I am ok with OvmfPkg code using NestedInterruptTplLib.

Thanks,
Ray



-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#114405): https://edk2.groups.io/g/devel/message/114405
Mute This Topic: https://groups.io/mt/103950154/1787277
Group Owner: devel+owner@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/leave/3901457/1787277/102458076/xyzzy [importer@patchew.org]
-=-=-=-=-=-=-=-=-=-=-=-


Re: [edk2-devel] RFC: Another solution to the nested interrupt issue
Posted by Michael Brown 3 months ago
On 25/01/2024 13:54, Ni, Ray wrote:
>> I don't disagree with the approach, but it does break the API as per the
>> UEFI PI specification (version 1.8 section II-12.10), and so this is not
>> something that can just be dropped in as an EDK2 code change.
> 
> You think that the TimerInterruptHandler() doesn't raise/restore TPL
> which would violate the PI spec as PI spec says " NotifyFunction ... executes at EFI_TPL_HIGH_LEVEL."?
> 
> I do not think the PI spec requires TimerInterruptHandler() raises TPL
> to HIGH before invoking NotifyFunction. It just means the NotifyFunction
> will execute at TPL_HIGH.

If the caller is not supposed to raise TPL to TPL_HIGH_LEVEL before 
calling NotifyFunction, then the statement "This function executes at 
EFI_TPL_HIGH_LEVEL" in the PI specification is meaningless.  There is no 
other possible interpretation besides "the caller must raise TPL to 
TPL_HIGH_LEVEL before calling this function".

> If you review HpetTimer driver, it does not raise TPL to HIGH before
> invoking NotifyFunction.

That would then be a bug in HpetTimer, which ought to be fixed.  If 
HpetTimer were to be used on a platform where the NotifyFunction 
correctly assumes that it is called at TPL_HIGH_LEVEL and does something 
that would break at a lower level, then this could lead to undefined 
behaviour.

> And I think implementing the DxeCore changes as attached does not
> prevent the TimerInterruptHandler() from calling raise/restore TPL.

No, but a spec-conforming timer interrupt handler could not take 
advantage of the feature, because it would have to raise to 
TPL_HIGH_LEVEL before calling the NotifyFunction.  (Any raise/restore 
within the NotifyFunction would then have no effect.)

> So, with the changes done in DxeCore, a timer driver could either
> not raise/restore TPL in TimerInterruptHandler(), or it calls
> NestedInterruptTplLib if it wants.

As a pure code change, I do agree that it solves the problem and it's a 
much simpler approach.  However, it is a breaking change to the 
specification and I think it would need be handled as such.

The minimal specification change I can think of that would make this 
possible would be to relax the wording on NotifyFunction in the next 
version of the PI specification to say that

* the NotifyFunction can be called at any TPL level

* the NotifyFunction will raise TPL to TPL_HIGH_LEVEL, restore TPL back 
to the original TPL before returning

* the NotifyFunction may re-enable interrupts during its execution, and 
that the caller must be prepared to be re-entered before NotifyFunction 
returns

* the timer interrupt must have been rearmed before calling NotifyFunction

* the NotifyFunction must guarantee that it never reaches a state in 
which the TPL has been restored to the original level with CPU 
interrupts enabled.

This would be backwards compatible with the existing behaviour.  A 
caller written to the current specification would call NotifyFunction at 
TPL_HIGH_LEVEL and so any RaiseTPL/RestoreTPL done within a 
NotifyFunction complying to the new specification would be a no-op anyway.

A caller written to the new specification would have to check the 
supported version of the PI specification (which I assume is available 
in some system configuration table somewhere) to know that it was safe 
to call NotifyFunction without first raising to TPL_HIGH_LEVEL.

This approach would at least avoid the need for an ARCH2_PROTOCOL 
variant, which is potentially lower impact.

Thanks,

Michael



-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#114407): https://edk2.groups.io/g/devel/message/114407
Mute This Topic: https://groups.io/mt/103950154/1787277
Group Owner: devel+owner@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub [importer@patchew.org]
-=-=-=-=-=-=-=-=-=-=-=-
Re: [edk2-devel] RFC: Another solution to the nested interrupt issue
Posted by Ni, Ray 3 months ago
> That would then be a bug in HpetTimer, which ought to be fixed.  If
> HpetTimer were to be used on a platform where the NotifyFunction
> correctly assumes that it is called at TPL_HIGH_LEVEL and does something
> that would break at a lower level, then this could lead to undefined
> behaviour.

In theory, it could happen that the NotifyFunction may not raise TPL
to HIGH.
In real world, NotifyFunction (CoreTimerTick() in DxeCore) does raise
TPL to HIGH.
I agree binding the NotifyFunction to CoreTimerTick() is not the best.
But it could help to reduce the complexity of the problem.


> As a pure code change, I do agree that it solves the problem and it's a
> much simpler approach.  However, it is a breaking change to the
> specification and I think it would need be handled as such.
> 
> The minimal specification change I can think of that would make this
> possible would be to relax the wording on NotifyFunction in the next
> version of the PI specification to say that
> 
> * the NotifyFunction can be called at any TPL level
> 
> * the NotifyFunction will raise TPL to TPL_HIGH_LEVEL, restore TPL back
> to the original TPL before returning
> 
> * the NotifyFunction may re-enable interrupts during its execution, and
> that the caller must be prepared to be re-entered before NotifyFunction
> returns
> 
> * the timer interrupt must have been rearmed before calling NotifyFunction
> 
> * the NotifyFunction must guarantee that it never reaches a state in
> which the TPL has been restored to the original level with CPU
> interrupts enabled.

I agree with you about the above PI spec clarifications.
Would you like to write a PI spec ECR?

But I do not think the PI spec version stored in the PI system table needs to
reflect whether a DxeCore implementation follows the clarification.

Since the DxeCore::CoreTimerTick() implementation raises TPL to HIGH in the very first version created
in more than 10 years ago, I think it's safe for TimerInterruptHandler() assumes CoreTimerTick() will
raise TPL to HIGH so that TimerInterruptHandler() does not need to raise TPL to HIGH.
(I agree changing the spec version is the most correct way if we review the problem in a very theorical way.)

I really want to keep the UEFI world simple with the bug fixed.
(The cost is assumption on existing DxeCore::CoreTimerTick().)




-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#114409): https://edk2.groups.io/g/devel/message/114409
Mute This Topic: https://groups.io/mt/103950154/1787277
Group Owner: devel+owner@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/leave/3901457/1787277/102458076/xyzzy [importer@patchew.org]
-=-=-=-=-=-=-=-=-=-=-=-


Re: [edk2-devel] RFC: Another solution to the nested interrupt issue
Posted by Michael Brown 3 months ago
On 25/01/2024 15:06, Ni, Ray wrote:
> I agree with you about the above PI spec clarifications.
> Would you like to write a PI spec ECR?

I am not a UEFI Forum member, so I don't think I have standing to do 
this.  (I also don't have the spare time to do this for free, sorry.)

> But I do not think the PI spec version stored in the PI system table needs to
> reflect whether a DxeCore implementation follows the clarification.
> 
> Since the DxeCore::CoreTimerTick() implementation raises TPL to HIGH in the very first version created
> in more than 10 years ago, I think it's safe for TimerInterruptHandler() assumes CoreTimerTick() will
> raise TPL to HIGH so that TimerInterruptHandler() does not need to raise TPL to HIGH.
> (I agree changing the spec version is the most correct way if we review the problem in a very theorical way.)

I don't see any call to RaiseTPL() in the current version of 
CoreTimerTick(), unless this is implicit in the use of CoreAcquireLock()?

> I really want to keep the UEFI world simple with the bug fixed.

I definitely agree with this aim!  :)

> (The cost is assumption on existing DxeCore::CoreTimerTick().)

For me, the problem is that we can't assume that it's the EDK2 
implementation of CoreTimerTick() that is being used.  It's perfectly 
legitimate for an implementation to not use EDK2's DxeCore, and to 
provide a NotifyFunction that *requires* being called with the TPL 
already raised to TPL_HIGH_LEVEL.

If we're not worried about maintaining conformance to the published 
specifications, then there are several very breaking changes I'd like to 
make, starting with the USB I/O protocols.  :)

Thanks,

Michael



-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#114427): https://edk2.groups.io/g/devel/message/114427
Mute This Topic: https://groups.io/mt/103950154/1787277
Group Owner: devel+owner@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub [importer@patchew.org]
-=-=-=-=-=-=-=-=-=-=-=-