In RTDS, removing the last eligible pCPU can kill repl_timer.
When a pCPU is later re-added, rt_switch_sched() reinitializes the
timer object, but pending entries may already exist in replq.
Without re-arming from replq head, replenishment can remain inactive
until some unrelated event programs the timer again. This may stall
budget replenishment for non-extratime units.
Fix this by re-arming repl_timer in rt_switch_sched() immediately after
init_timer() when replq is non-empty, using the earliest pending
deadline.
This keeps behavior unchanged when replq is empty.
Signed-off-by: Oleksii Moisieiev <oleksii_moisieiev@epam.com>
---
Changes in v2:
- update commit description, remove unneeded paragraph
xen/common/sched/rt.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/xen/common/sched/rt.c b/xen/common/sched/rt.c
index 7b1f64a779..59021e1110 100644
--- a/xen/common/sched/rt.c
+++ b/xen/common/sched/rt.c
@@ -741,8 +741,17 @@ rt_switch_sched(struct scheduler *new_ops, unsigned int cpu,
if ( prv->repl_timer.status == TIMER_STATUS_invalid ||
prv->repl_timer.status == TIMER_STATUS_killed )
{
+ struct list_head *replq = rt_replq(new_ops);
+
init_timer(&prv->repl_timer, repl_timer_handler, (void *)new_ops, cpu);
dprintk(XENLOG_DEBUG, "RTDS: timer initialized on cpu %u\n", cpu);
+
+ /*
+ * When re-adding CPUs after all RTDS CPUs were removed, replq may
+ * already contain pending replenishment events. Re-arm immediately.
+ */
+ if ( !list_empty(replq) )
+ set_timer(&prv->repl_timer, replq_elem(replq->next)->cur_deadline);
}
sched_idle_unit(cpu)->priv = vdata;
--
2.43.0
base-commit: a7bf8ff218ca05eb3674fdfd2817f6cff471e96a
On 03.04.26 11:29, Oleksii Moisieiev wrote: > In RTDS, removing the last eligible pCPU can kill repl_timer. > When a pCPU is later re-added, rt_switch_sched() reinitializes the > timer object, but pending entries may already exist in replq. Did you experience this behavior? I'm asking because I don't see how this could happen. A CPU can't be removed from a cpupool as long as there are domains in that cpupool. So how would replq contain entries after the last CPU of the cpupool has been removed? But maybe I'm missing something. Juergen
Hi Juergen, During our safety certification analysis work, we identified this as a potential issue. While we haven't encountered this problem in practice yet, it could occur in the future, so I believe it should be addressed proactively. -- Oleksii. On 10/04/2026 13:14, Jürgen Groß wrote: > On 03.04.26 11:29, Oleksii Moisieiev wrote: >> In RTDS, removing the last eligible pCPU can kill repl_timer. >> When a pCPU is later re-added, rt_switch_sched() reinitializes the >> timer object, but pending entries may already exist in replq. > > Did you experience this behavior? > > I'm asking because I don't see how this could happen. A CPU can't be > removed from a cpupool as long as there are domains in that cpupool. > So how would replq contain entries after the last CPU of the cpupool > has been removed? > > But maybe I'm missing something. > > > Juergen
On 10.04.26 14:04, Oleksii Moisieiev wrote: > Hi Juergen, > > During our safety certification analysis work, we identified this as a potential > issue. While we haven't encountered this problem in practice yet, it could occur > in the future, so I believe it should be addressed proactively. For being able to occur in future, the handling of removing a cpu from a cpupool would need to be changed. Considering the refusal to remove the last cpu from a populated cpupool is on purpose (this avoids leaving a domain without any cpu to run on), adding the code as you suggest would just be an addition without any benefit. It isn't doing any harm (other than adding code without purpose), so I won't explicitly NAK the patch, but I won't Ack it either. Juergen > -- > Oleksii. > > > On 10/04/2026 13:14, Jürgen Groß wrote: >> On 03.04.26 11:29, Oleksii Moisieiev wrote: >>> In RTDS, removing the last eligible pCPU can kill repl_timer. >>> When a pCPU is later re-added, rt_switch_sched() reinitializes the >>> timer object, but pending entries may already exist in replq. >> >> Did you experience this behavior? >> >> I'm asking because I don't see how this could happen. A CPU can't be >> removed from a cpupool as long as there are domains in that cpupool. >> So how would replq contain entries after the last CPU of the cpupool >> has been removed? >> >> But maybe I'm missing something. >> >> >> Juergen >
On 10.04.26 14:13, Juergen Gross wrote: > On 10.04.26 14:04, Oleksii Moisieiev wrote: >> Hi Juergen, >> >> During our safety certification analysis work, we identified this as a potential >> issue. While we haven't encountered this problem in practice yet, it could occur >> in the future, so I believe it should be addressed proactively. > > For being able to occur in future, the handling of removing a cpu from a > cpupool would need to be changed. Considering the refusal to remove the > last cpu from a populated cpupool is on purpose (this avoids leaving a > domain without any cpu to run on), adding the code as you suggest would > just be an addition without any benefit. > > It isn't doing any harm (other than adding code without purpose), so I > won't explicitly NAK the patch, but I won't Ack it either. One further remark: I would ack the addition of an ASSERT(list_empty(replq)) instead of the conditional set_timer() call. Juergen
On 10/04/2026 15:16, Juergen Gross wrote: > On 10.04.26 14:13, Juergen Gross wrote: >> On 10.04.26 14:04, Oleksii Moisieiev wrote: >>> Hi Juergen, >>> >>> During our safety certification analysis work, we identified this as >>> a potential >>> issue. While we haven't encountered this problem in practice yet, it >>> could occur >>> in the future, so I believe it should be addressed proactively. >> >> For being able to occur in future, the handling of removing a cpu from a >> cpupool would need to be changed. Considering the refusal to remove the >> last cpu from a populated cpupool is on purpose (this avoids leaving a >> domain without any cpu to run on), adding the code as you suggest would >> just be an addition without any benefit. >> >> It isn't doing any harm (other than adding code without purpose), so I >> won't explicitly NAK the patch, but I won't Ack it either. > > One further remark: I would ack the addition of an > ASSERT(list_empty(replq)) > instead of the conditional set_timer() call. > You're right: with the current cpupool semantics, when the timer is re-initialized in this path, replq is expected to be empty. In that case there is nothing to re-arm, and the timer can be programmed later when a new replenishment event is queued. Now I see that it would probably be better to update the cpupool logic to prohibit removing the last pCPU from a cpupool. In that case, this fix — even with the ASSERT — seems to be no longer relevant. I think I'd rather post an update for the cpupool semantics and drop this patch. Or I can send a v3 with the ASSERT if you think that is still reasonable. Thank you for the review! -- Oleksii > > Juergen
On 10.04.26 17:12, Oleksii Moisieiev wrote: > > On 10/04/2026 15:16, Juergen Gross wrote: >> On 10.04.26 14:13, Juergen Gross wrote: >>> On 10.04.26 14:04, Oleksii Moisieiev wrote: >>>> Hi Juergen, >>>> >>>> During our safety certification analysis work, we identified this as a >>>> potential >>>> issue. While we haven't encountered this problem in practice yet, it could >>>> occur >>>> in the future, so I believe it should be addressed proactively. >>> >>> For being able to occur in future, the handling of removing a cpu from a >>> cpupool would need to be changed. Considering the refusal to remove the >>> last cpu from a populated cpupool is on purpose (this avoids leaving a >>> domain without any cpu to run on), adding the code as you suggest would >>> just be an addition without any benefit. >>> >>> It isn't doing any harm (other than adding code without purpose), so I >>> won't explicitly NAK the patch, but I won't Ack it either. >> >> One further remark: I would ack the addition of an ASSERT(list_empty(replq)) >> instead of the conditional set_timer() call. >> > You're right: with the current cpupool semantics, when the timer is re- > initialized in this path, replq is expected to be empty. In that case there is > nothing to re-arm, and the timer can be programmed later when a new > replenishment event is queued. > > Now I see that it would probably be better to update the cpupool logic to > prohibit removing the last pCPU from a cpupool. In that case, this fix — even > with the ASSERT — seems to be no longer relevant. > > I think I'd rather post an update for the cpupool semantics and drop this patch. > Or I can send a v3 with the ASSERT if you think that is still reasonable. The cpupool semantics are already existing. I have written it this way when I introduced cpupools. Juergen
© 2016 - 2026 Red Hat, Inc.