[PATCH net-next v6 06/12] net: pse-pd: Add support for budget evaluation strategies

Kory Maincent posted 12 patches 11 months, 1 week ago
There is a newer version of this series
[PATCH net-next v6 06/12] net: pse-pd: Add support for budget evaluation strategies
Posted by Kory Maincent 11 months, 1 week ago
From: Kory Maincent (Dent Project) <kory.maincent@bootlin.com>

This patch introduces the ability to configure the PSE PI budget evaluation
strategies. Budget evaluation strategies is utilized by PSE controllers to
determine which ports to turn off first in scenarios such as power budget
exceedance.

The pis_prio_max value is used to define the maximum priority level
supported by the controller. Both the current priority and the maximum
priority are exposed to the user through the pse_ethtool_get_status call.

This patch add support for two mode of budget evaluation strategies.
1. Static Method:

   This method involves distributing power based on PD classification.
   It’s straightforward and stable, the PSE core keeping track of the
   budget and subtracting the power requested by each PD’s class.

   Advantages: Every PD gets its promised power at any time, which
   guarantees reliability.

   Disadvantages: PD classification steps are large, meaning devices
   request much more power than they actually need. As a result, the power
   supply may only operate at, say, 50% capacity, which is inefficient and
   wastes money.

   Priority max value is matching the number of PSE PIs within the PSE.

2. Dynamic Method:

   To address the inefficiencies of the static method, vendors like
   Microchip have introduced dynamic power budgeting, as seen in the
   PD692x0 firmware. This method monitors the current consumption per port
   and subtracts it from the available power budget. When the budget is
   exceeded, lower-priority ports are shut down.

   Advantages: This method optimizes resource utilization, saving costs.

   Disadvantages: Low-priority devices may experience instability.

   Priority max value is set by the PSE controller driver.

For now, budget evaluation methods are not configurable and cannot be
mixed. They are hardcoded in the PSE driver itself, as no current PSE
controller supports both methods.

Signed-off-by: Kory Maincent (Dent Project) <kory.maincent@bootlin.com>
---
Change in v6:
- Remove Budget evaluation strategy from ethtool_pse_control_status struct.

Change in v5:
- Save PI previous power allocated in set current limit to be able to
  restore the power allocated in case of error.

Change in v4:
- Remove disconnection policy features.
- Rename port priority to budget evaluation strategy.
- Add kdoc

Change in v3:
- Add disconnection policy.
- Add management of disabled port priority in the interrupt handler.
- Move port prio mode in the power domain instead of the PSE.

Change in v2:
- Rethink the port priority support.
---
 drivers/net/pse-pd/pse_core.c | 624 ++++++++++++++++++++++++++++++++++++++----
 include/linux/pse-pd/pse.h    |  44 +++
 include/uapi/linux/ethtool.h  |  39 ++-
 3 files changed, 660 insertions(+), 47 deletions(-)

diff --git a/drivers/net/pse-pd/pse_core.c b/drivers/net/pse-pd/pse_core.c
index cb3477e4bd67b..e5b40e684fff5 100644
--- a/drivers/net/pse-pd/pse_core.c
+++ b/drivers/net/pse-pd/pse_core.c
@@ -40,10 +40,13 @@ struct pse_control {
  * struct pse_power_domain - a PSE power domain
  * @id: ID of the power domain
  * @supply: Power supply the Power Domain
+ * @budget_eval_strategy: Current power budget evaluation strategy of the
+ *			  power domain
  */
 struct pse_power_domain {
 	int id;
 	struct regulator *supply;
+	u32 budget_eval_strategy;
 };
 
 static int of_load_single_pse_pi_pairset(struct device_node *node,
@@ -222,6 +225,29 @@ static int of_load_pse_pis(struct pse_controller_dev *pcdev)
 	return ret;
 }
 
+/**
+ * pse_pw_d_is_sw_pw_control - Is power control software managed
+ * @pcdev: a pointer to the PSE controller device
+ * @pw_d: a pointer to the PSE power domain
+ *
+ * Return: true if the power control of the power domain is managed from
+ * the software in the interrupt handler
+ */
+static bool pse_pw_d_is_sw_pw_control(struct pse_controller_dev *pcdev,
+				      struct pse_power_domain *pw_d)
+{
+	if (!pw_d)
+		return false;
+
+	if (pw_d->budget_eval_strategy == ETHTOOL_PSE_BUDGET_EVAL_STRAT_STATIC)
+		return true;
+	if (pw_d->budget_eval_strategy == ETHTOOL_PSE_BUDGET_EVAL_STRAT_DISABLED &&
+	    pcdev->ops->pi_enable && pcdev->irq)
+		return true;
+
+	return false;
+}
+
 static int pse_pi_is_enabled(struct regulator_dev *rdev)
 {
 	struct pse_controller_dev *pcdev = rdev_get_drvdata(rdev);
@@ -235,6 +261,11 @@ static int pse_pi_is_enabled(struct regulator_dev *rdev)
 
 	id = rdev_get_id(rdev);
 	mutex_lock(&pcdev->lock);
+	if (pse_pw_d_is_sw_pw_control(pcdev, pcdev->pi[id].pw_d)) {
+		ret = pcdev->pi[id].admin_state_enabled;
+		goto out;
+	}
+
 	ret = ops->pi_get_admin_state(pcdev, id, &admin_state);
 	if (ret)
 		goto out;
@@ -249,11 +280,260 @@ static int pse_pi_is_enabled(struct regulator_dev *rdev)
 	return ret;
 }
 
+/**
+ * pse_pi_deallocate_pw_budget - Deallocate power budget of the PI
+ * @pi: a pointer to the PSE PI
+ */
+static void pse_pi_deallocate_pw_budget(struct pse_pi *pi)
+{
+	if (!pi->pw_d || !pi->pw_allocated_mW)
+		return;
+
+	regulator_free_power_budget(pi->pw_d->supply, pi->pw_allocated_mW);
+	pi->pw_allocated_mW = 0;
+}
+
+/**
+ * _pse_pi_disable - Call disable operation. Assumes the PSE lock has been
+ *		     acquired.
+ * @pcdev: a pointer to the PSE
+ * @id: index of the PSE control
+ *
+ * Return: 0 on success and failure value on error
+ */
+static int _pse_pi_disable(struct pse_controller_dev *pcdev, int id)
+{
+	const struct pse_controller_ops *ops = pcdev->ops;
+	int ret;
+
+	if (!ops->pi_disable)
+		return -EOPNOTSUPP;
+
+	ret = ops->pi_disable(pcdev, id);
+	if (ret)
+		return ret;
+
+	pse_pi_deallocate_pw_budget(&pcdev->pi[id]);
+
+	return 0;
+}
+
+/**
+ * pse_control_find_phy_by_id - Find PHY attached to the pse control id
+ * @pcdev: a pointer to the PSE
+ * @id: index of the PSE control
+ *
+ * Return: PHY device pointer or NULL
+ */
+static struct phy_device *
+pse_control_find_phy_by_id(struct pse_controller_dev *pcdev, int id)
+{
+	struct pse_control *psec;
+
+	mutex_lock(&pse_list_mutex);
+	list_for_each_entry(psec, &pcdev->pse_control_head, list) {
+		if (psec->id == id) {
+			mutex_unlock(&pse_list_mutex);
+			return psec->attached_phydev;
+		}
+	}
+	mutex_unlock(&pse_list_mutex);
+	return NULL;
+}
+
+/**
+ * pse_disable_pi_pol - Disable a PI on a power budget policy
+ * @pcdev: a pointer to the PSE
+ * @id: index of the PSE PI
+ *
+ * Return: 0 on success and failure value on error
+ */
+static int pse_disable_pi_pol(struct pse_controller_dev *pcdev, int id)
+{
+	unsigned long notifs = ETHTOOL_PSE_EVENT_OVER_BUDGET;
+	struct netlink_ext_ack extack = {};
+	struct phy_device *phydev;
+	int ret;
+
+	dev_dbg(pcdev->dev, "Disabling PI %d to free power budget\n", id);
+
+	NL_SET_ERR_MSG_FMT(&extack,
+			   "Disabling PI %d to free power budget", id);
+
+	ret = _pse_pi_disable(pcdev, id);
+	if (ret) {
+		notifs |= ETHTOOL_PSE_EVENT_SW_PW_CONTROL_ERROR;
+	} else {
+		pcdev->pi[id].admin_state_enabled = 0;
+		pcdev->pi[id]._isr_counter_mismatch = 1;
+	}
+
+	phydev = pse_control_find_phy_by_id(pcdev, id);
+	if (phydev)
+		ethnl_pse_send_ntf(phydev, notifs, &extack);
+
+	return ret;
+}
+
+/**
+ * pse_disable_pi_prio - Disable all PIs of a given priority inside a PSE
+ *			 power domain
+ * @pcdev: a pointer to the PSE
+ * @pw_d: a pointer to the PSE power domain
+ * @prio: priority
+ *
+ * Return: 0 on success and failure value on error
+ */
+static int pse_disable_pi_prio(struct pse_controller_dev *pcdev,
+			       struct pse_power_domain *pw_d,
+			       int prio)
+{
+	int i;
+
+	for (i = 0; i < pcdev->nr_lines; i++) {
+		int ret;
+
+		if (pcdev->pi[i].prio != prio ||
+		    pcdev->pi[i].pw_d != pw_d ||
+		    !pcdev->pi[i].admin_state_enabled)
+			continue;
+
+		ret = pse_disable_pi_pol(pcdev, i);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+/**
+ * pse_pi_allocate_pw_budget_static_prio - Allocate power budget for the PI
+ *					   when the budget eval strategy is
+ *					   static
+ * @pcdev: a pointer to the PSE
+ * @id: index of the PSE control
+ * @pw_req: power requested in mW
+ * @extack: extack for error reporting
+ *
+ * Return: 0 on success and failure value on error
+ */
+static int
+pse_pi_allocate_pw_budget_static_prio(struct pse_controller_dev *pcdev, int id,
+				      int pw_req, struct netlink_ext_ack *extack)
+{
+	struct pse_pi *pi = &pcdev->pi[id];
+	int ret, _prio;
+
+	_prio = pcdev->nr_lines;
+	while (regulator_request_power_budget(pi->pw_d->supply, pw_req) == -ERANGE) {
+		if (_prio <= pi->prio) {
+			NL_SET_ERR_MSG_FMT(extack,
+					   "PI %d: not enough power budget available",
+					   id);
+			return -ERANGE;
+		}
+
+		ret = pse_disable_pi_prio(pcdev, pi->pw_d, _prio);
+		if (ret < 0)
+			return ret;
+
+		_prio--;
+	}
+
+	pi->pw_allocated_mW = pw_req;
+	return 0;
+}
+
+/**
+ * pse_pi_allocate_pw_budget - Allocate power budget for the PI
+ * @pcdev: a pointer to the PSE
+ * @id: index of the PSE control
+ * @pw_req: power requested in mW
+ * @extack: extack for error reporting
+ *
+ * Return: 0 on success and failure value on error
+ */
+static int pse_pi_allocate_pw_budget(struct pse_controller_dev *pcdev, int id,
+				     int pw_req, struct netlink_ext_ack *extack)
+{
+	struct pse_pi *pi = &pcdev->pi[id];
+
+	if (!pi->pw_d)
+		return 0;
+
+	/* ETHTOOL_PSE_BUDGET_EVAL_STRAT_STATIC */
+	if (pi->pw_d->budget_eval_strategy == ETHTOOL_PSE_BUDGET_EVAL_STRAT_STATIC)
+		return pse_pi_allocate_pw_budget_static_prio(pcdev, id, pw_req,
+							     extack);
+
+	return 0;
+}
+
+/**
+ * _pse_pi_enable_sw_pw_ctrl - Enable PSE PI in case of software power control.
+ *			       Assumes the PSE lock has been acquired
+ * @pcdev: a pointer to the PSE
+ * @id: index of the PSE control
+ * @extack: extack for error reporting
+ *
+ * Return: 0 on success and failure value on error
+ */
+static int _pse_pi_enable_sw_pw_ctrl(struct pse_controller_dev *pcdev, int id,
+				     struct netlink_ext_ack *extack)
+{
+	const struct pse_controller_ops *ops = pcdev->ops;
+	struct pse_pi *pi = &pcdev->pi[id];
+	int ret, pw_req;
+
+	if (!ops->pi_get_pw_req) {
+		/* No power allocation management */
+		ret = ops->pi_enable(pcdev, id);
+		if (ret)
+			NL_SET_ERR_MSG_FMT(extack,
+					   "PI %d: enable error %d",
+					   id, ret);
+		return ret;
+	}
+
+	ret = ops->pi_get_pw_req(pcdev, id);
+	if (ret < 0)
+		return ret;
+
+	pw_req = ret;
+
+	/* Compare requested power with port power limit and use the lowest
+	 * one.
+	 */
+	if (ops->pi_get_pw_limit) {
+		ret = ops->pi_get_pw_limit(pcdev, id);
+		if (ret < 0)
+			return ret;
+
+		if (ret < pw_req)
+			pw_req = ret;
+	}
+
+	ret = pse_pi_allocate_pw_budget(pcdev, id, pw_req, extack);
+	if (ret)
+		return ret;
+
+	ret = ops->pi_enable(pcdev, id);
+	if (ret) {
+		pse_pi_deallocate_pw_budget(pi);
+		NL_SET_ERR_MSG_FMT(extack,
+				   "PI %d: enable error %d",
+				   id, ret);
+		return ret;
+	}
+
+	return 0;
+}
+
 static int pse_pi_enable(struct regulator_dev *rdev)
 {
 	struct pse_controller_dev *pcdev = rdev_get_drvdata(rdev);
 	const struct pse_controller_ops *ops;
-	int id, ret;
+	int id, ret = 0;
 
 	ops = pcdev->ops;
 	if (!ops->pi_enable)
@@ -261,6 +541,23 @@ static int pse_pi_enable(struct regulator_dev *rdev)
 
 	id = rdev_get_id(rdev);
 	mutex_lock(&pcdev->lock);
+	if (pse_pw_d_is_sw_pw_control(pcdev, pcdev->pi[id].pw_d)) {
+		/* Manage enabled status by software.
+		 * Real enable process will happen if a port is connected.
+		 */
+		if (pcdev->pi[id].isr_pd_detected) {
+			struct netlink_ext_ack extack;
+
+			ret = _pse_pi_enable_sw_pw_ctrl(pcdev, id, &extack);
+			if (!ret)
+				pcdev->pi[id].admin_state_enabled = 1;
+		} else {
+			pcdev->pi[id].admin_state_enabled = 1;
+		}
+		mutex_unlock(&pcdev->lock);
+		return ret;
+	}
+
 	ret = ops->pi_enable(pcdev, id);
 	if (!ret)
 		pcdev->pi[id].admin_state_enabled = 1;
@@ -272,21 +569,20 @@ static int pse_pi_enable(struct regulator_dev *rdev)
 static int pse_pi_disable(struct regulator_dev *rdev)
 {
 	struct pse_controller_dev *pcdev = rdev_get_drvdata(rdev);
-	const struct pse_controller_ops *ops;
+	struct pse_pi *pi;
 	int id, ret;
 
-	ops = pcdev->ops;
-	if (!ops->pi_disable)
-		return -EOPNOTSUPP;
-
 	id = rdev_get_id(rdev);
+	pi = &pcdev->pi[id];
 	mutex_lock(&pcdev->lock);
-	ret = ops->pi_disable(pcdev, id);
-	if (!ret)
-		pcdev->pi[id].admin_state_enabled = 0;
-	mutex_unlock(&pcdev->lock);
+	ret = _pse_pi_disable(pcdev, id);
+	if (!ret) {
+		pi->admin_state_enabled = 0;
+		pcdev->pi[id]._isr_counter_mismatch = 0;
+	}
 
-	return ret;
+	mutex_unlock(&pcdev->lock);
+	return 0;
 }
 
 static int _pse_pi_get_voltage(struct regulator_dev *rdev)
@@ -542,6 +838,10 @@ static int pse_register_pw_ds(struct pse_controller_dev *pcdev)
 		}
 
 		pw_d->supply = supply;
+		if (pcdev->supp_budget_eval_strategies)
+			pw_d->budget_eval_strategy = pcdev->supp_budget_eval_strategies;
+		else
+			pw_d->budget_eval_strategy = ETHTOOL_PSE_BUDGET_EVAL_STRAT_DISABLED;
 		pcdev->pi[i].pw_d = pw_d;
 	}
 
@@ -699,27 +999,45 @@ static unsigned long pse_to_regulator_notifs(unsigned long notifs)
 }
 
 /**
- * pse_control_find_phy_by_id - Find PHY attached to the pse control id
+ * pse_set_config_isr - Set PSE control config according to the PSE
+ *			notifications
  * @pcdev: a pointer to the PSE
  * @id: index of the PSE control
+ * @notifs: PSE event notifications
+ * @extack: extack for error reporting
  *
- * Return: PHY device pointer or NULL
+ * Return: 0 on success and failure value on error
  */
-static struct phy_device *
-pse_control_find_phy_by_id(struct pse_controller_dev *pcdev, int id)
+static int pse_set_config_isr(struct pse_controller_dev *pcdev, int id,
+			      unsigned long notifs,
+			      struct netlink_ext_ack *extack)
 {
-	struct pse_control *psec;
+	int ret = 0;
 
-	mutex_lock(&pse_list_mutex);
-	list_for_each_entry(psec, &pcdev->pse_control_head, list) {
-		if (psec->id == id) {
-			mutex_unlock(&pse_list_mutex);
-			return psec->attached_phydev;
-		}
+	if (notifs & ETHTOOL_PSE_BUDGET_EVAL_STRAT_DYNAMIC)
+		return 0;
+
+	if ((notifs & ETHTOOL_C33_PSE_EVENT_DISCONNECTION) &&
+	    ((notifs & ETHTOOL_C33_PSE_EVENT_DETECTION) ||
+	     (notifs & ETHTOOL_C33_PSE_EVENT_CLASSIFICATION))) {
+		NL_SET_ERR_MSG_FMT(extack,
+				   "PI %d: error, connection and disconnection reported simultaneously",
+				   id);
+		return -EINVAL;
 	}
-	mutex_unlock(&pse_list_mutex);
 
-	return NULL;
+	if (notifs & ETHTOOL_C33_PSE_EVENT_CLASSIFICATION) {
+		pcdev->pi[id].isr_pd_detected = true;
+		if (pcdev->pi[id].admin_state_enabled)
+			ret = _pse_pi_enable_sw_pw_ctrl(pcdev, id, extack);
+	} else if (notifs & ETHTOOL_C33_PSE_EVENT_DISCONNECTION) {
+		if (pcdev->pi[id].admin_state_enabled &&
+		    pcdev->pi[id].isr_pd_detected)
+			ret = _pse_pi_disable(pcdev, id);
+		pcdev->pi[id].isr_pd_detected = false;
+	}
+
+	return ret;
 }
 
 /**
@@ -745,9 +1063,10 @@ static irqreturn_t pse_isr(int irq, void *data)
 	memset(h->notifs, 0, pcdev->nr_lines * sizeof(*h->notifs));
 	mutex_lock(&pcdev->lock);
 	ret = desc->map_event(irq, pcdev, h->notifs, &notifs_mask);
-	mutex_unlock(&pcdev->lock);
-	if (ret || !notifs_mask)
+	if (ret || !notifs_mask) {
+		mutex_unlock(&pcdev->lock);
 		return IRQ_NONE;
+	}
 
 	for_each_set_bit(i, &notifs_mask, pcdev->nr_lines) {
 		struct phy_device *phydev;
@@ -758,6 +1077,12 @@ static irqreturn_t pse_isr(int irq, void *data)
 			continue;
 
 		notifs = h->notifs[i];
+		if (pse_pw_d_is_sw_pw_control(pcdev, pcdev->pi[i].pw_d)) {
+			ret = pse_set_config_isr(pcdev, i, notifs, &extack);
+			if (ret)
+				notifs |= ETHTOOL_PSE_EVENT_SW_PW_CONTROL_ERROR;
+		}
+
 		dev_dbg(h->pcdev->dev,
 			"Sending PSE notification EVT 0x%lx\n", notifs);
 
@@ -769,6 +1094,8 @@ static irqreturn_t pse_isr(int irq, void *data)
 					      NULL);
 	}
 
+	mutex_unlock(&pcdev->lock);
+
 	return IRQ_HANDLED;
 }
 
@@ -863,6 +1190,7 @@ static struct pse_control *
 pse_control_get_internal(struct pse_controller_dev *pcdev, unsigned int index,
 			 struct phy_device *phydev)
 {
+	struct pse_admin_state admin_state = {0};
 	struct pse_control *psec;
 	int ret;
 
@@ -884,6 +1212,23 @@ pse_control_get_internal(struct pse_controller_dev *pcdev, unsigned int index,
 		goto free_psec;
 	}
 
+	if (!pcdev->ops->pi_get_admin_state) {
+		ret = -EOPNOTSUPP;
+		goto free_psec;
+	}
+
+	/* Initialize admin_state_enabled before the regulator_get. This
+	 * aims to have the right value reported in the first is_enabled
+	 * call in case of control managed by software.
+	 */
+	ret = pcdev->ops->pi_get_admin_state(pcdev, index, &admin_state);
+	if (ret)
+		goto free_psec;
+
+	if (admin_state.podl_admin_state == ETHTOOL_PODL_PSE_ADMIN_STATE_ENABLED ||
+	    admin_state.c33_admin_state == ETHTOOL_C33_PSE_ADMIN_STATE_ENABLED)
+		pcdev->pi[index].admin_state_enabled = 1;
+
 	psec->ps = devm_regulator_get_exclusive(pcdev->dev,
 						rdev_get_name(pcdev->pi[index].rdev));
 	if (IS_ERR(psec->ps)) {
@@ -891,12 +1236,6 @@ pse_control_get_internal(struct pse_controller_dev *pcdev, unsigned int index,
 		goto put_module;
 	}
 
-	ret = regulator_is_enabled(psec->ps);
-	if (ret < 0)
-		goto regulator_put;
-
-	pcdev->pi[index].admin_state_enabled = ret;
-
 	psec->pcdev = pcdev;
 	list_add(&psec->list, &pcdev->pse_control_head);
 	psec->id = index;
@@ -905,8 +1244,6 @@ pse_control_get_internal(struct pse_controller_dev *pcdev, unsigned int index,
 
 	return psec;
 
-regulator_put:
-	devm_regulator_put(psec->ps);
 put_module:
 	module_put(pcdev->owner);
 free_psec:
@@ -1017,6 +1354,35 @@ struct pse_control *of_pse_control_get(struct device_node *node,
 }
 EXPORT_SYMBOL_GPL(of_pse_control_get);
 
+/**
+ * pse_get_sw_admin_state - Convert the software admin state to c33 or podl
+ *			    admin state value used in the standard
+ * @psec: PSE control pointer
+ * @admin_state: a pointer to the admin_state structure
+ */
+static void pse_get_sw_admin_state(struct pse_control *psec,
+				   struct pse_admin_state *admin_state)
+{
+	struct pse_pi *pi = &psec->pcdev->pi[psec->id];
+
+	if (pse_has_podl(psec)) {
+		if (pi->admin_state_enabled)
+			admin_state->podl_admin_state =
+				ETHTOOL_PODL_PSE_ADMIN_STATE_ENABLED;
+		else
+			admin_state->podl_admin_state =
+				ETHTOOL_PODL_PSE_ADMIN_STATE_DISABLED;
+	}
+	if (pse_has_c33(psec)) {
+		if (pi->admin_state_enabled)
+			admin_state->c33_admin_state =
+				ETHTOOL_C33_PSE_ADMIN_STATE_ENABLED;
+		else
+			admin_state->c33_admin_state =
+				ETHTOOL_C33_PSE_ADMIN_STATE_DISABLED;
+	}
+}
+
 /**
  * pse_ethtool_get_status - get status of PSE control
  * @psec: PSE control pointer
@@ -1033,19 +1399,46 @@ int pse_ethtool_get_status(struct pse_control *psec,
 	struct pse_pw_status pw_status = {0};
 	const struct pse_controller_ops *ops;
 	struct pse_controller_dev *pcdev;
+	struct pse_pi *pi;
 	int ret;
 
 	pcdev = psec->pcdev;
 	ops = pcdev->ops;
+
+	pi = &pcdev->pi[psec->id];
 	mutex_lock(&pcdev->lock);
-	if (pcdev->pi[psec->id].pw_d)
-		status->pw_d_id = pcdev->pi[psec->id].pw_d->id;
+	if (pi->pw_d) {
+		status->pw_d_id = pi->pw_d->id;
+		if (pse_pw_d_is_sw_pw_control(pcdev, pi->pw_d)) {
+			pse_get_sw_admin_state(psec, &admin_state);
+		} else {
+			ret = ops->pi_get_admin_state(pcdev, psec->id,
+						      &admin_state);
+			if (ret)
+				goto out;
+		}
+		status->podl_admin_state = admin_state.podl_admin_state;
+		status->c33_admin_state = admin_state.c33_admin_state;
 
-	ret = ops->pi_get_admin_state(pcdev, psec->id, &admin_state);
-	if (ret)
-		goto out;
-	status->podl_admin_state = admin_state.podl_admin_state;
-	status->c33_admin_state = admin_state.c33_admin_state;
+		switch (pi->pw_d->budget_eval_strategy) {
+		case ETHTOOL_PSE_BUDGET_EVAL_STRAT_STATIC:
+			status->prio_max = pcdev->nr_lines;
+			status->prio = pi->prio;
+			break;
+		case ETHTOOL_PSE_BUDGET_EVAL_STRAT_DYNAMIC:
+			status->prio_max = pcdev->pis_prio_max;
+			if (ops->pi_get_prio) {
+				ret = ops->pi_get_prio(pcdev, psec->id);
+				if (ret < 0)
+					goto out;
+
+				status->prio = ret;
+			}
+			break;
+		default:
+			break;
+		}
+	}
 
 	ret = ops->pi_get_pw_status(pcdev, psec->id, &pw_status);
 	if (ret)
@@ -1120,11 +1513,15 @@ static int pse_ethtool_c33_set_config(struct pse_control *psec,
 	case ETHTOOL_C33_PSE_ADMIN_STATE_ENABLED:
 		/* We could have mismatch between admin_state_enabled and
 		 * state reported by regulator_is_enabled. This can occur when
-		 * the PI is forcibly turn off by the controller. Call
+		 * the PI is forcibly turn off by the controller or in power
+		 * off case in the interrupt context. Call
 		 * regulator_disable on that case to fix the counters state.
+		 * disable action might be called two times consecutively
+		 * but that is not a real issue.
 		 */
-		if (psec->pcdev->pi[psec->id].admin_state_enabled &&
-		    !regulator_is_enabled(psec->ps)) {
+		if ((psec->pcdev->pi[psec->id].admin_state_enabled &&
+		     !regulator_is_enabled(psec->ps)) ||
+		    psec->pcdev->pi[psec->id]._isr_counter_mismatch) {
 			err = regulator_disable(psec->ps);
 			if (err)
 				break;
@@ -1194,6 +1591,52 @@ int pse_ethtool_set_config(struct pse_control *psec,
 }
 EXPORT_SYMBOL_GPL(pse_ethtool_set_config);
 
+/**
+ * pse_pi_update_pw_budget - Update PSE power budget allocated with new
+ *			     power in mW
+ * @pcdev: a pointer to the PSE controller device
+ * @id: index of the PSE PI
+ * @pw_req: power requested
+ * @extack: extack for reporting useful error messages
+ *
+ * Return: Previous power allocated on success and failure value on error
+ */
+static int pse_pi_update_pw_budget(struct pse_controller_dev *pcdev, int id,
+				   const unsigned int pw_req,
+				   struct netlink_ext_ack *extack)
+{
+	struct pse_pi *pi = &pcdev->pi[id];
+	int previous_pw_allocated;
+	int pw_diff, ret = 0;
+
+	/* We don't want pw_allocated_mW value change in the middle of an
+	 * power budget update
+	 */
+	mutex_lock(&pcdev->lock);
+	previous_pw_allocated = pi->pw_allocated_mW;
+	pw_diff = pw_req - previous_pw_allocated;
+	if (!pw_diff) {
+		goto out;
+	} else if (pw_diff > 0) {
+		ret = regulator_request_power_budget(pi->pw_d->supply, pw_diff);
+		if (ret) {
+			NL_SET_ERR_MSG_FMT(extack,
+					   "PI %d: not enough power budget available",
+					   id);
+			goto out;
+		}
+
+	} else {
+		regulator_free_power_budget(pi->pw_d->supply, -pw_diff);
+	}
+	pi->pw_allocated_mW = pw_req;
+	ret = previous_pw_allocated;
+
+out:
+	mutex_unlock(&pcdev->lock);
+	return ret;
+}
+
 /**
  * pse_ethtool_set_pw_limit - set PSE control power limit
  * @psec: PSE control pointer
@@ -1206,7 +1649,7 @@ int pse_ethtool_set_pw_limit(struct pse_control *psec,
 			     struct netlink_ext_ack *extack,
 			     const unsigned int pw_limit)
 {
-	int uV, uA, ret;
+	int uV, uA, ret, previous_pw_allocated = 0;
 	s64 tmp_64;
 
 	if (pw_limit > MAX_PI_PW)
@@ -1230,10 +1673,99 @@ int pse_ethtool_set_pw_limit(struct pse_control *psec,
 	/* uA = mW * 1000000000 / uV */
 	uA = DIV_ROUND_CLOSEST_ULL(tmp_64, uV);
 
-	return regulator_set_current_limit(psec->ps, 0, uA);
+	/* Update power budget only in software power control case and
+	 * if a Power Device is powered.
+	 */
+	if (pse_pw_d_is_sw_pw_control(psec->pcdev,
+				      psec->pcdev->pi[psec->id].pw_d) &&
+	    psec->pcdev->pi[psec->id].admin_state_enabled &&
+	    psec->pcdev->pi[psec->id].isr_pd_detected) {
+		ret = pse_pi_update_pw_budget(psec->pcdev, psec->id,
+					      pw_limit, extack);
+		if (ret < 0)
+			return ret;
+		previous_pw_allocated = ret;
+	}
+
+	ret = regulator_set_current_limit(psec->ps, 0, uA);
+	if (ret < 0 && previous_pw_allocated) {
+		pse_pi_update_pw_budget(psec->pcdev, psec->id,
+					previous_pw_allocated, extack);
+	}
+
+	return ret;
 }
 EXPORT_SYMBOL_GPL(pse_ethtool_set_pw_limit);
 
+/**
+ * pse_ethtool_set_prio - Set PSE PI priority according to the budget
+ *			  evaluation strategy
+ * @psec: PSE control pointer
+ * @extack: extack for reporting useful error messages
+ * @prio: priovity value
+ *
+ * Return: 0 on success and failure value on error
+ */
+int pse_ethtool_set_prio(struct pse_control *psec,
+			 struct netlink_ext_ack *extack,
+			 unsigned int prio)
+{
+	struct pse_controller_dev *pcdev = psec->pcdev;
+	const struct pse_controller_ops *ops;
+	int ret = 0;
+
+	if (!pcdev->pi[psec->id].pw_d) {
+		NL_SET_ERR_MSG(extack, "no power domain attached");
+		return -EOPNOTSUPP;
+	}
+
+	/* We don't want priority change in the middle of an
+	 * enable/disable call or a priority mode change
+	 */
+	mutex_lock(&pcdev->lock);
+	switch (pcdev->pi[psec->id].pw_d->budget_eval_strategy) {
+	case ETHTOOL_PSE_BUDGET_EVAL_STRAT_STATIC:
+		if (prio > pcdev->nr_lines) {
+			NL_SET_ERR_MSG_FMT(extack,
+					   "priority %d exceed priority max %d",
+					   prio, pcdev->nr_lines);
+			ret = -ERANGE;
+			goto out;
+		}
+
+		pcdev->pi[psec->id].prio = prio;
+		break;
+
+	case ETHTOOL_PSE_BUDGET_EVAL_STRAT_DYNAMIC:
+		ops = psec->pcdev->ops;
+		if (!ops->pi_set_prio) {
+			NL_SET_ERR_MSG(extack,
+				       "pse driver does not support setting port priority");
+			ret = -EOPNOTSUPP;
+			goto out;
+		}
+
+		if (prio > pcdev->pis_prio_max) {
+			NL_SET_ERR_MSG_FMT(extack,
+					   "priority %d exceed priority max %d",
+					   prio, pcdev->pis_prio_max);
+			ret = -ERANGE;
+			goto out;
+		}
+
+		ret = ops->pi_set_prio(pcdev, psec->id, prio);
+		break;
+
+	default:
+		ret = -EOPNOTSUPP;
+	}
+
+out:
+	mutex_unlock(&pcdev->lock);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pse_ethtool_set_prio);
+
 bool pse_has_podl(struct pse_control *psec)
 {
 	return psec->pcdev->types & ETHTOOL_PSE_PODL;
diff --git a/include/linux/pse-pd/pse.h b/include/linux/pse-pd/pse.h
index ffa6cf9a0072f..b47d2d2c807e9 100644
--- a/include/linux/pse-pd/pse.h
+++ b/include/linux/pse-pd/pse.h
@@ -132,6 +132,9 @@ struct pse_pw_limit_ranges {
  *	is in charge of the memory allocation
  * @c33_pw_limit_nb_ranges: number of supported power limit configuration
  *	ranges
+ * @prio_max: max priority allowed for the c33_prio variable value.
+ * @prio: priority of the PSE. Managed by PSE core in case of static budget
+ *	evaluation strategy.
  */
 struct ethtool_pse_control_status {
 	u32 pw_d_id;
@@ -145,6 +148,8 @@ struct ethtool_pse_control_status {
 	u32 c33_avail_pw_limit;
 	struct ethtool_c33_pse_pw_limit_range *c33_pw_limit_ranges;
 	u32 c33_pw_limit_nb_ranges;
+	u32 prio_max;
+	u32 prio;
 };
 
 /**
@@ -168,6 +173,11 @@ struct ethtool_pse_control_status {
  *			    range. The driver is in charge of the memory
  *			    allocation and should return the number of
  *			    ranges.
+ * @pi_get_prio: Get the PSE PI priority.
+ * @pi_set_prio: Configure the PSE PI priority.
+ * @pi_get_pw_req: Get the power requested by a PD before enabling the PSE PI.
+ *		   This is only relevant when an interrupt is registered using
+ *		   devm_pse_irq_helper helper.
  */
 struct pse_controller_ops {
 	int (*setup_pi_matrix)(struct pse_controller_dev *pcdev);
@@ -188,6 +198,10 @@ struct pse_controller_ops {
 			       int id, int max_mW);
 	int (*pi_get_pw_limit_ranges)(struct pse_controller_dev *pcdev, int id,
 				      struct pse_pw_limit_ranges *pw_limit_ranges);
+	int (*pi_get_prio)(struct pse_controller_dev *pcdev, int id);
+	int (*pi_set_prio)(struct pse_controller_dev *pcdev, int id,
+			   unsigned int prio);
+	int (*pi_get_pw_req)(struct pse_controller_dev *pcdev, int id);
 };
 
 struct module;
@@ -223,6 +237,17 @@ struct pse_pi_pairset {
  * @rdev: regulator represented by the PSE PI
  * @admin_state_enabled: PI enabled state
  * @pw_d: Power domain of the PSE PI
+ * @prio: Priority of the PSE PI. Used in static budget evaluation strategy
+ * @isr_pd_detected: PSE PI detection status managed by the interruption
+ *		     handler. This variable is relevant when the power enabled
+ *		     management is managed in software like the static
+ *		     budget evaluation strategy.
+ * @pw_allocated_mW: Power allocated to a PSE PI to manage power budget in
+ *		     static budget evaluation strategy.
+ * @_isr_counter_mismatch: Internal flag used in PSE core in case of a
+ *			   counter mismatch between regulator and PSE API.
+ *			   This is caused by a disable call in the interrupt
+ *			   context handler.
  */
 struct pse_pi {
 	struct pse_pi_pairset pairset[2];
@@ -230,6 +255,10 @@ struct pse_pi {
 	struct regulator_dev *rdev;
 	bool admin_state_enabled;
 	struct pse_power_domain *pw_d;
+	int prio;
+	bool isr_pd_detected;
+	int pw_allocated_mW;
+	bool _isr_counter_mismatch;
 };
 
 /**
@@ -247,6 +276,9 @@ struct pse_pi {
  * @pi: table of PSE PIs described in this controller device
  * @no_of_pse_pi: flag set if the pse_pis devicetree node is not used
  * @irq: PSE interrupt
+ * @pis_prio_max: Maximum value allowed for the PSE PIs priority
+ * @supp_budget_eval_strategies: budget evaluation strategies supported
+ *				 by the PSE
  */
 struct pse_controller_dev {
 	const struct pse_controller_ops *ops;
@@ -261,6 +293,8 @@ struct pse_controller_dev {
 	struct pse_pi *pi;
 	bool no_of_pse_pi;
 	int irq;
+	unsigned int pis_prio_max;
+	u32 supp_budget_eval_strategies;
 };
 
 #if IS_ENABLED(CONFIG_PSE_CONTROLLER)
@@ -285,6 +319,9 @@ int pse_ethtool_set_config(struct pse_control *psec,
 int pse_ethtool_set_pw_limit(struct pse_control *psec,
 			     struct netlink_ext_ack *extack,
 			     const unsigned int pw_limit);
+int pse_ethtool_set_prio(struct pse_control *psec,
+			 struct netlink_ext_ack *extack,
+			 unsigned int prio);
 
 bool pse_has_podl(struct pse_control *psec);
 bool pse_has_c33(struct pse_control *psec);
@@ -322,6 +359,13 @@ static inline int pse_ethtool_set_pw_limit(struct pse_control *psec,
 	return -EOPNOTSUPP;
 }
 
+static inline int pse_ethtool_set_prio(struct pse_control *psec,
+				       struct netlink_ext_ack *extack,
+				       unsigned int prio)
+{
+	return -EOPNOTSUPP;
+}
+
 static inline bool pse_has_podl(struct pse_control *psec)
 {
 	return false;
diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
index 5e99daf42d440..0c6aef700a8ff 100644
--- a/include/uapi/linux/ethtool.h
+++ b/include/uapi/linux/ethtool.h
@@ -1006,6 +1006,20 @@ enum ethtool_c33_pse_pw_d_status {
  * enum ethtool_pse_events - event list of the PSE controller.
  * @ETHTOOL_PSE_EVENT_OVER_CURRENT: PSE output current is too high.
  * @ETHTOOL_PSE_EVENT_OVER_TEMP: PSE in over temperature state.
+ * @ETHTOOL_C33_PSE_EVENT_DETECTION: detection process occur on the PSE.
+ *	IEEE 802.3-2022 33.2.5 and 145.2.6 PSE detection of PDs.
+ *	IEEE 802.3-202 30.9.1.1.5 aPSEPowerDetectionStatus.
+ * @ETHTOOL_C33_PSE_EVENT_CLASSIFICATION: classification process occur on
+ *	the PSE. IEEE 802.3-2022 33.2.6 and 145.2.8 classification of PDs and
+ *	mutual identification.
+ *	IEEE 802.3-2022 30.9.1.1.8 aPSEPowerClassification.
+ * @ETHTOOL_C33_PSE_EVENT_DISCONNECTION: PD has been disconnected on the PSE.
+ *	IEEE 802.3-2022 33.3.8 and 145.3.9 PD Maintain Power Signature.
+ *	IEEE 802.3-2022 33.5.1.2.9 MPS Absent.
+ *	IEEE 802.3-2022 30.9.1.1.20 aPSEMPSAbsentCounter.
+ * @ETHTOOL_PSE_EVENT_OVER_BUDGET: PSE turned off due to over budget situation.
+ * @ETHTOOL_PSE_EVENT_SW_PW_CONTROL_ERROR: PSE faced an error managing the
+ *	power control from software.
  *
  * @ETHTOOL_PSE_EVENT_LAST: Last PSE event of the enum.
  */
@@ -1013,8 +1027,31 @@ enum ethtool_c33_pse_pw_d_status {
 enum ethtool_pse_events {
 	ETHTOOL_PSE_EVENT_OVER_CURRENT =	1 << 0,
 	ETHTOOL_PSE_EVENT_OVER_TEMP =		1 << 1,
+	ETHTOOL_C33_PSE_EVENT_DETECTION =	1 << 2,
+	ETHTOOL_C33_PSE_EVENT_CLASSIFICATION =	1 << 3,
+	ETHTOOL_C33_PSE_EVENT_DISCONNECTION =	1 << 4,
+	ETHTOOL_PSE_EVENT_OVER_BUDGET =		1 << 5,
+	ETHTOOL_PSE_EVENT_SW_PW_CONTROL_ERROR =	1 << 6,
 
-	ETHTOOL_PSE_EVENT_LAST = ETHTOOL_PSE_EVENT_OVER_TEMP,
+	ETHTOOL_PSE_EVENT_LAST = ETHTOOL_PSE_EVENT_SW_PW_CONTROL_ERROR,
+};
+
+/**
+ * enum ethtool_pse_budget_eval_strategies - PSE budget evaluation strategies.
+ * @ETHTOOL_PSE_BUDGET_EVAL_STRAT_DISABLED: Budget evaluation strategy disabled.
+ * @ETHTOOL_PSE_BUDGET_EVAL_STRAT_STATIC: PSE static budget evaluation strategy.
+ *	Budget evaluation strategy based on the power requested during PD
+ *	classification. This strategy is managed by the PSE core.
+ * @ETHTOOL_PSE_BUDGET_EVAL_STRAT_DYNAMIC: PSE dynamic budget evaluation
+ *	strategy. Budget evaluation strategy based on the current consumption
+ *	per ports compared to the total	power budget. This mode is managed by
+ *	the PSE controller.
+ */
+
+enum ethtool_pse_budget_eval_strategies {
+	ETHTOOL_PSE_BUDGET_EVAL_STRAT_DISABLED	= 1 << 0,
+	ETHTOOL_PSE_BUDGET_EVAL_STRAT_STATIC	= 1 << 1,
+	ETHTOOL_PSE_BUDGET_EVAL_STRAT_DYNAMIC	= 1 << 2,
 };
 
 /**

-- 
2.34.1

Re: [PATCH net-next v6 06/12] net: pse-pd: Add support for budget evaluation strategies
Posted by Oleksij Rempel 10 months, 4 weeks ago
On Tue, Mar 04, 2025 at 11:18:55AM +0100, Kory Maincent wrote:
> From: Kory Maincent (Dent Project) <kory.maincent@bootlin.com>
> +/**
> + * pse_disable_pi_prio - Disable all PIs of a given priority inside a PSE
> + *			 power domain
> + * @pcdev: a pointer to the PSE
> + * @pw_d: a pointer to the PSE power domain
> + * @prio: priority
> + *
> + * Return: 0 on success and failure value on error
> + */
> +static int pse_disable_pi_prio(struct pse_controller_dev *pcdev,
> +			       struct pse_power_domain *pw_d,
> +			       int prio)
> +{
> +	int i;
> +

Should we lock the pi[] array at some level?

> +	for (i = 0; i < pcdev->nr_lines; i++) {
> +		int ret;
> +
> +		if (pcdev->pi[i].prio != prio ||
> +		    pcdev->pi[i].pw_d != pw_d ||
> +		    !pcdev->pi[i].admin_state_enabled)
> +			continue;
> +
> +		ret = pse_disable_pi_pol(pcdev, i);

If the PSE has many lower-priority ports, the same set of ports could be
repeatedly shut down while higher-priority ports keep power
indefinitely.

This could result in a starvation issue, where lower-priority group of
ports may never get a chance to stay enabled, even if power briefly
becomes available.

To fix this, we could:
- Disallow identical priorities in static mode to ensure a clear
shutdown order.
- Modify pse_disable_pi_prio() to track freed power and stop
disabling once enough power is recovered.

static int pse_disable_pi_prio(struct pse_controller_dev *pcdev,
                               struct pse_power_domain *pw_d,
                               int prio, int required_power)
{
    int i, ret;
    int freed_power = 0;

    mutex_lock(&pcdev->lock);

    for (i = 0; i < pcdev->nr_lines; i++) {
        if (pcdev->pi[i].prio != prio ||
            pcdev->pi[i].pw_d != pw_d ||
            !pcdev->pi[i].admin_state_enabled)
            continue;

        ret = pse_disable_pi_pol(pcdev, i);
        if (ret == 0)
            freed_power += pcdev->pi[i].pw_allocated_mW;

        /* Stop once we have freed enough power */
        if (freed_power >= required_power)
            break;
    }

    mutex_unlock(&pcdev->lock);
    return ret;
}

The current approach still introduces an implicit priority based on loop
execution order, but since it's predictable, it can serve as a reasonable
default.

> +		if (ret)
> +			return ret;
> +	}
> +
> +	return 0;
> +}

....

> +/**
> + * pse_ethtool_set_prio - Set PSE PI priority according to the budget
> + *			  evaluation strategy
> + * @psec: PSE control pointer
> + * @extack: extack for reporting useful error messages
> + * @prio: priovity value
> + *
> + * Return: 0 on success and failure value on error
> + */
> +int pse_ethtool_set_prio(struct pse_control *psec,
> +			 struct netlink_ext_ack *extack,
> +			 unsigned int prio)
> +{
> +	struct pse_controller_dev *pcdev = psec->pcdev;
> +	const struct pse_controller_ops *ops;
> +	int ret = 0;
> +
> +	if (!pcdev->pi[psec->id].pw_d) {
> +		NL_SET_ERR_MSG(extack, "no power domain attached");
> +		return -EOPNOTSUPP;
> +	}
> +
> +	/* We don't want priority change in the middle of an
> +	 * enable/disable call or a priority mode change
> +	 */
> +	mutex_lock(&pcdev->lock);
> +	switch (pcdev->pi[psec->id].pw_d->budget_eval_strategy) {
> +	case ETHTOOL_PSE_BUDGET_EVAL_STRAT_STATIC:
> +		if (prio > pcdev->nr_lines) {
> +			NL_SET_ERR_MSG_FMT(extack,
> +					   "priority %d exceed priority max %d",
> +					   prio, pcdev->nr_lines);
> +			ret = -ERANGE;
> +			goto out;
> +		}
> +
> +		pcdev->pi[psec->id].prio = prio;

In case we already out of the budget, we will need to re-evaluate the
prios. New configuration may affect state of ports.

Potentially we may need a bulk interface to assign prios, to speed-up
reconfiguration. But it is not needed right now.

> +		break;
> +
> +	case ETHTOOL_PSE_BUDGET_EVAL_STRAT_DYNAMIC:
> +		ops = psec->pcdev->ops;
> +		if (!ops->pi_set_prio) {
> +			NL_SET_ERR_MSG(extack,
> +				       "pse driver does not support setting port priority");
> +			ret = -EOPNOTSUPP;
> +			goto out;
> +		}
> +
> +		if (prio > pcdev->pis_prio_max) {
> +			NL_SET_ERR_MSG_FMT(extack,
> +					   "priority %d exceed priority max %d",
> +					   prio, pcdev->pis_prio_max);
> +			ret = -ERANGE;
> +			goto out;
> +		}
> +
> +		ret = ops->pi_set_prio(pcdev, psec->id, prio);

Here too, but in case of microchip PSE it will happen in the firmware.
May be add here a comment that currently it is done in firmware and to
be extended for the kernel based implementation.

> +		break;
> +
> +	default:
> +		ret = -EOPNOTSUPP;
> +	}
> +
> +out:
> +	mutex_unlock(&pcdev->lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(pse_ethtool_set_prio);
-- 
Pengutronix e.K.                           |                             |
Steuerwalder Str. 21                       | http://www.pengutronix.de/  |
31137 Hildesheim, Germany                  | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |
Re: [PATCH net-next v6 06/12] net: pse-pd: Add support for budget evaluation strategies
Posted by Kory Maincent 10 months, 3 weeks ago
On Mon, 17 Mar 2025 13:40:45 +0100
Oleksij Rempel <o.rempel@pengutronix.de> wrote:

> On Tue, Mar 04, 2025 at 11:18:55AM +0100, Kory Maincent wrote:
> > From: Kory Maincent (Dent Project) <kory.maincent@bootlin.com>
> > +/**
> > + * pse_disable_pi_prio - Disable all PIs of a given priority inside a PSE
> > + *			 power domain
> > + * @pcdev: a pointer to the PSE
> > + * @pw_d: a pointer to the PSE power domain
> > + * @prio: priority
> > + *
> > + * Return: 0 on success and failure value on error
> > + */
> > +static int pse_disable_pi_prio(struct pse_controller_dev *pcdev,
> > +			       struct pse_power_domain *pw_d,
> > +			       int prio)
> > +{
> > +	int i;
> > +  
> 
> Should we lock the pi[] array at some level?

Lock the pi array?
pcdev is already locked in the irq thread handler. 
 
> > +	for (i = 0; i < pcdev->nr_lines; i++) {
> > +		int ret;
> > +
> > +		if (pcdev->pi[i].prio != prio ||
> > +		    pcdev->pi[i].pw_d != pw_d ||
> > +		    !pcdev->pi[i].admin_state_enabled)
> > +			continue;
> > +
> > +		ret = pse_disable_pi_pol(pcdev, i);  
> 
> If the PSE has many lower-priority ports, the same set of ports could be
> repeatedly shut down while higher-priority ports keep power
> indefinitely.

That is the point of priority. It is not a problem as there is as many priority
value as ports. Disabling all the ports configured with a specific priority is
understandable.

> This could result in a starvation issue, where lower-priority group of
> ports may never get a chance to stay enabled, even if power briefly
> becomes available.
> 
> To fix this, we could:
> - Disallow identical priorities in static mode to ensure a clear
> shutdown order.

We can't do this as this imply priority shutdown method is always used and
can't be disabled.
Having all the ports to the same priority means that the priority strategy is
not used and if new PD is plugged and exceed the budget it simply won't be
powered.

> > +int pse_ethtool_set_prio(struct pse_control *psec,
> > +			 struct netlink_ext_ack *extack,
> > +			 unsigned int prio)
> > +{
> > +	struct pse_controller_dev *pcdev = psec->pcdev;
> > +	const struct pse_controller_ops *ops;
> > +	int ret = 0;
> > +
> > +	if (!pcdev->pi[psec->id].pw_d) {
> > +		NL_SET_ERR_MSG(extack, "no power domain attached");
> > +		return -EOPNOTSUPP;
> > +	}
> > +
> > +	/* We don't want priority change in the middle of an
> > +	 * enable/disable call or a priority mode change
> > +	 */
> > +	mutex_lock(&pcdev->lock);
> > +	switch (pcdev->pi[psec->id].pw_d->budget_eval_strategy) {
> > +	case ETHTOOL_PSE_BUDGET_EVAL_STRAT_STATIC:
> > +		if (prio > pcdev->nr_lines) {
> > +			NL_SET_ERR_MSG_FMT(extack,
> > +					   "priority %d exceed priority
> > max %d",
> > +					   prio, pcdev->nr_lines);
> > +			ret = -ERANGE;
> > +			goto out;
> > +		}
> > +
> > +		pcdev->pi[psec->id].prio = prio;  
> 
> In case we already out of the budget, we will need to re-evaluate the
> prios. New configuration may affect state of ports.
>
> Potentially we may need a bulk interface to assign prios, to speed-up
> reconfiguration. But it is not needed right now.

Oh indeed, I missed that. /o\
I will try to add something but lets keep it not too complex! ^^ Don't add me
more work!! ;)

> > +		break;
> > +
> > +	case ETHTOOL_PSE_BUDGET_EVAL_STRAT_DYNAMIC:
> > +		ops = psec->pcdev->ops;
> > +		if (!ops->pi_set_prio) {
> > +			NL_SET_ERR_MSG(extack,
> > +				       "pse driver does not support
> > setting port priority");
> > +			ret = -EOPNOTSUPP;
> > +			goto out;
> > +		}
> > +
> > +		if (prio > pcdev->pis_prio_max) {
> > +			NL_SET_ERR_MSG_FMT(extack,
> > +					   "priority %d exceed priority
> > max %d",
> > +					   prio, pcdev->pis_prio_max);
> > +			ret = -ERANGE;
> > +			goto out;
> > +		}
> > +
> > +		ret = ops->pi_set_prio(pcdev, psec->id, prio);  
> 
> Here too, but in case of microchip PSE it will happen in the firmware.
> May be add here a comment that currently it is done in firmware and to
> be extended for the kernel based implementation.

Ack, I will add a comment.

Regards,
-- 
Köry Maincent, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com
Re: [PATCH net-next v6 06/12] net: pse-pd: Add support for budget evaluation strategies
Posted by Kory Maincent 10 months, 3 weeks ago
Hello Kyle, Oleksij,

On Thu, 20 Mar 2025 17:35:35 +0100
Kory Maincent <kory.maincent@bootlin.com> wrote:

> On Mon, 17 Mar 2025 13:40:45 +0100
> Oleksij Rempel <o.rempel@pengutronix.de> wrote:
> 
> > On Tue, Mar 04, 2025 at 11:18:55AM +0100, Kory Maincent wrote:  
> > > From: Kory Maincent (Dent Project) <kory.maincent@bootlin.com>

> > > +int pse_ethtool_set_prio(struct pse_control *psec,
> > > +			 struct netlink_ext_ack *extack,
> > > +			 unsigned int prio)
> > > +{
> > > +	struct pse_controller_dev *pcdev = psec->pcdev;
> > > +	const struct pse_controller_ops *ops;
> > > +	int ret = 0;
> > > +
> > > +	if (!pcdev->pi[psec->id].pw_d) {
> > > +		NL_SET_ERR_MSG(extack, "no power domain attached");
> > > +		return -EOPNOTSUPP;
> > > +	}
> > > +
> > > +	/* We don't want priority change in the middle of an
> > > +	 * enable/disable call or a priority mode change
> > > +	 */
> > > +	mutex_lock(&pcdev->lock);
> > > +	switch (pcdev->pi[psec->id].pw_d->budget_eval_strategy) {
> > > +	case ETHTOOL_PSE_BUDGET_EVAL_STRAT_STATIC:
> > > +		if (prio > pcdev->nr_lines) {
> > > +			NL_SET_ERR_MSG_FMT(extack,
> > > +					   "priority %d exceed priority
> > > max %d",
> > > +					   prio, pcdev->nr_lines);
> > > +			ret = -ERANGE;
> > > +			goto out;
> > > +		}
> > > +
> > > +		pcdev->pi[psec->id].prio = prio;    
> > 
> > In case we already out of the budget, we will need to re-evaluate the
> > prios. New configuration may affect state of ports.
> >
> > Potentially we may need a bulk interface to assign prios, to speed-up
> > reconfiguration. But it is not needed right now.  
> 
> Oh indeed, I missed that. /o\
> I will try to add something but lets keep it not too complex! ^^ Don't add me
> more work!! ;)

Small question on PSE core behavior for PoE users.

If we want to enable a port but we can't due to over budget.
Should we :
- Report an error (or not) and save the enable action from userspace. On that
  case, if enough budget is available later due to priority change or port
  disconnected the PSE core will try automatically to re enable the PoE port.
  The port will then be enabled without any action from the user.
- Report an error but do nothing. The user will need to rerun the enable
  command later to try to enable the port again.

How is it currently managed in PoE poprietary userspace tools?

Regards,
-- 
Köry Maincent, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com
Re: [PATCH net-next v6 06/12] net: pse-pd: Add support for budget evaluation strategies
Posted by Kyle Swenson 10 months, 3 weeks ago
Hello Kory,

On Mon, Mar 24, 2025 at 05:39:07PM +0100, Kory Maincent wrote:
> Hello Kyle, Oleksij,
...
> 
> Small question on PSE core behavior for PoE users.
> 
> If we want to enable a port but we can't due to over budget.
> Should we :
> - Report an error (or not) and save the enable action from userspace. On that
>   case, if enough budget is available later due to priority change or port
>   disconnected the PSE core will try automatically to re enable the PoE port.
>   The port will then be enabled without any action from the user.
> - Report an error but do nothing. The user will need to rerun the enable
>   command later to try to enable the port again.
> 
> How is it currently managed in PoE poprietary userspace tools?

So in our implementation, we're using the first option you've presented.
That is, we save the enable action from the user and if we can't power
the device due to insufficient budget remaining, we'll indicate that status to the
user.  If enough power budget becomes available later, we'll power up
the device automatically.

Hope this helps,
Kyle
Re: [PATCH net-next v6 06/12] net: pse-pd: Add support for budget evaluation strategies
Posted by Oleksij Rempel 10 months, 3 weeks ago
Hi,

On Mon, Mar 24, 2025 at 05:33:18PM +0000, Kyle Swenson wrote:
> Hello Kory,
> 
> On Mon, Mar 24, 2025 at 05:39:07PM +0100, Kory Maincent wrote:
> > Hello Kyle, Oleksij,
> ...
> > 
> > Small question on PSE core behavior for PoE users.
> > 
> > If we want to enable a port but we can't due to over budget.
> > Should we :
> > - Report an error (or not) and save the enable action from userspace. On that
> >   case, if enough budget is available later due to priority change or port
> >   disconnected the PSE core will try automatically to re enable the PoE port.
> >   The port will then be enabled without any action from the user.
> > - Report an error but do nothing. The user will need to rerun the enable
> >   command later to try to enable the port again.
> > 
> > How is it currently managed in PoE poprietary userspace tools?
> 
> So in our implementation, we're using the first option you've presented.
> That is, we save the enable action from the user and if we can't power
> the device due to insufficient budget remaining, we'll indicate that status to the
> user.  If enough power budget becomes available later, we'll power up
> the device automatically.

It seems to be similar to administrative UP state - "ip link set dev lan1 up".
I'm ok with this behavior.

-- 
Pengutronix e.K.                           |                             |
Steuerwalder Str. 21                       | http://www.pengutronix.de/  |
31137 Hildesheim, Germany                  | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |
Re: [PATCH net-next v6 06/12] net: pse-pd: Add support for budget evaluation strategies
Posted by Kory Maincent 10 months, 3 weeks ago
On Tue, 25 Mar 2025 06:34:17 +0100
Oleksij Rempel <o.rempel@pengutronix.de> wrote:

> Hi,
> 
> On Mon, Mar 24, 2025 at 05:33:18PM +0000, Kyle Swenson wrote:
> > Hello Kory,
> > 
> > On Mon, Mar 24, 2025 at 05:39:07PM +0100, Kory Maincent wrote:  
> > > Hello Kyle, Oleksij,  
> > ...  
> > > 
> > > Small question on PSE core behavior for PoE users.
> > > 
> > > If we want to enable a port but we can't due to over budget.
> > > Should we :
> > > - Report an error (or not) and save the enable action from userspace. On
> > > that case, if enough budget is available later due to priority change or
> > > port disconnected the PSE core will try automatically to re enable the
> > > PoE port. The port will then be enabled without any action from the user.
> > > - Report an error but do nothing. The user will need to rerun the enable
> > >   command later to try to enable the port again.
> > > 
> > > How is it currently managed in PoE poprietary userspace tools?  
> > 
> > So in our implementation, we're using the first option you've presented.
> > That is, we save the enable action from the user and if we can't power
> > the device due to insufficient budget remaining, we'll indicate that status
> > to the user.  If enough power budget becomes available later, we'll power up
> > the device automatically.  
> 
> It seems to be similar to administrative UP state - "ip link set dev lan1 up".
> I'm ok with this behavior.

Ack I will go for it then, thank you!

Other question to both of you:
If we configure manually the current limit for a port. Then we plug a Powered
Device and we detect (during the classification) a smaller current limit
supported. Should we change the current limit to the one detected. On that case
we should not let the user set a power limit greater than the one detected after
the PD has been plugged.

What do you think? Could we let a user burn a PD?

Regards,
-- 
Köry Maincent, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com
Re: [PATCH net-next v6 06/12] net: pse-pd: Add support for budget evaluation strategies
Posted by Kyle Swenson 10 months, 3 weeks ago
Hello Kory,

On Tue, Mar 25, 2025 at 04:25:34PM +0100, Kory Maincent wrote:
> On Tue, 25 Mar 2025 06:34:17 +0100
> Oleksij Rempel <o.rempel@pengutronix.de> wrote:
> 
> > Hi,
> > 
> > On Mon, Mar 24, 2025 at 05:33:18PM +0000, Kyle Swenson wrote:
> > > Hello Kory,
> > > 
> > > On Mon, Mar 24, 2025 at 05:39:07PM +0100, Kory Maincent wrote:  
> > > > Hello Kyle, Oleksij,  
> > > ...  
> > > > 
> > > > Small question on PSE core behavior for PoE users.
> > > > 
> > > > If we want to enable a port but we can't due to over budget.
> > > > Should we :
> > > > - Report an error (or not) and save the enable action from userspace. On
> > > > that case, if enough budget is available later due to priority change or
> > > > port disconnected the PSE core will try automatically to re enable the
> > > > PoE port. The port will then be enabled without any action from the user.
> > > > - Report an error but do nothing. The user will need to rerun the enable
> > > >   command later to try to enable the port again.
> > > > 
> > > > How is it currently managed in PoE poprietary userspace tools?  
> > > 
> > > So in our implementation, we're using the first option you've presented.
> > > That is, we save the enable action from the user and if we can't power
> > > the device due to insufficient budget remaining, we'll indicate that status
> > > to the user.  If enough power budget becomes available later, we'll power up
> > > the device automatically.  
> > 
> > It seems to be similar to administrative UP state - "ip link set dev lan1 up".
> > I'm ok with this behavior.
> 
> Ack I will go for it then, thank you!
> 
> Other question to both of you:
> If we configure manually the current limit for a port. Then we plug a Powered
> Device and we detect (during the classification) a smaller current limit
> supported. Should we change the current limit to the one detected. On that case
> we should not let the user set a power limit greater than the one detected after
> the PD has been plugged.

I don't know that we want to prevent the user from setting a higher
current than a device's classification current because that would
prevent the PD and PSE negotiating a higher current via LLDP.

That said, I'm struggling to think of a use-case where the user would be
setting a current limit before a PD is connected, so maybe we can reset
the current limit when the PD is classified to the classification
result, but also allow it to be adjusted after a PD is powered for the
LLDP negotiation case.

In our implementation, don't really let the user specify something like,
"Only class 3 and lower devices on this port" because we've not seen
customers need this.  We have, however, implemented the LLDP negotiation
support after several requests from customers, but this only makes sense
when a PD is powered at it's initial classification result.  The PD can
then request more power (via LLDP) and then we adjust the current limit
assuming the system has budget available for the request.

> 
> What do you think? Could we let a user burn a PD?

This seems like a very rare case, and if the PD is designed such that
it's reliant on the PSE's current limiting ability then seems like it's
just an accident waiting to happen with any PSE.

Very rarely have we seen a device actually pull more current than it's
classification result allows (except for LLDP negotiation). What's more
likely is a dual-channel 802.3bt device is incorrectly classified as a
single-channel 802.3at device; the device pulls more current than
allocated and gets shut off promptly, but no magic smoke escaped.  

Hope this helps!
Kyle
Re: [PATCH net-next v6 06/12] net: pse-pd: Add support for budget evaluation strategies
Posted by Oleksij Rempel 10 months, 2 weeks ago
Hi folks,

On Tue, Mar 25, 2025 at 08:40:54PM +0000, Kyle Swenson wrote:
> Hello Kory,
> 
> On Tue, Mar 25, 2025 at 04:25:34PM +0100, Kory Maincent wrote:
> > On Tue, 25 Mar 2025 06:34:17 +0100
> > Oleksij Rempel <o.rempel@pengutronix.de> wrote:
> > 
> > > Hi,
> > > 
> > > On Mon, Mar 24, 2025 at 05:33:18PM +0000, Kyle Swenson wrote:
> > > > Hello Kory,
> > > > 
> > > > On Mon, Mar 24, 2025 at 05:39:07PM +0100, Kory Maincent wrote:  
> > > > > Hello Kyle, Oleksij,  
> > > > ...  
> > > > > 
> > > > > Small question on PSE core behavior for PoE users.
> > > > > 
> > > > > If we want to enable a port but we can't due to over budget.
> > > > > Should we :
> > > > > - Report an error (or not) and save the enable action from userspace. On
> > > > > that case, if enough budget is available later due to priority change or
> > > > > port disconnected the PSE core will try automatically to re enable the
> > > > > PoE port. The port will then be enabled without any action from the user.
> > > > > - Report an error but do nothing. The user will need to rerun the enable
> > > > >   command later to try to enable the port again.
> > > > > 
> > > > > How is it currently managed in PoE poprietary userspace tools?  
> > > > 
> > > > So in our implementation, we're using the first option you've presented.
> > > > That is, we save the enable action from the user and if we can't power
> > > > the device due to insufficient budget remaining, we'll indicate that status
> > > > to the user.  If enough power budget becomes available later, we'll power up
> > > > the device automatically.  
> > > 
> > > It seems to be similar to administrative UP state - "ip link set dev lan1 up".
> > > I'm ok with this behavior.
> > 
> > Ack I will go for it then, thank you!
> > 
> > Other question to both of you:
> > If we configure manually the current limit for a port. Then we plug a Powered
> > Device and we detect (during the classification) a smaller current limit
> > supported. Should we change the current limit to the one detected. On that case
> > we should not let the user set a power limit greater than the one detected after
> > the PD has been plugged.
> 
> I don't know that we want to prevent the user from setting a higher
> current than a device's classification current because that would
> prevent the PD and PSE negotiating a higher current via LLDP.
> 
> That said, I'm struggling to think of a use-case where the user would be
> setting a current limit before a PD is connected, so maybe we can reset
> the current limit when the PD is classified to the classification
> result, but also allow it to be adjusted after a PD is powered for the
> LLDP negotiation case.
> 
> In our implementation, don't really let the user specify something like,
> "Only class 3 and lower devices on this port" because we've not seen
> customers need this.  We have, however, implemented the LLDP negotiation
> support after several requests from customers, but this only makes sense
> when a PD is powered at it's initial classification result.  The PD can
> then request more power (via LLDP) and then we adjust the current limit
> assuming the system has budget available for the request.
> 
> > 
> > What do you think? Could we let a user burn a PD?
> 
> This seems like a very rare case, and if the PD is designed such that
> it's reliant on the PSE's current limiting ability then seems like it's
> just an accident waiting to happen with any PSE.
> 
> Very rarely have we seen a device actually pull more current than it's
> classification result allows (except for LLDP negotiation). What's more
> likely is a dual-channel 802.3bt device is incorrectly classified as a
> single-channel 802.3at device; the device pulls more current than
> allocated and gets shut off promptly, but no magic smoke escaped.  

Here’s my understanding of the use cases described so far, and a proposal for
how we could handle them in the kernel to avoid conflicts between different
actors.

We have multiple components that may affect power delivery:
- The kernel, which reacts to detection and classification
- The admin, who might want to override or restrict power for policy or safety reasons
- The LLDP daemon, which may request more power dynamically based on what the PD asks for

To avoid races and make things more predictable, I think it's best if each
actor has its own dedicated input.

## Use Cases

### Use Case 1: Classification-based power (default behavior)  
- Kernel detects PD and performs classification
- Power is applied according to classification and hardware limits
- No override used

Steps:
1. Detection runs
2. Classification result obtained (e.g. Class 2 → 7W)
3. Kernel computes:

   effective_limit = min(
       classification_result,
       controller_capability,
       board_limit,
       dynamic_budget
   )

4. Power applied up to `effective_limit`

### Use Case 2: Admin-configured upper bound (non-override)  
- Admin sets a policy limit that restricts all power delivery
- Does not override classification, only bounds it

Steps:
1. Admin sets `ETHTOOL_A_C33_PSE_AVAIL_PWR_LIMIT = 15000`
2. Detection + classification run normally
3. Kernel computes:

   effective_limit = min(
       classification_result,
       AVAIL_PWR_LIMIT,
       controller_capability,
       board_limit,
       dynamic_budget
   )

4. Classification is respected, but never exceeds admin limit

This value is always included in power computation — even if classification
or LLDP overrides are active.

### Use Case 3: Persistent classification override (admin)  
- Admin sets a persistent limit that overrides classification
- Power is always based on this override

Steps:
1. Admin sets `CLASS_OVERRIDE_PERSISTENT = 25000` (mW)
2. Detection/classification may run, but classification result is ignored
3. Kernel computes:

   effective_limit = min(
       CLASS_OVERRIDE_PERSISTENT,
       AVAIL_PWR_LIMIT,
       controller_capability,
       board_limit,
       dynamic_budget
   )

4. Power applied accordingly
5. Override persists until cleared

### Use Case 4: Temporary classification override (LLDP)  
- LLDP daemon overrides classification for current PD session only
- Cleared automatically on PD disconnect

Steps:
1. PD connects, detection + classification runs (e.g. 7W)
2. LLDP daemon receives PD request for 25000 mW
3. LLDP daemon sets `CLASS_OVERRIDE_TEMPORARY = 25000`
4. Kernel computes:

   effective_limit = min(
       CLASS_OVERRIDE_TEMPORARY,
       AVAIL_PWR_LIMIT,
       controller_capability,
       board_limit,
       dynamic_budget
   )

5. Power is increased for this session
6. On PD disconnect, override is cleared automatically

---

### Use Case 5: Ignore detection and classification (force-on)  
- Admin forces the port on, ignoring detection
- Useful for passive/non-802.3 devices or bring-up

Steps:
1. Admin sets:
   - `DETECTION_IGNORE = true`
   - `CLASS_OVERRIDE_PERSISTENT = 5000`
2. Kernel skips detection and classification
3. Kernel computes:

   effective_limit = min(
       CLASS_OVERRIDE_PERSISTENT,
       AVAIL_PWR_LIMIT,
       controller_capability,
       board_limit,
       dynamic_budget
   )

4. Power is applied immediately

## Proposed kernel UAPI

### SET attributes (configuration input)

| Attribute                                 | Type     | Lifetime              | Owner           | Description |
|-------------------------------------------|----------|------------------------|------------------|-------------|
| `ETHTOOL_A_PSE_CLASS_OVERRIDE_PERSISTENT` | u32 (mW) | Until cleared          | Admin            | Persistent classification override |
| `ETHTOOL_A_PSE_CLASS_OVERRIDE_TEMPORARY`  | u32 (mW) | Cleared on detection failure / PD replug | LLDP daemon / test tool | Temporary override of classification |
| `ETHTOOL_A_PSE_DETECTION_IGNORE`          | bool     | Until cleared          | Admin            | Ignore detection phase |
| `ETHTOOL_A_C33_PSE_AVAIL_PWR_LIMIT`       | u32 (mW) | Until changed          | Admin            | Static admin-defined max power cap (non-override) |

### GET attributes (status and diagnostics)

| Attribute                                  | Type     | Description |
|--------------------------------------------|----------|-------------|
| `ETHTOOL_A_PSE_EFFECTIVE_PWR_LIMIT`        | u32 (mW) | Final power limit applied by kernel |
| `ETHTOOL_A_PSE_CLASS_OVERRIDE_PERSISTENT`  | u32 (mW) | Current persistent override (if set) |
| `ETHTOOL_A_PSE_CLASS_OVERRIDE_TEMPORARY`   | u32 (mW) | Current temporary override (if active) |
| `ETHTOOL_A_PSE_DETECTION_IGNORE`           | bool     | Current detection ignore state |

### Power Limit Priority

Since we now have multiple sources that can influence how much power is
delivered to a PD, we need to define a clear and deterministic priority
order for all these values. This avoids confusion and ensures that the kernel
behaves consistently, even when different actors (e.g. admin, LLDP daemon,
hardware limits) are active at the same time.

Below is the proposed priority list — values higher in the list take precedence
over those below:

| Priority | Source / Field                          | Description |
|----------|------------------------------------------|-------------|
| 1        | Hardware/board-specific limit         | Maximum allowed by controller or board design (e.g. via device tree or driver constraints) |
| 2        | Dynamic power budget                  | Current system-level or PSE-level power availability (shared with other ports) |
| 3        | `ETHTOOL_A_C33_PSE_AVAIL_PWR_LIMIT`       | Admin-configured upper bound — applies even when classification or override is used |
| 4        | `ETHTOOL_A_PSE_CLASS_OVERRIDE_TEMPORARY`  | Temporary override, e.g. set by LLDP daemon, cleared on PD disconnect or detection loss |
| 5        | `ETHTOOL_A_PSE_CLASS_OVERRIDE_PERSISTENT` | Admin override that persists until cleared |
| 6        | `ETHTOOL_A_PSE_CLASSIFICATION_RESULT`     | Result of PD classification, used when no override is present |

The effective power limit used by the kernel will always be the minimum of the
values above.

This way, even if the LLDP daemon requests more power, or classification result
is high, power delivery will still be constrained by admin policies, hardware
limits, and current budget.

Best regards,  
Oleksij
-- 
Pengutronix e.K.                           |                             |
Steuerwalder Str. 21                       | http://www.pengutronix.de/  |
31137 Hildesheim, Germany                  | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |
Re: [PATCH net-next v6 06/12] net: pse-pd: Add support for budget evaluation strategies
Posted by Kory Maincent 10 months, 2 weeks ago
On Wed, 26 Mar 2025 11:01:19 +0100
Oleksij Rempel <o.rempel@pengutronix.de> wrote:

> Hi folks,
> 
> On Tue, Mar 25, 2025 at 08:40:54PM +0000, Kyle Swenson wrote:
> > Hello Kory,
> > 
> > On Tue, Mar 25, 2025 at 04:25:34PM +0100, Kory Maincent wrote:  
> > > On Tue, 25 Mar 2025 06:34:17 +0100
> > > Oleksij Rempel <o.rempel@pengutronix.de> wrote:
> > >   
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
> > > 
> > > Ack I will go for it then, thank you!
> > > 
> > > Other question to both of you:
> > > If we configure manually the current limit for a port. Then we plug a
> > > Powered Device and we detect (during the classification) a smaller
> > > current limit supported. Should we change the current limit to the one
> > > detected. On that case we should not let the user set a power limit
> > > greater than the one detected after the PD has been plugged.  
> > 
> > I don't know that we want to prevent the user from setting a higher
> > current than a device's classification current because that would
> > prevent the PD and PSE negotiating a higher current via LLDP.
> > 
> > That said, I'm struggling to think of a use-case where the user would be
> > setting a current limit before a PD is connected, so maybe we can reset
> > the current limit when the PD is classified to the classification
> > result, but also allow it to be adjusted after a PD is powered for the
> > LLDP negotiation case.
> > 
> > In our implementation, don't really let the user specify something like,
> > "Only class 3 and lower devices on this port" because we've not seen
> > customers need this.  We have, however, implemented the LLDP negotiation
> > support after several requests from customers, but this only makes sense
> > when a PD is powered at it's initial classification result.  The PD can
> > then request more power (via LLDP) and then we adjust the current limit
> > assuming the system has budget available for the request.
> >   
> > > 
> > > What do you think? Could we let a user burn a PD?  
> > 
> > This seems like a very rare case, and if the PD is designed such that
> > it's reliant on the PSE's current limiting ability then seems like it's
> > just an accident waiting to happen with any PSE.
> > 
> > Very rarely have we seen a device actually pull more current than it's
> > classification result allows (except for LLDP negotiation). What's more
> > likely is a dual-channel 802.3bt device is incorrectly classified as a
> > single-channel 802.3at device; the device pulls more current than
> > allocated and gets shut off promptly, but no magic smoke escaped.    
> 
> Here’s my understanding of the use cases described so far, and a proposal for
> how we could handle them in the kernel to avoid conflicts between different
> actors.
> 
> We have multiple components that may affect power delivery:
> - The kernel, which reacts to detection and classification
> - The admin, who might want to override or restrict power for policy or
> safety reasons
> - The LLDP daemon, which may request more power dynamically based on what the
> PD asks for
> 
> To avoid races and make things more predictable, I think it's best if each
> actor has its own dedicated input.
> 
> ## Use Cases
> 
> ### Use Case 1: Classification-based power (default behavior)  
> - Kernel detects PD and performs classification
> - Power is applied according to classification and hardware limits
> - No override used
> 
> Steps:
> 1. Detection runs
> 2. Classification result obtained (e.g. Class 2 → 7W)
> 3. Kernel computes:
> 
>    effective_limit = min(
>        classification_result,
>        controller_capability,
>        board_limit,
>        dynamic_budget
>    )
> 
> 4. Power applied up to `effective_limit`
> 
> ### Use Case 2: Admin-configured upper bound (non-override)  
> - Admin sets a policy limit that restricts all power delivery
> - Does not override classification, only bounds it
> 
> Steps:
> 1. Admin sets `ETHTOOL_A_C33_PSE_AVAIL_PWR_LIMIT = 15000`
> 2. Detection + classification run normally
> 3. Kernel computes:
> 
>    effective_limit = min(
>        classification_result,
>        AVAIL_PWR_LIMIT,
>        controller_capability,
>        board_limit,
>        dynamic_budget
>    )
> 
> 4. Classification is respected, but never exceeds admin limit
> 
> This value is always included in power computation — even if classification
> or LLDP overrides are active.
> 
> ### Use Case 3: Persistent classification override (admin)  
> - Admin sets a persistent limit that overrides classification
> - Power is always based on this override
> 
> Steps:
> 1. Admin sets `CLASS_OVERRIDE_PERSISTENT = 25000` (mW)
> 2. Detection/classification may run, but classification result is ignored
> 3. Kernel computes:
> 
>    effective_limit = min(
>        CLASS_OVERRIDE_PERSISTENT,
>        AVAIL_PWR_LIMIT,
>        controller_capability,
>        board_limit,
>        dynamic_budget
>    )
> 
> 4. Power applied accordingly
> 5. Override persists until cleared
> 
> ### Use Case 4: Temporary classification override (LLDP)  
> - LLDP daemon overrides classification for current PD session only
> - Cleared automatically on PD disconnect
> 
> Steps:
> 1. PD connects, detection + classification runs (e.g. 7W)
> 2. LLDP daemon receives PD request for 25000 mW
> 3. LLDP daemon sets `CLASS_OVERRIDE_TEMPORARY = 25000`
> 4. Kernel computes:
> 
>    effective_limit = min(
>        CLASS_OVERRIDE_TEMPORARY,
>        AVAIL_PWR_LIMIT,
>        controller_capability,
>        board_limit,
>        dynamic_budget
>    )
> 
> 5. Power is increased for this session
> 6. On PD disconnect, override is cleared automatically
> 
> ---
> 
> ### Use Case 5: Ignore detection and classification (force-on)  
> - Admin forces the port on, ignoring detection
> - Useful for passive/non-802.3 devices or bring-up
> 
> Steps:
> 1. Admin sets:
>    - `DETECTION_IGNORE = true`
>    - `CLASS_OVERRIDE_PERSISTENT = 5000`
> 2. Kernel skips detection and classification
> 3. Kernel computes:
> 
>    effective_limit = min(
>        CLASS_OVERRIDE_PERSISTENT,
>        AVAIL_PWR_LIMIT,
>        controller_capability,
>        board_limit,
>        dynamic_budget
>    )
> 
> 4. Power is applied immediately
> 
> ## Proposed kernel UAPI
> 
> ### SET attributes (configuration input)
> 
> | Attribute                                 | Type     | Lifetime
>  | Owner           | Description |
> |-------------------------------------------|----------|------------------------|------------------|-------------|
> | `ETHTOOL_A_PSE_CLASS_OVERRIDE_PERSISTENT` | u32 (mW) | Until cleared
>   | Admin            | Persistent classification override | |
> `ETHTOOL_A_PSE_CLASS_OVERRIDE_TEMPORARY`  | u32 (mW) | Cleared on detection
> failure / PD replug | LLDP daemon / test tool | Temporary override of
> classification | | `ETHTOOL_A_PSE_DETECTION_IGNORE`          | bool     |
> Until cleared          | Admin            | Ignore detection phase | |
> `ETHTOOL_A_C33_PSE_AVAIL_PWR_LIMIT`       | u32 (mW) | Until changed
> | Admin            | Static admin-defined max power cap (non-override) |
> 
> ### GET attributes (status and diagnostics)
> 
> | Attribute                                  | Type     | Description |
> |--------------------------------------------|----------|-------------|
> | `ETHTOOL_A_PSE_EFFECTIVE_PWR_LIMIT`        | u32 (mW) | Final power limit
> applied by kernel | | `ETHTOOL_A_PSE_CLASS_OVERRIDE_PERSISTENT`  | u32 (mW) |
> Current persistent override (if set) | |
> `ETHTOOL_A_PSE_CLASS_OVERRIDE_TEMPORARY`   | u32 (mW) | Current temporary
> override (if active) | | `ETHTOOL_A_PSE_DETECTION_IGNORE`           | bool
>  | Current detection ignore state |
> 
> ### Power Limit Priority
> 
> Since we now have multiple sources that can influence how much power is
> delivered to a PD, we need to define a clear and deterministic priority
> order for all these values. This avoids confusion and ensures that the kernel
> behaves consistently, even when different actors (e.g. admin, LLDP daemon,
> hardware limits) are active at the same time.
> 
> Below is the proposed priority list — values higher in the list take
> precedence over those below:
> 
> | Priority | Source / Field                          | Description |
> |----------|------------------------------------------|-------------|
> | 1        | Hardware/board-specific limit         | Maximum allowed by
> controller or board design (e.g. via device tree or driver constraints) | | 2
>        | Dynamic power budget                  | Current system-level or
> PSE-level power availability (shared with other ports) | | 3        |
> `ETHTOOL_A_C33_PSE_AVAIL_PWR_LIMIT`       | Admin-configured upper bound —
> applies even when classification or override is used | | 4        |
> `ETHTOOL_A_PSE_CLASS_OVERRIDE_TEMPORARY`  | Temporary override, e.g. set by
> LLDP daemon, cleared on PD disconnect or detection loss | | 5        |
> `ETHTOOL_A_PSE_CLASS_OVERRIDE_PERSISTENT` | Admin override that persists
> until cleared | | 6        | `ETHTOOL_A_PSE_CLASSIFICATION_RESULT`     |
> Result of PD classification, used when no override is present |
> 
> The effective power limit used by the kernel will always be the minimum of the
> values above.
> 
> This way, even if the LLDP daemon requests more power, or classification
> result is high, power delivery will still be constrained by admin policies,
> hardware limits, and current budget.

Kyle thanks for your PoE user side answer!
Oleksij, thanks, as usual you have done an intense brainstorm! ^^
These two replies could be helpful in the future.

I asked this because I found out that the over current event of the TPS23881
was resetting the power limit register of the port. I think I will simply
fix it by reconfiguring the power limit if this event happens.
So it won't change the current behavior where the user setting of power limit
prevail over the power limit detected during classification.

Regards,
-- 
Köry Maincent, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com
Re: [PATCH net-next v6 06/12] net: pse-pd: Add support for budget evaluation strategies
Posted by Jakub Kicinski 11 months, 1 week ago
On Tue, 04 Mar 2025 11:18:55 +0100 Kory Maincent wrote:
> +/**
> + * enum ethtool_pse_budget_eval_strategies - PSE budget evaluation strategies.
> + * @ETHTOOL_PSE_BUDGET_EVAL_STRAT_DISABLED: Budget evaluation strategy disabled.
> + * @ETHTOOL_PSE_BUDGET_EVAL_STRAT_STATIC: PSE static budget evaluation strategy.
> + *	Budget evaluation strategy based on the power requested during PD
> + *	classification. This strategy is managed by the PSE core.
> + * @ETHTOOL_PSE_BUDGET_EVAL_STRAT_DYNAMIC: PSE dynamic budget evaluation
> + *	strategy. Budget evaluation strategy based on the current consumption
> + *	per ports compared to the total	power budget. This mode is managed by
> + *	the PSE controller.
> + */
> +
> +enum ethtool_pse_budget_eval_strategies {
> +	ETHTOOL_PSE_BUDGET_EVAL_STRAT_DISABLED	= 1 << 0,
> +	ETHTOOL_PSE_BUDGET_EVAL_STRAT_STATIC	= 1 << 1,
> +	ETHTOOL_PSE_BUDGET_EVAL_STRAT_DYNAMIC	= 1 << 2,
>  };

Leftover?
-- 
pw-bot: cr
Re: [PATCH net-next v6 06/12] net: pse-pd: Add support for budget evaluation strategies
Posted by Kory Maincent 11 months, 1 week ago
On Thu, 6 Mar 2025 17:46:19 -0800
Jakub Kicinski <kuba@kernel.org> wrote:

> On Tue, 04 Mar 2025 11:18:55 +0100 Kory Maincent wrote:
> > +/**
> > + * enum ethtool_pse_budget_eval_strategies - PSE budget evaluation
> > strategies.
> > + * @ETHTOOL_PSE_BUDGET_EVAL_STRAT_DISABLED: Budget evaluation strategy
> > disabled.
> > + * @ETHTOOL_PSE_BUDGET_EVAL_STRAT_STATIC: PSE static budget evaluation
> > strategy.
> > + *	Budget evaluation strategy based on the power requested during PD
> > + *	classification. This strategy is managed by the PSE core.
> > + * @ETHTOOL_PSE_BUDGET_EVAL_STRAT_DYNAMIC: PSE dynamic budget evaluation
> > + *	strategy. Budget evaluation strategy based on the current
> > consumption
> > + *	per ports compared to the total	power budget. This mode
> > is managed by
> > + *	the PSE controller.
> > + */
> > +
> > +enum ethtool_pse_budget_eval_strategies {
> > +	ETHTOOL_PSE_BUDGET_EVAL_STRAT_DISABLED	= 1 << 0,
> > +	ETHTOOL_PSE_BUDGET_EVAL_STRAT_STATIC	= 1 << 1,
> > +	ETHTOOL_PSE_BUDGET_EVAL_STRAT_DYNAMIC	= 1 << 2,
> >  };  
> 
> Leftover?

We still need these to know the PSE method but they shall not be in the uAPI
for now. I will move it out of it.

Regards,
-- 
Köry Maincent, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com