[PATCH 3/3] EDAC/igen6: Add polling support

Orange Kao posted 3 patches 2 weeks, 6 days ago
[PATCH 3/3] EDAC/igen6: Add polling support
Posted by Orange Kao 2 weeks, 6 days ago
Some PCs with Intel N100 (with PCI device 8086:461c, DID_ADL_N_SKU4)
experienced issues with error interrupts not working, even with the
following configuration in the BIOS.

    In-Band ECC Support: Enabled
    In-Band ECC Operation Mode: 2 (make all requests protected and
                                   ignore range checks)
    IBECC Error Injection Control: Inject Correctable Error on insertion
                                   counter
    Error Injection Insertion Count: 251658240 (0xf000000)

Add polling mode support for these machines to ensure that memory error
events are handled.

Signed-off-by: Orange Kao <orange@aiven.io>
---
 drivers/edac/igen6_edac.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/drivers/edac/igen6_edac.c b/drivers/edac/igen6_edac.c
index fa488ba15059..eb783c6b77f1 100644
--- a/drivers/edac/igen6_edac.c
+++ b/drivers/edac/igen6_edac.c
@@ -1170,6 +1170,20 @@ static int igen6_pci_setup(struct pci_dev *pdev, u64 *mchbar)
 	return -ENODEV;
 }
 
+static void igen6_check(struct mem_ctl_info *mci)
+{
+	struct igen6_imc *imc = mci->pvt_info;
+	u64 ecclog;
+
+	/* errsts_clear() isn't NMI-safe. Delay it in the IRQ context */
+	ecclog = ecclog_read_and_clear(imc);
+	if (!ecclog)
+		return;
+
+	if (!ecclog_gen_pool_add(imc->mc, ecclog))
+		irq_work_queue(&ecclog_irq_work);
+}
+
 static int igen6_register_mci(int mc, u64 mchbar, struct pci_dev *pdev)
 {
 	struct edac_mc_layer layers[2];
@@ -1211,6 +1225,8 @@ static int igen6_register_mci(int mc, u64 mchbar, struct pci_dev *pdev)
 	mci->edac_cap = EDAC_FLAG_SECDED;
 	mci->mod_name = EDAC_MOD_STR;
 	mci->dev_name = pci_name(pdev);
+	if (edac_op_state == EDAC_OPSTATE_POLL)
+		mci->edac_check = igen6_check;
 	mci->pvt_info = &igen6_pvt->imc[mc];
 
 	imc = mci->pvt_info;
@@ -1352,6 +1368,10 @@ static void unregister_err_handler(void)
 
 static void opstate_set(struct res_config *cfg)
 {
+	/* Only the polling mode can be set via the module parameter. */
+	if (edac_op_state == EDAC_OPSTATE_POLL)
+		return;
+
 	/* Set the mode according to the configuration data. */
 	if (cfg->machine_check)
 		edac_op_state = EDAC_OPSTATE_INT;
@@ -1483,3 +1503,6 @@ module_exit(igen6_exit);
 MODULE_LICENSE("GPL v2");
 MODULE_AUTHOR("Qiuxu Zhuo");
 MODULE_DESCRIPTION("MC Driver for Intel client SoC using In-Band ECC");
+
+module_param(edac_op_state, int, 0444);
+MODULE_PARM_DESC(edac_op_state, "EDAC Error Reporting state: 0=Poll, Others or default=Auto detect");
-- 
2.47.0
Re: [PATCH 3/3] EDAC/igen6: Add polling support
Posted by Borislav Petkov 2 weeks, 5 days ago
On Mon, Nov 04, 2024 at 12:40:54PM +0000, Orange Kao wrote:
> +module_param(edac_op_state, int, 0444);
> +MODULE_PARM_DESC(edac_op_state, "EDAC Error Reporting state: 0=Poll, Others or default=Auto detect");

Why is this module parameter here instead of detecting those broken machines
and enabling polling on them by default and automatically?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette
RE: [PATCH 3/3] EDAC/igen6: Add polling support
Posted by Zhuo, Qiuxu 2 weeks, 5 days ago
> From: Borislav Petkov <bp@alien8.de>
> [...]
> On Mon, Nov 04, 2024 at 12:40:54PM +0000, Orange Kao wrote:
> > +module_param(edac_op_state, int, 0444);
> > +MODULE_PARM_DESC(edac_op_state, "EDAC Error Reporting state: 0=Poll,
> > +Others or default=Auto detect");
> 
> Why is this module parameter here instead of detecting those broken
> machines and enabling polling on them by default and automatically?

Good suggestion. Thanks, Boris. 

@Orange Kao,
As per Boris' suggestion, set the default to polling mode for those broken machines
to offload the burden from userspace.

1) A small update to your current patch, as shown below for your reference. 

static void opstate_set(struct res_config *cfg, const struct pci_device_id *ent)
{
        /*
         * Quirk: Certain SoCs' error reporting interrupts don't work.
         *        Force polling mode for them to ensure that memory error
         *        events can be handled.
         */
        if (ent->device == DID_ADL_N_SKU4) {
                edac_op_state = EDAC_OPSTATE_POLL;
                return;
        }

        /* Set the mode according to the configuration data. */
        if (cfg->machine_check)
                edac_op_state = EDAC_OPSTATE_INT;
        else
                edac_op_state = EDAC_OPSTATE_NMI;
}

2) The call site is updated accordingly:
      ...
      opstate_set(res_cfg, ent);
      ...

3) Also, the following 2 lines are no longer needed in this patch.
    
     module_param(edac_op_state, int, 0444);
     MODULE_PARM_DESC(edac_op_state, "EDAC Error Reporting state: 0=Poll, Others or default=Auto detect");

Could you try it and help resend a new version of this patch? 
Or any questions please feel free to let me know.
Thanks!

-Qiuxu