[v1] RAS/CEC: Reduce offline page threshold for Intel systems

[PATCH] RAS/CEC: Reduce offline page threshold for Intel systems

Posted by Tony Luck 3 years, 9 months ago

A large scale study of memory errors on Intel systems in data centers
showed that aggressively taking pages with corrected errors offline is
the best strategy of using corrected errors as a predictor of future
uncorrected errors.

It is unknown whether this would help other vendors. There are some
indicators that it would not.

Set the threshold to "2" on Intel systems.

Do-not-apply-without-agreement-from-AMD
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 drivers/ras/cec.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/ras/cec.c b/drivers/ras/cec.c
index 42f2fc0bc8a9..b1fc193b2036 100644
--- a/drivers/ras/cec.c
+++ b/drivers/ras/cec.c
@@ -556,6 +556,14 @@ static int __init cec_init(void)
 	if (ce_arr.disabled)
 		return -ENODEV;
 
+	/*
+	 * Intel systems may avoid uncorreectable errors
+	 * if pages with corrected errors are aggresively
+	 * taken offline.
+	 */
+	if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
+		action_threshold = 2;
+
 	ce_arr.array = (void *)get_zeroed_page(GFP_KERNEL);
 	if (!ce_arr.array) {
 		pr_err("Error allocating CE array page!\n");
-- 
2.35.3

Re: [PATCH] RAS/CEC: Reduce offline page threshold for Intel systems

Posted by Yazen Ghannam 3 years, 8 months ago

On Fri, Jul 01, 2022 at 12:12:39PM -0700, Tony Luck wrote:
> A large scale study of memory errors on Intel systems in data centers
> showed that aggressively taking pages with corrected errors offline is
> the best strategy of using corrected errors as a predictor of future
> uncorrected errors.
> 
> It is unknown whether this would help other vendors. There are some
> indicators that it would not.
> 
> Set the threshold to "2" on Intel systems.
> 
> Do-not-apply-without-agreement-from-AMD
> Signed-off-by: Tony Luck <tony.luck@intel.com>

Hi Tony,
The guidance from our hardware folks is that this isn't necessary for our
systems. So I think restricting this to Intel systems is okay.

> ---
>  drivers/ras/cec.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/ras/cec.c b/drivers/ras/cec.c
> index 42f2fc0bc8a9..b1fc193b2036 100644
> --- a/drivers/ras/cec.c
> +++ b/drivers/ras/cec.c
> @@ -556,6 +556,14 @@ static int __init cec_init(void)
>  	if (ce_arr.disabled)
>  		return -ENODEV;
>  
> +	/*
> +	 * Intel systems may avoid uncorreectable errors
> +	 * if pages with corrected errors are aggresively
> +	 * taken offline.
> +	 */

s/uncorreectable/uncorrectable/
s/aggresively/aggressively/

> +	if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
> +		action_threshold = 2;
> +
>  	ce_arr.array = (void *)get_zeroed_page(GFP_KERNEL);
>  	if (!ce_arr.array) {
>  		pr_err("Error allocating CE array page!\n");
> --

Looks good to me overall.

Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>

Thanks,
Yazen

[PATCH v2] RAS/CEC: Reduce offline page threshold for Intel systems

Posted by Tony Luck 3 years, 8 months ago

A large scale study of memory errors on Intel systems in data centers
showed that aggressively taking pages with corrected errors offline is
the best strategy of using corrected errors as a predictor of future
uncorrected errors.

Set the threshold to "2" on Intel systems. AMD guidance is that this is
not necessary for their systems.

Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---

V2:
	Fix some spelling errors. 
	Add note to commit that AMD systems do not need this.
	Add Yazen's Reviewed-by tag.

 drivers/ras/cec.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/ras/cec.c b/drivers/ras/cec.c
index 42f2fc0bc8a9..321af498ee11 100644
--- a/drivers/ras/cec.c
+++ b/drivers/ras/cec.c
@@ -556,6 +556,14 @@ static int __init cec_init(void)
 	if (ce_arr.disabled)
 		return -ENODEV;
 
+	/*
+	 * Intel systems may avoid uncorrectable errors
+	 * if pages with corrected errors are aggressively
+	 * taken offline.
+	 */
+	if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
+		action_threshold = 2;
+
 	ce_arr.array = (void *)get_zeroed_page(GFP_KERNEL);
 	if (!ce_arr.array) {
 		pr_err("Error allocating CE array page!\n");
-- 
2.35.3