[v3] x86/e820: Fix handling of subpage regions when calculating nosave ranges

[PATCH v3] x86/e820: Fix handling of subpage regions when calculating nosave ranges

Posted by Myrrh Periwinkle 10 months, 1 week ago

The current implementation of e820__register_nosave_regions suffers from
multiple serious issues:
 - The end of last region is tracked by PFN, causing it to find holes
   that aren't there if two consecutive subpage regions are present
 - The nosave PFN ranges derived from holes are rounded out (instead of
   rounded in) which makes it inconsistent with how explicitly reserved
   regions are handled

Fix this by:
 - Treating reserved regions as if they were holes, to ensure consistent
   handling (rounding out nosave PFN ranges is more correct as the
   kernel does not use partial pages)
 - Tracking the end of the last RAM region by address instead of pages
   to detect holes more precisely

Cc: stable@vger.kernel.org
Fixes: e5540f875404 ("x86/boot/e820: Consolidate 'struct e820_entry *entry' local variable names")
Reported-by: Roberto Ricci <io@r-ricci.it>
Closes: https://lore.kernel.org/all/Z4WFjBVHpndct7br@desktop0a/
Signed-off-by: Myrrh Periwinkle <myrrhperiwinkle@qtmlabs.xyz>
---
The issue of the kernel failing to resume from hibernation after
kexec_load() is used is likely due to kexec-tools passing in a different
e820 memory map from the one provided by system firmware, causing the
e820 consistency check to fail. That issue is not addressed in this
patch and will need to be fixed in kexec-tools instead.

Changes in v3:
- Round out nosave PFN ranges since the kernel does not use partial
  pages
- Link to v2: https://lore.kernel.org/r/20250405-fix-e820-nosave-v2-1-d40dbe457c95@qtmlabs.xyz

Changes in v2:
- Updated author details
- Rewrote commit message
- Link to v1: https://lore.kernel.org/r/20250405-fix-e820-nosave-v1-1-162633199548@qtmlabs.xyz
---
 arch/x86/kernel/e820.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 57120f0749cc3c23844eeb36820705687e08bbf7..9d8dd8deb2a702bd34b961ca4f5eba8a8d9643d0 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -753,22 +753,21 @@ void __init e820__memory_setup_extended(u64 phys_addr, u32 data_len)
 void __init e820__register_nosave_regions(unsigned long limit_pfn)
 {
 	int i;
-	unsigned long pfn = 0;
+	u64 last_addr = 0;
 
 	for (i = 0; i < e820_table->nr_entries; i++) {
 		struct e820_entry *entry = &e820_table->entries[i];
 
-		if (pfn < PFN_UP(entry->addr))
-			register_nosave_region(pfn, PFN_UP(entry->addr));
-
-		pfn = PFN_DOWN(entry->addr + entry->size);
-
 		if (entry->type != E820_TYPE_RAM)
-			register_nosave_region(PFN_UP(entry->addr), pfn);
+			continue;
 
-		if (pfn >= limit_pfn)
-			break;
+		if (last_addr < entry->addr)
+			register_nosave_region(PFN_DOWN(last_addr), PFN_UP(entry->addr));
+
+		last_addr = entry->addr + entry->size;
 	}
+
+	register_nosave_region(PFN_DOWN(last_addr), limit_pfn);
 }
 
 #ifdef CONFIG_ACPI

---
base-commit: a8662bcd2ff152bfbc751cab20f33053d74d0963
change-id: 20250405-fix-e820-nosave-f5cb985a9a61

Best regards,
-- 
Myrrh Periwinkle <myrrhperiwinkle@qtmlabs.xyz>

Re: [PATCH v3] x86/e820: Fix handling of subpage regions when calculating nosave ranges

Posted by Ingo Molnar 10 months, 1 week ago

* Myrrh Periwinkle <myrrhperiwinkle@qtmlabs.xyz> wrote:

> The current implementation of e820__register_nosave_regions suffers from
> multiple serious issues:
>  - The end of last region is tracked by PFN, causing it to find holes
>    that aren't there if two consecutive subpage regions are present
>  - The nosave PFN ranges derived from holes are rounded out (instead of
>    rounded in) which makes it inconsistent with how explicitly reserved
>    regions are handled
> 
> Fix this by:
>  - Treating reserved regions as if they were holes, to ensure consistent
>    handling (rounding out nosave PFN ranges is more correct as the
>    kernel does not use partial pages)
>  - Tracking the end of the last RAM region by address instead of pages
>    to detect holes more precisely
> 
> Cc: stable@vger.kernel.org
> Fixes: e5540f875404 ("x86/boot/e820: Consolidate 'struct e820_entry *entry' local variable names")

So why is this SHA1 indicated as the root cause? AFAICS that commit 
does nothing but cleanups, so it cannot cause such regressions.

Thanks,

	Ingo

Re: [PATCH v3] x86/e820: Fix handling of subpage regions when calculating nosave ranges

Posted by Ingo Molnar 10 months, 1 week ago

* Ingo Molnar <mingo@kernel.org> wrote:

> 
> * Myrrh Periwinkle <myrrhperiwinkle@qtmlabs.xyz> wrote:
> 
> > The current implementation of e820__register_nosave_regions suffers from
> > multiple serious issues:
> >  - The end of last region is tracked by PFN, causing it to find holes
> >    that aren't there if two consecutive subpage regions are present
> >  - The nosave PFN ranges derived from holes are rounded out (instead of
> >    rounded in) which makes it inconsistent with how explicitly reserved
> >    regions are handled
> > 
> > Fix this by:
> >  - Treating reserved regions as if they were holes, to ensure consistent
> >    handling (rounding out nosave PFN ranges is more correct as the
> >    kernel does not use partial pages)
> >  - Tracking the end of the last RAM region by address instead of pages
> >    to detect holes more precisely
> > 
> > Cc: stable@vger.kernel.org
> > Fixes: e5540f875404 ("x86/boot/e820: Consolidate 'struct e820_entry *entry' local variable names")
> 
> So why is this SHA1 indicated as the root cause? AFAICS that commit 
> does nothing but cleanups, so it cannot cause such regressions.

BTW.:

 A) "It was the first random commit that seemed related, sry"
 B) "It's a 15 years old bug, but I wanted to indicate a fresh, 8-year old bug to get this into -stable. Busted!"

... are perfectly fine answers in my book. :-)

I'm glad about the fixes, I'm just curious how the Fixes tag came about.

Thanks,

	Ingo

Re: [PATCH v3] x86/e820: Fix handling of subpage regions when calculating nosave ranges

Posted by Myrrh Periwinkle 10 months, 1 week ago

On 4/7/25 01:36, Ingo Molnar wrote:

> * Ingo Molnar<mingo@kernel.org> wrote:
>
>> * Myrrh Periwinkle<myrrhperiwinkle@qtmlabs.xyz> wrote:
>>
>>> The current implementation of e820__register_nosave_regions suffers from
>>> multiple serious issues:
>>>   - The end of last region is tracked by PFN, causing it to find holes
>>>     that aren't there if two consecutive subpage regions are present
>>>   - The nosave PFN ranges derived from holes are rounded out (instead of
>>>     rounded in) which makes it inconsistent with how explicitly reserved
>>>     regions are handled
>>>
>>> Fix this by:
>>>   - Treating reserved regions as if they were holes, to ensure consistent
>>>     handling (rounding out nosave PFN ranges is more correct as the
>>>     kernel does not use partial pages)
>>>   - Tracking the end of the last RAM region by address instead of pages
>>>     to detect holes more precisely
>>>
>>> Cc:stable@vger.kernel.org
>>> Fixes: e5540f875404 ("x86/boot/e820: Consolidate 'struct e820_entry *entry' local variable names")
>> So why is this SHA1 indicated as the root cause? AFAICS that commit
>> does nothing but cleanups, so it cannot cause such regressions.
> BTW.:
>
>   A) "It was the first random commit that seemed related, sry"
>   B) "It's a 15 years old bug, but I wanted to indicate a fresh, 8-year old bug to get this into -stable. Busted!"

You got me :) How did you know that this is a 15 years old bug? 
(although I didn't think the age of the bug a patch fixes would affect 
its chances of getting to -stable)

This specific revision was picked since it's the latest one that this 
patch can be straightforwardly applied to (there is a (trivial) merge 
conflict with -stable, though).

Later, I managed to track the buggy logic back to 1c10070a55a3 ("i386: 
do not restore reserved memory after hibernation"), which I believe is 
the very first occurrence of this bug. If you prefer, I can send a v4 
with a more correct Fixes: tag (or feel free to do so yourself when 
applying this patch).

> ... are perfectly fine answers in my book. :-)
>
> I'm glad about the fixes, I'm just curious how the Fixes tag came about.
>
> Thanks,
>
> 	Ingo

Regards,

Myrrh

Re: [PATCH v3] x86/e820: Fix handling of subpage regions when calculating nosave ranges

Posted by Ingo Molnar 10 months, 1 week ago

* Myrrh Periwinkle <myrrhperiwinkle@qtmlabs.xyz> wrote:

> On 4/7/25 01:36, Ingo Molnar wrote:
> 
> > * Ingo Molnar<mingo@kernel.org> wrote:
> > 
> > > * Myrrh Periwinkle<myrrhperiwinkle@qtmlabs.xyz> wrote:
> > > 
> > > > The current implementation of e820__register_nosave_regions suffers from
> > > > multiple serious issues:
> > > >   - The end of last region is tracked by PFN, causing it to find holes
> > > >     that aren't there if two consecutive subpage regions are present
> > > >   - The nosave PFN ranges derived from holes are rounded out (instead of
> > > >     rounded in) which makes it inconsistent with how explicitly reserved
> > > >     regions are handled
> > > > 
> > > > Fix this by:
> > > >   - Treating reserved regions as if they were holes, to ensure consistent
> > > >     handling (rounding out nosave PFN ranges is more correct as the
> > > >     kernel does not use partial pages)
> > > >   - Tracking the end of the last RAM region by address instead of pages
> > > >     to detect holes more precisely
> > > > 
> > > > Cc:stable@vger.kernel.org
> > > > Fixes: e5540f875404 ("x86/boot/e820: Consolidate 'struct e820_entry *entry' local variable names")
> > > So why is this SHA1 indicated as the root cause? AFAICS that commit
> > > does nothing but cleanups, so it cannot cause such regressions.
> > BTW.:
> > 
> >   A) "It was the first random commit that seemed related, sry"
> >   B) "It's a 15 years old bug, but I wanted to indicate a fresh, 8-year old bug to get this into -stable. Busted!"
> 
> You got me :) How did you know that this is a 15 years old bug?

Call it a 'regression radar' that every kernel maintainer develops 
after their first 20 years or so - each bug has a distinct feeling
to them, and this one felt genuinely *ancient*.

> [...] (although I didn't think the age of the bug a patch fixes would 
> affect its chances of getting to -stable)

Yeah, it doesn't really affect its -stable elibility much once we move 
outside the ~6-12 months window that upstream recognizes as a 
semi-recent regression - it was mostly my lame attempt at deadpan 
humor, trying to play off 15 year old bugs against 8 year old bugs as 
if 8 years old bugs were fresh. Yeah, I know, it's not funny even to me 
anymore, I'm weird that way. ;-)

> This specific revision was picked since it's the latest one that this 
> patch can be straightforwardly applied to (there is a (trivial) merge 
> conflict with -stable, though).

Yeah. So in the x86/urgent commit I've tagged the other commit you 
pinpointed in your followup mail:

    Fixes: e8eff5ac294e ("[PATCH] Make swsusp avoid memory holes and reserved memory regions on x86_64")

Just to give backporters *some* chance at fixing this ancient bug
in older kernels, if they really want to.

Thanks,

	Ingo

Re: [PATCH v3] x86/e820: Fix handling of subpage regions when calculating nosave ranges

Posted by Myrrh Periwinkle 10 months, 1 week ago

On 4/7/25 07:53, Myrrh Periwinkle wrote:

> On 4/7/25 01:36, Ingo Molnar wrote:
>
>> * Ingo Molnar<mingo@kernel.org> wrote:
>>
>>> * Myrrh Periwinkle<myrrhperiwinkle@qtmlabs.xyz> wrote:
>>>
>>>> The current implementation of e820__register_nosave_regions suffers 
>>>> from
>>>> multiple serious issues:
>>>>   - The end of last region is tracked by PFN, causing it to find holes
>>>>     that aren't there if two consecutive subpage regions are present
>>>>   - The nosave PFN ranges derived from holes are rounded out 
>>>> (instead of
>>>>     rounded in) which makes it inconsistent with how explicitly 
>>>> reserved
>>>>     regions are handled
>>>>
>>>> Fix this by:
>>>>   - Treating reserved regions as if they were holes, to ensure 
>>>> consistent
>>>>     handling (rounding out nosave PFN ranges is more correct as the
>>>>     kernel does not use partial pages)
>>>>   - Tracking the end of the last RAM region by address instead of 
>>>> pages
>>>>     to detect holes more precisely
>>>>
>>>> Cc:stable@vger.kernel.org
>>>> Fixes: e5540f875404 ("x86/boot/e820: Consolidate 'struct e820_entry 
>>>> *entry' local variable names")
>>> So why is this SHA1 indicated as the root cause? AFAICS that commit
>>> does nothing but cleanups, so it cannot cause such regressions.
>> BTW.:
>>
>>   A) "It was the first random commit that seemed related, sry"
>>   B) "It's a 15 years old bug, but I wanted to indicate a fresh, 
>> 8-year old bug to get this into -stable. Busted!"
>
> You got me :) How did you know that this is a 15 years old bug? 
> (although I didn't think the age of the bug a patch fixes would affect 
> its chances of getting to -stable)
>
> This specific revision was picked since it's the latest one that this 
> patch can be straightforwardly applied to (there is a (trivial) merge 
> conflict with -stable, though).
>
> Later, I managed to track the buggy logic back to 1c10070a55a3 ("i386: 
> do not restore reserved memory after hibernation"), which I believe is 
> the very first occurrence of this bug. If you prefer, I can send a v4 
> with a more correct Fixes: tag (or feel free to do so yourself when 
> applying this patch).

I did some more digging and it seems like the buggy logic actually 
appeared all the way back in e8eff5ac294e ("[PATCH] Make swsusp avoid 
memory holes and reserved memory regions on x86_64") back when x86_64 
was a separate port, which was copied later by the i386 port in the 
commit I mentioned above, which would make this a 19 year old bug 
instead of 15.

>
>> ... are perfectly fine answers in my book. :-)
>>
>> I'm glad about the fixes, I'm just curious how the Fixes tag came about.
>>
>> Thanks,
>>
>>     Ingo
>
> Regards,
>
> Myrrh
>

[tip: x86/urgent] x86/e820: Fix handling of subpage regions when calculating nosave ranges in e820__register_nosave_regions()

Posted by tip-bot2 for Myrrh Periwinkle 10 months, 1 week ago

The following commit has been merged into the x86/urgent branch of tip:

Commit-ID:     f2f29da9f0d4367f6ff35e0d9d021257bb53e273
Gitweb:        https://git.kernel.org/tip/f2f29da9f0d4367f6ff35e0d9d021257bb53e273
Author:        Myrrh Periwinkle <myrrhperiwinkle@qtmlabs.xyz>
AuthorDate:    Sun, 06 Apr 2025 11:45:22 +07:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Mon, 07 Apr 2025 19:20:08 +02:00

x86/e820: Fix handling of subpage regions when calculating nosave ranges in e820__register_nosave_regions()

While debugging kexec/hibernation hangs and crashes, it turned out that
the current implementation of e820__register_nosave_regions() suffers from
multiple serious issues:

 - The end of last region is tracked by PFN, causing it to find holes
   that aren't there if two consecutive subpage regions are present

 - The nosave PFN ranges derived from holes are rounded out (instead of
   rounded in) which makes it inconsistent with how explicitly reserved
   regions are handled

Fix this by:

 - Treating reserved regions as if they were holes, to ensure consistent
   handling (rounding out nosave PFN ranges is more correct as the
   kernel does not use partial pages)

 - Tracking the end of the last RAM region by address instead of pages
   to detect holes more precisely

These bugs appear to have been introduced about ~18 years ago with the very
first version of e820_mark_nosave_regions(), and its flawed assumptions were
carried forward uninterrupted through various waves of rewrites and renames.

[ mingo: Added Git archeology details, for kicks and giggles. ]

Fixes: e8eff5ac294e ("[PATCH] Make swsusp avoid memory holes and reserved memory regions on x86_64")
Reported-by: Roberto Ricci <io@r-ricci.it>
Tested-by: Roberto Ricci <io@r-ricci.it>
Signed-off-by: Myrrh Periwinkle <myrrhperiwinkle@qtmlabs.xyz>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Len Brown <len.brown@intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250406-fix-e820-nosave-v3-1-f3787bc1ee1d@qtmlabs.xyz
Closes: https://lore.kernel.org/all/Z4WFjBVHpndct7br@desktop0a/
---
 arch/x86/kernel/e820.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 57120f0..9d8dd8d 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -753,22 +753,21 @@ void __init e820__memory_setup_extended(u64 phys_addr, u32 data_len)
 void __init e820__register_nosave_regions(unsigned long limit_pfn)
 {
 	int i;
-	unsigned long pfn = 0;
+	u64 last_addr = 0;
 
 	for (i = 0; i < e820_table->nr_entries; i++) {
 		struct e820_entry *entry = &e820_table->entries[i];
 
-		if (pfn < PFN_UP(entry->addr))
-			register_nosave_region(pfn, PFN_UP(entry->addr));
-
-		pfn = PFN_DOWN(entry->addr + entry->size);
-
 		if (entry->type != E820_TYPE_RAM)
-			register_nosave_region(PFN_UP(entry->addr), pfn);
+			continue;
 
-		if (pfn >= limit_pfn)
-			break;
+		if (last_addr < entry->addr)
+			register_nosave_region(PFN_DOWN(last_addr), PFN_UP(entry->addr));
+
+		last_addr = entry->addr + entry->size;
 	}
+
+	register_nosave_region(PFN_DOWN(last_addr), limit_pfn);
 }
 
 #ifdef CONFIG_ACPI