[tip: x86/urgent] x86/mm: Fix pti_clone_entry_text() for i386

tip-bot2 for Peter Zijlstra posted 1 patch 1 year, 6 months ago
There is a newer version of this series
arch/x86/mm/pti.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
[tip: x86/urgent] x86/mm: Fix pti_clone_entry_text() for i386
Posted by tip-bot2 for Peter Zijlstra 1 year, 6 months ago
The following commit has been merged into the x86/urgent branch of tip:

Commit-ID:     49947e7aedfea2573bada0c95b85f6c2363bef9f
Gitweb:        https://git.kernel.org/tip/49947e7aedfea2573bada0c95b85f6c2363bef9f
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Thu, 01 Aug 2024 12:42:25 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 01 Aug 2024 12:48:23 +02:00

x86/mm: Fix pti_clone_entry_text() for i386

While x86_64 has PMD aligned text sections, i386 does not have this
luxery. Notably ALIGN_ENTRY_TEXT_END is empty and _etext has PAGE
alignment.

This means that text on i386 can be page granular at the tail end,
which in turn means that the PTI text clones should consistently
account for this.

Make pti_clone_entry_text() consistent with pti_clone_kernel_text().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/mm/pti.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index 48c5032..bfdf5f4 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -496,7 +496,7 @@ static void pti_clone_entry_text(void)
 {
 	pti_clone_pgtable((unsigned long) __entry_text_start,
 			  (unsigned long) __entry_text_end,
-			  PTI_CLONE_PMD);
+			  PTI_LEVEL_KERNEL_IMAGE);
 }
 
 /*
Re: [tip: x86/urgent] x86/mm: Fix pti_clone_entry_text() for i386
Posted by Guenter Roeck 1 year, 6 months ago
Hi Peter,

On Thu, Aug 01, 2024 at 10:55:31AM -0000, tip-bot2 for Peter Zijlstra wrote:
> The following commit has been merged into the x86/urgent branch of tip:
> 
> Commit-ID:     49947e7aedfea2573bada0c95b85f6c2363bef9f
> Gitweb:        https://git.kernel.org/tip/49947e7aedfea2573bada0c95b85f6c2363bef9f
> Author:        Peter Zijlstra <peterz@infradead.org>
> AuthorDate:    Thu, 01 Aug 2024 12:42:25 +02:00
> Committer:     Peter Zijlstra <peterz@infradead.org>
> CommitterDate: Thu, 01 Aug 2024 12:48:23 +02:00
> 
> x86/mm: Fix pti_clone_entry_text() for i386
> 
> While x86_64 has PMD aligned text sections, i386 does not have this
> luxery. Notably ALIGN_ENTRY_TEXT_END is empty and _etext has PAGE
> alignment.
> 
> This means that text on i386 can be page granular at the tail end,
> which in turn means that the PTI text clones should consistently
> account for this.
> 
> Make pti_clone_entry_text() consistent with pti_clone_kernel_text().
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

With this patch in the tree, some of my qemu tests (those with PAE enabled)
report several WARNING backtraces.

WARNING: CPU: 0 PID: 1 at arch/x86/mm/pti.c:256 pti_clone_pgtable+0x298/0x2dc

WARNING: CPU: 0 PID: 1 at arch/x86/mm/pti.c:394 pti_clone_pgtable+0x29a/0x2dc

The backtraces are repeated multiple times.

Please see

https://kerneltests.org/builders/qemu-x86-master/builds/253/steps/qemubuildcommand/logs/stdio

for complete logs.

Bisect log is attached. Reverting this patch fixes the problem.

Thanks,
Guenter

---
# bad: [de9c2c66ad8e787abec7c9d7eff4f8c3cdd28aed] Linux 6.11-rc2
# good: [defaf1a2113a22b00dfa1abc0fd2014820eaf065] Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
git bisect start 'v6.11-rc2' 'defaf1a2113a'
# good: [953f776459a83f00ac940dd67c96d226d7041550] Merge tag 'irq-urgent-2024-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good 953f776459a83f00ac940dd67c96d226d7041550
# good: [61ca6c78295e242d4b681003112bfcdc54597489] Merge tag 'timers-urgent-2024-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good 61ca6c78295e242d4b681003112bfcdc54597489
# good: [41e71dbb0e0a0fe214545fe64af031303a08524c] x86/mm: Fix pti_clone_pgtable() alignment assumption
git bisect good 41e71dbb0e0a0fe214545fe64af031303a08524c
# bad: [dd35a0933269c636635b6af89dc6fa1782791e56] x86/uaccess: Zero the 8-byte get_range case on failure on 32-bit
git bisect bad dd35a0933269c636635b6af89dc6fa1782791e56
# bad: [3db03fb4995ef85fc41e86262ead7b4852f4bcf0] x86/mm: Fix pti_clone_entry_text() for i386
git bisect bad 3db03fb4995ef85fc41e86262ead7b4852f4bcf0
# first bad commit: [3db03fb4995ef85fc41e86262ead7b4852f4bcf0] x86/mm: Fix pti_clone_entry_text() for i386
Re: [tip: x86/urgent] x86/mm: Fix pti_clone_entry_text() for i386
Posted by Peter Zijlstra 1 year, 6 months ago
On Mon, Aug 05, 2024 at 09:52:06PM -0700, Guenter Roeck wrote:
> Hi Peter,
> 
> On Thu, Aug 01, 2024 at 10:55:31AM -0000, tip-bot2 for Peter Zijlstra wrote:
> > The following commit has been merged into the x86/urgent branch of tip:
> > 
> > Commit-ID:     49947e7aedfea2573bada0c95b85f6c2363bef9f
> > Gitweb:        https://git.kernel.org/tip/49947e7aedfea2573bada0c95b85f6c2363bef9f
> > Author:        Peter Zijlstra <peterz@infradead.org>
> > AuthorDate:    Thu, 01 Aug 2024 12:42:25 +02:00
> > Committer:     Peter Zijlstra <peterz@infradead.org>
> > CommitterDate: Thu, 01 Aug 2024 12:48:23 +02:00
> > 
> > x86/mm: Fix pti_clone_entry_text() for i386
> > 
> > While x86_64 has PMD aligned text sections, i386 does not have this
> > luxery. Notably ALIGN_ENTRY_TEXT_END is empty and _etext has PAGE
> > alignment.
> > 
> > This means that text on i386 can be page granular at the tail end,
> > which in turn means that the PTI text clones should consistently
> > account for this.
> > 
> > Make pti_clone_entry_text() consistent with pti_clone_kernel_text().
> > 
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> 
> With this patch in the tree, some of my qemu tests (those with PAE enabled)
> report several WARNING backtraces.
> 
> WARNING: CPU: 0 PID: 1 at arch/x86/mm/pti.c:256 pti_clone_pgtable+0x298/0x2dc
> 
> WARNING: CPU: 0 PID: 1 at arch/x86/mm/pti.c:394 pti_clone_pgtable+0x29a/0x2dc
> 
> The backtraces are repeated multiple times.
> 
> Please see
> 
> https://kerneltests.org/builders/qemu-x86-master/builds/253/steps/qemubuildcommand/logs/stdio
> 
> for complete logs.

Could you try the below patch? If that don't work, could you provide the
.config, I'm assuming that'll work with the bits I grabbed last time.


---
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index bfdf5f45b137..bec53f127c0d 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -634,14 +634,8 @@ void __init pti_init(void)
 	}
 #endif
 
-	pti_clone_user_shared();
-
 	/* Undo all global bits from the init pagetables in head_64.S: */
 	pti_set_kernel_image_nonglobal();
-	/* Replace some of the global bits just for shared entry text: */
-	pti_clone_entry_text();
-	pti_setup_espfix64();
-	pti_setup_vsyscall();
 }
 
 /*
@@ -659,7 +653,12 @@ void pti_finalize(void)
 	 * We need to clone everything (again) that maps parts of the
 	 * kernel image.
 	 */
+	pti_clone_user_shared();
+
 	pti_clone_entry_text();
+	pti_setup_espfix64();
+	pti_setup_vsyscall();
+
 	pti_clone_kernel_text();
 
 	debug_checkwx_user();
Re: [tip: x86/urgent] x86/mm: Fix pti_clone_entry_text() for i386
Posted by Guenter Roeck 1 year, 6 months ago
On 8/6/24 01:50, Peter Zijlstra wrote:
> On Mon, Aug 05, 2024 at 09:52:06PM -0700, Guenter Roeck wrote:
>> Hi Peter,
>>
>> On Thu, Aug 01, 2024 at 10:55:31AM -0000, tip-bot2 for Peter Zijlstra wrote:
>>> The following commit has been merged into the x86/urgent branch of tip:
>>>
>>> Commit-ID:     49947e7aedfea2573bada0c95b85f6c2363bef9f
>>> Gitweb:        https://git.kernel.org/tip/49947e7aedfea2573bada0c95b85f6c2363bef9f
>>> Author:        Peter Zijlstra <peterz@infradead.org>
>>> AuthorDate:    Thu, 01 Aug 2024 12:42:25 +02:00
>>> Committer:     Peter Zijlstra <peterz@infradead.org>
>>> CommitterDate: Thu, 01 Aug 2024 12:48:23 +02:00
>>>
>>> x86/mm: Fix pti_clone_entry_text() for i386
>>>
>>> While x86_64 has PMD aligned text sections, i386 does not have this
>>> luxery. Notably ALIGN_ENTRY_TEXT_END is empty and _etext has PAGE
>>> alignment.
>>>
>>> This means that text on i386 can be page granular at the tail end,
>>> which in turn means that the PTI text clones should consistently
>>> account for this.
>>>
>>> Make pti_clone_entry_text() consistent with pti_clone_kernel_text().
>>>
>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>>
>> With this patch in the tree, some of my qemu tests (those with PAE enabled)
>> report several WARNING backtraces.
>>
>> WARNING: CPU: 0 PID: 1 at arch/x86/mm/pti.c:256 pti_clone_pgtable+0x298/0x2dc
>>
>> WARNING: CPU: 0 PID: 1 at arch/x86/mm/pti.c:394 pti_clone_pgtable+0x29a/0x2dc
>>
>> The backtraces are repeated multiple times.
>>
>> Please see
>>
>> https://kerneltests.org/builders/qemu-x86-master/builds/253/steps/qemubuildcommand/logs/stdio
>>
>> for complete logs.
> 
> Could you try the below patch? If that don't work, could you provide the
> .config, I'm assuming that'll work with the bits I grabbed last time.
> 

Unfortunately that makes it worse: It causes qemu to quit immediately
without logging anything.

I created http://server.roeck-us.net/qemu/x86-v6.11-rc2/ with all
the relevant information. Please let me know if you need anything else.

Thanks,
Guenter
Re: [tip: x86/urgent] x86/mm: Fix pti_clone_entry_text() for i386
Posted by Peter Zijlstra 1 year, 6 months ago
On Tue, Aug 06, 2024 at 07:25:42AM -0700, Guenter Roeck wrote:

> I created http://server.roeck-us.net/qemu/x86-v6.11-rc2/ with all
> the relevant information. Please let me know if you need anything else.

So I grabbed that config, stuck it in the build dir I used last time and
upgraded gcc-13 from 13.2 ro 13.3. But alas, my build runs successfully
:/

Is there anything else special I missed?
Re: [tip: x86/urgent] x86/mm: Fix pti_clone_entry_text() for i386
Posted by Peter Zijlstra 1 year, 6 months ago
On Tue, Aug 06, 2024 at 04:56:32PM +0200, Peter Zijlstra wrote:
> On Tue, Aug 06, 2024 at 07:25:42AM -0700, Guenter Roeck wrote:
> 
> > I created http://server.roeck-us.net/qemu/x86-v6.11-rc2/ with all
> > the relevant information. Please let me know if you need anything else.
> 
> So I grabbed that config, stuck it in the build dir I used last time and
> upgraded gcc-13 from 13.2 ro 13.3. But alas, my build runs successfully
> :/
> 
> Is there anything else special I missed?

run.sh is not exacrlty the same this time, different CPU model, that
made it go.

OK, lemme poke at this.
Re: [tip: x86/urgent] x86/mm: Fix pti_clone_entry_text() for i386
Posted by Peter Zijlstra 1 year, 6 months ago
On Tue, Aug 06, 2024 at 05:05:15PM +0200, Peter Zijlstra wrote:
> On Tue, Aug 06, 2024 at 04:56:32PM +0200, Peter Zijlstra wrote:
> > On Tue, Aug 06, 2024 at 07:25:42AM -0700, Guenter Roeck wrote:
> > 
> > > I created http://server.roeck-us.net/qemu/x86-v6.11-rc2/ with all
> > > the relevant information. Please let me know if you need anything else.
> > 
> > So I grabbed that config, stuck it in the build dir I used last time and
> > upgraded gcc-13 from 13.2 ro 13.3. But alas, my build runs successfully
> > :/
> > 
> > Is there anything else special I missed?
> 
> run.sh is not exacrlty the same this time, different CPU model, that
> made it go.
> 
> OK, lemme poke at this.

Urgh, so crypto's late_initcall() does user-mode-helper based modprobe
looking for algorithms before we kick off /bin/init :/

This makes things difficult.

Urgh.
Re: [tip: x86/urgent] x86/mm: Fix pti_clone_entry_text() for i386
Posted by Peter Zijlstra 1 year, 6 months ago
On Tue, Aug 06, 2024 at 05:46:53PM +0200, Peter Zijlstra wrote:
> On Tue, Aug 06, 2024 at 05:05:15PM +0200, Peter Zijlstra wrote:
> > On Tue, Aug 06, 2024 at 04:56:32PM +0200, Peter Zijlstra wrote:
> > > On Tue, Aug 06, 2024 at 07:25:42AM -0700, Guenter Roeck wrote:
> > > 
> > > > I created http://server.roeck-us.net/qemu/x86-v6.11-rc2/ with all
> > > > the relevant information. Please let me know if you need anything else.
> > > 
> > > So I grabbed that config, stuck it in the build dir I used last time and
> > > upgraded gcc-13 from 13.2 ro 13.3. But alas, my build runs successfully
> > > :/
> > > 
> > > Is there anything else special I missed?
> > 
> > run.sh is not exacrlty the same this time, different CPU model, that
> > made it go.
> > 
> > OK, lemme poke at this.
> 
> Urgh, so crypto's late_initcall() does user-mode-helper based modprobe
> looking for algorithms before we kick off /bin/init :/
> 
> This makes things difficult.
> 
> Urgh.

So the problem is that mark_readonly() splits a code PMD due to NX. Then
the second pti_clone_entry_text() finds a kernel PTE but a user PMD
mapping for the same address (from the early clone) and gets upset.

And we can't run mark_readonly() sooner, because initcall expect stuff
to be RW. But initcalls do modprobe, which runs user crap before we're
done initializing everything.

This is a right mess, and I really don't know what to do.
Re: [tip: x86/urgent] x86/mm: Fix pti_clone_entry_text() for i386
Posted by Guenter Roeck 1 year, 6 months ago
On 8/6/24 08:59, Peter Zijlstra wrote:
> On Tue, Aug 06, 2024 at 05:46:53PM +0200, Peter Zijlstra wrote:
>> On Tue, Aug 06, 2024 at 05:05:15PM +0200, Peter Zijlstra wrote:
>>> On Tue, Aug 06, 2024 at 04:56:32PM +0200, Peter Zijlstra wrote:
>>>> On Tue, Aug 06, 2024 at 07:25:42AM -0700, Guenter Roeck wrote:
>>>>
>>>>> I created http://server.roeck-us.net/qemu/x86-v6.11-rc2/ with all
>>>>> the relevant information. Please let me know if you need anything else.
>>>>
>>>> So I grabbed that config, stuck it in the build dir I used last time and
>>>> upgraded gcc-13 from 13.2 ro 13.3. But alas, my build runs successfully
>>>> :/
>>>>
>>>> Is there anything else special I missed?
>>>
>>> run.sh is not exacrlty the same this time, different CPU model, that
>>> made it go.
>>>
>>> OK, lemme poke at this.
>>
>> Urgh, so crypto's late_initcall() does user-mode-helper based modprobe
>> looking for algorithms before we kick off /bin/init :/
>>
>> This makes things difficult.
>>
>> Urgh.
> 
> So the problem is that mark_readonly() splits a code PMD due to NX. Then
> the second pti_clone_entry_text() finds a kernel PTE but a user PMD
> mapping for the same address (from the early clone) and gets upset.
> 
> And we can't run mark_readonly() sooner, because initcall expect stuff
> to be RW. But initcalls do modprobe, which runs user crap before we're
> done initializing everything.
> 
> This is a right mess, and I really don't know what to do.

And there was me thinking this one should be easy to solve. Oh well.

Maybe Linus has an idea ? I am getting a bit wary to reporting all those
weird problems to him, though.

Guenter
Re: [tip: x86/urgent] x86/mm: Fix pti_clone_entry_text() for i386
Posted by Peter Zijlstra 1 year, 6 months ago
On Tue, Aug 06, 2024 at 09:42:23AM -0700, Guenter Roeck wrote:
> On 8/6/24 08:59, Peter Zijlstra wrote:
> > On Tue, Aug 06, 2024 at 05:46:53PM +0200, Peter Zijlstra wrote:
> > > On Tue, Aug 06, 2024 at 05:05:15PM +0200, Peter Zijlstra wrote:
> > > > On Tue, Aug 06, 2024 at 04:56:32PM +0200, Peter Zijlstra wrote:
> > > > > On Tue, Aug 06, 2024 at 07:25:42AM -0700, Guenter Roeck wrote:
> > > > > 
> > > > > > I created http://server.roeck-us.net/qemu/x86-v6.11-rc2/ with all
> > > > > > the relevant information. Please let me know if you need anything else.
> > > > > 
> > > > > So I grabbed that config, stuck it in the build dir I used last time and
> > > > > upgraded gcc-13 from 13.2 ro 13.3. But alas, my build runs successfully
> > > > > :/
> > > > > 
> > > > > Is there anything else special I missed?
> > > > 
> > > > run.sh is not exacrlty the same this time, different CPU model, that
> > > > made it go.
> > > > 
> > > > OK, lemme poke at this.
> > > 
> > > Urgh, so crypto's late_initcall() does user-mode-helper based modprobe
> > > looking for algorithms before we kick off /bin/init :/
> > > 
> > > This makes things difficult.
> > > 
> > > Urgh.
> > 
> > So the problem is that mark_readonly() splits a code PMD due to NX. Then
> > the second pti_clone_entry_text() finds a kernel PTE but a user PMD
> > mapping for the same address (from the early clone) and gets upset.
> > 
> > And we can't run mark_readonly() sooner, because initcall expect stuff
> > to be RW. But initcalls do modprobe, which runs user crap before we're
> > done initializing everything.
> > 
> > This is a right mess, and I really don't know what to do.
> 
> And there was me thinking this one should be easy to solve. Oh well.
> 
> Maybe Linus has an idea ? I am getting a bit wary to reporting all those
> weird problems to him, though.

While I had dinner with the family, Thomas cooked up the below, which
seems to make it happy.

It basically splits the PMD on the late copy.

---
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index bfdf5f45b137..b6ebbc9c23b1 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -241,7 +241,7 @@ static pmd_t *pti_user_pagetable_walk_pmd(unsigned long address)
  *
  * Returns a pointer to a PTE on success, or NULL on failure.
  */
-static pte_t *pti_user_pagetable_walk_pte(unsigned long address)
+static pte_t *pti_user_pagetable_walk_pte(unsigned long address, bool late_text)
 {
 	gfp_t gfp = (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO);
 	pmd_t *pmd;
@@ -251,10 +251,15 @@ static pte_t *pti_user_pagetable_walk_pte(unsigned long address)
 	if (!pmd)
 		return NULL;
 
-	/* We can't do anything sensible if we hit a large mapping. */
+	/* Large PMD mapping found */
 	if (pmd_leaf(*pmd)) {
-		WARN_ON(1);
-		return NULL;
+		/* Clear the PMD if we hit a large mapping from the first round */
+		if (late_text) {
+			set_pmd(pmd, __pmd(0));
+		} else {
+			WARN_ON_ONCE(1);
+			return NULL;
+		}
 	}
 
 	if (pmd_none(*pmd)) {
@@ -283,7 +288,7 @@ static void __init pti_setup_vsyscall(void)
 	if (!pte || WARN_ON(level != PG_LEVEL_4K) || pte_none(*pte))
 		return;
 
-	target_pte = pti_user_pagetable_walk_pte(VSYSCALL_ADDR);
+	target_pte = pti_user_pagetable_walk_pte(VSYSCALL_ADDR, false);
 	if (WARN_ON(!target_pte))
 		return;
 
@@ -301,7 +306,7 @@ enum pti_clone_level {
 
 static void
 pti_clone_pgtable(unsigned long start, unsigned long end,
-		  enum pti_clone_level level)
+		  enum pti_clone_level level, bool late_text)
 {
 	unsigned long addr;
 
@@ -390,7 +395,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
 				return;
 
 			/* Allocate PTE in the user page-table */
-			target_pte = pti_user_pagetable_walk_pte(addr);
+			target_pte = pti_user_pagetable_walk_pte(addr, late_text);
 			if (WARN_ON(!target_pte))
 				return;
 
@@ -452,7 +457,7 @@ static void __init pti_clone_user_shared(void)
 		phys_addr_t pa = per_cpu_ptr_to_phys((void *)va);
 		pte_t *target_pte;
 
-		target_pte = pti_user_pagetable_walk_pte(va);
+		target_pte = pti_user_pagetable_walk_pte(va, false);
 		if (WARN_ON(!target_pte))
 			return;
 
@@ -475,7 +480,7 @@ static void __init pti_clone_user_shared(void)
 	start = CPU_ENTRY_AREA_BASE;
 	end   = start + (PAGE_SIZE * CPU_ENTRY_AREA_PAGES);
 
-	pti_clone_pgtable(start, end, PTI_CLONE_PMD);
+	pti_clone_pgtable(start, end, PTI_CLONE_PMD, false);
 }
 #endif /* CONFIG_X86_64 */
 
@@ -492,11 +497,11 @@ static void __init pti_setup_espfix64(void)
 /*
  * Clone the populated PMDs of the entry text and force it RO.
  */
-static void pti_clone_entry_text(void)
+static void pti_clone_entry_text(bool late)
 {
 	pti_clone_pgtable((unsigned long) __entry_text_start,
 			  (unsigned long) __entry_text_end,
-			  PTI_LEVEL_KERNEL_IMAGE);
+			  PTI_LEVEL_KERNEL_IMAGE, late);
 }
 
 /*
@@ -571,7 +576,7 @@ static void pti_clone_kernel_text(void)
 	 * pti_set_kernel_image_nonglobal() did to clear the
 	 * global bit.
 	 */
-	pti_clone_pgtable(start, end_clone, PTI_LEVEL_KERNEL_IMAGE);
+	pti_clone_pgtable(start, end_clone, PTI_LEVEL_KERNEL_IMAGE, false);
 
 	/*
 	 * pti_clone_pgtable() will set the global bit in any PMDs
@@ -639,7 +644,7 @@ void __init pti_init(void)
 	/* Undo all global bits from the init pagetables in head_64.S: */
 	pti_set_kernel_image_nonglobal();
 	/* Replace some of the global bits just for shared entry text: */
-	pti_clone_entry_text();
+	pti_clone_entry_text(false);
 	pti_setup_espfix64();
 	pti_setup_vsyscall();
 }
@@ -659,7 +664,7 @@ void pti_finalize(void)
 	 * We need to clone everything (again) that maps parts of the
 	 * kernel image.
 	 */
-	pti_clone_entry_text();
+	pti_clone_entry_text(true);
 	pti_clone_kernel_text();
 
 	debug_checkwx_user();
Re: [tip: x86/urgent] x86/mm: Fix pti_clone_entry_text() for i386
Posted by Guenter Roeck 1 year, 6 months ago
On 8/6/24 11:48, Peter Zijlstra wrote:
> On Tue, Aug 06, 2024 at 09:42:23AM -0700, Guenter Roeck wrote:
>> On 8/6/24 08:59, Peter Zijlstra wrote:
>>> On Tue, Aug 06, 2024 at 05:46:53PM +0200, Peter Zijlstra wrote:
>>>> On Tue, Aug 06, 2024 at 05:05:15PM +0200, Peter Zijlstra wrote:
>>>>> On Tue, Aug 06, 2024 at 04:56:32PM +0200, Peter Zijlstra wrote:
>>>>>> On Tue, Aug 06, 2024 at 07:25:42AM -0700, Guenter Roeck wrote:
>>>>>>
>>>>>>> I created http://server.roeck-us.net/qemu/x86-v6.11-rc2/ with all
>>>>>>> the relevant information. Please let me know if you need anything else.
>>>>>>
>>>>>> So I grabbed that config, stuck it in the build dir I used last time and
>>>>>> upgraded gcc-13 from 13.2 ro 13.3. But alas, my build runs successfully
>>>>>> :/
>>>>>>
>>>>>> Is there anything else special I missed?
>>>>>
>>>>> run.sh is not exacrlty the same this time, different CPU model, that
>>>>> made it go.
>>>>>
>>>>> OK, lemme poke at this.
>>>>
>>>> Urgh, so crypto's late_initcall() does user-mode-helper based modprobe
>>>> looking for algorithms before we kick off /bin/init :/
>>>>
>>>> This makes things difficult.
>>>>
>>>> Urgh.
>>>
>>> So the problem is that mark_readonly() splits a code PMD due to NX. Then
>>> the second pti_clone_entry_text() finds a kernel PTE but a user PMD
>>> mapping for the same address (from the early clone) and gets upset.
>>>
>>> And we can't run mark_readonly() sooner, because initcall expect stuff
>>> to be RW. But initcalls do modprobe, which runs user crap before we're
>>> done initializing everything.
>>>
>>> This is a right mess, and I really don't know what to do.
>>
>> And there was me thinking this one should be easy to solve. Oh well.
>>
>> Maybe Linus has an idea ? I am getting a bit wary to reporting all those
>> weird problems to him, though.
> 
> While I had dinner with the family, Thomas cooked up the below, which
> seems to make it happy.
> 
> It basically splits the PMD on the late copy.
> 

The patch below passes all my tests.

Tested-by: Guenter Roeck <linux@roeck-us.net>

Thanks!

Guenter

> ---
> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
> index bfdf5f45b137..b6ebbc9c23b1 100644
> --- a/arch/x86/mm/pti.c
> +++ b/arch/x86/mm/pti.c
> @@ -241,7 +241,7 @@ static pmd_t *pti_user_pagetable_walk_pmd(unsigned long address)
>    *
>    * Returns a pointer to a PTE on success, or NULL on failure.
>    */
> -static pte_t *pti_user_pagetable_walk_pte(unsigned long address)
> +static pte_t *pti_user_pagetable_walk_pte(unsigned long address, bool late_text)
>   {
>   	gfp_t gfp = (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO);
>   	pmd_t *pmd;
> @@ -251,10 +251,15 @@ static pte_t *pti_user_pagetable_walk_pte(unsigned long address)
>   	if (!pmd)
>   		return NULL;
>   
> -	/* We can't do anything sensible if we hit a large mapping. */
> +	/* Large PMD mapping found */
>   	if (pmd_leaf(*pmd)) {
> -		WARN_ON(1);
> -		return NULL;
> +		/* Clear the PMD if we hit a large mapping from the first round */
> +		if (late_text) {
> +			set_pmd(pmd, __pmd(0));
> +		} else {
> +			WARN_ON_ONCE(1);
> +			return NULL;
> +		}
>   	}
>   
>   	if (pmd_none(*pmd)) {
> @@ -283,7 +288,7 @@ static void __init pti_setup_vsyscall(void)
>   	if (!pte || WARN_ON(level != PG_LEVEL_4K) || pte_none(*pte))
>   		return;
>   
> -	target_pte = pti_user_pagetable_walk_pte(VSYSCALL_ADDR);
> +	target_pte = pti_user_pagetable_walk_pte(VSYSCALL_ADDR, false);
>   	if (WARN_ON(!target_pte))
>   		return;
>   
> @@ -301,7 +306,7 @@ enum pti_clone_level {
>   
>   static void
>   pti_clone_pgtable(unsigned long start, unsigned long end,
> -		  enum pti_clone_level level)
> +		  enum pti_clone_level level, bool late_text)
>   {
>   	unsigned long addr;
>   
> @@ -390,7 +395,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
>   				return;
>   
>   			/* Allocate PTE in the user page-table */
> -			target_pte = pti_user_pagetable_walk_pte(addr);
> +			target_pte = pti_user_pagetable_walk_pte(addr, late_text);
>   			if (WARN_ON(!target_pte))
>   				return;
>   
> @@ -452,7 +457,7 @@ static void __init pti_clone_user_shared(void)
>   		phys_addr_t pa = per_cpu_ptr_to_phys((void *)va);
>   		pte_t *target_pte;
>   
> -		target_pte = pti_user_pagetable_walk_pte(va);
> +		target_pte = pti_user_pagetable_walk_pte(va, false);
>   		if (WARN_ON(!target_pte))
>   			return;
>   
> @@ -475,7 +480,7 @@ static void __init pti_clone_user_shared(void)
>   	start = CPU_ENTRY_AREA_BASE;
>   	end   = start + (PAGE_SIZE * CPU_ENTRY_AREA_PAGES);
>   
> -	pti_clone_pgtable(start, end, PTI_CLONE_PMD);
> +	pti_clone_pgtable(start, end, PTI_CLONE_PMD, false);
>   }
>   #endif /* CONFIG_X86_64 */
>   
> @@ -492,11 +497,11 @@ static void __init pti_setup_espfix64(void)
>   /*
>    * Clone the populated PMDs of the entry text and force it RO.
>    */
> -static void pti_clone_entry_text(void)
> +static void pti_clone_entry_text(bool late)
>   {
>   	pti_clone_pgtable((unsigned long) __entry_text_start,
>   			  (unsigned long) __entry_text_end,
> -			  PTI_LEVEL_KERNEL_IMAGE);
> +			  PTI_LEVEL_KERNEL_IMAGE, late);
>   }
>   
>   /*
> @@ -571,7 +576,7 @@ static void pti_clone_kernel_text(void)
>   	 * pti_set_kernel_image_nonglobal() did to clear the
>   	 * global bit.
>   	 */
> -	pti_clone_pgtable(start, end_clone, PTI_LEVEL_KERNEL_IMAGE);
> +	pti_clone_pgtable(start, end_clone, PTI_LEVEL_KERNEL_IMAGE, false);
>   
>   	/*
>   	 * pti_clone_pgtable() will set the global bit in any PMDs
> @@ -639,7 +644,7 @@ void __init pti_init(void)
>   	/* Undo all global bits from the init pagetables in head_64.S: */
>   	pti_set_kernel_image_nonglobal();
>   	/* Replace some of the global bits just for shared entry text: */
> -	pti_clone_entry_text();
> +	pti_clone_entry_text(false);
>   	pti_setup_espfix64();
>   	pti_setup_vsyscall();
>   }
> @@ -659,7 +664,7 @@ void pti_finalize(void)
>   	 * We need to clone everything (again) that maps parts of the
>   	 * kernel image.
>   	 */
> -	pti_clone_entry_text();
> +	pti_clone_entry_text(true);
>   	pti_clone_kernel_text();
>   
>   	debug_checkwx_user();
Re: [tip: x86/urgent] x86/mm: Fix pti_clone_entry_text() for i386
Posted by Guenter Roeck 1 year, 6 months ago
On 8/6/24 11:48, Peter Zijlstra wrote:
> On Tue, Aug 06, 2024 at 09:42:23AM -0700, Guenter Roeck wrote:
>> On 8/6/24 08:59, Peter Zijlstra wrote:
>>> On Tue, Aug 06, 2024 at 05:46:53PM +0200, Peter Zijlstra wrote:
>>>> On Tue, Aug 06, 2024 at 05:05:15PM +0200, Peter Zijlstra wrote:
>>>>> On Tue, Aug 06, 2024 at 04:56:32PM +0200, Peter Zijlstra wrote:
>>>>>> On Tue, Aug 06, 2024 at 07:25:42AM -0700, Guenter Roeck wrote:
>>>>>>
>>>>>>> I created http://server.roeck-us.net/qemu/x86-v6.11-rc2/ with all
>>>>>>> the relevant information. Please let me know if you need anything else.
>>>>>>
>>>>>> So I grabbed that config, stuck it in the build dir I used last time and
>>>>>> upgraded gcc-13 from 13.2 ro 13.3. But alas, my build runs successfully
>>>>>> :/
>>>>>>
>>>>>> Is there anything else special I missed?
>>>>>
>>>>> run.sh is not exacrlty the same this time, different CPU model, that
>>>>> made it go.
>>>>>
>>>>> OK, lemme poke at this.
>>>>
>>>> Urgh, so crypto's late_initcall() does user-mode-helper based modprobe
>>>> looking for algorithms before we kick off /bin/init :/
>>>>
>>>> This makes things difficult.
>>>>
>>>> Urgh.
>>>
>>> So the problem is that mark_readonly() splits a code PMD due to NX. Then
>>> the second pti_clone_entry_text() finds a kernel PTE but a user PMD
>>> mapping for the same address (from the early clone) and gets upset.
>>>
>>> And we can't run mark_readonly() sooner, because initcall expect stuff
>>> to be RW. But initcalls do modprobe, which runs user crap before we're
>>> done initializing everything.
>>>
>>> This is a right mess, and I really don't know what to do.
>>
>> And there was me thinking this one should be easy to solve. Oh well.
>>
>> Maybe Linus has an idea ? I am getting a bit wary to reporting all those
>> weird problems to him, though.
> 
> While I had dinner with the family, Thomas cooked up the below, which
> seems to make it happy.
> 

I'll give it a try for all emulations (both x86 and x86_64), just to be sure,
but it does indeed fix the problem for the test with the affected CPU/build.

Thanks!

Guenter

> It basically splits the PMD on the late copy.
> 
> ---
> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
> index bfdf5f45b137..b6ebbc9c23b1 100644
> --- a/arch/x86/mm/pti.c
> +++ b/arch/x86/mm/pti.c
> @@ -241,7 +241,7 @@ static pmd_t *pti_user_pagetable_walk_pmd(unsigned long address)
>    *
>    * Returns a pointer to a PTE on success, or NULL on failure.
>    */
> -static pte_t *pti_user_pagetable_walk_pte(unsigned long address)
> +static pte_t *pti_user_pagetable_walk_pte(unsigned long address, bool late_text)
>   {
>   	gfp_t gfp = (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO);
>   	pmd_t *pmd;
> @@ -251,10 +251,15 @@ static pte_t *pti_user_pagetable_walk_pte(unsigned long address)
>   	if (!pmd)
>   		return NULL;
>   
> -	/* We can't do anything sensible if we hit a large mapping. */
> +	/* Large PMD mapping found */
>   	if (pmd_leaf(*pmd)) {
> -		WARN_ON(1);
> -		return NULL;
> +		/* Clear the PMD if we hit a large mapping from the first round */
> +		if (late_text) {
> +			set_pmd(pmd, __pmd(0));
> +		} else {
> +			WARN_ON_ONCE(1);
> +			return NULL;
> +		}
>   	}
>   
>   	if (pmd_none(*pmd)) {
> @@ -283,7 +288,7 @@ static void __init pti_setup_vsyscall(void)
>   	if (!pte || WARN_ON(level != PG_LEVEL_4K) || pte_none(*pte))
>   		return;
>   
> -	target_pte = pti_user_pagetable_walk_pte(VSYSCALL_ADDR);
> +	target_pte = pti_user_pagetable_walk_pte(VSYSCALL_ADDR, false);
>   	if (WARN_ON(!target_pte))
>   		return;
>   
> @@ -301,7 +306,7 @@ enum pti_clone_level {
>   
>   static void
>   pti_clone_pgtable(unsigned long start, unsigned long end,
> -		  enum pti_clone_level level)
> +		  enum pti_clone_level level, bool late_text)
>   {
>   	unsigned long addr;
>   
> @@ -390,7 +395,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
>   				return;
>   
>   			/* Allocate PTE in the user page-table */
> -			target_pte = pti_user_pagetable_walk_pte(addr);
> +			target_pte = pti_user_pagetable_walk_pte(addr, late_text);
>   			if (WARN_ON(!target_pte))
>   				return;
>   
> @@ -452,7 +457,7 @@ static void __init pti_clone_user_shared(void)
>   		phys_addr_t pa = per_cpu_ptr_to_phys((void *)va);
>   		pte_t *target_pte;
>   
> -		target_pte = pti_user_pagetable_walk_pte(va);
> +		target_pte = pti_user_pagetable_walk_pte(va, false);
>   		if (WARN_ON(!target_pte))
>   			return;
>   
> @@ -475,7 +480,7 @@ static void __init pti_clone_user_shared(void)
>   	start = CPU_ENTRY_AREA_BASE;
>   	end   = start + (PAGE_SIZE * CPU_ENTRY_AREA_PAGES);
>   
> -	pti_clone_pgtable(start, end, PTI_CLONE_PMD);
> +	pti_clone_pgtable(start, end, PTI_CLONE_PMD, false);
>   }
>   #endif /* CONFIG_X86_64 */
>   
> @@ -492,11 +497,11 @@ static void __init pti_setup_espfix64(void)
>   /*
>    * Clone the populated PMDs of the entry text and force it RO.
>    */
> -static void pti_clone_entry_text(void)
> +static void pti_clone_entry_text(bool late)
>   {
>   	pti_clone_pgtable((unsigned long) __entry_text_start,
>   			  (unsigned long) __entry_text_end,
> -			  PTI_LEVEL_KERNEL_IMAGE);
> +			  PTI_LEVEL_KERNEL_IMAGE, late);
>   }
>   
>   /*
> @@ -571,7 +576,7 @@ static void pti_clone_kernel_text(void)
>   	 * pti_set_kernel_image_nonglobal() did to clear the
>   	 * global bit.
>   	 */
> -	pti_clone_pgtable(start, end_clone, PTI_LEVEL_KERNEL_IMAGE);
> +	pti_clone_pgtable(start, end_clone, PTI_LEVEL_KERNEL_IMAGE, false);
>   
>   	/*
>   	 * pti_clone_pgtable() will set the global bit in any PMDs
> @@ -639,7 +644,7 @@ void __init pti_init(void)
>   	/* Undo all global bits from the init pagetables in head_64.S: */
>   	pti_set_kernel_image_nonglobal();
>   	/* Replace some of the global bits just for shared entry text: */
> -	pti_clone_entry_text();
> +	pti_clone_entry_text(false);
>   	pti_setup_espfix64();
>   	pti_setup_vsyscall();
>   }
> @@ -659,7 +664,7 @@ void pti_finalize(void)
>   	 * We need to clone everything (again) that maps parts of the
>   	 * kernel image.
>   	 */
> -	pti_clone_entry_text();
> +	pti_clone_entry_text(true);
>   	pti_clone_kernel_text();
>   
>   	debug_checkwx_user();
[tip: x86/urgent] x86/mm: Fix PTI for i386 some more
Posted by tip-bot2 for Thomas Gleixner 1 year, 6 months ago
The following commit has been merged into the x86/urgent branch of tip:

Commit-ID:     c48b5a4cf3125adb679e28ef093f66ff81368d05
Gitweb:        https://git.kernel.org/tip/c48b5a4cf3125adb679e28ef093f66ff81368d05
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Tue, 06 Aug 2024 20:48:43 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 07 Aug 2024 15:35:01 +02:00

x86/mm: Fix PTI for i386 some more

So it turns out that we have to do two passes of
pti_clone_entry_text(), once before initcalls, such that device and
late initcalls can use user-mode-helper / modprobe and once after
free_initmem() / mark_readonly().

Now obviously mark_readonly() can cause PMD splits, and
pti_clone_pgtable() doesn't like that much.

Allow the late clone to split PMDs so that pagetables stay in sync.

[peterz: Changelog and comments]
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lkml.kernel.org/r/20240806184843.GX37996@noisy.programming.kicks-ass.net
---
 arch/x86/mm/pti.c | 45 +++++++++++++++++++++++++++++----------------
 1 file changed, 29 insertions(+), 16 deletions(-)

diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index bfdf5f4..851ec8f 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -241,7 +241,7 @@ static pmd_t *pti_user_pagetable_walk_pmd(unsigned long address)
  *
  * Returns a pointer to a PTE on success, or NULL on failure.
  */
-static pte_t *pti_user_pagetable_walk_pte(unsigned long address)
+static pte_t *pti_user_pagetable_walk_pte(unsigned long address, bool late_text)
 {
 	gfp_t gfp = (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO);
 	pmd_t *pmd;
@@ -251,10 +251,15 @@ static pte_t *pti_user_pagetable_walk_pte(unsigned long address)
 	if (!pmd)
 		return NULL;
 
-	/* We can't do anything sensible if we hit a large mapping. */
+	/* Large PMD mapping found */
 	if (pmd_leaf(*pmd)) {
-		WARN_ON(1);
-		return NULL;
+		/* Clear the PMD if we hit a large mapping from the first round */
+		if (late_text) {
+			set_pmd(pmd, __pmd(0));
+		} else {
+			WARN_ON_ONCE(1);
+			return NULL;
+		}
 	}
 
 	if (pmd_none(*pmd)) {
@@ -283,7 +288,7 @@ static void __init pti_setup_vsyscall(void)
 	if (!pte || WARN_ON(level != PG_LEVEL_4K) || pte_none(*pte))
 		return;
 
-	target_pte = pti_user_pagetable_walk_pte(VSYSCALL_ADDR);
+	target_pte = pti_user_pagetable_walk_pte(VSYSCALL_ADDR, false);
 	if (WARN_ON(!target_pte))
 		return;
 
@@ -301,7 +306,7 @@ enum pti_clone_level {
 
 static void
 pti_clone_pgtable(unsigned long start, unsigned long end,
-		  enum pti_clone_level level)
+		  enum pti_clone_level level, bool late_text)
 {
 	unsigned long addr;
 
@@ -390,7 +395,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
 				return;
 
 			/* Allocate PTE in the user page-table */
-			target_pte = pti_user_pagetable_walk_pte(addr);
+			target_pte = pti_user_pagetable_walk_pte(addr, late_text);
 			if (WARN_ON(!target_pte))
 				return;
 
@@ -452,7 +457,7 @@ static void __init pti_clone_user_shared(void)
 		phys_addr_t pa = per_cpu_ptr_to_phys((void *)va);
 		pte_t *target_pte;
 
-		target_pte = pti_user_pagetable_walk_pte(va);
+		target_pte = pti_user_pagetable_walk_pte(va, false);
 		if (WARN_ON(!target_pte))
 			return;
 
@@ -475,7 +480,7 @@ static void __init pti_clone_user_shared(void)
 	start = CPU_ENTRY_AREA_BASE;
 	end   = start + (PAGE_SIZE * CPU_ENTRY_AREA_PAGES);
 
-	pti_clone_pgtable(start, end, PTI_CLONE_PMD);
+	pti_clone_pgtable(start, end, PTI_CLONE_PMD, false);
 }
 #endif /* CONFIG_X86_64 */
 
@@ -492,11 +497,11 @@ static void __init pti_setup_espfix64(void)
 /*
  * Clone the populated PMDs of the entry text and force it RO.
  */
-static void pti_clone_entry_text(void)
+static void pti_clone_entry_text(bool late)
 {
 	pti_clone_pgtable((unsigned long) __entry_text_start,
 			  (unsigned long) __entry_text_end,
-			  PTI_LEVEL_KERNEL_IMAGE);
+			  PTI_LEVEL_KERNEL_IMAGE, late);
 }
 
 /*
@@ -571,7 +576,7 @@ static void pti_clone_kernel_text(void)
 	 * pti_set_kernel_image_nonglobal() did to clear the
 	 * global bit.
 	 */
-	pti_clone_pgtable(start, end_clone, PTI_LEVEL_KERNEL_IMAGE);
+	pti_clone_pgtable(start, end_clone, PTI_LEVEL_KERNEL_IMAGE, false);
 
 	/*
 	 * pti_clone_pgtable() will set the global bit in any PMDs
@@ -638,8 +643,15 @@ void __init pti_init(void)
 
 	/* Undo all global bits from the init pagetables in head_64.S: */
 	pti_set_kernel_image_nonglobal();
+
 	/* Replace some of the global bits just for shared entry text: */
-	pti_clone_entry_text();
+	/*
+	 * This is very early in boot. Device and Late initcalls can do
+	 * modprobe before free_initmem() and mark_readonly(). This
+	 * pti_clone_entry_text() allows those user-mode-helpers to function,
+	 * but notably the text is still RW.
+	 */
+	pti_clone_entry_text(false);
 	pti_setup_espfix64();
 	pti_setup_vsyscall();
 }
@@ -656,10 +668,11 @@ void pti_finalize(void)
 	if (!boot_cpu_has(X86_FEATURE_PTI))
 		return;
 	/*
-	 * We need to clone everything (again) that maps parts of the
-	 * kernel image.
+	 * This is after free_initmem() (all initcalls are done) and we've done
+	 * mark_readonly(). Text is now NX which might've split some PMDs
+	 * relative to the early clone.
 	 */
-	pti_clone_entry_text();
+	pti_clone_entry_text(true);
 	pti_clone_kernel_text();
 
 	debug_checkwx_user();