[v1] locking/rtmutex: Avoid PI state recursion through sched_submit_work()

[PATCH 0/6] locking/rtmutex: Avoid PI state recursion through sched_submit_work()

Posted by Peter Zijlstra 2 years, 5 months ago

Hi!

This is basically the 'same' patches as send earlier by Sebastian here:

  https://lkml.kernel.org/r/20230427111937.2745231-1-bigeasy@linutronix.de

I spend a number of days trying to invert rtmutex, only to make a giant mess of
things and finally conceded that this is the least horrible approach.

There's a bunch of naming differences and I added some asserts that should
hopefully avoid things from going sideways without notice. I've also updated
the changelogs to high-light the actual problem. The whole pi_blocked_on
'corruption' is a mere consequence of the more fundamental problem that the
whole PI state recurses.

I've not tested this with the rest of the RT patches stuck on, so very limited
actual testing happened.

If anybody could please confirm stuff still works as advertised, I'll go queue
this mess and we can hopefully forget all about it.

Re: [PATCH 0/6] locking/rtmutex: Avoid PI state recursion through sched_submit_work()

Posted by Peter Zijlstra 2 years, 5 months ago

On Tue, Aug 15, 2023 at 01:01:21PM +0200, Peter Zijlstra wrote:
> Hi!
> 
> This is basically the 'same' patches as send earlier by Sebastian here:
> 
>   https://lkml.kernel.org/r/20230427111937.2745231-1-bigeasy@linutronix.de
> 
> I spend a number of days trying to invert rtmutex, only to make a giant mess of
> things and finally conceded that this is the least horrible approach.
> 
> There's a bunch of naming differences and I added some asserts that should
> hopefully avoid things from going sideways without notice. I've also updated
> the changelogs to high-light the actual problem. The whole pi_blocked_on
> 'corruption' is a mere consequence of the more fundamental problem that the
> whole PI state recurses.
> 
> I've not tested this with the rest of the RT patches stuck on, so very limited
> actual testing happened.
> 
> If anybody could please confirm stuff still works as advertised, I'll go queue
> this mess and we can hopefully forget all about it.
> 

N/m - 0day found a problem. Futex-PI trips the rt_mutex_schedule()
assertion for not passing through rt_mutex_pre_schedule().

I'll go try and untangle that...

Re: [PATCH 0/6] locking/rtmutex: Avoid PI state recursion through sched_submit_work()

Posted by Sebastian Andrzej Siewior 2 years, 5 months ago

On 2023-08-15 18:15:57 [+0200], Peter Zijlstra wrote:
> N/m - 0day found a problem. Futex-PI trips the rt_mutex_schedule()
> assertion for not passing through rt_mutex_pre_schedule().
> 
> I'll go try and untangle that...
Is this the same as

| ------------[ cut here ]------------
| WARNING: CPU: 3 PID: 995 at kernel/sched/core.c:7155 rt_mutex_schedule+0x52/0x60
| Modules linked in:
| CPU: 3 PID: 995 Comm: mount Tainted: G        W          6.5.0-rc6-rt2+ #228
| Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
| RIP: 0010:rt_mutex_schedule+0x52/0x60
| Code: 00 00 00 e8 b0 80 ff ff 31 ff e8 a9 99 cb 00 bf 01 00 00 00 e8 3f 6d ff ff 48 8b 03 a9 08 00 08 00 75 db 5b e9 6f 82 cc 00 90 <0f> 0b 90 eb c6 66 0f 1f 84 00 00 0
| RSP: 0018:ffffc90001043e28 EFLAGS: 00010246
| RAX: ffff888107ab8040 RBX: 0000000000000202 RCX: 0000000000000000
| RDX: 0000000000000000 RSI: ffffffff82460bfd RDI: 00000000ffffffff
| RBP: ffffffff827e8fe0 R08: 0000000000000001 R09: 0000000000000001
| R10: 0000000000000000 R11: 0000000000000001 R12: ffffffff827e8fe8
| R13: 0000000000000002 R14: 0000000000000000 R15: ffff888107ab8040
| FS:  00007fb36281e840(0000) GS:ffff88817b4c0000(0000) knlGS:0000000000000000
| CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
| CR2: 00007fb3629702a0 CR3: 0000000108210000 CR4: 00000000003506e0
| Call Trace:
|  <TASK>
|  ? __warn+0x90/0x200
|  ? rt_mutex_schedule+0x52/0x60
|  ? report_bug+0x1c5/0x1f0
|  ? handle_bug+0x3b/0x70
|  ? exc_invalid_op+0x13/0x60
|  ? asm_exc_invalid_op+0x16/0x20
|  ? rt_mutex_schedule+0x52/0x60
|  rwbase_write_lock+0xc1/0x230
|  do_lock_mount+0x4c/0x200
|  ? __x86_return_thunk+0x5/0x10
|  path_mount+0x8a7/0xb00
|  __x64_sys_mount+0xf1/0x140
|  do_syscall_64+0x44/0x90
|  entry_SYSCALL_64_after_hwframe+0x74/0xde

Sebastian

Re: [PATCH 0/6] locking/rtmutex: Avoid PI state recursion through sched_submit_work()

Posted by Peter Zijlstra 2 years, 5 months ago

On Wed, Aug 16, 2023 at 10:58:26AM +0200, Sebastian Andrzej Siewior wrote:
> On 2023-08-15 18:15:57 [+0200], Peter Zijlstra wrote:
> > N/m - 0day found a problem. Futex-PI trips the rt_mutex_schedule()
> > assertion for not passing through rt_mutex_pre_schedule().
> > 
> > I'll go try and untangle that...
> Is this the same as

Not the same -- this is namespace_lock(), right? That's a regular rwsem
afaict and that *should* be good. Clearly I messed something up.

Thanks!

> | ------------[ cut here ]------------
> | WARNING: CPU: 3 PID: 995 at kernel/sched/core.c:7155 rt_mutex_schedule+0x52/0x60
> | Modules linked in:
> | CPU: 3 PID: 995 Comm: mount Tainted: G        W          6.5.0-rc6-rt2+ #228
> | Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> | RIP: 0010:rt_mutex_schedule+0x52/0x60
> | Code: 00 00 00 e8 b0 80 ff ff 31 ff e8 a9 99 cb 00 bf 01 00 00 00 e8 3f 6d ff ff 48 8b 03 a9 08 00 08 00 75 db 5b e9 6f 82 cc 00 90 <0f> 0b 90 eb c6 66 0f 1f 84 00 00 0
> | RSP: 0018:ffffc90001043e28 EFLAGS: 00010246
> | RAX: ffff888107ab8040 RBX: 0000000000000202 RCX: 0000000000000000
> | RDX: 0000000000000000 RSI: ffffffff82460bfd RDI: 00000000ffffffff
> | RBP: ffffffff827e8fe0 R08: 0000000000000001 R09: 0000000000000001
> | R10: 0000000000000000 R11: 0000000000000001 R12: ffffffff827e8fe8
> | R13: 0000000000000002 R14: 0000000000000000 R15: ffff888107ab8040
> | FS:  00007fb36281e840(0000) GS:ffff88817b4c0000(0000) knlGS:0000000000000000
> | CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> | CR2: 00007fb3629702a0 CR3: 0000000108210000 CR4: 00000000003506e0
> | Call Trace:
> |  <TASK>
> |  ? __warn+0x90/0x200
> |  ? rt_mutex_schedule+0x52/0x60
> |  ? report_bug+0x1c5/0x1f0
> |  ? handle_bug+0x3b/0x70
> |  ? exc_invalid_op+0x13/0x60
> |  ? asm_exc_invalid_op+0x16/0x20
> |  ? rt_mutex_schedule+0x52/0x60
> |  rwbase_write_lock+0xc1/0x230
> |  do_lock_mount+0x4c/0x200
> |  ? __x86_return_thunk+0x5/0x10
> |  path_mount+0x8a7/0xb00
> |  __x64_sys_mount+0xf1/0x140
> |  do_syscall_64+0x44/0x90
> |  entry_SYSCALL_64_after_hwframe+0x74/0xde
> 
> Sebastian

Re: [PATCH 0/6] locking/rtmutex: Avoid PI state recursion through sched_submit_work()

Posted by Sebastian Andrzej Siewior 2 years, 5 months ago

On 2023-08-16 11:42:57 [+0200], Peter Zijlstra wrote:
> Not the same -- this is namespace_lock(), right? That's a regular rwsem
> afaict and that *should* be good. Clearly I messed something up.

Most likely. I do see it also fom inode_lock() which does down_write().
I see it only to originate from rwbase_write_lock().

> Thanks!

Sebastian

Re: [PATCH 0/6] locking/rtmutex: Avoid PI state recursion through sched_submit_work()

Posted by Sebastian Andrzej Siewior 2 years, 5 months ago

On 2023-08-16 12:19:04 [+0200], To Peter Zijlstra wrote:
> On 2023-08-16 11:42:57 [+0200], Peter Zijlstra wrote:
> > Not the same -- this is namespace_lock(), right? That's a regular rwsem
> > afaict and that *should* be good. Clearly I messed something up.
> 
> Most likely. I do see it also fom inode_lock() which does down_write().
> I see it only to originate from rwbase_write_lock().

I've been looking at what you did and what we had.
I'm not sure if your additional debug/assert code figured it out or me
looking at it, but in rwbase_write_lock() for down_write(), we had this
beauty with a comment that you made go away:

|        * Take the rtmutex as a first step. For rwsem this will also
|        * invoke sched_submit_work() to flush IO and workers.
|	 */
|         if (rwbase_rtmutex_lock_state(rtm, state))

for rw_semaphore we don't have any explicit rwbase_sched_submit_work()
but relied on this one. Now that I look at it again,
rwbase_rtmutex_lock_state() can succeed in the fast path so we don't
flush/ invoke rwbase_pre_schedule().
So you rightfully removed the comment as it was misleading but we do
need that rwbase_pre_schedule() thingy before
raw_spin_lock_irqsave(&rtm->wait_lock).

Sebastian

Re: [PATCH 0/6] locking/rtmutex: Avoid PI state recursion through sched_submit_work()

Posted by Peter Zijlstra 2 years, 5 months ago

On Wed, Aug 16, 2023 at 03:46:30PM +0200, Sebastian Andrzej Siewior wrote:
> On 2023-08-16 12:19:04 [+0200], To Peter Zijlstra wrote:
> > On 2023-08-16 11:42:57 [+0200], Peter Zijlstra wrote:
> > > Not the same -- this is namespace_lock(), right? That's a regular rwsem
> > > afaict and that *should* be good. Clearly I messed something up.
> > 
> > Most likely. I do see it also fom inode_lock() which does down_write().
> > I see it only to originate from rwbase_write_lock().
> 
> I've been looking at what you did and what we had.
> I'm not sure if your additional debug/assert code figured it out or me
> looking at it, but in rwbase_write_lock() for down_write(), we had this
> beauty with a comment that you made go away:
> 
> |        * Take the rtmutex as a first step. For rwsem this will also
> |        * invoke sched_submit_work() to flush IO and workers.
> |	 */
> |         if (rwbase_rtmutex_lock_state(rtm, state))
> 

Yeah, I can't quite remember why I killed that comment, I think because
it was either 'obvious' or confusing at the time. Or perhaps I was too
lazy to type, ... :-)

> for rw_semaphore we don't have any explicit rwbase_sched_submit_work()
> but relied on this one. Now that I look at it again,
> rwbase_rtmutex_lock_state() can succeed in the fast path so we don't
> flush/ invoke rwbase_pre_schedule().
> So you rightfully removed the comment as it was misleading but we do
> need that rwbase_pre_schedule() thingy before
> raw_spin_lock_irqsave(&rtm->wait_lock).

Right, it's both the fast-path and the fact that rt_mutex_slowlock()
will also do post_schedule() and reset the flag.

I've ended up with the below, but it is quite horrible.. but let me go
stare at the futex wreckage before trying to clean things up.

--- a/kernel/locking/rwbase_rt.c
+++ b/kernel/locking/rwbase_rt.c
@@ -270,6 +270,7 @@ static int __sched rwbase_write_lock(str
 
 out_unlock:
 	raw_spin_unlock_irqrestore(&rtm->wait_lock, flags);
+	rt_mutex_post_schedule();
 	return 0;
 }
 
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1412,8 +1412,30 @@ static inline void __downgrade_write(str
 #define rwbase_restore_current_state()			\
 	__set_current_state(TASK_RUNNING)
 
-#define rwbase_rtmutex_lock_state(rtm, state)		\
-	__rt_mutex_lock(rtm, state)
+/*
+ * Variant of __rt_mutex_lock() that unconditionally does
+ * rt_mutex_pre_schedule() and keeps it on success.
+ */
+static __always_inline int
+rwbase_rtmutex_lock_state(struct rt_mutex_base *lock, unsigned int state)
+{
+	unsigned long flags;
+	int ret;
+
+	rt_mutex_pre_schedule();
+
+	if (likely(rt_mutex_try_acquire(lock)))
+		return 0;
+
+	raw_spin_lock_irqsave(&lock->wait_lock, flags);
+	ret = __rt_mutex_slowlock_locked(lock, NULL, state);
+	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
+
+	if (ret)
+		rt_mutex_post_schedule();
+
+	return ret;
+}
 
 #define rwbase_rtmutex_slowlock_locked(rtm, state)	\
 	__rt_mutex_slowlock_locked(rtm, NULL, state)

Re: [PATCH 0/6] locking/rtmutex: Avoid PI state recursion through sched_submit_work()

Posted by Sebastian Andrzej Siewior 2 years, 5 months ago

On 2023-08-16 16:58:18 [+0200], Peter Zijlstra wrote:
> I've ended up with the below, but it is quite horrible.. but let me go
> stare at the futex wreckage before trying to clean things up.

What about

diff --git a/kernel/locking/rwbase_rt.c b/kernel/locking/rwbase_rt.c
index f8a194e7ec9e9..b5e881250fec5 100644
--- a/kernel/locking/rwbase_rt.c
+++ b/kernel/locking/rwbase_rt.c
@@ -241,6 +241,8 @@ static int __sched rwbase_write_lock(struct rwbase_rt *rwb,
 	/* Force readers into slow path */
 	atomic_sub(READER_BIAS, &rwb->readers);
 
+	rt_mutex_pre_schedule();
+
 	raw_spin_lock_irqsave(&rtm->wait_lock, flags);
 	if (__rwbase_write_trylock(rwb))
 		goto out_unlock;
@@ -252,6 +254,7 @@ static int __sched rwbase_write_lock(struct rwbase_rt *rwb,
 		if (rwbase_signal_pending_state(state, current)) {
 			rwbase_restore_current_state();
 			__rwbase_write_unlock(rwb, 0, flags);
+			rt_mutex_post_schedule();
 			trace_contention_end(rwb, -EINTR);
 			return -EINTR;
 		}
@@ -270,6 +273,7 @@ static int __sched rwbase_write_lock(struct rwbase_rt *rwb,
 
 out_unlock:
 	raw_spin_unlock_irqrestore(&rtm->wait_lock, flags);
+	rt_mutex_post_schedule();
 	return 0;
 }
 
Sebastian

Re: [PATCH 0/6] locking/rtmutex: Avoid PI state recursion through sched_submit_work()

Posted by Peter Zijlstra 2 years, 5 months ago

On Thu, Aug 17, 2023 at 08:59:50AM +0200, Sebastian Andrzej Siewior wrote:
> On 2023-08-16 16:58:18 [+0200], Peter Zijlstra wrote:
> > I've ended up with the below, but it is quite horrible.. but let me go
> > stare at the futex wreckage before trying to clean things up.
> 
> What about

Ah, of course, that's much nicer. I got hung up on that
rwbase_rtmutex_lock_state() thing :/

> diff --git a/kernel/locking/rwbase_rt.c b/kernel/locking/rwbase_rt.c
> index f8a194e7ec9e9..b5e881250fec5 100644
> --- a/kernel/locking/rwbase_rt.c
> +++ b/kernel/locking/rwbase_rt.c
> @@ -241,6 +241,8 @@ static int __sched rwbase_write_lock(struct rwbase_rt *rwb,
>  	/* Force readers into slow path */
>  	atomic_sub(READER_BIAS, &rwb->readers);
>  
> +	rt_mutex_pre_schedule();
> +
>  	raw_spin_lock_irqsave(&rtm->wait_lock, flags);
>  	if (__rwbase_write_trylock(rwb))
>  		goto out_unlock;
> @@ -252,6 +254,7 @@ static int __sched rwbase_write_lock(struct rwbase_rt *rwb,
>  		if (rwbase_signal_pending_state(state, current)) {
>  			rwbase_restore_current_state();
>  			__rwbase_write_unlock(rwb, 0, flags);
> +			rt_mutex_post_schedule();
>  			trace_contention_end(rwb, -EINTR);
>  			return -EINTR;
>  		}
> @@ -270,6 +273,7 @@ static int __sched rwbase_write_lock(struct rwbase_rt *rwb,
>  
>  out_unlock:
>  	raw_spin_unlock_irqrestore(&rtm->wait_lock, flags);
> +	rt_mutex_post_schedule();
>  	return 0;
>  }
>  
> Sebastian

Re: [PATCH 0/6] locking/rtmutex: Avoid PI state recursion through sched_submit_work()

Posted by Peter Zijlstra 2 years, 5 months ago

On Wed, Aug 16, 2023 at 04:58:18PM +0200, Peter Zijlstra wrote:

> I've ended up with the below, but it is quite horrible.. but let me go
> stare at the futex wreckage before trying to clean things up.

OK, I think the below covers the simple case, now lets see if I can make
sense of futex_wait_requeue_pi()... :/

---
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -1002,6 +1002,12 @@ int futex_lock_pi(u32 __user *uaddr, uns
 		goto no_block;
 	}
 
+	/*
+	 * Must be done before we enqueue the waiter, here is unfortunately
+	 * under the hb lock, but that *should* work.
+	 */
+	rt_mutex_pre_schedule();
+
 	rt_mutex_init_waiter(&rt_waiter);
 
 	/*
@@ -1052,6 +1058,10 @@ int futex_lock_pi(u32 __user *uaddr, uns
 	if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter))
 		ret = 0;
 
+	/*
+	 * Waiter is unqueued.
+	 */
+	rt_mutex_post_schedule();
 no_block:
 	/*
 	 * Fixup the pi_state owner and possibly acquire the lock if we

Re: [PATCH 0/6] locking/rtmutex: Avoid PI state recursion through sched_submit_work()

Posted by Sebastian Andrzej Siewior 2 years, 5 months ago

On 2023-08-16 17:22:31 [+0200], Peter Zijlstra wrote:
> On Wed, Aug 16, 2023 at 04:58:18PM +0200, Peter Zijlstra wrote:
> 
> > I've ended up with the below, but it is quite horrible.. but let me go
> > stare at the futex wreckage before trying to clean things up.
> 
> OK, I think the below covers the simple case, now lets see if I can make
> sense of futex_wait_requeue_pi()... :/
> 
> ---
> --- a/kernel/futex/pi.c
> +++ b/kernel/futex/pi.c
> @@ -1002,6 +1002,12 @@ int futex_lock_pi(u32 __user *uaddr, uns
>  		goto no_block;
>  	}
>  
> +	/*
> +	 * Must be done before we enqueue the waiter, here is unfortunately
> +	 * under the hb lock, but that *should* work.
> +	 */
> +	rt_mutex_pre_schedule();

but this will do sched_submit_work() which you don't need for futex at
all. It should be a nop (as in nothing will happen) but still.

>  	rt_mutex_init_waiter(&rt_waiter);
>  
>  	/*

Sebastian