From nobody Sun Feb 8 06:54:32 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9239F24219; Sun, 10 Mar 2024 16:37:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1710088673; cv=none; b=VyJKJX7cvP/JjIwTZjpMITdkaczWdNpJh1AVCLN4f2j3kL2RckKJeH53C7mlq5MJjeh6eBfuxdjxv2frKH2S4M7LUBF+/RsJsCLeM1ai7MktBcv3ypJVpGddfOaTN8AeJitbUexijVXyp6IBEsQQ89bVXQyIGKiqJdhdM1L4p90= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1710088673; c=relaxed/simple; bh=HOoZ1vLwV0Z/nT2AMqSg+qDgVo7LPPVbt0z6qTqTQfQ=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=EhLU2l8O6xxS2sYL8uF1bN50Bix83FsLO0k7vg7oPtBUB1NpAki3D+hBg7zAR0fp1MrqQrUqJ6z4heOuou0jIKuKhTPcUBKRXdnxE5AlWShVD/JHG58L65b06pU5bdZFWSB6bmqLg9/2dw6h0kG1NgsQJMS0yOt0f4NrBNr1Ko8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 Received: by smtp.kernel.org (Postfix) with ESMTPSA id 21FEBC43390; Sun, 10 Mar 2024 16:37:53 +0000 (UTC) Received: from rostedt by gandalf with local (Exim 4.97) (envelope-from ) id 1rjMDL-00000001Pb5-1U7K; Sun, 10 Mar 2024 12:39:55 -0400 Message-ID: <20240310163955.216982555@goodmis.org> User-Agent: quilt/0.67 Date: Sun, 10 Mar 2024 12:32:19 -0400 From: Steven Rostedt To: linux-kernel@vger.kernel.org Cc: Masami Hiramatsu , Mark Rutland , Mathieu Desnoyers , Andrew Morton , stable@vger.kernel.org, Linus Torvalds , linke li , Rabin Vincent Subject: [for-linus][PATCH 1/3] ring-buffer: Fix waking up ring buffer readers References: <20240310163218.425365963@goodmis.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: "Steven Rostedt (Google)" A task can wait on a ring buffer for when it fills up to a specific watermark. The writer will check the minimum watermark that waiters are waiting for and if the ring buffer is past that, it will wake up all the waiters. The waiters are in a wait loop, and will first check if a signal is pending and then check if the ring buffer is at the desired level where it should break out of the loop. If a file that uses a ring buffer closes, and there's threads waiting on the ring buffer, it needs to wake up those threads. To do this, a "wait_index" was used. Before entering the wait loop, the waiter will read the wait_index. On wakeup, it will check if the wait_index is different than when it entered the loop, and will exit the loop if it is. The waker will only need to update the wait_index before waking up the waiters. This had a couple of bugs. One trivial one and one broken by design. The trivial bug was that the waiter checked the wait_index after the schedule() call. It had to be checked between the prepare_to_wait() and the schedule() which it was not. The main bug is that the first check to set the default wait_index will always be outside the prepare_to_wait() and the schedule(). That's because the ring_buffer_wait() doesn't have enough context to know if it should break out of the loop. The loop itself is not needed, because all the callers to the ring_buffer_wait() also has their own loop, as the callers have a better sense of what the context is to decide whether to break out of the loop or not. Just have the ring_buffer_wait() block once, and if it gets woken up, exit the function and let the callers decide what to do next. Link: https://lore.kernel.org/all/CAHk-=3Dwhs5MdtNjzFkTyaUy=3DvHi=3DqwWgPi0= JgTe6OYUYMNSRZfg@mail.gmail.com/ Link: https://lore.kernel.org/linux-trace-kernel/20240308202431.792933613@g= oodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu Cc: Mark Rutland Cc: Mathieu Desnoyers Cc: Andrew Morton Cc: Linus Torvalds Cc: linke li Cc: Rabin Vincent Fixes: e30f53aad2202 ("tracing: Do not busy wait in buffer splice") Signed-off-by: Steven Rostedt (Google) --- kernel/trace/ring_buffer.c | 139 ++++++++++++++++++------------------- 1 file changed, 68 insertions(+), 71 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 0699027b4f4c..3400f11286e3 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -384,7 +384,6 @@ struct rb_irq_work { struct irq_work work; wait_queue_head_t waiters; wait_queue_head_t full_waiters; - long wait_index; bool waiters_pending; bool full_waiters_pending; bool wakeup_full; @@ -798,14 +797,40 @@ void ring_buffer_wake_waiters(struct trace_buffer *bu= ffer, int cpu) rbwork =3D &cpu_buffer->irq_work; } =20 - rbwork->wait_index++; - /* make sure the waiters see the new index */ - smp_wmb(); - /* This can be called in any context */ irq_work_queue(&rbwork->work); } =20 +static bool rb_watermark_hit(struct trace_buffer *buffer, int cpu, int ful= l) +{ + struct ring_buffer_per_cpu *cpu_buffer; + bool ret =3D false; + + /* Reads of all CPUs always waits for any data */ + if (cpu =3D=3D RING_BUFFER_ALL_CPUS) + return !ring_buffer_empty(buffer); + + cpu_buffer =3D buffer->buffers[cpu]; + + if (!ring_buffer_empty_cpu(buffer, cpu)) { + unsigned long flags; + bool pagebusy; + + if (!full) + return true; + + raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags); + pagebusy =3D cpu_buffer->reader_page =3D=3D cpu_buffer->commit_page; + ret =3D !pagebusy && full_hit(buffer, cpu, full); + + if (!cpu_buffer->shortest_full || + cpu_buffer->shortest_full > full) + cpu_buffer->shortest_full =3D full; + raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags); + } + return ret; +} + /** * ring_buffer_wait - wait for input to the ring buffer * @buffer: buffer to wait on @@ -821,7 +846,6 @@ int ring_buffer_wait(struct trace_buffer *buffer, int c= pu, int full) struct ring_buffer_per_cpu *cpu_buffer; DEFINE_WAIT(wait); struct rb_irq_work *work; - long wait_index; int ret =3D 0; =20 /* @@ -840,81 +864,54 @@ int ring_buffer_wait(struct trace_buffer *buffer, int= cpu, int full) work =3D &cpu_buffer->irq_work; } =20 - wait_index =3D READ_ONCE(work->wait_index); - - while (true) { - if (full) - prepare_to_wait(&work->full_waiters, &wait, TASK_INTERRUPTIBLE); - else - prepare_to_wait(&work->waiters, &wait, TASK_INTERRUPTIBLE); - - /* - * The events can happen in critical sections where - * checking a work queue can cause deadlocks. - * After adding a task to the queue, this flag is set - * only to notify events to try to wake up the queue - * using irq_work. - * - * We don't clear it even if the buffer is no longer - * empty. The flag only causes the next event to run - * irq_work to do the work queue wake up. The worse - * that can happen if we race with !trace_empty() is that - * an event will cause an irq_work to try to wake up - * an empty queue. - * - * There's no reason to protect this flag either, as - * the work queue and irq_work logic will do the necessary - * synchronization for the wake ups. The only thing - * that is necessary is that the wake up happens after - * a task has been queued. It's OK for spurious wake ups. - */ - if (full) - work->full_waiters_pending =3D true; - else - work->waiters_pending =3D true; - - if (signal_pending(current)) { - ret =3D -EINTR; - break; - } - - if (cpu =3D=3D RING_BUFFER_ALL_CPUS && !ring_buffer_empty(buffer)) - break; - - if (cpu !=3D RING_BUFFER_ALL_CPUS && - !ring_buffer_empty_cpu(buffer, cpu)) { - unsigned long flags; - bool pagebusy; - bool done; - - if (!full) - break; - - raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags); - pagebusy =3D cpu_buffer->reader_page =3D=3D cpu_buffer->commit_page; - done =3D !pagebusy && full_hit(buffer, cpu, full); + if (full) + prepare_to_wait(&work->full_waiters, &wait, TASK_INTERRUPTIBLE); + else + prepare_to_wait(&work->waiters, &wait, TASK_INTERRUPTIBLE); =20 - if (!cpu_buffer->shortest_full || - cpu_buffer->shortest_full > full) - cpu_buffer->shortest_full =3D full; - raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags); - if (done) - break; - } + /* + * The events can happen in critical sections where + * checking a work queue can cause deadlocks. + * After adding a task to the queue, this flag is set + * only to notify events to try to wake up the queue + * using irq_work. + * + * We don't clear it even if the buffer is no longer + * empty. The flag only causes the next event to run + * irq_work to do the work queue wake up. The worse + * that can happen if we race with !trace_empty() is that + * an event will cause an irq_work to try to wake up + * an empty queue. + * + * There's no reason to protect this flag either, as + * the work queue and irq_work logic will do the necessary + * synchronization for the wake ups. The only thing + * that is necessary is that the wake up happens after + * a task has been queued. It's OK for spurious wake ups. + */ + if (full) + work->full_waiters_pending =3D true; + else + work->waiters_pending =3D true; =20 - schedule(); + if (rb_watermark_hit(buffer, cpu, full)) + goto out; =20 - /* Make sure to see the new wait index */ - smp_rmb(); - if (wait_index !=3D work->wait_index) - break; + if (signal_pending(current)) { + ret =3D -EINTR; + goto out; } =20 + schedule(); + out: if (full) finish_wait(&work->full_waiters, &wait); else finish_wait(&work->waiters, &wait); =20 + if (!ret && !rb_watermark_hit(buffer, cpu, full) && signal_pending(curren= t)) + ret =3D -EINTR; + return ret; } =20 --=20 2.43.0 From nobody Sun Feb 8 06:54:32 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B0F502206E; Sun, 10 Mar 2024 16:37:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1710088673; cv=none; b=mD2GlG+jLzEoAeBn9FEpr2twihrxASVS55oaCf0FkOH5dJuaEdgC8HjMNTvlS7JcMqGQVp2vmKq9jxpOzuYt5SU+ULuKCpdwSocBattCRAz3TO+z3naV9t9QWP8nTsZT4zoVR8khC1a8D8UGWa8jWYNvq1UV7DVndOhzSuTZQ3I= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1710088673; c=relaxed/simple; bh=ykEcCdXml3LUTlCcDClbPBcuJKAwOQH+kh7nuOoYXXg=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=hoIppHgOkLvkbDe57k0A0Y0TGnZFy6m4aXjEHtA5SMNh/wqFVjaIMrer7zVTMZ1AlDXn7Ck9dtMl3gFSTpZK07DnjKY5QOv4du8mSG0pLWkGrDXsLt+gRaHapm5cKGDjD3g2sP65a3fSaktfu+fxnaixBAmivrZo8/Ftgm4KJ38= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 Received: by smtp.kernel.org (Postfix) with ESMTPSA id 6AAD8C433B1; Sun, 10 Mar 2024 16:37:53 +0000 (UTC) Received: from rostedt by gandalf with local (Exim 4.97) (envelope-from ) id 1rjMDL-00000001PbZ-29pz; Sun, 10 Mar 2024 12:39:55 -0400 Message-ID: <20240310163955.375509349@goodmis.org> User-Agent: quilt/0.67 Date: Sun, 10 Mar 2024 12:32:20 -0400 From: Steven Rostedt To: linux-kernel@vger.kernel.org Cc: Masami Hiramatsu , Mark Rutland , Mathieu Desnoyers , Andrew Morton , stable@vger.kernel.org, Linus Torvalds , linke li , Rabin Vincent Subject: [for-linus][PATCH 2/3] ring-buffer: Fix resetting of shortest_full References: <20240310163218.425365963@goodmis.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: "Steven Rostedt (Google)" The "shortest_full" variable is used to keep track of the waiter that is waiting for the smallest amount on the ring buffer before being woken up. When a tasks waits on the ring buffer, it passes in a "full" value that is a percentage. 0 means wake up on any data. 1-100 means wake up from 1% to 100% full buffer. As all waiters are on the same wait queue, the wake up happens for the waiter with the smallest percentage. The problem is that the smallest_full on the cpu_buffer that stores the smallest amount doesn't get reset when all the waiters are woken up. It does get reset when the ring buffer is reset (echo > /sys/kernel/tracing/tr= ace). This means that tasks may be woken up more often then when they want to be. Instead, have the shortest_full field get reset just before waking up all the tasks. If the tasks wait again, they will update the shortest_full before sleeping. Also add locking around setting of shortest_full in the poll logic, and change "work" to "rbwork" to match the variable name for rb_irq_work structures that are used in other places. Link: https://lore.kernel.org/linux-trace-kernel/20240308202431.948914369@g= oodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu Cc: Mark Rutland Cc: Mathieu Desnoyers Cc: Andrew Morton Cc: Linus Torvalds Cc: linke li Cc: Rabin Vincent Fixes: 2c2b0a78b3739 ("ring-buffer: Add percentage of ring buffer full to w= ake up reader") Signed-off-by: Steven Rostedt (Google) --- kernel/trace/ring_buffer.c | 30 +++++++++++++++++++++++------- 1 file changed, 23 insertions(+), 7 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 3400f11286e3..aa332ace108b 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -755,8 +755,19 @@ static void rb_wake_up_waiters(struct irq_work *work) =20 wake_up_all(&rbwork->waiters); if (rbwork->full_waiters_pending || rbwork->wakeup_full) { + /* Only cpu_buffer sets the above flags */ + struct ring_buffer_per_cpu *cpu_buffer =3D + container_of(rbwork, struct ring_buffer_per_cpu, irq_work); + + /* Called from interrupt context */ + raw_spin_lock(&cpu_buffer->reader_lock); rbwork->wakeup_full =3D false; rbwork->full_waiters_pending =3D false; + + /* Waking up all waiters, they will reset the shortest full */ + cpu_buffer->shortest_full =3D 0; + raw_spin_unlock(&cpu_buffer->reader_lock); + wake_up_all(&rbwork->full_waiters); } } @@ -934,28 +945,33 @@ __poll_t ring_buffer_poll_wait(struct trace_buffer *b= uffer, int cpu, struct file *filp, poll_table *poll_table, int full) { struct ring_buffer_per_cpu *cpu_buffer; - struct rb_irq_work *work; + struct rb_irq_work *rbwork; =20 if (cpu =3D=3D RING_BUFFER_ALL_CPUS) { - work =3D &buffer->irq_work; + rbwork =3D &buffer->irq_work; full =3D 0; } else { if (!cpumask_test_cpu(cpu, buffer->cpumask)) return EPOLLERR; =20 cpu_buffer =3D buffer->buffers[cpu]; - work =3D &cpu_buffer->irq_work; + rbwork =3D &cpu_buffer->irq_work; } =20 if (full) { - poll_wait(filp, &work->full_waiters, poll_table); - work->full_waiters_pending =3D true; + unsigned long flags; + + poll_wait(filp, &rbwork->full_waiters, poll_table); + + raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags); + rbwork->full_waiters_pending =3D true; if (!cpu_buffer->shortest_full || cpu_buffer->shortest_full > full) cpu_buffer->shortest_full =3D full; + raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags); } else { - poll_wait(filp, &work->waiters, poll_table); - work->waiters_pending =3D true; + poll_wait(filp, &rbwork->waiters, poll_table); + rbwork->waiters_pending =3D true; } =20 /* --=20 2.43.0 From nobody Sun Feb 8 06:54:32 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B0F7F2B9DA; Sun, 10 Mar 2024 16:37:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1710088673; cv=none; b=nCmi1lve9/oX0Uavq3SbeY1xbYJMhW66oWCOBs3/KxoHTSbREX4i7i6GijoSayEuzdqWSSTI6cPClXV/MebS2Dx0LOBciXqzbEmZQx+y1LC/RaCaHArpHIcZDFi6bD8QRA4Si6s4HzkBBSaVaGjZ/F9JH5y1fpDWolh+IK+6p5I= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1710088673; c=relaxed/simple; bh=x++FexBDevj2+uPhfxhqFB91G8i40sg/g/6wDDyegcg=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=n/XBtiFkyXP9L3b8vbK66T+/WwKiQoOk2saBNx+nJBwTbs0yNG3xiGKj6en8MwcA8+AkzCd644ducUO9rFBt0VJv/RsV8phaN0A69xTQ2Bhg06/rDAkJksGhIv8zytnuPYISva8RmwAk4MM7t+RmI8FJhgv9NcNcYhUty7MjAf4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 Received: by smtp.kernel.org (Postfix) with ESMTPSA id 6AA5FC43394; Sun, 10 Mar 2024 16:37:53 +0000 (UTC) Received: from rostedt by gandalf with local (Exim 4.97) (envelope-from ) id 1rjMDL-00000001Pc3-2o3E; Sun, 10 Mar 2024 12:39:55 -0400 Message-ID: <20240310163955.534979305@goodmis.org> User-Agent: quilt/0.67 Date: Sun, 10 Mar 2024 12:32:21 -0400 From: Steven Rostedt To: linux-kernel@vger.kernel.org Cc: Masami Hiramatsu , Mark Rutland , Mathieu Desnoyers , Andrew Morton , stable@vger.kernel.org, Linus Torvalds , linke li , Rabin Vincent Subject: [for-linus][PATCH 3/3] tracing: Use .flush() call to wake up readers References: <20240310163218.425365963@goodmis.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: "Steven Rostedt (Google)" The .release() function does not get called until all readers of a file descriptor are finished. If a thread is blocked on reading a file descriptor in ring_buffer_wait(), and another thread closes the file descriptor, it will not wake up the other thread as ring_buffer_wake_waiters() is called by .release(), and that will not get called until the .read() is finished. The issue originally showed up in trace-cmd, but the readers are actually other processes with their own file descriptors. So calling close() would w= ake up the other tasks because they are blocked on another descriptor then the one that was closed(). But there's other wake ups that solve that issue. When a thread is blocked on a read, it can still hang even when another thread closed its descriptor. This is what the .flush() callback is for. Have the .flush() wake up the readers. Link: https://lore.kernel.org/linux-trace-kernel/20240308202432.107909457@g= oodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu Cc: Mark Rutland Cc: Mathieu Desnoyers Cc: Andrew Morton Cc: Linus Torvalds Cc: linke li Cc: Rabin Vincent Fixes: f3ddb74ad0790 ("tracing: Wake up ring buffer waiters on closing of t= he file") Signed-off-by: Steven Rostedt (Google) --- kernel/trace/trace.c | 21 +++++++++++++++------ 1 file changed, 15 insertions(+), 6 deletions(-) diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index d16b95ca58a7..c9c898307348 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -8393,6 +8393,20 @@ tracing_buffers_read(struct file *filp, char __user = *ubuf, return size; } =20 +static int tracing_buffers_flush(struct file *file, fl_owner_t id) +{ + struct ftrace_buffer_info *info =3D file->private_data; + struct trace_iterator *iter =3D &info->iter; + + iter->wait_index++; + /* Make sure the waiters see the new wait_index */ + smp_wmb(); + + ring_buffer_wake_waiters(iter->array_buffer->buffer, iter->cpu_file); + + return 0; +} + static int tracing_buffers_release(struct inode *inode, struct file *file) { struct ftrace_buffer_info *info =3D file->private_data; @@ -8404,12 +8418,6 @@ static int tracing_buffers_release(struct inode *ino= de, struct file *file) =20 __trace_array_put(iter->tr); =20 - iter->wait_index++; - /* Make sure the waiters see the new wait_index */ - smp_wmb(); - - ring_buffer_wake_waiters(iter->array_buffer->buffer, iter->cpu_file); - if (info->spare) ring_buffer_free_read_page(iter->array_buffer->buffer, info->spare_cpu, info->spare); @@ -8625,6 +8633,7 @@ static const struct file_operations tracing_buffers_f= ops =3D { .read =3D tracing_buffers_read, .poll =3D tracing_buffers_poll, .release =3D tracing_buffers_release, + .flush =3D tracing_buffers_flush, .splice_read =3D tracing_buffers_splice_read, .unlocked_ioctl =3D tracing_buffers_ioctl, .llseek =3D no_llseek, --=20 2.43.0