From nobody Fri Dec 26 21:24:59 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 43EF214293; Fri, 29 Dec 2023 21:12:47 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id CBCB1C433C9; Fri, 29 Dec 2023 21:12:46 +0000 (UTC) Received: from rostedt by gandalf with local (Exim 4.97) (envelope-from ) id 1rJKAk-00000000McL-3oMe; Fri, 29 Dec 2023 16:13:38 -0500 Message-ID: <20231229211338.768648820@goodmis.org> User-Agent: quilt/0.67 Date: Fri, 29 Dec 2023 16:13:15 -0500 From: Steven Rostedt To: linux-kernel@vger.kernel.org Cc: Masami Hiramatsu , Mark Rutland , Mathieu Desnoyers , Andrew Morton , Jiri Olsa , stable@vger.kernel.org Subject: [for-linus][PATCH 1/3] ring-buffer: Fix wake ups when buffer_percent is set to 100 References: <20231229211314.081907608@goodmis.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: "Steven Rostedt (Google)" The tracefs file "buffer_percent" is to allow user space to set a water-mark on how much of the tracing ring buffer needs to be filled in order to wake up a blocked reader. 0 - is to wait until any data is in the buffer 1 - is to wait for 1% of the sub buffers to be filled 50 - would be half of the sub buffers are filled with data 100 - is not to wake the waiter until the ring buffer is completely full Unfortunately the test for being full was: dirty =3D ring_buffer_nr_dirty_pages(buffer, cpu); return (dirty * 100) > (full * nr_pages); Where "full" is the value for "buffer_percent". There is two issues with the above when full =3D=3D 100. 1. dirty * 100 > 100 * nr_pages will never be true That is, the above is basically saying that if the user sets buffer_percent to 100, more pages need to be dirty than exist in the ring buffer! 2. The page that the writer is on is never considered dirty, as dirty pages are only those that are full. When the writer goes to a new sub-buffer, it clears the contents of that sub-buffer. That is, even if the check was ">=3D" it would still not be equal as the most pages that can be considered "dirty" is nr_pages - 1. To fix this, add one to dirty and use ">=3D" in the compare. Link: https://lore.kernel.org/linux-trace-kernel/20231226125902.4a057f1d@ga= ndalf.local.home Cc: stable@vger.kernel.org Cc: Mark Rutland Cc: Mathieu Desnoyers Acked-by: Masami Hiramatsu (Google) Fixes: 03329f9939781 ("tracing: Add tracefs file buffer_percentage") Signed-off-by: Steven Rostedt (Google) --- kernel/trace/ring_buffer.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 83eab547f1d1..32c0dd2fd1c3 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -881,9 +881,14 @@ static __always_inline bool full_hit(struct trace_buff= er *buffer, int cpu, int f if (!nr_pages || !full) return true; =20 - dirty =3D ring_buffer_nr_dirty_pages(buffer, cpu); + /* + * Add one as dirty will never equal nr_pages, as the sub-buffer + * that the writer is on is not counted as dirty. + * This is needed if "buffer_percent" is set to 100. + */ + dirty =3D ring_buffer_nr_dirty_pages(buffer, cpu) + 1; =20 - return (dirty * 100) > (full * nr_pages); + return (dirty * 100) >=3D (full * nr_pages); } =20 /* --=20 2.42.0 From nobody Fri Dec 26 21:24:59 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4291014292; Fri, 29 Dec 2023 21:12:47 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id D875FC433CC; Fri, 29 Dec 2023 21:12:46 +0000 (UTC) Received: from rostedt by gandalf with local (Exim 4.97) (envelope-from ) id 1rJKAl-00000000Mcp-0HmC; Fri, 29 Dec 2023 16:13:39 -0500 Message-ID: <20231229211338.928136124@goodmis.org> User-Agent: quilt/0.67 Date: Fri, 29 Dec 2023 16:13:16 -0500 From: Steven Rostedt To: linux-kernel@vger.kernel.org Cc: Masami Hiramatsu , Mark Rutland , Mathieu Desnoyers , Andrew Morton , Jiri Olsa , stable@vger.kernel.org Subject: [for-linus][PATCH 2/3] tracing: Fix blocked reader of snapshot buffer References: <20231229211314.081907608@goodmis.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: "Steven Rostedt (Google)" If an application blocks on the snapshot or snapshot_raw files, expecting to be woken up when a snapshot occurs, it will not happen. Or it may happen with an unexpected result. That result is that the application will be reading the main buffer instead of the snapshot buffer. That is because when the snapshot occurs, the main and snapshot buffers are swapped. But the reader has a descriptor still pointing to the buffer that it originally connected to. This is fine for the main buffer readers, as they may be blocked waiting for a watermark to be hit, and when a snapshot occurs, the data that the main readers want is now on the snapshot buffer. But for waiters of the snapshot buffer, they are waiting for an event to occur that will trigger the snapshot and they can then consume it quickly to save the snapshot before the next snapshot occurs. But to do this, they need to read the new snapshot buffer, not the old one that is now receiving new data. Also, it does not make sense to have a watermark "buffer_percent" on the snapshot buffer, as the snapshot buffer is static and does not receive new data except all at once. Link: https://lore.kernel.org/linux-trace-kernel/20231228095149.77f5b45d@ga= ndalf.local.home Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers Cc: Mark Rutland Acked-by: Masami Hiramatsu (Google) Fixes: debdd57f5145f ("tracing: Make a snapshot feature available from user= space") Signed-off-by: Steven Rostedt (Google) --- kernel/trace/ring_buffer.c | 3 ++- kernel/trace/trace.c | 20 +++++++++++++++++--- 2 files changed, 19 insertions(+), 4 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 32c0dd2fd1c3..9286f88fcd32 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -949,7 +949,8 @@ void ring_buffer_wake_waiters(struct trace_buffer *buff= er, int cpu) /* make sure the waiters see the new index */ smp_wmb(); =20 - rb_wake_up_waiters(&rbwork->work); + /* This can be called in any context */ + irq_work_queue(&rbwork->work); } =20 /** diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 199df497db07..a0defe156b57 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -1894,6 +1894,9 @@ update_max_tr(struct trace_array *tr, struct task_str= uct *tsk, int cpu, __update_max_tr(tr, tsk, cpu); =20 arch_spin_unlock(&tr->max_lock); + + /* Any waiters on the old snapshot buffer need to wake up */ + ring_buffer_wake_waiters(tr->array_buffer.buffer, RING_BUFFER_ALL_CPUS); } =20 /** @@ -1945,12 +1948,23 @@ update_max_tr_single(struct trace_array *tr, struct= task_struct *tsk, int cpu) =20 static int wait_on_pipe(struct trace_iterator *iter, int full) { + int ret; + /* Iterators are static, they should be filled or empty */ if (trace_buffer_iter(iter, iter->cpu_file)) return 0; =20 - return ring_buffer_wait(iter->array_buffer->buffer, iter->cpu_file, - full); + ret =3D ring_buffer_wait(iter->array_buffer->buffer, iter->cpu_file, full= ); + +#ifdef CONFIG_TRACER_MAX_TRACE + /* + * Make sure this is still the snapshot buffer, as if a snapshot were + * to happen, this would now be the main buffer. + */ + if (iter->snapshot) + iter->array_buffer =3D &iter->tr->max_buffer; +#endif + return ret; } =20 #ifdef CONFIG_FTRACE_STARTUP_TEST @@ -8517,7 +8531,7 @@ tracing_buffers_splice_read(struct file *file, loff_t= *ppos, =20 wait_index =3D READ_ONCE(iter->wait_index); =20 - ret =3D wait_on_pipe(iter, iter->tr->buffer_percent); + ret =3D wait_on_pipe(iter, iter->snapshot ? 0 : iter->tr->buffer_percent= ); if (ret) goto out; =20 --=20 2.42.0 From nobody Fri Dec 26 21:24:59 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 844EB14A92; Fri, 29 Dec 2023 21:12:47 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 0D34BC433CB; Fri, 29 Dec 2023 21:12:47 +0000 (UTC) Received: from rostedt by gandalf with local (Exim 4.97) (envelope-from ) id 1rJKAl-00000000MdJ-0xCy; Fri, 29 Dec 2023 16:13:39 -0500 Message-ID: <20231229211339.088802381@goodmis.org> User-Agent: quilt/0.67 Date: Fri, 29 Dec 2023 16:13:17 -0500 From: Steven Rostedt To: linux-kernel@vger.kernel.org Cc: Masami Hiramatsu , Mark Rutland , Mathieu Desnoyers , Andrew Morton , Jiri Olsa , stable@vger.kernel.org, Alexei Starovoitov , Daniel Borkmann Subject: [for-linus][PATCH 3/3] ftrace: Fix modification of direct_function hash while in use References: <20231229211314.081907608@goodmis.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: "Steven Rostedt (Google)" Masami Hiramatsu reported a memory leak in register_ftrace_direct() where if the number of new entries are added is large enough to cause two allocations in the loop: for (i =3D 0; i < size; i++) { hlist_for_each_entry(entry, &hash->buckets[i], hlist) { new =3D ftrace_add_rec_direct(entry->ip, addr, &fre= e_hash); if (!new) goto out_remove; entry->direct =3D addr; } } Where ftrace_add_rec_direct() has: if (ftrace_hash_empty(direct_functions) || direct_functions->count > 2 * (1 << direct_functions->size_bits= )) { struct ftrace_hash *new_hash; int size =3D ftrace_hash_empty(direct_functions) ? 0 : direct_functions->count + 1; if (size < 32) size =3D 32; new_hash =3D dup_hash(direct_functions, size); if (!new_hash) return NULL; *free_hash =3D direct_functions; direct_functions =3D new_hash; } The "*free_hash =3D direct_functions;" can happen twice, losing the previous allocation of direct_functions. But this also exposed a more serious bug. The modification of direct_functions above is not safe. As direct_functions can be referenced at any time to find what direct caller it should call, the time between: new_hash =3D dup_hash(direct_functions, size); and direct_functions =3D new_hash; can have a race with another CPU (or even this one if it gets interrupted), and the entries being moved to the new hash are not referenced. That's because the "dup_hash()" is really misnamed and is really a "move_hash()". It moves the entries from the old hash to the new one. Now even if that was changed, this code is not proper as direct_functions should not be updated until the end. That is the best way to handle function reference changes, and is the way other parts of ftrace handles this. The following is done: 1. Change add_hash_entry() to return the entry it created and inserted into the hash, and not just return success or not. 2. Replace ftrace_add_rec_direct() with add_hash_entry(), and remove the former. 3. Allocate a "new_hash" at the start that is made for holding both the new hash entries as well as the existing entries in direct_functions. 4. Copy (not move) the direct_function entries over to the new_hash. 5. Copy the entries of the added hash to the new_hash. 6. If everything succeeds, then use rcu_pointer_assign() to update the direct_functions with the new_hash. This simplifies the code and fixes both the memory leak as well as the race condition mentioned above. Link: https://lore.kernel.org/all/170368070504.42064.8960569647118388081.st= git@devnote2/ Link: https://lore.kernel.org/linux-trace-kernel/20231229115134.08dd5174@ga= ndalf.local.home Cc: stable@vger.kernel.org Cc: Masami Hiramatsu Cc: Mark Rutland Cc: Mathieu Desnoyers Cc: Jiri Olsa Cc: Alexei Starovoitov Cc: Daniel Borkmann Fixes: 763e34e74bb7d ("ftrace: Add register_ftrace_direct()") Signed-off-by: Steven Rostedt (Google) --- kernel/trace/ftrace.c | 100 ++++++++++++++++++++---------------------- 1 file changed, 47 insertions(+), 53 deletions(-) diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index 8de8bec5f366..b01ae7d36021 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -1183,18 +1183,19 @@ static void __add_hash_entry(struct ftrace_hash *ha= sh, hash->count++; } =20 -static int add_hash_entry(struct ftrace_hash *hash, unsigned long ip) +static struct ftrace_func_entry * +add_hash_entry(struct ftrace_hash *hash, unsigned long ip) { struct ftrace_func_entry *entry; =20 entry =3D kmalloc(sizeof(*entry), GFP_KERNEL); if (!entry) - return -ENOMEM; + return NULL; =20 entry->ip =3D ip; __add_hash_entry(hash, entry); =20 - return 0; + return entry; } =20 static void @@ -1349,7 +1350,6 @@ alloc_and_copy_ftrace_hash(int size_bits, struct ftra= ce_hash *hash) struct ftrace_func_entry *entry; struct ftrace_hash *new_hash; int size; - int ret; int i; =20 new_hash =3D alloc_ftrace_hash(size_bits); @@ -1366,8 +1366,7 @@ alloc_and_copy_ftrace_hash(int size_bits, struct ftra= ce_hash *hash) size =3D 1 << hash->size_bits; for (i =3D 0; i < size; i++) { hlist_for_each_entry(entry, &hash->buckets[i], hlist) { - ret =3D add_hash_entry(new_hash, entry->ip); - if (ret < 0) + if (add_hash_entry(new_hash, entry->ip) =3D=3D NULL) goto free_hash; } } @@ -2536,7 +2535,7 @@ ftrace_find_unique_ops(struct dyn_ftrace *rec) =20 #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS /* Protected by rcu_tasks for reading, and direct_mutex for writing */ -static struct ftrace_hash *direct_functions =3D EMPTY_HASH; +static struct ftrace_hash __rcu *direct_functions =3D EMPTY_HASH; static DEFINE_MUTEX(direct_mutex); int ftrace_direct_func_count; =20 @@ -2555,39 +2554,6 @@ unsigned long ftrace_find_rec_direct(unsigned long i= p) return entry->direct; } =20 -static struct ftrace_func_entry* -ftrace_add_rec_direct(unsigned long ip, unsigned long addr, - struct ftrace_hash **free_hash) -{ - struct ftrace_func_entry *entry; - - if (ftrace_hash_empty(direct_functions) || - direct_functions->count > 2 * (1 << direct_functions->size_bits)) { - struct ftrace_hash *new_hash; - int size =3D ftrace_hash_empty(direct_functions) ? 0 : - direct_functions->count + 1; - - if (size < 32) - size =3D 32; - - new_hash =3D dup_hash(direct_functions, size); - if (!new_hash) - return NULL; - - *free_hash =3D direct_functions; - direct_functions =3D new_hash; - } - - entry =3D kmalloc(sizeof(*entry), GFP_KERNEL); - if (!entry) - return NULL; - - entry->ip =3D ip; - entry->direct =3D addr; - __add_hash_entry(direct_functions, entry); - return entry; -} - static void call_direct_funcs(unsigned long ip, unsigned long pip, struct ftrace_ops *ops, struct ftrace_regs *fregs) { @@ -4223,8 +4189,8 @@ enter_record(struct ftrace_hash *hash, struct dyn_ftr= ace *rec, int clear_filter) /* Do nothing if it exists */ if (entry) return 0; - - ret =3D add_hash_entry(hash, rec->ip); + if (add_hash_entry(hash, rec->ip) =3D=3D NULL) + ret =3D -ENOMEM; } return ret; } @@ -5266,7 +5232,8 @@ __ftrace_match_addr(struct ftrace_hash *hash, unsigne= d long ip, int remove) return 0; } =20 - return add_hash_entry(hash, ip); + entry =3D add_hash_entry(hash, ip); + return entry ? 0 : -ENOMEM; } =20 static int @@ -5410,7 +5377,7 @@ static void remove_direct_functions_hash(struct ftrac= e_hash *hash, unsigned long */ int register_ftrace_direct(struct ftrace_ops *ops, unsigned long addr) { - struct ftrace_hash *hash, *free_hash =3D NULL; + struct ftrace_hash *hash, *new_hash =3D NULL, *free_hash =3D NULL; struct ftrace_func_entry *entry, *new; int err =3D -EBUSY, size, i; =20 @@ -5436,17 +5403,44 @@ int register_ftrace_direct(struct ftrace_ops *ops, = unsigned long addr) } } =20 - /* ... and insert them to direct_functions hash. */ err =3D -ENOMEM; + + /* Make a copy hash to place the new and the old entries in */ + size =3D hash->count + direct_functions->count; + if (size > 32) + size =3D 32; + new_hash =3D alloc_ftrace_hash(fls(size)); + if (!new_hash) + goto out_unlock; + + /* Now copy over the existing direct entries */ + size =3D 1 << direct_functions->size_bits; + for (i =3D 0; i < size; i++) { + hlist_for_each_entry(entry, &direct_functions->buckets[i], hlist) { + new =3D add_hash_entry(new_hash, entry->ip); + if (!new) + goto out_unlock; + new->direct =3D entry->direct; + } + } + + /* ... and add the new entries */ + size =3D 1 << hash->size_bits; for (i =3D 0; i < size; i++) { hlist_for_each_entry(entry, &hash->buckets[i], hlist) { - new =3D ftrace_add_rec_direct(entry->ip, addr, &free_hash); + new =3D add_hash_entry(new_hash, entry->ip); if (!new) - goto out_remove; + goto out_unlock; + /* Update both the copy and the hash entry */ + new->direct =3D addr; entry->direct =3D addr; } } =20 + free_hash =3D direct_functions; + rcu_assign_pointer(direct_functions, new_hash); + new_hash =3D NULL; + ops->func =3D call_direct_funcs; ops->flags =3D MULTI_FLAGS; ops->trampoline =3D FTRACE_REGS_ADDR; @@ -5454,17 +5448,17 @@ int register_ftrace_direct(struct ftrace_ops *ops, = unsigned long addr) =20 err =3D register_ftrace_function_nolock(ops); =20 - out_remove: - if (err) - remove_direct_functions_hash(hash, addr); - out_unlock: mutex_unlock(&direct_mutex); =20 - if (free_hash) { + if (free_hash && free_hash !=3D EMPTY_HASH) { synchronize_rcu_tasks(); free_ftrace_hash(free_hash); } + + if (new_hash) + free_ftrace_hash(new_hash); + return err; } EXPORT_SYMBOL_GPL(register_ftrace_direct); @@ -6309,7 +6303,7 @@ ftrace_graph_set_hash(struct ftrace_hash *hash, char = *buffer) =20 if (entry) continue; - if (add_hash_entry(hash, rec->ip) < 0) + if (add_hash_entry(hash, rec->ip) =3D=3D NULL) goto out; } else { if (entry) { --=20 2.42.0