From nobody Mon Nov 25 11:36:36 2024 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B5D6718990E; Mon, 28 Oct 2024 06:18:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730096334; cv=none; b=d88kQoWH4fwleL0CC8BMDKa+Yy1XoCdf47+BUOs835VFLC7yClC8ZLgEtxt42nggeGLviUeD9IVpBEd+1iX2u6jzus1V+XFN8w9wurNXpFOkOE+EbQ3P+okOwxP1fBA9VrQbFv8daGuxog6Hkly4wXEyjYee/yfEWfyxpufAa3Y= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730096334; c=relaxed/simple; bh=dzDk9/xFKvznY+UTyW1qA2jZIwt55uoI3mWeTswf4Kw=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=e103VHtYF6fsc1+wPBbxyQSdZdN+TZozE+SwpfYkV04+3q2rTrfy5Xc4vPP2dlgU88PasyKhX5LIw7G76kSf5df+fabd4+Ca7fIdXvMYjioBLGIQY+DkoOEstCzLfVL93nwcyC+3RmNF+85vKDh6/3a47weh4ru38yta2yie154= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 Received: by smtp.kernel.org (Postfix) with ESMTPSA id 61896C4CEE3; Mon, 28 Oct 2024 06:18:54 +0000 (UTC) Received: from rostedt by gandalf with local (Exim 4.98) (envelope-from ) id 1t5J6N-000000052wy-01NP; Mon, 28 Oct 2024 02:19:43 -0400 Message-ID: <20241028061942.865941501@goodmis.org> User-Agent: quilt/0.68 Date: Mon, 28 Oct 2024 02:18:22 -0400 From: Steven Rostedt To: linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org Cc: Masami Hiramatsu , Mark Rutland , Mathieu Desnoyers , Andrew Morton , Thomas Gleixner Subject: [PATCH v3 1/2] fgraph: Free ret_stacks when graph tracing is done References: <20241028061821.009891807@goodmis.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Steven Rostedt Since function graph tracing was added to the kernel, it needed shadow stacks for every process in order to be able to hijack the return address and replace it with its own trampoline to trace when the function exits. The first time function graph was used, it allocated PAGE_SIZE for each task on the system (including idle tasks). But because these stacks may still be in use long after tracing is done, they were never freed (except when a task exits). That means any task that never exits (including kernel tasks), would always have these shadow stacks allocated even when they were no longer needed. The race that needed to be avoided was tracing functions that sleep for long periods of time (i.e. poll()). If it gets traced, its original return address is saved on the shadow stack. That means the shadow stack can not be freed until the task is no longer using it. Luckily, it is easy to know if the task is done with its shadow stack. After function graph is disabled, the shadow stack will never grow, and once the last element is removed off of it, nothing will use it again. When function graph is done and the last user unregisters, all the tasks in the system can be examined and if the shadow stack pointer (curr_ret_depth), is zero, then it can be freed. But since there's no memory barriers on the CPUs doing the tracing, it has to be moved to a link list first and then after a call to synchronize_rcu_tasks_trace() the shadow stacks can be freed. As the shadow stack is not going to grow anymore, the end of the shadow stack can be used to store a structure that holds the list_head for the link list as well as a pointer to the task. This can be used to delay the freeing until all the shadow stacks to be freed are added to the link list and the synchronize_rcu_tasks_trace() has finished. Note, tasks that are still using their shadow stack will not have them freed. They will stay until the task exits or if another instance of function graph is registered and unregistered and the shadow stack is no longer being used. Reviewed-by: Masami Hiramatsu (Google) Signed-off-by: Steven Rostedt (Google) --- Changes since v2: https://lore.kernel.org/20241028060118.796879197@goodmis.= org - Use guard(mutex) for ftrace_lock in ftrace_graph_init_task() The added mutex ignored the error paths that returned early. kernel/trace/fgraph.c | 114 ++++++++++++++++++++++++++++++++++++------ 1 file changed, 99 insertions(+), 15 deletions(-) diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c index 001abf376c0c..f7df6c14d5f9 100644 --- a/kernel/trace/fgraph.c +++ b/kernel/trace/fgraph.c @@ -1144,6 +1144,7 @@ void ftrace_graph_init_task(struct task_struct *t) t->curr_ret_stack =3D 0; t->curr_ret_depth =3D -1; =20 + guard(mutex)(&ftrace_lock); if (ftrace_graph_active) { unsigned long *ret_stack; =20 @@ -1292,19 +1293,106 @@ static void ftrace_graph_disable_direct(bool disab= le_branch) fgraph_direct_gops =3D &fgraph_stub; } =20 -/* The cpu_boot init_task->ret_stack will never be freed */ -static int fgraph_cpu_init(unsigned int cpu) +static void __fgraph_cpu_init(unsigned int cpu) { if (!idle_task(cpu)->ret_stack) ftrace_graph_init_idle_task(idle_task(cpu), cpu); +} + +static int fgraph_cpu_init(unsigned int cpu) +{ + if (ftrace_graph_active) + __fgraph_cpu_init(cpu); + return 0; +} + +struct ret_stack_free_data { + struct list_head list; + struct task_struct *task; +}; + +static void remove_ret_stack(struct task_struct *t, struct list_head *head= , int list_index) +{ + struct ret_stack_free_data *free_data; + + /* If the ret_stack is still in use, skip this */ + if (t->curr_ret_depth >=3D 0) + return; + + free_data =3D (struct ret_stack_free_data*)(t->ret_stack + list_index); + list_add(&free_data->list, head); + free_data->task =3D t; +} + +static void free_ret_stacks(void) +{ + struct ret_stack_free_data *free_data, *n; + struct task_struct *g, *t; + LIST_HEAD(stacks); + int list_index; + int list_sz; + int cpu; + + /* Calculate the size in longs to hold ret_stack_free_data */ + list_sz =3D DIV_ROUND_UP(sizeof(struct ret_stack_free_data), sizeof(long)= ); + + /* + * We do not want to race with __ftrace_return_to_handler() where this + * CPU can see the update to curr_ret_depth going to zero before it + * actually does. As tracing is disabled, the ret_stack is not going + * to be used anymore and there will be no more callbacks. Use + * the top of the stack as the link list pointer to attach this + * ret_stack to @head. Then at the end, run an RCU trace synthronization + * which will guarantee that there are no more uses of the ret_stacks + * and they can all be freed. + */ + list_index =3D SHADOW_STACK_MAX_OFFSET - list_sz; + + read_lock(&tasklist_lock); + for_each_process_thread(g, t) { + if (t->ret_stack) + remove_ret_stack(t, &stacks, list_index); + } + read_unlock(&tasklist_lock); + + cpus_read_lock(); + for_each_online_cpu(cpu) { + t =3D idle_task(cpu); + if (t->ret_stack) + remove_ret_stack(t, &stacks, list_index); + } + cpus_read_unlock(); + + /* Make sure nothing is using the ret_stacks anymore */ + synchronize_rcu_tasks_trace(); + + list_for_each_entry_safe(free_data, n, &stacks, list) { + unsigned long *stack =3D free_data->task->ret_stack; + + free_data->task->ret_stack =3D NULL; + kmem_cache_free(fgraph_stack_cachep, stack); + } +} + +static __init int fgraph_init(void) +{ + int ret; + + ret =3D cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "fgraph:online", + fgraph_cpu_init, NULL); + if (ret < 0) { + pr_warn("fgraph: Error to init cpu hotplug support\n"); + return ret; + } return 0; } +core_initcall(fgraph_init) =20 int register_ftrace_graph(struct fgraph_ops *gops) { - static bool fgraph_initialized; int command =3D 0; - int ret =3D 0; + int cpu; + int ret; int i =3D -1; =20 guard(mutex)(&ftrace_lock); @@ -1317,17 +1405,6 @@ int register_ftrace_graph(struct fgraph_ops *gops) return -ENOMEM; } =20 - if (!fgraph_initialized) { - ret =3D cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "fgraph:online", - fgraph_cpu_init, NULL); - if (ret < 0) { - pr_warn("fgraph: Error to init cpu hotplug support\n"); - return ret; - } - fgraph_initialized =3D true; - ret =3D 0; - } - if (!fgraph_array[0]) { /* The array must always have real data on it */ for (i =3D 0; i < FGRAPH_ARRAY_SIZE; i++) @@ -1342,6 +1419,12 @@ int register_ftrace_graph(struct fgraph_ops *gops) =20 ftrace_graph_active++; =20 + cpus_read_lock(); + for_each_online_cpu(cpu) { + __fgraph_cpu_init(cpu); + } + cpus_read_unlock(); + if (ftrace_graph_active =3D=3D 2) ftrace_graph_disable_direct(true); =20 @@ -1412,6 +1495,7 @@ void unregister_ftrace_graph(struct fgraph_ops *gops) ftrace_graph_entry =3D ftrace_graph_entry_stub; unregister_pm_notifier(&ftrace_suspend_notifier); unregister_trace_sched_switch(ftrace_graph_probe_sched_switch, NULL); + free_ret_stacks(); } out: gops->saved_func =3D NULL; --=20 2.45.2 From nobody Mon Nov 25 11:36:36 2024 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C4D95189BAC; Mon, 28 Oct 2024 06:18:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730096334; cv=none; b=sJ+T4rvJWeUAPc3AybLAr0v/hCamtRTucQ76t5EGtWt6qzryVraGK7wuGbQ6AbSOm0kNMHPsG4x4d7cNT2gmiz73Bvj1++ZLR2teunyD8XLMv/9MjAeMoPFzHGv6qvPa+TQnwaTYhpzgkNh4ZuDYr+kL0zh+SdDja2ehEPkgpRs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730096334; c=relaxed/simple; bh=3czVAhAP6abjALkMMr8DDUwznxBwrkdSG3RvBG30mD4=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=dXYSEOPUpBk+n10Tq6lCCcruh2O9lV+0p9r5dJFm+TKV3712d4E7PO3jNKGQXrrM30cCrJ9c/Eb+t9olYIC1f4XDFv0l4G/U+hF7Z42G+XTqkvj72uY4bbWJckjCY2/3CJUhuf/PazEbvYjgw/+BVxjp8Q1Nk3Etnn9+bT0nd3E= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 Received: by smtp.kernel.org (Postfix) with ESMTPSA id 72312C4CEE9; Mon, 28 Oct 2024 06:18:54 +0000 (UTC) Received: from rostedt by gandalf with local (Exim 4.98) (envelope-from ) id 1t5J6N-000000052xT-0hU5; Mon, 28 Oct 2024 02:19:43 -0400 Message-ID: <20241028061943.027149869@goodmis.org> User-Agent: quilt/0.68 Date: Mon, 28 Oct 2024 02:18:23 -0400 From: Steven Rostedt To: linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org Cc: Masami Hiramatsu , Mark Rutland , Mathieu Desnoyers , Andrew Morton , Thomas Gleixner Subject: [PATCH v3 2/2] fgraph: Free ret_stack when task is done with it References: <20241028061821.009891807@goodmis.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Steven Rostedt The shadow stack used for function graph is only freed when function graph is done for those tasks that are no longer using them. That's because a function that does a long sleep (like poll) could be traced, and its original return address on the stack has been replaced with a pointer to a trampoline, but that return address is saved on the shadow stack. It can not be freed until the function returns and there's no more return addresses being stored on the shadow stack. Add a static_branch test in the return part of the function graph code that is called after the return address on the shadow stack is popped. If the shadow stack is empty, call an irq_work that will call a work queue that will run the shadow stack freeing code again. This will clean up all the shadow stacks that were not removed when function graph ended but are no longer being used. Reviewed-by: Masami Hiramatsu (Google) Signed-off-by: Steven Rostedt (Google) --- Changes since v2: https://lore.kernel.org/20241028060118.956474816@goodmis.= org - Fixed the previous patch ;-) kernel/trace/fgraph.c | 37 ++++++++++++++++++++++++++++++++++++- 1 file changed, 36 insertions(+), 1 deletion(-) diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c index f7df6c14d5f9..afc20a9616c7 100644 --- a/kernel/trace/fgraph.c +++ b/kernel/trace/fgraph.c @@ -174,6 +174,11 @@ int ftrace_graph_active; =20 static struct kmem_cache *fgraph_stack_cachep; =20 +DEFINE_STATIC_KEY_FALSE(fgraph_ret_stack_cleanup); +static struct workqueue_struct *fgraph_ret_stack_wq; +static struct work_struct fgraph_ret_stack_work; +static struct irq_work fgraph_ret_stack_irq_work; + static struct fgraph_ops *fgraph_array[FGRAPH_ARRAY_SIZE]; static unsigned long fgraph_array_bitmask; =20 @@ -849,8 +854,15 @@ static unsigned long __ftrace_return_to_handler(struct= fgraph_ret_regs *ret_regs */ barrier(); current->curr_ret_stack =3D offset - FGRAPH_FRAME_OFFSET; - current->curr_ret_depth--; + + /* + * If function graph is done and this task is no longer using ret_stack + * then start the work to free it. + */ + if (static_branch_unlikely(&fgraph_ret_stack_cleanup) && current->curr_re= t_depth < 0) + irq_work_queue(&fgraph_ret_stack_irq_work); + return ret; } =20 @@ -1374,6 +1386,21 @@ static void free_ret_stacks(void) } } =20 +static void fgraph_ret_stack_work_func(struct work_struct *work) +{ + mutex_lock(&ftrace_lock); + if (!ftrace_graph_active) + free_ret_stacks(); + mutex_unlock(&ftrace_lock); +} + +static void fgraph_ret_stack_irq_func(struct irq_work *iwork) +{ + if (unlikely(!fgraph_ret_stack_wq)) + return; + queue_work(fgraph_ret_stack_wq, &fgraph_ret_stack_work); +} + static __init int fgraph_init(void) { int ret; @@ -1384,6 +1411,12 @@ static __init int fgraph_init(void) pr_warn("fgraph: Error to init cpu hotplug support\n"); return ret; } + fgraph_ret_stack_wq =3D alloc_workqueue("fgraph_ret_stack_wq", + WQ_UNBOUND | WQ_MEM_RECLAIM, 0); + WARN_ON(!fgraph_ret_stack_wq); + + INIT_WORK(&fgraph_ret_stack_work, fgraph_ret_stack_work_func); + init_irq_work(&fgraph_ret_stack_irq_work, fgraph_ret_stack_irq_func); return 0; } core_initcall(fgraph_init) @@ -1429,6 +1462,7 @@ int register_ftrace_graph(struct fgraph_ops *gops) ftrace_graph_disable_direct(true); =20 if (ftrace_graph_active =3D=3D 1) { + static_branch_disable(&fgraph_ret_stack_cleanup); ftrace_graph_enable_direct(false, gops); register_pm_notifier(&ftrace_suspend_notifier); ret =3D start_graph_tracing(); @@ -1495,6 +1529,7 @@ void unregister_ftrace_graph(struct fgraph_ops *gops) ftrace_graph_entry =3D ftrace_graph_entry_stub; unregister_pm_notifier(&ftrace_suspend_notifier); unregister_trace_sched_switch(ftrace_graph_probe_sched_switch, NULL); + static_branch_enable(&fgraph_ret_stack_cleanup); free_ret_stacks(); } out: --=20 2.45.2