From nobody Fri Dec 19 09:43:54 2025 Received: from mail-wr1-f73.google.com (mail-wr1-f73.google.com [209.85.221.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 182635D758 for ; Mon, 18 Dec 2023 15:46:23 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--vdonnefort.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="n+d92z8f" Received: by mail-wr1-f73.google.com with SMTP id ffacd0b85a97d-3365791d24eso1803993f8f.1 for ; Mon, 18 Dec 2023 07:46:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1702914382; x=1703519182; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=E2D6Ryuz+miPb2vPk7MGTz0ShxBxzAIGszX/sP4CX3M=; b=n+d92z8faLW9ZZ1pQVxqwRBy91uAk9CmiUtW3QMVzxXvw9O7A1EJCS29w1DTnzbyNb zNabx8Jkk78BuHTfF8hXoNqwqqnYARjnjxlBg4Yz5Lkvu41q36Mg41SpqdIkNNJZr+Yw aDBPkFVDOnWGc8cfF56K2xucwrxaNitjfjmTMT6JCtFCszIqGD3f/vbdFXo/I7ENL4ec t9yP2phfdvHiqXWdFbZlJ0E2hWJXf+m73QAKj1VAcngrPEClfKSoGCdxKAcFpDArKL63 lBYvXjgbMR9RBCWY26hJE1o4uv7B0KAljQYkhOplsMH9Y9lJhZDAe2IL720FVWh6fzUO 3cFg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702914382; x=1703519182; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=E2D6Ryuz+miPb2vPk7MGTz0ShxBxzAIGszX/sP4CX3M=; b=jLjL2O5xlundTK/vIrDH7nRI+ehoU3Gw1QfF+EP0R/gPYozvP1fUPPmymubEgWlWc/ JrNtWQGC5s+CD5lqa3jt/5STu6eeTYQheRWnie+X1l/xv7jleOz4zo9tHEfVPLfRjiwU 6LXiT+tWhYaH/VZzFrO9bp1uU7HNzyeCK3ftHjxEuctTUqIjb8fByl1JNe5acx7j6bHA dv0MgRBdybG3H4StxLqEm3lToMEKe38rAm4Bf997SXo/ug6yzT3P9NjKJGkcZB1+O5Tu DgmhYq0v6gKCMLZ7gnuo7N0WK0DLkPGDLnhl/Izrkc8YX0aZDMOf+jIeVN9Nw2FVXlk+ exQg== X-Gm-Message-State: AOJu0YyZVoy/M+0bvZ4oUFTDgA+Ar0IODNOv3Q1YCf2rJQ9j+CxQSY5c 3h3D6nc8woo/hH+fvqdAZOmN3+Qc9cRSMqUP X-Google-Smtp-Source: AGHT+IF2hGFiHPNDbYEE1Sqnb349kM05Z5dzOF5FvIDOY+l1HWZMXv1p79dPBk1KTNVBle4W4/7UqzXZrnXbyeNK X-Received: from vdonnefort.c.googlers.com ([fda3:e722:ac3:cc00:28:9cb1:c0a8:2eea]) (user=vdonnefort job=sendgmr) by 2002:a5d:6352:0:b0:336:5fb0:330c with SMTP id b18-20020a5d6352000000b003365fb0330cmr15582wrw.6.1702914382174; Mon, 18 Dec 2023 07:46:22 -0800 (PST) Date: Mon, 18 Dec 2023 15:46:18 +0000 In-Reply-To: <20231218151451.944907-1-vdonnefort@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20231218151451.944907-1-vdonnefort@google.com> X-Mailer: git-send-email 2.43.0.472.g3155946c3a-goog Message-ID: <20231218154618.954997-1-vdonnefort@google.com> Subject: [PATCH v7 0/2] ring-buffer: Rename sub-buffer into buffer page From: Vincent Donnefort To: rostedt@goodmis.org, mhiramat@kernel.org, linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org Cc: kernel-team@android.com, Vincent Donnefort Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Previously was introduced the ability to change the ring-buffer page size. It also introduced the concept of sub-buffer that is, a contiguous virtual memory space which can now be bigger than the system page size (4K on most systems). But behind the scene this is really just a page with an order > 0 and a struct buffer_page (often refered as "bpage") already exists. We have then an unnecessary duplicate subbuffer =3D=3D bpage. Remove all references to sub-buffer and replace them with either bpage or ring_buffer_page. Signed-off-by: Vincent Donnefort --- I forgot this patch when sending the v7 :-( diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst index 8571d84d129b..4cd1d89b4ac6 100644 --- a/Documentation/trace/ftrace.rst +++ b/Documentation/trace/ftrace.rst @@ -203,26 +203,26 @@ of ftrace. Here is a list of some of the key files: =20 This displays the total combined size of all the trace buffers. =20 - buffer_subbuf_size_kb: - - This sets or displays the sub buffer size. The ring buffer is broken up - into several same size "sub buffers". An event can not be bigger than - the size of the sub buffer. Normally, the sub buffer is the size of the - architecture's page (4K on x86). The sub buffer also contains meta data - at the start which also limits the size of an event. That means when - the sub buffer is a page size, no event can be larger than the page - size minus the sub buffer meta data. - - Note, the buffer_subbuf_size_kb is a way for the user to specify the - minimum size of the subbuffer. The kernel may make it bigger due to the - implementation details, or simply fail the operation if the kernel can - not handle the request. - - Changing the sub buffer size allows for events to be larger than the - page size. - - Note: When changing the sub-buffer size, tracing is stopped and any - data in the ring buffer and the snapshot buffer will be discarded. + buffer_page_size_kb: + + This sets or displays the ring-buffer page size. The ring buffer is + broken up into several same size "buffer pages". An event can not = be + bigger than the size of the buffer page. Normally, the buffer page= is + the size of the architecture's page (4K on x86). The buffer page a= lso + contains meta data at the start which also limits the size of an e= vent. + That means when the buffer page is a system page size, no event ca= n be + larger than the system page size minus the buffer page meta data. + + Note, the buffer_page_size_kb is a way for the user to specify the + minimum size for each buffer page. The kernel may make it bigger d= ue to + the implementation details, or simply fail the operation if the ke= rnel + can not handle the request. + + Changing the ring-buffer page size allows for events to be larger = than + the system page size. + + Note: When changing the buffer page size, tracing is stopped and a= ny + data in the ring buffer and the snapshot buffer will be discarded. =20 free_buffer: =20 diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h index fa802db216f9..929ed54dd651 100644 --- a/include/linux/ring_buffer.h +++ b/include/linux/ring_buffer.h @@ -207,9 +207,9 @@ struct trace_seq; int ring_buffer_print_entry_header(struct trace_seq *s); int ring_buffer_print_page_header(struct trace_buffer *buffer, struct trac= e_seq *s); =20 -int ring_buffer_subbuf_order_get(struct trace_buffer *buffer); -int ring_buffer_subbuf_order_set(struct trace_buffer *buffer, int order); -int ring_buffer_subbuf_size_get(struct trace_buffer *buffer); +int ring_buffer_page_order_get(struct trace_buffer *buffer); +int ring_buffer_page_order_set(struct trace_buffer *buffer, int order); +int ring_buffer_page_size_get(struct trace_buffer *buffer); =20 enum ring_buffer_flags { RB_FL_OVERWRITE =3D 1 << 0, diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 9b95297339b6..f95ad0f5be1b 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -511,8 +511,8 @@ struct trace_buffer { struct rb_irq_work irq_work; bool time_stamp_abs; =20 - unsigned int subbuf_size; - unsigned int subbuf_order; + unsigned int bpage_size; + unsigned int bpage_order; unsigned int max_data_size; }; =20 @@ -555,7 +555,7 @@ int ring_buffer_print_page_header(struct trace_buffer *= buffer, struct trace_seq trace_seq_printf(s, "\tfield: char data;\t" "offset:%u;\tsize:%u;\tsigned:%u;\n", (unsigned int)offsetof(typeof(field), data), - (unsigned int)buffer->subbuf_size, + (unsigned int)buffer->bpage_size, (unsigned int)is_signed_type(char)); =20 return !trace_seq_has_overflowed(s); @@ -1488,11 +1488,11 @@ static int __rb_allocate_pages(struct ring_buffer_p= er_cpu *cpu_buffer, list_add(&bpage->list, pages); =20 page =3D alloc_pages_node(cpu_to_node(cpu_buffer->cpu), mflags, - cpu_buffer->buffer->subbuf_order); + cpu_buffer->buffer->bpage_order); if (!page) goto free_pages; bpage->page =3D page_address(page); - bpage->order =3D cpu_buffer->buffer->subbuf_order; + bpage->order =3D cpu_buffer->buffer->bpage_order; rb_init_page(bpage->page); =20 if (user_thread && fatal_signal_pending(current)) @@ -1572,7 +1572,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, l= ong nr_pages, int cpu) =20 cpu_buffer->reader_page =3D bpage; =20 - page =3D alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, cpu_buffer->buffe= r->subbuf_order); + page =3D alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, cpu_buffer->buffe= r->bpage_order); if (!page) goto fail_free_reader; bpage->page =3D page_address(page); @@ -1656,13 +1656,13 @@ struct trace_buffer *__ring_buffer_alloc(unsigned l= ong size, unsigned flags, goto fail_free_buffer; =20 /* Default buffer page size - one system page */ - buffer->subbuf_order =3D 0; - buffer->subbuf_size =3D PAGE_SIZE - BUF_PAGE_HDR_SIZE; + buffer->bpage_order =3D 0; + buffer->bpage_size =3D PAGE_SIZE - BUF_PAGE_HDR_SIZE; =20 /* Max payload is buffer page size - header (8bytes) */ - buffer->max_data_size =3D buffer->subbuf_size - (sizeof(u32) * 2); + buffer->max_data_size =3D buffer->bpage_size - (sizeof(u32) * 2); =20 - nr_pages =3D DIV_ROUND_UP(size, buffer->subbuf_size); + nr_pages =3D DIV_ROUND_UP(size, buffer->bpage_size); buffer->flags =3D flags; buffer->clock =3D trace_clock_local; buffer->reader_lock_key =3D key; @@ -1981,7 +1981,7 @@ static void update_pages_handler(struct work_struct *= work) * @size: the new size. * @cpu_id: the cpu buffer to resize * - * Minimum size is 2 * buffer->subbuf_size. + * Minimum size is 2 * buffer->bpage_size. * * Returns 0 on success and < 0 on failure. */ @@ -2003,7 +2003,7 @@ int ring_buffer_resize(struct trace_buffer *buffer, u= nsigned long size, !cpumask_test_cpu(cpu_id, buffer->cpumask)) return 0; =20 - nr_pages =3D DIV_ROUND_UP(size, buffer->subbuf_size); + nr_pages =3D DIV_ROUND_UP(size, buffer->bpage_size); =20 /* we need a minimum of two pages */ if (nr_pages < 2) @@ -2483,7 +2483,7 @@ static inline void rb_reset_tail(struct ring_buffer_per_cpu *cpu_buffer, unsigned long tail, struct rb_event_info *info) { - unsigned long bsize =3D READ_ONCE(cpu_buffer->buffer->subbuf_size); + unsigned long bsize =3D READ_ONCE(cpu_buffer->buffer->bpage_size); struct buffer_page *tail_page =3D info->tail_page; struct ring_buffer_event *event; unsigned long length =3D info->length; @@ -3426,7 +3426,7 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buf= fer, tail =3D write - info->length; =20 /* See if we shot pass the end of this buffer page */ - if (unlikely(write > cpu_buffer->buffer->subbuf_size)) { + if (unlikely(write > cpu_buffer->buffer->bpage_size)) { check_buffer(cpu_buffer, info, CHECK_FULL_PAGE); return rb_move_tail(cpu_buffer, tail, info); } @@ -4355,7 +4355,7 @@ static struct buffer_page * rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer) { struct buffer_page *reader =3D NULL; - unsigned long bsize =3D READ_ONCE(cpu_buffer->buffer->subbuf_size); + unsigned long bsize =3D READ_ONCE(cpu_buffer->buffer->bpage_size); unsigned long overwrite; unsigned long flags; int nr_loops =3D 0; @@ -4935,7 +4935,7 @@ ring_buffer_read_prepare(struct trace_buffer *buffer,= int cpu, gfp_t flags) return NULL; =20 /* Holds the entire event: data and meta data */ - iter->event_size =3D buffer->subbuf_size; + iter->event_size =3D buffer->bpage_size; iter->event =3D kmalloc(iter->event_size, flags); if (!iter->event) { kfree(iter); @@ -5054,14 +5054,14 @@ unsigned long ring_buffer_size(struct trace_buffer = *buffer, int cpu) { /* * Earlier, this method returned - * buffer->subbuf_size * buffer->nr_pages + * buffer->bpage_size * buffer->nr_pages * Since the nr_pages field is now removed, we have converted this to * return the per cpu buffer value. */ if (!cpumask_test_cpu(cpu, buffer->cpumask)) return 0; =20 - return buffer->subbuf_size * buffer->buffers[cpu]->nr_pages; + return buffer->bpage_size * buffer->buffers[cpu]->nr_pages; } EXPORT_SYMBOL_GPL(ring_buffer_size); =20 @@ -5350,7 +5350,7 @@ int ring_buffer_swap_cpu(struct trace_buffer *buffer_= a, if (cpu_buffer_a->nr_pages !=3D cpu_buffer_b->nr_pages) goto out; =20 - if (buffer_a->subbuf_order !=3D buffer_b->subbuf_order) + if (buffer_a->bpage_order !=3D buffer_b->bpage_order) goto out; =20 ret =3D -EAGAIN; @@ -5439,7 +5439,7 @@ ring_buffer_alloc_read_page(struct trace_buffer *buff= er, int cpu) if (!bpage) return ERR_PTR(-ENOMEM); =20 - bpage->order =3D buffer->subbuf_order; + bpage->order =3D buffer->bpage_order; cpu_buffer =3D buffer->buffers[cpu]; local_irq_save(flags); arch_spin_lock(&cpu_buffer->lock); @@ -5456,7 +5456,7 @@ ring_buffer_alloc_read_page(struct trace_buffer *buff= er, int cpu) goto out; =20 page =3D alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_NORETRY, - cpu_buffer->buffer->subbuf_order); + cpu_buffer->buffer->bpage_order); if (!page) { kfree(bpage); return ERR_PTR(-ENOMEM); @@ -5494,10 +5494,10 @@ void ring_buffer_free_read_page(struct trace_buffer= *buffer, int cpu, =20 /* * If the page is still in use someplace else, or order of the page - * is different from the subbuffer order of the buffer - + * is different from the bpage order of the buffer - * we can't reuse it */ - if (page_ref_count(page) > 1 || data_page->order !=3D buffer->subbuf_orde= r) + if (page_ref_count(page) > 1 || data_page->order !=3D buffer->bpage_order) goto out; =20 local_irq_save(flags); @@ -5580,7 +5580,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer, =20 if (!data_page || !data_page->data) goto out; - if (data_page->order !=3D buffer->subbuf_order) + if (data_page->order !=3D buffer->bpage_order) goto out; =20 bpage =3D data_page->data; @@ -5703,7 +5703,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer, /* If there is room at the end of the page to save the * missed events, then record it there. */ - if (buffer->subbuf_size - commit >=3D sizeof(missed_events)) { + if (buffer->bpage_size - commit >=3D sizeof(missed_events)) { memcpy(&bpage->data[commit], &missed_events, sizeof(missed_events)); local_add(RB_MISSED_STORED, &bpage->commit); @@ -5715,8 +5715,8 @@ int ring_buffer_read_page(struct trace_buffer *buffer, /* * This page may be off to user land. Zero it out here. */ - if (commit < buffer->subbuf_size) - memset(&bpage->data[commit], 0, buffer->subbuf_size - commit); + if (commit < buffer->bpage_size) + memset(&bpage->data[commit], 0, buffer->bpage_size - commit); =20 out_unlock: raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags); @@ -5739,19 +5739,19 @@ void *ring_buffer_read_page_data(struct buffer_data= _read_page *page) EXPORT_SYMBOL_GPL(ring_buffer_read_page_data); =20 /** - * ring_buffer_subbuf_size_get - get size of the sub buffer. + * ring_buffer_page_size_get - get size of the sub buffer. * @buffer: the buffer to get the sub buffer size from * * Returns size of the sub buffer, in bytes. */ -int ring_buffer_subbuf_size_get(struct trace_buffer *buffer) +int ring_buffer_page_size_get(struct trace_buffer *buffer) { - return buffer->subbuf_size + BUF_PAGE_HDR_SIZE; + return buffer->bpage_size + BUF_PAGE_HDR_SIZE; } -EXPORT_SYMBOL_GPL(ring_buffer_subbuf_size_get); +EXPORT_SYMBOL_GPL(ring_buffer_page_size_get); =20 /** - * ring_buffer_subbuf_order_get - get order of system sub pages in one buf= fer page. + * ring_buffer_page_order_get - get order of system sub pages in one buffe= r page. * @buffer: The ring_buffer to get the system sub page order from * * By default, one ring buffer sub page equals to one system page. This pa= rameter @@ -5762,17 +5762,17 @@ EXPORT_SYMBOL_GPL(ring_buffer_subbuf_size_get); * 0 means the sub buffer size is 1 system page and so forth. * In case of an error < 0 is returned. */ -int ring_buffer_subbuf_order_get(struct trace_buffer *buffer) +int ring_buffer_page_order_get(struct trace_buffer *buffer) { if (!buffer) return -EINVAL; =20 - return buffer->subbuf_order; + return buffer->bpage_order; } -EXPORT_SYMBOL_GPL(ring_buffer_subbuf_order_get); +EXPORT_SYMBOL_GPL(ring_buffer_page_order_get); =20 /** - * ring_buffer_subbuf_order_set - set the size of ring buffer sub page. + * ring_buffer_page_order_set - set the size of ring buffer sub page. * @buffer: The ring_buffer to set the new page size. * @order: Order of the system pages in one sub buffer page * @@ -5787,7 +5787,7 @@ EXPORT_SYMBOL_GPL(ring_buffer_subbuf_order_get); * * Returns 0 on success or < 0 in case of an error. */ -int ring_buffer_subbuf_order_set(struct trace_buffer *buffer, int order) +int ring_buffer_page_order_set(struct trace_buffer *buffer, int order) { struct ring_buffer_per_cpu *cpu_buffer; struct buffer_page *bpage, *tmp; @@ -5800,15 +5800,15 @@ int ring_buffer_subbuf_order_set(struct trace_buffe= r *buffer, int order) if (!buffer || order < 0) return -EINVAL; =20 - if (buffer->subbuf_order =3D=3D order) + if (buffer->bpage_order =3D=3D order) return 0; =20 psize =3D (1 << order) * PAGE_SIZE; if (psize <=3D BUF_PAGE_HDR_SIZE) return -EINVAL; =20 - old_order =3D buffer->subbuf_order; - old_size =3D buffer->subbuf_size; + old_order =3D buffer->bpage_order; + old_size =3D buffer->bpage_size; =20 /* prevent another thread from changing buffer sizes */ mutex_lock(&buffer->mutex); @@ -5817,8 +5817,8 @@ int ring_buffer_subbuf_order_set(struct trace_buffer = *buffer, int order) /* Make sure all commits have finished */ synchronize_rcu(); =20 - buffer->subbuf_order =3D order; - buffer->subbuf_size =3D psize - BUF_PAGE_HDR_SIZE; + buffer->bpage_order =3D order; + buffer->bpage_size =3D psize - BUF_PAGE_HDR_SIZE; =20 /* Make sure all new buffers are allocated, before deleting the old ones = */ for_each_buffer_cpu(buffer, cpu) { @@ -5830,7 +5830,7 @@ int ring_buffer_subbuf_order_set(struct trace_buffer = *buffer, int order) =20 /* Update the number of pages to match the new size */ nr_pages =3D old_size * buffer->buffers[cpu]->nr_pages; - nr_pages =3D DIV_ROUND_UP(nr_pages, buffer->subbuf_size); + nr_pages =3D DIV_ROUND_UP(nr_pages, buffer->bpage_size); =20 /* we need a minimum of two pages */ if (nr_pages < 2) @@ -5907,8 +5907,8 @@ int ring_buffer_subbuf_order_set(struct trace_buffer = *buffer, int order) return 0; =20 error: - buffer->subbuf_order =3D old_order; - buffer->subbuf_size =3D old_size; + buffer->bpage_order =3D old_order; + buffer->bpage_size =3D old_size; =20 atomic_dec(&buffer->record_disabled); mutex_unlock(&buffer->mutex); @@ -5927,7 +5927,7 @@ int ring_buffer_subbuf_order_set(struct trace_buffer = *buffer, int order) =20 return err; } -EXPORT_SYMBOL_GPL(ring_buffer_subbuf_order_set); +EXPORT_SYMBOL_GPL(ring_buffer_page_order_set); =20 /* * We only allocate new buffers, never free them if the CPU goes down. diff --git a/kernel/trace/ring_buffer_benchmark.c b/kernel/trace/ring_buffe= r_benchmark.c index 008187ebd7fe..b58ced8f4626 100644 --- a/kernel/trace/ring_buffer_benchmark.c +++ b/kernel/trace/ring_buffer_benchmark.c @@ -118,7 +118,7 @@ static enum event_status read_page(int cpu) if (IS_ERR(bpage)) return EVENT_DROPPED; =20 - page_size =3D ring_buffer_subbuf_size_get(buffer); + page_size =3D ring_buffer_page_size_get(buffer); ret =3D ring_buffer_read_page(buffer, bpage, page_size, cpu, 1); if (ret >=3D 0) { rpage =3D ring_buffer_read_page_data(bpage); diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index b35c85edbb49..c17dd849e6f1 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -1269,8 +1269,8 @@ int tracing_alloc_snapshot_instance(struct trace_arra= y *tr) if (!tr->allocated_snapshot) { =20 /* Make the snapshot buffer have the same order as main buffer */ - order =3D ring_buffer_subbuf_order_get(tr->array_buffer.buffer); - ret =3D ring_buffer_subbuf_order_set(tr->max_buffer.buffer, order); + order =3D ring_buffer_page_order_get(tr->array_buffer.buffer); + ret =3D ring_buffer_page_order_set(tr->max_buffer.buffer, order); if (ret < 0) return ret; =20 @@ -1293,7 +1293,7 @@ static void free_snapshot(struct trace_array *tr) * The max_tr ring buffer has some state (e.g. ring->clock) and * we want preserve it. */ - ring_buffer_subbuf_order_set(tr->max_buffer.buffer, 0); + ring_buffer_page_order_set(tr->max_buffer.buffer, 0); ring_buffer_resize(tr->max_buffer.buffer, 1, RING_BUFFER_ALL_CPUS); set_buffer_entries(&tr->max_buffer, 1); tracing_reset_online_cpus(&tr->max_buffer); @@ -8315,7 +8315,7 @@ tracing_buffers_read(struct file *filp, char __user *= ubuf, return -EBUSY; #endif =20 - page_size =3D ring_buffer_subbuf_size_get(iter->array_buffer->buffer); + page_size =3D ring_buffer_page_size_get(iter->array_buffer->buffer); =20 /* Make sure the spare matches the current sub buffer size */ if (info->spare) { @@ -8492,7 +8492,7 @@ tracing_buffers_splice_read(struct file *file, loff_t= *ppos, return -EBUSY; #endif =20 - page_size =3D ring_buffer_subbuf_size_get(iter->array_buffer->buffer); + page_size =3D ring_buffer_page_size_get(iter->array_buffer->buffer); if (*ppos & (page_size - 1)) return -EINVAL; =20 @@ -9391,7 +9391,7 @@ static const struct file_operations buffer_percent_fo= ps =3D { }; =20 static ssize_t -buffer_subbuf_size_read(struct file *filp, char __user *ubuf, size_t cnt, = loff_t *ppos) +buffer_page_size_read(struct file *filp, char __user *ubuf, size_t cnt, lo= ff_t *ppos) { struct trace_array *tr =3D filp->private_data; size_t size; @@ -9399,7 +9399,7 @@ buffer_subbuf_size_read(struct file *filp, char __use= r *ubuf, size_t cnt, loff_t int order; int r; =20 - order =3D ring_buffer_subbuf_order_get(tr->array_buffer.buffer); + order =3D ring_buffer_page_order_get(tr->array_buffer.buffer); size =3D (PAGE_SIZE << order) / 1024; =20 r =3D sprintf(buf, "%zd\n", size); @@ -9408,8 +9408,8 @@ buffer_subbuf_size_read(struct file *filp, char __use= r *ubuf, size_t cnt, loff_t } =20 static ssize_t -buffer_subbuf_size_write(struct file *filp, const char __user *ubuf, - size_t cnt, loff_t *ppos) +buffer_page_size_write(struct file *filp, const char __user *ubuf, + size_t cnt, loff_t *ppos) { struct trace_array *tr =3D filp->private_data; unsigned long val; @@ -9434,11 +9434,11 @@ buffer_subbuf_size_write(struct file *filp, const c= har __user *ubuf, /* Do not allow tracing while changing the order of the ring buffer */ tracing_stop_tr(tr); =20 - old_order =3D ring_buffer_subbuf_order_get(tr->array_buffer.buffer); + old_order =3D ring_buffer_page_order_get(tr->array_buffer.buffer); if (old_order =3D=3D order) goto out; =20 - ret =3D ring_buffer_subbuf_order_set(tr->array_buffer.buffer, order); + ret =3D ring_buffer_page_order_set(tr->array_buffer.buffer, order); if (ret) goto out; =20 @@ -9447,10 +9447,10 @@ buffer_subbuf_size_write(struct file *filp, const c= har __user *ubuf, if (!tr->allocated_snapshot) goto out_max; =20 - ret =3D ring_buffer_subbuf_order_set(tr->max_buffer.buffer, order); + ret =3D ring_buffer_page_order_set(tr->max_buffer.buffer, order); if (ret) { /* Put back the old order */ - cnt =3D ring_buffer_subbuf_order_set(tr->array_buffer.buffer, old_order); + cnt =3D ring_buffer_page_order_set(tr->array_buffer.buffer, old_order); if (WARN_ON_ONCE(cnt)) { /* * AARGH! We are left with different orders! @@ -9479,10 +9479,10 @@ buffer_subbuf_size_write(struct file *filp, const c= har __user *ubuf, return cnt; } =20 -static const struct file_operations buffer_subbuf_size_fops =3D { +static const struct file_operations buffer_page_size_fops =3D { .open =3D tracing_open_generic_tr, - .read =3D buffer_subbuf_size_read, - .write =3D buffer_subbuf_size_write, + .read =3D buffer_page_size_read, + .write =3D buffer_page_size_write, .release =3D tracing_release_generic_tr, .llseek =3D default_llseek, }; @@ -9953,8 +9953,8 @@ init_tracer_tracefs(struct trace_array *tr, struct de= ntry *d_tracer) trace_create_file("buffer_percent", TRACE_MODE_WRITE, d_tracer, tr, &buffer_percent_fops); =20 - trace_create_file("buffer_subbuf_size_kb", TRACE_MODE_WRITE, d_tracer, - tr, &buffer_subbuf_size_fops); + trace_create_file("buffer_page_size_kb", TRACE_MODE_WRITE, d_tracer, + tr, &buffer_page_size_fops); =20 create_trace_options_dir(tr); =20 --=20 2.43.0.472.g3155946c3a-goog From nobody Fri Dec 19 09:43:54 2025 Received: from mail-wr1-f73.google.com (mail-wr1-f73.google.com [209.85.221.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 160E949896 for ; Mon, 18 Dec 2023 15:15:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--vdonnefort.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="hRqAkY+U" Received: by mail-wr1-f73.google.com with SMTP id ffacd0b85a97d-33661476cf9so1261270f8f.1 for ; Mon, 18 Dec 2023 07:15:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1702912500; x=1703517300; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=VYppLCWMV7TUNrE+QVflNYwCet9QUorE1cXQ4fIKqb4=; b=hRqAkY+UKpRlmulwTN4aekz1P+diMuzymLB72T5x5q4Wpb41ffEJF5xYinfVzbiwvB 61yYo+1zWMJ50hzUdqMTXcmJKBtqhZWbtvTOWljRd3yHUmZRyPR5+gJ95fa9lYBTJCaY Wimfy6ykjztx0Lq44426Mkwf4gqUSwGakMifrVRCdDvsuENWErPPma949M/uyedZeMcr SSpOs+Gl3MeAbGjpVMxvE0PoEygS+r6+zlMIFBrqfxaKYslNrp6gNo7A89B0kJAdkVdK Tl/hS0MboTM+zNlVrP86itMsyvyn5aJvbv7N3JW1M5FTx7Ujo9RxP0TUoT3B4eMPrFj+ i5pw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702912500; x=1703517300; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=VYppLCWMV7TUNrE+QVflNYwCet9QUorE1cXQ4fIKqb4=; b=Vm40IbjElOKW7i77wjsvUVbQAsZNoDbTREgK9nbVjYMl5ht21kL4/iyLGcbfUCfDiN K/kc9INhOuPS6LlX+GOJ3jc1FaUfZww+UTi2CLleAgWnhQWoSb3bofm2MtqyKjS7getz wgwOZcD6eEHCjyQQskK1MsWo4tPNhnsSrzJ8fCZztZQCQEGk2IupQsM0DupVo41jfWXp o75hfoAICFrcIVKkFmuWy9B4whA0LAdolIZnMFMzfai7FMM9G5Bd+/AynHcxZc4j5hsK /y6mCgUzu8qUBoxO29BDJGkTVS2Q2ZvRgWaJgMuOom5yDVlV/brtRssld8MJ3PPC4ZwW 2xYQ== X-Gm-Message-State: AOJu0Yzptzj9Bl8JWctSG3ttsdewWnkpPdgyy5KRKebT6o+USU8wIt7J txiHbbWrtEajQcWWZh8sbmldz2tw/i0icvGp X-Google-Smtp-Source: AGHT+IG8pnIYyJLeteO7rDvjBRAdqm3Ef2a061CdOOOI5EE5VDQ8Lp1f+Y9bEF8JG3RHI36MzCbp7Kfaey1jCsjn X-Received: from vdonnefort.c.googlers.com ([fda3:e722:ac3:cc00:28:9cb1:c0a8:2eea]) (user=vdonnefort job=sendgmr) by 2002:a05:600c:3799:b0:40c:6fa:97a0 with SMTP id o25-20020a05600c379900b0040c06fa97a0mr132107wmr.7.1702912500024; Mon, 18 Dec 2023 07:15:00 -0800 (PST) Date: Mon, 18 Dec 2023 15:14:50 +0000 In-Reply-To: <20231218151451.944907-1-vdonnefort@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20231218151451.944907-1-vdonnefort@google.com> X-Mailer: git-send-email 2.43.0.472.g3155946c3a-goog Message-ID: <20231218151451.944907-2-vdonnefort@google.com> Subject: [PATCH v7 1/2] ring-buffer: Introducing ring-buffer mapping functions From: Vincent Donnefort To: rostedt@goodmis.org, mhiramat@kernel.org, linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org Cc: kernel-team@android.com, Vincent Donnefort Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" In preparation for allowing the user-space to map a ring-buffer, add a set of mapping functions: ring_buffer_{map,unmap}() ring_buffer_map_fault() And controls on the ring-buffer: ring_buffer_map_get_reader() /* swap reader and head */ Mapping the ring-buffer also involves: A unique ID for each buffer page of the ring-buffer, currently they are only identified through their in-kernel VA. A meta-page, where are stored ring-buffer statistics and a description for the current reader The linear mapping exposes the meta-page, and each bpage of the ring-buffer, ordered following their unique ID, assigned during the first mapping. Once mapped, no bpage can get in or out of the ring-buffer: the buffer size will remain unmodified and the splice enabling functions will in reality simply memcpy the data instead of swapping the buffer pages. Signed-off-by: Vincent Donnefort diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h index 929ed54dd651..e77a2685fe52 100644 --- a/include/linux/ring_buffer.h +++ b/include/linux/ring_buffer.h @@ -6,6 +6,8 @@ #include #include =20 +#include + struct trace_buffer; struct ring_buffer_iter; =20 @@ -221,4 +223,9 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist= _node *node); #define trace_rb_cpu_prepare NULL #endif =20 +int ring_buffer_map(struct trace_buffer *buffer, int cpu); +int ring_buffer_unmap(struct trace_buffer *buffer, int cpu); +struct page *ring_buffer_map_fault(struct trace_buffer *buffer, int cpu, + unsigned long pgoff); +int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu); #endif /* _LINUX_RING_BUFFER_H */ diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mma= p.h new file mode 100644 index 000000000000..9536f0b7c094 --- /dev/null +++ b/include/uapi/linux/trace_mmap.h @@ -0,0 +1,29 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef _UAPI_TRACE_MMAP_H_ +#define _UAPI_TRACE_MMAP_H_ + +#include + +struct trace_buffer_meta { + unsigned long entries; + unsigned long overrun; + unsigned long read; + + unsigned long bpages_touched; + unsigned long bpages_lost; + unsigned long bpages_read; + + struct { + unsigned long lost_events; /* Events lost at the time of the reader swap= */ + __u32 id; /* Reader bpage ID from 0 to nr_bpages - 1 */ + __u32 read; /* Number of bytes read on the reader bpage */ + } reader; + + __u32 bpage_size; /* Size of each buffer page including the header */ + __u32 nr_bpages; /* Number of buffer pages in the ring-buffer */ + + __u32 meta_page_size; /* Size of the meta-page */ + __u32 meta_struct_len; /* Len of this struct */ +}; + +#endif /* _UAPI_TRACE_MMAP_H_ */ diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index f95ad0f5be1b..2a6307af9c6c 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -335,6 +335,7 @@ struct buffer_page { local_t write; /* index for next write */ unsigned read; /* index for next read */ local_t entries; /* entries on this page */ + u32 id; /* ID for external mapping */ unsigned long real_end; /* real end of data */ unsigned order; /* order of the page */ struct buffer_data_page *page; /* Actual data page */ @@ -483,6 +484,12 @@ struct ring_buffer_per_cpu { u64 read_stamp; /* pages removed since last reset */ unsigned long pages_removed; + + int mapped; + struct mutex mapping_lock; + unsigned long *bpage_ids; /* ID to addr */ + struct trace_buffer_meta *meta_page; + /* ring buffer pages to update, > 0 to add, < 0 to remove */ long nr_pages_to_update; struct list_head new_pages; /* new pages to add */ @@ -760,6 +767,22 @@ static __always_inline bool full_hit(struct trace_buff= er *buffer, int cpu, int f return (dirty * 100) > (full * nr_pages); } =20 +static void rb_update_meta_page(struct ring_buffer_per_cpu *cpu_buffer) +{ + if (unlikely(READ_ONCE(cpu_buffer->mapped))) { + /* Ensure the meta_page is ready */ + smp_rmb(); + WRITE_ONCE(cpu_buffer->meta_page->entries, + local_read(&cpu_buffer->entries)); + WRITE_ONCE(cpu_buffer->meta_page->overrun, + local_read(&cpu_buffer->overrun)); + WRITE_ONCE(cpu_buffer->meta_page->bpages_touched, + local_read(&cpu_buffer->pages_touched)); + WRITE_ONCE(cpu_buffer->meta_page->bpages_lost, + local_read(&cpu_buffer->pages_lost)); + } +} + /* * rb_wake_up_waiters - wake up tasks waiting for ring buffer input * @@ -769,6 +792,10 @@ static __always_inline bool full_hit(struct trace_buff= er *buffer, int cpu, int f static void rb_wake_up_waiters(struct irq_work *work) { struct rb_irq_work *rbwork =3D container_of(work, struct rb_irq_work, wor= k); + struct ring_buffer_per_cpu *cpu_buffer =3D + container_of(rbwork, struct ring_buffer_per_cpu, irq_work); + + rb_update_meta_page(cpu_buffer); =20 wake_up_all(&rbwork->waiters); if (rbwork->full_waiters_pending || rbwork->wakeup_full) { @@ -1562,6 +1589,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, l= ong nr_pages, int cpu) init_irq_work(&cpu_buffer->irq_work.work, rb_wake_up_waiters); init_waitqueue_head(&cpu_buffer->irq_work.waiters); init_waitqueue_head(&cpu_buffer->irq_work.full_waiters); + mutex_init(&cpu_buffer->mapping_lock); =20 bpage =3D kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()), GFP_KERNEL, cpu_to_node(cpu)); @@ -4474,6 +4502,14 @@ rb_get_reader_page(struct ring_buffer_per_cpu *cpu_b= uffer) cpu_buffer->last_overrun =3D overwrite; } =20 + if (cpu_buffer->mapped) { + WRITE_ONCE(cpu_buffer->meta_page->reader.read, 0); + WRITE_ONCE(cpu_buffer->meta_page->reader.id, reader->id); + WRITE_ONCE(cpu_buffer->meta_page->reader.lost_events, cpu_buffer->lost_e= vents); + WRITE_ONCE(cpu_buffer->meta_page->bpages_read, + local_read(&cpu_buffer->pages_read)); + } + goto again; =20 out: @@ -4541,6 +4577,12 @@ static void rb_advance_reader(struct ring_buffer_per= _cpu *cpu_buffer) length =3D rb_event_length(event); cpu_buffer->reader_page->read +=3D length; cpu_buffer->read_bytes +=3D length; + if (cpu_buffer->mapped) { + WRITE_ONCE(cpu_buffer->meta_page->reader.read, + cpu_buffer->reader_page->read); + WRITE_ONCE(cpu_buffer->meta_page->read, + cpu_buffer->read); + } } =20 static void rb_advance_iter(struct ring_buffer_iter *iter) @@ -5088,6 +5130,19 @@ static void rb_clear_buffer_page(struct buffer_page = *page) page->read =3D 0; } =20 +static void rb_reset_meta_page(struct ring_buffer_per_cpu *cpu_buffer) +{ + struct trace_buffer_meta *meta =3D cpu_buffer->meta_page; + + WRITE_ONCE(meta->entries, 0); + WRITE_ONCE(meta->overrun, 0); + WRITE_ONCE(meta->read, cpu_buffer->read); + WRITE_ONCE(meta->bpages_touched, 0); + WRITE_ONCE(meta->bpages_lost, 0); + WRITE_ONCE(meta->bpages_read, local_read(&cpu_buffer->pages_read)); + WRITE_ONCE(meta->reader.read, cpu_buffer->reader_page->read); +} + static void rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer) { @@ -5132,6 +5187,9 @@ rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer) cpu_buffer->lost_events =3D 0; cpu_buffer->last_overrun =3D 0; =20 + if (cpu_buffer->mapped) + rb_reset_meta_page(cpu_buffer); + rb_head_page_activate(cpu_buffer); cpu_buffer->pages_removed =3D 0; } @@ -5346,6 +5404,11 @@ int ring_buffer_swap_cpu(struct trace_buffer *buffer= _a, cpu_buffer_a =3D buffer_a->buffers[cpu]; cpu_buffer_b =3D buffer_b->buffers[cpu]; =20 + if (READ_ONCE(cpu_buffer_a->mapped) || READ_ONCE(cpu_buffer_b->mapped)) { + ret =3D -EBUSY; + goto out; + } + /* At least make sure the two buffers are somewhat the same */ if (cpu_buffer_a->nr_pages !=3D cpu_buffer_b->nr_pages) goto out; @@ -5609,7 +5672,8 @@ int ring_buffer_read_page(struct trace_buffer *buffer, * Otherwise, we can simply swap the page with the one passed in. */ if (read || (len < (commit - read)) || - cpu_buffer->reader_page =3D=3D cpu_buffer->commit_page) { + cpu_buffer->reader_page =3D=3D cpu_buffer->commit_page || + cpu_buffer->mapped) { struct buffer_data_page *rpage =3D cpu_buffer->reader_page->page; unsigned int rpos =3D read; unsigned int pos =3D 0; @@ -5828,6 +5892,11 @@ int ring_buffer_page_order_set(struct trace_buffer *= buffer, int order) =20 cpu_buffer =3D buffer->buffers[cpu]; =20 + if (cpu_buffer->mapped) { + err =3D -EBUSY; + goto error; + } + /* Update the number of pages to match the new size */ nr_pages =3D old_size * buffer->buffers[cpu]->nr_pages; nr_pages =3D DIV_ROUND_UP(nr_pages, buffer->bpage_size); @@ -5929,6 +5998,308 @@ int ring_buffer_page_order_set(struct trace_buffer = *buffer, int order) } EXPORT_SYMBOL_GPL(ring_buffer_page_order_set); =20 +#define bpage_subpage(sub_off, start) \ + virt_to_page((void *)(start + (sub_off << PAGE_SHIFT))) + +#define foreach_bpage_subpage(sub_off, bpage_order, start, page) \ + for (sub_off =3D 0, page =3D bpage_subpage(0, start); \ + sub_off < (1 << bpage_order); \ + sub_off++, page =3D bpage_subpage(sub_off, start)) + +static inline void bpage_map_prepare(unsigned long start, int order) +{ + struct page *page; + int subpage_off; + + /* + * When allocating order > 0 pages, only the first struct page has a + * refcount > 1. Increasing the refcount here ensures the none of the + * struct page composing the bpage is freeed when the mapping is closed. + */ + foreach_bpage_subpage(subpage_off, order, start, page) + page_ref_inc(page); +} + +static inline void bpage_unmap(unsigned long start, int order) +{ + struct page *page; + int subpage_off; + + foreach_bpage_subpage(subpage_off, order, start, page) { + page_ref_dec(page); + page->mapping =3D NULL; + } +} + +static void rb_free_bpage_ids(struct ring_buffer_per_cpu *cpu_buffer) +{ + int sub_id; + + for (sub_id =3D 0; sub_id < cpu_buffer->nr_pages + 1; sub_id++) + bpage_unmap(cpu_buffer->bpage_ids[sub_id], + cpu_buffer->buffer->bpage_order); + + kfree(cpu_buffer->bpage_ids); + cpu_buffer->bpage_ids =3D NULL; +} + +static int rb_alloc_meta_page(struct ring_buffer_per_cpu *cpu_buffer) +{ + if (cpu_buffer->meta_page) + return 0; + + cpu_buffer->meta_page =3D page_to_virt(alloc_page(GFP_USER)); + if (!cpu_buffer->meta_page) + return -ENOMEM; + + return 0; +} + +static void rb_free_meta_page(struct ring_buffer_per_cpu *cpu_buffer) +{ + unsigned long addr =3D (unsigned long)cpu_buffer->meta_page; + + virt_to_page((void *)addr)->mapping =3D NULL; + free_page(addr); + cpu_buffer->meta_page =3D NULL; +} + +static void rb_setup_ids_meta_page(struct ring_buffer_per_cpu *cpu_buffer, + unsigned long *bpage_ids) +{ + struct trace_buffer_meta *meta =3D cpu_buffer->meta_page; + unsigned int nr_bpages =3D cpu_buffer->nr_pages + 1; + struct buffer_page *first_page, *bpage; + int id =3D 0; + + bpage_ids[id] =3D (unsigned long)cpu_buffer->reader_page->page; + bpage_map_prepare(bpage_ids[id], cpu_buffer->buffer->bpage_order); + cpu_buffer->reader_page->id =3D id++; + + first_page =3D bpage =3D rb_set_head_page(cpu_buffer); + do { + + if (id >=3D nr_bpages) { + WARN_ON(1); + break; + } + + bpage_ids[id] =3D (unsigned long)bpage->page; + bpage->id =3D id; + bpage_map_prepare(bpage_ids[id], cpu_buffer->buffer->bpage_order); + + rb_inc_page(&bpage); + id++; + } while (bpage !=3D first_page); + + /* install page ID to kern VA translation */ + cpu_buffer->bpage_ids =3D bpage_ids; + + meta->meta_page_size =3D PAGE_SIZE; + meta->meta_struct_len =3D sizeof(*meta); + meta->nr_bpages =3D nr_bpages; + meta->bpage_size =3D cpu_buffer->buffer->bpage_size + BUF_PAGE_HDR_SIZE; + meta->reader.id =3D cpu_buffer->reader_page->id; + rb_reset_meta_page(cpu_buffer); +} + +static inline struct ring_buffer_per_cpu * +rb_get_mapped_buffer(struct trace_buffer *buffer, int cpu) +{ + struct ring_buffer_per_cpu *cpu_buffer; + + if (!cpumask_test_cpu(cpu, buffer->cpumask)) + return ERR_PTR(-EINVAL); + + cpu_buffer =3D buffer->buffers[cpu]; + + mutex_lock(&cpu_buffer->mapping_lock); + + if (!cpu_buffer->mapped) { + mutex_unlock(&cpu_buffer->mapping_lock); + return ERR_PTR(-ENODEV); + } + + return cpu_buffer; +} + +static inline void rb_put_mapped_buffer(struct ring_buffer_per_cpu *cpu_bu= ffer) +{ + mutex_unlock(&cpu_buffer->mapping_lock); +} + +int ring_buffer_map(struct trace_buffer *buffer, int cpu) +{ + struct ring_buffer_per_cpu *cpu_buffer; + unsigned long flags, *bpage_ids; + int err =3D 0; + + if (!cpumask_test_cpu(cpu, buffer->cpumask)) + return -EINVAL; + + cpu_buffer =3D buffer->buffers[cpu]; + + mutex_lock(&cpu_buffer->mapping_lock); + + if (cpu_buffer->mapped) { + WRITE_ONCE(cpu_buffer->mapped, cpu_buffer->mapped + 1); + goto unlock; + } + + /* prevent another thread from changing buffer sizes */ + mutex_lock(&buffer->mutex); + + err =3D rb_alloc_meta_page(cpu_buffer); + if (err) + goto unlock; + + /* bpage_ids include the reader while nr_pages does not */ + bpage_ids =3D kzalloc(sizeof(*bpage_ids) * (cpu_buffer->nr_pages + 1), + GFP_KERNEL); + if (!bpage_ids) { + rb_free_meta_page(cpu_buffer); + err =3D -ENOMEM; + goto unlock; + } + + atomic_inc(&cpu_buffer->resize_disabled); + + /* + * Lock all readers to block any page swap until the page IDs are + * assigned. + */ + raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags); + + rb_setup_ids_meta_page(cpu_buffer, bpage_ids); + /* + * Ensure rb_update_meta() will observe the meta-page before + * cpu_buffer->mapped. + */ + smp_wmb(); + WRITE_ONCE(cpu_buffer->mapped, 1); + + /* Init meta_page values unless the writer did it already */ + cmpxchg(&cpu_buffer->meta_page->entries, 0, + local_read(&cpu_buffer->entries)); + cmpxchg(&cpu_buffer->meta_page->overrun, 0, + local_read(&cpu_buffer->overrun)); + cmpxchg(&cpu_buffer->meta_page->bpages_touched, 0, + local_read(&cpu_buffer->pages_touched)); + cmpxchg(&cpu_buffer->meta_page->bpages_lost, 0, + local_read(&cpu_buffer->pages_lost)); + + raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags); +unlock: + mutex_unlock(&buffer->mutex); + mutex_unlock(&cpu_buffer->mapping_lock); + + return err; +} + +int ring_buffer_unmap(struct trace_buffer *buffer, int cpu) +{ + struct ring_buffer_per_cpu *cpu_buffer; + int err =3D 0; + + if (!cpumask_test_cpu(cpu, buffer->cpumask)) + return -EINVAL; + + cpu_buffer =3D buffer->buffers[cpu]; + + mutex_lock(&cpu_buffer->mapping_lock); + + if (!cpu_buffer->mapped) { + err =3D -ENODEV; + goto unlock; + } + + WRITE_ONCE(cpu_buffer->mapped, cpu_buffer->mapped - 1); + if (!cpu_buffer->mapped) { + /* Wait for the writer and readers to observe !mapped */ + synchronize_rcu(); + + rb_free_bpage_ids(cpu_buffer); + rb_free_meta_page(cpu_buffer); + atomic_dec(&cpu_buffer->resize_disabled); + } + +unlock: + mutex_unlock(&cpu_buffer->mapping_lock); + + return err; +} + +/* + * +--------------+ pgoff =3D=3D 0 + * | meta page | + * +--------------+ pgoff =3D=3D 1 + * | bpage 0 | + * +--------------+ pgoff =3D=3D 1 + (1 << bpage_order) + * | bpage 1 | + * ... + */ +struct page *ring_buffer_map_fault(struct trace_buffer *buffer, int cpu, + unsigned long pgoff) +{ + struct ring_buffer_per_cpu *cpu_buffer =3D buffer->buffers[cpu]; + unsigned long bpage_id, bpage_offset, addr; + struct page *page; + + if (!pgoff) + return virt_to_page((void *)cpu_buffer->meta_page); + + pgoff--; + + bpage_id =3D pgoff >> buffer->bpage_order; + if (bpage_id > cpu_buffer->nr_pages) + return NULL; + + bpage_offset =3D pgoff & ((1UL << buffer->bpage_order) - 1); + addr =3D cpu_buffer->bpage_ids[bpage_id] + (bpage_offset * PAGE_SIZE); + page =3D virt_to_page((void *)addr); + + return page; +} + +int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu) +{ + struct ring_buffer_per_cpu *cpu_buffer; + unsigned long reader_size; + unsigned long flags; + + cpu_buffer =3D rb_get_mapped_buffer(buffer, cpu); + if (IS_ERR(cpu_buffer)) + return (int)PTR_ERR(cpu_buffer); + + raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags); +consume: + if (rb_per_cpu_empty(cpu_buffer)) + goto out; + + reader_size =3D rb_page_size(cpu_buffer->reader_page); + + /* + * There are data to be read on the current reader page, we can + * return to the caller. But before that, we assume the latter will read + * everything. Let's update the kernel reader accordingly. + */ + if (cpu_buffer->reader_page->read < reader_size) { + while (cpu_buffer->reader_page->read < reader_size) + rb_advance_reader(cpu_buffer); + goto out; + } + + if (WARN_ON(!rb_get_reader_page(cpu_buffer))) + goto out; + + goto consume; +out: + raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags); + rb_put_mapped_buffer(cpu_buffer); + + return 0; +} + /* * We only allocate new buffers, never free them if the CPU goes down. * If we were to free the buffer, then the user would lose any trace that = was in --=20 2.43.0.472.g3155946c3a-goog From nobody Fri Dec 19 09:43:54 2025 Received: from mail-wr1-f73.google.com (mail-wr1-f73.google.com [209.85.221.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 910FD4239B for ; Mon, 18 Dec 2023 15:15:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--vdonnefort.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="QfLqC/oo" Received: by mail-wr1-f73.google.com with SMTP id ffacd0b85a97d-3334286b720so2799263f8f.1 for ; Mon, 18 Dec 2023 07:15:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1702912503; x=1703517303; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=2cHFqrs/DsZSfDXX/TXsRzQWk2JL/c6Rao0A3y+tqak=; b=QfLqC/ooHWvd3efRI7Ab8hrRGvvhllBRzKUbJ3JZpRHUH32NvJUuvu/je8STkAwrHF IgCxtGxsPtjKy+9Bsr4uKULyfvQbaGz4J0rQG0SsuE5DNcxr+g0uuCGHNcI2zw9kTwTp TyE84tendjMpFwI0Vs9hj86lLdHr9k34REWZPgG8xEKSySTKkoo4jkMad6W6KMMNEMUy EBuZEGKktzF0cWDN6jSNU56b7xhRaWhCtohoWX9WK4mqjp73pODwEeUEGudrlkOjJRRB 882EE26q/2kJ+1zSC1OXUtpx9Izzbul0NP4FoHCvPn5MbIzWTU/E8tg30pTChZamRS9g Ucbg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702912503; x=1703517303; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=2cHFqrs/DsZSfDXX/TXsRzQWk2JL/c6Rao0A3y+tqak=; b=NT1B7+rjdRAs+xnf9RBiGNw7q/XLgQVKDWpjQ+bPWHso/pmrr/vs8BiKjtgv4ENifl LEjjof1JewEOQID2eFbGhaKahpkmE0K2Oevic8IymhMIMGnMOl5OYnrJDHwJZymeSbhf ig7eTC5N+DRa5752Xd/djmsp8MIjwd4FKmXjBCamnwPI0Hgg7RCNf6dact1ZNDS8vGUy rP6X7HcvOd/aDt6XnUjJyXrDWZcVihMyv4evmdirRHu+WNq8yqhjFaRIwyBxT5n+F7a9 Ro/nclBOV+RChoN7CBppXB0QPXyjCDLin4Oe+PwmHTK4HKN6UJk3DW/jP/EuS5L1ghT6 ge8w== X-Gm-Message-State: AOJu0YwMhhSR83fD2j+DG1aL/G0p/kmZjmlcwPxRxgGzOOUJVUqO0neb A1MTw2yQ2Zv7cwtWVd5KQ0bSZkY2Cjmy32YH X-Google-Smtp-Source: AGHT+IGJ1ng+FumU7x/d3GFAZllpeKu8HFIvaDFD0TBdIKC5mo6H2+zZNW7ahX5/aXuPipXP3jTFra7TJ0ads3Ec X-Received: from vdonnefort.c.googlers.com ([fda3:e722:ac3:cc00:28:9cb1:c0a8:2eea]) (user=vdonnefort job=sendgmr) by 2002:a5d:6c6f:0:b0:333:3c3f:a496 with SMTP id r15-20020a5d6c6f000000b003333c3fa496mr83919wrz.14.1702912502553; Mon, 18 Dec 2023 07:15:02 -0800 (PST) Date: Mon, 18 Dec 2023 15:14:51 +0000 In-Reply-To: <20231218151451.944907-1-vdonnefort@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20231218151451.944907-1-vdonnefort@google.com> X-Mailer: git-send-email 2.43.0.472.g3155946c3a-goog Message-ID: <20231218151451.944907-3-vdonnefort@google.com> Subject: [PATCH v7 2/2] tracing: Allow user-space mapping of the ring-buffer From: Vincent Donnefort To: rostedt@goodmis.org, mhiramat@kernel.org, linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org Cc: kernel-team@android.com, Vincent Donnefort Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Currently, user-space extracts data from the ring-buffer via splice, which is handy for storage or network sharing. However, due to splice limitations, it is imposible to do real-time analysis without a copy. A solution for that problem is to let the user-space map the ring-buffer directly. The mapping is exposed via the per-CPU file trace_pipe_raw. The first element of the mapping is the meta-page. It is followed by each bpage constituting the ring-buffer, ordered by their unique page ID: * Meta-page -- include/uapi/linux/trace_mmap.h for a description * buffer-page ID 0 * buffer-page ID 1 ... It is therefore easy to translate a buffer page ID into an offset in the mapping: reader_id =3D meta->reader->id; reader_offset =3D meta->meta_page_size + reader_id * meta->bpage_size; When new data is available, the mapper must call a newly introduced ioctl: TRACE_MMAP_IOCTL_GET_READER. This will update the Meta-page reader ID to point to the next reader containing unread data. Signed-off-by: Vincent Donnefort diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mma= p.h index 9536f0b7c094..e44563cf5ede 100644 --- a/include/uapi/linux/trace_mmap.h +++ b/include/uapi/linux/trace_mmap.h @@ -26,4 +26,6 @@ struct trace_buffer_meta { __u32 meta_struct_len; /* Len of this struct */ }; =20 +#define TRACE_MMAP_IOCTL_GET_READER _IO('T', 0x1) + #endif /* _UAPI_TRACE_MMAP_H_ */ diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index c17dd849e6f1..0a4927e56315 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -8590,15 +8590,31 @@ tracing_buffers_splice_read(struct file *file, loff= _t *ppos, return ret; } =20 -/* An ioctl call with cmd 0 to the ring buffer file will wake up all waite= rs */ static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, uns= igned long arg) { struct ftrace_buffer_info *info =3D file->private_data; struct trace_iterator *iter =3D &info->iter; + int err; =20 - if (cmd) - return -ENOIOCTLCMD; + if (cmd =3D=3D TRACE_MMAP_IOCTL_GET_READER) { + if (!(file->f_flags & O_NONBLOCK)) { + err =3D ring_buffer_wait(iter->array_buffer->buffer, + iter->cpu_file, + iter->tr->buffer_percent); + if (err) + return err; + } =20 + return ring_buffer_map_get_reader(iter->array_buffer->buffer, + iter->cpu_file); + } else if (cmd) { + return -ENOTTY; + } + + /* + * An ioctl call with cmd 0 to the ring buffer file will wake up all + * waiters + */ mutex_lock(&trace_types_lock); =20 iter->wait_index++; @@ -8611,6 +8627,62 @@ static long tracing_buffers_ioctl(struct file *file,= unsigned int cmd, unsigned return 0; } =20 +static vm_fault_t tracing_buffers_mmap_fault(struct vm_fault *vmf) +{ + struct ftrace_buffer_info *info =3D vmf->vma->vm_file->private_data; + struct trace_iterator *iter =3D &info->iter; + vm_fault_t ret =3D VM_FAULT_SIGBUS; + struct page *page; + + page =3D ring_buffer_map_fault(iter->array_buffer->buffer, iter->cpu_file, + vmf->pgoff); + if (!page) + return ret; + + get_page(page); + vmf->page =3D page; + vmf->page->mapping =3D vmf->vma->vm_file->f_mapping; + vmf->page->index =3D vmf->pgoff; + + return 0; +} + +static void tracing_buffers_mmap_close(struct vm_area_struct *vma) +{ + struct ftrace_buffer_info *info =3D vma->vm_file->private_data; + struct trace_iterator *iter =3D &info->iter; + + ring_buffer_unmap(iter->array_buffer->buffer, iter->cpu_file); +} + +static void tracing_buffers_mmap_open(struct vm_area_struct *vma) +{ + struct ftrace_buffer_info *info =3D vma->vm_file->private_data; + struct trace_iterator *iter =3D &info->iter; + + WARN_ON(ring_buffer_map(iter->array_buffer->buffer, iter->cpu_file)); +} + +static const struct vm_operations_struct tracing_buffers_vmops =3D { + .open =3D tracing_buffers_mmap_open, + .close =3D tracing_buffers_mmap_close, + .fault =3D tracing_buffers_mmap_fault, +}; + +static int tracing_buffers_mmap(struct file *filp, struct vm_area_struct *= vma) +{ + struct ftrace_buffer_info *info =3D filp->private_data; + struct trace_iterator *iter =3D &info->iter; + + if (vma->vm_flags & VM_WRITE) + return -EPERM; + + vm_flags_mod(vma, VM_DONTCOPY | VM_DONTDUMP, VM_MAYWRITE); + vma->vm_ops =3D &tracing_buffers_vmops; + + return ring_buffer_map(iter->array_buffer->buffer, iter->cpu_file); +} + static const struct file_operations tracing_buffers_fops =3D { .open =3D tracing_buffers_open, .read =3D tracing_buffers_read, @@ -8619,6 +8691,7 @@ static const struct file_operations tracing_buffers_f= ops =3D { .splice_read =3D tracing_buffers_splice_read, .unlocked_ioctl =3D tracing_buffers_ioctl, .llseek =3D no_llseek, + .mmap =3D tracing_buffers_mmap, }; =20 static ssize_t --=20 2.43.0.472.g3155946c3a-goog