From nobody Sat Nov 30 07:32:50 2024 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 42701178388 for ; Wed, 11 Sep 2024 09:30:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1726047052; cv=none; b=kRTWRFDQzpxiCHyFeTpskXy0DPdgC1/5tDpEkYyy1nM2bnWQCGKlisF6zSWUDQX/O94i04ZOzbvtIgOKYBE1UlZXiPRqwk+VMhaJORihkR7olTA38XNMrKKz1WwDh7OGOK16O/iZBATAfvBz2J6QRWGJNjZ/AC60FFVUVm22LcQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1726047052; c=relaxed/simple; bh=rdESoqDkV6dvzVoBnAnk5bpIBGdwqcBkBRK9Adlytzg=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=o+wqwrBYccHj3gzwAPrQ4j56sRSaBd5QcU4T2hnSkeEOsgw0g9UsRg9o7KdXW/KB1OQAgv4Xrb7MT4AdsMrqbtG3EExS0znftHHsWePjALfmy4L/iAR+6AMb2K82j5Jc37feOyL0YBodEN7uBa44uQcumF7s17zp3MVBfs4vEhY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--vdonnefort.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=WZkJxTKM; arc=none smtp.client-ip=209.85.128.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--vdonnefort.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="WZkJxTKM" Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-6db70bf67faso104457037b3.0 for ; Wed, 11 Sep 2024 02:30:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1726047049; x=1726651849; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=bX7fRzDpnNd4zeQVzomecGzzlbCGx2yH+IA4Wh3YhHI=; b=WZkJxTKMtvFv8/wmlR19q1BHVIZJ/b1XVrFspxibOL2GV+oq6Pke/H65jckWPONO+i xraBY94KvIfaAJ8a3IqQwW85Vymx3YfaZwSoCrqRUBqM5Y8Q8DnAZRRoyCrAVcoXpCi1 mLRdaXQcg2tiJ2MNw/NDoDd45Q1L+6xvfk6vVyqiTT/m+4VheMHB94JvyzPIvq4s+7D6 bpiPaeTQf3bFZAB82vEGdopyO8tT8dyQIGgRVznJxqvTQ1eN+bT69uTL+vLFeFJZh+R+ +30P1XAw2P0DrLaR0YQIorR8Vz2pMnpIP8LmH7O/0iu9vKxc8fGhKV6xDKYPTbAEXosy kPJA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1726047049; x=1726651849; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=bX7fRzDpnNd4zeQVzomecGzzlbCGx2yH+IA4Wh3YhHI=; b=Q8WoZbrDhp4zumisrs4oLAbc5VepeSJaWCuLsERvPMZMHOJ8tM5FaLeaMc7wvCWalR ptszSA6WOmhl0trliakrInbRpMYDwNVMo7oLMuOKnUS8xhnxp+m8TwDN8vY+sE0GQyTj o1gNZgWLpO+ntMpjY2QAou1XnVMeBSC7dSpWnbkOdnUcbSVomJ9Vi57PKPHiWMld6bf4 FTSSWWRiiF/DC15ANr810W2d5Nx7KSmIE5q1HKVc/AuwkVQgZlmJ6rsJMHkrEMBC6gDJ sOn1F6zvm29k+yuFtNwxzVQ1E9yPPh1gb5s6PdmWqw8xHrdUa6LDQoR4T2cRsrIZAybX Lx7w== X-Forwarded-Encrypted: i=1; AJvYcCXp12Nypg+Ya6KOmN4JI7EiFtkudiG0hlglqgTxGQdmIfq2K12VFv2Nkq67ymfYbGNfTV3LP8su1iYC50Y=@vger.kernel.org X-Gm-Message-State: AOJu0YwSSfqfRoB+gmr8pAvfYWX1TVyjHgjt829pASyLlNAlvnOv/Gwb sqdpohOOwXyCTqpJe07DOTXgyKd0bGcETph4UwcMqlOcVCelgU77Pnr3XgH4gZz3L55ECVJlB5q 6SDLghhood1sSGcSGyg== X-Google-Smtp-Source: AGHT+IHr95mENntI+XL7TLPinDZ+fi2cz1xAO3aClhDUfpqjW1lhYGfcAWmsh0sAFKdZEZ2hNYgSoyjFvy43lESc X-Received: from vdonnefort.c.googlers.com ([fda3:e722:ac3:cc00:28:9cb1:c0a8:2eea]) (user=vdonnefort job=sendgmr) by 2002:a05:690c:38c:b0:6af:623c:7694 with SMTP id 00721157ae682-6db44a5d200mr11834517b3.0.1726047049319; Wed, 11 Sep 2024 02:30:49 -0700 (PDT) Date: Wed, 11 Sep 2024 10:30:18 +0100 In-Reply-To: <20240911093029.3279154-1-vdonnefort@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240911093029.3279154-1-vdonnefort@google.com> X-Mailer: git-send-email 2.46.0.598.g6f2099f65c-goog Message-ID: <20240911093029.3279154-3-vdonnefort@google.com> Subject: [PATCH 02/13] ring-buffer: Introducing ring-buffer writer From: Vincent Donnefort To: rostedt@goodmis.org, mhiramat@kernel.org, linux-trace-kernel@vger.kernel.org, maz@kernel.org, oliver.upton@linux.dev Cc: kvmarm@lists.linux.dev, will@kernel.org, qperret@google.com, kernel-team@android.com, linux-kernel@vger.kernel.org, Vincent Donnefort Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" A ring-buffer writer is an entity outside of the kernel (most likely a firmware or a hypervisor) capable of writing events in a ring-buffer following the same format as the tracefs ring-buffer. To setup the ring-buffer on the kernel side, a description of the pages (struct trace_page_desc) is necessary. A callback (get_reader_page) must also be provided. It is called whenever it is done reading the previous reader page. It is expected from the writer to keep the meta-page updated. Signed-off-by: Vincent Donnefort diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h index fd35d4ec12e1..d78a33b3c96e 100644 --- a/include/linux/ring_buffer.h +++ b/include/linux/ring_buffer.h @@ -83,21 +83,24 @@ u64 ring_buffer_event_time_stamp(struct trace_buffer *b= uffer, void ring_buffer_discard_commit(struct trace_buffer *buffer, struct ring_buffer_event *event); =20 +struct ring_buffer_writer; + /* * size is in bytes for each per CPU buffer. */ struct trace_buffer * -__ring_buffer_alloc(unsigned long size, unsigned flags, struct lock_class_= key *key); +__ring_buffer_alloc(unsigned long size, unsigned flags, struct lock_class_= key *key, + struct ring_buffer_writer *writer); =20 /* * Because the ring buffer is generic, if other users of the ring buffer g= et * traced by ftrace, it can produce lockdep warnings. We need to keep each * ring buffer's lock class separate. */ -#define ring_buffer_alloc(size, flags) \ -({ \ - static struct lock_class_key __key; \ - __ring_buffer_alloc((size), (flags), &__key); \ +#define ring_buffer_alloc(size, flags) \ +({ \ + static struct lock_class_key __key; \ + __ring_buffer_alloc((size), (flags), &__key, NULL); \ }) =20 typedef bool (*ring_buffer_cond_fn)(void *data); @@ -228,4 +231,54 @@ int ring_buffer_map(struct trace_buffer *buffer, int c= pu, struct vm_area_struct *vma); int ring_buffer_unmap(struct trace_buffer *buffer, int cpu); int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu); + +#define meta_pages_lost(__meta) \ + ((__meta)->Reserved1) +#define meta_pages_touched(__meta) \ + ((__meta)->Reserved2) + +struct rb_page_desc { + int cpu; + int nr_page_va; /* exclude the meta page */ + unsigned long meta_va; + unsigned long page_va[]; +}; + +struct trace_page_desc { + int nr_cpus; + char __data[]; /* list of rb_page_desc */ +}; + +static inline +struct rb_page_desc *__next_rb_page_desc(struct rb_page_desc *pdesc) +{ + size_t len =3D struct_size(pdesc, page_va, pdesc->nr_page_va); + + return (struct rb_page_desc *)((void *)pdesc + len); +} + +static inline +struct rb_page_desc *__first_rb_page_desc(struct trace_page_desc *trace_pd= esc) +{ + return (struct rb_page_desc *)(&trace_pdesc->__data[0]); +} + +#define for_each_rb_page_desc(__pdesc, __cpu, __trace_pdesc) \ + for (__pdesc =3D __first_rb_page_desc(__trace_pdesc), __cpu =3D 0; \ + __cpu < (__trace_pdesc)->nr_cpus; \ + __cpu++, __pdesc =3D __next_rb_page_desc(__pdesc)) + +struct ring_buffer_writer { + struct trace_page_desc *pdesc; + int (*get_reader_page)(int cpu); + int (*reset)(int cpu); +}; + +int ring_buffer_poll_writer(struct trace_buffer *buffer, int cpu); + +#define ring_buffer_reader(writer) \ +({ \ + static struct lock_class_key __key; \ + __ring_buffer_alloc(0, RB_FL_OVERWRITE, &__key, writer);\ +}) #endif /* _LINUX_RING_BUFFER_H */ diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 7abe671effbf..b05b7a95e3f1 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -495,6 +495,8 @@ struct ring_buffer_per_cpu { unsigned long *subbuf_ids; /* ID to subbuf VA */ struct trace_buffer_meta *meta_page; =20 + struct ring_buffer_writer *writer; + /* ring buffer pages to update, > 0 to add, < 0 to remove */ long nr_pages_to_update; struct list_head new_pages; /* new pages to add */ @@ -517,6 +519,8 @@ struct trace_buffer { =20 struct ring_buffer_per_cpu **buffers; =20 + struct ring_buffer_writer *writer; + struct hlist_node node; u64 (*clock)(void); =20 @@ -1555,6 +1559,42 @@ static int __rb_allocate_pages(struct ring_buffer_pe= r_cpu *cpu_buffer, return -ENOMEM; } =20 +static struct rb_page_desc *rb_page_desc(struct trace_page_desc *trace_pde= sc, + int cpu) +{ + struct rb_page_desc *pdesc; + size_t len; + int i; + + if (!trace_pdesc) + return NULL; + + if (cpu >=3D trace_pdesc->nr_cpus) + return NULL; + + pdesc =3D __first_rb_page_desc(trace_pdesc); + len =3D struct_size(pdesc, page_va, pdesc->nr_page_va); + pdesc +=3D len * cpu; + + if (pdesc->cpu =3D=3D cpu) + return pdesc; + + /* Missing CPUs, need to linear search */ + + for_each_rb_page_desc(pdesc, i, trace_pdesc) { + if (pdesc->cpu =3D=3D cpu) + return pdesc; + } + + return NULL; +} + +static void *rb_page_desc_page(struct rb_page_desc *pdesc, int page_id) +{ + return page_id > pdesc->nr_page_va ? NULL : (void *)pdesc->page_va[page_i= d]; +} + + static int rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer, unsigned long nr_pages) { @@ -1614,6 +1654,31 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, = long nr_pages, int cpu) =20 cpu_buffer->reader_page =3D bpage; =20 + if (buffer->writer) { + struct rb_page_desc *pdesc =3D rb_page_desc(buffer->writer->pdesc, cpu); + + if (!pdesc) + goto fail_free_reader; + + cpu_buffer->writer =3D buffer->writer; + cpu_buffer->meta_page =3D (struct trace_buffer_meta *)(void *)pdesc->met= a_va; + cpu_buffer->subbuf_ids =3D pdesc->page_va; + cpu_buffer->nr_pages =3D pdesc->nr_page_va - 1; + atomic_inc(&cpu_buffer->record_disabled); + atomic_inc(&cpu_buffer->resize_disabled); + + bpage->page =3D rb_page_desc_page(pdesc, + cpu_buffer->meta_page->reader.id); + if (!bpage->page) + goto fail_free_reader; + /* + * The meta-page can only describe which of the ring-buffer page + * is the reader. There is no need to init the rest of the + * ring-buffer. + */ + return cpu_buffer; + } + page =3D alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_COMP | __G= FP_ZERO, cpu_buffer->buffer->subbuf_order); if (!page) @@ -1651,6 +1716,10 @@ static void rb_free_cpu_buffer(struct ring_buffer_pe= r_cpu *cpu_buffer) =20 irq_work_sync(&cpu_buffer->irq_work.work); =20 + /* ring_buffers with writer set do not own the data pages */ + if (cpu_buffer->writer) + cpu_buffer->reader_page->page =3D NULL; + free_buffer_page(cpu_buffer->reader_page); =20 if (head) { @@ -1681,7 +1750,8 @@ static void rb_free_cpu_buffer(struct ring_buffer_per= _cpu *cpu_buffer) * drop data when the tail hits the head. */ struct trace_buffer *__ring_buffer_alloc(unsigned long size, unsigned flag= s, - struct lock_class_key *key) + struct lock_class_key *key, + struct ring_buffer_writer *writer) { struct trace_buffer *buffer; long nr_pages; @@ -1709,6 +1779,11 @@ struct trace_buffer *__ring_buffer_alloc(unsigned lo= ng size, unsigned flags, buffer->flags =3D flags; buffer->clock =3D trace_clock_local; buffer->reader_lock_key =3D key; + if (writer) { + buffer->writer =3D writer; + /* The writer is external and never done by the kernel */ + atomic_inc(&buffer->record_disabled); + } =20 init_irq_work(&buffer->irq_work.work, rb_wake_up_waiters); init_waitqueue_head(&buffer->irq_work.waiters); @@ -4456,8 +4531,54 @@ rb_update_iter_read_stamp(struct ring_buffer_iter *i= ter, } } =20 +static bool rb_read_writer_meta_page(struct ring_buffer_per_cpu *cpu_buffe= r) +{ + local_set(&cpu_buffer->entries, READ_ONCE(cpu_buffer->meta_page->entries)= ); + local_set(&cpu_buffer->overrun, READ_ONCE(cpu_buffer->meta_page->overrun)= ); + local_set(&cpu_buffer->pages_touched, READ_ONCE(meta_pages_touched(cpu_bu= ffer->meta_page))); + local_set(&cpu_buffer->pages_lost, READ_ONCE(meta_pages_lost(cpu_buffer->= meta_page))); + /* + * No need to get the "read" field, it can be tracked here as any + * reader will have to go through a rign_buffer_per_cpu. + */ + + return rb_num_of_entries(cpu_buffer); +} + static struct buffer_page * -rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer) +__rb_get_reader_page_from_writer(struct ring_buffer_per_cpu *cpu_buffer) +{ + u32 prev_reader; + + if (!rb_read_writer_meta_page(cpu_buffer)) + return NULL; + + /* More to read on the reader page */ + if (cpu_buffer->reader_page->read < rb_page_size(cpu_buffer->reader_page)) + return cpu_buffer->reader_page; + + prev_reader =3D cpu_buffer->meta_page->reader.id; + + WARN_ON(cpu_buffer->writer->get_reader_page(cpu_buffer->cpu)); + /* nr_pages doesn't include the reader page */ + if (cpu_buffer->meta_page->reader.id > cpu_buffer->nr_pages) { + WARN_ON(1); + return NULL; + } + + cpu_buffer->reader_page->page =3D + (void *)cpu_buffer->subbuf_ids[cpu_buffer->meta_page->reader.id]; + cpu_buffer->reader_page->read =3D 0; + cpu_buffer->read_stamp =3D cpu_buffer->reader_page->page->time_stamp; + cpu_buffer->lost_events =3D cpu_buffer->meta_page->reader.lost_events; + + WARN_ON(prev_reader =3D=3D cpu_buffer->meta_page->reader.id); + + return cpu_buffer->reader_page; +} + +static struct buffer_page * +__rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer) { struct buffer_page *reader =3D NULL; unsigned long bsize =3D READ_ONCE(cpu_buffer->buffer->subbuf_size); @@ -4624,6 +4745,13 @@ rb_get_reader_page(struct ring_buffer_per_cpu *cpu_b= uffer) return reader; } =20 +static struct buffer_page * +rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer) +{ + return cpu_buffer->writer ? __rb_get_reader_page_from_writer(cpu_buffer) : + __rb_get_reader_page(cpu_buffer); +} + static void rb_advance_reader(struct ring_buffer_per_cpu *cpu_buffer) { struct ring_buffer_event *event; @@ -5028,7 +5156,7 @@ ring_buffer_read_prepare(struct trace_buffer *buffer,= int cpu, gfp_t flags) struct ring_buffer_per_cpu *cpu_buffer; struct ring_buffer_iter *iter; =20 - if (!cpumask_test_cpu(cpu, buffer->cpumask)) + if (!cpumask_test_cpu(cpu, buffer->cpumask) || buffer->writer) return NULL; =20 iter =3D kzalloc(sizeof(*iter), flags); @@ -5198,6 +5326,22 @@ rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer) { struct buffer_page *page; =20 + if (cpu_buffer->writer) { + if (!cpu_buffer->writer->reset) + return; + + cpu_buffer->writer->reset(cpu_buffer->cpu); + rb_read_writer_meta_page(cpu_buffer); + + /* Read related values, not covered by the meta-page */ + local_set(&cpu_buffer->pages_read, 0); + cpu_buffer->read =3D 0; + cpu_buffer->read_bytes =3D 0; + cpu_buffer->last_overrun =3D 0; + + return; + } + rb_head_page_deactivate(cpu_buffer); =20 cpu_buffer->head_page @@ -5428,6 +5572,49 @@ bool ring_buffer_empty_cpu(struct trace_buffer *buff= er, int cpu) } EXPORT_SYMBOL_GPL(ring_buffer_empty_cpu); =20 +int ring_buffer_poll_writer(struct trace_buffer *buffer, int cpu) +{ + struct ring_buffer_per_cpu *cpu_buffer; + unsigned long flags; + + if (cpu !=3D RING_BUFFER_ALL_CPUS) { + if (!cpumask_test_cpu(cpu, buffer->cpumask)) + return -EINVAL; + + cpu_buffer =3D buffer->buffers[cpu]; + + raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags); + if (rb_read_writer_meta_page(cpu_buffer)) + rb_wakeups(buffer, cpu_buffer); + raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags); + + return 0; + } + + /* + * Make sure all the ring buffers are up to date before we start reading + * them. + */ + for_each_buffer_cpu(buffer, cpu) { + cpu_buffer =3D buffer->buffers[cpu]; + + raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags); + rb_read_writer_meta_page(buffer->buffers[cpu]); + raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags); + } + + for_each_buffer_cpu(buffer, cpu) { + cpu_buffer =3D buffer->buffers[cpu]; + + raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags); + if (rb_num_of_entries(cpu_buffer)) + rb_wakeups(buffer, buffer->buffers[cpu]); + raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags); + } + + return 0; +} + #ifdef CONFIG_RING_BUFFER_ALLOW_SWAP /** * ring_buffer_swap_cpu - swap a CPU buffer between two ring buffers @@ -5679,6 +5866,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer, unsigned int commit; unsigned int read; u64 save_timestamp; + bool force_memcpy; int ret =3D -1; =20 if (!cpumask_test_cpu(cpu, buffer->cpumask)) @@ -5716,6 +5904,8 @@ int ring_buffer_read_page(struct trace_buffer *buffer, /* Check if any events were dropped */ missed_events =3D cpu_buffer->lost_events; =20 + force_memcpy =3D cpu_buffer->mapped || cpu_buffer->writer; + /* * If this page has been partially read or * if len is not big enough to read the rest of the page or @@ -5725,7 +5915,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer, */ if (read || (len < (commit - read)) || cpu_buffer->reader_page =3D=3D cpu_buffer->commit_page || - cpu_buffer->mapped) { + force_memcpy) { struct buffer_data_page *rpage =3D cpu_buffer->reader_page->page; unsigned int rpos =3D read; unsigned int pos =3D 0; @@ -6278,7 +6468,7 @@ int ring_buffer_map(struct trace_buffer *buffer, int = cpu, unsigned long flags, *subbuf_ids; int err =3D 0; =20 - if (!cpumask_test_cpu(cpu, buffer->cpumask)) + if (!cpumask_test_cpu(cpu, buffer->cpumask) || buffer->writer) return -EINVAL; =20 cpu_buffer =3D buffer->buffers[cpu]; --=20 2.46.0.598.g6f2099f65c-goog