fs/ceph/Makefile | 2 + fs/ceph/addr.c | 130 ++--- fs/ceph/blog_client.c | 244 +++++++++ fs/ceph/blog_debugfs.c | 361 +++++++++++++ fs/ceph/caps.c | 242 ++++----- fs/ceph/crypto.c | 18 +- fs/ceph/debugfs.c | 33 +- fs/ceph/dir.c | 88 ++-- fs/ceph/export.c | 20 +- fs/ceph/file.c | 130 ++--- fs/ceph/inode.c | 182 +++---- fs/ceph/ioctl.c | 6 +- fs/ceph/locks.c | 22 +- fs/ceph/mds_client.c | 278 +++++----- fs/ceph/mdsmap.c | 8 +- fs/ceph/quota.c | 2 +- fs/ceph/snap.c | 66 +-- fs/ceph/super.c | 82 +-- fs/ceph/xattr.c | 42 +- include/linux/blog/blog.h | 515 +++++++++++++++++++ include/linux/blog/blog_batch.h | 54 ++ include/linux/blog/blog_des.h | 46 ++ include/linux/blog/blog_module.h | 329 ++++++++++++ include/linux/blog/blog_pagefrag.h | 33 ++ include/linux/blog/blog_ser.h | 275 ++++++++++ include/linux/ceph/ceph_blog.h | 124 +++++ include/linux/ceph/ceph_debug.h | 6 +- include/linux/sched.h | 7 + kernel/fork.c | 37 ++ lib/Kconfig | 2 + lib/Makefile | 2 + lib/blog/Kconfig | 56 +++ lib/blog/Makefile | 15 + lib/blog/blog_batch.c | 311 ++++++++++++ lib/blog/blog_core.c | 772 ++++++++++++++++++++++++++++ lib/blog/blog_des.c | 385 ++++++++++++++ lib/blog/blog_module.c | 781 +++++++++++++++++++++++++++++ lib/blog/blog_pagefrag.c | 124 +++++ 38 files changed, 5163 insertions(+), 667 deletions(-) create mode 100644 fs/ceph/blog_client.c create mode 100644 fs/ceph/blog_debugfs.c create mode 100644 include/linux/blog/blog.h create mode 100644 include/linux/blog/blog_batch.h create mode 100644 include/linux/blog/blog_des.h create mode 100644 include/linux/blog/blog_module.h create mode 100644 include/linux/blog/blog_pagefrag.h create mode 100644 include/linux/blog/blog_ser.h create mode 100644 include/linux/ceph/ceph_blog.h create mode 100644 lib/blog/Kconfig create mode 100644 lib/blog/Makefile create mode 100644 lib/blog/blog_batch.c create mode 100644 lib/blog/blog_core.c create mode 100644 lib/blog/blog_des.c create mode 100644 lib/blog/blog_module.c create mode 100644 lib/blog/blog_pagefrag.c
Motivation: improve observability in production by providing subsystemsawith
a logger that keeps up with their verbouse unstructured logs and aggregating
logs at the process context level, akin to userspace TLS.
Binary LOGging (BLOG) introduces a task-local logging context: each context
owns a single 512 KiB fragment that cycles through “ready → in use → queued for
readers → reset → ready” without re-entering the allocator. Writers copy the
raw parameters they already have; readers format them later when the log is
inspected.
BLOG borrows ideas from ftrace (captureabinary data now, format later) but
unlike ftrace there is no global ring. Each module registers its own logger,
manages its own buffers, and keeps the state small enough for production use.
To host the per-module pointers we extend `struct task_struct` with one
additional `void *`, in line with other task extensions already in the kernel.
Each module keeps independent batches: `alloc_batch` for contexts with
refcount 0 and `log_batch` for contexts that have been filled and are waiting
for readers. The batching layer and buffer management were migrated from the
existing Ceph SAN logging code, so the behaviour is battle-tested; we simply
made the buffer inline so every composite stays within a single 512 KiB
allocation.
The patch series lands the BLOG library first, then wires the task lifecycle,
and finally switches Ceph’s `bout*` logging macros to BLOG so we exercise the
new path.
Patch summary:
1. sched, fork: wire BLOG contexts into task lifecycle
- Adds `struct blog_tls_ctx *blog_contexts[BLOG_MAX_MODULES]` to
`struct task_struct`.
- Fork/exit paths initialise and recycle contexts automatically.
2. lib: introduce BLOG (Binary LOGging) subsystem
- Adds `lib/blog/` sources and headers under `include/linux/blog/`.
- Each composite (`struct blog_tls_pagefrag`) consists of the TLS
metadata, the pagefrag state, and an inline buffer sized at
`BLOG_PAGEFRAG_SIZE - sizeof(struct blog_tls_pagefrag)`.
3. ceph: add BLOG scaffolding
- Introduces `include/linux/ceph/ceph_blog.h` and `fs/ceph/blog_client.c`.
- Ceph registers a logger and maintains a client-ID map for the reader
callback.
4. ceph: add BLOG debugfs support
- Adds `fs/ceph/blog_debugfs.c` so filled contexts can be drained.
5. ceph: activate BLOG logging
- Switches `bout*` macros to BLOG, making Ceph the first consumer.
With these patches, Ceph now writes its verbose logging to task-local buffers
managed by BLOG, and the infrastructure is ready for other subsystems that need
allocation-free, module-owned logging.
Alex Markuze (5):
sched, fork: Wire BLOG contexts into task lifecycle
lib: Introduce BLOG (Binary LOGging) subsystem
ceph: Add BLOG scaffolding
ceph: Add BLOG debugfs support
ceph: Activate BLOG logging
fs/ceph/Makefile | 2 +
fs/ceph/addr.c | 130 ++---
fs/ceph/blog_client.c | 244 +++++++++
fs/ceph/blog_debugfs.c | 361 +++++++++++++
fs/ceph/caps.c | 242 ++++-----
fs/ceph/crypto.c | 18 +-
fs/ceph/debugfs.c | 33 +-
fs/ceph/dir.c | 88 ++--
fs/ceph/export.c | 20 +-
fs/ceph/file.c | 130 ++---
fs/ceph/inode.c | 182 +++----
fs/ceph/ioctl.c | 6 +-
fs/ceph/locks.c | 22 +-
fs/ceph/mds_client.c | 278 +++++-----
fs/ceph/mdsmap.c | 8 +-
fs/ceph/quota.c | 2 +-
fs/ceph/snap.c | 66 +--
fs/ceph/super.c | 82 +--
fs/ceph/xattr.c | 42 +-
include/linux/blog/blog.h | 515 +++++++++++++++++++
include/linux/blog/blog_batch.h | 54 ++
include/linux/blog/blog_des.h | 46 ++
include/linux/blog/blog_module.h | 329 ++++++++++++
include/linux/blog/blog_pagefrag.h | 33 ++
include/linux/blog/blog_ser.h | 275 ++++++++++
include/linux/ceph/ceph_blog.h | 124 +++++
include/linux/ceph/ceph_debug.h | 6 +-
include/linux/sched.h | 7 +
kernel/fork.c | 37 ++
lib/Kconfig | 2 +
lib/Makefile | 2 +
lib/blog/Kconfig | 56 +++
lib/blog/Makefile | 15 +
lib/blog/blog_batch.c | 311 ++++++++++++
lib/blog/blog_core.c | 772 ++++++++++++++++++++++++++++
lib/blog/blog_des.c | 385 ++++++++++++++
lib/blog/blog_module.c | 781 +++++++++++++++++++++++++++++
lib/blog/blog_pagefrag.c | 124 +++++
38 files changed, 5163 insertions(+), 667 deletions(-)
create mode 100644 fs/ceph/blog_client.c
create mode 100644 fs/ceph/blog_debugfs.c
create mode 100644 include/linux/blog/blog.h
create mode 100644 include/linux/blog/blog_batch.h
create mode 100644 include/linux/blog/blog_des.h
create mode 100644 include/linux/blog/blog_module.h
create mode 100644 include/linux/blog/blog_pagefrag.h
create mode 100644 include/linux/blog/blog_ser.h
create mode 100644 include/linux/ceph/ceph_blog.h
create mode 100644 lib/blog/Kconfig
create mode 100644 lib/blog/Makefile
create mode 100644 lib/blog/blog_batch.c
create mode 100644 lib/blog/blog_core.c
create mode 100644 lib/blog/blog_des.c
create mode 100644 lib/blog/blog_module.c
create mode 100644 lib/blog/blog_pagefrag.c
--
2.34.1
On Fri, 24 Oct 2025 08:42:54 +0000 Alex Markuze <amarkuze@redhat.com> wrote: > Motivation: improve observability in production by providing subsystemsawith > a logger that keeps up with their verbouse unstructured logs and aggregating > logs at the process context level, akin to userspace TLS. > I still don't understand the motivation behind this. What exactly is this doing that the current tracing infrastructure can't do? -- Steve
First of all, Ftrace is for debugging and development; you won't see components or kernel modules run in production with ftrace enabled. The main motivation is to have verbose logging that is usable for production systems. The second improvement is that the logs have a struct task hook which facilitates better logging association between the kernel log and the user process. It's especially handy when debugging FS systems. Specifically we had several bugs reported from the field that we could not make progress on without additional logs. Re: MM folks, apologies for including unrelated people, the only change is the addition of a field in struct task. On Fri, Oct 24, 2025 at 8:52 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > On Fri, 24 Oct 2025 08:42:54 +0000 > Alex Markuze <amarkuze@redhat.com> wrote: > > > Motivation: improve observability in production by providing subsystemsawith > > a logger that keeps up with their verbouse unstructured logs and aggregating > > logs at the process context level, akin to userspace TLS. > > > > I still don't understand the motivation behind this. > > What exactly is this doing that the current tracing infrastructure can't do? > > -- Steve >
On Sat, 25 Oct 2025 13:50:39 +0300 Alex Markuze <amarkuze@redhat.com> wrote: > First of all, Ftrace is for debugging and development; you won't see > components or kernel modules run in production with ftrace enabled. > The main motivation is to have verbose logging that is usable for > production systems. That is totally untrue. Several production environments use ftrace. We have it enabled and used in Chromebooks and in Android. Google servers also have it enabled. > The second improvement is that the logs have a struct task hook which > facilitates better logging association between the kernel log and the > user process. > It's especially handy when debugging FS systems. So this is for use with debugging too? > > Specifically we had several bugs reported from the field that we could > not make progress on without additional logs. This still doesn't answer my question about not using ftrace. Heck, when I worked for Red Hat, we used ftrace to debug production environments. Did that change? -- Steve
Please correct me if I am wrong, I was not aware that ftrace is used by any kernel component as the default unstructured logger. This is the point of BLog, having a low impact unstructured logger, it's not always possible or easy to provide a debug kernel where ftarce is both enabled and used for dumping logs. Having an always-on binary logger facilitates better debuggability. When anything happens, a client with BLog has the option to send a large log file with their report. An additional benefit is that each logging buffer is attached to the associated tasks and the whole module has its own separate cyclical log buffer. On Sat, Oct 25, 2025 at 5:59 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > On Sat, 25 Oct 2025 13:50:39 +0300 > Alex Markuze <amarkuze@redhat.com> wrote: > > > First of all, Ftrace is for debugging and development; you won't see > > components or kernel modules run in production with ftrace enabled. > > The main motivation is to have verbose logging that is usable for > > production systems. > > That is totally untrue. Several production environments use ftrace. We > have it enabled and used in Chromebooks and in Android. Google servers > also have it enabled. > > > > The second improvement is that the logs have a struct task hook which > > facilitates better logging association between the kernel log and the > > user process. > > It's especially handy when debugging FS systems. > > So this is for use with debugging too? > > > > > Specifically we had several bugs reported from the field that we could > > not make progress on without additional logs. > > This still doesn't answer my question about not using ftrace. Heck, > when I worked for Red Hat, we used ftrace to debug production > environments. Did that change? > > -- Steve >
On Sat, 25 Oct 2025 20:54:00 +0300 Alex Markuze <amarkuze@redhat.com> wrote: > Please correct me if I am wrong, I was not aware that ftrace is used > by any kernel component as the default unstructured logger. > This is the point of BLog, having a low impact unstructured logger, > it's not always possible or easy to provide a debug kernel where > ftarce is both enabled and used for dumping logs. > Having an always-on binary logger facilitates better debuggability. > When anything happens, a client with BLog has the option to send a > large log file with their report. > An additional benefit is that each logging buffer is attached to the > associated tasks and the whole module has its own separate cyclical > log buffer. This looks like a very specific solution trying to be a little more generic. The more generic a solution becomes, the more "bloated" it becomes as well. That's the nature of tracers and loggers. Ftrace was designed to be very generic, and yes, it can be more bloated because of that. But it is also designed to be tuned down to be a highly efficient tracer. One that can be used in a production environment. Sure, if you enable every event, it will cause a noticeable overhead, but ftrace is designed to surgically pick which events should be enabled or not, keeping the overhead within the noise. Ftrace is more of an "infrastructure" than a tool. It provides access to trace almost every function , but you can use that same code to implement live kernel patching or BPF hooks to functions. The trace event and tracepoints are part of ftrace, and are used for things other than tracing. Perhaps it may be more suitable to make BLOG use the tracefs interface, then to create an entirely new interface based on debugfs (which BTW, a lot of production systems do not enable debugfs which is why ftrace uses its own tracefs that does not depend on it). -- Steve
On 24.10.25 10:42, Alex Markuze wrote: > Motivation: improve observability in production by providing subsystemsawith > a logger that keeps up with their verbouse unstructured logs and aggregating > logs at the process context level, akin to userspace TLS. > > Binary LOGging (BLOG) introduces a task-local logging context: each context > owns a single 512 KiB fragment that cycles through “ready → in use → queued for > readers → reset → ready” without re-entering the allocator. Writers copy the > raw parameters they already have; readers format them later when the log is > inspected. > > BLOG borrows ideas from ftrace (captureabinary data now, format later) but > unlike ftrace there is no global ring. Each module registers its own logger, > manages its own buffers, and keeps the state small enough for production use. > > To host the per-module pointers we extend `struct task_struct` with one > additional `void *`, in line with other task extensions already in the kernel. > Each module keeps independent batches: `alloc_batch` for contexts with > refcount 0 and `log_batch` for contexts that have been filled and are waiting > for readers. The batching layer and buffer management were migrated from the > existing Ceph SAN logging code, so the behaviour is battle-tested; we simply > made the buffer inline so every composite stays within a single 512 KiB > allocation. > > The patch series lands the BLOG library first, then wires the task lifecycle, > and finally switches Ceph’s `bout*` logging macros to BLOG so we exercise the > new path. > > Patch summary: > 1. sched, fork: wire BLOG contexts into task lifecycle > - Adds `struct blog_tls_ctx *blog_contexts[BLOG_MAX_MODULES]` to > `struct task_struct`. > - Fork/exit paths initialise and recycle contexts automatically. > > 2. lib: introduce BLOG (Binary LOGging) subsystem > - Adds `lib/blog/` sources and headers under `include/linux/blog/`. > - Each composite (`struct blog_tls_pagefrag`) consists of the TLS > metadata, the pagefrag state, and an inline buffer sized at > `BLOG_PAGEFRAG_SIZE - sizeof(struct blog_tls_pagefrag)`. > > 3. ceph: add BLOG scaffolding > - Introduces `include/linux/ceph/ceph_blog.h` and `fs/ceph/blog_client.c`. > - Ceph registers a logger and maintains a client-ID map for the reader > callback. > > 4. ceph: add BLOG debugfs support > - Adds `fs/ceph/blog_debugfs.c` so filled contexts can be drained. > > 5. ceph: activate BLOG logging > - Switches `bout*` macros to BLOG, making Ceph the first consumer. Hi! You CCed plenty of MM folks, and I am sure most of them observe "this doesn't seem to touch any core-mm files" and wonder "what's hiding in there that requires a pair of MM eyes". Is there anything specific we should be looking at (and if so, in which patch)? -- Cheers David / dhildenb
On Fri, 2025-10-24 at 08:42 +0000, Alex Markuze wrote: Probably, it make sense to consider it as a topic for LSF/MM/BPF conference. Because, it could be not easy to convince people. As far as I can see, from my point of view, the motivation doesn't contain enough explanation of benefits, benchmarking results and comparison with already existing infrastructures. The clear explanation of these points could be a good step to convince people to try and to adopt the new infrastructure. > Motivation: improve observability in production by providing subsystemsawith "subsystemsawith" -> subsystem with? > a logger that keeps up with their verbouse unstructured logs and aggregating > logs at the process context level, akin to userspace TLS. > > Binary LOGging (BLOG) introduces a task-local logging context: each context > owns a single 512 KiB fragment that cycles through “ready → in use → queued for Why exactly 512 KiB? Could it be increased/decreased? Any available optimization parameters of infrastructure? Could the infrastructure "eat" the whole memory if we have a lot tasks/cores? Do we have any danger of introducing the system crashes because of BLOG subsystem's memory requirements? I assume that BLOG's 512 KiB fragment works as a circular buffer. Am I right here? So, how long could be the recorded history of operations? Could new records overwrite the information that needs for the issue analysis? > readers → reset → ready” without re-entering the allocator. Writers copy the > raw parameters they already have; readers format them later when the log is > inspected. > > BLOG borrows ideas from ftrace (captureabinary data now, format later) but "captureabinary" -> capture a binary? > unlike ftrace there is no global ring. Each module registers its own logger, > manages its own buffers, and keeps the state small enough for production use. > > To host the per-module pointers we extend `struct task_struct` with one > additional `void *`, in line with other task extensions already in the kernel. > Each module keeps independent batches: `alloc_batch` for contexts with > refcount 0 and `log_batch` for contexts that have been filled and are waiting > for readers. The batching layer and buffer management were migrated from the > existing Ceph SAN logging code, so the behaviour is battle-tested; we simply I am not completely following what do you mean by Ceph SAN logging code. Maybe, it makes to share the link on it? > made the buffer inline so every composite stays within a single 512 KiB > allocation. > > The patch series lands the BLOG library first, then wires the task lifecycle, > and finally switches Ceph’s `bout*` logging macros to BLOG so we exercise the What do you mean by Ceph’s `bout*` logging macros? Do you mean 'dout*' here? Thanks, Slava. > new path. > > Patch summary: > 1. sched, fork: wire BLOG contexts into task lifecycle > - Adds `struct blog_tls_ctx *blog_contexts[BLOG_MAX_MODULES]` to > `struct task_struct`. > - Fork/exit paths initialise and recycle contexts automatically. > > 2. lib: introduce BLOG (Binary LOGging) subsystem > - Adds `lib/blog/` sources and headers under `include/linux/blog/`. > - Each composite (`struct blog_tls_pagefrag`) consists of the TLS > metadata, the pagefrag state, and an inline buffer sized at > `BLOG_PAGEFRAG_SIZE - sizeof(struct blog_tls_pagefrag)`. > > 3. ceph: add BLOG scaffolding > - Introduces `include/linux/ceph/ceph_blog.h` and `fs/ceph/blog_client.c`. > - Ceph registers a logger and maintains a client-ID map for the reader > callback. > > 4. ceph: add BLOG debugfs support > - Adds `fs/ceph/blog_debugfs.c` so filled contexts can be drained. > > 5. ceph: activate BLOG logging > - Switches `bout*` macros to BLOG, making Ceph the first consumer. > > With these patches, Ceph now writes its verbose logging to task-local buffers > managed by BLOG, and the infrastructure is ready for other subsystems that need > allocation-free, module-owned logging. > > Alex Markuze (5): > sched, fork: Wire BLOG contexts into task lifecycle > lib: Introduce BLOG (Binary LOGging) subsystem > ceph: Add BLOG scaffolding > ceph: Add BLOG debugfs support > ceph: Activate BLOG logging > > fs/ceph/Makefile | 2 + > fs/ceph/addr.c | 130 ++--- > fs/ceph/blog_client.c | 244 +++++++++ > fs/ceph/blog_debugfs.c | 361 +++++++++++++ > fs/ceph/caps.c | 242 ++++----- > fs/ceph/crypto.c | 18 +- > fs/ceph/debugfs.c | 33 +- > fs/ceph/dir.c | 88 ++-- > fs/ceph/export.c | 20 +- > fs/ceph/file.c | 130 ++--- > fs/ceph/inode.c | 182 +++---- > fs/ceph/ioctl.c | 6 +- > fs/ceph/locks.c | 22 +- > fs/ceph/mds_client.c | 278 +++++----- > fs/ceph/mdsmap.c | 8 +- > fs/ceph/quota.c | 2 +- > fs/ceph/snap.c | 66 +-- > fs/ceph/super.c | 82 +-- > fs/ceph/xattr.c | 42 +- > include/linux/blog/blog.h | 515 +++++++++++++++++++ > include/linux/blog/blog_batch.h | 54 ++ > include/linux/blog/blog_des.h | 46 ++ > include/linux/blog/blog_module.h | 329 ++++++++++++ > include/linux/blog/blog_pagefrag.h | 33 ++ > include/linux/blog/blog_ser.h | 275 ++++++++++ > include/linux/ceph/ceph_blog.h | 124 +++++ > include/linux/ceph/ceph_debug.h | 6 +- > include/linux/sched.h | 7 + > kernel/fork.c | 37 ++ > lib/Kconfig | 2 + > lib/Makefile | 2 + > lib/blog/Kconfig | 56 +++ > lib/blog/Makefile | 15 + > lib/blog/blog_batch.c | 311 ++++++++++++ > lib/blog/blog_core.c | 772 ++++++++++++++++++++++++++++ > lib/blog/blog_des.c | 385 ++++++++++++++ > lib/blog/blog_module.c | 781 +++++++++++++++++++++++++++++ > lib/blog/blog_pagefrag.c | 124 +++++ > 38 files changed, 5163 insertions(+), 667 deletions(-) > create mode 100644 fs/ceph/blog_client.c > create mode 100644 fs/ceph/blog_debugfs.c > create mode 100644 include/linux/blog/blog.h > create mode 100644 include/linux/blog/blog_batch.h > create mode 100644 include/linux/blog/blog_des.h > create mode 100644 include/linux/blog/blog_module.h > create mode 100644 include/linux/blog/blog_pagefrag.h > create mode 100644 include/linux/blog/blog_ser.h > create mode 100644 include/linux/ceph/ceph_blog.h > create mode 100644 lib/blog/Kconfig > create mode 100644 lib/blog/Makefile > create mode 100644 lib/blog/blog_batch.c > create mode 100644 lib/blog/blog_core.c > create mode 100644 lib/blog/blog_des.c > create mode 100644 lib/blog/blog_module.c > create mode 100644 lib/blog/blog_pagefrag.c
© 2016 - 2026 Red Hat, Inc.