1 | Allow the persistent memory mapped memory to be memory mapped to user | 1 | Now that I learned that the memory passed back from reserve_mem is part |
---|---|---|---|
2 | space as well. Currently, the user space memory mapping requires the | 2 | of the memory allocator and just "reserved" and the memory is already |
3 | buffers to have been allocated via page_alloc() and converted to virtual | 3 | virtually mapped, it can simply use phys_to_virt() on the physical memory |
4 | address via page_address(). But the persistent memory is memory mapped | 4 | that is returned to get the virtual mapping for that memory! |
5 | via vmap() and a simple virt_to_page() can not be used. Move the control | 5 | (Thanks Mike!) |
6 | of the physical mapping via vmap() to the ring buffer code so that it | ||
7 | can then use the saved physical and virtual mapping to find the pages | ||
8 | needed for memory mapping user space. | ||
9 | 6 | ||
10 | The first patch moves the memory mapping of the physical memory returned | 7 | That makes things much easier, especially since it means that the memory |
11 | by reserve_mem from the tracing code to the ring buffer code. This makes | 8 | returned by reserve_mem is no different than the memory retrieved by |
12 | sense as this gives more control over to the ring buffer in knowing exactly | 9 | page_alloc(). This allows that memory to be memory mapped to user space |
13 | how the pages were created. It keeps track of where the physical memory | 10 | no differently than it is mapped by the normal buffer. |
14 | that was mapped and also handles the freeing of this memory (removing the | ||
15 | burden from the tracing code from having to do this). It also handles | ||
16 | knowing if the buffer may be memory mapped or not. The check is removed | ||
17 | from the tracing code, but if the tracing code tries to memory map the | ||
18 | persistent ring buffer, the call to the ring buffer code will fail with | ||
19 | the same error as before. | ||
20 | 11 | ||
21 | The second patch implements the user space memory mapping of the persistent | 12 | This new series does the following: |
22 | ring buffer. It does so by adding several helper functions to annotate | ||
23 | what the code is doing. Note, there's two meta pages here. One is mapped | ||
24 | between the kernel and user space and is used to inform user space of updates | ||
25 | to the ring buffer. The other is inside the persistent memory that is used to | ||
26 | pass information across boots. The persistent memory meta data is never exposed | ||
27 | to user space. The meta data for user space mapping is always allocated via the | ||
28 | normal memory allocation. | ||
29 | 13 | ||
30 | The helper functions are: | 14 | - Enforce the memory mapping is page aligned (both the address and the |
15 | size). If not, it errors out. | ||
31 | 16 | ||
32 | rb_struct_page() - This is the rb_get_page() from our discussions, but | 17 | - Use phys_to_virt() to get to the virtual memory from the reserve_mem |
33 | I renamed it because "get" implies "put". | 18 | returned addresses. Also use free_reserved_area() to give it |
34 | This function will return the struct page for a given | 19 | back to the buddy allocator when it is freed. |
35 | buffer page by either virt_to_page() if the page was | ||
36 | allocated via the normal memory allocator, or it | ||
37 | is found via pfn_to_page() by using the saved physical | ||
38 | and virtual address of the mapped location. It uses | ||
39 | that to calculate the physical address from the virtual | ||
40 | address of the page and then pfn_to_page() can be used | ||
41 | from that. | ||
42 | 20 | ||
43 | rb_fush_buffer_page() - this calls the above rb_struct_page() and then | 21 | - Treat the buffer allocated via memmap differently. It still needs to |
44 | calls flush_dcache_folio() to make sure the kernel | 22 | be virtually mapped (cannot use phys_to_virt) and it must not be |
45 | and user space is coherent. | 23 | freed nor memory mapped to user space. A new flag is added when a buffer |
24 | is created this way to prevent it from ever being memory mapped to user | ||
25 | space and the ref count is upped so that it can never be freed. | ||
46 | 26 | ||
47 | rb_flush_meta() - This just uses virt_to_page() and calls flush_dcache_folio() | 27 | - Use vmap_page_range() instead of using kmalloc_array() to create an array |
48 | as it is always allocated by the normal memory allocator. | 28 | of struct pages for vmap(). |
49 | I created it just to be consistent. | ||
50 | 29 | ||
51 | rb_page_id() - The mappings require knowing where they are mapped. | 30 | - Use flush_kernel_vmap_range() instead of flush_dcache_folio() |
52 | As the normal allocated pages are done in a way that they | ||
53 | may exist anywhere from the kernel's point of view, they | ||
54 | need to be labelled to know where they are mapped in user | ||
55 | space. The bpage->id is used for this. But for the persistent | ||
56 | memory, that bpage->id is already used for knowing the order | ||
57 | of the pages that are still active in the write part of | ||
58 | the buffer. This means that they are not consecutive. For | ||
59 | the user space mapping, the index of where the pages exist | ||
60 | in the physical memory is used for the placement in user | ||
61 | space. In order to manage this difference between how the | ||
62 | ids are used, this helper function handles that. | ||
63 | 31 | ||
64 | These helper functions make the code obvious to what is being mapped | 32 | - Allow the reserve_mem persistent ring buffer to be memory mapped. |
65 | and how they are mapped. | 33 | There is no difference now with how the memory is mapped to user space, |
34 | only the accounting of what pages are mapped where is updated as | ||
35 | the meta data is different between the two. | ||
66 | 36 | ||
67 | Changes since v1: https://lore.kernel.org/all/20250328220836.812222422@goodmis.org/ | 37 | Note, the first 4 patches makes the code a bit more correct. Especially |
38 | since the vunmap() does not give the buffer back to the buddy allocator. | ||
39 | I will be looking to get the first 4 patches into this merge window. | ||
68 | 40 | ||
69 | - Changed map_pages counters page_count and i to unsigned long | 41 | The last patch which enables he persistent memory mapping to user space can |
42 | wait till the 6.16. | ||
70 | 43 | ||
71 | In case someone uses over 4 billion pages on a 64 bit machine to map | 44 | Changes since v2: https://lore.kernel.org/all/20250331143426.947281958@goodmis.org/ |
72 | the memory, have the counters be unsigned long and unsigned int. | ||
73 | The way machine memory is growing, this may just happen in the near future! | ||
74 | 45 | ||
46 | - Basically a full rewrite once I found out that you can get the virtual | ||
47 | address of the memory returned by reserve_mem via phys_to_virt()! | ||
75 | 48 | ||
76 | Steven Rostedt (2): | 49 | Steven Rostedt (5): |
77 | tracing: ring-buffer: Have the ring buffer code do the vmap of physical memory | 50 | tracing: Enforce the persistent ring buffer to be page aligned |
78 | ring-buffer: Allow persistent ring buffers to be mmapped | 51 | tracing: Have reserve_mem use phys_to_virt() and separate from memmap buffer |
52 | tracing: Use vmap_page_range() to map memmap ring buffer | ||
53 | ring-buffer: Use flush_kernel_vmap_range() over flush_dcache_folio() | ||
54 | ring-buffer: Allow reserve_mem persistent ring buffers to be mmapped | ||
79 | 55 | ||
80 | ---- | 56 | ---- |
81 | include/linux/ring_buffer.h | 19 ++--- | 57 | Documentation/admin-guide/kernel-parameters.txt | 2 + |
82 | kernel/trace/ring_buffer.c | 180 +++++++++++++++++++++++++++++++++++++++----- | 58 | Documentation/trace/debugging.rst | 2 + |
83 | kernel/trace/trace.c | 65 ++++------------ | 59 | kernel/trace/ring_buffer.c | 54 ++++++++++++++++-- |
84 | 3 files changed, 186 insertions(+), 78 deletions(-) | 60 | kernel/trace/trace.c | 75 ++++++++++++++++--------- |
61 | kernel/trace/trace.h | 1 + | ||
62 | 5 files changed, 102 insertions(+), 32 deletions(-) | diff view generated by jsdifflib |
New patch | |||
---|---|---|---|
1 | From: Steven Rostedt <rostedt@goodmis.org> | ||
1 | 2 | ||
3 | Enforce that the address and the size of the memory used by the persistent | ||
4 | ring buffer is page aligned. Also update the documentation to reflect this | ||
5 | requirement. | ||
6 | |||
7 | Link: https://lore.kernel.org/all/CAHk-=whUOfVucfJRt7E0AH+GV41ELmS4wJqxHDnui6Giddfkzw@mail.gmail.com/ | ||
8 | |||
9 | Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> | ||
10 | Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> | ||
11 | --- | ||
12 | Documentation/admin-guide/kernel-parameters.txt | 2 ++ | ||
13 | Documentation/trace/debugging.rst | 2 ++ | ||
14 | kernel/trace/trace.c | 12 ++++++++++++ | ||
15 | 3 files changed, 16 insertions(+) | ||
16 | |||
17 | diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt | ||
18 | index XXXXXXX..XXXXXXX 100644 | ||
19 | --- a/Documentation/admin-guide/kernel-parameters.txt | ||
20 | +++ b/Documentation/admin-guide/kernel-parameters.txt | ||
21 | @@ -XXX,XX +XXX,XX @@ | ||
22 | This is just one of many ways that can clear memory. Make sure your system | ||
23 | keeps the content of memory across reboots before relying on this option. | ||
24 | |||
25 | + NB: Both the mapped address and size must be page aligned for the architecture. | ||
26 | + | ||
27 | See also Documentation/trace/debugging.rst | ||
28 | |||
29 | |||
30 | diff --git a/Documentation/trace/debugging.rst b/Documentation/trace/debugging.rst | ||
31 | index XXXXXXX..XXXXXXX 100644 | ||
32 | --- a/Documentation/trace/debugging.rst | ||
33 | +++ b/Documentation/trace/debugging.rst | ||
34 | @@ -XXX,XX +XXX,XX @@ kernel, so only the same kernel is guaranteed to work if the mapping is | ||
35 | preserved. Switching to a different kernel version may find a different | ||
36 | layout and mark the buffer as invalid. | ||
37 | |||
38 | +NB: Both the mapped address and size must be page aligned for the architecture. | ||
39 | + | ||
40 | Using trace_printk() in the boot instance | ||
41 | ----------------------------------------- | ||
42 | By default, the content of trace_printk() goes into the top level tracing | ||
43 | diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c | ||
44 | index XXXXXXX..XXXXXXX 100644 | ||
45 | --- a/kernel/trace/trace.c | ||
46 | +++ b/kernel/trace/trace.c | ||
47 | @@ -XXX,XX +XXX,XX @@ __init static void enable_instances(void) | ||
48 | } | ||
49 | |||
50 | if (start) { | ||
51 | + /* Start and size must be page aligned */ | ||
52 | + if (start & ~PAGE_MASK) { | ||
53 | + pr_warn("Tracing: mapping start addr %lx is not page aligned\n", | ||
54 | + (unsigned long)start); | ||
55 | + continue; | ||
56 | + } | ||
57 | + if (size & ~PAGE_MASK) { | ||
58 | + pr_warn("Tracing: mapping size %lx is not page aligned\n", | ||
59 | + (unsigned long)size); | ||
60 | + continue; | ||
61 | + } | ||
62 | + | ||
63 | addr = map_pages(start, size); | ||
64 | if (addr) { | ||
65 | pr_info("Tracing: mapped boot instance %s at physical memory %pa of size 0x%lx\n", | ||
66 | -- | ||
67 | 2.47.2 | diff view generated by jsdifflib |
New patch | |||
---|---|---|---|
1 | From: Steven Rostedt <rostedt@goodmis.org> | ||
1 | 2 | ||
3 | The reserve_mem kernel command line option may pass back a physical | ||
4 | address, but the memory is still part of the normal memory just like | ||
5 | using memblock_reserve() would be. This means that the physical memory | ||
6 | returned by the reserve_mem command line option can be converted directly | ||
7 | to virtual memory by simply using phys_to_virt(). | ||
8 | |||
9 | When freeing the buffer allocated by reserve_mem, use free_reserved_area(). | ||
10 | |||
11 | Because the persistent ring buffer can also be allocated via the memmap | ||
12 | option, which *is* different than normal memory as it cannot be added back | ||
13 | to the buddy system, it must be treated differently. It still needs to be | ||
14 | virtually mapped to have access to it. It also can not be freed nor can it | ||
15 | ever be memory mapped to user space. | ||
16 | |||
17 | Create a new trace_array flag called TRACE_ARRAY_FL_MEMMAP which gets set | ||
18 | if the buffer is created by the memmap option, and this will prevent the | ||
19 | buffer from being memory mapped by user space. | ||
20 | |||
21 | Also increment the ref count for memmap'ed buffers so that they can never | ||
22 | be freed. | ||
23 | |||
24 | Link: https://lore.kernel.org/all/Z-wFszhJ_9o4dc8O@kernel.org/ | ||
25 | |||
26 | Suggested-by: Mike Rapoport <rppt@kernel.org> | ||
27 | Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> | ||
28 | --- | ||
29 | kernel/trace/trace.c | 28 ++++++++++++++++++++++------ | ||
30 | kernel/trace/trace.h | 1 + | ||
31 | 2 files changed, 23 insertions(+), 6 deletions(-) | ||
32 | |||
33 | diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c | ||
34 | index XXXXXXX..XXXXXXX 100644 | ||
35 | --- a/kernel/trace/trace.c | ||
36 | +++ b/kernel/trace/trace.c | ||
37 | @@ -XXX,XX +XXX,XX @@ static int tracing_buffers_mmap(struct file *filp, struct vm_area_struct *vma) | ||
38 | struct trace_iterator *iter = &info->iter; | ||
39 | int ret = 0; | ||
40 | |||
41 | + /* A memmap'ed buffer is not supported for user space mmap */ | ||
42 | + if (iter->tr->flags & TRACE_ARRAY_FL_MEMMAP) | ||
43 | + return -ENODEV; | ||
44 | + | ||
45 | /* Currently the boot mapped buffer is not supported for mmap */ | ||
46 | if (iter->tr->flags & TRACE_ARRAY_FL_BOOT) | ||
47 | return -ENODEV; | ||
48 | @@ -XXX,XX +XXX,XX @@ static void free_trace_buffers(struct trace_array *tr) | ||
49 | free_trace_buffer(&tr->max_buffer); | ||
50 | #endif | ||
51 | |||
52 | - if (tr->range_addr_start) | ||
53 | - vunmap((void *)tr->range_addr_start); | ||
54 | + if (tr->range_addr_start) { | ||
55 | + void *start = (void *)tr->range_addr_start; | ||
56 | + void *end = start + tr->range_addr_size; | ||
57 | + | ||
58 | + free_reserved_area(start, end, 0, tr->range_name); | ||
59 | + } | ||
60 | } | ||
61 | |||
62 | static void init_trace_flags_index(struct trace_array *tr) | ||
63 | @@ -XXX,XX +XXX,XX @@ static inline void do_allocate_snapshot(const char *name) { } | ||
64 | __init static void enable_instances(void) | ||
65 | { | ||
66 | struct trace_array *tr; | ||
67 | + bool memmap_area = false; | ||
68 | char *curr_str; | ||
69 | char *name; | ||
70 | char *str; | ||
71 | @@ -XXX,XX +XXX,XX @@ __init static void enable_instances(void) | ||
72 | name); | ||
73 | continue; | ||
74 | } | ||
75 | + memmap_area = true; | ||
76 | } else if (tok) { | ||
77 | if (!reserve_mem_find_by_name(tok, &start, &size)) { | ||
78 | start = 0; | ||
79 | @@ -XXX,XX +XXX,XX @@ __init static void enable_instances(void) | ||
80 | continue; | ||
81 | } | ||
82 | |||
83 | - addr = map_pages(start, size); | ||
84 | + if (memmap_area) | ||
85 | + addr = map_pages(start, size); | ||
86 | + else | ||
87 | + addr = (unsigned long)phys_to_virt(start); | ||
88 | if (addr) { | ||
89 | pr_info("Tracing: mapped boot instance %s at physical memory %pa of size 0x%lx\n", | ||
90 | name, &start, (unsigned long)size); | ||
91 | @@ -XXX,XX +XXX,XX @@ __init static void enable_instances(void) | ||
92 | update_printk_trace(tr); | ||
93 | |||
94 | /* | ||
95 | - * If start is set, then this is a mapped buffer, and | ||
96 | - * cannot be deleted by user space, so keep the reference | ||
97 | - * to it. | ||
98 | + * memmap'd buffers can not be freed. | ||
99 | */ | ||
100 | + if (memmap_area) { | ||
101 | + tr->flags |= TRACE_ARRAY_FL_MEMMAP; | ||
102 | + tr->ref++; | ||
103 | + } | ||
104 | + | ||
105 | if (start) { | ||
106 | tr->flags |= TRACE_ARRAY_FL_BOOT | TRACE_ARRAY_FL_LAST_BOOT; | ||
107 | tr->range_name = no_free_ptr(rname); | ||
108 | diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h | ||
109 | index XXXXXXX..XXXXXXX 100644 | ||
110 | --- a/kernel/trace/trace.h | ||
111 | +++ b/kernel/trace/trace.h | ||
112 | @@ -XXX,XX +XXX,XX @@ enum { | ||
113 | TRACE_ARRAY_FL_BOOT = BIT(1), | ||
114 | TRACE_ARRAY_FL_LAST_BOOT = BIT(2), | ||
115 | TRACE_ARRAY_FL_MOD_INIT = BIT(3), | ||
116 | + TRACE_ARRAY_FL_MEMMAP = BIT(4), | ||
117 | }; | ||
118 | |||
119 | #ifdef CONFIG_MODULES | ||
120 | -- | ||
121 | 2.47.2 | diff view generated by jsdifflib |
1 | From: Steven Rostedt <rostedt@goodmis.org> | 1 | From: Steven Rostedt <rostedt@goodmis.org> |
---|---|---|---|
2 | 2 | ||
3 | Currently, when reserve_mem is used on the kernel command line to reserve | 3 | The code to map the physical memory retrieved by memmap currently |
4 | "persistent" memory to map the ring buffer on. The tracing code will do | 4 | allocates an array of pages to cover the physical memory and then calls |
5 | the vmap() on the physical memory provided by reserve_mem and pass that to | 5 | vmap() to map it to a virtual address. Instead of using this temporary |
6 | ring_buffer_alloc_range() where it will map the ring buffer on top of the | 6 | array of struct page descriptors, simply use vmap_page_range() that can |
7 | given memory. It will also look at the current content of the memory there | 7 | directly map the contiguous physical memory to a virtual address. |
8 | and if the memory already contains valid content, it will use that content | ||
9 | for the ring buffer. | ||
10 | 8 | ||
11 | But this method makes the ring buffer code not know where that memory came | 9 | Link: https://lore.kernel.org/all/CAHk-=whUOfVucfJRt7E0AH+GV41ELmS4wJqxHDnui6Giddfkzw@mail.gmail.com/ |
12 | from. Here, the tracing code used vmap() but it could have also used | ||
13 | vmalloc(), or whatever. And many of these methods may not be supported by | ||
14 | the ring buffer code. | ||
15 | 10 | ||
16 | Instead, rename ring_buffer_alloc_range() to ring_buffer_alloc_physical() | 11 | Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> |
17 | where contiguous physical memory is passed to the ring buffer code, and it | ||
18 | will be responsible for mapping it as well as freeing it. This simplifies | ||
19 | the callers from having to keep track of whether the code is mapped or | ||
20 | not. | ||
21 | |||
22 | The ring buffer can also take control of whether it can memory map the | ||
23 | buffer to user space or not. Currently it does not allow this physical | ||
24 | memory to be mapped to user space, but now that it has control over the | ||
25 | struct pages of this memory, it can easily do so in the future. | ||
26 | |||
27 | As the ring buffer mapping of a physical memory mapped buffer will now | ||
28 | fail, the tracing code no longer keeps track if the buffer is the "last | ||
29 | boot" buffer or not and will try to map it anyway. It will still fail, but | ||
30 | when the ring buffer code is modified to allow it, it will then succeed. | ||
31 | |||
32 | Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> | 12 | Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
33 | --- | 13 | --- |
34 | include/linux/ring_buffer.h | 19 ++++---- | 14 | kernel/trace/trace.c | 35 +++++++++++++++++------------------ |
35 | kernel/trace/ring_buffer.c | 86 ++++++++++++++++++++++++++++++++----- | 15 | 1 file changed, 17 insertions(+), 18 deletions(-) |
36 | kernel/trace/trace.c | 65 +++++++--------------------- | ||
37 | 3 files changed, 101 insertions(+), 69 deletions(-) | ||
38 | 16 | ||
39 | diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h | ||
40 | index XXXXXXX..XXXXXXX 100644 | ||
41 | --- a/include/linux/ring_buffer.h | ||
42 | +++ b/include/linux/ring_buffer.h | ||
43 | @@ -XXX,XX +XXX,XX @@ void ring_buffer_discard_commit(struct trace_buffer *buffer, | ||
44 | struct trace_buffer * | ||
45 | __ring_buffer_alloc(unsigned long size, unsigned flags, struct lock_class_key *key); | ||
46 | |||
47 | -struct trace_buffer *__ring_buffer_alloc_range(unsigned long size, unsigned flags, | ||
48 | - int order, unsigned long start, | ||
49 | - unsigned long range_size, | ||
50 | - unsigned long scratch_size, | ||
51 | - struct lock_class_key *key); | ||
52 | +struct trace_buffer *__ring_buffer_alloc_physical(unsigned long size, unsigned flags, | ||
53 | + int order, unsigned long start, | ||
54 | + unsigned long range_size, | ||
55 | + unsigned long scratch_size, | ||
56 | + struct lock_class_key *key); | ||
57 | |||
58 | void *ring_buffer_meta_scratch(struct trace_buffer *buffer, unsigned int *size); | ||
59 | |||
60 | @@ -XXX,XX +XXX,XX @@ void *ring_buffer_meta_scratch(struct trace_buffer *buffer, unsigned int *size); | ||
61 | * traced by ftrace, it can produce lockdep warnings. We need to keep each | ||
62 | * ring buffer's lock class separate. | ||
63 | */ | ||
64 | -#define ring_buffer_alloc_range(size, flags, order, start, range_size, s_size) \ | ||
65 | +#define ring_buffer_alloc_physical(size, flags, order, start, range_size, s_size) \ | ||
66 | ({ \ | ||
67 | static struct lock_class_key __key; \ | ||
68 | - __ring_buffer_alloc_range((size), (flags), (order), (start), \ | ||
69 | - (range_size), (s_size), &__key); \ | ||
70 | + __ring_buffer_alloc_physical((size), (flags), (order), (start), \ | ||
71 | + (range_size), (s_size), &__key); \ | ||
72 | }) | ||
73 | |||
74 | typedef bool (*ring_buffer_cond_fn)(void *data); | ||
75 | @@ -XXX,XX +XXX,XX @@ int ring_buffer_subbuf_order_set(struct trace_buffer *buffer, int order); | ||
76 | int ring_buffer_subbuf_size_get(struct trace_buffer *buffer); | ||
77 | |||
78 | enum ring_buffer_flags { | ||
79 | - RB_FL_OVERWRITE = 1 << 0, | ||
80 | + RB_FL_OVERWRITE = BIT(0), | ||
81 | + RB_FL_PHYSICAL = BIT(1), | ||
82 | }; | ||
83 | |||
84 | #ifdef CONFIG_RING_BUFFER | ||
85 | diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c | ||
86 | index XXXXXXX..XXXXXXX 100644 | ||
87 | --- a/kernel/trace/ring_buffer.c | ||
88 | +++ b/kernel/trace/ring_buffer.c | ||
89 | @@ -XXX,XX +XXX,XX @@ | ||
90 | #include <linux/uaccess.h> | ||
91 | #include <linux/hardirq.h> | ||
92 | #include <linux/kthread.h> /* for self test */ | ||
93 | +#include <linux/vmalloc.h> /* for vmap */ | ||
94 | #include <linux/module.h> | ||
95 | #include <linux/percpu.h> | ||
96 | #include <linux/mutex.h> | ||
97 | @@ -XXX,XX +XXX,XX @@ struct trace_buffer { | ||
98 | |||
99 | struct ring_buffer_meta *meta; | ||
100 | |||
101 | + unsigned long phys_start; | ||
102 | + | ||
103 | unsigned int subbuf_size; | ||
104 | unsigned int subbuf_order; | ||
105 | unsigned int max_data_size; | ||
106 | @@ -XXX,XX +XXX,XX @@ static void rb_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer) | ||
107 | kfree(cpu_buffer); | ||
108 | } | ||
109 | |||
110 | +static unsigned long map_pages(unsigned long *start, unsigned long *end) | ||
111 | +{ | ||
112 | + struct page **pages; | ||
113 | + phys_addr_t page_start; | ||
114 | + unsigned long size; | ||
115 | + unsigned long page_count; | ||
116 | + unsigned long i; | ||
117 | + void *vaddr; | ||
118 | + | ||
119 | + /* Make sure the mappings are page aligned */ | ||
120 | + *start = ALIGN(*start, PAGE_SIZE); | ||
121 | + | ||
122 | + size = *end - *start; | ||
123 | + | ||
124 | + /* The size must fit full pages */ | ||
125 | + page_count = size >> PAGE_SHIFT; | ||
126 | + | ||
127 | + if (!page_count) | ||
128 | + return 0; | ||
129 | + | ||
130 | + page_start = *start; | ||
131 | + pages = kmalloc_array(page_count, sizeof(struct page *), GFP_KERNEL); | ||
132 | + if (!pages) | ||
133 | + return 0; | ||
134 | + | ||
135 | + for (i = 0; i < page_count; i++) { | ||
136 | + phys_addr_t addr = page_start + i * PAGE_SIZE; | ||
137 | + pages[i] = pfn_to_page(addr >> PAGE_SHIFT); | ||
138 | + } | ||
139 | + vaddr = vmap(pages, page_count, VM_MAP, PAGE_KERNEL); | ||
140 | + kfree(pages); | ||
141 | + | ||
142 | + *end = *start + page_count * PAGE_SIZE; | ||
143 | + | ||
144 | + return (unsigned long)vaddr; | ||
145 | +} | ||
146 | + | ||
147 | static struct trace_buffer *alloc_buffer(unsigned long size, unsigned flags, | ||
148 | int order, unsigned long start, | ||
149 | unsigned long end, | ||
150 | @@ -XXX,XX +XXX,XX @@ static struct trace_buffer *alloc_buffer(unsigned long size, unsigned flags, | ||
151 | if (!buffer->buffers) | ||
152 | goto fail_free_cpumask; | ||
153 | |||
154 | - /* If start/end are specified, then that overrides size */ | ||
155 | + /* If start/end are specified, then this is a physical mapping */ | ||
156 | if (start && end) { | ||
157 | unsigned long buffers_start; | ||
158 | + unsigned long addr; | ||
159 | unsigned long ptr; | ||
160 | + u64 size; | ||
161 | int n; | ||
162 | |||
163 | - /* Make sure that start is word aligned */ | ||
164 | - start = ALIGN(start, sizeof(long)); | ||
165 | + addr = map_pages(&start, &end); | ||
166 | + if (!addr) | ||
167 | + goto fail_free_cpumask; | ||
168 | + | ||
169 | + /* end and start have have been updated for alignment */ | ||
170 | + size = end - start; | ||
171 | + | ||
172 | + buffer->phys_start = start; | ||
173 | + buffer->flags |= RB_FL_PHYSICAL; | ||
174 | + | ||
175 | + start = addr; | ||
176 | + end = start + size; | ||
177 | |||
178 | /* scratch_size needs to be aligned too */ | ||
179 | scratch_size = ALIGN(scratch_size, sizeof(long)); | ||
180 | @@ -XXX,XX +XXX,XX @@ static struct trace_buffer *alloc_buffer(unsigned long size, unsigned flags, | ||
181 | } | ||
182 | kfree(buffer->buffers); | ||
183 | |||
184 | + if (buffer->phys_start) | ||
185 | + vunmap((void *)buffer->phys_start); | ||
186 | + | ||
187 | fail_free_cpumask: | ||
188 | free_cpumask_var(buffer->cpumask); | ||
189 | |||
190 | @@ -XXX,XX +XXX,XX @@ struct trace_buffer *__ring_buffer_alloc(unsigned long size, unsigned flags, | ||
191 | EXPORT_SYMBOL_GPL(__ring_buffer_alloc); | ||
192 | |||
193 | /** | ||
194 | - * __ring_buffer_alloc_range - allocate a new ring_buffer from existing memory | ||
195 | + * __ring_buffer_alloc_physical - allocate a new ring_buffer from physical memory | ||
196 | * @size: the size in bytes per cpu that is needed. | ||
197 | * @flags: attributes to set for the ring buffer. | ||
198 | * @order: sub-buffer order | ||
199 | - * @start: start of allocated range | ||
200 | + * @start: start of the physical memory range | ||
201 | * @range_size: size of allocated range | ||
202 | * @scratch_size: size of scratch area (for preallocated memory buffers) | ||
203 | * @key: ring buffer reader_lock_key. | ||
204 | @@ -XXX,XX +XXX,XX @@ EXPORT_SYMBOL_GPL(__ring_buffer_alloc); | ||
205 | * when the buffer wraps. If this flag is not set, the buffer will | ||
206 | * drop data when the tail hits the head. | ||
207 | */ | ||
208 | -struct trace_buffer *__ring_buffer_alloc_range(unsigned long size, unsigned flags, | ||
209 | - int order, unsigned long start, | ||
210 | - unsigned long range_size, | ||
211 | - unsigned long scratch_size, | ||
212 | - struct lock_class_key *key) | ||
213 | +struct trace_buffer *__ring_buffer_alloc_physical(unsigned long size, unsigned flags, | ||
214 | + int order, unsigned long start, | ||
215 | + unsigned long range_size, | ||
216 | + unsigned long scratch_size, | ||
217 | + struct lock_class_key *key) | ||
218 | { | ||
219 | return alloc_buffer(size, flags, order, start, start + range_size, | ||
220 | scratch_size, key); | ||
221 | @@ -XXX,XX +XXX,XX @@ ring_buffer_free(struct trace_buffer *buffer) | ||
222 | kfree(buffer->buffers); | ||
223 | free_cpumask_var(buffer->cpumask); | ||
224 | |||
225 | + if (buffer->flags & RB_FL_PHYSICAL) | ||
226 | + vunmap((void *)buffer->phys_start); | ||
227 | + | ||
228 | kfree(buffer); | ||
229 | } | ||
230 | EXPORT_SYMBOL_GPL(ring_buffer_free); | ||
231 | @@ -XXX,XX +XXX,XX @@ int ring_buffer_map(struct trace_buffer *buffer, int cpu, | ||
232 | unsigned long flags, *subbuf_ids; | ||
233 | int err = 0; | ||
234 | |||
235 | + /* | ||
236 | + * Currently, this does not support vmap()'d buffers. | ||
237 | + * Return -ENODEV as that is what is returned when a file | ||
238 | + * does not support memory mapping. | ||
239 | + */ | ||
240 | + if (buffer->flags & RB_FL_PHYSICAL) | ||
241 | + return -ENODEV; | ||
242 | + | ||
243 | if (!cpumask_test_cpu(cpu, buffer->cpumask)) | ||
244 | return -EINVAL; | ||
245 | |||
246 | diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c | 17 | diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c |
247 | index XXXXXXX..XXXXXXX 100644 | 18 | index XXXXXXX..XXXXXXX 100644 |
248 | --- a/kernel/trace/trace.c | 19 | --- a/kernel/trace/trace.c |
249 | +++ b/kernel/trace/trace.c | 20 | +++ b/kernel/trace/trace.c |
250 | @@ -XXX,XX +XXX,XX @@ static int tracing_buffers_mmap(struct file *filp, struct vm_area_struct *vma) | 21 | @@ -XXX,XX +XXX,XX @@ |
251 | struct trace_iterator *iter = &info->iter; | 22 | #include <linux/irq_work.h> |
252 | int ret = 0; | 23 | #include <linux/workqueue.h> |
253 | 24 | #include <linux/sort.h> | |
254 | - /* Currently the boot mapped buffer is not supported for mmap */ | 25 | +#include <linux/io.h> /* vmap_page_range() */ |
255 | - if (iter->tr->flags & TRACE_ARRAY_FL_BOOT) | 26 | |
256 | - return -ENODEV; | 27 | #include <asm/setup.h> /* COMMAND_LINE_SIZE */ |
257 | - | 28 | |
258 | ret = get_snapshot_map(iter->tr); | ||
259 | if (ret) | ||
260 | return ret; | ||
261 | @@ -XXX,XX +XXX,XX @@ static int tracing_buffers_mmap(struct file *filp, struct vm_area_struct *vma) | ||
262 | ret = ring_buffer_map(iter->array_buffer->buffer, iter->cpu_file, vma); | ||
263 | if (ret) | ||
264 | put_snapshot_map(iter->tr); | ||
265 | - | ||
266 | - vma->vm_ops = &tracing_buffers_vmops; | ||
267 | + else | ||
268 | + vma->vm_ops = &tracing_buffers_vmops; | ||
269 | |||
270 | return ret; | ||
271 | } | ||
272 | @@ -XXX,XX +XXX,XX @@ allocate_trace_buffer(struct trace_array *tr, struct array_buffer *buf, int size | ||
273 | |||
274 | if (tr->range_addr_start && tr->range_addr_size) { | ||
275 | /* Add scratch buffer to handle 128 modules */ | ||
276 | - buf->buffer = ring_buffer_alloc_range(size, rb_flags, 0, | ||
277 | - tr->range_addr_start, | ||
278 | - tr->range_addr_size, | ||
279 | - struct_size(tscratch, entries, 128)); | ||
280 | + buf->buffer = ring_buffer_alloc_physical(size, rb_flags, 0, | ||
281 | + tr->range_addr_start, | ||
282 | + tr->range_addr_size, | ||
283 | + struct_size(tscratch, entries, 128)); | ||
284 | + if (!buf->buffer) { | ||
285 | + pr_warn("Tracing: Failed to map boot instance %s\n", tr->name); | ||
286 | + return -1; | ||
287 | + } | ||
288 | + | ||
289 | + pr_info("Tracing: mapped boot instance %s at physical memory %lx of size 0x%lx\n", | ||
290 | + tr->name, tr->range_addr_start, tr->range_addr_size); | ||
291 | |||
292 | tscratch = ring_buffer_meta_scratch(buf->buffer, &scratch_size); | ||
293 | setup_trace_scratch(tr, tscratch, scratch_size); | ||
294 | @@ -XXX,XX +XXX,XX @@ static void free_trace_buffers(struct trace_array *tr) | ||
295 | #ifdef CONFIG_TRACER_MAX_TRACE | ||
296 | free_trace_buffer(&tr->max_buffer); | ||
297 | #endif | ||
298 | - | ||
299 | - if (tr->range_addr_start) | ||
300 | - vunmap((void *)tr->range_addr_start); | ||
301 | } | ||
302 | |||
303 | static void init_trace_flags_index(struct trace_array *tr) | ||
304 | @@ -XXX,XX +XXX,XX @@ static int instance_mkdir(const char *name) | 29 | @@ -XXX,XX +XXX,XX @@ static int instance_mkdir(const char *name) |
305 | return ret; | 30 | return ret; |
306 | } | 31 | } |
307 | 32 | ||
308 | -static u64 map_pages(u64 start, u64 size) | 33 | -static u64 map_pages(u64 start, u64 size) |
309 | -{ | 34 | +static u64 map_pages(unsigned long start, unsigned long size) |
35 | { | ||
310 | - struct page **pages; | 36 | - struct page **pages; |
311 | - phys_addr_t page_start; | 37 | - phys_addr_t page_start; |
312 | - unsigned int page_count; | 38 | - unsigned int page_count; |
313 | - unsigned int i; | 39 | - unsigned int i; |
314 | - void *vaddr; | 40 | - void *vaddr; |
315 | - | 41 | + unsigned long vmap_start, vmap_end; |
42 | + struct vm_struct *area; | ||
43 | + int ret; | ||
44 | |||
316 | - page_count = DIV_ROUND_UP(size, PAGE_SIZE); | 45 | - page_count = DIV_ROUND_UP(size, PAGE_SIZE); |
317 | - | 46 | + area = get_vm_area(size, VM_IOREMAP); |
47 | + if (!area) | ||
48 | + return 0; | ||
49 | |||
318 | - page_start = start; | 50 | - page_start = start; |
319 | - pages = kmalloc_array(page_count, sizeof(struct page *), GFP_KERNEL); | 51 | - pages = kmalloc_array(page_count, sizeof(struct page *), GFP_KERNEL); |
320 | - if (!pages) | 52 | - if (!pages) |
321 | - return 0; | 53 | - return 0; |
322 | - | 54 | + vmap_start = (unsigned long) area->addr; |
55 | + vmap_end = vmap_start + size; | ||
56 | |||
323 | - for (i = 0; i < page_count; i++) { | 57 | - for (i = 0; i < page_count; i++) { |
324 | - phys_addr_t addr = page_start + i * PAGE_SIZE; | 58 | - phys_addr_t addr = page_start + i * PAGE_SIZE; |
325 | - pages[i] = pfn_to_page(addr >> PAGE_SHIFT); | 59 | - pages[i] = pfn_to_page(addr >> PAGE_SHIFT); |
326 | - } | 60 | - } |
327 | - vaddr = vmap(pages, page_count, VM_MAP, PAGE_KERNEL); | 61 | - vaddr = vmap(pages, page_count, VM_MAP, PAGE_KERNEL); |
328 | - kfree(pages); | 62 | - kfree(pages); |
329 | - | 63 | + ret = vmap_page_range(vmap_start, vmap_end, |
64 | + start, pgprot_nx(PAGE_KERNEL)); | ||
65 | + if (ret < 0) { | ||
66 | + free_vm_area(area); | ||
67 | + return 0; | ||
68 | + } | ||
69 | |||
330 | - return (u64)(unsigned long)vaddr; | 70 | - return (u64)(unsigned long)vaddr; |
331 | -} | 71 | + return (u64)vmap_start; |
332 | - | 72 | } |
73 | |||
333 | /** | 74 | /** |
334 | * trace_array_get_by_name - Create/Lookup a trace array, given its name. | ||
335 | * @name: The name of the trace array to be looked up/created. | ||
336 | @@ -XXX,XX +XXX,XX @@ __init static void enable_instances(void) | ||
337 | while ((curr_str = strsep(&str, "\t"))) { | ||
338 | phys_addr_t start = 0; | ||
339 | phys_addr_t size = 0; | ||
340 | - unsigned long addr = 0; | ||
341 | bool traceprintk = false; | ||
342 | bool traceoff = false; | ||
343 | char *flag_delim; | ||
344 | @@ -XXX,XX +XXX,XX @@ __init static void enable_instances(void) | ||
345 | rname = kstrdup(tok, GFP_KERNEL); | ||
346 | } | ||
347 | |||
348 | - if (start) { | ||
349 | - addr = map_pages(start, size); | ||
350 | - if (addr) { | ||
351 | - pr_info("Tracing: mapped boot instance %s at physical memory %pa of size 0x%lx\n", | ||
352 | - name, &start, (unsigned long)size); | ||
353 | - } else { | ||
354 | - pr_warn("Tracing: Failed to map boot instance %s\n", name); | ||
355 | - continue; | ||
356 | - } | ||
357 | - } else { | ||
358 | + if (!start) { | ||
359 | /* Only non mapped buffers have snapshot buffers */ | ||
360 | if (IS_ENABLED(CONFIG_TRACER_MAX_TRACE)) | ||
361 | do_allocate_snapshot(name); | ||
362 | } | ||
363 | |||
364 | - tr = trace_array_create_systems(name, NULL, addr, size); | ||
365 | + tr = trace_array_create_systems(name, NULL, (unsigned long)start, size); | ||
366 | if (IS_ERR(tr)) { | ||
367 | pr_warn("Tracing: Failed to create instance buffer %s\n", curr_str); | ||
368 | continue; | ||
369 | -- | 75 | -- |
370 | 2.47.2 | 76 | 2.47.2 | diff view generated by jsdifflib |
New patch | |||
---|---|---|---|
1 | From: Steven Rostedt <rostedt@goodmis.org> | ||
1 | 2 | ||
3 | Some architectures do not have data cache coherency between user and | ||
4 | kernel space. For these architectures, the cache needs to be flushed on | ||
5 | both the kernel and user addresses so that user space can see the updates | ||
6 | the kernel has made. | ||
7 | |||
8 | Instead of using flush_dcache_folio() and playing with virt_to_folio() | ||
9 | within the call to that function, use flush_kernel_vmap_range() which | ||
10 | takes the virtual address and does the work for those architectures that | ||
11 | need it. | ||
12 | |||
13 | Link: https://lore.kernel.org/all/CAG48ez3w0my4Rwttbc5tEbNsme6tc0mrSN95thjXUFaJ3aQ6SA@mail.gmail.com/ | ||
14 | |||
15 | Suggested-by: Jann Horn <jannh@google.com> | ||
16 | Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> | ||
17 | --- | ||
18 | kernel/trace/ring_buffer.c | 5 +++-- | ||
19 | 1 file changed, 3 insertions(+), 2 deletions(-) | ||
20 | |||
21 | diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c | ||
22 | index XXXXXXX..XXXXXXX 100644 | ||
23 | --- a/kernel/trace/ring_buffer.c | ||
24 | +++ b/kernel/trace/ring_buffer.c | ||
25 | @@ -XXX,XX +XXX,XX @@ static void rb_update_meta_page(struct ring_buffer_per_cpu *cpu_buffer) | ||
26 | meta->read = cpu_buffer->read; | ||
27 | |||
28 | /* Some archs do not have data cache coherency between kernel and user-space */ | ||
29 | - flush_dcache_folio(virt_to_folio(cpu_buffer->meta_page)); | ||
30 | + flush_kernel_vmap_range(cpu_buffer->meta_page, PAGE_SIZE); | ||
31 | } | ||
32 | |||
33 | static void | ||
34 | @@ -XXX,XX +XXX,XX @@ int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu) | ||
35 | |||
36 | out: | ||
37 | /* Some archs do not have data cache coherency between kernel and user-space */ | ||
38 | - flush_dcache_folio(virt_to_folio(cpu_buffer->reader_page->page)); | ||
39 | + flush_kernel_vmap_range(cpu_buffer->reader_page->page, | ||
40 | + buffer->subbuf_size + BUF_PAGE_HDR_SIZE); | ||
41 | |||
42 | rb_update_meta_page(cpu_buffer); | ||
43 | |||
44 | -- | ||
45 | 2.47.2 | diff view generated by jsdifflib |
1 | From: Steven Rostedt <rostedt@goodmis.org> | 1 | From: Steven Rostedt <rostedt@goodmis.org> |
---|---|---|---|
2 | 2 | ||
3 | The persistent ring buffer uses vmap()'d memory to map the reserved memory | 3 | When the persistent ring buffer is created from the memory returned by |
4 | from boot. But the user space mmap() to the ring buffer requires | 4 | reserve_mem there is nothing prohibiting it to be memory mapped to user |
5 | virt_to_page() to return a valid page. But that only works for core kernel | 5 | space. The memory is the same as the pages allocated by alloc_page(). |
6 | addresses and not for vmap() addresses. | ||
7 | 6 | ||
8 | To address this, save the physical and virtual address of where the | 7 | The way the memory is managed by the ring buffer code is slightly |
9 | persistent memory is mapped. Create a rb_struct_page() helper function | 8 | different though and needs to be addressed. |
10 | that returns the page via virt_to_page() for normal buffer pages that were | ||
11 | allocated with page_alloc() but for the physical memory vmap()'d pages, it | ||
12 | uses the saved physical and virtual addresses of where the memory was | ||
13 | located to calculate the physical address of the virtual page that needs | ||
14 | the struct page. Then it uses pfn_to_page() to get the page for that | ||
15 | physical address. | ||
16 | 9 | ||
17 | New helper functions are created for flushing the cache for architectures | 10 | The persistent memory uses the page->id for its own purpose where as the |
18 | that need it between user and kernel space. | 11 | user mmap buffer currently uses that for the subbuf array mapped to user |
19 | 12 | space. If the buffer is a persistent buffer, use the page index into that | |
20 | Also, the persistent memory uses the page->id for its own purpose where as | 13 | buffer as the identifier instead of the page->id. |
21 | the user mmap buffer currently uses that for the subbuf array mapped to | ||
22 | user space. If the buffer is a persistent buffer, use the page index into | ||
23 | that buffer as the identifier instead of the page->id. | ||
24 | 14 | ||
25 | That is, the page->id for a persistent buffer, represents the order of the | 15 | That is, the page->id for a persistent buffer, represents the order of the |
26 | buffer is in the link list. ->id == 0 means it is the reader page. | 16 | buffer is in the link list. ->id == 0 means it is the reader page. |
27 | When a reader page is swapped, the new reader page's ->id gets zero, and | 17 | When a reader page is swapped, the new reader page's ->id gets zero, and |
28 | the old reader page gets the ->id of the page that it swapped with. | 18 | the old reader page gets the ->id of the page that it swapped with. |
... | ... | ||
37 | 27 | ||
38 | A new rb_page_id() helper function is used to get and set the id depending | 28 | A new rb_page_id() helper function is used to get and set the id depending |
39 | on if the page is a normal memory allocated buffer or a physical memory | 29 | on if the page is a normal memory allocated buffer or a physical memory |
40 | mapped buffer. | 30 | mapped buffer. |
41 | 31 | ||
42 | Link: https://lore.kernel.org/all/CAHk-=wgD9MQOoMAGtT=fXZsWY39exRVyZgxuBXix4u=1bdHA=g@mail.gmail.com/ | ||
43 | |||
44 | Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> | 32 | Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> |
45 | --- | 33 | --- |
46 | kernel/trace/ring_buffer.c | 110 +++++++++++++++++++++++++++++++------ | 34 | kernel/trace/ring_buffer.c | 49 ++++++++++++++++++++++++++++++++++---- |
47 | 1 file changed, 93 insertions(+), 17 deletions(-) | 35 | kernel/trace/trace.c | 4 ---- |
36 | 2 files changed, 45 insertions(+), 8 deletions(-) | ||
48 | 37 | ||
49 | diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c | 38 | diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c |
50 | index XXXXXXX..XXXXXXX 100644 | 39 | index XXXXXXX..XXXXXXX 100644 |
51 | --- a/kernel/trace/ring_buffer.c | 40 | --- a/kernel/trace/ring_buffer.c |
52 | +++ b/kernel/trace/ring_buffer.c | 41 | +++ b/kernel/trace/ring_buffer.c |
53 | @@ -XXX,XX +XXX,XX @@ struct trace_buffer { | ||
54 | struct ring_buffer_meta *meta; | ||
55 | |||
56 | unsigned long phys_start; | ||
57 | + unsigned long virt_start; | ||
58 | |||
59 | unsigned int subbuf_size; | ||
60 | unsigned int subbuf_order; | ||
61 | @@ -XXX,XX +XXX,XX @@ static struct trace_buffer *alloc_buffer(unsigned long size, unsigned flags, | ||
62 | |||
63 | start = addr; | ||
64 | end = start + size; | ||
65 | + buffer->virt_start = start; | ||
66 | |||
67 | /* scratch_size needs to be aligned too */ | ||
68 | scratch_size = ALIGN(scratch_size, sizeof(long)); | ||
69 | @@ -XXX,XX +XXX,XX @@ static void rb_clear_buffer_page(struct buffer_page *page) | 42 | @@ -XXX,XX +XXX,XX @@ static void rb_clear_buffer_page(struct buffer_page *page) |
70 | page->read = 0; | 43 | page->read = 0; |
71 | } | 44 | } |
72 | 45 | ||
73 | +/* | ||
74 | + * Get the struct page for the given buffer page. | ||
75 | + * | ||
76 | + * For normal ring buffer pages that are allocated via page_alloc() | ||
77 | + * the struct page can simply be retrieved via virt_to_page(). | ||
78 | + * | ||
79 | + * But if the buffer was created via a physical mapping and vmap() | ||
80 | + * was used to get to the virtual addresses, use the stored virtual | ||
81 | + * and physical of the start address to calculate the original | ||
82 | + * physical address of the given page and use pfn_to_page() to return | ||
83 | + * the struct page. | ||
84 | + */ | ||
85 | +static struct page *rb_struct_page(struct trace_buffer *buffer, void *vaddr) | ||
86 | +{ | ||
87 | + if (buffer->flags & RB_FL_PHYSICAL) { | ||
88 | + unsigned long addr = (unsigned long)vaddr; | ||
89 | + | ||
90 | + addr -= buffer->virt_start; | ||
91 | + addr += buffer->phys_start; | ||
92 | + return pfn_to_page(addr >> PAGE_SHIFT); | ||
93 | + } | ||
94 | + return virt_to_page(vaddr); | ||
95 | +} | ||
96 | + | ||
97 | +/* Some archs do not have data cache coherency between kernel and user-space */ | ||
98 | +static void rb_flush_buffer_page(struct trace_buffer *buffer, | ||
99 | + struct buffer_page *bpage) | ||
100 | +{ | ||
101 | + struct page *page = rb_struct_page(buffer, bpage->page); | ||
102 | + | ||
103 | + flush_dcache_folio(page_folio(page)); | ||
104 | +} | ||
105 | + | ||
106 | +/* The user mapped meta page is always allocated via page_alloc() */ | ||
107 | +static void rb_flush_meta(void *meta) | ||
108 | +{ | ||
109 | + struct page *page = virt_to_page(meta); | ||
110 | + | ||
111 | + flush_dcache_folio(page_folio(page)); | ||
112 | +} | ||
113 | + | ||
114 | +/* | 46 | +/* |
115 | + * When the buffer is memory mapped to user space, each sub buffer | 47 | + * When the buffer is memory mapped to user space, each sub buffer |
116 | + * has a unique id that is used by the meta data to tell the user | 48 | + * has a unique id that is used by the meta data to tell the user |
117 | + * where the current reader page is. | 49 | + * where the current reader page is. |
118 | + * | 50 | + * |
119 | + * For a normal allocated ring buffer, the id is saved in the buffer page | 51 | + * For a normal allocated ring buffer, the id is saved in the buffer page |
120 | + * id field, and updated via this function. | 52 | + * id field, and updated via this function. |
121 | + * | 53 | + * |
122 | + * But for a physical memory mapped buffer, the id is already assigned for | 54 | + * But for a fixed memory mapped buffer, the id is already assigned for |
123 | + * memory ording in the physical memory layout and can not be used. Instead | 55 | + * fixed memory ording in the memory layout and can not be used. Instead |
124 | + * the index of where the page lies in the memory layout is used. | 56 | + * the index of where the page lies in the memory layout is used. |
125 | + * | 57 | + * |
126 | + * For the normal pages, set the buffer page id with the passed in @id | 58 | + * For the normal pages, set the buffer page id with the passed in @id |
127 | + * value and return that. | 59 | + * value and return that. |
128 | + * | 60 | + * |
129 | + * For memory mapped pages, get the page index in the physical memory layout | 61 | + * For fixed memory mapped pages, get the page index in the memory layout |
130 | + * and return that as the id. | 62 | + * and return that as the id. |
131 | + */ | 63 | + */ |
132 | +static int rb_page_id(struct ring_buffer_per_cpu *cpu_buffer, | 64 | +static int rb_page_id(struct ring_buffer_per_cpu *cpu_buffer, |
133 | + struct buffer_page *bpage, int id) | 65 | + struct buffer_page *bpage, int id) |
134 | +{ | 66 | +{ |
... | ... | ||
156 | + cpu_buffer->reader_page->id); | 88 | + cpu_buffer->reader_page->id); |
157 | + | 89 | + |
158 | meta->reader.lost_events = cpu_buffer->lost_events; | 90 | meta->reader.lost_events = cpu_buffer->lost_events; |
159 | 91 | ||
160 | meta->entries = local_read(&cpu_buffer->entries); | 92 | meta->entries = local_read(&cpu_buffer->entries); |
161 | meta->overrun = local_read(&cpu_buffer->overrun); | ||
162 | meta->read = cpu_buffer->read; | ||
163 | |||
164 | - /* Some archs do not have data cache coherency between kernel and user-space */ | ||
165 | - flush_dcache_folio(virt_to_folio(cpu_buffer->meta_page)); | ||
166 | + rb_flush_meta(meta); | ||
167 | } | ||
168 | |||
169 | static void | ||
170 | @@ -XXX,XX +XXX,XX @@ static void rb_setup_ids_meta_page(struct ring_buffer_per_cpu *cpu_buffer, | 93 | @@ -XXX,XX +XXX,XX @@ static void rb_setup_ids_meta_page(struct ring_buffer_per_cpu *cpu_buffer, |
171 | struct trace_buffer_meta *meta = cpu_buffer->meta_page; | 94 | struct trace_buffer_meta *meta = cpu_buffer->meta_page; |
172 | unsigned int nr_subbufs = cpu_buffer->nr_pages + 1; | 95 | unsigned int nr_subbufs = cpu_buffer->nr_pages + 1; |
173 | struct buffer_page *first_subbuf, *subbuf; | 96 | struct buffer_page *first_subbuf, *subbuf; |
174 | + int cnt = 0; | 97 | + int cnt = 0; |
... | ... | ||
198 | + WARN_ON(cnt != nr_subbufs); | 121 | + WARN_ON(cnt != nr_subbufs); |
199 | + | 122 | + |
200 | /* install subbuf ID to kern VA translation */ | 123 | /* install subbuf ID to kern VA translation */ |
201 | cpu_buffer->subbuf_ids = subbuf_ids; | 124 | cpu_buffer->subbuf_ids = subbuf_ids; |
202 | 125 | ||
203 | @@ -XXX,XX +XXX,XX @@ static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer, | 126 | diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c |
204 | if (!pgoff) { | 127 | index XXXXXXX..XXXXXXX 100644 |
205 | unsigned long meta_page_padding; | 128 | --- a/kernel/trace/trace.c |
206 | 129 | +++ b/kernel/trace/trace.c | |
207 | + /* The meta page is always allocated via alloc_page() */ | 130 | @@ -XXX,XX +XXX,XX @@ static int tracing_buffers_mmap(struct file *filp, struct vm_area_struct *vma) |
208 | pages[p++] = virt_to_page(cpu_buffer->meta_page); | 131 | if (iter->tr->flags & TRACE_ARRAY_FL_MEMMAP) |
209 | 132 | return -ENODEV; | |
210 | /* | 133 | |
211 | @@ -XXX,XX +XXX,XX @@ static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer, | 134 | - /* Currently the boot mapped buffer is not supported for mmap */ |
212 | goto out; | 135 | - if (iter->tr->flags & TRACE_ARRAY_FL_BOOT) |
213 | } | ||
214 | |||
215 | - page = virt_to_page((void *)cpu_buffer->subbuf_ids[s]); | ||
216 | + page = rb_struct_page(cpu_buffer->buffer, | ||
217 | + (void *)cpu_buffer->subbuf_ids[s]); | ||
218 | |||
219 | for (; off < (1 << (subbuf_order)); off++, page++) { | ||
220 | if (p >= nr_pages) | ||
221 | @@ -XXX,XX +XXX,XX @@ int ring_buffer_map(struct trace_buffer *buffer, int cpu, | ||
222 | unsigned long flags, *subbuf_ids; | ||
223 | int err = 0; | ||
224 | |||
225 | - /* | ||
226 | - * Currently, this does not support vmap()'d buffers. | ||
227 | - * Return -ENODEV as that is what is returned when a file | ||
228 | - * does not support memory mapping. | ||
229 | - */ | ||
230 | - if (buffer->flags & RB_FL_PHYSICAL) | ||
231 | - return -ENODEV; | 136 | - return -ENODEV; |
232 | - | 137 | - |
233 | if (!cpumask_test_cpu(cpu, buffer->cpumask)) | 138 | ret = get_snapshot_map(iter->tr); |
234 | return -EINVAL; | 139 | if (ret) |
235 | 140 | return ret; | |
236 | @@ -XXX,XX +XXX,XX @@ int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu) | ||
237 | goto consume; | ||
238 | |||
239 | out: | ||
240 | - /* Some archs do not have data cache coherency between kernel and user-space */ | ||
241 | - flush_dcache_folio(virt_to_folio(cpu_buffer->reader_page->page)); | ||
242 | + rb_flush_buffer_page(buffer, cpu_buffer->reader_page); | ||
243 | |||
244 | rb_update_meta_page(cpu_buffer); | ||
245 | |||
246 | -- | 141 | -- |
247 | 2.47.2 | 142 | 2.47.2 | diff view generated by jsdifflib |