[PATCH v4 00/10] tracing: introducing eventfs

Ajay Kaher posted 10 patches 2 years, 6 months ago
There is a newer version of this series
fs/tracefs/Makefile                           |   1 +
fs/tracefs/event_inode.c                      | 711 ++++++++++++++++++
fs/tracefs/inode.c                            | 124 ++-
fs/tracefs/internal.h                         |  25 +
include/linux/trace_events.h                  |   1 +
include/linux/tracefs.h                       |  32 +
kernel/trace/trace.h                          |   2 +-
kernel/trace/trace_events.c                   |  78 +-
.../ftrace/test.d/kprobe/kprobe_args_char.tc  |   4 +-
.../test.d/kprobe/kprobe_args_string.tc       |   4 +-
10 files changed, 930 insertions(+), 52 deletions(-)
create mode 100644 fs/tracefs/event_inode.c
create mode 100644 fs/tracefs/internal.h
[PATCH v4 00/10] tracing: introducing eventfs
Posted by Ajay Kaher 2 years, 6 months ago
Events Tracing infrastructure contains lot of files, directories
(internally in terms of inodes, dentries). And ends up by consuming
memory in MBs. We can have multiple events of Events Tracing, which
further requires more memory.

Instead of creating inodes/dentries, eventfs could keep meta-data and
skip the creation of inodes/dentries. As and when require, eventfs will
create the inodes/dentries only for required files/directories.
Also eventfs would delete the inodes/dentries once no more requires
but preserve the meta data.

Tracing events took ~9MB, with this approach it took ~4.5MB
for ~10K files/dir.

v3:
Patch 3,4,5,7,9:
         removed all the eventfs_rwsem code and replaced it with an srcu
         lock for the readers, and a mutex to synchronize the writers of
         the list.

Patch 2: moved 'tracefs_inode' and 'get_tracefs()' to v4 03/10

Patch 3: moved the struct eventfs_file and eventfs_inode into event_inode.c
         as it really should not be exposed to all users.

Patch 5: added a recursion check to eventfs_remove_rec() as it is really
         dangerous to have unchecked recursion in the kernel (we do have
         a fixed size stack).

         have the free use srcu callbacks. After the srcu grace periods
         are done, it adds the eventfs_file onto a llist (lockless link
         list) and wakes up a work queue. Then the work queue does the
         freeing (this needs to be done in task/workqueue context, as
         srcu callbacks are done in softirq context).

Patch 6: renamed:
         eventfs_create_file() -> create_file()
         eventfs_create_dir() -> create_dir()

v2:
Patch 01: new patch:'Require all trace events to have a TRACE_SYSTEM'
Patch 02: moved from v1 1/9
Patch 03: moved from v1 2/9
          As suggested by Zheng Yejian, introduced eventfs_prepare_ef()
          helper function to add files or directories to eventfs
          fix WARNING reported by kernel test robot in v1 8/9
Patch 04: moved from v1 3/9
          used eventfs_prepare_ef() to add files
          fix WARNING reported by kernel test robot in v1 8/9
Patch 05: moved from v1 4/9
          fix compiling warning reported by kernel test robot in v1 4/9
Patch 06: moved from v1 5/9
Patch 07: moved from v1 6/9
Patch 08: moved from v1 7/9
Patch 09: moved from v1 8/9
          rebased because of v3 01/10
Patch 10: moved from v1 9/9

v1:
Patch 1: add header file
Patch 2: resolved kernel test robot issues
         protecting eventfs lists using nested eventfs_rwsem
Patch 3: protecting eventfs lists using nested eventfs_rwsem
Patch 4: improve events cleanup code to fix crashes
Patch 5: resolved kernel test robot issues
         removed d_instantiate_anon() calls
Patch 6: resolved kernel test robot issues
         fix kprobe test in eventfs_root_lookup()
         protecting eventfs lists using nested eventfs_rwsem
Patch 7: remove header file
Patch 8: pass eventfs_rwsem as argument to eventfs functions
         called eventfs_remove_events_dir() instead of tracefs_remove()
         from event_trace_del_tracer()
Patch 9: new patch to fix kprobe test case

Ajay Kaher (9):
  tracefs: Rename some tracefs function
  eventfs: Implement eventfs dir creation functions
  eventfs: Implement eventfs file add functions
  eventfs: Implement eventfs file, directory remove function
  eventfs: Implement functions to create eventfs files and directories
  eventfs: Implement eventfs lookup, read, open functions
  eventfs: Implement tracefs_inode_cache
  eventfs: Move tracing/events to eventfs
  test: ftrace: Fix kprobe test for eventfs

Steven Rostedt (Google) (1):
  tracing: Require all trace events to have a TRACE_SYSTEM

 fs/tracefs/Makefile                           |   1 +
 fs/tracefs/event_inode.c                      | 711 ++++++++++++++++++
 fs/tracefs/inode.c                            | 124 ++-
 fs/tracefs/internal.h                         |  25 +
 include/linux/trace_events.h                  |   1 +
 include/linux/tracefs.h                       |  32 +
 kernel/trace/trace.h                          |   2 +-
 kernel/trace/trace_events.c                   |  78 +-
 .../ftrace/test.d/kprobe/kprobe_args_char.tc  |   4 +-
 .../test.d/kprobe/kprobe_args_string.tc       |   4 +-
 10 files changed, 930 insertions(+), 52 deletions(-)
 create mode 100644 fs/tracefs/event_inode.c
 create mode 100644 fs/tracefs/internal.h

-- 
2.39.0
Re: [PATCH v4 00/10] tracing: introducing eventfs
Posted by Steven Rostedt 2 years, 6 months ago
On Thu, 13 Jul 2023 17:03:14 +0530
Ajay Kaher <akaher@vmware.com> wrote:

> Events Tracing infrastructure contains lot of files, directories
> (internally in terms of inodes, dentries). And ends up by consuming
> memory in MBs. We can have multiple events of Events Tracing, which
> further requires more memory.
> 
> Instead of creating inodes/dentries, eventfs could keep meta-data and
> skip the creation of inodes/dentries. As and when require, eventfs will
> create the inodes/dentries only for required files/directories.
> Also eventfs would delete the inodes/dentries once no more requires
> but preserve the meta data.
> 
> Tracing events took ~9MB, with this approach it took ~4.5MB
> for ~10K files/dir.

I think we are very close to getting this in for the next merge window. I
ran several tests and so far it's holding up!

I made a bunch of nits for this series, but nothing major. Mostly fixing up
change logs and comments, as well as some naming conventions and
reorganizing the series a little bit.

Anyway, I'm hoping that v5 will be ready to go into linux-next.

Thanks a lot Ajay for working on this!

-- Steve


> 
> v3:
> Patch 3,4,5,7,9:
>          removed all the eventfs_rwsem code and replaced it with an srcu
>          lock for the readers, and a mutex to synchronize the writers of
>          the list.
> 
> Patch 2: moved 'tracefs_inode' and 'get_tracefs()' to v4 03/10
> 
> Patch 3: moved the struct eventfs_file and eventfs_inode into event_inode.c
>          as it really should not be exposed to all users.
> 
> Patch 5: added a recursion check to eventfs_remove_rec() as it is really
>          dangerous to have unchecked recursion in the kernel (we do have
>          a fixed size stack).
> 
>          have the free use srcu callbacks. After the srcu grace periods
>          are done, it adds the eventfs_file onto a llist (lockless link
>          list) and wakes up a work queue. Then the work queue does the
>          freeing (this needs to be done in task/workqueue context, as
>          srcu callbacks are done in softirq context).
> 
> Patch 6: renamed:
>          eventfs_create_file() -> create_file()
>          eventfs_create_dir() -> create_dir()
> 
> v2:
> Patch 01: new patch:'Require all trace events to have a TRACE_SYSTEM'
> Patch 02: moved from v1 1/9
> Patch 03: moved from v1 2/9
>           As suggested by Zheng Yejian, introduced eventfs_prepare_ef()
>           helper function to add files or directories to eventfs
>           fix WARNING reported by kernel test robot in v1 8/9
> Patch 04: moved from v1 3/9
>           used eventfs_prepare_ef() to add files
>           fix WARNING reported by kernel test robot in v1 8/9
> Patch 05: moved from v1 4/9
>           fix compiling warning reported by kernel test robot in v1 4/9
> Patch 06: moved from v1 5/9
> Patch 07: moved from v1 6/9
> Patch 08: moved from v1 7/9
> Patch 09: moved from v1 8/9
>           rebased because of v3 01/10
> Patch 10: moved from v1 9/9
> 
> v1:
> Patch 1: add header file
> Patch 2: resolved kernel test robot issues
>          protecting eventfs lists using nested eventfs_rwsem
> Patch 3: protecting eventfs lists using nested eventfs_rwsem
> Patch 4: improve events cleanup code to fix crashes
> Patch 5: resolved kernel test robot issues
>          removed d_instantiate_anon() calls
> Patch 6: resolved kernel test robot issues
>          fix kprobe test in eventfs_root_lookup()
>          protecting eventfs lists using nested eventfs_rwsem
> Patch 7: remove header file
> Patch 8: pass eventfs_rwsem as argument to eventfs functions
>          called eventfs_remove_events_dir() instead of tracefs_remove()
>          from event_trace_del_tracer()
> Patch 9: new patch to fix kprobe test case
> 
> Ajay Kaher (9):
>   tracefs: Rename some tracefs function
>   eventfs: Implement eventfs dir creation functions
>   eventfs: Implement eventfs file add functions
>   eventfs: Implement eventfs file, directory remove function
>   eventfs: Implement functions to create eventfs files and directories
>   eventfs: Implement eventfs lookup, read, open functions
>   eventfs: Implement tracefs_inode_cache
>   eventfs: Move tracing/events to eventfs
>   test: ftrace: Fix kprobe test for eventfs
> 
> Steven Rostedt (Google) (1):
>   tracing: Require all trace events to have a TRACE_SYSTEM
> 
>  fs/tracefs/Makefile                           |   1 +
>  fs/tracefs/event_inode.c                      | 711 ++++++++++++++++++
>  fs/tracefs/inode.c                            | 124 ++-
>  fs/tracefs/internal.h                         |  25 +
>  include/linux/trace_events.h                  |   1 +
>  include/linux/tracefs.h                       |  32 +
>  kernel/trace/trace.h                          |   2 +-
>  kernel/trace/trace_events.c                   |  78 +-
>  .../ftrace/test.d/kprobe/kprobe_args_char.tc  |   4 +-
>  .../test.d/kprobe/kprobe_args_string.tc       |   4 +-
>  10 files changed, 930 insertions(+), 52 deletions(-)
>  create mode 100644 fs/tracefs/event_inode.c
>  create mode 100644 fs/tracefs/internal.h
>
Re: [PATCH v4 00/10] tracing: introducing eventfs
Posted by Ajay Kaher 2 years, 6 months ago

> On 15-Jul-2023, at 4:28 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> !! External Email
> 
> On Thu, 13 Jul 2023 17:03:14 +0530
> Ajay Kaher <akaher@vmware.com> wrote:
> 
>> Events Tracing infrastructure contains lot of files, directories
>> (internally in terms of inodes, dentries). And ends up by consuming
>> memory in MBs. We can have multiple events of Events Tracing, which
>> further requires more memory.
>> 
>> Instead of creating inodes/dentries, eventfs could keep meta-data and
>> skip the creation of inodes/dentries. As and when require, eventfs will
>> create the inodes/dentries only for required files/directories.
>> Also eventfs would delete the inodes/dentries once no more requires
>> but preserve the meta data.
>> 
>> Tracing events took ~9MB, with this approach it took ~4.5MB
>> for ~10K files/dir.
> 
> I think we are very close to getting this in for the next merge window. I
> ran several tests and so far it's holding up!
> 
> I made a bunch of nits for this series, but nothing major. Mostly fixing up
> change logs and comments, as well as some naming conventions and
> reorganizing the series a little bit.
> 
> Anyway, I'm hoping that v5 will be ready to go into linux-next.
> 
> Thanks a lot Ajay for working on this!


Thanks Steve, hopefully I will fix all the pending nits in v5.
Here is the checkpatch.pl report:


./scripts/checkpatch.pl v4/*
--------------------------
v4/0000-cover-letter.patch
--------------------------
total: 0 errors, 0 warnings, 0 lines checked

v4/0000-cover-letter.patch has no obvious style problems and is ready for submission.
------------------------------------------------------------------
v4/0001-tracing-Require-all-trace-events-to-have-a-TRACE_SYS.patch
------------------------------------------------------------------
total: 0 errors, 0 warnings, 22 lines checked

v4/0001-tracing-Require-all-trace-events-to-have-a-TRACE_SYS.patch has no obvious style problems and is ready for submission.
--------------------------------------------------
v4/0002-tracefs-Rename-some-tracefs-function.patch
--------------------------------------------------
total: 0 errors, 0 warnings, 71 lines checked

v4/0002-tracefs-Rename-some-tracefs-function.patch has no obvious style problems and is ready for submission.
--------------------------------------------------------------
v4/0003-eventfs-Implement-eventfs-dir-creation-functions.patch
--------------------------------------------------------------
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#52:
new file mode 100644

WARNING: Symbolic permissions 'S_IRWXU | S_IRUGO | S_IXUGO' are not preferred. Consider using octal permissions '0755'.
#194: FILE: fs/tracefs/event_inode.c:138:
+	inode->i_mode = S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO;

WARNING: Symbolic permissions 'S_IRWXU | S_IRUGO | S_IXUGO' are not preferred. Consider using octal permissions '0755'.
#229: FILE: fs/tracefs/event_inode.c:173:
+		S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO,

WARNING: Symbolic permissions 'S_IRWXU | S_IRUGO | S_IXUGO' are not preferred. Consider using octal permissions '0755'.
#261: FILE: fs/tracefs/event_inode.c:205:
+		S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO,

total: 0 errors, 4 warnings, 297 lines checked

NOTE: For some of the reported defects, checkpatch may be able to
      mechanically convert to the typical style using --fix or --fix-inplace.

v4/0003-eventfs-Implement-eventfs-dir-creation-functions.patch has style problems, please review.
----------------------------------------------------------
v4/0004-eventfs-Implement-eventfs-file-add-functions.patch
----------------------------------------------------------
total: 0 errors, 0 warnings, 101 lines checked

v4/0004-eventfs-Implement-eventfs-file-add-functions.patch has no obvious style problems and is ready for submission.
------------------------------------------------------------------
v4/0005-eventfs-Implement-eventfs-file-directory-remove-func.patch
------------------------------------------------------------------
total: 0 errors, 0 warnings, 129 lines checked

v4/0005-eventfs-Implement-eventfs-file-directory-remove-func.patch has no obvious style problems and is ready for submission.
------------------------------------------------------------------
v4/0006-eventfs-Implement-functions-to-create-eventfs-files-.patch
------------------------------------------------------------------
total: 0 errors, 0 warnings, 182 lines checked

v4/0006-eventfs-Implement-functions-to-create-eventfs-files-.patch has no obvious style problems and is ready for submission.
------------------------------------------------------------------
v4/0007-eventfs-Implement-eventfs-lookup-read-open-functions.patch
------------------------------------------------------------------
total: 0 errors, 0 warnings, 224 lines checked

v4/0007-eventfs-Implement-eventfs-lookup-read-open-functions.patch has no obvious style problems and is ready for submission.
---------------------------------------------------
v4/0008-eventfs-Implement-tracefs_inode_cache.patch
---------------------------------------------------
total: 0 errors, 0 warnings, 68 lines checked

v4/0008-eventfs-Implement-tracefs_inode_cache.patch has no obvious style problems and is ready for submission.
----------------------------------------------------
v4/0009-eventfs-Move-tracing-events-to-eventfs.patch
----------------------------------------------------
total: 0 errors, 0 warnings, 241 lines checked

v4/0009-eventfs-Move-tracing-events-to-eventfs.patch has no obvious style problems and is ready for submission.
-----------------------------------------------------
v4/0010-test-ftrace-Fix-kprobe-test-for-eventfs.patch
-----------------------------------------------------
total: 0 errors, 0 warnings, 32 lines checked

v4/0010-test-ftrace-Fix-kprobe-test-for-eventfs.patch has no obvious style problems and is ready for submission.

-Ajay
Re: [PATCH v4 00/10] tracing: introducing eventfs
Posted by Steven Rostedt 2 years, 6 months ago
On Sun, 16 Jul 2023 17:32:35 +0000
Ajay Kaher <akaher@vmware.com> wrote:

> Thanks Steve, hopefully I will fix all the pending nits in v5.
> Here is the checkpatch.pl report:

Hold off on v5. I hit the following on v4:

[  220.170527] BUG: unable to handle page fault for address: fffffffffffffff0
[  220.172792] #PF: supervisor read access in kernel mode
[  220.174618] #PF: error_code(0x0000) - not-present page
[  220.176516] PGD 13104d067 P4D 13104d067 PUD 13104f067 PMD 0
[  220.178559] Oops: 0000 [#1] PREEMPT SMP PTI
[  220.180087] CPU: 3 PID: 35 Comm: kworker/u8:1 Not tainted 6.5.0-rc1-test-00021-gdd6e7af33766-dirty #15
[  220.183441] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[  220.186629] Workqueue: events_unbound eventfs_workfn
[  220.188286] RIP: 0010:eventfs_set_ef_status_free+0x17/0x40
[  220.190091] Code: 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 48 8b 47 18 48 8b 40 30 48 83 f8 10 74 1b <f6> 40 f0 02 74 15 48 8b 47 78 48 85 c0 74 0c c6 40 5a 00 48 c7 40
[  220.195360] RSP: 0018:ffffa731c0147e20 EFLAGS: 00010287
[  220.196802] RAX: 0000000000000000 RBX: ffff97ca512ca000 RCX: 0000000000000000
[  220.198703] RDX: 0000000000000001 RSI: ffff97ca52d18010 RDI: ffff97ca512ca000
[  220.200540] RBP: ffff97ca52cb3780 R08: 0000000000000064 R09: 00000000802a0022
[  220.202324] R10: 0000000000039e80 R11: ffff97cabffd5000 R12: ffff97ca512ca058
[  220.204012] R13: ffff97ca52cb3780 R14: ffff97ca40153705 R15: ffffffffad5c1848
[  220.205685] FS:  0000000000000000(0000) GS:ffff97cabbd80000(0000) knlGS:0000000000000000
[  220.207476] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  220.208764] CR2: fffffffffffffff0 CR3: 000000010a01a001 CR4: 0000000000170ee0
[  220.210342] Call Trace:
[  220.210879]  <TASK>
[  220.211359]  ? __die+0x23/0x70
[  220.212036]  ? page_fault_oops+0xa4/0x180
[  220.212904]  ? exc_page_fault+0xf6/0x190
[  220.213738]  ? asm_exc_page_fault+0x26/0x30
[  220.214586]  ? eventfs_set_ef_status_free+0x17/0x40
[  220.216081]  tracefs_dentry_iput+0x39/0x50
[  220.217370]  __dentry_kill+0xdc/0x170
[  220.218581]  dput+0x142/0x310
[  220.219647]  eventfs_workfn+0x42/0x70
[  220.220805]  process_one_work+0x1e2/0x3e0
[  220.222031]  worker_thread+0x1da/0x390
[  220.223204]  ? __pfx_worker_thread+0x10/0x10
[  220.224476]  kthread+0xf7/0x130
[  220.225543]  ? __pfx_kthread+0x10/0x10
[  220.226735]  ret_from_fork+0x2c/0x50
[  220.227898]  </TASK>
[  220.228792] Modules linked in:
[  220.229860] CR2: fffffffffffffff0
[  220.230960] ---[ end trace 0000000000000000 ]---


I think I know the issue, and looking to see if I can fix it.

-- Steve
Re: [PATCH v4 00/10] tracing: introducing eventfs
Posted by Ajay Kaher 2 years, 6 months ago

> On 18-Jul-2023, at 7:10 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> !! External Email
> 
> On Sun, 16 Jul 2023 17:32:35 +0000
> Ajay Kaher <akaher@vmware.com> wrote:
> 
>> Thanks Steve, hopefully I will fix all the pending nits in v5.
>> Here is the checkpatch.pl report:
> 
> Hold off on v5. I hit the following on v4:

OK.

> 
> [  220.170527] BUG: unable to handle page fault for address: fffffffffffffff0
> [  220.172792] #PF: supervisor read access in kernel mode
> [  220.174618] #PF: error_code(0x0000) - not-present page
> [  220.176516] PGD 13104d067 P4D 13104d067 PUD 13104f067 PMD 0
> [  220.178559] Oops: 0000 [#1] PREEMPT SMP PTI
> [  220.180087] CPU: 3 PID: 35 Comm: kworker/u8:1 Not tainted 6.5.0-rc1-test-00021-gdd6e7af33766-dirty #15
> [  220.183441] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> [  220.186629] Workqueue: events_unbound eventfs_workfn
> [  220.188286] RIP: 0010:eventfs_set_ef_status_free+0x17/0x40
> [  220.190091] Code: 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 48 8b 47 18 48 8b 40 30 48 83 f8 10 74 1b <f6> 40 f0 02 74 15 48 8b 47 78 48 85 c0 74 0c c6 40 5a 00 48 c7 40
> [  220.195360] RSP: 0018:ffffa731c0147e20 EFLAGS: 00010287
> [  220.196802] RAX: 0000000000000000 RBX: ffff97ca512ca000 RCX: 0000000000000000
> [  220.198703] RDX: 0000000000000001 RSI: ffff97ca52d18010 RDI: ffff97ca512ca000
> [  220.200540] RBP: ffff97ca52cb3780 R08: 0000000000000064 R09: 00000000802a0022
> [  220.202324] R10: 0000000000039e80 R11: ffff97cabffd5000 R12: ffff97ca512ca058
> [  220.204012] R13: ffff97ca52cb3780 R14: ffff97ca40153705 R15: ffffffffad5c1848
> [  220.205685] FS:  0000000000000000(0000) GS:ffff97cabbd80000(0000) knlGS:0000000000000000
> [  220.207476] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  220.208764] CR2: fffffffffffffff0 CR3: 000000010a01a001 CR4: 0000000000170ee0
> [  220.210342] Call Trace:
> [  220.210879]  <TASK>
> [  220.211359]  ? __die+0x23/0x70
> [  220.212036]  ? page_fault_oops+0xa4/0x180
> [  220.212904]  ? exc_page_fault+0xf6/0x190
> [  220.213738]  ? asm_exc_page_fault+0x26/0x30
> [  220.214586]  ? eventfs_set_ef_status_free+0x17/0x40
> [  220.216081]  tracefs_dentry_iput+0x39/0x50
> [  220.217370]  __dentry_kill+0xdc/0x170
> [  220.218581]  dput+0x142/0x310
> [  220.219647]  eventfs_workfn+0x42/0x70
> [  220.220805]  process_one_work+0x1e2/0x3e0
> [  220.222031]  worker_thread+0x1da/0x390
> [  220.223204]  ? __pfx_worker_thread+0x10/0x10
> [  220.224476]  kthread+0xf7/0x130
> [  220.225543]  ? __pfx_kthread+0x10/0x10
> [  220.226735]  ret_from_fork+0x2c/0x50
> [  220.227898]  </TASK>
> [  220.228792] Modules linked in:
> [  220.229860] CR2: fffffffffffffff0
> [  220.230960] ---[ end trace 0000000000000000 ]---
> 
> 
> I think I know the issue, and looking to see if I can fix it.


- Is it also reproducible on v3?
- Is it manually reproducible or reproducible using any specific script?

Let me know if I can help.

-Ajay
Re: [PATCH v4 00/10] tracing: introducing eventfs
Posted by Steven Rostedt 2 years, 6 months ago
On Wed, 19 Jul 2023 10:25:28 +0000
Ajay Kaher <akaher@vmware.com> wrote:

> - Is it also reproducible on v3?
> - Is it manually reproducible or reproducible using any specific script?
> 
> Let me know if I can help.

Just tried it against v3, and it gave me the splat that I originally had
and starting to fix, which now gives me another splat. I'll spend a couple
more days on it and start sharing code and seeing if we can work together
on this.

Here's the reproducer (of both v3 splat and the bug I'm hitting now).

 ~# echo 'p:sock_getattr 0xffffffff9b55cef0 sk=%di' > /sys/kernel/tracing/kprobe_events
 ~# ls /sys/kernel/debug/tracing/events/kprobes/sock_getattr/
 ~# echo '-:sock_getattr 0xffffffff9b55cef0 sk=%di' > /sys/kernel/tracing/kprobe_events

v3 gives me (and my updates too)

======================================================
 WARNING: possible circular locking dependency detected
 6.5.0-rc1-test+ #576 Not tainted
 ------------------------------------------------------
 trace-cmd/840 is trying to acquire lock:
 ffff8881007e5de0 (&sb->s_type->i_mutex_key#5){++++}-{3:3}, at: dcache_dir_open_wrapper+0xc1/0x1b0
 
 but task is already holding lock:
 ffff888103ad7e70 (eventfs_rwsem/1){.+.+}-{3:3}, at: dcache_dir_open_wrapper+0x6f/0x1b0
 
 which lock already depends on the new lock.

 
 the existing dependency chain (in reverse order) is:
 
 -> #1 (eventfs_rwsem/1){.+.+}-{3:3}:
        down_read_nested+0x41/0x180
        eventfs_root_lookup+0x42/0x120
        __lookup_slow+0xff/0x1b0
        walk_component+0xdb/0x150
        path_lookupat+0x67/0x1a0
        filename_lookup+0xe4/0x1f0
        vfs_statx+0x9e/0x180
        vfs_fstatat+0x51/0x70
        __do_sys_newfstatat+0x3f/0x80
        do_syscall_64+0x3a/0xc0
        entry_SYSCALL_64_after_hwframe+0x6e/0xd8
 
 -> #0 (&sb->s_type->i_mutex_key#5){++++}-{3:3}:
        __lock_acquire+0x165d/0x2390
        lock_acquire+0xd4/0x2d0
        down_write+0x3b/0xd0
        dcache_dir_open_wrapper+0xc1/0x1b0
        do_dentry_open+0x20c/0x510
        path_openat+0x7ad/0xc60
        do_filp_open+0xaf/0x160
        do_sys_openat2+0xab/0xe0
        __x64_sys_openat+0x6a/0xa0
        do_syscall_64+0x3a/0xc0
        entry_SYSCALL_64_after_hwframe+0x6e/0xd8
 
 other info that might help us debug this:

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   rlock(eventfs_rwsem/1);
                                lock(&sb->s_type->i_mutex_key#5);
                                lock(eventfs_rwsem/1);
   lock(&sb->s_type->i_mutex_key#5);
 
  *** DEADLOCK ***

 1 lock held by trace-cmd/840:
  #0: ffff888103ad7e70 (eventfs_rwsem/1){.+.+}-{3:3}, at: dcache_dir_open_wrapper+0x6f/0x1b0
 
 stack backtrace:
 CPU: 7 PID: 840 Comm: trace-cmd Not tainted 6.5.0-rc1-test+ #576
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
 Call Trace:
  <TASK>
  dump_stack_lvl+0x57/0x90
  check_noncircular+0x14b/0x160
  __lock_acquire+0x165d/0x2390
  lock_acquire+0xd4/0x2d0
  ? dcache_dir_open_wrapper+0xc1/0x1b0
  down_write+0x3b/0xd0
  ? dcache_dir_open_wrapper+0xc1/0x1b0
  dcache_dir_open_wrapper+0xc1/0x1b0
  ? __pfx_dcache_dir_open_wrapper+0x10/0x10
  do_dentry_open+0x20c/0x510
  path_openat+0x7ad/0xc60
  do_filp_open+0xaf/0x160
  do_sys_openat2+0xab/0xe0
  __x64_sys_openat+0x6a/0xa0
  do_syscall_64+0x3a/0xc0
  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
 RIP: 0033:0x7f1743267e41
 Code: 44 24 18 31 c0 41 83 e2 40 75 3e 89 f0 25 00 00 41 00 3d 00 00 41 00 74 30 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d 00 f0 ff ff 77 3f 48 8b 54 24 18 64 48 2b 14 25 28 00 00 00
 RSP: 002b:00007ffec10ff5d0 EFLAGS: 00000287 ORIG_RAX: 0000000000000101
 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1743267e41
 RDX: 0000000000090800 RSI: 00007ffec10ffdb0 RDI: 00000000ffffff9c
 RBP: 00007ffec10ffda0 R08: 00007ffec11003e0 R09: 0000000000000040
 R10: 0000000000000000 R11: 0000000000000287 R12: 00007ffec11003e0
 R13: 0000000000000040 R14: 0000000000000000 R15: 00007ffec110034b
  </TASK>


I moved the code around a bit, and it appears that kprobes is getting
dput() more than once.

I moved the d_invalidate() and dput() into the workqueue function, and on
kprobes going away, d_invalidate() frees it, and dput() is now corrupted.

Still investigating. The VFS layer is a magic box that needs the right
wizard hat to deal with, but I unfortunately am waiting on back order to
retrieve that specific hat :-p

-- Steve
Re: [PATCH v4 00/10] tracing: introducing eventfs
Posted by Ajay Kaher 2 years, 6 months ago

> On 19-Jul-2023, at 7:53 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> !! External Email
> 
> On Wed, 19 Jul 2023 10:25:28 +0000
> Ajay Kaher <akaher@vmware.com> wrote:
> 
>> - Is it also reproducible on v3?
>> - Is it manually reproducible or reproducible using any specific script?
>> 
>> Let me know if I can help.
> 
> Just tried it against v3, and it gave me the splat that I originally had
> and starting to fix, which now gives me another splat. I'll spend a couple
> more days on it and start sharing code and seeing if we can work together
> on this.
> 
> Here's the reproducer (of both v3 splat and the bug I'm hitting now).
> 
> ~# echo 'p:sock_getattr 0xffffffff9b55cef0 sk=%di' > /sys/kernel/tracing/kprobe_events
> ~# ls /sys/kernel/debug/tracing/events/kprobes/sock_getattr/
> ~# echo '-:sock_getattr 0xffffffff9b55cef0 sk=%di' > /sys/kernel/tracing/kprobe_events

I tried above steps on v4 but couldn’t reproduce:

root@photon-6 [ ~/sdb/linux ]# echo 'p:sock_getattr 0xffffffff9b55cef0 sk=%di' > /sys/kernel/tracing/kprobe_events
root@photon-6 [ ~/sdb/linux ]# ls /sys/kernel/debug/tracing/events/kprobes/sock_getattr/
enable  filter  format  id  trigger
root@photon-6 [ ~/sdb/linux ]# echo '-:sock_getattr 0xffffffff9b55cef0 sk=%di' > /sys/kernel/tracing/kprobe_events
-bash: echo: write error: No such file or directory

I have doubt on call_srcu(), it may first end the grace period for parent then for child. If this is true then free_list
will have unordered list and could cause problem.
 

> 
> v3 gives me (and my updates too)
> 
> ======================================================
> WARNING: possible circular locking dependency detected
> 6.5.0-rc1-test+ #576 Not tainted
> ------------------------------------------------------
> trace-cmd/840 is trying to acquire lock:
> ffff8881007e5de0 (&sb->s_type->i_mutex_key#5){++++}-{3:3}, at: dcache_dir_open_wrapper+0xc1/0x1b0
> 
> but task is already holding lock:
> ffff888103ad7e70 (eventfs_rwsem/1){.+.+}-{3:3}, at: dcache_dir_open_wrapper+0x6f/0x1b0
> 
> which lock already depends on the new lock.
> 
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #1 (eventfs_rwsem/1){.+.+}-{3:3}:
>        down_read_nested+0x41/0x180
>        eventfs_root_lookup+0x42/0x120
>        __lookup_slow+0xff/0x1b0
>        walk_component+0xdb/0x150
>        path_lookupat+0x67/0x1a0
>        filename_lookup+0xe4/0x1f0
>        vfs_statx+0x9e/0x180
>        vfs_fstatat+0x51/0x70
>        __do_sys_newfstatat+0x3f/0x80
>        do_syscall_64+0x3a/0xc0
>        entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> 
> -> #0 (&sb->s_type->i_mutex_key#5){++++}-{3:3}:
>        __lock_acquire+0x165d/0x2390
>        lock_acquire+0xd4/0x2d0
>        down_write+0x3b/0xd0
>        dcache_dir_open_wrapper+0xc1/0x1b0
>        do_dentry_open+0x20c/0x510
>        path_openat+0x7ad/0xc60
>        do_filp_open+0xaf/0x160
>        do_sys_openat2+0xab/0xe0
>        __x64_sys_openat+0x6a/0xa0
>        do_syscall_64+0x3a/0xc0
>        entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> 
> other info that might help us debug this:
> 
>  Possible unsafe locking scenario:
> 
>        CPU0                    CPU1
>        ----                    ----
>   rlock(eventfs_rwsem/1);
>                                lock(&sb->s_type->i_mutex_key#5);
>                                lock(eventfs_rwsem/1);
>   lock(&sb->s_type->i_mutex_key#5);
> 
>  *** DEADLOCK ***
> 
> 1 lock held by trace-cmd/840:
>  #0: ffff888103ad7e70 (eventfs_rwsem/1){.+.+}-{3:3}, at: dcache_dir_open_wrapper+0x6f/0x1b0
> 
> stack backtrace:
> CPU: 7 PID: 840 Comm: trace-cmd Not tainted 6.5.0-rc1-test+ #576
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> Call Trace:
>  <TASK>
>  dump_stack_lvl+0x57/0x90
>  check_noncircular+0x14b/0x160
>  __lock_acquire+0x165d/0x2390
>  lock_acquire+0xd4/0x2d0
>  ? dcache_dir_open_wrapper+0xc1/0x1b0
>  down_write+0x3b/0xd0
>  ? dcache_dir_open_wrapper+0xc1/0x1b0
>  dcache_dir_open_wrapper+0xc1/0x1b0
>  ? __pfx_dcache_dir_open_wrapper+0x10/0x10
>  do_dentry_open+0x20c/0x510
>  path_openat+0x7ad/0xc60
>  do_filp_open+0xaf/0x160
>  do_sys_openat2+0xab/0xe0
>  __x64_sys_openat+0x6a/0xa0
>  do_syscall_64+0x3a/0xc0
>  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> RIP: 0033:0x7f1743267e41
> Code: 44 24 18 31 c0 41 83 e2 40 75 3e 89 f0 25 00 00 41 00 3d 00 00 41 00 74 30 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d 00 f0 ff ff 77 3f 48 8b 54 24 18 64 48 2b 14 25 28 00 00 00
> RSP: 002b:00007ffec10ff5d0 EFLAGS: 00000287 ORIG_RAX: 0000000000000101
> RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1743267e41
> RDX: 0000000000090800 RSI: 00007ffec10ffdb0 RDI: 00000000ffffff9c
> RBP: 00007ffec10ffda0 R08: 00007ffec11003e0 R09: 0000000000000040
> R10: 0000000000000000 R11: 0000000000000287 R12: 00007ffec11003e0
> R13: 0000000000000040 R14: 0000000000000000 R15: 00007ffec110034b
>  </TASK>
> 

This is expected from v3 (just ignore as of now), if eventfs_set_ef_status_free crash not
reproduced on v3 then it’s v4 issue.

-Ajay

> 
> I moved the code around a bit, and it appears that kprobes is getting
> dput() more than once.
> 
> I moved the d_invalidate() and dput() into the workqueue function, and on
> kprobes going away, d_invalidate() frees it, and dput() is now corrupted.
> 
> Still investigating. The VFS layer is a magic box that needs the right
> wizard hat to deal with, but I unfortunately am waiting on back order to
> retrieve that specific hat :-p
> 
> -- Steve
> 
> !! External Email: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender.

Re: [PATCH v4 00/10] tracing: introducing eventfs
Posted by Steven Rostedt 2 years, 6 months ago
On Wed, 19 Jul 2023 18:37:12 +0000
Ajay Kaher <akaher@vmware.com> wrote:

> > Here's the reproducer (of both v3 splat and the bug I'm hitting now).
> > 
> > ~# echo 'p:sock_getattr 0xffffffff9b55cef0 sk=%di' > /sys/kernel/tracing/kprobe_events
> > ~# ls /sys/kernel/debug/tracing/events/kprobes/sock_getattr/
> > ~# echo '-:sock_getattr 0xffffffff9b55cef0 sk=%di' > /sys/kernel/tracing/kprobe_events  
> 
> I tried above steps on v4 but couldn’t reproduce:
> 
> root@photon-6 [ ~/sdb/linux ]# echo 'p:sock_getattr 0xffffffff9b55cef0 sk=%di' > /sys/kernel/tracing/kprobe_events
> root@photon-6 [ ~/sdb/linux ]# ls /sys/kernel/debug/tracing/events/kprobes/sock_getattr/
> enable  filter  format  id  trigger
> root@photon-6 [ ~/sdb/linux ]# echo '-:sock_getattr 0xffffffff9b55cef0 sk=%di' > /sys/kernel/tracing/kprobe_events
> -bash: echo: write error: No such file or directory
> 
> I have doubt on call_srcu(), it may first end the grace period for parent then for child. If this is true then free_list
> will have unordered list and could cause problem.

I modified the srcu portion a bit. Will post soon. I think I got something
working.

I'm having doubt that the dput()s were needed in the eventfs_remove_rec(),
as the d_invalidate() appears to be enough. I'm still testing.

>  
> 
> > 
> > v3 gives me (and my updates too)
> > 
> > ======================================================
> > WARNING: possible circular locking dependency detected
> > 6.5.0-rc1-test+ #576 Not tainted
> > ------------------------------------------------------
> > trace-cmd/840 is trying to acquire lock:
> > ffff8881007e5de0 (&sb->s_type->i_mutex_key#5){++++}-{3:3}, at: dcache_dir_open_wrapper+0xc1/0x1b0
> > 
> > but task is already holding lock:
> > ffff888103ad7e70 (eventfs_rwsem/1){.+.+}-{3:3}, at: dcache_dir_open_wrapper+0x6f/0x1b0
> > 
> > which lock already depends on the new lock.
> > 
> > 
> > the existing dependency chain (in reverse order) is:
> >   
> > -> #1 (eventfs_rwsem/1){.+.+}-{3:3}:  
> >        down_read_nested+0x41/0x180
> >        eventfs_root_lookup+0x42/0x120
> >        __lookup_slow+0xff/0x1b0
> >        walk_component+0xdb/0x150
> >        path_lookupat+0x67/0x1a0
> >        filename_lookup+0xe4/0x1f0
> >        vfs_statx+0x9e/0x180
> >        vfs_fstatat+0x51/0x70
> >        __do_sys_newfstatat+0x3f/0x80
> >        do_syscall_64+0x3a/0xc0
> >        entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> >   
> > -> #0 (&sb->s_type->i_mutex_key#5){++++}-{3:3}:  
> >        __lock_acquire+0x165d/0x2390
> >        lock_acquire+0xd4/0x2d0
> >        down_write+0x3b/0xd0
> >        dcache_dir_open_wrapper+0xc1/0x1b0
> >        do_dentry_open+0x20c/0x510
> >        path_openat+0x7ad/0xc60
> >        do_filp_open+0xaf/0x160
> >        do_sys_openat2+0xab/0xe0
> >        __x64_sys_openat+0x6a/0xa0
> >        do_syscall_64+0x3a/0xc0
> >        entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> > 
> > other info that might help us debug this:
> > 
> >  Possible unsafe locking scenario:
> > 
> >        CPU0                    CPU1
> >        ----                    ----
> >   rlock(eventfs_rwsem/1);
> >                                lock(&sb->s_type->i_mutex_key#5);
> >                                lock(eventfs_rwsem/1);
> >   lock(&sb->s_type->i_mutex_key#5);
> > 
> >  *** DEADLOCK ***
> > 
> > 1 lock held by trace-cmd/840:
> >  #0: ffff888103ad7e70 (eventfs_rwsem/1){.+.+}-{3:3}, at: dcache_dir_open_wrapper+0x6f/0x1b0
> > 
> > stack backtrace:
> > CPU: 7 PID: 840 Comm: trace-cmd Not tainted 6.5.0-rc1-test+ #576
> > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> > Call Trace:
> >  <TASK>
> >  dump_stack_lvl+0x57/0x90
> >  check_noncircular+0x14b/0x160
> >  __lock_acquire+0x165d/0x2390
> >  lock_acquire+0xd4/0x2d0
> >  ? dcache_dir_open_wrapper+0xc1/0x1b0
> >  down_write+0x3b/0xd0
> >  ? dcache_dir_open_wrapper+0xc1/0x1b0
> >  dcache_dir_open_wrapper+0xc1/0x1b0
> >  ? __pfx_dcache_dir_open_wrapper+0x10/0x10
> >  do_dentry_open+0x20c/0x510
> >  path_openat+0x7ad/0xc60
> >  do_filp_open+0xaf/0x160
> >  do_sys_openat2+0xab/0xe0
> >  __x64_sys_openat+0x6a/0xa0
> >  do_syscall_64+0x3a/0xc0
> >  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> > RIP: 0033:0x7f1743267e41
> > Code: 44 24 18 31 c0 41 83 e2 40 75 3e 89 f0 25 00 00 41 00 3d 00 00 41 00 74 30 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d 00 f0 ff ff 77 3f 48 8b 54 24 18 64 48 2b 14 25 28 00 00 00
> > RSP: 002b:00007ffec10ff5d0 EFLAGS: 00000287 ORIG_RAX: 0000000000000101
> > RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1743267e41
> > RDX: 0000000000090800 RSI: 00007ffec10ffdb0 RDI: 00000000ffffff9c
> > RBP: 00007ffec10ffda0 R08: 00007ffec11003e0 R09: 0000000000000040
> > R10: 0000000000000000 R11: 0000000000000287 R12: 00007ffec11003e0
> > R13: 0000000000000040 R14: 0000000000000000 R15: 00007ffec110034b
> >  </TASK>
> >   
> 
> This is expected from v3 (just ignore as of now), if eventfs_set_ef_status_free crash not
> reproduced on v3 then it’s v4 issue.

The issue comes from fixing the above ;-)

-- Steve
Re: [PATCH v4 00/10] tracing: introducing eventfs
Posted by Steven Rostedt 2 years, 6 months ago
[ Resending because claws-mail is messing with the Cc again. It doesn't like quotes :-p ]

On Wed, 19 Jul 2023 14:40:46 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> > I have doubt on call_srcu(), it may first end the grace period for parent then for child. If this is true then free_list
> > will have unordered list and could cause problem.  
> 
> I modified the srcu portion a bit. Will post soon. I think I got something
> working.
> 
> I'm having doubt that the dput()s were needed in the eventfs_remove_rec(),
> as the d_invalidate() appears to be enough. I'm still testing.

OK, I got this working and it appears to pass all my tests. I actually got
it working Wednesday night, but I tried a different approach on Thursday
that got rid of the evenfs_file and only used eventfs_inodes and makes the
files more dynamic. There's still a couple of corner cases that are not
working with this approach (the dentry counters are getting out of sync).
This should not stop this from going in. I'll continue working on that
approach for the next merge cycle. But for now, here's the patch to this
series that works.

I'm going to post it here, and then reply to it with comments about my
changes.

-- Steve

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 4db048250cdb..2718de1533e6 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -36,16 +36,36 @@ struct eventfs_file {
 	const struct file_operations	*fop;
 	const struct inode_operations	*iop;
 	union {
+		struct list_head	del_list;
 		struct rcu_head		rcu;
-		struct llist_node	llist;  /* For freeing after RCU */
+		unsigned long		is_freed; /* Freed if one of the above is set */
 	};
 	void				*data;
 	umode_t				mode;
-	bool				created;
+	unsigned int			flags;
 };
 
 static DEFINE_MUTEX(eventfs_mutex);
 DEFINE_STATIC_SRCU(eventfs_srcu);
+
+static struct dentry *eventfs_root_lookup(struct inode *dir,
+					  struct dentry *dentry,
+					  unsigned int flags);
+static int dcache_dir_open_wrapper(struct inode *inode, struct file *file);
+static int eventfs_release(struct inode *inode, struct file *file);
+
+static const struct inode_operations eventfs_root_dir_inode_operations = {
+	.lookup		= eventfs_root_lookup,
+};
+
+static const struct file_operations eventfs_file_operations = {
+	.open           = dcache_dir_open_wrapper,
+	.read		= generic_read_dir,
+	.iterate_shared	= dcache_readdir,
+	.llseek		= generic_file_llseek,
+	.release        = eventfs_release,
+};
+
 /**
  * create_file - create a file in the tracefs filesystem
  * @name: the name of the file to create.
@@ -123,17 +143,12 @@ static struct dentry *create_file(const char *name, umode_t mode,
  * If tracefs is not enabled in the kernel, the value -%ENODEV will be
  * returned.
  */
-static struct dentry *create_dir(const char *name, umode_t mode,
-				 struct dentry *parent, void *data,
-				 const struct file_operations *fop,
-				 const struct inode_operations *iop)
+static struct dentry *create_dir(const char *name, struct dentry *parent, void *data)
 {
 	struct tracefs_inode *ti;
 	struct dentry *dentry;
 	struct inode *inode;
 
-	WARN_ON(!S_ISDIR(mode));
-
 	dentry = eventfs_start_creating(name, parent);
 	if (IS_ERR(dentry))
 		return dentry;
@@ -142,9 +157,9 @@ static struct dentry *create_dir(const char *name, umode_t mode,
 	if (unlikely(!inode))
 		return eventfs_failed_creating(dentry);
 
-	inode->i_mode = mode;
-	inode->i_op = iop;
-	inode->i_fop = fop;
+	inode->i_mode = S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO;
+	inode->i_op = &eventfs_root_dir_inode_operations;
+	inode->i_fop = &eventfs_file_operations;
 	inode->i_private = data;
 
 	ti = get_tracefs(inode);
@@ -169,15 +184,27 @@ void eventfs_set_ef_status_free(struct dentry *dentry)
 	struct tracefs_inode *ti_parent;
 	struct eventfs_file *ef;
 
+	mutex_lock(&eventfs_mutex);
 	ti_parent = get_tracefs(dentry->d_parent->d_inode);
 	if (!ti_parent || !(ti_parent->flags & TRACEFS_EVENT_INODE))
-		return;
+		goto out;
 
 	ef = dentry->d_fsdata;
 	if (!ef)
-		return;
-	ef->created = false;
+		goto out;
+	/*
+	 * If ef was freed, then the LSB bit is set for d_fsdata.
+	 * But this should not happen, as it should still have a
+	 * ref count that prevents it. Warn in case it does.
+	 */
+	if (WARN_ON_ONCE((unsigned long)ef & 1))
+		goto out;
+
+	dentry->d_fsdata = NULL;
+
 	ef->dentry = NULL;
+ out:
+	mutex_unlock(&eventfs_mutex);
 }
 
 /**
@@ -202,6 +229,79 @@ static void eventfs_post_create_dir(struct eventfs_file *ef)
 	ti->private = ef->ei;
 }
 
+static struct dentry *
+create_dentry(struct eventfs_file *ef, struct dentry *parent, bool lookup)
+{
+	bool invalidate = false;
+	struct dentry *dentry;
+
+	mutex_lock(&eventfs_mutex);
+	if (ef->is_freed) {
+		mutex_unlock(&eventfs_mutex);
+		return NULL;
+	}
+	if (ef->dentry) {
+		dentry = ef->dentry;
+		/* On dir open, up the ref count */
+		if (!lookup)
+			dget(dentry);
+		mutex_unlock(&eventfs_mutex);
+		return dentry;
+	}
+	mutex_unlock(&eventfs_mutex);
+
+	if (!lookup)
+		inode_lock(parent->d_inode);
+
+	if (ef->ei)
+		dentry = create_dir(ef->name, parent, ef->data);
+	else
+		dentry = create_file(ef->name, ef->mode, parent,
+				     ef->data, ef->fop);
+
+	if (!lookup)
+		inode_unlock(parent->d_inode);
+
+	mutex_lock(&eventfs_mutex);
+	if (IS_ERR_OR_NULL(dentry)) {
+		/* If the ef was already updated get it */
+		dentry = ef->dentry;
+		if (dentry && !lookup)
+			dget(dentry);
+		mutex_unlock(&eventfs_mutex);
+		return dentry;
+	}
+
+	if (!ef->dentry && !ef->is_freed) {
+		ef->dentry = dentry;
+		if (ef->ei)
+			eventfs_post_create_dir(ef);
+		dentry->d_fsdata = ef;
+	} else {
+		/* A race here, should try again (unless freed) */
+		invalidate = true;
+	}
+	mutex_unlock(&eventfs_mutex);
+	if (invalidate)
+		d_invalidate(dentry);
+
+	if (lookup || invalidate)
+		dput(dentry);
+
+	return invalidate ? NULL : dentry;
+}
+
+static bool match_event_file(struct eventfs_file *ef, const char *name)
+{
+	bool ret;
+
+	mutex_lock(&eventfs_mutex);
+	ret = !ef->is_freed && strcmp(ef->name, name) == 0;
+	mutex_unlock(&eventfs_mutex);
+
+	return ret;
+}
+
 /**
  * eventfs_root_lookup - lookup routine to create file/dir
  * @dir: directory in which lookup to be done
@@ -211,7 +311,6 @@ static void eventfs_post_create_dir(struct eventfs_file *ef)
  * Used to create dynamic file/dir with-in @dir, search with-in ei
  * list, if @dentry found go ahead and create the file/dir
  */
-
 static struct dentry *eventfs_root_lookup(struct inode *dir,
 					  struct dentry *dentry,
 					  unsigned int flags)
@@ -230,30 +329,10 @@ static struct dentry *eventfs_root_lookup(struct inode *dir,
 	idx = srcu_read_lock(&eventfs_srcu);
 	list_for_each_entry_srcu(ef, &ei->e_top_files, list,
 				 srcu_read_lock_held(&eventfs_srcu)) {
-		if (strcmp(ef->name, dentry->d_name.name))
+		if (!match_event_file(ef, dentry->d_name.name))
 			continue;
 		ret = simple_lookup(dir, dentry, flags);
-		if (ef->created)
-			continue;
-		mutex_lock(&eventfs_mutex);
-		ef->created = true;
-		if (ef->ei)
-			ef->dentry = create_dir(ef->name, ef->mode, ef->d_parent,
-						ef->data, ef->fop, ef->iop);
-		else
-			ef->dentry = create_file(ef->name, ef->mode, ef->d_parent,
-						 ef->data, ef->fop);
-
-		if (IS_ERR_OR_NULL(ef->dentry)) {
-			ef->created = false;
-			mutex_unlock(&eventfs_mutex);
-		} else {
-			if (ef->ei)
-				eventfs_post_create_dir(ef);
-			ef->dentry->d_fsdata = ef;
-			mutex_unlock(&eventfs_mutex);
-			dput(ef->dentry);
-		}
+		create_dentry(ef, ef->d_parent, true);
 		break;
 	}
 	srcu_read_unlock(&eventfs_srcu, idx);
@@ -270,6 +349,7 @@ static int eventfs_release(struct inode *inode, struct file *file)
 	struct tracefs_inode *ti;
 	struct eventfs_inode *ei;
 	struct eventfs_file *ef;
+	struct dentry *dentry;
 	int idx;
 
 	ti = get_tracefs(inode);
@@ -280,8 +360,11 @@ static int eventfs_release(struct inode *inode, struct file *file)
 	idx = srcu_read_lock(&eventfs_srcu);
 	list_for_each_entry_srcu(ef, &ei->e_top_files, list,
 				 srcu_read_lock_held(&eventfs_srcu)) {
-		if (ef->created)
-			dput(ef->dentry);
+		mutex_lock(&eventfs_mutex);
+		dentry = ef->dentry;
+		mutex_unlock(&eventfs_mutex);
+		if (dentry)
+			dput(dentry);
 	}
 	srcu_read_unlock(&eventfs_srcu, idx);
 	return dcache_dir_close(inode, file);
@@ -312,47 +395,12 @@ static int dcache_dir_open_wrapper(struct inode *inode, struct file *file)
 	ei = ti->private;
 	idx = srcu_read_lock(&eventfs_srcu);
 	list_for_each_entry_rcu(ef, &ei->e_top_files, list) {
-		if (ef->created) {
-			dget(ef->dentry);
-			continue;
-		}
-		mutex_lock(&eventfs_mutex);
-		ef->created = true;
-
-		inode_lock(dentry->d_inode);
-		if (ef->ei)
-			ef->dentry = create_dir(ef->name, ef->mode, dentry,
-						ef->data, ef->fop, ef->iop);
-		else
-			ef->dentry = create_file(ef->name, ef->mode, dentry,
-						 ef->data, ef->fop);
-		inode_unlock(dentry->d_inode);
-
-		if (IS_ERR_OR_NULL(ef->dentry)) {
-			ef->created = false;
-		} else {
-			if (ef->ei)
-				eventfs_post_create_dir(ef);
-			ef->dentry->d_fsdata = ef;
-		}
-		mutex_unlock(&eventfs_mutex);
+		create_dentry(ef, dentry, false);
 	}
 	srcu_read_unlock(&eventfs_srcu, idx);
 	return dcache_dir_open(inode, file);
 }
 
-static const struct file_operations eventfs_file_operations = {
-	.open           = dcache_dir_open_wrapper,
-	.read		= generic_read_dir,
-	.iterate_shared	= dcache_readdir,
-	.llseek		= generic_file_llseek,
-	.release        = eventfs_release,
-};
-
-static const struct inode_operations eventfs_root_dir_inode_operations = {
-	.lookup		= eventfs_root_lookup,
-};
-
 /**
  * eventfs_prepare_ef - helper function to prepare eventfs_file
  * @name: the name of the file/directory to create.
@@ -470,11 +518,7 @@ struct eventfs_file *eventfs_add_subsystem_dir(const char *name,
 	ti_parent = get_tracefs(parent->d_inode);
 	ei_parent = ti_parent->private;
 
-	ef = eventfs_prepare_ef(name,
-		S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO,
-		&eventfs_file_operations,
-		&eventfs_root_dir_inode_operations, NULL);
-
+	ef = eventfs_prepare_ef(name, S_IFDIR, NULL, NULL, NULL);
 	if (IS_ERR(ef))
 		return ef;
 
@@ -502,11 +546,7 @@ struct eventfs_file *eventfs_add_dir(const char *name,
 	if (!ef_parent)
 		return ERR_PTR(-EINVAL);
 
-	ef = eventfs_prepare_ef(name,
-		S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO,
-		&eventfs_file_operations,
-		&eventfs_root_dir_inode_operations, NULL);
-
+	ef = eventfs_prepare_ef(name, S_IFDIR, NULL, NULL, NULL);
 	if (IS_ERR(ef))
 		return ef;
 
@@ -601,37 +641,15 @@ int eventfs_add_file(const char *name, umode_t mode,
 	return 0;
 }
 
-static LLIST_HEAD(free_list);
-
-static void eventfs_workfn(struct work_struct *work)
-{
-	struct eventfs_file *ef, *tmp;
-	struct llist_node *llnode;
-
-	llnode = llist_del_all(&free_list);
-	llist_for_each_entry_safe(ef, tmp, llnode, llist) {
-		if (ef->created && ef->dentry)
-			dput(ef->dentry);
-		kfree(ef->name);
-		kfree(ef->ei);
-		kfree(ef);
-	}
-}
-
-DECLARE_WORK(eventfs_work, eventfs_workfn);
-
 static void free_ef(struct rcu_head *head)
 {
 	struct eventfs_file *ef = container_of(head, struct eventfs_file, rcu);
 
-	if (!llist_add(&ef->llist, &free_list))
-		return;
-
-	queue_work(system_unbound_wq, &eventfs_work);
+	kfree(ef->name);
+	kfree(ef->ei);
+	kfree(ef);
 }
 
-
-
 /**
  * eventfs_remove_rec - remove eventfs dir or file from list
  * @ef: eventfs_file to be removed.
@@ -639,7 +657,7 @@ static void free_ef(struct rcu_head *head)
  * This function recursively remove eventfs_file which
  * contains info of file or dir.
  */
-static void eventfs_remove_rec(struct eventfs_file *ef, int level)
+static void eventfs_remove_rec(struct eventfs_file *ef, struct list_head *head, int level)
 {
 	struct eventfs_file *ef_child;
 
@@ -659,15 +677,12 @@ static void eventfs_remove_rec(struct eventfs_file *ef, int level)
 		/* search for nested folders or files */
 		list_for_each_entry_srcu(ef_child, &ef->ei->e_top_files, list,
 					 lockdep_is_held(&eventfs_mutex)) {
-			eventfs_remove_rec(ef_child, level + 1);
+			eventfs_remove_rec(ef_child, head, level + 1);
 		}
 	}
 
-	if (ef->created && ef->dentry)
-		d_invalidate(ef->dentry);
-
 	list_del_rcu(&ef->list);
-	call_srcu(&eventfs_srcu, &ef->rcu, free_ef);
+	list_add_tail(&ef->del_list, head);
 }
 
 /**
@@ -678,12 +693,62 @@ static void eventfs_remove_rec(struct eventfs_file *ef, int level)
  */
 void eventfs_remove(struct eventfs_file *ef)
 {
+	struct eventfs_file *tmp;
+	LIST_HEAD(ef_del_list);
+	struct dentry *dentry_list = NULL;
+	struct dentry *dentry;
+
 	if (!ef)
 		return;
 
 	mutex_lock(&eventfs_mutex);
-	eventfs_remove_rec(ef, 0);
+	eventfs_remove_rec(ef, &ef_del_list, 0);
+
+	list_for_each_entry_safe(ef, tmp, &ef_del_list, del_list) {
+		if (ef->dentry) {
+			unsigned long ptr = (unsigned long)dentry_list;
+
+			/* Keep the dentry from being freed yet */
+			dget(ef->dentry);
+
+			/*
+			 * Paranoid: The dget() above should prevent the dentry
+			 * from being freed and calling eventfs_set_ef_status_free().
+			 * But just in case, set the link list LSB pointer to 1
+			 * and have eventfs_set_ef_status_free() check that to
+			 * make sure that if it does happen, it will not think
+			 * the d_fsdata is an event_file.
+			 *
+			 * For this to work, no event_file should be allocated
+			 * on a odd space, as the ef should always be allocated
+			 * to be at least word aligned. Check for that too.
+			 */
+			WARN_ON_ONCE(ptr & 1);
+
+			ef->dentry->d_fsdata = (void *)(ptr | 1);
+			dentry_list = ef->dentry;
+			ef->dentry = NULL;
+		}
+		call_srcu(&eventfs_srcu, &ef->rcu, free_ef);
+	}
 	mutex_unlock(&eventfs_mutex);
+
+	while (dentry_list) {
+		unsigned long ptr;
+
+		dentry = dentry_list;
+		ptr = (unsigned long)dentry->d_fsdata & ~1UL;
+		dentry_list = (struct dentry *)ptr;
+		dentry->d_fsdata = NULL;
+		d_invalidate(dentry);
+		mutex_lock(&eventfs_mutex);
+		/* dentry should now have at least a single reference */
+		WARN_ONCE((int)d_count(dentry) < 1,
+			  "dentry %px less than one reference (%d) after invalidate\n",
+			  dentry, d_count(dentry));
+		mutex_unlock(&eventfs_mutex);
+		dput(dentry);
+	}
 }
 
 /**
diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index c443a0c32a8c..1b880b5cd29d 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -22,4 +22,6 @@ struct dentry *tracefs_end_creating(struct dentry *dentry);
 struct dentry *tracefs_failed_creating(struct dentry *dentry);
 struct inode *tracefs_get_inode(struct super_block *sb);
 
+void eventfs_set_ef_status_free(struct dentry *dentry);
+
 #endif /* _TRACEFS_INTERNAL_H */
diff --git a/include/linux/tracefs.h b/include/linux/tracefs.h
index 4d30b0cafc5f..47c1b4d21735 100644
--- a/include/linux/tracefs.h
+++ b/include/linux/tracefs.h
@@ -51,8 +51,6 @@ void eventfs_remove(struct eventfs_file *ef);
 
 void eventfs_remove_events_dir(struct dentry *dentry);
 
-void eventfs_set_ef_status_free(struct dentry *dentry);
-
 struct dentry *tracefs_create_file(const char *name, umode_t mode,
 				   struct dentry *parent, void *data,
 				   const struct file_operations *fops);
Re: [PATCH v4 00/10] tracing: introducing eventfs
Posted by Steven Rostedt 2 years, 6 months ago
On Fri, 21 Jul 2023 09:18:43 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> OK, I got this working and it appears to pass all my tests. I actually got
> it working Wednesday night, but I tried a different approach on Thursday
> that got rid of the evenfs_file and only used eventfs_inodes and makes the
> files more dynamic. There's still a couple of corner cases that are not
> working with this approach (the dentry counters are getting out of sync).
> This should not stop this from going in. I'll continue working on that
> approach for the next merge cycle. But for now, here's the patch to this
> series that works.

Just figured out my bug with my new design. It was in the eventfs_release()
code, where I have a loop that does the dput on the children, but I wasn't
dput(child), I was dput(parent) in that loop!!

Anyway, for this merge window I much rather have this code in, and for the
next merge window I'll add this update.

I can post the new design too, but first let's focus on this.

-- Steve