[RFC PATCH v4 0/2] vfs: add O_CREAT|O_DIRECTORY to open*(2)

Jori Koolstra posted 2 patches 6 days, 11 hours ago
fs/namei.c                                    | 180 +++++++++++-----
fs/open.c                                     |  25 ++-
include/uapi/asm-generic/fcntl.h              |   2 +
.../testing/selftests/filesystems/.gitignore  |   1 +
tools/testing/selftests/filesystems/Makefile  |   4 +-
tools/testing/selftests/filesystems/fclog.c   |   1 +
.../filesystems/open_o_creat_o_dir.c          | 200 ++++++++++++++++++
7 files changed, 342 insertions(+), 71 deletions(-)
create mode 100644 tools/testing/selftests/filesystems/open_o_creat_o_dir.c
[RFC PATCH v4 0/2] vfs: add O_CREAT|O_DIRECTORY to open*(2)
Posted by Jori Koolstra 6 days, 11 hours ago
This series implements new semantics for the O_CREAT|O_DIRECTORY flag
combination for open*(2): perform a mkdir and open the resulting
directory; return a pinning fd (which mkdir does not).

Feedback by Christian Brauner and Aleksa Sarai on the v2 rfc of this
patch was to not introduce a new syscall (mkdirat2) but implement this
functionality as O_CREAT|O_DIRECTORY in open*(2). I had some very silly
bugs that syzbot alerted me of in v3, so here is v4...

Three comments from me upfront:

- This patch just EINVAL bans O_CREAT|O_DIRECTORY for filesystems that
  define atomic_open(). I figure it is better to (dis)allow on a fs per
  fs basis. So feedback per filesystem on what is the appropriate course
  of action on receiving O_CREAT|O_DIRECTORY would be very welcome.

- If we create a regular file with mknod, before creation
  security_path_mknod() is called, and after creation
  security_path_post_mknod(). If we create a regular file using O_CREAT
  (and this is also pre-patch) only security_path_mknod() is called. Is
  this the correct behaviour?

- open_last_lookups() locks the parent inode like like: 

		inode_lock(dir->d_inode);

  should this perhaps be

		inode_lock_nested(dir, I_MUTEX_PARENT);

  to stay consistent with the start_dirop() path that is used by
  filename_create() for instance in mknod(2)? I get that we are only
  locking one inode here at most, so it does not really matter, but
  now one regular file create path does set the lockdep and the other
  does not.

Jori Koolstra (2):
  vfs: add O_CREAT|O_DIRECTORY to open*(2)
  selftest: add tests for open*(O_CREAT|O_DIRECTORY)

 fs/namei.c                                    | 180 +++++++++++-----
 fs/open.c                                     |  25 ++-
 include/uapi/asm-generic/fcntl.h              |   2 +
 .../testing/selftests/filesystems/.gitignore  |   1 +
 tools/testing/selftests/filesystems/Makefile  |   4 +-
 tools/testing/selftests/filesystems/fclog.c   |   1 +
 .../filesystems/open_o_creat_o_dir.c          | 200 ++++++++++++++++++
 7 files changed, 342 insertions(+), 71 deletions(-)
 create mode 100644 tools/testing/selftests/filesystems/open_o_creat_o_dir.c

-- 
2.54.0
[syzbot ci] Re: vfs: add O_CREAT|O_DIRECTORY to open*(2)
Posted by syzbot ci 5 days, 20 hours ago
syzbot ci has tested the following series

[v4] vfs: add O_CREAT|O_DIRECTORY to open*(2)
https://lore.kernel.org/all/20260518165237.2084042-1-jkoolstra@xs4all.nl
* [RFC PATCH v4 1/2] vfs: add O_CREAT|O_DIRECTORY to open*(2)
* [RFC PATCH v4 2/2] selftest: add tests for open*(O_CREAT|O_DIRECTORY)

and found the following issues:
* KASAN: slab-out-of-bounds Read in ovl_dir_release
* general protection fault in path_openat

Full report is available here:
https://ci.syzbot.org/series/0d511b6b-6434-45cd-bbf3-51fe9d916e99

***

KASAN: slab-out-of-bounds Read in ovl_dir_release

tree:      torvalds
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux
base:      5200f5f493f79f14bbdc349e402a40dfb32f23c8
arch:      amd64
compiler:  Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config:    https://ci.syzbot.org/builds/024b677f-50e7-4ef9-ae98-2652f5098bfd/config
syz repro: https://ci.syzbot.org/findings/a7087360-137c-41f5-ae13-db4d551fe142/syz_repro

==================================================================
BUG: KASAN: slab-out-of-bounds in ovl_dir_release+0x228/0x2a0 fs/overlayfs/readdir.c:1033
Read of size 8 at addr ffff88816cd2c818 by task syz.0.17/5813

CPU: 0 UID: 0 PID: 5813 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
 print_address_description+0x55/0x1e0 mm/kasan/report.c:378
 print_report+0x58/0x70 mm/kasan/report.c:482
 kasan_report+0x117/0x150 mm/kasan/report.c:595
 ovl_dir_release+0x228/0x2a0 fs/overlayfs/readdir.c:1033
 __fput+0x44f/0xa60 fs/file_table.c:510
 task_work_run+0x1d9/0x270 kernel/task_work.c:233
 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
 __exit_to_user_mode_loop kernel/entry/common.c:67 [inline]
 exit_to_user_mode_loop+0xf3/0x4d0 kernel/entry/common.c:98
 __exit_to_user_mode_prepare include/linux/irq-entry-common.h:207 [inline]
 syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:230 [inline]
 syscall_exit_to_user_mode include/linux/entry-common.h:318 [inline]
 do_syscall_64+0x33e/0xf80 arch/x86/entry/syscall_64.c:100
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fe1a159ce59
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffe25af07f8 EFLAGS: 00000246 ORIG_RAX: 00000000000001b4
RAX: 0000000000000000 RBX: 00007ffe25af08e0 RCX: 00007fe1a159ce59
RDX: 0000000000000000 RSI: 000000000000001e RDI: 0000000000000003
RBP: 0000000000012e15 R08: 0000000000000001 R09: 0000000000000000
R10: 0000001b32a20000 R11: 0000000000000246 R12: 00007ffe25af0920
R13: 00007fe1a1815fac R14: 0000000000012e53 R15: 00007fe1a1815fa0
 </TASK>

Allocated by task 5814:
 kasan_save_stack mm/kasan/common.c:57 [inline]
 kasan_save_track+0x3e/0x80 mm/kasan/common.c:78
 poison_kmalloc_redzone mm/kasan/common.c:398 [inline]
 __kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:415
 kasan_kmalloc include/linux/kasan.h:263 [inline]
 __kmalloc_cache_noprof+0x31c/0x660 mm/slub.c:5419
 kmalloc_noprof include/linux/slab.h:950 [inline]
 kzalloc_noprof include/linux/slab.h:1188 [inline]
 ovl_file_alloc+0x4f/0x90 fs/overlayfs/file.c:99
 ovl_create_tmpfile fs/overlayfs/dir.c:1399 [inline]
 ovl_tmpfile+0x3fc/0x7d0 fs/overlayfs/dir.c:1448
 vfs_tmpfile+0x3ff/0x890 fs/namei.c:4794
 do_tmpfile+0xd3/0x240 fs/namei.c:4859
 path_openat+0x33c7/0x3b40 fs/namei.c:4893
 do_file_open+0x23e/0x4a0 fs/namei.c:4931
 do_sys_openat2+0x113/0x200 fs/open.c:1367
 do_sys_open fs/open.c:1373 [inline]
 __do_sys_open fs/open.c:1381 [inline]
 __se_sys_open fs/open.c:1377 [inline]
 __x64_sys_open+0x11e/0x150 fs/open.c:1377
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x15f/0xf80 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

The buggy address belongs to the object at ffff88816cd2c800
 which belongs to the cache kmalloc-16 of size 16
The buggy address is located 8 bytes to the right of
 allocated 16-byte region [ffff88816cd2c800, ffff88816cd2c810)

The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffff88816cd2c840 pfn:0x16cd2c
flags: 0x57ff00000000200(workingset|node=1|zone=2|lastcpupid=0x7ff)
page_type: f5(slab)
raw: 057ff00000000200 ffff888100041640 ffff888160400408 ffff888160400408
raw: ffff88816cd2c840 0000000800800042 00000000f5000000 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 0, migratetype Unmovable, gfp_mask 0x252800(GFP_NOWAIT|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE), pid 5741, tgid 5741 (syz-executor), ts 77437404410, free_ts 77431264562
 set_page_owner include/linux/page_owner.h:32 [inline]
 post_alloc_hook+0x231/0x280 mm/page_alloc.c:1858
 prep_new_page mm/page_alloc.c:1866 [inline]
 get_page_from_freelist+0x24ba/0x2540 mm/page_alloc.c:3946
 __alloc_frozen_pages_noprof+0x18d/0x380 mm/page_alloc.c:5226
 alloc_slab_page mm/slub.c:3278 [inline]
 allocate_slab+0x77/0x660 mm/slub.c:3467
 new_slab mm/slub.c:3525 [inline]
 ___slab_alloc+0x154/0x6c0 mm/slub.c:4444
 __slab_alloc_node mm/slub.c:4510 [inline]
 slab_alloc_node mm/slub.c:4886 [inline]
 __do_kmalloc_node mm/slub.c:5294 [inline]
 __kvmalloc_node_noprof+0x34d/0x8a0 mm/slub.c:6832
 xt_jumpstack_alloc net/netfilter/x_tables.c:1449 [inline]
 do_replace_table+0x191/0x620 net/netfilter/x_tables.c:1486
 xt_register_table+0x269/0x960 net/netfilter/x_tables.c:1596
 ip6t_register_table+0x16b/0x330 net/ipv6/netfilter/ip6_tables.c:1754
 ip6table_raw_table_init+0x54/0x80 net/ipv6/netfilter/ip6table_raw.c:48
 xt_find_table_lock+0x30c/0x3f0 net/netfilter/x_tables.c:1353
 xt_request_find_table_lock+0x26/0x100 net/netfilter/x_tables.c:1378
 get_info net/ipv6/netfilter/ip6_tables.c:979 [inline]
 do_ip6t_get_ctl+0x716/0x1230 net/ipv6/netfilter/ip6_tables.c:1668
 nf_getsockopt+0x26e/0x290 net/netfilter/nf_sockopt.c:116
 ipv6_getsockopt+0x1fd/0x2b0 net/ipv6/ipv6_sockglue.c:1464
 do_sock_getsockopt+0x51d/0x7e0 net/socket.c:2487
page last free pid 15 tgid 15 stack trace:
 reset_page_owner include/linux/page_owner.h:25 [inline]
 __free_pages_prepare mm/page_alloc.c:1402 [inline]
 __free_frozen_pages+0xbc7/0xd30 mm/page_alloc.c:2943
 rcu_do_batch kernel/rcu/tree.c:2617 [inline]
 rcu_core+0x7cd/0x1070 kernel/rcu/tree.c:2869
 handle_softirqs+0x22a/0x840 kernel/softirq.c:622
 run_ksoftirqd+0x36/0x60 kernel/softirq.c:1076
 smpboot_thread_fn+0x541/0xa50 kernel/smpboot.c:160
 kthread+0x389/0x470 kernel/kthread.c:436
 ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245

Memory state around the buggy address:
 ffff88816cd2c700: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ffff88816cd2c780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff88816cd2c800: 00 00 fc fc 00 00 fc fc fc fc fc fc fc fc fc fc
                            ^
 ffff88816cd2c880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ffff88816cd2c900: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================


***

general protection fault in path_openat

tree:      torvalds
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux
base:      5200f5f493f79f14bbdc349e402a40dfb32f23c8
arch:      amd64
compiler:  Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config:    https://ci.syzbot.org/builds/024b677f-50e7-4ef9-ae98-2652f5098bfd/config
syz repro: https://ci.syzbot.org/findings/4b3b0fe3-064c-43e2-b887-b3a52d87a16a/syz_repro

BTRFS info (device loop1): enabling ssd optimizations
BTRFS info (device loop1): turning on async discard
BTRFS info (device loop1): enabling free space tree
BTRFS info (device loop1): use zstd compression, level 3
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN PTI
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
CPU: 1 UID: 0 PID: 5849 Comm: syz.1.18 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:__d_entry_type include/linux/dcache.h:429 [inline]
RIP: 0010:d_can_lookup include/linux/dcache.h:444 [inline]
RIP: 0010:do_open fs/namei.c:4726 [inline]
RIP: 0010:path_openat+0x2e66/0x3b40 fs/namei.c:4902
Code: e8 8f 49 7f ff eb 62 48 8b 44 24 78 42 80 3c 20 00 48 8b 5c 24 68 74 08 48 89 df e8 44 87 ea ff 4c 8b 3b 4c 89 f8 48 c1 e8 03 <42> 0f b6 04 20 84 c0 0f 85 cb 09 00 00 41 bc 00 00 38 00 45 23 27
RSP: 0018:ffffc9000405f960 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffffc9000405fc28 RCX: 0000000000000000
RDX: ffff88810fe50000 RSI: 0000000000000002 RDI: 0000000000000000
RBP: ffffc9000405fbb0 R08: ffff8881b949469b R09: 1ffff110372928d3
R10: dffffc0000000000 R11: ffffed10372928d4 R12: dffffc0000000000
R13: 1ffff1102eb65c88 R14: 000000000015d0c0 R15: 0000000000000001
FS:  00007f074937f6c0(0000) GS:ffff8882a928a000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000c2113360000 CR3: 0000000103f3a000 CR4: 00000000000006f0
Call Trace:
 <TASK>
 do_file_open+0x23e/0x4a0 fs/namei.c:4931
 do_sys_openat2+0x113/0x200 fs/open.c:1367
 do_sys_open fs/open.c:1373 [inline]
 __do_sys_openat fs/open.c:1389 [inline]
 __se_sys_openat fs/open.c:1384 [inline]
 __x64_sys_openat+0x138/0x170 fs/open.c:1384
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x15f/0xf80 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f074859ce59
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f074937f028 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
RAX: ffffffffffffffda RBX: 00007f0748815fa0 RCX: 00007f074859ce59
RDX: 00000000001dd0c0 RSI: 0000200000000240 RDI: ffffffffffffff9c
RBP: 00007f0748632d6f R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f0748816038 R14: 00007f0748815fa0 R15: 00007fff34fee548
 </TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:__d_entry_type include/linux/dcache.h:429 [inline]
RIP: 0010:d_can_lookup include/linux/dcache.h:444 [inline]
RIP: 0010:do_open fs/namei.c:4726 [inline]
RIP: 0010:path_openat+0x2e66/0x3b40 fs/namei.c:4902
Code: e8 8f 49 7f ff eb 62 48 8b 44 24 78 42 80 3c 20 00 48 8b 5c 24 68 74 08 48 89 df e8 44 87 ea ff 4c 8b 3b 4c 89 f8 48 c1 e8 03 <42> 0f b6 04 20 84 c0 0f 85 cb 09 00 00 41 bc 00 00 38 00 45 23 27
RSP: 0018:ffffc9000405f960 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffffc9000405fc28 RCX: 0000000000000000
RDX: ffff88810fe50000 RSI: 0000000000000002 RDI: 0000000000000000
RBP: ffffc9000405fbb0 R08: ffff8881b949469b R09: 1ffff110372928d3
R10: dffffc0000000000 R11: ffffed10372928d4 R12: dffffc0000000000
R13: 1ffff1102eb65c88 R14: 000000000015d0c0 R15: 0000000000000001
FS:  00007f074937f6c0(0000) GS:ffff8882a928a000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1c7a2f73b0 CR3: 0000000103f3a000 CR4: 00000000000006f0
----------------
Code disassembly (best guess):
   0:	e8 8f 49 7f ff       	call   0xff7f4994
   5:	eb 62                	jmp    0x69
   7:	48 8b 44 24 78       	mov    0x78(%rsp),%rax
   c:	42 80 3c 20 00       	cmpb   $0x0,(%rax,%r12,1)
  11:	48 8b 5c 24 68       	mov    0x68(%rsp),%rbx
  16:	74 08                	je     0x20
  18:	48 89 df             	mov    %rbx,%rdi
  1b:	e8 44 87 ea ff       	call   0xffea8764
  20:	4c 8b 3b             	mov    (%rbx),%r15
  23:	4c 89 f8             	mov    %r15,%rax
  26:	48 c1 e8 03          	shr    $0x3,%rax
* 2a:	42 0f b6 04 20       	movzbl (%rax,%r12,1),%eax <-- trapping instruction
  2f:	84 c0                	test   %al,%al
  31:	0f 85 cb 09 00 00    	jne    0xa02
  37:	41 bc 00 00 38 00    	mov    $0x380000,%r12d
  3d:	45 23 27             	and    (%r15),%r12d


***

If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
  Tested-by: syzbot@syzkaller.appspotmail.com

---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.

To test a patch for this bug, please reply with `#syz test`
(should be on a separate line).

The patch should be attached to the email.
Note: arguments like custom git repos and branches are not supported.