Makefile | 3 +
arch/mips/Kconfig | 1 +
arch/mips/kernel/syscalls/Makefile | 23 ++++++-
arch/riscv/Kconfig | 1 +
arch/riscv/include/asm/unistd.h | 2 +
arch/riscv/kernel/Makefile | 7 +-
arch/riscv/kernel/syscalls/Makefile | 69 +++++++++++++++++++
.../{ => syscalls}/compat_syscall_table.c | 4 +-
.../kernel/{ => syscalls}/syscall_table.c | 4 +-
init/Kconfig | 49 +++++++++++++
scripts/Makefile.syscalls | 29 ++++++++
11 files changed, 182 insertions(+), 10 deletions(-)
create mode 100644 arch/riscv/kernel/syscalls/Makefile
rename arch/riscv/kernel/{ => syscalls}/compat_syscall_table.c (82%)
rename arch/riscv/kernel/{ => syscalls}/syscall_table.c (83%)
create mode 100644 scripts/Makefile.syscalls
Hi, all
This series aims to add DCE based DSE support, here is the first
revision of the RFC patchset [1], the whole series includes three parts,
here is the Part1.
This Part1 adds basic DCE based DSE support.
Part2 will further eliminate the unused syscalls forcely kept by the
exception tables.
Part3 will add DSE test support with nolibc-test.c.
Changes from RFC patchset [1]:
- The DCE support [2] for RISC-V has been merged [3]
- The "nolibc: Record used syscalls in their own sections" [4] will be
delayed to Part3
- Add debug support for DCE
- Further allows CONFIG_USED_SYSCALLS accept a file stores used syscalls
- Now, only accepts symbolic syscalls, not support integral number again
- Works with newly added riscv syscalls suffix: __riscv_
- Further trims the syscall tables by removing the tailing invalid parts
The nolibc-test based initrd run well on riscv64 kernel image with dead
syscalls eliminated:
$ nm build/riscv64/virt/linux/v6.6-rc2/vmlinux | grep "T __riscv_sys" | grep -v sys_ni_syscall | wc -l
48
These options should be enabled:
CONFIG_LD_DEAD_CODE_DATA_ELIMINATION=y
CONFIG_LD_DEAD_CODE_DATA_ELIMINATION_DEBUG=y
CONFIG_TRIM_UNUSED_SYSCALLS=y
CONFIG_USED_SYSCALLS="sys_dup sys_dup3 sys_ioctl sys_mknodat sys_mkdirat sys_unlinkat sys_symlinkat sys_linkat sys_mount sys_chdir sys_chroot sys_fchmodat sys_fchownat sys_openat sys_close sys_pipe2 sys_getdents64 sys_lseek sys_read sys_write sys_pselect6 sys_ppoll sys_exit sys_sched_yield sys_kill sys_reboot sys_getpgid sys_prctl sys_gettimeofday sys_getpid sys_getppid sys_getuid sys_geteuid sys_brk sys_munmap sys_clone sys_execve sys_mmap sys_wait4 sys_statx"
The really used syscalls:
$ echo "sys_dup sys_dup3 sys_ioctl sys_mknodat sys_mkdirat sys_unlinkat sys_symlinkat sys_linkat sys_mount sys_chdir sys_chroot sys_fchmodat sys_fchownat sys_openat sys_close sys_pipe2 sys_getdents64 sys_lseek sys_read sys_write sys_pselect6 sys_ppoll sys_exit sys_sched_yield sys_kill sys_reboot sys_getpgid sys_prctl sys_gettimeofday sys_getpid sys_getppid sys_getuid sys_geteuid sys_brk sys_munmap sys_clone sys_execve sys_mmap sys_wait4 sys_statx" | tr ' ' '\n' | wc -l
40
Thanks to Yuan Tan, he has researched and verified the elimination of
the unused syscalls forcely kept by the exception tables, both section
group and section link order attributes of ld work. part2 will be sent
out soon to further remove another 8 unused syscalls and eventually we
are able to run a dead loop application on a kernel image without
syscalls.
Best Regards,
Zhangjin Wu
---
[1]: https://lore.kernel.org/lkml/cover.1676594211.git.falcon@tinylab.org/
[2]: https://lore.kernel.org/lkml/234017be6d06ef84844583230542e31068fa3685.1676594211.git.falcon@tinylab.org/
[3]: https://lore.kernel.org/lkml/CAFP8O3+41QFVyNTVJ2iZYkB0tqnvdLTAoGShgGy-qPP1PHjBEw@mail.gmail.com/
[4]: https://lore.kernel.org/lkml/cbcbfbb37cabfd9aed6088c75515e4ea86006cff.1676594211.git.falcon@tinylab.org/
Zhangjin Wu (7):
DCE: add debug support
DCE/DSE: add unused syscalls elimination configure support
DCE/DSE: Add a new scripts/Makefile.syscalls
DCE/DSE: mips: add HAVE_TRIM_UNUSED_SYSCALLS support
DCE/DSE: riscv: move syscall tables to syscalls/
DCE/DSE: riscv: add HAVE_TRIM_UNUSED_SYSCALLS support
DCE/DSE: riscv: trim syscall tables
Makefile | 3 +
arch/mips/Kconfig | 1 +
arch/mips/kernel/syscalls/Makefile | 23 ++++++-
arch/riscv/Kconfig | 1 +
arch/riscv/include/asm/unistd.h | 2 +
arch/riscv/kernel/Makefile | 7 +-
arch/riscv/kernel/syscalls/Makefile | 69 +++++++++++++++++++
.../{ => syscalls}/compat_syscall_table.c | 4 +-
.../kernel/{ => syscalls}/syscall_table.c | 4 +-
init/Kconfig | 49 +++++++++++++
scripts/Makefile.syscalls | 29 ++++++++
11 files changed, 182 insertions(+), 10 deletions(-)
create mode 100644 arch/riscv/kernel/syscalls/Makefile
rename arch/riscv/kernel/{ => syscalls}/compat_syscall_table.c (82%)
rename arch/riscv/kernel/{ => syscalls}/syscall_table.c (83%)
create mode 100644 scripts/Makefile.syscalls
--
2.25.1
I didn't test DSE with explicit KEEP() in the previous mail. So, I will make up for it now. This test result is about DEAD_CODE_DATA_ELIMINATION (DCE) and dead syscalls elimination (DSE). It's based on config[1] and a simple hello.c initramfs. We set CONFIG_SYSCALLS_USED="sys_write sys_exit sys_reboot", which is used by hello.c to simply print "Hello" then exit and shut down qemu. | | syscall remain | vmlinux size | vmlinux after strip | | ------------------------------------------------------------ | -------------- | ---------------- | ------------------- | | disable DCE | 236 | 2559632 | 1963400 | | enable DCE | 208 | 2037384 (-20.4%) | 1485776 (-24.3%) | | enable DCE and DSE with explicit KEEP() of except table | 17 | 1899208 (-25.8%) | 1387272 (-29.3%) | | enable DCE and DSE without KEEP() (By SHF_GROUP method) | 3 | 1856640 (-27.6%) | 1354424 (-31.0%) | | enable DCE and DSE without KEEP() (By SHE_LINK_ORDER method) | 3 | 1856664 (-27.6%) | 1354424 (-31.0%) | It shows that dead syscalls elimination can save 7% of space based on DCE. Although no KEEP() can only save up 2% space, it can reduce the attack surface and eliminate the misuse of KEEP(). It ensures that every orphan section is not orphaned anymore. [1]: https://pastebin.com/KG4fd7aT
I don't know why linux-kernel@vger.kernel.org reject my email send out by thunderbird. So here I am resending this mail with git send-email. Here is a test result about DEAD_CODE_DATA_ELIMINATION (DCE) and dead syscalls elimination (DSE). It's based on config[1] and a simple hello.c initramfs. In the DSE test, we set CONFIG_SYSCALLS_USED="sys_write sys_exit sys_reboot," which is used by hello.c to simply print "Hello" then exit and shut down qemu. | | syscall remain | vmlinux size | vmlinux after strip | | ---------------------------------- | -------------- | ---------------- | ------------------- | | disable DCE | 236 | 2559632 | 1963400 | | enable DCE | 208 | 2037384 (-20.4%) | 1485776 (-24.3%) | | enable DCE and DSE(SHE_GROUP) | 3 | 1856640 (-27.6%) | 1354424 (-31.0%) | | enable DCE and DSE(SHE_LINK_ORDER) | 3 | 1856664 (-27.6%) | 1354424 (-31.0%) | It shows that dead syscalls elimination can save 7% of space based on DCE. [1]: https://pastebin.com/KG4fd7aT
On Tue, Sep 26, 2023, at 00:33, Zhangjin Wu wrote:
>
> This series aims to add DCE based DSE support, here is the first
> revision of the RFC patchset [1], the whole series includes three parts,
> here is the Part1.
>
> This Part1 adds basic DCE based DSE support.
>
> Part2 will further eliminate the unused syscalls forcely kept by the
> exception tables.
>
> Part3 will add DSE test support with nolibc-test.c.
I missed the RFC version, but I think this is a useful thing to
have overall, though it will probably need to go through a couple
of revisions and rewrites, mostly to ensure we are not adding
complexity that gets in the way of other improvements I would
like to see to the syscall entry handling.
It would be nice to include some size numbers here for at least
one practical use case. If you have a defconfig for a shipping
product with a small kernel, what is the 'size -B' output you
see comparing with and without DCE and, and with DCE+DSE?
There is generally not much work going into micro-optimizing
the size of the kernel image any more, for a number of reasons,
but if you are able to show that this is a noticeable improvement,
we should be able to find a way to do it. Geert is doing statistics
about size bloat over time, and anything that undoes a couple
of years worth of bloat would clearly be significant here.
Another alternative would be to resume the work done by Nicolas
Pitre, who added Kconfig symbols for controlling groups of
system calls. Since we already have a number of those compile
time options, adding more of them should generally be
less controversial and more consistent, while bringing most
of the same benefits.
Arnd
On Tue, Sep 26, 2023, at 09:14, Arnd Bergmann wrote:
> On Tue, Sep 26, 2023, at 00:33, Zhangjin Wu wrote:
>
> It would be nice to include some size numbers here for at least
> one practical use case. If you have a defconfig for a shipping
> product with a small kernel, what is the 'size -B' output you
> see comparing with and without DCE and, and with DCE+DSE?
To follow up on this myself, for a very rough baseline,
I tried a riscv tinyconfig build with and without
CONFIG_LD_DEAD_CODE_DATA_ELIMINATION (this is currently
not supported on arm, so I did not try it there), and
then another build with simply *all* system calls stubbed
out by hacking asm/syscall-wrapper.h:
$ size build/tmp/vmlinux-*
text data bss dec hex filename
754772 220016 71841 1046629 ff865 vmlinux-tinyconfig
717500 223368 71841 1012709 f73e5 vmlinux-tiny+nosyscalls
567310 176200 71473 814983 c6f87 vmlinux-tiny+gc-sections
493278 170752 71433 735463 b38e7 vmlinux-tiny+gc-sections+nosyscalls
10120058 3572756 493701 14186515 d87813 vmlinux-defconfig
9953934 3529004 491525 13974463 d53bbf vmlinux-defconfig+gc
9709856 3500600 489221 13699677 d10a5d vmlinux-defconfig+gc+nosyscalls
This would put us at an upper bound of 10% size savings (80kb) for
tinyconfig, which is clearly significant. For defconfig, it's
still 2.0% or 275kb size reduction when all syscalls are dropped.
Arnd
On Tue, 26 Sep 2023, Arnd Bergmann wrote: > On Tue, Sep 26, 2023, at 09:14, Arnd Bergmann wrote: > > On Tue, Sep 26, 2023, at 00:33, Zhangjin Wu wrote: > > > > It would be nice to include some size numbers here for at least > > one practical use case. If you have a defconfig for a shipping > > product with a small kernel, what is the 'size -B' output you > > see comparing with and without DCE and, and with DCE+DSE? > > To follow up on this myself, for a very rough baseline, > I tried a riscv tinyconfig build with and without > CONFIG_LD_DEAD_CODE_DATA_ELIMINATION (this is currently > not supported on arm, so I did not try it there), and > then another build with simply *all* system calls stubbed > out by hacking asm/syscall-wrapper.h: > > $ size build/tmp/vmlinux-* > text data bss dec hex filename > 754772 220016 71841 1046629 ff865 vmlinux-tinyconfig > 717500 223368 71841 1012709 f73e5 vmlinux-tiny+nosyscalls > 567310 176200 71473 814983 c6f87 vmlinux-tiny+gc-sections > 493278 170752 71433 735463 b38e7 vmlinux-tiny+gc-sections+nosyscalls > 10120058 3572756 493701 14186515 d87813 vmlinux-defconfig > 9953934 3529004 491525 13974463 d53bbf vmlinux-defconfig+gc > 9709856 3500600 489221 13699677 d10a5d vmlinux-defconfig+gc+nosyscalls > > This would put us at an upper bound of 10% size savings (80kb) for > tinyconfig, which is clearly significant. For defconfig, it's > still 2.0% or 275kb size reduction when all syscalls are dropped. I did something similar a while ago. Results included here: https://lwn.net/Articles/746780/ In my case, stubbing out all syscalls produced a 7.8% reduction which was somewhat disappointing compared to other techniques. Of course it all depends on what is your actual goal. Nicolas
On Tue, Sep 26, 2023, at 22:49, Nicolas Pitre wrote:
> On Tue, 26 Sep 2023, Arnd Bergmann wrote:
>
>> $ size build/tmp/vmlinux-*
>> text data bss dec hex filename
>> 754772 220016 71841 1046629 ff865 vmlinux-tinyconfig
>> 717500 223368 71841 1012709 f73e5 vmlinux-tiny+nosyscalls
>> 567310 176200 71473 814983 c6f87 vmlinux-tiny+gc-sections
>> 493278 170752 71433 735463 b38e7 vmlinux-tiny+gc-sections+nosyscalls
>> 10120058 3572756 493701 14186515 d87813 vmlinux-defconfig
>> 9953934 3529004 491525 13974463 d53bbf vmlinux-defconfig+gc
>> 9709856 3500600 489221 13699677 d10a5d vmlinux-defconfig+gc+nosyscalls
>>
>> This would put us at an upper bound of 10% size savings (80kb) for
>> tinyconfig, which is clearly significant. For defconfig, it's
>> still 2.0% or 275kb size reduction when all syscalls are dropped.
>
> I did something similar a while ago. Results included here:
>
> https://lwn.net/Articles/746780/
>
> In my case, stubbing out all syscalls produced a 7.8% reduction which
> was somewhat disappointing compared to other techniques. Of course it
> all depends on what is your actual goal.
Thanks for the link, I had forgotten about your article.
With all the findings combined, I guess the filtering
at the syscall table level is not all that promising
any more. Going through the list of saved space, I ended up
with 5.7% (47kb) in the best case after I left the 40 syscalls
from the example in this thread.
Removing entire groups of features using normal Kconfig symbols
based on the remaining syscalls that have the largest size
probably gives better results. I can see possible groups
of syscalls that could be disabled under CONFIG_EXPERT,
along with making their underlying infrastructure optional:
- xattr
- ptrace
- adjtimex
- splice/vmsplice/tee
- unshare/setns
- sched_*
After those, one would quickly hit diminishing returns.
Arnd
On Tue, Sep 26, 2023, at 13:24, Arnd Bergmann wrote:
> $ size build/tmp/vmlinux-*
> text data bss dec hex filename
> 754772 220016 71841 1046629 ff865 vmlinux-tinyconfig
> 717500 223368 71841 1012709 f73e5 vmlinux-tiny+nosyscalls
> 567310 176200 71473 814983 c6f87 vmlinux-tiny+gc-sections
> 493278 170752 71433 735463 b38e7 vmlinux-tiny+gc-sections+nosyscalls
> 10120058 3572756 493701 14186515 d87813 vmlinux-defconfig
> 9953934 3529004 491525 13974463 d53bbf vmlinux-defconfig+gc
> 9709856 3500600 489221 13699677 d10a5d vmlinux-defconfig+gc+nosyscalls
>
> This would put us at an upper bound of 10% size savings (80kb) for
> tinyconfig, which is clearly significant. For defconfig, it's
> still 2.0% or 275kb size reduction when all syscalls are dropped.
I did one more test to see which syscalls actually cause bloat in
when CONFIG_LD_DEAD_CODE_DATA_ELIMINATION is set in order to drop them
all. I build the above riscv tinyconfig with
CONFIG_LD_DEAD_CODE_DATA_ELIMINATION and truncated the syscall
table before and after each syscall to see the size difference.
A lot of syscalls are already conditional, so those show up as
0, 4 or 8 bytes (not sure why they are not always 0). Others
could probably be made to fit within some category that can
be made optional (e.g. xattr or adjtimex). Having a Kconfig
option for those would also let users remove even more code that
is not useful without the syscalls but might be called from
somewhere else in the kernel.
Arnd
syscall size name
-------------------------
0 8 io_setup
1 4 io_destroy
2 8 io_submit
3 4 io_cancel
4 8 io_getevents
5 1496 setxattr
6 28 lsetxattr
7 148 fsetxattr
8 1404 getxattr
9 16 lgetxattr
10 80 fgetxattr
11 276 listxattr
12 16 llistxattr
13 68 flistxattr
14 460 removexattr
15 20 lremovexattr
16 92 fremovexattr
17 240 getcwd
18 4 lookup_dcookie
19 8 eventfd2
20 4 epoll_create1
21 8 epoll_ctl
22 4 epoll_pwait
23 64 dup
24 300 dup3
25 1684 fcntl
26 4 inotify_init1
27 8 inotify_add_watch
29 0 ioctl
28 4 inotify_rm_watch
30 8 ioprio_set
31 4 ioprio_get
32 8 flock
33 456 mknodat
34 192 mkdirat
35 64 unlinkat
36 208 symlinkat
38 0 renameat
37 324 linkat
40 0 mount
39 64 umount2
42 0 nfsservctl
41 708 pivot_root
43 424 statfs
44 132 fstatfs
45 272 truncate
46 216 ftruncate
47 88 fallocate
48 420 faccessat
49 120 chdir
50 112 fchdir
51 120 chroot
52 68 fchmod
53 164 fchmodat
54 184 fchownat
55 136 fchown
56 184 openat
57 204 close
58 4 vhangup
59 648 pipe2
61 0 getdents64
60 4 quotactl
62 148 lseek
63 328 read
64 356 write
65 952 readv
66 252 writev
67 92 pread64
68 92 pwrite64
69 100 preadv
71 0 sendfile
72 0 pselect6
70 100 pwritev
73 132 ppoll
74 4 signalfd4
75 2808 vmsplice
76 1388 splice
77 536 tee
78 424 readlinkat
79 244 fstatat
80 64 fstat
81 296 sync
82 100 fsync
83 20 fdatasync
84 448 sync_file_range
85 8 timerfd_create
86 4 timerfd_settime
87 8 timerfd_gettime
88 300 utimensat
89 4 acct
90 8 capget
91 4 capset
92 24 personality
93 24 exit
94 24 exit_group
95 16 waitid
96 28 set_tid_address
97 608 unshare
98 4 futex
99 8 set_robust_list
100 4 get_robust_list
101 276 nanosleep
103 0 setitimer
102 8 getitimer
104 4 kexec_load
105 8 init_module
107 0 timer_create
108 0 timer_gettime
109 0 timer_getoverrun
110 0 timer_settime
111 0 timer_delete
106 4 delete_module
112 44 clock_settime
113 88 clock_gettime
114 64 clock_getres
115 160 clock_nanosleep
116 8 syslog
117 740 ptrace
118 140 sched_setparam
119 36 sched_setscheduler
120 64 sched_getscheduler
121 88 sched_getparam
122 196 sched_setaffinity
123 180 sched_getaffinity
124 24 sched_yield
125 60 sched_get_priority_max
126 60 sched_get_priority_min
127 164 sched_rr_get_interval
128 12 restart_syscall
129 304 kill
130 212 tkill
131 40 tgkill
132 100 sigaltstack
133 104 rt_sigsuspend
134 396 rt_sigaction
135 180 rt_sigprocmask
136 76 rt_sigpending
137 336 rt_sigtimedwait
139 0 rt_sigreturn
138 120 rt_sigqueueinfo
140 396 setpriority
141 276 getpriority
142 1256 reboot
143 4 setregid
144 8 setgid
145 4 setreuid
146 8 setuid
147 4 setresuid
148 8 getresuid
149 4 setresgid
150 8 getresgid
151 4 setfsuid
152 8 setfsgid
153 152 times
154 252 setpgid
155 48 getpgid
156 48 getsid
157 140 setsid
158 8 getgroups
159 4 setgroups
160 172 uname
161 132 sethostname
162 136 setdomainname
163 156 getrlimit
164 52 setrlimit
165 88 getrusage
167 0 prctl
168 0 getcpu
169 0 gettimeofday
170 0 settimeofday
166 24 umask
171 1514 adjtimex
172 20 getpid
173 20 getppid
174 4 getuid
175 4 geteuid
176 4 getgid
177 4 getegid
178 20 gettid
179 276 sysinfo
180 4 mq_open
181 8 mq_unlink
182 4 mq_timedsend
183 8 mq_timedreceive
184 4 mq_notify
185 8 mq_getsetattr
186 4 msgget
187 8 msgctl
188 4 msgrcv
189 8 msgsnd
190 4 semget
191 8 semctl
192 4 semtimedop
193 8 semop
194 4 shmget
195 8 shmctl
196 4 shmat
197 8 shmdt
198 4 socket
199 8 socketpair
200 4 bind
201 8 listen
202 4 accept
203 8 connect
204 4 getsockname
205 8 getpeername
206 4 sendto
207 8 recvfrom
208 4 setsockopt
209 8 getsockopt
210 4 shutdown
211 8 sendmsg
212 4 recvmsg
213 460 readahead
214 2872 brk
215 288 munmap
216 4268 mremap
217 4 add_key
218 8 request_key
219 4 keyctl
220 100 clone
221 724 execve
222 2504 mmap
223 8 fadvise64
224 4 swapon
225 8 swapoff
226 2180 mprotect
227 320 msync
228 1140 mlock
229 84 munlock
230 304 mlockall
231 52 munlockall
232 828 mincore
233 4 madvise
234 324 remap_file_pages
235 4 mbind
236 8 get_mempolicy
237 4 set_mempolicy
238 8 migrate_pages
239 4 move_pages
240 132 rt_tgsigqueueinfo
241 8 perf_event_open
242 4 accept4
244 0 arch_specific_syscall
243 8 recvmmsg
260 100 wait4
261 252 prlimit64
262 8 fanotify_init
263 4 fanotify_mark
264 8 name_to_handle_at
266 0 clock_adjtime
265 4 open_by_handle_at
267 120 syncfs
268 624 setns
269 4 sendmmsg
270 8 process_vm_readv
271 4 process_vm_writev
272 8 kcmp
274 0 sched_setattr
273 4 finit_module
275 208 sched_getattr
276 2364 renameat2
277 4 seccomp
278 124 getrandom
279 4 memfd_create
280 8 bpf
281 52 execveat
282 4 userfaultfd
283 8 membarrier
284 40 mlock2
285 708 copy_file_range
286 32 preadv2
287 32 pwritev2
288 8 pkey_mprotect
289 4 pkey_alloc
290 8 pkey_free
291 356 statx
292 4 io_pgetevents
424 244 pidfd_send_signal
425 8 io_uring_setup
426 4 io_uring_enter
427 8 io_uring_register
428 368 open_tree
429 404 move_mount
430 556 fsopen
431 1056 fsconfig
432 484 fsmount
433 220 fspick
434 124 pidfd_open
435 516 clone3
436 240 close_range
437 120 openat2
438 304 pidfd_getfd
439 12 faccessat2
440 8 process_madvise
441 4 epoll_pwait2
442 1088 mount_setattr
443 8 quotactl_fd
444 4 landlock_create_ruleset
445 8 landlock_add_rule
446 4 landlock_restrict_self
447 8 memfd_secret
448 240 process_mrelease
449 4 futex_waitv
450 8 set_mempolicy_home_node
451 4 cachestat
452 28 fchmodat2
454 4 futex_wake
© 2016 - 2025 Red Hat, Inc.