Documentation/tools/kdoc_parser.rst | 8 + tools/docs/kernel-doc | 1 - tools/docs/sphinx-build-wrapper | 9 +- tools/lib/python/kdoc/kdoc_files.py | 54 +++++- tools/lib/python/kdoc/kdoc_output.py | 73 ++++++-- tools/lib/python/kdoc/kdoc_parser.py | 183 ++++--------------- tools/lib/python/kdoc/kdoc_re.py | 242 ++++++++++++++++++++------ tools/lib/python/kdoc/xforms_lists.py | 109 ++++++++++++ 8 files changed, 451 insertions(+), 228 deletions(-) create mode 100644 tools/lib/python/kdoc/xforms_lists.py
Hi Jon,
This series contain several improvements for kernel-doc.
Most of the patches came from v4 of this series:
https://lore.kernel.org/linux-doc/cover.1769867953.git.mchehab+huawei@kernel.org/
But I dropped from this series the unit tests part. I'll
be sumitting it on a separate series.
The rationale is that, when I converted kernel-doc from Perl,
the goal were to produce a bug-compatible version.
As anyone that worked before with kernel-doc are aware, using regex to
handle C input is not great. Instead, we need something closer to how
C statements and declarations are handled.
Yet, to avoid breaking docs, I avoided touching the regex-based algorithms
inside it with one exception: struct_group logic was using very complex
regexes that are incompatible with Python internal "re" module.
So, I came up with a different approach: NestedMatch. The logic inside
it is meant to properly handle brackets, square brackets and parenthesis,
which is closer to what C lexical parser does. On that time, I added
a TODO about the need to extend that.
The first part of this series do exactly that: it extends it to parse
comma-separated arguments, respecting brackets and parenthesis.
It then adds an "alias" to it at class CFunction. With that, specifying
functions/macros to be handled becomes much easier.
With such infra in place, it moves the transform functions to a separate
file, making it hopefully easier to maintain. As a side effect, it also
makes easier for other projects to use kernel-doc (I tested it on QEMU).
Then, it adds support for newer kref annotations.
The remaining patches on this series improve the man page output, making
them more compatible with other man pages.
-
I wrote several unit tests to check kernel-doc behavior. I intend to
submit them on the top of this series later on.
Regards,
Mauro
Mauro Carvalho Chehab (36):
docs: kdoc_re: add support for groups()
docs: kdoc_re: don't go past the end of a line
docs: kdoc_parser: move var transformers to the beginning
docs: kdoc_parser: don't mangle with function defines
docs: kdoc_parser: add functions support for NestedMatch
docs: kdoc_parser: use NestedMatch to handle __attribute__ on
functions
docs: kdoc_parser: fix variable regexes to work with size_t
docs: kdoc_parser: fix the default_value logic for variables
docs: kdoc_parser: add some debug for variable parsing
docs: kdoc_parser: don't exclude defaults from prototype
docs: kdoc_parser: fix parser to support multi-word types
docs: kdoc_parser: add support for LIST_HEAD
docs: kdoc_re: properly handle strings and escape chars on it
docs: kdoc_re: better show KernRe() at documentation
docs: kdoc_re: don't recompile NestedMatch regex every time
docs: kdoc_re: Change NestedMath args replacement to \0
docs: kdoc_re: make NestedMatch use KernRe
docs: kdoc_re: add support on NestedMatch for argument replacement
docs: kdoc_parser: better handle struct_group macros
docs: kdoc_re: fix a parse bug on struct page_pool_params
docs: kdoc_re: add a helper class to declare C function matches
docs: kdoc_parser: use the new CFunction class
docs: kdoc_parser: minimize differences with struct_group_tagged
docs: kdoc_parser: move transform lists to a separate file
docs: kdoc_re: don't remove the trailing ";" with NestedMatch
docs: kdoc_re: prevent adding whitespaces on sub replacements
docs: xforms_lists.py: use CFuntion to handle all function macros
docs: kdoc_files: allows the caller to use a different xforms class
docs: kdoc_re: Fix NestedMatch.sub() which causes PDF builds to break
docs: kdoc_files: document KernelFiles() ABI
docs: kdoc_output: add optional args to ManOutput class
docs: sphinx-build-wrapper: better handle troff .TH markups
docs: kdoc_output: use a more standard order for .TH on man pages
docs: sphinx-build-wrapper: don't allow "/" on file names
docs: kdoc_output: describe the class init parameters
docs: kdoc_output: pick a better default for modulename
Randy Dunlap (2):
docs: kdoc_parser: ignore context analysis and lock attributes
docs: kdoc_parser: handle struct member macro
VIRTIO_DECLARE_FEATURES(name)
Documentation/tools/kdoc_parser.rst | 8 +
tools/docs/kernel-doc | 1 -
tools/docs/sphinx-build-wrapper | 9 +-
tools/lib/python/kdoc/kdoc_files.py | 54 +++++-
tools/lib/python/kdoc/kdoc_output.py | 73 ++++++--
tools/lib/python/kdoc/kdoc_parser.py | 183 ++++---------------
tools/lib/python/kdoc/kdoc_re.py | 242 ++++++++++++++++++++------
tools/lib/python/kdoc/xforms_lists.py | 109 ++++++++++++
8 files changed, 451 insertions(+), 228 deletions(-)
create mode 100644 tools/lib/python/kdoc/xforms_lists.py
--
2.52.0
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> writes: > Hi Jon, > > This series contain several improvements for kernel-doc. > > Most of the patches came from v4 of this series: > https://lore.kernel.org/linux-doc/cover.1769867953.git.mchehab+huawei@kernel.org/ So I will freely confess to having lost the plot with this stuff; I'm now trying to get back up to speed. But, before I dig into this big series, can you say whether you think it's ready, or whether there's another one on the horizon that I should wait for? Thanks, jon
On Mon, 23 Feb 2026 14:58:53 -0700 Jonathan Corbet <corbet@lwn.net> wrote: > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> writes: > > > Hi Jon, > > > > This series contain several improvements for kernel-doc. > > > > Most of the patches came from v4 of this series: > > https://lore.kernel.org/linux-doc/cover.1769867953.git.mchehab+huawei@kernel.org/ > > So I will freely confess to having lost the plot with this stuff; I'm > now trying to get back up to speed. Yeah, I kinda figure it out ;-) > But, before I dig into this big > series, can you say whether you think it's ready, or whether there's > another one on the horizon that I should wait for? There are more things undergoing, but I need some time to reorganize the patchset... currently, there are 60+ patches on my pile. So, instead of merging this patchset, I'll be sending you a smaller series with the basic stuff, in a way that it would be easier to review. My plan is to send patches along this week on smaller chunks, and after checking the differences before/after, in terms of man/rst/error output. -- Thanks, Mauro
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> writes: > So, instead of merging this patchset, I'll be sending you > a smaller series with the basic stuff, in a way that it would > be easier to review. My plan is to send patches along this week > on smaller chunks, and after checking the differences before/after, > in terms of man/rst/error output. OK... *whew* ... that sounds like a better way to proceed :) Thanks, jon
On 2/18/26 2:12 AM, Mauro Carvalho Chehab wrote: > Hi Jon, > > This series contain several improvements for kernel-doc. > > Most of the patches came from v4 of this series: > https://lore.kernel.org/linux-doc/cover.1769867953.git.mchehab+huawei@kernel.org/ > Mauro, Is this series available as a git tree/branch? Or what is the base for applying this series? Thanks. -- ~Randy
On 2/20/26 5:24 PM, Randy Dunlap wrote:
>
>
> On 2/18/26 2:12 AM, Mauro Carvalho Chehab wrote:
>> Hi Jon,
>>
>> This series contain several improvements for kernel-doc.
>>
>> Most of the patches came from v4 of this series:
>> https://lore.kernel.org/linux-doc/cover.1769867953.git.mchehab+huawei@kernel.org/
>>
>
> Mauro,
> Is this series available as a git tree/branch?
>
> Or what is the base for applying this series?
I applied the series to linux-next-20260220. It applies cleanly
except for one gotcha (using 'patch'):
In patch 25, in the commit description, I had to change the
example before/after diff to have leading "//" ('patch' was
treating them as part of the diff).
I am still seeing kernel-doc warnings being duplicated.
Seems like there a patch for that but it's not applied yet and not part
of this series...?
The results on linux-next-20260220 look good.
I do have one issue on a test file that I had sent to you (Mauro)
earlier: kdoc-nested.c
In struct super_struct, the fields of nested struct tlv are not
described but there is no warning about that.
Likewise for the fields of the nested structs header, gen_descr,
and data.
Does this series address when /* private: */ is turned off at the
end of a struct/union? If so, I don't see it working.
See struct nla_policy for where the final struct member should be
public.
kdoc-nested.c test file is attached.
thanks.
--
~Randy
On Wed, 18 Feb 2026, Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote: > As anyone that worked before with kernel-doc are aware, using regex to > handle C input is not great. Instead, we need something closer to how > C statements and declarations are handled. > > Yet, to avoid breaking docs, I avoided touching the regex-based algorithms > inside it with one exception: struct_group logic was using very complex > regexes that are incompatible with Python internal "re" module. > > So, I came up with a different approach: NestedMatch. The logic inside > it is meant to properly handle brackets, square brackets and parenthesis, > which is closer to what C lexical parser does. On that time, I added > a TODO about the need to extend that. There's always the question, if you're putting a lot of effort into making kernel-doc closer to an actual C parser, why not put all that effort into using and adapting to, you know, an actual C parser? BR, Jani. -- Jani Nikula, Intel
On Mon, 23 Feb 2026 15:47:00 +0200
Jani Nikula <jani.nikula@linux.intel.com> wrote:
> On Wed, 18 Feb 2026, Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> > As anyone that worked before with kernel-doc are aware, using regex to
> > handle C input is not great. Instead, we need something closer to how
> > C statements and declarations are handled.
> >
> > Yet, to avoid breaking docs, I avoided touching the regex-based algorithms
> > inside it with one exception: struct_group logic was using very complex
> > regexes that are incompatible with Python internal "re" module.
> >
> > So, I came up with a different approach: NestedMatch. The logic inside
> > it is meant to properly handle brackets, square brackets and parenthesis,
> > which is closer to what C lexical parser does. On that time, I added
> > a TODO about the need to extend that.
>
> There's always the question, if you're putting a lot of effort into
> making kernel-doc closer to an actual C parser, why not put all that
> effort into using and adapting to, you know, an actual C parser?
Playing with this idea, it is not that hard to write an actual C
parser - or at least a tokenizer. There is already an example of it
at:
https://docs.python.org/3/library/re.html
I did a quick implementation, and it seems to be able to do its job:
$ ./tokenizer.py ./include/net/netlink.h
1: 0 COMMENT '/* SPDX-License-Identifier: GPL-2.0 */'
2: 0 CPP '#ifndef'
2: 8 ID '__NET_NETLINK_H'
3: 0 CPP '#define'
3: 8 ID '__NET_NETLINK_H'
5: 0 CPP '#include'
5: 9 OP '<'
5: 10 ID 'linux'
5: 15 OP '/'
5: 16 ID 'types'
5: 21 PUNC '.'
5: 22 ID 'h'
5: 23 OP '>'
6: 0 CPP '#include'
6: 9 OP '<'
6: 10 ID 'linux'
6: 15 OP '/'
6: 16 ID 'netlink'
6: 23 PUNC '.'
6: 24 ID 'h'
6: 25 OP '>'
7: 0 CPP '#include'
7: 9 OP '<'
7: 10 ID 'linux'
7: 15 OP '/'
7: 16 ID 'jiffies'
7: 23 PUNC '.'
7: 24 ID 'h'
7: 25 OP '>'
8: 0 CPP '#include'
8: 9 OP '<'
8: 10 ID 'linux'
8: 15 OP '/'
8: 16 ID 'in6'
...
12: 1 COMMENT '/**\n * Standard attribute types to specify validation policy\n */'
13: 0 ENUM 'enum'
13: 5 PUNC '{'
14: 1 ID 'NLA_UNSPEC'
14: 11 PUNC ','
15: 1 ID 'NLA_U8'
15: 7 PUNC ','
16: 1 ID 'NLA_U16'
16: 8 PUNC ','
17: 1 ID 'NLA_U32'
17: 8 PUNC ','
18: 1 ID 'NLA_U64'
18: 8 PUNC ','
19: 1 ID 'NLA_STRING'
19: 11 PUNC ','
20: 1 ID 'NLA_FLAG'
...
41: 0 STRUCT 'struct'
41: 7 ID 'netlink_range_validation'
41: 32 PUNC '{'
42: 1 ID 'u64'
42: 5 ID 'min'
42: 8 PUNC ','
42: 10 ID 'max'
42: 13 PUNC ';'
43: 0 PUNC '}'
43: 1 PUNC ';'
45: 0 STRUCT 'struct'
45: 7 ID 'netlink_range_validation_signed'
45: 39 PUNC '{'
46: 1 ID 's64'
46: 5 ID 'min'
46: 8 PUNC ','
46: 10 ID 'max'
46: 13 PUNC ';'
47: 0 PUNC '}'
47: 1 PUNC ';'
49: 0 ENUM 'enum'
49: 5 ID 'nla_policy_validation'
49: 27 PUNC '{'
50: 1 ID 'NLA_VALIDATE_NONE'
50: 18 PUNC ','
51: 1 ID 'NLA_VALIDATE_RANGE'
51: 19 PUNC ','
52: 1 ID 'NLA_VALIDATE_RANGE_WARN_TOO_LONG'
52: 33 PUNC ','
53: 1 ID 'NLA_VALIDATE_MIN'
53: 17 PUNC ','
54: 1 ID 'NLA_VALIDATE_MAX'
54: 17 PUNC ','
55: 1 ID 'NLA_VALIDATE_MASK'
55: 18 PUNC ','
56: 1 ID 'NLA_VALIDATE_RANGE_PTR'
56: 23 PUNC ','
57: 1 ID 'NLA_VALIDATE_FUNCTION'
57: 22 PUNC ','
58: 0 PUNC '}'
58: 1 PUNC ';'
It sounds doable to use it, and, at least on this example, it
properly picked the IDs.
On the other hand, using it would require lots of changes at
kernel-doc. So, I guess I'll add a tokenizer to kernel-doc, but
we should likely start using it gradually.
Maybe starting with NestedSearch and with public/private
comment handling (which is currently half-broken).
As a reference, the above was generated with the code below,
which was based on the Python re documentation.
Comments?
---
One side note: right now, we're not using typing at kernel-doc,
nor really following a proper coding style.
I wanted to use it during the conversion, and place consts in
uppercase, as this is currently the best practices, but doing
it while converting from Perl were very annoying. So, I opted
to make things simpler. Now that we have it coded, perhaps it
is time to define a coding style and apply it to kernel-doc.
--
Thanks,
Mauro
#!/usr/bin/env python3
import sys
import re
class Token():
def __init__(self, type, value, line, column):
self.type = type
self.value = value
self.line = line
self.column = column
class CTokenizer():
C_KEYWORDS = {
"struct", "union", "enum",
}
TOKEN_LIST = [
("COMMENT", r"//[^\n]*|/\*[\s\S]*?\*/"),
("STRING", r'"(?:\\.|[^"\\])*"'),
("CHAR", r"'(?:\\.|[^'\\])'"),
("NUMBER", r"0[xX][0-9a-fA-F]+[uUlL]*|0[0-7]+[uUlL]*|"
r"[0-9]+(\.[0-9]*)?([eE][+-]?[0-9]+)?[fFlL]*"),
("ID", r"[A-Za-z_][A-Za-z0-9_]*"),
("OP", r"\+\+|\-\-|\->|==|\!=|<=|>=|&&|\|\||<<|>>|\+=|\-=|\*=|/=|%="
r"|&=|\|=|\^=|=|\+|\-|\*|/|%|<|>|&|\||\^|~|!|\?|\:"),
("PUNC", r"[;,\.\[\]\(\)\{\}]"),
("CPP", r"#\s*(define|include|ifdef|ifndef|if|else|elif|endif|undef|pragma)"),
("HASH", r"#"),
("NEWLINE", r"\n"),
("SKIP", r"[\s]+"),
("MISMATCH",r"."),
]
def __init__(self):
re_tokens = []
for name, pattern in self.TOKEN_LIST:
re_tokens.append(f"(?P<{name}>{pattern})")
self.re_scanner = re.compile("|".join(re_tokens),
re.MULTILINE | re.DOTALL)
def tokenize(self, code):
# Handle continuation lines
code = re.sub(r"\\\n", "", code)
line_num = 1
line_start = 0
for match in self.re_scanner.finditer(code):
kind = match.lastgroup
value = match.group()
column = match.start() - line_start
if kind == "NEWLINE":
line_start = match.end()
line_num += 1
continue
if kind in {"SKIP"}:
continue
if kind == "MISMATCH":
raise RuntimeError(f"Unexpected character {value!r} on line {line_num}")
if kind == "ID" and value in self.C_KEYWORDS:
kind = value.upper()
# For all other tokens we keep the raw string value
yield Token(kind, value, line_num, column)
if __name__ == "__main__":
if len(sys.argv) != 2:
print(f"Usage: python {sys.argv[0]} <fname>")
sys.exit(1)
fname = sys.argv[1]
try:
with open(fname, 'r', encoding='utf-8') as file:
sample = file.read()
except FileNotFoundError:
print(f"Error: The file '{fname}' was not found.")
sys.exit(1)
except Exception as e:
print(f"An error occurred while reading the file: {str(e)}")
sys.exit(1)
print(f"Tokens from {fname}:")
for tok in CTokenizer().tokenize(sample):
print(f"{tok.line:3d}:{tok.column:3d} {tok.type:12} {tok.value!r}")
On Tue, 03 Mar 2026, Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote: > On Mon, 23 Feb 2026 15:47:00 +0200 > Jani Nikula <jani.nikula@linux.intel.com> wrote: >> There's always the question, if you're putting a lot of effort into >> making kernel-doc closer to an actual C parser, why not put all that >> effort into using and adapting to, you know, an actual C parser? > > Playing with this idea, it is not that hard to write an actual C > parser - or at least a tokenizer. Just for the record, I suggested using an existing parser, not going all NIH and writing your own. BR, Jani. -- Jani Nikula, Intel
> -----Original Message-----
> From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> Sent: Tuesday, March 3, 2026 3:53 PM
> To: Jani Nikula <jani.nikula@linux.intel.com>
> Cc: Lobakin, Aleksander <aleksander.lobakin@intel.com>; Jonathan
> Corbet <corbet@lwn.net>; Kees Cook <kees@kernel.org>; Mauro Carvalho
> Chehab <mchehab@kernel.org>; intel-wired-lan@lists.osuosl.org; linux-
> doc@vger.kernel.org; linux-hardening@vger.kernel.org; linux-
> kernel@vger.kernel.org; netdev@vger.kernel.org; Gustavo A. R. Silva
> <gustavoars@kernel.org>; Loktionov, Aleksandr
> <aleksandr.loktionov@intel.com>; Randy Dunlap <rdunlap@infradead.org>;
> Shuah Khan <skhan@linuxfoundation.org>
> Subject: Re: [PATCH 00/38] docs: several improvements to kernel-doc
>
> On Mon, 23 Feb 2026 15:47:00 +0200
> Jani Nikula <jani.nikula@linux.intel.com> wrote:
>
> > On Wed, 18 Feb 2026, Mauro Carvalho Chehab
> <mchehab+huawei@kernel.org> wrote:
> > > As anyone that worked before with kernel-doc are aware, using
> regex
> > > to handle C input is not great. Instead, we need something closer
> to
> > > how C statements and declarations are handled.
> > >
> > > Yet, to avoid breaking docs, I avoided touching the regex-based
> > > algorithms inside it with one exception: struct_group logic was
> > > using very complex regexes that are incompatible with Python
> internal "re" module.
> > >
> > > So, I came up with a different approach: NestedMatch. The logic
> > > inside it is meant to properly handle brackets, square brackets
> and
> > > parenthesis, which is closer to what C lexical parser does. On
> that
> > > time, I added a TODO about the need to extend that.
> >
> > There's always the question, if you're putting a lot of effort into
> > making kernel-doc closer to an actual C parser, why not put all that
> > effort into using and adapting to, you know, an actual C parser?
>
> Playing with this idea, it is not that hard to write an actual C
> parser - or at least a tokenizer. There is already an example of it
> at:
>
> https://docs.python.org/3/library/re.html
>
> I did a quick implementation, and it seems to be able to do its job:
>
> $ ./tokenizer.py ./include/net/netlink.h
> 1: 0 COMMENT '/* SPDX-License-Identifier: GPL-2.0 */'
> 2: 0 CPP '#ifndef'
> 2: 8 ID '__NET_NETLINK_H'
> 3: 0 CPP '#define'
> 3: 8 ID '__NET_NETLINK_H'
> 5: 0 CPP '#include'
> 5: 9 OP '<'
> 5: 10 ID 'linux'
> 5: 15 OP '/'
> 5: 16 ID 'types'
> 5: 21 PUNC '.'
> 5: 22 ID 'h'
> 5: 23 OP '>'
> 6: 0 CPP '#include'
> 6: 9 OP '<'
> 6: 10 ID 'linux'
> 6: 15 OP '/'
> 6: 16 ID 'netlink'
> 6: 23 PUNC '.'
> 6: 24 ID 'h'
> 6: 25 OP '>'
> 7: 0 CPP '#include'
> 7: 9 OP '<'
> 7: 10 ID 'linux'
> 7: 15 OP '/'
> 7: 16 ID 'jiffies'
> 7: 23 PUNC '.'
> 7: 24 ID 'h'
> 7: 25 OP '>'
> 8: 0 CPP '#include'
> 8: 9 OP '<'
> 8: 10 ID 'linux'
> 8: 15 OP '/'
> 8: 16 ID 'in6'
> ...
> 12: 1 COMMENT '/**\n * Standard attribute types to
> specify validation policy\n */'
> 13: 0 ENUM 'enum'
> 13: 5 PUNC '{'
> 14: 1 ID 'NLA_UNSPEC'
> 14: 11 PUNC ','
> 15: 1 ID 'NLA_U8'
> 15: 7 PUNC ','
> 16: 1 ID 'NLA_U16'
> 16: 8 PUNC ','
> 17: 1 ID 'NLA_U32'
> 17: 8 PUNC ','
> 18: 1 ID 'NLA_U64'
> 18: 8 PUNC ','
> 19: 1 ID 'NLA_STRING'
> 19: 11 PUNC ','
> 20: 1 ID 'NLA_FLAG'
> ...
> 41: 0 STRUCT 'struct'
> 41: 7 ID 'netlink_range_validation'
> 41: 32 PUNC '{'
> 42: 1 ID 'u64'
> 42: 5 ID 'min'
> 42: 8 PUNC ','
> 42: 10 ID 'max'
> 42: 13 PUNC ';'
> 43: 0 PUNC '}'
> 43: 1 PUNC ';'
> 45: 0 STRUCT 'struct'
> 45: 7 ID 'netlink_range_validation_signed'
> 45: 39 PUNC '{'
> 46: 1 ID 's64'
> 46: 5 ID 'min'
> 46: 8 PUNC ','
> 46: 10 ID 'max'
> 46: 13 PUNC ';'
> 47: 0 PUNC '}'
> 47: 1 PUNC ';'
> 49: 0 ENUM 'enum'
> 49: 5 ID 'nla_policy_validation'
> 49: 27 PUNC '{'
> 50: 1 ID 'NLA_VALIDATE_NONE'
> 50: 18 PUNC ','
> 51: 1 ID 'NLA_VALIDATE_RANGE'
> 51: 19 PUNC ','
> 52: 1 ID 'NLA_VALIDATE_RANGE_WARN_TOO_LONG'
> 52: 33 PUNC ','
> 53: 1 ID 'NLA_VALIDATE_MIN'
> 53: 17 PUNC ','
> 54: 1 ID 'NLA_VALIDATE_MAX'
> 54: 17 PUNC ','
> 55: 1 ID 'NLA_VALIDATE_MASK'
> 55: 18 PUNC ','
> 56: 1 ID 'NLA_VALIDATE_RANGE_PTR'
> 56: 23 PUNC ','
> 57: 1 ID 'NLA_VALIDATE_FUNCTION'
> 57: 22 PUNC ','
> 58: 0 PUNC '}'
> 58: 1 PUNC ';'
>
> It sounds doable to use it, and, at least on this example, it properly
> picked the IDs.
>
> On the other hand, using it would require lots of changes at kernel-
> doc. So, I guess I'll add a tokenizer to kernel-doc, but we should
> likely start using it gradually.
>
> Maybe starting with NestedSearch and with public/private comment
> handling (which is currently half-broken).
>
> As a reference, the above was generated with the code below, which was
> based on the Python re documentation.
>
> Comments?
>
> ---
>
> One side note: right now, we're not using typing at kernel-doc, nor
> really following a proper coding style.
>
> I wanted to use it during the conversion, and place consts in
> uppercase, as this is currently the best practices, but doing it while
> converting from Perl were very annoying. So, I opted to make things
> simpler. Now that we have it coded, perhaps it is time to define a
> coding style and apply it to kernel-doc.
>
> --
> Thanks,
> Mauro
>
> #!/usr/bin/env python3
>
> import sys
> import re
>
> class Token():
> def __init__(self, type, value, line, column):
> self.type = type
> self.value = value
> self.line = line
> self.column = column
>
> class CTokenizer():
> C_KEYWORDS = {
> "struct", "union", "enum",
> }
>
> TOKEN_LIST = [
> ("COMMENT", r"//[^\n]*|/\*[\s\S]*?\*/"),
>
> ("STRING", r'"(?:\\.|[^"\\])*"'),
> ("CHAR", r"'(?:\\.|[^'\\])'"),
>
> ("NUMBER", r"0[xX][0-9a-fA-F]+[uUlL]*|0[0-7]+[uUlL]*|"
> r"[0-9]+(\.[0-9]*)?([eE][+-]?[0-9]+)?[fFlL]*"),
>
> ("ID", r"[A-Za-z_][A-Za-z0-9_]*"),
>
> ("OP", r"\+\+|\-\-|\->|==|\!=|<=|>=|&&|\|\||<<|>>|\+=|\-
> =|\*=|/=|%="
> r"|&=|\|=|\^=|=|\+|\-
> |\*|/|%|<|>|&|\||\^|~|!|\?|\:"),
>
> ("PUNC", r"[;,\.\[\]\(\)\{\}]"),
>
> ("CPP",
> r"#\s*(define|include|ifdef|ifndef|if|else|elif|endif|undef|pragma)"),
>
> ("HASH", r"#"),
>
> ("NEWLINE", r"\n"),
>
> ("SKIP", r"[\s]+"),
>
> ("MISMATCH",r"."),
> ]
>
> def __init__(self):
> re_tokens = []
>
> for name, pattern in self.TOKEN_LIST:
> re_tokens.append(f"(?P<{name}>{pattern})")
>
> self.re_scanner = re.compile("|".join(re_tokens),
> re.MULTILINE | re.DOTALL)
>
> def tokenize(self, code):
> # Handle continuation lines
> code = re.sub(r"\\\n", "", code)
>
> line_num = 1
> line_start = 0
>
> for match in self.re_scanner.finditer(code):
> kind = match.lastgroup
> value = match.group()
> column = match.start() - line_start
>
> if kind == "NEWLINE":
> line_start = match.end()
> line_num += 1
> continue
>
> if kind in {"SKIP"}:
> continue
>
> if kind == "MISMATCH":
> raise RuntimeError(f"Unexpected character {value!r} on
> line {line_num}")
>
> if kind == "ID" and value in self.C_KEYWORDS:
> kind = value.upper()
>
> # For all other tokens we keep the raw string value
> yield Token(kind, value, line_num, column)
>
> if __name__ == "__main__":
> if len(sys.argv) != 2:
> print(f"Usage: python {sys.argv[0]} <fname>")
> sys.exit(1)
>
> fname = sys.argv[1]
>
> try:
> with open(fname, 'r', encoding='utf-8') as file:
> sample = file.read()
> except FileNotFoundError:
> print(f"Error: The file '{fname}' was not found.")
> sys.exit(1)
> except Exception as e:
> print(f"An error occurred while reading the file: {str(e)}")
> sys.exit(1)
>
> print(f"Tokens from {fname}:")
>
> for tok in CTokenizer().tokenize(sample):
> print(f"{tok.line:3d}:{tok.column:3d} {tok.type:12}
> {tok.value!r}")
As hobby C compiler writer, I must say that you need to implement C preprocessor first, because C preprocessor influences/changes the syntax.
In your tokenizer I see right away that any line which begins from '#' must be just as C preprocessor command without further tokenizing.
But the real pain make C preprocessor substitutions IMHO
On Tue, 3 Mar 2026 15:12:30 +0000
"Loktionov, Aleksandr" <aleksandr.loktionov@intel.com> wrote:
> > -----Original Message-----
> > From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> > Sent: Tuesday, March 3, 2026 3:53 PM
> > To: Jani Nikula <jani.nikula@linux.intel.com>
> > Cc: Lobakin, Aleksander <aleksander.lobakin@intel.com>; Jonathan
> > Corbet <corbet@lwn.net>; Kees Cook <kees@kernel.org>; Mauro Carvalho
> > Chehab <mchehab@kernel.org>; intel-wired-lan@lists.osuosl.org; linux-
> > doc@vger.kernel.org; linux-hardening@vger.kernel.org; linux-
> > kernel@vger.kernel.org; netdev@vger.kernel.org; Gustavo A. R. Silva
> > <gustavoars@kernel.org>; Loktionov, Aleksandr
> > <aleksandr.loktionov@intel.com>; Randy Dunlap <rdunlap@infradead.org>;
> > Shuah Khan <skhan@linuxfoundation.org>
> > Subject: Re: [PATCH 00/38] docs: several improvements to kernel-doc
> >
> > On Mon, 23 Feb 2026 15:47:00 +0200
> > Jani Nikula <jani.nikula@linux.intel.com> wrote:
> >
> > > There's always the question, if you're putting a lot of effort into
> > > making kernel-doc closer to an actual C parser, why not put all that
> > > effort into using and adapting to, you know, an actual C parser?
> >
> > Playing with this idea, it is not that hard to write an actual C
> > parser - or at least a tokenizer. There is already an example of it
> > at:
> >
> > https://docs.python.org/3/library/re.html
> >
> > I did a quick implementation, and it seems to be able to do its job:
...
>
> As hobby C compiler writer, I must say that you need to implement C preprocessor first, because C preprocessor influences/changes the syntax.
> In your tokenizer I see right away that any line which begins from '#' must be just as C preprocessor command without further tokenizing.
Yeah, we may need to implement C preprocessor parser in the future,
but this will require handling #include, with could be somewhat
complex. It is also tricky to handle conditional preprocessor macros,
as kernel-doc would either require a file with at least some defines
or would have to guess how to evaluate it to produce the right
documentation, as ifdefs interfere at C macros.
For now, I want to solve some specific problems:
- fix trim_private_members() function that it is meant to handle
/* private: */ and /* public: */ comments, as it currently have
bugs when used on nested structs/unions, related to where the
"private" scope finishes;
- properly parse nested struct/union and properly pick nested
identifiers;
- detect and replace function arguments when macros with multiple
arguments are used at the same prototype.
Plus, kernel-doc has already a table of transforms to "convert"
the C preprocessor macros that affect documentation into something
that will work.
So, I'm considering to start simple, for now ignoring cpp, addressing
the existing issues.
> But the real pain make C preprocessor substitutions IMHO
Agreed. For now, we're using a transforms list inside kernel-doc for
such purpose. So, those macros are manually "evaluated" there, like:
(KernRe(r'DEFINE_DMA_UNMAP_ADDR\s*\(' + struct_args_pattern + r'\)', re.S), r'dma_addr_t \1'),
This works fine on trivial cases, where the argument is just an ID,
but there are cases were we use macros like here:
struct page_pool_params {
struct_group_tagged(page_pool_params_fast, fast,
unsigned int order;
unsigned int pool_size;
int nid;
struct device *dev;
struct napi_struct *napi;
enum dma_data_direction dma_dir;
unsigned int max_len;
unsigned int offset;
);
struct_group_tagged(page_pool_params_slow, slow,
struct net_device *netdev;
unsigned int queue_idx;
unsigned int flags;
/* private: used by test code only */
void (*init_callback)(netmem_ref netmem, void *arg);
void *init_arg;
);
};
To handle it, I'm thinking on using something like this(*):
CFunction('struct_group_tagged'), r'struct \1 { \3 } \2;')
E.g. teaching kernel-doc that, when:
struct_group_tagged(a, b, c)
is used, it should convert it into:
struct a { c } b;
which is basically what this macro does. On other words, hardcoding
kernel-doc with some rules to handle the cases where CPP macros
need to be evaluated. As there aren't much cases where such macros affect
documentation (on lots of cases, just drop macros are enough), such
approach kinda works.
(*) I wrote already a patch for it, but as Jani pointed, perhaps
using a tokenizer will make the logic simpler and easier to
be understood/maintained.
--
Thanks,
Mauro
Jani Nikula <jani.nikula@linux.intel.com> writes: > There's always the question, if you're putting a lot of effort into > making kernel-doc closer to an actual C parser, why not put all that > effort into using and adapting to, you know, an actual C parser? Not speaking to the current effort but ... in the past, when I have contemplated this (using, say, tree-sitter), the real problem is that those parsers simply strip out the comments. Kerneldoc without comments ... doesn't work very well. If there were a parser without those problems, and which could be made to do the right thing with all of our weird macro usage, it would certainly be worth considering. jon
On Mon, 23 Feb 2026, Jonathan Corbet <corbet@lwn.net> wrote: > Jani Nikula <jani.nikula@linux.intel.com> writes: > >> There's always the question, if you're putting a lot of effort into >> making kernel-doc closer to an actual C parser, why not put all that >> effort into using and adapting to, you know, an actual C parser? > > Not speaking to the current effort but ... in the past, when I have > contemplated this (using, say, tree-sitter), the real problem is that > those parsers simply strip out the comments. Kerneldoc without comments > ... doesn't work very well. If there were a parser without those > problems, and which could be made to do the right thing with all of our > weird macro usage, it would certainly be worth considering. I think e.g. libclang and its Python bindings can be made to work. The main problems with that are passing proper compiler options (because it'll need to include stuff to know about types etc. because it is a proper parser), preprocessing everything is going to take time, you need to invest a bunch into it to know how slow exactly compared to the current thing and whether it's prohitive, and it introduces an extra dependency. So yeah, there are definitely tradeoffs there. But it's not like this constant patching of kernel-doc is exactly burden free either. I don't know, is it just me, but I'd like to think as a profession we'd be past writing ad hoc C parsers by now. BR, Jani. -- Jani Nikula, Intel
On Wed, 04 Mar 2026 12:07:45 +0200
Jani Nikula <jani.nikula@linux.intel.com> wrote:
> On Mon, 23 Feb 2026, Jonathan Corbet <corbet@lwn.net> wrote:
> > Jani Nikula <jani.nikula@linux.intel.com> writes:
> >
> >> There's always the question, if you're putting a lot of effort into
> >> making kernel-doc closer to an actual C parser, why not put all that
> >> effort into using and adapting to, you know, an actual C parser?
> >
> > Not speaking to the current effort but ... in the past, when I have
> > contemplated this (using, say, tree-sitter), the real problem is that
> > those parsers simply strip out the comments. Kerneldoc without comments
> > ... doesn't work very well. If there were a parser without those
> > problems, and which could be made to do the right thing with all of our
> > weird macro usage, it would certainly be worth considering.
>
> I think e.g. libclang and its Python bindings can be made to work. The
> main problems with that are passing proper compiler options (because
> it'll need to include stuff to know about types etc. because it is a
> proper parser), preprocessing everything is going to take time, you need
> to invest a bunch into it to know how slow exactly compared to the
> current thing and whether it's prohitive, and it introduces an extra
> dependency.
>
> So yeah, there are definitely tradeoffs there. But it's not like this
> constant patching of kernel-doc is exactly burden free either.
On my tests with a simple C tokenizer:
https://lore.kernel.org/linux-doc/cover.1773326442.git.mchehab+huawei@kernel.org/
The tokenizer is working fine and didn't make it much slow: it
increases the time to pass the entire Kernel tree from 37s to 47s
for man pages generation, but should not change much the time for
htmldocs, as right now only ~4 seconds is needed to read files
pointed by Documentation kernel-doc tags and parse them.
The code can still be cleaned up, as there are still some things
hardcoded on the various dump_* functions that could be better
implemented (*).
The advantage of the approach I'm using is that it allows to
gradually migrate to rely at the tokenized code, as it can be done
incrementally.
(*) for instance, __attribute__ and a couple of other macros are parsed
twice at dump_struct() logic, on different places.
> I don't
> know, is it just me, but I'd like to think as a profession we'd be past
> writing ad hoc C parsers by now.
Probably not, but I don't think we need a C parser, as kernel-doc
just needs to understand data types (enum, struct, typedef, union,
vars) and function/macro prototypes.
For such purpose, a tokenizer sounds enough.
Now, there is the code that it is now inside:
https://github.com/mchehab/linux/blob/tokenizer-v5/tools/lib/python/kdoc/xforms_lists.py
which contains a list of C/gcc/clang keywords that will
be ignored, like:
__attribute__
static
extern
inline
Together with a sanitized version of the kernel macros it needs
to handle or ignore:
DECLARE_BITMAP
DECLARE_HASHTABLE
__acquires
__init
__exit
struct_group
...
Once we finish cleaning up kdoc_parser.py to rely only
on it for prototype transformations, this will be the only file
that will require changes when more macros start affecting
kernel-doc.
As this is complex, and may require manual adjustments, it
is probably better to not try to auto-generate xforms list
in runtime. A better approach is, IMO, to have a C pre-processor
code to help periodically update it, like using a target like:
make kdoc-xforms
that would use either cpp or clang to generate a patch to
update xforms_list content after adding new macros that
affect docs generation.
--
Thanks,
Mauro
Jani Nikula <jani.nikula@linux.intel.com> writes: > So yeah, there are definitely tradeoffs there. But it's not like this > constant patching of kernel-doc is exactly burden free either. I don't > know, is it just me, but I'd like to think as a profession we'd be past > writing ad hoc C parsers by now. I don't think that having a "real" parser is going to free us from the need to patch kernel-doc. The kernel uses a weird form of C, and kernel-doc is expected to evolve as our dialect of the language does. It *might* make that patching job easier -- that is to be seen -- but it won't make it go away. Thanks, jon
On Wed, 04 Mar 2026 12:07:45 +0200 Jani Nikula <jani.nikula@linux.intel.com> wrote: > On Mon, 23 Feb 2026, Jonathan Corbet <corbet@lwn.net> wrote: > > Jani Nikula <jani.nikula@linux.intel.com> writes: > > > >> There's always the question, if you're putting a lot of effort into > >> making kernel-doc closer to an actual C parser, why not put all that > >> effort into using and adapting to, you know, an actual C parser? > > > > Not speaking to the current effort but ... in the past, when I have > > contemplated this (using, say, tree-sitter), the real problem is that > > those parsers simply strip out the comments. Kerneldoc without comments > > ... doesn't work very well. If there were a parser without those > > problems, and which could be made to do the right thing with all of our > > weird macro usage, it would certainly be worth considering. > > I think e.g. libclang and its Python bindings can be made to work. The > main problems with that are passing proper compiler options (because > it'll need to include stuff to know about types etc. because it is a > proper parser), preprocessing everything is going to take time, you need > to invest a bunch into it to know how slow exactly compared to the > current thing and whether it's prohitive, and it introduces an extra > dependency. It is not just that. Assume we're parsing something like this: static __always_inline int _raw_read_trylock(rwlock_t *lock) __cond_acquires_shared(true, lock); using a cpp (or libclang). We would need to define/undefine 3 symbols: #if defined(WARN_CONTEXT_ANALYSIS) && !defined(__CHECKER__) && !defined(__GENKSYMS__) (in this particular case, the default is OK, but on others, it may not be) This is by far more complex than just writing a logic that would convert the above into: static int _raw_read_trylock(rwlock_t *lock); which is the current kernel-doc approach. - Using a C preprocessor, we might have a very big prototype - and even have arch-specific defines affecting it, as some includes may be inside arch/*/include. So, we would need a kernel-doc ".config" file with a set of defines that can be hard to maintain. > So yeah, there are definitely tradeoffs there. But it's not like this > constant patching of kernel-doc is exactly burden free either. I don't > know, is it just me, but I'd like to think as a profession we'd be past > writing ad hoc C parsers by now. I'd say that the binding logic and the ".config" kernel-doc defines will be complex to maintain. Maybe more complex than kernel-doc patching and a simple C parser, like the one on my test. > > On Mon, 23 Feb 2026 15:47:00 +0200 > > Jani Nikula <jani.nikula@linux.intel.com> wrote: > >> There's always the question, if you're putting a lot of effort into > >> making kernel-doc closer to an actual C parser, why not put all that > >> effort into using and adapting to, you know, an actual C parser? > > > > Playing with this idea, it is not that hard to write an actual C > > parser - or at least a tokenizer. > > Just for the record, I suggested using an existing parser, not going all > NIH and writing your own. I know, but I suspect that a simple tokenizer similar to my example might do the job without any major impact, but yeah, tests are needed. -- Thanks, Mauro
On Mon, 23 Feb 2026 08:02:11 -0700 Jonathan Corbet <corbet@lwn.net> wrote: > Jani Nikula <jani.nikula@linux.intel.com> writes: > > > There's always the question, if you're putting a lot of effort into > > making kernel-doc closer to an actual C parser, why not put all that > > effort into using and adapting to, you know, an actual C parser? > > Not speaking to the current effort but ... in the past, when I have > contemplated this (using, say, tree-sitter), the real problem is that > those parsers simply strip out the comments. Kerneldoc without comments > ... doesn't work very well. If there were a parser without those > problems, and which could be made to do the right thing with all of our > weird macro usage, it would certainly be worth considering. Parser is only needed for statement prototypes. There, stripping comments (after we parse public/private) should be OK. Yet, we want a python library to do the parsing, using it only for the things we want to be parsed. Assuming we have something like that, we'll still need to teach the parser about the macro transforms, as those are very Linux specific. Maybe something like: https://github.com/eliben/pycparser would help (didn't test nor tried to check if it does what we want). There is an additional problem that this will add an extra dependency for the Kernel build itself, because kernel-doc can run at Kernel build time. -- Thanks, Mauro
© 2016 - 2026 Red Hat, Inc.