> -----Original Message-----
> From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> Sent: Tuesday, March 3, 2026 3:53 PM
> To: Jani Nikula <jani.nikula@linux.intel.com>
> Cc: Lobakin, Aleksander <aleksander.lobakin@intel.com>; Jonathan
> Corbet <corbet@lwn.net>; Kees Cook <kees@kernel.org>; Mauro Carvalho
> Chehab <mchehab@kernel.org>; intel-wired-lan@lists.osuosl.org; linux-
> doc@vger.kernel.org; linux-hardening@vger.kernel.org; linux-
> kernel@vger.kernel.org; netdev@vger.kernel.org; Gustavo A. R. Silva
> <gustavoars@kernel.org>; Loktionov, Aleksandr
> <aleksandr.loktionov@intel.com>; Randy Dunlap <rdunlap@infradead.org>;
> Shuah Khan <skhan@linuxfoundation.org>
> Subject: Re: [PATCH 00/38] docs: several improvements to kernel-doc
>
> On Mon, 23 Feb 2026 15:47:00 +0200
> Jani Nikula <jani.nikula@linux.intel.com> wrote:
>
> > On Wed, 18 Feb 2026, Mauro Carvalho Chehab
> <mchehab+huawei@kernel.org> wrote:
> > > As anyone that worked before with kernel-doc are aware, using
> regex
> > > to handle C input is not great. Instead, we need something closer
> to
> > > how C statements and declarations are handled.
> > >
> > > Yet, to avoid breaking docs, I avoided touching the regex-based
> > > algorithms inside it with one exception: struct_group logic was
> > > using very complex regexes that are incompatible with Python
> internal "re" module.
> > >
> > > So, I came up with a different approach: NestedMatch. The logic
> > > inside it is meant to properly handle brackets, square brackets
> and
> > > parenthesis, which is closer to what C lexical parser does. On
> that
> > > time, I added a TODO about the need to extend that.
> >
> > There's always the question, if you're putting a lot of effort into
> > making kernel-doc closer to an actual C parser, why not put all that
> > effort into using and adapting to, you know, an actual C parser?
>
> Playing with this idea, it is not that hard to write an actual C
> parser - or at least a tokenizer. There is already an example of it
> at:
>
> https://docs.python.org/3/library/re.html
>
> I did a quick implementation, and it seems to be able to do its job:
>
> $ ./tokenizer.py ./include/net/netlink.h
> 1: 0 COMMENT '/* SPDX-License-Identifier: GPL-2.0 */'
> 2: 0 CPP '#ifndef'
> 2: 8 ID '__NET_NETLINK_H'
> 3: 0 CPP '#define'
> 3: 8 ID '__NET_NETLINK_H'
> 5: 0 CPP '#include'
> 5: 9 OP '<'
> 5: 10 ID 'linux'
> 5: 15 OP '/'
> 5: 16 ID 'types'
> 5: 21 PUNC '.'
> 5: 22 ID 'h'
> 5: 23 OP '>'
> 6: 0 CPP '#include'
> 6: 9 OP '<'
> 6: 10 ID 'linux'
> 6: 15 OP '/'
> 6: 16 ID 'netlink'
> 6: 23 PUNC '.'
> 6: 24 ID 'h'
> 6: 25 OP '>'
> 7: 0 CPP '#include'
> 7: 9 OP '<'
> 7: 10 ID 'linux'
> 7: 15 OP '/'
> 7: 16 ID 'jiffies'
> 7: 23 PUNC '.'
> 7: 24 ID 'h'
> 7: 25 OP '>'
> 8: 0 CPP '#include'
> 8: 9 OP '<'
> 8: 10 ID 'linux'
> 8: 15 OP '/'
> 8: 16 ID 'in6'
> ...
> 12: 1 COMMENT '/**\n * Standard attribute types to
> specify validation policy\n */'
> 13: 0 ENUM 'enum'
> 13: 5 PUNC '{'
> 14: 1 ID 'NLA_UNSPEC'
> 14: 11 PUNC ','
> 15: 1 ID 'NLA_U8'
> 15: 7 PUNC ','
> 16: 1 ID 'NLA_U16'
> 16: 8 PUNC ','
> 17: 1 ID 'NLA_U32'
> 17: 8 PUNC ','
> 18: 1 ID 'NLA_U64'
> 18: 8 PUNC ','
> 19: 1 ID 'NLA_STRING'
> 19: 11 PUNC ','
> 20: 1 ID 'NLA_FLAG'
> ...
> 41: 0 STRUCT 'struct'
> 41: 7 ID 'netlink_range_validation'
> 41: 32 PUNC '{'
> 42: 1 ID 'u64'
> 42: 5 ID 'min'
> 42: 8 PUNC ','
> 42: 10 ID 'max'
> 42: 13 PUNC ';'
> 43: 0 PUNC '}'
> 43: 1 PUNC ';'
> 45: 0 STRUCT 'struct'
> 45: 7 ID 'netlink_range_validation_signed'
> 45: 39 PUNC '{'
> 46: 1 ID 's64'
> 46: 5 ID 'min'
> 46: 8 PUNC ','
> 46: 10 ID 'max'
> 46: 13 PUNC ';'
> 47: 0 PUNC '}'
> 47: 1 PUNC ';'
> 49: 0 ENUM 'enum'
> 49: 5 ID 'nla_policy_validation'
> 49: 27 PUNC '{'
> 50: 1 ID 'NLA_VALIDATE_NONE'
> 50: 18 PUNC ','
> 51: 1 ID 'NLA_VALIDATE_RANGE'
> 51: 19 PUNC ','
> 52: 1 ID 'NLA_VALIDATE_RANGE_WARN_TOO_LONG'
> 52: 33 PUNC ','
> 53: 1 ID 'NLA_VALIDATE_MIN'
> 53: 17 PUNC ','
> 54: 1 ID 'NLA_VALIDATE_MAX'
> 54: 17 PUNC ','
> 55: 1 ID 'NLA_VALIDATE_MASK'
> 55: 18 PUNC ','
> 56: 1 ID 'NLA_VALIDATE_RANGE_PTR'
> 56: 23 PUNC ','
> 57: 1 ID 'NLA_VALIDATE_FUNCTION'
> 57: 22 PUNC ','
> 58: 0 PUNC '}'
> 58: 1 PUNC ';'
>
> It sounds doable to use it, and, at least on this example, it properly
> picked the IDs.
>
> On the other hand, using it would require lots of changes at kernel-
> doc. So, I guess I'll add a tokenizer to kernel-doc, but we should
> likely start using it gradually.
>
> Maybe starting with NestedSearch and with public/private comment
> handling (which is currently half-broken).
>
> As a reference, the above was generated with the code below, which was
> based on the Python re documentation.
>
> Comments?
>
> ---
>
> One side note: right now, we're not using typing at kernel-doc, nor
> really following a proper coding style.
>
> I wanted to use it during the conversion, and place consts in
> uppercase, as this is currently the best practices, but doing it while
> converting from Perl were very annoying. So, I opted to make things
> simpler. Now that we have it coded, perhaps it is time to define a
> coding style and apply it to kernel-doc.
>
> --
> Thanks,
> Mauro
>
> #!/usr/bin/env python3
>
> import sys
> import re
>
> class Token():
> def __init__(self, type, value, line, column):
> self.type = type
> self.value = value
> self.line = line
> self.column = column
>
> class CTokenizer():
> C_KEYWORDS = {
> "struct", "union", "enum",
> }
>
> TOKEN_LIST = [
> ("COMMENT", r"//[^\n]*|/\*[\s\S]*?\*/"),
>
> ("STRING", r'"(?:\\.|[^"\\])*"'),
> ("CHAR", r"'(?:\\.|[^'\\])'"),
>
> ("NUMBER", r"0[xX][0-9a-fA-F]+[uUlL]*|0[0-7]+[uUlL]*|"
> r"[0-9]+(\.[0-9]*)?([eE][+-]?[0-9]+)?[fFlL]*"),
>
> ("ID", r"[A-Za-z_][A-Za-z0-9_]*"),
>
> ("OP", r"\+\+|\-\-|\->|==|\!=|<=|>=|&&|\|\||<<|>>|\+=|\-
> =|\*=|/=|%="
> r"|&=|\|=|\^=|=|\+|\-
> |\*|/|%|<|>|&|\||\^|~|!|\?|\:"),
>
> ("PUNC", r"[;,\.\[\]\(\)\{\}]"),
>
> ("CPP",
> r"#\s*(define|include|ifdef|ifndef|if|else|elif|endif|undef|pragma)"),
>
> ("HASH", r"#"),
>
> ("NEWLINE", r"\n"),
>
> ("SKIP", r"[\s]+"),
>
> ("MISMATCH",r"."),
> ]
>
> def __init__(self):
> re_tokens = []
>
> for name, pattern in self.TOKEN_LIST:
> re_tokens.append(f"(?P<{name}>{pattern})")
>
> self.re_scanner = re.compile("|".join(re_tokens),
> re.MULTILINE | re.DOTALL)
>
> def tokenize(self, code):
> # Handle continuation lines
> code = re.sub(r"\\\n", "", code)
>
> line_num = 1
> line_start = 0
>
> for match in self.re_scanner.finditer(code):
> kind = match.lastgroup
> value = match.group()
> column = match.start() - line_start
>
> if kind == "NEWLINE":
> line_start = match.end()
> line_num += 1
> continue
>
> if kind in {"SKIP"}:
> continue
>
> if kind == "MISMATCH":
> raise RuntimeError(f"Unexpected character {value!r} on
> line {line_num}")
>
> if kind == "ID" and value in self.C_KEYWORDS:
> kind = value.upper()
>
> # For all other tokens we keep the raw string value
> yield Token(kind, value, line_num, column)
>
> if __name__ == "__main__":
> if len(sys.argv) != 2:
> print(f"Usage: python {sys.argv[0]} <fname>")
> sys.exit(1)
>
> fname = sys.argv[1]
>
> try:
> with open(fname, 'r', encoding='utf-8') as file:
> sample = file.read()
> except FileNotFoundError:
> print(f"Error: The file '{fname}' was not found.")
> sys.exit(1)
> except Exception as e:
> print(f"An error occurred while reading the file: {str(e)}")
> sys.exit(1)
>
> print(f"Tokens from {fname}:")
>
> for tok in CTokenizer().tokenize(sample):
> print(f"{tok.line:3d}:{tok.column:3d} {tok.type:12}
> {tok.value!r}")
As hobby C compiler writer, I must say that you need to implement C preprocessor first, because C preprocessor influences/changes the syntax.
In your tokenizer I see right away that any line which begins from '#' must be just as C preprocessor command without further tokenizing.
But the real pain make C preprocessor substitutions IMHO
On Tue, 3 Mar 2026 15:12:30 +0000
"Loktionov, Aleksandr" <aleksandr.loktionov@intel.com> wrote:
> > -----Original Message-----
> > From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> > Sent: Tuesday, March 3, 2026 3:53 PM
> > To: Jani Nikula <jani.nikula@linux.intel.com>
> > Cc: Lobakin, Aleksander <aleksander.lobakin@intel.com>; Jonathan
> > Corbet <corbet@lwn.net>; Kees Cook <kees@kernel.org>; Mauro Carvalho
> > Chehab <mchehab@kernel.org>; intel-wired-lan@lists.osuosl.org; linux-
> > doc@vger.kernel.org; linux-hardening@vger.kernel.org; linux-
> > kernel@vger.kernel.org; netdev@vger.kernel.org; Gustavo A. R. Silva
> > <gustavoars@kernel.org>; Loktionov, Aleksandr
> > <aleksandr.loktionov@intel.com>; Randy Dunlap <rdunlap@infradead.org>;
> > Shuah Khan <skhan@linuxfoundation.org>
> > Subject: Re: [PATCH 00/38] docs: several improvements to kernel-doc
> >
> > On Mon, 23 Feb 2026 15:47:00 +0200
> > Jani Nikula <jani.nikula@linux.intel.com> wrote:
> >
> > > There's always the question, if you're putting a lot of effort into
> > > making kernel-doc closer to an actual C parser, why not put all that
> > > effort into using and adapting to, you know, an actual C parser?
> >
> > Playing with this idea, it is not that hard to write an actual C
> > parser - or at least a tokenizer. There is already an example of it
> > at:
> >
> > https://docs.python.org/3/library/re.html
> >
> > I did a quick implementation, and it seems to be able to do its job:
...
>
> As hobby C compiler writer, I must say that you need to implement C preprocessor first, because C preprocessor influences/changes the syntax.
> In your tokenizer I see right away that any line which begins from '#' must be just as C preprocessor command without further tokenizing.
Yeah, we may need to implement C preprocessor parser in the future,
but this will require handling #include, with could be somewhat
complex. It is also tricky to handle conditional preprocessor macros,
as kernel-doc would either require a file with at least some defines
or would have to guess how to evaluate it to produce the right
documentation, as ifdefs interfere at C macros.
For now, I want to solve some specific problems:
- fix trim_private_members() function that it is meant to handle
/* private: */ and /* public: */ comments, as it currently have
bugs when used on nested structs/unions, related to where the
"private" scope finishes;
- properly parse nested struct/union and properly pick nested
identifiers;
- detect and replace function arguments when macros with multiple
arguments are used at the same prototype.
Plus, kernel-doc has already a table of transforms to "convert"
the C preprocessor macros that affect documentation into something
that will work.
So, I'm considering to start simple, for now ignoring cpp, addressing
the existing issues.
> But the real pain make C preprocessor substitutions IMHO
Agreed. For now, we're using a transforms list inside kernel-doc for
such purpose. So, those macros are manually "evaluated" there, like:
(KernRe(r'DEFINE_DMA_UNMAP_ADDR\s*\(' + struct_args_pattern + r'\)', re.S), r'dma_addr_t \1'),
This works fine on trivial cases, where the argument is just an ID,
but there are cases were we use macros like here:
struct page_pool_params {
struct_group_tagged(page_pool_params_fast, fast,
unsigned int order;
unsigned int pool_size;
int nid;
struct device *dev;
struct napi_struct *napi;
enum dma_data_direction dma_dir;
unsigned int max_len;
unsigned int offset;
);
struct_group_tagged(page_pool_params_slow, slow,
struct net_device *netdev;
unsigned int queue_idx;
unsigned int flags;
/* private: used by test code only */
void (*init_callback)(netmem_ref netmem, void *arg);
void *init_arg;
);
};
To handle it, I'm thinking on using something like this(*):
CFunction('struct_group_tagged'), r'struct \1 { \3 } \2;')
E.g. teaching kernel-doc that, when:
struct_group_tagged(a, b, c)
is used, it should convert it into:
struct a { c } b;
which is basically what this macro does. On other words, hardcoding
kernel-doc with some rules to handle the cases where CPP macros
need to be evaluated. As there aren't much cases where such macros affect
documentation (on lots of cases, just drop macros are enough), such
approach kinda works.
(*) I wrote already a patch for it, but as Jani pointed, perhaps
using a tokenizer will make the logic simpler and easier to
be understood/maintained.
--
Thanks,
Mauro
© 2016 - 2026 Red Hat, Inc.