docs: several improvements to kernel-doc

[PATCH 00/38] docs: several improvements to kernel-doc

Posted by Mauro Carvalho Chehab 1 month, 1 week ago

Hi Jon,

This series contain several improvements for kernel-doc.

Most of the patches came from v4 of this series:
	https://lore.kernel.org/linux-doc/cover.1769867953.git.mchehab+huawei@kernel.org/

But I dropped from this series the unit tests part. I'll
be sumitting it on a separate series.

The rationale is that, when I converted kernel-doc from Perl,
the goal were to produce a bug-compatible version.

As anyone that worked before with kernel-doc are aware, using regex to
handle C input is not great. Instead, we need something closer to how
C statements and declarations are handled.

Yet, to avoid breaking  docs, I avoided touching the regex-based algorithms
inside it with one exception: struct_group logic was using very complex
regexes that are incompatible with Python internal "re" module.

So, I came up with a different approach: NestedMatch. The logic inside
it is meant to properly handle brackets, square brackets and parenthesis,
which is closer to what C lexical parser does. On that time, I added
a TODO about the need to extend that.

The first part of this series do exactly that: it extends it to parse
comma-separated arguments, respecting brackets and parenthesis.

It then adds an "alias" to it at class CFunction. With that, specifying
functions/macros to be handled becomes much easier.

With such infra in place, it moves the transform functions to a separate
file, making it hopefully easier to maintain. As a side effect, it also
makes easier for other projects to use kernel-doc (I tested it on QEMU).

Then, it adds support for newer kref annotations.

The remaining patches on this series improve the man page output, making
them more compatible with other man pages.

-

I wrote several unit tests to check kernel-doc behavior. I intend to
submit them on the top of this series later on.

Regards,
Mauro

Mauro Carvalho Chehab (36):
  docs: kdoc_re: add support for groups()
  docs: kdoc_re: don't go past the end of a line
  docs: kdoc_parser: move var transformers to the beginning
  docs: kdoc_parser: don't mangle with function defines
  docs: kdoc_parser: add functions support for NestedMatch
  docs: kdoc_parser: use NestedMatch to handle __attribute__ on
    functions
  docs: kdoc_parser: fix variable regexes to work with size_t
  docs: kdoc_parser: fix the default_value logic for variables
  docs: kdoc_parser: add some debug for variable parsing
  docs: kdoc_parser: don't exclude defaults from prototype
  docs: kdoc_parser: fix parser to support multi-word types
  docs: kdoc_parser: add support for LIST_HEAD
  docs: kdoc_re: properly handle strings and escape chars on it
  docs: kdoc_re: better show KernRe() at documentation
  docs: kdoc_re: don't recompile NestedMatch regex every time
  docs: kdoc_re: Change NestedMath args replacement to \0
  docs: kdoc_re: make NestedMatch use KernRe
  docs: kdoc_re: add support on NestedMatch for argument replacement
  docs: kdoc_parser: better handle struct_group macros
  docs: kdoc_re: fix a parse bug on struct page_pool_params
  docs: kdoc_re: add a helper class to declare C function matches
  docs: kdoc_parser: use the new CFunction class
  docs: kdoc_parser: minimize differences with struct_group_tagged
  docs: kdoc_parser: move transform lists to a separate file
  docs: kdoc_re: don't remove the trailing ";" with NestedMatch
  docs: kdoc_re: prevent adding whitespaces on sub replacements
  docs: xforms_lists.py: use CFuntion to handle all function macros
  docs: kdoc_files: allows the caller to use a different xforms class
  docs: kdoc_re: Fix NestedMatch.sub() which causes PDF builds to break
  docs: kdoc_files: document KernelFiles() ABI
  docs: kdoc_output: add optional args to ManOutput class
  docs: sphinx-build-wrapper: better handle troff .TH markups
  docs: kdoc_output: use a more standard order for .TH on man pages
  docs: sphinx-build-wrapper: don't allow "/" on file names
  docs: kdoc_output: describe the class init parameters
  docs: kdoc_output: pick a better default for modulename

Randy Dunlap (2):
  docs: kdoc_parser: ignore context analysis and lock attributes
  docs: kdoc_parser: handle struct member macro
    VIRTIO_DECLARE_FEATURES(name)

 Documentation/tools/kdoc_parser.rst   |   8 +
 tools/docs/kernel-doc                 |   1 -
 tools/docs/sphinx-build-wrapper       |   9 +-
 tools/lib/python/kdoc/kdoc_files.py   |  54 +++++-
 tools/lib/python/kdoc/kdoc_output.py  |  73 ++++++--
 tools/lib/python/kdoc/kdoc_parser.py  | 183 ++++---------------
 tools/lib/python/kdoc/kdoc_re.py      | 242 ++++++++++++++++++++------
 tools/lib/python/kdoc/xforms_lists.py | 109 ++++++++++++
 8 files changed, 451 insertions(+), 228 deletions(-)
 create mode 100644 tools/lib/python/kdoc/xforms_lists.py

-- 
2.52.0

Re: [PATCH 00/38] docs: several improvements to kernel-doc

Posted by Jonathan Corbet 1 month, 1 week ago

Mauro Carvalho Chehab <mchehab+huawei@kernel.org> writes:

> Hi Jon,
>
> This series contain several improvements for kernel-doc.
>
> Most of the patches came from v4 of this series:
> 	https://lore.kernel.org/linux-doc/cover.1769867953.git.mchehab+huawei@kernel.org/

So I will freely confess to having lost the plot with this stuff; I'm
now trying to get back up to speed.  But, before I dig into this big
series, can you say whether you think it's ready, or whether there's
another one on the horizon that I should wait for?

Thanks,

jon

Re: [Intel-wired-lan] [PATCH 00/38] docs: several improvements to kernel-doc

Posted by Mauro Carvalho Chehab 1 month ago

On Mon, 23 Feb 2026 14:58:53 -0700
Jonathan Corbet <corbet@lwn.net> wrote:

> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> writes:
> 
> > Hi Jon,
> >
> > This series contain several improvements for kernel-doc.
> >
> > Most of the patches came from v4 of this series:
> > 	https://lore.kernel.org/linux-doc/cover.1769867953.git.mchehab+huawei@kernel.org/  
> 
> So I will freely confess to having lost the plot with this stuff; I'm
> now trying to get back up to speed. 

Yeah, I kinda figure it out ;-)

> But, before I dig into this big
> series, can you say whether you think it's ready, or whether there's
> another one on the horizon that I should wait for?

There are more things undergoing, but I need some time to reorganize
the patchset... currently, there are 60+ patches on my pile.

So, instead of merging this patchset, I'll be sending you
a smaller series with the basic stuff, in a way that it would
be easier to review. My plan is to send patches along this week
on smaller chunks, and after checking the differences before/after,
in terms of man/rst/error output.

-- 
Thanks,
Mauro

Re: [Intel-wired-lan] [PATCH 00/38] docs: several improvements to kernel-doc

Posted by Jonathan Corbet 1 month ago

Mauro Carvalho Chehab <mchehab+huawei@kernel.org> writes:

> So, instead of merging this patchset, I'll be sending you
> a smaller series with the basic stuff, in a way that it would
> be easier to review. My plan is to send patches along this week
> on smaller chunks, and after checking the differences before/after,
> in terms of man/rst/error output.

OK... *whew* ... that sounds like a better way to proceed :)

Thanks,

jon

Re: [PATCH 00/38] docs: several improvements to kernel-doc

Posted by Randy Dunlap 1 month, 1 week ago


On 2/18/26 2:12 AM, Mauro Carvalho Chehab wrote:
> Hi Jon,
> 
> This series contain several improvements for kernel-doc.
> 
> Most of the patches came from v4 of this series:
> 	https://lore.kernel.org/linux-doc/cover.1769867953.git.mchehab+huawei@kernel.org/
> 

Mauro,
Is this series available as a git tree/branch?

Or what is the base for applying this series?

Thanks.

-- 
~Randy

Re: [PATCH 00/38] docs: several improvements to kernel-doc

Posted by Randy Dunlap 1 month, 1 week ago

On 2/20/26 5:24 PM, Randy Dunlap wrote:
> 
> 
> On 2/18/26 2:12 AM, Mauro Carvalho Chehab wrote:
>> Hi Jon,
>>
>> This series contain several improvements for kernel-doc.
>>
>> Most of the patches came from v4 of this series:
>> 	https://lore.kernel.org/linux-doc/cover.1769867953.git.mchehab+huawei@kernel.org/
>>
> 
> Mauro,
> Is this series available as a git tree/branch?
> 
> Or what is the base for applying this series?

I applied the series to linux-next-20260220. It applies cleanly
except for one gotcha (using 'patch'):

  In patch 25, in the commit description, I had to change the
  example before/after diff to have leading "//" ('patch' was
  treating them as part of the diff).

I am still seeing kernel-doc warnings being duplicated.
Seems like there a patch for that but it's not applied yet and not part
of this series...?

The results on linux-next-20260220 look good.
I do have one issue on a test file that I had sent to you (Mauro)
earlier:  kdoc-nested.c

In struct super_struct, the fields of nested struct tlv are not
described but there is no warning about that.
Likewise for the fields of the nested structs header, gen_descr,
and data.

Does this series address when /* private: */ is turned off at the
end of a struct/union?  If so, I don't see it working.
See struct nla_policy for where the final struct member should be
public.

kdoc-nested.c test file is attached.

thanks.
-- 
~Randy

Re: [PATCH 00/38] docs: several improvements to kernel-doc

Posted by Jani Nikula 1 month, 1 week ago

On Wed, 18 Feb 2026, Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> As anyone that worked before with kernel-doc are aware, using regex to
> handle C input is not great. Instead, we need something closer to how
> C statements and declarations are handled.
>
> Yet, to avoid breaking  docs, I avoided touching the regex-based algorithms
> inside it with one exception: struct_group logic was using very complex
> regexes that are incompatible with Python internal "re" module.
>
> So, I came up with a different approach: NestedMatch. The logic inside
> it is meant to properly handle brackets, square brackets and parenthesis,
> which is closer to what C lexical parser does. On that time, I added
> a TODO about the need to extend that.

There's always the question, if you're putting a lot of effort into
making kernel-doc closer to an actual C parser, why not put all that
effort into using and adapting to, you know, an actual C parser?


BR,
Jani.

-- 
Jani Nikula, Intel

Re: [PATCH 00/38] docs: several improvements to kernel-doc

Posted by Mauro Carvalho Chehab 1 month ago

On Mon, 23 Feb 2026 15:47:00 +0200
Jani Nikula <jani.nikula@linux.intel.com> wrote:

> On Wed, 18 Feb 2026, Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> > As anyone that worked before with kernel-doc are aware, using regex to
> > handle C input is not great. Instead, we need something closer to how
> > C statements and declarations are handled.
> >
> > Yet, to avoid breaking  docs, I avoided touching the regex-based algorithms
> > inside it with one exception: struct_group logic was using very complex
> > regexes that are incompatible with Python internal "re" module.
> >
> > So, I came up with a different approach: NestedMatch. The logic inside
> > it is meant to properly handle brackets, square brackets and parenthesis,
> > which is closer to what C lexical parser does. On that time, I added
> > a TODO about the need to extend that.  
> 
> There's always the question, if you're putting a lot of effort into
> making kernel-doc closer to an actual C parser, why not put all that
> effort into using and adapting to, you know, an actual C parser?

Playing with this idea, it is not that hard to write an actual C
parser - or at least a tokenizer. There is already an example of it
at:

	https://docs.python.org/3/library/re.html

I did a quick implementation, and it seems to be able to do its job:

    $ ./tokenizer.py ./include/net/netlink.h
      1:  0  COMMENT       '/* SPDX-License-Identifier: GPL-2.0 */'
      2:  0  CPP           '#ifndef'
      2:  8  ID            '__NET_NETLINK_H'
      3:  0  CPP           '#define'
      3:  8  ID            '__NET_NETLINK_H'
      5:  0  CPP           '#include'
      5:  9  OP            '<'
      5: 10  ID            'linux'
      5: 15  OP            '/'
      5: 16  ID            'types'
      5: 21  PUNC          '.'
      5: 22  ID            'h'
      5: 23  OP            '>'
      6:  0  CPP           '#include'
      6:  9  OP            '<'
      6: 10  ID            'linux'
      6: 15  OP            '/'
      6: 16  ID            'netlink'
      6: 23  PUNC          '.'
      6: 24  ID            'h'
      6: 25  OP            '>'
      7:  0  CPP           '#include'
      7:  9  OP            '<'
      7: 10  ID            'linux'
      7: 15  OP            '/'
      7: 16  ID            'jiffies'
      7: 23  PUNC          '.'
      7: 24  ID            'h'
      7: 25  OP            '>'
      8:  0  CPP           '#include'
      8:  9  OP            '<'
      8: 10  ID            'linux'
      8: 15  OP            '/'
      8: 16  ID            'in6'
...
     12:  1  COMMENT       '/**\n  * Standard attribute types to specify validation policy\n  */'
     13:  0  ENUM          'enum'
     13:  5  PUNC          '{'
     14:  1  ID            'NLA_UNSPEC'
     14: 11  PUNC          ','
     15:  1  ID            'NLA_U8'
     15:  7  PUNC          ','
     16:  1  ID            'NLA_U16'
     16:  8  PUNC          ','
     17:  1  ID            'NLA_U32'
     17:  8  PUNC          ','
     18:  1  ID            'NLA_U64'
     18:  8  PUNC          ','
     19:  1  ID            'NLA_STRING'
     19: 11  PUNC          ','
     20:  1  ID            'NLA_FLAG'
...
     41:  0  STRUCT        'struct'
     41:  7  ID            'netlink_range_validation'
     41: 32  PUNC          '{'
     42:  1  ID            'u64'
     42:  5  ID            'min'
     42:  8  PUNC          ','
     42: 10  ID            'max'
     42: 13  PUNC          ';'
     43:  0  PUNC          '}'
     43:  1  PUNC          ';'
     45:  0  STRUCT        'struct'
     45:  7  ID            'netlink_range_validation_signed'
     45: 39  PUNC          '{'
     46:  1  ID            's64'
     46:  5  ID            'min'
     46:  8  PUNC          ','
     46: 10  ID            'max'
     46: 13  PUNC          ';'
     47:  0  PUNC          '}'
     47:  1  PUNC          ';'
     49:  0  ENUM          'enum'
     49:  5  ID            'nla_policy_validation'
     49: 27  PUNC          '{'
     50:  1  ID            'NLA_VALIDATE_NONE'
     50: 18  PUNC          ','
     51:  1  ID            'NLA_VALIDATE_RANGE'
     51: 19  PUNC          ','
     52:  1  ID            'NLA_VALIDATE_RANGE_WARN_TOO_LONG'
     52: 33  PUNC          ','
     53:  1  ID            'NLA_VALIDATE_MIN'
     53: 17  PUNC          ','
     54:  1  ID            'NLA_VALIDATE_MAX'
     54: 17  PUNC          ','
     55:  1  ID            'NLA_VALIDATE_MASK'
     55: 18  PUNC          ','
     56:  1  ID            'NLA_VALIDATE_RANGE_PTR'
     56: 23  PUNC          ','
     57:  1  ID            'NLA_VALIDATE_FUNCTION'
     57: 22  PUNC          ','
     58:  0  PUNC          '}'
     58:  1  PUNC          ';'

It sounds doable to use it, and, at least on this example, it
properly picked the IDs.

On the other hand, using it would require lots of changes at
kernel-doc. So, I guess I'll add a tokenizer to kernel-doc, but
we should likely start using it gradually.

Maybe starting with NestedSearch and with public/private
comment handling (which is currently half-broken).

As a reference, the above was generated with the code below,
which was based on the Python re documentation.

Comments?

---

One side note: right now, we're not using typing at kernel-doc,
nor really following a proper coding style.

I wanted to use it during the conversion, and place consts in
uppercase, as this is currently the best practices, but doing
it while converting from Perl were very annoying. So, I opted
to make things simpler. Now that we have it coded, perhaps it
is time to define a coding style and apply it to kernel-doc.

-- 
Thanks,
Mauro

#!/usr/bin/env python3

import sys
import re

class Token():
    def __init__(self, type, value, line, column):
        self.type = type
        self.value = value
        self.line = line
        self.column = column

class CTokenizer():
    C_KEYWORDS = {
        "struct", "union", "enum",
    }

    TOKEN_LIST = [
        ("COMMENT", r"//[^\n]*|/\*[\s\S]*?\*/"),

        ("STRING",  r'"(?:\\.|[^"\\])*"'),
        ("CHAR",    r"'(?:\\.|[^'\\])'"),

        ("NUMBER",  r"0[xX][0-9a-fA-F]+[uUlL]*|0[0-7]+[uUlL]*|"
                    r"[0-9]+(\.[0-9]*)?([eE][+-]?[0-9]+)?[fFlL]*"),

        ("ID",      r"[A-Za-z_][A-Za-z0-9_]*"),

        ("OP",      r"\+\+|\-\-|\->|==|\!=|<=|>=|&&|\|\||<<|>>|\+=|\-=|\*=|/=|%="
                    r"|&=|\|=|\^=|=|\+|\-|\*|/|%|<|>|&|\||\^|~|!|\?|\:"),

        ("PUNC",    r"[;,\.\[\]\(\)\{\}]"),

        ("CPP",     r"#\s*(define|include|ifdef|ifndef|if|else|elif|endif|undef|pragma)"),

        ("HASH",    r"#"),

        ("NEWLINE", r"\n"),

        ("SKIP",    r"[\s]+"),

        ("MISMATCH",r"."),
    ]

    def __init__(self):
        re_tokens = []

        for name, pattern in self.TOKEN_LIST:
            re_tokens.append(f"(?P<{name}>{pattern})")

        self.re_scanner = re.compile("|".join(re_tokens),
                                     re.MULTILINE | re.DOTALL)

    def tokenize(self, code):
        # Handle continuation lines
        code = re.sub(r"\\\n", "", code)

        line_num = 1
        line_start = 0

        for match in self.re_scanner.finditer(code):
            kind   = match.lastgroup
            value  = match.group()
            column = match.start() - line_start

            if kind == "NEWLINE":
                line_start = match.end()
                line_num += 1
                continue

            if kind in {"SKIP"}:
                continue

            if kind == "MISMATCH":
                raise RuntimeError(f"Unexpected character {value!r} on line {line_num}")

            if kind == "ID" and value in self.C_KEYWORDS:
                kind = value.upper()

            # For all other tokens we keep the raw string value
            yield Token(kind, value, line_num, column)

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print(f"Usage: python {sys.argv[0]} <fname>")
        sys.exit(1)

    fname = sys.argv[1]

    try:
        with open(fname, 'r', encoding='utf-8') as file:
            sample = file.read()
    except FileNotFoundError:
        print(f"Error: The file '{fname}' was not found.")
        sys.exit(1)
    except Exception as e:
        print(f"An error occurred while reading the file: {str(e)}")
        sys.exit(1)

    print(f"Tokens from {fname}:")

    for tok in CTokenizer().tokenize(sample):
        print(f"{tok.line:3d}:{tok.column:3d}  {tok.type:12}  {tok.value!r}")

Re: [PATCH 00/38] docs: several improvements to kernel-doc

Posted by Jani Nikula 4 weeks, 1 day ago

On Tue, 03 Mar 2026, Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> On Mon, 23 Feb 2026 15:47:00 +0200
> Jani Nikula <jani.nikula@linux.intel.com> wrote:
>> There's always the question, if you're putting a lot of effort into
>> making kernel-doc closer to an actual C parser, why not put all that
>> effort into using and adapting to, you know, an actual C parser?
>
> Playing with this idea, it is not that hard to write an actual C
> parser - or at least a tokenizer.

Just for the record, I suggested using an existing parser, not going all
NIH and writing your own.

BR,
Jani.

-- 
Jani Nikula, Intel

RE: [PATCH 00/38] docs: several improvements to kernel-doc

Posted by Loktionov, Aleksandr 1 month ago


> -----Original Message-----
> From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> Sent: Tuesday, March 3, 2026 3:53 PM
> To: Jani Nikula <jani.nikula@linux.intel.com>
> Cc: Lobakin, Aleksander <aleksander.lobakin@intel.com>; Jonathan
> Corbet <corbet@lwn.net>; Kees Cook <kees@kernel.org>; Mauro Carvalho
> Chehab <mchehab@kernel.org>; intel-wired-lan@lists.osuosl.org; linux-
> doc@vger.kernel.org; linux-hardening@vger.kernel.org; linux-
> kernel@vger.kernel.org; netdev@vger.kernel.org; Gustavo A. R. Silva
> <gustavoars@kernel.org>; Loktionov, Aleksandr
> <aleksandr.loktionov@intel.com>; Randy Dunlap <rdunlap@infradead.org>;
> Shuah Khan <skhan@linuxfoundation.org>
> Subject: Re: [PATCH 00/38] docs: several improvements to kernel-doc
> 
> On Mon, 23 Feb 2026 15:47:00 +0200
> Jani Nikula <jani.nikula@linux.intel.com> wrote:
> 
> > On Wed, 18 Feb 2026, Mauro Carvalho Chehab
> <mchehab+huawei@kernel.org> wrote:
> > > As anyone that worked before with kernel-doc are aware, using
> regex
> > > to handle C input is not great. Instead, we need something closer
> to
> > > how C statements and declarations are handled.
> > >
> > > Yet, to avoid breaking  docs, I avoided touching the regex-based
> > > algorithms inside it with one exception: struct_group logic was
> > > using very complex regexes that are incompatible with Python
> internal "re" module.
> > >
> > > So, I came up with a different approach: NestedMatch. The logic
> > > inside it is meant to properly handle brackets, square brackets
> and
> > > parenthesis, which is closer to what C lexical parser does. On
> that
> > > time, I added a TODO about the need to extend that.
> >
> > There's always the question, if you're putting a lot of effort into
> > making kernel-doc closer to an actual C parser, why not put all that
> > effort into using and adapting to, you know, an actual C parser?
> 
> Playing with this idea, it is not that hard to write an actual C
> parser - or at least a tokenizer. There is already an example of it
> at:
> 
> 	https://docs.python.org/3/library/re.html
> 
> I did a quick implementation, and it seems to be able to do its job:
> 
>     $ ./tokenizer.py ./include/net/netlink.h
>       1:  0  COMMENT       '/* SPDX-License-Identifier: GPL-2.0 */'
>       2:  0  CPP           '#ifndef'
>       2:  8  ID            '__NET_NETLINK_H'
>       3:  0  CPP           '#define'
>       3:  8  ID            '__NET_NETLINK_H'
>       5:  0  CPP           '#include'
>       5:  9  OP            '<'
>       5: 10  ID            'linux'
>       5: 15  OP            '/'
>       5: 16  ID            'types'
>       5: 21  PUNC          '.'
>       5: 22  ID            'h'
>       5: 23  OP            '>'
>       6:  0  CPP           '#include'
>       6:  9  OP            '<'
>       6: 10  ID            'linux'
>       6: 15  OP            '/'
>       6: 16  ID            'netlink'
>       6: 23  PUNC          '.'
>       6: 24  ID            'h'
>       6: 25  OP            '>'
>       7:  0  CPP           '#include'
>       7:  9  OP            '<'
>       7: 10  ID            'linux'
>       7: 15  OP            '/'
>       7: 16  ID            'jiffies'
>       7: 23  PUNC          '.'
>       7: 24  ID            'h'
>       7: 25  OP            '>'
>       8:  0  CPP           '#include'
>       8:  9  OP            '<'
>       8: 10  ID            'linux'
>       8: 15  OP            '/'
>       8: 16  ID            'in6'
> ...
>      12:  1  COMMENT       '/**\n  * Standard attribute types to
> specify validation policy\n  */'
>      13:  0  ENUM          'enum'
>      13:  5  PUNC          '{'
>      14:  1  ID            'NLA_UNSPEC'
>      14: 11  PUNC          ','
>      15:  1  ID            'NLA_U8'
>      15:  7  PUNC          ','
>      16:  1  ID            'NLA_U16'
>      16:  8  PUNC          ','
>      17:  1  ID            'NLA_U32'
>      17:  8  PUNC          ','
>      18:  1  ID            'NLA_U64'
>      18:  8  PUNC          ','
>      19:  1  ID            'NLA_STRING'
>      19: 11  PUNC          ','
>      20:  1  ID            'NLA_FLAG'
> ...
>      41:  0  STRUCT        'struct'
>      41:  7  ID            'netlink_range_validation'
>      41: 32  PUNC          '{'
>      42:  1  ID            'u64'
>      42:  5  ID            'min'
>      42:  8  PUNC          ','
>      42: 10  ID            'max'
>      42: 13  PUNC          ';'
>      43:  0  PUNC          '}'
>      43:  1  PUNC          ';'
>      45:  0  STRUCT        'struct'
>      45:  7  ID            'netlink_range_validation_signed'
>      45: 39  PUNC          '{'
>      46:  1  ID            's64'
>      46:  5  ID            'min'
>      46:  8  PUNC          ','
>      46: 10  ID            'max'
>      46: 13  PUNC          ';'
>      47:  0  PUNC          '}'
>      47:  1  PUNC          ';'
>      49:  0  ENUM          'enum'
>      49:  5  ID            'nla_policy_validation'
>      49: 27  PUNC          '{'
>      50:  1  ID            'NLA_VALIDATE_NONE'
>      50: 18  PUNC          ','
>      51:  1  ID            'NLA_VALIDATE_RANGE'
>      51: 19  PUNC          ','
>      52:  1  ID            'NLA_VALIDATE_RANGE_WARN_TOO_LONG'
>      52: 33  PUNC          ','
>      53:  1  ID            'NLA_VALIDATE_MIN'
>      53: 17  PUNC          ','
>      54:  1  ID            'NLA_VALIDATE_MAX'
>      54: 17  PUNC          ','
>      55:  1  ID            'NLA_VALIDATE_MASK'
>      55: 18  PUNC          ','
>      56:  1  ID            'NLA_VALIDATE_RANGE_PTR'
>      56: 23  PUNC          ','
>      57:  1  ID            'NLA_VALIDATE_FUNCTION'
>      57: 22  PUNC          ','
>      58:  0  PUNC          '}'
>      58:  1  PUNC          ';'
> 
> It sounds doable to use it, and, at least on this example, it properly
> picked the IDs.
> 
> On the other hand, using it would require lots of changes at kernel-
> doc. So, I guess I'll add a tokenizer to kernel-doc, but we should
> likely start using it gradually.
> 
> Maybe starting with NestedSearch and with public/private comment
> handling (which is currently half-broken).
> 
> As a reference, the above was generated with the code below, which was
> based on the Python re documentation.
> 
> Comments?
> 
> ---
> 
> One side note: right now, we're not using typing at kernel-doc, nor
> really following a proper coding style.
> 
> I wanted to use it during the conversion, and place consts in
> uppercase, as this is currently the best practices, but doing it while
> converting from Perl were very annoying. So, I opted to make things
> simpler. Now that we have it coded, perhaps it is time to define a
> coding style and apply it to kernel-doc.
> 
> --
> Thanks,
> Mauro
> 
> #!/usr/bin/env python3
> 
> import sys
> import re
> 
> class Token():
>     def __init__(self, type, value, line, column):
>         self.type = type
>         self.value = value
>         self.line = line
>         self.column = column
> 
> class CTokenizer():
>     C_KEYWORDS = {
>         "struct", "union", "enum",
>     }
> 
>     TOKEN_LIST = [
>         ("COMMENT", r"//[^\n]*|/\*[\s\S]*?\*/"),
> 
>         ("STRING",  r'"(?:\\.|[^"\\])*"'),
>         ("CHAR",    r"'(?:\\.|[^'\\])'"),
> 
>         ("NUMBER",  r"0[xX][0-9a-fA-F]+[uUlL]*|0[0-7]+[uUlL]*|"
>                     r"[0-9]+(\.[0-9]*)?([eE][+-]?[0-9]+)?[fFlL]*"),
> 
>         ("ID",      r"[A-Za-z_][A-Za-z0-9_]*"),
> 
>         ("OP",      r"\+\+|\-\-|\->|==|\!=|<=|>=|&&|\|\||<<|>>|\+=|\-
> =|\*=|/=|%="
>                     r"|&=|\|=|\^=|=|\+|\-
> |\*|/|%|<|>|&|\||\^|~|!|\?|\:"),
> 
>         ("PUNC",    r"[;,\.\[\]\(\)\{\}]"),
> 
>         ("CPP",
> r"#\s*(define|include|ifdef|ifndef|if|else|elif|endif|undef|pragma)"),
> 
>         ("HASH",    r"#"),
> 
>         ("NEWLINE", r"\n"),
> 
>         ("SKIP",    r"[\s]+"),
> 
>         ("MISMATCH",r"."),
>     ]
> 
>     def __init__(self):
>         re_tokens = []
> 
>         for name, pattern in self.TOKEN_LIST:
>             re_tokens.append(f"(?P<{name}>{pattern})")
> 
>         self.re_scanner = re.compile("|".join(re_tokens),
>                                      re.MULTILINE | re.DOTALL)
> 
>     def tokenize(self, code):
>         # Handle continuation lines
>         code = re.sub(r"\\\n", "", code)
> 
>         line_num = 1
>         line_start = 0
> 
>         for match in self.re_scanner.finditer(code):
>             kind   = match.lastgroup
>             value  = match.group()
>             column = match.start() - line_start
> 
>             if kind == "NEWLINE":
>                 line_start = match.end()
>                 line_num += 1
>                 continue
> 
>             if kind in {"SKIP"}:
>                 continue
> 
>             if kind == "MISMATCH":
>                 raise RuntimeError(f"Unexpected character {value!r} on
> line {line_num}")
> 
>             if kind == "ID" and value in self.C_KEYWORDS:
>                 kind = value.upper()
> 
>             # For all other tokens we keep the raw string value
>             yield Token(kind, value, line_num, column)
> 
> if __name__ == "__main__":
>     if len(sys.argv) != 2:
>         print(f"Usage: python {sys.argv[0]} <fname>")
>         sys.exit(1)
> 
>     fname = sys.argv[1]
> 
>     try:
>         with open(fname, 'r', encoding='utf-8') as file:
>             sample = file.read()
>     except FileNotFoundError:
>         print(f"Error: The file '{fname}' was not found.")
>         sys.exit(1)
>     except Exception as e:
>         print(f"An error occurred while reading the file: {str(e)}")
>         sys.exit(1)
> 
>     print(f"Tokens from {fname}:")
> 
>     for tok in CTokenizer().tokenize(sample):
>         print(f"{tok.line:3d}:{tok.column:3d}  {tok.type:12}
> {tok.value!r}")

As hobby C compiler writer, I must say that you need to implement C preprocessor first, because C preprocessor influences/changes the syntax.
In your tokenizer I see right away that any line which begins from '#' must be just as C preprocessor command without further tokenizing.
But the real pain make C preprocessor substitutions IMHO

Re: [Intel-wired-lan] [PATCH 00/38] docs: several improvements to kernel-doc

Posted by Mauro Carvalho Chehab 1 month ago

On Tue, 3 Mar 2026 15:12:30 +0000
"Loktionov, Aleksandr" <aleksandr.loktionov@intel.com> wrote:

> > -----Original Message-----
> > From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> > Sent: Tuesday, March 3, 2026 3:53 PM
> > To: Jani Nikula <jani.nikula@linux.intel.com>
> > Cc: Lobakin, Aleksander <aleksander.lobakin@intel.com>; Jonathan
> > Corbet <corbet@lwn.net>; Kees Cook <kees@kernel.org>; Mauro Carvalho
> > Chehab <mchehab@kernel.org>; intel-wired-lan@lists.osuosl.org; linux-
> > doc@vger.kernel.org; linux-hardening@vger.kernel.org; linux-
> > kernel@vger.kernel.org; netdev@vger.kernel.org; Gustavo A. R. Silva
> > <gustavoars@kernel.org>; Loktionov, Aleksandr
> > <aleksandr.loktionov@intel.com>; Randy Dunlap <rdunlap@infradead.org>;
> > Shuah Khan <skhan@linuxfoundation.org>
> > Subject: Re: [PATCH 00/38] docs: several improvements to kernel-doc
> > 
> > On Mon, 23 Feb 2026 15:47:00 +0200
> > Jani Nikula <jani.nikula@linux.intel.com> wrote:
> >   
> > > There's always the question, if you're putting a lot of effort into
> > > making kernel-doc closer to an actual C parser, why not put all that
> > > effort into using and adapting to, you know, an actual C parser?  
> > 
> > Playing with this idea, it is not that hard to write an actual C
> > parser - or at least a tokenizer. There is already an example of it
> > at:
> > 
> > 	https://docs.python.org/3/library/re.html
> > 
> > I did a quick implementation, and it seems to be able to do its job:

...

> 
> As hobby C compiler writer, I must say that you need to implement C preprocessor first, because C preprocessor influences/changes the syntax.
> In your tokenizer I see right away that any line which begins from '#' must be just as C preprocessor command without further tokenizing.

Yeah, we may need to implement C preprocessor parser in the future,
but this will require handling #include, with could be somewhat
complex. It is also tricky to handle conditional preprocessor macros,
as kernel-doc would either require a file with at least some defines
or would have to guess how to evaluate it to produce the right
documentation, as ifdefs interfere at C macros.

For now, I want to solve some specific problems:

- fix trim_private_members() function that it is meant to handle
  /* private: */ and /* public: */ comments, as it currently have
  bugs when used on nested structs/unions, related to where the
  "private" scope finishes;

- properly parse nested struct/union and properly pick nested
  identifiers;

- detect and replace function arguments when macros with multiple 
  arguments are used at the same prototype.

Plus, kernel-doc has already a table of transforms to "convert"
the C preprocessor macros that affect documentation into something
that will work.

So, I'm considering to start simple, for now ignoring cpp, addressing
the existing issues. 

> But the real pain make C preprocessor substitutions IMHO

Agreed. For now, we're using a transforms list inside kernel-doc for
such purpose. So, those macros are manually "evaluated" there, like:

	(KernRe(r'DEFINE_DMA_UNMAP_ADDR\s*\(' + struct_args_pattern + r'\)', re.S), r'dma_addr_t \1'),

This works fine on trivial cases, where the argument is just an ID,
but there are cases were we use macros like here:

    struct page_pool_params {
	struct_group_tagged(page_pool_params_fast, fast,
		unsigned int	order;
		unsigned int	pool_size;
		int		nid;
		struct device	*dev;
		struct napi_struct *napi;
		enum dma_data_direction dma_dir;
		unsigned int	max_len;
		unsigned int	offset;
	);
	struct_group_tagged(page_pool_params_slow, slow,
		struct net_device *netdev;
		unsigned int queue_idx;
		unsigned int	flags;
    /* private: used by test code only */
		void (*init_callback)(netmem_ref netmem, void *arg);
		void *init_arg;
	);
    };

To handle it, I'm thinking on using something like this(*):

	CFunction('struct_group_tagged'), r'struct \1 { \3 } \2;')

E.g. teaching kernel-doc that, when:

	struct_group_tagged(a, b, c)

is used, it should convert it into:

	struct a { c } b;

which is basically what this macro does. On other words, hardcoding
kernel-doc with some rules to handle the cases where CPP macros
need to be evaluated. As there aren't much cases where such macros affect
documentation (on lots of cases, just drop macros are enough), such
approach kinda works.

(*) I wrote already a patch for it, but as Jani pointed, perhaps
    using a tokenizer will make the logic simpler and easier to
    be understood/maintained.

-- 
Thanks,
Mauro

Re: [PATCH 00/38] docs: several improvements to kernel-doc

Posted by Jonathan Corbet 1 month, 1 week ago

Jani Nikula <jani.nikula@linux.intel.com> writes:

> There's always the question, if you're putting a lot of effort into
> making kernel-doc closer to an actual C parser, why not put all that
> effort into using and adapting to, you know, an actual C parser?

Not speaking to the current effort but ... in the past, when I have
contemplated this (using, say, tree-sitter), the real problem is that
those parsers simply strip out the comments.  Kerneldoc without comments
... doesn't work very well.  If there were a parser without those
problems, and which could be made to do the right thing with all of our
weird macro usage, it would certainly be worth considering.

jon

Re: [PATCH 00/38] docs: several improvements to kernel-doc

Posted by Jani Nikula 4 weeks, 1 day ago

On Mon, 23 Feb 2026, Jonathan Corbet <corbet@lwn.net> wrote:
> Jani Nikula <jani.nikula@linux.intel.com> writes:
>
>> There's always the question, if you're putting a lot of effort into
>> making kernel-doc closer to an actual C parser, why not put all that
>> effort into using and adapting to, you know, an actual C parser?
>
> Not speaking to the current effort but ... in the past, when I have
> contemplated this (using, say, tree-sitter), the real problem is that
> those parsers simply strip out the comments.  Kerneldoc without comments
> ... doesn't work very well.  If there were a parser without those
> problems, and which could be made to do the right thing with all of our
> weird macro usage, it would certainly be worth considering.

I think e.g. libclang and its Python bindings can be made to work. The
main problems with that are passing proper compiler options (because
it'll need to include stuff to know about types etc. because it is a
proper parser), preprocessing everything is going to take time, you need
to invest a bunch into it to know how slow exactly compared to the
current thing and whether it's prohitive, and it introduces an extra
dependency.

So yeah, there are definitely tradeoffs there. But it's not like this
constant patching of kernel-doc is exactly burden free either. I don't
know, is it just me, but I'd like to think as a profession we'd be past
writing ad hoc C parsers by now.

BR,
Jani.

-- 
Jani Nikula, Intel

Re: [Intel-wired-lan] [PATCH 00/38] docs: several improvements to kernel-doc

Posted by Mauro Carvalho Chehab 2 weeks, 6 days ago

On Wed, 04 Mar 2026 12:07:45 +0200
Jani Nikula <jani.nikula@linux.intel.com> wrote:

> On Mon, 23 Feb 2026, Jonathan Corbet <corbet@lwn.net> wrote:
> > Jani Nikula <jani.nikula@linux.intel.com> writes:
> >  
> >> There's always the question, if you're putting a lot of effort into
> >> making kernel-doc closer to an actual C parser, why not put all that
> >> effort into using and adapting to, you know, an actual C parser?  
> >
> > Not speaking to the current effort but ... in the past, when I have
> > contemplated this (using, say, tree-sitter), the real problem is that
> > those parsers simply strip out the comments.  Kerneldoc without comments
> > ... doesn't work very well.  If there were a parser without those
> > problems, and which could be made to do the right thing with all of our
> > weird macro usage, it would certainly be worth considering.  
> 
> I think e.g. libclang and its Python bindings can be made to work. The
> main problems with that are passing proper compiler options (because
> it'll need to include stuff to know about types etc. because it is a
> proper parser), preprocessing everything is going to take time, you need
> to invest a bunch into it to know how slow exactly compared to the
> current thing and whether it's prohitive, and it introduces an extra
> dependency.
> 
> So yeah, there are definitely tradeoffs there. But it's not like this
> constant patching of kernel-doc is exactly burden free either. 

On my tests with a simple C tokenizer:

	https://lore.kernel.org/linux-doc/cover.1773326442.git.mchehab+huawei@kernel.org/

The tokenizer is working fine and didn't make it much slow: it
increases the time to pass the entire Kernel tree from 37s to 47s
for man pages generation, but should not change much the time for
htmldocs, as right now only ~4 seconds is needed to read files
pointed by Documentation kernel-doc tags and parse them.

The code can still be cleaned up, as there are still some things
hardcoded on the various dump_* functions that could be better
implemented (*).

The advantage of the approach I'm using is that it allows to
gradually migrate to rely at the tokenized code, as it can be done
incrementally.

(*) for instance, __attribute__ and a couple of other macros are parsed
    twice at dump_struct() logic, on different places.

> I don't
> know, is it just me, but I'd like to think as a profession we'd be past
> writing ad hoc C parsers by now.

Probably not, but I don't think we need a C parser, as kernel-doc
just needs to understand data types (enum, struct, typedef, union,
vars) and function/macro prototypes.

For such purpose, a tokenizer sounds enough.

Now, there is the code that it is now inside:
	https://github.com/mchehab/linux/blob/tokenizer-v5/tools/lib/python/kdoc/xforms_lists.py

which contains a list of C/gcc/clang keywords that will
be ignored, like:

	__attribute__
	static
	extern
	inline

Together with a sanitized version of the kernel macros it needs
to handle or ignore:

	DECLARE_BITMAP
	DECLARE_HASHTABLE
 	__acquires
	__init
	__exit
	struct_group
	...

Once we finish cleaning up kdoc_parser.py to rely only
on it for prototype transformations, this will be the only file
that will require changes when more macros start affecting 
kernel-doc.

As this is complex, and may require manual adjustments, it
is probably better to not try to auto-generate xforms list
in runtime. A better approach is, IMO, to have a C pre-processor
code to help periodically update it, like using a target like:

	make kdoc-xforms

that would use either cpp or clang to generate a patch to
update xforms_list content after adding new macros that
affect docs generation.

-- 
Thanks,
Mauro

Re: [PATCH 00/38] docs: several improvements to kernel-doc

Posted by Jonathan Corbet 4 weeks, 1 day ago

Jani Nikula <jani.nikula@linux.intel.com> writes:

> So yeah, there are definitely tradeoffs there. But it's not like this
> constant patching of kernel-doc is exactly burden free either. I don't
> know, is it just me, but I'd like to think as a profession we'd be past
> writing ad hoc C parsers by now.

I don't think that having a "real" parser is going to free us from the
need to patch kernel-doc.  The kernel uses a weird form of C, and
kernel-doc is expected to evolve as our dialect of the language does.
It *might* make that patching job easier -- that is to be seen -- but it
won't make it go away.

Thanks,

jon

Re: [Intel-wired-lan] [PATCH 00/38] docs: several improvements to kernel-doc

Posted by Mauro Carvalho Chehab 4 weeks, 1 day ago

On Wed, 04 Mar 2026 12:07:45 +0200
Jani Nikula <jani.nikula@linux.intel.com> wrote:

> On Mon, 23 Feb 2026, Jonathan Corbet <corbet@lwn.net> wrote:
> > Jani Nikula <jani.nikula@linux.intel.com> writes:
> >  
> >> There's always the question, if you're putting a lot of effort into
> >> making kernel-doc closer to an actual C parser, why not put all that
> >> effort into using and adapting to, you know, an actual C parser?  
> >
> > Not speaking to the current effort but ... in the past, when I have
> > contemplated this (using, say, tree-sitter), the real problem is that
> > those parsers simply strip out the comments.  Kerneldoc without comments
> > ... doesn't work very well.  If there were a parser without those
> > problems, and which could be made to do the right thing with all of our
> > weird macro usage, it would certainly be worth considering.  
> 
> I think e.g. libclang and its Python bindings can be made to work. The
> main problems with that are passing proper compiler options (because
> it'll need to include stuff to know about types etc. because it is a
> proper parser), preprocessing everything is going to take time, you need
> to invest a bunch into it to know how slow exactly compared to the
> current thing and whether it's prohitive, and it introduces an extra
> dependency.

It is not just that. Assume we're parsing something like this:

	static __always_inline int _raw_read_trylock(rwlock_t *lock)
		__cond_acquires_shared(true, lock);


using a cpp (or libclang). We would need to define/undefine 3 symbols:

	#if defined(WARN_CONTEXT_ANALYSIS) && !defined(__CHECKER__) && !defined(__GENKSYMS__)

(in this particular case, the default is OK, but on others, it may not
be)

This is by far more complex than just writing a logic that would
convert the above into:

	static int _raw_read_trylock(rwlock_t *lock);

which is the current kernel-doc approach.

-

Using a C preprocessor, we might have a very big prototype - and even have
arch-specific defines affecting it, as some includes may be inside 
arch/*/include.

So, we would need a kernel-doc ".config" file with a set of defines
that can be hard to maintain.

> So yeah, there are definitely tradeoffs there. But it's not like this
> constant patching of kernel-doc is exactly burden free either. I don't
> know, is it just me, but I'd like to think as a profession we'd be past
> writing ad hoc C parsers by now.

I'd say that the binding logic and the ".config" kernel-doc defines will
be complex to maintain. Maybe more complex than kernel-doc patching and
a simple C parser, like the one on my test.

> > On Mon, 23 Feb 2026 15:47:00 +0200
> > Jani Nikula <jani.nikula@linux.intel.com> wrote:  
> >> There's always the question, if you're putting a lot of effort into
> >> making kernel-doc closer to an actual C parser, why not put all that
> >> effort into using and adapting to, you know, an actual C parser?  
> >
> > Playing with this idea, it is not that hard to write an actual C
> > parser - or at least a tokenizer.  
> 
> Just for the record, I suggested using an existing parser, not going all
> NIH and writing your own.

I know, but I suspect that a simple tokenizer similar to my example might
do the job without any major impact, but yeah, tests are needed.


-- 
Thanks,
Mauro

Re: [Intel-wired-lan] [PATCH 00/38] docs: several improvements to kernel-doc

Posted by Mauro Carvalho Chehab 1 month, 1 week ago

On Mon, 23 Feb 2026 08:02:11 -0700
Jonathan Corbet <corbet@lwn.net> wrote:

> Jani Nikula <jani.nikula@linux.intel.com> writes:
> 
> > There's always the question, if you're putting a lot of effort into
> > making kernel-doc closer to an actual C parser, why not put all that
> > effort into using and adapting to, you know, an actual C parser?  
> 
> Not speaking to the current effort but ... in the past, when I have
> contemplated this (using, say, tree-sitter), the real problem is that
> those parsers simply strip out the comments.  Kerneldoc without comments
> ... doesn't work very well.  If there were a parser without those
> problems, and which could be made to do the right thing with all of our
> weird macro usage, it would certainly be worth considering.

Parser is only needed for statement prototypes. There, stripping
comments (after we parse public/private) should be OK. Yet, we
want a python library to do the parsing, using it only for the
things we want to be parsed.

Assuming we have something like that, we'll still need to teach
the parser about the macro transforms, as those are very Linux
specific.

Maybe something like: https://github.com/eliben/pycparser would
help (didn't test nor tried to check if it does what we want).

There is an additional problem that this will add an extra
dependency for the Kernel build itself, because kernel-doc can
run at Kernel build time.

-- 
Thanks,
Mauro