[v2] kernel-doc: use a C lexical tokenizer for transforms

[PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms

Posted by Mauro Carvalho Chehab 3 weeks, 4 days ago

Hi Jon,

Sorry for respamming this one too quick. It ends that v1 had some
bugs causing it to fail on several cases. I opted to add extra
patches in the end. This way, it better integrates with kdoc_re.
As part of it, now c_lex will output file name when reporting
errors. With that regards, only more serious errors will raise
an exception. They are meant to indicate problems at kernel-doc
itself. Parsing errors are now using the same warning approach
as kdoc_parser.

I also added a filter at Ctokenizer __str__() logic for the
string convertion to drop some weirdness whitespaces and uneeded
";" characters at the output.

Finally, v2 address the undefined behavior about private: comment
propagation.

This patch series change how kdoc parser handles macro replacements.

Instead of heavily relying on regular expressions that can sometimes
be very complex, it uses a C lexical tokenizer. This ensures that
BEGIN/END blocks on functions and structs are properly handled,
even when nested.

Checking before/after the patch series, for both man pages and
rst only had:
    - whitespace differences;
    - struct_group macros now are shown as inner anonimous structs
      as it should be.

Also, I didn't notice any relevant change on the documentation build
time. With that regards, right now, every time a CMatch replacement
rule takes in place, it does:

    for each transform:
    - tokenizes the source code;
    - handle CMatch;
    - convert tokens back to a string.

A possible optimization would be to do, instead:

    - tokenizes source code;
    - for each transform handle CMatch;
    - convert tokens back to a string.

For now, I opted not do do it, because:

    - too much changes on a single row;
    - docs build time is taking ~3:30 minutes, which is
      about the same time it ws taken before the changes;
    - there is a very dirty hack inside function_xforms:
         (KernRe(r"_noprof"), ""). This is meant to change
      function prototypes instead of function arguments.

So, if ok for you, I would prefer to merge this one first. We can later
optimize kdoc_parser to avoid multiple token <-> string conversions.

-

One important aspect of this series is that it introduces unittests
for kernel-doc. I used it a lot during the development of this series,
to ensure that the changes I was doing were producing the expected
results. Tests are on two separate files that can be executed directly.

Alternatively, there is a run.py script that runs all of them (and
any other python script named  tools/unittests/test_*.py"):

  $ tools/unittests/run.py
  test_cmatch:
      TestSearch:
          test_search_acquires_multiple:                               OK
          test_search_acquires_nested_paren:                           OK
          test_search_acquires_simple:                                 OK
          test_search_must_hold:                                       OK
          test_search_must_hold_shared:                                OK
          test_search_no_false_positive:                               OK
          test_search_no_function:                                     OK
          test_search_no_macro_remains:                                OK
      TestSubMultipleMacros:
          test_acquires_multiple:                                      OK
          test_acquires_nested_paren:                                  OK
          test_acquires_simple:                                        OK
          test_mixed_macros:                                           OK
          test_must_hold:                                              OK
          test_must_hold_shared:                                       OK
          test_no_false_positive:                                      OK
          test_no_function:                                            OK
          test_no_macro_remains:                                       OK
      TestSubSimple:
          test_rise_early_greedy:                                      OK
          test_rise_multiple_greedy:                                   OK
          test_strip_multiple_acquires:                                OK
          test_sub_count_parameter:                                    OK
          test_sub_mixed_placeholders:                                 OK
          test_sub_multiple_placeholders:                              OK
          test_sub_no_placeholder:                                     OK
          test_sub_single_placeholder:                                 OK
          test_sub_with_capture:                                       OK
          test_sub_zero_placeholder:                                   OK
      TestSubWithLocalXforms:
          test_functions_with_acquires_and_releases:                   OK
          test_raw_struct_group:                                       OK
          test_raw_struct_group_tagged:                                OK
          test_struct_group:                                           OK
          test_struct_group_attr:                                      OK
          test_struct_group_tagged_with_private:                       OK
          test_struct_kcov:                                            OK
          test_vars_stackdepot:                                        OK
  
  test_tokenizer:
      TestPublicPrivate:
          test_balanced_inner_private:                                 OK
          test_balanced_non_greddy_private:                            OK
          test_balanced_private:                                       OK
          test_no private:                                             OK
          test_unbalanced_inner_private:                               OK
          test_unbalanced_private:                                     OK
          test_unbalanced_struct_group_tagged_with_private:            OK
          test_unbalanced_two_struct_group_tagged_first_with_private:  OK
          test_unbalanced_without_end_of_line:                         OK
      TestTokenizer:
          test_basic_tokens:                                           OK
          test_depth_counters:                                         OK
          test_mismatch_error:                                         OK
  
  
  Ran 47 tests

PS.: This series contain the contents of the previous /8 series:
    https://lore.kernel.org/linux-doc/cover.1773074166.git.mchehab+huawei@kernel.org/

---

v2:
  - Added 8 more patches fixing several bugs and modifying unittests
    accordingly:
    - don't raise exceptions when not needed;
    - don't report errors reporting lack of END if there's no BEGIN
      at the last replacement string;
    - document private scope propagation;
    - some changes at unittests to reflect current status;
    - addition of two unittests to check error raise logic at c_lex.

Mauro Carvalho Chehab (28):
  docs: python: add helpers to run unit tests
  unittests: add a testbench to check public/private kdoc comments
  docs: kdoc: don't add broken comments inside prototypes
  docs: kdoc: properly handle empty enum arguments
  docs: kdoc_re: add a C tokenizer
  docs: kdoc: use tokenizer to handle comments on structs
  docs: kdoc: move C Tokenizer to c_lex module
  unittests: test_private: modify it to use CTokenizer directly
  unittests: test_tokenizer: check if the tokenizer works
  unittests: add a runner to execute all unittests
  docs: kdoc: create a CMatch to match nested C blocks
  tools: unittests: add tests for CMatch
  docs: c_lex: properly implement a sub() method for CMatch
  unittests: test_cmatch: add tests for sub()
  docs: kdoc: replace NestedMatch with CMatch
  docs: kdoc_re: get rid of NestedMatch class
  docs: xforms_lists: handle struct_group directly
  docs: xforms_lists: better evaluate struct_group macros
  docs: c_lex: add support to work with pure name ids
  docs: xforms_lists: use CMatch for all identifiers
  docs: c_lex: add "@" operator
  docs: c_lex: don't exclude an extra token
  docs: c_lex: setup a logger to report tokenizer issues
  docs: unittests: add and adjust tests to check for errors
  docs: c_lex: better handle BEGIN/END at search
  docs: kernel-doc.rst: document private: scope propagation
  docs: c_lex: produce a cleaner str() representation
  unittests: test_cmatch: remove weird stuff from expected results

 Documentation/doc-guide/kernel-doc.rst |   6 +
 Documentation/tools/python.rst         |   2 +
 Documentation/tools/unittest.rst       |  24 +
 tools/lib/python/kdoc/c_lex.py         | 645 +++++++++++++++++++
 tools/lib/python/kdoc/kdoc_parser.py   |  29 +-
 tools/lib/python/kdoc/kdoc_re.py       | 201 ------
 tools/lib/python/kdoc/xforms_lists.py  | 209 +++----
 tools/lib/python/unittest_helper.py    | 353 +++++++++++
 tools/unittests/run.py                 |  17 +
 tools/unittests/test_cmatch.py         | 821 +++++++++++++++++++++++++
 tools/unittests/test_tokenizer.py      | 462 ++++++++++++++
 11 files changed, 2434 insertions(+), 335 deletions(-)
 create mode 100644 Documentation/tools/unittest.rst
 create mode 100644 tools/lib/python/kdoc/c_lex.py
 create mode 100755 tools/lib/python/unittest_helper.py
 create mode 100755 tools/unittests/run.py
 create mode 100755 tools/unittests/test_cmatch.py
 create mode 100755 tools/unittests/test_tokenizer.py

-- 
2.52.0

Re: [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms

Posted by Mauro Carvalho Chehab 3 weeks, 4 days ago

Hi Jon,

On Thu, 12 Mar 2026 15:54:20 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:

> Also, I didn't notice any relevant change on the documentation build
> time. 

After more tests, I actually noticed an issue after this changeset:

https://lore.kernel.org/linux-doc/2b957decdb6cedab4268f71a166c25b7abdb9a61.1773326442.git.mchehab+huawei@kernel.org/

Basically, a broken kernel-doc like this:

	
	/**
	 * enum dmub_abm_ace_curve_type - ACE curve type.
	 */
	enum dmub_abm_ace_curve_type {
	        /**
	         * ACE curve as defined by the SW layer.
	         */
	        ABM_ACE_CURVE_TYPE__SW = 0,
	        /**
	         * ACE curve as defined by the SW to HW translation interface layer.
	         */
	        ABM_ACE_CURVE_TYPE__SW_IF = 1,
	};

where the inlined markups don't have "@symbol" doesn't parse well. If
you run current kernel-doc, it would produce:

	.. c:enum:: dmub_abm_ace_curve_type

	  ACE curve type.

	.. container:: kernelindent

	    **Constants**

	    ``*/ ABM_ACE_CURVE_TYPE__SW = 0``
	      *undescribed*


	    `` */ ABM_ACE_CURVE_TYPE__SW_IF = 1``
	      *undescribed*

Because Kernel-doc currently drops the "/**" line. My fix patch
above fixes it, but inlined comments confuse enum/struct detection.
To avoid that, we need to strip comments earlier at dump_struct and
dump_enum:

	https://lore.kernel.org/linux-doc/d112804ace83e0ad8496f687977596bb7f091560.1773390831.git.mchehab+huawei@kernel.org/T/#u

After such fix, the output is now:

	.. c:enum:: dmub_abm_ace_curve_type

	  ACE curve type.

	.. container:: kernelindent

	    **Constants**

	    ``ABM_ACE_CURVE_TYPE__SW``
	      *undescribed*


	    ``ABM_ACE_CURVE_TYPE__SW_IF``
	      *undescribed*

which is the result expected when there's no proper inlined
kernel-doc markups.

Due to this issue, I ended adding a 29/28 patch on this series.

> With that regards, right now, every time a CMatch replacement
> rule takes in place, it does:
> 
>     for each transform:
>     - tokenizes the source code;
>     - handle CMatch;
>     - convert tokens back to a string.
> 
> A possible optimization would be to do, instead:
> 
>     - tokenizes source code;
>     - for each transform handle CMatch;
>     - convert tokens back to a string.
> 
> For now, I opted not do do it, because:
> 
>     - too much changes on a single row;
>     - docs build time is taking ~3:30 minutes, which is
>       about the same time it ws taken before the changes;
>     - there is a very dirty hack inside function_xforms:
>          (KernRe(r"_noprof"), ""). This is meant to change
>       function prototypes instead of function arguments.
> 
> So, if ok for you, I would prefer to merge this one first. We can later
> optimize kdoc_parser to avoid multiple token <-> string conversions.

I did such optimization and it worked fine. So, I ended adding
a 30/28 patch at the end. With that, running kernel-doc before/after
the entire series won't have significant performance changes.

	# Current approach
	$ time ./scripts/kernel-doc . -man >original 2>&1

	real    0m37.344s
	user    0m36.447s
	sys     0m0.712s

	# Tokenizer running multiple times (patch 29)
	$ time ./scripts/kernel-doc . -man >before 2>&1

	real    1m32.427s
	user    1m25.377s
	sys     0m1.293s

	# After optimization (patch 30)
	$ time ./scripts/kernel-doc . -man >after 2>&1

	real    0m47.094s
	user    0m46.106s
	sys     0m0.751s

10 seconds slower than before when parsing everything, which affects
make mandocs, but the time differences spent at kernel-doc parser during
make htmldocs is minimal: ir is about ~4 seconds(*):

	$  run_kdoc.py -none 2>/dev/null
	Checking what files are currently used on documentation...
	Running kernel-doc

	Elapsed time: 0:00:04.348008

(*) the slowest logic when building docs with Sphinx is inside its
    RST parser code.

See the enclosed script to see how I measured the parsing time for
existing ".. kernel-doc::" markups inside Documentation.


Thanks,
Mauro

---

This is the run_kdoc.py script I'm using here to pick the same files
as make htmldocs do:

#!/bin/env python3

import os
import re
import subprocess
import sys

from datetime import datetime
from glob import glob

print("Checking what files are currently used on documentation...")

kdoc_files = set()
re_kernel_doc = re.compile(r"^\.\.\s+kernel-doc::\s*(\S+)")

for fname in glob(os.path.join(".", "**"), recursive=True):
    if os.path.isfile(fname) and fname.endswith(".rst"):
        with open(fname, "r", encoding="utf-8") as in_fp:
            data = in_fp.read()

        for line in data.split("\n"):
            match = re_kernel_doc.match(line)
            if match:
                if os.path.isfile(match.group(1)):
                    kdoc_files.add(match.group(1))

if not kdoc_files:
    sys.exit(f"Directory doesn't contain kernel-doc tags")

cmd = [ "./tools/docs/kernel-doc" ]
cmd += sys.argv[1:]
cmd += sorted(kdoc_files)

print("Running kernel-doc")

start_time = datetime.now()

try:
    result = subprocess.run(cmd, check=True)
except subprocess.CalledProcessError as e:
    print(f"kernel-doc failed: {repr(e)}")

elapsed = datetime.now() - start_time
print(f"\nElapsed time: {elapsed}")

Re: [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms

Posted by Jonathan Corbet 2 weeks, 6 days ago

Mauro Carvalho Chehab <mchehab+huawei@kernel.org> writes:

> Sorry for respamming this one too quick. It ends that v1 had some
> bugs causing it to fail on several cases. I opted to add extra
> patches in the end. This way, it better integrates with kdoc_re.
> As part of it, now c_lex will output file name when reporting
> errors. With that regards, only more serious errors will raise
> an exception. They are meant to indicate problems at kernel-doc
> itself. Parsing errors are now using the same warning approach
> as kdoc_parser.
>
> I also added a filter at Ctokenizer __str__() logic for the
> string convertion to drop some weirdness whitespaces and uneeded
> ";" characters at the output.
>
> Finally, v2 address the undefined behavior about private: comment
> propagation.
>
> This patch series change how kdoc parser handles macro replacements.

So I have at least glanced at the whole series now; other than the few
things I pointed out, I don't find a whole lot to complain about.  I do
worry about adding another 2000 lines to kernel-doc, even if more than
half of them are tests.  But hopefully it leads to a better and more
maintainable system.

We're starting to get late enough in the cycle that I'm a bit leery of
applying this work for 7.1.  What was your thinking on timing?

Thanks,

jon

Re: [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms

Posted by Mauro Carvalho Chehab 2 weeks, 6 days ago

On Tue, 17 Mar 2026 11:12:50 -0600
Jonathan Corbet <corbet@lwn.net> wrote:

> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> writes:
> 
> > Sorry for respamming this one too quick. It ends that v1 had some
> > bugs causing it to fail on several cases. I opted to add extra
> > patches in the end. This way, it better integrates with kdoc_re.
> > As part of it, now c_lex will output file name when reporting
> > errors. With that regards, only more serious errors will raise
> > an exception. They are meant to indicate problems at kernel-doc
> > itself. Parsing errors are now using the same warning approach
> > as kdoc_parser.
> >
> > I also added a filter at Ctokenizer __str__() logic for the
> > string convertion to drop some weirdness whitespaces and uneeded
> > ";" characters at the output.
> >
> > Finally, v2 address the undefined behavior about private: comment
> > propagation.
> >
> > This patch series change how kdoc parser handles macro replacements.  
> 
> I do worry about adding another 2000 lines to kernel-doc, even if more than
> half of them are tests.  But hopefully it leads to a better and more
> maintainable system.

Net change due to the parser itself was ~650 lines of code, excluding
unittests.

Yet, at least for me, the code looks a lot better with:

        (CMatch("VIRTIO_DECLARE_FEATURES"), r"union { u64 \1; u64 \1_array[VIRTIO_FEATURES_U64S]; }"),
	...
        (CMatch("struct_group"), r"struct { \2+ };"),
        (CMatch("struct_group_attr"), r"struct { \3+ };"),
        (CMatch("struct_group_tagged"), r"struct { \3+ };"),
        (CMatch("__struct_group"), r"struct { \4+ };"),

and other similar stuff than with the previous approach with
very complex regular expressions and/or handing it on two
steps. IMO this should be a lot easier to maintain as well.

Also, the unittests will hopefully help to detect regressions(
and to test new stuff there without hidden bugs.

> We're starting to get late enough in the cycle that I'm a bit leery of
> applying this work for 7.1.  What was your thinking on timing?

There is something I want to change, but not sure if it will
be in time: get rid of the ugly code at:

	- rewrite_struct_members
	- create_parameter_list
	- split_struct_proto

I started doing some changes with that regards, but unlikely to
have time for 7.1.

I do have a pile of patches sitting here to be rebased.

Among them, there are unittests for KernelDoc class.
IMO, it is worth rebasing at least some of them in time for this
merge window. The ones with unittests are independent (or
eventually might require minimal changes). I'd like to have
at least those merged for 7.1.

Among them, there are several tests written by Randy with
regards to some parsing issues at kernel-doc. We should at
least merge the ones that already pass after the tokenizer ;-)

Thanks,
Mauro

Re: [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms

Posted by Mauro Carvalho Chehab 2 weeks, 6 days ago

On Tue, 17 Mar 2026 11:12:50 -0600
Jonathan Corbet <corbet@lwn.net> wrote:

> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> writes:
> 
> > Sorry for respamming this one too quick. It ends that v1 had some
> > bugs causing it to fail on several cases. I opted to add extra
> > patches in the end. This way, it better integrates with kdoc_re.
> > As part of it, now c_lex will output file name when reporting
> > errors. With that regards, only more serious errors will raise
> > an exception. They are meant to indicate problems at kernel-doc
> > itself. Parsing errors are now using the same warning approach
> > as kdoc_parser.
> >
> > I also added a filter at Ctokenizer __str__() logic for the
> > string convertion to drop some weirdness whitespaces and uneeded
> > ";" characters at the output.
> >
> > Finally, v2 address the undefined behavior about private: comment
> > propagation.
> >
> > This patch series change how kdoc parser handles macro replacements.  
> 
> So I have at least glanced at the whole series now; other than the few
> things I pointed out, I don't find a whole lot to complain about.  I do
> worry about adding another 2000 lines to kernel-doc, even if more than
> half of them are tests.  But hopefully it leads to a better and more
> maintainable system.
> 
> We're starting to get late enough in the cycle that I'm a bit leery of
> applying this work for 7.1.  What was your thinking on timing?

I'm sending now a v3. It basically address your points, which
reduced the series to 22 patches.

I'm adding the diff between the two versions here, as it may help
checking what changed. I'll also document the main changes at
patch 00/22.

-- 
Thanks,
Mauro

diff --git a/tools/lib/python/kdoc/c_lex.py b/tools/lib/python/kdoc/c_lex.py
index 95c4dd5afe77..b6d58bd470a9 100644
--- a/tools/lib/python/kdoc/c_lex.py
+++ b/tools/lib/python/kdoc/c_lex.py
@@ -50,7 +50,7 @@ class CToken():
     STRING = 1      #: A string, including quotation marks.
     CHAR = 2        #: A character, including apostophes.
     NUMBER = 3      #: A number.
-    PUNC = 4        #: A puntuation mark: ``;`` / ``,`` / ``.``.
+    PUNC = 4        #: A puntuation mark: / ``,`` / ``.``.
     BEGIN = 5       #: A begin character: ``{`` / ``[`` / ``(``.
     END = 6         #: A end character: ``}`` / ``]`` / ``)``.
     CPP = 7         #: A preprocessor macro.
@@ -62,8 +62,9 @@ class CToken():
     TYPEDEF = 13    #: A ``typedef`` keyword.
     NAME = 14       #: A name. Can be an ID or a type.
     SPACE = 15      #: Any space characters, including new lines
+    ENDSTMT = 16    #: End of an statement (``;``).
 
-    BACKREF = 16  #: Not a valid C sequence, but used at sub regex patterns.
+    BACKREF = 17    #: Not a valid C sequence, but used at sub regex patterns.
 
     MISMATCH = 255  #: an error indicator: should never happen in practice.
 
@@ -104,37 +105,42 @@ class CToken():
 
         return f"CToken(CToken.{name}, {value}, {self.pos}, {self.level})"
 
-#: Tokens to parse C code.
-TOKEN_LIST = [
+#: Regexes to parse C code, transforming it into tokens.
+RE_SCANNER_LIST = [
+    #
+    # Note that \s\S is different than .*, as it also catches \n
+    #
     (CToken.COMMENT, r"//[^\n]*|/\*[\s\S]*?\*/"),
 
     (CToken.STRING,  r'"(?:\\.|[^"\\])*"'),
     (CToken.CHAR,    r"'(?:\\.|[^'\\])'"),
 
-    (CToken.NUMBER,  r"0[xX][0-9a-fA-F]+[uUlL]*|0[0-7]+[uUlL]*|"
-                     r"[0-9]+(\.[0-9]*)?([eE][+-]?[0-9]+)?[fFlL]*"),
+    (CToken.NUMBER,  r"0[xX][\da-fA-F]+[uUlL]*|0[0-7]+[uUlL]*|"
+                     r"\d+(?:\.\d*)?(?:[eE][+-]?\d+)?[fFlL]*"),
 
-    (CToken.PUNC,    r"[;,\.]"),
+    (CToken.ENDSTMT, r"(?:\s+;|;)"),
+
+    (CToken.PUNC,    r"[,\.]"),
 
     (CToken.BEGIN,   r"[\[\(\{]"),
 
     (CToken.END,     r"[\]\)\}]"),
 
-    (CToken.CPP,     r"#\s*(define|include|ifdef|ifndef|if|else|elif|endif|undef|pragma)\b"),
+    (CToken.CPP,     r"#\s*(?:define|include|ifdef|ifndef|if|else|elif|endif|undef|pragma)\b"),
 
     (CToken.HASH,    r"#"),
 
     (CToken.OP,      r"\+\+|\-\-|\->|==|\!=|<=|>=|&&|\|\||<<|>>|\+=|\-=|\*=|/=|%="
-                     r"|&=|\|=|\^=|=|\+|\-|\*|/|%|<|>|&|\||\^|~|!|\?|\:|\@"),
+                     r"|&=|\|=|\^=|[=\+\-\*/%<>&\|\^~!\?\:]"),
 
     (CToken.STRUCT,  r"\bstruct\b"),
     (CToken.UNION,   r"\bunion\b"),
     (CToken.ENUM,    r"\benum\b"),
-    (CToken.TYPEDEF, r"\bkinddef\b"),
+    (CToken.TYPEDEF, r"\btypedef\b"),
 
-    (CToken.NAME,    r"[A-Za-z_][A-Za-z0-9_]*"),
+    (CToken.NAME,    r"[A-Za-z_]\w*"),
 
-    (CToken.SPACE,   r"[\s]+"),
+    (CToken.SPACE,   r"\s+"),
 
     (CToken.BACKREF, r"\\\d+"),
 
@@ -142,7 +148,7 @@ TOKEN_LIST = [
 ]
 
 def fill_re_scanner(token_list):
-    """Ancillary routine to convert TOKEN_LIST into a finditer regex"""
+    """Ancillary routine to convert RE_SCANNER_LIST into a finditer regex"""
     re_tokens = []
 
     for kind, pattern in token_list:
@@ -157,7 +163,8 @@ RE_CONT = KernRe(r"\\\n")
 RE_COMMENT_START = KernRe(r'/\*\s*')
 
 #: tokenizer regex. Will be filled at the first CTokenizer usage.
-RE_SCANNER = fill_re_scanner(TOKEN_LIST)
+RE_SCANNER = fill_re_scanner(RE_SCANNER_LIST)
+
 
 class CTokenizer():
     """
@@ -170,10 +177,39 @@ class CTokenizer():
     # This class is inspired and follows the basic concepts of:
     #   https://docs.python.org/3/library/re.html#writing-a-tokenizer
 
+    def __init__(self, source=None, log=None):
+        """
+        Create a regular expression to handle RE_SCANNER_LIST.
+
+        While I generally don't like using regex group naming via:
+            (?P<name>...)
+
+        in this particular case, it makes sense, as we can pick the name
+        when matching a code via RE_SCANNER.
+        """
+
+        self.tokens = []
+
+        if not source:
+            return
+
+        if isinstance(source, list):
+            self.tokens = source
+            return
+
+        #
+        # While we could just use _tokenize directly via interator,
+        # As we'll need to use the tokenizer several times inside kernel-doc
+        # to handle macro transforms, cache the results on a list, as
+        # re-using it is cheaper than having to parse everytime.
+        #
+        for tok in self._tokenize(source):
+            self.tokens.append(tok)
+
     def _tokenize(self, source):
         """
-        Interactor that parses ``source``, splitting it into tokens, as defined
-        at ``self.TOKEN_LIST``.
+        Iterator that parses ``source``, splitting it into tokens, as defined
+        at ``self.RE_SCANNER_LIST``.
 
         The interactor returns a CToken class object.
         """
@@ -214,29 +250,6 @@ class CTokenizer():
             yield CToken(kind, value, pos,
                          brace_level, paren_level, bracket_level)
 
-    def __init__(self, source=None, log=None):
-        """
-        Create a regular expression to handle TOKEN_LIST.
-
-        While I generally don't like using regex group naming via:
-            (?P<name>...)
-
-        in this particular case, it makes sense, as we can pick the name
-        when matching a code via RE_SCANNER.
-        """
-
-        self.tokens = []
-
-        if not source:
-            return
-
-        if isinstance(source, list):
-            self.tokens = source
-            return
-
-        for tok in self._tokenize(source):
-            self.tokens.append(tok)
-
     def __str__(self):
         out=""
         show_stack = [True]
@@ -278,18 +291,10 @@ class CTokenizer():
 
                 # Do some cleanups before ";"
 
-                if (tok.kind == CToken.SPACE and
-                    next_tok.kind == CToken.PUNC and
-                    next_tok.value == ";"):
-
+                if tok.kind == CToken.SPACE and next_tok.kind == CToken.ENDSTMT:
                     continue
 
-                if (tok.kind == CToken.PUNC and
-                    next_tok.kind == CToken.PUNC and
-                    tok.value == ";" and
-                    next_tok.kind == CToken.PUNC and
-                    next_tok.value == ";"):
-
+                if tok.kind == CToken.ENDSTMT and next_tok.kind == tok.kind:
                     continue
 
             out += str(tok.value)
@@ -368,9 +373,13 @@ class CTokenArgs:
 
                 if tok.kind == CToken.BEGIN:
                     inner_level += 1
-                    continue
 
-                if tok.kind == CToken.END:
+                    #
+                    # Discard first begin
+                    #
+                    if not groups_list[0]:
+                        continue
+                elif tok.kind == CToken.END:
                     inner_level -= 1
                     if inner_level < 0:
                         break
@@ -414,7 +423,7 @@ class CTokenArgs:
                 if inner_level < 0:
                     break
 
-            if tok.kind == CToken.PUNC and delim == tok.value:
+            if tok.kind in [CToken.PUNC, CToken.ENDSTMT] and delim == tok.value:
                 pos += 1
                 if self.greedy and pos > self.max_group:
                     pos -= 1
@@ -458,6 +467,7 @@ class CTokenArgs:
 
         return new.tokens
 
+
 class CMatch:
     """
     Finding nested delimiters is hard with regular expressions. It is
diff --git a/tools/lib/python/kdoc/kdoc_parser.py b/tools/lib/python/kdoc/kdoc_parser.py
index 3b99740ebed3..f6c4ee3b18c9 100644
--- a/tools/lib/python/kdoc/kdoc_parser.py
+++ b/tools/lib/python/kdoc/kdoc_parser.py
@@ -13,9 +13,8 @@ import sys
 import re
 from pprint import pformat
 
+from kdoc.c_lex import CTokenizer, tokenizer_set_log
 from kdoc.kdoc_re import KernRe
-from kdoc.c_lex import tokenizer_set_log
-from kdoc.c_lex import CTokenizer
 from kdoc.kdoc_item import KdocItem
 
 #
diff --git a/tools/unittests/test_tokenizer.py b/tools/unittests/test_tokenizer.py
index 6a0bd49df72e..5634b4a7283e 100755
--- a/tools/unittests/test_tokenizer.py
+++ b/tools/unittests/test_tokenizer.py
@@ -76,13 +76,13 @@ TESTS_TOKENIZER = {
         "expected": [
             CToken(CToken.NAME, "int"),
             CToken(CToken.NAME, "a"),
-            CToken(CToken.PUNC, ";"),
+            CToken(CToken.ENDSTMT, ";"),
             CToken(CToken.COMMENT, "// comment"),
             CToken(CToken.NAME, "float"),
             CToken(CToken.NAME, "b"),
             CToken(CToken.OP, "="),
             CToken(CToken.NUMBER, "1.23"),
-            CToken(CToken.PUNC, ";"),
+            CToken(CToken.ENDSTMT, ";"),
         ],
     },
 
@@ -103,7 +103,7 @@ TESTS_TOKENIZER = {
             CToken(CToken.BEGIN, "[", brace_level=1, bracket_level=1),
             CToken(CToken.NUMBER, "10", brace_level=1, bracket_level=1),
             CToken(CToken.END, "]", brace_level=1),
-            CToken(CToken.PUNC, ";", brace_level=1),
+            CToken(CToken.ENDSTMT, ";", brace_level=1),
             CToken(CToken.NAME, "func", brace_level=1),
             CToken(CToken.BEGIN, "(", brace_level=1, paren_level=1),
             CToken(CToken.NAME, "a", brace_level=1, paren_level=1),
@@ -117,7 +117,7 @@ TESTS_TOKENIZER = {
             CToken(CToken.NAME, "c", brace_level=1, paren_level=2),
             CToken(CToken.END, ")", brace_level=1, paren_level=1),
             CToken(CToken.END, ")", brace_level=1),
-            CToken(CToken.PUNC, ";", brace_level=1),
+            CToken(CToken.ENDSTMT, ";", brace_level=1),
             CToken(CToken.END, "}"),
         ],
     },