[PATCH 08/18] docs: kdoc_parser: fix parser to support multi-word types

Mauro Carvalho Chehab posted 18 patches 1 month, 1 week ago
[PATCH 08/18] docs: kdoc_parser: fix parser to support multi-word types
Posted by Mauro Carvalho Chehab 1 month, 1 week ago
The regular expression currently expects a single word for the
type, but it may be something like  "struct foo".

Add support for it.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
---
 tools/lib/python/kdoc/kdoc_parser.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/lib/python/kdoc/kdoc_parser.py b/tools/lib/python/kdoc/kdoc_parser.py
index 39ff27d421eb..22a820d33dc8 100644
--- a/tools/lib/python/kdoc/kdoc_parser.py
+++ b/tools/lib/python/kdoc/kdoc_parser.py
@@ -1018,14 +1018,14 @@ class KernelDoc:
 
         default_val = None
 
-        r= KernRe(OPTIONAL_VAR_ATTR + r"[\w_]*\s+(?:\*+)?([\w_]+)\s*[\d\]\[]*\s*(=.*)?")
+        r= KernRe(OPTIONAL_VAR_ATTR + r"\s*[\w_\s]*\s+(?:\*+)?([\w_]+)\s*[\d\]\[]*\s*(=.*)?")
         if r.match(proto):
             if not declaration_name:
                 declaration_name = r.group(1)
 
             default_val = r.group(2)
         else:
-            r= KernRe(OPTIONAL_VAR_ATTR + r"(?:[\w_]*)?\s+(?:\*+)?(?:[\w_]+)\s*[\d\]\[]*\s*(=.*)?")
+            r= KernRe(OPTIONAL_VAR_ATTR + r"(?:[\w_\s]*)?\s+(?:\*+)?(?:[\w_]+)\s*[\d\]\[]*\s*(=.*)?")
 
             if r.match(proto):
                 default_val = r.group(1)
-- 
2.52.0
Re: [PATCH 08/18] docs: kdoc_parser: fix parser to support multi-word types
Posted by Jonathan Corbet 1 month ago
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> writes:

> The regular expression currently expects a single word for the
> type, but it may be something like  "struct foo".
>
> Add support for it.
>
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> Acked-by: Randy Dunlap <rdunlap@infradead.org>
> Tested-by: Randy Dunlap <rdunlap@infradead.org>
> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> ---
>  tools/lib/python/kdoc/kdoc_parser.py | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/tools/lib/python/kdoc/kdoc_parser.py b/tools/lib/python/kdoc/kdoc_parser.py
> index 39ff27d421eb..22a820d33dc8 100644
> --- a/tools/lib/python/kdoc/kdoc_parser.py
> +++ b/tools/lib/python/kdoc/kdoc_parser.py
> @@ -1018,14 +1018,14 @@ class KernelDoc:
>  
>          default_val = None
>  
> -        r= KernRe(OPTIONAL_VAR_ATTR + r"[\w_]*\s+(?:\*+)?([\w_]+)\s*[\d\]\[]*\s*(=.*)?")
> +        r= KernRe(OPTIONAL_VAR_ATTR + r"\s*[\w_\s]*\s+(?:\*+)?([\w_]+)\s*[\d\]\[]*\s*(=.*)?")

Just for future reference...I *really* think that the code is improved
by breaking up and commenting gnarly regexes like this.  They are really
unreadable in this form.  (And yes, I know the code has been full of
these forever, but we can always try to make it better :)

Anyway, just grumbling.

Thanks,

jon
Re: [PATCH 08/18] docs: kdoc_parser: fix parser to support multi-word types
Posted by Mauro Carvalho Chehab 1 month ago
On Tue, 03 Mar 2026 10:34:48 -0700
Jonathan Corbet <corbet@lwn.net> wrote:

> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> writes:
> 
> > The regular expression currently expects a single word for the
> > type, but it may be something like  "struct foo".
> >
> > Add support for it.
> >
> > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> > Acked-by: Randy Dunlap <rdunlap@infradead.org>
> > Tested-by: Randy Dunlap <rdunlap@infradead.org>
> > Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> > ---
> >  tools/lib/python/kdoc/kdoc_parser.py | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/tools/lib/python/kdoc/kdoc_parser.py b/tools/lib/python/kdoc/kdoc_parser.py
> > index 39ff27d421eb..22a820d33dc8 100644
> > --- a/tools/lib/python/kdoc/kdoc_parser.py
> > +++ b/tools/lib/python/kdoc/kdoc_parser.py
> > @@ -1018,14 +1018,14 @@ class KernelDoc:
> >  
> >          default_val = None
> >  
> > -        r= KernRe(OPTIONAL_VAR_ATTR + r"[\w_]*\s+(?:\*+)?([\w_]+)\s*[\d\]\[]*\s*(=.*)?")
> > +        r= KernRe(OPTIONAL_VAR_ATTR + r"\s*[\w_\s]*\s+(?:\*+)?([\w_]+)\s*[\d\]\[]*\s*(=.*)?")  
> 
> Just for future reference...I *really* think that the code is improved
> by breaking up and commenting gnarly regexes like this.  They are really
> unreadable in this form.  (And yes, I know the code has been full of
> these forever, but we can always try to make it better :)

Heh, you're right: this could be better.

> Anyway, just grumbling.

Heh, if we start using a code like the tokenizer I'm experimenting
here:

	https://lore.kernel.org/linux-doc/20260303155310.5235b367@localhost/

we could probably get rid of regexes in the future, using instead
a loop that would be picking "ID" tokens, e.g. basically we would
have something similar to this completely untested code snippet:

	self.tokenizer = CTokenizer()

	...

	ids = []
	get_default = False

	while kind, value in self.tokenizer(proto):
		if kind == "ID":
			ids.append(value)

		if kind == "OP" and value == "=":
			get_default = True
			break

	if get_default:
		while kind, value in self.tokenizer(proto):
			if kind in ["CHAR", "STRING", "NUMBER"]:
				default_val = value
				break

	declaration_name = ids[-1]
	

Thanks,
Mauro
Re: [PATCH 08/18] docs: kdoc_parser: fix parser to support multi-word types
Posted by Jonathan Corbet 1 month ago
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> writes:

> Heh, if we start using a code like the tokenizer I'm experimenting
> here:
>
> 	https://lore.kernel.org/linux-doc/20260303155310.5235b367@localhost/
>
> we could probably get rid of regexes in the future, using instead
> a loop that would be picking "ID" tokens, e.g. basically we would
> have something similar to this completely untested code snippet:

...which has some appeal, but I will confess to being leery of another
massive kernel-doc thrash; it would be nice if things settled for a bit.

One can always dream :)

jon
Re: [PATCH 08/18] docs: kdoc_parser: fix parser to support multi-word types
Posted by Mauro Carvalho Chehab 1 month ago
On Tue, 03 Mar 2026 13:24:55 -0700
Jonathan Corbet <corbet@lwn.net> wrote:

> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> writes:
> 
> > Heh, if we start using a code like the tokenizer I'm experimenting
> > here:
> >
> > 	https://lore.kernel.org/linux-doc/20260303155310.5235b367@localhost/
> >
> > we could probably get rid of regexes in the future, using instead
> > a loop that would be picking "ID" tokens, e.g. basically we would
> > have something similar to this completely untested code snippet:  
> 
> ...which has some appeal, but I will confess to being leery of another
> massive kernel-doc thrash; it would be nice if things settled for a bit.

Yeah, I feel your pain. The idea is not to simply rewrite the entire
kernel-doc. Just to use it at the places that have known hard to solve
bugs.

> One can always dream :)

The thing is that there are some issues at kernel-doc that can only be
solved with a better parser, and a plain regex logic won't fix, even
with really complex expressions.

I'm not talking about variable handling like on this specific patch.
For it, the current pure regex approach works fine, at least for the 
cases we already mapped.

However, kernel-doc, even after this series, do a crap job on
several places:

1. Macros used to build structs and function prototypes.

   During the conversion I wrote a half-baked NestedMatch class to be 
   able of properly handling struct_group*(), which is the best
   example of the involved complexity. It works, but it requires parsing
   the code twice.

   Also, It probably will fail with nested struct_group;

2. Nested structs.

   Current logic just transform them on an un-nested struct-like
   pseudo-code to re-use the structs regex-based parser;

3. Nested struct identifiers handling.

   Spec says that if one has:

	struct {
		struct {
			int foo;
		} a;
	} b;

   kernel-doc should document "foo"  as "a.foo", but this is not
   always the case, due to bugs at the parser. So, on some places,
   you need to use "foo"; on others, "a.foo".

4. Public/private handling.

   Code almost works, but when it finds a unmatched private inside
   a nested struct, it will can hide close brackets. This prevents
   fixing (2).

5. Comments strip.

   Code kinda works if you don't touch it, but when trying to solve
   the previous issues, I ended discovering some hidden problems
   related to the way it does.

   (That's basically when I ended opting to try a different approach:
   too much changes to try to live with a plain regex approach,
   plus all stuff needed for NestedMatch to do a better job)

6. Proper whitespace and alignment at the output.

   The current way we parse things mean that little changes end
   mangling with whitespaces, line breaks and/or indentation.

   Perhaps we could use some token-based formatter for man pages
   to properly handle open/close brackets.

   For rst output, we're relying at the C domain to handle it for
   us. Still, perhaps a tokenizer-based approach can just add a
   single whitespace everywhere, which would help us to check
   before/after differences on kernel-doc changes.

I attempted fixing this at the /38 patch series and afterwards
(good news that that we have 18 less patches after you merged
this one), but my current pending patch pile has stlll +40 patches
to address issues and add unit tests.

Among them, I have changes:

- adding support for "\1", "\2", "\3"... group matches at
  NestedMatch sub. It works. The code itself is small, but
  very complex;

- writing a different logic to address comment stripping;

- write a logic to pick struct IDs on nested cases that will
  be using an approach similar to NestedMatch.

- some of those new logic are recursive, which makes them 
  more complex to be understood and tested.

Such approach works, but, as Jani pointed, this ends adding lots
of complexity, and the main reason is that we're acquainted to
use regexes - or perhaps too afraid to handle it on a different
way. Also, they ended introducing extra bugs.

I had to confess that I also a little reticent on trying to use
a tokenizer, as I was afraid that this would require extra libraries 
to have something similar to what flex/bison would be doing. Then,
I realized that perhaps the already internal libraries might just
have what we needed. So, after some research, looking at:

	https://docs.python.org/3/library/re.html#writing-a-tokenizer

and doing some (so far simple) tests, I'm starting to think about
modifying my pending patches to use a code similar to it.

Note that, at this point, I didn't try yet to use the tokenizer,
so for now this is still mostly a brainstorm.

I intend to try to use the tokenizer to handle comments e.g. touching 
the logic at trim_private_members() and see how it behaves. This is
self-contained, checking the results would likely be easy, and I don't
expect big surprises.

Depending on such tests, I may try to modify NestedMatch and my patches
using it to use the same approach.

So, if all ends well, the changes will so far be confined to the code
used by dump_struct().

Let's see.

Thanks,
Mauro