From nobody Wed Apr  8 12:41:18 2026
Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org
 [10.30.226.201])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id D82DA3F8DEB;
	Tue, 17 Mar 2026 18:09:49 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=10.30.226.201
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1773770989; cv=none;
 b=fxrre951u5rqmo0gpO3mWtBi6/kQf02QVwCHR+m9Al3WYLiVeP7HwK3/g8OOtiz+Fti8mKfPV3uL2Z2QkzraS74pRYDge41DO6BqK73upWK9rX2dhZ3ND6guQXKBbgUiQUyE56+2sgN+QbOIykq58f3ya9pzwvU8zkfAxT+iSIU=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1773770989; c=relaxed/simple;
	bh=LUcFzGUiEZD4bIprs6ndA9LAE1KebeM/BqrpXPUV0iY=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version:Content-Type;
 b=QG4G1ksoVSRUD/bwvDtjMiJAVmGr5QdG/azEgBErvwHh61lETdsnAu1GePoGo21/E4YqPtIiEeZt1NdP6JYJ6FRjPRzNLxMgpMfXk7Rqz50ZXxnhI+QtWOBg+yJxrtSy/w0YUsBBQVhrbw8VdJgF8YKiK8HTpKO3GZpBQK7iOis=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org
 header.b=oL3beYMF; arc=none smtp.client-ip=10.30.226.201
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org
 header.b="oL3beYMF"
Received: by smtp.kernel.org (Postfix) with ESMTPSA id 94438C2BCB1;
	Tue, 17 Mar 2026 18:09:49 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=k20201202; t=1773770989;
	bh=LUcFzGUiEZD4bIprs6ndA9LAE1KebeM/BqrpXPUV0iY=;
	h=From:To:Cc:Subject:Date:In-Reply-To:References:From;
	b=oL3beYMFmzSXQZspVzBMqlm4aB81VpvnHsOx0FQEn5NLSlIbPiDX4X/rEFDnAr6zh
	 uSy68UX27KW2xxaZ8AOG4zI6ah2EU9FZOiDMmCtOZWGUwI7k9Mp9l8SHqLUbKrF13t
	 lfU4fnxfPiBapM8+sN52VaAeE6kSsHp9WnG8lZ7VdxYfSQ7UupnuYT9TkjgazcGo4b
	 Ntg+/W7azhNpbYacjrDG5WhVAAF6mW2R5xmF79SCJ7/ROwoKsNgMprtLODX50iZpSO
	 wo7Rpo8N1TaL6yjjxGRqOACg7FQJb1HZnXAsStpgx/8W7bzrmevGfVu732wdkgjLQw
	 6cCjwRmza6HFg==
Received: from mchehab by mail.kernel.org with local (Exim 4.99.1)
	(envelope-from <mchehab+huawei@kernel.org>)
	id 1w2YrT-0000000H5Sz-309C;
	Tue, 17 Mar 2026 19:09:47 +0100
From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
To: Jonathan Corbet <corbet@lwn.net>,
	Linux Doc Mailing List <linux-doc@vger.kernel.org>
Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>,
	linux-hardening@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH v3 12/22] docs: c_lex: properly implement a sub() method for
 CMatch
Date: Tue, 17 Mar 2026 19:09:32 +0100
Message-ID: 
 <dbc45b86db18783289d94cfdbba4b72792c47929.1773770483.git.mchehab+huawei@kernel.org>
X-Mailer: git-send-email 2.52.0
In-Reply-To: <cover.1773770483.git.mchehab+huawei@kernel.org>
References: <cover.1773770483.git.mchehab+huawei@kernel.org>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Sender: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

Implement a sub() method to do what it is expected, parsing
backref arguments like \0, \1, \2, ...

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 tools/lib/python/kdoc/c_lex.py | 272 +++++++++++++++++++++++++++++++--
 1 file changed, 259 insertions(+), 13 deletions(-)

diff --git a/tools/lib/python/kdoc/c_lex.py b/tools/lib/python/kdoc/c_lex.py
index 5da472734ff7..20e50ff0ecd5 100644
--- a/tools/lib/python/kdoc/c_lex.py
+++ b/tools/lib/python/kdoc/c_lex.py
@@ -16,6 +16,8 @@ Other errors are logged via log instance.
 import logging
 import re
=20
+from copy import copy
+
 from .kdoc_re import KernRe
=20
 log =3D logging.getLogger(__name__)
@@ -284,6 +286,172 @@ class CTokenizer():
         return out
=20
=20
+class CTokenArgs:
+    """
+    Ancillary class to help using backrefs from sub matches.
+
+    If the highest backref contain a "+" at the last element,
+    the logic will be greedy, picking all other delims.
+
+    This is needed to parse struct_group macros with end with ``MEMBERS...=
``.
+    """
+    def __init__(self, sub_str):
+        self.sub_groups =3D set()
+        self.max_group =3D -1
+        self.greedy =3D None
+
+        for m in KernRe(r'\\(\d+)([+]?)').finditer(sub_str):
+            group =3D int(m.group(1))
+            if m.group(2) =3D=3D "+":
+                if self.greedy and self.greedy !=3D group:
+                    raise ValueError("There are multiple greedy patterns!")
+                self.greedy =3D group
+
+            self.sub_groups.add(group)
+            self.max_group =3D max(self.max_group, group)
+
+        if self.greedy:
+            if self.greedy !=3D self.max_group:
+                raise ValueError("Greedy pattern is not the last one!")
+
+            sub_str =3D KernRe(r'(\\\d+)[+]').sub(r"\1", sub_str)
+
+        self.sub_str =3D sub_str
+        self.sub_tokeninzer =3D CTokenizer(sub_str)
+
+    def groups(self, new_tokenizer):
+        """
+        Create replacement arguments for backrefs like:
+
+        ``\0``, ``\1``, ``\2``, ...``\n``
+
+        It also accepts a ``+`` character to the highest backref. When use=
d,
+        it means in practice to ignore delimins after it, being greedy.
+
+        The logic is smart enough to only go up to the maximum required
+        argument, even if there are more.
+
+        If there is a backref for an argument above the limit, it will
+        raise an exception. Please notice that, on C, square brackets
+        don't have any separator on it. Trying to use ``\1``..``\n`` for
+        brackets also raise an exception.
+        """
+
+        level =3D (0, 0, 0)
+
+        if self.max_group < 0:
+            return level, []
+
+        tokens =3D new_tokenizer.tokens
+
+        #
+        # Fill \0 with the full token contents
+        #
+        groups_list =3D [ [] ]
+
+        if 0 in self.sub_groups:
+            inner_level =3D 0
+
+            for i in range(0, len(tokens)):
+                tok =3D tokens[i]
+
+                if tok.kind =3D=3D CToken.BEGIN:
+                    inner_level +=3D 1
+
+                    #
+                    # Discard first begin
+                    #
+                    if not groups_list[0]:
+                        continue
+                elif tok.kind =3D=3D CToken.END:
+                    inner_level -=3D 1
+                    if inner_level < 0:
+                        break
+
+                if inner_level:
+                    groups_list[0].append(tok)
+
+        if not self.max_group:
+            return level, groups_list
+
+        delim =3D None
+
+        #
+        # Ignore everything before BEGIN. The value of begin gives the
+        # delimiter to be used for the matches
+        #
+        for i in range(0, len(tokens)):
+            tok =3D tokens[i]
+            if tok.kind =3D=3D CToken.BEGIN:
+                if tok.value =3D=3D "{":
+                    delim =3D ";"
+                elif tok.value =3D=3D "(":
+                    delim =3D ","
+                else:
+                    self.log.error(fr"Can't handle \1..\n on {sub_str}")
+
+                level =3D tok.level
+                break
+
+        pos =3D 1
+        groups_list.append([])
+
+        inner_level =3D 0
+        for i in range(i + 1, len(tokens)):
+            tok =3D tokens[i]
+
+            if tok.kind =3D=3D CToken.BEGIN:
+                inner_level +=3D 1
+            if tok.kind =3D=3D CToken.END:
+                inner_level -=3D 1
+                if inner_level < 0:
+                    break
+
+            if tok.kind in [CToken.PUNC, CToken.ENDSTMT] and delim =3D=3D =
tok.value:
+                pos +=3D 1
+                if self.greedy and pos > self.max_group:
+                    pos -=3D 1
+                else:
+                    groups_list.append([])
+
+                    if pos > self.max_group:
+                        break
+
+                    continue
+
+            groups_list[pos].append(tok)
+
+        if pos < self.max_group:
+            log.error(fr"{self.sub_str} groups are up to {pos} instead of =
{self.max_group}")
+
+        return level, groups_list
+
+    def tokens(self, new_tokenizer):
+        level, groups =3D self.groups(new_tokenizer)
+
+        new =3D CTokenizer()
+
+        for tok in self.sub_tokeninzer.tokens:
+            if tok.kind =3D=3D CToken.BACKREF:
+                group =3D int(tok.value[1:])
+
+                for group_tok in groups[group]:
+                    new_tok =3D copy(group_tok)
+
+                    new_level =3D [0, 0, 0]
+
+                    for i in range(0, len(level)):
+                        new_level[i] =3D new_tok.level[i] + level[i]
+
+                    new_tok.level =3D tuple(new_level)
+
+                    new.tokens +=3D [ new_tok ]
+            else:
+                new.tokens +=3D [ tok ]
+
+        return new.tokens
+
+
 class CMatch:
     """
     Finding nested delimiters is hard with regular expressions. It is
@@ -309,10 +477,10 @@ class CMatch:
     will ignore the search string.
     """
=20
-    # TODO: add a sub method
=20
-    def __init__(self, regex):
-        self.regex =3D KernRe(regex)
+    def __init__(self, regex, delim=3D"("):
+        self.regex =3D KernRe("^" + regex + r"\b")
+        self.start_delim =3D delim
=20
     def _search(self, tokenizer):
         """
@@ -335,7 +503,6 @@ class CMatch:
         """
=20
         start =3D None
-        offset =3D -1
         started =3D False
=20
         import sys
@@ -351,15 +518,24 @@ class CMatch:
=20
                 continue
=20
-            if not started and tok.kind =3D=3D CToken.BEGIN:
-                started =3D True
-                continue
+            if not started:
+                if tok.kind =3D=3D CToken.SPACE:
+                    continue
+
+                if tok.kind =3D=3D CToken.BEGIN and tok.value =3D=3D self.=
start_delim:
+                    started =3D True
+                    continue
+
+                # Name only token without BEGIN/END
+                if i > start:
+                    i -=3D 1
+                yield start, i
+                start =3D None
=20
             if tok.kind =3D=3D CToken.END and tok.level =3D=3D stack[-1][1=
]:
                 start, level =3D stack.pop()
-                offset =3D i
=20
-                yield CTokenizer(tokenizer.tokens[start:offset + 1])
+                yield start, i
                 start =3D None
=20
         #
@@ -367,9 +543,12 @@ class CMatch:
         # This is meant to solve cases where the caller logic might be
         # picking an incomplete block.
         #
-        if start and offset < 0:
-            print("WARNING: can't find an end", file=3Dsys.stderr)
-            yield CTokenizer(tokenizer.tokens[start:])
+        if start and stack:
+            if started:
+                s =3D str(tokenizer)
+                log.warning(f"can't find a final end at {s}")
+
+            yield start, len(tokenizer.tokens)
=20
     def search(self, source):
         """
@@ -386,8 +565,75 @@ class CMatch:
             tokenizer =3D CTokenizer(source)
             is_token =3D False
=20
-        for new_tokenizer in self._search(tokenizer):
+        for start, end in self._search(tokenizer):
+            new_tokenizer =3D CTokenizer(tokenizer.tokens[start:end + 1])
+
             if is_token:
                 yield new_tokenizer
             else:
                 yield str(new_tokenizer)
+
+    def sub(self, sub_str, source, count=3D0):
+        """
+        This is similar to re.sub:
+
+        It matches a regex that it is followed by a delimiter,
+        replacing occurrences only if all delimiters are paired.
+
+        if the sub argument contains::
+
+            r'\0'
+
+        it will work just like re: it places there the matched paired data
+        with the delimiter stripped.
+
+        If count is different than zero, it will replace at most count
+        items.
+        """
+        if isinstance(source, CTokenizer):
+            is_token =3D True
+            tokenizer =3D source
+        else:
+            is_token =3D False
+            tokenizer =3D CTokenizer(source)
+
+        # Detect if sub_str contains sub arguments
+
+        args_match =3D CTokenArgs(sub_str)
+
+        new_tokenizer =3D CTokenizer()
+        pos =3D 0
+        n =3D 0
+
+        #
+        # NOTE: the code below doesn't consider overlays at sub.
+        # We may need to add some extra unit tests to check if those
+        # would cause problems. When replacing by "", this should not
+        # be a problem, but other transformations could be problematic
+        #
+        for start, end in self._search(tokenizer):
+            new_tokenizer.tokens +=3D tokenizer.tokens[pos:start]
+
+            new =3D CTokenizer(tokenizer.tokens[start:end + 1])
+
+            new_tokenizer.tokens +=3D args_match.tokens(new)
+
+            pos =3D end + 1
+
+            n +=3D 1
+            if count and n >=3D count:
+                break
+
+        new_tokenizer.tokens +=3D tokenizer.tokens[pos:]
+
+        if not is_token:
+            return str(new_tokenizer)
+
+        return new_tokenizer
+
+    def __repr__(self):
+        """
+        Returns a displayable version of the class init.
+        """
+
+        return f'CMatch("{self.regex.regex.pattern}")'
--=20
2.52.0