From nobody Wed Apr 8 12:36:37 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7C32B3F8DE1; Tue, 17 Mar 2026 18:09:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773770989; cv=none; b=e0bS3ihUnFGcRQIHCdpMUarVsH0K380b0M/Bxpm+P71R/df+G+FHKWhRIM14B5BOx9Lp9c2nIyMvgQB1jS95/n4m20o8nAgtU5T8zsKt17H3cFnoTbk5NYPBY44pr8fGS7NTY8zdGCjocLeR5cq+QH3lcE5JufysQ/CzOSutxt8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773770989; c=relaxed/simple; bh=DE2bgHoSxjCADPPtYN1w1qD6MKNDV4mXxiXZgZKnIBQ=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=qedA3G6MuCV9EOmGyc/6P8ZT3PjFfNuH3IEweJcwq9XiwSwItzRmf7/U6fyhcDb9BwPNeGgFptsl/ctOzMofsCQm+bdri2tcMsLeuDobSghaXNX+V5HzCvz5A+68My/2irCq5+dwp4eTxWHka+7DICZCiOM+YH018/gq7DCuDTw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=TUB/P2pp; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="TUB/P2pp" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 0FB4FC4CEF7; Tue, 17 Mar 2026 18:09:49 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1773770989; bh=DE2bgHoSxjCADPPtYN1w1qD6MKNDV4mXxiXZgZKnIBQ=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=TUB/P2pp2s+U2KB2hgmk1MqkaD1NbNkXF4ajY2mgNQnJXVsj25cBYiZO4X7J/Cnx5 Bp7N22OVFJC5sZ+8ZTWo0Hzg2JcLcAqBeZqjhJe/9YY/8aa0k4C3aRBwbYl5yrchS8 QE1Y536ui8utmmxCFqH5wOtdwJOw4ADmgSoNzCvEmxp41P1u6VuIoCBjUeHQ9ffuhF uDz04aig1plRvZYzOVBRetaa8vD0vLGOx78pBw7F+JZosxsHeb6XvBPTqRvj0bJGNz 5+073evKGRjCwv5e3X1oaWZiuSD6ItSG0RP461mRkbeX8Vh0Vqw7uyYawpcH8Fnudd No/T9au2TF2XQ== Received: from mchehab by mail.kernel.org with local (Exim 4.99.1) (envelope-from ) id 1w2YrT-0000000H5QY-1E68; Tue, 17 Mar 2026 19:09:47 +0100 From: Mauro Carvalho Chehab To: Jonathan Corbet , Linux Doc Mailing List Cc: Mauro Carvalho Chehab , linux-hardening@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v3 10/22] docs: kdoc: create a CMatch to match nested C blocks Date: Tue, 17 Mar 2026 19:09:30 +0100 Message-ID: X-Mailer: git-send-email 2.52.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Sender: Mauro Carvalho Chehab The NextMatch code is complex, and will become even more complex if we add there support for arguments. Now that we have a tokenizer, we can use a better solution, easier to be understood. Yet, to improve performance, it is better to make it use a previously tokenized code, changing its ABI. So, reimplement NextMatch using the CTokener class. Once it is done, we can drop NestedMatch. Signed-off-by: Mauro Carvalho Chehab --- tools/lib/python/kdoc/c_lex.py | 121 ++++++++++++++++++++++++++++++--- 1 file changed, 111 insertions(+), 10 deletions(-) diff --git a/tools/lib/python/kdoc/c_lex.py b/tools/lib/python/kdoc/c_lex.py index 9d726f821f3f..5da472734ff7 100644 --- a/tools/lib/python/kdoc/c_lex.py +++ b/tools/lib/python/kdoc/c_lex.py @@ -273,20 +273,121 @@ class CTokenizer(): =20 # Do some cleanups before ";" =20 - if (tok.kind =3D=3D CToken.SPACE and - next_tok.kind =3D=3D CToken.PUNC and - next_tok.value =3D=3D ";"): - + if tok.kind =3D=3D CToken.SPACE and next_tok.kind =3D=3D C= Token.ENDSTMT: continue =20 - if (tok.kind =3D=3D CToken.PUNC and - next_tok.kind =3D=3D CToken.PUNC and - tok.value =3D=3D ";" and - next_tok.kind =3D=3D CToken.PUNC and - next_tok.value =3D=3D ";"): - + if tok.kind =3D=3D CToken.ENDSTMT and next_tok.kind =3D=3D= tok.kind: continue =20 out +=3D str(tok.value) =20 return out + + +class CMatch: + """ + Finding nested delimiters is hard with regular expressions. It is + even harder on Python with its normal re module, as there are several + advanced regular expressions that are missing. + + This is the case of this pattern:: + + '\\bSTRUCT_GROUP(\\(((?:(?>[^)(]+)|(?1))*)\\))[^;]*;' + + which is used to properly match open/close parentheses of the + string search STRUCT_GROUP(), + + Add a class that counts pairs of delimiters, using it to match and + replace nested expressions. + + The original approach was suggested by: + + https://stackoverflow.com/questions/5454322/python-how-to-match-ne= sted-parentheses-with-regex + + Although I re-implemented it to make it more generic and match 3 types + of delimiters. The logic checks if delimiters are paired. If not, it + will ignore the search string. + """ + + # TODO: add a sub method + + def __init__(self, regex): + self.regex =3D KernRe(regex) + + def _search(self, tokenizer): + """ + Finds paired blocks for a regex that ends with a delimiter. + + The suggestion of using finditer to match pairs came from: + https://stackoverflow.com/questions/5454322/python-how-to-match-ne= sted-parentheses-with-regex + but I ended using a different implementation to align all three ty= pes + of delimiters and seek for an initial regular expression. + + The algorithm seeks for open/close paired delimiters and places th= em + into a stack, yielding a start/stop position of each match when the + stack is zeroed. + + The algorithm should work fine for properly paired lines, but will + silently ignore end delimiters that precede a start delimiter. + This should be OK for kernel-doc parser, as unaligned delimiters + would cause compilation errors. So, we don't need to raise excepti= ons + to cover such issues. + """ + + start =3D None + offset =3D -1 + started =3D False + + import sys + + stack =3D [] + + for i, tok in enumerate(tokenizer.tokens): + if start is None: + if tok.kind =3D=3D CToken.NAME and self.regex.match(tok.va= lue): + start =3D i + stack.append((start, tok.level)) + started =3D False + + continue + + if not started and tok.kind =3D=3D CToken.BEGIN: + started =3D True + continue + + if tok.kind =3D=3D CToken.END and tok.level =3D=3D stack[-1][1= ]: + start, level =3D stack.pop() + offset =3D i + + yield CTokenizer(tokenizer.tokens[start:offset + 1]) + start =3D None + + # + # If an END zeroing levels is not there, return remaining stuff + # This is meant to solve cases where the caller logic might be + # picking an incomplete block. + # + if start and offset < 0: + print("WARNING: can't find an end", file=3Dsys.stderr) + yield CTokenizer(tokenizer.tokens[start:]) + + def search(self, source): + """ + This is similar to re.search: + + It matches a regex that it is followed by a delimiter, + returning occurrences only if all delimiters are paired. + """ + + if isinstance(source, CTokenizer): + tokenizer =3D source + is_token =3D True + else: + tokenizer =3D CTokenizer(source) + is_token =3D False + + for new_tokenizer in self._search(tokenizer): + if is_token: + yield new_tokenizer + else: + yield str(new_tokenizer) --=20 2.52.0