docs: kdoc: create a CMatch to match nested C blocks

The NextMatch code is complex, and will become even more complex
if we add there support for arguments.

Now that we have a tokenizer, we can use a better solution,
easier to be understood.

Yet, to improve performance, it is better to make it use a
previously tokenized code, changing its ABI.

So, reimplement NextMatch using the CTokener class. Once it
is done, we can drop NestedMatch.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <fa818ea164216b17520b588e3f12b81499b76dd7.1773770483.git.mchehab+huawei@kernel.org>

authored by

Mauro Carvalho Chehab and committed by

Jonathan Corbet 3 months ago f1cf9f7c bd167a21

+111 -10

1 changed file

expand all

tools

lib

python

kdoc

c_lex.py

+111 -10

tools/lib/python/kdoc/c_lex.py

··· 273 273 274 274 # Do some cleanups before ";" 275 275 276 - if (tok.kind == CToken.SPACE and 277 - next_tok.kind == CToken.PUNC and 278 - next_tok.value == ";"): 279 - 276 + if tok.kind == CToken.SPACE and next_tok.kind == CToken.ENDSTMT: 280 277 continue 281 278 282 - if (tok.kind == CToken.PUNC and 283 - next_tok.kind == CToken.PUNC and 284 - tok.value == ";" and 285 - next_tok.kind == CToken.PUNC and 286 - next_tok.value == ";"): 287 - 279 + if tok.kind == CToken.ENDSTMT and next_tok.kind == tok.kind: 288 280 continue 289 281 290 282 out += str(tok.value) 291 283 292 284 return out 285 + 286 + 287 + class CMatch: 288 + """ 289 + Finding nested delimiters is hard with regular expressions. It is 290 + even harder on Python with its normal re module, as there are several 291 + advanced regular expressions that are missing. 292 + 293 + This is the case of this pattern:: 294 + 295 + '\\bSTRUCT_GROUP(\\(((?:(?>[^)(]+)|(?1))*)\\))[^;]*;' 296 + 297 + which is used to properly match open/close parentheses of the 298 + string search STRUCT_GROUP(), 299 + 300 + Add a class that counts pairs of delimiters, using it to match and 301 + replace nested expressions. 302 + 303 + The original approach was suggested by: 304 + 305 + https://stackoverflow.com/questions/5454322/python-how-to-match-nested-parentheses-with-regex 306 + 307 + Although I re-implemented it to make it more generic and match 3 types 308 + of delimiters. The logic checks if delimiters are paired. If not, it 309 + will ignore the search string. 310 + """ 311 + 312 + # TODO: add a sub method 313 + 314 + def __init__(self, regex): 315 + self.regex = KernRe(regex) 316 + 317 + def _search(self, tokenizer): 318 + """ 319 + Finds paired blocks for a regex that ends with a delimiter. 320 + 321 + The suggestion of using finditer to match pairs came from: 322 + https://stackoverflow.com/questions/5454322/python-how-to-match-nested-parentheses-with-regex 323 + but I ended using a different implementation to align all three types 324 + of delimiters and seek for an initial regular expression. 325 + 326 + The algorithm seeks for open/close paired delimiters and places them 327 + into a stack, yielding a start/stop position of each match when the 328 + stack is zeroed. 329 + 330 + The algorithm should work fine for properly paired lines, but will 331 + silently ignore end delimiters that precede a start delimiter. 332 + This should be OK for kernel-doc parser, as unaligned delimiters 333 + would cause compilation errors. So, we don't need to raise exceptions 334 + to cover such issues. 335 + """ 336 + 337 + start = None 338 + offset = -1 339 + started = False 340 + 341 + import sys 342 + 343 + stack = [] 344 + 345 + for i, tok in enumerate(tokenizer.tokens): 346 + if start is None: 347 + if tok.kind == CToken.NAME and self.regex.match(tok.value): 348 + start = i 349 + stack.append((start, tok.level)) 350 + started = False 351 + 352 + continue 353 + 354 + if not started and tok.kind == CToken.BEGIN: 355 + started = True 356 + continue 357 + 358 + if tok.kind == CToken.END and tok.level == stack[-1][1]: 359 + start, level = stack.pop() 360 + offset = i 361 + 362 + yield CTokenizer(tokenizer.tokens[start:offset + 1]) 363 + start = None 364 + 365 + # 366 + # If an END zeroing levels is not there, return remaining stuff 367 + # This is meant to solve cases where the caller logic might be 368 + # picking an incomplete block. 369 + # 370 + if start and offset < 0: 371 + print("WARNING: can't find an end", file=sys.stderr) 372 + yield CTokenizer(tokenizer.tokens[start:]) 373 + 374 + def search(self, source): 375 + """ 376 + This is similar to re.search: 377 + 378 + It matches a regex that it is followed by a delimiter, 379 + returning occurrences only if all delimiters are paired. 380 + """ 381 + 382 + if isinstance(source, CTokenizer): 383 + tokenizer = source 384 + is_token = True 385 + else: 386 + tokenizer = CTokenizer(source) 387 + is_token = False 388 + 389 + for new_tokenizer in self._search(tokenizer): 390 + if is_token: 391 + yield new_tokenizer 392 + else: 393 + yield str(new_tokenizer)

Configure Feed

Configure Feed