pattern: Improve handling of max_origins_per_page parameter

Instead of fully consuming the get_origins_from_page generator into
a list and truncate it, prefer to consume the generator origin per
origin and abort the process when the max number of origin per page
is reached.

Indeed some non trivial listers like the cgit one can perform costly
processing, HTTP request for instance, for each origin in a page.
So better not consuming the full generator in a row to avoid such
side effects.
This commit is contained in:
Antoine Lambert 2023-03-21 16:46:59 +01:00
parent 45bbc29a52
commit 35871896b2

View file

@ -182,17 +182,20 @@ class Lister(Generic[StateType, PageType]):
try:
for page in self.get_pages():
full_stats.pages += 1
origins = list(self.get_origins_from_page(page))
if (
self.max_origins_per_page
and len(origins) > self.max_origins_per_page
):
logger.info(
"Max origins per page set, truncated %s page results down to %s",
len(origins),
self.max_origins_per_page,
)
origins = origins[: self.max_origins_per_page]
origins = []
for origin in self.get_origins_from_page(page):
origins.append(origin)
if (
self.max_origins_per_page
and len(origins) == self.max_origins_per_page
):
logger.info(
"Max origins per page set to %s and reached, "
"aborting page processing",
self.max_origins_per_page,
)
break
if not self.enable_origins:
logger.info(
"Disabling origins before sending them to the scheduler"