pattern: Ensure accurate origin counts returned by run method
Previously, the run method was returning the total count of ListedOrigin objects sent to scheduler database. However, some listers can send multiple ListedOrigin objects for a given origin URL during the listing process, for instance when an origin is contained in multiple pages (e.g. gogs listing) or when the listing is gathering multiple versions of an origin spread across multiple pages (e.g. maven listing). This changes ensures an accurate count of listed origins by maintaining a set of origin URLs associated to the sent ListedOrigin objects.
This commit is contained in:
parent
3928fc9ee9
commit
8d85b2e4e8
4 changed files with 32 additions and 9 deletions
|
@ -198,3 +198,20 @@ def test_stateless_run(swh_scheduler):
|
|||
|
||||
# And that all origins are stored
|
||||
check_listed_origins(swh_scheduler, lister, stored_lister)
|
||||
|
||||
|
||||
class ListerWithSameOriginInMultiplePages(RunnableStatelessLister):
|
||||
def get_pages(self) -> Iterator[PageType]:
|
||||
for _ in range(2):
|
||||
yield [{"url": "https://example.org/user/project"}]
|
||||
|
||||
|
||||
def test_listed_origins_count(swh_scheduler):
|
||||
lister = ListerWithSameOriginInMultiplePages(
|
||||
scheduler=swh_scheduler, url="https://example.org", instance="example.org"
|
||||
)
|
||||
|
||||
run_result = lister.run()
|
||||
|
||||
assert run_result.pages == 2
|
||||
assert run_result.origins == 1
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue