Skip to content

Add end to end tests for the SitemapInjector#52

Merged
lfoppiano merged 7 commits into
ccfrom
feature/test-sitemap-hreflang
May 22, 2026
Merged

Add end to end tests for the SitemapInjector#52
lfoppiano merged 7 commits into
ccfrom
feature/test-sitemap-hreflang

Conversation

@lfoppiano

@lfoppiano lfoppiano commented Apr 16, 2026

Copy link
Copy Markdown

@sebastian-nagel I've manage, not without any problems, to get an end to end test running around the SitemapInjector. However is a Draft PR and the tests are still failing, see my question below.

It should simulate the process via protocol-file and make Nutch in the condition of running the sitemapInjector on any sitemap, load them into CrawlDB and verify.

I've created two tests, one for the KPMG and one for your sitemap.xml example.
I'm having problems to understand how to assert the number of URLs, e.g. what is counted by hadoop:

2026-04-16 22:01:02,576 INFO o.a.n.c.SitemapInjector [Thread-43] Found 1474 URLs in file:/Users/lfoppiano/development/projects/cc/nutch/src/testresources/sitemaps/sitemap.example.1.xml
2026-04-16 22:01:02,588 INFO o.a.n.c.SitemapInjector [Thread-43] Injected total 4630 URLs for file:/Users/lfoppiano/development/projects/cc/nutch/src/testresources/sitemaps/sitemap.example.1.xml
2026-04-16 22:01:02,812 INFO o.a.n.c.SitemapInjector [main] SitemapInjector:     3156  sitemap_extension_localized_link
2026-04-16 22:01:02,812 INFO o.a.n.c.SitemapInjector [main] SitemapInjector:        1  sitemap_type_xml
2026-04-16 22:01:02,812 INFO o.a.n.c.SitemapInjector [main] SitemapInjector:        1  sitemaps_processed
2026-04-16 22:01:02,812 INFO o.a.n.c.SitemapInjector [main] SitemapInjector:     4630  urls_from_sitemaps_injected
2026-04-16 22:01:04,665 INFO o.a.n.c.SitemapInjector [Thread-160] Found 866 URLs in file:/Users/lfoppiano/development/projects/cc/nutch/src/testresources/sitemaps/sitemap.example.2.xml
2026-04-16 22:01:04,668 INFO o.a.n.c.SitemapInjector [Thread-160] Injected total 2598 URLs for file:/Users/lfoppiano/development/projects/cc/nutch/src/testresources/sitemaps/sitemap.example.2.xml
2026-04-16 22:01:05,357 INFO o.a.n.c.SitemapInjector [main] SitemapInjector:     1732  sitemap_extension_localized_link
2026-04-16 22:01:05,358 INFO o.a.n.c.SitemapInjector [main] SitemapInjector:        1  sitemap_type_xml
2026-04-16 22:01:05,358 INFO o.a.n.c.SitemapInjector [main] SitemapInjector:        1  sitemaps_processed
2026-04-16 22:01:05,358 INFO o.a.n.c.SitemapInjector [main] SitemapInjector:     2598  urls_from_sitemaps_injected

and what is cumulated in the crawl db, when I check in

[...]
        SitemapInjector sitemapInjector = new SitemapInjector();
        sitemapInjector.setConf(conf);
        sitemapInjector.inject(crawldbPath, urlPath);

        List<String> injected = readCrawldb();

I did not find any way to count manually the expected URLs and compare them with injected.size(), but I'm surely missing something here..

@sebastian-nagel sebastian-nagel left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

count manually the expected URLs

That's another argument to replace the test resources by smaller sample sitemaps with only 1-2 entries per sitemap.

The job counters are accessible inside the method inject(...) from the sitemapJob variable per sitemapJob.getCounters().

Comment thread src/testresources/sitemaps/sitemap.example.1.txt Outdated
Comment thread src/testresources/sitemaps/sitemap.example.1.xml
@lfoppiano

lfoppiano commented Apr 20, 2026

Copy link
Copy Markdown
Author

@sebastian-nagel thanks for the feedback. Having made up sitemaps.xml would be helpful at least at first, and would solve multiple problems you have highlighted above.

I have one more question, I'm not sure is an assumption I should consider, but the counter as I read it should be the final number of URLs that are injected (or this depends on the content of the crawldb?):

2026-04-20 15:33:04,049 INFO o.a.n.c.SitemapInjector [main] SitemapInjector:        2  sitemap_extension_localized_link
2026-04-20 15:33:04,049 INFO o.a.n.c.SitemapInjector [main] SitemapInjector:        1  sitemap_type_xml
2026-04-20 15:33:04,049 INFO o.a.n.c.SitemapInjector [main] SitemapInjector:        1  sitemaps_processed
2026-04-20 15:33:04,049 INFO o.a.n.c.SitemapInjector [main] SitemapInjector:        4  urls_from_sitemaps_injected

for example here there are 3 injected URLs, 2 are overlapping and they are merged, shouldn't the counter reflect that?

@lfoppiano lfoppiano marked this pull request as ready for review April 20, 2026 14:59

@sebastian-nagel sebastian-nagel left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @lfoppiano! Looks good to me.

@lfoppiano lfoppiano merged commit 29c642c into cc May 22, 2026
1 check passed
@lfoppiano lfoppiano deleted the feature/test-sitemap-hreflang branch May 22, 2026 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants