Skip to content

SitemapInjector: extract and inject localized links#51

Merged
sebastian-nagel merged 3 commits into
ccfrom
sitemap-localized-links
Apr 7, 2026
Merged

SitemapInjector: extract and inject localized links#51
sebastian-nagel merged 3 commits into
ccfrom
sitemap-localized-links

Conversation

@sebastian-nagel

Copy link
Copy Markdown

Extract and inject localized links (crawler-commons sitemap link attributes).

  • Refactor and split method injectURLs to allow iterating over sitemap URL and localized links.
  • Add counter sitemap_extension_localized_link to count the number of localized links injected.

Minor changes:

  • Update Javadoc
  • Log sitemap processing timeout

@lfoppiano

lfoppiano commented Apr 6, 2026

Copy link
Copy Markdown

@sebastian-nagel is there a sitemap.xml with hreflang that can be used for testing?

EDIT: as usual, after I ask the question, I find the answer: https://kpmg.com/tr/tr/sitemap.xml

@lfoppiano

Copy link
Copy Markdown

@sebastian-nagel I did not find anything wrong, but I did not manage to test it, it seems that this part is only invoked via hadoop. I tried to see if I could create some tests, however for what is worth, I have a segment containing the sitemap.xml with hreflang from the previous comment.

@sebastian-nagel

Copy link
Copy Markdown
Author

I've tested on a local copy of https://www.i-run.be/electronique/sitemap.xml and

$> echo http://localhost/nutch/sitemap/extensions/sitemap-i-run-be-electronique.xml >/tmp/seed_sitemaps.txt

$> nutch org.apache.nutch.crawl.SitemapInjector -Ddb.injector.sitemap.check-cross-submits=false -Dhttp.filter.ipaddress.exclude= /tmp/crawldb/ /tmp/seed_sitemaps.txt
...
2026-04-07 09:11:34,254 INFO o.a.n.c.SitemapInjector [main] SitemapInjector:     1732  sitemap_extension_localized_link
2026-04-07 09:11:34,254 INFO o.a.n.c.SitemapInjector [main] SitemapInjector:        1  sitemap_type_xml
2026-04-07 09:11:34,254 INFO o.a.n.c.SitemapInjector [main] SitemapInjector:        1  sitemaps_processed
2026-04-07 09:11:34,254 INFO o.a.n.c.SitemapInjector [main] SitemapInjector:     2598  urls_from_sitemaps_in
...

But for the kpmg sitemap it does not work - only the "normal" sitemap URLs are injected, not the localized ones. Need to look into it. But could be also that the sitemap is broken.

@lfoppiano

Copy link
Copy Markdown

@sebastian-nagel I've got several errors on the URLs:

2026-04-07 10:15:13,189 INFO o.a.n.c.SitemapInjector [main] SitemapInjector:       48  sitemap_failed_to_fetch_timeout
2026-04-07 10:15:13,189 INFO o.a.n.c.SitemapInjector [main] SitemapInjector:    19520  sitemap_rejected_by_url_filters
2026-04-07 10:15:13,189 INFO o.a.n.c.SitemapInjector [main] SitemapInjector:      165  sitemap_skipped_too_many_failures_per_host
2026-04-07 10:15:13,189 INFO o.a.n.c.SitemapInjector [main] SitemapInjector: finished fetching and processing sitemaps, elapsed: 3996
2026-04-07 10:15:13,190 WARN o.a.n.c.SitemapInjector [main] No URLs found in sitemaps, skipping step 2 merging URLs into CrawlDb

I tried to run the class directly but it did not work. It sems there is a problem when parsing that sitemap. Although they look similar. I did not find any evident difference.

@sebastian-nagel

Copy link
Copy Markdown
Author

But for the kpmg sitemap it does not work

This is addressed in crawler-commons/crawler-commons#572.

@sebastian-nagel sebastian-nagel merged commit 31c2ba8 into cc Apr 7, 2026
1 check passed
@sebastian-nagel

Copy link
Copy Markdown
Author

@lfoppiano, thanks for testing and for uncovering the issue with the XHTML namespace URI.

@sebastian-nagel sebastian-nagel deleted the sitemap-localized-links branch April 7, 2026 20:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants