Add end to end tests for the SitemapInjector#52
Conversation
sebastian-nagel
left a comment
There was a problem hiding this comment.
count manually the expected URLs
That's another argument to replace the test resources by smaller sample sitemaps with only 1-2 entries per sitemap.
The job counters are accessible inside the method inject(...) from the sitemapJob variable per sitemapJob.getCounters().
|
@sebastian-nagel thanks for the feedback. Having made up I have one more question, I'm not sure is an assumption I should consider, but the counter as I read it should be the final number of URLs that are injected (or this depends on the content of the crawldb?): for example here there are 3 injected URLs, 2 are overlapping and they are merged, shouldn't the counter reflect that? |
sebastian-nagel
left a comment
There was a problem hiding this comment.
Thanks, @lfoppiano! Looks good to me.
@sebastian-nagel I've manage, not without any problems, to get an end to end test running around the SitemapInjector. However is a
DraftPR and the tests are still failing, see my question below.It should simulate the process via protocol-file and make Nutch in the condition of running the sitemapInjector on any sitemap, load them into CrawlDB and verify.
I've created two tests, one for the KPMG and one for your sitemap.xml example.
I'm having problems to understand how to assert the number of URLs, e.g. what is counted by hadoop:
and what is cumulated in the crawl db, when I check in
[...] SitemapInjector sitemapInjector = new SitemapInjector(); sitemapInjector.setConf(conf); sitemapInjector.inject(crawldbPath, urlPath); List<String> injected = readCrawldb();I did not find any way to count manually the expected URLs and compare them with
injected.size(), but I'm surely missing something here..