Allow caller to retrieve the URL used for the request#56
Conversation
09536d8 to
d6b90e3
Compare
|
Ok. If I have done it correctly, the Below the different records before and after this change. One of the hops is skip, so this might be something to fix. Before this fix: After this fix: |
sebastian-nagel
left a comment
There was a problem hiding this comment.
Thanks, @lfoppiano! Looks generally good.
Two major points to consider
- eventually, drop the raw URL, which would reduce the number of modified classes significantly
- might move some methods to the interface
See inline comments for details.
70872b9 to
ff1fd25
Compare
|
Thanks! I wasn't expecting this PR to be so close to the completion. I've made all modifications. I still haven't figure out why the robots.txt is not visited in comment: #56 (comment), though |
|
Following up my previous comment, after setting the log level to DEBUG I've got the following. main/master this branch: I cannot figure out what is going on and why the robots.txt is not visited in this branch. |
I cannot see any differences, the robots.txt is visited in both variants:
|
|
Sorry, the difference is in the resulting cdx / warc file, where the record is not present in our branch. master/main/cc: this branch: I probably need to debug it 😅 |
|
I think I found why in this branch we don't have There is a hit on a cache to fetch the robotsRules at FetcherThread:436: fit = {FetchItem@6480} this reach HttpRobotRulesParser.getRobotRuleSet(...): In master, the cache is not hit, because the resolved url is I still don't know what makes this latest visiting to |
This PR (is one part of #54 covers NUTCH-3173 for okhttp-protocol and attempt to solve the problem in a generic way.
We add a new method in the Response.java interface contract
getRawUrl()which returns the URL that was initially provided by the caller.getUrl()would return the actual URL used for the request.