Fix caching and hash collisions in _fast_make_domain_id#74
Open
sebastian-nagel wants to merge 1 commit into
Open
Fix caching and hash collisions in _fast_make_domain_id#74sebastian-nagel wants to merge 1 commit into
sebastian-nagel wants to merge 1 commit into
Conversation
- do not overwrite local variable "domain" used as key for cache - avoid hash collisions for pure domains (host without subdomain) by using a 64-bit hash value on domain.suffix - avoid hash collisions inside large domains (aka. public suffixes, e.g., deviantart.com, wordpress.org): replace two-part hash value (32-bit subdomain + 32-bit domain.suffix) by 64-bit hash on subdomain.domain.suffix - add notice that _fast_make_domain_id is not compatible with make_domain_id
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
domain.suffixsubdomain.domain.suffix_fast_make_domain_idis not compatible withmake_domain_idIn the 2016 host-level webgraph 501,252 host nodes collide with one or more (up to 5) other nodes, i.e., the share the same ID caused by a hash collision. The collisions are counted via
Most of the collisions affect hosts without subdomain hashed only with 32-bit: e.g., "myfunfan.com" and "adobe.com" both with id "1831416544"). By using a 64-bit hash for these, the collisions are significantly reduced (now about 1000).
However, if a composed hash is used (32-bit per subdomain and domain.suffix) there are still many collisions for domains also known as public suffixes often with millions of subdomains. The number of collisions are:
This can be fixed by using a 64-bit hash on the whole string
subdomain.domain.suffix. For the 2016 host lists, I did not see any collisions.