Skip to content

Fix caching and hash collisions in _fast_make_domain_id#74

Open
sebastian-nagel wants to merge 1 commit into
commonsearch:masterfrom
sebastian-nagel:fast-make-domain-id-hash-collisions
Open

Fix caching and hash collisions in _fast_make_domain_id#74
sebastian-nagel wants to merge 1 commit into
commonsearch:masterfrom
sebastian-nagel:fast-make-domain-id-hash-collisions

Conversation

@sebastian-nagel

Copy link
Copy Markdown
  1. do not overwrite local variable "domain" used as key for cache
  2. avoid hash collisions for pure domains (host without subdomain) by using a 64-bit hash value on domain.suffix
  3. avoid hash collisions inside large domains (aka. public suffixes, e.g., deviantart.com, wordpress.org): replace two-part hash value (32-bit subdomain + 32-bit domain.suffix) by 64-bit hash on subdomain.domain.suffix
  4. add notice that _fast_make_domain_id is not compatible with make_domain_id

In the 2016 host-level webgraph 501,252 host nodes collide with one or more (up to 5) other nodes, i.e., the share the same ID caused by a hash collision. The collisions are counted via

   zcat vertices.txt.gz | cut -d' ' -f1 | sort | uniq -d

Most of the collisions affect hosts without subdomain hashed only with 32-bit: e.g., "myfunfan.com" and "adobe.com" both with id "1831416544"). By using a 64-bit hash for these, the collisions are significantly reduced (now about 1000).

However, if a composed hash is used (32-bit per subdomain and domain.suffix) there are still many collisions for domains also known as public suffixes often with millions of subdomains. The number of collisions are:

   1083 deviantart.com
    566 wordpress.com
    184 tumblr.com
    124 xbjiaju.com
     94 polyvore.com
     82 livejournal.com

This can be fixed by using a 64-bit hash on the whole string subdomain.domain.suffix. For the 2016 host lists, I did not see any collisions.

- do not overwrite local variable "domain" used as key for cache
- avoid hash collisions for pure domains (host without subdomain)
  by using a 64-bit hash value on domain.suffix
- avoid hash collisions inside large domains (aka. public suffixes,
  e.g., deviantart.com, wordpress.org): replace two-part hash value
  (32-bit subdomain + 32-bit domain.suffix) by 64-bit hash on
  subdomain.domain.suffix
- add notice that _fast_make_domain_id is not compatible with make_domain_id
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant