Improve G page distinction between footer and results

Pages in the Whoogle footer that by default route to Google pages were
previously being removed, but caused results that also routed to similar
pages to no longer be accessible. This was due to the removal of the
'/url' endpoint that Google uses for each result.

To fix this, the result link is now parsed so that the domain of the
result can be checked against the disallowed G page list. Since results
are delivered in a "/url?q=<domain>" format -- even for pages to
Google's own products -- and the footer links are formatted as
"<product>.google.com", footer links are removed and result links are
parsed correctly.

Fixes #747
main
Ben Busby 2022-05-16 09:53:48 -06:00
parent f5d599e7d2
commit fb600d6fc8
No known key found for this signature in database
GPG Key ID: B9B7231E01D924A1
1 changed files with 5 additions and 1 deletions

View File

@ -410,8 +410,10 @@ class Filter:
None (the tag is updated directly) None (the tag is updated directly)
""" """
link_netloc = urlparse.urlparse(link['href']).netloc
# Remove any elements that direct to unsupported Google pages # Remove any elements that direct to unsupported Google pages
if any(url in link['href'] for url in unsupported_g_pages): if any(url in link_netloc for url in unsupported_g_pages):
# FIXME: The "Shopping" tab requires further filtering (see #136) # FIXME: The "Shopping" tab requires further filtering (see #136)
# Temporarily removing all links to that tab for now. # Temporarily removing all links to that tab for now.
parent = link.parent parent = link.parent
@ -431,6 +433,8 @@ class Filter:
# Internal google links (i.e. mail, maps, etc) should still # Internal google links (i.e. mail, maps, etc) should still
# be forwarded to Google # be forwarded to Google
link['href'] = 'https://google.com' + q link['href'] = 'https://google.com' + q
elif link['href'].startswith('/url'):
link['href'] = q
elif q.startswith('https://accounts.google.com'): elif q.startswith('https://accounts.google.com'):
# Remove Sign-in link # Remove Sign-in link
link.decompose() link.decompose()