whoogle-search/test/test_results.py

from bs4 import BeautifulSoup
from app.filter import Filter
from app.utils.session_utils import generate_user_keys
from datetime import datetime
from dateutil.parser import *
import json
import os


def get_search_results(data):
    secret_key = generate_user_keys()
    soup = Filter(user_keys=secret_key).clean(BeautifulSoup(data, 'html.parser'))

    main_divs = soup.find('div', {'id': 'main'})
    assert len(main_divs) > 1

    result_divs = []
    for div in main_divs:
        # Result divs should only have 1 inner div
        if len(list(div.children)) != 1 or not div.findChild() or 'div' not in div.findChild().name:
            continue

        result_divs.append(div)

    return result_divs


def test_get_results(client):
    rv = client.get('/search?q=test')
    assert rv._status_code == 200

    # Depending on the search, there can be more
    # than 10 result divs
    assert len(get_search_results(rv.data)) >= 10
    assert len(get_search_results(rv.data)) <= 15


def test_post_results(client):
    rv = client.post('/search', data=dict(q='test'))
    assert rv._status_code == 200

    # Depending on the search, there can be more
    # than 10 result divs
    assert len(get_search_results(rv.data)) >= 10
    assert len(get_search_results(rv.data)) <= 15


# TODO: Unit test the site alt method instead -- the results returned
# are too unreliable for this test in particular.
# def test_site_alts(client):
    # rv = client.post('/search', data=dict(q='twitter official account'))
    # assert b'twitter.com/Twitter' in rv.data

    # client.post('/config', data=dict(alts=True))
    # assert json.loads(client.get('/config').data)['alts']

    # rv = client.post('/search', data=dict(q='twitter official account'))
    # assert b'twitter.com/Twitter' not in rv.data
    # assert b'nitter.net/Twitter' in rv.data


def test_recent_results(client):
    times = {
        'past year': 365,
        'past month': 31,
        'past week': 7
    }

    for time, num_days in times.items():
        rv = client.post('/search', data=dict(q='test :' + time))
        result_divs = get_search_results(rv.data)

        current_date = datetime.now()
        for div in [_ for _ in result_divs if _.find('span')]:
            date_span = div.find('span').decode_contents()
            if not date_span or len(date_span) > 15 or len(date_span) < 7:
                continue

            try:
                date = parse(date_span)
                assert (current_date - date).days <= (num_days + 5)  # Date can have a little bit of wiggle room
            except ParserError:
                pass
Added testing and ci build, refactored filter class, refactored project structure 2020-04-16 02:41:53 +03:00			`from bs4 import BeautifulSoup`
			`from app.filter import Filter`
Privacy respecting alternatives in results view (#106) Full implementation of social media alt redirects (twitter/youtube/instagram -> nitter/invidious/bibliogram) depending on configuration. Verbatim search and option to ignore search autocorrect are now supported as well. Also cleaned up the javascript side of whoogle config so that it now uses arrays of available fields for parsing config values instead of manually assigning each one to a variable. This doesn't include support for Google Maps -> Open Street Maps, that seems a bit more involved than the social media redirects were, so it should likely be a separate effort. 2020-07-26 20:53:59 +03:00			`from app.utils.session_utils import generate_user_keys`
Added testing and ci build, refactored filter class, refactored project structure 2020-04-16 02:41:53 +03:00			`from datetime import datetime`
			`from dateutil.parser import *`
Allow setting site alts using environment vars (#155) * Add ability to configure site alts w/ env vars Site alternatives (i.e. twitter.com -> nitter.net) can now be configured using environment variables: WHOOGLE_ALT_TW='nitter.net' # twitter alt WHOOGLE_ALT_YT='invidio.us' # youtube alt WHOOGLE_ALT_IG='bibliogram.art/u' # instagram alt Updated testing to confirm results have been modified. * Add site alt vars to docker settings and readme 2020-12-06 01:01:21 +03:00			`import json`
			`import os`
Added testing and ci build, refactored filter class, refactored project structure 2020-04-16 02:41:53 +03:00

			`def get_search_results(data):`
Project refactor (#85) * Major refactor of requests and session management - Switches from pycurl to requests library - Allows for less janky decoding, especially with non-latin character sets - Adds session level management of user configs - Allows for each session to set its own config (people are probably going to complain about this, though not sure if it'll be the same number of people who are upset that their friends/family have to share their config) - Updates key gen/regen to more aggressively swap out keys after each request * Added ability to save/load configs by name - New PUT method for config allows changing config with specified name - New methods in js controller to handle loading/saving of configs * Result formatting and removal of unused elements - Fixed question section formatting from results page (added appropriate padding and made questions styled as italic) - Removed user agent display from main config settings * Minor change to button label * Fixed issue with "de-pickling" of flask session Having a gitignore-everything ("") file within a flask session folder seems to cause a weird bug where the state of the app becomes unusable from continuously trying to prune files listed in the gitignore (and it can't prune ''). * Switched to pickling saved configs * Updated ad/sponsored content filter and conf naming Configs are now named with a .conf extension to allow for easier manual cleanup/modification of named config files Sponsored content now removed by basic string matching of span content * Version bump to 0.2.0 * Fixed request.send return style 2020-06-02 21:54:47 +03:00			`secret_key = generate_user_keys()`
			`soup = Filter(user_keys=secret_key).clean(BeautifulSoup(data, 'html.parser'))`
Added testing and ci build, refactored filter class, refactored project structure 2020-04-16 02:41:53 +03:00
			`main_divs = soup.find('div', {'id': 'main'})`
			`assert len(main_divs) > 1`

			`result_divs = []`
			`for div in main_divs:`
			`# Result divs should only have 1 inner div`
			`if len(list(div.children)) != 1 or not div.findChild() or 'div' not in div.findChild().name:`
			`continue`

			`result_divs.append(div)`

			`return result_divs`


Updated tests, fixed a few bugs Added opensearch routes test and individual tests for searching via GET and POST separately. Fixed incorrect assignment in gen_query. 2020-04-29 03:59:33 +03:00			`def test_get_results(client):`
Added testing and ci build, refactored filter class, refactored project structure 2020-04-16 02:41:53 +03:00			`rv = client.get('/search?q=test')`
			`assert rv._status_code == 200`

Modified result length test 2020-04-16 02:54:38 +03:00			`# Depending on the search, there can be more`
			`# than 10 result divs`
			`assert len(get_search_results(rv.data)) >= 10`
			`assert len(get_search_results(rv.data)) <= 15`
Added testing and ci build, refactored filter class, refactored project structure 2020-04-16 02:41:53 +03:00

Updated tests, fixed a few bugs Added opensearch routes test and individual tests for searching via GET and POST separately. Fixed incorrect assignment in gen_query. 2020-04-29 03:59:33 +03:00			`def test_post_results(client):`
			`rv = client.post('/search', data=dict(q='test'))`
			`assert rv._status_code == 200`

			`# Depending on the search, there can be more`
			`# than 10 result divs`
			`assert len(get_search_results(rv.data)) >= 10`
			`assert len(get_search_results(rv.data)) <= 15`


Fix nojs lxml constructor The BeautifulSoup constructur in gen_nojs needed to explicitly set features='lxml' to silence a warning from the library. Also temporarily disabled the site alts test since the results are too unreliable. This should be moved to a unit test instead. 2020-12-12 03:21:32 +03:00			`# TODO: Unit test the site alt method instead -- the results returned`
			`# are too unreliable for this test in particular.`
			`# def test_site_alts(client):`
			`# rv = client.post('/search', data=dict(q='twitter official account'))`
			`# assert b'twitter.com/Twitter' in rv.data`

			`# client.post('/config', data=dict(alts=True))`
			`# assert json.loads(client.get('/config').data)['alts']`

			`# rv = client.post('/search', data=dict(q='twitter official account'))`
			`# assert b'twitter.com/Twitter' not in rv.data`
			`# assert b'nitter.net/Twitter' in rv.data`
Allow setting site alts using environment vars (#155) * Add ability to configure site alts w/ env vars Site alternatives (i.e. twitter.com -> nitter.net) can now be configured using environment variables: WHOOGLE_ALT_TW='nitter.net' # twitter alt WHOOGLE_ALT_YT='invidio.us' # youtube alt WHOOGLE_ALT_IG='bibliogram.art/u' # instagram alt Updated testing to confirm results have been modified. * Add site alt vars to docker settings and readme 2020-12-06 01:01:21 +03:00

Added testing and ci build, refactored filter class, refactored project structure 2020-04-16 02:41:53 +03:00			`def test_recent_results(client):`
			`times = {`
Updated tests, fixed a few bugs Added opensearch routes test and individual tests for searching via GET and POST separately. Fixed incorrect assignment in gen_query. 2020-04-29 03:59:33 +03:00			`'past year': 365,`
			`'past month': 31,`
			`'past week': 7`
Added testing and ci build, refactored filter class, refactored project structure 2020-04-16 02:41:53 +03:00			`}`

			`for time, num_days in times.items():`
Updated tests, fixed a few bugs Added opensearch routes test and individual tests for searching via GET and POST separately. Fixed incorrect assignment in gen_query. 2020-04-29 03:59:33 +03:00			`rv = client.post('/search', data=dict(q='test :' + time))`
Added testing and ci build, refactored filter class, refactored project structure 2020-04-16 02:41:53 +03:00			`result_divs = get_search_results(rv.data)`

			`current_date = datetime.now()`
Small update to results time period test Updated to ensure a child span element is available before running a test to verify the correct time range for the result. Need to come up with a better way of ensuring uniform results across multiple tests, since otherwise periodic changes in the returned results can cause tests to fail. 2020-06-28 19:52:53 +03:00			`for div in [_ for _ in result_divs if _.find('span')]:`
Added testing and ci build, refactored filter class, refactored project structure 2020-04-16 02:41:53 +03:00			`date_span = div.find('span').decode_contents()`
Fixed search results test For datetime spans in time-filtered search results, anything less than 7 characters or more than 15 can be guaranteed to not be properly formatted dates (either "mm dd yyyy" or "xx days/months/weeks ago") 2020-04-27 03:11:02 +03:00			`if not date_span or len(date_span) > 15 or len(date_span) < 7:`
Added testing and ci build, refactored filter class, refactored project structure 2020-04-16 02:41:53 +03:00			`continue`

			`try:`
			`date = parse(date_span)`
Updated recent results test w/ +5 day tolerance 2020-05-20 20:07:01 +03:00			`assert (current_date - date).days <= (num_days + 5) # Date can have a little bit of wiggle room`
Added testing and ci build, refactored filter class, refactored project structure 2020-04-16 02:41:53 +03:00			`except ParserError:`
Feature: country and safe search config options (#71) * Added country and safe search config options * Updated handling of parser error in results test * Improved handling of default country * Added 1px empty gif fallback as a replacement for images that fail to load 2020-05-23 23:27:23 +03:00			`pass`