Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL no longer working due to cookie consent page #62

Open
murnanedaniel opened this issue Jul 4, 2023 · 3 comments
Open

URL no longer working due to cookie consent page #62

murnanedaniel opened this issue Jul 4, 2023 · 3 comments

Comments

@murnanedaniel
Copy link

Running

google_news = GNews(period = "1d")
results = google_news.get_news("russia")

gives results such as

[{'title': "Russia Is Gaining Influence in Africa, at West's Expense - Foreign Policy",
  'description': "Russia Is Gaining Influence in Africa, at West's Expense  Foreign Policy",
  'published date': 'Sat, 18 Mar 2023 07:00:00 GMT',
  'url': 'https://consent.google.com/m?continue=https://news.google.com/rss/articles/CBMiYmh0dHBzOi8vZm9yZWlnbnBvbGljeS5jb20vMjAyMy8wMy8xOC9ydXNzaWFuLW1lcmNlbmFyaWVzLWFyZS1wdXNoaW5nLWZyYW5jZS1vdXQtb2YtY2VudHJhbC1hZnJpY2Ev0gEA?oc%3D5&gl=DK&m=0&pc=n&cm=2&hl=en-US&src=1',
  'publisher': {'href': 'https://foreignpolicy.com/',
   'title': 'Foreign Policy'}},
...

Where the URL now directs to a cookie consent screen:

image

Is there a way to consent to the cookies somehow?

@themetalleg
Copy link

problem described here already: #53
but I cant get it to work.

@ranahaani
Copy link
Owner

ranahaani commented Aug 2, 2023

orig_url = requests.get(get_news()['url']).url

can you try that

@izdrail
Copy link

izdrail commented Nov 22, 2023

I've done something like this if it helps anyoane . I've found the answer on stack overflow :



# Ref: https://stackoverflow.com/a/59023463/

_ENCODED_URL_PREFIX = "https://news.google.com/rss/articles/"
_ENCODED_URL_PREFIX_WITH_CONSENT = "https://consent.google.com/m?continue=https://news.google.com/rss/articles/"
_ENCODED_URL_RE = re.compile(fr"^{re.escape(_ENCODED_URL_PREFIX_WITH_CONSENT)}(?P<encoded_url>[^?]+)")
_ENCODED_URL_RE = re.compile(fr"^{re.escape(_ENCODED_URL_PREFIX)}(?P<encoded_url>[^?]+)")
_DECODED_URL_RE = re.compile(rb'^\x08\x13".+?(?P<primary_url>http[^\xd2]+)\xd2\x01')


@functools.lru_cache(2048)
def _decode_google_news_url(url: str) -> str:
    match = _ENCODED_URL_RE.match(url)
    encoded_text = match.groupdict()["encoded_url"]  # type: ignore
    encoded_text += "==="  # Fix incorrect padding. Ref: https://stackoverflow.com/a/49459036/
    decoded_text = base64.urlsafe_b64decode(encoded_text)

    match = _DECODED_URL_RE.match(decoded_text)
    print (match)
    
    primary_url = match.groupdict()["primary_url"]  # type: ignore
    primary_url = primary_url.decode()
    return primary_url


def decode_google_news_url(url: str) -> str:
    return _decode_google_news_url(url) if url.startswith(_ENCODED_URL_PREFIX) else url

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants