Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent specifying video as short due to web-scraping abuse #330

Open
Benjamin-Loison opened this issue Nov 30, 2024 · 9 comments
Open

Prevent specifying video as short due to web-scraping abuse #330

Benjamin-Loison opened this issue Nov 30, 2024 · 9 comments
Assignees
Labels
bug Something isn't working medium priority A high-priority issue noticeable by the user but he can still work around it. quick A task that should take less than two hours to complete.

Comments

@Benjamin-Loison
Copy link
Owner

As requested on Discord.

Related to #11.

@Benjamin-Loison Benjamin-Loison added bug Something isn't working medium priority A high-priority issue noticeable by the user but he can still work around it. quick A task that should take less than two hours to complete. labels Nov 30, 2024
@Benjamin-Loison Benjamin-Loison self-assigned this Nov 30, 2024
@Benjamin-Loison
Copy link
Owner Author

grep -ri 'isRedirection('
Output:
common.php:    function isRedirection($url)
videos.php:                'available' => !isRedirection("https://www.youtube.com/shorts/$id")

@Benjamin-Loison
Copy link
Owner Author

grep -r 'getHeadersFromOpts'
Output:
common.php:    function getHeadersFromOpts($url, $opts)
common.php:		$http_response_header = getHeadersFromOpts($url, $opts);

@Benjamin-Loison
Copy link
Owner Author

grep -ri 'redirect' --exclude-dir={.git,tools}
Output:
common.php:    function getRedirection($url)
common.php:    function getJSONFromHTML($url, $opts = [], $scriptVariable = '', $prefix = 'var ', $forceLanguage = false, $verifiesChannelRedirection = false)
common.php:        if($verifiesChannelRedirection)
common.php:            $redirectedToChannelIdPath = 'onResponseReceivedActions/0/navigateAction/endpoint/browseEndpoint/browseId';
common.php:            if(doesPathExist($json, $redirectedToChannelIdPath))
common.php:                $redirectedToChannelId = getValue($json, $redirectedToChannelIdPath);
common.php:                $url = preg_replace('/[\w\-_]{24}/', $redirectedToChannelId, $url);
common.php:                // Does a redirection of redirection for a channel exist?
common.php:                return getJSONFromHTML($url, $opts, $scriptVariable, $prefix, $forceLanguage, $verifiesChannelRedirection);
common.php:                if (str_starts_with($url, 'https://www.youtube.com/redirect?')) {
channels.php:            $result = getJSONFromHTML("https://www.youtube.com/channel/$id", forceLanguage: true, verifiesChannelRedirection: true);
channels.php:            $result = getJSONFromHTML("https://www.youtube.com/channel/$id", forceLanguage: true, verifiesChannelRedirection: true);
channels.php:                $result = getJSONFromHTML("https://www.youtube.com/channel/$id/shorts", forceLanguage: true, verifiesChannelRedirection: true);
channels.php:                $result = getJSONFromHTML("https://www.youtube.com/channel/$id/community", forceLanguage: true, verifiesChannelRedirection: true);
channels.php:                $result = getJSONFromHTML("https://www.youtube.com/channel/$id/channels", forceLanguage: true, verifiesChannelRedirection: true);
channels.php:            $result = getJSONFromHTML("https://www.youtube.com/channel/$id/about", forceLanguage: true, verifiesChannelRedirection: true);
channels.php:            $result = getJSONFromHTML("https://www.youtube.com/channel/$id", forceLanguage: true, verifiesChannelRedirection: true);
channels.php:            $result = getJSONFromHTML("https://www.youtube.com/channel/$id", verifiesChannelRedirection: true);
channels.php:                $result = getJSONFromHTML("https://www.youtube.com/channel/$id/playlists", forceLanguage: true, verifiesChannelRedirection: true);
.htaccess:Redirect /matrix https://matrix.to/#/#youtube-operational-api:matrix.org
.htaccess:Redirect /discord https://discord.gg/pDzafhGWzf
.htaccess:Redirect /code https://github.com/Benjamin-Loison/YouTube-operational-API
.htaccess:Redirect /host-your-own-instance https://github.com/Benjamin-Loison/YouTube-operational-API/blob/main/README.md#install-your-own-instance-of-the-api
.htaccess:Redirect /issues https://github.com/Benjamin-Loison/YouTube-operational-API/issues
videos.php:                'available' => getRedirection("https://www.youtube.com/shorts/$id")

@Benjamin-Loison
Copy link
Owner Author

die(json_encode($http_response_header));

@Benjamin-Loison
Copy link
Owner Author

Benjamin-Loison commented Nov 30, 2024

Bash script:
VIDEO_IDS=(
    ydPkyvWtmg4
    NiXD4xVJM5Y
)
for videoId in "${VIDEO_IDS[@]}"
do
    echo -n "$videoId: "
    curl -s "http://localhost/YouTube-operational-API/videos?part=short&id=$videoId" | jq .items[0].short.available
done
Output:
ydPkyvWtmg4: true
NiXD4xVJM5Y: false

@Benjamin-Loison
Copy link
Owner Author

Benjamin-Loison commented Nov 30, 2024

Knowing that https://www.youtube.com/shorts/VIDEO_ID redirects to https://www.youtube.com/watch?v=VIDEO_ID only if it is not a short.

Is detected as spamming \ Video type Video Short
False Redirected to https://www.youtube.com/watch?v=VIDEO_ID Not redirected
True Redirected to an arbitrary URL Redirected to an arbitrary URL

We want available to be:

Is detected as spamming \ Video type Video Short
False false true
True die die

@Benjamin-Loison
Copy link
Owner Author

So the point is more about verifying the redirected URL to be as expected if any. So could refine web-scraping detected detection to not just rely on HTTP codes.

@Benjamin-Loison
Copy link
Owner Author

diff:
diff --git a/common.php b/common.php
index ef9d2f4..67d771a 100644
--- a/common.php
+++ b/common.php
@@ -84,7 +84,7 @@
         return [$result, $http_response_header];
     }
 
-    function isRedirection($url)
+    function getRedirection($url)
     {
         $opts = [
             'http' => [
@@ -97,7 +97,10 @@
         if (in_array($code, HTTP_CODES_DETECTED_AS_SENDING_UNUSUAL_TRAFFIC)) {
             detectedAsSendingUnusualTraffic();
         }
-        return $code == 303;
+        if ($code == 303) {
+            return $http_response_header['Location'];
+        }
+        return null;
     }
 
     function getRemote($url, $opts = [], $verifyTrafficIfForbidden = true)
diff --git a/videos.php b/videos.php
index 4328c1c..c8b3e53 100644
--- a/videos.php
+++ b/videos.php
@@ -171,7 +172,7 @@
 
         if ($options['short']) {
             $short = [
-                'available' => !isRedirection("https://www.youtube.com/shorts/$id")
+                'available' => getRedirection("https://www.youtube.com/shorts/$id") === null,
             ];
             $item['short'] = $short;
         }

@Benjamin-Loison
Copy link
Owner Author

Fix #330: Change `isRedirection` to `getRedirection`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working medium priority A high-priority issue noticeable by the user but he can still work around it. quick A task that should take less than two hours to complete.
Projects
None yet
Development

No branches or pull requests

1 participant