Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contents of type "Inhaltsseite" won't get crawled #50

Open
Geronymos opened this issue Jan 1, 2022 · 6 comments
Open

Contents of type "Inhaltsseite" won't get crawled #50

Geronymos opened this issue Jan 1, 2022 · 6 comments

Comments

@Geronymos
Copy link

My analysis course uses the structure of "Inhaltsseite" (icon looks like a laptop showing a diagram) to provide the script (which gets updated regularly) as well as the exercise sheets and its solutions.

Unfortunately I can't download them with pferd. I tried using the command line, the config file downloading the whole course and explicit URL but nothing is working.

When executing pferd kit-ilias-web [url] . it just says

Loading auth:ilias
Loading crawl:ilias

Running crawl:ilias
Crawled     '.'


Report for crawl:ilias
  Nothing changed

And the folder stays empty.

Is this a misconfiguration on my end or is this type of structure not implemented yet?

@I-Al-Istannen
Copy link
Collaborator

Is this a misconfiguration on my end or is this type of structure not implemented yet?

I'd guess the latter, could you pass the --explain switch as the first parameter to pferd (before the kit-ilias-web)? Then PFERD should try and explain itself, maybe it will tell you that it has no idea what's happening.

@Geronymos
Copy link
Author

Here is the output with the --explain-flag

Loading config
  CLI command specified, loading config from its arguments
  Creating config for command 'kit-ilias-web'
Deciding which crawlers to run
  No crawlers specified on CLI
  Running crawlers specified in config
Loading auth:ilias
Loading crawl:ilias

Running crawl:ilias
Loading cookies
  Sharing cookies
  '/home/me/documents/uni/ilias/ana_blatt/.cookies' has newest mtime so far
  Loading cookies from '/home/me/documents/uni/ilias/ana_blatt/.cookies'
Creating base directory at '/home/me/documents/uni/ilias/ana_blatt'
Loading previous report from '/home/me/documents/uni/ilias/ana_blatt/.report'
  Loaded report successfully
Inferred crawl target: URL https://ilias.studium.kit.edu/goto.php?target=copa_1649818&client_id=produktiv
Decision: Crawl '.'
  Final result: '.'
  Answer: Yes
Parsing root HTML page
  URL: https://ilias.studium.kit.edu/goto.php?target=copa_1649818&client_id=produktiv
  Page is a normal folder, searching for elements
Crawled     '.'
Decision: Clean up files
  No warnings or errors occurred during this run
  Answer: Yes
Storing report to '/home/me/documents/uni/ilias/ana_blatt/.report'
  Stored report successfully
Total amount of HTTP requests: 1
Saving cookies
  Saving cookies to '/home/me/documents/uni/ilias/ana_blatt/.cookies'

Report for crawl:ilias
  Nothing changed

@I-Al-Istannen
Copy link
Collaborator

Yea, so it apparently did not recognize anything useful. I will have a look at it, but not before the ILIAS 7 migration in a few days if that's alright with you. That one will probably absolutely slaughter the HTML parser anyways :P

@I-Al-Istannen
Copy link
Collaborator

Could you have a look at what https://github.com/Garmelon/PFERD/releases/tag/v3.3.0 produces @Geronymos?

@Geronymos
Copy link
Author

Even though pferd 3.3 can download all regular content again (thank you for that!), it unfortunately still downloads nothing for those types of links. But it recognizes that it is of type content page (see explain log).

As I see it "Inhaltsseite" might be an option for the lecturer to write pure html. So maybe it could be handled like a "external link": downloaded as plaintext and download links within the page.

explain-log
Loading config
  CLI command specified, loading config from its arguments
  Creating config for command 'kit-ilias-web'
Deciding which crawlers to run
  No crawlers specified on CLI
  Running crawlers specified in config
Loading auth:ilias
Loading crawl:ilias

Running crawl:ilias
Loading cookies
  Sharing cookies
  '/home/me/documents/uni/ilias/ana_blatt/.cookies' has newest mtime so far
  Loading cookies from '/home/me/documents/uni/ilias/ana_blatt/.cookies'
Creating base directory at '/home/me/documents/uni/ilias/ana_blatt'
Loading previous report from '/home/me/documents/uni/ilias/ana_blatt/.report'
  Loaded report successfully
Inferred crawl target: URL https://ilias.studium.kit.edu/goto.php?target=copa_1649818&client_id=produktiv
Decision: Crawl '.'
  Final result: '.'
  Answer: Yes
Parsing root HTML page
  URL: https://ilias.studium.kit.edu/goto.php?target=copa_1649818&client_id=produktiv
  Page is a content page, searching for elements
Crawled     '.'
Decision: Clean up files
  No warnings or errors occurred during this run
  Answer: Yes
Storing report to '/home/me/documents/uni/ilias/ana_blatt/.report'
  Stored report successfully
Total amount of HTTP requests: 1
Saving cookies
  Saving cookies to '/home/me/documents/uni/ilias/ana_blatt/.cookies'

Report for crawl:ilias
  Nothing changed

@I-Al-Istannen
Copy link
Collaborator

The "content page" has a "file" feature which I added support for. I thought they were nice enough to use it but they are not...

I don't really want to crawl random pages linked by the content page - that could lead to weird network requests, errors when the remote file is behind authentication and so on. I was about to suggest writing a dedicated crawler type for the math page but they don't even link them there... So I guess I will have to find a compromise here.

  1. I could do a HEAD to find out the content type of the remote server and store it as an "external link" file if it is text/html and otherwise download it, but that would cause an additional network request for each item - even if they are already present locally.

  2. Slightly less fancy, I could just use the name of the link and perform the same check. That would allow me to do this in one request and not do anything if it is present locally, but the file extension will be off.

  3. As a third option I could just download them as-is and you might end up with downloaded HTML files if it links to things which can not be downloaded directly.

All of these will lead to errors if there are links to files behind authentication.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants