Contents of type "Inhaltsseite" won't get crawled #50

Geronymos · 2022-01-01T17:45:38Z

My analysis course uses the structure of "Inhaltsseite" (icon looks like a laptop showing a diagram) to provide the script (which gets updated regularly) as well as the exercise sheets and its solutions.

Unfortunately I can't download them with pferd. I tried using the command line, the config file downloading the whole course and explicit URL but nothing is working.

When executing pferd kit-ilias-web [url] . it just says

Loading auth:ilias
Loading crawl:ilias

Running crawl:ilias
Crawled     '.'


Report for crawl:ilias
  Nothing changed

And the folder stays empty.

Is this a misconfiguration on my end or is this type of structure not implemented yet?

The text was updated successfully, but these errors were encountered:

I-Al-Istannen · 2022-01-01T17:50:37Z

Is this a misconfiguration on my end or is this type of structure not implemented yet?

I'd guess the latter, could you pass the --explain switch as the first parameter to pferd (before the kit-ilias-web)? Then PFERD should try and explain itself, maybe it will tell you that it has no idea what's happening.

Geronymos · 2022-01-01T18:01:35Z

Here is the output with the --explain-flag

Loading config
  CLI command specified, loading config from its arguments
  Creating config for command 'kit-ilias-web'
Deciding which crawlers to run
  No crawlers specified on CLI
  Running crawlers specified in config
Loading auth:ilias
Loading crawl:ilias

Running crawl:ilias
Loading cookies
  Sharing cookies
  '/home/me/documents/uni/ilias/ana_blatt/.cookies' has newest mtime so far
  Loading cookies from '/home/me/documents/uni/ilias/ana_blatt/.cookies'
Creating base directory at '/home/me/documents/uni/ilias/ana_blatt'
Loading previous report from '/home/me/documents/uni/ilias/ana_blatt/.report'
  Loaded report successfully
Inferred crawl target: URL https://ilias.studium.kit.edu/goto.php?target=copa_1649818&client_id=produktiv
Decision: Crawl '.'
  Final result: '.'
  Answer: Yes
Parsing root HTML page
  URL: https://ilias.studium.kit.edu/goto.php?target=copa_1649818&client_id=produktiv
  Page is a normal folder, searching for elements
Crawled     '.'
Decision: Clean up files
  No warnings or errors occurred during this run
  Answer: Yes
Storing report to '/home/me/documents/uni/ilias/ana_blatt/.report'
  Stored report successfully
Total amount of HTTP requests: 1
Saving cookies
  Saving cookies to '/home/me/documents/uni/ilias/ana_blatt/.cookies'

Report for crawl:ilias
  Nothing changed

I-Al-Istannen · 2022-01-01T18:13:09Z

Yea, so it apparently did not recognize anything useful. I will have a look at it, but not before the ILIAS 7 migration in a few days if that's alright with you. That one will probably absolutely slaughter the HTML parser anyways :P

I-Al-Istannen · 2022-01-09T20:17:28Z

Could you have a look at what https://github.com/Garmelon/PFERD/releases/tag/v3.3.0 produces @Geronymos?

Geronymos · 2022-01-11T07:31:14Z

Even though pferd 3.3 can download all regular content again (thank you for that!), it unfortunately still downloads nothing for those types of links. But it recognizes that it is of type content page (see explain log).

As I see it "Inhaltsseite" might be an option for the lecturer to write pure html. So maybe it could be handled like a "external link": downloaded as plaintext and download links within the page.

explain-log

Loading config
  CLI command specified, loading config from its arguments
  Creating config for command 'kit-ilias-web'
Deciding which crawlers to run
  No crawlers specified on CLI
  Running crawlers specified in config
Loading auth:ilias
Loading crawl:ilias

Running crawl:ilias
Loading cookies
  Sharing cookies
  '/home/me/documents/uni/ilias/ana_blatt/.cookies' has newest mtime so far
  Loading cookies from '/home/me/documents/uni/ilias/ana_blatt/.cookies'
Creating base directory at '/home/me/documents/uni/ilias/ana_blatt'
Loading previous report from '/home/me/documents/uni/ilias/ana_blatt/.report'
  Loaded report successfully
Inferred crawl target: URL https://ilias.studium.kit.edu/goto.php?target=copa_1649818&client_id=produktiv
Decision: Crawl '.'
  Final result: '.'
  Answer: Yes
Parsing root HTML page
  URL: https://ilias.studium.kit.edu/goto.php?target=copa_1649818&client_id=produktiv
  Page is a content page, searching for elements
Crawled     '.'
Decision: Clean up files
  No warnings or errors occurred during this run
  Answer: Yes
Storing report to '/home/me/documents/uni/ilias/ana_blatt/.report'
  Stored report successfully
Total amount of HTTP requests: 1
Saving cookies
  Saving cookies to '/home/me/documents/uni/ilias/ana_blatt/.cookies'

Report for crawl:ilias
  Nothing changed

I-Al-Istannen · 2022-01-11T08:49:37Z

The "content page" has a "file" feature which I added support for. I thought they were nice enough to use it but they are not...

I don't really want to crawl random pages linked by the content page - that could lead to weird network requests, errors when the remote file is behind authentication and so on. I was about to suggest writing a dedicated crawler type for the math page but they don't even link them there... So I guess I will have to find a compromise here.

I could do a HEAD to find out the content type of the remote server and store it as an "external link" file if it is text/html and otherwise download it, but that would cause an additional network request for each item - even if they are already present locally.
Slightly less fancy, I could just use the name of the link and perform the same check. That would allow me to do this in one request and not do anything if it is present locally, but the file extension will be off.
As a third option I could just download them as-is and you might end up with downloaded HTML files if it links to things which can not be downloaded directly.

All of these will lead to errors if there are links to files behind authentication.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contents of type "Inhaltsseite" won't get crawled #50

Contents of type "Inhaltsseite" won't get crawled #50

Geronymos commented Jan 1, 2022

I-Al-Istannen commented Jan 1, 2022

Geronymos commented Jan 1, 2022

I-Al-Istannen commented Jan 1, 2022

I-Al-Istannen commented Jan 9, 2022

Geronymos commented Jan 11, 2022

I-Al-Istannen commented Jan 11, 2022

Contents of type "Inhaltsseite" won't get crawled #50

Contents of type "Inhaltsseite" won't get crawled #50

Comments

Geronymos commented Jan 1, 2022

I-Al-Istannen commented Jan 1, 2022

Geronymos commented Jan 1, 2022

I-Al-Istannen commented Jan 1, 2022

I-Al-Istannen commented Jan 9, 2022

Geronymos commented Jan 11, 2022

I-Al-Istannen commented Jan 11, 2022