Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add normalise_url parameter to Crawly.Middlewares.UniqueRequest #295

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

adonig
Copy link
Contributor

@adonig adonig commented Apr 29, 2024

This parameter allows to customize the normalization behavior using a unary normalization function.

@oltarasenko
Copy link
Collaborator

Sorry, I don't understand the idea behind this change. Could you please explain?

@adonig
Copy link
Contributor Author

adonig commented Apr 29, 2024

It allows you to use a different normalization function without having to replace the whole UniqueRequest middleware. For instance, while some search engines consider URLs with and without a trailing slash as identical, others do not. Some go even further and treat /index.html the same way. Some transform the host part of the URL to lowercase and some even sort the query string parameters alphanumerically or resolve relative paths. For example I use this normalization function:

  def normalize_url(url) do
    parsed = URI.parse(url)

    if parsed.scheme in ["http", "https"] and parsed.host do
      %URI{
        scheme: parsed.scheme,
        host: String.downcase(parsed.host),
        path: parsed.path,
        query: parsed.query
      }
      |> URI.to_string()
    else
      nil
    end
  end

@adonig
Copy link
Contributor Author

adonig commented Apr 30, 2024

Section 6 of RFC 3986 goes a bit deeper into the topic of URL normalization.

@adonig
Copy link
Contributor Author

adonig commented May 2, 2024

I found out that Erlang comes with a RFC 3986-compliant URL normalization function: :uri_string.normalize/1

I believe it's still a good idea to allow people to provide their own implementation, because some might want to extend the behavior of the RFC, like for example Cloudflare or Kaspersky do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants