Skip to content

Latest commit

 

History

History
134 lines (97 loc) · 7.63 KB

File metadata and controls

134 lines (97 loc) · 7.63 KB

Mitigation Strategies Analysis for gocolly/colly

  • Description:

    1. Create a list of realistic User-Agent strings (browsers, operating systems, versions).
    2. Directly using Colly: Use colly.RandomUserAgent() to automatically select a random User-Agent from a predefined list (within Colly's internal resources). This is the simplest approach.
    3. Alternatively (for more control): Maintain your own list of User-Agents. Before each request (or batch of requests), randomly select one from your list. Set the User-Agent header using c.UserAgent = selectedUserAgent.
  • Threats Mitigated:

    • Detection and Blocking (High Severity): Websites block based on User-Agent.
    • Rate Limiting (Medium Severity): Some sites apply stricter rate limits to known bot User-Agents.
  • Impact:

    • Detection and Blocking: High Impact (makes the scraper appear more like real users).
    • Rate Limiting: Medium Impact (may help avoid stricter rate limits).
  • Currently Implemented:

    • Implemented in initialization.go using a custom list and c.UserAgent.
  • Missing Implementation:

    • The rotation should be done more frequently (per request or small batch). Move the logic into the main scraping loop in scraper.go. Consider using colly.RandomUserAgent() for simplicity if its built-in list is sufficient.
  • Description:

    1. Obtain a list of proxy servers (IP and port).
    2. Directly using Colly: Implement a colly.ProxyFunc. This is a function that Colly calls before each request to get the proxy URL.
    3. Inside the ProxyFunc, select a proxy from your list (or use your proxy service's logic).
    4. Return the proxy URL as a string (e.g., "http://proxy_ip:proxy_port"). Use http.ProxyURL to help format the URL.
  • Threats Mitigated:

    • Detection and Blocking (High Severity): Masks your real IP address.
    • Rate Limiting (High Severity): Distributes requests across multiple IPs.
    • Geo-Blocking (Medium Severity): Access content from different regions.
  • Impact:

    • Detection and Blocking: Very High Impact.
    • Rate Limiting: Very High Impact.
    • Geo-Blocking: Medium Impact.
  • Currently Implemented:

    • No proxy implementation.
  • Missing Implementation:

    • Full implementation needed: proxy provider integration, colly.ProxyFunc implementation in initialization.go or proxy.go.
  • Description:

    1. Directly using Colly: Use colly.LimitRule to define request timing.
    2. Set the Delay property to specify a base delay between requests (e.g., 5 * time.Second).
    3. Set the RandomDelay property to add a random additional delay (e.g., 2 * time.Second).
    4. Apply the rule to the collector: c.Limit(&colly.LimitRule{...}).
  • Threats Mitigated:

    • Detection and Blocking (Medium Severity): Mimics human browsing.
    • Rate Limiting (High Severity): Stays within acceptable request rates.
    • Unintentional DoS on Target (High Severity): Reduces server load.
  • Impact:

    • Detection and Blocking: Moderate Impact.
    • Rate Limiting: High Impact.
    • Unintentional DoS: High Impact.
  • Currently Implemented:

    • LimitRule with Delay in initialization.go.
  • Missing Implementation:

    • Add RandomDelay to the existing LimitRule. Tune delays based on the target website. Consider a configuration file for domain-specific delays.
  • Description:

    1. Directly using Colly: Colly respects robots.txt by default.
    2. Ensure you do not set c.IgnoreRobotsTxt = true. This is a passive mitigation – simply avoid disabling the default behavior.
  • Threats Mitigated:

    • Legal and Ethical Issues (High Severity): Avoids violating terms of service.
    • Detection and Blocking (Low Severity): Some sites block scrapers that ignore robots.txt.
  • Impact:

    • Legal and Ethical Issues: High Impact.
    • Detection and Blocking: Low Impact.
  • Currently Implemented:

    • Colly's default behavior is in place (no code disabling it).
  • Missing Implementation:

    • No active monitoring of changes to the target's robots.txt.
  • Description:

    1. Directly using Colly: Use colly.LimitRule to set the Parallelism option.
    2. Parallelism limits the number of concurrent requests.
    3. Start with a low value (e.g., 2 or 3) and adjust as needed.
    4. Apply the rule: c.Limit(&colly.LimitRule{Parallelism: ...}).
  • Threats Mitigated:

    • Resource Exhaustion (Your System) (High Severity): Prevents overwhelming your system.
    • Unintentional DoS on Target (Medium Severity): Indirectly helps avoid overloading the target.
  • Impact:

    • Resource Exhaustion: High Impact.
    • Unintentional DoS: Medium Impact.
  • Currently Implemented:

    • LimitRule with Parallelism: 4 in initialization.go.
  • Missing Implementation:

    • Tune the concurrency limit based on system resources and target website tolerance.
  • Description:

    1. Directly using Colly: Set c.Async = true to enable asynchronous request handling. This allows Colly to manage multiple requests concurrently, potentially improving performance.
    2. Crucially: When using Async, you must use c.Wait() after starting the scraping process to ensure all asynchronous tasks are completed before your program exits.
    3. Also Crucially: Asynchronous mode makes it even more important to use colly.LimitRule to control Parallelism and prevent overwhelming your system or the target server.
  • Threats Mitigated:

    • Resource Exhaustion (Your System) (High Severity - if misused): While Async can improve efficiency, improper use can increase the risk of resource exhaustion. Careful management of concurrency is essential.
    • Unintentional DoS on Target (Medium Severity - if misused): Similar to resource exhaustion, uncontrolled asynchronous requests can increase the risk of overloading the target.
  • Impact:

    • Resource Exhaustion: Can be negative if not used with strict LimitRule settings.
    • Unintentional DoS: Can be negative if not used with strict LimitRule settings.
  • Currently Implemented:

    • Not implemented.
  • Missing Implementation:

    • If performance becomes a bottleneck, consider implementing c.Async = true, but only in conjunction with very careful tuning of Parallelism and Delay/RandomDelay in a colly.LimitRule. Thorough testing and monitoring are essential if using asynchronous mode.