Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed royal road watermark issue #2538

Merged
merged 3 commits into from
Jan 2, 2025
Merged

Fixed royal road watermark issue #2538

merged 3 commits into from
Jan 2, 2025

Conversation

dipu-bd
Copy link
Owner

@dipu-bd dipu-bd commented Jan 2, 2025

No description provided.

@dipu-bd dipu-bd merged commit 2997161 into dipu-bd:dev Jan 2, 2025
8 checks passed
dipu-bd added a commit that referenced this pull request Jan 2, 2025
* Fixed royal road watermark issue. #2531

* Hacky solution for removing only watermarks without false positives

* fix format

---------

Co-authored-by: zGadli <[email protected]>
@xp3xp3
Copy link

xp3xp3 commented Jan 4, 2025

If someone stumbles across this PR here's a script that removes the current watermarks from EPUBs -

# pip install ebooklib beautifulsoup4 lxml
import os
import sys
from ebooklib import epub
from bs4 import BeautifulSoup

watermark_contains_phrase_set = {
    "is not rightfully on Amazon",
    "Royal Road",
    "report the violation.",
    "; please report.",
    "author's consent",
    ". Report it.",
    "is not meant to be on Amazon;",
    ". Please report it.",
    "report the infringement.",
    "without the author's approval.",
    "this story on Amazon",
    "if you see it on Amazon",
    "This story has been",
    "should you find it on Amazon",
    "support the author",
    "support the creat",
    "Ensure the author gets",
    ". Report sightings.",
    "author's consent",
    "without the author's",
    "authors get the support they deserve.",
    "author's preferred platform",
    "is published on a different ",
    "author. Report any sightings."
}

def is_lowest_level(tag):
    """Check if the tag is a <div> with no child elements."""
    return tag.name == "div" and not any(child.name for child in tag.children)

def filter_epub(input_path, output_path):
    # Load the EPUB
    book = epub.read_epub(input_path)
    
    for item in book.items:
        if item.media_type == "application/xhtml+xml":
            try:
                soup = BeautifulSoup(item.get_content(), "xml")

                # Find and remove <div> blocks at the lowest level containing watermark phrases
                for div in soup.find_all(is_lowest_level):
                    if div.get_text():                        
                        text = div.get_text(strip=True).lower()
                        if any(phrase.lower() in text for phrase in watermark_contains_phrase_set):
                            print(f"Removed {div.get_text(strip=True)}")
                            div.decompose()

                # Ensure content is valid before updating
                if soup.body and soup.body.get_text(strip=True):
                    item.set_content(soup.encode('utf-8'))
                else:
                    print(f"Warning: Skipping empty content for {item.file_name}")
            except Exception as e:
                print(f"Error processing {item.file_name}: {e}")
        else:
            print(f"Skipping non-HTML item: {item.file_name} (type: {item.media_type})")

    # Save the modified EPUB
    try:
        epub.write_epub(output_path, book)
        print(f"Filtered EPUB saved as: {output_path}")
    except Exception as e:
        print(f"Failed to save EPUB: {e}")

if __name__ == "__main__":
    # Get all files in the current directory
    current_directory = os.getcwd()
    
    # Loop through all files and process those with .epub extension
    for filename in os.listdir(current_directory):
        if filename.lower().endswith(".epub") and not filename.lower().endswith("_filtered.epub"):
            input_path = os.path.join(current_directory, filename)
            output_path = os.path.join(current_directory, filename.replace(".epub", "_filtered.epub"))
            print(f"Processing: {filename}")
            filter_epub(input_path, output_path)

Unfortunately this PR misses a bunch of watermarks. Here's a sample of unique watermarks from ~1k pages -

A case of theft: this story is not rightfully on Amazon; if you spot it, report the violation.
Ensure your favorite authors get the support they deserve. Read this novel on the original website.
If you discover this narrative on Amazon, be aware that it has been unlawfully taken from Royal Road. Please report it.
If you encounter this narrative on Amazon, note that it's taken without the author's consent. Report it.
If you spot this story on Amazon, know that it has been stolen. Report the violation.
If you stumble upon this narrative on Amazon, be aware that it has been stolen from Royal Road. Please report it.
If you stumble upon this tale on Amazon, it's taken without the author's consent. Report it.
Unauthorized content usage: if you discover this narrative on Amazon, report the violation.
Unauthorized duplication: this narrative has been taken without consent. Report sightings.
Unauthorized tale usage: if you spot this story on Amazon, report the violation.
Unauthorized usage: this tale is on Amazon without the author's consent. Report any sightings.
Unauthorized use of content: if you find this story on Amazon, report the violation.
Love this story? Find the genuine version on the author's preferred platform and support their work!
Royal Road is the home of this novel. Visit there to read the original and support the author.
Taken from Royal Road, this narrative should be reported if found on Amazon.
This content has been unlawfully taken from Royal Road; report any instances of this story if found elsewhere.
This story has been unlawfully obtained without the author's consent. Report any appearances on Amazon.
This tale has been unlawfully lifted from Royal Road; report any instances of this story if found elsewhere.
This book was originally published on Royal Road. Check it out there for the real experience.
This content has been misappropriated from Royal Road; report any instances of this story if found elsewhere.
This narrative has been illicitly obtained; should you discover it on Amazon, report the violation.
You could be reading stolen content. Head to Royal Road for the genuine story.
You might be reading a stolen copy. Visit Royal Road for the authentic version.
Support the creativity of authors by visiting Royal Road for this novel and more.
Support creative writers by reading their stories on Royal Road, not stolen versions.
The author's content has been appropriated; report any instances of this story on Amazon.
This text was taken from Royal Road. Help the author by reading the original version there.
This novel is published on a different platform. Support the original author by finding the official source.
Unlawfully taken from Royal Road, this story should be reported if seen on Amazon.

@dipu-bd
Copy link
Owner Author

dipu-bd commented Jan 7, 2025

I'll check it out

@dipu-bd dipu-bd mentioned this pull request Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants