Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job data validation with Indeed website #229

Open
samshipengs opened this issue Dec 23, 2024 · 3 comments
Open

Job data validation with Indeed website #229

samshipengs opened this issue Dec 23, 2024 · 3 comments

Comments

@samshipengs
Copy link

samshipengs commented Dec 23, 2024

I'm trying to validate the (indeed) data I got from JobSpy with the job listings I see directly from Indeed website, given same search params,

such as:

  • radius 50km
  • same location
  • same job title
  • within last 24hrs
  1. I'm seeing some records that show date_posted for example 10 days old (see attachment), but when I go to indeed and search within last 24hrs, it shows up as well, which means shouldn't that listing have date_posted like either Dec 23 (today) or Dec 22? instead of Dec 18?
Screenshot 2024-12-23 at 12 25 07 PM
  1. In general is this a robust way to validate the data we get from JobSpy simply by looking up against the Indeed website, or there might be some discrepancies due to non-obvious things?

  2. I took a quick look of the code, and found this line:

f'location: {{where: "{self.scraper_input.location}", radius: {self.scraper_input.distance}, radiusUnit: MILES}}'

I'm searching the ca website, which would default me to km for radius, so for validation purposes, I suppose I should set the unit to be something KMS instead of MILES?

thanks!

@cullenwatson
Copy link
Collaborator

  1. The date on indeed website is the date made available on Indeed website. it could've been available before on other job boards so we're using "datePublished" not "dateOnIndeed"
  2. yea you can check against the website to validate. i believe thats the only way.
  3. or just do radius=km * 0.621371

@samshipengs
Copy link
Author

samshipengs commented Jan 6, 2025

just tried and turns out ScraperInput has the distance as distance: int | None = None, turning 50km into miles as a float doesn't work but forcing it to be an int should be fine as long as the precision tolerates. @cullenwatson

@samshipengs
Copy link
Author

samshipengs commented Jan 6, 2025

just a follow up on this, if we look at a job search of VP of finance on indeed.ca with last 7 days filter and Ontario

image

it shows 3 jobs in total

but if we do the same search (or at least I think the same) with jobspy:

data = scrape_jobs(
    site_name="indeed",
    search_term="VP of finance",
    is_remote=False,
    location="Ontario",
    results_wanted=200,
    hours_old=24 * 7,
    country_indeed="Canada",
    verbose=True,
    distance=int(50 / 1.60934),
)

it returns more jobs:

image but I don't see the 2 out of 3 jobs from the browser in the result:
Believeco not found
                        title  company date_posted
15  Vice President of Finance  Symtech  2024-12-30
ERP Buddies not found

Even if I extend the hours_old to be 24 * 30, still missing the two jobs (my understanding from previous thread is that, the hours old we search for might be using the published date, so it might be further back than the date posted on indeed)

@cullenwatson Any idea on what's causing this or I'm missing something?

@cullenwatson cullenwatson reopened this Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants