climate-news-db

The climate-news-db has two goals:

create a dataset of climate change newspaper articles for NLP researchers,
provide a web application for users to view climate change news.

Use

Crawling URLs

Pulls urls.jsonl from S3 and crawls articles into articles/{newspaper}.jsonl and into database:

$ make crawl

Regenerate Database

Take urls from articles/{newspaper}.jsonl and saves into database:

$ make regen-db

This is useful when you want to re-create the database without scraping articles.

Interactive Search for Getting URLs

Requires Go + Gum

$ ./scripts/search-cli.sh

Data Artifacts

Lineage

graph LR
  1(urls.jsonl) -->|make crawl| 2(articles.jsonl)
  2(articles.jsonl) -->|make crawl, make regen-db| 3(database)

urls.jsonl

{"url": "https://www.chinadaily.com.cn/a/202302/21/WS63f4aea4a31057c47ebb004e.html", "search_time_utc": "2023-03-20T00:05:02.998560"}
{"url": "https://www.chinadaily.com.cn/a/202301/19/WS63c8a4a8a31057c47ebaa8e4.html", "search_time_utc": "2023-03-20T00:05:02.998560"}

Append only storage of raw newspaper urls. Created by a daily Google search for each newspaper with the keywords climate change and climate crisis. This file contains many duplicates.

Infra

Webapp

Deployed as a Fly.IO app:

$ make deploy

AWS Infra

Deployed with AWS CDK:

$ make aws-infra

Name	Name	Last commit message	Last commit date
Latest commit ADGEfficiency chore: stack name Jun 3, 2024 0340d88 · Jun 3, 2024 History 147 Commits
.github/workflows	.github/workflows	2023 Rebuild (#36 )	Sep 13, 2023
climatedb	climatedb	fix: crawl makefile target	Dec 30, 2023
docker	docker	2023 Rebuild (#36 )	Sep 13, 2023
infra	infra	chore: stack name	Jun 3, 2024
poc	poc	2023 Rebuild (#36 )	Sep 13, 2023
scripts	scripts	chore: stack name	Jun 3, 2024
static	static	2023 Rebuild (#36 )	Sep 13, 2023
templates	templates	fix: crawl makefile target	Dec 30, 2023
tests	tests	2023 Rebuild (#36 )	Sep 13, 2023
.dockerignore	.dockerignore	2023 Rebuild (#36 )	Sep 13, 2023
.envrc	.envrc	2023 Rebuild (#36 )	Sep 13, 2023
.gitignore	.gitignore	fix: crawl makefile target	Dec 30, 2023
Makefile	Makefile	chore: stack name	Jun 3, 2024
README.md	README.md	2023 Rebuild (#36 )	Sep 13, 2023
fly.toml	fly.toml	2023 Rebuild (#36 )	Sep 13, 2023
newspapers.json	newspapers.json	2023 Rebuild (#36 )	Sep 13, 2023
poetry.lock	poetry.lock	2023 Rebuild (#36 )	Sep 13, 2023
poetry.toml	poetry.toml	[FEATURE/TECH] Feb 2022 Rebuild (#2 )	Apr 29, 2022
pyproject.toml	pyproject.toml	2023 Rebuild (#36 )	Sep 13, 2023
scrapy.cfg	scrapy.cfg	2023 Rebuild (#36 )	Sep 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

climate-news-db

Use

Crawling URLs

Regenerate Database

Interactive Search for Getting URLs

Data Artifacts

Lineage

urls.jsonl

Infra

Webapp

AWS Infra

About

Contributors 2

Languages

ADGEfficiency/climate-news-db

Folders and files

Latest commit

History

Repository files navigation

climate-news-db

Use

Crawling URLs

Regenerate Database

Interactive Search for Getting URLs

Data Artifacts

Lineage

urls.jsonl

Infra

Webapp

AWS Infra

About

Topics

Resources

Stars

Watchers

Forks

Contributors 2

Languages