Skip to content

ADGEfficiency/climate-news-db

Folders and files

NameName
Last commit message
Last commit date

Latest commit

0340d88 · Jun 3, 2024
Sep 13, 2023
Dec 30, 2023
Sep 13, 2023
Jun 3, 2024
Sep 13, 2023
Jun 3, 2024
Sep 13, 2023
Dec 30, 2023
Sep 13, 2023
Sep 13, 2023
Sep 13, 2023
Dec 30, 2023
Jun 3, 2024
Sep 13, 2023
Sep 13, 2023
Sep 13, 2023
Sep 13, 2023
Apr 29, 2022
Sep 13, 2023
Sep 13, 2023

Repository files navigation

climate-news-db

The climate-news-db has two goals:

  1. create a dataset of climate change newspaper articles for NLP researchers,
  2. provide a web application for users to view climate change news.

Use

Crawling URLs

Pulls urls.jsonl from S3 and crawls articles into articles/{newspaper}.jsonl and into database:

$ make crawl

Regenerate Database

Take urls from articles/{newspaper}.jsonl and saves into database:

$ make regen-db

This is useful when you want to re-create the database without scraping articles.

Interactive Search for Getting URLs

Requires Go + Gum

$ ./scripts/search-cli.sh

Data Artifacts

Lineage

Loading
graph LR
  1(urls.jsonl) -->|make crawl| 2(articles.jsonl)
  2(articles.jsonl) -->|make crawl, make regen-db| 3(database)

urls.jsonl

{"url": "https://www.chinadaily.com.cn/a/202302/21/WS63f4aea4a31057c47ebb004e.html", "search_time_utc": "2023-03-20T00:05:02.998560"}
{"url": "https://www.chinadaily.com.cn/a/202301/19/WS63c8a4a8a31057c47ebaa8e4.html", "search_time_utc": "2023-03-20T00:05:02.998560"}

Append only storage of raw newspaper urls. Created by a daily Google search for each newspaper with the keywords climate change and climate crisis. This file contains many duplicates.

Infra

Webapp

Deployed as a Fly.IO app:

$ make deploy

AWS Infra

Deployed with AWS CDK:

$ make aws-infra