Skip to content

Commit

Permalink
Merge pull request #499 from tattle-made/development
Browse files Browse the repository at this point in the history
merge dev to main
  • Loading branch information
aatmanvaidya authored Jan 13, 2025
2 parents 0a12594 + d867266 commit ba0e008
Show file tree
Hide file tree
Showing 39 changed files with 1,468 additions and 317 deletions.
6 changes: 6 additions & 0 deletions .github/workflows/pr-security.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,12 @@ jobs:
src: "."
continue-on-error: false

- name: Validate required fields in pyproject.toml
run: |
pip install tomli
python -m scripts.validate_toml_files
# - name: Run Trivy vulnerability scanner in repo mode
# uses: aquasecurity/trivy-action@fd25fed6972e341ff0007ddb61f77e88103953c2 # v0.21.0
# with:
Expand Down
52 changes: 52 additions & 0 deletions .github/workflows/pr-tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
name: Run tests on PR

permissions:
contents: read

on:
pull_request:
branches:
- main
- development
- hotfix
types:
- opened
- synchronize
- reopened
- ready_for_review

jobs:
test:
if: github.event.pull_request.draft == false
name: Run tests
runs-on: ubuntu-latest

steps:
- name: Checkout code
uses: actions/checkout@v4
with:
fetch-depth: 0
token: ${{ secrets.GITHUB_TOKEN }}

- name: Setup Python version
uses: actions/setup-python@v5
with:
python-version: "3.11"

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install .
- name: Run all Feluda Unit Tests
run: |
echo "Running all Feluda Unit Tests folder..."
for test_file in $(find tests/feluda_unit_tests -type f -name "test_*.py"); do
echo "############# Running file: $test_file #############"
python -m unittest $test_file
if [ $? -ne 0 ]; then
echo "Tests in $test_file failed"
exit 1
fi
echo "Run Successful"
done
13 changes: 10 additions & 3 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,13 @@ repos:
# Run the linter.
- id: ruff
stages: [pre-commit]
# args: [--fix]
# Run the formatter.
# - id: ruff-format
args: [--fix]
# run ruff specifically for sorting imports.
- id: ruff
name: ruff-import-sort
stages: [pre-commit]
args: ["--select", "I", "--fix"]
# format code using ruff
- id: ruff
name: ruff-format
stages: [pre-commit]
199 changes: 2 additions & 197 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,201 +23,6 @@ When we built Feluda, we were focusing on the unique challenges of social media


## Contributing
Please create a new Discussion [here](https://github.com/tattle-made/tattle-api/discussions) describing what you'd like to do and we'll follow up.
You can find instructions on contributing on the [Wiki](https://github.com/tattle-made/feluda/wiki)

## Setup for Developing Locally

1. Set environment variables by replacing the credentials in `/src/api/.env-template` with your credentials. Rename the file to `development.env`.
(For production, update the RabbitMQ and Elasticsearch host and credentials in the `.env` files)

For development, replace the following in `development.env`:
- Replace the value of `MQ_USERNAME` with the value of `RABBITMQ_DEFAULT_USER` from `docker-compose.yml`
- Replace the value of `MQ_PASSWORD` with the value of `RABBITMQ_DEFAULT_PASS` from `docker-compose.yml`

2. Install packages for local development. These will be installed automatically with `docker compose up`

```
# Install locally in venv
$ cd src/api/
$ pip install --require-hashes --no-deps -r requirements.txt
```


3. Run `docker-compose up` . This will bring up the following containers:

Elasticsearch : Used to store searchable representations of multilingual text, images and videos.

RabbitMQ : Used as a Job Queue to queue up long indexing jobs.

Search Indexer : A RabbitMQ consumer that receives any new jobs that are added to the queue and processes them.

Search Server : A public REST API to index new media and provide additional public APIs to interact with this service.

The first time you run `docker-compose up` it will take several minutes for all services to come up. Its usually instantaneous after that, as long as you don't make changes to the Dockerfile associated with each service.

4. To verify if every service is up, visit the following URLs:

elasticsearch: http://localhost:9200

rabbitmq UI: http://localhost:15672

5. Install required operators
Each operator has to be installed separately

```
# Install locally in venv
$ cd src/api/core/operators/
$ pip install --require-hashes --no-deps -r image_vec_rep_resnet_requirements.txt
$ pip install --require-hashes --no-deps -r vid_vec_rep_resnet_requirements.txt
..
# Create the docker containers
$ cd src/api/
$ docker build -t image-operator -f Dockerfile.image_vec_rep_resnet .
$ docker build -t video-operator -f Dockerfile.vid_vec_rep_resnet .
# Run the docker image
$ docker run image-operator
$ docker run video-operator
```


6. Then, in a new terminal, start the server with:

```
$ cd src/api
$ docker exec -it feluda_api python server.py
```

7. Verify that the server is running by opening: http://localhost:7000


#### Server endpoints

http://localhost:7000/media : Receives image URLs / video URLs / text documents via POST requests and sends them to a RabbitMQ job queue. This queue is consumed by `receive.py` and the processed data is indexed into the appropriate Elasticsearch index. This endpoint is designed for fault-tolerant bulk indexing.

http://localhost:7000/upload_image : Receives an image URL via a POST request and indexes it in the Elasticsearch image index.

http://localhost:7000/upload_video : Receives a video URL via a POST request and indexes it in the Elasticsearch video index.

http://localhost:7000/upload_text : Receives a text document via a POST request and indexes it in the Elasticsearch text index.

The `/upload_image`, `/upload_video` and `/upload_text` endpoints index data directly (bypassing RabbitMQ) and are suitable for development / testing. Indices are defined and accessed according to the names specified in `.env` and the mappings specified in `indices.py`.

http://localhost:7000/search : Receives a query image / video / text and returns the top 10 matches found in the Elasticsearch index in descending order.
Note: A text search returns two sets of matches: `simple_text_matches` and `text_vector_matches`. The former is useful for same-language search and the latter for multilingual search.


#### Bulk indexing

Bulk indexing scripts for the data collected by various Tattle services should be located in the service repository, such as [this one](https://github.com/tattle-made/sharechat-scraper/blob/development/workers/indexer/tattlesearch_indexer.py) and triggered as required. This makes the data searchable via this search API.
The indexing status of each record can be updated via a [reporter](https://github.com/tattle-made/sharechat-scraper/blob/development/workers/reporter/tattlesearch_reporter.py).
While the former fetches data from the service's MongoDB and sends it to the API via HTTP requests, the latter is a RabbitMQ consumer that consumes reports generated by `receive.py` and adds them to the DB.


#### Updating Packages

1. Update packages in `src/api/requirements.in` or operator specific requirements file:
`src/api/core/operators/<operator>_requirements.in`
2. Use `pip-compile` to generate `requirements.txt`

Note:

- Use a custom `tmp` directory to avoid memory issues
- If an operator defaults to a higher version than allowed by feluda core `requirements.txt`, manually edit the `<operator>_requirements.txt` to the compatible version. Then run `pip install`. If it runs without errors, the package version is valid for the operator.

```bash
$ cd src/
$ pip install --upgrade pip-tools
$ TMPDIR=<temp_dir> pip-compile --verbose --allow-unsafe --generate-hashes --emit-index-url --emit-find-links requirements.in

# Updating operators
$ cd src/core/operators/
# The link for torch is required since PyPi only hosts the GPU version of torch packages.
$ TMPDIR=<temp_dir> pip-compile --verbose --allow-unsafe --generate-hashes --emit-index-url --emit-find-links --find-links https://download.pytorch.org/whl/torch_stable.html vid_vec_rep_resnet_requirements.in
$ TMPDIR=<temp_dir> pip-compile --verbose --allow-unsafe --generate-hashes --emit-index-url --emit-find-links --find-links https://download.pytorch.org/whl/torch_stable.html audio_vec_embedding_requirements.in
```

#### Modify generated `requirements.txt` for platform specific torch packages

NOTE: Update the command to match python docker image version

```bash
# Download package to find hash - you will get an error message if the package has been previously downloaded without the hash. The hash value will be printed in the message. Use that hash

$ pip download --no-deps --require-hashes --python-version 311 --implementation cp --abi cp311 --platform linux_x86_64 --find-links https://download.pytorch.org/whl/torch_stable.html torch==2.2.0+cpu
$ pip download --no-deps --require-hashes --python-version 311 --implementation cp --abi cp311 --platform linux_x86_64 --find-links https://download.pytorch.org/whl/torch_stable.html torchvision==0.17.0+cpu
$ pip download --no-deps --require-hashes --python-version 311 --implementation cp --abi cp311 --platform manylinux2014_aarch64 --find-links https://download.pytorch.org/whl/cpu torch==2.2.0
$ pip download --no-deps --require-hashes --python-version 311 --implementation cp --abi cp311 --platform manylinux2014_aarch64 --find-links https://download.pytorch.org/whl/cpu torchvision==0.17.0
```
Replace the torch package lines from `requirement.txt` with the following (depending upon the generated hash values above)

```bash
# For arm64 architecture
--find-links https://download.pytorch.org/whl/cpu
torch==2.2.0; platform_machine=='aarch64' \
--hash=sha256:9328e3c1ce628a281d2707526b4d1080eae7c4afab4f81cea75bde1f9441dc78
# via
# -r vid_vec_rep_resnet_requirements.in
# torchvision
torchvision==0.17.0; platform_machine=='aarch64' \
--hash=sha256:3d2e9552d72e4037f2db6f7d97989a2e2f95763aa1861963a3faf521bb1610c4 \
# via -r vid_vec_rep_resnet_requirements.in

# For amd64 architecture
--find-links https://download.pytorch.org/whl/torch_stable.html
torch==2.2.0+cpu; platform_machine=='x86_64' \
--hash=sha256:15a657038eea92ac5db6ab97b30bd4b5345741b49553b2a7e552e80001297124 \
--hash=sha256:15e05748815545b6eb99196c0219822b210a5eff0dc194997a283534b8c98d7c \
--hash=sha256:2a8ff4440c1f024ad7982018c378470d2ae0a72f2bc269a22b1a677e09bdd3b1 \
--hash=sha256:4ddaf3393f5123da4a83a53f98fb9c9c64c53d0061da3c7243f982cdfe9eb888 \
--hash=sha256:58194066e594cd8aff27ddb746399d040900cc0e8a331d67ea98499777fa4d31 \
--hash=sha256:5b40dc66914c02d564365f991ec9a6b18cbaa586610e3b160ef559b2ce18c6c8 \
--hash=sha256:5f907293f5a58619c1c520380f17641f76400a136474a4b1a66c363d2563fe5e \
--hash=sha256:8258824bec0181e01a086aef5809016116a97626af2dcbf932d4e0b192d9c1b8 \
--hash=sha256:d053976a4f9ca3bace6e4191e0b6e0bcffbc29f70d419b14d01228b371335467 \
--hash=sha256:f72e7ce8010aa8797665ff6c4c1d259c28f3a51f332762d9de77f8a20277817f
# via
# -r vid_vec_rep_resnet_requirements.in
# torchvision
torchvision==0.17.0+cpu; platform_machine=='x86_64' \
--hash=sha256:00e88e9483e52f99fc61a73941b6ef0b59d031930276fc220ee8973170f305ff \
--hash=sha256:04e72249add0e5a0fc3d06a876833651e77eb6c3c3f9276e70d9bd67804c8549 \
--hash=sha256:39d3b3a80c63d18594e81829fdbd6108512dff98fa17156c7bec59133a0c1173 \
--hash=sha256:55660c67bd8d5b777984655116b75070c73d37ce64175a8120cb59010039fd7f \
--hash=sha256:569ebc5f47bb765ae73cd380ace01ddcb074c67df05d7f15f5ddd0fa3062881a \
--hash=sha256:701d7fcfdd8ed206dcb71774190152f0a2d6c999ad7cee277fc5a71a943ae64d \
--hash=sha256:b683d52753c5579a5b0250d7976deada17deab646071da289bd598d1af4877e0 \
--hash=sha256:bb787aab6daf2d72600c14cd7c3c11459701dc5fac07e790e0335777e20b39df \
--hash=sha256:da83b8a14d1b0579b1119e24272b0c7bf3e9ad14297bca87184d02c12d210501 \
--hash=sha256:eb1e9d061c528c8bb40436d445599ca05fa997701ac395db3aaec5cb7660b6ee
# via -r vid_vec_rep_resnet_requirements.in
```



#### Updating specific packages in `requirements.txt`

This is useful to update dependencies e.g. when using `pip-audit`

```bash
$ TMPDIR=<temp_dir> pip-compile --verbose --allow-unsafe --generate-hashes --find-links https://download.pytorch.org/whl/torch_stable.html --upgrade-package <package>==<version> --upgrade-package <package>

```

### Running Tests

To run a test, implement the following command.

```bash
python -m unittest <FILE_NAME>.py
```

To run all the tests in a specific folder run

```bash
python -m unittest discover -s project_directory -p "test_*.py"
```

Read full test documentation [here](https://github.com/tattle-made/feluda/wiki/Running-Tests).

----
v : 0.0.8
#### Documentation for Setting up Feluda for Local Development - [Link to the Wiki](https://github.com/tattle-made/feluda/wiki/Setup-Feluda-Locally-for-Development)
2 changes: 1 addition & 1 deletion docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ version: "3.5"
services:
store:
container_name: es
image: docker.elastic.co/elasticsearch/elasticsearch@sha256:ec72548cf833e58578d8ff40df44346a49480b2c88a4e73a91e1b85ec7ef0d44 # docker.elastic.co/elasticsearch/elasticsearch:8.12.0
image: docker.elastic.co/elasticsearch/elasticsearch:8.16.0
volumes:
- ./.docker/es/data:/var/lib/elasticsearch/data
ports:
Expand Down
Loading

0 comments on commit ba0e008

Please sign in to comment.