diff --git a/notebooks/en/_toctree.yml b/notebooks/en/_toctree.yml index b4416b77..8520fa6a 100644 --- a/notebooks/en/_toctree.yml +++ b/notebooks/en/_toctree.yml @@ -114,6 +114,8 @@ title: Fine-tuning SmolVLM using direct preference optimization (DPO) with TRL on a consumer GPU - local: structured_generation_vision_language_models title: Structured Generation from Images or Documents Using Vision Language Models + - local: fine_tuning_granite_vision_sft_trl + title: Fine-tuning Granite Vision with TRL - title: Search Recipes isExpanded: false diff --git a/notebooks/en/fine_tuning_granite_vision_sft_trl.ipynb b/notebooks/en/fine_tuning_granite_vision_sft_trl.ipynb new file mode 100644 index 00000000..1bfbfbf7 --- /dev/null +++ b/notebooks/en/fine_tuning_granite_vision_sft_trl.ipynb @@ -0,0 +1,1187 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "vKadZFQ2IdJb" + }, + "source": [ + "# Fine-tuning Granite Vision 3.1 2B with TRL\n", + "\n", + "_Authored by: [Eli Schwartz](https://huggingface.co/elischwartz)_\n", + "\n", + "Adapted from [Sergio Paniego](https://github.com/sergiopaniego)'s [Notebook](https://huggingface.co/learn/cookbook/en/fine_tuning_smol_vlm_sft_trl)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JATmSI8mcyW2" + }, + "source": [ + "This recipe will enable you to fine-tune [IBM's Granite Vision 3.1 2B Model](https://huggingface.co/ibm-granite/granite-vision-3.1-2b-preview).\n", + "It is a lightweight yet capable model trained by fine-tuning a [Granite language model](https://huggingface.co/ibm-granite/granite-3.1-2b-instruct) with both image and text modalities.\n", + "We will be using the Hugging Face ecosystem, leveraging the powerful [Transformer Reinforcement Learning library (TRL)](https://huggingface.co/docs/trl/index). This step-by-step guide will enable you to Granite Vision for your specific tasks, even on consumer GPUs.\n", + "\n", + "### 🌟 Model & Dataset Overview\n", + "\n", + "In this notebook, we will fine-tune and evaluate the **[Granite Vision](https://huggingface.co/ibm-granite/granite-vision-3.1-2b-preview)** model using the **[Geometric Perception](https://huggingface.co/datasets/euclid-multimodal/Geoperception)** dataset, containing tasks that the model wasn't intially trained for. Granite Vision is a highly performant and memory-efficient model, making it an ideal for fine tuning for new tasks. The **Geometric Perception** provides images of various geometric diagrams, compiled from high-school textbooks, paired with question-answer pairs.\n", + "\n", + "\n", + "This notebook is tested using a A100 GPU." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gSHmDKNFoqjC" + }, + "source": [ + "## 1. Install Dependencies\n", + "\n", + "Let’s start by installing the essential libraries we’ll need for fine-tuning! 🚀\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "GCMhPmFdIGSb" + }, + "outputs": [], + "source": [ + "!pip install -q git+https://github.com/huggingface/transformers.git\n", + "!pip install -U -q trl datasets bitsandbytes peft accelerate\n", + "# Tested with transformers==4.49.0.dev0, trl==0.14.0, datasets==3.2.0, bitsandbytes==0.45.2, peft==0.14.0, accelerate==1.3.0" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "x6fAqSnKDtKg", + "outputId": "bb20c090-9769-4566-a0ef-b11522b4f62d" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "FlashAttention is not installed\n" + ] + } + ], + "source": [ + "!pip install -q flash-attn --no-build-isolation\n", + "\n", + "try:\n", + " from flash_attn.flash_attention import FlashAttention\n", + " print(\"FlashAttention is installed\")\n", + " USE_FLASH_ATTENTION = True\n", + "except ImportError:\n", + " print(\"FlashAttention is not installed\")\n", + " USE_FLASH_ATTENTION = False" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "g9QXwbJ7ovM5" + }, + "source": [ + "## 2. Load Dataset 📁\n", + "\n", + "We’ll load the **[Geometric Perception](https://huggingface.co/datasets/euclid-multimodal/Geoperception)** dataset, which provides images of various geometric diagrams, compiled from popular high-school textbooks, paired with question-answer pairs." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LBWqnXkcTN-s" + }, + "source": [ + "We’ll use the original system prompt used during the model training." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "IBvKAlXhI46X" + }, + "outputs": [], + "source": [ + "system_message = \"A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IQ2zuDeGThNe" + }, + "source": [ + "For educational purposes, we’ll only train and evaluate on the Line Length Comaprison task, specified in the \"predicate\" field of the dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "QFe_A78aIwK8" + }, + "outputs": [], + "source": [ + "from datasets import load_dataset\n", + "\n", + "dataset_id = \"euclid-multimodal/Geoperception\"\n", + "dataset = load_dataset(dataset_id)\n", + "dataset_LineComparison = dataset['train'].filter(lambda x: x['predicate'] == 'LineComparison')\n", + "train_test = dataset_LineComparison.train_test_split(test_size=0.5, seed=42)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "I0X4ZajcPV10" + }, + "source": [ + "Let’s take a look at the dataset structure. It includes an image, a question, an answer, and \"predicate\" which we used to filter the dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "o2UKZj15jGwv", + "outputId": "6b4ff10f-9ddd-4f22-8254-d2d8ccf41d72" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "DatasetDict({\n", + " train: Dataset({\n", + " features: ['id', 'question', 'answer', 'predicate', 'image'],\n", + " num_rows: 697\n", + " })\n", + " test: Dataset({\n", + " features: ['id', 'question', 'answer', 'predicate', 'image'],\n", + " num_rows: 697\n", + " })\n", + "})" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "train_test" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "z8UpL5ZxTakm" + }, + "source": [ + "We’ll format the dataset into a chatbot structure, with the system message, image, user query, and answer for each interaction.\n", + "\n", + "💡For more tips on using this model for inference, check out the [Model Card](https://huggingface.co/ibm-granite/granite-vision-3.1-2b-preview)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "XG8NuzqjjbgI" + }, + "outputs": [], + "source": [ + "def format_data(sample):\n", + " return [\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": [\n", + " {\n", + " \"type\": \"text\",\n", + " \"text\": system_message\n", + " }\n", + " ],\n", + " },\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": [\n", + " {\n", + " \"type\": \"image\",\n", + " \"image\": sample[\"image\"],\n", + " },\n", + " {\n", + " \"type\": \"text\",\n", + " \"text\": sample['question'],\n", + " }\n", + " ],\n", + " },\n", + " {\n", + " \"role\": \"assistant\",\n", + " \"content\": [\n", + " {\n", + " \"type\": \"text\",\n", + " \"text\": sample[\"answer\"]\n", + " }\n", + " ],\n", + " },\n", + " ]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1edUqNGWTtjA" + }, + "source": [ + "Now, let’s format the data using the chatbot structure. This will set up the interactions for the model." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "oSHNqk0dkxii" + }, + "outputs": [], + "source": [ + "train_dataset = [format_data(x) for x in train_test['train']]\n", + "test_dataset = [format_data(x) for x in train_test['test']]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[{'role': 'system',\n", + " 'content': [{'type': 'text',\n", + " 'text': \"A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\"}]},\n", + " {'role': 'user',\n", + " 'content': [{'type': 'image',\n", + " 'image': },\n", + " {'type': 'text', 'text': 'Which line is longer, AC or BA?'}]},\n", + " {'role': 'assistant', 'content': [{'type': 'text', 'text': 'BA'}]}]" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "train_dataset[200]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YY1Y_KDtoycB" + }, + "source": [ + "## 3. Load Model and Check Performance! 🤔\n", + "\n", + "Now that we’ve loaded the dataset, it’s time to load the [IBM's Granite Vision Model](https://huggingface.co/ibm-granite/granite-vision-3.1-2b-preview), a 2B parameter Vision Language Model (VLM) built on that offers state-of-the-art (SOTA) performance while being efficient in terms of memory usage.\n", + "\n", + "For a broader comparison of state-of-the-art VLMs, explore the [WildVision Arena](https://huggingface.co/spaces/WildVision/vision-arena) and the [OpenVLM Leaderboard](https://huggingface.co/spaces/opencompass/open_vlm_leaderboard), where you can find the best-performing models across various benchmarks.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "PCJhM6tCw4lq" + }, + "outputs": [], + "source": [ + "import torch\n", + "from transformers import AutoModelForVision2Seq, AutoProcessor\n", + "\n", + "model_id = \"ibm-granite/granite-vision-3.1-2b-preview\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2HobU2iPUDWL" + }, + "source": [ + "Next, we’ll load the model and the tokenizer to prepare for inference." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true, + "id": "awtjIq86JfFF", + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "3c063d29572546a1956d01504803d329", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Loading checkpoint shards: 0%| | 0/2 [00:00},\n", + " {'type': 'text', 'text': 'Which line is longer, AC or BD?'}]},\n", + " {'role': 'assistant', 'content': [{'type': 'text', 'text': 'BD'}]}]" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "test_idx = 20\n", + "sample = test_dataset[test_idx]\n", + "sample" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3IK2HOMuRtY_" + }, + "source": [ + "Now, let’s take a look at the image corresponding to the sample. Can you answer the query based on the visual information?\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "QavnLzjJUbxf" + }, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sample[1]['content'][0]['image']" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gpLfsCUtUW6I" + }, + "source": [ + "Let’s create a method that takes the model, processor, and sample as inputs to generate the model's answer. This will allow us to streamline the inference process and easily evaluate the VLM's performance." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "_MoRTjFcE8qD" + }, + "outputs": [], + "source": [ + "def generate_text_from_sample(model, processor, sample, max_new_tokens=100, device=\"cuda\"):\n", + " # Prepare the text input by applying the chat template\n", + " text_input = processor.apply_chat_template(\n", + " sample[:2], # Use the sample without the assistant response\n", + " add_generation_prompt=True\n", + " )\n", + "\n", + " image_inputs = []\n", + " image = sample[1]['content'][0]['image']\n", + " if image.mode != 'RGB':\n", + " image = image.convert('RGB')\n", + " image_inputs.append([image])\n", + "\n", + " # Prepare the inputs for the model\n", + " model_inputs = processor(\n", + " #text=[text_input],\n", + " text=text_input,\n", + " images=image_inputs,\n", + " return_tensors=\"pt\",\n", + " ).to(device) # Move inputs to the specified device\n", + "\n", + " # Generate text with the model\n", + " generated_ids = model.generate(**model_inputs, max_new_tokens=max_new_tokens)\n", + "\n", + " # Trim the generated ids to remove the input ids\n", + " trimmed_generated_ids = [\n", + " out_ids[len(in_ids):] for in_ids, out_ids in zip(model_inputs.input_ids, generated_ids)\n", + " ]\n", + "\n", + " # Decode the output text\n", + " output_text = processor.batch_decode(\n", + " trimmed_generated_ids,\n", + " skip_special_tokens=True,\n", + " clean_up_tokenization_spaces=False\n", + " )\n", + "\n", + " return output_text[0] # Return the first decoded output text" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 35 + }, + "id": "5UeNiMJC_uCk", + "outputId": "1e31833c-9484-464f-e6a0-e736b46ada65" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "'The length of line segment AC is not explicitly provided in the given information, so it cannot be determined which line is longer between AC and BD.'" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "output = generate_text_from_sample(model, processor, sample)\n", + "output" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ysh0e9DRUfF-" + }, + "source": [ + "It seems like the model is unable to comapre the lines' lengths which are not explicitly specified. To improve its performance, we can fine-tune the model with more relevant data to ensure it better understands the context and provides more accurate responses." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Sw3b76rawti6" + }, + "source": [ + "**Remove Model and Clean GPU**\n", + "\n", + "Before we proceed with training the model in the next section, let's clear the current variables and clean the GPU to free up resources.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "dxkXZuUkvy8j" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "GPU allocated memory: 0.01 GB\n", + "GPU reserved memory: 0.02 GB\n" + ] + } + ], + "source": [ + "import gc\n", + "import time\n", + "\n", + "def clear_memory():\n", + " # Delete variables if they exist in the current global scope\n", + " if 'inputs' in globals(): del globals()['inputs']\n", + " if 'model' in globals(): del globals()['model']\n", + " if 'processor' in globals(): del globals()['processor']\n", + " if 'trainer' in globals(): del globals()['trainer']\n", + " if 'peft_model' in globals(): del globals()['peft_model']\n", + " if 'bnb_config' in globals(): del globals()['bnb_config']\n", + " time.sleep(2)\n", + "\n", + " # Garbage collection and clearing CUDA memory\n", + " gc.collect()\n", + " time.sleep(2)\n", + " torch.cuda.empty_cache()\n", + " torch.cuda.synchronize()\n", + " time.sleep(2)\n", + " gc.collect()\n", + " time.sleep(2)\n", + "\n", + " print(f\"GPU allocated memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB\")\n", + " print(f\"GPU reserved memory: {torch.cuda.memory_reserved() / 1024**3:.2f} GB\")\n", + "\n", + "clear_memory()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YIZOIVEzQqNg" + }, + "source": [ + "## 4. Fine-Tune the Model using TRL\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yIrR9gP2z90z" + }, + "source": [ + "### 4.1 Load the Quantized Model for Training ⚙️\n", + "\n", + "Next, we’ll load the quantized model using [bitsandbytes](https://huggingface.co/docs/bitsandbytes/main/en/index). If you want to learn more about quantization, check out [this blog post](https://huggingface.co/blog/merve/quantization) or [this one](https://www.maartengrootendorst.com/blog/quantization/).\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "zm_bJRrXsESg" + }, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "332bc9732faf47199551056e64772a82", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Loading checkpoint shards: 0%| | 0/2 [00:00\", return_tensors=\"pt\")['input_ids'][0]\n", + " eos_token = processor.tokenizer(\"<|end_of_text|>\", return_tensors=\"pt\")['input_ids'][0]\n", + "\n", + " for i in range(batch[\"input_ids\"].shape[0]):\n", + " apply_loss = False\n", + " for j in range(batch[\"input_ids\"].shape[1]):\n", + " if not apply_loss:\n", + " labels[i][j] = -100\n", + " if ((j>=len(assistant_tokens)+1) and\n", + " torch.all(batch[\"input_ids\"][i][j+1-len(assistant_tokens):j+1]==assistant_tokens)):\n", + " apply_loss = True\n", + " if batch[\"input_ids\"][i][j]==eos_token:\n", + " apply_loss = False\n", + "\n", + " batch[\"labels\"] = labels\n", + "\n", + " return batch" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "skbpTuJlV8qN" + }, + "source": [ + "Now, we will define the [SFTTrainer](https://huggingface.co/docs/trl/sft_trainer), which is a wrapper around the [transformers.Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) class and inherits its attributes and methods. This class simplifies the fine-tuning process by properly initializing the [PeftModel](https://huggingface.co/docs/peft/v0.6.0/package_reference/peft_model) when a [PeftConfig](https://huggingface.co/docs/peft/v0.6.0/en/package_reference/config#peft.PeftConfig) object is provided. By using `SFTTrainer`, we can efficiently manage the training workflow and ensure a smooth fine-tuning experience for our Vision Language Model.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "k_jk-U7ULYtA" + }, + "outputs": [], + "source": [ + "from trl import SFTTrainer\n", + "\n", + "trainer = SFTTrainer(\n", + " model=model,\n", + " args=training_args,\n", + " train_dataset=train_dataset,\n", + " data_collator=collate_fn,\n", + " peft_config=peft_config,\n", + " tokenizer=processor.tokenizer,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NlDsh4WvWCx0" + }, + "source": [ + "Time to Train the Model! 🎉" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "p1rgMTBDLboO" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.\n" + ] + }, + { + "data": { + "text/html": [ + "\n", + "
\n", + " \n", + " \n", + " [44/44 13:38, Epoch 1/1]\n", + "
\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
StepTraining Loss
101.081700
200.847900
300.540300
400.431300

" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "TrainOutput(global_step=44, training_loss=0.695373223586516, metrics={'train_runtime': 838.1529, 'train_samples_per_second': 0.832, 'train_steps_per_second': 0.052, 'total_flos': 4.321343420750362e+16, 'train_loss': 0.695373223586516, 'epoch': 1.0})" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "trainer.train()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "w6CykSCtX-Xa" + }, + "source": [ + "Let's save the results 💾" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "tE8usZw0lgrL" + }, + "outputs": [], + "source": [ + "trainer.save_model(training_args.output_dir)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6yx_sGW42dN3" + }, + "source": [ + "## 5. Testing the Fine-Tuned Model 🔍\n", + "\n", + "Now that our Vision Language Model (VLM) is fine-tuned, it's time to evaluate its performance! In this section, we'll test the model using examples from the ChartQA dataset to assess how accurately it answers questions based on chart images. Let's dive into the results and see how well it performs! 🚀" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "i0KEPu6qYKqn" + }, + "source": [ + "Let's clean up the GPU memory to ensure optimal performance 🧹" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Ttx6EK8Uy8t0", + "outputId": "e671dad3-bddc-4206-dcb9-78a55cc242c0" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "GPU allocated memory: 0.02 GB\n", + "GPU reserved memory: 0.19 GB\n" + ] + } + ], + "source": [ + "clear_memory()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HwCTPHsfujn2" + }, + "source": [ + "We will reload the base model using the same pipeline as before." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "EFqTNUud2lA7" + }, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "b20fdd2afe3449f2ab57cc4c9c009f5a", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Loading checkpoint shards: 0%| | 0/2 [00:00},\n", + " {'type': 'text', 'text': 'Which line is longer, AC or BD?'}]},\n", + " {'role': 'assistant', 'content': [{'type': 'text', 'text': 'BD'}]}]" + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "test_idx = 20\n", + "sample = test_dataset[test_idx]\n", + "sample[1:]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ATuQ6ZS6eirO" + }, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "" + ] + }, + "execution_count": 27, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sample[1]['content'][0]['image']" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 36 + }, + "id": "9yHJMKHNWcMc", + "outputId": "fef11c29-f8ed-4301-9f16-e4b963587988" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "'BD'" + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "output = generate_text_from_sample(model, processor, sample)\n", + "output" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NUr6jmnAIlh1" + }, + "source": [ + "#### 🎉✨ The model has successfully learned to respond to the queries as specified in the dataset. We've achieved our goal! 🎉✨" + ] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "gpuType": "T4", + "machine_shape": "hm", + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.11" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/notebooks/en/index.md b/notebooks/en/index.md index 058c052f..b31dd3e3 100644 --- a/notebooks/en/index.md +++ b/notebooks/en/index.md @@ -7,11 +7,11 @@ applications and solving various machine learning tasks using open-source tools Check out the recently added notebooks: +- [Fine-tuning Granite Vision 3.1 2B with TRL](fine_tuning_granite_vision_sft_trl) - [Post training an LLM for reasoning with GRPO in TRL](fine_tuning_llm_grpo_trl) - [Evaluating AI Search Engines with `judges` - the open-source library for LLM-as-a-judge evaluators](llm_judge_evaluating_ai_search_engines_with_judges_library) - [Structured Generation from Images or Documents Using Vision Language Models](structured_generation_vision_language_models) - [Vector Search on Hugging Face with the Hub as Backend](vector_search_with_hub_as_backend) -- [Multi-Agent Order Management System with MongoDB](mongodb_smolagents_multi_micro_agents) You can also check out the notebooks in the cookbook's [GitHub repo](https://github.com/huggingface/cookbook).