Skip to main content

Talk with your Images using Gemini

Talk with Your Images Using Gemini

Enhance Image Analysis with AI-Powered Conversational Insights Using Gemini

Introduction

Have you ever wanted to extract meaningful insights from an image just by talking to it? Thanks to advancements in AI, you can now analyze and interact with your images using Google's Gemini AI. Whether it’s extracting text, identifying objects, or understanding complex visual elements, Gemini makes it easier than ever to engage with images in a conversational way.

Why Use AI for Image Analysis?

Traditionally, analyzing an image required complex computer vision techniques, but AI models like Gemini simplify the process by offering:

  • Automated Image Interpretation – Extracts text, objects, and contextual insights.
  • Conversational Responses – Allows you to interact with your images naturally.
  • Scalability – Works efficiently across multiple images with ease.

By encoding images into base64 and passing them to Gemini, we can leverage these capabilities seamlessly.

Setting Up the Environment

Before we dive into the code, ensure you have the necessary dependencies installed:

pip install --upgrade --quiet google-genai

Additionally, you need access to the Google Vertex AI platform with Gemini enabled in your project!


Encoding Images to Base64

To send images to Gemini, we first need to convert them into a format that AI models can understand—base64 encoding. Here’s how you can do it:

import base64
from pathlib import Path

def encode_image_to_base64(image_path: str) -> str | None:
    image_file = Path(image_path)
    if not image_file.is_file():
        print(f"Error: File not found - {image_path}")
        return None
    try:
        return base64.b64encode(image_file.read_bytes()).decode("utf-8")
    except Exception as e:
        print(f"Error encoding image: {e}")
        return None

Getting AI Responses from Gemini

Once we have the base64-encoded image, we can pass it to Gemini for analysis.

Note: You need to setup the Google Cloud CLI in your system and authenticate it before running the code.

Code:

import genai
from genai.types import Part

# Get the details of the project ID, location from Vertex AI Platform / Google Cloud
PROJECT_ID = "your-project-id"
LOCATION = "your-location"
MODEL_ID = "gemini-model-id"

client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)

def get_gemini_response(base64_image: str) -> str | None:
    prompt = (
        "You are an expert in analyzing images. Please extract all information from the provided image.\n"
        "Response Format: Simple Text, no markdown; no bullet points."
    )
    try:
        response = client.models.generate_content(
            model=MODEL_ID,
            contents=[
                Part.from_bytes(data=base64.b64decode(base64_image), mime_type="image/jpeg"),
                prompt,
            ],
        )
        return response.text if response else None
    except Exception as e:
        print(f"Error getting AI response: {e}")
        return None

Running the Complete Script

Now, let’s put everything together and test our AI-powered image analysis:

def main():
    image_path = "download.jpg"
    base64_image = encode_image_to_base64(image_path)
    
    if base64_image:
        response = get_gemini_response(base64_image)
if response: print("Response received:\n", response) else: print("Failed to get a response from Gemini.") else: print("Failed to encode the image.") if __name__ == "__main__": main()

Test Example

For testing, let's assume we provide an image of a cat as input. The output response from Gemini could be:


Response received:
Here's what I can tell about the image: It's a close-up shot of a tabby cat. The cat has a brown and black striped coat, and its eyes appear to be a shade of green or yellow. The background is a dark solid color.

Conclusion

With just a few lines of Python code, you can now talk to your images and extract valuable insights using Google's Gemini AI. Whether you're analyzing historical documents, identifying objects, or automating workflows, this technique opens up a world of possibilities.

Next Steps

  • Try using different images and observe the responses.
  • Experiment with different prompts for varied insights.
  • Integrate Gemini’s image analysis with chatbots or automation tools.

Got any cool ideas for using Gemini with images? Share them in the comments! 🚀

Code

https://github.com/saswatsamal/talkwithphotowithgemini/

Acknowledgments

This is a project built during the Vertex sprints held by Google’s ML Developer Programs team.  Thanks to the MLDP Team for their generous support in providing GCP credits units to help facilitate this project.

Comments

Popular posts from this blog

Exploring & Implementing Multi-Agent Systems with Gemini

In today's rapidly evolving technological landscape, the concept of multi-agent conversations is gaining significant traction. At its core, this involves multiple AI agents, each with distinct roles and capabilities, interacting seamlessly to accomplish complex tasks. A prime example of this is Microsoft's AutoGen framework, an open-source platform designed to facilitate the creation of such collaborative AI systems. And in this blog, we will learn how we can leverage the Google's Gemini to use AutoGen. Understanding Multi-Agent Conversations Imagine a scenario where several AI agents, each specialized in a particular domain—be it language translation, data analysis, or customer service—come together to solve a problem. This collaborative approach mirrors human teamwork, where diverse expertise converges to achieve a common goal. In the realm of AI, this is made possible through frameworks like AutoGen, which provide the necessary infrastructure for these agents to communic...

The Overview of Large Language Models and Agents

LLM Agents: Moving from Words to Deeds What are LLM Agents? LLM Agents are advanced language models that do more than just generate text. They can make decisions and act independently using tools like SQL Agent and Math Tool. These agents excel at automating tasks, assisting individuals with disabilities, and solving complex problems. Frameworks like LangChain and Hugging Face make it easier for developers to create LLM Agents for various industries, driving innovation and efficiency. How Do They Work? LLM Agents combine smart tools with decision-making capabilities. They can perform tasks like database searches, complex calculations, and more. By using tools such as Brave/Bing Search and Math Tool, these agents can deliver accurate and efficient results. Benefits of LLM Agents Task Automation: Streamlines repetitive tasks, allowing people to focus on complex challenges. Accessibility: Helps individuals with disabilities by breaking communication barriers. Efficiency: Enhances produ...