Room Type
Kitchen, bedroom, bathroom, living room, exterior, garage, etc.
Vision analysis automatically examines property photos, enabling powerful visual search that goes beyond listing descriptions. Instead of relying solely on what agents write in MLS descriptions, buyers can search based on what they actually see in photos.
Traditional property search has a fundamental limitation: it can only find what’s written in the listing description. If an agent forgets to mention “granite countertops” or “mountain views,” that property won’t show up in searches—even though those features are clearly visible in the photos.
Vision analysis solves this by examining the photos themselves and automatically detecting features, styles, materials, and room types.
Every property photo is automatically analyzed to identify:
Room Type
Kitchen, bedroom, bathroom, living room, exterior, garage, etc.
Features
Countertops, appliances, flooring, fixtures, built-ins
Materials
Granite, hardwood, tile, stainless steel, quartz, marble
Style & Condition
Modern, traditional, farmhouse, craftsman, updated, original
The process happens in several stages:
When a property syncs from MLS, all photos are downloaded and stored locally. Each photo gets a unique ID and is linked to the listing.
Each photo is processed through two different AI models:
MiniCPM-V generates natural language descriptions:
“Modern kitchen with white cabinets, granite countertops, stainless steel appliances, and pendant lighting over a large island. Hardwood floors and subway tile backsplash.”
SigLIP generates visual embeddings (mathematical representations of what the photo looks like). These embeddings enable “find me photos that look like this” searches.
The natural language descriptions are parsed to extract structured features:
| Category | Extracted Features |
|---|---|
| Room Type | kitchen, bedroom, bathroom, living_room, exterior, garage |
| Materials | granite, hardwood, tile, stainless_steel, quartz, marble |
| Features | island, pantry, fireplace, pool, deck, walk_in_closet |
| Style | modern, traditional, farmhouse, craftsman, contemporary |
| Appliances | gas_range, double_oven, dishwasher, refrigerator |
Both the natural language descriptions and visual embeddings are indexed for fast search:
Vision analysis enables three types of search:
Search using automatically generated photo descriptions. The system searches the detailed text generated for each photo.
Example query: “granite countertops and stainless steel appliances”
This finds all photos whose descriptions mention those features—even if the listing description doesn’t.
Find photos that look similar using visual embeddings. This searches directly against what’s in the image, not the text describing it.
Example query: “modern white kitchen”
The system finds kitchens that visually resemble modern white kitchens—similar color palettes, layout, style—regardless of how they’re described.
Combines text and image search for the best results. Returns properties ranked by both textual relevance and visual similarity.
Example query: “open concept living room with vaulted ceilings”
Results are scored on both:
Vision analysis means you can search for what you actually want to see, not just what’s written:
You’re no longer limited by how thoroughly agents wrote their descriptions.
Vision analysis helps your listings get discovered:
This rewards good photography and complete photo galleries.
Photos are automatically categorized by room type, making it easy to:
Each photo gets detailed auto-generated tags:
| Room Type | Example Tags |
|---|---|
| Kitchen | island, pantry, breakfast bar, gas range, subway tile backsplash |
| Bathroom | soaking tub, walk-in shower, double vanity, heated floors |
| Living Room | vaulted ceilings, fireplace, built-in shelving, hardwood floors |
| Exterior | covered patio, pool, mountain views, landscaped yard, 3-car garage |
These tags enable precise filtering and searching.
Vision analysis is only as good as your photos. To maximize discoverability:
The system analyzes all photos, but prominent features in early photos get weighted higher. Structure your photo gallery:
Vision analysis tracks which features attract attention:
Use this data to understand what buyers value and adjust future listings accordingly.
Vision analysis respects privacy:
Photos are analyzed once during initial sync. Results are cached for fast search performance.
Vision analysis is powerful but not perfect:
The system provides confidence scores for detected features. Low-confidence detections are filtered out to maintain accuracy.
| Model | Purpose | Provider |
|---|---|---|
| MiniCPM-V | Natural language photo descriptions | OpenBMB |
| SigLIP | Visual embeddings for similarity search | Google Research |
Both models run locally via Ollama—no photos are sent to external APIs.
Each analyzed photo requires:
For 10,000 properties with 20 photos each:
We’re exploring additional vision capabilities: