Skip to content

Understanding Vision Analysis

Vision analysis automatically examines property photos, enabling powerful visual search that goes beyond listing descriptions. Instead of relying solely on what agents write in MLS descriptions, buyers can search based on what they actually see in photos.

Traditional property search has a fundamental limitation: it can only find what’s written in the listing description. If an agent forgets to mention “granite countertops” or “mountain views,” that property won’t show up in searches—even though those features are clearly visible in the photos.

Vision analysis solves this by examining the photos themselves and automatically detecting features, styles, materials, and room types.

Every property photo is automatically analyzed to identify:

Room Type

Kitchen, bedroom, bathroom, living room, exterior, garage, etc.

Features

Countertops, appliances, flooring, fixtures, built-ins

Materials

Granite, hardwood, tile, stainless steel, quartz, marble

Style & Condition

Modern, traditional, farmhouse, craftsman, updated, original

The process happens in several stages:

When a property syncs from MLS, all photos are downloaded and stored locally. Each photo gets a unique ID and is linked to the listing.

Each photo is processed through two different AI models:

MiniCPM-V generates natural language descriptions:

“Modern kitchen with white cabinets, granite countertops, stainless steel appliances, and pendant lighting over a large island. Hardwood floors and subway tile backsplash.”

SigLIP generates visual embeddings (mathematical representations of what the photo looks like). These embeddings enable “find me photos that look like this” searches.

The natural language descriptions are parsed to extract structured features:

CategoryExtracted Features
Room Typekitchen, bedroom, bathroom, living_room, exterior, garage
Materialsgranite, hardwood, tile, stainless_steel, quartz, marble
Featuresisland, pantry, fireplace, pool, deck, walk_in_closet
Stylemodern, traditional, farmhouse, craftsman, contemporary
Appliancesgas_range, double_oven, dishwasher, refrigerator

Both the natural language descriptions and visual embeddings are indexed for fast search:

  • Text search uses full-text indexing on descriptions
  • Visual search uses vector similarity on embeddings
  • Multi-modal search combines both for best results

Vision analysis enables three types of search:

Search using automatically generated photo descriptions. The system searches the detailed text generated for each photo.

Example query: “granite countertops and stainless steel appliances”

This finds all photos whose descriptions mention those features—even if the listing description doesn’t.

Find photos that look similar using visual embeddings. This searches directly against what’s in the image, not the text describing it.

Example query: “modern white kitchen”

The system finds kitchens that visually resemble modern white kitchens—similar color palettes, layout, style—regardless of how they’re described.

Combines text and image search for the best results. Returns properties ranked by both textual relevance and visual similarity.

Example query: “open concept living room with vaulted ceilings”

Results are scored on both:

  • How well the description matches the query
  • How visually similar the photos are to typical open-concept vaulted living rooms

Vision analysis means you can search for what you actually want to see, not just what’s written:

  • “Show me kitchens with white cabinets” — finds them even if description says “light cabinetry”
  • “Find properties with mountain views” — detects views in photos even if not mentioned
  • “Houses with updated bathrooms” — identifies modern fixtures and finishes visually

You’re no longer limited by how thoroughly agents wrote their descriptions.

Vision analysis helps your listings get discovered:

  • Your photos do the talking—features don’t need to be in the description to be searchable
  • Buyers can find your listing through visual searches
  • Properties with great photos perform better in search results

This rewards good photography and complete photo galleries.

Photos are automatically categorized by room type, making it easy to:

  • View all kitchen photos across listings
  • Compare backyard and pool images
  • Find master bedroom suites
  • See garage and workshop spaces

Each photo gets detailed auto-generated tags:

Room TypeExample Tags
Kitchenisland, pantry, breakfast bar, gas range, subway tile backsplash
Bathroomsoaking tub, walk-in shower, double vanity, heated floors
Living Roomvaulted ceilings, fireplace, built-in shelving, hardwood floors
Exteriorcovered patio, pool, mountain views, landscaped yard, 3-car garage

These tags enable precise filtering and searching.

Vision analysis is only as good as your photos. To maximize discoverability:

  1. Take high-quality photos — Well-lit, clear images get better analysis
  2. Show key features — Include close-ups of granite, hardwood, fixtures
  3. Capture room context — Wide shots show layout and style
  4. Highlight unique features — Pool, views, custom built-ins deserve their own photos

The system analyzes all photos, but prominent features in early photos get weighted higher. Structure your photo gallery:

  1. Hero photo — Best feature or exterior curb appeal
  2. Living spaces — Great room, kitchen, dining
  3. Bedrooms — Master suite first
  4. Bathrooms — Master bath, then guest baths
  5. Special features — Pool, views, bonus rooms
  6. Exterior/yard — Backyard, landscaping, garage

Vision analysis tracks which features attract attention:

  • Dwell time — How long buyers view each photo
  • Zoom interactions — Photos buyers zoom into for detail
  • Feature patterns — Which visual features correlate with engagement

Use this data to understand what buyers value and adjust future listings accordingly.

Vision analysis respects privacy:

  • Only MLS property listing photos are analyzed
  • No personal photos or documents are processed
  • All processing follows MLS data usage guidelines
  • Descriptions are stored locally, not sent to external services after processing

Photos are analyzed once during initial sync. Results are cached for fast search performance.

Vision analysis is powerful but not perfect:

  • Identifying obvious features (pools, fireplaces, granite counters)
  • Detecting room types (kitchen vs. bathroom vs. bedroom)
  • Recognizing materials and finishes
  • Understanding general style (modern vs. traditional)
  • Subtle details in low-quality photos
  • Features obscured by furniture or staging
  • Exact measurements (can’t measure square footage from photos)
  • Condition assessment (can’t definitively say “newly renovated”)
  • Features not visible in photos (buried utilities, structural elements)

The system provides confidence scores for detected features. Low-confidence detections are filtered out to maintain accuracy.

ModelPurposeProvider
MiniCPM-VNatural language photo descriptionsOpenBMB
SigLIPVisual embeddings for similarity searchGoogle Research

Both models run locally via Ollama—no photos are sent to external APIs.

  • Per photo: ~2-3 seconds on GPU, ~10-15 seconds on CPU
  • Full property (20 photos): ~1 minute on GPU, ~5 minutes on CPU
  • Batch indexing: Parallelized across available cores

Each analyzed photo requires:

  • Original image: ~500KB - 2MB (JPEG)
  • Description text: ~200-500 bytes
  • Visual embedding: 512 floats × 4 bytes = 2KB

For 10,000 properties with 20 photos each:

  • Total photos: 200,000
  • Storage needed: ~200GB images + ~400MB embeddings

We’re exploring additional vision capabilities:

  • Floor plan analysis — Automatically detect room layouts
  • Virtual staging — Show how empty rooms could look furnished
  • Style matching — “Find homes with similar style to this one”
  • Feature comparison — Side-by-side visual comparisons
  • Trend detection — “Most popular kitchen styles this year”