Small-Object Grounding · LVLMs

Self-Improving Small Object Grounding in LVLMs

Large vision language models localize big objects well but miss small ones. We show their internal attention already knows which boxes are right — and use it to pick better ones, with no fine-tuning.

Tianze Yang  ·  Yucheng Shi  ·  Ruitong Sun  ·  Ninghao Liu  ·  Jin Sun
University of Georgia

Can the internal attention of an LVLM tell us which small-object boxes are reliable — without any fine-tuning?

We give an affirmative answer. Attention structure in LVLMs encodes grounding quality: a lightweight IoU regressor trained solely on attention maps predicts box quality well (Pearson r > 0.67). This regressor powers ACS-Learned, the learned variant of our Attention-based Candidate Selection (ACS) framework, which picks the best box among multiple sampled candidates. By analyzing what the regressor learns, we reveal which transformer layers and heads matter most and derive ACS-Free: a training-free selector that ranks candidates by attention entropy on those discriminative heads — no learned component at inference. On COCO and Objects365, ACS delivers up to 19% self-improvement on small-object localization, with ACS-Free ranking best among all training-free methods.

LVLMs localize large objects accurately but fail on small objects; our method recovers them.
LVLMs nail large objects (IoU 0.98) but stumble on small ones. The accuracy/IoU gap is large — and our selection turns a Greedy IoU of 0.49 into 0.89.
The problem

Small objects break LVLM localization

State-of-the-art LVLMs such as Qwen2.5-VL and InternVL-3.5 can output bounding boxes directly, no detection head required. Yet performance collapses on small objects — exactly the distant pedestrians, signs, and tools that matter most for safety-critical perception.

Prior fixes add fine-tuning, external detectors, or hand-tuned decoding — all of which cost compute or change the model. We ask whether the model's own internal signals are enough.

Method

Attention-based Candidate Selection

Sample several candidate boxes, read out each one's attention maps, and let attention decide which box to trust. Two selectors emerge — one learned, one entirely training-free.

ACS pipeline: an Attention-IoU regressor trained on attention maps, ACS-Learned and ACS-Free inference, and discovery of localization-critical heads.
Left: we train an Attention-IoU regressor to predict box quality from attention alone. Right: at inference, ACS-Learned scores candidates with the regressor, while ACS-Free ranks them by entropy on the discriminative heads the regressor revealed.
Learned

ACS-Learned

A lightweight regressor scores every sampled box directly from its attention maps and selects the highest-quality candidate. It confirms that attention carries a genuine localization-quality signal.

attention maps → IoU score → argmax
Training-free

ACS-Free

Gradient and entropy analysis of the trained regressor reveals a handful of discriminative heads. ACS-Free ranks candidates by attention entropy on those heads alone — no learned component at inference.

low entropy on key heads → better box
Why it works

Concentrated attention → accurate boxes

Inside localization-critical layers, accurate predictions produce concentrated, low-entropy attention over the right region, while poor predictions scatter attention. High-IoU boxes consistently show lower entropy than low- or zero-IoU boxes.

That single, interpretable signal — entropy on a few discriminative heads — is what ACS-Free exploits, turning a mechanistic finding into a deployable selection rule.

Attention entropy in localization-critical layers: high-IoU predictions have lower entropy than low/zero-IoU predictions.
Attention entropy across localization-critical layers: high-IoU predictions stay low-entropy; poor predictions spread out.
Results

Consistent gains, no fine-tuning

Across COCO and Objects365, on two different LVLM families, ACS improves small-object grounding over greedy decoding — and ACS-Free leads all training-free baselines.

+19.4%
ACS-Learned on InternVL-3.5
COCO Acc@0.5
+16.4%
ACS-Learned on InternVL-3.5
Objects365 Acc@0.5
+8.2%
ACS-Free (training-free)
InternVL-3.5, COCO
r > 0.67
Attention-only IoU regressor
Pearson correlation
Method Qwen2.5-VL-7B InternVL-3.5-8B
COCOObjects365COCOObjects365
Greedy decoding61.443.049.121.9
Best sampling baseline60.941.753.823.2
ACS-Free (training-free)63.443.053.124.8
ACS-Learned65.345.158.625.5

Small-object localization Acc@0.5 (%), best per column in teal. Baselines select from N=10 sampled responses. Full tables and ablations are in the paper.

Qualitative Results

Each column is a small object — top: Greedy decoding, bottom: our ACS selection. Yellow = ground truth, red = prediction; ACS recovers tight boxes where greedy drifts.

Vase
GreedyGreedy decoding on a small vase
OursACS selection on a small vase
Spoon
GreedyGreedy decoding on a small spoon
OursACS selection on a small spoon
Person
GreedyGreedy decoding on a small person
OursACS selection on a small person
Citation

BibTeX

@article{yang2025acs,
  title   = {Self-Improving Small Object Grounding in LVLMs},
  author  = {Yang, Tianze and Shi, Yucheng and Sun, Ruitong and Liu, Ninghao and Sun, Jin},
  year    = {2025},
  note    = {University of Georgia}
}