Extreme amodal face detection

Changlin, Song; Yunzhong, Hou; Michael Randall, Barnes; Rahul, Shome; Dylan, Campbell

Extreme Amodal Face Detection

Changlin Song¹, Yunzhong Hou¹, Michael Randall Barnes², Rahul Shome¹, Dylan Campbell¹

¹Australian National University²University of Oslo

Abstract

Extreme amodal detection is the task of inferring the 2D location of objects that are not fully visible in the input image but are visible within an expanded field-of-view. This differs from amodal detection, where the object is partially visible within the input image, but is occluded. In this paper, we consider the sub-problem of face detection, since this class provides motivating applications involving safety and privacy, but do not tailor our method specifically to this class. Existing approaches rely on image sequences so that missing detections may be interpolated from surrounding frames or make use of generative models to sample possible completions. In contrast, we consider the single-image task and propose a more efficient, sample-free approach that makes use of the contextual cues from the image to infer the presence of unseen faces. We design a heatmap-based extreme amodal object detector that addresses the problem of efficiently predicting a lot (the out-of-frame region) from a little (the image) with a selective coarse-to-fine decoder. Our method establishes strong results for this new task, even outperforming less efficient generative approaches.

Extreme Amodal Detection

Given an image \( \mathbf{x} \in \mathbb{R}^{H \times W \times 3} \), extreme amodal detection predicts the location of objects within a centrally-expanded region of size \( KH \times KW \), where \( K \) denotes the expansion factor. To predict objects within this larger region, we consider two output types, commonly associated with the tasks of detection and localization.

For the detection task, a set of \( N \) objects \( o_i = (c_i, b_i) \) are predicted, where \( c_i \) is the object class and \( b_i = (x_i, y_i, w_i, h_i) \) is the bounding box represented by center coordinate, width, and height.

For the localization task, a heatmap \( \mathbf{h} \in [0,1]^{KH \times KW \times C} \) is predicted, where \( C \) is the number of classes. The value at coordinate \( (i, j, c) \), denoted \( \hat{h}^c_{i,j} \), indicates the probability that an object of class \( c \) is located at that pixel.

As motivated in the introduction, we consider a single class in this paper: human faces.

Task definition illustration — Illustration of the extreme amodal detection task.

As shown in the figure, the difficulty of detecting extreme amodal faces depends on whether there is direct visual evidence within the image of a face wholly or partially outside the image. We classify faces as follows:

Inside: faces that are entirely within the image.
Truncated: faces that are partially within the image.
Outside: faces that are entirely outside the image:
1. With direct visual evidence, such as a visible body in the image.
2. Without direct visual evidence, where indirect cues like eye gaze and semantic co-occurrences may need to be considered.

The EXAFace Dataset

We introduce the Extreme Amodal Face (EXAFace) dataset, derived from the MS COCO object detection dataset. First, RetinaFace was used to pseudolabel the many unlabeled faces in COCO, excluding detections with a confidence below 0.9, resulting in 2.4× more face labels. Then, we applied a randomized cropping strategy while preserving both cropped and uncropped bounding boxes.

For an image with height \( H \) and width \( W \), the cropping procedure is:

Sample crop height from \([0.3H, 0.6H]\) and aspect ratio from \([0.5, 2.0]\), producing crop size \( H' \times W' \).
Sample crop center \( (x, y) \) from \([0.5W', W - 0.5W'] \times [0.5H', H - 0.5H']\).
Crop the image based on the sampled center and size.
Discard boxes not fully inside the expanded crop area (\( K^2 \times \) crop size).
Update each box center \( (x_b, y_b) \rightarrow (x_b - x + 0.5KW', y_b - y + 0.5KH') \).

This process is repeated four times per image to generate diverse examples. Image and box statistics are summarized below.

♯ × 10³ (%)	Inside	Truncated	Outside +	Outside −
Boxes (train)	116 (24%)	74 (15%)	66 (13%)	235 (48%)
Boxes (test)	5.0 (24%)	3.0 (14%)	2.0 (12%)	11 (50%)
Images (train)	30 (17%)	37 (20%)	32 (17%)	83 (46%)
Images (test)	1.0 (16%)	1.5 (20%)	1.0 (17%)	3.5 (47%)

Sample counts (×10³) and percentages (in parentheses) are shown for bounding boxes and images. Categories include: inside faces, truncated faces, outside faces with body evidence (+), and outside faces without body evidence (−). An image is assigned a category based on the hardest face it contains.

Extreme Amodal Face Detector

Our detector (Figure 1) comprises: (i) a convolutional feature extractor \( f_{\text{feat}} \), (ii) a transformer encoder–decoder that transfers information between in-image and out-of-image tokens using rotary positional encodings (ROPE), and (iii) two detection heads for in-image and out-of-image predictions.

Given an input image \( \mathbf{x} \), the computation is

\[ \begin{aligned} \mathbf{y}_{\text{in}} &= f_{\text{feat}}(\mathbf{x}) \\ \mathbf{z}_{\text{in}} &= f_{\text{enc}}(\mathbf{y}_{\text{in}}, \mathbf{p}_{\text{in}}), \quad \mathbf{p}_{\text{in}}=\phi(\mathcal{C}_{\text{in}}) \\ \mathbf{y}_{\text{out}} &= f_{\text{dec}}(\mathbf{z}_{\text{in}}, \mathbf{p}), \quad \mathbf{p}=\phi(\mathcal{C}) \\ (o_{\text{in}}, \mathbf{h}_{\text{in}}) &= g_{\text{in}}(\mathbf{y}_{\text{in}}), \\ (o_{\text{out}}, \mathbf{h}_{\text{out}}) &= g_{\text{out}}(\mathbf{y}_{\text{out}}). \end{aligned} \]

Overall framework and selective coarse-to-fine decoder — Overview of the detector and the selective coarse-to-fine (C2F) decoder. The encoder shares context from in-image tokens to the expanded region; the decoder queries and refines candidate regions efficiently.

Selective Coarse-to-Fine (C2F) Decoder

Sharing information from the image to the expanded field-of-view is challenging due to (a) computational cost—at equal resolution, the extended region contains \( (K^2\!-\!1) \) times more tokens—and (b) object sparsity—only a small fraction of patches contain faces. We address this with a selective C2F strategy: query the extended region at coarse resolution, then refine only high-scoring regions.

Let \( \mathcal{S} \) denote a sequence of coarse-to-fine scales and \( \mathcal{C}_{\text{out}} \) the out-of-image coordinates. Coarse positional encodings are formed by average pooling the expanded-grid encodings:

\[ \mathbf{p}^{s_i}_{\text{out}} \;=\; \big\{\, \mathrm{avgpool}(\mathbf{p},\, s_i)(u,v) \;\big|\; (u,v)\in \mathcal{C}_{\text{out}} \,\big\}, \quad s_i \in \mathcal{S}. \]

Initialization sets \( \mathbf{x}^{s_1}_{\text{out}} \leftarrow \mathbf{p}^{s_1}_{\text{out}} \). A two-layer decoder block \( f_{\text{decblk}} \) attends from coarse out-of-image queries to in-image keys/values (with ROPE), followed by a scoring network \( f_{\text{score}} \) that selects the top-\( \mu^{s_i}\% \) tokens to refine at the next scale:

\[ \begin{aligned} \mathbf{y}^{s_i}_{\text{out}} &= f_{\text{decblk}}\!\big(\mathbf{x}^{s_i}_{\text{out}},\, \mathbf{p}^{s_i}_{\text{out}},\, \mathbf{z}_{\text{in}},\, \mathbf{p}_{\text{in}}\big), \\ \mathbf{x}^{s_{i+1}}_{\text{out}} &= f_{\text{score}}\!\big(\mathbf{y}^{s_i}_{\text{out}},\, \mu^{s_i}\big). \end{aligned} \]

Multi-scale outputs are aggregated by summing upsampled features:

\[ \mathbf{y}_{\text{out}} \;=\; \sum_{i=1}^{|\mathcal{S}|} \uparrow\!\big(\mathbf{y}^{s_i}_{\text{out}}\big). \]

Main Results

Method	\( \text{AP} \uparrow \)	\( \text{AP}_\text{t} \uparrow \)	\( \text{AP}_\text{o} \uparrow \)	\( \text{AP}_\text{o+} \uparrow \)	\( \text{AP}_\text{o-} \uparrow \)	\( \text{MAE} \downarrow \)	\( \text{MAE}_\text{t} \downarrow \)	\( \text{MAE}_\text{o} \downarrow \)	\( \text{MAE}_\text{o+} \downarrow \)	\( \text{MAE}_\text{o-} \downarrow \)	\( \text{mIoU}_\text{o} \uparrow \)	\( \text{AR}_\text{o} \uparrow \)	\( \text{SE}_\text{o} \downarrow \)	\( \text{CE}_\text{o} \downarrow \)
Uniform	--	--	--	--	--	--	--	--	--	--	8.80	51.71	100	100
Oracle-GT	100	100	100	100	100	0.00	0.00	0.00	0.00	0.00	100	100	58.68	58.68
Oracle-YOLOH	44.79	61.70	36.34	49.83	22.85	7.55	2.07	10.65	2.54	13.60	28.63	44.56	91.96	78.74
YOLOH	10.20	30.60	0.01	0.01	0.001	17.37	2.78	26.11	6.87	33.11	17.23	19.01	96.90	94.01
Pix2Gestalt	11.30	33.43	0.24	0.48	0.001	17.38	2.83	26.10	6.63	33.18	17.75	20.25	96.54	93.31
Outpaint	4.93	11.54	1.62	2.47	0.76	14.69	2.07	21.94	3.48	28.67	20.53	25.03	96.41	90.18
Ours	23.07	66.69	1.26	2.17	0.34	17.83	2.01	27.43	4.53	35.77	18.70	27.17	93.99	88.16

Extreme amodal detection performance on the test set of our MS COCO-based dataset. We report metrics such as average precision (AP), mean absolute error (MAE), mean IoU, average recall (AR), self-entropy (SE), and cross-entropy (CE). Subscripts t, o, o+, and o- represent truncated, outside, outside with evidence, and outside without evidence, respectively. The data subsets truncated (t), outside (o), outside with evidence (o+), and outside without evidence (o-) are indicated by subscripts. The metrics that are most meaningful for assessing performance on the different data subsets are shaded. Detection metrics like AP are appropriate for evaluation of the truncated faces, since the realization of the conditional distribution (our ``ground-truth'') is very close to the true distribution near the image. However, further from the image, this realization no longer captures all modes of the true distribution, and so AR, CE and SE are more meaningful measures of performance in this regime..

qualitative result — Qualitative results.

The final row shows samples from the ground-truth conditional distributions. Our model effectively leverages contextual cues—such as nearby people (example 1), objects like a skateboard (example 2), or partial body evidence (example 4)—to infer completely unseen faces. In example 1, the model correctly extends predictions to the left, where a partial person is visible, but not to the right, demonstrating awareness of scene context and typical human height. Example 3 further shows generalization beyond annotated ground truth. Compared to our model, Pix2Gestalt struggles without large visible body parts, while the outpainting pipeline can infer outside faces but yields noisier and less consistent results.

BibTeX

@misc{song2025extremeamodalfacedetection,
      title={Extreme Amodal Face Detection}, 
      author={Changlin Song and Yunzhong Hou and Michael Randall Barnes and Rahul Shome and Dylan Campbell},
      year={2025},
      eprint={2510.06791},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.06791}, 
}