Model card: Person Segmentation (v0.11)

Model details

  • Model date: 7/25/2022

  • Model version: 0.11

  • License: refer to the terms of service for Lightship.

Technical specifications

The person segmentation model returns a floating point value from 0 to 1 for each pixel indicating the probability of that pixel being part of a person. This value is then thresholded to return a boolean mask of presence/absence of “person” at each pixel.

Intended use

Intended use cases

  • General semantic segmentation of people for augmented reality applications accessed through the Lightship ARDK.

  • Querying the presence or absence of a person at any specified pixel in the camera feed.

  • Using the semantic mask for “person” to enable screen space effects.

Permitted users

Out-of-scope use cases

This model does not provide the capability to:

  • Segment individual people (instance segmentation)

  • Track individuals

  • Identify or recognize individuals

Factors

The following factors apply to all semantic segmentation provided in the Lightship ARDK, including person segmentation:

  • Scale: objects / classes may not be segmented if they are very far away from the camera.

  • Lighting: extreme light conditions may affect the overall performance.

  • Viewpoint: extreme camera views that have not been seen during training may lead to a miss in detection or a class confusion.

  • Occlusion: objects / classes may not be segmented if they are covered by other objects.

  • Motion blur: fast camera or object motion may degrade the performance of the model.

  • Flicker: predictions are made frame by frame and no temporal smoothing or context is applied; this may lead to a ‘jittering’ effect between predictions of temporally adjacent frames.

For person segmentation specifically, based on known problems with computer vision technology, we identify potential relevant factors that include subgroups for:

  • Geographical region

  • Skin tone

  • Gender

  • Body posture: certain body configurations may be harder to predict due to appearing less often in the training corpus.

  • Other: age, fashion style, accessories, body alterations

Fairness evaluation

At Niantic we strive for our technology to be inclusive and fair by following strict equality and fairness practices when building, evaluating, and deploying our models. We define person segmentation fairness as follows: a model makes fair predictions if it performs equally on images that depict a variety of the identified subgroups. The evaluation results focus on measuring the performance of the person segmentation channel on the first three main subgroups (geographical region, skin tone and gender).

Instrumentation and dataset details

Our benchmark dataset comprises 5650 images captured around the world using the back camera of a smartphone, with the specification:

  • Only one person per image is depicted.

  • Both indoors and outdoors environments.

  • Captured with a variety of devices.

  • No occlusions.

  • Full body within the frame of the image in a variety of poses.

Images are labeled with the following attributes:

  • Geographical region: based on the UN geoscheme with the merge of European subregions and Micronesia, Polynesia and Melanesia:

    • Northern Africa

    • Eastern Africa

    • Middle Africa

    • Southern Africa

    • Western Africa

    • Caribbean

    • Central America

    • South America

    • Northern America

    • Central Asia

    • Eastern Asia

    • South Eastern Asia

    • Southern Asia

    • Western Asia

    • Europe

    • Australia and New Zealand

    • Melanesia, Micronesia and Polynesia

  • Skin tone: following the Fitzpatrick scale images are annotated from subgroup 1 to 6. The skin tone is annotated by the person in the image, thus it is a self-reported value.

  • Gender: images are annotated with self-reported gender.

Metrics

The standard and used metric to evaluate a segmentation model is the Intersection over union(IoU). It is computed as follows:

IoU = true_positives / (true_positives + false_positives + false_negatives)

Reported IoUs are averages (mean IoU or mIoU) over images belonging to the referenced subgroup unless stated otherwise.

Fairness criteria

A model is considered to be making unfair predictions if it yields a performance (mIoU) for a particular subgroup that is three standard deviations units or more from the average across all the subgroups.

Results

Geographical evaluation

Average performance across all 6 skin tones is 83.84% with a standard deviation of 1.26%. All skin tones subgroups yield a performance in the range of [81.72%, 85.45%]. The maximum difference between the mean and the worst performing skin tone subgroup is 2.13%, within our fairness criterion threshold of 3 stdevs ( 3x1.26 = 3.78%).

Region mIoU stdev Number of images
Northern Africa 85.37% 12.41% 301
Eastern Africa 83.61% 14.82% 336
Middle Africa 84.57% 14.83% 322
Southern Africa 83.15% 15.62% 368
Western Africa 80.81% 18.50% 364
Caribbean 84.52% 13.95% 412
Central America 85.14% 11.68% 415
South America 83.30% 16.19% 397
Northern America 80.06% 18.48% 335
Central Asia 87.07% 10.81% 229
Eastern Asia 86.06% 12.06% 346
South Eastern Asia 81.47% 14.83% 333
Southern Asia 83.64% 15.32% 353
Western Asia 85.94% 13.37% 370
Europe 86.26% 11.87% 320
Australia and New Zealand 82.34% 14.84% 374
Melanesia, Micronesia and Polynesia 82.10% 21.57% 75
Average (across all images) 83.86% 14.89% 5650
Average (across regions) 83.85% 2.06% -

Skin tone evaluation results

Average performance across all six skin tones is 83.84% with a standard deviation of 1.26%. All skin tones subgroups yield a performance in the range of [81.72%, 85.45%]. The maximum difference between the mean and the worst performing skin tone subgroup is 2.13%, within our fairness criterion threshold of 3 stdevs ( 3x1.26 = 3.78%).

Skin tone

(Fitzpatrick scale)

mIoU stdev Number of images
1 85.45% 10.87% 247
2 84.48% 13.81% 1919
3 84.14% 14.20% 1463
4 83.28% 15.57% 457
5 84.02% 14.70% 706
6 81.72% 18.19% 858
Average (across all images) 83.86% 14.83% 5650
Average (across skin tones) 83.85% 1.26% -

Gender evaluation results

Average performance of all evaluated gender subgroups is 83.76% with a range [82.58, 84.93]. The difference between the average and the worst performing gender is 1.18%, within our fairness criterion threshold of 3 stdevs ( 3x1.66 = 4.98%).

Perceived gender mIoU stdev Number of images
Female 82.58% 15.98% 2585
Male 84.93% 13.70% 3065
Average (across all images) 83.86% 14.83% 5650
Average (across genders) 83.76% 1.18% -

Ethical Considerations

  • Privacy: the model was trained and evaluated on images that may depict humans. All the used images were either consented or anonymized when the data was captured in the public domain. When the model is used in ARDK, inference is only applied on-device and the image is not transferred off of the user device.

  • Human life: this model is designed for entertainment purposes within an augmented reality application. It is not intended to be used for making human life-critical decisions

  • Bias: Training datasets have not been audited for diversity and may present biases not surfaced by our benchmarks.

Caveats and Recommendations

  • Our annotated dataset only contains binary genders, which we include as male/female. Further data needed to evaluate across a spectrum of genders.

  • An ideal skin tone evaluation dataset would additionally include camera details, and more environment details such as lighting and humidity. Furthermore, the Fitzpatrick scale has limitations as it doesn’t fully represent the full spectrum of human skin tones.