What is panoptic segmentation and how it works

Learn Feb 8, 2022

We are often oblivious to the capabilities of our minds and senses when it comes to assessing the world around us. Our brains are granted the capability to scan, sort, and navigate the world around us in real-time, which is precisely what computer vision aims to replicate. We’re coming closer and closer to developing artificial intelligence that mimics the full spectrum of human capabilities down to the very last detail. Until we get there, let’s take a closer look at what panoptic segmentation entails, address common misconceptions related to similar image segmentation techniques, and understand what role panoptic segmentation has in our current world and the near future of AI.

Without further ado, let’s take a deeper look at:

Understanding panoptic segmentation

Panoptic segmentation is relatively simple to grasp if you have prior knowledge of image segmentation. What does ‘panoptic’ mean exactly? A closer look at the meaning of the word itself infers that we’ll receive a full, comprehensive view of something, which is quite accurate in the case of the image segmentation task. Essentially, panoptic segmentation is a merge of two notable types of image segmentation —  instance segmentation and semantic segmentation. To define panoptic segmentation accurately, we need to not only highlight differences and similarities of panoptic segmentation vs semantic segmentation but how it differs from instance segmentation as well.

Semantic vs. instance segmentation

In short, semantic segmentation refers to segmenting pixels of an image on a granular level to assign a class label to each pixel. We end up with a color-coded output that identifies the limits of the pixels which fall under the same class level. For example, any pixel that belongs to a visualization of a cat will be marked as such. Semantic segmentation covers all pixels of the image and classifies them, as opposed to instance segmentation that concentrates on the ones which have instances (more on that in a moment).

With instance segmentation, objects in a similar class are not lumped together, but they are all highlighted as separate instances. For example, if we have an image of a road with several cars, each car is labeled as a ‘car’ but marked as individual instances since each car is a different color, brand, etc. It’s important to keep in mind that with instance segmentation, we’re looking at an image from a perspective of 1s and 0s where the former refers to the presence of an instance and the latter refers to the absence of one. That is why you’ll commonly see a black background with a mask on it of varying colors to highlight the separate instances in the final output.

Stuff & things in panoptic segmentation

Now, circling back to panoptic segmentation, it becomes easier to navigate just how it combines the core purposes of both semantic and instance segmentation. Each pixel in an image is assigned a class label but is also identified by the instance that they belong to as well. That means each pixel in the image is assigned two values simultaneously — a label and an instance number. You’ll find that the “first” instance of a particular class is marked as ‘0’ and then ‘1’ for the second instance and so on depending on how many countable instances are visible.

One of the elements that makes panoptic segmentation in machine learning stand out from the other image segmentation techniques is that it provides an accurate representation of a specific view by including both ‘stuff’ and ‘things’ in the output. ‘Things’ in an image are countable objects such as people, vehicles, plants, and so on, while ‘stuff’ are the elements that are difficult to quantify, such as roads, the sky, and other backgrounds. Semantic segmentation is essentially a task that concentrates on stuff while instance segmentation focuses on the things; panoptic segmentation doesn’t exclude one or the other.

How it works

In order to achieve a comprehensive output, specialists in the field have found that the most optimal path to do so is by combining the results of separate semantic segmentation and instance segmentation tasks via a network. That is easier said than done as the actual breakdown of the architecture is much more complex. Let’s take a look at a couple of the most notable panoptic segmentation architectures.

EfficientPS architecture

Illustration of the EfficientPS architecture
Illustration of the EfficientPS architecture. Image credits: Robot Learning Lab.

The common drawback of initial panoptic segmentation models was that the instance segmentation and semantic segmentation network predictions were combined only during the post-processing phase. That led to an array of shortcomings, such as data discrepancies between the two networks and computational overhead. The aim of EfficientPS is to address those challenges via an architecture that consists of the shared backbone, a two-way Feature Pyramid Network (FPN), instance and semantic heads, and a panoptic fusion module. Thanks to the two-way FPN and the separate heads, each consisting of three modules that capture fine features, the losses in data are minimized. Finally, the panoptic fusion module is applied to combine the semantic and instance heads' outputs to produce the panoptic segmentation output.

Attention-guided unified network for panoptic segmentation

Attention-Guided Unified Network structure
Attention-Guided Unified Network structure. Image credits: arxiv.org

In the image above, we can see a visual representation of the next proposed network. Once again, FPN is the backbone that shares features with three parallel branches which are the foreground, background, and RPN branches. Once again, the task of panoptic segmentation architecture is to generate an accurate output with minimal to no discrepancies or data loss in terms of ‘things’ and ‘stuff’ since they are generated from separate models. With the Attention-Guided Unified Network, that challenge is approached by creating a framework where the foreground (FG, also known as instance-level) and background (BG, also known as semantic-level) branches are segmented simultaneously. With the addition of two attention sources — the proposal attention module and the mask attention module, there will be significant accuracy gains in the final output result.

Panoptic segmentation datasets

In order to build and deploy a successful system that utilizes the immense advantages panoptic segmentation has to offer, you will need to acquire accurate training data. Thankfully, there are easily accessible public datasets for machine learning that you can implement instead of training your AI system from the ground up. A few of our recommendations are:

coco dataset

COCO — Short for Common Object in Context, the COCO dataset provides image annotations for upwards of 1.5 million common object instances. No need to manually annotate the objects that appear frequently.


Cityscapes — Everything you need to label the urban city life scene is available in this dataset. It includes 10 things categories and 20 stuff categories, including everything from pedestrians to buildings.


Pastis — This is an excellent dataset for applications of AI in agriculture. It comprises 2,433 patches with panoptic annotations for each pixel.

Don’t forget to search independently for others depending on your specific preferences and needs since there are dozens of public datasets available online.

Use cases and applications

You can imagine just how much panoptic segmentation has skyrocketed accuracy in computer vision applications by providing a comprehensive and detailed view of images and real-time video. Here are only a handful of use cases where panoptic segmentation plays an immense role in taking innovation to the next level.

  • Self-driving vehicles — Panoptic segmentation is crucial for establishing safety and efficiency in autonomous vehicles. The AI system needs to distinguish between other vehicles, pedestrians, and road signs all at once, in real-time in order to assess the situation and make quick decisions. All of that is executed with the help of appropriate hardware, such as LiDAR cameras and sensors.
  • Medical imaging — Visualizing cell nuclei is a task that requires precision, especially to diagnose diseases like cancer. Often, it’s difficult to accurately detect cells during the screening that overlap and are diverse in nature. Semantic segmentation models were commonly used but showcased gaps in data and inaccuracies in the case of overlapping cells. Panoptic segmentation, specifically with deep learning, has proven to outperform the previous technologies.
  • Smart cities — Computer vision and AI play a vital role in constructing smart cities. With the help of state-of-the-art systems, cities can monitor, manage, and optimize all spheres from utilities, to waste management, security, healthcare, education, roads, and much more. Panoptic segmentation offers an accurate and efficient model for smart cities to rely on. Think of the importance panoptic segmentation has for autonomous vehicles and expand it across an entire city.

Key takeaways

Panoptic segmentation is essentially a marriage of instance segmentation and semantic segmentation that provides us with a clear and detailed output regarding the entire scene of an image or real-time video. The final output does not polarize merely ‘things’ or ‘stuff’. Instead, each pixel in the image is assigned a label and a corresponding instance ID to provide us with the full picture of the input visual. It enables us to excel in terms of accuracy when producing both ML or DL-based algorithms for computer vision applications. Can a new task emerge one day that develops and innovates image segmentation further? Perhaps, yes, but panoptic segmentation is undoubtedly revolutionary for our current computer vision tasks and opens endless avenues for innovation.

SuperAnnotate request demo


Great! You've successfully subscribed.
Great! Next, complete checkout for full access.
Welcome back! You've successfully signed in.
Success! Your account is fully activated, you now have access to all content.