Field of artificial intelligence is moving toward creating a general system. One important property of this general system is to be able to reason over multiple modalities. One of the biggest challenge of using AI system efficiently for real world problem is that information required to solve majority of these problems is distributed among data from different modalities. We as humans are very efficient in combining this various forms of information (like visual, audio or text) from various sources to make an informed decision. Being able to develop such intelligent system that is as efficient as human in decision making is an AI-complete problem. One approach toward achieving the goal of solving AI-complete problems is by reasoning over combination of our understanding about individual knowledge bases and make decisions.

Recent advancement in deep learning has demonstrated excellent performance over vision problems like image classification, object recognition, and scene classification and over natural tasks such as semantic understanding.  Visual Question Answering (VQA) systems leverages knowledge from both vision and natural language to build a multi-modal knowledge base to generate output. This system takes as input an image and a free-form, open-ended, natural language question about the image and produces a natural language answer as the output.

At insight I worked on creating a VQA system for Radiology data. This page describes in detail about the project and concept used in the project. The content of the page is divided in three major sections:

1) What? 2) Why? and 3) How?

Data used to build the project is described in Dataset section. Presentation I used for demonstration is embedded at the end of the page.

  The project code is accessible through following github link:

What is RJ?

Radiologist Jr. (RJ) is the visual question answering system I developed while working as an AI fellow at Insight data science.  In radiology, with its wealth of image data and reports, RJ could assist radiologists in faster reporting and also benefit trainees who have questions about the size of a mass or presence of a fracture.

RJ: Overview
Jason J. Lau, Soumya Gayen, Asma Ben Abacha & Dina Demner-Fushman, Data Descriptor: A dataset of clinically generated visual questions and answers about radiology images,

Why RJ?

According to Clinical Radiology: UK workforce census 2018 report approximately 31.5 million images were generated across by radiology department alone. In order to meet this demand ~4000 radiologist spent about 3 million hours and were only able to report 24.4 million images. This results in longer diagnosis time for decreases patients-doctor interaction. RJ helps radiologist time they have to spend looking at an image by providing answers to the questions required for reporting and better diagnosis. Moreover, the trainee radiologists often time know what questions to ask and are unsure  about the answers. RJ can help them with these question and aid in learning.

Time Radiologist spend on single image. Source:

How was RJ developed?

One of the interesting problem in machine learning is getting useful representation of the data. The ability to learn these representations of data with little or no supervision is a key challenge towards applying artificial intelligence. In this case, in order to generate answers, the model not only needs to understand the semantic of the question but it has to do that in context of an Image. Using the co-occurrence of pattern in data was famously summarized by linguist John Rupert Firth (Firth 1957) as “you shall know a word by the company it keeps”.

Let’s further intuitively try to understand this. Here the idea is to covert the data in to a vector, often termed as an Embedding vector.  For example, let’s take an image as raw input data. How these vectors are obtained is discussed in further paragraphs, for now lets assume there is an oracle that given an images outputs an embedding vector. This embedding vector is feature representation of the input image. Meaning, now image is not a 3 channel width x height picture but a long vector of numbers. If I create embedding vectors of all the images in my training set there will be certain features that will co-occur for certain images. And in a arbitrary dimension (dimension = length of feature vector) latent space these images will likely to be neighbors.

To understand this more concretely, there are CT as well as X-Ray images in data set used to develop RJ. So intuitively according to above explaination, a X-Ray image will be close to other X-ray image than a CT. In the figure given below I tried to visualize image points by plotting a scatter plot of image-features by using only two features from embedding vector and Voila! They indeed flock together!

Scatter Plot of Images in Data taking first two features

Since we only need a good representation of image without associating it to any label, obtaining these vector is the problem of Unsupervised Representation Learning. There are 3 main kinds of embedding involved in development of RJ  1) Image embedding 2) Question text Embedding and 3) Joint image and text embedding. These embedding vectors are then used for further decision making.

Now that we know what embeddings are let’s dive into how they’re obtained.

Convolution Autoencoders: Image Embeddings

In supervised learning task we try to learn a function that inputs data x and maps it to label y. While creating the embedding we do not want a function in context of labels but a function that is best representation of data x itself.  An autoencoder neural network is an unsupervised learning algorithm that applies back-propagation, setting the target values to be equal to the inputs. Essentially a auto encoder learns the function f(x) ≈ x.

A Convolutional Auto-Encoder (CAE) is the type of autoencoder that allows to train the convolutional layers in order to learn a new data representation. This can be seen as two different networks, encoder network and decoder network, joint together and loss used for training is calculated as L(decoder(encoder(x)), x). 

Convolution Auto Encoder

Here the encoder network performs convolution operation on image to create the latent representation, while the decoder network takes this latent representation and perform inverse convolution operation to recreate original image. Vector we are interested is this Latent representation.

The convolution autoencoder architecture for RJ is same as described in paper: Pre-Training CNNs Using Convolutional Autoencoders. Given by the figure below.

CAE Architecture

Same structure is used for both Encoder and Decoder networks except, convolution layers in encoders are replaced by transposed convolution.

LSTM Autoencoder: Question Embeddings

In previous section we obtained image embedding. This section explains how text embeddings are obtained from the data. Same concept of auto encoding as used with images can be used with text except now the encoder layers will have LSTM layers instead of convolution layers.

Visual Question Answering Classifier:

Embeddings obtained from both images and questions can then be combined using various algorithms like concatenation, element wise sum or product, bi-linear pooling, attention models, Bayesian models etc. This combined vector then can be used further used in either classifier model or generative model. For RJ, I tried combination of two simplest approach. I used concatenation of features which then are used in classification problem where classes are one most important word from the answer sentence. Figure below shows the complete model used in RJ.

VQA model used in RJ


Data Set:

Data set used to develop RJ is open data set available through the paper “Data Descriptor: A data set of clinically generated visual questions and answers about radiology images.” published in Nature Scientific Data. There are about 315 images in the data set with 3515 question and answer pairs associated with the image.

RJ in Action: