We know that Google’s image analysis is good. If we go to Google Photos and for example type the word ‘dog’, all the images on the Internet where a dog is seen will appear. And if we type in tree, pizza or car it’s the same. Google is also able to identify people, but for privacy reasons it doesn’t work very well if you just type in their name.
For years, there has also been a reverse image search where we can place a photo and images similar to the one we are looking for will appear. But that’s just the tip of the iceberg of what Google is able to identify and analyze in an image.
Computers describing images, the latest in AI
To train its algorithms, Google has Open Images, a huge database where it saves annotated images and uses them to “practice” its neural networks. In total, almost 60 million images differentiated into some 20,000 categories.
Its first version arrived in 2016 and this February Open Images V6 was launched, a new version that adds a new layer of analysis to better understand what is seen in each photo. Initially, descriptions such as “a red car” or “a guitar” appeared. But over time the database has been improved so that the algorithm can work with deeper levels of information.
A total of 9 million images contain annotations, with 36 million levels and labels. In this sixth version of Open Images, the Google image set has expanded the associated annotations and presents what they call “localized narratives“. This is a new form of contextual image information, with text and a synchronized voice, which describes the photo as a cursor moves through the text file.
Google currently has more than 500,000 images with this new narrative. A large enough base to train its algorithm and be able in the future, that the Google algorithm does something similar with other images. In the video below you can see very well how this new technology works.
The voice begins by describing the center of the image, with the colors, the clothes and the type of objects it carries. These localized narratives are generated by note-takers who provide a description by identifying the position of each object in the image and relating it to a cursor gesture. The aim is to try to make the Google AI have a database on which it can work constantly, being able to have a clear idea of where each object is located within an image.
Structuring the objects in the image through voice, text and a cursor
Previously, when the Google algorithm worked with photos of “a dog” it knew where it was located since it knows how to differentiate the animal from what is the sky or the earth. However, if we talk about the “dog’s ear” or the “orange hat”, the exact position is more complicated. Besides not having a clear structure of where each thing is located. With these localized narratives, Google has a tool to better specify to its algorithm what each thing is.
To make the descriptions as accessible and structured as possible, note takers manually transcribe their description and relate it to different colors. This allows them to generate “zones” within the image and also to have a text to describe each object within a photo. Google therefore has a photo, a voice describing the image, a text and a mouse crawl, several elements that in a synchronized way allow Google to obtain a fairly accurate description of what is contained in an image.
One of the limitations of existing descriptions or categories is that it is difficult to generate a direct link between vision and language. The subtitle can be very specific, but Google did not have any tool to specify which object was referred to in each word. Now with the combination of these localized narratives Google finally has a starting point to further refine its understanding of images.
With the latest version of Open Images, the Google database works with concepts as specific as “a man skateboarding”, “a man and a woman helping each other, jumping and laughing” or “a dog catching a frisbee”. Because this difference is very relevant to offer the most appropriate image in each moment.
According to Open Images V6 data, more than 2.5 million annotations are included of humans performing independent actions such as “jumping”, “smiling” or “throwing themselves on the ground”. So that when the algorithm is asked about what is in an image, it can not only tell us if there is a man or an animal, but it can also give details as specific as the action it is performing, if its jacket is big, what color are the shoes or any other type of details.
While the database that Google AI works on is created to work only with humans for now, the more accurate the data processed by Google AI, the better the results obtained by the algorithm in searching and classifying images according to their content.