Icon
EngineeringComputer Vision

Audio Vision

A prototype and publication for a new generation of navigational aids for the visually impaired using image to sound translation.

Audio Vision Cover

As a human we have five senses that all serve their own unique purpose and are not interchangeable. For example, it's very hard to imagine tasting music, seeing a smell, or hearing an image. But what if we could somehow translate the data from one sense and present it to another? What if we could for example hear an image? How would it sound and what applications could this have?

This initial thought experiment has led me to the idea of creating a sound based navigational help for people with visual impairment, that translates the depth information of a 3D Camera to different frequency soundwaves. The project has eventually turned into my bachelor thesis, in which I build and programmed a prototype that was tested in different scenarios. The paper is linked below and contains more detailed technical information, as well as the exact results of the survey.

Computer vision has become a very powerful tool and allows us to create detailed 3D maps of an area and characterize objects and people through image recognition. Pairing this with sound generating algorithms can create an intuitive auditive reflection of a 3D space. This could not only be a new way to experience the world around you, but also a very powerful aid for people with visual impairment.

Obviously, this raises the question of how to make an intuitive soundscape. What should for example a shelve sound like, or a pizza? Maybe it would be more intuitive, if we categorize those objects as in ”furniture“ and “food“? Or maybe we should just try to convey their basic shape?

As an engineer you always first try to come up with the simplest solution to a problem and then add more functionality and complexity step by step. So, my first prototype will be only conveying the rough physical shape of the surrounding world.

To do this, I am using the image of a depth-camera. To further simplify the problem, the mean value of a cluster of pixels is calculated to decrease the resolution. Additionally, sinewaves for every cluster are created for the left and right audio channel. The stereo effect is used to convey the horizontal position of the cluster. The pitch is altered according to the lateral position. And the closer a cluster gets, the louder becomes the wave.

The prototype was built using a Microsoft Kinect sensor, a laptop with custom code and a pair of glasses with downward facing Bluetooth speakers, to not block surrounding sounds. In a test environment the prototype showed very promising results, proving the feasibility of a system using this principle.

Credits

Thanks to the Automation and Control Institute (ACIN) atthe Technical University of Vienna for their help.