Multimodal Neurons in Artificial Neural Networks (2022)


We are deeply grateful to Sandhini Agarwal, Daniela Amodei, Dario Amodei, Tom Brown, Jeff Clune, Steve Dowling, Gretchen Krueger, Brice Menard, Reiichiro Nakano, Aditya Ramesh, Pranav Shyam, Ilya Sutskever and Martin Wattenberg.

Author Contributions

Gabriel Goh: Research lead. Gabriel Goh first discovered multimodal neurons, sketched out the project direction and paper outline, and did much of the conceptual and engineering work that allowed the team to investigate the models in a scalable way. This included developing tools for understanding how concepts were built up and decomposed (that were applied to emotion neurons), developing zero-shot neuron search (that allowed easy discoverability of neurons), and working with Michael Petrov on porting CLIP to microscope. Subsequently developed faceted feature visualization, and text feature visualization.

Chris Olah: Worked with Gabe on the overall framing of the article, actively mentored each member of the team through their work providing both high and low level contributions to their sections, and contributed to the text of much of the article, setting the stylistic tone. He worked with Gabe on understanding the neuroscience literature and better understanding the relevant neuroscience literature. Additionally, he wrote the sections on region neurons and developed diversity feature visualization which Gabe used to create faceted feature visualization

(Video) Multimodal Neurons in Artificial Neural Networks (w/ OpenAI Microscope, Research Paper Explained)

Alec Radford: Developed CLIP. First observed that CLIP was learning to read. Advised Gabriel Goh on project direction on a weekly basis. Upon the discovery that CLIP was using text to classify images, proposed typographical adversarial attacks as a promising research direction.

Shan Carter: Worked on initial investigation of CLIP with Gabriel Goh. Did multimodal activation atlases to understand the space of multimodal representations and geometry, and neuron atlases, which potentially helped the arrangement and display of neurons. Provided much useful advice on the visual presentation of ideas, and helped with many aspects of visual design.

Michael Petrov: Worked on the initial investigation of multimodal neurons by implementing and scaling dataset examples. Discovered, with Gabriel Goh, the original “Spider-Man” multimodal neuron in the dataset examples, and many more multimodal neurons. Assisted a lot in the engineering of Microscope both early on, and at the end, including helping Gabriel Goh with the difficult technical challenges of porting microscope to a different backend.

(Video) Exploring multimodal neurons in artificial neural networks - openai

Chelsea Voss†: Performed investigation of the typographical attacks phenomena, both via linear probes and zero-shot, confirming that the attacks were indeed real and state of the art. Proposed and successfully found “in-the-wild” attacks in the zero-shot classifier. Subsequently wrote the section “typographical attacks”. Upon completion of this part of the project, investigated responses of neurons to rendered text on dictionary words. Also assisted with the organization of neurons into neuron cards.

Nick Cammarata†: Drew the connection between multimodal neurons in neural networks and multimodal neurons in the brain, which became the overall framing of the article. Created the conditional probability plots (regional, Trump, mental health), labeling more than 1500 images, discovered that negative pre-ReLU activations are often interpretable, and discovered that neurons sometimes contain a distinct regime change between medium and strong activations. Wrote the identity section and the emotion sections, building off Gabriel’s discovery of emotion neurons and discovering that “complex” emotions can be broken down into simpler ones. Edited the overall text of the article and built infrastructure allowing the team to collaborate in Markdown with embeddable components.

Ludwig Schubert: Helped with general infrastructure.

(Video) Exploring multimodal neurons in artificial neural networks part 1 - openai blog post

† equal contributors

Discussion and Review

Review 1 - Anonymous
Review 2 - Anonymous
Review 3 - Anonymous


  1. Invariant visual representation by single neurons in the human brain[PDF]
    Quiroga, R.Q., Reddy, L., Kreiman, G., Koch, C. and Fried, I., 2005. Nature, Vol 435(7045), pp. 1102--1107. Nature Publishing Group.
  2. Explicit encoding of multimodal percepts by single neurons in the human brain
    Quiroga, R.Q., Kraskov, A., Koch, C. and Fried, I., 2009. Current Biology, Vol 19(15), pp. 1308--1313. Elsevier.
  3. Learning Transferable Visual Models From Natural Language Supervision[link]
    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G. and Sutskever, I., 2021.
  4. Deep Residual Learning for Image Recognition[PDF]
    He, K., Zhang, X., Ren, S. and Sun, J., 2015. CoRR, Vol abs/1512.03385.
  5. Attention is all you need
    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L. and Polosukhin, I., 2017. Advances in neural information processing systems, pp. 5998--6008.
  6. Improved deep metric learning with multi-class n-pair loss objective
    Sohn, K., 2016. Advances in neural information processing systems, pp. 1857--1865.
  7. Contrastive multiview coding
    Tian, Y., Krishnan, D. and Isola, P., 2019. arXiv preprint arXiv:1906.05849.
  8. Linear algebraic structure of word senses, with applications to polysemy
    Arora, S., Li, Y., Liang, Y., Ma, T. and Risteski, A., 2018. Transactions of the Association for Computational Linguistics, Vol 6, pp. 483--495. MIT Press.
  9. Visualizing and understanding recurrent networks[PDF]
    Karpathy, A., Johnson, J. and Fei-Fei, L., 2015. arXiv preprint arXiv:1506.02078.
  10. Object detectors emerge in deep scene cnns[PDF]
    Zhou, B., Khosla, A., Lapedriza, A., Oliva, A. and Torralba, A., 2014. arXiv preprint arXiv:1412.6856.
  11. Network Dissection: Quantifying Interpretability of Deep Visual Representations[PDF]
    Bau, D., Zhou, B., Khosla, A., Oliva, A. and Torralba, A., 2017. Computer Vision and Pattern Recognition.
  12. Zoom In: An Introduction to Circuits
    Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M. and Carter, S., 2020. Distill, Vol 5(3), pp. e00024--001.
  13. Multifaceted feature visualization: Uncovering the different types of features learned by each neuron in deep neural networks[PDF]
    Nguyen, A., Yosinski, J. and Clune, J., 2016. arXiv preprint arXiv:1602.03616.
  14. Sparse but not ‘grandmother-cell’ coding in the medial temporal lobe
    Quiroga, R.Q., Kreiman, G., Koch, C. and Fried, I., 2008. Trends in cognitive sciences, Vol 12(3), pp. 87--91. Elsevier.
  15. Concept cells: the building blocks of declarative memory functions
    Quiroga, R.Q., 2012. Nature Reviews Neuroscience, Vol 13(8), pp. 587--597. Nature Publishing Group.
  16. Emotional expressions reconsidered: Challenges to inferring emotion from human facial movements
    Barrett, L.F., Adolphs, R., Marsella, S., Martinez, A.M. and Pollak, S.D., 2019. Psychological science in the public interest, Vol 20(1), pp. 1--68. Sage Publications Sage CA: Los Angeles, CA.
  17. Geographical evaluation of word embeddings[PDF]
    Konkol, M., Brychc{\'\i}n, T., Nykl, M. and Hercig, T., 2017. Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 224--232.
  18. Using Artificial Intelligence to Augment Human Intelligence[link]
    Carter, S. and Nielsen, M., 2017. Distill. DOI: 10.23915/distill.00009
  19. Visualizing Representations: Deep Learning and Human Beings[link]
    Olah, C., 2015.
  20. Natural language processing (almost) from scratch
    Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K. and Kuksa, P., 2011. Journal of machine learning research, Vol 12(ARTICLE), pp. 2493--2537.
  21. Linguistic regularities in continuous space word representations
    Mikolov, T., Yih, W. and Zweig, G., 2013. Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies, pp. 746--751.
  22. Man is to computer programmer as woman is to homemaker? debiasing word embeddings
    Bolukbasi, T., Chang, K., Zou, J.Y., Saligrama, V. and Kalai, A.T., 2016. Advances in neural information processing systems, pp. 4349--4357.
  23. Intriguing properties of neural networks[PDF]
    Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. and Fergus, R., 2013. arXiv preprint arXiv:1312.6199.
  24. Visualizing higher-layer features of a deep network[PDF]
    Erhan, D., Bengio, Y., Courville, A. and Vincent, P., 2009. University of Montreal, Vol 1341, pp. 3.
  25. Feature Visualization[link]
    Olah, C., Mordvintsev, A. and Schubert, L., 2017. Distill. DOI: 10.23915/distill.00007
  26. How does the brain solve visual object recognition?
    DiCarlo, J.J., Zoccolan, D. and Rust, N.C., 2012. Neuron, Vol 73(3), pp. 415--434. Elsevier.
  27. Imagenet: A large-scale hierarchical image database
    Deng, J., Dong, W., Socher, R., Li, L., Li, K. and Fei-Fei, L., 2009. 2009 IEEE conference on computer vision and pattern recognition, pp. 248--255.
  28. BREEDS: Benchmarks for Subpopulation Shift
    Santurkar, S., Tsipras, D. and Madry, A., 2020. arXiv preprint arXiv:2008.04859.
  29. Global Weighted Average Pooling Bridges Pixel-level Localization and Image-level Classification[PDF]
    Qiu, S., 2018. CoRR, Vol abs/1809.08264.
  30. Separating style and content with bilinear models
    Tenenbaum, J.B. and Freeman, W.T., 2000. Neural computation, Vol 12(6), pp. 1247--1283. MIT Press.
  31. The feeling wheel: A tool for expanding awareness of emotions and increasing spontaneity and intimacy
    Willcox, G., 1982. Transactional Analysis Journal, Vol 12(4), pp. 274--276. SAGE Publications Sage CA: Los Angeles, CA.
  32. Activation atlas
    Carter, S., Armstrong, Z., Schubert, L., Johnson, I. and Olah, C., 2019. Distill, Vol 4(3), pp. e15.
  33. Adversarial Patch[PDF]
    Brown, T., Mané, D., Roy, A., Abadi, M. and Gilmer, J., 2017. arXiv preprint arXiv:1712.09665.
  34. Synthesizing Robust Adversarial Examples[PDF]
    Athalye, A., Engstrom, L., Ilyas, A. and Kwok, K., 2017. arXiv preprint arXiv:1707.07397.
  35. Studies of interference in serial verbal reactions.
    Stroop, J.R., 1935. Journal of experimental psychology, Vol 18(6), pp. 643. Psychological Review Company.
  36. Curve Detectors
    Cammarata, N., Goh, G., Carter, S., Schubert, L., Petrov, M. and Olah, C., 2020. Distill, Vol 5(6), pp. e00024--003.
  37. An overview of early vision in inceptionv1
    Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M. and Carter, S., 2020. Distill, Vol 5(4), pp. e00024--002.
  38. Deep inside convolutional networks: Visualising image classification models and saliency maps[PDF]
    Simonyan, K., Vedaldi, A. and Zisserman, A., 2013. arXiv preprint arXiv:1312.6034.
  39. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images[PDF]
    Nguyen, A., Yosinski, J. and Clune, J., 2015. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 427--436. DOI: 10.1109/cvpr.2015.7298640
  40. Inceptionism: Going deeper into neural networks[HTML]
    Mordvintsev, A., Olah, C. and Tyka, M., 2015. Google Research Blog.
  41. Plug & play generative networks: Conditional iterative generation of images in latent space[PDF]
    Nguyen, A., Clune, J., Bengio, Y., Dosovitskiy, A. and Yosinski, J., 2016. arXiv preprint arXiv:1612.00005.
  42. Sun database: Large-scale scene recognition from abbey to zoo
    Xiao, J., Hays, J., Ehinger, K.A., Oliva, A. and Torralba, A., 2010. 2010 IEEE computer society conference on computer vision and pattern recognition, pp. 3485--3492.
  43. The pascal visual object classes (voc) challenge
    Everingham, M., Van Gool, L., Williams, C.K., Winn, J. and Zisserman, A., 2010. International journal of computer vision, Vol 88(2), pp. 303--338. Springer.
  44. Fairface: Face attribute dataset for balanced race, gender, and age
    Kärkkäinen, K. and Joo, J., 2019. arXiv preprint arXiv:1908.04913.
  45. A style-based generator architecture for generative adversarial networks
    Karras, T., Laine, S. and Aila, T., 2019. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4401--4410.

Updates and Corrections

If you see mistakes or want to suggest changes, please create an issue on GitHub.

(Video) Chris Olah Explaining Multimodal Neurons and How Incredible They Are


Diagrams and text are licensed under Creative Commons Attribution CC-BY 4.0 with the source available on GitHub, unless noted otherwise. The figures that have been reused from other sources don’t fall under this license and can be recognized by a note in their caption: “Figure from …”.


For attribution in academic contexts, please cite this work as

Goh, et al., "Multimodal Neurons in Artificial Neural Networks", Distill, 2021.

BibTeX citation

(Video) Deep Learning for Multi-Modal Systems | Data Science Summer School 2022

@article{goh2021multimodal, author = {Goh, Gabriel and †, Nick Cammarata and †, Chelsea Voss and Carter, Shan and Petrov, Michael and Schubert, Ludwig and Radford, Alec and Olah, Chris}, title = {Multimodal Neurons in Artificial Neural Networks}, journal = {Distill}, year = {2021}, note = {}, doi = {10.23915/distill.00030}}


What are multimodal neurons? ›

Multimodal Neurons in CLIP

Indeed, these neurons appear to be extreme examples of “multi-faceted neurons,” 11 neurons that respond to multiple distinct cases, only at a higher level of abstraction.

What is multimodal AI? ›

Multimodal AI is a new AI paradigm, in which various data types (image, text, speech, numerical data) are combined with multiple intelligence processing algorithms to achieve higher performances. Multimodal AI often outperforms single modal AI in many real-world problems.

What are the 3 different types of neural networks? ›

Different types of Neural Networks in Deep Learning

Artificial Neural Networks (ANN) Convolution Neural Networks (CNN) Recurrent Neural Networks (RNN)

What is the work of neurons in AI Modelling? ›

An artificial neural network consists of a collection of simulated neurons. Each neuron is a node which is connected to other nodes via links that correspond to biological axon-synapse-dendrite connections. Each link has a weight, which determines the strength of one node's influence on another.

What are multimodal networks? ›

A multimodal network (MMN) is a novel graph-theoretic formalism designed to capture the structure of biological networks and to represent relationships derived from multiple biological databases.

What is multimodal in deep learning? ›

Multimodal learning helps to understand and analyze better when various senses are engaged in the processing of information. This paper focuses on multiple types of modalities, i.e., image, video, text, audio, body gestures, facial expressions, and physiological signals.

What does multimodal mean in machine learning? ›

Multimodal machine learning (MMML) is a vibrant multi-disciplinary research field which addresses some of the original goals of artificial intelligence by integrating and modeling multiple communicative modalities, including linguistic, acoustic, and visual messages.

What is a multimodal model? ›

Multimodal learning refers to the process of learning representations from different types of modalities using the same model. Different modalities are characterized by different statistical properties. In the context of machine learning, input modalities include images, text, audio, etc.

What is multimodal NLP? ›

Integrating multimodality in natural language processing (NLP) is referred to as multimodal natural language processing. Research in this novel direction primarily aims at processing textual content using visual information (e.g., images and possibly video) to support various task (e.g., machine translation).

How many types of AI neural networks are there? ›

There are two Artificial Neural Network topologies − FeedForward and Feedback.

How many types of artificial neurons are there? ›

6 Types of Artificial Neural Networks Currently Being Used in ML.

What is modular artificial neural network? ›

A modular neural network is an artificial neural network characterized by a series of independent neural networks moderated by some intermediary. Each independent neural network serves as a module and operates on separate inputs to accomplish some subtask of the task the network hopes to perform.

Do neural networks have neurons? ›

What are Neurons in a Neural Network? A layer consists of small individual units called neurons. A neuron in a neural network can be better understood with the help of biological neurons. An artificial neuron is similar to a biological neuron.

How does an artificial neural network model the brain? ›

Artificial neural networks (ANNs) were designed to simulate the biological nervous system, where information is sent via input signals to a processor, resulting in output signals. ANNs are composed of multiple processing units that work together to learn, recognize patterns, and predict data.

What is inside a neuron in neural network? ›

A neural network consists of multiple layers. A neuron consists of a function f(x1, x2, ..., xn), a sigmoid function which uses f as input and gives a binary output and a weight factor which is multiplied with with the sigmoid function and determines how much this neuron is considered for the output of the layer.

What are examples of Multimodal learning? ›

An example is that people learn from images by reacting to visual cues such as photos and graphs. People can also learn from kinesthetics by reacting to tactile cues such as actions and movement. Multimodal learning is teaching a concept using more than one mode.

What are the benefits of Multimodal learning? ›

A core benefit of blended and multimodal learning is their ability to reach people who benefit from various learning styles. Additionally, helping employees understand their unique learning styles can guide them in leveraging resources that are compatible with how they process information most effectively.

What is Multimodal learning and how does it work? ›

What is multimodal learning? Multimodal learning suggests that when a number of our senses – visual, auditory, kinaesthetic – are being engaged during learning, we understand and remember more. By combining these modes, learners experience learning in a variety of ways to create a diverse learning style.

Why multimodal AI is important? ›

The ability to process multimodal data concurrently is vital for advancements in AI. For instance, it would enable us to refer to an object with multiple languages such as visual, text or speech. This requires, however, comprehensive understanding of different modalities and relationships between them.

Why is multimodal machine learning important? ›

Multimodal machine learning aims to build models that can process and relate information from multiple modalities. It is a vibrant multi-disciplinary field of increasing importance and with extraordinary potential.

How do you describe a multimodal distribution? ›

A multimodal distribution is a probability distribution with more than one peak, or “mode.” A bimodal distribution is also multimodal, as there are multiple peaks. A comb distribution is so-called because the distribution looks like a comb, with alternating high and low peaks.

What is active and multimodal learning? ›

Multimodal learning suggests that when a number of our senses – visual, auditory, kinaesthetic – are being engaged during learning, we understand and remember more. By combining these modes, learners experience learning in a variety of ways to create a diverse learning style.

What is the function of neurons? ›

Neurons are information messengers. They use electrical impulses and chemical signals to transmit information between different areas of the brain, and between the brain and the rest of the nervous system.

Who is Chris Olah? ›

Christopher Olah

I'm one of the co-founders of Anthropic, an AI lab focused on the safety of large models. Previously, I led interpretability research at OpenAI, worked at Google Brain, and co-founded Distill, a scientific journal focused on outstanding communication.

What is feature visualization? ›

Neural network feature visualization is a powerful technique. It can answer questions about what a network — or parts of a network — are looking for by generating idealized examples of what the network is trying to find. Over the last few years, the field has made great strides in feature visualization.

What is multimodal example? ›

Multimodal projects are simply projects that have multiple “modes” of communicating a message. For example, while traditional papers typically only have one mode (text), a multimodal project would include a combination of text, images, motion, or audio.

What are the benefits of multimodal learning? ›

A core benefit of blended and multimodal learning is their ability to reach people who benefit from various learning styles. Additionally, helping employees understand their unique learning styles can guide them in leveraging resources that are compatible with how they process information most effectively.

Why is multimodal important? ›

It allows students to make creative and purposeful decisions about how to communicate effectively to particular audiences. The choice to include elements of other languages in a text is an overt and concrete means by which students can develop their skills as text analysts.

What is the difference between a nerve and a neuron? ›

Neurons are specialized to transmit information throughout the body. Whereas nerve is a whitish fibre or bundle of fibres in the body made up of number of neuron cells that transmits impulses of sensation to the brain or spinal cord, and impulses from these to the muscles and organs.

What are the types of neuron? ›

Types of Neurons. Although there are billions of neurons and vast variations, neurons can be classified into three basic groups depending on their function: sensory neurons (long dendrites and short axons), motor neurons (short dendrites and long axons) and relay neurons (short dendrites and short or long axons).


1. MIT 6.S191 Lecture 5 Multimodal Deep Learning
(Harini Suresh)
2. CLIP, DALL E, Multimodal Neurons
(Mark Saroufim)
3. Multimodal Neurons in CLIP (And Progress Update on Surreal GAN)
(Walker Stipe)
4. Introduction to Neural Networks with Example in HINDI | Artificial Intelligence
(Gate Smashers)
5. Neural Networks Explained | Unpacking AI Black Box
(AI Futurist)
6. JAIA Meeting 5 - Discussion of SEER from Facebook and MultiModal Neurons in CLIP
(matthew stone)

Top Articles

You might also like

Latest Posts

Article information

Author: Aracelis Kilback

Last Updated: 10/03/2022

Views: 6473

Rating: 4.3 / 5 (64 voted)

Reviews: 95% of readers found this page helpful

Author information

Name: Aracelis Kilback

Birthday: 1994-11-22

Address: Apt. 895 30151 Green Plain, Lake Mariela, RI 98141

Phone: +5992291857476

Job: Legal Officer

Hobby: LARPing, role-playing games, Slacklining, Reading, Inline skating, Brazilian jiu-jitsu, Dance

Introduction: My name is Aracelis Kilback, I am a nice, gentle, agreeable, joyous, attractive, combative, gifted person who loves writing and wants to share my knowledge and understanding with you.