Source: Google AI Blog
Machine learning and cross-over to Nature! Researchers at Northeastern University published their findings in Nature, exploring how human expressions are different around the world. The result is only 30% correlation rate.
Do people around the world smile or sadness are the same expression?
It seems reasonable for people to have consistent facial expressions; for example, whether a person is from Brazil, India or Canada, their smile when they see a close friend, or their excitement when they see a fireworks display, looks essentially the same.
But does this really make sense? Is the connection between these facial expressions and related contexts across geography really universal? What do people’s smiles or frowns tell us about how people relate to each other in different cultures, and how are the two situations similar or different?
Scientists have tried to answer these questions and reveal the extent to which people cross cultures and geographies, often using survey-based research that relies heavily on local language, moral norms, and values. And such studies are not scalable and often end up with small samples and inconsistent results.
Studying facial movement patterns provides a more direct understanding of expressive human behavior than survey-based studies.
However, analyzing the actual use of facial expressions in everyday life requires researchers to work through millions of hours of continuous real-world footage, a task that is extremely tedious and requires a great deal of manual work.
In addition, facial expressions and the contexts in which they are displayed are complex and require large samples to draw statistically reliable conclusions.
While existing studies have yielded different answers to the question of the universality of facial expressions in specific contexts, extending the research using machine learning techniques may provide different and clearer answers.
In the article Sixteen facial expressions occur in similar contexts worldwide, published in Nature in 2019, is the first, large-scale, global study of the actual use of facial expressions in everyday life analyzed using deep neural networks Expanded expression analysis.
Using a total of 6 million publicly available video datasets from 144 countries, the paper analyzes the context in which people use various facial expressions and demonstrates that rich nuances in facial behavior, including subtle expressions, are used in similar social contexts around the world.
Deep neural networks measure facial expressions Facial expressions are not static. When a person looks at another person’s expression, it may initially appear to be anger, but it may turn out to be awe, surprise, or confusion, with different expression interpretations depending on the dynamic effect that a person’s facial expression presents.
Thus, the challenge of building a neural network to understand facial expressions is that it must interpret such expressions in their temporal context. Training such a system requires a large, diverse and cross-cultural video dataset, along with fully interpreted expression meanings.
To build the dataset, annotators manually searched a wide range of publicly available video sets to identify those that might contain expression categories covering our pre-selected expressions.
To ensure that the videos match the regions they represent, those containing the original geographic location are prioritized in the video selection.
The faces in the videos are discovered using a deep convolutional neural network recognition system similar to Google’s cloud-based face detection API, which uses a traditional optical flow-based approach to track faces during video editing.
Using an interface similar to Google’s crowdsourcing platform, if facial expressions appear at any point during the clip, the annotator flags them in 28 different categories.
Because the goal is to sample how an average person understands an expression, the annotators are not instructed or trained, and no sample expressions or definitions of annotations are provided.
Additional experiments are discussed in the text to evaluate whether the models trained from these annotations are biased.
The face detection algorithm builds a sequence of locations for each face throughout the video. We then use a pre-trained initial network to extract features to find the most salient one from the face that represents the facial expression.
These features were then fed into a long-term short-term memory network (LSTM) , a recurrent neural network capable of modeling how facial expressions evolve over time and of remembering information about past prominence.
To ensure that the model makes consistent predictions across a range of demographic groups, we evaluated the fairness of the model on an existing dataset that was constructed using similar facial expression labels and targeting the best-performing of 16 expressions.
The model’s performance showed its consistency across the evaluation dataset represented by all types of demographic groups, also indicating an unmeasurable bias in the model’s training of annotated facial expressions. The model annotated 16 facial expressions from 1500 images.
In order to understand the context of facial expressions in millions of videos, the experiments also measure the before and after parts of the expressions captured in the videos. Neural networks that capture fine-grained content and automatically recognize context are used in the paper.
The first DNN is a combination of video-related textual features (title and description) and actual visual content (video-topic model).
The second DNN relies only on textual features without any visual information (text-topic model).
These models predicted tens of thousands of category labels describing the video, and in this experiment, these models were able to identify hundreds of unique contexts (e.g., weddings, sporting events, or fireworks) to demonstrate the diversity of the analyzed data.
In the first experiment in the paper, the researchers analyzed 3 million public videos taken by cell phones, which are more likely to contain natural expressions.
The facial expressions that appeared in the videos were then correlated with contextual annotations from a model of video themes, and 16 facial expressions were found to have different associations with everyday social contexts that were consistent across the world. For example, joyful expressions and pranks were more likely to co-occur; excited expressions and fireworks were also better matched; and victory expressions were also frequently seen at sporting events.
These results are highly suggestive for the discussion of facial expressions in which psychologically relevant scenarios are more relevant to the use of expressions than other factors such as those specific to individuals, culture, or society.
The second experiment analyzed 3 million individual videos, this time using context annotated with a textual topic model. The results confirmed that the findings in the first experiment were not driven by the subtle effects of facial expressions in the videos on the annotation of the video topic model. In other words, this experiment confirms the conclusion reached in the first experiment that the video topic model may implicitly factor in facial expressions when computing its content labels.
In both experiments, the correlation between expressions and contexts seemed to be well validated across cultures. To accurately quantify how similar the correlations between expressions and contexts were across the 12 different world regions studied, the researchers calculated second-order correlations between each pair of regions. These correlations determined the relationships between the different expressions and contexts in each region, and then compared them to the other regions.
The final conclusion was that 70 percent of the contextual expression associations found in each region were shared worldwide.
Machine learning enabled the researchers to analyze millions of videos from around the world and found evidence to support the hypothesis that facial expressions are retained to a certain extent in similar environments across cultures.
The findings also leave room for cultural differences; while the correlation between facial expressions and context is 70% consistent worldwide, it is only 30% between regions. The correlations between facial expressions and context in neighboring world regions are generally more similar than those in distant world regions, suggesting that the geographic spread of human culture may also play a role in the meaning of facial expressions.
This work shows that machine learning can better understand itself and identify common elements of communication across cultures. Tools such as neural networks give us the opportunity to provide a large and diverse set of data for scientific discovery, giving us more confidence in statistical findings.
Posted by:CoinYuppie，Reprinted with attribution to:https://coinyuppie.com/cv-in-roll-psychologists-learn-face-recognition-train-6-million-videos-to-distinguish-expressions-around-the-world/
Coinyuppie is an open information publishing platform, all information provided is not related to the views and positions of coinyuppie, and does not constitute any investment and financial advice. Users are expected to carefully screen and prevent risks.