We present a coarse-to-fine method that decomposes the original image description into a skeleton sentence and its attributes, and generates the skeleton sentence and attribute phrases separately. By this decomposition, our method can generate more accurate and novel descriptions than the previous state-of-the-art.
Moreover, our algorithm can generate descriptions with varied length benefiting from the separate control of the skeleton and attributes. This enables image description generation that better accommodates user preferences.
For the detailed project description, please visit the project homepage.
Joint Album Curation-Recognition
Continuing from event-specific image importance project, we attempt to simultaneously solve both tasks: album-wise event recognition and imagewise importance prediction.
We collected an album dataset with both event type labels and image importance labels, refined from an existing CUFED dataset.
We propose a hybrid system consisting of three parts: A siamese network-based event-specific image importance prediction, a Convolutional Neural Network (CNN) that recognizes the event type, and a Long Short-Term Memory (LSTM)-based sequence level event recognizer. We propose an iterative updating procedure for event type and image importance score prediction.
We experimentally verified that image importance score prediction and event type recognition can each help the performance of the other.
For the ML-CUFED dataset, please visit the project homepage.
Event-specific Image Importance
When creating a photo album of an event, people typically select a few important images to keep or share. Modeling this selection process will assist automatic photo selection and album summarization. In this project, we show that the selection of important images is consistent among different viewers,
and that this selection process is related to the event type of the album. We introduce the concept of event-specific image importance. We propose a Convolutional Neural Network (CNN) based method to predict the image importance score of a given event album, using a novel rank loss function and a
progressive training scheme. Results demonstrate that our method significantly outperforms various baseline methods. We also introduce the CUration of Flickr Events Dataset (CUFED) dataset for the study of event-specific image importance. For the dataset, please visit the project homepage.
Urban Tribes Recognition
Recognition of social styles of people are an interesting but relatively unexplored task. Recognizing "style" appears to be a quite different problem than categorization; it is like recognizing a letter's font as opposed to recognizing the letter itself. We solved this problem with the features extracted from convolutional deep network pre-trained on imagenet (Caffe). Combining the results from individuals in group pictures and the group itself, with some fine-tuning of the network, we reduce the previous state of the art error by almost half, going from 46% recognition rate to 71%. To explore how the networks perform this task, we compute the mutual information between the imagenet output category activations and the urban tribe categories, and find, for example, that bikers are well-categorized as tobacco shops, and that better-recognized social groups have more highly-correlated ImageNet categories. This gives us insight into the features useful for categorizing urban tribes.
Real-time Hand Posture Recognition with Kinect
Hand posture recognition is quite a challenging task, due to both the difficulty in detecting and tracking hands with normal cameras, and the limitations of traditional manually-selected features. We proposed a two-stage HPR system for Sign Language Recognition using a Kinect sensor. I mainly worked on the hand detection and tracking stage. We proposed an effective algorithm to implement hand detection and tracking. The algorithm incorporates both color and depth information, without specific requirements on uniform-colored or stable background. It can handle the situations in which hands are very close to other parts of the body or hands are not the nearest objects to the camera, and allows for occlusion of hands caused by faces or other hands. In the second stage, we apply Deep Neural Networks to automatically learn features from hand posture images insensitive to movement, scaling and rotation. Recognition rate on 36-posture dataset is 98.12%.
Supervising: Object Classification using a Turtlebot
During the summer of 2014, I supervised Kevin Xiong and Evan Phibbs to use the Turtlebot, a robot running on ROS, for object recognition.
The robot can be placed anywhere in the room, and it will move forward until it finds the first object. With the object target localized, the Turtlebot will circle the object to take pictures from different views, and recognize the object as a new or existed class. Convolutional neural network is used for feature extraction, and SVM is used for object recognition and new object learning.
Back to Top