Friday, October 21, 2011

Motion Segmentation

The past few days, I have been playing around with a good motion segmentation scheme so that I can break down user motions into discrete, atomic "actions". The plan for these actions is that they will form the "alphabet" of my motion gesture language. Combination of actions or individual actions in the correct context and mode (remember from my previous posts?) will lead to interactions with our onscreen interface.


I used the concept of zero crossing to segment motion, thus motion is broken into actions by a combination of velocity and acceleration. The basic principle is that a abrupt change in velocity (high acceleration) signals the end of a previous action and the start of a new action.

Thus I used a finite state machine to represent the breakdown of these motions.



Then after each motion has been completed, I save the many attributes of that specific motion and put in my queue of previous motions that will then be interpreted.

Here is what I store (so far).

Finally, here is an trial of my motion segmentation. On the bottom left I keep track of the various states that the motions go through, and everytime a motion is completed it is pushed onto the queue and the total number of motions increase. As you see, its fairly accurate but fails to pick up very rapid minute motions, but in the grander scheme of things I think it is good enough for our purposes.




Tuesday, October 18, 2011

Smoothing vs No Smoothing

Here is a video comparison of smoothing vs no smoothing in my framework. I utilized the Kinect's built Holt smoothing for the joint positions. I also applied my own holt smoothing to my calculated velocity and acceleration values.

http://www.youtube.com/watch?v=TUgUnrWrjRw

Monday, October 17, 2011

Working with the low-level motion data.

Please excuse me for my late posting. I was only at Penn the last week for less than 48 hrs (Fall Break + a trek to EA's Redwood Shores HQ for my Wharton EA field application project). Expect a few post in the next few days to make up for it.

As previously mentioned, the Xbox Kinect SDK provides joint location positions (in the form of Vector3's).
Until now, I have been simply mapping relative velocity/position of joints to onscreen actions such as the zoom and pan of 2D objects. However, there are a few problems with this naive approach. First, the raw position points from the Kinect sensor are not perfectly stable. For example, if you hold your hand out straight without any movement, the onscreen object mapped to your hand's position will jitter based on sensor inaccuracies.
Additionally, we are not able to separate gestures: if an individual accelerates his hand to zoom and then stops, we should the onscreen zoom to reflect this behavior; however, this does not occur if we simply map zoom to hand position since the mapping does not stop, even after our action.

Thus for the next few days, I will add on filtering and motion segmentation features on top of my existing Kinect framework.

For filtering, my main objective is to reduce data jitter and smooth the raw position data such that motion mappings produce smooth onscreen actions. There are two aspects in filtering.
(1)The first is to reduce drift which is low frequency noise. Drift is often caused when our subject ever so slightly changes his overall position. The simplest way to reduce drift is to make the subject root position (which can be the shoulder position in the case of hand/arm motion gestures) the origin of the coordinate frame.
(2)The second component of our jitter comes from high frequency noise. This is a result of both minute motions of the subject and the slight inaccuracies of the Kinect camera sensors. The best way to fix this problem will be to pass the data through a smoothing filter. Luckily the Kinect SDK comes its built in smoothing function that is based off the Holt double exponential smoothing method.















I will also apply the Holt smoothing method to my velocity and acceleration calculation in the framework.
Finding the right the constants for optimal smoothing and finding a balance between accuracy/responsiveness and minimal jitter will be an ongoing process throughout my project.

My motion segmentation implementation will mainly revolve around the zero-crossings of specific joint velocities. The basic idea is that every abrupt change in our velocity will represent a new motion. More to come on this as I begin the implementation....

Friday, October 7, 2011

Elements of Interaction

While I was playing around with moving a shape across the screen using my Kinect Framework, I realized that a gesture does not make an interaction alone. When we swipe our hands left to move an object left, our interaction comprises of more than just a mapping between the hand and object position.
    At the root of any interaction, we have the gesture. The gesture is a explicit action with a purpose of achieving a specific goal or result. In my interface, the gesture is usually revolves around the position, velocity or acceleration of specific skeletal joints.
However, this gesture information is useless if we do not know what our gesture maps to. This is where the mode of interaction comes into play. The mode of interaction is a explicit mapping between our gestures and state of the interface. So for our simple example, panning is the mode of interaction.
Finally the last element of our interaction is the context. The context comprises our relationship with our input environment. Are we sitting or standing? Are we actually facing the screen? What actions/gestures did we perform leading up to this current interaction? Context envelops our spatial relationship in the environment and the history of the sequence of our interactions.

In conclusion:  Gesture + Mode + Context = Interaction.

I am currently on working on implementing this interaction scheme in my Kinect framework.

Sunday, October 2, 2011

Setting up the for Kinect Development
This past week, I have been focusing on becoming familiar with development using the Kinect SDK. I decided to develop the recognition engine in C# rather than C++. Two main reasons:

1.) C# makes it easier to perform threading, and this will come in useful when having to integrate both audio and motion input.

2.) C# has some basic UI elements built in, so this allows me to test and prototype faster as I can easier code simple demos to test the recognition engine before I render the final interface in unity/

After looking through some demo code and documentation I found out that the Kinect SDK allows for two methods of motion tracking. The first is a polling-based approach, where the Kinect sensors return information at stated intervals. The other option is a event-based approach, where Kinect sensors return data if there is actually an event (motion, change in depth, etc....). The event-based approach seems more efficient for my purposes, and I decided to build on top of this approach.

The first step in linking up any code to the Kinect is to turn on the sensors that you need.
Here we activate the depth sensors, skeletal tracking and the raw color camera (for testing purposes so we can see how our real world actions correlate to on screen actions)


Next, we hook assign event handlers to the sensor events.

Finally, we create an entire new thread for our audio input and speech recognition.
Most of my code right now is under the nui_SkeletonFrameReady. This is called every time the Kinect senses a chance in any of the skeletal joint position, and my interface will react accordingly.

Joint Data 
The Kinect Skeleton only returns the positions of the joints in as a Vector3. Thus I created a wrapper class over these joint positions that calculates both velocity and acceleration data. The velocity and acceleration are updated every time the SkeletonFrame event handler is called. I am looking at how to better smooth and interpolate this data so that gesture motion will not lead to "spikey".

Voice Data
Through the Kinect, we can leverage Microsoft's Speech API that has voice recognition features. We open the Kinect's microphone input as an audio source stream, and this stream is passed to the Microsoft's speech recognizer.

Before anything else, it is necessary to generate a grammar, a list of words or phrases that the speech recognizer should be looking for.


















Much like the SkeletonTracking, the speech recognition system uses an eventhandler system. The 3 events that are thrown are SpeechRecognized, SpeechHypothesized and SpeechRecognitionRejected events. I still need to do more reading about the exact details of these three events, but I am only responding for SpeechRecognized events for now.

Testing Environment
I made a simple testing interface as I continue to implement gesture recognition features and modes. 
All my work right now involves basic manipulation (zoom, pan) of a circle in two-dimensional state.
I also output some key information such as the current mode that my interface is in and the sensitivity modifier(discussed in the previous post). 

I will go more in-depth about this test environment in my next post and explain some of the new gesture recognition models that I have come up with. Videos too! Look for it tomorrow once I get camtasia (need Joe's admin password) installed on my dev computer.

Test interface: