Monday, December 19, 2011

I GOT MY PROJECT TO WORK ON DOGS TOO!!!


WOO!!!!!

Just kidding. Anyways, I've been at work the last few days, adding new interaction features and tightening my existing gesture commands. There are still some kinks that I have to iron out as you can see from my video...but check it out (turn on sound too, I'm narrating!).




[EDIT]
This video just came up as a recommended video next to my youtube upload.
Best kinect hack ever?.....Are puppies cute?.......YES





Monday, December 12, 2011

GUI in a NUI

Today, I was working on refining the "GUI" interface on my project. That means working with menu navigation, animations, text and selection.

So far I have two menus:
(1)A mode menu that allows you to selected an operation on the spot.
(2) A Global menu that allows to adding objects, exiting the app or doing any large scale changes
The mode menu is accessed through the swipe of the right hand, the global menu is activated by taking a step back.

Even thought it may seem like eye-candy, the correct animations and fonts can make a big difference in usability. Part of the idea of the NUI is that you have a very "physical" interface, objects move on screen like they would in real life since you are actually using your hands. Thus the slide animations I've included add to this feel.  Furthermore, font for such an interface is an interesting selection. I assume that most users will be standing at least 2 feet away (due to the limitations of the Kinect) and thus most font is kept at a large size. I've used a san serif font called Segoe; this gives the interface a cleaner look.


Here is a demo video (still working on improving the constants for optimal gesture recognition).



Also sorry for the jittery video capture. The animations are smooth...trust me :).





Sunday, December 11, 2011

Interaction Feedback and Selection


This last week, I have been playing around with interaction feedback and object selection for my Kinect NUI.
I have used my original concept for gesture feedback. After testing, I realized that having a fourth state (in motion) was excessive and there was too much visual noise being generated. Thus on the top of the screen, there are three possible colors that can show up:

Blue = initial gesture conditions met, gesture started
Green  = gesture recognized and completed
Red = gesture completed, but is not in the list of interactions

For the challenge of multiple objects and object selection, I decided to map the position of the hands on screen and have a trailing line to represent the path of each hand. To selection an object, use any hand, place it near the object and wiggle it in a small circle. To deselect an an object, wiggle your hand at a location away from the selected object. The selected object is signified by the blue glow around it.

I also show the menu system which is activated by the horizontal swipe of the right hand.

Demo Video:


Sunday, December 4, 2011

Self-Evaluation

It has been an interesting journey, looking at my proposal and where I am now.  I am first going to talk about parts of the project I felt i did a good job on and parts that I should have done a better job on.

Here are some things I liked about what I did:

-I liked that I wrote out the definition of interaction (mode+context+gesture) early on. I think it shaped the way
I would approach the entire problem (using motion segmentation and using 3D space as a modifier) in a way that was different from the current existing crop of NUI demos.

-I like that I quickly implemented a rough prototype just using c#. This allow me to become familiar with the Kinect API and understand the abilities of the Kinect at a very early stage.

-Building a motion-segmentation engine early on: This would be the basis of all my advanced motion recognition , and it help a lot that I developed a structure way of reading my motions, so that once I developed a workflow for reading one specific type of motion, I could use the skeleton of that code and port it to the other motion types.

-Iterating through different approaches of motion recognition. I first implemented a live 1-1 mapping. Then I parsed actions after they were completed. Finally I used a combination of the two approaches. By going through many approaches I was able to see the strength + benefits of each one, and it was a very interesting exercise.

Here are somethings I could have done a better job on:

-Focus more on the interaction designs throughout the entire process. Sometimes I became too focused in the implementation and debugging of my code. I would then have to play catch-up designing the interactions. My interactions would have been more fully fleshed out at this stage if I had made a continual focus on this.

-Have a clearer strategy for the actual implementation. Honesty, I should have taken Mubbasir's advice and just used our existing Kinect to Unity plugin. But instead I developed my own and ran into many problems and spent more time working on this than the actual interface. I ultimately did get all the wrappers to sync up with Unity, but my solution was very buggy and led to memory leaks. In the end I used existing plugin because I just didn't want to waste anymore time on debugging memory leaks. If I had paid closer attention to the feasibility of implementation at a early stage, this could have been avoided.

-Bring in users to try out the interface and get feedback throughout my work This is honestly something that I should have done. I might have been afraid to show my work at such an early stage, but with feedback from users, I could have had many new leads to my interaction designs. I realized this during the Beta Review, when I got a lot of great ideas in just a short 20 minute chat with Badler and Joe.

Conclusion
All in all, I felt that this has been a very stimulating project. It has the right mixture of human computer interaction design and coding to suite my tastes. I think that it has been fun to just take a stab at making a NUI interface. Now I realize the shortcomings of a motion-based nui, but also some of the strengths too. From my work and observations, the NUI is not ready at all to supplant the traditional gui, but it also brings whole new functionality that has never existed before. I am really glad that I focused on making my motion framework, because I have a strong base to build other applications on top of the Kinect and I plan to play around with this in the future.

Beta Review Recap and Plans for next few weeks

Last Friday I showed my current working demo to Joe, Norm and Mubbasir. 
I showed my current interactions of Menu/mode selection, zooming, panning and rotation of an 3D object in my  Unity-based interface. From the feedback, I realized that I had a good basis of interaction and motion interpretation, but I really needed to tighten the interactions so that they were intuitive and also easy to perform in 3D space. I had started playing around with the use of space as a "value modifier" for my various interactions, and I really need to fully flesh the utilization of this out in my current interactions. I had some great possible ideas from Badler and Joe, and I plan to implement them in the following weeks.

User feedback again was something that I had a good basis for, but still needed more iterations on. Current, I have implemented a passive feedback using a particle cloud that shows the movement of the user's hand velocities, and basic feedback on whether the user is in interaction space or not. My next steps for user feedback will be to implement my original idea of displaying the states of actions and whether they ended being accepted as gestures.

The good thing is that I am completely down the base motion framework, so the rest of the changes will be focused on the actual design of the interaction gestures and flow of interaction.

Here is my timeline for the next week.

11/4 - 11/7: use Badler's and Joe's input to design a more intuitive rotation gesture. Tighten up constants for zoom and pan gestures.

11/7-11/9: Implement user feedback of current action states.

11//9-11/16: Develop a menu system that will allow users to add different objects in the scene and create new scenes. Allow for a selection system that will allow users to select different objects for editing.

11/17-onward: Work on presentation of work in the video.

Monday, November 28, 2011

Update on motion recognition

So last week, I was looking at my two approaches towards motion recognition (live updating vs waiting for a motion to fully complete). I decided that it was best to combine these two approaches and get the best of both worlds (faster reaction time  and accuracy).

This is the basic workflow of how I recognize motions:

1. A motion has began and is in phase
I start tracking this motion. There is a set "learning period", where the framework records all data of this motion. Once the "learning period" has passed, the framework makes the best guess on what kind of gesture is being performed.

2. We have assigned a gesture to this motion
For every frame, we read in the current motion changes and we update the interface. Here we are basically implementing the live update approach.

3. The motion has been completed
Now, we look back at the complete motion data and use the old method at looking at the entire motion sequence data. We make any necessary adjustments to our gesture and correct the interface changes if the entire motion does not match or initial guess.

So, far this has worked relatively well. Basically what happens is that during the live-update phase, we have a constant update in our interface. Once the motion is complete, the interface makes the final changes and quickly changes to the final/correct position.

This is very comparable to momentum scrolling on current touchscreen smartphones and tablets: when you have your finger on the screen, the scrolling maps one-to-one based on your finger movement. Once you lift, the screen scrolls based on the final acceleration/velocity of your finger movement. Similarily in our approach, we basically have a one-to-one mapping during the actual motion. Once the motion is completed, we correct the change based on the entire motion.

Sunday, November 20, 2011

Challenges with Motion Recognition / Motion-Interaction Mapping

After creating the motion recognition framework for a 3d object editor-like interaction in unity (zoom, panning and rotation) , I have come upon a few interesting choices in approach.

For some brief background, I currently break down all user motions into discrete actions. Each action contains information such as velocity, acceleration, start/end position and duration. I currently have programming two approaches into interpreting this data.

The first is a live approach. The live approach does not wait for an action to end. Rather it checks the current action that is being track (if any). If the current action satisfies certain rules for an interaction (ex. specific start position, duration, must occur in unison with another action), then we start changing the state of the interface based on each new frame update. Here is a quick example of this. In order for a correct rotation gesture, both hands must be above the waist, and the actions of the two hands must start in the roughly the same time. If both initial conditions are true, then I start changing the orientation of my object based on the updated positions of the two hands until the action is complete.

In the second approach, rather than looking at initial conditions of the current action. We look through the list of completed actions. Then we analyze the saved data of the those actions to see if the sequence of actions satisfy any possible interactions.

There are plus and minuses to both approaches. In the live approach, the interaction is very responsive and there is barely any downtime between gesture and a change in the interface; however we also sacrifice accuracy. This approach is dependent on the fact that our read of the initial state is correct, we must make an assumption that if the user begins a specific motion, he will also end it correctly.

The second approach is more accurate. We can look at the entire sequence of actions to ensure that we will match the correct on screen changes. However, there is lag time. What if the user attempts one long motion--we will not be able to process this motion until it is complete, and thus the user will be unable to see any changes in the interface for a relatively long time.

Any thoughts?


Sunday, November 13, 2011

Motion Recognition Work

Now that I have my base c# framework ported over to unity, I have began work building more advanced motion recognition to power my beta review object editor/navigator application.

There are multiple motion recognition engines that "look for" specific motions. They all take in the same atomic action data from my motion segmenter. The motion recognizers are also turned on/off based on the current state/context and from other motion recognizers. For example, I implemented a motion recognizer that checks to see if you have moved your hands into the "up" position (up from you side and pointed towards the general direction of the screen). Once this state is reached, then my rotation recognizer is activated and will look to is if you rotate you two hands in synchronization like you would if you were rotating a real life object.

Here is a snapshot of my work in progress. I am rotating a 3d cube in 3 dimensions (x,y,z axis) based from the motion controls. If you look on the bottom, your will see the 2 unity icons and a black bar. The unity icon (its a placeholder image for now) shows up if you right/left hand is the "up" position. The black bar represents the distance between your two hands which will affect the sensitivity of your rotation (more about that on the next post). The basic idea is to give on-screen feedback at every step of the interaction.

Look for a video demo once I get the rotation kinks worked out.

Sunday, November 6, 2011

Post-Alpha Review Next Steps

The alpha-review was great in terms in getting feedback on my current progress. There are two main issues/areas of focus that have been repeatedly emphasized by my reviewers:

1. Gesture Recognition is not trivial and can be challenge.

2. You should define an application/use-case early on and adjust your motion recognition accordingly.

These are two really good points and I definitely agree with them.

I'm currently bringing my existing C# test framework into Unity, which will my final production code. This has taken sometime, but I have finally resolved many issues regarding wrappers, dlls and Unity's inability to use the Microsoft .net 4.0 framework (the framework that the Kinect SDK utilizes).
I found this trick online http://www.codeproject.com/KB/dotnet/DllExport.aspx and basically had to play around with lots of compile settings both in C# and C++ for this to work.

Now getting back to addressing my main two challenges, I have decided that my initial use case for the beta-review will be a spatial interaction environment for 3D objects. Thus in simple terms, the user should be able to move and rotate an object in 3D space while also being able to change his/her view point. Think of a the cis 277 object editor but in unity and with voice and gesture controls.

Probably one of the most important aspects to gesture recognition doesn't actually have to pertain directly to the recognition of motion, it has to do with user feedback. How does a user know if his gesture has been registered? How does the user even know if the interface is listening for input in the first place?

This user feedback has been implemented religiously in most traditional gui interfaces (good ones that is). Hover your mouse over a button--That button now will change color/opacity/shape letting the user know that the button is listening. Move your mouse out and the button will return its previous state, letting the user know that the button is no longer listening to your mouse click. Finally if you click the button, the button will change its visual state once more to signify that your action has been successfully recorded.

In order for successfully gesture recognition, this same interaction flow must be replicated in our NUI. If our user does not know that his action is being recorded, then they will madly wave back and forth which will leader to further misinterpretation by our recognition engine. Furthermore, the user must be told if his gesture/motion has been correctly recognized if it has been ignored because our engine is enable to parse it.

Since it does not make sense to have button feedback (the whole point of a NUI is to remove the mouse-pointer paradigm) and pop up dialog boxes are intrusive, I've decided to utilize a border highlight that displays the feedback. The response is coded in the color of the border. The color will then fade away after the feedback is shown.

Initial "recording feedback". The user has stepped into the interaction space.


User action has been, recorded and saved. However, it might be of a longer gesture sequence so the entire gesture is not complete yet and we are waiting for more motions.



User action or sequence of actions has recognized and we have updated the state of the interface to correspond with this.

Current action or current sequence of actions cannot be recognized as a specific command. The sequence/action has been deleted from queue. Start the action again from the beginning.


My rationale for this:
1. A color border is non-obtrusive but yet has enough global visual scale to reach the attention of the suer.
2. The differentiation between recorded gestures and actually completed gestures allows the use of gestures that formed from the build up of many atomic gestures. Thus as we build up to these complex gestures, it is still good to know that our sequence of actions is still being recorded.
3. The initial "on-phase" provides feedback to the user that he is in the interaction space and all his motions are currently being watched.

This mode of feedback is inspired from Alan Cooper's concept of modeless feedback from his book on interaction design, About Face.


Friday, October 21, 2011

Motion Segmentation

The past few days, I have been playing around with a good motion segmentation scheme so that I can break down user motions into discrete, atomic "actions". The plan for these actions is that they will form the "alphabet" of my motion gesture language. Combination of actions or individual actions in the correct context and mode (remember from my previous posts?) will lead to interactions with our onscreen interface.


I used the concept of zero crossing to segment motion, thus motion is broken into actions by a combination of velocity and acceleration. The basic principle is that a abrupt change in velocity (high acceleration) signals the end of a previous action and the start of a new action.

Thus I used a finite state machine to represent the breakdown of these motions.



Then after each motion has been completed, I save the many attributes of that specific motion and put in my queue of previous motions that will then be interpreted.

Here is what I store (so far).

Finally, here is an trial of my motion segmentation. On the bottom left I keep track of the various states that the motions go through, and everytime a motion is completed it is pushed onto the queue and the total number of motions increase. As you see, its fairly accurate but fails to pick up very rapid minute motions, but in the grander scheme of things I think it is good enough for our purposes.




Tuesday, October 18, 2011

Smoothing vs No Smoothing

Here is a video comparison of smoothing vs no smoothing in my framework. I utilized the Kinect's built Holt smoothing for the joint positions. I also applied my own holt smoothing to my calculated velocity and acceleration values.

http://www.youtube.com/watch?v=TUgUnrWrjRw

Monday, October 17, 2011

Working with the low-level motion data.

Please excuse me for my late posting. I was only at Penn the last week for less than 48 hrs (Fall Break + a trek to EA's Redwood Shores HQ for my Wharton EA field application project). Expect a few post in the next few days to make up for it.

As previously mentioned, the Xbox Kinect SDK provides joint location positions (in the form of Vector3's).
Until now, I have been simply mapping relative velocity/position of joints to onscreen actions such as the zoom and pan of 2D objects. However, there are a few problems with this naive approach. First, the raw position points from the Kinect sensor are not perfectly stable. For example, if you hold your hand out straight without any movement, the onscreen object mapped to your hand's position will jitter based on sensor inaccuracies.
Additionally, we are not able to separate gestures: if an individual accelerates his hand to zoom and then stops, we should the onscreen zoom to reflect this behavior; however, this does not occur if we simply map zoom to hand position since the mapping does not stop, even after our action.

Thus for the next few days, I will add on filtering and motion segmentation features on top of my existing Kinect framework.

For filtering, my main objective is to reduce data jitter and smooth the raw position data such that motion mappings produce smooth onscreen actions. There are two aspects in filtering.
(1)The first is to reduce drift which is low frequency noise. Drift is often caused when our subject ever so slightly changes his overall position. The simplest way to reduce drift is to make the subject root position (which can be the shoulder position in the case of hand/arm motion gestures) the origin of the coordinate frame.
(2)The second component of our jitter comes from high frequency noise. This is a result of both minute motions of the subject and the slight inaccuracies of the Kinect camera sensors. The best way to fix this problem will be to pass the data through a smoothing filter. Luckily the Kinect SDK comes its built in smoothing function that is based off the Holt double exponential smoothing method.















I will also apply the Holt smoothing method to my velocity and acceleration calculation in the framework.
Finding the right the constants for optimal smoothing and finding a balance between accuracy/responsiveness and minimal jitter will be an ongoing process throughout my project.

My motion segmentation implementation will mainly revolve around the zero-crossings of specific joint velocities. The basic idea is that every abrupt change in our velocity will represent a new motion. More to come on this as I begin the implementation....

Friday, October 7, 2011

Elements of Interaction

While I was playing around with moving a shape across the screen using my Kinect Framework, I realized that a gesture does not make an interaction alone. When we swipe our hands left to move an object left, our interaction comprises of more than just a mapping between the hand and object position.
    At the root of any interaction, we have the gesture. The gesture is a explicit action with a purpose of achieving a specific goal or result. In my interface, the gesture is usually revolves around the position, velocity or acceleration of specific skeletal joints.
However, this gesture information is useless if we do not know what our gesture maps to. This is where the mode of interaction comes into play. The mode of interaction is a explicit mapping between our gestures and state of the interface. So for our simple example, panning is the mode of interaction.
Finally the last element of our interaction is the context. The context comprises our relationship with our input environment. Are we sitting or standing? Are we actually facing the screen? What actions/gestures did we perform leading up to this current interaction? Context envelops our spatial relationship in the environment and the history of the sequence of our interactions.

In conclusion:  Gesture + Mode + Context = Interaction.

I am currently on working on implementing this interaction scheme in my Kinect framework.

Sunday, October 2, 2011

Setting up the for Kinect Development
This past week, I have been focusing on becoming familiar with development using the Kinect SDK. I decided to develop the recognition engine in C# rather than C++. Two main reasons:

1.) C# makes it easier to perform threading, and this will come in useful when having to integrate both audio and motion input.

2.) C# has some basic UI elements built in, so this allows me to test and prototype faster as I can easier code simple demos to test the recognition engine before I render the final interface in unity/

After looking through some demo code and documentation I found out that the Kinect SDK allows for two methods of motion tracking. The first is a polling-based approach, where the Kinect sensors return information at stated intervals. The other option is a event-based approach, where Kinect sensors return data if there is actually an event (motion, change in depth, etc....). The event-based approach seems more efficient for my purposes, and I decided to build on top of this approach.

The first step in linking up any code to the Kinect is to turn on the sensors that you need.
Here we activate the depth sensors, skeletal tracking and the raw color camera (for testing purposes so we can see how our real world actions correlate to on screen actions)


Next, we hook assign event handlers to the sensor events.

Finally, we create an entire new thread for our audio input and speech recognition.
Most of my code right now is under the nui_SkeletonFrameReady. This is called every time the Kinect senses a chance in any of the skeletal joint position, and my interface will react accordingly.

Joint Data 
The Kinect Skeleton only returns the positions of the joints in as a Vector3. Thus I created a wrapper class over these joint positions that calculates both velocity and acceleration data. The velocity and acceleration are updated every time the SkeletonFrame event handler is called. I am looking at how to better smooth and interpolate this data so that gesture motion will not lead to "spikey".

Voice Data
Through the Kinect, we can leverage Microsoft's Speech API that has voice recognition features. We open the Kinect's microphone input as an audio source stream, and this stream is passed to the Microsoft's speech recognizer.

Before anything else, it is necessary to generate a grammar, a list of words or phrases that the speech recognizer should be looking for.


















Much like the SkeletonTracking, the speech recognition system uses an eventhandler system. The 3 events that are thrown are SpeechRecognized, SpeechHypothesized and SpeechRecognitionRejected events. I still need to do more reading about the exact details of these three events, but I am only responding for SpeechRecognized events for now.

Testing Environment
I made a simple testing interface as I continue to implement gesture recognition features and modes. 
All my work right now involves basic manipulation (zoom, pan) of a circle in two-dimensional state.
I also output some key information such as the current mode that my interface is in and the sensitivity modifier(discussed in the previous post). 

I will go more in-depth about this test environment in my next post and explain some of the new gesture recognition models that I have come up with. Videos too! Look for it tomorrow once I get camtasia (need Joe's admin password) installed on my dev computer.

Test interface:

Friday, September 23, 2011

Spatial State


There are actually a wealth of motion-capture driven interface tech demos out on the net. One of the most famous examples comes from a TED talk by John Underkoffler (the designed behind the fictional Minority Report interface.

In Underkoffler’s interface, all control is performed by gestures and movements of the hands. As Underkoffler navigates through a series of pictures, his hand becomes an extension into the 3D spatial representation of all the photo files. The movement of his hand across the x,y,z axis map to the movement across the same axises inside the virtual space.

Despite the multi-dimensionality of Underkoffler’s user interaction, there is one aspect that strikingly static: the actual root position of the user. As Underkoffler swings his hands up, down, left, right, forward and backward. His feet and the majority of his body is stationary If you even look carefully at the TED video, Underkoffler has marked (through tape) the exact location that he must stand during all interaction. I believe that the disregard of this extra modality is a wasted opportunity in producing a natural user interface.

Thus I propose to use the spatial position of the user root (body) to as a way to represent state with the ability to modify all ongoing interactions. This is quite a bit of information and might be hard to conceptually understand, so I will give a short example of one way I plan to use spatial location of the root:
The spatial interaction environment

Image that you are manipulating a 3D unit cube using the Kinect. As you move your hands up, down, left and right, your view of the cube pans accordingly. However, what if you want to do very small and detailed pans? This becomes difficult due to the accuracy of the Kinect motion capture system. Thus I will implement a system where the closer you are to the Kinect (smaller Z axis value), the smaller the mapping factor between hand movement and camera panning (large movements in real life will lead to smaller movements in the virtual world). This is the opposite when you are further away from the Kinect device; the mapping factor is greater and thus smaller movements of the hands can lead to large movements in the virtual void. Your spatial position in the interaction environment becomes a modifier to the sensitivity levels of hand gesture.


This is also effective because it is a natural extension of the real world. As you work on more detailed elements of a drawing, your will moving in closer; however, if you are doing overall "big picture" work, you will take a step back so that you maintain the entire perspective.

I think that using this concept of spatial state can lead to even richer interactions, I will to incorporate this in other interactions too.

Wednesday, September 21, 2011

Hello all!

Now that I have finished a more complete design document, I have began jumping into the Microsoft Kinect SDK and taking a look at the capabilities of the Kinect.

Linking Kinect to a Windows 7 PC is actually a fairly simple process. Once the SDK has been installed, you simply plug the Kinect into the pc through the USB port and the fun begins. The first thing that I look at was the type of data that the Kinect SDK would give. The Kinect SDK is able to track up to two skeletal models, giving joint location values as a vector in the 3D plane.
The values are normalized from -1 to 1 on the x, y and z axis. The Kinect also returns a depth map that is stored in a byte array, but I have not decided yet if that data is necessary. The raw video streams are also fairly easy to extract using the SDK and this is be greatly useful for debugging as we can overlay the skeletal joints on top and see what the Kinect is recognizing at the moment.

I have also designed the general pipeline of our interface. The code will be divided in a very traditional model-view-controller pattern. Our controller consists of the Kinect device and its accompanying SDK. Here, raw motion and voice input is both captured and filtered into usable data. The model consists of our recognition engine. The current state of the interface is stored here and the engine is responsible for changing state due to input if necessary. Finally I will use the Unity engine solely for rendering, and it will connect to my recognition engine through the use of .dll’s. 



I will be coding a few simple Kinect demos in the next few days and I am beginning to design the interaction and user experience. Look for all this in another blog post shortly.

Wednesday, September 14, 2011

My Abstract

First Post!
Here is my abstract:

Human computer interaction has traditionally been limited to the mouse and keyboard; however, with the advent of touch screens and motion capture hardware, there has been a rise in the concept of the “natural user interface” or NUI. The natural user interface is heavily driven by gesture rather than precise movements and clicks of the mouse. This project will explore interactions with visualizations driven by multimodal input (in the form of motion and voice).