With the evolving AI technology human machine interaction has reached higher levels of interaction, and is no more limited to mouse and keyboard. Motivated by the desire to provide users with an intuitive human activities as input, this article describes a framework for human activity detection and classification.
The observations used to characterize the gestures will be obtained from the features extracted from the segmented human image. By selecting a set of frames from the huge set of fames obtained from the video and subtracting the consecutive frames we obtain the exact body parts which are being moved by the person. Features are extracted from this new image which contains objects of interest only i.e. the moving portions of the image. In the classification system, we assume our problem to be linear and use simple if else rules with high recognition rate and low computational complexity.
With in the long video sequences just a few frames are enough to recognize the parts of the body being utilize to perform the activity which is used further for calculation of features to make assumptions about what kind of actions are performed and with what intention. For example to walk a person makes use of legs and hence we can identify the activity of walking if there is a rhythmic movement of legs along with displacement of entire body between the frames.
Most general tasks of the systems that perform human motion analysis include: person detection & Tracking, activity classification, and behavior interpretation. With the significant progress made in detection and tracking techniques and systems, human motion and behavior interpretation have naturally shrink to three steps of tracking, representation and recognition .
The basic steps to be followed are stated below.
1.Extracting frames. Any software available can be used to extract frames from the captured videos. The one I have used have 25 frames per second for one and 20 frames per second for the second set. We don’t need to analyze each frame hence we select every tenth frame from the set of frames for each video that means now we are processing with 2 frames per second approximately.
2.Segmenting background and foreground objects from each of the frames preprocessing (Morphological) operations are used to smooth the image and remove the noise along with segmenting the human body from the foreground. This step returns with the human body silhouette, which could be distorted sometimes but however it is this will be our only reference to human body tracing in the scene.
3.Starting from the initial size of the rectangle (assumed) fitting this rectangle over the most crowded portion of the image (considering them to be the places where most displacement has occurred between the frames). The places which have the most of the pixels white than black which in our case we make sure is of the person who’s activity is to be monitored. Rectangle varies in sizes for the same sequence of frames and sometimes may leave some parts of the body, but those will be negligible.
4.Using rule based technique classifying the sequence of images under either of the activities. This poses huge problem because we are considering the problem to be linear which actually is not so we have to be very careful while defining these rules. Apart there are so many errors in the previous steps that it is very hard to quote any perfect and static rules viable for every case.
We have described a general technique for activity recognition and applied it to the case of walking recognition. We have illustrated the technique using real world examples and shown that it robustly recognizes the activity under various complications It is robust to varying image illumination and contrast because the method uses only motion information which is invariant to these. It is also fairly robust with respect to small changes in viewing angle
For detection of multiple moving objects we can detect the pixels in an image sequence that exhibit motion independent of that of the background and segment the image frames into distinct regions corresponding to different moving objects. Other common methods of segmenting multiple moving objects are using color cues distance from camera obtained from a range sensor or selecting objects moving in a certain velocity range.