Perhaps the most fundamental application of affective computing would be Human-Computer Interaction (HCI) in which the computer is able to detect and track the user's affective states, and make corresponding feedback. The human multi-sensor affect system defines the expectation of multimodal affect analyzer. In this paper, we present our efforts toward audio-visual HCI-related affect recognition. With HCI applications in mind, we take into account some special affective states which indicate users' cognitive/motivational states. Facing the fact that a facial expression is influenced by both an affective state and speech content, we apply a smoothing method to extract the information of the affective state from facial features. In our fusion stage, a voting method is applied to combine audio and visual modalities so that the final affect recognition accuracy is greatly improved. We test our bimodal affect recognition approach on 38 subjects with 11 HCI-related affect states. The extensive experimental results show that the average person-dependent affect recognition accuracy is almost 90% for our bimodal fusion.