多模态人机交互综述
A survey on multi-modal human-computer interaction
- 2022年27卷第6期 页码:1956-1987
纸质出版日期: 2022-06-16 ,
录用日期: 2022-03-30
DOI: 10.11834/jig.220151
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2022-06-16 ,
录用日期: 2022-03-30
移动端阅览
陶建华, 巫英才, 喻纯, 翁冬冬, 李冠君, 韩腾, 王运涛, 刘斌. 多模态人机交互综述[J]. 中国图象图形学报, 2022,27(6):1956-1987.
Jianhua Tao, Yingcai Wu, Chun Yu, Dongdong Weng, Guanjun Li, Teng Han, Yuntao Wang, Bin Liu. A survey on multi-modal human-computer interaction[J]. Journal of Image and Graphics, 2022,27(6):1956-1987.
多模态人机交互旨在利用语音、图像、文本、眼动和触觉等多模态信息进行人与计算机之间的信息交换。在生理心理评估、办公教育、军事仿真和医疗康复等领域具有十分广阔的应用前景。本文系统地综述了多模态人机交互的发展现状和新兴方向,深入梳理了大数据可视化交互、基于声场感知的交互、混合现实实物交互、可穿戴交互和人机对话交互的研究进展以及国内外研究进展比较。本文认为拓展新的交互方式、设计高效的各模态交互组合、构建小型化交互设备、跨设备分布式交互、提升开放环境下交互算法的鲁棒性等是多模态人机交互的未来研究趋势。
Benefiting from the development of the Internet of things
human-computer interaction devices have been widely used in people's daily life. Human-computer interaction is no longer limited to the input and output modes of a single sensory channel (vision
touch
hearing
smell and taste). Multi-modal human-computer interaction aims to exchange information between human and computer by using multi-modal information such as speech
image
text
eye movement and touch. Multi-modal human-computer interaction includes multi-modal information input from human to computer and multi-modal information presentation from computer to human and it is a comprehensive discipline closely related to cognitive psychology
ergonomics
multimedia technology and virtual reality technology. At present
multi-modal human-computer interaction and various kinds of academic and technology in the field of image and graphics are more and more closely combined. In the era of big data and artificial intelligence
multi-modal human-computer interaction technology
as the technical carrier of human-machine-thing
is closely related to the development of image and graphics
artificial intelligence
emotional computing
physiological and psychological assessment
Internet big data
office education
medical rehabilitation and other fields. The research on multi-modal human-computer interaction first appeared in the 1990 s
and a number of works proposed an interactive method combining speech and gesture. In recent years
the emergence of immersive visualization provides a new multi-modal interactive interface for human-computer interaction: an immersive environment that integrates visual
auditory
tactile and other sensory channels. Visualization is an important scientific technology for data analysis and exploration. It converts abstract data into graphical representations and facilitates analytical reasoning through interactive interfaces. In today's data explosion
visualization transforms complex big data into easy-to-understand content
improving people's ability to understand and explore data. The traditional interactive interface can only support a flat visual design
including data mapping channels and data interaction methods
and cannot meet the analysis needs in the context of the big data area. In the area of big data
data visualization will have problems such as limited presentation space
abstract data expression
and data occlusion. The emergence of immersive visualization provides a broad presentation space for high-dimensional big data visualization
integrating multi-sensing channels and multi-modalities. Interaction allows users to interact with data naturally and in parallel using multiple channels. The interaction technology based on sound field perception can be divided into three types according to the working principle: measure and identify the acoustic characteristics of a specific space
passage or the change of the acoustic characteristics caused by the action; use the sound wave measurement of the microphone array to achieve sound source localization
the sound source can emit specific carrier audio to improve the positioning accuracy and robustness; the machine learning algorithm recognizes the sound from a specific scene
environment or human body. The technical solution includes a single method based on sound field perception and a sensor fusion solution. In the physical interaction system
the user interacts with the virtual environment by using the physical objects existing in the real environment. In recent years
the integration of physical interaction interface technology into virtual reality and augmented reality has become a mainstream direction in this field
and the concept of "physical mixed reality" has gradually formed
which is also the conceptual basis of passive haptics. The haptics of physical interaction can be divided into three ways: static passive haptics; passive haptics with feedback and active force haptics. Since active haptic devices are relatively expensive
there are few current researches
and the main research directions are still static passive haptics and encounter-type haptics. Regarding the mixed reality interaction mode of passive haptics
the current research levels of various countries and institutions in the world are not very different
but there is a slight emphasis. Wearable interaction is mainly divided into research on gesture interaction and touch interaction mainly in the form of wristbands
skin electronic technology and interaction design. Gesture input is considered to be one of the core contents of "natural human-machine interface"
and it is also suitable for exploring the input methods of wearable devices. The key to realizing gesture input lies in sensing technology. At present
in the field of human-computer interaction
the sensing technology for gesture recognition based on infrared light
motion sensor
electromagnetic
capacitive
ultrasonic
camera and biological signals has been deeply studied. As the natural interface between people and the outside world
the skin has been initially used to explore its role in information interaction
and its applications in several aspects have demonstrated its advantages. The human-computer dialogue interaction process involves multiple modules such as speech recognition
emotion recognition
dialogue system
and speech synthesis. First
the user's speech is converted into corresponding text and emotion labels through speech recognition and emotion recognition modules. The dialogue system is then used to understand what the user is saying and generate dialogue responses. Finally
the speech synthesis module converts the dialogue responses into speech to interact with the user. How to effectively integrate information of different modalities in the human-computer interaction system and improve the quality of human-computer interaction is also worthy of attention. Multi-modal fusion methods can be divided into three types: feature layer fusion methods
decision layer fusion methods
and hybrid fusion methods. The feature layer fusion method maps the features extracted from multiple modalities into a feature vector through a certain transformation and then sends it to the classification model to obtain the final decision. The decision-level fusion method combines the decisions obtained from different modal information to obtain the final decision. The hybrid fusion method adopts both the feature layer fusion method and the decision layer fusion method. This paper systematically reviews the development status and emerging directions of multi-modal human-computer interaction
and thoroughly combs the research progress of big data visualization interaction
interaction based on sound field perception
near-eye display entity interaction
wearable interaction
and human-computer dialogue interaction. This article believes that expanding new interaction methods
designing efficient interaction combinations of various modalities
building miniaturized interactive devices
cross-device distributed interaction
and improving the robustness of interactive algorithms in open environments are the future works of multi-modal human-computer interaction.
多模态人机交互大数据可视化交互声场感知交互实物交互可穿戴交互人机对话交互
multi-modal human-computer interactionbig data visualization interactionsound field perception interactionentity interactionwearable interactionhuman-computer dialogue interaction
Abtahi P, Gonzalez-Franco M, Ofek E and Steed A. 2019a. I'm a giant: walking in large virtual environments at high speed gains//Proceedings of 2019 CHI Conference on Human Factors in Computing Systems. Glasgow, UK: ACM: #522[DOI: 10.1145/3290605.3300752http://dx.doi.org/10.1145/3290605.3300752]
Abtahi P, Landry B, Yang J, Pavone M, Follmer S and Landay J A. 2019b. Beyond the force: using quadcopters to appropriate objects and the environment for haptics in virtual reality//Proceedings of 2019 CHI Conference on Human Factors in Computing Systems. Glasgow, UK: ACM: 1-13[DOI: 10.1145/3290605.3300589http://dx.doi.org/10.1145/3290605.3300589]
Alghofaili R, Sawahata Y, Huang H K, Wang H C, Shiratori T and Yu L F. 2019. Lost in style: gaze-driven adaptive aid for VR navigation//Proceedings of 2019 CHI Conference on Human Factors in Computing Systems. Glasgow, UK: ACM: 1-12[DOI: 10.1145/3290605.3300578http://dx.doi.org/10.1145/3290605.3300578]
Alper B, Hollerer T, Kuchera-Morin J A and Forbes A. 2011. Stereoscopic highlighting: 2D graph visualization on stereo displays. IEEE Transactions on Visualization and Computer Graphics, 17(12): 2325-2333[DOI: 10.1109/TVCG.2011.234]
Amoh J and Odame K. 2015. DeepCough: a deep convolutional neural network in a wearable cough detection system//Proceedings of 2015 IEEE Biomedical Circuits and Systems Conference (BioCAS). Atlanta, USA: IEEE: 1-4[DOI: 10.1109/BioCAS.2015.7348395http://dx.doi.org/10.1109/BioCAS.2015.7348395]
Ando H, Kitahara Y and Hataoka N. 1994. Evaluation of multi-modal interface using spoken language and pointing gesture on interior design system//Proceedings of the 3rd International Conference on Spoken Language Processing. Yokohama, Japan: ISCA: #77
Araujo B, Jota R, Perumal V, Yao J X, Singh K and Wigdor D. 2016. Snake charmer: physically enabling virtual objects//Proceedings of the TEI'16: the 10th International Conference on Tangible, Embedded, and Embodied Interaction. Eindhoven, the Netherlands: ACM: 218-226[DOI: 10.1145/2839462.2839484http://dx.doi.org/10.1145/2839462.2839484]
Arık S Ö, Chrzanowski M, Coates A, Diamos G, Gibiansky A, Kang Y G, Li X, Miller J, Ng A, Raiman J, Sengupta S and Shoeybi M. 2017. Deep voice: real-time neural text-to-speech//Proceedings of the 34th International Conference on Machine Learning. Sydney, Australia: PMLR: 195-204
Baevski A, Schneider S and Auli M. 2020a. Vq-wav2vec: self-supervised learning of discrete speech representations[EB/OL]. [2022-01-20].https://arxiv.org/pdf/1910.05453v2.pdfhttps://arxiv.org/pdf/1910.05453v2.pdf
Baevski A, Zhou Y H, Mohamed A and Auli M. 2020b. Wav2vec 2.0: a framework for self-supervised learning of speech representations[EB/OL]. [2022-01-20].https://arxiv.org/pdf/2006.11477.pdfhttps://arxiv.org/pdf/2006.11477.pdf
Bakhshi A, Wong A S W and Chalup S K. 2020. End-to-end speech emotion recognition based on time and frequency information using deep neural networks//Proceedings of the 24th European Conference on Artificial Intelligence. Santiago de Compostela, Spain: IOS Press: 969-975
Baloup M, Pietrzak T and Casiez G. 2019. RayCursor: a 3D pointing facilitation technique based on raycasting//Proceedings of 2019 CHI Conference on Human Factors in Computing Systems. Glasgow, UK: ACM: #101[DOI: 10.1145/3290605.3300331http://dx.doi.org/10.1145/3290605.3300331]
Barbieri F, Ballesteros M, Ronzano F and Saggion H. 2018. Multi-modal emoji prediction//Proceedings of 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. New Orleans, USA: ACL: 679-686[DOI: 10.18653/v1/N18-2107http://dx.doi.org/10.18653/v1/N18-2107]
Bourguet M L. 2003. Designing and prototyping multi-modal commands//Proceedings of the IFIP TC13 International Conference on Human-Computer Interaction. Zurich, Switzerland: IOS Press: 717-720
Büschel W, Chen J, Dachselt R, Drucker S, Dwyer T, Görg C, Isenberg T, Kerren A, North C and Stuerzlinger W. 2018. Interaction for immersive analytics//Marriott K, Schreiber F, Dwyer T, Klein K, Riche N H, Itoh T, Stuerzlinger W and Thomas B H, eds. Immersive Analytics. Cham: Springer: 95-138[DOI: 10.1007/978-3-030-01388-2_4http://dx.doi.org/10.1007/978-3-030-01388-2_4]
Cassell J, Pelachaud C, Badler N, Steedman M, Achorn B, Becket T, Douville B, Prevost S and Stone M. 1994. Animated conversation: rule-based generation of facial expression, gesture and spoken intonation for multiple conversational agents//Proceedings of the 21st Annual Conference on Computer Graphics and Interactive Techniques. New York, USA: ACM: 413-420[DOI: 10.1145/192161.192272http://dx.doi.org/10.1145/192161.192272]
Chen C. 2021. Research on Active Fatigue Driving Detection Based on Audio Perception. Tianjin: Tianjin University of Technology
陈超. 2021. 基于音频感知的主动疲劳驾驶检测研究. 天津: 天津理工大学
Chen C J, Wang Z W, Wu J, Wang X T, Guo L Z, Li Y F and Liu S X. 2021. Interactive graph construction for graph-based semi-supervised learning. IEEE Transactions on Visualization and Computer Graphics, 27(9): 3701-3716[DOI: 10.1109/TVCG.2021.3084694]
Chen N X, Watanabe S, Villalba J, Z·elasko P and Dehak N. 2020a. Non-autoregressive transformer for speech recognition. IEEE Signal Processing Letters, 28: 121-125[DOI: 10.1109/LSP.2020.3044547]
Chen X Y, Xu J M and Xu B. 2019. A working memory model for task-oriented dialog response generation//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: ACL: 2687-2693[DOI: 10.18653/v1/P19-1258http://dx.doi.org/10.18653/v1/P19-1258]
Chen Y Q, Qin X, Wang J D, Yu C H and Gao W. 2020b. FedHealth: a federated transfer learning framework for wearable healthcare. IEEE Intelligent Systems, 35(4): 83-93[DOI: 10.1109/MIS.2020.2988604]
Cheng L P, Roumen T, Rantzsch H, Köhler S, Schmidt P, Kovacs R, Jasper J, Kemper J and Baudisch P. 2015. TurkDeck: physical virtual reality based on people//Proceedings of the 28th Annual ACM Symposium on User Interface Software and Technology. Charlotte, USA: ACM: 417-426[DOI: 10.1145/2807442.2807463http://dx.doi.org/10.1145/2807442.2807463]
Chi E A, Salazar J and Kirchhoff K. 2021. Align-refine: non-autoregressive speech recognition via iterative realignment//Proceedings of 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. [s. l.]: Association for Computational Linguistics: 1920-1927[DOI: 10.18653/v1/2021.naacl-main.154http://dx.doi.org/10.18653/v1/2021.naacl-main.154]
Chiu C C and Raffel C. 2018. Monotonic chunkwise attention//Proceedings of the 6th International Conference on Learning Representations. Vancouver, Canada: OpenReview. net
Choi I, Culbertson H, Miller M R, Olwal A and Follmer S. 2017. Grabity: a wearable haptic interface for simulating weight and grasping in virtual reality//Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology. Québec City, Canada: ACM: 119-130[DOI: 10.1145/3126594.3126599http://dx.doi.org/10.1145/3126594.3126599]
Choi I, Hawkes E W, Christensen D L, Ploch C J and Follmer S. 2016. Wolverine: a wearable haptic interface for grasping in virtual reality//Proceedings of 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Daejeon, Korea(South): IEEE: 986-993[DOI: 10.1109/IROS.2016.7759169http://dx.doi.org/10.1109/IROS.2016.7759169]
Chu X T, Xie X, Ye S N, Lu H L, Xiao H G, Yuan Z Q, Chen Z T, Zhang H and Wu Y C. 2022. TIVEE: visual exploration and explanation of badminton tactics in immersive visualizations. IEEE Transactions on Visualization and Computer Graphics, 28(1): 118-128[DOI: 10.1109/TVCG.2021.3114861]
Cordeil M, Bach B, Cunningham A, Montoya B, Smith R T, Thomas B H and Dwyer T. 2020. Embodied axes: tangible, actuated interaction for 3d augmented reality data spaces//Proceedings of 2020 CHI Conference on Human Factors in Computing Systems. Honolulu, USA: ACM: 1-12[DOI: 10.1145/3313831.3376613http://dx.doi.org/10.1145/3313831.3376613]
Cui C, Wang W J, Song X M, Huang M L, Xu X S and Nie L Q. 2019. User attention-guided multi-modal dialog systems//Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. Paris, France: ACM: 445-454[DOI: 10.1145/3331184.3331226http://dx.doi.org/10.1145/3331184.3331226]
Dai Y Y, Jin Y, Ma Y, Yang Z X and Yu J J. 2021. Speech emotion recognition based on efficient channel attention. Journal of Signal Processing, 37(10): 1835-1842
戴妍妍, 金赟, 马勇, 杨子秀, 俞佳佳. 2021. 基于高效通道注意力机制的语音情感识别方法. 信号处理, 37(10): 1835-1842[DOI: 10.16798/j.issn.1003-0530.2021.10.006]
Debie E, Rojas R F, Fidock J, Barlow M, Kasmarik K, Anavatti S, Garratt M and Abbass H A. 2021. Multi-modal fusion for objective assessment of cognitive workload: a review. IEEE Transactions on Cybernetics, 51(3): 1542-1555[DOI: 10.1109/TCYB.2019.2939399]
Deng Z K, Weng D, Liang Y X, Bao J, Zheng Y, Schreck T, Xu M L and Wu Y C. 2021. Visual cascade analytics of large-scale spatiotemporal data. IEEE Transactions on Visualization and Computer Graphics: #9397369[DOI: 10.1109/TVCG.2021.3071387http://dx.doi.org/10.1109/TVCG.2021.3071387]
Deng Z K, Weng D, Xie X, Bao J, Zheng Y, Xu M L, Chen W and Wu Y C. 2022. Compass: towards better causal analysis of urban time series. IEEE Transactions on Visualization and Computer Graphics, 28(1): 1051-1061[DOI: 10.1109/TVCG.2021.3114875]
Dragicevic P, Jansen Y and Moere A V. 2021. Data physicalization//Vanderdonckt J, Palanque P and Winckler M, eds. Handbook of Human Computer Interaction. Cham: Springer: 1-51[DOI: 10.1007/978-3-319-27648-9_94-1http://dx.doi.org/10.1007/978-3-319-27648-9_94-1]
Drogemuller A, Cunningham A, Walsh J, Cordeil M, Ross W and Thomas B. 2018. Evaluating navigation techniques for 3D graph visualizations in virtual reality//Proceedings of 2018 International Symposium on Big Data Visual and Immersive Analytics (BDVA). Konstanz, Germany: IEEE: 1-10[DOI: 10.1109/BDVA.2018.8533895http://dx.doi.org/10.1109/BDVA.2018.8533895]
Eric M, Krishnan L, Charette F and Manning C D. 2017. Key-value retrieval networks for task-oriented dialogue//Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue. Saarbrücken, Germany: ACL: 37-49[DOI: 10.18653/v1/W17-5506http://dx.doi.org/10.18653/v1/W17-5506]
Eyben F, Wöllmer M and Schuller B. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor//Proceedings of the 18th ACM International Conference on Multimedia. Firenze, Italy: ACM: 1459-1462[DOI: 10.1145/1873951.1874246http://dx.doi.org/10.1145/1873951.1874246]
Fei Z C, Li Z K, Zhang J C, Feng Y and Zhou J. 2021. Towards expressive communication with internet memes: a new multi-modal conversation dataset and benchmark[EB/OL]. [2022-01-20].https://arxiv.org/pdf/2109.01839.pdfhttps://arxiv.org/pdf/2109.01839.pdf
Filho J A W, Freitas C M D S and Nedel L. 2019. Comfortable immersive analytics with the VirtualDesk metaphor. IEEE Computer Graphics and Applications, 39(3): 41-53[DOI: 10.1109/MCG.2019.2898856]
Filho J A W, Stuerzlinger W and Nedel L. 2020. Evaluating an immersive space-time cube geovisualization for intuitive trajectory data exploration. IEEE Transactions on Visualization and Computer Graphics, 26(1): 514-524[DOI: 10.1109/TVCG.2019.2934415]
Franklin K M and Roberts J C. 2003. Pie chart sonification//Proceedings of the 7th International Conference on Information Visualization, 2003. IV 2003. London, UK: IEEE: 4-9[DOI: 10.1109/IV.2003.1217949http://dx.doi.org/10.1109/IV.2003.1217949]
Fu L, Li X X, Wang R Y, Fan L, Zhang Z C, Chen M, Wu Y Z and He X D. 2021. SCaLa: supervised contrastive learning for end-to-end speech recognition. [EB/OL]. [2022-01-20].https://arxiv.org/pdf/2110.04187.pdfhttps://arxiv.org/pdf/2110.04187.pdf
Fujie S, Ejiri Y, Matsusaka Y, Kikuchi H and Kobayashi T. 2003. Recognition of para-linguistic information and its application to spoken dialogue system//Proceedings of 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No. 03EX721). St Thomas, USA: IEEE: 231-236[DOI: 10.1109/ASRU.2003.1318446http://dx.doi.org/10.1109/ASRU.2003.1318446]
Fujie S, Yagi D, Matsusaka Y, Kikuchi H and Kobayashi T. 2004. Spoken dialogue system using prosody as para-linguistic information//Proceedings of the Speech Prosody 2004. Nara, Japan: ISCA
Funk M, Müller F, Fendrich M, Shene M, Kolvenbach M, Dobbertin N, Günther S and Mühlhäuser M. 2019. Assessing the accuracy of point and teleport locomotion with orientation indication for virtual reality using curved trajectories//Proceedings of 2019 CHI Conference on Human Factors in Computing Systems. Glasgow, UK: ACM: #147[DOI: 10.1145/3290605.3300377http://dx.doi.org/10.1145/3290605.3300377]
GannonM, Grossman T and Fitzmaurice G. 2015. Tactum: a skin-centric approach to digital design and fabrication//Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. Seoul, Korea(South): ACM: 1779-1788[DOI: 10.1145/2702123.2702581http://dx.doi.org/10.1145/2702123.2702581]
Gannon M, Grossman T and Fitzmaurice G. 2016. ExoSkin: on-body fabrication//Proceedings of 2016 CHI Conference on Human Factors in Computing Systems. San Jose, USA: ACM: 5996-6007[DOI: 10.1145/2858036.2858576http://dx.doi.org/10.1145/2858036.2858576]
Gong J, Gupta A and Benko H. 2020. Acustico: surface tap detection and localization using wrist-based acoustic TDOA sensing//Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology. New York, USA: ACM: 406-419[DOI: 10.1145/3379337.3415901http://dx.doi.org/10.1145/3379337.3415901]
Gong Y, Chung Y A and Glass J. 2021a. PSLA: improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29: 3292-3306[DOI: 10.1109/TASLP.2021.3120633]
Gong Y, Chung Y A and Glass J R. 2021b. AST: audio spectrogram transformer//Proceedings of the 22nd Annual Conference of the International Speech Communication Association. Brno, Czechia: ISCA: 571-575
Goto M, Itou K and Hayamizu S. 2002. Speech completion: on-demand completion assistance using filled pauses for speech input interfaces//Proceedings of the 7th International Conference on Spoken Language Processing. Denver, USA: ISCA: 1489-1492
Goto M, Kitayama K, Itou K and Kobayashi T. 2004. Speech spotter: on-demand speech recognition in human-human conversation on the telephone or in face-to-face situations//Proceedings of the 8th International Conference on Spoken Language Processing. Jeju Island, Korea(South): ISCA: 1533-1536
Goto M, Omoto Y, Itou K and Kobayashi T. 2003. Speech shift: direct speech-input-mode switching through intentional control of voice pitch//Proceedings of the 8th European Conference on Speech Communication and Technology. Geneva, Switzerland: ISCA: 1201-1204
Groeger D and Steimle J. 2017. ObjectSkin: augmenting everyday objects with hydroprinted touch sensors and displays. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 1(4): #134[DOI: 10.1145/3161165]
Guo J, Weng D D, Fang H, Zhang Z L, Ping J M, Liu Y and Wang Y T. 2020. Exploring the differences of visual discomfort caused by long-term immersion between virtual environments and physical environments//Proceedings of 2020 IEEE Conference on Virtual Reality and 3D User Interfaces (VR). Atlanta, USA: IEEE: 443-452[DOI: 10.1109/VR46266.2020.00065http://dx.doi.org/10.1109/VR46266.2020.00065]
Guo P C, Boyer F, Chang X K, Hayashi T, Higuchi Y, Inaguma H, Kamo N, Li C D, Garcia-Romero D, Shi J T, Shi J, Watanabe S, Wei K, Zhang W Y and Zhang Y K. 2021. Recent developments on espnet toolkit boosted by conformer//Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toronto, Canada: IEEE: 5874-5878[DOI: 10.1109/ICASSP39728.2021.9414858http://dx.doi.org/10.1109/ICASSP39728.2021.9414858]
Gupta A, Irudayaraj A A R and Balakrishnan R. 2017. HapticClench: investigating squeeze sensations using memory alloys//Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology. Québec City, Canada: ACM: 109-117[DOI: 10.1145/3126594.3126598http://dx.doi.org/10.1145/3126594.3126598]
Gupta S, Morris D, Patel S and Tan D. 2012. SoundWave: using the Doppler effect to sense gestures//Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. Austin, USA: ACM: 1911-1914[DOI: 10.1145/2207676.2208331http://dx.doi.org/10.1145/2207676.2208331]
Haber J, Baumgärtner T, Takmaz E, Gelderloos L, Bruni E and Fernández R. 2019. The PhotoBook dataset: building common ground through visually-grounded dialogue//Proceedings of the 57th Conference of the Association for Computational Linguistics. Florence, Italy: ACL: 1895-1910
Han T, Hasan K, Nakamura K, Gomez R and Irani P. 2017. SoundCraft: enabling spatial interactions on smartwatches using hand generated acoustics//Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology. Québec City, Canada: ACM: 579-591[DOI: 10.1145/3126594.3126612http://dx.doi.org/10.1145/3126594.3126612]
Han T, Li J N, Hasan K, Nakamura K, Gomez R, Balakrishnan R and Irani P. 2018. PageFlip: leveraging page-flipping gestures for efficient command and value selection on smartwatches//Proceedings of 2018 CHI Conference on Human Factors in Computing Systems. Montreal, Canada: ACM: #529[DOI: 10.1145/3173574.3174103http://dx.doi.org/10.1145/3173574.3174103]
Han W J, Li H F and Han J Q. 2008. Speech emotion recognition with combined short and long term features. Journal of Tsinghua University (Science and Technology), 48(S1): 708-714
韩文静, 李海峰, 韩纪庆. 2008. 基于长短时特征融合的语音情感识别方法. 清华大学学报(自然科学版), 48(S1): 708-714[DOI: 10.16511/j.cnki.qhdxxb.2008.s1.023]
Han W J, Li H F, Ruan H B and Ma L. 2014. Review on speech emotion recognition. Journal of Software, 25(1): 37-50
韩文静, 李海峰, 阮华斌, 马琳. 2014. 语音情感识别研究进展综述. 软件学报, 25(1): 37-50[DOI: 10.13328/j.cnki.jos.004497]
Harada S, Landay J A, Malkin J, Li X and Bilmes J A. 2006. The vocal joystick: evaluation of voice-based cursor control techniques//Proceedings of the 8th International ACM SIGACCESS Conference on Computers and Accessibility. Portland, USA: ACM: 197-204
Harada S, Wobbrock J O, Malkin J, Bilmes J A and Landay J A. 2009. Longitudinal study of people learning to use continuous voice-based cursor control//Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. Boston, USA: ACM: 347-356[DOI: 10.1145/1518701.1518757http://dx.doi.org/10.1145/1518701.1518757]
Harrison C, Benko H and Wilson A D. 2011. OmniTouch: wearable multitouch interaction everywhere//Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology. Santa Barbara, USA: ACM: 441-450[DOI: 10.1145/2047196.2047255http://dx.doi.org/10.1145/2047196.2047255]
Harrison C and Faste H. 2014. Implications of location and touch for on-body projected interfaces//Proceedings of 2014 Conference on Designing Interactive Systems. Vancouver, Canada: ACM: 543-552[DOI: 10.1145/2598510.2598587http://dx.doi.org/10.1145/2598510.2598587]
Heo S, Hung C, Lee G and Wigdor D. 2018. Thor's hammer: an ungrounded force feedback device utilizing propeller-induced propulsive force//Proceedings of 2018 CHI Conference on Human Factors in Computing Systems. Montreal, Canada: ACM: #525[DOI: 10.1145/3173574.3174099http://dx.doi.org/10.1145/3173574.3174099]
Hershey S, Chaudhuri S, Ellis D P W, Gemmeke J F, Jansen A, Moore R C, Plakal M, Platt D, Saurous R A, Seybold B, Slaney M, Weiss R J and Wilson K. 2017. CNN architectures for large-scale audio classification//Proceedings of 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). New Orleans, USA: IEEE: 131-135[DOI: 10.1109/ICASSP.2017.7952132http://dx.doi.org/10.1109/ICASSP.2017.7952132]
Higuchi Y, Inaguma H, Watanabe S, Ogawa T and Kobayashi T. 2021. Improved mask-CTC for non-autoregressive end-to-end ASR//Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toronto, Canada: IEEE: 8363-8367[DOI: 10.1109/ICASSP39728.2021.9414198http://dx.doi.org/10.1109/ICASSP39728.2021.9414198]
House B, Malkin J and Bilmes J. 2009. The VoiceBot: a voice controlled robot arm//Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. Boston, USA: ACM: 183-192[DOI: 10.1145/1518701.1518731http://dx.doi.org/10.1145/1518701.1518731]
Hsu W N, Sriram A, Baevski A, Likhomanenko T, Xu Q T, Pratap V, Kahn J, Lee A, Collobert R, Synnaeve G and Auli M. 2021. Robust wav2vec 2.0: analyzing domain shift in self-supervised pre-training//Proceedings of the 22nd Annual Conference of the International Speech Communication Association. Brno, Czechia: ISCA: 721-725
Hu M. 2015. Exploring new paradigms for accessible 3D printed graphs//Proceedings of the 17th International ACM SIGACCESS Conference on Computers and Accessibility. Lisbon, Portugal: ACM: 365-366[DOI: 10.1145/2700648.2811330http://dx.doi.org/10.1145/2700648.2811330]
Hu Z M, Bulling A, Li S and Wang G P. 2021. FixationNet: forecasting eye fixations in task-oriented virtual environments. IEEE Transactions on Visualization and Computer Graphics, 27(5): 2681-2690[DOI: 10.1109/TVCG.2021.3067779]
Huang C C, Gong W, Fu W L and Feng D Y. 2014. Research of speech emotion recognition based on DBNs. Journal of Computer Research and Development, 51(S1): 75-80
黄晨晨, 巩微, 伏文龙, 冯东煜. 2014. 基于深度信念网络的语音情感识别的研究. 计算机研究与发展, 51(S1): 75-80
Huang D Y, Chan L W, Yang S, Wang F, Liang R H, Yang S N, Hung Y P and Chen B Y. 2016. DigitSpace: designing thumb-to-fingers touch interfaces for one-handed and eyes-free interactions//Proceedings of 2016 CHI Conference on Human Factors in Computing Systems. San Jose, USA: ACM: 1526-1537[DOI: 10.1145/2858036.2858483http://dx.doi.org/10.1145/2858036.2858483]
Huang H Y, Ning C W, Wang P Y, Cheng J H and Cheng L P. 2020a. Haptic-go-round: a surrounding platform for encounter-type haptics in virtual reality experiences//Proceedings of 2020 CHI Conference on Human Factors in Computing Systems. Honolulu, USA: ACM: 1-10[DOI: 10.1145/3313831.3376476http://dx.doi.org/10.1145/3313831.3376476]
Huang M K, Zhang J, Cai M, Zhang Y, Yao J L, You Y B, He Y and Ma Z J. 2020b. Improving RNN transducer with normalized jointer network[EB/OL]. [2022-01-20].https://arxiv.org/pdf/2011.01576.pdfhttps://arxiv.org/pdf/2011.01576.pdf
Huang W Y, Hu W C, Yeung Y T and Chen X. 2020c. Conv-transformer transducer: low latency, low frame rate, streamable end-to-end speech recognition//Proceedings of the 21st Annual Conference of the International Speech Communication Association. Shanghai, China: ISCA: 5001-5005
Hurter C, Riche N H, Drucker S M, Cordeil M, Alligier R and Vuillemot R. 2019. FiberClay: sculpting three dimensional trajectories to reveal structural insights. IEEE Transactions on Visualization and Computer Graphics, 25(1): 704-714[DOI: 10.1109/TVCG.2018.2865191]
Igarashi T and Hughes J F. 2001. Voice as sound: using non-verbal voice input for interactive control//Proceedings of the 14th Annual ACM Symposium on User Interface Software and Technology. Orlando, USA: ACM: 155-156[DOI: 10.1145/502348.502372http://dx.doi.org/10.1145/502348.502372]
Inaguma H, Mimura M and Kawahara T. 2020a. Enhancing monotonic multihead attention for streaming ASR//Proceedings of the 21st Annual Conference of the International Speech Communication Association. Shanghai, China: ISCA: 2137-2141
Inaguma H, Mimura M and Kawahara T. 2020b. CTC-synchronous training for monotonic attention model//Proceedings of the 21st Annual Conference of the International Speech Communication Association. Shanghai, China: ISCA: 571-575
Ion A, Wang E J and Baudisch P. 2015. Skin drag displays: dragging a physical tactor across the user's skin produces a stronger tactile stimulus than vibrotactile//Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. Seoul, Korea(South): ACM: 2501-2504[DOI: 10.1145/2702123.2702459http://dx.doi.org/10.1145/2702123.2702459]
Iravantchi Y, Goel M and Harrison C. 2019. BeamBand: hand gesture sensing with ultrasonic beamforming//Proceedings of 2019 CHI Conference on Human Factors in Computing Systems. Glasgow, UK: ACM: #15[DOI: 10.1145/3290605.3300245http://dx.doi.org/10.1145/3290605.3300245]
Ishii H and Ullmer B. 1997. Tangible bits: towards seamless interfaces between people, bits and atoms//Proceedings of 1997 ACM SIGCHI Conference on HumanFactors in Computing Systems. Atlanta, USA: ACM: 234-241[DOI: 10.1145/258549.258715http://dx.doi.org/10.1145/258549.258715]
Jansen Y, Isenberg P, Dykes J, Carpendale S, Subramanian S and Keefe D F. 2014. Death of the Desktop Envisioning Visualization without Desktop Computing. Retrieved January, 16: 2017
Je S, Rooney B, Chan L W and Bianchi A. 2017. tactoRing: a skin-drag discrete display//Proceedings of 2017 CHI Conference on Human Factors in Computing Systems. Denver, USA: ACM: 3106-3114[DOI: 10.1145/3025453.3025703http://dx.doi.org/10.1145/3025453.3025703]
Jia Y, Zhang Y, Weiss R J, Wang Q, Shen J, Ren F, Chen Z F, Nguyen P, Pang R M, Moreno I L and Wu Y H. 2018. Transfer learning from speaker verification to multispeaker text-to-speech synthesis//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montréal, Canada: Curran Associates Inc. : 4485-4495
Jiang D W, Lei X N, Li W B, Luo N, Hu Y X, Zou W and Li X G. 2019a. Improving transformer-based speech recognition using unsupervised pre-training. [EB/OL]. [2022-01-26].https://arxiv.org/pdf/1910.09932.pdfhttps://arxiv.org/pdf/1910.09932.pdf
Jiang D W, Li W B, Cao M, Zou W and Li X G. 2021. Speech SimCLR: combining contrastive and reconstruction objective for self-supervised speech representation learning//Proceedings of the 22nd Annual Conference of the International Speech Communication Association. Brno, Czechia: ISCA: 1544-1548
Jiang H Y, Weng D D, Zhang Z L and Chen F. 2019b. HiFinger: one-handed text entry technique for virtual environments based on touches between fingers. Sensors, 19(14): #3063[DOI: 10.3390/s19143063]
Jin H J, Holz C and Hornbæk K. 2015. Tracko: ad-hoc mobile 3D tracking using Bluetooth low energy and inaudible signals for cross-device interaction//Proceedings of the 28th Annual ACM Symposium on User Interface Software and Technology. Charlotte, USA: ACM: 147-156[DOI: 10.1145/2807442.2807475http://dx.doi.org/10.1145/2807442.2807475]
Jin X C. 2007. A Study on Recognition of Emotions in Speech. Hefei: University of Science and Technology of China
金学成. 2007. 基于语音信号的情感识别研究. 合肥: 中国科学技术大学
Kitayama K, Goto M, Itou K and Kobayashi T. 2003. Speech starter: noise-robust endpoint detection by using filled pauses//Proceedings of the 8th European Conference on Speech Communication and Technology. Geneva, Switzerland: ISCA: 1237-1240
Kobayashi T and Fujie S. 2013. Conversational robots: an approach to conversation protocol issues that utilizes the paralinguistic information available in a robot-human setting. Acoustical Science and Technology, 34(2): 64-72[DOI: 10.1250/ast.34.64]
Kong H K, Zhu W J, Liu Z C and Karahalios K. 2019. Understanding visual cues in visualizations accompanied by audio narrations//Proceedings of 2019 CHI Conference on Human Factors in Computing Systems. Glasgow, UK: ACM: #50[DOI: 10.1145/3290605.3300280http://dx.doi.org/10.1145/3290605.3300280]
Kong Q Q, Cao Y, Iqbal T, Wang Y X, Wang W W and Plumbley M D. 2020. PANNs: large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28: 2880-2894[DOI: 10.1109/TASLP.2020.3030497]
Kovacs R, Ofek E, Franco M G, Siu A F, Marwecki S, Holz C and Sinclair M. 2020. Haptic PIVOT: on-demand handhelds in VR//Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology. Minneapolis, USA: ACM: 1046-1059[DOI: 10.1145/3379337.3415854http://dx.doi.org/10.1145/3379337.3415854]
Kraus M, Weiler N, Oelke D, Kehrer J, Keim D A and Fuchs J. 2020. The impact of immersion on cluster identification tasks. IEEE Transactions on Visualization and Computer Graphics, 26(1): 525-535[DOI: 10.1109/TVCG.2019.2934395]
Krekhov A and Krüger J. 2019. Deadeye: a novel preattentive visualization technique based on dichoptic presentation. IEEE Transactions on Visualization and Computer Graphics, 25(1): 936-945[DOI: 10.1109/TVCG.2018.2864498]
Krekhov A, Cmentowski S, Waschk A and Kruger J. 2020. Deadeye visualization revisited: investigation of preattentiveness and applicability in virtual environments. IEEE Transactions on Visualization and Computer Graphics, 26(1): 547-557[DOI: 10.1109/TVCG.2019.2934370]
Kwok T C K, Kiefer P, Schinazi V R, Adams B and Raubal M. 2019. Gaze-guided narratives: adapting audio guide content to gaze in virtual and real environments//Proceedings of 2019 CHI Conference on Human Factors in Computing Systems. Glasgow, UK: ACM: #491[DOI: 10.1145/3290605.3300721http://dx.doi.org/10.1145/3290605.3300721]
Kwon O H, Muelder C, Lee K and Ma K L. 2016. A study of layout, rendering, and interaction methods for immersive graph visualization. IEEE Transactions on Visualization and Computer Graphics, 22(7): 1802-1815[DOI: 10.1109/TVCG.2016.2520921]
Langner R, Satkowski M, Büschel W and Dachselt R. 2021. MARVIS: combining mobile devices and augmented reality for visual data analysis//Proceedings of 2021 CHI Conference on Human Factors in Computing Systems. Yokohama, Japan: ACM: #468[DOI: 10.1145/3411764.3445593http://dx.doi.org/10.1145/3411764.3445593]
Laput G, Xiao R, Chen X, Hudson S E and Harrison C. 2014. Skin buttons: cheap, small, low-powered and clickable fixed-icon laser projectors//Proceedings of the 27th annual ACM symposium on User interface software and technology. Honolulu, USA: ACM: 389-394[DOI: 10.1145/2642918.2647356http://dx.doi.org/10.1145/2642918.2647356]
Lee J and Lee G. 2016. Designing a non-contact wearable tactile display using airflows//Proceedings of the 29th Annual Symposium on User Interface Software and Technology. Tokyo, Japan: ACM: 183-194[DOI: 10.1145/2984511.2984583http://dx.doi.org/10.1145/2984511.2984583]
Lee S P, Cheok A D, James T K S, Debra G P L, Jie C W, Chuang W and Farbiz F. 2006. A mobile pet wearable computer and mixed reality system for human-poultry interaction through the internet. Personal and Ubiquitous Computing, 10(5): 301-317[DOI: 10.1007/s00779-005-0051-6]
Li F, Wu Y, Xie Y D and Yang S. 2021a. A method for detecting respiratory symptoms based on smartphone audio perception in driving environment. CN, CN112309423A
李凡, 吴玥, 解亚东, 杨松. 2021a. 驾驶环境下基于智能手机音频感知的呼吸道症状检测方法. 中国, CN112309423A
Li F, Wu Y, Xie Y D and Yang S. 2021b. A method of detecting car driving speed based on smartphone audio perception. CN, CN112230208A
李凡, 吴玥, 解亚东, 杨松. 2021b. 一种基于智能手机音频感知的汽车行驶速度检测方法. 中国, CN112230208A
Li H Y and Fan L W. 2020.Mapping various large virtual spaces to small real spaces: a novel redirected walking method for immersive VR navigation. IEEE Access, 8: 180210-180221[DOI: 10.1109/ACCESS.2020.3027985]
Li M, Yang B, Levy J, Stolcke A, Rozgic V, Matsoukas S, Papayiannis C, Bone D and Wang C. 2021. Contrastive unsupervised learning for speech emotion recognition//Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toronto, Canada: IEEE: 6329-6333[DOI: 10.1109/ICASSP39728.2021.9413910http://dx.doi.org/10.1109/ICASSP39728.2021.9413910]
Li N L, Kim H J, Shen L Y, Tian F, Han T and Yang X D. 2020. HapLinkage: prototyping haptic proxies for virtual hand tools using linkage mechanism//Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology. New York, USA: ACM: 1261-1274[DOI: 10.1145/3379337.3415812http://dx.doi.org/10.1145/3379337.3415812]
Li P C, Song Y, McLoughlin I, Guo W and Dai L R. 2018. An attention pooling based representation learning method for speech emotion recognition//Proceedings of the 19th Annual Conference of the International Speech Communication Association. Hyderabad, India: ISCA: 3087-3091
Lien J, Gillian N, Karagozler M E, Amihood P, Schwesig C, Olson E, Raja H and Poupyrev I. 2016. Soli: ubiquitous gesture sensing with millimeter wave radar. ACM Transactions on Graphics, 35(4): #142[DOI: 10.1145/2897824.2925953]
Liu A T, Li S W and Lee H Y. 2021. TERA: self-supervised learning of transformer encoder representation for speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29: 2351-2366[DOI: 10.1109/TASLP.2021.3095662]
Liu A T, Yang S W, Chi P H, Hsu P C and Lee H Y. 2020. Mockingjay: unsupervised speech representation learning with deep bidirectional transformer encoders//Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Barcelona, Spain: IEEE: 6419-6423[DOI: 10.1109/ICASSP40776.2020.9054458http://dx.doi.org/10.1109/ICASSP40776.2020.9054458]
Lu H, Pan W, Lane N D, Choudhury T and Campbell A T. 2009. SoundSense: scalable sound sensing for people-centric applications on mobile phones//Proceedings of the 7th International Conference on Mobile Systems, Applications, and Services. Kraków, Poland: ACM: 165-178[DOI: 10.1145/1555816.1555834http://dx.doi.org/10.1145/1555816.1555834]
Ma J, Wang C L, Shene C K and Jiang J F. 2014. A graph-based interface for visual analytics of 3D streamlines and pathlines. IEEE Transactions on Visualization and Computer Graphics, 20(8): 1127-1140[DOI: 10.1109/TVCG.2013.236]
Madotto A, Wu C S and Fung P. 2018. Mem2Seq: effectively incorporating knowledge bases into end-to-end task-oriented dialog systems//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne, Australia: ACL: 1468-1478[DOI: 10.18653/v1/P18-1136http://dx.doi.org/10.18653/v1/P18-1136]
Maekawa K. 2004. Production and perception of "paralinguistic" information//Proceedings of the Speech Prosody 2004 Nara. 367-374
Mao W G, He J and Qiu L L. 2016. CAT: high-precision acoustic motion tracking//Proceedings of 22nd Annual International Conference on Mobile Computing and Networking. New York, USA: ACM: 69-81[DOI: 10.1145/2973750.2973755http://dx.doi.org/10.1145/2973750.2973755]
Mao W G, Wang M, Sun W, Qiu L L, Pradhan S and Chen Y C. 2019. RNN-based room scale hand motion tracking//Proceedings of the 25th Annual International Conference on Mobile Computing and Networking. Los Cabos, Mexico: ACM: #38[DOI: 10.1145/3300061.3345439http://dx.doi.org/10.1145/3300061.3345439]
Massie T H and Salisbury J K. 1994. The PHANTOM haptic interface: a device for probing virtual objects//ASME Winter Annual Meeting, Symposium on Haptic Interfaces for Virtual Environment and Teleoperator Systems. Chicago, USA: DSC
McNeely W A. 1993. Robotic graphics: a new approach to force feedback for virtual reality//IEEE Virtual Reality Annual International Symposium. Seattle, USA: IEEE: 336-341[DOI: 10.1109/VRAIS.1993.380761http://dx.doi.org/10.1109/VRAIS.1993.380761]
Munzner T. 2014. Visualization Analysis and Design. CRC Press
Nandakumar R, Iyer V, Tan D and Gollakota S. 2016. FingerIO: using active sonar for fine-grained finger tracking//Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. San Jose, USA: ACM: 1515-1525[DOI: 10.1145/2858036.2858580http://dx.doi.org/10.1145/2858036.2858580]
Olberding S, Wessely M and Steimle J. 2014. PrintScreen: fabricating highly customizable thin-film touch-displays//Proceedings of the 27th Annual ACM Symposium on User Interface Software and Technology. Honolulu, USA: ACM: 281-290[DOI: 10.1145/2642918.2647413http://dx.doi.org/10.1145/2642918.2647413]
Ono M, Shizuki B and Tanaka J. 2013. Touch and activate: adding interactivity to existing objects using active acoustic sensing//The 26th Annual ACM Symposium on User Interface Software and Technology. St. Andrews, Scotland: ACM: 31-40[DOI: 10.1145/2501988.2501989http://dx.doi.org/10.1145/2501988.2501989]
Pan Z G, Gao J L, Wang R N, Yuan Q S, Fan R and She L. 2021. Digital twin registration technique of spatial augmented reality for tangible interaction. Journal of Computer-Aided Design and Computer Graphics, 33(5): 655-661
潘志庚, 高嘉利, 王若楠, 袁庆曙, 范然, 佘莉. 2021. 面向实物交互的空间增强现实数字孪生法配准技术. 计算机辅助设计与图形学学报, 33(5): 655-661[DOI: 10.3724/SP.J.1089.2021.18556]
Park J H, Nadeem S, Boorboor S, Marino J and Kaufman A. 2021. CMed: crowd analytics for medical imaging data. IEEE Transactions on Visualization and Computer Graphics, 27(6): 2869-2880[DOI: 10.1109/TVCG.2019.2953026]
Patnaik B, Batch A and Elmqvist N. 2019. Information olfactation: harnessing scent to convey data. IEEE Transactions on Visualization and Computer Graphics, 25(1): 726-736[DOI: 10.1109/TVCG.2018.2865237]
Pavlovic V I, Berry G A and Huang T S. 1997. Integration of audio/visual information for use in human-computer intelligent interaction//Proceedings of International Conference on Image Processing. Santa Barbara, USA: IEEE: 121-124[DOI: 10.1109/ICIP.1997.647399http://dx.doi.org/10.1109/ICIP.1997.647399]
Peng C Y, Shen G B, Zhang Y G, Li Y L and Tan K. 2007. BeepBeep: a high accuracy acoustic ranging system using COTS mobile devices//Proceedings of the 5th International Conference on Embedded Networked Sensor Systems. Sydney, Australia: ACM: 1-14[DOI: 10.1145/1322263.1322265http://dx.doi.org/10.1145/1322263.1322265]
Pepino L, Riera P and Ferrer L. 2021. Emotion recognition from speech using wav2vec 2.0 embeddings//Proceedings of the 22nd Annual Conference of the International Speech Communication Association. Brno, Czechia: ISCA: 3400-3404
Prouzeau A, Cordeil M, Robin C, Ens B, Thomas B H and Dwyer T. 2019. Scaptics and highlight-planes: immersive interaction techniquesfor finding occluded features in 3D scatterplots//Proceedings of 2019 CHI Conference on Human Factors in Computing Systems. Glasgow, UK: ACM: #325[DOI: 10.1145/3290605.3300555http://dx.doi.org/10.1145/3290605.3300555]
Qin L B, Liu Y J, Che W X, Wen H Y, Li Y M and Liu T. 2019. Entity-consistent end-to-end task-oriented dialogue system with KB retriever//Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: ACL: 133-142[DOI: 10.18653/v1/D19-1013http://dx.doi.org/10.18653/v1/D19-1013]
Qin L B, Xu X, Che W X, Zhang Y and Liu T. 2020. Dynamic fusion network for multi-domain end-to-end task-oriented dialog//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. [s. l.]: ACL: 6344-6354
Qin Y, Yu C, Li Z H, Zhong M Y, Yan Y K and Shi Y C. 2021. ProxiMic: convenient voice activation via close-to-mic speech detected by a single microphone//Proceedings of 2021 CHI Conference on Human Factors in Computing Systems. Yokohama, Japan: ACM: #8[DOI: 10.1145/3411764.3445687http://dx.doi.org/10.1145/3411764.3445687]
Renner R S, Velichkovsky B M and Helmert J R. 2013. The perception of egocentric distances in virtual environments——A review. ACM Computing Surveys, 46(2): #23[DOI: 10.1145/2543581.2543590]
Röddiger T, Clarke C, Wolffram D, Budde M and Beigl M. 2021. EarRumble: discreet hands-and eyes-free input by voluntary tensor tympani muscle contraction//Proceedings of 2021 CHI Conference on Human Factors in Computing Systems. Yokohama, Japan: ACM: #743[DOI: 10.1145/3411764.3445205http://dx.doi.org/10.1145/3411764.3445205]
Rossi M, Feese S, Amft O, Braune N, Martis S and Tröster G. 2013. AmbientSense: a real-time ambient sound recognition system for smartphones//Proceedings of 2013 IEEE International Conference on Pervasive Computing and Communications Workshops (PERCOM Workshops). San Diego, USA: IEEE: 230-235[DOI: 10.1109/PerComW.2013.6529487http://dx.doi.org/10.1109/PerComW.2013.6529487]
Ruan W J, Sheng Q Z, Yang L, Gu T, Xu P P and Shangguan L F. 2016. AudioGest: enabling fine-grained hand gesture detection by decoding echo signal//Proceedings of 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing. Heidelberg, Germany: ACM: 474-485[DOI: 10.1145/2971648.2971736http://dx.doi.org/10.1145/2971648.2971736]
Saakes D, Yeo H S, Noh S T, Han G and Woo W. 2016. Mirror mirror: an on-body t-shirt design system//Proceedings of 2016 CHI Conference on Human Factors in Computing Systems. San Jose, USA: ACM: 6058-6063[DOI: 10.1145/2858036.2858282http://dx.doi.org/10.1145/2858036.2858282]
Sadhu S, He D, Huang C W, Mallidi S H, Wu M H, Rastrow A, Stolcke A, Droppo J and Maas R. 2021. Wav2vec-C: a self-supervised model for speech representation learning//Proceedings of the 22nd Annual Conference of the International Speech Communication Association. Brno, Czechia: ISCA: 711-715
Sainath T N, Pang R M, Rybach D, He Y Z, Prabhavalkar R, Li W, Visontai M, Liang Q, Strohman T, Wu Y H, McGraw I and Chiu C C. 2019. Two-pass end-to-end speech recognition[EB/OL]. [2022-01-26].https://arxiv.org/pdf/1908.10992.pdfhttps://arxiv.org/pdf/1908.10992.pdf
Saponas T S, Tan D S, Morris D, Balakrishnan R, Turner J and Landay J A. 2009. Enabling always-available input with muscle-computer interfaces//Proceedings of the 22nd Annual ACM Symposium on User Interface Software and Technology. Victoria, Canada: ACM: 167-176[DOI: 10.1145/1622176.1622208http://dx.doi.org/10.1145/1622176.1622208]
Satt A, Rozenberg S and Hoory R. 2017. Efficient emotion recognition from speech using deep learning on spectrograms//Proceedings of the 18th Annual Conference of the International Speech Communication Association. Stockholm, Sweden: ISCA: 1089-1093
Schneider S, Baevski A, Collobert R and Auli M. 2019. Wav2vec: unsupervised pre-training for speech recognition//Proceedings of the 20th Annual Conference of the International Speech Communication Association. Graz, Austria: ISCA: 3465-3469
Schrapel M, Stadler M L and Rohs M. 2018. Pentelligence: combining pen tip motion and writing sounds for handwritten digit recognition//Proceedings of 2018 CHI Conference on Human Factors in Computing Systems. Montreal, Canada: ACM: #131[DOI: 10.1145/3173574.3173705http://dx.doi.org/10.1145/3173574.3173705]
Shen J, Pang R M, Weiss R J, Schuster M, Jaitly N, Yang Z H, Chen Z F, Zhang Y, Wang Y X, Skerrv-Ryan R, Saurous R A, Agiomvrgiannakis Y and Wu Y H. 2018. Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions//Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary, Canada: IEEE: 373-376[DOI: 10.1109/ICASSP.2018.8461368http://dx.doi.org/10.1109/ICASSP.2018.8461368]
Shigeyama J, Hashimoto T, Yoshida S, Narumi T, Tanikawa T and Hirose M. 2019. Transcalibur: a weight shifting virtual reality controller for 2D shape rendering based on computational perception model//Proceedings of 2019 CHI Conference on Human Factors in Computing Systems. Glasgow, UK: ACM: #11[DOI: 10.1145/3290605.3300241http://dx.doi.org/10.1145/3290605.3300241]
Sidenmark L, Clarke C, Zhang X S, Phu J and Gellersen H. 2020. Outline pursuits: gaze-assisted selection of occluded objects in virtual reality//Proceedings of 2020 CHI Conference on Human Factors in Computing Systems. Honolulu, USA: ACM: 1-13[DOI: 10.1145/3313831.3376438http://dx.doi.org/10.1145/3313831.3376438]
Siu A F, Sinclair M, Kovacs R, Ofek E, Holz C and Cutrell E. 2020. Virtual reality without vision: a haptic and auditory white cane to navigate complex virtual worlds//Proceedings of 2020 CHI Conference on Human Factors in Computing Systems. Honolulu, USA: ACM Press: 1-13[DOI: 10.1145/3313831.3376353http://dx.doi.org/10.1145/3313831.3376353]
Sotelo J, Mehri S, Kumar K, Santos J F, Kastner K, Courville A C and Bengio Y. 2017. Char2 Wav: end-to-end speech synthesis//Proceedings of the 5th International Conference on Learning Representations. Toulon, France: OpenReview. net
Ssin S Y, Walsh J A, Smith R T, Cunningham A and Thomas B H. 2019. GeoGate: correlating geo-temporal datasets using an augmented reality space-time cube and tangible interactions//Proceedings of 2019 IEEE Conference on Virtual Reality and 3D User Interfaces (VR). Osaka, Japan: IEEE: 210-219[DOI: 10.1109/VR.2019.8797812http://dx.doi.org/10.1109/VR.2019.8797812]
Sun L C, Liu B, Tao J H and Lian Z. 2021. Multi-modal cross-and self-attention network for speech emotion recognition//Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toronto, Canada: IEEE: 4275-4279[DOI: 10.1109/ICASSP39728.2021.9414654http://dx.doi.org/10.1109/ICASSP39728.2021.9414654]
Suzuki R, Hedayati H, Zheng C, Bohn J L, Szafir D, Do E Y L, Gross M D and Leithinger D. 2020. RoomShift: room-scale dynamic haptics for VR with furniture-moving swarm robots//Proceedings of 2020 CHI Conference on Human Factors in Computing Systems. Honolulu, USA: ACM: 1-11[DOI: 10.1145/3313831.3376523http://dx.doi.org/10.1145/3313831.3376523]
Thomaz E, Zhang C, Essa I and Abowd G D. 2015. Inferring meal eating activities in real world settings from ambient sounds: a feasibility study. IUI, 2015: 427-431[DOI: 10.1145/2678025.2701405]
Tian Y, Yao H T, Cai M, Liu Y M and Ma Z J. 2021a. Improving RNN transducer modeling for small-footprint keyword spotting//Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toronto, Canada: IEEE: 5624-5628[DOI: 10.1109/ICASSP39728.2021.9414339http://dx.doi.org/10.1109/ICASSP39728.2021.9414339]
Tian Z K, Yi J Y, Bai Y, Tao J H, Zhang S and Wen Z Q. 2020. One in a hundred: Select the best predicted sequence from numerous candidates for streaming speech recognition[EB/OL]. [2022-01-26].https://arxiv.org/pdf/2010.14791.pdfhttps://arxiv.org/pdf/2010.14791.pdf
Tian Z K, Yi J Y, Bai Y, Tao J H, Zhang S and Wen Z Q. 2021b. FSR: accelerating the inference process of transducer-based models by applying fast-skip regularization//Proceedings of the 22nd Annual Conference of the International Speech Communication Association. Brno, Czechia: ISCA: 4034-4038
Tian Z K, Yi J Y, Tao J H, Bai Y and Wen Z Q. 2019. Self-attention transducers for end-to-end speech recognition//Proceedings of the 20th Annual Conference of the International Speech Communication Association. Graz, Austria: ISCA: 4395-4399
Tzirakis P, Zhang J H and Schuller B W. 2018. End-to-end speech emotion recognition using deep neural networks//Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary, Canada: IEEE: 5089-5093[DOI: 10.1109/ICASSP.2018.8462677http://dx.doi.org/10.1109/ICASSP.2018.8462677]
Usher W, Klacansky P, Federer F, Bremer P T, Knoll A, Yarch J, Angelucci A and Pascucci V. 2018. A virtual reality visualization tool for neuron tracing. IEEE Transactions on Visualization and Computer Graphics, 24(1): 994-1003[DOI: 10.1109/TVCG.2017.2744079]
Valin J M and Skoglund J. 2019. LPCNET: improving neural speech synthesis through linear prediction//Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton, UK: IEEE: 4384-4289[DOI: 10.1109/ICASSP.2019.8682804http://dx.doi.org/10.1109/ICASSP.2019.8682804]
van den Oord A, Dieleman S, Zen H G, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior A W and Kavukcuoglu K. 2016. WaveNet: a generative model for raw audio//Proceedings of the 9th ISCA Speech Synthesis Workshop. Sunnyvale, USA: ISCA: #125
Wagner J, Stuerzlinger W and Nedel L. 2021. Comparing and combining virtual hand and virtual ray pointer interactions for data manipulation in immersive analytics. IEEE Transactions on Visualization and Computer Graphics, 27(5): 2513-2523[DOI: 10.1109/TVCG.2021.3067759]
Wang A R and Gollakota S. 2019. MilliSonic: pushing the limits of acoustic motion tracking//Proceedings of 2019 CHI Conference on Human Factors in Computing Systems. Glasgow, UK: ACM: 1-11[DOI: 10.1145/3290605.3300248http://dx.doi.org/10.1145/3290605.3300248]
Wang C Y, Hsiu M C, Chiu P T, Chang C H, Chan L W, Chen B Y and Chen M Y. 2015. PalmGesture: using palms as gesture interfaces for eyes-free input//Proceedings of the 17th International Conference on Human-Computer Interaction with Mobile Devices and Services. Copenhagen, Denmark: ACM: 217-226[DOI: 10.1145/2785830.2785885http://dx.doi.org/10.1145/2785830.2785885]
Wang C Y, Liu J, Chen Y Y, Liu H B, Xie L, Wang W, He B B and Lu S L. 2018. Multi-touch in the air: device-free finger tracking and gesture recognition via COTS RFID//Proceedings of the IEEE INFOCOM 2018-IEEE Conference on Computer Communications. Honolulu, USA: IEEE: 1691-1699[DOI: 10.1109/INFOCOM.2018.8486346http://dx.doi.org/10.1109/INFOCOM.2018.8486346]
Wang H, Zhang D Q, Wang Y S, Ma J Y, Wang Y X and Li S J. 2017a. RT-Fall: a real-time and contactless fall detection system with commodity WiFi devices. IEEE Transactions on Mobile Computing, 16(2): 511-526[DOI: 10.1109/TMC.2016.2557795]
Wang J D, Chen Y Q, Hao S J, Peng X H and Hu L S. 2019. Deep learning for sensor-based activity recognition: a survey[EB/OL]. [2022-01-26].https://arxiv.org/pdf/1707.03502.pdfhttps://arxiv.org/pdf/1707.03502.pdf
Wang S, Zhu D, Yu H, and Wu Y D. 2020a. Immersive WYSIWYG (what you see is what you get) volume visualization//2020 IEEE Pacific Visualization Symposium (PacificVis). Tianjin, China: IEEE: 166-170[DOI: 10.1109/PacificVis48177.2020.1001http://dx.doi.org/10.1109/PacificVis48177.2020.1001]
Wang T, Tao J H, Fu R B, Yi J Y, Wen Z Q and Zhong R X. 2020b. Spoken content and voice factorization for few-shot speaker adaptation//Proceedings of the 21st Annual Conference of the International Speech Communication Association, Virtual Event. Shanghai, China: ISCA: 796-800
Wang W, Liu A X and Sun K. 2016. Device-free gesture tracking using acoustic signals: demo//Proceedings of the 22nd Annual International Conference on Mobile Computing and Networking. New York, USA: ACM: 497-498[DOI: 10.1145/2973750.2987385http://dx.doi.org/10.1145/2973750.2987385]
Wang Y F, Peng T Q, Lu H H, Wang H R, Xie X, Qu H M and Wu Y C. 2022a. Seek for success: a visualization approach for understanding the dynamics of academic careers. IEEE Transactions on Visualization and Computer Graphics, 28(1): 475-485[DOI: 10.1109/TVCG.2021.3114790]
Wang Y T, Ding J X, Chatterjee I, Parizi F S, Zhuang Y Z, Yan Y K, Patel S and Shi Y C. 2022b. FaceOri: tracking head position and orientation using ultrasonic ranging on earphones//Proceedings of 2022 CHI Conference on Human Factors in Computing Systems (CHI'22). New York, USA: ACM: 1-12
Wang Y W, Lin Y H, Ku P S, Miyatake Y, Mao Y H, Chen P Y, Tseng C M and Chen M Y. 2021. JetController: high-speed ungrounded 3-DoF force feedback controllers using air propulsion jets//Proceedings of 2021 CHI Conference on Human Factors in Computing Systems. Yokohama, Japan: ACM: #124[DOI: 10.1145/3411764.3445549http://dx.doi.org/10.1145/3411764.3445549]
Wang Y X, Skerry-Ryan R J, Stanton D, Wu Y H, Weiss R J, Jaitly N, Yang Z H, Xiao Y, Chen Z F, Bengio S, Le Q V, Agiomyrgiannakis Y, Clark R and Saurous R A. 2017b. Tacotron: towards end-to-end speech synthesis//Proceedings of the 18th Annual Conference of the International Speech Communication Association. Stockholm, Sweden: ISCA: 4006-4010
Ward J A, Lukowicz P and Tröster G. 2005. Gesture spotting using wrist worn microphone and 3-axis accelerometer//Proceedings of 2005 Joint Conference on Smart Objects and Ambient Intelligence: Innovative Context-Aware Services: Usages and Technologies. Grenoble, France: ACM: 99-104[DOI: 10.1145/1107548.1107578http://dx.doi.org/10.1145/1107548.1107578]
Wei W Z and He Q B. 2018. Research on ultrasound-based gesture recognition device. Machinery and Electronics, 36(5): 54-57, 61
魏文钊, 何清波. 2018. 基于超声波的手势识别设备的研究. 机械与电子, 36(5): 54-57, 61[DOI: 10.3969/j.issn.1001-2257.2018.05.012]
Weigel M, Mehta V and Steimle J. 2014. More than touch: understanding how people use skin as an input surface for mobile computing//Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. Toronto, Canada: ACM: 179-188[DOI: 10.1145/2556288.2557239http://dx.doi.org/10.1145/2556288.2557239]
Weigel M and Steimle J. 2017. DeformWear: deformation input on tiny wearable devices. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 1(2): #28[DOI: 10.1145/3090093]
Withana A, Groeger D and Steimle J. 2018. Tacttoo: a thin and feel-through tattoo for on-skin tactile output//Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology. Berlin, Germany: ACM: 365-378[DOI: 10.1145/3242587.3242645http://dx.doi.org/10.1145/3242587.3242645]
Wu C S, Socher R and Xiong C M. 2019. Global-to-local memory pointer networks for task-oriented dialogue//Proceedings of the 7th International Conference on Learning Representations. New Orleans, USA: OpenReview. net
Xi H W and Kelley A. 2015. Sonification of time-series data sets. Bulletin of the American Physical Society, 60(3)
Xiao R, Cao T, Guo N, Zhuo J, Zhang Y and Harrison C. 2018. LumiWatch: on-arm projected graphics and touch input//Proceedings of 2018 CHI Conference on Human Factors in Computing Systems. Montreal: ACM: #95[DOI: 10.1145/3173574.3173669http://dx.doi.org/10.1145/3173574.3173669]
Xie L, Sheng B, Tan C C, Han H, Li Q and Chen D X. 2010. Efficient tag identification in mobile RFID systems//Proceedings of 2010 Proceedings IEEE INFOCOM. San Diego, USA: IEEE: 1-9[DOI: 10.1109/INFCOM.2010.5461949http://dx.doi.org/10.1109/INFCOM.2010.5461949]
Xue Y Q, Weng D D, Jiang H Y and Gao Q. 2019. MMRPet: modular mixed reality pet system based on passive props//Proceedings of the 14th Chinese Conference on Image and Graphics Technologies. Beijing, China: Springer: 645-658[DOI: 10.1007/978-981-13-9917-6_61http://dx.doi.org/10.1007/978-981-13-9917-6_61]
Yamamoto R, Song E and Kim J M. 2020. Parallel wavegan: a fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram//Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Barcelona, Spain: IEEE: 6199-6203[DOI: 10.1109/ICASSP40776.2020.9053795http://dx.doi.org/10.1109/ICASSP40776.2020.9053795]
Yan Y K, Yu C, Shi Y T and Xie M X. 2019. PrivateTalk: activating voice input with hand-on-mouth gesture detected by Bluetooth earphones//Proceedings of the 32nd Annual ACM Symposium on User Interface Software and Technology. New Orleans, USA: ACM: 1013-1020[DOI: 10.1145/3332165.3347950http://dx.doi.org/10.1145/3332165.3347950]
Yang D Q, Zhang D Q, Zheng V W and Yu Z Y. 2015. Modeling user activity preference by leveraging user spatial temporal characteristics in LBSNs. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 45(1): 129-142[DOI: 10.1109/TSMC.2014.2327053]
Yang Y L, Cordeil M, Beyer J, Dwyer T, Marriott K and Pfister H. 2021a. Embodied navigation in immersive abstract data visualization: is overview+detail or zooming better for 3D scatterplots? IEEE Transactions on Visualization and Computer Graphics, 27(2): 1214-1224[DOI: 10.1109/TVCG.2020.3030427]
Yang Y L, Dwyer T, Jenny B, Marriott K, Cordeil M and Chen H H. 2019. Origin-destination flow maps in immersive environments. IEEE Transactions on Visualization and Computer Graphics, 25(1): 693-703[DOI: 10.1109/TVCG.2018.2865192]
Yang Y L, Dwyer T, Marriott K, Jenny B and Goodwin S. 2021b. Tilt map: interactive transitions between choropleth map, prism map and bar chart in immersive environments. IEEE Transactions on Visualization and Computer Graphics, 27(12): 4507-4519[DOI: 10.1109/TVCG.2020.3004137]
Yao L N, Ou J F, Cheng C Y, Steiner H, Wang W, Wang G Y and Ishii H. 2015. bioLogic: natto cells as nanoactuators for shape changing interfaces//Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. Seoul, Korea(South): ACM: 1-10[DOI: 10.1145/2702123.2702611http://dx.doi.org/10.1145/2702123.2702611]
Yao Z Y, Wu D, Wang X, Zhang B B, Yu F, Yang C, Peng Z D, Chen C Y, Xie L and Lei X. 2021. WeNet: production oriented streaming and non-streaming end-to-end speech recognition toolkit[EB/OL]. [2022-01-26].https://arxiv.org/pdf/2102.01547.pdfhttps://arxiv.org/pdf/2102.01547.pdf
Ye S N, Chen Z T, Chu X T, Wang Y F, Fu S W, Shen L J, Zhou K and Wu Y C. 2021. ShuttleSpace: exploring and analyzing movement trajectory in immersive visualization. IEEE Transactions on Visualization and Computer Graphics, 27(2): 860-869[DOI: 10.1109/TVCG.2020.3030392]
Ye S N, Chu X T and Wu Y C. 2021. A survey on immersive visualization. Journal of Computer-Aided Design and Computer Graphics, 33(4): 497-507
叶帅男, 储向童, 巫英才. 2021. 沉浸式可视化综述. 计算机辅助设计与图形学学报, 33(4): 497-507[DOI: 10.3724/SP.J.1089.2021.18809]
Yeh C F, Mahadeokar J, Kalgaonkar K, Wang Y Q, Le D, Jain M, Schubert K, Fuegen C and Seltzer M L. 2019. Transformer-transducer: end-to-end speech recognition with self-attention[EB/OL]. [2022-01-26].https://arxiv.org/pdf/1910.12977.pdfhttps://arxiv.org/pdf/1910.12977.pdf
Yi X, Yu C, Xu W J, Bi X J and Shi Y C. 2017. COMPASS: rotational keyboard on non-touch smartwatches//Proceedings of 2017 CHI Conference on Human Factors in Computing Systems. Denver, USA: ACM: 705-715[DOI: 10.1145/3025453.3025454http://dx.doi.org/10.1145/3025453.3025454]
Yoon S, Byun S, Dey S and Jung K. 2019. Speech emotion recognition using multi-hop attention mechanism//Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton, UK: IEEE: 2822-2826[DOI: 10.1109/ICASSP.2019.8683483http://dx.doi.org/10.1109/ICASSP.2019.8683483]
Yoon S, Dey S, Lee H and Jung K. 2020. Attentive modality hopping mechanism for speech emotion recognition//Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Barcelona, Spain: IEEE:3362-3366[DOI: 10.1109/ICASSP40776.2020.9054229http://dx.doi.org/10.1109/ICASSP40776.2020.9054229]
Yu J J, Jin Y, Ma Y, Jiang F Z and Dai Y Y. 2021. Emotion recognition from raw speech based on Sinc-Transformer model. Journal of Signal Processing, 37(10): 1880-1888
俞佳佳, 金赟, 马勇, 姜芳艽, 戴妍妍. 2021. 基于Sinc-Transformer模型的原始语音情感识别. 信号处理, 37(10): 1880-1888[DOI: 10.16798/j.issn.1003-0530.2021.10.011]
Yu T, Jin H M and Nahrstedt K. 2016. WritingHacker: audio based eavesdropping of handwriting via mobile devices//Proceedings of 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing. Heidelberg, Germany: ACM: 463-473[DOI: 10.1145/2971648.2971681http://dx.doi.org/10.1145/2971648.2971681]
Yun S, Chen Y C, Mao W G and Qiu L L. 2015. Demo: turning a mobile device into a mouse in the air//Proceedings of the 13th Annual International Conference on Mobile Systems, Applications, and Services. Florence, Italy: ACM: #469
Zen H G, Tokuda K and Black A W. 2009. Statistical parametric speechsynthesis. Speech Communication, 51(11): 1039-1064[DOI: 10.1016/j.specom.2009.04.004]
Zhang B B, Wu D, Yao Z Y, Wang X, Yu F, Yang C, Guo L Y, Hu Y G, Xie L and Lei X. 2020b. Unified streaming and non-streaming two-pass end-to-end model for speech recognition[EB/OL]. [2022-01-26].https://arxiv.org/pdf/2012.05481.pdfhttps://arxiv.org/pdf/2012.05481.pdf
Zhang C, Bedri A K, Reyes G, Bercik B, Inan O T, Starner T E and Abowd G D. 2016. TapSkin: recognizing on-skin input for smartwatches//Proceedings of 2016 ACM International Conference on Interactive Surfaces and Spaces. Niagara Falls, Canada: ACM: 13-22[DOI: 10.1145/2992154.2992187http://dx.doi.org/10.1145/2992154.2992187]
Zhang C, Waghmare A, Kundra P, Pu Y M, Gilliland S, Ploetz T, Starner T E, Inan O T and Abowd G D. 2017a. FingerSound: recognizing unistroke thumb gestures using a ring. Proceedings of 2017 ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 1(3): #120[DOI: 10.1145/3130985]
Zhang C, Xue Q Y, Waghmare A, Jain S, Pu Y M, Hersek S, Lyons K, Cunefare K A, Inan O T and Abowd G D. 2017b. SoundTrak: continuous 3D tracking of a finger using active acoustics. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 1(2): #30[DOI: 10.1145/3090095]
Zhang C, Xue Q Y, Waghmare A, Meng R C, Jain S, Han Y Z, Li X Y, Cunefare K, Ploetz T, Starner T and Inan O. 2018. FingerPing: recognizing fine-grained hand poses using active acoustic on-body sensing//Proceedings of 2018 CHI Conference on Human Factors in Computing Systems. Montreal, Canada: ACM: #437[DOI: 10.1145/3173574.3174011http://dx.doi.org/10.1145/3173574.3174011]
Zhang Q, Lu H, Sak H, Tripathi A, McDermott E, Koo S and Kumar S. 2020a. Transformer transducer: a streamable speech recognition model with transformer encoders and RNN-T loss//Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Barcelona, Spain: IEEE: 7829-7833[DOI: 10.1109/ICASSP40776.2020.9053896http://dx.doi.org/10.1109/ICASSP40776.2020.9053896]
Zhang R X, Wu H W, Li W B, Jiang D W, Zou W and Li X G. 2021. Transformer based unsupervised pre-training for acoustic representation learning//Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toronto, Canada: IEEE: 6933-6937[DOI: 10.1109/ICASSP39728.2021.9414996http://dx.doi.org/10.1109/ICASSP39728.2021.9414996]
Zhang X T, Fang G X, Dai C K, Verlinden J, Wu J, Whiting E and Wang C C L. 2017c. Thermal-comfort design of personalized casts//The 30th Annual ACM Symposium on User Interface Software and Technology. Québec City, Cananda: ACM: 243-254[DOI: 10.1145/3126594.3126600http://dx.doi.org/10.1145/3126594.3126600]
Zhang Y, Wang D X, Wang Z Q, Zhang Y R and Xiao J. 2019. Passive force-feedback gloves with joint-based variable impedance using layer jamming. IEEE Transactions on Haptics, 12(3): 269-280[DOI: 10.1109/TOH.2019.2908636]
Zhang Y, Zhou J H, Laput G and Harrison C. 2016. SkinTrack: using the body as an electrical waveguide for continuous finger tracking on the skin//Proceedings of 2016 CHI Conference on Human Factors in Computing Systems. San Jose, USA: ACM: 1491-1503[DOI: 10.1145/2858036.2858082http://dx.doi.org/10.1145/2858036.2858082]
Zhang Y K, Sun S N and Ma L. 2021. Tiny transducer: a highly-efficient speech recognition model on edge devices//Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toronto, Canada: IEEE: 6024-6028[DOI: 10.1109/ICASSP39728.2021.9413854http://dx.doi.org/10.1109/ICASSP39728.2021.9413854]
Zhao J H, Lin Y X and Yuan Z Y. 2021. Designing and simulation of electromagnetic force feedback model focusing on virtual interventional surgery. Journal of Computer-Aided Design and Computer Graphics, 33(8): 1254-1263
赵俭辉, 林远轩, 袁志勇. 2021. 面向虚拟介入手术的电磁力反馈模型的设计与仿真. 计算机辅助设计与图形学学报, 33(8): 1254-1263[DOI: 10.3724/SP.J.1089.2021.18703]
Zhao J M, Li R C, Chen S Z and Jin Q. 2018. Multi-modal multi-cultural dimensional continues emotion recognition in dyadic interactions//Proceedings of 2018 on Audio/Visual Emotion Challenge and Workshop. Seoul, Korea(South): ACM: 65-72[DOI: 10.1145/3266302.3266313http://dx.doi.org/10.1145/3266302.3266313]
Zhao L, Jiang C H, Zou C R and Wu Z Y. 2004. A study on emotional feature analysis and recognition in speech. Acta Electronica Sinica, 32(4): 606-609
赵力, 将春辉, 邹采荣, 吴镇扬. 2004. 语音信号中的情感特征分析和识别的研究. 电子学报, 32(4): 606-609[DOI: 10.3321/j.issn:0372-2112.2004.04.018]
Zhao L, Liu Y and Song W T. 2021. Tactile perceptual thresholds of electrovibration in VR. IEEE Transactions on Visualization and Computer Graphics, 27(5): 2618-2626[DOI: 10.1109/TVCG.2021.3067778]
Zhao Y W, Kim L H, Wang Y, Le Goc M and Follmer S. 2017. Robotic assembly of haptic proxy objects for tangible interaction and virtual reality//Proceedings of 2017 ACM International Conference on Interactive Surfaces and Spaces. Brighton, UK: ACM: 82-91[DOI: 10.1145/3132272.3134143http://dx.doi.org/10.1145/3132272.3134143]
Zhou F, Duh H B L and Billinghurst M. 2008. Trends in augmented reality tracking, interaction and display: a review of ten years of ISMAR//The 7th IEEE/ACM International Symposium on Mixed and Augmented Reality. Cambridge, UK: IEEE: 193-202[DOI: 10.1109/ISMAR.2008.4637362http://dx.doi.org/10.1109/ISMAR.2008.4637362]
Zhou J H, Zhang Y, Laput G and Harrison C. 2016. AuraSense: enabling expressive around-smartwatch interactions with electric field sensing//The 29th Annual Symposium on User Interface Software and Technology. Tokyo, Japan: ACM: 81-86[DOI: 10.1145/2984511.2984568http://dx.doi.org/10.1145/2984511.2984568]
Zhuang Y Z, Wang Y T, Yan Y K, Xu X H and Shi Y C. 2021. ReflecTrack: enabling 3D acoustic position tracking using commodity dual-microphone smartphones//The 34th Annual ACM Symposium on User Interface Software and Technology. New York, USA: ACM: 1050-1062[DOI: 10.1145/3472749.3474805http://dx.doi.org/10.1145/3472749.3474805]
Zhuge M C, Gao D H, Fan D P, Jin L B, Chen B, Zhou H M, Qiu M H and Shao L. 2021. Kaleido-BERT: vision-language pre-training on fashion domain//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 12647-12657[DOI: DOI:10.1109/CVPR46437.2021.01246http://dx.doi.org/DOI:10.1109/CVPR46437.2021.01246]
相关文章
相关作者
相关机构