An Adaptive Speech Interface for Assistance in Maintenance and Changeover Procedures

. Machine operators remain important in future production environments and need intuitive and powerful interaction techniques. Many assistance and support applications for machine operators use speech-based interfaces since they are suitable during manual tasks and when visual attention cannot be occupied. Due to developments like the demographic change or the need for skilled personnel, the skills and capabilities of the workers will become increasingly diverse. Speech-based interfaces therefore need to be adaptable to the capabilities, limitations and preferences of individual operators. This paper addresses this requirement and proposes an adaptive speech interface that supports machine operators during maintenance and changeover procedures. All aspects of the proposed application can be adapted to the requirements of the user. The system uses a process model, instruction templates, a user model, and a model of the input vocabulary to describe the components of the application. This allows a ﬂ exible adaptation of the speech interface and the provided instructions to the requirements of individual users and to further use cases.


Introduction
Machine operators remain an important factor in the increasingly computerized production environments and require intuitive and powerful interaction techniques to perform their tasks successfully [1]. Several simultaneous secondary tasks need to be carried out, such as refilling supplies, checking quality parameters, or rectifying errors. Moreover, machine operators are responsible for maintenance and changeover procedures. During these tasks, spatial flexibility is required to inspect the machine and remain aware of the environment. The operator often cannot check the stationary human machine interface or refer to assistance like paper-based manuals. Regarding this, speech-based interfaces provide an efficient and location-independent interaction approach that can be used during manual tasks with little visual distraction. Such interfaces are suitable for machine operation in future demanding industrial environments.
Demographic changes yield a diversification of the workforce regarding perceptive and cognitive capabilities. Especially hearing losses that are caused by aging have serious effects on the individual performance, for instance regarding memorization skills [2]. Compensating such limitations could make an assistance system more accessible to elderly people and improve their performance. An adaptation to different skill levels is desirable as well. While a novice may require detailed instructions, an experienced worker may be distracted by unnecessary information. Existing speech interfaces do not provide such adaptive functions and only provide voice in-and output.
This paper proposes an adaptive speech interface for assistance in sequential maintenance and changeover procedures. The vocabulary that is used to interact with the application can be adapted. The instructions can also be adapted to the user using a template mechanism. Moreover, the speech synthesis is adapted to the perceptive abilities of the current user. All adaptations are based on human-readable models that can be adapted easily. This paper has the following structure. Section 2 summarizes the state of the art of speech-based applications and derives the addressed research questions. Section 3 provides the basic approach by discussing the context of use and the necessary capabilities for adaptation. Section 4 introduces the models that are the foundation for the adaptive mechanisms. Section 5 describes the technical realization of a prototype. The paper ends with conclusions and further work.

State of the Art
The following section discusses the state of the art of adaptive speech interfaces for the industrial domain. Section 2.1 summarizes applications of speech-based interaction in manufacturing environments. It describes systems that provide assistance during manual tasks and for machine operation. The second section discusses mechanisms to adapt a user interface to the context and the requirements of the user.

Speech-Based Interaction in Manufacturing
Speech-based interfaces have been subject of research for a long time starting from the 1990s. They provide mobile interfaces for environments that require manual and visual attention and that preclude other input devices [3]. Users with vision impairments or illiterate users can benefit from speech interaction as well [4]. These opportunities have been pursued by research in the industrial domain and need increasing attention in future production environments [5].
Many speech-based applications target use cases where the operator is involved in manual activities. For example, Scherff evaluated speech-based interfaces for the manual programming of welding robots [6]. While indicating the potentials of speech-based interaction during manual tasks, Scherff highlights challenges such as the importance of reliable speech recognition and the need to recall commands instead of recognizing them on a screen. Another application that benefits from speech interaction is commissioning. Commissioning workers are moving continuously and are involved in manual (picking) and visual (searching) tasks. These motivations were supported by empirical studies that yield increases in efficiency and picking quality compared to traditional approaches [7,8]. Industrial companies therefore increasingly adopt such Pick-by-voice systems for commissioning. Fischer et al. present an application for maintenance engineers that allows storing and retrieving information about the rectification of machine faults [9]. Using speech interaction allows using the assistance while carrying out the maintenance procedure.
Speech-based interfaces can also support machine operation. Overmeyer et al. provide a system for the control of automated guided vehicles using a multimodal interface [10]. They apply speech commands and gesture-based interaction to control lifting devices and consider the effect of cognitive workload on speech performance. Majewski and Kacalak propose a similar multimodal system for the control of lifting devices [11].

Adaptive User Interfaces
The key motivations for adapting a user interface are twofold. An interaction that is adapted to the needs of a user can increase the usability and enable more users the operation of a complex system. On the other side, adaptive user interfaces can improve the situation awareness while interacting with the machine, thus ultimately improving their efficiency. Situation awareness is organized in three levels, which involve perception of important data (level 1), comprehension of the current situation (level 2), and prediction of the future status (level 3).
An important distinction is to be made between adaptability, which is initiated by the user, and adaptivity, which is initiated by the system [12]. Both approaches have characteristic advantages and disadvantages. While being adaptable to the preferences of the user, adaptable systems suffer from enhanced complexity since secondary user interfaces for the adaption of the user interface are necessary [13]. Adaptive systems may induce the feeling of a loss of control over the behavior of the system [14].
Factors that can motivate adaptations are subsumed under the term context. Context denotes information that characterizes the situation of an entity [15]. One aspect of the context is the device class on which the application is running. Different device classes may require adaptations, for instance depending on the screen size. The presented information and interaction possibilities can also be adapted. Different interfaces can be presented based on the organizational role of the user (e.g., worker, shift supervisor) [16]. An automation system can provide appropriate levels of assistance for operators with different skill levels [17]. Such adaptations ensure that users receive support that suits their current situation.
Models fulfill a crucial role in adaptive user interfaces. Domain models like UML describe the information that an application handles and can specify different information for each user group. User models enable a system to "say the 'right' thing at the 'right' time in the 'right' way" using knowledge about the users [18]. Reference frameworks for the creation of adaptive interfaces exist. The Cameleon reference framework decomposes the design process into intermediary representations and models [19]. It thereby provides a structure to create adaptive user interfaces.

Summary
Several speech-based applications for industrial use cases exist. Especially use cases that require manual actions or visual attention, for instance commissioning, have been addressed. Existing applications do not adapt to the user and the situation. However, adaptivity is important to address individual capabilities and limitations. The flexibility of an adaptive system furthermore facilitates its adaptation to a different use case.
This paper addresses this limitation and proposes an adaptive speech-based application. The application supports operators during maintenance and changeover tasks. It can adapt to the requirements of users in terms of individual vocabularies, knowledge, and constitution and qualification. The adaptations are specified using models that can be edited easily.

Design of an Adaptive Speech Interface
This section introduces the main ideas for the design of an adaptive speech interface. The first section describes the characteristics of the addressed procedures and characterizes the industrial environment in which the application should be used. Section 3.2 summarizes the adaptive components of the proposed application and how the adaptations are triggered.

Task and Physical Environment
The proposed system considers the user in complex tasks in industrial environments. The characteristics of the task and the physical environment are important influences on the design of an interface, besides the user requirements. Tasks can be categorized according to the following main classes based on common characteristics.
• Procedural tasks include all tasks in which the user has to fulfill an ordered sequence of activities. • Supervision tasks include all tasks in which the user has to supervise and manage a complex and large system, possibly including multiple machines. • Extraordinary maintenance tasks have to be performed in response to unpredictable and/or infrequent situations.
The system that is proposed in this paper focusses on procedural tasks (such as changeover procedures) and extraordinary maintenance procedures. These tasks typically combine manual steps (e.g., replacing defective parts), the calibration of components (e.g., setting a temperature), and operations at the user interface (e.g., changing the production program). These procedures typically take between 15 and 20 min. The operator is continuously involved in manual operations during these procedures, and cannot use paper-based or tablet-based assistance systems. Speech-based interfaces represent a relevant solution in this context.
The physical environment pose constraints on the technological solutions. A representative example for an industrial environment is bottling industry. This environment is characterized by high temperature and moisture. The machines create a high noise level that requires hearing protection. The operator is constantly involved in manual activities, for instance the refilling of supplies. Therefore, she or he is moving continuously and varies the body posture depending on the task. The operator wears protective equipment (i.e., helmet, gloves, hearing protection, and safety glasses). A speech interface meets the requirements of this context. Modern microphones are able to filter background noise. Speech can be used during manual procedures and can be controlled without manual actions of the user. Furthermore, no additional hardware that could be damaged or distract the operator is required.

Context-Based Adaptation
Adaptive systems are necessary to cope with the increasing gap between the complexity of manufacturing machines and the capabilities of the human workers that operate them [20]. Information about the context such as the characteristics of the user and the technical, physical or organizational environment can be used to adapt an application. The following section introduces the considered contextual information and the supported adaptations.
One part is the constitution and disposition of the users to compensate eventual limitations. The considered user characteristics are age, education, computer experience, and disabilities. Declines in the hearing capabilities of elderly people can, for instance be addressed by raising the playback volume or slowing down the speed of the speech synthesis. Individual vocabulary is included to support dialects. Another aspect is the knowledge of the operator, which is formalized in mental models that describe individual assumptions about relevant objects and their structure in a given domain [21]. Differences in mental models concern the structuring of the steps into larger units or different names for the tools and components. Three aspects of the interface are adapted based on the context.
• The presentation describes the changeable parameters of speech synthesis (e.g., playback volume and playback speed, or the applied voice). Adaptations of the presentation address individual constitution and disposition. • The content describes the instructions that the application provides. Fine instructions can address novice operators, whereas more abstract instructions are provided for experienced operators. • The interface needs to be adaptive to support individual vocabulary. The interface includes the recognized vocabulary and the mapping between recognized phrases and commands.
With respect to the three levels of situation awareness (see Sect. 2.2), adapting the presentation enhances level 1 (i.e., perception of important data), whereas adapting the content improves levels 2 and 3 of situation awareness, since it becomes easier for the user to understand the current situation and predict what is likely to happen.
Finally, adapting the interface to the user's vocabulary enhances level 2, since it facilitates understanding what is happening. Thus, a speech-based interface that adapts according to these three aspects enhances the user's situation awareness.
The following section introduces the architecture of the application that implements the proposed adaptations. The architecture contains four major components and the models that formalize the contextual information and the derived adaptations.

Model-Based Adaptation of a Speech Interface
An architecture structures the components of the application and the included contextual information (see Fig. 1). Models adapt the application at the four modules (i.e. Speech recognition, Interpretation, Instructions generation, and Speech synthesis). The separation into modules allows the adaptation of parts of the application without affecting the other modules. The responsibilities of each component are described below.
The Speech recognition component is responsible for the transcription of the recognized phrases. It also checks whether the recognized phrase is a part of the Vocabulary of this user. The Interpretation component matches a recognized phrase to a command. The Vocabulary describes the vocabulary that is used to interact with the application. New phrases can be added and linked to commands.  The Instruction generation component creates a verbal instruction based on a Process model and a matching Template. The Process model provides a representation of the task and the instructions and is the foundation for the generation of instructions. Templates describe how the information that is stored in the Process model is converted into a concrete instruction.
The Speech synthesis component generates the acoustic output based on the User profile and the instruction that it receives. The User profile describes relevant characteristics and previous knowledge of a user. The following sections describe the models.

Input Vocabulary
The Vocabulary model specifies the recognized phrases and their mapping to the functions of the application. The model consists of pre-defined commands that describe the functions of the application (e.g., "next step", "previous step"). One or more verbalizations, meaning phrases that the system recognizes, can be assigned to each command. Table 1 displays the supported commands and exemplary verbalizations. The Vocabulary model allows an easy adaptation of the vocabulary to the requirements of a user (e.g., considering dialects or individual vocabulary).

Process Model
The Process model describes the supported procedure. It contains a sequential representation of the work steps. The mental model of the user is the foundation for this model. Mental models describe an internal representation of a process. The Process model can therefore contain hierarchies to support individual segmentations. This allows providing instructions on different levels of granularity. A work step (e.g., "Remove component A") can be expressed by smaller steps that describe the step using separate sub-steps (e.g., "Unscrew component A", "Pull out component A"). Each step contains a description of the location of an action, the necessary tools, the aim, and additional information (e.g., safety measures). These components specify the instructions that the assistance system provides. Table 2 lists the components of an instruction that are stored in the Process model.
The creation of the instructions is governed by a template mechanism that aggregates the components to a complete instruction. This mechanism allows providing instructions with varying level of detail and is described in Sect. 4.3.

Templates
The descriptions that are stored in the Process model have to be combined to a complete instruction. This allows adapting the details and content of an instruction to the requirements of individual operators. A set of Templates controls the creation of the instructions and specify how the components (see Table 2) are combined to form a complete instruction. Experienced operators, for instance, can be provided with less detailed instructions. Table 3 displays exemplary templates and maps them to different levels of experience.

User Model
The user model describes relevant characteristics of the user for the implementation of a speech interface. It was designed to consider individual differences by means of a user-centered design process. The characteristics contain factors, such as individual constitution and disposition, as well as qualification and competence, and adaptive attributes [25]. Constitutional characteristics address static attributes, such as gender or culture. Dispositional characteristics are variable, but not directly influenceable, for instance age or personality. Qualification and competence on the other hand is influenceable by the individual itself. Adaptive attributes, like strain or fatigue are the most "dynamic" of the mentioned characteristics and depend on the current situation [22]. Regarding the described context of use, adaptive attributes describe human Table 2. Components of an instruction in the process model.

Component Description
Example Operation Operation that is carried out Unscrew, Check Tool Tool that is used for the operation Wrench Location The location of a work step At the infeed Object Object that is manipulated Labelling station Table 3. Exemplary templates and mappings to level of experience.
Template Experience <object> <location> <operation> Beginner <object> <operation> <how>. Use <tool> Beginner <object> <operation> Expert reactions towards informatory mental stress whilst working [23]. A systematic approach was designed based on these individual characteristics that influence human performance and are of relevance for the implementation of the speech interface. The constitutional and dispositional characteristics contain the attributes age, and auditive impairments. There is an ageing effect on hearing loss (presbyacusis) that is caused by physiological changes. This process can start at the age of 20 years [24], but usually occurs at an age of 30 years at frequencies of about 4000 Hz. With increasing age, also lower and higher frequencies become affected. At an age of 50 years hearing loss causes an auditory threshold, which drastically increases with the height of the frequency [25]. Figure 2 depicts the user model and its components. Table 4 summarizes the rules that are derived from the user model and govern the adaptations of the presentation and instructions. The characteristics of the instructions (e.g., structure or complexity) are derived from the qualification and competence. For example, the given details of the instruction decrease with increasing work experience. The provided thresholds serve as starting points that are adapted based on feedback of the user. Such feedback is generated by explicit requests for more or less detailed instructions by the user.
The capabilities are clustered into groups. The clusters are mapped to rules that describe changes to the parameters of the speech synthesizer using rules. To compensate for declines of perception, the system increases the volume and decreases the playback speed for users above the age of 30 to prevent negative effects of hearing declines on individual performance [2]. Hearing impairments on the other hand, can be innate and age-independent. The adaption of the speech interface in this case, has to be adjusted per person according to individual impairments.  The user needs to be able to control and review the adaptations provided by the system. This should avoid a feeling of a loss of control over the behavior of the system [19]. The user can therefore request changes of the adaptations and override the system. This improves the behavior of the application for this specific user and provides feedback to validate and improve the rulesets.

Implementation of the Speech Interface
The technical architecture should facilitate the application of the system in practical scenarios. Therefore, free software components that can be deployed on multiple platform were used. Windows was chosen as an operating system due to its distribution in industrial environments and the support of a variety of device classes (e.g., desktop, mobile, or single-board computers). Speech recognition and speech synthesis were implemented using the C# .NET framework. The models are specified using a reusable XML-based format. A specific editor for the authoring and the maintenance of the procedures and model files was developed to support the adaptation of the system to different use cases and applications.

Conclusion and Further Work
This paper presented an adaptive speech-based application for assistance in maintenance and changeover procedures. The system provides flexible adaptation mechanisms to meet the demands of different user groups, regarding presentation and content of the instructions. The application is based on models that allow an easy adaptation of all parts of the interface to the needs of a specific user. Specific editors support content authoring and the adaptation of the system to different use cases and requirements. An empirical evaluation will be conducted to validate whether the application is usable in a real maintenance procedures and whether it outperforms traditional assistance systems. The evaluation could also validate the suggested adaptation rules.
The integration of images or videos could support inexperienced users in the identification of machine parts and provide additional output modalities for users with hearing impairments. The application could also be connected to industrial machines. The worker could then trigger operations that are part of the procedure remotely and receive information about the machine state. The detection of emotional states using speech data is possible as well to offer support if strain is detected in the user's voice [26]. Furthermore, holdup-times can indicate stressful states. They can be detected by comparing completion times for a step with the expected time or previous trials.