|©1990, 1995||General contents|
|Chapter 5||Chapter 7|
An objective study of controlling complex systems was wanted, not relying on verbal reports, or purely subjective interpretations. The most promising focus for such an objective study appeared to be the representational primitives used by humans in the cognitive processes underlying their control decisions. The justification for this comes from considering a key feature of rule-induction algorithms.
Since rule-induction algorithms are noted for their dependence on the suitability of the representation primitives (as we have noted, §3.2.3), the possibility exists of turning this connection round, and using the effectiveness of rule induction as a measure of merit of the representation. We should note, however, that we could fail to get good results on many grounds, only one of which is the quality of the representation. If some other problems exist (e.g., see below, §6.4.2), there would be some limitation on the performance of the rules, and we might get only relatively small effects from changing the representation.
Hence subsidiary aims were to produce a means of preparing the data in accordance with a variety of representations, then to test the performance of a rule-induction program with the prepared data. The relative performance of rules generated following the different representations would then reflect their relative merit, and hence give some lead on the correspondence of varying representations with a supposed inherent structure of the data. In human control of a complex system, this would in turn be evidence for or against the claim that certain concepts were salient features of an operator's ‘mental model’ of the task or the system, irrespective of whether the concepts were verbalisable or not.
Inevitably, the aims of this first experiment could not be defined more closely than this in advance, since it was not at all clear where difficulties would occur, and where progress would be halted.
Since the objective of the research was to investigate human performance of complex tasks, a task had to be built. The chief factor of importance in the design both of task and interface was to provide a source of data suitable for analysis.
The key criteria influencing the design have been discussed above (§5.1). It was generally accepted that the important aspects of a program implementing these design principles were principally that it worked, secondarily that it was able to be updated as required, and only last and very much least the elegance or finer details of the coding. Discussion in this section therefore concentrates on the important design decisions and how they were implemented.
The simulation program, and all the analysis programs with the exception of the rule-induction program, were written by the author.
The general approach taken in the construction of the program was that of top-down functional decomposition. In the main function, after initialisations, there is a main loop from which are called other functions, which deal with the simulation of each simulated object, the scoring, the interface interaction, and the logging of the actions.
On-line help had the advantage that access to it could be recorded in the same way as the other interactions. For that reason, the main function was designed to cope not only with the simulation interactions, but also with the ancillary interactions, including access to help, that surrounded the individual games.
The total length of code in the simulation programs was around 10000 lines. Of this, the code specifically for replaying accounted for about 1500 lines, help accounted for some 1000 lines, the interface for about 3500 lines and the simulation itself for approximately 4000 lines. These are approximate values, not only because the layout of the code is arbitrary, but also because there was not always a clear separation between the code for the different functions. However, these figures do give a general indication of the relative complexity of the various aspects of the program.
For consistency, a particular length had to be chosen for the time loop, and the choice of length was influenced by two factors: firstly the maximum amount of time that the necessary program steps might take, and secondly the suitability of this time interval from the user's point of view. This second consideration can be further broken down into two points: on the one hand, the refresh rate had to be fast enough to allow the user to have a sense of immediacy in control and feedback, but on the other hand, if the refresh rate was too fast, the user's performance would depend more on exact timing, reintroducing the effects of psycho-motor limits, which were seen as undesirable.
After the initial coding had been done, it was clear that a decent simulation and interface could be performed on the chosen system within about 0.2 seconds. However, following the discussion in §5.1.3, it was undesirable to set the refresh interval too close to a typical simple human response time as this would create the possibility that non-cognitive aspects of response time would be an important factor. Half a second was therefore tried, and no subjects complained about the refresh rate being too slow. This timing was therefore accepted. (On other available systems, the same machine computations could well have taken over 0.5 seconds, and therefore forced an undesired choice of timing.)
It was found that using the same interval of 0.5 seconds for the length of the simulation step caused problems with the cable simulation, and was in any case unnecessary since one simulation step for all the objects took only a small fraction of a second. After trying different values, a simulation interval of 0.1 seconds was fixed on as giving a reasonable balance between accuracy and low computation time. In each half second therefore, five simulation steps for each object are performed, in immediate sequence.
The overall task was to identify all suspicious objects in an area of sea-bed, to dispose of the mines, and to return to the starting area. This was done with a ship, a ‘remotely operated vehicle’ (or ROV: a small unmanned submarine), and an umbilical cable that connects the two together. A short way of describing how to perform the task was thus:
find a target;
send the ROV to look at it;
if it is a mine, fly the ROV to the required position, and at some later point, detonate the mine;
until all targets are dealt with.
Return to home.
The task was more closely defined by the scoring system. This was as follows:
It can be seen from this that there were trade-offs between on the one hand speed, with low time penalty, and on the other hand care, to avoid the risk of explosion. This kind of trade-off is common in the control of complex systems, wherever the areas of danger lie close to otherwise desirable paths, and it was generally seen as important to the relevance of the game that such trade-offs were set up. Another way of describing the same aspect of the task is that there may be multiple conflicting goals at any time, and the operator has to find his or her own balance.
The simulation is divided into four parts, one each for the ship, the ROV, the umbilical cable, and the targets. The first three, which are the controllable ones, also have their own sub-displays, only one of which is able to be seen and used at one time. Since the cable potentially affects both ship and ROV, it is done first. This is followed by the ship and the ROV, and finally the targets, which explode if the ship or the ROV has done the wrong thing.
This was the most problematic part of the simulation. Cable models exist based on finite element analysis , but these tend to be computationally very intensive, and probably therefore inappropriate for a small-scale real-time model such as the one built. Simple models are easy to imagine and implement, such as an elastic cable without water resistance lying in a straight line between the ship and the ROV. The problem with such simple models is that their behaviour is both counter-intuitive and unrealistic. This lack of realism could easily distract the operator from the task, to trying to discover how the cable in fact behaves.
For this task, the author therefore constructed an original model. This model is based on the fiction that the cable can be represented for many purposes by a single point halfway along its length. The elastic forces can be dealt with reasonably in this way, and the motion of the representative point provides a basis for calculation of overall water-resistance. It was not clear how good this model would be, so it was implemented, and tested by manoeuvring the simulation in ways that would discover the model's limits.
The model was then refined a number of times, by introducing factors which appeared to be relevant to the discrepancies between the actual and the desired behaviour. Since intuitive plausibility was more important that technical accuracy, the desired behaviour was that which did not appear counter-intuitive. The author makes no claims about the accuracy of the resultant model, only that it seems to behave in a reasonable and interesting way.
Other problems that had to be tackled included unstable oscillations of the cable in tension. This can be due to the length of the time step being too large to enable quick changes to be dealt with properly (cf.  “As is well known, explicit finite-difference methods for initial value problems are susceptible to numerical instability if too large a time step is taken”). This was solved by insisting that the cable mid-point could not go to the other side of its equilibrium mid-point position in one simulation step.
Accuracy was more important here, since in YARD there were numerous experienced mariners who had more finely-tuned ideas about how a ship should behave. The model constructed was based on a mathematical model of an actual vessel design . Since the design of ship was not entirely up to date, the information was not highly sensitive: however, as a precaution, the parameters were altered slightly without having a large effect on the behaviour of the ship. The original model was on paper (not programmed) being a detailed model of a ship's behaviour in calm to moderate weather, taking into account all six degrees of freedom: roll, pitch, yaw, surge, sway, and heave. Douglas Blane of YARD simplified the model by cutting out roll, pitch, and heave, and making simplifying assumptions about the rudder; and the author implemented this simplification. The model takes into account wind, waves, and tide, although these were not used in the first experiment other than in a casual exploratory way.
The propeller and rudder controls were modelled as if controlled by servos, with a fixed rate of alteration, so that it took a reasonably realistic amount of time to achieve a given control demand. These parameters were decided on after informal consultation with experienced personnel at YARD.
YARD had a model of a particular ROV (Remotely Operated Vehicle) implemented as a mock-up simulation, with a scenario of inspecting the legs of oil drilling platforms. This vessel simulation, based on previous research , was slow, ungainly, and asymmetric, and had directable thrusters and a camera that could be tilted and panned. This simulation was too slow for the kind of situation envisaged, with too much unnecessary detail and high fidelity in the hydrodynamics, which would have made it difficult to adapt to the chosen implementation environment. The author therefore implemented a much simpler vessel, with much simpler hydrodynamics, in which the original six degrees of freedom were reduced to four by ignoring (setting to zero at all times) roll and pitch.
A number of additional features were included. The effect of the umbilical cable on the ROV was modelled, and turned out to be an operationally important constraint even before the cable became fully taut. Clearly realism and plausibility were to be enhanced by modelling interaction of the ROV with the sea bed. The author devised a model of sticking in the mud, which resulted in gentle collisions with the bottom being able to be freed using upwards thrust, while heavier collisions needed the cable to be reeled in to free the ROV. As with the ship, the effect of tide was modelled, but not used in the first experiment. Not modelled in the first experiment was collision of the ROV with the target objects, nor collision of the ROV with the ship.
To maintain player interest and uncertainty, it was decided to have the sea-bed targets randomly positioned in a given area. The precise time of giving the order to start a game gave a random number seed, and pseudo-random numbers generated from this seed gave the number, type and position of the targets. This seed was recorded so that precisely the same set-up could be regenerated for a replay. Randomly varying the type of targets meant that the player did not know whether a target was dangerous or not before observing it at close quarters, and the random number of targets (with a mean value of five, but soon constrained to be at least five) prevented the player from going back to base before checking in all corners of the minefield.
There was no official information on which to base the behaviour of the mines, so the author implemented his own idea of how an acoustically operated mine might work. In the simulation, it is set off by the ship propellers or ROV thrusters being at too high a speed too close to the mine.
As discussed above (§5.1.2), the task needed an interface that was on the one hand sufficiently low-level to require both substantial learning, and creation by the operators of higher-level structure above the level of the interface; but on the other hand not so low that the task took too long to learn (as would have been the case with a typical live complex system). Pitching the game fairly near the lowest level of the task described would enable the observation of learning higher levels which were easily understandable. The level of interaction would be confirmed as not too low if the subjects were able to learn the task to a fairly stable state in the allowed time.
In contrast with, for example, the Iris flight simulator, it was decided to conduct all user input through pressing the mouse buttons. It is a natural extension of the mouse terminology (and common, if slightly loose) to refer to the active areas on the screen as buttons, and the action of pressing on the mouse button while the cursor is in a particular active area as a “button-press”. The advantages of this are that firstly all the ‘buttons’ can be labelled with their effect, eliminating the need for the user to memorise codes or consult help screens in the middle of a run. Secondly, immediate visual feedback can be given, which in the case of this interface was done by highlighting the background of a button that had just been pressed.
Also discussed earlier was the goal of minimising the significance of motor skill, and psycho-motor limitations. The interaction was therefore designed to rule out the effects of small fractions of a second. Only one button-press was taken into account in one half-second, and implementation was only at the half-second boundaries.
The ‘Seeheim’ model  of user interface management conceptually separates the interactions that deal solely with presentation of information from those that affect the controlled system itself. This was adopted as one of the design principles. This decision having been made, there were four general divisions of the interface:
The next important design decision was whether to have all the information present at once. There were three good reasons why not. Firstly, to put all the information on one screen would produce very small areas of screen which could not hold many characters of a reasonable size font, and the effectors would be difficult and slower to locate than ideally. Secondly, practical complex systems, if their interface uses VDU screens, tend to need to split up the information into a number of different screenfuls. Thirdly, by having less than all the information on the screen at once, the obvious possibilities for the information being used at any particular time would be limited. This would help the analysis of operators' decisions. The sensors relevant to a group of effectors should be displayed along with those effectors.
The highest-level divisions apparent were between the different objects of the simulation: the ship, the ROV, and the umbilical cable. It was therefore decided to divide the screen horizontally into two parts, one of which would show information relevant to the task as a whole, or all the objects of the task, and the other would show sensors and effectors either for the ship, or for the ROV, or for the cable. The resultant appearance of the interface is shown in Figure 6.1, with the ship sub-display showing.
Figure 6.1: The interface in the first sea-searching experiment
For example, let us consider adding a new element to the interface. To do this, at least the following steps need to be considered:
The basic conceptual structure used in the implementation of the interface was a hierarchy of sub-displays, columns, rows and elements. This was reflected in a four-dimensional array of structures, each one of which contained the information relevant to the display element. The positioning of the elements on the screen was taken care of by automatically allocating them equal spaces in their row, which were allocated equal heights in their column. The overhead involved in this was the maintenance of hard-coded arrays of the number of rows in each column, and the number of elements in each row. This was much easier to keep updated than would be the alternative approach of changing the element positions by hand each time the number of elements changed. Columns were sized in a hard-coded fashion, since changes were not anticipated at this level.
One function was responsible for calling the many functions needed to effect the actions. The rest of the information about the interface elements was kept together hard-coded in a function which was used at initialisation. For the addition of a new element, one therefore needed to do little more than:
Obviously, for such a complex task, a user would initially be in great difficulty if there were no explanation of the game. The provision of help also enabled some useful knowledge to be presented, so that the user did not need to discover this from experience, which would retard learning in general. Provision of help was able to be fitted in to the form of the interface as designed for the game, thus providing help on-line, which had the added advantage that it could be monitored, and studied if thought worthwhile. Another feature provided was the ability to replay previous games. Two demonstration examples showing the operation of the ship and the ROV were included, but none showing a complete game by another player, as this was thought likely to influence the style of the beginner.
The text of the help messages was kept in separate files, read in when a particular section of help was requested. This enabled changes in the format of the display without needing changes in the text itself; and changes in the help without recompiling the program. The content of the help for the second experiment, which is slightly different from the first version here, is given in Appendix A.
Logging data was achieved by maintaining an internal array of the actions taken, at the level of individual button-presses. (See below, Figure 6.2.) This table was written out to file at the end of the run. Writing to file during a run could have caused brief interruptions in the continuity of the game.
Replaying of runs selected through the help screen was implemented by reading in the appropriate trace file, and executing the same process as in an ordinary game, with the entries in the trace file being used instead of player's button-presses. This replaying relies on identical simulation steps being performed and the actions being presented at exactly the right time, because, since there is no record of the process variables, any small error would quickly accumulate and disrupt the correspondence between conditions and actions. To get this right was difficult.
As it is, the interface is not portable to any other system. This is because of the non-standard nature of:
It would be a substantial task to re-implement the program on a different system.
Unpaid volunteers were requested by word-of-mouth and local posters. The selling pitch was that the simulation was more interesting than, and different from, other computer games. A spirit of competition was fostered firstly by providing a scoreboard as part of the help, secondly by offering a small prize (unspecified at the start) for the highest overall score in a single game at the end of a given period. Four subjects at GH, and two at CXT (other than the author), achieved at least a basic competence allowing them to complete games within a reasonable time-span. We will abbreviate those at GH to R, G, S and M. Those at CXT we will abbreviate to DM and DJ.
As explained above (§6.2.3),
the interaction with the simulation
was entirely through the mouse.
The keyboard itself had no effect, except that
during the course of the experiment, a facility was introduced
so that players could skip short amounts of time (when they
knew in advance that they would not want to take any actions)
by pressing a number key.
1 skipped about 10 seconds-worth of action,
2 about 20 seconds-worth, and so on.
In no case could this increase the score, and
these actions were not recorded in the first experiment.
The primary data consisted of time-stamped records of every legal key-press. This was recorded in a format of five blank-separated numerical fields per line, one line representing one key-press (see Figure 6.2). These files will be referred to as ‘action trace files’ or simply ‘trace files’.
00018 2 3 2 4 Both_Props_Full_Ahead 00021 2 3 5 0 Rudder_Hard_Port 00049 2 3 5 2 Rudder_Centre 00053 1 0 8 1 Scale_over_2 00054 1 0 8 1 Scale_over_2 00056 1 0 2 0 Fix_Ship 00058 1 0 2 1 Centre_Ship 00160 2 3 5 0 Rudder_Hard_Port 00195 2 3 5 2 Rudder_Centre 00201 2 3 5 3 Rudder_Gentle_Stbd 00207 2 3 5 2 Rudder_Centre 00223 2 3 2 0 Both_Props_Full_Astn 00356 1 3 1 0 Stop_Return_to_HelpFigure 6.2: A commented example game trace file
There were two types of action trace file. Each game had its key-presses recorded in a separate file, and each session of zero or more games had the ancillary key-presses (starting and stopping the game, reading the help information provided, etc.) recorded in the same format, together, without the key-presses from the games themselves.
The first field of each trace file gives the number of half-seconds since the beginning of the game, or session. This varies from zero up to several thousand (7200 being equivalent to one hour). The remaining four fields are single digits, representing in turn the sub-display, the column, the row, and the element within that row. These refer to the obvious divisions of the interface screen.
Including the files due to the author, covering the period Oct 8th to Jan 16th, there were over 300 files from GH and nearly 150 files from CXT. These occupied over 1 Mbyte and over 400 kbyte respectively.
All plausible actions were recorded, which included those actions that were impossible or ineffective at the time, and which caused an audible warning to the player at the time of the game, coinciding with the highlighting of the selected area. However, there were some mouse-button-clicks that occurred while the cursor was in an area that was generally inactive, and these were not recorded. Such button-clicks also caused an audible warning at the time of play, but did not cause any highlighting of the selected area of the screen.
In addition to the action trace files mentioned above, there was one file for each site, in which each game has a one line entry. The fields in this file (called the ‘Runindex’) were:
calm_weather s 17282750 Apr14-00:29 -546 1036 20262 calm_weather s 17283514 Apr14-00:42 982 998 10535 calm_weather s 17338726 Apr14-16:02 5799 2201 571 calm_weather s 17339722 Apr14-16:19 7044 2456 10535 calm_weather s 17340482 Apr14-16:32 6845 2155 20262 calm_weather s 17341776 Apr14-16:53 7781 3219 1940 calm_weather s 17447710 Apr15-22:19 6637 2363 8818Figure 6.3: Part of a Runindex file
Table 6.1: Subjects' times, scores, and dates (1989–90)
An idea of the amount of data gathered can be obtained from Table 6.1. “No. Runs” is the total number of games played. Only a few of these were false starts, abandoned after a very short time. “Total Time” is the hours and minutes spent on the games themselves, not counting ancillary activities. “Best Time” is the total time spent at the end of the game that gave the best score, recorded in the next column. “Start Date” and “Best Date” allow the calculation of the calendar period from first trying the game to scoring the recorded high score. “Finish Date” represents the date of the last game played before the (arbitrary) cut-off date of 16th January 1990. There were in fact only a few games played after this, with no-one improving their score. The subject with the highest score, G, was also the one who had put in most hours of practice. For comparison, the author, with substantially more practice than any of the subjects, achieved a score of over 9000. A practical ceiling would appear to be around 10000, though this did depend on the random fluctuation of the arrangement of the mines. A score as high as this was only possible (even in theory) on a small proportion of games. Uncertainties in the scoring system are discussed below (§6.4.1).
Since the analysis of the data involves many stages, it might be helpful to review the overall pattern before describing the details. This overall view, from which some details have been omitted for clarity, appears here as Figure 6.4. In this figure, the rectangles represent types of file, and the ovals represent programs for transforming one type of file into another.
Figure 6.4: Simplified data flow during analysis
It was clear from very early on in the analysis that a single action from a human point of view was not necessarily equivalent to a single key-press. In particular, after practice on the task, short sequences of key-presses were frequently apparent. This happened particularly in the case of manoeuvring the ROV. Because there was no simple effector to turn left or right, players had to create their own sequences of actions that performed the function of turning. Even for the same direction of turn, different sequences were performed in different contexts. When going full ahead with both thrusters, a left turn of a few degrees would most probably be executed by selecting half ahead on the left thruster, followed immediately (0.5 seconds later) by restoring the left thruster to full ahead, using the full-ahead key for either the left thruster or both thrusters together. In contrast, when near a mine, gliding along with both thrusters stopped, a left turn would most likely be executed by selecting the starboard thruster half ahead, and then stopped (or both stopped). Other variations were also observed.
With these turning manoeuvres, the time interval between the unbalancing and balancing actions was crucial to determining the magnitude of the effect of the action. Due to the delayed response of the thrusters, leaving an interval of 1 second produced an effect approximately four times the size of the effect produced by an interval of 0.5 seconds. Thus, in situations where the former would be an appropriate action, the latter would not, and vice versa—the two were, in effect, different actions. There were other sequences of actions where time interval and order were not important. For example, when recovering the ROV, the first step was to reel in the cable. This was effected by setting the cable tension to ‘grip’ and the take-in speed to ‘fast’; but the ordering and interval of these two actions was immaterial.
Perhaps these considerations could in principle be derived from the data. However, in this study, they were deliberately introduced, as background knowledge, in the attempt to get the best classification of actions possible in the available time.
The problem then remained to find what compound actions actually occur in different players' task performance. This could be attempted by observation and questioning; but, without any objective measure, it would be difficult to assess how much of the resulting discoveries were artificially produced by the biases and preconceptions of player and experimenter. Thus a prime objective was to devise a program to compile a list of these compound actions given only a quantity of raw data. Programs for learning macro-operators in games or puzzles have been devised, but the methods used and quoted (e.g. in ) do not explicitly deal with time or dynamic systems, and were thus unsuitable here.
The basic program to perform this was called
summ (for summary).
The program first completed a large table,
of the frequency of occurrence of each key-interval-key, for
intervals between 0.5 seconds (coded 0) and 2 seconds (coded 3).
Two seconds was judged to be a reasonable maximum between two
sub-actions that formed a higher-level action.
The program then wrote out a summary
of the more commonly occurring sequences.
Depending on flags given on the command line, this summary
was either intended as a general summary for human reading,
(see Figure 6.5) or as a list of 4-tuples
specifying key-interval-key sequences and single
replacement keys (see Figure 6.6).
The summary in Figure 6.5 gives on the first line:
3 3 1 3 relative 0.050 freq :1342 Port_Ths_Half_Ahd 0 3 3 1 2 freq : 17 Port_Ths_Stop Slow_Turn_Stbd_0 1 3 3 1 2 freq : 55 Port_Ths_Stop Slow_Turn_Stbd_1 2 3 3 1 2 freq : 38 Port_Ths_Stop Slow_Turn_Stbd_2 3 3 3 1 2 freq : 31 Port_Ths_Stop Slow_Turn_Stbd_3 0 3 3 1 4 freq : 370 Port_Ths_Full_Ahd Diff_Turn_Port_0 1 3 3 1 4 freq : 66 Port_Ths_Full_Ahd Diff_Turn_Port_1 2 3 3 1 4 freq : 11 Port_Ths_Full_Ahd Diff_Turn_Port_2 0 3 3 2 2 freq : 24 Both_Ths_Stop Slow_Turn_Stbd_0 1 3 3 2 2 freq : 70 Both_Ths_Stop Slow_Turn_Stbd_1 2 3 3 2 2 freq : 65 Both_Ths_Stop Slow_Turn_Stbd_2 3 3 3 2 2 freq : 55 Both_Ths_Stop Slow_Turn_Stbd_3 0 3 3 3 1 freq : 228 Stbd_Ths_Half_Astn Pure_Turn_Stbd 1 3 3 3 1 freq : 75 Stbd_Ths_Half_Astn Pure_Turn_Stbd 3 3 3 3 2 freq : 9 Stbd_Ths_StopFigure 6.5: A fragment of a summary
3 3 1 3 0 3 3 1 4 0 3 3 0 3 3 1 3 0 3 3 3 1 0 3 0 1 3 3 1 3 1 3 3 1 4 0 3 3 1 3 3 1 3 1 3 3 3 1 0 3 0 1 3 3 1 3 2 3 3 1 4 0 3 3 2 3 3 1 3 2 3 3 3 1 0 3 0 1 3 3 1 3 3 3 3 1 4 0 3 3 3 3 3 1 3 3 3 3 3 1 0 3 0 1Figure 6.6: A fragment of a replacement chart
One deficiency with this basic summary program was that
it could not deal with sequences of more than two actions.
This was overcome by using the program iteratively.
First a basic replacement chart was made.
Next, actions were fed through a filter that
makes the changes specified in the first chart.
This program was called
(not shown in Figure 6.4)
from its effect of changing the
actions in the trace file.
The output of this filter was then fed in to
summ again, giving a second chart,
some of whose input keys may have been new composite ones.
A C-shell script (named
chart) was written
to govern this iterative process.
In the final version of
the first application of
summ output only more common
sequences, and two subsequent iterations included
progressively less common sequences.
The typology of actions is further discussed below (§6.4.2).
The trace files only had information concerning
the actions taken, not about the situations.
The files had therefore to be expanded before
analysis to include full details of the situation,
and this was done by a program called
which was a modified version of the simulation program,
without the graphics or interaction.
The input was an action trace file, and the output was
a binary file containing all the data on which
the displays were based for each half-second step.
This produced an increase in
the file size of a factor of 250–300.
(Thus it would be quite impractical to
store many of these expanded files on disk.)
An expanded file also permitted a more flexible form of replay. Replaying from the trace file was possible, but it was only one-way (forwards), and took considerable time to execute, since all the original mathematics have to be performed. With an expanded file, on the other hand, one was able to jump to any place in the game, stop, or go backwards. This proved to be a help in getting an intuitive feel for what the various subjects were doing. It allowed an observer to study the circumstances of a certain action in a flexible, easy way.
The expanded file then had its actions modified in
accordance with the required replacement chart.
The program that effected this was called
(for action representation).
This was done twice, in series,
so as to enable longer sequences to be converted
than would be possible with only one application.
As well as the explicit actions, there was also the question of a reasonable human representation of the null action. The original expansion gave null actions for every time step (0.5 seconds) where there was no explicit action. Since one of the priorities of the simulation was to get away from an over-dependence on critical timing, it seemed unreasonable to class the time steps immediately preceding an action as null—after all, the player may have just been a bit slower than intended or desired. A thinking-time parameter was therefore introduced, imagined to be around 1 to 3 seconds, and for this amount of time around an action no null actions were passed on. This was tried with thinking time extending only before, or before and after, any action. Another way of describing the purpose of this would be to say that we want null actions to be registered when everything is fine, not in the thick of hectic action. Compare this with Card et al.'s ‘M’ operator (“mentally prepare”) in their Keystroke Level Model . The value for the ‘M’ operator is given as 1.35 seconds.
Having thus reached the stage where the representation of actions was altered to what we may suppose is a more human-like form, it was left to decide what to do about the representation of the situations. Whereas the basic representation of actions was unequivocal (discrete key-presses), the representation of situations implied by the interface was not completely clear (see below, §6.4.3). The approach taken was to remain agnostic about the exact information provided, because in any case the aim of the methods investigated was to be able to tell when a representation was closer to the human one.
Having decided on a representation to test,
the selection of the attribute values in that
representation was done by the program called
sitrep (for situation representation).
The representation was defined by a hand-crafted file, listing
the attributes to be selected (see Figure 6.7).
(RT01) (RT2) 2 1 rov_degrees rov_off_head rov_target_head 4 4 rov_height rov_height rov_speed rov_speed rov_r rov_r rov_target_range rov_target_range 4 4 sub_display sub_display rov_av_revs_demand rov_port_revs_demand rov_turn_demand rov_stbd_revs_demand rov_status rov_status 3 3 0 3 0 1 Pure_Turn_Stbd 3 3 1 3 Port_Ths_Half_Ahead 0 3 0 3 Pure_Turn_Port 3 3 3 3 Stbd_Ths_Half_Ahead 0 0 0 0 NO_KEY 0 0 0 0 NO_KEYFigure 6.7: Example representation files
The three sections of situation attributes are respectively
integer variables, floating-point variables,
and qualitative variables.
The fourth group is a selection of actions (classes) in
the single ‘decision’ attribute, along with their key codes.
(such as the one marked
contain only attributes that are explicitly
present in the unmodified expanded data.
Higher-level ones (such as the one marked
have some quantities that are not explicitly present in the
original data, and therefore have to be calculated on the spot.
In the example case,
rov_off_head is the relative
bearing of the closest active target from the ROV.
It is calculated from
the heading of the ROV, and
the true bearing of the target from the ROV.
calculated from the two demands in the old representation
Pure_Turn_Stbd is a combination of
as shown in Figures 6.5
and 6.6 above.
Pure_Turn_Port is defined similarily.
In the expanded file,
the number of half-second intervals where there was
no key-press outnumber those in which there was.
Any individual key, therefore, is greatly outnumbered by
what might be regarded as null actions (
It would be unhelpful to include all these as examples
for a rule-induction program, for two reasons:
sitrep had output the selected data
(now in readable, ‘ascii’ form), it was a straightforward
matter to format this for a rule-induction program.
This was done by a program called
(for induction preparation), which had a
command line flag determining which rule-induction program the
data should be prepared for, since there is no standard format.
Other flags determined whether
indprep should output a file
of examples (data) only, attributes (names), or both together.
As an example, the attribute file and
example file (containing 30 examples)
for subject G, representation RS2
(see Figure 6.9, below),
interval 4 (as in Table 6.2 below),
is given here as Figure 6.8.
In each example, there is one entry for every attribute,
**ATTRIBUTE FILE** rov_off_head:(FLOAT) rov_u:(FLOAT) rov_v:(FLOAT) rov_target_range:(FLOAT) rov_height:(FLOAT) rov_speed:(FLOAT) sub_display:ship rov umb env; rov_av_revs_demand:f_astn hf_astn h_astn q_astn stop q_ahd h_ahd hf_ahd f_ahd; stage:initial searching placing far close final pull_in infringe; class:Both_Ths_Full_Astn Both_Ths_Half_Astn Both_Ths_Stop Both_Ths_Half_Ahd Both_Ths_Full_Ahd NO_KEY; **EXAMPLE FILE** 48 7.98 0.00 1000.0 48.0 0.00 ship stop initial NO_KEY; 3 7.97 0.00 418.9 48.0 0.00 ship stop initial NO_KEY; 11 7.94 -0.22 320.3 48.0 0.00 ship stop placing NO_KEY; 30 7.60 -0.11 231.4 48.0 0.00 ship stop placing NO_KEY; 52 4.08 -0.01 150.4 34.5 4.14 rov stop far Both_Ths_Full_Ahd; 15 3.21 -1.08 134.9 34.5 3.39 rov f_ahd far NO_KEY; 15 3.19 -0.27 50.6 30.2 3.68 rov f_ahd far NO_KEY; 21 1.34 -0.72 16.1 12.1 1.53 rov f_ahd final Both_Ths_Stop; 15 -0.13 -0.11 14.1 8.5 0.88 rov stop final Both_Ths_Half_Ahd; 28 0.30 0.26 11.9 3.0 0.56 rov q_ahd final NO_KEY; -56 0.24 0.56 15.7 2.0 0.61 rov q_ahd final Both_Ths_Half_Ahd; -8 1.04 0.33 8.7 1.8 1.09 rov h_ahd final Both_Ths_Stop; -49 0.13 0.25 4.5 2.1 0.29 rov stop final NO_KEY; -20 -0.12 0.07 6.1 2.6 0.15 rov stop final Both_Ths_Stop; 27 0.06 -0.06 6.9 2.8 0.08 rov stop final Both_Ths_Stop; -15 -0.15 -0.08 6.8 3.0 0.16 rov stop final Both_Ths_Stop; 56 0.09 -0.24 4.3 3.4 0.26 rov stop final NO_KEY; -5 -6.69 -1.41 65.7 19.7 7.04 ship stop pull_in NO_KEY; 2 -0.67 1.12 1000.0 36.3 1.31 ship stop searching NO_KEY; 2 -1.44 -5.76 1000.0 37.9 6.86 ship stop searching NO_KEY; 2 0.59 -7.44 1000.0 39.2 7.98 ship stop searching NO_KEY; -157 0.62 -8.28 487.4 39.8 7.57 ship stop searching NO_KEY; -169 0.55 -7.37 465.4 40.3 8.36 ship stop searching NO_KEY; 178 0.61 -8.21 463.9 40.9 7.51 ship stop searching NO_KEY; 166 0.58 -7.49 483.1 41.5 8.18 ship stop searching NO_KEY; -132 -5.27 -6.18 424.4 42.5 7.37 ship stop searching NO_KEY; -129 -7.05 -4.36 229.7 45.2 7.55 ship stop placing NO_KEY; -109 -3.64 -2.69 119.6 47.0 4.59 rov stop far Both_Ths_Full_Ahd; 5 1.94 1.50 80.1 47.4 2.46 ship f_ahd infringe NO_KEY; -14 2.78 0.88 60.1 47.7 2.91 ship f_ahd infringe NO_KEY;Figure 6.8: An instance of an example and attribute file for the CN2 induction program.
Three rule-induction programs were readily available: C4, ID3 and CN2  (for a description and comparison of some algorithms, see ). (The version of CN2 used was developed by Robin Boswell of the Turing Institute, as part of ESPRIT project 2154, the Machine Learning Toolkit.) C4 is based on the ID3 algorithm, and like ID3, produces output in the form of decision trees. When C4 was tried on larger data sets (a few thousand examples) it was found to be excessively slow, taking several hours to run, and on the largest ones it ‘crashed’. It was then decided against as a primary tool. Of the other two, CN2 was chosen as the more appropriate, because
The unordered mode produces if-then rules where the condition is made up of a conjunction of conditions on any of the attributes. Disjunction (‘or’) is produced by having a number of rules all for the same decision class.
A standard method for generating and testing rules was adopted. This is to take a training set of data, and use the program to generate rules, then to take the training set and unseen test sets, and to evaluate the prediction performance of the rules on these data. This process leads to figures for the effectiveness of the generated rules for each decision class considered, and an overall prediction performance figure, which must be carefully compared with the prediction performance of a default rule before being able to assess its value.
The first comparison of representations was between those given above in Figure 6.7 (see also §B.1). This used CN2 in its ‘unordered’ mode, where the rules produced are independent of each other (the order of them is immaterial). This was, however, a new facility to be added to CN2, and it was still to be fully tested.
The example was interesting, because although it looked
as if the second case performed better, in fact,
comparing the prediction performance of the rules
with the default rule reveals that these rules
did not score better at predicting human actions
than the rule “do nothing all the time”.
The default rule is that all examples belong to the modal
(most frequent) class, which in these cases is
So we obtain a figure for the default rule by
summing the actual frequencies of
dividing by the total number of examples.
In the unordered case, with the representation
the prediction performance of the rules
even on the training set was very close
to the prediction performance of the default rule.
On the test data, the prediction performance
was substantially worse than the default.
Looking at the individual rules (§B.1),
the second rule makes sense in that it is saying that
when the first key-press of a ‘pure turn’ has been
carried out, the corresponding part then needs to be done.
In contrast, the fourth rule is quite implausible.
Simply from considerations of symmetry, we could
call into question a rule where there were symmetrical
conditions but an asymmetrical action.
In this case, the rule must be presumed to have emerged
from a coincidence in the data.
The figures after the rule indicate that it is not very
well supported even in the training data: we might
expect it to be even more poorly supported in test data.
But many of the other rules can be criticised in a similar way.
With the new representation
RT2, the overall accuracy
figures are not much different from the default rule figures.
But the rules look much better than
in the representation
Firstly, there are fewer of them, which is an advantage.
Secondly, most of them can be made good sense of.
After discovering some unresolved issues
with the unordered mode in CN2,
the same data were reworked using the ordered mode
(see Appendix B.2).
This appendix illustrates the kind of rules
obtained using the ordered mode.
Briefly, the test data results on the representation
RT01 still are below the default values.
For the representation
RT2, the results
just manage to be better than default.
The subject with the longest experience was G, at George House.
His trace files were grouped into four equal calendar time
intervals: 02, 03, 04 and 05, in order from earlier to later.
The calendar divisions are October 19th, October 30th,
November 10th, November 22nd, and December 4th.
The data was fed through the
using action replacement charts generated for
the subject G from all his games together.
The 05 games (including the subject's best scoring game)
were used for generating rules.
Rules were generated on three representations of
ROV speed control, named RS0, RS1 and RS2.
(See Figure 6.9.)
(RS0) (RS1) (RS2) 2 1 1 rov_degrees rov_off_head rov_off_head rov_target_head 3 5 5 rov_target_range rov_u rov_u rov_height rov_v rov_v rov_speed rov_target_range rov_target_range rov_height rov_height 3 rov_speed rov_speed sub_display rov_av_revs_demand 3 3 stage sub_display sub_display rov_av_revs_demand rov_port_revs_demand 6 stage rov_stbd_revs_demand 3 3 2 0 Both_Ths_Full_Astn 3 3 2 1 Both_Ths_Half_Astn 6 6 3 3 2 2 Both_Ths_Stop 3 3 2 0 Both_Ths_Full_Astn 3 3 2 0 Both_Ths_Full_Astn 3 3 2 3 Both_Ths_Half_Ahd 3 3 2 1 Both_Ths_Half_Astn 3 3 2 1 Both_Ths_Half_Astn 3 3 2 4 Both_Ths_Full_Ahd 3 3 2 2 Both_Ths_Stop 3 3 2 2 Both_Ths_Stop 0 0 0 0 NO_KEY 3 3 2 3 Both_Ths_Half_Ahd 3 3 2 3 Both_Ths_Half_Ahd 3 3 2 4 Both_Ths_Full_Ahd 3 3 2 4 Both_Ths_Full_Ahd 0 0 0 0 NO_KEY 0 0 0 0 NO_KEYFigure 6.9: Three slightly varying representations for ROV speed control
These were intended to be progressively more human-like.
The CN2 parameter ‘star’ was set to 10,
and ‘threshold’ was set to 10, and ordered mode was used.
The rules generated were then tested against the data
from each of the divisions, 02, 03, 04 and 05.
The results are summarised in Table 6.2.
The numbers in the body of the table are the percentage points
difference between the prediction performance of the rules
and the prediction performance of the default rule.
The high scores for the interval 05 are due to
the fact that 05 interval provided the training data.
The default rule generally scored around 60% to 70%,
and the 05 interval absolute scores are over 95%.
Table 6.2: Testing rules for interval 05, subject G, against defaults
There are two trends immediately apparent in this table. One is that RS1 and RS2 perform substantially better than RS0, with RS2 being slightly the better of the two. The other is that whatever rules were induced for the interval 05 were not much in evidence during interval 02, and progressively became more so. This is reassuring in two ways: firstly it suggests that the rules found are not imaginary, or due to random effects; and secondly that these rules are being adopted increasingly as time goes on. This is consistent with a common-sense view of learning.
An alternative way of dividing up the examples
is into sets of similar size.
This was done with subject M, but otherwise the
same procedure was followed as with subject G.
Table 6.3 summarises the results for M.
Table 6.3: Testing rules for interval 0499–0508, subject M, against defaults
The rules were again constructed on the data containing the highest score, which in this case was the 0499–0506 interval. The same general trend is apparent with respect to the representations as above.
The prediction performance of the rules across time again shows a build-up of prediction performance towards the training interval; but now also shows a subsequent decline. This could in principle be due either to a decline in task performance, with the acquired rules not being followed as strictly as before, or due to new rules supplanting the old ones. In this table, the overall accuracy figures have also been included, to show that there is in fact no rise in overall accuracy between the fourth and fifth intervals. The rise in the relative figure is due to a fall in the default rule accuracy, which, in this interval, implies that there were fewer null actions.
G and M were both in George House,
and took an interest in each other's games.
It is perhaps not surprising that similar patterns
emerge in their results, and that a representation that was able
for one of them to produce rules performing substantially above
the default rule, should also be able to do so for the other.
This is not so, however, for DM,
one of the Charing Cross Tower subjects.
His results, derived by exactly the same process as
the above results, are summarised in Table 6.4.
Table 6.4: Testing rules for interval 065–1, subject DM, against defaults
This table suggests that the rules do not reflect the actual rules being used by this subject. The first two columns suggest, rather more strongly, that the rules are substantially different from those used at the earlier stages of learning. The pattern for all the representations is similar, and this suggests that none of RS0, RS1 or RS2 cover the attributes actually used by subject DM. However there is, if anything, a slight favouring of RS1 over RS2, contrary to the other subjects. Another representations would have to be found, if we were to find results as satisfactory for DM as for G and M.
A number of issues arose in the previous section that will be further expanded here. This leads on to a review of what was learnt from this experiment, and arguments pointing towards what needed doing next.
The reliability of the total score as a measure of experience was compromised by the random number and distribution of the mines. The number was random to ensure that the search could not be broken off without covering all corners of the minefield, which would enable an unfairly quick return, as well as being unrealistic. In an attempt to counter this problem, the scoring system allotted points for each mine found. However, the scoring was fixed before a great deal of experience had been gained, and it was subsequently discovered that experienced players gained more points by dealing with a mine than they lost through the extra time taken. Hence higher scores could be obtained when there are more mines, and the actual highest score of a player depended not only on their skill, but also on their luck in the allocation of mines.
A further problem with the reliability of the scoring comes from the catastrophic nature of an accidental mine explosion. A subject could be performing very well, but such an explosion would cause the total score to be highly negative. Thus good scores would be mixed in with very bad ones. For these reasons, it was felt that any graph of raw scores over time would be of little value.
The psychological impact of the scoring system is difficult to evaluate, and this will not be attempted. It may be noted that the task would change depending on whether the subjects were instructed to achieve the single highest score, or to achieve the best overall average score, or somewhere in between these two extremes. The emphasis in the experiment was only on achieving the highest single score, and this meant that when a subject accidentally blew up a mine, or did something that led to a long delay, the game was sometimes abandoned at that point, presumably on the grounds that a high score could not be obtained.
Detailed consideration of the task, a priori, suggests several possible types of action that the player might perform. Correct identification of the type or types of action performed is potentially important to any analysis of this kind, since an analysis designed to find certain kinds of action might fail to find other kinds. These could include:
Dealing first with slips, we may note that
some of the unintended key-presses have no effect.
These can be taken out in the process
of analysis, by
Other slips will contribute noise to the data,
with the result that the induced rules will be less
accurate and perform worse on prediction.
With knowledge-based processes, we could expect in theory to be able to induce rules if we know all the factors that are taken into account, and the intermediate concepts involved, in the knowledge-based process. This would be comparable with defining the terminology with which to construct an expert system, and would involve defining appropriate higher-level concepts in terms of lower-level ones: there is nothing in general to prevent this being done, but it may require much knowledge or theory about the knowledge-processing mechanisms. We are unlikely to be able to capture much of this level with the relatively straightforward methods that are used in the present study.
Information-seeking actions could be of two types: either actions directly altering the selection or presentation of information, or actions affecting the simulation itself. The information selection actions may reveal something about the information being used or considered at a particular time: however, in the first version of the simulation game, there was still so much information present concurrently (especially graphical) that our knowledge of the player's information usage is advanced only slightly, if at all. This approach to understanding the player had yet to be explored. More difficult to formalise are the actions which may be characterised thus: “give it a nudge to see how much it moves, and then you'll know how much to push it”. If this kind of action were being used, it would tend to obscure rules about how large an action to make in differing circumstances, since the initial nudge might be similar in the different situations, with only the following action differing; and that action not differing on the basis of the static quantities, but on the dynamic response of the thing that was nudged. However these information-seeking actions are dealt with, there are likely to be fewer of them the more practiced the subject is, since the desired quantities will be more likely to be known. For exploratory actions, again, the more practiced the player is, the less likely they are to occur. This reinforces the desirability of concentrating on well-learnt situations.
There is also a philosophical aspect to the question of the nature of actions, i.e., how we are to represent actions in general. This has a large effect on the methods of analysis. Firstly, we could consider an action as directly corresponding to the state that it brought about. An analysis on this basis will work if every situation has corresponding unique control settings appropriate to that situation. For example, if the ship is moving forwards at a reasonable speed, and the desired direction is more than (say) 15 degrees to port, then the desired rudder setting is hard port. This fits into the paradigm of pattern recognition and means-ends analysis: knowing how things ought to be leads to appreciation of the discrepancies between the actual and the desired state, and thence to steps to reduce the difference.
A second approach is to consider all actions as interventions, not necessarily determined by the objective state that is brought about. One can characterise the above example in this way, by adding that if the rudder setting is not hard port, then set it to be so. In this second approach, the dependency of actions on the current control setting is emphasised. It may be that this is more appropriate for serial actions, and explicit rules; while the first may be more appropriate for parallel actions without conscious attention.
Which approach is taken has implications for treating null actions. It is evident that at times, an operator is consciously not intervening, because everything is within the operator's limits of acceptance. If a ‘desired state’ approach is taken, the concept of action has no default: there is always some desired control state; every situation has some appropriate response, even if this does not entail altering the controls. If the correct response is not known, some measure of closeness will give a situation which is similar, and whose action can fill the unknown. With an ‘intervention’ approach, on the other hand, there is a default action of ‘do nothing’, in just those cases where there is no appropriate intervention. This does, however, raise the problem of granularity of actions, in that it is far from clear how many null actions to attribute to any given space of time free from positive actions.
Choosing exclusively one or the other approach seems over-rigid. However, purely for ease of analysis, we may note that one can always express desired-state actions in terms of interventions contingent on the current state of the controls, as well as the outside world; whereas one cannot always express interventions in terms of desired states. For this reason, the analysis in this study is in terms of interventions.
The choice of approach with respect to actions is to some extent a pragmatic rather than a theoretical one. Constructing a complete theory incorporating all these types of actions would be a huge enterprise, encompassing a great deal of cognitive psychology. The choice of rule-governed actions may be justified as a starting point firstly by considering the relevance of regularities to the kinds of applications we are considering (and the relative lesser relevance of other actions); secondly by recognising that knowledge-based actions have been the subject of much investigation, both in AI and in learning systems (e.g., ); and thirdly by discounting the practicality of investigating information-seeking, exploratory and whimsical actions as being a much more difficult place to start.
The information displayed at the interface falls into two sections. The ‘sensor’ section contains only numeric data displayed as numbers, and this clearly defines a set of primitives which we can take as the basic representation of these quantities. But for the ‘graphic’ sections, it was much more difficult to decide what was being displayed. One view might take the content of what is displayed to be the system variables that are used in the construction of the graphical display. However, the inference of other quantities is so immediate and intuitive, that it is difficult to avoid the idea that this information is also being presented in the display.
A simple example concerns the ROV's heading. One numerical sensor gives, in whole degrees, the heading of the ROV in the conventional way (000 to 359 clockwise from North). Another sensor gives the bearing of the closest unknown or unsafe target. There was no explicit offering of the bearing of the closest target relative to the ROV's head, and yet this was obviously going to be a significant quantity, and it was one which was immediately apparent (though in an unscaled form) from the ROV graphic display, as long as the target was within the viewing region. A very close parallel exists with the ship's heading and associated quantities. In the case of the ship, the relative bearing can be immediately seen from the general position indicator.
Thus, it is a real difficulty with graphic displays to achieve any degree of objectivity about what information is being provided, and hence, for any higher-level representation, how much information processing is being done by the interface and how much is being done by the human. The uncertainty of interface design remains unclarified in this case, because there is no a priori way of being sure that you have presented the information that you wish to present effectively.
One possibility for formalising some graphic information is to focus on significant events, and allow that the display effectively gives a rough idea of time-until-the-event. Of course, this need not be displayed as such, but the combination of perceived distance and motion can easily be seen as giving a time measure. Such time measures have an established history in theories of mariners' actions in collision avoidance. For a discussion of the “RDRR” criterion (Range to Domain over Range Rate) and its use in an intelligent collision avoidance system, see [15, 27, 28, 30]. A slightly simpler concept, “TCPA” (Time to Closest Point of Approach) is also used in many places (e.g., [111, 129]).
Another difficulty with representation arises in connection with the manoeuvring of the ship. The general position indicator (the upper graphic display present at all times) sometimes confronts the player with a pattern of targets that have complex implications. What is the best place to stop the ship, so that the most targets can be dealt with at once, and leaving the ship in an advantageous position to proceed? To come to a decision on this clearly requires an overall view of the disposition of the targets, and since the precise pattern of targets repeats itself extremely rarely, any routinizing of these decisions could not be linked to precise identity of the conditions.
It is plausible to consider this as an example of knowledge-based behaviour, since in the time available people are likely to still be trying out different approaches, and developing ways of categorising arrangements of targets into groups indicating the best action to take. In the author's experience, a considerable amount of conscious thinking goes on in the consideration of where to stop the ship, though this thinking may not be verbal. Alternatively, one could consider it as a visual pattern-matching process. An attempt to analyse this in symbolic terms would inevitably involve many pattern and shape concepts, which would be difficult to derive from data such as is in the present study, because this experiment was not designed to address pattern issues. In the longer term, we might be able to ascertain which attributes were relevant to ship positioning decisions, and we might be able to devise a method of learning how to recognise values of these attributes from the original Cartesian data of the simulation. These questions are difficult enough to constitute independent problems, and since there is little necessary overlap with the present lines of enquiry, issues involving the processing and use of patterns are not followed here.
Reviewing the state of results at the end of the first experiment:
|Next Chapter 7|