©1990, 1995 General contents
Chapter 5 Chapter 7

Chapter 6: The Sea-Searching Simulation task and first experiment

6.1 Aims

We have studied the suitability of different possibilities for experimental systems, and the conclusion was that a new system needed to be built. The aim of the simulation system was to produce data from human control of a system: then this data would have to be analysed, prepared and explored; with the ultimate aims of firstly trying out the method of using data gathered in this way, and secondly of exploring the nature of human control.

An objective study of controlling complex systems was wanted, not relying on verbal reports, or purely subjective interpretations. The most promising focus for such an objective study appeared to be the representational primitives used by humans in the cognitive processes underlying their control decisions. The justification for this comes from considering a key feature of rule-induction algorithms.

Since rule-induction algorithms are noted for their dependence on the suitability of the representation primitives (as we have noted, §3.2.3), the possibility exists of turning this connection round, and using the effectiveness of rule induction as a measure of merit of the representation. We should note, however, that we could fail to get good results on many grounds, only one of which is the quality of the representation. If some other problems exist (e.g., see below, §6.4.2), there would be some limitation on the performance of the rules, and we might get only relatively small effects from changing the representation.

Hence subsidiary aims were to produce a means of preparing the data in accordance with a variety of representations, then to test the performance of a rule-induction program with the prepared data. The relative performance of rules generated following the different representations would then reflect their relative merit, and hence give some lead on the correspondence of varying representations with a supposed inherent structure of the data. In human control of a complex system, this would in turn be evidence for or against the claim that certain concepts were salient features of an operator's ‘mental model’ of the task or the system, irrespective of whether the concepts were verbalisable or not.

Inevitably, the aims of this first experiment could not be defined more closely than this in advance, since it was not at all clear where difficulties would occur, and where progress would be halted.

6.2 The design and implementation of the task and interface

Since the objective of the research was to investigate human performance of complex tasks, a task had to be built. The chief factor of importance in the design both of task and interface was to provide a source of data suitable for analysis.

The key criteria influencing the design have been discussed above (§5.1). It was generally accepted that the important aspects of a program implementing these design principles were principally that it worked, secondarily that it was able to be updated as required, and only last and very much least the elegance or finer details of the coding. Discussion in this section therefore concentrates on the important design decisions and how they were implemented.

The simulation program, and all the analysis programs with the exception of the rule-induction program, were written by the author.

6.2.1 General implementation details

General structure

The general approach taken in the construction of the program was that of top-down functional decomposition. In the main function, after initialisations, there is a main loop from which are called other functions, which deal with the simulation of each simulated object, the scoring, the interface interaction, and the logging of the actions.

On-line help had the advantage that access to it could be recorded in the same way as the other interactions. For that reason, the main function was designed to cope not only with the simulation interactions, but also with the ancillary interactions, including access to help, that surrounded the individual games.

The total length of code in the simulation programs was around 10000 lines. Of this, the code specifically for replaying accounted for about 1500 lines, help accounted for some 1000 lines, the interface for about 3500 lines and the simulation itself for approximately 4000 lines. These are approximate values, not only because the layout of the code is arbitrary, but also because there was not always a clear separation between the code for the different functions. However, these figures do give a general indication of the relative complexity of the various aspects of the program.

Time in the program
The obvious approach to maintaining a simulation in real-time is to do everything that is necessary for one step, then to wait, checking the system real-time clock, until the time comes to do another step. In this program, the simulation steps need to be performed, and the interface display refreshed, both of which would clearly take more than a few milliseconds.

For consistency, a particular length had to be chosen for the time loop, and the choice of length was influenced by two factors: firstly the maximum amount of time that the necessary program steps might take, and secondly the suitability of this time interval from the user's point of view. This second consideration can be further broken down into two points: on the one hand, the refresh rate had to be fast enough to allow the user to have a sense of immediacy in control and feedback, but on the other hand, if the refresh rate was too fast, the user's performance would depend more on exact timing, reintroducing the effects of psycho-motor limits, which were seen as undesirable.

After the initial coding had been done, it was clear that a decent simulation and interface could be performed on the chosen system within about 0.2 seconds. However, following the discussion in §5.1.3, it was undesirable to set the refresh interval too close to a typical simple human response time as this would create the possibility that non-cognitive aspects of response time would be an important factor. Half a second was therefore tried, and no subjects complained about the refresh rate being too slow. This timing was therefore accepted. (On other available systems, the same machine computations could well have taken over 0.5 seconds, and therefore forced an undesired choice of timing.)

It was found that using the same interval of 0.5 seconds for the length of the simulation step caused problems with the cable simulation, and was in any case unnecessary since one simulation step for all the objects took only a small fraction of a second. After trying different values, a simulation interval of 0.1 seconds was fixed on as giving a reasonable balance between accuracy and low computation time. In each half second therefore, five simulation steps for each object are performed, in immediate sequence.

6.2.2 Description of the task The task scenario

The overall task was to identify all suspicious objects in an area of sea-bed, to dispose of the mines, and to return to the starting area. This was done with a ship, a ‘remotely operated vehicle’ (or ROV: a small unmanned submarine), and an umbilical cable that connects the two together. A short way of describing how to perform the task was thus:

find a target;
send the ROV to look at it;
if it is a mine, fly the ROV to the required position, and at some later point, detonate the mine;
until all targets are dealt with.
Return to home.

The task was more closely defined by the scoring system. This was as follows:

  1. a large bonus was given for completion of the task, which occurs when all inert targets have been identified as such, and all mines destroyed, and the ship has returned to its home area;
  2. a bonus was given for each target correctly identified;
  3. a penalty was imposed for misidentification;
  4. a bonus was given for each mine destroyed;
  5. a penalty was imposed for navigating the ship in an unsafe area, too close to a potentially dangerous target;
  6. one point was taken away for each half-second from start to finish.
All the values were changeable, the idea being that changes in the scoring system can be used to modify the characteristics of the task. In the first experiment, these values were 5000 for completion, 500 for identification, 100 for misidentification, 500 for mine disabling, 10 points per half-second for going within 100m of an unsafe target, and a variable damage score if a mine explodes with a vessel within range.

It can be seen from this that there were trade-offs between on the one hand speed, with low time penalty, and on the other hand care, to avoid the risk of explosion. This kind of trade-off is common in the control of complex systems, wherever the areas of danger lie close to otherwise desirable paths, and it was generally seen as important to the relevance of the game that such trade-offs were set up. Another way of describing the same aspect of the task is that there may be multiple conflicting goals at any time, and the operator has to find his or her own balance. The simulation of the objects in the task

The simulation is divided into four parts, one each for the ship, the ROV, the umbilical cable, and the targets. The first three, which are the controllable ones, also have their own sub-displays, only one of which is able to be seen and used at one time. Since the cable potentially affects both ship and ROV, it is done first. This is followed by the ship and the ROV, and finally the targets, which explode if the ship or the ROV has done the wrong thing.

The umbilical cable simulation

This was the most problematic part of the simulation. Cable models exist based on finite element analysis [74], but these tend to be computationally very intensive, and probably therefore inappropriate for a small-scale real-time model such as the one built. Simple models are easy to imagine and implement, such as an elastic cable without water resistance lying in a straight line between the ship and the ROV. The problem with such simple models is that their behaviour is both counter-intuitive and unrealistic. This lack of realism could easily distract the operator from the task, to trying to discover how the cable in fact behaves.

For this task, the author therefore constructed an original model. This model is based on the fiction that the cable can be represented for many purposes by a single point halfway along its length. The elastic forces can be dealt with reasonably in this way, and the motion of the representative point provides a basis for calculation of overall water-resistance. It was not clear how good this model would be, so it was implemented, and tested by manoeuvring the simulation in ways that would discover the model's limits.

The model was then refined a number of times, by introducing factors which appeared to be relevant to the discrepancies between the actual and the desired behaviour. Since intuitive plausibility was more important that technical accuracy, the desired behaviour was that which did not appear counter-intuitive. The author makes no claims about the accuracy of the resultant model, only that it seems to behave in a reasonable and interesting way.

Other problems that had to be tackled included unstable oscillations of the cable in tension. This can be due to the length of the time step being too large to enable quick changes to be dealt with properly (cf. [93] “As is well known, explicit finite-difference methods for initial value problems are susceptible to numerical instability if too large a time step is taken”). This was solved by insisting that the cable mid-point could not go to the other side of its equilibrium mid-point position in one simulation step.

The ship simulation

Accuracy was more important here, since in YARD there were numerous experienced mariners who had more finely-tuned ideas about how a ship should behave. The model constructed was based on a mathematical model of an actual vessel design [8]. Since the design of ship was not entirely up to date, the information was not highly sensitive: however, as a precaution, the parameters were altered slightly without having a large effect on the behaviour of the ship. The original model was on paper (not programmed) being a detailed model of a ship's behaviour in calm to moderate weather, taking into account all six degrees of freedom: roll, pitch, yaw, surge, sway, and heave. Douglas Blane of YARD simplified the model by cutting out roll, pitch, and heave, and making simplifying assumptions about the rudder; and the author implemented this simplification. The model takes into account wind, waves, and tide, although these were not used in the first experiment other than in a casual exploratory way.

The propeller and rudder controls were modelled as if controlled by servos, with a fixed rate of alteration, so that it took a reasonably realistic amount of time to achieve a given control demand. These parameters were decided on after informal consultation with experienced personnel at YARD.

The ROV simulation

YARD had a model of a particular ROV (Remotely Operated Vehicle) implemented as a mock-up simulation, with a scenario of inspecting the legs of oil drilling platforms. This vessel simulation, based on previous research [92], was slow, ungainly, and asymmetric, and had directable thrusters and a camera that could be tilted and panned. This simulation was too slow for the kind of situation envisaged, with too much unnecessary detail and high fidelity in the hydrodynamics, which would have made it difficult to adapt to the chosen implementation environment. The author therefore implemented a much simpler vessel, with much simpler hydrodynamics, in which the original six degrees of freedom were reduced to four by ignoring (setting to zero at all times) roll and pitch.

A number of additional features were included. The effect of the umbilical cable on the ROV was modelled, and turned out to be an operationally important constraint even before the cable became fully taut. Clearly realism and plausibility were to be enhanced by modelling interaction of the ROV with the sea bed. The author devised a model of sticking in the mud, which resulted in gentle collisions with the bottom being able to be freed using upwards thrust, while heavier collisions needed the cable to be reeled in to free the ROV. As with the ship, the effect of tide was modelled, but not used in the first experiment. Not modelled in the first experiment was collision of the ROV with the target objects, nor collision of the ROV with the ship.

The target simulation

To maintain player interest and uncertainty, it was decided to have the sea-bed targets randomly positioned in a given area. The precise time of giving the order to start a game gave a random number seed, and pseudo-random numbers generated from this seed gave the number, type and position of the targets. This seed was recorded so that precisely the same set-up could be regenerated for a replay. Randomly varying the type of targets meant that the player did not know whether a target was dangerous or not before observing it at close quarters, and the random number of targets (with a mean value of five, but soon constrained to be at least five) prevented the player from going back to base before checking in all corners of the minefield.

There was no official information on which to base the behaviour of the mines, so the author implemented his own idea of how an acoustically operated mine might work. In the simulation, it is set off by the ship propellers or ROV thrusters being at too high a speed too close to the mine.

6.2.3 Description of the interface

As discussed above (§5.1.2), the task needed an interface that was on the one hand sufficiently low-level to require both substantial learning, and creation by the operators of higher-level structure above the level of the interface; but on the other hand not so low that the task took too long to learn (as would have been the case with a typical live complex system). Pitching the game fairly near the lowest level of the task described would enable the observation of learning higher levels which were easily understandable. The level of interaction would be confirmed as not too low if the subjects were able to learn the task to a fairly stable state in the allowed time.

Physical method of interaction

In contrast with, for example, the Iris flight simulator, it was decided to conduct all user input through pressing the mouse buttons. It is a natural extension of the mouse terminology (and common, if slightly loose) to refer to the active areas on the screen as buttons, and the action of pressing on the mouse button while the cursor is in a particular active area as a “button-press”. The advantages of this are that firstly all the ‘buttons’ can be labelled with their effect, eliminating the need for the user to memorise codes or consult help screens in the middle of a run. Secondly, immediate visual feedback can be given, which in the case of this interface was done by highlighting the background of a button that had just been pressed.

Also discussed earlier was the goal of minimising the significance of motor skill, and psycho-motor limitations. The interaction was therefore designed to rule out the effects of small fractions of a second. Only one button-press was taken into account in one half-second, and implementation was only at the half-second boundaries.

Separation of the different functions in the interface

The ‘Seeheim’ model [44] of user interface management conceptually separates the interactions that deal solely with presentation of information from those that affect the controlled system itself. This was adopted as one of the design principles. This decision having been made, there were four general divisions of the interface:

  1. sensors presenting information about the system in numerical or verbal form;
  2. sensors presenting information in a graphic form;
  3. effectors affecting the simulation;
  4. effectors affecting the presentation.
These four divisions were reflected on the screen by having four columns with different coloured backgrounds, one column for each division. So as to maximise the distinction between presentation and simulation effectors, these were situated on opposite sides of the screen.
Division of the interface into sub-displays

The next important design decision was whether to have all the information present at once. There were three good reasons why not. Firstly, to put all the information on one screen would produce very small areas of screen which could not hold many characters of a reasonable size font, and the effectors would be difficult and slower to locate than ideally. Secondly, practical complex systems, if their interface uses VDU screens, tend to need to split up the information into a number of different screenfuls. Thirdly, by having less than all the information on the screen at once, the obvious possibilities for the information being used at any particular time would be limited. This would help the analysis of operators' decisions. The sensors relevant to a group of effectors should be displayed along with those effectors.

The highest-level divisions apparent were between the different objects of the simulation: the ship, the ROV, and the umbilical cable. It was therefore decided to divide the screen horizontally into two parts, one of which would show information relevant to the task as a whole, or all the objects of the task, and the other would show sensors and effectors either for the ship, or for the ROV, or for the cable. The resultant appearance of the interface is shown in Figure 6.1, with the ship sub-display showing.

Figure 6.1: The interface in the first sea-searching experiment Implementation of the interface

Structure and modifiability
One of the main considerations in the design of the interface was modifiability. Thus, the program had to be designed so that it was easy to change the details of any of the interface elements, or add new elements or take them away.

For example, let us consider adding a new element to the interface. To do this, at least the following steps need to be considered:

The basic conceptual structure used in the implementation of the interface was a hierarchy of sub-displays, columns, rows and elements. This was reflected in a four-dimensional array of structures, each one of which contained the information relevant to the display element. The positioning of the elements on the screen was taken care of by automatically allocating them equal spaces in their row, which were allocated equal heights in their column. The overhead involved in this was the maintenance of hard-coded arrays of the number of rows in each column, and the number of elements in each row. This was much easier to keep updated than would be the alternative approach of changing the element positions by hand each time the number of elements changed. Columns were sized in a hard-coded fashion, since changes were not anticipated at this level.

One function was responsible for calling the many functions needed to effect the actions. The rest of the information about the interface elements was kept together hard-coded in a function which was used at initialisation. For the addition of a new element, one therefore needed to do little more than:

  1. change the number of elements;
  2. allocate a new symbolic name (with a #define);
  3. if the new element was an effector, add a function to implement its effect;
  4. create new entries to the initialisation function, on the model of existing ones.
Physical interaction
The main time loop executed once every half-second. In order to avoid the effect of precisely timed actions, two measures were taken. Firstly, button-presses occurring during a half-second interval were stored until the end of that interval and actually executed during the next interval, simultaneously with the button being highlighted on the screen. Secondly, the interface only accepted one button-press per half-second, and ignored any others coming in the same half-second in which one press had already been stored. Nor were the button-presses queued.
The help screen, etc.

Obviously, for such a complex task, a user would initially be in great difficulty if there were no explanation of the game. The provision of help also enabled some useful knowledge to be presented, so that the user did not need to discover this from experience, which would retard learning in general. Provision of help was able to be fitted in to the form of the interface as designed for the game, thus providing help on-line, which had the added advantage that it could be monitored, and studied if thought worthwhile. Another feature provided was the ability to replay previous games. Two demonstration examples showing the operation of the ship and the ROV were included, but none showing a complete game by another player, as this was thought likely to influence the style of the beginner.

The text of the help messages was kept in separate files, read in when a particular section of help was requested. This enabled changes in the format of the display without needing changes in the text itself; and changes in the help without recompiling the program. The content of the help for the second experiment, which is slightly different from the first version here, is given in Appendix A. Logging data and replaying runs

Logging data was achieved by maintaining an internal array of the actions taken, at the level of individual button-presses. (See below, Figure 6.2.) This table was written out to file at the end of the run. Writing to file during a run could have caused brief interruptions in the continuity of the game.

Replaying of runs selected through the help screen was implemented by reading in the appropriate trace file, and executing the same process as in an ordinary game, with the entries in the trace file being used instead of player's button-presses. This replaying relies on identical simulation steps being performed and the actions being presented at exactly the right time, because, since there is no record of the process variables, any small error would quickly accumulate and disrupt the correspondence between conditions and actions. To get this right was difficult.

6.2.4 Portability

As it is, the interface is not portable to any other system. This is because of the non-standard nature of:

  1. the window system;
  2. the screen coordinate system;
  3. the graphics commands.
However, as far as possible, the non-portable code was kept in one source-code file, which accounts for some 10% of the source code.

It would be a substantial task to re-implement the program on a different system.

6.3 Methods and results

6.3.1 Game organisation

The stand-alone nature of the game, with all the necessary information incorporated on line, allowed a user-directed experimental setting with minimal supervision. Identical versions of the game were used on two sites: George House and Charing Cross Tower. In George House (GH), people were not prevented from watching each other, and indeed this was part of the method of stimulating interest. Hence the amount of observational experience obtained by the subjects before starting the game was unrecorded. At Charing Cross Tower (CXT), the more structured environment meant that the subjects had little or no prior exposure to the game.

6.3.2 Subjects

Unpaid volunteers were requested by word-of-mouth and local posters. The selling pitch was that the simulation was more interesting than, and different from, other computer games. A spirit of competition was fostered firstly by providing a scoreboard as part of the help, secondly by offering a small prize (unspecified at the start) for the highest overall score in a single game at the end of a given period. Four subjects at GH, and two at CXT (other than the author), achieved at least a basic competence allowing them to complete games within a reasonable time-span. We will abbreviate those at GH to R, G, S and M. Those at CXT we will abbreviate to DM and DJ.

6.3.3 Data collection

As explained above (§6.2.3), the interaction with the simulation was entirely through the mouse. The keyboard itself had no effect, except that during the course of the experiment, a facility was introduced so that players could skip short amounts of time (when they knew in advance that they would not want to take any actions) by pressing a number key. The key 1 skipped about 10 seconds-worth of action, 2 about 20 seconds-worth, and so on. In no case could this increase the score, and these actions were not recorded in the first experiment.

The primary data consisted of time-stamped records of every legal key-press. This was recorded in a format of five blank-separated numerical fields per line, one line representing one key-press (see Figure 6.2). These files will be referred to as ‘action trace files’ or simply ‘trace files’.

00018 2 3 2 4    Both_Props_Full_Ahead
00021 2 3 5 0    Rudder_Hard_Port
00049 2 3 5 2    Rudder_Centre
00053 1 0 8 1    Scale_over_2
00054 1 0 8 1    Scale_over_2
00056 1 0 2 0    Fix_Ship
00058 1 0 2 1    Centre_Ship
00160 2 3 5 0    Rudder_Hard_Port
00195 2 3 5 2    Rudder_Centre
00201 2 3 5 3    Rudder_Gentle_Stbd
00207 2 3 5 2    Rudder_Centre
00223 2 3 2 0    Both_Props_Full_Astn
00356 1 3 1 0    Stop_Return_to_Help
Figure 6.2: A commented example game trace file

There were two types of action trace file. Each game had its key-presses recorded in a separate file, and each session of zero or more games had the ancillary key-presses (starting and stopping the game, reading the help information provided, etc.) recorded in the same format, together, without the key-presses from the games themselves.

The first field of each trace file gives the number of half-seconds since the beginning of the game, or session. This varies from zero up to several thousand (7200 being equivalent to one hour). The remaining four fields are single digits, representing in turn the sub-display, the column, the row, and the element within that row. These refer to the obvious divisions of the interface screen.

Including the files due to the author, covering the period Oct 8th to Jan 16th, there were over 300 files from GH and nearly 150 files from CXT. These occupied over 1 Mbyte and over 400 kbyte respectively.

All plausible actions were recorded, which included those actions that were impossible or ineffective at the time, and which caused an audible warning to the player at the time of the game, coinciding with the highlighting of the selected area. However, there were some mouse-button-clicks that occurred while the cursor was in an area that was generally inactive, and these were not recorded. Such button-clicks also caused an audible warning at the time of play, but did not cause any highlighting of the selected area of the screen.

In addition to the action trace files mentioned above, there was one file for each site, in which each game has a one line entry. The fields in this file (called the ‘Runindex’) were:

  1. the name of the version of the game;
  2. the player's short name, as entered by the player;
  3. the name (number) of the game record file;
  4. the date and time of the end of the game;
  5. the total score for that game;
  6. the time taken for the game, in half seconds;
  7. a random seed which determines the number and position of the mines, and which differs between most games.
A section of a Runindex is shown in Figure 6.3 (runs by the author later than the experimental period).

calm_weather s           17282750    Apr14-00:29    -546    1036   20262
calm_weather s           17283514    Apr14-00:42     982     998   10535
calm_weather s           17338726    Apr14-16:02    5799    2201     571
calm_weather s           17339722    Apr14-16:19    7044    2456   10535
calm_weather s           17340482    Apr14-16:32    6845    2155   20262
calm_weather s           17341776    Apr14-16:53    7781    3219    1940
calm_weather s           17447710    Apr15-22:19    6637    2363    8818
Figure 6.3: Part of a Runindex file

Table 6.1: Subjects' times, scores, and dates (1989–90)

An idea of the amount of data gathered can be obtained from Table 6.1. “No. Runs” is the total number of games played. Only a few of these were false starts, abandoned after a very short time. “Total Time” is the hours and minutes spent on the games themselves, not counting ancillary activities. “Best Time” is the total time spent at the end of the game that gave the best score, recorded in the next column. “Start Date” and “Best Date” allow the calculation of the calendar period from first trying the game to scoring the recorded high score. “Finish Date” represents the date of the last game played before the (arbitrary) cut-off date of 16th January 1990. There were in fact only a few games played after this, with no-one improving their score. The subject with the highest score, G, was also the one who had put in most hours of practice. For comparison, the author, with substantially more practice than any of the subjects, achieved a score of over 9000. A practical ceiling would appear to be around 10000, though this did depend on the random fluctuation of the arrangement of the mines. A score as high as this was only possible (even in theory) on a small proportion of games. Uncertainties in the scoring system are discussed below (§6.4.1).

6.3.4 Analysis

Since the analysis of the data involves many stages, it might be helpful to review the overall pattern before describing the details. This overall view, from which some details have been omitted for clarity, appears here as Figure 6.4. In this figure, the rectangles represent types of file, and the ovals represent programs for transforming one type of file into another.

Figure 6.4: Simplified data flow during analysis Analysis of the actions

It was clear from very early on in the analysis that a single action from a human point of view was not necessarily equivalent to a single key-press. In particular, after practice on the task, short sequences of key-presses were frequently apparent. This happened particularly in the case of manoeuvring the ROV. Because there was no simple effector to turn left or right, players had to create their own sequences of actions that performed the function of turning. Even for the same direction of turn, different sequences were performed in different contexts. When going full ahead with both thrusters, a left turn of a few degrees would most probably be executed by selecting half ahead on the left thruster, followed immediately (0.5 seconds later) by restoring the left thruster to full ahead, using the full-ahead key for either the left thruster or both thrusters together. In contrast, when near a mine, gliding along with both thrusters stopped, a left turn would most likely be executed by selecting the starboard thruster half ahead, and then stopped (or both stopped). Other variations were also observed.

With these turning manoeuvres, the time interval between the unbalancing and balancing actions was crucial to determining the magnitude of the effect of the action. Due to the delayed response of the thrusters, leaving an interval of 1 second produced an effect approximately four times the size of the effect produced by an interval of 0.5 seconds. Thus, in situations where the former would be an appropriate action, the latter would not, and vice versa—the two were, in effect, different actions. There were other sequences of actions where time interval and order were not important. For example, when recovering the ROV, the first step was to reel in the cable. This was effected by setting the cable tension to ‘grip’ and the take-in speed to ‘fast’; but the ordering and interval of these two actions was immaterial.

Perhaps these considerations could in principle be derived from the data. However, in this study, they were deliberately introduced, as background knowledge, in the attempt to get the best classification of actions possible in the available time.

The problem then remained to find what compound actions actually occur in different players' task performance. This could be attempted by observation and questioning; but, without any objective measure, it would be difficult to assess how much of the resulting discoveries were artificially produced by the biases and preconceptions of player and experimenter. Thus a prime objective was to devise a program to compile a list of these compound actions given only a quantity of raw data. Programs for learning macro-operators in games or puzzles have been devised, but the methods used and quoted (e.g. in [120]) do not explicitly deal with time or dynamic systems, and were thus unsuitable here.

The basic program to perform this was called summ (for summary). The program first completed a large table, of the frequency of occurrence of each key-interval-key, for intervals between 0.5 seconds (coded 0) and 2 seconds (coded 3). Two seconds was judged to be a reasonable maximum between two sub-actions that formed a higher-level action. The program then wrote out a summary of the more commonly occurring sequences. Depending on flags given on the command line, this summary was either intended as a general summary for human reading, (see Figure 6.5) or as a list of 4-tuples specifying key-interval-key sequences and single replacement keys (see Figure 6.6). The summary in Figure 6.5 gives on the first line:

  1. the code for the former of the pair of keys;
  2. its frequency relative to all the actions in the sample;
  3. its absolute frequency;
  4. and a meaningful label;
and on each of the subsequent lines:
  1. the number of empty half-seconds separating the pair of actions;
  2. the code for the latter of the action pair;
  3. the absolute frequency of the key-interval-key combination;
  4. a label for the latter key;
  5. and finally, if that combination has been recognised and named (by the experimenter), a label for the combination.

3 3 1 3 relative 0.050 freq :1342 Port_Ths_Half_Ahd
            0    3 3 1 2 freq :  17    Port_Ths_Stop        Slow_Turn_Stbd_0
            1    3 3 1 2 freq :  55    Port_Ths_Stop        Slow_Turn_Stbd_1
            2    3 3 1 2 freq :  38    Port_Ths_Stop        Slow_Turn_Stbd_2
            3    3 3 1 2 freq :  31    Port_Ths_Stop        Slow_Turn_Stbd_3
            0    3 3 1 4 freq : 370    Port_Ths_Full_Ahd    Diff_Turn_Port_0
            1    3 3 1 4 freq :  66    Port_Ths_Full_Ahd    Diff_Turn_Port_1
            2    3 3 1 4 freq :  11    Port_Ths_Full_Ahd    Diff_Turn_Port_2
            0    3 3 2 2 freq :  24    Both_Ths_Stop        Slow_Turn_Stbd_0
            1    3 3 2 2 freq :  70    Both_Ths_Stop        Slow_Turn_Stbd_1
            2    3 3 2 2 freq :  65    Both_Ths_Stop        Slow_Turn_Stbd_2
            3    3 3 2 2 freq :  55    Both_Ths_Stop        Slow_Turn_Stbd_3
            0    3 3 3 1 freq : 228    Stbd_Ths_Half_Astn   Pure_Turn_Stbd
            1    3 3 3 1 freq :  75    Stbd_Ths_Half_Astn   Pure_Turn_Stbd
            3    3 3 3 2 freq :   9    Stbd_Ths_Stop       
Figure 6.5: A fragment of a summary

 3 3 1 3      0      3 3 1 4             0 3 3 0
 3 3 1 3      0      3 3 3 1             0 3 0 1
 3 3 1 3      1      3 3 1 4             0 3 3 1
 3 3 1 3      1      3 3 3 1             0 3 0 1
 3 3 1 3      2      3 3 1 4             0 3 3 2
 3 3 1 3      2      3 3 3 1             0 3 0 1
 3 3 1 3      3      3 3 1 4             0 3 3 3
 3 3 1 3      3      3 3 3 1             0 3 0 1
Figure 6.6: A fragment of a replacement chart

One deficiency with this basic summary program was that it could not deal with sequences of more than two actions. This was overcome by using the program iteratively. First a basic replacement chart was made. Next, actions were fed through a filter that makes the changes specified in the first chart. This program was called actrace (not shown in Figure 6.4) from its effect of changing the actions in the trace file. The output of this filter was then fed in to summ again, giving a second chart, some of whose input keys may have been new composite ones. A C-shell script (named chart) was written to govern this iterative process. In the final version of chart, the first application of summ output only more common sequences, and two subsequent iterations included progressively less common sequences.

The typology of actions is further discussed below (§6.4.2). Expanding the trace files

The trace files only had information concerning the actions taken, not about the situations. The files had therefore to be expanded before analysis to include full details of the situation, and this was done by a program called exp, which was a modified version of the simulation program, without the graphics or interaction. The input was an action trace file, and the output was a binary file containing all the data on which the displays were based for each half-second step. This produced an increase in the file size of a factor of 250–300. (Thus it would be quite impractical to store many of these expanded files on disk.)

An expanded file also permitted a more flexible form of replay. Replaying from the trace file was possible, but it was only one-way (forwards), and took considerable time to execute, since all the original mathematics have to be performed. With an expanded file, on the other hand, one was able to jump to any place in the game, stop, or go backwards. This proved to be a help in getting an intuitive feel for what the various subjects were doing. It allowed an observer to study the circumstances of a certain action in a flexible, easy way. Effecting the action changes

The expanded file then had its actions modified in accordance with the required replacement chart. The program that effected this was called actrep (for action representation). This was done twice, in series, so as to enable longer sequences to be converted than would be possible with only one application.

As well as the explicit actions, there was also the question of a reasonable human representation of the null action. The original expansion gave null actions for every time step (0.5 seconds) where there was no explicit action. Since one of the priorities of the simulation was to get away from an over-dependence on critical timing, it seemed unreasonable to class the time steps immediately preceding an action as null—after all, the player may have just been a bit slower than intended or desired. A thinking-time parameter was therefore introduced, imagined to be around 1 to 3 seconds, and for this amount of time around an action no null actions were passed on. This was tried with thinking time extending only before, or before and after, any action. Another way of describing the purpose of this would be to say that we want null actions to be registered when everything is fine, not in the thick of hectic action. Compare this with Card et al.'s ‘M’ operator (“mentally prepare”) in their Keystroke Level Model [20]. The value for the ‘M’ operator is given as 1.35 seconds. Selection of the desired attributes

Having thus reached the stage where the representation of actions was altered to what we may suppose is a more human-like form, it was left to decide what to do about the representation of the situations. Whereas the basic representation of actions was unequivocal (discrete key-presses), the representation of situations implied by the interface was not completely clear (see below, §6.4.3). The approach taken was to remain agnostic about the exact information provided, because in any case the aim of the methods investigated was to be able to tell when a representation was closer to the human one.

Having decided on a representation to test, the selection of the attribute values in that representation was done by the program called sitrep (for situation representation). The representation was defined by a hand-crafted file, listing the attributes to be selected (see Figure 6.7).

(RT01)                               (RT2)

2                                    1
rov_degrees                          rov_off_head
rov_target_head                      4
4                                    rov_height
rov_height                           rov_speed
rov_speed                            rov_r
rov_r                                rov_target_range
rov_target_range                     4
4                                    sub_display
sub_display                          rov_av_revs_demand
rov_port_revs_demand                 rov_turn_demand
rov_stbd_revs_demand                 rov_status
3                                     0 3 0 1   Pure_Turn_Stbd
 3 3 1 3   Port_Ths_Half_Ahead        0 3 0 3   Pure_Turn_Port
 3 3 3 3   Stbd_Ths_Half_Ahead        0 0 0 0   NO_KEY
 0 0 0 0   NO_KEY
Figure 6.7: Example representation files

The three sections of situation attributes are respectively integer variables, floating-point variables, and qualitative variables. The fourth group is a selection of actions (classes) in the single ‘decision’ attribute, along with their key codes. Lower-level representations (such as the one marked (RT01)) contain only attributes that are explicitly present in the unmodified expanded data. Higher-level ones (such as the one marked (RT2)) have some quantities that are not explicitly present in the original data, and therefore have to be calculated on the spot.

In the example case, rov_off_head is the relative bearing of the closest active target from the ROV. It is calculated from rov_degrees, the heading of the ROV, and rov_target_head, the true bearing of the target from the ROV. Rov_av_revs_demand and rov_turn_demand are calculated from the two demands in the old representation RT01. Pure_Turn_Stbd is a combination of Port_Ths_Half_Ahead and Stbd_Ths_Half_Astn, as shown in Figures 6.5 and 6.6 above. Pure_Turn_Port is defined similarily.

In the expanded file, the number of half-second intervals where there was no key-press outnumber those in which there was. Any individual key, therefore, is greatly outnumbered by what might be regarded as null actions (NO_KEY). It would be unhelpful to include all these as examples for a rule-induction program, for two reasons:

  1. the program would take a much longer time to execute, possibly making the difference between a practical length of time and an impractical one;
  2. including all of them would not help the program identify rules for the keys in which we are interested.
For these reasons, sitrep also performed the function of cutting out a large proportion of null actions. The precise proportion was controllable via a command-line parameter, or settable to a ‘good guess’ value dependent on the number of non-null keys being investigated. Evaluating representation primitives using rule induction

After sitrep had output the selected data (now in readable, ‘ascii’ form), it was a straightforward matter to format this for a rule-induction program. This was done by a program called indprep (for induction preparation), which had a command line flag determining which rule-induction program the data should be prepared for, since there is no standard format. Other flags determined whether indprep should output a file of examples (data) only, attributes (names), or both together. As an example, the attribute file and example file (containing 30 examples) for subject G, representation RS2 (see Figure 6.9, below), interval 4 (as in Table 6.2 below), is given here as Figure 6.8. In each example, there is one entry for every attribute, in order.

sub_display:ship rov umb env;
rov_av_revs_demand:f_astn hf_astn h_astn q_astn stop q_ahd h_ahd hf_ahd f_ahd;
stage:initial searching placing far close final pull_in infringe;
class:Both_Ths_Full_Astn Both_Ths_Half_Astn Both_Ths_Stop Both_Ths_Half_Ahd
      Both_Ths_Full_Ahd NO_KEY;

48 7.98 0.00 1000.0 48.0 0.00 ship stop initial NO_KEY;
3 7.97 0.00 418.9 48.0 0.00 ship stop initial NO_KEY;
11 7.94 -0.22 320.3 48.0 0.00 ship stop placing NO_KEY;
30 7.60 -0.11 231.4 48.0 0.00 ship stop placing NO_KEY;
52 4.08 -0.01 150.4 34.5 4.14 rov stop far Both_Ths_Full_Ahd;
15 3.21 -1.08 134.9 34.5 3.39 rov f_ahd far NO_KEY;
15 3.19 -0.27 50.6 30.2 3.68 rov f_ahd far NO_KEY;
21 1.34 -0.72 16.1 12.1 1.53 rov f_ahd final Both_Ths_Stop;
15 -0.13 -0.11 14.1 8.5 0.88 rov stop final Both_Ths_Half_Ahd;
28 0.30 0.26 11.9 3.0 0.56 rov q_ahd final NO_KEY;
-56 0.24 0.56 15.7 2.0 0.61 rov q_ahd final Both_Ths_Half_Ahd;
-8 1.04 0.33 8.7 1.8 1.09 rov h_ahd final Both_Ths_Stop;
-49 0.13 0.25 4.5 2.1 0.29 rov stop final NO_KEY;
-20 -0.12 0.07 6.1 2.6 0.15 rov stop final Both_Ths_Stop;
27 0.06 -0.06 6.9 2.8 0.08 rov stop final Both_Ths_Stop;
-15 -0.15 -0.08 6.8 3.0 0.16 rov stop final Both_Ths_Stop;
56 0.09 -0.24 4.3 3.4 0.26 rov stop final NO_KEY;
-5 -6.69 -1.41 65.7 19.7 7.04 ship stop pull_in NO_KEY;
2 -0.67 1.12 1000.0 36.3 1.31 ship stop searching NO_KEY;
2 -1.44 -5.76 1000.0 37.9 6.86 ship stop searching NO_KEY;
2 0.59 -7.44 1000.0 39.2 7.98 ship stop searching NO_KEY;
-157 0.62 -8.28 487.4 39.8 7.57 ship stop searching NO_KEY;
-169 0.55 -7.37 465.4 40.3 8.36 ship stop searching NO_KEY;
178 0.61 -8.21 463.9 40.9 7.51 ship stop searching NO_KEY;
166 0.58 -7.49 483.1 41.5 8.18 ship stop searching NO_KEY;
-132 -5.27 -6.18 424.4 42.5 7.37 ship stop searching NO_KEY;
-129 -7.05 -4.36 229.7 45.2 7.55 ship stop placing NO_KEY;
-109 -3.64 -2.69 119.6 47.0 4.59 rov stop far Both_Ths_Full_Ahd;
5 1.94 1.50 80.1 47.4 2.46 ship f_ahd infringe NO_KEY;
-14 2.78 0.88 60.1 47.7 2.91 ship f_ahd infringe NO_KEY;
Figure 6.8: An instance of an example and attribute file for the CN2 induction program.

Three rule-induction programs were readily available: C4, ID3 and CN2 [23] (for a description and comparison of some algorithms, see [39]). (The version of CN2 used was developed by Robin Boswell of the Turing Institute, as part of ESPRIT project 2154, the Machine Learning Toolkit.) C4 is based on the ID3 algorithm, and like ID3, produces output in the form of decision trees. When C4 was tried on larger data sets (a few thousand examples) it was found to be excessively slow, taking several hours to run, and on the largest ones it ‘crashed’. It was then decided against as a primary tool. Of the other two, CN2 was chosen as the more appropriate, because

  1. it was designed specifically for ‘noisy’ data, and human actions are rarely noise-free.
  2. it can produce output in the form of if-then rules rather than as a decision tree.
There are two major modes of CN2: ordered and unordered. The ordered mode produces if-then-else rules, where, when the rules are being executed, the search through the rules stops when a match is found. In effect, later rules have as part of their conditions the negation of the conditions of earlier rules. This means that the ordering of the rules is significant, and that the application of any rule cannot be understood out of context. Thus, from the point of view of human comprehensibility, there is little advantage of the ordered mode over a decision tree, as in ID3.

The unordered mode produces if-then rules where the condition is made up of a conjunction of conditions on any of the attributes. Disjunction (‘or’) is produced by having a number of rules all for the same decision class.

A standard method for generating and testing rules was adopted. This is to take a training set of data, and use the program to generate rules, then to take the training set and unseen test sets, and to evaluate the prediction performance of the rules on these data. This process leads to figures for the effectiveness of the generated rules for each decision class considered, and an overall prediction performance figure, which must be carefully compared with the prediction performance of a default rule before being able to assess its value.

The first comparison of representations was between those given above in Figure 6.7 (see also §B.1). This used CN2 in its ‘unordered’ mode, where the rules produced are independent of each other (the order of them is immaterial). This was, however, a new facility to be added to CN2, and it was still to be fully tested.

The example was interesting, because although it looked as if the second case performed better, in fact, comparing the prediction performance of the rules with the default rule reveals that these rules did not score better at predicting human actions than the rule “do nothing all the time”. The default rule is that all examples belong to the modal (most frequent) class, which in these cases is NO_KEY. So we obtain a figure for the default rule by summing the actual frequencies of NO_KEY and dividing by the total number of examples.

In the unordered case, with the representation RT01, the prediction performance of the rules even on the training set was very close to the prediction performance of the default rule. On the test data, the prediction performance was substantially worse than the default. Looking at the individual rules (§B.1), the second rule makes sense in that it is saying that when the first key-press of a ‘pure turn’ has been carried out, the corresponding part then needs to be done. In contrast, the fourth rule is quite implausible. Simply from considerations of symmetry, we could call into question a rule where there were symmetrical conditions but an asymmetrical action. In this case, the rule must be presumed to have emerged from a coincidence in the data. The figures after the rule indicate that it is not very well supported even in the training data: we might expect it to be even more poorly supported in test data. But many of the other rules can be criticised in a similar way.

With the new representation RT2, the overall accuracy figures are not much different from the default rule figures. But the rules look much better than in the representation RT01. Firstly, there are fewer of them, which is an advantage. Secondly, most of them can be made good sense of.

After discovering some unresolved issues with the unordered mode in CN2, the same data were reworked using the ordered mode (see Appendix B.2). This appendix illustrates the kind of rules obtained using the ordered mode. Briefly, the test data results on the representation RT01 still are below the default values. For the representation RT2, the results just manage to be better than default. Evidence for development of rules, and representational effects

The subject with the longest experience was G, at George House. His trace files were grouped into four equal calendar time intervals: 02, 03, 04 and 05, in order from earlier to later. The calendar divisions are October 19th, October 30th, November 10th, November 22nd, and December 4th. The data was fed through the actrep filter, using action replacement charts generated for the subject G from all his games together. The 05 games (including the subject's best scoring game) were used for generating rules. Rules were generated on three representations of ROV speed control, named RS0, RS1 and RS2. (See Figure 6.9.)

(RS0)                           (RS1)                           (RS2)

2                               1                               1
rov_degrees                     rov_off_head                    rov_off_head
                                3                               5
5                               rov_target_range                rov_u
rov_u                           rov_height                      rov_v
rov_v                           rov_speed                       rov_target_range
rov_target_range                                                rov_height
rov_height                      3                               rov_speed
rov_speed                       sub_display
                                rov_av_revs_demand              3
3                               stage                           sub_display
sub_display                                                     rov_av_revs_demand
rov_port_revs_demand            6                               stage
rov_stbd_revs_demand             3 3 2 0   Both_Ths_Full_Astn
                                 3 3 2 1   Both_Ths_Half_Astn   6
6                                3 3 2 2   Both_Ths_Stop         3 3 2 0   Both_Ths_Full_Astn
 3 3 2 0   Both_Ths_Full_Astn    3 3 2 3   Both_Ths_Half_Ahd     3 3 2 1   Both_Ths_Half_Astn
 3 3 2 1   Both_Ths_Half_Astn    3 3 2 4   Both_Ths_Full_Ahd     3 3 2 2   Both_Ths_Stop
 3 3 2 2   Both_Ths_Stop         0 0 0 0   NO_KEY                3 3 2 3   Both_Ths_Half_Ahd
 3 3 2 3   Both_Ths_Half_Ahd                                     3 3 2 4   Both_Ths_Full_Ahd
 3 3 2 4   Both_Ths_Full_Ahd                                     0 0 0 0   NO_KEY
 0 0 0 0   NO_KEY

Figure 6.9: Three slightly varying representations for ROV speed control

These were intended to be progressively more human-like. The CN2 parameter ‘star’ was set to 10, and ‘threshold’ was set to 10, and ordered mode was used. The rules generated were then tested against the data from each of the divisions, 02, 03, 04 and 05. The results are summarised in Table 6.2. The numbers in the body of the table are the percentage points difference between the prediction performance of the rules and the prediction performance of the default rule. The high scores for the interval 05 are due to the fact that 05 interval provided the training data. The default rule generally scored around 60% to 70%, and the 05 interval absolute scores are over 95%.

Table 6.2: Testing rules for interval 05, subject G, against defaults

There are two trends immediately apparent in this table. One is that RS1 and RS2 perform substantially better than RS0, with RS2 being slightly the better of the two. The other is that whatever rules were induced for the interval 05 were not much in evidence during interval 02, and progressively became more so. This is reassuring in two ways: firstly it suggests that the rules found are not imaginary, or due to random effects; and secondly that these rules are being adopted increasingly as time goes on. This is consistent with a common-sense view of learning.

An alternative way of dividing up the examples is into sets of similar size. This was done with subject M, but otherwise the same procedure was followed as with subject G. Table 6.3 summarises the results for M.

Table 6.3: Testing rules for interval 0499–0508, subject M, against defaults

The rules were again constructed on the data containing the highest score, which in this case was the 0499–0506 interval. The same general trend is apparent with respect to the representations as above.

The prediction performance of the rules across time again shows a build-up of prediction performance towards the training interval; but now also shows a subsequent decline. This could in principle be due either to a decline in task performance, with the acquired rules not being followed as strictly as before, or due to new rules supplanting the old ones. In this table, the overall accuracy figures have also been included, to show that there is in fact no rise in overall accuracy between the fourth and fifth intervals. The rise in the relative figure is due to a fall in the default rule accuracy, which, in this interval, implies that there were fewer null actions.

G and M were both in George House, and took an interest in each other's games. It is perhaps not surprising that similar patterns emerge in their results, and that a representation that was able for one of them to produce rules performing substantially above the default rule, should also be able to do so for the other. This is not so, however, for DM, one of the Charing Cross Tower subjects. His results, derived by exactly the same process as the above results, are summarised in Table 6.4.

Table 6.4: Testing rules for interval 065–1, subject DM, against defaults

This table suggests that the rules do not reflect the actual rules being used by this subject. The first two columns suggest, rather more strongly, that the rules are substantially different from those used at the earlier stages of learning. The pattern for all the representations is similar, and this suggests that none of RS0, RS1 or RS2 cover the attributes actually used by subject DM. However there is, if anything, a slight favouring of RS1 over RS2, contrary to the other subjects. Another representations would have to be found, if we were to find results as satisfactory for DM as for G and M.

6.4 Further discussion

A number of issues arose in the previous section that will be further expanded here. This leads on to a review of what was learnt from this experiment, and arguments pointing towards what needed doing next.

6.4.1 Uncertainty in scores

The reliability of the total score as a measure of experience was compromised by the random number and distribution of the mines. The number was random to ensure that the search could not be broken off without covering all corners of the minefield, which would enable an unfairly quick return, as well as being unrealistic. In an attempt to counter this problem, the scoring system allotted points for each mine found. However, the scoring was fixed before a great deal of experience had been gained, and it was subsequently discovered that experienced players gained more points by dealing with a mine than they lost through the extra time taken. Hence higher scores could be obtained when there are more mines, and the actual highest score of a player depended not only on their skill, but also on their luck in the allocation of mines.

A further problem with the reliability of the scoring comes from the catastrophic nature of an accidental mine explosion. A subject could be performing very well, but such an explosion would cause the total score to be highly negative. Thus good scores would be mixed in with very bad ones. For these reasons, it was felt that any graph of raw scores over time would be of little value.

The psychological impact of the scoring system is difficult to evaluate, and this will not be attempted. It may be noted that the task would change depending on whether the subjects were instructed to achieve the single highest score, or to achieve the best overall average score, or somewhere in between these two extremes. The emphasis in the experiment was only on achieving the highest single score, and this meant that when a subject accidentally blew up a mine, or did something that led to a long delay, the game was sometimes abandoned at that point, presumably on the grounds that a high score could not be obtained.

6.4.2 Types of action

Detailed consideration of the task, a priori, suggests several possible types of action that the player might perform. Correct identification of the type or types of action performed is potentially important to any analysis of this kind, since an analysis designed to find certain kinds of action might fail to find other kinds. These could include:

Of these, the methods of the current study can only deal with those actions that fall into the second category, i.e., those that follow a rule. Therefore the success of the rule-induction depends, as well as on the quality of the representation, on the extent to which the actions analysed belong to this class, as opposed to one of the other classes.

Dealing first with slips, we may note that some of the unintended key-presses have no effect. These can be taken out in the process of analysis, by actrep. Other slips will contribute noise to the data, with the result that the induced rules will be less accurate and perform worse on prediction.

With knowledge-based processes, we could expect in theory to be able to induce rules if we know all the factors that are taken into account, and the intermediate concepts involved, in the knowledge-based process. This would be comparable with defining the terminology with which to construct an expert system, and would involve defining appropriate higher-level concepts in terms of lower-level ones: there is nothing in general to prevent this being done, but it may require much knowledge or theory about the knowledge-processing mechanisms. We are unlikely to be able to capture much of this level with the relatively straightforward methods that are used in the present study.

Information-seeking actions could be of two types: either actions directly altering the selection or presentation of information, or actions affecting the simulation itself. The information selection actions may reveal something about the information being used or considered at a particular time: however, in the first version of the simulation game, there was still so much information present concurrently (especially graphical) that our knowledge of the player's information usage is advanced only slightly, if at all. This approach to understanding the player had yet to be explored. More difficult to formalise are the actions which may be characterised thus: “give it a nudge to see how much it moves, and then you'll know how much to push it”. If this kind of action were being used, it would tend to obscure rules about how large an action to make in differing circumstances, since the initial nudge might be similar in the different situations, with only the following action differing; and that action not differing on the basis of the static quantities, but on the dynamic response of the thing that was nudged. However these information-seeking actions are dealt with, there are likely to be fewer of them the more practiced the subject is, since the desired quantities will be more likely to be known. For exploratory actions, again, the more practiced the player is, the less likely they are to occur. This reinforces the desirability of concentrating on well-learnt situations.

There is also a philosophical aspect to the question of the nature of actions, i.e., how we are to represent actions in general. This has a large effect on the methods of analysis. Firstly, we could consider an action as directly corresponding to the state that it brought about. An analysis on this basis will work if every situation has corresponding unique control settings appropriate to that situation. For example, if the ship is moving forwards at a reasonable speed, and the desired direction is more than (say) 15 degrees to port, then the desired rudder setting is hard port. This fits into the paradigm of pattern recognition and means-ends analysis: knowing how things ought to be leads to appreciation of the discrepancies between the actual and the desired state, and thence to steps to reduce the difference.

A second approach is to consider all actions as interventions, not necessarily determined by the objective state that is brought about. One can characterise the above example in this way, by adding that if the rudder setting is not hard port, then set it to be so. In this second approach, the dependency of actions on the current control setting is emphasised. It may be that this is more appropriate for serial actions, and explicit rules; while the first may be more appropriate for parallel actions without conscious attention.

Which approach is taken has implications for treating null actions. It is evident that at times, an operator is consciously not intervening, because everything is within the operator's limits of acceptance. If a ‘desired state’ approach is taken, the concept of action has no default: there is always some desired control state; every situation has some appropriate response, even if this does not entail altering the controls. If the correct response is not known, some measure of closeness will give a situation which is similar, and whose action can fill the unknown. With an ‘intervention’ approach, on the other hand, there is a default action of ‘do nothing’, in just those cases where there is no appropriate intervention. This does, however, raise the problem of granularity of actions, in that it is far from clear how many null actions to attribute to any given space of time free from positive actions.

Choosing exclusively one or the other approach seems over-rigid. However, purely for ease of analysis, we may note that one can always express desired-state actions in terms of interventions contingent on the current state of the controls, as well as the outside world; whereas one cannot always express interventions in terms of desired states. For this reason, the analysis in this study is in terms of interventions.

The choice of approach with respect to actions is to some extent a pragmatic rather than a theoretical one. Constructing a complete theory incorporating all these types of actions would be a huge enterprise, encompassing a great deal of cognitive psychology. The choice of rule-governed actions may be justified as a starting point firstly by considering the relevance of regularities to the kinds of applications we are considering (and the relative lesser relevance of other actions); secondly by recognising that knowledge-based actions have been the subject of much investigation, both in AI and in learning systems (e.g., [41]); and thirdly by discounting the practicality of investigating information-seeking, exploratory and whimsical actions as being a much more difficult place to start.

6.4.3 Evaluating the information provided by an interface

The information displayed at the interface falls into two sections. The ‘sensor’ section contains only numeric data displayed as numbers, and this clearly defines a set of primitives which we can take as the basic representation of these quantities. But for the ‘graphic’ sections, it was much more difficult to decide what was being displayed. One view might take the content of what is displayed to be the system variables that are used in the construction of the graphical display. However, the inference of other quantities is so immediate and intuitive, that it is difficult to avoid the idea that this information is also being presented in the display.

A simple example concerns the ROV's heading. One numerical sensor gives, in whole degrees, the heading of the ROV in the conventional way (000 to 359 clockwise from North). Another sensor gives the bearing of the closest unknown or unsafe target. There was no explicit offering of the bearing of the closest target relative to the ROV's head, and yet this was obviously going to be a significant quantity, and it was one which was immediately apparent (though in an unscaled form) from the ROV graphic display, as long as the target was within the viewing region. A very close parallel exists with the ship's heading and associated quantities. In the case of the ship, the relative bearing can be immediately seen from the general position indicator.

Thus, it is a real difficulty with graphic displays to achieve any degree of objectivity about what information is being provided, and hence, for any higher-level representation, how much information processing is being done by the interface and how much is being done by the human. The uncertainty of interface design remains unclarified in this case, because there is no a priori way of being sure that you have presented the information that you wish to present effectively.

One possibility for formalising some graphic information is to focus on significant events, and allow that the display effectively gives a rough idea of time-until-the-event. Of course, this need not be displayed as such, but the combination of perceived distance and motion can easily be seen as giving a time measure. Such time measures have an established history in theories of mariners' actions in collision avoidance. For a discussion of the “RDRR” criterion (Range to Domain over Range Rate) and its use in an intelligent collision avoidance system, see [15, 27, 28, 30]. A slightly simpler concept, “TCPA” (Time to Closest Point of Approach) is also used in many places (e.g., [111, 129]).

6.4.4 Other difficulties with representation

Another difficulty with representation arises in connection with the manoeuvring of the ship. The general position indicator (the upper graphic display present at all times) sometimes confronts the player with a pattern of targets that have complex implications. What is the best place to stop the ship, so that the most targets can be dealt with at once, and leaving the ship in an advantageous position to proceed? To come to a decision on this clearly requires an overall view of the disposition of the targets, and since the precise pattern of targets repeats itself extremely rarely, any routinizing of these decisions could not be linked to precise identity of the conditions.

It is plausible to consider this as an example of knowledge-based behaviour, since in the time available people are likely to still be trying out different approaches, and developing ways of categorising arrangements of targets into groups indicating the best action to take. In the author's experience, a considerable amount of conscious thinking goes on in the consideration of where to stop the ship, though this thinking may not be verbal. Alternatively, one could consider it as a visual pattern-matching process. An attempt to analyse this in symbolic terms would inevitably involve many pattern and shape concepts, which would be difficult to derive from data such as is in the present study, because this experiment was not designed to address pattern issues. In the longer term, we might be able to ascertain which attributes were relevant to ship positioning decisions, and we might be able to devise a method of learning how to recognise values of these attributes from the original Cartesian data of the simulation. These questions are difficult enough to constitute independent problems, and since there is little necessary overlap with the present lines of enquiry, issues involving the processing and use of patterns are not followed here.

6.4.5 Limited nature of interesting results

Reviewing the state of results at the end of the first experiment:

  1. we had interesting evidence that rule-induction reveals important things about human task performance, particularly about learning and differences in representation.
  2. we had a reasonable method of dealing with the representation of actions (though far from perfect).
  3. we have discovered higher-level representations of turning actions for the ROV which appear to fit humans better than the lowest-level representations.
  4. the studies pointed towards the possibility of cross-comparisons of one player's rules with another's actions, perhaps leading to the ability to distinguish representative examples of different player's actions.
  5. the way was in principle also open to selecting and refining rules and examples iteratively; selecting for the next training set those games where the rules perform best (the most ‘ruly’ games), and selecting those rules that perform best on the best games.
  6. it appeared possible, though extremely laborious, to select attributes for representing situations by introducing them one at a time, and observing the effect on the performance of the rules.
  7. a yet more laborious possibility would be to select landmark values for the data as a whole, mapping the numeric data onto a small number of values for each attribute. The best position of these landmarks could be found by moving them gradually, watching the effect on the performance of the rules.
  8. these last three possibilities would only become practical if some automated tools were produced to help. Some of these ideas therefore will be taken up in the ‘further work’ section (§8.3).
  9. there was no good principled method of generating representations of situations close to those that we might assume people have.
  10. graphically displayed information appeared the hardest to represent, and was difficult to envisage dealing properly with.
  11. the performance of derived rules suggested that we were still a long way from any full representation including all the factors which come into a human's decisions.

6.4.6 Need for further experiments

The most salient need was therefore to overcome the problem of generating better representations of situations, in the absence of automated methods. Three ideas had clear merit.
  1. Cutting out the graphic displays would drastically limit the uncertainty of how the presented information was to be represented. However, removing them altogether might make it far more difficult for the task to be learnt in the first place.
  2. Costing the information, enabling and encouraging players to turn off what they are not using, would give a great deal of help towards knowing what information any player was using at any time, and therefore would help to provide representations capable of supporting the induction of rules which performed better. Graphical information would be priced highly, thus encouraging players to do without it. As soon as they had ‘got the general idea’, they would attempt to find strategies which did not need the graphical information.
  3. Using data from a well-practiced subject would minimise the learning activities performed (knowledge-based behaviour), and if the player had enough practice to be clear about what information was necessary, there might be fewer information-seeking actions that affected the simulation. This implies the maximisation of the time spent by each subject.
Also having discovered something about the turning of the ROV, new higher-level controls could be made, which could make the task easier. To compensate for this, weather could be introduced, as was originally planned. These steps would change the task substantially; but since the idea of the task in the first place was only to provide a sufficiently complex and interesting task in the chosen field, this should not be detrimental to the experiment as a whole.

Next Chapter 7
General Contents Copyright