©1990, 1995 General contents
Chapter 6 Chapter 8

Chapter 7: Sea-Searching Simulation: second experiment

7.1 Experimental methodology and the implementation of changes

The experience from the first experiment, and the directions emerging from it, which have just been summarised, stimulated a number of changes to the task, which will now be discussed.

7.1.1 Costing the information

Costing the information was the main identified potential means of obtaining data about what information the operator was actually using at any time. Duncan & Prætorius [33] report experiments involving the withholding of plant information relevant to diagnosing faults, with the aim of checking verbal reports from the operator about how they performed the diagnosis task. Duncan [32] cites Marshall et al. [76] as originators of the technique of withholding information. But the task in the present study is quite different from the kinds of diagnostic task studied by Duncan and others. In the sea-searching task, if the information was automatically switched off before each time step, the operator would have to ask for all the information needed before making any action, and the task would be unrecognisably different, and slower. On the other hand, if information could only be switched on permanently, and asking for some information resulted in its being present at all times thereafter, information used in situations that had passed would be mixed in with information freshly required, and the experimenter would be in a situation little better than the original one, where all information is shown at all times. The sea-searching task needed the capability to switch the information both on and off, and the operator had to be induced to switch off the information not required at the time. The obvious method of inducing this was to deduct some score for each time interval for each sensor used.

The implementation of this information costing required considerable alteration to the program code. For each sensor, a price had to be associated with it, and a variable to hold the information indicating whether that sensor was currently on or off. Also, a simple means of turning sensors on or off was needed, and this was done by a mouse-click on the appropriate area. When a sensor was off, the price was displayed instead of the sensor value. To guard against careless misreading, all prices were prefixed by “pr”, whereas all sensor values were simply numerical, without any prefix. The prices of the sensors that were on were summed, and displayed as the information rate—that is, the number of points deducted per half second for the collection of currently selected sensors. Another display in the scoring panel kept a running total of the total cost of information from the start of the run.

To get an idea of the magnitude of the information cost, the price of the different sensors can be compared to the time penalty, which remained the same as in the first experiment at 1.0 points per half second. The general position indicator price was set at 6.0 (points per half second), and the other graphic displays at 3.0, since these were the sensors that were most problematic for analysis. The other sensors were generally set at 0.2, except for the relative heading of the ROV, which was set at 0.5, on the grounds that the information used to calculate the values were given by two other separate sensors. Using typical values from the previous experiment, this would mean that if all the sensors were left on for a whole run, an information cost of perhaps 40000 would be accumulated—far more than the total positive scores available. Thus, leaving the graphics on all the time was removed from possible competitive strategies.

Clearly, introducing this information costing was going to change the task considerably. Firstly there was the added load of deciding what information was wanted, and using the mouse to turn the appropriate sensors on and off. But secondly, if, as intended by the experimenter, most of the sensors were turned off, there would no longer be the chance of opportunistic use of information that happened to catch the eye. One intended benefit from this was to constrain the methods used to be more consistent over time: having decided on necessary information, a player would not get the opportunity to notice fortuitously that other information would be useful. Thirdly, assuming the graphic sensors were used much less, the strategies that were found to be appropriate for digital displays might differ from those appropriate for use with graphic displays.

If graphic displays were a hindrance to analysis, one might ask, why not do without them altogether? In this game, as in so many learnt skills, it was recognised that the needs of the learner and the needs of the expert were in all likelihood going to differ. Shorter learning times were better in all ways for the experiment, and it was difficult to imagine a subject learning the task quickly with an interface consisting only of digital data. But even practiced subjects get confused sometimes, and if there were no graphic display to fall back on, there would be a risk that they might remain thoroughly disoriented, or possibly even give up the game or be generally discouraged.

Given that the graphic displays would be used at the outset, the scoring system needed careful thought. The decision taken was to start with the full scoring system in place, but to issue instructions to subjects that they were not even to consider the scoring until they felt reasonably competent at performing the main task, that is, without the added task of information management. If the large mounting negative score was felt to be a distraction, the display could be turned off, just like the other information displays. Since the negative score was necessarily going to be larger than in the first experiment, the completion bonus was raised to 20000, so that subjects would feel the satisfaction of scoring positively at an earlier stage.

7.1.2 Rearranging the ROV turn controls

Since the first experiment had analysed the ROV turning, and since it was the most obvious inadequacy in the interface, it was decided to implement better turning effectors for the ROV. To maintain some continuity and comparability with the previous experiment, these effectors were implemented in a way similar to the way they were implemented by humans, and still using the same underlying simulated mechanisms. This meant that, on one button-press, the ROV motors had to be set asymmetrically, and without a further button-press, they had to be brought back into balance at a later time. Thus, the balancing action was implicit in the complete action. This required the ability to store button-presses for execution at a later time, also ensuring that when the time came for execution, it would not be interfered with by other button-presses.

This was implemented, and worked successfully. The unbalancing of the ROV motors was done in a way that had been found common in the first experiment, and this necessarily depended on the state of the motors at the time. This meant that turning actions and speed control actions were interdependent. The opportunity was taken to rearrange the other ROV controls, so that sensors relevant to each other were closer together.

The resultant appearance of the display is shown in Figure 7.1. The ROV sub-display is shown, with most of the sensors turned off.

Figure 7.1: The interface in the second sea-searching experiment

7.1.3 Introducing weather

Similar changes were required by the desire to introduce weather. In order to have all the relevant information present in the trace files, the actions of setting the weather needed to be recorded in the same way as other actions. But allowing the player to alter the weather would risk tempting them into exploring a whole range of situations, most of which would be irrelevant to their task performance for the still relatively short experiments that were planned (very short compared to a process operator's training). Developing and practising skill for a wide range of weather situations would need substantially more time than envisaged in this experiment.

Setting the weather required the ability to predefine some actions, to be taken and recorded, in a separate file. This was achieved by reading in that file into a space to be shared by implicit actions described above. Further details are not given here, as that would need reference to the source code. As it happened, the weather facility was not needed, since the subjects did not attain a sufficiently stable and high degree of skill.

7.1.4 New arrangements for subjects

As has been noted, it was believed to be desirable for subjects to spend longer at the task than the subjects of the previous experiment. Either nominal payment, or time taken from working hours, would be highly desirable, and this, in turn, limited the scope of the experiment.

Thirty hours was the chosen target for total experiment duration, of which most would be actually playing the game, and a part, especially at the beginning, would be familiarisation by reading the help provided.

7.1.5 Other changes

The possibility of the General Position Indicator being off highlighted the fact that, in the first experiment, some information was only presented graphically, with no corresponding digital measurement. So as to enable the player to navigate around the area without needing to use the General Position Indicator, North and East sensors were introduced, giving measurements relative to the starting point. The mine area was also moved slightly, so that the boundaries were multiples of 500m north and west from the starting position.

The help given by the help screens was changed to reflect the other changes made. For reference, the contents of the help screens are given in Appendix A.

7.2 Analytic methods and results

7.2.1 Analysis of sensor usage

The modifications to the program meant that, as well as the effectors recorded in the previous experiment, the trace records included all the key-presses having the effect of turning sensors on or off. From these records, knowing that all the sensors were originally off (except the scores, which were all on), one could deduce the state of visibility of all the sensors at any time, and this from only the trace files, without needing to recreate the simulations for that run.

So that we can discuss collections of sensors, we shall here use the term “chord”, by analogy with the musical term,1 to mean a collection of sensors that are simultaneously available—in the current interface, visibly. For instance, one chord commonly used (later on) by subject AJ was the combination of the three sensors that indicate the ROV height, range from target, and relative heading of target.

Analysis of these chords needed a means of representing and manipulating them on the computer. Since there were less than 24 sensors in each of the three sub-displays, it was possible to hold the chords in the form of 32-bit integers, where there was one bit for each of the sensors that could be on or off, three bits showing what sub-display was current (each sub-display having different sensors), a further bit showing whether the graphic display for that sub-display was on or off, and another showing whether the general position indicator was on or off. Here these are written in octal format. The first digit (after ‘ch’) contains the bits indicating the status of the graphic sensors, with the first digit being 1 if the sub-display graphic is on, 2 if the general position indicator is on, and 3 if both are on. The second digit is 2 for the ship sub-display, 3 for the ROV, and 4 for the cable. The last 8 digits (24 bits) are for the individual sensors, one digit having information on up to 3 sensors (the individual bits of the octal number). A 7 indicates three sensors on; a 3, 5 or 6, two sensors, and a 1, 2 or 4, one sensor on. Thus, as in Figure 7.2, the chord ch0347372742 indicates the ROV sub-display, with graphics off, and 15 sensors on (which was all of them). The chord ch3200000000 indicates the ship sub-display, with both its own graphic display and the general position indicator on, and all other sensors off.

A program called tracechord was written by the author to analyse the chords from the action traces. The action trace files, which were in the same format as in the previous experiment, were passed to tracechord, which kept a track of which sensors were visible, identified the chords used, and added up the number of actions performed while each chord was showing. Figures 7.2 and 7.3 show output from tracechord for the two subjects, side by side. Each entry shows the chord code, the number of effector actions performed when this chord was operating, and the proportion of the total represented by this number.

             AJ                              MT
 ch0347372742  1331 0.4087       ch0347372742  1374 0.4883
 ch1347372742   885 0.2717       ch3277633303   605 0.2150
 ch3277633303   600 0.1842       ch1347372742   538 0.1912
 ch0400766736   241 0.0740       ch0400766736   161 0.0572
 ch3200000000    44 0.0135       ch2277633303   127 0.0451
 ch2277633303    41 0.0126       ch0400000000     5 0.0018
 ch2200000000    32 0.0098       ch1300000000     2 0.0007
 ch0300000000    30 0.0092       ch2200000000     1 0.0004
 ch1300000000    17 0.0052       ch3200000000     1 0.0004
 ch0400766636     9 0.0028
 ch0200000000     8 0.0025
 ch0400000000     8 0.0025
 ch2347372742     5 0.0015
 ch0277633303     3 0.0009
 ch0343372742     2 0.0006
 ch3347372742     1 0.0003
Figure 7.2: Sensor chord usage at the outset

Figure 7.2 shows figures for the first few hours of each subject's practice. The subjects, as instructed, initially left most of the sensors on while they were in the initial stages of learning the task, and there is not a very wide range of chords that were tried. The chords that were extensively used at this early stage were mainly those where most, if not all, the sensors are on. In Figure 7.3 are corresponding figures for each subject's final few hours, except that the list shown for MT omits further chords with effector frequencies down to 1. For both subjects, there were also many other chords used where there were no effector actions, and these are omitted from the figures. In the late chords, we see that the sensor usage of both subjects has changed greatly from the early pattern, and each is markedly different from the other subject's. The frequently used chords have only a few sensors on, and there are many more different chords used.

             AJ                              MT
 ch0301000140   901 0.2891       ch1303020700   518 0.1902
 ch0200200200   330 0.1059       ch1303020100   235 0.0863
 ch0300000000   324 0.1039       ch0300000600   219 0.0804
 ch0200000000   310 0.0995       ch0300000000   186 0.0683
 ch1301000140   255 0.0818       ch0303020700   149 0.0547
 ch1300000000   205 0.0658       ch0400000000    88 0.0323
 ch0201200200   166 0.0533       ch0205000000    85 0.0312
 ch0400000000   149 0.0478       ch0207200000    80 0.0294
 ch0204000000    92 0.0295       ch0400000002    76 0.0279
 ch0400000002    68 0.0218       ch0201200002    75 0.0275
 ch0207000000    64 0.0205       ch0207200002    64 0.0235
 ch0300000040    60 0.0192       ch0200000000    58 0.0213
 ch0201000000    47 0.0151       ch0201200000    55 0.0202
 ch0206000000    21 0.0067       ch0205200000    54 0.0198
 ch0200200000    19 0.0061       ch0300000700    43 0.0158
 ch0301000100    17 0.0055       ch0301020700    42 0.0154
 ch0205000000    15 0.0048       ch0205200002    39 0.0143
 ch0300000140    10 0.0032       ch0301000100    32 0.0117
 ch0202000000     9 0.0029       ch0300000100    29 0.0106
 ch0201200000     8 0.0026       ch0225200000    28 0.0103
 ch0207200000     7 0.0022       ch0235000000    28 0.0103
 ch0200000200     6 0.0019       ch0301000700    28 0.0103
 ch0300000100     6 0.0019       ch0204000000    27 0.0099
 ch0201000200     5 0.0016       ch0207000000    25 0.0092
 ch1200000000     5 0.0016       ch0215200000    22 0.0081
 ch0207200200     3 0.0010       ch0303020100    18 0.0066
 ch0301000000     3 0.0010       ch2400000000    18 0.0066
 ch0207000200     2 0.0006       ch2400000002    17 0.0062
 ch1206000000     2 0.0006       ch0300020700    15 0.0055
 ch2201200200     2 0.0006       ch1301000100    15 0.0055
 ch0203000000     1 0.0003       ch0225000000    14 0.0051
 ch0206200000     1 0.0003       ch0234000000    14 0.0051
 ch0301000040     1 0.0003       ch0237000000    13 0.0048
 ch1201200200     1 0.0003       ch2205000000    13 0.0048
 ch1204000000     1 0.0003       ch0224000000    12 0.0044
 ch2207200200     1 0.0003       ch0301020100    12 0.0044
                                 ch1303000100    12 0.0044
                                 ch0235400000    11 0.0040
                                 ch0303020600    10 0.0037
                                 ch1300000600    10 0.0037
                                 ch2237000000    10 0.0037
                                 ch0201200202     9 0.0033
                                 ch0303000100     9 0.0033
                                 ch1301020700     9 0.0033
                                 ch0201000000     8 0.0029
                                 ch0227200000     8 0.0029
                                 ch0235200000     8 0.0029
                                 ch0215000000     7 0.0026
                                 ch0221200000     7 0.0026
                                 ch0300010000     7 0.0026
                                 ch2225000000     7 0.0026
                                 ch0203200000     6 0.0022
                                 ch0221000000     6 0.0022
                                 ch0227200002     6 0.0022
                                 ch0300000400     6 0.0022
                                 ch2224000000     6 0.0022
                                 ch3234000000     6 0.0022
                                 ch0207600000     5 0.0018
                                 ch0217000000     5 0.0018
                                 ch0220000000     5 0.0018
Figure 7.3: Sensor chord usage at the end

The sensor usage results are enough to add another dimension to differences between individuals, but they are not of themselves enough to give a predictive model of the players' actions. For this, we must think again about what is necessary for a predictive model, before we are able to integrate these sensor usage results into a coherent model.

7.2.2 The idea of context applied to this analysis

In the previous chapter, we were looking at the attempt to derive rules for various groups of control actions, and recognising that better rules were derivable from some representations of situations and actions than for others. But we did not address the wider issue of making a predictive model of an operator's behaviour of broad enough scope to simulate the performance of the whole task. Considering this issue in greater depth has important repercussions for this analysis.

From the above analysis of sensor usage, it was clear that players were able to perform the task adequately, and with improving scores, using only a small selection of sensors at any one time. A predictive model of an operator using only certain information should ideally use the same information. What information should a predictive model be using, and what rules it should have ‘loaded’, at any particular time?

Let us consider the full spectrum of possible answers to these questions. At one extreme, it would be possible to base a model on all rules being available at once. In order to execute this model, all the relevant information for all of the rules would also have to be available. As well as not matching the results of this experiment, this model's reliance on all the information being monitored would perhaps plausibly model that aspect of human information processing if actions were few and far between, but not so if some of the actions demanded focused attention over time, as is the case in the experimental simulation here.

At the other extreme, it would be possible to base a model on the principle of only one rule being present at once. The information necessary for the execution of that rule would be well-defined and limited, but the difficulty in the model would come from the extensive higher-level rules necessary to decide which rule was the one that was relevant to any particular situation.

Seeing this spectrum of possibilities, the approach taken in this analysis was to explore a range of the middle ground first, since that appeared by far the most plausible. The middle ground assumption is that there are groups of rules that are applicable at the same stage of the task, and those rules share an information environment, in that, although they will not all require all the same information, there will be a considerable overlap. The amount of this information should be such that it is plausible to imagine a human monitoring it, given the workload and constraints of that stage of the task. There should also be some chance of reasonable higher-level rules governing either the transition to a new information environment, or rules which allow the deduction of which one should apply at any time. This ‘package’ of rules and information requirements will here be called a context.

As well as relating to the natural use of the word, the name also serves to distinguish the idea from potentially related previous concepts such as schemas [10], frames [82], scripts [119], and their offspring. The motivation behind these concepts is more to do with long-term memory, general knowledge, and understanding story fragments, which differs from that of the present study. However, the term is used in a similar sense by Fagan et al. [35], when discussing the VM system, although this is not the same sense as in the rest of the MYCIN project.

A context is a particular stage of the task, along with the rules and the information that are actually being used during this stage, for which the chords are evidence. This is in harmony with natural usage such as “in different contexts, the same values imply different actions”. The word “representation”, though it has been used in a variety of looser ways, will now refer to a whole pattern of context-based information use that can be thought of as latently present while a person is performing a task; but the meaning should not be taken to include the lower-level action rules themselves. The higher-level rules for switching between contexts could be seen either as a property of individual contexts, or as a property of the representation as a whole.

Having introduced the idea of context, it should be noted that in principle it could stretch all the way in between the two extremes mentioned above. One could have just a few contexts just below the level of the task as a whole: each context would include a relatively large number of rules, and need a relatively large amount of information, but the transition rules would be less likely to be intricate. On the other hand, there could be many contexts comprising only a few rules, and each context would have a relatively small requirement for information. The higher-level rules for determining context would be correspondingly more complex. Furthermore, in principle there is no reason why a kind of context structure should not be built up in more than one layer: there could be grouping of action rules at the bottom level, and grouping of higher-level rules at levels up to the level of the whole task. This last possibility will not be explored here.

Since this idea of context is about a package of rules, information and higher-level rules, ideally we want to do context analysis in terms of rules as well as information use. However, this has not proved possible yet. An approach to this will be discussed below (§8.3.1), but for the time being we shall use information in our analysis.

7.2.3 Analytic approach

The analysis in terms of contexts ideally needs data about information usage, but what we have at this point is data on sensor usage. The two are not necessarily the same. This could be for a number of reasons.

  1. There could be sensors visible that were not being used. Up to a point, this could be minimised by the player having practiced to the stage where the majority of unused sensors could be turned off. However, over short time intervals under time pressure, one could expect some to be left on unused.
  2. The values of the sensors could be remembered while they were turned off, thus possibly playing a part in some decision while not being visible. In discussion, one of the subjects confirmed that this was a conscious strategy in some specific situations. Equally well, it appeared not to be an issue in other situations, so that an analysis purely in terms of memory would not reveal all that was desired.
  3. Information could be deduced from other visible sensors. For example, acceleration can be deduced from observation of speed over time, and time before arriving somewhere can be deduced from speed and distance. Also, the effector settings could be tested without needing the relevant sensor, by pressing on the effector reckoned to be currently set, which would result in the audible beep given in response to an ineffective action. Again, in principle these possibilities were confirmed in discussion.
Though the first of these reasons may not be a great problem, the second and third mean that it is unsatisfactory to use simply the chords themselves as the basis for the analysis. Furthermore, we can see from Figure 7.3 that there are many different chords used, and that this would appear rather too many to correspond with a human division into stages of the task.

The analysis needed to compensate as much as possible for these ways in which information usage was likely to differ from sensor usage. As well as this, in keeping with the original aims of the study, it was desirable for the analysis to be kept as much as possible objective and automatic. A method of grouping the chords together, and a method of allowing for implied information, are described below.

7.2.4 Analysis structure

Figure 7.4: Simplified data flow during analysis of second experiment

An outline of the flow of data in this analysis is given in Figure 7.4, which should be compared with Figure 6.4 above. The first stage of the analysis, shown in outline on the left side of the figure, is to find some context structure and content, and the second stage, down the main axis of the figure, is to use this structure in the induction of rules for the actions. Finding context structure

There are at least three potential ways of finding context structure. Firstly, we could ask subjects what their perceived context structure is, i.e., how they split up the task in subtasks or substages, what information they use to make decisions in each context, and what rules they use. (See below, § This may or may not correspond with what they actually do. Secondly, we can examine the information that they use, and look for patterns in that usage. This is the main approach taken here. Thirdly, we could derive a context structure from the pattern of applicability of rules. This will be explained and discussed as part of further work, below (§8.3.1).

To obtain a more satisfactory (and possibly more realistic) context structure than that given by the chords alone, the raw chords needed to be grouped in some way. What was needed was more than a simple clustering process, because when the chords are clustered together, we do not wish to take the central or most frequent one as wholly defining the information usage, since other chords may have had extra sensors turned on. From the point of view of rule induction, the important point was not necessarily to find the exact information used in a context, but to find a superset of this. Having a few variables present that were not actually used should not hinder the rule induction to any great extent.

In the following procedure, which the author devised to meet this need, the frequency associated with a chord was the number of effector key-presses in the sample performed while that chord was in use. Starting from the least frequently used chord, each chord in turn was matched with all the other more frequently used chords, to find whether there was another chord within a specified ‘distance’ of the first, and with at least as great a frequency; and if there was, to find the one of those with the greatest frequency. The less frequently used chord would then be absorbed into the more frequently used one. If the frequency of the original chord was greater than zero, the keys used in that chord would be added on to the keys used in the chord to which it was absorbed, to make a superset chord stored separately (with harmonics, using the analogy). A distance of one unit meant that the two chords differed by exactly one non-graphic sensor being on in one chord and off in the other, while if one had a graphic display on that the other had off, this was (arbitrarily) assigned a distance of three units.

 ch0301000140   998 0.3202 och0301000140 
 ch0200000000   569 0.1825 och0207200200 
 ch0200200200   537 0.1723 och0207200200 
 ch0300000000   324 0.1039 och0300000000 
 ch1301000140   255 0.0818 och1301000140 
 ch0400000000   217 0.0696 och0400000002 
 ch1300000000   205 0.0658 och1300000000 
 ch1200000000     8 0.0026 och1206000000 
 ch2201200200     3 0.0010 och2207200200 
 ch1201200200     1 0.0003 och1201200200 
Figure 7.5: Result of chord absorption for the later chords of AJ

An example of the result of this process, applied to the later chords of subject AJ already given in Figure 8.3, is shown in Figure 7.5. With the threshold of absorption of a chord set at two units, the number of distinct units reduced from 36 original chords to 10 groups of chords. In the figure, each line has four components. The first three are as before: the ‘base’ chord code of highest frequency in this group (starting with ‘ch’); the frequency (of effector actions with this group of chords); and this frequency expressed as a proportion of the whole. The fourth and last entry on each line (starting with ‘och’ for ‘overchord’) represents the chord made up by including all the sensors that were on in any of the chords in the group. For instance, in the original list, the chord ch0400000000 (cable sub-display with all sensors off) had a frequency of 149. In the list of chord groups, this has absorbed the chord ch0400000002 (cable sub-display with one sensor on) which had a frequency of 68, so that the resulting group of chords has a base chord of ch0400000000, an overchord of och0400000002 and a frequency of 217. The other groups are made up in the same way.

Having thus made an attempt to integrate related chords together into groups, the next step was to make allowance for information that was implied. Each sensor was assessed for its likely implications, and a routine was written to add in these implied quantities to the contexts. The implications were based on just a few basic principles.

  1. In every sub-display, the settings of the effectors would be counted as known, since they would be initially known on setting, and it was also possible to test or confirm the settings by further effector key-presses.
  2. Any sensor would imply its time derivative, if this was a sensible relevant quantity.
  3. The graphic displays would be taken to imply the information that was most obvious. This involved introducing quantities that had no separate sensor of their own. This was the most difficult implication about which to achieve any certainty.
It is difficult to be comprehensive about these implications, so it would be surprising if there were not some omissions, and indeed spurious inclusions. This is a list of the implications that were included in the analysis. The items followed by a star (*) did not have a sensor of their own.

These implications were added to the overchords, which then resulted in chord groups as in Figure 7.6. Comparing this with the previous figure, 7.5, we see how the overchords have been filled out. For instance, for the first chord, ch0301000140, before implications, the overchord has only the same three sensors as the base chord: after implications, the overchord has seven sensors (och0303102142). In addition to this, the implied quantities that had no actual sensor were recorded in another part of the data structure, along with the implications from the general position indicator that referred to information whose digital sensor was in another sub-display. When these chord groups are used, the base chords on the left are used to match a chord for closeness, and the overchords and extra quantities without sensors are used to give what is hoped to be a superset of the information used in any particular context.

 ch0301000140   998 0.3202 och0303102142 
 ch0200000000   569 0.1825 och0207211301 
 ch0200200200   537 0.1723 och0207211301 
 ch0300000000   324 0.1039 och0300102002 
 ch1301000140   255 0.0818 och0303112142 
 ch0400000000   217 0.0696 och0400000116 
 ch1300000000   205 0.0658 och0303112142 
 ch1200000000     8 0.0026 och0206611101 
 ch2201200200     3 0.0010 och0237211301 
 ch1201200200     1 0.0003 och0205611301 
Figure 7.6: Result of implications after chord absorption Using context structure in the remaining analysis

The second stage of the analysis follows the data down the central path in the diagram (Figure 7.4). Starting with the trace data, the first step is to expand it in the same way as in the previous experiment. There, actrep then dealt with the representation of actions, both null and compound. In this experiment, having introduced higher-level ROV turn controls, the focus was away from the representation of actions; but null actions still needed attending to even if compound actions were going to be ignored.

A modified version of actrep removed key-presses that were ineffectual, and put in a null action wherever there were at least 10 consecutive time steps without any key-presses (that is, 5 seconds). This produced a reasonable number of null actions, such that the number of null actions was at least of the same order as the numbers of any other individual class, but not so many as to far outnumber all the other classes put together.

The functions of the previous programs sitrep and indprep (see Figure 6.4) were combined into a new program prepcont (for prepare data according to context), part of which incorporated a definition of a representation in terms of contexts, either output from tracechord or hand-written. The program prepcont then amounted to some 600 lines of source code. Some decision had to made about which actions to include and which to leave out, as was done previously by sitrep with the representation files. Including all of them would merely clutter up the programs, since there are several actions that were either never or very rarely taken. The subject AJ never used the ship's propellers individually, and therefore the relevant effectors were left out in his case. But they were left in for subject MT, who did use them. The camera angle controls were left out on the grounds that the information on which these actions would be based would be graphical in form, and difficult to formalise. The action of detonating the mines was left out because the button for it is in the top section of the screen, always available, and hence it would not obviously belong to any one of the ordinary contexts. There were 44 remaining keys that were included in the analysis for AJ, 53 for MT, which were responsible for the overwhelming majority of the total key-presses. In a significant change from the earlier method, the program prepcont output a number of files ready for the rule-induction programs: this is the intended meaning of the fanning out of arrows in Figure 7.4. Each of the defined contexts had separate files, and to facilitate the testing of rules against test data from the same time interval, the data for each interval and context were split into two parts, by putting alternate examples in two files. Thus, for a representation of 10 contexts, 20 files would be generated from however much data was fed in to prepcont at one time. The form of these files was as before (see Figure 6.8).

This separation of alternate examples was reasonable in this case because any action is associated with the situation prevailing at only one time interval, and the rule-induction process does not make any distinction on grounds of the order of the examples—any significance that there might have been is lost in any case. In contrast, when inducing higher-level rules for contexts themselves (see below, §7.2.7), one cannot use the same method of splitting data, because one context covers a sequence of examples, and if one split up the data by assigning alternate ones to alternate data sets, one would effectively have training and test sets that were drawn from the same instances. So in that case, the data needed to be divided sequentially.

In order to be more comprehensive than in the previous experiment, it was decided to generate rules for every set of data, and to test those rules against every other set of data within the same context. It proved possible to write a C-shell script to govern this process, using the same implementation of CN2 as previously for the induction. The unordered mode of CN2 was used for this analysis, with the modification that only rules that had the decision class as the class of maximum frequency were to be recorded. The parameters of CN2's operation were given values that had given reasonable results in the first experiment. ‘Star’ was set at 15 (a value also used in tests by the algorithm authors [23]), and the significance threshold at 15.0.

7.2.5 Analysis of data from subject AJ

The subject AJ interacted with the simulation for a nominal 30 hours 31 minutes, including 62 starts, between 18th June and 25th July. The first non-negative score was 9500 after 14h 38m. Progressive maxima were 11362 after 17h 23m, 12089 after 19h 53m, 14332 after 20h 35m, 14477 after 24h 0m, 16990 after 25h 17m, and 17441 after 30h 31m. Interspersed with these high scores were several where, due to infringements or damage penalties, the score was large and negative. The values of these low scores reveal very little other than the fact that a mine exploded, and so a complete table or graph is not given here. As a comparison, indicating the region where scores would stop improving, the author on a good day can score around 20000 on this task, but has never scored as much as 21000.

A potentially serious error was discovered after this subject had completed 8h 39m of practice. When the ship was travelling backwards, there were certain circumstances where it accelerated backwards without power, and well beyond maximum speed. This was rectified by attending to the simulation of the rudder, but this meant that the data from before this time was not easily analysable with the updated versions of the programs. So the analysis we have here does not include the first stages of learning. The remainder of the data was divided up into six intervals. These intervals were intended to be of approximately similar sizes, but preference was given where possible to put the boundaries coincident with the end of a day. The seven boundaries corresponded with practice times of 8h 39m, 12h 23m, 16h 37m, 19h 53m, 23h 10m, 26h 41m, and 30h 31m. These will here be referred to as intervals C to H respectively, as a reminder that the first part of the data is absent. The durations of the intervals were 3h 44m, 4h 14m, 3h 16m, 3h 17m, 3h 31m, and 3h 50m.

In order to observe the value of using a context-style representation for the analysis, it was desirable to have at least two contrasting analyses. The first analysis follows the minimum context structure compatible with the interface, taking only the contexts defined by the three separate sub-displays of ROV, ship, and cable. This representation was derived by using the tracechord program with trace data from the early interval C, with a very large distance parameter governing chord absorption (20), ensuring that only one chord group for each sub-display would remain.

General ROV context (ch2301000140)
Training set Test set
933 exs
864 exs
611 exs
535 exs
643 exs
675 exs
933 exs
864 exs
611 exs
536 exs
644 exs
676 exs

Table 7.1: General ROV context for AJ (of 3 basic)

The first table, 7.1, shows the results of inducing rules for ROV actions, in the general ROV context, with CN2 and testing them on fresh data. The number of examples in each set of data is given below the label for that data set: because the data were dealt out evenly between the two sets for each context, the numbers in set 0 and set 1 differ by at most 1. In the body of the table, the upper figure (e.g., 37.4% in the top left element) gives the overall performance of the rules (generated from the training set) at classifying examples from the test set. The rules generated were used as they were, without any attempt to ‘clean them up’ (despite the fact that it was easy to see opportunities to clean up the rules), so as to reduce the possibility of unaccountable knowledge affecting the analysis. From the data that generate any set of rules, a default rule can also be generated, which is that the class for all examples is the class seen most frequently in the training data. The difference between the performance of the default rule and the induced rules is given as the lower number in each element of the body of the table, where a positive value indicates rules performing better than the default rule, and a negative value that the default rule performs better. As an example, for the top left element, 37.4% is an improvement over the default rule of 17.4%: and thus the performance of the default rule was 20.0%.

The main trend to be observed in this table is that the improvement of performance of the rules (over the default rule) generally is near a maximum when the test set is from the same time interval as the training set, and falls off to either side. This suggests that the rules that are induced are ones that change over time, something that could be explained by the subject learning, and his score improving, during the experimental period. However, the overall performance of the rules is far from good. This implies, in terms of the discussion of action types above (§6.4.2), that there were many actions either which fell into a category other than that of established rule-following actions, or for which effective rules could not be induced given only the attributes included, and the characteristics of the induction program. Some ways of attempting to get better performing rules will be addressed in the next section (§7.3).

Another noticeable feature of this set of data is that the figures for test set E1 appear to be slightly depressed from what would be expected on the basis of the above trend. In fact, between sets D and E there was a two-week break, during which the subject suffered accidental injury. The combination of injury and falling out of practice would seem very plausible explanations in this particular context, where the actions are faster-moving and more time-critical than in the other contexts.

Table 7.2: General ship context for AJ (of 3 basic)

The next table, 7.2, shows similar overall trends. The overall performance percentages are higher than for the ROV context, but this is accounted for by the higher performance of the default rule in each case (this is because there is a greater proportion of null actions in the ship context). The increases of performance over default fall within the same range as for the ROV context. There is somewhat less of a trend of higher performance for training and test data close in time, but instead, there appears to be an increase in the performance of all the rules as the test set is later in time. This could be explained as a general increase in the proportion of rule-governed actions.

Table 7.3: General cable context for AJ (of 3 basic)

The final table of the three in the first group, 7.3, has similar overall performance figures to the ship context, but this time the improvements over the default rule are far more marked. It seems thus likely that the actions in this context are more rule-governed in nature. This is to be expected, given the relative simplicity of the decisions that have to be made in the cable context.

The second analysis given here used contexts derived directly from the data at a finer granularity than the previous ones. The data used in the context derivation was the last set of data from subject AJ, i.e., that called H here. As described above, a chord distance of 2 units led to 10 contexts. Some of the contexts, however, had very small numbers of examples in them, and perhaps could be regarded as fictions created by the analysis process.

Table 7.4: ROV approach context for AJ (of 10)

The three ROV contexts that we shall consider here were termed ROV approach, ROV visual and ROV miscellaneous. This ROV approach context generally applied from after the ROV has been put out, up to where it is close enough to the target for the camera to reveal the nature of the target. It was based around three sensors: the relative heading of the target from the ROV; the range of the target from the ROV; and the height of the ROV above the sea-bed. In Table 7.4 we see much the same trends as for the general ROV context above (Table 7.1). Here the trend indicating shifting rules in somewhat more pronounced than in the general ROV context. In Table 7.5, in contrast, the trend towards shifting rules is distinctly less pronounced. This context, here described as ‘ROV visual’, included the ROV graphic sensor on, so that the subject was relying less on the digital sensors. This context applied immediately after the ROV approach context, and covered the stage where the ROV was manoeuvring very close to a mine. However, looking at the number of examples over time suggests that this context is declining in use, with more actions being taken under the ROV approach context as time goes on. If indeed this context is ‘on the way out’, it is not surprising that there is not much change or development of the rules over the period covered. Conversely, if the ROV approach context is taking on a larger share of the action, it may be that there is further sub-structure within it.

Table 7.5: ROV visual context for AJ (of 10)

The other ROV context is given in Table 7.6. This context is based around no sensors, and tended to occur both immediately as the ROV was put out, and immediately before being pulled in again. It could contain, for example, routine preparatory actions that were not dependent on the situation at all.

Table 7.6: Miscellaneous ROV context for AJ (of 10)

There were also three ship contexts of interest (two others had very few examples). In the ship search context, there were generally no sensors turned on, and from observing replays, it was apparent that brief glances at information were taken, often being turned off before any action was taken. This context was the one relevant to going between targets. Looking at Table 7.7, we see that virtually all the rules induced performed worse at classifying test examples than the default rule. This means that the rules induced could not be accurately showing consistent regularities in the data. The obvious explanation is that there are no good rules, in terms of the attributes associated with this context.

Table 7.7: Ship search context for AJ (of 10)

This is consistent with the view that the searching pattern is the aspect of the task that is most at the knowledge-based level, involving reasoning and planning, rather than simple condition-matching. Subjects were able to discuss at length reasons for or against taking a particular path, both in general, and in particular, when they could see several targets at once and had to decide where to stop the ship most advantageously. A complementary explanation would be that suitable attributes were not provided, in terms of which decisions could be taken. Providing those attributes would involve considerable machine processing, in lieu of the considerable knowledge-based processing that is presumably performed by people.

Table 7.8: Ship positioning context for AJ (of 10)

The ship search context contrasts greatly with the ship positioning context, for which results are given in Table 7.8. This context covered the stage from where a position to stop had been selected, to the time when the ship was stopped and attention moved to the ROV. The sensors centrally involved were the propeller revs and the ship surge speed, with the control demands, headings and target range also amongst the overtones. In this context, the rules perform very well by comparison both with the ship search context and the general ship context, though there are no very clear trends within this good performance.

What is very clear, however, is that this division of context between ship search and ship positioning divides two collections of data that have very different characteristics, and that thus this division is highly relevant to the analysis of this data. This could not have been due to the choice of attributes, since it happened that the same set of attributes had been selected for both contexts.

Table 7.9: Ship with General Position Indicator context for AJ (of 10)

The other ship context included here is the one including the general position indicator (GPI). The GPI was priced heavily to deter its use, and it was never envisaged as being easy to formalise its information content. In Table 7.9, we see that even for period C, the rules induced do not perform very well. The context is then progressively abandoned as time goes on, and the decline of this context roughly matches the growth of the ship search context.

Table 7.10: General cable context for AJ (of 10)

The general cable context derived from period H of AJ's runs (Table 7.10) differed from the previous one (Table 7.3) only in that there were more attributes included in the earlier version's analysis. These extra attributes cannot have been centrally important however, because comparing the tables shows a better performance for the later version with fewer attributes, for the majority of the table elements, including all of the leading diagonal. One plausible hypothesis here is that some of the extra attributes allowed increased precision in the rules, whereas others allowed spurious precision leading to unfounded rules.

7.2.6 Analysis of data from subject MT

The subject MT interacted with the simulation for a nominal 20 hours 53 minutes, including 31 starts, between 17th July and 10th August. The first non-negative score was 7922 after 11h 33m. Progressive maxima were 12337 after 15h 56m, 13971 after 18h 8m, 15147 after 20h 53m. Compared with AJ, MT achieved similar levels of score in a shorter practice time, but did not achieve as high a final score due to spending less time overall in practice.

The data from MT were divided into five intervals, with the boundaries corresponding with practice times of 0h 0m, 4h 8m, 7h 23m, 10h 48m, 16h 22m, and 20h 53m. Thus, the lengths of the five intervals were 4h 8m, 3h 15m, 3h 25m, 5h 34m, and 4h 31m. These are referred to here as intervals A to E respectively.

Initially, the same process was followed for MT as for AJ. Interval E served as the basis for defining 13 contexts, using again a chord ‘distance’ of 2 for allowing the absorption of a chord in a larger group. For each context, rules for each training set were tested against each test set, as before.

Table 7.11: ROV visual context for MT

Table 7.12: ROV direction context for MT

Table 7.13: ROV non-graphic context for MT

The tables of results for the ROV contexts are Table 7.11, Table 7.12, and Table 7.13. For the ROV visual context, we see for interval A a low value for the improvement of performance over the default rule. This implies that at the outset, rules in terms of the attributes selected were not yet established. By the time we get to intervals D and E, however, the induced rules are performing well above the default rule, and we have a situation comparable to AJ's Table 7.4 and Table 7.5. The rules are not performing very well in absolute terms, which suggests that some feature of the actions in these contexts is not taken account of in the analysis.

As we would expect, given the pricing policy on the information, the use of the graphic information declines as time goes on. The context here called ‘ROV direction’ is one of the contexts that is standing in for the ROV visual one at later times. The performance of the D and E rules in (Table 7.12) somewhat suggest that the rules in this new context are in the process of development—there are better scores for the training and test sets drawn from the same interval than for training and test sets drawn from different intervals.

When we come to the ROV non-graphic context (Table 7.13), the results look rather erratic. One possibility is that in this context, there are actions that do not depend on the selected attributes, such as preformed sequences of actions. Another possibility is that this context itself is not a natural one, and that there could be a number of disparate contexts within it. We will return to this point shortly.

Table 7.14: Ship search context for MT

Table 7.15: Ship close context for MT

Table 7.16: Ship search with GPI context for MT

Table 7.17: Ship with GPI 2 context for MT

The results in the ship contexts are given in Tables 7.14, 7.15, 7.16, and 7.17. None of these display any convincing sign that there are rules in terms of the attributes selected. There is no context as ruly as AJ's ship positioning context (Table 7.8). What they do show, however, is shifts in patterns of sensor usage, and that this sensor usage differs substantially from that of AJ. In the absence of any signs to differentiate between them, we must say that it is not clear whether these contexts correspond to MT's task structure, or whether they are artefacts of the analysis.

Table 7.18: General cable context for MT

Table 7.19: Cable with GPI context for MT

The cable contexts (Tables 7.18 and 7.19) for MT show much the same picture as for AJ, but in the early intervals (not present in the AJ analysis) MT has the general position indicator on, and the rules induced are not as good during the initial learning, in intervals A and B. The cable operations are extremely simple, and it is not at all surprising that rules can be induced for these.

As well as the 9 contexts described here, there were 5 other contexts in which the number of examples was smaller. These contexts are more tentative than the others and the results based on them of less value.

In order to assess the variability of the figures obtained in the tables, the induction was run again with the 0 and 1 sets of data interchanged: i.e., rules were induced on the 1 data sets and tested on the 0 data sets. Because of the fluctuations between the 0 and 1 sets, in general both the overall accuracy figures and the default accuracy figures differed slightly between the two sets. For comparison, the three tables 7.20, 7.21, and 7.22 give alternative versions of Tables 7.11, 7.13, and 7.18 respectively.

Table 7.20: ROV visual context for MT

Table 7.21: ROV non-graphic context for MT

Table 7.22: General cable context for MT

These tables show a reassuring difference in detail and similarity in structure. Of the three shown here, the context which was least regular and predictable, ROV non-graphic (Tables 7.13 and 7.21), is also the one where the discrepancies between the two versions are greatest, whereas for the other two, where there appears to be more predictability, there is also less discrepancy between the versions. This adds to suspicions that the ROV non-graphic context corresponds less to MT's mental structure than some of the others.

7.2.7 Deriving rules for contexts

Another aspect of the validity of the contexts that were derived in analysis is whether they themselves can be predicted in terms of the variables in the simulation. This would be crucial to the ability to simulate the performance of a human operator, since to use the rules from a particular context needs first to determine which context to use. At the same time, the question arises whether we can measure in some way the difference between the two subjects' context structures.

To these ends, the final intervals of both subjects were each divided into three sections, and data was prepared with many of the likely variables as attributes, and the contexts that we have used above as class values. The data sets were E1, E2, and E3 from MT's interval E, and H1, H2 and H3 from AJ's interval H. Both context structures were used with data from both subjects: that is, the data was put into both the representation derived from that data, and also what should have been a less well-fitting representation from the other subject. For each of the two representations, rules were induced on each of the 6 sets of data, and these rules were tested on each of the 6 sets.

Table 7.23:Testing representation rules using contexts from AJ

Table 7.24:Testing representation rules using contexts from MT

The results of this analysis are given in Tables 7.23 and 7.24. The first point to be recognised is that the leading diagonal of these tables must be discounted, since the training set and the test set are the same, giving much higher accuracy values.

The next point to consider is the comparison of the figures from the top left and bottom right quadrants (where the training set and test set come from the same subject) and figures from the top right and bottom left quadrants (where the training set and test set cross between the two subjects). On the whole, the figures for the crossed training and test sets are lower than for the homogeneous case. This suggests that the rules for the contexts differ between the two subjects, even in this case where the same context structure is being used to process the original data.

The third point worthy of consideration comes from a comparison between the two tables. Though it is difficult to pick out any very marked differences, It would appear that where the training set is prepared with its proper representation (H with AJ, E with MT), the distinction between the performance of own subject test sets and other subject test sets is more marked. Thus, in Table 7.23, having discounted the leading diagonal element, there is no clear difference between the performance of test sets on the training set E1, which is from MT's data. For the H training sets, however, there is in each case a marked difference between the E and H test set performance. This could be due to an appropriate context structure leading to clearer context selection rules, which in turn lead to clearer distinctions between individuals.

Looking at the table again, one sees that the performance with E2 as test set is consistently lower than for the other E test sets (suggesting an unrecognised cause operating), and this tendency contributes to the effect just described as the third point. But even discounting the E2 test set results, there is still a slight trend in the way described. If one discounts E2 as a reliable test set, then Table 7.24 shows the same pattern as the other table. But the figures supporting this third point are far from conclusive, so more evidence would be needed to establish this as an effect.

As we have already seen that different contexts can differ markedly from each other, it also makes sense to look at the performance of the rules induced for predicting each context separately. This was done for both subjects, and the results are presented in Tables 7.25 and 7.26.

Table 7.25:Accuracy of rules for contexts for AJ, for the last interval

Table 7.26:Accuracy of rules for contexts for MT, for the last interval

These two tables show to what extent rules can be constructed to predict the context itself, from the attributes included in the analysis. It is important to note from these tables that the overall accuracy figures in previous tables do not consistently reflect the predictability of individual context use. For some contexts, their selection would appear to be highly rule-based, e.g., both subjects' ship search contexts. It is interesting that the ship search contexts are so highly predictable, despite the fact (above) that within the context no good predictive rules could be discovered. Other contexts are not so rule-based. This could be due to a number of reasons.

  1. They could be fictitious contexts thrown up by the analysis, having no foundation in human cognition. We have already raised this doubt about MT's ROV non-graphic context.
  2. They may not be selected in a systematic way. For example, one may suspect that the General Position Indicator is used sporadically. This would account for the very low accuracies for the GPI contexts.
  3. The analysis may not have included the attributes on the basis of which they are selected. The general cable contexts, for example, have a low predictability, which might be surprising, given the high predictability of actions within the context. This would be a good candidate to stimulate searching for further attributes to govern the context selection process.

7.2.8 Further analysis of the ROV data

The ROV non-graphic context posed the question of whether this was a fictitious context, or alternatively a real context in which there were no straightforward rules inducible on the basis of the attributes chosen. One test for this was to attempt to make the granularity of the contexts finer. This was done by setting the absorption distance to 1 rather than 2 in the process of construction of the contexts. As a result, more putative contexts were produced, and these then served as the basis for another similar process of rule induction and testing. This revealed little difference from the previous analysis. The same contexts were still dominant, with the same general patterns of results: the other contexts were generally low in examples, and offered no further coherent insight into the context structure. This result also serves to cast doubt on the value of pursuing still finer context divisions.

Perhaps a more challenging question was raised by the ROV visual context for MT, and the ROV visual and ROV approach contexts of AJ. We have here what look like well-defined contexts, yet the overall performance of induced rules is not as high as one might hope for, looking at the performance of rules in other contexts. Why not? One possible reason worth investigating was that during ROV manoeuvring, there are three concurrent tasks: to deal with speed, direction, and height. It may be that these tasks interfere with each other, because the human controller cannot attend to all at once, and that therefore at times more than one action may become appropriate according to simple rules. But the human will only be able to deal with one at a time, and any set of combined rules may predict either, but cannot predict both simultaneously.

A method of testing this is to separate out the control actions for the three different sub-tasks, and see how rules for the actions separately (together with null actions) compared on performance with the rules for all the actions together, which we have already discussed. Because null actions are only included when there is a reasonable thinking break, it is plausible to suppose that this process would avoid the potential clashes, although of course it cannot do anything about the extra fuzziness introduced by the actions having had to be delayed. This analysis of the sub-tasks turns out not to be strongly suggestive of any particular explanation of why the overall accuracy figures for the ROV contexts are not very high. It is given in Appendix C.

In an area so barely explored, it cannot be doubted that there must be other methods of analysis which have not been pursued here: further discussion is in the next section (7.3).

7.2.9 Verbal reports of task performance

At this point we shall turn to verbal reports both as a means of explaining some of the findings here, and of highlighting some of the problems, to be discussed in the next section (§7.3). For both AJ and MT, on the same day as their last trial, the author and the player discussed a replay of their final, highest-scoring run, and this discussion was recorded on audio tape. This replay was a version using the expanded file, with the facility to stop, go slowly or fast, backwards, or skip forwards or backwards.

This study is not primarily a study of verbal data, and therefore far less than a full analysis of the verbal reports is offered here. We will rather pick out certain points that are relevant to the general issues under consideration. The extracts below are quoted as near as possible verbatim, because in most cases the subjects did not (and perhaps could not) give concise accurate accounts of the rules they were using. In the extracts, “I” stands for the author/experimenter. Distinguishing contexts where there is no difference in sensor usage

One of the potential failings of the method of analysis described is that it will not distinguish contexts that have the same sensor usage. An example of such an undistinguished context is the start, described by both subjects.

MT: The first objective is to try to hit the red square at roughly, to go through the corner axis of the square I'm aiming for that point. I start off by just, er, aiming to go full ahead then I move onto my display screen and switch off the, er; fix the ship in the centre, and reduce the scale by double the amount (whatever it is). This sets up the screen for the right sort of like distance I'm going to be using now when I want to be, er, when I want to use the screen. I would then — that's a reasonable amount of time to actually come back and start turning the ship then: the ship's going a sufficiently amount, er, speed forward, then start turning it to the port, so it's going to actually hit this, aiming at round about 300 degrees to be able to hit it in the right direction.

AJ: Go into full ahead, I want to get as near to the area as possible. I know roughly the direction of the area.

I: So you do that of course without looking at anything.

AJ: Yes; which is a bit unfair; you should change the area each time. Right, now, checking the heading, see how far I've gone, 340, that's fine, I don't have to adjust that.

I: But what wouldn't be fine? But you've gone centre rudder there I notice.

AJ: Centre rudder, yes. 340, that's OK, that's a fine bearing. If it was going towards, er, if it was still about 350 say, I'd want to have a wee bit more port. Yeah, I've just remembered, I want to get the position indicator ready, in case I have to look ...

This could well be regarded as a separate context, since special rules apply that do not apply anywhere else. But the sensor usage is fundamentally the same as for the general ship searching: that is, no sensors on all the time. A more subtle approach would be needed to distinguish this context from other similar ones on the basis of the players actions and information usage. High-level concepts in ship searching

We noted above the lack of effective rules coming from the process of rule induction, for ship searching contexts. This is not surprising, given the kind of high-level concepts employed by the subjects describing their searching strategy.

AJ: What's my strategy? I usually just keep going along, um, the bottom half.

I: Yes. So where roughly abouts?

AJ: Say, about the middle of the bottom half.

I: About two and a half squares up from the bottom? 250 metres up from the bottom?

AJ: Three, three I'd say.

I: Yes, 300 metres up from the bottom. So you go along the bottom and back round the top, do you?

AJ: Yes. But usually it never works that way.

MT: The next bit I'm looking for is the easterly direction, and when it's about 600, what I intend to do then is to turn it North, and hit 0 degrees and just go North, bring it round south, north and south, and then back in. It's a pattern to follow through the, er, the maze, the red maze you give us. It adds some rules and directions to where I'm going within there, rather than search aimlessly.

These strategies would be very difficult to discover from the data, and without them we cannot very well make sense of the decisions taken in this context. Using information from a combination of sensors

The discussions brought out the fact that some sensors were used together, to form a new compound quantity, which seemed more likely to figure in the rules. Here is an example of a quantity that was not included in the analysis, and could therefore be partly responsible for the fact that the rules generated were less than optimal.

MT: ... look at the range. I should have also looked at the height, or the depth.

I: Yes, the height.

MT: And decided on when I want to actually thrust down to, to get there.

I: Do you feel yourself making some sort of intuitive judgement of angle, on the range and height together?

MT: Yes.

I: Right, and when do you — have you formalised that in your head, or is that just a sort of vague idea?

MT: I don't want to be diving too deeply, er, by being too close. The problem is, you end up losing the vehicle underneath you ...

There were other instances in the discussions which could have been taken to imply that certain quantities were playing a part in operational rules, which were derived from the quantities displayed, rather than being displayed directly. Verbal reports of context structure

Both subjects were asked explicitly how they would describe the structure of the task in terms of phases. Subject MT came up with approximately the following outline.

  1. Startup.
  2. Hunting phase. When a mine is found, work out if it's obtainable within the desired path.
  3. Slowing down. Includes consideration of direction for next movement.
  4. Stopping phase.
  5. ROV location (turning).
  6. Approach to the mine. Check what it is.
  7. Slowing down and stopping: fine manoeuvring.
  8. Recover ROV.
There were also a few non-standard situations that were recognised as having separate rules.

Subject AJ's reported outline pattern can be summarised as follows.

  1. General search pattern.
  2. Approach to target.
  3. ROV handling.
  4. Pulling in the ROV and restarting the ship.
Getting stuck in the mud (on the sea-bed) was another obvious separate context, as indeed were other recoveries from mistakes.

The phases mentioned by the two subjects have some similarities with the contexts produced in the analysis described earlier. The number of them is comparable, and some of them can be identified with one of the analysed contexts. However, they are not clearly identical, either with each other, or with the analysed contexts, and this adds doubt to the idea that the context analysis procedure is perfect. Conscious changes in strategy or tactics

Both subjects reported recent changes in the way they performed the task. An example of what might be called a strategic change was given by MT.

MT: The last 3 turns, probably from 8th August, or the go before then, there's been, er, a conscious switch in the rules that's been used, to generate the strategy used for finding the ships, pointing the ships. The direction to — the rules to go up and down — there's various mines been left around, right on the edge, which I've not been getting, because I've been wandering from one mine to the other, which meant I'd come back to the base,

I: And you wouldn't have finished, yes.

At a more tactical level, AJ reported having just started to use the ‘turn’ effectors, where he had previously exclusively used the ‘kick’ effectors to change direction. He also said that he had recently begun “going by feel”, rather than, presumably, going by conscious rules.

Here is an important point for the experimental methodology. Even after 20 or 30 hours of practice on this task, the players performance was still in a state of flux, and hence, since stable rules would be easier to discover, a longer period of practice would be better. It is also consistent with the observation, for some of the tables above, that the rules induced for one time interval performed much less well when tested against data from a different interval. Ideally, an experiment such as this should be long enough for the rules to stabilise, which would mean both that subjects did not report recent changes, and that rules induced on one interval performed equally well on neighbouring intervals.

An interesting and important additional point is that neither subject reported recently changing their view of the structure of the task, in terms of stages or contexts. Other points

Another factor seen as leading to change in task performance was the experience of recent problems. MT talked a lot about confidence, how it was lost, and how this affected the tactics. AJ reported not taking a certain action because of recent experience of failure. Changes in tactics for these reasons would also be reflected in more poorly performing rules being induced from intervals including such changes.

But, among the many other interesting facets of the discussions, which are less directly relevant here, there was a clearly apparent difference between many aspects of the way the two subjects performed the task.

7.3 Discussion

7.3.1 Main findings of this experiment

The discussion at the end of the previous chapter highlighted the need for an approach to discovering about human representations of situations. To this end, we have seen the introduction of a concept of context, together with a rudimentary means of deriving contexts within the framework of the information-costing experimental arrangement that was devised expressly for that purpose; and then an analysis in terms of those contexts.

Despite the shortcomings of these methods, which will be discussed later in this section, the context structures derived

Some of the contexts appear to have a comparatively highly rule-based character, and it is easy to relate this to Rasmussen's categories of rule-based and skill-based behaviour. It would be rule-based, in Rasmussen's terms, if the rules were consciously known by the operator, and skill-based if they were not. On the other hand, other contexts do not reveal a highly rule-governed nature through this method of analysis. There are a number of possible explanations for this, but one obvious one is that they correspond to Rasmussen's category of knowledge-based behaviour. Here it is interesting and suggestive to note that in the ship searching contexts, for which good rules could not be derived, the information flow is relatively small, with the sensors kept mostly off, and the number of actions is comparatively low. These are just the conditions one would expect for knowledge-based processing.

It must be emphasised here that the results of these analyses are tentative. The analysis methods have apparently not been tried on this kind of data, and there are no established equivalents of the general statistical methodologies, current in psychology, to support this approach. The results have been presented and discussed largely in terms of the difference in performance between induced rules and default rules, expressed as a simple difference in percentage. However, there are undoubtedly other possible ways of arriving at a measure of ‘how much has been learnt’, and the methods used have been used because they were plausible and gave interesting results. We await a more thoroughly worked out methodology. To the extent to which these results can be considered at all valid, they serve also to support and justify the novel techniques that have been necessary to derive them. There is a great deal more that could be done in the line of analysis in terms of contexts, and this can be seen as a highly valuable outcome of the context principle.

7.3.2 Justification of results in terms of other work

A context structure is also a means of structuring a task so that it does not grossly exceed the known capabilities of the human information processing system. Card, Moran & Newell's Model Human Processor, which has been discussed above (§, has a useful collection of relevant values of those capabilities. No explicit attention has been paid to make a context structure fit in with these boundaries, but it is not difficult to see firstly, that a context structure is a plausible way of breaking down a task so that only a small number of independent quantities need to be monitored at any one time; and secondly, that explicit constraints of this type could be built in to a context analysis process, to ensure that the limitations were kept within, and thus that a context analysis remained consistent with what is known about human information processing.

This would also be addressing similar issues to those addressed by the idea of Programmable User Models (PUMs), also discussed above (§ A context-based structure could provide a model of the content of task skill, in a form which could be run on an explicitly constrained computational model of a human operator, as envisaged by the PUMs approach.

7.3.3 Problems and direct remedies

Here we will consider problems with each stage of the analysis, from the subjects onwards. These problems invite solutions, which are suggested as well.

Both subjects both showed and described recent changes in their methods of performing the task. A longer practice time would be preferable. Based on the experience of these experiments, one could conjecture that perhaps 100 hours of practice would be more appropriate for the level of complexity of the of the task examined here.

The relatively short practice time meant that the data could be expected to have more anomalies in it than would be the case for later practice: but the data was not ‘cleaned up’ in any way before use. This means that they could have included runs, or parts of runs, when the player was doing something other than the usual task. It would be possible, if laborious, to watch all the runs carefully, and to discard those runs which appeared not to be conforming to a minimum standard of attempting to perform the task as given. This would run the risk, of course, of selecting the data to fit the theory, but it might also produce an improvement in the clarity of the analysis. Another related open question is whether to filter out actions which preceded disasters (such as setting off a mine), on the grounds that such actions cannot be consistent with a successful overall strategy.

Having chosen the data, attention turns to the analysis, with the construction of contexts and the choice of attributes within each context. The method of finding contexts was not highly developed or principled, and there is no doubt that this could be improved, both for the information-hiding methods employed in the second experiment, and by exploring other methods, which will be discussed below, §8.3.1.

The question of selecting attributes within a context is highly problematic. Seen in one way, this is an endless problem, to be solved only in the ideal case that a full predictive model of behaviour is constructed in terms of the full set of attributes. However, the impossibility of this need not blind us to possibilities of improving the attribute set for any context. This is also linked to the question of whether we have a realistic context structure, since an inappropriate structure could mean that an inhomogeneous mixture of information might be being used. But assuming that there was a good context structure, there are essentially two approaches to improving the set of attributes associated with it. The first is the way which has been taken here, to monitor usage, and to ask the operator what information is being used. More attention could be given to this. The second way is to ascertain which attributes lead to the best induced rules, and this will be taken up later, in §8.3.1.

Having decided on the contexts and attributes to be used, the next important factor in the induction is to optimise the operation of the rule-induction program for the data presented. In the analysis reported here, plausible values were assigned to the parameters of the program, and not altered, so that the analysis would not be confused. There remains the possibility that other values would have given better or clearer results. A natural extension to the work would be to check this.

Another approach to obtaining good rules is not to rely entirely on the rule-induction process, but to attempt some kind of selection or editing of rules. This could be done by eliminating those rules that performed least well on test data; or that could be discounted on a priori grounds such as symmetry, or the use of attributes that should have nothing to do with the action. It is important to recognise that in this experiment, no attention has been paid to the rules themselves, but only to the performance of the rules together. In other words, the chief interest has been the ruliness of the data, rather than the details of the rules. The number of rules is rather larger than one would desire for a model of dynamic task performance, and the rules individually appear more to specify when a given action does not occur, than when it does occur. Hence it is unclear how successful editing rules would be.

Another unexplored possibility is the integration of the analysis of situation representations followed in this experiment, with the analysis of action representations, which was carried further in the previous experiment. It is an open question whether this would improve the effectiveness of the analysis as a whole.

7.3.4 Other possible direct extensions to the study

Other extensions to the work, that do not arise specifically from recognised problems or deficiencies, involve methods to further check the validity or consistency of the results.

Originally envisaged, but not undertaken, was to use the representations derived from particular operators, and implement interfaces where the sub-displays corresponded to the contexts, and the information available in those sub-displays corresponded to the information that was found to be used within that context. It would then be possible to test experimentally how operators performed with interfaces that either corresponded, or not, to their own context structure. This might provide valuable feedback about how closely an individual's representation had been captured.

Related to this, it would be very interesting to train people on the information-costing version of this task, and then put them on a version as in the former experiment, where all the information is simultaneously available. An important question would be, do their rules for performing the task stay as they were, or does the presence of extra information help, or even possibly hinder, them? Having developed a strategy for using information, do they prefer an interface where information can be turned off?

There might be some value in changing the scoring system. For instance, any access to a piece of information could be priced at the appropriate value for a minimum time of a few seconds. Alternatively, a sensor could be set to be disabled a few seconds after a button-press on an enabled sensor. This might make the analysis of information usage easier, by making the system fit more closely with human short-term memory.

At some point it might become worthwhile to assess the difference, if any, between results obtained with CN2 (in its different modes), other rule-induction algorithms, and other techniques such as Bayes classifiers.

Another more ambitious way of testing the whole context and rule system is to use them to construct a executable model player, based on the data from one human player. To do this, one would have to first code context selection rules, then, for each context, code a set of rules for that context. In considering rules for contexts, some of the same considerations arise as in the discussion of types of action, above (§6.4.2). One could consider a context to be a function of the system state, with every state having a unique corresponding context. This may, however, be over-idealised for representing a human context structure. In order to implement a model where the context was a function of several variables of the system, those variables would have to be continuously monitored, to check for change of context. If the number of variables to be monitored was in excess of the plausible human monitoring capacity, it might become more realistic to consider context changes as the fundamental method of keeping track of context, with rules for change from one context to others existing alongside the rules for actions within that context. There could then arise considerations such as whether more than one context could coexist, where there was swapping between contexts based on available attention rather than triggering rules. The issues involved in constructing a full executable model of an individual's task performance are extensive, and some of them are taken further in the next chapter.

1. The analogy between music and other intellectual systems is taken much further, imaginatively by Hesse [52] and speculatively by Hofstadter [54].

Next Chapter 8
General Contents Copyright