[Review] John McCarthy's early look at multi-modality robotics

November 15, 2018 ยท 5 minute read

Paper Review 15-11-2018

A computer with hands, eyes and ears

J. McCarthy, L.D. Earnest, D. R. Reddy and P. J. Vicens, AFIPS ‘68 (Fall, part I) Proceedings of the December 9-11, 1968, Fall Joint Computer Conference, Part I, Pages 329-338

Aims of the paper

To explore the design of a computer system which extends the capabilities of current machines. They want to push forward the idea of suppressing the egocentric idea that people are the best at all possible tasks. They aim to build a computer with “eyes, ears and hands” to realise the ideas of previous ideas from Shannon, Minsky and McCarthy himself. In this work they present the beginnings of a sensory computer which utilises currently available technology and discuss the success and limitations of their method. A discussion is then given on the three challenges of collecting and understanding data collected from a TV camera and microphones, as well as the intuitive control of robotic arms.

Paper Summary

In this paper all factors of the presented machine (basically a “robot” but the author avoids use of the word). This includes design considerations as well as methodology for each facet of the machine. The machine was built to explore current technologies and to see whether current technologies can conceivably build such a machine. The machine itself consists of a TV camera for sight (“eyes”), several microphones, both a hydraulic and electric 2 joint robotic gripper, a CRT display all connected to the PDP-6 computer which was the most advanced computer of its time. These components where chosen because a major considerations was the use of off-the-shelf components and ease of interfacing. A description is given of the details and further reasoning for each of these components with respect to their primary purpose. Examples include

  • The use of both a condenser microphone with a sample rate of 20k and a 3 band filtered microphone without use of a pre-processor filter bank so as to collect the raw audio waveform directly
  • The choice of using either an electric or a hydraulic arm. The hydraulic arm was more preferable, being much more accurate and accurate but unfortunately less safe, so the electric arm was mostly used instead.
  • The use of dynamic visual input dependent on static vs moving scenes to reduce memory consumption.

Following that an explanation of the data analysis techniques used for control are then presented.

  • Visual They discuss the use of linguistic models (My interpretation is of a bag-of-words like method) for doing 3D scene analysis. They also discuss weaknesses of this model, including connectivity and dependence of objects, error recovery and how to handle obscured objects.
  • Audio Commands The audio system is used to handle voice commands. This involves the problem of audio segmentation and a discussion is provided on how to compare sounds (Euclidean distance does not work and heuristics are provided for a different metric) and how to recognise sounds from speech phenomes. This leads to the presentation of a basic control language and the challenges in recognising those sounds and how the system was made somewhat more robust.
  • Robot Arm Control Small details are given in terms of physical controls of the arm with reference to other work for kinematics, obstacle avoidance and planning - these are areas which he describes need work. In this section, the focus is primarily on developing a control language/grammar for the arm with respect to the world. For example, for the use of phrases such as “Pick up the large block on the lower left corner”.

Paper Review

An interesting paper to read to get the perspective of the time from a well known author - John McCarthy - in the AI field. The content is interesting and gives an interesting perspective on the technologies and methods available to the author at the time and seeing how things have developed until current times. The work tackles 3 current major areas of research and does a good job at explaining the challenges faced and potential future problems with their solutions, although it would seem with more of a focus on speech recognition.

My main concern with the paper is that after reading I am not entirely sure whether they actually built a functioning version of the system which is described. They reference many possible techniques of which one is implemented, but the chosen one is glossed over. No evaluation is given on the operation of their robot on any tasks they tried to get it to complete which is slightly disappointing. It leads me to doubt whether the robot did actually function as the authors intended.


The contents of the paper were very interesting to read about indeed and looking back upon the work, it suggested many of the techniques that were used in the decades to come. As mentioned above, the linguistic approach appears to be a proto-bag-of-words technique and references are given to edge filters in times when cameras have more pixels. In the following years much work has been focused on identifying phonemes for speech recognition and this paper seems to be one of the earliest which attempts such a method. The mechanical arm control component can perhaps still be seen as an open problem in understanding the semantics of a scene more than the arm control itself, but work has only recently begun to see success in this area with the advent of deep labelling networks.

Overall, it seems like an interesting exploratory work to collate the challenges in bringing together a seeing, hearing and acting robot. In this respect, it did indeed hit the aims of the paper I outlined which is good. The paper itself is fairly well structured and feels more casual than current papers, with McCarthy injecting jokes such as his statement on the hydraulic arm. The level of the paper was good for introductory readers, but I feel it left out some important mathematical details on some of the techniques involved that would have been useful for a more complete understanding of the work. It’s an interesting read to get a perspective on the time, but I don’t particularly recommend reading it as it does not provide insight into much else.