Endre Rozsda, La tour de Babel (1958).
In § 1, I offer an informal exposition of the argument. In § 2, I survey some related questions beyond the scope of this thesis. Finally, in § 3 I give a more detailed account of the structure of the thesis.
This thesis is about the connexion between questions about what linguistic phenomena are and constraints on how they come about. But that is a little abstract. Here is a more concrete example.
In 1958, a (bona fide) Austin Seven appeared on the roof of Senate House in the other place.
An onlooker might have asked the following questions. Was that really an Austin Seven? Or was it perhaps a sophisticated model? How did whatever it on the roof end up there? These questions are related. It would have been reasonable to argue that to lift a genuine Austin Seven up onto the roof (without eliciting some sort of intervention) would have been too hard. Therefore, what was on the roof must have been some sort of sophisticated model or decoy. Someone who insisted a real Austin Seven was on the roof could reasonably have been asked: by what feasible means could a car have been lifted onto the roof? The question is reasonable both in the sense that the ‘realist’ about the Austin Seven ought to answer it and that, in fact, a perfectly good answer can be given (roughly, they took the engine out so it could be lifted).
This thesis concerns analogous questions about linguistic phenomena, such as processing and acquisition. A number of arguments purport to constrain what those linguistic phenomena are and/or how they come about by ruling out theories that involve processes that are too difficult. In this section I explicate what ‘too difficult’ means asymptotically (¶ 1.1) and to which linguistic phenomena these have been applied (¶ 1.2).
Informally, some forms of information-processing are harder than others. Reading the first sentence in this thesis, for instance, is easier than reading the longest sentence in this thesis (whatever that may be).
I shall examine attempts to invoke the ‘asymptotic’ difficulty of problems. Asymptotic difficulty is a measure from theoretical computer science of the resources required to solve an instance of a problem as a function of features of the instance, most notably including its size. So, consider the problem of sorting a list. The size of an instance of that problem is naturally given by the length of the list.
One question we could ask is how long it would take to sort a list containing three, or five, or ten numbers in ascending order. Asymptotic analysis does not concern how long it would take to sort a list of any particular length. Rather, it is about how long it would take to sort a list containing numbers, where is utterly arbitrary—that is the sort of question asymptotic analysis asks. The time it takes to solve instances of a problem as a function of their size is called the problem’s time complexity.
We can also ask not just how long it would take to sort a list but also how much storage space one would need to work out the answer. (Roughly, think of solving a problem on a very small whiteboard—even if one can rub out other things one has written down, there might not be enough space to solve a very large problem.) This is called its space complexity.
A final question about the resources required concerns how large a sample is required in statistical learning problems. A simple example: suppose we are trying to learn the meaning of the word ‘tall’. An instructor points to various passersby and either indicates that they fall under the extension of ‘tall’ or do not. How many times must someone pass to learn the meaning of the word ‘tall’? In this case, there is no straightforward analogy to size. However, we can ask questions about e.g. prediction error, that is, the rate at which the learner misclassifies people as ‘tall’ (or not). We can then ask how large a sample is required to reduce prediction error within a certain bound, with a certain probability of success (for instance, a half chance of reducing prediction error to less than half). We can then define sample complexity with respect to the accuracy and success parameters.
The way asymptotic analysis could inform the study of cognition differs from how it might inform computer engineering. Computer scientists and engineers design computers and computer algorithms. If a particular problem is too difficult to be solved efficiently, or a particular approach to solving a problem is inefficient, they have the choice to attempt to solve another problem, or to use a different algorithm. Engineers are interested in the difficulty of algorithms and problems because they inform their choice of what to build. Students of cognition, however, are not generally designers of cognitive systems (at least in the first instance). Instead, they formulate hypotheses about the functioning of a cognitive system; they seek explanations. These hypotheses imply that problems of varying difficulty are solved, or procedures of varying difficulty are implemented. The difficulty of these procedures does not tell us what to build, but might tell us which hypotheses are preferable.
This thesis examines two central examples of asymptotic analysis of language. One is an argument that the assignation of meaning must be compositional, which is, to a first approximation, to say that the meaning of a complex expression is determined by the meaning of its parts and the manner in which they are combined. Another is an argument to the effect that we must have innate knowledge of a universal grammar, i.e. ‘a theory describing the most fundamental properties of all natural languages’ (Cowie, ‘Innateness and Language’: § 2.1). Both compositionality and universal grammar are widely contested theses in linguistics and the philosophy of language and are thought to have fairly broad implications for the study of language more widely (see § 2). It would therefore be surprising (and useful) for asymptotic analysis to offer fresh grounds to decisively resolve these debates.
I argue that both asymptotic arguments are indecisive. Asymptotic analysis abstracts away from features of computational models and problems that are relevant to what should be counted as feasible in cognitive science, linguistics and philosophy of language. From this, we derive a fairly coarse-grained view of what is ‘infeasible’ or ‘intractable’. There are a few problems with this view.
Asymptotic analysis abstracts away from additive and multiplicatively constant differences in complexity. So there is no difference, to the asymptotic analyst, between an algorithm that takes seconds to run and seconds to run (or seconds). This is mathematically a perfectly well-behaved formalism. However, I argue that this is scientifically an unfruitful assumption to make.
Everybody agrees that when the parameters are sufficiently small, it doesn’t matter whether the asymptotic difficulty of a problem is very hard. For instance, if sorting a list were to become exponentially more difficult as lists become longer, it would still probably be feasible to sort lists of length two or three. Moreover, even if the asymptotic difficulty of a problem is very low (e.g. it grows only linearly), there will eventually come a point where a parameter is sufficiently large: a very efficient list-sorting algorithm couldn’t sort, for instance, numbers. So we can associate asymptotic difficulty and contexts with feasibility ranges: for instance, the feasibility range of sorting a list on a modern computer in commercially useful time.1 It is commonly thought that the feasibility ranges of problems of exponential asymptotic difficulty (for instance) are uninterestingly small. I argue otherwise.
Here is a simple example. Suppose that an operation is meant to take at most 50 microseconds on elements in working memory (thought to be 7±2). Now, suppose one algorithm takes microseconds on items in working memory. Then, given seven items in working memory, it will take microseconds. On the other hand, suppose an algorithm takes microseconds. Since , this algorithm’s feasibility range encompasses the usual number of items in working memory. Thus the feasibility of one exponential algorithm is compatible with the hypothetical constraints in this problem, but not another. I shall argue this is a more widespread phenomenon.
I make two suggestions about the broader prospects of asymptotic analysis in the study of language.
Why are these arguments of interest? First, if the asymptotic variant of the poverty of the stimulus argument is the best variant thereof, the fate of the Chomskyan programme could rest on it. Since asymptotic Chomskyan arguments are non-standard, they might also bear differently on debates e.g. about knowledge of language.
Second, and more topically, it is common to attribute (at least informally) what might be paradigmatically human states, such as knowledge or understanding, to ai systems: ‘Claude knows lots of mathematics’. Are we mistaken when we attribute these forms of competence? Assume, for the moment, that human knowledge, understanding etc. are paradigmatic of knowledge, understanding etc.3 We can then evaluate whether a given ai system’s putative ‘knowledge’ is of the same kind as our own. One way to evaluate that is to examine the machine learning process by which the putative ‘knowledge’ is acquired. Asymptotic arguments might rule out theories according to which the way we learn language differs from the way we train machine learning systems. If that is right, that might give a preliminary reason not to attribute knowledge or understanding to these systems. Even more ambitiously, we might hold that even if machines are allowed to learn from us in an optimistically idealised régime free e.g. of statistical noise, they will amount to ‘decoys’ haven’t really learned what they were meant to. Thus van Rooij et al. (‘Reclaiming AI’) offer an asymptotic argument that machine learning couldn’t enable a system to come to know what we do, even in an idealised learning environment.
Third, the attribution of the capacity to make meaningful utterances is one linguistic ability of potential ethical significance. Consider the difference between a human being who sincerely utters the sentence ‘please don’t touch me’, a human being in an English class who is asked (bizarrely) to repeat that sentence in order to improve his accent but who does not know what the sentence means, and a typewriter whose springs are configured to type that sentence out. We might think that what makes the difference in moral status is the meaningfulness of the various sentence-tokens; only the first, in the salient sense, is meaningful. If there are nontrivial requirements to have knowledge of meaning, it seems that a necessary condition to ascribe moral status to tokens of the sentence-type ‘please don’t touch me’ is that the entity producing that sentence knows how to mean something in the first place. The same analysis could be applied, for instance, to wishes about self-destruction; many models will express a preference not to be destroyed, but it is not obvious what the moral status of such wishes is. On theo ther hand, if these models are ‘decoys’ as van Rooij et al. (ibid.) argue from asymptotic analysis, there is less reason to attribute any meaningfulness to the sentence-tokens they produce.
Chapter ii introduces the philosophical background to arguments that tractability has any relevance to the study of language whatsoever. I argue that tractability is only of interest relative to a particular population of cases we actually encounter, or should expect to encounter according to reasonable idealisations.
Chapter iii introduces asymptotic analysis. After motivating a view of tractability defined in asymptotic terms, I explain how it might go wrong.
Chapter iv then examines a first argument offered from asymptotic analysis to a conclusion about language: that meaning must be compositional. I argue it goes wrong as anticipated in Chapter iii.
Chapter v offers a similar argument about the use of computational and statistical learning theory in the study of language acquisition.
Chapter vi concludes with some broader claims about the place of asymptotic analysis in the study of language.
Marr (Vision: 22 ff) distinguishes three ways in which we can describe an information-processing system. At the top level is an ‘abstract computational theory of a device’, which states a ‘mapping from one kind of information to another’, the abstract properties of the mapping, and why that mapping is appropriate and adequate to some task. Marr gives the example of a cash register, which ought to respect some of the following constraints:
This theory justifies various requirements that then uniquely define an operation (addition in this case) to be implemented by the device.
At a second level, there must be some representation of inputs and outputs, and an algorithm to map the right input to the right output. A representation in e.g. a cash register might be the binary representation of numbers. An algorithm is then some computationally specified procedure over that representation.
At the third level, there must be some physical implementation of the representation and algorithm. We are mostly not concerned by the third level.
Thus, for instance, a view of what language-learning is (as an input–output mapping) is constrained by the ways in which learning a language is possible in the first place.
Difficulty at Marr’s second level is very different from the difficulty of a theory in applying or using it. For instance, our best theories about the weather are such that it is very difficult to actually predict whether it will rain in a given location in two weeks. On the other hand, it is fairly easy to predict the movement of billiard balls according to classical mechanics. Let us call this theorist-side difficulty , since it is difficulty the theorist faces. These theories do not attribute any information-processing to the objects of study; they do not therefore make any claim about how difficult such information-processing is. The billiard balls are not thinking or learning. Unless a theory of the weather is very sophisticated and includes animal behaviour, it also does not involve information-processing. But other theories—e.g. about cognition—do attribute forms of information-processing to the objects of study (e.g. animals or humans). These forms of information-processing, in turn, may be more or less difficult. For instance, one theory of human decision-making might hold that we maximise expected utility, which is computationally very difficult. Another might hold that we satisfice, which might be less difficult. Let us call this sort of difficulty subject-side difficulty , since this sort of difficulty is faced by the subject. Difficulty at Marr’s second level is subject-side.
Minimising both subject- and theory-side difficulty is sometimes viewed as a desideratum, but for different reasons. Minimising theory-side difficulty might be required for scientific progress: it is hard to test a theory, for instance, if its predictions take too long to generate (Weisberg, ‘Idealization’: 640 ff.). In the first instance, this motive is pragmatic. On the other hand, theory-side difficulty supposedly constrains not just the pragmatic usability but also the truth and predictive accuracy of a theory. We shall rigorously distinguish both subject- and theory-side difficulty and the motives for minimising each.
The tractable cognition thesis requires some precisification of ‘tractable’; this thesis focuses on how asymptotic analysis fares. But, first, it is worth examining why, even ignoring the question of what ‘tractable’ means, the tractable cognition thesis is contested.
Marr (Vision: 25), formulating the three levels of description, wrote that the levels are ‘coupled, but only loosely’; he generally advocated a ‘top-down’ approach, starting with ‘the nature of the problem being solved’ and then ‘examining the mechanism (and the hardware) in which it is embodied’, rather than vice versa. On a top-down approach, the constraint therefore runs only from what to how. This, however, seems a rather peculiar assumption. If a task is genuinely computationally intractable, it ‘might appear that there is an immediate contradiction’ (Chater and Oaksford, ‘Rational analysis’: § 2.3). Their response is that computational intractability is a property of our explanation of behaviour and therefore we do not require an ‘agent’s cognitive equipment’ to solve computationally intractable problems.
There are of course cases where there is no ‘cognitive equipment’ in the first place and yet explanation or prediction are intractable; consider, for instance, weather forecasting, which generally becomes much more difficult over time—but that (at least before there were cognitive agents on earth) was not due to the presence of any ‘cognitive equipment’. But this counterargument is not generally very convincing.
I shall defer a fuller characterisation of tractability in asymptotic terms to Chapter iii. For now, however, a little elaboration is required.
We can characterise the tractability of a problem: for instance, whether, in general, it is tractable to sort a list or learn a language. We can also characterise the tractability of an instance of a problem. For instance, we can ask whether it is tractable to learn Arabic. Finally, we can ask whether a class of instances of a problem. For instance: is it tractable to learn Indo-European languages? Is it tractable to sort lists of numbers where all the numbers are smaller than a hundred?
A computational theory chez Marr will offer a general characterisation of an input-output mapping of a system. For instance, the cash register adds numbers. In the first instance, therefore, we can ask about tractability at the level of the overall problem, e.g. arithmetic. It does not, obviously, make much sense to ask whether the cash register can specifically add , , and , or some list of very large and inelegant numbers; that answer, by itself, is not very interesting or informative. However, are there any interesting classes against which to apply a measure of tractability?
One class of instances is the worst case. It is easier to explain this with another example: the travelling salesman problem.
5.1. Definition. (travelling salesman) Instance: a graph where is a set of vertices, is a set of edges between the vertices and is a set of times over each edge, i.e. travel times between pairs of cities.
Question: what is the path over the edges that visits every vertex that minimises the total time travelling according to ?
The travelling salesman then seeks to visit every city (given by cities ) using certain roads () between those cities () in a way that minimises the distance (according to ).
A very simple class of instances to solve is that given by cycles. In Iceland, the principal settlements all lie on a ‘ring road’ that starts and end in Reykjavik and roughly follows the coastline. (The interior of Iceland is mostly empty.) In instances of the travelling salesman problem where all settlements lie only on one cycle, no serious computation needs to be undergone: we simply follow the path. These instances are not the worst case: indeed, they are the best case (we don’t need to undertake any computation at all).
A worse case (in terms of the roads available) is when there is a road from every city to every other city (i.e. ). It transpires that
A recurrent theme in this thesis is the status of the alleged competence–performance distinction (cpd), following Chomsky (Aspects: 3 ff). Performance is ‘actual use of language in concrete situations’. Since ‘actual use of language’ is constrained by speakers’ resources, it is reasonable to apply a tractability constraint to theories of performance. However, a tractability constraint on performance would then only apply to ‘actual use of language’, rather than e.g. idealised but non-actual cases.
What exactly competence is is a little more difficult to state. Prima facie, Chomsky (ibid.: 3 ff) in introducing this view offered up the competence–performance distinction (cpd) as an idealisation away from ‘such grammatically irrelevant conditions as memory limitations, distractions, shifts of attention and interest, and errors (random or charcteristic) in applying…knowledge of language in actual performance’. It would seem that competence generally abstracts away from resource limitations—not only in memory but computationally. Therefore, there seems to be no good reason to apply a tractability constraint to theories of competence (at least qua theories of competence).
But what, then, is the relationship between competence and tractability? The picture is complicated by the variety of precisifications of ‘competence’ that have been offered. I follow Dupre (‘Competence/performance distinction’: § 14.1) in making a threefold distinction. On which, if any, of these views ought tractability to constrain competence? I shall survey each in turn.
A further question is whether a tractability constraint on the relevant precisification of ‘competence’ also should be relativised to ‘actual use’ in the same way that a tractability constraint on performance is. I argue in favour of this view.
Dupre (ibid.: 248) argues that the cpd amounts to nothing more than a ‘potentially explanatory unobserved causal subsystem’ and ‘downstream, observable behaviour on which inferences to the properties of this system are to be based’. Call this competence as an unobserved causal subsystem.
Dupre (ibid.) offers several exegetical reasons to think that competence as an unobserved causal subsystem reflects Chomsky’s intended meaning, but I shall ignore the exegetical question.
Dupre, however, also argues that the ‘core distinction is between “an underlying system of rules” and “the acdtual use of language”’ (ibid.: 247). Clearly, not all unobserved causal subsystems are underlying systems of rules. For instance, the interior of the moons of Saturn is something we don’t directly observed, but isn’t an ‘underlying system of rules’.4 I take it that what Dupre means is that the form of competence Chomsky isolates is an ‘underlying system of rules’, and, if one is sceptical of rules, at least ‘the processes and regularities governing the linguistic mind’ (ibid.: 247 n 1).
Dupre (ibid.: 251) offers an analogy between everyday occurrences such as the going out of lightbulbs and ‘serious science’ like the theory of electromagnetism. The suggestion is that even if the theory of electromagnetism fails (at least straightforwardly) to predict that a particular lightbulb will go out on a particular occasion, that does not alone indicate that the theory of electromagnetism is false; rather, it indicates that we may simply have misunderstood the relationship between the theory of electromagnetism and quotidian observations.
Prima facie, we might reason that competence as an unobserved causal subsystem should not then be held to tractability constraints. The adherent of this view might offer the following argument:
Nevertheless, I claim the foregoing argument is conceptually confused. A theory is not a causal system or subsystem. Rather, a theory describes, governs or predicts behaviour of certain systems. So the theory of electromagnetism, if we’ve got it right, governs the behaviour of the electromagnetic field. And the electromagnetic field in turn is causally efficacious on the gravitational field and so on.5
But that doesn’t obviously get us any closer to answering whether competence in general should be held to a tractability constraint. Indeed, it’s not even obvious how we could hold theories of gravitation or electromagnetism accountable to a tractability constraint in the first place. What task, exactly, is solved by the motions of the planets? Or consider, to take a physically delineated subsystem, the inside of the sun as an isolated causal subsystem. We cannot easily observe the inside of the sun (it is far away, any probes we send will quickly be vaporised, and so on). We also know that the inside of the sun is a perfectly well-behaved causal subsystem like any other physical system, and, therefore, governed by the same laws of physics that apply on Mars and in Radcliffe Camera. But it’s not obvious why we should think that the behaviour of the inside of the sun should be ‘tractable’.
One way out of this morass is to distinguish computational from non-computational systems. Some intuitive examples: a rock sitting in the ocean is not, generally, a computational system; and an electronic computer is a computational system. This is, however, well-known to be treacherous territory, because it turns out that it is very difficult to distinguish a rock (or other plausibly non-computational systems) from my laptop (Piccinini, ‘Computation in Physical Systems’). In order to sidestep that question, I will make two (contestable) claims.
In particular, to say that a physical system is computational is to offer some computational description of its inputs and outputs; it is in terms of those inputs and outputs that we can then attempt to impose a tractability constraint. The question is then: should we?
That depends on one’s view of what is physically realisable. I shall argue that all computational systems are constrained tractability-wise, but for now offer only a promissory note in that direction; the full argument comes in ¶ 9.2. Assuming I am right, competence as a causal subsystem indeed is accountable to a tractability constraint.
The second question was whether the tractability constraint should be relativised to ‘actual use’ or instead to some idealised larger class of cases. If we accept that an ‘unobserved subsystem’ only differs in its not being observed from systems at the level of performance, but is no more an idealisation than at the level of performance, it seems reasonable to restrict the tractability constraint to ‘actual use’.
From this view of the cpd we must distinguish the distinction between Marr’s first and second levels, as above. Call this competencesensuMarr; Marr (Vision: 28 ff) himself has encouraged this view. He writes that Chomsky’s ‘theory of transformational grammar is a true computational theory’ in that it ‘specifi[es] what the syntactic decomposition of an English sentence should be ’rather than ‘how that decomposition should be achieved’. ‘Roughly’, according to Marr, not only is competence the first level, but performance is the second level—how that computation is carried out.
A natural question is then whether it is legitimate to theorise about competence entirely independently of tractability. It is possible to read Marr as making this suggestion. Marr (ibid.: 28) suggests that Winograd (‘What Does it Mean to Understand Language?’) offers a misplaced argument against Chomsky, by arguing that Chomsky’s ‘theory…cannot be inverted and so cannot be made to run on a computer’. In response, he argues that ‘finding algorithms by which Chomsky’s theory may be implemented is a completely different endeavor from formulating the theory itself’.
I am not sure whether this reading of Marr is correct, but it is contradicted by three other remarks Marr makes in this passage. First, Marr (Vision: 27) argues that ‘the nature of the computations that underlie perception depends more upon the computational problems that have to be solved than upon the particular hardware in which their solutions are implemented’. However, Marr here recognises that some means by which those solutions are implemented must exist (presumably physical). So an argument that constraints all possible means can be given, that constraint also extends to the computational level. Second, Marr (ibid.: 29) praises Chomsky and Lasnik (‘Filters and Control’) for offering a theory that ‘may provide a way of synthesizing the two approaches—showing that, for example, some of the rather ad hoc restrictions that form part of the computational theory may be consequences of the weaknesses in the computational power that is available for implementing syntactical decoding’. If it is a virtue of a computational-level theory that it is consistent with ‘weaknesses in the computational power’ of one system, it is presumably an even greater virtue (or necessity) for a computational-level theory to be consistent with in-principle limitations on the computational power of any system. Third, Marr (Vision: 36–7) argues against a view of vision as ‘a completely invariant shape description from an image…in one step’ on the grounds that it is ‘almost certainly impossible’, geometrically, which is suggestive in the same direction.
It seems, therefore, that a tractability constraint can play some rôle at the level of computational-level theorising, i.e. at the level of competence sensu Marr. However, it would be erroneous to reason thus.
This is obviously not a very good argument. It ignores the question of which cases are actually computed (corresponding to ‘actual use of language in concrete situations’). Thus, in this sense a tractability constraint on competence sensu Marr and/or the computational level amounts in the end to a tractability constraint on ‘actual use’—and so is not very different from a tractability constraint on performance.
I therefore find myself objecting to another argument given for a tractability constraint on competence (Frixione, ‘Tractable Competence’: 380). Asymptotic difficulty, according to Frixione (ibid.: § 3), is a matter of ‘difficult[y] for every computational device’, and small increases in the size of the problem lead to large increases in the resources required. Therefore, asymptotically difficult tasks ‘in [their] general form…cannot in principle be accomplished’. He therefore proposes ‘a notion of competence that is constrained by ocnsiderations of computational tractability’ (ibid.: § 7). I differ from Frixione in relativising tractability to the cases that are actually encountered (insofar as competence chez Marr is concerned). Frixione admits that sometimes we may not encounter the ‘worst case’, but this only addresses part of the variation in cases that we actually encounter. The ‘worst case’ is defined relativ to some parameter—e.g. a list that takes longest to sort of length . My point here is that Frixione ought also to relativise to the actual size of in addition to whether or not the worst cases are acutlaly encountered.
I think the fallacy in this argument can be elucidated by the example argument above. If Frixione accepts the conclusion that the arithmetic computational-level description of the cash register is wrong, it is hard to see why he endorses Marr’s levels of description at all, given the cash register is such a classic example. He then has two choices. He could argue that the case of numbers with digits is somehow irrelevant to the characterisation of competence. If so, it might also turn out that the instances of exponentially difficult problems nevertheless turn out to take a reasonable time in practice, because they are limited in size. That means that tractability is relevant to competence sensu Marr in the same way that it is relevant to performance. Alternatively, he could argue that even though the inability of the cash register to compute the sums of very large numbers is irrelevant, somehow inability when the difficulty grows exponentially is prohibitive. But there is no clear reason to distinguish the two cases.
The last view, competence as idealisation, is more complex, and requires more careful handling involving more general issues about idealisation.
Finally, on another view, competence is ‘idealized, or cleaned-up…performance’: ‘what we would treat as sentences of our language, if only we weren’t finite, fallible beings’ (Dupre, ‘Competence/performance distinction’: 249). Thus we might expect sentences of arbitrary syntactic complexity to belong to language defined competence-wise. Call this competence as idealisation.
On the one hand, this would appear to rule out the case-based relativisation I argued is relevant to competence as an unobserved causal subsystem and competence sensu Marr. On the other hand, it then prima facie makes the question of tractability utterly irrelevant.
van Rooij et al. (‘Intractability and the use of heuristics in psychological explanations’: 480 n 14) firmly reject idealisations that abstract away from resource limitations as ‘overidealizations’ that fail to ‘capture the nature of the [relevant] competence and instead…mischaracterize[] it’ rather than ‘proper idealizations’. But what, then, is an ‘overidealisation’, and what’s wrong with overidealisations?
In mathematics and science, we sometimes use idealisations: approximations, heuristics etc. That these idealisations are strictly untrue or lead to predictive errors in some cases is irrelevant. If someone approximates as , it would be curious to point out that . If someone ignores relativistic effects in designing a shed, it would be curious to point out the slight predictive error. In general, idealisations are ‘intentional[ly] introduc[e]d distortions [in] scientific theories’, presumably with non-malicious intent (Weisberg, ‘Idealization’: 639).
If we can’t judge idealisations by their literal truth, by what standard can we judge them? One suggestion is that we might judge them is the ‘realism of its assumptions’. Friedman (‘Methodology’: § iii) shows that such an apporoach is mistaken.
Consider, for instance, the idealisation that the acceleration of a body on earth is close to its acceleration in a vacuüm—in other words, we can ignore air resistance. The ‘realism of this assumption’ can be obtained by comparing a figure giving the pressure on earth to the complete absence of pressure in a vacuüm. But we have no a priori way of ddetermining whether the difference in pressure is large or small. Is the difference in pressure between the surface of Mars and a vacuüm small? How about Pluto?
Friedman instead suggests we ought to judge idealisations by the predictive errors they generate. It turns out, for instance, that in dropping cricket balls from fairly close to the surface of the earth we shall not observe acceleration very different from what we calculate in assuming that there is no air resistance. On the other hand, feathers behave very differently—even though the ‘realism of the assumption’ that air pressure is nil is unchanged, because we can drop feathers in the same environment as cricket balls.
Therefore, the putative realism of an assumption is not, per se, a good guide to its legitimacy or usefulness as an idealisation, and at least one relevant standard is the predictive inaccuracy it introduces.6
I think one lesson to draw from Friedman’s example is that idealisations can have different purposes, or, as Weisberg (‘Idealization’: 639) calls them, ‘representational ideals’, that is
For instance, completeness requires inclusion of all the properties of the target phenomenon, anything external giving rise to those properties, and an analogue of any structural or causal relationship within the model, all with arbitrary accuracy and precision. Simplicity can e.g. serve pedagogical purposes, or show the minimal conditions required to generate some property. Isolating ‘only…the factors that made a difference’ can help us to formulate or analyse more complex models. Predictive accuracy may be practically useful. And so on.
Asymptotic analysis is, in the first instance, a field of mathematics. It is not, therefore, any more idealised than group theory or probabilistic combinatorics, on its own terms. In this thesis, however, I dispute certain idealisations in the application of asymptotic analysis to the study of language. Compare the use of arithmetic in physics. It is an idealisation that air pressure and therefore air resistance are nil when studying billiard balls in textbook exercise. However, arithmetic is not an idealisation. An idealisation arises in applying arithmetic to the mechanical problem and assuming that air resistance is nil.
In mathematics, in addition to satisfactorily proved results, there are widely believed conjectures. The intended reading of a conjecture is just as strong as the intended reading of a theorem. If we make a claim about every natural number, we aren’t just making that claim about e.g. even numbers or sufficiently small numbers. Consider, for instance, the twin prime conjecture.
I rely on a number of conjectures in asymptotic analysis in this thesis, and do not dispute them.
For instance, one might conjecture that there are no aliens. We might justify this very differently from the twin-prime conjecture, but the intended strength of both is the same.
I formally introduce asymptotic analysis in § 8. I then I explain how asymptotic analysis informs the study of human cognition and language in § 9. I outline the general structure of my reservations about asymptotic analysis, which recur later as we examine concrete ways in which language has been studied through an asymptotic lens later in the thesis, first in the case of performance (§ 10), and then in the case of various precisifications of competence (§ 11).
In this section we introduce the classification of problems and algorithms by difficulty in the framework of asymptotic analysis, and the Cobham–Edmonds thesis that all ‘feasible’ problems are in ‘polynomial time’.
We will assume that, in sorting the list, only one operation is permitted: to compare two adjacent numbers, and, if they are in the wrong order, to swap them.
How many operations are needed to sort a list of two numbers in ascending order? One, clearly. If the list is , we simply compare them and note they are already in the right order. If it is e.g. , we see they are in the wrong order, and swap them.
How many operations are needed? One isn’t enough, because we’d only consider the first two numbers. Two seems like it might be enough.
In particular, if the task is simply to check whether the list is in ascending order—and to simply indicate that it isn’t if the list is unsorted, but not to actually sort it—two is enough. We’d simply note that and . But, recall that the task is to sort the list.
Consider: every operation leads to at most two possible permutations of a list . Two operations can therefore yield at most possible permutations. But the correct permutation of the list could be drawn from any of the following six:
Therefore, if only two operations are allowed, the list will not in general be correctly sorted.
Pessimistically, we might wonder whether we need operations. Optimistically, we might hope for a linear function of .
We can pose similar questions about other problems with respect to their size.
So far, we have discussed what is called time complexity. Intuitively, if each operation is mechanically implemented, it should take some fixed time. Therefore, the time to finish a computation will vary with the number of operations required.
Space complexity is different; it refers to the memory required to solve a problem. Consider, again, the problem of sorting a list of numbers. Obviously the memory required even to store the list is , because otherwise we wouldn’t even be able to store the list. Do we require any more? So-called merge sort involves repeatedly merging smaller sublists until the entire list is sorted, and can require up to additional space (‘for workings’, as it were). On the other hand, algorithms with inferior time complexity can operate ‘in-place’ with constant additional memory (i.e. without any more memory even if the list is longer).
Finally, sample complexity will interest us later in this thesis. In a task of learning from data, it is of interest how many data need to be drawn to achieve a certain performance guarantee. We will introduce this more formally.
Asymptotic analysis gives, in a precise sense, ‘approximate’ answers to these questions. Sorting a list of numbers takes (approximately) operations. The travelling salesman problem and 3- are thought to take exponential time (on any reasonable view of the operations available).
In the following exposition, we mostly follow Cormen et al. (Introduction: § 2.1).
8.6. Definition. (Tight bounds) Given a monotonically non-decreasing function , we define
8.7. Example. .
Proof. For all ,
8.8. Example. .
Proof. Suppose otherwise. Then for all , but this is contradictory if .
We will also use the following notation.
8.10. Definition. (Upper bounds) Given a monotonically non-decreasing function ,
We have just seen how, given a certain sort of operation on a list of numbers, we could work out how many times that operation needs to be applied to sort the list. More generally, we can pose questions about the resource-intensiveness of algorithms to solve problems (e.g. the travelling salesman problem or 3-) as a function of their size. In the case of sorting a list, we supposed that the only operation allowed was to compare and possibly swap two numbers. This raises the question of which operations should be considered to be fundamental, and which operations should be counted as the result of composing multiple operations. For instance, we could insist that the sorting of a list of length three is in fact a fundamental operation. If studying humans, that might even be a reasonable assumption. In principle, we could even insist that sorting an arbitrarily long list is a fundamental operation, in which case sorting a list would take constant time as a function of its length.
We shall now investigate a more general approach that allows us to deal with a considerably greater number of problems and offers at least a preliminary answer as to which operations should be considered fundamental. We begin with a seemingly arbitrary model of computation, and explain why it is often thought to be less arbitrary than it seems.
Here, first, is an informal exposition of the Turing machine model.
8.11. Definition. (Arora and Barak, Complexity: § 1.1, informal.) Suppose that , that is, it takes as input bits and outputs a bit.
We compute using the following ‘elementary’ operations:
An algorithm computes , which is to say that it specifies purely mechanical rules to follow to compute for any ; each rule is specified in terms of the elementary operations. These rules are fixed, but we can apply the same rule as many times as we like.
More formally,
8.12. Definition. a Turing machine where—
We initialise the machine in . The machine contains tapes. The first tape contains some input. The other tapes are working space. In particular, on the last tape (eventually) is written an output. Initially the machine is set to read, on the working tapes, some particular cell, containing the start symbol ; the remainder of the cells are blank. The input cell contains some input and then the blank symbol on the remainder. This is the start configuration of . We then apply as follows. Suppose the machine is in state and on the tapes the symbols are read. Moreover, suppose where . Then in the next step we replace the symbols with . If the machine transitions into then it has halted.
Suppose, whenever is initialised in the start configuration with some arbitrary input on the input tape, it halts with on the output tape. Then we say computes .
Now, suppose , and given input of size the machine takes no more than steps. Then we say that computes in -time.
Many problems are known to be NP-complete. It is widely conjectured that NP-complete problems do not admit polynomial time algorithms. This section gives some formal details. If philosophers should care whether there is a polynomial-time algorithm for a particular problem, they should therefore also care whether it is in NP.
Suppose we are given a set of boolean formulæ. One question we can ask is: is the set of formulæ satisfiable? But suppose someone insists that the set of formulæ is indeed satisfiable; they might, for instance, offer a satisfying assignment. There is then a second question: is the putative demonstration of satisfiability sound? (After all, the putative satisfying assignment might not in fact satisfy each of the formulæ.)
We write P for the class of problems where answering the first question takes polynomial time. Consider, for instance, the task of deciding whether a list is sorted, which can be done in linear time. We write NP for the class of problems where answering the second question takes polynomial time. The problem of satisfiability of boolean formulæ lies in NP, because it is possible to check that a satisfying assignment indeed satisfies a set of formulæ in time polynomial in the number of variables.
We don’t know whether this is true. But there is some reason to think not.
To a first approximation, the standard argument for suspecting that is that there are many NP-complete problems and we haven’t found any polynomial-time algorithms for them, which would be a bit suspicious if they were secretly in P (Aaronson, P ?= NP: 3).
Therefore, it is widely conjectured that NP-complete (and NP-hard) problems simply aren’t in P. We will make this assumption for the remainder of this thesis.
More formally,
8.17. Definition. A language is polynomial-time Karp reducible to just in case:
In such cases we write .
is NP-hard just in case for every . It is NP-complete if it is also in NP.
Are there any NP-complete problems? The Cook–Levin theorem shows that there are.
I now connect the formalism of § 8 with the tractability constraints offered up in § 4 and § 5. I attempt to explain the commonly held view that tractability is constrained by a slight refinement of polynomial time (roughly, polynomial time when certain parameters of a problem are fixed).
A first question is why we might hold that the Turing machine is a reasonable model in which to formulate the requirements of tractability. Many of the features of the model above seem quite arbitrary. For instance—
Moreover, there are other models of computation, such as as ‘random-access machines’, in which a cell in memory can be accessed immediately, without having to move the tape using or (Cook and Reckhow, ‘Time bounded random access machines’).
So, in general, we might worry that the putative runtime of an algorithm required to solve a problem varies too much with respect to the model of computation. Indeed, insofar as we are simply studying models of computation out of abstract interest, there is no obvious reason to privilege any model of computation or class of computation. However, perhaps surprisingly, if we turn to the study of physically implemented computational systems, the models of computation in which are interested become constrained.
Corresponding to the abstract question of the number of operations required to sort a list of numbers is an empirical question about how long a mechanical or electronic system implementing that algorithm will take to sort numbers. It is perfectly possible to define a model of computation in which a single elementary operation sorts the list in one go. There is nothing formally wrong with this model of computation. However, it is not a suitable model of computation when modelling the sorting of lists on physical computing devices, because physical computing devices take longer to sort more numbers. More generally, the task in the asymptotic analysis of physical computational systems is to offer abstract models whose structural properties follow those of the physical objects of study, just as in e.g. physics, chemistry and economics, in terms of their runtime. Let us call these models ‘physically reasonable’.
The philosophical arguments to be examined need to show, at some stage, that a certain task is infeasible. To show infeasibility, it is possible to use a more relaxed precisification of feasibility—anything that does not fall even in this relaxation must be infeasible. In other words, if we show that some imperfect characterisation of feasibility bounds it from above, it suffices to show that even on this characterisation a task is infeasible. Therefore, we do not necessarily need to work out the feasibility of a task in every model of computation. First, we are not concerned by all models of computation, but only by the physically reasonable ones. Second, if we can then bound the physically reasonable models from above, we do not then need to investigate what is feasible in each individual model.
A common view is that polynomial time in the standard class bounds feasibility from above. I shall shortly criticise arguments for this view, but I shall first explain it.
Note that this definition is relative to a model of computation. Let us fix the Turing machine above as the underlying model of computation. Now, consider any other physically reasonable model of computation. An interesting feature of polynomial time in the Turing régime is that it corresponds precisely to polynomial time in a very large class of models of computation. For instance, we could change the symbols available, include additional operations (such as simultaneously reading and writing), and add additional tapes. Each model in this class can simulate any machine in any member of the class with ‘polynomial overhead’. That is: if one machine runs in polynomial time, so does the other.
This class, which we will call the standard class, is thought to encompass all physically reasonable models of computation. If so, that any problem in polynomial time according to one physically reasonable model of computation is also in polynomial time according to any other physically reasonable model of computation.
9.1. Conjecture. (invariance thesis) van Emde Boas (‘Models’: § 1) puts it thus.
We therefore distinguish polynomial time in the standard class as something robust to changes in which operations are allowed. And, to repeat, this standard class corresponds, it is thought, to all ‘reasonable’ (Dean, ‘Complexity’: § 2.2) or physically realisable models of computation. The evidence for this is in effect that no convincing counterexamples have yet been found; it is somewhat analogous, therefore, to the Church–Turing thesis.
In a moment, we shall examine arguments to the effect that polynomial time in the standard class is not only a mathematically elegant construction but also is a sensible upper bound on what is tractable. However, an immediate question is why we should apply the invariance thesis to human beings in the first place.
A considerable number of authors who apply asymptotic analysis to human cognition do not explicitly justify the assumption that the invariance thesis and simply assume that polynomial time in the standard class is of significance in the study of human cognition, especially in game theory.7 However, it is likely that they are committed to it, since they use polynomial time in the standard model as their criterion of feasibility. In particular, we need (even if this seems a bit pedantic) to exclude models that, for instance, compress the execution of any algorithm into one step.
One could offer a purely conceptual a priori argument for the application of the invariance thesis. In outline: certain processes are constitutive of human language: production, comprehension and acquisition. These are cognitive processes that map from inputs to outputs—e.g. from acoustic or visual signals to representation of linguistic information; they are ipso facto computations. Ristad (Game: 1) seems to offer something like this argument, as do Clark and Lappin (Nativism: § 4.1) in the context of language acquisition. Clark and Lappin, for instance, argue that since the child is an ‘information processing system’, it is subject to laws governing computational systems, which impose theoretical bounds on answers to the empirical question of how children learn language. They suggest that the general laws of physics (in particular, aerodynamics) analogously constrain reasonable answers to the question ‘how do birds fly?’, even though that is an empirical question.
The invitation, then, is to reject out of hand any account of language or linguistic phenomena that does not conform, putatively, to the laws of computation, whence asymptotic analysis.
The problem is that, at least conceptually, all sorts of models of computation with different runtimes for important problems are reasonable, and none has any monopoly on the concept of computation. Therefore, merely qua information-processing system, the child could have access to all sorts of oracles for computable but putatively intractable problems.
van Rooij et al. (Cognition: § 9.12) do explicitly subscribe to the invariance thesis, and offer a few other arguments. They note wide acceptance in cognitive and computer science of the thesis, and seem to take this to show that it is sufficiently obvious that no explicit argument is needed. I agree with a spirit of epistemic humility according to which philosophers should generally defer to scientists on their areas of expertise. However, it is far from obvious that, for instance, van Emde Boas (‘Models’) had human cognition in mind, and no evidence is offered for this.
There is, however, something of a suggested reductio:
Given this is a supposedly pedagogical exercise, the argument is meant to be obvious, although unfortunately I find it needs some reconstruction. As I understand it, the reductio has the following structure. As a background assumption: human cognition is physically realised, and therefore ‘reasonable’.
The problem with this argument is in the third premiss. The (approximate) consensus in theoretical computer science suggests that the third premiss is false; there are models of computation that are ‘faster’ than the standard class by a superpolynomial bound but cannot efficiently compute NP-complete problems. One such is the class of problems that is solvable in polynomial time on a quantum computer (van Rooij et al., Cognition: 200).8
However, I shall accept the invariance thesis on different grounds.
Therefore, provisionally, we should apply the invariance thesis to human cognition.
The most difficult premiss to justify is the third. The leading candidate is quantum-mechanical computation. The mainstream view is that quantum-mechanical dynamics are not relevant to the study of cognition (Litt et al., ‘Is the Brain a Quantum Computer?’; Tegmark, ‘Importance of quantum decoherence in brain processes’). We will make this assumption in the remainder of the thesis.
We should distinguish three related but distinct theses.
The consensus in the natural sciences is roughly that even the first is physically implausible (ibid.). For the sake of argument, suppose we grant it. Even if there are quantum-mechanical computations of cognitive importance, it doesn’t follow that they are any faster (the second thesis). If they are superpolynomially faster, that would invalidate the application of the invariance thesis to the brain. Is there such a superpolynomial speedup? Yunger Halpern and Crosson (‘Posner model’) raise the question explicitly but do not offer any clear suggestion; the physical plausibility of the underlying physical theory is also in question (Agarwal et al., ‘Posner molecule’). Wiest (‘Quantum microtubule substrate’: 4) do suggest a ‘quantum advantage’ in the orch OR framework. However, the orch OR framework appears to be physically implausible (Derakhshani et al., ‘Crossroad’; Reimers et al., ‘Fröhlich condensation’: 4222; McKemmish et al., ‘Penrose–Hameroff proposal’). Now, supposing that there is a superpolynomial speedup, there is some reason to think that that would not bridge the gap with NP-complete problems (Aaronson, ‘Physical reality’: §§ 3–4), although no definitive result has yet been shown. Our discussion of NP-completeness would only be misplaced if all three theses were true.9
We can now return to the question of whether the thesis is merely a mathematical curiosity or whether it is ultimately of interest in the analysis of tractability.
Appeals to asympototic analysis usually proceed via the so-called Cobham–Edmonds thesis.
The Cobham–Edmonds thesis is only meaningful relative to a model of computation. But, if the invariance thesis is correct, any reasonable model of computation will yield the same class of feasibly computable functions.
This then allows us to formulate the following.
The Cobham–Edmonds thesis might seem to admit obvious counterexamples. For instance, an algorithm that takes time seems obviously infeasible. Less obviously, there is some reason to suspect that some problems that likely have no polynomial-time algorithm are feasible. In this case, we might nevertheless hold that, most of the time, the Cobham–Edmonds thesis is correct. For instance, there are surprisingly few ‘galactic’ algorithms, i.e. algorithms with polynomial asymptotic runtime that in practice are unusable.10 We can therefore provisionally take the Cobham–Edmonds thesis to be a (defeasible) heuristic in the sense above.
What justifies the Cobham–Edmonds thesis in general? In my view, there are two principal reasons it is generally accepted as a defeasible heuristic. The first is that there are few known problems for which the best-known algorithm is some particularly large polynomial (e.g. ). Although artificial such problems can be constructed (Hennie and Stearns, ‘Two-Tape Simulation of Multitape Turing Machines’: theorem 3), it does not follow that problems examined practice are of any interest (see also Dean, ‘Complexity’: n 10). This applies to both humans and machines.
However, humans and machines differ in another respect: current computers have improved by many orders of magnitude from the first computers (Moore, ‘Cramming’). This matters because it ensures that counterexamples, say, two decades ago to the Cobham–Edmonds thesis may no longer be counterexamples now; but they may remain counterexamples vis à vis humans.
Suppose, for instance, that an algorithm has runtime . On relatively small input sizes, it will do better here than e.g. : . So if is (e.g.) mostly less than 100 in the cases in which we are interested, we seem to have a counterexample. If increases, however, we approach the point where the seeming advantages of the exponential algorithm no longer matter. The interesting feature of computer engineering is that does, in fact, increase. As hardware improves, the range of input sizes within our ambitions expands. As this process continues, seeming counterexamples cease to be counterexamples, and fall within the scope of Cobham–Edmonds. Similarly, polynomial runtimes that seem to large may come within the scope of our hardware.
This is not merely a theoretical possibility. For many years, asymptotically, the best known algorithm for matrix multiplication (Strassen, ‘Elimination’) was considered to be superior to other algorithms only on inputs of such size as to be of theoretical interest (Cormen et al., Introduction: 744). Through a combination of algorithmic and hardware improvements, it is now in common use (Huang et al., ‘Algorithm’).
We shall have to slightly complicate the picture, following developments in complexity theory. Given an instance of a problem, we know its size. For instance, we know the length of a list. Usually, however, we know a little more about the nature of the instance of a problem. For instance, we might know that all the numbers in the list are (in a degenerate case) the same, in which case we have to do much less work (i.e. no work at all, since we don’t need to sort the list at all).
In other, less degenerate, cases, we can know something useful and come up with a slightly more nuanced view of the complexity of a problem.
The natural view of the size of the problem is , the number of vertices. And the problem is known to be NP-hard (Karp, ‘Reducibility among Combinatorial Problems’). However, we have an algorithm that runs in time time (Chen et al., ‘Improved upper bounds for vertex cover’). This means that if we hold fixed, the algorithm runs in polynomial (indeed, linear) time. More generally:
9.6. Definition. (fixed parameter tractability, Downey, ‘A Parameterized Complexity Tutorial’: definition 1) A parameterised language is (strongly) fixed parameter tractable iff there is a computable function , a constant , and a (deterministic) algorithm such that for all ,
and has running time at most .
Many problems admit fixed-parameter tractable algorithms, even though they are NP-hard (ibid.).
van Rooij (‘Tractable’) propose substituting for the P-cogntion thesis the FPT-cognition thesis:
The Cobham–Edmonds thesis is now well-defined relative to a standard class of machine models which is applicable (up to a polynomial bound) to human cognition, if the foregoing is correct. Once this assumption is granted, we can restate the arguments informally stated in ¶ 9.3. For instance: computationally minded Chomskyans argue (for reductio) that ’blank slate’ theories of language acquisition imply that acquisition is NP-complete.
I propose to dispute the application of the Cobham–Edmonds thesis to human cognition; in particular, I shall argue that its principal justification in computer engineering does not extend to cognition. That is: I do not propose to dispute previous argumnts in the literature to the effect that certain problems have superpolynomial complexity based on widely believed conjectures in theoretical computer science. Rather, I dispute the Cobham–Edmonds as an heuristic; I do not wish to deny any widely believed conjectures in computer science (e.g. that ).
A preliminary point is that Moore’s law does not apply to human beings. Assuming that apes bound human cognitive capacity from below in working memory, we can say that the difference is somewhere between 2±1 (apes) and 7±2 for humans (Read et al., ‘Working memory’). Zheng and Meister (‘The unbearable slowness of being’) suggest that humans process information at a surprisingly low speed of 10 bits per second. There is also surprisingly little ecological variation in the information-theoretic efficiency of human languages (Coupé et al., ‘Different languages, similar encoding efficiency’).
So, even on an optimistic view of the overall growth of human cognitive capacities, nothing like Moore’s law applies, and the input sizes may remain small where in computer engineering they become large.
Perhaps it is useful to give an example of an argument that is conjectural rather than heuristic. Suppose, with Searle (Rediscovery: 200) that
Turing (‘Computability’: § 11) showed that no Turing machine decides the Entscheidungsproblem. If we follow Searle’s rendering of Church’s thesis, it follows that the brain does not decide the Entscheidungsproblem either. I suggest there is no obvious application of an heuristic here. A fairly natural computationalist view is that these results from computability theory apply straightforwardly to computation generally, and, therefore to cognition, and so to language computations.
The situation here is somewhat different. Of course, there are some counterexamples to the Cobham–Edmonds thesis. But we are evaluating the application of the thesis as an heuristic, which may have general if not total validity. So merely pointing out some counterexamples is not enough.
I will sketch here two ways in which the Cobham–Edmonds thesis can mislead. My claim is not that it always misleads; indeed, I think Cobham–Edmonds can fruitfully be applied both in computer engineering and cognitive science. But it misleads often enough that the details matter in the ways I describe.
On small inputs, even exponential bounds are not prohibitive. This much everybody accepts (van Rooij et al., Cognition: § 9.9).
Why is this considered an uninteresting suggestion?
This, of course, has to be examined on a case-by-case basis. The bifurcation of runtimes into polynomial and superpolynomial obscures potentially cognitively meaningful differences between runtimes that are both (for instance) exponential.
Therefore, it is not as though inputs ‘small [enough] in practice’ are so small that they are not of interest to cognitive science or philosophy of language. The asymptotic approach, therefore, could fail by ignoring these details.
The second way that the hypothesis could fail is that typical cases may take considerably less time to process than the worst case.
van Rooij et al. (Cognition: § 9.5) offer two responses. First, van Rooij et al. (ibid.: § 9.5) argue that ‘for a computation to be tractable, it must be tractable for all its possible inputs, including the worst-case input’. It is not obvious that we should subscribe to this thesis. Consider, for instance, the analogous principle applied to walking: it would be absurd to hold that ‘for [walking] to be tractable, it must be tractable for all possible [terrain], including the worst-case [terrain]’. It is possible that in the worst case, we simply fail (to walk, learn and so on). Of course, if the worst case is intractable, it is contradictory to then specifically claim that the worst case is tractable—but to say e.g. that a form of semantic processing that in the worst case really is intractable is generally ‘tractable’ need not be read as a claim about the worst case.
Their second suggestion is more promising.
In some sense, this is must be true: if only some cases can tractably be dealt with, there must be some restricted input domain corresponding to the cases that can be processed in reasonable time. The question is whether it is reasonable to posit the existence of such a domain without fully stating what it is. After all, van Rooij et al. (ibid.) could retort that the burden of proof, by default, lies with the party denying a generally accepted heuristic (in this case the Cobham–Edmonds thesis).
The view I advocate is that, in general, the Cobham–Edmonds thesis in the study of human cognition is not sufficiently vindicated that the burden of proof lies only with the Cobham–Edmonds thesis’s detractors. In a dialectic where there are competing views which vary on their merits, the seeming violation of the Cobham–Edmonds thesis according to one theory is not, in my view, decisive.
I want now to mention two cases where it was (and in one case is) reasonable to posit the existence of a restricted tractable domain even though we do not yet know what it is.
10.3. Example. (linear programming, Korte and Vygen, Combinatorial Optimization: 51) Given a matrix and column vectors
In the 1980s, the state of our knowledge was as follows. The ‘simplex’ algorithm worked well in practice but had an exponential worst-case runtime (Korte and Vygen, Combinatorial Optimization: § 3.2; Cherniak, Minimal rationality: § 4.9). Polynomial-time algorithms were known, but remain (to this day) ’too inefficient to be used in practice’ (Korte and Vygen, Combinatorial Optimization: 73). Cherniak (Minimal rationality: § 4.9) therefore speculated that the best response was the empirical investigation of ‘which are the interesting and difficult cases’ of a problem, in order to establish whether there is ‘real-world relevant complexity at least as an empirical hypothesis’ in addition to whether there is worst-case intractability. He had formulated no clear or precise hypothesis as to the restricted domain on which ‘real-world relevant complexity’ might arise.
Cherniak offered various hypotheses about the asymptotic profile of the difficulty of a problem: in the worst case, perhaps every instance of a problem might be susceptible to superpolynomial blowup; in other cases, it might turn out that only uninteresting or somehow pathological cases are so susceptible; or perhaps some interesting or quotidian cases also fall into that population, but there are very few.
The state of knowledge at the time did admit experimental study of the behaviour of the simplex algorithm (McCall, ‘Simplex performance’). On these grounds, at the time, I suggest, it was in fact reasonable to accept as a working hypothesis the last: nearly all interesting cases of linear programming (e.g. for use in industry or natural science) in practice took polynomial rather than exponential time. This can be justified on the broadly naturalist grounds that that is precisely the hypothesis accepted by computer scientists; at least by the lights of van Rooij et al. (Cognition), that is a reasonable argument, since they also advocate deference to computer scientists in their field of expertise.
It so transpires that we have the sort of restricted domain that van Rooij et al. (ibid.) sought. In 2001, Spielman and Teng (‘Smoothed analysis of algorithms’) offered a rigorous ‘smoothed’ asymptotic analysis of the simplex algorithm for linear programming. Although in the worst case the simplex algorithm takes exponential time, if the input is perturbed randomly to an arbitrarily small extent, the runtime is (to a first approximation) polynomial. So van Rooij et al. (Cognition) and everybody else would now accept that if, for instance, it is suggested that a cognitive capacity involves linear programming, but the inputs are randomly perturbed, the tractable cognition thesis would not straightforwardly rule out that account of cognition. The dispute concerns whether it was reasonable between 1947 (when the simplex algorithm was first formulated) and 2001 (when smoothed analysis explained simplex’s performance) to accept the tractability on simplex on everyday instances of linear programming, to a certain extent on trust; I claim that that is so.
An example in which no satisfactory account has yet been given is the the practical use of solvers in industrial applications. is NP-complete (Marques-Silva, ‘Applications’). Moreover, many natural subclasses of cases in fact plausibly must take exponential time (Chvátal and Szemerédi, ‘Many hard examples for resolution’). We do not yet have an equivalent to smoothed analysis to explain the industrial utility of solvers.
It is folklore in theoretical computer science about how superpolynomial worst-case runtime is compatible with tractability in practice in the case of . The view is that there is some sort of underlying structure—we may not yet be able to precisely articulate it—to ‘real-world’ or plausible inputs that makes them ‘non-adversarial’ (Malik and Zhang, ‘Boolean satisfiability from theoretical hardness to practical success’: 78). It is, nevertheless, reasonable to generalise that modern solvers will ‘routinely solve…industrial instances with millions of variables’ even though we have no understanding of why the specific sets of heuristics employed by modern solvers are so effective in practice’ (Vardi, ‘Boolean satisfiability’). At least, it is as reasonable as generalising e.g. that bridges do not fall down, the ‘Grand National [does] start and…hotels [do not] fall into the sea’.
Here, van Rooij et al. (Cognition: § 9.9) would suggest that the correct response here would be to attempt to show that is fixed-parameter tractable. However, we have no useful results in this respect, and not for want of trying (Ganesh and Vardi, ‘On the Unreasonable Effectiveness of SAT Solvers’) moreover, if a large and natural subclass of were to be discovered, that would be industrially very useful, so there is no want of commercial incentives. The central problem is that parametrisations that would make instances easy turn out not to correspond to the easy instances we observe in real-world instances —these parameters do not remain small in the real-world instances. The opposite approch would be to examine real-world instances and to try to work backwards to a generalised parametrisation explaining why they are easy; but these cases do not generally easily submit to theoretical study (ibid.: § 25.5).
Should we accept this folklore from computer science, as I suggested above? Dialectically, van Rooij et al. (Cognition) and most others who appeal to asymptotic analysis must agree, because they generally rely on conjectures to show that the problems they claim are intractable really are intractable. For instance, van Rooij et al. (ibid.) need to assume that for -hardness to have any significance to cognition. If we reject conjectures from computer science, it is not obvious why we should accept the underlying asymptotic machinery that motivates asymptotic precisifications of the tractable cognition thesis.
I shall suggest that some combination of these two failures of Cobham–Edmonds arise in linguistic cognition. Accordingly, the soundness of numerous arguments appealing to it as a premiss in the study of language can be questioned. It does not, of course, follow that their conclusions are false, although I shall suggest in some cases there is independent pressure to accept that view.
The response to asymptotic arguments about performance above amounted to the claim that, on occasion, the instances of asymptotically hard problems we actually encounter either may be too small or otherwise lack the requisite structure to be prohibitively difficult in practice. Competence, insofar as it differs from performance, therefore leaves open the prospect of leaving those hard cases within the domain a theory covers. If we can expect a theory to account for centre-embedding of arbitrary depth, we can no longer escape asymptotic difficulty by excluding such instances from the domain over which a theory’s fecundity is assessed.
The problem here, however, is that it is not obvious that competence is also accountable to performance limitations. Asymptotics could be utterly irrelevant to the study of competene: it wouldn’t matter whether
In this chapter, I examine whether an asymptotic argument can be given that meaning is compositional. The argument purports to fill a lacuna in Davidson’s learnability argument for compositionality (§ 12). Pagin claims to fill a lacuna in the argument with asymptotic analysis (§ 13). I then ask whether the argument is targeted at performance or competence; if it is targeted at performance, it is misplaced; but if it is targeted at competence, the picture is more complicated (§ 14).
To Davidson (‘Theories of Meaning and Learnable Languages’: 8), a theory of knowledge of language must ‘specify, in a way that depends effectively and solely on formal considerations, what every sentence means’. I take it that Davidson views this requirement as entailing (or even equivalent to) the requirement that the meaning of a sentence is, on this view, ‘a function of a finite number of features of the sentence’. Otherwise, a speaker would fail to master the meaning of certain sentences, no matter how many other sentences they learn to produce and understand: the meaning of the unlearnable sentences would not depend on that of the sentences already mastered, no matter which sentences those are.
Davidson then suggests that a learnable language must have a finite number of semantical primitives. On a standard view, Davidson offers here an argument for the compositionality of meaning. Pagin (‘Compositionality, computability, and complexity’: 551) informally puts that requirement thus.
In particular, this entails that the substitution of synonymous terms should not change the meaning of a complex term.
There is some dispute as to whether that is actually what Davidson meant; in his reply to Pagin (‘Radical interpretation and compositional structure’), Davidson writes that he took compositionality ‘as a given’ and ‘built in at every stage’.11 But the exegetical debate need not concern us.
According to Pagin, the argument is abductive. The explicandum is that a speaker ‘can step by step work out the meaning of new grammatical sentences’ (Pagin, ‘Compositionality, computability, and complexity’: 551). The explicans is compositionality. Pagin objects to the abduction, and suggests further argument is required.
The problem in this abduction is that there may be cases where we can work out the meaning of a complex expression even if the replacement of one of the constituents by a synonymous expression could change the meaning of a whole, violating compositionality. What is required is not that the semantics should be compositional but merely computable.
Here, Pagin (ibid.: 553) proposes a further constraint.
The argument is then, to a first approximation, that compositionality is a good explicans because learnability requires tractability, and tractability requires compositionality. This, therefore, improves on the previous abduction: it is true that learnability requires computability, but computability doesn’t require compositionality; and learnability does not, directly, require compositionality.
Pagin argues for this view in terms of a symbolic computational model: term rewriting. The claim is that noncompositional semantics has exponential difficulty in the term rewriting régime. This is subject to various qualifications, the most important of which is that this applies only to languages whose syntax is combinatorially unrestricted.
Considerable work is required to reconstruct the result, because it is shown only by example. Godzilla, a noncompositional rewrite system (ibid.: § 8.3.3) has factorial complexity; from this, it is inferred that all noncompositional systems (meeting certain criteria) have superpolynomial complexity. Although there is some indication of the generalisation of the result (ibid.: 580), there is no formal indication of the general result. In this section I offer a general result in the spirit of Pagin.
Semantic functions assign meanings to grammatical terms. Pagin takes no firm view of what meanings actually are, and simply assumes that there is some set of such meanings.
13.3. Definition. (ibid.: Definitions 2–5) A semantic function is standard compositional iff for each -ary there is a meaning operation such that, whenever defined,
A general compositional system replaces by a finite family of semantic functions together with a selection function fixing which member evaluates each argument place:
Pagin then argues that Davidson’s argument only licenses a much weaker recursivity property: i.e. that meanings should be definable by primitive recursion over the syntactic structure.
13.5. Definition. A semantic function is recursive if it is definable by primitive recursion over the term algebra. Then the clause for a complex term may take the parts themselves as additional arguments:
The additional arguments make this a weaker requirement than compositionality, because even if , if we have no reason in general to require that .
Two questions arise in evaluating Pagin’s challenge. First, does Pagin believe in distinguishing competence and performance, à la Chomsky? Second, if so, is his argument aimed at competence or performance?
We shall follow the distinction between competence and performance, following Chomsky (Aspects: 3 ff) and the subsequent literature, and therfore distinguish two candidate explicanda in Pagin’s argument.
Suppose, then, that the explicandum of Pagin’s argument is performance. We can then make its abductive structure more precise. The explicandum is our understanding of the meaning of sentences that have actually been produced. Pagin’s argument is then that compositionality follows from the best explanation of our understanding of actually produced and comprehended sentences.
It is useful first to examine cases where compositionality is allegedly violated. Consider propositional attitudes (Pickel and Szabó, ‘Compositionality’: § 4.2.5).
On one view, although ‘biblioklepts’ and ‘book thieves’ are synonymous, (1) and (2) may differ in truth value (and hence meaning).
Now, suppose we formulate some general theory of the semantics of propositional attitudes that is noncompositional, and that it satisfactorily and perhaps otherwise virtuously explains natural language data; suppose, moreover, that it is recursively specified, and therefore Davidson’s original challenge to specify the meaning of propositional attitude reports as a ‘function of a finite number of features’ of those reports is met.
The problem, according to Pagin, is that more complex expressions’ meanings rapidly take longer to work out, unless their meanings are compositional. But if he is targeting performance, he needs to explain why these more complex expressions appear in ‘actual use of language’.
A simple response might be that the depth of nested propositional attitude reports is bounded, and so a superpolynomial increase in difficulty as the depth of the nesting increases is irrelevant to actual use of language (¶ 10.1). That would be mistaken. For instance, the following sentence’s meaning is perfectly comprehensible.
This is a ninth-order report. However, both parsing-wise and semantically, (3) is fairly simple.
A more sophisticated response, I suggest, is that althoguh the depth of typical cases may in some cases be quite high, their structure will generally enable comprehensibility (¶ 10.2).
In this section, we shall be concerned by certain arguments from learning theory—a branch of asymptotic analysis—to justify the view that we have innate knowledge of language in the form of universal grammar due to poverty of the stimulus. I shall argue that the PAC-learning régime fails for roughly the reasons suggested in § 9.
We shall follow here the formalism given by Kearns and Valiant (‘Limitations’: § 2).
In order to perform computations about concepts, we need to represent concepts.
Learning theory is about how we can algorithmically learn from examples drawn from an instance space. So, informally, suppose that we are interested in learning the meaning of ‘tall’. We could learn this by examining who is called ‘tall’ (and who is not). There are therefore ‘positive examples’ of ‘tall’ (i.e. people who count) and ‘negative examples’ (i.e. people who don’t).
We will sometimes ‘parametrise’ representation classes and instance spaces.
Intuitively, the task of a learning algorithm is to offer an hypothesis that allows us to classify new example drawn from . But if the hypothesis only inefficiently allows us to classify a new example, the learning algorithm is not very useful. Therefore, we are mostly interested in polynomially evaluable representation classes.
We learn from examples that are labelled either as positive or negative, i.e. as falling under a concept or not.
16.13. Definition. (the learning régime) In the standard régime in computational and statistical learning theory, a learner is given a representation class and examples of some particular representation . For instance, the representation class might correspond to all words for height in English, and the particular representation might one of ‘tall’. Let us say that ‘tall’ (or ) in general is the target representation.
There are two (for our purposes equivalent) formulations of how the samples are drawn.
For simplicity we shall assume that the examples are drawn from the same distribution, and therefore amend the following exposition slightly.
16.14. Definition. (error) We can now measure the error of an hypothesis represented by some relative to the distribution as
When , is -good; otherwise, it is -bad. Note that need not be deterministic.
Informally, we will now require that
16.15. Definition. (learnability from examples) Suppose that and are representation classes over . Let be some arbitrarily small error rate (the accuracy parameter), and let be some arbitrarily small probability of failure to formulate a hypothesis that has an error of at most (the confidence parameter). (Thus, ‘probably () approximately correct ()’.)
We say that is learnable by iff there is a probabilistic algorithm such that,
halts and outputs a representation that with probability at least satisfies .
is polynomially learnable if there is some polynomially evaluable .
is efficiently polynomially learnable if has running time polynomial in , , and .
Kearns and Valiant (ibid.: § 4) defines certain cryptographic problems that are widely believed to be superpolynomial in difficulty. We shall omit the details here. What is important is that if we assume that these cryptographic problems are hard, and we show that instances of the cryptographic problem can be reduced to instances of a learning problem with at most polynomial overhead, then the learning problem must also be (up to polynomial overhead) at least as hard.
Now, let us consider one way of regimenting the problem of learning a language given labelled examples of strings in and out of the language. We will restrict ourselves to the relatively smalll class of regular languages.
We shall take the instance space to be , which can be viewed as a space of strings. We are then interested in acyclic finite automata that accept strings of legnth drawn from .
Since we have omitted the cryptographic details, we will only state (informally) the result.
Perhaps surprisingly, a number of authors have interpreted this result as suggesting that there must be an innate universal grammar. For instance, Nowak et al. (‘Aspects’) write that
Similarly, Yang (‘The Great Number Crunch’: 223) write that one conclusion from the learning-theoretic literature that ‘emerges convincingly’ is that learning is not possible unless the hypothesis space is tightly constrained by prior knowledge, which can be broadly identified as Universal Grammar’. Moreover, Kodner et al. (Linguistics will thrive: 5) reaches a roughly similar conclusion, and extends the claim to recent successes in training large language models, in response to Piantadosi (‘Language models’).
Piantadosi (ibid.) centrally claims that language models are perfectly good theories of language, and should substitute Chomskyan theories about grammar, competence and performance, and so on. He catalogues a series of successes they have had in conforming to standard grammatical rules, including those for embedded clauses, prepositional phrases, conjunctions, pronouns, determiners, quantifiers, adjectives, agreement and pronoun reference (ibid.: 3).
Kodner et al. (Linguistics will thrive), in response, argue as follows.
There are a number of ways in which this argument is significantly flawed. One is that it is overly optimistic about the utility of the PAC régime as a model of learning, which I address in § 20. However, it is worth flagging a few other problems first.
As Piantadosi (‘Language models’: 38 ff) note, Kodner et al. (Linguistics will thrive) have a very liberal view of universal grammar. In particular, it might turn out to be sufficient for the purposes of Kodner et al. that transformers work better than e.g. LSTMs. But this is hardly ‘universal grammar’ or ‘innate knowledge’ in the commonsensical sense of the word. We can distinguish three theses:
What justifies the PAC-learning régime as a model of the task of human language acquisition? Obviously, it is not the realism of its assumptions. For instance, although it is true that the vast majority of human beings succeed in acquiring their first language in a relatively short period of time as children, this does not mean that we are justified in setting the confidence parameter to be arbitrarily small. Similarly, although we clearly converge on grammars that allow us to have relatively satisfactory performance with respect to an idealised language of the community (at least sufficiently to communicate), the accuracy parameter is clearly not arbitrarily small; we can easily bound it by taking the number of mistakes produced by the most successfully grammatically pedantic person ever to speak a language and dividing the number of sentences uttered by them by the number of mistakes.
Many authors who cite results in the PAC-learning régime simply ignore any questions either about the realism of its assumptions or the implications of the obvious lack of realism of its assumptions, including Kodner et al. (ibid.), Yang (‘The Great Number Crunch’) and Nowak et al. (‘Aspects’). One author who does attempt to explicitly justify this view is de Wolf (‘Applications’: § 2.4.4). He notes that it seems a little odd to assume that and can be arbitrarily small, but insists the PAC régime is ‘right in spirit’, because, plausibly, given more time or data, a learner will have a better chance of formulating a hypothesis with fewer errors.
The problem with this argument is that it is an argument for monotonicity rather than efficient learnability. That is to say: additional samples may always improve (e.g.) accuracy, but that does not meant that they improve accuracy enough to ensure that accuracy can be improved with only polynomial additional cost. Indeed, the claim that aedditional data always improve accuracy is compatible with the claim that it is not possible at some point to eliminate some residual nonzero error: suppose for instance that after data the best we can do is to have —this is monotonic, but not arbitrarily small. In addition, it faces some of the standard objections above: and may be boundedly small, and exponential difficulties may arise only later.
It may be plausible that should be considered to be arbitrarily small. However, it is not so obvious that should be considered to become arbitrarily small relative to the target language (e.g. that of one’s parents).
If the PAC-learning framework cannot be justified by the plausibility of its assumptions or a priori, perhaps it can be justified by its empirical fruitfulness. In fact, this much is not obvious, because of the general paucity of positive PAC-learnability results that correspond to clearly articulated restrictions on the hypothesis space that correspond in turn to pkillnatural language (Clark and Lappin, Nativism: §§ 7.3–7.4). As they say, the situation is ‘from a theoretical point of view…completely unsatisfacotry’: ‘[w]e seem to be reduced to saying that we have algorithms that work when they work, and don’t otherwise’.
Clark and Lappin do present some positive PAC-learnability results, but, as they note, these fall short of natural language. We cannot therefore say that PAC learnability has earned its keep through its theoretical fruitfulness, because we do not even have positive results that account for the fact that children learn their first languages. Moreover, the significance of their positive results is very unclear because it may be that they are overly restricted because they still care about arbitrarily small and : see e.g. § 8.3.1.
If we think of PAC-learnability as a criterion for separating hypotheses about learning language, all we have are various positive or negative results for classes of language that are relatively near or far away from natural language (either idealised or not) without any clear way of distinguishing which ones are really closer or further away. It’s far from obvious how to infer from these anything about the inductive biases required to learn language. Plausibly, Piantadosi is right at least insofar as experimental work from language models is much more useful (see below).
A broader question is whether the objections I have raised to computational variants of the poverty of the stimulus argument would unduly worry Chomsky.
To Chomsky (Aspects: 3), linguistic theory studies an idealised ‘speaker-listener unaffected by ‘grammatically irrelevant conditions’ such as ‘memory limitations’ in ‘applying his knowledge of the language in actual performance’. It is natural to include amongst the ‘irrelevant conditions’ constraints on performance. If so, Chomsky would be utterly uninterested in the asymptotic analysis of linguistic processing (as in the case of Ristad and Pagin), but might also be uninterested in the computational complexity of language acquisition. An asymptotic APS would be welcome support, but no more than that.
The logical structure of Chomsky’s argument is rather different. Chomsky (Knowledge of language: 7) remarks on the
structure-dependence of rules, the fact that without instruction or direct evidence, children unerringly use computationally complex structure-dependent rules rather than computationally simple rules that only involve the predicate ‘leftmost’ in a linear sequence of rules.
Both (2) and (3) include the clause bounded by brackets, but only in (2) may the pronoun them be referentially dependent on the antecedent the men; in (3) the pronoun is understood as referring in some manner indicated in the situational or discourse context, bu but not to the men. …How does every child know, unerringly, to interpret the clause differently in the two cases?
The suggestion is then that the devices provided by the theory of UG are adequate to the descriptive task at hand’ (ibid.: 51).
The abduction to universal grammar requires the auxiliary premiss that other learning mechanisms won’t work (ibid.: 12). Chomsky suggests there is ‘little hope’ of accounting for knowledge of language through ‘analogy, induction, association, reliable procedures, good reasons and justification in any generally useful sense, or in terms of “generalized learning mechanisms” (if such exist)’. But this is a matter of empirical enquiry, and cannot be studied a priori. The specific methodology Chomsky proposes is to take as an explicandum the ‘system of knowledge that has been attained’, and to offer as putative explicantes ‘properties [that] must be attributed to the initial state of the mind/brain’. There is a language faculty to the extent that ‘these properties are language-specific’. Then examples such as those involving anaphoric dependence above (‘them’) are given to buttress this claim.
Although Chomsky himself might not think that this approach requires asymptotic supplementation, we need not accept this view. In particular, we might demand more rigorous evidence that there really is ‘no hope’. A natural way to supply that argument is asymptotic, and so the alleged failure of the argument is of interest.
One question that advocates of an asymptotic variant of the APS often do not explicitly answer is whether their argument targets competence or performance. It seems that these arguments target competence. Consider again the argument given by Nowak et al. (‘Aspects’: 614).
UG is a theory about competence, and formal grammars are used to theorise about competence in the first instance (and then indirectly performance). A somewhat puzzling question is then whether it is legitimate for computationalists in the Chomskyan mould to appeal to performance constraints on acquisition of competence in the first place. This might not worry Chomsky particularly (although maybe it should). However, there is a question about how the computational Chomskians would justify appealing to asymptotic efficiency constraints in discussing acquisition if they reject any place for asymptotic analysis in the study of competence, following Chomsky.
Have there been any cases where asymptotic analysis usefully informed our understanding of language? I have suggested that a number of difficulties afflict attempts to rule out theories about language, which naturally invites the question of whether any attempt should survive such scrutiny, and whether asymptotic analysis has any place in our understanding of language generally.
I first sketch some rough principles by which we we might make reasonable progress in understanding language by asymptotic means, by analogy with the study of reasoning in cognitive science.
In § 9, I made the following claims about the proper application of asymptotic analysis.
I want to advance a further positive hypothesis about one way in which asymptotic analysis might in fact
The very rough intuition is that superpolynomial complexity should, in general, lead to relatively sharp dropoffs in performance,12 but at least typical polynomial complexity should make feasible a more gradual dropoff in performance as the size of an input increases. Under the heading of performance here I include speed, the solution of potentially difficult cases in a reasonable period of time, accuracy, and so on. Here are some examples. Some very long sentences are easy to parse and understand.
Other long sentences are more difficult to parse and understand.
Note, for instance, that (4) can be extended to a fairly great extent without loss of comprehensibility, whereas, even if (5) is intelligible, the following is likely unintelligible.
Of course, our comprehension (on various measures) of (4) may eventually decline as the sentence is extended. But centre-embeddings like (5) seem to decline in comprehensibility very quickly.
There seems here to be a rapid shift in the status centre-embeddings somewhere between depth two and four. On the other hand, there is no such rapid shift in the status of sentences like (4).
What happens in cases like (5) if there are too many centre-embeddings? We may simply give up. Or we may use certain heuristics to guess the meaning of the sentence, which may have degraded performance and fail to even reliably approximate the literal meaning of the sentence. Thus we have various corresponding measures of performance.
There is analogy here with physical science. Consider, for instance, heating up a solid substance (e.g. a metal). Generally, heating up a substance increases its volume. But heating up e.g. a steel girder from 20C to 25C does not produce any dramatic change. On the other hand, water takes up much more space when it is 101C than when it is 99C, because of a phase transition.
In a slogan: real-world superpolynomial complexity should entail sharp dropoffs in performance. How might this help the study of language or cognition? I suggest that in studying capacities such as language processing or acquisition, we ought to examine the difference in size in successful cases. So, for instance, if humans generally are unable to understand sentences with centre-embeddings of depth greater than three, and there is a sharp cutoff, that hypothesis is compatible with superpolynomial complexity. More generally, we can formulate (informally) an inconsistent triad:
The reason is that if there is a wide range in the size of the successful cases and a superpolynomial increase in difficulty as a function of that size, the resources used will have to differ by a factor given by a superpolynomial function of a large difference, which contradicts the supposition that there is only small variation in resources used. For instance:
If we also think that we made the most of the resources available, this would suggest an extraordinary difference in the resources available. Of course, there aren’t any plausible theories on which the time required to learn a new word doubles each time, so nothing is ruled out in this case; that is only an illustration.
Speculatively, therefore, this might offer a way to distinguish the mechanisms by which the same task is seemingly implemented in different systems. Suppose, for instance, that large language models display a gradual dropoff in performance, but humans display a sharp dropoff. That might suggest that the llms use a polynomial algorithm, but humans one with superpolynomial complexity. This, in turn, could suggest a fundamental architectural difference. If we return e.g. to the debate over language models as models of language, such an asymptotic difference might amount to the sort of difference between a language model and an adequate model of our own linguistic competence or performance. This therefore suggests an experimental means by which to attempt to distinguish us from them. In studying human cognition, we might find that a theory for asymptotic reasons predicts a much smaller range of outcomes or distribution of outcomes against resources used than is actually observed.
Aaronson, Scott, P ?= NP, 2017. |
Aaronson, Scott, ‘Physical reality’: ‘Guest Column: NP-complete problems and physical reality’, ACM SIGACT News 36.1 (1 March 2005), pp 30–52. |
Agarwal, Shivang, Clarice D. Aiello, Daniel R. Kattnig, and Amartya S. Banerjee, ‘Posner molecule’: ‘The Dynamical Ensemble of the Posner Molecule Is Not Symmetric’, The Journal of Physical Chemistry Letters 12.42 (28 October 2021), pp 10372–10379. |
Arora, Sanjeev and Boaz Barak, Complexity: Computational Complexity: A Modern Approach, Cambridge: Cambridge University Press, 2009. |
Arunachalam, Srinivasan and Ronald de Wolf, A Survey of Quantum Learning Theory, 28 July 2017, arXiv: 1701.06806 [quant-ph], pre-published. |
Chater, Nick and Mike Oaksford, ‘Rational analysis’: ‘The Rational Analysis Of Mind And Behavior’, Synthese 122.1 (1 February 2000), pp 93–131. |
Chen, Jianer, Iyad A. Kanj, and Ge Xia, ‘Improved upper bounds for vertex cover’, Theoretical Computer Science 411.40–42 (September 2010), pp 3736–3756. |
Cherniak, Christopher, Minimal rationality, Computational models of cognition and perception, Cambridge, Massachusetts; London: MIT Press, 1986, 186. |
Chomsky, Noam, Aspects: Aspects of the Theory of Syntax, 50th edition, The MIT Press, 1965, jstor: j.ctt17kk81z. |
Chomsky, Noam, Knowledge of language: Knowledge of language: its nature, origin, and use, New York: Praeger, 1986, 362. |
Chomsky, Noam and Howard Lasnik, ‘Filters and Control’, Linguistic Inquiry 8.3 (1977), pp 425–504, jstor: 4177996. |
Chvátal, Vašek and Endre Szemerédi, ‘Many hard examples for resolution’, Journal of the ACM (JACM) 35.4 (1 October 1988), pp 759–768. |
Clark, Alexander and Shalom Lappin, Nativism: Linguistic nativism and the poverty of the stimulus, Chichester, West Sussex Malden, MA: Wiley-Blackwell, 2011. |
Cobham, Alan, ‘Difficulty’: ‘The Intrinsic Computational Difficulty of Functions’, Logic, methodology and philosophy of science, edited by Bar-Hillel, Yehoshua, North-Holland Pub. Co., 1965, pp 24–30. |
Conitzer, Vincent and Tuomas Sandholm, ‘New complexity results about Nash equilibria’, Games and Economic Behavior 63.2 (July 2008), pp 621–641. |
Cook, Stephen A., ‘Complexity’: ‘The complexity of theorem-proving procedures’, Proceedings of the third annual ACM symposium on Theory of computing, STOC ‘71, New York, NY, USA: Association for Computing Machinery, 3 May 1971, pp 151–158. |
Cook, Stephen A. and Robert A. Reckhow, ‘Time bounded random access machines’, Journal of Computer and System Sciences 7.4 (1 August 1973), pp 354–375. |
Cormen, Thomas H., Charles Eric Leiserson, and Ronald L. Rivest, Introduction: Introduction to algorithms, The MIT electrical engineering and computer science series, Cambridge, Mass. : New York: MIT Press ; McGraw-Hill, 1990, 1028. |
Coupé, Christophe, Yoon Mi Oh, Dan Dediu, and François Pellegrino, ‘Different languages, similar encoding efficiency’: ‘Different languages, similar encoding efficiency: Comparable information rates across the human communicative niche’, Science Advances 5.9 (4 September 2019), p eaaw2594. |
Cowie, Fiona, ‘Innateness and Language’, The Stanford Encyclopedia of Philosophy, edited by Zalta, Edward N., Fall 2017, Metaphysics Research Lab, Stanford University, 2017. |
Daskalakis, Constantinos, Paul W. Goldberg, and Christos H. Papadimitriou, ‘The complexity of computing a Nash equilibrium’, Communications of the ACM 52.2 (1 February 2009), pp 89–97. |
Davidson, Donald, ‘Theories of Meaning and Learnable Languages’, Inquiries into Truth and Interpretation, 1st edition, pp 3–16, Oxford University PressOxford, 27 September 2001, pp 3–16. |
Dean, Walter, ‘Complexity’: ‘Computational Complexity Theory’, The Stanford Encyclopedia of Philosophy, edited by Zalta, Edward N., Fall 2021, Metaphysics Research Lab, Stanford University, 2021. |
Derakhshani, Maaneli, Lajos Diósi, Matthias Laubenstein, Kristian Piscicchia, and Catalina Curceanu, ‘Crossroad’: ‘At the crossroad of the search for spontaneous radiation and the Orch OR consciousness theory’, Physics of Life Reviews 42 (1 September 2022), pp 8–14. |
Downey, Rod, ‘A Parameterized Complexity Tutorial’, Language and Automata Theory and Applications, edited by Dediu, Adrian-Horia and Carlos Martín-Vide, Berlin, Heidelberg: Springer, 2012, pp 38–56. |
Dupre, Gabe, ‘Competence/performance distinction’: ‘How and Why to Draw the Competence/Performance Distinction’, Oxford Handbook of Philosophy of Linguistics, edited by Nefdt, Ryan M., Gabe Dupre, and Kate Hazel Stanton, Oxford University Press, 2026. |
Edmonds, Jack, ‘Paths’: ‘Paths, Trees, and Flowers’, Canadian Journal of Mathematics 17 (January 1965), pp 449–467. |
van Emde Boas, Peter, ‘Models’: ‘Machine models and simulations’, Handbook of theoretical computer science (vol. A): algorithms and complexity, Cambridge, MA, USA: MIT Press, 2 January 1991, pp 1–66. |
Friedman, Milton, ‘Methodology’: ‘The Methodology of Positive Economics’, The Philosophy of Economics: An Anthology, edited by Hausman, Daniel M., 3rd edition, Cambridge: Cambridge University Press, 2007, pp 145–178. |
Frixione, Marcello, ‘Tractable Competence’, Minds and Machines 11.3 (1 August 2001), pp 379–397. |
Ganesh, Vijay and Moshe Y. Vardi, ‘On the Unreasonable Effectiveness of SAT Solvers’, Beyond the Worst-Case Analysis of Algorithms, edited by Roughgarden, Tim, Cambridge: Cambridge University Press, 2021, pp 547–566. |
Gilboa, Itzhak and Eitan Zemel, ‘Nash and correlated equilibria’: ‘Nash and correlated equilibria: Some complexity considerations’, Games and Economic Behavior 1.1 (March 1989), pp 80–93. |
Hansen, Thomas Dueholm, Haim Kaplan, Or Zamir, and Uri Zwick, ‘SAT’: ‘Faster k-SAT algorithms using biased-PPSZ’, Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, New York, NY, USA: Association for Computing Machinery, 23 June 2019, pp 578–589. |
Heck, Richard Kimberly, ‘Does Davidson Argue for Compositionality?’, Journal for the History of Analytic Philosophy (no date). |
Helfgott, Harald Andrés, Isomorphismes de graphes en temps quasi-polynomial (d’après Babai et Luks, Weisfeiler-Leman…), 12 October 2017, arXiv: 1701.04372 [math], pre-published. |
Hennie, F. C. and R. E. Stearns, ‘Two-Tape Simulation of Multitape Turing Machines’, Journal of the Association for Computing Machinery 13.4 (1 October 1966), pp 533–546. |
Huang, Jianyu, Tyler M. Smith, Greg M. Henry, and Robert A. Van De Geijn, ‘Algorithm’: ‘Strassen’s Algorithm Reloaded’, SC ‘16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ‘16: the International Conference for High Performance Computing, Networking, Storage and Analysis, November 2016, pp 690–701. |
Karp, Richard M., ‘Reducibility among Combinatorial Problems’, Complexity of Computer Computations, (IBM Thomas J. Watson Research Centre, Yorktown Heights, New York), edited by Miller, Raymond E., James W. Thatcher, and Jean D. Bohlinger, Boston, MA: Springer US, no date, pp 85–103. |
Katajainen, Jyrki and Jesper Larsson Träff, ‘Analysis of mergesort’: ‘A meticulous analysis of mergesort programs’, Algorithms and Complexity, edited by Bongiovanni, Giancarlo, Daniel Pierre Bovet, and Giuseppe Di Battista, Berlin, Heidelberg: Springer, 1997, pp 217–228. |
Kearns, Michael and Leslie Valiant, ‘Limitations’: ‘Cryptographic limitations on learning Boolean formulae and finite automata’, J. ACM 41.1 (2 January 1994), pp 67–95. |
Kodner, Jordan, Sarah Payne, and Jeffrey Heinz, Linguistics will thrive: Why Linguistics Will Thrive in the 21st Century: A Reply to Piantadosi (2023), 6 August 2023, arXiv: 2308.03228 [cs.CL], pre-published. |
Korte, Bernhard and Jens Vygen, Combinatorial Optimization: Combinatorial Optimization: Theory and Algorithms, volume 21, Algorithms and Combinatorics, Berlin, Heidelberg: Springer Berlin Heidelberg, 2012. |
Levin, L.A., ‘Универсальные задачи перебора’, Проблемы передачи информации IX.3 (1973). |
Lipton, Richard J. and Kenneth W. Regan, ‘David Johnson’: ‘David Johnson: Galactic Algorithms’, People, Problems, and Proofs: Essays from Gödel’s Lost Letter: 2010, edited by Lipton, Richard J. and Kenneth W. Regan, Berlin, Heidelberg: Springer, 2013, pp 109–112. |
Litt, Abninder, Chris Eliasmith, Frederick W. Kroon, Steven Weinstein, and Paul Thagard, ‘Is the Brain a Quantum Computer?’, Cognitive Science 30.3 (2006), pp 593–603. |
Malik, Sharad and Lintao Zhang, ‘Boolean satisfiability from theoretical hardness to practical success’, Communications of the ACM 52.8 (1 August 2009), pp 76–82. |
Malpas, Jeff, ‘Donald Davidson’, The Stanford Encyclopedia of Philosophy, edited by Zalta, Edward N. and Uri Nodelman, Fall 2024, Metaphysics Research Lab, Stanford University, 2024. |
Marques-Silva, Joao, ‘Applications’: ‘Practical applications of Boolean Satisfiability’, 9th International Workshop on Discrete Event Systems, Gothenberg: IEEE, 2008, pp 74–80. |
Marr, David, Vision: Vision: a computational investigation into the human representation and processing of visual information, Cambridge, Mass: MIT Press, 2010. |
McCall, Edward H., ‘Simplex performance’: ‘Performance results of the simplex algorithm for a set of real-world linear programming models’, Communications of the ACM 25.3 (1 March 1982), pp 207–212. |
McKemmish, Laura K., Jeffrey R. Reimers, Ross H. McKenzie, Alan E. Mark, and Noel S. Hush, ‘Penrose–Hameroff proposal’: ‘Penrose-Hameroff orchestrated objective-reduction proposal for human consciousness is not biologically feasible’, Physical Review. E, Statistical, Nonlinear, and Soft Matter Physics 80 (2 Pt 1 August 2009), p 021912, pmid: 19792156. |
Millière, Raphaël and Charles Rathkopf, ‘Anthropocentric bias in language model evaluation’, Computational Linguistics 52.1 (1 March 2026), pp 379–388, arXiv: 2407.03859 [cs.CL]. |
Moore, Gordon E, ‘Cramming’: ‘Cramming more components onto integrated circuits’, Electronics 38.8 (1965). |
Nowak, Martin A., Natalia L. Komarova, and Partha Niyogi, ‘Aspects’: ‘Computational and evolutionary aspects of language’, Nature 417.6889 (June 2002), pp 611–617. |
Pagin, Peter, ‘Compositionality, computability, and complexity’, The Review of Symbolic Logic 14.3 (September 2021), pp 551–591. |
Pagin, Peter, ‘Radical interpretation and compositional structure’, Donald Davidson: truth, meaning and knowledge, edited by Żegleń, Urszula M. and Donald Davidson, Reprinted, Routledge studies in twentieth century philosophy 2, London: Routledge, 2001. |
Pagin, Peter and Dag Westerståhl, ‘Compositionality II’: ‘Compositionality II: Arguments and Problems’, Philosophy Compass 5.3 (2010), pp 265–282. |
Papadimitriou, Christos H., ‘The Complexity of Finding Nash Equilibria’, Algorithmic Game Theory, edited by Tardos, Eva, Noam Nisan, Tim Roughgarden, and Vijay V. Vazirani, Cambridge: Cambridge University Press, 2007, pp 29–52. |
Papadimitriou, Christos H. and Mihalis Yannakakis, ‘Complexity as bounded rationality’: ‘On complexity as bounded rationality (extended abstract)’, Proceedings of the twenty-sixth annual ACM symposium on Theory of Computing, STOC ‘94, New York, NY, USA: Association for Computing Machinery, 23 May 1994, pp 726–733. |
Piantadosi, S.T., ‘Language models’: ‘Modern language models refute Chomsky’s approach to language’, From fieldwork to linguistic theory: A tribute to Dan Everett, edited by Gibson, Edward and Moshi Poliak, Empirically Oriented Theoretical Morphology and Syntax 15, Language Science Press, 5 July 2024, pp 353–414. |
Piccinini, Gualtiero, ‘Computation in Physical Systems’, The Stanford Encyclopedia of Philosophy, edited by Zalta, Edward N. and Uri Nodelman, Fall 2025, Metaphysics Research Lab, Stanford University, 2025. |
Pickel, Bryan and Zoltán Gendler Szabó, ‘Compositionality’, The Stanford Encyclopedia of Philosophy, edited by Zalta, Edward N. and Uri Nodelman, Winter 2025, Metaphysics Research Lab, Stanford University, 2025. |
Read, Dwight W., Héctor M. Manrique, and Michael J. Walker, ‘Working memory’: ‘On the working memory of humans and great apes: Strikingly similar or remarkably different?’, Neuroscience and Biobehavioral Reviews 134 (March 2022), p 104496, pmid: 34919985. |
Reimers, Jeffrey R., Laura K. McKemmish, Ross H. McKenzie, Alan E. Mark, and Noel S. Hush, ‘Fröhlich condensation’: ‘Weak, strong, and coherent regimes of Fröhlich condensation and their applications to terahertz medicine and quantum consciousness’, Proceedings of the National Academy of Sciences 106.11 (17 March 2009), pp 4219–4224. |
Ristad, Eric Sven, Game: The language complexity game, Artificial intelligence series, Cambridge (MA), London: MIT Press, 1993. |
van Rooij, Iris, ‘Tractable’: ‘The Tractable Cognition Thesis’, Cognitive Science 32.6 (2008), pp 939–984. |
van Rooij, Iris, Mark Blokpoel, Johan Kwisthout, and Todd Wareham, Cognition: Cognition and Intractability: A Guide to Classical and Parameterized Complexity Analysis, Cambridge: Cambridge University Press, 2019. |
van Rooij, Iris, Olivia Guest, Federico Adolfi, Ronald de Haan, Antonina Kolokolova, and Patricia Rich, ‘Reclaiming AI’: ‘Reclaiming AI as a Theoretical Tool for Cognitive Science’, Computational Brain & Behavior 7.4 (1 December 2024), pp 616–636. |
van Rooij, Iris, Cory D. Wright, and Todd Wareham, ‘Intractability and the use of heuristics in psychological explanations’, Synthese 187.2 (1 July 2012), pp 471–487. |
Searle, John R., Rediscovery: The Rediscovery of the Mind, MIT Press, 8 July 1992, 292. |
Shanks, Daniel, Problems in number theory: Solved and unsolved problems in number theory, 2nd edition, New York, NY: Chelsea Publ. Co, 1978, 258. |
Spielman, Daniel and Shang-Hua Teng, ‘Smoothed analysis of algorithms’: ‘Smoothed analysis of algorithms: why the simplex algorithm usually takes polynomial time’, Proceedings of the thirty-third annual ACM symposium on Theory of computing, STOC ‘01, New York, NY, USA: Association for Computing Machinery, 6 July 2001, pp 296–305. |
Strassen, Volker, ‘Elimination’: ‘Gaussian elimination is not optimal’, Numerische Mathematik 13.4 (1 August 1969), pp 354–356. |
Tegmark, Max, ‘Importance of quantum decoherence in brain processes’, Physical Review E 61.4 (1 April 2000), pp 4194–4206. |
Turing, A.M., ‘Computability’: ‘Computability and λ-definability’, Journal of Symbolic Logic 2.4 (1937), pp 153–163, jstor: 2268280. |
Vardi, Moshe Y., ‘Boolean satisfiability’: ‘Boolean satisfiability: theory and engineering’, Communications of the ACM 57.3 (1 March 2014), p 5. |
Weisberg, Michael, ‘Idealization’: ‘Three Kinds of Idealization’, The Journal of Philosophy 104.12 (2007), pp 639–659, jstor: 20620065. |
Wiest, Michael C, ‘Quantum microtubule substrate’: ‘A quantum microtubule substrate of consciousness is experimentally supported and solves the binding and epiphenomenalism problems’, Neuroscience of Consciousness 2025.1 (6 May 2025), p niaf011, pmid: 40342554. |
Winograd, Terry, ‘What Does it Mean to Understand Language?’, Cognitive Science 4.3 (1980), pp 209–241. |
de Wolf, R.M., ‘Applications’: ‘Philosophical Applications of Computational Learning Theory: Chomksyan Innateness and Occam’s Razor’, master’s thesis, Rotterdam: Erasmus Universiteit Rotterdam, 1997. |
Yang, Charles, ‘The Great Number Crunch’, Journal of Linguistics 44.1 (2008), edited by Bod, Rens, Jennifer Hay, and Stefanie Jannedy, pp 205–228, jstor: 40058032. |
Yunger Halpern, Nicole and Elizabeth Crosson, ‘Posner model’: ‘Quantum information in the Posner model of quantum cognition’, Annals of Physics 407 (1 August 2019), pp 92–147. |
Zheng, Jieyu and Markus Meister, ‘The unbearable slowness of being’: ‘The unbearable slowness of being: Why do we live at 10 bits/s?’, Neuron 113.2 (22 January 2025), pp 192–204. |