Today, I would like to share my final paper from the Technical Writing course I took this semester. I intended this paper to be read by a general audience, so don't be discouraged when I say that it's about programming languages.
This paper isn't perfect, and I am by no means an expert, but I feel like it would be better to share this than to discard it.
OUTLINE:
ABSTRACT
Glossary
This paper isn't perfect, and I am by no means an expert, but I feel like it would be better to share this than to discard it.
BTW, not to gloat, but the main author whom I'm quoting thinks I did alright!
Note: I'll try to figure out how to add hyperlinks to the outline and the glossary without Blogger fudging it up, but maybe not right away.
Excellent job Zengid. I think you actually understood the work quite well. Not easy given the amount of data in that paper! https://t.co/PzKhWmXZev— Andreas Stefik (@AndreasStefik) December 13, 2016
Note: I'll try to figure out how to add hyperlinks to the outline and the glossary without Blogger fudging it up, but maybe not right away.
Programming Languages Are For Humans
Syntax Design and Its Effect on Intuitiveness
------------------------------------------------------------------------------------------------------------
OUTLINE:
ABSTRACT
INTRODUCTION
Statement of Purpose
Target Audience
Scope
Note on Formatting
BACKGROUND & HISTORY
Machine Time vs Human Time
High-Level Languages
Syntax Designed by Faith
Traditional vs Alternative Syntax
Designing Readability: “Treating Code as an Essay”
REVIEW OF SYNTAX STUDY
Goals
Syntax Survey
Intuitiveness Tests
Goals and Methods
Token Accuracy Maps
Results
CONCLUSION
Works Cited
--------------------------------------------------------------------------------------------------------------------
Abstract
A programming language is an abstract tool used to issue commands to a computer. This is possible because of a compiler, which translates the high-level language into an equivalent binary form that the machine can consume. Most of the effort in Computer Science have gone into optimizing compilers for the machine, but this lack of attention towards the human operators is taking a tole on new programming students. Current research by Stefik et al attempts to empirically study which syntactic elements of modern programming languages present difficulty to students. A successful language, Ruby, is discussed. It’s creator, Yukihiro Matsumoto, believes that Humans and their needs are more important than machines.
INTRODUCTION
A programming language is a bridge between humans and machines. These languages are defined with a finite set of symbols, words, and punctuation marks that make up their syntax. This syntax has an underlying meaning, or semantics, that can be understood by a trained human and translated into a machine language which can then be executed by a computer (Scott, 1999, p. 37).
Given that the field of computer science is a relatively new branch of applied mathematics and physics, much of its development has been derived from the mathematical techniques of formulating proofs and statistical analysis of machine performance (Stefik & Siebert, 2013, p. 25). Thus when designing programming languages, a computer scientist will have a formalized method for analyzing the machine, but until recently there has been few scientific methods for analyzing the productively of the human programmer. A pioneering study by Stefik and Siebert (2013) has made efforts to change that trend by studying how intuitive several programming language may be for novice programmers. To measure this, they used empirical methods that have been developed in other scientific fields such as medicine and psychology (p. 25).
Statement of Purpose
The key purpose of this paper is to discuss the nature of programming language syntax and some historical background that has shaped the state of several popular languages. This background will form a foundation that can then be used to discuss research into the intuitiveness of language syntax and suggest why some languages may be more intuitive.
Target Audience
This paper has been written so that hopefully anyone may understand the issues discussed; such as, why programming language syntax has the form that it does and why novice programmers may struggle with learning some languages over others. Also, key technical information will be described so that it can be understood by a general audience.
Scope
This essay is intended to present the significance of the results found by Stefik and Siebert in their study of the intuitiveness of syntax without explaining the fine details of how they conducted their research. The necessary background information for programming languages will be described for a general audience who are not trained in Computer Science.
Note on Formatting
When code is mentioned in the text it will be set in the Courier font and colored in light-blue to distinguish it from the surrounding description.
BACKGROUND & HISTORY
Machine Time vs Human Time
In 2016, computer technology is interwoven into nearly every facet of daily life. Despite this, as the computer scientist Michael Scott (1999) states, “Computer Science is [still] a young discipline” (p. 5). It began in earnest amidst the tensions preceding the second world war, as some of the first computers were built to decrypt enemy ciphers or to compute the trajectories of artillery shells. These machines calculated at a rate that significantly exceeded the abilities of a single human. Because of the limited access to computing machinery, computer-time was regarded as more valuable than the human-time needed to develop the computer. The empirical research of the time focused on the validation of algorithms and the efficiency of the machine as it executed those algorithms without much concern for the humans who programmed the machines (Scott, 1999, p. 3).
High-Level Languages
Eventually, the economics of computer-science research began to shift. As Scott (1999) mentions:
“Increases in computer speed and program size have made it increasingly important to economize on programmer effort, not only in the original construction of programs, but in subsequent program maintenance --enhancements and correction. Labor costs now heavily outweigh the cost of computing hardware” (p. 4).
Now the focus shifted towards the humans that programmed the machines with the hope of increasing their productivity. A facet of this effort lead to automating the translation of a human-oriented language into the rigid instructions that the machine was built to execute. This human-oriented language is known as a high-level programming language because it is composed of english-like statements and math-like expressions of which an educated human would be more familiar with than the awkward machine-level instructions. In 1951 Grace Murray Hopper invented a translation program called a compiler, which translates a high-level programming language into machine instructions (IEEE Computer Society History Committee, 1996). This invention allowed for high-level languages to develop independently from the underlying machine hardware, and also allowed programmers “to communicate with machines in terms of abstract concepts rather than forcing them to translate these concepts into machine-compatible form” (Brookshear, 2003, p.243).
Syntax Designed by Faith
While this advancement offered a significant boost in productivity, the design and development of higher-level languages was done without any empirical testing into the effectiveness of their syntax. After reviewing the literature of programming language design, Stefik and Siebert (2013) cite the claims of two authors:
“First, Markstrum has looked carefully at the historical literature and his investigation reveals that new syntax or features are simply added to languages, usually with no data regarding human users [Markstrum 2010]. Second, Hanenberg argues that the entire discipline of language usability is based on Faith, Hope, and Love [Hanenberg 2010b]. In other words, both Hanenberg and Markstrum have documented that the language design community often does not use evidence at all; relying nearly exclusively on anecdotes” (p. 5).
To counter this trend, Stefik and Siebert designed empirical experiments of their own which borrow from the traditions of medicine and psychology; for instance, they use randomized controlled trials and administered a metaphorical placebo. Their goal was to investigate whether syntax has a measurable effect on the comprehensibility of a programming language for novice programmers. While there has been few evidence based studies in the effectiveness of certain syntaxes, there remains a prolific trend for modern languages to adopt the syntax from the C programing language.
Traditional vs Alternative Syntax
Perhaps one of the most influential (though not original) syntax has come from the C programming language. This is not necessarily because of the syntax itself, but because the language has become the de-facto standard for writing operating systems. The C language was invented by Dennis Ritchie in the late 1970’s while he and Kenneth Thompson were busy building the UNIX operating system at Bell Labs (Kernighan & Ritchie, 1978, p. ix). C provided them with the right tools for developing programs that worked directly with the underlying hardware of the computer without any unnecessary abstractions that might bloat the size and degrade the speed of execution. Therefore C may be thought of as a minimalistic language where many programmer-conveniences must be built from scratch or done without (Kernighan & Ritchie, 1978, p.1-2).
Today C is recognized as a structured programming language, which means that there are special syntactic structures that allow the programmer to declare the order in which the statements within the program will be executed. The most basic structure, sequence, is subtly obvious: statements are executed in the order that they appear within the program starting from the top and reading down line-by-line. The second structure is selection, which can select a block of code to be executed or ignored based on the logical value of a preceding test. The syntax of a selection is the if statement (an example of a Ruby style if statement can be found in Figure 1). The final control-structure is the loop, which allows a block of code to be executed repeatedly depending on the condition of a logical test. These basic structures were not invented in the C language but the specific syntax for them has since been copied verbatim into many other languages. Languages that have this C style syntax for their control structures are considered to be the traditional languages within this paper.
Another feature of some traditional languages is that they inherit a similar use-case as C, namely for developing systems-level programs which run close to the machine level. This means that, like C, their code must be compiled into machine code before their programs can be executed. Compiled languages typically require the programmer to provide complete definitions of each element within the program in order for the compiler to know how to translate the code into the correct machine instructions. While this can become tedious for the programmer, it results in programs that are more performant and easier to maintain once they become large and complex.
In contrast, there exists an alternative to using a compiler. Stefik and Sibert examined several languages that are known as Scripting languages, as they are intended for directing and coordinating the interactions of many small and disparate programs by writing scripts. Scripting languages often depend on an interpreter, which is a running program that acts as an interactive translator. The interpreter executes each line of code immediately, allowing the programmer to interactively build a program the way an artist would build up a painting. The interpreter also takes care of ‘bookkeeping’, such as data-type declarations, providing the programmer with greater flexibility as they develop their scripts.
Three of the languages Stefik and Sibet investigated have originated as scripting languages: Perl, Python, and Ruby. While these languages inherit some C syntax, they also significantly diverge; for instance, Perl has incredibly terse statements with heavy punctuation while Ruby and Python favor a more readable and prosaic form. Because of this, some programmers consider Ruby and Python syntax to be like “executable pseudo-code” (Artima Developer, 2003).
Designing Readability: “Treating Code as an Essay”
It is not an accident that Ruby code is easy to read. In an essay written by the inventor of Ruby, Yukihiro Matsumoto, he explains the importance of readable code:
“Most programs are not write-once. They are reworked and rewritten again and again in their lives. [...] During this process, human beings must be able to read and understand the original code; it is therefore more important by far for humans to be able to understand the program than it is for the computer. Computers can, of course, deal with complexity without complaint, but this is not the case for human beings. Unreadable code will reduce most people’s productivity significantly. On the other hand, easily understandable code will increase it. And we see beauty in such code” (Matsumoto, 2007, p. 478).
Thus as cited earlier in the High-Level Languages section, the cost of programmer-labour far outweighs the cost of computing hardware and therefore Matsumoto has sought to design a language that is optimized for the programmer and not the machine. Because Ruby is an interpreted language, Matsumoto was able to design the Ruby interpreter to do much of the ‘bookkeeping’ that is required for the computer to run the program without making it a mandatory task for the programmer. This allows for much of Rubys syntax to be pruned of unsightly symbols that would detract from its readability (Artima Developer, 2003). Ultimately, to see whether Matsumoto was successful in his design of Ruby one must look at the results of Stefik and Siebert's scientific experiments.
REVIEW OF SYNTAX STUDY
Goals
Stefik and Siebert sought to perform controlled experiments testing the effectiveness of programming language design for the benefit of students (2013, p. 1-2). They state that “given the documented attrition rates in introductory programming courses [Beaubouef and Mason 2005], it seems reasonable to assert than novices face significant challenges when initially confronted with a general purpose programming language”. Thus by identifying where “syntactic barriers” may exist allows instructors and language designers the opportunity to make improvements to their coursework and syntaxes respectively (2013, p. 2).
Stefik and Siebert began their research by issuing surveys to students with and without programming experience, asking if a selection of symbols and keywords matched to a corresponding semantic category. The results of the survey were used to influence the design of a pedagogical language they were developing, named Quorum, which will be explained later in this paper.
Syntax Survey
An explanation of each syntactic element that was surveyed is outside of the scope of this paper, but one simple topic that can be cited as an example is about two operators that have different meanings but both of which use the ‘equals’ sign from mathematics.
The first symbol is = which commonly represents the operation of variable assignment, and it is used to assign some form of data to a name so it can be used later. Variable assignment for an integer value in the Ruby programming language takes the form:
number = 5
where number is the name of the variable. This code statement can be translated into the pseudo-English phrase “the variable number is assigned the integer value of 5”. Notice how ‘is assigned’ is used instead of ‘equals’ because the variable number can be re-assigned to another integer value at any time during the execution of the program, and this is unlike the mathematical meaning of ‘equals’. When surveyed on the intuitiveness of this syntax, Stefik and Siebert reported that “overall, both programmers and non-programmers rated the symbol = as the most intuitive choice for assignment”, although they conceded that non-programmers also rated the alternative keyword is with a rating of 6.32 on a scale of 1 to 10, while = scored a 6.9 as shown in Table 1 below (2013, p.11).
Languages that use = for assignment usually use an alternate symbol for equality expressions. The most common operator for equality is the double-equals sign, or ==. It is used to test if two values are logically equivalent; for instance, if a programmer wished to test whether number is a certain value then they could write the Ruby code excerpt in Figure 1.
This code tests whether the variable number is equivalent to the literal value of 5, and if so, the string "number equals 5" will be printed to the screen. If the variable number is not equal to the value 5, then the string "number is not 5" will be printed. Logical equivalence is a very subtle syntactic distinction, and Stefik and Siebert found evidence showing that novice programmers have difficulty with distinguishing the assignment operator = from the less intuitive equality operator == (2013, p. 32). One can see this in their high rating of = in Table 2, while programmers rated the more common == as more intuitive. Stefik and Siebert took this into consideration while designing the next iteration of their research language Quorum, which will be discussed in the section on Token Accuracy Maps.
Intuitiveness Tests
Goals and Methods
The most significant results of Stefik and Siebert’s research are from their tests on intuitiveness. They presented novice programmers with snippets of code from several languages (as seen in Figure 2.1 and Figure 2.2) and a brief description of what the code does as a whole (Figure 3). Without teaching them specific details of syntax or any general programming concepts, Stefik and Siebert asked the participants to extrapolate from the code they were shown and to write new code that accomplishes specific task requirements. One such target program description is shown in Figure 3.
Token Accuracy Maps
To score the participants, Stefik and Siebert used a special tool they call a Token Accuracy Map (TAM), as shown in Figure 4 below. The TAM scores show how frequently the token elements (highlighted in gray) were written correctly (a score near 1) or incorrectly (a score near 0) by the participants in their studies. As mentioned in the Syntax Survey section of this paper, novice programmers conflate the meaning of the assignment operator = with the logical equivalence operator ==. Stefik and Siebert found evidence for this using their technique of mapping statistical scores next to the syntactic elements of the target program which the participants were asked to write.
The TAM is interesting because it helped them discover that participants were incorrectly using the assignment operator = within the if statement (2013, p. 32). Stefik and Siebert took this evidence and used it as a motivation for redesigning the Quorum compiler so that using = within an if statement would change the meaning of = to that of the logical-equivalence operator ==. This technique is called operator overloading, which means that the semantics of an operator depends on the context within which it appears. A feature like this creates more work for the language designer, but as Yukihiro Matsumoto explains: “because language users are more common than language implementers, the needs of the latter must give way to those of the former” (2007, p. 480).
Results
By using the Token Accuracy Mapping technique to score the participants, Stefik and Siebert were able to determine that the languages Ruby, Python, and a version of Quorum were significantly more intuitive than Perl and Java (as shown in Table 3).
CONCLUSION
While the results of Stefik and Siebert’s studies are preliminary, they undoubtedly show that syntax can have a significant effect on how a language is perceived by a novice programmer. Yukihiro Matsumoto’s intuition for how a programming language should be designed has successfully culminated in Ruby being an intuitive programming language. By using the methods developed by Stefik and Siebert, perhaps future language designers will be able to rely upon scientific methods instead of merely basing their work off of Faith, Hope, and Love.
Works Cited
Artima Developer. (2003). The Philosophy of Ruby, A Conversation with Yukihiro Matsumoto.
Retreived from http://www.artima.com/intv/ruby3.html
Brookshear, J. G. (2003). Computer Science: an overview (11th ed.). Boston, MA:
Addison-Wesley.
IEEE Computer Society History Committee. (1996). Timeline of Computing History. Retrieved
from https://www.computer.org/cms/Publications/timeline.pdf
Kernighan, B. W., & Ritchie, D. (1978). The C Programming Language (2nd ed.). Upper Saddle
River, NJ: Prentice Hall.
Matsumoto, Y. (2007). Treating Code as an Essay. In A. Oram & G. Wilson (Eds.), Beautiful
Code (pp. 447-481). Sebastopol, CA: O’Reilly Media.
Scott, M. L. (1999). Programming Language Pragmatics (4th ed.). San Francisco, CA: Morgan Kaufmann Publishers.
Stefik, A., & Siebert, S. (2013). An empirical investigation into Programming Language Syntax.
ACM Transactions on Computing Education. 13(4), 19:1-40.
Glossary:
- C: A systems-level programming language invented by Dennis Ritchie in the 1970’s while developing the Unix Operating system with Ken Thompson (see Unix). C is a small language, meaning it affords a programmer with the bare essentials needed to control the basic components of a CPU while offering a concise structured programming paradigm (Kernighan & Ritchie, 1978, p. 1).
- Compiler: “A compiler translates the high-level source program into an equivalent target program (typically in machine language) and then goes away” (Scott, 14). Compilers produce efficient code because they can make optimizations based on analysis of the complete code being compiled. Compare Interpreter.
- Interpreter: A program that is the “locus of control” for an interpreted programming language (Scott, 14). The interpreter converts the high-level syntax of a language into machine-executable code one line of code at a time. This allows for a programmer to interact with the interpreter to test ideas before committing them to a complete program. Compare Compiler.
- Kernel: The core of an operating system (traditionally written in C). Utility and application programs interact with the kernel via the shell interface. See Shell.
- Operator Overloading: The context of an operator defines its semantic meaning.
- Perl: A scripting language invented by Larry Wall for text processing and tying operating-system programs together
- Pseudo-code: A generalized ‘code-like’ syntax intended for sketching out the basic forms of algorithms. Instead of using a specific programming language, pseudo-code allows for programmers to communicate what their code should be doing, and now how it should ultimately look.
- Python: A general purpose programming language invented by Guido Van Rossum in the late 80’s to facilitate the writing of utility software. It was intended to fill the problem domain between C and the Shell.
- Ruby: A general purpose programming language invented by Yukihiro Matsumoto in 1991 which builds upon the models set forth by Perl and Python. Ruby seeks to improving the programmer's experience by increasing code readability and making programming fun with elegant syntax and powerful abstractions.
- Scripting Language: A high-level language that allows for external programs to be controlled and combined (glued, or scripted) together. Although an imprecise term, it generally is associated with languages that offer expressive power and convenient data-abstraction in place of speed. Scripting languages are often interpreted. See Interpreter.
- Shell: A text-based interface for issuing commands to the operating system kernel with a built-in job-control language. See Kernel.
- String: A type of data that contains character symbols, spaces, and/or numbers.
- Structured Programming: Programs are written using imperative sequential statements, selection (if statements), and iteration (loops).
- Syntax: The formal rules defining how a language can be constructed with a finite number of indivisible elements or symbols (primitives).
- Unix: An operating system kernel developed at AT&T labs by Ken Thompson and Dennis Ritchie in the 1970’s. The philosophy behind the system is to combine many programs that “do one thing well” from the high-level of a shell command interface (see Shell). Unix has influenced many contemporary operating systems, most notably the Open-Source operating system kernel created by Linus Torvalds, named Linux.
Comments
Post a Comment