A Few Things About Programming Languages

2024-12-09

Motivation for writing this article: As a general programmer rather than a specialized programming language (PL) researcher, I collected some basic concepts about PL and drafted this. This article is rather fragmented, and the parallel chapter titles do not necessarily represent a parallel relationship between the two topics.

Scope, Name Binding, and Closure

Free and Bound Variables

In Mathematical Logic

In the expression $\Sigma_{k=1}^{10} f(n, k)$, n is a free variable, and k is a bound variable.

If the value of the final expression depends on the value of the variable, then the variable is a free variable; if the variable name can be replaced arbitrarily without affecting external properties, then the variable is a bound variable.

In Programming Languages

A free variable refers to a variable in the function scope that is neither a local variable nor a function parameter.

Scope and Name Binding

Name binding associates names with entities. Scope determines how variables are bound.

Static scope (Lexical scope) (Lexical scope) has a compile-time statically determined scope (such as a function or a block), where variables are visible within the region. This is the rule adopted by most programming languages (C++, Python, Java, ...).

Dynamic scope Dynamic Scope maintains a runtime-determined name binding stack. As long as the program executes the code segment of the dynamic variable, the variable always exists. For example, if function f calls function g, then when executing g, all local variables in f can be accessed by g. Common Lisp and Perl use dynamic scope.

For example, in the example below, if we use static scope rules, it will print 1; if we use dynamic scope rules, it will print 2.

x = 0
def f(y):
    return x + y
def g(z):
    x = 1
    return f(z)
g(1)  # evaluates to 1 or 2?

Binding can be divided into Early (Static) Binding vs Late (Dynamic) Binding, as the names imply. It is mostly determined by scope rules, but C++ language, which uses static scope, can also use late binding: C++ virtual method calls are late binding.

Closure

Translated from English Wikipedia: A closure is a method of implementing static scope name binding in programming languages with first-class functions (i.e., functions treated as first-class citizens, such as being used as variables or return values). If you are most familiar with programming languages like Python or C++, this definition might initially seem to say nothing, because both of these languages use static scope and fully or partially support first-class functions.

I simply understand a closure as a function packaged with its environment. The "environment" is actually the free variables, which may be bound by value or reference (depending on the programming language).

Python Closure Example

def foo():
   x = 3
   def bar():
      print(x)
   x = 5
   return bar # returns a closure

x = 0
f = foo()
f()   # 5

In the example above, foo returns a closure, the closure function is bar, and the free variable is x.

As long as you are familiar with Python, even if you don't understand closures, you can know that the printed number should not be 0, which is determined by the rules of static scope. But why is the printed number 5, not 3? Because the closure captures the reference of the variable, not the value of the variable (this seems to involve the next chapter on value semantics and object semantics).

C++ Closure Example

std::vector<int> some_list{ 1, 2, 3, 4, 5 };
int total = 0;
int value = 5;
std::for_each(begin(some_list), end(some_list), [&total, value, this](int x) { total += x * value * this->some_func(); });

The anonymous function here is a closure. The free variable total is passed into the closure by reference; value and this are passed into the closure by value.

Value Semantics and Object Semantics

Value Semantics

Value Semantics: Taking C++ as an example: When performing an assignment operation, if the left operand is a newly created object, the system will call the copy constructor; otherwise, the system calls the assignment operator function. What is copied are the property values. Before and after copying, the copy and the original data do not interfere with each other.

class A{
public:
  int value;
  A(int v): value(v) {}
  void print(){cout << value << endl;}
};
int main(){
  A obj1(1);
  A obj2 = obj1;
  obj1.value = 2;
  obj1.print(); // 2
  obj2.print(); // 1
}

Object Semantics

Object Semantics / Reference Semantics: In languages like Python, Java, C#, and JavaScript, objects are pointers, and copying them does not copy their content. In the following Java example, a and b point to the same ArrayList object.

ArrayList<Integer> a = new ArrayList<Integer>();
ArrayList<Integer> b = a;
a.add(10);
System.out.println("After modifying 'a', the object is: " + a); // [10]
System.out.println("Since 'b' references the same object, 'b' is also: " + b); // [10]
System.out.println(a == b); // true

Switching Between Value Semantics and Object Semantics

Here's an example using Java. In Java, primitive types use value semantics for assignment, while other variables are objects and use reference semantics for assignment. If you want to implement value semantics for Java objects, you can use Project Lombok's @Value annotation.

Compiler and Runtime System

Distinguishing Programming Languages from Their Implementations

First of all, a programming language is an abstraction and specification for defining computer programs, without specifying how it should be implemented (whether through compilation or interpretation? It doesn't specify) (a language can even have multiple implementations, such as Python with CPython, PyPy, etc.). Therefore, the boundary between "compiled" and "interpreted" languages is not clear.

Based on the common characteristics of existing implementations of programming languages, we can roughly divide them into two categories:

Unmanaged Languages: These languages are typically implemented using "static compilation to architecture-specific machine code for efficient execution," such as C, C++, and Rust.
Managed Languages: These languages are typically implemented using "maintaining a level, architecture-independent format, executed in a runtime system," such as Java and Python.

Discussing JIT in Runtime Systems

Runtime systems involve many topics, such as garbage collection, but here we only mention the concept of Just-in-time compilation (JIT). JIT involves converting source code or bytecode to machine code during execution to directly run the machine code and improve performance. In some common runtime systems, the Java Virtual Machine uses JIT, CPython does not include JIT, and PyPy includes JIT.

Designing Compilers Using Intermediate Languages

The following method seems to be a paradigm of implementing programming languages (specifically, to build compilers):

Define an Intermediate Language (IL: Intermediate Language / IR: Intermediate Representation)
Use a compiler front-end to translate multiple languages into the intermediate language
Use a series of runtime environments to execute the intermediate language, or use a series of compiler back-ends to translate the intermediate language into machine code

Microsoft's CIL and CLR

Microsoft uses the Common Intermediate Language (CIL, also known as MSIL) as a bridge to implement multiple programming languages: languages like C#, F#, VB.NET, and C++/CLI can all be compiled to CIL; CIL can run on the CLR (Common Language Runtime) virtual machine.

Microsoft's CLR is called the .NET runtime. .NET was not open-source in its early days, so Unity, which needed to use C# for development, created its own CLR called Mono. After .NET became open-source, Unity adopted .NET.

LLVM

LLVM is a compiler toolchain built around an IR standard. It can be used to implement any programming language on any platform: first, use a front-end to compile the target programming language to LLVM IR, then use a back-end to translate the IR to machine code for the target platform.

LLVM is widely used: the Rust compiler rustc translates Rust programs to low-level LLVM IR, then calls the LLVM back-end to generate target machine code. The Clang compiler, developed by the LLVM team, is also a front-end that can translate C, C++, Objective-C, and Objective-C++ to LLVM IR.