Are any Linux commands written in assembly

Programming languages ​​and translators
10 - machine code

Now that the calling convention and the mapping of IR variables to stack slots have made it clear what the data layout for a call frame looks like, the next step is to encode the IR commands as machine commands. To do this, two decisions have to be made: When Command selection a single machine command or a sequence of machine commands is selected for an IR command. In the Register allocation we decide at what point in time a variable lives in its slot and when it can be found in a CPU register. Once we have solved both problems, we can write out the machine instructions together with the appropriate register operands as assembler and we are done with the translation. So actually quite simple, if it weren't for two NP-complete problems.

On the one hand, command selection is not so easy: many processors, especially CISC architectures, have many variants of a single command, all of which differ, to a greater or lesser extent, in their semantics and in their non-functional properties. The AMD64 architecture knows 36 different variants of, all of which are noted in the assembler with the -Opcode. In addition, a 1: 1 mapping is not possible for every IR command, but a 1: N mapping to machine commands is required in order to implement the required functionality. It gets worse because looking at multiple IR commands in an M: N fashion leads to better translation results. All in all, it has been found that optimal instruction selection is an NP-complete problem.

On the other hand, the register allocation is also a problem: The decision as to which variable lives in which register at which point in time has a massive impact on the program's performance. This is due to the fact that accesses to the memory are much, much slower than to registers. See: Your computer is already a distributed system. Why isn't your OS? Access times of stack overflow .. Therefore every transfer between stack slot and register costs a lot Time and we want to have as few slot register transfers as possible during the term. But register allocation is also a complicated problem when trying to keep variables in register across basic block boundaries. For example, if the code of a basic block expects a variable in a certain register, all previous blocks must ensure that the variable is actually loaded in this register. Since the CFG is in general neither linear nor acyclic, we are dealing here with a cyclic feedback of such conditions. All in all, it has been found that an optimal register allocation is an NP-complete problem.

But the whole thing gets worse, since instruction selection and register allocation interact with one another: not every instruction sequence that takes instruction selection into account needs registers the same. And not every machine instruction can be called with every register. There are also commands that change the content of registers as a side effect. All in all: The combination of register allocation and command selection is a really thick board.

And whenever the computer scientist doesn't know what to do next, you think about the simplest method to solve the problem and then think about where you can improve the method.But with such an approach of incremental improvements, it is important that you can easily move into a local one Can maneuver optimally. To avoid this, it is worth formulating an optimal solution, no matter how complex it is. This simplest basic case for machine code generation will help us to understand the problem more precisely before we discuss a suggestion on how we can improve this minimal solution. The minimum solution consists of doing a macro expansion for each IR command and reading all variables from the slot before the sequence and writing them back to it immediately afterwards.

For the Macro expansion we store a fixed macro for each of our 16 IR commands, which we expand when such a command occurs in the IR program. This macro consists of a machine instruction sequence that includes one or more instructions. However, at the points where registers would be used in the operations, the macro contains placeholders (R1, R2) that are filled by the register allocation. In addition, there are a number of instructions as to which IR operand is to be loaded into which placeholder and in which placeholder register the result can be found.

For every occurrence of the IR command in the IR program, we instantiate the macro and connect the instructions at the inputs and outputs with the specific operands. We then pass the instantiation of the macro to the register allocation, which has to ensure that a register is selected for each placeholder, the current variable values ​​end up in the register, and the result ends up in the target operand.

The simplest variant of a register allocation is a spilling register allocator. With this type of register allocation, we always only consider a single instruction: Before the instruction, we load all variables freshly from their permanently assigned stack slot into a register. After the command sequence has been processed, the result is immediately written back to a slot. This direct writing back means that a variable only lives in a register for the duration of a single instruction. There are no problems whatsoever that there can be any inconsistencies across control flows or that we forget to write back any value. Fancy. Chic but inefficient.

inefficiency is precisely the problem of this minimal machine code generation. The potential of the machine is not fully utilized in (at least) three places: On the one hand, we are constantly transferring variable values ​​between memory and register, regardless of whether they are already preloaded in a register. The example alone, in which 3 out of 4 commands were only transfers, makes it clear that a main part of our program is concerned with using memory. It would be much smarter to only load what is not already loaded in a register from memory. If we have paid the price of loading from the store, then we want to have some of it as long as we can.

The second problem is that the rigid macros cannot take advantage of the more complex capabilities of the CPU. With CISC machines in particular, a single command can have very complex semantics. On the slides you can see an address calculation for an array that was naively composed of three macros, and the appropriate IA-32 command that calculates and dereferences in one go. The special features of the processor architecture are therefore not fully exploited.

In a similar direction, but different from it, is the third problem of the minimal solution: The macro expansion ignores the fact that modern CPUs do not execute commands butted one after the other, but that these work with a pipeline, can execute instructions in a different order than specified (out-of- Order) and also have several functional units (superscalar). This means that commands can be carried out "for free" in the slipstream of other commands. In this way, commands can be executed on the ALU in an out-of-order CPU while an earlier command is still waiting for the memory. This problem can be summarized in such a way that the peculiarities of the processorMicroarchitecture are not exploited.

There are methods for improving machine code generation for all of these sub-aspects: On the one hand, when allocating registers, you can consider more than just one instruction at a time, and so on more global register allocation carry out. This greater knowledge makes the allocation more complex, but the result is orders of magnitude better. In order to take advantage of the processor architecture, modern translators have one Peephole optimizerwhich always compares a fixed number of assembly instructions (e.g. 3 consecutive commands) with a sample database and replaces them with more efficient snippets. In order to take advantage of the micro-architectural features, the Instruction selection Rearrange assembly instructions so that the same calculation is still carried out, but as many instructions as possible are hidden in the slipstream of others.

In the following, we will pick out the aspect of register allocation and discuss a register allocator that does not suffer so much from amnesia and that remembers which variables have already been transferred to the register set over the course of a single basic block.

First we have to explain the world view of this allocator fundamentally: This allocator with memory sees the register set as a kind of fast intermediate storage (cache) for the slots of the variables in the memory. The allocator goes through a basic block instruction by instruction, from top to bottom, and in the meantime notes which values ​​are already in the cache and which have to be loaded into the register file. Load commands are then issued for the latter in order to fill the cache. In addition, the allocator does not write the values ​​back from the registers immediately, as was the case with the minimal solution, but only after a delay (lazy spilling). In the words of the course "Basics of Computer Architecture", this allocator uses the register set as a fully associative cache with a write-back strategy. If you have problems with this sentence, go back to your GRA records and see what it means.

The allocator can be used directly from the macro expansion. The slides show what the macro expansion looks like for the -IR command. The expansion has access to the register allocator used. This not only manages the full and free registers, as the name suggests, but also the assignment of registers to variables. Therefore, macro expansion orders that the left () and right () must be loaded into registers. The macro also gives the register allocation information that it intends to modify the register in which the right-hand side is loaded (). The actual macro sequence consists only of the instruction which is output or emitted in the assembler. At the end, the macro sequence communicates that the target variable () can be found in.

To fully understand the integration of macro expansion and register allocator, let's take a look at the API of the register allocator. In a functional hierarchy, the register allocator is completely dependent on the macro expansion. This means that the register allocation never works by itself, but is called at critical points in order to load variables or give them the chance to synchronize with the machine code generation.

With three hooks, which are executed for a function, a basic block or an individual instruction before the machine code is generated, the allocator is given the opportunity to adapt its internal bookkeeping accordingly. There is no flow of information from the allocator to the macro expansion at these points.

The situation is different for the functions, and. Here the allocator receives a concrete work order from the macro expansion to clear a register, to transfer a variable to a register, or to link a result register with an IR variable. The macro expansion must be able to request that a specific register be allocated (,), since some machine instructions expect their operands in certain registers. Without this secondary condition, the allocator is free to choose the register.

Before we get to the functionality of the allocator, let's look at what database the allocator works on. Basic principle to understand IT solutions: Ask what the state (data) is and according to which rules it is processed. The state of our improved Allocators consists of three fields per processor register, which are continuously updated during machine code generation.

The field shows whether this register has already been allocated for the current instruction. Since we have to ensure that the two commands in the macro have different registers, we need this Boolean flag to prevent duplication. These fields are reset to before each instruction in the hook.

The field shows whether a variable is currently "alive" in this register, and if so, which one. At the beginning of a basic block we do not know of any register that it contains the current value of a variable, which is why these fields are initialized as empty.

The field indicates that the stack slot of the variable and the register content are currently not synchronized and only an outdated value can be found in the stack slot. This also means that the value in the register has to be written back to memory at some point. The value of the Boolean flag is only relevant if an IR variable is assigned to the register.

Now let's go through the three important API commands (, and):

First, let's look at what macro expansion can use to request an empty register. This is necessary, for example, when a value is created from "nothing", as is the case, for example, with. The aim of the award is to choose a register in such a way that so little can be found in the further course Follow-up costs as possible. Although our allocator cannot look into the future, we can at least specify an upper bound, which maximum follow-up costs can arise in the further course of the basic block. The allocator calculates a score for each CPU register and outputs the register with the most favorable score; he takes one prioritized allocation in front. For each call of, of course, only those registers that have not yet been assigned come into question ().

The lowest cost is when we publish an empty register. Since no value is displaced from the register set, there can be no costs for reloading the value in the future. Empty registers are therefore assigned with the highest priority.

Next, we issue registers that contain a value but are synchronized with the stack slot. For these registers, costs can only be incurred for reloading the value in the further course if this should be required again by another instruction; there is a maximum of one memory-register transfer. When this register is released, the link between the register and the IR variable is broken ().

The most expensive case occurs when we issue a register that contains a value that has not yet been synchronized (). In this situation there is definitely a cost for the spill operation and we may also have to reload the value in the future. In the worst case, the allocation of this register will result in two register-memory transfers. We should only publish these registers if no other register is free or the macro expansion has explicitly asked for this register ().

At the end of a macro expansion it is often instructed that the result, which is now in one of the allocated registers, should be written to an IR variable. Our allocator goes the way that we do not immediately write the value into the stack slot, but only serve all accesses to the IR variable from the specified register from now on. To do this, we set the field to the corresponding target variable and note that the register content and stack slot do not match. The writing back will then take place at a later point in time.

Our allocator's command is the most exciting part of the API available. Here we use the bookkeeping that we keep on the register contents in order to save load instructions when values ​​are already available in the register "cache". The macro expansion instructs us to load a certain IR variable () into a register. We differentiate between several cases, depending on what the current allocator status looks like:

If the variable to be loaded is not found in any register (), we use a new register to allocate it and we issue a load command that transfers the variable from its stack slot into the register.

It gets exciting when the variable already lives in a register, which is one of the sticking points of the register allocator with memory. If the variable is already loaded and the register is synchronized with the stack slot, we simply output the register (). Only when registers are marked as dirty do we have to differentiate what the macro expansion intends to do with the register: If the expansion only wants to read the register with the variable value, we can simply output dirty registers. If the macro expansion not only wants to have the value of a variable, but also wants to use the register for calculation or for the result, we must first synchronize dirty registers with the corresponding stack slot using a spill command. If we did not write the dirty register back to the variable slot beforehand, the most recent value of the variable would be lost.

In addition to these four cases, there are also a few special cases when the caller has requested a specific register.In this case it can be worthwhile, if a value is already loaded in another register, to swap the contents of both registers () command. If you are interested in these special cases, you can look at them in the exercise translator.

Finally, we have to deal with a few situations that require special treatment by the allocator. The allocator is informed about these situations by the hooks already mentioned.

On the one hand, there is a register allocation that Basic block boundaries exceeds, much more difficult than a block-local allocation. Again we have the problem that only the register assignments that are consistent in all previous nodes are valid across block boundaries. In other words: A variable must be loaded at the end of all previous nodes in the same register in order to be considered loaded before the first instruction of a block. We make it easy for ourselves here and just throw away the entire allocator state before each block.

Another problem are Function calls. Since a function only preserves the calle-saved registers via a function call, we have to invalidate all registers whose content is potentially destroyed with a command. This includes that dirty registers have to be written out beforehand and the coupling between register and IR variable has to be removed. To be on the safe side, we also throw away the entire allocator status at this point and pretend that a new basic block starts after the call.

We also come back to this point with the Alias ​​problem in touch: A reference to a local IR variable becomes a pointer to the corresponding stack slot on the machine code level. Therefore, the command can cause the stack slot of a variable to suddenly contain a more current value than the register. So our cache was invalidated "backwards". In order to deal with this problem, we discard all variables from our bookkeeping for each of them, for which a reference has been created somewhere in the function. In this way we prevent us from considering an obsolete value from a register to be valid.

Despite all these conservative assumptions and the frequent discarding of the allocator status, our improved allocator already has a significant influence on the performance of programs. To make this clear to you, I have translated an iterative Fibonacci once with the minimal solution and once with the improved register allocation. This already results in a huge improvement of 36 percent and our practice translator comes to the region of when it is started without optimizations. However, you can also see that GCC still gets a lot out of such an iterative Fibonacci. These improvements come mainly from register allocation, since the inner loop of the Fibonacci calculation can take place entirely in registers and does not require any memory operations.