229 Lab 3: Fix Branch

CMPUT 229 - Computer Organization and Architecture I

Lab 3: Fix Branch

Creator: Kristen Newbury

Converted to RISC-V by: Zhaoyu Li

Introduction

The goal of this lab is to fix branch instructions in a RISC-V program that has been modified. The modified program is created by taking an original program and inserting some instructions into it. It will be necessary to fix the branch instructions in the modified program so that they will branch to the same target labels that they did in the original program. In order to achieve this you will compare the original program to the modified program to find where the new instructions are located, then you will make appropriate adjustments to the branch instructions in the modified program.

Changes to Targets of Branches

The insertion of additional instructions in a program will change the address of the branch targets. For instance, consider the original program below:

RISC-V Assembly

The code as loaded into the RARS simulator

foo:
    beq  t0 t1 target
    addi t0 t0 1
target:
    addi t1 t1 1

Address	Code	Basic
`0x00400024`	`0x00628463`	`beq x5, x6, 0x00000008`
`0x00400028`	`0x00128293`	`addi x5, x5, 1`
`0x0040002c`	`0x00130313`	`addi x6, x6, 1`

After inserting three instructions, we obtain the modified program shown below:

The code as loaded into the RARS simulator

Address	Code	Basic
`0x00400024`	`0x00628463`	`beq x5, x6, 0x00000008`
`0x00400028`	`0x00128293`	`addi x5, x5, 1`
`0x0040002c`	`0x00128313`	`addi x6, x5, 1`
`0x00400030`	`0x00130393`	`addi x7, x6, 1`
`0x00400034`	`0x00138e13`	`addi x28, x7, 1`
`0x00400038`	`0x00130313`	`addi x6, x6, 1`

In the original program, the initial address of the target is 0x0040002c, but the address of the target label is 0x00400038 in the modified program. In other words, the branch instruction at the address 0x00400024 no longer branches to the same label as it did in the original program. This is because branch instructions in RISC-V specify their targets using a relative offset. Therefore, we need to change this offset in the modified program such that the branch instructions branch to the same target label as it did in the original program.

RISC-V Branch Instructions

There are six different types of branch instructions in RISC-V:

`beq`	Branch EQual
`bge`	Branch Greater than or Equal
`bgeu`	Branch Greater than or Equal Unsigned
`blt`	Branch Less Than
`bltu`	Branch Less Than Unsigned
`bne`	Branch Not Equal

All branch instructions in RISC-V are encoded in the SB Type format:

31	25	24	20	19	15	14	12	11	7	6	0
`imm[12\|10:5]`		`rs2`		`rs1`		`funct3`		`imm[4:1\|11]`		`opcode`

In RISC-V, the relative offset of a branch instruction is encoded in bits 7-11 & 25-31 of the branch instruction. With only 12 bits (13 after shifting as the lowest bit will always be zero) to specify the offset, the range that a branch target can occur in is limited. The introduction of new instructions between the branch and its target may lead to a situation where the branch is no longer possible because the distance between the branch and its target cannot be expressed in only 12 bits.

To simplify this assignment, assume that this case will not occur for any of the programs that are provided as input to your assignment. In other words, the number of insertion made to the program will not be enough to create an impossible branch. Keep in mind that jump instruction may also be affected by the insertion of instructions. However, you DO NOT need to adjust the immediate value of jump instructions in this lab.

Fixing Branch Instructions

In order to determine if a branch instruction was affected by the insertion of instructions to the program, the solution needs to calculate the original targets of each branch instructions in the original program. Only insertions that take place between a branch and its target will affect the location of the proper target. In other words, branch instructions will only need to be fixed if instructions were inserted in between the branch and its target.

In order to calculate a branch's target the solution will need the value contained in the PC. During program execution the PC contains the memory address of the current instruction being executed. In this lab the binary representation of the program is located in the user data segment of memory, therefore we can think of the value of the address that was used to load the binary representation of the instruction into a register as the value that the PC normally contains during program execution. In other words, the PC for an instruction is the address that is used to load the instruction from memory.

To calculate the address of the target for branches, the following algorithm is recommended:

L ← immediate of the branch instruction
A ← L << 1
S ← Sign extend A to 32 bits
T ← S + PC

T is the address of the branch target. Keep in mind that this address will correspond to an address in the data segment, where the program representation has been placed.

Finding Insertion Points

An "insertion point" is the address of the instruction (in the original program) before a block of added instructions. In order to find the insertion points the two programs must be compared. We first load one instruction from each program, then do a simple check to see if the two instructions are equal. If they are not equal then this is the first instruction in a possible block of added instructions. We also keep a counter and increment it for each new instruction in the block of new instructions. As soon as an instruction is found in the modified program that matches the last previously seen instruction in the original program then we have reached the end of the block of added instructions.

Terminologies

This section formally introduces the terminologies used in this lab.

The modified program is obtained by inserting instructions into the original program. Branch instructions in the modified program must be fixed so that they jump to the same label as they did in the original program. In this lab, the fixed program refers to the modified program with fixed branch instructions.

The instructions array of a program is an array of RISC-V instructions in binary representation, where the instructions in the array are the instructions in the program.

The insertion points array is an array of all the insertion points. The array must be sorted in ascending order by the value of the insertion points.

The insertions array is an array of natural numbers where the i'th element in the array is the number of instructions inserted at the i'th insertion point in the insertion points array. Note that each element is stored in 32 bits.

The branches array contains the addresses of all the branch instructions in the original program. Elements in the array must be sorted in ascending order by the addresses of the branch instructions.

The targets array contains the addresses of the target instructions for all the branch instructions in the branches array. The i'th element in the array is the address of the branch target for the i'th branch instruction in the branches array.

All arrays are terminated by the sentinel value 0xFFFFFFFF.

Assignment

In this assignment you will write a RISC-V assembly program that fixes branch instructions in a modified RISC-V program. You must implement all the following functions.

fixBranch

fixBranch
  This function is the entry point of student's solution to this lab. It fixes
  branch instructions in the modified program such that they branch to the same
  target as they did in the original program. Writes the instructions of the
  fixed program to a separate array.

  Arguments:
    a0: Pointer to the instructions array of the original program.
    a1: Pointer to the instructions array of the modified program.
    a2: Pointer to an empty array that is the same size as the instructions array
        of the modified program. This function must write the instructions of the
        fixed program to this array and terminate it with the sentinel value
        0xFFFFFFFF.

  Returns:
    None

findInsertions

findInsertions
  Finds all the insertion points and how many instructions were inserted after
  each insertion point

  Arguments:
    a0: Pointer to the instructions array of the original program.
    a1: Pointer to the instructions array of the modified program.

  Returns:
    a0: Pointer to the insertion points array.
    a1: Pointer to the insertions array.

findBranches

findBranches
  Finds all branches and their respective targets in the original program

  Arguments:
    a0: Pointer to the instructions array of the original program.

  Returns:
    a0: Pointer to the branches array.
    a1: Pointer to the targets array.

Input Guarantees

The binary representation of the RISC-V programs are terminated by the sentinel value 0xFFFFFFFF.
There will be a maximum of 50 instructions inserted to the modified program
There will be a maximum of 25 branches to fix.
All instructions are valid RISC-V instructions.
Neither branches nor jumps will ever be inserted to the original program to create the modified program.
Some pseudo-instructions in RARS correspond to more than one instruction. For example, the la pseudo instruction expands into a two separate instructions. The input will contain no such instructions.
The modified program will not contain inserted instructions before the first instruction of the original program.

Resources

Class slides used for this lab (.pptx) (.pdf).
Marksheet used to mark the lab (.txt)

We also provide two Python programs that will help students in generating their own test cases and checking whether their solution is working as expected.

binDecompiler.py

binDecompiler.py is a decompiler for RISC-V assembly.

Required positional arguments:

Input file format
- b: binary file
- h: hexadecimal instruction representation file input (each instruction as 0x________ on a newline)
Input file path

Optional arguments:

-o OUTPUT. Basename of output file. If provided, binCompiler.py writes its output to two different files: OUTPUT.tsv and OUTPUT.txt.
- OUTPUT.tsv. Each row in the file contains the following information regarding a RISC-V instruction in the input file: Address, Code, and Source. Address is the address of the instruction when loaded into memory. Code is the hexadecimal representation of the RISC-V assembly source code (Source).
- OUTPUT.txt. Each line in the file is a hexadecimal representation of a RISC-V instruction in the input file.
If you run binDecompiler.py without the -o flag, binDecompiler.py will print the content of OUTPUT.tsv and OUTPUT.txt to the terminal, separating them with an empty line.

Usage:

python3 binDecompiler.py {b,h} inputFile [-o OUTPUT]

Example:

python3 binDecompiler.py b original.bin -o original_decomp

# Decompiles the binary file original.bin.
# Outputs the result to two files, original_decomp.tsv and original_decomp.txt

hexToBin.py

hexToBin.py converts hexadecimal numbers in textual format to binary format and writes the result to a file. Expects one hexadecimal number per line.

Optional arguments:

-i INPUT. Path to the input file. If you run hexToBin.py without the -i flag, it will take input from standard input, one hexadecimal number per line.
-o OUTPUT. Path to the output file. If you run hexToBin.py without the -o flag, it will write the output to out.bin

Usage:

python3 hexToBin.py [-i INPUT] [-o OUTPUT]

Example:

python3 hexToBin.py -i lab_out.txt -o lab_out.bin

# Converts the hexadecimal numbers in lab_out.txt to binary format.
# Outputs the result to the file lab_out.bin

The description of the two Python programs above is also in the provided README.

Testing your Lab

Note: To execute RARS on Windows, or if you do not have rars aliased to execute RARS as a Java executable, you need to replace rars with java -jar </path/to/rars>, where </path/to/rars> is the path to RARS on your file system.

To test your lab, you must first generate your input files: the binary files for the original and modified programs. The process for generating test cases is complicated because you must insert new instructions into the binary without modifying the pre-existing branch instructions. You can use any RISC-V .s files to generate the original program as long as it satisfies the constraints in the Input Guarantees section (two sample files are provided in the Tests directory). To generate the binary file, original.bin, for the original program, original.s, you can run the following command:

rars a dump .text Binary original.bin original.s

To create the modified program and generate its binary file, you can follow the steps below:

Decompile the binary file for the original program using the provided binDecompiler.py Python script.
```
python3 binDecompiler.py b original.bin -o original_decomp
          
```
This creates two files: original_decomp.tsv and original_decomp.txt.
To create the modified program, one option is to use a spreadsheet editor, such as Google Spreadsheets or Microsoft Excel, to edit the .tsv file directly. Some IDE's such as VS Code and JetBrains also supports opening .tsv files in a tabular format. If none of the aforementioned tools are available to you, skip to the next step.

Using spreadsheet editors, you can insert rows above or below an instruction to insert instructions before or after it, which can be done by placing hexadecimal representation of valid RISC-V instructions in the empty cells of the 'Code' column. Make sure your modified program satisfies the constraints outline in the Input Guarantees section.
Another option to create the modified program requires opening both the .tsv and the .txt files, preferably in two separate windows that are placed side by side. Now, you can use the information in the .tsv file to assist you in inserting instructions into the .txt file. The steps are similar to using a spreadsheet editor, except that you will be adding new lines.
Generating the binary file for the modified program.
- If you have followed step 2 to create the modified program, simply copy all the instructions in the 'Code' column of the .tsv file and provide it as standard input to hexToBin.py.
```
python3 hexToBin.py -o modified.bin
              
```
- If you have instead followed step 3, you can provide the file name of the modified program as a commandline argument.
```
python3 hexToBin.py -i modified.txt -o modified.bin
              
```
To test your lab, you should run fixBranch.s with RARS and must provide two binary files as program arguments. The first binary file is the original program and the second one is the modified program.
```
rars fixBranch.s pa original.bin modified.bin nc > lab.out
          
```
This command runs your lab solution on original.bin and modified.bin. The provided common.s prints the hexadecimal representation of all the instructions in the fixed program. By disabling the RARS copyright notice with the nc flag and piping the output to lab.out, you can decompile lab.out using binDecompiler.py without modifying the content of lab.out.
```
python3 binDecompiler.py h lab.out -o lab_out_decomp
          
```
Similar to step 1, two files, lab_out_decomp.tsv and lab_out_decomp.txt will be created.

Here is a video demonstrating the general flow of testing your lab. This video is created by screen recording slides 28-33 in the presentation for this lab.

Check My Lab

Link to CheckMyLab

This lab is supported in CheckMyLab. To get started, navigate to the FixBranch lab in CheckMyLab found in the dashboard. From there, students can upload test cases in the My test cases table. Test cases follow the format described below. Additionally, students can upload their fixBranch.s file in the My solutions table, which will then be tested against all other valid test cases.

Uploading a Test Case

To create a test case for CheckMyLab, first create the binary files for your original and modified assembly, then place the files in the same directory:

[NAME].bin
[NAME]Modified.bin

where [NAME] is the name of the test case. Then, zip both files into a single archive:

zip [NAME].zip [NAME].bin [NAME]Modified.bin

The resulting [NAME].zip file can be uploaded to CheckMyLab.

Marking Guide

Assignments too short to be adequately judged for code quality will be given a zero for that portion of the evaluation.

24% for correct insertion points and number of instructions added at each insertion point.
24% for correct addresses of branch instructions and respective branch targets.
32% for correctly fixing the offsets of the branch instructions.
20% for code cleanliness, readability, and comments.

Submission

There is a single file to be submitted for this lab. The file name is fixBranch.s and it should contain only the code for your solution.

Do not add a main label to fixBranch.s.
Do not modify the line .include "common.s".
Keep the file fixBranch.s in the Code folder of the git repository.
Push your repository to GitHub before the deadline. Just committing will not upload your code. Check online to ensure your solution is submitted.