arinak1017

Posted on Oct 6

6502 Assembly: Calculating Code Performance

#6502 #cpu #assembly

Before we start...

If you have just stumbled upon my SPO600 series of blog posts, it has been created to document and share my learnings as I progress through my Software Portability and Optimization college course.

In this post, I will focus on explaining the logic and calculating the execution time and memory usage of a simple 6502 Assembly program. I will also review the code modifications I made to improve the program's performance.

Check out my 1-st post about 6502 Assembly: 6502 Assembly Intro
Find the 6502 emulator I used here: 6502 Emulator
Nice and colourful 6502 reference: Ultimate 6502 Reference
$ = hexadecimal (e. g. $0200 = 0x0200)

Let's take a look at the program

The following code fills the emulator's display with the yellow colour:

    lda #$00    ; set a pointer in memory location $40 to point to $0200
    sta $40     ; ... low byte ($00) goes in address $40
    lda #$02    
    sta $41     ; ... high byte ($02) goes into address $41
    lda #$07    ; colour number
    ldy #$00    ; set index to 0
 loop:  sta ($40),y ; set pixel colour at the address (pointer)+Y
    iny     ; increment index
    bne loop    ; continue until done the page (256 pixels)
    inc $41     ; increment the page
    ldx $41     ; get the current page number
    cpx #$06    ; compare with 6
    bne loop    ; continue until done all pages

At first glance, even despite the extensive comments, the code may appear confusing to someone new to assembly. As for me, it wasn't until I did a line-by-line breakdown while referencing the 6502 documentation that I was able to fully wrap my head around how the program works.

In-depth explanation of the code's logic

Setting the values

The first section of the program is dedicated to setting the values that are going to be used later in the code (think variable initializations, except we are manually loading and putting each value into memory, accumulator or Y register).

lda #$00 ; load the low byte of the address ($00) into the accumulator
sta $40  ; store the low byte in location $40 of the Zero Page
lda #$02 ; load the high byte ($02) into the accumulator    
sta $41  ; store the high byte in the consequent memory location ($41)
lda #$07 ; load the colour number into the accumulator
ldy #$00 ; set index to 0 (loaded into Y register)

Notes:

We are storing the address ($0200), which points to the start of the memory region reserved for the bitmapped display ($0200 - $05FF, = 4 pages of memory). See why it needs two consecutive memory locations to be stored in my previous blog.
You can find information about the colour values by clicking the "Notes" button in the emulator (warning! you will need to do a bit of scrolling).
The value in the Y register will be used to iterate through the pixels within each of the display's memory pages.

Colouring the pages - main functionality

We colour the pages within the loop by traversing through each byte of the 4 pages of memory reserved for the bitmapped display and storing the colour value in each memory location.

 loop:  sta ($40),y  ; store the colour number in the calculated address
        iny       ; increment y
        bne loop  ; continue to colour the page until y overflows

The code below is only executed when the program traverses all of the pixels in the current page and the Y register overflows. It handles switching to the next page once the current one is filled. The program stops when it attempts to switch to page 6, as we are only colouring bitmapped display pages (2, 3, 4, and 5).

        inc $41      ; increment the page number
        ldx $41      ; load current page number into X register
        cpx #$06     ; compare the page number with 6
        bne loop     ; continue to colour until all pages are done

Notes:

The Y register can hold values from 0 to 255 and wraps around (goes to 0) after reaching its capacity, making it perfect for indexing when traversing pages (as memory page size is 256 bytes).
The BNE instruction allows the program to jump to a specified label (in our case loop) if the last operation did not result in 0.

Calculating the run-time

We can calculate the program's execution time by counting the total number of CPU cycles it takes to complete. To do this, we can refer to the 6502 documentation, which specifies the number of cycles for each instruction and addressing mode. Nonetheless, since some instructions are executed multiple times, it's important to analyze the program’s logic to determine how many times the loop executes/how many times the BNE condition is met.

Tip: Use spreadsheets to make your calculations easier.

Instructions	Cycles	Count	Alt-Cycles	Alt-Count	Totals
lda #$00	2	1			2
sta $40	3	1			3
lda #$02	2	1			2
sta $41	3	1			3
lda #$07	2	1			2
ldy #$00	2	1			2
loop:
sta ($40),y	6	1024			6144
iny	2	1024			2048
bne loop	3	1020	2	4	3068
inc $41	5	4			20
ldx $41	3	4			12
cpx #$06	2	4			8
bne loop	3	3	2	1	11

Notes:

STA ($40),Y and INY are executed 1024 times because the bitmapped display has 1024 bytes (256 bytes * 4 pages) that need to be coloured.
Instructions like BNE (branch on not equal) have varying cycle counts depending on whether the branch is taken or not. In our case: 1-st BNE - the branch is not taken 4 times (each time the Y register overflows after filling a page); 2-nd BNE: The branch is not taken once when all pages are filled, and the program reaches its end.

Knowing the total number of cycles for each instruction, we can calculate the execution time by summing up the totals and multiplying the result by the CPU's clock speed.

Result:

Runtime = 11,325 cycles x 1 MHz = 11,325 uS (microseconds)

Runtime = 11,325 uS / 1,000,000 = 0.011325 S (seconds)

Calculating the memory usage

To calculate the memory usage of the program, we need to sum up the size of all the instructions along with the size of each pointer and variable used.

You can find the number of bytes each instruction uses depending on its addressing mode in the 6502 documentation.

In the given program, we are only storing one 16-bit (= 2-byte) address in memory, thus, the total memory usage of the program will be equal to: total instruction size + 2 bytes.

The number of bytes each instruction occupies according to the docs:

Instruction	Bytes
lda #$00	2
sta $40	2
lda #$02	2
sta $41	2
lda #$07	2
ldy #$00	2
loop:
sta ($40),y	2
iny	1
bne loop	2
inc $41	2
ldx $41	2
cpx #$06	2
bne loop	2

Result:

Memory usage = 25 bytes + 2 bytes = 27 bytes

Optimizing Performance

I considered a couple of ways to optimize the code.

My initial option was loading the number of pages to be coloured into the X register: instead of repeatedly loading the value from memory location $41 and comparing it to 6 to track the coloured pages, the value in the X register is decremented each time a page is coloured.

    lda #$00       
    sta $40        
    lda #$02      
    sta $41        

    lda #$07       
    ldx #$04 ; Load the number of pages to be coloured into the X register
    ldy #$00       

loop: 
    sta ($40),y    
    iny            
    bne loop      
    inc $41        
    dex       ; decrementing the value (instead of ldx $41 and cpx #$06)
    bne loop

But! this change only brings a slight improvement in code performance, as the operation that consumes the most cycles is the process of writing the colour data to the display memory.

To achieve a more drastic runtime improvement, I started looking into alternative optimization methods and learned about loop unrolling (an optimization technique that decreases loop overhead by executing multiple iterations of the loop body in a single pass).

However, upon re-examining the code logic, I realized that colouring a certain number of pixels at once might not be the optimal strategy (as it creates a lot of redundant code). So, instead, I went for a more logical solution: colouring the pages concurrently. This approach improves the performance while keeping the code clean and eliminating the need to switch from one page to another (since after 256 iterations each page will be fully coloured).

    lda #$07      ; load the colour number into the accumulator
    ldy #$00      ; set index to 0 (loaded into Y register)
loop:
    sta $0200, y  ; colour a pixel on page 2
    sta $0300, y  ; colour a pixel on page 3
    sta $0400, y  ; colour a pixel on page 4
    sta $0500, y  ; colour a pixel on page 5
    iny           ; go to the next pixel
    bne loop      ; continue until Y overflows

Let's calculate the execution time of the optimized code:

Instruction	Cycles	Cycles Count	Alt-Cycles	Alt-Count	Totals
lda #$07	2	1			2
ldy #$00	2	1			2
loop:
sta $0200, y	5	256			1280
sta $0300, y	5	256			1280
sta $0400, y	5	256			1280
sta $0500, y	5	256			1280
iny	2	256			512
bne loop	3	255	2	1	767

Result:

Runtime = 6,403 cycles x 1 MHz = 6,403 uS

Runtime = 6,403 uS / 1,000,000 = 0.006403 S

6,403 uS vs. 11,325 uS - the optimized version is almost twice as fast!

Afterthoughts

Analyzing the 6502 Assembly code was challenging. As I am new to 6502 Assembly, I had a hard time understanding the nitty-gritty of the code's logic. Nevertheless, my struggles were worthwhile. This experience deepened my understanding of how the 6502 CPU operates, made me more comfortable in writing code in Assembly, and introduced me to the concept of loop unrolling.

Stay tuned! In the next blog post in the Software Optimization Series, I will share how I experimented with the logic of the given Assembly code.

DEV Community

6502 Assembly: Calculating Code Performance

Before we start...

Let's take a look at the program

In-depth explanation of the code's logic

Setting the values

Colouring the pages - main functionality

Calculating the run-time

Calculating the memory usage

Optimizing Performance

Top comments (0)

Read next

Increasing Code Coverage in faker-cxx

Retro Games React App

🎮 Dive into Game Development with Unity: A Beginner's Guide 🚀

Open-Source != Transparency