This is the first post in a pair of posts about TI-BASIC number parsing aimed at TI-BASIC programmers who may or may not have encountered advanced TI-BASIC optimization techniques before.
TI-BASIC programs have a lot of numbers in them. My quick survey of over 1.3 million lines of TI-BASIC code from ticalc found that programs have an average of one numeric literal or constant per line, and many of these numbers are written in ways that will parse quite slowly.
You may have heard that using a color constant like BLUE is substantially faster (nearly 35% faster!) than writing out the digits because parsing the number is slower than looking up the value from a predetermined location in memory. This speed improvement is generally well-known (at least, well-known among people who care about TI-BASIC program speed), but color or mathematical constants are not available for every number in every program. I'm working on a program to automatically optimize TI-BASIC programs. For this, I want an algorithm that, when given any number, determines the fastest way to write that number.
First, we'll design such an algorithm, then fill in the details by building an astonishingly complete understanding of how numbers get parsed without examining a single line of TI's code. In the next post, we'll analyze these results to determine the fastest way to write numbers, and I'll provide some insight into how I implemented this algorithm in my optimizer.
Results (for humans)
The optimal strategy I found is too complicated to be useful for most humans trying to write fast programs. I can give you a few memorable heuristics, though:
These are not new tips, but they have not seen widespread adoption. I encourage you to integrate these into your programs where applicable but know that you'll probably see better results by reducing variable lookups and branching.
(You may notice that all of the points I noted sharply decrease program legibility. This is the primary motivator for having an optimizer- if you don't need to touch the optimized code, surely you'll take the speed improvement!)
Engineering Considerations
I have the following requirements for my "algorithm":
The first point is the hardest to satisfy- it'd be the fastest to have a set of inequalities that dictate when each solution is best, but adding a new strategy would be a nightmare.
Knowing that the speed of the optimizer itself is not too critical– we're aided immensely by the fact that there is a hard limit on the size of a TI-BASIC program– I've chosen a more explicit approach. For every applicable strategy, I determine the speed and size cost of the provided number and compare those using whatever metric I want.
To do this, we need to roughly characterize the size and speed for every strategy; we trade our optimization task for a modeling task.
For this simple routine, with ideal measurements, this is not too difficult nor too time-consuming— writing this post took substantially longer than characterizing all four different strategies— but it does require a careful eye and a healthy amount of patience
Science!
Methodology
Taking measurements requires some thought. I am making a few assumptions here to streamline this process, so I owe you some notes on my methodology:
I said it above, but I'll say it again: I only care about the relative speeds of each strategy. Additional factors that uniformly affect every strategy are unimportant, but slight variations in timing could be important. We have complete consistency here, so we should be able to account for every clock cycle.
First Strategy: Write all the digits directly in your program.
We'll start exploring the complex problem of modeling the full behavior by exploring a reduced case where we only parse one significant figure (i.e., something like 10 or .002 but not 11 or .1234)
First, I'll share a pair of less-than-surprising results, which I tested for completeness' sake:
I timed a bunch of numbers, here are my results (every point on the plot represents a single measurement):
There are some surprises here, but let's examine the expected stuff first:
I certainly did not expect to see the following things, though I have satisfying speculative justifications anyway. (Note that some of these are impossible to notice by looking at the plot; you'll have to check my raw data in my next post!)
We aren't quite ready to build a general-purpose formula because we're still looking at a reduced example. Let's explore cases where we have more than one significant figure.
This graph is fantastically interesting. There are three "regimes" here: the first regime has numbers with a decimal point before their significant figures (like .0001111), the second regime has numbers with a decimal point in their significant figures (11.11), and the third regime has numbers with a decimal point after their significant figures (i.e. no decimal point at all, like 111100).
We have neat observations to make in every regime:
Let's attempt to express these observations more mathematically. The general approach here is to treat every clock cycle as "unknown" and we progressively subtract the things we know until there are no "unknown" clock cycles left. Then, we can add the parts we know to compute the clock cycles for an arbitrary input. This lets us progressively guess the structure of this formula. We can first check our work against all the data in our initial survey, then quickly evaluate how close we are to a full solution by checking how well our model's predictions match reality for new measurements. Furthermore, we will be able to use these "knowns" to express the constants in terms of measurements we made. We'll only consider natural numbers for now, then use our observations about the fractional part to fill in the rest of the picture.
Define TTP(x) as the "time to parse x" in clock cycles, per my measurement procedure in the methodology section
I compute these values using my data:
Let's sum these up for a couple of example inputs to confirm the validity of our assumptions:
This looks great! Isn't it nice when things work out? Now, we consider when the number has an exponent less than zero:
Notice that the fractional_digit_cost is about 6% less than the regular digit_cost? I wonder if we can use this to optimize numbers with many significant figures
Results (for machines)
So, putting this all together, we can construct a routine to produce the time to parse any number. For clarity, I'll express this as an iteration over the digits, but it's not too challenging to convert this to a closed form:
Code:
Though I didn't explicitly try to minimize this, the fact that we can completely characterize the parse time of numbers written this way with just seven measurements is lovely (though, of course, to find the patterns, I had to make several dozen measurements). I'm curious if my formula works on the z80 calcs, too-- if some adventurous soul is willing to make similar measurements for a z80 model, I'd love to see them.
In the next post, we'll briefly examine two other strategies for writing numbers (we've done all of the most demanding work in this one) and compare all three strategies empirically. Thanks for sticking with me, folks!
TI-BASIC programs have a lot of numbers in them. My quick survey of over 1.3 million lines of TI-BASIC code from ticalc found that programs have an average of one numeric literal or constant per line, and many of these numbers are written in ways that will parse quite slowly.
You may have heard that using a color constant like BLUE is substantially faster (nearly 35% faster!) than writing out the digits because parsing the number is slower than looking up the value from a predetermined location in memory. This speed improvement is generally well-known (at least, well-known among people who care about TI-BASIC program speed), but color or mathematical constants are not available for every number in every program. I'm working on a program to automatically optimize TI-BASIC programs. For this, I want an algorithm that, when given any number, determines the fastest way to write that number.
First, we'll design such an algorithm, then fill in the details by building an astonishingly complete understanding of how numbers get parsed without examining a single line of TI's code. In the next post, we'll analyze these results to determine the fastest way to write numbers, and I'll provide some insight into how I implemented this algorithm in my optimizer.
Results (for humans)
The optimal strategy I found is too complicated to be useful for most humans trying to write fast programs. I can give you a few memorable heuristics, though:
- Use a constant if you can. Colors are a free 35% speed improvement, and you can use them anywhere: loop bounds, key codes (ex: DARKGRAY is the left arrow key!), coordinates used in graphics routines, you name it, and it's faster everywhere, even for writing something like ~BLUE instead of ~10.
- Use the |E token. 1000000 parses at about half the speed of 1|E6 (and writing |E6 without the 1 is around 11% faster than that... and do not use the 10^^( token for constants )
- Minimize the number of digits you write. Each digit incurs a significant cost.
- A minor point, so I'm not bolding it: Digits after a decimal point are parsed faster than digits before a decimal point, but there is a high cost to having a decimal point in your number at all
These are not new tips, but they have not seen widespread adoption. I encourage you to integrate these into your programs where applicable but know that you'll probably see better results by reducing variable lookups and branching.
(You may notice that all of the points I noted sharply decrease program legibility. This is the primary motivator for having an optimizer- if you don't need to touch the optimized code, surely you'll take the speed improvement!)
Engineering Considerations
I have the following requirements for my "algorithm":
- It must be straightforward to add or improve number-writing strategies.
- It must be aware of the target version. (for example, we do not want to emit color tokens when they cannot be used)
- It must be easy to select whether we optimize for speed, size, or some secret third option (this is well-defined but not essential here)
The first point is the hardest to satisfy- it'd be the fastest to have a set of inequalities that dictate when each solution is best, but adding a new strategy would be a nightmare.
Knowing that the speed of the optimizer itself is not too critical– we're aided immensely by the fact that there is a hard limit on the size of a TI-BASIC program– I've chosen a more explicit approach. For every applicable strategy, I determine the speed and size cost of the provided number and compare those using whatever metric I want.
To do this, we need to roughly characterize the size and speed for every strategy; we trade our optimization task for a modeling task.
For this simple routine, with ideal measurements, this is not too difficult nor too time-consuming— writing this post took substantially longer than characterizing all four different strategies— but it does require a careful eye and a healthy amount of patience
Science!
Methodology
Taking measurements requires some thought. I am making a few assumptions here to streamline this process, so I owe you some notes on my methodology:
- I care about throughput, and I am using "which strategy takes the fewest clock cycles in CEmu" as a proxy for "which strategy has the highest throughput." Because of this, the exact definition of what counts as a "clock cycle" is unimportant, but know that it's a unit of time: faster code uses fewer clock cycles. This seems innocuous enough, but there are some finer points here. If you recall from my engineering considerations, I only care about the relative speeds of each strategy. Additional factors that uniformly affect every strategy are unimportant. I'm measuring the number of clock cycles between when the parser is ready for its first token of the number and when it is prepared for its first token after the number. This means a breakpoint at ParseInp+$4a for my OS version.
- For consistent results, I turn off interrupts and DMA cycle counting because I expect these costs to be roughly proportional to the amount of time the routine runs.
- I assume that if input "A" is faster than input "B" on my calculator and OS version, it is likely to be faster or at least not substantially slower on other calculators and OS versions. This is not so unreasonable given this code has likely not changed appreciably for a long time, but I put this last because it is perhaps the most important caveat here and one I'd do more work to resolve if I had the time. Confirming or refuting my results on a z80 calculator would be much appreciated, particularly the work of measuring the parameters I list at the end.
I said it above, but I'll say it again: I only care about the relative speeds of each strategy. Additional factors that uniformly affect every strategy are unimportant, but slight variations in timing could be important. We have complete consistency here, so we should be able to account for every clock cycle.
First Strategy: Write all the digits directly in your program.
We'll start exploring the complex problem of modeling the full behavior by exploring a reduced case where we only parse one significant figure (i.e., something like 10 or .002 but not 11 or .1234)
First, I'll share a pair of less-than-surprising results, which I tested for completeness' sake:
- Digits 1-9 are processed in the same amount of time.
- Leading zeros are never optimal (01 parses slower than 1)
I timed a bunch of numbers, here are my results (every point on the plot represents a single measurement):
There are some surprises here, but let's examine the expected stuff first:
- We expect that numbers like 1000 would be more expensive to parse than numbers like 10, and that is clearly mirrored in the chart.
- We see an additional cost when the number has a decimal point (.1 is more expensive than 1).
I certainly did not expect to see the following things, though I have satisfying speculative justifications anyway. (Note that some of these are impossible to notice by looking at the plot; you'll have to check my raw data in my next post!)
- We can see pretty clearly that, for example, the change in parse time between .00002 and .0002 is much smaller than the change in parse time between 2000 and 20000. I think this is because the parser needs to do less work for zeros before the first significant figure. For 20000, each zero contributes to the exponent and is written to the mantissa, but for .00002, each zero only contributes to the exponent.
- Specifically with positive exponents, every other zero is slightly more expensive (every even-indexed digit costs the same as every other even-indexed digit, and every odd-indexed digit costs the same as every other odd-indexed digit, but these costs are not the same). I'm reasonably sure this is because of TI's packed BCD float representation. With two digits per byte, every other digit requires either shifting the digit into the high nibble or moving some pointer into the next byte. It doesn't matter, but it seems like shifting the digit is more expensive. This doesn't occur in the negative exponent case (providing some weak verification to my above explanation for why leading zeros are less expensive: they are processed separately).
We aren't quite ready to build a general-purpose formula because we're still looking at a reduced example. Let's explore cases where we have more than one significant figure.
This graph is fantastically interesting. There are three "regimes" here: the first regime has numbers with a decimal point before their significant figures (like .0001111), the second regime has numbers with a decimal point in their significant figures (11.11), and the third regime has numbers with a decimal point after their significant figures (i.e. no decimal point at all, like 111100).
We have neat observations to make in every regime:
- In the first regime, the number of digits in a decimal significantly impacts parsing time. Furthermore, looking at the raw data, we observe that every other significant figure is more expensive (more evidence of a shifting cost).
- In the second regime, we see that digits after a decimal point cost less than digits before the decimal point (Look at the "flat" sections of the graph. For the 4-significant-figure case, this flat part corresponds to measurements of inputs like 1.111, 11.11, and 111.1. These flat parts are slightly slanted upward, indicating that each time the decimal point is shifted to the left, the parsing is barely slower). I note that this is distinct from our earlier observation about which zeros contribute to the exponent and which are written to the mantissa.
- In the third regime, we see that 0 costs barely more than other digits. It's incredibly subtle on the graph, but it's not an artifact of the graphing: each additional non-zero significant figure does actually push the time lower. Looking at the data directly, I observe a 62-cycle reduction for every zero digit replaced with a non-zero digit.
Let's attempt to express these observations more mathematically. The general approach here is to treat every clock cycle as "unknown" and we progressively subtract the things we know until there are no "unknown" clock cycles left. Then, we can add the parts we know to compute the clock cycles for an arbitrary input. This lets us progressively guess the structure of this formula. We can first check our work against all the data in our initial survey, then quickly evaluate how close we are to a full solution by checking how well our model's predictions match reality for new measurements. Furthermore, we will be able to use these "knowns" to express the constants in terms of measurements we made. We'll only consider natural numbers for now, then use our observations about the fractional part to fill in the rest of the picture.
Define TTP(x) as the "time to parse x" in clock cycles, per my measurement procedure in the methodology section
- Let's start small. There is a tiny additional cost for zeros after the first significant figure. We can compute the zero_sigfig_cost easily with TTP(10) - TTP(11). The only difference between 10 and 11 is that 10 has a 0 in the one's place, and 11 doesn't. Thus any difference in parsing time between them must be because of the change in digit! Whenever I subtract two measurements, ask, "What is the difference between the two numbers being measured, and how does this relate to the coefficient being computed."
- From my second unexpected observation, we expect an extra cost to be incurred after every other digit. I call this the "shifting cost," an extra cost above the usual "digit cost." It's easiest to compute this digit cost first (by finding the time to parse an even-indexed digit), then compute the additional shifting cost (subtracting the digit cost from the time to parse an odd-indexed digit). More concretely, we can find digit_cost by subtracting TTP(1) from TTP(11) (there isn't a shift from the first to the second digit) and shifting_cost by computing (TTP(111) - TTP(11)) - digit_cost (there is a shift from the second to the third digit).
- Think: why did I choose to use numbers that do not have zeros? I chose numbers without zeros because we observed that zeros cost more after the first significant digit. If we were using 1000 and 100, there would be an additional difference in the number of zeros, not just the number of digits.
- There is some "base" cost to parsing a number regardless of how many digits that number has. The easiest way to determine this is to measure TTP(1), then subtract the time spent processing the digit itself. Thus, the base_cost is TTP(1) - digit_cost - shifting_cost.
I compute these values using my data:
- base_cost = TTP(1) - digit_cost - shifting_cost = 5020
- digit_cost = TTP(11) - TTP(1) = 1976
- zero_sigfig_cost = TTP(10) - TTP(11) = 62 (very small!)
- shifting_cost = (TTP(111) - TTP(11)) - digit_cost = 79 (very small!)
Let's sum these up for a couple of example inputs to confirm the validity of our assumptions:
- 1024: I measure 13144 clock cycles. By inspection, we determine this costs at least base_cost + 4*digit_cost + 1*zero_sigfig_cost clock cycles. The shifting cost is incurred twice, so our final cost is base_cost + 4*digit_cost + 1*zero_sigfig_cost + 2*shifting_cost = 13144 clock cycles, which is exactly my measurement.
- 1000000 (a million): I measure 19540 clock cycles. We determine this costs at least base_cost + 7*digit_cost + 6*zero_sigfig_cost clock cycles. The shifting cost is incurred four times, so our final cost is base_cost + 7*digit_cost + 6*zero_sigfig_cost + 4*shifting_cost = 19540 clock cycles, which is exactly my measurement.
This looks great! Isn't it nice when things work out? Now, we consider when the number has an exponent less than zero:
- We can compute the cost of each zero after the decimal point (there's no need to worry about even/odd nonsense for zeros before the first significant figure, as established above) similarly to digit_cost above. The fractional_leading_zero_cost is TTP(.01) - TTP(.1). I observe the shifting cost to be the same as before and occur under the same situations.
- Next is the cost for digits after and including the first significant figure; we compute this fractional_digit_cost with TTP(.11) - TTP(.1).
- We also expect cost just from the decimal point. We can compute the cost of just the decimal point with TTP(.1) - fractional_digit_cost - shifting_cost - base_cost (this formula subtracts all of the components from TTP(.1) that we have accounted for, leaving just the decimal point).
- fractional_leading_zero_cost = TTP(.01) - TTP(.1) = 1422
- fractional_digit_cost = TTP(.11) - TTP(.1) = 1867
- decimal_point_cost = TTP(.1) - base_cost - fractional_digit_cost - shifting_cost = 1123
Notice that the fractional_digit_cost is about 6% less than the regular digit_cost? I wonder if we can use this to optimize numbers with many significant figures
Results (for machines)
So, putting this all together, we can construct a routine to produce the time to parse any number. For clarity, I'll express this as an iteration over the digits, but it's not too challenging to convert this to a closed form:
Code:
# raw measurements from CEmu
ttp = {
"1": 7075,
".1": 8089,
"11": 9051,
"10": 9113,
".01": 9511,
".11": 9956,
"111": 11106,
}
# explaining this mess is the point of the post
digit_cost = ttp["11"] - ttp["1"]
fractional_digit_cost = ttp[".11"] - ttp[".1"]
shifting_cost = (ttp["111"] - ttp["11"]) - digit_cost
base_cost = ttp["1"] - digit_cost - shifting_cost
zero_sigfig_cost = ttp["10"] - ttp["11"]
fractional_leading_zero_cost = ttp[".01"] - ttp[".1"]
decimal_point_cost = ttp[".1"] - base_cost - fractional_digit_cost - shifting_cost
def speed_cost(digits, exponent):
"""
Arguments:
digits -- Every digit as written in a program, after and including the first nonzero digit. (ex. `1000000` becomes `[1, 0, 0, 0, 0, 0, 0]` and `.01111` becomes `[1, 1, 1, 1])
exponent -- 10's exponent for the first nonzero digit (eg. `6` for `1000000`, `-2` for `.01111`)
"""
clock_cycles = base_cost
if exponent < 0:
# .1 has no leading zeros, .01 has one leading zero, ... 10^-n has n-1 leading zeros
clock_cycles += fractional_leading_zero_cost * (abs(exponent) - 1)
if len(digits) > exponent + 1:
# .1, 1.1, .01 all have decimals
# 1, 11 do not
clock_cycles += decimal_point_cost
for index, digit in enumerate(digits):
if index > exponent:
clock_cycles += fractional_digit_cost
else:
clock_cycles += digit_cost
if index % 2 == 0:
clock_cycles += shifting_cost
if digit == 0:
clock_cycles += zero_sigfig_cost
return clock_cycles
# 19540
print(speed_cost([1, 0, 0, 0, 0, 0, 0], 6))
# 15191
print(speed_cost([1, 1, 1, 1], -2))
Though I didn't explicitly try to minimize this, the fact that we can completely characterize the parse time of numbers written this way with just seven measurements is lovely (though, of course, to find the patterns, I had to make several dozen measurements). I'm curious if my formula works on the z80 calcs, too-- if some adventurous soul is willing to make similar measurements for a z80 model, I'd love to see them.
In the next post, we'll briefly examine two other strategies for writing numbers (we've done all of the most demanding work in this one) and compare all three strategies empirically. Thanks for sticking with me, folks!