So I've run up against the point in work on elliptic curve diffie hellman in which I need to implement point addition and doubling in a manner that will not take 6 hours to complete. Here's the research I've done so far and what, from what I understand, is needed:

Code:
``` Point addition seems to be an operation relative to the slope of the line that links both points, defined: P + Q = R if P or Q are point at infinity (0,0?), R = the other point if P == Q, double instead if Xp == Xq, set R to point at infinity. else: slope = (Yp - Yq) / (Xp - Xq) Xr = slope^2 + Xp - Xq Yr = slope(Xp - Xr) - Yp ```

Point Doubling

Code:
``` P + P = R compute slope as: slope = (3(Xp)^2 + a) / 2(Yp) use slope in the Xr, Yr computations for addition ```

I foresee a few issues trying to implement this:
1. Value overflow. From what I can tell, this doesn't involve modulo some type width, you need to work with exact values. So would this need a few bytes to buffer for overflow?
2. Computational cost for division. The curve I am using is a 224-bit curve, meaning this will be running that many divisions * 2. Isn't that expensive? I read something in the docs about deferring division, ie: maintaining a numerator and a denominator until the end and then computing division then? How does that work exactly? Is there a better way?
3. Fastest algorithm for multiplication? As of now I'm using the double-then-add method for multiplication, but is there a faster way? If so, how does it work.

My existing API uses:

Code:
``` #define ECC_PRV_KEY_SIZE   28 ... #define OVERFLOW_BYTES      4 // a bigint type constrainted to the private key size + 4 bytes for overflow buffering typedef uint8_t BIGINT[ECC_PRV_KEY_SIZE + OVERFLOW_BYTES]; struct Point {    BIGINT x;    BIGINT y; }; struct Curve {    BIGINT polynomial;    BIGINT coeff_a;    BIIGINT coeff_b;    Point G;    BIGINT b_order;    uint8_t cofactor; }; ```

I save all static data as big-endian encoded bytearrays, but when loading the data in for arithmetic, the array is reversed into little endian for more efficient mathing.

What I am seeking here is (1) recommendations for the best way to do the 4 issues I indicated above, as well as to fish for if anyone has, or can write up, fast bigint arithmetic routines supporting up to 32-bytes. Or at least can point to the best resources for implementing it, though I'd be doing it in C which would not have the ideal speed. It would seem we need addition, subtraction, multiplication, division, and squaring.
Timing resistance preferred.

For reference, the full API as it presently stands is:
https://github.com/acagliano/cryptx/blob/dev/src/ecdh.c

And I've taken some stabs at writing the bigint routines I think I can do myself:
Here they are (ps the shifting was written by Zeda. I've done add/sub/mul)
https://github.com/acagliano/cryptx/blob/dev/src/asm/ecdh_ops.asm
Update
With a lot of help from the likes of calc84maniac and jacobly, I have managed to work out bigint addition and multiplication. I'll post the code here if anyone else wants to recommend fixes or optimizations.
I am still missing multiplicative inversion. To that end, I will post the algorithm based on my reference for anyone who may have a fast way to do it.

Code:
``` ; bigint_add(uint8_t *op1, uint8_t *op2); ; hard limit to 32 bytes ; output in op1 ; addition over a galois field of form GF(2^m) is mod 2 or just xor _bigint_add:    ti._frameset0    ld hl, (ix + 3)      ; op2    ld de, (ix + 6)      ; op1    ld b, 32 .loop:    ld a,(de)    ;adc a,(hl)    xor (hl)    ld (de),a    inc hl    inc de    djnz .loop    ld sp, ix    pop ix    ret ; apparently they are the same _bigint_sub := _bigint_add ```

Multiplication

Code:
``` ; bigint_mul(uint8_t *op1, uint8_t *op2) ; multiplication is add then double, then a polynomial reduction _bigint_mul:    ld hl, 32    ti._frameset    lea de, ix - 32      ; stack mem?    ld hl, (ix + 9)      ; op1 (save a copy)    ld bc, 32    ldir            ; ix - 32 = tmp = op1        ; zero out op1    ld de, (ix + 9)      ; op 1    ld a, 0    ld (de), a    inc de    ld hl, (ix + 9)    ld bc, 31    ldir            ; op1 = res = 0        ld hl, (ix + 6)      ; op2 = for bit in bits    ld c, 32 .loop_op2    ld a, (hl)    ld b, 8 .loop_bits_in_byte:    rra    push af       sbc a,a       push bc, hl          ld c,a                    ; add op1 (res) + tmp          ld hl, (ix +9)      ; hl = op1 (dest)          lea de, ix - 32      ; de = tmp (src)          ld b, 32 .loop_add:          ld a, (de)          and a, c          xor a, (hl)          ld (hl), a          inc hl          inc de          djnz .loop_add                 ; now double tmp          lea hl, ix - 32      ; tmp in hl          ld b, 32          or a            ; reset carry .loop_mul2:          inc hl          rl (hl)          djnz .loop_mul2                    ; now xor with polynomial if tmp degree too high          ; this means timing analysis will leak polynomial info          ; however, this is a public spec and therefore not          ; implementation-breaking          bit 1, (ix - 4)      ; polynomial is 233 bits, check 234th bit          jr z, .no_xor_poly          ; xor byte 1 (little-endian encoding)          ld a, (ix - 32 + 1)          xor 2          ld (ix - 32 + 1), a                    ; xor byte 21 (little endian encoding)          ld a, (ix - 32 + 21)          xor 4          ld (ix - 32 + 21), a                    ; xor byte 28 (little endian encoding)          ld a, (ix - 32 + 28)          xor 1          ld (ix - 32 + 28), a           .no_xor_poly       pop hl,bc    pop af    djnz .loop_bits_in_byte    inc hl    dec c    jr nz, .loop_op2    ld sp, ix    pop ix    ret  ```

And lastly:
Algorithm for Mult Inv

Code:
``` _bigint_invert:    ; tmp = op    ; v = polynomial    ; g = 0    ; res = 1    ; while(tmp != 1)    ;    i = degree(tmp) - degree(v)    ;   if( i < 0 )    ;      swap tmp1, v    ;      swap g, res    ;      i = -i    ;   h = left_shift v by i bits    ;   add tmp, h    ;   h = left_shift g by i bits    ;   add res, h         ; return res ```
So I have completed the bigint routine for multiplicative inverse over a finite field of form F2m.

The code is here: https://github.com/acagliano/cryptx/blob/stable/encrypt/encrypt.asm#L6231
There are some issues with the algorithm. For reasons currently unknown to me the algorithm seems to work fine for powers of 2, but loop infinitely for anything that is not an even power of two. Several days of debugging have brought it to its current state from infinitely looping for everything. If anyone is good with this kind of math and wants to take a stab at determining how dumb I am, feel free. Please and thanks.

Register to Join the Conversation
Have your own thoughts to add to this or any other topic? Want to ask a question, offer a suggestion, share your own programs and projects, upload a file to the file archives, get help with calculator and computer programming, or simply chat with like-minded coders and tech and calculator enthusiasts via the site-wide AJAX SAX widget? Registration for a free Cemetech account only takes a minute.

»
» All times are UTC - 5 Hours

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum