Quote:
I tested _uram and _ostk, and kmalloc was only using _ostk.

This is strange because _uram is the default. xD Well, no need to solve every mystery.
Quote:
Now, how do I actually use dma_memset to clear the depth buffer?

You need the buffer to be both 32-aligned and have a size multiple of 32. Also, the size is in bytes.

Code:
GALIGNED(32) static fp depthBuffer[RENDER_WIDTH * RENDER_HEIGHT];
dma_memset(depthBuffer, *(uint32_t *)value, sizeof depthBuffer);

assuming RENDER_WIDTH * RENDER_HEIGHT is a multiple of 8, which it probably is.

Now, there is a caveat, which is that the DMA only has access to the physical address space. Hence, the note in dma_memset()'s documentation that you can't use P0 addresses, which is any address lower than 0x80000000. The only things in P0 are the add-in code and data segment; since you are using the latter, you are in fact using space invisible to the DMA.

That could be a problem since virtualized addresses may map to discontinuous sections of physical memory, which is what the DMA uses. This implies that you can't copy code from your add-in with the DMA, for instance. Fortunately, the data segment is mapped as a single block, so your buffer is too and we can find its physical address.

Code:
#include <gint/mmu.h>

void *mmu_translate_uram(void *ptr)
{
    return mmu_uram() + ((uint32_t)ptr - 0x08100000);
}

dma_memset(mmu_translate_uram(depthBuffer), *(uint32_t *)value, sizeof depthBuffer);

I should provide that function in gint, and probably bake that translation into the DMA driver too.

(You don't get that problem with malloc() because gint translates to physical addresses before setting up the arena.)

Code:
dma_memset(mmu_uram() + ((uint32_t)((void*)(depthBuffer)) - 0x08100000), *((uint32_t*)&v), RENDER_WIDTH*RENDER_HEIGHT);

This clears the top quarter of the screen, but if I try to clear anything else, it starts getting weird.
If I just multiply RENDER_WIDTH*RENDER_HEIGHT by four, then some weird things happen. It looks like it doesn't clear some regions, but then it does on the next frame, and some regions are always cleared. I'm talking about small regions (about 10 pixels) that are 16 byte aligned.

So I thought I could clear a quarter of the screen each time, like this:

Code:
for(int i = 0; i < 4; i++){
   dma_memset(mmu_translate_uram(depthBuffer + RENDER_WIDTH*RENDER_HEIGHT*i), *((uint32_t*)&v), RENDER_WIDTH*RENDER_HEIGHT);
}

And this:

Code:
for(int i = 0; i < 4; i++){
   dma_memset(mmu_translate_uram(depthBuffer) + RENDER_WIDTH*RENDER_HEIGHT*i, *((uint32_t*)&v), RENDER_WIDTH*RENDER_HEIGHT);
}


In the first one, it crashes. It just does what it would do if I pressed the reset button, which it also did when the buffer wasn't aligned. Is it supposed to be 32 bit or 32 byte aligned?
In the second, it does the same thing as when trying to clear it all at once.

Any ideas? I'm sure the depth buffer is working well, so it's probably related to DMA.
It is 32-byte aligned. RENDER_WIDTH * RENDER_HEIGHT is indeed just a quarter of the buffer, since everything is measured in bytes.

Small inconsistencies on short regions sounds like a cache problem. Cache lines are 32 bytes, which is 16 pixels. Recall that the DMA writes to physical memory but ignores all the virtualization logic, which includes the cache.

Could you please try allocating your buffer under a different name and then using it through P2 exclusively (cache disabled)? It should eliminate the cache problem from the worries list for testing. If that's the issue, you could then come back to flush the depth buffer from cache as needed.

Code:
/* Original buffer is in P0 (virtualized userspace) */
GALIGNED(32) static fp _depthBuffer[RENDER_WIDTH * RENDER_HEIGHT];
/* Same in physical memory, with cache enabled */
uint32_t depthBuffer_P1 = (uint32_t)mmu_translate_uram(_depthBuffer);
/* Same in physical memory, with cache disabled */
uint32_t depthBuffer_P2 = (depthBuffer_P1 & 0x1fffffff) | 0xa0000000;
fp *depthBuffer = (void *)depthBuffer_P2;

Sorry for all the back-and-forth here. All of this is tested; in fact gint's equivalent of Bdisp_AllClr_VRAM(), dclear(), is by dma_memset(). And we use it through P2 specifically to avoid cache interference. (In that case we completely ignore the cache because we almost only write to VRAM so the cache is useless.) But I tend to forget about the details over time.
It worked, but now it's slower because the cache is disabled
Nothing is free huh. xD

The following should purge the depth buffer from the cache, meaning you can run it just before dma_memset() and use P0 or P1 addresses again (use only one of them to avoid duplication).

Code:
void cache_ocbp(void *buffer, size_t size)
{
   for(int i = 0; i < size / 32; i++) {
      __asm__("ocbp @%0" :: "r"(buffer));
      buffer += 32;
   }
}

GALIGNED(32) static fp depthBuffer[RENDER_WIDTH * RENDER_HEIGHT];
fp *depthBuffer_P1 = mmu_translate_uram(depthBuffer);

cache_ocbp(depthBuffer, RENDER_WIDTH * RENDER_HEIGHT * sizeof(fp));
dma_memset(depthBuffer_P1, *(uint32_t *)value, RENDER_WIDTH * RENDER_HEIGHT * sizeof(fp));

As you can see you need to do a fair bit of ocbp, one per 32 bytes. With proper unrolling I believe that would take about 0.2 ms overall, which remains fast (since you're spending 4 RTC ticks on clearing, which is ~30 ms...). Since the buffer is 350 kB but the cache is only 32 kB, we could also probably iterate on the cache contents itself and flush relevant lines.

I'm itching to try out the program and play around these optimizations, so if you're inspired to push these gint sources that'd be fantastic Smile
That worked, and I pushed the code.

There are still some things to be done:
- Use DMA to draw the grass
- Make multiplayer work

But first I'm going to try using libprof to check how long things are taking.
Here are the measurements:
- Clear the depth buffer: 10.1 ms
- Draw the sky and grass (with DMA): 4.9 ms
- Draw the car: 10.4 ms
- Draw the track and cones: 14.1 ms
- Calling dupdate(): 11.1 ms

In total, this is about 19.8 fps, which is definitely faster than before but it's not a huge improvement.
dupdate() is taking more time than I expected it to, so maybe I could use triple buffering? But where and how can I allocate the other vram?

But then I tried it on an fx-CG20, and it's significantly slower. It decreased from 6 to 4/5 fps!
Maybe I'll have a different version for it which uses prizm and has half the resolution, so that it's at least playable with overclocking.
I tried adding multiplayer by putting all the serial code in one function which is called by gint_world_switch, but then it gets really slow and multiplayer doesn't even work.
I guess it's because the OS isn't receiving anything while gint is active, and i think it might be slow because evey time I go back to using gint, it waits for everything to be sent.

So I made two different versions: One for multiplayer and one for singleplayer.
To anyone who has a Ti-84 Plus CE and wants to play this:
I added a TI-84 Plus CE version as a branch, but I don't actually have one, so I can't test it.

Here's how you can built it:
- Install the CE C/C++ toolchain (https://ce-programming.github.io/toolchain/static/getting-started.html#getting-started)
- Run "make ce" in the git repository
- The files should be in ce/bin. You might need to copy multiple files, because there's a message about splitting across 2 appvars.

If it crashes, try increasing PIXEL_SIZE in src/rasterizer.h
duartec wrote:
To anyone who has a Ti-84 Plus CE and wants to play this:
I added a TI-84 Plus CE version as a branch, but I don't actually have one, so I can't test it.

Here's how you can built it:
- Install the CE C/C++ toolchain (https://ce-programming.github.io/toolchain/static/getting-started.html#getting-started)
- Run "make ce" in the git repository
- The files should be in ce/bin. You might need to copy multiple files, because there's a message about splitting across 2 appvars.

If it crashes, try increasing PIXEL_SIZE in src/rasterizer.h

Thanks ill try it.
I get an error when I try to build it. I went to https://github.com/duarteapcoelho/prizm_racing/tree/ce
and downloaded and extracted the folder. I tried going into it in the toolchain and running make ce and I get this error.

Code:
make  -C ce/
make[1]: Entering directory `C:/Users/jakt2/Desktop/CEdev/prizm_racing-ce/ce'
make[1]: *** No targets specified and no makefile found.  Stop.
make[1]: Leaving directory `C:/Users/jakt2/Desktop/CEdev/prizm_racing-ce/ce'
make: *** [ce/bin/DEMO.8xp] Error 2


If I try to run make in the directory it gives this error

Code:
make  -C sdl/
make[1]: Entering directory `C:/Users/jakt2/Desktop/CEdev/bin/prizm_racing-ce/sdl'
g++ -Wall -Wextra -DSDL -Og -g  -c -o car.o -MMD ../src/car.cpp -MF "car.d"
'g++' is not recognized as an internal or external command,
operable program or batch file.
make[1]: *** [car.o] Error 1
make[1]: Leaving directory `C:/Users/jakt2/Desktop/CEdev/bin/prizm_racing-ce/sdl'
make: *** [sdl/racing] Error 2
duartec wrote:
The files should be in ce/bin. You might need to copy multiple files, because there's a message about splitting across 2 appvars.

You should try setting COMPRESSED = YES in the makefile.
MateoConLechuga wrote:
You should try setting COMPRESSED = YES in the makefile.

That worked, now it's just one file.

Invalid_Jake wrote:
I get an error when I try to build it.

I added a commit to fix this error, but then you'll probably get another error.
To fix this, remove the SCRDIR line and copy "src" to "ce/src"
I don't know why, but using "../" in SRCDIR doesn't work in windows
I had a quick look and to try and optimize some stuff, with admittedly limited success. Here is what I found to be a typical frame:

- Sky/Ground: 4.5 ms
- Track: 13.8 ms
- Car: 9.2 ms
- Update: 11.0 ms
- Z-buffer: 9.0 ms
- Others: 5.0 ms
- Total: 52.5 ms (≈ 20 FPS)

Track rendering looked promising, but seems to be write-bottlenecked. I split the useDepth case out of the main loop because it just turns the whole function into a series of horizontal lines that we can render with 32-bit accesses. This had little impact, despite the fact that a full 32-bit fill of the VRAM takes about 6.1 ms. There is something fishy there IMO.

Filling the Z-buffer really shouldn't take as long as rendering the car; this is somewhat absurd. Are your absolutely certain you need 32-bit precision for this? I can't help but feel that 16-bit should be enough.
Lephe wrote:
Are your absolutely certain you need 32-bit precision for this?

I thought I could only use 16-bit precision (not 8-bit because the car has more than 256 triangles), but I just tried it and I can actually use 8-bit precision.

Now it's running really fast but I haven't made any accurate measurements yet.

The reason why track.render is taking that long might be because it's also rendering the cones. I'll try disabling them to see what happens.
Great! That sounds like a good improvement. For the track I thought the cones would be minor, but from a quick test it turns out they occupy about 6 ms while the track is just 8 ms. Which shows that even rendering fairly small triangles for the cones and car adds up rather quickly...
After some more performance improvements, it's now running at 24 FPS.

I noticed that, if it isn't visible, a small triangle still takes about 20 microseconds to "draw". I improved this for the cones by clipping whole models instead of individual triangles., and by doing back-face culling as soon as possible. Those 20 microseconds are spent multiplying the points by the model and view matrices.

But all of these improvements only affect the gint version, which doesn't support multiplayer, as the prizm version seems to be slower because of it.
Do you plan to add serial communication to gint any time soon?
Nice, 24 FPS is quite smooth already. 20 microseconds is small enough that you have little control on the C side (remember a microsecond is just 100 CPU cycles), so I hope that's not going to be your bottleneck...

Quote:
Do you plan to add serial communication to gint any time soon?

I've had quite a bit of demand recently and SlyVTT has worked on the basics. I've got my hands full due to too-many-ideas-and-projects syndrome (plus that other thing, a "PhD" or something) but it's basically #1 on the gint list. Feel free to ping me for updates.
I'll be looking forward to this.

Anyway, thanks for all the help and for making gint!
Since the last two versions at least, when I run the game a second time it freezes on the first frame of gameplay, forcing me to pull a battery. (fx-CG10)
  
Register to Join the Conversation
Have your own thoughts to add to this or any other topic? Want to ask a question, offer a suggestion, share your own programs and projects, upload a file to the file archives, get help with calculator and computer programming, or simply chat with like-minded coders and tech and calculator enthusiasts via the site-wide AJAX SAX widget? Registration for a free Cemetech account only takes a minute.

» Go to Registration page
Page 2 of 3
» All times are UTC - 5 Hours
 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

 

Advertisement