Posts Tagged Technical
I learned something today: Out of Memory errors are not actually caused by being out of memory! What a silly I am for expecting error messages to make sense.
So, for everyone preparing to set your RAM on fire, stomp on it and throw it out the window before burying it in a deep dark hole and pouring concrete on top: don’t bother! You RAM is fine. Assuming you haven’t already set it on fire. If you have, your RAM is not fine. And you might want to put that out.
No if you’re out of RAM you’ll get performance issues caused by ‘thrashing’ (when a computer starts relying on the Hard Drive instead of RAM), but you won’t get Out Of Memory exceptions. So, what causes OoM exceptions?
32-bit Windows causes OoM exceptions.
Screw 32-bit Windows.
See, when Microsoft says “you are out of memory”, they actually mean “you are out of consecutive address space”.
You now have several options (it’s like a game!):
– For a “proper” explanation, go read the article I read that taught me all this: http://blogs.msdn.com/b/ericlippert/archive/2009/06/08/out-of-memory-does-not-refer-to-physical-memory.aspx
– For an “improper” explanation, go ahead and use your imagination (I’m sure it’s way more kinky than anything I can come up with).
– For an “incredibly-simplified and probably-wrong” explanation, continue reading…
When an object like a Creature or a Tree is assigned, it gets put in memory and Windows assigns it an address. And by ‘put in memory’, I don’t mean RAM. Forget RAM. RAM is a glorified performance optimisation. The memory we’re talking about here is your hard drive. And as you’re probably aware, your hard drive is huge. You’re not in any danger of running out of that unless you’ve packed the drive to it’s very limit.
Okay, so you’re not in danger of running out of Hard Drive space, and lack of RAM doesn’t cause exceptions, just performance issues. So the next place to look is at that address the object was assigned by Windows when it was put in memory.
This address is where windows get’s it’s 32/64 bit designation. The absurd oversimplification I’m about to make will likely make any computer-science types reading this swallow their hats, but it’s basically a number stating that object X can be found Y bits into the memory stream.
The cause of Out Of Memory exceptions is that “Y-bits into the memory stream”. See, even though the process has the entire hard drive to work with, a 32-bit address format can only store addresses up to ~4GB (and half of that is reserved by the operating system for other things). This is why 32-bit editions of Windows can’t make use of more than 2GB ram: the 32-bit format doesn’t have enough address space to map more than 2GB.
Okay, but surely that’s basically the same as a 2GB memory limit for the purposes of Out Of Memory exceptions?
Nope. See, to assign and map a new object, there needs to be an empty “hole” in address space for it to go into. If there isn’t a hole large enough, an Out Of Memory exception will be thrown. (finally!)
I’ll explain via metaphor. Imagine putting three baskets into your car, then filling in the area in between them with smaller items. Then you realise that you need a large esky instead, so you take out the baskets and try to put the esky in. Even though the three baskets combined took up more space than the esky, and even though they’ve been removed, the esky still won’t fit because the smaller infill items are blocking it.
That’s the problem with address space: if you fragment it with smaller items in between the big ones, the next time you need to allocate something big it’ll throw an Out Of Memory exception, even if you freed up the big ones and have plenty of cumulative space to use. *That* is what is causing the memory issues in Species.
I’m sure a decent Computer Science course would have covered all this but I’m self taught, so I get to learn by the slightly more direct route of sticking my hand in the fire and finding out what happens.
So, on to the usual question: how do we fix this?
Well, the obvious solution is “compile to 64-bit.” A 64-bit address space can map about 4 billion times as much memory (no, literally), so chances are we wouldn’t be seeing fragmentation errors any time soon. Of course, it also means people using 32-bit windows wouldn’t be able to play the game.
I’d rather solve the error at the source and retain backwards compatibility, but “solve” is a relative term. Memory fragmentation on a 32 bit machine isn’t something that can be stopped. But it can be significantly reduced, and I’d prefer to do that for the sake of memory management *before* dropping the metaphorical world-killer asteroid that is a 64-bit compilation.
So, on to the source. What, ultimately, causes memory errors in Species?
I’d done memory profiling in the past, but I hadn’t properly understood what it was telling me until now. They indicated I had two main memory draws: grass vertices, and tree instances. But these two memory hogs always appeared in different graphs: tree instances would appear in allocation, grass vertices in snapshots.
With what I know now, this makes sense. Grass vertices are created when the terrain is built, and held from then on out. This means that the grasses memory cost at any moment in time is huge: after all, it’s storing the position of every potential billboard vertex on the map. Imagine the map completely covered in the dense savanna grass: that’s what the grass memory cost is.
But relative to the hard drive, or even the RAM, it’s not that big: I think somewhere around the 200MB mark? Can’t remember exactly, but regardless, it’s also generally a static cost. The grass grabs up all that memory when the terrain is built and doesn’t release it until the terrain is destroyed. You might be able to force an OoM exception by repeatedly loading/generating worlds, and fragmenting the memory that way. Fixing it isn’t urgent, but probably worth doing: need to ensure the memory is allocated at startup, rather than on terrain-build.
The second cost is the Tree Instances array. This one confused me because it was barely noticable on the moment-to-moment memory, but was invariably a massive cost in terms of amount allocated. This was because the Tree Instances array was designed to grow organically with the vegetation: if the number of trees grew too large for the array, it would allocate a new array with an extra 500 slots for trees and copy the data to that (well okay it actually uses Array.Resize(), but that’s what Array.Resize() does internally).
Of course, frequently allocating a massive new tree array is a great method for fragmenting your address space, which is why large (>2) worlds were invariably hitting an OoM exception during the vegetation generation at startup. Since I didn’t realise this was a problem, the amount allocated in the memory profiler didn’t mean anything to me.
I’ve already applied a few simple optimisations to this method, which makes generating size-3 worlds possible again, but fixing it will be another matter entirely. Ultimately, the vegetation should probably be capped to a set level, but this is tricky because the array is fed directly into the graphics card when drawing the trees. This means that those 500 “excess” trees still incur a GPU cost, even though they’re invisible. In a size 3 world, this cost is enough to cap FPS to 12, even when there are no creatures.
To fix this one I’m going to need to adapt the InstancedModel DrawInstances() method to take the total number of trees as a parameter, and feed that to an overload of the DynamicVertexBuffer. This should allow me to submit the entire oversized array, but then lower the number of instances that are actually drawn each frame.
(Edit) Well that was surprisingly easy. A heap of address space is now set aside for trees, but only the actual trees themselves are sent to the graphics card. Assuming XNA isn’t doing something horrible with them beneath the hood, that should take care of most of the OoM exceptions in the game. Stability improvements ahoy!
Also, I successfully got a post up! Sure it’s like half a week late or something, but I’ll take it. My fragile self-esteem needs all the help it can get.
So, scene setting: I’ve just finished the skinning system. I can render animated and deformed meshes anywhere I want in my environment. I’ve also modeled a few simple animated legs and a torso object.
Now, to be fair, I might be taking a few liberties with the chronology of things here. A lot of the following was worked out in parallel with the skinning system, but I’m presenting it as sequential because it sounds more organised. So don’t mind my temporally-unhinged antics here. I assure you that unless the cooling pump fails on the paradox diffuser the space-time continuum is in absolutely no danger of instantly ceasing to exist. And the chances of that happening are, like, 1 in 10. Or is that 10 to 1? Never mind, nobody cares anyway.
Moving on: there are two possible approaches to take when it comes to laying out a foundation for a class as central to the game as the “Creature” class.
The first is to lay out everything at the start. Take it all from the design document, set up every data structure I need, and then go about tying them together. So, I’d prepare an leg class with some genetically-defined variables for size, width and type, then set aside some ‘combination variables’, such as strength and speed. Then I could use occult magiks to force demonic slaves to write Mathematical Equations to derive the latter from the former, as so: strength = (bicepWidth + tricepWidth * 0.6) * sizeModifier, etc. Or I could do them myself, that works too.
OR… the other option is to wing it. We start by making the creature move at a constant rate, and then we go and set-up the leg rendering and animation code, and then we plumb the size of the legs into speed and then we actually make the size of the legs variable, and then some other things we made up on the spot and then we realise we can make the subjects into ravenous undead flesh-eaters to unleash on an unsuspecting populous and make an absolute killing selling shotguns, and then… you get the idea. Basically making it all up as we go along.
Now, given that both of these approaches are mutually exclusive, in a stroke of certifiable genius I went and enacted both of them at the same time.
… wait, did I say genius? I meant the other thing. Word. Starts with I.
I had a fairly good idea as to how I was going to structure the creature body parts: A torso object, with legs, arms, neck and tail objects attached to that, and head, tailtip, hands and feet, attached to those. I also knew a large quantity of genetic variables I wanted, so I went ahead and put them in so I could tie them all together later. But I also wanted to see it in action as soon as possible, so long before things like “speed” were set up I threw in placeholder values and started on the AI, so that the creature would do more than just stand around looking awesome.
And it worked. Okay, true, I ended up making changes to or deleting the majority of the genetic values, but knowing the basic structure of the class and it’s sub-classes was a great help when it came to prioritising work, and being able to see the effects of the changes I was making kept me enthused.
So, in the fine tradition of starting with the most time consuming thing possible, the very first thing I prioritised was leg and torso visualisation. The creatures aren’t actually held up by their legs in species: instead, I derive a torso-height based on their leg size, and the legs sort of dangle at just the right height to make it look like they’re standing on the ground. Okay, all well and goo: that’ll work fine for horisontally-aligned bipeds, but what about quadrupeds with differently-sized limbs? What about creatures that stand upright like humans, or creatures that drag themselves along with a set of huge legs up the front? Clearly I’d need to do more than just move the torso up.
This was solved by the addition of another, half-derived half-genetic variable called “torsoRotation”. torsoRotation determines the pitch at which the creature’s torso is orientated, from 0 at horisontal to 90 when upright (okay, the value is actually in radians, but who in their right mind says “Pi on 2” when they mean 90 degrees?).
torsoRotation is a genetic variable: creatures can pass it on to their offspring. But if the creature is a quadruped, or if they are overly unbalanced (for example, small legs at the very rear), then torsoRotation will be overwritten with a value derived from a whole bunch of rules depending on their leg sizes and positions. This allows creatures to “stand up” without making it arbitrary or nonsensical: a creature who’s back legs increase in size will automatically pitch forward unless they usually stand upright enough to compensate for it.
Fittingly for a game like Species, this all affects survival in various ways. Any creatures whose body touches the ground suffers a ‘dragging’ energy-loss penalty whenever they move. Upright creatures move more slowly, but use less energy to get around. And so on.
This system of rules, once it was finished, resulted in a large variety of potential body plans, even with nothing more than legs and a torso. And, thanks to the time already spent on the skinning system, I could take randomly generated test renders of them straight away:
Don’t worry everyone, I swapped his paradox machine for a coffee maker and replaced his timecube collection with fluffy dice. The worst he can do now is make a terribly tacky espresso.
Written July 2008
In 3 dimensional graphics, a computer can only render so many triangles before it suffers death by framerate. This is a problem in most games: the artists can never put as much detail into their 3D models as they would like.
With games where you can see and walk to the horison and beyond, this is especially awful. How can a computer render every triangle on a terrain so many kilometers wide, and still get a decent framerate?
The answer is fairly simple: it can’t. But the computer can reduce the number of triangles to be rendered on the graphics card. A distant mountain doesn’t need to have the same level of detail as the ground under the players feet, after all.
This is the point of Level of Detail (LOD) Algorithms: render fewer triangles, but get a visual result very similar to what you would have had you rendered them all. There are many different styles of LOD Algorithm. Species uses a QuadTree.
A Quadtree is a structure where each node is subdivided into 4 children nodes, each child node subdivided into 4 grandchildren nodes, etc. In the context of a terrain, this means that the ‘root’ node would have a low detail mesh, the children would have a higher detail mesh, the grandchildren have higher detail still, etc…
As the camera moves about the terrain, therefore, you can sink deeper into the quadtree and render high detail squares for nearby objects, and not sink anywhere near as far for distant objects, thus rendering them at a much lower detail.
[Present me says: this is where I seamlessly stitched in further detail from a more technically orientated post. You may not be able to follow much of the following without a background in graphics programming and/or wikipedia]
Of course, nothing is perfect. Two same sized quadnode meshes next to each other will fit seamlessly, but imagine a high detail node next to a low detail node. The result will be artifacts known as gaps.
This can be fixed by ‘stitching’, where you either add or remove edge vertices from the nodes so that they match each other. In this case, I removed vertices from the higher detail node, creating a triangle pattern which connects to the lower detail node. In this screenshot, you can see it applied to every edge of every quadnode. In this one, it is applied correctly.
Of course, this isn’t as simple as it sounds: each quadnode is now accompanied by 9 meshes, 1 for no stitching, 4 for a single edge stitched, and another 4 for two stitched edges. Since these are generated at runtime, however, they aren’t a problem. [Present me says: well, they’re less of a problem than they would be if I tried to save them in memory. This is a case of putting up with longer loading times to reduce the memory footprint of the game]
It isn’t only distant objects need to be LODd. Nearby nodes which are not in view because you are looking in the opposite direction need not be rendered at all.
By doing a collision test between the camera’s “Bounding Frustrum” (a 3d shape representing the cameras viewport) and a quadnodes bounding box, we can determine which nodes are outside the view and quickly cut a whole heap of geometry from the render. By combining this with the distance testing, we can stop that from sinking further into off-screen nodes as well!
Multiple Vertex Buffers
One of the things I spent a lot of time on was converting the terrain to run with more than one vertex buffer.
The vertex buffer contains all the vertices read off of the heightmap. In general, it’s best to use a single buffer, because each buffer must be rendered with a separate Draw call. Unfortunately, there is a maximum limit to the number of vertices that can be rendered at once on the graphics card, and going above this limit will force the call to be rendered on the CPU (resulting in death by frame rate).
This limit varies, but on my home computer over a million vertices (a 1024×1024 terrain) is just a bit too much.
So, I implemented a nasty and complex algorithm to separate the terrain into a number of vertex buffers, based on the idea that each quadnode could be entirely contained within a parent vertex buffer if we split it correctly. This (eventually) came out well: it is now possible to split your terrain based on a size value. A 1024×1024 terrain split by 512 vertex buffers will come out as 4, 512×512 vertex buffers, with a fifth for all quadnodes which cover an area greater than 512 (only the root node, in this case).
It’s worth noting that I may have managed this whilst drunk, tired or otherwise incapacitated, as I cannot remember actually coding it, and have no really idea how or why it works. But it does, and seems to be bug free. [Present me says: Yes, that’s right. Apparently, the Ballmer Peak actually exists]
Global Normal Map
This shader was my first attempt at HLSL (High Level Shader Language), and came out brilliantly in my opinion. [Present me says: Hah! Oh you poor naive fool]
Dynamically changing the geometry can have a nasty effect: although the shape of a low detail node may be very similar to that of a high detail node, the shading can result in a fairly large difference in appearance if done per vertex.
To solve this, I took advantage of the power of HLSL, and told the terrain shader to use a global normal texture rather than per vertex normals. The advantage was twofold: My vertex buffers halved in size, and more importantly shading no longer changes between high and low detail quadnodes.
Detail Normal Map
Initially, I didn’t understand the mathematics behind normal mapping, and thought that by simply adding a detail normal map value to my global value in the shader I could create detail normal mapping. The result looked OK, but it wasn’t accurate: the detail normal ‘pulled’ the global normal upwards. When I tested with a high strength value for the detail normal, this was instantly apparent: the entire terrain was shaded as if it was a lot flatter than it truly was.
It wasn’t until after I’d finished the multitexturing that I worked out how to fix this, but when I did the difference was apparent (see below).
Using a single texture for the entire terrain has two problems: the terrain is too big to be covered with a single texture with a high enough resolution, and tiling the same ground texture over the entire terrain is very bland. Therefore, I went for a multitexturing approach that made use of a blend texture.
In short, mutitexturing is using more than one texture on the same object, and a blend texture uses it’s red, green and blue channels to define the amount of each texture to show. In the sample, Blue represents sand, Green – grass, Red – Foliage and Black displays as Rock.
Quickly back on the subject of Detail Normal Mapping: you can see the difference between my original method, and my final method.
“Overdraw” is a term used when the pixel shader renders a distant object, then renders a closer object which completely obscures the distant object. Obviously, rendering the distant object was unnecessary: it didn’t end up being drawn on the screen, and rendering it simply hurt the fill-rate.
The GPU can overcome this if the near object is drawn first: when rendering the distant object, the fact that an object with a smaller Z-Buffer value has already been rendered will be detected, and the rendering process will be skipped.
Therefore, it is to our advantage to render the near quadnode before the far ones. Since a list of Quadnodes to be drawn is built seconds before the quadnodes indices are compiled into the index buffer, it’s a relatively easy fix to add a sort function in between.
Isn’t snarking in italics supposed to be my job?”