User Tools

Site Tools


dromed:crashdump

Anatomy of a Crash Dump

Dromed crashes a lot. Even an experienced designer can’t avoid pushing the program too far. How well you can recover from a crash depends on the severity of the problem, and if you have prepared ahead of time. The first thing you do is determine what went wrong.

When a crash occurs, you can see one of three things: Dromed may report the error, Windows may report the error, or the game will simply quit with no dialog box shown. Dromed’s error messages have the title Assertion Failed and three buttons: Yes, No, Cancel. Those messages often have useful information. Some problems are not fatal, and pressing Cancel will allow you to continue. Pressing Yes or No will always quit Dromed. The Windows message box has the title Dromed and a Close button plus other buttons and generic messages. This information mostly useless and only confuses end-users. Just press Close, and don’t bother sending a report to Microsoft.

But the most useful information is written to the monolog. When a crash occurs, and the monolog has been activated, you will see the following lines and then a long list of critical information.

Fatal exception occured; dumping crash info to log...

------------------------------------------------------------------------------

A programmer can use this information to determine what caused the crash. This is an example of a crash dump and how to interpret it.

Reading a Crash Dump

DROMED caused an Access Violation in module DROMED.EXE at 0197:006acc63.
Exception handler called in _AppMain(): x:\prj\tech\libsrc\appcore\appcore.cpp.
Error occurred at 10/4/2002 20:49:04.
C:\GAMES\THIEF2\DROMED.EXE, run by etienne.
1 processor(s), type 586.
192 MBytes physical memory.
Read from location ffffffff caused an access violation.

Every crash dump will begin with something like this. The first and last lines are the ones to pay attention to. The exception handler will almost always be in _AppMain, because that’s what handles displaying dialog boxes. The last line tells you why the crash occurred. An access violation is the most common reason. The location will either be a very high number (all numbers are displayed in base-16), or a very low number.

The first line is where the crash occured. This is the most obviously useful. If the module name is the file name of a custom script you’re using, then you know right away that there is a problem with that script. When the module is DROMED.EXE, the number that follows it points to the part of the program that was active at the time.

The rest of the crash dump depends on what the error is.

Assertion Failed

When Dromed is able to detect a problem, it will display a dialog box describing what caused the error. The information it displays was intended for the Dromed programmers to use to diagnose the problem. This can sometimes be at least a little bit useful, but more often you are left guessing what to do about it.

To generate a crash dump from the error dialog, you press the Yes button. The Cancel button will sometimes cause a crash (and sometimes not), but even when it does, the crash is not guaranteed to occur in a meaningful way. Pressing Yes stops Dromed immediately, and this description is relevant when you do that. (Pressing No just closes Dromed without generating a crash dump.)

There are many things that can trigger the error dialog, and a proper diagnosis depends on what the particular error is. But in the process of creating the dialog box, a large amount of extra data is generated that will appear in the crash dump. The first thing you must do is determine what is actually important, and ignore the rest.

DROMED caused a Breakpoint in module KERNEL32.DLL at 016f:bff768a1.
Exception handler called in _AppMain(): x:\prj\tech\libsrc\appcore\appcore.cpp.
Error occurred at 10/3/2006 19:30:02.
F:\THIEF\THIEF.DROMED\DROMED.EXE, run by Free.
1 processor(s), type 586.
1022 MBytes physical memory.

The error will always be described as a “Breakpoint” in KERNEL32.DLL. It’s the same thing that happens when you use the Dromed command hello_debugger.

Ignore the registers; they all get overwritten by the dialog box. The first interesting value is the fifth entry in the stack:

Stack dump:
0119f3cc: 00610f20 01564f54 00000000 015d7730 000007c4 00724a8c 6e6b6e55 006e776f
                                              ^^^^^^^^

This is the top of the stack. Everything else on the line is unimportant. The underlined value was copied from the EBX register before the dialog box was created. This is sometimes useful, and in this case it is an object ID. (But it will probably mean something else depending on the error.)

0119f82c: 0051c33a 00724ac0 00724a8c 0000012d 01564ec0 03c10005 00524897 015d7730
0119f84c: 00000000 0119f89c 004c8585 01564f54 00000000 0119f89c 00048560 00000000

The next part of the stack with relevant data in it is much farther along. Count down 36 lines (or 1152 bytes) from the start of the stack dump. The first four values at this part are what Dromed used to create the dialog box. They refer, in order, to the instruction pointer when the error was triggered, the message that is displayed, the name of the source file, and the line number in the source. Other than the first value, this is what you already saw displayed in the dialog. It’s what comes after it that is important. At this point, you need to know more about the particular error, and how Dromed works internally, in order to make an accurate diagnosis.

Script Causes a Crash

I needed to diagnose a fan-mission that would crash when played in Windows 98. The author of the mission suspected that a brand-new script module I had written was the cause. As we eventually discovered, it was the older scripts that were at fault.

DROMED caused an Access Violation in module MSVCRT.DLL at 0167:7801042a.
Exception handler called in _AppMain(): x:\prj\tech\libsrc\appcore\appcore.cpp.
Error occurred at 9/10/2006 20:52:32.
C:\GAMES\THIEF2\DROMED.EXE, run by Unknown.
1 processor(s), type 586.
64 MBytes physical memory.
Read from location ffffffff caused an access violation.

This crash occured outside of Dromed, so I needed to do some hunting to pinpoint the problem. In this case, the error is reported as coming from the system library MSVCRT.DLL. To get back to Dromed, I want to trace back to the point where the program transitioned into this module. I first scroll down to the list of modules.

C:\GAMES\THIEF2\DROMED.EXE, loaded at 0x00400000 - 8265876 bytes - 38dfac00
C:\GAMES\THIEF2\GEN.OSM, loaded at 0x07250000 - 429568 bytes - 38dfa231
C:\GAMES\THIEF2\CONVICT.OSM, loaded at 0x072c0000 - 125952 bytes - 38dfa215
C:\GAMES\THIEF2\DARKDLGS.DLL, loaded at 0x10000000 - 1077760 bytes - 3810bca2
C:\WINDOWS\SYSTEM\D3DIM.DLL, loaded at 0x56660000 - 625936 bytes - 37d6f529
C:\GAMES\THIEF2\TNHSCRIPT.OSM, loaded at 0x68780000 - 413184 bytes - 428262d5
C:\GAMES\THIEF2\SCRIPT-T2.OSM, loaded at 0x6b500000 - 558592 bytes - 43f2db0d
C:\WINDOWS\SYSTEM\DINPUT.DLL, loaded at 0x70000000 - 633104 bytes - 37d6f56a
C:\WINDOWS\SYSTEM\SHLWAPI.DLL, loaded at 0x70bd0000 - 282896 bytes - 3717633e
C:\WINDOWS\SYSTEM\SETUPAPI.DLL, loaded at 0x77ea0000 - 409600 bytes - 3720a219
C:\WINDOWS\SYSTEM\MSVCRT.DLL, loaded at 0x78000000 - 266293 bytes - 36b69d5d

The first three numbers of an address are usually sufficient to identify the module. Addresses that begin with 780 are in MSVCRT.DLL. I want to get back to DROMED.EXE, DARKDLGS.DLL, or one of the OSM modules. So I make a note of those numbers, then go back to the top of the dump.

Registers:
EAX=00000070 CS=0167 EIP=7801042a EFLGS=00010202
EBX=00000003 SS=016f ESP=0119f5cc EBP=0119f5cc
ECX=0119f86c DS=016f ESI=687d56c9 FS=0f3f
EDX=00000000 ES=016f EDI=0119f840 GS=0000
Bytes at CS:EIP:
88 02 ff 01 0f b6 c0 83 f8 ff 8b 45 10 75 12 83 

The register EIP is the current address; beginning with 780. The register ESP points to the top of the stack, which is vital to tracing a program. Very often, the register EBP will contain the previous position of the stack before the current function was entered. But, typical of MSVCRT, this is not the case here. So I have to make educated guesses.

I first notice that ESI has a value that begins 687. This address is in the module TNHSCRIPT.OSM. That’s my first clue to the culprit, but doesn’t tell me much because that register doesn’t have any pre-assigned purpose.

Stack dump:
0119f5cc: 0119f854 780104a5 00000070 0119f86c 0119f840 687d61aa fffffffc 00000000
0119f5ec: 780103f2 687d56c8 00000004 0119f86c 0119f840 00000000 00000000 687d56c8

A stack is a block of memory that can be partially filled with data. The top of the stack points to the most recent data written, and all previous data appears afterwards. I can see a few 780 addresses, but also three 687 addresses, further incriminating tnhScript. This is enough for a non-programmer. But as a programmer, I want to know more.

Modules can contain code and data. When I checked these addresses against the module file, I find they are in the data segment. I’d like to have a code address, eventually, but I’ll work with these for now.

687d61aa = "=%d"
687d56c8 = "page"

These strings are at the given addresses. Actually, the first one is %s=%d, but the pointer has skipped over the first part. Now I don’t really need the code address because I know where these strings are used in the script module. (The script OnScreenText is trying to save the current page number.) For further confirmation, I scan the stack for the exact pattern that matches where I suspect the problem is.

0119f84c: 687d56c8 00000000 0119f88c 78022bc8 0119f86c 687d61aa 0119f8a0 00000000
0119f86c: 00000000 7ffffffe 00000000 00000042 006cc4b8 687d56c8 00000000 00000000
0119f88c: 0119f8bc 687b5363 00000000 687d61a8 687d56c8 00000000 00abe425 00abe424
0119f8ac: 0119fa2c 00000009 00000000 0119f960 0119f8dc 687a34b4 00000118 687d56c8

I know that the first string I’m looking for has the address 687d61a8. This only appears once in the stack, at address 0119f898, followed by the second string and the number 0. This must be the function call. What I really want to see is the two numbers that come before. Right in front of it is another 0, which is a NULL pointer in this context. In front of that is 687b5363, which I verify is the suspected code location. As I said, however, I have enough evidence to know the exact cause. I know that the problem is the NULL pointer, which shouldn’t be used with this version of MSVCRT.

dromed/crashdump.txt · Last modified: 2007/06/26 21:40 (external edit)