Chrysalide

There is a common point that all software products share in their development cycles: features are regularly added, code (mostly) works but at a given time some parts of the product need to be rewritten.

The root causes are multiple: refactoring, API update, aso. In the Chrysalide's case scaling was a huge problem that prevents the final user to load large binaries without many RAM available.

The last weeks have seen many improvements about this concern, so here is a few hints to make your own disassembler more memory-friendly!

The following news is based on commit 3d2576f, so you can give this version of Chrysalide a try by running:

git clone http://git.0xdeadc0de.fr/chrysalide.git
cd chrysalide
git checkout 3d2576f

Rely on static allocations

The readdex and readelf plugins are quite verbose and create many comments to attach to rendering lines.

Most of these comments are repeated: "Section file offset" or "Section size in bytes" for each found section for instance.

Or "Number of words of incoming arguments to the method that this code is for" which is allocated for each code item in a Dex file... Guess what happens if there are thousands of code items in an analyzed binary!

Moreover, as these strings are present in the disassembler binary itself, there is no need to copy them into the heap.

So static strings got handled as well as dynamic strings in comments with the commit c177597.

And then a generic parser has been created to centralize operations for creating instructions showing the content of a given format and the associated comments (commits ac75fdb for ELF and later 33880cf for Dex).

Respect the UNIX philosophy: do only one thing

Preloaded instructions and comments were inserted in the final disassembly view using the only available structure storing data to be inserted: symbols.

So there were plenty of symbols and most of them were not really symbols. Here were two problems.

A new structure has thus been created to handle the issue (commit a66f854). Instructions and comments get stacked and popped when needed.

As all data has to be moved from this new preload info manager to the final rendering view, this allows an extra optimization trick: items are never removed from the lists: only a reference is given and the global list is simply freed at the end (commit 13be5aa).

Regular strings from the ELF formats are now loaded in the same way (commit 5fce213).

Bonus time: this new way of parsing formats helped to spot silent bugs:

loading LEB128 raw instructions drove to register a wrong content coverage for this kind of instructions (fix in commit 11e76ce).
ELF magic allows segment and section overlapping and a (raw) instruction can not be cut into two parts (fix in commit 8ee7fc5).

Finally the third supported format Mobicore has been updated to match the new process (commit 198ba09).

Reduce the number of allocations

So here is the situation: symbols are now only used to represent labels and routines.

The symbol interface can now get cleaned as instructions and comments are not handled here any more (commit cb36603).

A few words about routines: there is no need to allocate one object for the routine and one object for the corresponding symbol.

So the commit 4691a43 made routines inherit from symbols.

This simple fact also allows to stop maintaining a routine list in the format handler (commit 1996274). Old accesses to the removed list had to be updated (commits 83a9ca9 and 25aaa3a).

Sharing is caring

The probability that the "r3" register is used more than once is high. And it is the same for the immediate value 0x0 for instance.

So a share system for operands has been built in order to avoid useless allocations.

At first for immediate operands (commit 8e5c841), then for target operands (commit 9c1367e).

And finally the system has been applied to the whole code (commit acd355c).

Here are some statistics of the saved memory for a Dex sample file (numbers are #instances/#size):

GImmOperand: current = 63544 / 3558464 - needed = 954939 / 53476584 (size=56, saved=49918120)
GTargetOperand: current = 58496 / 4679680 - needed = 246946 / 19755680 (size=80, saved=15076000)
GDalvikRegister: current = 66 / 2640 - needed = 497323 / 19892920 (size=40, saved=19890280)
GDalvikArgsOperand: current = 3600 / 201600 - needed = 54371 / 3044776 (size=56, saved=2843176)
GDalvikPoolOperand: current = 27123 / 1518888 - needed = 192330 / 10770480 (size=56, saved=9251592)
GDalvikRegisterOperand: current = 66 / 3696 - needed = 489140 / 27391840 (size=56, saved=27388144)

Store only relevant data

The size of the structure allocated for each GArchInstruction has been once again reduced!

The links between instructions are now using "flat arrays" which only consume memory when needed: items and counters are merged if there are less than two links (commit 6906aa1).

Moreover, only data which may be used at any time can qualify for persistent storing:

hooks only used in the disassembling process are now stored and deleted using the GObject facilities (commit 16f07d7).
reference to the global content which is only used for instruction printing has been replaced by an extra argument in the printing function (commit 90e0f7a).

Kill Memory leaks

Memory leaks are a classic way to lose memory:

the built list of disassembling areas is now properly freed (commit 1c96d0a).
the delayed works are destroyed after processing (commit 3d2576f).
the bitfield implementation do not allocate more memory than necessary anymore (commit 8e76324).

Misc

The test suite has been extended a little bit. All tests are important to avoid regressions, such as an out of bound string section length (commit 81aec19).

Tooltips have been improved when displaying information about immediate operands (commit b16071a) and some characters get properly escaped (commit 9b14115).

A new panel appeared! It shows all the parts of a binary format using a tree view (commit acc7b5f).

Raw instructions containing several characters merge them into strings when possible (commit b0c6ffa).

The commit 9a54829 ensures that a Dex routine is not abstract nor native before adding a symbol into the disassembled code.

And the best for the end: the manual page tells that the isprint() libc function takes an integer as argument. But the underlaying implementation uses this argument as an access index to a static bounded array. I guess "which must have the value of an unsigned char or EOF" is the hint to avoid segfaults for big values... (fixed in the commit aeffe7a).

Posted on May 27, 2017 at 15:42.