Friday, August 3, 2007

Introducing PopTrees (Detecting Message Chains)

I introduce a PopTree as a simple abstraction over CIL and demonstrate how they can be used to find Message Chains.

Detecting Message Chains.

Message chains violate the Law of Dementor[1] by accessing a long series of other object's Methods and Properties. Detecting Message Chains violations in CIL involves understanding the local structure of statements. A large subset of instructions can occur in the context of a passing values to a method call. This means understanding the structure of message chains entails understanding much of the CIL instructions used in passing arguments. In addition, the instructions have to be understood to tell the difference between a sequence of method calls and a chain of method calls.

The problem is that there isn't good support for representing the instructions as more complex statements. Usually this is done ad hoc for the purpose of a person's analysis. Some code in Cecil.FlowAnalysis.ActionFlow has attempted to provide some abstractions over CIL; however, it only implemented a small subset of the instructions. In addition, the design tried prematurely canonicalizing the instruction tree making it difficult to extend to the full instruction set.

Instead, I introduce the concept of a PopTree. A PopTree roughly corresponds to a statement in the original source code. A PopTree is constructed by simulating the instruction stream to push and pop various values on a state machine. An instruction 'owns' another instruction if it pops it's value off the stack. A post-order transversal should closely correspond to the original AST. The hardest part was examining the spec to see how many pops and pushes an given instruction causes.

Algorithm:

public PopTreeList Build( ICollection instructions )
{
foreach (Instruction i in instructions)
{
PopTree p = new PopTree(i);
int pops = InstructionInfo.Pops(i,_machineStack.Count);
if (pops > _machineStack.Count)
{
// can happen with exception blocks whose support is flaky.
//Console.WriteLine("Attempting to pop more than machine has: " + i.OpCode.ToString() + " " + i.Operand);
return null;
}
for (int x = 0; x <>
{
p.Consume(Pop());
}
if (InstructionInfo.Pushes(i))
{
Push(p);
}
else
{
_popTreeList.Add(p);
}
}
return _popTreeList;
}

CODE:
m1( a1() ).m2( a2() ).m3( a3() )

CIL:
ld this ; push [this]
dup ; push [this]
call a1 ; pop [this] -> push [value]
call m1 ; pop [this] pop [value] -> push [obj]
ld this ; push [this]
call a2 ; pop [this] -> push [value2]
call m2 ; pop [obj] pop[value2] -> push [obj2]
ld this ; push [this]
call a3 ; pop [this] -> push [value3]
call m3 ; pop [obj2] pop [value3] -> push [obj3]

POP TREE:

[m3]
[m2] [a3]
[m1] [a2] [this]
[this] [a1] [this]
[this]

As an example, I have written an scanner that can find message chains in source code. This works decently well. However, I have to fix the Cecil.FlowGraph.ControlFlow model so that it can properly handle more advanced control flow -- that way things will work on real-world assemblies.

Another thing that should be easy is to determine all the variables used in the context of an if statement. This should make doing branch coverage and other analysis needing to get predicates easier to perform.

[1] http://en.wikipedia.org/wiki/Law_of_Demeter

No comments: