Category Archives: Idiosyncratic

Structure Aversion

fractal-69181_640

A well-designed system is like a fractal image – it’s highly compositional. One can see finer degrees of structure at any level of detail.

My Idiosyncratic C++ series would not be complete without a discussion of structure. 

Good object-oriented (OO) programmers understand the basic principles of OO analysis.  And many make a respectable effort in decomposing a system into independent, reusable, composable building blocks.  Unfortunately, many such efforts end there, resulting in course-grained, poorly-structured designs.  In this post, I’ll explore what a well-structured design does, and does not, look like.

C++ has been described as two languages in one: a low-level language of constructors, destructors, conversion and assignment operators, and the like; and a high-level domain-specific language undergirded by the first. But highly structured code contains N “languages”, and is like a fractal image.  It could also be compared to an objective hierarchy, wherein the levels below answer the question “how?”, and the levels above, the question “what?” or “why?”  A good programmer is adept at shifting levels as necessary, and works hard not to conflate them.

Information Content

Well-factored code is synonymous with highly structured code.  You cannot have one without the other.  An information theorist might say that highly structured code contains more information.  This is true in two senses.  

First, structured code has greater “information density”.  Here, we’re not creating information so much as we are distilling and concentrating it.  The same signal remains, but there is far less noise.  And since every line of code is a liability, the fewer it takes to express an idea, the better.  (Fewer to write, compile, debug, review, test, document, ship, maintain, …)

Second, creating more structure implies defining more abstractions, which is a powerful programming exercise.  Articulating the utility and purpose of every artifact, and formalizing the relationships between them, makes the programmer’s intent as explicit as possible.  Taking the knowledge latent inside the gray matter, and “injecting” it into the system creates new information.  And as an information theorist might also say, there is “no information without physical representation.”  (OK, it still exists somewhere in the programmer’s head, but it sure isn’t very accessible.)

What’s In a Name?

As new degrees of structure are created, it’s probably fairly apparent that new names are needed (lambdas notwithstanding).  A good test for one’s own understanding of a system is the ability to name each part of it.  I have found that when I labor over the right name for an artifact – be it a namespace, class, method or whatever – it may be a symptom that my design is a bit muddled.  Most likely, it needs more fine-grained structure, permitting each piece to abide by the single responsibility principle.  Taming the naming beast is a good thing.  If I can create the best name for an artifact right now, I’m doing my peers and my future self a favor.  Conversely, if I can’t understand the system today (well enough to name it), how can I hope to do so when I have to maintain or extend it a year from now?

Structure Aversion

That brings us to our idiosyncratic practice – what I’d simply call “structure aversion.”  It’s manifested in many ways, which I’ll describe, but first I want to get down to the (de)motivation for it.  I think its most common among beginners, perhaps driven by a distaste for the “overhead” needed to define a new structure.  Some developers seem to think that every class warrants dedicated header and source files, complete field encapsulation, serializability, a host of virtual methods implementing various interfaces, etc, etc.  (The corollary of this is that having many local structures should be prohibited.)  This may be applicable for some singly-rooted hierarchy languages, especially managed ones.  But C++ inherits C, which embraces lightweight, native structs.  And these are ideal for POD types, tuples, parameter objects, compound returns, local aggregate arrays, and so on.

Let’s look at some examples of structure aversion, similar to what I recently encountered during a refactoring exercise.  The first is a simple case of structuring a function’s parameters and returns, which can lead to a more compact, fluent, functional style.

// Factory: the return value is assigned to an arbitrary bit of
// person data (SSN) - the "lucky winner!"  Also, it's difficult 
// to tell at a glance which params are "in" and which are "out".
int GetPerson(const string & first, const string & last, int & dob);

// Note the asymmetry of GetPerson and OutputPerson, due to the SSN
void OutputPerson(const string & first, const string & last, int dob, int ssn);

void ProcessPerson(const string & first, const string & last)
{
	// Lots of temporaries are needed with structure aversion.
	// Note how they're "viral" - the dob out param temporary
	// necessitates the ssn temporary, which otherwise could 
	// have been passed along inline to OutputPerson.
	int dob;
	int ssn = GetPerson(first, last, dob);
	OutputPerson(first, last, dob, ssn);
}

// ... after refactoring ...

// A few POD structs, local or private to a class or module.
// Note the separation of key fields into a base struct, which
// is subclassed for the non-key fields, providing a tidy way
// to partition a struct based on "in" versus "in/out" usage.
struct PersonName
{
	string First;
	string Last;
};
struct Person : PersonName
{
	int SSN;
	int DOB;
};

Person GetPerson(const PersonName & name);

void OutputPerson(const Person & person);

void ProcessPerson(const string & first, const string & last)
{
	// The functional style may not be to everyone's taste,
	// but it shows the potential for compactness that 
	// is otherwise impossible with structure aversion.
	OutputPerson(GetPerson(PersonName(first, last)));
}

Structure aversion is “compositional” in a way.  If the pre-factored code above looks bad, it looks worse when we have to deal with collections.  In this case, structure aversion leads to the use of parallel collections of primitives, versus a single collection of structures. This follows naturally from the example above. After all, without a Person abstraction, how would one transmit the details of several people? One might call this “column-centric” versus “row-centric” data access.  And while it may make sense for big data platforms like Google Big Query, you have to learn to walk before you can fly (or, learn the rules before you break them).  Here’s what I found in my refactoring exercise:

// Without a Person abstraction, people must be collected partwise.
int GetPeople(vector<string> & firstNames, vector<string> & lastNames, vector<int> & dobs, vector<int> & ssns);

// Parallel collections are passed in as out params and 
// processed in parallel.  Lots of opportunity for bugs.
void ProcessPeople()
{
	vector<string> firstNames;
	vector<string> lastNames;
	vector<int> dobs;
	vector<int> ssns;
	auto numPeople = GetPeople(firstNames, lastNames, dobs, ssns);
	for( int i = 0; i < numPeople; i++ )
	{
		OutputPerson(firstNames[i], lastNames[i], dobs[i], ssns[i]);
	}
}

// ... after refactoring ...

// A much simpler factory
vector<Person> GetPeople();

// And its use
void ProcessPeople()
{
	auto people = GetPeople();
	for_each(begin(people), end(people), [&](Person & person){OutputPerson(person);});
}

In the examples above, I’ve simplified things greatly.  The original code had many more fields (and consequently, collections), and was using MFC collections, which made the parallel iteration even messier.

Factoring Data

In addition to using structures to aggregate runtime data, as above, it can also be used for compile-time constant data.  With this technique, aggregate initialization is used to create a static array of POD structs.  This approach can be used in a number of common scenarios, such as:

  • Render passes
  • Token parsers
  • UI control layout

Let’s take the last one as an example – UI control layout. This is a bit contrived, as these days I’d use a layout engine for new UIs. On the other hand, I find myself refactoring legacy code like this quite often, so I mention it.

enum Color
{
	Red, Green, Blue, Gray
};

struct Button
{
	Button(const string & caption, int color, bool isDefault = false);
};

struct Dialog
{
	void AddControl( Button * button, int left, int top, int width, int height );

	// This kind of structureless layout code is common, 
	// especially when machine-generated.  
	void Layout()
	{
		auto ok = new Button("OK", Red, true);
		AddControl( ok, 10, 400, 80, 20 );
		auto cancel = new Button("Cancel", Green);
		AddControl( cancel, 100, 400, 80, 20 );
		auto apply = new Button("Apply", Blue);
		AddControl( apply, 190, 400, 80, 20 );
		auto help = new Button("Help", Gray);
		AddControl( help, 280, 400, 80, 20 );
	}
};

// ... after refactoring ...

void Dialog::Layout()
{
	// A local POD struct to collect all constant button data
	struct ButtonData
	{
		const char * name;
		Color color;
		int left;
		// we can place "optional" (zero-defaulted) fields at 
		// the end of the struct, relying on the behavior of 
		// aggregate initialization for typical cases.
		bool isDefault;	
	} 
	buttons[] = 
	{
		{"OK", Red, 10, true},
		{"Cancel", Green, 100},
		{"Apply", Blue, 190},
		{"Help", Gray, 280},
	};
	// Now, we loop over the data defining each button
	// The code is slowly morphing from imperative to declarative.
	// Perhaps more importantly, we've "created new information",
	// adding structure to establish the cohesiveness of each
	// pair of statements in the original version.
	for each (auto& buttonData in buttons)
	{
		auto b = new Button(buttonData.name, buttonData.color, buttonData.isDefault);
		AddControl( b, buttonData.left, 400, 80, 20 );
	}
}

// The refactoring above is an improvement, but all the references to
// buttonData are a bit of a code smell (Inappropriate Intimacy).
// ... so, after refactoring again ...

void Dialog::Layout()
{
	struct ButtonData
	{
		const char * name;
		Color color;
		int left;
		bool isDefault;
		// The appeal to this technique - we can drop little
		// bits of logic into the POD to encapsulate and clean
		// up call sites even more.
		void Create(Dialog & dialog)
		{
			auto b = new Button(name, color, isDefault);
			dialog.AddControl(left, 400, 80, 20);
		}
	} 
	buttons[] = 
	{
		{"OK", Red, 10, true},
		{"Cancel", Green, 100},
		{"Apply", Blue, 190},
		{"Help", Gray, 280},
	};
	// Now, we've got a much more declarative approach
	for each (auto& buttonData in buttons)
	{
		buttonData.Create(*this);
	}
}

Chunky Functions

It’s important to point out that fine-grained structure should exist not just with types, but with functions too.  Simple one-line inline methods may seem gratuitous at first, but they can enhance readability and maintainability of code tremendously.  I recently refactored some XML-generation code that had a very low signal to noise ratio (probably less than 1), due to all the Xerces clutter.  I spent an hour writing a handful of methods for creating elements and attributes, in a literate style (permitting method-chaining).  The number of lines of client code was cut in half and the readability doubled, at least.  This kind of structuring is also helpful for conversions, casts, factories, and many other small sequences of oft-repeated statements and expressions.  It’s also helpful for “object-oriented currying” (as I call methods implemented in terms of overloads).

References

In closing, I’ll point out that this is all well-trod ground, but it bears repeating.  I’ve never regretted jumping into a ruthless refactoring leading to greater structure.  There may be moments when I wonder how long it will take to get the engine block back into the car, but the end always justifies it. Adding new information makes the code more self-describing and operating at a higher level of abstraction makes it more declarative.  Both of these contribute to making the code clearer.

For further research, I recommend Martin Fowler’s Refactoring: Improving the Design of Existing Code (a bit dated, but still relevant), as well as Andrew Binstock’s Dr. Dobbs article on fine-grained structures.  The challenge of constraining oneself to 50-line classes is a great way to enforce a discipline that is sure to grow on you. In the meantime, I encourage you to create structure at every opportunity.

Naked Aggression with the Heap

MP900321198I’ll begin my Idiosyncratic C++ series with a discussion of the heap, its abuses, and its alternatives.

The Heap

The heap does not seem to be well understood among many C++ programmers.   I’m referring to the trade-offs involved with using it, not its underlying implementation.  Heap allocation should be a means of last resort, a necessary evil for getting the job done.   As a general pool of memory available for program use, it is slower and riskier to use than the stack.  And yet, many beginning C++ programmers are enamored with it.  Perhaps it’s a misappropriation of the ‘new’ keyword from managed languages like Java and C# – a conflation of the concepts of object creation and heap allocation.  In any case, it’s often the first tool the programmer reaches for in allocating memory.  And when it is deemed necessary, it is often used both “nakedly” and “aggressively”, leading to unnecessary risks for program safety.

The Stack

By contrast, the stack is the original, automatic, hardware-accelerated memory manager.  It’s impossible to leak stack memory, and difficult to double-delete it (or single-delete it, for that matter).  [The original purpose of the “auto” keyword in C/C++ was to designate an “automatic storage” variable with local block scope, in contrast to the “static” keyword.  But since the language provided “auto” as a default, the keyword found little use and has been re-purposed for type deduction.]  At its foundation, the computer is a stack-based machine.  Functional programmers know this and design their algorithms around it.  But it’s just as true for the object-oriented programmer.  Cultivating a “stack-based discipline” can vastly simplify designs, eliminate whole classes of bugs (heap-based), and improve efficiency and readability.

Aggressive

Many heap allocations are simply not necessary.  Examples include:

  • Locals
    Local variables newed and deleted entirely within a given block scope.   This must certainly be a holdover from managed languages.
  • Members
    Aggregate classes newing members in a constructor and delete-ing them in a destructor.  Sometimes this is necessary if a member does not have a public constructor (s obtained from a factory), is implementing the pimpl pattern, and so on.  However, I have seen many cases where member data is created on the heap with evidently no conscious thought.
  • Collections
    The (mistaken) belief that every collection must be implemented with a heap-based library template.  The corollary to this is that aggregate arrays are evil.  This is probably a misapplication of the good advice not to allocate raw arrays from the heap, preferring standard collections.  But this assumes that dynamic allocation is even needed in the first place.

Naked

Many heap allocations are made directly from client code.  Calls to new/malloc and delete/free are manually paired, and may be arbitrarily distant in code.  The risk of mismanaging these allocations is high.  Leaks can result from exceptions, or multiple function returns, or simply overlooking a terminal branch.  Double-deletes may result from pointer aliasing, or not having clear responsibility for ownership.

Avoiding the Heap

Now that we’ve identified a few risks of heap use, let’s discuss ways to avoid it.

  • Locals
    For local variables, the cure is simple – allocate the object directly on the stack, rather than through a pointer.
  • Members
    The same approach should be used for member data.  This does not guarantee the member will live on the stack.  Instead, it ties the allocation policy of the member to its containing class, so that it may live on the stack or on the heap.
  • Collections
    When dealing with simple, static collections, use an aggregate-initialized array.  There are some restrictions to this technique, when dealing with user-defined types.  However, these can usually be worked around, and should improve as the rules for POD types are revised and loosened.  There is virtually no advantage to dynamically allocating a collection of static (compile-time constant) objects, when a fixed array will do.
  • Algorithms
    Some heap allocations may indicate a flawed algorithm.  An example would be the creation or collection of data at a deeper level in the stack than where it is consumed.  In this case, the data must clearly be placed on the heap so it can “swim upstream”, against the call stack current.  The solution is to invert the pattern and use a more “functional” approach, passing  data down the call stack to where it is needed.  This can be generalized to multiple data flows.
  • Recursion
    Similar to the preceding is to use a recursive approach, over a loop-based approach.  This can be especially helpful in avoiding the creation of short-lived heap-based collections, where an element may depend on the computation of prior elements (e.g., Fibonacci).  A map-reduce pattern can then move the reducing logic to the terminal condition of the recursion.
  • alloca
    Finally, consider that “dynamic allocation” != “heap allocation”.  The CRT function alloca enables fast, safe, dynamic allocation on the stack.  This can be very effective in performance critical situations where many small objects are frequently allocated and freed, such as with 3D mesh processing algorithms.  [Note: there are some challenges to the effective use of alloca and its ilk, malloca and freea.  The scoping requirements are tricky, foiling attempts to wrap calls in an RAII pattern.  But the utility of the function outweighs these concerns.]

Holding the Heap at Arm’s Length

In the event the heap cannot be avoided, it should be “treated with modesty”.  Direct calls to new/delete should be avoided if possible, or their exposure minimized.  Smart pointers should be used systematically and universally, to ensure exception-safe unwind behavior.  Locals and members which require heap allocation (factory-based, pimpl) should be wrapped in smart pointers.

  • For shared (aliased) reference-counted smart pointers, std::shared_ptr  should be used.  The std::make_shared template function should be used to allocate std::shared_ptr objects safely and efficiently.
  • For simple, unshared “singleton” smart pointers, either as class members or local variables, std::unique_ptr should be used to manage heap calls in an exception-safe manner.  As with std::make_shared, std::make_unique may be offered in the future for creating std::unique_ptr objects safely.
  • [Note that with C++11, legacy smart pointer templates, such as auto_ptr, are now considered obsolete.]

In summary, my advice is to use the heap as sparingly and indirectly as possible.  Your peers, testers, and customers will appreciate it.

 

Idiosyncratic C++

Happy St. Patrick’s Day!

Years ago I was in Florida courting a prospective customer.  In one of our initial biz-dev meetings, we were trying to get our arms around the opportunity, and were a bit paralyzed.  Finally, Jack, the customer’s colorful VP of Sales, said “what we need is a dead cat!”.  My partner and I looked at each other blankly.  Jack explained, “A dead cat is a brainstorming technique we use.  Someone puts a lousy idea on the table to get started.  Everyone takes a look at it and says, ‘We don’t know what we want, but we know it’s NOT THAT!’ ”

Sometimes, a good way to understand something is to identify what it is not.  With that technique in mind, I’m going to start a series of posts on idiomatic C++.  In each post, I’ll start with an anti-pattern or two of what Andrei Alexandrescu calls “C+”.  Rather than being idiomatic C++, we might call this style “idiosyncratic C++” (“idiotic C++” is probably too strong).  Many arrive at idiomatic C++ by way of idiosyncratic C++, particularly if they’ve come from other languages like C or Java.  They’ve learned the hard way how to write safe, efficient, maintainable, platform-independent C++ in a conventional style – in short, idiomatically.  Those trench experiences are gold <insert tenuous tie-in with St. Patrick’s Day pot-of-gold image>, and worth exploring and sharing.

My working definition of idiosyncratic C++ is:

Idiosyncratic C++:  Code that a compiler will accept, but a reviewer should not.

In this series, I’ll cover topics such as:

  • Using The Heap
  • Structured Code
  • Eager Evaluation
  • Naming Conventions

I’m open to requests and feedback.  Send them in and stay tuned…