Category Archives: Uncategorized

What’s Your Tweet?

twitterIf you had to give a single, pithy piece of advice to a junior programmer today, what would it be?

 

 

Some possibilities…

  • Test First – TDD, SDD, …
  • Structure Everything
  • Avoid the Heap – use a stack-based discipline
  • Hop Around – gain breadth
  • Stay Put – gain depth
  • Read Voraciously
  • Specialize – Mobile, SOA, Big Data, Concurrency, …
  • Generalize – All of the above
  • Embrace and Extend
  • Don’t Be Evil
  • Learn Idiomatic Style
  • Lawyer Up – C++, C#, Java, Ruby, Haskell…
  • Be Polyglot – A Language Per Year
  • Be Multi-Paradigm – OO, FP, Prototype…
  • Grok TMP, Monads, …
  • Embrace Scrum, KanBan, XP, …
  • Eschew Scrum, KanBan, XP, …
  • Contribute to the Community
  • Share – Teach, Mentor, Speak, Publish

Since good programmers appear to be born and not made, maybe the best advice is to “Know Thyself”. Unless you have, or have a way to develop, an intuition for good programming practice, you may be engaged in a Sisyphean task.

Structure Aversion

fractal-69181_640

A well-designed system is like a fractal image – it’s highly compositional. One can see finer degrees of structure at any level of detail.

My Idiosyncratic C++ series would not be complete without a discussion of structure. 

Good object-oriented (OO) programmers understand the basic principles of OO analysis.  And many make a respectable effort in decomposing a system into independent, reusable, composable building blocks.  Unfortunately, many such efforts end there, resulting in course-grained, poorly-structured designs.  In this post, I’ll explore what a well-structured design does, and does not, look like.

C++ has been described as two languages in one: a low-level language of constructors, destructors, conversion and assignment operators, and the like; and a high-level domain-specific language undergirded by the first. But highly structured code contains N “languages”, and is like a fractal image.  It could also be compared to an objective hierarchy, wherein the levels below answer the question “how?”, and the levels above, the question “what?” or “why?”  A good programmer is adept at shifting levels as necessary, and works hard not to conflate them.

Information Content

Well-factored code is synonymous with highly structured code.  You cannot have one without the other.  An information theorist might say that highly structured code contains more information.  This is true in two senses.  

First, structured code has greater “information density”.  Here, we’re not creating information so much as we are distilling and concentrating it.  The same signal remains, but there is far less noise.  And since every line of code is a liability, the fewer it takes to express an idea, the better.  (Fewer to write, compile, debug, review, test, document, ship, maintain, …)

Second, creating more structure implies defining more abstractions, which is a powerful programming exercise.  Articulating the utility and purpose of every artifact, and formalizing the relationships between them, makes the programmer’s intent as explicit as possible.  Taking the knowledge latent inside the gray matter, and “injecting” it into the system creates new information.  And as an information theorist might also say, there is “no information without physical representation.”  (OK, it still exists somewhere in the programmer’s head, but it sure isn’t very accessible.)

What’s In a Name?

As new degrees of structure are created, it’s probably fairly apparent that new names are needed (lambdas notwithstanding).  A good test for one’s own understanding of a system is the ability to name each part of it.  I have found that when I labor over the right name for an artifact – be it a namespace, class, method or whatever – it may be a symptom that my design is a bit muddled.  Most likely, it needs more fine-grained structure, permitting each piece to abide by the single responsibility principle.  Taming the naming beast is a good thing.  If I can create the best name for an artifact right now, I’m doing my peers and my future self a favor.  Conversely, if I can’t understand the system today (well enough to name it), how can I hope to do so when I have to maintain or extend it a year from now?

Structure Aversion

That brings us to our idiosyncratic practice – what I’d simply call “structure aversion.”  It’s manifested in many ways, which I’ll describe, but first I want to get down to the (de)motivation for it.  I think its most common among beginners, perhaps driven by a distaste for the “overhead” needed to define a new structure.  Some developers seem to think that every class warrants dedicated header and source files, complete field encapsulation, serializability, a host of virtual methods implementing various interfaces, etc, etc.  (The corollary of this is that having many local structures should be prohibited.)  This may be applicable for some singly-rooted hierarchy languages, especially managed ones.  But C++ inherits C, which embraces lightweight, native structs.  And these are ideal for POD types, tuples, parameter objects, compound returns, local aggregate arrays, and so on.

Let’s look at some examples of structure aversion, similar to what I recently encountered during a refactoring exercise.  The first is a simple case of structuring a function’s parameters and returns, which can lead to a more compact, fluent, functional style.

// Factory: the return value is assigned to an arbitrary bit of
// person data (SSN) - the "lucky winner!"  Also, it's difficult 
// to tell at a glance which params are "in" and which are "out".
int GetPerson(const string & first, const string & last, int & dob);

// Note the asymmetry of GetPerson and OutputPerson, due to the SSN
void OutputPerson(const string & first, const string & last, int dob, int ssn);

void ProcessPerson(const string & first, const string & last)
{
	// Lots of temporaries are needed with structure aversion.
	// Note how they're "viral" - the dob out param temporary
	// necessitates the ssn temporary, which otherwise could 
	// have been passed along inline to OutputPerson.
	int dob;
	int ssn = GetPerson(first, last, dob);
	OutputPerson(first, last, dob, ssn);
}

// ... after refactoring ...

// A few POD structs, local or private to a class or module.
// Note the separation of key fields into a base struct, which
// is subclassed for the non-key fields, providing a tidy way
// to partition a struct based on "in" versus "in/out" usage.
struct PersonName
{
	string First;
	string Last;
};
struct Person : PersonName
{
	int SSN;
	int DOB;
};

Person GetPerson(const PersonName & name);

void OutputPerson(const Person & person);

void ProcessPerson(const string & first, const string & last)
{
	// The functional style may not be to everyone's taste,
	// but it shows the potential for compactness that 
	// is otherwise impossible with structure aversion.
	OutputPerson(GetPerson(PersonName(first, last)));
}

Structure aversion is “compositional” in a way.  If the pre-factored code above looks bad, it looks worse when we have to deal with collections.  In this case, structure aversion leads to the use of parallel collections of primitives, versus a single collection of structures. This follows naturally from the example above. After all, without a Person abstraction, how would one transmit the details of several people? One might call this “column-centric” versus “row-centric” data access.  And while it may make sense for big data platforms like Google Big Query, you have to learn to walk before you can fly (or, learn the rules before you break them).  Here’s what I found in my refactoring exercise:

// Without a Person abstraction, people must be collected partwise.
int GetPeople(vector<string> & firstNames, vector<string> & lastNames, vector<int> & dobs, vector<int> & ssns);

// Parallel collections are passed in as out params and 
// processed in parallel.  Lots of opportunity for bugs.
void ProcessPeople()
{
	vector<string> firstNames;
	vector<string> lastNames;
	vector<int> dobs;
	vector<int> ssns;
	auto numPeople = GetPeople(firstNames, lastNames, dobs, ssns);
	for( int i = 0; i < numPeople; i++ )
	{
		OutputPerson(firstNames[i], lastNames[i], dobs[i], ssns[i]);
	}
}

// ... after refactoring ...

// A much simpler factory
vector<Person> GetPeople();

// And its use
void ProcessPeople()
{
	auto people = GetPeople();
	for_each(begin(people), end(people), [&](Person & person){OutputPerson(person);});
}

In the examples above, I’ve simplified things greatly.  The original code had many more fields (and consequently, collections), and was using MFC collections, which made the parallel iteration even messier.

Factoring Data

In addition to using structures to aggregate runtime data, as above, it can also be used for compile-time constant data.  With this technique, aggregate initialization is used to create a static array of POD structs.  This approach can be used in a number of common scenarios, such as:

  • Render passes
  • Token parsers
  • UI control layout

Let’s take the last one as an example – UI control layout. This is a bit contrived, as these days I’d use a layout engine for new UIs. On the other hand, I find myself refactoring legacy code like this quite often, so I mention it.

enum Color
{
	Red, Green, Blue, Gray
};

struct Button
{
	Button(const string & caption, int color, bool isDefault = false);
};

struct Dialog
{
	void AddControl( Button * button, int left, int top, int width, int height );

	// This kind of structureless layout code is common, 
	// especially when machine-generated.  
	void Layout()
	{
		auto ok = new Button("OK", Red, true);
		AddControl( ok, 10, 400, 80, 20 );
		auto cancel = new Button("Cancel", Green);
		AddControl( cancel, 100, 400, 80, 20 );
		auto apply = new Button("Apply", Blue);
		AddControl( apply, 190, 400, 80, 20 );
		auto help = new Button("Help", Gray);
		AddControl( help, 280, 400, 80, 20 );
	}
};

// ... after refactoring ...

void Dialog::Layout()
{
	// A local POD struct to collect all constant button data
	struct ButtonData
	{
		const char * name;
		Color color;
		int left;
		// we can place "optional" (zero-defaulted) fields at 
		// the end of the struct, relying on the behavior of 
		// aggregate initialization for typical cases.
		bool isDefault;	
	} 
	buttons[] = 
	{
		{"OK", Red, 10, true},
		{"Cancel", Green, 100},
		{"Apply", Blue, 190},
		{"Help", Gray, 280},
	};
	// Now, we loop over the data defining each button
	// The code is slowly morphing from imperative to declarative.
	// Perhaps more importantly, we've "created new information",
	// adding structure to establish the cohesiveness of each
	// pair of statements in the original version.
	for each (auto& buttonData in buttons)
	{
		auto b = new Button(buttonData.name, buttonData.color, buttonData.isDefault);
		AddControl( b, buttonData.left, 400, 80, 20 );
	}
}

// The refactoring above is an improvement, but all the references to
// buttonData are a bit of a code smell (Inappropriate Intimacy).
// ... so, after refactoring again ...

void Dialog::Layout()
{
	struct ButtonData
	{
		const char * name;
		Color color;
		int left;
		bool isDefault;
		// The appeal to this technique - we can drop little
		// bits of logic into the POD to encapsulate and clean
		// up call sites even more.
		void Create(Dialog & dialog)
		{
			auto b = new Button(name, color, isDefault);
			dialog.AddControl(left, 400, 80, 20);
		}
	} 
	buttons[] = 
	{
		{"OK", Red, 10, true},
		{"Cancel", Green, 100},
		{"Apply", Blue, 190},
		{"Help", Gray, 280},
	};
	// Now, we've got a much more declarative approach
	for each (auto& buttonData in buttons)
	{
		buttonData.Create(*this);
	}
}

Chunky Functions

It’s important to point out that fine-grained structure should exist not just with types, but with functions too.  Simple one-line inline methods may seem gratuitous at first, but they can enhance readability and maintainability of code tremendously.  I recently refactored some XML-generation code that had a very low signal to noise ratio (probably less than 1), due to all the Xerces clutter.  I spent an hour writing a handful of methods for creating elements and attributes, in a literate style (permitting method-chaining).  The number of lines of client code was cut in half and the readability doubled, at least.  This kind of structuring is also helpful for conversions, casts, factories, and many other small sequences of oft-repeated statements and expressions.  It’s also helpful for “object-oriented currying” (as I call methods implemented in terms of overloads).

References

In closing, I’ll point out that this is all well-trod ground, but it bears repeating.  I’ve never regretted jumping into a ruthless refactoring leading to greater structure.  There may be moments when I wonder how long it will take to get the engine block back into the car, but the end always justifies it. Adding new information makes the code more self-describing and operating at a higher level of abstraction makes it more declarative.  Both of these contribute to making the code clearer.

For further research, I recommend Martin Fowler’s Refactoring: Improving the Design of Existing Code (a bit dated, but still relevant), as well as Andrew Binstock’s Dr. Dobbs article on fine-grained structures.  The challenge of constraining oneself to 50-line classes is a great way to enforce a discipline that is sure to grow on you. In the meantime, I encourage you to create structure at every opportunity.