Monthly Archives: April 2013

Syncing Enums and Arrays

puzzleHere is a basic technique that may not be familiar to all…

Often, the need arises to define a symbolic enumeration, along with a “companion” array of complementary data such as string names.  With default compiler-assigned values, the enumeration serves as a convenient index into the companion array, as in the following example:

// Enumeration and companion array, often 
// accompanied by a comment like this:
// WARNING: Must keep ColorEnum and ColorNames in sync!
enum ColorEnum
{
	Red,
	Green,
	Blue
};

static const char * ColorNames[] = 
{ 
	"Red",
	"Green",
	"Blue"
};

void PrintColor(ColorEnum c)
{
	cout << ColorNames[c];
}

The risk of this technique is in maintaining coordination between symbols and slots - both the sequence and the count.  Beyond that, there is violation of the DRY principle ("don't repeat yourself"), creating more opportunities for things to get out of sync.

The solution is to use a pair of macros.  The first, which we'll call sequence, defines an abstract sequence of elements.  The second, which we'll call select, defines the attributes of each element.  Often, the function of the select macro is simply to select one or more entities from the abstract definition in sequence.  But the capability is general and can be used to define any sequence, including enums, static C-style arrays, static const members, parse tables, etc.

Following is a code sample demonstrating the basic idea.

// Define abstract sequence of elements, each of which 
// programmatically "selects" appropriate attributes
#define sequence \
	select( Red, 0xFF0000 ) \
	select( Green, 0x00FF00 ) \
	select( Blue, 0x0000FF )

// Generate symbolic enum: Red, Green, Blue
#undef select
#define select(symbol, value, ...) symbol,
enum ColorEnum { sequence };

// Generate color name sequence: "Red", "Green", "Blue"
// Indexed by ColorEnum above
#undef select
#define select(symbol, value, ...) #symbol, 
static const char * ColorNames[] = { sequence };

// Generate color value sequence: 0xFF0000, 0x00FF00, ...
// Indexed by ColorEnum above
#undef select
#define select(symbol, value, ...) value,
static int ColorValues[] = { sequence };

// Generate color struct sequence: {"Red", 0xFF0000}, {"Green", 0x00FF00}, ...
// Indexed by ColorEnum above
// Note the definition of select as a variadic macro, 
// to conveniently pick up all remaining element values.
#undef select
#define select(symbol, ...) { #symbol, __VA_ARGS__ }, 
static struct Color
{
	const char * symbol;
	int value;
}
Colors[] = { sequence };

// Define public static color symbols with numeric values
#undef select
#define select(symbol, value, ...) static const int symbol = value;
struct Colors { sequence };

For a deeper dive into code-gen with the preprocessor, I recommend reading Appendix A An Introduction to Preprocessor Metaprogramming from the excellent "C++ Template Metaprogramming" by David Abrahams and Aleksey Gurtovoy.

 

C++ Extension Methods

MP900386081

Here’s another wish for C++14:  Extension Methods.

I’m a huge fan of these in C#.  In case you’re not familiar, I’ll give you a quick introduction.  An extension method is simply a free function that permits the interface of a class to be extended outside the class definition.  This is achieved through a magical bit of syntactic sugar.  In C#, extension methods are declared static, with the first parameter representing the object to be extended by the method.  Intuitively, the keyword ‘this‘ is used to mark this parameter.  Here is an example:

    
public static class StringExtensions
{
    public static bool IsNullOrEmpty(this string s)
    {
        return string.IsNullOrEmpty(s);
    }

    public static int ParseInt(this string s)
    {
        return s == null ? 0 : int.Parse(s);
    }
};

void UsingExtensionMethods()
{
    string s = "42";

    // without extension methods
    var isNull = string.IsNullOrEmpty(s);
    var intVal = int.Parse(s);

    // with extension methods
    isNull = s.IsNullOrEmpty();
    intVal = s.ParseInt();

    // the syntactic equivalent, also permitted by the user
    isNull = StringExtensions.IsNullOrEmpty(s);
    intVal = StringExtensions.ParseInt(s);
}

Okay, so there’s a marginal improvement in compactness and readability, so what’s the big deal? I’ll tell you. Extension methods are all about tooling support. Here’s what the editing experience looked like as I typed the code above into my IDE:

ExtMethods

Here, all I could remember was that I needed a method on the string class that had the word “null” in it somewhere.  And that’s all I needed to remember:  any modern IDE offers autocomplete, which is a great productivity boost.  It saves typing and it saves remembering.  The really clever thing about extension methods in .NET is that they afford an expando-like capability to strongly-typed compiled languages – something usually associated solely with dynamic languages, like Javascript or Ruby.

Koenig to the Rescue?

Some have argued that C++ does not need this feature, and that Argument Dependent Lookup is sufficient, but this misses the point of extension methods.  When we program in OO languages, the starting point for every new line of code is almost always an object.  We type in an “argument” (a value/reference/pointer to an instance of a type), usually with some help from autocomplete.  Then, we type in a member (method/property/event), again with help from autocomplete.  On the other hand, if I begin by typing the function, everything’s backwards.  Rather than scoping to the argument in question, I am considering everything that is in scope at the moment.  And there is no support the IDE can give me.  According to Anders Hejlsberg, that is precisely the reason LINQ queries buck the SQL approach and begin with the “from” keyword, rather than the “select” keyword – to support Visual Studio’s autocomplete feature (Intellisense).

Subclassing, Anyone?

Others have suggested the tried and true OO solution: subclassing.  However, this technique does not scale at all.  First, it cannot work with built-in primitive types like int.  Second, it doesn’t work for many UDTs (e.g., those with private constructors, non-virtual destructors, etc).  Third, it doesn’t support “optional” (nullable) arguments.  Fourth, it interacts poorly with code maintenance.  Given an existing body of code using a given type, say std::string, if I want to inject a new method foo() on that type, I have to subclass it, creating stringex.  Now, I have to chase all references to std::string through the codebase, replacing them with stringex.  At least, I’d have to do this as deep and far as I’d ever expect to need to call foo (and who can anticipate that?).  All remaining instances would simply be sliced back down to std::string.  Now imagine iterating that process for future maintenance cycles.  Hmmm.  Much has been written on the topic of abusing subclassing, by Herb Sutter, Scott Meyers and others.  In the end, it’s not a viable implementation strategy for extension methods.

Operator, Give Me Long Distance

Interestingly, C++ already comes very close to this capability when defining free binary operator functions, such as ostream & operator<<(ostream & os, string & s).  In this case, the compiler automatically permits a more natural infix syntax for such operators.  When combined with returning the first argument back out of the operator, chained fluent-style programming can be achieved, like so:  cout << str1 << str2 << str3.  The standard library’s IO manipulators go one step farther, defining << operators that take a free function, such as endl, permitting call chains like cout << str1 << endl.

Some (like Boost) have put all of this behavior to use by designating an arbitrary, infrequently-used operator to do the work of invoking arbitrary extension method functors.  In a nutshell, here’s how it looks:

template<typename extension>
string & operator|(string & s, extension method )
{
	method(s);
	return s;
}

function<string(string &)>
Left(int count)
{
	return [=](string & s)
	{
		return s.substr(0, count);
	};
}

void TestExtensionMethod()
{
	string s = "hello world";
	string s2 = s | Left(5);	// "hello"
	cout << s << s2;

	// oops - operator precedence fails - parens needed
	cout << (s | Left(5));
}

It’s testimony to the power of C++, but it quickly breaks down in real-world scenarios.  For example, operator precedence is not natural.  Also, we’re really back where we started.  The IDE doesn’t know we’re invoking a function on the right side of the operator, so it’s silent as the grave.  And without IDE support, there’s no gain in productivity.

The Request

In the end, there doesn’t appear to be a way to achieve extension methods with the language as it stands.  Therefore, I would propose that C++ be extended to permit an alternate calling syntax for any static free function, effectively eliding the first argument and passing it implicitly as the ‘this’ pointer.  The behavior should work closely with ADL rules to determine when and where scoping is necessary.  Of course, this feature would introduce ambiguities which the programmer would need to resolve.  An example that springs to mind immediately would be iterator begin and end methods.  Binding preference should be given to “regular” instance methods, over extension methods.  Unlike C#, I would want to permit calling an extension method from inside another method, again assuming all ambiguities are resolved.  This feature should also fully embrace template support.  Imagine generating families of helper methods on an arbitrary type (either primitive or UDT – it makes no difference).  Now we have something approximating a “mixin”.  Imagine combining that behavior in turn with template-based systems for parsing, serialization, etc.  This would be a powerful feature indeed.

P.S.  After writing this, I stumbled over the D language’s Uniform Function Call Syntax feature.  I think it fits the bill rather nicely!

Template Template Parameter Overloading

As the ISO C++ committee deliberates on what to include in the next iteration of the language, C++14, I’d submit the following for their consideration…

Often, the need arises to decorate one type with another (typically through derivation or some variation on CRTP). Template Template Parameters are a powerful tool for implementing this, as Alexandrescu demonstrates with his scattered and linear hierarchy generation (Modern C++ Design). The problem is that currently, C++ demands a uniform interface for such parameters – they must have identical parameter lists, despite the fact that “regular” templates and run-time functions offer support for both overloading on parameter lists and for defaulting parameter values.  

There does not appear to be a sound reason for this limitation, as all ambiguity would necessarily be resolved at compile time. This is particularly onerous when using third-party templates which make heavy use of parameter defaulting themselves (e.g., collections) as template parameters.  The defaulting nature is not transparently supported by the referencing template.  One possible implementation might be to use variadic templates.  Another would be to use template typedefs (aliases) to “hide” the optional parameters.  These feel more like workarounds.  My preference would be for the language to simply support this capability natively.  C++ may never achieve the homoiconicity of functional languages, with the unification of compile-time and run-time features, but it can take steps to remove arbitrary distinctions.

Code Progressions

1085844_sax_close-upI love jazz.  Especially when a musician improvises over the changes – the “changes” being chord progressions in a given section of a song.  One of my favorites is Coltrane’s Giant Steps.  I’m in awe when a soloist can float above complex chord changes without getting lost.  The key, I’m sure, is having the progression memorized.

Here’s an analogy: I’ve found it helpful to memorize several progressions in my programming. This helps ensure that my code is safe, efficient, and loosely coupled, without conscious effort – like a musician soloing over the changes. The following code progressions might be helpful for a programmer new to the art, and seeking to develop an idiomatic approach.  Each sequence begins with safest, most restrictive option, progressing to greater complexity and capability as necessary.

Member Accessibility

  • Private
  • Protected
  • Public

This is the progression every object-oriented programmer begins with. Keeping the surface area of a class as small as possible reduces complexity, test case coverage, and cognitive load for consumers of the class (i.e., it does your peers a favor).

Member Declaration

  • Module Static
  • Class Static Inline
  • Class Static Out-of-line
  • Const Non-Virtual Inline
  • Const Non-Virtual Out-of-line
  • Modifiable Non-Virtual Inline
  • Modifiable Non-Virtual Out-of-line
  • Const Virtual
  • Modifiable Virtual

Each degree in this progression increases cost or complexity, or reduces generality. This one is a little meatier, because it combines several concerns, all related to member declaration.  But we can break it down into “atomic” sub-progressions, each of which factors into the overall progression:

  • Module → Class
  • Static → Member
  • Const → Modifiable
  • Non-Virtual → Virtual
  • Inline → Out-of-line

Private Static might be useful with pure header implementations, or when it’s hard to identify a source code module for a member to live in. But generally, Module Static is preferable to Class Static because it introduces zero coupling, not being mentioned in the public interface. One pattern for implementing Module Static is to forward declare an opaque helper class and wrap it with a smart pointer.

Const methods have a broader domain than Modifiable methods, given the compiler’s ability to bind a const reference to a temporary object, and so should be preferred.

Inline methods can be handled more efficiently by the compiler, and should be preferred over out-of-line methods, all things being equal. For example, simple get/set accessors which add no coupling should always be implemented inline. In this case, the narrower context also aids readability.

Non-Virtual methods require the additional pushing/en-registering of the ‘this’ pointer when dispatching the call, and should only be used when required.

Finally, Virtual methods require a level of indirection (pointer table lookup) when dispatching the call, and should be used only as necessary.

Data Storage

  • Stack Variable
  • Function Parameter
  • Class Member (Aggregation)
  • Heap

The scope of a variable’s existence and visibility should be kept as narrow as possible. This doesn’t just aid readability. For every degree of scope broadening, there is increased cost in managing memory or communicating the data between parts of the program. Whenever possible, a developer should strive for a “stack-based discipline.”  This might entail rewriting an algorithm to be recursive, or simply relocating a variable to the topmost stack frame in which it must be visible. The heap should be considered a means of last resort. It certainly has its place, such as a factory method returning a collection, but it’s relatively slow and adds complexity and risk.

Code Reuse

  • Aggregation (or Private Inheritance)
  • Single Inheritance
  • Single Concrete & Multiple Interface
  • Multiple Concrete

Private Inheritance might be useful if a developer wants to expose a small subset of an inherited interface publicly, in which case a few “using” declarations can come in handy. Otherwise, Aggregation is typically preferred, as it reduces complexity in constructor implementations.

Some languages enforce single concrete inheritance, permitting implementation of multiple abstract interfaces. This design eliminates the possibility of the “dreaded diamond hierarchy”, with multiple instances of the same base. Other languages permit multiple concrete inheritance, and must address multiple bases (typically through virtual inheritance).

Do you have a progression you’d like to add?  Let’s hear it!

Structure Aversion

fractal-69181_640

A well-designed system is like a fractal image – it’s highly compositional. One can see finer degrees of structure at any level of detail.

My Idiosyncratic C++ series would not be complete without a discussion of structure. 

Good object-oriented (OO) programmers understand the basic principles of OO analysis.  And many make a respectable effort in decomposing a system into independent, reusable, composable building blocks.  Unfortunately, many such efforts end there, resulting in course-grained, poorly-structured designs.  In this post, I’ll explore what a well-structured design does, and does not, look like.

C++ has been described as two languages in one: a low-level language of constructors, destructors, conversion and assignment operators, and the like; and a high-level domain-specific language undergirded by the first. But highly structured code contains N “languages”, and is like a fractal image.  It could also be compared to an objective hierarchy, wherein the levels below answer the question “how?”, and the levels above, the question “what?” or “why?”  A good programmer is adept at shifting levels as necessary, and works hard not to conflate them.

Information Content

Well-factored code is synonymous with highly structured code.  You cannot have one without the other.  An information theorist might say that highly structured code contains more information.  This is true in two senses.  

First, structured code has greater “information density”.  Here, we’re not creating information so much as we are distilling and concentrating it.  The same signal remains, but there is far less noise.  And since every line of code is a liability, the fewer it takes to express an idea, the better.  (Fewer to write, compile, debug, review, test, document, ship, maintain, …)

Second, creating more structure implies defining more abstractions, which is a powerful programming exercise.  Articulating the utility and purpose of every artifact, and formalizing the relationships between them, makes the programmer’s intent as explicit as possible.  Taking the knowledge latent inside the gray matter, and “injecting” it into the system creates new information.  And as an information theorist might also say, there is “no information without physical representation.”  (OK, it still exists somewhere in the programmer’s head, but it sure isn’t very accessible.)

What’s In a Name?

As new degrees of structure are created, it’s probably fairly apparent that new names are needed (lambdas notwithstanding).  A good test for one’s own understanding of a system is the ability to name each part of it.  I have found that when I labor over the right name for an artifact – be it a namespace, class, method or whatever – it may be a symptom that my design is a bit muddled.  Most likely, it needs more fine-grained structure, permitting each piece to abide by the single responsibility principle.  Taming the naming beast is a good thing.  If I can create the best name for an artifact right now, I’m doing my peers and my future self a favor.  Conversely, if I can’t understand the system today (well enough to name it), how can I hope to do so when I have to maintain or extend it a year from now?

Structure Aversion

That brings us to our idiosyncratic practice – what I’d simply call “structure aversion.”  It’s manifested in many ways, which I’ll describe, but first I want to get down to the (de)motivation for it.  I think its most common among beginners, perhaps driven by a distaste for the “overhead” needed to define a new structure.  Some developers seem to think that every class warrants dedicated header and source files, complete field encapsulation, serializability, a host of virtual methods implementing various interfaces, etc, etc.  (The corollary of this is that having many local structures should be prohibited.)  This may be applicable for some singly-rooted hierarchy languages, especially managed ones.  But C++ inherits C, which embraces lightweight, native structs.  And these are ideal for POD types, tuples, parameter objects, compound returns, local aggregate arrays, and so on.

Let’s look at some examples of structure aversion, similar to what I recently encountered during a refactoring exercise.  The first is a simple case of structuring a function’s parameters and returns, which can lead to a more compact, fluent, functional style.

// Factory: the return value is assigned to an arbitrary bit of
// person data (SSN) - the "lucky winner!"  Also, it's difficult 
// to tell at a glance which params are "in" and which are "out".
int GetPerson(const string & first, const string & last, int & dob);

// Note the asymmetry of GetPerson and OutputPerson, due to the SSN
void OutputPerson(const string & first, const string & last, int dob, int ssn);

void ProcessPerson(const string & first, const string & last)
{
	// Lots of temporaries are needed with structure aversion.
	// Note how they're "viral" - the dob out param temporary
	// necessitates the ssn temporary, which otherwise could 
	// have been passed along inline to OutputPerson.
	int dob;
	int ssn = GetPerson(first, last, dob);
	OutputPerson(first, last, dob, ssn);
}

// ... after refactoring ...

// A few POD structs, local or private to a class or module.
// Note the separation of key fields into a base struct, which
// is subclassed for the non-key fields, providing a tidy way
// to partition a struct based on "in" versus "in/out" usage.
struct PersonName
{
	string First;
	string Last;
};
struct Person : PersonName
{
	int SSN;
	int DOB;
};

Person GetPerson(const PersonName & name);

void OutputPerson(const Person & person);

void ProcessPerson(const string & first, const string & last)
{
	// The functional style may not be to everyone's taste,
	// but it shows the potential for compactness that 
	// is otherwise impossible with structure aversion.
	OutputPerson(GetPerson(PersonName(first, last)));
}

Structure aversion is “compositional” in a way.  If the pre-factored code above looks bad, it looks worse when we have to deal with collections.  In this case, structure aversion leads to the use of parallel collections of primitives, versus a single collection of structures. This follows naturally from the example above. After all, without a Person abstraction, how would one transmit the details of several people? One might call this “column-centric” versus “row-centric” data access.  And while it may make sense for big data platforms like Google Big Query, you have to learn to walk before you can fly (or, learn the rules before you break them).  Here’s what I found in my refactoring exercise:

// Without a Person abstraction, people must be collected partwise.
int GetPeople(vector<string> & firstNames, vector<string> & lastNames, vector<int> & dobs, vector<int> & ssns);

// Parallel collections are passed in as out params and 
// processed in parallel.  Lots of opportunity for bugs.
void ProcessPeople()
{
	vector<string> firstNames;
	vector<string> lastNames;
	vector<int> dobs;
	vector<int> ssns;
	auto numPeople = GetPeople(firstNames, lastNames, dobs, ssns);
	for( int i = 0; i < numPeople; i++ )
	{
		OutputPerson(firstNames[i], lastNames[i], dobs[i], ssns[i]);
	}
}

// ... after refactoring ...

// A much simpler factory
vector<Person> GetPeople();

// And its use
void ProcessPeople()
{
	auto people = GetPeople();
	for_each(begin(people), end(people), [&](Person & person){OutputPerson(person);});
}

In the examples above, I’ve simplified things greatly.  The original code had many more fields (and consequently, collections), and was using MFC collections, which made the parallel iteration even messier.

Factoring Data

In addition to using structures to aggregate runtime data, as above, it can also be used for compile-time constant data.  With this technique, aggregate initialization is used to create a static array of POD structs.  This approach can be used in a number of common scenarios, such as:

  • Render passes
  • Token parsers
  • UI control layout

Let’s take the last one as an example – UI control layout. This is a bit contrived, as these days I’d use a layout engine for new UIs. On the other hand, I find myself refactoring legacy code like this quite often, so I mention it.

enum Color
{
	Red, Green, Blue, Gray
};

struct Button
{
	Button(const string & caption, int color, bool isDefault = false);
};

struct Dialog
{
	void AddControl( Button * button, int left, int top, int width, int height );

	// This kind of structureless layout code is common, 
	// especially when machine-generated.  
	void Layout()
	{
		auto ok = new Button("OK", Red, true);
		AddControl( ok, 10, 400, 80, 20 );
		auto cancel = new Button("Cancel", Green);
		AddControl( cancel, 100, 400, 80, 20 );
		auto apply = new Button("Apply", Blue);
		AddControl( apply, 190, 400, 80, 20 );
		auto help = new Button("Help", Gray);
		AddControl( help, 280, 400, 80, 20 );
	}
};

// ... after refactoring ...

void Dialog::Layout()
{
	// A local POD struct to collect all constant button data
	struct ButtonData
	{
		const char * name;
		Color color;
		int left;
		// we can place "optional" (zero-defaulted) fields at 
		// the end of the struct, relying on the behavior of 
		// aggregate initialization for typical cases.
		bool isDefault;	
	} 
	buttons[] = 
	{
		{"OK", Red, 10, true},
		{"Cancel", Green, 100},
		{"Apply", Blue, 190},
		{"Help", Gray, 280},
	};
	// Now, we loop over the data defining each button
	// The code is slowly morphing from imperative to declarative.
	// Perhaps more importantly, we've "created new information",
	// adding structure to establish the cohesiveness of each
	// pair of statements in the original version.
	for each (auto& buttonData in buttons)
	{
		auto b = new Button(buttonData.name, buttonData.color, buttonData.isDefault);
		AddControl( b, buttonData.left, 400, 80, 20 );
	}
}

// The refactoring above is an improvement, but all the references to
// buttonData are a bit of a code smell (Inappropriate Intimacy).
// ... so, after refactoring again ...

void Dialog::Layout()
{
	struct ButtonData
	{
		const char * name;
		Color color;
		int left;
		bool isDefault;
		// The appeal to this technique - we can drop little
		// bits of logic into the POD to encapsulate and clean
		// up call sites even more.
		void Create(Dialog & dialog)
		{
			auto b = new Button(name, color, isDefault);
			dialog.AddControl(left, 400, 80, 20);
		}
	} 
	buttons[] = 
	{
		{"OK", Red, 10, true},
		{"Cancel", Green, 100},
		{"Apply", Blue, 190},
		{"Help", Gray, 280},
	};
	// Now, we've got a much more declarative approach
	for each (auto& buttonData in buttons)
	{
		buttonData.Create(*this);
	}
}

Chunky Functions

It’s important to point out that fine-grained structure should exist not just with types, but with functions too.  Simple one-line inline methods may seem gratuitous at first, but they can enhance readability and maintainability of code tremendously.  I recently refactored some XML-generation code that had a very low signal to noise ratio (probably less than 1), due to all the Xerces clutter.  I spent an hour writing a handful of methods for creating elements and attributes, in a literate style (permitting method-chaining).  The number of lines of client code was cut in half and the readability doubled, at least.  This kind of structuring is also helpful for conversions, casts, factories, and many other small sequences of oft-repeated statements and expressions.  It’s also helpful for “object-oriented currying” (as I call methods implemented in terms of overloads).

References

In closing, I’ll point out that this is all well-trod ground, but it bears repeating.  I’ve never regretted jumping into a ruthless refactoring leading to greater structure.  There may be moments when I wonder how long it will take to get the engine block back into the car, but the end always justifies it. Adding new information makes the code more self-describing and operating at a higher level of abstraction makes it more declarative.  Both of these contribute to making the code clearer.

For further research, I recommend Martin Fowler’s Refactoring: Improving the Design of Existing Code (a bit dated, but still relevant), as well as Andrew Binstock’s Dr. Dobbs article on fine-grained structures.  The challenge of constraining oneself to 50-line classes is a great way to enforce a discipline that is sure to grow on you. In the meantime, I encourage you to create structure at every opportunity.