In general, nodelets don't guarantee that your CPU usage will go down.  While it is often the case, if the message passing itself is already a small part of your usage, nodelets won't help that much.  They will help latency of the passing though.
 * What is a good methodology for measuring the overhead of these messages?

 * Is there some way to set different "command names" for the
different nodelets so top or ps can identify which is which?

 * How can I run "profile" on a nodelet or a collection of nodelets?

You profile nodelets the same way you profile any C++ application: with the profiler of your choice.  I tend to use google perftools and/or cachegrind.  There's also gprof and sysprof, and probably many more.

 * Are any enhancements planned for rxgraph to report nodelet
connections clearly?

If not, I'll open an enhancement ticket. The rostopic and rosnode
commands seem to report things correctly, so the right information
must be available somewhere.

That information doesn't exist anywhere at the moment.  As far as the ROS graph is concerned, it's just a single node.

I am guessing that memory allocation for large, high-bandwidth
messages could be a significant factor. Before, I pre-allocated the
messages to avoid memory overhead on every cycle. (But, I suppose that
just pushed the problem down into the publish() implementation.) Now,
I have to allocate a new message and shared_ptr every time.

Don't guess, profile.  Allocation could be a bottleneck, but it's more likely that filling in the data (or std::vector's 0-filling of primitive types on resize) is the problem.

 * Should I use the ros_realtime/allocators package in place of
standard C++ new? Are there examples of this I can study?

The allocators package currently only has an aligned allocator, so that won't help.  What might help is a growable (and shrinkable) version of lockfree's ObjectPool.  You could probably try using those if you're OK having a fixed-size pool of messages.
