Karma: C++11 and low level abstraction

C++11

One thing I liked with the LISP, was the possibility to generate code at the runtime. In C++, without some black magic (and LLVM + JIT) is unable to do the same. However, things are getting better with the new standard.

One of the most powerful and useful feature of the new standard are: constant expressions. Constant expressions allows the developer writing expressions that will be evaluated during the compilation. You can see it as an evolved form of inlining.

For example:

constexpr max(int i, int j)
{
    return i > j ? i : j;
}

int main(int, char**)
{
    int m = max(1, 4); // will translate to: int m = 4;
    return 0;
}

The variable m will have its value directly affected to 4 and not comparison will be done at the runtime.

How does it help us?

By greatly improving readability.

Do you remember these kind of macros:

#define GET_ARRAY_SIZE(array) (sizeof(array) / sizeof(*(array)))
#define STATIC_ASSERT(expression) typedef char static_assertion[(expression)?1:-1]

STATIC_ASSERT(GET_ARRAY_SIZE(anArray) > 10);

In C++11, you may prefer:

template<typename T, size_t SIZE>
static constexpr size_t getArraySize(T (&) [SIZE]) { return SIZE; }

static_assert(getArraySize(anArray) > 10, "The array is too small !");

Much more sexy, in my opinion.

Constant expressions can’t be complicated, they can only consist in a return. If C++ had not template, they would have been pointless. However, with templates they allow constructions that were not possible before like in the previous example.

The C++ ABI

The C++ has some requirements we don’t usually care about in software development. For example, GCC adds some code when you create a static variable local to a function, to ensure the atomicity of its accesses.

I won’t explain how to do it, I simply followed the C++ article on OSDEV to set everything up.

I use a C++ function to call all the global variables constructors:

typedef void (*CTOR)();

extern "C"
{
    extern void* start_ctors;
    extern void* end_ctors;
}

void __call_constructors()
{
    size_t nb_ctors = (size_t(&end_ctors) - size_t(&start_ctors))
            / sizeof(void*);

    CTOR* addr = reinterpret_cast<CTOR*>(&start_ctors);

    for (size_t i = 0; i < nb_ctors; ++i, ++addr)
    {
        (*addr)();
    }
}

The __call_constructors function has to be called from the main.

Kernel types

I’m a clumsy person and I easily do mistakes.

However, since I know it, I’m trying to put safeguards wherever I can. That’s why the two first classes I created after having an “Hello world” on my board were: Address and Register.

Address

The Address class holds an address and checks every access we’re doing on it.

Basically, an address has 4 properties:

  • The size of the data at the address;
  • Is it readable;
  • Is it writable;
  • Is it volatile.

With these information, we can start to generate the actual pointer type:

The 4 properties are template values:

template<size_t PointedSizeInBytes = 4, bool Writeable = true, bool Volatile = false, bool Readable = true>
class Address
{
    Address(size_t address)
   : address_(address)
    {}
    // ....
};

Is it readable?

typedef typename Traits::static_if<Readable, T, void>::type is_readable_type;

Is it volatile?

typedef typename Traits::static_if<Volatile, typename Traits::add_volatile<is_readable_type>::type, is_readable_type>::type is_volatile_type;

Is it writable?

typedef typename Traits::static_if<Writeable, is_volatile_type, typename Traits::add_const<is_volatile_type>::type>::type is_writable_type;

And we create our pointer: typedef typename Traits::add_pointer::type type;

When we instantiate an Address instance, we can check the last property, the pointee size:

template <typename T>
Address::Address(T* pointer)
: address_(size_t(pointer))
{
    static_assert(sizeof(T) == PointedSizeInBytes, "Incompatible pointer type");
}

Or, more likely, we can check it when we get out the pointer:

template<typename T>
typename GetPointerType<T>::type Address::getAddress()
{
    static_assert(sizeof(T) == PointedSizeInBytes, "Cannot cast the pointer, invalid size");
    return reinterpret_cast<typename GetPointerType<T>::type>(address_);
}

Register

A processor wouldn’t be as useful without peripheral devices. Devices are designed to interact with the environment. However, they need to be instructed to do so.

To communicate with the devices we need to access their registers. Registers are mapped in the address space, which is specified in the OMAP4460 documentation (remember the UART3?).

Let’s take the UART as an example.

The UART allows reading and writing data from and to another device (in our case, the rs232 connection between my computer and the board). First, it needs to be initialized: the BAUD rate, the parity, and some other things need to be set. But I took the lazy way, and I let U-Boot do it for me (Actually, I didn’t manage to reinitialize the UART device properly :( )

So, we can use it right away!

As we saw, we need to write in an 8bits register to send data, but if we write to much information the device won’t send anymore data (I think the device enter in some error state and needs to be re-initialized). If we take a look at the UART documentation (included in the OMAP4460 pdf), we can see that there is a register that indicates the state of the transmit queue: UART_LSR. The documentation also indicates at which address this register is available: 0×48020014.

Furthermore, it says:

    5 TX_FIFO_E
    Read 0x0: Transmit hold register (TX FIFO) is not empty.
    Read 0x1: Transmit hold register (TX FIFO) is empty.

So, we must wait until we can read a 1 before sending any characters.

Let’s rewrite our puts function:

    void puts(const char* data)
    {
       volatile uint8* UART3_BASE = (volatile uint8*) 0x48020000;
       volatile uint8* UART3_LSR  = (volatile uint8*) 0x48020014;

       while (*data)
       {
          while (((*UART3_LSR) & (1<<5)) == 0);
          *UART3_BASE = *data++;
       }
    }

I don’t know for you, but I easily do mistake when I’m programming with binary mask. (To be entirely honest I did not try the above code again, I took it from my mercurial history).

I rewrote this code using an abstraction for managing registers. Now I can rewrite the above code like this:

    enum LSRRegisterPart
    {
        RX_FIFO_E,
        RX_OE,
        RX_PE,
        RX_FE,
        RX_PI,
        TX_FIFO_E,
        TX_SR_E,
        RX_FIFO_STS
    };
    typedef KLib::Register<unsigned char,
                            UARTBaseAddress
                            , KLib::RegisterPart<RX_FIFO_E, 0>
                            , KLib::RegisterPart<RX_OE, 1>
                            , KLib::RegisterPart<RX_PE, 2>
                            , KLib::RegisterPart<RX_FE, 3>
                            , KLib::RegisterPart<RX_PI, 4>
                            , KLib::RegisterPart<TX_FIFO_E, 5>
                            , KLib::RegisterPart<TX_SR_E, 6>
                            , KLib::RegisterPart<RX_FIFO_STS, 7>>  LSRRegister;

    enum DataRegisterPart { THR = 0, RHR = 0, DLL = 0 };
    typedef KLib::Register<unsigned char,
                            UARTBaseAddress,
                            KLib::RegisterPart<THR, 0, 8>> DataRegister;

    void puts(const char* data)
    {
       static DataRegister data(0x48020000);
       static LSRRegister  lsr(0x48020014);

       while (*data)
       {
           while (lsr.getValue<LSRRegisterPart::TX_FIFO_E, uint8>() == 0);
           data.writeValue(*data++);
       }
    }

How does it work?

A register is defined by an address where we can read or/and write data of a given size. Quite often, a register is also divided in several parts, defined by names, start and end offsets; pretty much like a plain structure has different fields of different size.

The RegisterPart argument defines one of those parts and makes it easy to manipulate the underlying range of bits. Combined with the Address class, this is a solid tool to handle most of the register accesses, including via mask/logical arithmetics.

RegisterPart is defined as follow:

template<size_t ID, size_t OFFSET_START, size_t OFFSET_END = OFFSET_START + 1>
struct RegisterPart
{
    static constexpr size_t RegisterID = ID;
    static_assert(OFFSET_END > OFFSET_START, "OFFSET_START > OFFSET_END");

    template<typename ReturnType, typename T>
    static ReturnType getValue(T reg)
    {
        static_assert(OFFSET_START < getSizeInBits(T()), "OFFSET_START > reg");
        static_assert(OFFSET_END <= (getSizeInBits(T())), "OFFSET_END > reg");
        static_assert(((OFFSET_END - OFFSET_START) / 8) <= sizeof(ReturnType), "Invalid return type");
        return (ReturnType)sliceByte(reg, OFFSET_START, OFFSET_END);
    }

    template <typename T, typename U>
    static void setValue(T reg_addr, U value)
    {
        static_assert(Traits::is_const<typename Traits::remove_pointer<T>::type>::value == false,
                      "The pointer is const, the register is not writable");
        auto old_value = *reg_addr;
        constexpr unsigned int mask = ((1 << (OFFSET_END - OFFSET_START)) - 1) << OFFSET_START;
        *reg_addr = old_value ^ ((old_value ^ (value << OFFSET_START)) & mask);
    }

    template <typename T, typename U>
    static void setSlicedValue(T reg_addr, U value)
    {
        value = sliceByte(value, OFFSET_START, OFFSET_END);
        RegisterPart::setValue(reg_addr, value);
    }

};

The sliceByte function is simply:

template <typename NumericType>
constexpr size_t sliceByte(NumericType value, size_t start, size_t end)
{
    return (value >> start) & ((1 << (end - start)) - 1);
}

The complex bitmask arithmetic in setValue is borrowed this website. I modified it to allow the OFFSET_START shift.

The next step is to put all the RegisterPart together in one Register, with the least possible runtime overhead.

To do so, I decided to separate the Register class in two parts:

  • one for the C++ stuff;
  • one for the actual public interface.

The part that handles “the C++ stuff”, is basically a proxy towards RegisterPart. It is responsible for select the proper RegisterPart and call the desired function.

The Register class is declared as follow:

template<typename RegisterUnderlyingType, typename Address, typename... RegisterValues>
struct Register : private detail::Register<RegisterUnderlyingType, RegisterValues...>
                , private NonCopyable
{
    typedef detail::Register<RegisterUnderlyingType, RegisterValues...> Base;
    typedef typename Address::template GetPointerType<RegisterUnderlyingType>::type PointerType;
    //...

    public:

    Register(Address addr)
    : address_(addr)
    {}

    template <size_t RegisterValueIdx, typename ReturnType>
    ReturnType getValue();

    template <size_t RegisterValueIdx, typename U>
    RegisterHolderRAII setValue(U value);

    template <size_t RegisterValueIdx, typename U>
    RegisterHolderRAII setSlicedValue(U value);

    void writeValue(RegisterUnderlyingType value)
    {
        static_assert(Traits::is_const<typename Traits::remove_pointer<PointerType>::type>::value == false,
                      "The pointer is const, the register is not writable");
        auto reg_addr = address_.template getAddress<RegisterUnderlyingType>();
        *reg_addr = value;
    }
};

The set/get*Value member functions are intentionally left blank.

As I said, the code responsible for selecting the proper register part is in a separated class. Please, welcome detail::Register:

    namespace detail {
    template<typename RegisterUnderlyingType, typename... RegisterValues>
    struct Register : public RegisterValues...
    {
        template <size_t RegisterValueIdx, typename ReturnType, typename T>
        ReturnType getValue(T value)
        {
            //...
        }
    };
    }

I decided to inherit from RegisterValue. The trick, is to be able to cast this in the proper type to select the wanted part.

To address this problem the C++ implicit conversion kicks in! Only a 3 line function can trigger the cast!

    template <size_t idx, size_t S, size_t E>
    static constexpr RegisterPart<idx, S, E>& get(RegisterPart<idx, S, E>& ref)
    { return ref; }

Basically, if you can specify at least 1 template value, the compiler is able to lookup the remaining values. Here, we know idx, so the compiler will retrieve the others for us. The code going into the detail::Register::getValue implementation finally boils down to:

// value is the value we are going to split according to the offset specified in
// the RegisterPart instance we are selecting
return detail::get<RegisterValueIdx>(*this).template getValue<ReturnType>(value);

The others are all very simple:

        template <size_t RegisterValueIdx, typename T, typename U>
        void setValue(T reg, U value)
        {
            detail::get<RegisterValueIdx>(*this).setValue(reg, value);
        }

        template <size_t RegisterValueIdx, typename T, typename U>
        void setSlicedValue(T reg, U value)
        {
                detail::get<RegisterValueIdx>(*this).setSlicedValue(reg, value);
        }

And then we can finish the implementation of the Register class:

    template <size_t RegisterValueIdx, typename ReturnType>
    ReturnType getValue()
    {
        return Base::template getValue<RegisterValueIdx, ReturnType>(*address_.template getAddress<RegisterUnderlyingType>());
    }

    template <size_t RegisterValueIdx, typename U>
    RegisterHolderRAII setValue(U value)
    {
        RegisterHolderRAII ret(*this, address_.template getAddress<RegisterUnderlyingType>());
        ret.template setValue<RegisterValueIdx>(value);
        return ret;
    }

    template <size_t RegisterValueIdx, typename U>
    RegisterHolderRAII setSlicedValue(U value)
    {
        RegisterHolderRAII ret(*this, address_.template getAddress<RegisterUnderlyingType>());
        ret.template setSlicedValue<RegisterValueIdx>(value);
        return ret;
    }

RegisterHolderRAII permits atomic writing: the setValue/setSlicedValue will be cached and written to the register only when RegisterHolderRAII is destroyed.

Few remarks

While the code behind Register is certainly not trivial, it removes a lot of complexity in a lot of other places. I think this is a, more than acceptable, trade-off.

Moreover, the compiler catch one more possible mistake: if you create two RegisterPart with the same id, and try to access it; the compiler will raise an error indicating that there is an ambiguity triggered by the implicit cast.

I did not implement a writeValue in RegisterPart (I don’t need it for now), but for read-only addresses, we will try to dereference a void\* and the compiler will issue an error.

I kept Address and Register in a library aside the kernel, that allowed me to write unit test.

Conclusion

I hope this was clear enough and that you understood why I’ve chosen C++11 to experiment with kernel programming.

The compiler does a really good job on inlining the calls; in the end almost only the actual register write remains in the assembly code.

Of course, that’s not the only performance factor, and I didn’t try to run any benchmark (performances aren’t my current concern). But, I had pretty decent results as far.

The next post will be about setting up the MMU. I’m currently motivating myself to write a page allocator and a memory allocator, but the design isn’t clear in my head yet.

After that, I will try to implement interruptions and post about it too.

Where can I find the code?

I will publish the whole source code a bit later, that’s a bad excuse, but I’ve to clean it up a little bit, most notably the logging part.

Thanks

I would like to thank Louis for the time he takes to review this post!

Little remarks

Since I decided to write my post with markdown, the inlined code will be less sexy. It is easier for me to work on it offline and easier for the reviewers.

Write its own kernel for a PandaBoard ES


Several months ago, I acquired a PandaBoard Es because I wanted to revisit all my low-level’s knowledge, learning about ARM architecture and embedded system.

Late September, I started my kernel and now I would like to share some of my
work and problems I encountered. Hopefully, this will help people who want to
discover the embedded world like me. Thus, here is the first of a long list of
posts about writing your own kernel on an ARM board. I’ll try to write one article
by week.

Karma

I decided to write a kernel named Karma, in C++11. I decided to go with C++
because it’s an interesting language and its last standard dramatically
improved it.

Since I’m stuck with the previous standard at my job, it is also an occasion to
learn and apply the new C++11 concepts.

C++ is also one of the rare languages that allow kernel programming.

One thing I like about the C++ is that it allows doing some heavy type checking
at the compile time, thus you may manipulate some template types without
knowing the whole public interface but you can still make sure that what you
need is implemented. I will cover how I did it thanks to the SFINAE, type traits and
static assertion.

Where did I start?

I have some basics in x86 kernel programming and while there are actually some
similarities with ARM; I lost a lot of time trying to apply what I knew about
x86 against Karma.

On the PandaBoard, the boot sequence is in three steps:

  • First, you have the vendor ROM that will initialize the basic features of the
    board and load the x-loader (provided by Texas Instrument);
  • Then, the second program loader (SPL), U-boot, starts;
  • U-boot looks for a kernel image and boots it.

U-boot also provides a complete command line to interact with the board. For
example, it allows you to send a program or a kernel through the serial port.
There are several topics which cover the minimal set-up to boot a kernel on the
board, thus I won’t talk about it.

The documentation

The documentation is split in three parts: the CPU, the system-on-chip and the
board documentation:

  • The PandaBoard embeds an OMAP4460 SoC which has its own documentation, it can be found here;
  • The CPU documentation (which is the cortex A9 in our case), is available for free but you need to register an account on the ARM website.
  • The board documentation is available on the PandaBoard website.

The workflow:

  • I use a gcc 4.7 to cross-compile for arm;
  • minicom to connect to the PandaBoard.

A basic helloworld

When I started I was afraid I would have to develop my own serial driver to be
able to print anything but, fortunately, UARTs are here! UART stands for Universal Asynchronous Receiver/Transmitter.

UART abstracts the actual data bus, there is only one register where one needs
to write to print something out. Of course an UART needs to be initialized, but
U-boot do it for us.

To write something with the UART you only need to know its address. On the
OMAP4460 you have 4 UARTs, so you need the board documentation to know which
one to use. For the PandaBoard the one wired to the serial port is the third.
At this point, you can look up its address on the OMAP4460: 0×48020000.

U-Boot is initializing the UART (among a lot of others things) for us so it is
usable as it.

Let’s code

I’m not a fan of ARM assembly. I understand it, I can write some of it, but the
lesser I write it the better I feel. However, a little bit of ASM will be
needed: we need to program the entry point in order to branch to our C++ code.

Here we go:

.global kmain
.arm
start:
    .globl _start
    b kmain

The “b” instruction is the equivalent of the x86 “jump”.

Straightforward isn’t it :) .

Now the fun part, the C (++) part:

void puts(const char* data)
{
   volatile uint8* UART3_BASE = (volatile uint8*) 0x48020000;
   while (*data)
   {
      *UART3_BASE = *data++;
   }

   *UART3_BASE = '\r';
   *UART3_BASE = '\n';
}

extern "C" void kmain()
{
   const char* test = "Hello World";
   puts(test);
}

And to finish, the linker script:

ENTRY(_start)

SECTIONS
{
    . = 0x82000040;

    .text :
    {
        *(.text*)
         *(.gnu.linkonce.t*)
    }

    .ctors :
    {
        . = ALIGN(4096);
        start_ctors = .;
        *(.init_array);
        *(.ctors);
        end_ctors = .;
     }

     .dtors :
     {
        . = ALIGN(4096);
         start_dtors = .;
         *(.fini_array);
         *(.dtors);
         end_dtors = .;
      }

    .rodata :
    {
        *(.rodata .rodata*)
        *(.gnu.linkonce.r*)
    }

    .data :
    {
        *(.data)
    }

    .bss :
    {
        *(COMMON*)
        *(.bss)
        *(.gnu.linkonce.b*)
    }

    /DISCARD/ :
    {
        *(.comment)
        *(.eh_frame) /* discard this, unless you are implementing runtime support for C++ exceptions. */
        *(.ARM.exidx*)
    }
}

Now we can compile our “kernel”:

$> arm-linux-gnueabi-g++ -W -Wall -Werror test.s test.cpp -Tlinker.ld -o kernel.elf -nostdlib –nostartfiles

Then we need to only get the “code” and drop the elf stuff:

$> arm-linux-gnueabi-objcopy kernel.elf -O binary kernel.bin –image-base 0×82000040

And the last step is to generate an u-boot kernel image:

$> mkimage -A arm -n Karma -C none -O linux -T kernel -d kernel.bin -a 0×82000000 -e 0×82000040 -n karma uImage

And here you go!

I spent a certain amount of time figuring one bug in my code. I knew that
mkimage added 64 bytes at the beginning of my kernel but I did not modified my
linker script accordingly. That why the linker script indicates 0×82000040
instead of 0×82000000.

At this point if you try to write too much data it won’t work because you
need to take care of the transmit queue. The next post will be about it! I’ll
show how to setup the basic C++ ABI runtime and some of the code I wrote to
ease the register accesses.

That’s it for the first blog post. In the next one, I’ll show how to setup the basic C++ ABI runtime and some of the code I wrote to ease the register accesses.

Vrac



Long time no see. Since I came back to France I did not take the time to post. I have/had some ideas of what to share but nothing big…

Why I don’t like Python even if I use it nearly every days.

The utf-8 support

Python2 has a poor support of utf-8. You can get some Encoding/DecodingError very easily.

In my school, every projects are corrected with a software I developed. It is responsible for gather the student project (on AFS, SVN, HG or GIT location) and make a correction. The problem is that the program entry may received data it can’t handle. When a student program is buggy it can print some unexpected character like some utf-8 characters that is interpreted as a multi-byte character and when the next one does not fit the requirement for a proper decoding, Python throw an exception… The cause is only because Python treat every data as string and it is dynamically typed.

I never tried Python3 but things seem better since a ‘new’ type will be present: the data type.  Here is a better description that I could do:

http://docs.python.org/release/3.0.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit

The static scoping.

Python is slow, it is not a secret but to improve the performances they statically scoped the variable. Here is an example:

>>> def test():
...     i = 2
...     def test2():
...             return i + 1
...     return test2
... 
>>> test()
<function test2 at 0x7f2c7af13a68>
>>> test()()
3
>>> def test():
...     i = 2
...     def test2():
...             i = i + 1
...             return i
...     return test2
... 
>>> test()
<function test2 at 0x7f2c7a9f4d10>
>>> test()()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 4, in test2
UnboundLocalError: local variable 'i' referenced before assignment

When the Python interpreter sees an initialization it creates a table attached to the function to allow a fast symbol resolution. But it does not handle the fact that a variable can come from a parent context like here. So you can’t do a closure in Python :(

A little bit of C++ meta-programming.

I like the meta-reprogramming (even if Lisp is better), but sadly I don’t use it often… The last week I had to do a program which checks if thousands codes are valid against an ISO specs. The problem is that there is no rules, only possibilities (i.e. if at index X I have the character I, at X + 1 I must have A, S, D or F). A very painful task but I was able to add some fun in it by creating a static template tree.
Here is the code:

template<class T>
struct cfi_level
{
        char letter;
        T next_valid;

        bool is_valid(const char* cfi) const
        {
            if (*cfi == letter)
            {
                return next_valid.is_valid(++cfi);
            }
            return false;
        }
};

enum
{
    MAX_LEVEL = 11
};

template<int DEPTH>
struct _cfis
{
        typedef _cfis<DEPTH - 1> ChildClass;

        bool is_valid(const char* cfi) const
        {
            for (int i = 0; i < MAX_LEVEL; ++i)
            {
                if (current_level[i].is_valid(cfi))
                    return true;
            }
            return false;
        }

        cfi_level<ChildClass> current_level[MAX_LEVEL];
};

template<>
struct _cfis<0>
{
      const char* letters;

        bool is_valid(const char* cfi) const
        {
            return !*(cfi + 1) && strchr(letters, *cfi);
        }
};

typedef _cfis<5> Cfis;

// the big static initialization. let as exercise for the reader 
#include "array.cpp" 

static inline bool is_cfi_valid(const char* cfi)
{
    // c comes from array.cpp
    return c.is_valid(cfi);
}

static inline bool is_cfi_valid(const std::string& cfi)
{
    return is_cfi_valid(cfi.c_str());
}

Normalize time series




When you have a big cluster of servers you need to monitor them. However, there is not a lot of possibilities to do this. When a server crashes, it is very important to be able to access to all the metrics collected and plot them. When dotCloud asked me to work on the metric system, I did a lot of researches on the existing technologies.

here are some results.

Collect the metrics is something well known and well done. For example, collectd is very good at this. But the difficult part is the storage and plotting because you need to be able to access to the datapoint and plot the data quickly and it has to be reliable: miss 10mins of metric is acceptable, 1 day no. Collectd is designed to only grab the metrics from the host but there is a plugin to write the data points in rrd files. RRDTool is an old tool, very used, which store the data in a file on a regular filesystem. There are several tools which can draw a graph from the rrd files.

However this method has some problems:

  • rrd files have an high random writes, so when you have too many files, the hard drive is not enough fast;
  • It is not reliable since the data arrives a one point;
  • Even if the data point are sent a many different collectd to be written, the rrd files cannot be merged;
  • rrdtool is one of (or the) the worst code I ever read (give it a try).

So, in other word, it does not scale.

There isn’t even one usable technology to work on big amount of metrics, however, these last years there are a lot of new technologies designed to scale. For example, all the nosql databases (riak, redis, mongodb, hbase…) are heavily distributed.

The only one thing that can easily handle the amount of data we have was HBase. HBase is a column oriented database written in Java belonging to the Hadoop software suite. In the same time we discovered OpenTSDB which is built on top of HBase and is specifically designed to store metrics and display graph from them (was it fate? :) ).

After writing a plugin for collectd, we realized an essential feature was missing when you want to connect several graphs together: the normalization.

When you receive datapoints, you don’t know when they will be collected so when they are plotted on a graph, you can have the feeling that some points are missing. The normalization solves this problem by relocating each point at equal distance from each other (on each minute for example), it can also “guess” what would have been a missing point. I could explain what exactly what it is, but this documentation will be clearer.

When I wanted to look at an algorithm for time series normalization I was not able to found one, and look at the RRDTool code was helpless therefore I wrote a little test program which normalize time series (the method normalize was designed to be included in OpenTSDB code, that why an iterator is retrieved from the array).

import java.text.DecimalFormat;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;
import java.util.Iterator;
import java.util.Random;

import javax.swing.JFrame;

import org.jfree.chart.*;
import org.jfree.chart.labels.StandardXYToolTipGenerator;
import org.jfree.chart.renderer.xy.*;
import org.jfree.data.time.Second;
import org.jfree.data.time.TimeSeries;
import org.jfree.data.time.TimeSeriesCollection;
import org.jfree.data.time.TimeSeriesDataItem;

public class Main {
  public static void main(String[] args) {
    new Main();
  }

  private Random r = new Random(42);

  @SuppressWarnings("hiding")
  private class Pair {
    long first;
    double second;

    public Pair() {
    }

    public Pair(long f, double s) {
      this.first = f;
      this.second = s;
    }

    public long getFirst() {
      return first;
    }

    public void setFirst(long first) {
      this.first = first;
    }

    public double getSecond() {
      return second;
    }

    public void setSecond(double second) {
      this.second = second;
    }

    public String toString() {
      return "(" + first + ", " + second + ")";
    }
  }

  public ArrayList<Pair> getRandomSerie() {
    ArrayList<Pair> result = new ArrayList<Main.Pair>();

    long start_time = 0;

    for (int i = 0; i < 1000; ++i) {
      result.add(new Pair(start_time, 100 + r.nextGaussian()));
      start_time += r.nextInt(59);
    }

    return result;
  }

  private void addToTimeSeries(ArrayList<Pair> points, TimeSeries ts) {
    Pair last = null;
    for (Pair p : points) {
      try {
        ts.add(new TimeSeriesDataItem(
            new Second(new Date(p.getFirst() * 1000)), p.getSecond()));
        last = p;
      } catch (Exception e) {
      }
    }
  }

  private ArrayList<Pair> normalize(ArrayList<Pair> src) {
    ArrayList<Pair> output = new ArrayList<Main.Pair>();

    long x = src.get(0).getFirst();

    Iterator<Pair> it = src.iterator();

    if (x % 60 != 0)
      x += 60 - x % 60;

    Pair current_point = it.next();
    Pair next_point = it.next();

    while (it.hasNext()) {
      long x0 = current_point.getFirst();
      double y0 = current_point.getSecond();

      long x1 = 0;
      double y1 = 0;

      double amount = 0;
      double interval = 0;

       // if we are on the minute no computation is needed.
      if (x0 == x) {
        amount = y0;
        interval = 1;
      } else {
        // when some data points are missing
        // we can have x0 > x
        interval = Math.abs(x - x0);
        amount = (y0 * interval);

        boolean has_next = true;
        do {

          has_next = false;

          /*
           * we are going to look ahead in order to check if some points match
           * the current time interval. If yes their area will be added to the
           * current.
           */

          x1 = next_point.getFirst();
          y1 = next_point.getSecond();

          if (x1 > x) {
            x1 = x;
          } else if (x1 < x && it.hasNext()) {
            has_next = true;
            next_point = it.next();
          }

          // in most of the case x1 > x0
          // but when x1 is aligned and when some
          // datapoints are missing, we can have
          // x1 < x0
          long _int = Math.abs(x1 - x0);
          interval += _int;
          amount += y1 * _int;

        } while (has_next);
      }

      double y = amount / interval;

      output.add(new Pair(x, y));

      x += 60;

      x0 = next_point.getFirst();
      y0 = next_point.getSecond();

      // handle the case we missed a point,
      // To detect it we just need to know if the next
      // aligned point is before the next data point we
      // got. If yes, we missed a point.

      if (x >= next_point.getFirst() && it.hasNext())
        next_point = it.next();
    }
    return output;
  }

  public Main() {
    ArrayList<Pair> s = getRandomSerie();
    ArrayList<Pair> s3 = normalize(s);

    TimeSeries series = new TimeSeries("Non-Normalized");
    addToTimeSeries(s, series);
    TimeSeries norm_series = new TimeSeries("Normalized");
    addToTimeSeries(s3, norm_series);

    TimeSeriesCollection dataset = new TimeSeriesCollection();
    dataset.addSeries(series);
    dataset.addSeries(norm_series);

    JFreeChart chart = ChartFactory.createTimeSeriesChart("TS", "Time",
        "Value", dataset, true, true, true);

    XYLineAndShapeRenderer r = new XYLineAndShapeRenderer();
    final StandardXYToolTipGenerator g = new StandardXYToolTipGenerator(
        StandardXYToolTipGenerator.DEFAULT_TOOL_TIP_FORMAT,
        new SimpleDateFormat("h:m:s - M/y"), new DecimalFormat("0.00"));

    r.setBaseToolTipGenerator(g);

    chart.getXYPlot().setRenderer(r);
    ChartFrame frame = new ChartFrame("Test", chart);
    frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
    frame.setSize(1600, 900);
    frame.setVisible(true);
  }
}

It uses JFreeChart to display a graph. The patch is ready for OpenTSDB, I’ll submit it Monday.

What every C++ programmer should know, The hard part




Previously, I explained how C++ does to handle the classes and inheritance between them. But, I did not cover how the virtual is handled.

It adds a lot of complexity, C++ is compiled and when a binary is linked against a library they have to speak the same language: they have to share the same ABI. The C++ creators had to find a way to give along the program lifetime metadata about the manipulated classes.

They chose the Virtual Tables.

The Virtual Table

When a C++ program is compiled, the binary embedded some information about the manipulated classes by the program. When a class inherits from an interface, the actual implementation of the method should always be accessible. The Virtual Table (VTable) are generated during the compilation process,they can be seen as array of method pointers.

Let’s take an example:

#include <iostream>

struct Interface
{
        Interface() : i(0x424242) {}
        virtual void test_method() = 0;
        virtual ~Interface(){}
        int i;
};

struct Daughter : public Interface
{
        void test_method()
        {
            std::cout << "This is a call to the method" << std::endl;
            std::cout << "This: " << this << std::endl;
        }
};

int main()
{
    Daughter* d = new Daughter;
    Interface* i = d;

    i->test_method();

    std::cout << sizeof(Daughter) << std::endl;
    std::cout << *((void**)i) << std::endl;
    std::cout << ((void**)i)[1] << std::endl;
}

I recall that all the test have been done on a Linux 64bits.

The size of a Daughter instance is not 8 as we could expect but 16bytes. The memory dump shows that the first field of the class is not the value of i but a strange value and our field come next to it. Our ‘strange’ value is actually a pointer, in fact it is a pointer inside our binary.

nm -C test | grep 400d
0000000000400de0 V vtable for Daughter

I will explain after why there is a difference of some bytes between the two. So this pointer represent the location of the Daughter VTable. We can now check its content.

As I said, a VTable is a kind of array of method pointer.

To get a pointer on it, it is simply:

size_t** vtable = *(size_t***)i;
std::cout << vtable[0] << std::endl;

And if we check the new address printed on the output we can see that it is actually our pointer on method.

nm -C test | grep -E 400c6a
0000000000400c6a W Daughter::test_method()

We can play a little bite more to test deeper:

typedef void (*VtablePtr) (Daughter*);
VtablePtr ptr = (VtablePtr)vtable[0];
ptr(d);

The VTable are determined along the compilation. When the compiler see a virtual method in a class in start to construct a VTable associated to this class. When this class is inherited by another one, it will automatically duplicate and receive a pointer on a VTable for the current parsed class. Each entry of the VTable will be filled when the actual definition of the method is encountered. It is always the last definition which is kept.

The index of the method in the VTable is the same as the apparition order in the source file, that's why it's very important that all the part of a project is compiled with consistent header. It is always embarrassing when the bad method is called in a project without knowing why…

Here is the complete code:

#include <iostream>

struct Interface
{
        Interface() : i(0x424242) {}
        virtual void test_method() = 0;
        virtual ~Interface(){}
        int i;
};

struct Daughter : public Interface
{
        void test_method()
        {
            std::cout << "This is a call to the method" << std::endl;
            std::cout << "This: " << this << std::endl;
        }
};

int main()
{
    Daughter* d = new Daughter;
    Interface* i = d;

    i->test_method();

    std::cout << sizeof(Daughter) << std::endl;
    std::cout << *((void**)i) << std::endl;
    std::cout << ((void**)i)[1] << std::endl;

    size_t** vtable = *(size_t***)i;
    std::cout << vtable[0] << std::endl;

    typedef void (*VtablePtr) (Daughter*);
    VtablePtr ptr = (VtablePtr)vtable[0];
    ptr(d);

}

In conclusion, when virtual appears an instance should be seen like this:

VPTR
Base1
Daughter

And the instance is heavier of sizeof(void*)*nb_of_vptr bytes.

Virtual in multiple inheritance

As usual, we are going to start with a trivial code:

#include <iostream>

struct Mother
{
        virtual void mother()=0;
        virtual ~Mother() {}
        int i;
};

struct Father
{
        virtual void father()=0;
        virtual ~Father() {}
        int j;
};

struct Daughter : public Mother, public Father
{
        void mother()
        { std::cout << "Mother: " << this << std::endl; }

        void father()
        { std::cout << "Father: " << this << std::endl; }

        int k;
};

int main()
{
    Daughter* d = new Daughter;
    Mother* m = d;
    Father* f = d;

    std::cout << "Daughter: " << (void*)d << std::endl;
    std::cout << "Father  : " << (void*)f << std::endl;
    std::cout << sizeof(*d) << std::endl;

    std::cout << *((void**)d) << std::endl;
    std::cout << *((void**)f) << std::endl;
}

As you can note, the two table used are different. When the types are manipulated, this is not always (never?) the concrete type used but the abstract one. With multiple inheritance it can be a Mother or a Father instances, so when a Father is used and the actual implementation is in Daughter, the method should be accessible. That's why there is another VTable pointer.

However, when an instance of type Daughter is used through a Father pointer, Daughter method cannot be called directly. Indeed, the instance pointer needs to be adjusted to match a Daughter instance. To solve this problem, there are the Thunk function.

If we print the first entry of the VTable and if we disassemble the code a this location, we have this:

0000000000400cf4 <non-virtual thunk to Daughter::father()>:
  400cf4:       48 83 ef 10             sub    $0x10,%rdi
  400cf8:       eb 00                   jmp    400cfa <Daughter::father()>

These two instructions perform pointer adjustment by subtracting the size of the Mother class (and then match the Daughter instance). Therefore, if you have multiple inheritance with method you can add some indirection very easily:

  • Get the VTable;
  • Move to the wanted method (apply an offset on the VTable pointer, for example 8 to get the second method);
  • Call the method;
  • Adjust the this pointer;
  • Jump to the actual method definition.

 

Method Pointer

Yes, method pointer have a cost. Contrary to the C where function pointers have no overhead, the C++ had to deal with the difference between:

  • From which instance the method is accessed;
  • Is the method virtual?

The first point require a pointer adjustment. The second point, well, lot of things.

Firstly, the size of a method pointer is 16 bytes (against 8 in C). The method pointer is in three parts:

  1. Offset
  2. Address/index
  3. virtual?

The first one is on 8 bytes, the second on 8 bytes also. The third part is on one byte and is merged with the second one. If the last byte is set then the second part should be seen as an index (the index of the method in the VTable), otherwise it is the address of the method.

Therefore, calling a method pointer require ~ 20 asm instructions (in the worst case):

  1. Get the offset to apply on the instance pointer;
  2. Apply it;
  3. Check if we call a virtual member function;
  4. If yes, subtract 1;
  5. Get the VTable;
  6. Get the method address;
  7. Call the method.

Conclusion

In a next article I'll cover the VTable prefix and the virtual inheritance but there are less common in C++ code. In these two articles I tried to put some light on C++'s internal mechanism. The C++ is a fast language but it can become much less efficient because of complex class relation. I don't say: "don't use virtual and method pointer", I think programmers should be aware of these counterparts.

I think the readability is more important than performances. Yes, you can have a lot of overhead in C++ but it will still be more efficient than a lot of languages. But sometimes you can avoid virtualization. For example, the common ways for a beginner (and sometimes less beginners C++ programmers) to do an abstraction is to define an interface and for the different implementation, define a new class which inherits from this interface.

Sometimes, ok it is the right thing to do, sometimes not. If you are asked to write an abstraction to the filesystem on Linux and Windows if you follow the described way, you'll write an iFS interface, a WindowsFS and a LinuxFS. It'll work well but you can do even better: You can write a WindowsFS and LinuxFS and define a new type FS according to the platform where the code is compiled, on Linux we could imagine something like this:

typedef LinuxFS FS;

With a code like this, you'll avoid some overheard due to the interface. It works well on abstraction of platform specific features but it does not work on data abstraction and you'll need an interface.

Here are some resources:

What every C++ programmer should know, The simple part




Several months ago I did a presentation of the C++ abi under gcc and GNU/Linux. It is very interesting because it can change your way of programming in C++. A C++ can be improved easily by knowing few things.

Before, let’s replace the C++ in history: the C++ was designed and implemented in the late 70′s. Its purpose was to provide a language which is able to run natively and which is more high level than its predecessor, the C. However, allowing multiple inheritance, overriding, interfaces was not an easy thing to design and some counterparts are to be considered/known…

The mangling

On an Unix platform the binaries follow a specification: The ELF. In C, when a function is written a symbol in the binaries is added. The name of the function is the same than the symbol. A binary cannot have the same symbol declared multiple time (except in some cases, with weak symbol but it is not the point of this post).

Note: the nm command can be used to see all the symbols defined in a binary.

In C++, the symbols are still used in the same way except for one thing: they have’nt the same name than the function anymore… They are ‘mangled’. The type, the function qualifier (static, const, volatile) and the namespace are used.
For example, these functions:

void test(int, char)
void test(char, int)

are mangled like this:

_Z4testci
_Z4testic

As you can note, the return type is not used.

This is how the C++ handle the function/method overriding.

Member function Calling

When a member function is called in C++, you can always use the this keyword in order to access to the current instance. But when you actually call the method, you never give it.

Example:

struct Test {
	void test() {}
};

int main() {
  Test t;
  t.test();
}

Let’s disassemble the code:

Test t;
t.test();
lea    -0x1(%rbp),%rax
mov    %rax,%rdi
callq  4008f4 <Test::test()>

According to the standard, the minimal size of a C++ object is 1 byte. At the third line you can see that the address of the instance (it is the last object pushed onto the stack, rbp represents the back of the stack, so the instance can be found at rbp - 1) is stocked into the rax register and afterwards the address is copied into rdi register which is the one used to give the first parameter to a function according to the calling convention x64.

Note: To disassemble the code, I used objdump -CDS. Do not forget to compile your code in debug mode.

Then, as we saw the member functions are simply functions disguised. The member function test of our class Test can be seen like this:
void test(Test*){}.

Some languages like the Perl and the Python decided to explicit this mechanism: it is up to the programmer to give the this argument. In Python it is called self.

For example, in Python the Test could be written:

class Test:
    def test(self):
        pass

This mechanism can lead to some ‘funny’ things in C++. It is totally legal to call a member function
with a NULL instance. With the Test class you can write:
((Test*)NULL)->test();
It will work until you use this to access to the class fields.

Note: The class name is used to mangle the member function name, a class is a kind of namespace. The public/private/protected keyword does not affect the mangling, it is just a ‘compiler’ thing in order to make some static security/semantic checks on your code.

Simple inheritance

In the case of the simplest inheritance case the class are just stacked.
In this code:

#include <iostream>

class Mother
{
  public:
	Mother() : c(0xCCCCCCCC), d(0xDDDDDDDD){}
  private:
	int c, d;
};

class Test : public Mother
{
  public:
	Test() : a(0xAAAAAAAA), b(0xBBBBBBBB){}

  private:
	int a;
	int b;
};

int main()
{
  Test t;

  std::cout << sizeof(t) << std::endl;
  return (0);
}

If you compile+execute it you’ll see the actual size of the class Test and if you dump
the memory at the Test instance you’ll see that the Mother attributes are copied
into the Test class.

The great thing with having the mother class above the daughter is that all the calls to mother member functions or access to attributes does not add any complexity nor pointer arithmetics: a cast from a daughter to a mother pointer do nothing. I will become less true with multiple inheritance.

Multiple inheritance

Inheritance is often view as bad practice or show a misconception in the program. I won’t talk about
these considerations except that for example in the case of shared pointer it is useful (boost::shared_pointer, shared_from_this(), etc.).

Let’s start with a new code.

#include <iostream>

struct Mother
{
        int i;
};

struct Father
{
        int j;
};

struct Daughter: public Mother, public Father
{
        int k;
};

int main()
{
    Daughter d;

    d.i = 0;
    d.j = 1;
    d.k = 3;

    int* tab = reinterpret_cast<int*>(&d);

    for (int i = 0; i < (sizeof(d) / sizeof(*tab)); ++i)
        std::cout << tab[i] << std::endl;
}

This code enlighten the fact that the class are still stacked and in the case of multiple inheritance
we have the organization that look like this:

Mother
Father
Daughter

Actually, to be precise, it should be understand as:

LHS-Inheritance-Tree
Next-Inheritance-Tree
XXX-Inheritance-Tree
RHS-Inheritance-Tree
Daughter

But to keep it simple, I’ll use the first representation.

The problem with the multiple inheritance is that to access to the Father attributes, the pointer of the instance needs to be adjusted!
Add a member function in the Father class, call it through a Daughter pointer and you’ll see an assembly code like this:

// get the this pointer
  400881:       48 8d 45 e0             lea    -0x20(%rbp),%rax
  400885:       48 83 c0 04             add    $0x4,%rax
  400889:       48 89 c7                mov    %rax,%rdi
  40088c:       e8 a7 00 00 00          callq  400938 <Father::func()>

So, ANY access to an inheritance tree part different from the RHS == offset calculation.

Diamond inheritance

Even though I can find some explanations for multiple inheritances, for the diamond one I can’t :) .
There is one trap with the diamond inheritance, you think:

But, actually you have this:

So the common class is duplicated. There are some mechanism to avoid this, we will see them later.
According to the previous part, the organization of the instances:

GrandMother
Mother
GrandMother
Father
Daughter

You’ve just seen the basis of the C++ ABI with the mangling and inheritance.

In the next article I’ll explain the role of the virtual keyword according to the context where it is used.

Come back on the exceptions




In the end, the problem was that I indicated that a member function can throw a reference.
I can easily understand that it is not a good idea but the compiler should at least print a warning.

After talking with a friend, A thing that I said before could be misinterpreted:
I do not say that exception or return value should not be tested ! I said that when I throw an exception, I do not catch it until the last moment. I avoid to nest to much try/catch, not because of some kind of performance issues (that are inexistent) but because sometimes it leads to a code bloated by try/catch which does nothing and with no recovery done from these exceptions. There is a location to treat the exception and until I’m not sure where it is, I wait.

C++ and exception on Mac OS X Lion




I have a mac book pro as laptop. When I upgraded to Lion and installed the new version of XCode, I had a surprise…

I’m currently programming a little language, a LISP dialect to be precise (it can be found here). The interpreter is in C++. When I program in C++ and use exception, I delayed the exception catching the latest possible. I noticed that when the error handling is done too soon, the code resulting have so many try/catch or return code check that it becomes inefficient…

Anyway, in my code I have a class LispException which inherits from std::exception. With g++ when you compile in debug mode, the reason of the exception is printed when the program abort.

Example:

#include <exception>

class myexception:public std::exception
{
 public:
    myexception() throw() {}
    virtual ~ myexception() throw() {}

    const char *what() const throw()
    {
        return "this is a good reason";
    }
};

int main()
{
    myexception ex;
    throw ex;
}

Compile it and execute, you should see this:

$> ./a.out
terminate called after throwing an instance of 'myexception'
what(): this is a good reason
zsh: abort ./a.out

But, now, if you checkout my interpreter at the revision: 7ab204bea16e7b33227b , and if you type:

 (test)

The program will segfault !

The worst part is that when you take a look with GDB at the call stack it is helpless:
$> gdb ./build/klisp
GNU gdb 6.3.50-20050815 (Apple version gdb-1705) (Fri Jul 1 10:50:06 UTC 2011)

(gdb) r
Starting program: /Users/daedric/Documents/workspace/lisp/build/klisp
Reading symbols for shared libraries ++......................... done
(test)

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: 13 at address: 0x0000000000000000
0x00007fff913ec8b1 in std::string::compare ()
(gdb) bt
#0 0x00007fff913ec8b1 in std::string::compare ()
#1 0x0000000100006b43 in std::operator< , std::allocator > (__lhs=@0x8948fc7d89000023, __rhs=@0x100000000) at basic_string.h:2228
#2 0x0000000100006b8d in std::less::operator() (this=0x100001808, __x=@0x8948fc7d89000023, __y=@0x100000000) at stl_function.h:227
#3 0x0000000100006bf9 in std::_Rb_tree >, std::_Select1st > >, std::less, std::allocator > > >::find (this=0x100001808, __k=@0x100000000) at stl_tree.h:1378
#4 0x0000000100006d0d in std::map, std::less, std::allocator > > >::find (this=0x100001808, __x=@0x100000000) at stl_map.h:542
#5 0x0000000100005d1f in Environment::lookupVar (this=0x1000017d0, name=@0x100000000) at Environment.cpp:43
(gdb)

Now, checkout the next revision (and take a look at the diff)
and try again:
$> ./build/klisp
(test)
Exception: Unbound function: test

Weird no ?

I was not able to determine the exact use case to reproduce this bug with a little amount of code, but it can cause a lot of problem when you code. My first guess when I saw this call stack was: I messed up with my vtable/cast.
If someone has an explaination to this problem, please let me know.