Tuesday, November 12, 2013

MapReduce on .NET Steroids

Ok, so everyone's talking about MapReduce, so I'm gonna chime in, .NET Style!

Wikipedia defines "MapReduce" as a programming model for processing large data sets with a parallel, distributed algorithm on a cluster.

"The actual origins of MapReduce are arguable, but the paper that most cite as the one that started us down this journey is “MapReduce: Simplified Data Processing on Large Clusters” by Jeffrey Dean and Sanjay Ghemawat in 2004. This paper described how Google split, processed, and aggregated their data set of mind-boggling size." (O'reilly publishing)

If you're thinking about distributed parallel computing, think again, this blog post is only about the Functional Programming Map and Reduce model.
I'm going to show you how the basics are done using .net and Linq as an amazing functional programming framework, this is the basics we need in order to take the next step towards distributed computing easily.

So what is "Map-Reduce" ?

Well, it's basically a data processing model that's aimed at large datasets, where you need as much horsepower as you can get in order  to get results at a reasonable time frame.
If you can split your data source into X parts, that have no dependency among each other, OR if you have a list of data items (which is basically the same thing) you are good to go.

So go where? Let's think of a "simple" problem/example.
Let's say we have a set of newspaper articles and we want to find out the number of times certain key words show up, e.g. "Peace", "War", "Love", "Hate"...

First step, mapping step, give every intern (assuming you have intern to spend on this stupid project) one newspaper, and tell him: "You! take this newspaper and write down on a blank piece of paper every time a key word shows up!"

Once, your eager minions are done you need to take their mapped data and send it to the next step, groung step. You take one of your loved workers and tell him/her to take new sheets of blank paper, one sheet will be for "Peace", one for "War" and so on. The grouper needs to go through every word found and write them again but on the appropriately designated sheet.

After this we have sheets of "War", sheets of "Peace", of "Love" and of "Hate".
But we still don't know how many words of each type. (yes, if you're thinking that our grouper couldv'e already counted while grouping you are correct, but remember that this is a simple example :) ).

Next and final step, Reduce!
Find a bunch of workers that are bored and give each a sheet from the grouper's sheets.
Tell them to find the data we want from the grouped data, in this simple case, just count the words.

In the end they will give you the result:
"War" - X times
"Peace" - Y times
and so on for the rest of the key words.

What we have do here is essentially MapReduce in all of it's blazing parallel distributed glory!!
This is what Google DOES!
They distributed the data sheets to many workers (computers) let them work in parallel, grouped the results, and reduced back the answers with more distributed parallel computing! no need to bother fellow interns! :)

I ran across a cute example on the web for map reduce concerning how Facebook tells us what friends in common we have with our own friends.
http://stevekrenzel.com/finding-friends-with-mapreduce
(Facebook probably does something different, but it's a great example)

You'd think that this would be an easy task... you'd think that it's simple using SQL. However please remind yourself that Facebook does not use relational databases for it's 1 Billions users!!
So it has to be either key value or graph databases and probably combined together :) I would probably use SQL-MapReduce or something of that nature.

So, in order to fully understand the theory behind this paradigm shift in distributed computing, we need to fully grasp the Functional Programming theory underlying this methodology. (I cannot stress this enough).

Let's try to accomplish this word counting task in C#.
Remember, we need to Map -> Group -> Reduce.

Well first off we need some sort of Data Source. let's say a list of strings, where each string represents a sheet of words.

And so:

var sheets = new List();
sheets.AddRange(some sheets);

So, we're generally thinking of an enumerable data structure for now.

How would we go about defining our mapper?
A mapper needs to take a sheet of paper and write down each time a key word is encountered.

public static IList Map(String sheet)
{
       string pattern = "war|peace|love|hate";

       var matches = (from Match m in Regex.Matches(sheet, pattern) select m.Value).ToList();

       return matches;
}

As you can see it's pretty much hard wired to our specific task, not a good programming practice.
Let's deal with that in a moment.

Okay, now we have our data mapped, that's good but we need to group it somehow, and for any grouping action to occur one needs a Key.
In this simple example, it's pretty obvious that we need to group the mapped data by the words we're looking for.

How about:

var groups = sheets.SelectMany(Map).GroupBy(y => y);

That's pretty simple and adequate.
After we have our groups, we want to get the final results we've been waiting for, and that's the number of repetitions of every key word.

So our reducers can work independently, each can take a group for instance and solve our task.

public static KeyValuePair Reduce(IGrouping group)
{
     return new KeyValuePair(group.Key, group.Count());
}

That's it! All we have to do now is combine it all.

var mapped = sheets.SelectMany(Map);
var groups = mapped.GroupBy(y => y);
var results = groups.Select(Reduce);

Our results look something like:
[love:2]
[war: 4]
and so on...

To simplify things we could just write:

var results = sheets.SelectMany(Map).GroupBy(y => y, Reduce);

Which is exactly the same thing.
Pretty simple isn't it!
This is map reduce by functional programming. The map and reduce stages both can run in parallel.

So what's missing? ... oh yeah, it's really non generic, or is it...

Well if you look at that one liner just above, the only thing that's not generic is the "sheets" source, but that's easily transformed to any source.

To make things generic, let's put on the table the types we need.

TSource - for our datasource type
TMapper - for our mapping results
TKeySelector - for our grouping key
TResult - for the end result type we want

If we create some sort of generic Map-Reduce method, it will need all of these parameters to work.
And what are they?

TSource - is the type of an IEnumerable or some sort.
TMapper - is a part of a function, TSource->IEnumerable
TKeySelector - is part of our key selecting  function TMapper->TKeySelector
TResult - we be returned from a function TKeySelector , IEnumerable -> TResult

And so as written as an extension method, it would look something like:

public static IEnumerable<TResult> MapReduce<TSource, TMapper, TKeySelector, TResult>(
           this IEnumerable<TSource> source,
           Func<TSource, IEnumerable<TMapper>> mapper,
           Func<TMapper, TKeySelector> keySelector,
           Func<TKeySelector, IEnumerable<TMapper>, TResult> reducer)
        {
            return source.SelectMany(mapper).GroupBy(keySelector, reducer);
        }

wow, you really have to look at it for a second to realize that it's this simple using Linq.
Without it, think of the amount of code needed to loop and group...
This is the amazing strength of modern C#, simplicity.

Now just provide the needed delegates and this function will mapreduce and problem.

Just to get a taste of distribution of processing, just add PLinq.
You can add .AsParallel() to the mappers and reducers and voila.
So easy.

But wait...

Delegates are so constrictive.
What if our datasource was IQueryable... and let's say that it manages it's own distributed computing mechanism or selecting and grouping...
For this we need to modify our method by just a bit:

public static IQueryable<TResult> MapReduce<TSource, TMapper, TKeySelector, TResult>(
            this IQueryable<TSource> source,
            Expression<Func<TSource, IEnumerable<TMapper>>> mapper,
            Expression<Func<TMapper, TKeySelector>> keySelector,
            Expression<Func<TKeySelector, IEnumerable<TMapper>, TResult>> reducer)
        {
            return source.SelectMany(mapper).GroupBy(keySelector, reducer);
        }

Wait.. are you serious? that's it? just add expression around the func?
YES!

Now if you have a query provider that supports distributed computing, this method will do the job.

And there is: http://research.microsoft.com/en-us/projects/dryadlinq/

unfortunately it's a discontinued project, but interesting nonetheless.

http://hadoopsdk.codeplex.com/ is what microsoft is focusing on.


Monday, July 29, 2013

How to delete all contents inside a word document bookmark using C# and Open XML

I had the need to mark a certain section in a word document and then remove it by code if it is not needed in a certain scenario. 
My best solution was to use bookmarks, so I had the to remove all contents between two bookmarks (or actually inside one bookmark) inside that word document.

So I marked the section and inserted a bookmarkthat encapsulated it and named it appropriately.

This is the code that did the job:


public void DeleteBetweeen(string bookmarkName)
        {
            var body = m_document.MainDocumentPart.Document.Body;

            //Find the bookmark element, it's either under the root element, or inside a paragraph
            var bookmark = body.Descendants<BookmarkStart>().FirstOrDefault(x => x.Name == bookmarkName);

            //Could not find it.. it maybe inside a more complex element
            if (bookmark == null)
                throw new ArgumentOutOfRangeException("bookmarkName", "Bookmark not found in file");

            //If the bookmark is inside a paragraph we want to delete all siblings of that paragraph
            var start = bookmark.Parent is Paragraph ? bookmark.Parent : bookmark;
            var sibling = start.NextSibling();

            //Delete all elements until we reach our bookmark end tag
            while (!(sibling is BookmarkEnd))
            {
                var temp = sibling;
                sibling = sibling.NextSibling();
                temp.Remove();
            }
            
        }

Wednesday, March 27, 2013

How to implement a Restricted Boltzmann Machine in C#

If anyone wants to "feel" the difference between Matlab or Python and languages such as C#, I suggest that the first thing they do is try to program basic mathematical fundamentals, such as linear algebra.
C# is a very robust language, however, it does not help you in any way, if you want to simplify coding math oriented subjects.
You basically need to write everything from scratch.

Now, AI is usually built on neural networks, and neural networks usually process data in vectors or matrices.
And if we have matrices and vectors, we usually need outer products, dot products, triangulations, decompositions,etc...

Since Matlab is built for math, and python has the Numpy library, what would take a few lines of code using those languages would take hundreds if not thousands of lines of code in C#.
Now, yes, I am stating the "obvious", you could say that if I had an isomorphic mode code library in C#, it would be almost the same.
That is true, to a degree, even if we would have had all the same functions, the syntax itself is not suited for algebraic manipulations.
For example:
In python if we want to update the 2nd to 4th element in the first column of a matrix with a vector we need only write:
X[2:4,0] = V

Now no matter what methods you have in C#, CLR syntax does not allow such typing.
But that's the thing, isn't it. Let's say I want to test a theory right now, and if it's not good i want to move on.
I don't want to spend hours or days building or searching for helper classes just so i can "test" a simple theory.

This is where I take my hat off to Matlab and even Python, and I strongly recommend that although I think that .net (c#) is one of the best pieces of software engineering languages ever created, the simplicity and readability of Matlab / Python has a greater advantage in rapid development, specifically for research.

A Restricted Boltzmann machine is an interesting unsupervised machine learning algorithm. It is a generative stochastic neural network that can learn a probability distribution over its set of inputs.

Feature extraction really gets interesting when you stack the RBMs one on top of the other creating a Deep Belief Network.
I will not go into the theory of the Boltzmann machine, regular or restricted.

I recommend looking at the original papers by Geoffrey E. Hinton, Yoshua Bengio and more.
Also E. Chen's post on the subject and python implementation is very good and intuitive.

A word about Arrays in C#:
Standard multidimensional arrays in C# are similar in syntax to C++ and take the form of (e.g.) int[][].
They are called Jagged arrays.
In C# "Multidimensional" arrays are actually typed as: int[,] meaning they are in x,y Cartesian coordinate system.
These are much more intuitive, and require no further initialization of inner rows.
However, due to a poor CLR implementation they are much slower than jagged arrays.
Thus requiring even further code to be written in order to achieve simple and readable code for building our neural networks.

In my project I build a matrix class, and a vector class (which i don't really use) actually I think that the vector class should be inherited from a matrix class.
Our matrix class encapsulates a jagged array and has a multidimensional indexer for our ease of use.
This is the basis of creating clean and short code for our algorithm.

A Single RBM is inefficient in extracting small number of features from a large input source, so stacked RBMs (what are called deep learning networks) provide a more efficient way to spread the feature learning from layer to layer.

In the following example we trained a dual layer RBM, 1024 visible inputs to 50 hidden outputs and on to 16 hidden outputs in the 2nd layer.
We trained 100 handwritten digits, 50 epochs for the first layer, and 1500 epochs for the 2nd layer.
As you can see, the original digit is on the left, and the reconstructed (from 16 boolean features) image is on the right.
It is quite similar to the original but not perfect, cause naturally it learned other examples or writing the number "2".


00000000011111100000000000000000
00000000011111111000000000000000
00000000111111111100000000000000
00000001111111111100000000000000
00000001111111111110000000000000
00000000111111111111100000000000
00000001111111111111100000000000
00000001111111011111100000000000
00000001111110011111100000000000
00000011111110011111100000000000
00000000111111011111110000000000
00000000111110011111100000000000
00000000011110011111100000000000
00000000001100111111100000000000
00000000000000011111100000000000
00000000000000011111100000000000
00000000000000111111100000000000
00000000000000111111000000000000
00000000000000111110000000000000
00000000000000111111000000000000
00000000000001111110000000000000
00000000000001111110000000000000
00000000000011111100000000000000
00000000000011111110000000000000
00000000000011111100000000000000
00000000001111111111000000000000
00000000011111111111111111100000
00000000111111111111111111110000
00000000111111111111111111111000
00000000111111111111111111111100
00000000111111111111111111111100
00000000011111111111111111111000

00000000010001001000000000000000
00000000011011111000001100001000
00000000001111110100000000000000
00010000111111111011100000000000
00000011111111111101000100000000
00000011011101011111111001000000
00000011101110111111111000000000
00000001111000000111111000000000
00000001110010010111100000000000
00000110111000000011011000000000
00000000110101001111110000000000
00000001011000101111100000000000
00100001001010001111000000000000
00000001111000101111111000000000
00000000001000001110101000000000
00000000000010001011110001000000
00000000110010001110100001000000
00000000000100111111000000000000
00000010100000110111011001010000
00000000100001101111010000000000
00000000001000011110011001000000
00000011000101111100000011000000
00000000000000111000000000100000
00000000101101111100000101000000
00010000010111011010000100000000
00000000000111111001100100100000
00000100011111111010111000000010
00000000111111110111111111111100
00000000111111111111011111111100
00000000111111111111101101111000
00000000001101100111101101100000
00000001010110000000010001011000

So now the challange is to make our computations as efficient and fast as possible to make it more realistic for applications.

I've added the "Daydreaming" functionality.
It requires more study on my part, but it is interesting to see what features pop up when random data is inputted in to the machine.

Please feel free to view/download or fork my code on Github

Further work to be done:
This is "Simple" example.
Next phase is to add Greedy-Layer Wise Pre-Training, and supervised fine tuning.

Sources and further reading:

"Learning Deep Architectures for AI" By Joshua Bengio
"A Practical Guide to Training Restricted Boltzmann Machines" By Geoffrey Hinton
E.Chen's Python implementation 

Sunday, March 3, 2013

Nested Class Constructor Accessibility

I wanted to share this very nice, small, creative piece of code that I've stumbled upon while looking for different perspectives on the subject.

The problem at hand is, let us say, that you have a Class, a "Parent" or "Container" class, and you want this class to have the ability to instantiate other classes.
This is a very real and sometimes serious situation, where access to constructor is forbidden to everyone and everything except one controlling object.

One "usual" pattern is using a factory builder. however, usually you will be able to either access the constructor still, or accessing the factory method from almost everywhere.

There are a few good solutions out there involving interfaces. The only downside to them is that they require quite a bit of code to be written down, for such a simple functionality.

And to be honest, what are we asking for???
To have a class with a nested class that only the containing class can access!

http://stackoverflow.com/questions/1664793/how-to-restrict-access-to-nested-class-member-to-enclosing-class

Here you can see different implementations that deal with this problem.
Check out Ray Burns' answer: Answer


public class Journal
{
  private static Func<object, JournalEntry> _newJournalEntry;

  public class JournalEntry
  {
    static JournalEntry()
    {
      _newJournalEntry = (object) => new JournalEntry(object);
    }
    private JournalEntry(object value)
    {
      ...

Here we can see, that in the nested class's static constructor we initialize a delegate to our true nested class's constructor, this is a private static field in the containing class, so it can only be accesses within the containing class.

Thus, in this example, JournalEntry can only be instantiated by functionality that has to be implemented inside the Journal object.

It might seem trivial to some, and might seem complex to others, I must admit I never thought of attacking the subject in this fashion, and it is quite creative.
Kudos.

Thursday, February 7, 2013

Shazam's Alogorithm

I recently replaced my old blackberry to an android phone.
One of the apps that i downloaded was "Shazam".

If you are not familiar with it, download it now, it's pretty cool.
What it does is very simple, it recognizes songs. So, just play a song on the radio and click on Shazam.
It will practically instantly detect the song playing on that radio.

It works so well, I've tried to think how the heck it works.
I came up with all sort of algorithms they might have used, Neural networks, Clustering algorithms, etc...
And then I came across this blog post:

How-shazam-works

Genius.
I thought that they had a AI that clustered and classified similar patterns.
But using a fingerprint algorithm is actually much much simpler and better.
A fingerprint algorithm doesn't look for "similar" patterns, it looks for an EXACT pattern.Think of it as a hash-table, naturally it's O(1) when looking for a key.
So if you can build an enormous hash-table  that keys are the hash-code of the song's fingerprint, voila! you got a O(1) song recognition system.

Downside:
You cannot hum a song to it. Cause it can only recognize exact patterns.
You cannot find similar songs (songs that have a general similar beat or instruments playing), again cause it uses an exact search algorithm.


Still, it is pretty amazing when you use it.