1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

PdfPig Version 0.0.5


Today version 0.0.5 of PdfPig was released. This is the first version which includes the ability to create PDF documents in C#.

There aren't many fully open source options around for both reading and writing PDF documents so the addition of PDF document creation to PdfPig is an exciting next step for the API.

The actual design of document creation isn't finished yet and there's more work to be done around the currently unsupported use cases such as splitting, merging and editing existing documents as well as adding non-ASCII text, working with forms and adding images to new documents but the functionality in 0.0.5 should provide enough for simple use cases and the open source Apache 2.0 license means that it can be used in commercial software.

You can create a new document using a document builder:

PdfDocumentBuilder builder = new PdfDocumentBuilder();

This creates a completely empty document. To add the first page we use the imaginatively named add page method.

PdfPageBuilder page = builder.AddPage(PageSize.A4);

This supports various page sizes defined by the PageSize enum, such as the North American standard PageSize.Letter. It also allows the choice of portrait (default) or landscape pages.

Once a page builder has been created text, lines and rectangles can be added to it.


In order to draw text a font must be chosen. Version 0.0.5 supports TrueType fonts as well as the 14 default fonts detailed in the PDF Specification. These are called the Standard 14 fonts and while their use is beginning to be phased out, all PDF readers should still support them.


Sentence Boundary Detection in C#


Sentence Boundary Detection or Segmentation is the task of splitting an input passage of text into individual sentences. Since the period '.' character may be used in numbers, ellipses or names it's not enough to simply split by the period character.

When I was researching ways to do this in C# I didn't find much in the way of properly open source libraries. A lot of the libraries I found for other languages referred to the Golden Rule Set (GRS). This set comes from Pragmatic Segmenter, a Ruby gem to segment text based on rules observed from a varied corpus of text.

Since I find porting code from other languages helps me understand both the variations in how different languages approach the same problems and also how other people make architectural decisions and structure their code I decided to port it to C#.

This Pragmatic Segmenter port is available to download from NuGet. The public API is similar to that for the Ruby package however the method is static:

var result = Segmenter.Segment("There it is! I found it.");

Assert.Equal(new[] { "There it is!", "I found it." }, result);

There is also support for other languages, the Language enum gives the supported languages:

var result = Segmenter.Segment("Salve Sig.ra Mengoni! Come sta oggi?", Language.Italian);
Assert.Equal(new[] { "Salve Sig.ra Mengoni!", "Come sta oggi?" }, result);

The source code also contains a set of data from various sources I was using to test my port as well as add some behaviour for the sources I was primarily interested in (academic journals). This data can be found here. Hopefully this corpus of annotated sentence boundary data will be useful to people building their own libraries.


Using ConvNetSharp With Feature Based Data


ConvNetSharp which is descended from ConvNetJs is a library which enables you to use Neural Networks in .NET without the need to call out to other languages or services.

ConvNetSharp also has GPU support which makes it a good option for training networks.

Since much of the interest (and as a result the guides) around Neural Networks focuses on their utility in image analysis, it's slightly unclear how to apply these libraries to numeric and categorical features you may be used to using for SVMs or other machine learning methods.

The aim of this blog post is to note how to acheive this.

Let's take the example of some data observed in a scientific experiment. Perhaps we are trying to predict which snails make good racing snails.

Our data set looks like this:

Age   Stalk Height    Shell Diameter    Shell Color   Good Snail?
1     0.52            7.6               Light Brown   No
1.2   0.74            6.75              Brown         Yes
1.16  0.73            7.01              Grey          Yes

ConvNetSharp uses the concept of Volumes to deal with input and classification data. A Volume is a 4 dimensional shape containing data.

The 4 dimensions are:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17