1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Sentence Boundary Detection in C#


Sentence Boundary Detection or Segmentation is the task of splitting an input passage of text into individual sentences. Since the period '.' character may be used in numbers, ellipses or names it's not enough to simply split by the period character.

When I was researching ways to do this in C# I didn't find much in the way of properly open source libraries. A lot of the libraries I found for other languages referred to the Golden Rule Set (GRS). This set comes from Pragmatic Segmenter, a Ruby gem to segment text based on rules observed from a varied corpus of text.

Since I find porting code from other languages helps me understand both the variations in how different languages approach the same problems and also how other people make architectural decisions and structure their code I decided to port it to C#.

This Pragmatic Segmenter port is available to download from NuGet. The public API is similar to that for the Ruby package however the method is static:

var result = Segmenter.Segment("There it is! I found it.");

Assert.Equal(new[] { "There it is!", "I found it." }, result);

There is also support for other languages, the Language enum gives the supported languages:

var result = Segmenter.Segment("Salve Sig.ra Mengoni! Come sta oggi?", Language.Italian);
Assert.Equal(new[] { "Salve Sig.ra Mengoni!", "Come sta oggi?" }, result);

The source code also contains a set of data from various sources I was using to test my port as well as add some behaviour for the sources I was primarily interested in (academic journals). This data can be found here. Hopefully this corpus of annotated sentence boundary data will be useful to people building their own libraries.


Using ConvNetSharp With Feature Based Data


ConvNetSharp which is descended from ConvNetJs is a library which enables you to use Neural Networks in .NET without the need to call out to other languages or services.

ConvNetSharp also has GPU support which makes it a good option for training networks.

Since much of the interest (and as a result the guides) around Neural Networks focuses on their utility in image analysis, it's slightly unclear how to apply these libraries to numeric and categorical features you may be used to using for SVMs or other machine learning methods.

The aim of this blog post is to note how to acheive this.

Let's take the example of some data observed in a scientific experiment. Perhaps we are trying to predict which snails make good racing snails.

Our data set looks like this:

Age   Stalk Height    Shell Diameter    Shell Color   Good Snail?
1     0.52            7.6               Light Brown   No
1.2   0.74            6.75              Brown         Yes
1.16  0.73            7.01              Grey          Yes

ConvNetSharp uses the concept of Volumes to deal with input and classification data. A Volume is a 4 dimensional shape containing data.

The 4 dimensions are:


Alpha Release of PdfPig


I'm very pleased to finally have reached the first alpha release of PdfPig (NuGet).

PdfPig (GitHub) is a library that reads text content from PDFs in C#. This will help users extract and index text from a PDF file using C#.

The current version of the library provides access to the text and text positions in PDF documents.


The library began as an effort to port PDFBox from Java to C# in order to provide a native open-source solution for reading PDFs with C#. PdfPig is Apache 2.0 licensed and therefore avoids questionably (i.e. not at all) 'open-source' copyleft viral licenses.

I had been using the PDFBox library through IKVM and started the project to investigate the effort required to make the PDFBox work natively with C#.

In order to understand the specification better I rewrote quite a few parts of the code resulting in many more bugs and fewer features than the original code.

As the alpha is (hopefully) used and issues are reported I will refine the initial public API. I can't forsee the API expanding much beyond its current surface area for the first proper release.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17