Projects

Designs for a Computer Algebra System

Posted in Maths & Science, Programming, Projects, Software on January 8th, 2010 by Noldorin – Be the first to comment

Creating a modern computer algebra system from the ground up, as has been a plan of mine for some months now, is no trivial goal, as anyone who has even a vague conceptual understanding of computational algebra must surely know. My efforts in this area, grouped under the title of “the Syracuse Project“, have involved mostly research and large amounts of contemplation so far, but I feel that I have finally managed to formulate enough solid ideas that they are worth presenting in a short article. I was also fortunate enough to find someone on IRC (in #math-software on Freenode) with whom I could discuss and refine many of our mutual ideas. The ideas discussed in this post are the culmination of my own thoughts and much of the knowledge I gained from my conversations with Robert Smith (nickname ‘Quadrescence’) online, which has served as the basis for some of my own investigation and explanations here, in a modified form.

What I am going to focus on here falls mainly under the domain of my Euclid.NET project, which is essentially the core of Archimedes, which one might describe as the “CAS kernel”.  Think of Euclid.NET as the framework for symbolic mathematics upon which Archimedes is built. Its primary purpose is to handle such tasks as expression evaluation, simplification, differentiation, series expansion, and so on. (Bonus points if you can spot the connection between the names of Archimedes and Syracuse.)

TeX.NET, which is the primary parser (expression tree builder) for the project at this moment, has already seen a stable release, and will soon see the second with any luck. As you might guess from the name, it takes input in TeX syntax (well, actually LaTeX, with a few extensions).

To begin, I thought I would tackle one of the tougher aspects of computer algebra systems, namely, simplification. More trivial features such as expression evaluation and differentiation are essential to a CAS, yet are hardly worth an in-depth discussion, at least not for now.

So what is simplification?

Simplification, to state the banal, is concerned with making an expression more “simple”. Unlike many every terms that have been borrowed by mathematics, “simple” or “simplification” truly does not have a more rigorous definition. The problem here largely stems from the fact that the definition of what is simple is somewhat fluid – it depends to a great extent on the context. To illustrate this nature more concretely, here a few examples.

  1. (x+1)^2 or x^2 + 2x + 1?
  2. y^{-3.5} or 1/y^{3.5} or 1/(y^3 \sqrt{y})?
  3. \tan x \sin x or \dfrac{\sin^2 x}{\cos x}?
  4. e^{A+(2x-y)} or e^A\left[\dfrac{e^{2x}}{e^{y}}\right]?
  5. 4 \sqrt{x^3 + 6x^2 + 9x} or 4(x + 3)^2 \sqrt{x} ?

A simple expression treeA visualisation of a very simple expression tree. The equivalent infix expression is (2 + 2) + (2 + 2) + (3 + 3). Note that in general such trees are not restricted to be binary.

I think that anyone who has had sufficient experience using maths and manipulating many formulae and expressions will realise that in some scenarios, one of the given equivalent expressions in 1 to 5 is desired in a certain scenario, while another is desired in a second situation. Hence, any conceivable simplification algorithm cannot be treated as a rigid mechanical process, bur rather must adjust itself depending on the parameters it is given, which hint at the sort of result that is desired.

When humans perform simplification of mathematical expressions, they often use so-called intuition, developed from much prior experience, along with trial and error, to (in most cases) quickly and accurately simplify maths. It is inevitable that even the best mathematicians hit dead ends when trying to simplify complicated expressions. A computer, perhaps unsurprisingly, cannot do any better. Moreover, mathematical simplification is, in my view, one of the few aspects of mathematical methodology that overall better suits a holistic intelligence, rather than the traditional sequential one that is most often associated with maths (theorem proving being a notable exception). In fact, mathematics is not all so logical and all-encompassing as even mathematicians not long ago thought – thanks to the magnificent Incompleteness Theorem proposed by Kurt Godel in the 1930s – this is however a subject for another day.

How can we measure simplicity?

Fortunately, and perhaps surprisingly, the field of evolutionary computation presents a rather handy way of treating the “simplicity” of expressions, that is, a fitness function – in its most abstract sense, something that measures the absolute “fitness” of any given solution for a certain optimisation problem (most commonly genetic algorithms, which lent the term “fitness”). The “fitness” may be thought of qualitatively as the value, worth, or suitability of a particular solution. The solutions, in this case, are of course expression trees.

To begin, it is greatly helpful to reduce the problem by extracting a certain (small) number of parameters from the expression tree, rather than trying to analyse the entire thing holistically. This reduces the parameter space (the set of all possible parameters) dramatically, which is typically highly beneficial in optimisation problems. The fitness function itself is allowed to take any form in general, though we shall see shortly that one or two classes of function in particular are desirable. For now, let us just focus on the set of parameters. After some consideration, I came to the conclusion that any effective parameter space must consist of the following variables:

  • Size of the expression tree (i.e. the number of nodes)
  • Height of the expression tree  (i.e. the number of layers)
  • Width of the expression tree (i.e. the number of nodes in the bottom layer).

These are all of course integer values, and thus the value of the function is restricted to lie within the set of integers. After building up an image that measuring simplicity is a tricky thing, this may seem like a rather straightforward framework; indeed, it is in some ways, thought it is worth noting that the apparent problems arise when we decide what function to use and when it should be applied. The reasons for the choice of these parameters should become apparent soon.

Let us first consider a basic fitness function that simply weights each parameter individually in a linear combination. In other words, the fitness function, F(S, H, W) may be defined as the following, where S, H, and W correspond to the size, height, and width of the expression tree respectively.

F(S, H, W) = aS + bH + cW

The constants a, b, c may be any integer (postive or negative), and should be passed to the simplifier routine rather than predefined, according to the desired output. It should be quite evident that by choosing different magnitudes as well as signs for these constants, each of the three parameters may be independently rewarded or penalised to greater or lesser extents. The question you might then ask is: why not a more complicated function depending on S, H, and W? My answer is a straightforward one: there is no need. A linear combination of terms gives enough control over the desired function that extending to the function to add higher-order or even exponential terms would be quite pointless and arbitrary. I have, however, far from closed my mind on this matter – as I design and test the system progressively, certain discoveries may be made that suggest a slightly different approach.

Indeed, my only other real consideration for a fitness function thus far is one of even more basic form. Suppose, for example, that the fitness function only depended on two things: a) which parameter should be prioritised, b) whether this parameter should be promoted or demoted. The other two parameters would simply be minimised in the case that the first shows no preference between two trees. I have not finalised the implementation of this, but hopefully this brief description should give you an idea.

I now only leave, as an exercise to the reader, the five example expressions given in the previous section, and considered how any desired result (simplified form) can be achieved through the selection of the appropiate fitness function (using either of the two I have just proposed). Of course, feel free to post a comment regarding any queries or findings you have regarding this matter.

An effective algorithm for simplification

So far, I have discussed how simplicity can be measured in absolute terms, and how in this way the most “simple” of a set of solutions (expression trees) can be chosen as the result of the algorithm. What I have not really mentioned, however, is how an algorithm might actually search out the possible solutions. Although the nature of the algorithm is mainly independent of the fitness function  and the evaluation of expression trees, it is helpful to discuss this second so as to give a clearer image to the approach as a whole.

Simplification when done by a computer, as when done by humans, involves at its heart the application of a large number of mathematical rules that transform expressions. To give some examples of a few of the more basic rules:

  • x + 0 \rightarrow x
  • 1 * x \rightarrow x
  • x * x \rightarrow x^2
  • x * (y / x) \rightarrow y
  • x^2 - y^2 \rightarrow (x + y)(x - y)
  • \sin^2 x + \cos^2 x \rightarrow 1

Given a complete set of all simplification rules, we can find a path (or derivation) between any two equivalent expressions. (In theory, this is no issue, since the set is finite, though rather large. The practicality of  all the required rules is another issue that I will not go into here.) Note that the rules are bidirectional; they allow you take a simple expression and sequentially transform it into an arbitrarily complex, yet still identical one. (For example, x \rightarrow x + 0 - 0 + 0 - 0 + 0.)

Assuming (quite reasonably) that this assertion that a “finite number of application of simplification rules can derive any equivalent expression from the original” is valid, we must then consider how the search should proceed. Under the utterly naive brute force approach, the algorithm is clearly non-terminating, but we can do a lot better than this.

The search algorithm is all about compromisation in essence. If we search every possible derived expression (up to a certain size), then it could take an unreasonably length of time to simplify even relatively tiny expressions. On the other hand, we make too many assumptions and cut off many branches in the search tree prematurely, the algorithm may terminate quickly, though not necessarily with the simplest solution (or even anything close). Hence, the idea of using genetic algorithms, albeit initially appealing, is in my view to restrictive to a problem that requires a “perfect” answer in most cases, and should not have any stochastic nature.

My currently planned approach is one that does not differ greatly from a simple brute force evaluation of all the simplification paths. The main improvement is one that falls out rather naturally from using a fitness function. The idea is that each node of the search tree is evaluated by the fitness function upon its creation, and if the fitness is below a certain (specified) threshold, the search from that particular node terminates. For a start, this prevents originally small/simple expression trees being transformed into ones that are absurdly large, while still allowing some limited expansion of expression trees in the hope that they may later be simplified very effectively. Many nonsensical (what we might call counter-intuitive) paths of reduction of the expression would also be eliminated in a similar way. There is also one practical problem of important note here: many disparate simplification paths along the tree do converge to the same solution during a search (some quite quickly), so it would be quite foolish to branch twice from identical nodes of the search tree. Instead, we really want to cache any nodes (expression trees) already visited during the search process, and not compute their descendants (derived children) more than once. A simple hash table (set, in fact) would seem like the most effective way of accomplish this. Creating an efficient and relatively collision-free hash function for an arbitrary tree structure is however no trivial task. I was to get a number of quite sensible and useful responds when I asked the question on StackOverflow.

Apologies if this discussion of “search trees” and “expression trees” and their corresponding nodes has led to some confusion regarding what is what. It is most important to recognise that the node of a search tree is itself an expression tree (what I sometimes call a “solution” to the “simplification problem”). Due to the risk of losing reader interest at the cost of an even longer post, I shall stop my ramblings here, and leave further elaboration of the search algorithm for another post.

The future of the project

What has been discussed so far is largely theoretical, yet I have tried to present it in such a way that the method of implementation is for the most part self-evident. Work on this project will likely proceed slowly in the short-term future, though as it advances, the features will surely solidify. I am hoping that at least by summer there should be some tangible results to these efforts. Regardless, I shall try to give status updates along the way.

Electronic Lecture Notes

Posted in Personal, Projects, Software on November 28th, 2009 by Noldorin – Be the first to comment

Lectures at college and the inseparable hours of studying have been taking up much of my time recently, and in this mindset have made me start thinking about how I could somehow improve that wonderfully tedious and hand-cramping task of taking lectures notes from the board. Ok, so taking lecture notes by hand is far from the biggest inconvenience any student experiences on a regular basis; nonetheless, the idea came upon me that if I could create and store all these notes electronically, I’d be saving myself a good deal of pain (presently and during exam season).

The main challenge of this plan was finding some way to make the input/conversion process for turning what’s written on the blackboard ultimately into some prettily formatted pages, quickly and trivially. Although I’m sure it’s been done by many before, using MS Word or the like immediately struck me as impractical – indeed, worse than pen and paper, in my mind. Hence, my initial thoughts were centered on using a lightweight markup of some sort, such as YAML, Markdown, or perhaps simply plain text. (The text could then be rendered nicely by a program at some later stage.) My plans were to write a small editor application specifically for some format, with facilities such as auto-completion and an in-built previewer to make the note-taking even more efficient.

In the end, my mind was decided when my friend David suggested that I simply use the widespread wiki markup language reST as the primary format. This turns out to be almost perfect, since the Python docutils package contains an advanced and stable reST-to-LaTeX converter – and what could be prettier than LaTeX for displaying notes full of complex formulae. He also kindly offerred to lend me his Eee PC. I daresay this seems like an ideal combination…

Now, I must confess that I have not in fact yet begun to adopt this wonderful approach of taking notes, despite my being quite eager to start. Given that there are only about two weeks remaining before the current courses finish and Christmas holidays begin, I thought that I’d just set up my environment and at best give the plan a “test run”. Beginning in January, however, I very much intend to be using nothing but the Eee PC for notes. I will of course duly report back how the matter turns out, and with any luck, begin a weekly routine of uploading lovely LaTeX PDFs of (not so lovely) physics material!

TeX.NET 0.1.0 released

Posted in Maths & Science, Programming, Projects, Software on October 8th, 2009 by Noldorin – Be the first to comment

Some news to report at last! I’ve just released the first the first beta version of my TeX.NET project over at Launchpad. Both the source (with unit tests) and binaries of what might be called “a parsing and writing library for mathematical expressions written in the TeX format”; programmed, of course, in C# 3.0 (with some rudimentary F# bindings).

This project hasn’t exactly been the focus of my blog posts recently (or at all, for the matter), mainly because I’ve found that working on the thing has taken almost all of my free time over the past month or so.  As I have come to learn, putting effort into promotion of ones ideas and work is not a wise thing to do before they have proper substance!

Since I’m not one to reword any text that I’ve already (quite carefully) written, I shall quote the project summary over at Launchpad:

TeX.NET is a library for the .NET Framework 3.5 that provides lexical analysis and parsing of mathematics written in the TeX format into expression trees, and conversely the writing of expression trees in the same textual format. The purpose of the library is to provide a useful framework for inputing, outputing, and manipulating complex mathematical expressions.

Now, you may of course be curious where I’m taking this project in the grand scheme of things. (Indeed, a library for parsing of mathematical expressions isn’t terribly useful on its own.) Although it’s far from an in-depth explanation of the end goals and my approach, I suggest you check out the Syracuse Project, which is the “container” or “super-project” for TeX.NET, among other related libraries and programs. I can however say that I will be shifting my efforts more towards the Euclid.NET project, now that I have a stable initial version of the requisite parser/writer library. It is bound to be the subject of one of my upcoming posts, anyway.

So, if anyone is inclined to give my TeX.NET library a whirl (or better yet, some further rigorous testing), I would urge gratefully them to do so. Any sort of feedback on this project is welcome, as with all of my endeavours. My aim the moment is mainly to garner a nucleus of interest and see this library applied to great things!

Demo of Euclid.NET Launched

Posted in Programming, Projects, Software on September 20th, 2009 by Noldorin – Be the first to comment

Just a quick note that I’ve recently put up an interactive demo of my Euclid.NET project (symbolic mathematical framework for .NET).  This project is based off my TeX.NET parser, which is currently all of which the demo presently consists. I hope to both expand and document the support for the parser soon, as well as starting some serious work on Euclid.NET specifically (most likely, a symbolic differentiator to begin).

Writing a Parser in C#

Posted in Programming, Projects on September 14th, 2009 by Noldorin – 1 Comment

This is just a quick note about a particularly useful article that I stumbled across in the process of writing my TeX Math parser (to be described in full in an upcoming post). The article Grammars and parsing with C# 2.0 is a wonderful introduction to writing a lexer and parser for any language represented by a context free grammar (more accurately, a parser expression grammar).

The document is written in a refreshingly lucid fashion and what is more there are straightforward examples in C# regularly interspersed. (I am not overly-inclined to give out such praise!) For a topic so mired in confusion and complexity, I though this would be an especially helpful link to share with any C# coders who are interested in entering the field, as I am currently doing myself. The content is something I believe any coder with a solid background in OOP should be able to drive straight into without much difficulty (that isn’t to say you will glean everything from a one-pass read). Of course, it is only a guide for relative beginners (albeit a fairly thorough one) – if anyone happens to have other basic or more advanced texts on the subject of similar quality, it would be great to compile a small collection of selective resources on the subject, and expand this post in the process.

Relocation and Revival!

Posted in Projects, Uncategorized on August 20th, 2009 by Noldorin – Be the first to comment

In the past week I’ve finally gotten round to setting up a website under my own domain name, and with a bit of effort, relocating my entire blog from WordPress.com to my new Noldorin.com domain. Much more on the reasons for this another time; for now, I just want to say that this will be the future location of all posts. (Do kindly update your feeds list with the new address.)

A couple of improvements to the blog format, in particular:

  • It is now using the SyntaxHighlighted Evolved plugin for formatting source code, which looks noticably prettier than the old style. All code snippets in existing points have been updated.
  • I have activated the WP LaTeX plugin for displaying mathematical formulae. All though no existing posts use any proper maths, I surely intend to include some in future ones, and this shall only encourage me.

As I now have unrestricted access to the setup of my blog, I will most probably be playing around with a number of both handy and useless plugins in the coming weeks.

And finally: yes, I will be returning to a regular blogging schedule very shortly! My hiatus has lasted significantly longer than expected, mainly due to my inability to find time in the short break between the end of exams and the start of my work developing a large business application. Well, I am hoping to the get the core of my Noldorin.com website up and running pretty soon, so that will surely be an upcoming topic of my posts. Of course, some updates on the statuses of my projects are long overdue too.

Playlist Synchronisation for Portable Devices

Posted in Personal, Projects, Software on June 11th, 2009 by Noldorin – Be the first to comment

I have recently been attempting to properly set up synchronisation between Windows Media Player and my portable music player (which happens to be my phone). Though I found that the Windows Media Player synchronisation tool does the job pretty well, it does fail in one respect: it cannot copy over playlist (WPL) files. For me, this was a bit of a nuisance, since I rely very much on playlists to categorise my music collection.

The solution for me was to write my own tool that synchronises a given set of playlists with a portable device that is compatible with WMP (Windows Media Player) – as I believe many devices tend to be. The tool works simply by finding the appropiate place on the device to which to copy the playlist files (a known XML descriptor file on the device should specify this), and then copying over these files, with the locations of the media files updated to point to those on the device.

Naturally, my choice of technology with which to write the thing was .NET/C# – this does mean that it’s not a fully standalone application, though it does only consist of a single EXE. However, thanks to a few particularly convenient features of the language/framework (primarily LINQ to XML), the code was largely trivial to write, and the majority of the ~200 lines is in fact error handling.

You can download the program here. As mentioned, it requires the Microsoft .NET Framework 3.5 (SP1) to run, which is not installed on any current version of Windows by default, so it will need to be downloaded and installed firstly if you don’t yet have it. Also, if anyone is curious to see the code, I may be able to upload that at some point.

The tool should be run from the command line, and would seem to be very straightforward to use. (Run the program with no arguments to see the help information.) An example command line to synchronise the playlists in the standard location of your user profile with a portable device on drive F might be:

pps F “C:\Users\username\Music\Playlists”

That’s all it takes. The task should finish within a matter of seconds  and then report some general information about the playlists it found and what it managed to successfully synchronised; else return an error message.

NB: If you’re wondering how the synchroniser matches the media files on the device with those in the playlist, I have a small admission to make. Because the directory structure is not guaranteed to be the same on the device as at the location of the source media, the current version simply matches media items by file name. This works perfectly well for me, though there is clearly a caveat. I am looking for an improvement on this method, and while I have a few ideas, I haven’t finalised my decision yet. Any recommendations by someone more knowledgeable on the subject would be appreciated.

Now, this program was designed primarily for my own use, but I did consciously attempt to make it usable with any WMP-compatible portable device, so hopefully people shouldn’t have any major problems using it.

Finally, it would be nice to hear any feedback regarding this little tool of mine, so please feel free to drop me a message (even if it’s just to say you’re using it). If I hear any suggestion for a worthwhile feature to add (or of course a valid bug report), I will gladly update the program.

Numerical Analysis for .NET

Posted in Personal, Programming, Projects, Software on May 15th, 2009 by Noldorin – Be the first to comment

During my ongoing work on a computational project for university, I recently discovered the need to perform some serious numerical analysis from my C# code.  Unfortunately, I must admit that the .NET world only now seems to be catching up in terms of the free and open source libraries it offers for various tasks, and initially I was disheartened to find that there seemed to be nothing available for doing calculations on large (sparse) matrices. After a fair deal of searching, only a couple of somewhat incomplete and no longer maintained matrix libraries turned up. Being an avid user of StackOverflow, however, I decided that if anyone was aware of some library that could do what I needed, I would most likely find them there.

The result was much better than for what I was even hoping. dnAnalytics is a general-purpose package for numerical analysis in .NET that does almost everything for which I might possibly ask – and from my first impressions, does it very well indeed. This wonderful find is a well-maintained, fully open-source, library with great API documentation (not a wholly unexpected thing, but surprisingly uncommon among so many open source projects). There are several features that stand out as particularly impressive. One undoubtedly is I/O classes for Matlab and delimited files (among other formats). What is more, the library seems to offer both a fully managed version and one that wraps the Intel® Math Kernel Library. I’m not sure how the performance compares between the two (I haven’t yet tried the latter), but it is surely nice to have the pair of options available, quite similarly to how you have alternatives of cryptographic algorithms in the .NET BCL, that is to say, a) a fully managed version, v) a version based on top of the Windows Crypto API, c) a version that uses the CNG (Next Generation) API introduced with Vista. Perhaps what appeals to me the greatest about this library is that the developers have clearly gone to an effort to make it user-friendly, not only with regards to the documentation, but also by adding an interface friendly to F# coders (likely to be a language of choice for future mathematical/scientific programming), and even visual debuggers for Visual Studio (possibly the only library to date I’ve seen include them).

My particular usage of the library requires me to use the linear algebra (specifically, sparse matrix) classes. Although I must point out that the specific algorithm that I was intending to employ for the project was not available (see my later discussion), it did include a host of other ones, primarily focusing on direct and iterative matrix decomposition, which would appear to be quite handy in many circumstances. I haven’t yet had a chance to play with the other areas of the library, but I have noticed that it offers some statistical functions and methods as well as a number of modern pseudo-RNG algorithms such as the Mersenne Twister.

To conclude, I should come back to the point that the most important part of the analysis I require was not (at least directly) contained by the library – finding the eigenvalues or eigendecomposition of large (1000s of rows/columns) matrices, which happens to be in relation to spectral theory, in case you’re curious. Even so, being such a complex field and one fraught with difficulties when it comes to implementation (numerical instability is a huge problem), I was not surprised to find that an implementation of the Arnoldi or Lanczos algorithm was not present. Fortunately, after a bit more searching around (by this point I knew specifically what I was looking for), I came across the ARPACK library, written in the archaic Fortran77 language. It did however seem to be exactly what I was looking for: a set of fast routines to find the eigenvalues of large (either dense or sparse) matrices of various types. After only a small amount of pain messing about with MinGW, I managed to get the code compiled nicely into a DLL. At this point, I am of course perfectly able just to use the P/Invoke capabilities of .NET and do some hackery to integrate the ARPACK stuff with my existing code and dnAnalytics. Yet, I am also inclined to do this whole task properly and basically write a managed wrapper for ARPACK that is tightly conforms with dnAnalytics. I could then perhaps submit these wrapper types (along with a few unit tests?) as a repository patch to the dnAnalytics team in the hope that they’ll take it and add it to the next release. As with most other projects at this time, I will have to see what time permits me, though I would certainly hope to contribute something substantial to what truly is a terrific project that I would love to see expand further.

LINQ to YAML

Posted in Programming, Projects, Software on May 14th, 2009 by Noldorin – Be the first to comment

LINQ to XML is one of the many technologies introduced with the .NET Framework 3.5, and one that is certainly a step forward in terms of usability. It allows querying in both the functional style (using LINQ and lambda expressions) and the more traditional imperative one, meaning that it’s a great tool for concisely working with XML data in any sort of application, and undoubtedly a significant improvement over the old XML DOM that resides in the System.Xml namespace.

In the spirit of LINQ, and with the advent of YAML, I recntly decided it was about time that this new “markup language” were integrated with LINQ. Surprisingly, there does not already exist anything akin to LINQ to YAML out there (though there are a couple of fairly usable implementations of a YAML reader/writer for .NET). This seemed to me like a good chance to potentially create something that might be used by more than the odd .NET developer or two. My plans are to implement a LINQ to YAML provider either from scratch or on top of one of the existing YAML libraries. (Which option I choose will depend on the state of the existing projects, which I haven’t yet investigated properly. I am however suspecting that it might be worthwhile writing my own, since it would a) teach me all the intricacies of YAML, and b) allow me to support the latest version [1.2], which the existing libraries do not.)

Before I launch into an overview of my intended implementation, here is a little bit about YAML itself, for those who aren’t already familiar with it. Although technically YAML isn’t a markup language (after all, the recursive acronym stands for YAML Ain’t Markup Language) – it is rather a serialisation format – it does essentially fulfill the the role that XML  traditionally has, in a variety of common situations. I’m not going to try to sell the format to you right now, but it should suffice to say that you wouldn’t have reached this far in the post if you weren’t already at least intrigued! Without doubt, the format is actively gaining popularity because of it’s ultra-lightweight syntax and suitability for hand editing, perhaps the two points that summarise its advantages over XML.

Anyway, here’s a short example of a YAML document (taken straight from the Wikipedia page), so you can see precisely how pleasant it is to work with (at least for humans).

receipt:     Oz-Ware Purchase Invoice
date:        2007-08-06
customer:
    given:   Dorothy
    family:  Gale

items:
    - part_no:   A4786
      descrip:   Water Bucket (Filled)
      price:     1.47
      quantity:  4

    - part_no:   E1628
      descrip:   High Heeled "Ruby" Slippers
      price:     100.27
      quantity:  1

bill-to:  &id001
    street: |
            123 Tornado Alley
            Suite 16
    city:   East Westville
    state:  KS

ship-to:  *id001

specialDelivery:  >
    Follow the Yellow Brick
    Road to the Emerald City.
    Pay no attention to the
    man behind the curtain.
...

Of course, the great thing about YAML, which is demonstrated clearly by this example, is that you don’t have to have any real knowledge about YAML to understand exactly and immediately what the data represents, and as a bonus it doesn’t hurt your eyes to stare at for too long! Even the referencing syntax should be fairly self evident. (Syntax such as &id00 and *id001 would surely be nothing new to C programmers.)

The semantics as well as the syntax of YAML obviously differ to those of XML greatly, although there is almost always some sort of correspondence between the features and possibilities that the two formats offer. The only notable missing feature when contrasted to XML is attributes, yet their usefulness is questionable anyway.

Right, so now I ought to explain a bit about how I actually plan to design this library. The basic framework will be virtually equivalent to that of LINQ to XML. In other words, the hierarchy will be largely based around an abstract YamlObject (YObject?) class, and will look very much like the one contained within System.Xml.Linq.

Diagram of LINQ to XML class hierarchyLINQ to XML class hierarchy

Though LINQ to YAML must of course accommodate for the unique nature of the format, I would initially aim for minimal difference and only significantly adjust the hierarchy when it is found to be necessary. Classes such as XCData and XDocumentType would not apply at all to YAML, yet there would need to be a place for a YReference or such somewhere in the hierarchy. The referencing aspect of YAML will likely prove to be one of the more interesting challenges; while YAML’s lists, maps (dictionaries), and combinations thereof would seem relatively straightforward with regards to emulation of the LINQ to XML design, references would introduce a substantially novel concept. Some sort of implementation of lazy evaluation followed by concrete referencing should be able to solve the problem, but there’s no way to predict how well this might work in practice at this moment.

What I realised only after deciding to create a LINQ to YAML library is that among LINQ providers, LINQ to XML is somewhat special in that the LINQ aspect of it is built on top of LINQ to Objects (i.e. LINQ using IEnumerable<T> objects), with only a relatively small number of extension methods specific to LINQ to XML. Indeed, most LINQ providers (LINQ to Objects and LINQ to SQL among others) require you to implement the IQueryable and IQueryProvider interfaces to provide complex logic for interpreting and returning the results of expressions, as well as evaluating complex expression trees.  All this means that I can pretty much just design a DOM  to a certain style (i.e. one suited to functional code, like LINQ to XML), and let LINQ to Objects to everything else for me.

As I can’t think of anything more worth mentioning about my project at this time, I shall leave any more specific and complex details to a future post. Still, do by all means feel free to query me about my plans – I would be glad to answer any questions, and even gladder to receive some suggestions as how you think I might design LINQ to YAML, or simply a nod that you might find this useful at some point. I don’t anticipate this project to be a very long one, though I must say that both my work and free-time schedule are likely to be fairly messed up for the next month or two, therefore I’m not going to promise when I’ll get around to my initial release. Whenever it so happens, I will duly post the link to the project page on Launchpad (or wherever I happen to host it).

Strongly-Typed CSV Reader in C#

Posted in Programming, Projects on May 3rd, 2009 by Noldorin – 1 Comment

As part of a project on which I’ve recently started working, I found it necessary to write a class that reads entries from CSV files. Such a simple format, you might think, so why would I bother sharing such trivial code? Indeed, it is a relatively short class, but I thought I’d post it here nonetheless, primarily because I believe its usage promotes a design practice of which I am particularly fond, and I suspect (hope) other people may appreciate as well. There are also a few bits of code that might be considered interesting (and unusual) from a language/design perspective.

When I decided to formalise the logic for reading from CSV files, I firstly thought it would be nice to write something in the spirit of .NET 3.5 – in this case, easily compatible with LINQ, fully generic (strongly-typed), and attribute-oriented (as seems to be the trend in APIs nowadays). Before I launch into any further discussion, here’s the code for the class in full.

using System;
using System.ComponentModel;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Reflection;
using System.Text;

namespace NetworkAnalyser
{
    public class CsvReader<TEntry> : IDisposable where TEntry : struct
    {
        private StreamReader streamReader;
        private FieldTypeInfo[] fieldTypeInfos;

        private bool isDisposed = false;

        public CsvReader(string path)
        {
            streamReader = new StreamReader(path);
            Initialize();
        }

        public CsvReader(Stream stream)
        {
            streamReader = new StreamReader(stream);
            Initialize();
        }

        ~CsvReader()
        {
            Dispose(false);
        }

        public void Dispose()
        {
            Dispose(true);
            GC.SuppressFinalize(this);
        }

        protected virtual void Dispose(bool disposing)
        {
            if (!isDisposed)
            {
                if (disposing)
                {
                    if (streamReader != null)
                        streamReader.Dispose();
                }
            }

            isDisposed = true;
        }

        public IEnumerable<TEntry> ReadAllEntries()
        {
            TEntry? entry;
            while ((entry = ReadEntry()).HasValue)
                yield return entry.Value;
        }

        public TEntry? ReadEntry()
        {
            var line = streamReader.ReadLine();
            if (line == null)
                return null;

            var entry = new TEntry();
            var fields = line.Split(new char[] { ',' }, StringSplitOptions.None);
            FieldTypeInfo fieldTypeInfo;
            object fieldValue;

            for (int i = 0; i < fields.Length; i++)
            {
                fieldTypeInfo = fieldTypeInfos[i];
                fieldValue = fieldTypeInfo.TypeConverter.ConvertFromString(fields[i].Trim());
                fieldTypeInfo.FieldInfo.SetValueDirect(__makeref(entry), fieldValue);
            }

            return entry;
        }

        private void Initialize()
        {
            var entryType = typeof(TEntry);
            fieldTypeInfos = (from fieldInfo in entryType.GetFields(BindingFlags.Instance |
                                  BindingFlags.Public)
                              let fieldTypeConverterAttrib = fieldInfo.GetCustomAttributes(
                                  typeof(TypeConverterAttribute), true).SingleOrDefault()
                                  as TypeConverterAttribute
                              let fieldTypeConverter = (fieldTypeConverterAttrib == null) ? null :
                                  Activator.CreateInstance(Type.GetType(
                                      fieldTypeConverterAttrib.ConverterTypeName))
                                  as TypeConverter
                              select new FieldTypeInfo()
                              {
                                  FieldInfo = fieldInfo,
                                  TypeConverter = fieldTypeConverter ??
                                      TypeDescriptor.GetConverter(fieldInfo.FieldType)
                              }).ToArray();
        }

        private struct FieldTypeInfo
        {
            public FieldInfo FieldInfo;
            public TypeConverter TypeConverter;
        }
    }
}

(Please excuse the utter lack of comments in the code. Most of it is self-explanatory, but admittedly some parts are probably not. I put it together pretty quickly, but I may get around to commenting it some time soon. Some basic error handling might also be nice.)

At this point it may seem rather excessive just to read data from a CSV file, but I hope you’ll agree that it’s worthwhile once you see an example of typical usage.

The first step is to define a structure (struct) that holds each entry in memory. Here we’re going to define one that holds some basic information about a programming language.

public struct LanguageEntry
{
    public string Name;
    public string[] Paradigms;
    public string LatestVersion;
    [TypeConverter(typeof(CustomDateTimeConverter))]
    public DateTime InitialRelease;
    [TypeConverter(typeof(CustomDateTimeConverter))]
    public DateTime LatestRelease;
    public float Popularity;
}

The TypeConverter attributes are completely optional, and are only required when you’re reading some fields that have unusual formats and whose values you would like to convert to something simpler/more accessible (e.g. a string “Jun2002″ to a DateTime object in this case). For any field of a type recognisable by the default type converter, you don’t need to bother, as is shown for the double type. (This actually applies to a very large range of types within the BCL, including System.Drawing.Color, which can be specified in any format that you might use in the property editor of Visual Studio, such as “DarkRed”.)

Finally, here’s a snippet to show how you might actually use the CsvReader<TEntry> class to read from a CSV file. This example reads all entries from the languages.csv file and prints out to the console the names of all functional languages.

using (var languagesReader = new CsvReader<LanguageEntry>("language.csv"))
{
    var languages = from lang in languagesReader.ReadAllEntries()
                    where lang.Paradigms.Contains("Functional")
                    select lang;
    foreach (var lang in languages)
        Console.WriteLine(lang.Name);
}

Hopefully that’s now convinced you that this is the right way to go about reading data entries from files. What this class provides is completely strongly-typed I/O (reading in this case, though it wouldn’t be very hard to create a similar CsvWriter class), and a declarative manner to defining entry types (or records, to use database terminology).

I’m not going to delve too deeply into the implementation of the class, but I think it’s worth highlighting a few specifics. Going back to the code for the class, the first thing to notice is the Initialize method – this is where much of the interesting stuff is happening. To summarise: it loops over all the public fields of the type specified by TEntry, gets the default type converter for the type of each field (or the one given by TypeConverterAttribute, if it exists), and then stores the FieldInfo along with the TypeConverter in a simple struct. The only other noteworthy point is the call to SetValueDirect in the ReadEntry method. This uses a keyword that’s almost wholly unknown (and undocumented!) to C# developers by the name of __makeref (there are other related ones by the names of __reftype and __refvalue) – I was certainly unaware of it before today. The problem that I initially encountered was one of using the SetValue method, which works perfectly well on classes, but presents a unique problem with structs: namely, because they are value-types, and the obj parameter is of type object, the argument must be boxed (wrapped into a reference type) and placed on the heap rather than the stack, meaning that the heap-based copy gets altered, and not the one you passed to the method (which is on the stack)! What the __makeref keyword does is create a TypeReference that directly references the stack-based object and thus allows SetValueDirect to set the field accordingly.

That’s enough explanation, I think. If you still aren’t sure about how it works precisely, then feel free to comment on this post. I’d also be quite happy to hear what anyone thinks of the general design and implementation, too.