Saturday, October 29, 2005
The myth of the stored procedure
I just came across an interesting thread on the Ruby on Rails blog here. I'm not going to touch the arrogance issue because that's too subjective. What I will address are two things:
1 - the need for stored procedure support
2 - the need for xml configuration
As the reader will have no doubt surmised, I don't believe in stored procedures, however that wasn't always the case. Having been primarily a developer and secondarily a dba for nearly 12 years now, I've had the opportunity to work on some massive systems. I've had the opportunity to lead the design and implementation of some rather large systems, intially I used stored proc's heavily, but over time the glaring problems with stored procs have become rather obvious.
As I see it there are three primary problems with stored procedures
They don't scale. I can practically hear the outrage at that statement. What are you talking about? Stored Proc's don't scale? You've have to be crazy! Before you go huffing off let me explain that statement. When I talk about scalability I am not referring to machine performance per se. Instead I'm talking about the position that on any system of sufficient size, the most important problem and least scalable resource will be the allocation of programmer time and their ability to comprehend the complexity of the system. I'm sure there are exceptions to this rule, but I am talking about the vast majority of systems, rather than a few special cases. Ok, so you believe it or you don't, how exactly do stored procedures fit into this picture? It's a generally recognized principle within the software development world that code duplication is bad. Cut and paste coding causes problems because logic is duplicated all over the place. When you make one change you then have to go and find all of the other places where that logic resides and duplicate your changes in the code. All software developers know this, there are a million ways to avoid this issue when it comes to software. But when it comes to databases, somehow that rule no longer applies? Stored Procedures are effectively cut and paste coding of SQL. It's not a problem if you only have one or two databases to deal with, just like if you only have one or two code files cut and paste isn't a huge issue. Multiply that by a few hundred or a few thousand though, and you have a serious problem. If you have ever had to propagate a changed stored procedure to a couple of hundred databases you know what I'm talking about. Now I'm not saying that stored procedures are all evil in all circumstances, but as with all things in computer science, their advantages and disadvantages must be understood and weighed. Generally I would say that building a system on top of stored procedures is a bad idea.
Stored procedures are not portable. This is kind of a no brainer, typically the database vendors don't even try to argue their stored procedure syntax is portable to other systems. Usually their argument goes something like "and to port this application, all you have to do is re-write the stored procedures". Generally I think that's probably not even true, but let's assume for a minute it is. Apply that logic to the same scenario above. So you have to re-write your stored proc layer then propagate those changes out to hudreds or thousands of databases? Hard to imagine how people get locked into a database isn't it? I can tell you from first hand experience that it's unlikely any significant system will be ported under those circumstances without a very serious reason. So the basic idea is to build your system in a way that you can easily use another database if necessary (this includes new versions of the database by the way).
Finally stored procedures don't facilitate reuse. This is related to the first issue, but isn't identical. In this case I'm not talking about duplication over a large number of databases, rather I'm talking about leveraging your codebase in different situations. This can be within the same database or within a different system. The point is anything you put in a stored procedure stays in a that stored procedure. There really are no good methods for high level abstraction within stored procedures. Now maybe there is some method that applies to some specific database, if so, it's not portable to other databases. On the other hand through careful use of abstraction in code you can reuse your logic in other situations.
I know there are other issues with stored procedures, but for me those are the big ones. Now to XML tags and configuration.
As with stored procedures, XML is one of those things that should be used at the right time under the right circumstances. XML was designed as an abstraction for marking up data, to me it was never designed as a programming language. I have had the misfortune of using the XML heavy frameworks (j2ee, asp.net, Zope). XML definitely makes those systems more configurable, but one thing it definitely doesn't do is make development faster. Wading through the morass of XML configuration has never sped me up, but has always dramatically increased the time to implement.
I'm sure there are situations where XML is useful, but as an Entrepreneur I am less interested in those situations and I'm more interested in the situations that I can leverage the power of DRY and sensible defaults to maximize my time. That philosophy more than anything else has really sucked me into Rails. The sheer joy of focusing on the problem and not the damn configuration files cannot be underestimated. If you really think XML is the solution, please stay in the XML frameworks. That is one direction I sincerely hope Rails never moves.
Tuesday, October 18, 2005
The joy of windows :/
So I took today off from work as an extra day to spend some time with my wife and recouperate from my trip to the startup schoool. Of course, I'm typing this message from work. So much for that extra day off.
We do E-discovery here at work, and we use a collection of tools that imo are worst of breed. We use windows as a file server to house 100's of millions of image and text files. If you think windows doesn't scale well to deal with those numbers, well, you're absolutely right. It doesn't, we have all kinds of problems dealing with the size and number of files. Then we use SQL Server as our database backend. Certainly that's a step forward from Access (which they used to use), but it has all kinds of scale problems for what we are doing, not to mention the horribly designed structures we are torturing it with. Now add to that a visual basic program that does some automation to load other programs and extract data, and you have barely a hint of the problems I deal with at work.
The upside is that I spent most of my time working with Python, and I have an really bright team of people that works with me. Our team uses Python to process text and images, and we are not responsible for the platform decisions we have, we just have to work within them.
So one of our systems runs a C# application designed by another team, and the system is just falling apart. SQL Server 32 bit apparently cannot use more than 1.7 GB of ram without a hotfix. The hotfix says not to do it on a server with only 4 GB of ram. Great. So we bought a Quad Xeon processor with 32 GB of ram from Dell. After some problems we finally get the 64 bit version of windows installed. Whoo..time to get the 64 bit version of SQL Server installed. HA! That only runs on Itanium servers, these are some other whacked 64 bit architecture. wtf?!
Ok, time to install the 32 bit version of SQL Server, so far a complete waste of about 6 hours. After we get it installed, we installed service pack 4 for sql server and the Awe hotfix to allow sql server to address more than 2 gigs of ram. Reboot the server (because that's always a good choice with windows). Ok, to test we set the SQL Server to use a fixed amount of ram rather than dynamically allocate the ram. To test we set it at 10 GB of ram. Now, one might think that the sql server process would then start ramping up it's memory usage and/or spawn a few processes to take up 10 GB right? wrong! It sat there at 128 MB of ram. We tried several different options, none of which worked. At this point my boss suggested we look at Perfmon because he heard you can only see how much memory it's using there. Guess what, none of the SQL Server performance counters are there. Recheck the install, hmm, everything ok there. Reboot the server (because that's a good idea under windows), crap they still don't show up. Apparently the 32 bit counters don't work under a 64 bit OS. Unfortunately the perfmon counters are critical to our performance evaluation, without them we have no idea what sql server is doing.
So now we are down to reinstalling a 32 bit version of windows on the 64 bit quad processor server we have. Of course the night is still young.
Hard to imagine why I refuse to use windows and sql server outside of work eh? If it were my company, Open Source products are the only way to go. Or at the very least, if you want to play in the enterprise, you need to get real enterprise level products. As for me and my house, we'll use Rails on Linux.
Monday, October 17, 2005
I just got back from Paul Grahams excellent Startup School. What an incredible experience! If you are a would-be entrepreneur, I would highly encourage you to attend next year. The depth and quality of the speakers was as impressive as any conference I've ever been to. I won't bother to rehash the content, if you are interested, check out this tag on del.icio.us: Startup School. The summaries there are fantastic, and represent an incredible amount of work. I can't wait for the video and audio to be posted.
On sunday following the conference Paul held an open house at Y-Combinator for people who wanted to talk to him in more depth than the 5 minutes allowed between speakers. Additionally several of the Summer Founders were there, including both of the guys from Kiko.com and Aaron Schwartz from Infogami.
I had a great opportunity to talk with a lot of really bright people about their ideas. I also had the chance to see several people pitch their ideas to Paul Graham and to other people at the event. It was very instructive to see how their presentations went, and what ideas people liked and what ideas flopped. Overall people were very nice and very helpful to each other, even people who would potentially be competitors in the marketplace. Here are a some things I noticed about presenting your ideas:
A Working Demo: This was stressed over and over during the startup school, a working demo is an incredibly powerful and effective way to get your idea across. Why is a demo so important? For one thing a demo separates the doers from the dreamers. A lot of people have good ideas, not many people will go to even the trouble of putting together a demo. I know dozens of people who want to start a business, however the number of people who have done anything in that direction can be counted on one hand. In the end, it all boils down to action. A demo also provides a visceral and concrete representation of your idea. Explaining your idea really is a poor substitute for showing your idea.
People don't need the background: Your presentation shouldn't turn into a teaching lesson on the specifics of your industry or field. I saw someone talking to Paul for almost an hour, he spent a great deal of time trying to explain the background and the specifics of his industry. I think this means he hasn't done the next point, his idea isn't really crystalized yet.
Boil your idea down to it's core: It's important to really know what you are trying do. What problem you are trying to solve, and what differentiates you from your competitors. You should be able to summarize this to someone in less than sixty seconds. That doesn't mean there isn't more elaboration in your idea, but it gives someone a high level context to work within, and something very specific to focus on. The tighter and more focused you can make your idea, the more likely you can actually pull it off. It's much easier to build something that is focused on very specific ideas than it is to build something that is still vague and nebulous in your own mind.
Know the competitive landscape: Ok the other person understands your idea, they know generally what you are trying to do. At this point the natural progression seems to be "Have you seen x?" Where X is something they see as a competitor. Clearly it's not possible to know of every possible competitor out there, but you need to really do some research and understand what your competitors do and don't do. What you like and what you don't like about their systems. Ultimately you need to know why your system is better than theirs, and why people should use your system instead.
Know what you want to do: This might sound simple, but I'm not talking about having a general idea of how the site will work. I mean, you really need to have a pretty solid idea of how you are going to solve the problems you are planning to solve. How do you do this? Build a demo or a prototype.
I would like to extend a heartfelt thanks to Paul Graham and to Jessica for putting together the Startup School. It was a wonderful opportunity and it was really well done. I would attend again in a heartbeat.
Wednesday, October 12, 2005
File Server update
Remember that whacked out file server I mentioned in april? It turns out that my random file checker script magically fixes some problem with the server. When the script isn't running we start getting these wierd "file not found" messages. For some reason, everytime we fire up the script the message miraculously stop coming. If we stop the script the errors will start happening anywhere from several hours to several days later, but they always recur. If the script is running though, they never come back. Very bizarre.
Rooming with strangers
I'm headed to the Startup School this weekend. In order to save on cost I've decided to split the cost of the room with three complete strangers. Financially it works out that it's a good deal. It turns out that all three of them are from my state, and they live in my area. I was a bit suprised that anyone else from my state would be attending. Funny how things turn out.
I'm getting ready to release a version 1 of a project I've been hacking on for a few weeks now. It's a rule based random name generator (Rubarang). More details to come. My last site skillfulstudent has stalled to some degree, the partner I'm working with moved to another city and we rarely see each other any more. We talk on the phone relatively frequently, but it hasn't been enough to keep the project rolling well. Hopefully we'll get some of these kinks worked out and get it rolling again. Either way, I have several other projects in mind, and after I get this first version of the Rubarang out, it's on to other things.
Tuesday, July 19, 2005
Dell 2405 fpw
Well I finally broke down and bought a dell 2405 fpw. I found a 35% off coupon on slickdeals.net, and at $779.00 I just couldn't turn it down. I've been Jonesing hard for a widescreen LCD, now it's finally on the way. Now, as I told my wife, I will finally be complete. Or not. :)
Wednesday, July 13, 2005
You can't break just some of the rules.
I found out some weird stuff today at work. I've always been told that you have to have good selectivity for an index to be useful, particularly if it's a non-clustered index. So you wouldn't imagine that an index on a bit field would particularly improve performance. Yet today I saw a query that went from two plus minutes go to a one second query by adding a non-clustered index on a bit field.
This really perturbed me. Why would adding an index to a field with terrible selectivity improve the speed by exponential factors?
Before the answer a little bit of history. The database structure we are working with is a legacy structure that was designed and created before my time with the company. This structure breaks every single rule and principle of relational design I've ever learned. It is in all honesty one of the freakiest database systems I've ever seen. Unfortunately there is a lot of code written on top of the structure and very little political support for rebuilding it.
Just to give you a taste of the pain, the system is designed so that we maintain logically separate databases in the same physical database and they are differentiated by a prefix.
Next the main table for each database is this massively wide table with all kinds of interdependencies between columns. In order to make fulltext searching work better, someone had the bright idea that we should denormalize the list structure, and instead of storing lists in a separate table, we should store them inside this table in a single column in a semi-colon space delimited list. To add to that, we should also then make another column in the same table that contains the id's of all of the list items, and build a comma delimited list for the ids. Of course those two columns should be dependend not only on each other, but on several separate tables, and if you want to verify the integrity of any of this data, you have to pull it out of the database and run through it with a program. No referential integrity here.
Ok, so that was a mouthful, and to be honest it is just the tip of the iceburg when it comes to all of the problems present within this database. Now, let's get down to the original problem mentioned at the beginning. Why does an index on a bit column speed up queries?
Well the problem lies in the fact that this superwide table has a lot of variable length columns. So if I want to do a query like select count(*) from table where bitfield = 1. The server not only has to do a table scan, but it has all kinds of trouble actually moving through the records.
It turns out that creating a non-clustered index on that bitfield essentially creates a structure outside of the table that represents a fixed width layout of all the records for the table with just that field. So for the server to move through the records, it can now do so simply by incrementing a pointer.
We theorized that we would see the same performance if we were to extract just the bit column into a separate table and run the same queries. Our theory exactly matched reality. So ultimately the solution to the problem is to move the data into a fixed field structure that the server can more readily deal with.
As my boss said when we finally figured out the solution. When it comes to this system we are dealing with, you can't just break some of the rules. You have to break all of them, and when you break enough of them, you wind up back where you should have been to begin with.
Tuesday, June 28, 2005
I am getting ready to roll out a Rails site with one of they guys that I used to work with.
You can check out the beta version of the site at: http://skillfulstudent.com.
It's been a pretty fun process building this site in rails, and it definitely makes me look forward to building other sites.
Python Unicode Woes
Ok, so I love using Python at work. It's a much more fun and open language than C# or even worse Visual Basic (shivers). The one thing that causes pain though is working with international strings. The most frustrating aspect is strings that use characters in the 128-255 range. Basic strings in Python are 7 bit, 0 - 127. If it encounters a string above 128, it pukes and asks for a codepage. Says it can't find a codec for the character. The irritating thing about this is that pretty much every other programming language I've used just deals with these types of strings. C#, VB, Lisp, Ruby, Perl, I've never run into these problems with them.
If that wasn't enough, Python is constantly touting that it deals with Unicode, which is great, but the solution at the present time is lacking. There are some basic and well known Byte Order Marks for unicode files. For instance if you run a windows box (most of the world) your unicode will be encoded in Little Endian whereas if you run a *nix box your unicode will be encoded in Big Endian. The byte order marks for Little Endian are FF FE, which means the first two bytes of the file will be those values. Whereas for Big Endian it will be FE FF.
To complicate matters a bit those byte order marks are for utf-16 encoding. There are several other types of encoding (utf-8, utf-32, etc). The thing is though, the BOM's are specified, they are published, they are known. Why then does Python insist on making the programmer know the specific encoding when they open the file? I've had to write my own smart wrapper functions that reads the different byte order marks and uses the correct encoding in order to read the data appropriately.
Here are the Byte Order Marks:
UTF-8: EF BB BF
UTF-16 Big Endian: FE FF
UTF-16 Little Endian: FF FE
UTF-32 Big Endian: 00 00 FE FF
UTF-32 Little Endian: FF FE 00 00
SCSU: 0E FE FF
UTF-7: 2B 2F 76 and one of the following byte sequences [ 38 | 39 | 2B | 2F | 38 2D ]
UTF-EBCDIC: DD 73 66 73
BOCU-1: FB EE 28
You can find more information at this link:
I don't expect a language to be perfect ( although I would like it ), but what I do expect is when a solution to an obvious and recurrent problem is apparent, it should be implemented in the language. To me and to many others, this has been a major source of frustration.
Thursday, April 28, 2005
Whacked out file server
So we have a file server that serves images and files for our web site at my work. It houses upward of fifty million images, and recently we have started to have serious problems with the machine. All of the sudden the box becomes unresponsive to network requests for 20 to 50 seconds. Initially we thought it was the virus scanner, but we've gone so far as to completely uninstall it, and the problems persist. Unfortunately the problems are highly erratic and we are having one helluva time trying to narrow it down. I finally ended up writing a python script that will randomly check for the existance of images so we can determine if the machine itself is bogging down during these outages or if it's related only to network traffic. I would be interested in moving the images all to a BSD box with a samba share and see if we have the same problems. Intermittant server problems are the worst! :(
Wednesday, April 20, 2005
Be careful what you wish for
Well I've been wishing for harder problems to work on. Generally I don't come across too many problems at work that require a lot of brain bending effort to solve. Today one of the junior developers (juneys) was working on a problem that he's spent almost two days on. He talked to me a few times during the day about the problem, but he didn't seem to be making much headway, so I finally dropped what I was doing and went and helped him on this issue. It turns out the problem is a pretty interesting undirected graph. I spent maybe an hour working on it with him, at which point I decided this wasn't a good problem for him to tackle, and sent him packing. The joy of being a senior developer. :)
I'm getting ready to build a solution for the problem right now, which is unfortunately going to keep me from spending too much time on the Nested Sets implementation for Rails. I got some helpful feedback from Glen on the rubyonrails weblog that the blog isn't a support forum for rails, and that I should head over to the rails channel or to the mailing list. I did head to the irc channel, and someone finally suggested I search through the tickets for rails for the nested sets check in. I did exactly that and found some test cases. In any event this will give me some test code to work with. I am envisioning something a little easier to use and set up with the Nested Sets than the current incarnation though. Although as I work through the code, if I decide the current implementation is good I'll post a walk through on how to get it up and running. If nothing else, maybe I'll add an implementation for Transitive Closure tables. :)
Tuesday, April 19, 2005
Nested sets in rails 0.12.1
Nested sets are one of my very favorite structures. It is a structure that allows you to deal with arbitrarily deeply nested hierarchies. I saw with release 0.12 of rails that there was a new :acts_as_nested_set keyword in the model layer. I have spent the last two nights trying to get this code to work, and so far I haven't had any luck. Even if the code works as written, I think I'm going to try to get a better implementation in place. Having worked with this structure quite a bit, I'm pretty sure I can add a few things that will make it easier to work with and more powerful overall. Still I have to say I'm pretty impressed that there is *any* type of support for nested sets. Just one more item in a long list of items that have impressed me deeply with Ruby On Rails.
Thursday, March 24, 2005
Well, I figured I'd start posting to this blog again. I had a different blog at http://mbawulf.blogspot.com, while I was considering business school. At this point, I've ruled out attending business school, at least for the short term.
Dave Massey one of the IE developers posted this article: http://blogs.msdn.com/dmassy/archive/2005/03/22/400689.aspx
It is primarily a response to the Mozilla post he references there. Personally I didn't find his arguments very compelling. It is interesting to note that Microsoft is no longer singing the tune that IE is "part of the OS". They were making that claim pretty hard when they were being sued by the Justice Department. Now that the lawsuit is over, it's no longer part of the OS?
There was an interesting comment on that page as well. They were talking about bug free code, and someone said that people aren't willing to pay for truly bug free code. This argument floored me. What do you mean people aren't willing to pay for bug free code? When have we even had that as an option? That type of blanket statement is in my view the problem that plagues Microsoft. Perhaps the problem lies in writing all of their code in C++. Maybe the problem is just that they are no longer the small nimble software company, but the big behemoth. Ultimately with an attitude that people don't want to pay for bug free code, open source will continue to eat into their market share. People do want bug free code, and perhaps they don't want to pay any more for it, but as long as we can get free software that is more secure and more bug free than software that costs hundreds of dollars, people will continue to defect to open source.
Friday, January 23, 2004
Phenomenal SQL Library
One of the guys at work sent me a link to a library with 123
TSQL Functions. There are some incredibly useful functions into
this library. Some of the function categories are base conversion,
combinatorial, algebra, numeric, string, date, comparision
validation, logical, trigonometic, hyperbolic.
You can find the library here
There are days when you write code that just feels ugly and hard
to read. Those times when you know that there must be a
cleaner, easier way to do this, but you can't for the life of
you think of it. It feels as if you were lobotomized against
your will and without your knowledge.
Then there are those days when it clicks, and you do think of
that cleaner easier way. When that happens, and you look at the
clean lines of logic and the natural flow to your code, you feel
an almost mystical sense of peace and tranquility as you bask in
the glow of your monitor. Knowing that you have produced a
thing of beauty.
Imagine being a painter, someone that is relatively good, but you
can still see all the flaws in your paintings, each little
imperfection cries out as if it were suffering an injustice.
Something that others praise as masterpiece looks like nothing
more than hodgepodge, a random collection of brush strokes without
any cohesive plan. Programming is that way to me, rarely do I
produce a piece of code that pefectly matches my own ideal of
consistency, elegance and beauty. When I do though, the feeling
is of pure elation.
Wednesday, January 21, 2004
Lisp Web Server
A few weeks ago I ran across this page.
It's an interesting study about building web applications using
lisp. This is something I've had a very serious interest in
since I have been reading Paul Grahams web site. They look to
have built a relatively serious dynamic web site using Lisp.
They used Portable Allegro Serve and several other open source
Lisp technologies. They also mentioned a Web Application server
they built called KPAX. They had not yet released the source
code, but they said they would release that code at some point
in the future.
Today is that day, they have made the KPAX source code available
at this site: KPAX.
I'm excited about this, because the two primary alternatives
that I have been considering for building web applications have
been Zope (python based) and using one of the Lisp based
frameworks. The problem that I've seen until now is the
relative immaturity of the Lisp based frameworks. KPAX is one
of the last major pieces that need to be in place. From the
sound of it, the code is in working condition, but they don't
plan to work on it any further. I'll talk to the authors and see
how they would feel about me starting a project on source forge.
Monday, January 19, 2004
The End of Ender's Game
I finally finished the Ender's Game quartet last night. It was
an excellent series, it was filled with interesting characters
and interesting ethical dilemas. The last book came across as
much more talky than the previous three, although he warns you
about that in the prologue to Speaker for the Dead.
If you've never read Ender's Game, I highly recommend it. It's
one of the best books I've read. The last three books are very
different than Ender's game, although they build upon the story
and elaborate on the ethical dilemas in a much more detailed
After finishing Children of the Mind, I started reading Angels &
Demons by Dan Brown. If you don't know, Dan Brown is the author
who wrote the Davinci Code. This book is the first in the
Robert Langdon series, it's about the Brotherhood of the
Illuminati. It's been very interesting so far. I really like
how he puts science into the book as well, he's had a decent
discussion of the X33 transport plane, and the first part of the
book takes place at CERN, so he gets some good instruction on
particle physics as well.