Archive for June, 2009

How Not to Crowdsource (or really, how not to build an open-submission website)

@ink_slinger linked to a City of Edmonton website today called “Idea Zone”. I was intrigued, so tried to find out what it was.

Go ahead, visit the site. See if you can find out what it’s all about. I can wait…

Did you visit it? What did you learn? Probably nothing, because in order to find out what this site is, you have to register. I don’t know about you, but I like to know what a website is before giving it my vitals. And its registration form is quite detailed. Although the more intimate fields (address and phone number) are not required, they are still asked for, which is quite intimidating.

So step one in how not to crowdsource is: Require me to register to find out what your site is.
And while we’re at it, step two is: Ask me for this much detail without buying me dinner first.

Thankfully, Edmonton blogstar Mack Male has a blog post that explains it all. Not a terrible idea, overall. I think they should have taken a bit more of a lesson from the low signups at the ICLEI Congress that there must have been something terribly wrong, but that’s a whole other matter.

The purpose of the site (for people who don’t want to click) is to get ideas for how to make Edmonton a better city. I sometimes think too much effort is put into finding ways to make this city better, and far too little into actually doing anything about it, but more input from a broader spectrum probably isn’t a bad thing.

So I register. Leaving out the gory details of my location, age, and phone number as I see no reason for them to have them. They do one thing right here and skip the activation step, so kudos on that. They miss the boat on the benefit of that step being skipped by not just having me logged in immediately.

So rule 3: Make me activate my account with a link in an email. Then make me log in even though I just gave you my username and password and so can’t possibly be faking who I am.

So I log in. Before I can do anything, I have to agree to this obscure set of rules about how my submission may or may not be used. I don’t really care. By now, I have gone through several forms, been frustrated and limited in what I can do at every turn, and am not really interested in submitting anything at all.

So a final rule before I sum up: Make me agree to rules I don’t care about even though this could have been a simple checkbox in the signup page.

But I go on, because now I’m on a roll and writing a blog post about it. Somehow I just can’t stop myself. I find that, since Mack’s post, there have been 10 new users and *no* new ideas posted. This comes as absolutely no surprise to me. I’m also too tired of signing up to enter an idea now.

I guess the city is using this software because they’re using a version of it internally with a stronger workflow. This is the public-facing version of that software. Well, I’m just going to come out and say it. The public facing side of it is crap.

When you’re crowdsourcing, your goal should absolutely not be to try to filter users out early. This is a super important thing for most sites to do because they’re looking to filter out all of the rough at the expense of some diamonds. Unfortunately, in crowdsourcing, you can’t afford to do this. The entire purpose of this process is for *you* to find the diamonds. That means a bit more work on your part, but it also means a less frustrating user experience.

Not only should users be able to see what the hell this site is without logging in, they should be able to see submitted ideas and even submit their own ideas with either a minimal (username/password) account creation or no account creation at all. It should be moderated, filtered on the level of word-triggers (no one will suggest to improve the city with anything to do with penis’ for example), but it should be *easy* to submit ideas.

Championing those ideas, commenting on ideas, there you can increase the barriers. But if your goal is to find new ideas, you must make this process much easier. It’s important to realize that internal and external tools rarely work well from the same package (see for example the dreadful WebCT — great for teachers, terrible for students).

To sum up all the rules in one sentence: Make it hard for users to submit ideas!

Drizzle: The Future of MySQL

Brian Aker, one of the main engineers on mysql at Sun, has posted a presentation he did on the project he’s been working on for the last year and a half: Drizzle. I highly recommend anyone who’s interested in the state of the art of database technology watch it.

To summarize:

  • Scalability: A large part of the effort so far is along the lines of making it so that the system can scale to massive numbers of threads (and processes). They’re removing locks wherever possible and aiming for systems with 100+ cores.
  • UTF-8: This is a hugely important move. Drizzle talks exclusively in UTF-8 and bytestreams. This pushes all the character set insanity out to the client, which is really where it belongs. Unfortunately, this will probably be a stumbling block for some apps that have data that can’t be easily converted.
  • Protobuffer replication streams: Using google’s protobuffer protocol to put out replication information makes it really easy to write applications that do things based on the replication stream. With mysql binlogs, this was a fairly tedious thing to do and resulted in fragile code.
  • Async protocol: This is really useful. A page load should be able to spam the server with a bunch of queries and then fetch results as needed rather than doing them one at a time. This is a big part of taking advantage of higher concurrency and reducing pageload times.
  • Built in sharding: This is also really useful. I’m not entirely sure what their plan is, because this is the first I heard of it being part of the project, but if done right this will be so valuable. Sites that need to shard often wind up implementing this from scratch. I’ve been involved in doing so myself. It certainly isn’t as scary as a lot of people think it should be, but the fear is palpable among other devs and a solid baseline implementation would raise the state of the art a good deal.
  • Plugins: Plugins are a big part of drizzle’s re-architecting. The goal seems to be to completely ground up make it as simple as possible in the core (slides say 350k loc as opposed to 6.0’s > 1m loc) and push all extra functionality out to plugins. Areas subject to becoming plugins include:
    • Pluggable client protocols: Making it so that the client can talk in an HTTP/REST protocol for simplicity, or any protocol desired.
    • Pluggable logging: Have it log out to syslog, for example. Or to an analysis app that does custom slow query logging, etc.
    • Pluggable authentication: Turn off auth altogether, use the system’s user accounts through PAM (yes, please!), LDAP, HTTP AUTH, or just something custom. This also helps remove locks for scalability apparently.



I can’t stress enough how this is the real future of MySQL, far moreso than any future versions of mainline mysql are. As the world of the web moves towards simpler databases like couchdb, drizzle is the only way that mysql will manage to be competitive on the web. Mainline mysql just keeps getting bigger and heavier, growing towards enterprise use (towards being an Oracle replacement, really) leaving those of us who don’t need or can’t afford those features (not in $, but in response time) out in the cold.

Personally, I think that object/document store databases are the future of databases for the web as a whole, but Drizzle is the future of the particular subset where schemas are still important. And for the time being, it will come out of the gates as the most mature product among next-gen web databases by the simple fact of its inheritance of mysql architecture.

I’m going to be keeping my eye on Drizzle, and I think other people should too. Brian Aker has a blog and a twitter, and Drizzle itself is on LaunchPad.

Why you shouldn’t really be using qmail anymore (or how I found a license I hate more than the GPL)

I’ve long been a fan of djb’s method of writing software. Over the years, three of his tools have served me very well: djbdns, daemontools, and qmail.

But djb has a dark side. He has some strange views on filesystem layout (I’m no booster of the linux standard layout, but his views on layout are just plain strange) that I can get over if not work around. More importantly, though, are his views on licensing. The GPL makes me cringe (another blog post for another time), but djb goes a whole other direction: You can’t modify his code and redistribute it. You can distribute patches, you can distribute his pristine copy, but you can not and must not distribute an altered version wholesale, in source or in binary.

Which would be fine, if it ever got updated. But it’s been years now, and the world of internet mail (and spam) has changed drastically since then. Namely, backscatter. To a lot of people familiar with email tech, this is nothing at all new. But to qmail, it’s like it’s still 1999.

For the uninitiated, backscatter is when spam sends to known-bad addresses with a reply-to that goes to their real target, a known-good (or plausibly-good) address. The known-bad bounces back to the known-good, giving someone spam from a sender who didn’t actually mean to do anything bad. This results in a very bad reputation for the previously innocent mail server.

Think of it like sending a letter with no stamp and the address of the person you’re sending it to to get around paying for postage (note: I have no idea how this works and am not endorsing any form of mail fraud).

The right thing to do, nowadays, is for a mail server to immediately reject an undeliverable email with an error (like a 404 code from a web server). Because of a quirk in how qmail is designed, it can’t do that, though. It will accept all mail for all domains it knows about and then reject it later if it can’t deliver through a bounce. Which makes it a prime target for backscatter.

The process for solving this with qmail involves a fairly tedious and possibly risky set of steps. You have to patch your copy of qmail to add scriptable hooking to the e-mail accept phase. You then have to add a script to this system, written in probably bash or perl script, that will go through and do all the processing qmail intends to do later on to figure out if it’s deliverable. If it’s not, it’ll return an error code. If it is, it’ll let it go through the normal qmail process.

This completely breaks the rather beautiful design of qmail. At this point, you may as well be using postfix, which is less beautiful but actually designed for plugins and has more modern notions about what to do about backscatter anyways.

So from being a djb booster, there’s now only one product of his I recommend: djbdns. Still the simplest, cleanest little dns server you can run. Daemontools has fallen out of favour to runit, which has similar modernizations and a less restrictive license.

So postfix seems to be where it’s at for email servers these days. I always felt I had a better understanding of how qmail worked, though. Maybe someday someone will ground-up rewrite it like they did with daemontools/runit.