technical – Page 5

In the beginning was the 8`(@H

According to Pope Benedict XVI in his lecture on Tuesday (the one the Muslims are upset about).

The HTML is rubbish. The line in question contains the text “8`(@H”, wrapped in an HTML font element with a face attribute of “WP Greek Century”.

The document is not xhtml – indeed there is no html version declaration of any kind. There is an http-equiv=”Content-Type” meta element specifying “charset=iso-8859-1”

In other words, the content is published as 8`(@H, in an 8-bit latin character set, with a font requested that would display those characters as greek letters. Since I do not have that font, I get the latin characters.

The file should have had a charset specified that included the proper Greek characters. I might still not have seen them, if I have no suitable font, but I would have been in with more of a chance. Also, the pdf file on the BBC website would probably not have perpetuated the error, since it can include fonts and handle more than latin characters. Since it was presumably produced from the bad HTML, it dutifully reproduces the 8`(@H.

If I were in charge of the Inquisition, the penalty for causing the Vicar of Christ to misquote the first verse of the New Testament would be pretty damn severe, I can tell you.

Update: They’ve fixed it.

Technical Integration

Cory Doctorow asks:

I’ve often wondered why the camera in my pocket — which has a fast processor, a big beautiful screen, and a four-way rocker-switch — doesn’t come with a couple thousand video-games, given its capacious memory.

I can think of several reasons:

The camera state-of-the-art is fast-moving. The extra time it takes to design in the game features for a given model will delay it – putting it up in the market against newer designs.

Software reliablility. Cameras don’t crash. Games do. I slight tendency to crash would be a huge problem for a camera.

Phones. If you want a single do-anything gadget, it’s more likely to be a phone with a camera in it than a camera with extra features. Buyers of specialist cameras – which aren’t phones or pdas – are likely to concentrate solely on phone features.

General “integrated device problems” – if one feature goes obsolete, the other is left with an obsolete device hanging off it. If one feature breaks, the other is left with a broken device hanging off it.

These things take time. I remember a long period during which laser printers, photocopiers, faxes and scanners were all made of different combinations of the same functional elements, but multi-functional devices that could fulfil the different roles were not available. They became available once the features of each device reached a plateau – where integrating different functions became a more useful innovation than improving any one function.

There is a pernicious belief that what matters in innovation is ideas. The idea of integrating a dvd-player into a television, the idea of a compressed-air powered toy aeroplane, the idea of selling petfood on the internet. What matters isn’t having the idea, it’s making it work.

All these things will happen when someone invests in making them work. I’m planning to hang on to my antediluvian Nokia 3310 until I can replace it with a model integrating an MP3 player with >20Gb of storage. I estimate 2009, including a year for the early-adopter tax to go away.

Elements and Attributes and CSS

VoIP technically sucks: trying to fake a switched circuit connection with packet switching is inherently inefficient.

However, if 99% of the data on the network is well suited to packet switching, putting the rest of the data on the same platform is much more sensible than having a whole separate network just for 1%. I don’t know if voice traffic is as low as 1% of total traffic over the world’s data network, but if it isn’t yet it soon will be. VoIP is therefore the only sensible way to carry voice traffic.

That was just a demonstration.

99% of the web page data you are reading is marked-up text: words you want to read, along with markup describing how different bits of the text should be presented. HTML is a decent format for that, and XHTML is much better – more logical, easier to parse, more extensible.

The other 1% (the element) is document-level metadata – not stuff you’re meant to read. XHTML is a poor format for that, but it’s only 1%, and it’s better to use an inappropriate format than add a separate format into the same document for 1% of the content. So we put up with <meta name=”generator” content=”blogger”/> despite it’s clunkiness.

XML is designed for marked-up-text formats like XHTML. At a pinch, it can be used for other things (like document-level metadata), but it’s fairly crap. So when Tim Bray says:

Today I observe empirically that people who write markup languages like having elements and attributes, and I feel nervous about telling people what they should and shouldn’t like. Also, I have one argument by example that I think is incredibly powerful, a show-stopper: <a href=”http://www.w3.org/”>the W3C</a> This just seems like an elegantly simple and expressive way to encode an anchored one-way hyperlink, and I would resent any syntax that forced me to write it differently.

He’s arguing against “use the best general-purpose format for everything”, and in favour of “use a suitable special-purpose format for the job at hand, like XML for marked-up text”.

A special prize to those who noticed that my XHTML <head> example was just plain wrong. 90% of the head of this document is not XML at all – a document with a completely different syntax is embedded in the XML. Blogger and the W3C have decided that XML is so inappropriate that it shouldn’t be used for this data, even at the cost of needing two parsers to parse one document.

To paraphrase Tim Bray,
writing body{margin:0px;
padding:0px;background:#f6f6f6;color:#000000;font-family:”Trebuchet MS”,Trebuchet,Verdana,Sans-Serif;} just seems like an elegantly simple and expressive way to encode complex structured information, and I would resent any syntax that forced me to write about 1K of XML to do the same thing.

More XML

(Apologies to those of my readers that aren’t interested in this stuff. I’ve been giving more time & attention to my work of late, and the results are less blogging, and technical stuff being on the top of my mind more than current affairs)

Very good piece by Jim Waldo of Sun that chimes (in my mind at least) with my piece below. He emphasises the limited scope of what XML is. He doesn’t echo my discussion of whether XML is good, rather he shoves that aside as irrelevant – the comparison is with ASCII. We don’t spend much time arguing over whether ASCII is a good character set – is 32 really the best place to put a space? Do we really need the “at” sign more than the line-and-two-dots “divide-by” sign? Who cares? The goodness or badness of ASCII isn’t the point, and the badness of XML isn’t really the point either.

The comparison with ASCII is very interesting – Waldo talks about using the classic Unix command-line tools like tr, sort, cut, head and so on that can be combined to all sorts of powerful thing with data in ascii line-oriented data files. XML, apparently, is like that.

Well, yes, I agree with all that. But, just a sec, where are those tools? Where are the tools that will do transforms on arbitrary XML data, and that can be combined to do powerful things? It all seems perfectly logical that they should exist and would be useful, but I’ve never seen any! If I want to perform exactly Waldo’s example: producing a unique list of words from an English document, on a file in XML (say OOWriter‘s output), how do I do it? If I want to list all the font sizes used, how do I do that? I can write a 20-30 line program in XSLT or perl to do what I want, just as Waldo could have written a 20-30 line program in Awk or C to do his job, but I can’t just plug together pre-existing tools as Waldo did on his ascii file.

There are tools like IE or XMLSpy that can interactively view, navigate, or edit XML data, and there is XSLT in which you can write programs to do specific transformations for specific XML dialects, but that’s like saying, with Unix ascii data, you’ve got Emacs and Perl – get on with it! The equivalents of sort, join, head and so on, either as commandline tools for scripting or a standard library for compiling against, are conspicuous by their absence.

The nearest thing I can think of is something called XMLStarlet, but even that looks more like awk than like a collection of simple tools, and in any case it is not widely used. Significantly, one of its more useful features is the ability to convert between XML and the PYX format, a data format that is equivalent to XML but easier to read, edit, and process with software (in other words – superior in every way).

As a complete aside – note that pyx would be slightly horrible for marked-up text: it would look a bit like nroff or something. XML is optimised for web pages at the expense of every other function. That is why it is so bad.

Maybe I’m impatient. XML 1.0 has been around since 1998, and while that seems like a long time, it may not be long enough. Any process that involves forming new ways for people to do things actually takes a period of time that is independent of Moore’s law, or “internet time”, or whatever. The general-purpose tools for manipulating arbitrary XML data in useful ways may yet arrive.

But I think the tools have been prevented, or at least held up, by the problems of the XML syntax itself. You could write rough-and-ready implementations of most of the Unix text utilities in a few lines of C, and program size and speed is excellent. To write any kind of tool for processing XML, you’ve got to link in a parser. Until recently, that itself would make your program large and slow. The complete source for the GNU textutils is a 2.7M tgz file, while the source for xerces-c alone is 7.4M. The libc library containing C’s basic string-handling functions (and much more) is a 1.3Mb library, xerces-c is 4.5Mb.

If you have to perform several operations on the data, it is much more efficent to parse the file into a data structure, apply all transformations on the data, and then stream it back to the file. That efficiency probably doesn’t matter, but efficiency matters to many programmers much more than it should. It takes a serious effort of will to build something that uses such an inefficient method. Most programmers will have been drawn irresistibly to bundling a series of transformations into a single process, using XSLT or a conventional language, rather than making them independent subprocesses. The thought that 99% of their program’s activity is going to be building a data structure from the XML, then throwing it away so it has to be built up again by the next tool, just “feels” wrong, even if you don’t actually know or care whether the whole run will take 5ms or 500.

In case I haven’t been clear – I think the “xmlutils” tools are needed, I don’t think the efficiency considerations above are good reasons not to make or use them, but I think they might be the cause of the tools’ unfortunate non-existence.

I also don’t see how they can be used as an argument in favour of XML when they don’t exist.

See also: Terence Parr – when not to use XML

XML Sucks

Pain. Once again, I have had to put structured data in a text file. Once again, I have had to decide whether to use a sane, simple format for the data, knocking up a parser for it in half an hour, or whether to use XML, sacrificing simplicity of code and easy editability of data on the altar of standardisation. Once again, I’ve had to accept that sanity is out and XML is in.

The objections to XML seem trivial. It’s verbose – big deal. It has a pointless distinction between “element content” and “attributes” which makes unneccessary complexity, but not that much unnecessary complexity. It is hideously hard to write a parser for, but who cares? the parsers are written, you just link to one.The triviality of the objections are put in better context alongside the triviality of the problem which XML solves. XML is a text format for arbitrary heirarchically-structured data. That’s not a difficult problem. I firmly believe that I could invent one in 15 minutes, and implement a parser for it in 30, and that it would be superior in every way to XML. If a solution to a difficult problem has trivial flaws, that’s acceptable. If a solution to a trivial problem has trivial flaws, that’s unjustifiable.And yet XML proliferates. Why?Since the only distinctive thing about it is its sheer badness, that is probably the reason. Here’s the mechanism: There was a clear need for a widely-adopted standard format for arbitrary heirarchically-structured data in text files, and yet, prior to XML none existed. Plenty of formats did exist, most of them clearly superior to XML, but none had the status of a standard.Why not? Well, because the problem is so easy. It’s easier to design and implement a suitable format than to find, download and learn the interface to someone else’s. Why use someone else’s library for working with, say, Lisp S-expressions when you could write your own just as easily, and have it customised precisely to your immediate needs? So no widely-used standard emerged.On the other hand, if you want something like XML, but with a slight variation, you’d have to spend weeks implementing its insanities. It’s not worth it – you’re be better of using xerces and living with it. Therefore XML is a standard, when nothing else has been.This is not the “Worse is Better” argument – it’s almost the opposite. The original Richard Gabriel argument is that a simple, half-solution will spread widely because of its simplicity, while a full solution will be held back by its complexity. But that only applies to complex problems. In heirarchical data formats, there is no complex “full solution” – the simple solutions are also full. That is why we went so long without one standard. “Worse is Better” is driven by practical functionality over correctness. “Insane is Better” is driven by the (real) need for standardisation over practical functionality, and therefore the baroque drives out the straightforward. Poor design is XML’s unique selling point.

Microsoft Bugs

A question at the end of an article on how the Microsoft X-Box security (designed to prevent unauthorised code being run) was broken:

512 bytes is a very small amount of code (it fits on a single sheet of paper!), compared to the megabytes of code contained in software like Windows, Internet Explorer or Internet Information Server. Three bugs within these 512 bytes compromised the security completely – a bunch of hackers found them within days after first looking at the code. Why hasn’t Microsoft Corp. been able to do the same? Why?

It’s a good question. There are a few plausible explanations:

The design team were aware that the task of making it secure was an impossible one, and put just enough effort in to show willing, or to qualify as an “access control system” for legal purposes.
The design was done in an insane rush, due to last-minute architectural compromises or general managerial incompetence.
One or more of the designers secretly felt that the more the customer could do with the device, the better it would be, and in effect sabotaged a feature which had the purpose of limiting what the customer could do with it.

But my favourite theory is quality control. The biggest obstacle I face as a programmer to producing high quality software is the system of controls intended to make sure the software I produce is of high quality.

The major mechanism is obtaining approvals from people who have a vague idea of what the software is supposed to do, no idea at all of how it is supposed to do it, and little interest in the whole process. Other mechanisms involve using, or avoiding, particular tools or techniques.

What they all have in common is that they require me to subordinate my own engineering choice for some one else’s, quite likely someone who not only has less knowledge of the specific question, but of the relevant general principles. This extends even to questions of who else to involve: if the bureaucracy says I have to get sign-off from person A, then person A gets to check the product ahead of person B, even if, left to myself, I would choose to ask person B to check it in preference, due to person B’s greater expertise or interest.

The bureaucrats would say it is a question of trust – the checks are in place so that management can take direct responsibility for the quality of the product, rather than just taking my word for it. I do not find this at all offensive; it is a perfectly reasonable thing for them to want. The problem is that it doesn’t work. It is always possible to “go through the motions” of doing the procedures, but there is almost no value in it. Getting it right always takes a mental effort, a positive commitment. I don’t blame them for not trusting me to do it, but they don’t have any choice.

The general ineffectiveness of quality control policy is masked by the usefulness of systematic testing. It is possible for a less-involved person to ask for, and check, tests – particularly regression tests on a new version of a product – and achieve significant quality benefits from doing so. As testing of this kind is generally part of the general battery of ceremonial procedures, the uselessness of all the others is less obvious than it would otherwise be. But there are many failures that this kind of testing doesn’t catch (and, therefore, which over-emphasis on this kind of testing will increase the occurence of), and practically all security issues are in this category.

I have no knowledge of the quality-control regime at Microsoft: I’m just speculating based on my observation that a ceremony-heavy process can produce bad code of a kind that would be almost inexplicable otherwise. In this case, there are other reasonably plausible explanations, which I already listed.

(via Bruce Schneier)

(See also LowCeremonyMethods)

Computons

The Economist:

Electricity is sold by the kilowatt-hour. Now a researcher has proposed that computing power be sold by the computon

If a 500MW power station could only be built by putting fifty thousand small 10kW generators in racks, with expensive complicated machinery to try to keep as many as possible fueled and running at once, then I don’t think the concept of an electricity grid would ever have caught on. But that’s what a “computing” power station looks like.

There are some slight economies of scale to computer hardware, mainly in management overhead, but compared to the cost of putting your own computer at the other end of a wide area network, they’re negligible.

The amazing thing is that this idea keeps cropping up, year after year, despite the fact that the basic technology just does not exist. Maybe it will one day, although currently it’s moving in the other direction.