[22 August 2008]
I just posted some notes on a paper given at Balisage 2008 by Yu Wu et al. of Intel.
A few thoughts occurred to me in writing up those notes which might merit separate consideration.
How effective could pessimization be?
A key part of the optimistic concurrency algorithm presented by Yu Wu et al. is that the process of chunking the document needs to be quick. So they make some guesses, when chunking, that could later be proven wrong; in that case, the chunk needs to be re-parsed.
I suppose the worse-case scenario here is that a sufficiently lucky and malignant adversary could construct a document in which the context at the end of chunk 1 means that chunk 2 needs to be reparsed, and the reparsing of chunk 2 reveals for the first time that chunk 3 now needs to be reparsed, and so on, so that in the end you end up using n time slices to parse n chunks, instead of n divided by the number of threads.
So there’s an interesting question: how long can we keep this up?
It’s pretty clear that if we know exactly where the pre-scanner will break the chunks, then we can devise an XML document that forces chunk 2 to be reparsed. Can we construct a document in which only the second, correct parse of chunk 2 reveals that chunk 3 now needs to be reparsed (i.e. in which the first parse of chunk 2 makes chunk 3 look OK, and the second one shows that it’s not OK)?
Can we make a document in which every time we reparse a chunk with the correct context, we discover that the next chunk also needs to be reparsed? How much reworking can an omniscient and malevolent XML author cause this algorithm to do? Remember that comments and CDATA sections do not nest; the worst I can figure out off hand is that a comment or CDATA section begins in chunk 1 and doesn’t end until the last chunk.
How many chunks do you want?
The paper says fewer chunks are better than many chunks (to reduce post-processing costs), and that you want at least as many chunks as there are threads (to ensure that all cores can be busy). To simplify the examples I’ve been thinking about, I’ve been imagining that if I have eight threads, I’ll make eight chunks.
But if I’ve read the performance data and charts right, the biggest single reason the Horatian parser is not getting an eight-fold speedup when using eight threads is the need to reparse some chunks, owing to bad guesses about parse context made during the first parse. If we have eight threads and eight chunks, everything is fine for the first pass over the chunks. But if we need to reparse two of the chunks, then it rather looks as if six threads might be sitting idle waiting for the re-parsing to finish.
I wonder: would you get better results if you had shorter chunks, and more of them, to keep more threads busy longer? What you want is enough chunks to ensure that while you are reparsing some chunks, you still have other chunks for the other threads to parse.
As a first approximation, imagine that we have eight threads. Instead of eight chunks, we make fourteen chunks, and give the first eight of them to the eight threads. Let’s say two of them need to be reparsed; the reparsing goes on at the same time that the remaining six threads parse the remaining six chunks. The minimal path through the speculative parsing step remains the time it takes to parse two chunks, but the chunks are somewhat smaller now. The only question is how much additional time the post-processing step will now take, given that it has fourteen and not eight chunks to knit together.
And of course you need to bear in mind that if one chunk in four turns out to need re-parsing, then three or four out of the fourteen chunks are going to need reparsing, not just two. By the time you factor that in, and try to ensure that your last round of parsing doesn’t generate any new re-parse requests, things have gotten more complicated than I can conveniently deal with here (or elsewhere).
Maybe that’s why the Intel paper was so non-committal on the way to choose how many chunks to make in the first place: it can get pretty complicated pretty fast.
Optimization and context independence in schema languages
One of the things that intrigues me about these results is that so much of what people have said needs to be done to schema languages to ensure that validation can be fast has nothing much to do with the speed gains shown by optimistic concurrency.
I thought for a while that this work did benefit from the fact that elements can be validated against XSD types without knowledge of their context (no reference to ancestors or siblings in any assertions, for example), but on reflection I’m not sure this is true: in order to find the right element declaration and type definition to bind an instance element, you need to know (a) the expanded name of the element (which means knowing the in-scope namespaces, which in practice means having looked at all of the ancestors of the element), and (b) the type assigned to the element’s parent (unless this element is itself the validation root). Once you have a type, it’s true that validation is independent of context. But the assignment of a type to an element or attribute does depend, in the normal case, on the context. It’s not clear to me that allowing upward-pointing XPath expressions in assertions or conditional type assignment would make much difference.
To really exploit parallelism in validation, it would seem you want to eliminate the variable binding of expanded names to element declarations and to types.
Back to DTDs plus datatypes, anyone?