SINX format official page

Russian version

0. What is it?

1. Motivation

2. Idea and syntax

2.1. Encoding

2.2. Syntax (SINX level 0)

2.3. Basic metasyntax recommendations (SINX level 1)

2.4. Is that all?

3. Why SINX and not XML?

4. Library

5. User agreement

6. Example of software

7. Extras (updates of 2011 and later)

0. What is it?

SINX (SINX Is Not XML) is a semi-structured data markup language similar to XML, HTML and wide variety of less known alternatives. Yes, yet another one. What's the meaning of it, you may ask. Why do we need yet another markup language?

Don't know about you, but here is an explanation of why it was needed in my case and why I had to make it up.

1. Motivation

It isn't a big secret that one of the most frequent needs in programming practice is representation, storage and parsing data in certain text formats. Just a few typical examples:

configuration and control files,

resource tables, manifests of all kinds,

serialization and deserialization of data,

logs and reports assumed for automatic processing,

...

Why exactly the text formats are best options for such tasks? First, accessibility of reading and editing, a notepad is sufficient for any text format (there are of course files that require something more substantial than notepad, but notepad is exactly sufficient) while binary formats require dedicated editors for each one, incompatible with other ones in general. It is inconvenient and impractical in many cases. And, if it is about homebrew formats for your grounded garage needs, you will likely be in dire short of editors. Second, binary formats naturally have a relatively rigid structure. If your program evolves and its data structures mutate, making it understand previous version of format (or to write an upgrade utility at least) becomes more expensive task than in case of text data.

You may turn out to need more than one of such formats. In my case, I came to need three at once – for configuration, for localization and for serialized storage. And what is the most troublesome peculiarity of text (and not just text, actually...) format? Exactly, it is writing of a parser (and of a generator, in more general case).

Therefore, a developer eventually comes to understanding that it's better to have a single generalized format rather than a whole zoo of them and to build all lesser formats on its base. If the base format is good, you will need to write a parser and a generator only once and won't have to poke and tweak them for each particular case of a derived structure.

So, I needed a format satisfying following requirements:

text based,

capable of representing both structured data with its treesome structure as a primary content and a text with markup where the text itself is a primary content.

What is your first thought on sight of these requirements? Of course it's XML. It is tuned exactly for these types of application, it is a widespread industrial standard, it is a base for multitude of technologies, there are plenty of libraries for it on all languages of the world, and so forth, and so forth...

Just take and use, one would think. But – no. XML was absolutely unacceptable for me, and here are my reasons.

1. It is redundant. It is just redundant beyond any reasonable limit. If you want to comprehend the whole horror of things, take a look at examples XML-RPC, for instance, and try to realize that they are just about passing a couple of parameters. Human readability of XML is a purely formal concept. In real life, it is a hard labor to distinguish data in opening-closing tags mash, let alone typing in such document by hand.

2. It is complicated. It is complicated beyound any imagination. It pesters with useless and mutually duplicating extras. Just one thing to mention is a <tag><value>data</value></tag> vs. <tag value="data"/> alternative for representation of essentially identical piece of data!

As a consequence, nearly all of so called XML handling libraries are bloated obscenities that take magnitudes much more space and files than some completed programs (while you might just needed a feeble config file for "hello, world"), are often built in non-trivial ways (dragging GNUttish toolchains and Vis$tudios behind them), and interfaces they provide come down to iteration of varying manuality through mishmash of subnodes and attributes. Seeing the resulting code makes you weep tears of blood. If it still doesn't, look through dusty archives for description of, say, GetPrivateProfileString function and its alikes (also, take a look at description of INI file format found at the same place), then compare to how an access to XML documents (loaded and parsed already, by the way) in what is claimed to be one of simplest XML parsing libraries.

In addition to all these issues, a rare library can boast understanding XML in all its completeness (an example, note the section unambiguously titled 'What it doesn't do?').

Most people consider XML something good just because they use its based formats via appropriate programs (offices, inkscapes, expression blends etc.) and don't see the homely inside. But we, plain programmers with plain needs, rather writers than readers, have a slightly different list of preferences. So, I can't tell for you, but as for me, XML was obviously the worst solution possible (and a clear overkill in my particular case).

Please don't reproach me for deprecation of international standard, addiction to wheel inventions and all such. Just imagine yourself in my place. You are a plain programmer, you have a plain compiler and a virgin snow notepad in front of you, and the task is to write a lightweight, slightly configurable console program. All these magnificient international standards, how will they help you as you have to drag home and toilet train a monster of large tonnage that will make your miserable code into its tiny appendix? That's it. And, if you already have the monster at home and trained, you definitely had no better things to do.

So, XML is stray. What easier alternatives free of aforementioned flaws are offered to us by collective intellect?

YAML, JSON, XF, thousands of them?

They are of course better, but unfortunately are overspecialized. Namely, all of these formats are tailored purely for data, and using them to represent text with loose markup is a churlish labor. Also, their syntax is somewhat functionally redundant and somewhat too rigid. For example, JSON provides exactly 5 fundamental data types (a string, a number, a bool, a dictionary, a vector). In order to be able to parse this format, the program must understand all five, even if it doesn't need all of them. On the other hand, as soon as the program needs a different data type (such as a complex number or a text with markup), the format suddenly ceases to be compact and begins to screw up. Approximately the same holds true for all others.

Adapted S-expressions (as in despised LISP).

Very good. I needed something like this, lightweight and structurally flexible, two in one. But still, it's not ideal. First, this format is data oriented as well, and representation of text with markup (one of the requirements, I remind you!) is quite clumsy in it. Second, the syntax could still be simpler. I don't know what the author intended for simplicity, but ways of further simplification and reduction of redundancy are obvious since first lines of specs.

All these ordeals lead me to idea of SINX format.

Not to say it worked out overly original or revolutionary. It wasn't a priority to concoct a brand new markup language not resembling anything we've seen before (and why should it be?), you can easily observe loans from the predecessors in it. (Even the foundation of the format is not actually made ground-up, it is borrowed from an ancient Starcraft 1 related utility where I liked author's approach to representation of unprintable characters). There were following priorities:

format fit for both text and structured data, similarly to XML and HTML but free of their heavyweightness,

maximum simplicity and minimum redundancy. If we have text, it must not be lost among syntax elements and should deform for syntax needs as little as possible. If we have structured data, syntax elements must only be sufficient for specification of structure and not an ounce more. Classification of data by types, their decomposition into information and metainformation, these are things the user can well do himself if needed (and, if it is not needed, it is the more so stray in the format). Our task is just to let him pull these data out without unnecessary bureaucracy and questionable conventionalism.

It's up to you to determine how well the idea is implemented.

2. Idea and syntax

The format is based around the following concept. We have a SINX string – a (finite) sequence of symbols, possibly an empty one. There are two sorts of symbols:

terminal symbol (a letter, a digit, etc. single character),

special symbol, described in terms of name and data. Special symbol's data are also a SINX string, and all aforementioned considerations are recursively applicable to it.

This principle is origin for all following schemes.

2.1. Encoding

SINX format is ok with any encoding that can represent characters '<', '>', '%' and provides any whitespace characters. It is desirable but not required to have characters '.', ':', '=' in it, and even less required to have '-', '#', all hexadecimal digits, 'x' (latin ex), 'l' and 'g'. UTF-8 or any other 8-bit ASCII-based encoding is recommended. If you are into perversions though, something UTF-32 or EBCDIC will also do.

An important notion: all characters in SINX are taken as is, under no circumstances they are translated or transformed. That is, whitespace runs are not merged and not skipped, tabs and line feeds are not translated into spaces, letters are case sensitive, and so on. Some unavoidable platform specific dodges are possible, such as translation of CR/LF into a single line feed, but these transforms are assumed to take place at physical stream reading level before feeding the data to SINX parser, so the format doesn't care about it. At SINX format level, all the input characters are considered verbatim. Thenceforth, if we speak of 'a sequence' or 'a run' without any extra specification, we assume it binary identical to its representation in the input stream.

2.2. Syntax (SINX level 0)

SINX-string is a sequence of elements. Following element types are possible:

1) Plain terminal symbol – any character except '<' and '>'.

Example: aba125 letters are parts of the example 15#tbs.

Plain terminal symbol denotes simply the appropriate character.

2) Verbatim data run:

'<' delimiter '%' sequence-of-characters '%' delimiter '>'

where:

delimiter is a sequence (possibly an empty one) of any characters except '%', '<', '>', '=' and whitespaces, second delimiter must be binary identical to first one,

sequence-of-characters is a sequence of any characters allowed in encoding used that doesn't include a subsequence of '%' delimiter '>'.

Examples:

<abc%example%abc>

<%example of run with an empty delimiter%>

<ab%example of run < including >< special characters > % in it%%ab> (last % before %ab is included in the run as well)

<$#!@%example of run with non-alphanumeric characters in the delimiter%$#!@>

<abcdx%this %abcdx is a %abcdx > valid %abcd> example too%abcdx> (the run includes all characters between <abcdx% and %abcdx> – ending marker of the run is not the delimiter itself, but exactly a seguence of a percent sign, a delimiter and a closing angle bracket)

Verbatim data run is interpreted as sequence of its characters, taken binary verbatim, with exclusion of starting and ending markers.

Example: abc<%def%>ghi is interpreted the same as abcdefghi

Nevertheless, starting and ending markers are considered a part of SINX string that contains them and are included in it if we speak of the SINX string as a whole.

Characters inside verbatim data run are not interpreted in any special way and are all part of the run regardless on their kind.

<abc% it can <def% even %def> be like this %abc> (<def% and %def> here are just sets of characters included in the run along with all others)

Purpose of verbatim data run is to express a piece of data that may include characters that would otherwisely have a special syntax meaning.

<%For example, of arithmetical or logical language expressions, such as x = i>11? 100 : 500;%>

3) Special symbol (or just 'special' for short):

'<' name delimiter SINX-string '>'

ãäå:

name is a sequence (possibly empty) of any characters except '%', '<', '>', '=' and whitespaces,

delimiter is either a sequence of whitespaces (can be empty if followed by '<' or '>') or a single '=' character.

SINX string is special symbol's data and can in turn consist of all aforementioned types of elements (including other special symbols). It can be empty.

Examples:

<a example>

<$!#@ example with weird special name>

<a example with composite data: <xxx%verbatim run%xxx>, <b nested special>>

<a=example with delimiter of '='>

example with delimiter of line feed (assumed it is a whitespace in the encoding used)>

< example of special with an empty name>

<=another one>

<xxx> (example of special with empty data)

<xxx=> (example of special with empty data and delimiter of '=')

<xxx<%yyy%>> (example of special with empty whitespace-type delimiter which is possible due to its data starting with '<')

<> (example of special with empty name and data)

<=> (another one)

< > (another one)

<a =the '=' character is part of data here as the delimiter is a single space>

<a= the delimiter here is '=' while the space after it is part of data>

Characters of the delimiter and opening and closing '<' and '>' are not included in name or in data of the special. Nevertheless, they are considered a part of its containing SINX string and are included in it if we speak of the SINX string as a whole.

Special symbol is interpreted as, well, a special symbol with a given name and data. There are some issues about the name:

if the name starts with '.' and/or ends with ':', these characters are not included in the name, but the parser MUST keep extra flags to remember the fact they were present for this special. Leading '.' of the name is an attribute trait and trailing ':' is a compound trait.

For example, all these specials have the same name:

<a example of special named 'a'>

<.a example of special named 'a' with an attribute trait>

<a: example of special named 'a' with a composite trait>

<.a: example of special named 'a' with an attribute and a composite traits>

also, the parser MUST keep an extra flag to remember the type of delimiter used in this special, '=' or whitespaces run. Delimiter of '=' is a terminal trait.

Traits have no syntax payload by themselves (apart from the fact they must be stripped off the name and noted separately by the parser). Their purpose is metasyntactic. User of derived format can assign any meaning to them on his own discretion, or to ignore them entirely. SINX level 1 (see below) contains some recommendations about interpretation of traits but these are not requirements and are not enforced at SINX format level.

If the encoding used does not provide characters for '.', ':' or '=', all rules that involve them can be ignored.

What is the meaning of "considered a part of its containing SINX string and are included in it if we speak of it as a whole" phrase? It is very simple. If it is about a per-symbol reading of a SINX string, the symbols are not considered. If the SINX string is taken as a whole (for example, as data of a special), all of its characters are considered, including ones with special syntax meaning.

Example: 1235<xx%abcdef%xx>678<a <%x%>> is a sequence of symbols: '1', '2', '3', '5', 'a', 'b', 'c', 'd', 'e', 'f', '6', '7', '8', 'special named a with data of <%x%>'. At the same time, as a whole SINX-string, it is 1235<xx%abcdef%xx>678<a <%x%>>. The same way, data of the special named a of this sequence is SINX string <%x%>, which is a sequence of a single symbol 'x' from standpoint of per-symbol reading.

That's all about SINX syntax. Here is a grammar if you want one:

any-character ::= any character possible in encoding used

character-not-bracket ::= any character possible in encoding used except '<' and '>'

name-character ::= any character possible in encoding used except '<', '>', '%', '=' and whitespaces

whitespace ::= any whitespace from encoding used

name ::= name-character*

delimiter ::= '=' | whitespace*

(if there is no '=' in the encoding used then delimiter ::= whitespace*)

SINX-string ::= SINX-symbol*

SINX-symbol ::= character-not-bracket | verbatim-data-run | special

verbatim-data-run ::= '<' name '%' any-character* '%' name '>' (2-nd name must be binary identical to first one)

special ::= '<' name delimiter SINX-string '>'

As you can see though, the logic is very simple and the parser can be written with no any grammar in mind.

The syntax is made so that any characters sequence would be a valid SINX string as far as possible. A random sequence can be invalid after all. However, syntax error is not an excuse for not returning a result. Following are possible types of syntax errors which MUST be handled by the parser in one of appropriately offered methods, with choice of particular method left up to parser implementation:

1) Character '>' with no corresponding opening '<'. Example: abc>de

Options:

treat it as end of input stream, not inclusive (abc),

ignore it (handle the stream as if the problematic character was missing) (abcde).

2) Unterminated special. Example: abc<de <f><g h

Options:

assume that all specials not terminated explicitly are terminated implicitly in corresponding order as the input stream ends (abc<de <f><g h>>),

ignore opening fragments of unterminated specials including their delimiters (abc<f>h).

3) Unterminated verbatim data run. Example: <abc%def

It MUST be always assumed implicitly terminated as the input stream ends (<abc%def%abc>).

4) Sequence of '<' èìÿ (end of input stream) MUST be considered an opening fragment of a special with empty whitespace-type delimiter and therefore a particular case of error 2).

The parser MAY report syntax errors it found in some additional way but MUST always parse to the end, recovering according to error handling options chosen for the implementation.

Description of syntax and syntax error handling options is named "SINX format level 0". It is actually sufficient for practical use. However, there are some typical problems for markup languages that you will likely have to solve for your format. SINX level 1 is provided for most common things you may need, see next chapter for more.

2.3. Basic metasyntax recommendations (SINX level 1)

While SINX level 0 specifies format syntax, SINX level 1 establishes several conventions about interpretation of special symbols from a certain set in your target SINX-based format.

1) Characters specified by numeric code.

Specials with name of '#' run-of-decimal-digits or '#' '0' x|X run-of-hexadecimal-digits MUST be interpreted as a character with corresponding numeric code in encoding used. Examples:

<#33> (ASCII character '!')

<#0x33> (ASCII character '3')

<#0XDeAdBeEf> (character with hexadecimal code of DEADBEEF)

Data of such specials MUST be ignored. E. g., mail<#0x40 gg>to is interpreted identically to mail<#0x40>to and mail@to.

If name of the special starts with '#' but does not form a valid decimal or a hexadecimal number, the special is not translated into character and is considered as a general special with this name.

Additionally, there are two specials that MUST be interpreted as '<' and '>':

<lt> – '<',

<gt> – '>'.

The l, g è t letters can be in any case if the encoding used provides them in different cases.

E. g.: <example here are some quoted specials: <lt><gt> <lT>spec1 content1<Gt> <Lt>spec2 content2<gT>> – is interpreted identically to <example <%here are some quoted specials: <> <spec1 content1> <spec2 content2>%>>

This rule CAN be dropped if the encoding used does not provide characters for hexadecimal digits and characters of '#', 'l', 'g' and 'x'.

2) Remarks/comments.

Specials with name of '--' MUST be treated as remarks and ignored entirely.

E. g.: a<-- b>cd – identical to acd

Two conventions above are compulsory for conformance to SINX level 1. Additionally, SINX level 1 provides some optional conventions about usage of traits in target user format.

3) Terminal trait ('=' delimiter in a special) SHOULD be applied to a special if its whole data is assumed terminal within the target user format, that is, its format is assumed opaque at SINX level and is to be passed verbatim to its own dedicated parser. Note however that the data is still required to be syntactically valid SINX string.

Depending on conventions of the target user format, SINX specials inside the terminal data SHOULD be either interpreted as characters (according to convention 1) of the current chapter (if they satisfy the requirements) or ignored or be emitted per-character binary verbatim. Verbatim data runs inside the terminal data SHOULD also be either interpreted as a corresponding characters sequence or emitted per-character binary verbatim.

Examples:

<url=http://mikle33.narod.ru>

<url=http:<#47><#47>mikle33.<-- remark –->narod.<%ru%>> (depending on target user format conventions, the data can be interpreted in one of many ways: http://mikle33.narod.ru, http:<#47><#47>mikle33.narod.ru, http://mikle33.narod.<%ru%>, http:<#47><#47>mikle33.narod.<%ru%>, http:<#47><#47>mikle33.<-- remark –->narod.<%ru%>, etc.)

Delimiter of '=' MAY be not used if the special has empty data.

4) Attribute trait SHOULD be applied to a special if the target user format interprets them as attributes of their immediate containing special (or of the whole stream if they are not contained anywhere). A special is assumed to contain no more than one immediate child special of the same name with an attribute trait. (That is, specials with an attribute trait are recommended as equivalent to XML/HTML attributes.) If there are several attribute specials with same name, the target user format may consider only the first one or only the last one.

Example:

<error <.locations <line=11><line=22><line=100500>at these lines>An error has been encountered>

5) Compound trait SHOULD be applied to specials whose purpose in the target user format is to store structured data and that are only assumed to have meaningful data inside contained specials (the convention is not recursively applied to these contained specials unless they have compound traits too). All other characters, including comments and character code specifiers (according to conventions 1 and 2 of the current chapter) SHOULD be treated as remarks or formatting helpers and ignored as such.

Example:

<url: <scheme=http> <domain=mikle33> <domain=narod> <domain=ru>>

<url:

(url decomposed into components and expressed as a compound SINX special)

<scheme=http>

<domain=mikle33>

<domain=narod>

<domain=ru>>

Both versions should be interpreted identically to <url: <scheme=http><domain=mikle33><domain=narod><domain=ru>>

Traits can be combined if it conforms their role in the target user format. Example:

<table: <.width=100%><.height=50%>

<tr:

<td <b>Cell11</b>> <td <i>Cell12</i>>

>

<tr:

<td <u>Cell21</u>> <td Cell22>

>

<tr:

<td <.colspan=2>Note: <b>, </b>, <i>, </i>, <u> and </u> are not opening/closing tags here, they are individual SINX specials.>

Terminal and compound traits are obviously incompatible, but it's syntactically possible to have them at the same time. SINX level 1 does not provide any convention for use of such a combination and intentionally leaves it to the target user format.

If the encoding used does not provide a character for some trait and corresponding syntax rule is dropped, the corresponding convention from 3-5 of the current chapter is also dropped.

2.4. Is that all?

Yes, that's all. Note, not "all you have to know for the starters", that's exactly all. This chapter describes both levels of SINX format in exhaustive entirety.

3. Why SINX and not XML?

First of all, it makes sense to ask the reverse – why XML and not SINX?

If you are not free to choose the format due to your task specifications, or take part in a project where it is not up to you to decide on everything and where familiarity of technologies used is important for collaboration, or you are using pre-existing sophisticated tools that already have their own fixed formats, or are otherwisely constrained – well, it can't be helped then, use XML and accept my deepest condolences on the sad occasion.

However, if you are doing something on your own and are free to choose tools and technologies, most of your arguments pro XML likely root back to established stereotypes and lemming instinct. These are bad things, you should learn to break them in favor of reason. Here are some considerations for you on why SINX is better for homebrew and ad hoc needs than XML.

+1. SINX is simpler. It possibly isn't worth extra mentioning, but SINX is really much simpler. It is nearly as simple as it it possible at all (it could be even more simple if it weren't for some aesthetic and auxiliary considerations). Parser of level 0 can be written from scratch in about an hour, its upgrade for level 1 will take about one more. At the same time SINX provides you with exactly the same capabilities for expressing data and markup as XML does. You probably noted pseudo-HTML example in previous chapter. It is a good picture of idea of cross-conversion between XML/HTML and SINX. The idea is so primitive and obvious (and can be none the less obviously automated) that I shall not explain it in details, if you don't mind.

Moreover, SINX provides same capabilities in easier and less ambiguous ways. E. g., in XML, you know how much you have to do in both XML document and its reading code in order to convert an attribute to a nested tag or vice versa. In SINX, you simply don't distinguish between nested tags and attributes. Your problem is basically an aesthetic choice of whether to use an attribute trait or not.

+2. SINX is more compact. It isn't just about syntax overhead that is ~2 times less in SINX string than in structurally identical XML document. The format affects the text itself to a lesser degree. Have you ever considered an option to print a message for the human user into stdout using XML formatting? Unlikely. The reason prompts that actual text will be lost among service characters and syntax extras. With SINX, it is possible to do so that human readability will nearly not suffer compared to plain text. Have a check:

<?xml version='1.0' encoding='UTF-8' ?>

<message file="file.txt" line="14">

This message concerns line <14> of file "file.txt".

</message>

vs:

<message <.file=file.txt> <.line=14>

<% This message concerns line <14> of file "file.txt". %>>

+3. SINX is more versatile. Let's consider an XML piece <tag attr="some-value"/> and a similar SINX string <tag <.attr some-value>>. What can we say about some-value? It is probably some text. But why exactly a text? Why can't it happen to be a nested node with its own structure? SINX provides this capability by design: <tag <.attr <.attr-of-attr=100%><component1=xxx><component2=yyy>>> – note that data of the special are also a SINX string and therefore can be extracted and considered as an independent document. In XML, even if we would laborously work out a counterpart using nested tags:

<tag><attr>

<attr-of-attr>100%</attr-of-attr>

<component1>xxx</component1>

<component2>yyy</component2>

</attr></tag>

we would have an internal of <tag> that is generally not even a valid XML document that can be parsed outside its containing document.

This advantage is much more powerful than it may seem. Why? A couple of illustrative examples of modern data formats:

SVG:

...

<path fill="none" stroke="black" d="M 227 239 L 328 90 L 346 250 L 201 124 L 410 150 L 228 238" />

...

CSS:

...

...

These are good pictures of anecdotic situation in so called XML based formats. Ideally, 'XML based' means that we would just take a random XML parser, set it on the document, and our problems are solved. But it is a frequent occasion that XML's actual role in a format is limited to outline a major data lump that has its own entirely different format and requires a dedicated parser. But what's so special in these data? Do they have an exotic structure that disallows to represent them as a subtree inside the document's general structure and to parse them uniformly? It may happen of course (javascript fragments, for example), but what we have here is definitely not the case. What we have here is a bunch of nodes with properties similar to as in the rest of the document. What prevents us from representing the same CSS as:

...

...

Yes, it would take a bit more space, but uniformity of the format would yield advantages in... oh wait, can't CSS elements cbe nested? And can't they have fields of compound format?? And can't the style be specified in both dedicated section of a document and in the subject tag's stype attribute??? Oh gosh. It really looks a better idea to make up a dedicated format.

A cheapskate pays twice, that is.

But, if you would use SINX instead of X/HTML, you wouldn't have the problem.

<style ...>

body {

background-color: #111111;

background-image: url("caulk_sucker.jpg");

}

p {

}

h1 {

color: #222222;

background-color: #333333;

}

</style>

...

<h1 style="color: #FFFFFF; background-color: #000000;">want it directly?</h1>

===>

<style:

<body:

<.background-color=#111111>

<.background-image: <.url=caulk_sucker.jpg>>

>

<h1:

<.color=#222222>

<.background-color=#333333>

>

>

<h1 <.style: <.color=#FFFFFF> <.background-color=#000000>>can do directly!>

(Note – the result is nearly the same and still is compliant to SINX.)

Unfortunately, XML/HTML+CSS became widespread in the form they are and their genetic defects are too late to be cured. But not everything is lost yet for your homebrew formats, you still choose SINX to base them on and therefore reduce possibility of having to make up language-in-a-language dramatically.

+4. SINX can even be used for binary data. Rule of binary verbatim interpretation of input stream is harsh and unambiguous, and a verbatim data run is more universal than XML's CDATA and allows to host a really arbitrary binary data in one chunk.

Also, SINX level 0 is perfectly fit for inplace parsing.

An unnumbered bonus. SINX is quite sparing in terms of migration from XML. As said before, cross conversion between XML and SINX is a simple and transparent task, a universal converter can easily be written. SINX even uses so familiar angle brackets. :)

Besides joke, a curious fact: most simple XML/HTML with no jabascripts, CDATA-s and sophisticated comment tags are also a valid SINX string with each opening and closing tag corresponds to a special.

Ok, we got it about the advantages. What about drawbacks? Just for fairness, let's consider several typical objections.

–1. "There are many technologies developed for XML, such as XSLT, Xpath and so on, it has dtd/dsd/schema. Your format has none of these. And, if it all is to be made up, the result won't be essentially better. So what's the point?"

Well, how often do you use schemes even in production, let alone your everyday garage practice? Did you ever need it? What for? To validate a document? You'd think your program is not aware of its own internal data structure and doesn't know what keys and values to check in an input DOM tree without any extra tips.

Arguably, there are some exotic cases where scheme can be of use. But a 99% chance that your case is not exotic. It is about so much percentage of XML use cases where you would not need any schemes. An average XML parsing library is not even educated about this science. So, really, what's the point?

I shall be even more brief about the rest of "many technologies". A major share of them are overspecialized crutches invented with sole purpose of alleviating XML clumsiness in certain applicatons. Most of these "technologies" are likely known to you only by name and are essentially foreign to your everyday needs.

There are actually quite few things that you need from a base text format, and all of these things are provided by SINX in better way than they are in XML (see list of advantages).

–2. "XML has many libraries to be worked with, while your format is only known to one person and a half, and it has no legacy code."

Take these "many libraries" and enjoy manual iteration through DOM trees via polished interfaces. Have fun with updates to your data files as your target format extends or mutates. And, after you realize you are fed up with the fun, read chapter 2 once more. SINX is so simple that writing your own parser for it will take less time from you than comprehension of some XML handling libraries would. And even this loss will pay back as you save much more time later on minor ad hoc issues due to simpler and more reasonable structure of the format.

For starting idea on how a SINX handling library might look, you can check an example from chapter 4.

–3. "If it is about invention of a custom wheel, why don't I make up my own one instead of your one? It will have all blackjack and whores I need for sure."

Go forth and try. Thinking with your own thinker instead of worshipping authorities is a useful occupation that often yields valuable fruit.

One forum guy considered SINX too complicated and ideologically incorrect at all, he invented his own wheel just in place in order to show imperfectness of my idea and its competitive disadvantage on wheel market.

And what he invented turned out an absolute copy of SINX level 0, with just a slightly more verbose syntax for delimiting special symbols.

So, the skeptical guy proved with his own hands that SINX is extremely close to perfect minimalism.

You can try and make up something of your own, but is it really worth it as you have a decent chance to come up with nearly the same solution? :)

4. Library

According to common courtesy, we'll add format description with a sample C++ library for reading and writing and an example of its use: sinx.zip (27 kb). The library is not specifically compact of spherical as compactness and sphericality were not objectives. Besides directly format related things, it includes some extras and commodities and also uses some STL items (namely, std::map, std::deque and std::string). On one hand, it was assumed a reference SINX parser and generator implementation. On the other hand, the author intended to use it in his own real projects and occasionally had word "portability" in mind. (Nevertheless, even with all its incompactness, non-sphericality and extras, it is you can check how many times more compact than RapidXml (the most minimalistic XML library existing, of roughly the same feature set), and still implements SINX level 1 enrirely, with no "unsupported features". Isn't it indicative?)

There is no detailed documentation because: a) I'm lazy :), b) everything an end user needs to be happy is shown in the example and is quite obvious in its simplicity, c) everything is doxygen-commented so you can use power of doxygen. I shall only provide a brief description of the package and of its basic idea and some brief instructions on the use.

The package contains several files that you should unzip into one single folder:

sinx_pltfm.h – platform dependent concepts for a parser and some general purpose definitions,

sinx_read_l0.h, sinx_read_l1.h – implementations of level 0 and 1 parsers respectively,

sinx_write.h – implementation of a generator (uses the above items),

sinx.h, sinx.cpp – convenience wrappers for the above items,

test.cpp – main program file of the example,

test.sinx, testpr.sinx, example_gen.sinx – samples of SINX files. The last one is generated by the example (as a generator demo), one before the last is a straightforward conversion from an XML file (a stub C++ Builder project), mostly to show an unnumbered bonus from chapter 3.

No project files, makefiles and other garbage is found in the package. The program is assumed to build by a trivial console command. For C++ Builder 5 and above (most likely, works for lower too) it is done by such a command:

bcc32 test.cpp sinx.cpp (builds test.exe)

In M$ Vi$ual $tudio 2005 and later (assuming you have already configured the $tudio to react console commands) –

cl /EHsc test.cpp sinx.cpp

The example is known to compile on gcc from WxDev++ (gcc test.cpp sinx.cpp) but it didn't want to see standard library at linking stage, and I had no time and need to RTFM. Therefore, if you have gcc or some other marginal compiler (or even an operating system, who knows), take examples above as illustration of the idea and figure your case out yourselves.

Actual use of the library won't be described here either, typical use cases can be taken from the same example (test.cpp) and they are quite simplistic. But some explanations about internals and logic of some stuff are nevertheless worth to be given.

So, some more words about each file in order of their logical dependency.

sinx_pltfm.h. Initial plan was to make parser absolutely platform independent and to be parametrized via some concept class where the user would define his own platform specific parts. One of such classes, SINX0_Platform, is provided (thought as an example) in sinx_pltfm.h. It defines following essential items: SINX0_Platform::TChar (a character from the encoding used), syntactically essential character codes, dynamic memory (de)allocation functions and SINX0_Platform::Buffer class that implements a buffer containing SINX string to be parsed. In this implementation, it uses a memory buffer that contains the string in one piece, but nothing restricts you from rewriting its internals to work with partially buffered data from an external storage.

sinx_l0.h (depends on sinx_pltfm.h). Unfortunately I failed to achieve everything that was planned. This slippery way lead me to land of sophisticated templates where compilers from different vendors began to disagree up to complete incompatibility. The goal was reached though for SINX level 0 parser, file sinx_l0.h is responsible for it. It contains following classes:

SINX0_String<Platform> – binary verbatim substring of a SINX string. Speaking in other words, a fragment of a buffer (Platform::Buffer) treated as a string. It can be collated to other strings (including C-style strings), and substrings can be extracted from it.

SINX0_SuxxElement<Platform>, SINX0_SuxxParser<Platform> – provide a serial extraction of elements from SINX string (represented with SINX0_String<Platform>) in SAX-style parser. (Name of the classes reveals my attidude to methodology of SAX parsing, but, with all its practical flaws, this method is convenient as a foundation.) I guess it's no need to specify who of them implements the parser and who implements the element extracted. An element can be one of the following:

a data run,

a special symbol opening fragment (containing name and traits),

a special symbol closing fragment,

end of string marker.

SINX0_Symbol<Platform>, SINX0_SymbolIterator<Platform> – provide iteration through SINX string that conforms level 0 more directly, that is, in terms of single characters and specials. First SINX0_Symbol<Platform> can be obtained via assignment from SINX0_String<Platform> (it will be a "virtual" special that has empty name and the SINX string assigned as its data) and then extract components from it using SINX0_SymbolIterator<Platform>. The parser keeps to following strategy of recovery from syntax errors (see end of 2.2):

'>' character with no matching opening '<' is ignored,

opening fragments of specials that have no matching '>' are ignored.

All aforementioned classes are templates whose parameter must be a class conforming to platform concept. That is, a class that implements the same interface as SINX0_Platform from sinx_pltfm.h (which will likely be the one and only class you will ever use in this capacity). Additionally, they are all made in such a way that they don't use dynamic memory allocation except for case of assignment from SINX0_String to SINX0_Symbol (one extra allocation) and rely on the buffer object for any data storage. That is, the buffer is assumed to stay alive for no less time than any SINX0_... classes using it (its interface provides reference counting for this purpose). Take this into consideration if you ever come to make your own an implementation of a buffer.

If you have such a wish, sinx_l0.h and its riches can be used on their own, they are sufficient to handle SINX level 0. But the interface and application of all this stuff is not too convenient (approximately the same as XML parsing libraries have) so you are unlikely to communicate SINX0_... directly.

sinx_l1.h (depends on sinx_l0.h). SINX level 1 parser. I failed to make it customizable (for reasons mentiones above, it uses strictly SINX0_Platform and SINX0_... classes parametrized with SINX0_Platform only) but it has more decent interface. Besides SINX level 1 related things, it includes several useful helpers, for convenience and to show some advanced features feasible with use of SINX. Classes of primary interest here are following ones:

SINX1_Symbol – impersonates SINX special symbol. Allows to extract its data as a string or as an integer (if it doesn't contain specials not reducible to characters according to SINX level 1 rules), obtain nested specials (individually by name or in an array), obtain its whole data as raw SINX string, etc. SINX1_Symbol uses std::string è std::deque (and, internally, std::map) to return data and results, but it does it in lazy method so that no dynamic memory allocation related operations are made until you explicitly request a data. So SINX1_Symbol can be copied and returned with no fear of exceptions.

Syntax error handling follows the same strategy as described above for SINX0_Symbol.

An indicative feature worth personal mentioning is extraction of a special from a subtree by path specification (OpenPathL function). What is it? It is easier to show by an example. In SINX string:

<a:

<x>

'b' special has path of <a <b>> (or, which is the same, <a <b><.index=0> (index can only be specified by an integer number)),

'c' special has path of <a <c>>,

first 'd' special has path of <a <c <d>>> (or, which is the same, <a <c <d><.index=0>>>)

second 'd' special has path of <a <c <d><.index=1>>>,

third 'd' special has path of <a <c <d><.index=2>>>,

<.x=2> special has path of <a <c <d <x>><.index=1>>>,

...

Well, you got the idea.

SINX string to search in is data of the subject SINX1_Symbol and path is passed to subject's OpenPathL (as a raw string), the response returned is the special you looked for. If there is no special with such "coordinates" found, a result is returned that is an empty special symnbol with a special IsEof () flag set to true. This feature is useful if you have some heavily compound SINX file and your program is interested in only a small subtree whose location inside the structure is specified via input data (for example, as a console command parameter).

Indicativeness here is that, as you can see, path to the symbol is also specified in SINX format (and is parsed by the very same parser as you may easily guess). Such an ad hoc additional wheel with no extra tools required. Compare the ease and minimum expenses we got it with to XPath, a similar utility for XML. Yes, I know that XPath has much more features, more handy syntax, etc., etc. But just one fact: it has a different syntax incompatible with XML and requires a dedicated parser. Think of it. Even if the "technology" would have exactly the same primitive capabilities as we do, it would still have a different syntax incompatible with XML and require a dedicated parser. Our ad hoc "SINXPath" required nothing extra. It could, but why would it if we could make a nice use of improvised means?

What prevented inventors of XPath take similar way? In theory, nothing did. In practice... I think their reasons are clear.

sinx_write.h (depends on sinx_l1.h). Implements data output in SINX level 1 (including possibility of comments, automatic escaping of characters via <#...> specials, following conventions about traits where possible and automatic formatting inside specials that user specifies to be compound).

Idea of use:

User must define his own output stream class, which must be derived from pure virtual class SINX_WriteDocumentBase and define several virtual methods for writing bytes into the stream. Then, an instance of this class is assumed to impersonate an open stream and can be written to via methods of the instance (or, more exactly, methods of its earlier ancestor, SINX_WriteNode).

In order to output plain data (strings and specials with plain string data) methods of SINX_WriteNode class are sufficient. In order to output a special of compound structure, you need: a) open new special by creating a local variable instance of class SINX_WriteSpecialOpen (you can also specify desired traits and formatting tips for symbol opened), b) write its component data (it is done in the same way as SINX_WriteSpecialOpen is derived from SINX_WriteNode too), c) close the special (it is done automatically as SINX_WriteSpecialOpen instance goes out of scope).

sinx.h, sinx.cpp (depends on sinx_l1.h and sinx_write.h) – wrapper for classes and constants of sinx_l1.h and sinx_write.h into unified namespace SINX, definition of some helpers and implementation of wrapper classes for most natural case of reading/writing from/to a file (classes SINX_ReadFileOpen/SINX::ReadFileOpen and SINX_WriteFileOpen/SINX::WriteFileOpen).

Usage of helpers is quite obvious from their names and parameters. The only not so obvious ones are SINX::ReadSpecialEnumL<Enum> and SINX::WriteSpecialEnumL<Enum> methods intended for (de)serializing a enum type value (or another integral value with predefined set of possible values) into/from a special. They are templates whose parameter must be exactly the enum type you want to (de)serialize. You also have to provide an instance of special class SINX::EnumHelper<Enum> (also a template that must be parametrized with the same enum type) that describes biection between values of the set and their string representation. See the example program for construction of such an object.

test.cpp – the actual example. Is a console program. Launched with no parameters, it generates file example_gen.sinx (it is included in the package as well but you can remove it, it will be rewritten), then reads some data from it. The file is relatively large and structurally complex (an example of simple and primitive file would be of less interest, wouldn't it?) so don't get embarrassed by amount of the code in sample of writing, there is much of it because it writes much.

You can also run the program with one parameter, name of a SINX file (see other files with .sinx extensions in the package), it will then dump the file on per-element basis.

That's essentially all. Comprehend and use.

5. User agreement

Our harsh copyracist time requires to drop in a couple of words about this subject as well. Here is the couple of words:

===

User agreement.

Code included in sinx.zip mentioned in this document can be used and modified with no fees and for any purpose, with only the following restriction applied: no copyright or license marks must be added to it. The code is provided as is with no warranty of anything.

Chapter 2 of this document, with exclusion of subchapter 2.4, is description of SINX format. The description can be used with no fees and for any purpose, provided that its text is used with no modifications.

SINX format can be used feely with no fees and for any purpose, provided that following requirements are met:

1. SINX format is considered public domain and must not be a subject of any patent under any circumstances.

2. A new technology or data format that involve use of SINX format in any capacity can only be subject of any patent if the patent only covers that particular technology and format and does not cover any objects besides ones explicitly named in the patent. Usage of SINX format in any patent that covers a generic set of technologies and/or formats is not permitted.

3. Integration of SINX format into existing and patented technology or format (for example, as an alternative low level format) is allowed only under condition that the new element will not be covered by the original patent. If it is not possible to do so without modifying original patent, new patent can be issued for modified format or technology, provided that it satisfies requirements of 2.

By using SINX format and other materials from this web page you thereby agree to terms of this user agreement.

===

The couple of words is not yet legally rigorous enough, yet it outlines main ideas. This section may be subject to extension and more detailed specification in future as needed.

Why is this chapter needed at a homebrew and unknown format home page, you may ask? Isn't it ridiculous in the light of homebrewness and unknownity?

As a homebrew and unknown format, SINX has one more unevident advantage, it is patent free. It can be used for any purposes with no fear of situations like this. The need in this chapter is of course questionable, common sence tells us that SINX will unlikely gain wide fame and supersede unholy XML in immediate future and that anyone will ever care of it. But, being a paranoic person, I can't ignore any possible scenarios. It's better to write some ridiculous letters now than to find out some day that it's too late.

6. Example of software

As a proof that I indeed used SINX format for my own needs, here is an example of an application: the game Operation I.T.C.H. (5 mb, .zip) (works on M$ Window$, 2000 and above in theory, checked on XP, 7 and, according to witness, in Wine). Search for places where it is used is up to your curiosity. (Tip: the application is in Russian language but includes an English locale, to enable it, change <.lang=ru> to <.lang=en> in locale.cfg.)

7. Extras

My continued use of SINX format results in advent of various handy extras. I will put them out here.

sinxxml.zip (187 kb) – SINX<->XML conversion tool, a console utility. Source code included, run the program for instruction. RapidXML is used for XML handling, therefore there are a couple of minor defects. Conversion from XML generally loses indentation and conversion back to XML yields not a 100% valid XML – you have to add <?xml version="1.0" encoding="utf-8"?> manually (I told you XML handling libraries are flawed!), and you may fail if your SINX consists of several entities at root level as strict XML only allows a single root node (I told you XML format is flawed!:).

sinx_cs.zip (4 kb) – a minimalistic SINX helper library for C# 3+, usable on Window$ Phone (7) SDK (which was actually the primary target). C# documentation comments used, so you will get it quickly. Class Test in the end of file is a sort of example of use (strip it before using for your own needs). In "public SinxSpecial this[string pName,int i]", you can change to "int i=0" in C# 4+ so that you could use less verbose xxx["special_name"] instead of xxx["special_name",0].

sinx_php.zip (7 kb) – a SINX helper library for PHP 4/5. No examples, but all functions and classes are heavily documented (even a brief description of the format itself is included) so you won't get lost.

Contents

0. What is it?

1. Motivation

2. Idea and syntax

2.1. Encoding

2.2. Syntax (SINX level 0)

2.3. Basic metasyntax recommendations (SINX level 1)

2.4. Is that all?

3. Why SINX and not XML?

4. Library

5. User agreement

6. Example of software

7. Extras