Contents
0. What is it?
SINX (SINX Is Not XML) is a semi-structured data markup language similar
to XML, HTML and wide variety of less known alternatives. Yes, yet
another one. What's the meaning of it, you may ask. Why do we need
yet another markup language?
Don't know about you, but here is an explanation of why it was needed
in my case and why I had to make it up.
1. Motivation
It isn't a big secret that one of the most frequent needs in programming
practice is representation, storage and parsing data in certain text
formats. Just a few typical examples:
configuration and control files,
resource tables, manifests of all kinds,
serialization and deserialization of data,
logs and reports assumed for automatic processing,
...
Why exactly the text formats are best options for such tasks? First,
accessibility of reading and editing, a notepad is sufficient for
any text format (there are of course files that require something
more substantial than notepad, but notepad is exactly sufficient)
while binary formats require dedicated editors for each one, incompatible
with other ones in general. It is inconvenient and impractical in
many cases. And, if it is about homebrew formats for your grounded
garage needs, you will likely be in dire short of editors. Second,
binary formats naturally have a relatively rigid structure. If your
program evolves and its data structures mutate, making it understand
previous version of format (or to write an upgrade utility at least)
becomes more expensive task than in case of text data.
You may turn out to need more than one of such formats. In my case,
I came to need three at once – for configuration, for localization
and for serialized storage. And what is the most troublesome peculiarity
of text (and not just text, actually...) format? Exactly, it is writing
of a parser (and of a generator, in more general case).
Therefore, a developer eventually comes to understanding that it's
better to have a single generalized format rather than a whole zoo
of them and to build all lesser formats on its base. If the base
format is good, you will need to write a parser and a generator only
once and won't have to poke and tweak them for each particular case
of a derived structure.
So, I needed a format satisfying following requirements:
text based,
capable of representing both structured data with its treesome structure
as a primary content and a text with markup where the text itself
is a primary content.
What is your first thought on sight of these requirements? Of course
it's XML. It is tuned exactly for these types of application, it
is a widespread industrial standard, it is a base for multitude of
technologies, there are plenty of libraries for it on all languages
of the world, and so forth, and so forth...
Just take and use, one would think. But – no. XML was absolutely unacceptable
for me, and here are my reasons.
1. It is redundant. It is just redundant beyond any reasonable limit.
If you want to comprehend the whole horror of things, take a look
at
examples XML-RPC, for instance, and try to realize that they are
just about passing a couple of parameters. Human readability of XML
is a purely formal concept. In real life, it is a hard labor to distinguish
data in opening-closing tags mash, let alone typing in such document
by hand.
2. It is complicated. It is complicated beyound any imagination. It
pesters with useless and mutually duplicating extras. Just one thing
to mention is a <tag><value>data</value></tag> vs. <tag
value="data"/> alternative for representation of essentially
identical piece of data!
As a consequence, nearly all of so called XML handling libraries are
bloated obscenities that take magnitudes much more space and files
than some completed programs (while you might just needed a feeble
config file for "hello, world"), are often built in non-trivial ways
(dragging GNUttish toolchains and Vis$tudios behind them), and interfaces
they provide come down to iteration of varying manuality through
mishmash of subnodes and attributes. Seeing the resulting code makes
you weep tears of blood. If it still doesn't, look through dusty
archives for description of, say,
GetPrivateProfileString function
and its alikes (also, take a look at description of INI file format
found at the same place), then compare to
how an access to XML documents
(loaded and parsed already, by the way) in what is claimed to be
one of simplest XML parsing libraries.
In addition to all these issues, a rare library can boast understanding
XML in all its completeness (
an example, note the section unambiguously
titled 'What it doesn't do?').
Most people consider XML something good just because they use its
based formats via appropriate programs (offices, inkscapes, expression
blends etc.) and don't see the homely inside. But we, plain programmers
with plain needs, rather writers than readers, have a slightly different
list of preferences. So, I can't tell for you, but as for me, XML
was obviously the worst solution possible (and a clear overkill in
my particular case).
Please don't reproach me for deprecation of international standard,
addiction to wheel inventions and all such. Just imagine yourself
in my place. You are a plain programmer, you have a plain compiler
and a virgin snow notepad in front of you, and the task is to write
a lightweight, slightly configurable console program. All these magnificient
international standards, how will they help you as you have to drag
home and toilet train a monster of large tonnage that will make your
miserable code into its tiny appendix? That's it. And, if you already
have the monster at home and trained, you definitely had no better
things to do.
So, XML is stray. What easier alternatives free of aforementioned
flaws are offered to us by collective intellect?
They are of course better, but unfortunately are overspecialized.
Namely, all of these formats are tailored purely for data, and using
them to represent text with loose markup is a churlish labor. Also,
their syntax is somewhat functionally redundant and somewhat too
rigid. For example, JSON provides exactly 5 fundamental data types
(a string, a number, a bool, a dictionary, a vector). In order to
be able to parse this format, the program must understand all five,
even if it doesn't need all of them. On the other hand, as soon as
the program needs a different data type (such as a complex number
or a text with markup), the format suddenly ceases to be compact
and begins to screw up. Approximately the same holds true for all
others.
Very good. I needed something like this, lightweight and structurally
flexible, two in one. But still, it's not ideal. First, this format
is data oriented as well, and representation of text with markup
(one of the requirements, I remind you!) is quite clumsy in it. Second,
the syntax could still be simpler. I don't know what the author intended
for simplicity, but ways of further simplification and reduction
of redundancy are obvious since first lines of specs.
All these ordeals lead me to idea of SINX format.
Not to say it worked out overly original or revolutionary. It wasn't
a priority to concoct a brand new markup language not resembling
anything we've seen before (and why should it be?), you can easily
observe loans from the predecessors in it. (Even the foundation of
the format is not actually made ground-up, it is borrowed from an
ancient Starcraft 1 related utility where I liked author's approach
to representation of unprintable characters). There were following
priorities:
format fit for both text and structured data, similarly to XML and
HTML but free of their heavyweightness,
maximum simplicity and minimum redundancy. If we have text, it must
not be lost among syntax elements and should deform for syntax needs
as little as possible. If we have structured data, syntax elements
must only be sufficient for specification of structure and not an
ounce more. Classification of data by types, their decomposition
into information and metainformation, these are things the user can
well do himself if needed (and, if it is not needed, it is the more
so stray in the format). Our task is just to let him pull these data
out without unnecessary bureaucracy and questionable conventionalism.
It's up to you to determine how well the idea is implemented.
2. Idea and syntax
The format is based around the following concept. We have a SINX
string – a (finite) sequence of symbols, possibly an empty one.
There are two sorts of symbols:
terminal symbol (a letter, a digit, etc. single character),
special symbol, described in terms of name and data. Special
symbol's data are also a SINX string, and all aforementioned considerations
are recursively applicable to it.
This principle is origin for all following schemes.
2.1. Encoding
SINX format is ok with any encoding that can represent characters
'<', '>', '%' and provides any whitespace characters. It is desirable
but not required to have characters '.', ':', '=' in it, and even
less required to have '-', '#', all hexadecimal digits, 'x' (latin
ex), 'l' and 'g'. UTF-8 or any other 8-bit ASCII-based encoding is
recommended. If you are into perversions though, something UTF-32
or EBCDIC will also do.
An important notion: all characters in SINX are taken as is, under
no circumstances they are translated or transformed. That is, whitespace
runs are not merged and not skipped, tabs and line feeds are not
translated into spaces, letters are case sensitive, and so on. Some
unavoidable platform specific dodges are possible, such as translation
of CR/LF into a single line feed, but these transforms are assumed
to take place at physical stream reading level before feeding the
data to SINX parser, so the format doesn't care about it. At SINX
format level, all the input characters are considered verbatim. Thenceforth,
if we speak of 'a sequence' or 'a run' without any extra specification,
we assume it binary identical to its representation in the input
stream.
2.2. Syntax (SINX level 0)
SINX-string is a sequence of elements. Following element types are
possible:
1) Plain terminal symbol – any character except '<' and '>'.
Example: aba125 letters are parts of the example 15#tbs.
Plain terminal symbol denotes simply the appropriate character.
2) Verbatim data run:
'<' delimiter '%' sequence-of-characters '%' delimiter
'>'
where:
delimiter is a sequence (possibly an empty one) of any characters
except '%', '<', '>', '=' and whitespaces, second delimiter must
be binary identical to first one,
sequence-of-characters is a sequence of any characters allowed
in encoding used that doesn't include a subsequence of '%' delimiter
'>'.
Examples:
<abc%example%abc>
<%example of run with an empty delimiter%>
<ab%example of run < including >< special characters > % in it%%ab>
(last % before %ab is included in the run as well)
<$#!@%example of run with non-alphanumeric characters in the delimiter%$#!@>
<abcdx%this %abcdx is a %abcdx > valid %abcd> example too%abcdx>
(the run includes all characters between <abcdx% and %abcdx> – ending
marker of the run is not the delimiter itself, but exactly a seguence
of a percent sign, a delimiter and a closing angle bracket)
Verbatim data run is interpreted as sequence of its characters, taken
binary verbatim, with exclusion of starting and ending markers.
Example: abc<%def%>ghi is interpreted the same as abcdefghi
Nevertheless, starting and ending markers are considered a part of
SINX string that contains them and are included in it if we speak
of the SINX string as a whole.
Characters inside verbatim data run are not interpreted in any special
way and are all part of the run regardless on their kind.
<abc% it can <def% even %def> be like this %abc> (<def% and
%def> here are just sets of characters included in the run along
with all others)
Purpose of verbatim data run is to express a piece of data that may
include characters that would otherwisely have a special syntax meaning.
<%For example, of arithmetical or logical language expressions,
such as x = i>11? 100 : 500;%>
3) Special symbol (or just 'special' for short):
'<' name delimiter SINX-string '>'
ãäå:
name is a sequence (possibly empty) of any characters except
'%', '<', '>', '=' and whitespaces,
delimiter is either a sequence of whitespaces (can be empty
if followed by '<' or '>') or a single '=' character.
SINX string is special symbol's data and can in turn consist of all
aforementioned types of elements (including other special symbols).
It can be empty.
Examples:
<a example>
<$!#@ example with weird special name>
<a example with composite data: <xxx%verbatim run%xxx>, <b nested
special>>
<a=example with delimiter of '='>
<a
example with delimiter of line feed (assumed it is a whitespace in
the encoding used)>
< example of special with an empty name>
<=another one>
<xxx> (example of special with empty data)
<xxx=> (example of special with empty data and delimiter of
'=')
<xxx<%yyy%>> (example of special with empty whitespace-type
delimiter which is possible due to its data starting with '<')
<> (example of special with empty name and data)
<=> (another one)
< > (another one)
<a =the '=' character is part of data here as the delimiter is
a single space>
<a= the delimiter here is '=' while the space after it is part
of data>
Characters of the delimiter and opening and closing '<' and '>' are
not included in name or in data of the special. Nevertheless, they
are considered a part of its containing SINX string and are included
in it if we speak of the SINX string as a whole.
Special symbol is interpreted as, well, a special symbol with a given
name and data. There are some issues about the name:
if the name starts with '.' and/or ends with ':', these characters
are not included in the name, but the parser MUST keep extra flags
to remember the fact they were present for this special. Leading
'.' of the name is an attribute trait and trailing ':' is
a compound trait.
For example, all these specials have the same name:
<a example of special named 'a'>
<.a example of special named 'a' with an attribute trait>
<a: example of special named 'a' with a composite trait>
<.a: example of special named 'a' with an attribute and a composite
traits>
also, the parser MUST keep an extra flag to remember the type of
delimiter used in this special, '=' or whitespaces run. Delimiter
of '=' is a terminal trait.
Traits have no syntax payload by themselves (apart from the fact they
must be stripped off the name and noted separately by the parser).
Their purpose is metasyntactic. User of derived format can assign
any meaning to them on his own discretion, or to ignore them entirely.
SINX level 1 (see below) contains some recommendations about interpretation
of traits but these are not requirements and are not enforced at
SINX format level.
If the encoding used does not provide characters for '.', ':' or '=',
all rules that involve them can be ignored.
What is the meaning of "considered a part of its containing SINX string
and are included in it if we speak of it as a whole" phrase? It is
very simple. If it is about a per-symbol reading of a SINX string,
the symbols are not considered. If the SINX string is taken as a
whole (for example, as data of a special), all of its characters
are considered, including ones with special syntax meaning.
Example: 1235<xx%abcdef%xx>678<a <%x%>> is a sequence of symbols:
'1', '2', '3', '5', 'a', 'b', 'c', 'd', 'e', 'f', '6', '7', '8',
'special named a with data of <%x%>'. At the same time,
as a whole SINX-string, it is 1235<xx%abcdef%xx>678<a <%x%>>.
The same way, data of the special named a of this sequence
is SINX string <%x%>, which is a sequence of a single symbol
'x' from standpoint of per-symbol reading.
That's all about SINX syntax. Here is a grammar if you want one:
any-character ::= any character possible in encoding used
character-not-bracket ::= any character possible in encoding
used except '<' and '>'
name-character ::= any character possible in encoding used
except '<', '>', '%', '=' and whitespaces
whitespace ::= any whitespace from encoding used
name ::= name-character*
delimiter ::= '=' | whitespace*
(if there is no '=' in the encoding used then delimiter ::=
whitespace*)
SINX-string ::= SINX-symbol*
SINX-symbol ::= character-not-bracket | verbatim-data-run
| special
verbatim-data-run ::= '<' name '%' any-character*
'%' name '>' (2-nd name must be binary identical to first
one)
special ::= '<' name delimiter SINX-string
'>'
As you can see though, the logic is very simple and the parser can
be written with no any grammar in mind.
The syntax is made so that any characters sequence would be a valid
SINX string as far as possible. A random sequence can be invalid
after all. However, syntax error is not an excuse for not returning
a result. Following are possible types of syntax errors which MUST
be handled by the parser in one of appropriately offered methods,
with choice of particular method left up to parser implementation:
1) Character '>' with no corresponding opening '<'. Example: abc>de
Options:
treat it as end of input stream, not inclusive (abc),
ignore it (handle the stream as if the problematic character was
missing) (abcde).
2) Unterminated special. Example: abc<de <f><g h
Options:
assume that all specials not terminated explicitly are terminated
implicitly in corresponding order as the input stream ends (abc<de
<f><g h>>),
ignore opening fragments of unterminated specials including their
delimiters (abc<f>h).
3) Unterminated verbatim data run. Example: <abc%def
It MUST be always assumed implicitly terminated as the input stream
ends (<abc%def%abc>).
4) Sequence of '<' èìÿ (end of input stream) MUST be considered
an opening fragment of a special with empty whitespace-type delimiter
and therefore a particular case of error 2).
The parser MAY report syntax errors it found in some additional way
but MUST always parse to the end, recovering according to error handling
options chosen for the implementation.
Description of syntax and syntax error handling options is named "SINX
format level 0". It is actually sufficient for practical use. However,
there are some typical problems for markup languages that you will
likely have to solve for your format. SINX level 1 is provided for
most common things you may need, see next chapter for more.
2.3. Basic metasyntax recommendations (SINX level 1)
While SINX level 0 specifies format syntax, SINX level 1 establishes
several conventions about interpretation of special symbols from
a certain set in your target SINX-based format.
1) Characters specified by numeric code.
Specials with name of '#' run-of-decimal-digits or '#' '0'
x|X run-of-hexadecimal-digits MUST be interpreted as a character
with corresponding numeric code in encoding used. Examples:
<#33> (ASCII character '!')
<#0x33> (ASCII character '3')
<#0XDeAdBeEf> (character with hexadecimal code of DEADBEEF)
Data of such specials MUST be ignored. E. g., mail<#0x40 gg>to
is interpreted identically to mail<#0x40>to and mail@to.
If name of the special starts with '#' but does not form a valid decimal
or a hexadecimal number, the special is not translated into character
and is considered as a general special with this name.
Additionally, there are two specials that MUST be interpreted as '<'
and '>':
<lt> – '<',
<gt> – '>'.
The l, g è t letters can be in any case if the encoding used provides
them in different cases.
E. g.: <example here are some quoted specials: <lt><gt> <lT>spec1
content1<Gt> <Lt>spec2 content2<gT>> – is interpreted identically
to <example <%here are some quoted specials: <> <spec1 content1>
<spec2 content2>%>>
This rule CAN be dropped if the encoding used does not provide characters
for hexadecimal digits and characters of '#', 'l', 'g' and 'x'.
2) Remarks/comments.
Specials with name of '--' MUST be treated as remarks and ignored
entirely.
E. g.: a<-- b>cd – identical to acd
Two conventions above are compulsory for conformance to SINX level
1. Additionally, SINX level 1 provides some optional conventions
about usage of traits in target user format.
3) Terminal trait ('=' delimiter in a special) SHOULD be applied to
a special if its whole data is assumed terminal within the target
user format, that is, its format is assumed opaque at SINX level
and is to be passed verbatim to its own dedicated parser. Note however
that the data is still required to be syntactically valid SINX string.
Depending on conventions of the target user format, SINX specials
inside the terminal data SHOULD be either interpreted as characters
(according to convention 1) of the current chapter (if they satisfy
the requirements) or ignored or be emitted per-character binary verbatim.
Verbatim data runs inside the terminal data SHOULD also be either
interpreted as a corresponding characters sequence or emitted per-character
binary verbatim.
Examples:
<url=http://mikle33.narod.ru>
<url=http:<#47><#47>mikle33.<-- remark –->narod.<%ru%>> (depending
on target user format conventions, the data can be interpreted in
one of many ways: http://mikle33.narod.ru, http:<#47><#47>mikle33.narod.ru,
http://mikle33.narod.<%ru%>, http:<#47><#47>mikle33.narod.<%ru%>,
http:<#47><#47>mikle33.<-- remark –->narod.<%ru%>, etc.)
Delimiter of '=' MAY be not used if the special has empty data.
4) Attribute trait SHOULD be applied to a special if the target user
format interprets them as attributes of their immediate containing
special (or of the whole stream if they are not contained anywhere).
A special is assumed to contain no more than one immediate child
special of the same name with an attribute trait. (That is, specials
with an attribute trait are recommended as equivalent to XML/HTML
attributes.) If there are several attribute specials with same name,
the target user format may consider only the first one or only the
last one.
Example:
<error <.locations <line=11><line=22><line=100500>at these lines>An
error has been encountered>
5) Compound trait SHOULD be applied to specials whose purpose in the
target user format is to store structured data and that are only
assumed to have meaningful data inside contained specials (the convention
is not recursively applied to these contained specials unless they
have compound traits too). All other characters, including comments
and character code specifiers (according to conventions 1 and 2 of
the current chapter) SHOULD be treated as remarks or formatting helpers
and ignored as such.
Example:
<url: <scheme=http> <domain=mikle33> <domain=narod> <domain=ru>>
<url:
(url decomposed into components and expressed as a compound SINX special)
<scheme=http>
<domain=mikle33>
<domain=narod>
<domain=ru>>
Both versions should be interpreted identically to <url: <scheme=http><domain=mikle33><domain=narod><domain=ru>>
Traits can be combined if it conforms their role in the target user
format. Example:
<table: <.width=100%><.height=50%>
<tr:
<td <b>Cell11</b>> <td <i>Cell12</i>>
>
<tr:
<td <u>Cell21</u>> <td Cell22>
>
<tr:
<td <.colspan=2>Note: <b>, </b>, <i>, </i>, <u> and </u> are not opening/closing
tags here, they are individual SINX specials.>
>>
Terminal and compound traits are obviously incompatible, but it's
syntactically possible to have them at the same time. SINX level
1 does not provide any convention for use of such a combination and
intentionally leaves it to the target user format.
If the encoding used does not provide a character for some trait and
corresponding syntax rule is dropped, the corresponding convention
from 3-5 of the current chapter is also dropped.
2.4. Is that all?
Yes, that's all. Note, not "all you have to know for the starters",
that's exactly all. This chapter describes both levels of SINX format
in exhaustive entirety.
3. Why SINX and not XML?
First of all, it makes sense to ask the reverse – why XML and not
SINX?
If you are not free to choose the format due to your task specifications,
or take part in a project where it is not up to you to decide on
everything and where familiarity of technologies used is important
for collaboration, or you are using pre-existing sophisticated tools
that already have their own fixed formats, or are otherwisely constrained
– well, it can't be helped then, use XML and accept my deepest condolences
on the sad occasion.
However, if you are doing something on your own and are free to choose
tools and technologies, most of your arguments pro XML likely root
back to established stereotypes and lemming instinct. These are bad
things, you should learn to break them in favor of reason. Here are
some considerations for you on why SINX is better for homebrew and
ad hoc needs than XML.
+1. SINX is simpler. It possibly isn't worth extra mentioning, but
SINX is really much simpler. It is nearly as simple as it it possible
at all (it could be even more simple if it weren't for some aesthetic
and auxiliary considerations). Parser of level 0 can be written from
scratch in about an hour, its upgrade for level 1 will take about
one more. At the same time SINX provides you with exactly the same
capabilities for expressing data and markup as XML does. You probably
noted pseudo-HTML example in previous chapter. It is a good picture
of idea of cross-conversion between XML/HTML and SINX. The idea is
so primitive and obvious (and can be none the less obviously automated)
that I shall not explain it in details, if you don't mind.
Moreover, SINX provides same capabilities in easier and less ambiguous
ways. E. g., in XML, you know how much you have to do in both XML
document and its reading code in order to convert an attribute to
a nested tag or vice versa. In SINX, you simply don't distinguish
between nested tags and attributes. Your problem is basically an
aesthetic choice of whether to use an attribute trait or not.
+2. SINX is more compact. It isn't just about syntax overhead that
is ~2 times less in SINX string than in structurally identical XML
document. The format affects the text itself to a lesser degree.
Have you ever considered an option to print a message for the human
user into stdout using XML formatting? Unlikely. The reason prompts
that actual text will be lost among service characters and syntax
extras. With SINX, it is possible to do so that human readability
will nearly not suffer compared to plain text. Have a check:
<?xml version='1.0' encoding='UTF-8' ?>
<message file="file.txt" line="14">
This message concerns line <14> of file "file.txt".
</message>
vs:
<message <.file=file.txt> <.line=14>
<% This message concerns line <14> of file "file.txt". %>>
+3. SINX is more versatile. Let's consider an XML piece <tag attr="some-value"/>
and a similar SINX string <tag <.attr some-value>>. What can
we say about some-value? It is probably some text. But why exactly
a text? Why can't it happen to be a nested node with its own structure?
SINX provides this capability by design: <tag <.attr <.attr-of-attr=100%><component1=xxx><component2=yyy>>>
– note that data of the special are also a SINX string and therefore
can be extracted and considered as an independent document. In XML,
even if we would laborously work out a counterpart using nested tags:
<tag><attr>
<attr-of-attr>100%</attr-of-attr>
<component1>xxx</component1>
<component2>yyy</component2>
</attr></tag>
we would have an internal of <tag> that is generally not even a valid
XML document that can be parsed outside its containing document.
This advantage is much more powerful than it may seem. Why? A couple
of illustrative examples of modern data formats:
SVG:
...
<path fill="none" stroke="black" d="M 227 239 L 328 90 L 346 250 L
201 124 L 410 150 L 228 238" />
...
CSS:
...
<style type="text/css">
body { color: red; }
h1 { color: white; background-color: orange !important; }
h2 { color: white; background-color: green !important; }
</style>
...
These are good pictures of anecdotic situation in so called XML based
formats. Ideally, 'XML based' means that we would just take a random
XML parser, set it on the document, and our problems are solved.
But it is a frequent occasion that XML's actual role in a format
is limited to outline a major data lump that has its own entirely
different format and requires a dedicated parser. But what's so special
in these data? Do they have an exotic structure that disallows to
represent them as a subtree inside the document's general structure
and to parse them uniformly? It may happen of course (javascript
fragments, for example), but what we have here is definitely not
the case. What we have here is a bunch of nodes with properties similar
to as in the rest of the document. What prevents us from representing
the same CSS as:
...
<style ...>
<body color="red"/>
<h1 color="red" orange="!important/>
<h2 color="white" green="!important/>
</style>
...
?
Yes, it would take a bit more space, but uniformity of the format
would yield advantages in... oh wait, can't CSS elements cbe nested?
And can't they have fields of compound format?? And can't the style
be specified in both dedicated section of a document and in the subject
tag's stype attribute??? Oh gosh. It really looks a better idea to
make up a dedicated format.
A cheapskate pays twice, that is.
But, if you would use SINX instead of X/HTML, you wouldn't have the
problem.
<style ...>
body {
background-color: #111111;
background-image: url("caulk_sucker.jpg");
}
p {
}
h1 {
color: #222222;
background-color: #333333;
}
</style>
...
<h1 style="color: #FFFFFF; background-color: #000000;">want it directly?</h1>
===>
<style:
<body:
<.background-color=#111111>
<.background-image: <.url=caulk_sucker.jpg>>
>
<h1:
<.color=#222222>
<.background-color=#333333>
>
>
<h1 <.style: <.color=#FFFFFF> <.background-color=#000000>>can do directly!>
(Note – the result is nearly the same and still is compliant to SINX.)
Unfortunately, XML/HTML+CSS became widespread in the form they are
and their genetic defects are too late to be cured. But not everything
is lost yet for your homebrew formats, you still choose SINX to base
them on and therefore reduce possibility of having to make up language-in-a-language
dramatically.
+4. SINX can even be used for binary data. Rule of binary verbatim
interpretation of input stream is harsh and unambiguous, and a verbatim
data run is more universal than XML's CDATA and allows to host a
really arbitrary binary data in one chunk.
Also, SINX level 0 is perfectly fit for inplace parsing.
An unnumbered bonus. SINX is quite sparing in terms of migration from
XML. As said before, cross conversion between XML and SINX is a simple
and transparent task, a universal converter can easily be written.
SINX even uses so familiar angle brackets. :)
Besides joke, a curious fact: most simple XML/HTML with no jabascripts,
CDATA-s and sophisticated comment tags are also a valid SINX string
with each opening and closing tag corresponds to a special.
Ok, we got it about the advantages. What about drawbacks? Just for
fairness, let's consider several typical objections.
–1. "There are many technologies developed for XML, such as XSLT,
Xpath and so on, it has dtd/dsd/schema. Your format has none of these.
And, if it all is to be made up, the result won't be essentially
better. So what's the point?"
Well, how often do you use schemes even in production, let alone your
everyday garage practice? Did you ever need it? What for? To validate
a document? You'd think your program is not aware of its own internal
data structure and doesn't know what keys and values to check in
an input DOM tree without any extra tips.
Arguably, there are some exotic cases where scheme can be of use.
But a 99% chance that your case is not exotic. It is about so much
percentage of XML use cases where you would not need any schemes.
An average XML parsing library is not even educated about this science.
So, really, what's the point?
I shall be even more brief about the rest of "many technologies".
A major share of them are overspecialized crutches invented with
sole purpose of alleviating XML clumsiness in certain applicatons.
Most of these "technologies" are likely known to you only by name
and are essentially foreign to your everyday needs.
There are actually quite few things that you need from a base text
format, and all of these things are provided by SINX in better way
than they are in XML (see list of advantages).
–2. "XML has many libraries to be worked with, while your format is
only known to one person and a half, and it has no legacy code."
Take these "many libraries" and enjoy manual iteration through DOM
trees via polished interfaces. Have fun with updates to your data
files as your target format extends or mutates. And, after you realize
you are fed up with the fun, read chapter 2 once more. SINX is so
simple that writing your own parser for it will take less time from
you than comprehension of some XML handling libraries would. And
even this loss will pay back as you save much more time later on
minor ad hoc issues due to simpler and more reasonable structure
of the format.
For starting idea on how a SINX handling library might look, you can
check an example from chapter 4.
–3. "If it is about invention of a custom wheel, why don't I make
up my own one instead of your one? It will have all blackjack and
whores I need for sure."
Go forth and try. Thinking with your own thinker instead of worshipping
authorities is a useful occupation that often yields valuable fruit.
One forum guy considered SINX too complicated and ideologically incorrect
at all, he invented his own wheel just in place in order to show
imperfectness of my idea and its competitive disadvantage on wheel
market.
And what he invented turned out an absolute copy of SINX level 0,
with just a slightly more verbose syntax for delimiting special symbols.
So, the skeptical guy proved with his own hands that SINX is extremely
close to perfect minimalism.
You can try and make up something of your own, but is it really worth
it as you have a decent chance to come up with nearly the same solution?
:)
4. Library
According to common courtesy, we'll add format description with a
sample C++ library for reading and writing and an example of its
use:
sinx.zip (27 kb). The library is not specifically compact of
spherical as compactness and sphericality were not objectives. Besides
directly format related things, it includes some extras and commodities
and also uses some STL items (namely, std::map, std::deque and std::string).
On one hand, it was assumed a reference SINX parser and generator
implementation. On the other hand, the author intended to use it
in his own real projects and occasionally had word "portability"
in mind. (Nevertheless, even with all its incompactness, non-sphericality
and extras, it is you can check how many times more compact than
RapidXml (the most minimalistic XML library existing, of roughly
the same feature set), and still implements SINX level 1 enrirely,
with no "unsupported features". Isn't it indicative?)
There is no detailed documentation because: a) I'm lazy :), b) everything
an end user needs to be happy is shown in the example and is quite
obvious in its simplicity, c) everything is doxygen-commented so
you can use power of doxygen. I shall only provide a brief description
of the package and of its basic idea and some brief instructions
on the use.
The package contains several files that you should unzip into one
single folder:
sinx_pltfm.h – platform dependent concepts for a parser and
some general purpose definitions,
sinx_read_l0.h, sinx_read_l1.h – implementations of level 0
and 1 parsers respectively,
sinx_write.h – implementation of a generator (uses the above
items),
sinx.h, sinx.cpp – convenience wrappers for the above items,
test.cpp – main program file of the example,
test.sinx, testpr.sinx, example_gen.sinx – samples of SINX
files. The last one is generated by the example (as a generator demo),
one before the last is a straightforward conversion from an XML file
(a stub C++ Builder project), mostly to show an unnumbered bonus
from chapter 3.
No project files, makefiles and other garbage is found in the package.
The program is assumed to build by a trivial console command. For
C++ Builder 5 and above (most likely, works for lower too) it is
done by such a command:
bcc32 test.cpp sinx.cpp (builds test.exe)
In M$ Vi$ual $tudio 2005 and later (assuming you have already configured
the $tudio to react console commands) –
cl /EHsc test.cpp sinx.cpp
The example is known to compile on gcc from WxDev++ (gcc test.cpp
sinx.cpp) but it didn't want to see standard library at linking
stage, and I had no time and need to RTFM. Therefore, if you have
gcc or some other marginal compiler (or even an operating system,
who knows), take examples above as illustration of the idea and figure
your case out yourselves.
Actual use of the library won't be described here either, typical
use cases can be taken from the same example (test.cpp) and they
are quite simplistic. But some explanations about internals and logic
of some stuff are nevertheless worth to be given.
So, some more words about each file in order of their logical dependency.
sinx_pltfm.h. Initial plan was to make parser absolutely platform
independent and to be parametrized via some concept class where the
user would define his own platform specific parts. One of such classes,
SINX0_Platform, is provided (thought as an example) in sinx_pltfm.h.
It defines following essential items: SINX0_Platform::TChar
(a character from the encoding used), syntactically essential character
codes, dynamic memory (de)allocation functions and SINX0_Platform::Buffer
class that implements a buffer containing SINX string to be parsed.
In this implementation, it uses a memory buffer that contains the
string in one piece, but nothing restricts you from rewriting its
internals to work with partially buffered data from an external storage.
sinx_l0.h (depends on sinx_pltfm.h). Unfortunately I
failed to achieve everything that was planned. This slippery way
lead me to land of sophisticated templates where compilers from different
vendors began to disagree up to complete incompatibility. The goal
was reached though for SINX level 0 parser, file sinx_l0.h is responsible
for it. It contains following classes:
SINX0_String<Platform> – binary verbatim substring of a SINX
string. Speaking in other words, a fragment of a buffer (Platform::Buffer)
treated as a string. It can be collated to other strings (including
C-style strings), and substrings can be extracted from it.
SINX0_SuxxElement<Platform>, SINX0_SuxxParser<Platform> – provide
a serial extraction of elements from SINX string (represented with
SINX0_String<Platform>) in SAX-style parser. (Name of the classes
reveals my attidude to methodology of SAX parsing, but, with all
its practical flaws, this method is convenient as a foundation.)
I guess it's no need to specify who of them implements the parser
and who implements the element extracted. An element can be one of
the following:
a data run,
a special symbol opening fragment (containing name and traits),
a special symbol closing fragment,
end of string marker.
SINX0_Symbol<Platform>, SINX0_SymbolIterator<Platform> – provide
iteration through SINX string that conforms level 0 more directly,
that is, in terms of single characters and specials. First SINX0_Symbol<Platform>
can be obtained via assignment from SINX0_String<Platform> (it will
be a "virtual" special that has empty name and the SINX string assigned
as its data) and then extract components from it using SINX0_SymbolIterator<Platform>.
The parser keeps to following strategy of recovery from syntax errors
(see end of 2.2):
'>' character with no matching opening '<' is ignored,
opening fragments of specials that have no matching '>' are ignored.
All aforementioned classes are templates whose parameter must be a
class conforming to platform concept. That is, a class that implements
the same interface as SINX0_Platform from sinx_pltfm.h (which will
likely be the one and only class you will ever use in this capacity).
Additionally, they are all made in such a way that they don't use
dynamic memory allocation except for case of assignment from SINX0_String
to SINX0_Symbol (one extra allocation) and rely on the buffer object
for any data storage. That is, the buffer is assumed to stay alive
for no less time than any SINX0_... classes using it (its interface
provides reference counting for this purpose). Take this into consideration
if you ever come to make your own an implementation of a buffer.
If you have such a wish, sinx_l0.h and its riches can be used on their
own, they are sufficient to handle SINX level 0. But the interface
and application of all this stuff is not too convenient (approximately
the same as XML parsing libraries have) so you are unlikely to communicate
SINX0_... directly.
sinx_l1.h (depends on sinx_l0.h). SINX level 1 parser.
I failed to make it customizable (for reasons mentiones above, it
uses strictly SINX0_Platform and SINX0_... classes parametrized with
SINX0_Platform only) but it has more decent interface. Besides SINX
level 1 related things, it includes several useful helpers, for convenience
and to show some advanced features feasible with use of SINX. Classes
of primary interest here are following ones:
SINX1_Symbol – impersonates SINX special symbol. Allows to
extract its data as a string or as an integer (if it doesn't contain
specials not reducible to characters according to SINX level 1 rules),
obtain nested specials (individually by name or in an array), obtain
its whole data as raw SINX string, etc. SINX1_Symbol uses std::string
è std::deque (and, internally, std::map) to return data and results,
but it does it in lazy method so that no dynamic memory allocation
related operations are made until you explicitly request a data.
So SINX1_Symbol can be copied and returned with no fear of exceptions.
Syntax error handling follows the same strategy as described above
for SINX0_Symbol.
An indicative feature worth personal mentioning is extraction of a
special from a subtree by path specification (OpenPathL function).
What is it? It is easier to show by an example. In SINX string:
<a:
<b blah>
<c:
<e blah-blah>
<d <.x=1>>
<d <.x=2>>
<d <.x=3>>
>>
<x>
'b' special has path of <a <b>> (or, which is the same, <a
<b><.index=0> (index can only be specified by an integer number)),
'c' special has path of <a <c>>,
first 'd' special has path of <a <c <d>>> (or, which is the
same, <a <c <d><.index=0>>>)
second 'd' special has path of <a <c <d><.index=1>>>,
third 'd' special has path of <a <c <d><.index=2>>>,
<.x=2> special has path of <a <c <d <x>><.index=1>>>,
...
Well, you got the idea.
SINX string to search in is data of the subject SINX1_Symbol and path
is passed to subject's OpenPathL (as a raw string), the response
returned is the special you looked for. If there is no special with
such "coordinates" found, a result is returned that is an empty special
symnbol with a special IsEof () flag set to true. This feature is
useful if you have some heavily compound SINX file and your program
is interested in only a small subtree whose location inside the structure
is specified via input data (for example, as a console command parameter).
Indicativeness here is that, as you can see, path to the symbol is
also specified in SINX format (and is parsed by the very same parser
as you may easily guess). Such an ad hoc additional wheel with no
extra tools required. Compare the ease and minimum expenses we got
it with to XPath, a similar utility for XML. Yes, I know that XPath
has much more features, more handy syntax, etc., etc. But just one
fact: it has a different syntax incompatible with XML and requires
a dedicated parser. Think of it. Even if the "technology" would have
exactly the same primitive capabilities as we do, it would still
have a different syntax incompatible with XML and require a dedicated
parser. Our ad hoc "SINXPath" required nothing extra. It could, but
why would it if we could make a nice use of improvised means?
What prevented inventors of XPath take similar way? In theory, nothing
did. In practice... I think their reasons are clear.
sinx_write.h (depends on sinx_l1.h). Implements data
output in SINX level 1 (including possibility of comments, automatic
escaping of characters via <#...> specials, following conventions
about traits where possible and automatic formatting inside specials
that user specifies to be compound).
Idea of use:
User must define his own output stream class, which must be derived
from pure virtual class SINX_WriteDocumentBase and define several
virtual methods for writing bytes into the stream. Then, an instance
of this class is assumed to impersonate an open stream and can be
written to via methods of the instance (or, more exactly, methods
of its earlier ancestor, SINX_WriteNode).
In order to output plain data (strings and specials with plain string
data) methods of SINX_WriteNode class are sufficient. In order to
output a special of compound structure, you need: a) open new special
by creating a local variable instance of class SINX_WriteSpecialOpen
(you can also specify desired traits and formatting tips for symbol
opened), b) write its component data (it is done in the same way
as SINX_WriteSpecialOpen is derived from SINX_WriteNode too), c)
close the special (it is done automatically as SINX_WriteSpecialOpen
instance goes out of scope).
sinx.h, sinx.cpp (depends on sinx_l1.h and sinx_write.h)
– wrapper for classes and constants of sinx_l1.h and sinx_write.h
into unified namespace SINX, definition of some helpers and implementation
of wrapper classes for most natural case of reading/writing from/to
a file (classes SINX_ReadFileOpen/SINX::ReadFileOpen and SINX_WriteFileOpen/SINX::WriteFileOpen).
Usage of helpers is quite obvious from their names and parameters.
The only not so obvious ones are SINX::ReadSpecialEnumL<Enum>
and SINX::WriteSpecialEnumL<Enum> methods intended for (de)serializing
a enum type value (or another integral value with predefined set
of possible values) into/from a special. They are templates whose
parameter must be exactly the enum type you want to (de)serialize.
You also have to provide an instance of special class SINX::EnumHelper<Enum>
(also a template that must be parametrized with the same enum type)
that describes biection between values of the set and their string
representation. See the example program for construction of such
an object.
test.cpp – the actual example. Is a console program. Launched
with no parameters, it generates file example_gen.sinx (it is included
in the package as well but you can remove it, it will be rewritten),
then reads some data from it. The file is relatively large and structurally
complex (an example of simple and primitive file would be of less
interest, wouldn't it?) so don't get embarrassed by amount of the
code in sample of writing, there is much of it because it writes
much.
You can also run the program with one parameter, name of a SINX file
(see other files with .sinx extensions in the package), it will then
dump the file on per-element basis.
That's essentially all. Comprehend and use.
5. User agreement
Our harsh copyracist time requires to drop in a couple of words about
this subject as well. Here is the couple of words:
===
User agreement.
Code included in sinx.zip mentioned in this document can be used and
modified with no fees and for any purpose, with only the following
restriction applied: no copyright or license marks must be added
to it. The code is provided as is with no warranty of anything.
Chapter 2 of this document, with exclusion of subchapter 2.4, is description
of SINX format. The description can be used with no fees and for
any purpose, provided that its text is used with no modifications.
SINX format can be used feely with no fees and for any purpose, provided
that following requirements are met:
1. SINX format is considered public domain and must not be a subject
of any patent under any circumstances.
2. A new technology or data format that involve use of SINX format
in any capacity can only be subject of any patent if the patent only
covers that particular technology and format and does not cover any
objects besides ones explicitly named in the patent. Usage of SINX
format in any patent that covers a generic set of technologies and/or
formats is not permitted.
3. Integration of SINX format into existing and patented technology
or format (for example, as an alternative low level format) is allowed
only under condition that the new element will not be covered by
the original patent. If it is not possible to do so without modifying
original patent, new patent can be issued for modified format or
technology, provided that it satisfies requirements of 2.
By using SINX format and other materials from this web page you thereby
agree to terms of this user agreement.
===
The couple of words is not yet legally rigorous enough, yet it outlines
main ideas. This section may be subject to extension and more detailed
specification in future as needed.
Why is this chapter needed at a homebrew and unknown format home page,
you may ask? Isn't it ridiculous in the light of homebrewness and
unknownity?
As a homebrew and unknown format, SINX has one more unevident advantage,
it is patent free. It can be used for any purposes with no fear of
situations
like this. The need in this chapter is of course questionable,
common sence tells us that SINX will unlikely gain wide fame and
supersede unholy XML in immediate future and that anyone will ever
care of it. But, being a paranoic person, I can't ignore any possible
scenarios. It's better to write some ridiculous letters now than
to find out some day that it's too late.
6. Example of software
As a proof that I indeed used SINX format for my own needs, here is
an example of an application:
the game Operation I.T.C.H. (5 mb,
.zip) (works on M$ Window$, 2000 and above in theory, checked on
XP, 7 and, according to witness, in Wine). Search for places where
it is used is up to your curiosity. (Tip: the application is in Russian
language but includes an English locale, to enable it, change <.lang=ru>
to <.lang=en> in locale.cfg.)
7. Extras
My continued use of SINX format results in advent of various handy
extras. I will put them out here.
sinxxml.zip (187 kb) – SINX<->XML conversion tool, a console utility.
Source code included, run the program for instruction. RapidXML is
used for XML handling, therefore there are a couple of minor defects.
Conversion from XML generally loses indentation and conversion back
to XML yields not a 100% valid XML – you have to add <?xml version="1.0"
encoding="utf-8"?> manually (I told you XML handling libraries
are flawed!), and you may fail if your SINX consists of several entities
at root level as strict XML only allows a single root node (I told
you XML format is flawed!:).
sinx_cs.zip (4 kb) – a minimalistic SINX helper library for C# 3+,
usable on Window$ Phone (7) SDK (which was actually the primary target).
C# documentation comments used, so you will get it quickly. Class
Test in the end of file is a sort of example of use (strip it before
using for your own needs). In "public SinxSpecial this[string pName,int
i]", you can change to "int i=0" in C# 4+ so that you could use less
verbose xxx["special_name"] instead of xxx["special_name",0].
sinx_php.zip (7 kb) – a SINX helper library for PHP 4/5. No examples,
but all functions and classes are heavily documented (even a brief
description of the format itself is included) so you won't get lost.
(C) 2011 Mikle[Subterranean Devil]
If you quote text from this page you must also provide link to it