Mark Cave-Ayland wrote:
Thanks for the patch! With the patch applied, then "see" no longer crashes on those particular routines. Interestingly enough, digging further into the BIOS I still see some discrepancies between the source code and the detokenized version:
0 > see find-device : find-device 2dup " .." strcmp 0= if 2drop active-package dup if >dn.parent @ dup 0= if (lit) throw active-package! exit 0 -rot path-resolution 0= if false exit active-package swap true path-res-cleanup active-package!
In this case, (lit) should be "..". And also:
-22 even.
And all the "then"s are missing. That's a bit more complicated to implement though.
0 > see (find-dev) : (find-dev) active-package -rot (lit) catch if 3drop false exit active-package swap active-package! true ; ok
And here (lit) should be "[']".
Hm..
Am I right in thinking that it should be possible to reconstruct any source exactly (minus formatting) from a tokenized input?
Not exactly.
Forgive me for some nit-picking, the forth dictionary is not "tokenized" forth code, like the stuff toke produces. A tokenizer just produces a binary representation of the source code ("FCode"), similar to what some BASIC dialects did in ancient times to reduce file size. It's still "source code" and in order to execute it, it still needs to be "compiled", just like forth source code.
Now, if source code (or FCode for that matter) is compiled, the forth engine can keep things simple.
Example:
: some-new-word ( -- xt-of-find-device ) ['] find-device ;
['] and ' will put the execution token (xt) of a word on the stack. That execution token could then be executed with "execute". It's like a function pointer in C.
['] is executed as an immediate word, which means it will not start looking for "find-device" when some-new-word is executed, but rather when it is "compiled into the dictionary" (aka when it is defined).
So at the time some-new-word is executed, all it really does is put a cell sized integer on the stack. Just as if you had typed -22.
The primitive word to achieve this is (lit). When a (lit) is executed, it will read the cell after the execution token of (lit) and put it to the stack. It has no knowledge anymore about what number that would be.
So when a number is compiled into the dictionary, it looks like this:
| xt-of-(lit) | number | next-word's-xt | ...
Formerly, when a string was put on the stack with " it looked like this:
| xt-of-(lit) | pointer-to-string | xt-of-(lit) | length-of-string | xt-of-dobranch | offset-behind-string | cell-aligned-string | ...
So it's not easy to recognize from two (lit) and a dobranch that the above is a string. Which is why at some point we started hiding that magic behind another word called (") which basically puts a two numbers on the stack, but is only used for string handling. So we can recognize strings in see. For some odd reason s" was using (") but " was not.
We can do this kind of thing for other words, too, in order to improve the reversability of forth words. Suggestions, and patches are most welcome!
Stefan