Jump to content

User:Marius851000

From Wikifunctions

Creating a Rust interpreter, take 2

I’ve been trying to create a Rust interpreter for Wikifunctions. I already have the beginning of a first one, but I decided to start again anew to better implement handling of types.

I’ve decided to use a lazy evaluation strategy, which means type checking only need to happen when constructing new values (such as when running a function call). Even thought I eventioned the idea of using strict compile-time type checking, this appear to be impossible in the strict sense of the term, as there are function that return objects, which require a certain amount of run-time type checking. I’ll so use run-time type checking everywhere.

It works by having an enum that can contain ZObjects. They might low level representation of native type (such as a String or a Boolean), other kind of special value (such as Reference or Function Call), a non-native type or just arbitrary unparsed data (still expected to hold a Z1K1 for when it’ll be parsed).

Those different kind of ZObject are their own struct. I’ve opted for using an enum that list all possible possibility instead of dynamic reference for a few reason, including reducing the amount of reference (each requiring an allocation) and also because the dynamic reference feature of Rust is quite restrictive (All those entry still implement a trait, but its role is much less important).

That enum expose a few function. Among them a bunch of "parse_xxx(self) -> Result<xxx, EvalError)>", such as parse_boolean(self) -> Result<WfBoolean, EvalError>. It then first check if the result is already a boolean, and if not, convert it into a BTreeMap<Zid, WfData> that will then be provided to the WfBoolean to parse. That function that convert this WfData into its inner BTreeMap<Zid, WfData>, called into_map, will take ownership of the data, and will then call the into_map function of the inner data. For untyped data, this is simply returning the inner storage, but data like WfBoolean would have to construct a new map containing what is expected from a WfBoolean.

It’ll also run function calls. follow references and parse function parameter (hence why those function take an ExecutionContext as parameter, that contain both the function parameters, eventual function override to force a specific implementation for a function, and a reference to the struct that allow to access persistant object).

This result in into_map never returning anything else than other WfData. So it’ll never return references or function call. I might eventually add an exception in the future, but for now, I have a .get_reference() structure, that allow my WfDataType (via WfData) to return the Zid this data reference. It might lead to some oddities, such as all booleans having a reference even if it is not initially a reference, acting as an involuntary form of canonization (hopefully not being source of problem, but it might potentially lead to a few different behavior.

This get_reference solve the problem for boolean, but strings still cause a problem. String object contain a String, which makes it a recursive definition. I’ll have to be careful around that.

Speaking of different behavior, I also intend for reify to be fully (and infinitely) recursive, unlike the current implementation on Wikifunctions. Might cause some trouble with Strings. I’ll have to think about it. And also if I want to return a reified data as json or something (I suspect I will, in that case, return unevaluated reified data as a function call to reify).


It might be noted I am still in the initial phase. I have the WfData and a few WfDataType object (most notably, the boolean), but not much. Among other, I haven’t implemented parsing of the dump yet (unlike my previous attempt when this was the first thing I did).

I’ll probably convert them into a bunch of WfUntypedData, WfTypedList (yet to create), and, the most complicated, strings, which might either be reference or String depending on context. I’m still a bit uncertain of how I’ll do that, but I’ll likely put it into a WfStringOrReference that will return a string if a String is expected, or a Reference if it want anything else. Which might cause problem for reference to string. Luckily, that appears pretty uncommon. Oh! Another thing I noticed, but it appears that persistent object reference can’t be create, so that’ll reduce the case of infinite recursion (thought my parser will still evaluate those).


Mhhh... Error handling... It would be nice to trace errors... I have had this idea:

function that consumer self (or parse owned data) have such signature as -> Err(_, (EvalError, Self)) (self or WfData depending on the context). functions inside then map the received error with .map_err(|(e, object)| (e.inside(zid!(something)), (... reconstruct the data ...)) or something like that. (hmm... Do I put the object into EvalError? I don’t know yet. I’ll do some test).

That way, once the error has fully unwrapped, the full path to the error will be known, and all the function call on the way.

After some implementation, that appears to be (as expected) super verbose. New idea: just store the execution path that lead to there. If needed, we might just replay up to this point... That would require support for partial evaluation.

The execution context store, among a few runtime informtion, function override, which allow to force the use of a specific implementation for a function. This is meant to be used for test cases, thought it could override multiple function at once.

It might make sense to disable the ability to make use of this feature. In such case, constant propagation might be possible. A possible type would be a state known to always be erroneous. It would store the error until some point of divergence is reached (such as try/catch).

Owned data or Rc?

I wonder if putting (almost) everything behind an Rc would be appropriate.

For now, most of my function take ownership of the WfData, and return another one (whether there is an error or not, it still return one). This works fine, until you consider that this means every data passed as input much be owned, which should be quite performance hit for Reference, as it need to clone the whole referenced data. Here is a table measuring some benefit and disadvantage of each:

Owned Rc (and Arc)
performance hit inside function that consume it Presumably none? Maybe none if done well enought?

Example idea of api:

a. parse(WfData, context) -> Result<Self, _> // hmm... That seems to share some similarity with point 3 of the map reflection.

b. parse(&WfData, context) -> Result<Self, _> // that might avoid a Rc.clone(), but might create an extra dereference if WfData is already an Rc or a Copy type. Which it always is.

c. parse(Rc<BtreeMap<_>>, context) -> Result<Self, _> // no. Bad original idea. I’ll go with WfData having the .get and .keys function

struct will store raw data as owned, like

SomeStruct {

inner_data: WfData,

inner_data_2: WfData

}.


Not using reference cause lifetime.

SomeStruct will itself be behind a Rc, meaning it isn’t mutable, and so can’t take the inner data out of it (thought it can be cloned).

meaning that passing a reference will effectivelly avoid a clone.

Clone performance when accessed via Reference from a cached value Need to clone everything, including BTreeMaps and String Very cheap, more than 4 times cheaper than cloning a String (8 for simple Rc). Probably negligeable compared to looking up the map.
performance of analysing something as something else (such as interpreting a WfUntyped as a WfBoolean) Extract the BTreeMap from WfUntyped and parse it with WfBoolean. Naively, would need to clone the map inside the WfUntyped, or create one like for the others.

Altenative:

  1. into_map return a reference to the map
  2. no map are passed. WfDataType generate them on the fly with a ".get(Zid) -> Option<WfData>" (or Option<&WfData>) and a ".keys(Zid) -> Iter<Zid>" (or just a Vec)

I decided to go with the Rc approach.

Argument reference

Previous notes about how argument reference couldn’t be easily implement were incorrect. They are indeed not easy, but not as much as originally thought due to misunderstanding.

Reflexion:

(FC if for function call)

We have a function F(x): if x == 0 then 0 else x^2 + FC(F, x-1)

We want to call it with a FC(F, 2)?

We know that x, an argument of F, should be replaced with 2

if 2 == 0 then 0 else 2^2 + FC(F, 2-1) -> 4 + FC(F, 1)

notice how I didn’t had to evalute the FC for now. But that is due to its nice intermediate representation, I think.


Note that Function Call always stay function call. They aren’t substituted with its implementation. It’s only substituted with its result once fully computed, which is only feasable during evaluation.

Solution: Add a function such as:

<WfDataType>.resolve_arguments_reference(args: BTreeMap<KeyIndex, WfData>) -> WfData

It replace the arguments reference recursively, but does not resolve function call (instead running on its argument. And the function to call. Hmm... I think I made the error of assuming the function is always resolvable during parsing. Is that wrong? Maybe. I don’t know. I’ll need to check). Care should be taken to not recurse substitution into Quote (Z99) and Implementation (Z14).

Should I follow references? No. We will assume all the function reference is inside the same object, for performance reason, and cause I saw no case that is like that.

Should I evaluate function call? neither should I. If I evaluate before the substitution, it will mix the different function arguments reference. If I do it afterward, I may evaluate data that might not be relevant to the final result (such as the else branch if an If that has true)

This will still perform substitution for unused data.

Is there another approach? Such as doing is as part of the evaluation? TODO: study that

Older notes


I also have some focus on speed. Its not my main priority, but I still expect my implementation to be decently fast. Thought it still won’t prevent stack overflow, given that running a function propably will add two or three stack. Maybe I’ll rework that later, but it should proably still be fine for things that run less than 1000 recursive call deep. Could probably climb to 10k or maybe even 100k if I adjust the stack size manually. A cache might also be usefull. Oh! And the heap memory use shouldn’t have much an impact, given how everything is consumed as soon as needed, which is probably half the requirement for tail call optimisation. But again, that’s not a priority.

Some interesting observation: String and Boolean are quite unique type as they are self-referencing. That’s not much of a challenge for boolean, I’ve opted to recurse into the identity until I get a Z41/Z42 (I don’t know if a boolean with anything else is valid. So I presumed it is and follow the reference).

It might be usefull to comment that the function that convert data to ZObject only transform the first layer, so I can transform the identity type into a copy of itself, and it will dereference infinitly just fine.

For String... I’ll probably need some special handling outside of just the string type. It’s more complicated...