Better search for fmax goals #218

JulianKemmerer · 2021-11-20T19:45:33Z

JulianKemmerer
Nov 20, 2021
Maintainer

Relating to, but not the same as Faster timing estimates #46

After having some mechanism to get timing estimates^ - aka able to do circuit input -> timing info out - how does PipelineC then use that info to search for a particular fmax goal in the fastest way possible?

JulianKemmerer · 2021-11-20T19:48:04Z

JulianKemmerer
Nov 20, 2021
Maintainer Author

Both @bartokon and @suarezvictor have said things along the lines of do a binary search.

bartokon has some experimental plots of the #stages vs fmax curves we would be "searching-along" and its not clear a binary search is going to work?

0 replies

JulianKemmerer · 2021-11-20T19:51:16Z

JulianKemmerer
Nov 20, 2021
Maintainer Author

It is important to note the search is not always get to the absolute fmax possible. We should have a mode for that. But also more typical get to this specific fmax goal searching towards a specific fmax (at lowest latency possible implied) is also needed. You can even just set the specific fmax goal to your device spec'd fmax by the manufacturer to approximate 'as fast as possible'. Ill comment on what the basic search tool currently does soon.

0 replies

bartokon · 2021-11-20T20:11:31Z

bartokon
Nov 20, 2021

Decision trees can get stuck in local minimum. We could make some border cases for search and try to offset the problem.

0 replies

bartokon · 2021-11-20T20:18:09Z

bartokon
Nov 20, 2021

For now, I think, we should analyze pyrtl feedback with critical times and group wires to distinct speed categories (time delay categories) and put registers there for example: https://machinelearningmastery.com/clustering-algorithms-with-python/
K-Meas? Two categories, fast and slow? Some other methods to deduct what wires need to be sliced to achieve better fmax?

0 replies

JulianKemmerer · 2021-11-20T20:21:43Z

JulianKemmerer
Nov 20, 2021
Maintainer Author

I dont think I understand analyze pyrtl feedback with critical times and group wires to distinct speed categories and put registers there

I think that assumes way too much of how well names through GHDL->Yosys->BLIF-pyrtl are preserved at the moment. But maybe not / could imagine that not being an issue...

Ill explain how the tool currently places registers...

0 replies

JulianKemmerer · 2021-11-20T20:29:18Z

JulianKemmerer
Nov 20, 2021
Maintainer Author

Say you have a my_pipeline that is reported to have a total delay of 100ns, thats comb. logic without pipeline registers, etc

Say your device fmax is 500MHz(2ns period). And you want to reach that absolute fmax.
Great says the tool. To meet an fmax of 2ns period, we need to divide our 100ns of delay 100ns/2ns = 50 equal stages.

Then it goes at the task of divide my_pipeline into 50 stages. How does it do that? It looks at the delay of my_pipeline submodules/function calls. Using those delays, and the dataflow of the C code, it makes a directed acyclic graph (DAG). Given the 'layout' of the DAG it becomes apparent that something like 'cut my_pipeline here' equates to 'cut specifically that submodule there'. That process repeats for all submodules, all the way down the call stack until 'cut some_pipeline here' is just rendering a chunk of VHDL for a small some_pipeline operation (think add/sub etc)

0 replies

JulianKemmerer · 2021-11-20T20:34:18Z

JulianKemmerer
Nov 20, 2021
Maintainer Author

Then it gets feedback from the syn+pnr tool saying. Your slicing of 50 stages actually got to ~ 350MHz. What do we do next? It roughly (some hand crafted stuff here) says: If 50 stages actually works out to 350MHz(2.85 ns period) that would seem the actual total comb. delay in this circuit is ~50 stages * 2.85ns = 142ns. Repeat the above steps as if the circuit had 142ns of total delay from the start as better estimate.

0 replies

JulianKemmerer · 2021-11-20T20:37:04Z

JulianKemmerer
Nov 20, 2021
Maintainer Author

And its that kind of iterative looping - finding out you were off, and trying to re-adjust that exists.

Important: this method, and anything to immediately replace it works just off the single number what is the fmax of the circuit (i.e. max path delay in ns) from the syn+pnr tool. That makes it super easy to add support for other tools / timing models.

More detailed feedback of 'the critical path was in this specific submodule, stage 3' is possible but not for all tools/flows easily.

0 replies

bartokon · 2021-11-20T21:08:10Z

bartokon
Nov 20, 2021

Aaaahhh, okay now I know how it works. I thought that you analyze on netlist level, and if you are pipelineing a netlist it could be directly inserted into fpga and resynth (pipeline not C model but netlist). So if some net connection in other words path is loooong. You could slice it in half, put register in the middle, that way we could aim at critical paths + if netlist is supported all other tools and even encrypted ip are supported IF you have a netlist.

So my proposal is too low level for now, I think. So the best we can do is detect the worst module and slice it N times. We could slice each module separately with multiprocessing and detect optimal slice level with for example binary search (or as I would like to have some fun with some ML models and predict as much slicing parameters as I posibly can :D).

0 replies

JulianKemmerer · 2021-11-21T03:37:16Z

JulianKemmerer
Nov 21, 2021
Maintainer Author

Yeah thats good thinking - perhaps what actually occurs for larger non---coarse designs. In that mode, ex. for our 2ns period target the tool does a slighty different sweep: it asks what are the smallest submodules are larger than 2ns in delay? It might say "your 32b multiplier is smallest submodule larger than 2ns in comb delay". Then instead of saying 'slice my_pipeline' it says, slice all 32b multipliers like so to meet fmax. And then given that slicing of smaller submodules rebuild the entire top level my_pipeline out of those pieces that supposedly meet timing.

Regardless the 'slice a module N times' is roughly the same process whether its to the top level or not (its that 'down the call stack' slicing until its raw VHDL you are manipulating)

0 replies

JulianKemmerer · 2021-11-21T03:39:18Z

JulianKemmerer
Nov 21, 2021
Maintainer Author

Definitely down to have some fun ML experiments to see what better predictions can be done :D

0 replies

suarezvictor · 2021-11-21T23:40:43Z

suarezvictor
Nov 21, 2021

Better not to reinvent the wheel https://github.com/MattRighetti/leiserson-retiming/wiki

0 replies

bartokon · 2021-11-23T22:27:45Z

bartokon
Nov 23, 2021

Better not to reinvent the wheel https://github.com/MattRighetti/leiserson-retiming/wiki

Haha can't wait for new pipelinec v2 :D

0 replies

JulianKemmerer · 2021-11-24T00:15:43Z

JulianKemmerer
Nov 24, 2021
Maintainer Author

Oh gosh v2 ⏳

We model a circuit as a graph in which the vertex set V is a collection of combinational logic elements and the edge set E is the set of interconnections

this connects to #45 as well.

If I were modeling at the LUT level - then each LUT becomes a comb. element in the network that the re-timing tool is analyzing. Doesnt need to be LUT level - but just some fixed combinatorial primitives seem to be needed.

I dont have those really? My comb. primitives can be split / sliced. So I dont know what the smallest comb. chunks are for the tool to move around?

Maybe I need to ask - does that retiming tool have a way to say 'split this comb. element delay in half'?

At the moment sounds like version 2 for sure - I need to think about how the current way things works would map to some fixed comb. prim network you could throw into some other tool.

0 replies

bartokon · 2021-11-27T22:37:44Z

bartokon
Nov 27, 2021

artix_times.txt

nextpnr_times.txt

0 replies

JulianKemmerer · 2022-11-02T23:28:01Z

JulianKemmerer
Nov 2, 2022
Maintainer Author

Can try to follow up on this work here
https://github.com/JulianKemmerer/PipelineC/blob/master/src/DEVICE_MODELS.py

0 replies

suarezvictor · 2022-11-03T12:08:27Z

suarezvictor
Nov 3, 2022

Let me know if you agree with what I musing, in regards to estimate the larger delay of a path in netlist:
We'll have an estimation, then when you send the netlist to synthesis, you'll have the "real" value. Let's assume that both our model and synth produce a strictly monotonic (that mean "always increasing") delay result in function of the number of path length and stages. So our model may be different than reality, but since there would be a search of the optimal point (using strictly monotonic functions in both cases), we can stick to the "real results" for the search, and correct the model with the difference from reality. In that way, we'll using a differential function, that is, an estimation of "how much more or less delay we get if we increase or decrease, respectively, the netlist". It seems that would work, with the errors in the model just affecting how fast we can find the sweet point. A good property of this, if it holds true, is that the naive "sum all individual delays" on the path, will keep the function monotonic, as required.
UPDATE: note that to get the "differential" version of the estimator (i.e how a change produces a change), the absolute version can be called with the "old" and "new" values and them simply substracted

0 replies

JulianKemmerer · 2022-11-10T04:02:16Z

JulianKemmerer
Nov 10, 2022
Maintainer Author

Please see #46 for recent new caching stuff of period values for pipelined built in functions
ex.
pipeline_min_period_cache/vivado/xc7a100tcsg324-1/syn/BIN_OP_PLUS_float_8_14_t_float_8_14_t_0.25.delay

0 replies

suarezvictor · 2023-01-10T17:27:05Z

suarezvictor
Jan 10, 2023

the RWRoute paper has timing model equations used in RapidWright (see Page 13) http://www.rapidwright.io/docs/Papers.html

0 replies

JulianKemmerer · 2023-01-10T22:10:39Z

JulianKemmerer
Jan 10, 2023
Maintainer Author

Ooo nice - I think trying to use RapidWright ~instead of Vivado is a option that could be ~easily explored for getting faster timing modeling with Xilinx FPGAs

Using their equations or something like it to derive our own timing models for all/any fpga architecture also sounds possible - but is indeed a more difficult/experimental long term goal

0 replies

Better search for fmax goals #218

JulianKemmerer Nov 20, 2021 Maintainer

Replies: 20 comments

JulianKemmerer Nov 20, 2021 Maintainer Author

JulianKemmerer Nov 20, 2021 Maintainer Author

JulianKemmerer Nov 20, 2021 Maintainer Author

JulianKemmerer Nov 20, 2021 Maintainer Author

JulianKemmerer Nov 20, 2021 Maintainer Author

JulianKemmerer Nov 20, 2021 Maintainer Author

JulianKemmerer Nov 21, 2021 Maintainer Author

JulianKemmerer Nov 21, 2021 Maintainer Author

JulianKemmerer Nov 24, 2021 Maintainer Author

JulianKemmerer Nov 2, 2022 Maintainer Author

JulianKemmerer Nov 10, 2022 Maintainer Author

JulianKemmerer Jan 10, 2023 Maintainer Author

JulianKemmerer
Nov 20, 2021
Maintainer

JulianKemmerer
Nov 20, 2021
Maintainer Author

JulianKemmerer
Nov 20, 2021
Maintainer Author

JulianKemmerer
Nov 20, 2021
Maintainer Author

JulianKemmerer
Nov 20, 2021
Maintainer Author

JulianKemmerer
Nov 20, 2021
Maintainer Author

JulianKemmerer
Nov 20, 2021
Maintainer Author

JulianKemmerer
Nov 21, 2021
Maintainer Author

JulianKemmerer
Nov 21, 2021
Maintainer Author

JulianKemmerer
Nov 24, 2021
Maintainer Author

JulianKemmerer
Nov 2, 2022
Maintainer Author

JulianKemmerer
Nov 10, 2022
Maintainer Author

JulianKemmerer
Jan 10, 2023
Maintainer Author