Github repository: https://github.com/Humanity-Unleashed/pretraining/tree/main/benchmarking
Benchmarking working on several models. Runs with and without context.
Model inference results are stored at /workspace/pretraining/benchmarks/ and can be read into a dictionary using humun_benchmark.metrics.read_results , calculating metrics. For an example see;
HuggingFace chat template - investigate if this improves performance for instruct LLMs, as lots have been fine-tuned using this template.
More models - Standard models like arima, dhr-arima, etc. Multi-modal models like UniTime etc.
Uncertainty bounds in outputs - configure probabilistic prompt.