LLM Benchmarking

Github repository: https://github.com/Humanity-Unleashed/pretraining/tree/main/benchmarking

Updates:

Benchmarking working on several models. Runs with and without context.

Model inference results are stored at /workspace/pretraining/benchmarks/ and can be read into a dictionary using humun_benchmark.metrics.read_results , calculating metrics. For an example see;

https://github.com/Humanity-Unleashed/benchmarking/blob/main/notebooks/metrics.ipynb

HuggingFace chat template - investigate if this improves performance for instruct LLMs, as lots have been fine-tuned using this template.

More models - Standard models like arima, dhr-arima, etc. Multi-modal models like UniTime etc.

Uncertainty bounds in outputs - configure probabilistic prompt.