-
Scout Monitoring
Free Django app performance insights with Scout Monitoring. Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.
The speedup would not be that high in practice for folks already using speculative sampling[1]. ANPD appears to be similar but uses a simpler, faster, and less accurate drafting approach. These two enhancements can't be meaningfully stacked.
[1] https://github.com/ggerganov/llama.cpp/pull/2926
The HuggingFace transformers library already has support for a similar method called prompt lookup decoding that uses the existing context to generate an ngram model: https://github.com/huggingface/transformers/issues/27722
I don't think it would be that hard to switch it out for a pretrained ngram model.