With the rise of open-source large language models (LLMs) such as Alpaca, Vicuna, and Falcon, we are witnessing boundary-pushing possibilities in this realm. Certain models demonstrate superior overall performances on leaderboards like AlpacaEval and Chatbot Arena. However, is it reasonable to stick with one top-performing LLM for all user inputs? The answer may not be as straightforward as one might think.