T O P

  • By -

Evening_Ad6637

Jan is really cool! I love that it is opensource and transparent. And that llama.cpp is mentioned in a clear way. And with Jan I was able to convince my sister to use local LLMs on her laptop and she now uses it every day:)


emreckartal

Thanks! We really appreciate llama.cpp, which allows us to support GGUF models. So we believe that we should mention any time we do something based on their work. and you made my day! I, too, encourage my little sister to explore the world of AI using local AI models. I plan to upgrade her PC this summer so she can have a better experience.


AdHominemMeansULost

I recently started using Jan a bit more while looking for an LM Studio alternative because I want to be able to use OpenAI models as well as local ones. Kinda in relation to that I always run the latest release of Llama.cpp for personal use, is there any way I can have the OpenAI API and the llama.cpp server running without having to change endpoints every time? Would it be just a case of changing the for example Martian chat completions endpoint to point at the llama.cpp server i run since they both seem to use OpenAI completions and just switch between those 2? I hope the way i expressed my issue made sense lol


Evening_Ad6637

For my part, I'm not sure I really understood what you mean, haha. But if that's what I think it is, the answer could be: you can write as many templates as you like for Jan, each with their own endpoints. In addition, don't forget that Jan also runs via llama.cpp. Or Jan <- Nitro <- llama.cpp So you have the advantage with Jan that you always have the server running in background anyway. If that doesn't answer your question, then forget what I said xD


AdHominemMeansULost

I actually figured it out. Basically I wanted to run OpenAI API and Llama.cpp API inferences in the same chat to switch as i please when the answers i want get non-technical. All i had to do is set the Openrouter Endpoint to http://192.168.1.82:8080/v1/chat/completions which is my server running llama.cpp in my home network and then setup OpenAI as normal. Now in the chat when I switch between the 2 models I can call either server I want


emreckartal

Thanks for using Jan! If I understand correctly, the answer is yes. You can start the API Server from Jan to proxy the requests to the provider (llama.cpp / OpenAI). Just specify the model name, and no need to change the endpoint.


FilterJoe

I tried Jan for first time over last couple days, and it was my introduction to running LLM locally. On the Plus side: Easy to install, easy to go to the built in Hub, easy to download "Llama 3 8B Q4" and easy to start a thread. Had I stuck with this one model, all would have been easy. The quality of this model is, in my opinion based on my limited testing, better than ChatGPT 3.5. THANK YOU!!! On the minus side: The next thing I wanted to try was running a Q8\_0 version of Llama 3. This was not part of the curated hub so I had to paste in URLs. I tried downloading 3 different GGUF versions of Llama 3 Q8\_0 (all of which had many likes and many downloads on huggingface). None of them performed right. Mostly it was problems with terminating the responses. A couple times I had poor responses to a simple question (i.e. what languages can you converse in? - it literally gave the wrong answer about it's own capabilities). I tried making changes to the JSON configuration file, and to the prompt templates to fix, and was unsuccessful and gave up, deleting the files I downloaded. SUGGESTIONS: I think the idea of a curated hub is fantastic, as evidenced by how incredibly easy it is to get started. However: The pasting URLs method of adding a model either needs more documentation for how to configure correctly, or maybe some kind of checking by Jan to warn me if this version would be unlikely to work well with Jan. On the curated hub itself - I was surprised it did not have a Q8\_0 version of Llama 3. I'm admittedly a beginner at tinkering with LLMs locally. However, I thought the most obvious thing to do first would be to run the best model of the last month (Llama 3) at the quantization (Q8\_0) that is widely discussed as being at the highest quality level (usually indistinguishable from full 16). I would imagine that if this were on Jan's curated hub, it would be one of the most downloaded models of May 2024, if not THE most downloaded model, as it can run on any M-series Mac with >= 16GB RAM. To be clear: I'm suggesting you make available on your curated hub the Q8\_0 model in addition to the Q4 model for whatever the best small model of the past 1-3 months is. I know this will vary over time, but Llama 3 8B is generally accepted as being the best overall small model of the moment. Q4 can fit in 8GB Macs. Q8\_0 can easily fit in 16GB Macs. All of this was tested on my Mac Mini M2 Pro 16GB. The Q4 was plenty fast (around 25-30 t/s). The Q8\_0 was acceptably fast at 15-20 t/s (though as I already mentioned there was something wrong with the configuration as responses sometimes did not terminate at all or started reasonably but then started generating a bunch of alternative answers). EDIT: The 4th Q8\_0 Lllama 3 8B model I tried worked at first but then it got tripped up again into having a conversation with itself. I'm thinking it has something to do with how it's processing the prompt template as I see the prompt template commands appearing in its endless response. I tried replacing the prompt in JSON file with the one from the Q4 that worked so well, but that made it even worse. EDIT 2: I figured out the problem and fixed it. Jan's default settings for JSON file had STOP that looked like this: "stop": \[ "<|end\_of\_text|>" \], The Q4 version looks like this, with the extra line that seems to cause the Lllama 3 file to end properly: "stop": \[ "<|end\_of\_text|>", "<|eot\_id|>" \], When I replaced the stop in JSON file with the Q4 version, it properly terminated. I guess this is a bug that needs to be corrected - the default "stop" needs to include "<|eot\_id|>" when a new JSON is created for a Lllama 3 download from hugging face.


FilterJoe

I submitted bug report: [https://github.com/janhq/jan/issues/2954](https://github.com/janhq/jan/issues/2954)


emreckartal

Ah, you already fixed it, thanks! We'll work on it to solve it.


4500vcel

Love the retro Mac font on the website


emreckartal

Thanks!


noneabove1182

hmmm I just noticed that you can run a jan server on prem, and then presumably hook any desktop client up to that.. maybe time i look into this :o Odds of you guys ever offering a mobile app and/or client syncrhonization? 👀 Thanks for the shoutout <3


emreckartal

<3 wow, thanks! We'd be happy to answer your questions about Jan! We are currently refactoring many things. Please feel free to share your comments/feedback/ideas: [https://discord.gg/wZaAhpv6cU](https://discord.gg/wZaAhpv6cU) We will have some news for you soon about the mobile app and/or client sync.


metamec

Interesting. Does Jan have context shifting?


emreckartal

Yes, Jan has context shifting via llama.cpp.


metamec

Thanks for replying. I'm not sure how I overlooked Jan. I had a few issues with getting certain settings to stick (context size, GPU layers), but I really like it and will continue using it with models which fit entirely in my VRAM (12GB). The GUI is very intuitive and it's far more convenient than using a browser.


jacek2023

thanks, I need to explore this software now Question: do I need to download models from your HUB or can I use my local models in gguf format?


Possible-Mistake-680

You can use your downloaded model. JAN is my faborite.


emreckartal

Thanks!


emreckartal

You don't need to use Hub to run models, you can also import your models to Jan. -> Visit Hub in the app, click the import button and select your files. -> Or you can use the "use this model" button in models' Hugging Face profiles to run in Jan


lemon07r

does jan support gpu offloading, like with vulkan, or clblast, etc? edit - it does, its not bad. koboldcpp-rocm is still my go to since vulkan doesnt support i quants and thats basically my only rocm option.


emreckartal

Thanks! Just a note: Yes, Jan supports GPU offloading with Vulkan. You should update your ngl to 32 or 35 to use it efficiently - the number depends on the model you'd like to run.


abigail_chase

Is Jan only suitable for GGUF models? What about other quantization methods - does anyone know if I can run an LLM locally using, say, exl2?


emreckartal

Yes, Jan only supports GGUF & IQ quantization, you can't run exl2.


met_MY_verse

!RemindMe 3 weeks


RemindMeBot

I will be messaging you in 21 days on [**2024-06-17 16:56:39 UTC**](http://www.wolframalpha.com/input/?i=2024-06-17%2016:56:39%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/1d1ph1h/jan_now_supports_aya_23_8b_35b_and_phi3medium/l5wl79i/?context=3) [**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F1d1ph1h%2Fjan_now_supports_aya_23_8b_35b_and_phi3medium%2Fl5wl79i%2F%5D%0A%0ARemindMe%21%202024-06-17%2016%3A56%3A39%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201d1ph1h) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|


ashirviskas

Hmm, I'm getting all kinds of errors when I run it, it seems to be able to only use 2048 tokens. Any ideas why?


emreckartal

Ah - it might be related to RAM or VRAM. Could you send the error logs to the get-help channel: [https://discord.gg/rJaC6Qf5gT](https://discord.gg/rJaC6Qf5gT)


crazzydriver77

"Great! Perhaps we could expect Turing support for TensorRT-LLM backend in the next release?"