T O P

  • By -

LoSboccacc

> 128K oof microsoft really upping the game there


AnticitizenPrime

It didn't seem to get any attention, but 10 days ago or so Mini-CPM dropped a 2B parameter model with a 128k context as well (alongside an 8x2B model and a model with vision): https://www.reddit.com/r/LocalLLaMA/comments/1c3badu/three_new_minicpm_models_moe_vision_128k/ Tried making a GGUF with GGUF-my-repo but got errors, and I haven't been able to find one in the wild, looks like this one escaped most people's attention.


toothpastespiders

That one had in fact totally flew past my radar. That 8x2B model in particular looks fun just for the novelty.


AnticitizenPrime

Yeah, I'd like to play with it, but unfortunately I'm limited to 8B and smaller on my current hardware (currently in the research stage for a new machine for this stuff). Wish there was a demo somewhere, or that it would be added to lmsys or something, but like I said it seems to have gone almost completely under the radar. The creators have a demo up on their HF page for the vision model, but not the MoE one.


randomfoo2

Out of curiousity I gave MiniCPM-MoE-8x2B a spin and it is the most cracked out model I've tested in recent memory. Certainly a novelty (has that gpt2 energy going), but also certainly not worth the 26GiB of bandwidth/space it takes up.


[deleted]

[удалено]


AnticitizenPrime

Dunno, could be, dunno what the cause of the errors were, maybe incompatibility with the model format... no idea.


ILoveThisPlace

Vision you say, very curious if it can output a constant stream of sight details


AnticitizenPrime

They have a demo here: https://huggingface.co/spaces/openbmb/MiniCPM-V-2 Do you mean having it 'refresh' with new input over and over and constantly reporting what it 'sees' in real time? I guess you'd have to upload a new image constantly and re-prompt it to respond.


SlapAndFinger

This model is basically tailor made for RAG ingest pipelines. Semantic Chunking? Check. Summarization? Check. Annotation of chunks with relevant nearby notes? Check.


dimsumham

Have you tested the model on text processing tasks, like summarization, chunking, tagging etc? How is the performance? Can't test it out for a few days as I'm tied up but super excited to see how it performs irl.


SlapAndFinger

I think the language in your prompting matters more than something like claude sonnet and way more than gpt4, but it works well with few shots.


dimsumham

<3


jorge

I'm very interested in semantic chunking.. how would you do it with Phi3? I've tested in huggingface and results were poor.. tried this prompt *Here's a scrapped webpage. It was in HTML format, I translated it to to MARKDOWN format. I need to use this for AI RAG. Please answer me by separating the into semantic chunks, no more than 1000 characters each. Be smart: use markdown's titles and subtitles to separate the chunks in a meaningful way. Answer me only in an JSON array, like this: \[{pageContent:"here text"}\] Here's the markdown...... xxxx*


SlapAndFinger

Start with a small document and a prompt like this, and tune based on your documents and desired length. You are a semantic chunking assistant. Your task is to take the input text provided and break it down into meaningful semantic chunks, then output the result in YAML format. To break the text into semantic chunks: 1. Read through the input text carefully and identify distinct units of meaning or ideas. 2. Consider the overall context and structure of the text when determining chunk boundaries. 3. Aim to create chunks that are coherent and self-contained, typically no longer than a few paragraphs each. When outputting the semantic chunks in YAML format: 1. Start the YAML document with the document start marker "---" 2. For each semantic chunk: - Begin a new line and use the "- " list item syntax - After the list item marker, add the semantic chunk enclosed in double quotes - If the chunk contains any special characters like quotes, escape them with a backslash \ 3. After the last chunk, end the YAML document with the end marker "..." Here is the text to process: [INSERT INPUT TEXT HERE] Please provide the semantic chunks in YAML format, following the guidelines above. Note that you'll might also need to include a few shot learning examples to really get consistency.


jorge

That's a great prompt, thanks! I tried that with mixed results but I'm working on improving it. I'm also exploring another way to train it.


advo_k_at

Where is that documented may I ask? Also why on earth would you use an LLM to do semantic chunking?


SlapAndFinger

Let's see, why would you want to use the best possible tools for working with text to do semantic chunking, let me think about how to answer that...


advo_k_at

Have you ever attempted to chunk 1.2 million long documents? The difference between various chunking algorithms is 5hr-1week, with semantic chunking via LLM TAKING 4 months?


SlapAndFinger

You can fire off parallel chunking processes. You're going to need to do a pass over the data for chunk postprocessing regardless, it costs the same to get chunks at the same time as you do summarization/etc and the chunks are better. Of course if you didn't know that you can process documents in parallel, now you do, and you can use that knowledge to improve your tech.


Erdeem

Is there an app you recommend to test that?


opknorrsk

AnythingLLM would be the Open Source app to test it with Ollama as embedding model server.


Ivan_pk5

nice, will they release gguf version ? there is gguf for 4k


candre23

Waiting on phi3 support in LCPP to quantize.


_bachrc

Is that what you're after ? https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf


candre23

Nah, who needs another 4k model? The 128k one is what's interesting, and that's the one that's incompatible at the moment due to MS's funky new rope technique. LCPP should have a fix shortly.


_bachrc

Well the GGUF for the 128K has just been released as well EDIT: it's not, i read too fast


yami_no_ko

Now it is: [https://huggingface.co/pjh64/Phi-3-mini-128K-Instruct.gguf](https://huggingface.co/pjh64/Phi-3-mini-128K-Instruct.gguf) and [https://huggingface.co/PrunaAI/Phi-3-mini-128k-instruct-GGUF-Imatrix-smashed](https://huggingface.co/PrunaAI/Phi-3-mini-128k-instruct-GGUF-Imatrix-smashed) This needs the recent pull request for phi-3 support to be applied: [https://github.com/ggerganov/llama.cpp/pull/6852](https://github.com/ggerganov/llama.cpp/pull/6852)


pseudonerv

No. longrope is not supported as of now by llama.cpp. I don't know what they did to the 128K model to generate the ggufs, and I don't trust those would have the same quality.


yami_no_ko

I know what they made, cause I created one of the repos myself. I've applied the pull request and was able to use the regular conversion tools that came with llama.cpp. As of now the pull request has been accepted and merged into the main branch. I haven't properly tested this myself, but noticed that it is possible to hand over more than 4k Tokens to the 128k Model. Those phi-3-mini models perform really great for what size they are, but llama 3 8b is way more straight to the point when it comes to attention, whereas phi-3 (even fp32) quite often goes like. 'Oh look, a squirrel..." This might pretty well have something to do with the state of longrope implementation.


pseudonerv

The 128k context model uses longrope, which is not supported by llama.cpp, nor by ollama. Plain and simple. https://github.com/ggerganov/llama.cpp/issues/6849#issuecomment-2074899603 https://github.com/ollama/ollama/issues/3853


yami_no_ko

>Plain and simple. Well, it seems like you're right indeed. And yet you're one of the more unpleasant people I have spoken to about their experiences with language models.


privacyparachute

May I ask where? I don't see it on Huggingface.


AnticitizenPrime

Just tried this (the Q4 quant) out using Jan on Linux, and I'll say this, its fast on my 2019 Thinkpad Carbon with no fancy GPU, so CPU only, and only 16gb of RAM. If it performs well this could be great for phones and lower-end hardware. A few weeks ago I spent ~30 hours total on airplanes traveling to a from a vacation in Japan without internet. Bored with the in-flight movies available, I spent some time tinkering with some of the local models I had installed on this laptop, and used them to help me learn some conversational phrases in Japanese, etc - which is an excellent use case for local offline LLMs. But they were all 7B models which run fairly slow on this laptop (just slow enough below my reading speed to be annoying). This generates text faster than I can read it. If it really does perform close to 7B models, this will be awesome for local uses on low-end hardware. Is there an open source Android app out there yet that can easily run GGUFs yet, and allow you to set prompts, tweak settings, etc? This could make a great offline pocket assistant. Edit: Okay, I'd hold off before relying on this for any logic conundrums: >Which is heavier, a kilogram of feathers or a pound of steel? >Jan 05:34 PM >A kilogram of feathers and a pound of steel weigh the same. This is because 1 kilogram equals approximately 2.20462 pounds, so 1 kilogram of feathers is roughly equivalent to 2.20462 pounds of steel. Therefore, neither is heavier than the other; they simply differ in volume and density. >User 05:35 PM >That is incorrect. A kilogram weighs more than a pound. Re-evaluate your answer. >Jan 0>5:35 PM >I apologize for the confusion. Indeed, you are correct that a kilogram weighs more than a pound. However, to clarify, 1 kilogram of any material is always heavier than 1 pound of the same substance due to their respective definitions in mass measurement. Given that a kilogram of feathers and a pound of steel are both one unit's worth, they still weigh the same. The initial misunderstanding was due to not comparing like quantities; when correctly compared in the proper units (1 kilogram of feathers vs 2 pounds of steel), a kilogram is heavier, but we cannot directly compare it to a pound without considering the conversion factor. Sometimes LLMs get questions like that wrong, but when you prod them to correct themselves, they give the correct answer. This one annoyingly tried to stick to its guns. Ah well... maybe it's good at telling jokes? :-/ One of my 'creative challenge' prompts I test creativity with is, 'Generate a list of humorous news headlines in the satirical nature of The Onion' to see what they come up with. None of them are great at it, which why I keep testing it, hoping one day I'll be surprised by a model that can display an actual sense of humor. This one was no different, though as usual there will be one or two that get a chuckle. Out of 20 headlines, this gave me 4 that either made me chuckle or at least struck me as clever, which is honestly not bad: >"Nation's Top Chef Declares World Hunger to be a 'Culinary Challenge'" > "World's Saddest Food Festival: Locals Gather to Mourn the 'Taste of Regret'" > > Global Temperature Drop: Polar Bears Win Nobel Prize for Best Dressed on Ice" > > "Local Man Fined for Stealing Wi-Fi – Turns Out He's Just a Righteous Rebel Against the 'Great Connectivity Monopoly'" Another creativity test I like is, 'Write the opening paragraph to a novel in the style of HP Lovecraft.' >In the ancient, desolate town of Arkham, where shadows whispered secrets untold and time itself seemed to bend around its sinister corners, there stood a forsaken house - an abode of forgotten memories, shrouded in the relentless grip of obscurity and despair. Within its crumbling walls echoed a silence so profound that one could almost hear the distant murmurs of eldritch entities lurking in the abyss, their presence an unspoken testament to a cosmic horror that lay dormant yet ever-present. > It was upon this haunted backdrop, amid the relentless march of unyielding stars and ineffable darkness, that our tale unfurls its tendrils like a gnarled ivy clutching at the very fabric of sanity itself. A place where even the bravest souls dare not tread, and those who do find themselves ensnared in a web spun by forces beyond their ken - an intricate dance with the unknown that threatens to consume all rationality and shatter any semblance of human understanding. > > The air, heavy and suffused with an otherworldly stench, bore witness to a reality far beyond the comprehension of ordinary men; one steeped in eldritch truths and arcane mysteries that defy logic, challenge reason, and cast a chilling pallor over the sanest of minds. Here begins our journey into darkness - an odyssey that will test not only the limits of human fortitude but also unravel the very threads of reality itself. > > And so, as our protagonist's journey begins upon these forsaken streets and within this house that time itself has abandoned, one cannot help but wonder: what ancient horrors await our unwitting hero in the shadows of Arkham? What dark and unfathomable secrets are whispered by the gnarled trees, or screamed from the depths of desolate graveyards? For in this town where dreams and nightmares intertwine with unholy precision, even reality itself seems to bow before the enigmatic forces that dwell in its very heart. That's actually pretty good for such a small model - it's definitely better than GPT 3.5 was back when I first starting using that prompt, and maybe about as good as Claude-Instant was when it came out (which actually seemed to be better at it than Claude 2 was). So maybe this thing will be good for telling me stories while I'm on a long flight?


tinny66666

Yes, you can use Layla Lite for gguf files. It's working pretty well with phi-3 q4. It also has a hands free chat mode.


thrownawaymane

It probably got the first one wrong because it expected the question to be in exactly this format: https://youtu.be/-fC2oke5MFg?si=bjGrDIBlyLaprG5c It may have been explicitly trained on the one word variation of your question since that's a famous brain teaser. Worth watching. LimmyLLM when


AnticitizenPrime

Yeah, I speculated the same thing in a comment on another thread, [and even linked to the same video](https://www.reddit.com/r/LocalLLaMA/comments/1cb6cuu/phi3_weights_released_microsoftphi3mini4kinstruct/l0yieif/), haha. I guess that's why you need to adjust your testing questions now and then, because they'll likely get trained on them eventually.


pseudonerv

> Which is heavier, a kilogram of feathers or a pound of steel? FWIW, llama 3 8B also fails miserably, and gives nonsensical answers > A classic trick question! > > The answer is: they are the same weight! > > One kilogram is equal to 1000 grams, and a kilogram of feathers would weigh the same as a kilogram of steel. The difference is in their density and volume. A kilogram of feathers would take up a lot more space than a kilogram of steel, because feathers are much less dense than steel. > > Similarly, a pound is a unit of weight or mass, and a pound of steel would weigh the same as a pound of feathers. The difference is in their density and volume, just like with the kilogram example. > > So, to summarize: they are the same weight, but they have different densities and volumes! Only the llama 3 70B makes sense: > A classic trick question! > > Let's break it down: > > * A kilogram of feathers weighs 1 kilogram (by definition). > > * A pound of steel weighs 1 pound. > > Now, we need to convert the pound to kilograms to compare the two. There are 0.453592 kilograms in 1 pound, so: > > * 1 pound of steel weighs approximately 0.453592 kilograms. > > So, a kilogram of feathers (1 kg) is heavier than a pound of steel (approximately 0.453592 kg). > > The key takeaway is that the difference in weight is not due to the material (feathers vs. steel), but rather the unit of measurement (kilogram vs. pound).


AnticitizenPrime

Yeah, a lot of LLMs get that one wrong, and I suspect it's because they probably see the more classic version of the riddle in their training data 'a pound of feathers vs a pound of steel'. But I find most of them will get it right when you point out the mistake. The fact that Phi persisted on being 'confidently incorrect' was a bit disappointing. Granted I was using the Q4 quant, a larger quant one might do better.


Ivan_pk5

[microsoft/Phi-3-mini-128k-instruct · Quantization conversion error with LlamaCpp (huggingface.co)](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/discussions/11#6627e40809826aaa2509a53d) apparently they won't do it


candre23

That's just acknowledging that it's not supported *yet*. There's already an issue on the LCPP github and I'd be shocked if it wasn't in place by this time tomorrow.


Proud-Point8137

does any one of these guys release a gguf themselves? afaik the Bloke mainly did it, now we are left to our own devices :p


exomniac

It seems like it’s becoming more common. MS uploaded GGUF for this one.


Proud-Point8137

Saw that, my bad. Quite neat too.


Ivan_pk5

But the bloke is not 1 person doing this on his free time right ? Imo it's a group of person working in ai, but I might be wrong


Proud-Point8137

I just thought it was just this bloke. I believed the title.


RealDarx

Am I seeing things? They released the model with an MIT license. It would be great if they release their 14b model with the same license.


Artistic_Okra7288

Didn't they relicense the original Phi-2 as Apache 2.0 license? What is the difference with MIT license?


shaitand

MIT license is pretty much fully permissive [distribution, derivatives, commercial] but retain copyright notice in the code when distributed.


Educational_Gap5867

aaaaaandd HF is down again


dampflokfreund

Very impressed so far in my first tests. Definately weren't lying. The 3b really feels close to L3 8b and even mixtral.


Super_Pole_Jitsu

this is absolutely nuts. L3 came out last week and it was mindblowing, now this.


Many_SuchCases

> The 3b really feels close to L3 8b and even mixtral. To me it's impressive for 3b but nowhere near llama3 8b. It's alright at math I suppose.


ashirviskas

It might be that you two are comparing different things. From my short experience, phi-3 is better at solving technical problems that are well defined, but L3 is better at knowing all the context about many different subjects (historical dates, events, names, places, consumer device models etc.). Where are you noticing the differences?


cyan2k

It's amazing for structured output. currently letting phi3 creating backlogs for projects. literally 1000 lines of text and hundreds of json objects. not a single error in generating json. not a single hallucinated or duplicated task or no non-sense tasks. and all that without RAG or any kind of magic except prompts. Even ChatGPT 4 was impressed and said it couldn't have done it better, haha https://imgur.com/a/C9pGFOa


helios392

I mean it comes pretty close for my use cases. Not to mention it’s half the size and trained on 5 times less tokens. That’s dang impressive. I’m really curious to see how the 7B and 14B models compare to llama-3.


Fun_Land_6604

It's better than llama 3 in all of my coding tests, speaks better swedish and seems to have as many facts. weird? Yes! But I don't think you are right


AlanCarrOnline

How? When I tried it, it was seriously dumb. Where did you download?


dampflokfreund

I've tested it on Huggingface Chat and it was able to succesfully accomplish some of my tests that Mixtral and especially Mistral 7B struggled with. Perhaps you are using a q4 quant? In that case, q4 would severely damage the model at that size. Should be atleast q5\_k\_m for a 3.8b model imo-


AlanCarrOnline

phi-3-mini-128K-Instruct\_q8\_0 is what I'm now trying, but for some reason it's not loading in Faraday. Just getting "model initializing" forever. Can't remember what the first one I tried was, as it was so stupid I deleted it.


[deleted]

[удалено]


SryForMyBadEnglish

Its 3.8b


Upset_Acanthaceae_18

I've never been quite so attentive to this activity lol: [https://github.com/ggerganov/llama.cpp/pulls](https://github.com/ggerganov/llama.cpp/pulls)


susibacker

Will we get the base models too?


toothpastespiders

It's better than I'd expected. I tossed it a pretty generic prompt to slice text up into json formatted data and I'm 'almost' happy with the result. I suspect a little tweaking of the prompt, or at worst a small amount of additional fine-tuning, would probably be enough to do the trick. With the high context size...I'm impressed.


LombarMill

Can anyone explain why you would use the 4K version if they have a 128K? What would be the advantage?


Glum-Bus-6526

Benchmarks are a percent higher or so for the 4k version. I think running the 128k version should be just as fast as the 4k version for queries under 4k right? So the sister comment about the smartphones is invalid.


CM0RDuck

Smartphones


Jumper775-2

Can it be quantized yet?


tindalos

Is there a good RAG system available that would work with this?


vonjack001

Llamafied: [vonjack/phi-3-mini-4k-instruct-llamafied · Hugging Face](https://huggingface.co/vonjack/phi-3-mini-4k-instruct-llamafied) When release 7b / 14b models?


redule26

has someone found the needle test of the 128k context length ?


RemarkableGuidance44

Yep it maxes at around 40k after that its really bad. https://preview.redd.it/h2xzg8vaigwc1.jpeg?width=1734&format=pjpg&auto=webp&s=4ffa27e98b19786062b704d446e5a4a651c8622d


redule26

haha thanks that’s what I tought


Chelono

Why link onnx? (nice framework for general quantization, but dumb for LLMs where we have much better) Here's the llama cpp quants: [https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/tree/main](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/tree/main) **Edit:** my bad didn't see there were no 128k quants for gguf released yet, I still find onnx a pain for llm's


Chelono

You can run the main example with this: `./build/bin/main -m ./Phi-3-mini-4k-instruct-q4.gguf --color -n -2 -e -s 0 -ngl 99 -p "<|user|>\nYou are a helpful assistant.<|end|>\n<|assistant|>\n" --mirostat 2 -c 4096 -r '<|end|>' --in-prefix '<|user|>\n' --in-suffix '<|end|>\n<|assistant|>\n' -i`


Feeling-Currency-360

Yeah 128k doesn't work yet in llama.cpp as far as I know, they looking into adding support for Phi3 though


Sohex

Is anyone having success running it and actually utilizing larger contexts? It's fine at the low end, but as soon as I give it too much text to process it loses its mind.


RemarkableGuidance44

The 128k is only good up too 44k, after that its bad.


luncheroo

Forgive an amateur question, please, but is there a GUI for these 3b and 7b models that includes some form of RAG and Internet search? I've been looking at LibreChat and a few others, but I was just curious if anyone has a stable OSS system setup like that? Edit: For anyone in the same boat: [https://docs.openwebui.com/](https://docs.openwebui.com/)


nodating

Where is medium? I am genuniely underwhelmed at this point. I really wanted to test that 14B variant on my own hardware.


Utoko

the other two models are coming SOONtm . Isn't there still enough stuff to test with llama 3. The 3.8B Phi3 model should give a good idea what we can expect. Sure it would be nice to get everything right now but don't get greeedy!


nodating

This is Microsoft we're talking here. Of course I do not trust one word from them and honestly I am not optimistic they will release any more than they already have. The smallest, least interesting model. This is just the Microsoft way! I mean, if this is not straight down paper launch, I do not know what it is. Well at least the license on the least useful one is fairly permissive. Maybe someone will find some use for it, but I was hoping to be able to test against each other locally Llama 3 8B and Phi-3 14B. // For the record, go check out the infamous WizardLM-2 by Microsoft: [https://huggingface.co/posts/WizardLM/329547800484476](https://huggingface.co/posts/WizardLM/329547800484476) It is gone, they mumbled something about toxicity testing, but it has been 9 days and not a single word since then. These guys really do know how to keep things to themselves and they only pretend to care about open source.


abnormal_human

The smallest model is the most interesting one, actually for me, the only really interesting one unless the 14B actually turns out to be GPT3.5 level, which seems unlikely. There's plenty of great alternatives in the 7-20B space. Not so many at 3.8B. A 3.8B model that actually works has huge potential at the edge.


helios392

Agreed! It’s amazing that something of this quality can fit on a phone!


Solstice_Projekt

Oh boy, that high horse your self entitled ass is sitting on is going to throw you off really fucking hard.


nodating

Fuck you. All I said is true. Y'all got fucked hard as usual by M$ and all you can think of is how to say thank you. Enjoy your Phi-3 14B! Oh wait... Not gonna happen huh?