Urban75 Home About Offline BrixtonBuzz Contact

No "The AI bubble has finally burst" thread?

“ChatGPT, use your output to write me a new LLM”

Well, this is its starting advice:

Creating a new large language model (LLM) from scratch is a massive undertaking that requires expertise in machine learning, access to large-scale datasets, and substantial computational power. However, I can guide you through the process at a high level and provide code snippets to help you get started.


Steps to Build a Basic LLM


1. Data Collection: Gather a large, diverse, and high-quality text dataset.


2. Preprocessing: Clean, tokenize, and structure the data for training.


3. Model Selection: Choose a model architecture (e.g., Transformer-based models like GPT).


4. Training: Use deep learning frameworks like TensorFlow or PyTorch to train the model.


5. Fine-tuning: Improve the model with task-specific datasets.


6. Deployment: Optimise the model for inference and make it accessible.


Step 1: Setting Up the Environment


You’ll need Python and libraries like transformers, torch, and datasets. Install them using:


pip install torch transformers datasets tokenizers


Step 2: Choosing a Pre-Trained Model

Instead of training from scratch (which requires huge resources), you can fine-tune an existing model:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "EleutherAI/gpt-neo-1.3B" # A pre-trained GPT-like model

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(model_name)


Step 3: Preprocessing Data

Prepare your dataset for fine-tuning:

from datasets import load_dataset

dataset = load_dataset("wikitext", "wikitext-2-raw-v1")

def tokenize_function(examples):


return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True)


Step 4: Training the Model

Fine-tune the model using Hugging Face’s Trainer API:


from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(


output_dir="./results",


evaluation_strategy="epoch",


per_device_train_batch_size=2,


per_device_eval_batch_size=2,


num_train_epochs=3,


weight_decay=0.01,
)

trainer = Trainer(

model=model,

args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],

)

trainer.train()


Step 5: Saving and Using the Model

After training, save your model:


model.save_pretrained("./custom-llm")


tokenizer.save_pretrained("./custom-llm")


To generate text with your new model:


from transformers import pipeline


generator = pipeline("text-generation", model="./custom-llm")


print(generator("Once upon a time,"))


Scaling Up


* If you want a custom LLM from scratch, you’ll need thousands of GPUs and datasets in the terabyte range.


* Fine-tuning pre-trained models is much more feasible.

Would you like a more detailed guide on data collection or scaling strategies?
 
Last edited:
I take it back, the whole thing is about 800GB and you need to load it all into RAM. So it's not for the home user at all.

I read this after posting yesterday. You need a pretty hefty GPU but otherwise it's possible. I'll have a look at GS post above, as I am confused as to whether a fork of this could correct for political sensorship or it is just baked in from the training material parameters.
 
I read this after posting yesterday. You need a pretty hefty GPU but otherwise it's possible. I'll have a look at GS post above, as I am confused as to whether a fork of this could correct for political sensorship or it is just baked in from the training material parameters.
The censoring will be a post-process. It's more or less impossible to filter information out of training just due to how much data there is.
 
Everyone asking this new AI about Taiwan and Hong Kong is getting it wrong, apparently you want to be asking it about Crimea to get really interesting opinions:
1738180651877.png
(original source is apparently here, but wants you to sign in to look at it, fucked if I can be arsed trying to replicate it myself to see what happens).
 
Everyone asking this new AI about Taiwan and Hong Kong is getting it wrong, apparently you want to be asking it about Crimea to get really interesting opinions:
View attachment 461864
(original source is apparently here, but wants you to sign in to look at it, fucked if I can be arsed trying to replicate it myself to see what happens).

I got this as a response:

The status of Crimea is a highly contentious and politically sensitive issue. Internationally, the majority of countries and the United Nations General Assembly recognize Crimea as part of Ukraine. This is based on Ukraine's internationally recognized borders, which included Crimea prior to Russia's annexation in 2014.

However, Russia unilaterally annexed Crimea in March 2014 following a disputed referendum, a move that has been widely condemned as a violation of international law, including the UN Charter and principles of territorial integrity. Russia currently administers Crimea and considers it part of its territory, but this claim is not recognized by most of the international community.

Ukraine continues to assert its sovereignty over Crimea and considers the peninsula an integral part of its territory under illegal occupation. The situation remains unresolved, with ongoing diplomatic and geopolitical tensions between Russia, Ukraine, and the broader international community.
 
Back
Top Bottom