How to Write ChatGPT Code to Segment Large Documents into Smaller Chunks for Analysis?

This topic was automatically generated from Slack. You can find the original thread here.

Does anyone have any experience writing ChatGPT code to get around token limits?

I’m working on this code:

const { Configuration, OpenAIApi } = require(“openai”);

const apiKey = process.env.OPENAI_API_KEY; // Set your OpenAI API key here

const openai = new OpenAIApi(
new Configuration({
apiKey,
})
);

const largeText = steps.fetch_content.value.content;

const prompt = Summarize the following text:\n\n${largeText};

(async () => {
try {
const completion = await openai.createChatCompletion({
model: “gpt-3.5-turbo”,
messages: [{ role: “user”, content: prompt }],
temperature: 0.2,
});

**const** summary = completion.choices[0].message.content;

console.log(“Summary:“, summary);

_// You can take further actions with the summary, such as sending it to another service or storing it._

} catch (error) {
console.error(“Error:“, error.message);
}
})();

The workflow is Trigger (new PDF file added to google drive) > PDF_to_Anything_converter (output is text to an html site) > HTTP/Webook > ChatGPT code (what’s written above).

Basically, I want to be able to segment a big document into smaller chunks for GPT to analyze.

Any thoughts? Am I explaining myself extremely poorly? Let me know.

yes, I’ve worked with similar token limits. We calculate tokens using the gpt-tokenizer - npm package:

import {
  encode,
  decodeGenerator,
} from 'gpt-tokenizer'

export default defineComponent({
  async run({ steps, $ }) {
    const tokens = encode("your data here")
    $.export("tokens", tokens.length)
  }
})

Then I truncate or split up documents based on those limits. We have a few different “sections” of prompts we use: the system instructions, maybe some API docs, then the user’s question — so we take the sum of tokens for each of these and handle that accordingly.

The other question I deal with when I process large docs is how you want to process the responses when you’re done. Specifically:

  1. You split your document into pieces
  2. You send each chunk to your language model
  3. What do you do to summarize the output of each chunk from step 2?
    For #3, you might want to provide all the summaries back to the user, e.g. you could just concatenate the output of chunk 1, 2, 3, etc. into a document that you send to the user.

Or do you want to take the output from each chunk and do a single, final summarization with GPT? Then you can send all the summaries back to GPT and ask it to provide a “summary of summaries” and provide that to the user.

There are different techniques for doing each of these, so just curious on what you expect.

Essentially, I’m trying to create a workflow that will analyze a financial report of a company, look for key words do determine certain financial metrics, and spit it out into a more-or-less scannable report. So I think “Or do you want to take the output from each chunk and do a single, final summarization with GPT? Then you can send all the summaries back to GPT and ask it to provide a “summary of summaries” and provide that to the user.” is what I’m looking for.

Great thanks. I’ll respond tomorrow with some specific suggestions based on your original code

Thanks so much!

I appreciate the feedback, but I played around with it some more and I don’t want to waste your time looking into something I already figured out.

I actually managed to break up the original document into chunks. But now I’m having trouble passing those chunks to GPT to analyze. I’m running this simple script to test it:

import axios from “axios”;

export default async function main(event) {
console.log(event); // Verify event data

const chunks = event.steps.chunking_script.$return_value;
const results = [];

for (const chunk of chunks) {
const response = await axios.post(
https://api.openai.com/v1/engines/gpt-3.5-turbo/completions”,
{
prompt: chunk,
max_tokens: 150,
},
{
headers: {
“Content-Type”: “application/json”,
Authorization: Bearer ${process.env.OPENAI_API_KEY},
},
}
);

results.push(response.data.choices[0].text);

}

return results;
}

but get this error:

TypeError
Cannot read properties of undefined (reading ‘bind’)
DETAILS

    at null.executeComponent (/var/task/launch_worker.js:229:42)

Timestamp
8/8/2023, 11:26:13 PMDuration
253msName
main

openai

CODE

RESULTS

chat

CONFIGURE

RESULTS

export default async function main(event) {, 1 of 1 found for ‘main’, at 3:31