Skip to main content
This walkthrough provides you with deep, hands-on experience with the Unstructured user interface (UI). As you follow along, you will learn how to use many of Unstructured’s features for partitioning, enriching, chunking, and embedding. These features are optimized for turning your source documents and data into information that is well-tuned for retrieval-augmented generation (RAG), agentic AI, and model fine-tuning. This walkthrough uses two sample files to demonstrate how Unstructured identifies and processes content such as image, graphs, complex tables, non-English characters, and handwriting. These files, which are available for you to download to your local machine, include:
  • Wang, Z., Liu, X., & Zhang, M. (2022, November 23). Breaking the Representation Bottleneck of Chinese Characters: Neural Machine Translation with Stroke Sequence Modeling. arXiv.org. https://arxiv.org/pdf/2211.12781. This 12-page PDF file features English and non-English characters, images, graphs, and complex tables. Throughout this walkthrough, this file’s title is shortened to “Chinese Characters” for brevity.
  • United States Central Security Service. (2012, January 27). National Cryptologic Museum Opens New Exhibit on Dr. John Nash. United States National Security Agency. https://courses.csail.mit.edu/6.857/2012/files/H03-Cryptosystem-proposed-by-Nash.pdf. This PDF file features English handwriting and scanned images of documents. Throughout this walkthrough, this file’s title is shortened to “Nash letters” for brevity.
If you are not able to complete any of the following steps, contact Unstructured Support at support@unstructured.io.
What are these green boxes?As you move through this walkthrough, you will notice tips like this one. These tips are designed to help expand your knowledge about Unstructured as you go. Feel free to skip these tips for now if you are in a hurry. You can always return to them later to learn more.

Step 1: Sign up and sign in to Unstructured

Let’s get started!
  1. If you do not already have an Unstructured account, sign up for free. After you sign up, you are automatically signed in to your new Unstructured Starter account, at https://platform.unstructured.io.
    To sign up for a Team or Enterprise account instead, contact Unstructured Sales, or learn more.
  2. If you have an Unstructured Starter or Team account and are not already signed in, sign in to your account at https://platform.unstructured.io.
    For an Enterprise account, see your Unstructured account administrator for instructions, or email Unstructured Support at support@unstructured.io.

Step 2: Create a custom workflow

In this step, you create a custom workflow in your Unstructured account. Workflows are defined sequences of processes that automate the flow of data from your source documents and data into Unstructured for processing. Unstructured then sends its processed data over into your destination file storage locations, databases, and vector stores. Your RAG apps, agents, and models can then use this processed data in those destinations to do things more quickly and accurately such as answering users’ questions, automating business processes, and expanding your organization’s available body of knowledge.
Which kinds of sources and destinations does Unstructured support?Unstructured can connect to many types of sources and destinations including file storage services such as Amazon S3 and Google Cloud Storage; databases such as PostgreSQL; and vector storage and database services such as MongoDB Atlas and Pinecone.See the full list of supported source and destination connectors.
Which kinds of files does Unstructured support?Unstructured can process a wide variety of file types including PDFs, word processing documents, spreadsheets, slide decks, HTML, image files, emails, and more.See the full list of supported file types.
Let’s get going!
  1. After you are signed in to your Unstructured account, on the sidebar, click Workflows. Workflows button on the sidebar
    What do the other buttons on the sidebar do?
    • Start takes you to the UI home page.
    • Connectors allows you to create and manage your source and destination connectors.
    • Jobs allows you to see the results of your workflows that are run manually (on-demand) and automatically (on a regular time schedule). Learn more.
    • API Keys allows you to use code to create and manage connectors, workflows, and jobs programmatically instead of by using the UI. Learn more.
    • Your user icon at the bottom of the sidebar allows you to manage your Unstructured account. You can also sign out of your account from here. Learn more.
  2. Click New Workflow. New Workflow button
  3. With Build it Myself already selected, click Continue. Build it Myself workflow option
    What does Build it For Me do?The Build it For Me option creates an automatic workflow with sensible default settings to enable you to get good-quality results faster. However, this option requires that you first have an existing remote source and destination connector to add to the workflow. To speed things up here and keep things simple, this walkthrough only processes files from your local machine and skips the use of connectors. To learn how to use connectors later, see the next steps at the end of this walkthrough.
  4. The workflow designer appears. The workflow designer
    What are all the parts of the workflow designer?The middle portion of the workflow designer is the workflow directed acyclic graph (DAG), which contains a a collection of nodes connected by edges that go in only one direction. You can think of the DAG similar to a flowchart for a process. Directed means the arrows show the flow from one step to the next, and acyclic means you cannot follow the arrows backward to get back to the starting point.The workflow settings pane on the right includes the following tabs:
    • The settings on the Details tab allow you to change this workflow’s name. You can also see when this workflow was created and which jobs were run for this workflow.
    • Schedule allows you to set up a schedule for this workflow to run automatically (on a regular time schedule).
    • Settings allows you to specify whether every time this workflow runs, that Unstructured’s results will overwrite any previous results in the destination location. To turn on this behavior, check the Overwrite existing results box. To turn it off, uncheck the box. Note that this setting works only for blog storage destination connectors such as the ones for Amazon S3, Azure Blob Storage, and Google Cloud Storage.
    • FAQ contains additional information about how to use the workflow designer.
    If the workflow settings pane is not visible, click the Settings button near the bottom to show it.There are also buttons near the bottom to undo or redo recent edits to the workflow DAG, zoom in and out of the workflow designer, re-center the DAG within the designer, and add a new node to the DAG.

Step 3: Experiment with partitioning

In this step, you use your new workflow to partition the sample PDF files that you downloaded earlier onto your local machine. Partitioning is the process where Unstructured identifies and extracts content from your source documents and then outputs this content as a series of contextually-rich document elements and metadata, which are well-tuned for RAG, agentic AI, and model fine-tuning. This step shows how well Unstructured’s High Res partitioning strategy identifies and extracts content, and how well Unstructured’s VLM partitioning strategy handles more complex content such as complex tables, multilanguage characters, and handwriting.
  1. With the workflow designer active from the previous step, at the bottom of the Source node, click Drop file to test. Drop file to test button
  2. Browse to and select the “Chinese Characters” PDF file that you downloaded earlier.
  3. Click the Partitioner node and then, in the node’s settings pane’s Details tab, select High Res. Selecting the High Res partitioning strategy
    When would I choose Auto, Fast, High Res, or VLM?
    • Auto is recommended in most cases. It lets Unstructured figure out the best strategy to switch over to for each incoming file (and even for each page if the incoming file is a PDF), so you don’t have to!
    • Fast is only for when you know for certain that none of your files have tables, images, or multilanguage, scanned, or handwritten content in them. It’s optimized for partitioning text-only content and is the fastest of all the strategies. It can recognize the text for only a few languages other than English.
    • High Res is only for when you know for certain that at least one of your files has images or simple tables in them, and that none of your files also have scanned or handwritten content in them. It can recognize the text for more languages than Fast but not as many as VLM.
    • VLM is great for any file, but it is best when you know for certain that some of your files have a combination of tables (especially complex ones), images, and multilanguage, scanned, or handwritten content. It’s the highest quality but slowest of all the strategies.
    In this walkthrough, you switch between High Res and VLM strategies only to see how each of these strategies works with a combination of complex tables, images, and multilanguage, scanned, and handwritten content. In practice, for these kinds of files you would likely just want to choose Auto.
  4. Immediately above the Source node, click Test. Begin testing the local file
  5. The PDF file appears in a pane on the left side of the screen, and Unstructured’s output appears in a Test output pane on the right side of the screen. Showing the test output results
    What am I looking at in the output here?
    • Unstructured outputs its results in industry-standard JSON format, which is ideal for RAG, agentic AI, and model fine-tuning.
    • Each object in the JSON is called a document element and contains a text representation of the content that Unstructured detected for the particular portion of the document that was analyzed.
    • The type is the kind of document element that Unstructured categorizes it as, such as whether it is a title (Title), a table (Table), an image (Image), a series of well-formulated sentences (NarrativeText), some kind of free text (UncategorizedText), a part of a list (ListItem), and so on. Learn more.
    • The element_id is a unique identifier that Unstructured generates to refer to each document element. Learn more.
    • metadata contains supporting details about each document element, such as the page number it occurred on, the file it occurred in, and so on. Learn more.
    What else can I do here?
    • You can scroll through the original file on the left or, where supported for a given file type, click the up and down arrows to page through the file one page at a time.
    • You can scroll through Unstructured’s JSON output on the right, and you can click Search JSON to search for specific text in the JSON output. You will do this next.
    • Download Full JSON allows you to download the full output to your local machine as a JSON file.
    • View JSON at this step allows you to view the JSON output at each step in the workflow as it was further processed. There’s only one step right now (the Partitioner step), but as you add more nodes to the workflow DAG, this can be a useful tool to see how the JSON output changes along the way.
    • The close (X) button returns you to the workflow designer.
  6. Some interesting portions of the output include the following, which you can get to be clicking Search JSON above the output: Searching the JSON output
    • The Chinese characters on page 3. Search for the text In StrokeNet, the corresponding. Notice that the Chinese characters are not interpreted correctly.
    • The formula on page 5. Search for the text L= LL + Ln. Notice that the formula’s output diverges quite a bit from the original content.
    • Table 2 on page 6. Search for the text Model Parameters Performance (BLEU). Notice that the text_as_html output diverges slightly from the original content.
    • Figure 4 on page 8. Search for the text 50 45 40 35. Notice that the output is not that informative about the original image’s content.
    These quality issues will be addressed later in this step when you change the partitioning strategy to VLM, and later in Step 4 when you add enrichments alongside High Res partitioning.
  7. Now try changing the partitioning strategy to VLM and see how the output changes. To do this: a. Click the close (X) button above the output on the right side of the screen.
    b. In the workflow designer, click the Partitioner node and then, in the node’s settings pane’s Details tab, select VLM.
    c. Under Select VLM Model, under Anthropic, select Claude Sonnet 4.
    d. Click Test.
    When would I choose one of these models over another?A vision language model (VLM) is designed to use sophisticated AI techniques and logic to combine advanced image and text understanding, resulting in more accurate and contextually-rich output.As VLMs are constantly being released and improved, Unstructured is always adding to and updating its list of supported VLMs. If you aren’t getting consistent results with one VLM for a particular set of files, switching over to another one might improve your results, depending on that VLM’s capabilities and the sample data that is was trained on.
  8. Notice how the quality of the output changes, now that you are using the VLM strategy:
    • The Chinese characters on page 3. Search for the text In StrokeNet, the corresponding. Notice that the Chinese characters are intepreted correctly.
    • The formula on page 5. Search for the text match class. Notice that the formula’s output is closer to the original content.
    • Table 2 on page 6. Search for the text Model Parameters Performance (BLEU). Notice that the text_as_html output is closer to the original content.
    • Figure 4 on page 8. Search for the text Graph showing BLEU scores comparison. Notice the informative description about the figure.
  9. Now try looking at the “Nash letters” PDF file’s output. To do this: a. Click the close (X) button above the output on the right side of the screen.
    b. In the workflow designer, click the Partitioner node and then, in the node’s settings pane’s Details tab, select High Res.
    c. At the bottom of the Source node, click the existing PDF’s file name.
    d. Browse to and select the “Nash letters” file that you downloaded earlier to your local machine.
    e. Click Test.
  10. Some interesting portions of the High Res output against this handwritten and scanned content include the following:
    • The handwriting on page 3. Search for the text Deo Majr. Notice that the handwriting is not recognized correctly.
    • The mimeograph on page 11. Search for the text Technicans at this Agency (note the typo Technicans). Notice that the mimeograph contains 18 January 1955, but the output contains only January 1955.
    • The handwritten diagrams on page 13. Search for the text "page_number": 13. Notice that no output is generated for the diagrams.
  11. Now try changing the partitioning strategy to VLM and see how the quality of the output changes. To do this: a. Click the close (X) button above the output on the right side of the screen.
    b. In the workflow designer, click the Partitioner node and then, in the node’s settings pane’s Details tab, select VLM.
    c. Under Select VLM Model, under Anthropic, select Claude Sonnet 4.
    d. Click Test.
  12. Notice how the output changes, now that you are using the VLM strategy:
    • The handwriting on page 3. Search for the text Dear Major Grosjean. Notice how well the handwriting is recognized correctly.
    • The mimeograph on page 11. Search for the text Technicians at this Agency (note the corrected typo Technicians). Notice that the mimoegraph contains 18 January 1955, and the output now also contains 18 January 1955.
    • The handwritten diagrams on page 13. Search for the text graph LR. Notice that Mermaid representations of the handwritten diagrams are output.
  13. When you are done, be sure to click the close (X) button above the output on the right side of the screen, to return to the workflow designer for the next step.

Step 4: Experiment with enriching

In this step, you add several enrichments to your workflow, such as generating summary descriptions of detected images and tables, HTML representations of detected tables, and detected entities (such as people and organizations) and the inferred relationships among these entities.
  1. With the workflow designer active from the previous step, change the Partitioner node to use High Res.
  2. Between the Partitioner and Destination nodes, click the add (+) icon, and then click Enrich > Enrichment. Adding an enrichment node
  3. In the node’s settings pane’s Details tab, select Image under Input Type, and then click OpenAI (GPT-4o) under Model.
    The image description enrichment generates a summary description of each detected image. This can help you to more quickly and easily understand what each image is all about without having to stop to manually visualize and interpret the image’s content yourself. This also provides additional helpful context about the image for your RAG apps, agents, and models. Learn more.
  4. Repeat this process to add three more nodes between the Partitioner and Destination nodes. To do this, click the add (+) icon, and then click Enrich > Enrichment, as follows: a. Add a Table (under Input Type) enrichment node with OpenAI (GPT-4o) (under Model) and Table Description (under Task) selected.

    The table description enrichment generates a summary description of each detected table. This can help you to more quickly and easily understand what each table is all about without having to stop to manually read through the table’s content yourself. This also provides additional helpful context about the table for your RAG apps, agents, and models. Learn more.
    b. Add another Table (under Input Type) enrichment node with OpenAI (GPT-4o) (under Model) and Table to HTML (under Task) selected.

    The table to HTML enrichment generates an HTML representation of each detected table. This can help you to more quickly and accurately recreate the table’s content elsewhere later as needed. This also provides additional context about the table’s structure for your RAG apps, agents, and models. Learn more.
    c. Add a Text (under Input Type) enrichment node with OpenAI (GPT-4o) (under Model) selected.

    The named entity recognition (NER) enrichment generates a list of detected entities (such as people and organizations) and the inferred relationships among these entities. This provides additional context about these entities’ types and their relationships for your graph databases, RAG apps, agents, and models. Learn more.
    The workflow designer should now look like this: The workflow with enrichments added
  5. Change the Source node to use the “Chinese Characters” PDF file, and then click Test.
  6. In the Test output pane, make sure that Enrichment (5 of 5) is showing. If not, click the right arrow (>) until Enrichment (5 of 5) appears, which will show the output from the last node in the workflow. The final Enrichment node's output
  7. Some interesting portions of the output include the following:
    • The figures on pages 3, 7, and 8. Search for the seven instances of the text "type": "Image". Notice the summary description for each image.
    • The tables on pages 6, 7, 8, 9, and 12. Search for the seven instances of the text "type": "Table". Notice the summary description for each of these tables. Also notice the text_as_html field for each of these tables.
    • The identified entities and inferred relationships among them. Search for the text Zhijun Wang. Of the eight instances of this name, notice the author’s identification as a PERSON three times, the author’s published relationship twice, and the author’s affiliated_with relationship twice.
  8. When you are done, be sure to click the close (X) button above the output on the right side of the screen, to return to the workflow designer for the next step.

Step 5: Experiment with chunking

In this step, you apply chunking to your workflow. Chunking is the process where Unstructured rearranges the resulting document elements’ text content into manageable “chunks” to stay within the limits of an AI model and to improve retrieval precision.
What kind of chunking strategy should I use, and how big should my chunks be?Unfortunately, there is no one-size-fits-all answer to this question. However, there are some general considerations and guidelines that can help you to determine the best chunking strategy and chunk size for your specific use case. Be sure of course to also consult the documentation for your target AI model and downstream application toolsets.Is your content primarily organized by title, by page, by interrelated subject matter, or none of these? This can help you determine whether a by-title, by-page, by-similarity, or basic (by-character) chunking strategy is best. (You’ll experiment with each of these strategies here later.)If your chunks are too small, they might lose necessary context, leading to the model providing inaccurate, irrelevant, or hallucinated results. On the other hand, if your chunks are too large, the model can struggle with the sheer volume of information, leading to information overload, diluted meaning, and potentially higher processing costs. You should aim to find a balance between chunks that are big enough to contain meaningful information, while small enough to enable performant applications and low latency responses.For example, smaller chunks of 128 or 256 tokens might be sufficient for capturing more granular semantic information, while larger chunks of 512 or 1024 tokens might be better for retaining more context. It’s important here to note that tokens and characters are not the same thing! In terms of characters, for English text, a common approximation is 1 token being equal to about 3 or 4 characters or three-quarters of a word. Many AI model providers publish their own token-to-character calculators online that you can use for estimation purposes.You should experiement with a variety of chunk sizes, taking into account the kinds of content, the length and complexity of user queries and agent tasks, the intended end use, and of course the limits of the models you are using. Try different chunking strategies and sizes with your models and evaluate the results for yourself.
  1. With the workflow designer active from the previous step, just before the Destination node, click the add (+) icon, and then click Enrich > Chunker. Adding a chunker node
  2. In the node’s settings pane’s Details tab, select Chunk by Character.
  3. Under Chunk by Character, specify the following settings:
    • Check the box labelled Include Original Elements.
    • Set Max Characters to 500.
    • Set New After N Characters to 400.
    • Set Overlap to 50.
    • Leave Contextual Chunking turned off and Overlap All unchecked.
    Setting up the Chunk by Character strategy
    What do each of these chunking settings do?
    • Contextual Chunking prepends chunk-specific explanatory context to each chunk, which has been shown to yield significant improvements in downstream retrieval accuracy. Learn more.
    • Include Original Elements outputs into each chunk’s metadata field’s orig_elements value the elements that were used to form that particular chunk. Learn more.
    • Max Characters is the “hard” or maximum number of characters that any one chunk can contain. Unstructured cannot exceed this number when forming chunks. Learn more.
    • New After N Characters: is the “soft” or approximate number of characters that any one chunk can contain. Unstructured can exceed this number if needed when forming chunks (but still cannot exceed the Max Characters setting). Learn more.
    • Overlap, when applied (see Overlap All), prepends to the current chunk the specified number of characters from the previous chunk, which can help provide additional context about this chunk relative to the previous chunk. Learn more
    • Overlap All applies the Overlap setting (if greater than zero) to all chunks. Otherwise, unchecking this box means that the Overlap setting (if greater than zero)is applied only in edge cases where “normal” chunks cannot be formed by combining whole elements. Check this box with caution as it can introduce noise into otherwise clean semantic units. Learn more.
  4. With the “Chinese Characters” PDF file still selected in the Source node, click Test.
  5. In the Test output pane, make sure that Chunker (6 of 6) is showing. If not, click the right arrow (>) until Chunker (6 of 6) appears, which will show the output from the last node in the workflow.
  6. To explore the chunker’s results, search for the text "type": "CompositeElement".
    In the chunked output, where did all of the document elements I saw before, such as Title, Image, and Table, go?During chunking, the document elements that were generated during partitioning are now chunked. Because some of these document elements can be split into multiple chunks or combined with other chunks, these chunked document elements are now of type CompositeElement and TableChunk.You can have Unstructured also output the original document elements that these chunks were derived from by putting them into each chunk’s metadata. To have Unstructured do this, use the Include Original Elements setting, as described in the preceding tip.
  7. Try running this workflow again with the Chunk by Title strategy, as follows: a. Click the close (X) button above the output on the right side of the screen.
    b. In the workflow designer, click the Chunker node and then, in the node’s settings pane’s Details tab, select Chunk by Title.
    c. Under Chunk by Title, specify the following settings:
    • Check the box labelled Include Original Elements.
    • Set Max Characters to 500.
    • Set New After N Characters to 400.
    • Set Overlap to 50.
    • Leave Contextual Chunking turned off, leave Combine Text Under N Characters blank, and leave Multipage Sections and Overlap All unchecked.
    What do each of the chunking settings here that were not already described in the preceding tip do?
    • Combine Text Under N Characters combines elements from a section into a chunk until a section reaches a length of this many characters. Learn more.
    • Multipage Sections when checked, allows sections to span multiple pages. Learn more.
    d. Click Test.
    e. In the Test output pane, make sure that Chunker (6 of 6) is showing. If not, click the right arrow (>) until Chunker (6 of 6) appears, which will show the output from the last node in the workflow.
    f. To explore the chunker’s results, search for the text "type": "CompositeElement". Notice that the lengths of some of the chunks that immediately precede titles might be shortened due to the presence of the title impacting the chunk’s size.
  8. Try running this workflow again with the Chunk by Page strategy, as follows: a. Click the close (X) button above the output on the right side of the screen.
    b. In the workflow designer, click the Chunker node and then, in the node’s settings pane’s Details tab, select Chunk by Page.
    c. Under Chunk by Page, specify the following settings:
    • Check the box labelled Include Original Elements.
    • Set Max Characters to 500.
    • Set New After N Characters to 400.
    • Set Overlap to 50.
    • Leave Contextual Chunking turned off, and leave Overlap All unchecked.
    d. Click Test.
    e. In the Test output pane, make sure that Chunker (6 of 6) is showing. If not, click the right arrow (>) until Chunker (6 of 6) appears, which will show the output from the last node in the workflow.
    f. To explore the chunker’s results, search for the text "type": "CompositeElement". Notice that the lengths of some of the chunks that immediately precede page breaks might be shortened due to the presence of the page break impacting the chunk’s size.
  9. Try running this workflow again with the Chunk by Similarity strategy, as follows: a. Click the close (X) button above the output on the right side of the screen.
    b. In the workflow designer, click the Chunker node and then, in the node’s settings pane’s Details tab, select Chunk by Similarity.
    c. Under Chunk by Similarity, specify the following settings:
    • Check the box labelled Include Original Elements.
    • Set Max Characters to 500.
    • Set Similarity Threshold to 0.99.
    • Leave Contextual Chunking turned off.
    What does Similarity Threshold mean?
    • The Similarity Threshold is a number between 0 and 1 exclusive (0.01 to 0.99 inclusive).
    • 0.01 means that any two segments of text that are being compared to each other and are considered least identical in semantic meaning to each other are more likely to be combined into the same chunk together, when such combining must occur.
    • 0.99 means that any two segments of text that are being compared to each other and are considered almost identical in semantic meaning to each other are more likely to be combined into the same chunk together, when such combining must occur.
    • Numbers toward 0.01 bias toward least-identical semantic matches, while numbers toward 0.99 bias toward near-identical semantic matches.
    d. Click Test.
    e. In the Test output pane, make sure that Chunker (6 of 6) is showing. If not, click the right arrow (>) until Chunker (6 of 6) appears, which will show the output from the last node in the workflow.
    f. To explore the chunker’s results, search for the text "type": "CompositeElement". Notice that the lengths of many of the chunks fall well short of the Max Characters limit. This is because a similarity threshold of 0.99 means that only sentences or text segments with a near-perfect semantic match will be grouped together into the same chunk. This is an extremely high threshold, resulting in very short, highly specific chunks of text.
    g. If you change Similarity Threshold to 0.01 and run the workflow again, searching for the text "type": "CompositeElement", many of the chunks will now come closer to the Max Characters limit. This is because a similarity threshold of 0.01 provides an extreme tolerance of differences between pieces of text, grouping almost anything together.
  10. When you are done, be sure to click the close (X) button above the output on the right side of the screen, to return to the workflow designer for the next step.

Step 6: Experiment with embedding

In this step, you generate embeddings for your workflow. Embeddings are vectors of numbers that represent various aspects of the text that is extracted by Unstructured. These vectors are stored or “embedded” next to the text itself in a vector store or vector database. Chatbots, agents, and other AI solutions can use these vector embeddings to more efficiently and effectively find, analyze, and use the associated text. These vector embeddings are generated by an embedding model that is provided by an embedding provider. For the best embedding model to apply to your use case, see the documentation for your target downstream application toolsets.
  1. With the workflow designer active from the previous step, just before the Destination node, click the add (+) icon, and then click Transform > Embedder. Adding an embedder node
  2. In the node’s settings pane’s Details tab, under Select Embedding Model, for Azure OpenAI, select Text Embedding 3 Small [dim 1536].
  3. With the “Chinese Characters” PDF file still selected in the Source node, click Test.
  4. In the Test output pane, make sure that Embedder (7 of 7) is showing. If not, click the right arrow (>) until Embedder (7 of 7) appears, which will show the output from the last node in the workflow.
  5. To explore the embeddings, search for the text "embeddings".
    What do all of these numbers mean?All by themselves, the numbers in the embeddings field of the output have no human-interpretable meaning on their own. However, when combined with the specific text that these numbers are associated with, and the embedding model’s logic that was used to generate these numbers, the numbers in the embeddings field are extremely powerful when leveraged by downstream chatbots, agents, and other AI solutions.These numbers typically represent complex, abstract attributes about the text that are known only to the embedding model that generated these numbers. These attributes can be about the text’s overall sentiment, intent, subject, semantic meaning, grammatical function, relationships between words, or any number of other things that the model is good at figuring out. This is why the embedding model you choose here must be the exact same embedding model that you use in any related chatbot, agent, or other AI solution that relies on these numbers. Otherwise, the numbers that are generated here will not have the same meaning downstream as well. Also, the number of dimensions (or the number of numbers in the embeddings field) you choose here must also be the exact same number of dimensions downstream as well.To repeat, the name and number of dimensions for the embedding model you choose here must be the exact same name and number of dimensions for the embedding model you use in your related downstream chatbots, agents, and other AI solutions that rely on this particular text and its associated embeddings that were generated here.
  6. When you are done, be sure to click the close (X) button above the output on the right side of the screen, to return to the workflow designer so that you can continue designing things later as you see fit.

Next steps

Congratulations! You now have an Unstructured workflow that partitions, enriches, chunks, and embeds your source documents, producing context-rich data that is ready for retrieval-augmented generation (RAG), agentic AI, and model fine-tuning. Right now, your workflow only accepts one local file at a time for input. Your workflow also only sends Unstructured’s processed data to your screen or to be save locally as a JSON file. You can modify your workflow to accept multiple files and data from—and send Unstructured’s processed data to—one or more file storage locations, databases, and vector stores. To learn how to do this, try one or more of the following quickstarts: Unstructured also offers an API and SDKs, which allow you to use code to work with Unstructured programmatically instead of only with the UI. For details, see: If you are not able to complete any of the preceding quickstarts, contact Unstructured Support at support@unstructured.io.
I