概览

本教程会帮助你熟悉 LangChain 的 embeddingvector store 抽象。这些抽象用于支持从(向量)数据库和其他来源检索数据,并将其集成到 LLM 工作流中。对于需要获取数据并将其作为模型推理一部分的应用,它们非常重要,例如 retrieval-augmented generation 或 RAG 这里将基于 PDF 文档构建一个搜索引擎。这样可以检索 PDF 中与输入查询相似的段落。本指南还包含一个基于该搜索引擎的最小 RAG 实现。

概念

本指南聚焦文本数据检索。将介绍以下概念:

设置

安装

本教程使用 pypdf 包读取 PDF:
pip install pypdf
更多详情请参阅 Installation guide

LangSmith

你用 LangChain 构建的许多应用会包含多个步骤,并多次调用 LLM。 随着这些应用越来越复杂,能够检查链或代理内部究竟发生了什么会变得至关重要。 最好的方式是使用 LangSmith 在上方链接注册后,请设置环境变量以开始记录 traces:
export LANGSMITH_TRACING="true"
export LANGSMITH_API_KEY="..."
或者,如果在 notebook 中,可以这样设置:
import getpass
import os

os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = getpass.getpass()

1. 文档

LangChain 实现了 Document 抽象,用于表示一个文本单元及其关联 metadata。它有三个属性:
  • page_content:表示内容的字符串;
  • metadata:包含任意 metadata 的 dict;
  • id:(可选)文档的字符串标识符。
metadata 属性可以捕获文档来源、与其他文档的关系以及其他信息。请注意,单个 Document 对象通常表示较大文档的一个 chunk。 需要时,可以生成示例文档:
from langchain_core.documents import Document

documents = [
    Document(
        page_content="Dogs are great companions, known for their loyalty and friendliness.",
        metadata={"source": "mammal-pets-doc"},
    ),
    Document(
        page_content="Cats are independent pets that often enjoy their own space.",
        metadata={"source": "mammal-pets-doc"},
    ),
]

2. Embeddings

向量搜索是一种存储和搜索非结构化数据(例如非结构化文本)的常见方式。其思想是存储与文本关联的数字向量。给定查询后,可以将其 embed 为相同维度的向量,并使用向量相似度指标(例如余弦相似度)识别相关文本。 LangChain 支持来自数十个 providers 的 embeddings。这些模型指定如何将文本转换为数字向量。选择一个模型:
pip install -U "langchain-openai"
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
vector_1 = embeddings.embed_query(documents[0].page_content)
vector_2 = embeddings.embed_query(documents[1].page_content)

assert len(vector_1) == len(vector_2)
print(f"Generated vectors of length {len(vector_1)}\n")
print(vector_1[:10])
Generated vectors of length 1536

[-0.008586574345827103, -0.03341241180896759, -0.008936782367527485, -0.0036674530711025, 0.010564599186182022, 0.009598285891115665, -0.028587326407432556, -0.015824200585484505, 0.0030416189692914486, -0.012899317778646946]
有了用于生成文本 embeddings 的模型后,接下来可以将它们存储到支持高效相似度搜索的特殊数据结构中。

3. Vector stores

LangChain VectorStore 对象包含用于向 store 添加文本和 Document 对象的方法,以及使用各种相似度指标查询它们的方法。它们通常使用 embedding 模型初始化,这些模型决定文本数据如何转换为数字向量。 LangChain 包含一组面向不同 vector store 技术的 integrations。有些 vector stores 由 provider 托管,使用时需要特定凭证;有些运行在单独基础设施中,可以本地运行或通过第三方运行;还有一些可以在内存中运行,适合轻量工作负载。选择一个 vector store:
pip install -U "langchain-core"
from langchain_core.vectorstores import InMemoryVectorStore

vector_store = InMemoryVectorStore(embeddings)

为 vector store 填充数据

用 PDF 内容填充 store。这里有一个示例 PDF,它是 Nike 2023 年的 10-K 文件。将使用一个小型 helper 直接读取 PDF,并在索引前将其拆分为更小的 chunks。
import pypdf
from langchain_core.documents import Document


# Below is a minimal helper for demonstration purposes.
def load_pdf_pages(file_path: str) -> list[Document]:
    reader = pypdf.PdfReader(file_path)
    return [
        Document(
            page_content=page.extract_text() or "",
            metadata={"source": file_path, "page": i},
        )
        for i, page in enumerate(reader.pages)
    ]


file_path = "../example_data/nke-10k-2023.pdf"
docs = load_pdf_pages(file_path)
print(len(docs))
107
对于检索和下游问答来说,页面粒度可能过粗。进一步拆分有助于确保文档相关部分的含义不会被周围文本“冲淡”。这里使用 RecursiveCharacterTextSplitter,它会使用换行等常见分隔符递归拆分文档,直到每个 chunk 达到合适大小。这是通用文本用例推荐的 text splitter。 设置 add_start_index=True,以便将每个拆分 Document 在初始 Document 中开始处的字符索引保留为 metadata 属性 start_index
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

print(len(all_splits))
516
现在可以将 chunks 索引到 vector store 中。
ids = vector_store.add_documents(documents=all_splits)
请注意,大多数 vector store 实现都允许你连接到现有 vector store,例如通过提供 client、index name 或其他信息。更多详情请参阅特定 integration 的文档。 实例化包含文档的 VectorStore 后,就可以查询它。VectorStore 包含以下查询方法:
  • 同步和异步查询;
  • 按字符串查询和按向量查询;
  • 返回或不返回相似度分数;
  • 按相似度和 maximum marginal relevance 查询(在与查询的相似度和检索结果多样性之间取得平衡)。
这些方法的输出通常会包含 Document 对象列表。 用法 Embeddings 通常将文本表示为“稠密”向量,使含义相似的文本在几何空间中彼此接近。这样只需传入问题即可检索相关信息,而不需要知道文档中使用的任何特定关键词。 根据与字符串查询的相似度返回文档:
results = vector_store.similarity_search(
    "How many distribution centers does Nike have in the US?"
)

print(results[0])
page_content='direct to consumer operations sell products through the following number of retail stores in the United States:
U.S. RETAIL STORES NUMBER
NIKE Brand factory stores 213
NIKE Brand in-line stores (including employee-only stores) 74
Converse stores (including factory stores) 82
TOTAL 369
In the United States, NIKE has eight significant distribution centers. Refer to Item 2. Properties for further information.
2023 FORM 10-K 2' metadata={'page': 4, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 3125}
异步查询:
results = await vector_store.asimilarity_search("When was Nike incorporated?")

print(results[0])
page_content='Table of Contents
PART I
ITEM 1. BUSINESS
GENERAL
NIKE, Inc. was incorporated in 1967 under the laws of the State of Oregon. As used in this Annual Report on Form 10-K (this "Annual Report"), the terms "we," "us," "our,"
"NIKE" and the "Company" refer to NIKE, Inc. and its predecessors, subsidiaries and affiliates, collectively, unless the context indicates otherwise.
Our principal business activity is the design, development and worldwide marketing and selling of athletic footwear, apparel, equipment, accessories and services. NIKE is
the largest seller of athletic footwear and apparel in the world. We sell our products through NIKE Direct operations, which are comprised of both NIKE-owned retail stores
and sales through our digital platforms (also referred to as "NIKE Brand Digital"), to retail accounts and to a mix of independent distributors, licensees and sales' metadata={'page': 3, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 0}
返回分数:
# Note that providers implement different scores; the score here
# is a distance metric that varies inversely with similarity.

results = vector_store.similarity_search_with_score("What was Nike's revenue in 2023?")
doc, score = results[0]
print(f"Score: {score}\n")
print(doc)
Score: 0.23699893057346344

page_content='Table of Contents
FISCAL 2023 NIKE BRAND REVENUE HIGHLIGHTS
The following tables present NIKE Brand revenues disaggregated by reportable operating segment, distribution channel and major product line:
FISCAL 2023 COMPARED TO FISCAL 2022
•NIKE, Inc. Revenues were $51.2 billion in fiscal 2023, which increased 10% and 16% compared to fiscal 2022 on a reported and currency-neutral basis, respectively.
The increase was due to higher revenues in North America, Europe, Middle East & Africa ("EMEA"), APLA and Greater China, which contributed approximately 7, 6,
2 and 1 percentage points to NIKE, Inc. Revenues, respectively.
•NIKE Brand revenues, which represented over 90% of NIKE, Inc. Revenues, increased 10% and 16% on a reported and currency-neutral basis, respectively. This
increase was primarily due to higher revenues in Men's, the Jordan Brand, Women's and Kids' which grew 17%, 35%,11% and 10%, respectively, on a wholesale
equivalent basis.' metadata={'page': 35, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 0}
根据与嵌入查询的相似度返回文档:
embedding = embeddings.embed_query("How were Nike's margins impacted in 2023?")

results = vector_store.similarity_search_by_vector(embedding)
print(results[0])
page_content='Table of Contents
GROSS MARGIN
FISCAL 2023 COMPARED TO FISCAL 2022
For fiscal 2023, our consolidated gross profit increased 4% to $22,292 million compared to $21,479 million for fiscal 2022. Gross margin decreased 250 basis points to
43.5% for fiscal 2023 compared to 46.0% for fiscal 2022 due to the following:
*Wholesale equivalent
The decrease in gross margin for fiscal 2023 was primarily due to:
•Higher NIKE Brand product costs, on a wholesale equivalent basis, primarily due to higher input costs and elevated inbound freight and logistics costs as well as
product mix;
•Lower margin in our NIKE Direct business, driven by higher promotional activity to liquidate inventory in the current period compared to lower promotional activity in
the prior period resulting from lower available inventory supply;
•Unfavorable changes in net foreign currency exchange rates, including hedges; and
•Lower off-price margin, on a wholesale equivalent basis.
This was partially offset by:' metadata={'page': 36, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 0}
了解更多:

4. Retrievers

LangChain VectorStore 对象不是 Runnable 的子类。LangChain Retrievers 是 Runnables,因此它们实现了一组标准方法(例如同步和异步 invokebatch 操作)。虽然可以从 vector stores 构造 retrievers,但 retrievers 也可以与非 vector store 数据来源对接(例如外部 API)。 可以自己创建一个简单版本,而不需要继承 Retriever。如果选择希望用于检索文档的方法,就可以轻松创建一个 runnable。下面围绕 similarity_search 方法构建一个:
from typing import List

from langchain_core.documents import Document
from langchain_core.runnables import chain


@chain
def retriever(query: str) -> List[Document]:
    return vector_store.similarity_search(query, k=1)


retriever.batch(
    [
        "How many distribution centers does Nike have in the US?",
        "When was Nike incorporated?",
    ],
)
[[Document(metadata={'page': 4, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 3125}, page_content='direct to consumer operations sell products through the following number of retail stores in the United States:\nU.S. RETAIL STORES NUMBER\nNIKE Brand factory stores 213 \nNIKE Brand in-line stores (including employee-only stores) 74 \nConverse stores (including factory stores) 82 \nTOTAL 369 \nIn the United States, NIKE has eight significant distribution centers. Refer to Item 2. Properties for further information.\n2023 FORM 10-K 2')],
 [Document(metadata={'page': 3, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 0}, page_content='Table of Contents\nPART I\nITEM 1. BUSINESS\nGENERAL\nNIKE, Inc. was incorporated in 1967 under the laws of the State of Oregon. As used in this Annual Report on Form 10-K (this "Annual Report"), the terms "we," "us," "our,"\n"NIKE" and the "Company" refer to NIKE, Inc. and its predecessors, subsidiaries and affiliates, collectively, unless the context indicates otherwise.\nOur principal business activity is the design, development and worldwide marketing and selling of athletic footwear, apparel, equipment, accessories and services. NIKE is\nthe largest seller of athletic footwear and apparel in the world. We sell our products through NIKE Direct operations, which are comprised of both NIKE-owned retail stores\nand sales through our digital platforms (also referred to as "NIKE Brand Digital"), to retail accounts and to a mix of independent distributors, licensees and sales')]]
Vectorstores 实现了 as_retriever 方法,会生成一个 Retriever,具体来说是 VectorStoreRetriever。这些 retrievers 包含特定的 search_typesearch_kwargs 属性,用于标识应调用底层 vector store 的哪些方法,以及如何参数化它们。例如,可以用下面的方式复现上面的行为:
retriever = vector_store.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 1},
)

retriever.batch(
    [
        "How many distribution centers does Nike have in the US?",
        "When was Nike incorporated?",
    ],
)
[[Document(metadata={'page': 4, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 3125}, page_content='direct to consumer operations sell products through the following number of retail stores in the United States:\nU.S. RETAIL STORES NUMBER\nNIKE Brand factory stores 213 \nNIKE Brand in-line stores (including employee-only stores) 74 \nConverse stores (including factory stores) 82 \nTOTAL 369 \nIn the United States, NIKE has eight significant distribution centers. Refer to Item 2. Properties for further information.\n2023 FORM 10-K 2')],
 [Document(metadata={'page': 3, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 0}, page_content='Table of Contents\nPART I\nITEM 1. BUSINESS\nGENERAL\nNIKE, Inc. was incorporated in 1967 under the laws of the State of Oregon. As used in this Annual Report on Form 10-K (this "Annual Report"), the terms "we," "us," "our,"\n"NIKE" and the "Company" refer to NIKE, Inc. and its predecessors, subsidiaries and affiliates, collectively, unless the context indicates otherwise.\nOur principal business activity is the design, development and worldwide marketing and selling of athletic footwear, apparel, equipment, accessories and services. NIKE is\nthe largest seller of athletic footwear and apparel in the world. We sell our products through NIKE Direct operations, which are comprised of both NIKE-owned retail stores\nand sales through our digital platforms (also referred to as "NIKE Brand Digital"), to retail accounts and to a mix of independent distributors, licensees and sales')]]
VectorStoreRetriever 支持 "similarity"(默认)、"mmr"(maximum marginal relevance,上文已描述)和 "similarity_score_threshold" 搜索类型。可以使用后者根据相似度分数对 retriever 输出的文档设置阈值。 Retrievers 可以轻松并入更复杂的应用,例如 retrieval-augmented generation (RAG) 应用,它们会将给定问题与检索到的上下文组合成 LLM prompt。要进一步了解如何构建这类应用,请查看 RAG tutorial

下一步

现在你已经了解如何基于 PDF 文档构建语义搜索引擎。 更多 embeddings 相关内容: 更多 vector stores 相关内容: 更多 RAG 相关内容: