From Biblical Texts to AI Systems: Unexpected Lessons in Structure

ISCOL 2025 Conference, Bar-Ilan University, 18-Dec-2025, Poster Presentation

TwoHillsLab: A Scalable Platform for Quantitative Analysis of Biblical Hebrew Text Revealing Structures Relevant to Retrieval-Augmented Generation (RAG)

Guy Shaked

Bar-Ilan University – Dept. of Jewish History and Contemporary Jewry,

Two Hills Lab

Abstract

TwoHillsLab (TwoHillsLab.org) is a free web-based platform for large-scale quantitative analysis of the Hebrew Bible and related corpora. We explore chapter-initial “opening-word” regularities as segmentation cues. Our findings suggest that biblical segmentation conventions may be analogous to certain text-chunking and boundary-setting steps used in retrieval-augmented generation (RAG) workflows for LLM-based chatbots.

Motivation

Digitized biblical texts invite computational linguistics approaches to complement close reading with scalable, interpretable statistics. The goal is not to replace philology, but to provide global signals that (i) highlight structure, (ii) surface anomalies worth investigating, and (iii) support reproducible comparisons across books, genres, and traditions.

New Surprising Result: Chapter-Opening Words as Segmentation Cues

Using TwoHillsLab for exploratory analysis, we extracted chapter-initial words and observed three recurring general patterns:

Strong repetition runs: many consecutive chapters open with the same formula, sometimes with small variations.

Example (Leviticus/Vayikra 11–27): repeated וַיְדַבֵּר (“and [He] spoke”) with occasional variants.

וַיְדַבֵּ֧ר וַיְדַבֵּ֥ר וַיְדַבֵּ֣ר וַיְדַבֵּ֥ר וַיְדַבֵּ֣ר וַיְדַבֵּ֤ר וַיְדַבֵּ֥ר וַיְדַבֵּ֥ר וַיְדַבֵּ֥ר וַיְדַבֵּ֥ר וַיֹּ֤אמֶר וַיְדַבֵּ֥ר וַיְדַבֵּ֥ר וַיְדַבֵּ֥ר וַיְדַבֵּ֤ר לֹֽא־תַעֲשׂ֨וּ וַיְדַבֵּ֥ר

Repetition with interruptions: a dominant opening word recurs, but is periodically interrupted by a different opener.

Example (Isaiah 13–24): recurrent מַשָּׂא (“oracle/burden”) with intermittent alternations.

מַשָּׂ֖א כִּי֩ מַשָּׂ֖א שִׁלְחוּ־כַ֥ר מַשָּׂ֖א ה֥וֹי מַשָּׂ֖א בִּשְׁנַ֨ת מַשָּׂ֖א מַשָּׂ֖א מַשָּׂ֖א הִנֵּ֧ה

Low repetition / no visible opener pattern: chapter openings vary widely with no obvious repeated cue.

Example (2 Chronicles 1–11):

וַיִּתְחַזֵּ֛ק וַיִּסְפֹּ֨ר וַיָּ֣חֶל וַיַּ֙עַשׂ֙ וַתִּשְׁלַם֙ אָ֖ז וּכְכַלּ֤וֹת וַיְהִ֞י וּמַֽלְכַּת־שְׁבָ֗א וַיֵּ֥לֶךְ וַיָּבֹ֣א

These opening-word regularities can be treated computationally as boundary markers and compared across books, genres, and editorial strata.

This observation can be framed as:

Biblical composition and redaction often employ human-legible boundary cues—for example, sequences of chapters marked by recurring opening words—which parallel a modern NLP practice for segmenting long documents for retrieval-augmented generation (RAG).

Limitations

Not all books or passages align cleanly with global patterns; deviations are expected and often the most informative.

Chapter boundaries are convenient but not always linguistically optimal.

Ongoing & Future Work

Segmentation research: quantify opener-based boundary cues and test whether they improve retrieval quality.

Conclusion

TwoHillsLab shows that minimalist, interpretable statistics can uncover large-scale literary structure in the Hebrew Bible and provide quantitative perspectives on genre and authorship-related questions. The platform also highlights how editorial boundary cues (e.g., recurring chapter openers in the Bible) may be leveraged in modern NLP workflows—especially retrieval-oriented pipelines used by Chatbots, such as RAG.

Finally, these initial parallels suggest a broader research agenda: if one structural parameter aligns with practices in AI, it is worth testing whether additional features of biblical composition and redaction might likewise inspire, refine, or help evaluate how RAG systems and chatbot architectures are designed.

TwoHillsLab: Digital Humanities Platform

"Beauty and Meaning Through Structure"

From Biblical Texts to AI Systems: Unexpected Lessons in Structure

ISCOL 2025 Conference, Bar-Ilan University, 18-Dec-2025, Poster Presentation

TwoHillsLab: A Scalable Platform for Quantitative Analysis of Biblical Hebrew Text Revealing Structures Relevant to Retrieval-Augmented Generation (RAG)

Biblical composition and redaction often employ human-legible boundary cues—for example, sequences of chapters marked by recurring opening words—which parallel a modern NLP practice for segmenting long documents for retrieval-augmented generation (RAG).