Quick Start¶
Installation¶
To get started with ChemDataWriter, you can easily install it using pip
. Currently, it’s compatible with Python versions below 3.10:
pip install chemdatawriter
Introduction¶
ChemDataWriter provides two main scripts for book generation:
run_corpus.py
: Generates the corpus of the book.
run_cdw.py
: Generates the book itself.
Corpus Retrieval¶
To generate the corpus, utilize the run_corpus.py
script:
python run_corpus.py --input_path <path_to_XML_files> --save_path <path_to_save_corpus>
Arguments:
input_path : Specifies the directory of XML files
save_path : Determines the folder location to save the corpus in JSON format.
The resulting JSON file contains entries, each depicting a paper:
title: Title of the paper.
abstract: Abstract of the paper.
introduction: Introduction section of the paper.
conclusion: Conclusion section of the paper.
reference: Formatted reference of the paper.
Note
The script emphasizes sections with “Introduction” and “Conclusion” in their titles. Absent these sections, a paper will be excluded from the results. The script sequentially appends each paper’s details to the output JSON.
Book Generation¶
For book generation, employ the run_cdw.py
script:
python run_cdw.py --input_path <path_to_XML_files> --keywords <keyword1 keyword2> --topic_words <topic_word1 topic_word2> --chapter_size <int> --save_path <path_to_save_corpus> --cache_path <path_to_cache> --title_generator_hf <hf_model_name>
Arguments:
input_path: Path to the JSON files of papers.
keywords: Keywords to filter and screen papers.
topic_words: Topic words that define each chapter.
chapter_size: Number of papers included per chapter.
save_path: Directory where the generated corpus will be saved.
cache_path: Cache directory for storing papers to be summarized.
title_generator_hf: Name of the model used for title generation.
The output JSON file consists of entries that represent individual sections of research books. Each section contains:
id: Unique identifier.
ref: Formatted reference of the paper.
intro: Introduction of the paper.
sum_intro: Summarized introduction.
para_intro: Paraphrased summary of the introduction.
short_title: Generated short title for the paper.
para_title: Paraphrased short title.
intros: All the introductions from the paper.
sum_intros: Summaries of all introductions.
para_intros: Paraphrased summaries of all introductions.
conclusion: Conclusion of the paper.
sum_conclusion: Summarized conclusion.
para_conclusion: Paraphrased summary of the conclusion.
abstract: Abstract of the paper.
sum_abstract: Summarized abstract.
para_abstract: Paraphrased summary of the abstract.