Quick Start¶

Installation¶

To get started with ChemDataWriter, you can easily install it using pip. Currently, it’s compatible with Python versions below 3.10:

pip install chemdatawriter

Introduction¶

ChemDataWriter provides two main scripts for book generation:

run_corpus.py: Generates the corpus of the book.

run_cdw.py: Generates the book itself.

Corpus Retrieval¶

To generate the corpus, utilize the run_corpus.py script:

python run_corpus.py --input_path <path_to_XML_files> --save_path <path_to_save_corpus>

Arguments:

input_path : Specifies the directory of XML files

save_path : Determines the folder location to save the corpus in JSON format.

The resulting JSON file contains entries, each depicting a paper:

title: Title of the paper.

abstract: Abstract of the paper.

introduction: Introduction section of the paper.

conclusion: Conclusion section of the paper.

reference: Formatted reference of the paper.

Note

The script emphasizes sections with “Introduction” and “Conclusion” in their titles. Absent these sections, a paper will be excluded from the results. The script sequentially appends each paper’s details to the output JSON.

Book Generation¶

For book generation, employ the run_cdw.py script:

python run_cdw.py --input_path <path_to_XML_files> --keywords <keyword1 keyword2> --topic_words <topic_word1 topic_word2> --chapter_size <int> --save_path <path_to_save_corpus> --cache_path <path_to_cache> --title_generator_hf <hf_model_name>

Arguments:

input_path: Path to the JSON files of papers.

keywords: Keywords to filter and screen papers.

topic_words: Topic words that define each chapter.

chapter_size: Number of papers included per chapter.

save_path: Directory where the generated corpus will be saved.

cache_path: Cache directory for storing papers to be summarized.

title_generator_hf: Name of the model used for title generation.

The output JSON file consists of entries that represent individual sections of research books. Each section contains:

id: Unique identifier.

ref: Formatted reference of the paper.

intro: Introduction of the paper.

sum_intro: Summarized introduction.

para_intro: Paraphrased summary of the introduction.

short_title: Generated short title for the paper.

para_title: Paraphrased short title.

intros: All the introductions from the paper.

sum_intros: Summaries of all introductions.

para_intros: Paraphrased summaries of all introductions.

conclusion: Conclusion of the paper.

sum_conclusion: Summarized conclusion.

para_conclusion: Paraphrased summary of the conclusion.

abstract: Abstract of the paper.

sum_abstract: Summarized abstract.

para_abstract: Paraphrased summary of the abstract.