I require a Java command-line program that automatically extracts information (content/sentences) between bookmarks in a PDF.
The program should use either Apache PDFBox or Apache Tikka.
The program should do the following:
a) Java -jar extractContent PDFName Bookmark
Extract the content between bookmarks (i.e. print to screen). In the command line a Bookmark name will be provided (i.e. Background) and the program should extract the text between that Bookmark and the next Bookmark. Note: Bookmarks may have several levels (So you need to extract the data between Bookmarks on the same level).
b) Java -jar extractContent PDFName Bookmark Keyword
If a keyword is provided, then the program should extract the paragraph (located in that Bookmark section) that contains that keyword.
The type of PDF that I am interested in can be found at: [login to view URL]
Deliverables include the following:
1. Source code with documentation
2. Jar file
Test Cases using the PDF at [login to view URL]:
a) Java -jar extractContent PDFName Methods (extracts data between bookmark Methods and Results, i.e. same level)
b) Java -jar extractContent PDFName “Study Population” (extracts data between bookmark Study population and Data, i.e. same level)
c) Java -jar extractContent PDFName Methods SPSS (extracts the paragraph in the the Methods section that contains the keyword SPSS. note: there may be more than one paragraph that contains the keyword).