All codes related to pdf parsing.

10 Achegas

1 Branches

Harsh Parikh 7443b06f0e added expert_resume parser		%!s(int64=2) %!d(string=hai) anos
complaints	b6df27f4f7 fixed import position	%!s(int64=2) %!d(string=hai) anos
docker	2352bd3c3c added the intial codebase for parsing documents	%!s(int64=2) %!d(string=hai) anos
expert_resume	7443b06f0e added expert_resume parser	%!s(int64=2) %!d(string=hai) anos
.gitignore	deff3db35b Updated parser for acronyms	%!s(int64=2) %!d(string=hai) anos
LICENSE	24a0e3093a updated license	%!s(int64=2) %!d(string=hai) anos
README.md	6d7c714390 Fixed formatting error on README.md	%!s(int64=2) %!d(string=hai) anos

pdf_parser

All the codes related to pdf parsing

The following elements are to be parsed from documents.

Documents
1. Extracting dates from documents
2. Classification Tags
3. Extracting Key Entities from documents
  1. Patents
  2. References
  3. Entities
    1. Names
    2. Addresses
    3. Law Firms
    4. Contact Numbers
    5. Emails
4. Association with Cases

Setting up the workspace environment

Creating a virtual environment.
1. Launch the terminal.
2. Go to the home directory by typing the following command.
```
cd ~
```
  1. Make a new directory Installs using the following command: bash mkdir Installs
3. Create a virtual environment venv
```
python3 -mvenv venv
```
  1. Activate the virtual environment: bash source ~/Installs/venv/bin/activate
Setting up apache-tika
1. Launch terminal
2. Enter the following command to go to the base directory:
```
cd ~
```
  1. Download the apache-tika server using the following command: bash wget https://www.apache.org/dyn/closer.lua/tika/1.28.4/tika-server-1.28.4.jar
3. Check if java is installed in your machine by running the following command:
```
java --version
```
  1. If java is not installed in your local machine, please refer to this documentation.
  2. Running the tika server
  3. Create a new screen called apache-tika
```
screen -S apache-tika
```
    1. Enter the following command to run the tika server: bash java -jar {your tika server}

Setting up the code base.

Launch the terminal.
Enter the following command to go to the base directory:
```
cd ~
```
1. Make a new directory Code by using the following command: bash mkdir Code
Pull the current repository by entering the following command:
```
git pull gogs@git.fafadiatech.com:harsh/pdf_parser.git
```
TODO LIST:
1. Implementing OCR on tika.
2. Dockerising the whole apache tika with ocr.
3. Testing the re on the scanned pdfs.