# pdf_parser ## All the codes related to pdf parsing ### The following elements are to be parsed from documents. 1. Documents 1. Extracting dates from documents 1. Classification Tags 1. Extracting Key Entities from documents 1. Patents 1. References 1. Entities 1. Names 1. Addresses 1. Law Firms 1. Contact Numbers 1. Emails 1. Association with Cases ### Setting up the workspace environment 1. Creating a `virtual environment`. 1. Launch the terminal. 1. Go to the home directory by typing the following command. ``` cd ~ ``` 1. Make a new directory `Installs` using the following command ``` mkdir Installs ``` 1. Create a virtual environment `venv` ``` python3 -mvenv venv ``` 1. Activate the virtual environment: ``` source ~/Installs/venv/bin/activate ``` 1. Installing `Java` 1. Check if Java is installed in your system. If the command below throws an error, refer to this [documentation](https://www.java.com/download/ie_manual.jsp) ``` java -version ``` ### Setting up the code base. 1. Launch the terminal. 1. Enter the following command to go to the base directory: ``` cd ~ ``` 1. Make a new directory `Code` by using the following command ``` mkdir Code ``` 1. Enter into the `Code` directory by using the following command ``` cd Code ``` 1. Pull the current repository by entering the following command ``` git pull gogs@git.fafadiatech.com:harsh/pdf_parser.git ``` ### Running the docker file: 1. Launch the terminal. 1. Change the directory to the docker file by the following command. ``` cd ~/Code/pdf_parser/docker/ ``` 1. Check if ["docker"](https://docs.docker.com/engine/install/) is installed in your machine using the command below. ``` docker ps ``` 1. Pull the required images using the following command. ``` docker pull docker-compose.yml ``` 1. Build the docker volume using the following command. ``` docker-compose build ``` 1. Activate a new screen and activate the containers using the following commands. ``` screen -S docker docker-compose up ``` ### TODO LIST: - [ ] Implementing OCR on tika. - [x] Dockerising apache-tika. - [ ] Testing the re on the scanned pdfs