All codes related to pdf parsing.

1 Větve

Harsh Parikh 4024f01882 fixed formatting issue		před 2 roky
complaints	b6df27f4f7 fixed import position	před 2 roky
docker	cd80ead15d dockerised apache tika	před 2 roky
expert_resume	6a5cac4a92 added parsers for elements of expert resume	před 2 roky
.gitignore	deff3db35b Updated parser for acronyms	před 2 roky
LICENSE	24a0e3093a updated license	před 2 roky
README.md	4024f01882 fixed formatting issue	před 2 roky

pdf_parser

All the codes related to pdf parsing

The following elements are to be parsed from documents.

Documents
1. Extracting dates from documents
2. Classification Tags
3. Extracting Key Entities from documents
  1. Patents
  2. References
  3. Entities
    1. Names
    2. Addresses
    3. Law Firms
    4. Contact Numbers
    5. Emails
4. Association with Cases

Setting up the workspace environment

Creating a virtual environment.
1. Launch the terminal.
2. Go to the home directory by typing the following command.
```
cd ~
```
3. Make a new directory Installs using the following command:
```
mkdir Installs
```
4. Create a virtual environment venv
```
python3 -mvenv venv
```
5. Activate the virtual environment:
```
source ~/Installs/venv/bin/activate
```
Setting up apache-tika
1. Launch terminal
2. Enter the following command to go to the base directory:
```
cd ~
```
3. Download the apache-tika server using the following command:
```
wget https://www.apache.org/dyn/closer.lua/tika/1.28.4/tika-server-1.28.4.jar
```
4. Check if java is installed in your machine by running the following command:
```
java --version
```
5. If java is not installed in your local machine, please refer to this documentation.
6. Running the tika server
  1. Create a new screen called apache-tika
```
screen -S apache-tika
```
  2. Enter the following command to run the tika server:
```
java -jar {your tika server}
```

Setting up the code base.

Launch the terminal.
Enter the following command to go to the base directory:
```
cd ~
```
Make a new directory Code by using the following command:
```
mkdir Code
```
Pull the current repository by entering the following command:
```
git pull gogs@git.fafadiatech.com:harsh/pdf_parser.git
```

Running the docker file:

Launch the terminal.

Change the directory to the docker file by the following command. `` cd ~/Code/pdf_parser/docker/

3. Check if docker is installed in your machine using the command below. If it throws an error, refer to this [documentation](https://docs.docker.com/engine/install/)
``
docker ps

Pull the required images using the following command. `` docker pull docker-compose.yml
```
5. Build the docker volume using the following command.
``
docker-compose build
```
Activate a new screen and activate the containers using the following commands. screen -S docker docker-compose up`

TODO LIST:
- Implementing OCR on tika.
- Dockerising apache-tika.
- Testing the re on the scanned pdfs

README.md

pdf_parser

All the codes related to pdf parsing

The following elements are to be parsed from documents.

Setting up the workspace environment

Setting up the code base.

Running the docker file:

TODO LIST: