All codes related to pdf parsing.

Harsh Parikh 4caa9d66a4 Merge branch 'master' of git.fafadiatech.com:harsh/pdf_parser %!s(int64=2) %!d(string=hai) anos
complaints b6df27f4f7 fixed import position %!s(int64=2) %!d(string=hai) anos
docker 6e7ca468a4 updated docker settings for ubuntu %!s(int64=2) %!d(string=hai) anos
document_download_from_server 92e83c5949 added a script to collect documents from the postgres server %!s(int64=2) %!d(string=hai) anos
expert_report 44058d12b9 updated new code %!s(int64=2) %!d(string=hai) anos
expert_resume 1de14ff7d7 added parser elements for extraction of names from expert resume %!s(int64=2) %!d(string=hai) anos
.gitignore 0277ca0712 ref #1:fixed merge conflict %!s(int64=2) %!d(string=hai) anos
LICENSE 24a0e3093a updated license %!s(int64=2) %!d(string=hai) anos
README.md 4105ec854e fixed formating error in README.md %!s(int64=2) %!d(string=hai) anos

README.md

pdf_parser

All the codes related to pdf parsing

The following elements are to be parsed from documents.

  1. Documents
    1. Extracting dates from documents
    2. Classification Tags
    3. Extracting Key Entities from documents
      1. Patents
      2. References
      3. Entities
        1. Names
        2. Addresses
        3. Law Firms
        4. Contact Numbers
        5. Emails
    4. Association with Cases

Setting up the workspace environment

  1. Creating a virtual environment.
    1. Launch the terminal.
    2. Go to the home directory by typing the following command.

      cd ~
      
    3. Make a new directory Installs using the following command

      mkdir Installs
      
    4. Create a virtual environment venv

      python3 -mvenv venv
      
    5. Activate the virtual environment:

      source ~/Installs/venv/bin/activate
      
  2. Installing Java

    1. Check if Java is installed in your system. If the command below throws an error, refer to this documentation

      java -version
      

      Setting up the code base.

  3. Launch the terminal.

  4. Enter the following command to go to the base directory:

    cd ~
    
  5. Make a new directory Code by using the following command

    mkdir Code
    
  6. Enter into the Code directory by using the following command

    cd Code
    
  7. Pull the current repository by entering the following command

    git pull gogs@git.fafadiatech.com:harsh/pdf_parser.git
    

Running the docker file:

  1. Launch the terminal.
  2. Change the directory to the docker file by the following command.

    cd ~/Code/pdf_parser/docker/
    
  3. Check if "docker" is installed in your machine using the command below.

    docker ps
    
  4. Pull the required images using the following command.

    docker pull docker-compose.yml
    
  5. Build the docker volume using the following command.

    docker-compose build
    
  6. Activate a new screen and activate the containers using the following commands.

    screen -S docker
    docker-compose up
    

TODO LIST:

  • Implementing OCR on tika.
  • Dockerising apache-tika.
  • Testing the re on the scanned pdfs