Back

MapReduce in Beam (Python) 2.5

MapReduce

Stamps

Initial conditions

Categories:

Google Cloud


Check project permissions

Before you begin your work on Google Cloud, you need to ensure that your project has the correct permissions within Identity and Access Management (IAM).

  1. In the GNavigation menu icon), select IAM & Admin > IAM.

  2. Confirm that the default compute Service Account {project-number}-compute@developer.gserviceaccount.com is present and has the editor role assigned. The account prefix is the project number, which you can find on Navigation menu > Cloud Overview > Dashboard.

Compute Engine default service account name and editor status highlighted on the Permissions tabbed page

Note: If the account is not present in IAM or does not have the editor role, follow the steps below to assign the required role.

  1. In the Google Cloud console, on the Navigation menu, click Cloud Overview > Dashboard.
  2. Copy the project number (e.g. 729328892908).
  3. On the Navigation menu, select IAM & Admin > IAM.
  4. At the top of the roles table, below View by Principals, click Grant Access.
  5. For New principals, type:
CODE...

 

  1. Replace {project-number} with your project number.
  2. For Role, select Project (or Basic) > Editor.
  3. Click Save.

Task 1. Lab preparations

Specific steps must be completed to successfully execute this lab.

Open the SSH terminal and connect to the training VM

You will be running all code from a curated training VM.

  1. In the Console, on the Navigation menu (Navigation menu icon), click Compute Engine > VM instances.

  2. Locate the line with the instance called training-vm.

  3. On the far right, under Connect, click on SSH to open a terminal window.

  4. In this lab, you will enter CLI commands on the training-vm.

Clone the training github repository

  • In the training-vm SSH terminal enter the following command:
CODE...

 

Task 2. Identify map and reduce operations

  • Return to the training-vm SSH terminal and navigate to the directory /training-data-analyst/courses/data_analysis/lab2/python and view the file is_popular.py with Nano. Do not make any changes to the code. Press Ctrl+X to exit Nano.
CODE...

 

Can you answer these questions about the file is_popular.py?

  • What custom arguments are defined?
  • What is the default output prefix?
  • How is the variable output_prefix in main() set?
  • How are the pipeline arguments such as --runner set?
  • What are the key steps in the pipeline?
  • Which of these steps happen in parallel?
  • Which of these steps are aggregations?

Task 3. Execute the pipeline

  1. In the training-vm SSH terminal, run the pipeline locally:
CODE...

 

  1. Identify the output file. It should be output<suffix> and could be a sharded file:
CODE...

 

  1. Examine the output file, replacing '-*' with the appropriate suffix:
CODE...

 

Task 4. Use command line parameters

  1. In the training-vm SSH terminal, change the output prefix from the default value:
CODE...

 

  1. What will be the name of the new file that is written out?
  2. Note that we now have a new file in the /tmp directory:
CODE...

 

 

...
Prog

Final conditions:

;

Organize your work