Please see the set of transform project conventions for details on general project conventions, transform configuration, testing and IDE set up.
- Yash Kalathiya ([email protected])
The Header Cleanser module is a versatile tool designed to remove license and copyright headers from code files. It supports over 90 programming languages and utilizes the ScanCode Toolkit to identify license and copyright information within the codebase.
The set of dictionary keys holding configuration for values are as follows:
- contents_column_name - used to define input column name. Default value is 'contents'.
- license - write 'true' to remove license from input data else 'false'. By default set as 'true'.
- copyright - write 'true' to remove copyright from input data else 'false'. by default set as 'true'.
You can run the header_cleanser_local.py (python-only implementation) or header_cleanser_local_ray.py (ray-based implementation) to transform the test1.parquet
file in test input data to an output
directory. The directory will contain both the new annotated test1.parquet
file and the metadata.json
file.
When running the transform with the Ray launcher (i.e. TransformLauncher), the following command line arguments are available in addition to the launcher.
- --header_cleanser_contents_column_name - set the contents_column_name configuration key.
- --header_cleanser_document_id_column_name - set the document_id_column_name configuration key.
- --header_cleanser_license - set the license configuration key.
- --header_cleanser_copyright - set the copyright configuration key.
- --header_cleanser_n_processes - set the n_processes configuration key.
- --header_cleanser_tmp_dir - set the tmp_dir configuration key.
- --header_cleanser_timeout - set the timeout configuration key.
- --header_cleanser_skip_timeout - set the skip_timeout configuration key.
- File Format: Parquet file containing code.
- Input Column: The code should be in a column named
content
. - Sample Input:
Sample Input File
- File Format: Parquet file with the updated code in the same column.
- Sample Output:
Sample Output File
When invoking the CLI, use the following syntax for these parameters:
--header_cleanser_<parameter_name>
For example:
--header_cleanser_content_column_name='content'
/*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package com.jstevenperry.intro;
import java.util.logging.Logger;
// This is the main public class representing a Person
public class Person {
private static final Logger logger = Logger.getLogger(Person.class.getName());
private String name;
private int age;
private int height;
private int weight;
private String eyeColor;
private String gender;
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public int getAge() {
return age;
}
public void setAge(int age) {
this.age = age;
}
public int getHeight() {
return height;
}
public void setHeight(int height) {
this.height = height;
}
public int getWeight() {
return weight;
}
public void setWeight(int weight) {
this.weight = weight;
}
public String getEyeColor() {
return eyeColor;
}
public void setEyeColor(String eyeColor) {
this.eyeColor = eyeColor;
}
public String getGender() {
return gender;
}
public void setGender(String gender) {
this.gender = gender;
}
public Person(String name, int age, int height, int weight, String eyeColor, String gender) {
super();
this.name = name;
this.age = age;
this.height = height;
this.weight = weight;
this.eyeColor = eyeColor;
this.gender = gender;
logger.info("Created Person object with name '" + getName() + "'");
}
}
package com.jstevenperry.intro;
import java.util.logging.Logger;
/// This is the main public class representing a Person
public class Person {
private static final Logger logger = Logger.getLogger(Person.class.getName());
private String name;
private int age;
private int height;
private int weight;
private String eyeColor;
private String gender;
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public int getAge() {
return age;
}
public void setAge(int age) {
this.age = age;
}
public int getHeight() {
return height;
}
public void setHeight(int height) {
this.height = height;
}
public int getWeight() {
return weight;
}
public void setWeight(int weight) {
this.weight = weight;
}
public String getEyeColor() {
return eyeColor;
}
public void setEyeColor(String eyeColor) {
this.eyeColor = eyeColor;
}
public String getGender() {
return gender;
}
public void setGender(String gender) {
this.gender = gender;
}
public Person(String name, int age, int height, int weight, String eyeColor, String gender) {
super();
this.name = name;
this.age = age;
this.height = height;
this.weight = weight;
this.eyeColor = eyeColor;
this.gender = gender;
logger.info("Created Person object with name '" + getName() + "'");
}
}
Check out the example notebook for further details.
To use the transform image to transform your data, please refer to the running images quickstart, substituting the name of this transform image and runtime as appropriate.