pandas - Extracting certain formate data fromexcel using python

#After exporting data into date frame it looks like below

data 1
id	datum
2000	2024.09.02
2903	2024.09.02
data 2
id	datum
4000	2024.09.02
4001	2024.09.02

#After exporting data into date frame it looks like below

data 1
id	datum
2000	2024.09.02
2903	2024.09.02
data 2
id	datum
4000	2024.09.02
4001	2024.09.02

#expected answer should look like this

id	datum	data
2000	2024.09.02	1
2903	2024.09.02	1
4000	2024.09.02	2
4001	2024.09.02	2

import pandas as pd
import numpy as np

# Step 1: Read the Excel file
# Replace 'your_file_path.xlsx' with the actual path to your Excel file
file_path = 'your_file_path.xlsx'

# Load the Excel file
xls = pd.ExcelFile(file_path)

# Assuming 'Datei2' is the name of the sheet we want to read
df = pd.read_excel(xls, sheet_name='Datei2')
df['Column1 - Copy'] = df['Column1']
split_columns = df['Column1 - Copy'].str.split(' ', expand=True)
split_columns.columns = [f'Column1 - Copy.{i+1}' for i in range(split_columns.shape[1])]
df = pd.concat([df, split_columns], axis=1)

# Step 6: Fill Down Values
df['Column1 - Copy.16'] = df['Column1 - Copy.16'].fillna(method='ffill')

Share Improve this question edited Nov 18, 2024 at 20:55 asked Nov 18, 2024 at 20:46 niki 135 bronze badges

1 Please, take some time to read how to ask and How to create a Minimal, Reproducible Example – LMC Commented Nov 18, 2024 at 20:47
2 What is the point of the added code? Your sample doesn't have any of these columns. Please edit your question to fix the code so that it meaningfully fits your sample data, and explain how it relates to either input or desired output. – ouroboros1 Commented Nov 18, 2024 at 21:29

Add a comment |

2 Answers 2

Sorted by: Reset to default 0

Assuming your excel file looks exactly like you put, you can use cumsum to put your data into groups and then remove any unnecessary data by filtering the dataframe:

df = pd.read_excel(<your file>, header=None, names=['id', 'datum'])

       id       datum
0  data 1         NaN
1      id       datum
2    2000  2024.09.02
3    2903  2024.09.02
4  data 2         NaN
5      id       datum
6    4000  2024.09.02
7    4001  2024.09.02

df['data'] = (df['datum'].isna()).cumsum()

        id       datum  data
0  data 1         NaN     1
1      id       datum     1
2    2000  2024.09.02     1
3    2903  2024.09.02     1
4  data 2         NaN     2
5      id       datum     2
6    4000  2024.09.02     2
7    4001  2024.09.02     2

df = df[~df['datum'].eq('datum')].dropna(how='any')

     id       datum  data
2  2000  2024.09.02     1
3  2903  2024.09.02     1
6  4000  2024.09.02     2
7  4001  2024.09.02     2

import pandas as pd

##### Step 1: Read the Excel file (replace 'your_file_path.xlsx' with the actual file path)
file_path = 'your_file_path.xlsx'

##### Assuming the data spans multiple blocks in a single sheet
df = pd.read_excel(file_path, header=None)

##### Step 2: Identify blocks of data and label them
##### A block starts with "data X" where X is the block number
df['block'] = df[0].str.extract(r'data (\d+)').ffill()

##### Step 3: Remove rows that are not part of the main data
df = df[~df[0].str.contains('data', na=False)]

##### Step 4: Rename columns and clean up the DataFrame
df.columns = ['id', 'datum', 'block']
df = df.dropna(subset=['id', 'datum'])

##### Step 5: Clean and format the columns
df['id'] = df['id'].astype(int)
df['datum'] = pd.to_datetime(df['datum'], errors='coerce').dt.strftime('%Y.%m.%d')
df['block'] = df['block'].astype(int)

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

pandas - Extracting certain formate data fromexcel using python - Stack Overflow

2 Answers 2

与本文相关的文章

评论列表(0)