最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

pandas - Extracting certain formate data fromexcel using python - Stack Overflow

programmeradmin5浏览0评论

#After exporting data into date frame it looks like below

data 1
id datum
2000 2024.09.02
2903 2024.09.02
data 2
id datum
4000 2024.09.02
4001 2024.09.02

#After exporting data into date frame it looks like below

data 1
id datum
2000 2024.09.02
2903 2024.09.02
data 2
id datum
4000 2024.09.02
4001 2024.09.02

#expected answer should look like this

id datum data
2000 2024.09.02 1
2903 2024.09.02 1
4000 2024.09.02 2
4001 2024.09.02 2
import pandas as pd
import numpy as np

# Step 1: Read the Excel file
# Replace 'your_file_path.xlsx' with the actual path to your Excel file
file_path = 'your_file_path.xlsx'

# Load the Excel file
xls = pd.ExcelFile(file_path)

# Assuming 'Datei2' is the name of the sheet we want to read
df = pd.read_excel(xls, sheet_name='Datei2')
df['Column1 - Copy'] = df['Column1']
split_columns = df['Column1 - Copy'].str.split(' ', expand=True)
split_columns.columns = [f'Column1 - Copy.{i+1}' for i in range(split_columns.shape[1])]
df = pd.concat([df, split_columns], axis=1)

# Step 6: Fill Down Values
df['Column1 - Copy.16'] = df['Column1 - Copy.16'].fillna(method='ffill')
Share Improve this question edited Nov 18, 2024 at 20:55 niki asked Nov 18, 2024 at 20:46 nikiniki 135 bronze badges 2
  • 1 Please, take some time to read how to ask and How to create a Minimal, Reproducible Example – LMC Commented Nov 18, 2024 at 20:47
  • 2 What is the point of the added code? Your sample doesn't have any of these columns. Please edit your question to fix the code so that it meaningfully fits your sample data, and explain how it relates to either input or desired output. – ouroboros1 Commented Nov 18, 2024 at 21:29
Add a comment  | 

2 Answers 2

Reset to default 0

Assuming your excel file looks exactly like you put, you can use cumsum to put your data into groups and then remove any unnecessary data by filtering the dataframe:

df = pd.read_excel(<your file>, header=None, names=['id', 'datum'])
       id       datum
0  data 1         NaN
1      id       datum
2    2000  2024.09.02
3    2903  2024.09.02
4  data 2         NaN
5      id       datum
6    4000  2024.09.02
7    4001  2024.09.02

df['data'] = (df['datum'].isna()).cumsum()
        id       datum  data
0  data 1         NaN     1
1      id       datum     1
2    2000  2024.09.02     1
3    2903  2024.09.02     1
4  data 2         NaN     2
5      id       datum     2
6    4000  2024.09.02     2
7    4001  2024.09.02     2
df = df[~df['datum'].eq('datum')].dropna(how='any')
     id       datum  data
2  2000  2024.09.02     1
3  2903  2024.09.02     1
6  4000  2024.09.02     2
7  4001  2024.09.02     2
import pandas as pd

##### Step 1: Read the Excel file (replace 'your_file_path.xlsx' with the actual file path)
file_path = 'your_file_path.xlsx'

##### Assuming the data spans multiple blocks in a single sheet
df = pd.read_excel(file_path, header=None)

##### Step 2: Identify blocks of data and label them
##### A block starts with "data X" where X is the block number
df['block'] = df[0].str.extract(r'data (\d+)').ffill()

##### Step 3: Remove rows that are not part of the main data
df = df[~df[0].str.contains('data', na=False)]

##### Step 4: Rename columns and clean up the DataFrame
df.columns = ['id', 'datum', 'block']
df = df.dropna(subset=['id', 'datum'])

##### Step 5: Clean and format the columns
df['id'] = df['id'].astype(int)
df['datum'] = pd.to_datetime(df['datum'], errors='coerce').dt.strftime('%Y.%m.%d')
df['block'] = df['block'].astype(int)
发布评论

评论列表(0)

  1. 暂无评论